Detect Outliers in Data Wrangling: Examples and Use Cases

Data wrangling is a crucial step in the data science pipeline that involves cleaning, transforming, and preparing raw data for analysis. One of the most important aspects of data wrangling is identifying and handling outliers - data points that significantly differ from the rest of the dataset.

What are Outliers?

Outliers is the term that strikes your mind whenever you speak about analyzing data. They are data points that deviate significantly from the overall pattern of the data.

If you are planning to inspect any task to examine data sets, you can make assumptions about the methods used to produce the data.

They can be caused by:

Measurement errors during data collection
Data entry mistakes by humans
Natural variations in the data
Systematic errors in data processing
Legitimate extreme values that represent real phenomena

If you come across a few data points that are probably erroneous in some way, then these are obviously outliers. Based on the situation, you have to correct such errors.

Outliers can cause a statistical or machine learning model to perform poorly. It's because they do not fall within the normal range of values for that attribute.

Why Outlier Detection Matters

Outliers can have a significant impact on data analysis and modeling:

1. Statistical Analysis Impact

Skew mean and standard deviation calculations
Affect correlation coefficients
Distort regression analysis results
Impact hypothesis testing outcomes

2. Machine Learning Impact

Reduce model accuracy
Increase training time
Cause overfitting or underfitting
Leads to poor generalization

3. Business Impact

Misleading insights and reports
Poor decision-making
Inaccurate predictions
Wasted resources on false signals

Common Outlier Detection Methods

1. Statistical Methods

Z-Score Method

The Z-score measures how many standard deviations a data point is from the mean:

Z-score = (X - μ) / σ

Data points with |Z-score| > 3 are typically considered outliers.

IQR Method (Interquartile Range)

Uses quartiles to identify outliers:

Lower Bound = Q1 - 1.5 × IQR
Upper Bound = Q3 + 1.5 × IQR

For outlier detection methods, we need to know the 1st (Q1) and 3rd (Q3) quartiles of that numerical attribute.

Where IQR = Q3 - Q1

If any value falls outside the normal range for that attribute, it is an outlier. The formula for the normal range:

[Q1 – (1.5 * IQR), Q3 + (1.5 * IQR)]

Modified Z-Score

More robust to extreme values:

Modified Z-score = 0.6745 × (X - median) / MAD

Where MAD is the Median Absolute Deviation.

2. Distance-Based Methods

Local Outlier Factor (LOF)

Identifies outliers based on local density:

Calculates density around each point
Compares local density with neighbors
Points with significantly lower density are outliers

Isolation Forest

Uses random partitioning to isolate outliers:

Outliers require fewer partitions to isolate
More efficient for large datasets
Works well with high-dimensional data

3. Clustering-Based Methods

DBSCAN

Density-based clustering that can identify outliers:

Points not belonging to any cluster are outliers
Effective for spatial data
Handles clusters of varying shapes

Business Use Cases

Outlier detection methods are used across industry for different purposes, namely fraud detection, traffic monitoring, web fault detection, building intrusion detection systems, etc.

For example, banks can provide us with their customer transaction data, where we are supposed to identify customers who look suspicious based on their transaction records.

In this case, if we observe the average transaction amount of the customers based on a specific time span and plot the data, we can easily identify the customers whose average transaction amount looks quite off compared to the other customers.

Then, we can alert the bank about those customers so that they can make further detailed inquiries about their transaction details.

Standardization of Textual Data in Python

In many cases, we perform comparisons based on textual data. A common issue observed here is the lack of a standard data format for string-type attributes.

For example, we might not be able to detect the exact words if one starts with a capital letter and the other does not. To overcome this, we should standardize the strings.

We can convert all of them to upper case or lower case. Also, we should perform data cleaning (a significant part of data wrangling) of textual data to make sure that the same format is followed across different datasets.

Data cleaning example of two dummy datasets

Below is an example of two dummy datasets. One contains the height records (feet), and the other includes the Date of Birth records for five different actors.

As we can see, the values in the "Actor Name" column in the "Height" dataset look correct. But the DoB database contains unwanted characters.

So if we try to create a single dataset by joining these two datasets based on the "Actor Name" attribute, then we get an empty dataset. This is because the values in the "Actor Name" column need to match in the two datasets for the join to occur successfully.

To resolve this, we need to clean the values in that column in the DoB dataset and then try to join the two datasets. We will use the re library of Python to implement Regular Expressions and come up with a cleaned set of text.

We also use the string library of Python to get all the punctuation of the English language.

Next, we need to create a Python function that takes a text as input, cleans the text, and returns it. We then apply this function to the "Actor Name" column of the DoB dataset.

We also apply the same function to the "Actor Name" column of the "Height' dataset to make sure that the "Actor Name" columns in both the datasets are standardized, and the connection occurs without any issue.

Finally, we join the two datasets based on the "Actor Name" column to get the final dataset. This time, we see that we get the desired result, and not an empty dataset as before.

Date Manipulation

This is another critical aspect of the Data Wrangling use case. Almost every dataset contains an attribute related to date-time that provides us with helpful information. It is essential to use the various utility functions related to date-time and extract relevant information from them.

The image below shows the storage of date-time information in a dataset. As we can see, the dates are in STRING (or Textual) format and they need to be converted to a Python DateTime format.

Data Manipulation Python

We create a Python function to convert the values of the date column from String to DateTime format. We apply that function and then check the data type of the Date column. Now the values have been converted into DateTime format from String format.

Data Manipulation ss22

We can then extract the day, month, and year information and store it in separate columns in the dataset for further analysis.

Data Manipulation ss23

Also, we can find the difference between two dates in terms of the number of days. This operation is advantageous and can be helpful for multiple purposes. Ex: Calculating and storing Customer Recency Information for e-commerce websites.

Practical Examples and Use Cases

1. Financial Services

Fraud Detection

Challenge: Identifying fraudulent transactions among millions of legitimate ones.

Approach: Use anomaly detection algorithms to flag unusual patterns:

Unusual transaction amounts
Geographic anomalies (transactions from unexpected locations)
Time-based anomalies (transactions at unusual hours)
Behavioral anomalies (unusual spending patterns)

Example: A credit card transaction for $50,000 from a foreign country when the cardholder typically makes $50 purchases locally.

Risk Assessment

Challenge: Identifying high-risk loans or investments.

Approach: Analyze credit scores, income levels, and payment histories to detect:

Unusually high debt-to-income ratios
Inconsistent income reporting
Suspicious payment patterns

2. Healthcare

Medical Diagnosis

Challenge: Identifying patients with unusual symptoms or test results.

Approach: Use statistical methods to detect:

Abnormal lab values
Unusual vital signs
Atypical patient responses to treatments

Example: A patient with blood pressure readings significantly higher than the normal range.

Drug Safety Monitoring

Challenge: Detecting adverse drug reactions in clinical trials.

Approach: Monitor patient responses to identify:

Unexpected side effects
Unusual drug interactions
Atypical patient outcomes

3. Manufacturing

Quality Control

Challenge: Identifying defective products on production lines.

Approach: Use sensor data to detect:

Products with measurements outside specifications
Unusual production parameters
Equipment malfunctions

Example: A car part with dimensions that deviate significantly from design specifications.

Predictive Maintenance

Challenge: Predicting equipment failures before they occur.

Approach: Monitor sensor data to identify:

Unusual vibration patterns
Abnormal temperature readings
Unexpected energy consumption

4. E-commerce

Customer Behavior Analysis

Challenge: Understanding normal vs. unusual customer behavior.

Approach: Analyze purchase patterns to identify:

Unusual buying sprees
Suspicious return patterns
Atypical browsing behavior

Example: A customer who typically spends $50 suddenly makes a $5,000 purchase.

Inventory Management

Challenge: Identifying unusual demand patterns.

Approach: Monitor sales data to detect:

Sudden spikes in demand
Unusual seasonal patterns
Unexpected product popularity

Implementation Best Practices

1. Data Understanding

Domain Knowledge: Understand what constitutes normal vs. abnormal in your context
Data Quality: Ensure data is clean and properly formatted
Feature Engineering: Create relevant features for outlier detection

2. Method Selection

Data Size: Choose methods appropriate for your dataset size
Data Type: Consider whether data is numerical, categorical, or mixed
Computational Resources: Balance accuracy with performance requirements

3. Validation

Cross-Validation: Use multiple methods to confirm outliers
Domain Expert Review: Have subject matter experts validate findings
Business Impact Assessment: Evaluate the consequences of removing outliers

4. Handling Strategies

Removal: Delete outliers if they're clearly errors
Capping: Limit extreme values to reasonable bounds
Transformation: Apply log or other transformations
Separate Analysis: Analyze outliers separately for insights

Tools and Technologies

1. Python Libraries

NumPy/SciPy: Statistical outlier detection methods
Pandas: Data manipulation and basic outlier detection
Scikit-learn: Machine learning-based outlier detection
PyOD: Comprehensive outlier detection toolkit

2. R Packages

outliers: Statistical outlier detection
mvoutlier: Multivariate outlier detection
DMwR: Data mining with R

3. Commercial Tools

Tableau: Built-in outlier detection capabilities
Power BI: Statistical outlier identification
SAS: Advanced statistical analysis tools

Challenges and Limitations

1. Context Dependency

What's an outlier in one context may be normal in another
Requires domain expertise to interpret results

2. High-Dimensional Data

Traditional methods become less effective
Curse of dimensionality affects performance

3. Dynamic Data

Outliers may change over time
Requires continuous monitoring and updating

4. False Positives/Negatives

Risk of removing legitimate extreme values
Risk of keeping actual outliers

Future Trends

1. Deep Learning

Neural networks for complex outlier detection
Autoencoders for unsupervised anomaly detection

2. Real-Time Detection

Streaming data analysis
Immediate outlier identification

3. Explainable AI

Understanding why a point is classified as an outlier
Building trust in automated systems

Conclusion

Outlier detection is a critical component of data wrangling that requires careful consideration of:

Business Context: Understanding what outliers mean in your domain
Method Selection: Choosing appropriate detection techniques
Validation: Ensuring outliers are legitimate before handling
Action Planning: Deciding how to respond to identified outliers

By implementing robust outlier detection strategies, organizations can:

Improve data quality
Enhance model accuracy
Make better business decisions
Identify opportunities and risks

The key is to approach outlier detection systematically, using multiple methods and validating results with domain experts. Remember that outliers aren't always errors - they can also represent valuable insights or legitimate extreme cases that deserve special attention.

With the amount of data we generate almost every minute today, if more ways of automating the data wrangling process do not evolve soon, there is a very high probability that much of the data the world produces will continue to sit idle and not deliver any value to enterprises.

Ready to improve your data quality with outlier detection? → Contact us