How To Detect Outliers In Data Wrangling: Examples and Use Cases
Outlier Detection: Detection of outliers is a very important part of the data wrangling process. Outliers are those values of an attribute which lie far off from the majority values present for that attribute. Outliers can cause a statistical or machine learning model to perform poorly because they do not fall within the normal range of values for that attribute.
For outlier detection methods we need to know the 1st (Q1) and+ 3rd (Q3) quartile of that numerical attribute.
Then we can calculate the Interquartile Range (IQR) as follows :
IQR = Q3-Q1.
If any value falls outside the normal range for that attribute, it is declared as an outlier. The normal range is given by:
[Q1 – (1.5 * IQR), Q3 + (1.5 * IQR)]
The Python Boxplot images show the difference between value ranges of “Price” (in USD) attribute, before and after the removal of outliers.
As we can see, that the “Maximum Price” value in the 1st graph was $8000 and the there were a lot of outliers present in the data.
After the removal of outliers, the “Maximum Price” value in the 2nd graph became $350. For this attribute, we had Q3 = $180, Q1 = $69 and IQR = $111.
Business Use Case
Outlier detection methods are used across industry for different purposes, namely fraud detection, traffic monitoring, web fault detection, building of intrusion detection systems, etc.
For example, banks can provide us with their customer transaction data, where we are supposed to identify customers who look suspicious based on their transaction records. In this case, if we observe the average transaction amount of the customers based on a certain time span and plot the data, we can easily identify the customers whose average transaction amount looks quite off as compared to the other customers. Then, we can alert the bank about those customers so that they can make further detailed inquiries about their transaction details.
Standardization Of Textual Data In Python
In many cases, we perform comparison based on textual data. A common issue observed here is lack of a standard data format for string type attributes. For example, the same words may not be detected if one starts with a capital case letter and the other does not. To overcome this, we should standardize the strings. We can convert all of them to upper case or lower case. Also, we should perform data cleaning (major part of data wrangling) of textual data to make sure that the same format is followed across different datasets.
Data cleaning example of two dummy datasets
Below is an example of two dummy datasets. One contains the records of height in feet and the other contains records of date Of birth for 5 different actors.
As we can see, the values in the “Actor Name” column in the “Height” dataset looks correct, but in the DoB dataset it contains unwanted characters. So if we try to create a single dataset by joining these two datasets based on the “Actor Name” attribute, then we get an empty dataset. This is because the values in the “Actor Name” column need to match in the two datasets for the join to occur successfully.
To resolve this, we need to clean the values in that column in the DoB dataset, and then try to join the two datasets. We will use the re library of Python to implement Regular Expressions and come up with a cleaned set of text. We also use the string library of Python to get all the punctuations of the English language.
Next, we need to create a Python function that takes a text as input, cleans the text and returns it. We then apply this function on the “Actor Name” column of the DoB dataset.
We also apply the same function to the “Actor Name” column of the “Height’ dataset to make sure that the “Actor Name” columns in both the datasets are standardized, and the connection occurs without any issue.
Finally, we join the two datasets based on the “Actor Name” column to get the final dataset. This time we see that we get the desired result, and not an empty dataset as before.
This is another important aspect of Data Wrangling use case. Almost every dataset contains an attribute related to date-time that provides us with useful information. So it is very important to use the different utility functions related to date-time and extract relevant information from it.
Below image shows how date-time information is stored in a dataset. As we can see, the dates are in STRING (or Textual) format and they need to be converted to Python DateTime format.
We create a Python function to convert the values of the date column from String to DateTime format. We apply that function and then check the data type of the Date column. Now the values got converted into DateTime format from String format.
We can then extract the day, month and year information and store it in separate columns in the dataset for further analysis.
Also, we can find the difference between two dates in terms of number of days. This operation is very useful and can be used for multiple purposes. Ex : Calculating and storing Customer Recency Information for e-commerce websites.
In conclusion: Given the amount of data being generated almost every minute today, if more ways of automating the data wrangling process are not found soon, there is a very high probability that much of the data the world produces shall continue to just sit idle, and not deliver any value to the enterprise at all.
Build sentiment analysis models with Oyster
Whatever be your business, you can leverage Express Analytics’ customer data platform Oyster to analyze your customer feedback. To know how to take that first step in the process, press on the tab below.
Liked This Article?
Gain more insights, case studies, information on our product, customer data platform