How to Find Outliers in Data using Machine Learning
No matter how careful you are during data collection, every data scientist has felt the frustration of finding outliers.
An outlier is a data point that is noticeably different from the rest. They represent errors in measurement, bad data collection, or simply show variables not considered when collecting the data.
Table of Contents
- Why You Shouldn’t Just Delete Outliers?
- Impact On Machine Learning Models
- Detecting Outliers In Statistics Normal Situations
- Detecting Outliers in Data using Machine Learning
Why You Shouldn’t Just Delete Outliers?
Many data analysts are tempted to delete outliers. However, this is sometimes the wrong choice. Occasionally, Like in conventional analytical models, in machine learning, too, you need to resist the urge to simply hit the delete button when you come across such an anomaly, to improve your model’s accuracy. So, rather than a knee-jerk reaction, one must tread with caution while handling outliers.
One cannot recognize outliers while collecting data; you won’t know what values are outliers until you begin analyzing the data.
Many statistical tests are sensitive to outliers and therefore, the ability to detect them is an important part of data analytics.
The interpretability of an outlier model is very important, and decisions seeking to tackle an outlier need some context or rationale.
In fact, outliers sometimes can be helpful indicators. For example, in some applications of data analytics like credit card fraud detection, outlier analysis becomes important because here, the exception rather than the rule may be of interest to the analyst.
Simplistically speaking, here are some options you have when you detect outliers: accept them, correct them or delete them. If there’s a chance that the outlier will not significantly alter the outcome, you may “accept” it.
Otherwise, you can either ‘correct’ it or delete it. However, you should reserve deletion only for data points that are definitely wrong.
Impact on Machine Learning Models
Machine learning algorithms, too, are at risk to the statistics and distribution of the input variables. In supervised models, outliers can deceive the training process resulting in prolonged training times, or leading to the development of less precise models.
According to Alvira Swalin, a data scientist at Uber, machine learning models, like linear & logistic regression are easily influenced by the outliers in the training data.
Some models even exist that hike the weights of misclassified points for every repetition of the training.
Detecting Outliers in Statistics Normal Situations
How to find the outliers? There is no one method to detect outliers because of the facts at the center of each dataset. One dataset is different from the other.
A rule-of-the-thumb could be that you, the domain expert, can inspect the unfiltered, basic observations and decide whether a value is an outlier or not.
There are more scientific methods, though. You can carry out two types of analysis to find outliers – uni-variate, which involves just one variable, and multi-variate. These are different outlier methods for outlier analysis:
In descriptive statistics, a box plot is a method for graphically depicting groups of numerical data through their quartiles.
Box plots may also have lines extending vertically from the boxes (whiskers) indicating variability outside the upper and lower quartiles, hence the terms box-and-whisker plot and box-and-whisker diagram. Outliers in machine learning will appear separate from the plot. (Source: Wikipedia)
A scatter plot is a chart type that is normally used to observe and visually display the relationship between variables. The values of the variables are represented by dots.
The positioning of the dots on the vertical and horizontal axis will inform the value of the respective data point; hence, scatter plots make use of Cartesian coordinates to display the values of the variables in a data set. Scatter plots are also known as scattergrams, scatter graphs, or scatter charts. (Source: CFI)
Z-score: A Z-score is a numerical measurement that describes a value’s relationship to the mean of a group of values. Z-score is measured in terms of standard deviations from the mean (Source: Investopedia
Z-score is a measure of a point’s relationship to the average of all points in the dataset. When scored, the values receive a positive or negative number. This number is the number of standard deviations above or below the average value.
Thus, when an analyst calculates z-scores and finds data points with a value above 1, he has found the outliers.
Detecting Outliers in Machine Learning
How to detect outliers in machine learning? How do you deal with outliers in predictive analytics? In machine learning, however, there’s one way to tackle outliers: it’s called “one-class classification” (OCC).
This involves fitting a model on the “normal” data, and then predicting whether the new data collected is normal or an anomaly.
However, one-class classifiers can only identify if the new data is ‘normal’ relative to the data it was initially fed. In other words, the OCC will give incorrect predictions if the training set has outliers.
Author Charu C Aggarwal, in his book “Outlier Analysis”, discusses many outlier detection methods. Some notable ones include:
Probabilistic and Statistical Models
You can use statistics to identify unlikely outcomes.
This model is interpreted as a linear combination of features. direct correlations are used to model the data into lower dimensions. As an example, principal component analysis and data with large residual errors may be outliers.
High-Dimensional Outlier Detection
High-dimensional data present a major challenge. Many of the current algorithms cannot address the problems of a large number of features.
A paper by Aggarwal and his colleague Philip S Yu states that, for effectiveness, high dimensional outlier detection algorithms must satisfy many properties, including the provision of interpretability in terms of the reasoning which creates the abnormality.
In machine learning, one cannot just “ignore” data outliers. They can impair the training process, and create cascading errors.
Increase your Sales and Conversions with Outliers
What are 3 Different Types of Outliers
There are 3 different categories of outliers in machine learning:
- Type 1: Global Outliers
- Type 2: Contextual Outliers
- Type 3: Collective Outliers
Global Outliers: Type 1
The Data point is measured as a global outlier if its value is far outside the entirety of the data in which it is contained.
Contextual or Conditional Outliers: Type 2
Contextual or conditional outliers are data sets whose value considerably diverges from other data points within a similar context. The “context” is approximate all the time temporal in time-series data sets, like the records of a detailed extent over time.
Collective Outliers: Type 3
A division of data points in a data set is measured abnormal if those values as a group deviate significantly from the whole data set, but the values of a single data point are not themselves abnormal in whichever contextual or global logic.
In time series data sets, one way this can be noticeable is as usual peaks and valleys happening outside of a time frame when that seasonal sequence is usual or as a grouping of time series that is in an outlier condition as a collection.
Reference: Machine Learning Mastery