Outlier detection is a crucial step in data analysis and science, helping us identify unusual patterns and anomalies in our data. This can be especially useful in fields like finance, healthcare, and quality control.
One common method for outlier detection is the Z-score method, which calculates the number of standard deviations from the mean that a data point is. According to the Z-score method, data points with a Z-score greater than 3 are typically considered outliers.
The Z-score method is simple and effective, but it can be sensitive to the distribution of the data. For example, if the data has a heavy tail, the Z-score method may miss some outliers.
Types of Outliers
Outliers can be categorized into different types, each with its own characteristics. Point outliers are data points that behave unusually in a specific time instant when compared to other values in the time series or to their neighboring points.
If this caught your attention, see: Outlier Detection in Time Series
There are three main types of point outliers: global outliers, local outliers, and subsequences. Global outliers are points that stand out from the rest of the data, while local outliers are points that are unusual compared to their neighbors. Subsequences, on the other hand, refer to consecutive points in time whose joint behavior is unusual, even if each observation individually is not necessarily a point outlier.
A time series can also be an outlier, but this can only be detected when the input data is a multivariate time series.
Isolation
Isolation is a powerful approach to detecting outliers, and it's based on a clever idea: outliers are easy to isolate from the rest of the data.
Isolation Forest is a specific algorithm that uses multiple trees to identify anomalies. It works by randomly picking a feature and a split value to build each tree.
The key insight behind Isolation Forest is that outliers require fewer splits to isolate them from the rest of the data. This means that abnormal samples will have shorter paths from the root to the leaf node in a tree.
Isolation Forest uses binary decision trees, which are constructed using randomly selected features and a random split value. This makes it robust and simple to optimize, with very few parameters to adjust.
The algorithm then forms a tree forest by averaging out the individual trees, and uses this to calculate outlier scores for each data point. These scores range from 0 to 1, with 0 indicating a normal sample and 1 indicating a more significant outlier.
Data visualization can be complex with Isolation Forest, and it may be a long and expensive process. However, the algorithm doesn't require scaling, making it a good choice when you can't assume value distributions.
Quantile Based
Quantile Based outliers involve capping or flooring data points that fall outside certain percentiles.
In this technique, the outlier is capped at a certain value above the 90th percentile value or floored at a factor below the 10th percentile value.
Data points lesser than the 10th percentile are replaced with the 10th percentile value.
Data points greater than the 90th percentile are replaced with the 90th percentile value.
Statistical Methods
Statistical Methods are a crucial part of outlier detection, and there are several techniques to choose from. Model-based detection methods consider a point an outlier if its distance to its expected value is higher than a predefined threshold.
There are two subcategories of model-based methods: estimation model-based and prediction model-based. Estimation model-based methods use past, current, and future data to obtain the expected value, while prediction model-based methods rely only on past data.
Density-based methods, on the other hand, consider points with less than k neighbors as outliers. This can be done using sliding windows.
Here are some common statistical methods for outlier detection:
- Detection method: Model-based, Density-based, Histogramming
- Model-based: Estimation model-based, Prediction model-based
One popular statistical method is the Inter Quantile Range (IQR) method, which declares data points that lie 1.5 times of IQR above Q3 and below Q1 as outliers. This method is useful for detecting outliers in a dataset.
Standard Deviation
Standard deviation measures the spread of data around the mean. It captures how far away from the mean the data points are. Around 68.2% of the data will lie within one standard deviation from the mean, while around 95.4% and 99.7% of the data lie within two and three standard deviations from the mean, respectively.
The standard deviation of a distribution is denoted by σ, and the mean by μ. One approach to outlier detection is to set the lower limit to three standard deviations below the mean (μ - 3σ), and the upper limit to three standard deviations above the mean (μ + 3σ). Any data point that falls outside this range is detected as an outlier.
A z-score is a measure of how many standard deviations an observation is from the mean. It's calculated as Z = (X - μ) / σ, where X is the observation, μ is the mean, and σ is the standard deviation. A standard cut-off value for finding outliers are z-scores of +/- 3 further from zero.
In a normally distributed population, z-score values more extreme than +/- 3 have a probability of 0.0027 (2*0.00135), which is about 1 in 370 observations. However, if your data don’t follow the normal distribution, this approach might not be accurate.
Here's a summary of the standard deviation and outlier detection methods:
Percentile Based
The percentile method is an extension of the interquartile range, allowing you to detect outliers more effectively.
You can use the percentile function in NumPy to calculate the first and third quartiles. It takes in two arguments: an array or a dataframe and a list of quartiles.
The 10th and 90th percentile values can be used to cap outliers at a certain value above the 90th percentile or floor them at a factor below the 10th percentile value.
To use the percentile method, you can define a custom range that accommodates all data points that lie anywhere between 0.5 and 99.5 percentile of the dataset.
Here's an example of how to use the percentile function to detect outliers:
By using the percentile method, you can widen the range of permissible values to estimate outliers better, reducing the number of points discarded as outliers.
For example, if you define q = [0.5, 99.5] in the percentile function, you can filter the dataframe using the lower and upper limits obtained from the previous step.
Intriguing read: Anomaly Detection Using Generative Ai
This method can be particularly useful when your observations have a wide distribution, allowing you to detect outliers more effectively.
By using the percentile method, you can achieve more accurate results and improve the overall quality of your analysis.
For instance, if you use the percentile method to detect outliers, you may find that there are only two outliers in the dataset, and the filtered dataframe has 198 data records.
Methods
Outlier detection is a crucial process in data analysis, and various methods can be employed to identify anomalies. One such method is the use of Local Outlier Factor (LOF), which does not show a decision boundary in black as it has no predict method to be applied on new data when it is used for outlier detection.
Several algorithms in scikit-learn can be used for outlier detection, including IsolationForest and LocalOutlierFactor, which perform reasonably well on data sets. On the other hand, svm.OneClassSVM is known to be sensitive to outliers and thus may not perform very well for outlier detection.
Some algorithms, such as svm.OneClassSVM, may still be used for outlier detection but require fine-tuning of their hyperparameter nu to handle outliers and prevent overfitting. This can be a challenge, especially in high-dimensional data or when there are no assumptions about the distribution of the inlying data.
A comparison of different outlier detection algorithms is provided in the example "Comparing anomaly detection algorithms for outlier detection on toy datasets", which shows the performance of svm.OneClassSVM, IsolationForest, LocalOutlierFactor, and EllipticEnvelope.
Here are some common methodological approaches used for outlier detection:
- Statistical based methods
- Forecasting-based approaches, which flag a sample as an anomaly if the forecasted value is out of the confidence interval
- Neural Network Based Approaches
- Clustering Based Approaches, which assume outliers don't belong to any cluster or have their own clusters
- Proximity Based Approaches
- Tree Based Approaches
- Dimension Reduction Based Approaches
Autoencoders, a type of unsupervised neural network, can also be used for anomaly detection. They work by reconstructing a sample from its encoded features, and the reconstruction error is used to identify anomalies. The higher the reconstruction error, the more likely it is to be an anomaly.
For more insights, see: Unsupervised Anomaly Detection
Handling Outliers
Quantile-based flooring and capping is a technique used to identify and remove outliers from a dataset. This method involves replacing data points that are significantly higher or lower than the norm with values that represent the 10th and 90th percentiles.
Data points that are less than the 10th percentile are replaced with the 10th percentile value, effectively flooring them. Conversely, data points that exceed the 90th percentile are capped at the 90th percentile value.
Imputation
Imputation is a technique used to handle outliers. It involves replacing missing or erroneous values with a suitable substitute.
Outliers can be highly influential, so it's advised to replace them with the median value. This is especially true when using the mean value, as it can be skewed by outliers.
The mean value can be highly influenced by outlier treatment, making median imputation a better choice. By replacing outliers with the median, you can get a more accurate representation of your data.
Flooring and Capping
In the Quantile Based Flooring and Capping technique, outliers are capped at a certain value above the 90th percentile value or floored at a factor below the 10th percentile value.
Data points that are lesser than the 10th percentile are replaced with the 10th percentile value, effectively flooring them.
Data points that are greater than the 90th percentile are replaced with the 90th percentile value, effectively capping them.
This method helps to remove extreme values and bring the data into a more manageable range.
The 10th percentile and 90th percentile values serve as the threshold for flooring and capping, respectively.
By using these thresholds, you can prevent outliers from dominating your data and skewing your analysis.
Detection Techniques
A data scientist can use several techniques to identify outliers and decide if they are errors or novelties.
Boxplots are a useful visualization technique for detecting outliers in a dataset. They provide a graphical representation of the distribution of data, making it easier to spot unusual values.
The Inter Quantile Range (IQR) is a mathematical technique used to detect outliers. It calculates the difference between the 75th percentile (Q3) and the 25th percentile (Q1) of the data.
You can use the IQR to identify outliers by calculating the lower and upper bounds: lower bound = (Q1 - 1.5*IQR) and upper bound = (Q3 + 1.5*IQR). Any data points that fall outside of these bounds are considered outliers.
The Z-score is another technique used to detect outliers. It measures how many standard deviations a data point is away from the mean.
Here are some common outlier detection techniques listed:
- Boxplots
- Z-score
- Inter Quantile Range (IQR)
These techniques can be used to identify outliers in a dataset and help data scientists make informed decisions about their data.
Challenges and Best Practices
Outlier detection can be a complex task, and it's essential to be aware of the challenges you may face.
Data sets with a large number of records can be difficult to manage, making it hard to correctly remove outliers while keeping valid data intact. Noise or outliers that are similar to valid data can make it tricky to distinguish between flawed and good data.
The frequency of making anomaly detection also plays a role. If you're dealing with near real-time data, you may need to adjust your approach accordingly.
The number of anomalies is another concern. Most anomaly detection algorithms have a scoring process internally, allowing you to tune the number of anomalies by selecting an optimum threshold.
Some common challenges with outlier detection include data being over-pruned or removing genuine outliers that should be included in the data set.
To overcome these challenges, it's crucial to have excellent algorithms that are constantly being reassessed to ensure they are accurate.
Sources
- https://scikit-learn.org/1.5/modules/outlier_detection.html
- https://s-ai-f.github.io/Time-Series/outlier-detection-in-time-series.html
- https://www.analyticsvidhya.com/blog/2021/05/detecting-and-treating-outliers-treating-the-odd-one-out/
- https://www.freecodecamp.org/news/how-to-detect-outliers-in-machine-learning/
- https://www.spotfire.com/glossary/what-is-outlier-detection
Featured Images: pexels.com