Outlier Detection Scikit Learn Techniques and Implementation

Author

Posted Oct 30, 2024

Reads 714

A person raises arms in a foggy forest landscape, embracing solitude and nature.
Credit: pexels.com, A person raises arms in a foggy forest landscape, embracing solitude and nature.

Outlier detection is a crucial step in data analysis, and Scikit-learn provides several techniques to identify unusual data points.

The Isolation Forest algorithm is one such technique, which works by isolating outliers by selecting features that are most relevant to the outlier. This approach is particularly effective for high-dimensional data.

By using the Isolation Forest algorithm, we can detect anomalies in our data and gain a deeper understanding of the underlying patterns. The algorithm's simplicity and speed make it a popular choice among data scientists.

One key benefit of the Isolation Forest algorithm is its ability to handle both univariate and multivariate data, making it a versatile tool for outlier detection.

One-Class SVM

The One-Class SVM is a robust machine learning model that's perfect for detecting outliers. It's implemented in the Support Vector Machines module in Scikit-Learn and can handle high-dimensional data with ease.

One-Class SVMs determine the boundary that best separates normal data points from outliers by maximizing the margin between support vector machines. It's not an outlier-detection method, but rather a novelty-detection method that's trained on a set of data without outliers. This makes it a great option for situations where outlier detection in high-dimension or without assumptions on the distribution of the inlying data is challenging.

The One-Class SVM is particularly useful when the inlier distribution is strongly non-Gaussian, as it can recover a reasonable approximation. However, it can overfit when the inlier distribution is bimodal, interpreting a region where some outliers are clustered as inliers.

Attributes

Credit: youtube.com, What is One Class SVM in Machine Learning?

The Attributes of One-Class SVM are what make it a powerful tool for identifying anomalies in data. The support_ attribute is an array-like that represents the mask of observations used to compute robust estimates of location and shape.

This attribute is crucial in determining which data points are used to calculate the robust location and shape of the data. The location_ attribute returns the estimated robust location, which is an array-like with a shape of (n_features).

The robust location is a key component in identifying anomalies, as it provides a central point around which the data is clustered. The covariance_ attribute returns the estimated robust covariance matrix, which is an array-like with a shape of (n_features, n_features).

This matrix is used to calculate the spread of the data and can help identify outliers. The precision_ attribute returns the estimated pseudo inverse matrix, which is an array-like with a shape of (n_features, n_features).

Credit: youtube.com, One-Class SVM for Outlier Detection

This matrix is useful in solving systems of linear equations and can be used to identify anomalies. The offset_ attribute is a float that is used to define the decision function from the raw scores.

Here's a summary of the attributes:

Elliptic Envelope

The Elliptic Envelope algorithm is a robust method for detecting outliers in a dataset. It fits a Gaussian distribution to the data points and predicts the probability of each point to be under this distribution.

Data points with the lowest probability are considered outliers. This method assumes that the data follows a Gaussian distribution, which is a limitation of the Elliptic Envelope algorithm.

The Elliptic Envelope algorithm can be implemented using the `covariance.EllipticEnvelope` object in scikit-learn, making it easy to use in Python. This object fits a robust covariance estimate to the data, ignoring points outside the central mode.

The Mahalanobis distances obtained from this estimate are used to derive a measure of outlyingness. This strategy is illustrated in the scikit-learn documentation.

Credit: youtube.com, One-Class SVM for Outlier Detection

Here are some key differences between the One-class SVM and the Elliptic Envelope algorithm:

  • The One-class SVM is a novelty-detection method, whereas the Elliptic Envelope is an outlier-detection method.
  • The One-class SVM is more flexible and can handle data with multiple modes, whereas the Elliptic Envelope assumes a unimodal distribution.
  • The One-class SVM can be more prone to overfitting, whereas the Elliptic Envelope is more robust to outliers.

Anomaly Detection Algorithms

Anomaly Detection Algorithms are a crucial part of outlier detection in scikit-learn. There are several algorithms available, each with its own strengths and weaknesses.

One of the most popular algorithms is the Isolation Forest, which uses an ensemble of decision trees to isolate outliers. It's efficient and can handle high-dimensional data, but it can be sensitive to noise.

Another efficient algorithm is the Local Outlier Factor (LOF), which computes a score reflecting the degree of anomality of the observations. It measures the local density deviation of a given data point with respect to its neighbors, making it effective in detecting outliers with substantially lower density than their neighbors.

Here are some key parameters used by the LOF algorithm:

These parameters can be adjusted to suit the specific needs of your data and problem.

Applying ML Algorithms

Credit: youtube.com, Anomaly Detection: Algorithms, Explanations, Applications

The Algorithmic Approach uses the power of ML algorithms to detect outliers, addressing the limitations of simpler statistical methods.

There are many robust models for outlier detection, including Isolation Forest, Local Outlier Factor, One-Class Support Vector Machine, and Elliptic Envelope.

These models excel at understanding intricate patterns in the data and defining accurate decision boundaries.

Isolation Forest and Local Outlier Factor perform reasonably well on most datasets, while One-Class Support Vector Machine can be sensitive to outliers.

To handle outliers, One-Class Support Vector Machine requires fine-tuning of its hyperparameter nu to prevent overfitting.

The Local Outlier Factor (LOF) algorithm computes a score reflecting the degree of abnormality of the observations.

The LOF score is equal to the ratio of the average local density of its k-nearest neighbors, and its own local density.

The number k of neighbors considered, typically chosen to be greater than the minimum number of objects a cluster has to contain, is usually set to 20.

Credit: youtube.com, Complete Anomaly Detection Tutorials Machine Learning And Its Types With Implementation | Krish Naik

However, when the proportion of outliers is high, the number of neighbors should be greater, such as 35.

The strength of the LOF algorithm is that it takes both local and global properties of datasets into consideration.

Here's a comparison of the four robust models for outlier detection:

Isolation Forest

The Isolation Forest is a powerful anomaly detection algorithm that excels in high-dimensional datasets. It's based on the principles of Random Forest and Ensemble Learning techniques.

To understand how it works, imagine a tree structure where each node represents a split in the data. The Isolation Forest isolates outliers by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of the selected feature.

This process is repeated many times, and the average path length of data points in these trees is used to quantify an anomaly score for each data point. The shorter the path length, the more likely the data point is to be an outlier.

Credit: youtube.com, Isolation Forests: Identify Outliers in Data

The implementation of ensemble.IsolationForest is based on an ensemble of tree.ExtraTreeRegressor, and the maximum depth of each tree is set to \(\lceil \log_2(n) \rceil\) where \(n\) is the number of samples used to build the tree.

The Isolation Forest supports warm_start=True, which allows you to add more trees to an already fitted model.

Here's a comparison of the Isolation Forest with other anomaly detection algorithms:

The Isolation Forest is an efficient way to perform outlier detection in high-dimensional datasets, and its implementation in Python using Scikit-Learn is straightforward.

Local Factor

Local Factor is a crucial concept in anomaly detection, and one of the most effective algorithms for detecting outliers is the Local Outlier Factor (LOF) algorithm. This algorithm works by evaluating the local neighborhood of each data point, calculating its density relative to its neighbors.

The LOF algorithm is based on the idea that outliers are more isolated in the feature space compared to their k nearest neighbors. This makes them detectable using the LOF algorithm. The algorithm measures the local density deviation of a given data point with respect to its neighbors.

Credit: youtube.com, Local Outlier Factor- Everything you need to know! | Outlier Detection| Machine Learning Algorithms

The LOF algorithm uses a score called the local outlier factor to reflect the degree of anomality of the observations. This score is calculated as the ratio of the average local density of the k-nearest neighbors to the local density of the data point itself.

To use the LOF algorithm, you need to choose the number of neighbors (n_neighbors) and the contamination coefficient. The number of neighbors affects how sensitive the algorithm is to changes in local density, while the contamination coefficient determines the expected proportion of outliers in the dataset.

Here are the key parameters to adjust when using the LOF algorithm:

The LOF algorithm is a powerful tool for detecting outliers, and by adjusting these parameters, you can fine-tune its performance to suit your specific needs.

One-Class SVM vs Elliptic Envelope

One-Class SVM is not an outlier-detection method, but a novelty-detection method, which means its training set should not be contaminated by outliers. It's particularly useful in high-dimensional data or without any assumptions on the inlier data distribution.

Credit: youtube.com, Anomaly Detection Using One Class SVM (optional)

The Elliptic Envelope algorithm, on the other hand, fits a Gaussian distribution to the data points and predicts the probability of each point to be under this distribution. It's a simple yet effective method for outlier detection, but it assumes the data follows a Gaussian distribution.

In terms of performance, One-Class SVM works better on data with multiple modes, while the Elliptic Envelope degrades as the data is less and less unimodal. This is because the Elliptic Envelope learns an ellipse that fits well the inlier distribution, but struggles with bimodal or non-Gaussian distributions.

Here's a comparison of the two methods:

Ultimately, the choice between One-Class SVM and Elliptic Envelope depends on the specific characteristics of your data and the assumptions you're willing to make. If your data is high-dimensional or has multiple modes, One-Class SVM might be a better choice. If your data follows a Gaussian distribution, the Elliptic Envelope could be a simpler and more effective solution.

Scaling and Fitting

Credit: youtube.com, Python Feature Scaling in SciKit-Learn (Normalization vs Standardization)

One common way to perform outlier detection is to assume that regular data come from a known distribution, such as a Gaussian distribution.

The scikit-learn library provides an object covariance.EllipticEnvelope that fits a robust covariance estimate to the data, ignoring points outside the central mode. This object can estimate the inlier location and covariance in a robust way, without being influenced by outliers.

The implementation of an online linear version of the One-Class SVM scales linearly with the number of samples, making it suitable for large datasets. This is achieved through the linear_model.SGDOneClassSVM implementation, which can be used with a kernel approximation to approximate the solution of a kernelized svm.OneClassSVM.

Parameters

When working with data, precision is key. The `store_precision` parameter allows you to specify whether the estimated precision is stored, and it defaults to `True`.

You can adjust the `assume_centered` parameter to control how the robust location and covariance are computed. If set to `False`, it will use the FastMCD algorithm directly, whereas setting it to `True` will compute the support of the robust location and covariance.

Credit: youtube.com, Standardization vs Normalization Clearly Explained!

The `support_fraction` parameter determines the proportion of points to be included in the support of the raw MCD estimates. It's a float value between 0 and 1.

Outliers can be a major issue in data analysis. The `contamination` parameter provides the proportion of outliers in the data set, and it defaults to 0.1.

If you need to shuffle your data, the `random_state` parameter comes in handy. It represents the seed of the pseudo random number generator used while shuffling the data.

Here's a quick rundown of the parameters:

Scaling One-Class SVM

Scaling One-Class SVM can be a challenge, especially when dealing with large datasets. An online linear version of the One-Class SVM is implemented in linear_model.SGDOneClassSVM, which scales linearly with the number of samples.

This implementation can be used with a kernel approximation to approximate the solution of a kernelized svm.OneClassSVM whose complexity is at best quadratic in the number of samples.

A good example of this is the illustration in the section One-Class SVM versus One-Class SVM using Stochastic Gradient Descent, which shows the approximation of a kernelized One-Class SVM with the linear_model.SGDOneClassSVM combined with kernel approximation.

The linear_model.SGDOneClassSVM is a more efficient option for large datasets, and it's a good choice when you need to scale up your One-Class SVM implementation.

This implementation is part of the Support Vector Machines module in Sklearn, which provides an easy implementation of One-Class SVM in Python.

Elliptic Envelope Fitting

Credit: youtube.com, Elliptic Envelope Fitting | Case study Implementation |Beginner to Advance Course lecture 41

The Elliptic Envelope fitting algorithm is a powerful tool for outlier detection in data. It works by fitting a Gaussian distribution to the data points and predicting the probability of each point to be under this distribution.

This method can only be applied if the data is assumed to follow a Gaussian distribution, which is a common assumption in many data analysis tasks. The scikit-learn library provides an object called covariance.EllipticEnvelope that fits a robust covariance estimate to the data, ignoring points outside the central mode.

The Elliptic Envelope algorithm is particularly useful when the data is unimodal and well-centered, as it can learn the rotational symmetry of the inlier population. However, it can struggle with bimodal or non-Gaussian distributions, where it may overfit or completely fail.

Here are some key advantages and disadvantages of the Elliptic Envelope algorithm:

  • Advantages: Works well with unimodal and well-centered data, learns rotational symmetry.
  • Disadvantages: Struggles with bimodal or non-Gaussian distributions, may overfit.

The Elliptic Envelope algorithm is a robust and efficient method for outlier detection, but it requires careful consideration of the data distribution and assumptions. By understanding its strengths and limitations, data analysts can choose the right tool for the job and get the most out of their data.

Keith Marchal

Senior Writer

Keith Marchal is a passionate writer who has been sharing his thoughts and experiences on his personal blog for more than a decade. He is known for his engaging storytelling style and insightful commentary on a wide range of topics, including travel, food, technology, and culture. With a keen eye for detail and a deep appreciation for the power of words, Keith's writing has captivated readers all around the world.

Love What You Read? Stay Updated!

Join our community for insights, tips, and more.