Novelty detection is a fascinating field that involves identifying unusual patterns or outliers in data. This can be particularly useful in applications such as fraud detection, where identifying unusual transactions can help prevent financial losses.
Deep learning techniques have proven to be highly effective in novelty detection, allowing us to identify anomalies that might have gone unnoticed with traditional methods. One such technique is the use of autoencoders, which can learn to compress data into a lower-dimensional representation, making it easier to identify anomalies.
By training an autoencoder on a large dataset, we can create a model that can detect anomalies in new, unseen data. For example, if we're trying to detect unusual credit card transactions, we can train an autoencoder on a dataset of normal transactions and then use it to identify transactions that are significantly different from the norm.
Novelty Detection Methods
Novelty detection is a fascinating field that helps us identify unusual patterns in data. It's all about learning a rough frontier that delimits the contour of the initial observations distribution.
The One-Class SVM has been introduced for novelty detection and is implemented in the Support Vector Machines module in the svm.OneClassSVM object. It requires the choice of a kernel and a scalar parameter to define a frontier, with the RBF kernel being the default choice.
A common question in novelty detection is whether a new observation is so different from the others that we can doubt it's regular. The One-Class SVM tries to answer this question by learning a rough frontier that delimits the contour of the initial observations distribution.
The nu parameter, also known as the margin of the One-Class SVM, corresponds to the probability of finding a new, but regular, observation outside the frontier. This parameter needs to be carefully tuned to achieve good results.
Here's a summary of the key points to consider when using the One-Class SVM for novelty detection:
The choice of kernel and nu parameter is crucial for the success of novelty detection using the One-Class SVM.
Return and Anomaly Detection
Return and Anomaly Detection is a crucial aspect of novelty detection. Outlier detection is similar to novelty detection, with the goal of separating regular observations from polluting ones called outliers.
Outliers are events that deviate from the standard, happen rarely, and don't follow the rest of the pattern. Examples include large dips and spikes in the stock market, defective items in a factory, and contaminated samples in a lab.
These anomalies only typically occur 0.001-1% of the time, making it challenging to detect them. Machine learning researchers have created algorithms such as Isolation Forests, One-class SVMs, Elliptic Envelopes, and Local Outlier Factor to help detect such events.
- Autoencoders can be used for anomaly detection by framing the problem correctly.
- One approach is to use the autoencoder to make predictions and measure the Mean Squared Error (MSE) between the original input images and reconstructions.
- The q-th quantile of the error can be used as a threshold to detect outliers.
By using these techniques, we can effectively detect anomalies and outliers in our data, which is essential for novelty detection.
Isolation Forest
Isolation Forest is an efficient way to perform outlier detection in high-dimensional datasets. It uses random forests to isolate observations by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of the selected feature.
This process can be represented by a tree structure, where the number of splittings required to isolate a sample is equivalent to the path length from the root node to the terminating node. The path length, averaged over a forest of such random trees, is a measure of normality.
Random partitioning produces noticeably shorter paths for anomalies, making it a reliable method for detecting outliers. The implementation of ensemble.IsolationForest is based on an ensemble of tree.ExtraTreeRegressor, with the maximum depth of each tree set to ⌈log2(n)⌉, where n is the number of samples used to build the tree.
The ensemble.IsolationForest supports warm_start=True, which allows you to add more trees to an already fitted model. This feature is useful for incrementally updating the model as new data becomes available.
To illustrate the use of IsolationForest, see the Isolation Forest example, which provides a practical demonstration of how to use this algorithm for anomaly detection.
Anomaly
Anomalies can be found in various domains, such as stock market, factory production, and lab samples. Examples of anomalies include large dips and spikes in the stock market due to world events, defective items in a factory, and contaminated samples in a lab.
Anomalies typically occur only 0.001-1% of the time, which is an incredibly small fraction of the time. This rarity makes the problem more challenging, as the majority of data points will be of valid events.
To detect anomalies, machine learning researchers have created algorithms such as Isolation Forests, One-class SVMs, Elliptic Envelopes, and Local Outlier Factor. These methods are rooted in traditional machine learning.
Deep learning can also be used for anomaly detection, but it requires framing the problem correctly.
Here are some examples of anomalies:
- Large dips and spikes in the stock market due to world events
- Defective items in a factory/on a conveyor belt
- Contaminated samples in a lab
Deep Learning Approaches
Deep learning approaches are particularly well-suited for novelty detection, as they can learn to recognize patterns in data that are difficult to define explicitly.
Autoencoders are a type of unsupervised neural network that can accept an input set of data, compress it into a latent-space representation, and reconstruct the input data from the latent representation.
The key to anomaly detection with autoencoders lies in the reconstruction loss, which measures the mean-squared-error (MSE) between the input image and the reconstructed image.
When the MSE is high, it's likely that the input image is an outlier.
Autoencoders can be trained on unlabeled data, and the resulting model can be used to detect anomalies in new, unseen data.
To implement anomaly detection using an autoencoder, you'll need to load the pre-trained model, make predictions on the new data, and measure the MSE between the original input images and reconstructions.
The q-th quantile of the error can be used as a threshold to detect outliers, where any MSE with a value greater than or equal to the threshold is considered an anomaly.
Here's a step-by-step summary of the process:
- Load the pre-trained autoencoder model
- Make predictions on the new data
- Measure the MSE between the original input images and reconstructions
- Compute the q-th quantile of the error and use it as a threshold
- Identify anomalies as any MSE with a value greater than or equal to the threshold
Implementation and Training
To implement an autoencoder for anomaly detection, you'll need to start by building the autoencoder script. This involves creating a ConvAutoencoder class with a static method called build, which accepts five parameters: width, height, depth, filters, and latentDim. These parameters define the characteristics of the input images.
The build method then defines the input for the encoder, using Keras' functional API to loop over the filters and add layers of CONV, LeakyReLU, and BN.
The encoder is used to construct the latent-space representation, which is the compressed form of the data. This representation is then used to reconstruct the original input image.
To train the autoencoder, you'll need to use the Keras and TensorFlow libraries. The training process involves using a terminal to execute a command that trains the entire model. This process takes around 2 minutes on a 3Ghz Intel Xeon processor, and the training history plot shows that the training is stable.
The output of the training process includes a visualization file called recon_vis.png, which shows that the autoencoder has learned to correctly reconstruct a 1 digit from the MNIST dataset.
Here are the parameters that need to be defined when building the ConvAutoencoder class:
- width: Width of the input images.
- height: Height of the input images.
- depth: Number of channels in the images.
- filters: Number of filters the encoder and decoder will learn, respectively.
- latentDim: Dimensionality of the latent-space representation.
Sources
- https://scikit-learn.org/1.5/modules/outlier_detection.html
- https://www.mathworks.com/help/stats/unsupervised-anomaly-detection.html
- https://ogrisel.github.io/scikit-learn.org/sklearn-tutorial/modules/outlier_detection.html
- https://pyimagesearch.com/2020/03/02/anomaly-detection-with-keras-tensorflow-and-deep-learning/
- https://www.javatpoint.com/novelty-detection
Featured Images: pexels.com