Time series outlier detection is a crucial step in understanding and analyzing data. It helps identify anomalies that can significantly impact business decisions.
Methods like the Z-score and Modified Z-score are used to detect outliers in time series data. These methods calculate the difference between a data point and the mean, divided by the standard deviation.
The Z-score method is simple and effective, but it can be sensitive to non-normal distributions. The Modified Z-score method is a variation of the Z-score that is more robust.
Time series outlier detection is used in various applications, such as fraud detection and anomaly detection in sensor data. It can also be used to improve the accuracy of predictive models.
Take a look at this: Time Series Feature Engineering
What is Time Series Outlier Detection?
Time series outlier detection is all about finding data points that don't fit the pattern. These are data points that are far away from the others, often due to simple measurement mistakes or random chance.
Outliers can be a problem because they can hide important information that needs to be addressed. Anomalies, a specific type of outlier, can be especially troublesome as they don't match the expected pattern and can signal big problems like cyberattacks or system breakdowns.
In time series data, spotting outliers and anomalies is key for catching issues early on.
What is Data?
Data is just a collection of information, but in the context of time series, it's a list of data points that are organized in time with some kind of measurement.
These measurements can be anything from stock prices to website visitors or sales numbers.
Data points can be recorded at different intervals, such as every day, week, or year, giving us a clear picture of how things change over time.
Here are some examples of data points and their corresponding measurements:
- Stock prices: change over days, weeks, or years
- Website visitors: by minute, hour, or day
- Sales numbers: every day or month
- Server performance: measured every second
- Sensor readings: from smart devices or industrial setups
By analyzing these data points, we can spot trends, patterns, and cycles, which is crucial for making predictions, spotting weird data points, and more.
The Basics of
Outliers and anomalies are basically data points that stick out because they're not like the rest. Spotting these odd ones out is key for catching issues early on.
Outliers are data points that don't follow the usual pattern. They can be a sign of a problem that needs to be addressed.
In time series data, outliers can be a major issue if left unchecked. They can throw off calculations and make it hard to get an accurate picture of what's going on.
Outliers can be caused by a variety of things, such as measurement errors or unusual events. They can also be a sign of a more serious problem that needs to be investigated.
Spotting outliers early on can help prevent problems from getting out of hand. It's like catching a small fire before it spreads – it's much easier to put out a small fire than a big one!
Types of Outliers
Time series outlier detection is crucial to identify unusual patterns in data. Outliers can be categorized into two main types: point outliers and subsequence outliers.
Point outliers are one-off weird data points, like a sudden spike in website visitors. They don't fit in with the rest of the data.
Subsequence outliers are when a bunch of data points in a row look strange compared to the rest, like daily sales numbers being oddly low for a whole week.
Here's a breakdown of the two types of outliers:
Understanding these types of outliers helps you identify when something unusual is happening in your data, and where to look to fix problems.
Visualizing and Exploring Outliers
Visualizing outliers in a space-time cube can be done in 2D and 3D using specific tools.
The 2D feature output symbolizes the number of outliers at each location, displaying pop-up charts with time series and identified outliers. These charts show the time series and the identified outlier as a large point.
In 3D, the output features display the locations and times of the identified outliers, with space-time bins labeled as Above Fitted Value or Below Fitted Value, depending on whether they're above or below the fitted values of the forecast model.
Outliers above the fitted value display in purple, while those below the fitted value display in green. Space-time bins not identified as outliers are labeled Not an Outlier and display in light gray.
Discover more: Time Magazine Ai
Data Properties
Time series data has some unique features that can affect how we analyze and visualize it. Understanding these properties is crucial for accurate predictions.
Trends in time series data can be upward or downward, like online store sales that increase each year. This means the data follows a pattern over time.
Seasonality is another key feature, where data shows regular changes based on the time of year, month, or day of the week. For example, a website might get fewer visitors on weekends.
Autocorrelation is a property that makes related data points connected, so what happened recently can help predict what happens next. This feature is useful for making predictions.
Non-Stationarity is a tricky property where the basic stats of the data shift over time. This can make it difficult to analyze and predict future values.
Here are the four main features of time series data in a quick rundown:
- Trends - Patterns of going up or down over time.
- Seasonality - Regular changes based on the time of year, month, or day of the week.
- Autocorrelation - Related data points connected to each other.
- Non-Stationarity - Basic stats shifting over time.
Visualize and Explore
You can visualize and explore outliers using various tools and features. The output feature symbology shows information about the detected outliers.
The time series charts display the time series data and help identify outliers. You can also use the 2D or 3D visualization of the output space-time cube to explore outliers.
The Visualize Space Time Cube in 2D and Visualize Space Time Cube in 3D tools allow you to visualize outliers in 2D and 3D, respectively. These tools use the Time series outlier results option of the Display Theme parameter.
For 2D feature output, the output features are symbolized by the number of outliers at each location. The pop-up charts display the time series and identified outliers.
For 3D feature output, the output features display the locations and times of the identified outliers in a 3D scene. Space-time bins not identified as outliers are labeled Not an Outlier and display in light gray.
The 3D features include two charts: the Visualize in 3D Time-Series chart and the Count of Outliers Above or Below Fitted Value Over Time chart. The first chart displays a line plot of the average value of the time series across the time steps of the space-time cube.
The Count of Outliers Above or Below Fitted Value Over Time chart is a stacked bar chart displaying the total number of outliers above and below fit at each time step of the space-time cube. This allows you to identify important dates when many outliers occurred.
For more insights, see: Real Time Feature Engineering
Detection Methods
The Generalized ESD test is a sequence of tests that checks for a specific number of outliers at each location of the space-time cube. The test proceeds by calculating the residuals of each time step, mean and standard deviation of the residuals, and then compares the test statistic to the critical value of the t-distribution.
The Generalized ESD test can identify different numbers of outliers at each location of the space-time cube, and the number of outliers at each location can be seen in the Number of Model Fit Outliers field of the output features.
Z-Score Anomaly Detection is another method that calculates a Z-score for each data point based on data averages and standard deviations, identifying any Z-score above a configurable threshold as an anomaly. This method can adapt to dynamic baselines even in an unsupervised setting.
A key benefit of the Z-Score Anomaly Detection method is that it can identify anomalies based on short-term rates of change, such as when water levels and sustained rain accumulations are rising fast enough to cause flooding.
Expand your knowledge: Unsupervised Anomaly Detection
Generalized ESD Test
The Generalized ESD Test is a robust method for detecting outliers in time series data. It's a sequence of tests that checks for a specific number of outliers at each location of the space-time cube.
The test is based on the Grubbs' test, which is used to check for the presence of exactly one outlier in the dataset. It then continues to check for two, three, and up to a maximum number of outliers, which is 5 percent of the number of time steps, rounded down.
The test calculates the residuals of each time step by subtracting the value of the forecast model from the raw value. The mean and standard deviation of the residuals are then calculated.
The test statistic is calculated by dividing the maximum absolute deviation from the mean by the standard deviation. This is compared to a critical value, which is determined by the two-sided critical value of the t-distribution with T-i+1 degrees of freedom at a specific confidence level.
If the test statistic is larger than the critical value, the test for exactly i outliers is statistically significant. The value associated with the maximum absolute residual is then removed, and the test is repeated on all time steps that have not been previously removed.
Here's a summary of the steps involved in the Generalized ESD Test:
- Calculate the residuals of each time step.
- Calculate the mean and standard deviation of the residuals.
- Calculate the test statistic by dividing the maximum absolute deviation from the mean by the standard deviation.
- Compare the test statistic to the critical value.
- Remove the value associated with the maximum absolute residual and repeat the test.
The Generalized ESD Test returns the outliers associated with the largest statistically significant number of outliers.
Forecasting-Based
Forecasting-based detection is a method that looks at past data to guess what should happen next. This approach works well when data changes in predictable ways.
If the real data is way off from our guess, it might be an anomaly. This highlights the importance of having a good understanding of the data and its patterns.
Using historical data to make predictions is a common practice in business forecasting, where future sales, web traffic, or customer loss are guessed based on past data.
This method can be particularly useful for businesses that experience regular fluctuations in demand or sales. By analyzing past trends, companies can make informed decisions about inventory, staffing, and resource allocation.
Here are some examples of how forecasting-based detection can be applied:
- Guessing future sales based on past data
- Figuring out when machines might break down using sensor data
- Predicting how much infrastructure is needed based on past use
By leveraging historical data, businesses can reduce risks and make more accurate predictions.
Mean Absolute Deviation (MAD)
The Mean Absolute Deviation (MAD) is a robust metric that can help you detect outliers in your data. It's a great alternative to the Z-score when your data has outliers.
The MAD is calculated by finding the median of the absolute difference between the values of a sample and the median of the sample. This is a more robust metric than the Z-score because it's less affected by outliers.
The MAD is useful when your data is close to a normal distribution, but it can be problematic if more than 50% of the data has the same value, in which case the MAD will be zero.
Here are some key facts to keep in mind when using the MAD:
- The MAD is calculated by finding the median of the absolute difference between the values of a sample and the median of the sample.
- The MAD is a more robust metric than the Z-score because it's less affected by outliers.
- The MAD is useful when your data is close to a normal distribution.
- The MAD can be problematic if more than 50% of the data has the same value, in which case the MAD will be zero.
By understanding the MAD and its limitations, you can use it as a tool to detect outliers in your data and make more informed decisions.
Machine Learning Powered Detection
Machine learning powered detection can be a game-changer for time series outlier detection. By leveraging machine learning models, you can identify anomalies in real-time, even when the data is streaming in continuously.
Machine learning models can predict potential impact and outcomes of different corrective actions based on historical data and cluster behavior. This allows for seamless operations integration and minimizes the need for manual intervention.
With machine learning, you can use models like classification and regression trees (CART) to spot anomalies. CART can be used in two ways: supervised, where you teach the model what normal and weird data look like, or unsupervised, where a technique called Isolation Forest finds anomalies by seeing how easy it is to separate a data point from the rest.
Here are some key advantages of using machine learning for anomaly detection:
- Machine learning models are more adaptable and scalable, and they can uncover complex and evolving anomalies.
- Machine learning models can be more difficult to understand and interpret, but they offer a high degree of accuracy in anomaly identification.
By integrating machine learning with auto-scaling mechanisms, you can optimize resource allocation and ensure that your system is always running at peak performance.
Isolation Forest
Isolation Forest is a powerful tool for detecting anomalies in data. It's a tree-based algorithm that works by randomly selecting an attribute and a split value to partition the data.
The algorithm continues to partition the data many times until each point is isolated. This is where the magic happens – an outlier will take fewer partitions to be isolated than a normal point.
Here's how it works: imagine you're looking at a dataset with a bunch of points scattered around. The isolation forest algorithm will start by randomly selecting an attribute, like the x-coordinate, and then randomly selecting a split value within that attribute. This creates a partition, effectively splitting the data into two groups.
The algorithm continues to do this many times, each time creating a new partition. The point that takes the fewest partitions to be isolated is likely to be an anomaly.
To get a better understanding of how this works, let's take a look at an example. Suppose we have a dataset with two points, one of which is an outlier. If we use the isolation forest algorithm to isolate each point, we might see that the outlier takes fewer partitions to be isolated than the normal point.
In fact, the number of partitions required to isolate a point is directly related to its likelihood of being an anomaly. If a point takes many partitions to be isolated, it's probably a normal point. But if it takes just a few partitions, it's likely to be an outlier.
This is a key insight that makes isolation forest so effective. By analyzing the number of partitions required to isolate each point, we can get a clear picture of which points are likely to be anomalies and which are not.
Here's a simple way to think about it: if a point is easy to isolate, it's probably an outlier. But if it's hard to isolate, it's probably a normal point.
In the next section, we'll explore another method for detecting anomalies, the local outlier factor. But for now, let's take a closer look at how to apply the isolation forest algorithm in practice.
For your interest: Anomaly Detection Using Generative Ai
Clustering-Based Techniques
Clustering-based techniques are a powerful tool in machine learning powered detection. They involve grouping data points that are similar, making it easier to identify anomalies.
Techniques like DBSCAN and k-Means are used for clustering, each with its own strengths. DBSCAN is particularly useful for identifying anomalies in dense regions of data.
Clustering-based techniques are used in various areas, including IT operations, application performance management, and intrusion detection systems. The key is finding the right tool for the job.
Finding the right tool can be tricky, but it's essential for effective anomaly detection. This is where machine learning models come in, offering a more flexible approach than traditional rule-based systems.
Supervised vs Unsupervised
Supervised Anomaly Detection uses labeled training data to develop a classification model that can effectively flag points that deviate significantly from the learned norm across many variables.
This approach allows for high precision in anomaly identification, especially for multivariate use cases. However, it often requires multiple passes through the data for effective training, making it less ideal for the continuous nature of real-time data streams.
See what others are reading: Outlier Ai Training
Manual data labeling introduces latency, hindering its use in situations requiring real-time detection. In contrast, Unsupervised Anomaly Detection works well in scenarios where streaming data must be analyzed in real time, labeled data is scarce, or the definition of "normal" is constantly evolving.
Unsupervised methods analyze the raw data on the fly, identifying inherent patterns and statistical properties. Data points that fall outside these patterns or statistical expectations by a significant margin are flagged as potential anomalies in real time.
Here's a comparison of the two approaches:
Unsupervised Anomaly Detection is particularly well-suited for real-time anomaly detection, as it does not require pre-labeled data and can adapt to changing data trends. Techniques like Z-score and Interquartile Range (IQR) calculations can be used within unsupervised methods to identify outliers and potential anomalies in real time.
Frequently Asked Questions
What is the 3 sigma rule for outlier detection?
The 3σ-rule detects outliers by identifying observations with residuals exceeding three times their standard deviation. This statistical method helps identify data points that significantly deviate from the norm.
Sources
- https://pro.arcgis.com/en/pro-app/latest/tool-reference/space-time-pattern-mining/understanding-outliers-in-time-series-analysis.htm
- https://eyer.ai/blog/outlier-detection-time-series-a-primer/
- https://uninterrupted.tech/research-and-development/time-series-anomaly-detection/
- https://www.datasciencewithmarco.com/blog/practical-guide-for-anomaly-detection-in-time-series-with-python
- https://www.tinybird.co/blog-posts/real-time-anomaly-detection
Featured Images: pexels.com