Data drift is a silent killer of AI models, where the underlying data distribution changes over time, rendering the model less effective. This can happen due to changes in user behavior, new data being added, or even seasonal variations.
The consequences of data drift can be severe, leading to decreased model performance, biased predictions, and even financial losses. In fact, a study found that data drift can cause a 30% drop in model accuracy over time.
Data drift can occur due to various reasons, including changes in user demographics, new product releases, or even changes in external factors like weather or economic conditions. For instance, a model trained on data from a specific region may not perform well when deployed in a different region with a different climate.
Understanding data drift is crucial to maintaining the performance of AI models, and it's essential to monitor data distribution over time to catch any changes early on. By doing so, you can take corrective actions to prevent data drift from affecting your model's performance.
What Is AI Data Drift?
Data drift is a change in the statistical properties and characteristics of the input data that a machine learning model is trained on. It occurs when the data deviates from what the model was initially trained on or earlier production data.
This shift in input data distribution can lead to a decline in the model's performance. Most machine learning models are only good at making predictions on data similar to what they were trained on.
Detecting and addressing data drift is vital to maintaining ML model reliability in dynamic settings. A few terms related to data drift are concept drift, model drift, prediction drift, and training-serving skew.
Data drift can happen suddenly, like when a retailer's sales shift from physical stores to online sales. The model's forecasts may drop significantly, affecting the ability to manage inventory effectively.
Types of Data Drift
Data drift in Machine Learning happens when there is a difference between historical data utilized to train and validate the model and live production data. This difference can occur due to various factors such as time, data integrity concerns, and seasonality.
You might like: Difference between Generative Ai and Discriminative Ai
Time is a crucial factor that leads to data drift, as there can be gaps between when the data is gathered and consumed. For example, a picture taken in the summer may look very different when that location is taken during winter with a lot of snow.
Some common types of data drift include Covariate Shift, Label Shift, Domain Shift/Adaptation, and Sample Selection Bias. These types of drifts can significantly affect a model's performance.
Here are the key types of data drift:
- Covariate Shift: occurs when independent variables (features) shift between the training and production environment.
- Label Shift: a distribution drift that occurs when the distribution of labels changes over time.
- Domain Shift/Adaptation: refers to the changes in the relationships between the inputs and the target.
- Sample Selection Bias: refers to the differences in the data collection process.
These types of data drift can have a significant impact on a model's performance, and it's essential to monitor and address them to ensure the model remains accurate and reliable.
Label Shift
Label shift is a type of distribution drift that occurs when the distribution of labels changes over time.
This can happen when the data is split into train and test sets, resulting in different proportions of the target label. For example, in an image classification project to classify cats and dogs, the test set may have more cats compared to the proportion in the training set.
The model's performance can be affected by this uneven distribution, making it harder to generalize accurately. Label shifts can occur in computer vision use cases with uneven data.
To detect label drift, data science teams can use metrics like Drift Score – Earth Mover’s Distance, Bounding box area distribution, Drift Score – PSI, Sample per class, and bounding box per image.
Take a look at this: Inception Score
Change in Correlations
Change in Correlations is a way to spot concept drift by looking at the correlations between model features and predictions and pairwise features correlations. If there's a significant change in how they relate to each other, it might signal a meaningful pattern shift.
You can use correlation coefficients like Pearson's or Spearman's to evaluate the correlation strength. Visualize the relationships on a heatmap to make it easier to spot changes.
This method works best with smaller datasets, like those in a healthcare setting, where features are interpretable and there are known strong correlations. However, it can be too noisy in other scenarios.
Tracking individual feature correlations can be impractical, so it's best to run occasional checks to surface the most significant shifts in correlations.
Tabular
Tabular data drift detection is a crucial aspect of ensuring data quality. For small datasets with less than or equal to 1000 observations, numerical columns with more than 5 unique values are checked using the two-sample Kolmogorov-Smirnov test.
Numerical columns with 5 or fewer unique values, as well as categorical columns, use the chi-squared test. Binary categorical features, with 2 or fewer unique values, are checked using the proportion difference test for independent samples based on Z-score.
The default confidence level for all tests is 0.95. For numerical columns, the Wasserstein Distance is used when there are more than 5 unique values. For categorical columns or numerical columns with 5 or fewer unique values, the Jensen-Shannon divergence is used.
You can always modify this drift detection logic to suit your needs. You can select from a range of statistical tests available in the library, specify custom thresholds, or pass a custom test.
Consider reading: Unsupervised Anomaly Detection
Here are the available drift detection methods for tabular data:
- Two-sample Kolmogorov-Smirnov test for numerical columns with more than 5 unique values
- Chi-squared test for numerical columns with 5 or fewer unique values and categorical columns
- Proportion difference test for independent samples based on Z-score for binary categorical features
- Wasserstein Distance for numerical columns with more than 5 unique values
- Jensen-Shannon divergence for categorical columns and numerical columns with 5 or fewer unique values
These methods can be used as-is or modified to suit your specific needs.
Causes of Data Drift
Data drift can occur due to covariate shift, where independent variables shift between the training and production environment. This is common when moving from a controlled offline environment to a live dynamic one.
The features encountered in the offline environment might differ from those in the real world, such as different levels of lighting, contrast, or exposure. Input data used to train a model might have different characteristics than the data encountered in production.
For instance, a model trained with x-ray images from patients between 20 and 30 years old might not perform well on patients 40 years and older, where the disease is more prevalent. The obtained distribution can significantly affect the model's performance.
On a similar theme: Offline Learning
Why Is Important?
Data drift is a crucial aspect of production machine learning because it's inevitable and can significantly impact model quality. Understanding that distribution drift can happen helps prepare us to maintain model quality.
Tracking data distribution drift can be a technique to monitor model quality in production when true labels are unavailable. This is especially useful in situations where ground truth is scarce.
Data drift analysis helps interpret and debug model quality drops by understanding changes in the model environment. It's a vital tool for identifying and addressing issues that affect model performance.
Covariate Shift
Covariate shift occurs when independent variables shift between the training and production environment. This can happen when moving from a controlled offline or local environment to a live dynamic one.
The features encountered in the offline environment might differ from those encountered in the real world. This is typical in computer vision models, where input data used to train the model might have different levels of lighting, contrast, or exposure.
For instance, a model trained with x-ray images to detect a specific disease might not be representative of the real world use cases. If the disease is prevalent in patients 40 years and older, but the model is trained with data collected from patients between 20 and 30 years old, it can affect the model’s performance significantly.
Covariate shift can occur due to changes in environmental conditions, such as lighting or exposure. This can lead to a mismatch between the training and production data, causing the model to perform poorly.
Quality
Data quality issues can lead to observed data drift, but they're not the same thing. Data quality concerns issues like missing values or errors in the data.
Data drift, on the other hand, refers to statistical changes in the data distribution, even if the data has high quality. You can often attribute data distribution shifts to data quality issues.
Data quality issues can be caused by pipeline bugs or data entry errors, which can result in corrupted and incomplete data. It's essential to verify data quality before applying data distribution checks.
Data drift detection techniques can often expose data quality issues, making it crucial to divide the two groups of checks. First, verify the data quality, and then apply data distribution checks to see if there's a statistical shift in the feature pattern.
Readers also liked: Learning with Errors
Detecting and Monitoring Data Drift
Data drift monitoring is a time-consuming task when done manually, so consider using open source or paid tools like Evidently AI or WhyLabs to make it easier.
You can detect data drift by using statistical tests, but keep in mind that parametric tests can be more sensitive to drifts compared to non-parametric tests like Chi-squared and Kolmogorov-Smirnov tests.
For streaming data, use a window function for statistical tests to determine the representative or aggregate metric, such as mean, mode, or median.
To spot data drift, you need to set up machine learning model monitoring to track how well your model is doing over time.
Machine learning model monitoring helps track how well your machine learning model is doing over time, and you can use different metrics to detect concept drift, including model quality metrics like accuracy, precision, recall, or F1-score.
A significant drop in these metrics over time can indicate the presence of concept drift.
Here are some common metrics to track when detecting concept drift:
- Accuracy
- Precision
- Recall
- F1-score
Data drift analysis can also be useful for model troubleshooting and debugging, as it helps understand and contextualize the changes in the model features and locate the source of the issue.
You can use proxy metrics and heuristics as early warning signs of concept drift when direct metrics are not available.
For example, in scenarios like spam detection, you can track user feedback, such as moving emails to spam or marking them as "not spam", to calculate model quality metrics.
Related reading: Confusion Matrix in Ai
Addressing Data Drift
Data drift is a natural occurrence in machine learning models, and it's essential to address it to maintain their performance. You can expect minor variations to accumulate over time, and even with no drastic changes, new products can appear, customer preferences can evolve, and market conditions can change.
Retraining your models on new data is a common strategy to combat concept drift. This involves designing a model update process and retraining schedule. A robust model monitoring setup is also necessary to provide visibility into the current model quality and ensure you can intervene in time.
Model quality metrics, such as accuracy, precision, recall, or F1-score, are the most direct and reliable reflection of concept drift. A significant drop in these metrics over time can indicate the presence of concept drift.
Automating data drift detection can reduce the amount of tedious work and alert teams to act on data drifts occurrence. Tools like WhyLabs, Fiddler AI, Evidently AI, Deepchecks, and skmultiflow.drift_detection can help with this.
Investigating the problem is crucial when detecting a drift, and not assuming the model is faulty is essential. You can use data drift analysis to understand and contextualize the changes in the model features and locate the source.
Broaden your view: Concept Drift
Best Practices for Avoiding Data Drift
Being proactive is key to avoiding and resolving data drifts in Machine Learning projects. You have to understand the data collected, highlight metrics to monitor, and consider the timing between data collection and deployment.
To be proactive, data scientists should ensure data quality. This means monitoring data for errors, inconsistencies, and missing values.
There are two ways to be proactive: ensuring data quality and monitoring the ML system for data drift. This proactive approach can help you catch data drifts early on and make necessary adjustments.
Data drift can be compared to a figure, which shows the changes in data distribution over time. This visual representation can help you identify patterns and anomalies in your data.
By being proactive, you can avoid data drifts and ensure the accuracy and reliability of your Machine Learning models.
Additional reading: Proactive Learning
Data Drift in Specific Contexts
Data drift can occur in various contexts, and understanding these scenarios is crucial for effective model monitoring.
Covariate shift is a common issue when moving from a controlled offline environment to a live dynamic one, where the features encountered in the offline environment might differ from those in the real world.
Label shift, on the other hand, is a distribution drift that occurs when the distribution of labels changes over time, which can affect a model's ability to generalize accurately.
Domain shift/adaptation is also a significant concern, especially with computer vision models, as they require ample input data to precisely generalize the target domain.
A good rule of thumb is to have at least 1,000 sample images for each class, as seen in the original ImageNet challenge.
Domain Shift
Domain shift occurs when the input data used to train a model is not representative of the real-world data it will encounter in production. This can lead to a decline in model performance.
Convolutional Neural Networks (CNNs) are particularly prone to domain shift, requiring a rich source of data to generalize accurately. A good rule of thumb is to have 1,000 sample images for each class.
Transfer Learning can help mitigate domain shift by transferring the learned weights of a previously trained model to a related but different problem. This can be particularly useful when working with limited data.
The problem domain can change significantly, requiring adjustments to the model's architecture and parameters to achieve optimal results. For example, a model built around images captured in one country might struggle to identify objects in another country.
Domain shift can occur due to various factors, including environmental changes, data preprocessing, and feature engineering. It's essential to analyze the training and test errors to identify bias and take corrective action.
Slice-based Learning can help understand what parts of the data the model works well on, providing insights into the domain shift. This can be particularly useful when dealing with complex data sets.
Embeddings
Embeddings can be detected using various methods, including custom thresholds and parameters.
You can choose from other embedding drift detection methods, such as Euclidean distance and Cosine Similarity.
Curious to learn more? Check out: Proximal Gradient Methods for Learning
These methods allow for dimensionality reduction, which can help identify subtle changes in data.
Custom thresholds and parameters must be specified when using these methods.
You can also choose from other methods, including Maximum Mean Discrepancy and share of drifted embeddings.
Each method has its own strengths and weaknesses, and the right choice depends on the specific context and requirements of your project.
Data Drift Metrics and Heuristics
Model quality metrics are the most direct and reliable reflection of concept drift, so it's essential to track them regularly. You can calculate metrics like accuracy, precision, recall, or F1-score for classification problems.
A significant drop in these metrics over time can indicate the presence of concept drift. For instance, if you notice a decrease in accuracy for a spam detection model, it might be a sign that the model is no longer effective.
You can also use proxy metrics and heuristics as early warning signs of concept drift, especially when direct metrics are not available. These can tell you about changes in the environment or model behavior that might precede a drop in model quality.
Discover more: Concept Drift vs Data Drift
Developing heuristics that reflect the model quality or correlate with it is a good approach. For example, tracking the average share of clicks on a recommendation block can give you an idea of the recommendation system's quality.
If you notice a sudden drop in clicks, it might mean that the model is no longer showing relevant suggestions. Similarly, tracking the appearance of new categories in input features can indicate concept drift.
It's a good idea to leverage input from domain experts to develop meaningful heuristics. They can provide valuable insights into the data and help you identify potential issues.
Data drift analysis is also a useful technique for model troubleshooting and debugging. By examining the underlying cause of a model quality drop, you can identify changes in the input data that might be contributing to the issue.
Running checks for per-feature distribution comparison can help you identify which features have shifted most significantly. This can give you a better understanding of the changes in the model features and help you locate the source of the issue.
By using data drift metrics and heuristics, you can stay on top of concept drift and make necessary adjustments to your models.
Sources
- https://docs.evidentlyai.com/reference/data-drift-algorithm
- https://www.deepchecks.com/data-drift-in-computer-vision-models/
- https://docs.evidentlyai.com/user-guide/customization/options-for-statistical-tests
- https://www.evidentlyai.com/ml-in-production/data-drift
- https://www.evidentlyai.com/ml-in-production/concept-drift
Featured Images: pexels.com