Concept Drift vs Data Drift: Best Practices for Identifying and Addressing

Author

Posted Nov 17, 2024

Reads 579

A home office setup featuring multiple monitors displaying trading charts and data analysis.
Credit: pexels.com, A home office setup featuring multiple monitors displaying trading charts and data analysis.

Identifying concept drift requires monitoring changes in the underlying concept or relationship between variables, such as a shift in customer behavior or a change in market trends.

Data drift, on the other hand, refers to changes in the distribution of the data itself, such as a shift in the mean or variance of a variable.

Concept drift can occur due to changes in the underlying population, such as a change in demographics or an increase in data quality.

Monitoring the performance of machine learning models over time is key to detecting concept drift, as changes in model performance can indicate underlying shifts in the data.

Data drift, however, can occur due to changes in data collection methods or data quality, such as a change in sensor calibration or an increase in missing values.

Regularly reviewing and updating data pipelines and collection methods can help mitigate data drift.

Explore further: Concept Drift Detection

What Is Concept Drift?

Concept drift is the most undesirable property of streaming data, making it very unpredictable. It leads to a degradation in the performance of mining techniques like classification or clustering, increasing the chances of misclassification. This is because the relationship between the target variable and input features changes over time.

Credit: youtube.com, What is Concept and Data Drift? | Data Science Fundamentals

For instance, an email spam classifier that predicts whether an email is spam or not based on the textual body of the email. The machine learning model learns a relationship between the target variable and the set of keywords that appears in a spam email.

These sets of keywords might not be constant, their pattern changes over time. This means that the model built on the old set of emails doesn’t work anymore on a new pattern of keywords.

Here are the three types of shifts that can lead to model decay:

  • Covariate Shift: Shift in the independent variables.
  • Prior Probability Shift: Shift in the target variable.
  • Concept Drift: Shift in the relationship between the independent and the target variable.

This is why it becomes necessary to identify such drifts in the data to get efficient results with accuracy.

Types of Concept Drift

Concept drift refers to the changes in the underlying patterns or relationships within a dataset over time.

There are different types of concept drift patterns, one of which is gradual drift, where the change occurs slowly and steadily.

Credit: youtube.com, Machine Learning Model Drift - Concept Drift & Data Drift in ML - Explanation

Another type is sudden drift, where the change happens abruptly and without warning.

Concept drift can also be caused by a change in the underlying distribution of the data, such as a shift in the mean or variance.

This can be seen in production, where the patterns or relationships within a dataset change due to various factors, such as changes in user behavior or external events.

Detection and Monitoring

To detect concept drift, you need to set up ML model monitoring, which helps track how well your machine learning model is doing over time.

Machine learning model monitoring involves running reports on model quality, gathering metrics from the machine learning service, and displaying them on a live dashboard. You can also set up alerts to warn you if something goes wrong.

There are several metrics you can track to detect concept drift, including performance scores, classification confidence, and data drift.

Monitoring data drift introduces many challenges, including defining thresholds, as different quantification methods can create different scales of values.

Credit: youtube.com, ML Drift: Identifying Issues Before You Have a Problem

The ideal concept drift handling system should be able to monitor the performance of the model over a long time, track relevant custom metrics, and detect data drift.

Here are some common methods for monitoring concept drift:

  • Monitoring the performance of the model over a long time
  • Monitoring the classification confidence (applicable only to classification)
  • Tracking relevant custom metrics to ensure your model is drift-free and performance is driving value

By monitoring concept drift, you can avoid the negative consequences of inaccurate predictions, such as lost revenue, reduced customer satisfaction, and increased operational costs.

Methods for Detecting Concept Drift

Detecting concept drift is crucial to handling it effectively. Machine learning model monitoring helps track how well your machine learning model is doing over time.

You can track various metrics to detect concept drift, such as model performance, data quality, and alerts for unexpected changes. For example, if your model's performance drops below a certain level, you'll get a signal to retrain the model or dig deeper into the issue.

Evidently is an open-source Python library that helps implement testing and monitoring for production machine learning models. You can choose from 100+ pre-built checks and metrics to evaluate concept drift, including comparing model performance for past periods using reports like Regression Performance or Classification Performance.

How to Detect

Credit: youtube.com, What Are Drifts and How to Detect Them? #machinelearning

Detecting concept drift is crucial to prevent it from affecting your machine learning model's performance. You can set up machine learning model monitoring to track how well your model is doing over time.

Machine learning model monitoring helps track how well your machine learning model is doing over time. You can use simple reports and one-off checks or real-time monitoring dashboards to do this.

To detect concept drift, you can track various metrics, including model performance, data distribution, and correlations between features. For example, you can use the Kolmogorov-Smirnov test to determine whether two samples come from the same distribution.

The Kolmogorov-Smirnov test is a non-parametric test that can be used to determine whether two samples come from the same distribution. This test can be used to detect data drift by comparing the distribution of the training data and the distribution of the test data.

Evidently is an open-source Python library that helps implement testing and monitoring for production machine learning models. You can use Evidently to evaluate concept drift in your models by comparing the model performance for past periods using reports like Regression Performance or Classification Performance.

A different take: Evidently Ai Data Drift

Credit: youtube.com, ML Drift: Identifying Issues Before You Have a Problem

Some common metrics to track when detecting concept drift include:

  • Regression Performance
  • Classification Performance
  • Data Drift
  • Prediction Drift
  • Correlations between features

These metrics can help you identify changes in your data distribution over time, which can indicate concept drift. By tracking these metrics regularly, you can detect concept drift early on and take corrective action to prevent it from affecting your model's performance.

Heuristics

Heuristics are a great way to detect concept drift. You can develop heuristics that reflect the model quality or correlate with it.

Tracking metrics like an average share of clicks on a recommendation block can be a useful heuristic. This can indicate if the model no longer shows relevant suggestions.

Domain experts can provide valuable input in developing meaningful heuristics. This is especially true when dealing with complex models and data.

The appearance of new categories in input features can be a sign of concept drift. For example, if a model forecasting future sales starts encountering entirely new product categories, it could be a warning sign.

It's essential to continuously monitor your model's performance and adjust your heuristics accordingly. This will help you catch concept drift before it's too late.

Addressing Concept Drift

Credit: youtube.com, Model Drift In machine learning | Concept drift vs Data Drift in Machine Learning

Concept drift can't be avoided for complex phenomena, so it's essential to have strategies in place to manage it. One approach is to periodically retrain the model to learn from historical data, as mentioned in Example 4. This involves detecting when a model needs retraining and then updating it with recent data.

Retraining a model can be costly, especially in a supervised setting, so it's often necessary to selectively create a sub-sample from the complete population and retrain on that. Feature dropping is another simple but effective method to deal with concept drift, as discussed in Example 5. This involves building multiple models where one feature is used at a time and monitoring the AUC-ROC response to determine if a feature is drifting.

Other remedies for concept drift include business rules and policies, human-in-the-loop decision-making, and pausing or stopping the model. For example, you can modify decision thresholds for classification models or use heuristics on top of model output, as seen in Example 2.

How to Address

Credit: youtube.com, Machine Learning Model Drift - Concept Drift & Data Drift in ML - Explanation

Addressing concept drift requires a combination of strategies and techniques to manage its effects.

One approach is to detect concept drift through established model monitoring. This is crucial for knowing when to take action.

Sometimes, model retraining under concept drift is not possible due to a lack of new labeled data or other constraints. In these scenarios, business rules and policies can be modified to adjust model sensitivity. For example, decision thresholds can be changed to reduce false positives.

Human-in-the-loop decision-making can also be considered. This involves sending some or all of the data for manual decision-making when concept drift is detected.

Alternative models can be used to make decisions when the primary model is no longer reliable. This can include heuristics or other model types.

In some cases, it's best to simply pause or stop the model if its quality is unsatisfactory. Alternatively, you can choose to do nothing and accept a diminished performance.

Intriguing read: Pruning Decision Tree

Credit: youtube.com, Detecting Concept Drift in the Presence of Sparsity A Case Study of Automated Change Risk

Here are some possible remedies for preventing deterioration in prediction accuracy due to concept drift:

  • Retrain the model in reaction to a triggering mechanism, such as a change-detection test.
  • Use tracking solutions that continually update the model to track changes in the concept.
  • Use contextual information to better explain the causes of concept drift.
  • Periodically retrain the model to learn from historical data.

Periodic retraining is necessary for models that experience concept drift due to complex phenomena that are not governed by fixed laws of nature.

Feature Dropping

Feature dropping is a simple yet effective method to deal with concept drift. It involves building multiple models, each using a different feature while keeping the target variable unchanged.

The idea is to monitor the AUC-ROC response after prediction on test data. If the value of AUC-ROC for a particular feature goes beyond a certain threshold, that feature might be considered as drifting.

This method is widely used in the industry, and it's a good starting point for addressing concept drift. A common threshold for AUC-ROC is 0.8, but this can vary depending on the specific use case.

By dropping the drifting feature, you can prevent it from negatively impacting your model's performance. This can be a good solution, especially if the feature is not essential to the model's functionality.

Projects

Credit: youtube.com, Addressing Event Driven Concept Drift in Twitter Stream A Stance Detection Application

Concept drift is a common problem in machine learning and data analysis, and it requires innovative solutions to address effectively. One approach is to develop dynamic systems that can adapt to changing data distributions.

Projects such as INFER, a computational intelligence platform developed by Bournemouth University, Evonik Industries, and the Research and Engineering Centre, aimed to create robust predictive systems from 2010 to 2014.

The INFER project was just one of many efforts to tackle concept drift. Another notable project was HaCDAIS, which focused on handling concept drift in adaptive information systems at Eindhoven University of Technology from 2008 to 2012.

Researchers at INESC Porto and the Laboratory of Artificial Intelligence and Decision Support in Portugal worked on KDUS, a project that explored knowledge discovery from ubiquitous streams.

These projects demonstrate the importance of addressing concept drift in various domains and the need for innovative solutions. ADEPT, a project from the University of Manchester and the University of Bristol, worked on adaptive dynamic ensemble prediction techniques.

Credit: youtube.com, Model or Concept Drift Explained!

In 2022, the GAENARI project developed a C++ incremental decision tree algorithm that aims to minimize concept drifting damage.

Here are some key projects that have tackled concept drift:

  • INFER: Computational Intelligence Platform for Evolving and Robust Predictive Systems (2010–2014)
  • HaCDAIS: Handling Concept Drift in Adaptive Information Systems (2008–2012)
  • KDUS: Knowledge Discovery from Ubiquitous Streams
  • ADEPT: Adaptive Dynamic Ensemble Prediction Techniques
  • ALADDIN: autonomous learning agents for decentralised data and information networks (2005–2010)
  • GAENARI: C++ incremental decision tree algorithm

Best Practices for Dealing with Concept Drift

Dealing with concept drift requires a structured approach. Data collection and preprocessing is the first step, which involves handling missing values, outliers, and label encoding for categorical variables.

To effectively detect concept drift, it's essential to analyze data points of adjacent windows. Accuracy metrics like accuracy, precision, recall, AUC-ROC response curve, and execution time are useful for this purpose.

Divide the data stream into a series of windows to facilitate data labeling. Assign a class label to individual data points based on the business context.

If concept drift is detected, it's crucial to follow an appropriate methodology to address it.

Machine Learning Challenges

Data drift can be a major challenge in machine learning, but it's not the only one. One of the biggest challenges is ensuring that the model remains accurate over time.

Credit: youtube.com, Concept Drift vs. Data Drift: Understanding the Differences

Continuous monitoring is key to addressing data drift. By regularly checking the data, organizations can detect changes as they occur and address them quickly.

Data drift can lead to decreased model performance, which can have serious consequences. For example, in healthcare, a model that's no longer accurate can lead to misdiagnoses and incorrect treatment.

To mitigate the effects of data drift, organizations can use data cleansing techniques such as deduplication, standardization, and validation.

Here are some strategies that can help address data drift:

  • Continuous monitoring
  • Data cleansing
  • Retraining
  • Ensemble models
  • Data augmentation

These strategies can help ensure that the machine learning model remains accurate and continues to perform well over time.

Measuring and Defining Concept Drift

Measuring data drift isn't straightforward. You need to understand which distribution you want to test and check if it's drifting relative to the distribution you choose as your reference distribution.

To define the right drift metrics, you'll need to decide how to quantify the distance between these two distributions. This is a crucial step in detecting concept drift.

Credit: youtube.com, Data Drift Detection and Model Monitoring | Concept Drift | Covariate Drift | Statistical Tests

Defining the tested period and the reference distribution should be customized on a case-by-case basis. You should consider what constitutes the tested period and what distribution to use as a reference for comparison.

The most common use case for drift is to test for a training-serving skew, which can be referred to as a type of uncertainty measure. However, in many cases, the training dataset may not reflect the real-life distribution.

Model quality metrics, such as accuracy, precision, recall, or F1-score, are the most direct and reliable reflection of concept drift. A significant drop in these metrics over time can indicate the presence of concept drift.

Defining the Reference Distribution

The reference distribution is a crucial aspect of measuring concept drift, and it's essential to choose it wisely. In many cases, the training dataset may not reflect the real-life distribution, so comparing production distribution to the test dataset, which was untouched and represents the actual distribution, may be a better approach.

Credit: youtube.com, Concept Drift Detection with NannyML | Webinar

The most common use case for drift is to test for a training-serving skew, where the model in production is working under different circumstances than what was assumed when it was being trained. This can be referred to as a type of uncertainty measure.

For example, if you've rebalanced classes, run stratified sampling, or used other normalization methods, the training dataset may not accurately represent the real-life distribution. In these cases, comparing production distribution to the test dataset is a better approach.

In some cases, comparing the tested distribution to a sliding window reference distribution can be useful. This can help highlight whether the underlying data distribution has been changing over time.

Here are some common use cases for defining the reference distribution:

  • Training-serving skew: compare production distribution to the test dataset
  • Imbalance use case: compare production distribution to the test dataset
  • Seasonality use case: compare tested distribution to the equivalent distribution a season ago

Ultimately, the choice of reference distribution depends on the specific use case and data.

Quality Metrics

Model quality metrics are the most direct and reliable reflection of concept drift. A significant drop in accuracy, precision, recall, or F1-score over time can indicate the presence of concept drift.

Credit: youtube.com, Concept Drift: Monitoring Model Quality in Streaming Machine Learning Applications

For classification problems, you can track metrics like accuracy, precision, recall, or F1-score. In scenarios like spam detection, you can gather user feedback, such as moving emails to spam or marking them as "not spam."

However, in some situations, you can't get the labels easily. For instance, if you predict future sales, you can only know the forecast error once that period is over.

You can utilize proxy metrics and heuristics as early warning signs of concept drift. They can tell you about a change in the environment or model behavior that might precede a drop in model quality.

Here's an interesting read: Confusion Matrix in Ai

Note

Concept drift is a real challenge in data analysis, and it's essential to understand the methods used to investigate it.

For instance, researchers have identified various methods to investigate concept drift in big data streams, as mentioned in a reference paper.

The reference paper "Methods to Investigate Concept Drift in Big Data Streams" highlights the importance of understanding concept drift in real-world applications.

Credit: youtube.com, Getting Ready for Change: Handling Concept Drift in Predictive Analytics

Investigating concept drift requires a combination of statistical and machine learning techniques.

The paper suggests that statistical methods, such as hypothesis testing, can be used to detect concept drift.

Machine learning algorithms, such as ensemble methods, can also be employed to identify concept drift.

Researchers have found that concept drift can occur due to changes in the underlying distribution of the data.

These changes can be caused by various factors, including changes in the population, environment, or data collection process.

In reality, concept drift can have significant consequences on the accuracy and reliability of machine learning models.

Univariate Approach

The univariate approach to detecting data drift is a straightforward method that can be easily implemented. You can choose from a variety of statistical measures that can be applied to a single univariate distribution.

One of the advantages of the univariate approach is that it's simple to drill down to the specific feature that's drifting, making it easy to interpret the results. You can see the drift score of each feature and decide how it contributed to the overall drift score.

See what others are reading: Inception Score

Credit: youtube.com, Explainable Data Drift for NLP

Some of the benefits of the univariate approach include:

  • Simple to implement
  • Easy to drill down to drifting feature/s
  • Can easily be adjusted to weigh different features according to their importance

However, this approach can be impacted by redundancy, where correlated features can lead to multiple measurements of the same drift, making it less effective in high-dimensional datasets.

Univariate vs. Multivariate

In the univariate approach, you can compute the drift on a per-feature basis and then use some aggregation function to get a single aggregated drift metric. This is a common practice, but it has its drawbacks.

Computing drift on a per-feature basis can lead to an overload of alerts due to the granularity of the measurement method, especially in scenarios of high dimensionality. Many machine learning models leverage dozens, hundreds, or even thousands of different features.

You can use the average drift level in all of the features to come up with a single score, but this might not accurately reflect the importance of each feature. Normalizing this according to the importance of each feature can help, but this information might not be available.

Another option is to use a multivariate method from the get-go, but this approach has its own set of pros and cons.

Univariate Approach Advantages

Credit: youtube.com, What is Univariate, Bivariate and Multivariate analysis?

The univariate approach has its advantages, and one of the biggest benefits is how simple it is to implement. You can choose from a variety of statistical measures that can be easily applied to a single univariate distribution.

One of the things I like about the univariate approach is how easy it is to drill down to the individual features that are causing a problem. This makes it easy to interpret the results and see exactly how each feature contributed to the overall drift score.

The univariate approach also allows you to easily adjust the weights of different features according to their importance. This means you can tailor the approach to the specific needs of your project.

Here are some of the key advantages of the univariate approach:

  • Simple to implement – You can choose from a variety of statistical measures that can be easily applied to a single univariate distribution.
  • Easy to drill down to drifting feature/s and, therefore, easy to interpret.
  • It can easily be adjusted to weigh different features according to their importance.

Software and Tools

There are several software tools available for detecting concept drift in machine learning systems, including Frouros, an open-source Python library specifically designed for drift detection.

Credit: youtube.com, What is model monitoring? 🤯 Data Drift? 🧐 Learn the hard-part of data science with FREE tools 🔥 #ml

Some notable tools for concept drift detection include NannyML, which can detect univariate and multivariate distribution drift, and MOA, a free open-source software for mining data streams with concept drift.

If you're looking for a tool that can handle both data stream mining and concept drift, you might want to consider RapidMiner, formerly known as YALE, which is a free open-source software for knowledge discovery and data mining.

Here are some popular software tools for concept drift detection:

  • Frouros
  • NannyML
  • RapidMiner
  • EDDM (Early Drift Detection Method)
  • MOA (Massive Online Analysis)

Software

There are some amazing open-source software options available for detecting concept drift in machine learning systems. One such option is Frouros, a Python library designed specifically for drift detection.

Frouros is a powerful tool that can help you identify changes in your data streams. It's free and open-source, making it a great choice for anyone looking to get started with concept drift detection.

NannyML is another popular open-source library that's perfect for detecting univariate and multivariate distribution drift. It's also a Python library, making it easy to integrate with your existing workflows.

An artist's illustration of artificial intelligence (AI). This image represents storage of collected data in AI. It was created by Wes Cockx as part of the Visualising AI project launched ...
Credit: pexels.com, An artist's illustration of artificial intelligence (AI). This image represents storage of collected data in AI. It was created by Wes Cockx as part of the Visualising AI project launched ...

For those who prefer a more comprehensive platform, RapidMiner is a great option. It's a free, open-source software that offers a range of features, including data stream mining and tracking drifting concepts.

If you're looking for something a bit more specialized, EDDM (Early Drift Detection Method) is a free open-source implementation of drift detection methods in Weka. It's a great choice for those who want to dive deeper into the technical details of concept drift detection.

Here are some of the software options mentioned in this section:

  • Frouros: An open-source Python library for drift detection in machine learning systems.
  • NannyML: An open-source Python library for detecting univariate and multivariate distribution drift and estimating machine learning model performance without ground truth labels.
  • RapidMiner: Formerly Yet Another Learning Environment (YALE): free open-source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time-varying concepts, and tracking drifting concept.
  • EDDM (Early Drift Detection Method): free open-source implementation of drift detection methods in Weka.
  • MOA (Massive Online Analysis): free open-source software specific for mining data streams with concept drift.

Reviews

Reviews of software and tools for data stream learning are plentiful, but it's essential to consider the challenges of benchmarking these algorithms with real-world data, as noted by Souza et al. in their 2020 study.

One of the most significant challenges is concept drift, which occurs when the underlying distribution of the data changes over time. This can be seen in the work of Gama et al., who surveyed the field of ensemble learning for data stream analysis and found that concept drift is a major issue.

Mother Changing the Clothes of Her Baby Boy
Credit: pexels.com, Mother Changing the Clothes of Her Baby Boy

Krawczyk et al. also highlighted the importance of ensemble learning in their 2017 survey, noting that it can help to adapt to changing data distributions.

To give you a better idea of the tools available, here are some notable reviews:

  • Souza et al. (2020) highlighted the challenges of benchmarking stream learning algorithms with real-world data.
  • Gama et al. (2014) surveyed the field of ensemble learning for data stream analysis and found that concept drift is a major issue.
  • Krawczyk et al. (2017) noted the importance of ensemble learning in adapting to changing data distributions.
  • Dal Pozzolo et al. (2015) demonstrated the effectiveness of ensemble learning in credit card fraud detection.
  • Alippi (2014) discussed the challenges of learning in nonstationary and evolving environments.

Keith Marchal

Senior Writer

Keith Marchal is a passionate writer who has been sharing his thoughts and experiences on his personal blog for more than a decade. He is known for his engaging storytelling style and insightful commentary on a wide range of topics, including travel, food, technology, and culture. With a keen eye for detail and a deep appreciation for the power of words, Keith's writing has captivated readers all around the world.

Love What You Read? Stay Updated!

Join our community for insights, tips, and more.