Distributional shift refers to the phenomenon where the underlying distribution of data changes over time, making it challenging for machine learning models to generalize well. This can happen due to various reasons such as changes in population demographics, new technologies, or shifts in user behavior.
One of the key challenges of distributional shift is that it can occur gradually, making it difficult to detect. For instance, a study found that the distribution of user behavior on a popular e-commerce platform shifted significantly over a period of six months, affecting the performance of their recommendation system.
The impact of distributional shift can be substantial, leading to decreased model performance and potentially even economic losses. In one case, a company experienced a 20% decrease in sales due to the shift in user behavior on their platform.
To address distributional shift, it's essential to regularly monitor the distribution of data and update models accordingly.
Its Causes
Distribution shift can result from non-stationarity, where the data involves a temporal component and the distribution is non-stationary. This can happen when trying to predict future commodity prices based on historical data.
Non-stationarity can also occur due to non-temporal changes, such as training a model on one geographical area or on a certain source of images before applying it elsewhere.
Interaction effects, a special kind of non-stationarity, can occur when the model has an impact on the systems it is interacting with, particularly in massive deployments.
Sampling bias is another cause of distribution shift, where biased sampling error can lead to discrepancies in distribution between training and deployment environments.
Sample selection bias occurs when the training data is obtained through a biased method, not representing the operating environment where the classifier is to be deployed.
Non-stationary environments can also cause dataset shift, where the training environment is different from the test one, whether due to a temporal or spatial change.
Types of Distributional Shift
Distributional shift can manifest in several distinct ways, each with its own set of challenges. Covariate shift is a common challenge in real-world settings, where the distribution of covariates changes between model training and deployment.
Covariate shift occurs when there is a change in the distribution of the covariates between model training and model deployment, while the way these covariates relate to the outcome remains the same. This can be due to changes in the operating environment, such as a factory receiving a large order, leading to changes in Operating Temperatures and Lubricant Levels.
Label shift is the reverse of covariate shift, where the distribution of the output changes, while the relationship between the input and output remains the same. This can be caused by changes in the way the data is labeled, such as a misrepresentation of the operating environment.
Concept drift is where the relationship between the input and output changes, while the distribution of the input remains the same. This can be due to changes in the underlying process, such as a change in the stationarity of a temporal process.
Here are the three main types of distributional shift:
Sample Selection Bias
Sample selection bias is a systematic flaw in the process of data collection or labeling that causes nonuniform selection of training examples from a population, leading to biases during training.
This type of bias is not a flaw with any algorithm or handling of the data, but rather a result of how the data is collected or labeled. It's like trying to train a model on a dataset that only shows pictures of cats, when in reality the world has many more dogs than cats.
Sample selection bias is especially relevant in highly imbalanced classification, where the minority class is particularly sensitive to singular classification errors due to the low number of samples it presents.
In extreme cases, a single misclassified example of the minority class can create a significant drop in performance. For example, if a model is trained on a dataset that only has 10 pictures of dogs, and it misclassifies one of those pictures as a cat, the model's performance will suffer greatly when it encounters more pictures of dogs in the test set.
This type of bias can be caused by many factors, such as biased data collection methods, sampling errors, or even intentional manipulation of the data. It's essential to be aware of these biases and take steps to mitigate them when working with datasets.
Here are some examples of sample selection bias:
- A face recognition algorithm trained predominantly on younger faces
- Predicting life expectancy with a training set that has very few samples of individuals who smoke
- Classifying images as either cats or dogs and omitting certain species from the training set that are seen in the test set
Non-Stationary Environments
In real-world applications, it's common for data to not be stationary, meaning it can change over time or space. This type of non-stationarity can lead to distributional shifts, making it challenging for machine learning models to adapt.
Adversarial classification problems, such as spam filtering and network intrusion detection, are a prime example of non-stationary environments. An adversary can intentionally try to evade the classifier's learned concepts by manipulating the data, introducing various types of distributional shifts.
Covariate shift, in particular, can occur in these scenarios, where the distribution of covariates changes between the training and test sets, while the relationship between the covariates and the outcome remains the same.
Here are some examples of non-stationary environments:
- Face recognition algorithms trained on younger faces but tested on older faces.
- Predicting life expectancy with a training set lacking samples of individuals who smoke.
- Classifying images as cats or dogs but omitting certain species from the training set.
These scenarios highlight the importance of monitoring covariate shifts in dynamic environments. In high-dimensional scenarios, tracking individual features can be challenging, and dimensionality reduction techniques like t-SNE can be useful for visualizing the data and detecting shifts.
Internal covariate shift, a phenomenon specific to deep neural networks, can also occur due to the variation in the distribution of activations from one layer to the next. This can impede the training of deep neural networks, but batch normalization can help resolve this issue by normalizing the inputs to hidden layers.
Distributional shift
Data distribution shifts are only a problem if they cause your model's performance to degrade. So, the first idea might be to monitor your model's accuracy-related metrics in production to see whether they have changed.
Monitoring accuracy-related metrics is crucial, but it can be challenging without access to ground truth labels. In production, you don't always have access to ground truth, and even if you do, ground truth labels will be delayed.
The distributions of interest are the input distribution P(X), the label distribution P(Y), and the conditional distributions P(X|Y) and P(Y|X). Monitoring the input distribution is possible without ground truth labels, but monitoring the label distribution and conditional distributions require knowing Y.
Not all types of shifts are equal – some are harder to detect than others. Abrupt changes are easier to detect than slow, gradual changes, and shifts can happen across two dimensions: spatial or temporal.
Temporal shifts happen over time, and a common approach is to treat input data to ML applications as time series data. The time scale window of the data affects the shifts we can detect – if your data has a weekly cycle, then a time scale of less than a week won't detect the cycle.
Detecting temporal shifts is hard when shifts are confounded by seasonal variation. Differentiating between cumulative and sliding statistics is essential – sliding statistics are computed within a single time scale window, while cumulative statistics are continually updated with more data.
Concept drift, or posterior shift, is when the input distribution remains the same but the conditional distribution of the output given an input changes. This means that the same input can produce different outputs over time.
Concept drift can be cyclic or seasonal, such as rideshare prices fluctuating on weekdays versus weekends. Companies might have different models to deal with cyclic and seasonal drifts.
Handling and Correcting Shift
Handling data distribution shifts depends on the sophistication of your machine learning infrastructure setup. At one end of the spectrum, companies that have just started with ML might not have gotten to the point where data shifts are catastrophic yet.
Many companies assume that data shifts are inevitable and periodically retrain their models, but determining the optimal frequency is still a decision made based on gut feelings. We'll discuss more about retraining frequency in the lecture on Continual Learning.
There are two parts to handling data shifts in a more targeted way: detecting the shift and addressing the shift. If possible, you should always retrain your models to correct dataset shift.
If retraining is not possible, there are several techniques for correcting dataset shift, including feature removal and importance reweighting. Feature removal involves analyzing individual features or through an ablation study to determine which features are most responsible for the shifting and removing them from the dataset.
Importance reweighting involves upweighting training instances that are very similar to your test instances, essentially changing your training data set to look like it was drawn from the test data set.
Some of the most important methods for working under covariate shift include weighting the log-likelihood function, importance weighted cross-validation, integrated optimization problem, discriminative learning, kernel mean matching, adversarial search, and the Frank-Wolfe algorithm.
Here are some of the methods mentioned above, along with a brief description of each:
- Weighting the log-likelihood function: This involves adjusting the log-likelihood function to account for the shift in the data.
- Importance weighted cross-validation: This involves using importance weights to reweight the training data and improve the model's performance on the test data.
- Integrated optimization problem: This involves solving an optimization problem to find the optimal solution that minimizes the effect of the shift.
- Discriminative learning: This involves learning a discriminative model that can handle the shift in the data.
- Kernel mean matching: This involves using a kernel to match the mean of the training data to the mean of the test data.
- Adversarial search: This involves using an adversarial model to search for the optimal solution that minimizes the effect of the shift.
- Frank-Wolfe algorithm: This involves using a Frank-Wolfe algorithm to solve the optimization problem and find the optimal solution.
Understanding and Addressing Shift
Distributional shift can occur due to changes in the relationship between covariates and labels over time. This is known as concept shift, where a model's predictions may become less accurate as the learned relationships become outdated in the face of new data dynamics.
A real-world example of concept shift is a maintenance routine for milling machines, where improved maintenance practices altered the relationship between usage time and machine health. Despite similar covariate distributions, the model's accuracy decreased from 98% to 84%.
Concept drift is another type of distributional shift, where the conditional distribution of the output given an input changes, but the input distribution remains the same. This can be observed in time series analysis, where a stationary time series is easier to analyze than a non-stationary one.
Monitoring Accuracy-Related Metrics
Monitoring accuracy-related metrics is crucial to ensure your model's performance doesn't degrade over time. This involves logging and tracking user feedback, such as clicks, hides, purchases, upvotes, downvotes, favorites, bookmarks, and shares.
You should log all types of user feedback to calculate your model's accuracy-related metrics. This will help you identify whether your model's performance has improved or worsened.
Changes in accuracy-related metrics can be subtle, so it's essential to track them closely. For example, if your recommendation system is supposed to suggest videos to watch next on YouTube, you should track the click-through rate and completion rate.
If the click-through rate remains the same but the completion rate drops, it might indicate that your recommendation system is getting worse. This could be a sign of a shift in the underlying input distribution.
Monitoring accuracy-related metrics can help you detect changes in your model's performance, even if the feedback can't be used to infer natural labels directly.
Addressing Concept Drift
Concept drift is a sneaky phenomenon where the relationship between inputs and outputs changes over time, making your model's predictions less accurate. This can happen even if the input distribution remains the same.
Concept drift can be cyclic or seasonal, like rideshare prices fluctuating on weekdays versus weekends, or flight tickets rising during holiday seasons. Companies might need to have different models to deal with these types of drifts.
Detecting concept drift is challenging because it can appear gradually, making it hard to pinpoint when the model's predictions start deviating from expected outcomes. A general strategy to detect this form of distributional shift is to systematically monitor the performance of ML models over time.
You can monitor predictions for distribution shifts, which are a proxy for input distribution shifts. This is easier with low-dimensional predictions, making it easier to compute two-sample tests to detect whether the prediction distribution has shifted.
A good example of concept drift is when the price of a house in San Francisco changed due to the COVID-19 pandemic, even though the distribution of house features remained the same. This highlights the importance of monitoring for concept drift to ensure your model's predictions remain accurate over time.
Changes in accuracy-related metrics might not become obvious for days or weeks, but a model predicting all False for 10 minutes can be detected immediately. This shows that monitoring predictions can help you catch concept drift before it's too late.
It's essential to have different models to deal with cyclic and seasonal drifts, like having one model to predict rideshare prices on weekdays and another for weekends. This can help mitigate the effects of concept drift and keep your model's predictions accurate.
Prior Probability
Prior probability shift occurs when the distribution of the class variable y changes, unlike covariate shift which focuses on changes in the feature x distribution. This type of shifting is essentially the reverse of covariate shift.
An unbalanced dataset is a great example of prior probability shift. If a training set has equal prior probabilities on the number of spam emails, we'd expect 50% to contain spam emails and 50% to contain non-spam. But what if only 90% of our emails are spam? The prior probability of the class variables has changed.
This problem only occurs in Y → X problems, which are commonly associated with naive Bayes, used to filter spam emails. Naive Bayes is a classic example of prior probability shift in action.
Prior probability shift has relations to data sparsity and biased feature selection, which can cause covariance shift. However, instead of influencing the input distribution, they influence the output distribution.
Degenerate Feedback Loop
A degenerate feedback loop can happen when the predictions themselves influence the feedback, which is then used to extract labels to train the next iteration of the model.
This type of loop can occur in tasks with natural labels from users, such as recommender systems and ads click-through-rate prediction. It's especially common in systems where users' interactions with the system are used as inputs to the same system.
Imagine building a system to recommend songs to users. The songs that are ranked high by the system are shown first to users, who then click on them more, making the system more confident that these recommendations are good.
In the beginning, the rankings of two songs might be only marginally different, but because one song was originally ranked a bit higher, it got clicked on more, which made the system rank it even higher. After a while, its ranking became much higher than the other song.
Degenerate feedback loops can cause models to perform suboptimally at best, and at worst, they can perpetuate and magnify biases embedded in data. For example, a resume-screening model might recommend resumes with certain features, which then gets reinforced by the company only hiring candidates with those features.
Natural Labels and Feedback
Natural labels and feedback are crucial in machine learning, but they can be tricky to work with, especially when it comes to distributional shift.
Label shift occurs when the distribution of the labels changes between modeling training and model deployment, as we saw in the example of the milling machine that was more prone to breakdowns.
This change in label distribution can significantly impact the model's performance, as we saw when the Random Forest classifier achieved an accuracy of only 86% in the new scenario.
Regular examination of the class label distributions is a key approach to ensure they accurately represent the deployment environment.
The distribution of the covariates given the label remains constant in label shift, which can make it easier to detect, but it still requires careful monitoring to catch any changes.
In the example, the Vibration covariate showed a shift in the test data, which was assumed to be a symptom of machine breakdowns.
Frequently Asked Questions
What is the difference between domain shift and covariate shift?
Domain shift and covariate shift are related concepts, but domain shift refers to a change in environment between the source and target domains, while covariate shift specifically refers to changes in the distribution of input features. Understanding the difference is crucial for developing effective machine learning models that can adapt to new environments.
Sources
- https://huyenchip.com/2022/02/07/data-distribution-shifts-and-monitoring.html
- https://ethon.ai/what-are-distributional-shifts-and-why-do-they-matter-in-industrial-applications/
- https://jessehoogland.com/article/robustness-and-distribution-shifts/
- https://insideainews.com/2023/07/22/tackling-data-distribution-shift/
- https://towardsdatascience.com/understanding-dataset-shift-f2a5a262a766
Featured Images: pexels.com