Active learning is a machine learning approach that involves actively selecting the most informative data points to label, rather than passively labeling all available data. This approach can significantly reduce the amount of labeled data needed to achieve high accuracy.
By actively selecting the most informative data points, we can focus on the most uncertain or difficult examples, which can lead to a more efficient and effective learning process. This is particularly useful when working with large datasets or when labeling data is time-consuming or expensive.
The goal of active learning is to minimize the number of labeled examples required to achieve a desired level of accuracy, while also ensuring that the model is well-trained and reliable.
You might like: Labeling Data for Machine Learning
What is Active Learning
Active learning is a subset of machine learning that allows a learning algorithm to interact with a user to label data with the desired outputs.
In active learning, the algorithm chooses from the pool of unlabeled data the subset of examples to be labelled next, which is a key concept behind the active learner algorithm.
The basic idea behind active learning is that if an ML algorithm could select the data it wants to learn from, it might be able to achieve a higher degree of accuracy with fewer training labels.
This approach is interactive, meaning the algorithm actively queries the user to label data, which can be a more efficient way to train a model compared to traditional machine learning methods.
Why We Need Active Learning
Active learning is a game-changer in machine learning, and we need it because data is the fuel that powers modern deep learning models, and copious amounts of it are often required to train these neural networks.
The annotation process can be extensively time-consuming and expensive, especially for tasks like segmentation and motion detection. Labeling images and videos can be laborious, and some domains, such as medical imaging, require domain knowledge from experts with limited accessibility.
To make matters worse, not all data points are equally important for training a model. In fact, a random subset of data is unlikely to produce a model with sub-par performance, as it can lead to lower accuracies and other diminished performance metrics.
A unique perspective: Action Model Learning
Here are the benefits of active learning:
- Reduced Labeling Costs: By reducing the quantity of labeled data needed for training, active learning helps to cut down on labeling expenses.
- Enhanced Model Performance: Active learning frequently produces models with higher accuracy and generalization by concentrating on the most instructive examples.
- Optimal Resource Utilisation: When labeled data acquisition presents a bottleneck, active learning enables a more economical use of resources.
- Adaptability to Data Distribution: Because active learning allows the model to concentrate on difficult examples, it works well in situations where the distribution of data is uneven or unbalanced.
What Distinguishes Passive
Active learning is a game-changer in the world of machine learning.
In passive learning, the model is trained on a fixed, randomly chosen dataset. This can be limiting because the model may not be learning from the most relevant data.
Passive learning is like trying to learn a new language by listening to random conversations - you might pick up some phrases, but you won't be able to hold a conversation anytime soon.
The model in passive learning doesn't get to choose which instances to query for labels, which can lead to a less effective learning process.
For more insights, see: Difference between Model and Algorithm in Machine Learning
The Motivation
We need active learning because data is the fuel that powers modern deep learning models, and copious amounts of it are often required to train these neural networks.
Not all data points are equally important for training a model, as we can see in a cluster of two sets with a decision boundary in between.
Just imagine having tens of thousands of unlabeled data points to learn from - it would be cumbersome or extremely expensive to label all those points manually.
Labeling a random subset of data can lead to a model with sub-par performance, with lower accuracies and other diminished performance metrics.
The idea of active learning is to select a bunch of data points near the decision boundary and help the model to learn selectively, which would be the most preferred scenario for selecting the samples from the given unlabelled dataset.
This concept originated and evolved to mitigate the pain of labeling all data points, especially in scenarios like NLP training where getting relevant labeled datasets can be a challenge.
The costs involved in labeling data can be significant, both in terms of money and time, not to mention the possible carbon footprint.
Here are some key statistics that illustrate the benefits of active learning:
- Reducing labeling costs by 50% or more
- Improving model performance by 10-20% or more
- Optimizing resource utilization by 20-30% or more
- Adapting to data distribution with improved accuracy
How It Works
Active learning is a machine learning technique that selects the most informative data points from an unlabeled dataset and requests labels for these points. This approach aims to reduce the labeling cost and improve the model's performance with fewer labeled examples.
The algorithm selects the most valuable data instances, which could be edge cases, and requests them to be labeled by a human. These newly labeled instances are then added to the training set, and the model is retrained. The selection of these data instances happens through a process called querying.
There are two broad scenarios of active learning: query synthesis-based and sampling-based. Pool-Based Sampling is the most well-known scenario, where the learning algorithm evaluates the entire dataset before selecting data points for labeling. Stream-Based Selective Sampling examines each consecutive unlabeled instance one at a time, and Membership Query Synthesis generates synthetic data from an underlying natural distribution.
Here are the three scenarios of active learning:
- Pool-Based Sampling: selects data points from the entire dataset based on confidence scores.
- Stream-Based Selective Sampling: examines each unlabeled instance one at a time to decide whether to assign a label or query the teacher.
- Membership Query Synthesis: generates synthetic data from an underlying natural distribution to query the teacher.
How It Works
Active learning is a machine learning technique that selects the most informative data points from an unlabeled dataset and requests labels for these points. It aims to lower the labeling cost and enhance the model's performance with fewer labeled examples.
The process of active learning involves a cycle of selecting the most valuable data instances, requesting them to be labeled by a human, adding the newly labeled instances to the training set, and retraining the model. This cycle repeats iteratively until the model achieves the desired level of accuracy.
The selection of data instances happens through a process called querying, which is a key component of active learning. The query formation plays a crucial role in developing an active learning algorithm.
There are two broad scenarios of active learning: query synthesis-based and sampling-based. Let's take a closer look at these scenarios:
Pool-based sampling is the most well-known scenario, where the algorithm initially trains on a fully labeled subset of the data and then selects instances for which it is the least confident. This approach is memory-intensive and limited in handling enormous datasets, but it is more efficient in terms of teacher effort.
Stream-based methods, on the other hand, examine each consecutive unlabeled instance one at a time, but they do not have sufficient information early in the process to make a sound assign-label-vs-ask-teacher decision. This approach is likely to require more teacher effort in supplying labels.
Efficiency
Active learning is a process that can be quite lengthy, with multiple cycles of labeling and model training. This is why it's essential to make these iterations as efficient as possible.
We want to minimize waiting downtime, and the only component of the active learning cycle that requires manual intervention is the actual annotation. All model training and data selection processes should be automated, for example, as DAG-based workflows.
Time spent waiting for annotators should be minimized as well, and most data annotation services are set up to process large datasets as a single entity, not for fast turnaround times on smaller sequential batches.
The size of the labeling batches and the total number of iterations is a very important hyperparameter to consider. Larger batches and shorter iterations will reduce the time requirements of active learning.
A fresh viewpoint: Ai and Machine Learning Training
Methods and Techniques
Active learning methods aim to select the most informative examples for labeling, and one way to do this is by using an acquisition function that scores each example based on its predicted usefulness for training.
The acquisition function can be thought of as a function that takes an unlabeled example and outputs its score, with the goal of selecting the top k-scoring samples for annotation.
Choosing the right acquisition function is crucial, as it will determine which examples are labeled next.
One approach is to use the query technique, which involves selecting cases from the unlabeled pool where the model is unsure or has low confidence.
This can include picking cases with the highest level of uncertainty, cases close to the decision boundary, or cases where models in an ensemble disagree.
Here are some common query techniques:
- Uncertainty-based sampling: Select cases where the model is most uncertain.
- Entropy-based sampling: Select cases with the highest level of uncertainty.
- Ensemble-based sampling: Select cases where models in an ensemble disagree.
Active learning using sampling techniques involves labeling a subsample of data, training a model, and then selecting a new subsample based on the model's outputs. This process is repeated until the model approaches desired levels of performance.
Methods
Active learning methods can be categorized into different techniques, each with its own strengths and weaknesses. One of the key techniques is Query Technique, which involves selecting cases from the unlabeled pool where the model is unsure or has low confidence.
Broaden your view: Applied Machine Learning Explainability Techniques
To choose the most informative examples, we can use acquisition functions, such as the least confident score, margin score, entropy, and BALD. These functions take an unlabeled example and output its score, indicating how useful it is for training the model.
In some cases, we may want to select examples based on their diversity, rather than uncertainty. This can be achieved by using hybrid acquisition functions, such as Core Sets, BatchBALD, and BADGE. These functions prioritize uncertainty, but choose examples in a batch-aware manner to avoid issues like mode collapse within batches.
Another technique is Committee based strategies, which involves building several models and sampling from their predictions. This can be done using voting or variance-based methods, or even based on the disagreement between the models.
Here are some popular acquisition functions:
- Least Confident Score
- Margin Score
- Entropy
- BALD
- Core Sets
- BatchBALD
- BADGE
- Variational Adversarial Active Learning (VAAL)
- Wasserstein Adversarial Active Learning (WAAL)
These acquisition functions can be used in various active learning pipelines, such as the semi-automatic pipeline, which involves manual intervention at key stages of the process.
Testing and Debugging
Testing and debugging are crucial steps in the active learning process. You should thoroughly check your active learning workflow and training data platform before large-scale implementation.
Verify the outputs of each component to have confidence in your system. Monitoring is also essential for inspecting a model's performance at each iteration.
You need to monitor all of your usual model performance metrics after each training iteration. This includes metrics such as improvement in model performance from increasing the training set size.
Some potential culprits for not improving model performance are a labeling batch size that's too small, hyperparameter settings that no longer suit the model, a training dataset that's already huge, or an issue in the data selection process.
Record the difference in acquisition function scores and rankings after the training step in each iteration. You're looking for scores for already labeled examples to be much lower than unlabeled examples on average.
For your interest: Supervised or Unsupervised Machine Learning Examples
If scores for already labeled examples are similar to unlabeled examples, you might be selecting data points similar to already-labeled examples. This could indicate a problem with your model or the data selection process.
Here are some potential issues to watch out for:
- Your labeling batch size is too small to make a noticeable difference.
- Your hyperparameter settings no longer suit your model due to the increased size of your training dataset.
- Your training dataset is already huge and you've saturated performance gains from more data.
- There is an issue/bug in your data selection process causing you to select bad examples.
Does the set of highest-ranked data points change after training your model on newly-labeled samples? If not, updating the model is not changing the data selection.
Query Techniques
Query Techniques are a crucial part of Active Learning, and there are several strategies to choose from. Active Thompson Sampling (ATS) is a sequential algorithm that assigns a sampling distribution on the pool, samples one point from this distribution, and queries the oracle for this sample point label.
Query techniques can be broadly categorized into two main types: informativeness and representativeness. Informativeness-based query strategies assign an informative measure to each unlabeled instance individually, and the instance(s) with the highest measure will be selected.
Some common query techniques include Query-by-Committee, Margin Sampling, Uncertainty Sampling, and Diversity Sampling. These techniques direct the choice of cases for labeling according to variables like model disagreement, decision boundaries, and uncertainty.
Query synthesis is another approach used when we have a very small dataset. This method chooses any uncertain point from the given n-dimensional space, without considering the existence of that point.
Here are some common query strategies:
- Query-by-Committee: a variety of models are trained on the current labeled data, and vote on the output for unlabeled data; label those points for which the "committee" disagrees the most
- Margin Sampling: instances close to the classification boundary are more ambiguous, and getting those instances labeled will provide the most information to the learning model
- Uncertainty Sampling: label those points for which the current model is least certain as to what the correct output should be
- Diversity Sampling: selects instances from non-overlapping or minimally overlapping partitions for labeling
These query strategies can be used to select the most informative instances for labeling, and can be combined to create hybrid strategies that take into account both informativeness and representativeness.
Frequently Asked Questions
What is the difference between active learning and supervised learning?
Active learning differs from supervised learning in that it actively selects data for training, whereas supervised learning relies on a pre-defined set of labeled data. This paradigm shift enables more efficient and effective learning
Sources
- Here is a pioneering paper on this technique (arxiv.org)
- Monte Carlo Dropout (arxiv.org)
- Bayesian Active Learning through Disagreement (BALD) (arxiv.org)
- BatchBALD (arxiv.org)
- Wasserstein Adversarial Active Learning (WAAL) (arxiv.org)
- Variational Adversarial Active Learning (VAAL) (arxiv.org)
- https://arxiv.org/abs/1708.02383v1 (arxiv.org)
- https://arxiv.org/abs/2303.01560v2 (arxiv.org)
- https://link.springer.com/article/10.1007/s10994-010-5174-y (springer.com)
- https://doi.org/10.1007/BF00993277 (doi.org)
- 10.1111/cgf.13406 (doi.org)
- 2002.05033 (arxiv.org)
- 10.13182/FST12-A14626 (doi.org)
- "shubhomoydas/ad_examples" (github.com)
- 10.1007/978-3-031-16474-3_38 (doi.org)
- 10.3390/computers5010001 (doi.org)
- 1408.2196 (arxiv.org)
- 10.1007/978-3-319-12637-1_51 (doi.org)
- "Contextual Bandit for Active Learning: Active Thompson" (archives-ouvertes.fr)
- 10.1016/j.neucom.2014.06.042 (doi.org)
- "Active learning via query synthesis and nearest neighbour search" (uq.edu.au)
- "Active learning machine learning: What it is and how it works" (datarobot.com)
- 10.1088/2632-2153/abc9fe (doi.org)
- 2007.08555 (arxiv.org)
- 10.1007/s12530-012-9060-7 (doi.org)
- 10.1016/j.patcog.2011.08.009 (doi.org)
- 10.1145/1557019.1557119 (doi.org)
- 10.1.1.546.9358 (psu.edu)
- "Effective multi-label active learning for text classification" (microsoft.com)
- "A literature survey of active machine learning in the context of natural language processing" (sics.se)
- 10.1109/ICDM.2016.0102 (doi.org)
- 10.1007/978-1-4899-7637-6 (doi.org)
- "Active Learning Literature Survey" (wisc.edu)
- ML | Active Learning (geeksforgeeks.org)
- Optimizing Annotation Effort Using Active Learning Strategies: A Sentiment Analysis Case Study in Persian (aclanthology.org)
- A Survey of Active Learning for Natural Language Processing (aclanthology.org)
- Active Learning with AutoNLP and Prodigy (huggingface.co)
- active learning in machine learning (towardsdatascience.com)
- paper (arxiv.org)
- A versatile active learning workflow for optimization of genetic and metabolic networks (nature.com)
- modAL (modal-python.readthedocs.io)
- libact (libact.readthedocs.io)
- AutoTrain (huggingface.co)
- article (huggingface.co)
- Active Learning Workflow using Amazon SageMaker Ground Truth (sagemaker-examples.readthedocs.io)
- A_Survey_of_Active_Learning_Algorithms_for_Supervised_Remote_Sensing_Image_Classification (researchgate.net)
Featured Images: pexels.com