Multiclass Classification with Metric Compute Huggingface and Evaluation Metrics

Author

Reads 414

An artist’s illustration of artificial intelligence (AI). This image represents how machine learning is inspired by neuroscience and the human brain. It was created by Novoto Studio as par...
Credit: pexels.com, An artist’s illustration of artificial intelligence (AI). This image represents how machine learning is inspired by neuroscience and the human brain. It was created by Novoto Studio as par...

Multiclass classification is a type of problem where we have more than two classes to predict. This is a common challenge in many real-world applications, such as image classification, sentiment analysis, and medical diagnosis.

In multiclass classification, we can use the Hugging Face library to implement metric computing and evaluation metrics. This library provides a wide range of pre-trained models and a simple interface for computing metrics.

One of the key metrics used in multiclass classification is the accuracy score. According to the article, the accuracy score is calculated as the number of correct predictions divided by the total number of predictions.

Metric Selection

Choosing the right metrics for multiclass classification tasks is crucial for evaluating model performance. F1 score, precision, and recall are common metrics used for multiclass classification.

For example, the F1 score can be calculated as the harmonic mean of precision and recall, giving equal weight to both. This is useful for imbalanced datasets where one class has a significantly larger number of instances.

In the context of Hugging Face's Transformers library, metrics such as accuracy, F1 score, and mean squared error are available for multiclass classification tasks.

Auroc

Credit: youtube.com, ROC and AUC, Clearly Explained!

When evaluating the performance of a machine learning model, it's essential to consider metrics that accurately reflect its ability to make predictions. The Auroc, or Area Under the Receiver Operating Characteristic Curve, is one such metric.

The Auroc is computed from prediction scores, which are estimated probabilities provided by the model. These probabilities are compared to ground-truth labels to determine the Auroc score.

To calculate the Auroc, you'll need to provide the estimated probabilities, ground-truth labels, and optionally, sample weights. The label for the positive class is also required, which is denoted as pos_label.

Here's a brief overview of the required inputs:

  • pred: Estimated probabilities
  • target: Ground-truth labels
  • sample_weight: Optional sample weights
  • pos_label: The label for the positive class

The Auroc is often used in conjunction with other metrics, such as precision and recall, which are also calculated from the prediction scores. Precision and recall are useful metrics for evaluating the model's performance on specific tasks.

A fresh viewpoint: Confusion Matrix Metrics

Stat Scores

Stat scores are a crucial part of evaluating the performance of a model, allowing you to understand the nuances of its predictions.

Credit: youtube.com, NFL Athlete Comparison Project: How to Select Metrics and Change Score Calculations

The stat_scores function calculates the number of true positives, false positives, true negatives, and false negatives for a specific class, providing a detailed breakdown of the model's performance.

To use stat_scores, you'll need to provide a prediction tensor, a target tensor, a class index, and an argmax dimension.

Here's a summary of the required inputs:

  • pred (Tensor) - prediction tensor
  • target (Tensor) - target tensor
  • class_index (int) - class to calculate over
  • argmax_dim (int) - if pred is a tensor of probabilities, this indicates the axis the argmax transformation will be applied over

The stat_scores_multiple_classes function is a variation of stat_scores that calculates the number of true positives, false positives, true negatives, and false negatives for each class.

This function requires the same inputs as stat_scores, but also accepts an optional num_classes parameter if the number of classes is known.

The stat_scores functions are used in various applications, including information retrieval, where they are used to evaluate the performance of models in retrieving relevant information.

Implement a Metric

When implementing a metric, you have two options: PyTorch metric or Numpy metric. It's recommended to use PyTorch metrics when possible, as they are faster.

Consider reading: Pytorch Confusion Matrix

Credit: youtube.com, Metric Selection for Better Experimentation

You can use TensorMetric to implement native PyTorch metrics. This class handles automated DDP syncing and converts all inputs and outputs to tensors.

Numpy metrics, on the other hand, can slow down your training substantially, since every metric computation requires a GPU sync to convert tensors to numpy.

TensorMetric is the way to go if you want to implement native PyTorch metrics, as it takes care of automated DDP syncing and tensor conversions for you.

NumpyMetric is available for implementing numpy metrics, but keep in mind that it can slow down your training.

Expand your knowledge: Pytorch Finetune

(SK)

Precision is a key metric to consider in machine learning, and it's computed as the ratio of true positives to the sum of true positives and false positives. This means that a high precision score indicates that the model is good at not labeling negative samples as positive.

The best value for precision is 1, and the worst value is 0. This makes sense, as a perfect model would have a precision score of 1, and a completely inaccurate model would have a precision score of 0.

A precision score of 1 means that the model is 100% accurate in its positive predictions, while a score of 0 means that the model has made no accurate positive predictions.

Average (Sk)

Credit: youtube.com, #123: Scikit-learn 117: Model Selection 5 Metrics and scoring (2/4)

Average (Sk) is a metric that calculates the average precision score. This score is a crucial measure of how well a model is performing in terms of precision.

The average precision score can be calculated using the AveragePrecision function, which takes in three main parameters: pred (predicted labels), target (groundtruth labels), and sample_weight (optional weights per sample).

Here's a breakdown of what each parameter means:

  • pred: This is the tensor containing the predicted labels.
  • target: This is the tensor containing the groundtruth labels.
  • sample_weight: This is an optional parameter that allows you to specify weights for each sample.

By using the AveragePrecision function, you can get a clear understanding of your model's performance and make data-driven decisions to improve it.

Multiclass Metrics

Multiclass metrics are a crucial aspect of evaluating the performance of multiclass models. They provide a way to measure the accuracy, precision, and recall of a model across multiple classes. A key metric is the F1-score, which is the harmonic mean of precision and recall.

The F1-score can be computed using the `f1_score` function, which takes in estimated probabilities, ground-truth labels, and the number of classes as input. It can also be reduced using various methods, such as taking the mean or sum of the F1-scores across labels. The `f1_score` function returns a tensor containing the F1-score for each class.

For another approach, see: Multi-class Confusion Matrix

Credit: youtube.com, 12.5 Extending Binary Metric to Multiclass Problems (L12 Model Eval 5: Performance Metrics)

To calculate the number of true positives, false positives, true negatives, and false negatives for a specific class, you can use the `stat_scores` function, which takes in a prediction tensor, target tensor, class index, and argmax dimension as input. This function can be used to calculate the number of true positives, false positives, true negatives, and false negatives for each class.

Here are some common multiclass metrics:

  • F1-score: a measure of the model's accuracy and precision
  • Precision: a measure of the model's accuracy and relevance
  • Recall: a measure of the model's ability to detect all instances of a class
  • True positives: the number of instances correctly classified as positive
  • False positives: the number of instances incorrectly classified as positive
  • True negatives: the number of instances correctly classified as negative
  • False negatives: the number of instances incorrectly classified as negative

Dice Score

The Dice Score is a popular metric used to evaluate the accuracy of multiclass classification models. It's a measure of similarity between two sets of data, in this case, predicted labels and ground truth labels.

The Dice Coefficient is a variant of the Dice Score that can also compute the Dice for the background. This is a useful feature if you're working with images that have a background class.

To compute the Dice Score, you'll need to provide the predicted probability for each label and the ground truth labels. The predicted probability is typically a tensor, and the ground truth labels should also be a tensor.

Credit: youtube.com, What is Dice Score?

The Dice Score can be computed with or without including the background class. If you choose to include the background class, the Dice Coefficient will return a score for both the foreground and background classes.

You can also specify a reduction method to apply to the Dice Score. The available reduction methods are elementwise mean, none, and sum.

Here are the available reduction methods:

  • elementwise_mean: takes the mean
  • none: pass array
  • sum: add elements

If a NaN occurs during computation, you can specify a score to return. This is useful if you're working with noisy or incomplete data.

If no foreground pixel is found in the target, you can also specify a score to return. This is useful if you're working with images that have a very small foreground region.

The reduction method, nan score, and no fg score are all optional parameters that can be specified when computing the Dice Score.

Multiclass Roc

The Multiclass ROC is a powerful metric that helps us evaluate the performance of multiclass predictors. It's a Receiver Operating Characteristic (ROC) curve for each class, which plots the true-positive rate (tpr) against the false-positive rate (fpr) at different thresholds.

For your interest: Auc Roc Huggingface Metrucs

Credit: youtube.com, Performance Metrics On MultiClass Classification Problems

To compute the Multiclass ROC, we need to provide the predicted probabilities for each label, the groundtruth labels, and optional sample weights. This is done using the multiclass_roc function, which returns a tuple consisting of one tuple per class, holding the false positive rate, true positive rate, and thresholds.

The Multiclass ROC is a great way to visualize the performance of our model, and it's especially useful when we have multiple classes. By plotting the ROC curve for each class, we can see how well our model is performing on each class individually.

Here's a summary of the Multiclass ROC:

  • Input: predicted probabilities, groundtruth labels, optional sample weights
  • Output: tuple of tuples, each containing false positive rate, true positive rate, and thresholds for each class
  • Use: to evaluate the performance of multiclass predictors and visualize the ROC curve for each class

Evaluation Metrics

Evaluation metrics are a crucial aspect of model evaluation, allowing you to assess how well your model is performing on unseen data. You can evaluate your model using metrics such as AUROC, which computes the area under the curve of the receiver operating characteristic (ROC).

The AUC_ROC score is a key metric to track during model training, as shown in example 1. It's essential to monitor this score to ensure your model is improving over time. By adjusting parameters like learning rate, loss scaling, or using auto-loss scaling, you can potentially improve the validation performance of your model.

Credit: youtube.com, What is the ROUGE metric?

For multiclass classification problems, metrics like multiclass_roc (F) come into play. This metric computes the Receiver Operating Characteristic (ROC) for multiclass predictors, taking into account the estimated probabilities and ground-truth labels.

Here's a quick rundown of available reduction methods for metrics like multiclass_roc (F) and accuracy (F):

Au Roc

AuROC is a metric that computes the area under the curve of the receiver operator characteristic (ROC). It's a way to evaluate the performance of a model.

The AuROC score is what the tutorial achieves after 3 epochs, and it's a crucial metric to keep an eye on when evaluating your model's performance.

To calculate AuROC, you can use the MulticlassROC function, which takes in predicted probabilities and groundtruth labels as input. The function returns a tuple consisting of one tuple per class, holding false positive rate, true positive rate, and thresholds.

Here's a breakdown of what you need to provide to the MulticlassROC function: predicted probability for each label (pred), groundtruth labels (target), and optional sample weights (sample_weight).

Accuracy

Credit: youtube.com, How to evaluate ML models | Evaluation metrics for machine learning

Accuracy is a crucial metric in evaluating the performance of a model. It's calculated using the predicted labels and ground truth labels, which are represented as tensors.

To compute accuracy, you'll need to provide the predicted labels, target labels, and optionally, the number of classes. The number of classes is not always required, but it can be useful in certain situations.

The predicted labels and target labels are the core components of accuracy calculation. They're used to determine how well the model has performed on a given dataset.

The accuracy can be reduced using various methods, including taking the mean. This is the default method, but you can choose from other available reduction methods if needed.

Here are the available reduction methods for accuracy:

  • Mean (default)
  • Other available methods (not specified)

By understanding accuracy and its calculation, you can better evaluate the performance of your model and make informed decisions about its improvement.

Rmse

The root mean squared error (RMSE) is a key metric in evaluating model performance. It measures the difference between predicted and actual values.

Credit: youtube.com, HOW TO CALCULATE RMSE IN EXCEL - Root-Mean-Square Error

RMSE is calculated using the formula: RMSE = sqrt(MSE), where MSE is the mean squared error. In the context of machine learning, RMSE is often used to evaluate regression models.

RMSE can be computed using the rmse function, which takes in three inputs: predicted labels (pred), ground truth labels (target), and a reduction method (reduction). The reduction method determines how to aggregate the RMSE values across different samples.

The available reduction methods are: mean, sum, and none. The default reduction method is mean, which takes the mean of the RMSE values across different samples.

Here are the available reduction methods, along with a brief description of each:

  • mean: takes the mean of the RMSE values across different samples
  • sum: sums up the RMSE values across different samples
  • none: does not aggregate the RMSE values

RMSE is a widely used metric in machine learning, and is often used in conjunction with other metrics such as mean absolute error (MAE) and mean squared error (MSE).

Fbeta (Sk)

The Fbeta score is a metric that helps you balance precision and recall in your model's performance. It's particularly useful when you want to favor one over the other.

The beta parameter determines the weight of precision in the combined score. If beta is less than 1, precision gets more weight.

In contrast, if beta is greater than 1, recall gets more weight. In fact, if beta approaches infinity, the Fbeta score only considers recall.

Model Evaluation

Credit: youtube.com, Mastering HuggingFace Model Evaluation: In-Detail Walkthrough of Measurement, Metric & Comparators

Model evaluation is a crucial step in the machine learning process. You can evaluate your model's ability to predict labels of unseen data using a validation dataset.

There are several metrics you can use to evaluate your model, including the validation AUC_ROC score. This score shows how well your model is performing after a certain number of epochs.

To improve the accuracy of your model, you can try changing optimisers, learning rate, learning rate schedule, loss scaling or using auto-loss scaling. This can be done by exploring different options and seeing what works best for your specific model and dataset.

You can also write a compute_metrics() style function to evaluate your model during training. This allows you to compute metrics like the matthews correlation coefficient or F1 score in real-time.

By evaluating your model during training, you can catch any issues early on and make adjustments before it's too late. This can save you a lot of time and effort in the long run.

Expand your knowledge: Huggingface Imdb Dataset

Mse

Credit: youtube.com, Simple Linear Regression | MSE RMSE & MAE | Model Evaluation Techniques - Part 2

MSE is a common metric used to evaluate the performance of a model. It measures the average squared difference between predicted and actual values.

The mean squared error (MSE) formula is calculated using the estimated labels and ground truth labels. You can compute MSE using the mse function, which takes in three parameters: pred (estimated labels), target (ground truth labels), and reduction (method for reducing MSE).

The reduction parameter is optional, but it's essential to choose the right method. The default method is to take the mean, but you can also use sum or none. The available reduction methods are: mean, sum, and none.

Here are the available reduction methods:

  • mean
  • sum
  • none

The MSE function returns a tensor with the calculated mean squared error.

Model Training Evaluation

During model training, it's essential to evaluate its performance to ensure it's on the right track. You can write a compute_metrics() style function to evaluate the model during training, as seen in the example of computing the Matthews correlation coefficient.

Credit: youtube.com, Evaluation Metrics for Machine Learning Models | Full Course

Evaluating the model during training allows you to make adjustments and improvements before it's too late. By adding a compute_BERT_classifier_matthews_correlation() function as an input parameter on instantiating the Trainer class, you can fine-tune your model's performance.

Running the evaluation after training is also crucial. After training the model, you can use the validation dataset to evaluate its ability to predict labels of unseen data. This is where you can see the validation AUC_ROC score the tutorial achieves after 3 epochs.

To improve the accuracy of the model, you might need to explore different directions, such as longer training, changing optimizers, learning rate, learning rate schedule, loss scaling, or using auto-loss scaling.

Preprocessing

Preprocessing is a crucial step in preparing your dataset for use with Huggingface Transformers. You'll need to ensure that the labels are in a form that can be processed by the library.

To do this, you can use the 🤗 Huggingface Datasets Library, which can handle the dataset splits for you. This is especially useful for the Intent Classification dataset, which requires specific preprocessing to work with Huggingface Transformers.

By using the Datasets Library, you can create a DatasetDict object that holds the dataset splits, making it easier to work with the data. This is a key step in getting your dataset ready for use with Huggingface Transformers.

See what others are reading: Is Huggingface Transformers Model Good

Model Preparation

Credit: youtube.com, How to Use Hugging Face's New Evaluate Library

To prepare a model for metric computation, we need to import it from Hugging Face and define a trainer using the IPUTrainer class. This class takes the same arguments as the Hugging Face model along with an IPUConfig object which specifies the behaviour for compilation and execution on the IPU.

We're going to use the Graphcore/vit-base-ipu configuration, which can be found at https://huggingface.co/Graphcore. This configuration gives control to all the parameters specific to Graphcore IPUs.

To use this model on the IPU, we need to load the IPU configuration, IPUConfig, and set our training hyperparameters using IPUTrainingArguments. This subclasses the Hugging Face TrainingArguments class, adding parameters specific to the IPU and its execution characteristics.

For another approach, see: How to Use Ai in Computer

Classifier Model Definition

In this section, we'll explore how to define a classifier model for our machine learning task.

To define a classifier model, we have two options: a specific BERT classifier model or a more general automodel classifier model. The former is a specific model that's pre-trained on a certain task, while the latter is a more general model that can be fine-tuned on a specific task.

Credit: youtube.com, Build Baseline Classification Model for Machine Learning Projects using Python

The num_labels parameter is crucial when defining a classifier model. It specifies the number of distinct labels in our classification task. For example, in the Multi-Intent dataset, we have 7 distinct labels: 'PlayMusic', 'AddToPlayList', 'RateBook', 'SearchScreeningEvent', 'BookRestaurant', 'GetWeather', and 'SearchCreativeWork'.

Here are the two options for instantiating a BERT-based classifier model:

The choice of option depends on the specific requirements of our task. If we need a more general model that can be fine-tuned on a specific task, we can use the automodel classifier model. Otherwise, we can use the specific BERT classifier model.

In the case of the Multi-Intent dataset, we're using the automodel classifier model with a BERT checkpoint. The num_labels parameter is set to 7, which corresponds to the number of distinct labels in our classification task.

Preparing the Model

To prepare a model, you'll need to import it from Hugging Face, specifically the ViT model in this case. Importing the model is the first step in preparing it for use.

Credit: youtube.com, inLab Model SW: How to Prepare a Model

The IPUTrainer class is used to train the model on the IPU, and it takes the same arguments as the Hugging Face model, along with an IPUConfig object. This object specifies the behavior for compilation and execution on the IPU.

Existing IPU configs can be found on the Hugging Face website, and in this case, we're using the Graphcore/vit-base-ipu configuration.

To set our training hyperparameters, we use the IPUTrainingArguments class, which subclasses the Hugging Face TrainingArguments class and adds parameters specific to the IPU and its execution characteristics.

To train the model, we define a trainer using the IPUTrainer class, which takes care of compiling the model to run on IPUs, and of performing training and evaluation. This class works similarly to the Hugging Face Trainer class, but with an additional ipu_config argument.

The IPUTrainer class is a powerful tool for training models on the IPU, and with the right configuration, it can help us achieve great results.

Curious to learn more? Check out: How to Install Hugging Face

Sklearn Interface

Credit: youtube.com, Build your first machine learning model in Python

Lightning supports the sklearn metrics module as a backend for calculating metrics. This is a great option if you want to use well-tested and robust metrics, but be aware that it may slow down your computations due to the conversion between PyTorch and NumPy.

To use the sklearn backend of metrics, simply import it. Each converted sklearn metric comes with the same interface as its original counterpart, so you can expect a seamless experience.

One example of a converted metric is accuracy, which takes the additional normalize keyword. This is consistent with the native Lightning metrics, which also come with built-in distributed (ddp) support.

Keith Marchal

Senior Writer

Keith Marchal is a passionate writer who has been sharing his thoughts and experiences on his personal blog for more than a decade. He is known for his engaging storytelling style and insightful commentary on a wide range of topics, including travel, food, technology, and culture. With a keen eye for detail and a deep appreciation for the power of words, Keith's writing has captivated readers all around the world.

Love What You Read? Stay Updated!

Join our community for insights, tips, and more.