Training Tutorial Hugging Face with Trainer API

Author

Posted Nov 11, 2024

Reads 865

Couple celebrating new home purchase with embracing hug while realtor observes, all wearing face masks.
Credit: pexels.com, Couple celebrating new home purchase with embracing hug while realtor observes, all wearing face masks.

Training with Hugging Face's Trainer API is a breeze once you understand the basics. You can start by installing the library using pip, which you can do by running `pip install transformers` in your terminal.

The Trainer API is a powerful tool for training models, and it's designed to work seamlessly with Hugging Face's pre-trained models. You can access these models through the Transformers library, which you've already installed.

To begin training a model, you'll need to import the Trainer class from the transformers library and create an instance of it. This is done by calling `from transformers import Trainer` and then `tr = Trainer()`.

Preparing the Data

Preparing the Data involves several key steps. We need to transform the text labels into N-hot encoded arrays to be compatible with the Hugging Face library. This is necessary because the labels need to represent multiple labels for each image.

To achieve this, we first identify the unique labels in the dataset. We then transform the labels into N-hot encoded arrays, which represent the labels as a list of booleans. This process allows us to classify each image accurately.

For our dataset, we have chosen to use a metadata.jsonl file to store the image file names and their associated labels. This approach is suitable because the images in our dataset can have multiple labels.

Intriguing read: Ai Image Training

Preparing the Labels

Credit: youtube.com, What is Data Labeling ? | Prepare Your Data for ML and AI | Attaching meaning to digital data 27

Preparing the labels is a crucial step in training a model to predict disease from images. We're going to train the Graphcore Optimum ViT model to predict the disease defined by "Finding Label".

The "Finding Label" can be any one of 14 diseases or a "No Finding" label. This indicates that no disease was detected in the image.

We need to transform the text labels into N-hot encoded arrays to make them compatible with the datasets Hugging Face library. N-hot encoded arrays represent the labels as a list of booleans, true if the label corresponds to the image and false if not.

To do this, we first identify the unique labels in the dataset. This is a straightforward process that helps us understand what labels we're working with.

We then transform the labels into N-hot encoded arrays. This is a necessary step to prepare the labels for training the model.

The images in this dataset can have multiple labels, so we've chosen to use a metadata.jsonl file to store the image file names and their associated labels. This is a convenient way to store the metadata for our dataset.

See what others are reading: Advanced Coders - Ai Training

Create the Data

Credit: youtube.com, How is data prepared for machine learning?

To create the dataset, we're going to use the PyTorch dataset and split it into training and validation sets. This step converts the dataset to the Arrow file format, which allows data to be loaded quickly during training and validation.

The dataset needs to have the same properties as the original dataset used for pre-training, which is provided in a config file loaded using the AutoFeatureExtractor.

We'll be fine-tuning a pre-trained model, so the new dataset must have the same properties as the original dataset. The X-ray images are resized to the correct resolution (224x224), converted from grayscale to RGB, and normalized across the RGB channels with a mean (0.5, 0.5, 0.5) and a standard deviation (0.5, 0.5, 0.5).

To fine-tune the model efficiently, images need to be batched. We'll define a batch_sampler function which returns batches of images and labels in a dictionary.

Get Test Dataloader

To get the test dataloader, you'll need to specify the test dataset to use.

Credit: youtube.com, PyTorch Tutorial 09 - Dataset and DataLoader - Batch Training

The test dataset should be a torch.utils.data.Dataset, and it's optional to provide it. If you do provide a test dataset, columns not accepted by the model.forward() method will be automatically removed.

A test dataset must implement the __len__ method.

Here's a summary of the required parameters for the get_test_dataloader function:

  • test_dataset: the test dataset to use (torch.utils.data.Dataset, optional)
  • ignore_keys: a list of keys to ignore (Optional)
  • metric_key_prefix: a string prefix for metric keys (str = 'test')

Get Train Dataloader

The get_train_dataloader method is a crucial part of preparing your data for training. It will use no sampler if your train_dataset does not implement __len__, but a random sampler otherwise, which is adapted to distributed training if necessary.

To get the train dataloader, you can call this method, and it will automatically create a train dataloader for you. This method is very useful as it saves you the hassle of creating the dataloader manually.

However, if your train_dataset does not implement __len__, the method will not use a sampler. This is because the length of the dataset is needed to determine the number of batches to create.

Credit: youtube.com, Learning PyTorch From Scratch #1: Intro & Dataloader

Here are the key arguments you can pass to the set_dataloader method to customize your dataloader:

  • `drop_last`: whether to drop the last incomplete batch or not
  • `num_workers`: the number of subprocesses to use for data loading
  • `pin_memory`: whether to pin memory in data loaders or not
  • `persistent_workers`: whether to keep the worker processes alive after the dataset has been consumed
  • `prefetch_factor`: the number of batches to load in advance by each worker
  • `auto_find_batch_size`: whether to find a batch size that will fit into memory automatically
  • `ignore_data_skip`: whether to skip the epochs and batches to get the data loading at the same stage as in the previous training
  • `sampler_seed`: the random seed to be used with data samplers

Here's a summary of the key arguments:

By using these arguments, you can customize your dataloader to suit your specific needs and improve the performance of your model.

Initializing a New Model

Initializing a new model is a crucial step in training a GPT-2 model. We'll start by loading a pretrained configuration for the small GPT-2 model, ensuring the tokenizer size matches the model vocabulary size and passing the bos and eos token IDs.

The configuration is set to match the small GPT-2 model, which has a vocabulary size that requires a specific tokenizer size. This is done to ensure the model can process the data correctly. Our model has 124M parameters that we'll have to tune.

Initializing a New Model

Initializing a new model is a straightforward process that involves loading a pre-trained configuration and setting up the model's vocabulary size. We'll be using the small GPT-2 model configuration for our example.

Credit: youtube.com, Initializing Models with Larger Ones

The first step is to load the pre-trained configuration, which we'll use as a starting point for our new model. This configuration will serve as the foundation for our model's architecture.

To ensure that the tokenizer size matches the model vocabulary size, we need to make sure that the tokenizer size is set to the same value as the model's vocabulary size. This is a crucial step to avoid any potential issues during training.

Our model has 124M parameters that we'll need to tune during training. This is a significant number of parameters, and we'll need to use a robust training approach to optimize them.

We'll use the DataCollatorForLanguageModeling collator to create batches and language model labels. This collator is specifically designed for language modeling tasks and will take care of creating the labels for us.

Note that DataCollatorForLanguageModeling supports both masked language modeling and causal language modeling, but by default, it prepares data for masked language modeling. We can switch to causal language modeling by setting the argument mlm=False.

ClassTransformers Seq2SeqTrainer

Credit: youtube.com, Simple Training with the 🤗 Transformers Trainer

Initializing a new Seq2SeqTrainer model involves several key parameters.

The `eval_dataset` parameter allows you to override the default evaluation dataset, which must implement the `__len__` method.

You can also ignore certain keys in the output of your model by providing a list of keys in the `ignore_keys` parameter.

The `metric_key_prefix` parameter enables you to prefix metrics with a custom string, such as 'test'.

A maximum target length can be specified using the `max_length` parameter, which is useful when predicting with the generate method.

The `num_beams` parameter determines the number of beams for beam search, with 1 indicating no beam search.

The `gen_kwargs` parameter allows for additional generate-specific keyword arguments.

Here are the key parameters for the Seq2SeqTrainer model:

The `test_dataset` parameter is used to run predictions on a specific dataset, which must also implement the `__len__` method. The `ignore_keys` parameter can be used to ignore certain keys in the output of your model, just like in the `eval_dataset` case.

Training Process

Credit: youtube.com, Getting Started With Hugging Face in 15 Minutes | Transformers, Pipeline, Tokenizer, Models

The training process with Hugging Face is made easy with their 'Trainer' class, which simplifies feeding batches of tokenized input data into your model, computing the loss, and updating the model's weights using backpropagation.

The Trainer class is a game-changer, as it takes care of the heavy lifting for you, allowing you to focus on fine-tuning your model.

To get started, you'll need to define your eval dataset and metric key prefix, which will help you evaluate your model's performance.

Loop

The training process is made up of two key loops: the training loop and the evaluation loop. The training loop is where the magic happens, feeding batches of tokenized input data into your model, computing the loss, and updating the model's weights using backpropagation.

Hugging Face's 'Trainer' class simplifies this process, making it easier to get started with model training.

The evaluation loop, on the other hand, is where you test your model's performance on a separate dataset. This loop is shared by Trainer.evaluate() and Trainer.predict(), and it's what helps you gauge how well your model is doing.

A unique perspective: Host Model on Huggingface

Step

Credit: youtube.com, Training process | Steps of training process

In the training process, there are two key steps: prediction and training.

The prediction step evaluates a model using inputs, returning a tuple with the loss, logits, and labels. This step is customizable, allowing you to inject your own behavior by subclassing and overriding it.

The training step is where the magic happens, and the model learns from the data. It takes in the model to train and the inputs, and returns the tensor with the training loss on this batch.

There are a few things to keep in mind when working with these steps. First, the inputs dictionary will be unpacked before being fed to the model, so make sure you're passing in the correct arguments. Second, most models expect the targets under the argument labels, so check your model's documentation to make sure you're meeting its requirements.

Here are the key arguments for the prediction and training steps:

By understanding these steps and their arguments, you'll be well on your way to training your model and achieving your goals.

Compute Loss

Credit: youtube.com, 133 - What are Loss functions in machine learning?

The compute_loss function is a crucial part of the training process, and it's used to calculate the loss of the model. This function is defined by the Trainer class, which simplifies the training process.

By default, all models return the loss in the first element of the output. This means that the loss is always the first value in the tuple returned by the compute_loss function.

The compute_loss function can be overridden to inject custom behavior into the training process. This allows developers to tailor the training process to their specific needs.

To compute the loss, the Trainer class uses the model's output. The model's output is a tuple containing the loss, logits, and labels. The loss is always the first element of this tuple.

Here's a breakdown of the compute_loss function:

The compute_loss function is an essential part of the training process, and it's used to evaluate the performance of the model. By understanding how the compute_loss function works, developers can fine-tune their models and achieve better results.

Model Optimization

Credit: youtube.com, Easier, Faster Training for your Hugging Face models

We provide a reasonable default optimizer that works well, so you can focus on training your model without worrying about the details.

If you want to use a different optimizer, you can pass a tuple in the Trainer's init through optimizers, or subclass and override the create_optimizer method in a subclass.

The get_optimizer_cls_and_kwargs method returns the optimizer class and optimizer parameters based on the training arguments, which is useful for customizing your optimizer setup.

Hyperparameter Search is a crucial step in Model Optimization. It's a process of automatically searching for the best combination of hyperparameters for your model. This is done using a library like Optuna or Ray Tune or SigOpt.

You can launch a hyperparameter search using the hyperparameter_search function, which takes several parameters. The hp_space parameter defines the search space, and it defaults to a function that defines the default hyperparameter search space for your chosen backend.

The compute_objective parameter computes the objective to minimize or maximize, and it defaults to a function that returns the evaluation loss when no metric is provided. The n_trials parameter determines the number of trial runs to test, and it defaults to 100.

Credit: youtube.com, Simple Methods for Hyperparameter Tuning

You can also specify the direction of optimization, which can be "minimize" or "maximize", depending on whether you're optimizing the validation loss or one or several metrics. The backend parameter determines the library to use for hyperparameter search, and it defaults to Optuna if all libraries are installed.

Here are the available backends and their default hyperparameter search spaces:

The hyperparameter_search function returns the best run or best runs for multi-objective optimization. The experiment summary can be found in the run_summary attribute for Ray backend.

Create Optimizer

Creating an optimizer is a crucial step in the model optimization process. You can use the default optimizer provided, but if you want something different, you can pass a tuple in the Trainer's init through optimizers.

The default optimizer works well, but you can also subclass and override the `create_optimizer` method in a subclass. This gives you the flexibility to experiment with different optimizers and find the one that works best for your model.

Credit: youtube.com, Optimizers - EXPLAINED!

To use a different optimizer, you can pass a tuple in the Trainer's init through optimizers. For example, you can use `optimizers=("adamw_torch", {"lr": 1e-4})`.

Here are some common optimizers you can use:

You can also use the `set_optimizer` method to specify the optimizer and its hyperparameters. For example, you can use `set_optimizer(name="adamw_torch", learning_rate=1e-4)`.

Landon Fanetti

Writer

Landon Fanetti is a prolific author with many years of experience writing blog posts. He has a keen interest in technology, finance, and politics, which are reflected in his writings. Landon's unique perspective on current events and his ability to communicate complex ideas in a simple manner make him a favorite among readers.

Love What You Read? Stay Updated!

Join our community for insights, tips, and more.