Fine-Tuning Hugging Face Model with Custom Dataset for Downstream Applications

Author

Reads 689

Couple in a loving embrace, face to face in an open desert setting. Captures intimacy.
Credit: pexels.com, Couple in a loving embrace, face to face in an open desert setting. Captures intimacy.

Fine-tuning a Hugging Face model with a custom dataset is a powerful way to adapt a pre-trained model to your specific use case. This approach allows you to leverage the strengths of the pre-trained model while also incorporating your unique dataset.

By using a custom dataset, you can tailor the model to perform well on tasks specific to your application. The Hugging Face model can be fine-tuned to recognize patterns in your dataset that were not present in the original training data.

To fine-tune the model, you'll need to prepare your custom dataset in the correct format. This involves splitting your data into training and validation sets, and then converting it into a format that the model can understand.

Installing Hugging Face Library

Installing the Hugging Face Library is a straightforward process. You can install the transformers package from Hugging Face, which provides a pytorch interface for working with BERT.

The Hugging Face library is the most widely accepted and powerful pytorch interface for working with BERT. It contains interfaces for other pretrained language models like OpenAI’s GPT and GPT-2.

If this caught your attention, see: Free Gpt Model in Huggingface

Credit: youtube.com, Tutorial 2- Fine Tuning Pretrained Model On Custom Dataset Using 🤗 Transformer

You can use the library's pre-built modifications of these models suited to your specific task. For example, you can use BertForSequenceClassification in this tutorial.

The library also includes task-specific classes for token classification, question answering, next sentence prediction, etc. Using these pre-built classes simplifies the process of modifying BERT for your purposes.

The run_glue.py utility is a helpful tool that allows you to pick which GLUE benchmark task you want to run on, and which pre-trained model you want to use. It also supports using either the CPU, a single GPU, or multiple GPUs for further speed up.

Training epochs take around 5 minutes and 28 seconds when using a MAX_LEN of 128.

Data Preparation

Data preparation is a crucial step in fine-tuning a HuggingFace model with a custom dataset. To start, you'll need to preprocess your data with a tokenizer, which tokenizes the inputs, including converting tokens to their corresponding IDs in the pretrained vocabulary.

Credit: youtube.com, Prepare Fine-tuning Datasets with Open Source LLMs

You can use the AutoTokenizer.from_pretrained method to get a tokenizer that corresponds to the model architecture you want to use, and download the vocabulary used when pretraining this specific checkpoint.

The tokenizer will need to be able to handle the longest sequence in the batch, so set the truncation argument to True when feeding the samples to the tokenizer.

Once you have the preprocessed data, you can split it into training and validation sets. The Datasets library can automatically pull images and classes from the dataset, but you'll need to calculate the size of the validation set as a fraction of the training set based on the size of the test set.

Here's a summary of the data preparation steps:

By following these steps, you'll be well on your way to preparing your custom dataset for fine-tuning a HuggingFace model.

Data Split

Data Split is a crucial step in preparing your data for training models. You can use 90% of your training set for training and 10% for validation, as demonstrated in Example 2.

Credit: youtube.com, Why do we split data into train test and validation sets?

Divide your dataset into training and validation sets using a ratio of 90% to 10%. This is a common approach in machine learning.

To create an iterator for your dataset, use the torch DataLoader class, which helps save on memory during training by not loading the entire dataset into memory at once.

Here's a breakdown of the data split:

By splitting your data in this way, you can train your model on a large dataset and validate its performance on a smaller, separate dataset. This helps you avoid overfitting and ensures that your model generalizes well to new, unseen data.

Tokenization & Formatting

Tokenization & Formatting is a crucial step in preparing your data for model training. To tokenize your text, you need to split it into individual tokens and map them to their corresponding IDs in the tokenizer vocabulary.

The transformers library provides a helpful encode function to handle most of the parsing and data prep steps for us. This function can be used to perform multiple steps such as splitting the sentence into tokens, adding special tokens, mapping tokens to their IDs, and padding or truncating all sentences to the same length.

A unique perspective: Huggingface Api Token

Credit: youtube.com, Help me Tokenize my training data AWS PartyRock application

A popular choice for tokenization is the BERT tokenizer, which splits text into subwords or wordpieces. This is done by the tokenizer included with BERT, which can be downloaded and used to tokenize your text.

To format your data, you need to add special tokens to the start and end of each sentence, pad and truncate all sentences to a single constant length, and explicitly differentiate real tokens from padding tokens with the attention mask.

Here are the required formatting steps:

  • Add special tokens to the start and end of each sentence
  • Pad and truncate all sentences to a single constant length
  • Explicitly differentiate real tokens from padding tokens with the attention mask

The encode_plus function in the transformers library combines multiple steps for us, including tokenization, adding special tokens, mapping tokens to their IDs, padding or truncating, and creating attention masks. This function is a convenient way to prepare your data for model training.

Fine-Tuning Hugging Face Model

Fine-tuning a Hugging Face model is a straightforward process. You can fine-tune with the Trainer API, which involves defining your training hyperparameters in TrainingArguments, passing them to a Trainer along with the model, dataset, tokenizer, and data collator, and calling Trainer.train() to fine-tune your model.

Explore further: Fine Tune Code Llama

Credit: youtube.com, Finetune LLMs to teach them ANYTHING with Huggingface and Pytorch | Step-by-step tutorial

To fine-tune with TensorFlow, you'll need to batch the processed examples together with dynamic padding using the DataCollatorWithPadding function and convert your datasets to the tf.data.Dataset format with to_tf_dataset. You'll also need to set return_tensors="tf" to return tf.Tensor outputs instead of PyTorch tensors.

One of the main differences when using Ray Train is that you need to define the training logic as a function (train_func) and pass this function to the TorchTrainer. You'll also need to initialize the model, metric, and tokenizer within the function to avoid serialization errors.

Fine-tuning the BERT model requires loading your input data, defining your training hyperparameters, and setting up an optimizer function and learning rate schedule. You'll also need to define a helper function for calculating accuracy and another for formatting elapsed times as hh:mm:ss.

Here's a summary of the main steps involved in fine-tuning a Hugging Face model:

These steps provide a general outline of the fine-tuning process, and you can refer to the specific examples above for more detailed instructions.

Preprocessing the Data

Credit: youtube.com, FineTuning BERT for Multi-Class Classification on custom Dataset | Transformer for NLP

Preprocessing the data is a crucial step in fine-tuning a Hugging Face model with your custom dataset. You need to preprocess the data with a Hugging Face Transformers' Tokenizer, which tokenizes the inputs, converts tokens to their corresponding IDs, and puts them in a format the model expects.

To do this, instantiate your tokenizer with the AutoTokenizer.from_pretrained method, which ensures you get a tokenizer that corresponds to the model architecture you want to use and downloads the vocabulary used when pretraining this specific checkpoint.

The tokenizer should be used with the argument truncation=True to ensure the tokenizer truncates and pads to the longest sequence in the batch.

You can then write a function that preprocesses the samples, feeding them to the tokenizer with the argument truncation=True.

To preprocess the dataset, you need the names of the columns containing the sentence(s), which can be kept track of in a dictionary.

You can convert HF Dataset objects to Ray Data using the built-in from_huggingface() function, which is straightforward since Arrow tables back both of them.

For another approach, see: Huggingface Tokenizer Pad

Credit: youtube.com, Fine Tune Transformers Model like BERT on Custom Dataset.

Here's a summary of the preprocessing steps:

By following these steps, you'll be able to preprocess your data and get it ready for fine-tuning your Hugging Face model.

Training and Evaluation

Fine-tuning a Hugging Face model with a custom dataset involves a series of steps that require careful attention to detail. You'll need to define your training logic as a function (train_func) when using Ray Train, which will be passed to the TorchTrainer to execute on every Ray worker.

This function should initialize the model, metric, and tokenizer to avoid serialization errors. You can then instantiate the TorchTrainer, setting the scaling_config and datasets for training and evaluation.

To evaluate your model, you can use metrics like validation loss, which is a more precise measure than accuracy. Validation loss takes into account the exact output value, whereas accuracy only cares about which side of a threshold it falls on.

Here's a summary of the training process:

Notice how the training loss decreases with each epoch, but the validation loss increases, indicating over-fitting.

Training Loop

Credit: youtube.com, Write your training loop in PyTorch

The training loop is a crucial part of the training process, where the model learns from the data and improves its performance. It's a repetitive process where the model is trained and validated multiple times.

In the training loop, we have a series of steps that happen for each pass. These steps include unpacking the data inputs and labels, loading the data onto the GPU for acceleration, clearing out the gradients calculated in the previous pass, doing a forward pass, backward pass, and updating the network parameters.

Here's a breakdown of the steps in the training loop:

  • Unpack our data inputs and labels
  • Load data onto the GPU for acceleration
  • Clear out the gradients calculated in the previous pass
  • Forward pass (feed input data through the network)
  • Backward pass (backpropagation)
  • Tell the network to update parameters with optimizer.step()
  • Track variables for monitoring progress

We also have a validation phase, which is similar to the training phase but uses validation data instead of training data. We unpack the data inputs and labels, load the data onto the GPU, do a forward pass, and compute the loss on the validation data.

The training loop is where we can monitor the progress of our model and make adjustments as needed. We can view the summary of the training process, which includes the training loss, validation loss, validation accuracy, training time, and validation time.

Test Set Evaluation

Credit: youtube.com, Terms: Training vs. Evaluation vs. Prediction

As you prepare to evaluate your model's performance, it's essential to consider the test set. The test set is used to get an unbiased estimate of your model's performance on unseen data.

With the test set prepared, you can apply your fine-tuned model to generate predictions on the test set. This will give you a sense of how well your model is doing on the entire test set.

Each batch in the test set has 32 sentences, except for the last batch, which has only 4 test sentences. This means you'll need to combine the results for all batches to calculate your final MCC score.

In about half an hour, you can get a good score without doing any hyperparameter tuning. However, to maximize the score, it's recommended to remove the "validation set" and train on the entire training set.

The accuracy can vary significantly between runs, especially with small dataset sizes. This is why it's essential to evaluate your model's performance consistently.

Keith Marchal

Senior Writer

Keith Marchal is a passionate writer who has been sharing his thoughts and experiences on his personal blog for more than a decade. He is known for his engaging storytelling style and insightful commentary on a wide range of topics, including travel, food, technology, and culture. With a keen eye for detail and a deep appreciation for the power of words, Keith's writing has captivated readers all around the world.

Love What You Read? Stay Updated!

Join our community for insights, tips, and more.