Fine-tuning a Hugging Face model with a custom dataset is a powerful way to adapt a pre-trained model to your specific use case. This approach allows you to leverage the strengths of the pre-trained model while also incorporating your unique dataset.
By using a custom dataset, you can tailor the model to perform well on tasks specific to your application. The Hugging Face model can be fine-tuned to recognize patterns in your dataset that were not present in the original training data.
To fine-tune the model, you'll need to prepare your custom dataset in the correct format. This involves splitting your data into training and validation sets, and then converting it into a format that the model can understand.
A unique perspective: Pre-trained Multi Task Generative Ai Models Are Called
Installing Hugging Face Library
Installing the Hugging Face Library is a straightforward process. You can install the transformers package from Hugging Face, which provides a pytorch interface for working with BERT.
The Hugging Face library is the most widely accepted and powerful pytorch interface for working with BERT. It contains interfaces for other pretrained language models like OpenAI’s GPT and GPT-2.
If this caught your attention, see: Free Gpt Model in Huggingface
You can use the library's pre-built modifications of these models suited to your specific task. For example, you can use BertForSequenceClassification in this tutorial.
The library also includes task-specific classes for token classification, question answering, next sentence prediction, etc. Using these pre-built classes simplifies the process of modifying BERT for your purposes.
The run_glue.py utility is a helpful tool that allows you to pick which GLUE benchmark task you want to run on, and which pre-trained model you want to use. It also supports using either the CPU, a single GPU, or multiple GPUs for further speed up.
Training epochs take around 5 minutes and 28 seconds when using a MAX_LEN of 128.
Explore further: Geophysics Velocity Model Prediciton Using Generative Ai
Data Preparation
Data preparation is a crucial step in fine-tuning a HuggingFace model with a custom dataset. To start, you'll need to preprocess your data with a tokenizer, which tokenizes the inputs, including converting tokens to their corresponding IDs in the pretrained vocabulary.
You can use the AutoTokenizer.from_pretrained method to get a tokenizer that corresponds to the model architecture you want to use, and download the vocabulary used when pretraining this specific checkpoint.
The tokenizer will need to be able to handle the longest sequence in the batch, so set the truncation argument to True when feeding the samples to the tokenizer.
Once you have the preprocessed data, you can split it into training and validation sets. The Datasets library can automatically pull images and classes from the dataset, but you'll need to calculate the size of the validation set as a fraction of the training set based on the size of the test set.
Here's a summary of the data preparation steps:
By following these steps, you'll be well on your way to preparing your custom dataset for fine-tuning a HuggingFace model.
Data Split
Data Split is a crucial step in preparing your data for training models. You can use 90% of your training set for training and 10% for validation, as demonstrated in Example 2.
Broaden your view: Distributed Training Huggingface
Divide your dataset into training and validation sets using a ratio of 90% to 10%. This is a common approach in machine learning.
To create an iterator for your dataset, use the torch DataLoader class, which helps save on memory during training by not loading the entire dataset into memory at once.
Here's a breakdown of the data split:
By splitting your data in this way, you can train your model on a large dataset and validate its performance on a smaller, separate dataset. This helps you avoid overfitting and ensures that your model generalizes well to new, unseen data.
Tokenization & Formatting
Tokenization & Formatting is a crucial step in preparing your data for model training. To tokenize your text, you need to split it into individual tokens and map them to their corresponding IDs in the tokenizer vocabulary.
The transformers library provides a helpful encode function to handle most of the parsing and data prep steps for us. This function can be used to perform multiple steps such as splitting the sentence into tokens, adding special tokens, mapping tokens to their IDs, and padding or truncating all sentences to the same length.
A unique perspective: Huggingface Api Token
A popular choice for tokenization is the BERT tokenizer, which splits text into subwords or wordpieces. This is done by the tokenizer included with BERT, which can be downloaded and used to tokenize your text.
To format your data, you need to add special tokens to the start and end of each sentence, pad and truncate all sentences to a single constant length, and explicitly differentiate real tokens from padding tokens with the attention mask.
Here are the required formatting steps:
- Add special tokens to the start and end of each sentence
- Pad and truncate all sentences to a single constant length
- Explicitly differentiate real tokens from padding tokens with the attention mask
The encode_plus function in the transformers library combines multiple steps for us, including tokenization, adding special tokens, mapping tokens to their IDs, padding or truncating, and creating attention masks. This function is a convenient way to prepare your data for model training.
Fine-Tuning Hugging Face Model
Fine-tuning a Hugging Face model is a straightforward process. You can fine-tune with the Trainer API, which involves defining your training hyperparameters in TrainingArguments, passing them to a Trainer along with the model, dataset, tokenizer, and data collator, and calling Trainer.train() to fine-tune your model.
Explore further: Fine Tune Code Llama
To fine-tune with TensorFlow, you'll need to batch the processed examples together with dynamic padding using the DataCollatorWithPadding function and convert your datasets to the tf.data.Dataset format with to_tf_dataset. You'll also need to set return_tensors="tf" to return tf.Tensor outputs instead of PyTorch tensors.
One of the main differences when using Ray Train is that you need to define the training logic as a function (train_func) and pass this function to the TorchTrainer. You'll also need to initialize the model, metric, and tokenizer within the function to avoid serialization errors.
Fine-tuning the BERT model requires loading your input data, defining your training hyperparameters, and setting up an optimizer function and learning rate schedule. You'll also need to define a helper function for calculating accuracy and another for formatting elapsed times as hh:mm:ss.
Here's a summary of the main steps involved in fine-tuning a Hugging Face model:
These steps provide a general outline of the fine-tuning process, and you can refer to the specific examples above for more detailed instructions.
Preprocessing the Data
Preprocessing the data is a crucial step in fine-tuning a Hugging Face model with your custom dataset. You need to preprocess the data with a Hugging Face Transformers' Tokenizer, which tokenizes the inputs, converts tokens to their corresponding IDs, and puts them in a format the model expects.
To do this, instantiate your tokenizer with the AutoTokenizer.from_pretrained method, which ensures you get a tokenizer that corresponds to the model architecture you want to use and downloads the vocabulary used when pretraining this specific checkpoint.
The tokenizer should be used with the argument truncation=True to ensure the tokenizer truncates and pads to the longest sequence in the batch.
You can then write a function that preprocesses the samples, feeding them to the tokenizer with the argument truncation=True.
To preprocess the dataset, you need the names of the columns containing the sentence(s), which can be kept track of in a dictionary.
You can convert HF Dataset objects to Ray Data using the built-in from_huggingface() function, which is straightforward since Arrow tables back both of them.
For another approach, see: Huggingface Tokenizer Pad
Here's a summary of the preprocessing steps:
By following these steps, you'll be able to preprocess your data and get it ready for fine-tuning your Hugging Face model.
Training and Evaluation
Fine-tuning a Hugging Face model with a custom dataset involves a series of steps that require careful attention to detail. You'll need to define your training logic as a function (train_func) when using Ray Train, which will be passed to the TorchTrainer to execute on every Ray worker.
This function should initialize the model, metric, and tokenizer to avoid serialization errors. You can then instantiate the TorchTrainer, setting the scaling_config and datasets for training and evaluation.
To evaluate your model, you can use metrics like validation loss, which is a more precise measure than accuracy. Validation loss takes into account the exact output value, whereas accuracy only cares about which side of a threshold it falls on.
Here's a summary of the training process:
Notice how the training loss decreases with each epoch, but the validation loss increases, indicating over-fitting.
Training Loop
The training loop is a crucial part of the training process, where the model learns from the data and improves its performance. It's a repetitive process where the model is trained and validated multiple times.
In the training loop, we have a series of steps that happen for each pass. These steps include unpacking the data inputs and labels, loading the data onto the GPU for acceleration, clearing out the gradients calculated in the previous pass, doing a forward pass, backward pass, and updating the network parameters.
Here's a breakdown of the steps in the training loop:
- Unpack our data inputs and labels
- Load data onto the GPU for acceleration
- Clear out the gradients calculated in the previous pass
- Forward pass (feed input data through the network)
- Backward pass (backpropagation)
- Tell the network to update parameters with optimizer.step()
- Track variables for monitoring progress
We also have a validation phase, which is similar to the training phase but uses validation data instead of training data. We unpack the data inputs and labels, load the data onto the GPU, do a forward pass, and compute the loss on the validation data.
The training loop is where we can monitor the progress of our model and make adjustments as needed. We can view the summary of the training process, which includes the training loss, validation loss, validation accuracy, training time, and validation time.
Worth a look: Huggingface Training Service
Test Set Evaluation
As you prepare to evaluate your model's performance, it's essential to consider the test set. The test set is used to get an unbiased estimate of your model's performance on unseen data.
With the test set prepared, you can apply your fine-tuned model to generate predictions on the test set. This will give you a sense of how well your model is doing on the entire test set.
Each batch in the test set has 32 sentences, except for the last batch, which has only 4 test sentences. This means you'll need to combine the results for all batches to calculate your final MCC score.
In about half an hour, you can get a good score without doing any hyperparameter tuning. However, to maximize the score, it's recommended to remove the "validation set" and train on the entire training set.
The accuracy can vary significantly between runs, especially with small dataset sizes. This is why it's essential to evaluate your model's performance consistently.
Sources
- https://docs.ray.io/en/latest/train/examples/transformers/huggingface_text_classification.html
- https://mccormickml.com/2019/07/22/BERT-fine-tuning/
- https://huggingface.co/transformers/v3.2.0/custom_datasets.html
- https://huggingface.co/docs/transformers/v4.15.0/en/custom_datasets
- https://huggingface.co/learn/cookbook/en/fine_tuning_vit_custom_dataset
Featured Images: pexels.com