Fine tuning GPT2 with Hugging Face Transformers is a powerful way to adapt this language model to your specific needs. This process involves using the Transformers library from Hugging Face to load a pre-trained GPT2 model and fine tune it on your own dataset.
The Transformers library provides a simple and efficient way to fine tune GPT2. With just a few lines of code, you can load a pre-trained model and start fine tuning it on your dataset.
To fine tune GPT2 with Hugging Face Transformers, you'll need to install the library and import it into your code. This can be done using pip and the transformers library.
Fine tuning GPT2 can be done in a matter of minutes with the right tools and a little practice.
Getting Started
To get started, clone the code to download and train the GPT-2 Small model. Fortunately, others have done the hard work of adding code to train on top of the gpt-2 small model that OpenAI released.
We're going to use Docker from here on out, just because it's easier to manage the code and dependencies. The repository comes with a Dockerfile, let's build the image.
To get to a shell using our image, we'll use the command. At this point, you can play with the base gpt-2 small model and generate some text.
Let's try it out with this prompt: "A pair of jumper cables walks into a bar." First, let's download the data.
Fine-Tuning a Large Model
Fine-tuning a pre-trained model like GPT-2 involves continuing training on a specialized corpus to adapt it to a specific task. This process enhances the model's capability in domain-specific tasks without starting from scratch.
There are three ways to fine-tune a pre-trained model: unsupervised, supervised, and reinforcement learning from human feedback. In this article, we'll focus on unsupervised fine-tuning, which involves training the model on a dataset of data without labels.
The resultant model will be saved in the "model" folder and can be used for generating recipes, as shown in Example 2.3 Quick Test with GPT2, where a pre-trained GPT2-medium model is used to generate text with a prompt of "beef, salt, pepper".
What Is and Why?
Fine-tuning a large model is a powerful technique that can help you achieve better results on a specific task. It starts from an existing pre-trained model and continues training on a specialized corpus to shift the parameters to achieve better loss on a specific task.
There are three ways to fine-tune a pre-trained model: unsupervised fine-tuning, supervised fine-tuning, and Reinforcement Learning from Human Feedback (RLHF). Unsupervised fine-tuning is used for tasks like text generation, such as cooking recipe generation.
You can use one of three options for parameter training: Retrain all parameters, Transfer Learning, and Parameter Efficient Fine-tuning (PEFT). In this article, we'll be using Transfer Learning, which selectively updates only a subset of the model's parameters.
Fine-tuning a model is a way to enhance its capability in domain-specific tasks without starting the training process from scratch, which can be extremely time-consuming and computationally expensive.
Here are the three ways to fine-tune a pre-trained model:
- Unsupervised fine-tuning: used for tasks like text generation, such as cooking recipe generation.
- Supervised fine-tuning: used for tasks like sentiment analysis, text summarization, and translation.
- Reinforcement Learning from Human Feedback (RLHF): used for tasks where human feedback is used to fine-tune the model.
Fine-Tuning a Large
Fine-tuning a large model is a technique used to adapt a pre-trained model to a specific task or dataset. This is done by continuing to train the model on a specialized corpus, which helps to shift the parameters and achieve better loss on the specific task.
The utility of a pre-trained model lies in its adaptability, allowing you to use it as is or employ transfer learning to fine-tune the model for a specific task. Hugging Face Transformers provides thousands of pre-trained models to perform tasks on text, vision, and audio.
GPT-2 pre-trained models are available in four different sizes: gpt2 (110M parameters), gpt2-medium (345M parameters), gpt2-large (774M parameters), and gpt2-xl (1558M parameters). These models can be used for tasks such as text generation and fine-tuning.
There are three ways to fine-tune a pre-trained model: unsupervised fine-tuning, supervised fine-tuning, and reinforcement learning from human feedback (RLHF). In this article, we will focus on unsupervised fine-tuning, which involves training the model on a dataset of data that does not contain any labels.
The training process involves creating a dataset class that preprocesses the tokenized data into a dictionary with keys input_ids, attention_mask, and labels for each data sequence. The dataset is then split into train and validation sets, and the model is trained using the Trainer class or a standard PyTorch training loop.
The training config includes customizable parameters such as output_dir, learning_rate, per_device_train_batch_size, and num_train_epochs. The model is trained for a specified number of epochs, and the validation loss is calculated after each epoch.
Here are the different pre-trained model sizes and their parameters:
The model is saved in the "model" folder after training, and can be used for generating recipes or other tasks.
Training and Setup
To train and fine-tune GPT-2 with Hugging Face Transformers Trainer, you'll need to start with a large amount of text, or text corpus, which is broken into sequences of uniform size, typically 1024 tokens each.
The model is trained to predict the next token at each step of the sequence, and the labels are identical to input_ids, but shifted to one position to the left. The training process is demonstrated in the code train_gpt2_trainer1.py, which uses a small toy corpus.
You'll need to load the corpus, tokenize it, and break it into 511-token pieces plus the [END] token, which brings the sequence length to 512. This is done in the function break_text_to_pieces() in the code.
2.2 Setup
To get started with training a GPT-2 model, you'll need to install the transformers package from Hugging Face, which provides a PyTorch interface for working with pre-trained models. This package will give you the tools you need to work with GPT-2 models.
The transformers package can be installed using pip, and it's a good idea to install it along with some other libraries for data processing and model optimization using PyTorch. I recommend using the gpt2-medium variant, which has 345 million parameters, for a good balance between performance and computational resources.
If you have a computer equipped with an RTX 2080 or a Colab notebook with a CUDA-enabled GPU, you'll be able to run the code with optimal performance. This is because the transformers package is optimized for use with GPUs, which can significantly speed up the training process.
You can also use a smaller variant of the GPT-2 model, such as the gpt2-small, if you have limited computational resources. However, keep in mind that the smaller model may not perform as well as the larger variant.
Dataset and Collator
The Dataset and Collator is where the magic happens, folks! This is where we create the PyTorch Dataset and Data Loader with Data Collator objects that will be used to feed data into our model.
We use the MovieReviewsDataset class to create the PyTorch Dataset that returns texts and labels. Since we need to input numbers to our model, we convert the texts and labels to numbers using a collator.
A collator takes data outputted by the PyTorch Dataset and passed through the Data Collator function to output the sequence for our model. This is a crucial step in preparing our data for the model.
I keep the tokenizer away from the PyTorch Dataset to make the code cleaner and better structured. This helps keep our code organized and easy to understand.
The data collator is used to format the PyTorch Dataset outputs to match the inputs needed for GPT2. This ensures that our data is in the right format for the model to learn from.
Model Management
Saving and loading a fine-tuned model is a crucial part of the process, and it's essential to know how to do it efficiently.
The file pytorch_model.bin has a size of 1.3 GB, so make sure you have enough space on your device.
You'll need to save the trained model, configuration, and tokenizer for future use. This will allow you to load the model quickly and easily when you need to make further adjustments.
- It's worth noting that saving the model can be a large file, so make sure you have enough space.
Loading the model is a straightforward process, and you'll be able to access all the settings and configurations you saved earlier.
This will save you a lot of time and effort in the long run, especially if you're working on a complex project.
Data Preparation
Data Preparation is a crucial step in fine-tuning GPT-2, and it's where you'll want to start. I got out of memory error with batch_size above 2, so I process 2 sequences together in each iteration.
The maximum length of sentences or sequences is 180, which is far from the 1024 limit of gpt2-medium. This is enough to show the meaning of the generated recipe, though.
To handle recipe data, I create a RecipeDataset class using PyTorch Dataset. I use tokenize.encode_plus to tokenize the sentences of the recipe.
The tokenize.encode_plus function adds special tokens to start and end of the sentence, maps tokens into their integer IDs, and creates attention masks for real and [PAD] tokens. With truncation=True, padding='max_length', max_length=180, I get same length inputs for the model.
The long texts get truncated to 180 tokens, and the short texts have extra padding tokens added to make it 180 tokens. The first token ID of the sequence is 50257, which is the bos_token, and 50256 is the eos_token.
The second tensor is attention masks with 1 for real tokens and 0 for padding tokens. To divide up the dataset, I use 90% for training and 10% for validation.
I create an iterator for the recipe dataset using the PyTorch DataLoader, which helps save on memory during training. Unlike a for loop, the entire dataset does not need to be loaded into memory with an iterator.
Our data is recipes, which are independent of each other, so the order is not important. Using RandomSampler or SequentialSampler to sample data does not make any difference in this project.
Using GPT
Using GPT2 for text classification is a bit different from other models like BERT, where we only care about the first token in the input sequence. We need to pad on the left instead of the right.
GPT2 is a decoder transformer, which means the last token of the input sequence contains all the information needed for the prediction. This is in contrast to BERT, where we use the first token embedding to make predictions.
Since we're using the last token for prediction, we'll need to configure the GPT2 Tokenizer to pad on the left. Luckily, HuggingFace Transformers has made it easy to do just that.
In this case, we're fine-tuning the GPT2 model for a custom dataset using the HuggingFace Transformers library. This allows us to leverage the strengths of GPT2 for text classification tasks.
GPT and Hugging Face
GPT and Hugging Face are two powerful tools that can be used together to fine-tune the GPT-2 model. Hugging Face Transformers Trainer is a user-friendly interface that makes it easy to train GPT models, but it's not ideal for serious use.
GPT models are trained on a large amount of text, broken into sequences of uniform size, typically 1024 tokens each. The model predicts the next token at each step of the sequence.
The labels for GPT-2 training are identical to the input_ids, but shifted to one position to the left. This shift happens automatically when the loss is calculated in Hugging Face transformers.
Training a GPT model with Hugging Face transformers involves creating a training config with parameters like learning rate, batch size, and number of epochs. The training config is used to create a trainer instance, which is then used to train the model.
A custom dataset class is used to wrap the training and validation data in PyTorch datasets. The dataset class preprocesses the tokenized data into a dict with keys input_ids, attention_mask, and labels for each data sequence.
The Trainer class in Hugging Face transformers can be used to train the model, but it may not provide enough control for serious use. This is because the Trainer class is designed for beginners and may not be suitable for large-scale training.
GPT Architecture
GPT Architecture is a key component of fine-tuning GPT-2. It's based on a multi-layer bidirectional transformer encoder.
The encoder is made up of a series of identical layers, each consisting of two sub-layers: a self-attention mechanism and a fully connected feed-forward network.
The self-attention mechanism allows the model to weigh the importance of different input elements relative to each other.
Each layer also includes a residual connection, which means that the output of each sub-layer is added to the input of the next sub-layer.
The output of the final encoder layer is then passed through a linear layer and a softmax function to produce the final output.
The architecture is designed to handle long-range dependencies and can be fine-tuned for specific tasks by adjusting the weights of the existing layers or adding new ones.
Pre-Trained Model
A pre-trained model is essentially a saved network that has undergone prior training on a substantial dataset.
You can use a pre-trained model as is or employ transfer learning to fine-tune it for a specific task. This is a powerful approach because it saves time and computational resources.
Hugging Face Transformers provides thousands of pre-trained models to perform tasks on text, vision, and audio. These models can be used for inference or as a basis for subsequent fine-tuning.
GPT-2 pre-trained models are available in four different sizes: gpt2, gpt2-medium, gpt2-large, and gpt2-xl. The gpt2-medium model has 345M parameters.
Using a pre-trained model like gpt2-medium can be a great starting point for fine-tuning a model for a specific task, as it has already been trained on a large dataset.
Wrap-Up
Fine-tuning GPT-2 can be a bit tricky, and it's easy to overfit on the data.
I've found that if you're not selective in the examples you're choosing, you might end up with a model that's great at predicting the exact jokes you've seen before, but not so great at coming up with something new and creative.
The data from /r/jokes is particularly reflective of typical jokes, with lots of dog jokes and "what's the difference?" type questions.
After 500-1000 epochs of training, the models start to favor a little more generality and abstractness, which can be a nice change of pace.
Sources
- https://www.peterbaumgartner.com/blog/gpt2-jokes/
- https://www.it-jim.com/blog/training-and-fine-tuning-gpt-2-and-gpt-3-models-using-hugging-face-transformers-and-openai-api/
- https://pypi.org/project/transformers/
- https://tuanatran.medium.com/fine-tuning-large-language-model-with-hugging-face-pytorch-adce80dce2ad
- https://gmihaila.github.io/tutorial_notebooks/gpt2_finetune_classification/
Featured Images: pexels.com