Fine tuning a Large Language Model (LLM) from scratch to deployment is a crucial step in creating a production-ready model. This process involves several key steps, including data preparation and model selection.
To begin, you'll need to select a pre-trained model from the Hugging Face model hub. According to the article, the Hugging Face model hub offers a wide range of pre-trained models, including the popular BERT and RoBERTa models.
Fine tuning a pre-trained model involves adjusting the model's weights to fit your specific task. This is typically done using a small dataset of labeled examples. The article notes that fine tuning a pre-trained model can significantly improve its performance on a specific task, often by 5-10% or more.
The next step is to prepare your data for fine tuning. This involves cleaning, tokenizing, and splitting your data into training and validation sets. The article suggests using the Hugging Face dataset library to easily load and preprocess your data.
If this caught your attention, see: Machine Learning Hyperparameter
Preparation
Preparing your dataset is a crucial step in fine-tuning your LLM. The quality of your dataset significantly impacts the effectiveness of your fine-tuned model.
You'll need to gather a dataset that is relevant to your task, structured in a way that allows the model to learn from it. The more specific and high-quality your dataset, the better the model will perform on the task at hand.
To create a dataset, you can use publicly available medical datasets like Medical meadow wikidoc and Medquad, which are question-answer pairs. You can also use the Llama 3 Instruct template to create instruction prompts from each pair.
Here's a step-by-step guide to preparing your dataset:
- Define the preprocessing of the datasets, including renaming columns, dropping unused columns, removing duplicates or NaN rows, adding instruction for each entry, and creating instruct prompts for each entry.
- Write a script to trigger the preprocessing of the datasets, create instruction datasets from them, create Hugging Face datasets for each particular dataset, merge them into one bigger dataset, and create a smaller one with 2k entries from the bigger dataset.
- Use the SFFTrainer from the TRL library for supervised fine-tuning, which supports all the same features as the Trainer from the transformers library, including logging, evaluation, and checkpointing.
By following these steps, you'll be able to create a high-quality dataset that will help your fine-tuned model perform well on the task at hand.
Configuration
Configuration is a crucial step in fine-tuning a Large Language Model (LLM) with Hugging Face. You need to provide a Trainer with the necessary parameters, including metrics, a base model, and a training configuration.
To configure the Trainer, you can use Hugging Face's training configuration tools. These tools require you to specify metrics, a base model, and a training configuration.
Metrics can be added to the default loss metric that the Trainer computes, such as accuracy for text classification tasks. You can use AutoModelForSequenceClassification to load a base model for text classification, providing the number of classes and label mappings created during dataset preparation.
The TrainingArguments class allows you to specify the output directory, evaluation strategy, learning rate, and other parameters. DataCollatorWithPadding can be used to batch input in training and evaluation datasets, giving good baseline performance for text classification tasks.
Curious to learn more? Check out: Open Webui Add Tools from Huggingface
Log In to the Hugging Face Hub
To log in to the Hugging Face Hub, you'll need to use a read/write access token. This is required to upload your fine-tuned model.
You can create a read/write access token by following the instructions provided by Hugging Face. This will give you the necessary credentials to access the Hub.
The access token is a crucial step in the process, so make sure to create one before proceeding.
For your interest: How to Create a Huggingface Dataset
Creating the Config
Creating the Config is a crucial step in fine-tuning a model, and it's where you get to decide the specifics of your configuration. It's like building with Legos, you have a base model and you get to add the details that make it unique.
The TrainingArguments class is what allows you to specify the output directory, evaluation strategy, learning rate, and other parameters. You can think of it like a blueprint for your model's training.
To create the config, you'll need to decide on the base model, the name of the new fine-tuned model, and the data type of the values of the matrices. This is where you get to customize your model to fit your specific needs.
The Hugging Face Transformers library makes this process much easier by offering Auto-Classes like AutoModel, AutoTokenizer, and AutoConfig. These auto-classes automate much of the setup, making it easier to get started.
Here are some key parameters to consider when creating your config:
By carefully considering these parameters, you can create a config that's tailored to your specific needs and helps you achieve the best possible results.
Feature Encoder & Preprocessing
Setting trainable to false can significantly speed up training during fine-tuning, by over 50x. This is because it removes the forward pass of the encoder from the training workflow.
To take full advantage of this, you should set cache_encoder_embeddings to true in the preprocessing section of the feature config. This moves the forward pass of the encoder into the preprocessing portion of the training workflow.
By caching embeddings in the preprocessed data, you can reuse it for new model training experiments. This means that any subsequent training runs for the same dataset will require no forward passes on the pretrained model.
Intriguing read: Huggingface Training Service
Training
Training is where the magic happens. You'll need to configure your training configuration using Hugging Face's training configuration tools, which require providing metrics, a base model, and a training configuration.
To configure evaluation metrics, you can add accuracy as a metric, as shown in the example. For text classification, use AutoModelForSequenceClassification to load a base model, providing the number of classes and label mappings created during dataset preparation.
Worth a look: Distributed Training Huggingface
The TrainingArguments class is used to specify the output directory, evaluation strategy, learning rate, and other parameters. DataCollatorWithPadding is a good baseline performance for text classification.
You can also use techniques like LoRA (Low-Rank Adaptation) via the peft library to reduce memory usage for large models. This can be especially useful when training with FastAPI and Uvicorn.
Before starting training, it's a good idea to check the memory usage statistics. This will help you understand how much memory your model is using before and after training.
Here are some key parameters to consider when defining the hyperparameters for your SFTTrainer:
- Evaluation strategy: "steps" or "epoch"
- eval_steps: 100
- do_eval: True
These parameters can affect the training time and resource usage, so be sure to experiment with different values to find the optimal configuration for your model.
SFTTrainer supports example packing, which can increase training efficiency by packing multiple short examples into a single input sequence. This can be a great way to speed up training, especially for large datasets.
Fine Tuning
Fine Tuning is a crucial step in adapting a pre-trained Large Language Model (LLM) to a specific task or domain. This process involves training the model further on a custom dataset that contains examples relevant to the desired application. Fine-tuning helps in several ways, including enhancing relevance, improving specificity, and contextual understanding.
There are two types of fine-tuning depending on which model weights are updated during the process: Full Fine-Tuning (Real Instruction Fine-Tuning) and Parameter Efficient Fine-Tuning (PEFT). Full Fine-Tuning involves updating all model weights, while PEFT involves updating only a subset of the model's parameters.
To fine-tune a model, you can use the HuggingFace encoders in Ludwig, which can be used for fine-tuning when `use_pretrained=true` in the encoder config (default). You can also use the auto_transformer encoder in conjunction with providing the model name in the `pretrained_model_name_or_path` parameter.
Here are some best practices to keep in mind when fine-tuning a model:
- Monitor Performance: Use real-world data to assess model accuracy.
- Feedback Loop: Implement a system where user feedback can improve the model.
- Regular Updates: Periodically retrain the model with new data.
- Version Control: Keep track of different model versions for easy rollback.
Fine-tuning a model can significantly improve its performance on a specific task or domain. However, it also imposes significant demands on memory and computational resources, similar to pre-training. To mitigate this, you can use techniques such as caching encoder embeddings, which can speed up training by over 50x.
Features Supporting
Fine-tuning is a powerful technique, and Ludwig makes it surprisingly easy. All of the HuggingFace encoders in Ludwig can be used for fine-tuning when use_pretrained=true in the encoder config (default).
You can use any of the HuggingFace encoders in Ludwig without having to start from scratch. If you want to use a specific model that's not listed, you can use the auto_transformer encoder in conjunction with providing the model name in the pretrained_model_name_or_path parameter.
Ludwig also supports fine-tuning with Torchvision pretrained models. These models can be used for fine-tuning when use_pretrained=true in the encoder config (default).
Take a look at this: How to Use Huggingface Models in Python
Cache Encoder Embeddings
Setting trainable to false can keep weights fixed during fine-tuning, but this can be slow. Setting cache_encoder_embeddings to true in the preprocessing section can speed up training by over 50x.
This is because embedding caching moves the forward pass of the encoder that generates text embeddings into the preprocessing portion of the Ludwig training workflow, removing this step entirely from training.
By caching preprocessed data, Ludwig can reuse it for new model training experiments, making subsequent training runs for the same dataset require no forward passes on the pretrained model whatsoever.
Train and Log to MLflow
You can use MLflow to log your fine-tuned model, and it integrates well with Hugging Face. To log your trained model, you must do it manually with `mlflow.transformers.log_model`.
To log your model, you need to wrap your training in an MLflow run, which constructs a Transformers pipeline from the tokenizer and the trained model, and writes it to local disk. If you don't need to create a pipeline, you can submit the components used in training into a dictionary.
Here's a summary of the steps:
Note that you must log the trained model yourself, as Hugging Face only automatically logs metrics during model training using the MLflowCallback.
Optimizer
When fine-tuning a pre-trained model, it's essential to choose the right optimizer to avoid catastrophic forgetting. AdamW is typically recommended over other optimizers due to its improved handling of weight decay.
AdamW is a robust optimizer that can handle the weight decay necessary during fine-tuning without compromising performance. This is particularly important when training a model on a specific dataset to avoid forgetting the knowledge it gained during pre-training.
Explore further: Fine-tuning vs Transfer Learning
If you're new to fine-tuning, you might be wondering why AdamW is the preferred choice. The reason is that AdamW is designed to handle weight decay more effectively than traditional optimizers like SGD or Adam.
To get the most out of AdamW, it's recommended to use a very small learning rate, especially when trainable=True. This will help prevent catastrophic forgetting and ensure that the model adapts well to the new dataset.
Here's a summary of the recommended learning rates for AdamW:
By following these guidelines and using AdamW as your optimizer, you'll be well on your way to fine-tuning your pre-trained model and achieving impressive results on your specific task or domain.
Full Fine Tuning
Full fine-tuning is a comprehensive approach to enhancing a model's performance across diverse tasks by training it on guiding examples for responding to queries. This method involves updating all model weights, resulting in an optimized version.
The selection of the dataset is pivotal and tailored to the specific task at hand, be it summarization or translation. This is crucial for instruction fine-tuning, as seen in Example 12, "Full Fine Tuning (Real Instruction Fine-Tuning):".
For more insights, see: Fine-tuning Huggingface Model with Custom Dataset
Full fine-tuning can impose significant demands on memory and computational resources akin to pre-training, necessitating robust infrastructure to manage storage and processing during training. This is particularly true for large models, as mentioned in Example 4.
To give you a better idea of the process, here's a brief overview of the steps involved in full fine-tuning:
- Converting the model and tokenizer to use the conversational format, as seen in Example 8.
- Training the model on a custom dataset that contains examples relevant to the desired application, as described in Example 7.
- Updating all model weights to optimize the model for the specific task, as mentioned in Example 12.
By following these steps and selecting the right dataset, you can achieve significant improvements in your model's performance and unlock new possibilities for your applications.
Saving Trainer Stats
Saving Trainer Stats is a crucial step in the fine-tuning process. After fine-tuning the model, we need to keep track of the trainer_stats, such as time required for the training job and training loss.
We'll save the fine-tuned model locally on the Google Colab Notebook environment. This will allow us to access the model and its stats from anywhere.
To keep track of the trainer_stats, we'll save them along with the fine-tuned model. This will help us monitor the performance of the model and make adjustments as needed.
The trainer_stats saved will include time required for the training job, training loss, and other relevant metrics. This will give us a clear picture of the model's performance and help us identify areas for improvement.
Related reading: Huggingface save Model
Train Transformer with RL
Train Transformer with RL is a powerful way to fine-tune your model. By leveraging reinforcement learning (RL) to optimize the transformer's parameters, you can achieve state-of-the-art results in various tasks.
RL can be used to optimize the transformer's encoder and decoder separately, or simultaneously. This allows you to focus on specific areas of the model that need improvement.
The RL algorithm can be designed to maximize the reward function, which is typically a function of the model's performance on a specific task. For example, in machine translation, the reward function might be the BLEU score.
To train a transformer with RL, you need to define the environment, agent, and reward function. The environment is the task or problem you're trying to solve, the agent is the transformer model, and the reward function is the metric you're using to evaluate the model's performance.
RL can be used to fine-tune a pre-trained transformer model, or to train a transformer model from scratch. Either way, the goal is to optimize the model's parameters to achieve the best possible performance on the task at hand.
Explore further: Huggingface Transformer Introductions
The RL algorithm can be trained using a variety of methods, including policy gradient methods and actor-critic methods. These methods can be used to optimize the model's parameters in a way that maximizes the reward function.
RL can be used to fine-tune a transformer model on a specific task, such as machine translation or text classification. By leveraging the power of RL, you can achieve state-of-the-art results in these tasks and many others.
Sources
- https://ludwig.ai/latest/user_guide/distributed_training/finetuning/
- https://docs.databricks.com/en/machine-learning/train-model/huggingface/fine-tune-model.html
- https://mlops.community/budget-instruction-fine-tuning-of-llama-3-8b-instructon-medical-data-with-hugging-face-google-colab-and-unsloth/
- https://blog.stackademic.com/fine-tuning-large-language-models-with-hugging-face-a-comprehensive-guide-from-data-to-deployment-9484628893e8
- https://medium.com/@jayeshchouhan826/the-ultimate-guide-to-fine-tuning-large-language-models-with-hugging-face-c971e588bf02
Featured Images: pexels.com