Fine-tuning a Large Language Model (LLM) on custom data is a game-changer for businesses and individuals who need tailored language understanding.
By leveraging CodeLlama and Mistral, you can adapt an LLM to your specific needs and data, leading to improved performance and accuracy.
To get started, you'll need to prepare your custom data, which can include text, images, or other media. This data will serve as the foundation for fine-tuning your LLM.
CodeLlama's flexibility and scalability make it an ideal choice for fine-tuning an LLM on custom data.
Preparing Your Data
You'll need to load and preprocess your custom data to fine-tune your LLM.
First, create a JSONL file in the src directory, as Axolotl supports many dataset formats. This will be the basis for your custom dataset.
To fine-tune BERT for Question-Answering, convert your data into SQuAD format, which consists of three fields: context, question, and answers. The context is the sentence or paragraph with text based on which the model will search for the answer to the question.
You'll need to provide the desired answer under the answers field, which has two sub-components: text and answer_start. The text will have the answer string, while answer_start denotes the index from where the answer begins in the context paragraph.
You can use the Haystack annotation tool to easily create this data, as it would take a lot of time to do it manually.
To load and preprocess your data, you can use the Dataset class from pytorch's utils.data module to define a custom class for your dataset. This will involve initializing variables, loading the JSON file, and encoding the question and context into input tensors using the BERT tokenizer.
Here's a summary of the required fields for your custom dataset:
By following these steps, you'll be able to prepare your custom data for fine-tuning your LLM.
Setting Up
To fine-tune an LLM on custom data, you'll need to set up a suitable environment. Start by opening a new Jupyter notebook, like a Kaggle notebook, which offers 30 hours of free GPU usage per week.
For this tutorial, we'll use the GPU P100 as the accelerator, but feel free to experiment with other options. This will provide the necessary computational power for our experiment.
To download models from HuggingFace, you'll need an Access Token, which you can generate from the settings section if you've already signed up.
Install Required Libraries
To install the required libraries, you'll need to start by installing the Bitsandbytes package, which provides a lightweight wrapper around custom CUDA functions that make LLMs go faster. This will help us load our model as efficiently as possible.
We'll also need to install the transformers library, which is a library by Hugging Face that provides pre-trained models and training utilities for various natural language processing tasks.
Additionally, we'll install the peft library, which is another library by Hugging Face that enables parameter-efficient fine-tuning. This will allow us to fine-tune our model with minimal additional parameters.
We'll also install the accelerate library, which abstracts exactly and only the boilerplate code related to multi-GPUs/TPU/fp16 and leaves the rest of our code unchanged. This will help us simplify our code and make it more efficient.
We'll also need to install the datasets library, which provides easy access to a wide range of datasets. This will allow us to easily load and use the datasets we need for our experiment.
Finally, we'll install the einops library, which simplifies tensor operations. This will make it easier for us to perform operations on tensors in our code.
Here's a list of the libraries we'll need to install:
- Bitsandbytes
- Transformers
- Peft
- Accelerate
- Datasets
- Einops
Setup Peft
To setup PEFT, you'll need to define the LoRA config for fine-tuning the base model. Note that the rank (r) hyper-parameter defines the rank/dimension of the adapter to be trained, which controls the number of parameters trained.
A higher rank will allow for more expressivity, but there's a compute tradeoff. The rank is the rank of the low-rank matrix used in the adapters.
The alpha hyper-parameter is the scaling factor for the learned weights. It scales the weight matrix by alpha/r, so a higher value for alpha assigns more weight to the LoRA activations.
Set Up Training Arguments
Setting up training arguments is a crucial step in fine-tuning a language model. `output_dir` defines the directory where the training results will be written.
`per_device_train_batch_size` defines the batch size for each GPU, which can be increased or decreased depending on the available GPU memory. This could be a game-changer for models that require a lot of memory.
`gradient_accumulation_steps` is used to increase the effective batch size without increasing the GPU memory usage. It defines the number of update steps to accumulate the gradients before performing a backward/update pass.
The `optim` argument defines the optimizer that will be used for training. In this case, the `paged_adamw_32bit` optimizer is used, which is a variant of `AdamW` designed to be more efficient on 32-bit GPUs.
Here are the key training arguments to consider:
`fp16` defines whether to use 16-bit floating point precision during training, which can significantly reduce memory usage but may also reduce the accuracy of the model.
Training the Model
Training the Model is a crucial step in fine-tuning your Large Language Model (LLM) on custom data. You can use a pre-trained model like BertForQuestionAnswering, which is best suited for QA tasks.
To start, you need to initialize the pre-trained weights of the model by calling the from_pretrained function. This will load the pre-trained weights of the bert-base-uncased model.
The next step is to choose the evaluation loss function and optimizer you'll be using for training. For example, you can use an Adam optimizer and cross entropy loss function.
To load your custom data, you can use the Pytorch class DataLoader to load data in different batches and also shuffle them to avoid any bias. This will help you avoid any bias in your training data.
Once you have your data loader defined, you can write the final training loop. During each iteration, each batch obtained from the data loader contains batch_size number of examples, on which forward and backward propagation is performed.
To find the best set of weights for parameters, you'll need to perform the training job using the config and data. You can specify the config and data files using the --config and --dataset flags.
Here are the key hyperparameters you'll need to consider:
Remember to check if your fine-tuned model is stored properly in the training run folder using modal volume ls.
Fine-Tuning Methods
Fine-tuning a Large Language Model (LLM) involves a supervised learning process, where a dataset with labeled examples is used to adjust the model's weights, enhancing its proficiency in specific tasks.
There are two key methods employed in the fine-tuning process: Full Fine Tuning and Parameter Efficient Fine-Tuning (PEFT). Full Fine Tuning updates all model weights, creating a new version with improved capabilities, but demands significant computational resources.
Parameter Efficient Fine-Tuning (PEFT), on the other hand, is a more efficient approach that updates only a subset of parameters, effectively "freezing" the rest. This reduces the number of trainable parameters, making memory requirements more manageable and preventing catastrophic forgetting.
Here are the key differences between Full Fine Tuning and PEFT:
This approach proves beneficial for handling storage issues when fine-tuning for multiple tasks, and is achieved through various methods, including Low-Rank Adaptation (LoRA) and Quantized Low-Rank Adaptation (QLoRA), which are the most widely used and effective.
Fine-Tuning Methods
Fine-tuning a Large Language Model (LLM) involves a supervised learning process, where a dataset with labeled examples is used to adjust the model's weights, enhancing its proficiency in specific tasks.
The choice of dataset is crucial and tailored to the specific task, such as summarization or translation. This approach, known as full fine-tuning, updates all model weights, creating a new version with improved capabilities.
Full fine-tuning demands sufficient memory and computational resources, similar to pre-training, to handle the storage and processing of gradients, optimizers, and other components during training.
Parameter Efficient Fine-Tuning (PEFT) is a form of instruction fine-tuning that is much more efficient than full fine-tuning. This approach reduces the number of trainable parameters, making memory requirements more manageable and preventing catastrophic forgetting.
PEFT maintains the original LLM weights, avoiding the loss of previously learned information. This approach proves beneficial for handling storage issues when fine-tuning for multiple tasks.
There are various ways of achieving Parameter Efficient Fine-Tuning, with Low-Rank Adaptation (LoRA) and Quantized LoRA (QLoRA) being the most widely used and effective methods.
Here are some of the key characteristics of PEFT:
Low-Rank Adaptation (LoRA) and Quantized LoRA (QLoRA) are two of the most widely used and effective methods for achieving PEFT.
LLM Fine-Tuning
LLM Fine-Tuning is a process of adjusting a pre-trained Large Language Model to perform a specific task, such as sentiment analysis or named entity recognition. This approach is significant because training a large language model from scratch is highly resource-intensive.
Fine-tuning involves selecting a pre-trained model, gathering a relevant dataset, preprocessing the data, and fine-tuning the model on the preprocessed dataset. The model's parameters are adjusted based on the new dataset, helping it better understand and generate content relevant to the specific task.
Fine-tuning can be done using various techniques, including full fine-tuning and parameter-efficient fine-tuning (PEFT). Full fine-tuning updates all model weights, creating a new version with improved capabilities, but it demands significant computational resources. PEFT, on the other hand, updates only a subset of parameters, reducing the number of trainable parameters and making memory requirements more manageable.
Some common methods used in fine-tuning include LoRA (Low-Rank Adaptation) and QLoRA (Quantization with Low-Rank Adaptation). LoRA decomposes the weight change matrix of an LLM into low-rank matrices, reducing the number of trainable parameters. QLoRA adds quantization to LoRA, reducing memory consumption without a significant reduction in performance.
Fine-tuning can be done using various open-source packages, such as Pytorch and Transformers. For example, the transformers library provides a BERTTokenizer, which is specifically for tokenizing inputs to the BERT model.
Here are some benefits of fine-tuning:
- Saves time and resources compared to training from scratch
- Reduces data requirements, making it possible to achieve good performance with smaller amounts of data
- Customizes the model to specific domains or applications
- Enables continual learning, allowing the model to be updated with new data and knowledge
The choice of fine-tuning method depends on the specific task and requirements. For example, if the task is text generation, a model like GPT-3 or GPT-2 may be a better choice. If the task is text classification, question answering, or entity recognition, BERT may be a better option.
Configuring and Serving
Configuring and serving your fine-tuned model is a breeze. You can serve a model for inference using the following command, specifying which training run folder to load the model from with the –run-folder flag.
To customize your training parameters and options, you'll need to edit the config file. This is done by duplicating one of the example_configs to src/config.yml and modifying it as needed. The most important options to consider are model base_model, dataset, LoRA, and multi-GPU training.
Here are the key config options to consider:
- model base_model: codellama/CodeLlama-7b-Instruct-hf
- dataset: path to your local .jsonl file, or see all dataset options here
- LoRA: adapter, r, alpha, dropout, and target_modules
- multi-GPU training: use DeepSpeed for easy setup, and specify GPU_MEM and N_GPUS in train.py
Once you've configured your model, you can easily deploy it to production for serverless inference via Modal's web endpoint feature. Modal will handle all the auto-scaling for you, so you only pay for the compute you use.
Config
Configuring your Axolotl model is a breeze, thanks to the customizable config file. You can duplicate one of the example configs to src/config.yml and modify it as needed.
The config file is where you'll set up your training parameters and options. You can customize everything from the model to the dataset to the logging settings.
The model HTML_TAG_START base_model: codellama/CodeLlama-7b-Instruct-hf HTML_TAG_END is a crucial option to consider. This determines the base model for your Axolotl model.
You can also specify the dataset by default, which is uploaded as a local .jsonl file from the src folder in completion format. However, you can see all dataset options HTML_TAG_START datasets:- path: my_data.jsonl ds_type: json type: completion HTML_TAG_END.
Another important option is the LoRA (Low-Rank Adaptation) adapter, which you can use to fine-tune your model. You can specify the adapter, r, alpha, dropout, and target modules. For example: HTML_TAG_START adapter: lora # for qlora, or leave blank for full finetunelora_r: 8lora_alpha: 16lora_dropout: 0.05lora_target_modules: - q_proj - v_proj HTML_TAG_END.
For multi-GPU training, we recommend using DeepSpeed, which is easy to set up. You can specify which of the default deepspeed JSON configurations to use in your config.yml: HTML_TAG_START deepspeed: /root/axolotl/deepspeed/zero3.json HTML_TAG_END.
To track your training runs with Weights and Biases, you can use the following code: HTML_TAG_START import modalapp = modal.App( "my_app", secrets=[ modal.Secret.from_name("huggingface"), modal.Secret.from_name("my-wandb-secret"), ],) HTML_TAG_END HTML_TAG_START wandb_project: mistral-7b-samsumwandb_watch: gradients HTML_TAG_END.
Here's a summary of the most important config options:
- Model: HTML_TAG_START base_model: codellama/CodeLlama-7b-Instruct-hf HTML_TAG_END
- Dataset: HTML_TAG_START datasets:- path: my_data.jsonl ds_type: json type: completion HTML_TAG_END
- LoRA: HTML_TAG_START adapter: lora # for qlora, or leave blank for full finetunelora_r: 8lora_alpha: 16lora_dropout: 0.05lora_target_modules: - q_proj - v_proj HTML_TAG_END
- Multi-GPU training: HTML_TAG_START deepspeed: /root/axolotl/deepspeed/zero3.json HTML_TAG_END
- Logging with Weights and Biases: HTML_TAG_START import modalapp = modal.App( "my_app", secrets=[ modal.Secret.from_name("huggingface"), modal.Secret.from_name("my-wandb-secret"), ],) HTML_TAG_END HTML_TAG_START wandb_project: mistral-7b-samsumwandb_watch: gradients HTML_TAG_END
Serving Your Model
You can serve a model for inference using a specific command, which loads the model from a training run folder using the –run-folder flag.
The run folder name is found in the training log output. For example, you can spawn a vLLM inference container for any pre-trained or fine-tuned model from a previous training job.
vLLM is used to speed up inference up to 24x.
Evaluation and Tracking
Evaluation and Tracking is a crucial step in fine-tuning your LLM on custom data. ROUGE metric evaluation is a key tool in this process, allowing you to quantify the validity of summarizations produced by your model.
To use ROUGE metric evaluation, you compare your model's summarizations to a baseline summary created by a human. This will give you a percentage increase in summarization effectiveness, helping you gauge the impact of fine-tuning.
The ROUGE metric is a set of metrics and a software package used for evaluating automatic summarization and machine translation software in natural language processing. It compares an automatically produced summary or translation against a reference or a set of references (human-produced) summary or translation.
Here are some key features of ROUGE metric evaluation:
- Quantifies the validity of summarizations produced by your model
- Compares to a baseline summary created by a human
- Gives a percentage increase in summarization effectiveness
In addition to ROUGE metric evaluation, you can also use Comet for experiment tracking. Comet allows you to inspect experiments, parameters, code, metrics, and other metadata fields and artifacts. It also provides charts and panels to monitor the fine-tuning process.
Evaluate Model with Rouge Metric
The ROUGE metric is a powerful tool for evaluating the effectiveness of summarization models. It compares automatically produced summaries to human-created references, providing a quantitative measure of a model's performance.
ROUGE stands for Recall-Oriented Understudy for Gisting Evaluation, and it's a software package used for evaluating automatic summarization and machine translation software. The metrics it uses are based on n-gram overlaps between the model's output and the reference summary.
The ROUGE metric is not perfect, but it's a useful indicator of a model's improvement. For example, the PEFT model shows a significant improvement compared to the original model, with a noticeable increase in summarization effectiveness.
To use the ROUGE metric, you compare your model's output to a baseline summary created by a human. This helps you evaluate the overall effectiveness of your model's summarization capabilities.
Comet Experiment Tracking
Comet Experiment Tracking is a powerful tool for monitoring and comparing experiments. It allows you to track experiments on CometML and inspect the parameters, code, metrics, and other metadata fields and artifacts.
Upon selecting an experiment, you can view the model definition summary of layers and modules using Graph definition, hyperparameters and metrics logged, system metrics (GPU, CPU usage upon active experiment run), code changes, and many more. This detailed view helps you understand the experiment's performance and identify potential issues.
To enable Comet to log everything automatically by default, make sure you import comet_ml before importing torch in your script. This will save you time and effort in tracking your experiments.
Comet also offers a feature to compare multiple experiments, which is essential for identifying the key set of parameters and insights from the fine-tuning process. By selecting the desired experiments and clicking on Compare, you can overlap the experiments and provide a common view to spot key insights from the training process.
Here are the key components of the comparison feature:
- Charts and Panels to monitor the fine-tuning process
- Automatic logging of training loss and other metrics
- Code changes comparison using Code Diff
The Code Diff feature provides a git-like interface to compare code changes between experiments, making it easier to identify the differences and improvements. With all these features, CometML takes a top spot in the MLOps Lifecycle Modelling Stage.
Frequently Asked Questions
How much data to fine-tune LLM?
For fine-tuning a Large Language Model (LLM), you'll typically need thousands to tens of thousands of examples to achieve optimal results. The amount of data required may vary depending on the model's size and complexity.
Sources
- https://modal.com/docs/examples/llm-finetuning
- https://dassum.medium.com/fine-tune-large-language-model-llm-on-a-custom-dataset-with-qlora-fb60abdeba07
- https://stackabuse.com/guide-to-fine-tuning-open-source-llms-on-custom-data/
- https://www.comet.com/site/blog/mistral-llm-fine-tuning/
- https://ai.plainenglish.io/fine-tune-falcon-7b-llm-on-custom-dataset-for-sentiment-analysis-using-qlora-388dcfb1c7e9
Featured Images: pexels.com