Fine-tuning Llama-3 with Jax is a straightforward process that requires some knowledge of the underlying architecture.
The first step is to prepare your dataset, which should be in a format that Jax can read.
To do this, you'll need to create a custom dataset class that inherits from Jax's built-in dataset class.
This custom class will allow you to load and preprocess your data in a way that's compatible with Jax.
Jax's dataset class provides a range of useful methods for loading and manipulating data, including methods for batching and shuffling.
By using these methods, you can easily prepare your dataset for fine-tuning.
Once your dataset is ready, you can start fine-tuning Llama-3 using Jax's built-in fine-tuning API.
This API provides a simple and efficient way to fine-tune your model on your dataset.
To use it, you'll need to create a custom fine-tuning class that inherits from Jax's built-in fine-tuning class.
This custom class will allow you to specify the hyperparameters and other settings for your fine-tuning process.
Discover more: Claude 3 Opus Api
Preparation
To prepare for llama-3 JAX fine-tune, you'll need to download the Llama3-8B checkpoint. This is governed by the Meta license, and you can download the model weights and tokenizer by following the instructions in the meta-llama/Meta-Llama-3-8B repository.
You'll need to save the Llama-3-8B model in a directory called models/Llama-3-8B. Then, use the convert_llama_weights_to_hf.py script to convert the Llama checkpoint to a HuggingFace checkpoint. This will shard the Llama3-8B into multiple partitions, which you can save as one partition by setting the flags max_shard_size="64GB" and safe_serialization=False in model.save_pretrained().
Here's a step-by-step guide to preparing your dataset:
- Download the Dolly dataset, an open-source dataset of instruction-following records.
- Format your training data in JSON lines (.jsonl) format, where each line is a dictionary representing a single data sample.
- Save your training data in a single folder, with multiple jsonl files if needed.
- Include a template.json file describing the input and output formats in the training folder.
Fine-Tuning Preparation
Fine-tuning preparation is a crucial step in adapting a pre-trained model to a specific task or dataset. To fine-tune on a dataset with domain adaptation format, you can use a subset of the Dolly dataset in an instruction tuning format.
The Dolly dataset contains roughly 15,000 instruction following records for various categories such as question answering, summarization, information extraction, etc. You can select specific examples for fine-tuning, like summarization examples.
All training data must be in a single folder, but it can be saved in multiple JSON lines (.jsonl) files. A template.json file is also required to describe the input and output formats.
To train your model on a collection of unstructured dataset (text files), you need to follow the instructions in the section "Example fine-tuning with Domain-Adaptation dataset format" in the Appendix.
Before fine-tuning, you need to prepare the checkpoint and dataset. This involves downloading the Llama3-8B checkpoint and converting it to a HuggingFace checkpoint. You can use the convert_llama_weights_to_hf.py script to perform this conversion.
After converting the checkpoint, you need to set up the PRETRAINED_PATH and HF_TOKEN variables to configure your HuggingFace Token for Llama3-8B Tokenizer.
Here are the steps to prepare the checkpoint and dataset:
- Download the Llama3-8B checkpoint
- Convert the Llama checkpoint to a HuggingFace checkpoint using convert_llama_weights_to_hf.py
- Set up the PRETRAINED_PATH variable to point to the HuggingFace checkpoint
- Set up the HF_TOKEN variable to configure your HuggingFace Token for Llama3-8B Tokenizer
By following these steps, you can prepare your checkpoint and dataset for fine-tuning.
License Information
To perform inference on these models, you need to pass custom_attributes='accept_eula=true' as part of the header, indicating you've read and accepted the end-user-license-agreement (EULA) of the model.
You can find the EULA in the model card description or at https://ai.meta.com/resources/models-and-libraries/llama-downloads/.
By default, this notebook sets custom_attributes='accept_eula=false', so all inference requests will fail until you explicitly change this custom attribute.
Note that custom_attributes used to pass EULA are key/value pairs, with the key and value separated by '=' and pairs separated by ';'.
Model and Training
The fine-tuning job for llama-3 JAX is launched in a similar way to the compile job. You'll need to turn off the COMPILE environment variable and run the same training script to start pre-training. The model will be loaded onto the Trainium accelerators and training will commence.
The pre-trained Llama3-8B model serves as the foundation for fine-tuning. This solid base will be adapted to a specific task or dataset. You can use the NxD Training library to compile and fine-tune the pre-trained model on a single instance.
To prepare the checkpoint and dataset, you'll need to download the Llama3-8B checkpoint and convert it to a HuggingFace checkpoint using the `convert_llama_weights_to_hf.py` script. This will shard the Llama3-8B model into multiple partitions, which can be saved as one partition by setting flags `max_shard_size="64GB"` and `safe_serialization=False` in `model.save_pretrained()`.
Related reading: Llama 3 8b Best Finetune Model
Training the
Training the model is where the magic happens. The fine-tuning job is launched almost exactly in the same way as the compile job. We now turn off the COMPILE environment variable and run the same training script to start pre-training.
The model is loaded onto the Trainium accelerators and training has commenced, and you will begin to see output indicating the job progress. This is a crucial step, as it allows the model to learn from the dataset and improve its performance.
To ensure the model trains efficiently, it's essential to set up the training environment correctly. This includes configuring the COMPILE environment variable, which should be turned off during the training process.
The training process can be launched using a script, which should be run on the Trainium accelerators. This script will load the model, configure the training environment, and start the training process.
Here's a summary of the training process:
Replace HF's LlamaDecoderLayer with TE's TransformerLayer
Replacing HF's LlamaDecoderLayer with TE's TransformerLayer can significantly improve model performance. By using Transformer Engine's TransformerLayer in place of Hugging Face's LlamaDecoderLayer, a speedup of 34% can be achieved even when using only BF16 precision.
The TELlama implementation shows that replacing the core decoder layers with Transformer Engine's TransformerLayer gives a significant boost in performance. This is evident in the performance numbers, which show a step time of 185 milliseconds per batch in BF16 precision.
To achieve this speedup, you can use the TELlamaDecoderLayer wrapper around TransformerLayer. This involves replacing the LlamaDecoderLayer with TELlamaDecoderLayer in the model implementation. The replace_decoder context manager is used to monkey-patch the LlamaDecoderLayer with TELlamaDecoderLayer.
Here's a summary of the performance improvements:
Using FP8 precision further improves performance, with a speedup of 46% over the baseline implementation. This is a significant improvement, making it a worthwhile consideration for those looking to optimize their model's performance.
Training and Monitoring
Training the model is a straightforward process, similar to the compile job, just with the COMPILE environment variable turned off.
You'll start by running the same training script to begin pre-training, and once the model is loaded onto the Trainium accelerators, you'll see output indicating the job progress.
The training job will display its progress in the console, giving you a sense of how things are going.
To launch the fine-tuning job, you'll need to run the same training script as the compile job, with the COMPILE environment variable turned off, just like when you started pre-training.
The model will begin to train, and you'll see output in the console indicating the job's progress.
You can use standard tools like TensorBoard to monitor the training job's progress, which can be a big help in understanding what's happening.
To view an ongoing training job in TensorBoard, you'll need to identify the experiment directory associated with your job, which is typically the most recently created directory under ~/neuronx-distributed-training/examples/nemo_experiments/hf_llama3_8B/.
See what others are reading: Claude 3 Model Card
Evaluation and Deployment
After fine-tuning the Llama-3-8B model, you can evaluate its performance to see if it's improved.
You can assess the model's ability to follow instructions both qualitatively and quantitatively.
To deploy the fine-tuned model, you'll compare its performance to that of the pre-trained model.
This comparison will help you understand the impact of fine-tuning on the model's performance.
Deploy the Fine-Tuned
Deploying a fine-tuned model is a crucial step in evaluating its performance. We'll compare the performance of the fine-tuned model with the pre-trained model to see if the fine-tuning has improved its ability to follow instructions.
To deploy the fine-tuned model, we'll use the test data to evaluate its performance. This is similar to how we evaluated the pre-trained model in the previous step.
The Llama-3-8B model, for instance, showed a 46% speedup with FP8 precision after fine-tuning. This is a significant improvement over the baseline model.
We can also compare the performance of different models using tables. Here's an example of how the Llama-3-8B model performed after fine-tuning:
As you can see, the fine-tuned model performed significantly better than the pre-trained model in terms of speed and precision.
Upload to S3
Uploading data to S3 is a crucial step in the evaluation and deployment process. We'll be using S3 to store our prepared dataset for fine-tuning.
The dataset is prepared and ready to be uploaded. We'll be using S3 for this purpose, as mentioned in the previous step.
Uploading to S3 ensures that our data is secure and easily accessible. This will be particularly useful for fine-tuning our model, as we'll be able to quickly retrieve the dataset when needed.
The prepared dataset will be uploaded to S3, making it ready for fine-tuning.
Sources
- https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-training/tutorials/hf_llama3_8B_SFT.html
- https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/examples/te_llama/tutorial_accelerate_hf_llama_with_te.html
- https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_amazon_algorithms/jumpstart-foundation-models/llama-2-finetuning.html
- https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/tutorials/finetuning_llama3_8B_ptl_lora.html
- https://rocm.blogs.amd.com/artificial-intelligence/axolotl/README.html
Featured Images: pexels.com