Huggingface RLHF training strategies are designed to produce real-world results by leveraging the power of human feedback and reinforcement learning.
RLHF involves training models on human-annotated data, which allows them to learn from nuanced and context-dependent feedback.
This approach has been shown to improve model performance in areas such as conversational dialogue and text classification.
In one study, RLHF training resulted in a 25% increase in conversational accuracy.
Suggestion: Huggingface Training Service
Efficient Training Strategies
Training large language models requires a significant amount of memory, with a 7B parameter model using 70GB just to fit in memory.
You can use more efficient optimizers or half-precision training to squeeze more into memory, but you'll eventually run out. Another option is to use Parameter-Efficient Fine-Tuning (PEFT) techniques, such as the peft library, which can perform Low-Rank Adaptation (LoRA) on a model loaded in 8-bit.
Loading the model in 8-bit reduces the memory footprint drastically, to just 7GB for a 7B parameter model. This is achieved by adding small adapter layers on top of specific layers, drastically reducing the number of trainable parameters. A rule of thumb for allocating memory is to use ~1.2-1.4GB per billion parameters, depending on the batch size and sequence length.
Here's an interesting read: How to Use Hugging Face Models
Supervised Fine-Tuning
Supervised fine-tuning is a crucial step in preparing a language model for downstream tasks.
The StackExchange dataset is enormous, with over 10 million instructions, making it an ideal resource for fine-tuning a model.
We can train the language model on a subset of the dataset, which is more efficient than using the entire dataset.
To use the data efficiently, we employ a technique called packing, which involves concatenating texts with an EOS token in between and cutting chunks of the context size to fill the batch without padding.
This approach makes training much more efficient, as each token passed through the model is also trained, unlike padding tokens which are usually masked from the loss.
The packing is handled by the ConstantLengthDataset, and we can then use the Trainer after loading the model with peft.
First, we load the model in int8, prepare it for training, and then add the LoRA adapters.
We train the model for a few thousand steps with the causal language modeling objective and save the model.
Expand your knowledge: Llama 2 Fine Tuning Huggingface
Experiments
We ran experiments to validate the RLOO implementation, using the Pythia 1B and 6.9B models, and released the trained checkpoints for reference.
The SFT/RM models were taken directly from Huang et al., 2024. We used vLLM to load the checkpoints and GPT4 as a judge model to assess the generated TL;DR against the reference TL;DR.
One of the key findings was that the 6.9B checkpoint achieved a 78.7% preferred rate using GPT4 as a judge, which even exceeded the best-reported performance of 77.9% (k=4) and 74.2 (k=2) in the original paper.
RLOO training also showed promise in terms of efficiency, using less memory and running faster than other methods.
Here are some key results from our experiments:
- Highly performant RLOO checkpoint: The 6.9B checkpoint gets a 78.7% (k=2) preferred rate using GPT4 as a judge.
- Less GPU memory and runs faster: RLOO training uses less memory and runs faster, making it a highly useful algorithm for online RL training.
Under fp16, we achieved even better results, but the exact numbers are not specified in the article section.
Reward Modeling and Preferences
Reward modeling is a crucial step in Hugging Face's RLHF (Reinforcement Learning from Human Feedback) approach. It's used to fine-tune the model using human annotations, but this can be expensive and slow due to the number of training samples needed for convergence and human reading speed.
The trick to making this process more efficient is to train a reward model on human annotations collected before the RL loop. This reward model imitates how a human would rate a text, predicting the ranking of two examples for a given prompt.
One possible strategy to build a reward model is to predict the annotation, such as a rating score or a binary value for "good" or "bad". However, in practice, predicting the ranking of two examples works better.
The loss function used for training the reward model is a straightforward one, based on the ranking of two examples. It's defined as the negative expected value of the log of the sigmoid of the difference between the model's scores for the two examples.
With a modest training batch size of 4, the LLaMA model can be trained using the Adam optimizer and BF16 precision. The LoRA peft adapter is used, and the training is logged via Weights & Biases.
The model achieves a final accuracy of 67%, which may sound low, but the task is also very hard, even for human annotators.
Suggestion: Distributed Training Huggingface
Learning from Human Feedback
Learning from Human Feedback is a crucial step in the RLHF pipeline. This phase involves training a reward model to classify responses as good or bad.
The reward model is trained on labeled data, where human experts label responses as thumbs up or thumbs down. This training process is a key component of the RLHF pipeline.
The RLHF fine-tuning phase uses the reward model to align the responses of the LLM. This is done by training the LLM on (prompt, good_response, bad_response) data labeled by human experts.
A penalty is added to the reward to prevent the model from generating gibberish and exploiting the reward model. This penalty is calculated using the KL-divergence between the current policy and the reference model.
The RLHF pipeline consists of three phases: Domain Specific Pre-Training, Supervised fine-tuning, and RLHF. The RLHF phase includes Reward model training and RLHF fine-tuning.
Here's a summary of the RLHF pipeline:
- Domain Specific Pre-Training: Fine-tune a pre-trained LLM on raw text with a Causal Language Modelling Objective.
- Supervised fine-tuning: Fine-tune the domain-specific LLM on task-specific as well as domain-specific (prompt/instruction, response) pairs.
- Reward model training: Training a language model to classify responses as good or bad (thumbs up, thumbs down)
- RLHF fine-tuning: Using the reward model training on (prompt, good_response, bad_response) data labeled by human experts to align the responses on the LLM
The RLHF pipeline can be trained for 20 hours on 3x8 A100-80GB GPUs, or around 20 hours on 8 A100 GPUs.
Implementing Rloo Trainer
Implementing Rloo Trainer was a key part of our work with TRL. We based it on our experimental PPOv2Trainer, which itself is built on a specific paper.
The interesting thing is that our Rloo trainer still uses the PPO loss, despite being based on REINFORCE. This is because the loss of REINFORCE is a special case of PPO, as shown in another paper.
Implementing Rloo Trainer in Trl
We implemented the RLOO trainer based on our new experimental PPOv2Trainer, which is itself based on a research paper from 2024. This implementation still uses the PPO loss, a choice that makes sense given that the loss of REINFORCE is a special case of PPO.
Interestingly, our implementation of the RLOO trainer uses the PPO loss, which is a reinforcement learning algorithm introduced by OpenAI in 2017. Initially used for 2D and 3D control problems, PPO has now found a place in NLP, specifically in the RLHF pipeline.
By leveraging this existing work, we were able to speed up the development of our RLOO trainer and focus on fine-tuning the model for our specific use case. The PPOv2Trainer provides a solid foundation for our implementation, allowing us to build on top of established research and techniques.
Numerical Instability
Numerical instability is a significant issue with RLOO, causing it to null the gradient of a substantial portion of the batch data.
The problem arises from the use of bf16 precision, which leads to slightly different logprobs during generation and training forward passes.
Under bf16, the ratio of (forward_logprob - generation_logprob) can become very unstable, causing PPO's clip coefficient to kick in and null the gradient of certain tokens.
This issue is more extreme for RLOO, where the gradient for entire sequences can be nulled due to the large difference in logprobs.
In practice, we observed that PPO nulls the gradient of approximately 3% of the batch data, while RLOO nulls about 20-40%.
RLOO should theoretically null 0% of the batch data when not using mini-batches, but in our experience, the clipping ratio remains significant even with increased gradient steps.
The use of mini-batches exacerbates the problem, making it more challenging to train RLOO effectively.
For another approach, see: How to Use Huggingface Models in Python
Performance in Details
Mistral 7B significantly outperforms Llama 2 13B on all metrics, and is on par with Llama 34B.
We compared Mistral 7B to the Llama 2 family, and re-ran all model evaluations ourselves for fair comparison. This allowed us to accurately assess their performance on a wide range of benchmarks.
Here's a breakdown of the benchmarks Mistral 7B was compared against:
- Commonsense Reasoning: 0-shot average of Hellaswag, Winogrande, PIQA, SIQA, OpenbookQA, ARC-Easy, ARC-Challenge, and CommonsenseQA.
- World Knowledge: 5-shot average of NaturalQuestions and TriviaQA.
- Reading Comprehension: 0-shot average of BoolQ and QuAC.
- Math: Average of 8-shot GSM8K with maj@8 and 4-shot MATH with maj@4
- Code: Average of 0-shot Humaneval and 3-shot MBPP
- Popular aggregated results: 5-shot MMLU, 3-shot BBH, and 3-5-shot AGI Eval (English multiple-choice questions only)
Mistral 7B performs equivalently to a Llama 2 that would be more than 3x its size on reasoning, comprehension, and STEM reasoning, saving a significant amount of memory while gaining in throughput.
Tools and Resources
Open-source tools for RLHF have come a long way since OpenAI released their first code in TensorFlow in 2019.
TRL is a primary repository for RLHF in the Hugging Face ecosystem, designed to fine-tune pretrained LMs with PPO.
TRLX is an expanded fork of TRL, built by CarperAI to handle larger models for online and offline training, and has an API capable of production-ready RLHF with PPO and ILQL.
RL4LMs offers building blocks for fine-tuning and evaluating LLMs with a wide variety of RL algorithms, including PPO, NLPO, A2C, and TRPO, and is easily customizable.
The library is well-tested and benchmarked on a broad range of tasks, with recent work amounting to 2000 experiments highlighting practical insights on data budget comparison and handling reward hacking.
TRLX is optimized for machine learning engineers with experience at scale, and is capable of interfacing with models up to 33 billion parameters, with future versions planned to support models up to 200B parameters.
RL4LMs current plans include distributed training of larger models and new RL algorithms, making it a powerful tool for RLHF in the Hugging Face ecosystem.
Worth a look: Ollama Huggingface
Take It Step by Step
Breaking down complex tasks into manageable steps is essential for success.
The RLHF process involves four main steps: data annotation, model training, model evaluation, and deployment.
Taking it one step at a time helps to avoid feeling overwhelmed.
The first step, data annotation, requires human evaluators to review and label data to prepare it for the model.
Labeling data can be a time-consuming process, but it's crucial for the model's accuracy.
The next step, model training, involves training the model on the annotated data to learn patterns and relationships.
Model training can take several iterations to achieve optimal results.
With the model trained, the next step is evaluation to assess its performance and identify areas for improvement.
Model evaluation helps to refine the model and ensure it meets the desired standards.
The final step, deployment, involves integrating the model into a production-ready environment.
Deployment requires careful consideration of the model's limitations and potential biases.
Mistral 7B
Mistral 7B is a 7.3B parameter model that outperforms Llama 2 13B on all benchmarks.
It's impressive to see how Mistral 7B approaches CodeLlama 7B performance on code tasks while remaining good at English tasks.
Mistral 7B uses Grouped-query attention (GQA) for faster inference, which is a significant advantage for those who need to process large amounts of data quickly.
The model also uses Sliding Window Attention (SWA) to handle longer sequences at a smaller cost.
This means that users can expect faster and more efficient processing with Mistral 7B compared to other models.
Mistral 7B is released under the Apache 2.0 license, making it available for use without restrictions.
Here are the ways you can use Mistral 7B:
- Download it and use it anywhere (including locally) with our reference implementation
- Deploy it on any cloud (AWS/GCP/Azure), using vLLM inference server and skypilot
- Use it on HuggingFace
Fine-tuning Mistral 7B on any task is also easy, and we're providing a model fine-tuned for chat that outperforms Llama 2 13B chat.
A different take: Llm Fine Tuning Huggingface
Featured Images: pexels.com