The Huggingface Training Service is a game-changer for model development. It allows you to train, deploy, and manage models with ease, all in one place.
With the Huggingface Training Service, you can access a vast library of pre-trained models and fine-tune them for your specific use case. This can save you a significant amount of time and resources.
One of the key benefits of the Huggingface Training Service is its ability to handle large-scale model training. This is made possible by its distributed training capabilities, which can scale up to thousands of GPUs.
By using the Huggingface Training Service, you can streamline your model development process and get your models into production faster.
Discover more: Ai in Training and Development
Training Options
You can train your model on completions only, which can be a great way to fine-tune your model on specific tasks. This approach uses the DataCollatorForCompletionOnlyLM.
To use this approach, you'll need to pass a response template and the tokenizer to the collator. For example, you can fine-tune opt-350m on completions only on the CodeAlpaca dataset by passing a response template and the tokenizer to the collator.
Make sure to have a pad_token_id that's different from eos_token_id, as this can cause issues with predicting EOS tokens during generation.
Worth a look: Huggingface Tokenizer Pad
Train on Completions
You can train your model on completions only using the DataCollatorForCompletionOnlyLM. This works only when packing is set to False.
To instantiate this collator for instruction data, you'll need to pass a response template and the tokenizer. For example, fine-tuning opt-350m on completions only on the CodeAlpaca dataset requires this setup.
Make sure to have a pad_token_id that's different from eos_token_id, as this can cause issues with predicting EOS tokens during generation.
Train with Pretrained Models
You can directly pass the kwargs of the from_pretrained() method to the SFTConfig to control over the pretrained model. This allows you to load a model in a different precision, for example.
All keyword arguments of from_pretrained() are supported, giving you flexibility in how you load your pretrained model.
Loading a model in a different precision can be achieved by passing the relevant kwargs to the SFTConfig.
Recommended read: How to Load a Model in Mixed Precision in Huggingface
Model Customization
You can customize your model initialization by passing a model initializer to the SFTTrainer. This can be a callable function that returns a PreTrainedModel, such as a BERT or RoBERTa model.
The model initializer can be used to load a pre-trained model from a specific checkpoint, or to create a new model instance from scratch. For example, you can pass a function that loads a pre-trained BERT model from a specific checkpoint using the from_pretrained() method.
You can also customize the model's architecture by passing a PeftConfig object to the SFTTrainer. This will convert the model to a PeftModel, allowing you to fine-tune the model's architecture and parameters.
Here's a summary of the model initialization options:
By customizing your model initialization, you can tailor the training process to your specific needs and achieve better results.
Customize with Pretrained Models
You can customize your model by using pretrained models, which can be loaded into your SFTConfig using the from_pretrained() method.
The pretrained model can be loaded in a different precision, such as fp16 or bf16, by passing the corresponding keyword argument to the from_pretrained() method.
You can also pass other keyword arguments to the from_pretrained() method, such as model_name_or_path or config, to customize the loaded model.
To load a model in a different precision, you would use code similar to `SFTConfig.from_pretrained("model_name", **{"revision": "main", "from_tf": False, "use_auth_token": True, "cache_dir": "/tmp/sft_cache", "revision": "main", "use_auth_token": True, "cache_dir": "/tmp/sft_cache"})`.
This allows you to control various aspects of the loaded model, such as its revision, from_tf flag, and cache directory.
Here's a summary of the supported keyword arguments:
By using these keyword arguments, you can customize the loaded model to suit your specific needs.
Class Sft
The SFTTrainer class is a powerful tool for model customization.
It's a wrapper around the transformers.Trainer class and inherits all of its attributes and methods. This means you can use the SFTTrainer class to take care of tasks such as training, evaluation, and model saving.
The SFTTrainer class can handle a variety of models, including PreTrainedModels, torch.nn.Modules, and strings representing model names to load from cache or download.
Expand your knowledge: Dataset Huggingface Modify Class Label
You can also convert a model to a PeftModel if you pass a PeftConfig object to the peft_config argument.
The SFTTrainer class has several optional arguments, including args, data_collator, train_dataset, eval_dataset, processing_class, model_init, compute_metrics, callbacks, optimizers, preprocess_logits_for_metrics, peft_config, and formatting_func.
Here are some key arguments to consider:
- args: The arguments to tweak for training. If not provided, it will default to a basic instance of SFTConfig with the output_dir set to a directory named tmp_trainer in the current directory.
- data_collator: The data collator to use for training.
- train_dataset and eval_dataset: The datasets to use for training and evaluation, respectively. We recommend using trl.trainer.ConstantLengthDataset to create your dataset.
- processing_class: The processing class used to process the data. If provided, it will be used to automatically process the inputs for the model.
- model_init: The model initializer to use for training. If None is specified, the default model initializer will be used.
- compute_metrics: The function used to compute metrics during evaluation. If not specified, only the loss will be computed during evaluation.
- callbacks: The callbacks to use for training.
- optimizers: The optimizer and scheduler to use for training.
- preprocess_logits_for_metrics: The function to use to preprocess the logits before computing the metrics.
- peft_config: The PeftConfig object to use to initialize the PeftModel.
- formatting_func: The formatting function to be used for creating the ConstantLengthDataset.
By using the SFTTrainer class, you can easily customize your model and fine-tune it for your specific use case.
Model Performance
You can significantly enhance your model's performance using NEFTune, a technique that adds noise to the embedding vectors during training. This boosts the performance of chat models, as seen in the paper "NEFTune: Noisy Embeddings Improve Instruction Finetuning" from Jain et al.
Standard finetuning of LLaMA-2-7B using Alpaca achieves 29.79% on AlpacaEval, which rises to 64.69% using noisy embeddings. This is a substantial improvement, and NEFTune also boosts performance on modern instruction datasets.
To use NEFTune in the Hugging Face Training Service, simply pass neftune_noise_alpha when creating your SFTConfig instance. This will enable the technique and potentially lead to significant performance gains.
Broaden your view: Huggingface Finetuning Ast
The amount of performance gain from NEFTune is dataset dependent, and applying it on synthetic datasets like UltraChat typically produces smaller gains. However, it's still worth trying, especially if you're working with a real-world dataset.
We've tested NEFTune on the OpenAssistant dataset and saw a performance boost of ~25% on MT Bench. This is a promising result, and it's worth exploring further to see how NEFTune can help improve your model's performance.
Here are some key statistics on the performance gains from NEFTune:
- 10% improvement on Evol-Instruct
- 8% improvement on ShareGPT
- 8% improvement on OpenPlatypus
- ~25% improvement on MT Bench (OpenAssistant dataset)
Model Evaluation
Model Evaluation is a crucial step in the Hugging Face Training Service. You can evaluate your model on a specific dataset by passing it to the `evaluate` method.
You can also override the default evaluation dataset by passing a new one to the `evaluate` method. If the new dataset is a dictionary, it will evaluate on each dataset separately.
The `ignore_keys` argument allows you to ignore certain keys in the output of your model when gathering predictions.
The `metric_key_prefix` argument is an optional prefix to be used as the metrics key prefix. For example, the metrics "bleu" will be named "eval_bleu" if the prefix is "eval".
Here are some possible values for the `strategy` argument in the `set_evaluate` method:
- "no"
- "steps"
- "epoch"
If you set the `strategy` to "steps", you can specify the number of update steps between two evaluations using the `steps` argument.
The `batch_size` argument determines the batch size per device used for evaluation, and defaults to 8.
The `accumulation_steps` argument determines the number of predictions steps to accumulate the output tensors for before moving the results to the CPU.
The `delay` argument determines the number of epochs or steps to wait for before the first evaluation can be performed, depending on the evaluation strategy.
The `loss_only` argument ignores all outputs except the loss, and defaults to False.
The `jit_mode` argument determines whether or not to use PyTorch jit trace for inference, and defaults to False.
Here are some possible use cases for the `evaluation_loop` method:
- Evaluating a model on a specific dataset
- Predicting on a dataset using a trained model
- Ignoring certain keys in the output of the model
- Using PyTorch jit trace for inference
Model Deployment
Model Deployment is a crucial step in the Hugging Face Training Service workflow. With Hugging Face's automated deployment, you can deploy your trained models to any platform, including Hugging Face's own model hub, AWS SageMaker, and Google Cloud AI Platform, in just a few clicks.
Hugging Face's automated deployment supports a wide range of model formats, including TensorFlow, PyTorch, and ONNX. This means you can deploy your models without worrying about compatibility issues.
You can also deploy your models to any custom platform, using Hugging Face's SDKs for Python, Java, and JavaScript. This gives you the flexibility to integrate your models into your existing infrastructure.
Hugging Face's automated deployment also includes features like model serving, which allows you to serve your models as APIs, and model monitoring, which provides real-time metrics on your model's performance.
Integrations
The Hugging Face Training Service offers seamless integrations with various platforms to streamline your workflow.
You can integrate the Hugging Face Training Service with your existing continuous integration and continuous deployment (CI/CD) pipelines, such as GitHub Actions and CircleCI.
See what others are reading: Generative Ai Customer Service
The service supports integrations with popular model hosting platforms like Hugging Face Model Hub and AWS SageMaker.
Hugging Face also provides integrations with other AI and machine learning tools, including TensorFlow and PyTorch.
The Training Service integrates with GitHub repositories, allowing you to access and utilize your models directly from your code.
This integration enables you to train and deploy models directly from your GitHub repository, making the entire process more efficient and convenient.
Model Optimization
Model Optimization is a crucial aspect of achieving top-notch performance with the Hugging Face Training Service.
One of the most exciting features is Liger-Kernel, which can increase multi-GPU training throughput by 20% and reduces memory usage by 60%. This means you can train larger models with more context.
Liger Kernel is a collection of Triton kernels designed specifically for LLM training, and it's fully compatible with Flash Attention, PyTorch FSDP, and Microsoft DeepSpeed.
With great memory reduction, you can potentially turn off cpu_offloading or gradient checkpointing to further boost the performance. This can be a game-changer for large-scale training tasks.
To use Liger-Kernel in SFTTrainer, you'll need to follow these simple steps:
- First, install Liger-Kernel.
- Once installed, set use_liger in SFTConfig. No other changes are needed!
Model Configuration
You can directly pass the kwargs of the from_pretrained() method to the SFTConfig to control over the pretrained model. This allows you to load a model in a different precision, for example.
Note that all keyword arguments of from_pretrained() are supported, so you can experiment with different settings to find what works best for your project.
To ensure a smooth training process, pay attention to the following best practices:
- SFTTrainer pads sequences by default to the max_seq_length argument of the SFTTrainer, which can be a problem if your tokenizer doesn't provide a default value. Be sure to check the max_seq_length before training.
- When training adapters in 8bit, you might need to tweak the prepare_model_for_kbit_training method from PEFT. Consider using the prepare_in_int8_kwargs field or creating the PeftModel outside the SFTTrainer to avoid issues.
- Loading the base model in 8bit can make training more memory-efficient, and you can do this by adding the load_in_8bit argument when creating the SFTTrainer.
- If you create a model outside the trainer, be careful not to pass additional keyword arguments that are relative to the from_pretrained() method.
Class and Keyword Arguments
The SFTTrainer class is a wrapper around the transformers.Trainer class and inherits all of its attributes and methods. It takes care of properly initializing the PeftModel in case a user passes a PeftConfig object.
The SFTTrainer class has several keyword arguments that can be used to control the training process. These include args, data_collator, train_dataset, eval_dataset, processing_class, model_init, compute_metrics, callbacks, optimizers, preprocess_logits_for_metrics, peft_config, and formatting_func.
You can pass a PreTrainedModel, a torch.nn.Module, or a string with the model name to load from cache or download to the model argument. The model can also be converted to a PeftModel if a PeftConfig object is passed to the peft_config argument.
The args argument allows you to tweak the training arguments, and it will default to a basic instance of SFTConfig with the output_dir set to a directory named tmp_trainer in the current directory if not provided.
Some tokenizers like Llama 2 tokenize sequences differently depending on whether they have context or not. To solve this, you can tokenize the response_template with the same context as in the dataset, truncate it as needed and pass the token_ids directly to the response_template argument of the DataCollatorForCompletionOnlyLM class.
The SFTTrainer always pads by default the sequences to the max_seq_length argument of the SFTTrainer. If none is passed, the trainer will retrieve that value from the tokenizer. Some tokenizers do not provide a default value, so there is a check to retrieve the minimum between 2048 and that value.
The SFTTrainer has several optional arguments, including train_dataset, eval_dataset, processing_class, model_init, compute_metrics, callbacks, optimizers, preprocess_logits_for_metrics, peft_config, and formatting_func. These arguments can be used to customize the training process.
Here are the SFTTrainer keyword arguments in a table format:
PyTorch on Mac
PyTorch on Mac offers a significant advantage for developers and researchers. With the release of PyTorch v1.12, Apple silicon GPUs can be used for faster model training.
You can install PyTorch with MPS support on your MacOS machine by following the instructions in the official document "Introducing Accelerated PyTorch Training on Mac". It's recommended to install PyTorch >= 1.13, which has major fixes related to model correctness and performance improvements for transformer-based models.
Using the "mps" device, you can take advantage of Apple's Metal Performance Shaders (MPS) as a backend for PyTorch. This will map computational graphs and primitives on the MPS Graph framework and tuned kernels provided by MPS.
To use the Apple Silicon GPU, simply pass the "--use_mps_device" argument when running your PyTorch script. For example, you can run the official Glue text classification task from the root folder using the following command:
Some operations have not been implemented in MPS and will throw an error. To get around this, you can set the environment variable PYTORCH_ENABLE_MPS_FALLBACK=1, which will fallback to CPU for these operations.
Here are the benefits of training and inference using Apple Silicon Chips:
- Enables users to train larger networks or batch sizes locally
- Reduces data retrieval latency and provides the GPU with direct access to the full memory store due to unified memory architecture
- Reduces costs associated with cloud-based development or the need for additional local GPUs
Hyperparameter Search
Hyperparameter Search is a crucial step in training models, and it's great that you're taking the time to learn about it. You can launch an hyperparameter search using optuna or Ray Tune or SigOpt.
The optimized quantity is determined by the compute_objective function, which defaults to a function returning the evaluation loss when no metric is provided, the sum of all metrics otherwise. This function is used to minimize or maximize the objective.
To use this method, you need to have provided a model_init when initializing your Trainer. This is because the model needs to be reinitialized at each new run, which is incompatible with the optimizers argument.
The hyperparameter search space is defined by the hp_space function, which defaults to default_hp_space_optuna() or default_hp_space_ray() or default_hp_space_sigopt() depending on your backend. You can also define a custom hp_space function if needed.
Here are some common hyperparameters that you can tune:
You can also customize the hyperparameter search by passing additional keyword arguments to optuna.create_study or ray.tune.run.
Learning Rates
The learning rate is a crucial aspect of model configuration, and it's determined by the optimizer used in the model.
get_learning_rates is a method that returns the learning rate of each parameter from the optimizer, and it returns a torch.Tensor.
A well-configured learning rate can make a big difference in the model's performance, and it's essential to find the right balance between exploration and convergence.
The learning rate is typically a small value, often in the range of 0.01 to 0.001, and it's used to determine how quickly the model learns from the data.
A high learning rate can lead to rapid convergence, but it may also cause the model to overshoot the optimal solution, while a low learning rate can lead to slow convergence, but it may also lead to better generalization.
Broaden your view: Ai Special Training Yakuza 0
Model Configuration
Model Configuration is a crucial step in training a model. You can configure the logging level of your model using the `get_process_log_level` method.
The logging level defaults to `logging.WARNING` unless overridden by the `log_level` argument. This applies to the main process of node 0, the main process of node non-0, and non-main processes.
You can also use the `set_logging` method to configure logging settings. This method allows you to set the logging strategy, log level, and other settings.
Here are some possible values for the logging level:
- `debug`
- `info`
- `warning`
- `error`
- `critical`
- `passive`
You can also use the `log_level_replica` argument to set the logging level for replicas.
In a multi-node distributed training environment, you can use the `on_each_node` argument to decide whether to log using `log_level` once per node or only on the main node.
Here's a summary of the possible values for the `report_to` argument:
- `azure_ml`
- `comet_ml`
- `mlflow`
- `neptune`
- `tensorboard`
- `clearml`
- `wandb`
- `all`
- `none`
Multi-GPU
Multi-GPU training is a great way to speed up your model training process, but it does require some special considerations.
To use multi-GPU training with Trainer (and thus SFTTrainer), you'll need to launch your script with a specific command. This can be either `python -m torch.distributed.launch script.py` or `accelerate launch script.py`.
Using DDP is generally recommended for multi-GPU training, but it does require some extra setup.
To use DDP, you'll need to ensure that your model is placed on the correct device. This is crucial for DDP to work.
If you're using gradient checkpointing, you'll need to add `gradient_checkpointing_kwargs={'use_reentrant':False}` to your TrainingArguments.
Here are the key things to check for multi-GPU training with DDP:
- Use the correct launch command
- Place your model on the correct device
- Disable reentrant gradient checkpointing (if using)
Model Architecture
The Hugging Face Training Service offers a variety of model architectures to choose from, including DistilBERT, a smaller and more efficient version of the popular BERT model.
DistilBERT is based on the BERT model and is designed to be more efficient and easier to deploy in production environments.
One of the key benefits of the Hugging Face Training Service is its ability to support a wide range of model architectures, making it a versatile option for developers and researchers.
Class Transformers
The Transformers class is a powerful tool for building and training models. It's a crucial part of the model architecture.
To prepare a Transformers Trainer, you'll need to pass it into the prepare_trainer() function. This validates your configurations and enables Ray Data Integration.
As you work with the Transformers class, keep in mind that it's designed for building and training models.
Additional reading: Is Huggingface Transformers Model Good
Attention
Attention is a crucial component of model architecture, and Flash Attention is a game-changer. Flash Attention 1 and 2 are two versions of this technology, and they can be used together with minimal code changes.
You can install Flash Attention 1 and 2 using SFTTrainer, which comes with the transformers library. This will give you access to the latest features from transformers, including Flash Attention.
One thing to note is that Flash Attention only works on GPU devices, and it requires half-precision regime when using adapters. This means that if you're using a base model loaded in half-precision, you'll need to make sure your GPU is supported.
A different take: Llama 2 Huggingface
You can use Flash Attention 1 with the BetterTransformer API, which allows you to force-dispatch the API to use the Flash Attention kernel. However, this requires the latest optimum package to be installed.
If your dataset contains padding tokens, you'll need to use Flash Attention 2 integration, as Flash Attention 1 doesn't support training with padding tokens.
Here are some numbers to illustrate the performance of Flash Attention 1 on a single NVIDIA-T4 16GB GPU:
To use Flash Attention 2, you'll need to install the latest flash-attn package and add attn_implementation="flash_attention_2" when calling from_pretrained. This will allow you to train your model on an arbitrary dataset that includes padding tokens.
Model Data
Customizing your prompts is a breeze with the Hugging Face Training Service. You can combine multiple fields from your dataset using a formatting function passed to the trainer.
The data format is flexible, but it needs to be compatible with the custom collator you'll define later. A common approach is to use conversational data that includes both text and images.
Here's an interesting read: Training Data for Ai
You can adjust the data format to accommodate both text and images by using a conversational data format, as shown in the example code. The output will be formatted accordingly.
The ConstantLengthDataset can be customized further by passing arguments directly to the SFTConfig constructor. Refer to the class' signature for more information.
Conversational data is a common approach for preparing data, especially when it includes both text and images. The data format needs to be adjusted accordingly to accommodate both types of data.
On a similar theme: Long Text Summarization Huggingface
Model Migration
The Hugging Face Training Service offers a seamless migration path for your existing Transformers models.
Ray 2.1 introduced the TransformersTrainer, which exposed a trainer_init_per_worker interface to define transformers.Trainer.
With the unified TorchTrainer API, you now have better control over your native Transformers training code.
This API aligns more with standard Hugging Face Transformers scripts, ensuring a smoother transition for your models.
You can now define transformers.Trainer with the trainer_init_per_worker interface and run a pre-defined training function in a black box.
For your interest: Huggingface Api
Sources
- example scripts (github.com)
- training_args.py (github.com)
- example scripts (github.com)
- accelerate (github.com)
- ZeRO: Memory Optimizations Toward Training Trillion Parameter Models, by Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, Yuxiong He (arxiv.org)
- examples/scripts/sft_vlm.py (github.com)
- examples/scripts/sft.py (github.com)
- trl (github.com)
- examples/scripts/sft_vlm.py (github.com)
- examples/scripts/sft_vlm.py (github.com)
- examples/scripts/sft.py (github.com)
- trl (github.com)
- slower than expected (github.com)
- examples/scripts/sft_vlm.py (github.com)
- Get Started with Distributed Training using Hugging Face ... (ray.io)
Featured Images: pexels.com