Distributed training with Hugging Face is a game-changer for achieving fast results. It allows you to scale up your training process by splitting your data and model across multiple machines.
This approach is particularly useful for large-scale models that require significant computational resources. By distributing the training process, you can take advantage of multiple GPUs or machines to speed up the training process.
Hugging Face provides a range of tools and libraries to make distributed training easier. Their library, Transformers, is designed to work seamlessly with distributed training frameworks like Dask and Ray.
You might like: Improving Language Understanding by Generative Pre Training
Getting Started
To get started with distributed training using Hugging Face, you'll need to ensure you have the library installed. You can do this via pip with the command `pip install accelerate`.
First, you'll need to install the 🤗 Accelerate library. This can be done using pip with the command `pip install accelerate`.
To set up your environment for distributed training, use the `accelerate config` command. This command will guide you through the necessary configurations based on your hardware setup.
Expand your knowledge: Huggingface Training Service
Here are the basic steps to modify your training loop for distributed training:
- Import the Accelerator from the accelerate library.
- Prepare your model, optimizer, and dataloader using the Accelerator.
- Launch the distributed training job using the TorchTrainer.
ScalingConfig defines the number of distributed training workers and whether to use GPUs. This is an essential part of setting up your environment for distributed training.
You can use the `accelerate config` command to set up your environment. This command will guide you through the necessary configurations based on your hardware setup.
A unique perspective: Learning Generative Ai
Configuring
Configuring your distributed training setup is a crucial step in optimizing your training process. You can specify the number of GPUs to use by setting the nproc_per_node argument.
To scale your training across multiple devices, you'll want to consider the number of distributed training worker processes. This is where the num_workers parameter comes in, allowing you to control the number of worker processes.
You can also configure each worker to use a GPU or CPU by setting the use_gpu parameter. This gives you flexibility in how you want to allocate your resources.
If this caught your attention, see: How to Use Huggingface Model in Python
Here's a quick rundown of the key parameters to consider:
For TensorFlow users, the MirroredStrategy is a great option for distributed training. This strategy automatically utilizes all available GPUs without requiring additional arguments in your training script.
Training Process
Training in a distributed fashion allows you to train on multiple machines at once, which can significantly speed up the process.
In this setup, each machine runs its own process, and they communicate with each other to share information and updates.
Whether or not a process is the local main process is determined by whether it's process zero, which is indicated by the is_local_process_zero flag.
This flag is useful for identifying which process is the main one, especially when working with multiple machines.
The local main process is responsible for coordinating the training process and communicating with the other machines.
You might enjoy: Huggingface Local Llm
Performance Optimization
To maximize the efficiency of your distributed training, consider adjusting your batch size according to the number of GPUs. This ensures optimal utilization of resources.
Batch size is a crucial factor in distributed training. According to the article, if memory is a constraint, implementing gradient accumulation can simulate larger batch sizes without exceeding memory limits.
Monitoring training performance is essential. Utilize tools like TensorBoard to track your progress and make adjustments as necessary.
With gradient accumulation, you can effectively increase your batch size without running out of memory. This technique is particularly useful when working with limited hardware resources.
Here are some key considerations for performance optimization:
By applying these performance optimization techniques, you can significantly enhance the efficiency of your distributed training with Hugging Face models.
Distributed Training
Distributed training is a powerful technique for training large models on limited hardware. You can specify the number of GPUs to use by setting the nproc_per_node argument, which is crucial for scaling your training across multiple devices.
To begin utilizing 🤗 Accelerate for distributed training, follow these steps: Install the library via pip, ensure you have the library installed, and use the accelerate config command to set up your environment.
Intriguing read: How to Use Models from Huggingface
The 🤗 Accelerate library is designed to facilitate the training of 🤗 Transformers models across various distributed setups, whether utilizing multiple GPUs on a single machine or across several machines. You can launch a distributed training job with a TorchTrainer.
Here are the main differences in the inter-GPU communication overhead between DataParallel (DP) and DistributedDataParallel (DDP):
DP is ~10% slower than DDP w/ NVlink, but ~15% faster than DDP w/o NVlink. The real difference will depend on how much data each GPU needs to sync with the others.
Prepare Transformers Trainer
To prepare a Transformers Trainer, you'll need to pass your Transformers Trainer into the prepare_trainer() method to validate your configurations and enable Ray Data Integration. This is a crucial step in setting up distributed training with Hugging Face Transformers.
Here are the key arguments you'll need to provide:
- Your Transformers Trainer instance
- Any additional arguments required for Ray Data Integration
By following these steps, you'll be able to take advantage of the powerful features of 🤗 Accelerate and Hugging Face Transformers to train your models more efficiently.
Here's a quick rundown of the required arguments:
By validating your configurations and enabling Ray Data Integration, you'll be able to take full advantage of the capabilities of 🤗 Accelerate and Hugging Face Transformers.
Log
In a distributed environment, logging is done only for a process with rank 0.
You can log metrics in a specially formatted way using the `logs` argument, which is a dictionary of key-value pairs where the values are floats.
The callback is removed if found.
If you want to customize the log level, you can use the `get_process_log_level` method, which returns the log level to be used depending on whether this process is the main process of node 0, the main process of node non-0, or a non-main process.
To set the logging strategy, you can use the `set_logging` method, which takes several arguments, including the logging strategy, the number of update steps between two logs, the logger log level, and the list of integrations to report the results and logs to.
Here are some possible values for the logging strategy:
- steps
- interval
You can also set the logger log level to one of the following values: "debug", "info", "warning", "error", "critical", or "passive".
If you want to report to all integrations installed, you can set the `report_to` argument to "all". If you want to report to no integrations, you can set it to "none".
Seq2Seq Arguments
Seq2Seq Arguments are a crucial aspect of distributed training. They help alleviate the communication overhead between workers, enabling them to process data in parallel.
One common approach is to use a variant of the Ring AllReduce algorithm, which is particularly effective for Seq2Seq models. This algorithm reduces the number of communication rounds, making it more efficient.
In practice, this means that the number of communication rounds can be reduced from O(n) to O(log n), where n is the number of workers. This can lead to significant speedups in training time.
A key challenge in implementing this approach is handling the variable-length sequences that are typical of Seq2Seq models. This requires careful consideration of how to partition the data and manage the communication between workers.
By using a combination of data parallelism and model parallelism, Seq2Seq models can be efficiently trained on large datasets. This approach allows workers to process different parts of the model and data in parallel, reducing the overall training time.
Push to Hub
Push to Hub is an essential feature in distributed training, allowing you to upload your model and processing class to the 🤗 model hub on the repo specified by self.args.hub_model_id.
You can customize the commit message, blocking behavior, token, and revision using the push_to_hub function. For example, you can set the commit message to a custom string, such as "My custom commit message".
The push_to_hub function also allows you to specify additional keyword arguments using the kwargs parameter, which are passed along to create_model_card().
To set up the push to hub feature, you need to specify the model_id, which can be a simple model ID or a whole repository name. You can also customize the strategy, token, private repo, and always_push behavior using the set_push_to_hub function.
Here are the possible values for the strategy parameter:
- every_save
- every_checkpoint
- end_of_training
You can also set the token to use to push the model to the Hub, which will default to the token in the cache folder obtained with huggingface-cli login.
The set_push_to_hub function will set self.push_to_hub to True, which means the output_dir will begin a git directory synced with the repo determined by model_id, and the content will be pushed each time a save is triggered.
Trainer and Optimizer
The Trainer and Optimizer are crucial components in Hugging Face's distributed training. We provide a reasonable default optimizer that works well, but you can pass a tuple in the Trainer's init through optimizers or subclass and override this method in a subclass.
You're responsible for providing a method to compute metrics, as they are task-dependent. Pass it to the init compute_metrics argument.
To set up the optimizer, you can use the set_optimizer method, which allows you to specify the optimizer name, learning rate, weight decay, beta1, beta2, epsilon, and optional arguments.
The set_optimizer method takes several arguments, including the optimizer name, learning rate, weight decay, beta1, beta2, epsilon, and optional arguments. Here are the details:
The create_optimizer method sets up the optimizer, while the create_optimizer_and_scheduler method sets up both the optimizer and the learning rate scheduler.
Logging and Metrics
Logging and metrics are crucial aspects of distributed training with Hugging Face. To persist checkpoints and monitor training progress, you can add a RayTrainReportCallback utility callback to your Trainer.
In a distributed environment, logging is only done for a process with rank 0. This means that only the main process will log metrics and checkpoints. The callback will be removed if found.
You can customize the logging strategy to adopt during training by using the set_logging method. This method allows you to specify the logging strategy, number of update steps between logs, logger log level, and more.
Here are some possible logging strategies:
You can also specify the logging level to use on the main process and replicas. The possible choices are "debug", "info", "warning", "error", and "critical", or "passive" which doesn't set anything.
In multinode distributed training, you can choose to log using log_level once per node, or only on the main node. The default is to log once per node.
The log level to be used can be determined by the get_process_log_level method, which returns the log level to be used depending on whether this process is the main process of node 0, main process of node non-0, or a non-main process.
Related reading: Huggingface Interview Process
Hyperparameter Search
Hyperparameter Search is a crucial step in distributed training with Hugging Face. It allows you to search for the optimal hyperparameters for your model.
The hyperparameter search process is launched using the `hyperparameter_search` function, which takes several parameters, including `hp_space`, `compute_objective`, `n_trials`, `direction`, `backend`, `hp_name`, and `kwargs`. The `hp_space` parameter defines the hyperparameter search space, which can be a function that returns a dictionary of hyperparameters.
The `compute_objective` parameter defines the objective function to be optimized, which can be a function that takes a dictionary of hyperparameters and returns a float value. The `n_trials` parameter specifies the number of trial runs to test, which defaults to 100.
The `direction` parameter specifies the direction of optimization, which can be either "minimize" or "maximize". If it's single-objective optimization, the direction is a string; if it's multi-objective optimization, the direction is a list of strings.
Here's a summary of the available backends:
- Optuna
- Ray Tune
- SigOpt
If all three backends are installed, Optuna is used by default. The `hp_name` parameter can be used to define a custom trial/run name. Additional keyword arguments can be passed to optuna.create_study or ray.tune.run using the `kwargs` parameter.
Recommended read: How to Run Accelerate Huggingface
Seq2Seq Trainer
Seq2Seq Trainer is a powerful tool for training sequence-to-sequence models. It's a part of the Hugging Face Transformers library, which we'll be using for distributed training.
Seq2Seq Trainer is designed to handle a variety of tasks, including machine translation, text summarization, and question answering.
You can use it to train models on large datasets, and it supports many popular sequence-to-sequence architectures, including Transformer and BART.
To use Seq2Seq Trainer, you'll need to define a dataset and a model, then pass them to the trainer. This can be done with just a few lines of code.
Seq2Seq Trainer also supports distributed training, which allows you to train your model on multiple GPUs or machines at the same time.
Sources
- Get Started with Distributed Training using Hugging Face ... (ray.io)
- Distributed Training With Hugging Face | Restackio (restack.io)
- DDP (pytorch.org)
- torch.distributed (pytorch.org)
- Fairscale (github.com)
- DeepSpeed (deepspeed.ai)
- Varuna (github.com)
- Megatron-LM (github.com)
- DeepSpeed (deepspeed.ai)
- FairScale (fairscale.readthedocs.io)
- Interleaved Pipeline (amazon.com)
- Efficient Large-Scale Language Model Training on GPU Clusters (arxiv.org)
- @anton-l (github.com)
- parallelformers (github.com)
- DeepSpeed (github.com)
- 3D parallelism: Scaling to trillion-parameter models (microsoft.com)
- Megatron-Deepspeed from BigScience (github.com)
- Megatron-DeepSpeed (github.com)
- Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model (arxiv.org)
- FlexFlow (github.com)
- “Beyond Data and Model Parallelism for Deep Neural Networks” by Zhihao Jia, Matei Zaharia, Alex Aiken (arxiv.org)
- transformers.utils.fx (github.com)
- Memory Centric Tiling (deepspeed.readthedocs.io)
- Liger (github.com)
- https://arxiv.org/abs/2403.03507 (arxiv.org)
- original paper (arxiv.org)
- https://pytorch.org/docs/stable/distributed.html#torch.distributed.init_process_group (pytorch.org)
- training_args.py (github.com)
- Deepspeed (github.com)
- Apex documentation (nvidia.github.io)
- Zero Redundancy Optimizer (ZeRO) (arxiv.org)
- DDP (pytorch.org)
- torch.distributed (pytorch.org)
- DeepSpeed (deepspeed.ai)
- Varuna (github.com)
- Megatron-LM (github.com)
- DeepSpeed (deepspeed.ai)
- Interleaved Pipeline (amazon.com)
- Efficient Large-Scale Language Model Training on GPU Clusters (arxiv.org)
- parallelformers (github.com)
- DeepSpeed (github.com)
- Megatron-Deepspeed from BigScience (github.com)
- Megatron-DeepSpeed (github.com)
- Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model (arxiv.org)
- FlexFlow (github.com)
- “Beyond Data and Model Parallelism for Deep Neural Networks” by Zhihao Jia, Matei Zaharia, Alex Aiken (arxiv.org)
- transformers.utils.fx (github.com)
- DataParallel (pytorch.org)
- DistributedDataParallel (pytorch.org)
Featured Images: pexels.com