AMD AI training with DeepSpeed and ROCm is a powerful combination that's gaining traction in the industry. DeepSpeed is an open-source deep learning library developed by Facebook that's designed to accelerate AI training on large-scale models.
DeepSpeed can be used with ROCm, AMD's open-source HPC platform, to take advantage of AMD's high-performance GPUs. This combination allows for faster training times and improved model accuracy.
By leveraging ROCm, DeepSpeed can tap into the full potential of AMD's Radeon Instinct GPUs, which are designed for high-performance computing. This results in significant speedups compared to traditional CPU-based training methods.
One of the key benefits of using DeepSpeed with ROCm is the ability to scale up to large model sizes and complex AI workloads. This is particularly useful for applications like natural language processing and computer vision.
Explore further: Amd Ai Software
DeepSpeed Configuration
To start with DeepSpeed, you'll need to create a configuration file called ds_config.json. This file is essential for setting up DeepSpeed.
The configuration file is where you'll specify the settings for your DeepSpeed training. Save the file as 'ds_config.json' to get started.
DeepSpeed offers a range of system innovations that make large-scale deep learning training more effective, efficient, and easy to use. These innovations fall under the training pillar.
Some of the key innovations include ZeRO, 3D-Parallelism, DeepSpeed-MoE, and ZeRO-Infinity. These innovations can be used together to achieve significant performance gains.
For a detailed example of training with DeepSpeed on an AMD accelerator or GPU, check out the Pre-training a large language model with Megatron-DeepSpeed on multiple AMD GPUs — ROCm Blogs.
Explore further: The Cost of Training a Single Large Ai
Running the Training
The command is a crucial step in setting up the training process, and it's essential to get it right. By following this command, you'll be able to take advantage of the power of DeepSpeed and Flash Attention.
The training process involves using a shared initial weights checkpoint for all runs. This helps ensure consistency across different systems and hardware configurations.
Here are the key determinants that contribute to the consistency of the training process:
- StreamingDataset provides elastic determinism while streaming data from object storage.
- Composer's microbatching engine enables accurate gradient calculation regardless of the microbatch size used.
- We use a shared initial weights checkpoint for all runs.
- We don't use any non-deterministic operations (e.g. Dropout).
- PyTorch FSDP and BF16 autocast are consistent across both systems.
- ROCm-based FlashAttention and Triton-based FlashAttention are numerically close.
- CUDA and ROCm kernels are numerically close.
- NCCL and RCCL distributed collectives are numerically close.
Training on MI250
The AMD MI250 is a datacenter accelerator that outperforms the NVIDIA A100 in some key areas, including peak number of trillion floating-point operations per second (TFLOP/s) in FP16 or BF16, and HBM memory capacity, with 128GB compared to the A100's 80GB.
The MI250 traditionally comes in systems with 4 GPUs, whereas the A100 comes in systems with 8 GPUs. This means you need to buy twice the number of AMD systems than NVIDIA systems to reach the same compute target.
One of the biggest advantages of the MI250 is its ability to hold larger models for training or inference due to its larger HBM memory capacity. This is particularly important for large language models (LLMs) that require a significant amount of memory to train.
In terms of power consumption, the max power consumption for a single MI250 is higher than for a single A100. However, when looking at system power consumption in a node, the power per GPU is about the same, or a little better for AMD.
If this caught your attention, see: Ai Training Models
Here's a comparison of the MI250 and A100:
- Peak TFLOP/s: MI250 > A100
- HBM Memory: MI250 (128GB) > A100 (80GB)
- Max Power Consumption: MI250 > A100
- Power per GPU: MI250 = A100
In a real-world test, training an MPT-1B model on 1B tokens of the C4 dataset on a single node of 4xMI250-128GB system resulted in nearly identical loss curves compared to an NVIDIA 8xA100-40GB system. This is a significant achievement, especially considering the differences in hardware.
PyTorch and Distributed Training
PyTorch distributed training is categorized into three main components: Distributed Data-Parallel training (DDP), RPC-Based distributed training (RPC), and Collective communication. The focus is on the distributed data-parallelism strategy, which is the most popular.
The DDP workflow on multiple accelerators or GPUs involves splitting the current global training batch into small local batches on each GPU, copying the model to every device, running a forward and backward pass, and synchronizing the local gradients computed by each device.
In DDP training, each process or worker owns a replica of the model and processes a batch of data, then the reducer uses allreduce to sum up gradients over different workers.
Expand your knowledge: Ai Training Datasets
The benefits of using PyTorch for distributed training include existing code requiring no changes when switching from NVIDIA to AMD, and advanced distributed training algorithms like Fully Sharded Data Parallelism (FSDP) working seamlessly.
Here are the three main components of PyTorch distributed training:
- Distributed Data-Parallel training (DDP)
- RPC-Based distributed training (RPC)
- Collective communication
DDP is the most popular strategy, and it works by splitting the global training batch into small local batches on each GPU, copying the model to every device, running a forward and backward pass, and synchronizing the local gradients computed by each device.
Accelerating Training
Distributed training solutions can convert single-GPU training code to run on multiple accelerators or GPUs, making it possible to train large models like GPT2 or Llama 2 70B.
PyTorch offers distributed training solutions to facilitate this, allowing you to scale up your training process with ease.
To train large models, you'll need to use multiple accelerators or GPUs, as a single GPU can't store all the model parameters required for training.
Explore further: Ai & Ml Solutions
By using distributed training solutions, you can take advantage of multiple GPUs to speed up your training process.
For example, training an MPT-1B model on 1B tokens of the C4 dataset can be done on an NVIDIA 8xA100-40GB system or an AMD 4xMI250-128GB system, with nearly identical loss curves over 1B tokens.
Here are some key determinism factors that enable this consistency:
- StreamingDataset provides elastic determinism while streaming data from object storage.
- Composer's microbatching engine enables accurate gradient calculation regardless of the microbatch size used.
- Shared initial weights checkpoint for all runs.
- No non-deterministic operations (e.g. Dropout).
- PyTorch FSDP and BF16 autocast are consistent across both systems.
- ROCm-based FlashAttention and Triton-based FlashAttention are numerically close.
- CUDA and ROCm kernels are numerically close.
- NCCL and RCCL distributed collectives are numerically close.
Automatic mixed precision (AMP) can also be used to reduce training time and memory usage, which is highly beneficial for large models.
Frequently Asked Questions
What is the AMD AI strategy?
AMD's AI strategy focuses on creating a comprehensive ecosystem for AI innovation through developer tools, industry partnerships, and cloud integrations. This approach aims to accelerate AI adoption and market penetration.
Sources
- AMD ROCm GitHub (github.com)
- DeepSpeed (github.com)
- Collective communication (pytorch.org)
- RPC-Based distributed training (pytorch.org)
- automatic mixed precision (pytorch.org)
- rocm/dev-ubuntu-20.04:5.4.3-complete (docker.com)
- Training Large Vision Models (LVMs): Benchmarking AMD ... (landing.ai)
Featured Images: pexels.com