Accelerate ML with Amazon SageMaker HuggingFace

Amazon SageMaker Hugging Face is a powerful tool for streamlining machine learning workflows. It integrates the popular Hugging Face Transformers library with SageMaker's automated machine learning capabilities.

With SageMaker Hugging Face, you can easily deploy pre-trained models to production environments. This eliminates the need for manual model deployment and reduces the risk of human error.

The platform also provides a range of pre-built models and datasets, making it easy to get started with machine learning projects. These models and datasets are specifically designed for natural language processing and computer vision tasks.

By leveraging the strengths of both Hugging Face and SageMaker, you can accelerate your machine learning development process and focus on more complex tasks.

Explore further: Ollama Huggingface Models

What Is Hugging Face LLM Inference?

Hugging Face LLM Inference DLC is a purpose-built container for deploying Large Language Models (LLMs) in a secure and managed environment.

It's powered by Text Generation Inference (TGI), an open-source solution that enables high-performance text generation using Tensor Parallelism and dynamic batching.

Curious to learn more? Check out: Long Text Summarization Huggingface

Credit: youtube.com, SageMaker JumpStart: deploy Hugging Face models in minutes!

TGI is already used by customers like IBM and Grammarly, and it's optimized for popular open-source LLMs such as StarCoder, BLOOM, GPT-NeoX, Llama, and T5.

The DLC implements various optimizations, including Tensor Parallelism and custom CUDA kernels, optimized transformers code for inference, quantization, and accelerated weight loading.

Here are some of the technologies used in TGI:

Tensor Parallelism and custom CUDA kernels
Optimized transformers code for inference using flash-attention
Quantization with bitsandbytes
Continuous batching of incoming requests
Accelerated weight loading with safetensors
Logits warpers (temperature scaling, topk, repetition penalty ...)
Watermarking with A Watermark for Large Language Models
Stop sequences, Log probabilities
Token streaming using Server-Sent Events (SSE)

With the Hugging Face LLM Inference DLC on Amazon SageMaker, AWS customers can benefit from these technologies and deploy LLM models with high concurrency and low latency.

Check this out: Llm Fine Tuning Huggingface

Installation and Setup

To get started with Amazon SageMaker HuggingFace, you'll first need to sign up for an AWS account.

You can start using SageMaker in one of three ways: SageMaker Studio, SageMaker notebook instance, or a local environment.

To train locally, you'll need to set up an appropriate IAM role.

The SageMaker environment requires setup, but don't worry, it's straightforward.

The execution role is only available when running a notebook within SageMaker, so be aware of this if you're running your code elsewhere.

Credit: youtube.com, Introduction to Hugging Face on Amazon SageMaker | Amazon Web Services

To install the necessary dependencies, you'll need to create a requirements.txt file in the same directory as your training script.

Add the sagemaker sdk for Accelerate to this file, along with any other dependencies you have.

Here are the three ways to get started with SageMaker:

SageMaker Studio
SageMaker notebook instance
Local environment

Features & Benefits

With the Hugging Face DLCs, you can train cutting-edge Transformers-based NLP models in a single line of code. This is a game-changer for data science teams, allowing them to reduce the time required to set up and run experiments from days to minutes.

You can choose from multiple DLC variants, each one optimized for TensorFlow and PyTorch, single-GPU, single-node multi-GPU, and multi-node clusters. This flexibility gives you the freedom to choose a training infrastructure that best aligns with the price/performance ratio for your workload.

The Hugging Face DLCs feature built-in performance optimizations for PyTorch and TensorFlow to train NLP models faster. This means you can get results faster without having to worry about the underlying infrastructure.

If this caught your attention, see: How to Use Huggingface Model in Python

Credit: youtube.com, Working with Hugging Face models on Amazon SageMaker

Hugging Face DLCs are fully integrated with SageMaker distributed training libraries to train models faster than ever. This is especially useful when working with large datasets or complex models.

You can use the Hugging Face DLCs with SageMaker's automatic model tuning to optimize your training hyperparameters and increase the accuracy of your models. This is a huge advantage, as it allows you to automate the process of finding the best model configuration.

Here are some key benefits of using Hugging Face DLCs with SageMaker:

Train cutting-edge NLP models in a single line of code
Choose from multiple DLC variants for optimal performance
Built-in performance optimizations for PyTorch and TensorFlow
Fully integrated with SageMaker distributed training libraries
Automate model tuning with SageMaker's automatic model tuning

Sample Notebooks

Amazon SageMaker offers a wide range of sample notebooks to help you get started with your projects. These notebooks cover various topics, including deep learning frameworks like PyTorch and TensorFlow.

You can find notebooks on PyTorch and TensorFlow specifically, such as "Getting Started with Pytorch" and "Getting Started with Tensorflow". These notebooks provide a great starting point for beginners and experienced users alike.

SageMaker also offers notebooks on distributed training, including "Distributed Training Data Parallelism" and "Distributed Training Model Parallelism". These notebooks help you learn how to scale your models and train them more efficiently.

Recommended read: Distributed Training Huggingface

Credit: youtube.com, Amazon SageMaker Notebooks - Intro to Jupyter and hands on!

Some notebooks focus on specific use cases, such as "Image Classification with Vision Transformer" and "Deploy one of the 10 000+ Hugging Face Transformers to Amazon SageMaker for Inference". These notebooks show you how to apply Hugging Face Transformers to real-world problems.

Here's a list of some of the sample notebooks available in SageMaker:

All notebooks
Getting Started with Pytorch
Getting Started with Tensorflow
Distributed Training Data Parallelism
Distributed Training Model Parallelism
Spot Instances and continue training
SageMaker Metrics
Distributed Training Data Parallelism Tensorflow
Distributed Training Summarization
Image Classification with Vision Transformer
Deploy one of the 10 000+ Hugging Face Transformers to Amazon SageMaker for Inference
Deploy a Hugging Face Transformer model from S3 to SageMaker for inference

Transformers

You can deploy a Transformers model trained in SageMaker in two ways: after your training has finished or at a later time from S3 with the model_data.

To deploy a model from S3 to SageMaker for inference, you can open the deploy_transformer_model_from_s3.ipynb notebook for an example.

Deploying a model can be a bottleneck, especially on a GPU-based instance with a single vCPU. This can prevent the GPU from being fully utilized for model inference.

You can learn more about this limitation and how to optimize your pipelines here.

Readers also liked: Fastapi Huggingface Gpu

After Training

After training, you can deploy your model directly by saving all required files in your training script, including the tokenizer and the model.

Credit: youtube.com, #3-Deployment Of Huggingface OpenSource LLM Models In AWS Sagemakers With Endpoints

To do this, ensure that your tokenizer is saved along with the model, as it's often used for inference.

You can pass your tokenizer as an argument to the Hugging Face Trainer, which will automatically save it when you call trainer.save_model().

This makes it easy to deploy your model without having to worry about saving individual components.

A unique perspective: Huggingface Tokenizer Pad

Data

Data is stored in Amazon Sagemaker, where you can deploy models with model_data. This argument specifies the location of your tokenizer and model weights.

You can preprocess your data using the datasets library, as demonstrated with the imdb dataset. The imdb dataset consists of 25000 training and 25000 testing highly polar movie reviews.

The datasets library is used to download and preprocess the imdb dataset. This dataset is then uploaded to the current session's default s3 bucket, used within the training job.

Readers also liked: Huggingface Imdb Dataset

Create an Artifact

To create an artifact, you'll need to package your model files into a single .tar.gz file. This file should contain the required files, such as pytorch_model.bin, tf_model.h5, tokenizer.json, and tokenizer_config.json.

A typical model.tar.gz file will include the following files:

pytorch_model.bin
tf_model.h5
tokenizer.json
tokenizer_config.json

You can also create a model.tar.gz file from a model on the 🤗 Hub by downloading a model and providing the S3 URI to the model_data argument.

From the HUB

Credit: youtube.com, Deploy a Hugging Face Transformers Model from the Model Hub to Amazon SageMaker

You can deploy a model from the 🤗 Hub to Amazon SageMaker with just a few steps.

To deploy a model directly from the 🤗 Hub, define two environment variables when you create a HuggingFaceModel.

The process is straightforward, and you can find a detailed example in the deploy_transformer_model_from_hf_hub.ipynb notebook.

Open the notebook to see how to deploy a model from the 🤗 Hub to SageMaker for inference.

Run Batch Transform with Transformers

You can use SageMaker batch transform to perform inference with your trained model. This process accepts your inference data as an S3 URI and takes care of downloading the data, running the prediction, and uploading the results to S3.

The Hugging Face Inference DLC currently only supports .jsonl for batch transform due to the complex structure of textual data. Note that your inputs should fit the max_length of the model during preprocessing.

To create a transform job for a model based on the training job, call the transformer() method. This method is available if you trained a model using the Hugging Face Estimator.

Suggestion: Huggingface Load Model from S3

Credit: youtube.com, Run a Batch Transform Job using Hugging Face Transformers and Amazon SageMaker

If you want to run your batch transform job later or with a model from the Hugging Face Hub, create a HuggingFaceModel instance and then call the transformer() method.

Here's an example of how to run a batch transform job for inference:

Open the sagemaker-notebook.ipynb notebook for an example of how to run a batch transform job for inference.
Make sure that the latest version of SageMaker SDK is installed.

Once you're done experimenting, don't forget to delete the endpoint and the model resources.

Using TGI

You can deploy a high-performance serving container for LLMs using the Hugging Face TGI container, which utilizes the Text Generation Inference library.

To get started, you'll need to import the SageMaker Python SDK and instantiate a sagemaker_session to find the current region and execution role. This will help you configure the model object and deploy it to SageMaker.

The next step is to retrieve the LLM image URI, which you can do using the helper function get_huggingface_llm_image_uri(). This function takes a required parameter backend, which specifies the type of backend to use for the model, and several optional parameters.

You should set SM_NUM_GPUS to the number of available GPUs on your selected instance type, such as 4 for an ml.g4dn.12xlarge instance. This will enable tensor parallelism, which is necessary when working with large LLMs.

Using TGI

Credit: youtube.com, Demo: Unleashing Gemma in production with Hugging Face Text Generation Inference (TGI)

Using TGI is a great way to deploy a high-performance serving container for LLMs. You can use the Hugging Face TGI container, which utilizes the Text Generation Inference library.

To get started, you'll need to import the SageMaker Python SDK and instantiate a sagemaker_session to find the current region and execution role. This will give you the necessary information to proceed with deploying your model.

Next, you'll need to retrieve the LLM image URI, which you can do using the helper function get_huggingface_llm_image_uri(). This function takes a required parameter backend, which specifies the type of backend to use for the model, and several optional parameters.

The backend you choose will determine how your model is deployed. In this case, we're using the Hugging Face TGI backend, which is a great option for LLMs.

To configure the model object, you'll need to specify a unique name, the image_uri for the managed TGI container, and the execution role for the endpoint. You'll also need to define several environment variables, including HF_MODEL_ID, which corresponds to the model from the Hugging Face Hub that will be deployed.

You might like: How to Use Hugging Face Models

Credit: youtube.com, TGI How to use

You should also define SM_NUM_GPUS, which specifies the tensor parallelism degree of the model. This is especially important when working with LLMs that are too big for a single GPU. For example, if you're using an instance type with 4 available GPUs, you should set SM_NUM_GPUS to 4.

One optional step you can take is to reduce the memory and computational footprint of the model by setting the HF_MODEL_QUANTIZE environment variable to true. However, keep in mind that this may affect the quality of the output for some models.

User Defined Code

User Defined Code is a powerful feature of the Hugging Face Inference Toolkit, allowing you to customize the behavior of the HuggingFaceHandlerService.

To start, you'll need to create a folder named code/ with an inference.py file inside. This file will contain your custom inference module, which can override several methods of the default HuggingFaceHandlerService.

You can override the model_fn method to load your model in a custom way. This method receives the model_dir as an argument, which is the path to your unzipped model.tar.gz.

Worth a look: Fine-tuning Huggingface Model with Custom Dataset

Credit: youtube.com, Classification using Mistral 7B and Text Generation Inference (TGI)

The transform_fn method is another key one to override. This method allows you to implement your own preprocess, predict, and postprocess steps. It's worth noting that you can't combine this method with input_fn, predict_fn, or output_fn.

Here are the methods you can override in the custom inference module:

model_fn(model_dir) - overrides the default method for loading a model
transform_fn(model, data, content_type, accept_type) - overrides the default transform function with your custom implementation
input_fn(input_data, content_type) - overrides the default method for preprocessing
predict_fn(processed_data, model) - overrides the default method for predictions
output_fn(prediction, accept) - overrides the default method for postprocessing

For example, you can create a custom inference module with only model_fn and transform_fn. This will allow you to load your model and implement your own transform function.

Frequently Asked Questions

Does Hugging Face run on AWS?

Yes, Hugging Face can run on AWS through SageMaker, allowing for remote training and inference jobs on your local machine or other AWS services. This integration enables seamless deployment of Hugging Face models on the cloud.

What is the purpose of Amazon SageMaker?

Amazon SageMaker is a cloud-based platform that helps you build, train, and deploy machine learning models for predictive analytics applications. It automates the process of creating a production-ready AI pipeline, saving you time and effort.

How to deploy Hugging Face models?

To deploy Hugging Face models, select a template for GPU or CPU, choose your instance type, number of instances, and optionally specify an endpoint and deployment name. Click "deploy" to start the process.

Sources

Landon Fanetti

Writer

View Landon's Profile

Landon Fanetti is a prolific author with many years of experience writing blog posts. He has a keen interest in technology, finance, and politics, which are reflected in his writings. Landon's unique perspective on current events and his ability to communicate complex ideas in a simple manner make him a favorite among readers.