Amazon SageMaker Hugging Face is a powerful tool for streamlining machine learning workflows. It integrates the popular Hugging Face Transformers library with SageMaker's automated machine learning capabilities.
With SageMaker Hugging Face, you can easily deploy pre-trained models to production environments. This eliminates the need for manual model deployment and reduces the risk of human error.
The platform also provides a range of pre-built models and datasets, making it easy to get started with machine learning projects. These models and datasets are specifically designed for natural language processing and computer vision tasks.
By leveraging the strengths of both Hugging Face and SageMaker, you can accelerate your machine learning development process and focus on more complex tasks.
Explore further: Ollama Huggingface Models
What Is Hugging Face LLM Inference?
Hugging Face LLM Inference DLC is a purpose-built container for deploying Large Language Models (LLMs) in a secure and managed environment.
It's powered by Text Generation Inference (TGI), an open-source solution that enables high-performance text generation using Tensor Parallelism and dynamic batching.
Curious to learn more? Check out: Long Text Summarization Huggingface
TGI is already used by customers like IBM and Grammarly, and it's optimized for popular open-source LLMs such as StarCoder, BLOOM, GPT-NeoX, Llama, and T5.
The DLC implements various optimizations, including Tensor Parallelism and custom CUDA kernels, optimized transformers code for inference, quantization, and accelerated weight loading.
Here are some of the technologies used in TGI:
- Tensor Parallelism and custom CUDA kernels
- Optimized transformers code for inference using flash-attention
- Quantization with bitsandbytes
- Continuous batching of incoming requests
- Accelerated weight loading with safetensors
- Logits warpers (temperature scaling, topk, repetition penalty ...)
- Watermarking with A Watermark for Large Language Models
- Stop sequences, Log probabilities
- Token streaming using Server-Sent Events (SSE)
With the Hugging Face LLM Inference DLC on Amazon SageMaker, AWS customers can benefit from these technologies and deploy LLM models with high concurrency and low latency.
Check this out: Llm Fine Tuning Huggingface
Installation and Setup
To get started with Amazon SageMaker HuggingFace, you'll first need to sign up for an AWS account.
You can start using SageMaker in one of three ways: SageMaker Studio, SageMaker notebook instance, or a local environment.
To train locally, you'll need to set up an appropriate IAM role.
The SageMaker environment requires setup, but don't worry, it's straightforward.
The execution role is only available when running a notebook within SageMaker, so be aware of this if you're running your code elsewhere.
To install the necessary dependencies, you'll need to create a requirements.txt file in the same directory as your training script.
Add the sagemaker sdk for Accelerate to this file, along with any other dependencies you have.
Here are the three ways to get started with SageMaker:
- SageMaker Studio
- SageMaker notebook instance
- Local environment
Features & Benefits
With the Hugging Face DLCs, you can train cutting-edge Transformers-based NLP models in a single line of code. This is a game-changer for data science teams, allowing them to reduce the time required to set up and run experiments from days to minutes.
You can choose from multiple DLC variants, each one optimized for TensorFlow and PyTorch, single-GPU, single-node multi-GPU, and multi-node clusters. This flexibility gives you the freedom to choose a training infrastructure that best aligns with the price/performance ratio for your workload.
The Hugging Face DLCs feature built-in performance optimizations for PyTorch and TensorFlow to train NLP models faster. This means you can get results faster without having to worry about the underlying infrastructure.
If this caught your attention, see: How to Use Huggingface Model in Python
Hugging Face DLCs are fully integrated with SageMaker distributed training libraries to train models faster than ever. This is especially useful when working with large datasets or complex models.
You can use the Hugging Face DLCs with SageMaker's automatic model tuning to optimize your training hyperparameters and increase the accuracy of your models. This is a huge advantage, as it allows you to automate the process of finding the best model configuration.
Here are some key benefits of using Hugging Face DLCs with SageMaker:
- Train cutting-edge NLP models in a single line of code
- Choose from multiple DLC variants for optimal performance
- Built-in performance optimizations for PyTorch and TensorFlow
- Fully integrated with SageMaker distributed training libraries
- Automate model tuning with SageMaker's automatic model tuning
Sample Notebooks
Amazon SageMaker offers a wide range of sample notebooks to help you get started with your projects. These notebooks cover various topics, including deep learning frameworks like PyTorch and TensorFlow.
You can find notebooks on PyTorch and TensorFlow specifically, such as "Getting Started with Pytorch" and "Getting Started with Tensorflow". These notebooks provide a great starting point for beginners and experienced users alike.
SageMaker also offers notebooks on distributed training, including "Distributed Training Data Parallelism" and "Distributed Training Model Parallelism". These notebooks help you learn how to scale your models and train them more efficiently.
Recommended read: Distributed Training Huggingface
Some notebooks focus on specific use cases, such as "Image Classification with Vision Transformer" and "Deploy one of the 10 000+ Hugging Face Transformers to Amazon SageMaker for Inference". These notebooks show you how to apply Hugging Face Transformers to real-world problems.
Here's a list of some of the sample notebooks available in SageMaker:
- All notebooks
- Getting Started with Pytorch
- Getting Started with Tensorflow
- Distributed Training Data Parallelism
- Distributed Training Model Parallelism
- Spot Instances and continue training
- SageMaker Metrics
- Distributed Training Data Parallelism Tensorflow
- Distributed Training Summarization
- Image Classification with Vision Transformer
- Deploy one of the 10 000+ Hugging Face Transformers to Amazon SageMaker for Inference
- Deploy a Hugging Face Transformer model from S3 to SageMaker for inference
Transformers
You can deploy a Transformers model trained in SageMaker in two ways: after your training has finished or at a later time from S3 with the model_data.
To deploy a model from S3 to SageMaker for inference, you can open the deploy_transformer_model_from_s3.ipynb notebook for an example.
Deploying a model can be a bottleneck, especially on a GPU-based instance with a single vCPU. This can prevent the GPU from being fully utilized for model inference.
You can learn more about this limitation and how to optimize your pipelines here.
Readers also liked: Fastapi Huggingface Gpu
After Training
After training, you can deploy your model directly by saving all required files in your training script, including the tokenizer and the model.
To do this, ensure that your tokenizer is saved along with the model, as it's often used for inference.
You can pass your tokenizer as an argument to the Hugging Face Trainer, which will automatically save it when you call trainer.save_model().
This makes it easy to deploy your model without having to worry about saving individual components.
A unique perspective: Huggingface Tokenizer Pad
Data
Data is stored in Amazon Sagemaker, where you can deploy models with model_data. This argument specifies the location of your tokenizer and model weights.
You can preprocess your data using the datasets library, as demonstrated with the imdb dataset. The imdb dataset consists of 25000 training and 25000 testing highly polar movie reviews.
The datasets library is used to download and preprocess the imdb dataset. This dataset is then uploaded to the current session's default s3 bucket, used within the training job.
Readers also liked: Huggingface Imdb Dataset
Create an Artifact
To create an artifact, you'll need to package your model files into a single .tar.gz file. This file should contain the required files, such as pytorch_model.bin, tf_model.h5, tokenizer.json, and tokenizer_config.json.
A typical model.tar.gz file will include the following files:
- pytorch_model.bin
- tf_model.h5
- tokenizer.json
- tokenizer_config.json
You can also create a model.tar.gz file from a model on the 🤗 Hub by downloading a model and providing the S3 URI to the model_data argument.
From the HUB
You can deploy a model from the 🤗 Hub to Amazon SageMaker with just a few steps.
To deploy a model directly from the 🤗 Hub, define two environment variables when you create a HuggingFaceModel.
The process is straightforward, and you can find a detailed example in the deploy_transformer_model_from_hf_hub.ipynb notebook.
Open the notebook to see how to deploy a model from the 🤗 Hub to SageMaker for inference.
Run Batch Transform with Transformers
You can use SageMaker batch transform to perform inference with your trained model. This process accepts your inference data as an S3 URI and takes care of downloading the data, running the prediction, and uploading the results to S3.
The Hugging Face Inference DLC currently only supports .jsonl for batch transform due to the complex structure of textual data. Note that your inputs should fit the max_length of the model during preprocessing.
To create a transform job for a model based on the training job, call the transformer() method. This method is available if you trained a model using the Hugging Face Estimator.
Suggestion: Huggingface Load Model from S3
If you want to run your batch transform job later or with a model from the Hugging Face Hub, create a HuggingFaceModel instance and then call the transformer() method.
Here's an example of how to run a batch transform job for inference:
- Open the sagemaker-notebook.ipynb notebook for an example of how to run a batch transform job for inference.
- Make sure that the latest version of SageMaker SDK is installed.
Once you're done experimenting, don't forget to delete the endpoint and the model resources.
Using TGI
You can deploy a high-performance serving container for LLMs using the Hugging Face TGI container, which utilizes the Text Generation Inference library.
To get started, you'll need to import the SageMaker Python SDK and instantiate a sagemaker_session to find the current region and execution role. This will help you configure the model object and deploy it to SageMaker.
The next step is to retrieve the LLM image URI, which you can do using the helper function get_huggingface_llm_image_uri(). This function takes a required parameter backend, which specifies the type of backend to use for the model, and several optional parameters.
You should set SM_NUM_GPUS to the number of available GPUs on your selected instance type, such as 4 for an ml.g4dn.12xlarge instance. This will enable tensor parallelism, which is necessary when working with large LLMs.
Using TGI
Using TGI is a great way to deploy a high-performance serving container for LLMs. You can use the Hugging Face TGI container, which utilizes the Text Generation Inference library.
To get started, you'll need to import the SageMaker Python SDK and instantiate a sagemaker_session to find the current region and execution role. This will give you the necessary information to proceed with deploying your model.
Next, you'll need to retrieve the LLM image URI, which you can do using the helper function get_huggingface_llm_image_uri(). This function takes a required parameter backend, which specifies the type of backend to use for the model, and several optional parameters.
The backend you choose will determine how your model is deployed. In this case, we're using the Hugging Face TGI backend, which is a great option for LLMs.
To configure the model object, you'll need to specify a unique name, the image_uri for the managed TGI container, and the execution role for the endpoint. You'll also need to define several environment variables, including HF_MODEL_ID, which corresponds to the model from the Hugging Face Hub that will be deployed.
You might like: How to Use Hugging Face Models
You should also define SM_NUM_GPUS, which specifies the tensor parallelism degree of the model. This is especially important when working with LLMs that are too big for a single GPU. For example, if you're using an instance type with 4 available GPUs, you should set SM_NUM_GPUS to 4.
One optional step you can take is to reduce the memory and computational footprint of the model by setting the HF_MODEL_QUANTIZE environment variable to true. However, keep in mind that this may affect the quality of the output for some models.
User Defined Code
User Defined Code is a powerful feature of the Hugging Face Inference Toolkit, allowing you to customize the behavior of the HuggingFaceHandlerService.
To start, you'll need to create a folder named code/ with an inference.py file inside. This file will contain your custom inference module, which can override several methods of the default HuggingFaceHandlerService.
You can override the model_fn method to load your model in a custom way. This method receives the model_dir as an argument, which is the path to your unzipped model.tar.gz.
Worth a look: Fine-tuning Huggingface Model with Custom Dataset
The transform_fn method is another key one to override. This method allows you to implement your own preprocess, predict, and postprocess steps. It's worth noting that you can't combine this method with input_fn, predict_fn, or output_fn.
Here are the methods you can override in the custom inference module:
- model_fn(model_dir) - overrides the default method for loading a model
- transform_fn(model, data, content_type, accept_type) - overrides the default transform function with your custom implementation
- input_fn(input_data, content_type) - overrides the default method for preprocessing
- predict_fn(processed_data, model) - overrides the default method for predictions
- output_fn(prediction, accept) - overrides the default method for postprocessing
For example, you can create a custom inference module with only model_fn and transform_fn. This will allow you to load your model and implement your own transform function.
Frequently Asked Questions
Does Hugging Face run on AWS?
Yes, Hugging Face can run on AWS through SageMaker, allowing for remote training and inference jobs on your local machine or other AWS services. This integration enables seamless deployment of Hugging Face models on the cloud.
What is the purpose of Amazon SageMaker?
Amazon SageMaker is a cloud-based platform that helps you build, train, and deploy machine learning models for predictive analytics applications. It automates the process of creating a production-ready AI pipeline, saving you time and effort.
How to deploy Hugging Face models?
To deploy Hugging Face models, select a template for GPU or CPU, choose your instance type, number of instances, and optionally specify an endpoint and deployment name. Click "deploy" to start the process.
Featured Images: pexels.com