Hugging Face examples are a treasure trove for NLP and Vision Tasks. They provide a wide range of pre-trained models and datasets that can be fine-tuned for specific tasks.
You can use the Transformers library to access Hugging Face models and perform tasks such as text classification, sentiment analysis, and question answering.
The library includes models like BERT, RoBERTa, and DistilBERT, which have achieved state-of-the-art results in various NLP tasks.
These models can be fine-tuned on a specific dataset to adapt to the task at hand, making them extremely versatile.
Here's an interesting read: Training Tutorial Hugging Face
Getting Started
To get started with Hugging Face Transformers, you'll need MLflow 2.3. This version is a requirement for many of the examples and tutorials.
Any cluster with the Hugging Face transformers library installed can be used for batch inference. You can check if your cluster meets this requirement by looking for the transformers library. If it's preinstalled, you're good to go!
To start using Hugging Face Transformers, you'll need to have Python code that executes on each distributed training worker, also known as train_func. ScalingConfig defines the number of distributed training workers and whether to use GPUs. TorchTrainer launches the distributed training job.
A different take: Distributed Training Huggingface
Requirements
To get started, you'll need to meet the requirements for this project. You'll need to have MLflow 2.3 installed.
Any cluster with the Hugging Face transformers library installed can be used for batch inference. The transformers library comes preinstalled on Databricks Runtime 10.4 LTS ML and above.
To get the best performance, many of the popular NLP models work best on GPU hardware. Unless you use a model specifically optimized for use on CPUs, recent GPU hardware is the way to go.
A fresh viewpoint: Huggingface Transformers Model Loading Slow
Quickstart
To get started with Ray Train, you'll want to understand the basics of how it works. Here's a quick rundown of the key components involved.
train_func is the Python code that executes on each distributed training worker. This code is crucial for the training process.
ScalingConfig defines the number of distributed training workers and whether to use GPUs. This configuration will impact the performance of your training job.
TorchTrainer launches the distributed training job. With Ray Train, you can easily scale your training process to multiple workers and GPUs.
Here's a comparison of a Hugging Face Transformers training script with and without Ray Train:
Data Preparation
Data Preparation is a crucial step in any machine learning (ML) project. It involves transforming the dataset to prepare it for modeling.
Hugging Face provides two basic classes for data processing: Tokenizers and feature extractors. Since we're dealing with images, we won't use a Tokenizer here.
The datasets library by Hugging Face is a collection of ready-to-use datasets and evaluation metrics for Natural Language Processing (NLP). At the moment of writing this, the datasets hub counts over 900 different datasets.
To load a dataset, we need to import the load_dataset function and load the desired dataset. We can also configure it to use a custom script containing the loading functionality.
Here are some ways to query a dataset:
- A single row is dataset[3]
- A batch is dataset:[3:6]
- A column is dataset[‘feature_1’]
A Dataset object is behaving like a Python list, so we can query as we'd normally do with Numpy or Pandas.
Preparing the Dataset
Preparing the Dataset is a crucial step in any Machine Learning (ML) lifecycle. It involves transforming the dataset to make it suitable for your model. In our case, we need to preprocess the CIFAR10 images.
Hugging Face has two basic classes for data processing: Tokenizers and feature extractors. Since we're dealing with images, we won't use a Tokenizer here.
You can use the datasets library by Hugging Face to load a dataset, which is a collection of ready-to-use datasets and evaluation metrics for NLP. The datasets hub currently counts over 900 different datasets.
To load a dataset, you need to import the load_dataset function and load the desired dataset. You can download datasets from the Hugging Face Hub, read from a local file, or load from in-memory data using load_dataset.
Here are some ways to query a Dataset object, which is similar to a Python list:
- A single row is dataset[3]
- A batch is dataset:[3:6]
- A column is dataset['feature_1']
A Dataset object can be converted into NumPy, pandas, PyTorch, or TensorFlow using datasets.Dataset.set_format().
Data Collator
Data collator is an object that helps form batches from our dataset when training a model.
The default data collator provided by the library is usually enough for most cases.
Batching is an important step of the preprocessing pipeline, especially when training a model.
In our case, we will pass the data collator as an argument to the training loop.
Note that this currently works only for Pytorch.
For more insights, see: Data Enrichment Examples
Transformers
The transformers library by Hugging Face is a game-changer for building, training, and fine-tuning transformers. It comes with almost 10,000 pre-trained models that can be found on the Hub.
You can use these models in Tensorflow, Pytorch, or JAX, and anyone can upload their own model. This level of accessibility makes it easy for developers to get started with transformers.
For a more complete introduction to Hugging Face, check out the Natural Language Processing with Transformers book by 3 HF engineers.
You might enjoy: How to Use Hugging Face Models
Transformers Inference and MLflow Logging
Transformers Inference and MLflow Logging is a powerful combination that enables efficient and scalable text processing. This is evident in the example notebook that uses Hugging Face Transformers pipelines for inference and MLflow logging.
The notebook is an end-to-end example for text summarization, showcasing the capabilities of Transformers inference. It's a great resource for anyone looking to get started quickly with example code.
With Hugging Face Transformers pipelines, you can leverage pre-trained models for a wide range of NLP tasks. The notebook demonstrates this by using a pipeline for text summarization.
A fresh viewpoint: Is Huggingface Transformers Good
MLflow logging is also a key component of the notebook, allowing for seamless tracking and monitoring of experiments. This makes it easy to reproduce and compare results.
The notebook is designed to be an end-to-end example, meaning it includes everything you need to get started with text summarization using Transformers inference and MLflow logging.
Transformers Trainer Migration Guide
The Transformers Trainer Migration Guide is a crucial step for anyone looking to upgrade their training setup.
Ray 2.1 introduced the TransformersTrainer, which exposes a trainer_init_per_worker interface to define transformers.Trainer.
With the introduction of the unified TorchTrainer API in Ray 2.7, you now have better control over your native Transformers training code.
This new API aligns more with standard Hugging Face Transformers scripts, making it easier to integrate your existing code.
The TorchTrainer API offers enhanced transparency, flexibility, and simplicity, making it a worthwhile upgrade from the TransformersTrainer.
Related reading: Huggingface Api Tokens
Tokenizers
Tokenizers are a go-to solution in most NLP tasks.
A tokenizer is used to map text into tokens, which are then converted into numerical inputs that can be fed into the model. Each model comes with its own tokenizer based on the PreTrainedTokenizer class.
See what others are reading: Huggingface Tokenizer Pad
Modeling
The transformers library forces all models to produce outputs that inherit the file_utils.ModelOutput class, which contains all the information returned by the model.
This data structure has many different subclasses depending on the task at hand, and typically includes the output of the model and optionally the hidden states. In many models, the attention weights are also provided.
The SequenceClassifierOutput is a subclass of ModelOutput, specifically designed for classification models, and provides the main output for these types of tasks.
Worth a look: How to Use Huggingface Models in Python
Function Setup
When setting up a training function, it's essential to update your code to support distributed training. This involves wrapping your code in a training function that each distributed training worker executes.
The training function should contain all the logic, including dataset construction and preprocessing, model initialization, and transformers trainer definition. Avoid passing large data objects through the Trainer’s train_loop_config to reduce serialization and deserialization overhead.
Instead, initialize large objects directly in the train_func, such as datasets and models. This will help prevent serialization errors while transferring objects to the workers.
If you're using Hugging Face Datasets or Evaluate, make sure to call datasets.load_dataset and evaluate.load inside the training function. Don't pass the loaded datasets and metrics from outside of the training function.
Related reading: Huggingface Training Service
Model
You can load pretrained transformer models using the function from_pretrained('model_name') which will instantiate the selected model and assign the trainable parameters.
By default, the model is in evaluation mode, so you need to execute model.train() to train it.
Pretrained models can be used as a base for improved models, and you can modify the network as you want.
The Trainer class from Hugging Face is especially optimized for transformers and provides an API for both normal and distributed training.
You can define your training loop using the Trainer class, passing the model, the training dataset, the validation datasets, the data collator, and a few other critical things.
The compute_metrics function is used to calculate the metrics during evaluation and is a custom function that you can define according to your needs.
To set up a training function, you need to wrap your code in a function that can be executed by each distributed training worker.
Consider reading: Dataset Huggingface Modify Class Label
Auto classes are an inspired way to alleviate some of the pain of finding the correct model or tokenizer for a specific problem, and they can simplify the process of creating an instance of a model.
You can use an autoclass to automatically retrieve the relevant model to the appropriate weights, without needing to know the corresponding model type.
Modeling Outputs
The transformers library forces all models to produce outputs that inherit the file_utils.ModelOutput class. This class is a data structure that contains all the information returned by the model.
ModelOutput typically contains the output of the model and optionally the hidden states.
In many models, the attention weights are also provided. The SequenceClassifierOutput is used for classification models.
The ModelOutput class has many different subclasses depending on the task at hand.
Take a look at this: Ollama Huggingface
Sources
- https://docs.databricks.com/ja/archive/machine-learning/train-model/model-inference-nlp.html
- https://theaisummer.com/hugging-face-vit/
- https://huggingface.co/learn/nlp-course/en/chapter3/3
- https://towardsdatascience.com/fine-tuning-pretrained-nlp-models-with-huggingfaces-trainer-6326a4456e7b
- https://docs.ray.io/en/latest/train/getting-started-transformers.html
Featured Images: pexels.com