Huggingface Rust is an open-source library that allows developers to build and deploy AI models with Rust, a systems programming language known for its performance and reliability.
Rust is particularly well-suited for building high-performance AI applications due to its ability to compile directly to machine code.
By using Rust, developers can create efficient AI models that can run on a wide range of hardware platforms, from small embedded systems to large-scale data centers.
Rust's strong focus on memory safety and concurrency makes it an attractive choice for building reliable and scalable AI systems.
Expand your knowledge: How to Use Models from Huggingface
Getting Started
To use Huggingface with your Rust application, you'll need to add the hf_hub crate to your Cargo.toml. This crate is the foundation for using Huggingface with Rust.
Tensors are multidimensional arrays of numbers, and in the context of Huggingface, they're crucial for modeling real-world data like images. A picture, for instance, can be represented as a 3D tensor with height, width, and color.
See what others are reading: Learn to Code Rust
To deserialize the weight map file, you'll need to implement a custom deserialization function, as the file can contain an unknown amount of keys and values. This means deserializing it to a serde_json::Value first.
Safetensors is a data serialization format for storing tensors safely, created by Hugging Face to replace the pickle format. It's essential to note that safetensors is designed to be safer than pickle, which can contain malicious code.
You'll need some space to store the model on your local machine, as the repository download is quite large, clocking in at around 13.4gb. The Mistral-7B-v0.1 repo can be found here, and you can find the latest version of a repository by going to the repo, clicking on Files and Versions, and then going to the commit history.
Token Output Stream
Token Output Stream is a crucial part of working with Hugging Face's Rust library.
To create a token output stream, we need to encode our input into tokens, which can represent words, characters, or even data. This allows the model to parse the data more easily.
Most popular pre-trained models have billions or even trillions of tokens in their training data. This gives words meaning and enables the model to produce sophisticated answers by comparing the input to its training data.
We'll create a stream for encoding tokens and outputting a stream of tokens. This will involve implementing methods to decode tokens into UTF-8 strings and advance the internal index to return a string if there is any text.
We'll also need some auxiliary methods on this struct, which we'll use later on when generating our response text.
Here are some of the key methods we'll need:
- Decode tokens into UTF-8 strings
- Advance the internal index and return a string if there is any text
Text Generation
Text generation is a crucial aspect of Hugging Face Rust, and it's where the magic happens. You can control the accuracy of your model by adjusting the temperature, which determines how likely it is to hallucinate.
The temperature is a key factor in text generation, and you can adjust it to balance accuracy and diversity. A lower temperature makes the model more predictable and consistent, while a higher temperature allows for more creative and diverse responses.
Take a look at this: Long Text Summarization Huggingface
To generate text, you need to create a struct for generating tokens from your input. This struct should include parameters such as top_k, top_p, repeat_penalty, and repeat_last_n, which control the sampling process and prevent repetitive output.
Here are the main parameters you need to consider when generating text:
- temp (or temperature) - controls the accuracy of the model
- top_k - limits the sampling to the top K tokens
- top_p - allows the model to choose from a subset of tokens whose combined probability reaches or exceeds a threshold p
- repeat_penalty - punishes repetitive or redundant output
- repeat_last_n - sets the context window size for the repeat penalty
Text Generation from Tokens
Text generation from tokens is a powerful technique that allows us to create human-like text based on our input. It's a crucial part of many natural language processing (NLP) applications.
To generate text from tokens, we need to understand some key terminology. Temperature, or temp, controls how accurate we want the model to be. Lower temperatures make the model less likely to hallucinate.
The top_k argument allows the model to sample from the top K tokens, then samples based on probability. A lower K value makes the model more predictable and consistent. Top_p, on the other hand, allows the model to choose from a subset of tokens whose combined probability reaches or exceeds a threshold p.
Suggestion: Huggingface Api Tokens
Repeating penalty and repeat_last_n are also important parameters to consider. Repeating penalty punishes repetitive or redundant output, while repeat_last_n controls the size of the context window for the repeat penalty.
Here are the key parameters for generating text from tokens:
By understanding these parameters and how they work together, we can create more sophisticated and accurate text generation models.
Generating the Embedding
To create the embedding file, we need to process multiple sets of keywords as a batch. We can do this by using the `encode_batch` function from the tokenizer, which takes a vector of strings rather than a single string.
The `encode_batch` function returns a tensor of token ids, which we can then feed to the model's `forward` function to get the embedding vectors. This is the main difference in how we use the `BertInferenceModel` compared to encoding a single string.
We can parallelize the embedding of the text file using the `rayon` crate, which allows us to speed up the process by using multiple CPU cores. Once we have embedded all the text, we can stack the results together to create a single tensor.
A different take: Create Feature for Dataset Huggingface
The embedding generator uses the `safetensors` format to write the embedding to disk, which is an important asset in the pipeline as it needs to be copied to every service instance.
Here's a simple example of how to create the embedding file:
- Load the tokenizer and model
- Use `encode_batch` to process multiple sets of keywords as a batch
- Feed the tensor of token ids to the model's `forward` function to get the embedding vectors
- Parallelize the embedding using `rayon`
- Stack the results together to create a single tensor
- Write the embedding to disk using `safetensors`
Performance Considerations
Running Hugging Face's Rust implementation locally can be a bit slow due to sub-optimal CPU performance compared to a high-performance GPU.
GPU usage is gated by the use of the CUDA toolkit, but this can be difficult to use on the web unless CUDA is preinstalled.
To alleviate slow performance in debug mode, you can run in release mode using cargo shuttle run --release or cargo run --release, which will run the release profile of your application, much more optimized than debug mode.
For simple pipelines like sequence classification and question answering, performance between Python and Rust is expected to be comparable, thanks to the shared implementation of the language model in the Torch backend.
Text generation tasks, however, can see significant benefits, with processing speeds up to 2 to 4 times faster depending on the input and application.
Related reading: How to Run Accelerate Huggingface
Model Serving
Model Serving is a crucial aspect of any machine learning project, and Hugging Face's Rust implementation has made it easier than ever.
The Candle library API provides a robust abstraction layer for model serving, which we can leverage to create a custom module. This module, called BertInferenceModel, will encapsulate the Bert model and tokenizer from the Hugging Face repository.
BertInferenceModel will consist of three main functions: model loading, inference (or sentence embedding), and vector search using Cosine similarity.
Onnx Support (Optional)
Enabling ONNX Support is optional, but it can be a game-changer for model serving.
You can enable ONNX support by adding the optional onnx feature to your project. The rust-bert crate doesn't include any optional dependencies for ort, so you'll need to select the set of features that would be adequate for pulling the required onnxruntime C++ library.
The current recommended installation is to use dynamic linking by pointing to an existing library location. To do this, use the load-dynamic cargo feature for ort.
You might like: How to Use Huggingface Models in Python
Set the ORT_DYLIB_PATH to point to the location of the downloaded onnxruntime library. This will depend on your operating system, so you'll need to download the correct file from the onnxruntime project's release page.
Most architectures are supported, including encoders, decoders, and encoder-decoders. The library aims to keep compatibility with models exported using the Optimum library.
Here's a table showing the resources used to create ONNX models for different architectures:
Note that the computational efficiency will drop if the decoder with past file is optional but not provided, since the model won't use cached past keys and values for the attention mechanism.
Model Serving and Embedding
Model serving and embedding are crucial components of a model serving pipeline. In this section, we'll focus on the implementation of the module that serves as an abstraction layer on top of the Candle library API.
The module, named BertInferenceModel, will consist of 3 main functions: model loading, inference (or sentence embedding), and vector search using Cosine similarity. The BertInferenceModel will encapsulate the Bert model and tokenizer that we download from Hugging Face repository and will essentially wrap their functionality.
To implement the BertInferenceModel, we will use the Candle library API. The API provides a simple and efficient way to perform model loading, inference, and vector search.
Here are the main functions of the BertInferenceModel:
- Model loading: This function will load the pre-trained Bert model and tokenizer from the Hugging Face repository.
- Inference (or sentence embedding): This function will take a sentence as input, encode it using the tokenizer, and then use the embedding model to get the embedding vectors.
- Vector search using Cosine similarity: This function will take a query vector and search for the most similar vectors in the pre-loaded embedding.
The inference function is also fairly straightforward. We start by using the loaded tokenizer to encode the sentence, and then we wrap the array of token ids in a tensor that we can feed to our embedding model. The dimensions of the vectors we get from the embedding model are [128, 384], which we then compress to a single vector of size [1, 384] using max pooling and L2 normalization.
Loading Model Weights
You can load pretrained model weights from Hugging Face's model hub using RemoteResources, which are defined in the library.
Pretrained models are available for various transformer-based models, including BERT, DistilBERT, RoBERTa, GPT, GPT2, and BART.
To load pre-trained weights, you need to align the parameter naming convention with the Rust schema, which is required for the weights to be found in the weight files.
If the quality check fails, you can use the load_partial method to skip it.
A conversion utility script is included in ./utils to convert Pytorch weights to a set of weights compatible with the library.
If this caught your attention, see: Ollama Huggingface Models
Compile to WASM and Wrap
To compile our adapted Tokenizers package to WASM, we need to install wasm-pack and use wasm-pack build to create a WASM package.
This process is straightforward: install wasm-pack, run wasm-pack build, and you'll have a WASM package ready to go. The Rust crate is now successfully ported to WASM, and we can push it to npm with wasm-pack publish.
We did this, and the package can be found here on npm: tokenizers-wasm. This package is great for exposing a complex library like Tokenizers, but it has some limitations.
For example, it can't expose a complex library like Tokenizers since we are lacking essential features, such as the ability to pull a pretrained tokenizer model from Hugging Face Hub.
Web Service
As we explore the world of Hugging Face's Rust framework, let's take a closer look at how we can create a web service that leverages the power of large language models.
We can use the Candle library to implement a model serving and embedding system. This system will provide an abstraction layer on top of the Candle library API, making it easier to work with the Bert model and tokenizer.
The BertInferenceModel struct will be the core of our web service, encapsulating the Bert model and tokenizer that we download from the Hugging Face repository.
Here are some key components of the BertInferenceModel struct:
- Model loading: This function will handle the process of loading the Bert model and tokenizer from the Hugging Face repository.
- Inference (or sentence embedding): This function will take in a sentence and return its embedding using the Bert model.
- Vector search using Cosine similarity: This function will allow us to search for similar vectors using the Cosine similarity metric.
By using the Candle library and the BertInferenceModel struct, we can create a robust and efficient web service that leverages the power of large language models.
Codebase and Architecture
The codebase for Hugging Face Rust is designed to be scalable and efficient, with a clear separation of concerns between document processing and vector storage. The pipeline architecture is based on two main tasks: a read and embed task, and a write task.
These tasks are separated into different threads to avoid contention and back-pressure, with an MPSC channel connecting them for thread-safe and asynchronous communication. This design allows for efficient parallelization and scalability.
The main flow of the pipeline is simple, with the read and embed task sending vectors and IDs to the channel, and the write task reading from the channel, chunking vectors in memory, and flushing them when it reaches a certain size.
Make Codebase WASM Compatible
Making your codebase WASM compatible is a crucial step in the process of porting it to WebAssembly. Most Rust code is compatible by default, but there are some exceptions and pitfalls.
To make the Tokenizers codebase WASM-compatible, some features that use the filesystem had to be disabled because WASM can’t access the file system. This was a necessary step to ensure the codebase could be compiled to WASM.
A C dependency for regex was another issue that needed to be addressed. The solution was to use another regex library written in Rust instead.
Here are some changes that were made to the Tokenizers codebase to make it WASM-compatible:
- Disabled features that use the filesystem.
- Replaced the C dependency for regex with a Rust-written regex library.
The WASM-compatible version of Tokenizers is available through the unstable_wasm feature. It has been upstreamed on the Tokenizers repository and is now usable, though not yet thoroughly tested.
Pipeline Architecture
Our pipeline architecture is designed to handle the task of embedding text documents into vectors. It consists of two main tasks: a read and embed task, and a write task.
The read and embed task reads text from a text file and embeds it into a BERT vector using an embedding model. This task is mostly CPU bound, requiring multiple ML model operations, so it's separated from the write task to avoid contention and back-pressure.
The write task writes the embedding to the vector store, and it's mostly waiting on IO operations. To improve scalability, the two tasks are connected with an MPSC channel, enabling thread-safe and asynchronous communication between threads.
Each time the embedding task finishes embedding a text document, it sends the vector and its ID to the channel and continues to the next document. The write task continuously reads from the channel, chunks the vectors in memory, and flushes them when it reaches a certain size.
The main function of the pipeline is made clearer by the main() function, which sets up the channel, initializes the write task thread, and lists the files in the relevant directory. The write task then uses Rayon to parallelize its processing using the process_text_file() function.
The pipeline design allows for efficient parallelization and scalability, primarily by orchestrating two main tasks: document processing and vector storage. The document processing task uses Rayon to parallelize file handling, maximizing the use of available system resources.
This separation of concerns simplifies the overall architecture and allows for independent optimization of each task. By separating the tasks, we can optimize the document processing task to use as many cores as available on the machine, while the storage task manages the efficient writing of embedded vectors to LanceDB.
Sources
- https://www.shuttle.dev/blog/2024/05/01/using-huggingface-rust
- https://blog.mithrilsecurity.io/porting-tokenizers-to-wasm/
- https://towardsdatascience.com/streamlining-serverless-ml-inference-unleashing-candle-frameworks-power-in-rust-c6775d558545
- https://lib.rs/crates/rust-bert
- https://towardsdatascience.com/scale-up-your-rag-a-rust-powered-indexing-pipeline-with-lancedb-and-candle-cc681c6162e8
Featured Images: pexels.com