Fine Tune Local LLM for Custom AI Applications

Author

Posted Nov 17, 2024

Reads 725

An artist's illustration of artificial intelligence (AI). This image visualises the streams of data that large language models produce. It was created by Tim West as part of the Visualisin...
Credit: pexels.com, An artist's illustration of artificial intelligence (AI). This image visualises the streams of data that large language models produce. It was created by Tim West as part of the Visualisin...

Fine tuning a local LLM can significantly improve its performance in custom AI applications. This process involves adjusting the model's weights to better fit the specific needs of your project.

By fine tuning a local LLM, you can reduce the risk of overfitting and improve the model's ability to generalize to new, unseen data. This is especially important for applications where data is limited.

To fine tune a local LLM, you'll need to have a dataset that's relevant to your project, as well as a computing device with sufficient resources. This could be a desktop computer, a laptop, or even a specialized device like a Raspberry Pi.

You might enjoy: Huggingface Local Llm

What Is LLM?

LLMs, or Large Language Models, are pre-trained models that know a lot about a lot.

These models are like general-purpose experts, but for production, we need models that know a lot about a little. Fine-tuning is the process of taking these pre-trained models and further training them on smaller, specific datasets to refine their capabilities and improve performance in a particular task or domain.

Here's an interesting read: Pre Trained vs Fine Tune

Credit: youtube.com, EASIEST Way to Fine-Tune a LLM and Use It With Ollama

Fine-tuning is about turning general-purpose models into specialized models that can tackle specific tasks with ease.

The goal of fine-tuning is to adapt large pre-trained models to new tasks with minimal computational overhead and memory usage. This is where PEFT comes in – a technique designed to adapt large pre-trained models to new tasks with minimal computational overhead and memory usage.

PEFT involves reusing the pre-trained model's parameters and fine-tuning them on a smaller dataset, saving computational resources and time compared to training the entire model from scratch.

LLMs are incredibly powerful, but they can be resource-intensive. That's why techniques like PEFT and LoRA are so important – they help reduce the number of trainable parameters, making it easier to fine-tune models locally.

Fine Tuning Process

Fine-tuning a Large Language Model (LLM) can be a complex task, but thankfully, we have tools like HuggingFace's AutoTrain-Advanced to simplify the process.

AutoTrain-Advanced is a Python package that securely fine-tunes an LLM using Snowflake compute. This means we can use pairs of inputs and outputs to fine-tune an open-source LLM to better align with our use case's expectations.

Credit: youtube.com, Fine Tune a model with MLX for Ollama

To fine-tune an LLM, we'll use the AutoTrain-Advanced module, which is a part of the HuggingFace package. This module takes care of the complexity of fine-tuning, allowing us to focus on other aspects of our project.

Here's a quick rundown of what we'll learn during the fine-tuning process:

  • Fine-tuning an LLM with HuggingFace's AutoTrain-Advanced module
  • How to persist LLMs in the file system using volume mounts

By the end of this process, we'll have successfully fine-tuned a Large Language Model to describe product offer metadata using Snowpark Container Services and HuggingFace's AutoTrain-Advanced module.

Fine Tuning Techniques

Fine-tuning is a crucial step in adapting large language models (LLMs) to specific tasks. It can be done using various techniques to make the process more efficient.

Parameter-Efficient Fine-Tuning (PEFT) methods aim to reduce the number of trainable parameters while maintaining performance. One popular method is Low-Rank Adaptation for Large Language Models (LoRA), which decomposes a large weight matrix into two smaller, low-rank matrices.

By using LoRA, you can make fine-tuning more efficient, reduce the number of trainable parameters, and keep the original pre-trained weights frozen. This approach can be applied to any subset of weight matrices in a neural network, but it's typically applied to attention blocks only in Transformer models.

Additional reading: Fine Tune Ai

Credit: youtube.com, Fine-tuning Large Language Models (LLMs) | w/ Example Code

Here are some techniques to make fine-tuning more efficient:

  • Packing: Concatenating texts with an End-Of-Sentence (EOS) token in between and cutting chunks of the context size to fill the batch without any padding.
  • Train on completion only: Training the model on the completion of the input (prompt + answer) instead of the whole input.

These techniques can be used with the SFTTrainer to perform supervised fine-tuning. With SFTTrainer, you can easily adapt the training to your hardware setup in one line of code!

Parameter Efficient Methods

Parameter Efficient Fine-Tuning (PEFT) methods aim to drastically reduce the number of trainable parameters of a model while keeping the same performance as full fine-tuning.

PEFT methods can be differentiated by their conceptual framework, such as fine-tuning a subset of existing parameters, introducing new parameters, or introducing trainable prompts.

One of the most adopted PEFT methods is Low-Rank Adaptation for Large Language Models (LoRA), which works by attaching extra trainable parameters into a model, decomposing a large weight matrix into two smaller, low-rank matrices.

LoRA makes fine-tuning more efficient by drastically reducing the number of trainable parameters and keeping the original pre-trained weights frozen.

LoRA can be applied to any subset of weight matrices in a neural network to reduce the number of trainable parameters, but for simplicity and further parameter efficiency, it's typically applied to attention blocks only in Transformer models.

Credit: youtube.com, Fine-tuning Large Language Models (LLMs) | w/ Example Code

The resulting number of trainable parameters in a LoRA model depends on the size of the low-rank update matrices, which is determined mainly by the rank r and the shape of the original weight matrix.

Here are some advantages of using LoRA:

• LoRA makes fine-tuning more efficient by drastically reducing the number of trainable parameters.

• The original pre-trained weights are kept frozen, making it possible to have multiple lightweight and portable LoRA models for various downstream tasks.

• LoRA is orthogonal to many other parameter-efficient methods and can be combined with many of them.

• The performance of models fine-tuned using LoRA is comparable to the performance of fully fine-tuned models.

• LoRA does not add any inference latency when adapter weights are merged with the base model.

Two Answers

Trainer.save_model() and Trainer.save_model() in the context of multiple processes are essentially the same thing. This is according to KurtMica, who edited the answer to include this information.

Credit: youtube.com, Fine Tune LLaMA 2 In FIVE MINUTES! - "Perform 10x Better For My Use Case"

The docs for save_model mention that it will only save from the main process, which means it's referring to when multiple processes are used to train the model. This was clarified by KurtMica.

To save the full fine-tuned model, you can use the output dir argument, but this only saves the adapter layer. If you don't have a separate adapter layer, you're out of luck... or so it seems.

Here's a summary of the key points:

  • Trainer.save_model() and Trainer.save_model() in multiple processes are the same thing.
  • The docs for save_model only save from the main process in multiple process training.
  • The output dir argument can be used to save the model, but it only saves the adapter layer.

Data and Experimentation

Data quality is crucial for fine-tuning a local LLM. If the data is not accurate, the model cannot deliver good results.

Note that the quality of the data is really important, and you typically need thousands of samples. If you don’t have enough, you might want to consider creating synthetic data.

Here's a rough estimate of the data structure you might need:

Configure and Upload

To configure and upload your specification, start by navigating to the src/autotrain.yaml file in your project directory or create a new yaml file with the following content. Be sure to update the REPOSITORY_URL with your corresponding URL from the prior section.

Credit: youtube.com, Microsoft Azure Cloud - Machine Learning - Upload Data & Create Experiment - DIY-11-of-20

The specification file should have the selected model to fine-tune, which is meta-llama/Llama-2-7b-hf in the provided example. The fine-tuned model will be placed in Snowflake stage VOLUMES under the directory llama-2-ft.

Two endpoints are included in the specification file: jupyter and app. JupyterLab will be used to check the fine-tuning status, and the other endpoint will serve a quick chat-based test of the fine-tuned model.

You can request two GPUs to enable running both a fine-tuned and non-fine-tuned meta-llama/Llama-2-7b-hf on separate GPUs in the chat test.

To save the file, you can use one of the following methods: PUT file://autotrain.yaml @CONTAINER_HOL_DB.PUBLIC.SPECS AUTO_COMPRESS=FALSE OVERWRITE=TRUE; with SnowSQLsnow stage copy ./autotrain.yaml @specs --overwrite with SnowCLIUpload to stage CONTAINER_HOL_DB.PUBLIC.SPECS using Snowsight GUI.

Once you've uploaded the file, start the Snowpark Container Service by running the SQL commands in 03_start_service.sql using the Snowflake VSCode Extension or in a SQL worksheet.

Related reading: Llama Fine Tune

QLoRA: A Core Contribution to AI Democratization

QLoRA is a groundbreaking approach that has revolutionized the field of AI by making it more accessible and efficient.

Credit: youtube.com, QLoRA Explained: Making Giant AI Models

By using Quantized model weights + Low-Rank Adapters, QLoRA has achieved a remarkable 90% reduction in fine-tuning memory footprint while maintaining 16-bit fine-tuning performance across all scales and models.

This is a game-changer for the AI community, as it allows for the fine-tuning of state-of-the-art models on consumer-grade hardware.

The LoRA (Low-Rank Adapters) component is pivotal in QLoRA, enabling both fine-tuning and the correction of residual quantization errors.

Through generous use of LoRA, QLoRA achieves performance equivalent to 16-bit full model fine-tuning, making it an attractive solution for those looking to democratize AI.

To achieve high-fidelity fine-tuning of 4-bit models, QLoRA employs three algorithmic tricks:

  1. 4-bit NormalFloat (NF4) quantization, which exploits the normal distribution of model weights and enhances information density.
  2. Double Quantization, which quantizes the quantization constants for further savings.
  3. Paged Optimizers, which prevent memory spikes during gradient checkpointing from causing out-of-memory errors.

By combining these refinements to the quantization process and generous use of LoRA, QLoRA compresses the model by over 90% while retaining full model performance without the usual quantization degradation.

Data

Data plays a crucial role in experimentation, and it's essential to understand its importance. High-quality data is needed for fine-tuning, and it's not just about having a lot of data, but also about having accurate data.

Credit: youtube.com, Wayfair Data Science Explains It All: Experimentation

The structure of the data is also important, and it typically includes input and output_ground_truth columns. For example, you might have a table like this:

In this example, the quality of the data is really important, and if it's not accurate, the model cannot deliver good results. Typically, you need thousands of samples, so if you don't have enough, you might want to consider creating synthetic data.

Comet Experiment Tracking

CometML is a great tool for tracking experiments, allowing you to inspect the details of each experiment, including parameters, code, metrics, and metadata fields and artifacts.

You can select an experiment to view a detailed summary, including the model definition, hyperparameters, metrics, system metrics, code changes, and more.

The key components of Comet's experiment tracking are the Charts and Panels, which help you monitor the fine-tuning process.

To enable Comet to log everything automatically, make sure to import comet_ml before importing torch in your script.

For your interest: Fine Tune Code Llama

Credit: youtube.com, Futureproofing MLOps: CACE Principle and Tracking ML Experiments

By comparing multiple experiments, you can identify the key set of parameters and insights from the fine-tuning process.

Comet's Compare feature overlaps the experiments and provides a common view, making it easier to spot key insights from the training process.

You can add multiple panels to view different metrics, such as validation_loss, by selecting the Line Chart type and choosing the desired metric.

Comet's Code Diff feature offers a git-like interface to compare code changes between experiments.

Here are the key features of Comet's experiment tracking:

  • Model definition summary of layers and modules using Graph definition
  • Hyperparameters and Metrics logged
  • System metrics (GPU, CPU usage upon active experiment run)
  • Code changes
  • Charts and Panels for monitoring the fine-tuning process

Example and Tutorial

Fine-tuning a local LLM requires splitting data into training and evaluation datasets. Tokenization is done via the tokenizer from FLAN T5.

To fine-tune a local LLM, you'll need to split your data into training and evaluation datasets. The script provided splits the data into 80% for training and 20% for testing.

Tokenization is a crucial step in preparing data for LLMs. The tokenizer from FLAN T5 is used in the example script to tokenize the data. This tokenizer is specifically designed for large language models like FLAN T5.

Broaden your view: Ai Llm Training

Credit: youtube.com, Local LLM Fine-tuning on Mac (M1 16GB)

The script uses the `AutoTokenizer` from the `transformers` library to tokenize the data. The `from_pretrained` method is used to load the pre-trained tokenizer model.

The `preprocess_features` function is used to preprocess the features of the data. This function is applied to each sample in the data using the `apply` method.

The script then uses the `DatasetDict` class from the `datasets` library to create a dictionary of datasets, one for training and one for testing. The `concatenate_datasets` function is used to concatenate the two datasets into a single dataset.

The script then maps the `preprocess_function` to each sample in the dataset. This function is used to preprocess the inputs and labels of the data.

The `tokenized_dataset` is then saved to disk as an Arrow file and a dataset info file. The `save_to_disk` method is used to save the dataset to disk.

The script also prints the amount of data for training and testing. The `sample` method is used to select 80% of the data for training and the rest for testing.

Here is an example of the output of the script:

```

Amount data for training: 80

Amount data for testing: 20

```

This indicates that the script has successfully split the data into training and testing datasets.

Learning and Results

Credit: youtube.com, "okay, but I want GPT to perform 10x for my specific use case" - Here is how

In this section, we'll dive into the learning and results of fine-tuning a local Large Language Model (LLM).

You'll learn the basic mechanics of how Snowpark Container Services works, which is essential for deploying a long-running service with a UI and persisting LLMs in the file system.

The process involves deploying a long-running service with a UI and using volume mounts to persist LLMs in the file system. This allows you to fine-tune an LLM with HuggingFace's AutoTrain-Advanced module.

Fine-tuning an LLM with HuggingFace's AutoTrain-Advanced module is a crucial step in the process. You'll learn how to use this module to fine-tune your LLM.

Here's a summary of what you'll learn:

  • The basic mechanics of how Snowpark Container Services works
  • How to deploy a long-running service with a UI and use volume mounts to persist LLMs in the file system
  • How to fine-tune an LLM with HuggingFace's AutoTrain-Advanced module
  • How to deploy an LLM chat-interface with FastChat (optional)

With these skills, you'll be able to fine-tune a Large Language Model to describe product offer metadata using Snowpark Container Services and HuggingFace's AutoTrain-Advanced module.

Frequently Asked Questions

What is instruction fine-tuning in LLM?

Instruction fine-tuning is a machine learning technique that trains a model to respond correctly to queries by using examples of desired outputs. This process enhances a model's performance on various tasks by teaching it to follow specific instructions and guidelines.

How many examples to fine-tune LLM?

For effective fine-tuning, a minimum of 1,000 examples per task is recommended to avoid overfitting. However, having a larger dataset can lead to more accurate results.

Keith Marchal

Senior Writer

Keith Marchal is a passionate writer who has been sharing his thoughts and experiences on his personal blog for more than a decade. He is known for his engaging storytelling style and insightful commentary on a wide range of topics, including travel, food, technology, and culture. With a keen eye for detail and a deep appreciation for the power of words, Keith's writing has captivated readers all around the world.

Love What You Read? Stay Updated!

Join our community for insights, tips, and more.