Fine Tune BERT for Text Classification and NLP

Author

Reads 1.3K

An artist’s illustration of artificial intelligence (AI). This image represents how machine learning is inspired by neuroscience and the human brain. It was created by Novoto Studio as par...
Credit: pexels.com, An artist’s illustration of artificial intelligence (AI). This image represents how machine learning is inspired by neuroscience and the human brain. It was created by Novoto Studio as par...

Fine tuning BERT for text classification and NLP is a game-changer for many applications. BERT's pre-trained model can be adapted to various tasks with just a few tweaks.

The pre-training process involves masking 15% of the input tokens, which helps the model learn contextual relationships. This is a key aspect of BERT's success in NLP tasks.

To fine tune BERT for text classification, you'll need to add a classification layer on top of the pre-trained model. This layer will output a probability distribution over the possible classes.

Fine tuning BERT can be done with just a few lines of code, making it a relatively simple process.

Recommended read: Fine Tune Embedding Models

Fine-Tuning BERT

Fine-tuning BERT is a process where a pre-trained model is adapted to a specific task by training a new layer on top of the pre-trained model. This process empowers the model to gain task-specific knowledge and enhance its performance on the target task.

The pre-trained BERT model is a generalized tool that understands language but isn't tailored for any specific task. Fine-tuning is the act of adapting this generalized tool for a specialized job.

Credit: youtube.com, Fine Tune Transformers Model like BERT on Custom Dataset.

During fine-tuning, solely the weights of the supplementary layer appended to the pre-trained BERT model undergo updates. The weights of the pre-trained BERT model remain fixed. Thus only the added layer experiences modifications throughout the fine-tuning process.

Here are the different fine-tuning techniques:

  • Train the entire architecture – We can further train the entire pre-trained model on our dataset and feed the output to a softmax layer.
  • Train some layers while freezing others – Another way to use a pre-trained model is to train it partially by keeping the weights of initial layers frozen while retraining only the higher layers.
  • Freeze the entire architecture – We can even freeze all the layers of the model and attach a few neural network layers of our own and train this new model.

What Is Tuning?

Fine-tuning a pre-trained model like BERT is a process that adapts the model to a specific task by training a new layer with data from the desired job.

Fine-tuning BERT adapts a pre-trained model with training data from the desired job to a specific downstream task by training a new layer. This process empowers the model to gain task-specific knowledge and enhance its performance on the target task.

The pre-trained BERT model is a generalized tool that understands language but isn't tailored for any specific task. Fine-tuning is the act of adapting this generalized tool for a specialized job.

Think of fine-tuning as a general medical practitioner (BERT) who then goes on to specialize in cardiology (sentiment analysis) — they can't become a cardiologist without additional, specific training, even though their general medical training forms a strong base for their specialization.

For your interest: Pre Trained vs Fine Tune

Credit: youtube.com, The Secret to 90%+ Accuracy in Text Classification

There are different fine-tuning techniques, including:

  • Train the entire architecture
  • Train some layers while freezing others
  • Freeze the entire architecture

The choice of fine-tuning technique depends on the specific task and the desired outcome. In this tutorial, we will use the third approach, freezing all the layers of BERT during fine-tuning and appending a dense layer and a softmax layer to the architecture.

Fine-tuning a pre-trained model like BERT requires careful selection of hyperparameters to get the best possible results. We set both batch size and number of parallel processes as 32, and the base learning rate is set at 0.00005.

Take a look at this: Fine Tune Ai

Problem Statement

We have a collection of SMS messages that need to be analyzed. Some of these messages are spam, while the rest are genuine.

Our task is to build a system that would automatically detect whether a message is spam or not.

The dataset for this task can be downloaded, and it's available for anyone to use.

Importing Libraries

AutoTokenizer is a crucial library for tokenizing text data into a format BERT can understand.

Credit: youtube.com, Fine-Tuning BERT for Text Classification (Python Code)

AutoTokenizer is particularly useful because of its "Auto" prefix, which allows it to infer the appropriate tokenizer for various models.

DataCollatorWithPadding ensures that our tokenized data is batched together with consistent lengths, adding padding where necessary.

This is crucial for training stability and efficiency.

AutoModelForSequenceClassification is a generic class that can instantiate model architectures tailored for sequence classification tasks.

Its versatility across various pre-trained models makes it a convenient choice.

TrainingArguments provides a convenient way to define the training configuration, such as the learning rate, batch size, and number of epochs.

Pipeline simplifies the process of applying models on data, making it a handy tool for post-training evaluations and predictions.

See what others are reading: How to Fine Tune Llm on Custom Data

Using Google Cloud AI Platform

To fine-tune a pre-trained BERT model on Google Cloud AI Platform, you'll need to set up the environment and prepare your data for training. This involves enabling the AI Platform and Compute Engine APIs for your project, installing necessary libraries like `transformers` and `tensorflow`, and formatting your data in a compatible format such as CSV or JSON.

Curious to learn more? Check out: How to Fine Tune Llm to Teach Ai Knowledge

Credit: youtube.com, "okay, but I want GPT to perform 10x for my specific use case" - Here is how

You'll also need to upload your data to a Cloud Storage bucket, which will allow the training job to access the data. This can be done using the `gsutil` command, for example: `gsutil cp path/to/your/dataset.csv gs://your-bucket-name/dataset.csv`.

The best practices for configuring the training job include specifying hyperparameters, utilizing GPUs/TPUs, and handling model checkpoints. You can do this by modifying your training script to save checkpoints to a specified path, such as `gs://your-bucket-name/checkpoints`.

Here's a summary of the steps involved in fine-tuning a BERT model on Google Cloud AI Platform:

  • Enable the AI Platform and Compute Engine APIs for your project
  • Install necessary libraries like `transformers` and `tensorflow`
  • Format your data in a compatible format
  • Upload your data to a Cloud Storage bucket
  • Configure the training job with hyperparameters and GPU/TPU utilization
  • Handle model checkpoints by saving them to a specified path

By following these steps, you can fine-tune a pre-trained BERT model on Google Cloud AI Platform and achieve state-of-the-art results for your natural language processing project.

Note: Both Google Cloud SDK and Vertex AI SDK can handle LLM training/fine-tuning, but Vertex AI SDK is typically better due to its simplicity and ML-specific optimizations.

Named Entity Recognition

Named Entity Recognition is a crucial task in natural language processing, and BERT makes it surprisingly easy. The BERTForTokenClassification class from the Hugging Face transformers library is specifically designed for fine-tuning BERT for this purpose.

This class takes the input text and generates logits for each token, indicating its class. The BERT model is highly effective at identifying named entities, such as people and locations.

Broaden your view: Bert Generative Ai

Preparing the

Credit: youtube.com, Tutorial 2- Fine Tuning Pretrained Model On Custom Dataset Using 🤗 Transformer

Preparing the BERT model is a crucial step in fine-tuning it for your specific classification task.

The BERT model is initialized from the AutoModelForSequenceClassification class, which is tailored for sequence classification tasks.

As per your dataset, the number of classes are 11, and the id mappings will provide a more comprehensive output during inference.

The final BERT model has 109 million parameters.

To fine-tune BERT, your dataset's text data must be converted into a format the model understands, which is called tokenization.

Using the AutoTokenizer class, you can initialize a tokenizer specific to your chosen pre-trained model (bert-base-uncased).

This tokenizer knows the vocabulary of the pre-trained model and its special tokens.

DataCollatorWithPadding ensures that all sentences in a batch have the same length, which is crucial for feeding batches of sentences into a model.

The tokenizer knows the appropriate padding token and max length to use, which makes the preprocessing steps even simpler.

Credit: youtube.com, BERT Transformer: Pretraining and Fine Tuning

The datasets library makes dataset related tasks a lot simpler, and the load_dataset function can fetch the arxiv-classification dataset from its hub.

The split argument allows you to load specific portions of this dataset, such as the training, validation, and test samples.

The dataset contains around 28000 training samples, 2500 validation samples, and 2500 test samples.

Tokenization

Tokenization is a crucial step in fine-tuning BERT, as it converts text data into a format the model understands. This process is done using the AutoTokenizer class, which is specific to the chosen pre-trained model.

The AutoTokenizer class is designed to tokenize a batch of textual data, and truncation is set to True to ensure that if a text exceeds the model's max input length, it's truncated to fit. This can take some time depending on the processing power and cores used in the CPU.

To understand the tokenization process, let's take a look at a tokenized sample. The tokenizer returns a dictionary with two key-value pairs: input_ids and attention_mask. The input_ids are numerical identifiers for each token in the text, while the attention_mask identifies which tokens are padding and which aren't.

Credit: youtube.com, Training a new tokenizer

Here's a breakdown of the output:

* input_ids: These are the numerical identifiers for each token in the text.attention_mask: This mask identifies which tokens are padding and which aren’t.

The attention_mask helps the model differentiate between real tokens and padding, with a value of 1 indicating a token to attend to and a value of 0 indicating padding.

Tokenizing the Dataset

Tokenizing the dataset is a crucial step in preparing your text data for use with a pre-trained model like BERT. This process involves converting your text data into a format that the model can understand.

You'll need to initialize a tokenizer specific to your chosen pre-trained model. For example, you can use the AutoTokenizer class to initialize a tokenizer for the bert-base-uncased model. This tokenizer knows the vocabulary of the pre-trained model and its special tokens.

To tokenize a batch of textual data, you can use the tokenizer's function, which is designed to truncate text that exceeds the model's max input length.

Credit: youtube.com, NLP Demystified 2: Text Tokenization

DataCollatorWithPadding is a useful tool that ensures all sentences in a batch have the same length. It automatically pads shorter sentences with a special padding token.

Here's a quick rundown of the key players involved in tokenization:

The tokenizer function may take some time to run, depending on the processing power and cores used in your CPU. But don't worry, it's a necessary step to get your text data ready for use with BERT.

Tokenized Sample

Tokenizing a sample from our train dataset is a great way to understand the process. The tokenizer returns a dictionary with two key-value pairs: input_ids and attention_mask.

Input IDs are numerical identifiers for each token in the text. For example, the ID 101 corresponds to the special [CLS] token, which BERT uses to indicate the beginning of a sequence.

The attention mask identifies which tokens are padding and which aren’t. This helps the model differentiate between real tokens and padding.

Here's what the attention mask looks like:

  • A value of 1 indicates a token that the model should attend to.
  • A value of 0 indicates a padding token that the model shouldn't attend to.

Training

Credit: youtube.com, Training BERT #5 - Training With BertForPretraining

Training a fine-tuned BERT model involves defining important hyperparameters to achieve optimal results.

The learning rate for the optimizer is a crucial hyperparameter, and a smaller learning rate implies slower convergence but potentially better generalization.

Batch size during training and evaluation is also important, with the batch size determining how many samples are processed at once.

The total number of times the training set will be iterated over is another key hyperparameter, known as the number of train epochs.

Regularization techniques are used to prevent overfitting, and weight decay is one such technique that adds a penalty to the magnitude of the model parameters.

Evaluation will be performed at the end of each epoch, and the model will be saved at the end of each epoch as well.

The model with the best evaluation metric will be reloaded after all training epochs, and only the last 3 model checkpoints will be kept to help manage storage.

Credit: youtube.com, SENTIMENT ANALYSIS 😡😐😄 OF AMAZON REVIEWS USING 🤗 BERT AND PYTORCH

Here are the important training hyperparameters:

  • learning_rate: The learning rate for the optimizer.
  • per_device_train_batch_size & per_device_eval_batch_size: Batch size during training and evaluation.
  • num_train_epochs: The total number of times the training set will be iterated over.
  • weight_decay: Regularization technique to prevent overfitting.

FP16 precision will be used for training, and everything will be logged to TensorBoard.

Evaluate Accuracy

Evaluating the accuracy of your fine-tuned BERT model is crucial to understanding its performance. We'll use the validation dataset to evaluate the accuracy metric.

To evaluate the accuracy, you'll need to define a function using the evaluate library. This will give you a clear picture of how well your model is performing.

The best model should be used for evaluation on the test set. In one case, a model achieved almost 88% accuracy on the test set, which is a great starting point.

Fine-tuning BERT can lead to impressive performance improvements. For instance, after fine-tuning, BERT was able to correctly classify 91.5% of the reviews in the unseen validation dataset.

Increasing the number of fine-tuning examples from 1,000 to 2,000 doubled the training time required, but only resulted in a minuscule improvement in model accuracy. This highlights the importance of finding the right balance between training time and accuracy.

BERT and other models can learn enough about sentiment analysis to reach impressive levels of accuracy (80%–90%+) after just 500 examples for fine-tuning.

Inference on Unseen Data

Credit: youtube.com, FINE-TUNE BERT. EXTRACT HIDDEN LAYER FROM DistilBERT. TRANSFORMERS. TRAIN CLASSIFIER. ADVANCED NLP

We can fine-tune BERT to make predictions on unseen data, which is a crucial step in any machine learning project.

The model performs well on unseen abstracts from Arxiv, with just one mistake out of 10 predictions.

To test the model's performance, we pass a few unseen abstracts through the model and evaluate the results.

The model gets all other 9 predictions right, which is impressive for just 5 epochs of training.

This shows that the model has learned to generalize to new, unseen data, which is a key sign of a well-trained model.

Fine-tuning BERT allows us to adapt the model to new tasks and domains with ease, making it a powerful tool in the field of natural language processing.

Keith Marchal

Senior Writer

Keith Marchal is a passionate writer who has been sharing his thoughts and experiences on his personal blog for more than a decade. He is known for his engaging storytelling style and insightful commentary on a wide range of topics, including travel, food, technology, and culture. With a keen eye for detail and a deep appreciation for the power of words, Keith's writing has captivated readers all around the world.

Love What You Read? Stay Updated!

Join our community for insights, tips, and more.