Long Text Summarization HuggingFace with Transformers

Author

Posted Nov 2, 2024

Reads 999

Analytics Text
Credit: pexels.com, Analytics Text

Long text summarization with HuggingFace and Transformers is a game-changer for anyone working with large amounts of text data.

The HuggingFace Transformers library provides a wide range of pre-trained models that can be fine-tuned for long text summarization tasks. These models are based on the Transformer architecture, which has been shown to be highly effective for natural language processing tasks.

By using these pre-trained models, you can save a significant amount of time and effort compared to training a model from scratch. According to the article, fine-tuning a pre-trained model can take anywhere from a few minutes to a few hours, depending on the model size and the amount of data available.

The HuggingFace library also provides a simple and intuitive interface for working with these models, making it easy to get started with long text summarization.

Hugging Face and Transformers

Hugging Face is renowned for its transformers library, which provides easy access to pre-trained models for various NLP tasks, including text summarization.

Credit: youtube.com, Long Text Summarization for Free with HF Transformers: Overcoming Token Limits

The T5 (Text-to-Text Transfer Transformer) model is a popular choice for this task, treating every NLP task as a text generation problem.

This makes the T5 model highly versatile and effective for tasks like text summarization.

By using pre-trained models like T5, developers can save time and effort in building and training their own models from scratch.

With Hugging Face's transformers library, users can easily access and fine-tune these pre-trained models for their specific needs.

The library's user-friendly interface and extensive documentation make it a go-to resource for NLP developers and researchers alike.

A fresh viewpoint: Ollama Huggingface

Text Summarization Process

Text summarization is a powerful tool that helps us quickly grasp the essence of long documents or articles. It's a sequence-to-sequence task that can be formulated as either extractive or abstractive.

The T5 model from Hugging Face is a popular choice for text summarization. We can implement text summarization using this model with the help of Gradio, which creates a simple user interface for summarizing text.

Credit: youtube.com, AI Text Summarization with Hugging Face Transformers in 4 Lines of Python

To get started, we need to finetune the T5 model on a specific dataset, such as the California state bill subset of the BillSum dataset. This will help our model learn to generate accurate and informative summaries.

Here are the two main types of summarization tasks:

  • Extractive: extract the most relevant information from a document.
  • Abstractive: generate new text that captures the most relevant information.

To achieve abstractive summarization, we can follow these steps:

  1. Finetune T5 on the California state bill subset of the BillSum dataset.
  2. Use the finetuned model for inference.

Model Preparation

Model Preparation is a crucial step in long text summarization using Hugging Face. This involves tokenizing the input text into a format that can be understood by the model.

The input text is preprocessed by splitting it into subwords, which are the smallest units of meaning in the text. This process is done using the Hugging Face tokenizers, such as the BPE (Byte-Pair Encoding) tokenizer.

The maximum length of the input text is also set to 512 tokens, which is the default maximum length for most Hugging Face models. This ensures that the model can handle long texts without running out of memory.

Load Model and Tokenizer

Credit: youtube.com, Getting Started With Hugging Face in 15 Minutes | Transformers, Pipeline, Tokenizer, Models

Loading the T5 model and tokenizer is a crucial step in preparing your model for tasks like summarization. The T5 model and tokenizer are loaded from Hugging Face's model repository.

You have the option to use different variants of the T5 model, such as t5-small, t5-base, or t5-large, depending on your requirements and available resources. The t5-small variant is used here, but you can experiment with other variants to see what works best for you.

The T5 tokenizer is responsible for tokenizing the input text to a format that the T5 model can understand. The T5ForConditionalGeneration model architecture is specifically designed for tasks like summarization.

Here's a quick rundown of the key components:

  • T5Tokenizer: Tokenizes the input text to a format that the T5 model can understand.
  • T5ForConditionalGeneration: The T5 model architecture specialized for tasks like summarization.

Load BillSum Dataset

To prepare your model for the task at hand, you'll need to load the BillSum dataset. You can load the smaller California state bill subset from the 🤗 Datasets library.

The dataset can be split into a train and test set using the train_test_split method. This method will help you prepare your data for training and testing.

Credit: youtube.com, Loading a custom dataset

The text of the bill will serve as the input to the model, while the summary will be the target output. This is a crucial distinction to keep in mind as you work with the data.

Here are the key components of the dataset:

  • text: the text of the bill
  • summary: a condensed version of the text

With the dataset loaded and split, you're ready to move on to loading the T5 model using TFAutoModelForSeq2SeqLM.

Preprocess

To preprocess your data for summarization, you'll want to load a T5 tokenizer to process text and summary. This tokenizer will help you work with the model's input and output.

The next step is to prefix the input with a prompt so T5 knows this is a summarization task. This is especially important for models that can handle multiple NLP tasks, as they require prompting for specific tasks.

To process the dataset efficiently, use the 🤗 Datasets map method, which will apply the preprocessing function over the entire dataset. You can speed up the map function by setting batched=True to process multiple elements of the dataset at once.

Credit: youtube.com, How is data prepared for machine learning?

Here's a step-by-step guide to preprocessing your data:

  • Prefix the input with a prompt so T5 knows this is a summarization task.
  • Use the keyword text_target argument when tokenizing labels.
  • Truncate sequences to be no longer than the maximum length set by the max_length parameter.

By following these steps, you'll be able to prepare your data for the summarization task and get the best results from your model.

Model Training and Evaluation

To train and evaluate a model for long text summarization using Hugging Face, you'll need to load an evaluation metric, such as the ROUGE metric, which can be quickly loaded with the 🤗 Evaluate library.

The ROUGE metric is an n-gram-based metric that evaluates n-gram overlaps between the system output and the reference summary, with R-1 (uni-gram), R-2 (bi-gram), and R-L (longest common sequence) being the most reported ones.

To compute the ROUGE metric, you'll need to create a function that passes your predictions and labels to the `compute` method. This function, called `compute_metrics`, is a crucial step in evaluating your model's performance.

Here are the common loss functions used for optimization:

  • Cross-entropy
  • Mean squared error
  • Mean absolute error
  • KL divergence

The most common loss function for summarization is cross-entropy, but adding a summarization metric like ROUGE can provide more insights into your model's behavior.

Train

Credit: youtube.com, Why do we split data into train test and validation sets?

To train a model, you'll need to define your training hyperparameters in Seq2SeqTrainingArguments. This includes specifying where to save your model with the output_dir parameter.

The only required parameter is output_dir, which you'll also use to push your model to the Hub by setting push_to_hub=True. You'll need to be signed in to Hugging Face to upload your model.

There are three steps to train your model: define your training hyperparameters, pass the training arguments to Seq2SeqTrainer, and call train() to finetune your model.

Here are the three steps to train your model:

  1. Define your training hyperparameters in Seq2SeqTrainingArguments.
  2. Pass the training arguments to Seq2SeqTrainer along with the model, dataset, tokenizer, data collator, and compute_metrics function.
  3. Call train() to finetune your model.

Once training is completed, you can share your model on the Hub with the push_to_hub() method. This will allow everyone to use your model.

Evaluate

Including a metric during training is often helpful for evaluating your model's performance. You can quickly load an evaluation method with the 🤗 Evaluate library.

The ROUGE metric is a standard evaluation metric for the summarization task, and it's an n-gram-based metric that evaluates n-gram overlaps between the system output and the reference summary. ROUGE (R) is commonly used, and its variants R-1, R-2, and R-L are the most reported ones.

Credit: youtube.com, How to evaluate ML models | Evaluation metrics for machine learning

To compute the ROUGE metric, you can create a function that passes your predictions and labels to the compute method. This function is ready to go now, and you'll return to it when you set up your training.

For a more in-depth example of how to fine-tune a model for summarization, take a look at the corresponding PyTorch or TensorFlow notebook.

Model Inference and Deployment

Model inference is a crucial step in deploying a long text summarization model. After finetuning a model, you can use it for inference by prefixing your input text depending on the task, such as summarization.

To try out your finetuned model for inference, you can use the pipeline() method, which instantiates a pipeline for summarization with your model and passes your text to it. Alternatively, you can manually replicate the results of the pipeline using the generate() method.

To create a web interface for the summarization function, you can use Gradio to create a Gradio interface, specifying the function to be called, input type, and output type. For example, you can use the following code:

  1. gr.Interface(): Creates a Gradio interface.
  2. fn=summarize: Specifies the function to be called (the summarization function).
  3. inputs="text": Specifies that the input type is text.
  4. outputs="text": Specifies that the output type is text.
  5. iface.launch(): Launches the Gradio interface.

Inference

Credit: youtube.com, The Best Way to Deploy AI Models (Inference Endpoints)

Now that you've finetuned a model, you can use it for inference.

To try out your finetuned model for inference, you can use it in a pipeline().

You can instantiate a pipeline for summarization with your model, and pass your text to it. This is the simplest way to get started with inference.

To manually replicate the results of the pipeline, you can use the generate() method to create the summarization.

You can decode the generated token ids back into text by using the ~transformers.generation_tf_utils.TFGenerationMixin.generate method.

To tokenize the text and return the input_ids as TensorFlow tensors, you can use the ~transformers.generation_tf_utils.TFGenerationMixin.generate method.

For more details about the different text generation strategies and parameters for controlling generation, check out the Text Generation API.

Gradio Interface

Creating a Gradio interface is a straightforward process. You can use the `gr.Interface()` function to create a web interface for your model's summarization function.

Here's what you need to specify:

  • the function to be called, which is `summarize` in this case
  • the input type, which should be text
  • the output type, which should also be text

The `gr.Interface()` function takes these specifications as arguments, allowing you to create a custom interface for your model.

Model Architecture and Implementation

Credit: youtube.com, Different Text Summarization Techniques Using Langchain #generativeai

The Hugging Face model hub offers a vast collection of model checkpoints that are seamlessly integrated from the huggingface.co model hub.

All the model checkpoints are uploaded directly by users and organizations, providing a wide range of architectures to choose from.

The Hugging Face Transformers library currently provides the following architectures: you can find a high-level summary of each of them by referring to the documentation.

These architectures have been tested on several datasets and have implementations in Flax, PyTorch, and TensorFlow, and are backed by the 🤗 Tokenizers library.

You can check if each model has an implementation in Flax, PyTorch, or TensorFlow, or has an associated tokenizer backed by the 🤗 Tokenizers library, by referring to this table.

The implementations have been tested on several datasets and should match the performance of the original implementations, as stated in the documentation.

Model Architectures

The 🤗 Transformers library provides a wide range of model architectures that are seamlessly integrated from the huggingface.co model hub.

Credit: youtube.com, Neural Network Architectures & Deep Learning

These model architectures are uploaded directly by users and organizations, and there are currently a large number of checkpoints available.

To check if each model has an implementation in Flax, PyTorch or TensorFlow, or has an associated tokenizer backed by the 🤗 Tokenizers library, refer to the table.

These implementations have been tested on several datasets and should match the performance of the original implementations.

Machine Learning for Jax, PyTorch and TensorFlow

🤗 Transformers provides thousands of pretrained models to perform tasks on different modalities such as text, vision, and audio.

These models can be applied on text for tasks like text classification, information extraction, question answering, summarization, translation, and text generation, in over 100 languages.

Transformer models can also perform tasks on images for tasks like image classification, object detection, and segmentation.

Here are some examples of tasks that can be performed on different modalities:

🤗 Transformers provides APIs to quickly download and use those pretrained models on a given text, fine-tune them on your own datasets and then share them with the community on our model hub.

Putting It All Together

Credit: youtube.com, 🤗 Tasks: Summarization

The code snippets from the GitHub repository are a great way to visualize the fine-tuning process for pretrained abstractive summarization models using the Hugging Face library.

The repository, available at https://github.com/MehwishFatimah/t5_finetune/blob/main/run_summarization_no_trainer.py, provides a well-formatted representation of the code snippets.

We've learned to train a pretrained model for a given dataset, covering the training setup, optimizer and learning rate scheduler configuration, evaluation criteria, and setting up logs and checkpoints.

Carrie Chambers

Senior Writer

Carrie Chambers is a seasoned blogger with years of experience in writing about a variety of topics. She is passionate about sharing her knowledge and insights with others, and her writing style is engaging, informative and thought-provoking. Carrie's blog covers a wide range of subjects, from travel and lifestyle to health and wellness.

Love What You Read? Stay Updated!

Join our community for insights, tips, and more.