Unlocking NLP Insights with Multi Head Attention Techniques

Credit: pexels.com, An artist's illustration of artificial intelligence (AI). This image visualises the streams of data that large language models produce. It was created by Tim West as part of the Visualisin...

Multi-head attention is a game-changer in NLP, allowing models to capture complex relationships between input elements.

It does this by splitting the input into multiple attention heads, each of which can focus on different aspects of the input.

This approach enables models to attend to multiple positions in the input simultaneously, making them more robust and accurate.

Each attention head can be thought of as a mini-attention mechanism, working in parallel with the others to extract different types of information.

Take a look at this: Masked Multi Head Attention

Transformer Architecture

The Transformer architecture is a popular neural network design that's widely used in natural language processing tasks. It's particularly well-suited for machine translation, but it can also be applied to other areas like text classification and sentiment analysis.

One of the key features of the Transformer architecture is its use of self-attention, which allows the model to weigh the importance of different words in a sentence. This is achieved through the use of three parameters: Query, Key, and Value, which are all similar in structure and represent each word in the sequence as a vector.

Credit: youtube.com, Illustrated Guide to Transformers Neural Network: A step by step explanation

The Transformer architecture has three main components: Encoder Self-Attention, Decoder Self-Attention, and Encoder-Decoder Attention. Each of these components uses self-attention to weigh the importance of different words in the input sequence. The Encoder Self-Attention computes the interaction between each input word with other input words, while the Decoder Self-Attention computes the interaction between each target word with other target words.

The Transformer architecture also employs multi-head attention, which allows the model to capture multiple relationships and nuances for each word. This is achieved by splitting the Query, Key, and Value parameters N-ways and passing each split independently through a separate head. The attention scores from each head are then combined together to produce a final attention score.

Here's a breakdown of the Transformer architecture's components:

The Transformer architecture is a powerful tool for natural language processing tasks, and its use of self-attention and multi-head attention makes it particularly well-suited for tasks like machine translation.

Transformer Architecture

Credit: youtube.com, Transformers, explained: Understand the model behind GPT, BERT, and T5

The Transformer architecture is a popular neural network model that's widely used for natural language processing tasks. It's essentially a stack of self-attention and feed-forward neural networks.

The Transformer architecture is based on the Transformer paper, which was published in 2017 by Jakob Uszkoreit. The paper introduced a novel neural network architecture for language understanding that's capable of handling long-range dependencies in sequential data.

In the Transformer architecture, self-attention is used in three places: self-attention in the encoder, self-attention in the decoder, and encoder-decoder attention in the decoder. This allows the model to capture relationships between words within the input sequence, as well as between the input sequence and the target sequence.

The Transformer architecture uses multi-head attention, which splits the query, key, and value parameters into multiple attention heads. This allows the model to capture multiple relationships and nuances for each word. The attention heads are then combined together to produce a final attention score.

Credit: youtube.com, What are Transformers (Machine Learning Model)?

The Transformer architecture also uses a technique called masked self-attention, which prevents the decoder from "cheating" by looking at future tokens. This is done by masking out the future tokens in the sequence, forcing the decoder to rely on the information from past tokens and the current token itself when predicting the next word in the output sequence.

Here's a summary of the Transformer architecture's components:

Encoder: consists of self-attention and feed-forward neural networks
Decoder: consists of self-attention, feed-forward neural networks, and encoder-decoder attention
Multi-head attention: splits query, key, and value parameters into multiple attention heads
Masked self-attention: prevents decoder from looking at future tokens

The Transformer architecture has been widely adopted for various natural language processing tasks, including machine translation, text classification, and question answering. Its ability to capture long-range dependencies and multiple relationships makes it a powerful tool for many applications.

Input Layers

The Input Layers of the Transformer Architecture are a crucial part of the model, and they produce a matrix of shape (Number of Samples, Sequence Length, Embedding Size) which is fed to the Query, Key, and Value of the first Encoder in the stack.

The Input Embedding layer is responsible for converting the input words into numerical representations that the model can understand. This is done by learning a embedding for every possible input word, which is then used to generate a matrix of shape (Number of Samples, Sequence Length, Embedding Size).

Credit: youtube.com, Transformers (how LLMs work) explained visually | DL5

The Position Encoding layer, on the other hand, adds position information to the input features, which is essential for tasks like language understanding where the position of words is important. This is achieved by using sine and cosine functions of different frequencies, which are added to the original input features.

The specific pattern chosen by Vaswani et al. are sine and cosine functions of different frequencies, with wavelengths ranging from 2π to 10000⋅2π. This allows the model to easily attend to relative positions, making it permutation-equivariant.

How It Works

The Encoder-Decoder Attention takes its input from two sources, computing the interaction between each target word with each input word. This is different from the Encoder Self-Attention, which only computes interactions between input words.

In the Encoder-Decoder Attention, each cell in the resulting Attention Score corresponds to the interaction between one Q (target sequence word) with all other K (input sequence) words and all V (input sequence) words. This allows for a more nuanced understanding of the input and target sequences.

Captures Richer Interpretations

Credit: youtube.com, Vectoring Words (Word Embeddings) - Computerphile

The Embedding vector captures the meaning of a word, but it's not a one-size-fits-all solution.

In the case of Multi-head Attention, the Embedding vectors get logically split across multiple heads, allowing separate sections to learn different aspects of the meanings of each word.

This allows the Transformer to capture richer interpretations of the sequence, which is crucial during translation when the verb to be used depends on factors like gender and cardinality.

For instance, one section might capture the 'gender-ness' of a noun, while another might capture the 'cardinality' of a noun, making it possible to accurately translate sentences that rely on these factors.

This approach enables the Transformer to understand the nuances of language and produce more accurate translations.

Masked Self-

Masked self-attention is a crucial component of the Transformer model, preventing the decoder from "cheating" by looking at future tokens.

This is similar to studying for an exam, where looking at the answers in advance would undermine the learning process. The decoder would not learn to generate the correct translation if it could refer to future tokens.

Credit: youtube.com, Transformers - Part 7 - Decoder (2): masked self-attention

The Transformer model uses a mask for scaled dot-product attention, which hides future tokens in the sequence. As a result, the decoder relies on past tokens and the current token itself when predicting the next word in the output sequence.

The weights for future tokens are set to zero, indicating that the decoder is not allowed to refer to them. This ensures that the decoder generates the output sequence one token at a time, similar to RNN-based models.

Masking out future tokens also prevents the decoder from relying on information that hasn't been generated yet. This is essential for machine translation, where the decoder needs to generate the correct translation based on the input sequence.

Implementation and Details

In multi-head attention, each of the query, key, and value tensors is passed through a linear layer and then split into multiple heads.

To avoid significant growth of computational cost and parameterization cost, we set \(p_q = p_k = p_v = p_o / h\). This allows us to compute each head in parallel by setting the number of outputs of linear transformations for the query, key, and value to \(p_q h = p_k h = p_v h = p_o\).

The number of outputs of linear transformations for the query, key, and value is determined by the argument numHiddens.

Take a look at this: Multi Head Attention Math

Implementation

Credit: youtube.com, Implementation Strategies

Implementation is where the magic happens, and it's where most of the complexity of attention mechanisms lies. In the context of multi-head attention, each of the query, key, and value tensors is passed through a linear layer and then split into multiple heads.

The query, key, and value tensors are then passed through a dot product calculation, which is a fundamental operation in attention mechanisms. This calculation is scaled, optionally masked, and then fed into softmax to generate attention scores.

In some implementations, the number of outputs of linear transformations for the query, key, and value is set to \(p_q h = p_k h = p_v h = p_o\), which allows for parallel computation of multiple heads. This is done to avoid significant growth of computational cost and parameterization cost.

To implement multi-head attention, a class implementation is used, which includes a function for scaled dot-product attention. This function takes in the query, key, and value tensors and returns the attention scores.

You might like: Scaled Dot Product Attention

Linear Layers

Credit: youtube.com, Implementation of Linear Layers (FSE 2024)

In a multi-head attention implementation, there are three separate Linear layers for the Query, Key, and Value. Each Linear layer has its own weights.

The input is passed through these Linear layers to produce the Q, K, and V matrices, which are then used for dot product attention.

The Linear layers are crucial in the attention mechanism, as they transform the input data into a format that can be used for attention.

Each Linear layer has its own weights, which are not shared with the other layers.

The input data is passed through these Linear layers to produce the Q, K, and V matrices, which are then used for dot product attention.

The Linear layers are used to project the input data into a higher-dimensional space, where the attention mechanism can be applied.

The weights of the Linear layers are logically partitioned per head, meaning that each head has its own set of weights.

You might enjoy: Pytorch Scaled Dot Product Attention

Credit: youtube.com, L4.5 A Fully Connected (Linear) Layer in PyTorch

In our example, the Query Size is calculated as Embedding Size / Number of heads, which is 6/2 = 3.

The Linear layer weights are partitioned uniformly across the Attention heads, allowing for efficient computation.

The Q, K, and V matrices output by the Linear layers are reshaped to include an explicit Head dimension, making it easier to visualize the attention mechanism.

Each 'slice' of the matrix corresponds to a matrix per head, making it easier to understand the attention mechanism.

The dimensions of Q are now (Batch, Head, Sequence, Query size), which is a result of reshaping the matrix.

Another matrix multiplication is performed between the output of the Softmax and the V matrix, which is a crucial step in the attention mechanism.

Related reading: Pytorch Multi Head Attention

Frequently Asked Questions

What is the difference between self-attention and multihead attention?

Self-attention focuses on how each word relates to every other word, while multi-head attention breaks this down into multiple aspects, such as grammar and sentiment, to gain a deeper understanding. This allows models to capture more nuanced information and relationships within text.

What is masked multi-head attention?

Masked multi-head attention is a technique used in decoders to process tokens sequentially, preventing each generated token from being influenced by subsequent ones. This ensures the order and coherence of generated data, making it a crucial component in maintaining data quality.

What is multi-head self-attention?

Multi-head self-attention is a mechanism that allows a model to read a sentence multiple times, focusing on different aspects each time, such as grammar, relationships, or sentiment. This combination of focused readings enables a richer understanding of the sentence.

Sources

Keith Marchal

Senior Writer

View Keith's Profile

Keith Marchal is a passionate writer who has been sharing his thoughts and experiences on his personal blog for more than a decade. He is known for his engaging storytelling style and insightful commentary on a wide range of topics, including travel, food, technology, and culture. With a keen eye for detail and a deep appreciation for the power of words, Keith's writing has captivated readers all around the world.

View Keith's Profile

Multi Head Attention for Richer Interpretations in NLP