Understanding Masked Multi Head Attention in Transformer Models

Author

Reads 714

Workers at Electrical Transformer Substation
Credit: pexels.com, Workers at Electrical Transformer Substation

Understanding Masked Multi Head Attention in Transformer Models is a key concept in natural language processing. It's a technique used in transformer models to help them understand the relationships between different parts of a sentence or text.

The idea behind masked multi head attention is to focus on specific parts of the input while ignoring others, which is achieved by masking out certain input elements. This is done to prevent the model from relying too heavily on certain words or phrases, and to encourage it to focus on the relationships between different parts of the input.

By using masked multi head attention, transformer models can learn to attend to the most relevant parts of the input, and to ignore the less relevant parts. This helps the model to better understand the context and relationships between different words and phrases in a sentence or text.

For more insights, see: Pytorch Multi Head Attention

What Is Bertviz?

BertViz is an open-source tool that visualizes the attention mechanism of transformer models at multiple scales.

Credit: youtube.com, 115. Masked, Multi Headed Self Attention with BERT, BERTViz, and exBERT

It's been around since 2017, making it a veteran in the field of explainability tools.

The tool isn't limited to BERT models, despite its name; it actually supports many transformer language models, including GPT2 and T5, as well as most HuggingFace models.

BertViz is a crucial tool in the AI space, especially given the growing importance of transformer architectures.

Head View

The Head View is a powerful tool for understanding how attention flows between tokens within a transformer layer. In this view, tokens on the left are attending to tokens on the right, and attention is represented as a line connecting each token pair.

Colors correspond to attention heads, and line thickness represents the attention weight, making it easy to visualize which tokens are most relevant to each other. We can also specify how we'd like our tokens to be formatted, such as selecting "Sentence A → Sentence B" to examine the attention between question and answer.

Credit: youtube.com, Transformers - Part 7 - Decoder (2): masked self-attention

By selecting the experiment and asset we'd like to visualize, we can choose which attention layer to visualize and which attention heads to see. This allows us to drill down into specific areas of interest and gain a deeper understanding of how the model is processing the input.

The Head View is a great way to uncover patterns between attention heads, and by examining the attention weights, we can see which tokens are most relevant to each other.

Model View

The model view offers a bird's-eye perspective of attention across all layers and heads, allowing us to notice attention patterns across layers and how they evolve from input to output.

Each row of figures in the model view represents an attention layer, and each column represents individual attention heads. Clicking on a particular head can enlarge the figure for a closer look.

We can't necessarily look at the same attention heads for the same patterns across experiment runs because each layer is initialized with separate, independent weights. This means the layers that focus on specific patterns for one sentence may focus on different patterns for another sentence.

Credit: youtube.com, What is masked multi headed attention ? Explained for beginners

The model view gives us some interesting insight into what the model may be focusing on by identifying which layers may be focusing on areas of interest for a given sentence. This is a very inexact science, and it's true that if you look for it, you will find it.

Examining the attention heads in the model view can reveal what the model is really focusing on, even if it seems silly at first. For example, when GPT-2 generated the last word in a sentence, it was probably referring to the "park" as "too busy" due to the attention heads.

Self-Attention

Self-attention is a mechanism that enhances the information content of an input embedding by including information about the input's context. It enables the model to weigh the importance of different elements in an input sequence and dynamically adjust their influence on the output.

The concept of self-attention has its roots in the effort to improve Recurrent Neural Networks (RNNs) for handling longer sequences or sentences. It was introduced to give access to all sequence elements at each time step, allowing the model to be selective and determine which words are most important in a specific context.

Self-attention was made possible by the transformer architecture, which introduced a standalone self-attention mechanism in 2017, eliminating the need for RNNs altogether.

Patterns

Credit: youtube.com, Attention in transformers, visually explained | DL6

Attention heads can focus on very unique patterns, and different heads learn to focus on different aspects of the input.

Each attention head learns a unique attention mechanism, and they don't share parameters with each other.

In some cases, attention heads focus on identical words in a sentence, such as the crossover where two instances of "the" intersect.

Other attention heads focus on specific words or patterns, like the next word in the sentence or the delimiter [SEP].

The attention heads also capture lexical patterns, like focusing on list items, verbs, or acronyms.

Introducing Self-

Self-attention has become a cornerstone of many state-of-the-art deep learning models, particularly in NLP.

The concept of "attention" in deep learning has its roots in the effort to improve RNNs for handling longer sequences or sentences.

Translating a sentence word-by-word is usually not an option because it ignores the complex grammatical structures and idiomatic expressions unique to each language, leading to inaccurate or nonsensical translations.

Broaden your view: Multi-task Learning

Credit: youtube.com, Self-attention in deep learning (transformers) - Part 1

In 2017, the transformer architecture introduced a standalone self-attention mechanism, eliminating the need for RNNs altogether.

Self-attention enables the model to weigh the importance of different elements in an input sequence and dynamically adjust their influence on the output.

This is especially important for language processing tasks, where the meaning of a word can change based on its context within a sentence or document.

The original scaled-dot product attention mechanism introduced in the Attention Is All You Need paper remains the most popular and widely used attention mechanism in practice.

Most papers still implement this mechanism since self-attention is rarely a computational bottleneck for most companies training large-scale transformers.

Multi-Head Attention

The Transformer's Multi-Head Attention is a powerful tool that allows it to encode multiple relationships and nuances for each word. It does this by repeating the Attention module's computations multiple times in parallel, with each repetition called an Attention Head.

Each Attention Head splits the Query, Key, and Value parameters N-ways and passes each split independently through a separate Head. This is called Multi-head attention.

Curious to learn more? Check out: Multi Head Self Attention

Credit: youtube.com, A Dive Into Multihead Attention, Self-Attention and Cross-Attention

The Attention module's computations are then combined together to produce a final Attention score. This process is repeated N times, with each Head producing a weighted sum of the input sequence.

The Transformer's use of Multi-Head Attention allows it to attend to different parts of the input sequence at different times, which is particularly useful for tasks like translation. By using multiple Attention Heads, the Transformer can capture complex relationships and nuances in the input sequence.

The Attention Heads can be visualized in the Attention Head View, where each token is connected to other tokens by lines representing attention weights. The line thickness represents the attention weight, and the color corresponds to the Attention Head.

In the Model View, we can see the attention patterns across all layers and heads, illustrating the evolution of attention patterns from input to output. This view can help us identify which layers may be focusing on areas of interest for a given sentence.

The Multi-Head Attention mechanism is a key component of the Transformer's architecture, and it's what allows the model to perform so well on a wide range of tasks. By using multiple Attention Heads, the Transformer can capture complex relationships and nuances in the input sequence, making it a powerful tool for natural language processing.

Additional reading: Multi Head Attention Math

Computing the Weights

Credit: youtube.com, Attention is all you need (Transformer) - Model explanation (including math), Inference and Training

The self-attention mechanism involves three weight matrices, Wq, Wk, and Wv, which are adjusted as model parameters during training. These matrices serve to project the inputs into query, key, and value components of the sequence, respectively.

To obtain the respective query, key, and value sequences, matrix multiplication is performed between the weight matrices W and the embedded inputs x. Specifically, the query sequence q is obtained via matrix multiplication between the weight matrix Wq and the embedded inputs x: q = xWq.

The query and key vectors have to contain the same number of elements (dq = dk), which is typically set to 2 in many LLMs. The value vector v can have an arbitrary number of elements, which determines the size of the resulting context vector.

The projection matrices Wq and Wk have a shape of d × dk, while Wv has the shape d × dv. The dimensions dq, dk, and dv are usually much larger, but we use small numbers here for illustration purposes.

Credit: youtube.com, Visualize the Transformers Multi-Head Attention in Action

The unnormalized attention weights ω are computed as the dot product between the query and key sequences: ωi,j = qk. This can be generalized to compute the remaining key and value elements for all inputs.

The unnormalized attention weights ω are then scaled by dividing by the square root of the query size (dk) before normalizing them through the softmax function. This helps prevent the attention weights from becoming too small or too large, which could lead to numerical instability or affect the model's ability to converge during training.

The normalized attention weights α are obtained by applying the softmax function to the scaled unnormalized attention weights ω.

Here's a summary of the weight matrices and their shapes:

Linear Layers and Scores

In masked multi-head attention, three separate Linear layers are used to produce the Q, K, and V matrices.

Each Linear layer has its own weights, and the input is passed through these layers to produce the matrices.

Credit: youtube.com, Illustrated Guide to Transformers Neural Network: A step by step explanation

The Q, K, and V matrices are then split across the heads, and a single head's computations can be imagined as getting 'repeated' for each head and for each sample in the batch.

To compute the Attention Score for each head, a matrix multiplication is performed between Q and K, followed by adding a Mask value to the result.

The Mask value is used to mask out the Padding values in the Encoder Self-attention, preventing them from participating in the Attention Score.

The result is then scaled by dividing by the square root of the Query size, and a Softmax is applied to it.

Another matrix multiplication is performed between the output of the Softmax and the V matrix, completing the Attention Score calculation.

Linear Layers

Linear Layers are a crucial component of many deep learning models, and they play a key role in producing the Q, K, and V matrices.

Each Linear layer has its own weights, which are used to transform the input into the desired output.

Credit: youtube.com, GPT: A Technical Training Unveiled #7 - Final Linear Layer and Softmax

There are three separate Linear layers for the Query, Key, and Value, each with its own set of weights.

The input is passed through these Linear layers to produce the Q, K, and V matrices.

Reshaping the Q, K, and V matrices is a necessary step to prepare them for further calculations.

Compute Scores

Computing Attention Scores is a crucial step in many neural network architectures, including Transformers. To compute the Attention Score for each head, we start by performing a matrix multiplication between the Query (Q) and Key (K) matrices.

This matrix multiplication is a fundamental operation in linear algebra, and it's used to compute the dot product between the query and key sequences. The result of this operation is then scaled by dividing by the square root of the Query size.

A Mask value is added to the result to prevent padding values from participating in the Attention Score computation. This Mask value is specific to the Encoder Self-attention, Decoder Self-attention, and Decoder Encoder-Attention, each with its own unique mask.

An artist’s illustration of artificial intelligence (AI). This image visualises an artificial neural network as physical objects. The complex structure represents a network of information ...
Credit: pexels.com, An artist’s illustration of artificial intelligence (AI). This image visualises an artificial neural network as physical objects. The complex structure represents a network of information ...

The result is then scaled again by dividing by the square root of the Query size, and a Softmax function is applied to it. This Softmax function helps to normalize the Attention Score values, ensuring they're all between 0 and 1.

To compute the unnormalized Attention Weights, we perform a dot product between the query and key sequences. This is done for each input element, and the resulting values are used to compute the unnormalized Attention Weights.

The unnormalized Attention Weights are computed as the dot product between the query and key sequences, ωi,j=qk. This is done for each input element, and the resulting values are used to compute the actual Attention Weights later.

To merge each Head's Attention Scores together, we reshape the result matrix to eliminate the Head dimension. This is done by swapping the Head and Sequence dimensions, and then collapsing the Head dimension by reshaping to (Batch, Sequence, Head * Query size).

Here's a summary of the steps involved in computing the Attention Score:

  • Matrix multiplication between Q and K
  • Add Mask value to the result
  • Scale the result by dividing by the square root of the Query size
  • Apply Softmax function to the result
  • Perform another matrix multiplication between the output of Softmax and the V matrix

Applications and End-to-End

Credit: youtube.com, Attention is all you need (Transformer) - Model explanation (including math), Inference and Training

Masked multi-head attention has numerous applications in natural language processing, including end-to-end models.

This is particularly evident in the use of end-to-end multi-head attention, which enables the flow of information through the model in a single pass.

By putting it all together, we can see that this approach allows for efficient and effective processing of large amounts of data.

Model View Applications

The model view is a powerful tool for understanding how language models like GPT-2 work. It allows us to see which layers of the model are focusing on specific patterns in a sentence.

With the model view, we can identify which layers are focusing on areas of interest for a given sentence. This is especially useful when we're trying to understand why the model made a particular prediction.

Examining the attention heads of the model can reveal interesting insights into what the model is focusing on. For example, in the case of the sentence "The dog had too many plans to go to the park", the model was probably referring to "park" as "too busy."

An artist’s illustration of artificial intelligence (AI). This image visualises the input and output of neural networks and how AI systems perceive data. It was created by Rose Pilkington ...
Credit: pexels.com, An artist’s illustration of artificial intelligence (AI). This image visualises the input and output of neural networks and how AI systems perceive data. It was created by Rose Pilkington ...

This kind of insight is especially useful when we're trying to understand the nuances of language and how the model is processing it. It's a reminder that language is complex and multifaceted, and that the model view can help us better understand these complexities.

In the case of the model view, we can see that each layer is initialized with separate, independent weights, which means that the layers that focus on specific patterns for one sentence may focus on different patterns for another sentence.

End-to-End

In the end-to-end flow of Multi-head Attention, we see a complete process from start to finish.

The model view provides a bird’s-eye perspective of attention across all layers and heads.

Each row of figures in the model view represents an attention layer and each column represents individual attention heads.

We can click on any particular head to enlarge the figure and see more details.

The same line pattern found in the head view can also be seen in the model view.

Data Splitting

Credit: youtube.com, Why do we split data into train test and validation sets?

Data Splitting is a crucial step in the Attention mechanism, allowing multiple heads to process data independently.

The data gets split across the multiple Attention heads, but this is a logical split only, not a physical one.

Each Attention head processes the data independently, but they share the same Linear layer.

The Linear layer weights are logically partitioned per head, making the computations more efficient.

To achieve this, the Query Size is chosen as the Embedding Size divided by the Number of heads.

For example, if the Embedding Size is 6 and the Number of heads is 2, the Query Size is 3.

This allows us to think of the layer weight as 'stacking together' the separate layer weights for each head.

The Q, K, and V matrices output by the Linear layers are reshaped to include an explicit Head dimension, making each 'slice' correspond to a matrix per head.

The final reshaped matrix has dimensions of (Batch, Head, Sequence, Query size), making it easier to visualize the data splitting process.

Keith Marchal

Senior Writer

Keith Marchal is a passionate writer who has been sharing his thoughts and experiences on his personal blog for more than a decade. He is known for his engaging storytelling style and insightful commentary on a wide range of topics, including travel, food, technology, and culture. With a keen eye for detail and a deep appreciation for the power of words, Keith's writing has captivated readers all around the world.

Love What You Read? Stay Updated!

Join our community for insights, tips, and more.