Unlocking Deep Learning: Multi Head Attention Math Explained

Credit: pexels.com, Close-up of an electrical transformer on a utility pole against a sunset sky.

In the Transformer model, multi-head attention is a key mechanism that allows the model to weigh the importance of different input elements. This is achieved through a process called attention, where the model calculates a set of weights for each input element.

These weights are then used to compute a weighted sum of the input elements, which is the final output of the attention mechanism. The weights are calculated using a dot product of the query and key vectors.

The query, key, and value vectors are all learned during training, and their dimensions are typically much smaller than the input size. This is because the model is only interested in the relative importance of the input elements, not their absolute values.

The number of attention heads is a hyperparameter that can be tuned, and it's common to use a large number of heads to capture a wide range of dependencies in the input data.

You might enjoy: Llama 3 8b Best Finetune Model

Attention Mechanism Basics

Credit: youtube.com, Visual Guide to Transformer Neural Networks - (Episode 2) Multi-Head & Self-Attention

Attention mechanisms are a crucial part of many machine learning models, particularly in natural language processing.

The Bahdanau-Attention model, proposed in 2014, was one of the first to address the information problem in early encoder-decoder architectures. It introduced attention to fix the issue by allowing decoders to focus on specific parts of the input sequence.

A fixed-length context vector c is computed from the hidden states of the encoder, but attention changes this to a sequence of context vectors c_i, each focused on a specific word in the input sequence and its surroundings.

If a bi-directional encoder is used, each hidden state h_i is computed by concatenating the forward and backward hidden states.

The context vector c_t is computed as a weighted sum of the hidden states h_1,...,h_T, with each hidden state weighted by an alignment score α_t,i.

The alignment score α_t,i is computed as exp(score(s_{t-1},h_i)) / ∑^{n}_{i'=1}exp(score(s_{t-1},h_{i'})), where s_{t-1} is the hidden state of the decoder at time-step t-1.

Consider reading: Hidden Layers in Neural Networks Code Examples Tensorflow

Credit: youtube.com, Attention Mechanism - Basics, Additive Attention, Multi-head Attention

The score function used by Bahdanau et al. is score(s_t,h_i) = v_α^Ttanh(W_α[s_t;h_i]), where tanh is used as a non-linear activation function and v_α and W_α are the weight matrices to be learned by the alignment model.

This alignment score function is also known as "concat" or "additive attention", because s_t and h_i are concatenated just like the forward and backward hidden states.

Bahdanau's attention model is also called a global attention model, as it attends to every input in the sequence, and soft attention, because the attention is spread thinly over the input.

Curious to learn more? Check out: Fine Tune Embedding Model

Types of Attention

There are several types of attention mechanisms used in multi-head attention math, including self-attention, encoder-decoder attention, and multi-head attention.

Self-attention is a type of attention that allows the model to attend to different parts of the input sequence in parallel.

In self-attention, the query, key, and value vectors are all derived from the same input sequence.

A different take: Masked Multi Head Attention

Credit: youtube.com, A Dive Into Multihead Attention, Self-Attention and Cross-Attention

The encoder-decoder attention mechanism is used in sequence-to-sequence models, where the encoder generates a fixed-size vector representation of the input sequence.

This vector representation is then used as the query in the decoder's attention mechanism.

Multi-head attention is a type of attention that allows the model to jointly attend to information from different representation subspaces at different positions.

The number of attention heads is typically a hyperparameter that needs to be tuned.

In multi-head attention, the query, key, and value vectors are linearly transformed before being used to compute the attention weights.

The attention weights are then normalized using a softmax function.

Worth a look: Multi Head Self Attention

Attention Math

Attention math is a crucial aspect of multi-head attention. The attention mechanism revolves around the queries, keys, and values, which are used to determine how much focus each part of the sequence should have on every other part.

To compute the attention scores, we use the dot product between the queries and keys, which gives us attention scores that dictate this focus. This is also known as the scaled dot product attention, where the attention scores are scaled by the square of the dimension of the keys (dk) to prevent them from becoming too large and destabilizing training.

The formula for calculating scaled attention scores is: q and k are query and key matricesdk is the dimension of the keys. The result is then scaled, optionally masked, and fed into softmax to generate attention scores.

The dimension of the keys (dk) is used to scale the attention scores, which helps in maintaining stable gradients during backpropagation and ensures smoother learning. This scaling is essential, especially when the dimensionality of the keys is large.

Worth a look: Scaled Dot Product Attention

Computational Difference Between Luong- and Bahdanau-

Credit: youtube.com, Bahdanau Attention Vs Luong Attention

The Luong and Bahdanau attention mechanisms have a significant computational difference.

Luong et al. generalized the computation for the context vector $c_t$, allowing for easier implementation of different score functions for the same attention mechanism.

In Bahdanau's version, the attention mechanism computes the variable length context vector first, which is then used as input for the decoder. This necessitates the use of the last decoder hidden state $s_{t-1}$ as input for the computation of the context vector $c_t$.

Luong et al. compute their context vector with the current decoder hidden state $s_t$ and modify the decoder output with the context vector before it is processed by the last softmax layer.

The Bahdanau attention mechanism computes $\alpha_{t,i} = align(y_t, x_i) =\frac{exp(score(s_{t-1},h_i))}{\sum^{n}_{i'=1}exp(score(s_{t-1},h_{i'}))}$, whereas Luong et al.'s version is not specified in the article.

Readers trying to avoid a headache can build upon a version from Tensorflow which uses the AttentionWrapper function, which handles the specifics of the implementation.

Suggestion: Neural Network Hidden Layer

Sparse

Credit: youtube.com, Attention in transformers, visually explained | DL6

Sparse attention mechanisms can significantly reduce computational complexity while preserving performance. This is achieved by limiting each position's attention to a specific subset of other positions.

The most basic patterns for sparse attention include Sliding Window Attention, Global Attention, and Random Attention. These patterns are all $\mathcal{O}(N)$ in complexity.

Sliding Window Attention restricts each query to attend only to its neighboring nodes, leveraging the inherent locality of most data. This approach is particularly useful for tasks where the input data has a natural spatial or temporal structure.

Global Attention introduces global nodes as hubs to facilitate efficient information propagation across nodes. This can be beneficial for tasks where the input data has a complex, non-local structure.

Random Attention enhances non-local interactions by randomly sampling a few edges for each query, fostering a broader exploration of relationships within the data. This approach can be useful for tasks where the input data has a high degree of variability.

Take a look at this: Depth Anything Unleashing the Power of Large-scale Unlabeled Data

Credit: youtube.com, Is Sparse Attention more Interpretable?

The Sparse Transformer, used in the GPT-3 language model, leverages two distinct sparse attention patterns: strided and fixed. The complexity of the Sparse Transformer is $\mathcal{O}(N \sqrt{N})$ when the stride $s$ is chosen close to $\sqrt{N}$.

Here are the three basic sparse attention patterns:

Sliding Window Attention: Restricts each query to attend only to its neighboring nodes.
Global Attention: Introduces global nodes as hubs to facilitate efficient information propagation across nodes.
Random Attention: Enhances non-local interactions by randomly sampling a few edges for each query.

Linearized

Linearized attention is a technique that reduces the computational complexity of Transformer models by transforming the softmax function and altering the calculation order.

This technique involves replacing the softmax function with a new function, allowing us to rewrite the attention computation as Q’(K’^T V), which simplifies the complexity to O(N \* d_model^2), a linear function of the sequence length N.

The softmax function is replaced with a new function Q’K’^T, where Q’, K’, and V are all in the same dimension. This allows for a significant reduction in computational complexity.

The attention computation can be simplified to Q’(K’^T V), which is a much more efficient operation than the original softmax function.

Credit: youtube.com, Linformer: Self-Attention with Linear Complexity (Paper Explained)

This technique is particularly useful for large sequence lengths, where the original softmax function can become computationally expensive.

Here are some key references for linearized attention:

Efficient Transformers: A Survey in Section 4.2.
Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention (Linear Transformer) (v1: 29.Jun.2020, v3: 31.Aug.2020)
Rethinking Attention with Performers (Performer) (v1: 30.Sep.2020, v4: 19.Nov.2022)
Random Feature Attention (v1: 3.Mar.2021, v2: 19.Mar.2021)

Query and Grouped-Query

Query and Grouped-Query Attention are two innovative techniques that aim to improve the efficiency and quality of attention mechanisms in Transformer models.

Multi-Query Attention (MQA) was introduced in 2019 with the paper "Fast Transformer Decoding: One Write-Head is All You Need". It reduces data transfer per computation by utilizing the same key and value matrices for all attention heads within a single layer, leading to faster decoding with minimal quality degradation.

Several Large Language Models (LLMs) have adopted MQA, including Falcon and PaLM.

Grouped-Query Attention (GQA) was introduced in 2023 and expands upon MQA by using multiple, but not all, key-value head groups. This approach addresses the quality degradation observed in MQA while maintaining its efficiency benefits.

GQA has been adopted by LLMs such as LLaMa2 and Mistral 7B.

Credit: youtube.com, Variants of Multi-head attention: Multi-query (MQA) and Grouped-query attention (GQA)

Here are some key papers that have contributed to the development of Query and Grouped-Query Attention:

Fast Transformer Decoding: One Write-Head is All You Need (MQA) (6.Nov.2019)
Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints (v1: 22.May.2023, v2: 24.Oct.2023)

These techniques have shown promising results in improving the efficiency and quality of attention mechanisms, and it will be interesting to see how they continue to evolve in the future.

8.2 Self

Self-attention is a powerful mechanism that allows an encoder to attend to other parts of the input during processing. It's like having a spotlight that shines on different parts of the input sequence, highlighting the most relevant information.

The self-attention mechanism is implemented using three weight matrices: $\mathbf{W}_q$, $\mathbf{W}_k$, and $\mathbf{W}_v$. These matrices are used to project the inputs into query, key, and value components of the sequence, respectively.

To compute the attention weights, we need to compute the dot product between the query and key vectors. This is done using the formula $\omega_{i, j}=\mathbf{q}^{(i)^{\top}} \mathbf{k}^{(j)}$. We can see an example of this in action when computing the unnormalized attention weights for the second input element.

Here's an interesting read: Metric Compute Huggingface Multiclass

Credit: youtube.com, The math behind Attention: Keys, Queries, and Values matrices

The dimension of the keys, $d_k$, plays a crucial role in scaling the attention scores. To ensure stable gradients during backpropagation, the attention scores are scaled by the square of the dimension of the keys, $1/\sqrt{d_k}$.

Here's a summary of the key components involved in self-attention:

$\mathbf{W}_q$, $\mathbf{W}_k$, and $\mathbf{W}_v$: the three weight matrices used to project the inputs into query, key, and value components
$\omega_{i, j}=\mathbf{q}^{(i)^{\top}} \mathbf{k}^{(j)}$: the formula for computing the dot product between the query and key vectors
$1/\sqrt{d_k}$: the scaling factor used to ensure stable gradients during backpropagation

Self-attention has been implemented in various architectures, including the Long Short-Term Memory-Network (LSTMN) and the Transformer. The Transformer architecture uses self-attention extensively to circumvent the drawbacks of traditional RNNs.

Related reading: Transformer Ddpg

Transformer Encoder

The Transformer Encoder is a crucial component of the Transformer model, which is a type of neural network architecture that's particularly well-suited for natural language processing tasks.

It takes the input sequence of tokens and breaks it down into a sequence of vectors, where each vector represents the token's embedding.

These vectors are then passed through a series of identical layers, known as encoder layers, which apply a combination of self-attention and feed-forward neural networks to transform the input sequence.

Each encoder layer consists of two sub-layers: a self-attention mechanism and a position-wise fully connected feed-forward network.

Transformer Encoder

Credit: youtube.com, Transformer models: Encoders

The Transformer Encoder is a game-changer in the world of natural language processing (NLP). It's a type of neural network architecture designed to handle sequential data, such as text or speech.

This architecture is based on self-attention mechanisms, which allow the model to weigh the importance of different input elements relative to each other. The Transformer Encoder uses a multi-head attention mechanism to process input sequences.

The Transformer Encoder has six layers, each consisting of two sub-layers: a self-attention layer and a fully connected feed-forward network. This allows the model to capture both local and global dependencies in the input sequence.

The self-attention layer is where the magic happens, allowing the model to focus on specific parts of the input sequence that are relevant to the task at hand. This is particularly useful for tasks like machine translation, where the model needs to consider the context of the entire sentence.

Credit: youtube.com, Transformer models: Encoder-Decoders

The Transformer Encoder uses a positional encoding scheme to preserve the order of the input sequence. This is essential for tasks like machine translation, where the order of the words is crucial.

The Transformer Encoder is a highly parallelizable architecture, making it well-suited for large-scale NLP tasks. This is one of the reasons why it's become a popular choice for many NLP applications.

Positional Encoding

Positional encoding is a way to add position information to input features in a sequence, which is important for tasks like language understanding where the position of words matters.

The Multi-Head Attention block is permutation-equivariant, meaning it can't distinguish between the order of input elements in a sequence.

To address this, positional encoding uses sine and cosine functions of different frequencies to represent position information.

These values, concatenated for all hidden dimensions, are added to the original input features.

The specific pattern chosen by Vaswani et al. uses sine and cosine functions of different frequencies, with wavelengths ranging from 2π to 10000⋅2π.

Credit: youtube.com, How positional encoding works in transformers?

The intuition behind this encoding is that you can represent PE(pos+k,:) as a linear function of PE(pos,:), which might allow the model to easily attend to relative positions.

The positional encoding is implemented using sine and cosine waves with different wavelengths that encode the position in the hidden dimensions.

The patterns between hidden dimensions differ in the starting angle, with the wavelength being 2π, resulting in repetition after position 6.

For example, the hidden dimensions 1 and 2 only differ in the starting angle, while the hidden dimensions 2 and 3 have about twice the wavelength.

Implementation and Complexity

The Transformer architecture has a significant advantage over RNN-based models in terms of complexity, requiring only a constant number of sequential operations regardless of input length n. This is due to its extensive parallelization, which enables the modelling of long-range dependencies regardless of distance.

The Transformer's fast modelling of long-range dependencies and multiple attention heads make it a favourable choice for Transfer Learning. In fact, transfer learning models developed from the Transformer architecture have enabled models trained on more data to gain a deeper understanding of language and are state-of-the-art today.

Discover more: Grokking Deep Learning

Credit: youtube.com, Multi Head Attention in Transformer Neural Networks with Code!

The Transformer's complexity comes from the self-attention mechanism, which calculates the attention score between the currently processed input xi and every other input xj for each j ∈ {1, …, n} and i ∈ {1, …, n}. This limits the length of the context that Transformers can process and increases the time they need for training in practical uses.

Here's a summary of the complexity per layer for different architectures:

Implementation

In multi-head attention implementation, each of the query, key, and value tensors is passed through a linear layer.

This process allows for more complex interactions between different parts of the input data.

The tensors are then split into multiple heads, enabling the model to focus on different aspects of the input.

This is particularly useful for tasks like language translation, where different heads can focus on different aspects of the sentence.

After performing dot product attention on each head, the results are concatenated.

This is a key step in combining the information from each head into a single output.

The concatenated results are then projected back to the original dimensionality, allowing the model to output a single vector.

This final projection is a crucial step in making the model's output interpretable.

A different take: Multi Head Attention Pytorch

Complexity by Model

Credit: youtube.com, Model Complexity

As we dive into the world of implementation and complexity, it's essential to understand the different models and their complexities. The Transformer architecture has a complexity per layer of O(n^2 * d), which can be a significant limitation for very long sequences.

The RNN model, on the other hand, has a complexity per layer of O(n * d^2), making it less efficient than the Transformer.

The Transformer's sequential operations are O(1), thanks to its extensive parallelization, but it's not the only model with this advantage. The Convolutional model also has O(1) sequential operations, with a complexity per layer of O(k * n * d^2).

Here's a breakdown of the complexity per layer for different models:

The table above shows the complexity per layer and sequential operations for different models. The Sparse Transformer, Reformer, Linformer, and Linear Transformer all have a complexity per layer of O(n) or better, making them more efficient than the Transformer for very long sequences.

Frequently Asked Questions

What is the difference between single head and multihead attention?

Single head attention can only focus on one position at a time, whereas multihead attention allows the model to simultaneously consider multiple positions, enhancing its ability to capture complex relationships in the input data. This key difference is a major factor behind the success of the Transformer model.

What are the advantages of multi-head attention?

Multi-head attention allows models to capture diverse perspectives and complex relationships between data, enabling more accurate and comprehensive understanding

Sources

Keith Marchal

Senior Writer

View Keith's Profile

Keith Marchal is a passionate writer who has been sharing his thoughts and experiences on his personal blog for more than a decade. He is known for his engaging storytelling style and insightful commentary on a wide range of topics, including travel, food, technology, and culture. With a keen eye for detail and a deep appreciation for the power of words, Keith's writing has captivated readers all around the world.

View Keith's Profile

How Multi Head Attention Math Works in Transformers

Attention Mechanism Basics

Types of Attention

Attention Math