This blog is Part 3 of our series on how transformers work. By the end of this post, you’ll have an intuitive understanding of Multi-Head Attention—a key mechanism that enhances the model’s ability to capture diverse relationships between tokens.

Self-attention has revolutionized deep learning models, enabling them to capture dependencies across an entire sequence, making them highly effective for NLP, computer vision, and other domains. However, a single self-attention mechanism might not be sufficient to capture multiple aspects of relationships within a sequence. This is where Multi-Head Attention (MHA) comes in.

Multi-Head Attention enhances the power of self-attention by allowing the model to attend to different parts of the input sequence simultaneously. This extension is a core component of the Transformer architecture, enabling models like BERT, GPT, and T5 to process information more effectively.

In this article, we’ll explore how Multi-Head Attention extends Self-Attention and why it is crucial in modern deep learning models.

Self-Attention in Transformers

Before diving into Multi-Head Attention, let’s briefly revisit Self-Attention.

Self-attention enables a model to weigh different words in a sequence based on their relevance to each other. Given an input sequence, the model computes Query ( $Q$ ), Key ( $K$ ), and Value ( $V $ ) matrices and derives attention scores using the Scaled Dot-Product Attention formula:

$$\text{Attention}(Q, K, V) = \text{softmax} \left( \frac{QK^T}{\sqrt{d_k}} \right) V$$

This mechanism allows each word to dynamically attend to other words in the sequence, capturing relationships effectively.

If you’d like a detailed explanation of how self-attention works, check out my blog post - Transformer Encoder Explained : A Deep Dive into Attention Scores (Part 2).

Now, let’s see how Multi-Head Attention builds upon this concept.

What is Multi-Head Attention?

While self-attention is powerful, it has a limitation: a single attention mechanism might not be enough to capture different aspects of word relationships. Consider a sentence like:

"The bank approved the loan."

"Bank" could refer to a financial institution or a riverbank depending on context.
A single self-attention mechanism might focus on one aspect, missing the broader context.

To address this, Multi-Head Attention (MHA) applies multiple self-attention mechanisms in parallel, allowing the model to attend to different information simultaneously.

How Multi-Head Attention Works

Instead of computing attention using a single set of Q, K, V, Multi-Head Attention splits the input into multiple heads and applies self-attention independently. The results from all attention heads are then concatenated and projected into the final output.

The process involves the following steps:

Linear Projections: The input embeddings are transformed into multiple sets of Query ( $Q$ ), Key ( $K$), and Value ( $V$ ) matrices using different learned weight matrices.

$$Q = X \cdot W_q \quad K = X \cdot W_k \quad V = X \cdot W_v$$

where:
- $W_q, W_k, W_v$ are trainable weight matrices that create the respective projections.
- These projections split the input into multiple attention heads, each having its own set of $Q$, $K$ and $V$ matrices.
Independent Self-Attention: Each head applies self-attention separately on its respective $Q$, $K$ and $V$.

$$\text{Attention}(Q, K, V) = \text{softmax} \left( \frac{QK^T}{\sqrt{d_k}} \right) V$$

where:
- $QK^T$ computes similarity scores between tokens.
- The softmax function normalizes these scores to determine attention weights.
- The attention weights are used to scale the values $V$, determining how much focus each token should receive.

Since there are $h$ attention heads, each computes attention separately on its own transformed $Q, K, V$matrices.

Concatenation: The outputs of all heads are concatenated together by horizontally stacking them together into a single matrix.

$$\text{Concat(head_1,head_2,…,head_h)}$$
- If there are $h$ heads, then there are $h$ matrices of shape $N \times d_k$ that are stacked together into a single matrix.
- Horizontally stacking the attention outputs from all heads produces a matrix of shape
  
  $$\text{Number of Tokens}×\text{Embedding Dimension}$$
- Why Concatenation Alone Isn't Enough
  
  At this stage, the stacked attention outputs from different heads are merely placed side by side, but they remain disjoint—each attention head has focused on different relationships independently, but there’s no mechanism yet to combine their insights into a unified representation.
  - Without a learned transformation, the model has no way to determine how much weight to assign to the contributions of each head.
  - Each head might capture different semantic aspects (e.g., positional dependencies, syntax, or meaning), but simply concatenating them does not blend these insights into a meaningful form for the next layer.

To address this, we apply a final transformation using a trainable projection matrix $W^O$.

Final Linear Projection: To ensure consistency with the rest of the model, the concatenated vector is projected back to the original embedding dimension $D$ using a weight matrix $W^O$ of shape $D \times D$:

$$\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \text{head}_2, \dots, \text{head}_h) W^O$$
- $W^O$ learns how to blend the outputs from all heads to create a final context-aware representation.
- This final transformation ensures that the output remains compatible with the input dimensions for further processing in the model.
- The concatenated vector size ( $h \times d_k$) is reduced back to the original embedding dimension ( $D$) to ensure consistency with the rest of the model.

By transforming the raw concatenated attention outputs, the model can assign appropriate importance to different attention heads, ensuring that the most relevant and meaningful information is propagated forward.

Why Is This Important?

Integration of Multiple Perspectives: Each attention head captures different aspects of the input sequence. The projection step ensures that these different perspectives are blended into a single, coherent representation.
Consistency for Subsequent Layers: By projecting the concatenated output back to the embedding dimension, the transformer maintains dimensional consistency across layers, allowing seamless integration with feed-forward layers and residual connections.

Intuitive Analogy:

Imagine a team of specialists analyzing a document:

Each specialist (attention head) focuses on a different aspect—grammar, semantics, tone, or structure.
They all produce individual reports (head outputs).
These reports are then merged and summarized into a single, cohesive document (the projection step) to ensure no insights are lost and the output is streamlined for further processing.

Conclusion: The Power of Multi-Head Attention

Multi-Head Attention is a crucial enhancement of the self-attention mechanism, allowing transformers to capture multiple perspectives in parallel. By applying independent self-attention mechanisms, concatenating the outputs, and then projecting them back into the original embedding dimension, the model learns to extract richer contextual representations. This process ensures that transformers can effectively understand complex relationships between tokens, making them highly effective in NLP, computer vision, and beyond.

But Multi-Head Attention is just one piece of the puzzle. In the next part of this series, we’ll explore the Feed-Forward Network (FFN), Residual Connections, and Layer Normalization—key components that further refine the transformer’s ability to learn deep representations while stabilizing training.

Stay tuned for Part 4, where we’ll break down these mechanisms in detail! 🚀

Transformer Encoder Explained : Multi Head Attention (part 3)

Table of contents