In Part 1 of this series, we explored how raw text is transformed into a numerical format through tokenization, embeddings, and positional encoding—laying the groundwork for understanding how transformers process data. If you haven't checked it out yet, you can read it here: How Transformers Work: Tokenization Embeddings and Positional Encoding Explained (Part 1).

Now, in Part 2, we dive into the self-attention mechanism, breaking down how transformers compute attention scores to determine which tokens in a sequence are most relevant to each other. Let’s get started! 🚀

What is an Attention Score?

An attention score is a numerical value that determines how much importance one token in a sequence should assign to another. These scores are computed as part of the self-attention mechanism in a transformer, helping the model understand contextual relationships between tokens.

How Are Attention Scores Computed?

The process of computing attention scores involves several steps:

Generating Query (Q), Key (K), and Value (V) matrices
Computing alignment scores to measure relationships between tokens
Applying the softmax function to normalize these scores into probabilities
Applying a linear transformation to refine the attention output

We’ll start by exploring how attention scores are calculated within a single attention head and then extend this understanding to multi-head attention, which enhances the model’s ability to capture diverse contextual relationships.

1.1 Computing The Query (Q) , Key(K) and Value (V) Matrices

The attention layer in the encoder has multiple attention heads and each head has three weight matrices $W_q$,$W_k$ and $W_v$ each of order $D \times (\frac{D}{\text{No. Of Heads}})$ ( $D$ denotes dimension size). They are initialized randomly such that for each head the values of the weight matrices are unique i.e the values of the weight matrices of no two heads are the same. The weight matrices are used to compute three matrices namely query , key and value which are used to compute the attention scores.

Query (Q) :
1. A query is a vector that asks a question about the relationship between the current token and other tokens. For example, in a sentence, "The dog chased the cat," the query for "dog" might be asking, "Who or what is related to the dog?"
2. It is computed as follows :
  
  $$Q= \text{Input Matrix} \times W_q$$
3. The "questions" encoded by the Query vector (via the weight matrix $W_q$) are not explicitly predetermined. Instead, they are learned during the model's training process.
Key (K):
1. Think of it as metadata for each token, indicating what kind of information this word provides. For example, in the sentence "The dog chased the cat," the key for "cat" might encode information like "animal" or "object being acted upon."
2. It is computed as :

$$K =\text{Input Matrix} \times W_k$$

Value (V):
1. What it represents:
  
  A value is the vector containing the actual information or content of a token that will be passed to the next layer if that token is deemed important. In the example "The dog chased the cat," the value for "cat" might contain its embedding and context-specific information about "cat."
2. It is computed as :

$$V= \text{Input Matrix}\times W_v$$

In our use case because our dimension size is small we can have only one attention head.

The weight matrices( $W_q$, $W_k$ and $W_v$ will be initialized to random values initially , these randomly initialized weight matrices are given below:

$$\textbf{Weight Matrices:}$$

$$\begin{array}{c c c} {W_q} & {W_k} & {W_v} \\ \begin{bmatrix} 0.3745 & 0.9507 & 0.7320 \\ 0.5987 & 0.1560 & 0.1560 \\ 0.0581 & 0.8662 & 0.6011 \end{bmatrix} & \begin{bmatrix} 0.7081 & 0.0206 & 0.9699 \\ 0.8324 & 0.2123 & 0.1818 \\ 0.1834 & 0.3042 & 0.5248 \end{bmatrix} & \begin{bmatrix} 0.4320 & 0.2912 & 0.6119 \\ 0.1395 & 0.2921 & 0.3664 \\ 0.4561 & 0.7852 & 0.1997 \end{bmatrix} \\ \end{array}$$

The $Q$, $K$ and $V$ matrices obtained on by multiplying the input matrix with their corresponding learned weight matrices ( $W_q$, $W_k$, and $W_v$) are given below :

$$\begin{array}{c c c} \textbf{Query Matrix } Q & \textbf{Key Matrix } K & \textbf{Value Matrix } V \\ \begin{bmatrix} 1.1904 & 0.1467 & 0.4669 \\ 2.0582 & 0.2445 & 1.6091 \\ 2.0105 & 0.6131 & 1.5091 \\ 1.4613 & 0.2865 & 1.2076 \\ 0.3440 & 0.1828 & -0.5350 \\ 0.4377 & 0.4853 & -0.5507 \\ 1.6470 & 0.3173 & 0.5242 \\ 2.0412 & 0.2458 & 1.6666 \\ 1.5460 & 0.2431 & 1.1800 \\ 1.6145 & 0.2871 & 1.5270 \\ 1.0075 & 0.4310 & 0.0520 \\ 0.9114 & 0.2611 & 0.1972 \end{bmatrix} & \begin{bmatrix} 1.1904 & 0.1467 & 0.4669 \\ 2.0582 & 0.2445 & 1.6091 \\ 2.0105 & 0.6131 & 1.5091 \\ 1.4613 & 0.2865 & 1.2076 \\ 0.3440 & 0.1828 & -0.5350 \\ 0.4377 & 0.4853 & -0.5507 \\ 1.6470 & 0.3173 & 0.5242 \\ 2.0412 & 0.2458 & 1.6666 \\ 1.5460 & 0.2431 & 1.1800 \\ 1.6145 & 0.2871 & 1.5270 \\ 1.0075 & 0.4310 & 0.0520 \\ 0.9114 & 0.2611 & 0.1972 \end{bmatrix} & \begin{bmatrix} 0.2001 & 0.1991 & 0.6198 \\ 0.7546 & 0.6788 & 1.3418 \\ 0.9585 & 1.3736 & 1.2265 \\ 0.6461 & 0.7333 & 0.9554 \\ -0.1936 & 0.0401 & -0.0691 \\ -0.0077 & 0.6065 & -0.0870 \\ 0.2914 & 0.4476 & 0.7973 \\ 0.7873 & 0.7108 & 1.3539 \\ 0.5923 & 0.6145 & 0.9882 \\ 0.7935 & 0.8368 & 1.1251 \\ 0.1999 & 0.6229 & 0.3577 \\ 0.1711 & 0.3754 & 0.3957 \end{bmatrix} \\ \end{array}$$

NOTE :

In actual practice however, the dimension size will be much larger like 512 , 768 and 1024. So , the resultant $Q$, $K$ and $V$ matrices in those cases would be of the order $N \times (\frac{D}{\text{No. Of Heads}})$.

Since, the example considered here has only one attention head, the $Q$, $K$ and $V $ matrices are of the order $N \times D$.

📌 A deeper intuition behind the Query, Key, and Value vectors goes beyond the scope of this blog. However, you can explore it in detail in Understanding the Role of Query, Key, and Value Matrices in Transformer Models.

1.2 Computing The Alignment Scores

In transformer models, the alignment score matrix, denoted as $Score(Q,K)$, plays a pivotal role in determining the relevance or importance of each token to every other token in a sequence.

What is the Alignment Score Matrix?

The $Score(Q, K)$ matrix represents how relevant or important each token is to every other token in the sequence. It determines the degree of alignment between the "questions" (Queries) being asked by a token and the "metadata" (Keys) provided by all other tokens. You can think of it as a "relevance map" that guides how much attention each token should pay to every other token.

Formula for Alignment Score

$$Score(Q, K)=\frac{Q \ldotp K^{T}}{\sqrt \frac{D}{\text{No. Of Heads}}}$$

Here:

$Q$: Query matrix
$K$: Key matrix
$D$: Dimensionality of the model
$\text{No. of Heads}$ : Number of attention heads (in this case, only one)

Since there is only one head in our example, we have only one score matrix.

How It Works ?

A token's Query vector ( $Q$ ) represents what it seeks to understand, while Key vectors ($K$) from other tokens describe the information they offer. The dot product of $Q$ and $K$ generates an alignment score, measuring how useful a token's information is to understand what the query vector seeks. This score helps determine how much attention each token should give to others.

How to Interpret the Score(Q, K) Matrix ?

The $Score(Q, K)$matrix provides a map of token-to-token importance. Each element reflects

Semantic Relationship : The dot product between $Q$ and $K$ captures the similarity between the query token and key tokens. A higher score indicates stronger semantic alignment.
Relative Importance : The scores reflect the raw importance of each key token concerning the query token. These are used to compute attention weights.
Contextual Focus : The alignment scores guide the query token in determining which tokens to "pay attention to" while constructing its representation.

Analogy: A Group Discussion

Think of a conversation where:

Each participant (token) asks questions ( $Q$ ) to understand the context.
Every participant also shares their expertise ( $K$ ) to provide useful information.
The $\text{Score}(Q, K)$ matrix acts as a relevance chart, showing who best answers whose questions.

For example, if someone asks about timelines, a participant specializing in timelines will have a high alignment score, influencing the discussion more.

Example: Alignment Score Matrix in Practice

The alignment score matrix for our use case is given below:

$$\text{Score}(Q, K) = \begin{bmatrix} 0.6727 & 1.3030 & \dots & 0.5140 \\ 1.3371 & 2.8624 & \dots & 1.0407 \\ 1.3881 & 3.0150 & \dots & 1.0887 \\ \vdots & \vdots & \ddots & \vdots \\ 0.5179 & 0.9806 & \dots & 0.3962 \end{bmatrix}$$

Properties:

It is a square matrix of order $N \times N$ where $N$ is the number of tokens.
Each cell indicates the alignment score between a pair of tokens in the sequence.

1.3. Normalizing the Alignment Score To Obtain Attention Weights

The $\text{Score}(Q, K)$matrix is computed for all heads are then normalized using a softmax function to get attention weights denoted by $\alpha$. Applying softmax to the alignment scores serves to normalize them into a probability distribution, making the scores more interpretable and ensuring they can be treated as weights for aggregating the value vectors.

What is the Softmax Function?

The softmax function converts raw scores into probabilities, normalizes values within [0,1], and ensures they sum to 1.

The formula for the softmax activation function is:

$$\sigma(z_i) = \frac{e^{z_i}}{\sum_{j=1}^n e^{z_j}}$$

Where:

$z_i$ is the i-th input element to the softmax function.
$e^{z_i}$ is the exponential function applied to $z_i$.
$n$ is the total number of input elements.
$\sum_{j=1}^n e^{z_j}$ is the sum of the exponential of all input elements, which normalizes the output.

How Softmax Enhances Alignment Scores

Before Normalization:

The raw alignment scores $\text{Score}(Q,K)$ represent similarity between Query and Key vectors but:

Can have arbitrary values, including negatives or large magnitudes.
Lack a clear, interpretable scale for weighting Value vectors effectively.

After Applying Softmax:

Softmax transforms these raw scores into a probability distribution, making them more meaningful and useful for attention computation:

Ensuring Interpretability
- Converts raw scores into probabilities that sum to 1 for each Query.
- Allows the model to understand relative importance, ensuring each token's influence is bounded and weighted appropriately.
Determining Relative Importance
- The exponential function amplifies significant scores while suppressing smaller ones, sharpening attention towards highly relevant tokens.
- The resulting attention scores help the model focus on the most important tokens in the sequence.
- This essentially implies that softmax exaggerates score differences, ensuring attention is concentrated on the most relevant tokens.
Capturing Relevance
- Normalization ensures that each Query token distributes attention effectively across Key tokens.
- Higher scores indicate stronger relevance, guiding the model to pay more attention to critical tokens while minimizing distractions.
- This essentially implies that softmax normalizes scores into a probability distribution, ensuring proper weighting of token relevance.

Example: Attention Weights in Practice

For our use case, the attention weights ( $\alpha$ ) are derived from the softmax-normalized alignment scores. The resulting weights are shown below:

$$\text{Attention Weights } (\alpha) = \begin{bmatrix} 0.0661 & 0.1241 & 0.1276 & \dots & 0.0999 & 0.0593 & 0.0564 \\ 0.0368 & 0.1691 & 0.1970 & \dots & 0.1197 & 0.0278 & 0.0258 \\ 0.0324 & 0.1651 & 0.2142 & \dots & 0.1275 & 0.0234 & 0.0219 \\ \vdots & \vdots & \vdots & \ddots & \vdots & \vdots & \vdots \\ 0.0661 & 0.1196 & 0.1271 & \dots & 0.1016 & 0.0609 & 0.0581 \\ 0.0685 & 0.1159 & 0.1217 & \dots & 0.0996 & 0.0635 & 0.0609 \end{bmatrix}$$

Intuitive Analogy:

Imagine you’re picking relevant answers in a quiz:

The alignment scores are the raw relevance ratings (e.g., 7/10, 9/10, etc.) for each possible answer.
The softmax converts these ratings into percentages (e.g., 60%, 30%, 10%), showing how much weight you give to each answer when combining them to form your understanding.

This analogy illustrates how the softmax function emphasizes the most relevant scores while normalizing the overall distribution.

1.4. Computing the final Attention Output And Applying Linear Transformation

Attention output for each head is computed as follows :

$$\text{Attention Output} = \alpha . V$$

Where:

$\alpha$ : Normalized attention scores (from $\text{Softmax}(\text{Score}(Q,K))$).
$V$: Value matrix containing the information content for each token.

Breaking Down the Computation

Attention Weights (Softmax Scores) – Define how much focus a Query token assigns to each Key token, controlling how Value vectors influence its representation.
Value Vectors ( $V$ ) – Contain task-relevant semantic information from each token, essential for model understanding.
Weighted Summation ( $\alpha \cdot V $ ) – Aggregates Value vectors based on attention weights, ensuring relevant tokens contribute more while refining contextual representations.

Why Multiply Attention Weights with the Value Matrix?

1. Combining Relevant Information

Attention weights ( $\alpha$ ) indicate how much importance each token assigns to others.
The value matrix ( $V$) contains the actual content of each token.
Multiplying $\alpha$ with $V$ aggregates relevant information, weighted by importance.

2. Weighted Summation

The final token representation is a weighted sum of value vectors.
This ensures the model focuses on the most relevant tokens while minimizing distractions.

3. Contextualization

Value vectors ( $V$ ) provide content, while attention weights ( $\alpha$ ) dictate importance.
This transformation enhances token embeddings, making them context-aware and capturing relationships in the sequence.

Analogy: Writing a Summary

Imagine summarizing a paragraph:

Attention weights decide how much focus each sentence gets.
Value vectors hold the actual information from the sentences.
Weighted summation merges the most relevant details into a coherent summary.

Understanding the Attention Output Matrix

Each row represents a token in the sequence.
The row’s values form a vector that encodes the token’s meaning after attending to other tokens.
This contextual representation allows the model to understand words in relation to the entire input sequence.

Matrix Dimensions

The attention output matrix from each attention head has dimensions:
$$N \times \frac{D}{\text{No. of Heads}}$$
With $H$ attention heads, we get $H$ attention output matrices of the same size.
These matrices are horizontally stacked to form a single attention output matrix of size: $N \times D$

$$\text{Attention Output} = \begin{bmatrix} 0.2131 & 0.3482 & 0.3967 \\ 0.8221 & 1.0589 & 1.3860 \\ 0.9912 & 1.2546 & 1.6513 \\ \vdots & \vdots & \vdots \\ 0.6240 & 0.8280 & 1.0567 \\ 0.1669 & 0.2925 & 0.3192 \\ 0.1561 & 0.2777 & 0.2999 \end{bmatrix}$$

Now that we’ve explored how self-attention works within a single attention head, the next step is to understand how multi-head attention enhances this process—enabling the transformer to capture multiple aspects of meaning from different perspectives.

📌 Checkout Part 3 of the series, where we’ll dive deep into Multi-Head Attention and how it allows transformers to understand context even better! 🚀

Transformer Encoder Explained : A Deep Dive into Attention Scores (Part 2)

Table of contents