If you’re following along in our Transformer series, this is the fifth installment, where we dive deep into the decoder phase during forward propagation in training. Before proceeding, we highly recommend checking out Part 1, Part 2, Part 3, and Part 4 of this series to build a strong foundation on how encoders function within the Transformer architecture.

In a Transformer model, the decoder receives input from the previous decoder layer, refining and transforming it progressively through multiple layers. This process plays a crucial role in both the training phase and inference phase. While inference involves generating predictions step by step, training requires forward propagation to adjust model weights based on loss calculations.

In this blog, we will focus on the forward propagation mechanism during the training phase of the Transformer decoder. We will break down key operations, including self-attention, cross-attention, and feed-forward networks, to understand how data flows through the decoder during training.

Let’s get started!

Understanding the Decoder in the Training Phase

The Transformer decoder operates differently during the training phase compared to the inference phase. In training, it employs a strategy known as Teacher Forcing, which significantly influences how the model learns sequence relationships.

What is Teacher Forcing?

Teacher forcing is a training technique commonly used in sequence-to-sequence (Seq2Seq) models. Instead of using the model’s own predictions as input for the next step, the actual ground truth sequence (target sequence) is provided.

Why Is Teacher Forcing Used?

Stabilizes Training – Early in training, the model’s predictions are often inaccurate. By using the correct target sequence as input, teacher forcing prevents error accumulation and allows the model to learn effectively.
Faster Convergence – The model is guided by the correct sequence, helping it reach an optimal solution more quickly.
Focus on Learning Relationships – By eliminating errors from previous predictions, the model learns the relationships between input and output sequences rather than struggling with its own mistakes.

Decoder Operations During Forward Propagation

During training, the decoder follows a specific sequence of operations at each step:

Masked Self-Attention – Ensures the model attends only to previous tokens, preventing information leakage from future tokens.
Cross-Attention – Integrates the encoder’s output with the decoder’s partially generated sequence to align the input and output.
Feed-Forward Layer – Applies non-linear transformations to extract meaningful features.
Residual Connection – Aids in stable gradient flow, preventing vanishing gradients.
Layer Normalization – Stabilizes training by maintaining consistent activation distributions.

To fully grasp how the decoder processes input, let's first understand what inputs it receives and how they are prepared.

Preparing the Target Sequence for the Decoder

The decoder receives two key inputs:

The encoder output – Encapsulates the contextualized representation of the input sequence.
A partially generated ground truth sequence – A modified version of the target sequence that helps guide training.

Example: Input and Target Sequence

English Input (Source Sequence)

"My name is Vikas. I love cricket, finance, and AI."

French Translation (Target Sequence)

"Mon nom est Vikas. J'aime le cricket, la finance et l'IA."

However, the decoder does not directly use the target sequence as input. Instead, it shifts the target sequence to ensure an auto-regressive learning process.

Steps to Create the Shifted Target Sequence

To train the decoder properly, the target sequence is shifted to the right:

A special <start> (start-of-sequence) token is added at the beginning.
The last token is removed to maintain the same sequence length.

Example: Shifted Target Sequence

Original Target Sequence (Ground Truth)

["Mon", "nom", "est", "Vikas.", "J'aime", "le", "cricket,", "la", "finance", "et", "l'IA", "."]

Shifted Target Sequence (Decoder Input)

["<start>", "Mon", "nom", "est", "Vikas.", "J'aime", "le", "cricket,", "la", "finance", "et", "l'IA"]

Why Shift the Target Sequence?

This shifted input ensures that, at each decoding step, the model generates the next token using only the previous tokens.

When generating "Mon", the model sees only "".
When generating "nom", the model sees "" and "Mon".
When generating "est", the model sees "", "Mon", and "nom".

This setup enforces an auto-regressive approach, mimicking how the decoder will function during inference while still leveraging teacher forcing for stable training.

Converting to Vector Embeddings

Before the decoder can process the target sequence, it must be converted into vector embeddings with the same dimensionality as the encoder output. This transformation ensures that each token is represented in a high-dimensional space, enabling the model to capture semantic relationships.

The vector embedding of the shifted target sequence results in a matrix of order $M \times D$[ Qd=Masked Attention Output×WQd Kd=Encoder Output×WKd Vd=Encoder Output×WVd ], where:

$M$ represents the number of tokens in the target sequence.
$D$ is the embedding dimension.

This embedded target sequence is denoted as $T_{input}$.

$$T_{\text{input}} = \begin{bmatrix} 0.45 & 0.67 & -0.12 \\ 0.80 & 0.05 & -0.25 \\ 0.60 & 0.90 & 0.02 \\ 0.85 & 0.10 & 0.95 \\ 0.12 & 0.22 & 0.55 \\ -0.10 & 0.03 & -0.07 \\ 0.40 & 0.75 & -0.20 \\ 0.15 & -0.01 & 0.04 \\ -0.20 & -0.10 & 0.04 \\ 0.30 & 0.40 & 0.08 \end{bmatrix}$$

📌 For a detailed explanation of how token embeddings are generated, refer to How Transformers Work: Tokenization Embeddings and Positional Encoding Explained (Part 1)

Adding Positional Encodings

Since Transformers do not have a built-in notion of word order, we must add positional encodings to the embedded target sequence. These positional encodings inject sequential information into the embeddings, allowing the model to understand the relative position of words in a sentence.

The positional encodings are generated using the same sinusoidal function-based method discussed in the encoder. Adding these encodings results in the partially generated target sequence, which is then fed into the decoder.

Below is the matrix representing the positional encodings used in this example, along with the matrix obtained after adding them to the embeddings.

$$PE = \begin{bmatrix} 0.0000 & 1.0000 & 0.0000 \\ 0.8415 & 0.9989 & 0.0022 \\ 0.9093 & 0.9957 & 0.0043 \\ 0.1411 & 0.9903 & 0.0065 \\ -0.7568 & 0.9828 & 0.0086 \\ -0.9589 & 0.9732 & 0.0108 \\ -0.2794 & 0.9615 & 0.0129 \\ 0.6570 & 0.9477 & 0.0151 \\ 0.9894 & 0.9318 & 0.0172 \\ 0.4121 & 0.9140 & 0.0194 \end{bmatrix} \quad T_{\text{input}} + PE = \begin{bmatrix} 0.45 & 1.67 & -0.12 \\ 1.6415 & 1.0489 & -0.2478 \\ 1.5093 & 1.8957 & 0.0243 \\ 0.9911 & 1.0903 & 0.9565 \\ -0.6368 & 1.2028 & 0.5586 \\ -1.0589 & 1.0032 & -0.0592 \\ 0.1206 & 1.7115 & -0.1871 \\ 0.8070 & 0.9377 & 0.0551 \\ 0.7894 & 0.8318 & 0.0572 \\ 0.7121 & 1.3140 & 0.0994 \end{bmatrix}$$

📌 To understand how positional encodings are computed and why they are necessary, refer to How Transformers Work: Tokenization Embeddings and Positional Encoding Explained (Part 1).

Masked Self-Attention in the Decoder

Now that we have embedded the target sequence and added positional encodings, the next step is to understand how the decoder processes this sequence using Masked Self-Attention. This mechanism ensures the model predicts tokens one step at a time without seeing future words, which is crucial for language generation tasks like translation and text generation.

What is Masked Self-Attention?

Masked self-attention is a specialized form of the attention mechanism, used exclusively in the decoder of a Transformer. It ensures that, when predicting a token, the model only attends to previous tokens and not future ones.

Without masking, the model could "cheat" during training by using future tokens, which wouldn't be available during inference. Masking ensures that the Transformer learns in an auto-regressive manner, just like during actual text generation.

🔍 Analogy: Think of reading a sentence one word at a time while covering the future words with your hand. This ensures that you rely only on the words you've already seen, rather than anticipating upcoming ones.

How Does Masked Self-Attention Work?

Masked attention works by using a masking matrix that prevents the decoder from looking at future tokens. This is done by setting certain positions in the attention score matrix to $-\infty $ (or a very large negative number) before applying the softmax function.

When softmax is applied, these masked positions get a probability of $0$, effectively excluding them from contributing to the final output.
This ensures that at any given step, the model only considers previous tokens when making a prediction.

Computing Masked Self-Attention

Just like in the encoder, masked attention follows a step-by-step computation process.

1. Compute Query, Key, and Value Matrices

To compute masked attention, we first generate the Query ( $Q$ ), Key ( $K$ ), and Value ( $V$ ) matrices from the input sequence.

The query weight matrix $W_Q$, key weight matrix $W_K$, and value weight matrix $W_V$ are randomly initialized at the beginning of training. These matrices help the model determine which tokens should focus on others and how much importance to assign.

$$\begin{array}{ccc} \textbf{Query Weight Matrix } (W_Q) & \textbf{Key Weight Matrix } (W_K) & \textbf{Value Weight Matrix } (W_V) \\ \begin{bmatrix} 0.3745 & 0.9507 & 0.7320 \\ 0.5987 & 0.1560 & 0.1560 \\ 0.0581 & 0.8662 & 0.6011 \end{bmatrix} & \begin{bmatrix} 0.7081 & 0.0206 & 0.9699 \\ 0.8324 & 0.2123 & 0.1818 \\ 0.1834 & 0.3042 & 0.5248 \end{bmatrix} & \begin{bmatrix} 0.4320 & 0.2912 & 0.6119 \\ 0.1395 & 0.2921 & 0.3664 \\ 0.4561 & 0.7852 & 0.1997 \end{bmatrix} \end{array}$$

The Query ( $Q$ ), Key ( $K$ ), and Value ( $V$ ) matrices are then obtained by multiplying the input sequence with these weight matrices.

$$\begin{array}{ccc} \textbf{Query Matrix } (Q) & \textbf{Key Matrix } (K) & \textbf{Value Matrix } (V) \\ \begin{bmatrix} 1.1613 & 0.5844 & 0.5178 \\ 1.2283 & 1.5096 & 1.2162 \\ 1.7016 & 1.7517 & 1.4151 \\ 1.0795 & 1.9409 & 1.4705 \\ 0.5140 & 0.0661 & 0.0573 \\ 0.2005 & -0.9015 & -0.6542 \\ 1.0589 & 0.2196 & 0.2428 \\ 0.8668 & 0.9613 & 0.7701 \\ 0.7969 & 0.9298 & 0.7420 \\ 1.0591 & 0.9681 & 0.7860 \end{bmatrix} & \begin{bmatrix} 1.6868 & 0.3274 & 0.6771 \\ 1.9900 & 0.1811 & 1.6528 \\ 2.6512 & 0.4410 & 1.8213 \\ 1.7848 & 0.5429 & 1.6615 \\ 0.6528 & 0.4122 & -0.1058 \\ 0.0745 & 0.1732 & -0.8757 \\ 1.4758 & 0.3090 & 0.3300 \\ 1.3621 & 0.2325 & 0.9821 \\ 1.2619 & 0.2103 & 0.9469 \\ 1.6163 & 0.3239 & 0.9818 \end{bmatrix} & \begin{bmatrix} 0.3726 & 0.5247 & 0.8632 \\ 0.7423 & 0.5899 & 1.3392 \\ 0.9275 & 1.0125 & 1.6228 \\ 1.0164 & 1.3582 & 1.1968 \\ 0.1475 & 0.6045 & 0.1626 \\ -0.3444 & -0.0618 & -0.2922 \\ 0.2055 & 0.3882 & 0.6635 \\ 0.5045 & 0.5522 & 0.8483 \\ 0.4831 & 0.5178 & 0.7992 \\ 0.5362 & 0.6693 & 0.9369 \end{bmatrix} \end{array}$$

📌 For a detailed breakdown of how these matrices are computed, refer to Understanding the Role of Query, Key, and Value Matrices in Transformer Models .

2. Compute Alignment Scores

Once we have $Q$, $K$ and $V$ the next step is to compute the alignment scores using the scaled dot-product attention formula:

$$Score(Q, K) = \frac{Q \cdot K^T}{\sqrt{\frac{D}{\text{No. of Heads}}}}$$

Here:

$Q$: Query matrix
$K$: Key matrix
$D$: Dimensionality of the model
$\text{No. of Heads}$ : Number of attention heads (in this example, we assume one head)

Since we are using only one attention head, we compute a single alignment score matrix.

$$\textbf{Score(Q, K)} = \begin{bmatrix} 1.44 & 1.88 & 2.47 & \dots & 1.48 \\ 1.95 & 2.72 & 3.54 & \dots & 2.11 \\ 2.54 & 3.48 & 4.53 & \dots & 2.71 \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1.52 & 2.06 & 2.69 & \dots & 1.61 \end{bmatrix}$$

📌 For a deeper understanding of scaled dot-product attention, check out this blog where I explain how alignment scores are computed.

3. Applying the Mask

Creating the Mask Matrix

A static triangular mask matrix is used during training to ensure that a token only attends to past and current tokens, preventing information leakage from future tokens.

Steps to Construct the Mask Matrix

Define Sequence Length ( $M$ )
- Let $M$ be the number of tokens in the target sequence.
Initialize a Full Matrix
- Create an $M$ matrix filled with zero values.
Apply Masking to Future Tokens
- Use a triangular function to set all positions where $j > i$(i.e., future tokens) to $-\infty$.
- The resulting mask matrix ensures that the model only considers current and past tokens for attention.

$$\textbf{Mask Matrix} = \begin{bmatrix} 0 & -\infty & -\infty & \dots & -\infty \\ 0 & 0 & -\infty & \dots & -\infty \\ 0 & 0 & 0 & \dots & -\infty \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & 0 & \dots & 0 \end{bmatrix}$$

This mask is then added to the alignment score matrix, ensuring that future tokens receive an attention weight of zero when softmax is applied.

Computing the Masked Alignment Scores

Once the mask matrix is created, it is applied to the alignment score matrix as follows:

$$\text{Masked Score(Q, K)} = \text{Score(Q, K)} + \text{Mask Matrix}$$

This ensures that all future token positions contribute zero probability to the final attention output. The masked attention matrix in the decoder represents the relevance of each token in the target (ground truth) sequence with respect to every other token in the sequence, but only within the context of the tokens that are available for attention up to that point.

$$\textbf{Masked Score(Q, K)} = \begin{bmatrix} 1.44 & -\infty & -\infty & \dots & -\infty \\ 1.88 & 2.72 & -\infty & \dots & -\infty \\ 2.47 & 3.54 & 4.53 & \dots & -\infty \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1.48 & 2.11 & 2.71 & \dots & 1.61 \end{bmatrix}$$

4. Applying Softmax to Obtain Attention Weights

Now that we have the masked scores, the next step is to normalize them using the softmax function.

Softmax converts the masked alignment scores into a probability distribution that determines how much attention each token should receive. The formula used is:

$$\alpha = \text{Softmax}(\text{Masked Score}(Q, K))$$

Where:

$\alpha$ represents the attention weights after applying softmax.
Higher attention scores indicate stronger relevance between tokens.

After this step, we obtain the attention weight matrix, which is used to compute the final masked attention output.

$$\textbf{Attention Weight Matrix} = \begin{bmatrix} 1.0000 & 0.0000 & 0.0000 & 0.0000 & \dots & 0.0000 \\ 0.3015 & 0.6985 & 0.0000 & 0.0000 & \dots & 0.0000 \\ 0.0845 & 0.2471 & 0.6684 & 0.0000 & \dots & 0.0000 \\ \vdots & \vdots & \vdots & \ddots & \vdots & \vdots \\ 0.0796 & 0.1497 & 0.2727 & 0.1631 & \dots & 0.0905 \end{bmatrix}$$

5. Computing the Masked Attention Output

The final masked attention output is obtained by multiplying the attention weight matrix ( $\alpha$ ) with the Value ( $V$ ) matrix:

$$\text{Output} = \alpha \cdot V$$

This multiplication ensures that the attention mechanism weights the Value matrix according to the computed attention scores, resulting in a context-aware representation for each token.

$$\textbf{Output} = \begin{bmatrix} 0.3726 & 0.5247 & 0.8632 \\ 0.6308 & 0.5702 & 1.1957 \\ 0.8348 & 0.8669 & 1.4885 \\ 0.8701 & 0.9780 & 1.3929 \\ 0.7138 & 0.8626 & 1.1544 \\ 0.3403 & 0.5605 & 0.6270 \\ 0.6688 & 0.8064 & 1.1348 \\ 0.7030 & 0.8258 & 1.1677 \\ 0.6746 & 0.7918 & 1.1219 \\ 0.6909 & 0.8052 & 1.1497 \end{bmatrix}$$

Final Output Projection

In case of multiple attention heads, the outputs from all heads would be concatenated. However, since our example uses only one attention head, the masked attention output matrix serves as our final result.

To prepare this output for the next layers, it is projected using a learnable weight matrix $W_{dO}$:

$$\text{Final Output} = \text{Masked Attention Output} \cdot W_{dO}$$

This projection layer is used to blend the outputs from multiple heads (if present) into a single cohesive unit, just like in the encoder.

$$\begin{array}{c} \textbf{1. Projection Matrix } W_O \\ \text{The learned weight matrix used for the projection:} \\ W_O = \begin{bmatrix} 0.3745 & 0.9507 & 0.7320 \\ 0.5987 & 0.1560 & 0.1560 \\ 0.0581 & 0.8662 & 0.6011 \end{bmatrix} \\ \textbf{2. Final Projected Output} \\ \text{The attention output after applying the projection matrix:} \\ \text{Final Projected Output} = \begin{bmatrix} 0.5038 & 1.1838 & 0.8735 \\ 0.6471 & 1.7244 & 1.2695 \\ 0.9181 & 2.2183 & 1.6411 \\ 0.9922 & 2.1863 & 1.6267 \\ 0.8508 & 1.8131 & 1.3510 \\ 0.4994 & 0.9540 & 0.7134 \\ 0.7992 & 1.7446 & 1.2975 \\ 0.8255 & 1.8087 & 1.3454 \\ 0.7918 & 1.7366 & 1.2917 \\ 0.8076 & 1.7783 & 1.3224 \end{bmatrix} \end{array}$$

Conclusion

In this blog, we explored how the Transformer decoder processes input during training, focusing on teacher forcing, masked self-attention, and target sequence preparation. We also broke down the step-by-step computation of masked attention, ensuring the model learns in an auto-regressive manner without peeking at future tokens.

In the next installment, we'll dive into encoder-decoder cross-attention, residual connections, and feed-forward networks—key components that refine the decoder's output. Stay tuned!

Transformer Decoder: Forward Pass Mechanism and Key Insights (part 5)

Table of contents