In our previous blogs, we explored the decoder phase of the Transformer in detail, covering its architecture, attention mechanisms, and how it processes input sequences. If you haven’t read those yet, I highly recommend checking them out for a strong foundational understanding before diving into this one.

Check out my blog on decoder phase of transformers here - Transformer Decoder: Forward Pass Mechanism and Key Insights (part 5).

Now, in this final installment of our series on Transformers, we go a step further—unveiling the mathematical underpinnings of backpropagation and how gradients guide learning in the training phase. We will also take a deep dive into the inference phase, where the Transformer generates text in an auto-regressive manner, one token at a time.

By the end of this blog, you will have a clear understanding of how Transformer decoders optimize learning and generate accurate sequences in real-world applications like machine translation, text generation, and beyond. Let’s get started! 🚀

Decoder - BackPropagation In the Training Phase

The weights and biases across all the various layers are learned during backpropagation. We will use Gradient descent algorithm for this. During the training phase the transformer’s output is a matrix of order $M \times V_c$ where $M$ is the number of tokens in the ground truth sequence and $V_c$ the total number of unique tokens in the target vocabulary. This matrix essentially represents each token position in the ground truth as rows and each unique token in the vocabulary as a column. This means that the values in each row in this matrix denote the probability that the corresponding column token is the correct token for row position in the ground truth sequence.

We compute how far off the transformers predictions were from the ground truth using a loss function and we use gradient descent to compute how much each weight and bias contributes to this loss by finding the gradients of this loss with respect to each weight and bias. These gradients are then used to update the weights and biases. This is done until the loss is minimized to a desired value or for a set number of iterations called epochs.

Loss Function for Training Transformers for Machine Translation

The most commonly used loss function for training transformers on sequence-to-sequence tasks like machine translation is the Cross-Entropy Loss. This is a token-level loss that measures the difference between the predicted token probabilities and the ground truth tokens in the target sequence.

1. Cross-Entropy Loss

The loss for a single target sequence is computed as:

$$\mathcal{L} = -\frac{1}{T} \sum_{t=1}^T \sum_{i=1}^{|V_c|} y_t^{\text{true}}[i] \log(y_t^{\text{pred}}[i])$$

Where:

$T$ : Length of the target sequence.
$V_c$ : Size of the target vocabulary.
$y_t^{\text{true}}$ : One-hot encoded true token at time step .
$y_t^{\text{pred}}$ : Predicted probability distribution over the vocabulary for the t-th token (output of the softmax layer).

The prediction probabilities $y_t^{\text{pred}}$ come from the decoder output after applying the softmax function.

For each time step $t$, the loss ploenalizes the model for predicting probabilities far from the ground truth token's one-hot encoding
The average loss over all tokens in the sequence ensures equal contribution regardless of sequence length.

Why Cross-Entropy for Machine Translation?

Multi-Class Output: At each time step, the decoder predicts the next token from a vocabulary of size $|V_c|$, making it a multi-class classification problem.
Alignment with Likelihood Maximization: Cross-entropy loss directly corresponds to minimizing the negative log-likelihood of the correct sequence.
Efficient Backpropagation: Cross-entropy loss is differentiable, enabling gradient computation for weight updates during backpropagation.

How It Works for Each Position

At timestep $t$:
- The decoder generates $y_t^{\text{pred}}$ , a probability distribution over the vocabulary using a softmax layer.
- The ground truth token for $t$ , $y_t^{\text{true}}$ is a one-hot vector where the index corresponding to the correct token is 1, and the rest are 0.
The Cross-Entropy Loss computes:

$$\mathcal{L}t = - \sum{i=1}^{|V_c|} y_t^{\text{true}}[i] \log(y_t^{\text{pred}}[i])$$
- Since $y_t^{\text{true}}$ is one-hot encoded, this simplifies to:
  
  $$\mathcal{L}_t = - \log(y_t^{\text{pred}}[\text{correct_index}])$$
- Only the probability assigned to the correct token contributes to the loss.
The total loss over the sequence is the average of these token-level losses:

$$\mathcal{L} = -\frac{1}{T} \sum_{t=1}^T \log(y_t^{\text{pred}}[\text{correct_index}])$$

Computing gradients

In a simplified Transformer model where we have 1 encoder, 1 attention head, 1 feedforward layer, 1 decoder, 1 masked attention head, 1 encoder-decoder attention layer, 1 feedforward layer, and 1 final output layer, the gradients that need to be computed and the corresponding chain rule equations are listed below:

Gradients to Compute in Each Layer

1. Encoder

Multi-Head Self-Attention Layer:
- Gradients for Query, Key, and Value weights ( $W_Q, W_K, W_V$ ):
  
  $$\frac{\partial \mathcal{L}}{\partial W_Q},\frac{\partial \mathcal{L}}{\partial W_K} , \frac{\partial \mathcal{L}}{\partial W_V}$$
- Gradient for the attention projection weight ( $W_O$ ):
  
  $$\frac{\partial \mathcal{L}}{\partial W_O}$$
Feedforward Layer:
- Gradients for the two feedforward weights ( $W_1, W_2$):
  
  $$\frac{\partial \mathcal{L}}{\partial W_1}, \frac{\partial \mathcal{L}}{\partial W_2}$$
- Gradients for biases ( $b_1, b_2$):
  
  $$\frac{\partial \mathcal{L}}{\partial b_1},\frac{\partial \mathcal{L}}{\partial b_2}$$

2. Decoder

Masked Multi-Head Attention Layer:
- Gradients for Query, Key, and Value weights ( $W_Q^{\text{masked}}, W_K^{\text{masked}}, W_V^{\text{masked}}$):
  
  $$\frac{\partial \mathcal{L}}{\partial W_Q^{\text{masked}}},\frac{\partial \mathcal{L}}{\partial W_K^{\text{masked}}},\frac{\partial \mathcal{L}}{\partial W_V^{\text{masked}}}$$
- Gradient for the attention projection weight ( $W_O^{\text{masked}}$):
  
  $$\frac{\partial \mathcal{L}}{\partial W_O^{\text{masked}}}$$
Encoder-Decoder Attention Layer:
- Gradients for Query, Key, and Value weights ( $W_Q^{\text{enc-dec}}, W_K^{\text{enc-dec}}, W_V^{\text{enc-dec}}$):
  
  $$\frac{\partial \mathcal{L}}{\partial W_Q^{\text{enc-dec}}},\frac{\partial \mathcal{L}}{\partial W_K^{\text{enc-dec}}},\frac{\partial \mathcal{L}}{\partial W_V^{\text{enc-dec}}}$$
- Gradient for the attention projection weight ( $W_O^{\text{enc-dec}}$):
  
  $$\frac{\partial \mathcal{L}}{\partial W_O^{\text{enc-dec}}}$$
Feedforward Layer:
- Gradients for the two feedforward weights ( $W_1^{\text{dec}}, W_2^{\text{dec}}$):
  
  $$\frac{\partial \mathcal{L}}{\partial W_1^{\text{dec}}},\frac{\partial \mathcal{L}}{\partial W_2^{\text{dec}}}$$
- Gradients for biases ( $b_1^{\text{dec}}, b_2^{\text{dec}}$):
  
  $$\frac{\partial \mathcal{L}}{\partial b_1^{\text{dec}}},\frac{\partial \mathcal{L}}{\partial b_2^{\text{dec}}}$$
Final Output Layer:
- Gradient for the output projection weight ( $W_{\text{out}}$ ):
  
  $$\frac{\partial \mathcal{L}}{\partial W_{\text{out}}}$$
- Gradient for the output bias ( $b_{\text{out}}$ ):
  
  $$\frac{\partial \mathcal{L}}{\partial b_{\text{out}}}$$

Chain Rule Equations for Each Gradient

Final Output Layer Gradients:
- Output weights ( $W_{\text{out}}$ ):
  
  $$\frac{\partial \mathcal{L}}{\partial W_{\text{out}}} = \frac{\partial \mathcal{L}}{\partial y_t^{\text{pred}}} \cdot \frac{\partial y_t^{\text{pred}}}{\partial z_t} \cdot \frac{\partial zt}{\partial W{\text{out}}}$$
  
  Where:
  - $z_t$ : Logits before the softmax.
- Output bias ( $b_{\text{out}}$ ):
  
  $$\frac{\partial \mathcal{L}}{\partial b_{\text{out}}} = \frac{\partial \mathcal{L}}{\partial y_t^{\text{pred}}} \cdot \frac{\partial y_t^{\text{pred}}}{\partial z_t} \cdot \frac{\partial zt}{\partial b{\text{out}}}$$
Decoder Feedforward Gradients:
- First layer weights ( $W_1^{\text{dec}}$ ):
  
  $$\frac{\partial \mathcal{L}}{\partial W_1^{\text{dec}}} = \frac{\partial \mathcal{L}}{\partial h_2} \cdot \frac{\partial h_2}{\partial h_1} \cdot \frac{\partial h_1}{\partial W_1^{\text{dec}}}$$
- Second layer weights ( $W_2^{\text{dec}}$ ):
  
  $$\frac{\partial \mathcal{L}}{\partial W_2^{\text{dec}}} = \frac{\partial \mathcal{L}}{\partial h_2} \cdot \frac{\partial h_2}{\partial W_2^{\text{dec}}}$$
Encoder-Decoder Attention Gradients:
- Query weights ( $W_Q^{\text{enc-dec}}$ ):
  
  $$\frac{\partial \mathcal{L}}{\partial W_Q^{\text{enc-dec}}} = \frac{\partial \mathcal{L}}{\partial \alpha} \cdot \frac{\partial \alpha}{\partial Q} \cdot \frac{\partial Q}{\partial W_Q^{\text{enc-dec}}}$$
- Similarly for $W_K^{\text{enc-dec}}, W_V^{\text{enc-dec}},W_O^{\text{enc-dec}}$.
Masked Multi-Head Self-Attention Gradients:
- Same formulas as for Encoder-Decoder Attention, but keys and values come from the decoder's $T_{\text{input}}$.
Encoder Feedforward and Attention Gradients:
- Identical to decoder feedforward and attention gradients, applied to the encoder's hidden states.

Weight and Bias Updation Rule

The weights and biases are updated using the formula

$$quantity=quantity- \alpha \times \frac{\partial \mathcal{L}}{\partial quantity}$$

Minimizing loss

The Transformer computes the forward pass again with the updated weights and biases. The losses and gradients are recomputed and the weights are updated again. This process repeats until we minimize the loss to a desired value or for a fixed number of epochs.

Now that we have understood the training phase, let’s now look at how the transformer works in the Inference Phase.

Inference Phase

During inference, the encoder in the Transformer functions the same way as during training. It performs the following operations:

Computes the attention outputs.
Passes them through a feedforward network (FFN).
Adds residual connections and applies layer normalization.

The key difference between training and inference lies in how the decoder operates. Unlike training, during inference, the decoder does not have access to the ground truth target sequence. Instead, it must generate the target sequence one token at a time cyclically. This mechanism is referred to as Auto Regressive Generation.

Auto-regressive generation is a step-by-step process where a model generates output one token at a time. Each newly generated token is fed back into the model as input, along with previous tokens, until the entire sequence is produced.

How the Decoder Functions During Inference:

Initialization:

The decoder begins with only the start-of-sequence (SOS) token as input.
Token Generation Cycle:
- At each timestep, the decoder generates one token, appends it to the current sequence, and uses the updated sequence as the input for the next timestep.
- This process repeats until the end-of-sequence (EOS) token is generated or a predefined maximum sequence length is reached.

Step-by-Step Breakdown Of How the Decoder Generates Text in Inference

Let’s illustrate how a Transformer decoder works during inference with an example:
Task: Translate "Varun loves cricket and finance." into French.
Assumption: The encoder has already processed the sentence and produced contextual embeddings.

Now, let’s go through the inference process step by step.

Step 1 (Decoder Initialization):
- The decoder receives the SOS token as input, represented as a matrix of size $1 \times D$, where $D$ is the embedding dimension.
- The masked attention layer processes this input. Since only the SOS token is present, no masking is applied at this stage.
- The alignment scores are computed, these scores are then normalized using the softmax function, and passed through the subsequent decoder layers just like in training phase.
- The output layer projects the final representation over the vocabulary space to generate a $1×V_c$ matrix, where $V_c$ is the size of the vocabulary. This matrix contains the probabilities of all tokens in the vocabulary.
- The token with the highest probability is selected as the first output token ( $y_1$ ), which is then appended to the SOS token.
- Example Output for Step 1:
  - The decoder receives the input [SOS] , the decoder generates the first token (e.g., "Varun"), the decoder takes this token and appends it to the input sequence.
  - The updated sequence now becomes a matrix with:
    - First row: [SOS] token.
    - Second row: The first generated token, e.g., "Varun".
Step 2 (Next Cycle):
- Input: The decoder now receives the updated sequence $[SOS,y_1]$, represented as a matrix of size $2×D$.
- The decoder computes the query (Q), key (K), and value (V) matrices for this input and updates the mask matrix to enforce causal masking.
- The updated mask matrix for two tokens looks like this:
- The masked alignment scores are calculated, normalized, and passed through the remaining layers. The next token ( $y_2$) is predicted and appended to the sequence.
- Example Output for Step 2:
  - The Input to the decoder is now → [SOS, "Varun"].
  - Masked attention processes both tokens and the Encoder-decoder attention attends to the encoder output. The decoder then predicts the second token, e.g., "aime."
  - The output at this step is now [SOS, "Varun", "aime"]
Subsequent Steps:
- The process repeats, with the input sequence growing at each step ( $[SOS,y_1,y_2,…]$ ), and the mask matrix dynamically expands to ensure each token attends only to itself and previous tokens.
- For example:
  - Cycle 3: The mask matrix for three tokens would be:
- The output for the example we considered given the input to the decoder is → [SOS, "Varun", "aime"] will be [SOS, "Varun", "aime", "le"].
Termination:
- The process continues until the decoder generates the EOS token or the maximum sequence length is reached.
- The final output for the example we considered will be → "Varun aime le cricket et la finance." We obtain this output by a process called detokenization wherein the generated tokens are combined together and then returned as the output.

Result: The decoder outputs the French translation: "Varun aime le cricket et la finance."

Key Difference Between Training and Inference Phase Of The Decoder

During the inference phase, the key difference in the decoder lies in the absence of the ground truth target sequence. Instead, the decoder generates the target sequence token by token, using its own previously generated tokens as input at each decoding step.

Here’s how it differs from the training phase:

1. Input to the Decoder

Training Phase:
- The decoder receives the entire ground truth target sequence (e.g., in French: "Varun aime le cricket et la finance.") as input for the Masked Multi-Head Attention.
- Masking ensures that the decoder can only attend to tokens up to the current timestep.
Inference Phase:
- The decoder starts with only the start-of-sequence (SOS) token as input.
- At each timestep, the decoder generates a token, appends it to the sequence, and feeds the updated sequence back as input.
- This process repeats until the end-of-sequence (EOS) token is generated or a predefined maximum sequence length is reached.
- For example:
  - Step 1: [SOS]
  - Step 2: [SOS, "Varun"]
  - Step 3: [SOS, "Varun", "aime"], and so on.

2. Sequence Generation

Training Phase:
- The entire sequence is available at once, so the decoder can compute all tokens in parallel during a single forward pass.
Inference Phase:
- Tokens are generated sequentially, one at a time, in an autoregressive manner.
- At timestep $t$, the decoder uses:
  - The encoder output.
  - The previously generated sequence .
    
    $[SOS, y_1, y_2, \dots, y_{t-1}]$
  - The attention mechanism to predict the next token $y_t$.

3. Parallel vs. Sequential Processing

Training Phase:
- The decoder processes the target sequence in parallel, allowing for efficient computation.
Inference Phase:
- The decoder operates sequentially, generating one token at a time. This is computationally slower compared to training.

4. Predictions

Training Phase:
- The model predicts token probabilities for all positions simultaneously, compared against the ground truth using the Cross-Entropy Loss.
Inference Phase:
- At each timestep, the decoder outputs a probability distribution over the vocabulary for the next token.
- The token with the highest probability (or another strategy like beam search) is selected and added to the generated sequence.

5. Dynamic Masking

Training Phase:
- The model uses a static mask to compute the masked attention scores.
Inference Phase:
- The mask matrix dynamically adjusts to ensure that each token can only attend to itself and the tokens before it (causal masking).
- For example:
  - Cycle 1 (only SOS token):
    
    $$\text{Mask} = [0]$$
  - Cycle 2 (SOS + 1st generated token):
    
    $$\text{Mask} = \begin{bmatrix} 0 & -\infty \ 0 & 0 \end{bmatrix}$$
  - Cycle 3 (SOS + 1st + 2nd generated tokens):
    
    $$\text{Mask} = \begin{bmatrix} 0 & -\infty & -\infty \ 0 & 0 & -\infty \ 0 & 0 & 0 \end{bmatrix}$$

Conclusion

And with that, we conclude our deep dive into Transformer decoders! Throughout this series, we have uncovered the inner workings of the Transformer, from how it processes input sequences to how it learns through backpropagation and generates outputs during inference.

Key Takeaways:

The training phase involves computing gradients and updating weights using backpropagation to minimize loss.
The inference phase follows an auto-regressive generation approach, where tokens are produced sequentially rather than in parallel.
Masking techniques ensure that the model processes only relevant context at each step, preventing "cheating" by looking ahead.

Transformers have revolutionized NLP, powering models like GPT, BERT, and T5. But as the field evolves, we’re seeing even more efficient architectures like Mixture of Experts (MoE) models, Retrieval-Augmented Transformers (RAG), and multimodal AI systems shaping the future.

I hope this series has provided you with a solid intuitive and mathematical understanding of how Transformers work. If you have any thoughts, questions, or ideas for future deep dives, feel free to share them!

Transformer Decoders Explained: The Process of Backpropagation and Inference (Part 7)

Table of contents