Diving into Transformer : Attention Is All You Need!


Transformer is a deep neural network architecture that introduced in “attention is all you need”. It is designed for sequence to sequence tasks. Particularly, excelling in natural language processing (NLP). It eliminates RNN and CNN in favor of self attention and feedforward layers. Transformer become one of the state-of-the-art in deep learning architecture.
Source : https://arxiv.org/abs/1706.03762
1. Key Component of Transformer
1. Encoder Architecture
Transformer encoder extracts contextual meaning using attention mechanism. This encoder consist of stacked layers. Each encoded layer consist of:
Multi head self attention
Feedforward network (FFN)
Add and Layer Normalization
2. Decoder Architecture
Transformer decoder generates token in sequences, on token at a time. Each decoder layer has the same components as the encoder, but with an additional masker multi head self attention layer to prevent future information leakage. This layer also repeated multiple time. Decoder component consist of :
Masked Multi Head Self attention
Encoder Decoder Multi head attention
Feedforward Network
Add and Layer Normalization
Comparison Transformer Encoder vs Decoder
Component | Encoder | Decoder |
Self-Attention | Full | Masked |
Encoder-Decoder Attention | ❌ | ✅ |
Feed-Forward Network | ✅ | ✅ |
Auto-Regressive | ❌ | ✅ |
Final Softmax Output | ❌ | ✅ |
2. Step by step data flow in the Transformer
1. Input Tokenization
The raw text is converted into tokens (sub words or words) using Byte Pair Encoding or WordPiece in GPT)
Example:
Input sentence:
"I love AI"
Tokenized:
["I", "love", "AI"]
Then each token mapped to vocabulary indices:
[45, 987, 210]
2. Token Embedding
Each token ID is converted into high-dimensional vector using an embedding matrix. The goal of this step is to convert token IDs into dense matrix to capture meaning.
$$X = E \cdot I$$
Where :
$$E \in \mathbb{R}^{V \times d} \text{ is the trainable embedding matrix}$$
is the trainable embedding matrix
I is the input token index
d is the embedding dimension
Example :
[101] → [0.1, 0.3, -0.2, 0.7, ...]
[2345] → [0.25, -0.12, 0.87, 0.56, ...]
3. Positional Embedding
In transformer architecture, it doesn’t process each word sequentially, so it need to add positional information. The positional information is generated using a sine and cosine function to generates unique position encoding. Positional encoding formula
$$PE_{(pos, 2i)} = \sin \left( \frac{pos}{10000^{\frac{2i}{d}}} \right)$$
$$PE_{(pos, 2i+1)} = \cos \left( \frac{pos}{10000^{\frac{2i}{d}}} \right)$$
Where :
pos is the position in the sequence
d is the embedding dimension
i is the dimension index
The positional encoding is added to the word embedding. The formula becomes :
$$\text{Final Representation} = \text{Embedding} + \text{Positional Encoding}$$
4. Attention Mechanism Encoder
After all data are converted into final representation, it can be passing to the transformer encoder. Transformer encoder used to capture the contextual meaning. Each transformer encoder block consist of:
Multi head Self Attention
Add and Layer Normalization
Feedforward Network
Add and Layer Normalization
4.1 Compute Queries (Q), Keys (K), and Value (V)
This step converts the embedding using learnable metrics
$$W_{Q}, \quad W_{K}, \quad W_{V}$$
$$Q = XW_{Q}, \quad K = XW_{K}, \quad V = XW_{V}$$
4.2 Compute Attention Scores
The attention scores are computed using attention weight using dot product attention mechanism
$$\text{Attention Score} = Q K^{T}$$
After computing the attention score, we need to normalize it using softmax
4.3 Compute Weighted Sum of Values
This step is multiplying the attention score with Value (V) vectors
$$\text{Final Attention Output} = \sum (\text{attention weight} \times V)$$
This step produce word with contextual aware information
5. Feedforward Layers
This layer is applying additional processing to attention output. This layer often used in modern deep learning architecture. Each attention output is passed through a fully feedforward network
$$\text{Output} = \text{ReLU}(W_{1} + b_{1}) W_{2} + b_{2}$$
6. Repeating transformer layers
Step 4-5 are repeated N time (e.g 12 layers in BERT, 96 in GPT-3)
7. Transformer Decoder (for GPT-like Model)
Transformer decoder is responsible to generating output sequence one token at time. The decoder using previously generated token and encoder output. It consist of multiple layers (e.g 6 layers in the original transformer, 96 in GPT) and follows up with auto regressive generation process. Each decoder layer has three main component:
Masked Multi Head Self attention (Decoder attends to past generated token)
Encoder - Decoder Multi Head Attention (Decoder attends to encoder outputs)
Feedforward network (Process token reprsesentaion)
7.1 Initialize the decoder with special token
This process are same like the encoder process
7.2 Compute token embeddings and positional encodings
This process are same like the encoder process
7.3 Masked Multi Head Self attention
At this step, the decoder applies self attention to previously generated token. A casual mask is implemented to ensure that a token at position t cannot attend to future token t + 1, t +2, ….
There are similar step like computing queries, keys, and values then computing attention scores just like in encoder.
- Applying casual masking
To prevent the model from looking ahead, masking is implemented to mask ou future position by setting their attention score to - infinity, ensuring only previous token influence prediction.
7.4 Encoder-decoder multi head attention
This step allow the decoder to attend the encoders output. This step involve :
Queries from the decoder
Keys (K) and Values (V) from encoder otpt
The formula remain the same
This step align the generated text with the input sentence
"I love AI" → Encoder → Context Representation
"J' " → Decoder attends to encoded "I"
"I love AI" → Encoder → Context Representation "J' " → Decoder attends to encoded "I"
7.5 Applying Feed Forward Network (FFN), Residual Network and Normalization
This process involve traditional FFN, residual network and layer normalization in modern deep learning architecture.
8. Final Output Prediction
This step is to convert final vector into words
Apply a linear transformation to map embedding to vocabulary logits
Apply softmax to get probabilities
Select the highest probability token
Logits: [0.1, 0.6, 0.2, 0.9]
Softmax Probabilities: [10%, 60%, 20%, 90%]
Predicted word: "fast"
9. Post processing
This step is to convert model’s prediction into human readible sentences
convert token IDs back to words
Handle punctuation, capitalization, and formatting
If generating text, repeat steps until reaching an end token ([EOS])
Example
"The cat runs fast."
Summary of the Data Flow:
Step | Process | Purpose |
1 | Tokenization | Convert text to numerical tokens |
2 | Embedding Layer | Map tokens to dense vectors |
3 | Positional Encoding | Add word order information |
4 | Self-Attention | Compute relationships between words |
5 | Feedforward Layer | Process attention outputs |
6 | Repeat Transformer Layers | Refine representations across multiple layers |
7 | Decoder (Optional) | Generate output step by step |
8 | Softmax & Word Selection | Predict final words |
9 | Post-Processing | Convert back to readable text |
Subscribe to my newsletter
Read articles from Fadhil Elrizanda directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
