Diving into Transformer : Attention Is All You Need!

Transformer is a deep neural network architecture that introduced in “attention is all you need”. It is designed for sequence to sequence tasks. Particularly, excelling in natural language processing (NLP). It eliminates RNN and CNN in favor of self attention and feedforward layers. Transformer become one of the state-of-the-art in deep learning architecture.

Source : https://arxiv.org/abs/1706.03762

1. Key Component of Transformer

1. Encoder Architecture

Transformer encoder extracts contextual meaning using attention mechanism. This encoder consist of stacked layers. Each encoded layer consist of:

  1. Multi head self attention

  2. Feedforward network (FFN)

  3. Add and Layer Normalization

2. Decoder Architecture

Transformer decoder generates token in sequences, on token at a time. Each decoder layer has the same components as the encoder, but with an additional masker multi head self attention layer to prevent future information leakage. This layer also repeated multiple time. Decoder component consist of :

  1. Masked Multi Head Self attention

  2. Encoder Decoder Multi head attention

  3. Feedforward Network

  4. Add and Layer Normalization

Comparison Transformer Encoder vs Decoder

ComponentEncoderDecoder
Self-AttentionFullMasked
Encoder-Decoder Attention
Feed-Forward Network
Auto-Regressive
Final Softmax Output

2. Step by step data flow in the Transformer

1. Input Tokenization

The raw text is converted into tokens (sub words or words) using Byte Pair Encoding or WordPiece in GPT)

Example:

Input sentence:

"I love AI"

Tokenized:

["I", "love", "AI"]

Then each token mapped to vocabulary indices:

[45, 987, 210]

2. Token Embedding

Each token ID is converted into high-dimensional vector using an embedding matrix. The goal of this step is to convert token IDs into dense matrix to capture meaning.

$$X = E \cdot I$$

Where :

$$E \in \mathbb{R}^{V \times d} \text{ is the trainable embedding matrix}$$

  • is the trainable embedding matrix

  • I is the input token index

  • d is the embedding dimension

Example :

[101] → [0.1, 0.3, -0.2, 0.7, ...]
[2345] → [0.25, -0.12, 0.87, 0.56, ...]

3. Positional Embedding

In transformer architecture, it doesn’t process each word sequentially, so it need to add positional information. The positional information is generated using a sine and cosine function to generates unique position encoding. Positional encoding formula

$$PE_{(pos, 2i)} = \sin \left( \frac{pos}{10000^{\frac{2i}{d}}} \right)$$

$$PE_{(pos, 2i+1)} = \cos \left( \frac{pos}{10000^{\frac{2i}{d}}} \right)$$

Where :

  • pos is the position in the sequence

  • d is the embedding dimension

  • i is the dimension index

The positional encoding is added to the word embedding. The formula becomes :

$$\text{Final Representation} = \text{Embedding} + \text{Positional Encoding}$$

4. Attention Mechanism Encoder

After all data are converted into final representation, it can be passing to the transformer encoder. Transformer encoder used to capture the contextual meaning. Each transformer encoder block consist of:

  • Multi head Self Attention

  • Add and Layer Normalization

  • Feedforward Network

  • Add and Layer Normalization

4.1 Compute Queries (Q), Keys (K), and Value (V)

This step converts the embedding using learnable metrics

$$W_{Q}, \quad W_{K}, \quad W_{V}$$

$$Q = XW_{Q}, \quad K = XW_{K}, \quad V = XW_{V}$$

4.2 Compute Attention Scores

The attention scores are computed using attention weight using dot product attention mechanism

$$\text{Attention Score} = Q K^{T}$$

After computing the attention score, we need to normalize it using softmax

4.3 Compute Weighted Sum of Values

This step is multiplying the attention score with Value (V) vectors

$$\text{Final Attention Output} = \sum (\text{attention weight} \times V)$$

This step produce word with contextual aware information

5. Feedforward Layers

This layer is applying additional processing to attention output. This layer often used in modern deep learning architecture. Each attention output is passed through a fully feedforward network

$$\text{Output} = \text{ReLU}(W_{1} + b_{1}) W_{2} + b_{2}$$

6. Repeating transformer layers

Step 4-5 are repeated N time (e.g 12 layers in BERT, 96 in GPT-3)

7. Transformer Decoder (for GPT-like Model)

Transformer decoder is responsible to generating output sequence one token at time. The decoder using previously generated token and encoder output. It consist of multiple layers (e.g 6 layers in the original transformer, 96 in GPT) and follows up with auto regressive generation process. Each decoder layer has three main component:

  1. Masked Multi Head Self attention (Decoder attends to past generated token)

  2. Encoder - Decoder Multi Head Attention (Decoder attends to encoder outputs)

  3. Feedforward network (Process token reprsesentaion)

7.1 Initialize the decoder with special token

This process are same like the encoder process

7.2 Compute token embeddings and positional encodings

This process are same like the encoder process

7.3 Masked Multi Head Self attention

At this step, the decoder applies self attention to previously generated token. A casual mask is implemented to ensure that a token at position t cannot attend to future token t + 1, t +2, ….

There are similar step like computing queries, keys, and values then computing attention scores just like in encoder.

  • Applying casual masking

To prevent the model from looking ahead, masking is implemented to mask ou future position by setting their attention score to - infinity, ensuring only previous token influence prediction.

7.4 Encoder-decoder multi head attention

This step allow the decoder to attend the encoders output. This step involve :

  • Queries from the decoder

  • Keys (K) and Values (V) from encoder otpt

The formula remain the same

This step align the generated text with the input sentence

"I love AI" → Encoder → Context Representation
"J' " → Decoder attends to encoded "I"

"I love AI" → Encoder → Context Representation "J' " → Decoder attends to encoded "I"

7.5 Applying Feed Forward Network (FFN), Residual Network and Normalization

This process involve traditional FFN, residual network and layer normalization in modern deep learning architecture.

8. Final Output Prediction

This step is to convert final vector into words

  • Apply a linear transformation to map embedding to vocabulary logits

  • Apply softmax to get probabilities

  • Select the highest probability token

Logits: [0.1, 0.6, 0.2, 0.9]
Softmax Probabilities: [10%, 60%, 20%, 90%]
Predicted word: "fast"

9. Post processing

This step is to convert model’s prediction into human readible sentences

  • convert token IDs back to words

  • Handle punctuation, capitalization, and formatting

  • If generating text, repeat steps until reaching an end token ([EOS])

Example

"The cat runs fast."

Summary of the Data Flow:

StepProcessPurpose
1TokenizationConvert text to numerical tokens
2Embedding LayerMap tokens to dense vectors
3Positional EncodingAdd word order information
4Self-AttentionCompute relationships between words
5Feedforward LayerProcess attention outputs
6Repeat Transformer LayersRefine representations across multiple layers
7Decoder (Optional)Generate output step by step
8Softmax & Word SelectionPredict final words
9Post-ProcessingConvert back to readable text
0
Subscribe to my newsletter

Read articles from Fadhil Elrizanda directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Fadhil Elrizanda
Fadhil Elrizanda