Getting started with Transformers


A brief understanding of working of transformers
Introduction
In the ever-evolving landscape of Natural Language Processing (NLP), one architecture has captured the imagination of researchers, developers, and data scientists alike: the Transformer. Since its inception, the Transformer has become the cornerstone of modern NLP, revolutionizing the way machines understand, generate, and manipulate human language. In this article we are going to understand about the working of basic blocks used in a transformer and the transformer architecture itself.
Recurrent Neural Network (RNN)
Traditional deep neural networks assume that inputs and outputs are independent of each other, however in case of RNNs the output from previous sequence can be used as inputs.
While feedforward networks have different weights across each node, RNN share the same weight parameter within each layer of the network. This is because, the shared weights allow the RNN to capture and maintain information about previous time steps in its hidden state. This enables the network to learn and represent temporal dependencies in the data, which is crucial for tasks that involve sequences. For detailed understanding of RNN, you can refer to this article.
There are 2 major problems that might arise in RNN:
- Vanishing gradient — when the gradient is too small, it continues to become smaller until weights become 0 and model stops learning
- Exploding gradient — when the gradient is too large, and eventually will be represented as nan.
Encoder-Decoder Network
It is basically a neural network consisting of two parts; encoder and decoder. Encoder takes input sequence and creates a contextual representation (context) and decoder takes this context as input and generates output sequence. Using RNN sequence as encoder, the final hidden state of the RNN sequence chain can be used as a representation for context.
Traditional RNN based Encoder-Decoder (seq-to-seq) model
Each cell in RNN decoder takes its own estimated output from previous cell as input. One important drawback if context is provided only to the first cell of RNN, than context becomes less and less important as the process continues. To overcome this the “Context” can be made available to each decoding RNN time step. For detailed understanding of encoder decoder model you can refer to this article.
Attention Mechanism
It is the technique that allows Neural Network to focus on specific pats of an input sequence. This is done by assigning weights to different parts of input sequence with important parts containing more weights.
Attention model
An attention model differs from traditional model in 2 ways;
- Encoder passes more data to decoder or it passes all the hidden states from all time steps rather than just final hidden state in case of traditional model.
- Before producing outputs; look at the set of encoder hidden states that it received, give each hidden state a score and then multiply each hidden state by a Softmax score.
Thus amplifying hidden states with highest score and down sizing hidden states with low scores. This model has disadvantages i.e. it is slow and we are not sure if it represents the full context.
Self-Attention Mechanism
This is an improved attention mechanism because unlike attention mechanism, self-attention can understand how different words or elements in a sentence or sequence are related to each other without going through them one by one. When a sentence is fed to a computer, it considers each word as a token “t”, and each token has a word embedding “V”. But these word embeddings have no context. So the idea of self-attention is to apply some kind of weighing or similarity to obtain final word embedding “Y”, which has more context than the initial embedding V. In an embedding space, similar words appear closer together or have similar embeddings.
Self-attention block
Self-attention mechanism in detail;
- 3 vectors (query, key and value) are created from each of encoder’s input vectors by multiplying the embedding with 3 matrices that we trained during training process.
- Calculate score of each word of input against the word we’re calculating the self-attention for. The score determines how much focus to place on other parts of input sentence. The score is calculated by taking the dot product of the query vector with the key vector.
- The score is then divided by square root of dimension of key vector , which incase of the “Attention Is All You Need” paper is 8. This creates more stable gradient. Then the result is passed through Softmax operation which normalizes scores.
- Multiply each value vector by our normalized score, this keeps intact of values of the words we want to focus on and drown-out irrelevant words by multiplying with tiny numbers.
- Sum up the weighted value vectors, thus producing the final word embedding with context.
Multi-Head Attention
Multi-head attention extends the basic attention mechanism by performing it multiple times in parallel, with different sets of learnable parameters for each “head.” This improves the performance of the attention layer in two ways;
- It expands the model’s ability to focus on different positions.
- It gives the attention layer multiple “representation subspaces”.
The feed forward layer is expecting single matrix(vector for each word). So, we concatenate the matrices then multiply them by an additional weight matrix. For detailed understanding you can refer to this paper.
Multi-head attention block
Transformers
Traditionally RNNs were used for attention mechanisms, but the transformer model mainly relies on “self-attention” mechanism to understand the data. The line below from “Attention is All You Need” paper explains the use of self-attention in transformer quite well.
“… the Transformer is the first transduction model relying entirely on self-attention to compute representations of its input and output without using sequence aligned RNNs or convolution” — Attention is All you Need, 2017
The encoder-decoder structure of transformer architecture
Both the encoder and decoder consist of stack of N identical layers.
Encoder:
Before the inputs are sent to encoder, the inputs are sent through input embedding which is basically a look up table to learn vector representation of each word in the input sentence. Then the positional encoding adds the positional vector to corresponding embedding vector creating positional input embedding. Theses vectors follow a specific pattern that the model learns, which helps it determine the position of each word, or the distance between different words in the sequence. Finally, the encoder block takes this positional input embedding as input and gives a continuous vector representation of these inputs as output.
Note: Sine and cosine functions are used in positional encoding because they produce fixed patterns with varying frequencies and they are independent of each other.
The encoder is composed of two sublayers;
- The first sublayer implements a multi-head self-attention mechanism.
- The second sublayer is a fully connected feed-forward network consisting of two linear transformations with ReLU activation in between .
Furthermore, each of these two sublayers has a residual connection around it. Each sublayer is also succeeded by a normalization layer, which normalizes the sum computed between the sublayer input, x, and the output generated by the sublayer itself, sublayer(x). The N layers of the Transformer encoder apply the same linear transformations to all the words in the input sequence, but each layer employs different weight and bias parameters to do so.
Decoder:
The decoder shares several similarities with the encoder. It is composed of three sublayers;
- The first sublayer is a masked multi-head attention, which receives the previous output of the decoder stack. While the encoder is designed to attend to all words in the input sequence regardless of their position in the sequence, this multi-head attention is only allowed to attend to earlier positions in the output sequence . This is done by masking future positions (i.e. -∞) before softmax step in the multi-head attention calculation. The masking makes the decoder unidirectional.
- The second sublayer implements a multi-head self-attention mechanism similar to the one implemented in the first sublayer of the encoder, except it creates its Queries matrix from the layer below it, and takes the Keys and Values matrix from the output of the encoder stack.
- The third sublayer implements a fully connected feed-forward network, similar to the one implemented in the second sublayer of the encoder
Furthermore, the three sublayers on the decoder side also have residual connections around them and are succeeded by a normalization layer similar to the one in encoder. Positional encodings are also added to the output embeddings of the decoder in the same manner as previously explained for the encoder.
For detailed understanding of transformers you can refer to this article.
References
- All of Recurrent Neural Networks
- NLP Theory and Code: Encoder-Decoder Models (Part 11/30)
- Encoders-Decoders, Sequence to Sequence Architecture
- Papers with code: Multi-Head Attention
- A Gentle Introduction to Positional Encoding in Transformer Models, Part 1
- Rasa Algorithm Whiteboard — Transformers & Attention 2: Keys, Values, Queries
- The Transformer Attention Mechanism
- The Transformer Model
Subscribe to my newsletter
Read articles from Diwakar Basnet directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Diwakar Basnet
Diwakar Basnet
I am a bachelors student perusing my degree in Computer Science. Passionate about artificial intelligence and video games. Interested in topics such as Machine Learning, Deep Learning and game development. Watch anime and play video games as a hobby.