Ever wondered how Attention Mechanism works in tasks like translation or image recognition?

Just like many of you, I was intrigued by the term "Attention Mechanism". I've spent a lots of time researching about this topic.

I want you to save time. You want to understand it better, so start learning about the RNN.

Dive into the RNN Encoder-Decoder architecture to unlock the essence of modern AI.

Let's discover it in simple terms.

Introduction

If you want to learn about the Attention Mechanism, you should probably start by learning about the RNN Encoder-Decoder architecture.

This architecture is kind of a building block when it comes to LLMs. That is one of the first efficient models when it comes to sequence-to-sequence modeling.

The idea is to encode the input sequence as a vector representation with the encoder.

Then, use it as input to the decoder to decode the output sequence in an iterative fashion.

Finally, the predicting head just performs a multiclass classification task over all the possible words of the vocabulary.

RNN Encoder-Decoder Architecture

The RNN (Recurrent Neural Network) Encoder-Decoder architecture, introduced by Yoshua Bengio's team in 2014, was a significant advance in sequence-to-sequence (seq2seq) learning tasks such as machine translation.

Here’s how the architecture works:

Encoder
The encoder processes the input sequence (e.g., a sentence in the source language).

It’s a recurrent neural network that reads the input tokens one by one.

At each step, the RNN updates its hidden state based on the current input token and its previous hidden state.

The final hidden state of the RNN (after the last input token is processed) is considered a compressed representation of the entire input sequence (embeddings).

These embeddings capture semantic relationships between words and provide a dense, fixed-size representation irrespective of the vocabulary size.

The encoder uses Gated Recurrent Units (GRUs), a type of RNN architecture. Each word’s embedding is fed into the GRUs in a sequential manner, corresponding to time steps.

For each word, the input embedding and the previous hidden state are processed through the layers.

By the end of the input sequence, the encoder generates a set of hidden states that represent the entire sequence. These hidden states serve as the context for the decoder.

Decoder

The decoder is another RNN that generates the output sequence (e.g., the translation of the input sentence).

It starts with the context vector generated by the encoder as its initial state.

The decoder, also composed of GRUs, takes the encoder's hidden states and starts producing the output sequence iteratively.

At each time step, based on the hidden state and the previously generated word, the decoder predicts the next word.

The process continues iteratively until a special end-of-sequence token is produced.

Predicting Head

At each step in the decoding process, the decoder performs a multiclass classification task.

The classification is over the entire target vocabulary to predict the next token.

This prediction head is a dense layer that performs a multiclass classification.

Typically done using a softmax layer that outputs a probability distribution over possible tokens.

You can use an argmax operation to choose the word with the highest score as the predicted output.

Challenges and Limitations

The context vector becomes a bottleneck as it must contain all the information of the input sequence, leading to information loss, especially for longer sequences.

RNNs are difficult to parallelize due to their sequential nature, resulting in longer training times.

RNNs also suffer from issues like vanishing and exploding gradients.

Attention Mechanism

To mitigate the bottleneck problem of the context vector, the attention mechanism was introduced later in 2014 by Bengio.

Instead of using a single context vector, the attention mechanism allows the decoder to look at all the hidden states of the encoder (not just the last one) and selectively focus on parts of the input sequence during decoding.

Attention mechanisms in neural networks were designed to help sequence-to-sequence models focus on certain parts of the input when producing specific parts of the output, just as humans pay attention to specific portions of input when understanding or translating sentences.

Bahdanau Attention

The Bahdanau Attention is named after Dzmitry Bahdanau, one of the first researchers to introduce this concept in the context of neural machine translation.

Instead of only using the final hidden state of the encoder RNN as the context, Bahdanau Attention considers all the hidden states from the encoder.

For each word in the decoder, alignment scores are computed between the current decoder hidden state and all the encoder hidden states. This score determines the importance of words in the encoder sequence when predicting a particular word in the decoder sequence.

These alignment scores are then passed through a softmax layer to produce the attention weights.

The weighted sum of the encoder hidden states (using the attention weights) produces the context vector for each decoding time step.

The attention weights can change at each time step of the decoder, allowing the model to focus on different parts of the source sequence at different time steps of the decoding process.

The key difference between Bahdanau’s approach and the initial attention concept by Bengio is the method and granularity of attention application:

Granularity of Attention: Bengio’s initial concept was broader, setting the stage for the idea that neural models could “focus” on different parts of input data. Bahdanau’s Attention made this concept more precise by determining how sequence-to-sequence models could dynamically focus on different parts of an input sequence during the decoding process.
Dynamic Context Vector: While Bengio’s team introduced the foundational idea, Bahdanau Attention went a step further by dynamically creating a context vector for each decoder time step based on the attention weights, instead of a single fixed-size context vector.
Alignment Scores: Bahdanau introduced the concept of alignment scores between decoder and encoder states, leading to the attention weights, which wasn’t explicitly present in the initial attention concept introduced by Bengio.

Transformers

It was introduced in 2017 by Vaswani et al.

It uses a mechanism called "self-attention" to weigh the input tokens differently, and this is applied not only in the encoder-decoder context but also within the encoder and decoder independently.

The Transformer completely does away with recurrence, relying on the attention mechanism to draw global dependencies between input and output, making it highly parallelizable.

Key components

Self-Attention: At its core, the Transformer uses a mechanism called "self-attention", which weighs input tokens differently in relation to each other.

Unlike the attention mechanisms we discussed earlier that focus on relationships between encoder and decoder sequences, self-attention focuses on relationships within a single sequence.

Multi-Head Attention: Instead of a single set of attention weights, the Transformer computes multiple sets, enabling the model to focus on different parts of the input for different tasks or reasons.

Positional Encoding: Since the Transformer doesn’t have any recurrence, it doesn’t know the positions of tokens. Positional encodings are added to embeddings to provide a notion of token position.

Feed-forward Networks: Each Transformer layer contains feed-forward networks that operate independently on each position.

Layer Normalization & Residual Connections: These help in training deep Transformer models.

Differences against previous mechanisms

Attention:

- Bahdanau Attention is focused on aligning parts of the source sequence to the target sequence in a seq2seq model.

- Transformers use self-attention to relate different positions of a single sequence, understanding the context in which a word appears.

Recurrence:

- Bahdanau Attention typically works alongside RNNs.

- Transformers completely discard recurrence, relying only on attention mechanisms.

Parallelization:

- RNNs, used in Bahdanau Attention, process sequences token by token, making parallelization challenging.

- Transformers, with their self-attention mechanism, can process all tokens in parallel, which can speed up training.

Similarities between previous mechanisms

Attention Weights: Both architectures utilize attention weights derived from scoring mechanisms to weigh the significance of certain parts of the input.

Dynamic Context: Both architectures generate a dynamic context based on the input sequence, though the method and scope of generation differ.

Dependencies: Both methods aim to capture dependencies in input data, allowing models to focus on relevant parts of the input when generating output.

Conclusion

Understanding the RNN Encoder-Decoder framework and the attention mechanism is crucial for appreciating the development of modern LLMs.

LLMs have drastically improved the efficiency and effectiveness of sequence modeling tasks, from translation to text generation.

The Transformer model, in particular, has become the backbone of most state-of-the-art LLMs today.

If you like this article, share it with others ♻️

That would help a lot ❤️

Unlocking AI's Power: Attention Mechanism and RNN Secrets Revealed!