Decoding AI Jargons with Chai

dEncoder-

Encoder in AI terms is a component of the Transformer Architecture which was introduced by the Google Brain team in the year 2017 in their research paper, “ Attention is all you need”. But here we are trying to decode it in simple term using real world examples. So think of yourself as a soldier sitting on the enemy boundary at the time of war and you are about to run out of Ammo, in that situation you decide to communicate with the base team to send more ammo so that you don’t loose your position. But in the back of your mind you remember if the enemy get’s the wind of the idea you are done and dusted. So you come up with a plan of sending an encoded message just like done in historical wars by using Enigma machine and Morse code. You write your message in a language and it is transmitted to the base in the encoded format. The base has a decoder present that decodes the message in human readable format for the Personal at the base to understand and perform the required actions. BERT models are Encoder only models i.e they only have an encoder and no decoder. You have been saved from the misery at the front-lines but what if the people at the base were never able to understand? Let’s see what’s a decoder.

Decoder-

A decoder is nothing but a converter just like the encoder, the decoder understands the encrypted language and maps each word with the meaning in natural language that is understandable by the person. For example if you visit a foreign country like China and you donot know any Chinese, Google Translate can be your decoder as it can translate the language i.e. not understandable by you (Chinese) to a language that you are familiar with (English/Hindi). But this is just the working of a decoder for a 7 yo but telling everything about the encoder or decoder at this stage doesn’t make much sense, so we are going to break down the entire encoder and decoder architecture that was used in original transformer model and understand it piece by piece and it will help us build a deeper understanding of the Transformer Architecture as well as it’s subcomponents.

Transformer Architecture-

This extremely complicated architecture is called a transformer. Broadly it has 2 parts- Encoder and Decoder. Transformer architecture is the building block of the entire LLM and Chatbot ecosystem. Let’s start diving deep inside the Encoder part first.

Transformer’s Encoder

Think of entering the Encoder block as giving the input to a simple Chat model like OpenAI’s ChatGPT. You give in natural language prompts in english language. But do machines understand english? No, right? They understand binary or maths. So our primary task is to make the chat model understand the thing that the user is saying.

That is what we call a user input query or a prompt. A prompt is a string that is given to the model, now we need to convert the input prompt to a language i.e. understandable by the computers. That can be Binary or Tokens, but what is tokens?

Tokenization- It is the process of breaking down the sentence into words or subwords, the piece of code that is used to do that is called a Tokenizer. We can have different kinds of tokenizers i.e character level, word-level, sub word level. GPT uses a sub-word level tokenizer.

This is the tokenization of the sentence “hello my name is Satvik” by GPT-4o model, as we have established GPT uses a sub-word tokenizer so “Sat” is considered as one token and “vik” is considers as another.
But still it’s not a language that computers can understand, so we map each token to a particular number i.e. Token ID. These are usually integers and are used to map each unique token to a specific number.

These are the token Id’s for the above given sentence. But how big can these numbers be? In the GPT-4 models there are around 200k token ID’s and this 200k is called as the vocabulary size of the mode.
Embeddings- Now we have converted the words(sub-words actually) to token ID’s, but now I want to find the relation/similarity between words for example “King” and “Monarch”, “Happy” and “Joyful”. So it means that we need to convert the token ID’s into embeddings. Embeddings are nothing but the vector representation of each token ID into an n dimensional space.

The Token ID’s are passed through an embedding model and that model converts the token into vector embeddings. These vector embeddings are just projection of that sub-word/Token ID in n-dimensional space based on the embedding model used.

Here King and Monarch have similar meaning/semantic meaning so their embedding vector will be closer to each other in the n-dimensional space. If we calculate the similarity score (cosine similarity) between these two words will be similar because embeddings store the essence/meaning of data.
Positional Encoding- We have created the embeddings of the words and now the model is able to understand the embedding vectors. But is it enough? let’s see with an example-

1. The cat sat on the mat. 1. River Bank
2. The mat sat on the cat. 2. ICICI Bank

Both of these sentences will have similar embedding in the left case, the position of the words are not saved and in the right one the words around the word Bank influence it’s meaning. So we have established that we need some way to incorporate the nearby words and their position with reference to each other into encoding vector.

Basically, the positional encoding step tells the current word’s position in the string of words i.e. how is the arrangement of words in a particular sentence.
Multi-head attention- To understand this let’s first understand what is self-attention. Self attention is basically conversation between between the words of a sentence. Here the words communicate with each other and figure out how will a particular word affect the meaning of some other word / entire sentence.

The cat sat on the mat

In the sentence above both mat and cat are dependant on each other so a particular attention matrix is created that will represent that the word cat will give more attention to the word mat and vice versa.

Multi-head attention is just self-attention that is done on multiple fronts.
Addition and normalisation- Here the results from multiple heads are collected and their effective embeddings are added and normalised. The layer normalisation is used because it helps us achieve Stabilised Training as well as Faster Convergence in output.
Feed Forward Neural Network- This type of Neural Network containes weights as well as biases and these are optimized during the process of back propagation that starts once we calculate the loss.

After Completion of all these steps and addition as well as normalisation, the output is finally ready to be passed to the Decoder.

Masked Multi-Head Attention- This is similar to Multi-head attention but with a simple twist, the decoder is only able to look at the previous tokens that it has generated i.e. while generating/predicting the next word, the decoder can only look at the output that it has just created and not the next words before creation.
The cat sat on ………..

In the example given above, the decoder is not able to look at the next word while prediction but can look at the previous words/tokens that it has generated.
Multi-Head Attention (Encoder-Decoder attention) - This is the layer that attends to the encoder’s output. Here the decoder can see all the encoder’s tokens unlike Masked multi-head attention so that the decoder knows at all times what is it generating with respect to the output.
Feed-forward Neural Network - A similar 2-layer neural network is present here for each token, which is applied independently to each token.
Linear Layer (Logits Layer) - Converts the hidden dimensions to the vocabulary size, the output is raw, unnormalised scores for each vocabulary token.
Softmax - A softmax function converts the output vectors/logits to probability distribution and we can pick the next token/word based on the temperature we have set. Temperature is basically the level of variance/creativity are we letting the model have in it’s next word prediction task.

Here is a simplified Architecture for better understanding of Transformers-

Knowledge cutoff- GPT models are simply trained on a particular amount of data i.e. till date. For example if you train a model today on the entire internet’s data for today. Now if I want to see who won the IPL trophy, the model will not be able to answer that because the model has a knowledge cutoff till today. We have provided LLM’s with both NLP and Agentic tools nowadays that are able to search and fetch data from the internet in real time based on user query and give output of the latest search results as well but in normal LLM use-cases the model is not able to answer anything past it’s Knowledge cutoff date.

Decoding AI Jargons

Table of contents

dEncoder-

Decoder-

Transformer Architecture-

Transformer’s Encoder

Subscribe to my newsletter

Satvik Kharb

Satvik Kharb