Transformer is an architecture used by modern large language model like ChatGPT, Gemini, Llama, and so on, which is responsible for revolutionizing Natural Language Processing (NLP).

It was first introduced in research paper Attention is all you need in 2017 by google. This is different from other architecture like Recurrent Neural Network and Convolutional Neural Network. Let’s see how.

One thing that separate transformer from other architecture like Recurrent Neural Network (RNN) and Convolutional Neural Network (CNN) is it’s attention mechanism. We will discuss it later, for now, understand it as an crucial part of transformer.

Transformer architecture in "attention is all you need" research paper by google

Let’s see transformers.

Input Phase

Say we have a sentence, “My cat took my laptop”.

These sequence of words are first converted into tokens. Tokens are like beads of a necklace. When we say “My cat took my laptop” is tokenized, we converted a sequence of text into individual words.

These are tokens of our sentence “My cat took my laptop”. Tokens does not always mean words like above, it can also mean characters (depending upon the text, and the tokenization model).

Let’s see tokenization through code.

import tiktoken
encoder = tiktoken.encoding_for_model("gpt-4o")
text = "My cat took my laptop"
token = encoder.encode(text)
print(token)

Output:

What is this?, shouldn’t I get [“My”, “cat”, “took”, “my”, “laptop”] instead? Well, what we got here is basically token ids. Basically, our words are mapped to numbers as shown below:

Why bother mapping with numbers? Because math is done with numbers. So, we mapped our words with numbers. This is called encoding. Encoding means, we converted the above text ( can be words, characters, etc) into something that our model can understand (numbers).

In short, tokenization means converting some text into smaller and manageable units like words or characters. Encoding means converting those units like words or characters into format that models can understand.

1st Problem in input phase

Here is our first problem. Recall our input: “My cat took my laptop”. This is obvious to us, as it’s easy for us to understand. However, this easiness and meaning is not easy for model to get. Why? Because human language is not hard and fast rule like formula. So we need a mathematical method to make sense of these words. We can’t just map words to numbers, and expect language model to generate text that makes sense. If we just leave at tokenization (and encoding), we would get horrible output ( text ) from the model, because it would not have any context for the words it is generating. So, the main problem is, understanding the underlying context.

Vector Embeddings (Solution to 1st problem)

That’s where vector embeddings comes into play. What it does? It simply clusters words together which have closer meanings in vector space. For example: “Cat” is closer to word “Pet”. So, this two word will be closer in vector space. If we can bring words together like this, our model will get semantic meaning. This means, vector embeddings provides semantic meaning to our tokens. How does it do? In basic terms, it does by storing our token as vector. See the visualization below:

Source: Tensorflow Project

2nd problem in input phase

Wait, there is still one problem here. Recall our sentence: “My cat took my laptop”. Now let’s modify it a little bit: “My laptop took my cat”.

We know what happened here. But do you think our model can understand the difference? Yes, tokens might be different, but meaning wise, will it take this differently? That’s our second problem here. Our model need some kind of metrics to spot this kind of difference.

Positional encoding (Solution to 2nd problem)

If we see it logically, we were able to spot the difference due to position of the word, right? So, if we can make our model to see this position, it may process “My cat took my laptop” and “My laptop took my cat” sentence differently.

That’s where positional encoding comes into play. Basically, its job is to give position to tokens. In other words, it helps preserve the order of tokens. As a result, our model can get differences between “My cat took my laptop” and “My laptop took my car”.

Steps in Input phase

tokenization (converting sequence of text to tokens)
vector embeddings (providing semantic meaning)
positional encoding (position of tokens)

Attention mechanism and Feedforward Network

Ok, we finally passed input phase. Now, what we need to do next is predict the next word. How can we achieve it? Well, we can do so by having very big large neural networks, which is trained for predicting the next word. But there is a crucial step added here. If you can recall, I said “attention” is crucial part of transformer architecture. This is where attention mechanism comes into play. With Neural Network (Feedforward Network), there is addition of attention mechanism.

Understanding attention mechanism is simple. First let’s understand the problem. We may have applied vector embeddings, positional encoding and what not in our input, right? Why did we do this? So that, we can have better context of our sentence. While it solved some problems, there is still one problem left to solve. Consider the following example:

“Let’s go to bank of the river”

“Let’s go bank to deposite our money”

Notice the word “bank”. In 1st one, we are referring to edge of river, in 2nd one, we are referring to institution. Now, how would you make your model understand this? Same word, different meaning.

This is where attention mechanism comes into play. But how does this mechanism work? Well, it does by simply allowing tokens to talk to each other. What?

Put this aside, and think yourself. How did you know which bank means what thing? You read the other words too. When it said money, you said “oh, that’s institution bank”, and when you saw river, “oh, that’s river side”.

That is same thing we are doing by allowing tokens to talk to each other. This is self-attention.

So, when it sees money, it will move the word bank closer to word money (in vector space). As a result, our model gets much better context. If you look at the architecture of transformer, there is multi-head attention.

In basic term, multi-head attention is much powerful implementation of our attention mechanism.

Multi-head attention is key player in having today’s state-of-the-art language model.

Softmax

Above we saw a combination of feedforward Neural Network and attention. Let’s call this transformer block. Now this transformer block will give some score. What score? Score for all possible words or characters. Remember, what we are doing here is, predicting the next word. So, our transformer block is trying to predict the next word. To do so, it produces scores for all possible words.

The role of softmax is very simple. Turn those scores in probability which will add up to 1.

Temperature

Our transformer block (self-attention + feedforward network) gives scores and is converted into probabilities (adding up to 1). The word or character with higher probability, is the one that likely suits in the context. However, we can alter this behaviour with temperature parameter .

As temperature increases, randomness in selection goes up. In other words, the degree of temperature determines the randomness or creative output of the language model. So, we can also call it randomness, or creativity of the model.

Things we covered

We got introduction to some common terminology of large language model:

Transformer
Tokenization
Encoding
Vector embeddings
Positional encoding
Self-attention
Softmax
Temperature

Note that we only discussed transformer architecture in surface level (without going in depth and maths).

Understanding transformer and AI jargons

Table of contents