Transformers 101: Unfolding CHATGPT & Attention Is All You Need


This is the starting point of a new series where I introduce you to AI and show you how to apply it in your workflow so we can build great products together.
So let's start with some terms that aren't necessary but help us understand what happens behind the scenes in how a model works. Let's go back to 2017 when Google published a paper called "Attention is All You Need." This paper was initially for Google Translation, but it paved the way for a new generation of AI models.
So, what
The entire paper revolves around a diagram, which is shown below.
Here's a big, complex diagram.
Let me break it down for you.
To understand this, we need to learn what it's all about and where we're heading with it. We're moving towards generative AI.
What is GenAI 🤖
GenAI, or generative artificial intelligence, is a type of AI that creates content based on an input known as a prompt. The content generated uses the diagram mentioned above and is a mix of random and probabilistic elements. This means the content can be correct or incorrect, depending on probability. The next word is chosen based on what is most likely to follow.
For example, if you type "SIKE that's the wrong" and ask GPT to complete it, it will likely continue with something like this.
It is the probability that allows GPT to guess the next word.
Now let's start with our diagram and see what's happening there.
so lets discuss what is Transformers
Transformers
What is Transformers
Transformers are a key part of NLP (Natural Language Processing). Before transformers, NLP models like RNNs and LSTMs processed one word at a time, which made them difficult to parallelize. Transformers changed this by allowing parallel processing, making them easier to scale with GPUs and large datasets. Unlike previous models, transformers generate words in parallel using only the attention mechanism, meaning they don't store states or require knowledge of the previous word.No sequential word processing all word relation are calculated once using the attention
While ChatGPT generates output in a slightly different way, it uses something called a self-attention model. This model pays attention to all the content it has already generated and then creates the next word. It repeats this process over and over until the entire output is generated.
There are several core components to this transformer that are:
Encoder
The encoder is the starting component of the transformer. It converts the input, such as a sentence, into a numerical representation unique to each word. This representation is used in the next steps and is a crucial part of the process.
The encoder outputs a set of contextualized embeddings, one for each token. These embeddings capture both the meaning of the token and its relationship to other tokens in the sentence. This greatly aids the process by storing data in a way that keeps related information close together in a multi-dimensional database.
You can think of the encoder as a translator that helps convert the meaning of a sentence into embeddings, which show how each token relates to other tokens.
Decoder
Now that we've converted the initial input into contextual embeddings using the encoder, the model has a deep understanding of the input sequence.
But for Henry or the person chatting with the chatbot to understand, the output needs to be meaningful text. They can't work with random numbers called embeddings.
This is where the decoder comes in.
The decoder takes the encoder's output and turns it into text or a sequence of output, one token at a time, effectively "detokenizing" the learned representation into understandable language.
You must be like yagya you said something like vector and embeddings what are these
lets talk about vectors first
Vectors
Vectors are simply arrays of numbers that represent data in a way machines can understand and derive meaning from.
In transformers, vectors play an important role by representing words using embeddings, capturing their relationships, meanings, positions, and more.
Imagine you're describing a phone with a vector, which might look like this:
[0.12, 0.98, -0.44, 0.03, ..., 0.67]
A similar vector for a loan might look like this:
[0.10, 0.95, -0.40, 0.02, ..., 0.63]
These vectors are similar, so they are close together.
Embeddings
Embeddings are numerical representations used to represent words, sentences, or even entire pages or documents. They are expressed using vectors in a continuous space.
You might wonder what the difference is between embeddings and vectors.
All embeddings are vectors, but not all vectors are embeddings.
Embeddings carry meaningful information according to the context. They hold semantic meanings and capture syntactical roles.
Think of embeddings as being like a Mercedes, while vectors are like a car.
Previously, I mentioned that we process each word or token in parallel, not in sequence. So, how do we ensure the sequence is maintained?
This is where positional encoding comes into play.
Positional Encoding
Positional Encoding helps a transformer determine the order of each token in a sentence, which the attention mechanism alone cannot do because we process in parallel, unlike previous methods like RNN or CNN.
Example:I LOVE TO SLEEP
LOVE SLEEP I TO
Without positional encoding, both sentences look the same to the transformer.
It is of two types:
Fixed (Sinusoidal) Positional Encoding : It is what used in the paper attention is what you need
in this we give unique sign cos pattern to each positiion
which vary with time so the model can generalize to longer sequences.
Learnable Positional Embeddings:
Used by GPT and modern models
In this method, a separate embedding vector is learned for each position, similar to word embeddings. These are then added to the input token embeddings.
So, that's how positional encoding works.
Semantic Meanings
Previously, we also discussed semantic meanings when talking about embeddings. But what are semantic meanings? Are they only about the position of words in a sentence or how a word is spelled?
No, that's not it.
Semantic meanings are about the actual meaning of a word, where it can be used, how it is spelled, its position, and much more.
In NLP, when we say embeddings capture semantic meaning, we mean:
The model understands relationships:
“king” and “queen” are related by gender.
“hot” and “warm” have similar meanings.
“car” and “engine” are often linked.
In the image below, Sheetal is Aditi, Aditi is Nisha, and Nisha is Munni, suggesting they all share similar semantic meanings or represent the same underlying entity.
Self Attention
It helps the model not only focus on the current word or token but also look at other words in the sentence. This helps determine the relationships between those words and how important the current word is to others.
Example:I LOVE TO SLEEP
Here, what is important to sleep?
Sleep what?
Love who?
Etc.
So instead of just seeing one word at a time, the model builds understanding based on the entire sentence.
Each word is converted into vector embeddings by considering the following factors:
It calculates how much attention to give to other words.
It multiplies those words by their attention weights.
It adds them up to create a new representation of the word.
Softmax
Softmax is a mathematical function that turns a list of numbers into values between 0 and 1, which together add up to 1. For example, if the model generates multiple possible outputs, Softmax helps us find the probability of each one, ensuring their probabilities add up to one.
Let’s say we ask the model: What is Bat
It might generate two possible meanings:
a flying animal
a cricket bat
Softmax will assign probabilities to each:
Flying animal: 0.75
Cricket bat: 0.25
Now, this probability is further used to determine the output, and we can control the probability and ultimately which one should be the output using a method called temperature.
Temperature
Temperature is a setting that influences how random or creative a model's output is.
How It Works: Temperature is applied after Softmax to convert model scores into probabilities:
Low temperature (< 1) → less randomness
The model is more confident, choosing high-probability words
Output is safer and more predictable
High temperature (> 1) → more randomness
The model takes more risks, trying less likely words
Output is more creative but might be less clear
This helps us create applications like an image app where we need more creative behavior, while a cursor-like application works well with a low temperature.
Below is the Image of Bhupender (If you're not familiar with Bhupender, he is a popular meme personality known for always replying with his name).
According to this temperature concept, his temperature is set to a lower value, making him more predictable.
Multi Head Attention
Multi-Head Attention is a technique where the model runs several self-attention processes simultaneously and then combines the results.
Why? Because one attention head might focus on syntax, another on meaning, another on position, etc.
This allows for parallel execution and determines attention using multiple factors.
These are all components of the transformer that help generate output based on the given input.
Here is the structure of what happens in the transformer:
Step | This is what happens |
Input text | The text gets split → into individual tokens |
Token embedding | Each token gets converted → into meaningful vectors |
Positional encoding | Position info is added → so the model knows the order |
Encoder layers | Tokens get context → they relate to each other |
Decoder layers | Output tokens get generated → one after another |
Attention | The model focuses → on important words across tokens |
Softmax | Scores turn into probabilities → to pick next word |
Token selection | The next token is chosen → based on those probabilities |
Repeat output | This repeats → until the entire sequence is done |
So, this is how we convert the user's input into output. It's a complex process, and the diagram explained this same process in more detail and with mathematical terms, but I have described it in simpler terms. This sequence repeats until it finally produces the output.
This was the foundational step laid in 2017 that helped build all the AI models we have today.
That's all from me. Please like and share the blog if you enjoyed it, and follow for more content. Thank you so much for your support.
If you want to discuss the blog or anything tech-related, you can reach out to me on X or book a meeting on cal.com. I'll provide my Linktree below.
Yagya Goel
Socials - link
Subscribe to my newsletter
Read articles from Yagya Goel directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Yagya Goel
Yagya Goel
Hi, I'm Yagya Goel, a passionate full-stack developer with a strong interest in DevOps, backend development, and occasionally diving into the frontend world. I enjoy exploring new technologies and sharing my knowledge through weekly blogs. My journey involves working on various projects, experimenting with innovative tools, and continually improving my skills. Join me as I navigate the tech landscape, bringing insights, tutorials, and experiences to help others on their own tech journeys. You can checkout my linkedin for more about me : https://linkedin.com/in/yagyagoel