Unlocking Transformers: CHATGPT & Attention Basics

This is the starting point of a new series where I introduce you to AI and show you how to apply it in your workflow so we can build great products together.

So let's start with some terms that aren't necessary but help us understand what happens behind the scenes in how a model works. Let's go back to 2017 when Google published a paper called "Attention is All You Need." This paper was initially for Google Translation, but it paved the way for a new generation of AI models.
So, what
The entire paper revolves around a diagram, which is shown below.

From "Attention is all you need" paper by Vaswani, et al., 2017 [1]

Here's a big, complex diagram.
Let me break it down for you.

To understand this, we need to learn what it's all about and where we're heading with it. We're moving towards generative AI.

What is GenAI 🤖

GenAI, or generative artificial intelligence, is a type of AI that creates content based on an input known as a prompt. The content generated uses the diagram mentioned above and is a mix of random and probabilistic elements. This means the content can be correct or incorrect, depending on probability. The next word is chosen based on what is most likely to follow.

For example, if you type "SIKE that's the wrong" and ask GPT to complete it, it will likely continue with something like this.

It is the probability that allows GPT to guess the next word.

Sike. It's the wrong number! : r/Eldenring

Now let's start with our diagram and see what's happening there.
so lets discuss what is Transformers

Transformers

What is Transformers

Transformers are a key part of NLP (Natural Language Processing). Before transformers, NLP models like RNNs and LSTMs processed one word at a time, which made them difficult to parallelize. Transformers changed this by allowing parallel processing, making them easier to scale with GPUs and large datasets. Unlike previous models, transformers generate words in parallel using only the attention mechanism, meaning they don't store states or require knowledge of the previous word.No sequential word processing all word relation are calculated once using the attention

While ChatGPT generates output in a slightly different way, it uses something called a self-attention model. This model pays attention to all the content it has already generated and then creates the next word. It repeats this process over and over until the entire output is generated.

There are several core components to this transformer that are:

Encoder

The encoder is the starting component of the transformer. It converts the input, such as a sentence, into a numerical representation unique to each word. This representation is used in the next steps and is a crucial part of the process.

The encoder outputs a set of contextualized embeddings, one for each token. These embeddings capture both the meaning of the token and its relationship to other tokens in the sentence. This greatly aids the process by storing data in a way that keeps related information close together in a multi-dimensional database.

You can think of the encoder as a translator that helps convert the meaning of a sentence into embeddings, which show how each token relates to other tokens.

Decoder

Now that we've converted the initial input into contextual embeddings using the encoder, the model has a deep understanding of the input sequence.

But for Henry or the person chatting with the chatbot to understand, the output needs to be meaningful text. They can't work with random numbers called embeddings.

This is where the decoder comes in.

The decoder takes the encoder's output and turns it into text or a sequence of output, one token at a time, effectively "detokenizing" the learned representation into understandable language.

You must be like yagya you said something like vector and embeddings what are these
lets talk about vectors first

Vectors

Vectors are simply arrays of numbers that represent data in a way machines can understand and derive meaning from.

In transformers, vectors play an important role by representing words using embeddings, capturing their relationships, meanings, positions, and more.

Imagine you're describing a phone with a vector, which might look like this:

[0.12, 0.98, -0.44, 0.03, ..., 0.67]

A similar vector for a loan might look like this:

[0.10, 0.95, -0.40, 0.02, ..., 0.63]

These vectors are similar, so they are close together.

Text As Data: Word Embeddings

Embeddings

Embeddings are numerical representations used to represent words, sentences, or even entire pages or documents. They are expressed using vectors in a continuous space.

You might wonder what the difference is between embeddings and vectors.

All embeddings are vectors, but not all vectors are embeddings.

Embeddings carry meaningful information according to the context. They hold semantic meanings and capture syntactical roles.

Think of embeddings as being like a Mercedes, while vectors are like a car.

Meme Creator - Funny You get an embedding And you get an embedding And you get an embedding And you g Meme Generator at MemeCreator.org!

Previously, I mentioned that we process each word or token in parallel, not in sequence. So, how do we ensure the sequence is maintained?

This is where positional encoding comes into play.

Positional Encoding

Positional Encoding helps a transformer determine the order of each token in a sentence, which the attention mechanism alone cannot do because we process in parallel, unlike previous methods like RNN or CNN.

Example:
I LOVE TO SLEEP

LOVE SLEEP I TO
Without positional encoding, both sentences look the same to the transformer.

It is of two types:
Fixed (Sinusoidal) Positional Encoding : It is what used in the paper attention is what you need
in this we give unique sign cos pattern to each positiion
which vary with time so the model can generalize to longer sequences.

Learnable Positional Embeddings:
Used by GPT and modern models
In this method, a separate embedding vector is learned for each position, similar to word embeddings. These are then added to the input token embeddings.

So, that's how positional encoding works.

Semantic Meanings

Previously, we also discussed semantic meanings when talking about embeddings. But what are semantic meanings? Are they only about the position of words in a sentence or how a word is spelled?

No, that's not it.

Semantic meanings are about the actual meaning of a word, where it can be used, how it is spelled, its position, and much more.

In NLP, when we say embeddings capture semantic meaning, we mean:

The model understands relationships:

“king” and “queen” are related by gender.
“hot” and “warm” have similar meanings.
“car” and “engine” are often linked.

In the image below, Sheetal is Aditi, Aditi is Nisha, and Nisha is Munni, suggesting they all share similar semantic meanings or represent the same underlying entity.

Akshayvanshi on X: "BHAGAM BHAG REAL MULTIVERSE OF MADNESS Confusion of Nisha-Munni-Aditi-Sheetal. #BhaghamBhag2 should ensure same type of confusion, twists, comedy which will prompt every fans to revisit the film after several

Self Attention

It helps the model not only focus on the current word or token but also look at other words in the sentence. This helps determine the relationships between those words and how important the current word is to others.

Example:
I LOVE TO SLEEP

Here, what is important to sleep?
Sleep what?
Love who?
Etc.

So instead of just seeing one word at a time, the model builds understanding based on the entire sentence.

Each word is converted into vector embeddings by considering the following factors:

It calculates how much attention to give to other words.
It multiplies those words by their attention weights.
It adds them up to create a new representation of the word.

Softmax

Softmax is a mathematical function that turns a list of numbers into values between 0 and 1, which together add up to 1. For example, if the model generates multiple possible outputs, Softmax helps us find the probability of each one, ensuring their probabilities add up to one.

Let’s say we ask the model: What is Bat

It might generate two possible meanings:

a flying animal
a cricket bat

Softmax will assign probabilities to each:

Flying animal: 0.75
Cricket bat: 0.25

Now, this probability is further used to determine the output, and we can control the probability and ultimately which one should be the output using a method called temperature.

Temperature

Temperature is a setting that influences how random or creative a model's output is.

How It Works: Temperature is applied after Softmax to convert model scores into probabilities:

Low temperature (< 1) → less randomness
- The model is more confident, choosing high-probability words
- Output is safer and more predictable
High temperature (> 1) → more randomness
- The model takes more risks, trying less likely words
- Output is more creative but might be less clear

This helps us create applications like an image app where we need more creative behavior, while a cursor-like application works well with a low temperature.

Below is the Image of Bhupender (If you're not familiar with Bhupender, he is a popular meme personality known for always replying with his name).

Bhupendra Jogi🥶 #deathxpire #bhupendrajogi #memeedit #fyp | TikTok

According to this temperature concept, his temperature is set to a lower value, making him more predictable.

Multi Head Attention

Multi-Head Attention is a technique where the model runs several self-attention processes simultaneously and then combines the results.

Why? Because one attention head might focus on syntax, another on meaning, another on position, etc.

This allows for parallel execution and determines attention using multiple factors.

These are all components of the transformer that help generate output based on the given input.
Here is the structure of what happens in the transformer:

Step	This is what happens
Input text	The text gets split → into individual tokens
Token embedding	Each token gets converted → into meaningful vectors
Positional encoding	Position info is added → so the model knows the order
Encoder layers	Tokens get context → they relate to each other
Decoder layers	Output tokens get generated → one after another
Attention	The model focuses → on important words across tokens
Softmax	Scores turn into probabilities → to pick next word
Token selection	The next token is chosen → based on those probabilities
Repeat output	This repeats → until the entire sequence is done

So, this is how we convert the user's input into output. It's a complex process, and the diagram explained this same process in more detail and with mathematical terms, but I have described it in simpler terms. This sequence repeats until it finally produces the output.

This was the foundational step laid in 2017 that helped build all the AI models we have today.

That's all from me. Please like and share the blog if you enjoyed it, and follow for more content. Thank you so much for your support.

If you want to discuss the blog or anything tech-related, you can reach out to me on X or book a meeting on cal.com. I'll provide my Linktree below.

Yagya Goel

Socials - link

https://linktr.ee/yagyagoel

Transformers 101: Unfolding CHATGPT & Attention Is All You Need