How Transformers Work in Generative AI: Explained with Analogies and Google’s Attention Breakthrough !


What is Gen Ai?
At first we need to understand one things that what is gen ai actually?
The Ai that can generate anything is gen ai. That’s the one-liner I heard when I started learning GenAI — and it stuck. But how does it actually generate something meaningful? Let me break it down in lemon terms
The secret sauce behind most of these powerful tools is something called a Transformer.
So , What is a transformer in Ai?
Imagine you're trying to understand a book, but instead of reading one word at a time in order, your brain is smart enough to look at the entire sentence at once and figure out what matters most — even if the important word is at the very end.
That’s what a Transformer does. It looks at all the words together, finds relationships between them, and figures out what to pay attention to.
🧠 The name “Transformer” comes from its ability to transform input text into a meaningful output — like completing a sentence or answering a question.
Think of it like group chat on steroids.
Each word in your sentence is a person in the group. Everyone talks to everyone else to understand the full conversation before replying. That’s how transformers “understand” what you're trying to say.
The another question which arises is ; Where Did Transformers Come From?
Back in 2017, Google published a landmark research paper titled “Attention is All You Need”. This paper introduced the Transformer architecture, which revolutionized how machines understand and generate human language.
- 📌 Fun fact: Google originally created it to improve Google Translate, but today it powers models like ChatGPT, Gemini, Claude, Mistral, and more.
Before this, AI models relied on RNNs and LSTMs, which had trouble understanding long-range dependencies. Transformers solved that with a technique called Self-Attention, allowing the model to “focus” on the most relevant words — even if they were far apart in the sentence.
This is the core idea behind Generative AI today.
How Transformers work in Gen Ai ?
🪙 Transformers Predict One Token at a Time :-
🎨Analogy :- Imagine a magical pen that tries to guess what youwill write next — one letter or word at a time.
Let’s understand this with the help of the example !
Let’s say we start with :
Input: “The capital of France is”
- The transformer :
Predicts : “Paris”
Adds it to the input: “The capital of France is Paris”
Predicts next (if needed), like a loop — until the sentence is complet.
The Pipeline ( How Transformers Think )
Let’s walk through how the model processes the input text step-by-step:
1) Tokenization
🎨**Analogy : → Imagine a whole lemon representing a piece of text, whether it’s a sentence, paragraph, or an entire document.
Tokenization is like a breaking down that lemon (text) into that smaller pieces.
Text is split into tokens (words or sub-words). For computers to understand, these tokens are converted to numbers using a predefined vocabulary (a huge lookup table).
🧩 Example:
"Hello world" → ["Hello", "world"] → [1234, 5678]There is a website i.e Tiktokenizer , we can use that to visualize that how different llms convert the given input into the tokens.
Sample code for the tokenization is given below.
import tiktoken
encoder=tiktoken.encoding_for_model("gpt-4o")
text="Hello , I am Pragyan"
tokens=encoder.encode(text);
print("Tokens:",tokens);
tokens=[13225, 1366, 357, 939, 118421, 10134]
decoded_text=encoder.decode(tokens);
print("Decoded Text:",decoded_text);
2) Vector Embeddings – Giving Tokens Meaning
🎨 Analogy: Each token becomes a color dot in a huge 3D color space where similar meanings have similar colors.
These token numbers are turned into embeddings — numerical vectors that represent meaning. "King" and "Queen" will be close in this space. But "Apple" (fruit) and "Apple" (company) may still look the same... for now 👀.
Sample code for Vector embedding is given below:
For this code you need to configure the api key of open ai(or any llm).
from dotenv import load_dotenv
from openai import OpenAI
load_dotenv()
client = OpenAI()
text="Dog chases cat"
response = client.embeddings.create(
input=text,
model="text-embedding-3-small"
)
print(response.data[0].embedding)
3) Positional Encoding – Adding Word Order
🎨**Analogy: Like giving each word its own spice so the chef (model) knows what came first, second, etc.
Without this, the model can’t tell: "Dog bites man" vs "Man bites dog" .
Positional encoding adds this order info using sine/cosine patterns.
More formally :- Positional encoding is a technique that adds information about the position of words or tokens in a sequence.
It helps transformer models to understand relationship and order of tokens.
4) Multi-Head Self Attention – Making Words Talk
🎨**Analogy: Like a group discussion where every word in the sentence looks at every other word and decides who’s important.
For example , take the sentence : "He went to the bank to deposit money."
Here , The word “bank” is ambiguous. But attention helps it look at nearby words like “deposit” and “money” and realize — Aha! It’s a finance bank, not a river bank.
Now the question arises that what is multi-head attention?
🎨 Analogy: Like putting on multiple lenses at the same time — one looks for grammar, another looks for topic, another for context.
Each "head" in multi-head attention processes the sentence differently and then combines their views, giving the model a rich understanding of the sentence.
🧠 Why? Because language has layers — meaning, tone, role of the word — and one lens isn’t enough.
5) Feed Forward Neural Network – Local Thinking
🎨 Analogy: Like polishing each token individually after the discussion.
After attention, every word embedding goes through a small neural network to refine it further.
6) Final Prediction - Output token
The final output is chosen using a softmax function which gives probabilities to every word in the vocabulary, and the most likely one is picked.
And the cycle repeats until the model decides to stop.
How Do Transformers Learn? (Backpropagation in Simple Terms)
After all the steps like tokenization, embeddings, attention, etc., you might ask:
“But how does the model actually learn to give the right answer?”
That’s where training comes in — and the key concept is backpropagation.
Backpropagation (Explained in Lemon Terms)
- Let’s say you're training a model to answer basic math.
You give it:
Input : 2+2 = ?
The model predicts: 100
But we know the real answer is 4.
Now the model calculates a loss — the difference between the correct answer and the predicted one. In this case,
Loss = 100 - 4 = 96 (which is huge!).
The model says:
“Oops! I was way off. Let me go back and fix how I think.”
That “going back and fixing” process is called backpropagation.
What Happens During Backpropagation?
The model:
Looks at all the weights (like knobs) it used to reach that wrong answer.
Adjusts those knobs slightly to reduce the error next time.
Tries again with updated weights.
This repeats millions of times until the loss becomes almost zero — and the model gets really good at answering correctly.
🔁 Two Phases of a Model
Training Phase :- The model learns by making wrong predictions, measuring how far off it was (loss), and adjusting its weights through backpropagation.
Inference Phase:- Once the model is smart enough (low loss), it’s frozen and used to generate outputs — like ChatGPT answering your questions.
⚙️ Example: ChatGPT has already been trained. When you ask it something now, it’s in inference mode, not learning anymore — it’s just predicting based on its past training.
Subscribe to my newsletter
Read articles from Pragyan Prakhar directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
