Caught Behind: The Hidden Layers of AI

Table of contents
- • What is Generative AI?
- • GPT & Transformers - The Core Engines behind the scenes
- • Tokens & Tokenization: The Language Building Blocks
- • Vector Embeddings: Giving Meaning to Words
- • Positional Encoding: Teaching AI the Order of Words
- • Self-Attention: The Model’s Inner Eye
- • Multi-Head Attention: Diverse Perspectives in Parallel
- • Backpropagation: The Learning Loop
- • Conclusion
- • What’s next?

Ever wondered how ChatGPT seems to “understand” you? Or how tools like Midjourney, Canva generate awesome art from just words? Welcome to the world of Generative AI — where machines don’t just think, they create.
• What is Generative AI?
‣ Generative AI refers to any model that can generate contents based on user inputs (called “prompts”), be it texts, images, code snippets, music – all comes from the data on which they are trained. The more prominent the prompt is – Higher is the quality of the content generated.
• GPT & Transformers - The Core Engines behind the scenes
‣ More or less, we all have used ChatGPT once in our lives. Have you ever wondered how it works? It seems quiet magical how this technology answers all our questions, generate contents based on exactly what we need and many more. Well, the magic lies behind the scenes.
Let’s first decode GPT: it stands for Generative Pre-Trained Transformer.
‣ Generative – which can generate/create.
‣ Pre-Trained – which learns patterns from the data fed to it.
‣ Transformer – the model architecture that allows the magic to happen.
• So, what exactly is a Transformer?
‣ Introduced in the 2017 paper “Attention is All You Need “, the Transformer model was a paradigm shift in how machines understand and generate language. Transformers use a mechanism called self-attention to process all the inputs at once in parallel, which makes them faster and smarter.
Imagine reading a novel and trying to understand the meaning of a sentence. We don’t just look at one word at a time; our brain quickly relates each word to others in the sentence to get the full meaning. Transformers do the similar — they pay ”attention” to every word, all at once.
Transformers do this by assigning attention scores/weights to all the words, helping the model focus on the most relevant ones.
This simultaneous processing is made possible by self-attention and is further enhanced with multi-head attention (where the model looks at different parts of the sentence from multiple perspectives).
In short, the Transformer is like a mind map that constantly updates itself to understand meaning based on context, not just order. And that is what makes it the foundation of modern Generative AI.
Now let’s make this easier to visualize. Here’s the original architecture diagram of the Transformer model from the legendary paper "Attention is All You Need."
Think of the Transformer as Two Friends Talking. The Transformer model is split into two sides:
• The Left Side: The Encoder (Listener)
This is like a person carefully listening and trying to fully understand what you're saying.
- Input Embedding + Positional Encoding:
It first turns your words into numbers. Then it tags the position of each word (because "I love you" ≠ "You love I").
Self-Attention:
The model pays attention to every word in your sentence — all at once — to understand which words relate most closely.Example: In “She ate the cake because she was hungry,”
The model learns that the second “she” refers to the same person.Feed Forward & Repeat:
It processes this understanding through mini-brain circuits to get even smarter.
This repeats N times to refine its understanding — like rereading a sentence multiple times.
• The Right Side: The Decoder (Speaker)
This side takes what the encoder understood and starts generating a response — word by word.
Masked Attention:
It’s not allowed to see the future words — just the ones it’s already said. Like solving a puzzle one piece at a time.Attention Over Encoder’s Output:
It looks at what the encoder understood and thinks, “Based on what was said, what should I say next?”Feed Forward, Softmax & Predict:
It runs a prediction and picks the most likely next word… and then the next… and the next — until the full sentence is formed.
Now, a question might arise – How a model understands text inputs? The answer is right next.
• Tokens & Tokenization: The Language Building Blocks
‣ In the AI world, text is broken down into smaller chunks of texts usually words, sub-words, or even characters. These are known as “tokens”. Then each token is converted into a number so that the model can process it.
‣ Examples of Tokens:
Input text: “Virat scored a century”
Tokenized: [“Vir”,”at”,”scored”,”a”,”century”]
Input text: “Schrödinger”
Tokens: [“S”,”chrö”,”dinger”]
• Now, what is Tokenization?
‣ Tokenization is the process of splitting text into tokens and converting them into IDs for the model.
‣ Analogy – Breaking every overs of a T20 cricket match and then further breaking it into individual deliveries to analyse each one.
‣ Example of Tokenization:
Input text: “Virat scored a century”
Tokenized: [“Vir”,”at”,”scored”,”a”,”century”]
Converted to token IDs: [60985, 266, 27207, 261, 14015]
These IDs are then fed into the neural network.
‣ Note: Tokens varies from model to model. Different models can use different tokens for the same word.
‣ Python code to tokenize texts :
import tiktoken
encoder = tiktoken.encoding_for_model("gpt-4o")
text = "Virat hits a century"
## Tokenize
tokens = encoder.encode(text)
print("Tokens: ",tokens)
‣ Output :
‣ Python code to de-tokenize :
import tiktoken
encoder = tiktoken.encoding_for_model("gpt-4o")
## De-Tokenize
tokens = [60985, 266, 21571, 261, 14015]
text = encoder.decode(tokens)
print('Text: ',text)
‣ Output :
Now, you might wonder — how does a machine learning model understand the meaning of the input data? After all, it doesn’t know any human language, emotions, or context the way humans do.
That’s where vector embeddings comes into play.
• Vector Embeddings: Giving Meaning to Words
‣ Embeddings are numerical representations of data as vectors in a high-dimensional space.
They capture the meaning of the input and map the similar inputs to similar vectors.
‣ Analogy – Imagine every player in the IPL is represented by a dot in a stadium based on their playing style.
Virat Kohli and Joe Root might be close (classic batters).
MS Dhoni and Hardik Pandya may be closer (finisher + all-rounder role).
Jasprit Bumrah is far from all of them (fast bowler).
Here each player is a vector embedding with their position in a higher – dimensional space. These vector embeddings are usually stored is databases known as Vector Databases.
‣ Python code for vector-embedding texts :
- Using Sentence Transformers (Completely Free) :
### Using Sentence Transformers (Completely Free)
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
embedding = model.encode("Dog chases cat.")
print(embedding)
- Using OpenAI API (Pay as you go) :
### Using OpenAI API
import openai
openai.api_key = "your-openai-key"
response = openai.embeddings.create(
model="text-embedding-3-small",
input="Dog chases cat."
)
embedding = response.data[0].embedding
print(embedding)
‣ Output :
So far, we’ve seen how vector embeddings help models understand the meaning of individual words or tokens.
But there’s a catch…
How does the model know the order of the tokens?
That’s where positional encoding comes in.
• Positional Encoding: Teaching AI the Order of Words
‣ Since transformers don’t read left-to-right like humans, they need extra help understanding sequence. Positional Encoding adds unique values or weights to each token’s vector based on its position in the sentence.
For example:
There is a huge difference between “Virat hits a century” and “Century hits a Virat”. With positional encoding we assign some weights to individual words such that the transformer knows who hit whom –
“Virat” is the First word
“hits” is the Second
“a” is Third
“century” is fourth
Therefore, the final input would be –
Word meaning (from vector embedding) + Word position (from positional encoding)
This allows the transformer to not just understand what the words mean, but also where they appear in context.
Now that the model knows what each word means and where it appears, there’s still one crucial question left:
How does the model know which words to focus on when processing a given token?
This is where the self-attention mechanism — introduced in the 2017 paper, “*Attention Is All You Need*” by Vaswani et al. comes in.
• Self-Attention: The Model’s Inner Eye
“Self-attention, sometimes called intra-attention is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence. Self-attention has been used successfully in a variety of tasks including reading comprehension, abstractive summarization, textual entailment and learning task-independent sentence representations”
Source - “Attention Is All You Need” 2017 paper
‣ In simple words: Self-attention allows each word in a sentence to “pay attention” to other relevant words when being processed.
‣ Analogy – Imagine you’re a fielder on the boundary who just picked up the ball. Before you throw it, you quickly analyse the positions of the wicketkeeper, the running non – striker, the bowler, and also who’s calling for the ball and then based on relevancy you throw the ball.
Hurray! That is what self-attention actually means!
In a transformer – each token is a fielder, it looks at all other tokens to see which ones are more relevant, then makes its decision as an output vector based on those relationships.
• Multi-Head Attention: Diverse Perspectives in Parallel
“Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions.”
Source - “Attention Is All You Need” 2017 paper
In simple terms – Single-head attention (or Self-attention) mechanism is good. But many are better. Multi-head attention lets the model look at different aspects of the input simultaneously.
‣ Analogy – Imagine there are multiple commentators commentating on the same delivery. One focuses on the bowler’s wrist position, the other focuses on the fielders, while the another one analyses the batsman’s footwork. Each of them gives a different perspective, and combining all of them we get a complete tactical understanding of the delivery.
That’s multi-head attention.
In transformers, one head might learn the meaning of the word, another one might learn the position of the words and similarly other heads can learn their assigned tasks and interestingly all of them computes attention independently.
By now, you’ve seen how Transformers pay attention to the right words, process them from multiple perspectives, and assign meanings through embeddings. But there’s one big question left:
How does the model actually learn what’s right and what’s wrong?
Well, here enters Backpropagation…
• Backpropagation: The Learning Loop
It’s like this — imagine you’re learning to play cover drives in cricket. You try a shot, miss the timing, and your coach tells you: “Too early, wait for the ball!” Next time, you adjust. You try again. This loop continues until you master it.
That’s exactly what backpropagation does — it’s the AI model’s coach.
When a model generates an output (say, the next word in a sentence), it compares that prediction to the actual correct word. This difference is called the loss or error.
Backpropagation takes this error and works backward through the model, adjusting the internal “weights” to reduce future mistakes. It’s like reviewing the entire play, identifying where things went wrong, and correcting each decision.
And just like how a batsman practices hundreds of deliveries to get better, the model goes through thousands, even millions of data examples, each time refining its internal logic.
• Conclusion
‣ If you’ve made it this far, give yourself a pat on the back (or a chai break ☕).
We just broke down how Transformers actually understand stuff — not just memorize words but really capture context, meaning, and relationships.
From Vector embeddings to positional encoding to self-attention and finally, multi-head attention.
This "Attention is All You Need" paper is not just hype — it really changed the entire AI game. It's the reason GPT, BERT, and all these mad models even work today.
Read the paper - “Attention Is All You Need” 2017 paper
• What’s next?
If you found this cool, wait till you dive into:
‣ How transformers are trained
‣ What happens during fine-tuning
‣ What are RAGs and how to build RAG applications
‣ AI Agents and Agentic AI
‣ Or how these ideas fuel real-world stuff like ChatGPT, Google Search, or even JARVIS-type AI assistants
Thanks for reading. If you liked the cricket analogies, drop a comment or share it with your ML squad. Until then — happy learning, and see you on the other side of the transformer. 🏏
Subscribe to my newsletter
Read articles from Monideep Mistry directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
