Decoding AI Jargons with Chai ☕ & Oggy and the Cockroaches Drama 🍰

Let’s Start with a Small Story :)

Imagine a day in the Oggy universe. The sun is shining, Oggy’s about to enjoy his chocolate cake, and... BAM 💥 — the cockroaches have messed up the recipe card! Instead of reading:

"Oggy eats chocolate cake"

It says:

"Cake chocolate eats Oggy."

Oggy stares in confusion. "What kind of language is this?" he screams. That's when Professor Transformer steps in to help our blue cat restore order — using the power of AI Transformers.

A Transformer is a special type of model in AI that helps machines understand and generate human language (or other types of data like images, audio, etc.).

Transformers read input all at once, not one word at a time like older models.
They focus on the important parts of the input, using something called attention.

Think of it like reading an entire sentence and figuring out which words matter the most to understand the meaning.

Let’s have an example:

Let’s say you’re helping your friend prepare for an exam. They give you this sentence:

“Even though it was raining, she went to the park because she had promised her friend.”

You want to know why she went to the park.

A human brain immediately pays attention to:
👉 "because she had promised her friend"

A Transformer does something similar! It uses attention to focus on the important part of the sentence and gives the correct output.

🎬 Scene 1: Tokenization - Paplu’s Master Slicer

Paplu, the super sharp cockroach, slices up the sentence:

"Oggy eats chocolate cake" → ["Oggy", "eats", "chocolate", "cake"]

This process is called Tokenization. Tokenization is the process of converting a sentence (or any input text) into smaller units called tokens. These tokens are often words or sub-words, and they serve as the base units that transformers (GPT). We break the sentence into smaller parts (tokens) that the model can understand. Sometimes it's even smaller, like:

"chocolaty" → ["choco", "##laty"]

Because Transformers have a limited vocabulary size (usually 30k–50k), they split unknown words into known chunks.

🔡 Scene 2: Vocabulary Lookup - Jhaplu’s Dictionary Dive

Each token is now turned into an ID using a vocabulary list. Transformer-based models like GPT, we cannot input raw text directly. Instead, the text needs to be tokenized into smaller pieces, and each piece (called a "token") is mapped to a unique integer ID using a vocabulary.

A vocabulary is a fixed-size list of all the tokens the model knows. Each token is associated with a unique number. For instance:

TokenID
"Oggy"1024
"eats"872
"chocolate"4012
"cake"598

Example:

Say the sentence is:

"He devoured a cakelicious dessert."

The tokenizer might break it into:

["He", "devoured", "a", "cake", "##licious", "dessert", "."]

These tokens are then converted to numbers using the vocabulary:

[1001, 1420, 2001, 598, 9021, 3309, 102]

These integers are what the model understands. No raw words are passed into the model — only these numbers.

🧬 Scene 3: Embeddings - Taplu’s Funky Flavors

In natural language processing, embeddings are dense vector representations of words (or tokens) in a continuous vector space.

Instead of representing each word as a unique number (which loses meaning), embeddings allow similar words to have similar numerical representations.

"cake" → [0.21, -0.13, 0.88, ..., 0.02]

"cookie" → [0.22, -0.11, 0.85, ..., 0.01]

"rocket" → [0.99, 0.76, -0.45, ..., -0.60]

Notice how "cake" and "cookie" are numerically closer than "cake" and "rocket". This is how the model "knows" what kind of word it's dealing with— not just the word itself, but its meaning and context.

🧁 Real-World Analogy:

Imagine each word is an ice cream flavor, and Taplu is the flavor scientist.

Instead of giving each word a number (like “banana = 1”, “strawberry = 2”), Taplu gives each flavor a full taste profile:

Banana: Sweet, fruity, yellow-ish, tropical

Strawberry: Sweet, fruity, red-ish, berry

Pickle: Sour, green-ish, tangy

Even if Taplu doesn’t know “mango-lime-blast”, he can say: “Hmm, it's sweet + tropical + sour... kinda like a mix of mango and lime”.

In the same way, even if a Transformer has never seen a word before, embeddings help it guess what that word might mean based on similar flavor-profiles (vectors).

⏳ Scene 4: Positional Encoding - Jack Builds Order

Since Transformers don't read words in order like humans do, they have no built-in sense of which word comes first or last. But the order of words matters.

— "Oggy eats cake" is very different from "Cake eats Oggy" 😱

So Jack, our orderly builder, comes to the rescue with Positional Encoding. He adds a unique pattern to each word's vector (embedding) based on its position in the sentence. This encoding is a vector that represents the position of the word in the sentence (like 1st, 2nd, 3rd...).

🌍 Real-World Analogy

Imagine you're baking a cake and someone gives you all the ingredients at once —eggs, flour, sugar, butter — but not the recipe order.

Without order, you'd probably mix and bake it wrong!

Now imagine Jack (from your story) writes down:

  1. Add flour

  2. Mix sugar

  3. Break eggs

  4. Bake at 180°C

That step number is your positional encoding! It helps you make sense of the ingredients in the right order.

Without it, even if you know the meaning of each item (like what "flour" is), you don’t know when to use it.

🧪 Lets have a Simple Example

Let’s say your model sees the sentence:

“Oggy eats cake”

Without positional encoding, it’s just:

["Oggy", "eats", "cake"]
↓
[Embedding1, Embedding2, Embedding3]

But with positional encoding added:

[Embedding1 + Pos0, Embedding2 + Pos1, Embedding3 + Pos2]

So now it knows that "Oggy" is first, "eats" is second, and "cake" is third — and the sentence makes sense!

If you reverse the order:

“Cake eats Oggy”

You’ll get different positional encodings, and thus a completely different meaning — just like you’d expect!

🪞 Scene 5: Self-Attention - Olivia the Gossip Queen

Self-attention is a key mechanism in transformers that allows each word (token) in a sentence to look at all the other words in the sentence to understand the context.

💡 Why do we need it?

Language is contextual. For example:

"He poured water into the bank*."*
"He went to the bank to deposit money."

The word "bank" means very different things in these sentences. To understand which meaning is correct, the model needs to pay attention to the surrounding words — that’s what self-attention enables.

🔍 Real-World Analogy:

Imagine Olivia (self-attention mechanism) at a party.

  • She hears “Oggy eats chocolate cake.”

  • To understand "eats", she listens to "Oggy" (who eats?) and "cake" (what is being eaten?).

  • She assigns importance (weights) to each word based on how relevant it is.

So “Oggy” and “cake” get high attention scores, while “chocolate” might get a bit less, and any irrelevant word would get a low score.

Olivia then updates her understanding of the word "eats" using all this information.

🍕 Scene 6: Multi-Head Attention - Chaos Party Begins

Multi-Head Attention is like giving your model multiple perspectives to understand the sentence better.

In Self-Attention, each word looks at all the other words in the sentence to figure out what matters most. But sometimes, one single view is not enough. That’s where Multi-Head Attention comes in — it allows the model to focus on different aspects of the sentence simultaneously.

for example:

Instead of one Olivia, imagine 8–12 mini Olivias each looking at different parts of the sentence:

  • One checks grammar

  • One checks meaning

  • One focuses on food (Taplu loves that)

Then all their findings are combined. That’s multi-head attention!

🔁 Scene 7: Encoder - Professor Transformer Understands It All

Now comes the moment where Professor Transformer — the wise and powerful AI — puts all the puzzle pieces together.

🔧 Technical Explanation: The Encoder is the part of the Transformer architecture that takes the input (like "Oggy eats chocolate cake") and processes it using multiple layers of attention and feed-forward networks. Its job is to build a contextual understanding of each word based on its surroundings.

At each layer:

  1. Self-Attention lets the word look at all others.

  2. Add & Norm normalizes the result.

  3. Feed-Forward Network (FFN) helps the model refine its knowledge.

  4. Stacking Layers (usually 6–12 times) allows the model to build deeper understanding at each step.

Every output vector from the encoder now contains meaning-rich representations of each word in context.

💡 Real-World Example: Think of the encoder like a super-smart teacher who hears the whole sentence and instantly understands who is doing what, to whom, and why — even if the grammar is weird. For example:

  • "Oggy eats cake" → understood as a basic subject-verb-object sentence.

  • "Cake eats Oggy" → same words, different meaning. The encoder catches that too!

In our story, Professor Transformer finally nods and says:

"Ah, I get it now — Oggy is the eater, not the cake!"

That’s the power of the encoder — understanding the full meaning from messy input.

Professor Transformer takes the embeddings and attention outputs and fully understands:

"Ah! Oggy is eating the cake. Got it."

This is the Encoder — it reads and understands input.

🔄 Scene 8: Decoder - Chef Oggy Decodes the Dish

Chef Oggy now wants to generate a new sentence:

Input: "Oggy eats..." Output: "chocolate cake."

The Decoder is like a chef preparing a sentence, one ingredient (word) at a time, based on what’s already on the plate (previous words) and what the encoder understood from the original sentence.

🔧 The Decoder uses three things:

  1. Masked Self-Attention – This means while generating a sentence, it only looks at past words (not future ones) to avoid cheating. If Oggy is at “eats,” he can’t peek at “cake” yet.

  2. Encoder-Decoder Attention – This lets the decoder look at what the encoder understood. It’s like asking: “What was the input sentence again?”

  3. Feed Forward Network – Helps refine the prediction before passing to the next layer.

These steps are repeated layer by layer (usually 6–12 layers), and at every step, the decoder predicts the next word.

💡 Real-World Example: Imagine you're playing a game of Mad Libs where you have to complete a sentence based on context:

  • You read: “Oggy eats…”

  • You think: “What does Oggy usually eat?”

  • Based on memory and pattern, you guess: “chocolate cake.”

Just like that, the decoder picks the next best word based on previous ones and the encoder’s understanding. Word by word, it completes the sentence.

So, Chef Oggy isn't just guessing randomly — he’s cooking up words using logic, context, and a bit of learned taste!

🔮 Scene 9: Softmax - The Crystal Ball

Imagine you're a chef (like Chef Oggy 🍽️) and you're trying to decide what to serve for dessert.

You think of several options:

  • Chocolate cake

  • Cookie

  • Carrot

  • Ice cream

  • Pizza (wait, why is that here?)

Each of these choices gets a score from your brain based on how well they fit. Maybe the model gives them these logits (raw scores):

  • Cake → 5.4

  • Cookie → 2.1

  • Carrot → 1.2

  • Ice cream → 4.3

  • Pizza → 0.5

But these scores are just raw numbers — they don’t yet tell you how likely each one is.

🧠 Enter Softmax: The Probability Wizard

Softmax is like a magic spell that turns all those raw numbers into percentages — or probabilities. It answers the question:

“Out of all the options, how confident am I about each one?”

After applying Softmax, those raw scores become something like:

  • Cake → 80%

  • Ice cream → 15%

  • Cookie → 3%

  • Carrot → 1.5%

  • Pizza → 0.5%

Now, it’s clear: “Cake” is the most likely word to come next!

🧙 Think of Softmax Like a Crystal Ball

Chef Oggy looks into the Softmax crystal ball 🔮 and sees:

“Hmm… looks like there’s an 80% chance the next word should be ‘cake.’ That’s the winner!”

So the model picks "cake" as the next word in the sentence.

🍰 Real-World Example:

Say you're typing on your phone, and you write:

"I love chocolate..."

Your keyboard might suggest:

  • "cake" (most likely — based on data)

  • "ice cream"

  • "milk"

That’s Softmax working behind the scenes, deciding which word feels right next!

🌡️ Scene 10: Temperature - Bob’s Spice Meter

Temperature a parameter that controls the randomness and creativity of the model's output, with lower values resulting in more predictable and focused text, and higher values leading to more diverse and creative, but potentially less coherent, outputs.

🌡️ Scene 10: Temperature - Bob’s Spice Meter

Wanna spice things up? Temperature decides how creative the model gets:

Temp = 0.1 → Safe, boring

Temp = 1.0 → Balanced

Temp = 1.8 → Wild & wacky

Set it high and Oggy might say:

"Oggy eats a galaxy-flavored lava cake with time-travel sprinkles."

23
Subscribe to my newsletter

Read articles from Sudarsan Mansingh directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Sudarsan Mansingh
Sudarsan Mansingh