🚀 Transformers — The Engine Behind Everything Smart in Modern AI

A transformer is a special kind of neural network — a machine learning model. You can use transformers to build different types of models, like text-to-speech, text-to-image, and more. Many of the exciting recent developments, like image generation and the Ghibli-style art trend, are made possible thanks to transformers.

Back in 2017, Google introduced a transformer model that was originally designed for language translation — for example, converting English to Hindi or English to Marathi. ChatGPT is another great example of a transformer in action. The transformer used in ChatGPT works by predicting the next word based on the context it receives.

For instance, if you write “The colour of the sky is ___,” the model will likely predict “blue” as the next word. That’s how it constructs full sentences — by continuously predicting what comes next.

🔢 Tokenization - Like chopping veggies before cooking — smaller parts make it easier to process:

Before we talk about tokenization, let’s first get what tokens actually are.

🧩 Tokens - Like puzzle pieces — tiny chunks that help the machine understand the bigger picture

Tokens are just small pieces of whatever input you're giving to a machine — like text, images, or audio.
Let’s take an example:
“I love programming and developing real-world applications.”

If we split this into words, we get 8 words. We also have 2 special characters: a hyphen (‘-’) and a period (‘.’).
Each of these pieces — the words and special characters — are called tokens.

Now, if the input is an image, the tokens could be small parts or chunks of the image.
And if it’s audio, the tokens could be tiny pieces of the sound — like short bits of a voice note.

So yeah, tokens are basically the tiny building blocks that help the machine understand what we’re giving it.

🛠️ Tokenization - Like a barcode scanner — it breaks things down and assigns numbers so machines can read them

Tokenization is just the process of turning those tokens into numbers — because machines don’t understand words or images the way we do.

Let’s say we assign a number to each letter:

a = 1
b = 2
d = 4

So, the word “add” becomes 1-4-4. Or just 144 if we put them together.
(This is just a simple example — real models do it differently, but the idea is the same.)

Basically:

First, we break the input into small parts (tokens).
Then, we convert those into numbers.

This way, the machine can read and work with the input — whether it’s text, image, or sound.

So in short: tokenization = breaking stuff down + turning it into numbers.

# Programming Language - Python
import tiktoken

encoder = tiktoken.encoding_for_model('gpt-4o')

text = 'I love programming.'

tokens = encoder.encode(text)
print(f"Token IDs: {tokens}")

decoded_tokens = [encoder.decode([token]) for token in tokens]
print(f"Decoded Tokens (what the tokenizer sees): {decoded_tokens}")

Token IDs: [40, 3047, 23238, 13]
Decoded Tokens (what the tokenizer sees): ['I', ' love', ' programming', '.']

📚 Vocab Size - Like a dictionary — it tells us how many unique words (or pieces) the model can recognize and work with

Now that you understand tokens and tokenization, here’s the next piece: vocabulary size. It simply means how many unique tokens your model can recognize.

Let’s say each letter gets a number:
a = 1, b = 2, ..., z = 26
A = 27, B = 28, ..., Z = 52
So if we only consider English letters, lowercase and uppercase, that gives us 52 unique tokens — and that’s our vocabulary size for this mini-example.

In real-world models, vocab size is much larger because it includes words, symbols, punctuation, emojis, and more!

# Programming Language - Python
import tiktoken

model_encodings = {
    "gpt-4o": "o200k_base",
    "gpt-4": "cl100k_base",
    "gpt-3.5-turbo": "cl100k_base",
    "gpt2": "gpt2"
}

vocab_sizes = {}

for model, encoding_name in model_encodings.items():
    try:
        encoder = tiktoken.get_encoding(encoding_name)
        vocab_sizes[model] = encoder.n_vocab
    except Exception as e:
        vocab_sizes[model] = f"Error loading encoding: {e}"

print("Vocabulary sizes for various GPT models:")
for model, vocab_size in vocab_sizes.items():
    print(f"{model}: {vocab_size}")

Vocabulary sizes for various GPT models:
gpt-4o: 200019
gpt-4: 100277
gpt-3.5-turbo: 100277
gpt2: 50257

Interesting Fact:

GPT-2: 50,257 tokens.
GPT-3.5 turbo: 1,00,277 tokens.
GPT-4: 1,00,277 tokens.
GPT-4o: 2,00,019 tokens.

📐Vector Embedding - It’s like dropping every word into a 3D universe where similar ones hang out in the same neighborhood:

Before going into Vector Embedding, Lets break it down and learn one by one.

🧮 What is a Vector ? - Think of it like a recipe card — each number is an ingredient that tells the machine what flavor (meaning) the word has

Vectors are nothing but list of numbers that represent something e.g. word, sentence, Image, audio so that computer can understand what is it.
Now if you wanted to describe it to a machine, you can’t just say “apple.” Instead, you might describe it like this:

Color = red → 1
Size = medium → 2
Taste = sweet → 3

So your “apple” becomes:
[1, 2, 3] — and that’s a vector!

Each number in the vector is like a feature or characteristic of the thing you’re describing.

(Just random example numbers — real ones are way longer and more complex)

Interesting Fact:

Vectors let the model do the math with meaning. So machine can understand relationships between words.
It can “measure” how close two ideas are (like love vs hate).
It turns human language into something a machine can actually work with.

🧠 What is Embedding ? - Like teaching a robot slang — it helps the machine understand not just the word, but the vibe behind it

Embedding is nothing but turning stuff (like words) into meaningful vectors.
You give the computer a word — let’s say “cat” — and the computer turns it into a list of numbers like: [0.12, -0.98, 2.5, 0.3, ...]
That list of numbers is the embedding of the word “cat”.

It’s not random — it’s smart. The model learns these numbers in such a way that they actually mean something.

“Cat” and “Dog” will have embeddings that are close together
“Cat” and “Banana”? Probably far apart

It’s not random — it’s smart. The model learns these numbers in such a way that they actually mean something.

Why Embedding is necessary ?
- It makes model understand meaning of words.

- keep track of context and relationships between words.

📐 Vector Embedding

Now as you already know what is vector and what is embedding, Let’s put them together.
Vector embedding is the process of turning stuff (like words, sentences, images, etc.) into vectors that capture their meaning.

🧭 Positional Encoding - It’s GPS for words — so the model knows where each one sits in a sentence:

Positional encoding is basically applying a formula to the generated vectors (from embeddings) to update them — because the position of words in a sentence really matters.

why? Because changing the position of words can change the entire meaning of a sentence.

for example:
vector 1 : sky is blue => [ [101] , [204] , [4099] ]

vector 2: blue is sky => [ [4099] , [204] , [101] ]

You can see both have the same individual vectors (same words), but the order is different — and so is the meaning.

This is why we can’t just rely on the embeddings alone. The model needs to know:

Which word came first
Which came next
And so on…

So, positional encoding adds that extra “position info” to each word's vector so the model understands the sequence, not just the content.

🔁 Self Attention — Like Reading a Sentence While Remembering the Whole Story:

In this model allows all the tokens to talk with each other and update their vectors according to the given context. In position encoding we only fixed the position of the words but words didn’t know the context and the relation between each other.

👁️‍🗨️ Multi-Head Attention — Seeing the same sentence through multiple lenses, then blending the views

We apply different sets of parameters ( like different "questions" or "ways of thinking") to the same vectors, and run the same operation multiple times. Then, we combine the results to get a more complete understanding of the input.

For example

let’s say
Input: “I’m Riding a Bike.“
Now imagine we want to ask different things about this sentence — like:

When is this happening?
Where is it happening?
How is it happening?
What is the action?

Each of these 'Questions' becomes a different attention head but with a different angle. and all of them process each vector and generate output. We get slightly different output results from each.

Then we combine all those results and feed it to the model for more accurate results.

🎯 Softmax & 🌡️ Temperature — Softmax picks the favorite; temperature decides how wild it gets:

When we give some input to a model, it never gives back one fixed answer, it generates multiple possible outputs — and each of those has a certain probability.

That’s where the linear layer and softmax function come into play.

🎯 Softmax - Where all options compete, and only one gets the spotlight

Softmax takes all the possible outputs and their scores them and turns them into a probability distribution.
Basically, it gives us the likelihood of each possible output — like:

for example
Input: “Sky is _____”

then

“blue” → 75%
“red” → 15%
“green” → 10%

Then, based on that, the model usually picks the most likely one as the final output — in this case, “blue”.

🔥 Temperature - Low temp = Google Maps. High temp = Let’s just wander and see where we end up

Temperature is a setting which controls how confident and creative the model should be when picking the output.

Low temperature (like 0.1): The model only chooses very high-probability outputs. Results are more accurate but also predictable.
High temperature (like 1.0 or above): The model is more open to lower-probability outputs. This can make results more creative, random, or even a bit weird — but it’s interesting and fun!

Example:

Input: “The sky is…”

Low temperature → “blue” (predictable, common)
High temperature → “violet”, “full of balloons”, “not real” (creative or unexpected)

🧵 Wrapping It Up - Congrats! You just took a tour inside the brain of modern AI — and survived to tell the tale:

Alright, that was a lot — but now you’ve got a solid idea of how transformers actually work under the hood.

From breaking text into tiny tokens 🧩, turning them into numbers 📐, understanding word positions 🧭, to figuring out which word makes sense next 🎯 — it’s all part of what makes models like ChatGPT and others so smart.

Each piece we talked about is like one tool in a really powerful toolbox. Put them all together, and you get the brains behind modern AI.

Hope this helped clear things up!
Thanks for reading — and if you’re building something cool with AI, you’re already on the right track. 💻🚀

🧠 Understanding AI Jargons in an Easy Way

Table of contents