Tokenization: The First Step Inside an LLM Transformer

When people talk about Large Language Models (LLMs) like GPT, BERT, or others, they often skip the very first step: tokenization. Without tokenization, the model wouldn’t even understand what we are saying. So let’s break it down step by step in the simplest way possible. Think of this as me explaining it to a curious 10-year-old or someone who has never touched programming or AI before.

Why We Need Tokenization??

Computers don’t understand words like apple or mountain. They only deal with numbers.
Tokenization is like breaking a big sentence into small pieces (tokens) and then turning those pieces into numbers.

Example: If I say “I love pizza”, the model doesn’t see it as words. After tokenization, it might look like [1, 2, 3] or more complex…..

Tokens like bricks

Imagine a huge Lego house. It’s built from many small Lego blocks.
Similarly, sentences are built from smaller chunks called tokens or sequences.
A token might be a full word (pizza), a piece of a word (piz), or even just one letter (a).
Why so flexible? Because the model needs to handle every possible word in the world, even brand new ones.

Words vs Sub-Words vs Characters

If tokenization was only by words, the model would need to store millions of words. Too heavy.
If it was only by characters, sentences would get too long for the model to handle.
So LLMs use a middle ground: sub-words.
Example: The word “unbelievable” might become tokens like “un” + “believ” + “able”.
This way, even if the model has never seen unbelievable before, it can still piece it together

How Does Tokenization Actually Work?

Different models use different tricks, but let’s keep it simple.
One common method is Byte Pair Encoding (BPE).
It works like this:
1. Start with single letters.
2. Find the most common pairs of letters that always appear together.
3. Merge them into bigger chunks.
4. Keep repeating until you’ve built a good dictionary of tokens.

Example: In English, letters t and h appear together often. So BPE merges them into “th”.

Tokens Into Numbers

Once we have tokens, each one is assigned a unique ID number.
1. Think of it as a dictionary: “I” = 101, “love” = 200, “pizza” = 305.
The sentence “I love pizza” becomes [101, 200, 305].
This is the actual input the transformer receives.

See As Translating to friend:

Imagine you have a friend who only understands numbers, not words.
You tell them: “I love pizza”. They stare blankly.
But then you pull out a codebook where “I = 101”, “love = 200”, “pizza = 305”.
Suddenly they get it: [101, 200, 305] means “I love pizza”.
That’s exactly what tokenization does for LLMs.

Why this is Important!!!

If tokenization fails, everything else in the LLM breaks.
It sets the stage for embeddings, attention, and the entire transformer magic.
Think of it like chopping vegetables before cooking. If you don’t chop them right, the dish won’t cook properly.

So, tokenization is the unsung hero of transformers. It doesn’t get as much hype as “attention” or “embeddings”, but without it, nothing works. It’s the simple yet powerful step that converts our language into a form that LLMs can actually understand.

I hope you understand it better now. Sorry, it is Gemini sometimes feels like drunk while producing language output.😂

Understanding Tokenization: The First Step Inside an LLM Transformer