Understanding Tokenization: The First Step Inside an LLM Transformer


Tokenization: The First Step Inside an LLM Transformer
When people talk about Large Language Models (LLMs) like GPT, BERT, or others, they often skip the very first step: tokenization. Without tokenization, the model wouldn’t even understand what we are saying. So let’s break it down step by step in the simplest way possible. Think of this as me explaining it to a curious 10-year-old or someone who has never touched programming or AI before.
Why We Need Tokenization??
Computers don’t understand words like apple or mountain. They only deal with numbers.
Tokenization is like breaking a big sentence into small pieces (tokens) and then turning those pieces into numbers.
Example: If I say “I love pizza”, the model doesn’t see it as words. After tokenization, it might look like [1, 2, 3]
or more complex…..
Tokens like bricks
Imagine a huge Lego house. It’s built from many small Lego blocks.
Similarly, sentences are built from smaller chunks called tokens or sequences.
A token might be a full word (pizza), a piece of a word (piz), or even just one letter (a).
Why so flexible? Because the model needs to handle every possible word in the world, even brand new ones.
Words vs Sub-Words vs Characters
If tokenization was only by words, the model would need to store millions of words. Too heavy.
If it was only by characters, sentences would get too long for the model to handle.
So LLMs use a middle ground: sub-words.
Example: The word “unbelievable” might become tokens like “un” + “believ” + “able”.
This way, even if the model has never seen unbelievable before, it can still piece it together
How Does Tokenization Actually Work?
Different models use different tricks, but let’s keep it simple.
One common method is Byte Pair Encoding (BPE).
It works like this:
Start with single letters.
Find the most common pairs of letters that always appear together.
Merge them into bigger chunks.
Keep repeating until you’ve built a good dictionary of tokens.
Example: In English, letters t and h appear together often. So BPE merges them into “th”.
Tokens Into Numbers
Once we have tokens, each one is assigned a unique ID number.
- Think of it as a dictionary: “I” = 101, “love” = 200, “pizza” = 305.
The sentence “I love pizza” becomes
[101, 200, 305]
.This is the actual input the transformer receives.
See As Translating to friend:
Imagine you have a friend who only understands numbers, not words.
You tell them: “I love pizza”. They stare blankly.
But then you pull out a codebook where “I = 101”, “love = 200”, “pizza = 305”.
Suddenly they get it:
[101, 200, 305]
means “I love pizza”.That’s exactly what tokenization does for LLMs.
Why this is Important!!!
If tokenization fails, everything else in the LLM breaks.
It sets the stage for embeddings, attention, and the entire transformer magic.
Think of it like chopping vegetables before cooking. If you don’t chop them right, the dish won’t cook properly.
So, tokenization is the unsung hero of transformers. It doesn’t get as much hype as “attention” or “embeddings”, but without it, nothing works. It’s the simple yet powerful step that converts our language into a form that LLMs can actually understand.
I hope you understand it better now. Sorry, it is Gemini sometimes feels like drunk while producing language output.😂
Subscribe to my newsletter
Read articles from Vikas Kumar directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Vikas Kumar
Vikas Kumar
As a Frontend Web Developer, I bring a unique blend of technical expertise and creative vision to design and implement robust web applications. My proficiency in core web technologies including HTML, CSS, and JavaScript, combined with hands-on experience in React and NodeJs, allows me to deliver high-performance, responsive, and intuitive user interfaces. I am also dedicated to continuous learning and community engagement, regularly publishing technical blogs to share insights and foster collaborative growth.