Explain Tokenization to fresher

When you first step into the world of Natural Language Processing (NLP) or Large Language Models (LLMs), one of the first words you’ll hear is “tokenization.”
It sounds technical, but don’t worry. Let’s break it down in the simplest way possible.
What is Tokenization?
Imagine you’re reading a sentence:
👉 “I love learning AI.”
Now, for you, that’s just one sentence.
But for a computer, it’s too much to understand all at once. Computers don’t “see” words the way we do.
So, before processing, the sentence is broken into smaller pieces called tokens.
Tokens = Building Blocks 🧩
Think of tokens like LEGO blocks.
A word can be a token.
Sometimes, even part of a word can be a token.
Numbers, punctuation, and special characters are also tokens.
Example:
Sentence → “I love learning AI.”
Tokens → [“I”, “love”, “learning”, “AI”, “.”]
Now the computer can work with it step by step.
Why Do We Need Tokenization?
Computers are not good at dealing with huge chunks of text.
Breaking text into tokens helps:
Easier processing → Smaller pieces are simpler to handle.
Consistency → The same words get split the same way every time.
Model input → Most AI models (like GPT) only understand tokens, not raw text.
It’s like cutting a pizza into slices. You can’t eat the whole pizza at once 🍕.
Types of Tokenization
Word-level tokenization
Breaks text into words.
Example: “I’m happy” → [“I’m”, “happy”]Subword tokenization
Splits words into smaller parts if needed.
Example: “Happiness” → [“Happi”, “ness”]Character-level tokenization
Splits text into individual characters.
Example: “AI” → [“A”, “I”]
Modern AI models (like GPT) mostly use subword tokenization, because it’s more flexible with new or rare words.
Tokenization in Real Life (GPT Example)
When you type a sentence into ChatGPT, it doesn’t read it as full sentences. It turns your text into tokens first.
For example, the sentence:
👉 “Hello world!”
Might become 3 tokens: [“Hello”, “ world”, “!”]
The model then processes these tokens to generate a response.
Final Thoughts
For a fresher, remember this:
👉 Tokenization is just the process of breaking text into smaller pieces (tokens) so that computers can understand and process it.
Without tokenization, AI models would be lost in a jungle of words. With it, they can actually “read” and make sense of language.
Subscribe to my newsletter
Read articles from Punyansh Singla directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
