Tokenization Unleashed

Stop! Before a Computer Can Read, It Needs to Learn Its ABCs.
Ever wondered how your phone knows you meant to type "definitely" when you mashed "definately"? Or how Google seems to read your mind, finishing your search query before you do? It’s not magic. It’s a fundamental, yet super simple, concept in artificial intelligence called Tokenization.
If you're new to the world of coding or AI and hear terms like this, it's easy to feel a little lost. But don't worry! By the end of this blog, you'll understand tokenization so well, you'll be explaining it to your friends. No coding, no scary jargon, just a simple breakdown of one of AI's coolest first steps.
The Secret Recipe: An Analogy
Imagine you're trying to teach a friend who has never cooked before how to make a fruit salad. You wouldn't just hand them a pile of whole fruits and a bowl, right? You’d say, "First, chop the apple into bite-sized pieces. Then, slice the banana. Then, separate the grapes."
In this analogy:
The recipe is the final goal (understanding the language).
The whole fruits are long sentences or paragraphs of text.
The chopping, slicing, and separating is Tokenization.
The bite-sized pieces of fruit are the tokens.
That’s it! Tokenization is simply the process of breaking down a chunk of text into smaller, meaningful pieces called tokens. For a computer, which can't understand language like we do, this is the most crucial first step. It's how we get the text ready for the computer to "cook" with.
Why Can't Computers Just Read?
Humans are amazing. When you read the sentence "The quick brown fox jumps," your brain instantly recognizes nine separate words and understands the meaning they form together. You see spaces and punctuation as guides.
A computer, on the other hand, just sees a long string of characters: T-h-e- -q-u-i-c-k- -b-r-o-w-n...
It has no built-in idea of what a "word" is.
We need to teach it. Tokenization is that lesson. By chopping the text into tokens, we’re converting a messy, unstructured sentence into a neat, organized list that a machine can start to work with. It's the bridge from human language to machine understanding.
The Different Ways to Chop: Types of Tokenization
Just like you can chop fruits in different ways (diced, sliced, julienned), there are a few ways to tokenize text.
1. Word Tokenization
This is the most common and intuitive type. You simply split the text by the spaces.
Sentence: "I am learning AI."
Tokens:
['I', 'am', 'learning', 'AI', '.']
Simple, right? This is great for many languages like English where words are usually separated by spaces.
2. Character Tokenization
Here, you go even smaller and break the text down into individual characters.
Sentence: "Hello"
Tokens:
['H', 'e', 'l', 'l', 'o']
This might seem too simple, but it's incredibly useful for catching typos or for analyzing languages with complex characters.
3. Subword Tokenization (The Smartest Way)
This is the clever middle ground and what powers modern AI like ChatGPT. It breaks words down into smaller, meaningful sub-parts.
Think about the word "unhappily". Subword tokenization might break it down like this:
- Tokens:
['un', 'happi', 'ly']
Why is this so smart? The AI can learn that "un-" usually means something is negative, and "-ly" makes a word an adverb. Now, if it sees a brand new word it’s never encountered before, like "un-friend-ly", it can make a good guess about its meaning because it recognizes the parts! It gives the AI flexibility and a much deeper understanding of language structure.
The Tricky Bits (It's Not Always a Piece of Cake)
Tokenization seems easy, but language is messy!
Punctuation: What about "don't"? Should that be one token (
'don't'
) or two ('do'
,'n't'
)? Different tokenizers handle this differently.Hyphens: Is "state-of-the-art" one word or four?
Languages Without Spaces: Languages like Chinese and Japanese don't put spaces between words. So how do you know where one word ends and the next begins? This is a huge challenge where smart tokenization is essential. For example,
東京都は日本の首都です
(Tokyo is the capital of Japan) has no spaces.
Tokenization is Everywhere!
You interact with the results of tokenization every single day:
Google Search: When you type "best pizza in Brooklyn," Google tokenizes your query into
['best', 'pizza', 'in', 'Brooklyn']
to find the most relevant web pages.Siri & Alexa: When you say, "Hey Siri, what's the weather?" your voice is converted to text, which is then tokenized so the AI can figure out your intent.
Email Spam Filters: They tokenize the content of an email to look for spammy words or phrases (like
'free'
,'money'
,'click', 'here'
).
You Now Understand a Core Part of AI 🧠
And that's all there is to it. Tokenization isn't some high-level, gatekept secret of the AI world. It's a simple, powerful idea of chopping up text so a computer can begin to make sense of it.
So next time your phone suggests the perfect next word in a text or a chatbot understands your question, you can smile and know the secret ingredient that made it all possible: a little bit of chopping and slicing.
Subscribe to my newsletter
Read articles from Pramit Manna directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
