Tokenization in AI: A Fresher’s Guide (With Desi Examples 🍵)


🛑 Before We Start — Why Care About Tokenization?
If you’re a fresher stepping into the world of Natural Language Processing (NLP) or Generative AI, you might think:
“I just give AI my text, and it magically understands it… right?”
Well, the “magic” has a name — Tokenization.
It’s how your big sentence becomes small, bite-sized pieces the AI can process.
Think of it as:
Your Mom making parathas — she first breaks the dough into smaller balls.
A DJ mixing songs — cutting them into beats before remixing.
A teacher breaking chapters into topics — so you don’t panic seeing the whole syllabus at once.
Without tokenization, AI would choke on your giant paragraph like you trying to swallow a whole samosa in one bite. 🥴
💡 What is Tokenization?
Tokenization is the process of breaking text into smaller chunks (called tokens) before feeding it to an AI model.
These “tokens” could be:
Words (
I
,love
,chai
)Sub-words (
lov
,e
,chai
)Characters (
I
,l
,o
,v
,e
)Even special symbols (
.
,,
,?
)
🧠 Why Not Just Use Words?
You might ask — why can’t AI just read full words?
Because:
New words are invented daily — AI needs to handle unknown ones.
Different languages & slang — “bro”, “bhai”, “yaar” all mean friend.
Typos — AI should still understand
helo
ashello
.
By breaking into smaller units, AI can still make sense of unknown words by combining known pieces.
📦 Types of Tokenization
1. Word Tokenization
Breaking text at spaces.
Example:"I love chai"
→ [I, love, chai]
❌ Problem: Doesn’t handle unknown words well.
2. Character Tokenization
Breaking into letters."chai"
→ [c, h, a, i]
✅ Good for unknown words.
❌ But long sentences mean a LOT of tokens → more processing cost.
3. Subword Tokenization (BPE / WordPiece)
The most common in AI models like GPT.
Breaks text into frequent chunks.
Example:"lovely"
→ ["lov", "ely"]
"love"
→ ["lov", "e"]
So AI only learns “lov” once and can combine it with “e” or “ely”.
📊 Real Example: How GPT Tokenizes
Sentence: "I love samosas"
Tokens: [ "I", " love", " sam", "osas" ]
Notice: Spaces are also part of tokens.
🛠 How Tokenization Works in AI Models
Let’s simplify:
You: “I love samosas”
Tokenizer: Breaks it into tokens
Model: Converts each token into a number (ID) → embedding → AI brain processing.
Output: AI predicts next tokens → joins them → final text.
🍛 Desi Analogy: Chai Biscuit Tokenization
Imagine:
Sentence = A full packet of Parle-G biscuits.
Tokenization = Breaking biscuits into small pieces to dip in chai.
AI = Your mouth & brain enjoying and understanding each piece.
If you dump the whole packet in chai, it’s a mess. If you break it, you enjoy and understand each bite.
⚡ Why Tokenization Matters for Freshers
Performance → Shorter tokens = faster AI.
Cost → GPT API charges per token, not per word.
"Namaste"
as one token is cheaper than"Na", "ma", "ste"
as three.Understanding → Better tokenization = AI understands context better.
🔍 Pro Tip for Freshers
If you’re using OpenAI API or similar:
Check token limits (e.g., GPT-3.5 can handle ~4,096 tokens).
Use tools like OpenAI Tokenizer to see how your text breaks down.
Minimize extra spaces and fluff to save token costs.
🚀 Summary
Tokenization = breaking text into small chunks.
It’s the first step before AI processes your text.
Different methods: word, character, subword.
Impacts speed, cost, and accuracy.
🎯 Final Thought
If you understand tokenization, you’ve taken the first real step into the AI world.
It’s like learning how to cut vegetables before cooking — simple but absolutely essential.
Subscribe to my newsletter
Read articles from Md Noorullah Raza directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by