Tokenization Explained for Freshers


Imagine you’re trying to teach a robot to understand human language. If you give it an entire paragraph at once, it might get confused. Instead, you break that paragraph into small pieces — these pieces are called tokens.
Tokenization is the process of splitting text into smaller, meaningful parts so that a computer can process and understand them.
Why Tokenization?
When you type a sentence like:
“I love programming.”
A computer doesn’t understand words directly. Tokenization turns it into:
["I", "love", "programming", "."]
These tokens make it easier for algorithms to work on tasks like search, translation, or sentiment analysis.
Types of Tokenization
Word Tokenization – Splits text by words.
Example:"Machine learning is fun"
→["Machine", "learning", "is", "fun"]
Subword Tokenization – Breaks words into smaller parts to handle rare or unknown words.
Example:"unhappiness"
→["un", "happi", "ness"]
Character Tokenization – Splits text into individual characters.
Example:"Hi"
→["H", "i"]
Real-life Analogy
Think of tokenization like cutting a pizza into slices. The pizza is your sentence, and the slices are tokens. You can serve (process) them one at a time.
Where It’s Used
Search engines (finding the right results)
Chatbots (understanding your queries)
Spell checkers (identifying mistakes)
AI models like ChatGPT (understanding and generating language)
In short: Tokenization is the first step in teaching computers to understand language — without it, everything else falls apart.
Subscribe to my newsletter
Read articles from Yash Prashant Sonawane directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
