Tokenization Explained for Freshers

Imagine you’re trying to teach a robot to understand human language. If you give it an entire paragraph at once, it might get confused. Instead, you break that paragraph into small pieces — these pieces are called tokens.

Tokenization is the process of splitting text into smaller, meaningful parts so that a computer can process and understand them.

Why Tokenization?

When you type a sentence like:

“I love programming.”
A computer doesn’t understand words directly. Tokenization turns it into:

["I", "love", "programming", "."]

These tokens make it easier for algorithms to work on tasks like search, translation, or sentiment analysis.

Types of Tokenization

  1. Word Tokenization – Splits text by words.
    Example: "Machine learning is fun"["Machine", "learning", "is", "fun"]

  2. Subword Tokenization – Breaks words into smaller parts to handle rare or unknown words.
    Example: "unhappiness"["un", "happi", "ness"]

  3. Character Tokenization – Splits text into individual characters.
    Example: "Hi"["H", "i"]

Real-life Analogy

Think of tokenization like cutting a pizza into slices. The pizza is your sentence, and the slices are tokens. You can serve (process) them one at a time.

Where It’s Used

  • Search engines (finding the right results)

  • Chatbots (understanding your queries)

  • Spell checkers (identifying mistakes)

  • AI models like ChatGPT (understanding and generating language)

In short: Tokenization is the first step in teaching computers to understand language — without it, everything else falls apart.

0
Subscribe to my newsletter

Read articles from Yash Prashant Sonawane directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Yash Prashant Sonawane
Yash Prashant Sonawane