Tokenization in AI: A Fresher’s Guide (With Desi Examples 🍵)

🛑 Before We Start — Why Care About Tokenization?

If you’re a fresher stepping into the world of Natural Language Processing (NLP) or Generative AI, you might think:

“I just give AI my text, and it magically understands it… right?”

Well, the “magic” has a name — Tokenization.
It’s how your big sentence becomes small, bite-sized pieces the AI can process.

Think of it as:

  • Your Mom making parathas — she first breaks the dough into smaller balls.

  • A DJ mixing songs — cutting them into beats before remixing.

  • A teacher breaking chapters into topics — so you don’t panic seeing the whole syllabus at once.

Without tokenization, AI would choke on your giant paragraph like you trying to swallow a whole samosa in one bite. 🥴

💡 What is Tokenization?

Tokenization is the process of breaking text into smaller chunks (called tokens) before feeding it to an AI model.

These “tokens” could be:

  • Words (I, love, chai)

  • Sub-words (lov, e, chai)

  • Characters (I, l, o, v, e)

  • Even special symbols (., ,, ?)

🧠 Why Not Just Use Words?

You might ask — why can’t AI just read full words?
Because:

  1. New words are invented daily — AI needs to handle unknown ones.

  2. Different languages & slang — “bro”, “bhai”, “yaar” all mean friend.

  3. Typos — AI should still understand helo as hello.

By breaking into smaller units, AI can still make sense of unknown words by combining known pieces.

📦 Types of Tokenization

1. Word Tokenization

Breaking text at spaces.
Example:
"I love chai"[I, love, chai]
❌ Problem: Doesn’t handle unknown words well.

2. Character Tokenization

Breaking into letters.
"chai"[c, h, a, i]
✅ Good for unknown words.
❌ But long sentences mean a LOT of tokens → more processing cost.

3. Subword Tokenization (BPE / WordPiece)

The most common in AI models like GPT.
Breaks text into frequent chunks.
Example:
"lovely"["lov", "ely"]
"love"["lov", "e"]

So AI only learns “lov” once and can combine it with “e” or “ely”.

📊 Real Example: How GPT Tokenizes

Sentence: "I love samosas"
Tokens: [ "I", " love", " sam", "osas" ]
Notice: Spaces are also part of tokens.

🛠 How Tokenization Works in AI Models

Let’s simplify:

  1. You: “I love samosas”

  2. Tokenizer: Breaks it into tokens

  3. Model: Converts each token into a number (ID) → embedding → AI brain processing.

  4. Output: AI predicts next tokens → joins them → final text.

🍛 Desi Analogy: Chai Biscuit Tokenization

Imagine:

  • Sentence = A full packet of Parle-G biscuits.

  • Tokenization = Breaking biscuits into small pieces to dip in chai.

  • AI = Your mouth & brain enjoying and understanding each piece.

If you dump the whole packet in chai, it’s a mess. If you break it, you enjoy and understand each bite.

⚡ Why Tokenization Matters for Freshers

  • Performance → Shorter tokens = faster AI.

  • Cost → GPT API charges per token, not per word.
    "Namaste" as one token is cheaper than "Na", "ma", "ste" as three.

  • Understanding → Better tokenization = AI understands context better.

🔍 Pro Tip for Freshers

If you’re using OpenAI API or similar:

  • Check token limits (e.g., GPT-3.5 can handle ~4,096 tokens).

  • Use tools like OpenAI Tokenizer to see how your text breaks down.

  • Minimize extra spaces and fluff to save token costs.

🚀 Summary

  • Tokenization = breaking text into small chunks.

  • It’s the first step before AI processes your text.

  • Different methods: word, character, subword.

  • Impacts speed, cost, and accuracy.

🎯 Final Thought

If you understand tokenization, you’ve taken the first real step into the AI world.
It’s like learning how to cut vegetables before cooking — simple but absolutely essential.

0
Subscribe to my newsletter

Read articles from Md Noorullah Raza directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Md Noorullah Raza
Md Noorullah Raza