What is Tokenization? – A Beginner’s Guide for Freshers

Sabat AliSabat Ali
3 min read

So, today I’m going to explain Tokenization in the simplest way possible.
If you’re a fresher and have no idea about this term, don’t worry — by the end of this article, you’ll know exactly what it means and why it’s important.


1. Let’s Start with the Real-Life Example

Imagine you’re reading a sentence:

“I love playing cricket.”

Now, if I ask you to break this sentence into words, you will write:

I  
love  
playing  
cricket

That’s exactly what tokenization does — it breaks big text into smaller pieces called tokens.


2. What is a Token?

A token is just a small unit of text.
It could be:

  • A word → “love”

  • A part of a word → “play” from “playing”

  • A punctuation mark → “.” or “,”

Basically, tokens are the building blocks of any text for a computer.


3. Why Computers Need Tokenization

Humans can read full sentences easily.
But computers? They can’t just “read” text — they first need to break it into small, understandable pieces.

Tokenization is like cutting a big cake into slices so it’s easier to eat.
Here, the “cake” is your text, and the “slices” are the tokens.


4. How Tokenization Works

Let’s take the sentence:

AI is amazing!

After tokenization, it might look like this:

[ "AI", "is", "amazing", "!" ]

In some AI systems, it might even break “amazing” into smaller parts like:

[ "AI", "is", "amaz", "ing", "!" ]

Why? Because some AI models work better with sub-words instead of full words.


5. Types of Tokenization

a) Word Tokenization

Breaking text into words.
Example:

“I love cricket” → [ "I", "love", "cricket" ]

b) Subword Tokenization

Breaking words into smaller parts.
Example:

“Playing” → [ "Play", "ing" ]

This helps AI understand words it has never seen before.

c) Character Tokenization

Breaking text into single letters.
Example:

“Cat” → [ "C", "a", "t" ]


6. Why Tokenization is Important in AI & NLP

  • Search Engines → Find words quickly.

  • Chatbots → Understand what you mean.

  • Language Translation → Break text for accurate translation.

  • Speech-to-Text → Convert voice into tokens before processing.

Without tokenization, AI would be like a person trying to read a full page with no spaces between words.


7. Tokenization in AI Models (Fun Fact)

If you use ChatGPT or any AI model, remember — every time you type, your text is tokenized before the AI reads it.
Even you are charged based on tokens in most AI APIs.

Example:

  • “I love cricket” → 3 tokens

  • Longer text = More tokens = More cost


8. Summary for Freshers

  • Tokenization = Splitting text into small parts (tokens).

  • Token = A word, part of a word, or a symbol.

  • Computers need tokenization to understand text.

  • Used in AI, chatbots, translation, search, and more.


In simple words:
Tokenization is like chopping your text into small pieces so the computer can “chew” and understand it better.

0
Subscribe to my newsletter

Read articles from Sabat Ali directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Sabat Ali
Sabat Ali