What is Tokenization?

When a computer works with text, it can’t directly understand sentences the way we understand.
It needs to break the text into smaller pieces so it can process them step-by-step.

Those smaller pieces are called tokens.
Basically, tokenization is the process of splitting text into tokens.

What Do I Mean?

Example in Plain English
Think of a sentence:

I love samosas

When we do Tokenization, we could break it into:

Let’s say we break it based on Word-level tokens:

["I", "love", "samosas"]

Now, Character-level tokens:

["I", " ", "l", "o", "v", "e", " ", "s", "a", "m", "o", "s", "a", "s"]

Generally, in Machine Learning & AI, the tokenizer converts an input into a unique number assigned to that exact word. And you know that computers are better with numbers, this also eliminates the confusion that can occur when someone misspells or miscases the input.
For example:

Note: I used my recently made tool Tea Tokenizer here.
Link: https://teatokenizer.monc.space

Why is it Important?

Tokenization is like splitting a long message into smaller parts so the computer can read it one step at a time.

Without tokenization, the computer sees the entire sentence as one giant block of text and can’t figure out where words or parts of words start and end.
With tokenization, the text becomes small chunks (tokens) that the computer can store, search, and process efficiently.

In Short, Tokenization is cutting big text into small, meaningful chunks so a computer can handle it.

Understand Tokenization As A Fresher

What is Tokenization?

What Do I Mean?

Why is it Important?

Subscribe to my newsletter

Rahul Singh (Veer)

Rahul Singh (Veer)