Explaining Tokenization For a Fresher

Kishore GowdaKishore Gowda
1 min read

What is Tokenization?

Tokenization is the process of breaking text into smaller pieces (tokens) that a computer can understand and work with.

Think of it like taking a sentence and cutting it into Lego blocks so you can store, count, or process them.

Why do we need it?

Computers don't directly "understand" words; they understand numbers.
Tokenization provides a way to convert text to numbers and back.

for example :

"cat" → [3, 1, 20] [3, 1, 20] → "cat"

(These numbers are just example token IDs.)

Why is it important in AI/ML?

  • Models learn patterns in tokens, not raw text.

  • Tokenization controls vocabulary size (how many unique symbols you need to store).

  • Better tokenization = faster models, smaller memory usage, better generalization.

0
Subscribe to my newsletter

Read articles from Kishore Gowda directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Kishore Gowda
Kishore Gowda