Understand Tokenization As A Fresher


What is Tokenization?
When a computer works with text, it can’t directly understand sentences the way we understand.
It needs to break the text into smaller pieces so it can process them step-by-step.
Those smaller pieces are called tokens.
Basically, tokenization is the process of splitting text into tokens.
What Do I Mean?
Example in Plain English
Think of a sentence:
I love samosas
When we do Tokenization, we could break it into:
Let’s say we break it based on Word-level tokens:
["I", "love", "samosas"]
Now, Character-level tokens:
["I", " ", "l", "o", "v", "e", " ", "s", "a", "m", "o", "s", "a", "s"]
Generally, in Machine Learning & AI, the tokenizer converts an input into a unique number assigned to that exact word. And you know that computers are better with numbers, this also eliminates the confusion that can occur when someone misspells or miscases the input.
For example:
[
46530,
4,
55530,
4,
82663
]
Note: I used my recently made tool Tea Tokenizer here.
Link: https://teatokenizer.monc.space
Why is it Important?
Tokenization is like splitting a long message into smaller parts so the computer can read it one step at a time.
Without tokenization, the computer sees the entire sentence as one giant block of text and can’t figure out where words or parts of words start and end.
With tokenization, the text becomes small chunks (tokens) that the computer can store, search, and process efficiently.
In Short, Tokenization is cutting big text into small, meaningful chunks so a computer can handle it.
Subscribe to my newsletter
Read articles from Rahul Singh (Veer) directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
