Tokenization — Explained for a Fresher

Raghav GoelRaghav Goel
2 min read

If you’re just starting with AI and LLMs (Large Language Models), you’ll hear the term tokenization a lot.
Let’s break it down step by step.


What is Tokenization?

Large Language Models (LLMs) don’t directly “understand” human language the way we do.
Instead, they break down your input into tokens — small pieces of text — that they can process.

Think of it like how Python code eventually gets converted into binary 0s and 1s so your computer can understand it.
Tokenization is that first step in translating human language into something an LLM can work with.


Tokenization as a Chemical Equation

Imagine your sentence is like a chemical equation with multiple elements.

Example:
Your query → "Is MERN dead?"

The model splits it into smaller parts (tokens):
Is + MERN + dead + ?

Just like a chemist references the periodic table for each element, tokenization references its own token dictionary (also called a vocabulary).


The Token Dictionary

The token dictionary is like the periodic table of words and sub-words.

  • If a token exists in the dictionary, the model finds its ID number.

  • If it doesn’t exist, it may break the word further into smaller known pieces or assign a new number (depending on the tokenizer).

Example:

  • "Is"101

  • "MERN"269

  • "dead"151

  • "?"234

So "Is MERN dead?" becomes:
101 269 151 234

Now the LLM can process this numeric representation efficiently.


Why Tokenization Matters

  • It standardizes language for the model.

  • It handles rare or unknown words by breaking them into smaller parts.

  • It makes processing faster and more memory-efficient.

Without tokenization, an LLM would have to learn every possible word and variation from scratch — which isn’t practical.


Conclusion: Tokenization is like turning your sentence into a series of “chemical elements” that the AI can understand — with a dictionary acting as its periodic table.

0
Subscribe to my newsletter

Read articles from Raghav Goel directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Raghav Goel
Raghav Goel

Hello! I'm Raghav Goel, a passionate Front-End Developer with a knack for creating intuitive and high-performing web applications. I specialize in React.js and its ecosystem, building everything from responsive single-page applications to complex dashboards. I thrive on translating designs into pixel-perfect, functional user interfaces and optimizing for the best possible user experience. Currently, I'm freelancing and expanding my skills in UI/UX design and backend technologies. I'm also proud to have co-authored a research paper on solar power forecasting, which is slated for publication. I'm always eager to take on new challenges and collaborate on exciting projects. Let's connect!