Tokenization Explained..

Sanskar AgarwalSanskar Agarwal
4 min read

Understanding Tokenization – The First Step in Talking to AI

If you’ve read my previous blogs about Vector Embeddings and GPT, you already know that AI doesn’t understand words the way we do. We think in sentences, words, and meaning. AI thinks in numbers. And Tokenization is the bridge between our words and AI’s numbers.

Let’s break it down from scratch.

What is Tokenization?

In the simplest sense: Tokenization is the process of converting text into smaller chunks (tokens) that the AI model can understand and process.

Those tokens are not always words. They can be:

Whole words (“cat”)

Sub-words (“un” + “happy”)

Individual characters (“c”, “a”, “t”)

Or even punctuation marks (“,”, “?”)

Once the text is split into tokens, each token is assigned a unique number (its ID in the model’s vocabulary). This is what the model actually works with.

Why Not Just Use Words?

Because words are tricky.

For example:

“Running” and “runner” are different words but have related meaning.

“New York” is two words but represents one concept.

Misspellings like “goooood” should still be somewhat understood.

If AI stored every possible word variation as a separate thing, the vocabulary size would explode — making the model slow and memory-heavy.

Tokenization solves this by breaking text into manageable, reusable pieces.

Example of Tokenization

Let’s take the sentence:

"Hello, how are you?"

Depending on the tokenizer:

  1. Word-level tokenizer:

Tokens: ["Hello", ",", "how", "are", "you", "?"]

6 tokens

  1. Character-level tokenizer:

Tokens: ["H", "e", "l", "l", "o", ",", " ", "h", ...]

19 tokens

  1. Byte Pair Encoding (BPE) tokenizer (used in GPT-like models):

Tokens: ["Hello", ",", " how", " are", " you", "?"]

Still 6 tokens here, but in larger texts, it merges frequent patterns instead of always splitting by spaces.

GPT and Tokenization

GPT doesn’t “read” words. It reads token IDs like:

"Hello, how are you?" → [15496, 11, 703, 389, 345, 30]

These numbers correspond to entries in its vocabulary. The model’s training and prediction process happens entirely on these numbers, not on the letters themselves.

Tokenization and Context Window

Every model has a maximum context length (how many tokens it can consider at once).

For example:

GPT-3.5: ~4,096 tokens (~3,000 words)

GPT-4: up to ~128,000 tokens (~96,000 words) in some versions

If your input + output exceed this limit, older tokens “fall out” of context and the model forgets them.

This is why token counting is important — especially in production apps using APIs where cost is also calculated per token.

Tokenization Process in Steps

Here’s what happens behind the scenes:

  1. Text normalization – Convert everything to a consistent format (like lowercasing, handling special characters).

  2. Splitting – Break text into pieces (depends on tokenizer algorithm).

  3. Mapping – Assign each token an ID from the vocabulary.

Embedding lookup – Convert token IDs into vector embeddings so the model can work with them.

Algorithms Used

Common tokenization algorithms in NLP include:

Word-level tokenization – Fast, but large vocabulary size.

Character-level tokenization – Smaller vocabulary, but sequences are longer.

Byte Pair Encoding (BPE) – Finds frequently used character combinations and merges them. Used by GPT.

SentencePiece – Handles languages without spaces (like Chinese, Japanese) well.

For GPT models, BPE is the standard choice because it balances vocabulary size and token length efficiently.

Why Tokenization Matters in Real Applications

Model efficiency – Fewer tokens means faster processing and less cost.

Cross-language handling – Tokenization helps in processing multiple languages with one model.

Handling unknown words – New slang or typos can be broken down into known sub-tokens.

Search and retrieval – Tokenized text can be indexed more effectively.

Visualizing Tokens

If we plotted tokens in a high-dimensional space (using embeddings), similar tokens would cluster together:

“dog”, “dogs”, “puppy” would be close

“apple” (fruit) and “banana” would be close

“apple” (company) would be far from “banana” but closer to “Google”

This is where tokenization and vector embeddings meet — tokenization splits the text, embeddings give meaning to each token.

Final Thoughts

Tokenization is like breaking a Lego model into individual bricks before building something new. Without it, an AI model can’t understand or generate text effectively.

Next time you chat with ChatGPT, remember: Before your words reach the model, they’ve already been chopped into tokens, turned into numbers, and mapped into vectors — all in milliseconds.

Pro tip if you’re building with AI APIs: Always check token counts before sending large text. It’ll save you both money and headaches.

0
Subscribe to my newsletter

Read articles from Sanskar Agarwal directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Sanskar Agarwal
Sanskar Agarwal

I’m Sanskar Agarwal — a 3rd-year B.Tech student in Computer Science at VESIT, Mumbai, passionate about building impactful tech solutions. I enjoy turning ideas into reality through full-stack development, IoT projects, and machine learning applications. 💻 Currently learning and experimenting with the MERN stack and the Generative AI field. Lifelong learner, tech enthusiast, and a firm believer in “Build. Break. Learn. Repeat.” 📫 Let’s connect, collaborate, and share knowledge — tech grows best when it’s open!