✨ “How can machines read language?”

That's the fundamental question behind tokenization, the process of breaking down raw text into manageable, machine-readable pieces. If you're on your journey to becoming a data scientist or working with Natural Language Processing (NLP), tokenization is one of the most critical concepts you’ll master.

In this blog, we’ll walk from the ground level of tokenization to the heights of modern subword algorithms like Byte Pair Encoding (BPE), WordPiece, and SentencePiece.

📘 What is Tokenization?

Tokenization is the process of converting raw text into smaller units called tokens. These can be words, characters, or subwords. Once tokenized, we convert these tokens into numbers (indices or embeddings) for processing by a neural network.

Let’s look at a naive tokenization example:

text = open('example.txt', 'r').read()
words = text.split(" ")
tokens = {v: k for k, v in enumerate(words)}

This simply maps each word to an index. While this is straightforward, it's quite limited—it doesn’t account for punctuation, inflections, or compound words. We need more sophisticated techniques to truly empower machines to "read".

🤔 Why Do We Need Tokenization?

The question isn’t just how to make machines read, but how to make them understand. Raw text is not useful until we break it into comprehensible pieces.

Humans understand language through sound, meaning, context.
Machines don’t. They only understand tokens, which are then encoded into vectors.

A good tokenizer allows a model to:

Handle infrequent and compound words
Work across multiple languages, even those with no clear word boundaries (like Chinese)
Generalize well beyond its training vocabulary

🧱 Types of Tokenization

Let’s explore several approaches to tokenization, each with its pros and cons.

1️⃣ Word-Based Tokenization

This is the classic way—just split the text on whitespace.

"let's go home" → ["let's", "go", "home"]

⚠️ Problems:

Fails to generalize to unseen words (football ≠ foot + ball)
Huge vocabulary requirement
Can't handle slangs, compound words, or languages without spaces

2️⃣ Character-Based Tokenization

Instead of words, break the text into characters:

"hello" → ["h", "e", "l", "l", "o"]

✅ Pros:

Handles unseen words effortlessly
Language-independent

❌ Cons:

Very long input sequences
Higher compute cost
No inherent semantics (semantics arise only after extensive learning)

🧩 Subword Tokenization: The Best of Both Worlds

Subword tokenization is a middle ground—it breaks words into meaningful units, like prefixes, suffixes, or roots.

Example:
"unhappily" → ["un", "happ", "ily"]

This way, even unseen words can be processed if the model knows their subparts.

🔁 Byte Pair Encoding (BPE)

Originally a data compression algorithm, BPE is now a popular subword tokenization technique.

🔧 How BPE Works:

Add a word-end marker (</w>) to each word.
Split all words into characters.
Count all adjacent character pairs.
Merge the most frequent pair.
Repeat until you hit a limit (iterations or vocabulary size).

🧪 BPE Example:

Let’s say we have:

"There is an 80% chance of rainfall today. We are pretty sure it is going to rain."

Step-by-step:

Words → characters: "rain" → ["r", "a", "i", "n", "</w>"]
Count pairs: ("r", "a"), ("a", "i"), etc.
Merge the most frequent ones: maybe ("r", "a") → "ra"
Repeat...

Eventually:

"rainfall" → ["rain", "fall"]
"unhappily" → ["un", "happ", "ily"]

🧠 Pros:

Efficient
Reduces tokens
Vocabulary is controllable

😵‍💫 Cons:

Greedy algorithm (might not find global optima)
Results vary based on iteration count
Deterministic, no randomness in sampling

🔡 WordPiece: Subword Regularization with Probability

Developed by Google for BERT, WordPiece builds on BPE but with a twist—it chooses merges based on how much they increase likelihood of training data, not just raw frequency.

📌 How it works:

Uses a language model objective to decide merges
Tokens include special markers like ## to indicate subwords

Example:
"unhappily" → ["un", "##happi", "##ly"]

✅ Benefits:

Better handling of rare/unknown words
More robust than BPE
Language-specific patterns can emerge

🧠 SentencePiece: Tokenization Without Spaces

Developed by Google (again), SentencePiece is different—it doesn’t assume pre-tokenized input. Instead, it treats the entire text as a raw stream of characters, including whitespace.

"Hello world" → Could become ["_Hello", "_world"], where _ represents space.

🔍 Key Features:

Works for languages with or without spaces (like Chinese or Japanese)
Can use BPE or Unigram LM
No need for external preprocessing

💡 SentencePiece + Unigram Language Model

This model:

Builds a large vocabulary of candidate subwords
Uses likelihood-based pruning to select best tokens
Allows sampling of tokenizations → great for data augmentation

🧠 Summary Table

Tokenizer	Handles Unknowns	Language Agnostic	Vocabulary Size	Robustness
Word-Based	❌	❌	🔺 Huge	🔻 Low
Char-Based	✅	✅	🔻 Small	❌ (semantics lost)
BPE	✅	✅	⚖️ Controlled	✅
WordPiece	✅	✅	⚖️ Controlled	✅✅
SentencePiece	✅✅	✅✅	⚖️ Controlled	✅✅✅

🧙 Understanding Special Tokens in NLP

In modern NLP models—especially those based on Transformers like BERT, GPT, T5, etc.—you’ll often encounter special tokens. These tokens aren’t part of natural language, but are added to help the model understand context, sequence boundaries, and task-specific information.

Let’s go through the most common special tokens:

🔸 `<PAD>` – Padding Token

When batching sequences for training, not all sentences are of equal length. To ensure uniform input dimensions, we pad the shorter sequences with a special <PAD> token.

Original:     ["I", "like", "pizza"]
Padded:       ["I", "like", "pizza", "<PAD>", "<PAD>"]

Used for: Sequence alignment in batches
Ignored in attention mechanisms via attention masks
Value in embeddings: Usually mapped to a vector of zeros or a learned embedding

🔸 `<CLS>` – Classification Token

This token is added at the beginning of a sentence in models like BERT.

Sentence: "Transformers are powerful."
Tokenized: ["<CLS>", "Transformers", "are", "powerful", ".", "<SEP>"]

The embedding corresponding to <CLS> is often used as the aggregated representation of the entire sequence.
Used for tasks like sentence classification, entailment, or sentiment analysis.

In BERT, the final hidden state of the <CLS> token is passed to a classifier head.

🔸 `<SEP>` – Separator Token

This token is used to separate multiple sentences or segments within a single input.

Input: ["<CLS>", "Sentence A", "<SEP>", "Sentence B", "<SEP>"]

Used in tasks like:
- Next Sentence Prediction
- Question-Answering (where question and context are separated)
Helps the model distinguish between segments

🔸 `<MASK>` – Masking Token

Specific to masked language modeling, as used in BERT. This token hides a word in the input so the model learns to predict it.

Input: "The sky is <MASK>."

Trains the model to predict missing or corrupted tokens
Encourages deeper contextual understanding

🔸 `<UNK>` – Unknown Token

When a tokenizer encounters a word that isn't in its vocabulary and can’t be broken down into known subwords, it assigns <UNK>.

Appears in word-level tokenizers or poorly trained subword tokenizers
Subword tokenization methods like BPE or WordPiece aim to reduce <UNK> usage

🔸 `<BOS>` / `<EOS>` – Beginning/End of Sequence Tokens

These are used in sequence generation tasks like translation, summarization, or text generation.

<BOS> = Begin Of Sequence
<EOS> = End Of Sequence

Input: ["<BOS>", "Hello", "world", "<EOS>"]

In models like GPT, generation stops when <EOS> is predicted.
In seq2seq models (like T5), these help mark input/output boundaries.

🧠 Summary Table: Special Tokens

Token	Meaning	Use Case
`<PAD>`	Padding token	Batch alignment, ignored by attention
`<CLS>`	Classification token	Sentence-level tasks in BERT
`<SEP>`	Separator token	Sentence pair tasks, QA
`<MASK>`	Masking token	Masked language modeling
`<UNK>`	Unknown token	Out-of-vocab handling
`<BOS>`	Begin of sequence	Text generation, decoding start
`<EOS>`	End of sequence	Text generation, decoding end

💡 Pro Tip:

When using pre-trained models from Hugging Face Transformers or TensorFlow Hub, tokenizers automatically handle these special tokens for you. But when you build custom models or train from scratch, you need to define and manage them carefully.

🧪 Final Thoughts

Tokenization might seem like a simple preprocessing step—but it's actually fundamental to a model's performance. Whether you're building a chatbot or a translation system, understanding the nuances of tokenization will give your models the edge they need.

Choose your tokenizer wisely:

Want speed and control? Go with BPE.
Want precision for language models? WordPiece is your friend.
Want flexibility and multilingual support? SentencePiece is your hero.

🔍 Zero to Hero: Tokenization in NLP – From Basics to Subword Models

✨ “How can machines read language?”

📘 What is Tokenization?

🤔 Why Do We Need Tokenization?

🧱 Types of Tokenization

1️⃣ Word-Based Tokenization

⚠️ Problems:

2️⃣ Character-Based Tokenization

✅ Pros:

❌ Cons:

🧩 Subword Tokenization: The Best of Both Worlds

🔁 Byte Pair Encoding (BPE)

🔧 How BPE Works:

🧪 BPE Example:

Step-by-step:

🧠 Pros:

😵‍💫 Cons:

🔡 WordPiece: Subword Regularization with Probability

📌 How it works:

✅ Benefits:

🧠 SentencePiece: Tokenization Without Spaces

🔍 Key Features:

💡 SentencePiece + Unigram Language Model

🧠 Summary Table

🧙 Understanding Special Tokens in NLP

🔸 `<PAD>` – Padding Token

🔸 `<CLS>` – Classification Token

🔸 `<SEP>` – Separator Token

🔸 `<MASK>` – Masking Token

🔸 `<UNK>` – Unknown Token

🔸 `<BOS>` / `<EOS>` – Beginning/End of Sequence Tokens

🧠 Summary Table: Special Tokens

💡 Pro Tip:

🧪 Final Thoughts

Subscribe to my newsletter

Rahul Kumar

Rahul Kumar

🔍 Zero to Hero: Tokenization in NLP – From Basics to Subword Models

✨ “How can machines read language?”

📘 What is Tokenization?

🤔 Why Do We Need Tokenization?

🧱 Types of Tokenization

1️⃣ Word-Based Tokenization

⚠️ Problems:

2️⃣ Character-Based Tokenization

✅ Pros:

❌ Cons:

🧩 Subword Tokenization: The Best of Both Worlds

🔁 Byte Pair Encoding (BPE)

🔧 How BPE Works:

🧪 BPE Example:

Step-by-step:

🧠 Pros:

😵‍💫 Cons:

🔡 WordPiece: Subword Regularization with Probability

📌 How it works:

✅ Benefits:

🧠 SentencePiece: Tokenization Without Spaces

🔍 Key Features:

💡 SentencePiece + Unigram Language Model

🧠 Summary Table

🧙 Understanding Special Tokens in NLP

🔸 <PAD> – Padding Token

🔸 <CLS> – Classification Token

🔸 <SEP> – Separator Token

🔸 <MASK> – Masking Token

🔸 <UNK> – Unknown Token

🔸 <BOS> / <EOS> – Beginning/End of Sequence Tokens

🧠 Summary Table: Special Tokens

💡 Pro Tip:

🧪 Final Thoughts

Subscribe to my newsletter

Rahul Kumar

Rahul Kumar

🔸 `<PAD>` – Padding Token

🔸 `<CLS>` – Classification Token

🔸 `<SEP>` – Separator Token

🔸 `<MASK>` – Masking Token

🔸 `<UNK>` – Unknown Token

🔸 `<BOS>` / `<EOS>` – Beginning/End of Sequence Tokens