🔍 Zero to Hero: Tokenization in NLP – From Basics to Subword Models


✨ “How can machines read language?”
That's the fundamental question behind tokenization, the process of breaking down raw text into manageable, machine-readable pieces. If you're on your journey to becoming a data scientist or working with Natural Language Processing (NLP), tokenization is one of the most critical concepts you’ll master.
In this blog, we’ll walk from the ground level of tokenization to the heights of modern subword algorithms like Byte Pair Encoding (BPE), WordPiece, and SentencePiece.
📘 What is Tokenization?
Tokenization is the process of converting raw text into smaller units called tokens. These can be words, characters, or subwords. Once tokenized, we convert these tokens into numbers (indices or embeddings) for processing by a neural network.
Let’s look at a naive tokenization example:
text = open('example.txt', 'r').read()
words = text.split(" ")
tokens = {v: k for k, v in enumerate(words)}
This simply maps each word to an index. While this is straightforward, it's quite limited—it doesn’t account for punctuation, inflections, or compound words. We need more sophisticated techniques to truly empower machines to "read".
🤔 Why Do We Need Tokenization?
The question isn’t just how to make machines read, but how to make them understand. Raw text is not useful until we break it into comprehensible pieces.
Humans understand language through sound, meaning, context.
Machines don’t. They only understand tokens, which are then encoded into vectors.
A good tokenizer allows a model to:
Handle infrequent and compound words
Work across multiple languages, even those with no clear word boundaries (like Chinese)
Generalize well beyond its training vocabulary
🧱 Types of Tokenization
Let’s explore several approaches to tokenization, each with its pros and cons.
1️⃣ Word-Based Tokenization
This is the classic way—just split the text on whitespace.
"let's go home" → ["let's", "go", "home"]
⚠️ Problems:
Fails to generalize to unseen words (
football
≠foot
+ball
)Huge vocabulary requirement
Can't handle slangs, compound words, or languages without spaces
2️⃣ Character-Based Tokenization
Instead of words, break the text into characters:
"hello" → ["h", "e", "l", "l", "o"]
✅ Pros:
Handles unseen words effortlessly
Language-independent
❌ Cons:
Very long input sequences
Higher compute cost
No inherent semantics (semantics arise only after extensive learning)
🧩 Subword Tokenization: The Best of Both Worlds
Subword tokenization is a middle ground—it breaks words into meaningful units, like prefixes, suffixes, or roots.
Example:
"unhappily"
→ ["un", "happ", "ily"]
This way, even unseen words can be processed if the model knows their subparts.
🔁 Byte Pair Encoding (BPE)
Originally a data compression algorithm, BPE is now a popular subword tokenization technique.
🔧 How BPE Works:
Add a word-end marker (
</w>
) to each word.Split all words into characters.
Count all adjacent character pairs.
Merge the most frequent pair.
Repeat until you hit a limit (iterations or vocabulary size).
🧪 BPE Example:
Let’s say we have:
"There is an 80% chance of rainfall today. We are pretty sure it is going to rain."
Step-by-step:
Words → characters:
"rain"
→["r", "a", "i", "n", "</w>"]
Count pairs: ("r", "a"), ("a", "i"), etc.
Merge the most frequent ones: maybe ("r", "a") → "ra"
Repeat...
Eventually:
"rainfall" → ["rain", "fall"]
"unhappily" → ["un", "happ", "ily"]
🧠 Pros:
Efficient
Reduces tokens
Vocabulary is controllable
😵💫 Cons:
Greedy algorithm (might not find global optima)
Results vary based on iteration count
Deterministic, no randomness in sampling
🔡 WordPiece: Subword Regularization with Probability
Developed by Google for BERT, WordPiece builds on BPE but with a twist—it chooses merges based on how much they increase likelihood of training data, not just raw frequency.
📌 How it works:
Uses a language model objective to decide merges
Tokens include special markers like
##
to indicate subwords
Example:
"unhappily"
→["un", "##happi", "##ly"]
✅ Benefits:
Better handling of rare/unknown words
More robust than BPE
Language-specific patterns can emerge
🧠 SentencePiece: Tokenization Without Spaces
Developed by Google (again), SentencePiece is different—it doesn’t assume pre-tokenized input. Instead, it treats the entire text as a raw stream of characters, including whitespace.
"Hello world" → Could become
["_Hello", "_world"]
, where_
represents space.
🔍 Key Features:
Works for languages with or without spaces (like Chinese or Japanese)
Can use BPE or Unigram LM
No need for external preprocessing
💡 SentencePiece + Unigram Language Model
This model:
Builds a large vocabulary of candidate subwords
Uses likelihood-based pruning to select best tokens
Allows sampling of tokenizations → great for data augmentation
🧠 Summary Table
Tokenizer | Handles Unknowns | Language Agnostic | Vocabulary Size | Robustness |
Word-Based | ❌ | ❌ | 🔺 Huge | 🔻 Low |
Char-Based | ✅ | ✅ | 🔻 Small | ❌ (semantics lost) |
BPE | ✅ | ✅ | ⚖️ Controlled | ✅ |
WordPiece | ✅ | ✅ | ⚖️ Controlled | ✅✅ |
SentencePiece | ✅✅ | ✅✅ | ⚖️ Controlled | ✅✅✅ |
🧙 Understanding Special Tokens in NLP
In modern NLP models—especially those based on Transformers like BERT, GPT, T5, etc.—you’ll often encounter special tokens. These tokens aren’t part of natural language, but are added to help the model understand context, sequence boundaries, and task-specific information.
Let’s go through the most common special tokens:
🔸 <PAD>
– Padding Token
When batching sequences for training, not all sentences are of equal length. To ensure uniform input dimensions, we pad the shorter sequences with a special <PAD>
token.
Original: ["I", "like", "pizza"]
Padded: ["I", "like", "pizza", "<PAD>", "<PAD>"]
Used for: Sequence alignment in batches
Ignored in attention mechanisms via attention masks
Value in embeddings: Usually mapped to a vector of zeros or a learned embedding
🔸 <CLS>
– Classification Token
This token is added at the beginning of a sentence in models like BERT.
Sentence: "Transformers are powerful."
Tokenized: ["<CLS>", "Transformers", "are", "powerful", ".", "<SEP>"]
The embedding corresponding to
<CLS>
is often used as the aggregated representation of the entire sequence.Used for tasks like sentence classification, entailment, or sentiment analysis.
In BERT, the final hidden state of the
<CLS>
token is passed to a classifier head.
🔸 <SEP>
– Separator Token
This token is used to separate multiple sentences or segments within a single input.
Input: ["<CLS>", "Sentence A", "<SEP>", "Sentence B", "<SEP>"]
Used in tasks like:
Next Sentence Prediction
Question-Answering (where question and context are separated)
Helps the model distinguish between segments
🔸 <MASK>
– Masking Token
Specific to masked language modeling, as used in BERT. This token hides a word in the input so the model learns to predict it.
Input: "The sky is <MASK>."
Trains the model to predict missing or corrupted tokens
Encourages deeper contextual understanding
🔸 <UNK>
– Unknown Token
When a tokenizer encounters a word that isn't in its vocabulary and can’t be broken down into known subwords, it assigns <UNK>
.
Appears in word-level tokenizers or poorly trained subword tokenizers
Subword tokenization methods like BPE or WordPiece aim to reduce
<UNK>
usage
🔸 <BOS>
/ <EOS>
– Beginning/End of Sequence Tokens
These are used in sequence generation tasks like translation, summarization, or text generation.
<BOS>
= Begin Of Sequence<EOS>
= End Of Sequence
Input: ["<BOS>", "Hello", "world", "<EOS>"]
In models like GPT, generation stops when
<EOS>
is predicted.In seq2seq models (like T5), these help mark input/output boundaries.
🧠 Summary Table: Special Tokens
Token | Meaning | Use Case |
<PAD> | Padding token | Batch alignment, ignored by attention |
<CLS> | Classification token | Sentence-level tasks in BERT |
<SEP> | Separator token | Sentence pair tasks, QA |
<MASK> | Masking token | Masked language modeling |
<UNK> | Unknown token | Out-of-vocab handling |
<BOS> | Begin of sequence | Text generation, decoding start |
<EOS> | End of sequence | Text generation, decoding end |
💡 Pro Tip:
When using pre-trained models from Hugging Face Transformers or TensorFlow Hub, tokenizers automatically handle these special tokens for you. But when you build custom models or train from scratch, you need to define and manage them carefully.
🧪 Final Thoughts
Tokenization might seem like a simple preprocessing step—but it's actually fundamental to a model's performance. Whether you're building a chatbot or a translation system, understanding the nuances of tokenization will give your models the edge they need.
Choose your tokenizer wisely:
Want speed and control? Go with BPE.
Want precision for language models? WordPiece is your friend.
Want flexibility and multilingual support? SentencePiece is your hero.
Subscribe to my newsletter
Read articles from Rahul Kumar directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
