🔍 Zero to Hero: Tokenization in NLP – From Basics to Subword Models

Rahul KumarRahul Kumar
7 min read

✨ “How can machines read language?”

That's the fundamental question behind tokenization, the process of breaking down raw text into manageable, machine-readable pieces. If you're on your journey to becoming a data scientist or working with Natural Language Processing (NLP), tokenization is one of the most critical concepts you’ll master.

In this blog, we’ll walk from the ground level of tokenization to the heights of modern subword algorithms like Byte Pair Encoding (BPE), WordPiece, and SentencePiece.


📘 What is Tokenization?

Tokenization is the process of converting raw text into smaller units called tokens. These can be words, characters, or subwords. Once tokenized, we convert these tokens into numbers (indices or embeddings) for processing by a neural network.

Let’s look at a naive tokenization example:

text = open('example.txt', 'r').read()
words = text.split(" ")
tokens = {v: k for k, v in enumerate(words)}

This simply maps each word to an index. While this is straightforward, it's quite limited—it doesn’t account for punctuation, inflections, or compound words. We need more sophisticated techniques to truly empower machines to "read".


🤔 Why Do We Need Tokenization?

The question isn’t just how to make machines read, but how to make them understand. Raw text is not useful until we break it into comprehensible pieces.

  • Humans understand language through sound, meaning, context.

  • Machines don’t. They only understand tokens, which are then encoded into vectors.

A good tokenizer allows a model to:

  • Handle infrequent and compound words

  • Work across multiple languages, even those with no clear word boundaries (like Chinese)

  • Generalize well beyond its training vocabulary


🧱 Types of Tokenization

Let’s explore several approaches to tokenization, each with its pros and cons.

1️⃣ Word-Based Tokenization

This is the classic way—just split the text on whitespace.

"let's go home" → ["let's", "go", "home"]

⚠️ Problems:

  • Fails to generalize to unseen words (footballfoot + ball)

  • Huge vocabulary requirement

  • Can't handle slangs, compound words, or languages without spaces

2️⃣ Character-Based Tokenization

Instead of words, break the text into characters:

"hello" → ["h", "e", "l", "l", "o"]

✅ Pros:

  • Handles unseen words effortlessly

  • Language-independent

❌ Cons:

  • Very long input sequences

  • Higher compute cost

  • No inherent semantics (semantics arise only after extensive learning)


🧩 Subword Tokenization: The Best of Both Worlds

Subword tokenization is a middle ground—it breaks words into meaningful units, like prefixes, suffixes, or roots.

Example:
"unhappily" → ["un", "happ", "ily"]

This way, even unseen words can be processed if the model knows their subparts.


🔁 Byte Pair Encoding (BPE)

Originally a data compression algorithm, BPE is now a popular subword tokenization technique.

🔧 How BPE Works:

  1. Add a word-end marker (</w>) to each word.

  2. Split all words into characters.

  3. Count all adjacent character pairs.

  4. Merge the most frequent pair.

  5. Repeat until you hit a limit (iterations or vocabulary size).

🧪 BPE Example:

Let’s say we have:

"There is an 80% chance of rainfall today. We are pretty sure it is going to rain."

Step-by-step:

  • Words → characters: "rain"["r", "a", "i", "n", "</w>"]

  • Count pairs: ("r", "a"), ("a", "i"), etc.

  • Merge the most frequent ones: maybe ("r", "a") → "ra"

  • Repeat...

Eventually:

"rainfall" → ["rain", "fall"]
"unhappily" → ["un", "happ", "ily"]

🧠 Pros:

  • Efficient

  • Reduces tokens

  • Vocabulary is controllable

😵‍💫 Cons:

  • Greedy algorithm (might not find global optima)

  • Results vary based on iteration count

  • Deterministic, no randomness in sampling


🔡 WordPiece: Subword Regularization with Probability

Developed by Google for BERT, WordPiece builds on BPE but with a twist—it chooses merges based on how much they increase likelihood of training data, not just raw frequency.

📌 How it works:

  • Uses a language model objective to decide merges

  • Tokens include special markers like ## to indicate subwords

Example:
"unhappily"["un", "##happi", "##ly"]

✅ Benefits:

  • Better handling of rare/unknown words

  • More robust than BPE

  • Language-specific patterns can emerge


🧠 SentencePiece: Tokenization Without Spaces

Developed by Google (again), SentencePiece is different—it doesn’t assume pre-tokenized input. Instead, it treats the entire text as a raw stream of characters, including whitespace.

"Hello world" → Could become ["_Hello", "_world"], where _ represents space.

🔍 Key Features:

  • Works for languages with or without spaces (like Chinese or Japanese)

  • Can use BPE or Unigram LM

  • No need for external preprocessing

💡 SentencePiece + Unigram Language Model

This model:

  • Builds a large vocabulary of candidate subwords

  • Uses likelihood-based pruning to select best tokens

  • Allows sampling of tokenizations → great for data augmentation


🧠 Summary Table

TokenizerHandles UnknownsLanguage AgnosticVocabulary SizeRobustness
Word-Based🔺 Huge🔻 Low
Char-Based🔻 Small❌ (semantics lost)
BPE⚖️ Controlled
WordPiece⚖️ Controlled✅✅
SentencePiece✅✅✅✅⚖️ Controlled✅✅✅


🧙 Understanding Special Tokens in NLP

In modern NLP models—especially those based on Transformers like BERT, GPT, T5, etc.—you’ll often encounter special tokens. These tokens aren’t part of natural language, but are added to help the model understand context, sequence boundaries, and task-specific information.

Let’s go through the most common special tokens:


🔸 <PAD> – Padding Token

When batching sequences for training, not all sentences are of equal length. To ensure uniform input dimensions, we pad the shorter sequences with a special <PAD> token.

Original:     ["I", "like", "pizza"]
Padded:       ["I", "like", "pizza", "<PAD>", "<PAD>"]
  • Used for: Sequence alignment in batches

  • Ignored in attention mechanisms via attention masks

  • Value in embeddings: Usually mapped to a vector of zeros or a learned embedding


🔸 <CLS> – Classification Token

This token is added at the beginning of a sentence in models like BERT.

Sentence: "Transformers are powerful."
Tokenized: ["<CLS>", "Transformers", "are", "powerful", ".", "<SEP>"]
  • The embedding corresponding to <CLS> is often used as the aggregated representation of the entire sequence.

  • Used for tasks like sentence classification, entailment, or sentiment analysis.

In BERT, the final hidden state of the <CLS> token is passed to a classifier head.


🔸 <SEP> – Separator Token

This token is used to separate multiple sentences or segments within a single input.

Input: ["<CLS>", "Sentence A", "<SEP>", "Sentence B", "<SEP>"]
  • Used in tasks like:

    • Next Sentence Prediction

    • Question-Answering (where question and context are separated)

  • Helps the model distinguish between segments


🔸 <MASK> – Masking Token

Specific to masked language modeling, as used in BERT. This token hides a word in the input so the model learns to predict it.

Input: "The sky is <MASK>."
  • Trains the model to predict missing or corrupted tokens

  • Encourages deeper contextual understanding


🔸 <UNK> – Unknown Token

When a tokenizer encounters a word that isn't in its vocabulary and can’t be broken down into known subwords, it assigns <UNK>.

  • Appears in word-level tokenizers or poorly trained subword tokenizers

  • Subword tokenization methods like BPE or WordPiece aim to reduce <UNK> usage


🔸 <BOS> / <EOS> – Beginning/End of Sequence Tokens

These are used in sequence generation tasks like translation, summarization, or text generation.

  • <BOS> = Begin Of Sequence

  • <EOS> = End Of Sequence

Input: ["<BOS>", "Hello", "world", "<EOS>"]
  • In models like GPT, generation stops when <EOS> is predicted.

  • In seq2seq models (like T5), these help mark input/output boundaries.


🧠 Summary Table: Special Tokens

TokenMeaningUse Case
<PAD>Padding tokenBatch alignment, ignored by attention
<CLS>Classification tokenSentence-level tasks in BERT
<SEP>Separator tokenSentence pair tasks, QA
<MASK>Masking tokenMasked language modeling
<UNK>Unknown tokenOut-of-vocab handling
<BOS>Begin of sequenceText generation, decoding start
<EOS>End of sequenceText generation, decoding end

💡 Pro Tip:

When using pre-trained models from Hugging Face Transformers or TensorFlow Hub, tokenizers automatically handle these special tokens for you. But when you build custom models or train from scratch, you need to define and manage them carefully.


🧪 Final Thoughts

Tokenization might seem like a simple preprocessing step—but it's actually fundamental to a model's performance. Whether you're building a chatbot or a translation system, understanding the nuances of tokenization will give your models the edge they need.

Choose your tokenizer wisely:

  • Want speed and control? Go with BPE.

  • Want precision for language models? WordPiece is your friend.

  • Want flexibility and multilingual support? SentencePiece is your hero.

0
Subscribe to my newsletter

Read articles from Rahul Kumar directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Rahul Kumar
Rahul Kumar