Introduction

Machines can’t directly understand words. It only know numbers. That’s why we need tokenization, a way to break text into smaller units (tokens) that can be mapped to numbers. There are three common levels of tokenization: word-level, character-level, and subword-level.

BPE, wordPiece, and sentencePiece is tokenization for the subword-level. Nowdays, most of the model use the subword-level. Why subword-level? That’s cause:

Handles unknown words gracefully by breaking them into smaller chunks. In a word-level tokenizer, every whole word is a token. If a word doesn’t exist in the vocabulary it becomes [UNK]. Subword tokenizers gracefully recover meaning by splitting new words into familiar chunks. It makes them works better for misspellings (crayzy) and new words (bitcoiners, unfriendable).

As you can see from the example above, since the model never saw litty (new slang) before, a word level model failed to get the meaning from it. But since a subword model have saw lit and ty, it still understand the meaning behind the word
Keeps vocabulary size manageable (compared to word-level).

As you can see, the word-level vocabulary has to store every single inflected form separately “jumping”, “jumped”, “walking”, “walked”, etc. But in the subword-level vocabulary, we only need a handful of root morphemes (“jump”, “walk”, “play”) plus a few suffixes (“ing”, “ed”). With these, the model can compose all the same words.

This difference may look small here, but when you scale to millions of words in a language, the word-level vocabulary explodes in size, while the subword vocabulary stays compact and efficient.
More efficient and semantically meaningful than pure character-level. They’re short enough to cover any word (like characters), but still carry meaning (like words). That’s why subword tokenization is more efficient and semantically meaningful than character-level. If we go down to character-level, every token becomes a long sequence of meaningless units, which makes training less efficient and harder to capture semantics.

In “jumping”, if we tokenize at the character level, "i" or "j" are too general and carry little meaning. But at the subword level, "jump" preserves a clear action meaning, while "ing" indicates tense/aspect. This makes subword tokens more efficient and meaningful compared to pure characters.

I know that you still have some question like is there only these 3 method for subword level? No. So why do we focus on these methods first? cause they are:

The most widely adopted in real NLP systems (BERT, GPT, T5).
The foundation for understanding newer, fancier approaches.
They illustrate the tradeoffs between vocab size, efficiency, and flexibility.k

I think we’re ready to see the detail of each method.

Byte Pair Encoding (BPE)

BPE is a mechanism to create a subword token. This method was originally made to reduce vocabulary size of word level vocab to subword. This method also has mechanism to limit the number of token we have in the model.

The way this method work is by picking the most frequent pair appear on the corpus (dataset we have to train the model, it usually million of sentence, paragraph, etc that collected from multiple resouce like wikipedia, blog, etc.) and iterate those process until reach the desire number of tokens.

I know its hard just by reading those sentence and imagining them cause I feel that too😅. so lets move on the example of the process from the start until finish.

Step-by-Step of How BPE Works During training

Define Desire number of token and initial token: For our case, let’s say we want to create a subword vocab(set of tokens) with maximum size 38 (in practice 50k, 200k, or even million. but we simplify). We also need to predefine our initial token. This could also include predefined subwords (like "ing", "the", "##ed") if you want. this make the model train faster and better handling of common morphemes. But to explain BPE more clearly, we only choose alphabet and number.
Make Pair Candidates: First, preprocess the input by splitting the sentence paragraph by punctuation and whitespace.

After that, we convert the postprocessing data into token. since the vocab is now only filled with character, then the word will be converted into character.

Next, make the candidates. To do it, we’re going to pair the token that appear next to each other in
Count the frequency: count how many time each pair appear in the corpus to decide which candidates to keep.

as we can see “an” has the highest frequency. then we add that token to our vocab
Iteration until meet the desire token: Since our maximum token is 38, we need 1 more. We got it by doing the exact same process. Just reminder for the next iteration, be careful that the pair combination is already different cause we already add new token to our vocab. This tokenizer use technique called maximal-munch or also called greedy longest-match strategy

As you can see on the example above. since we already know “an” token. instead of picking “a” and “n”, greedy longest technique pick the longest token which is “an”. So let say maybe on iteration 100 we already know token longer token “nana” than “an”, then we pick that.

BPE On Inference

On inference, BPE also use the greedy longest-match strategy. It always pick the longest token

WordPiece

The difference between wordpiece and BPE is the way it decide the token that will be use for the vocab. While BPE pick the most frequent pair, wordPiece choose the most likelihood token for the vocabulary based on the corpus. With this mechanism, this method make the model more efficient cause its prefer less token and better known token.

When Will WordPiece mechanism stop during training? Unlike BPE where it stop the iteration only when it reach the desire token number. This also work when there are no candidate pair have better likelihood than the current vocab.

Step-by-Step of How WordPiece Works During training

Define Maximum token and initial vocab: create a subword token with desire maximum token we want 40 (its usually 50k, 200k, or even more but to simplified the explanation). And for our initial token vocab is combination of alphabet and number
Preprocess and Tokenize the Dataset: make pair combination candidate and also we supposed to split the sentence paragraph by punctuation and whitespace.

then we tokenize them using the current vocab
Make the pair candidates: pair the token that next to each other on corpus
Calculate The likelihood of the baseline: calculate the baseline or we can call it current vocab. This to see either the candidate better or no compare to the baseline.
Calculate the log likelihood of each pair candidate

So we have multiple corpus where each corpus has 1 candidate. After get the log likelihood, we calculate the difference to the baseline (ΔL). Lets just pick some of them to simplified it, but in actual implementation, we calculate to every pair candidate.

In this step, if there are no pair have more loglikelihood than the baseline/the ΔL is negative, then it stop the process.
Pick the highest pair candidates: So next we pick the highest likelihood. As we can see, the highest pair is "lo". Then we can add that to our vocab. So our vocab become
Repeat the process: Since we need 3 more token to reach 40 (desire max token) we do the step again from the 2nd step. Just to remind you again, we cant pick the previous result since we have new token, which mean the pair combination candidate will always be different and this also use maximal munch method like BPE

WordPiece On Inference

On inference, WordPiece is also use the greedy longest-match strategy like BPE. It always pick the longest token.

SentencePiece

SentencePiece is a tokenizer + detokenizer framework developed by Google. Its goal is to make subword tokenization (BPE, WordPiece, Unigram LM) language-independent and self-contained.

The difference of this than these 2 we talked before is :

In BPE/WordPiece, text is usually pre-split into words by whitespace or punctuation

→ ["This", "is", "an", "example"].
In SentencePiece, text is treated as a raw stream of characters. Whitespace is not discarded but converted into a special marker (typically "▁", shown here as _ for clarity).
→ a"This is an example" → "▁This▁is▁an▁example"

As you can see, there is a _ before “this” even tho its the first word. Why? its cause sentencePiece assumes every word starts with a space. Even the very first word in a sentence is “pretended” to be preceded by a space, so it gets the marker too. This ensures a consistent rule (every token starting with "▁" marks the beginning of a word).

Why Do We Need SentencePiece? Many languages (Chinese, Japanese, Thai) don’t use spaces between words. WordPiece and BPE rely on whitespace-based preprocessing which make them can fail badly on these languages. SentencePiece, by treating text as a raw stream, works for all languages (English, Chinese, mixed multilingual corpora, etc.).

Step by step On training

On training, we can use 3 method. they are BPE (merge frequent pairs), WordPiece (maximize likelihood when adding merges), Unigram LM (the default and most distinctive one). Since you already know BPE and WordPiece, we focus on unigram LM which originally choose by the author of the sentencePiece as the mechanism.

Unlike the other 2 where we build something from scratch (just character), Unigram LM word Head-bottom which mean, the initial vocab it has is already a lot of character, subword, etc. so what this mechanism do is removing useless token instead creating new combination. What define useless here is the lowest log likelihood token.

Initialize vocab and target size: with lots of candidate tokens (e.g., characters, frequent substrings, numbers, punctuation).
Preprocess input: replace spaces with "▁", so text is a continuous stream.

"This is an example" → "▁This▁is▁an▁example"
Segmentation: Instead of greedy longest-match (maximal munch), it considers all possible segmentations of the sentence.

Use Viterbi algorithm to pick the best segement of all possible segments.
Evaluate tokens: For each token, compute its contribution to likelihood. Tokens that rarely appear in best segmentations (low likelihood) are considered useless.
Prune tokens: Remove a fraction of the lowest-likelihood tokens.
Repeat the process: repeat the segmentation + pruning until vocab reaches the target size.

On Inference

Same as the other, SentencePiece is also use the greedy longest-match strategy which always pick the longest token.

Summary

Aspect	BPE (Byte Pair Encoding)	WordPiece	Unigram LM
How it works	Iteratively merges the most frequent character pairs until vocab size is reached.	Similar to BPE, but uses a likelihood-based merge (maximizes probability of data under a language model).	Starts with a large set of subwords, then prunes them using likelihood until desired vocab size is reached.
Greedy rule at inference	Maximal munch (longest match from vocab).	Maximal munch.	Probabilistic segmentation, but in practice often decoded with greedy / Viterbi.
Advantages	Simple, fast to train, Produces deterministic splits, Works well with frequent character patterns.	Avoids weird merges BPE can make, Better for rare words than BPE, Balanced vocab efficiency.	Very flexible (can segment multiple ways), Captures both common & rare subwords, Often yields fewer tokens per sentence.
Disadvantages	Struggles with rare/unseen words (may fragment heavily), Can create strange merges.	More complex than BPE, Still deterministic, less flexible than Unigram.	Training is heavier (requires EM-like optimization), More complex decoding if not greedy.
Vocabulary size	Typically 30k–50k.	Typically 30k–50k.	Can be slightly smaller or similar, 20k–50k.
Model usage	GPT-2, LLaMA, BLOOM, etc.	BERT, RoBERTa, ALBERT, DistilBERT, etc.	T5, XLNet, ByT5, some speech LLMs.
General performance	Good for generative LMs (fast, stable).	Good for bidirectional encoders (balanced).	Best when flexibility matters (translation, speech, multilingual).

Tokenization is the very first step that determines how effectively a model can understand and generate text. In this post, we’ve only scratched the surface covering the basics of word, subword, and character tokenization, along with their strengths and weaknesses.

Think of this as an introduction, enough to get you familiar with the ideas and trade-offs, but not the full story. If you want to dive deeper into the details, I strongly recommend going through the original research papers, as they explain the motivation and technical nuances far more thoroughly.

Neural Machine Translation of Rare Words with Subword Units
Fast WordPiece Tokenization(There isn’t a single canonical published paper for WordPiece, but it was introduced with BERT)
SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing

I’d love to hear your thoughts. do you prefer one tokenization strategy over another? Have you run into challenges with specific approaches? Let’s continue the discussion in the comments. 😉

BPE vs WordPiece vs SentencePiece: A Beginner-Friendly Guide to Subword Tokenization

Table of contents