Tokenization

Everything You Need to Know About Tokenization for LLMs (with Hugging Face)
Before a Large Language Model (LLM) like GPT-2 or LLaMA can read or generate anything, the first step is tokenization — the silent translator that converts human text into machine-understandable numbers.
In this post, we’ll break down:
What is tokenization?
What is a token?
Why do we tokenize?
Types of tokenization
What tokenization looked like before Hugging Face
How Hugging Face made it easier with
tokenizers
Let’s dive in.
🤔 What is Tokenization?
Tokenization is the process of converting raw text into smaller units called tokens, which are then mapped to numbers (token IDs) that a model can understand.
For example:
Input: "I love pizza."
Tokens: ["I", "love", "pizza", "."]
Token IDs: [40, 389, 22135, 4] # (these are example IDs)
These token IDs are what the model processes — not the actual text.
🔤 What is a Token?
A token is a unit of text. It can be:
A word (
"love"
)A subword (
"pizz", "a"
)A character (
"l", "o", "v", "e"
)Even a byte or byte pair (
"Ġlove"
,"##ing"
)
The size and shape of a token depend on the tokenization strategy and the model’s vocabulary.
🧠 Why Do We Tokenize?
Transformers (and other LLMs) only understand numbers, not text. Tokenization:
Prepares input for the model by converting text to numerical IDs.
Controls vocabulary size, which affects model size and training efficiency.
Handles rare words, typos, and out-of-vocabulary inputs via subwords or byte-pair strategies.
Decodes output token IDs back into human-readable text.
🔍 Types of Tokenization
1. Word-Level Tokenization
Splits text by spaces and punctuation. Simple but struggles with:
Large vocabularies
Out-of-vocabulary words
Morphologically rich languages
2. Character-Level Tokenization
Treats each character as a token. Pros and cons:
✅ Small vocabulary size
✅ No unknown words
❌ Very long sequences
❌ Loses word-level meaning
3. Subword Tokenization
The sweet spot for most modern NLP models:
Byte Pair Encoding (BPE): Merges most frequent character pairs
WordPiece: Google's approach, similar to BPE
SentencePiece: Language-agnostic subword tokenization
There are many strategies, each with trade-offs:
Type | Description | Example (unhappiness ) |
Whitespace | Split on spaces and punctuation | ["unhappiness"] |
Word-level | Each word is a token | ["unhappiness"] |
Character-level | Each character is a token | ["u", "n", "h", "a", ...] |
Subword (BPE) | Break into known parts | ["un", "happiness"] |
Byte-level BPE | Tokenizes as bytes, supports all UTF-8 characters | ["Ġun", "happiness"] |
Most modern LLMs (like GPT-2, LLaMA) use Byte Pair Encoding (BPE) or Unigram-based subword tokenizers for flexibility and compactness.
⏳ Tokenization Before Hugging Face
Before Hugging Face, working with tokenizers meant:
Writing custom regex-based scripts.
Manually handling vocab files, encoders, and decoders.
Worrying about padding, truncation, and attention masks.
It was slow, inconsistent, and error-prone.
Popular pre-HF libraries included:
SpaCy (for word/char)
SentencePiece (used in BERT, T5)
Moses tokenizer (used in early MT)
🚀How Hugging Face made it easier with tokenizers
Hugging Face provides the transformers
library with pre-built tokenizers for thousands of models. These tokenizers are:
Fast: Implemented in Rust for speed
Consistent: Same tokenization as original model training
Feature-rich: Handle padding, truncation, special tokens automatically
Setting Up Your Environment
First, install the required packages:
pip install transformers torch datasets
For development, you might also want:
pip install jupyter notebook matplotlib seaborn
📌 What Does Tokenization Actually Look Like?
Here’s a step-by-step breakdown of how tokenization works in practice with Hugging Face:
from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("gpt2") text = "I love machine learning!" # 1. Tokenize text into subwords tokens = tokenizer.tokenize(text) print("Tokens:", tokens)
Output:
Tokens: ['I', 'Ġlove', 'Ġmachine', 'Ġlearning', '!']
✅ Notice the
Ġ
— it indicates a space before the word (used in GPT-2's Byte-Pair Encoding).# 2. Convert tokens into input IDs ids = tokenizer.convert_tokens_to_ids(tokens) print("Token IDs:", ids)
Output (will vary by model):
Token IDs: [40, 389, 7594, 8945, 0]
# 3. Decode back to text decoded = tokenizer.decode(ids) print("Decoded:", decoded)
Output:
Decoded: I love machine learning!
Basic Tokenization Examples
Let's start with simple examples using popular models:
Example 1: BERT Tokenizer
python
from transformers import AutoTokenizer # Load BERT tokenizer tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") # Sample text text = "Hello, how are you doing today?" # Basic tokenization tokens = tokenizer.tokenize(text) print("Tokens:", tokens) # Output: ['hello', ',', 'how', 'are', 'you', 'doing', 'today', '?'] # Convert to token IDs token_ids = tokenizer.convert_tokens_to_ids(tokens) print("Token IDs:", token_ids) # Or do both in one step encoding = tokenizer(text) print("Full encoding:", encoding)
Example 2: GPT-2 Tokenizer
python
from transformers import AutoTokenizer # Load GPT-2 tokenizer tokenizer = AutoTokenizer.from_pretrained("gpt2") text = "Artificial intelligence is transforming the world." # Tokenize tokens = tokenizer.tokenize(text) print("GPT-2 Tokens:", tokens) # Output: ['Art', 'ificial', 'Ġintelligence', 'Ġis', 'Ġtransforming', 'Ġthe', 'Ġworld', '.']
📌 Summary
Tokenization is the first and most critical step in NLP pipelines.
Tokens turn text into numbers — the language of models.
Hugging Face’s
AutoTokenizer
makes it simple, fast, and reliable.Understanding tokenization helps you debug model behavior, input issues, and output errors.
Subscribe to my newsletter
Read articles from Tanuj Rai directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
