Everything You Need to Know About Tokenization for LLMs (with Hugging Face)

Before a Large Language Model (LLM) like GPT-2 or LLaMA can read or generate anything, the first step is tokenization — the silent translator that converts human text into machine-understandable numbers.

In this post, we’ll break down:

What is tokenization?
What is a token?
Why do we tokenize?
Types of tokenization
What tokenization looked like before Hugging Face
How Hugging Face made it easier with tokenizers

Let’s dive in.

🤔 What is Tokenization?

Tokenization is the process of converting raw text into smaller units called tokens, which are then mapped to numbers (token IDs) that a model can understand.

For example:

Input: "I love pizza."
Tokens: ["I", "love", "pizza", "."]
Token IDs: [40, 389, 22135, 4]  # (these are example IDs)

These token IDs are what the model processes — not the actual text.

🔤 What is a Token?

A token is a unit of text. It can be:

A word ("love")
A subword ("pizz", "a")
A character ("l", "o", "v", "e")
Even a byte or byte pair ("Ġlove", "##ing")

The size and shape of a token depend on the tokenization strategy and the model’s vocabulary.

🧠 Why Do We Tokenize?

Transformers (and other LLMs) only understand numbers, not text. Tokenization:

Prepares input for the model by converting text to numerical IDs.
Controls vocabulary size, which affects model size and training efficiency.
Handles rare words, typos, and out-of-vocabulary inputs via subwords or byte-pair strategies.
Decodes output token IDs back into human-readable text.

🔍 Types of Tokenization

1. Word-Level Tokenization

Splits text by spaces and punctuation. Simple but struggles with:

Large vocabularies
Out-of-vocabulary words
Morphologically rich languages

2. Character-Level Tokenization

Treats each character as a token. Pros and cons:

✅ Small vocabulary size
✅ No unknown words
❌ Very long sequences
❌ Loses word-level meaning

3. Subword Tokenization

The sweet spot for most modern NLP models:

Byte Pair Encoding (BPE): Merges most frequent character pairs
WordPiece: Google's approach, similar to BPE
SentencePiece: Language-agnostic subword tokenization

There are many strategies, each with trade-offs:

Type	Description	Example (`unhappiness`)
Whitespace	Split on spaces and punctuation	["unhappiness"]
Word-level	Each word is a token	["unhappiness"]
Character-level	Each character is a token	["u", "n", "h", "a", ...]
Subword (BPE)	Break into known parts	["un", "happiness"]
Byte-level BPE	Tokenizes as bytes, supports all UTF-8 characters	["Ġun", "happiness"]

Most modern LLMs (like GPT-2, LLaMA) use Byte Pair Encoding (BPE) or Unigram-based subword tokenizers for flexibility and compactness.

⏳ Tokenization Before Hugging Face

Before Hugging Face, working with tokenizers meant:

Writing custom regex-based scripts.
Manually handling vocab files, encoders, and decoders.
Worrying about padding, truncation, and attention masks.

It was slow, inconsistent, and error-prone.

Popular pre-HF libraries included:

SpaCy (for word/char)
SentencePiece (used in BERT, T5)
Moses tokenizer (used in early MT)

🚀How Hugging Face made it easier with `tokenizers`

Hugging Face provides the transformers library with pre-built tokenizers for thousands of models. These tokenizers are:

Fast: Implemented in Rust for speed
Consistent: Same tokenization as original model training

Feature-rich: Handle padding, truncation, special tokens automatically

Setting Up Your Environment

First, install the required packages:

  pip install transformers torch datasets

For development, you might also want:

  pip install jupyter notebook matplotlib seaborn

📌 What Does Tokenization Actually Look Like?

Here’s a step-by-step breakdown of how tokenization works in practice with Hugging Face:

  from transformers import AutoTokenizer

  tokenizer = AutoTokenizer.from_pretrained("gpt2")

  text = "I love machine learning!"

  # 1. Tokenize text into subwords
  tokens = tokenizer.tokenize(text)
  print("Tokens:", tokens)

Output:

  Tokens: ['I', 'Ġlove', 'Ġmachine', 'Ġlearning', '!']

✅ Notice the Ġ — it indicates a space before the word (used in GPT-2's Byte-Pair Encoding).

  # 2. Convert tokens into input IDs
  ids = tokenizer.convert_tokens_to_ids(tokens)
  print("Token IDs:", ids)

Output (will vary by model):

  Token IDs: [40, 389, 7594, 8945, 0]

  # 3. Decode back to text
  decoded = tokenizer.decode(ids)
  print("Decoded:", decoded)

Output:

  Decoded: I love machine learning!

Basic Tokenization Examples

Let's start with simple examples using popular models:

Example 1: BERT Tokenizer

python

  from transformers import AutoTokenizer

  # Load BERT tokenizer
  tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

  # Sample text
  text = "Hello, how are you doing today?"

  # Basic tokenization
  tokens = tokenizer.tokenize(text)
  print("Tokens:", tokens)
  # Output: ['hello', ',', 'how', 'are', 'you', 'doing', 'today', '?']

  # Convert to token IDs
  token_ids = tokenizer.convert_tokens_to_ids(tokens)
  print("Token IDs:", token_ids)

  # Or do both in one step
  encoding = tokenizer(text)
  print("Full encoding:", encoding)

Example 2: GPT-2 Tokenizer

python

  from transformers import AutoTokenizer

  # Load GPT-2 tokenizer
  tokenizer = AutoTokenizer.from_pretrained("gpt2")

  text = "Artificial intelligence is transforming the world."

  # Tokenize
  tokens = tokenizer.tokenize(text)
  print("GPT-2 Tokens:", tokens)
  # Output: ['Art', 'ificial', 'Ġintelligence', 'Ġis', 'Ġtransforming', 'Ġthe', 'Ġworld', '.']

📌 Summary

Tokenization is the first and most critical step in NLP pipelines.
Tokens turn text into numbers — the language of models.
Hugging Face’s AutoTokenizer makes it simple, fast, and reliable.
Understanding tokenization helps you debug model behavior, input issues, and output errors.

Tokenization

Everything You Need to Know About Tokenization for LLMs (with Hugging Face)

🤔 What is Tokenization?

🔤 What is a Token?

🧠 Why Do We Tokenize?

🔍 Types of Tokenization

1. Word-Level Tokenization

2. Character-Level Tokenization

3. Subword Tokenization

⏳ Tokenization Before Hugging Face

🚀How Hugging Face made it easier with `tokenizers`

📌 What Does Tokenization Actually Look Like?

Basic Tokenization Examples

Example 1: BERT Tokenizer

Example 2: GPT-2 Tokenizer

📌 Summary

Subscribe to my newsletter

Tanuj Rai

Tanuj Rai

Tokenization

Everything You Need to Know About Tokenization for LLMs (with Hugging Face)

🤔 What is Tokenization?

🔤 What is a Token?

🧠 Why Do We Tokenize?

🔍 Types of Tokenization

1. Word-Level Tokenization

2. Character-Level Tokenization

3. Subword Tokenization

⏳ Tokenization Before Hugging Face

🚀How Hugging Face made it easier with tokenizers

📌 What Does Tokenization Actually Look Like?

Basic Tokenization Examples

Example 1: BERT Tokenizer

Example 2: GPT-2 Tokenizer

📌 Summary

Subscribe to my newsletter

Tanuj Rai

Tanuj Rai

🚀How Hugging Face made it easier with `tokenizers`