Tokenization

Tanuj RaiTanuj Rai
5 min read

Everything You Need to Know About Tokenization for LLMs (with Hugging Face)

Before a Large Language Model (LLM) like GPT-2 or LLaMA can read or generate anything, the first step is tokenization — the silent translator that converts human text into machine-understandable numbers.

In this post, we’ll break down:

  • What is tokenization?

  • What is a token?

  • Why do we tokenize?

  • Types of tokenization

  • What tokenization looked like before Hugging Face

  • How Hugging Face made it easier with tokenizers

Let’s dive in.


🤔 What is Tokenization?

Tokenization is the process of converting raw text into smaller units called tokens, which are then mapped to numbers (token IDs) that a model can understand.

For example:

Input: "I love pizza."
Tokens: ["I", "love", "pizza", "."]
Token IDs: [40, 389, 22135, 4]  # (these are example IDs)

These token IDs are what the model processes — not the actual text.


🔤 What is a Token?

A token is a unit of text. It can be:

  • A word ("love")

  • A subword ("pizz", "a")

  • A character ("l", "o", "v", "e")

  • Even a byte or byte pair ("Ġlove", "##ing")

The size and shape of a token depend on the tokenization strategy and the model’s vocabulary.


🧠 Why Do We Tokenize?

Transformers (and other LLMs) only understand numbers, not text. Tokenization:

  1. Prepares input for the model by converting text to numerical IDs.

  2. Controls vocabulary size, which affects model size and training efficiency.

  3. Handles rare words, typos, and out-of-vocabulary inputs via subwords or byte-pair strategies.

  4. Decodes output token IDs back into human-readable text.


🔍 Types of Tokenization

1. Word-Level Tokenization

Splits text by spaces and punctuation. Simple but struggles with:

  • Large vocabularies

  • Out-of-vocabulary words

  • Morphologically rich languages

2. Character-Level Tokenization

Treats each character as a token. Pros and cons:

  • ✅ Small vocabulary size

  • ✅ No unknown words

  • ❌ Very long sequences

  • ❌ Loses word-level meaning

3. Subword Tokenization

The sweet spot for most modern NLP models:

  • Byte Pair Encoding (BPE): Merges most frequent character pairs

  • WordPiece: Google's approach, similar to BPE

  • SentencePiece: Language-agnostic subword tokenization

There are many strategies, each with trade-offs:

TypeDescriptionExample (unhappiness)
WhitespaceSplit on spaces and punctuation["unhappiness"]
Word-levelEach word is a token["unhappiness"]
Character-levelEach character is a token["u", "n", "h", "a", ...]
Subword (BPE)Break into known parts["un", "happiness"]
Byte-level BPETokenizes as bytes, supports all UTF-8 characters["Ġun", "happiness"]

Most modern LLMs (like GPT-2, LLaMA) use Byte Pair Encoding (BPE) or Unigram-based subword tokenizers for flexibility and compactness.


⏳ Tokenization Before Hugging Face

Before Hugging Face, working with tokenizers meant:

  • Writing custom regex-based scripts.

  • Manually handling vocab files, encoders, and decoders.

  • Worrying about padding, truncation, and attention masks.

It was slow, inconsistent, and error-prone.

Popular pre-HF libraries included:

  • SpaCy (for word/char)

  • SentencePiece (used in BERT, T5)

  • Moses tokenizer (used in early MT)


🚀How Hugging Face made it easier with tokenizers

Hugging Face provides the transformers library with pre-built tokenizers for thousands of models. These tokenizers are:

  • Fast: Implemented in Rust for speed

  • Consistent: Same tokenization as original model training

  • Feature-rich: Handle padding, truncation, special tokens automatically

    Setting Up Your Environment

    First, install the required packages:

      pip install transformers torch datasets
    

    For development, you might also want:

      pip install jupyter notebook matplotlib seaborn
    

    📌 What Does Tokenization Actually Look Like?

    Here’s a step-by-step breakdown of how tokenization works in practice with Hugging Face:

      from transformers import AutoTokenizer
    
      tokenizer = AutoTokenizer.from_pretrained("gpt2")
    
      text = "I love machine learning!"
    
      # 1. Tokenize text into subwords
      tokens = tokenizer.tokenize(text)
      print("Tokens:", tokens)
    

    Output:

      Tokens: ['I', 'Ġlove', 'Ġmachine', 'Ġlearning', '!']
    

    ✅ Notice the Ġ — it indicates a space before the word (used in GPT-2's Byte-Pair Encoding).

      # 2. Convert tokens into input IDs
      ids = tokenizer.convert_tokens_to_ids(tokens)
      print("Token IDs:", ids)
    

    Output (will vary by model):

      Token IDs: [40, 389, 7594, 8945, 0]
    
      # 3. Decode back to text
      decoded = tokenizer.decode(ids)
      print("Decoded:", decoded)
    

    Output:

      Decoded: I love machine learning!
    

    Basic Tokenization Examples

    Let's start with simple examples using popular models:

    Example 1: BERT Tokenizer

    python

      from transformers import AutoTokenizer
    
      # Load BERT tokenizer
      tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
    
      # Sample text
      text = "Hello, how are you doing today?"
    
      # Basic tokenization
      tokens = tokenizer.tokenize(text)
      print("Tokens:", tokens)
      # Output: ['hello', ',', 'how', 'are', 'you', 'doing', 'today', '?']
    
      # Convert to token IDs
      token_ids = tokenizer.convert_tokens_to_ids(tokens)
      print("Token IDs:", token_ids)
    
      # Or do both in one step
      encoding = tokenizer(text)
      print("Full encoding:", encoding)
    

    Example 2: GPT-2 Tokenizer

    python

      from transformers import AutoTokenizer
    
      # Load GPT-2 tokenizer
      tokenizer = AutoTokenizer.from_pretrained("gpt2")
    
      text = "Artificial intelligence is transforming the world."
    
      # Tokenize
      tokens = tokenizer.tokenize(text)
      print("GPT-2 Tokens:", tokens)
      # Output: ['Art', 'ificial', 'Ġintelligence', 'Ġis', 'Ġtransforming', 'Ġthe', 'Ġworld', '.']
    

    📌 Summary

    • Tokenization is the first and most critical step in NLP pipelines.

    • Tokens turn text into numbers — the language of models.

    • Hugging Face’s AutoTokenizer makes it simple, fast, and reliable.

    • Understanding tokenization helps you debug model behavior, input issues, and output errors.

0
Subscribe to my newsletter

Read articles from Tanuj Rai directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Tanuj Rai
Tanuj Rai