Understanding BERT Tokens: Tokenization and Its Role in NLP

Introduction to BERT Tokens: A Beginner's Guide

BERT (Bidirectional Encoder Representations from Transformers) is a powerful machine learning model used for processing natural language, like understanding text or answering questions. At the heart of how BERT works is something called tokenization—a process that breaks down text into smaller pieces called tokens.

You can think of tokens as the "building blocks" of language that BERT uses to analyze and understand text. For example, the sentence "I love AI" would be split into individual words or subwords, which the model can process more effectively. BERT uses special tokens (such as [CLS] and [SEP]) to add structure and context to the text it analyzes, making it easier for BERT to perform tasks like sentiment analysis or language translation.

This article will explain what BERT tokens are, how they’re created, and why they’re so important for helping BERT understand and process language. Whether you're curious about how BERT handles complex text or just want to know more about how tokenization works, this guide will give you the key insights you need.

Types of BERT Tokens

1. WordPiece Tokens

BERT uses a tokenization approach called WordPiece.

  • Words are split into smaller units (subwords) to handle out-of-vocabulary words effectively.

  • Example:

    • Input: "unbelievable"

    • Tokens: ["un", "##believable"]

      • "##" indicates the subword is part of a previous word.

2. Special Tokens

BERT adds special tokens to inputs to provide additional context and structure:

  • [CLS] (Classification Token):

    • Placed at the beginning of every input sequence.

    • Used to aggregate information for classification tasks.

  • [SEP] (Separator Token):

    • Marks the end of one sentence and separates multiple sentences in a sequence.

    • Example: In sentence-pair tasks, [SEP] separates the two sentences.

  • [PAD] (Padding Token):

    • Added to make sequences the same length in a batch.

BERT Tokenization Workflow

  1. Text Cleaning: Input text is lowercased (for uncased models) and punctuation is standardized.

  2. Tokenization: Sentences are split into tokens using the WordPiece algorithm.

  3. Special Tokens: [CLS] and [SEP] tokens are added.

  4. Convert to IDs: Tokens are mapped to integer IDs using a predefined vocabulary.

  5. Padding and Truncation: Sequences are padded or truncated to match the maximum length.

Example:

Input: "Hello world!"

  • Tokenization: ["[CLS]", "hello", "world", "!", "[SEP]"]

  • IDs (using a vocab): [101, 7592, 2088, 999, 102]


BERT Tokens in Practice

1. Single-Sentence Input:

Example: "I love AI."

  • Tokens: [CLS] I love AI . [SEP]

  • IDs: [101, 146, 1567, 7270, 1012, 102]

2. Sentence Pair Input:

Example: "What is AI?" / "Artificial Intelligence."

  • Tokens: [CLS] What is AI ? [SEP] Artificial Intelligence . [SEP]

  • IDs: [101, 2054, 2003, 7270, 1029, 102, 7844, 10392, 1012, 102]

3. Padding:

Sentences in a batch are padded to the same length.

  • Example:

    • Input 1: [101, 2054, 2003, 7270, 102, 0, 0]

    • Input 2: [101, 2154, 2731, 102, 0, 0, 0]


Token Embeddings in BERT

After tokenization, tokens are converted into embeddings that capture their contextual meaning:

  1. Token Embeddings: Represent the specific token.

  2. Segment Embeddings: Distinguish between sentences in sentence-pair tasks.

  3. Position Embeddings: Capture the order of tokens in the sequence.

The final embedding for each token is a combination of these three components.


Why Tokenization Matters in BERT

  1. Handles Out-of-Vocabulary Words: Breaking words into subwords reduces issues with rare words.

  2. Optimized Context Understanding: WordPiece tokens allow BERT to handle root words, prefixes, and suffixes effectively.

  3. Universal Vocabulary: The same vocabulary works across different languages and domains.


Tools for Tokenization

  1. Hugging Face Transformers:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
tokens = tokenizer.tokenize("Hello world!")
token_ids = tokenizer.convert_tokens_to_ids(tokens)
print(tokens)  # ['hello', 'world', '!']
print(token_ids)  # [7592, 2088, 999]

2.TensorFlow or PyTorch Implementations: Often include built-in tokenizers compatible with BERT models.

Summary: Understanding BERT Tokens in NLP

BERT (Bidirectional Encoder Representations from Transformers) uses tokenization to process text, breaking down raw input into smaller units called tokens. It relies on the WordPiece tokenization technique, which splits words into subwords to handle out-of-vocabulary words. Special tokens like [CLS], [SEP], and [PAD] are added to structure input sequences for specific NLP tasks.

The tokenization process involves:

  1. Cleaning and splitting text

  2. Converting tokens into integer IDs

  3. Adding padding or truncation where necessary

These tokens are then embedded into vectors, combining:

  • Token embeddings

  • Segment embeddings

  • Position embeddings

This combination represents their contextual meaning. BERT's tokenization approach enables efficient handling of large datasets and complex language models, making it essential for tasks like:

  • Text classification

  • Question answering

  • Sentence pair analysis

Tools like the Hugging Face Transformers library simplify tokenization and integration with BERT models for practical NLP applications.

Getting Started With Hugging Face in 15 Minutes | Transformers, Pipeline, Tokenizer,

Models Hugging Face Model Hub

Hugging Face Hello World examples

0
Subscribe to my newsletter

Read articles from Wojciech Kaczmarczyk directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Wojciech Kaczmarczyk
Wojciech Kaczmarczyk