Tokenization in LLMs

Tokenization

Tokenization is a critical preprocessing step in LLMs that converts raw text into smaller units called tokens. These tokens can represent whole words, sub words, characters, or even byte-level chunks, depending on the tokenization strategy. LLMs process these tokens instead of raw text, enabling efficient handling of vast datasets and complex language patterns.

How Tokenization Works in LLMs

  1. Text Splitting:

    • The input text is segmented into tokens using predefined rules or algorithms. For example, "#ChaiCode" might be tokenized into [‘#‘,'Chai', 'Code'] or sub word tokens like ['#', 'Ch', ai', 'co',’de’].
  2. Numerical Encoding:

    • Each token is mapped to a unique numerical identifier using a vocabulary built during the tokenizer's training phase. For instance:

      • "Chai" → 12345

      • "Code" → 67890

  3. Embedding:

    • Tokens are converted into dense vectors (embeddings) that capture semantic and syntactic information. These embeddings are fed into the transformer model for processing.
  4. Training and Preprocessing:

    • Tokenizers are trained separately from the LLM. They learn common token patterns from large corpora to optimize vocabulary size and efficiency.
  5. Decoding:

    • After processing, the model generates a sequence of tokens, which are decoded back into human-readable text.

Tokenization Techniques in LLMs

  1. Word-Level Tokenization:

    • Splits text into individual words.

    • Example: "I love Chaicode" → ['I', 'love', 'Chaicode'].

    • Limitation: Inefficient for morphologically rich languages.

  2. Sub word-Level Tokenization:

    • Breaks words into smaller meaningful units using algorithms.

    • Example: "running" → ['run', 'ning'].

    • Advantage: Handles out-of-vocabulary (OOV) words effectively.

  3. Character-Level Tokenization:

    • Splits text into individual characters.

    • Example: "Hello" → ['H', 'e', 'l', 'l', 'o'].

    • Used in niche applications but less efficient for larger datasets.

  4. Byte-Level Tokenization:

    • Operates on raw byte sequences, making it language-agnostic.

    • Example: GPT-2 uses byte-level encoding with a fixed vocabulary size of 50,257 tokens.

Code Example: Tokenization in LLMs

Using OpenAI's tiktoken package for Python:

import tiktoken

tokenizer = tiktoken.get_encoding("gpt-3")

text = "Large Language Models are revolutionizing AI!"

tokens = tokenizer.encode(text)

print("Tokens:", tokens)

decoded_text = tokenizer.decode(tokens)

print("Decoded Text:", decoded_text)

0
Subscribe to my newsletter

Read articles from Ashwini Anant Dudgikar directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Ashwini Anant Dudgikar
Ashwini Anant Dudgikar