What is Tokenization?
Tokenization is the process of breaking down a sequence of text into smaller units, called "tokens". These tokens could be words, subwords, characters, or even phrases, depending on the specific tokenization strategy used. Tokenization is a fundamental step in natural language processing (NLP) tasks, as it converts raw text data into a format that can be processed by machine learning models.
Different types of tokenization:
Word Tokenization: This is the most common tokenization strategies, where the text is split into individual words based on whitespace or punctuation. For example, the sentence "The quick brown fox jumps" would be tokenized into ["The", "quick", "brown", "fox", "jumps"].
Subword Tokenization: In subword tokenization, words are broken down into smaller units, typically based on their frequency or morphology. This is particularly useful for handling out-of-vocabulary words and languages with complex morphology. Popular subword tokenization algorithms include Byte Pair Encoding (BPE) and SentencePiece.
Example of BPE:
Lets say we have the following corpus of words: ["low", "lower", "newest", "widest", "running", "lows"]
Initialization : Initially, each character in the vocabulary is treated as a subword
Vocabulary: {'l', 'o', 'w', 'e', 'r', 'n', 's', 't', 'i', 'd', 'u', 'g'}
Merge Most Frequent Pair: Iterate through the corpus and merge the most frequent pair of subwords.
Iteration 1: Merge 'e' and 's' -> 'es' Vocabulary: {'l', 'o', 'w', 'es', 'r', 'n', 't', 'i', 'd', 'u', 'g'}
Iteration 2: Merge 'es' and 't' -> 'est' Vocabulary: {'l', 'o', 'w', 'est', 'r', 'n', 'i', 'd', 'u', 'g'}
Iteration 3: Merge 'e' and 's' -> 'es' Vocabulary: {'l', 'o', 'w', 'est', 'r', 'n', 'i', 'd', 'u', 'g', 'es'}
Repeat Merging: Repeat the merging process for a fixed number of iterations or until the vocabulary size reaches a predefined threshold.
Iteration 4: Merge 'es' and 't' -> 'est' Vocabulary: {'l', 'o', 'w', 'est', 'r', 'n', 'i', 'd', 'u', 'g'}
Iteration 5: Merge 'est' and 'l' -> 'estl' Vocabulary: {'o', 'w', 'estl', 'r', 'n', 'i', 'd', 'u', 'g'}
Tokenization: Tokenize new words using the learned subword vocabulary.
For example:
'lower' would be tokenized as ['low', 'er'].
'running' would be tokenized as ['r', 'un', 'n', 'ing'].
Here's a Python implementation of BPE using the
tokenizers
library:from tokenizers import ByteLevelBPETokenizer tokenizer = ByteLevelBPETokenizer() tokenizer.train(files=["corpus.txt"], vocab_size=1000, min_frequency=2) encoded = tokenizer.encode("lower") print(encoded.tokens) # Output: ['low', 'er']
Character Tokenization: In character tokenization, each character in the text is treated as a separate token. This strategy is useful for tasks where character-level information is important, such as text generation or spelling correction. You will want to use it if you want to preserve the smallest units of text and don't need to consider word boundaries.
Phrasal Tokenization: Phrasal tokenization involves identifying and tokenizing multi-word phrases or expressions as single units. This can be useful for preserving the meaning of idiomatic expressions or named entities.
Tokenization Libraries:
NLTK (Natural Language Toolkit)
SpaCy
Hugging Face's Tokenizers library
Stanford CoreNLP
Challenges with Tokenization:
Tokenization can be challenging, especially for languages with complex morphology, ambiguous word boundaries, or noisy text data. It requires careful handling of punctuation, special characters, and language-specific rules.
Each token is then represented by a unique numerical identifier
The textual tokens (words/subwords/characters) into numerical values that can be processed by machine learning models.
In NLP tasks, textual data needs to be converted into numerical format because most machine learning algorithms operate on numerical data. Each token in the text is assigned a unique numerical identifier, typically an integer, which allows the model to understand and process the text.
Using the example: "The quick brown fox jumps."
After tokenization, the tokens might be represented as follows:
"The" → 1
"quick" → 2
"brown" → 3
"fox" → 4
"jumps" → 5
Each token in the sentence is mapped to a unique numerical identifier. During the encoding process, the model replaces each token with its corresponding numerical value, creating a sequence of numerical tokens that can be fed into the machine learning model for further processing.
This numerical representation allows the model to learn patterns and relationships within the text data and make predictions or perform tasks such as classification, generation, or translation.
Subscribe to my newsletter
Read articles from Farhan Naqvi directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
Farhan Naqvi
Farhan Naqvi
🚀 Passionate about AI/ML | Software Engineer | Research Enthusiast 🚀 💻 As an Associate Software Engineer at Veritas Technologies LLC, I'm immersed in cutting-edge technologies, including C++, Elastic Stack (ELK), PostgreSQL, Docker, Kubernetes, and more. With a keen interest in AI and ML, I've delved into generative AI, machine learning, and deep learning, crafting projects that push the boundaries of innovation and efficiency. 👩💻 Additionally, I have a strong passion for research and have authored two papers on video processing during my undergrad. Currently, I'm exploring the bias in state-of-the-art LLMs, aiming to contribute to the understanding and mitigation of bias in AI systems.