20 Huggingface Tokenizers concepts with Examples
Table of contents
- 1. Installing Hugging Face Tokenizers ๐ฆ
- 2. Loading a Pre-trained Tokenizer ๐
- 3. Tokenizing and Decoding Text โ๏ธ
- 4. Handling Tokenizer Special Tokens ๐
- 5. Batch Tokenization for Multiple Sentences ๐๏ธ
- 6. Adding Custom Tokens to a Pre-trained Tokenizer ๐
- 7. Tokenizing Text with Attention Masks ๐ญ
- 8. Padding and Truncating Sequences to a Fixed Length โณ
- 9. Saving and Loading Custom Tokenizers ๐พ
- 10. Using Byte-Pair Encoding (BPE) Tokenization ๐งฉ
- 11. Loading Tokenizer from Hub ๐
- 12. Detokenizing: Converting IDs Back to Text ๐
- 13. Fast Tokenizers: Speeding Up Tokenization ๐
- 14. Padding Tokenized Inputs for Batch Processing ๐ ๏ธ
- 15. Custom Pre-tokenization Rules with Whitespace or Splitters โ๏ธ
- 16. Training a New Tokenizer from Scratch ๐
- 17. Subword Tokenization with WordPiece ๐งฉ
- 18. Using Byte-Level BPE Tokenization ๐งฌ
- 19. Post-processing: Adding Special Tokens Automatically ๐
- 20. Managing Padding and Truncation Dynamically โ๏ธ
1. Installing Hugging Face Tokenizers ๐ฆ
Boilerplate Code:
pip install tokenizers
Use Case: Install the Hugging Face Tokenizers library to tokenize text data efficiently.
Goal: Set up the tokenizers
library to quickly tokenize and process large text datasets. ๐ฏ
Sample Code:
pip install tokenizers
Before Example: You manually tokenize text using regular expressions or basic string manipulations, which can be slow.
# Manually splitting text into words:
tokens = text.split(" ")
After Example: With Hugging Face tokenizers
, tokenization is highly optimized and much faster.
Successfully installed tokenizers
# Hugging Face Tokenizers library installed and ready for use.
2. Loading a Pre-trained Tokenizer ๐
Boilerplate Code:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
Use Case: Load a pre-trained tokenizer to tokenize text data according to a specific model (e.g., BERT).
Goal: Tokenize text using the same tokenizer that was used to train a pre-trained model. ๐ฏ
Sample Code:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
tokens = tokenizer("Hugging Face is great!")
print(tokens)
Before Example: You use simple tokenization methods that donโt match the format needed for pre-trained models.
# Manually tokenizing text:
tokens = text.split(" ")
After Example: With a pre-trained tokenizer, the text is tokenized in a way that matches the modelโs expected input format.
{'input_ids': [101, 17662, 2227, 2003, 2307, 999, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}
# Text tokenized using the BERT tokenizer.
3. Tokenizing and Decoding Text โ๏ธ
Boilerplate Code:
tokens = tokenizer.encode("I love NLP", return_tensors="pt")
decoded = tokenizer.decode(tokens[0])
Use Case: Tokenize text into input IDs and later decode them back to the original text.
Goal: Convert text to model-readable token IDs and then decode them back into human-readable text. ๐ฏ
Sample Code:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
tokens = tokenizer.encode("I love NLP", return_tensors="pt")
decoded = tokenizer.decode(tokens[0])
print(decoded)
Before Example: You manually map tokens to numbers and struggle with converting tokens back into text.
# Manually mapping words to IDs:
tokens = [word_to_id[word] for word in sentence]
After Example: With tokenizers
, encoding and decoding text is handled automatically.
I love NLP
# Text tokenized and then decoded back into its original form.
4. Handling Tokenizer Special Tokens ๐
What are Special Tokens?
Special tokens are extra tokens that certain models like BERT require to perform tasks like text classification, sequence-to-sequence tasks, or sentence pair classification.
These tokens help models understand the structure of input sequences. The most common special tokens are:
[CLS]: Marks the beginning of a sequence. BERT uses this token for classification tasks.
[SEP]: Marks the end of a sequence or separates two sequences (for sentence pair tasks).
[PAD]: Used for padding shorter sequences to the same length during batching.
Why are they needed?
These tokens ensure that models like BERT can handle tasks like classification, question answering, or sentence pair tasks by marking the beginning and end of sequences.
Without these tokens, the model might not understand where the input starts or ends, which could lead to poor performance or even errors.
Analogy:
Imagine youโre writing a formal letter. You need a greeting (like โDear [Name]โ) to introduce the letter, and a closing (like โSincerely, [Your Name]โ) to end it. Without these formal markers, your letter would seem incomplete or confusing.
[CLS] is like the greeting of your letter, marking the start.
[SEP] is like the closing, signaling the end.
[PAD] is like adding blank space at the bottom of the page to make sure all letters are the same length (for models that process multiple letters at once).
Special Tokens Example with Human-Friendly Output:
Imagine we are working with the sentence: "I love NLP"
. The special tokens [CLS] and [SEP] are added automatically when tokenizing this input for a model like BERT.
Code:
from transformers import AutoTokenizer
# Load BERT tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
# Tokenize a sentence with special tokens
tokens = tokenizer("I love NLP", add_special_tokens=True)
# Print special tokens and the resulting tokenized sequence
print("Special tokens:", tokenizer.cls_token, tokenizer.sep_token)
print("Tokenized output:", tokens)
Expected Output (with Analogy):
Special tokens: [CLS] [SEP]
Tokenized output:
{
'input_ids': [101, 1045, 2293, 17953, 102],
'attention_mask': [1, 1, 1, 1, 1]
}
Explanation:
[CLS] (
101
): Like the greeting of a letter, it marks the beginning of the sentence. This is important for the model to know where the sentence starts.[SEP] (
102
): Like the closing of a letter, it marks the end of the sentence. This tells the model where the sentence finishes.The numbers
1045, 2293, 17953
are the token IDs for the words"I love NLP"
.attention_mask: Each
1
here means the token is important for the model to process, helping the model focus on real words rather than padding.
Hereโs how the input looks:
Original sentence:
"I love NLP"
Formatted sentence with special tokens:
"[CLS] I love NLP [SEP]"
Using AutoTokenizer:
tokens = tokenizer("I love NLP", add_special_tokens=True)
Output:
{'input_ids': [101, 1045, 2293, 17953, 102], 'attention_mask': [1, 1, 1, 1, 1]}
[CLS] and [SEP] are automatically added,
Now, you wonโt accidentally forget the special tokens, and the model will correctly interpret where the sentence starts and ends.
5. Batch Tokenization for Multiple Sentences ๐๏ธ
What is Batch Tokenization?
- Batch tokenization is the process of converting multiple sentences or paragraphs into token IDs simultaneously, while ensuring that all sequences are the same length. This is achieved through padding (adding extra tokens to shorter sequences) and truncation (shortening longer sequences).
Why is it needed?
- Models like BERT require all input sequences in a batch to be of equal length. Padding ensures that shorter sentences match the length of the longest one, while truncation shortens longer sequences to a specified limit. This way, the model can process all inputs efficiently in one go.
Boilerplate Code:
tokenized_batch = tokenizer(
["I love NLP", "Transformers are amazing"],
padding=True, truncation=True, return_tensors="pt"
)
- return_tensors="pt" means the output will be in PyTorch tensor format, which is useful for direct input into models.
Sample Code:
from transformers import AutoTokenizer
# Load BERT tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
# Sentences to be tokenized
sentences = ["I love NLP", "Transformers are amazing"]
# Tokenize the batch of sentences with automatic padding and truncation
tokenized_batch = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
# Print the tokenized output
print(tokenized_batch)
Expected Output:
{
'input_ids': tensor([[ 101, 1045, 2293, 17953, 102, 0],
[ 101, 19081, 2024, 6429, 102, 0]]),
'attention_mask': tensor([[1, 1, 1, 1, 1, 0],
[1, 1, 1, 1, 1, 0]])
}
input_ids: The token IDs for each sentence, including padding tokens (
0
) at the end to ensure that all sequences have the same length.101
is the [CLS] token.102
is the [SEP] token.The sentence
"I love NLP"
has been padded with0
(representing the [PAD] token).
attention_mask: This tells the model which tokens are actual words (
1
) and which are padding (0
). In this case, only the actual words and special tokens ([CLS]
,[SEP]
) are marked as important.
6. Adding Custom Tokens to a Pre-trained Tokenizer ๐
What is Adding Custom Tokens?
Adding custom tokens allows you to extend a pre-trained tokenizer to handle domain-specific words, jargon, or abbreviations that were not part of the original tokenizer's vocabulary. This is useful when working with specialized datasets (e.g., medical or legal documents).
Why is it needed?
In many cases, the pre-trained tokenizer may not recognize new or specific terms that are critical to your task. By adding these custom tokens, you ensure that your model can accurately process and understand these unique terms.
Boilerplate Code:
tokenizer.add_tokens(["newtoken1", "newtoken2"])
Sample Code:
from transformers import AutoTokenizer
# Load BERT tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
# Add custom tokens
tokenizer.add_tokens(["nlp_token", "deep_learning"])
# Check that the tokens were added
print(f"Added tokens: {tokenizer.additional_special_tokens}")
# Tokenize a sentence containing the new tokens
tokens = tokenizer("I love nlp_token and deep_learning", add_special_tokens=True)
# Print the tokenized output
print(tokens)
Expected Output:
Added tokens: ['nlp_token', 'deep_learning']
{
'input_ids': [101, 1045, 2293, 32000, 1998, 32001, 102],
'attention_mask': [1, 1, 1, 1, 1, 1, 1]
}
input_ids: The token IDs for each word in the sentence. The new custom tokens (
"nlp_token"
and"deep_learning"
) are represented by newly assigned token IDs (32000
,32001
).attention_mask: All tokens are marked as important (
1
), meaning no padding was added.
7. Tokenizing Text with Attention Masks ๐ญ
What is Tokenizing with Attention Masks?
When tokenizing text, attention masks help the model understand which tokens should be attended to (actual words) and which ones should be ignored (like padding tokens). This is especially important when dealing with batches of sentences of different lengths, as some sentences will need padding.
Why is it needed?
In transformer-based models like BERT, attention masks allow the model to focus on the actual content and ignore any padded tokens that were added to match the sequence length. This prevents the model from "learning" patterns from the padding.
Boilerplate Code:
tokenized_output = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
attention_mask = tokenized_output['attention_mask']
Sample Code:
from transformers import AutoTokenizer
# Load BERT tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
# Sentences to be tokenized
sentences = ["I love NLP", "Transformers are amazing"]
# Tokenize the batch of sentences with automatic padding and truncation, return attention masks
tokenized_output = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
# Print the tokenized output with attention masks
print(tokenized_output['input_ids'])
print(tokenized_output['attention_mask'])
Expected Output:
tensor([[ 101, 1045, 2293, 17953, 102, 0],
[ 101, 19081, 2024, 6429, 102, 0]])
tensor([[1, 1, 1, 1, 1, 0],
[1, 1, 1, 1, 1, 0]])
input_ids: The token IDs for the sentences, including padding tokens (
0
) to ensure all sequences have the same length.attention_mask:
1: Pay attention to this token.
- 0: Ignore this token..
8. Padding and Truncating Sequences to a Fixed Length โณ
Boilerplate Code:
tokenized_batch = tokenizer(sentences, padding='max_length', truncation=True, max_length=8, return_tensors="pt")
padding='max_length': Ensures that every sentence is padded to a specified maximum length.
truncation=True: Cuts off longer sequences to fit within the max length.
max_length=8: The specified maximum length for sequences.
Sample Code:
from transformers import AutoTokenizer
# Load BERT tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
# Sentences to be tokenized
sentences = ["I love NLP", "Transformers are amazing but complex"]
# Tokenize the batch of sentences with padding and truncation to max length of 8 tokens
tokenized_batch = tokenizer(sentences, padding='max_length', truncation=True, max_length=8, return_tensors="pt")
# Print the tokenized output
print(tokenized_batch['input_ids'])
print(tokenized_batch['attention_mask'])
Expected Output:
tensor([[ 101, 1045, 2293, 17953, 102, 0, 0, 0],
[ 101, 19081, 2024, 6429, 2021, 3372, 102, 0]])
tensor([[1, 1, 1, 1, 1, 0, 0, 0],
[1, 1, 1, 1, 1, 1, 1, 0]])
input_ids:
The first sequence
"I love NLP"
was padded with three [PAD] tokens (0
) to reach the max length of 8.The second sequence
"Transformers are amazing but complex"
was truncated to the first 8 tokens.
attention_mask:
The 1s indicate tokens that the model should focus on (actual words and special tokens like [CLS] and [SEP]).
The 0s represent the padding tokens, which should be ignored by the model.
Both batch tokenization and padding/truncating sequences are closely related concepts, but they have subtle differences in emphasis. Letโs compare them:
Key Differences:
Batch Tokenization: It's primarily about processing multiple inputs (a batch) and ensuring all inputs are padded/truncated based on the longest sequence.
Example:
"I love NLP"
vs."Transformers are amazing"
.The shorter sentence will be padded to match the length of the longer one.
Padding/Truncation to a Fixed Length: The emphasis here is on ensuring every input (or batch) has a specific length (e.g., 8 tokens), regardless of the original sentence length. This might involve truncating longer sequences or padding shorter ones.
Similarities:
Both involve padding and truncation to ensure input sequences are of equal length.
Both use tokenizers to automatically manage these tasks.
Summary:
Batch Tokenization is about tokenizing multiple sequences and ensuring they are of equal length for batching.
Padding/Truncating to Fixed Length ensures that all sequences (even if single) are padded/truncated to a predefined maximum length.
9. Saving and Loading Custom Tokenizers ๐พ
Boilerplate Code:
# Save tokenizer
tokenizer.save_pretrained('./my_custom_tokenizer')
# Load tokenizer
custom_tokenizer = AutoTokenizer.from_pretrained('./my_custom_tokenizer')
Sample Code:
from transformers import AutoTokenizer
# Load a pre-trained BERT tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
# Add custom tokens to the tokenizer
tokenizer.add_tokens(["custom_token1", "custom_token2"])
# Save the modified tokenizer to a directory
tokenizer.save_pretrained('./my_custom_tokenizer')
# Load the custom tokenizer from the saved directory
custom_tokenizer = AutoTokenizer.from_pretrained('./my_custom_tokenizer')
# Check if the custom tokens were preserved
print(f"Custom tokens: {custom_tokenizer.additional_special_tokens}")
# Tokenize a sentence with the custom tokenizer
tokens = custom_tokenizer("I love custom_token1 and custom_token2", add_special_tokens=True)
print(tokens)
Expected Output:
Custom tokens: ['custom_token1', 'custom_token2']
{
'input_ids': [101, 1045, 2293, 32000, 1998, 32001, 102],
'attention_mask': [1, 1, 1, 1, 1, 1, 1]
}
Custom tokens: This shows that the tokenizer preserved your custom tokens (
custom_token1
,custom_token2
) after saving and loading.input_ids: The custom tokens were correctly assigned new token IDs (
32000
and32001
), showing that the tokenizer recognizes them after being reloaded.attention_mask: All tokens are marked as important (
1
), meaning thereโs no padding.
10. Using Byte-Pair Encoding (BPE) Tokenization ๐งฉ
What is Byte-Pair Encoding (BPE) Tokenization?
BPE Tokenization is a subword tokenization technique that splits words into subword units based on their frequency. This helps models handle rare words and out-of-vocabulary words by breaking them into smaller, more common subwords.
Example:
Letโs take the word "unhappiness".
A regular tokenizer might treat this as a single token.
BPE tokenization splits it into:
"un" (prefix)
"happiness" (root word)
But if "happiness" is too rare, BPE might even split it further:
"hap"
"pi"
"ness"
This way, the model can process each smaller piece, even if itโs never seen the whole word before.
Boilerplate Code:
from transformers import GPT2Tokenizer
# Load GPT-2 tokenizer (which uses BPE)
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
Sample Code:
from transformers import GPT2Tokenizer
# Load GPT-2 tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
# Example sentence
sentence = "Tokenization is awesome, right?"
# Tokenize the sentence using BPE
tokenized_output = tokenizer(sentence, add_special_tokens=True)
# Print the tokens and their corresponding IDs
print("Tokens:", tokenizer.convert_ids_to_tokens(tokenized_output['input_ids']))
print("Token IDs:", tokenized_output['input_ids'])
Expected Output:
Tokens: ['<|endoftext|>', 'Token', 'ization', 'ฤ is', 'ฤ awesome', ',', 'ฤ right', '?', '<|endoftext|>']
Token IDs: [50256, 19204, 3034, 318, 10433, 11, 4283, 30, 50256]
Tokens:
The sentence
"Tokenization"
is broken into['Token', 'ization']
due to BPE.The spaces before some tokens like
ฤ is
andฤ awesome
represent the start of a new word or space.<|endoftext|> is a special token used to mark the end of the input.
Token IDs: The token IDs are the unique numbers assigned to each token, which will be fed into the model for processing.
11. Loading Tokenizer from Hub ๐
Boilerplate Code:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")
Use Case: Load a tokenizer from Hugging Face's Model Hub to ensure compatibility with a specific model.
Goal: Download and initialize a tokenizer from a pre-trained model available on the Hugging Face Hub. ๐ฏ
Sample Code:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokens = tokenizer("Hugging Face is awesome!", return_tensors="pt")
print(tokens)
Output sample: The Hugging Face library simplifies loading a pre-trained tokenizer directly from the Model Hub.
{'input_ids': tensor([[15496, 2159, 133, 8377, 0]])}
# GPT-2 tokenizer loaded from the Hub and tokenized the input text.
12. Detokenizing: Converting IDs Back to Text ๐
Boilerplate Code:
decoded_text = tokenizer.decode(token_ids)
Use Case: Convert token IDs back into human-readable text after tokenization and model inference.
Goal: Detokenize a sequence of token IDs back into a string of readable text. ๐ฏ
Sample Code:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
tokens = tokenizer("Hugging Face is awesome!", return_tensors="pt")
decoded_text = tokenizer.decode(tokens["input_ids"][0])
print(decoded_text)
Output: With Hugging Face's tokenizers
, detokenization is handled automatically, producing clean text.
hugging face is awesome!
# Token IDs decoded back into readable text.
13. Fast Tokenizers: Speeding Up Tokenization ๐
Boilerplate Code:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased", use_fast=True)
Use Case: Use a fast version of the tokenizer that dramatically speeds up the tokenization process.
Goal: Speed up tokenization using the Fast
tokenizer implementation, which leverages Rust for faster performance. ๐ฏ
Sample Code:
from transformers import AutoTokenizer
# Use the fast tokenizer implementation
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased", use_fast=True)
tokens = tokenizer("Hugging Face is awesome!", return_tensors="pt")
print(tokens)
Output: With the Fast
tokenizers, tokenization is significantly faster without losing accuracy.
{'input_ids': tensor([[101, 17662, 2227, 2003, 2307, 999, 102]])}
# Fast tokenization of the input text using the optimized tokenizer.
14. Padding Tokenized Inputs for Batch Processing ๐ ๏ธ
Boilerplate Code:
tokens = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
Use Case: Automatically pad tokenized inputs to the maximum sequence length in a batch for easier batching and processing.
Goal: Ensure all tokenized sequences in a batch have the same length by adding padding tokens where necessary. ๐ฏ
Sample Code:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
texts = ["I love NLP", "Transformers are amazing"]
tokens = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
print(tokens["input_ids"], tokens["attention_mask"])
Output: Padding and attention masks are automatically handled, ensuring all sequences in a batch are of equal length.
tensor([[ 101, 1045, 2293, 17953, 102, 0], [ 101, 19081, 2024, 6429, 102, 0]])
tensor([[1, 1, 1, 1, 1, 0], [1, 1, 1, 1, 1, 0]])
# Sequences padded and attention masks generated for batch processing.
15. Custom Pre-tokenization Rules with Whitespace or Splitters โ๏ธ
Boilerplate Code:
from tokenizers import Tokenizer
from tokenizers.pre_tokenizers import Whitespace
tokenizer.pre_tokenizer = Whitespace()
Use Case: Apply custom pre-tokenization rules (e.g., splitting based on whitespace or punctuation) before applying subword tokenization.
Goal: Customize how the input text is split into tokens before further processing, enabling more control over tokenization. ๐ฏ
Sample Code:
from tokenizers import Tokenizer
from tokenizers.models import WordPiece
from tokenizers.pre_tokenizers import Whitespace
# Create a custom tokenizer
tokenizer = Tokenizer(WordPiece(unk_token="[UNK]"))
tokenizer.pre_tokenizer = Whitespace()
encoded = tokenizer.encode("Hugging Face is awesome!")
print(encoded.tokens)
With Hugging Face tokenizers
, you can define pre-tokenization rules for greater flexibility.
['Hugging', 'Face', 'is', 'awesome', '!']
# Text pre-tokenized using whitespace splitting.
16. Training a New Tokenizer from Scratch ๐
Boilerplate Code:
from tokenizers import Tokenizer, models, trainers
# Initialize tokenizer with BPE model
tokenizer = Tokenizer(models.BPE())
trainer = trainers.BpeTrainer(vocab_size=30000, min_frequency=2)
# Train tokenizer on your dataset
tokenizer.train(files=["your_data.txt"], trainer=trainer)
Use Case: Train a brand new tokenizer from scratch using your own dataset.
Goal: Build a tokenizer that is specifically tailored to your dataset by training it from raw text. ๐ฏ
Sample Code:
from tokenizers import Tokenizer, models, trainers
tokenizer = Tokenizer(models.BPE())
trainer = trainers.BpeTrainer(vocab_size=30000, min_frequency=2)
# Assuming you have a file with your dataset
tokenizer.train(files=["your_data.txt"], trainer=trainer)
print(tokenizer.get_vocab_size())
With Hugging Face tokenizers
, you can train a tokenizer on your own dataset, tailored to its vocabulary and structure.
30000
# Tokenizer trained on your data with a vocabulary size of 30,000 tokens.
17. Subword Tokenization with WordPiece ๐งฉ
Boilerplate Code:
from tokenizers import Tokenizer, models
tokenizer = Tokenizer(models.WordPiece(unk_token="[UNK]"))
Use Case: Tokenize text into subword units using the WordPiece algorithm, which is widely used in models like BERT.
Goal: Break words into smaller subwords or characters when they are not part of the tokenizerโs vocabulary. ๐ฏ
Sample Code:
from tokenizers import Tokenizer, models
tokenizer = Tokenizer(models.WordPiece(unk_token="[UNK]"))
tokenizer.add_tokens(["hugging", "face", "is", "awesome", "!"])
encoded = tokenizer.encode("Hugging Face is awesome!")
print(encoded.tokens)
With WordPiece, unknown or rare words are broken into smaller, more frequent subword units.
['[UNK]', '[UNK]', 'is', 'awesome', '!']
# Text tokenized using the WordPiece model, unknown words represented by [UNK].
18. Using Byte-Level BPE Tokenization ๐งฌ
Boilerplate Code:
from tokenizers import Tokenizer, models
tokenizer = Tokenizer(models.ByteLevelBPETokenizer())
Use Case: Tokenize text at the byte level, enabling the tokenizer to handle any input text, even rare characters or non-standard inputs.
Goal: Use byte-level tokenization to create robust tokenizers that can process any text, including emojis, symbols, or different languages. ๐ฏ
Sample Code:
from tokenizers import Tokenizer, models
# Initialize the Byte-Level BPE tokenizer
tokenizer = Tokenizer(models.ByteLevelBPETokenizer())
tokenizer.add_tokens(["Hugging", "Face", "is", "awesome", "!"])
encoded = tokenizer.encode("Hugging Face is awesome! ๐")
print(encoded.tokens)
Byte-level BPE tokenization automatically handles any text, including special characters and emojis.
['Hugging', 'Face', 'is', 'awesome', '!', '๐']
# Text tokenized at the byte level, handling all characters correctly.
19. Post-processing: Adding Special Tokens Automatically ๐
Boilerplate Code:
from tokenizers import Tokenizer, processors
tokenizer.post_processor = processors.TemplateProcessing(
single="[CLS] $A [SEP]",
pair="[CLS] $A [SEP] $B:1 [SEP]:1",
special_tokens=[("[CLS]", 101), ("[SEP]", 102)]
)
Use Case: Automatically add special tokens (like [CLS]
, [SEP]
, etc.) after tokenization for sequence classification tasks.
Goal: Ensure that special tokens like [CLS]
and [SEP]
are included in tokenized outputs for models that require them. ๐ฏ
Sample Code:
from tokenizers import Tokenizer, models, processors
# Initialize tokenizer
tokenizer = Tokenizer(models.WordPiece(unk_token="[UNK]"))
# Add post-processing for special tokens
tokenizer.post_processor = processors.TemplateProcessing(
single="[CLS] $A [SEP]",
pair="[CLS] $A [SEP] $B:1 [SEP]:1",
special_tokens=[("[CLS]", 101), ("[SEP]", 102)]
)
encoded = tokenizer.encode("Hugging Face is awesome!")
print(encoded.tokens)
With post-processing, special tokens are automatically added to every tokenized sequence.
['[CLS]', 'Hugging', 'Face', 'is', 'awesome', '!', '[SEP]']
# Special tokens `[CLS]` and `[SEP]` are automatically added.
20. Managing Padding and Truncation Dynamically โ๏ธ
Boilerplate Code:
tokens = tokenizer.encode("Hugging Face is awesome!", padding=True, truncation=True, max_length=10)
Use Case: Automatically pad and truncate sequences dynamically based on the specified maximum length.
Goal: Ensure all tokenized sequences are of uniform length by padding shorter sequences and truncating longer ones. ๐ฏ
Sample Code:
from transformers import AutoTokenizer
# Load pre-trained tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
# Dynamically pad and truncate sequences
tokens = tokenizer.encode("Hugging Face is awesome!", padding=True, truncation=True, max_length=10)
print(tokens)
With dynamic padding and truncation, sequences are automatically adjusted to the desired length.
[101, 17662, 2227, 2003, 2307, 999, 102, 0, 0, 0]
# Sequence padded and truncated to a length of 10 tokens.
Subscribe to my newsletter
Read articles from Anix Lynch directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by