20 Huggingface Datasets concepts with Examples
Table of contents
- 1. Installing the Datasets Library π¦
- 2. Loading a Dataset from the Hub π
- 3. Inspecting Dataset Features and Splits π
- 4. Processing Datasets with Map Function πΊοΈ
- 5. Filtering Data with Datasets Library π§Ή
- 6. Dataset Shuffling π
- 7. Dataset Batching for Training βοΈ
- 8. Saving Processed Datasets πΎ
- 9. Loading Datasets from Disk π
- 10. Streaming Large Datasets π
- 11. Dataset Concatenation and Merging β
- 12. Dataset Sorting π
- 13. Dataset Casting for Feature Types π
- 14. Dataset Stratified Splitting βοΈ
- 15. Applying Preprocessing Pipelines π οΈ
- 16. Dataset Column Renaming π
- 17. Dataset Column Removal ποΈ
- 18. Dataset Bucketing and Binning π
- 19. Dataset Imputation for Missing Values π©Ή
- 20. Dataset Sampling π§ͺ
1. Installing the Datasets Library π¦
Boilerplate Code:
pip install datasets
Use Case: Install the Hugging Face Datasets library to load, process, and analyze large datasets.
Goal: Set up the datasets
library to access a variety of NLP and machine learning datasets. π―
2. Loading a Dataset from the Hub π
Use Case: Load a dataset directly from the Hugging Face Hub.
Goal: Access popular datasets like IMDb, MNIST, or SQuAD with one line of code. π―
Sample Code:
from datasets import load_dataset
dataset = load_dataset("imdb")
print(dataset["train"][0])
Before Example: You manually search for, download, and format datasets for machine learning tasks.
# Manually fetching dataset files:
wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
After Example: The datasets
library automatically loads datasets, including metadata and splits.
{'text': 'I absolutely loved this movie...', 'label': 1}
# IMDb dataset loaded and ready for use.
Where you can run this code:
Google Colab: It works perfectly in Google Colab, which has the Hugging Face
datasets
library pre-installed (or you can install it easily with!pip install datasets
).JupyterLab: You can run this in your local JupyterLab setup after installing the Hugging Face datasets library:
!pip install datasets
Locally: This will also work on your local environment or any notebook interface, as long as you have the
datasets
library installed.
Can this be done on the Hugging Face website?
Yes, Hugging Face has a platform called "Hugging Face Datasets Viewer" where you can view and explore datasets, but the code execution (e.g., loading datasets, manipulating them) happens outside the website in environments like Colab, JupyterLab, or your local environment.
3. Inspecting Dataset Features and Splits π
Boilerplate Code:
print(dataset.features)
print(dataset["train"].split)
Use Case: Explore the structure of a dataset, including its features and available splits (e.g., train, test, validation).
Goal: Understand the dataset's structure before processing or training. π―
Sample Code:
from datasets import load_dataset
dataset = load_dataset("imdb")
print(dataset.features)
print(dataset["train"].split)
Before Example: You manually inspect the dataset file formats, often needing custom code to explore features.
# Manually loading and inspecting data:
import pandas as pd
df = pd.read_csv("dataset.csv")
print(df.columns)
After Example: With the datasets
library, the dataset's features and splits are automatically presented.
Features: {'text': Value(dtype='string'), 'label': ClassLabel(num_classes=2)}
# Features and splits of the dataset displayed.
4. Processing Datasets with Map Function πΊοΈ
Preprocessing is the step that converts human-readable text into machine-readable numbers (token IDs) that the model can use for training or inference.
Boilerplate Code:
def tokenize_function(examples):
return tokenizer(examples["text"], padding="max_length", truncation=True)
tokenized_dataset = dataset.map(tokenize_function, batched=True)
Use Case: Apply transformations or preprocessing to the entire dataset using the map()
function.
Goal: Tokenize text or perform any custom processing on a dataset. π―
Sample Code:
from transformers import AutoTokenizer
from datasets import load_dataset
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
dataset = load_dataset("imdb")
def tokenize_function(examples):
return tokenizer(examples["text"], padding="max_length", truncation=True)
tokenized_dataset = dataset.map(tokenize_function, batched=True)
print(tokenized_dataset["train"][0])
Before Example: You write custom loops to preprocess datasets, which can be inefficient and harder to maintain.
# Manually tokenizing dataset:
tokenized_texts = [tokenizer(text) for text in texts]
After Example: The map()
function applies the preprocessing efficiently across the dataset.
{'input_ids': [...], 'attention_mask': [...], 'label': 1}
# Dataset tokenized using the map function.
5. Filtering Data with Datasets Library π§Ή
Boilerplate Code:
filtered_dataset = dataset.filter(lambda example: example["label"] == 1)
Use Case: Filter datasets based on specific conditions (e.g., selecting positive sentiment examples).
Goal: Reduce dataset size by selecting only relevant examples. π―
Sample Code:
from datasets import load_dataset
dataset = load_dataset("imdb")
filtered_dataset = dataset.filter(lambda example: example["label"] == 1)
print(filtered_dataset["train"][0])
After Example: With the datasets
library, filtering is simple and efficient, even for large datasets.
{'text': 'I absolutely loved this movie...', 'label': 1}
# Dataset filtered to only include examples with positive sentiment.
6. Dataset Shuffling π
Boilerplate Code:
shuffled_dataset = dataset.shuffle(seed=42)
Use Case: Randomly shuffle the order of examples in a dataset.
Goal: Shuffle the dataset to ensure that the training examples are not in any particular order. π―
Sample Code:
from datasets import load_dataset
dataset = load_dataset("imdb")
shuffled_dataset = dataset.shuffle(seed=42)
print(shuffled_dataset["train"][0])
After Example: With the datasets
library, you can easily shuffle large datasets with a single function.
{'text': 'A must-watch for movie lovers...', 'label': 1}
# Dataset shuffled and ready for training.
7. Dataset Batching for Training βοΈ
Boilerplate Code:
batched_dataset = dataset.with_format("torch").train_test_split(test_size=0.1)
Use Case: Split the dataset into smaller batches, which can then be used during training.
Goal: Split the dataset into train and test sets, and prepare it for model training. π―
Sample Code:
from datasets import load_dataset
dataset = load_dataset("imdb").with_format("torch")
batched_dataset = dataset["train"].train_test_split(test_size=0.1)
print(batched_dataset["train"][0])
With datasets
, batching and splitting are automatic and efficient.
{'text': 'A great film that is a timeless classic...', 'label': 0}
# Data split into training and testing batches.
8. Saving Processed Datasets πΎ
Boilerplate Code:
dataset.save_to_disk("path/to/dataset")
Use Case: Save a preprocessed dataset to disk for later use.
Goal: Store datasets locally after processing them, so you can reload them without reprocessing. π―
Sample Code:
from datasets import load_dataset
dataset = load_dataset("imdb").shuffle(seed=42)
dataset.save_to_disk("path/to/dataset")
# Later load the dataset
from datasets import load_from_disk
loaded_dataset = load_from_disk("path/to/dataset")
print(loaded_dataset["train"][0])
After Example: With datasets
, saving and reloading datasets is streamlined.
# Dataset saved and reloaded from disk.
{'text': 'Amazing movie with great performances...', 'label': 1}
9. Loading Datasets from Disk π
Boilerplate Code:
from datasets import load_from_disk
dataset = load_from_disk("path/to/dataset")
Use Case: Load a previously saved dataset from your local disk.
Goal: Reload a dataset from disk without having to reprocess or reload it from the Hugging Face Hub. π―
Sample Code:
from datasets import load_from_disk
dataset = load_from_disk("path/to/dataset")
print(dataset["train"][0])
After Example: With the datasets
library, loading datasets from disk is simple and efficient.
{'text': 'Fantastic storyline with deep characters...', 'label': 0}
# Dataset loaded from disk, ready for use.
10. Streaming Large Datasets π
Boilerplate Code:
dataset = load_dataset("imdb", split="train", streaming=True)
Use Case: Stream large datasets that do not fit into memory.
Goal: Load large datasets efficiently by streaming them instead of loading everything into memory at once. π―
Sample Code:
from datasets import load_dataset
dataset = load_dataset("imdb", split="train", streaming=True)
for example in dataset.take(5):
print(example)
Before Example: You struggle with large datasets that don't fit into memory, requiring custom code for data management.
# Manually streaming large datasets:
for chunk in pd.read_csv("large_dataset.csv", chunksize=1000):
process(chunk)
After Example: With datasets
, streaming large datasets is handled automatically, making it memory-efficient.
{'text': 'A beautiful movie with deep meaning...', 'label': 1}
# Streaming dataset processed efficiently.
11. Dataset Concatenation and Merging β
Boilerplate Code:
concatenated_dataset = dataset1.concatenate(dataset2)
Use Case: Combine two datasets into one by concatenating or merging them.
Goal: Merge multiple datasets to create a larger dataset for training or evaluation. π―
Sample Code:
from datasets import load_dataset
dataset1 = load_dataset("imdb", split="train[:50%]")
dataset2 = load_dataset("imdb", split="train[50%:]")
concatenated_dataset = dataset1.concatenate(dataset2)
print(len(concatenated_dataset))
Before Example: You manually join datasets by reading them into memory and concatenating them using custom code.
# Manually concatenating datasets:
combined_dataset = dataset1 + dataset2
After Example: With the datasets
library, you can seamlessly concatenate datasets.
25000
# Two IMDb dataset splits merged into a single dataset.
12. Dataset Sorting π
Boilerplate Code:
sorted_dataset = dataset.sort("label")
Use Case: Sort a dataset by a specific feature (e.g., sorting by label for classification tasks).
Goal: Sort dataset rows based on a feature to arrange them in a specific order. π―
Sample Code:
from datasets import load_dataset
dataset = load_dataset("imdb", split="train")
sorted_dataset = dataset.sort("label")
print(sorted_dataset["train"][0])
Before Example: You manually sort data using pandas or other tools, which can be inefficient.
# Manually sorting dataset:
sorted_dataset = dataset.sort_values(by="label")
After Example: With the datasets
library, sorting by any feature is quick and straightforward.
{'text': 'Worst movie ever...', 'label': 0}
# Dataset sorted by the "label" feature.
13. Dataset Casting for Feature Types π
Boilerplate Code:
dataset = dataset.cast_column("label", ClassLabel(num_classes=3))
Use Case: Change the type of a datasetβs feature, such as converting labels to a categorical type.
Goal: Modify the data types of specific columns, e.g., converting integer labels to class labels. π―
Sample Code:
from datasets import load_dataset
from datasets import ClassLabel
dataset = load_dataset("imdb", split="train")
dataset = dataset.cast_column("label", ClassLabel(num_classes=2))
print(dataset.features["label"])
Before Example: You manually modify data types using pandas or custom functions.
# Manually changing data types:
df['label'] = df['label'].astype('category')
After Example: With the datasets
library, you can cast features to specific types with minimal code.
ClassLabel(num_classes=2, names=['negative', 'positive'])
# Labels cast to class label type for easier manipulation.
14. Dataset Stratified Splitting βοΈ
Boilerplate Code:
train_test_split = dataset.train_test_split(test_size=0.2, stratify_by_column="label")
Use Case: Split the dataset into training and test sets while maintaining class balance (stratified split).
Goal: Create train/test splits while ensuring that the distribution of labels is preserved. π―
Sample Code:
from datasets import load_dataset
dataset = load_dataset("imdb", split="train")
train_test_split = dataset.train_test_split(test_size=0.2, stratify_by_column="label")
print(train_test_split["train"][0])
Before Example: You write custom code to perform stratified splits, ensuring balanced label distribution.
# Manually creating a stratified split:
from sklearn.model_selection import train_test_split
train_data, test_data = train_test_split(dataset, stratify=dataset['label'])
After Example: With the datasets
library, stratified splitting is simple and automatic.
{'text': 'A fascinating movie...', 'label': 1}
# Dataset split with balanced label distribution.
15. Applying Preprocessing Pipelines π οΈ
Boilerplate Code:
def preprocess_function(examples):
return {"length": len(examples["text"].split())}
processed_dataset = dataset.map(preprocess_function)
Use Case: Apply custom preprocessing functions (e.g., tokenization, feature extraction) to the dataset.
Goal: Add new features or preprocess the dataset before training or evaluation. π―
Sample Code:
from datasets import load_dataset
dataset = load_dataset("imdb", split="train")
def preprocess_function(examples):
return {"length": len(examples["text"].split())}
processed_dataset = dataset.map(preprocess_function)
print(processed_dataset["train"][0])
Before Example: You manually preprocess datasets, adding features and applying transformations with loops.
# Manually adding a new feature:
df['length'] = df['text'].apply(lambda x: len(x.split()))
After Example: With datasets
, preprocessing functions are applied efficiently to every row.
{'text': 'A great story with amazing actors...', 'label': 1, 'length': 6}
# Dataset processed with a new "length" feature added.
16. Dataset Column Renaming π
Boilerplate Code:
renamed_dataset = dataset.rename_column("label", "sentiment")
Use Case: Rename dataset columns to make them more descriptive or easier to work with.
Goal: Change the name of a specific column in the dataset. π―
Sample Code:
from datasets import load_dataset
dataset = load_dataset("imdb", split="train")
renamed_dataset = dataset.rename_column("label", "sentiment")
print(renamed_dataset["train"][0])
Before Example: You manually rename columns using pandas or other tools.
# Manually renaming columns in pandas:
df.rename(columns={"label": "sentiment"}, inplace=True)
After Example: With the datasets
library, renaming columns is straightforward and can be done with a single command.
{'text': 'Amazing movie!', 'sentiment': 1}
# "label" column renamed to "sentiment".
17. Dataset Column Removal ποΈ
Boilerplate Code:
dataset = dataset.remove_columns(["text"])
Use Case: Remove unnecessary columns from a dataset to focus on relevant features.
Goal: Drop specific columns from the dataset to reduce its size or complexity. π―
Sample Code:
from datasets import load_dataset
dataset = load_dataset("imdb", split="train")
reduced_dataset = dataset.remove_columns(["text"])
print(reduced_dataset["train"][0])
Before Example: You manually remove columns using pandas or write custom code for column management.
# Manually dropping columns:
df.drop(columns=["text"], inplace=True)
After Example: With the datasets
library, removing columns is quick and efficient.
{'label': 1}
# "text" column removed from the dataset.
18. Dataset Bucketing and Binning π
Boilerplate Code:
def bucket_function(examples):
return {"length_bucket": int(len(examples["text"].split()) / 10)}
binned_dataset = dataset.map(bucket_function)
Use Case: Group continuous values into buckets or bins (e.g., text length categories).
Goal: Create buckets to categorize data based on a feature (e.g., sentence length, age group). π―
Sample Code:
from datasets import load_dataset
dataset = load_dataset("imdb", split="train")
def bucket_function(examples):
return {"length_bucket": int(len(examples["text"].split()) / 10)}
binned_dataset = dataset.map(bucket_function)
print(binned_dataset["train"][0])
Before Example: You manually create bins or buckets for data based on continuous features.
# Manually binning data:
df['length_bucket'] = df['length'] // 10
After Example: With datasets
, bucketing or binning data is easy and done with a simple function.
{'text': 'Great acting!', 'label': 1, 'length_bucket': 0}
# Dataset bucketed based on text length.
19. Dataset Imputation for Missing Values π©Ή
Boilerplate Code:
def impute_function(examples):
if examples["text"] is None:
examples["text"] = "N/A"
return examples
imputed_dataset = dataset.map(impute_function)
Use Case: Handle missing or null values in the dataset by filling them with default values.
Goal: Impute missing values to ensure clean and complete data. π―
Sample Code:
from datasets import load_dataset
dataset = load_dataset("imdb", split="train")
def impute_function(examples):
if examples["text"] is None:
examples["text"] = "N/A"
return examples
imputed_dataset = dataset.map(impute_function)
print(imputed_dataset["train"][0])
Before Example: You manually handle missing values, which can be tedious, especially for large datasets.
# Manually filling missing values:
df['text'].fillna("N/A", inplace=True)
After Example: The datasets
library allows easy imputation of missing values in large datasets.
{'text': 'Fantastic!', 'label': 1}
# Missing values filled with default values.
20. Dataset Sampling π§ͺ
Boilerplate Code:
sampled_dataset = dataset.shuffle(seed=42).select(range(100))
Use Case: Sample a subset of data from a large dataset for quick testing or analysis.
Goal: Randomly sample a fixed number of examples from the dataset. π―
Sample Code:
from datasets import load_dataset
dataset = load_dataset("imdb", split="train")
sampled_dataset = dataset.shuffle(seed=42).select(range(100))
print(len(sampled_dataset))
Before Example: You manually sample data by writing custom sampling functions, which can be slow.
# Manually sampling data:
sampled_df = df.sample(n=100, random_state=42)
After Example: The datasets
library provides efficient sampling functionality with built-in methods.
100
# Random sample of 100 examples selected from the dataset.
Subscribe to my newsletter
Read articles from Anix Lynch directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by