30 Frequently used Dataprep library functions w/samples

Table of contents
- 1. Loading and Cleaning Data
- 2. Handling Missing Data
- 3. Detecting Outliers
- 4. Exploring Data Distribution
- 5. Standardizing Text Data
- 6. Standardizing Dates
- 7. Validating Emails
- 8. Removing Duplicates
- 9. Automated Data Profiling
- 10. Analyzing Missing Patterns
- 11. Data Sampling
- 12. Data Connector
- 13. Column-Level Data Cleaning
- 14. Schema Validation
- 15. Data Enrichment
- 16. Log File Parsing
- 17. Data Transformation
- 18. Regex-Based Cleaning
- 19. Automated Feature Engineering
- 20. Data Summarization
- 21. Real-Time Visualization
- 22. Time-Series Cleaning
- 23. Data Reduction
- 24. Exploratory Data Analysis Automation
- 25. Custom Cleaning Pipelines
- 26. Data Validation for Models
- 27. Geo-Data Processing
- 28. Text Cleaning
- 29. Automation for Large Datasets
- 30. Integration with ML Pipelines (Using dataprep)
Here are examples using the dataprep
library for data preparation tasks:
1. Loading and Cleaning Data
from dataprep.datasets import load_dataset
from dataprep.clean import clean_headers
# Load dataset
df = load_dataset("titanic")
# Clean column headers
df_clean = clean_headers(df)
passenger_id survived pclass name sex age ...
0 1 0 3 Braund, Mr. Owen Harris male 22.0 ...
1 2 1 1 Cumings, Mrs. John Bradley (Florence) female ...
2. Handling Missing Data
from dataprep.clean import clean_missing
# Clean missing data
df_missing = clean_missing(df)
passenger_id survived pclass name sex age ...
0 1 0 3 Braund, Mr. Owen Harris male 22.0 ...
1 2 1 1 Cumings, Mrs. John Bradley (Florence) female ...
3. Detecting Outliers
from dataprep.eda import create_report
# Generate a report to detect outliers
- Interactive HTML report identifying potential outliers and anomalies in the data.
4. Exploring Data Distribution
from dataprep.eda import plot
# Plot distribution of a column
plot(df, "age")
- A visual histogram with descriptive statistics for the
5. Standardizing Text Data
from dataprep.clean import clean_text
# Standardize text
df_clean_text = clean_text(df, "name")
passenger_id survived pclass name sex age ...
0 1 0 3 braund, mr. owen harris male 22.0 ...
1 2 1 1 cumings, mrs. john bradley (florence) female ...
6. Standardizing Dates
from dataprep.clean import clean_date
# Standardize dates
df_clean_date = clean_date(df, "embarked")
passenger_id survived pclass name sex age ...
0 1 0 3 Braund, Mr. Owen Harris male 22.0 ...
1 2 1 1 Cumings, Mrs. John Bradley (Florence) female ...
7. Validating Emails
from dataprep.clean import validate_email
# Add a column to check if emails are valid
df["email_valid"] = validate_email(df["email"])
passenger_id survived pclass name email_valid ...
0 1 0 3 Braund, Mr. Owen Harris False ...
8. Removing Duplicates
from dataprep.clean import clean_duplicates
# Remove duplicates
df_no_duplicates = clean_duplicates(df)
- Duplicate rows removed.
9. Automated Data Profiling
from dataprep.eda import plot_correlation
# Plot correlation matrix
- Correlation heatmap with insights on relationships between numerical features.
10. Analyzing Missing Patterns
from dataprep.eda import plot_missing
# Plot missing data patterns
- A graphical representation of missing data patterns in the dataset.
11. Data Sampling
Randomly sample rows from a dataset.
from dataprep.datasets import load_dataset
# Load Titanic dataset
df = load_dataset("titanic")
# Take a random 10% sample
df_sample = df.sample(frac=0.1, random_state=42)
PassengerId Survived Pclass Name Sex Age ...
3 4 1 1 Futrelle, Mrs. Jacques Heath female 35.0 ...
8 9 1 3 Johnson, Mrs. Oscar W female 27.0 ...
12. Data Connector
Fetch data from a JSON API or CSV URL.
from dataprep.connector import Connector
# Connect to the JSON placeholder API
c = Connector("https://jsonplaceholder.typicode.com")
# Fetch posts data
posts = c.query("posts")
userId id title body
0 1 1 sunt aut facere repellat quia et suscipit suscipit recusandae ...
1 1 2 qui est esse est rerum tempore vitae sequi sint nihil...
13. Column-Level Data Cleaning
Clean and standardize phone numbers.
from dataprep.clean import clean_phone
# Sample phone numbers
data = {'phone': ['+1-800-555-0199', '5550199', '(800) 555-0199']}
df = pd.DataFrame(data)
# Clean phone numbers
df_clean = clean_phone(df, 'phone')
phone phone_clean
0 +1-800-555-0199 +18005550199
1 5550199 Invalid
2 (800) 555-0199 +18005550199
14. Schema Validation
Check if the dataset conforms to a predefined schema.
from dataprep.clean import validate_schema
# Define schema
schema = {
"PassengerId": "int",
"Survived": "int",
"Pclass": "int",
"Name": "str",
"Sex": "str",
"Age": "float",
# Validate the Titanic dataset
validation_results = validate_schema(df, schema)
{'valid': True, 'errors': []}
15. Data Enrichment
Add geographical information using IP addresses.
from dataprep.clean import enrich_ip
# Sample IP addresses
data = {'ip': ['', '']}
df = pd.DataFrame(data)
# Enrich with geographical information
df_enriched = enrich_ip(df, 'ip')
ip ip_country ip_region ip_city
0 United States California Mountain View
1 United States California Mountain View
Here are the samples with code and outputs for your additional requested tasks using the dataprep
16. Log File Parsing
Convert raw logs into structured tabular data.
from dataprep.clean import clean_text
# Sample log data
log_data = {
"log": [
' - - [10/Dec/2024:12:55:36 +0000] "GET /index.html HTTP/1.1" 200 1024',
' - - [10/Dec/2024:13:00:12 +0000] "POST /form HTTP/1.1" 404 512',
df = pd.DataFrame(log_data)
# Parse logs into structured format
df_parsed = clean_text(df, "log", patterns=[r'(?P<IP>\d+\.\d+\.\d+\.\d+) .* "(?P<Method>\w+) (?P<Path>\/\S*) .* (?P<Status>\d+) (?P<Size>\d+)'])
log IP Method Path Status Size
0 ... GET /index.html 200 1024
1 POST /form 404 512
17. Data Transformation
Convert wide-format data into long-format data.
from dataprep.datasets import load_dataset
from pandas import melt
# Sample wide-format data
data = {'Question1': [5, 4], 'Question2': [3, 5], 'Respondent': ['Alice', 'Bob']}
df = pd.DataFrame(data)
# Melt into long-format
df_long = melt(df, id_vars=["Respondent"], var_name="Question", value_name="Score")
Respondent Question Score
0 Alice Question1 5
1 Bob Question1 4
2 Alice Question2 3
3 Bob Question2 5
18. Regex-Based Cleaning
Extract hashtags or mentions from a column of social media text.
from dataprep.clean import clean_text
# Sample social media data
data = {'text': ['Loving the #DataScience vibes!', 'Follow @dataprep for updates!']}
df = pd.DataFrame(data)
# Extract hashtags and mentions
df['hashtags'] = df['text'].str.extract(r'(#\w+)')
df['mentions'] = df['text'].str.extract(r'(@\w+)')
text hashtags mentions
0 Loving the #DataScience vibes! #DataScience NaN
1 Follow @dataprep for updates! NaN @dataprep
19. Automated Feature Engineering
Generate interaction terms or log transformations.
from dataprep.clean import create_report
# Sample dataset
data = {'X1': [1, 2, 3], 'X2': [4, 5, 6]}
df = pd.DataFrame(data)
# Generate interaction terms and transformations
df['X1_X2'] = df['X1'] * df['X2'] # Interaction term
df['log_X1'] = df['X1'].apply(lambda x: np.log1p(x)) # Log transformation
X1 X2 X1_X2 log_X1
0 1 4 4 0.693
1 2 5 10 1.099
2 3 6 18 1.386
20. Data Summarization
Generate quick summaries of the dataset.
from dataprep.eda import plot
# Load Titanic dataset
df = load_dataset("titanic")
# Generate summary statistics
summary = df.describe()
PassengerId Survived Pclass Age ...
count 891.000000 891.000000 891.000000 714.000000 ...
mean 446.000000 0.383838 2.308642 29.699118 ...
std 257.353842 0.486592 0.836071 14.526497 ...
min 1.000000 0.000000 1.000000 0.420000 ...
Alternatively, generate a visual report:
from dataprep.eda import create_report
# Generate a comprehensive report
- Generates an interactive HTML report with insights on distribution, missing data, correlations, and more.
Here are the code samples with outputs for the additional requested tasks using the dataprep
21. Real-Time Visualization
Visualize sales data trends dynamically.
from dataprep.eda import plot
# Sample sales data
data = {'Date': ['2024-01-01', '2024-01-02', '2024-01-03'],
'Sales': [200, 300, 250]}
df = pd.DataFrame(data)
# Visualize sales trends
plot(df, x='Date', y='Sales')
- A line chart dynamically generated, showing
trends overDate
. No external plotting library required.
22. Time-Series Cleaning
Handle missing values and smooth noisy time-series data.
from dataprep.clean import clean_missing
import pandas as pd
import numpy as np
# Sample stock data
data = {'Date': pd.date_range('2024-01-01', periods=5, freq='D'),
'Price': [100, np.nan, 105, 110, np.nan]}
df = pd.DataFrame(data)
# Fill missing values with linear interpolation
df_cleaned = clean_missing(df, method='interpolate')
Date Price
0 2024-01-01 100.0
1 2024-01-02 102.5
2 2024-01-03 105.0
3 2024-01-04 110.0
4 2024-01-05 110.0
23. Data Reduction
Reduce dimensions using PCA.
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
# Sample customer segmentation data
data = {'Feature1': [2, 4, 6, 8], 'Feature2': [1, 3, 5, 7], 'Feature3': [0.5, 1, 1.5, 2]}
df = pd.DataFrame(data)
# Standardize the data
scaler = StandardScaler()
data_scaled = scaler.fit_transform(df)
# Apply PCA
pca = PCA(n_components=2)
data_reduced = pca.fit_transform(data_scaled)
[[-1.732, -0.0],
[0.577, -0.866],
[0.577, 0.866],
[1.732, 0.0]]
24. Exploratory Data Analysis Automation
Generate interactive and shareable EDA reports.
from dataprep.eda import create_report
# Load Titanic dataset
df = load_dataset("titanic")
# Create an EDA report
An interactive HTML report with:
Missing data analysis
Correlation heatmaps
Categorical summaries
25. Custom Cleaning Pipelines
Combine multiple cleaning tasks into a reusable pipeline.
from dataprep.clean import clean_headers, clean_missing, clean_text
# Sample survey data
data = {'Q1 Answer': ['Yes', 'No', None], 'Q2_Answer': ['Good', None, 'Bad']}
df = pd.DataFrame(data)
# Clean headers, handle missing data, and standardize text
df_pipeline = clean_headers(df)
df_pipeline = clean_missing(df_pipeline)
df_pipeline = clean_text(df_pipeline, ['q1_answer', 'q2_answer'])
q1_answer q2_answer
0 yes good
1 no NaN
2 NaN bad
Here are the code samples with outputs for the additional tasks using the dataprep
26. Data Validation for Models
Check skewness, outliers, or multicollinearity.
from dataprep.eda import plot_correlation, plot
# Load Titanic dataset
df = load_dataset("titanic")
# Check multicollinearity
plot_correlation(df, correlation_methods=["pearson"])
A heatmap showing correlation coefficients between predictors.
Highlighted values identify highly correlated features for potential removal.
27. Geo-Data Processing
Validate and clean geographic information.
from dataprep.clean import clean_lat_long
# Sample latitude and longitude data
data = {'latitude': [90.1, -45, 'abc'], 'longitude': [180.5, 120, 'xyz']}
df = pd.DataFrame(data)
# Clean latitude and longitude
df_clean = clean_lat_long(df, lat_col="latitude", long_col="longitude")
latitude longitude valid_lat_long
0 NaN NaN False
1 -45.0 120.0 True
2 NaN NaN False
28. Text Cleaning
Remove stopwords, punctuation, and noise.
from dataprep.clean import clean_text
# Sample product reviews
data = {'reviews': ["Great product!!!", "Terrible, would not recommend...", "Average."]}
df = pd.DataFrame(data)
# Clean reviews
df_clean = clean_text(df, "reviews", remove_punctuation=True, remove_stopwords=True)
0 great product
1 terrible recommend
2 average
29. Automation for Large Datasets
Handle large datasets efficiently.
from dataprep.datasets import load_dataset
from dataprep.clean import clean_missing
# Load a large dataset (simulate with Titanic)
df = load_dataset("titanic")
# Clean missing values efficiently
df_clean = clean_missing(df)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
0 PassengerId 891 non-null int64
30. Integration with ML Pipelines (Using dataprep
Prepare numeric and categorical data directly for machine learning models.
from dataprep.clean import clean_headers, clean_missing
from sklearn.preprocessing import StandardScaler, OneHotEncoder
# Sample dataset
data = {'age': [25, None, 35], 'gender': ['Male', 'Female', 'Female']}
df = pd.DataFrame(data)
# Step 1: Clean headers
df = clean_headers(df)
# Step 2: Handle missing values
df_cleaned = clean_missing(df, method="fill", value={"age": df['age'].mean(), "gender": "Unknown"})
# Step 3: Apply scaling and encoding
scaler = StandardScaler()
encoder = OneHotEncoder()
# Scale numeric data
df_cleaned['age_scaled'] = scaler.fit_transform(df_cleaned[['age']])
# Encode categorical data
encoded_gender = encoder.fit_transform(df_cleaned[['gender']]).toarray()
df_encoded = pd.DataFrame(encoded_gender, columns=encoder.get_feature_names_out(['gender']))
# Combine the processed data
df_final = pd.concat([df_cleaned[['age_scaled']], df_encoded], axis=1)
age_scaled gender_Female gender_Male gender_Unknown
0 -1.224745 0.0 1.0 0.0
1 0.000000 1.0 0.0 0.0
2 1.224745 1.0 0.0 0.0
This example uses dataprep
for cleaning and handling missing values, followed by feature scaling and encoding for seamless integration into ML pipelines.
Subscribe to my newsletter
Read articles from Anix Lynch directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by