Python Automation#1: 🧽Data Cleaning w/janitor, pandas-profiling, dataprep, pandas

3 min read
Table of contents
- 1. Clean Messy Column Names (janitor.clean_names)
- 2. Remove Empty Rows/Columns (janitor.remove_empty)
- 3. Handle Missing Values (pandas.fillna)
- 4. Generate a Quick Data Report (pandas_profiling.ProfileReport)
- 5. Explore Data Structure Visually (dataprep.eda.create_report)
- 6. Complex Filtering/Grouping (pandas.query, pandas.groupby)
- 7. Add Computed Columns (janitor.add_column)
- 8. Merge or Join Datasets (pandas.merge)
- 9. Data Cleaning and Standardization (dataprep.clean)
- 10. Quick Insights into Distributions (dataprep.eda.plot)
1. Clean Messy Column Names (janitor.clean_names
)
import pandas as pd
import janitor
# Sample DataFrame
df = pd.DataFrame({"Col 1 ": [1, 2], "COL@2": [3, 4]})
# Clean column names
df = janitor.clean_names(df)
print(df)
Output:
col_1 col_2
0 1 3
1 2 4
2. Remove Empty Rows/Columns (janitor.remove_empty
)
import pandas as pd
import janitor
# Adding an empty row and column
df = pd.DataFrame({"Col 1 ": [1, 2, None], "COL@2": [3, 4, None]})
df["Empty"] = None
# Remove empty rows and columns
df = janitor.remove_empty(df)
print(df)
Output:
Col 1 COL@2
0 1.0 3.0
1 2.0 4.0
3. Handle Missing Values (pandas.fillna
)
import pandas as pd
# Sample DataFrame
df = pd.DataFrame({"A": [1, 2, None], "B": [None, 4, 5]})
# Handle missing values: Fill with mean
df_filled = pd.DataFrame.fillna(df, df.mean())
print(df_filled)
Output:
A B
0 1.0 4.5
1 2.0 4.0
2 1.5 5.0
4. Generate a Quick Data Report (pandas_profiling.ProfileReport
)
import pandas as pd
from pandas_profiling import ProfileReport
# Sample DataFrame
df = pd.DataFrame({"A": [1, 2, None], "B": [3, 4, 5]})
# Generate report
profile = ProfileReport(df, title="Quick Data Report")
profile.to_file("report.html")
Output: An HTML report (report.html
) is generated with detailed insights.
5. Explore Data Structure Visually (dataprep.eda.create_report
)
import pandas as pd
from dataprep.eda import create_report
# Sample DataFrame
df = pd.DataFrame({"A": [1, 2, None], "B": [3, 4, 5]})
# Generate visual report
report = create_report(df)
report.show_browser()
Output: A visual report is displayed in your browser.
6. Complex Filtering/Grouping (pandas.query
, pandas.groupby
)
import pandas as pd
# Sample DataFrame
df = pd.DataFrame({"Category": ["A", "B", "A"], "Value": [10, 20, 30]})
# Group by and calculate mean
grouped_df = pd.DataFrame.groupby(df, "Category").mean()
print(grouped_df)
Output:
Value
Category
A 20.0
B 20.0
7. Add Computed Columns (janitor.add_column
)
import pandas as pd
import janitor
# Sample DataFrame
df = pd.DataFrame({"A": [1, 2], "B": [3, 4]})
# Add a computed column
df = janitor.add_column(df, "C", lambda x: x["A"] + x["B"])
print(df)
Output:
A B C
0 1 3 4
1 2 4 6
8. Merge or Join Datasets (pandas.merge
)
import pandas as pd
# Sample DataFrames
df1 = pd.DataFrame({"ID": [1, 2], "Value1": [10, 20]})
df2 = pd.DataFrame({"ID": [1, 2], "Value2": [30, 40]})
# Merge DataFrames
merged_df = pd.merge(df1, df2, on="ID")
print(merged_df)
Output:
ID Value1 Value2
0 1 10 30
1 2 20 40
9. Data Cleaning and Standardization (dataprep.clean
)
import pandas as pd
from dataprep.clean import clean_headers
# Sample DataFrame
df = pd.DataFrame({"Col 1 ": [1, 2], "Col@2": [3, 4]})
# Clean headers
df = clean_headers(df)
print(df)
Output:
col_1 col_2
0 1 3
1 2 4
10. Quick Insights into Distributions (dataprep.eda.plot
)
import pandas as pd
from dataprep.eda import plot
# Sample DataFrame
df = pd.DataFrame({"A": [1, 2, 3], "B": [3, 4, 5]})
# Plot distributions
plot(df)
Output: A distribution plot is displayed in your browser.
0
Subscribe to my newsletter
Read articles from Anix Lynch directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
