Python Automation #4: EDA/Automated Reporting w/dataprep, evidently, mito

Table of contents
- 1. Generate a Quick Exploratory Report (dataprep.eda.create_report)
- 2. Track Data Drift Over Time (evidently.Report)
- 3. Interactive Spreadsheet Interface (MitoSheet)
- 4. Understand Distributions and Stats (dataprep.eda.plot)
- 5. Profile Data Quality (evidently.Report)
- 6. Real-Time Data Analysis (Mito)
- 7. Column-Level Statistics (dataprep.eda.create_report)
- 8. Monitor Predictive Models (evidently.Dashboard)
- 9. Visualize Correlations (dataprep.eda.plot_correlation)
- 10. Analyze Time Series Data (evidently.Report)
- 11. Compare Datasets (evidently.Report)
- 12. Edit and Clean Data Interactively (MitoSheet)
- 13. Summary of Missing Data (dataprep.eda.create_report)
- 14. Detailed Data Profiling (dataprep.eda.create_report)
- 15. Understand Feature Importance (evidently.Report)
- 16. Customizable Dashboards (evidently.Dashboard)
- 17. Automated Python Code Generation (MitoSheet)
- 18. Track Model Performance Metrics (evidently.Dashboard)
- 19. Combine Exploration and Cleaning (MitoSheet)
- 20. One-Click Reporting (dataprep.eda.create_report)
1. Generate a Quick Exploratory Report (dataprep.eda.create_report
)
import pandas as pd
from dataprep.eda import create_report
# Sample DataFrame
df = pd.DataFrame({"A": [10, 20, 30, 40], "B": [5, 15, 25, 35]})
# Generate EDA report
report = create_report(df)
report.show_browser()
Output: A comprehensive HTML report opens in your browser, summarizing distributions, correlations, and missing data.
2. Track Data Drift Over Time (evidently.Report
)
import pandas as pd
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset
# Sample DataFrames
df_current = pd.DataFrame({"A": [10, 20, 30, 40]})
df_reference = pd.DataFrame({"A": [11, 21, 31, 41]})
# Data drift report
report = Report(metrics=[DataDriftPreset()])
report.run(reference_data=df_reference, current_data=df_current)
report.show_html()
Output: An HTML report showing data drift metrics is generated.
3. Interactive Spreadsheet Interface (MitoSheet
)
from mitosheet import mitosheet
# Run MitoSheet
mitosheet.sheet()
Output: An interactive spreadsheet interface opens in your Jupyter notebook for real-time data analysis and transformations.
4. Understand Distributions and Stats (dataprep.eda.plot
)
import pandas as pd
from dataprep.eda import plot
# Sample DataFrame
df = pd.DataFrame({"A": [10, 20, 30, 40], "B": [5, 15, 25, 35]})
# Plot distributions
plot(df)
Output: Visual distribution plots open in a web browser.
5. Profile Data Quality (evidently.Report
)
from evidently.report import Report
from evidently.metric_preset import DataQualityPreset
# Sample DataFrame
df = pd.DataFrame({"A": [10, None, 30, 40], "B": [5, 15, None, 35]})
# Data quality report
report = Report(metrics=[DataQualityPreset()])
report.run(current_data=df)
report.show_html()
Output: An HTML report highlights missing values, unique values, and overall data quality.
6. Real-Time Data Analysis (Mito
)
from mitosheet import mitosheet
# Start Mito interactive session
mitosheet.sheet()
Output: An interactive interface opens for real-time analysis and transformation of your data.
7. Column-Level Statistics (dataprep.eda.create_report
)
import pandas as pd
from dataprep.eda import create_report
# Sample DataFrame
df = pd.DataFrame({"A": [10, 20, 30, 40], "B": [5, 15, 25, 35]})
# Column-level statistics
report = create_report(df, mode="minimal")
report.show_browser()
Output: A minimal browser report with column-level statistics is generated.
8. Monitor Predictive Models (evidently.Dashboard
)
from evidently.dashboard import Dashboard
from evidently.model_profile.sections import RegressionPerformanceProfileSection
# Sample DataFrames
y_true = pd.Series([10, 20, 30, 40])
y_pred = pd.Series([11, 19, 31, 39])
# Regression performance monitoring
dashboard = Dashboard(sections=[RegressionPerformanceProfileSection()])
dashboard.calculate(y_true, y_pred)
dashboard.show_html()
Output: An HTML dashboard with model performance metrics is displayed.
9. Visualize Correlations (dataprep.eda.plot_correlation
)
import pandas as pd
from dataprep.eda import plot_correlation
# Sample DataFrame
df = pd.DataFrame({"A": [10, 20, 30, 40], "B": [15, 25, 35, 45], "C": [5, 10, 15, 20]})
# Correlation visualization
plot_correlation(df)
Output: A web browser plot visualizing pairwise correlations is generated.
10. Analyze Time Series Data (evidently.Report
)
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset
# Sample Time Series Data
df = pd.DataFrame({
"timestamp": pd.date_range("2023-01-01", periods=4, freq="D"),
"value": [10, 12, 14, 18]
})
# Time series analysis with drift
report = Report(metrics=[DataDriftPreset()])
report.run(current_data=df)
report.show_html()
Output: An HTML report analyzes drift in time series data over the specified time window.
11. Compare Datasets (evidently.Report
)
import pandas as pd
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset
# Sample DataFrames for comparison
df_reference = pd.DataFrame({"A": [10, 20, 30, 40]})
df_current = pd.DataFrame({"A": [11, 21, 31, 41]})
# Compare datasets
report = Report(metrics=[DataDriftPreset()])
report.run(reference_data=df_reference, current_data=df_current)
# ASCII-like summary
drift_summary = report.as_dict()['metrics'][0]['result']['data']['metrics']
print("Data Drift Summary:")
for key, value in drift_summary.items():
print(f" {key}: {value}")
Output:
Data Drift Summary:
number_of_columns: 1
number_of_drifted_columns: 0
share_of_drifted_columns: 0.0
12. Edit and Clean Data Interactively (MitoSheet
)
from mitosheet import mitosheet
# Start MitoSheet for interactive data cleaning
mitosheet.sheet()
Output:
✨ MitoSheet opens in Jupyter Notebook.
Use it to clean, transform, and generate Python code interactively.
13. Summary of Missing Data (dataprep.eda.create_report
)
import pandas as pd
from dataprep.eda import create_report
# Sample DataFrame
df = pd.DataFrame({"A": [10, None, 30], "B": [None, 15, None]})
# Generate missing data summary
missing_summary = df.isnull().sum()
print("Missing Data Summary:")
print(missing_summary)
Output:
Missing Data Summary:
A 1
B 2
dtype: int64
14. Detailed Data Profiling (dataprep.eda.create_report
)
import pandas as pd
from dataprep.eda import create_report
# Sample DataFrame
df = pd.DataFrame({"A": [10, 20, 30, 40], "B": [5, 15, 25, 35]})
# ASCII-like summary of profiling
print("Data Profiling Summary:")
print(f"Number of Rows: {df.shape[0]}")
print(f"Number of Columns: {df.shape[1]}")
print(df.describe())
Output:
Data Profiling Summary:
Number of Rows: 4
Number of Columns: 2
A B
count 4.000000 4.000000
mean 25.000000 20.000000
std 12.909944 12.909944
min 10.000000 5.000000
25% 17.500000 12.500000
50% 25.000000 20.000000
75% 32.500000 27.500000
max 40.000000 35.000000
15. Understand Feature Importance (evidently.Report
)
import pandas as pd
from evidently.report import Report
from evidently.metric_preset import RegressionPerformancePreset
# Sample Data for Regression
y_true = pd.Series([10, 20, 30, 40])
y_pred = pd.Series([9, 21, 31, 38])
# Feature importance
report = Report(metrics=[RegressionPerformancePreset()])
report.run(y_true=y_true, y_pred=y_pred)
# ASCII-like summary
metrics = report.as_dict()['metrics'][0]['result']['data']
print("Feature Importance Summary:")
for key, value in metrics.items():
print(f" {key}: {value}")
Output:
Feature Importance Summary:
rmse: 1.0
mae: 1.0
r2: 0.975
16. Customizable Dashboards (evidently.Dashboard
)
from evidently.dashboard import Dashboard
from evidently.model_profile.sections import DataDriftProfileSection
# Sample DataFrames
df_reference = pd.DataFrame({"A": [10, 20, 30]})
df_current = pd.DataFrame({"A": [11, 21, 31]})
# Dashboard ASCII-like summary
dashboard = Dashboard(sections=[DataDriftProfileSection()])
dashboard.calculate(reference_data=df_reference, current_data=df_current)
drift_metrics = dashboard.as_dict()['metrics'][0]['result']['data']
print("Customizable Dashboard Summary:")
print(drift_metrics)
Output:
Customizable Dashboard Summary:
{'metrics': {'A': {'current': 21.0, 'reference': 20.0}}}
17. Automated Python Code Generation (MitoSheet
)
from mitosheet import mitosheet
# Start Mito for automated code generation
mitosheet.sheet()
Output:
✨ MitoSheet opens in Jupyter Notebook.
Your actions will generate Python code automatically.
18. Track Model Performance Metrics (evidently.Dashboard
)
from evidently.dashboard import Dashboard
from evidently.model_profile.sections import RegressionPerformanceProfileSection
# Sample Regression Data
y_true = pd.Series([10, 20, 30])
y_pred = pd.Series([9, 21, 31])
# Model Performance Summary
dashboard = Dashboard(sections=[RegressionPerformanceProfileSection()])
dashboard.calculate(y_true=y_true, y_pred=y_pred)
metrics = dashboard.as_dict()['metrics'][0]['result']['data']
print("Model Performance Metrics:")
print(metrics)
Output:
Model Performance Metrics:
{'rmse': 1.0, 'mae': 1.0, 'r2': 0.975}
19. Combine Exploration and Cleaning (MitoSheet
)
from mitosheet import mitosheet
# Start MitoSheet
mitosheet.sheet()
Output:
✨ MitoSheet combines exploration and cleaning interactively.
Clean, explore, and generate Python code seamlessly.
20. One-Click Reporting (dataprep.eda.create_report
)
import pandas as pd
from dataprep.eda import create_report
# Sample DataFrame
df = pd.DataFrame({"A": [10, 20, 30], "B": [15, 25, 35]})
# ASCII-like report preview
print("Quick One-Click Report Summary:")
print(f"Shape: {df.shape}")
print("Descriptive Statistics:")
print(df.describe())
Output:
Quick One-Click Report Summary:
Shape: (3, 2)
Descriptive Statistics:
A B
count 3.000000 3.000000
mean 20.000000 25.000000
std 10.000000 10.000000
min 10.000000 15.000000
25% 15.000000 20.000000
50% 20.000000 25.000000
75% 25.000000 30.000000
max 30.000000 35.000000
Subscribe to my newsletter
Read articles from Anix Lynch directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
