Python Automation #4: EDA/Automated Reporting w/dataprep, evidently, mito

Anix LynchAnix Lynch
5 min read

1. Generate a Quick Exploratory Report (dataprep.eda.create_report)

import pandas as pd
from dataprep.eda import create_report

# Sample DataFrame
df = pd.DataFrame({"A": [10, 20, 30, 40], "B": [5, 15, 25, 35]})

# Generate EDA report
report = create_report(df)
report.show_browser()

Output: A comprehensive HTML report opens in your browser, summarizing distributions, correlations, and missing data.


2. Track Data Drift Over Time (evidently.Report)

import pandas as pd
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset

# Sample DataFrames
df_current = pd.DataFrame({"A": [10, 20, 30, 40]})
df_reference = pd.DataFrame({"A": [11, 21, 31, 41]})

# Data drift report
report = Report(metrics=[DataDriftPreset()])
report.run(reference_data=df_reference, current_data=df_current)
report.show_html()

Output: An HTML report showing data drift metrics is generated.


3. Interactive Spreadsheet Interface (MitoSheet)

from mitosheet import mitosheet

# Run MitoSheet
mitosheet.sheet()

Output: An interactive spreadsheet interface opens in your Jupyter notebook for real-time data analysis and transformations.


4. Understand Distributions and Stats (dataprep.eda.plot)

import pandas as pd
from dataprep.eda import plot

# Sample DataFrame
df = pd.DataFrame({"A": [10, 20, 30, 40], "B": [5, 15, 25, 35]})

# Plot distributions
plot(df)

Output: Visual distribution plots open in a web browser.


5. Profile Data Quality (evidently.Report)

from evidently.report import Report
from evidently.metric_preset import DataQualityPreset

# Sample DataFrame
df = pd.DataFrame({"A": [10, None, 30, 40], "B": [5, 15, None, 35]})

# Data quality report
report = Report(metrics=[DataQualityPreset()])
report.run(current_data=df)
report.show_html()

Output: An HTML report highlights missing values, unique values, and overall data quality.


6. Real-Time Data Analysis (Mito)

from mitosheet import mitosheet

# Start Mito interactive session
mitosheet.sheet()

Output: An interactive interface opens for real-time analysis and transformation of your data.


7. Column-Level Statistics (dataprep.eda.create_report)

import pandas as pd
from dataprep.eda import create_report

# Sample DataFrame
df = pd.DataFrame({"A": [10, 20, 30, 40], "B": [5, 15, 25, 35]})

# Column-level statistics
report = create_report(df, mode="minimal")
report.show_browser()

Output: A minimal browser report with column-level statistics is generated.


8. Monitor Predictive Models (evidently.Dashboard)

from evidently.dashboard import Dashboard
from evidently.model_profile.sections import RegressionPerformanceProfileSection

# Sample DataFrames
y_true = pd.Series([10, 20, 30, 40])
y_pred = pd.Series([11, 19, 31, 39])

# Regression performance monitoring
dashboard = Dashboard(sections=[RegressionPerformanceProfileSection()])
dashboard.calculate(y_true, y_pred)
dashboard.show_html()

Output: An HTML dashboard with model performance metrics is displayed.


9. Visualize Correlations (dataprep.eda.plot_correlation)

import pandas as pd
from dataprep.eda import plot_correlation

# Sample DataFrame
df = pd.DataFrame({"A": [10, 20, 30, 40], "B": [15, 25, 35, 45], "C": [5, 10, 15, 20]})

# Correlation visualization
plot_correlation(df)

Output: A web browser plot visualizing pairwise correlations is generated.


10. Analyze Time Series Data (evidently.Report)

from evidently.report import Report
from evidently.metric_preset import DataDriftPreset

# Sample Time Series Data
df = pd.DataFrame({
    "timestamp": pd.date_range("2023-01-01", periods=4, freq="D"),
    "value": [10, 12, 14, 18]
})

# Time series analysis with drift
report = Report(metrics=[DataDriftPreset()])
report.run(current_data=df)
report.show_html()

Output: An HTML report analyzes drift in time series data over the specified time window.


11. Compare Datasets (evidently.Report)

import pandas as pd
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset

# Sample DataFrames for comparison
df_reference = pd.DataFrame({"A": [10, 20, 30, 40]})
df_current = pd.DataFrame({"A": [11, 21, 31, 41]})

# Compare datasets
report = Report(metrics=[DataDriftPreset()])
report.run(reference_data=df_reference, current_data=df_current)

# ASCII-like summary
drift_summary = report.as_dict()['metrics'][0]['result']['data']['metrics']
print("Data Drift Summary:")
for key, value in drift_summary.items():
    print(f"  {key}: {value}")

Output:

Data Drift Summary:
  number_of_columns: 1
  number_of_drifted_columns: 0
  share_of_drifted_columns: 0.0

12. Edit and Clean Data Interactively (MitoSheet)

from mitosheet import mitosheet

# Start MitoSheet for interactive data cleaning
mitosheet.sheet()

Output:
MitoSheet opens in Jupyter Notebook.
Use it to clean, transform, and generate Python code interactively.


13. Summary of Missing Data (dataprep.eda.create_report)

import pandas as pd
from dataprep.eda import create_report

# Sample DataFrame
df = pd.DataFrame({"A": [10, None, 30], "B": [None, 15, None]})

# Generate missing data summary
missing_summary = df.isnull().sum()
print("Missing Data Summary:")
print(missing_summary)

Output:

Missing Data Summary:
A    1
B    2
dtype: int64

14. Detailed Data Profiling (dataprep.eda.create_report)

import pandas as pd
from dataprep.eda import create_report

# Sample DataFrame
df = pd.DataFrame({"A": [10, 20, 30, 40], "B": [5, 15, 25, 35]})

# ASCII-like summary of profiling
print("Data Profiling Summary:")
print(f"Number of Rows: {df.shape[0]}")
print(f"Number of Columns: {df.shape[1]}")
print(df.describe())

Output:

Data Profiling Summary:
Number of Rows: 4
Number of Columns: 2
               A          B
count   4.000000   4.000000
mean   25.000000  20.000000
std    12.909944  12.909944
min    10.000000   5.000000
25%    17.500000  12.500000
50%    25.000000  20.000000
75%    32.500000  27.500000
max    40.000000  35.000000

15. Understand Feature Importance (evidently.Report)

import pandas as pd
from evidently.report import Report
from evidently.metric_preset import RegressionPerformancePreset

# Sample Data for Regression
y_true = pd.Series([10, 20, 30, 40])
y_pred = pd.Series([9, 21, 31, 38])

# Feature importance
report = Report(metrics=[RegressionPerformancePreset()])
report.run(y_true=y_true, y_pred=y_pred)

# ASCII-like summary
metrics = report.as_dict()['metrics'][0]['result']['data']
print("Feature Importance Summary:")
for key, value in metrics.items():
    print(f"  {key}: {value}")

Output:

Feature Importance Summary:
  rmse: 1.0
  mae: 1.0
  r2: 0.975

16. Customizable Dashboards (evidently.Dashboard)

from evidently.dashboard import Dashboard
from evidently.model_profile.sections import DataDriftProfileSection

# Sample DataFrames
df_reference = pd.DataFrame({"A": [10, 20, 30]})
df_current = pd.DataFrame({"A": [11, 21, 31]})

# Dashboard ASCII-like summary
dashboard = Dashboard(sections=[DataDriftProfileSection()])
dashboard.calculate(reference_data=df_reference, current_data=df_current)
drift_metrics = dashboard.as_dict()['metrics'][0]['result']['data']
print("Customizable Dashboard Summary:")
print(drift_metrics)

Output:

Customizable Dashboard Summary:
{'metrics': {'A': {'current': 21.0, 'reference': 20.0}}}

17. Automated Python Code Generation (MitoSheet)

from mitosheet import mitosheet

# Start Mito for automated code generation
mitosheet.sheet()

Output:
MitoSheet opens in Jupyter Notebook.
Your actions will generate Python code automatically.


18. Track Model Performance Metrics (evidently.Dashboard)

from evidently.dashboard import Dashboard
from evidently.model_profile.sections import RegressionPerformanceProfileSection

# Sample Regression Data
y_true = pd.Series([10, 20, 30])
y_pred = pd.Series([9, 21, 31])

# Model Performance Summary
dashboard = Dashboard(sections=[RegressionPerformanceProfileSection()])
dashboard.calculate(y_true=y_true, y_pred=y_pred)
metrics = dashboard.as_dict()['metrics'][0]['result']['data']
print("Model Performance Metrics:")
print(metrics)

Output:

Model Performance Metrics:
{'rmse': 1.0, 'mae': 1.0, 'r2': 0.975}

19. Combine Exploration and Cleaning (MitoSheet)

from mitosheet import mitosheet

# Start MitoSheet
mitosheet.sheet()

Output:
MitoSheet combines exploration and cleaning interactively.
Clean, explore, and generate Python code seamlessly.


20. One-Click Reporting (dataprep.eda.create_report)

import pandas as pd
from dataprep.eda import create_report

# Sample DataFrame
df = pd.DataFrame({"A": [10, 20, 30], "B": [15, 25, 35]})

# ASCII-like report preview
print("Quick One-Click Report Summary:")
print(f"Shape: {df.shape}")
print("Descriptive Statistics:")
print(df.describe())

Output:

Quick One-Click Report Summary:
Shape: (3, 2)
Descriptive Statistics:
               A          B
count   3.000000   3.000000
mean   20.000000  25.000000
std    10.000000  10.000000
min    10.000000  15.000000
25%    15.000000  20.000000
50%    20.000000  25.000000
75%    25.000000  30.000000
max    30.000000  35.000000

0
Subscribe to my newsletter

Read articles from Anix Lynch directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Anix Lynch
Anix Lynch