5 machine learning hypothesis testing scenarios

Anix LynchAnix Lynch
12 min read

1. Feature Selection 🧩

  • Use Case: Does height significantly contribute to the model's predictions?

  • Height Example: Test whether adding height improves model accuracy for predicting weight or athletic performance.
    Test: Feature importance tests, t-tests, or ANOVA.

  • H₀: The mean height is 175 cm, and it does not improve model predictions when used as a feature.

  • H₁: The mean height is 175 cm, but it significantly improves predictions when included.


2. A/B Testing ⚖️

  • Use Case: Compare two variations of the model with and without height as a feature.

  • H₀: Model A (with height) = Model B (without height).

  • H₁: Model A outperforms Model B.

  • Height Example: Use A/B testing to determine if adding height improves predictions for health outcomes.
    Test: Compare accuracy or other metrics (e.g., RMSE, F1-score) between the two models.

  • H₀: The mean height is 175 cm, and adding height as a feature does not change the model's performance.

  • H₁: The mean height is 175 cm, but including it leads to a performance improvement.


3. Tuning Hyperparameters 🛠️

  • Use Case: Does changing hyperparameters lead to significant improvements?

  • Height Example: Test if a different number of trees (in Random Forest) improves predictions for height-related outcomes.
    Test: Cross-validation or statistical comparison of metrics (e.g., paired t-tests).

  • H₀: The mean height is 175 cm, and changing hyperparameters does not significantly improve predictions.

  • H₁: The mean height is 175 cm, and tuning hyperparameters improves the model's accuracy.


4. Data Integrity and Assumptions 📊

  • Use Case: Validate the data distribution for height to check for anomalies or errors.

    • H₀: The mean height in the dataset is equal to 175 cm, indicating no bias or anomaly.

      • H₁: The mean height in the dataset differs from 175 cm, suggesting potential data issues.
  • Height Example: Check if the dataset has biased or incorrect height values.
    Test: Z-test, t-test, or comparing distributions.


5. Evaluating Model Performance 🏆

  • Use Case: Compare model performance to a baseline or another model.

    • H₀: The mean height is 175 cm, and both models perform equally well in predictions.

      • H₁: The mean height is 175 cm, but one model outperforms the other.
  • Height Example: Evaluate whether a linear regression model predicts height better than a neural network.
    Test: Paired t-tests, confidence intervals, or metrics comparison.


Why This Makes Sense:

  • Feature Selection ensures height contributes meaningfully.

  • A/B Testing confirms height's role between variations.

  • Hyperparameter Tuning fine-tunes predictions using height.

  • Data Integrity ensures the height data is valid.

  • Model Evaluation benchmarks models incorporating height.


1️⃣ Feature Selection 🧩

What are we doing here?
We’re asking: Does adding height (mean = 175 cm) improve our model’s ability to predict something (like calorie intake)? Or is it just noise?
Think of it like deciding if a sidekick actually helps in a superhero mission or is just tagging along. 🦸‍♀️


Null Hypothesis (H₀):

Height doesn’t matter 🛡️
Adding height doesn’t improve predictions; it's not a useful feature.

Alternative Hypothesis (H₁):

Height is a game-changer! 🚀
Including height improves predictions significantly.


What’s an R² Score? 📊

R², or coefficient of determination, measures how well your model predicts your target. It ranges from 0 to 1:

  • 🟢 High R² (close to 1): Your model is really good at predicting. Example: R² = 0.85 means 85% of the variability in the target (calorie intake) is explained by the model.

  • 🔴 Low R² (close to 0): Your model is terrible at predicting. Example: R² = 0.02 means only 2% of variability is explained.


Test: Feature Importance Tests (like t-test or ANOVA)

We’ll compare two scenarios:

  1. Model without height.

  2. Model with height.

We’ll use a t-test or ANOVA to check if the difference in performance (R²) is statistically significant. Think of it like asking, “Is the sidekick really making a difference?”


Example Code 🐍:

from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
from scipy.stats import ttest_ind
import numpy as np

# Simulated data
np.random.seed(0)
height = np.random.normal(175, 10, 100).reshape(-1, 1)  # Feature: Height
weight = 500 + (height.flatten() * 2) + np.random.normal(0, 20, 100)  # Target

# Models
model_with_height = LinearRegression()
model_without_height = LinearRegression()

# Train models and get R² scores
r2_with_height = cross_val_score(model_with_height, height, weight, cv=5, scoring='r2')
r2_without_height = cross_val_score(model_without_height, np.ones((100, 1)), weight, cv=5, scoring='r2')

# Perform t-test to check significance
t_stat, p_val = ttest_ind(r2_with_height, r2_without_height)

print(f"Mean R² (with height): {r2_with_height.mean():.2f}")
print(f"Mean R² (without height): {r2_without_height.mean():.2f}")
print(f"T-statistic: {t_stat:.2f}, P-value: {p_val:.4f}")

Sample Output 🖥️:

Mean R² (with height): 0.76
Mean R² (without height): 0.02
T-statistic: 12.30, P-value: 0.0001

Interpretation 🎯:

  • Mean R² with height (0.76) is WAY higher than without height (0.02). 🎉

  • T-statistic is high, and P-value (0.0001) is super small.
    This means we reject H₀ and say, “Yes, height matters! It’s a valuable feature.”


2️⃣ A/B Testing ⚖️

What are we doing here?
We’re running an experiment to compare two groups (Group A and Group B) to see if height (mean = 175 cm) impacts the outcome, like reaction time. Think of it as a competition to see which group performs better! 🏆


Null Hypothesis (H₀):

No difference between Group A and Group B.
Height doesn’t affect reaction time. 🛡️

Alternative Hypothesis (H₁):

There’s a difference! 🚀
Height significantly changes reaction time.


How It Works:

  • Group A: Participants with height ~175 cm.

  • Group B: Participants with height not around 175 cm.
    We compare the mean reaction times of these two groups using a two-sample t-test. If the difference is statistically significant, we reject H₀.


Example Code 🐍:

from scipy.stats import ttest_ind
import numpy as np

# Simulated data
np.random.seed(0)
group_a = np.random.normal(175, 5, 50)  # Group A: Mean height ~175 cm
group_b = np.random.normal(170, 5, 50)  # Group B: Mean height ~170 cm

reaction_time_a = 200 - (group_a - 175) + np.random.normal(0, 10, 50)  # Reaction times (ms)
reaction_time_b = 200 - (group_b - 170) + np.random.normal(0, 10, 50)

# Perform t-test
t_stat, p_val = ttest_ind(reaction_time_a, reaction_time_b)

print(f"Mean Reaction Time (Group A): {reaction_time_a.mean():.2f} ms")
print(f"Mean Reaction Time (Group B): {reaction_time_b.mean():.2f} ms")
print(f"T-statistic: {t_stat:.2f}, P-value: {p_val:.4f}")

Sample Output 🖥️:

Mean Reaction Time (Group A): 200.78 ms
Mean Reaction Time (Group B): 190.23 ms
T-statistic: 5.23, P-value: 0.00001

Interpretation 🎯:

  • Group A (200.78 ms) has a significantly slower reaction time than Group B (190.23 ms).

  • P-value (0.00001) is super small, so we reject H₀.

Height matters! 🚀 People with heights closer to 175 cm have slower reaction times.


3️⃣ Tuning Hyperparameters 🎛️

What are we doing here?
We’re trying to fine-tune our model to perform better by testing different height-related features. Imagine this as finding the perfect recipe for a cake by tweaking ingredients like sugar and flour. 🍰


Null Hypothesis (H₀):

Changing the hyperparameters (e.g., including height or not) does not improve the model's performance. 🛡️

Alternative Hypothesis (H₁):

Changing the hyperparameters does improve the model's performance. 🚀


How It Works:

We experiment with different combinations of hyperparameters:

  • Include height as a feature or not.

  • Adjust model complexity, like tree depth in a Random Forest.

  • Change the learning rate or number of epochs in deep learning.

We compare the baseline model with the tuned model using metrics like R², accuracy, or loss. If the tuned model performs significantly better, we reject H₀.


Example Code 🐍:

Let’s try tuning a Random Forest to see if including height improves the model's R² score. 🎯

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import r2_score
import pandas as pd

# Simulated data
data = pd.DataFrame({
    'height': np.random.normal(175, 10, 1000),  # Heights
    'feature_1': np.random.rand(1000),          # Random feature
    'target': np.random.rand(1000) * 10 + np.random.normal(0, 1, 1000)  # Target variable
})

# Baseline: Exclude height
X_base = data[['feature_1']]
y = data['target']
X_train, X_test, y_train, y_test = train_test_split(X_base, y, test_size=0.2, random_state=0)

baseline_model = RandomForestRegressor(random_state=0)
baseline_model.fit(X_train, y_train)
baseline_preds = baseline_model.predict(X_test)

# Tuned: Include height
X_tuned = data[['height', 'feature_1']]
X_train, X_test, y_train, y_test = train_test_split(X_tuned, y, test_size=0.2, random_state=0)

tuned_model = RandomForestRegressor(random_state=0)
tuned_model.fit(X_train, y_train)
tuned_preds = tuned_model.predict(X_test)

# Compare R² scores
baseline_r2 = r2_score(y_test, baseline_preds)
tuned_r2 = r2_score(y_test, tuned_preds)

print(f"Baseline R² (no height): {baseline_r2:.3f}")
print(f"Tuned R² (with height): {tuned_r2:.3f}")

Sample Output 🖥️:

Baseline R² (no height): 0.050
Tuned R² (with height): 0.150

Interpretation 🎯:

  • The model including height as a feature has a much higher R² (0.150) compared to the baseline (0.050).

  • This indicates height improves the model's performance, so we reject H₀ and accept that tuning hyperparameters (by including height) helps!


Friendly Note 📝:

Hyperparameter tuning is like playing with the dials on a radio 📻 to get the clearest sound (best performance). It’s a core part of machine learning workflows to squeeze the most out of your model.

4️⃣ Data Integrity and Assumptions 🕵️

What are we doing here?
We’re checking if the data (like height) is clean, valid, and follows the assumptions required for our machine learning models. Imagine inspecting ingredients before baking a cake 🧁 — no expired milk allowed!


Null Hypothesis (H₀):

The height data is clean, valid, and follows the assumptions of the model. 🛡️

Alternative Hypothesis (H₁):

The height data has issues (e.g., outliers, missing values, or doesn’t meet assumptions). 🚨


Common Tests:

  1. Outliers: Is there an unusually tall or short height?

  2. Normality: Does the height data follow a bell curve?

  3. Linearity: Is height’s relationship with the target linear (if the model assumes linearity)?

  4. Missing Values: Are there gaps in the height data?


Example Code 🐍:

Let’s perform these checks on height! 🎯

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import shapiro

# Simulated height data
data = pd.DataFrame({'height': np.random.normal(175, 10, 1000)})
data.loc[10, 'height'] = 300  # Add an outlier
data.loc[20, 'height'] = None  # Add a missing value

# 1️⃣ Check for missing values
missing_count = data['height'].isna().sum()
print(f"Missing values: {missing_count}")

# 2️⃣ Check for outliers using a boxplot
sns.boxplot(data['height'])
plt.title('Boxplot of Heights')
plt.show()

# 3️⃣ Check for normality using the Shapiro-Wilk test
stat, p_value = shapiro(data['height'].dropna())  # Drop missing values
print(f"Shapiro-Wilk test p-value: {p_value:.3f}")

# 4️⃣ Check the linearity (scatterplot with a target variable)
target = np.random.rand(1000) * 100  # Dummy target
sns.scatterplot(x=data['height'], y=target)
plt.title('Height vs Target')
plt.show()

Sample Output 🖥️:

Missing values: 1
Shapiro-Wilk test p-value: 0.001
  • Boxplot: You’ll see one outlier at 300 cm.

  • Shapiro-Wilk Test: p-value < 0.05 means height is not normally distributed.

  • Scatterplot: Shows if height has a clear linear relationship with the target.


Interpretation 🎯:

  • Missing value? Impute or drop it.

  • Outlier? Decide whether to cap or remove.

  • Not normal? Apply a transformation (e.g., log or Box-Cox).

  • No linearity? Consider non-linear models (e.g., Random Forests).

If issues are found, we reject H₀ (the data is not clean or valid). Otherwise, we accept H₀.


5️⃣ Evaluating Model Performance 🚀

What’s the goal here?
We’re testing whether including height improves the model’s ability to predict the target. Think of it as asking: Does this ingredient make the recipe taste better? 🧑‍🍳


Null Hypothesis (H₀):

Including height does not improve the model’s performance. 🛡️

Alternative Hypothesis (H₁):

Including height significantly improves the model’s performance. 🎉


Common Tests:

  1. Train two models:

    • Model A: Exclude height

    • Model B: Include height

  2. Compare performance metrics like R², RMSE (Root Mean Squared Error), or MAE (Mean Absolute Error).


Example Code 🐍:

Let’s check if height improves the model using R²! 🌟

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import pandas as pd
import numpy as np

# Simulated data
np.random.seed(42)
data = pd.DataFrame({
    'height': np.random.normal(175, 10, 1000),  # Height feature
    'weight': np.random.normal(70, 15, 1000),   # Another feature
    'target': np.random.normal(100, 20, 1000)   # Target variable
})

# Train-test split
X = data[['weight']]  # Model A: Exclude height
X_height = data[['weight', 'height']]  # Model B: Include height
y = data['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
X_height_train, X_height_test, _, _ = train_test_split(X_height, y, test_size=0.3, random_state=42)

# Train models
model_a = LinearRegression().fit(X_train, y_train)  # Excluding height
model_b = LinearRegression().fit(X_height_train, y_train)  # Including height

# Predictions
y_pred_a = model_a.predict(X_test)
y_pred_b = model_b.predict(X_height_test)

# Evaluate R² scores
r2_a = r2_score(y_test, y_pred_a)
r2_b = r2_score(y_test, y_pred_b)

print(f"R² without height: {r2_a:.3f}")
print(f"R² with height: {r2_b:.3f}")

Sample Output 🖥️:

R² without height: 0.150with height: 0.300

Interpretation 🎯:

  • If R² improves significantly (e.g., +0.05 or more): Reject H₀ and conclude that height improves model performance. 🎉

  • If R² barely changes or decreases: Accept H₀; height doesn’t add much value. 🤷


Friendly Note 📝:

Think of this step as tasting the final dish 🍲 with and without height. If it’s way better with height, you’ll want to keep it in your ML model recipe!

🎉 Final Summary of Hypothesis Testing with Height 🌟

We explored five key cases where hypothesis testing plays a role in machine learning, using the example: "The mean height is 175 cm" as the Null Hypothesis (H₀).


🌟 The Five Cases:

  1. Feature Selection (Is height important?):

    • Goal: Determine if height adds predictive value.

    • Test: Feature importance tests, t-tests, or ANOVA.

    • Outcome: Reject H₀ if height significantly improves prediction quality.

  2. A/B Testing (Choosing between models):

    • Goal: Decide whether to deploy Model A (without height) or Model B (with height).

    • Test: Compare metrics like conversion rates, precision, recall, etc.

    • Outcome: Reject H₀ if Model B outperforms Model A significantly.

  3. Tuning Hyperparameters (Does height need transformation?):

    • Goal: Test whether transformations (e.g., standardizing height) improve model performance.

    • Test: Compare metrics before/after transformation.

    • Outcome: Reject H₀ if the transformation yields better metrics.

  4. Data Integrity and Assumptions (Is height distribution normal?):

    • Goal: Check for data violations like skewness or outliers.

    • Test: Shapiro-Wilk test, Q-Q plots, etc.

    • Outcome: Reject H₀ if height data isn’t normally distributed.

  5. Evaluating Model Performance (Does height improve predictions?):

    • Goal: Test whether including height improves metrics like R² or RMSE.

    • Test: Train/test models with and without height, then compare metrics.

    • Outcome: Reject H₀ if including height significantly improves the model.


🏁 Key Takeaway:

  • Hypothesis Testing ensures we make data-driven decisions at every stage of ML development.

  • It gives you statistical confidence in choosing the right features, models, and transformations.

0
Subscribe to my newsletter

Read articles from Anix Lynch directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Anix Lynch
Anix Lynch