5 machine learning hypothesis testing scenarios

Table of contents
- 1. Feature Selection 🧩
- 2. A/B Testing ⚖️
- 3. Tuning Hyperparameters 🛠️
- 4. Data Integrity and Assumptions 📊
- 5. Evaluating Model Performance 🏆
- Why This Makes Sense:
- 1️⃣ Feature Selection 🧩
- What’s an R² Score? 📊
- Test: Feature Importance Tests (like t-test or ANOVA)
- Example Code 🐍:
- Sample Output 🖥️:
- Interpretation 🎯:
- 2️⃣ A/B Testing ⚖️
- How It Works:
- Example Code 🐍:
- Sample Output 🖥️:
- Interpretation 🎯:
- 3️⃣ Tuning Hyperparameters 🎛️
- How It Works:
- Example Code 🐍:
- Sample Output 🖥️:
- Interpretation 🎯:
- Friendly Note 📝:
- 4️⃣ Data Integrity and Assumptions 🕵️
- Common Tests:
- Example Code 🐍:
- Sample Output 🖥️:
- Interpretation 🎯:
- 5️⃣ Evaluating Model Performance 🚀
- Common Tests:
- Example Code 🐍:
- Sample Output 🖥️:
- Interpretation 🎯:
- Friendly Note 📝:
- 🎉 Final Summary of Hypothesis Testing with Height 🌟
- 🌟 The Five Cases:
- 🏁 Key Takeaway:
1. Feature Selection 🧩
Use Case: Does height significantly contribute to the model's predictions?
Height Example: Test whether adding height improves model accuracy for predicting weight or athletic performance.
Test: Feature importance tests, t-tests, or ANOVA.H₀: The mean height is 175 cm, and it does not improve model predictions when used as a feature.
H₁: The mean height is 175 cm, but it significantly improves predictions when included.
2. A/B Testing ⚖️
Use Case: Compare two variations of the model with and without height as a feature.
H₀: Model A (with height) = Model B (without height).
H₁: Model A outperforms Model B.
Height Example: Use A/B testing to determine if adding height improves predictions for health outcomes.
Test: Compare accuracy or other metrics (e.g., RMSE, F1-score) between the two models.H₀: The mean height is 175 cm, and adding height as a feature does not change the model's performance.
H₁: The mean height is 175 cm, but including it leads to a performance improvement.
3. Tuning Hyperparameters 🛠️
Use Case: Does changing hyperparameters lead to significant improvements?
Height Example: Test if a different number of trees (in Random Forest) improves predictions for height-related outcomes.
Test: Cross-validation or statistical comparison of metrics (e.g., paired t-tests).H₀: The mean height is 175 cm, and changing hyperparameters does not significantly improve predictions.
H₁: The mean height is 175 cm, and tuning hyperparameters improves the model's accuracy.
4. Data Integrity and Assumptions 📊
Use Case: Validate the data distribution for height to check for anomalies or errors.
H₀: The mean height in the dataset is equal to 175 cm, indicating no bias or anomaly.
- H₁: The mean height in the dataset differs from 175 cm, suggesting potential data issues.
Height Example: Check if the dataset has biased or incorrect height values.
Test: Z-test, t-test, or comparing distributions.
5. Evaluating Model Performance 🏆
Use Case: Compare model performance to a baseline or another model.
H₀: The mean height is 175 cm, and both models perform equally well in predictions.
- H₁: The mean height is 175 cm, but one model outperforms the other.
Height Example: Evaluate whether a linear regression model predicts height better than a neural network.
Test: Paired t-tests, confidence intervals, or metrics comparison.
Why This Makes Sense:
Feature Selection ensures height contributes meaningfully.
A/B Testing confirms height's role between variations.
Hyperparameter Tuning fine-tunes predictions using height.
Data Integrity ensures the height data is valid.
Model Evaluation benchmarks models incorporating height.
1️⃣ Feature Selection 🧩
What are we doing here?
We’re asking: Does adding height (mean = 175 cm) improve our model’s ability to predict something (like calorie intake)? Or is it just noise?
Think of it like deciding if a sidekick actually helps in a superhero mission or is just tagging along. 🦸♀️
Null Hypothesis (H₀):
Height doesn’t matter 🛡️
Adding height doesn’t improve predictions; it's not a useful feature.
Alternative Hypothesis (H₁):
Height is a game-changer! 🚀
Including height improves predictions significantly.
What’s an R² Score? 📊
R², or coefficient of determination, measures how well your model predicts your target. It ranges from 0 to 1:
🟢 High R² (close to 1): Your model is really good at predicting. Example: R² = 0.85 means 85% of the variability in the target (calorie intake) is explained by the model.
🔴 Low R² (close to 0): Your model is terrible at predicting. Example: R² = 0.02 means only 2% of variability is explained.
Test: Feature Importance Tests (like t-test or ANOVA)
We’ll compare two scenarios:
Model without height.
Model with height.
We’ll use a t-test or ANOVA to check if the difference in performance (R²) is statistically significant. Think of it like asking, “Is the sidekick really making a difference?”
Example Code 🐍:
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
from scipy.stats import ttest_ind
import numpy as np
# Simulated data
np.random.seed(0)
height = np.random.normal(175, 10, 100).reshape(-1, 1) # Feature: Height
weight = 500 + (height.flatten() * 2) + np.random.normal(0, 20, 100) # Target
# Models
model_with_height = LinearRegression()
model_without_height = LinearRegression()
# Train models and get R² scores
r2_with_height = cross_val_score(model_with_height, height, weight, cv=5, scoring='r2')
r2_without_height = cross_val_score(model_without_height, np.ones((100, 1)), weight, cv=5, scoring='r2')
# Perform t-test to check significance
t_stat, p_val = ttest_ind(r2_with_height, r2_without_height)
print(f"Mean R² (with height): {r2_with_height.mean():.2f}")
print(f"Mean R² (without height): {r2_without_height.mean():.2f}")
print(f"T-statistic: {t_stat:.2f}, P-value: {p_val:.4f}")
Sample Output 🖥️:
Mean R² (with height): 0.76
Mean R² (without height): 0.02
T-statistic: 12.30, P-value: 0.0001
Interpretation 🎯:
Mean R² with height (0.76) is WAY higher than without height (0.02). 🎉
T-statistic is high, and P-value (0.0001) is super small.
This means we reject H₀ and say, “Yes, height matters! It’s a valuable feature.”
2️⃣ A/B Testing ⚖️
What are we doing here?
We’re running an experiment to compare two groups (Group A and Group B) to see if height (mean = 175 cm) impacts the outcome, like reaction time. Think of it as a competition to see which group performs better! 🏆
Null Hypothesis (H₀):
No difference between Group A and Group B.
Height doesn’t affect reaction time. 🛡️
Alternative Hypothesis (H₁):
There’s a difference! 🚀
Height significantly changes reaction time.
How It Works:
Group A: Participants with height ~175 cm.
Group B: Participants with height not around 175 cm.
We compare the mean reaction times of these two groups using a two-sample t-test. If the difference is statistically significant, we reject H₀.
Example Code 🐍:
from scipy.stats import ttest_ind
import numpy as np
# Simulated data
np.random.seed(0)
group_a = np.random.normal(175, 5, 50) # Group A: Mean height ~175 cm
group_b = np.random.normal(170, 5, 50) # Group B: Mean height ~170 cm
reaction_time_a = 200 - (group_a - 175) + np.random.normal(0, 10, 50) # Reaction times (ms)
reaction_time_b = 200 - (group_b - 170) + np.random.normal(0, 10, 50)
# Perform t-test
t_stat, p_val = ttest_ind(reaction_time_a, reaction_time_b)
print(f"Mean Reaction Time (Group A): {reaction_time_a.mean():.2f} ms")
print(f"Mean Reaction Time (Group B): {reaction_time_b.mean():.2f} ms")
print(f"T-statistic: {t_stat:.2f}, P-value: {p_val:.4f}")
Sample Output 🖥️:
Mean Reaction Time (Group A): 200.78 ms
Mean Reaction Time (Group B): 190.23 ms
T-statistic: 5.23, P-value: 0.00001
Interpretation 🎯:
Group A (200.78 ms) has a significantly slower reaction time than Group B (190.23 ms).
P-value (0.00001) is super small, so we reject H₀.
Height matters! 🚀 People with heights closer to 175 cm have slower reaction times.
3️⃣ Tuning Hyperparameters 🎛️
What are we doing here?
We’re trying to fine-tune our model to perform better by testing different height-related features. Imagine this as finding the perfect recipe for a cake by tweaking ingredients like sugar and flour. 🍰
Null Hypothesis (H₀):
Changing the hyperparameters (e.g., including height or not) does not improve the model's performance. 🛡️
Alternative Hypothesis (H₁):
Changing the hyperparameters does improve the model's performance. 🚀
How It Works:
We experiment with different combinations of hyperparameters:
Include height as a feature or not.
Adjust model complexity, like tree depth in a Random Forest.
Change the learning rate or number of epochs in deep learning.
We compare the baseline model with the tuned model using metrics like R², accuracy, or loss. If the tuned model performs significantly better, we reject H₀.
Example Code 🐍:
Let’s try tuning a Random Forest to see if including height improves the model's R² score. 🎯
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import r2_score
import pandas as pd
# Simulated data
data = pd.DataFrame({
'height': np.random.normal(175, 10, 1000), # Heights
'feature_1': np.random.rand(1000), # Random feature
'target': np.random.rand(1000) * 10 + np.random.normal(0, 1, 1000) # Target variable
})
# Baseline: Exclude height
X_base = data[['feature_1']]
y = data['target']
X_train, X_test, y_train, y_test = train_test_split(X_base, y, test_size=0.2, random_state=0)
baseline_model = RandomForestRegressor(random_state=0)
baseline_model.fit(X_train, y_train)
baseline_preds = baseline_model.predict(X_test)
# Tuned: Include height
X_tuned = data[['height', 'feature_1']]
X_train, X_test, y_train, y_test = train_test_split(X_tuned, y, test_size=0.2, random_state=0)
tuned_model = RandomForestRegressor(random_state=0)
tuned_model.fit(X_train, y_train)
tuned_preds = tuned_model.predict(X_test)
# Compare R² scores
baseline_r2 = r2_score(y_test, baseline_preds)
tuned_r2 = r2_score(y_test, tuned_preds)
print(f"Baseline R² (no height): {baseline_r2:.3f}")
print(f"Tuned R² (with height): {tuned_r2:.3f}")
Sample Output 🖥️:
Baseline R² (no height): 0.050
Tuned R² (with height): 0.150
Interpretation 🎯:
The model including height as a feature has a much higher R² (0.150) compared to the baseline (0.050).
This indicates height improves the model's performance, so we reject H₀ and accept that tuning hyperparameters (by including height) helps!
Friendly Note 📝:
Hyperparameter tuning is like playing with the dials on a radio 📻 to get the clearest sound (best performance). It’s a core part of machine learning workflows to squeeze the most out of your model.
4️⃣ Data Integrity and Assumptions 🕵️
What are we doing here?
We’re checking if the data (like height) is clean, valid, and follows the assumptions required for our machine learning models. Imagine inspecting ingredients before baking a cake 🧁 — no expired milk allowed!
Null Hypothesis (H₀):
The height data is clean, valid, and follows the assumptions of the model. 🛡️
Alternative Hypothesis (H₁):
The height data has issues (e.g., outliers, missing values, or doesn’t meet assumptions). 🚨
Common Tests:
Outliers: Is there an unusually tall or short height?
Normality: Does the height data follow a bell curve?
Linearity: Is height’s relationship with the target linear (if the model assumes linearity)?
Missing Values: Are there gaps in the height data?
Example Code 🐍:
Let’s perform these checks on height! 🎯
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import shapiro
# Simulated height data
data = pd.DataFrame({'height': np.random.normal(175, 10, 1000)})
data.loc[10, 'height'] = 300 # Add an outlier
data.loc[20, 'height'] = None # Add a missing value
# 1️⃣ Check for missing values
missing_count = data['height'].isna().sum()
print(f"Missing values: {missing_count}")
# 2️⃣ Check for outliers using a boxplot
sns.boxplot(data['height'])
plt.title('Boxplot of Heights')
plt.show()
# 3️⃣ Check for normality using the Shapiro-Wilk test
stat, p_value = shapiro(data['height'].dropna()) # Drop missing values
print(f"Shapiro-Wilk test p-value: {p_value:.3f}")
# 4️⃣ Check the linearity (scatterplot with a target variable)
target = np.random.rand(1000) * 100 # Dummy target
sns.scatterplot(x=data['height'], y=target)
plt.title('Height vs Target')
plt.show()
Sample Output 🖥️:
Missing values: 1
Shapiro-Wilk test p-value: 0.001
Boxplot: You’ll see one outlier at 300 cm.
Shapiro-Wilk Test: p-value < 0.05 means height is not normally distributed.
Scatterplot: Shows if height has a clear linear relationship with the target.
Interpretation 🎯:
Missing value? Impute or drop it.
Outlier? Decide whether to cap or remove.
Not normal? Apply a transformation (e.g., log or Box-Cox).
No linearity? Consider non-linear models (e.g., Random Forests).
If issues are found, we reject H₀ (the data is not clean or valid). Otherwise, we accept H₀.
5️⃣ Evaluating Model Performance 🚀
What’s the goal here?
We’re testing whether including height improves the model’s ability to predict the target. Think of it as asking: Does this ingredient make the recipe taste better? 🧑🍳
Null Hypothesis (H₀):
Including height does not improve the model’s performance. 🛡️
Alternative Hypothesis (H₁):
Including height significantly improves the model’s performance. 🎉
Common Tests:
Train two models:
Model A: Exclude height
Model B: Include height
Compare performance metrics like R², RMSE (Root Mean Squared Error), or MAE (Mean Absolute Error).
Example Code 🐍:
Let’s check if height improves the model using R²! 🌟
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import pandas as pd
import numpy as np
# Simulated data
np.random.seed(42)
data = pd.DataFrame({
'height': np.random.normal(175, 10, 1000), # Height feature
'weight': np.random.normal(70, 15, 1000), # Another feature
'target': np.random.normal(100, 20, 1000) # Target variable
})
# Train-test split
X = data[['weight']] # Model A: Exclude height
X_height = data[['weight', 'height']] # Model B: Include height
y = data['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
X_height_train, X_height_test, _, _ = train_test_split(X_height, y, test_size=0.3, random_state=42)
# Train models
model_a = LinearRegression().fit(X_train, y_train) # Excluding height
model_b = LinearRegression().fit(X_height_train, y_train) # Including height
# Predictions
y_pred_a = model_a.predict(X_test)
y_pred_b = model_b.predict(X_height_test)
# Evaluate R² scores
r2_a = r2_score(y_test, y_pred_a)
r2_b = r2_score(y_test, y_pred_b)
print(f"R² without height: {r2_a:.3f}")
print(f"R² with height: {r2_b:.3f}")
Sample Output 🖥️:
R² without height: 0.150
R² with height: 0.300
Interpretation 🎯:
If R² improves significantly (e.g., +0.05 or more): Reject H₀ and conclude that height improves model performance. 🎉
If R² barely changes or decreases: Accept H₀; height doesn’t add much value. 🤷
Friendly Note 📝:
Think of this step as tasting the final dish 🍲 with and without height. If it’s way better with height, you’ll want to keep it in your ML model recipe!
🎉 Final Summary of Hypothesis Testing with Height 🌟
We explored five key cases where hypothesis testing plays a role in machine learning, using the example: "The mean height is 175 cm" as the Null Hypothesis (H₀).
🌟 The Five Cases:
Feature Selection (Is height important?):
Goal: Determine if height adds predictive value.
Test: Feature importance tests, t-tests, or ANOVA.
Outcome: Reject H₀ if height significantly improves prediction quality.
A/B Testing (Choosing between models):
Goal: Decide whether to deploy Model A (without height) or Model B (with height).
Test: Compare metrics like conversion rates, precision, recall, etc.
Outcome: Reject H₀ if Model B outperforms Model A significantly.
Tuning Hyperparameters (Does height need transformation?):
Goal: Test whether transformations (e.g., standardizing height) improve model performance.
Test: Compare metrics before/after transformation.
Outcome: Reject H₀ if the transformation yields better metrics.
Data Integrity and Assumptions (Is height distribution normal?):
Goal: Check for data violations like skewness or outliers.
Test: Shapiro-Wilk test, Q-Q plots, etc.
Outcome: Reject H₀ if height data isn’t normally distributed.
Evaluating Model Performance (Does height improve predictions?):
Goal: Test whether including height improves metrics like R² or RMSE.
Test: Train/test models with and without height, then compare metrics.
Outcome: Reject H₀ if including height significantly improves the model.
🏁 Key Takeaway:
Hypothesis Testing ensures we make data-driven decisions at every stage of ML development.
It gives you statistical confidence in choosing the right features, models, and transformations.
Subscribe to my newsletter
Read articles from Anix Lynch directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
