Hypothesis Tests Every Data Scientist Should Know


Hypothesis tests are crucial for validating assumptions about data, offering a quantifiable measure of how accurate or inaccurate those assumptions may be. These tests should not be viewed as definitive proofs; instead, they serve as tools for decision-making and evaluating evidence in situations of uncertainty.
In this blog post, I will highlight the following tests:
t-Test
Chi-Square Test
One-way ANOVA
t-Test
t-Test is used to compare means between two groups. There are three types of t-tests.
One Sample t-Test
Independent Samples t-Test
Paired Samples t-Test
One Sample t-Test
Purpose: Compare the mean of a single sample to a known or hypothesized population mean.
Null Hypothesis(H0): The sample mean equals the population mean
Alternative Hypothesis: The sample mean differs
import numpy as np
from scipy.stats import ttest_1samp
# Sample data
data = [12.9, 10.3, 11.2, 13.8, 9.6, 12.3, 14.1, 11.7, 10.9, 12.5]
# Hypothesized population mean
population_mean = 12.0
# Perform one-sample t-test
t_statistic, p_value = ttest_1samp(data, population_mean)
# Print results
print("T-statistic:", t_statistic)
print("P-value:", p_value)
# Interpretation
alpha = 0.05 # Significance level
if p_value < alpha:
print("Reject the null hypothesis.")
else:
print("Fail to reject the null hypothesis")
Independent Samples t-Test
Purpose: Compare means between two independent groups.
Null Hypothesis (H0): There is no difference between the means.
Alternative Hypothesis: The means of the groups differ.
from scipy.stats import ttest_ind
# Example data
group1 = [5.1, 4.8, 6.3, 5.5, 5.7]
group2 = [7.2, 6.9, 7.8, 7.4, 7.0]
# Perform independent t-test
t_stat, p_value = ttest_ind(group1, group2)
print("t-statistic:", t_stat)
print("p-value:", p_value)
# Decision
alpha = 0.05
if p_value < alpha:
print("Reject the null hypothesis")
else:
print("Fail to reject the null hypothesis")
Paired Samples t-Test:
Purpose: Compare the means of two related groups. The test is appropriate when you have "before-and-after" measurements or when the same subjects are measured under two conditions.
Null Hypothesis: The mean difference is zero.
Alternative hypothesis: The mean difference is not zero.
from scipy.stats import ttest_rel
# Example data (pre-test and post-test scores)
pre_test = [85, 89, 78, 92, 88, 76, 95, 91]
post_test = [88, 90, 80, 94, 86, 79, 97, 93]
# Perform paired t-test
t_stat, p_value = ttest_rel(pre_test, post_test)
print("t-statistic:", t_stat)
print("p-value:", p_value)
# Decision
alpha = 0.05
if p_value < alpha:
print("Reject the null hypothesis")
else:
print("Fail to reject the null hypothesis")
Chi-Square Test
Here, I am highlighting the chi-square test for independence
This function is commonly used to test for independence between two categorical variables.
It can answer questions like:
Does gender influence the preference for a particular type of chocolate?
How does bike type preference vary among different age groups?
Consider the following contingency table
Group\Category | Category A | Category B | Category C |
Group A | 10 | 20 | 30 |
Group B | 6 | 9 | 17 |
Variables: Group and Category
Null Hypothesis: The two variables are independent (no relationship exists).
Alternative Hypothesis: The two variables are not independent (there is an association).
Degrees of freedom: In this context, the minimum number of types from both variables is required to perform this hypothesis test.
$$df = (Number\space of \space rows - 1) \times (Number \space of \space cols -1)$$
import numpy as np
from scipy.stats import chi2_contingency
# Define the contingency table
observed = np.array([[10, 20, 30],
[6, 9, 17]])
# Perform the chi-square test
chi2, p, dof, expected = chi2_contingency(observed)
# Print the results
print("Chi-Square Statistic:", chi2)
print("P-Value:", p)
print("Degrees of Freedom:", dof)
print("Expected Frequencies:")
print(expected)
alpha = 0.05
if p < alpha:
print("There is relationship between two variables")
else:
print("The variables are independent")
One Way ANOVA
t-Test can compare means between two groups, one-way ANOVA can be used to compare means between more than two groups
Null Hypothesis: All group means are equal
Alternative Hypothesis: At least one group mean is different
import numpy as np
from scipy.stats import f_oneway
# Example data: Scores from three groups
group1 = [85, 90, 88, 92, 85]
group2 = [78, 82, 79, 81, 80]
group3 = [89, 91, 93, 94, 92]
# Perform One-Way ANOVA
stat, p_value = f_oneway(group1, group2, group3)
# Print results
print("F-Statistic:", stat)
print("P-Value:", p_value)
# Decision
alpha = 0.05 # Significance level
if p_value < alpha:
print("At least one group mean is different.")
else:
print("All group means are equal.")
Subscribe to my newsletter
Read articles from Sudarshan directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Sudarshan
Sudarshan
Machine Learning Engineer