How to Analyze Relationships in Data Using Parametric Tests

Suyog TimalsinaSuyog Timalsina
7 min read

In the last blog post, we discussed hypothesis testing, confidence intervals, and understanding the p-value. In this blog, we’ll apply parametric tests to analyze a real-world dataset. In practical scenarios, data often consists of two types: categorical and numerical variables. For example, you might encounter a dataset containing company names (categorical) and car prices (numerical).
We often want to find out whether a numerical value depends on, or is related to, a categorical variable. To explore these relationships, we use parametric tests.

What is Parametric Test

A parametric test is a type of statistical test that makes specific assumptions about the population parameters and the distribution of the data. They are more powerful than non-parametric tests. In this blog, I will show how to use parametric tests when working with a real dataset.

Importing the dataset

I have used the Kaggle dataset [link here](https://www.kaggle.com/datasets/amjadzhour/car-price-prediction). In this blog, we won’t be performing feature engineering. Instead, we’ll focus on how to apply parametric tests to this dataset.

When working with data, it’s important to first identify which columns are numerical and which are categorical. This perspective helps us discover meaningful relationships and insights from the dataset.

import kagglehub

# Download latest version
path = kagglehub.dataset_download("amjadzhour/car-price-prediction")

print("Path to dataset files:", path)
Path to dataset files: /kaggle/input/car-price-prediction

import pandas as pd

df = pd.read_csv('/kaggle/input/car-price-prediction/Car_Price_Prediction.csv')
df.head()

Assumption of Parametric Test

Before using parametric tests, it is essential to verify whether the data meets certain assumptions. The most important assumption is that the numerical data should be approximately normally distributed. This means the data should roughly follow a bell-shaped curve, which you can check using tests like the Shapiro-Wilk test, or visually using histograms and Q-Q plots.

Other key assumptions of parametric tests include:

  • Homogeneity of variances: The variance within each group should be similar.

  • Independence of observations: Data points should be independent of each other.

import matplotlib.pyplot as plt
import seaborn as sns


sns.histplot(df['Price'] , kde=True)

Here, we can see that the Price column, which we want to predict, is approximately normal, so we can use parametric tests. To verify normality statistically, we use the Shapiro-Wilk test. In this blog, we will perform this test using the scipy library.

import numpy as np
from scipy.stats import shapiro

stat , p = shapiro(df['Price'])
print(f"Shapiro-Wilk Test: Statistic = {stat:.3f}")
print(f"Shapiro-Wilk Test: p-value = {p:.3f}")

if p > 0.05:
    print("The data is likely drawn from a normal distribution.")
else:
    print("The data is likely not drawn from a normal distribution.")
Shapiro-Wilk Test: Statistic = 0.998
Shapiro-Wilk Test: p-value = 0.141
The data is likely drawn from a normal distribution.

Since we know Price follows a normal distribution, we can confidently use parametric tests. Now, let’s explore how Price relates to different categorical groups in our dataset.

# This will give the information about the each column of dataset
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Make          1000 non-null   object 
 1   Model         1000 non-null   object 
 2   Year          1000 non-null   int64  
 3   Engine Size   1000 non-null   float64
 4   Mileage       1000 non-null   int64  
 5   Fuel Type     1000 non-null   object 
 6   Transmission  1000 non-null   object 
 7   Price         1000 non-null   float64
dtypes: float64(2), int64(2), object(4)
memory usage: 62.6+ KB

ANOVA TEST :

One-Way ANOVA is used to examine the effect of one categorical variable (with two or more groups) on a numerical variable. For example, testing if different car brands (categorical) have different average prices (numerical).

Two-Way ANOVA is used when you want to study the effect of two categorical variables on a numerical variable, and also check if there is any interaction effect between the two categorical variables. For example, analyzing how both car brand and fuel type (two categorical variables) affect car price (numerical).

So yes, in two-way ANOVA, you’re examining relationships involving more than one categorical column with one numerical target.

# seeing the relation between price and company car name
from scipy.stats import f_oneway

grouped_prices = df.groupby('Make')['Price'].apply(list)
f_stat , p_val = f_oneway(*grouped_prices)
print(f"ANOVA p-value: {p_val:.4f}")

if p_val < 0.05:
    print("There is a significant difference in price between different car companies.")
else:
    print("There is no significant difference in price between different car companies.")
ANOVA p-value: 0.3064
There is no significant difference in price between different car companies.

ANOVA p-value: 0.3064
This indicates that there is no significant difference in price between different car companies. From this, we can conclude that, in this dataset, the company name and price are not significantly related.

To better understand this visually, we can use a boxplot to compare the price distributions across car companies. Let’s see how to create this plot.

sns.boxplot(x='Make' , y='Price' , data=df )
plt.xticks(rotation=90)

From the boxplot, we can also observe that the average prices are quite similar across different car companies, which supports our ANOVA result that there is no significant difference in prices between the companies

T-Test:

We use the t-test when we want to compare the means of a numerical variable between two groups defined by a categorical variable with exactly two categories (factors).

For example, if we want to see whether the average car price differs between cars with manual and automatic transmission (where the transmission type has only two categories), the t-test is appropriate.

from scipy.stats import ttest_ind

t_grouped_prices = df.groupby('Transmission')['Price'].apply(list)
t_stat , t_p_val = ttest_ind(*t_grouped_prices)
print(f"t-test p-value: {t_p_val:.4f}")

if t_p_val < 0.05:
    print("There is a significant difference in price between different transmission types.")
else:
    print("There is no significant difference in price between different transmission types.")
t-test p-value: 0.4190
There is no significant difference in price between different transmission types.

T-test p-value: 0.4190
This indicates there is no significant difference in price between different transmission types. In other words, based on this dataset, the transmission type does not have a significant effect on car price.

We can verify this result visually using a boxplot, which shows the distribution of prices for each transmission type.

sns.boxplot(x='Transmission' , y='Price' , data=df)
plt.xticks(rotation=90)

Pearson Correlation:

The Pearson correlation coefficient is a statistical measure that evaluates the linear relationship between two numerical variables. It ranges from -1 to +1:

  • +1 indicates a perfect positive linear relationship,

  • 0 means no linear relationship,

  • -1 indicates a perfect negative linear relationship.

In this section, we use Pearson correlation to identify which numerical variables in our dataset are closely related. We’ll visualize the correlations using a heatmap to make patterns easier to spot.

num_df = df.select_dtypes(include=['int64', 'float64'])
corr_matrix = num_df.corr(method = 'pearson')

sns.heatmap(corr_matrix,annot=True)
plt.title('Correlation Matrix')
plt.show()
corr_matrix

The Pearson correlation matrix shows that Price is moderately positively correlated with Year (0.61) and weakly to moderately with Engine Size (0.38), meaning newer cars and those with larger engines tend to be more expensive. Mileage has a moderate negative correlation with Price (-0.56), indicating that cars with higher mileage generally cost less. Other relationships, such as Year vs Mileage (0.02) and Engine Size vs Mileage (-0.01), show almost no linear correlation, suggesting they are largely independent. This analysis helps us identify which numerical features are most influential in determining car prices.

Conclusion

In this blog, we explored how to apply parametric tests to real-life datasets. We used ANOVA to compare prices across different car brands and found no significant differences. Then, we applied a t-test to check for differences in car prices based on transmission types — again, no significant effect was found. Finally, using Pearson correlation, we discovered that newer cars and those with larger engines tend to be more expensive, while higher mileage negatively impacts price.

These statistical tools help us uncover patterns and relationships that might not be obvious at first glance. They are essential for making informed, data-driven decisions — especially when working with clean, well-structured numerical data that satisfies the assumptions required by parametric tests.

👉 In the next post, we’ll dive into non-parametric tests, which are incredibly useful when your data doesn’t meet the assumptions of normality or equal variance. We’ll walk through real examples and show how to apply tests like the Mann-Whitney U test, Wilcoxon signed-rank test, and Kruskal-Wallis test — so you can still draw meaningful insights from messy or non-normal datasets. Stay tuned!

0
Subscribe to my newsletter

Read articles from Suyog Timalsina directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Suyog Timalsina
Suyog Timalsina