Non-Parametric Analysis of Titanic Data with Log Transformations

Suyog TimalsinaSuyog Timalsina
10 min read

Hello everyone, I’m back with another blog post! I know it’s been a little while since my last post on parametric tests, but I wanted to share some updates with you. I recently transferred to Weber State University, where I’m now studying Computational Statistics and Data Science. Because of the transfer process, I was a bit busy settling into college life, which is why it took me some time to get back to blogging.

What is Non-Parametric Test?????

A non-parametric test is a type of statistical test that does not assume a specific distribution (like normal distribution) for the data.

  • Unlike parametric tests (e.g., t-test, Pearson correlation) that rely on assumptions about the population parameters, non-parametric tests are more flexible and can handle:

    • Skewed data

    • Outliers

    • Ordinal data or ranks

Common Use of Non-Parametric Test:

ScenarioTest

Categorical vs Categorical

Chi-Square Test

Numerical vs Numerical

Spearman Correlation

Numerical vs Binary Categorical

Mann-Whitney U Test

Numerical vs 3+ Categorical Groups

Kruskal-Wallis Test

Let’s Dive into the Real-World Use of Non-Parametric Tests

In this section, we will be using the famous Titanic dataset to demonstrate the use of non-parametric tests in real-world data analysis. First, let’s import the dataset and take a quick look at it.

import pandas as pd

url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"

df = pd.read_csv(url)
df.head()

Before we start our analysis, it’s important to check if the dataset has any missing values. Missing data can affect statistical tests, so we need to be aware of it.

df.isnull().sum()

We can see that there are missing values in the Age, Cabin, and Embarked columns. Since our focus here is on non-parametric analysis, we won’t dive deeply into all the methods of handling missing data.

For simplicity and effectiveness, we will use median imputation for the Age column. This is a common approach because it does not significantly affect the distribution of the data and works well for numerical variables with skewed values.

Before imputing missing values, it’s helpful to see how the Age data is distributed. We can use a displot to visualize it.

sns.kdeplot(df['Age'])

After this, we can impute the missing values using the median and then plot the distribution again to see that the overall shape of the data remains mostly unchanged.

#median imputation
median = df['Age'].median()
df['Age'].fillna(median , inplace=True)

Both Embarked and Cabin are categorical columns, so we will use mode imputation to fill their missing values.

# mode imputation for cabin
mode = df['Cabin'].mode()[0]
df['Cabin'].fillna(mode , inplace=True)

# mode imputation for embarked
mode_em = df['Embarked'].mode()[0]
df['Embarked'].fillna(mode_em , inplace=True)

Now we are ready with the Clean Data

Chi-Square Test: Categorical vs Categorical

now that our data is cleaned and missing values are handled, we are ready to perform the Chi-Square test. This test is used to determine whether there is a significant association between two categorical variables.

  • In the Titanic dataset, a classic example is Survived vs Sex.

  • The Chi-Square test helps us answer questions like: “Is survival dependent on the passenger’s gender?”

#chi-sqaure test
from scipy.stats import chi2_contingency

#Contingency table
contingency_table = pd.crosstab(df['Sex'], df['Survived'])
print(contingency_table)

chi2, p, dof, expected = stats.chi2_contingency(contingency_table)

print("Chi2 Statistic:", chi2)
print("p-value:", p)

if p <= 0.05:
    print("Result: Sex and Survived are dependent (significant association).")
else:
    print("Result: Sex and Survived are independent (no significant association).")
Survived    0    1
Sex               
female     81  233
male      468  109
Chi2 Statistic: 260.71702016732104
p-value: 1.1973570627755645e-58
Result: Sex and Survived are dependent (significant association).
sns.countplot(x='Sex', hue='Survived', data=df)
plt.show()

The p-value is extremely small, much less than 0.05, which means there is a significant association between Sex and Survived. In other words, survival on the Titanic was dependent on the passenger’s gender.

✅ Female passengers were far more likely to survive than male passengers, which aligns with historical accounts.

Spearman Correlation: Numerical vs Numerical

Next, we look at the relationship between two numerical variables: Age and Fare. Since Age is not normally distributed, we use the Spearman correlation, a non-parametric method that measures the rank-based association between two variables.

Before applying Spearman correlation, it’s important to check whether the numerical variables are normally distributed. Since Spearman is a non-parametric test, it’s ideal for non-normal or skewed data. We can use a Q-Q plot to visualize normality.

sm.qqplot(df['Age'], line='s')
py.title('Q-Q Plot of Age')
py.show()
print('Skewness: %f' % df['Age'].skew())
print('Right skewed')

sm.qqplot(df['Fare'], line='s')
py.show() 
print('Skewness : %f', df['Fare'].skew())
print('Right skewed')

From the Q-Q plots, we can clearly see that the points deviate from the straight line, confirming that Age and Fare are not normally distributed. This verifies that using a non-parametric test, like Spearman correlation, is appropriate for analyzing their relationship.

# Using Spearman correlation
correlation, p_value = stats.spearmanr(df['Age'], df['Fare'])
print("Spearman Correlation Coefficient:", correlation)
print("p-value:", p_value)

sns.scatterplot(x='Age', y='Fare', data=df)
plt.title('Scatter Plot of Age vs Fare')
plt.show()
Spearman Correlation Coefficient: 0.12600552124010062
p-value: 0.00016260974540267112

After performing the Spearman correlation between Age and Fare, we find a correlation coefficient of 0.126 and a p-value of 0.00016. This indicates a weak positive relationship between the two variables, meaning that as Age increases slightly, Fare tends to increase as well. The p-value is much smaller than 0.05, which shows that this weak correlation is still statistically significant. Even though the relationship is not strong, it confirms that there is a measurable association between Age and Fare in the Titanic dataset.

Mann-Whitney U Test: Numerical vs Binary Categorical

Next, we analyze the relationship between a numerical variable (Fare) and a binary categorical variable (Survived). Since Fare is not normally distributed, we use the Mann-Whitney U test, a non-parametric test that compares the distributions of a numerical variable across two groups.

# Man-Whitnney U test
fare_0 = df[df['Survived'] == 0]['Fare']
fare_1 = df[df['Survived'] == 1]['Fare']

statistic, p_value = stats.mannwhitneyu(fare_0, fare_1, alternative='two-sided')

print("Mann-Whitney U Test Statistic:", statistic)
print("p-value:", p_value)

sns.boxplot(x='Survived', y='Fare', data=df)
plt.title("Fare Distribution by Survival")
plt.xlabel("Survived (0=No, 1=Yes)")
plt.ylabel("Fare")
plt.show()
Mann-Whitney U Test Statistic: 57806.5
p-value: 4.553477179250237e-22

After performing the Mann-Whitney U test on Fare versus Survived, we obtain a test statistic of 57806.5 and a p-value of 4.55 × 10⁻²². The extremely small p-value indicates a significant difference in the Fare distributions between passengers who survived and those who did not. In other words, Fare had a noticeable impact on survival, with passengers who paid higher fares generally having a better chance of surviving the Titanic disaster.

Kruskal-Wallis Test: Numerical vs 3+ Categorical Groups

Finally, we examine the relationship between a numerical variable (Age) and a categorical variable with more than two groups (Pclass). Since Age is not normally distributed and Pclass has three categories (1, 2, 3), we use the Kruskal-Wallis test, a non-parametric test that compares the distributions of a numerical variable across multiple groups.

# Kruskal-Wallis Test
age_1 = df[df['Pclass'] == 1]['Age']
age_2 = df[df['Pclass'] == 2]['Age']
age_3 = df[df['Pclass'] == 3]['Age']

stat, p_value = stats.kruskal(age_1, age_2, age_3)

print("Kruskal-Wallis H statistic:", stat)
print("p-value:", p_value)sns.boxplot(x='Pclass', y='Age', data=df)

plt.title("Age Distribution by Passenger Class")
plt.xlabel("Passenger Class")
plt.ylabel("Age")
plt.show()
Kruskal-Wallis H statistic: 93.0106041417642
p-value: 6.353366830958113e-21

After performing the Kruskal-Wallis test on Age across the three Pclass groups, we get an H statistic of 93.01 and a p-value of 6.35 × 10⁻²¹. The extremely small p-value indicates a significant difference in Age distributions among the three passenger classes. This means that passenger class is associated with Age, with certain age groups more likely to book specific classes on the Titanic.

Log Transformation: Handling Non-Normal Data for Linear Models

When working with linear models like Linear Regression or Logistic Regression, the data ideally should be normally distributed. Since real-world data is often skewed, we can apply log transformations to make it more normal-like. In this section, I’ll show you how to perform a log transformation on non-normal datasets to improve model performance.

For the modeling section, we will focus only on three columns: Age, Fare, and Survived. These variables are sufficient to demonstrate how transformations and non-parametric handling can improve model performance, while keeping the analysis simple and easy to follow.

Lets import all the necessary Python libraries for modeling

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression as logisticRegression
from sklearn.metrics import accuracy_score 
from sklearn.metrics import confusion_matrix
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import FunctionTransformer
import numpy as np
data = df[['Age','Fare','Survived']]
data.head()

Before building our models, we need to split the data into training and testing sets. We’ll use 80% of the data for training and 20% for testing.

X = data.iloc[:,0:2]
y = data.iloc[:,2]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Since we want to predict which passengers survived the Titanic using only Age and Fare, this is a classification problem. To solve it, we will use a Logistic Regression model, which is ideal for predicting binary outcomes like Survived (0 = did not survive, 1 = survived).

clf = logisticRegression()

clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

cm = confusion_matrix(y_test, y_pred)
cm_disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=clf.classes_)
cm_disp.plot()
plt.show()
Accuracy: 0.6480446927374302

Evaluating the Model with a Confusion Matrix

For classification problems like predicting survival, it’s important to measure how well the model is performing. One of the most useful tools for this is the confusion matrix, which shows the number of:

  • True Positives (TP): Correctly predicted survivors

  • True Negatives (TN): Correctly predicted non-survivors

  • False Positives (FP): Predicted survivor but actually did not survive

  • False Negatives (FN): Predicted non-survivor but actually survived

A confusion matrix gives a clear visual and numerical overview of the model’s accuracy, precision, and overall performance.

Interpretation:

  • 100 → True Negatives: Passengers correctly predicted as not survived

  • 16 → True Positives: Passengers correctly predicted as survived

  • 5 → False Positives: Predicted survived but actually did not survive

  • 58 → False Negatives: Predicted did not survive but actually survived

Now lets do Log Transform

trf = FunctionTransformer(func = np.log1p)

X_train_transformed = trf.fit_transform(X_train)
X_test_transformed = trf.fit_transform(X_test)

clf1 = logisticRegression()

clf1.fit(X_train_transformed, y_train)

y_pred1 = clf1.predict(X_test_transformed)

accuracy1 = accuracy_score(y_test, y_pred1)
print("Accuracy after log transform:", accuracy1)

cm1 = confusion_matrix(y_test, y_pred1)
cm_disp1 = ConfusionMatrixDisplay(confusion_matrix=cm1, display_labels=clf1.classes_)
cm_disp1.plot()
plt.show()
print(cm1)
Accuracy after log transform: 0.6759776536312849

After applying a log transformation to the Age and Fare variables and retraining the Logistic Regression model, the confusion matrix becomes:

[94114727]\begin{bmatrix} 94 & 11 \\ 47 & 27 \end{bmatrix}[9447​1127​]

  • 94 → True Negatives: correctly predicted non-survivors

  • 27 → True Positives: correctly predicted survivors

  • 11 → False Positives: predicted survived but did not

  • 47 → False Negatives: predicted did not survive but actually did

The overall accuracy of the model after log transformation is 0.676, which shows a slight improvement in capturing the patterns in skewed data.

✅ Applying log transformation helps normalize skewed features and can improve model performance in some cases, especially for numerical variables like Age and Fare.

Q-Q Plot of Fare Before and After Log Transformation

fig, axes = plt.subplots(1, 2, figsize=(12, 6))

# Q-Q plot before log transform
sm.qqplot(X_train['Fare'], line='s', ax=axes[0])
axes[0].set_title('Q-Q Plot of Fare Before Log Transform')

# Q-Q plot after log transform
sm.qqplot(X_train_transformed['Fare'], line='s', ax=axes[1])
axes[1].set_title('Q-Q Plot of Fare After Log Transform')

plt.tight_layout()
plt.show()

Conclusion

In this blog, we explored non-parametric tests and transformations using the Titanic dataset. We started by handling missing values with median and mode imputation and then analyzed relationships between variables:

  • Chi-Square Test showed a significant association between Sex and Survived.

  • Spearman Correlation revealed a weak but significant relationship between Age and Fare.

  • Mann-Whitney U Test indicated that Fare distributions differ significantly between survivors and non-survivors.

  • Kruskal-Wallis Test highlighted significant differences in Age across Pclass groups.

Finally, we applied log transformation to skewed numerical features (Age and Fare) and built a Logistic Regression model to predict survival. The transformation helped normalize the data and slightly improved model performance, as seen in the confusion matrix and accuracy.

✅ Overall, this workflow demonstrates how non-parametric tests and feature transformations can be effectively used to analyze real-world data and improve predictive modeling.

Coming Up in the Next Blog

In the next blog, I’ll cover how to visualize data effectively. You’ll learn:

  • How to think about data for graphing – choosing the right chart for the right type of variable

  • How to code graphs in Python using libraries like Matplotlib and Seaborn

  • Which graphs are most useful for exploring relationships and patterns in your dataset

Stay tuned to see step-by-step examples and learn how to turn raw data into insightful visualizations!

0
Subscribe to my newsletter

Read articles from Suyog Timalsina directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Suyog Timalsina
Suyog Timalsina