Introduction

Exploratory Data Analysis (EDA) is often the very first step when working with a new dataset, and for a good reason: it helps us understand the data before we build models or make decisions. As part of the Cyber Shujaa Program, I had the chance to dive into the Titanic dataset , a classic in data science.

At first glance, it’s a spreadsheet with numbers and text. But with proper EDA techniques, it becomes a story about people, decisions, and life-and-death outcomes aboard the ill-fated RMS Titanic.

This article breaks down my week-long EDA assignment .Not just what I did, but why each step matters and what it teaches us about data analysis.

Why the Titanic Dataset?

The Titanic dataset is famous in data science because:

It contains a mix of numerical and categorical data (like age, fare, gender, class, etc.).
It has real-world significance: it captures a historical disaster with a wide range of passenger demographics.
The target variable is binary: Survived (1 = yes, 0 = no), making it ideal for classification.

More importantly, it pushes analysts to think: What factors influenced whether someone lived or died?

Step 1: Initial Data Exploration — "Know Thy Data"

Before jumping into visualizations or predictions, it’s important to answer basic questions:

How many rows and columns are there?
What types of data are we working with?
Are there any missing or duplicate values?

Using pandas functions like .head(), .info(), .describe() and .nunique(), I discovered:

The dataset has 891 rows and 12 columns.
Some fields like Age, Cabin, and Embarked had missing values.
The Cabin column had over 77% missing data .A strong indicator that it might not be useful in analysis.

Why this matters:

You can't make solid inferences from dirty data. A good understanding of the dataset’s structure sets the stage for everything that follows.

Step 2: Handling Missing Values — "No Holes in the Ship"

The Problem:

Missing data leads to biased results, inaccurate visualizations, and broken models.
We had to deal with Age, Cabin, and Embarked.

The Strategy:

Drop columns when data is mostly missing and non-critical: Cabin was dropped.
Impute missing values when the percentage is small:
- Age was filled with the mean — a safe choice for data that is fairly symmetric.
- Embarked was filled with the mode — the most common value ('S').

Why this matters:

Each method of handling missing data reflects an assumption. By imputing rather than deleting, I retained as much useful data as possible without skewing the distribution.

Step 3: Univariate Analysis — "One Variable at a Time"

Univariate analysis focuses on distributions and patterns within a single feature.

What I found:

Age was right-skewed — most passengers were younger.
Fare was heavily skewed — most paid low fares, but a few paid very high ones (first class).
Embarked: Most passengers boarded from Southampton.
Gender: About 65% male, 35% female — a heavily male population.

Techniques Used:

Histograms and KDE plots for continuous variables
Bar plots and pie charts for categorical variables

Why this matters:

Understanding individual variables helps form hypotheses. For example, knowing that most passengers were male or that fares are skewed helps us later ask: Did being male or paying more impact survival?

Step 4: Bivariate Analysis — "Two Variables, One Story"

Bivariate analysis explores relationships between two variables.

Key Findings:

Pclass vs Fare: First-class passengers paid significantly more — a clear socioeconomic divide.
Age vs Survival: Children under 10 had a higher survival rate — consistent with evacuation priorities.
Embarked vs Survival: Passengers from Cherbourg had better survival odds, likely because many were first-class.

Visual Tools:

Boxplots, violin plots, scatter plots, and count plots.

Why this matters:

This stage builds understanding about which features might influence the target variable (in our case, Survived). It’s where insight begins to take shape from raw numbers.

Step 5: Multivariate Analysis — "Now Add a Third Ingredient"

Multivariate analysis looks at how multiple variables interact at once.

What I Explored:

Age + Fare + Class: Using 3D scatter plots, it became clear that younger, first-class passengers who paid higher fares had better survival rates.
Pclass + Embarked + Survival: Many Cherbourg passengers were 1st class and had higher survival. Southampton passengers were mostly 3rd class, with lower survival.

Why this matters:

Multivariate views show interdependencies — how variables reinforce or influence each other. It also prepares the data for predictive modeling, where interactions matter a lot.

Step 6: Outlier Detection — "Spotting the Extremes"

Outliers can distort mean values and cause misleading conclusions. But not all outliers are "bad."

What I Did:

Used boxplots, Z-scores, and IQR to find outliers in Age and Fare.
Found high outliers in Fare — expected from 1st-class tickets.
Kept the outliers because they were valid — rich passengers existed and their data told an important part of the story.

Why this matters:

You must always ask: Is this outlier an error or a meaningful insight? Removing valid extremes can lead to sanitized but inaccurate analysis.

Step 7: Target Variable Analysis — "Who Survived and Why?"

The Survived column (our target) was analyzed deeply.

Discoveries:

Overall: Only about 38% survived — a highly imbalanced target.
Gender: Women had a much higher survival rate — reflecting the "women and children first" evacuation policy.
Pclass: First-class passengers were far more likely to survive.
Age: Survivors were generally younger.

Why this matters:

By examining the target variable across different demographics, we start to understand what features matter most and lay the groundwork for future modeling.

Conclusion: What I Learned

This Titanic EDA project taught me that data analysis is not about tools, but about thoughtful exploration.

I learned how to clean, visualize, and reason through a dataset.
I practiced asking questions like: Is this variable useful? Is this insight valid? Is this pattern meaningful?
I saw how history, context, and data merge to form a complete picture.

This was just one dataset but it felt like a thousand stories, told through numbers.

Want to Dive Deeper?

Check out the full notebook and code on Kagle: https://www.kaggle.com/code/capwellmurimi/titanic-eda

Final Words

If you're just getting started with data analysis, I highly recommend tackling the Titanic dataset. And when you do, don’t just make pretty plots ; ask tough questions, explain your reasoning, and challenge assumptions. That’s where real learning happens.

Feel free to connect with me if you're on your own data journey I'm always happy to learn and share!

Deep Dive into Titanic Exploratory Data Analysis: What I Learned from the Data That Sank

Introduction

Why the Titanic Dataset?

Step 1: Initial Data Exploration — "Know Thy Data"

Why this matters:

Step 2: Handling Missing Values — "No Holes in the Ship"

The Problem:

The Strategy:

Why this matters:

Step 3: Univariate Analysis — "One Variable at a Time"

What I found:

Techniques Used:

Why this matters:

Step 4: Bivariate Analysis — "Two Variables, One Story"

Key Findings:

Visual Tools:

Why this matters:

Step 5: Multivariate Analysis — "Now Add a Third Ingredient"

What I Explored:

Why this matters:

Step 6: Outlier Detection — "Spotting the Extremes"

What I Did:

Why this matters:

Step 7: Target Variable Analysis — "Who Survived and Why?"

Discoveries:

Why this matters:

Conclusion: What I Learned

Want to Dive Deeper?

Final Words

Subscribe to my newsletter

Capwell Murimi

Capwell Murimi