Unsinkable Insights: Navigating My First Titanic Data Science Challenge

Sushant PupnejaSushant Pupneja
9 min read

In data science, every project is an experiment—a place where knowledge is created, improved, and shared with the community. My experience with the Titanic Shipwreck Problem is a great example of this experimental approach. Inspired by Andrew Ng's foundational course on Supervised Machine Learning, I approached this classic problem with curiosity, determination, and a readiness to learn from each challenge.

Building Knowledge Through Exploration

The Titanic dataset, with its mix of numerical and categorical features, served as the perfect playground for developing my skills. I started with Exploratory Data Analysis (EDA) to gain an in-depth understanding of the data. This phase involved:

  • Initial Inspection:
    Using methods like .head(), .info(), and .describe() gave me an immediate sense of the dataset’s structure. This initial look set the stage for deeper analysis by giving me a feel for the data and helped reveal data type inconsistencies or potential problems like unexpected null values.

  • Data Visualization:
    I created histograms, box plots, and heat maps to identify trends, anomalies, and correlations. This visual exploration not only uncovered the underlying distribution of the features but also provided clues about relationships that could be leveraged during feature engineering.

## Histograms and Density Plots
""" Use these to understand the distribution of continuous variables like Age and Fare. 
Histograms show frequency counts, while density plots provide a 
smoothed-out view of the distribution."""

import matplotlib.pyplot as plt
df['Age'].hist(bins=30)
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.title('Age Distribution')
plt.show()

## Box Plots:
""" Useful for visualizing the spread of data and identifying outliers. 
For example, comparing age distributions across passenger classes can reveal differences and anomalies."""

import seaborn as sns
sns.boxplot(x='Pclass', y='Age', data=df)
plt.title('Age by Passenger Class')
plt.show()

  • Identifying Missing Values and Outliers: Uncovering Hidden Data Challenges

    Recognizing that features such as Age, Fare, and even Cabin required special attention was essential. By methodically identifying missing values and outliers, I was able to design strategies to handle them, ensuring the data was clean and reliable for further analysis.

  1. Missing Values
print(df.isnull().sum())
## Visualizing Missing Data:
import missingno as msno
msno.matrix(df)
plt.show()

Guiding Preprocessing Decisions:

  • Knowing which columns have missing values and their proportions will dictate your strategy—whether to impute, drop, or engineer new features.

Avoiding Bias:

  • Ignoring missing values or handling them improperly can bias your model or distort its predictions.
  1. Outliers
### Detecting Outliers

"""
## Box Plots:
# One of the most effective ways to spot outliers visually. Outliers typically appear as points outside the “whiskers” of the box plot.

## Statistical Methods:
# Use the Interquartile Range (IQR) method.

- Calculate the first (Q1) and third quartiles (Q3) and then the IQR (Q3 - Q1).
- Identify any data point that falls below Q1 - 1.5×IQR or above Q3 + 1.5×IQR as potential outliers.
"""

Q1 = df['Fare'].quantile(0.25)
Q3 = df['Fare'].quantile(0.75)
IQR = Q3 - Q1
outliers = df[(df['Fare'] < Q1 - 1.5 * IQR) | (df['Fare'] > Q3 + 1.5 * IQR)]
print(outliers)

Model Impact:

  • Outliers can distort statistical analyses and affect model performance, especially for algorithms sensitive to extreme values.

Data Quality Check:

  • Outliers might indicate data entry errors or unique subgroups in your data that need special treatment.

Imputation Strategies: Filling the Gaps for Data Integrity

(techniques used to fill in missing data points in a dataset to maintain its integrity and improve model performance.)

  • Numerical Variables (e.g., Age.):

    • Median Imputation:

Replace missing ages with the median, which is robust to outliers.
  • Group-based Imputation:

For example, you might calculate the median age within groups defined by Pclass and Sex to better capture variations.
df['Age'] = df.groupby(['Pclass', 'Sex'])['Age'].transform(lambda x: x.fillna(x.median()))
  • Categorical Variables (e.g., Embarked):

    (Mode Imputation: Fill missing values with the most frequent category)

print(df['Embarked'].mode()[0])
df['Embarked'] = df['Embarked'].fillna(df['Embarked'].mode()[0])
  • High Missing-Rate Columns (e.g., Cabin):

  • With many missing values, consider:

    • Dropping the column if it seems uninformative.

    • Extracting useful information (like the deck letter) if you suspect that even limited data could be valuable.

Why It’s Important:

Handling missing data correctly ensures that your model isn't misled by gaps in information. It helps maintain statistical integrity and prevents skewed or biased outcomes.

Data Type Adjustments: Ensuring Compatibility for Machine Learning

(Machine learning algorithms typically require numerical input, Converting data types appropriately helps ensure that your algorithms process the information correctly).

Techniques:

  • Converting Categorical Data:

    • Label Encoding: Useful when categories are ordinal or when you want to convert text labels to integers.

    • One-Hot Encoding: Creates binary columns for each category, preventing the model from assuming any ordinal relationship between categories. For example:

    # One hot encoding.
    df = pd.get_dummies(df, columns=['Sex', 'Embarked'], drop_first=True)

    # Ensuring Numeric Columns Are Numeric:
    df['Fare'] = pd.to_numeric(df['Fare'], errors='coerce')

Why It’s Important

  • Correct data types allow you to apply mathematical operations, leverage efficient storage, and avoid errors during model training. They also ensure that preprocessing steps like scaling or encoding work as expected.

Refining the Process with Feature Engineering and Model Selection

Once the data was well understood, the next phase was to refine it through feature engineering. I delved into transforming raw data into informative features by:

  • Creating New Features:
    I combined SibSp and Parch to create a FamilySize feature and derived IsAlone to flag passengers traveling solo. Each new feature brought additional context and predictive power.
## Create a new feature called 'FamilySize' by combining 'SibSp' and 'Parch'.
## FamilySize:
df['FamilySize'] = df['SibSp'] + df['Parch'] + 1

#Add an 'IsAlone' feature to identify passengers traveling solo.
df['IsAlone'] = (df['FamilySize'] == 1).astype(int)
  • Handling the Cabin Feature with Ordinal Encoding:
    Recognizing the potential importance of cabin location, I first filled in missing cabin values with a placeholder (e.g., 'U' for unknown) before extracting the deck letter. By defining a natural order for the decks and applying ordinal encoding, I converted this categorical data into meaningful numeric values.
# Fill missing values in the 'Cabin' column with a placeholder, e.g., 'U'
df['Cabin'].fillna('U', inplace=True)
df['Deck'] = df['Cabin'].str[0]
# df = pd.get_dummies(df, columns=['Deck'], drop_first=True)

deck_order = ['U', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'T'] ## U is for unknown.

from sklearn.preprocessing import OrdinalEncoder
ordinal_encoder = OrdinalEncoder(categories=[deck_order])
df['DeckOrdinal'] = ordinal_encoder.fit_transform(df[['Deck']])

After engineering these features, I moved on to model selection and training. I started with a logistic regression model to establish a baseline, then experimented with more complex algorithms like Random Forest to capture non-linear patterns. Each model taught me valuable lessons about overfitting, model complexity, and the critical role of cross-validation in achieving robust performance.

Building Baseline Model: Establishing a Strong Starting Point

Quick Benchmark:

  • The baseline model gives you a reference point for performance. It should be simple, interpretable, and easy to implement.

How to Proceed:

  • Logistic Regression:

    • This is the most common starting point for binary classification problems like predicting survival. Its simplicity lets you quickly see whether your features have any predictive power.
# load test data:
test_df = pd.read_csv("test.csv")
test_df.head()
# 2. Fill missing values in test data
test_df['Age'] = test_df.groupby(['Pclass', 'Sex'])['Age'].transform(lambda x: x.fillna(x.median()))
test_df['Embarked'] = test_df['Embarked'].fillna(test_df['Embarked'].mode()[0])
test_df = pd.get_dummies(test_df, columns=['Sex', 'Embarked'], drop_first=True)
test_df['Fare'] = pd.to_numeric(test_df['Fare'], errors='coerce')
test_df['Fare'] = test_df.groupby('Pclass')['Fare'].transform(lambda x: x.fillna(x.median()))

# Fill missing values in the 'Cabin' column with a placeholder, e.g., 'U'
test_df['Cabin'].fillna('U', inplace=True)
test_df['Deck'] = test_df['Cabin'].str[0]
deck_order = ['U', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'T'] ## U is for unknown.

from sklearn.preprocessing import OrdinalEncoder
ordinal_encoder = OrdinalEncoder(categories=[deck_order])
test_df['DeckOrdinal'] = ordinal_encoder.fit_transform(test_df[['Deck']])
# test_df = pd.get_dummies(test_df, columns=['Deck'], drop_first=True)

## add new features:
test_df['FamilySize'] = test_df['SibSp'] + test_df['Parch'] + 1
test_df['IsAlone'] = (test_df['FamilySize'] == 1).astype(int)

Implementation:

Use scikit learn’s LogisticRegression:

""" Use scikit-learn’s LogisticRegression: """

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

#Drop columns that likely won't be useful for prediction (adjust as needed)
#Typically, you might drop columns like PassengerId, Name, Ticket, and Cabin.

X = df.drop(['Survived', 'PassengerId', 'Name', 'Ticket', 'Cabin', 'Deck'], axis=1, errors='ignore')
# X = df.drop(['Survived', 'PassengerId', 'Name', 'Ticket', 'Cabin'], axis=1, errors='ignore')
y = df['Survived']

X_test = test_df.drop(['PassengerId', 'Name', 'Ticket', 'Cabin', 'Deck'], axis=1, errors='ignore')
# X_test = test_df.drop(['PassengerId', 'Name', 'Ticket', 'Cabin'], axis=1, errors='ignore')


# 5. Ensure that the test set has the same feature columns as the training set
X_test = X_test.reindex(columns=X.columns, fill_value=0)
""" Make Model """

# Initialize and train model
model = LogisticRegression(max_iter=1000)
model.fit(X, y)

# Predict and evaluate
predictions = model.predict(X_test)

# If desired, save the predictions along with PassengerId (common for Titanic submissions)
output = pd.DataFrame({
    'PassengerId': test_df['PassengerId'],
    'Survived': predictions
})
output.to_csv("submission.csv", index=False)
print("Predictions saved to submission.csv")

output.head()

Making a model with SVM

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# ---------------------
# 4. Model Training with SVM
# ---------------------
# Create and train the SVM classifier (using an RBF kernel)
svm_model = SVC(kernel='rbf', random_state=42)
svm_model.fit(X, y)


predictions = svm_model.predict(X_test)

# Save the predictions along with PassengerId (useful for Kaggle submissions)
output = pd.DataFrame({
    'PassengerId': test_df['PassengerId'],
    'Survived': predictions
})
output.to_csv("submission.csv", index=False)
print("SVM predictions saved to submission.csv")

Making Model with Random Forest

import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score


# ---------------------------
# Splitting Data for Validation
# ---------------------------
# Split the data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# ---------------------------
# Model Training with Random Forest
# ---------------------------
# Create a RandomForestClassifier with 100 trees
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)

# Train the model
rf_model.fit(X_train, y_train)

# ---------------------------
# Model Evaluation
# ---------------------------
# Predict on the validation set
y_pred = rf_model.predict(X_val)
accuracy = accuracy_score(y_val, y_pred)
print("Validation Accuracy:", accuracy)



# Predict using the trained Random Forest model
test_predictions = rf_model.predict(X_test)

# Save the predictions alongside PassengerId (for submission or further analysis)
output = pd.DataFrame({
    'PassengerId': test_df['PassengerId'],
    'Survived': test_predictions
})
output.to_csv("submission.csv", index=False)
print("Test predictions saved to submission.csv")

Summary of the Data Transformation Techniques I Used and the Corresponding Model Accuracy.

Based on a Baseline Model: Logistic Regression and few features.Score: 0.76076
Adding new features like FamilySize, isAlone, and Deck.Score: 0.75598
Removing the Deck and Cabin, entirely.Score: 0.76076
Trained Model using SVM.Score: 0.66746
Trained Model with Random Forest.Score: 0.76315
Adding Deck feature with ordinal encoding.Score: 0.74162

Score is the percentage of passengers you correctly predict. For instance, if your model correctly classifies 800 out of 1,000 passengers, your accuracy is 80%.

Conclusion: Reflecting on a Journey of Data Science Mastery

Navigating the Titanic Data Science Challenge was a journey of discovery and growth. Through meticulous exploratory data analysis, thoughtful handling of missing values and outliers, and strategic feature engineering, I transformed raw data into a robust foundation for model building. Experimenting with various models, from logistic regression to Random Forest, provided valuable insights into model performance and the importance of cross-validation. Each step reinforced the significance of data preprocessing and feature selection in achieving accurate predictions. This experience not only honed my technical skills but also deepened my understanding of the iterative nature of data science, where each challenge is an opportunity to learn and innovate.

Engage with Us: Join the Conversation and Share Your Insights

I’d love to hear from you! Have you tackled the Titanic Data Science Challenge or a similar project? What strategies did you find most effective in handling missing data or feature engineering? Share your experiences and insights in the comments below. Your contributions can help others in the community learn and grow. Let's start a conversation!

0
Subscribe to my newsletter

Read articles from Sushant Pupneja directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Sushant Pupneja
Sushant Pupneja