Predicting Titanic Survivors with Machine Learning: A Beginner's Guide

Predicting Titanic Survivors with Machine Learning: A Beginner's Guide

Welcome back to my blog!

In my previous post, I explored the Titanic dataset using Python and visualized survival trends.
Now it’s time to take the next step — building a machine learning model that predicts whether a passenger would survive or not.

This is my first real ML project, and I’ll walk you through it step by step!


⚙️ Step 1: Import Libraries

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

📥 Step 2: Load the Dataset

df = sns.load_dataset('titanic')
df.head()

🧹 Step 3: Data Cleaning

# Drop columns we won’t use
df.drop(['deck', 'embark_town', 'alive', 'who', 'adult_male', 'class'], axis=1, inplace=True)

# Drop rows with missing values
df.dropna(inplace=True)

🔠 Step 4: Encoding Categorical Features

# Convert categorical columns to numeric using Label Encoding
le = LabelEncoder()
df['sex'] = le.fit_transform(df['sex'])        # female=0, male=1
df['embarked'] = le.fit_transform(df['embarked'])  # S=2, C=0, Q=1
df['alone'] = le.fit_transform(df['alone'])    # True=1, False=0

🎯 Step 5: Define Features and Target

X = df.drop('survived', axis=1)
y = df['survived']

🧪 Step 6: Split the Data

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

🧠 Step 7: Train a Logistic Regression Model

model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)

✅ Step 8: Evaluate the Model

y_pred = model.predict(X_test)

# Accuracy
print("Accuracy:", accuracy_score(y_test, y_pred))

# Confusion Matrix
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

# Classification Report
print("Classification Report:\n", classification_report(y_test, y_pred))

📌 Sample Output:

Accuracy: 0.82  
Confusion Matrix:  
[[83 10]  
 [11 35]]

Classification Report:
              precision    recall  f1-score   support

           0       0.88      0.89      0.89        93
           1       0.78      0.76      0.77        46

🧠 What I Learned

  • Logistic Regression is a simple but powerful algorithm for binary classification problems like survival prediction.

  • Encoding and cleaning the data correctly is crucial.

  • The model achieved over 80% accuracy on unseen data — not bad for a first ML model!


🚀 What’s Next?

In the next blog, I plan to try out a Random Forest classifier and compare results.
I’ll also show how to save the model and make predictions on new data.

Thanks for following my journey!
If you’re learning data science too, let’s grow together.


— Farsana | Aspiring Data Scientist | First ML Project Completed!


1
Subscribe to my newsletter

Read articles from Farsana Thasnem PA directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Farsana Thasnem PA
Farsana Thasnem PA

Aspiring Data Scientist | Physics Graduate | Passionate about Machine Learning, Python, and Data Storytelling. Sharing my journey, projects, and learnings in the world of data science.