Last winter, my cousin Maya started her dream online clothing store.

It was small at first, just a few T-shirts, hoodies, and eco-friendly tote bags.

Orders started rolling in, and she kept everything in a spreadsheet: names, addresses, items purchased, quantities, and payments.

Until one day, she looked at her data and said:
“Umm… why is the address missing for 14 people?”
“And why are some quantities empty?”

That’s when she asked me:
“Can Python help clean this up?”

The answer? Absolutely yes.

And that’s exactly what today’s post is about.

We’ll use Pandas, Python’s powerful data-handling library, to clean messy data and deal with missing values, just like we did for Maya’s online store.

🧺 Step 1: Let’s Load the Messy Dataset

Imagine Maya exported her customer orders into a CSV file named orders.csv.

Here’s how we’d start:

Let’s say this is what the dataset looks like:

Customer Name	Address	Item	Quantity	Price
Alice	123 Maple St	Hoodie	2	40
Ben		T-shirt	1	15
Cathy	456 Elm St			25
David	789 Oak Ave	Tote Bag	3	10
Emma	321 Pine Blvd	Hoodie		40

Yikes. Missing addresses, items, quantities.

But don’t worry. This is where data cleaning magic begins.

🔍 Step 2: Spot the Missing Values

Before cleaning anything, we need to see what’s missing.

That’ll give you a full table showing which values are missing (True) and which aren’t (False).

If you want a quick summary, use:

Output:

We now know:

1 missing address
1 missing item
2 missing quantities

🧹 Step 3: Cleaning the Data (Multiple Approaches)

Here’s where we decide what to do with the missing values.

Option A: Drop Rows with Missing Values

If the data is incomplete or unusable, you can drop those rows entirely.

But wait—do we really want to delete customer orders just because of a missing quantity? Probably not.

Option B: Fill Missing Values with Default Values

Let’s say we assume:

If quantity is missing, it’s 1 (default).

If item is missing, we mark it as "Unknown".
If address is missing, we flag it for follow-up.

This is often more helpful—we keep the data, but flag the issues.

Option C: Forward Fill / Backward Fill

If you're dealing with sequential data (like time series or grouped orders), you can fill missing values using nearby data:

Or:

This fills in blanks based on the value just above or below it.

⚠️ Use with caution—only when you know the missing values logically follow a pattern.

🧽 Step 4: Double-Check Your Cleaned Data

Let’s make sure it worked.

Output:

Boom. Clean data. No missing values. Maya’s store records are now complete and accurate.

🧠 Why Cleaning Data Matters

Here’s what Maya learned (and you will too):

You can’t analyze what you don’t organize.
Dirty data leads to faulty conclusions—especially in business.
Pandas gives you superpowers to handle real-world messiness like a pro.

Whether you're dealing with customer orders, survey forms, health data, or school records—clean data is the foundation of smart decisions.

✅ Final Recap: Your Cleaning Toolkit

Task	Code Snippet
Check for missing values	`df.isnull().sum()`
Drop missing rows	`df.dropna()`
Fill with default value	`df["column"].fillna(value)`
Forward fill	`df.fillna(method="ffill")`
Replace with message	`df["column"].fillna("To be confirmed")`

💬 Ready to Try It Yourself?

Download a messy CSV or create your own and run through the steps above.

Then ask yourself:

What kind of missing data do I have?
Does it make more sense to delete, fill, or flag?
What assumptions am I making?

What a Messy Clothing Store Taught Me About Cleaning Data in Pandas (Yes, Even the Missing Ones!)

🧺 Step 1: Let’s Load the Messy Dataset

🔍 Step 2: Spot the Missing Values

Output:

🧹 Step 3: Cleaning the Data (Multiple Approaches)

Option A: Drop Rows with Missing Values

Option B: Fill Missing Values with Default Values

Option C: Forward Fill / Backward Fill

🧽 Step 4: Double-Check Your Cleaned Data

Output:

🧠 Why Cleaning Data Matters

✅ Final Recap: Your Cleaning Toolkit

💬 Ready to Try It Yourself?

Subscribe to my newsletter

Kumkum Hirani

Kumkum Hirani