What a Messy Clothing Store Taught Me About Cleaning Data in Pandas (Yes, Even the Missing Ones!)


Last winter, my cousin Maya started her dream online clothing store.
It was small at first, just a few T-shirts, hoodies, and eco-friendly tote bags.
Orders started rolling in, and she kept everything in a spreadsheet: names, addresses, items purchased, quantities, and payments.
Until one day, she looked at her data and said:
“Umm… why is the address missing for 14 people?”
“And why are some quantities empty?”
That’s when she asked me:
“Can Python help clean this up?”
The answer? Absolutely yes.
And that’s exactly what today’s post is about.
We’ll use Pandas, Python’s powerful data-handling library, to clean messy data and deal with missing values, just like we did for Maya’s online store.
🧺 Step 1: Let’s Load the Messy Dataset
Imagine Maya exported her customer orders into a CSV file named orders.csv
.
Here’s how we’d start:
Let’s say this is what the dataset looks like:
Customer Name | Address | Item | Quantity | Price |
Alice | 123 Maple St | Hoodie | 2 | 40 |
Ben | T-shirt | 1 | 15 | |
Cathy | 456 Elm St | 25 | ||
David | 789 Oak Ave | Tote Bag | 3 | 10 |
Emma | 321 Pine Blvd | Hoodie | 40 |
Yikes. Missing addresses, items, quantities.
But don’t worry. This is where data cleaning magic begins.
🔍 Step 2: Spot the Missing Values
Before cleaning anything, we need to see what’s missing.
That’ll give you a full table showing which values are missing (True
) and which aren’t (False
).
If you want a quick summary, use:
Output:
We now know:
1 missing address
1 missing item
2 missing quantities
🧹 Step 3: Cleaning the Data (Multiple Approaches)
Here’s where we decide what to do with the missing values.
Option A: Drop Rows with Missing Values
If the data is incomplete or unusable, you can drop those rows entirely.
But wait—do we really want to delete customer orders just because of a missing quantity? Probably not.
Option B: Fill Missing Values with Default Values
Let’s say we assume:
If quantity is missing, it’s 1 (default).
If item is missing, we mark it as "Unknown".
If address is missing, we flag it for follow-up.
This is often more helpful—we keep the data, but flag the issues.
Option C: Forward Fill / Backward Fill
If you're dealing with sequential data (like time series or grouped orders), you can fill missing values using nearby data:
Or:
This fills in blanks based on the value just above or below it.
⚠️ Use with caution—only when you know the missing values logically follow a pattern.
🧽 Step 4: Double-Check Your Cleaned Data
Let’s make sure it worked.
Output:
Boom. Clean data. No missing values. Maya’s store records are now complete and accurate.
🧠 Why Cleaning Data Matters
Here’s what Maya learned (and you will too):
You can’t analyze what you don’t organize.
Dirty data leads to faulty conclusions—especially in business.
Pandas gives you superpowers to handle real-world messiness like a pro.
Whether you're dealing with customer orders, survey forms, health data, or school records—clean data is the foundation of smart decisions.
✅ Final Recap: Your Cleaning Toolkit
Task | Code Snippet |
Check for missing values | df.isnull().sum() |
Drop missing rows | df.dropna() |
Fill with default value | df["column"].fillna(value) |
Forward fill | df.fillna(method="ffill") |
Replace with message | df["column"].fillna("To be confirmed") |
💬 Ready to Try It Yourself?
Download a messy CSV or create your own and run through the steps above.
Then ask yourself:
What kind of missing data do I have?
Does it make more sense to delete, fill, or flag?
What assumptions am I making?
Subscribe to my newsletter
Read articles from Kumkum Hirani directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
