What a Messy Clothing Store Taught Me About Cleaning Data in Pandas (Yes, Even the Missing Ones!)

Kumkum HiraniKumkum Hirani
4 min read

Last winter, my cousin Maya started her dream online clothing store.

It was small at first, just a few T-shirts, hoodies, and eco-friendly tote bags.

Orders started rolling in, and she kept everything in a spreadsheet: names, addresses, items purchased, quantities, and payments.

Until one day, she looked at her data and said:
“Umm… why is the address missing for 14 people?”
“And why are some quantities empty?”

That’s when she asked me:
“Can Python help clean this up?”

The answer? Absolutely yes.

And that’s exactly what today’s post is about.

We’ll use Pandas, Python’s powerful data-handling library, to clean messy data and deal with missing values, just like we did for Maya’s online store.

🧺 Step 1: Let’s Load the Messy Dataset

Imagine Maya exported her customer orders into a CSV file named orders.csv.

Here’s how we’d start:

Let’s say this is what the dataset looks like:

Customer NameAddressItemQuantityPrice
Alice123 Maple StHoodie240
BenT-shirt115
Cathy456 Elm St25
David789 Oak AveTote Bag310
Emma321 Pine BlvdHoodie40

Yikes. Missing addresses, items, quantities.

But don’t worry. This is where data cleaning magic begins.

🔍 Step 2: Spot the Missing Values

Before cleaning anything, we need to see what’s missing.

That’ll give you a full table showing which values are missing (True) and which aren’t (False).

If you want a quick summary, use:

Output:

We now know:

  • 1 missing address

  • 1 missing item

  • 2 missing quantities


🧹 Step 3: Cleaning the Data (Multiple Approaches)

Here’s where we decide what to do with the missing values.

Option A: Drop Rows with Missing Values

If the data is incomplete or unusable, you can drop those rows entirely.

But wait—do we really want to delete customer orders just because of a missing quantity? Probably not.

Option B: Fill Missing Values with Default Values

Let’s say we assume:

If quantity is missing, it’s 1 (default).

  • If item is missing, we mark it as "Unknown".

  • If address is missing, we flag it for follow-up.

This is often more helpful—we keep the data, but flag the issues.

Option C: Forward Fill / Backward Fill

If you're dealing with sequential data (like time series or grouped orders), you can fill missing values using nearby data:

Or:

This fills in blanks based on the value just above or below it.

⚠️ Use with caution—only when you know the missing values logically follow a pattern.

🧽 Step 4: Double-Check Your Cleaned Data

Let’s make sure it worked.

Output:

Boom. Clean data. No missing values. Maya’s store records are now complete and accurate.

🧠 Why Cleaning Data Matters

Here’s what Maya learned (and you will too):

  • You can’t analyze what you don’t organize.

  • Dirty data leads to faulty conclusions—especially in business.

  • Pandas gives you superpowers to handle real-world messiness like a pro.

Whether you're dealing with customer orders, survey forms, health data, or school records—clean data is the foundation of smart decisions.

✅ Final Recap: Your Cleaning Toolkit

TaskCode Snippet
Check for missing valuesdf.isnull().sum()
Drop missing rowsdf.dropna()
Fill with default valuedf["column"].fillna(value)
Forward filldf.fillna(method="ffill")
Replace with messagedf["column"].fillna("To be confirmed")

💬 Ready to Try It Yourself?

Download a messy CSV or create your own and run through the steps above.

Then ask yourself:

  • What kind of missing data do I have?

  • Does it make more sense to delete, fill, or flag?

  • What assumptions am I making?

0
Subscribe to my newsletter

Read articles from Kumkum Hirani directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Kumkum Hirani
Kumkum Hirani