Clean Your Messy Data with Pandas

Let’s face it: data is rarely clean. Whether you're building dashboards, importing user spreadsheets, or doing analysis, data validation and cleaning are essential skills.

In this blog, you’ll learn how to use Python’s pandas library to:

  • Identify missing or invalid entries

  • Clean and transform data

  • Validate values and types

  • Prepare data for reliable use


📥 The Problem

Let’s say you get a CSV file from users like this:

csvCopyEditname,email,age,join_date
Alice,alice@example.com,30,2021-07-01
Bob,,27,not-a-date
,carol@example.com,twenty,2021-08-15
David,david@example.com,45,2021-06-25

Looks simple, but it’s full of issues:

  • Missing names and emails

  • Non-numeric age

  • Invalid date formats


🧪 Step 1: Load and Inspect

pythonCopyEditimport pandas as pd

df = pd.read_csv("users.csv")
print(df.info())
print(df.head())

🕵️ Step 2: Detect and Handle Missing Data

pythonCopyEditprint(df.isnull().sum())

# Drop rows with missing critical fields
df = df.dropna(subset=["name", "email"])

Or fill missing values:

pythonCopyEditdf["age"] = df["age"].fillna(0)

🔢 Step 3: Validate and Convert Types

Convert age to integers:

pythonCopyEditdef clean_age(value):
    try:
        return int(value)
    except:
        return None

df["age"] = df["age"].apply(clean_age)
df = df.dropna(subset=["age"])  # Drop rows where age couldn't be fixed

Parse dates:

pythonCopyEditdf["join_date"] = pd.to_datetime(df["join_date"], errors="coerce")
df = df.dropna(subset=["join_date"])

🧹 Step 4: Strip and Standardize

pythonCopyEditdf["name"] = df["name"].str.strip().str.title()
df["email"] = df["email"].str.lower()

✅ Final Cleaned Data

pythonCopyEditprint(df)
df.to_csv("cleaned_users.csv", index=False)

Now you have a clean dataset ready for storage, analysis, or machine learning.


⚠️ Bonus: Flag Invalid Rows Instead of Dropping

pythonCopyEditdf["valid"] = df["age"].apply(lambda x: isinstance(x, (int, float))) & \
              df["join_date"].notnull() & \
              df["name"].notnull() & df["email"].notnull()

You can export valid and invalid entries separately.


🧵 Wrapping Up

Cleaning and validating data is often 80% of the work in data projects. With pandas, you can write powerful and readable data cleanup pipelines that save time and headaches.

0
Subscribe to my newsletter

Read articles from Ashraful Islam Leon directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Ashraful Islam Leon
Ashraful Islam Leon

Passionate Software Developer | Crafting clean code and elegant solutions