Why Data Cleaning Matters

Imagine you’re working as a data analyst at a growing ride-sharing company. You’ve just received user data from the last three months, and your goal is to analyze customer trends. But the moment you open the file, it’s chaos:

Blank rows
Weird date formats
Duplicate entries
Unexpected typos in city names like "New Yrk" and "New York"

Welcome to the real world of raw data.
To uncover real insights, you need to clean your data first.

Our Example Dataset

Let’s say we have a CSV file from a fictional ride-sharing platform, rides_data.csv. Here's a preview:

Let’s Clean It Step by Step (Using Pandas in Python)

import pandas as pd

# Load the dataset
df = pd.read_csv("rides_data.csv")

Step 1 : Standardizing Column Names

This ensures consistency when referencing columns. Spaces and case sensitivity often cause bugs. str.strip() removes any leading or trailing whitespace from the column names. str.lower()converts all column names to lowercase. str.replace(" ", "_")replaces any spaces in column names with underscores.

df.columns = df.columns.str.strip().str.lower().str.replace(" ", "_")

Step 2 : Handling Missing Values

df.isnull() creates a DataFrame of True and False values, where True means the value is missing (NaN). And .sum() counts the number of True values (missing values) per column.

.dropna(subset=[...]) tells pandas to drop rows that have NaN values in the specified subset of columns ('user_id' and 'ride_id').

.fillna(df['rating'].mean()) takes the mean (average) of the existing ratings and replaces any missing (NaN) values in the rating column with that mean value

# Check how many nulls per column
print(df.isnull().sum())

# Drop rows where user_id or ride_id is missing
df = df.dropna(subset=['user_id', 'ride_id'])

# Optionally fill missing ratings with the average
df['rating'] = df['rating'].fillna(df['rating'].mean())

Step 3 : Correcting Inconsistent Entries (Like City Names)

.replace() is used to substitute specific incorrect values with correct ones. Here, we're replacing:

'new york' → 'New York'

'new york ' → 'New York' (removes the trailing space)

'new yrk' → 'New York' (fixes the typo "yrk")

df['city'] = df['city'].str.strip().str.lower()

# Fix common typos manually or use fuzzy matching
df['city'] = df['city'].replace({
    'new york': 'New York',
    'new york ': 'New York',
    'new yrk': 'New York'
})

Step 4 : Fixing Date Formats

pd.to_datetime() parses various string formats into a proper datetime object like 2024-01-12 08:45:00. It automatically infers the format, so you don’t need to specify each format manually. errors='coerce' tell s pandas that If the format is unrecognizable, don't crash—just convert it to NaT (Not a Time).

df['ride_date'] = pd.to_datetime(df['ride_date'], errors='coerce')

Step 5 : Handling Duplicates

Duplicate rows can inflate metrics like total rides or revenue. Hence we can remove them

df = df.drop_duplicates()

Step 6 : Ensuring Correct Data Types

`df['user_id'] = df['user_id'].astype(int)`

converts user_id column to (int) type. This is important because IDs should be treated as whole numbers (e.g., 1001, 1002) and it prevents issues where IDs are mistakenly treated as text (which could interfere with filtering, sorting, or joining with other data).

Similarly df['fare_usd'] = df['fare_usd'].astype(float) converts fare_usd column into (float) type. Because Fare values typically include decimals (e.g., $23.50), enables accurate arithmetic operations (like average fare, total revenue, etc.) and prevents calculation errors that could happen if this column was stored as text (string).

df['user_id'] = df['user_id'].astype(int)
df['fare_usd'] = df['fare_usd'].astype(float)

Step 7 : Outlier Detection (Optional but Powerful)

Outlier detection is about identifying and handling unusual data points that are significantly different from the rest of your dataset.

# For example, check if distance is unrealistic
df = df[df['distance_km'] < 100]  # Assuming no rides over 100 km are valid

similarly we can also find outliers in fares by df['fare_usd'].quantile(0.99). This calculates the 99th percentile of the fare_usd column. This means: 99% of the fare values are less than or equal to this number. Anything above this value is considered a potential outlier.

Example :

if this is system or human error you can remove it

df = df[df['fare_usd'] <= df['fare_usd'].quantile(0.99)]

Wrap it into a function

if we know that similar data set is going to be used in future as well, say next month for a internal team meeting, you can automate it with a small function

def clean_ride_data(df):
    df.columns = df.columns.str.strip().str.lower().str.replace(" ", "_")
    df = df.dropna(subset=['user_id', 'ride_id'])
    df['rating'] = df['rating'].fillna(df['rating'].mean())
    df['city'] = df['city'].str.strip().str.lower().replace({
        'new york': 'New York',
        'new york ': 'New York',
        'new yrk': 'New York'
    })
    df['ride_date'] = pd.to_datetime(df['ride_date'], errors='coerce')
    df = df.drop_duplicates()
    df = df[df['distance_km'] < 100]
    df['user_id'] = df['user_id'].astype(int)
    df['fare_usd'] = df['fare_usd'].astype(float)
    return df

Conclusion

Data cleaning might seem tedious—but it’s where real analysis begins. Without it, your insights are built on sand.

Clean data = trustworthy insights = better decisions.

So the next time you open up that messy Excel sheet from a digital platform, take a deep breath and remember: every insight starts with a good cleaning.

Clean Data, Clear Insights: A Beginner's Guide to Data Cleaning