Airbnb Data Cleaning: A Comprehensive Guide

📝 Introduction

It’s been a while since I last posted a blog, but I’m back — and this time with a fresh data cleaning project!

I wanted to work on something that wasn’t just a toy dataset but also gave me the chance to handle real-world messiness. So, I picked the Airbnb Open Data from Kaggle.

The goal? Take messy raw data and turn it into something consistent, clean, and ready for analysis.

When I first opened the dataset, I found:

Unnecessary columns
Missing values
Inconsistent text formatting
Duplicate rows
Categorical values needing standardization
Price and service fee stored as messy strings

This blog takes you through exactly what I did, step-by-step, so you can follow along.

📂 Step 1: Importing and Understanding the Data

I loaded the dataset into Pandas and did a quick scan:

import pandas as pd
df = pd.read_csv("Airbnb_Open_Data.csv")
df.head()
df.shape
df.info()
df.isnull().sum()

From this, I learned:

26 columns in total
Several columns with missing values
Some columns were clearly not useful for my analysis

🗑 Step 2: Dropping Unnecessary Columns

I removed extra fields like:

reviews per month
review rate number
calculated host listings count
availability 365
house_rules
license

columns_to_drop = ['reviews per month', 'review rate number',
                   'calculated host listings count', 'availability 365',
                   'house_rules', 'license']
df.drop(columns=columns_to_drop, inplace=True)

Columns before dropping unwanted columns

Columns after dropping unwanted columns

✏ Step 3: Making Column Names Consistent

To make my work easier, I converted all column names to lowercase:

df.columns = [col.lower() for col in df.columns]

🔍 Step 4: Removing Duplicate Rows

df.duplicated().sum()
df.drop_duplicates(inplace=True)

This ensured every listing was unique.

🧹 Step 5: Handling Missing Values

First, I dropped the last review column — too many missing values.
Then I removed all remaining rows with missing values:

df.drop(columns=["last review"], inplace=True)
df.dropna(inplace=True)

🔠 Step 6: Standardizing Text Data

The host_identity_verified column had mixed cases, so I converted all to uppercase:

df['host_identity_verified'] = df['host_identity_verified'].str.upper()

🔄 Step 7: Converting Boolean to Numeric

The instant_bookable column had True/False values. I converted them into 1 and 0:

df['instant_bookable'] = df['instant_bookable'].apply(lambda x: 1 if x==True else 0)

💵 Step 8: Cleaning Price and Service Fee

Both price and service fee were stored as strings with $, ,, and spaces. I removed those characters and converted them to integers:

to_remove = ['$', ',', ' ']
for char in to_remove:
    df['price'] = df['price'].str.replace(char, '')
    df['service fee'] = df['service fee'].str.replace(char, '')

df['price'] = df['price'].astype(int)
df['service fee'] = df['service fee'].astype(int)

📦 Step 9: Resetting Index and Saving

df.reset_index(drop=True, inplace=True)
df.to_csv("clean_airbnb_data.csv", index=False)

✅ Final Clean Dataset

The final dataset was:

Free of duplicates
Without missing values
Clean and standardized columns
Prices and service fees ready for numeric operations

📂 You can find my project files on GitHub: GitHub Repo Link

🔗Link to the dataset used in this project: Airbnb Open Data

📢 Stay tuned!
I’ll be sharing more new projects soon, and I’m also diving into Generative AI and projects, which I can’t wait to blog about next!

Cleaning Airbnb Open Data – A Step-by-Step Case Study