Cleaning Airbnb Open Data โ A Step-by-Step Case Study


๐ Introduction
Itโs been a while since I last posted a blog, but Iโm back โ and this time with a fresh data cleaning project!
I wanted to work on something that wasnโt just a toy dataset but also gave me the chance to handle real-world messiness. So, I picked the Airbnb Open Data from Kaggle.
The goal? Take messy raw data and turn it into something consistent, clean, and ready for analysis.
When I first opened the dataset, I found:
Unnecessary columns
Missing values
Inconsistent text formatting
Duplicate rows
Categorical values needing standardization
Price and service fee stored as messy strings
This blog takes you through exactly what I did, step-by-step, so you can follow along.
๐ Step 1: Importing and Understanding the Data
I loaded the dataset into Pandas and did a quick scan:
import pandas as pd
df = pd.read_csv("Airbnb_Open_Data.csv")
df.head()
df.shape
df.info()
df.isnull().sum()
From this, I learned:
26 columns in total
Several columns with missing values
Some columns were clearly not useful for my analysis
๐ Step 2: Dropping Unnecessary Columns
I removed extra fields like:
reviews per month
review rate number
calculated host listings count
availability 365
house_rules
license
columns_to_drop = ['reviews per month', 'review rate number',
'calculated host listings count', 'availability 365',
'house_rules', 'license']
df.drop(columns=columns_to_drop, inplace=True)
Columns before dropping unwanted columns
Columns after dropping unwanted columns
โ Step 3: Making Column Names Consistent
To make my work easier, I converted all column names to lowercase:
df.columns = [col.lower() for col in df.columns]
๐ Step 4: Removing Duplicate Rows
df.duplicated().sum()
df.drop_duplicates(inplace=True)
This ensured every listing was unique.
๐งน Step 5: Handling Missing Values
First, I dropped the last review
column โ too many missing values.
Then I removed all remaining rows with missing values:
df.drop(columns=["last review"], inplace=True)
df.dropna(inplace=True)
๐ Step 6: Standardizing Text Data
The host_identity_verified
column had mixed cases, so I converted all to uppercase:
df['host_identity_verified'] = df['host_identity_verified'].str.upper()
๐ Step 7: Converting Boolean to Numeric
The instant_bookable
column had True/False values. I converted them into 1
and 0
:
df['instant_bookable'] = df['instant_bookable'].apply(lambda x: 1 if x==True else 0)
๐ต Step 8: Cleaning Price and Service Fee
Both price
and service fee
were stored as strings with $
, ,
, and spaces. I removed those characters and converted them to integers:
to_remove = ['$', ',', ' ']
for char in to_remove:
df['price'] = df['price'].str.replace(char, '')
df['service fee'] = df['service fee'].str.replace(char, '')
df['price'] = df['price'].astype(int)
df['service fee'] = df['service fee'].astype(int)
๐ฆ Step 9: Resetting Index and Saving
df.reset_index(drop=True, inplace=True)
df.to_csv("clean_airbnb_data.csv", index=False)
โ Final Clean Dataset
The final dataset was:
Free of duplicates
Without missing values
Clean and standardized columns
Prices and service fees ready for numeric operations
๐ You can find my project files on GitHub: GitHub Repo Link
๐Link to the dataset used in this project: Airbnb Open Data
๐ข Stay tuned!
Iโll be sharing more new projects soon, and Iโm also diving into Generative AI and projects, which I canโt wait to blog about next!
Subscribe to my newsletter
Read articles from Avadhoot Kamble directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
