Cleaning Airbnb Open Data โ€“ A Step-by-Step Case Study

Avadhoot KambleAvadhoot Kamble
3 min read

๐Ÿ“ Introduction

Itโ€™s been a while since I last posted a blog, but Iโ€™m back โ€” and this time with a fresh data cleaning project!

I wanted to work on something that wasnโ€™t just a toy dataset but also gave me the chance to handle real-world messiness. So, I picked the Airbnb Open Data from Kaggle.

The goal? Take messy raw data and turn it into something consistent, clean, and ready for analysis.

When I first opened the dataset, I found:

  • Unnecessary columns

  • Missing values

  • Inconsistent text formatting

  • Duplicate rows

  • Categorical values needing standardization

  • Price and service fee stored as messy strings

This blog takes you through exactly what I did, step-by-step, so you can follow along.


๐Ÿ“‚ Step 1: Importing and Understanding the Data

I loaded the dataset into Pandas and did a quick scan:

import pandas as pd
df = pd.read_csv("Airbnb_Open_Data.csv")
df.head()
df.shape
df.info()
df.isnull().sum()

From this, I learned:

  • 26 columns in total

  • Several columns with missing values

  • Some columns were clearly not useful for my analysis

๐Ÿ—‘ Step 2: Dropping Unnecessary Columns

I removed extra fields like:

  • reviews per month

  • review rate number

  • calculated host listings count

  • availability 365

  • house_rules

  • license

columns_to_drop = ['reviews per month', 'review rate number',
                   'calculated host listings count', 'availability 365',
                   'house_rules', 'license']
df.drop(columns=columns_to_drop, inplace=True)

Columns before dropping unwanted columns

Columns after dropping unwanted columns

โœ Step 3: Making Column Names Consistent

To make my work easier, I converted all column names to lowercase:

df.columns = [col.lower() for col in df.columns]

๐Ÿ” Step 4: Removing Duplicate Rows

df.duplicated().sum()
df.drop_duplicates(inplace=True)

This ensured every listing was unique.

๐Ÿงน Step 5: Handling Missing Values

First, I dropped the last review column โ€” too many missing values.
Then I removed all remaining rows with missing values:

df.drop(columns=["last review"], inplace=True)
df.dropna(inplace=True)

๐Ÿ”  Step 6: Standardizing Text Data

The host_identity_verified column had mixed cases, so I converted all to uppercase:

df['host_identity_verified'] = df['host_identity_verified'].str.upper()

๐Ÿ”„ Step 7: Converting Boolean to Numeric

The instant_bookable column had True/False values. I converted them into 1 and 0:

df['instant_bookable'] = df['instant_bookable'].apply(lambda x: 1 if x==True else 0)

๐Ÿ’ต Step 8: Cleaning Price and Service Fee

Both price and service fee were stored as strings with $, ,, and spaces. I removed those characters and converted them to integers:

to_remove = ['$', ',', ' ']
for char in to_remove:
    df['price'] = df['price'].str.replace(char, '')
    df['service fee'] = df['service fee'].str.replace(char, '')

df['price'] = df['price'].astype(int)
df['service fee'] = df['service fee'].astype(int)

๐Ÿ“ฆ Step 9: Resetting Index and Saving

df.reset_index(drop=True, inplace=True)
df.to_csv("clean_airbnb_data.csv", index=False)

โœ… Final Clean Dataset

The final dataset was:

  • Free of duplicates

  • Without missing values

  • Clean and standardized columns

  • Prices and service fees ready for numeric operations

๐Ÿ“‚ You can find my project files on GitHub: GitHub Repo Link

๐Ÿ”—Link to the dataset used in this project: Airbnb Open Data

๐Ÿ“ข Stay tuned!
Iโ€™ll be sharing more new projects soon, and Iโ€™m also diving into Generative AI and projects, which I canโ€™t wait to blog about next!

0
Subscribe to my newsletter

Read articles from Avadhoot Kamble directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Avadhoot Kamble
Avadhoot Kamble