Analyzing Hotel Booking Trends

Overview

This is a personal project I worked on to explore hotel booking data for AtliQ Grands, a fictional hotel chain that operates in four Indian cities: Delhi, Mumbai, Bangalore, and Hyderabad. The business offers four different hotel brands — AtliQ Seasons, Exotica, Bay, and Palace, catering to luxury and business travelers.

The goal of this project wasn't to build dashboards or report KPIs — it was to put myself in the shoes of a data analyst: to get messy with raw data, understand the business situation, clean and transform datasets, and dig for meaningful insights that can guide action.

I wanted to simulate a real-world scenario and sharpen my data exploration and storytelling skills in the process.

Business Problem

Despite being in the hotel industry for two decades, AtliQ Grands has been facing a decline in market share and revenue. Competitors are pulling ahead, possibly due to smarter decisions, better use of data, or more strategic marketing. The company believes that lack of data-driven decision-making is part of the problem.

So, the idea was to deep dive into the booking data and see what could be discovered: occupancy rates, seasonal trends, city-wise performance, room category preferences, and revenue flows.

Datasets Used & Understanding the Schema

The project uses five structured CSV files following a simplified star schema — with dimension and fact tables. Here’s a quick overview of each dataset and its role:

1. `dim_date.csv` – Date Dimension

Gives various date-based perspectives (daily, monthly, weekly, weekday/weekend). Useful for trend analysis and aggregations over time.

date: Full calendar date.
mmm yy: Month and year in short format (e.g., Jan 22).
week no: Week number.
day_type: Weekday or Weekend.

2. `dim_hotels.csv` – Hotel Properties Dimension

Contains hotel metadata like type and city. Used for slicing the data by city or business category.

property_id: Unique hotel ID.
property_name: Hotel name.
category: Business/Luxury.
city: Location.

3. `dim_rooms.csv` – Room Categories Dimension

Describes room types. Helps compare metrics like occupancy or revenue by room class.

room_id: Room type ID.
room_class: Name of the room class (Standard, Premium, etc).

4. `fact_bookings.csv` – Detailed Booking Records

The most granular dataset with booking-level data: guest count, revenue, platform, etc.

Booking dates (booking_date, check_in_date, checkout_date)
no_guests, booking_platform, ratings_given
booking_status, revenue_generated, revenue_realized

5. `fact_aggregated_bookings.csv` – Daily Aggregated Bookings

High-level metrics of bookings and room availability per day per room class.

check_in_date, room_category, successful_bookings, capacity

Data Exploration

Before cleaning or transforming, I loaded the datasets using pandas and began exploring shapes, data types, missing values, and distributions.

import pandas as pd

df_bookings = pd.read_csv("fact_bookings.csv")
df_agg_bookings = pd.read_csv("fact_aggregated_bookings.csv")
df_hotels = pd.read_csv("dim_hotels.csv")
df_rooms = pd.read_csv("dim_rooms.csv")
df_date = pd.read_csv("dim_date.csv")

Key observations from df_bookings:

df_bookings.shape
df_bookings.info()
df_bookings.describe()
df_bookings["booking_platform"].value_counts()
df_bookings["room_category"].unique()
df_bookings["ratings_given"].isna().sum()

Over 100,000 rows.
∼77,000 missing values in ratings_given (expected, since not all guests rate).
Some bookings had negative no_guests — definitely incorrect.
Revenue values seemed unusually high for some entries.

Data Cleaning

1. Remove Negative Guest Counts

df_bookings = df_bookings[df_bookings["no_guests"] > 0]

Why? Negative guest numbers aren't realistic and likely reflect data entry issues. These would distort metrics like average guests per booking.

2. Remove Extreme Revenue Outliers

avg = df_bookings["revenue_generated"].mean()
std = df_bookings["revenue_generated"].std()
high_limit = avg + 3*std
df_bookings = df_bookings[df_bookings["revenue_generated"] <= high_limit]

Why? To make revenue insights more trustworthy. Some entries were in crores, far beyond average values. This method filters those out assuming a normal distribution.

3. Convert Date Columns

df_date["date"] = pd.to_datetime(df_date["date"])
df_bookings["check_in_date"] = pd.to_datetime(df_bookings["check_in_date"])

Why? Enables date filtering, grouping by month/week, and joining with dim_date.

Data Transformation

1. Calculate Occupancy Percentage

df_agg_bookings["occ_pct"] = df_agg_bookings["successful_bookings"] / df_agg_bookings["capacity"]
df_agg_bookings["occ_pct"] = df_agg_bookings["occ_pct"].apply(lambda x: round(x*100, 2))

Why? A core performance metric for hotels — tells us how well the available inventory is being used.

Insights and Analysis

Each of these analyses builds on the cleaned and merged datasets to answer specific business questions.

Average Occupancy Rate by Room Class

df = pd.merge(df_agg_bookings, df_rooms, left_on="room_category", right_on="room_id")
df.drop("room_id", axis=1, inplace=True)
df.groupby("room_class")["occ_pct"].mean().round(2)

Insight:

Presidential    59.30
Standard        58.22
Elite           58.04
Premium         58.03

All classes perform similarly. Presidential rooms do slightly better — perhaps due to exclusivity or fewer rooms available.

Average Occupancy Rate by City

df = pd.merge(df, df_hotels, on="property_id")
df.groupby("city")["occ_pct"].mean().round(2)

Insight:

Delhi        61.61
Hyderabad    58.14
Mumbai       57.94
Bangalore    56.59

Delhi has the highest occupancy. Indicates higher demand or better operations there.

Weekday vs Weekend Occupancy

df = pd.merge(df, df_date, left_on="check_in_date", right_on="date")
df.groupby("day_type")["occ_pct"].mean().round(2)

Insight:

weekday    50.90
weekend    72.39

Clear spike in demand during weekends. Hotels can optimize pricing and marketing accordingly.

June 2022 City-wise Occupancy

df_june = df[df["mmm yy"] == "Jun 22"]
df_june.groupby("city")["occ_pct"].mean().round(2)

Insight:

Delhi        62.47
Hyderabad    58.46
Mumbai       58.38
Bangalore    56.58

Delhi continues to lead, consistent with overall trends.

Revenue Realized by City

df_bookings_all = pd.merge(df_bookings, df_hotels, on="property_id")
df_bookings_all.groupby("city")["revenue_realized"].sum()

Insight:

Mumbai       668,569,251
Bangalore    420,383,550
Hyderabad    325,179,310
Delhi        294,404,488

Interestingly, Delhi — despite having high occupancy — has the lowest revenue. Possible reason: lower prices or fewer premium rooms.

Monthly Revenue Trends

df_date["date"] = pd.to_datetime(df_date["date"])
df_bookings_all["check_in_date"] = pd.to_datetime(df_bookings_all["check_in_date"])

df_bookings_all = pd.merge(df_bookings_all, df_date, left_on="check_in_date", right_on="date")

df_bookings_all.groupby("mmm yy")["revenue_realized"].sum()

Insight:

May 22    408,375,641
Jun 22    377,191,229
Jul 22    389,940,912

May 2022 was the best month in terms of revenue. This might relate to vacation season or events.

Revenue Realized by Hotel Type

revenue_per_hotel = df_bookings_all.groupby("category")["revenue_realized"].sum()
revenue_per_hotel

Insight:

Business    451,950,715
Luxury      723,557,067

Luxury hotels generated significantly more revenue (over ₹723 Cr) compared to Business hotels (₹452 Cr). This aligns with expectations as luxury properties typically have higher room rates and premium services driving revenue. The hotel category is a strong revenue driver.

Average Rating by City

avg_rating_per_city = df_bookings_all.groupby("city")["ratings_given"].mean().round(2).sort_values(ascending=False)
avg_rating_per_city

Insight:

Delhi        3.78
Hyderabad    3.66
Mumbai       3.64
Bangalore    3.40

Delhi hotels have the highest average customer ratings (3.78), indicating higher guest satisfaction or service quality, followed by Hyderabad and Mumbai. Bangalore trails with a lower rating (3.40), suggesting potential room for service improvements or guest experience enhancements there.

Average Booking Size by Hotel Type

df_checked_out = df_bookings_all[df_bookings_all["booking_status"] == "Checked Out"]

total_guests_per_category = df_checked_out.groupby("category")["no_guests"].sum()
total_bookings_per_category = df_checked_out.groupby("category")["booking_id"].count()

avg_booking_size = (total_guests_per_category / total_bookings_per_category).round(2)

Insight:

Business    1.67
Luxury      2.27

Luxury hotels tend to host larger groups on average (2.27 guests per booking) compared to Business hotels (1.67 guests). This could reflect luxury hotels catering more to families or groups, while business hotels might see more solo travelers or smaller groups.

Cancellation Rate by Platform

df_cancelled_bookings = df_bookings_all[df_bookings_all["booking_status"] == "Cancelled"]

cancelled_bookings_per_platform = df_cancelled_bookings.groupby("booking_platform")["booking_id"].count()
total_bookings_per_platform = df_bookings_all.groupby("booking_platform")["booking_id"].count()

cancellation_rate = round((cancelled_bookings_per_platform / total_bookings_per_platform) * 100, 2)

Insight:

direct offline    24.19
direct online     24.65
journey           24.83
logtrip           23.97
makeyourtrip      24.85
others            25.06
tripster          25.32

Cancellation rates are fairly consistent across all booking platforms, hovering around 24-25%. "Tripster" shows the highest cancellation rate (25.32%), while "logtrip" has the lowest (23.97%). This suggests cancellations are a widespread issue regardless of platform, possibly driven by external factors like guest uncertainty.

Revenue by Room Category

df_room_class = pd.merge(
    df,
    df_bookings_all[['property_id', 'room_category', 'revenue_realized']],
    on=['property_id', 'room_category'],
    how='left'
)

def formatted_revenue(x):
    if x >= 1e7:
        return f"{x / 1e7:.2f} Cr"
    elif x >= 1e5:
        return f"{x / 1e5:.2f} Lakh"
    else:
        return f"{x:,}"

revenue_by_room = df_room_class.groupby("room_class")["revenue_realized"].sum().sort_values(ascending=False)

new_revenue_by_room = revenue_by_room.apply(formatted_revenue_by_room)
new_revenue_by_room

Insight:

Elite           2503.40 Cr
Premium         2071.05 Cr
Presidential    1682.64 Cr
Standard        1383.72 Cr

Elite, Premium, and Presidential rooms are the top three revenue-generating categories, contributing ₹2503.4 Cr, ₹2071.05 Cr, and ₹1682.64 Cr respectively. This clearly shows that high-end room types account for a significant share of overall hotel revenue.

Premium rooms’ performance by city

premium_rooms = df_room_class[df_room_class['room_class'].isin(['Presidential', 'Elite'])]

def formatted_revenue_by_city_room(x):
    if x >= 1e7:
        return f"{x / 1e7:.2f} Cr"
    elif x >= 1e5:
        return f"{x / 1e5:.2f} Lakh"
    else:
        return f"{x:,}"

revenue_by_city_room = premium_rooms.groupby(['city', 'room_class'])['revenue_realized'].sum()

new_revenue = revenue_by_city_room.apply(formatted_revenue_by_city_room)

Insight:

Mumbai     Elite          1034.67 Cr
           Presidential    675.74 Cr
Bangalore  Elite           617.56 Cr
           Presidential    329.30 Cr
Delhi      Elite           440.13 Cr
           Presidential    272.09 Cr
Hyderabad  Elite           411.04 Cr
           Presidential    405.51 Cr

Mumbai leads premium room revenue. While all cities perform well, Bangalore and Delhi have lower premium room revenues, highlighting growth opportunities.

Distribution of booking lead time

df_bookings_all['booking_date'] = pd.to_datetime(df_bookings_all['booking_date'])
df_bookings_all['lead_time_days'] = (df_bookings_all['check_in_date'] - df_bookings_all['booking_date']).dt.days

valid_bookings = df_bookings_all[
    (df_bookings_all['booking_status'] == 'Checked Out') & 
    (df_bookings_all['lead_time_days'] >= 0)
]
valid_bookings.head()

import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(10, 6))
sns.histplot(valid_bookings['lead_time_days'], bins=30, kde=True, color='skyblue')
plt.title('Distribution of Booking Lead Time')
plt.xlabel('Lead Time (Days)')
plt.ylabel('Number of Bookings')
plt.grid(True)
plt.show()

Insight:

Most guests book within 30 days of arrival, with a large share booking just 0–10 days in advance. This shows that last-minute bookings are common at AtliQ Grands. Small spikes at 25, 50, and 75 days suggest some early planners—possibly due to events or holidays. This can help with timing promotions and forecasting demand.

No-shows and cancellations per month

df_no_show_bookings = df_bookings_all[df_bookings_all['booking_status'].isin(['Cancelled', 'No Show'])]
cancellations_no_shows_per_month = df_no_show_bookings.groupby(['mmm yy', 'booking_status'])['booking_id'].count()

Insight:

Jul 22  Cancelled         7558
        No Show           1562
Jun 22  Cancelled         7441
        No Show           1423
May 22  Cancelled         7976
        No Show           1682

Cancellations greatly outnumber no-shows by roughly 4-5 times every month, indicating a persistent trend of booking instability.

No-show/Cancellation rates by City

# Total bookings per city
total_bookings_city = df_bookings_all.groupby('city')['booking_id'].count()
total_bookings_city

# Count cancellations and no-shows per city
cancelled_noshow_city = df_no_show_bookings.groupby('city')['booking_id'].count()

# Calculate cancellation/no-show rate per city
cancellation_rate_city = round((cancelled_noshow_city / total_bookings_city) * 100,2).sort_values(ascending=False)

Insight:

Bangalore    30.04
Mumbai       29.99
Delhi        29.76
Hyderabad    29.60

Bangalore and Mumbai have the highest cancellation/no-show rates (~30%), highlighting cities where retention and commitment strategies might need strengthening.

Revenue per Booking Platform

revenue_per_platform = df_bookings_all.groupby('booking_platform')['revenue_realized'].sum()

# Custom label function
def format_pie_label(pct, all_vals):
    absolute = int(round(pct / 100. * sum(all_vals)))
    if absolute >= 1e7:
        value_str = f"{absolute / 1e7:.2f} Cr"
    elif absolute >= 1e5:
        value_str = f"{absolute / 1e5:.2f} Lakh"
    else:
        value_str = f"{absolute:,}"
    return f"{pct:.1f}%\\n({value_str})"

# Plot pie chart
import matplotlib.pyplot as plt

plt.figure(figsize=(7.5, 7.5))
plt.pie(
    revenue_per_platform,
    labels=revenue_per_platform.index,
    autopct=lambda pct: format_pie_label(pct, revenue_per_platform),
    startangle=140,
)

plt.title('Revenue Realized per Booking Platform')
plt.axis('equal')  # Keeps the pie chart circular
plt.show()

Insight:

The platform labeled “others” brings in the highest revenue, contributing nearly ₹48 Cr (40.9 %), suggesting either a highly diversified or unclassified set of booking sources. Among named platforms, MakeYourTrip is the top performer with ₹23.3 Cr in revenue, followed by Logtrip and Direct Online channels. This implies that online travel agencies (OTAs) and digital channels are playing a dominant role in revenue generation.

Tools Used

Python (Pandas, Matplotlib, Seaborn, Jupyter Notebook)

Conclusion

This project gave me hands-on experience dealing with messy, real-world data. From understanding schemas to cleaning invalid values and deriving actionable insights — I got to simulate a complete data analysis workflow.

Along the way, I practiced:

Data cleaning and preprocessing using Pandas
Feature engineering and KPI creation
Exploratory analysis with groupby, merge, and custom aggregations
Interpreting patterns and anomalies with business logic in mind

This was a solid foundation, and there’s still potential to go deeper: customer segmentation, forecasting, churn analysis, and even dashboarding.

But most importantly — this helped me develop an analyst’s mindset: always ask why, and always deliver with context.

Hotel Booking Data Analysis | AtliQ Grands

Overview

Business Problem

Datasets Used & Understanding the Schema

1. `dim_date.csv` – Date Dimension

2. `dim_hotels.csv` – Hotel Properties Dimension

3. `dim_rooms.csv` – Room Categories Dimension

4. `fact_bookings.csv` – Detailed Booking Records

5. `fact_aggregated_bookings.csv` – Daily Aggregated Bookings

Data Exploration

Data Cleaning

1. Remove Negative Guest Counts

2. Remove Extreme Revenue Outliers

3. Convert Date Columns

Data Transformation

1. Calculate Occupancy Percentage

Insights and Analysis

Average Occupancy Rate by Room Class

Average Occupancy Rate by City

Weekday vs Weekend Occupancy

June 2022 City-wise Occupancy

Revenue Realized by City

Monthly Revenue Trends

Revenue Realized by Hotel Type

Average Rating by City

Average Booking Size by Hotel Type

Cancellation Rate by Platform

Revenue by Room Category

Premium rooms’ performance by city

Distribution of booking lead time

No-shows and cancellations per month

No-show/Cancellation rates by City

Revenue per Booking Platform

Tools Used

Conclusion

Subscribe to my newsletter

Preeti Priyadarsini

Preeti Priyadarsini

Hotel Booking Data Analysis | AtliQ Grands

Overview

Business Problem

Datasets Used & Understanding the Schema

1. dim_date.csv – Date Dimension

2. dim_hotels.csv – Hotel Properties Dimension

3. dim_rooms.csv – Room Categories Dimension

4. fact_bookings.csv – Detailed Booking Records

5. fact_aggregated_bookings.csv – Daily Aggregated Bookings

Data Exploration

Data Cleaning

1. Remove Negative Guest Counts

2. Remove Extreme Revenue Outliers

3. Convert Date Columns

Data Transformation

1. Calculate Occupancy Percentage

Insights and Analysis

Average Occupancy Rate by Room Class

Average Occupancy Rate by City

Weekday vs Weekend Occupancy

June 2022 City-wise Occupancy

Revenue Realized by City

Monthly Revenue Trends

Revenue Realized by Hotel Type

Average Rating by City

Average Booking Size by Hotel Type

Cancellation Rate by Platform

Revenue by Room Category

Premium rooms’ performance by city

Distribution of booking lead time

No-shows and cancellations per month

No-show/Cancellation rates by City

Revenue per Booking Platform

Tools Used

Conclusion

Subscribe to my newsletter

Preeti Priyadarsini

Preeti Priyadarsini

1. `dim_date.csv` – Date Dimension

2. `dim_hotels.csv` – Hotel Properties Dimension

3. `dim_rooms.csv` – Room Categories Dimension

4. `fact_bookings.csv` – Detailed Booking Records

5. `fact_aggregated_bookings.csv` – Daily Aggregated Bookings