Hotel Booking Data Analysis | AtliQ Grands


Overview
This is a personal project I worked on to explore hotel booking data for AtliQ Grands, a fictional hotel chain that operates in four Indian cities: Delhi, Mumbai, Bangalore, and Hyderabad. The business offers four different hotel brands — AtliQ Seasons, Exotica, Bay, and Palace, catering to luxury and business travelers.
The goal of this project wasn't to build dashboards or report KPIs — it was to put myself in the shoes of a data analyst: to get messy with raw data, understand the business situation, clean and transform datasets, and dig for meaningful insights that can guide action.
I wanted to simulate a real-world scenario and sharpen my data exploration and storytelling skills in the process.
Business Problem
Despite being in the hotel industry for two decades, AtliQ Grands has been facing a decline in market share and revenue. Competitors are pulling ahead, possibly due to smarter decisions, better use of data, or more strategic marketing. The company believes that lack of data-driven decision-making is part of the problem.
So, the idea was to deep dive into the booking data and see what could be discovered: occupancy rates, seasonal trends, city-wise performance, room category preferences, and revenue flows.
Datasets Used & Understanding the Schema
The project uses five structured CSV files following a simplified star schema — with dimension and fact tables. Here’s a quick overview of each dataset and its role:
1. dim_date.csv
– Date Dimension
Gives various date-based perspectives (daily, monthly, weekly, weekday/weekend). Useful for trend analysis and aggregations over time.
date
: Full calendar date.mmm yy
: Month and year in short format (e.g., Jan 22).week no
: Week number.day_type
: Weekday or Weekend.
2. dim_hotels.csv
– Hotel Properties Dimension
Contains hotel metadata like type and city. Used for slicing the data by city or business category.
property_id
: Unique hotel ID.property_name
: Hotel name.category
: Business/Luxury.city
: Location.
3. dim_rooms.csv
– Room Categories Dimension
Describes room types. Helps compare metrics like occupancy or revenue by room class.
room_id
: Room type ID.room_class
: Name of the room class (Standard, Premium, etc).
4. fact_bookings.csv
– Detailed Booking Records
The most granular dataset with booking-level data: guest count, revenue, platform, etc.
Booking dates (
booking_date
,check_in_date
,checkout_date
)no_guests
,booking_platform
,ratings_given
booking_status
,revenue_generated
,revenue_realized
5. fact_aggregated_bookings.csv
– Daily Aggregated Bookings
High-level metrics of bookings and room availability per day per room class.
check_in_date
,room_category
,successful_bookings
,capacity
Data Exploration
Before cleaning or transforming, I loaded the datasets using pandas
and began exploring shapes, data types, missing values, and distributions.
import pandas as pd
df_bookings = pd.read_csv("fact_bookings.csv")
df_agg_bookings = pd.read_csv("fact_aggregated_bookings.csv")
df_hotels = pd.read_csv("dim_hotels.csv")
df_rooms = pd.read_csv("dim_rooms.csv")
df_date = pd.read_csv("dim_date.csv")
Key observations from df_bookings
:
df_bookings.shape
df_bookings.info()
df_bookings.describe()
df_bookings["booking_platform"].value_counts()
df_bookings["room_category"].unique()
df_bookings["ratings_given"].isna().sum()
Over 100,000 rows.
∼77,000 missing values in
ratings_given
(expected, since not all guests rate).Some bookings had negative
no_guests
— definitely incorrect.Revenue values seemed unusually high for some entries.
Data Cleaning
1. Remove Negative Guest Counts
df_bookings = df_bookings[df_bookings["no_guests"] > 0]
Why? Negative guest numbers aren't realistic and likely reflect data entry issues. These would distort metrics like average guests per booking.
2. Remove Extreme Revenue Outliers
avg = df_bookings["revenue_generated"].mean()
std = df_bookings["revenue_generated"].std()
high_limit = avg + 3*std
df_bookings = df_bookings[df_bookings["revenue_generated"] <= high_limit]
Why? To make revenue insights more trustworthy. Some entries were in crores, far beyond average values. This method filters those out assuming a normal distribution.
3. Convert Date Columns
df_date["date"] = pd.to_datetime(df_date["date"])
df_bookings["check_in_date"] = pd.to_datetime(df_bookings["check_in_date"])
Why? Enables date filtering, grouping by month/week, and joining with dim_date
.
Data Transformation
1. Calculate Occupancy Percentage
df_agg_bookings["occ_pct"] = df_agg_bookings["successful_bookings"] / df_agg_bookings["capacity"]
df_agg_bookings["occ_pct"] = df_agg_bookings["occ_pct"].apply(lambda x: round(x*100, 2))
Why? A core performance metric for hotels — tells us how well the available inventory is being used.
Insights and Analysis
Each of these analyses builds on the cleaned and merged datasets to answer specific business questions.
Average Occupancy Rate by Room Class
df = pd.merge(df_agg_bookings, df_rooms, left_on="room_category", right_on="room_id")
df.drop("room_id", axis=1, inplace=True)
df.groupby("room_class")["occ_pct"].mean().round(2)
Insight:
Presidential 59.30
Standard 58.22
Elite 58.04
Premium 58.03
All classes perform similarly. Presidential rooms do slightly better — perhaps due to exclusivity or fewer rooms available.
Average Occupancy Rate by City
df = pd.merge(df, df_hotels, on="property_id")
df.groupby("city")["occ_pct"].mean().round(2)
Insight:
Delhi 61.61
Hyderabad 58.14
Mumbai 57.94
Bangalore 56.59
Delhi has the highest occupancy. Indicates higher demand or better operations there.
Weekday vs Weekend Occupancy
df = pd.merge(df, df_date, left_on="check_in_date", right_on="date")
df.groupby("day_type")["occ_pct"].mean().round(2)
Insight:
weekday 50.90
weekend 72.39
Clear spike in demand during weekends. Hotels can optimize pricing and marketing accordingly.
June 2022 City-wise Occupancy
df_june = df[df["mmm yy"] == "Jun 22"]
df_june.groupby("city")["occ_pct"].mean().round(2)
Insight:
Delhi 62.47
Hyderabad 58.46
Mumbai 58.38
Bangalore 56.58
Delhi continues to lead, consistent with overall trends.
Revenue Realized by City
df_bookings_all = pd.merge(df_bookings, df_hotels, on="property_id")
df_bookings_all.groupby("city")["revenue_realized"].sum()
Insight:
Mumbai 668,569,251
Bangalore 420,383,550
Hyderabad 325,179,310
Delhi 294,404,488
Interestingly, Delhi — despite having high occupancy — has the lowest revenue. Possible reason: lower prices or fewer premium rooms.
Monthly Revenue Trends
df_date["date"] = pd.to_datetime(df_date["date"])
df_bookings_all["check_in_date"] = pd.to_datetime(df_bookings_all["check_in_date"])
df_bookings_all = pd.merge(df_bookings_all, df_date, left_on="check_in_date", right_on="date")
df_bookings_all.groupby("mmm yy")["revenue_realized"].sum()
Insight:
May 22 408,375,641
Jun 22 377,191,229
Jul 22 389,940,912
May 2022 was the best month in terms of revenue. This might relate to vacation season or events.
Revenue Realized by Hotel Type
revenue_per_hotel = df_bookings_all.groupby("category")["revenue_realized"].sum()
revenue_per_hotel
Insight:
Business 451,950,715
Luxury 723,557,067
Luxury hotels generated significantly more revenue (over ₹723 Cr) compared to Business hotels (₹452 Cr). This aligns with expectations as luxury properties typically have higher room rates and premium services driving revenue. The hotel category is a strong revenue driver.
Average Rating by City
avg_rating_per_city = df_bookings_all.groupby("city")["ratings_given"].mean().round(2).sort_values(ascending=False)
avg_rating_per_city
Insight:
Delhi 3.78
Hyderabad 3.66
Mumbai 3.64
Bangalore 3.40
Delhi hotels have the highest average customer ratings (3.78), indicating higher guest satisfaction or service quality, followed by Hyderabad and Mumbai. Bangalore trails with a lower rating (3.40), suggesting potential room for service improvements or guest experience enhancements there.
Average Booking Size by Hotel Type
df_checked_out = df_bookings_all[df_bookings_all["booking_status"] == "Checked Out"]
total_guests_per_category = df_checked_out.groupby("category")["no_guests"].sum()
total_bookings_per_category = df_checked_out.groupby("category")["booking_id"].count()
avg_booking_size = (total_guests_per_category / total_bookings_per_category).round(2)
Insight:
Business 1.67
Luxury 2.27
Luxury hotels tend to host larger groups on average (2.27 guests per booking) compared to Business hotels (1.67 guests). This could reflect luxury hotels catering more to families or groups, while business hotels might see more solo travelers or smaller groups.
Cancellation Rate by Platform
df_cancelled_bookings = df_bookings_all[df_bookings_all["booking_status"] == "Cancelled"]
cancelled_bookings_per_platform = df_cancelled_bookings.groupby("booking_platform")["booking_id"].count()
total_bookings_per_platform = df_bookings_all.groupby("booking_platform")["booking_id"].count()
cancellation_rate = round((cancelled_bookings_per_platform / total_bookings_per_platform) * 100, 2)
Insight:
direct offline 24.19
direct online 24.65
journey 24.83
logtrip 23.97
makeyourtrip 24.85
others 25.06
tripster 25.32
Cancellation rates are fairly consistent across all booking platforms, hovering around 24-25%. "Tripster" shows the highest cancellation rate (25.32%), while "logtrip" has the lowest (23.97%). This suggests cancellations are a widespread issue regardless of platform, possibly driven by external factors like guest uncertainty.
Revenue by Room Category
df_room_class = pd.merge(
df,
df_bookings_all[['property_id', 'room_category', 'revenue_realized']],
on=['property_id', 'room_category'],
how='left'
)
def formatted_revenue(x):
if x >= 1e7:
return f"{x / 1e7:.2f} Cr"
elif x >= 1e5:
return f"{x / 1e5:.2f} Lakh"
else:
return f"{x:,}"
revenue_by_room = df_room_class.groupby("room_class")["revenue_realized"].sum().sort_values(ascending=False)
new_revenue_by_room = revenue_by_room.apply(formatted_revenue_by_room)
new_revenue_by_room
Insight:
Elite 2503.40 Cr
Premium 2071.05 Cr
Presidential 1682.64 Cr
Standard 1383.72 Cr
Elite, Premium, and Presidential rooms are the top three revenue-generating categories, contributing ₹2503.4 Cr, ₹2071.05 Cr, and ₹1682.64 Cr respectively. This clearly shows that high-end room types account for a significant share of overall hotel revenue.
Premium rooms’ performance by city
premium_rooms = df_room_class[df_room_class['room_class'].isin(['Presidential', 'Elite'])]
def formatted_revenue_by_city_room(x):
if x >= 1e7:
return f"{x / 1e7:.2f} Cr"
elif x >= 1e5:
return f"{x / 1e5:.2f} Lakh"
else:
return f"{x:,}"
revenue_by_city_room = premium_rooms.groupby(['city', 'room_class'])['revenue_realized'].sum()
new_revenue = revenue_by_city_room.apply(formatted_revenue_by_city_room)
Insight:
Mumbai Elite 1034.67 Cr
Presidential 675.74 Cr
Bangalore Elite 617.56 Cr
Presidential 329.30 Cr
Delhi Elite 440.13 Cr
Presidential 272.09 Cr
Hyderabad Elite 411.04 Cr
Presidential 405.51 Cr
Mumbai leads premium room revenue. While all cities perform well, Bangalore and Delhi have lower premium room revenues, highlighting growth opportunities.
Distribution of booking lead time
df_bookings_all['booking_date'] = pd.to_datetime(df_bookings_all['booking_date'])
df_bookings_all['lead_time_days'] = (df_bookings_all['check_in_date'] - df_bookings_all['booking_date']).dt.days
valid_bookings = df_bookings_all[
(df_bookings_all['booking_status'] == 'Checked Out') &
(df_bookings_all['lead_time_days'] >= 0)
]
valid_bookings.head()
import matplotlib.pyplot as plt
import seaborn as sns
plt.figure(figsize=(10, 6))
sns.histplot(valid_bookings['lead_time_days'], bins=30, kde=True, color='skyblue')
plt.title('Distribution of Booking Lead Time')
plt.xlabel('Lead Time (Days)')
plt.ylabel('Number of Bookings')
plt.grid(True)
plt.show()
Insight:
Most guests book within 30 days of arrival, with a large share booking just 0–10 days in advance. This shows that last-minute bookings are common at AtliQ Grands. Small spikes at 25, 50, and 75 days suggest some early planners—possibly due to events or holidays. This can help with timing promotions and forecasting demand.
No-shows and cancellations per month
df_no_show_bookings = df_bookings_all[df_bookings_all['booking_status'].isin(['Cancelled', 'No Show'])]
cancellations_no_shows_per_month = df_no_show_bookings.groupby(['mmm yy', 'booking_status'])['booking_id'].count()
Insight:
Jul 22 Cancelled 7558
No Show 1562
Jun 22 Cancelled 7441
No Show 1423
May 22 Cancelled 7976
No Show 1682
Cancellations greatly outnumber no-shows by roughly 4-5 times every month, indicating a persistent trend of booking instability.
No-show/Cancellation rates by City
# Total bookings per city
total_bookings_city = df_bookings_all.groupby('city')['booking_id'].count()
total_bookings_city
# Count cancellations and no-shows per city
cancelled_noshow_city = df_no_show_bookings.groupby('city')['booking_id'].count()
# Calculate cancellation/no-show rate per city
cancellation_rate_city = round((cancelled_noshow_city / total_bookings_city) * 100,2).sort_values(ascending=False)
Insight:
Bangalore 30.04
Mumbai 29.99
Delhi 29.76
Hyderabad 29.60
Bangalore and Mumbai have the highest cancellation/no-show rates (~30%), highlighting cities where retention and commitment strategies might need strengthening.
Revenue per Booking Platform
revenue_per_platform = df_bookings_all.groupby('booking_platform')['revenue_realized'].sum()
# Custom label function
def format_pie_label(pct, all_vals):
absolute = int(round(pct / 100. * sum(all_vals)))
if absolute >= 1e7:
value_str = f"{absolute / 1e7:.2f} Cr"
elif absolute >= 1e5:
value_str = f"{absolute / 1e5:.2f} Lakh"
else:
value_str = f"{absolute:,}"
return f"{pct:.1f}%\\n({value_str})"
# Plot pie chart
import matplotlib.pyplot as plt
plt.figure(figsize=(7.5, 7.5))
plt.pie(
revenue_per_platform,
labels=revenue_per_platform.index,
autopct=lambda pct: format_pie_label(pct, revenue_per_platform),
startangle=140,
)
plt.title('Revenue Realized per Booking Platform')
plt.axis('equal') # Keeps the pie chart circular
plt.show()
Insight:
The platform labeled “others” brings in the highest revenue, contributing nearly ₹48 Cr (40.9 %), suggesting either a highly diversified or unclassified set of booking sources. Among named platforms, MakeYourTrip is the top performer with ₹23.3 Cr in revenue, followed by Logtrip and Direct Online channels. This implies that online travel agencies (OTAs) and digital channels are playing a dominant role in revenue generation.
Tools Used
Python (Pandas, Matplotlib, Seaborn, Jupyter Notebook)
Conclusion
This project gave me hands-on experience dealing with messy, real-world data. From understanding schemas to cleaning invalid values and deriving actionable insights — I got to simulate a complete data analysis workflow.
Along the way, I practiced:
Data cleaning and preprocessing using Pandas
Feature engineering and KPI creation
Exploratory analysis with groupby, merge, and custom aggregations
Interpreting patterns and anomalies with business logic in mind
This was a solid foundation, and there’s still potential to go deeper: customer segmentation, forecasting, churn analysis, and even dashboarding.
But most importantly — this helped me develop an analyst’s mindset: always ask why, and always deliver with context.
Subscribe to my newsletter
Read articles from Preeti Priyadarsini directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Preeti Priyadarsini
Preeti Priyadarsini
I, a data analytics enthusiast, am here to share my insights and learnings in the exciting field of data analytics. When I'm not working with data, I enjoy traveling solo and reading novels.