🔍 Overview

In this project, I analyze short-term rental investment opportunities in Madrid using Python, geospatial analytics, and Power BI.

By combining Airbnb listings, property pricing data from Idealista, and official geospatial boundaries, the analysis identifies high-yield neighborhoods, occupancy trends, and investment hotspots—all with a data-first mindset.

📂 GitHub repository

🎯 Objectives

Analyze available public data sources
Find insights that help understand the characteristics of the market in this city
Guide the investment team’s search efforts
Especially regarding the main focus areas: rental prices, occupancy levels, and purchase prices
Estimate and compare rental yield by district
Build a Power BI dashboard to explore key investment drivers
Identify undervalued districts with high potential returns

📌 Business Assumptions & KPIs

The profitability of short-term rentals relies primarily on three levers:

Rental price – the higher the nightly rate, the higher the income
Occupancy – the more days the property is rented, the better
Purchase price – the cheaper the acquisition, the higher the ROI

The KPIs we use in this analysis are:

Occupancy rate → computed as the % of the year the property is not available (proxy for booked days)
Nightly rental price → directly from Airbnb listings
Estimated property value → based on average €/m² from Idealista, adjusted with a 25% discount for negotiation leverage

Each KPI is engineered from raw data or external sources using Python.

🧠 Tools & Techniques

Python: pandas, numpy, matplotlib, seaborn, sqlalchemy, folium
SQLite database construction from raw CSVs
Geospatial analysis using latitude/longitude coordinates and distance calculations
Power BI for final business-friendly dashboard

📊 Executive Summary & Visual Highlights

The interactive dashboard summarizes over 10,000 listings and uncovers pricing, occupancy, and profitability patterns across Madrid’s 21 districts.

🧾 Download the full Power BI dashboard PDF

💡 Top Districts by Rental Yield (%)

Based on the dashboard summary:

Villaverde → 17.92%
Puente de Vallecas → 17.18%
San Blas – Canillejas → 16.36%

Top Districts by Yield

✅ These districts offer the highest rental profitability, combining low acquisition costs with solid demand signals.

📍 Location Impact — Does Centrality Guarantee Performance?

Proximity to Puerta del Sol does not guarantee higher occupancy or pricing.
Even well-located units underperform if poorly optimized.

Rental Prices vs Distance to Sol

Occupancy vs Distance to Sol

🧠 Location matters — but only when combined with quality, affordability, and guest fit.

💶 Occupancy vs Purchase Price by District

Lower-cost districts show better rental efficiency.

Occupancy vs Purchase Price by District

San Blas – Canillejas stands out due to event-driven demand (IFEMA, Wanda Stadium)

🧭 Property Type Insights — What Drives Occupancy?

💼 Guest Capacity and Price Dynamics

Listings optimized for 4 guests yield the highest total rental prices relative to purchase cost.

Rental vs Purchase Price by Guest Capacity

📈 Larger groups don’t always yield better ROI — the 4-guest configuration hits the sweet spot.

📌 Bedrooms & Beds & Accommodates & Room Type

1-bedroom apartments reach the highest occupancy rates, confirming demand from solo travelers and couples
Listings with 1–2 beds perform best. Adding more beds often reduces occupancy, likely due to inefficiencies in layout or pricing flexibility
Occupancy peaks at 1–3 guests accommodated
Private rooms achieve the highest utilization overall

Occupancy Drivers

🧭 Strategic Recommendation

Madrid presents a spectrum of opportunities:

High-end positioning in Salamanca (blue dot)
Mid-tier stability in Moncloa – Aravaca (orange dot)
High-potential emerging zone in San Blas – Canillejas (green dot)

→ Leverage rental volatility for short-term gains in event-active areas

Rental Price vs Purchase Price by District

🌍 Geospatial Analysis

Although a GeoJSON file was available, mapping was performed using listing-level coordinates (latitude and longitude). Distance from Puerta del Sol was used as a centrality proxy (see formula in the next section).

Listings Across Madrid by Yield Size

🧾 Methodology

🔗 1. Data Sources

Inside Airbnb - Madrid
Idealista - Housing Market Reports
Official GeoJSON for Madrid districts

🛠️ 2. Data Engineering & Datamart Creation

We created a local SQLite database (airbnb.db) from CSV files containing listings, availability, reviews, and price data.

We then:

Performed individual quality checks:
- Variable types
- Null handling via custom imputation functions
- Duplicate detection
- Categorical and numerical profiling
Built the analytical datamart by joining cleaned tables
Prepared analysis variables:
- price (validated per room type)
- occupancy derived from availability_365
- purchase_price estimated from Idealista’s €/m²

📈 3. Feature Engineering Examples

🛏️ Adjusted Total Price

def create_total_price(record):
    if (record.beds > 1) & ((record.room_type == 'Private room') | (record.room_type == 'Shared room')):
        return record.price * record.beds * 0.8
    return record.price

df['total_price'] = df.apply(create_total_price, axis=1)

📆 Occupancy Transformation

df['occupancy'] = ((365 - df.availability_365) / 365 * 100).astype('int')

🏢 Estimating Property Size via Bedrooms

conditions = [
    df.bedrooms == 1,
    df.bedrooms == 2,
    df.bedrooms == 3,
    df.bedrooms == 4,
    df.bedrooms > 4
]

results = [50,70,90,120,150]

df['m2'] = np.select(conditions, results, default = -999)

💰 Purchase Price Estimate

df['purchase_price'] = df.m2 * df.price_m2 * 0.75

📍 Distance to Puerta del Sol (Haversine)

from math import radians, cos, sin, asin, sqrt

def haversine(lat1, lon1, lat2, lon2):

    R = 6372.8 # In kilometers — if you're using miles, replace it with 3959.87433

    dLat = radians(lat2 - lat1)
    dLon = radians(lon2 - lon1)
    lat1 = radians(lat1)
    lat2 = radians(lat2)

    a = sin(dLat/2)**2 + cos(lat1)*cos(lat2)*sin(dLon/2)**2
    c = 2*asin(sqrt(a))

    return R * c

# The coordinates of Puerta del Sol will be lat1 and lon1
lat1 = 40.4167278
lon1 = -3.7033387

df['poi_sol'] = df.apply(lambda record: haversine(lat1,lon1,record.latitude,record.longitude),axis = 1)

🧭 4. Data Enrichment & Discretization

To better capture relationships, we discretized numerical features:

accommodates, beds, bedrooms, number_of_reviews

This allowed more robust exploratory visualizations and performance groupings.

📤 5. Output Dataset for Power BI

After data cleaning and feature engineering, we executed the notebook RE_Analysis_Madrid_Executive_Summary.ipynb, which:

Selects a subset of cleaned and enriched features from the local database
Generates the final CSV file: RE_Analysis_Madrid_dataset.csv

This dataset serves as the input for the Power BI dashboard and includes:

Cleaned rental-related variables
Geospatial features like poi_sol (distance to Sol)
All derived KPIs for investment analysis

📈 Business Dashboard

The dashboard was designed in Power BI to communicate key insights to business users in a simple, interactive format.

📊 Main Pages:

📈 Investment Outlook
🏠 Market and Property Drivers
📍 Location Impact (Distance to Puerta del Sol)
🛏️ Occupancy Drivers by Property and Price
💡 District-level Investment Outlook

📌 Example dashboard preview:

Dashboard Preview

🧾 Download the full PDF version

🔍 Insights & Recommendations

The most profitable opportunities arise in peripheral districts where acquisition costs remain low but demand remains solid — such as Villaverde, Puente de Vallecas, and San Blas – Canillejas
Simplicity outperforms luxury — listings with 1–2 beds and basic layouts achieve higher occupancy rates than oversized or overdesigned properties
Room type is a key differentiator: private rooms and 1-bedroom units consistently deliver higher utilization, especially among solo travelers and couples
Proximity to city center is not a guarantee of performance — distance metrics like proximity to Puerta del Sol must be paired with KPIs such as yield and occupancy to detect arbitrage zones
Geospatial analysis and pricing segmentation reveal patterns invisible to spreadsheets — enabling smarter, location-informed investment strategies

🧠 What I Learned

How to build a full data-to-insight pipeline for real estate — from raw public data to BI-ready dashboards
How to combine Python analytics and Power BI storytelling to deliver actionable insights for investment decisions
How to engineer domain-specific KPIs (e.g., yield, occupancy) and apply them in geospatial, financial, and behavioral contexts
How to translate complex datasets into clear narratives that bridge the gap between data science and business needs

💡 This project helped me sharpen my ability to deliver business-ready insights from raw, messy data — a skill at the core of real-world data science.

Madrid Real Estate Investment Analysis

Table of contents