Madrid Real Estate Investment Analysis

πŸ” Overview

In this project, I analyze short-term rental investment opportunities in Madrid using Python, geospatial analytics, and Power BI.

By combining Airbnb listings, property pricing data from Idealista, and official geospatial boundaries, the analysis identifies high-yield neighborhoods, occupancy trends, and investment hotspotsβ€”all with a data-first mindset.

πŸ“‚ GitHub repository


🎯 Objectives

  • Analyze available public data sources

  • Find insights that help understand the characteristics of the market in this city

  • Guide the investment team’s search efforts

  • Especially regarding the main focus areas: rental prices, occupancy levels, and purchase prices

  • Estimate and compare rental yield by district

  • Build a Power BI dashboard to explore key investment drivers

  • Identify undervalued districts with high potential returns


πŸ“Œ Business Assumptions & KPIs

The profitability of short-term rentals relies primarily on three levers:

  • Rental price – the higher the nightly rate, the higher the income

  • Occupancy – the more days the property is rented, the better

  • Purchase price – the cheaper the acquisition, the higher the ROI

The KPIs we use in this analysis are:

  • Occupancy rate β†’ computed as the % of the year the property is not available (proxy for booked days)

  • Nightly rental price β†’ directly from Airbnb listings

  • Estimated property value β†’ based on average €/mΒ² from Idealista, adjusted with a 25% discount for negotiation leverage

Each KPI is engineered from raw data or external sources using Python.


🧠 Tools & Techniques

  • Python: pandas, numpy, matplotlib, seaborn, sqlalchemy, folium

  • SQLite database construction from raw CSVs

  • Geospatial analysis using latitude/longitude coordinates and distance calculations

  • Power BI for final business-friendly dashboard


πŸ“Š Executive Summary & Visual Highlights

The interactive dashboard summarizes over 10,000 listings and uncovers pricing, occupancy, and profitability patterns across Madrid’s 21 districts.

🧾 Download the full Power BI dashboard PDF

πŸ’‘ Top Districts by Rental Yield (%)

Based on the dashboard summary:

  • Villaverde β†’ 17.92%

  • Puente de Vallecas β†’ 17.18%

  • San Blas – Canillejas β†’ 16.36%

Top Districts by Yield

βœ… These districts offer the highest rental profitability, combining low acquisition costs with solid demand signals.

πŸ“ Location Impact β€” Does Centrality Guarantee Performance?

Proximity to Puerta del Sol does not guarantee higher occupancy or pricing.
Even well-located units underperform if poorly optimized.

Rental Prices vs Distance to Sol

Occupancy vs Distance to Sol

🧠 Location matters β€” but only when combined with quality, affordability, and guest fit.

πŸ’Ά Occupancy vs Purchase Price by District

Lower-cost districts show better rental efficiency.

Occupancy vs Purchase Price by District

  • San Blas – Canillejas stands out due to event-driven demand (IFEMA, Wanda Stadium)

🧭 Property Type Insights β€” What Drives Occupancy?

πŸ’Ό Guest Capacity and Price Dynamics

Listings optimized for 4 guests yield the highest total rental prices relative to purchase cost.

Rental vs Purchase Price by Guest Capacity

πŸ“ˆ Larger groups don’t always yield better ROI β€” the 4-guest configuration hits the sweet spot.

πŸ“Œ Bedrooms & Beds & Accommodates & Room Type

  • 1-bedroom apartments reach the highest occupancy rates, confirming demand from solo travelers and couples

  • Listings with 1–2 beds perform best. Adding more beds often reduces occupancy, likely due to inefficiencies in layout or pricing flexibility

  • Occupancy peaks at 1–3 guests accommodated

  • Private rooms achieve the highest utilization overall

Occupancy Drivers

🧭 Strategic Recommendation

Madrid presents a spectrum of opportunities:

  • High-end positioning in Salamanca (blue dot)

  • Mid-tier stability in Moncloa – Aravaca (orange dot)

  • High-potential emerging zone in San Blas – Canillejas (green dot)

β†’ Leverage rental volatility for short-term gains in event-active areas

Rental Price vs Purchase Price by District


🌍 Geospatial Analysis

Although a GeoJSON file was available, mapping was performed using listing-level coordinates (latitude and longitude). Distance from Puerta del Sol was used as a centrality proxy (see formula in the next section).

Listings Across Madrid by Yield Size


🧾 Methodology

πŸ”— 1. Data Sources

πŸ› οΈ 2. Data Engineering & Datamart Creation

We created a local SQLite database (airbnb.db) from CSV files containing listings, availability, reviews, and price data.

We then:

  1. Performed individual quality checks:

    • Variable types

    • Null handling via custom imputation functions

    • Duplicate detection

    • Categorical and numerical profiling

  2. Built the analytical datamart by joining cleaned tables

  3. Prepared analysis variables:

    • price (validated per room type)

    • occupancy derived from availability_365

    • purchase_price estimated from Idealista’s €/mΒ²

πŸ“ˆ 3. Feature Engineering Examples

πŸ›οΈ Adjusted Total Price

def create_total_price(record):
    if (record.beds > 1) & ((record.room_type == 'Private room') | (record.room_type == 'Shared room')):
        return record.price * record.beds * 0.8
    return record.price

df['total_price'] = df.apply(create_total_price, axis=1)

πŸ“† Occupancy Transformation

df['occupancy'] = ((365 - df.availability_365) / 365 * 100).astype('int')

🏒 Estimating Property Size via Bedrooms

conditions = [
    df.bedrooms == 1,
    df.bedrooms == 2,
    df.bedrooms == 3,
    df.bedrooms == 4,
    df.bedrooms > 4
]

results = [50,70,90,120,150]

df['m2'] = np.select(conditions, results, default = -999)

πŸ’° Purchase Price Estimate

df['purchase_price'] = df.m2 * df.price_m2 * 0.75

πŸ“ Distance to Puerta del Sol (Haversine)

from math import radians, cos, sin, asin, sqrt

def haversine(lat1, lon1, lat2, lon2):

    R = 6372.8 # In kilometers β€” if you're using miles, replace it with 3959.87433

    dLat = radians(lat2 - lat1)
    dLon = radians(lon2 - lon1)
    lat1 = radians(lat1)
    lat2 = radians(lat2)

    a = sin(dLat/2)**2 + cos(lat1)*cos(lat2)*sin(dLon/2)**2
    c = 2*asin(sqrt(a))

    return R * c

# The coordinates of Puerta del Sol will be lat1 and lon1
lat1 = 40.4167278
lon1 = -3.7033387

df['poi_sol'] = df.apply(lambda record: haversine(lat1,lon1,record.latitude,record.longitude),axis = 1)

🧭 4. Data Enrichment & Discretization

To better capture relationships, we discretized numerical features:

  • accommodates, beds, bedrooms, number_of_reviews

This allowed more robust exploratory visualizations and performance groupings.

πŸ“€ 5. Output Dataset for Power BI

After data cleaning and feature engineering, we executed the notebook RE_Analysis_Madrid_Executive_Summary.ipynb, which:

  • Selects a subset of cleaned and enriched features from the local database

  • Generates the final CSV file: RE_Analysis_Madrid_dataset.csv

This dataset serves as the input for the Power BI dashboard and includes:

  • Cleaned rental-related variables

  • Geospatial features like poi_sol (distance to Sol)

  • All derived KPIs for investment analysis


πŸ“ˆ Business Dashboard

The dashboard was designed in Power BI to communicate key insights to business users in a simple, interactive format.

πŸ“Š Main Pages:

  • πŸ“ˆ Investment Outlook

  • 🏠 Market and Property Drivers

  • πŸ“ Location Impact (Distance to Puerta del Sol)

  • πŸ›οΈ Occupancy Drivers by Property and Price

  • πŸ’‘ District-level Investment Outlook

πŸ“Œ Example dashboard preview:

Dashboard Preview

🧾 Download the full PDF version


πŸ” Insights & Recommendations

  • The most profitable opportunities arise in peripheral districts where acquisition costs remain low but demand remains solid β€” such as Villaverde, Puente de Vallecas, and San Blas – Canillejas

  • Simplicity outperforms luxury β€” listings with 1–2 beds and basic layouts achieve higher occupancy rates than oversized or overdesigned properties

  • Room type is a key differentiator: private rooms and 1-bedroom units consistently deliver higher utilization, especially among solo travelers and couples

  • Proximity to city center is not a guarantee of performance β€” distance metrics like proximity to Puerta del Sol must be paired with KPIs such as yield and occupancy to detect arbitrage zones

  • Geospatial analysis and pricing segmentation reveal patterns invisible to spreadsheets β€” enabling smarter, location-informed investment strategies


🧠 What I Learned

  • How to build a full data-to-insight pipeline for real estate β€” from raw public data to BI-ready dashboards

  • How to combine Python analytics and Power BI storytelling to deliver actionable insights for investment decisions

  • How to engineer domain-specific KPIs (e.g., yield, occupancy) and apply them in geospatial, financial, and behavioral contexts

  • How to translate complex datasets into clear narratives that bridge the gap between data science and business needs

πŸ’‘ This project helped me sharpen my ability to deliver business-ready insights from raw, messy data β€” a skill at the core of real-world data science.


🌐 Explore the project

10
Subscribe to my newsletter

Read articles from Tayfur Akkaya Clavijo directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Tayfur Akkaya Clavijo
Tayfur Akkaya Clavijo

I'm an AI, data and finance enthusiast. Passionate about lifelong learning and cross-sector innovation, I’m always looking to collaborate on impactful projects at the intersection of data, strategy, and technology.