How I uncovered insights from messy real estate data using Python and discovered what really drives housing prices

Introduction: More Than Just Numbers

When I started the "Housing in Mexico" project through WorldQuant University's Applied Data Science Lab, I expected to work with clean datasets and follow straightforward tutorials. What I got instead was something far more valuable: a real-world crash course in handling messy data and extracting meaningful insights from it.

This wasn't about memorizing syntax or copying code examples. It was about getting my hands dirty with actual real estate data from Mexico and learning to think like a data scientist. By the end, I found myself asking better questions and seeing patterns I never would have noticed before.

The Challenge: Working with Real-World Data

Real estate data is notoriously messy, and this project didn't shy away from that reality. I worked with three separate CSV files, each with its own quirks and challenges:

Dataset 1: Price data with currency symbols and formatting issues
Dataset 2: Mexican peso prices that needed conversion to USD
Dataset 3: Location data stored in combined latitude-longitude strings

Data Cleaning: The Foundation of Good Analysis

The first lesson hit me immediately: you can't analyze what you can't read. Each dataset required different cleaning approaches.

import pandas as pd

# Loading the three datasets
df1 = pd.read_csv("mexico-real-estate-1.csv")
df2 = pd.read_csv("mexico-real-estate-2.csv") 
df3 = pd.read_csv("mexico-real-estate-3.csv")

print("df1 shape:", df1.shape)
print("df2 shape:", df2.shape)
print("df3 shape:", df3.shape)

Dataset 1 came with prices formatted as strings with dollar signs and commas. I had to strip these characters and convert to numeric values:

# Clean price data by removing symbols and converting to float
df1["price_usd"] = df1['price_usd'].str.replace('$','',regex=False).str.replace(',','').astype(float)
df1.dropna(inplace=True)

Dataset 2 presented prices in Mexican pesos. I needed to convert these to USD using the 2014 exchange rate:

# Convert Mexican pesos to USD (19 pesos = 1 USD in 2014)
df2["price_usd"] = (df2["price_mxn"]/19).round(2)
df2.drop(columns=["price_mxn"], inplace=True)
df2.dropna(inplace=True)

Dataset 3 had the trickiest format: coordinates stored as combined strings and location names buried in pipe separated structures:

# Split combined lat-lon strings into separate columns
df3[["lat", "lon"]] = df3['lat-lon'].str.split(',', expand=True).astype(float)

# Extract state names from hierarchical location strings
df3["state"] = df3["place_with_parent_names"].str.split("|", expand=True)[2]

# Clean up unnecessary columns
df3.drop(columns=["lat-lon", "place_with_parent_names"], inplace=True)
df3.dropna(inplace=True)

After cleaning each dataset individually, I combined them into one comprehensive dataset:

# Combine all three cleaned datasets
df = pd.concat([df1, df2, df3])
print("Combined dataset shape:", df.shape)

# Save the cleaned data for future use
df.to_csv('mexico_real_estate_clean.csv', index=False)

Exploratory Analysis: Finding Patterns in the Data

With clean data in hand, I could finally start asking interesting questions. The exploration phase taught me that data visualization isn't just about making pretty charts, it's about understanding your data's story.

Geographic Distribution: Where Are the Houses?

I started with a simple question: where are these properties located? Using Plotly's interactive mapping capabilities, I created a scatter plot showing all properties across Mexico:

import plotly.express as px

# Create an interactive map of all properties
fig = px.scatter_mapbox(
    df,
    lat='lat',
    lon='lon', 
    center={"lat": 19.43, "lon": -99.13},  # Centered on Mexico City
    width=600,
    height=600,
    hover_data=["price_usd"]
)
fig.update_layout(mapbox_style="open-street-map")
fig.show()

This visualization immediately revealed clustering patterns around major urban centers, with Mexico City showing the highest concentration of properties.

State Level Analysis: Regional Differences

Next, I wanted to understand which states had the most properties and how prices varied by location:

# Find the most common states in our dataset
print(df['state'].value_counts(ascending=False))

# Calculate basic statistics for key variables
print(df[["area_m2", "price_usd"]].describe())

Understanding Property Sizes and Prices

To get a feel for the typical Mexican home, I examined the distribution of house sizes and prices:

import matplotlib.pyplot as plt

# Distribution of home sizes
plt.figure(figsize=(10, 6))
plt.hist(df['area_m2'], bins=50, edgecolor='black')
plt.xlabel("Area [sq meters]")
plt.ylabel("Frequency") 
plt.title("Distribution of Home Sizes")
plt.show()

# Box plot to identify outliers
plt.figure(figsize=(8, 6))
plt.boxplot(df['area_m2'])
plt.ylabel("Area [sq meters]")
plt.title("Distribution of Home Sizes")
plt.show()

The histograms revealed interesting patterns: most homes clustered around certain size ranges, but there were notable outliers that suggested luxury properties or data quality issues.

The Central Question: Location vs. Size

The heart of this project centered on one crucial question: What influences housing prices more the size of the property or where it's located?

Price by State: Location Matters

I started by examining how average prices varied across different states:

# Calculate mean prices by state
mean_price_by_state = df.groupby("state")["price_usd"].mean().sort_values(ascending=False)

print("mean_price_by_state type:", type(mean_price_by_state))
print("mean_price_by_state shape:", mean_price_by_state.shape)

# Visualize the results
mean_price_by_state.plot(
    kind="bar",
    xlabel="State", 
    ylabel="Mean Price [USD]",
    title="Mean House Price by State",
    figsize=(12, 6)
)
plt.xticks(rotation=45)
plt.show()

The results were striking. Some states showed average prices significantly higher than others, suggesting that location plays a major role in pricing.

Price Per Square Meter: A More Nuanced View

Raw prices can be misleading because they don't account for property size. I calculated price per square meter to get a clearer picture:

# Create price per square meter column
df["price_per_m2"] = df["price_usd"] / df["area_m2"]

# Group by state and visualize
mean_price_per_m2 = df.groupby('state')['price_per_m2'].mean().sort_values(ascending=False)

mean_price_per_m2.plot(
    kind="bar",
    xlabel="State",
    ylabel="Mean Price per M² [USD]", 
    title="Mean House Price per M² by State",
    figsize=(12, 6)
)
plt.xticks(rotation=45)
plt.show()

This analysis revealed that some states with moderate absolute prices actually had very high costs per square meter, indicating premium locations where space comes at a premium.

The Size-Price Relationship

To understand how property size affects pricing, I examined the correlation between area and price:

# Create scatter plot to visualize the relationship
plt.figure(figsize=(10, 6))
plt.scatter(x=df['area_m2'], y=df['price_usd'], alpha=0.6)
plt.xlabel("Area [square meters]")
plt.ylabel("Price [USD]")
plt.title("Price vs. Area Relationship")
plt.show()

# Calculate correlation coefficient
correlation = df['area_m2'].corr(df['price_usd'])
print(f"Correlation between area and price (all Mexico): {correlation:.3f}")

The scatter plot showed a positive relationship between size and price, but with significant variation that suggested other factors at play.

Deep Dive: Focusing on Morelos State

To get more granular insights, I focused on a single state “Morelos” to see how relationships might differ at a more local level:

# Filter data for Morelos state only
df_morelos = df[df["state"] == "Morelos"]

print("df_morelos type:", type(df_morelos))
print("df_morelos shape:", df_morelos.shape)

# Analyze the size-price relationship in Morelos
plt.figure(figsize=(10, 6))
plt.scatter(x=df_morelos['area_m2'], y=df_morelos['price_usd'])
plt.xlabel("Area [square meters]")
plt.ylabel("Price [USD]")
plt.title("Morelos: Price vs. Area")
plt.show()

# Calculate correlation for Morelos specifically
morelos_correlation = df_morelos['area_m2'].corr(df_morelos['price_usd'])
print(f"Correlation in Morelos: {morelos_correlation:.3f}")

This state specific analysis revealed how local market conditions could create different patterns from the national average.

Key Insights and Learning

Through this analysis, several important patterns emerged:

1. Location is a Major Price Driver The variation in average prices between states was substantial, confirming that "location, location, location" holds true in Mexico's real estate market.

2. Price Per Square Meter Tells a Different Story Some states with moderate absolute prices had very high costs per square meter, indicating premium urban areas where space is at a premium.

3. Size Matters, But It's Complicated While larger properties generally cost more, the relationship isn't perfectly linear. Local market conditions, property type, and location within a state all influence this relationship.

4. Regional Markets Have Unique Characteristics The focused analysis of “Morelos” showed that local markets can behave differently from national trends, emphasizing the importance of granular analysis.

Technical Skills Developed

This project pushed me to develop several crucial data science competencies:

Data Manipulation with Pandas

Cleaning messy datasets with string operations and type conversions
Merging multiple datasets with different structures
Using groupby operations for aggregation and analysis
Handling missing data and outliers

Data Visualization

Creating informative charts with matplotlib
Building interactive maps with Plotly
Choosing appropriate visualization types for different data relationships
Designing clear, readable charts that communicate insights effectively

Statistical Analysis

Calculating and interpreting correlation coefficients
Understanding the difference between correlation and causation
Using descriptive statistics to understand data distributions
Recognizing when to dig deeper into specific subsets of data

The Bigger Picture: Real World Applications

This project taught me that data science isn't just about running code, it's about asking the right questions and interpreting results in context. The insights from this analysis could inform:

Real Estate Investment Decisions: Understanding which states offer the best value per square meter
Market Pricing Strategies: How location and size should factor into pricing models
Urban Planning: Identifying areas where housing costs are disproportionately high
Economic Development: Understanding regional variations in housing markets

Challenges and Learning Moments

Every step of this project presented learning opportunities:

Data Quality Issues: Learning to spot and handle inconsistent formatting, missing values, and outliers taught me that real world data is never perfect.

Choosing the Right Analysis: Deciding between absolute prices and price per square meter showed me how the same data can tell different stories depending on how you analyze it.

Interpretation vs. Correlation: Understanding that correlation doesn't imply causation helped me frame my conclusions more carefully.

Looking Forward

This project was more than an exercise in Python programming, it was training in thinking like a data scientist. I learned to:

Start with questions, not just data
Clean and prepare data systematically
Visualize results to find patterns
Interpret findings in real world context
Recognize the limitations of my analysis

The skills I developed here : data cleaning, exploratory analysis, and statistical thinking form the foundation for more advanced projects. Most importantly, I learned that every dataset has a story to tell, and finding that story requires both technical skills and curiosity.

As I move forward to analyze housing markets in Buenos Aires and beyond, I'm carrying these lessons with me. The goal isn't just to process data, but to uncover insights that could inform real decisions and solve real problems.

Final Thoughts

For anyone considering a similar path in data science, my advice is simple: embrace the messiness. Real world projects like this one don't just teach you technical skills, they teach you how to think like a data scientist. The frustration of dealing with missing data, the satisfaction of finding meaningful patterns, and the excitement of uncovering unexpected insights are all part of what makes this field so rewarding.

The "Housing in Mexico" project wasn't just an academic exercise. It was a bridge between learning and doing, between theory and practice. And that bridge is exactly what every aspiring data scientist needs to cross.

From Raw Data to Real Insights: My Journey Through Mexico's Housing Market Analysis