When I first opened that Jupyter notebook titled "Predicting Price with Property Size," I had no idea I was about to embark on one of the most rewarding learning experiences in data science. What started as a simple question about apartment sizes and prices in Buenos Aires turned into a comprehensive exploration of machine learning, data wrangling, and the art of making predictions from messy real world data.

The Foundation: Understanding the Data

The first thing that struck me about this project was how much work goes into preparing data before you can even think about building a model. The Buenos Aires real estate dataset wasn't just handed to me clean and ready to use. It came with all the messiness you'd expect from real world data: missing values, outliers, and columns that seemed important but could actually mislead the model.

Here's the data wrangling function that became my best friend throughout this project:

def wrangle(filepath):
    # Read CSV file
    df = pd.read_csv(filepath)

    # Subset data: Apartments in "Capital Federal", less than 400,000
    mask_ba = df["place_with_parent_names"].str.contains("Capital Federal")
    mask_apt = df["property_type"] == "apartment"
    mask_price = df["price_aprox_usd"] < 400_000
    df = df[mask_ba & mask_apt & mask_price]

    # Subset data: Remove outliers for "surface_covered_in_m2"
    low, high = df["surface_covered_in_m2"].quantile([0.1, 0.9])
    mask_area = df["surface_covered_in_m2"].between(low, high)
    df = df[mask_area]

    # Split "lat-lon" column
    df[["lat", "lon"]] = df["lat-lon"].str.split(",", expand=True).astype(float)
    df.drop(columns="lat-lon", inplace=True)

    # Get place name
    df["neighborhood"] = df["place_with_parent_names"].str.split("|", expand=True)[3]
    df.drop(columns="place_with_parent_names", inplace=True)

    # Dropping columns with more than half null values
    df.drop(columns=["floor", "expenses"], inplace=True)

    # Dropping high and low cardinality categorical features
    df.drop(columns=["operation", "property_type", "currency","properati_url"], inplace=True)

    # Dropping leakage
    df.drop(columns=["price", "price_aprox_local_currency", "price_per_m2", "price_usd_per_m2"], inplace=True)

    # Drop columns with multicollinearity
    df.drop(columns=["surface_total_in_m2", "rooms"], inplace=True)

    return df

This function taught me so much about data preparation. Every line had a purpose, from filtering out luxury properties that might skew our model to removing outliers that could throw off our predictions. The most eye opening part was learning about "leakage" ,those sneaky columns that seem helpful but actually give away the answer I was trying to predict.

Bringing Multiple Files Together

One of the first technical challenges I faced was combining multiple CSV files into a single dataset. The solution was elegant in its simplicity:

files = glob("data/buenos-aires-real-estate-*.csv")
files = sorted(files)

frames = [wrangle(file) for file in files]
df = pd.concat(frames, ignore_index=True)

The Art of Feature Selection

After all that data cleaning, I was left with a crucial decision: which features should I use to predict apartment prices? This is where the detective work really began. I created a correlation matrix to understand how different variables related to each other:

corr = df.select_dtypes("number").drop(columns="price_aprox_usd").corr()
sns.heatmap(corr)

The heatmap revealed fascinating patterns. Some variables were so closely correlated they were essentially telling the same story ,that's multicollinearity, and it can confuse machine learning models. After careful consideration, I settled on four key features:

target = "price_aprox_usd"
features = ["surface_covered_in_m2", "lat", "lon", "neighborhood"]
y_train = df[target]
X_train = df[features]

Surface area made obvious sense, bigger apartments generally cost more. Latitude and longitude would capture location effects, while neighborhood would add that local flavor that makes Buenos Aires real estate so diverse.

Building the Model: From Simple to Sophisticated

Before diving into machine learning, I established a baseline. What if I just predicted every apartment would cost the average price?

y_mean = y_train.mean()
y_pred_baseline = [y_mean] * len(y_train)
print("Mean apt price:", round(y_mean, 2))
print("Baseline MAE:", mean_absolute_error(y_train, y_pred_baseline))

This baseline gave me a reality check. Any model I built had to beat this simple average to be worth the effort. The mean absolute error of this naive approach became my benchmark for success.

Then came the exciting part ,building the actual machine learning pipeline:

model = make_pipeline(
    OneHotEncoder(use_cat_names=True),
    SimpleImputer(),
    Ridge()
)
model.fit(X_train, y_train)

This pipeline was a thing of beauty. The OneHotEncoder transformed neighborhood names into numerical data the model could understand. The SimpleImputer filled in any missing values. Finally, the Ridge regression model learned the complex relationships between size, location, and price.

The Moment of Truth: Making Predictions

After training the model, I could finally make predictions:

y_pred_training = model.predict(X_train)
print("Training MAE:", mean_absolute_error(y_train, y_pred_training))

The training error was significantly lower than my baseline , the model was actually learning something meaningful about Buenos Aires real estate! But the real test came with new data:

X_test = pd.read_csv("data/buenos-aires-test-features.csv")
y_pred_test = pd.Series(model.predict(X_test))

Watching those predictions populate was thrilling. Each number represented the model's best guess at what an apartment would cost based on everything it had learned.

Making It Interactive: The Fun Part

The most satisfying part of the project was creating an interactive prediction tool:

def make_prediction(area, lat, lon, neighborhood):
    data = {
        "surface_covered_in_m2": area,
        "lat": lat,
        "lon": lon,
        "neighborhood": neighborhood,
    }
    df = pd.DataFrame(data, index=[0])
    prediction = model.predict(df).round(2)[0]
    return f"Predicted apartment price: ${prediction}"

With Jupyter widgets, I could create sliders and dropdowns that let me explore different scenarios in real time. Want to see how much a 110 square meter apartment in Villa Crespo might cost? Just adjust the sliders and watch the prediction update instantly.

interact(
    make_prediction,
    area=IntSlider(
        min=X_train["surface_covered_in_m2"].min(),
        max=X_train["surface_covered_in_m2"].max(),
        value=X_train["surface_covered_in_m2"].mean(),
    ),
    lat=FloatSlider(
        min=X_train["lat"].min(),
        max=X_train["lat"].max(),
        step=0.01,
        value=X_train["lat"].mean(),
    ),
    lon=FloatSlider(
        min=X_train["lon"].min(),
        max=X_train["lon"].max(),
        step=0.01,
        value=X_train["lon"].mean(),
    ),
    neighborhood=Dropdown(options=sorted(X_train["neighborhood"].unique())),
)

Lessons Learned: Beyond the Code

This project taught me that machine learning isn't just about algorithms and statistics , it's about understanding the domain you're working in. Every decision I made, from which outliers to remove to which features to include, required thinking about how people actually buy and sell apartments in Buenos Aires.

The technical skills were crucial, but equally important was learning to think like a data scientist. How do you balance model complexity with interpretability? How do you know when you're overfitting? How do you validate that your model will work on new data?

The Human Side of Housing Data

Working with real estate data felt deeply personal. Each row in my dataset represented someone's home, someone's investment, someone's dreams. The latitude and longitude coordinates weren't just numbers, they were addresses where people lived, worked, and built their lives.

This perspective kept me grounded throughout the technical challenges. Yes, I was building a predictive model, but I was also trying to understand what makes a neighborhood desirable, what drives people to pay premium prices for certain locations, and how the physical characteristics of an apartment translate into market value.

Looking Forward: What's Next?

Completing this first lesson has given me a solid foundation in the machine learning workflow: data preparation, feature selection, model training, and evaluation. But I know this is just the beginning. The subsequent lessons will undoubtedly introduce new challenges and more sophisticated techniques.

The interactive prediction tool I built works, but I can already see ways to improve it. What about adding more features? How might I handle the temporal aspect of real estate prices? Could I incorporate external data sources like crime statistics or school ratings?

Advice for Fellow Learners

If you're considering a similar project, my biggest piece of advice is to embrace the messiness. Real data is never clean, never perfect, and never exactly what you expect. That's what makes it interesting and valuable.

Start simple, as I did with this size-based prediction model. Get something working, understand why it works, and then gradually add complexity. Each step builds on the previous one, and before you know it, you'll have created something genuinely useful.

Most importantly, remember that behind every dataset are real people and real stories. Whether you're predicting housing prices in Buenos Aires or analyzing customer behavior for an e-commerce site, the human element should always guide your technical decisions.

The code is important, but understanding the context is what transforms a working model into a valuable tool. That's the real lesson I learned from my first foray into Buenos Aires real estate prediction, and it's one I'll carry with me throughout my data science journey.

Building My First Machine Learning Model: Predicting Buenos Aires Housing Prices