Chapter 2: End-to-End ML with Housing Data

Khushi RawatKhushi Rawat
2 min read

Following my Chapter 1 post “Kicking Off My ML Journey with Aurélien Géron’s Book – Chapter 1: The Machine Learning Landscape”, where I laid the groundwork for understanding what Machine Learning is and the different learning paradigms, I’ve now completed Chapter 2, which marks the first real hands-on project in the book.

Why I’m Reading This Book

I started this book to go beyond surface-level ML and deeply understand Machine Learning, Deep Learning, and eventually Reinforcement Learning. So far, I’ve completed:

  • The Machine Learning A-Z™ 2025 course

  • My college-level Intro to ML

  • A club project on YOLOv8, which I’ll blog about soon

This book is my next serious step into the field, and Chapter 2 didn’t disappoint.

The Housing Price Prediction Project

In this chapter, I implemented an end-to-end ML project using California housing data. It’s a complete cycle—from loading the data to training, fine-tuning, and evaluating the model.

Major Steps Covered:

  1. Frame the Problem

    • Predict median housing prices in California districts (regression problem)
  2. Load & Explore the Data

    • Used pandas, matplotlib, and seaborn to analyse distributions, check for missing values, and identify feature correlations
  3. Create a Test Set

    • Learned the importance of consistent splits using train_test_split and stratified sampling
  4. Data Cleaning & Preparation

    • Handled missing values, categorical features, and feature scaling

    • Built data pipelines using scikit-learn’s Pipeline and ColumnTransformer

  5. Select and Train a Model

    • Trained Linear Regression, Decision Tree, and Random Forest models

    • Used cross-validation to evaluate and compare models

  6. Fine-Tune the Model

    • Used Grid Search and Randomized Search for hyperparameter tuning
  7. Evaluate on Test Set

    • Final model was tested and ready for deployment

Key Learnings

  • Data pipelines are game-changers. Automating preprocessing ensures cleaner, reusable workflows.

  • Stratified sampling is essential when dealing with imbalanced data distributions.

  • Cross-validation > simple train/test split — helps reduce variance in model evaluation.

  • Even basic models like Linear Regression can offer a good starting baseline.

Reflections

Chapter 2 truly felt like building something real. It connected the dots between theory and practice, especially around data handling, model evaluation, and pipeline automation. I found it rewarding to see how each small decision—from how I split the data to how I encoded features—affected final performance.

What’s Next?

I’ll be moving into the chapters on classification, training deep neural networks, and eventually Reinforcement Learning. I also have a separate post coming soon on my YOLOv8 club project that uses computer vision for retail analytics.

If you’re learning ML too or working on similar projects, I’d love to connect!

0
Subscribe to my newsletter

Read articles from Khushi Rawat directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Khushi Rawat
Khushi Rawat