Chapter 2: End-to-End ML with Housing Data

Following my Chapter 1 post “Kicking Off My ML Journey with Aurélien Géron’s Book – Chapter 1: The Machine Learning Landscape”, where I laid the groundwork for understanding what Machine Learning is and the different learning paradigms, I’ve now completed Chapter 2, which marks the first real hands-on project in the book.
Why I’m Reading This Book
I started this book to go beyond surface-level ML and deeply understand Machine Learning, Deep Learning, and eventually Reinforcement Learning. So far, I’ve completed:
The Machine Learning A-Z™ 2025 course
My college-level Intro to ML
A club project on YOLOv8, which I’ll blog about soon
This book is my next serious step into the field, and Chapter 2 didn’t disappoint.
The Housing Price Prediction Project
In this chapter, I implemented an end-to-end ML project using California housing data. It’s a complete cycle—from loading the data to training, fine-tuning, and evaluating the model.
Major Steps Covered:
Frame the Problem
- Predict median housing prices in California districts (regression problem)
Load & Explore the Data
- Used pandas, matplotlib, and seaborn to analyse distributions, check for missing values, and identify feature correlations
Create a Test Set
- Learned the importance of consistent splits using
train_test_split
and stratified sampling
- Learned the importance of consistent splits using
Data Cleaning & Preparation
Handled missing values, categorical features, and feature scaling
Built data pipelines using
scikit-learn
’sPipeline
andColumnTransformer
Select and Train a Model
Trained Linear Regression, Decision Tree, and Random Forest models
Used cross-validation to evaluate and compare models
Fine-Tune the Model
- Used Grid Search and Randomized Search for hyperparameter tuning
Evaluate on Test Set
- Final model was tested and ready for deployment
Key Learnings
Data pipelines are game-changers. Automating preprocessing ensures cleaner, reusable workflows.
Stratified sampling is essential when dealing with imbalanced data distributions.
Cross-validation > simple train/test split — helps reduce variance in model evaluation.
Even basic models like Linear Regression can offer a good starting baseline.
Reflections
Chapter 2 truly felt like building something real. It connected the dots between theory and practice, especially around data handling, model evaluation, and pipeline automation. I found it rewarding to see how each small decision—from how I split the data to how I encoded features—affected final performance.
What’s Next?
I’ll be moving into the chapters on classification, training deep neural networks, and eventually Reinforcement Learning. I also have a separate post coming soon on my YOLOv8 club project that uses computer vision for retail analytics.
If you’re learning ML too or working on similar projects, I’d love to connect!
Subscribe to my newsletter
Read articles from Khushi Rawat directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
