Ridge, Lasso, and ElasticNet Regression: Understanding Regularization in Machine Learning


Introduction
Regression models, particularly Linear Regression, are powerful tools for predicting continuous values. However, they often suffer from overfitting, especially when dealing with high-dimensional datasets. To combat this, we use regularization techniques that introduce a penalty term to reduce model complexity and improve generalization.
In this blog, we will explore Ridge Regression, Lasso Regression, and ElasticNet, covering:
The mathematics behind each method
Comparison of results
By the end, you'll have a deep understanding of how these regularization techniques work and when to use them!
1. Linear Regression Recap
Before diving into regularization, let's briefly revisit linear regression. The goal is to find the optimal parameters that minimize the cost function:
$$\displaylines{ J(\theta) = \frac{1}{2}\sum(y_i - h_\theta(x^{(i)}))^2 \\ }$$
where:
y is the actual output,
x is the feature vector,
θ is the parameter,
h(x) is the hypothesis function.
This approach works well, but it can lead to overfitting, particularly when features are correlated or high-dimensional. This is where regularization helps.
2. Ridge Regression (L2 Regularization)
Mathematics
Ridge Regression modifies the cost function by adding an L2 penalty:
$$\displaylines{ J(\theta) = \frac{1}{2}\sum(y_i - h_\theta(x^{(i)}))^2 + \sum\lambda(\theta)^2 }$$
where:
λ is the regularization parameter that controls the penalty strength.
The second term shrinks the parameters, reducing model complexity.
The solution to Ridge Regression is given by:
$$\theta = (X^TX + \lambda I)^{-1}X^TY$$
This prevents large weight values, reducing overfitting.
3. Lasso Regression (L1 Regularization)
Mathematics
Lasso Regression adds an L1 penalty, modifying the cost function to:
$$\displaylines{ J(\theta) = \frac{1}{2}\sum(y_i - h_\theta(x^{(i)}))^2 + \sum\lambda|\theta| }$$
Unlike Ridge, the L1 norm promotes sparsity, meaning it forces some weights to become exactly zero. This is useful for feature selection.
$$\theta = (X^TX + \lambda I)^{-1}X^TY$$
4. ElasticNet Regression (L1 + L2 Regularization)
Mathematics
ElasticNet combines both Ridge and Lasso penalties:
$$\displaylines{ J(\theta) = \frac{1}{2}\sum(y_i - h_\theta(x^{(i)}))^2 +\sum\lambda(\theta)^2 + \sum\lambda|\theta| }$$
This provides:
Feature selection (Lasso behavior)
Weight shrinkage (Ridge behavior)
ElasticNet is useful when features are correlated, where Lasso alone tends to pick only one of the correlated features.
5. Comparison & When to Use Which?
Method | Regularization | Effect |
Ridge | L2 | Shrinks weights, keeps all features |
Lasso | L1 | Shrinks some weights to zero (feature selection) |
ElasticNet | L1 + L2 | Balances Ridge & Lasso, good for correlated features |
Key Takeaways
Use Ridge when you have many correlated features and want to reduce model complexity.
Use Lasso when you want to select the most important features.
Use ElasticNet when you have correlated features and want the best of both worlds.
6. Conclusion
Regularization techniques are essential for improving the generalization of machine learning models. Ridge, Lasso, and ElasticNet provide different ways to prevent overfitting by penalizing large weights.
I hope this deep dive helped you understand these methods from a mathematical and implementation perspective. Try them on your datasets and experiment with different λ values!
Congratulations on making it this far!
If you enjoyed this article, please consider buying me a coffee.
Here are a few documentation links for the modules I used in this code:
Follow me on my socials and consider subscribing, as I'll be covering more Machine Learning algorithms "from scratch".
Subscribe to my newsletter
Read articles from Vamshi A directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
