Ridge, Lasso, and ElasticNet Regression: Understanding Regularization in Machine Learning

Vamshi AVamshi A
3 min read

Introduction

Regression models, particularly Linear Regression, are powerful tools for predicting continuous values. However, they often suffer from overfitting, especially when dealing with high-dimensional datasets. To combat this, we use regularization techniques that introduce a penalty term to reduce model complexity and improve generalization.

In this blog, we will explore Ridge Regression, Lasso Regression, and ElasticNet, covering:

  • The mathematics behind each method

  • Comparison of results

By the end, you'll have a deep understanding of how these regularization techniques work and when to use them!


1. Linear Regression Recap

Before diving into regularization, let's briefly revisit linear regression. The goal is to find the optimal parameters that minimize the cost function:

$$\displaylines{ J(\theta) = \frac{1}{2}\sum(y_i - h_\theta(x^{(i)}))^2 \\ }$$

where:

  • y is the actual output,

  • x is the feature vector,

  • θ is the parameter,

  • h(x) is the hypothesis function.

This approach works well, but it can lead to overfitting, particularly when features are correlated or high-dimensional. This is where regularization helps.


2. Ridge Regression (L2 Regularization)

Mathematics

Ridge Regression modifies the cost function by adding an L2 penalty:

$$\displaylines{ J(\theta) = \frac{1}{2}\sum(y_i - h_\theta(x^{(i)}))^2 + \sum\lambda(\theta)^2 }$$

where:

  • λ is the regularization parameter that controls the penalty strength.

  • The second term shrinks the parameters, reducing model complexity.

The solution to Ridge Regression is given by:

$$\theta = (X^TX + \lambda I)^{-1}X^TY$$

This prevents large weight values, reducing overfitting.


3. Lasso Regression (L1 Regularization)

Mathematics

Lasso Regression adds an L1 penalty, modifying the cost function to:

$$\displaylines{ J(\theta) = \frac{1}{2}\sum(y_i - h_\theta(x^{(i)}))^2 + \sum\lambda|\theta| }$$

Unlike Ridge, the L1 norm promotes sparsity, meaning it forces some weights to become exactly zero. This is useful for feature selection.

$$\theta = (X^TX + \lambda I)^{-1}X^TY$$


4. ElasticNet Regression (L1 + L2 Regularization)

Mathematics

ElasticNet combines both Ridge and Lasso penalties:

$$\displaylines{ J(\theta) = \frac{1}{2}\sum(y_i - h_\theta(x^{(i)}))^2 +\sum\lambda(\theta)^2 + \sum\lambda|\theta| }$$

This provides:

  • Feature selection (Lasso behavior)

  • Weight shrinkage (Ridge behavior)

ElasticNet is useful when features are correlated, where Lasso alone tends to pick only one of the correlated features.


5. Comparison & When to Use Which?

MethodRegularizationEffect
RidgeL2Shrinks weights, keeps all features
LassoL1Shrinks some weights to zero (feature selection)
ElasticNetL1 + L2Balances Ridge & Lasso, good for correlated features

Key Takeaways

  • Use Ridge when you have many correlated features and want to reduce model complexity.

  • Use Lasso when you want to select the most important features.

  • Use ElasticNet when you have correlated features and want the best of both worlds.


6. Conclusion

Regularization techniques are essential for improving the generalization of machine learning models. Ridge, Lasso, and ElasticNet provide different ways to prevent overfitting by penalizing large weights.

I hope this deep dive helped you understand these methods from a mathematical and implementation perspective. Try them on your datasets and experiment with different λ values!


Congratulations on making it this far!

If you enjoyed this article, please consider buying me a coffee.

Here are a few documentation links for the modules I used in this code:

Follow me on my socials and consider subscribing, as I'll be covering more Machine Learning algorithms "from scratch".

0
Subscribe to my newsletter

Read articles from Vamshi A directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Vamshi A
Vamshi A