Improving Model Accuracy with Lasso-Based Feature Selection


Objective
The goal is to use Lasso regression to perform feature selection. We want to distinguish true signal features from noise features by examining the Lasso coefficients and applying a threshold to eliminate irrelevant variables.
Imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import (
mean_absolute_error,
root_mean_squared_error,
r2_score,
explained_variance_score,
mean_squared_error,
)
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LassoCV
Utility Functions
Function to evaluate regression model performance:
def evaluate_regression_model(y_true, y_pred):
print(f"Mean absolute error\t\t: {mean_absolute_error(y_true, y_pred):.3f}")
print(f"Mean squared error\t\t: {mean_squared_error(y_true, y_pred):.3f}")
print(f"Root mean squared error\t\t: {root_mean_squared_error(y_true, y_pred):.3f}")
print(f"Explained variance score\t: {explained_variance_score(y_true, y_pred):.3f}")
print(f"R2 score\t\t\t: {r2_score(y_true, y_pred):.3f}")
Generate Dataset
Create a synthetic regression dataset with noise and known coefficients:
X, y, coef = make_regression(noise=10, coef=True, random_state=42)
Train the Model
Split the dataset and fit a LassoCV model:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
reg = LassoCV(alphas=np.logspace(-4, 2), random_state=42).fit(X_train, y_train)
y_pred = reg.predict(X_test)
print("Initial Result:")
evaluate_regression_model(y_test, y_pred)
Feature Selection
Problem:
Lasso performs both regularization and feature selection by shrinking some coefficients to exactly zero. This makes it a powerful tool for identifying relevant features in a dataset. However, due to regularization and noise in the data, Lasso may assign Non-zero coefficients to irrelevant features (false positives).
Thus, it's important to manually inspect and apply a threshold to filter out these irrelevant features.
Solution:
We examine the learned coefficients from the Lasso model and compare them with the true coefficients of the data. We use a threshold to distinguish signal from noise.
By observing the first plot below, we can see:
The green line shows the Lasso-estimated coefficients.
The scatter plot shows the true coefficients.
Some true coefficients (those actually 0) have been incorrectly assigned small non-zero values by Lasso due to noise or limitations in the optimization.
We can use the first plot to find a reasonable threshold to separate actual features and irrelevant features.
The second plot (an elbow plot of sorted absolute Lasso coefficients) helps determine a reasonable threshold. A threshold of 4
is chosen here to separate signal from noise.
threshold = 4
lasso_coef = reg.coef_
x = np.arange(len(coef))
fig, axes = plt.subplots(2, 1, figsize=(12, 12))
# Plot true vs Lasso coefficients
axes[0].scatter(x, coef, alpha=0.5, label="True Coefficients")
axes[0].plot(x, lasso_coef, c="g", label="Lasso Coefficients")
axes[0].axhline(threshold, c="r", linestyle="--", label="Threshold")
axes[0].legend()
# Elbow plot to choose threshold
sorted_lasso_coef = np.sort(np.abs(lasso_coef))[::-1]
axes[1].plot(sorted_lasso_coef, label="Sorted Lasso Coefficients")
axes[1].axhline(threshold, c="r", linestyle="--", label="Threshold")
axes[1].legend()
plt.tight_layout()
plt.show()
Evaluate Selected Features
After applying the threshold, we identify the indices of features selected by Lasso and compare them with the ground truth.
selected_features_indices = np.where(np.abs(lasso_coef) > threshold)[0]
print(f"Selected feature indices: {selected_features_indices}")
true_features_indices = np.where(np.abs(coef) > 0)[0]
print(f"True feature indices: {true_features_indices}")
correct_identifications = np.intersect1d(selected_features_indices, true_features_indices)
print(f"Correctly identified features: {len(correct_identifications)} out of {len(true_features_indices)}")
precision = len(correct_identifications) / len(selected_features_indices)
recall = len(correct_identifications) / len(true_features_indices)
print(f"Precision: {precision:.3f}, Recall: {recall:.3f}")
With threshold set to 4
Lasso correctly identified 9 out of 10 true features. It selected no irrelevant features, achieving a precision of 1.000 and a recall of 0.900. This shows that with an appropriate threshold, Lasso can be a highly effective tool for sparse feature selection.
Compare Performance With and Without Feature Selection
selected_features_X_train = X_train[:, selected_features_indices]
selected_features_X_test = X_test[:, selected_features_indices]
# Retrain using selected features only
reg_selected = LassoCV(alphas=np.logspace(-4, 2), random_state=42).fit(selected_features_X_train, y_train)
selected_features_y_pred = reg_selected.predict(selected_features_X_test)
print("Result with All Features:")
evaluate_regression_model(y_test, y_pred)
print("\nResult with Selected Features:")
evaluate_regression_model(y_test, selected_features_y_pred)
The model using selected features performs slightly better across all metrics. This shows that Lasso-based feature selection not only reduces model complexity but also improves generalization by removing noisy or irrelevant features.
Conclusion
Lasso regression is a great tool for feature selection in noisy datasets. It automatically reduces the importance of irrelevant features by shrinking their coefficients, often to zero. By adding a simple threshold (based on visualizing the coefficients), we can further improve feature selection.
Subscribe to my newsletter
Read articles from Theodorus Andi Gunawan directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
