Objective

The goal is to use Lasso regression to perform feature selection. We want to distinguish true signal features from noise features by examining the Lasso coefficients and applying a threshold to eliminate irrelevant variables.

Imports

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.metrics import (
    mean_absolute_error, 
    root_mean_squared_error, 
    r2_score,
    explained_variance_score,
    mean_squared_error,
)
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LassoCV

Utility Functions

Function to evaluate regression model performance:

def evaluate_regression_model(y_true, y_pred):
    print(f"Mean absolute error\t\t: {mean_absolute_error(y_true, y_pred):.3f}")
    print(f"Mean squared error\t\t: {mean_squared_error(y_true, y_pred):.3f}")
    print(f"Root mean squared error\t\t: {root_mean_squared_error(y_true, y_pred):.3f}")
    print(f"Explained variance score\t: {explained_variance_score(y_true, y_pred):.3f}")
    print(f"R2 score\t\t\t: {r2_score(y_true, y_pred):.3f}")

Generate Dataset

Create a synthetic regression dataset with noise and known coefficients:

X, y, coef = make_regression(noise=10, coef=True, random_state=42)

Train the Model

Split the dataset and fit a LassoCV model:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

reg = LassoCV(alphas=np.logspace(-4, 2), random_state=42).fit(X_train, y_train)
y_pred = reg.predict(X_test)

print("Initial Result:")
evaluate_regression_model(y_test, y_pred)

Feature Selection

Problem:
Lasso performs both regularization and feature selection by shrinking some coefficients to exactly zero. This makes it a powerful tool for identifying relevant features in a dataset. However, due to regularization and noise in the data, Lasso may assign Non-zero coefficients to irrelevant features (false positives).

Thus, it's important to manually inspect and apply a threshold to filter out these irrelevant features.

Solution:
We examine the learned coefficients from the Lasso model and compare them with the true coefficients of the data. We use a threshold to distinguish signal from noise.

By observing the first plot below, we can see:

The green line shows the Lasso-estimated coefficients.
The scatter plot shows the true coefficients.
Some true coefficients (those actually 0) have been incorrectly assigned small non-zero values by Lasso due to noise or limitations in the optimization.

We can use the first plot to find a reasonable threshold to separate actual features and irrelevant features.

The second plot (an elbow plot of sorted absolute Lasso coefficients) helps determine a reasonable threshold. A threshold of 4 is chosen here to separate signal from noise.

threshold = 4
lasso_coef = reg.coef_
x = np.arange(len(coef))

fig, axes = plt.subplots(2, 1, figsize=(12, 12))

# Plot true vs Lasso coefficients
axes[0].scatter(x, coef, alpha=0.5, label="True Coefficients")
axes[0].plot(x, lasso_coef, c="g", label="Lasso Coefficients")
axes[0].axhline(threshold, c="r", linestyle="--", label="Threshold")
axes[0].legend()

# Elbow plot to choose threshold
sorted_lasso_coef = np.sort(np.abs(lasso_coef))[::-1]
axes[1].plot(sorted_lasso_coef, label="Sorted Lasso Coefficients")
axes[1].axhline(threshold, c="r", linestyle="--", label="Threshold")
axes[1].legend()

plt.tight_layout()
plt.show()

Evaluate Selected Features

After applying the threshold, we identify the indices of features selected by Lasso and compare them with the ground truth.

selected_features_indices = np.where(np.abs(lasso_coef) > threshold)[0]
print(f"Selected feature indices: {selected_features_indices}")

true_features_indices = np.where(np.abs(coef) > 0)[0]
print(f"True feature indices: {true_features_indices}")

correct_identifications = np.intersect1d(selected_features_indices, true_features_indices)
print(f"Correctly identified features: {len(correct_identifications)} out of {len(true_features_indices)}")

precision = len(correct_identifications) / len(selected_features_indices)
recall = len(correct_identifications) / len(true_features_indices)
print(f"Precision: {precision:.3f}, Recall: {recall:.3f}")

With threshold set to 4 Lasso correctly identified 9 out of 10 true features. It selected no irrelevant features, achieving a precision of 1.000 and a recall of 0.900. This shows that with an appropriate threshold, Lasso can be a highly effective tool for sparse feature selection.

Compare Performance With and Without Feature Selection

selected_features_X_train = X_train[:, selected_features_indices]
selected_features_X_test = X_test[:, selected_features_indices]

# Retrain using selected features only
reg_selected = LassoCV(alphas=np.logspace(-4, 2), random_state=42).fit(selected_features_X_train, y_train)
selected_features_y_pred = reg_selected.predict(selected_features_X_test)

print("Result with All Features:")
evaluate_regression_model(y_test, y_pred)

print("\nResult with Selected Features:")
evaluate_regression_model(y_test, selected_features_y_pred)

The model using selected features performs slightly better across all metrics. This shows that Lasso-based feature selection not only reduces model complexity but also improves generalization by removing noisy or irrelevant features.

Conclusion

Lasso regression is a great tool for feature selection in noisy datasets. It automatically reduces the importance of irrelevant features by shrinking their coefficients, often to zero. By adding a simple threshold (based on visualizing the coefficients), we can further improve feature selection.

Improving Model Accuracy with Lasso-Based Feature Selection