Linear Regression in Python: A Practical Guide
Linear regression is one of the fundamental algorithms in machine learning and statistics. This guide will walk you through implementing and understanding linear regression using Python, NumPy, scikit-learn, and matplotlib.
What is Linear Regression?
Linear regression models the relationship between a dependent variable (target) and one or more independent variables (features) by fitting a linear equation to the observed data. The simplest form is:
$$y=mx+b$$
y: predicted value
x: feature value
m: slope (coefficient)
b: y-intercept
Assumptions of Linear Regression
Fitting your data with linear regression may not always be the best choice. Before using it, make sure that the following assumptions apply to your data:
Linearity: Relationship between X and y should be linear
Independence: Observations should be independent
Homoscedasticity: Constant variance in residuals
Normality: Residuals should be normally distributed
Implementation with Python
Let's look at different ways to implement linear regression:
1. Using scikit-learn
Scikit-Learn is a popular, open-source machine learning library in Python, designed for simplicity and efficiency, particularly for beginners and researchers working on data-driven projects. It provides a vast array of tools for data analysis and machine learning, including easy-to-use implementations of common algorithms. For linear regression, Scikit-Learn is especially handy because it simplifies complex processes like model training, validation, and performance evaluation.
The following code shows a common use of Scikit-Learn for linear regression:
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error
import matplotlib.pyplot as plt
# Generate sample data
np.random.seed(42)
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)
# Create and train the model
model = LinearRegression()
model.fit(X, y)
# Make predictions
y_pred = model.predict(X)
# Print results
print(f"Coefficient: {model.coef_[0][0]:.2f}")
print(f"Intercept: {model.intercept_[0]:.2f}")
print(f"R² Score: {r2_score(y, y_pred):.2f}")
print(f"Mean Squared Error: {mean_squared_error(y, y_pred):.2f}")
# Visualize results
plt.scatter(X, y, color='blue', label='Data')
plt.plot(X, y_pred, color='red', label='Prediction')
plt.xlabel('X')
plt.ylabel('y')
plt.legend()
plt.show()
2. Implementing from Scratch
Implementing linear regression from scratch can be an invaluable exercise for enhancing your understanding of this foundational algorithm. By manually coding the process, you gain deeper insights into how linear regression works, particularly regarding the calculation of the slope and intercept. The provided code snippet defines a SimpleLinearRegression
class that encapsulates the entire fitting and prediction process.
class SimpleLinearRegression:
def __init__(self):
self.slope = None
self.intercept = None
def fit(self, X, y):
# Calculate means
x_mean = np.mean(X)
y_mean = np.mean(y)
# Calculate slope
numerator = np.sum((X - x_mean) * (y - y_mean))
denominator = np.sum((X - x_mean) ** 2)
self.slope = numerator / denominator
# Calculate intercept
self.intercept = y_mean - self.slope * x_mean
def predict(self, X):
return self.slope * X + self.intercept
Key aspects:
Fitting the Model (
fit
method`):Calculates the means of the input features (X) and the target variable (y).
Determines the slope by computing the ratio of the covariance of X and y to the variance of X.
Calculates the intercept, which indicates where the regression line crosses the y-axis.
Making Predictions (
predict
method`):- Uses the calculated slope and intercept to make predictions for new input data.
Best Practices
Following best practices in machine learning is essential for achieving reliable and efficient results. This section highlights key practices for machine learning.
Feature scaling is important to normalize data ranges.
Splitting datasets into training and testing sets ensures effective evaluation.
Using cross-validation provides a more robust assessment.
Implementing these practices can significantly improve the accuracy of your machine learning projects. Below are code snippets demonstrating each of these essential techniques.
Feature Scaling: Scale features to similar ranges using StandardScaler or MinMaxScaler
from sklearn.preprocessing import StandardScaler scaler = StandardScaler() X_scaled = scaler.fit_transform(X)
Train-Test Split: Always split data into training and testing sets
from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42 )
Cross-Validation: Use cross-validation for more robust evaluation
from sklearn.model_selection import cross_val_score scores = cross_val_score(model, X, y, cv=5) print(f"Cross-validation scores: {scores}")
Evaluating the Model
To verify that your fitted model is viable, you can apply the following metrics:
R-squared (R²): Measures the proportion of variance in the dependent variable explained by the independent variables.
R² = 1 indicates perfect fit
R² = 0 indicates the model doesn't explain any variability
Mean Squared Error (MSE): Average squared difference between predicted and actual values.
$$\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$$
mse = np.mean((y_true - y_pred) ** 2)
Root Mean Squared Error (RMSE): Square root of MSE, in same units as target variable.
rmse = np.sqrt(mse)
Handling Common Issues
If you are not happy with the result of your evaluation, you can check if one of the following characteristics applies to your dataset.
1. Outliers
Outliers can significantly impact your regression line. Here are several approaches to handle them:
import numpy as np
from scipy import stats
from sklearn.linear_model import LinearRegression, HuberRegressor
import pandas as pd
# Z-score method
def remove_outliers_zscore(X, y, threshold=3):
z_scores = np.abs(stats.zscore(y))
mask = z_scores < threshold
return X[mask], y[mask]
# IQR method
def remove_outliers_iqr(X, y):
Q1 = np.percentile(y, 25)
Q3 = np.percentile(y, 75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
mask = (y >= lower_bound) & (y <= upper_bound)
return X[mask], y[mask]
# Using robust regression
def fit_robust_regression(X, y):
robust_reg = HuberRegressor()
return robust_reg.fit(X, y)
Key aspects:
remove_outliers_zscore()
Calculates the Z-scores of the target variable y.
Removes data points where the absolute Z-score exceeds the specified threshold (default is 3).
remove_outliers_iqr(X, y)
Calculates the first (Q1) and third (Q3) quartiles of y.
Computes the Interquartile Range (IQR) and defines lower and upper bounds to filter outliers.
Removes data points outside the defined range
fit_robust_regression()
- Utilizes
HuberRegressor
from scikit-learn for fitting a robust regression model that is less sensitive to outliers compared to standard regression methods.
- Utilizes
2. Missing Values
Here are different strategies for handling missing values:
import pandas as pd
from sklearn.impute import SimpleImputer, KNNImputer
def handle_missing_values(df, strategy='mean'):
if strategy == 'mean':
# Mean imputation
imputer = SimpleImputer(strategy='mean')
df_imputed = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)
elif strategy == 'knn':
# KNN imputation
imputer = KNNImputer(n_neighbors=5)
df_imputed = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)
elif strategy == 'interpolate':
# Linear interpolation
df_imputed = df.interpolate(method='linear')
return df_imputed
Key Apects:
Function:
handle_missing_values(df, strategy='mean')
:- Accepts a DataFrame
df
and astrategy
parameter to handle missing values.
- Accepts a DataFrame
Imputation Strategies:
Mean Imputation:
Utilizes
SimpleImputer
with the strategy set to 'mean'.Replaces missing values with the mean of each column.
KNN Imputation:
Uses
KNNImputer
with a specified number of neighbors (default is 5).Replaces missing values based on the mean of the nearest neighbors.
Linear Interpolation:
- Applies linear interpolation to estimate missing values based on existing data.
Output:
- Returns an imputed DataFrame (
df_imputed
) with missing values handled according to the chosen strategy.
- Returns an imputed DataFrame (
3. Non-linearity
When relationships are non-linear, you can transform features or use polynomial regression:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
def handle_nonlinearity(X, y, method='log', degree=2):
if method == 'log':
# Log transformation
X_transformed = np.log1p(X) # log1p handles zero values
elif method == 'polynomial':
# Polynomial features
polynomial = PolynomialFeatures(degree=degree, include_bias=False)
X_transformed = polynomial.fit_transform(X)
elif method == 'box-cox':
# Box-Cox transformation
X_transformed, _ = stats.boxcox(X.flatten())
X_transformed = X_transformed.reshape(-1, 1)
# Fit model with transformed features
model = LinearRegression()
model.fit(X_transformed, y)
return model, X_transformed
Key Aspects
Function:
handle_nonlinearity()
:- Accepts feature data
X
, target variabley
, amethod
parameter to address nonlinearity, and adegree
for polynomial features.
- Accepts feature data
Transformation Methods:
Log Transformation:
Applies a logarithmic transformation using
np.log1p(X)
to handle zero values.Useful for reducing skewness in data.
Polynomial Features:
Generates polynomial features of a specified degree using
PolynomialFeatures
from scikit-learn.Expands the feature space to capture non-linear relationships.
Box-Cox Transformation:
Applies the Box-Cox transformation to stabilize variance and make data more normally distributed.
Reshapes the transformed data to maintain its structure.
Model Fitting:
- Fits a linear regression model using the transformed features (
X_transformed
) against the target variabley
.
- Fits a linear regression model using the transformed features (
Output:
- Returns the fitted model and the transformed feature set (
X_transformed
).
- Returns the fitted model and the transformed feature set (
4. Multicollinearity
Detect and handle correlated features:
def handle_multicollinearity(X, threshold=0.8):
# Calculate correlation matrix
df = pd.DataFrame(X)
corr_matrix = df.corr().abs()
# Find highly correlated features
upper = corr_matrix.where(
np.triu(np.ones(corr_matrix.shape), k=1).astype(bool)
)
to_drop = [column for column in upper.columns
if any(upper[column] > threshold)]
# Remove one of each pair of highly correlated features
X_filtered = df.drop(columns=to_drop)
return X_filtered
def process_multicollinearity(X, y):
# Calculate VIF for each feature
from statsmodels.stats.outliers_influence import variance_inflation_factor
vif_data = pd.DataFrame()
vif_data["Feature"] = X.columns
vif_data["VIF"] = [variance_inflation_factor(X.values, i)
for i in range(X.shape[1])]
# Remove features with high VIF (>5 is a common threshold)
high_vif_features = vif_data[vif_data["VIF"] > 5]["Feature"]
X_processed = X.drop(columns=high_vif_features)
return X_processed, vif_data
Key aspects:
Function:
handle_multicollinearity(
):Accepts feature data
X
and athreshold
for correlation to identify multicollinearity.Correlation Matrix:
Computes the absolute correlation matrix of the features using
df.corr()
.Identifies pairs of highly correlated features by applying a mask to the upper triangle of the correlation matrix.
Feature Removal:
Creates a list of features to drop based on the threshold.
Filters the dataset (
X_filtered
) by removing one feature from each pair of highly correlated features.
Function:
process_multicollinearity(
):Calculates the Variance Inflation Factor (VIF) for each feature to quantify the degree of multicollinearity.
VIF Calculation:
Constructs a DataFrame containing the feature names and their corresponding VIF values.
Identifies features with a VIF greater than 5, which indicates high multicollinearity.
Feature Filtering:
Removes features with high VIF from the dataset.
Returns the processed feature set (
X_processed
) and the VIF data for further analysis.
5. Feature Scaling Issues
Proper scaling can improve model performance:
from sklearn.preprocessing import StandardScaler, RobustScaler, MinMaxScaler
def scale_features(X_train, X_test, method='standard'):
if method == 'standard':
scaler = StandardScaler()
elif method == 'robust':
scaler = RobustScaler() # Less sensitive to outliers
elif method == 'minmax':
scaler = MinMaxScaler()
# Fit on training data, transform both training and test
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
return X_train_scaled, X_test_scaled, scaler
Purpose: The function scales features of training and test datasets to improve the performance of machine learning models.
Parameters:
X_train
: The training dataset.X_test
: The test dataset.method
: The scaling method to use, which can be 'standard', 'robust', or 'minmax'.
Scaling Methods:
StandardScaler: Standardizes features by removing the mean and scaling to unit variance.
RobustScaler: Scales features using statistics that are robust to outliers (median and interquartile range).
MinMaxScaler: Scales features to a range between 0 and 1.
Process:
Creates an instance of the appropriate scaler based on the specified method.
Fits the scaler on the training data and transforms both the training and test data.
Returns:
X_train_scaled
: Scaled training dataset.X_test_scaled
: Scaled test dataset.scaler
: The fitted scaler instance for potential future use or inverse transformations.
6. Complete Pipeline
Here's a complete pipeline that handles multiple issues:
def preprocess_regression_data(X, y):
# 1. Handle missing values
X_clean = handle_missing_values(X, strategy='knn')
# 2. Remove outliers
X_clean, y_clean = remove_outliers_iqr(X_clean, y)
# 3. Handle multicollinearity
X_clean = handle_multicollinearity(X_clean)
# 4. Split data
X_train, X_test, y_train, y_test = train_test_split(
X_clean, y_clean, test_size=0.2, random_state=42
)
# 5. Scale features
X_train_scaled, X_test_scaled, _ = scale_features(
X_train, X_test, method='robust'
)
# 6. Fit model
model = LinearRegression()
model.fit(X_train_scaled, y_train)
# 7. Evaluate
train_score = model.score(X_train_scaled, y_train)
test_score = model.score(X_test_scaled, y_test)
return model, (X_train_scaled, X_test_scaled), (train_score, test_score)
Conclusion
Linear regression is a powerful tool for predicting continuous values and understanding relationships between variables. While simple to implement, proper attention to assumptions and best practices is crucial for reliable results. The code examples provided above should give you a solid foundation for implementing linear regression in your own projects.
Key takeaways:
Always check and handle assumptions
Preprocess your data carefully
Use appropriate evaluation metrics
Consider the complete pipeline from data cleaning to model evaluation
Monitor for and address common issues like outliers and multicollinearity
Remember that while linear regression is one of the simpler machine learning algorithms, it can be very effective when used properly and its assumptions are met.
This Article was useful? Check out more articles on this topic:
Subscribe to my newsletter
Read articles from HowAiWorks directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by