Linear Regression in Python: A Practical Guide

HowAiWorksHowAiWorks
10 min read

Linear regression is one of the fundamental algorithms in machine learning and statistics. This guide will walk you through implementing and understanding linear regression using Python, NumPy, scikit-learn, and matplotlib.

What is Linear Regression?

Linear regression models the relationship between a dependent variable (target) and one or more independent variables (features) by fitting a linear equation to the observed data. The simplest form is:

$$y=mx+b$$

  • y: predicted value

  • x: feature value

  • m: slope (coefficient)

  • b: y-intercept

Assumptions of Linear Regression

Fitting your data with linear regression may not always be the best choice. Before using it, make sure that the following assumptions apply to your data:

  1. Linearity: Relationship between X and y should be linear

  2. Independence: Observations should be independent

  3. Homoscedasticity: Constant variance in residuals

  4. Normality: Residuals should be normally distributed

Implementation with Python

Let's look at different ways to implement linear regression:

1. Using scikit-learn

Scikit-Learn is a popular, open-source machine learning library in Python, designed for simplicity and efficiency, particularly for beginners and researchers working on data-driven projects. It provides a vast array of tools for data analysis and machine learning, including easy-to-use implementations of common algorithms. For linear regression, Scikit-Learn is especially handy because it simplifies complex processes like model training, validation, and performance evaluation.

The following code shows a common use of Scikit-Learn for linear regression:

import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error
import matplotlib.pyplot as plt

# Generate sample data
np.random.seed(42)
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)

# Create and train the model
model = LinearRegression()
model.fit(X, y)

# Make predictions
y_pred = model.predict(X)

# Print results
print(f"Coefficient: {model.coef_[0][0]:.2f}")
print(f"Intercept: {model.intercept_[0]:.2f}")
print(f"R² Score: {r2_score(y, y_pred):.2f}")
print(f"Mean Squared Error: {mean_squared_error(y, y_pred):.2f}")

# Visualize results
plt.scatter(X, y, color='blue', label='Data')
plt.plot(X, y_pred, color='red', label='Prediction')
plt.xlabel('X')
plt.ylabel('y')
plt.legend()
plt.show()

2. Implementing from Scratch

Implementing linear regression from scratch can be an invaluable exercise for enhancing your understanding of this foundational algorithm. By manually coding the process, you gain deeper insights into how linear regression works, particularly regarding the calculation of the slope and intercept. The provided code snippet defines a SimpleLinearRegression class that encapsulates the entire fitting and prediction process.

class SimpleLinearRegression:
    def __init__(self):
        self.slope = None
        self.intercept = None

    def fit(self, X, y):
        # Calculate means
        x_mean = np.mean(X)
        y_mean = np.mean(y)

        # Calculate slope
        numerator = np.sum((X - x_mean) * (y - y_mean))
        denominator = np.sum((X - x_mean) ** 2)
        self.slope = numerator / denominator

        # Calculate intercept
        self.intercept = y_mean - self.slope * x_mean

    def predict(self, X):
        return self.slope * X + self.intercept

Key aspects:

  • Fitting the Model (fit method`):

    • Calculates the means of the input features (X) and the target variable (y).

    • Determines the slope by computing the ratio of the covariance of X and y to the variance of X.

    • Calculates the intercept, which indicates where the regression line crosses the y-axis.

  • Making Predictions (predict method`):

    • Uses the calculated slope and intercept to make predictions for new input data.

Best Practices

Following best practices in machine learning is essential for achieving reliable and efficient results. This section highlights key practices for machine learning.

  1. Feature scaling is important to normalize data ranges.

  2. Splitting datasets into training and testing sets ensures effective evaluation.

  3. Using cross-validation provides a more robust assessment.

Implementing these practices can significantly improve the accuracy of your machine learning projects. Below are code snippets demonstrating each of these essential techniques.

  1. Feature Scaling: Scale features to similar ranges using StandardScaler or MinMaxScaler

     from sklearn.preprocessing import StandardScaler
     scaler = StandardScaler()
     X_scaled = scaler.fit_transform(X)
    
  2. Train-Test Split: Always split data into training and testing sets

     from sklearn.model_selection import train_test_split
     X_train, X_test, y_train, y_test = train_test_split(
         X, y, test_size=0.2, random_state=42
     )
    
  3. Cross-Validation: Use cross-validation for more robust evaluation

     from sklearn.model_selection import cross_val_score
     scores = cross_val_score(model, X, y, cv=5)
     print(f"Cross-validation scores: {scores}")
    

Evaluating the Model

To verify that your fitted model is viable, you can apply the following metrics:

  1. R-squared (R²): Measures the proportion of variance in the dependent variable explained by the independent variables.

    • R² = 1 indicates perfect fit

    • R² = 0 indicates the model doesn't explain any variability

  2. Mean Squared Error (MSE): Average squared difference between predicted and actual values.

$$\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$$

mse = np.mean((y_true - y_pred) ** 2)
  1. Root Mean Squared Error (RMSE): Square root of MSE, in same units as target variable.

     rmse = np.sqrt(mse)
    

Handling Common Issues

If you are not happy with the result of your evaluation, you can check if one of the following characteristics applies to your dataset.

1. Outliers

Outliers can significantly impact your regression line. Here are several approaches to handle them:

import numpy as np
from scipy import stats
from sklearn.linear_model import LinearRegression, HuberRegressor
import pandas as pd

# Z-score method
def remove_outliers_zscore(X, y, threshold=3):
    z_scores = np.abs(stats.zscore(y))
    mask = z_scores < threshold
    return X[mask], y[mask]

# IQR method
def remove_outliers_iqr(X, y):
    Q1 = np.percentile(y, 25)
    Q3 = np.percentile(y, 75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    mask = (y >= lower_bound) & (y <= upper_bound)
    return X[mask], y[mask]

# Using robust regression
def fit_robust_regression(X, y):
    robust_reg = HuberRegressor()
    return robust_reg.fit(X, y)

Key aspects:

  • remove_outliers_zscore()

    • Calculates the Z-scores of the target variable y.

    • Removes data points where the absolute Z-score exceeds the specified threshold (default is 3).

  • remove_outliers_iqr(X, y)

    • Calculates the first (Q1) and third (Q3) quartiles of y.

    • Computes the Interquartile Range (IQR) and defines lower and upper bounds to filter outliers.

    • Removes data points outside the defined range

  • fit_robust_regression()

    • Utilizes HuberRegressor from scikit-learn for fitting a robust regression model that is less sensitive to outliers compared to standard regression methods.

2. Missing Values

Here are different strategies for handling missing values:

import pandas as pd
from sklearn.impute import SimpleImputer, KNNImputer

def handle_missing_values(df, strategy='mean'):
    if strategy == 'mean':
        # Mean imputation
        imputer = SimpleImputer(strategy='mean')
        df_imputed = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)

    elif strategy == 'knn':
        # KNN imputation
        imputer = KNNImputer(n_neighbors=5)
        df_imputed = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)

    elif strategy == 'interpolate':
        # Linear interpolation
        df_imputed = df.interpolate(method='linear')

    return df_imputed

Key Apects:

  • Function: handle_missing_values(df, strategy='mean'):

    • Accepts a DataFrame df and a strategy parameter to handle missing values.
  • Imputation Strategies:

    • Mean Imputation:

      • Utilizes SimpleImputer with the strategy set to 'mean'.

      • Replaces missing values with the mean of each column.

    • KNN Imputation:

      • Uses KNNImputer with a specified number of neighbors (default is 5).

      • Replaces missing values based on the mean of the nearest neighbors.

    • Linear Interpolation:

      • Applies linear interpolation to estimate missing values based on existing data.
  • Output:

    • Returns an imputed DataFrame (df_imputed) with missing values handled according to the chosen strategy.

3. Non-linearity

When relationships are non-linear, you can transform features or use polynomial regression:

from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline

def handle_nonlinearity(X, y, method='log', degree=2):
    if method == 'log':
        # Log transformation
        X_transformed = np.log1p(X)  # log1p handles zero values

    elif method == 'polynomial':
        # Polynomial features
        polynomial = PolynomialFeatures(degree=degree, include_bias=False)
        X_transformed = polynomial.fit_transform(X)

    elif method == 'box-cox':
        # Box-Cox transformation
        X_transformed, _ = stats.boxcox(X.flatten())
        X_transformed = X_transformed.reshape(-1, 1)

    # Fit model with transformed features
    model = LinearRegression()
    model.fit(X_transformed, y)

    return model, X_transformed

Key Aspects

  • Function: handle_nonlinearity():

    • Accepts feature data X, target variable y, a method parameter to address nonlinearity, and a degree for polynomial features.
  • Transformation Methods:

    • Log Transformation:

      • Applies a logarithmic transformation using np.log1p(X) to handle zero values.

      • Useful for reducing skewness in data.

    • Polynomial Features:

      • Generates polynomial features of a specified degree using PolynomialFeatures from scikit-learn.

      • Expands the feature space to capture non-linear relationships.

    • Box-Cox Transformation:

      • Applies the Box-Cox transformation to stabilize variance and make data more normally distributed.

      • Reshapes the transformed data to maintain its structure.

  • Model Fitting:

    • Fits a linear regression model using the transformed features (X_transformed) against the target variable y.
  • Output:

    • Returns the fitted model and the transformed feature set (X_transformed).

4. Multicollinearity

Detect and handle correlated features:

def handle_multicollinearity(X, threshold=0.8):
    # Calculate correlation matrix
    df = pd.DataFrame(X)
    corr_matrix = df.corr().abs()

    # Find highly correlated features
    upper = corr_matrix.where(
        np.triu(np.ones(corr_matrix.shape), k=1).astype(bool)
    )
    to_drop = [column for column in upper.columns 
               if any(upper[column] > threshold)]

    # Remove one of each pair of highly correlated features
    X_filtered = df.drop(columns=to_drop)

    return X_filtered

def process_multicollinearity(X, y):
    # Calculate VIF for each feature
    from statsmodels.stats.outliers_influence import variance_inflation_factor

    vif_data = pd.DataFrame()
    vif_data["Feature"] = X.columns
    vif_data["VIF"] = [variance_inflation_factor(X.values, i)
                       for i in range(X.shape[1])]

    # Remove features with high VIF (>5 is a common threshold)
    high_vif_features = vif_data[vif_data["VIF"] > 5]["Feature"]
    X_processed = X.drop(columns=high_vif_features)

    return X_processed, vif_data

Key aspects:

  • Function: handle_multicollinearity():

    • Accepts feature data X and a threshold for correlation to identify multicollinearity.

    • Correlation Matrix:

      • Computes the absolute correlation matrix of the features using df.corr().

      • Identifies pairs of highly correlated features by applying a mask to the upper triangle of the correlation matrix.

    • Feature Removal:

      • Creates a list of features to drop based on the threshold.

      • Filters the dataset (X_filtered) by removing one feature from each pair of highly correlated features.

  • Function: process_multicollinearity():

    • Calculates the Variance Inflation Factor (VIF) for each feature to quantify the degree of multicollinearity.

    • VIF Calculation:

      • Constructs a DataFrame containing the feature names and their corresponding VIF values.

      • Identifies features with a VIF greater than 5, which indicates high multicollinearity.

    • Feature Filtering:

      • Removes features with high VIF from the dataset.

      • Returns the processed feature set (X_processed) and the VIF data for further analysis.

5. Feature Scaling Issues

Proper scaling can improve model performance:

from sklearn.preprocessing import StandardScaler, RobustScaler, MinMaxScaler

def scale_features(X_train, X_test, method='standard'):
    if method == 'standard':
        scaler = StandardScaler()
    elif method == 'robust':
        scaler = RobustScaler()  # Less sensitive to outliers
    elif method == 'minmax':
        scaler = MinMaxScaler()

    # Fit on training data, transform both training and test
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)

    return X_train_scaled, X_test_scaled, scaler
  • Purpose: The function scales features of training and test datasets to improve the performance of machine learning models.

  • Parameters:

    • X_train: The training dataset.

    • X_test: The test dataset.

    • method: The scaling method to use, which can be 'standard', 'robust', or 'minmax'.

  • Scaling Methods:

    • StandardScaler: Standardizes features by removing the mean and scaling to unit variance.

    • RobustScaler: Scales features using statistics that are robust to outliers (median and interquartile range).

    • MinMaxScaler: Scales features to a range between 0 and 1.

  • Process:

    • Creates an instance of the appropriate scaler based on the specified method.

    • Fits the scaler on the training data and transforms both the training and test data.

  • Returns:

    • X_train_scaled: Scaled training dataset.

    • X_test_scaled: Scaled test dataset.

    • scaler: The fitted scaler instance for potential future use or inverse transformations.

6. Complete Pipeline

Here's a complete pipeline that handles multiple issues:

def preprocess_regression_data(X, y):
    # 1. Handle missing values
    X_clean = handle_missing_values(X, strategy='knn')

    # 2. Remove outliers
    X_clean, y_clean = remove_outliers_iqr(X_clean, y)

    # 3. Handle multicollinearity
    X_clean = handle_multicollinearity(X_clean)

    # 4. Split data
    X_train, X_test, y_train, y_test = train_test_split(
        X_clean, y_clean, test_size=0.2, random_state=42
    )

    # 5. Scale features
    X_train_scaled, X_test_scaled, _ = scale_features(
        X_train, X_test, method='robust'
    )

    # 6. Fit model
    model = LinearRegression()
    model.fit(X_train_scaled, y_train)

    # 7. Evaluate
    train_score = model.score(X_train_scaled, y_train)
    test_score = model.score(X_test_scaled, y_test)

    return model, (X_train_scaled, X_test_scaled), (train_score, test_score)

Conclusion

Linear regression is a powerful tool for predicting continuous values and understanding relationships between variables. While simple to implement, proper attention to assumptions and best practices is crucial for reliable results. The code examples provided above should give you a solid foundation for implementing linear regression in your own projects.

Key takeaways:

  1. Always check and handle assumptions

  2. Preprocess your data carefully

  3. Use appropriate evaluation metrics

  4. Consider the complete pipeline from data cleaning to model evaluation

  5. Monitor for and address common issues like outliers and multicollinearity

Remember that while linear regression is one of the simpler machine learning algorithms, it can be very effective when used properly and its assumptions are met.


This Article was useful? Check out more articles on this topic:

0
Subscribe to my newsletter

Read articles from HowAiWorks directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

HowAiWorks
HowAiWorks