Understanding Linear Regression: Evaluation Metrics, Assumptions, and ML Pipeline

Manav RastogiManav Rastogi
6 min read

Linear regression is a fundamental machine learning algorithm used for predicting continuous outcomes. To assess its performance and ensure its reliability, we evaluate various metrics, understand its assumptions, and implement it within a structured machine learning pipeline. This blog explores these aspects in detail, including evaluation metrics, model training considerations, and the ML pipeline.

1. Evaluation Metrics for Linear Regression

Evaluating a linear regression model involves quantifying how well the model explains the data and how close its predictions are to actual values. Key metrics include R-squared, Adjusted R-squared, and error-based metrics like MSE, MAE, and RMSE.

R-Squared (R²)

R-squared, also known as the coefficient of determination, measures the proportion of the total variation in the dependent variable (y) explained by the model.

  • Formula:
    R² = 1 - (RSS / TSS) = SSR / TSS
    Where:

    • TSS (Total Sum of Squares): Measures total variation in the dependent variable, calculated as Σ(y_actual - y_mean)².

    • SSR (Sum of Squares Regression): Represents the explained variation, calculated as Σ(y_pred - y_mean)².

    • RSS (Residual Sum of Squares) or SSE (Sum of Squared Errors): Represents the unexplained variation, calculated as Σ(y_actual - y_pred)².

  • Interpretation:

    • R² ranges from 0 to 1.

    • R² = 1 indicates a perfect model where SSR = TSS (all variation is explained).

    • R² = 0 indicates the model explains none of the variation.

    • Can R² be negative? Yes, in rare cases (e.g., when the model is worse than a simple mean-based prediction), but this typically occurs with poorly fitted models or when R² is computed on out-of-sample data.

  • Purpose: R² quantifies the percentage of total variation explained by the model (SSR/TSS).

Adjusted R-Squared

R-squared increases with the addition of predictors, even if they are irrelevant. Adjusted R-squared penalizes the inclusion of unnecessary predictors.

  • Formula:
    Adjusted R² = 1 - [(1 - R²)(n - 1) / (n - p - 1)]
    Where:

    • n = number of observations

    • p = number of predictors

  • Key Points:

    • Adjusted R² is always less than or equal to R².

    • The difference between R² and Adjusted R² should ideally be less than 5%. A larger gap may indicate overfitting or irrelevant predictors.

    • Adjusted R² accounts for the number of predictors, making it more reliable for models with multiple features.

Error-Based Metrics

These metrics quantify the difference between predicted and actual values, focusing on the magnitude of errors.

Mean Squared Error (MSE)

  • Formula: MSE = (1/n) * Σ(y_actual - y_pred)²

  • Advantages:

    • Squaring errors prevents positive and negative errors from canceling each other out.

    • MSE is differentiable, making it suitable as a cost function for optimization.

    • It emphasizes larger errors due to squaring.

  • Disadvantages:

    • Not robust to outliers (large errors are magnified).

    • Units are squared (e.g., if y is in meters, MSE is in meters²), reducing interpretability.

Mean Absolute Error (MAE)

  • Formula: MAE = (1/n) * Σ|y_actual - y_pred|

  • Advantages:

    • Less sensitive to outliers since it uses absolute differences.

    • Same unit as the dependent variable, making it more interpretable.

    • Each error contributes equally to the total.

  • Disadvantages:

    • Non-differentiable at zero, which can complicate optimization.

    • Convergence during training may be slower.

Root Mean Squared Error (RMSE)

  • Formula: RMSE = √(MSE)

  • Advantages:

    • Same unit as the dependent variable, improving interpretability compared to MSE.

    • Differentiable and emphasizes larger errors (like MSE).

    • Less sensitive to outliers than MSE due to the square root.

  • Disadvantages:

    • Interpretation can be complex due to the emphasis on larger errors.

    • Still somewhat sensitive to outliers.

Units of Metrics

  • R² and Adjusted R²: Unitless (proportions).

  • MSE: Squared units of the dependent variable.

  • MAE and RMSE: Same units as the dependent variable.

2. Time Consumption in Model Training

The time required to train a linear regression model depends on several factors:

  • Dataset Size: Larger datasets increase computation time due to matrix operations.

  • Number of Features: More features increase the complexity of solving the normal equations or optimizing the cost function.

  • Optimization Method: Analytical solutions (e.g., normal equations) are faster for small datasets, while iterative methods (e.g., gradient descent) may be slower but scale better for large datasets.

  • Hardware: Training on CPUs vs. GPUs or distributed systems affects speed.

  • Preprocessing: Feature scaling, handling missing values, or encoding categorical variables can add to the preprocessing time.

To minimize training time:

  • Use feature selection to reduce irrelevant predictors.

  • Standardize features to improve convergence in gradient-based methods.

  • Leverage optimized libraries like scikit-learn or NumPy for efficient computation.

3. Assumptions of Linear Regression

Linear regression relies on the following assumptions to ensure valid results:

  1. Linearity: The relationship between independent variables (features) and the dependent variable is linear.

  2. Independence: Observations are independent of each other.

  3. Homoscedasticity: The variance of residuals is constant across all levels of the independent variables.

  4. Normality of Errors: Residuals (errors) are normally distributed (important for hypothesis testing and confidence intervals).

  5. No Multicollinearity: Features should not be highly correlated with each other, as this can destabilize coefficient estimates.

Violations of these assumptions may lead to biased or unreliable predictions. Diagnostic plots (e.g., residual plots, Q-Q plots) and tests (e.g., Durbin-Watson for independence) can help verify these assumptions.

4. Machine Learning Pipeline

A robust ML pipeline ensures reproducibility, scalability, and ease of deployment. Below is a typical pipeline for linear regression:

Steps

  1. Data Collection and Cleaning:

    • Handle missing values, remove duplicates, and encode categorical variables.
  2. Feature Engineering:

    • Create new features, select relevant ones, and remove highly correlated features to avoid multicollinearity.
  3. Feature Standardization:

    • Scale features (e.g., using StandardScaler) to have zero mean and unit variance, improving model convergence.

    • Code Example:

        from sklearn.preprocessing import StandardScaler
        scaler = StandardScaler()
        X_scaled = scaler.fit_transform(X)
      
  4. Model Training:

    • Train the linear regression model using a library like scikit-learn.

    • Code Example:

        from sklearn.linear_model import LinearRegression
        model = LinearRegression()
        model.fit(X_scaled, y)
      
  5. Model Evaluation:

    • Compute R², Adjusted R², MSE, MAE, and RMSE on training and test sets.

    • Code Example:

        from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
        y_pred = model.predict(X_test_scaled)
        r2 = r2_score(y_test, y_pred)
        mse = mean_squared_error(y_test, y_pred)
        mae = mean_absolute_error(y_test, y_pred)
        rmse = mean_squared_error(y_test, y_pred, squared=False)
      
  6. Model Serialization:

    • Save the trained model and scaler for deployment using pickle or joblib.

    • Code Example:

        import pickle
        with open('model.pkl', 'wb') as f:
            pickle.dump(model, f)
        with open('scaler.pkl', 'wb') as f:
            pickle.dump(scaler, f)
      
  7. Model Deployment:

    • Load the model and scaler, standardize new data, and make predictions.

    • Code Example:

        with open('model.pkl', 'rb') as f:
            model = pickle.load(f)
        with open('scaler.pkl', 'rb') as f:
            scaler = pickle.load(f)
        X_new_scaled = scaler.transform(X_new)
        predictions = model.predict(X_new_scaled)
      

Why Standardization?

  • Ensures features are on the same scale, improving convergence in gradient-based optimization.

  • Prevents features with larger magnitudes from dominating the model.

Why Pickle?

  • Pickle serializes Python objects, allowing easy storage and retrieval of trained models and preprocessors.

  • Ensures consistency between training and deployment environments.

0
Subscribe to my newsletter

Read articles from Manav Rastogi directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Manav Rastogi
Manav Rastogi

"Aspiring Data Scientist and AI enthusiast with a strong foundation in full-stack web development. Passionate about leveraging data-driven solutions to solve real-world problems. Skilled in Python, databases, statistics, and exploratory data analysis, with hands-on experience in the MERN stack. Open to opportunities in Data Science, Generative AI, and full-stack development."