Building and Testing a Linear Regression Model from Scratch!

Anshuman SinghAnshuman Singh
5 min read

Introduction

Linear regression is often the first algorithm introduced to those entering the world of machine learning. But instead of just importing a model from scikit-learn, what if you built your own from scratch and deeply understood how it works under the hood?

In this blog, I walk through the complete mathematical foundation, the code implementation, and testing methodology for a Linear Regression model using NumPy, and finally compare it against scikit-learn's version.

Whether you're a student, an ML beginner, or someone trying to strengthen their intuition, this guide is meant for you.


Part 1: The Math Behind Linear Regression

Given a dataset with multiple input features \(X \in \mathbb{R}^{m \times n} \) and a target vector \(y \in \mathbb{R}^{m}\), the objective is to find coefficients \(\beta \in \mathbb{R}^{n}\) such that:

$$y=X\beta + \epsilon$$

The optimal coefficients minimize the residual sum of squares:

$$\hat{\beta} = (X^TX)^{-1}X^Ty$$

In case \(X^{T}X\) is non-invertible (e.g., due to multicollinearity), we use the Moore-Penrose pseudo-inverse:

\(\hat{β}= X^{+}y\). This can be computed via Singular Value Decomposition (SVD).


Part 2: Building the Model in Python

import numpy as np

class LinearRegression:
    def __init__(self):
        self.coef_ = None
        self.intercept_ = None

    def fit(self, X, Y):
        X = np.asanyarray(X)
        Y = np.asarray(Y)

        X = np.insert(X, 0, 1, axis=1)

        try:
            XT_X = X.T @ X
            if np.linalg.matrix_rank(XT_X) == XT_X.shape[0]:
                beta = np.linalg.inv(XT_X) @ X.T @ Y
            else:
                raise np.linalg.LinAlgError
        except np.linalg.LinAlgError:
            def stable_pinv(X, tol=1e-10):
                U, s, Vt = np.linalg.svd(X, full_matrices=False)
                s_inv = np.zeros_like(s)
                s_inv[s > tol] = 1.0 / s[s > tol]
                return Vt.T @ np.diag(s_inv) @ U.T
            beta = stable_pinv(X) @ Y

        beta = beta.flatten()
        self.intercept = beta[0]
        self.coefficient = beta[1:]
        return self

    def predict(self, X):
        X = np.asarray(X)
        coef = np.ravel(self.coefficient)
        return X @ coef + self.intercept

Part 3: Testing on Synthetic Datasets

I created 10 small test cases, each targeting different dataset behavior:

Dataset BehaviorMy Model's R² ScoreSklearn Model's R² Score
Perfectly Linear Dataset1.00001.0000
Perfect Multicollinearity1.00001.0000
Noisy Linear Dataset1.00001.0000
Under-determined System (more features than samples)1.00001.0000
Over-determined System (more samples than features)1.00001.0000
Linearly Non-Separable (non-linear function)0.97420.9742
Constant Feature (zero variance in one column)1.00001.0000
Negative Coefficients1.00001.0000
Very Small Magnitudes1.00001.0000
Very Large Magnitudes1.00001.0000

Each dataset was tested using both our model and scikit-learn's LinearRegression. The results were identical in terms of R² score, even under tough conditions like high noise or multicollinearity.


Part 4: Testing on Traditional Real-World Datasets

After confirming that the model performs identically on different behavior datasets, it was time to put the model into practice with real-world data.

For this, I used 5 datasets, 3 of which are preloaded from scikit-learn's datasets:

  1. load_diabetes() simulating a clean linear dataset

  2. make_regression()

  3. make_friedman1() simulating a non-linear dataset

  4. A dataset simulating a polynomial trend

  5. A dataset simulating multicollinearity

The results for these datasets were identical as it was thought to be:

DatasetMy model’s R2 ScoreSklearn Model’s R2 Score
load_diabetes()0.990.99
make_regression()0.550.55
make_friedman1()0.610.61
Polynomial dataset0.880.88
Multicollinearity0.680.68

Part 5: Visualizing the Results

For each test dataset, I created the following plots:

  1. Prediction vs. Actual Output

  2. Error Distribution

  3. Model Fit Comparison between our implementation and scikit-learn

Each plot shows a side-by-side comparison of the performance of our model and the sklearn model.

Plot Code Snippet:

from sklearn.model_selection import train_test_split

import matplotlib.pyplot as plt

def plot_comparison(X, y, my_model, title="Dataset"):
    if X.ndim == 1:
        X = X.reshape(-1, 1)

    # Split into train and test
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)

    # Fit models on train
    my_model = my_LR.fit(X_train, y_train)
    skl_model = sklearn_LR.fit(X_train, y_train)

    # Predictions on test
    y_pred_my = my_model.predict(X_test)
    y_pred_skl = skl_model.predict(X_test)

    # Plotting
    fig, axes = plt.subplots(1, 2, figsize=(12, 4))
    fig.suptitle(title, fontsize=14, fontweight='bold')

    for ax, preds, label, model in zip(axes, [y_pred_my, y_pred_skl], ['My Model', 'Sklearn Model'], [my_model, skl_model]):
        ax.scatter(y_test, preds, alpha=0.6)
        ax.plot(y_test, y_test, 'r--')  # ideal line
        ax.set_title(f"{label} (R² = {r2_score(y_test, preds):.2f})")
        ax.set_xlabel("True y")
        ax.set_ylabel("Predicted y")

    plt.tight_layout()
    plt.savefig(f"plots/{title.replace(' ', '_')}.png")
    plt.show()

Key Learnings

  • Linear Regression is interpretable, fast, and powerful when assumptions are met.

  • Using SVD and the pseudo-inverse makes it robust to rank-deficient data.

  • Comparing with scikit-learn helps validate your understanding and your math.

  • Visualization is essential to showcase performance and detect edge cases.


Final Thoughts

You don’t truly understand a machine learning model until you build it from scratch.

In this blog, you've explored everything from the mathematical basics of linear regression to its real implementation and testing. I hope this knowledge helps you not only implement linear regression but also debug, improve, and trust it.

Want the full notebook and plots? Feel free to reach out or check out the GitHub repo.

Thank you for reading this blog.

Happy learning!


References

Math behind Linear Regression with Python code – Muthukrishnan

https://www.geeksforgeeks.org/machine-learning/singular-value-decomposition-svd/

Moore - Penrose Pseudoinverse | Mathematics - GeeksforGeeks

9
Subscribe to my newsletter

Read articles from Anshuman Singh directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Anshuman Singh
Anshuman Singh