Building and Testing a Linear Regression Model from Scratch!

Introduction
Linear regression is often the first algorithm introduced to those entering the world of machine learning. But instead of just importing a model from scikit-learn
, what if you built your own from scratch and deeply understood how it works under the hood?
In this blog, I walk through the complete mathematical foundation, the code implementation, and testing methodology for a Linear Regression model using NumPy, and finally compare it against scikit-learn
's version.
Whether you're a student, an ML beginner, or someone trying to strengthen their intuition, this guide is meant for you.
Part 1: The Math Behind Linear Regression
Given a dataset with multiple input features \(X \in \mathbb{R}^{m \times n} \) and a target vector \(y \in \mathbb{R}^{m}\), the objective is to find coefficients \(\beta \in \mathbb{R}^{n}\) such that:
$$y=X\beta + \epsilon$$
The optimal coefficients minimize the residual sum of squares:
$$\hat{\beta} = (X^TX)^{-1}X^Ty$$
In case \(X^{T}X\) is non-invertible (e.g., due to multicollinearity), we use the Moore-Penrose pseudo-inverse:
\(\hat{β}= X^{+}y\). This can be computed via Singular Value Decomposition (SVD).
Part 2: Building the Model in Python
import numpy as np
class LinearRegression:
def __init__(self):
self.coef_ = None
self.intercept_ = None
def fit(self, X, Y):
X = np.asanyarray(X)
Y = np.asarray(Y)
X = np.insert(X, 0, 1, axis=1)
try:
XT_X = X.T @ X
if np.linalg.matrix_rank(XT_X) == XT_X.shape[0]:
beta = np.linalg.inv(XT_X) @ X.T @ Y
else:
raise np.linalg.LinAlgError
except np.linalg.LinAlgError:
def stable_pinv(X, tol=1e-10):
U, s, Vt = np.linalg.svd(X, full_matrices=False)
s_inv = np.zeros_like(s)
s_inv[s > tol] = 1.0 / s[s > tol]
return Vt.T @ np.diag(s_inv) @ U.T
beta = stable_pinv(X) @ Y
beta = beta.flatten()
self.intercept = beta[0]
self.coefficient = beta[1:]
return self
def predict(self, X):
X = np.asarray(X)
coef = np.ravel(self.coefficient)
return X @ coef + self.intercept
Part 3: Testing on Synthetic Datasets
I created 10 small test cases, each targeting different dataset behavior:
Dataset Behavior | My Model's R² Score | Sklearn Model's R² Score |
Perfectly Linear Dataset | 1.0000 | 1.0000 |
Perfect Multicollinearity | 1.0000 | 1.0000 |
Noisy Linear Dataset | 1.0000 | 1.0000 |
Under-determined System (more features than samples) | 1.0000 | 1.0000 |
Over-determined System (more samples than features) | 1.0000 | 1.0000 |
Linearly Non-Separable (non-linear function) | 0.9742 | 0.9742 |
Constant Feature (zero variance in one column) | 1.0000 | 1.0000 |
Negative Coefficients | 1.0000 | 1.0000 |
Very Small Magnitudes | 1.0000 | 1.0000 |
Very Large Magnitudes | 1.0000 | 1.0000 |
Each dataset was tested using both our model and scikit-learn
's LinearRegression
. The results were identical in terms of R² score, even under tough conditions like high noise or multicollinearity.
Part 4: Testing on Traditional Real-World Datasets
After confirming that the model performs identically on different behavior datasets, it was time to put the model into practice with real-world data.
For this, I used 5 datasets, 3 of which are preloaded from scikit-learn
's datasets
:
load_diabetes()
simulating a clean linear datasetmake_regression()
make_friedman1()
simulating a non-linear datasetA dataset simulating a polynomial trend
A dataset simulating multicollinearity
The results for these datasets were identical as it was thought to be:
Dataset | My model’s R2 Score | Sklearn Model’s R2 Score |
load_diabetes() | 0.99 | 0.99 |
make_regression() | 0.55 | 0.55 |
make_friedman1() | 0.61 | 0.61 |
Polynomial dataset | 0.88 | 0.88 |
Multicollinearity | 0.68 | 0.68 |
Part 5: Visualizing the Results
For each test dataset, I created the following plots:
Prediction vs. Actual Output
Error Distribution
Model Fit Comparison between our implementation and
scikit-learn
Each plot shows a side-by-side comparison of the performance of our model and the sklearn
model.
Plot Code Snippet:
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
def plot_comparison(X, y, my_model, title="Dataset"):
if X.ndim == 1:
X = X.reshape(-1, 1)
# Split into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)
# Fit models on train
my_model = my_LR.fit(X_train, y_train)
skl_model = sklearn_LR.fit(X_train, y_train)
# Predictions on test
y_pred_my = my_model.predict(X_test)
y_pred_skl = skl_model.predict(X_test)
# Plotting
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
fig.suptitle(title, fontsize=14, fontweight='bold')
for ax, preds, label, model in zip(axes, [y_pred_my, y_pred_skl], ['My Model', 'Sklearn Model'], [my_model, skl_model]):
ax.scatter(y_test, preds, alpha=0.6)
ax.plot(y_test, y_test, 'r--') # ideal line
ax.set_title(f"{label} (R² = {r2_score(y_test, preds):.2f})")
ax.set_xlabel("True y")
ax.set_ylabel("Predicted y")
plt.tight_layout()
plt.savefig(f"plots/{title.replace(' ', '_')}.png")
plt.show()
Key Learnings
Linear Regression is interpretable, fast, and powerful when assumptions are met.
Using SVD and the pseudo-inverse makes it robust to rank-deficient data.
Comparing with
scikit-learn
helps validate your understanding and your math.Visualization is essential to showcase performance and detect edge cases.
Final Thoughts
You don’t truly understand a machine learning model until you build it from scratch.
In this blog, you've explored everything from the mathematical basics of linear regression to its real implementation and testing. I hope this knowledge helps you not only implement linear regression but also debug, improve, and trust it.
Want the full notebook and plots? Feel free to reach out or check out the GitHub repo.
Thank you for reading this blog.
Happy learning!
References
Math behind Linear Regression with Python code – Muthukrishnan
https://www.geeksforgeeks.org/machine-learning/singular-value-decomposition-svd/
Subscribe to my newsletter
Read articles from Anshuman Singh directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
