Introduction to Linear Regression

Imagine trying to predict how far your paper plane will fly based on how hard you throw it. You jot down throw strength vs. distance each time. Soon, a pattern emerges—but it’s not perfect. You wish there was a way to summarize all those messy attempts with a single, clear rule.

That’s what Linear Regression does: it finds the straightest line through the noise.

The Story Begins:

Long before “machine learning” became a buzzword, statisticians in the early 19th century were already trying to make sense of the world with numbers.

Can we find a mathematical relationship between variables?

In 1805, Adrien-Marie Legendre introduced a technique now known as the method of least squares—a way to draw the best-fitting line through a set of scattered data points.

Then in 1809, Carl Friedrich Gauss expanded it with a probabilistic interpretation.

Their goal?

If I have two variables—say, study time and exam scores—can I predict one from the other?

The Problem They Were Solving:

The world is full of patterns—but also noise. Before regression, prediction was guesswork. Linear regression answered this practical question:

“Given this input (X), what’s the most likely output (Y)?”

But the twist was that It didn’t just fit a line—it optimized the line.

The idea was simple yet powerful: Minimize the sum of squared differences between actual and predicted values.

This gave rise to one of the earliest and most fundamental tools in data science: Linear Regression.

So What Did They Do?

They introduced:

A line equation:$$y = mx + b$$
A loss function to quantify how wrong the model is:

The loss function is defined as $$\text{Loss} = \sum (y_{\text{true}} - y_{\text{pred}})^2 $$
An optimization process (minimizing that loss) to find the “best” m and b.

Eventually, this laid the groundwork for more advanced algorithms—like Gradient Boosting, Neural Networks, and beyond.

But it all started with one question:

“What’s the best straight line through my data?”

Linear Regression:

Linear Regression is a method to find the best-fitting straight line through a set of data points. It works by minimizing the total error (specifically the squared error) between the real values and the predicted ones.

Real world analogy

You just baked 10 cookies 🍪 and guessed how much each one weighs.

But your friend has the actual weights using a scale. Now your goal is:

Make your guesses match the real weights as closely as possible.

You’re trying to guess a pattern like this:

$$\hat{y} = w \cdot x + b$$

Where:

x = cookie size
w = how much each bit of size adds to the weight (slope)
b = your starting guess or magic number (intercept)
y_pred = your prediction of the cookie's weight
y = the actual weight

So then what do you do?

As you try different values for w and b, you ask:

“If I change w or b a little, does my guess get closer to the real answer or worse?”

You’re not just guessing randomly — you’re learning how to guess better.

This is where gradient descent comes in:

👉 It's a smart way to nudge your guesses in the right direction.

Example

Suppose the cookie size is x = 5, and you try:

$$\hat{y} = 2 \cdot 5 + 0 = 10$$

But the real weight y is 13. Your guess was too low.

So you think:

“Maybe 2 is too small for w. Let me try w = 2.5.”

$$\hat{y} = 2.5 \cdot 5 + 0 = 12.5$$

Now it’s closer to 13 — awesome!

You keep adjusting, little by little, until you find the best value for w.

This adjusting process is exactly what gradient descent does automatically using math.

Gradient Descent in 3 Steps

Step 1: Measure How Bad Your Guess Is (Loss Function)

For every cookie, we calculate the error:

$$\text{Error}= y - \hat{y}$$

Since errors can be negative or positive, we square them:

$$\text{Error}^2 = (y - \hat{y})^2$$

Now, we add them all up and average them:

$$\text{Loss} = \frac{1}{n} \sum (y - \hat{y})^2$$

This is called Mean Squared Error (MSE) — it’s your "badness score." The lower, the better.

Step 2: Find Which Direction to Adjust (The Gradient)

You now ask the math:

"If I change w or b just a little, will the loss go up or down?"

This is where we use derivatives which is just a way to find the slope of the loss:

$$\frac{\partial \text{Loss}}{\partial w} = \frac{-2}{n} \sum x(y - \hat{y})$$

$$\frac{\partial \text{Loss}}{\partial b} = \frac{-2}{n} \sum (y - \hat{y})$$

These tell us:

- How much the loss changes if you change w or b.

- Which direction to move to make the loss smaller.

Step 3: Take a Small Step

Once you know the right direction, update your guess just a little:

$$w = w - \text{learning rate} \cdot \text{gradient of w}$$

$$b = b - \text{learning rate} \cdot \text{gradient of b}$$

The learning rate controls how big your steps are. Repeat this many times, and each time:

- Your guesses get better.

- Your loss gets smaller.

- Your line fits the data more perfectly!

Linear regression using Gradient Descent (With code + visualization)

Here is a step by step implementation of linear regression from scratch using gradient descent in Python.

Note:

This example uses perfectly clean data (no noise) to clearly show how Linear Regression and Gradient Descent work. In real-world projects, data usually contains noise, which makes the problem more complex.

# Import the necessary libraries
# numpy is used to generate the data for this project
import numpy as np
# Matplotlib helps us plot the results of the best fitting line
import matplotlib.pyplot as plt

#Step 1: Generate Data, X is the independent variable used to predict the dependent variable y
X= np.array([1,2,3,4,5])
y= np.array([3,5,7,9,11])

#Step 2: Initialize Parameters, which according to the equation w which is the slope is not given and also b which is intercept.
w= 0.0
b= 0.0

#Step 3: Set Hyperparameters, This controls the learning process of the model in training
learning_rate = 0.01
epochs = 1000
n=len(X)

#Step 4: Gradient Descent Training
for epoch in range(epochs):
    #Predict
    y_pred = w * X + b
    # Compute Gradients
    dw = (-2/n) * np.sum(X * (y - y_pred))
    db = (-2/n) * np.sum(y - y_pred)
    # Update Parameters
    w = w - learning_rate * dw
    b = b - learning_rate * db
    # Print loss every 100 steps
    if epoch % 100 == 0:
        loss = np.mean((y - y_pred) ** 2)
        print(f"Epoch {epoch}: Loss = {loss:.4f}, W={w:.4f}, b={b:.4f}")

#Step 5: Output
print("\nFinal Learned Parameters: ")
print(f"Slope (w): {w}")
print(f"Intercept (b): {b}")

#For this step we will plot the data points using scatterplot
plt.scatter(X, y, color= 'blue', label = 'Original Data')
plt.plot(X, w * X + b, color='red', label = 'Learned Line')
plt.title('Linear Regression with Gradient Descent')
plt.xlabel('X')
plt.ylabel('y')
plt.legend()
plt.grid(True)
plt.show()

If you're curious about the math behind the gradient descent and how linear regression works, check out this video:

Watch the full video here

Meet Linear Regression - Your First step into ML