So Linear regression. It’s not flashy, it doesn’t generate code or drive cars. But it’s a really judgmental line. One that tries to predict the future based on a bunch of past data.

At its core, linear regression is about finding the simplest possible relationship between two things. Like:

Does more study time → better grades?
Does eating out more → more money saved? (no it does not, I can tell from personal experience)
Does temperature → affect ice cream sales?

It takes those dots - the scattered, messy, real-world data - and tries to draw the best straight line through them. A line that captures the trend.

This model doesn’t guess wildly. It learns by minimizing its own regret (read loss).

And what makes it powerful - despite its simplicity - is that it introduces you to everything that matters in machine learning:

Features (the input/independent variable)
Labels (the output/dependent variable)
Loss functions (how wrong your model is)
Optimization (how it gets better)
Generalization (whether it actually works on new data)

What is Linear Regression actually doing?

In linear regression, the model assumes there’s a linear relationship between the input (x) and the output (y). That means: as x increases, y changes in a way that can be captured by a straight line. It’s not always true - but for now, we pretend it is.

That relationship looks like this:

ŷ = w₁x + w₀

Looks familiar, well because it is but we know it as y = mx + c and we’ve been looking at it since 8th grade. But in the above slightly upgraded version of the equation each variable has a significance to the model.

ŷ : This is the predicted value
x : This is the input (feature)
w₁ : The weight aka slope - this tells us how much ŷ changes when x increases by one unit. If w₁ is 3, that means every time x goes up by 1, ŷ goes up by 3.
w₀ : The bias aka intercept - this is the value of ŷ when x is zero - in other words, where the line crosses the y-axis. It gives the model a starting point. Even if x is zero, the model still needs to return some prediction, and that’s w₀.

Training the model means: finding the best values for w₁ and w₀ - the ones that make the predicted values (ŷ) as close as possible to the actual values (y) in the training data.

Fitting a line with Scikit-learn

Scikit-learn is one of the most widely-used libraries for machine learning in Python. It is clean, efficient, and full of pre-built models that let you focus on learning the concepts, not rewriting math from scratch.

Here’s why scikit-learn is great, especially when you're starting out:

It gives you access to most of the classic ML models (linear regression, decision trees, SVMs, etc.) with just a few lines of code.
It handles a lot of the messy stuff under the hood: data formatting, fitting models, making predictions, evaluating performance.
It has a consistent API - so once you learn how to train one model, you can pretty much train them all the same way.

In short: it lets you focus on what the model is doing, not how to write the math from scratch every time.
(Which you can absolutely do later, but for now - let’s not suffer unnecessarily.)

Here’s how you train a basic linear regression model using scikit-learn in under 15 lines of code:

#Step 1: Import the model
from sklearn.linear_model import LinearRegression

X = np.array([1, 2, 3, 4, 5]).reshape(-1, 1)  # input (independent variable)
y = np.array([3, 4, 2, 5, 6])                # output (target)

#Step 2: Initialize the model
model = LinearRegression()

#Step 3: Train the model
model.fit(X, y)

# Step 4: Make predictions
y_pred = model.predict(X)

#Step 5: Visualize the result
plt.scatter(X, y, color='blue', label='Actual Data')
plt.plot(X, y_pred, color='red', label='Predicted Line')
plt.xlabel('X')
plt.ylabel('y')
plt.title('Linear Regression Fit')
plt.legend()
plt.show()

Why Do We Use reshape(-1, 1) ?

Scikit-learn expects the input (X) to be in the shape of a 2D array - even if there’s only one feature.

In our case: [1, 2, 3, 4, 5] is a 1D array (shape: (5,))
But scikit-learn expects X to look like this:
[ [1],
[2],
[3],
[4],
[5] ]
which is a 2D array of shape (5, 1) → 5 rows, 1 column

That’s where .reshape(-1, 1) comes in. Here’s what it means:

The -1 means “figure out the number of rows automatically”
The 1 means “make sure there’s one column”

So .reshape(-1, 1) takes your flat array of 5 values and turns it into a 2D column vector, which is what scikit-learn expects for input features.

How wrong is the model?

Because why be optimistic when you can cry about the worst case scenarios instead.

Just because our model drew a line doesn’t mean it’s a good line. It might look convincing on a plot - but we need to measure just how far off its predictions were from reality.

That’s where the concept of loss comes in.

Loss is basically: “How much did the model mess up?”

In the world of linear regression, we usually use something called Mean Squared Error (MSE) to calculate this. Here’s what it does:

For each data point, it calculates the difference between the actual value (y) and the predicted value (ŷ).
Then it squares that difference (so negative errors don’t cancel out).
Then it averages all those squared errors across the dataset.

The result is one number: your model’s average squared mistake.

The smaller the number, the better your model’s doing. The bigger the number, well looks like someone did not do their homework.

from sklearn.metrics import mean_squared_error

mse = mean_squared_error(y, y_pred)
print("Mean Squared Error:", mse)

So, where does this all fall apart

Linear regression is cute. It’s honest. It tries. But it doesn’t always work in the real world.

Let’s talk about its limitations.

It assumes everything is linear

Linear regression thinks the world is a straight line. Which is adorable. And completely wrong.

If your data has curves, bends, or weird wiggly relationships, linear regression just… can’t. It’ll try to fit a line through it anyway, and the result will be way off.

It fails with pretty much anything in nature, economics, or real human behavior

Yes, you can add polynomial features (which will will talk about soon) - but at some point, you’re duct taping a fork and calling it a spoon.

It’s Sensitive to Outliers

Outliers aka the dramatic data point that is isolated away from the others has a seriously toxic relationship with the regression line. One y-value that’s way higher or lower than the rest - and suddenly the model freaks out and redraws the whole line to accommodate that rebel.

An outlier is simply a value that is significantly different from the rest of the data. Sometimes they’re caused by real-world anomalies (a celebrity buying a studio apartment for ₹15 crores), and sometimes they’re just errors or noise (someone typed an extra zero in a form).

Either way, linear regression doesn’t know the difference. Because it minimizes squared error, large mistakes get squared - and become huge.

Let’s say for most points, the model is off by 2. That’s fine: 2² = 4.

But if it’s off by 20 just once? 20² = 400. That single point has more influence than all the other data combined.

It cannot handle complex relationships

And unfortunately those are the only kind that exist :/

If your data has multiple inputs that interact in complicated ways, a straight line can’t capture that.
It’s like trying to make butter chicken but with just butter and chicken.

Real world problems need models that can handle more features, deal with noise and bias and understand non-linear relationships.

Now, this does not mean that Linear Regression is useless.

Governments, NGOs, and economists often use linear regression to model the relationship between income and education, inflation and interest rates, unemployment and GDP, etc.
Why? Because they need results that are explainable, reportable, and won’t break if opened in Excel 2007.
Before all the ML magic, most housing price estimations still start with a good ol’ linear model: price vs. area, location, number of bathrooms, etc.
When you’re exploring a new dataset, linear regression is often your first model - because it’s quick to build and gives you a sense of whether a relationship exists at all.
Whether it’s physics, biology, or psychology - linear regression is used to test hypotheses, measure correlations, and publish papers.

Linear Regression is a great intro - it’s just not the final boss.

Can we predict things with a line?