Unlocking Predictions: A Gentle Introduction to Linear Regression

Imagine you're playing a game. You quickly notice a simple rule: every time you throw a ball harder, it goes farther. Linear Regression can help you figure out just how far it might go if you throw it with a certain amount of force.

Linear Regression is like discovering these simple rules hidden within numbers. It helps us understand how one thing relates to another. It allows us to use what we've seen in the past to make educated guesses about what might happen in the future.

At its heart, Linear Regression is about finding a straight line that best represents the relationship between two things we are interested in. Think of it like,

It's like drawing a line through dots on paper that gets as close as possible to most points
This line shows us how two things might be related
The line might go up (more ice cream sales when the weather is hotter)
Or it might go down (fewer sweaters sold as the temperature rises)
Or it might be flat (your height doesn't affect how good you are at crossword puzzles)

💡

Example time: Suppose you're tracking the height of a sunflower you planted every week. You might notice that it grows a little taller each week. Linear Regression can help you draw a line that shows this growth over time. What's even more helpful is that once you have this line, you can use it to guess how tall the sunflower might be in the coming weeks. The goal is to find a line that doesn't have to touch every single data point (every weekly measurement) but tries to be as close as possible to all of them.

The line helps us make educated guesses. If you know one thing, you can follow the line to guess the other. That's really it! Everything else is just math to find that "best fit" line.

Where Linear Regression Pops Up in Real Life?

Think about buying a house. Linear Regression can look at the data from houses that have been sold recently and find the general relationship between the size of the house and its price. It's like having a rule of thumb: for every extra square foot, the price tends to go up by a certain amount.
Similarly, consider your electricity bill. The more electricity you use, the higher your bill tends to be.
If you're a student, you've probably noticed that the more you study for a test, the better your score is likely to be.
Businesses also use Linear Regression all the time. Companies look at their old ad spending and sales numbers to see the connection. The pattern helps them guess how much sales might jump if they put more money into advertising.

In all these examples, Linear Regression helps us to quantify the relationship between different things. It allows us to move beyond just guessing and to make more informed predictions based on the patterns we've seen in the data.

The Math That Makes It Work!

To really understand how Linear Regression works, we need to talk about the math behind that "best-fit" line. Don't worry, we'll keep it super simple! The equation for our straight line looks like this:

$$\text{Predicted Value = Change Factor * Input Value + Starting Number}$$

To understand more deeply, recall the formula for a straight line:

$$y=m\times x+b$$

This simple equation has a few key parts…

Starting Number (called by b-intercept) is where the line crosses the vertical axis y when x = 0. However, sometimes this starting number might not have a practical meaning. If we're predicting home prices based on square footage, a house with zero square feet makes no sense in reality.
Change Factor (m- also known as the slope or gradient) is how steep the line is. If m = 2, then for every increase of 1 in x, y goes up by 2 (assuming b is 0). More generally, m tells us how much we expect y to change when x increases by 1.
Input Value (called by x-input) is the value we are using to make our prediction.
Finally, the Predicted Value is the result we get after we plug in our input value and do the calculation.

Now, how do we find the "best" starting number and change factor for our line? We find the "best" line using a "Cost Checker" that measures how far each data point sits from our line. It calculates the average distance for all points combined. The smaller this number, the better our line fits the data - that's our goal.

In simpler terms, we're looking for the perfect slope and starting point for our line. The "best" line passes as close as possible to all our data points. We measure how good the line is by looking at the gaps between our points and the line - the smaller these gaps, the better our line. We square these gaps (to handle negative values) and try to make their total as small as possible.

Step-by-Step Math:

Let’s do a concrete example with three simple data points: (1,2), (2,3), (3,5). We’ll find the regression line y = m x + b by hand.

Compute the means:

First, find the average (mean) of all the x-values and all the y-values.

$$x_{\text{mean}} = \frac{1+2+3}{3} = 2,\qquad y_{\text{mean}} = \frac{2+3+5}{3} = 10/3 \approx 3.33.$$

Compute the slope m:

$$m = \frac{\sum (x_i - x_{\text{mean}})(y_i - y_{\text{mean}})}{\sum (x_i - x_{\text{mean}})^2}.$$

Substituting values into expressions,

$$\begin{aligned} m &= \frac{(1-2)(2-3.33) + (2-2)(3-3.33) + (3-2)(5-3.33)} {(1-2)^2 + (2-2)^2 + (3-2)^2} \\ &= \frac{(-1)(-1.33) + 0 + (1)(1.67)}{1 + 0 + 1} = \frac{1.33 + 1.67}{2} = \frac{3.0}{2} = 1.5. \end{aligned}$$

Compute the intercept b:

Use the fact that the line goes through the mean point (x_mean,y_mean):

$$b = y_{\text{mean}} - m \cdot x_{\text{mean}} = 3.33 - (1.5)\times 2 = 3.33 - 3.0 = 0.33.$$

Final line:

Now we have m = 1.5 and b = 0.33. The regression line is

$$y = 1.5\times x + 0.33.$$

As a check, if we plug x = 1, 2, 3 into 1.5 x + 0.33, we get approximately the values 1.83, 3.33, 4.83, which are close to the actual 2, 3, 5.

Make a prediction:

With our line, we can predict new y from any x. For example, if x = 4:

$$y = 1.5\times 4 + 0.33 = 6.0 + 0.33 = 6.33.$$

So the model predicts y ≈ 6.33 when x = 4.

This manual example shows how we use the means of the data and simple sums of products to calculate the slope and intercept. In practice, you can follow exactly these steps to compute a linear regression line by hand for small datasets.

Python Code From Scratch…

Here’s how you might implement the same calculation in pure Python (without libraries like scikit-learn, NumPy, or pandas):

# Example data points
data = [(1, 2), (2, 3), (3, 5)]

# Calculate means of x and y
x_vals = [x for x,y in data]
y_vals = [y for x,y in data]
x_mean = sum(x_vals) / len(x_vals)
y_mean = sum(y_vals) / len(y_vals)

# Calculate the slope (m)
numerator = 0
denominator = 0
for x, y in data:
    numerator   += (x - x_mean) * (y - y_mean)
    denominator += (x - x_mean) ** 2
m = numerator / denominator

# Calculate the intercept (b)
b = y_mean - m * x_mean

print(f"Slope m = {m}, Intercept b = {b}")

# Use the line to make a prediction for x = 4
x_new = 4
y_pred = m * x_new + b
print(f"Predicted y for x = {x_new} is {y_pred}")

This code will output the slope m = 1.5 and intercept b = 0.33, matching our hand calculation. It then predicts y ≈ 6.33 for x = 4.

When we run this code, it will first calculate the mean x values and mean y values from our data points. Then, it will use these means to find the slope (m) and the y-intercept (b) of the linear relationship. Finally, it will print the equation of our line and predict the y value for a new x value of 4.

Congratulations! You've now taken your first steps into the world of prediction with Linear Regression. We've explored the basic idea of finding a line to represent the relationship between two things, looked at some real-world examples, peeked at the simple math behind it, and even written our own code to make a prediction. Remember that Linear Regression is a powerful but also a foundational tool in the field of machine learning. There's much more to discover, but you've now built a solid base to continue your learning journey! We'll see you in the next Ridge and Lasso Regression learning...