Basic guide to simple linear regression


We've all been taught the classic equation of a straight line: \(y=mx+c\). But there's an application of this equation that has revolutionized the fields of economics, psychology, and many other scientific fields. Today we're going to learn about linear regression, which takes the equation of a straight line and flips it upside down.
What is linear regression?
We usually think of the equation of a straight line as a machine: You put in the value of \(x\), it spits out the value of \(y\). And to calculate the value of \(y\), it uses \(m\) and \(c\). So you must already know what the values of \(m\) and \(c\) are, and someone then gives you the value of \(x\) and you give them the value of \(y\) back.
But what if you already know the values of \(x\) and \(y\) and you have to find the values of \(m\) and \(c\)? That is linear regression.
The official definition of linear regression is:
Linear regression is a statistical method used to model the relationship between a dependent variable (the \(y\)) and an independent variable (the \(x\)). The goal is to find the best-fitting straight line through a set of data points that allows us to predict the target variable based on the input features.
This just means finding the values of \(m\) and \(c\) to get the so-called "line of best fit". But this should raise a bunch of questions:
How do we already know the values of \(x\) and \(y\)?
How do we find the values of \(m\) and \(c\)?
How is this the "best-fit" line?
What do we use it for?
We'll go through each of these questions one by one. By the end of this blog, you should have a deep understanding of where, how, and why we use linear regression.
Where do we use linear regression?
Let's look at some data showing the relationship between years of experience in a particular field and salary. Data like this is collected through surveys: asking people about their years of experience and current job salary. Then we get two lists and plot them in a scatter plot:
You can clearly see that as the number of years increases, the person's salary also increases. And you could even imagine a straight line passing through all the points:
This line visually summarizes the data. That's the line of best fit. But we also want a computational model to answer questions like:
How much does salary increase as experience increases?
What's the starting salary?
What salary would you expect from a person with, let's say, 5.5 years of experience?
All of this is extremely important for someone in the job market—maybe to know the expected salary during negotiation or job searching. But how do we find all this?
We find a single equation that answers all 3 questions. This equation will be of the form:
$$\text{salary} = m \cdot \text{experience} + c$$
That's exactly the equation of a straight line. All we have to do is find the values of \(m\) and \(c\):
The value of \(m\) gives us the "slope": How much salary increases for every year increase in experience
The value of \(c\) gives us the "intercept": How much salary you get with 0 experience (starting salary)
And you can plug any value of experience into this equation and get the expected salary for that many years of experience.
Why is it called "simple" linear regression?
The word "simple" comes from the fact that we're only using one explanatory variable to model the target variable. In this case, we're only using years of experience to model salary. But in reality, salary depends on many things: education level, health, geographical location, etc. Multiple linear regression takes multiple explanatory variables to model the target variable.
Why is it called simple "linear" regression?
The word "linear" comes from the fact that we're assuming the relationship is of the form \(y=mx+c\), which is the equation of a straight line. If we used a quadratic equation like \(y=ax^2+bx+c\), then it would be simple quadratic regression.
Why is it called simple linear "regression"?
This relates to the idea of "regressing towards the mean". This basically means that all the values revolve around a single trend or constant value. This trend is the line of best fit shown in red in the diagram above.
Notation
To make our lives simple, I'll replace "years of experience" with \(x\) and "salary" with \(y\).
In our salary example, we had 40 data points—that is, 40 different pairs of experience and current salary. We need a way to represent each data point. I'll name each experience value with a subscript:
$$x_1, x_2, x_3, \ldots, x_{40}$$
Similarly, each salary will be represented with a subscript:
$$y_1, y_2, y_3, \ldots, y_{40}$$
Now we want to find a single value of \(m\) and \(c\) that best models the relationships:
$$\begin{align*} y_1 &= m \cdot x_1 + c \\ y_2 &= m \cdot x_2 + c \\ y_3 &= m \cdot x_3 + c \\ &\vdots \\ y_{40} &= m \cdot x_{40} + c \end{align*}$$
Notice it's the same value of \(m\) and \(c\) in each equation. That's because we want a single linear model that best explains all the data points together.
Using a tiny bit of linear algebra, we can simplify this equation by writing:
$$\begin{pmatrix}y_1 \\ y_2 \\ y_3 \\ \vdots \\ y_{40}\end{pmatrix} = \begin{pmatrix}m \cdot x_1 + c \\ m \cdot x_2 + c \\ m \cdot x_3 + c \\ \vdots \\ m \cdot x_{40} + c\end{pmatrix}$$
$$\begin{pmatrix}y_1 \\ y_2 \\ y_3 \\ \vdots \\ y_{40}\end{pmatrix} = m \cdot \begin{pmatrix}x_1 \\ x_2 \\ x_3\\ \vdots \\ x_{40}\end{pmatrix} + c$$
If we use vector notation \(\hat{y}\) and \(X\) to represent the \(y\) and \(x\) values, then we get:
$$\hat{y} = mX + c$$
Notice that this is exactly like the old equation of a straight line, just with a few decorations here and there.
If you're wondering why we have a "hat" on \(y\) but not on \(X\), yet \(X\) is capitalized, it's because \(y\) will always be a 1-dimensional vector. But in multiple linear regression, \(X\) becomes a matrix where each explanatory variable has its own column:
$$X = \begin{pmatrix} x_{1,1} & x_{1,2} & x_{1,3} \\ x_{2,1} & x_{2,2} & x_{2,3} \\ x_{3,1} & x_{3,2} & x_{3,3}\\ \vdots & \vdots & \ddots \\ x_{40,1} & x_{40,2} & x_{40,3} \end{pmatrix}$$
Now if you just look at the first column, we get our original \(x\) vector back. Since we're going to move to multiple linear regression in the future, it's a good idea to get used to this notation now.
How do we calculate m and c (Python)
We're not going to get into the mathematics of how to calculate the values of \(m\) and \(c\) (that's coming in the next blog), but we'll see how to calculate them using Python.
To do so, we're going to use the scikit-learn, pandas, and matplotlib libraries. Make sure you have them installed:
pip install scikit-learn pandas matplotlib
Loading the libraries
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
Loading the data
df = pd.read_csv('https://github.com/ybifoundation/Dataset/raw/main/Salary%20Data.csv')
df.head()
You should see:
Experience Years Salary
0 1.1 39343
1 1.2 42774
2 1.3 46205
3 1.5 37731
4 2.0 43525
Linear Regression
x = df['Experience Years'].to_numpy()
y = df['Salary'].to_numpy()
regressor = LinearRegression()
regressor.fit(x.reshape(-1, 1), y)
The fit
function runs all the mathematics required to calculate the values of \(m\) and \(c\).
Why do we have x.reshape(-1, 1)
?
Remember how we said linear regression thinks of \(X\) as a matrix and not a vector? Well, when we have the x = df['Experience Years'].to_numpy()
line, it transforms it into a row vector:
$$X = \begin{pmatrix}x_1 & x_2 & x_3 & \ldots & x_{40}\end{pmatrix}$$
But scikit-learn was created to expect the feature variables to be in columns, not rows. So we have to convert it into a column. The reshape function takes your vector and, depending on what shape you give it, transforms it into that shape. So instead of the vector being 1 row and 40 columns, we convert it into 40 rows and 1 column to get:
$$X = \begin{pmatrix}x_1 \\ x_2 \\ x_3\\ \vdots \\ x_{40}\end{pmatrix}$$
The -1
in reshape(-1, 1)
means "figure out this dimension automatically" based on the total number of elements.
Getting the values of m and c
Once we've fitted our model, the values of \(m\) and \(c\) are found as:
print(regressor.coef_, regressor.intercept_)
The coef_
is the slope and intercept_
is the intercept. The reason this is called a "coefficient" and not a slope is because in multiple linear regression, you have a slope for each variable, and at that point it doesn't make sense to call it a slope. So we generalize the naming to "coefficient".
If I run the above, I get:
[9523.65050742] 25673.015760530274
The coef value is an array because scikit-learn expects us to potentially have multiple explanatory variables, each with its own coefficient.
Our equation of the line is now:
$$\text{salary} \approx 9523 \cdot \text{experience} + 25673$$
So:
1 year increase in experience should increase your salary by $9,523
The starting salary for this field is $25,673
You can even calculate the expected salary for any experience level you want. The expected salary for 5.5 years of experience is:
$$9523 \cdot 5.5 + 25673 = 78,049.5$$
So someone with 5.5 years of experience is expected to have a $78,049 salary.
Getting the line of best fit
The line of best fit is exactly this line:
$$\text{salary} \approx 9523 \cdot \text{experience} + 25673$$
We can plot this along with the data:
plt.scatter(df['Experience Years'], df['Salary'])
plt.plot(x, regressor.coef_[0]*x + regressor.intercept_, color='red')
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.title('Salary vs Years of Experience')
plt.show()
And we should get the original graph I showed you:
And there you have it! You've just performed your first linear regression analysis. You now know how to find relationships in data, make predictions, and understand what those mysterious \(m\) and \(c\) values actually mean in the real world.
Subscribe to my newsletter
Read articles from Affan Hamid directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
