Understanding Linear Regression in Machine Learning

Table of contents
- Introduction
- 1. What is Linear Regression?
- 2. Types of Linear Regression
- 3. Assumptions of Linear Regression
- 4. Cost Function and Optimization
- 5. Evaluation Metrics
- 7. Regularization Techniques for Linear Models
- 8. Implementation in Python
- 9. How Linear Regression Actually Works
- 10. Real-world Applications of Linear Regression
- 11. Advantages and Limitations
- 12. Conclusion

Introduction
Linear Regression is one of the simplest and most widely used algorithms in Machine Learning. It models the relationship between a dependent variable (output) and one or more independent variables (inputs) by fitting a linear equation to observed data. The primary goal is to predict continuous values, making it ideal for applications like house price prediction, sales forecasting, and stock market analysis.
1. What is Linear Regression?
Linear Regression is a supervised learning algorithm used for predicting a continuous output. It assumes a linear relationship between the input (X) and output (Y).
Linear regression performs the task to predict a dependent variable value (y) based on a given independent variable (x)).
Best Fitting Line:
Our primary objective while using linear regression is to locate the best-fit line, which implies that the error between the predicted and actual values should be kept to a minimum. There will be the least error in the best-fit line.
The best Fit Line equation provides a straight line that represents the relationship between the dependent and independent variables. The slope of the line indicates how much the dependent variable changes for a unit change in the independent variable(s).
Example Use Case:
Predicting house prices based on features like square footage, number of bedrooms, and location.
2. Types of Linear Regression
2.1 Simple Linear Regression
Simple Linear Regression involves one independent variable (X) and one dependent variable (Y). It aims to find the best-fitting straight line through the data points.
The equation for Simple Linear Regression is:
Where:
Y = Predicted output (dependent variable)
X = Input feature (independent variable)
β0 = Intercept (Y-axis intercept)
β1 = Slope (coefficient)
2.2 Multiple Linear Regression
Multiple Linear Regression involves two or more independent variables (X1, X2, ..., Xn) to predict the dependent variable (Y). It models complex relationships by considering the combined influence of multiple features.
Where:
X1,X2,...,Xn = Input features
β1,β2,...,βn = Coefficients for each feature
3. Assumptions of Linear Regression
3.1 Assumptions of Simple Linear Regression:
Linearity: It assumes that there is a linear relationship between the independent and dependent variables. This means that changes in the independent variable lead to proportional changes in the dependent variable.
Independence: The observations should be independent from each other that is the errors from one observation should not influence other.
Homoscedasticity: Across all levels of the independent variable(s), the variance of the errors is constant. This indicates that the amount of the independent variable(s) has no impact on the variance of the errors. If the variance of the residuals is not constant, then linear regression will not be an accurate model.
Normality: The residuals should be normally distributed. This means that the residuals should follow a bell-shaped curve. If the residuals are not normally distributed, then linear regression will not be an accurate model.
3.2 Assumptions for Multiple Linear Regression:
No Multicollinearity: There is no high correlation between the independent variables. This indicates that there is little or no correlation between the independent variables. Multicollinearity occurs when two or more independent variables are highly correlated with each other, which can make it difficult to determine the individual effect of each variable on the dependent variable. If there is multicollinearity, then multiple linear regression will not be an accurate model.
Additivity: The model assumes that the effect of changes in a predictor variable on the response variable is consistent regardless of the values of the other variables. This assumption implies that there is no interaction between variables in their effects on the dependent variable.
Feature Selection: In multiple linear regression, it is essential to carefully select the independent variables that will be included in the model. Including irrelevant or redundant variables may lead to overfitting and complicate the interpretation of the model.
Overfitting: Overfitting occurs when the model fits the training data too closely, capturing noise or random fluctuations that do not represent the true underlying relationship between variables. This can lead to poor generalization performance on new, unseen data.
Violating these assumptions can lead to unreliable predictions and inaccurate interpretations.
4. Cost Function and Optimization
The goal of Linear Regression is to find the best-fitting line that minimizes the error between the predicted and actual values. This is achieved using the Cost Function.
The difference between the predicted value Y^ and the true value Y is called cost function or the loss function.
4.1 Mean Squared Error (MSE)
In Linear Regression, the Mean Squared Error (MSE) cost function is employed, which calculates the average of the squared errors between the predicted values y^i and the actual values yi. The purpose is to determine the optimal values for the intercept θ1θ1 and the coefficient of the input feature θ2 providing the best-fit line for the given data points.
MSE function can be calculated as:
Where:
N = Number of observations
yi = Actual value
yi^ = Predicted value
4.2 Gradient Descent
A gradient is nothing but a derivative that defines the effects on outputs of the function with a little bit of variation in inputs.
Gradient Descent is used to minimize the cost function by updating the model parameters iteratively.
Differentiate Cost Function(J) with respect to Θ1
Where:
α = Learning rate
∂J/∂βj = Partial derivative of the cost function
5. Evaluation Metrics
5.1 Mean Squared Error(MSE)
In Linear Regression, the Mean Squared Error (MSE) cost function is employed, which calculates the average of the squared errors between the predicted values y^i and the actual values yi. The purpose is to determine the optimal values for the intercept θ1θ1 and the coefficient of the input feature θ2 providing the best-fit line for the given data points.
MSE function can be calculated as:
Where:
N = Number of observations
yi = Actual value
yi^ = Predicted value
5.2 Mean Absolute Error(MAE)
Mean Absolute Error is an evaluation metric used to calculate the accuracy of a regression model. MAE measures the average absolute difference between the predicted values and actual values.
Mathematically, MAE is expressed as:
Here,
n is the number of observations
Yi represents the actual values.
Yi^ represents the predicted values
Lower MAE value indicates better model performance. It is not sensitive to the outliers as we consider absolute differences.
5.3 Coefficient of Determination (R-Squared)
R-Squared is a statistic that indicates how much variation the developed model can explain or capture. It is always in the range of 0 to 1. In general, the better the model matches the data, the greater the R-squared number.
In mathematical notation, it can be expressed as:
Residual sum of Squares (RSS): The sum of squares of the residual for each data point in the plot or data is known as the residual sum of squares, or RSS. It is a measurement of the difference between the output that was observed and what was anticipated.
Total Sum of Squares (TSS): The sum of the data points’ errors from the answer variable’s mean is known as the total sum of squares, or TSS.
5.4 Root Mean Squared Error(RMSE)
The square root of the residuals’ variance is the Root Mean Squared Error. It describes how well the observed data points match the expected values, or the model’s absolute fit to the data.
In mathematical notation, it can be expressed as:
RMSE is not as good of a metric as R-squared. Root Mean Squared Error can fluctuate when the units of the variables vary since its value is dependent on the variables’ units (it is not a normalized measure).
5.5 Adjusted R-Squared Error
Adjusted R2 measures the proportion of variance in the dependent variable that is explained by independent variables in a regression model. Adjusted R-square accounts the number of predictors in the model and penalizes the model for including irrelevant predictors that don’t contribute significantly to explain the variance in the dependent variables.
Mathematically, adjusted R2 is expressed as:
Here,
n is the number of observations
k is the number of predictors in the model
R2 is coeeficient of determination
7. Regularization Techniques for Linear Models
7.1 Lasso Regression (L1 Regression)
Lasso Regression is a technique used for regularizing a linear regression model, it adds a penalty term to the linear regression objective function to prevent overfitting.
The objective function after applying lasso regression is:
the first term is the least squares loss, representing the squared difference between predicted and actual values.
the second term is the L1 regularization term, it penalizes the sum of absolute values of the regression coefficient θj.
7.2 Ridge Regression (L2 Regularization)
Ridge regression is a linear regression technique that adds a regularization term to the standard linear objective. Again, the goal is to prevent overfitting by penalizing large coefficient in linear regression equation. It useful when the dataset has multicollinearity where predictor variables are highly correlated.
The objective function after applying ridge regression is:
the first term is the least squares loss, representing the squared difference between predicted and actual values.
the second term is the L1 regularization term, it penalizes the sum of square of values of the regression coefficient θj.
7.3 Elastic Net Regression
Elastic Net Regression is a hybrid regularization technique that combines the power of both L1 and L2 regularization in linear regression objective.
the first term is least square loss.
the second term is L1 regularization and third is ridge regression.
λ is the overall regularization strength.
α controls the mix between L1 and L2 regularization.
8. Implementation in Python
Let's implement Linear Regression using the popular Python library scikit-learn. We will predict house prices based on square footage.
8.1 Importing Libraries:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
8.2 Loading the Dataset:
Link to the dataset: dataset
path = "/path/to/the/dataset"
df = pd.read_csv(path)
X = data[['SquareFootage']]
y = data['Price']
8.3 Plotting the dataset
df.plot(x="x", y="y", style="o")
plt.title("X vs Y")
plt.xlabel("X")
plt.ylabel("Y")
plt.show()
8.4 Splitting Data into Training and Testing Sets:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
8.5 Training the Model:
model = LinearRegression()
model.fit(X_train, y_train)
8.6 Making Predictions:
y_pred = regressor.predict(x_test)
df1 = pd.DataFrame({'Actual': y_test.flatten(), 'Predicted': y_pred.flatten()})
print(df1)
8.7 Evaluating the Model:
print('Mean Absolute Error:', np.mean(np.abs(y_test - y_pred)))
print('Mean Squared Error:', np.mean((y_test - y_pred) ** 2))
print('Root Mean Squared Error:', np.sqrt(np.mean((y_test - y_pred) ** 2)))
8.7 Visualizing the Results:
df2 = df1.head(25)
df2.plot(kind='bar', figsize=(16, 10))
plt.grid(which='major', linestyle='-', linewidth='0.5', color='green')
plt.grid(which='minor', linestyle=':', linewidth='0.5', color='black')
plt.show()
plt.scatter(x_test, y_test, color='gray')
plt.plot(x_test, y_pred, color='red', linewidth=2)
plt.show()
9. How Linear Regression Actually Works
To Solve the Linear Regression, the equation for the Regression line is:
which is same as y=mx+c.
Consider a sample Dataset:
When we plot the data in a X-Y axis chart, it looks like:
Since,
To calculate the value of ‘m’:-
X’ = 3 and Y’ = 3.6,
On solving the above equation with the values calculated in chart, we get m=0.4.
For first data entry: X=1, Y=3, m=0.4. Putting these values in y=mx+c
We get, c=2.6.
Using these values:-
For X=0 ==> Y=2.6
For X=1 ==> Y=3
For X=2 ==> Y=3.4
Plotting the line using above points, in the chart we made earlier, we will get the regression line.
10. Real-world Applications of Linear Regression
Finance: Stock price prediction, risk assessment, and financial forecasting.
Marketing: Predicting sales, customer lifetime value, and advertising effectiveness.
Healthcare: Predicting disease progression, patient outcomes, and medical costs.
Real Estate: Estimating property prices based on location, area, and amenities.
Economics: Demand and supply forecasting, GDP prediction, and economic analysis.
11. Advantages and Limitations
11.1 Advantages:
Easy to implement and interpret.
Fast and efficient for small to medium-sized datasets.
Works well when the relationship between input and output is linear.
11.2 Limitations:
Assumes a linear relationship, which may not always exist.
Sensitive to outliers, which can skew results.
Prone to overfitting with high-dimensional data.
Assumes no multicollinearity among independent variables.
12. Conclusion
Linear Regression is a powerful and easy-to-implement algorithm that provides a solid foundation for predictive modeling. By understanding its assumptions, mathematical formulation, and implementation, you can leverage Linear Regression for a wide range of applications.
Subscribe to my newsletter
Read articles from Tushar Pant directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
