Supervised Learning Made Easy: Linear Regression for Predictions

It’s great that we have a smarter and more efficient way to classify or predict things now by AI. In this article, I will take you through the adventure of one of the most exciting and widely used branches of Machine Learning, i.e. Supervised Learning. I will show you how to utilize Supervised Learning to build models to predict house prices and classify spam emails in my Machine Learning series.

First Question - What’s supervised learning?

Supervised Learning is a type of Machine Learning where the model learns from labelled data (input-output pairs). The goal is to predict the output for new, unseen inputs.

Types of Supervised Learning

There are two types of Supervised Learning: Classification and Regression.

Classification:
- Predicting discrete labels (e.g. Spam or Not Spam.)
- Examples: Spam classification, image recognition…
Regression:
- Predicting continuous values (e.g. house prices, temperature).
- Examples: Stock price prediction, weather forecasting…

Common Algorithms in Supervised Learning

Linear Regression:

Linear Regression is one of the simplest and most widely used statistical techniques for predictive modelling. It is used to model the relationship between a dependent variable (Y, often called the target or outcome) and one or more independent variables (X, often called features or predictors). The goal is to find a linear relationship between the inputs and the target variable.

Key Concepts

Dependent Variable (Y): The variable we are trying to predict or explain.
Independent Variable (X): The variable(s) used to predict or explain the dependent variable.
Linear Relationship: Assumes that the relationship between X and Y can be described by a straight line.
Residuals: The difference between the actual value (Y) and the predicted value (Ŷ).

Assumptions of Linear Regression

Linearity: The relationship between X and Y is linear.
Independence: Observations are independent of each other.
Homoscedasticity: The variance of residuals is constant across all levels of X.
Normality: Residuals are normally distributed (especially important for inference).
No Multicollinearity: Independent variables are not highly correlated with each other (in multiple regression).

Simple Linear Regression:

The Simple Linear Regression model can be expressed as: Y\=β0+β1X+ϵ

where Y is the predicted value (also, dependent value), X is the independent variable, β0 is the intercept(value of Y when X = 0), β1 is the slope of the line, and ε is the error term, representing the difference between the actual value and the predicted value.

Multiple Linear Regression:

When there are multiple independent variables, the model becomes:

Y=β0+β1X1+β2X2+⋯+βnXn+ϵY\=β0+β1X1+β2X2+⋯+βnXn+ϵ

The Goal of the Linear Regression Model

The goal is to find the best-fitting line that minimizes the difference between the observed values (Y) and the predicted values (Ŷ). This is done by estimating the coefficients (β0,β1,…,βnβ0,β1,…,βn).

Cost Function

The most common method to find the best-fitting line is Ordinary Least Squares (OLS), which minimizes the Sum of Squared Residuals (SSR):

$$SSR = \sum_{i=1}^{n} (Y_i- \hat{Y}_i)$$

$$Y_i$$

is the actual value.

$$\hat{Y}_i$$

is the predicted value.

Gradient Descent

For large datasets, we often use gradient descent to iteratively adjust the coefficients to minimize the cost function. You probably still remember the formula of Gradient Descent in my previous series.

Steps to Perform Linear Regression

Data Collection: Gather data for the dependent and independent variables.
Data Preprocessing: Handle missing values, encode categorical variables, and normalize/standardize data if necessary.
Model Training: Use the training data to estimate the coefficients.
Prediction: Use the model to make predictions on new data.
Model Evaluation: Assess the model's performance using metrics like R-squared, Mean Squared Error (MSE), or Root Mean Squared Error (RMSE).

Code Example 1: Simple Linear Regression Model train and evaluate

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

# Data Collection
X = np.array([[1], [2], [3], [4], [5]])
Y = np.array([2, 4, 5, 4, 5])

# Data Preprocessing
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

# Model creating and training
model = LinearRegression()
model.fit(X_train, Y_train)


# Make predictions
Y_pred = model.predict(X_test)

# Model Evaluation
mse = mean_squared_error(Y_test, Y_pred)
r2 = r2_score(Y_test, Y_pred)
print(f"MSE: {mse}, R²: {r2}")

Example output as follows:

Evaluation Metrics

R-squared (R²): Measures the proportion of variance in Y explained by X. Ranges from 0 to 1 (higher is better).
Mean Squared Error (MSE): Average of the squared residuals (lower is better).
Root Mean Squared Error (RMSE): Square root of MSE (easier to interpret in the same units as Y).
Mean Absolute Error (MAE): Average of the absolute residuals (less sensitive to outliers).

Code Example 2: Predict the house price as per its size and visualize it

import numpy as np
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt

# Generate some sample data (area in square meters and price in thousands of dollars in Melbourne)
area = np.array([300, 420, 100, 380, 600]).reshape(-1, 1)
price = np.array([60, 80, 45, 70, 120])

# Create a linear regression model
model = LinearRegression()

# Fit the model to the data
model.fit(area, price)

# Predict the price for the training data
predicted_price = model.predict(area)

# Calculate evaluation metrics
mse = mean_squared_error(price, predicted_price)
rmse = np.sqrt(mse)
r2 = r2_score(price, predicted_price)

print(f"Mean Squared Error (MSE): {mse}")
print(f"Root - Mean - Squared Error (RMSE): {rmse}")
print(f"Coefficient of Determination (R2): {r2}")

# Predict the price for a new area (for example, an area of 200 square meters)
house_square_meters = 200
new_area = np.array([house_square_meters]).reshape(-1, 1)
predicted_price = model.predict(new_area)

print(f"Predicted price for an area of {house_square_meters} square meters: {predicted_price[0]} thousand dollars")

# Visualize the data and the regression line
plt.scatter(area, price)
plt.plot(area, model.predict(area), color='red')
plt.xlabel('House Area (square meters)')
plt.ylabel('House Price (thousands of dollars)')
plt.title('Linear Regression for House Price Prediction')
plt.show()

Example output:

In today’s article, we embarked on an exciting journey into the world of Supervised Learning, focusing specifically on Linear Regression. And I gave hands-on code examples on training predictive models by using Linear Regression.

In the next article, I will continue the exploration of Supervised Learning by diving deeper into other powerful algorithms for Classification tasks.

Now you should understand Supervised Learning and Linear Regression, how Linear Regression works and be able to train and evaluate Linear Regression models. You can use this supervised learning approach to do linear predicting tasks in future like stock share predicting :) And don’t forget to leave me any feedback or questions if you have. Stay tuned for more AI adventures and enjoy your AI exploration journey together.

Machine Learning - Supervised Learning 1