Diving Deeper into Machine Learning: A Detailed Exploration of Regression, Classification, and Clustering
Part 1
In this blog post, we will explore various regression models used in machine learning. Regression is a fundamental concept used to predict continuous outcomes. We'll dive into simple linear regression, multiple linear regression, polynomial regression, and other advanced techniques like Ridge, Lasso, and Support Vector Regression (SVR). We'll also cover decision trees and ensemble methods like random forests and gradient boosting. By the end of this post, you'll have a clear understanding of when and how to use these models.
1. Simple Linear Regression
Simple Linear Regression predicts a continuous target variable using a single independent variable. The relationship between the variables is represented as a straight line.
Equation:
[ y = mx + b ] Where:
( y ) = target variable
( x ) = independent variable
( m ) = slope of the line (how much ( y ) changes with ( x ))
( b ) = intercept (value of ( y ) when ( x = 0 ))
Example:
Let’s say you want to predict house prices based on the square footage. The data looks like this:
Square Footage (x) | Price (y) ($1000s) |
800 | 200 |
1000 | 250 |
1200 | 300 |
1400 | 350 |
1600 | 400 |
We can use linear regression to find the best-fit line between house size and price.
Visualizing Simple Linear Regression:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
# Sample data
x = np.array([800, 1000, 1200, 1400, 1600]).reshape(-1, 1)
y = np.array([200, 250, 300, 350, 400])
# Fit model
model = LinearRegression().fit(x, y)
# Plot
plt.scatter(x, y, color='blue')
plt.plot(x, model.predict(x), color='red')
plt.xlabel('Square Footage')
plt.ylabel('Price ($1000s)')
plt.title('Simple Linear Regression: House Price vs Square Footage')
plt.show()
Mathematics Behind:
The slope ( m ) and intercept ( b ) are calculated using these formulas:
[ m = \frac{N \sum xy - \sum x \sum y}{N \sum x^2 - (\sum x)^2} ]
[ b = \frac{\sum y - m \sum x}{N} ]
Where ( N ) is the number of data points. This ensures that the line is positioned to minimize the sum of squared differences between the observed data points and the predicted values.
Strengths:
Simple and easy to interpret.
Works well when there is a linear relationship between variables.
Weaknesses:
Cannot model non-linear relationships.
Sensitive to outliers.
2. Multiple Linear Regression
Multiple Linear Regression is used when you want to predict a target variable using more than one independent variable.
Equation:
[ y = b_0 + b_1x_1 + b_2x_2 + \dots + b_nx_n ] Where ( x_1, x_2, \dots, x_n ) are independent variables, and ( b_0, b_1, \dots, b_n ) are coefficients.
Example:
You want to predict house prices based on multiple features: square footage, number of bedrooms, and age of the house.
Square Footage (x1) | Bedrooms (x2) | Age (x3) | Price (y) ($1000s) |
2000 | 3 | 10 | 300 |
1800 | 3 | 5 | 280 |
1500 | 2 | 20 | 200 |
2200 | 4 | 15 | 350 |
1200 | 2 | 30 | 180 |
Visualizing Multiple Linear Regression:
import numpy as np
from sklearn.linear_model import LinearRegression
# Sample data
x = np.array([[2000, 3, 10], [1800, 3, 5], [1500, 2, 20], [2200, 4, 15], [1200, 2, 30]])
y = np.array([300, 280, 200, 350, 180])
# Fit model
model = LinearRegression().fit(x, y)
# Predict price for a new house
new_house = np.array([[2100, 3, 12]])
predicted_price = model.predict(new_house)
print(f"Predicted price: {predicted_price[0]} $1000s")
Strengths:
Models relationships with multiple variables.
Helps understand the impact of each independent variable on the target variable.
Weaknesses:
Assumes a linear relationship between the target and independent variables.
Sensitive to multicollinearity (high correlation between independent variables).
3. Polynomial Regression
Polynomial Regression is used when the relationship between the dependent and independent variables is non-linear. It introduces powers of the independent variable to model non-linear relationships.
Equation:
[ y = b_0 + b_1x + b_2x^2 + b_3x^3 + \dots + b_nx^n ] Where ( x^2, x^3, \dots, x^n ) are higher-degree terms of the independent variable.
Example:
You want to model the relationship between temperature and ice cream sales, which isn’t necessarily linear.
Temperature (°C) (x) | Ice Cream Sales (y) |
10 | 20 |
15 | 30 |
20 | 60 |
25 | 80 |
30 | 90 |
Visualizing Polynomial Regression:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
# Sample data
x = np.array([10, 15, 20, 25, 30]).reshape(-1, 1)
y = np.array([20, 30, 60, 80, 90])
# Transform to polynomial features
poly = PolynomialFeatures(degree=2)
x_poly = poly.fit_transform(x)
# Fit model
model = LinearRegression().fit(x_poly, y)
# Plot
plt.scatter(x, y, color='blue')
plt.plot(x, model.predict(x_poly), color='red')
plt.xlabel('Temperature (°C)')
plt.ylabel('Ice Cream Sales')
plt.title('Polynomial Regression')
plt.show()
Strengths:
- Can model non-linear relationships.
Weaknesses:
Higher-degree polynomials can lead to overfitting.
More complex to interpret than linear models.
4. Ridge and Lasso Regression
Ridge Regression and Lasso Regression add regularization terms to the cost function to prevent overfitting.
Ridge Regression (L2 Regularization):
[ \text{Cost Function} = \sum (y_i - \hat{y}_i)^2 + \lambda \sum \beta_j^2 ] Where:
- ( \lambda ) controls the strength of the regularization.
Use Case: Ridge regression is used when you have many features and want to prevent overfitting.
Lasso Regression (L1 Regularization):
[ \text{Cost Function} = \sum (y_i - \hat{y}_i)^2 + \lambda \sum |\beta_j| ] Use Case: Lasso regression can be used when you want to perform feature selection automatically during model training.
5. Support Vector Regression (SVR)
Support Vector Regression (SVR) uses the principles of Support Vector Machines for regression tasks. It tries to fit the best line within a margin of tolerance ( \epsilon ).
Equation:
SVR minimizes the error within a margin ( \epsilon ), aiming to ignore points that lie within this margin.
Use Case: Predicting stock prices, where the model tries to find the best trend within a tolerance range.
6. Decision Tree Regression
Decision Tree Regression uses a tree-like model where splits are made based on feature values. Each leaf node represents a predicted value.
Use Case: Predicting house prices based on multiple features like location, size, and number of rooms.
7. Random Forest Regression
Random Forest Regression is an ensemble method that builds multiple decision trees and averages their predictions to improve accuracy and reduce overfitting.
Use Case: Predicting product demand in a retail store.
8. Gradient Boosting Regression
Gradient Boosting Regression builds models sequentially
Subscribe to my newsletter
Read articles from Riya Bose directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by