A Beginner's Guide to Simple Linear Regression in Python

Prasun DandapatPrasun Dandapat
5 min read

Introduction

Linear regression is one of the most fundamental and commonly used techniques in machine learning and statistics. In this guide, we'll explain the basics of simple linear regression, demonstrate how it works, and provide code examples using Python.

What is Simple Linear Regression?

Simple linear regression is a method to predict a dependent variable (or target) based on the value of an independent variable (or feature). The relationship is modeled by fitting a straight line to the data.

The general form of the simple linear regression equation is:

y=mx+c

Where:

  • y is the predicted output (dependent variable).

  • x is the input (independent variable).

  • m is the slope of the line (how much y changes with x).

  • c is the intercept (the value of y when x=0).

Steps for Building a Simple Linear Regression Model:

  1. Understand the Data: We need data with an independent variable and a dependent variable.

  2. Fit a Linear Model: Use this data to fit a linear equation.

  3. Make Predictions: Based on this fitted model, we can make future predictions.

  4. Evaluate the Model: Measure how well the model fits using metrics like Mean Squared Error (MSE) or R-squared.

Let's jump into the coding part now!

Example: Simple Linear Regression with Python

We'll use a basic dataset containing information about house sizes (in square feet) and house prices (in thousands of dollars) to predict the price based on the size.

Step 1: Import Libraries

We'll start by importing the necessary libraries such as numpy, matplotlib, and scikit-learn.

pythonCopy code# Import necessary libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

Step 2: Prepare the Dataset

Next, we’ll prepare a small dataset for this example.

pythonCopy code# Dataset: House sizes (in square feet) and their respective prices (in thousands of dollars)
# Independent variable (X): House size
# Dependent variable (Y): House price
X = np.array([500, 750, 1000, 1250, 1500, 1750, 2000, 2250, 2500]).reshape(-1, 1)  # Reshaping X to a 2D array
Y = np.array([150, 200, 250, 300, 350, 400, 450, 500, 550])

Step 3: Visualize the Data

It’s always a good practice to visualize the data to understand the relationship between the variables.

pythonCopy code # Plotting the data points
plt.scatter(X, Y, color='blue')
plt.title('House Size vs Price')
plt.xlabel('Size (Square Feet)')
plt.ylabel('Price (in Thousands)')
plt.show()

Step 4: Split the Data into Training and Testing Sets

We'll split the data into training and testing sets so we can evaluate the model's performance on unseen data.

pythonCopy code# Splitting the data into 80% training and 20% testing sets
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

Step 5: Train the Simple Linear Regression Model

Now, we’ll create a linear regression model and fit it to the training data.

pythonCopy code# Create the linear regression model
model = LinearRegression()

# Train the model using the training data
model.fit(X_train, Y_train)

Step 6: Make Predictions

Once the model is trained, we can use it to make predictions on the test set.

pythonCopy code# Predicting the prices for the test set
Y_pred = model.predict(X_test)

Step 7: Visualize the Fitted Line

Let’s visualize the fitted line over the data points to see how well the model fits.

pythonCopy code# Plotting the regression line with the training data
plt.scatter(X_train, Y_train, color='blue')  # Plot the original data points
plt.plot(X_train, model.predict(X_train), color='red')  # Plot the regression line
plt.title('Simple Linear Regression Fit')
plt.xlabel('Size (Square Feet)')
plt.ylabel('Price (in Thousands)')
plt.show()

Step 8: Evaluate the Model

We can now evaluate the performance of our model using metrics such as Mean Squared Error (MSE) and R-squared.

pythonCopy code# Calculate Mean Squared Error (MSE) and R-squared for the model
mse = mean_squared_error(Y_test, Y_pred)
r2 = r2_score(Y_test, Y_pred)

print(f"Mean Squared Error: {mse:.2f}")
print(f"R-squared: {r2:.2f}")
  • Mean Squared Error (MSE) gives an idea of how far the predictions are from the actual values.

  • R-squared indicates how well the model explains the variance in the target variable (values closer to 1 mean a better fit).

Full Code:

Here is the entire code for simple linear regression:

pythonCopy code# Import necessary libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Dataset: House sizes (in square feet) and their respective prices (in thousands of dollars)
X = np.array([500, 750, 1000, 1250, 1500, 1750, 2000, 2250, 2500]).reshape(-1, 1)
Y = np.array([150, 200, 250, 300, 350, 400, 450, 500, 550])

# Visualizing the data
plt.scatter(X, Y, color='blue')
plt.title('House Size vs Price')
plt.xlabel('Size (Square Feet)')
plt.ylabel('Price (in Thousands)')
plt.show()

# Splitting the data into training and testing sets
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

# Creating and training the model
model = LinearRegression()
model.fit(X_train, Y_train)

# Making predictions
Y_pred = model.predict(X_test)

# Plotting the regression line with the training data
plt.scatter(X_train, Y_train, color='blue')
plt.plot(X_train, model.predict(X_train), color='red')
plt.title('Simple Linear Regression Fit')
plt.xlabel('Size (Square Feet)')
plt.ylabel('Price (in Thousands)')
plt.show()

# Model evaluation
mse = mean_squared_error(Y_test, Y_pred)
r2 = r2_score(Y_test, Y_pred)
print(f"Mean Squared Error: {mse:.2f}")
print(f"R-squared: {r2:.2f}")

Summary:

In this tutorial, we’ve covered the basics of simple linear regression, including how to fit a model, make predictions, visualize results, and evaluate performance using Python. Simple linear regression is an essential concept that lays the foundation for more advanced topics in machine learning.

0
Subscribe to my newsletter

Read articles from Prasun Dandapat directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Prasun Dandapat
Prasun Dandapat

Prasun Dandapat is a Computer Science and Engineering graduate from the Academy of Technology, Hooghly, West Bengal. With a strong interest in AI and Machine Learning, Prasun is also skilled in frontend development and is an aspiring Software Development Engineer (SDE). Passionate about technology and innovation, he constantly seeks opportunities to broaden his expertise and contribute to impactful projects.