Introduction to Linear and Logistic Regression in Python

Machine learning often seems intimidating at first glance, but at its core, it builds upon mathematical concepts we're already familiar with. Let's demystify two fundamental regression techniques through their mathematical foundations.

Linear Regression: From High School Lines to Predictive Power

Remember the straight line equation from your math classes? That's exactly where linear regression begins! The basic equation is beautifully simple:

\( y = mx + b \)

In machine learning, we dress this up a bit more formally as:

$$y = \beta_0 + \beta_1x + \epsilon$$

Where:

  • \( y \) is our target variable (what we're trying to predict)

  • \( \beta_0 \) is the y-intercept (where our line crosses the y-axis)

  • \( \beta_1 \) is the slope (how much y changes for each unit change in x)

  • \( \epsilon \) is our error term (because real data isn't perfect!)

A Real-World Example

Imagine predicting house prices based on square footage. If our model learns that \( \beta_0 = 50,000 \) and \( \beta_1 = 100 \) , then:

$$\text{Price} = 50,000 + 100 \times \text{SquareFootage}$$

For a 2000 sq ft house, we'd predict:

Logistic Regression: The S-Curve That Changes Everything

While linear regression helps with continuous predictions, logistic regression tackles binary classification using a special function called the sigmoid. This S-shaped curve is our gateway to probability predictions:

$$\sigma(z) = \frac{1}{1 + e^{-z}}$$

The beauty of this function lies in its properties:

  • It squeezes any input into a range between 0 and 1

  • When \( z = 0 \) , \( \sigma(z) = 0.5 \)

  • As \( z \) approaches ∞, \( \sigma(z) \) approaches 1

  • As \( z \) approaches -∞, \( \sigma(z) \) approaches 0

Putting It Into Practice

In logistic regression, \( z \) is our linear equation:

$$z = \beta_0 + \beta_1x$$

For example, let's predict email spam. If our model learns \( \beta_0 = -1.5 \) and \( \beta_1 = 0.8 \) , then for an email with 4 suspicious keywords:

  1. First calculate z: \( z = -1.5 + 0.8 × 4 = 1.7 \)

  2. Then transform through sigmoid: \( P(\text{spam}) = \frac{1}{1 + e^{-1.7}} \approx 0.845 \)

This means there's an 84.5% chance the email is spam!

Motivation

Regression techniques are fundamental to machine learning because they are simple, interpretable, and incredibly versatile. Whether you're predicting house prices, determining exam outcomes, or diagnosing medical conditions, regression helps uncover relationships in data and make informed predictions.

Why Use Regression?

  1. Simplicity:
    Regression models are among the easiest to understand and implement. Their straightforward nature makes them ideal for beginners and professionals who need clear, explainable results. In fields like healthcare and finance, where understanding why a prediction was made is as important as the prediction itself, this simplicity is invaluable.

  2. Interpretable Insights:
    Regression models provide more than just predictions; they offer insights into how each input feature influences the outcome. For example:

    • In a model predicting house prices, you can see how factors like the number of rooms, location, or neighborhood crime rate affect the final price.

    • In a model predicting student performance, you can understand how study hours or attendance contribute to passing or failing an exam.

  3. Foundation for Advanced Models:
    Learning regression helps build a strong foundation for more advanced machine learning techniques. Many complex models, like neural networks and ensemble methods, are based on principles derived from regression. Understanding regression sets you up for success in tackling these more sophisticated methods later.

Linear Regression in Action

Imagine you’re a real estate agent who wants to predict the price of a house based on factors like the number of rooms and the crime rate in the area. By using linear regression, you can develop a model that learns from historical data to make accurate price predictions for new houses. This allows you to not only make informed decisions but also explain to your clients how each factor contributes to the overall price.

Logistic Regression in Action

Now consider a teacher who wants to predict whether a student will pass or fail an exam based on their study habits. Logistic regression can take data about study hours, attendance, and other factors and predict the likelihood of a student passing. This model can help teachers identify at-risk students and intervene early to provide additional support.

Why It Matters

Both linear and logistic regression are widely used because they help answer critical questions:

  • How much will a house cost based on its features?

  • Will a student pass or fail given their study habits?

  • Is an email spam or not based on its content?

When They Work Best

Linear Regression

Linear Regression is most effective when there is a linear relationship between the independent variable(s) (features) and the dependent variable (target). This means the target value increases or decreases proportionally with the change in the input values. Examples include:

  • Predicting house prices based on square footage.

  • Estimating a person's weight based on their height.

  • Forecasting sales revenue based on advertising spending.

In these cases, a straight line can capture the trend in the data accurately, making linear regression a reliable choice.

Logistic Regression

Logistic Regression works best for binary classification tasks, where the outcome can only belong to one of two categories (e.g., Yes/No, 0/1). The model performs well when the data points can be separated by a linear boundary. Examples include:

  • Classifying emails as spam or not spam.

  • Determining if a customer will purchase a product or not.

  • Predicting whether a student will pass or fail based on their study hours.

If a simple dividing line can effectively separate the two categories, logistic regression is a suitable and powerful choice for making these classifications.

Setting Up the Environment

Before we dive into building regression models, let’s set up the necessary environment and install the required libraries.

Installing Dependencies

You'll need a few essential Python libraries for data manipulation, visualization, and machine learning. Open your Anaconda Prompt and run the following command:

pip install numpy pandas matplotlib seaborn scikit-learn

Here’s a quick overview of what each library does:

  • numpy: For numerical computations and array operations.

  • pandas: For working with datasets in an easy-to-manage tabular format.

  • matplotlib: For creating plots and visualizing data.

  • seaborn: For creating more sophisticated and visually appealing plots.

  • scikit-learn: For building and evaluating machine learning models.

Launching Jupyter Lab

Once the dependencies are installed, you can launch Jupyter Lab, an interactive environment for writing and running Python code. In your terminal or command prompt, run:

jupyter lab

This will open Jupyter Lab in your default web browser, providing an interface to create and manage notebooks. You can start a new Python notebook to begin working on your regression models.

Linear Regression with the California Housing Dataset

About the Dataset

The California Housing dataset is commonly used to predict house prices in various districts of California based on different features. It contains data collected from the 1990 California census.

Attributes

The dataset contains 8 numerical features, such as:

  • MedInc: Median income in the district (in tens of thousands of dollars).

  • HouseAge: Median age of the houses in the district.

  • AveRooms: Average number of rooms per household.

  • AveOccup: Average number of occupants per household.

  • Latitude: Latitude of the district.

  • Longitude: Longitude of the district.

Target

  • MedHouseVal: The median house value in the district (in hundreds of thousands of dollars).

This dataset is widely used to demonstrate regression techniques because it captures various socioeconomic and geographical factors influencing home prices.

Exploring the Dataset

Let’s load the dataset and explore its structure using Python.

from sklearn.datasets import fetch_california_housing
import pandas as pd

# Load the dataset
data = fetch_california_housing()
df = pd.DataFrame(data.data, columns=data.feature_names)

# Add the target column
df['MedHouseVal'] = data.target

# Display the first 5 rows
print(df.head())

Understanding the Dataset

Number of Rows and Columns

Check the dimensions of the dataset:

print(df.shape)

Output:

(20640, 9)

This means there are 20,640 rows (samples) and 9 columns (8 features + 1 target).

Feature Names

View the names of the features (columns):

print(df.columns)

Dataset Description

Get a detailed description of the dataset, including information about each feature:

print(data.DESCR)

This description provides insights into what each feature represents and their respective units.

Example EDA (Exploratory Data Analysis)

Let’s visualize the relationship between MedInc (median income) and the target MedHouseVal (median house value):

import seaborn as sns
import matplotlib.pyplot as plt

# Scatter plot of MedInc vs MedHouseVal
sns.scatterplot(x=df['MedInc'], y=df['MedHouseVal'])
plt.xlabel("Median Income (in tens of thousands of dollars)")
plt.ylabel("Median House Value (in hundreds of thousands of dollars)")
plt.title("Relationship Between Median Income and House Value")
plt.show()

Code Explanation

import seaborn as sns
import matplotlib.pyplot as plt
  1. Importing Libraries:

    • seaborn is a Python data visualization library based on Matplotlib. It provides a high-level interface for creating visually appealing plots with less code.

    • matplotlib.pyplot is the Matplotlib module used for plotting. It provides functions for creating and customizing plots.

# Scatter plot of MedInc vs MedHouseVal
sns.scatterplot(x=df['MedInc'], y=df['MedHouseVal'])
  1. Creating a Scatter Plot:

    • sns.scatterplot() is a function from Seaborn for creating scatter plots.

    • x=df['MedInc']: Sets the x-axis to the values in the MedInc column of the DataFrame df.

      • MedInc stands for Median Income in tens of thousands of dollars for each district.
    • y=df['MedHouseVal']: Sets the y-axis to the values in the MedHouseVal column of the DataFrame df.

      • MedHouseVal stands for Median House Value in hundreds of thousands of dollars.

This creates a scatter plot where each point represents a district, with the median income on the x-axis and the median house value on the y-axis.

plt.xlabel("Median Income (in tens of thousands of dollars)")
  1. Setting the x-axis Label:

    • plt.xlabel() sets the label for the x-axis.

    • The label "Median Income (in tens of thousands of dollars)" helps users understand the scale and context of the x-axis values.

plt.ylabel("Median House Value (in hundreds of thousands of dollars)")
  1. Setting the y-axis Label:

    • plt.ylabel() sets the label for the y-axis.

    • The label "Median House Value (in hundreds of thousands of dollars)" clarifies what the y-axis represents.

plt.title("Relationship Between Median Income and House Value")
  1. Setting the Plot Title:

    • plt.title() adds a title to the plot.

    • The title "Relationship Between Median Income and House Value" summarizes what the plot shows, helping viewers quickly grasp the purpose of the visualization.

plt.show()
  1. Displaying the Plot:

    • plt.show() renders the plot and displays it in the output. This is necessary to ensure the plot appears when running the script or in a Jupyter Notebook.

Summary of the Plot

This code creates a scatter plot showing the relationship between median income and median house value in different districts:

  • Each point represents a district.

  • The x-axis shows the median income (in tens of thousands of dollars).

  • The y-axis shows the median house value (in hundreds of thousands of dollars).

  • The plot helps visualize if there is a trend or correlation between income and house prices.

Interpreting the Plot

The scatter plot helps visualize if there is a trend between the median income and house prices. Generally, as the median income increases, the house price tends to increase, suggesting a positive correlation between these two features.

This relationship makes MedInc a strong candidate for predicting house values.

Building the Linear Regression Model

Now that we’ve explored the California Housing dataset, let’s move on to building a linear regression model to predict house prices based on median income.

1. Import Libraries

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt

2. Split the Data

We’ll use the MedInc feature (median income) as the predictor and MedHouseVal (median house value) as the target.

# Define the predictor and target
X = df[['MedInc']]
y = df['MedHouseVal']

# Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Code Explanation

# Define the predictor and target
X = df[['MedInc']]
y = df['MedHouseVal']
  1. Defining the Predictor (Feature):

    • X = df[['MedInc']]:

      • X is the predictor variable (also known as the feature or input).

      • We select the 'MedInc' column from the DataFrame df and wrap it in double brackets [['MedInc']] to ensure it is a DataFrame and not a Series. This format is required by scikit-learn for model training.

      • 'MedInc' represents the Median Income of households in each district (in tens of thousands of dollars).

  2. Defining the Target (Label):

    • y = df['MedHouseVal']:

      • y is the target variable (also known as the label or output).

      • We select the 'MedHouseVal' column from the DataFrame df.

      • 'MedHouseVal' represents the Median House Value for each district (in hundreds of thousands of dollars).

Summary:
X contains the input feature (median income), and y contains the corresponding house prices we want to predict.

# Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
  1. Splitting the Data:

    • Function: train_test_split() is a function from sklearn.model_selection used to split the dataset into training and testing subsets.
  2. Parameters:

    • X: The predictor variable (feature) DataFrame.

    • y: The target variable (labels).

    • test_size=0.2:

      • Specifies that 20% of the data will be used for the test set.

      • The remaining 80% will be used for the training set.

      • This ratio (80/20) is a common split for training and testing in machine learning.

    • random_state=42:

      • Sets a random seed to ensure the split is reproducible.

      • Using the same random_state each time ensures you get the same train-test split.

  3. Outputs:

    • X_train: The training subset of the predictor variable.

    • X_test: The testing subset of the predictor variable.

    • y_train: The training subset of the target variable.

    • y_test: The testing subset of the target variable.

Summary of the Split:

  • Training Set (80%): Used to train the model so it can learn patterns from the data.

  • Testing Set (20%): Used to evaluate the model and see how well it performs on unseen data.

Example Breakdown

Suppose the dataset has 1,000 rows:

  • Training Set: 80% of 1,000 → 800 rows used for training.

  • Testing Set: 20% of 1,000 → 200 rows used for testing.

Why Split the Data?:
Splitting ensures the model can be evaluated on data it hasn’t seen before, helping to detect overfitting (when the model performs well on training data but poorly on new data).

Why Use random_state?:
Setting a random_state ensures consistent results every time you run the code, which is important for debugging and reproducibility.

3. Train the Model

Create an instance of the LinearRegression model and fit it to the training data.

# Create and train the model
model = LinearRegression()
model.fit(X_train, y_train)

4. Make Predictions

Use the trained model to predict house prices on the test set.

# Predict on the test set
y_pred = model.predict(X_test)

Evaluating the Model

Let’s calculate essential metrics to evaluate the performance of the linear regression model.

Metrics to Evaluate

  1. Mean Squared Error (MSE): Measures the average of the squared differences between actual and predicted values.

  2. R² Score (Coefficient of Determination): Indicates how well the model explains the variability of the target.

# Calculate Mean Squared Error
mse = mean_squared_error(y_test, y_pred)

# Calculate R² Score
r2 = r2_score(y_test, y_pred)

# Print the metrics
print(f"Mean Squared Error: {mse:.2f}")
print(f"R² Score: {r2:.2f}")

Example Output

Mean Squared Error: 0.52  
R² Score: 0.47

Visualizing the Regression Line

Plot the regression line over the test data to visualize how well the model fits.

plt.scatter(X_test, y_test, color='blue', label='Actual Prices')
plt.plot(X_test, y_pred, color='red', linewidth=2, label='Regression Line')
plt.xlabel("Median Income (in tens of thousands of dollars)")
plt.ylabel("Median House Value (in hundreds of thousands of dollars)")
plt.title("Linear Regression: Median Income vs. House Price")
plt.legend()
plt.show()

Predicting New Values

Once your model is trained, you can use it to predict house prices for new data points.

For example, let’s predict the price of a house in a district where the median income is 6 (i.e., $60,000):

# Ensure new_data has the correct feature name 'MedInc'
new_data = pd.DataFrame([[6]], columns=['MedInc'])

# Make the prediction
predicted_price = model.predict(new_data)

# Display the predicted price
print(f"Predicted House Price for a Median Income of $60,000: ${predicted_price[0] * 100_000:.2f}")

Example Output

Predicted House Price for a Median Income of $60,000: $296062.83

Summary

In this section, you learned how to:

  1. Load and explore the California Housing dataset.

  2. Train a linear regression model to predict house prices based on median income.

  3. Evaluate the model using Mean Squared Error and R² Score.

  4. Visualize the regression line to see how the model fits the data.

  5. Make predictions for new data points.

Linear regression is a simple yet powerful tool for making predictions and gaining insights into your data. Keep experimenting with different features to see how they affect house prices!

Logistic Regression with the Iris Dataset

About the Dataset

The Iris dataset is a classic dataset for classification tasks in machine learning. It contains measurements of iris flowers from three species:

  • Setosa (label 0)

  • Versicolor (label 1)

  • Virginica (label 2)

Attributes (Features)

The dataset includes 4 numerical features:

  1. Sepal Length (cm)

  2. Sepal Width (cm)

  3. Petal Length (cm)

  4. Petal Width (cm)

Target (Class Label)

The goal is to classify the species of iris flowers. However, for logistic regression, we’ll simplify the dataset to classify:

  • Setosa (0)

  • Not Setosa (1)

Exploring the Dataset

Let’s load the dataset and take a closer look.

from sklearn.datasets import load_iris
import pandas as pd

# Load the dataset
data = load_iris()
df = pd.DataFrame(data.data, columns=data.feature_names)

# Add the target column
df['species'] = data.target

# Display the first 5 rows
print(df.head())

Understanding the Dataset

Number of Rows and Columns

print(df.shape)  # Output: (150, 5)

Feature Names

print(df.columns)

Dataset Description

print(data.DESCR)

This provides detailed information about the features, target classes, and context of the dataset.

Simplifying the Target for Binary Classification

We’ll filter the dataset to create a binary classification problem — distinguishing between Setosa (0) and Not Setosa (1):

# Remove Virginica (class 2)
df = df[df['species'] != 2]

# Re-label Versicolor as 1 (Not Setosa)
df['species'] = df['species'].apply(lambda x: 0 if x == 0 else 1)

# Display first 5 rows after simplification
print(df.head())

Example EDA (Exploratory Data Analysis)

Visualize the relationship between sepal length and sepal width to see how the classes are distributed:

import seaborn as sns
import matplotlib.pyplot as plt

sns.scatterplot(x=df['sepal length (cm)'], y=df['sepal width (cm)'], hue=df['species'])
plt.xlabel("Sepal Length (cm)")
plt.ylabel("Sepal Width (cm)")
plt.title("Sepal Length vs Sepal Width (Setosa vs Not Setosa)")
plt.show()

Building the Logistic Regression Model

Now, let’s build a logistic regression model to classify the flowers.

1. Import Libraries

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

2. Split the Data

We’ll use sepal length and sepal width as predictors and species as the target:

# Define the predictor (features) and target
X = df[['sepal length (cm)', 'sepal width (cm)']]
y = df['species']

# Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

3. Train the Model

Create an instance of the LogisticRegression model and fit it to the training data:

# Create and train the model
model = LogisticRegression()
model.fit(X_train, y_train)

4. Make Predictions

Use the trained model to predict the class labels on the test set:

# Predict on the test set
y_pred = model.predict(X_test)

Evaluating the Model

Evaluating a logistic regression model is crucial to understanding how well it performs on classification tasks. We’ll use three key evaluation metrics: Accuracy Score, the Confusion Matrix, and the Classification Report. Each of these provides different insights into the model's performance.

1. Accuracy Score

Accuracy is the simplest evaluation metric. It measures the proportion of correctly predicted instances out of the total number of instances.

Formula for Accuracy

$$\text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}}$$

Example Code

from sklearn.metrics import accuracy_score

accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

Interpretation

  • An accuracy score of 0.90 means the model correctly classified 90% of the test data.

  • Accuracy is a good metric when the dataset is balanced (i.e., the number of samples in each class is roughly equal).

  • In cases of imbalanced datasets, accuracy can be misleading, so additional metrics are necessary.

2. Confusion Matrix

The Confusion Matrix is a table used to describe the performance of a classification model. It shows the number of true positive (TP), true negative (TN), false positive (FP), and false negative (FN) predictions.

Structure of the Confusion Matrix

Predicted PositivePredicted Negative
Actual PositiveTrue Positive (TP)False Negative (FN)
Actual NegativeFalse Positive (FP)True Negative (TN)
  • True Positive (TP): The model correctly predicted a positive class.

  • True Negative (TN): The model correctly predicted a negative class.

  • False Positive (FP) (Type I Error): The model incorrectly predicted a positive class (a "false alarm").

  • False Negative (FN) (Type II Error): The model incorrectly predicted a negative class (missed a positive case).

Example Code

from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(cm)

Example Output

Confusion Matrix:
[[18  2]
 [ 1 19]]

Interpretation of Example Output

  • 18 True Positives (TP): 18 instances were correctly classified as positive.

  • 19 True Negatives (TN): 19 instances were correctly classified as negative.

  • 2 False Positives (FP): 2 instances were incorrectly classified as positive.

  • 1 False Negative (FN): 1 instance was incorrectly classified as negative.

3. Classification Report

The Classification Report provides a detailed breakdown of the model’s performance, including Precision, Recall, and F1-Score for each class.

Key Metrics in the Classification Report

Precision (Positive Predictive Value):
The proportion of positive predictions that were actually correct.

$$\text{Precision} = \frac{TP}{TP + FP}$$

  • High precision means few false positives.

  • Useful when the cost of a false positive is high (e.g., predicting disease when it isn’t present).

Recall (Sensitivity or True Positive Rate):
The proportion of actual positives that were correctly predicted.

$$\text{Recall} = \frac{TP}{TP + FN}$$

  • High recall means few false negatives.

  • Useful when the cost of a false negative is high (e.g., missing a disease diagnosis).

F1-Score:
The harmonic mean of precision and recall. It balances both metrics and is useful when you need a single performance score.

$$\text{F1-Score} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}$$

  • A perfect F1-score is 1.0, indicating perfect precision and recall.

Support:
The number of actual occurrences of each class in the dataset.

Example Code

from sklearn.metrics import classification_report

report = classification_report(y_test, y_pred)
print("Classification Report:")
print(report)

Example Output

Classification Report:
              precision    recall  f1-score   support

           0       0.90      0.90      0.90        20
           1       0.90      0.90      0.90        20

    accuracy                           0.90        40
   macro avg       0.90      0.90      0.90        40
weighted avg       0.90      0.90      0.90        40

Explanation of Output

  • Class 0 (e.g., Not Setosa):

    • Precision: 0.90 – 90% of predicted class 0 instances were correct.

    • Recall: 0.90 – 90% of actual class 0 instances were correctly predicted.

    • F1-Score: 0.90 – Balanced measure of precision and recall.

    • Support: 20 instances of class 0 in the test set.

  • Class 1 (e.g., Setosa):

    • Precision: 0.90 – 90% of predicted class 1 instances were correct.

    • Recall: 0.90 – 90% of actual class 1 instances were correctly predicted.

    • F1-Score: 0.90 – Balanced measure of precision and recall.

    • Support: 20 instances of class 1 in the test set.

  • Accuracy: The model correctly classified 90% of all instances.

  • Macro Average: The average of precision, recall, and F1-score across both classes. Useful when you care equally about all classes.

  • Weighted Average: The average of precision, recall, and F1-score weighted by the number of instances in each class. Useful when class sizes are imbalanced.

Summary of Evaluation Metrics

  1. Accuracy: Overall correctness of the model.

  2. Confusion Matrix: Detailed breakdown of correct and incorrect predictions (TP, TN, FP, FN).

  3. Classification Report: Provides precision, recall, F1-score, and support for each class, offering a comprehensive evaluation.

Visualizing the Decision Boundary

Let’s visualize the decision boundary to see how well the logistic regression model separates the two classes:

import numpy as np

# Create a mesh grid
h = 0.02  # Step size
x_min, x_max = X['sepal length (cm)'].min() - 1, X['sepal length (cm)'].max() + 1
y_min, y_max = X['sepal width (cm)'].min() - 1, X['sepal width (cm)'].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

# Predict on the mesh grid
Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

# Plot the decision boundary
plt.contourf(xx, yy, Z, alpha=0.8)
plt.scatter(X_test['sepal length (cm)'], X_test['sepal width (cm)'], c=y_test, edgecolor='k', marker='o')
plt.xlabel('Sepal Length (cm)')
plt.ylabel('Sepal Width (cm)')
plt.title('Logistic Regression Decision Boundary')
plt.show()

Predicting New Values

Let’s use the model to predict the class of a new flower with sepal length 5.0 cm and sepal width 3.5 cm:

new_flower = [[5.0, 3.5]]
prediction = model.predict(new_flower)
print(f"Predicted Class: {'Setosa' if prediction[0] == 0 else 'Not Setosa'}")

Example Output

Predicted Class: Setosa

Summary

In this section, you learned how to:

  1. Load and explore the Iris dataset for logistic regression.

  2. Simplify the dataset to a binary classification problem (Setosa vs. Not Setosa).

  3. Train a logistic regression model to classify iris flowers.

  4. Evaluate the model using accuracy, confusion matrix, and classification report.

  5. Visualize the decision boundary to see how the model separates the classes.

  6. Make predictions for new data points.

Conclusion

In this blog, you’ve learned how to harness the power of regression in Python using Jupyter Lab. Specifically, we covered:

  1. Exploring Inbuilt Datasets:

    • The Boston Housing Dataset for predicting house prices using linear regression.

    • The Iris Dataset for classifying flowers using logistic regression.

  2. Understanding Dataset Attributes and Targets:

    • How to examine features (predictors) and target columns (outcomes) to frame the regression problem effectively.
  3. Performing Regression Tasks:

    • Linear Regression: Predicting continuous values by fitting a straight-line relationship.

    • Logistic Regression: Solving binary classification problems by estimating probabilities.

  4. Model Evaluation:

    • Calculating essential metrics such as Mean Squared Error (MSE), R² Score, Accuracy, and understanding the Confusion Matrix and Classification Report.

Regression is the first step toward mastering machine learning. By understanding these fundamental techniques, you’ve laid the groundwork for exploring more advanced models. The best way to solidify your learning is to experiment with different datasets, tweak your models, and observe how they perform.

Start exploring, keep experimenting, and see the power of prediction in action!

5
Subscribe to my newsletter

Read articles from Jyotiprakash Mishra directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Jyotiprakash Mishra
Jyotiprakash Mishra

I am Jyotiprakash, a deeply driven computer systems engineer, software developer, teacher, and philosopher. With a decade of professional experience, I have contributed to various cutting-edge software products in network security, mobile apps, and healthcare software at renowned companies like Oracle, Yahoo, and Epic. My academic journey has taken me to prestigious institutions such as the University of Wisconsin-Madison and BITS Pilani in India, where I consistently ranked among the top of my class. At my core, I am a computer enthusiast with a profound interest in understanding the intricacies of computer programming. My skills are not limited to application programming in Java; I have also delved deeply into computer hardware, learning about various architectures, low-level assembly programming, Linux kernel implementation, and writing device drivers. The contributions of Linus Torvalds, Ken Thompson, and Dennis Ritchie—who revolutionized the computer industry—inspire me. I believe that real contributions to computer science are made by mastering all levels of abstraction and understanding systems inside out. In addition to my professional pursuits, I am passionate about teaching and sharing knowledge. I have spent two years as a teaching assistant at UW Madison, where I taught complex concepts in operating systems, computer graphics, and data structures to both graduate and undergraduate students. Currently, I am an assistant professor at KIIT, Bhubaneswar, where I continue to teach computer science to undergraduate and graduate students. I am also working on writing a few free books on systems programming, as I believe in freely sharing knowledge to empower others.