In this post, we will walk through the process of building an email spam classifier using the Naive Bayes algorithm. The goal is to classify email messages as either spam or not spam (ham) by processing text data and training a model. We'll cover all the necessary steps from data loading, preprocessing, vectorization, model training, to evaluation.

Objectives

Preprocess text data: Convert raw text into a suitable format for machine learning.
Train a Naive Bayes classifier: Use a Naive Bayes model to classify emails as spam or not.
Evaluate model performance: Assess the model using metrics such as accuracy, precision, and recall.

Skills

This project will involve the following skills:

Text Preprocessing: Cleaning and transforming raw text data for machine learning.
Machine Learning Basics: Using Naive Bayes or Logistic Regression to perform classification.
Model Evaluation: Understanding performance metrics like accuracy, precision, and recall.

Tools

For this project, we will be using:

Python: The primary programming language.
Scikit-Learn: A machine learning library to build the model.
NLTK (Natural Language Toolkit): A toolkit for natural language processing tasks like tokenization and stop word removal.

Step-by-Step Process

Here’s the complete Python code for building a spam classifier. This code covers everything from loading the dataset to model training, evaluation, and sample prediction.

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, precision_score, recall_score
import nltk
from nltk.corpus import stopwords
import re

# Download stopwords if you haven't already
nltk.download('stopwords')

# Step 1: Load the dataset
data = pd.read_csv("spam.csv", encoding="latin-1")
data = data[['v1', 'v2']]
data.columns = ['label', 'message']

# Step 2: Preprocess the text data
stop_words = set(stopwords.words('english'))

def preprocess_text(text):
    text = re.sub(r'\W', ' ', text)  # Remove non-word characters
    text = text.lower()  # Convert to lowercase
    text = re.sub(r'\s+', ' ', text)  # Remove extra whitespace
    text = ' '.join([word for word in text.split() if word not in stop_words])
    return text

data['message'] = data['message'].apply(preprocess_text)

# Step 3: Convert text data to numerical data
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(data['message']).toarray()
y = data['label'].apply(lambda x: 1 if x == 'spam' else 0)

#Complete Code on our Github Repo

Check complete code on Github here

Explanation of Each Step

Load Dataset:
- The dataset is loaded using pandas. We assume the file is named spam.csv. The dataset contains two columns: one for the label (spam or not) and one for the message.
Text Preprocessing:
- We clean the text by removing non-alphabetical characters and converting all text to lowercase. We also remove stopwords, which are common words like "the" and "and" that do not contribute much to the model's performance.
Text Vectorization:
- We use CountVectorizer from Scikit-Learn to convert the cleaned text data into a format suitable for machine learning (i.e., a bag-of-words model). This transforms the text into a numerical array where each feature represents the count of a particular word in the message.
Data Splitting:
- The data is split into training and testing sets using train_test_split. 80% of the data is used for training, and 20% is reserved for testing the model.
Model Training:
- We use the Naive Bayes algorithm (MultinomialNB) to train the model. This algorithm works well with text classification tasks, especially for spam classification.
Model Evaluation:
- After training the model, we evaluate its performance using accuracy, precision, and recall. These metrics help us understand how well the model is distinguishing between spam and non-spam messages.
Test Message Classification:
- Finally, we use the trained model to classify a sample message (e.g., "Congratulations! You've won a free ticket.") as spam or not.

Expected Output

Running this code will produce the following results:

Model Accuracy: The overall accuracy of the model, showing how many predictions were correct.
Precision: The proportion of positive (spam) predictions that were actually spam.
Recall: The proportion of actual spam messages that were correctly identified by the model.

For example:

Conclusion

This spam classifier demonstrates how to preprocess text data, vectorize it into a machine-readable format, train a Naive Bayes classifier, and evaluate the model. By understanding the steps involved in building such classifiers, you can apply similar techniques to other text classification problems.

Let me know if you have any questions or need further assistance with this project!

Building an Email Spam Classifier Using Naive Bayes

Table of contents