Building an End-to-End Sentiment Classifier: Classical ML vs. BERT on IMDb Review

Introduction

In Natural Language Processing (NLP), one of the most classic problems is Sentiment Analysis, where the goal is to determine the emotion behind a text. In this project, I address this problem by using the IMDb Movie Reviews dataset aiming to classify whether a given review has a Positive or Negative sentiment.

I approach this task by analyzing and comparing different strategies which are:

  1. Classical Machine Learning Models: Logistic Regression, Support Vector Machine (SVM), Naive Bayes, Random Forest. All using TF-IDF Vectorization.

  2. Modern Deep Learning with BERT: a transformer-based model fine-tuned on the IMDb Dataset.

This project demonstrates the whole process, starting from data preprocessing, moving to training and evaluating multiple classical ML models, to fine-tuning BERT and finally deploying the models. The objective is to create a strong end-to-end sentiment classifier and compare how the different approaches approaches perform on real examples.

Dataset & Preprocessing

Dataset

The dataset used in this project is IMDB Dataset of 50K Movie Reviews by Lakshmipathi N on Kaggle. It consists of 50,000 movie reviews labeled as either positive or negative, with an equal distribution of 25,000 reviews per class, as shown in the following plot.

Sentiment Class Distribution Plot

In addition to analyzing class distribution, I also explored the length of the reviews (in number of characters).
As shown in the plot below, most reviews are under 2000 characters, with the peak around 750 characters.

As a way to visualize the tokens and data included, I generated a wordcloud for each of the classes, positive and negative, as shown in the following plot:

For further analysis and a way to make insights about the tokens and the movie reviews, we approached that by checking the most frequent words found in the reviews. The most frequent 5 words were: movie, film, one, like and good. Here is a plot that shows the most frequent 20 words:

We divided the data to be 70% training, 15% validation and 15% test data, as shown in the following code snippet:

# Split into train set with 70% and temporary set with 30% to be further split to test and validation sets
X_train, X_temp, y_train, y_temp = train_test_split(
    X, y, test_size=0.3, stratify=y, random_state=42
)

# Split temp into validation set with 15% and test set with 15%
X_val, X_test, y_val, y_test = train_test_split(
    X_temp, y_temp, test_size=0.5, stratify=y_temp, random_state=42
)

Preprocessing

Before going into any of the models, I preprocessed the data by performing a sequence of steps to prepare the IMDb reviews text for analysis and models.

The preprocessing steps I did are:

  1. HTML Tag Removal
    Due to noticing that some reviews contain HTML formatting like <br>, I used BeautifulSoup to remove these tags and just keep the clean text content.

  2. Remove Non-letter Characters
    I removed all non-alphanumeric such as digits and symbols to reduce noise in the text.

  3. Lowercase
    All text was converted to lowercase to ensure consistency and and avoid treating the same word differently based on case.

  4. Tokenization
    Splitting the text to tokens, such that each review is split into individual words for further processing.

  5. Stopword Removal
    I removed common English stopwords such as "and", "the" and "was" using NLTK's built-in stopword list, as they often don’t add much to the semantic analysis.

Here is the function clean_reviews_text that I used to apply these steps:

stop_words = set(stopwords.words('english'))

def clean_reviews_text(text):
    # Remove HTML formatting
    text = BeautifulSoup(text, "html.parser").get_text()

    # Remove non-letter characters
    text = re.sub(r'[^a-zA-Z]', ' ', text)

    # lowercase
    text = text.lower()

    # Split to tokens
    tokens = text.split()

    # Remove stopwords
    tokens = [word for word in tokens if word not in stop_words]

    return ' '.join(tokens)

Here is an example of a movie review before and after cleaning:

*Before: “*Being a huge street fighter fan and thoroughly enjoying the previous film, Street Fighter II: The Animated Movie, I was really looking forward to this one!<br /><br />However, it seemed that the film had no real sense of direction or purpose. Most of the characters I could not associate with and it just lacked the intense action that made the other mentioned street fighter film so superior.<br /><br />There are some good points however, the Animation is superb!!!”

*After: “*huge street fighter fan thoroughly enjoying previous film street fighter ii animated movie really looking forward one however seemed film real sense direction purpose characters could associate lacked intense action made mentioned street fighter film superior good points however animation superb”

After applying the cleaning, the vocabulary size decreased significantly by 76.92% from 438729 To 101246 words, which is demonstrated in the following plot:

Encoding

I encoded the labels such that “Negative” is mapped to 0 and “Positive” is mapped to 1, to make the data ready for ML models training.

imdb_data_df['label'] = imdb_data_df['sentiment'].map({'positive':1, 'negative':0})

Classical Machine Learning Models with TF-IDF:

I started with classical machine learning models to create a baseline for sentiment classification and to compare between the different ones.

Since these models can’t work directly with raw text, I first transformed the text into numerical features using vectorization. I experimented with CountVectorizer and TF-IDF using Logistic Regression as a test model to compare the performance between them both.

CountVectorizer

CountVectorizer is a simple vectorization method that converts text to a matrix of token (word) counts by firstly building a vocabulary of all unique words, then for each it counts the number of times each word appears treating each word equally. For example:

  • Review 1: “I love this movie”

  • Review 2: “I hated this movie”

Vocabulary: [ “I”, “love”, “this”, “movie”, “hated”]

Ilovethismoviehated
Review 111110
Review 210111

This is the code for applying the CountVectorizer to our data, by getting the vocabulary and calculating the matrix on the training data using .fit_transform() then applying the same vocabulary on the validation and test sets using .transform().

from sklearn.feature_extraction.text import CountVectorizer
# Initialize CountVectorizer
count_vectorizer = CountVectorizer(max_features=10000, ngram_range=(1,2))

# We apply to train data to learn vocabulary & computes CountVectorizer values for train set
X_train_count = count_vectorizer.fit_transform(X_train)

# We apply same vocabulary & weights to the val set
X_val_count = count_vectorizer.transform(X_val)

# We apply same vocabulary & weights to the test set
X_test_count = count_vectorizer.transform(X_test)

For the CountVectorizer, we visualized it by getting the top 30 terms by frequency, as shown in the following plot:

TF-IDF (Term Frequency Inverse Document Frequency)

TF-IDF is an another vectorization method. Like CountVectorizer, it also counts the number of occurrences of a term (word) “Term Frequency”, in addition to calculating how rare the word is across all text “Inverse Document Frequency”. This gives a weighted score of how important a word is in a text relative to the entire vocabulary.

$$TF-IDF = TF(Word) * IDF(Word)$$

That makes it very useful as it reduces the weight of common words, and highlights the words that might be important and informative.

Similarly to what we did in the CountVectorizer, we applied TF-IDF on the train set to learn vocabulary and compute TF-IDF values using fit_transform() then for the validation and test sets we use transform().

from sklearn.feature_extraction.text import TfidfVectorizer
# Initializing the TF-IDF Vectorizer
tfidf = TfidfVectorizer(max_features=10000, ngram_range=(1,2))

# We apply to train data to learn vocabulary & computes TF-IDF values for train set
X_train_tfidf = tfidf.fit_transform(X_train)

# We apply same vocabulary & weights to the val set
X_val_tfidf = tfidf.transform(X_val)

# We apply same vocabulary & weights to the test set
X_test_tfidf = tfidf.transform(X_test)

To visualize the TF-IDF vectorizer, I calculated the average TF-IDF across all terms, then showed the top 30 as shown in the following plot:

Logistic Regression

The first classical machine learning model I used is Logistic Regression. It is a supervised machine learning algorithm which is commonly used with binary classification tasks, which makes it very efficient for our sentiment analysis task of classifying reviews to Positive or Negative.

Logistic Regression uses sigmoid function to convert inputs into a probability value between 0 and 1. If the output is closer to 0, the review is predicted as negative; if closer to 1, it's predicted as positive (as we previously encoded).

Here is how I initialized my logistic regression model for training:

from sklearn.linear_model import LogisticRegression

# Initializing the Logistic Regression Model
lr = LogisticRegression(
    max_iter=1000,
    random_state=42
)

Then, I trained the model using both CountVectorizer and TF-IDF to compare their performance. In the table below, we see the results:

MetricLR with CountVectorizerLR with TF-IDF
Training Accuracy92.53%98.55%
Validation Accuracy89.04%86.87%
Testing Accuracy90.21%88.05%

From the previous results, it shows that CountVectorizer has higher training accuracy, but this is often a sign for overfitting. It memorized patterns from training data too well but failed to generalize as well on unseen data. However, TF-IDF has better validation and test accuracies, meaning it generalizes better. This is expected, as TF-IDF down-weights common terms and focuses on more meaningful and rare ones.

To further evaluate the model’s predictions, I generated a confusion matrix for the Logistic Regression model trained with TF-IDF on the test set:

Therefore, based on the better generalization performance, I decided to use TF-IDF for training and comparing the rest of classical machine learning as well.

Training Function

To streamline the training process across different classical machine learning models, I created a reusable function called train_model. This function handles the training and returns important metrics such as training time and training accuracy. By passing in different model instances and training data, I could easily compare performance across various models.

Here is the code for the function:

def train_model(model, X_train, y_train, model_name= "Model"): 
    # Train the model
    start = time.time()
    model.fit(X_train, y_train)
    end = time.time()
    train_time = end - start

    print(f"Training Time: {train_time:.2f} seconds")

    # Training Accuracy
    train_acc = model.score(X_train_tfidf, y_train)
    print(f"Training Accuracy: {train_acc*100:.2f} %")

    return {
        "Model": model,
        "Model Name": model_name,
        "Train Accuracy": train_acc,
        "Training Time (s)": train_time
    }

Evaluate Function

Similarly, I created a reusable function called evaluate_model to handle evaluation for different models on either the validation or test set. This helped avoid repeating code and made it easy to generate consistent metrics and visualizations for every model. The function prints and returns important evaluation metrics as:

  • Accuracy

  • Classification Report: Precision, Recall, F1-Score

  • Confusion Matrix

Here is the code for the function:

from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

def evaluate_model(model, X, y, set_name="Test"):
    # Predict on val/test set
    y_pred = model.predict(X)

    # Accuracy
    acc = accuracy_score(y, y_pred)
    print(f"{set_name} Accuracy: {acc*100:.2f}%\n")

    # Classification Report
    print(f"Classification Report for {set_name}:\n")
    print(classification_report(y, y_pred))

    # Confusion Matrix
    cm = confusion_matrix(y, y_pred)
    labels = ['Negative', 'Positive']

    plt.figure(figsize=(6,5))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=labels, yticklabels=labels)
    plt.xlabel('Predicted Label')
    plt.ylabel('True Label')
    plt.title(f'Confusion Matrix - {set_name}')
    plt.show()

    return {
        "Set": set_name,
        "Accuracy": acc,
        "Predictions": y_pred
    }

This function was used for evaluating all classical ML models after training. It made it easy to consistently measure and compare their performance on both the validation and test sets.

Now, let’s explore how each of the remaining classical models performed using this training and evaluation setup.

Support Vector Machine (SVM)

Support Vector Machine (SVM) is a supervised machine learning algorithm that classifies data by finding an optimal line or hyperplane that best separates the classes in feature space. To have a better performing SVM, the goal is to maximize the margin which is the distance between the closest points of each class to the hyperplane. A larger margin generally leads to a better generalization on unseen data.

In this project I used a Linear Support Vector Classifier (SVC) which works well with TF-IDF vectors as it is high-dimensional text data. The SVM tries to find the best line between the two classes: Positive and Negative.

Here is how I initialized and trained the model:

from sklearn.svm import LinearSVC

# Initialize the Linear SVM Model
svm = LinearSVC()

# Train Model
svm_results = train_model(svm, X_train_tfidf, y_train, model_name="SVM")

Then, I evaluated the model on the validation and test sets:

# Get trained model and evaluate
svm_model = svm_results["Model"]
svm_val = evaluate_model(svm_model, X_val_tfidf, y_val, "Validation")
svm_test = evaluate_model(svm_model, X_test_tfidf, y_test, "Test")

Here are the SVM Results:

Training Accuracy96.37 %
Training Time0.85 seconds
Validation Accuracy88.32%
Test Accuracy89.11%

These results show that the SVM model trained on TF-IDF features performs consistently well across training, validation, and test sets showing good generalization and robustness.

Here is the confusion matrix for the test set predictions:

Multinominal Naive Bayes:

Naive Bayes is a probabilistic supervised machine learning algorithm, commonly used for classification task. It assumes that all features (in our case words/tokens) are independent given the class label, which is why called “naive” assumption.

In this project, I used Multinominal Naive Bayes Variant, which is suitable for text classification problems. It works well when the features represent word counts or frequencies, such as those produced by TF-IDF or CountVectorizer. In our case, we continued using TF-IDF.

Here is how I initialized and trained the model:

from sklearn.naive_bayes import MultinomialNB

# Initalizing the Model
nb = MultinomialNB()

# Training the Model
nb_results = train_model(nb, X_train_tfidf, y_train, model_name="Multinomial Naive Bayes")

Then, I evaluated the model on the validation and test sets:

# Get trained model and evaluate
nb_model = nb_results["Model"]
nb_val = evaluate_model(nb_model, X_val_tfidf, y_val, "Validation")
nb_test = evaluate_model(nb_model, X_test_tfidf, y_test, "Test")

Here are the Multinomial Naive Bayes Results:

Training Accuracy88.09 %
Training Time0.02 seconds
Validation Accuracy86.23%
Test Accuracy87.33%

Despite being a very fast and lightweight model, Multinominal Naive Bayes still achieved competitive results. It serves as a strong baseline for text classification for our sentiment analysis task.

Here is the confusion matrix for the test set predictions:

Random Forest (RF)

Random Forest (RF) is an ensemble machine learning algorithm used for classification and regression tasks. It builds multiple decision trees during training and merges them together, through majority voting in case of classification, to get more accurate and stable predictions.

In our case, we are working with sparse and high-dimensional TF-IDF vectors, which represent text data as long feature vectors with many zeros. Random Forest models, being tree-based, are not always ideal for this kind of data which they tend to work better on dense, structured datasets. As a result, we might expect slightly lower performance compared to linear models like SVM or Logistic Regression, which are better suited for high-dimensional text classification. However, it's still worth evaluating as part of our comparison.

Here is how I initialized and trained the model:

from sklearn.ensemble import RandomForestClassifier

# Initializing the Model
rf = RandomForestClassifier()
# Training the Model
rf_results = train_model(rf, X_train_tfidf, y_train, model_name="Random Forest (RF)")

Then, I evaluated the model on the validation and test sets:

# Get trained model and evaluate
rf_model = rf_results['Model']
rf_val = evaluate_model(rf_model, X_val_tfidf, y_val, "Validation")
rf_test = evaluate_model(rf_model, X_test_tfidf, y_test, "Test")

Here are the Random Forest (RF) Results:

Training Accuracy100.00%
Training Time150.90
Validation Accuracy85.00%
Test Accuracy85.87%

The model clearly overfits on the training data, achieving an accuracy of 100%, but performs worse on unseen validation and test sets which was expected based on the characteristics we mentioned earlier about the Random Forest algorithm characteristics.

Here is the confusion matrix for the test set predictions:


BERT

BERT, which stands for Bidirectional Encoder Representations from Transformers, is a transformer-based language model developed by Google. It captures context from both directions in text (left-to-right and right-to-left) which makes it highly effective in language understanding tasks.

In this project, I fine-tuned the pretrained model bert-base-uncased to classify IMDb reviews as positive or negative for sentiment analysis.

Preprocessing & Tokenization

Unlike the classical machine learning models, BERT does not require manual text cleaning. We work directly with raw text as the BERT tokenizer, handles lowercasing, punctuation and special tokens internally.

Similar as before, we split the dataset to 70% training, 15% validation and 15% testing.

We tokenized the text using the pretrained BertTokenizer with a max sequence length of 256 to balance the performance and memory.

Then, we loaded the BERT Tokenizer and tokenized each of the sets:

from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
MAX_LEN = 256

# Tokenizing the train set
train_encodings = tokenizer(
    X_train,
    truncation = True,
    padding = True,
    max_length = MAX_LEN,
    return_tensors = "pt"
)

# Tokenizing the val set
val_encodings = tokenizer(
    X_val,
    truncation=True,
    padding=True,
    max_length=MAX_LEN,
    return_tensors="pt"
)

# Tokenizing the test set
test_encodings = tokenizer(
    X_test,
    truncation = True,
    padding = True,
    max_length = MAX_LEN,
    return_tensors = "pt"
)

After tokenizing, now we have the data as ‘input_ids’ and ‘attention_masks’.

We also converted the labels and created a custom dataset class for PyTorch.

class IMDbDataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, index):
        item = {key: val[index] for key, val in self.encodings.items()}
        item['labels'] = self.labels[index]
        return item

For efficient training, we prepared DataLoader for all the sets we have as shown in the code:

BATCH_SIZE = 16
train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
val_loader   = DataLoader(val_dataset, batch_size=BATCH_SIZE)
test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE)

Fine-Tuning the BERT Model

We used BertForSequenceClassification with 2 output classes (positive & negative), trained for 3 epochs, with a learning rate of 2e-5 on Google Colab GPUs.

from transformers import BertForSequenceClassification

# Load pretrained BERT model with a classification head for 2 classes (positive/negative)
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)

Training and Evaluating Functions

I implemented three reusable functions:

  1. get_accuracy: computes prediction accuracy
# Function to get accuracy
def get_accuracy(preds, labels):
    # Get the predicted class by getting the index of the highest score
    pred_labels = torch.argmax(preds, dim=1)
    # calculates the total number of correct predictions comparing with the labels
    correct = (pred_labels == labels).sum().item()
    # Calculates the accuracy of correct predictions over total
    return correct / len(labels)
  1. train_epoch(): trains the model for one epoch
# Training one epoch
def train_epoch(model, dataloader, optimizer, scheduler):
    # Training mode for the model
    model.train()

    # Initializing to accumulate the training loss and accuracy over the epoch
    total_loss = 0
    total_acc = 0

    # Looping through each batch of data of input-output pairs
    for batch_idx, batch in enumerate(dataloader):
        # Clears previous gradients
        optimizer.zero_grad()

        # To make sure they are on the device we are training on
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)

        # Forward pass which computes predictions and loss
        outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss
        logits = outputs.logits

        # Getting accuracy for current batch
        acc = get_accuracy(logits, labels)

        # Backpropagation
        loss.backward()
        # Update weights and optimizing
        optimizer.step()
        scheduler.step()

        # Accumulate loss and accuracy
        total_loss += loss.item()
        total_acc += acc

        # Print progress every 500 batches
        if batch_idx % 500 == 0:
            print(f"Batch {batch_idx}/{len(dataloader)}, Loss: {loss.item():.4f}, Acc: {acc:.4f}")

    # Returning the epoch loss and accuracy over all batches
    avg_loss = total_loss / len(dataloader)
    avg_acc = total_acc / len(dataloader)
    return avg_loss, avg_acc
  1. eval_model(): evaluates model on validation/test sets
# Evaluate the model (for validation or testing)
def eval_model(model, dataloader):
    # Evaluating mode for the model
    model.eval()

    # Initializing to accumulate the training loss and accuracy over the epoch
    total_loss = 0
    total_acc = 0

    # No gradient calculation in evaluating, saving memory
    with torch.no_grad():
        for batch in dataloader:
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['labels'].to(device)

            # Forward pass
            outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)

            loss = outputs.loss
            logits = outputs.logits

            # Calculate accuracy per batch
            acc = get_accuracy(logits, labels)

            # Accumulate loss and accuracy
            total_loss += loss.item()
            total_acc += acc

    # Returning avg loss and accuracy
    avg_loss = total_loss / len(dataloader)
    avg_acc = total_acc / len(dataloader)
    return avg_loss, avg_acc

BERT Fine-Tuning Results

We got the following results for training

Training AccuracyTraining LossValidation AccuracyValidation Loss
Epoch 187.95%0.277991.57%0.2164
Epoch 295.16%0.135492.36%0.2142
Epoch 398.33%0.056392.61%0.2519

Total BERT Training Time: 4818.46 seconds

We also plotted the loss and accuracy curves for both training and validation as shown in the following plot:

Deployment to Hugging Face

After training, I saved the model locally, then uploaded it to Hugging Face Hub for public use.
You can find it here: Fine-tuned BERT IMDb Model.

To load and use the model, use the following code snippet:

from transformers import BertTokenizer, BertForSequenceClassification

model = BertForSequenceClassification.from_pretrained("tarneemalaa/bert_imdb_model")
tokenizer = BertTokenizer.from_pretrained("tarneemalaa/bert_imdb_model")

Testing the Model

We got those results, applying the fine-tuned model on the test set:

Test Accuracy93.08%
Test Loss0.2334

The test accuracy of 93.08% is a great result, shows that the model is performing really well on unseen data. There is a possibility if the model has been trained for more epochs, it would have achieved even better results.

Here is the confusion matrix on the test set:


Interactive Movie Sentiment Classifier - Gradio App:

To make the project more engaging and interactive, I built a Gradio web app where users can test and compare all the trained models live. The app allows users to enter any movie review, select the model they want (e.g., BERT, Logistic Regression), and instantly get the sentiment analysis(Positive or Negative) along with confidence where applicable.

This allows the user to actually test and try, experiencing how each model interprets the data, instead of just looking at accuracy percentages of models.

It supports all of the previously discussed models:

  1. Fine-tuned BERT

  2. Logistic Regression

  3. Support Vector Machine

  4. Naive Bayes

  5. Random Forest

If you choose BERT, the app tokenizes the text and feeds it into the fine-tuned model from Hugging Face that I uploaded. If you choose any classical ML model, the review is vectorized using the same TF-IDF vectorizer used in training, and the corresponding model makes the prediction.


Try it Out:

Light Mode

https://tarneemalaa-imdb-sentiment-classifier.hf.space/?__theme=light

Dark Mode

https://tarneemalaa-imdb-sentiment-classifier.hf.space/?__theme=dark


Here is a few screenshots from the app:

  1. Home Page

  1. Choose Model

  1. Enter Review and Submit

For the movie review:
“I was really looking forward to this movie, but it turned out to be a huge letdown. The story was slow and lacked direction, and the characters felt flat and uninteresting. Some scenes looked visually nice, but that wasn't enough to save it. The dialogue was awkward, and the plot just didn’t go anywhere meaningful. I kept waiting for something to happen, but it never did. Overall, it was boring and forgettable I wouldn’t recommend it.”

Fine-tuned BERT detected that it is a negative review with confidence of 99.86%.


Summary

Here is a summary of all the models used with all the accuracies for comparison and analysis:

It clearly shows how BERT, while being the most accurate, requires significantly more training time. On the other hand, classical models like Logistic Regression and Naive Bayes offer fast and reasonably good performance, making them a good choice when compute or time is limited.


Final Thoughts

This project was a deep dive into both classical and modern NLP techniques for sentiment analysis. It allowed me to explore and compare the strengths and trade-offs between traditional machine learning models and transformer-based models like BERT.

Building the interactive Gradio app brought the project to life making it easy for anyone to try the models hands-on and see how each one performs in real time.

If you have any feedback, suggestions, or just want to connect feel free to reach out!

GitHub Repository:
All the code for data preprocessing, classical models, BERT fine-tuning, evaluation, and the deployed app is available here:
github.com/TarneemAlaa1/imdb-sentiment-classifier

1
Subscribe to my newsletter

Read articles from Tarneem Alaa Abdelreheem directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Tarneem Alaa Abdelreheem
Tarneem Alaa Abdelreheem

Computer Engineering graduate passionate about AI/ML and NLP, with hands-on experience in building and deploying intelligent systems. Seeking AI/ML Engineering roles to apply deep learning, NLP, and cloud deployment skills to real-world challenges.