π¦ How I Used TF-IDF, Word2Vec, and MLPs to Analyze Sentiment in Amazon Reviews

So during one of my early NLP experiments, I came across a massive dataset of Amazon product reviews, and thought β can I train a model to automatically detect whether a review is positive, negative, or neutral?
I knew that many people had done sentiment analysis before. But I wanted to try a more detailed pipeline β combining classic NLP features like TF-IDF, and also test how embeddings like Word2Vec impact the results when used with different models β from SVMs and Perceptrons to full-on MLPs.
π§© What I Wanted to Explore
The goal was to perform both:
Binary classification: Positive vs. Negative
Ternary classification: Positive, Neutral, Negative
And I wanted to answer a few questions:
Does custom Word2Vec beat pre-trained embeddings for specific product domains?
Can a simple Perceptron beat a deep network if the features are good?
How do dimensionality and preprocessing affect model performance?
π§ What I Used
Python
Pandas, NumPy β data handling
NLTK β for text preprocessing
Scikit-learn β for TF-IDF, SVMs, Perceptron
Gensim β for training custom Word2Vec
PyTorch β for building the MLP models
π οΈ What I Did (Step-by-Step)
1οΈβ£ Data Preprocessing
Cleaned raw review text by:
Removing HTML tags, links, punctuation
Expanding contractions (e.g., "canβt" β "cannot")
Lowercasing everything
Lemmatizing and removing stopwords
2οΈβ£ Feature Extraction
TF-IDF Vectors: for classic ML models (SVM, Perceptron)
Word2Vec Embeddings:
Used both pre-trained (Google News) and custom-trained on the Amazon review corpus
Represented reviews by averaging all word vectors in each review
3οΈβ£ Model Training
Ran separate experiments for:
Binary classification using SVM, Perceptron, MLP
Ternary classification using MLP on 300D Word2Vec
Also experimented with dimensionality reduction (300D β 10D) to see impact
4οΈβ£ Evaluation
Used accuracy, precision, recall, F1-score
Plotted confusion matrices to analyze where models were confused (neutral vs. pos/neg)
π Results
Model | Accuracy | Notes |
TF-IDF + SVM | ~88% | Strong on binary classification |
TF-IDF + Perceptron | ~85% | Simpler, faster to train |
Word2Vec (custom) + MLP (300D) | ~91% | Best result overall β |
Word2Vec (pre-trained) + MLP | ~87% | Struggled with domain words |
Word2Vec (10D) + MLP | ~78% | Dropped due to loss of detail |
π§ Key Insight: Custom embeddings outperformed pre-trained β domain context matters!
π‘ What I Learned
Preprocessing is everything in sentiment tasks β lemmatization and stopword removal helped models generalize better
Even small differences in word vector dimensionality can drastically affect results
TF-IDF is still insanely strong for linear models like SVMs
MLPs only help when paired with rich embeddings β raw input isnβt enough
π§ Why This Project Still Feels Relevant
Sentiment analysis might sound basic, but itβs exactly what companies like Abnormal do at scale β detecting sentiment, tone, and intent in large amounts of textual input (like emails or messages). This kind of modeling also lays the foundation for understanding behavioral anomalies or emotional cues in language.
π§ͺ Future Ideas
Plug in BERT-style embeddings and compare results
Add attention layers to make the model context-aware
Turn it into a real-time review classifier web app using Streamlit
βοΈ Want to Try This or Discuss NLP Pipelines?
Always down to collaborate, debug, or brainstorm better approaches. Especially if you're into text understanding or applying NLP in real-world systems.
Subscribe to my newsletter
Read articles from Khushal Jhaveri directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
