🧠 Zero to Hero : Stopword Removal, POS Tagging, Lemmatization & Stemming

Rahul KumarRahul Kumar
5 min read

In the vast realm of Natural Language Processing (NLP), data preprocessing is the bedrock of all downstream tasks—whether it's sentiment analysis, machine translation, or text summarization. Before any model shines, the text must be cleaned, normalized, and made digestible.

This blog is your step-by-step guide from zero to hero in understanding and implementing four essential preprocessing techniques:

✅ Stopword Removal
✅ POS (Part-of-Speech) Tagging
✅ Lemmatization
✅ Stemming


🚀 Why Text Preprocessing Matters

Raw text is messy—filled with redundant words, grammatical noise, and morphological variants. Preprocessing makes this text suitable for analysis:

  • Improves model accuracy

  • Reduces computational cost

  • Enhances generalization


🛑 1. Removing Stopwords

📘 What are Stopwords?

Stopwords are common words like "the", "is", "in", "and" that occur frequently in language but often add little semantic value for many NLP tasks.

Removing them helps in reducing dimensionality and noise.

✍️ Code Example with NLTK

import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

# Download required resources
nltk.download('punkt')
nltk.download('stopwords')

# Sample text
text = "This is a simple sentence, and we are going to remove stopwords from it."

# Tokenize and lowercase
words = word_tokenize(text.lower())

# Stopwords set
stop_words = set(stopwords.words('english'))

# Remove stopwords and punctuation
filtered_words = [word for word in words if word.isalnum() and word not in stop_words]

print(filtered_words)

📤 Output:

['simple', 'sentence', 'going', 'remove', 'stopwords']

🧠 Note: word.isalnum() also filters out punctuation.


🧠 2. Part-of-Speech (POS) Tagging

📘 What is POS Tagging?

POS Tagging assigns grammatical categories (noun, verb, adjective) to words. It's crucial for tasks like lemmatization, syntax parsing, and named entity recognition.

✍️ Code Example with NLTK

from nltk import pos_tag

# POS tagging filtered words
tagged_words = pos_tag(filtered_words)

print(tagged_words)

📤 Output:

[('simple', 'JJ'), ('sentence', 'NN'), ('going', 'VBG'), ('remove', 'VB'), ('stopwords', 'NNS')]

🔤 POS Tags Explained

TagMeaning
NNNoun
VBVerb (base)
JJAdjective
VBGVerb (gerund)
NNSNoun (plural)

🌱 3. Lemmatization

📘 What is Lemmatization?

Lemmatization reduces words to their dictionary form (lemma), considering their POS tag. Unlike stemming, it is context-aware and always returns valid words.

✍️ Code Example with WordNetLemmatizer

from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

nltk.download('wordnet')
nltk.download('omw-1.4')

lemmatizer = WordNetLemmatizer()

# Helper to convert POS tags for WordNet
def get_wordnet_pos(tag):
    if tag.startswith('J'):
        return wordnet.ADJ
    elif tag.startswith('V'):
        return wordnet.VERB
    elif tag.startswith('N'):
        return wordnet.NOUN
    elif tag.startswith('R'):
        return wordnet.ADV
    return wordnet.NOUN  # default

# Lemmatize with POS
lemmatized_words = [lemmatizer.lemmatize(word, get_wordnet_pos(tag)) 
                    for word, tag in tagged_words]

print(lemmatized_words)

📤 Output:

['simple', 'sentence', 'go', 'remove', 'stopword']

🎯 Notice how:

  • “going” → “go”

  • “stopwords” → “stopword”


🌿 4. Stemming

📘 What is Stemming?

Stemming crudely removes suffixes/prefixes to reduce a word to its root form. It's fast but can produce non-dictionary words.

📌 Use when speed > linguistic precision.

✍️ Code Example with NLTK’s Porter Stemmer

from nltk.stem import PorterStemmer

stemmer = PorterStemmer()

# Apply stemming
stemmed_words = [stemmer.stem(word) for word in filtered_words]

print(stemmed_words)

📤 Output:

['simpl', 'sentenc', 'go', 'remov', 'stopword']

👀 Note that:

  • “simple” ➝ “simpl”

  • “sentence” ➝ “sentenc”

Not always clean, but efficient for document-level tasks like search indexing.


🧠 Stemming vs Lemmatization

FeatureStemmingLemmatization
OutputNot always a valid wordAlways a valid word
ApproachHeuristicRule-based + Dictionary
SpeedFasterSlower
Context AwarenessNoYes
Use CaseSearch, fast filteringText classification, NLU

🔥 All-in-One with spaCy

If you're working on real-world NLP projects, consider using spaCy—a fast and industrial-strength NLP library.

✍️ spaCy Code Example

import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("This is a simple sentence, and we are going to remove stopwords from it.")

# Apply stopword removal, lemmatization, and punctuation filtering in one go
tokens = [token.lemma_ for token in doc if not token.is_stop and token.is_alpha]
print(tokens)

📤 Output:

['simple', 'sentence', 'go', 'remove', 'stopword']

✨ One line = All steps. That’s spaCy magic!


🧬 Final Recap: From Zero to Hero

TaskTool(s) UsedPurpose
Stopword RemovalNLTK, spaCyRemove common, low-value words
POS TaggingNLTK, spaCyUnderstand grammatical structure
LemmatizationWordNet, spaCyReduce words to base (dictionary) form
StemmingNLTK (Porter)Quick root word extraction

💡 Pro Tips for Real Projects

  • Use spaCy for production—it's optimized for performance.

  • Customize the stopword list depending on your domain (e.g., legal, medical).

  • Don’t blindly remove all stopwords—words like “not” can flip sentiment!

  • Always lemmatize after POS tagging for best results.

  • Prefer lemmatization over stemming for model training tasks.


🎯 Conclusion

Preprocessing is not just cleanup—it’s strategic preparation that decides how well your NLP models will perform. Whether you're analyzing tweets or building chatbots, mastering these fundamental steps sets the stage for success.

Ready to take the next step? Stay tuned for the next post where we’ll dive into Text Vectorization—from Bag-of-Words to Word Embeddings!


0
Subscribe to my newsletter

Read articles from Rahul Kumar directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Rahul Kumar
Rahul Kumar