Intro to NLP: NLP Text Preprocessing

Mariam HamzatMariam Hamzat
10 min read

I recently studied NLP text preprocessing, and to be honest, this module had me in a chokehold. I had to take the terms one by one and research them to understand why these steps are used to get text ready and how to do them.

In this blog, I'll share what I've learnt about Natural Language Processing (NLP), the steps involved in preparing text, and simple explanations to help you understand what they mean.

What Is NLP All About?

Imagine trying to talk to a computer like you talk to a friend. You use words, sentences, and sometimes even slang or jokes. But computers don't naturally understand human language the way humans do. Computers prefer things in neat, organised numbers.

Natural Language Processing (NLP) is all about teaching computers to "read" and "understand" everyday human language. It's the bridge that lets computers make sense of the large amount of text data out there, from emails and social media posts to news, articles, and books.

You need to think of it like this: if a computer can understand what you type or say, it can do amazing things like:

  • Answer your questions (like Google or Siri)

  • Translate languages (like Google Translate)

But before a computer can do any of that clever stuff, it needs your help to get the text ready. That's where NLP text preprocessing comes in.

Why Do We "Clean Up" Our Text? (The Why Behind Preprocessing)

You know how when you're cooking potatoes, you first wash your potatoes, peel them, and chop them into smaller pieces? You don't just throw the whole, dirty potato into the pot!

Working with text for computers is similar. The raw text, straight from a book or a webpage, is often messy. It has capital letters, punctuation, common words that don't add much meaning, and different versions of the same word. If you feed this messy text directly to a computer program, it will get confused.

NLP text preprocessing is simply the process of tidying up this raw text. It is done for a few key reasons:

  • You want the computer to treat the same word consistently, whether it's at the start of a sentence or in the middle. This means "The" and "the" should be understood as the same thing.

  • Breaking down text into smaller, easier-to-manage pieces helps the computer process it faster and more accurately.

Now, let's review some of these text preprocessing steps. They are crucial for preparing your text before you use it to train any AI model, especially when you're working with NLP.

NLP Text Preprocessing Steps

  1. Lowercase: The number one important step in working with text data is simply turning everything into lowercase letters. You might be wondering why you should do this; take a look at this scenario:

Imagine your computer is analysing a lot of text, and the word "The" appears at the start of a sentence, while "the" appears in the middle. Without lowercasing, your program might treat "The" and "the" as two entirely different words. Lowercasing ensures that words like these are recognized as the same thing. This helps keep your data consistent and makes it easier for a computer to count and understand words correctly. It also simplifies other cleaning steps later on.

Although, you have to note this: Lowercasing can sometimes remove important distinctions. For example, "Apple" (referring to the company) and "apple" (referring to the fruit) are clearly different things. Similarly, "US" (referring to a country) is distinct from "us" (meaning "you and I"). When you lowercase, both "Apple" (company) and "apple" (fruit) become "apple". For many NLP tasks, this loss of distinction is acceptable because you're more interested in the core word.

Let's see how easy it is to make text lowercase using Python's built-in lower()function.

sentence = "Her dog's name is Laina"
print(f"Original: {sentence}")

lower_sentence = sentence.lower()
print(f"Lowercased: {lower_sentence}")

The output will be:

Original: Her dog's name is Laina
Lowercased: her cat's name is laina
  1. Stopwords: "Stop words" are very common words in a language that typically don't carry much unique meaning on their own. Think of words like "and", "of", "a", or "to". These words appear so frequently that they can add unnecessary volume to your data without contributing much to the core understanding of the message. By removing them, you make your dataset smaller and cleaner. A smaller, cleaner dataset helps your computer programs process information faster and often leads to better, more accurate results in tasks like machine learning because the computer can focus on the more significant, content-bearing words.

    You often use the nltk package to find and remove these common words. First, you need to download the list of "stopwords" from nltk.

     import nltk
     # Run this once if you haven't downloaded stopwords yet:
     # nltk.download('stopwords')
    
     from nltk.corpus import stopwords
    
     # Get the list of English stopwords
     en_stopwords = stopwords.words('english')
    
     # Let's see some of them:
     print("Some English Stopwords:", en_stopwords[:10]) # This will display first 10
    
    1. Regular Expressions: "Regular expressions", or "regex" for short, are like special codes for finding patterns in text. It's a really handy tool when you want to search for text that fits a certain pattern, instead of just looking for an exact word or phrase.

      Why is it important for preprocessing? Take this scenario: you need to find all email addresses in a document, or all phone numbers, or perhaps replace all double spaces with a single space. You can't possibly list every single email address. But you can describe the pattern of an email address (like "characters@characters.domain"). Regex lets you do just that, giving you powerful control over text clean-up and extraction.

    2. Tokenizing Text: This step is my favourite. It is a fundamental step in NLP that involves breaking down your text into smaller pieces through a process called tokenization. These smaller pieces are known as tokens. The most common type is "word tokenization", where each word becomes a token. However, tokens can also be sentences, parts of words, or even single characters, depending on your specific use case.

      Let's look at how to break sentences into words and words into sentences using the nltk package. You will need to download the punkt data for this.

       import nltk
       # Run this once if you haven't downloaded punkt yet:
       # nltk.download('punkt')
      
       from nltk.tokenize import word_tokenize, sent_tokenize
      
       # Sentence tokenization:
       sentences_text = "Her cat's name is Luna. Her dog's name is Max."
       individual_sentences = sent_tokenize(sentences_text)
       print(f"Sentences: {individual_sentences}")
      
       # Word tokenization:
       single_sentence = "Her cat's name is Luna"
       individual_words = word_tokenize(single_sentence)
       print(f"Words: {individual_words}")
      

      The output of this will be:

       Sentences: ["Her cat's name is Luna.", "Her dog's name is Max."]
       Words: ['Her', 'cat', "'s", 'name', 'is', 'Luna']
      
      1. Stemming: Stemming helps to standardise words. The words are reduced to their basic "root" form. For example, words like 'connecting', 'connected', or 'connections' might all be stemmed to the root 'connect'. Stemming typically works by just removing the ending parts of words.

        This helps reduce the total number of unique words in your dataset. If 'connect', 'connecting', and 'connected' all convey roughly the same meaning for your specific task, treating them as one word ('connect') simplifies your data. This reduction in unique words makes your data smaller and less complex, which is beneficial for preparing it for machine learning programs.

        Although after stemming happens, some words might not be meaningful; for example, ‘beautiful’ might become ‘beauti’.

        To handle stemming, you'll use PorterStemmer from nltk for this.

         from nltk.stem import PorterStemmer
        
         # Create a stemmer tool
         ps = PorterStemmer()
        
         words_to_stem = ['connecting', 'connected', 'connectivity', 'connect', 'connects']
        
         print("Stemming examples:")
         for word in words_to_stem:
             print(f"'{word}' : '{ps.stem(word)}'")
        
         print("\nMore examples:")
         learn_words = ['learned', 'learning', 'learn', 'learns', 'learner', 'learners']
         for word in learn_words:
             print(f"'{word}' : '{ps.stem(word)}'")
        
         likes_words = ['likes', 'better', 'worse']
         for word in likes_words:
             print(f"'{word}' : '{ps.stem(word)}'")
        

        The output for this stemming operation will be:

         Stemming examples:
         'connecting' : 'connect'
         'connected' : 'connect'
         'connectivity' : 'connect'
         'connect' : 'connect'
         'connects' : 'connect'
        
         More examples:
         'learned' : 'learn'
         'learning' : 'learn'
         'learn' : 'learn'
         'learns' : 'learn'
         'learner' : 'learner'
         'learners' : 'learner'
         'likes' : 'like'
         'better' : 'better'
         'worse' : 'wors' # If you notice 'wors' isn't a real word
        
      2. Lemmatization is another way to standardize words by reducing them to their base form, like stemming. However, lemmatization is generally more intelligent! Instead of just chopping off word endings, it uses a pre-defined dictionary or understanding of language to find the meaningful base form of a word. This base form is called a "lemma". This means your processed text retains more of its original meaning, which can be particularly important for NLP tasks where precise word meaning is critical.

        For lemmatization, you'll often use WordNetLemmatizer from nltk. You will also need to download the wordnet corpus for its dictionary.

         import nltk
         # Run this once if you haven't downloaded wordnet yet:
         # nltk.download('wordnet')
        
         from nltk.stem import WordNetLemmatizer
        
         # Create a lemmatizer tool
         lemmatizer = WordNetLemmatizer()
        
         words_to_lemmatize = ['connecting', 'connected', 'connectivity', 'connect', 'connects']
        
         print("Lemmatization examples (connect group):")
         for word in words_to_lemmatize:
             # For verbs, specify pos='v' for more accurate lemmatization
             print(f"'{word}' : '{lemmatizer.lemmatize(word, pos='v')}'")
        
         print("\nLemmatization examples (learn group):")
         learn_words = ['learned', 'learning', 'learn', 'learns', 'learner', 'learners']
         for word in learn_words:
             print(f"'{word}' : '{lemmatizer.lemmatize(word, pos='v')}'")
        
         print("\nLemmatization examples (likes/better/worse group):")
         likes_words = ['likes', 'better', 'worse']
         for word in likes_words:
             # Default pos is 'n' (noun). For 'better'/'worse' as adjectives, use pos='a' to get 'good'/'bad'
             print(f"'{word}' : '{lemmatizer.lemmatize(word, pos='a')}'")
        

        The output will be:

         Lemmatization examples (connect group):
         'connecting' : 'connect'
         'connected' : 'connect'
         'connectivity' : 'connectivity'
         'connect' : 'connect'
         'connects' : 'connect'
        
         Lemmatization examples (learn group):
         'learned' : 'learn'
         'learning' : 'learn'
         'learn' : 'learn'
         'learns' : 'learn'
         'learner' : 'learner'
         'learners' : 'learners'
        
         Lemmatization examples (likes/better/worse group):
         'likes' : 'likes'
         'better' : 'good'
         'worse' : 'bad'
        

        You need to note this for lemmatization: You'll observe that WordNetLemmatizer often produces more accurate base forms like 'good' from 'better' or 'bad' from 'worse' when you provide its "part-of-speech" (like whether a word is a noun, verb, or adjective). By default, it assumes words are nouns, which might not always give the desired result for verbs or adjectives. This shows that lemmatization is smarter but sometimes needs a little guidance from you.

      3. N-grams: So far, you've learnt how to break down text into single words (tokens). But sometimes, understanding individual words isn't enough to grasp the full meaning of a sentence. Think about phrases like "not good" or "very happy". If you only look at "good" or "happy" on their own, you might miss the complete picture.

        This is where N-grams come in. An N-gram is simply a sequence of nneighbouring words (or tokens). The n can be any number you choose, telling you how many words to group.

        • If n=1, you're looking at single words. These are called unigrams. (You've already been doing this when you tokenize words)

        • If n=2, you're looking at pairs of words. These are called bigrams.

        • If n=3, you're looking at groups of three words. These are called trigrams.

And so on. The "N" in N-gram just means "any number" of words you want to group.

Let's take a short list of tokens (words) and see how you can make unigrams, bigrams, and trigrams. You'll use the ngrams function from nltk.

            from nltk import ngrams
            from collections import Counter # This helps you count how often things appear

            tokens = ['the', 'rise', 'of', 'artificial', 'intelligence', 'has', 'led', 'to', 'significant', 'advancements']
            print(f"Original tokens: {tokens}")

            # Unigrams (n=1)
            # These are just the original individual tokens.
            print("\nUnigrams (single words):")
            for gram in ngrams(tokens, 1):
                print(gram)

            # Bigrams (n=2)
            # Here, words are paired up.
            print("\nBigrams (pairs of words):")
            for gram in ngrams(tokens, 2):
                print(gram)

            # Trigrams (n=3)
            # Here, words are grouped in threes.
            print("\nTrigrams (groups of three words):")
            for gram in ngrams(tokens, 3):
                print(gram)

            # You can also count how frequently specific N-grams appear.
            # For instance, let's count bigrams:
            bigram_counts = Counter(ngrams(tokens, 2))
            print("\nBigram counts (top 3):", bigram_counts.most_common(3))

The output of this will be:

            Original tokens: ['the', 'rise', 'of', 'artificial', 'intelligence', 'has', 'led', 'to', 'significant', 'advancements']

            Unigrams (single words):
            ('the',)
            ('rise',)
            ('of',)
            ('artificial',)
            ('intelligence',)
            ('has',)
            ('led',)
            ('to',)
            ('significant',)
            ('advancements',)

            Bigrams (pairs of words):
            ('the', 'rise')
            ('rise', 'of')
            ('of', 'artificial')
            ('artificial', 'intelligence')
            ('intelligence', 'has')
            ('has', 'led')
            ('led', 'to')
            ('to', 'significant')
            ('significant', 'advancements')

            Trigrams (groups of three words):
            ('the', 'rise', 'of')
            ('rise', 'of', 'artificial')
            ('of', 'artificial', 'intelligence')
            ('artificial', 'intelligence', 'has')
            ('intelligence', 'has', 'led')
            ('has', 'led', 'to')
            ('led', 'to', 'significant')
            ('to', 'significant', 'advancements')

            Bigram counts (top 3): [(('the', 'rise'), 1), (('rise', 'of'), 1), (('of', 'artificial'), 1)]

And that marks the end of the preprocessing steps. NLP preprocessing might seem like a lot of individual steps, but each one plays a vital role in getting your text data ready for computers to understand. It took me a week to learn and understand, and it may take more time for you. You have to take it easy and dissect it. By taking raw, messy human language and carefully cleaning and structuring it, you unlock its power for all sorts of amazing artificial intelligence and machine learning applications. Keep practising, and these crucial steps will soon become second nature to you.

11
Subscribe to my newsletter

Read articles from Mariam Hamzat directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Mariam Hamzat
Mariam Hamzat

I'm an agile learner, mobile developer, tech lover and a mentor. I am currently interested in AI and Robotics