Text Preprocessing in NLP

Introduction

After getting the data, we’ll be doing text pre-processing on the dataset.

Text pre-processing is of 2 types,

  1. Basic Text pre-processing

  2. Advance Text pre-processing

    • POS (Parts of Speech) Tag

    • Chunking

    • Parsing

    • Coreference relation


Lowercasing

In NLP, we are doing all the context into lowercase. Let’s us understand this by using an example,

Run until you can’t run anymore.

Here 1st Run is to be considered different from the 2nd run because of letter case difference by the NLP Model.


Remove HTML Tags

Tags are required for the understanding for the browser not for user. So, we have sometimes required to remove those html tags. They can be removed by using regex or regular expressions. Below is the code which can able to remove html tags.

import re
def remove_html_tags(text):
    pattern = re.compile('<.*?>')
    return patter.sub(r'', text)

Remove URLs

Sometimes we have to require to remove URLs which are not contributed in the model performance like building a chatbot which analyze able to communicate where there is no means of URLs and does not provides any means in building NLP Model.

They are also handled by using regular expressions and remove them from our database. Below the implementation code for the same.

import re
def remove_url(text):
    pattern = re.compile(r'https?://\S+|www\.\S+')
    return pattern.sub(r'', text)

Remove Punctuation

What are punctuations?

Punctuation refers to the symbols used in writing to separate sentences, clarify meaning, and indicate pauses or emphasis.

According to python, Below are considered as the punctuations,

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~

So, punctuations are required to remove in most of the cases. Because in further pre-processing will increase the document size which does NO MEANS to them. This will confuse our model. Below is the implementation to remove punctuations,

import string
exclude = string.punctuation

def remove_punc(text):
    for char in exclude:
        text = text.replace(char, '')
    return text

The above provided code is very slower in comparison to the below code, and which will consume at most 20 times more time.

def remove_punc(text):
    return text.translate(str.maketrans('', '', exclude))

Chat word treatment

As we are doing messaging in today’s trends where we are using so many short hands in our talking.

Short hands in day-to-day talking refer to abbreviated forms of words or phrases used for convenience, speed, or informality, such as "brb" for "be right back," "idk" for "I don't know," or "asap" for "as soon as possible."

So, these are not easily understood by the machine because machine is only understanding numbers. So, if we convert them into numbers, they will be considered to be other entity with respect to their root form which is not beneficial for our model.

To remove them, we’ve to decode back them and add then to the text before model building. To do so, we should have the abbreviations and their full forms.

Short Hands Dataset

This dataset contains some short abbreviations and their full forms which can be easily used in our code. Below is the implementation to remove short hands,

chat_words = {...short_hands_dataset...}
def chat_conversion(text):
    new_text = []
    for w in text.split():
        if w.upper() in chat_words:
            new_text.append(chat_words[w.upper()])
        else:
            new_text.append(w)
    return ' '.join(new_text)

Spelling Correction

Spelling correction is one of the major concerns in building the NLP Model because it will create a redundancy in the data.

For example, please read the notebook, and also like the ntebook.

Here, notebook and ntebook having same means which are easily understood by the humans but not by the machines because machines are only understanding numbers.

To avoid it, Spelling corrector is required to use while processing the dataset. There are many libraries used to correct the text like textblob, Spacy, etc.

Below is the code of implementation using textblob:

from textblob import TextBlob
incorrect_text = "please read the notebook, and also like the ntebook."
textBlb = TextBlob(incorrect_text)
print(textBlb.correct().string)

This will correct the spelling mistakes. If you are working any different language which are not present in textblob, it may possibility you have to create your own spelling corrector.


Removing Stop words

In english, there are many words which are required in the sentence formation, but they don’t have any means in the sentence like the, an, a, of, are, is, etc. So, we have to require removing them because to reduce the dataset size and un-necessary confusion in the model building.

They can be removed either using any scratch code or some NLTK python package, etc.

from nltk.corpus import stopwords
exclude = stopwords.words('english')

def remove_stopwords(text):
    new_text = []
    for word in text.split():
        if word in exclude:
            new_text.append('')
        else:
            new_text.append(word)
    x = new_text[:]
    new_text.clear()
    return ' '.join(x)

Handling Emojis

In our day-to-day life, we are using emojis in place of our many words like 😀 for happy and much more. There are 2 ways either to remove them or replace them with the text. But sometimes, these emojis are contains a useful information in the sentence formation. It depends on our problem statement and application building.

So, to make them understandable to the machine we have to require converting them into words, words into machine understandable numbers.

Below is the implementation to remove emojis:

import re
def remove_emoji(text):
    emoji_pattern = re.compile("['
                        u"\U0001F600-\U0001F64F" # emoticons
                        u"\U0001F300-\U0001F5FF" # symbols & pictographs 
                        u"\U0001F680-\U0001F6FF" # transport & map symbols 
                        u"\U0001F1E0-\U0001F1FF" # flags (10s)
                        u"\U00002702-\U000027B0"
                        u"\U000024C2-\U0001F251"
                    ']+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', text)

Another method using emoji python package,

import emoji
print(emoji.demojize('Python is ❤️')

Tokenization

In Natural Language Processing (NLP), tokenization is the process of breaking down text into smaller units, called tokens, which can be words, phrases, sentences, or subwords. It serves as a fundamental step in preparing text data for analysis or model training by enabling machines to process and understand natural language effectively.

For example, I am an Indian. I love my country.

After word tokenize,

[I, am, an, Indian, ., I, love, my, country, .]

Or if sentence tokenize,

[I am an Indian, I love my country]

What’s the need of tokenization?

It can be understood by using an example, suppose we have to analyze the customers review on particular product and talks about their sentiment to be positive or negative. Then we have to count the positive words in the sentence and negative words in the sentence to classify this task.

We can create more features using tokenization like unique words in the sentence and much more.

Disadvantages

  • For example, I am new in new delhi.

    Here, new is considered as a single entity and it does not consider new delhi as a single entity which is a major dis-advantage of tokenization.

  • Prefix: Character(s) at the beginning. $20, (name), , etc.

  • Suffix: Character(s) at the end. 10km, (name), ., !, , etc.

  • Infix: Character(s) in between -, , /, , etc.

  • Exception: Special-case rule to split a string into several tokens or prevent a token from being split when punctuation rules are applied.

import re
msg = 'I am going to delhi!'
tokens = re.findall("[\w']+", msg)
print(tokens)

There are a lots method using python packages out of them, 2 are using NLTK and spacy python packages.

To word tokenize in NLTK, word_tokenize is there.

To sentence tokenize in NLTK, sent_tokenize is here.

from nltk.tokenize import word_tokenize, sent_tokenize
msg = 'I have a Ph.D in A.I'
print(word_tokenize(msg))
import spacy
nlp = spacy.load('en_core_web_sm')
print(nlp(msg))

Stemming

In grammar, inflection is the modification of a word to express different grammatical categories such as tense, case, voice, aspect, person, number, gender, and mood.

For example, walk, walking, walked, walks, etc.

This is a problem in NLP model building as all of them in their root form will have same meaning but for model, they all are separate entity which are not having same means.

Stemming is the process of reducing inflection in words to their root forms such as mapping a group of words to the same stem even if the stem itself is not a valid word in the language.

For example, probably after stemming becomes probabl where probabl is not a word in English dictionary.
Similarly, movie transforms to movi.

To do stemming, there is NLTK python package which can provide this stemming. It is of 2 types.

  • For english, Porter Stemmer is there in NLTK.

  • For any other language, Snow ball Stemmer is present.

from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()
def stem_words(text):
    return ' '.join([ps.stem(word) for word in text.split()])

sample = 'walk walks walking walked'
print(stem_words(sample))
  • It is very faster because it works on algorithms.

  • It is similar to lemmatization, but the difference is that lemma will contains only English dictionary word which makes it slower than stemming.


Lemmatization

Lemmatization, unlike stemming, reduces the inflected words properly ensuring that the root word belongs to the language. In Lemmatization root word is called Lemma.

A lemma (plural lemmas or lemmata) is the canonical form, dictionary form, or citation form of a set of words.

import nltk
from nltk.stem import WordNetLemmatizer
word_lemmatizer = WordNetLemmatizer()

sentence = "He was running and eating at same time."
punctuations = "?:!.,;"
senrence_words = nltk.word_tokenize(sentence)
for word in sentence_words:
    if word in punctuations:
        senrence_words.remove(word)

for word in sentence_words:
    print("{0:20}{1:20}".format(word, wordnet_lemmatizer.lemmatize(word)))

If output matters to show the NLP model text output, lemmatization is preferred. If speed of model building matters, stemming is preferred.

Thanks for reading this blog and giving your precious time!
If you find this blog to be useful, upvote it!
0
Subscribe to my newsletter

Read articles from Avdhesh Varshney directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Avdhesh Varshney
Avdhesh Varshney

I am an aspiring data scientist. Currently, I'm pursuing B.Tech from Dr. B R Ambedkar NIT Jalandhar. Contributed a lot in many open-source programs and secured top ranks amongs them.