Natural Language Processing Basics

Fatima JannetFatima Jannet
14 min read

Introduction

NLP works as a bridge, as a connection between computer science, AI, and natural human language. Basically, it involves programming computers to analyze large amounts of natural language data. Computers are very good at handling numerical data, but when it's categorical, like PDFs, emails, text commands or similar formats, we can read them as humans, but computers can't because it is not structured and comes in a variety of languages. In this blog I'm going to discuss some basic techniques on the process, how the bridge actually works.

SpaCy Setup

Spacy is an open-source natural language processing library. It efficiently handles all the necessary parts for NLP. Funnily enough, Spacy has only one implemented method (choosing the most efficient algorithm currently available). It means you don't have the option to choose other algorithms. Another one is NLTK - Natural Language Toolkit, which is also a very popular library for NLP. It provides a lot of functionality but is slow and less efficient.

For many common NLP tasks, Spacy is fast and efficient. NLTK has a variety of algorithms from which you can choose, whereas Spacy has its default set to the currently most efficient one.

But Spacy has no pre-created models for some applications, for example, sentiment analysis, which is easier to perform with NLTK.

We’ll use both Spacy and NLTK based on their suitability. Also, Google Colab has both spaCy and NLTK pre-installed. So you don’t need to go through the pain of installing it through pip.

But let me tell you what you need to do:

For windows user:

  • Open cmd and run as an administrator. It will automatically open to system 32.

  • If you're in a folder with permission issues, type cd .. and then cd .. to navigate to the C drive.

  • If you're using Anaconda, type conda install -c conda-forge spacy and hit enter to install spaCy.

  • After spaCy is downloaded, ensure you have full administrative capabilities and no firewall blocking downloads.

  • Type python -m spacy download en and hit enter to download the English language library.

  • On success, you should see "linking successful," indicating you can now load the model with spacy.load('en').

spaCy basics

Importing spaCy in Google Colab is straightforward.

  1. First, you need to check/install the spacy package using pip

  2. spaCy requires a pre-trained language model. To download the English model, we are going to pass a specific string en_core_web_sm, which stands for core English language, small version.

  3. Import spacy and load spacy. It might take some time cause it’s a fairly large library.

  4. Create a doc variable and pass a string. Yes, we have a method named nlp()

# Step 1: Install spaCy
!pip install spacy

# Step 2: Download the English model
!python -m spacy download en_core_web_sm

# Step 3: Import and use spaCy
import spacy

nlp = spacy.load("en_core_web_sm") #known as loading a model 
doc = nlp("I want c.a.t.s and $100 million cash") #doc abject by applying the nlp mpdel. 

# Print tokens and part-of-speech tags
for token in doc:
    print(token.text, token.pos_)

spaCy developed a library that parses the entire string into tokens. After printing, you can see that spaCy is smart enough to recognize "cats" as a word, even though it has dots in it. It is also smart enough to realize that $ sign and 100 should be separated.

We're also going to print some more information.

Let's print token.pos, which stands for part of speech. When we run it, we see numbers like 95, 99, 99. Each number indicates to a part of speech, like adverb, verb, noun, conjunction, etc.

If you want the actual name, just use pos_. Run that, and it will tell you the part of speech. Notice how it's smart enough to recognize parts of speech too! And if you add token.dep_, it will give you more information. "Dep" stands for syntactic dependency. However, it's not necessary here.

nlp.pipeline #Run this in a different cell and you will see the pipeline.

When we run NLP, our text goes through a processing pipeline that first breaks down the text and then performs a series of operations like tagging, parsing, and describing the data. The basic NLP pipeline includes a tagger, a parser, and NER, which stands for named entity recognizer. We'll discuss each of these in more detail later on.

doc2 = nlp("Tesla isn't looking into       startups anymore")
for i in doc2: 
  print(i.text, i.pos_, i.dep_)

Spacy recognizes "isn't" as two different words and understands that "n't" is a negative word. It also notices the unusual space in between and marks it as well!

We can index it as well.

print(doc2[0])
print(doc2[-1])

doc2[3].pos_
doc2[4].dep_

Checkout the links: https://spacy.io/usage, https://spacy.io/usage/linguistic-features. These links are very useful and will help you a lot.

Moving forward, now i have a very large document. Sometimes, large docs can be hard to work with.

doc3 = nlp(u'Although commmonly attributed to John Lennon from his song "Beautiful Boy", \
the phrase "Life is what happens to us while we are making other plans" was written by \
cartoonist Allen Saunders and published in Reader\'s Digest in 1957, when Lennon was 17.')
life_quote = doc3[16:30]
print(life_quote) #this is called a span

What I did here, is called a span. (As a mater of fact, I ofc know the starting and ending index). A span is a slice of the doc. Maybe we don’t need the whole document, maybe we need only a small section. In that case, we need spans.

But wait, the interesting part comes after this.

type(life_quote) #spacy knows that it's a span

Spacy knows!

type(doc)

Spacy knows which is what. It’s really doing a lot of work under the hood!

Not only this, spacy can also separate sentences and understands them in order, like 1st sentence, 2nd sentence and so so

#spacy can seperate the sentences too! 
doc4 = nlp("This is the 1st sentence. This is 2nd. And this is 3rd")

for x in doc4.sents: #sents attribute
  print(x)

The catch is, you have to use periods to separate the sentences that’s it.

#spacy can detect the starting/ending token of a sentence too. It returs boolean value 

doc4[6].is_sent_start
#doc4[7].is_sent_end

Spacy completely understand sentences!

Tokenization

Tokenization is the process of breaking up the original raw text into component pieces otherwise known as tokens. You are now familiar with tokens very well. It is able to detect suffix, prefix, symbols, concatenation of words and even exceptions too. Tokens are literally the basic building block of a doc object. And we saw Spacy really handles well all these things

You have somewhat idea of tokens. Now let’s dig deeper.

myString  = '"We\'re moving to L.A.!"' #backslash to maintain the string properly 
myString
doc5 = nlp(myString) 
for t in doc5: 
  print(t)

Run it, you will get ““ as a token. We were able to separate We and and apostrophe ‘re

moving to, L.A. is kept together, and then we have an exclamation mark as a token too. Basically spaCy is going to separate punctuation that does not form an integral part of a word. Things like quotation marks, commas, and punctuation at the end of the sentence will be assigned their own token.

However, punctuation that exists as part of an email address, website, or numerical value will be kept as part of the token.

doc6 = nlp("We're here to help! Send snail-mail, email support@oursite.com or visit us at http://www.oursite.com!")
for t in doc6: 
  print(t)

This was actually a complex string to analyze. Here we have dash/hyphen, then dot which is not end of a sentence, a lot of punctuation marks. But spacy is smart enough to understand each of them.

If we want to count numbers of tokens in a doc, just use the len() function.

# checking how many tokens a doc has 
len(doc6)

We know how to do indexing. But one thing you need to keep in mind is- you can’t reassign. Colab will say object does not support assignment. I mean, fair enough, right? spacy does so much work to understand each and every aspect of a documention. Ofc it’s not gonna let overwrite the whole documentation with something like doc5[0] = ‘change text’ However, it understands entities very well! It recognizes that certain words belong to a organization, location, money, dates and so on.

Here how it works:

#enttities 

doc7 = nlp("Apple to build a Hong Kong factory for $6 million")

for entity in doc7.ents: 
  print(entity)

Spacy knows that apple, Hong Kong and $6 million are named entities.

#enttities 

doc7 = nlp("Apple to build a Hong Kong factory for $6 million")

for entity in doc7.ents: 
  print(entity)
  #we can label those entities too 
  print(entity.label_)
  print()

spacy has an amazing function called explain()

  #more explation 
  print(spacy.explain(entity.label_))

We can extract chunks too, for example let me show you noun chunks

# noun chunks 
doc8 = nlp("Autonomous cars shift insurance liability toward manufacturers")

for chunks in doc8.noun_chunks:
  print( chunks )

Did you know we can visualize tokenization? Spacy has a built in displacy for visualization.

from spacy import displacy

doc9 = nlp(u'Apple is going to build a U.K. factory for $6 million.')
displacy.render(doc9, style='dep', jupyter=True, options={'distance': 110}) #params 

displacy.render(doc9) #it works just fine too

Call displacy, then render and pass in the doc. Then we have some params here for customization purpose.

Now I’m gonna show you another style

from spacy import displacy

doc9 = nlp(u'Apple is going to build a U.K. factory for $6 million.')
displacy.render(doc9, style='ent', jupyter=True) #params

Basically what it does is, it attempts to find the entities in a doc and then highlights them. Cool right?

For more options, check out https://spacy.io/usage/visualizers. Here you will get a bunch of styling options.

Stemming

It is the idea of extracting the main word, or the root word (or you can say the base word). Like 'boat', 'boater', 'boating', and 'boats'. Here the stem 'boat' is going to be the stem for these words.

This works well in most cases, but English has many exceptions where a more advanced process is needed. Spacy doesn't even include a stemmer, it has optioned for lemmatization instead, which we'll cover next. However, stemming is often discussed in natural language processing, so it's important to understand its basics before moving on to lemmatization.

Now, as spacy doesn’t have stemming, we're now going to jump over to NLTK and learn about various stemmers.

We’re goin to import Porter Stemmer, which is a widely used algorithm in natural language processing (NLP) for stemming.

#using porter stemmer 
import nltk

#import PorterStemmer 
from nltk.stem.porter import PorterStemmer

#create an instance
p_stemmer = PorterStemmer()

#list of words
words = ['run', 'runner', 'ran', 'easily', 'fairly', 'fairness','frog', 'soup']

for word in words: 
  print(word + '------->' + p_stemmer.stem(word))

There is a better, sophisticated version of stemmer, which is snowball stemmer.

#using snowball stemmer (better version) 
from nltk.stem.snowball import SnowballStemmer 

s_stemmer = SnowballStemmer(language='english') #you'll need an lang param, or it will throw error

words = ['run', 'runner', 'ran', 'easily', 'fairly', 'fairness','frog', 'soup']

for i in words: 
  print(i + '------> ' + s_stemmer.stem(i))

fairly and fairness both goes down to fair in snowball.

Hopefully, this gives you an idea of how to use NLTK to stem words. The specific stem isn't very important. What matters is understanding the process happening here. These stemmers follow a set of algorithmic rules to reduce words to a root idea or root word. Often, when you're reading books in natural language processing, people like to perform stemming before analysis to reduce words to their roots.

As I mentioned, Spacy doesn't include stemming because it considers lemmatization a more effective way to reduce words. So now, off to lemmatization.

Lemmatization

Unlike stemming, lemmatization goes beyond just cutting down words. It looks at the full vocabulary of a language to analyze the word's formation. For example, the lemma of "was" is "be," and the lemma of "mice" is "mouse." So, instead of just shortening words, we consider the word's full context. Also, the lemma of "meeting" could be "meet" or "meeting," depending on how it's used in a sentence. Lemmatization is generally more informative than simple stemming, which is why the spaCy library only uses lemmatization instead of stemming.

doc10 = nlp("I'm a runner running a race brcause i love to run since i ran today")

for token in doc10: 
  print(token.text, '\t', token.pos_, '\t', token.lemma, '\t',token.lemma_)

You’ll get a table like this. You’ll get the words, their part of speech. And the number you are seeing here points to a specific lemma inside of the language library (recall "en_core_web_sm"). Each of the number is actually an individual hash to it’s lemma working as a reference. And the lemma_, this is the actual lemma. Note that, The words running, run, and ran all get simplified to the lemma "run." You can confirm that it's the same lemma "run" by checking the hash value for this lemma. It's almost like a large hash table with many lemmas and parts of speech that you can look up.

However, the alignment is slightly off. We can make a function implementing f-string literals to form this decently.

def show_lemma(text): 
  for token in text: 
    print(f'{token.text:{12}} {token.pos_:{6}} {token.lemma:<{22}} {token.lemma_}')

show_lemma(doc10)

Stop words

Words like "a" and "the" appear so often in text that they don't need to be tagged as thoroughly as unique nouns, verbs, and modifiers. We call these words stop words, and they can be filtered out from the text before processing. Spacy has a built-in list of about 305 English stop words. Essentially, stop words are very common words that don't provide extra information and can sometimes even hinder your natural language processing. You should always remove this stop words.

print(nlp.Defaults.stop_words)

Additionally you can check how many words are in the default spot word list and even check if a word is stop word or not.

len(nlp.Defaults.stop_words)

Now for the checking using vocub:

nlp.vocab['is'].is_stop #is_stop is an attribute
nlp.vocab['fiction'].is_stop #is_stop is an attribute

You can add a stop word for you particular data set.

nlp.Defaults.stop_words.add('btw')
#manually grab it 
nlp.vocab['btw'].is_stop = True
len(nlp.Defaults.stop_words)
nlp.vocab['btw'].is_stop

You can also remove a stop word depending on you situation/model.

nlp.Defaults.stop_words.remove('btw')
#manually grab it 
nlp.vocab['btw'].is_stop = False
len(nlp.Defaults.stop_words)
nlp.vocab['btw'].is_stop

Phrase matching and vocabulary

Now, we're going to identify and label specific phrases that match patterns. We can define ourselves.We can think of this as a powerful version of regular expression where we actually take parts of speech into account for our pattern search. Let's explore this concept with Spacy

from spacy.matcher import Matcher 
matcher = Matcher(nlp.vocab) #passing nlp.vocab

Creating patterns we want to match on. The way this works with Spacy is by using a list where you provide a series of dictionaries. First, we set the key as "lower" and the value as "solar power." Then, we read a document or string and try to find "solar power." We want to detect it in three ways: as one word, with a dash in the middle, or as two separate words. So, we're looking for "solarpower" as a single word, "solar-power" with a dash, or "solar power" as two words. We're trying to find all three of these patterns.

# SolarPower 
pattern1 = [{'LOWER': 'solarpower'}]

# Solar-power
pattern2 = [{'LOWER': 'solar'}, {'LOWER': 'power'}]

# Solar power
pattern3 = [{'LOWER': 'solar'}, {'IS_PUNCT': True}, {'LOWER': 'power'}]


matcher.add('SolarPower', [pattern1,pattern2,pattern3]) 


doc11 = nlp(u'The Solar Power industry continues to grow as demand \
for solarpower increases. Solar-power cars are gaining popularity.')


found_matches = matcher(doc11)
print(found_matches)

Here I have added the word I would like to match and the patterns to match in a list.

matcher.add('SolarPower', [pattern1,pattern2,pattern3])

Creating a document

doc11 = nlp(u'The Solar Power industry continues to grow as demand \
for solarpower increases. Solar-power cars are gaining popularity.')
found_matches = matcher(doc11) #pass the doc in matches
print(found_matches)

The first piece of info is the match ID, and the follow numbers are their indexes.

for match_id, start, end in found_matches:
    string_id = nlp.vocab.strings[match_id]  # get string representation
    span = doc11[start:end]                    # get the matched span
    print(match_id, string_id, start, end, span.text)

This is a for-loop which will print out everything in a nice way.

# Redefine the patterns:
pattern1 = [{'LOWER': 'solarpower'}]
pattern2 = [{'LOWER': 'solar'}, {'IS_PUNCT': True, 'OP':'*'}, {'LOWER': 'power'}]

# Add the new set of patterns to the 'SolarPower' matcher:
matcher.add('SolarPower', [pattern1, pattern2])

Inside the dictionary, we're going to set OP to a colon and then set it as an asterisk for the string. This will allow the pattern to match zero or more times.

Just to clarify, pattern1 can recognize "solarpower" or "SolarPower" with capitalized S and P. Pattern two can identify "solar" followed by any amount of punctuation. The asterisk in OP allows for this flexibility. So, it can be double dashes, one underscore, one plus, one dot, or any punctuation, followed by "power." This makes it very versatile.

doc12 = nlp('Solar--power is solarpower yay!')

found_matches = matcher(doc12)
print(found_matches)

A recap:

  • Import matcher from spacy.

  • Create patterns, which are a list of dictionaries based on the keywords and terms you are looking for.

  • Once you have those patterns, add them to the matcher with a name you provide.

  • Pass in any doc objects you have into this matcher, and it reports back the found matches.

Next blog: Part of Speech Tagging and Named Entity Recognition.

1
Subscribe to my newsletter

Read articles from Fatima Jannet directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Fatima Jannet
Fatima Jannet