Natural Language Processing Basics


Introduction
NLP works as a bridge, as a connection between computer science, AI, and natural human language. Basically, it involves programming computers to analyze large amounts of natural language data. Computers are very good at handling numerical data, but when it's categorical, like PDFs, emails, text commands or similar formats, we can read them as humans, but computers can't because it is not structured and comes in a variety of languages. In this blog I'm going to discuss some basic techniques on the process, how the bridge actually works.
SpaCy Setup
Spacy is an open-source natural language processing library. It efficiently handles all the necessary parts for NLP. Funnily enough, Spacy has only one implemented method (choosing the most efficient algorithm currently available). It means you don't have the option to choose other algorithms. Another one is NLTK - Natural Language Toolkit, which is also a very popular library for NLP. It provides a lot of functionality but is slow and less efficient.
For many common NLP tasks, Spacy is fast and efficient. NLTK has a variety of algorithms from which you can choose, whereas Spacy has its default set to the currently most efficient one.
But Spacy has no pre-created models for some applications, for example, sentiment analysis, which is easier to perform with NLTK.
We’ll use both Spacy and NLTK based on their suitability. Also, Google Colab has both spaCy and NLTK pre-installed. So you don’t need to go through the pain of installing it through pip.
But let me tell you what you need to do:
For windows user:
Open cmd and run as an administrator. It will automatically open to system 32.
If you're in a folder with permission issues, type
cd ..
and thencd ..
to navigate to the C drive.If you're using Anaconda, type
conda install -c conda-forge spacy
and hit enter to install spaCy.After spaCy is downloaded, ensure you have full administrative capabilities and no firewall blocking downloads.
Type
python -m spacy download en
and hit enter to download the English language library.On success, you should see "linking successful," indicating you can now load the model with
spacy.load('en')
.
spaCy basics
Importing spaCy in Google Colab is straightforward.
First, you need to check/install the
spacy
package usingpip
spaCy requires a pre-trained language model. To download the English model, we are going to pass a specific string
en_core_web_sm
, which stands for core English language, small version.Import spacy and load spacy. It might take some time cause it’s a fairly large library.
Create a doc variable and pass a string. Yes, we have a method named
nlp()
# Step 1: Install spaCy
!pip install spacy
# Step 2: Download the English model
!python -m spacy download en_core_web_sm
# Step 3: Import and use spaCy
import spacy
nlp = spacy.load("en_core_web_sm") #known as loading a model
doc = nlp("I want c.a.t.s and $100 million cash") #doc abject by applying the nlp mpdel.
# Print tokens and part-of-speech tags
for token in doc:
print(token.text, token.pos_)
spaCy developed a library that parses the entire string into tokens. After printing, you can see that spaCy is smart enough to recognize "cats" as a word, even though it has dots in it. It is also smart enough to realize that $ sign and 100 should be separated.
We're also going to print some more information.
Let's print token.pos
, which stands for part of speech. When we run it, we see numbers like 95, 99, 99. Each number indicates to a part of speech, like adverb, verb, noun, conjunction, etc.
If you want the actual name, just use pos_
. Run that, and it will tell you the part of speech. Notice how it's smart enough to recognize parts of speech too! And if you add token.dep_
, it will give you more information. "Dep" stands for syntactic dependency. However, it's not necessary here.
nlp.pipeline #Run this in a different cell and you will see the pipeline.
When we run NLP, our text goes through a processing pipeline that first breaks down the text and then performs a series of operations like tagging, parsing, and describing the data. The basic NLP pipeline includes a tagger, a parser, and NER, which stands for named entity recognizer. We'll discuss each of these in more detail later on.
doc2 = nlp("Tesla isn't looking into startups anymore")
for i in doc2:
print(i.text, i.pos_, i.dep_)
Spacy recognizes "isn't" as two different words and understands that "n't" is a negative word. It also notices the unusual space in between and marks it as well!
We can index it as well.
print(doc2[0])
print(doc2[-1])
doc2[3].pos_
doc2[4].dep_
Checkout the links: https://spacy.io/usage, https://spacy.io/usage/linguistic-features. These links are very useful and will help you a lot.
Moving forward, now i have a very large document. Sometimes, large docs can be hard to work with.
doc3 = nlp(u'Although commmonly attributed to John Lennon from his song "Beautiful Boy", \
the phrase "Life is what happens to us while we are making other plans" was written by \
cartoonist Allen Saunders and published in Reader\'s Digest in 1957, when Lennon was 17.')
life_quote = doc3[16:30]
print(life_quote) #this is called a span
What I did here, is called a span. (As a mater of fact, I ofc know the starting and ending index). A span is a slice of the doc. Maybe we don’t need the whole document, maybe we need only a small section. In that case, we need spans.
But wait, the interesting part comes after this.
type(life_quote) #spacy knows that it's a span
Spacy knows!
type(doc)
Spacy knows which is what. It’s really doing a lot of work under the hood!
Not only this, spacy can also separate sentences and understands them in order, like 1st sentence, 2nd sentence and so so
#spacy can seperate the sentences too!
doc4 = nlp("This is the 1st sentence. This is 2nd. And this is 3rd")
for x in doc4.sents: #sents attribute
print(x)
The catch is, you have to use periods to separate the sentences that’s it.
#spacy can detect the starting/ending token of a sentence too. It returs boolean value
doc4[6].is_sent_start
#doc4[7].is_sent_end
Spacy completely understand sentences!
Tokenization
Tokenization is the process of breaking up the original raw text into component pieces otherwise known as tokens. You are now familiar with tokens very well. It is able to detect suffix, prefix, symbols, concatenation of words and even exceptions too. Tokens are literally the basic building block of a doc object. And we saw Spacy really handles well all these things
You have somewhat idea of tokens. Now let’s dig deeper.
myString = '"We\'re moving to L.A.!"' #backslash to maintain the string properly
myString
doc5 = nlp(myString)
for t in doc5:
print(t)
Run it, you will get ““
as a token. We were able to separate We
and and apostrophe ‘re
moving
to
, L.A.
is kept together, and then we have an exclamation mark as a token too. Basically spaCy is going to separate punctuation that does not form an integral part of a word. Things like quotation marks, commas, and punctuation at the end of the sentence will be assigned their own token.
However, punctuation that exists as part of an email address, website, or numerical value will be kept as part of the token.
doc6 = nlp("We're here to help! Send snail-mail, email support@oursite.com or visit us at http://www.oursite.com!")
for t in doc6:
print(t)
This was actually a complex string to analyze. Here we have dash/hyphen, then dot which is not end of a sentence, a lot of punctuation marks. But spacy is smart enough to understand each of them.
If we want to count numbers of tokens in a doc, just use the len()
function.
# checking how many tokens a doc has
len(doc6)
We know how to do indexing. But one thing you need to keep in mind is- you can’t reassign. Colab will say object does not support assignment. I mean, fair enough, right? spacy does so much work to understand each and every aspect of a documention. Ofc it’s not gonna let overwrite the whole documentation with something like doc5[0] = ‘change text’
However, it understands entities very well! It recognizes that certain words belong to a organization, location, money, dates and so on.
Here how it works:
#enttities
doc7 = nlp("Apple to build a Hong Kong factory for $6 million")
for entity in doc7.ents:
print(entity)
Spacy knows that apple, Hong Kong and $6 million are named entities.
#enttities
doc7 = nlp("Apple to build a Hong Kong factory for $6 million")
for entity in doc7.ents:
print(entity)
#we can label those entities too
print(entity.label_)
print()
spacy has an amazing function called explain()
#more explation
print(spacy.explain(entity.label_))
We can extract chunks too, for example let me show you noun chunks
# noun chunks
doc8 = nlp("Autonomous cars shift insurance liability toward manufacturers")
for chunks in doc8.noun_chunks:
print( chunks )
Did you know we can visualize tokenization? Spacy has a built in displacy
for visualization.
from spacy import displacy
doc9 = nlp(u'Apple is going to build a U.K. factory for $6 million.')
displacy.render(doc9, style='dep', jupyter=True, options={'distance': 110}) #params
displacy.render(doc9) #it works just fine too
Call displacy
, then render and pass in the doc. Then we have some params here for customization purpose.
Now I’m gonna show you another style
from spacy import displacy
doc9 = nlp(u'Apple is going to build a U.K. factory for $6 million.')
displacy.render(doc9, style='ent', jupyter=True) #params
Basically what it does is, it attempts to find the entities in a doc and then highlights them. Cool right?
For more options
, check out https://spacy.io/usage/visualizers. Here you will get a bunch of styling options.
Stemming
It is the idea of extracting the main word, or the root word (or you can say the base word). Like 'boat', 'boater', 'boating', and 'boats'. Here the stem 'boat' is going to be the stem for these words.
This works well in most cases, but English has many exceptions where a more advanced process is needed. Spacy doesn't even include a stemmer, it has optioned for lemmatization instead, which we'll cover next. However, stemming is often discussed in natural language processing, so it's important to understand its basics before moving on to lemmatization.
Now, as spacy doesn’t have stemming, we're now going to jump over to NLTK and learn about various stemmers.
We’re goin to import Porter Stemmer, which is a widely used algorithm in natural language processing (NLP) for stemming.
#using porter stemmer
import nltk
#import PorterStemmer
from nltk.stem.porter import PorterStemmer
#create an instance
p_stemmer = PorterStemmer()
#list of words
words = ['run', 'runner', 'ran', 'easily', 'fairly', 'fairness','frog', 'soup']
for word in words:
print(word + '------->' + p_stemmer.stem(word))
There is a better, sophisticated version of stemmer, which is snowball stemmer.
#using snowball stemmer (better version)
from nltk.stem.snowball import SnowballStemmer
s_stemmer = SnowballStemmer(language='english') #you'll need an lang param, or it will throw error
words = ['run', 'runner', 'ran', 'easily', 'fairly', 'fairness','frog', 'soup']
for i in words:
print(i + '------> ' + s_stemmer.stem(i))
fairly
and fairness
both goes down to fair
in snowball.
Hopefully, this gives you an idea of how to use NLTK to stem words. The specific stem isn't very important. What matters is understanding the process happening here. These stemmers follow a set of algorithmic rules to reduce words to a root idea or root word. Often, when you're reading books in natural language processing, people like to perform stemming before analysis to reduce words to their roots.
As I mentioned, Spacy doesn't include stemming because it considers lemmatization a more effective way to reduce words. So now, off to lemmatization.
Lemmatization
Unlike stemming, lemmatization goes beyond just cutting down words. It looks at the full vocabulary of a language to analyze the word's formation. For example, the lemma of "was" is "be," and the lemma of "mice" is "mouse." So, instead of just shortening words, we consider the word's full context. Also, the lemma of "meeting" could be "meet" or "meeting," depending on how it's used in a sentence. Lemmatization is generally more informative than simple stemming, which is why the spaCy library only uses lemmatization instead of stemming.
doc10 = nlp("I'm a runner running a race brcause i love to run since i ran today")
for token in doc10:
print(token.text, '\t', token.pos_, '\t', token.lemma, '\t',token.lemma_)
You’ll get a table like this. You’ll get the words, their part of speech. And the number you are seeing here points to a specific lemma inside of the language library (recall "en_core_web_sm"
). Each of the number is actually an individual hash to it’s lemma working as a reference. And the lemma_
, this is the actual lemma. Note that, The words running, run, and ran all get simplified to the lemma "run." You can confirm that it's the same lemma "run" by checking the hash value for this lemma. It's almost like a large hash table with many lemmas and parts of speech that you can look up.
However, the alignment is slightly off. We can make a function implementing f-string literals to form this decently.
def show_lemma(text):
for token in text:
print(f'{token.text:{12}} {token.pos_:{6}} {token.lemma:<{22}} {token.lemma_}')
show_lemma(doc10)
Stop words
Words like "a" and "the" appear so often in text that they don't need to be tagged as thoroughly as unique nouns, verbs, and modifiers. We call these words stop words, and they can be filtered out from the text before processing. Spacy has a built-in list of about 305 English stop words. Essentially, stop words are very common words that don't provide extra information and can sometimes even hinder your natural language processing. You should always remove this stop words.
print(nlp.Defaults.stop_words)
Additionally you can check how many words are in the default spot word list and even check if a word is stop word or not.
len(nlp.Defaults.stop_words)
Now for the checking using vocub:
nlp.vocab['is'].is_stop #is_stop is an attribute
nlp.vocab['fiction'].is_stop #is_stop is an attribute
You can add a stop word for you particular data set.
nlp.Defaults.stop_words.add('btw')
#manually grab it
nlp.vocab['btw'].is_stop = True
len(nlp.Defaults.stop_words)
nlp.vocab['btw'].is_stop
You can also remove a stop word depending on you situation/model.
nlp.Defaults.stop_words.remove('btw')
#manually grab it
nlp.vocab['btw'].is_stop = False
len(nlp.Defaults.stop_words)
nlp.vocab['btw'].is_stop
Phrase matching and vocabulary
Now, we're going to identify and label specific phrases that match patterns. We can define ourselves.We can think of this as a powerful version of regular expression where we actually take parts of speech into account for our pattern search. Let's explore this concept with Spacy
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab) #passing nlp.vocab
Creating patterns we want to match on. The way this works with Spacy is by using a list where you provide a series of dictionaries. First, we set the key as "lower" and the value as "solar power." Then, we read a document or string and try to find "solar power." We want to detect it in three ways: as one word, with a dash in the middle, or as two separate words. So, we're looking for "solarpower" as a single word, "solar-power" with a dash, or "solar power" as two words. We're trying to find all three of these patterns.
# SolarPower
pattern1 = [{'LOWER': 'solarpower'}]
# Solar-power
pattern2 = [{'LOWER': 'solar'}, {'LOWER': 'power'}]
# Solar power
pattern3 = [{'LOWER': 'solar'}, {'IS_PUNCT': True}, {'LOWER': 'power'}]
matcher.add('SolarPower', [pattern1,pattern2,pattern3])
doc11 = nlp(u'The Solar Power industry continues to grow as demand \
for solarpower increases. Solar-power cars are gaining popularity.')
found_matches = matcher(doc11)
print(found_matches)
Here I have added the word I would like to match and the patterns to match in a list.
matcher.add('SolarPower', [pattern1,pattern2,pattern3])
Creating a document
doc11 = nlp(u'The Solar Power industry continues to grow as demand \
for solarpower increases. Solar-power cars are gaining popularity.')
found_matches = matcher(doc11) #pass the doc in matches
print(found_matches)
The first piece of info is the match ID, and the follow numbers are their indexes.
for match_id, start, end in found_matches:
string_id = nlp.vocab.strings[match_id] # get string representation
span = doc11[start:end] # get the matched span
print(match_id, string_id, start, end, span.text)
This is a for-loop which will print out everything in a nice way.
# Redefine the patterns:
pattern1 = [{'LOWER': 'solarpower'}]
pattern2 = [{'LOWER': 'solar'}, {'IS_PUNCT': True, 'OP':'*'}, {'LOWER': 'power'}]
# Add the new set of patterns to the 'SolarPower' matcher:
matcher.add('SolarPower', [pattern1, pattern2])
Inside the dictionary, we're going to set OP
to a colon and then set it as an asterisk for the string. This will allow the pattern to match zero or more times.
Just to clarify, pattern1 can recognize "solarpower" or "SolarPower" with capitalized S and P. Pattern two can identify "solar" followed by any amount of punctuation. The asterisk in OP allows for this flexibility. So, it can be double dashes, one underscore, one plus, one dot, or any punctuation, followed by "power." This makes it very versatile.
doc12 = nlp('Solar--power is solarpower yay!')
found_matches = matcher(doc12)
print(found_matches)
A recap:
Import matcher from spacy.
Create patterns, which are a list of dictionaries based on the keywords and terms you are looking for.
Once you have those patterns, add them to the matcher with a name you provide.
Pass in any doc objects you have into this matcher, and it reports back the found matches.
Next blog: Part of Speech Tagging and Named Entity Recognition.
Subscribe to my newsletter
Read articles from Fatima Jannet directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
