NLP - Part of Speech Tagging and Named Entity Recognition


This is the part 2 of my NLP series. Enjoy!
Part of Speech tagging
So far I’ve shown you how to get the part of speech of the tokens. Now I’ll dive deeper.
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp("The quick brown fox jumped over the lazy dog")
for token in doc:
print(f"{token.text:{10}} {token.pos_:{10}} {token.tag:{10}} {token.tag_:{10}} {spacy.explain(token.tag_):{10}}")
Tag attribute will give you the numerical value. tag attribute with underscore will give you a fine detailed tag. Finally, if you would like to have a clear explanation, use the explanation from spacy which you have learned in the previous blog. Use f-string literals for a neat output.
I’m going to show another interesting thing.
doc1 = nlp("I read a book on NLP")
word = doc1[1]
token = word
print(f"{token.text:{10}} {token.pos_:{10}} {token.tag:{10}} {token.tag_:{10}} {spacy.explain(token.tag_):{10}}")
doc1 = nlp("I read books on NLP")
word = doc1[1]
token = word
print(f"{token.text:{10}} {token.pos_:{10}} {token.tag:{10}} {token.tag_:{10}} {spacy.explain(token.tag_):{10}}")
The difference lies in their sentence structure, otherwise both of the word is ‘read’, which spacy was able to detect perfectly as a part verb and as a present verb.
It has a attribute for frequency counting too.
pos_count = doc.count_by(spacy.attrs.POS)
pos_count
It’s a dictionary, returning back the number3 and how many times it appeared. You can also check it’s POS from the number using vocab.
doc1.vocab[84].text
So, there’s 3 adjectives in doc1.
For a more refined look- using a loop and little bit of string formatting will do the job.
for k,v in sorted(pos_count.items()):
print(f"{k}. {doc.vocab[k].text:{5}} {v}")
In the similar way, we can counts tags too
tag_count = doc.count_by(spacy.attrs.TAG)
for k,v in sorted(tag_count.items()):
print(f"{k}. {doc.vocab[k].text:{5}} {v}")
Same for DEP too!
dep_count = doc.count_by(spacy.attrs.DEP)
for k,v in sorted(dep_count.items()):
print(f"{k}. {doc.vocab[k].text:{5}} {v}")
Visualizing Part of Speech
Let’s have a recap on how to visualize:
doc2 = nlp("The quick brown fox jumped over the lazy dog")
from spacy import displacy
displacy.render(doc2)
You can customize it too!
options = {'distance': 110, 'compact': 'True', 'color': 'yellow', 'bg': '#09a3d5', 'font': 'Times'}
displacy.render(doc2, style='dep', options=options)
You can set the distance between words and adjust it to be more or less. You can also decide if you want it to be compact. spaCy will try to make it as compact as possible based on the distance you choose. You can set "compact" to "True" as a string. You can pick a color for the text, using basic names or hex codes, like "yellow". For the background color, use "bg" and choose a color name or hex code, like "red". If you are a hex code picker, I’m sure you can mix and match in your own way. You can also choose the font, but not all fonts are available. Use fonts listed in the spaCy documentation or general browser fonts. For example, for Times New Roman, just use "Times". Copy and paste this render, and you'll see the options in action.
Named Entity Recognition
Common Entity Labels in spaCy's English Models (en_core_web_*
)
Here’s a list of standard entity types for the English pipeline:
Label | Description | Example |
PERSON | People, including fictional. | "Elon Musk" |
NORP | Nationalities/religious/political groups. | "Americans", "Christians" |
FAC | Buildings, airports, highways, etc. | "Golden Gate Bridge" |
ORG | Companies, institutions, agencies. | "Apple Inc.", "NASA" |
GPE | Countries, cities, states. | "France", "New York" |
LOC | Non-GPE locations (mountains, lakes). | "Mount Everest" |
PRODUCT | Objects, vehicles, foods, etc. | "iPhone", "Coca-Cola" |
EVENT | Named events (wars, sports, festivals). | "Olympics", "World War II" |
WORK_OF_ART | Books, songs, movies. | "The Mona Lisa", "Hamlet" |
LAW | Legal document titles. | "First Amendment" |
LANGUAGE | Named languages. | "English", "Spanish" |
DATE | Absolute/relative dates/times. | "2025", "next Monday" |
TIME | Times smaller than a day. | "2:30 PM", "an hour" |
PERCENT | Percentages. | "50%", "100 percent" |
MONEY | Monetary values. | "$10", "500 euros" |
QUANTITY | Measurements (weight, distance, etc.). | "5 kilometers", "10 pounds" |
ORDINAL | "First", "second", etc. | "1st place", "third" |
CARDINAL | Numerals not covered by others. | "one", "two", "100" |
Adding Named Entities to a Span
Named Entity Recognition, or NER for short, aims to find and classify named entity mentions in unstructured text into predefined categories. These categories include person names, organizations, locations, medical codes, time expressions, quantities, percentages, monetary values, and more. There are many types of entities that can be recognized. Thankfully, spaCy will automatically handle this for us.
# displaying basic entity information
def show_entity(doc):
if doc.ents:
for entity in doc.ents:
print(f"{entity.text} -- {entity.label_} -- {spacy.explain(entity.label_)}")
else:
print("No entities found")
#demo 1
doc = nlp('Hi, how are you')
show_entity(doc)
#demo 2
doc = nlp("Hi,I'm Fatima. I live in Bangladesh")
show_entity(doc)
As you can see spacy has picked up the entities. Let’s put another document
#demo 3
doc = nlp("Can I have 500 dollars of Microsoft stock?")
show_entity(doc)
You can specifically set entity of a toke.
#demo 4
doc = nlp("Tesla to build a UK factory for $6 million")
show_entity(doc)
So right now, it's understanding that UK is some sort of country, city or state and $6 million refers to money. But right now spacy, as a named entity isn't recognizing Tesla.
We saw Tesla before in a previous example and it was able to realize that Tesla was a proper noun but it actually doesn't know that here, we're referring to Tesla as a company. It would be nice if we could tell spacy hey, Tesla should be an org, a company, agency. I’ll show you how to set up a custom entity.
# setting up a custom entity
from spacy.tokens import Span
#grab the "ORG" as entity label
ORG = doc.vocab.strings["ORG"]
ORG #383, this is actually the hash value of ORG entity
# Create a new Span for a new entity (e.g., "startup" as a COMPANY)
new_entity = Span(doc, 0, 1, label = ORG) #which is in the hash value
# Append the new entity to doc.ents
doc.ents = list(doc.ents) + [new_entity] #you can use append here
In the code above, there are four arguments:
doc
is the name of the document object.0
is the start index position of the span.1
is the stop index position of the span exclusive.label
is the label assigned to our entity.
How doc.ents = list(doc.ents) + [new_entity]
Works
list(doc.ents)
→ Converts the immutable tuple of entities into a mutable list.+ [new_entity]
→ Concatenates the existing entities with a newSpan
object in a list.doc.ents = ...
→ Assigns the new tuple (automatically created from the concatenated list) back todoc.ents
.
# function we created before
show_entity(doc)
# output
# Tesla -- ORG -- Companies, agencies, institutions, etc.
# UK -- GPE -- Countries, cities, states
# $6 million -- MONEY -- Monetary values, including unit
So this is it. That’s how you list an entity of a span.
Adding Named Entities to all Matching Spans
We've learned how to add a single term as our own NER (like adding "Tesla"). But what if we have multiple terms to add as potential NERs? For example, if we are working with a vacuum company, we might want to add both vacuum cleaner and vacuum-cleaner as PROD (product) NERs. Let's see how we can do this.
# creating a document
doc = nlp("Ou company created a brand new vacuum cleaner."
"This new vaccum-cleaner is the best in the show")
# checking if vaccum cleaner is an entity or not
show_entity(doc)
# Import PhraseMatcher
from spacy.matcher import PhraseMatcher
# create an instance and pass the vocab of document
# linking the matcher to the vocabulary
matcher = PhraseMatcher(nlp.vocab)
# Create the desired phrase patterns:
pharse_list = ['vacuum cleaner', 'vacuum-cleaner']
# turning these into phrase pattens by passing them into nlp function
phrase_pattern = [nlp(text) for text in pharse_list]
# Apply the patterns to our matcher object:
# You can name whatever the matcher you want
# None for the callback
matcher.add('newproduct', None, *phrase_pattern)
# Apply the matcher to our Doc object:
found_matches = matcher(doc)
found_matches
# Here we create Spans from each match, and create named entities from them:
from spacy.tokens import Span
#grab the "PRODUCT" as entity label
PROD = doc.vocab.strings["PRODUCT"]
Inside of the found_matches, we are only concerned with these two items that represents the start and and the end. And we're gonna use Span to actually define the start of the span, and the end of the span.
Gonna pass the original doc, grab each match at index 1 and 2, then set label as PROD which was defined earlier. The we’ll make it a list comprehension and set this as new_ents.
# (2689272359382549672, 6, 8); we need only the numbers from 2nd nd 3d idx
new_ents = [Span(doc, match[1],match[2],label=PROD) for match in found_matches]
doc.ents = list(doc.ents) + new_ents
show_entity(doc)
Visualizing Named Entity Recognition
I’m just gonna spam the codes here. As I’ve already discussed about visualization, there’s not much to teach here. rather. I’m going to show you how you can customize them.
doc = nlp(u'Over the last quarter Apple sold nearly 20 thousand iPods for a profit of $6 million. '
u'By contrast, Sony sold only 7 thousand Walkman music players.')
# For style='ent', displacy will highlight entities
displacy.render(doc, style='ent', jupyter=True)
# For line by line, use a for loop.
# Seperate out with sentence segmentation
for sent in doc.sents:
# Passing the text of individual sentence segmentation
# Make sure to add style
displacy.render(nlp(sent.text), style='ent', jupyter=True)
doc2 = nlp(u'Over the last quarter Apple sold nearly 20 thousand iPods for a profit of $6 million. '
u'By contrast, Sony sold only 7 thousand Walkman music players.')
for sent in doc2.sents:
displacy.render(nlp(sent.text), style='ent', jupyter=True)
for sent in doc2.sents:
docx = nlp(sent.text)
if docx.ents:
displacy.render(docx, style='ent', jupyter=True)
else:
print(docx.text)
# Additionally you can opt for which entities you want
# Store them in a dict
# Under the 'ents' key, pass is a list of what you are interested in
options = {'ents': ['ORG', 'MONEY']}
# Then render the whole thing
displacy.render(doc, style='ent', jupyter=True, options=options)
# You can customize colours for different entities
# Create another dict fo color
colors = {'ORG': 'pink'}
# Inside your options, state 'colors' key and set it equal to the colour dictionary
options = {'ents': ['ORG', 'MONEY'], 'colors': colors}
displacy.render(doc, style='ent', jupyter=True, options=options)
# Hex code works too
colors = {'ORG': '#aa9cfc'}
options = {'ents': ['ORG', 'MONEY'], 'colors': colors}
displacy.render(doc, style='ent', jupyter=True, options=options)
# You can actually linear gradient them too!
# You can linear gradient
# You can radial gradient
colors = {'ORG': 'linear-gradient(90deg, #aa9cfc, #fc9ce7)', 'MONEY': 'radial-gradient(yellow, green)'}
options = {'ents': ['ORG', 'MONEY'], 'colors':colors}
displacy.render(doc, style='ent', jupyter=True, options=options)
Usually you won't need to customize color effects that often unless you have a very specific style.
Sentence Segment
# From Spacy Basics:
doc = nlp(u'This is the first sentence. This is another sentence. This is the last sentence.')
for sent in doc.sents:
print(sent)
Keep in mind that, this doc.sents is a generator. so if you want to do indexing, you will fail. It generates the sentences instead of storing them in memory.
Although you can grab tokens from the doc by indexing, but you can’t grab sentences. So, you can put the doc in a list. Then you will be able to extract sentences by indexing.
list(doc.sents)[1]
type(list(doc.sents)[1])
But the data type will be a Span type.
# SPACY'S DEFAULT BEHAVIOR
doc3 = nlp(u'"Management is doing things right; leadership is doing the right things." -Peter Drucker')
for sent in doc3.sents:
print(sent)
Now I’ll show you two things:
ADDING A NEW SEGMENTATION RULE
CHANGE SEGMENTATION RULES
# ADD A NEW RULE TO THE PIPELINE
from spacy.language import Language
# 1. Register your component with a decorator
@Language.component("custom_boundaries") # Give it a name
def set_custom_boundaries(doc):
for token in doc[:-1]:
if token.text == ';':
doc[token.i+1].is_sent_start = True
return doc
# 2. Add using the REGISTERED NAME (as string)
nlp.add_pipe("custom_boundaries", before="parser")
# 3. Verify
print(nlp.pipe_names) # Should show your component
# Re-run the Doc object creation:
doc4 = nlp(u'"Management is doing things right; leadership is doing the right things." -Peter Drucker')
for sent in doc4.sents:
print(sent)
See you in the next blog!
Subscribe to my newsletter
Read articles from Fatima Jannet directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
