20 Spacy concepts with Before-and-After Examples

Anix LynchAnix Lynch
14 min read

1. Loading Language Model (spacy.load) ๐Ÿง 

Boilerplate Code:

import spacy
nlp = spacy.load("en_core_web_sm")

Use Case: Load a pre-trained language model to analyze text. ๐Ÿง 

Goal: Initialize spaCyโ€™s language model for various NLP tasks. ๐ŸŽฏ

Sample Code:

# Load English language model
nlp = spacy.load("en_core_web_sm")

# Example text
doc = nlp("This is a test sentence.")
print(doc)

Before Example: needs a pre-trained model but doesnโ€™t know how to load it. ๐Ÿค”

Need: Pre-trained NLP model.

After Example: With spacy.load(), can now analyze text using the loaded model! ๐Ÿง 

Loaded Model: "This is a test sentence."

Challenge: ๐ŸŒŸ Try loading a larger model (e.g., en_core_web_md or en_core_web_lg) for more advanced tasks.


2. Tokenization (spacy.tokens.Token) โœ‚๏ธ

Boilerplate Code:

from spacy.tokens import Token

Use Case: Split text into tokens (words or punctuation) using spaCyโ€™s tokenizer. โœ‚๏ธ

Goal: Tokenize text into individual words or punctuation marks. ๐ŸŽฏ

Sample Code:

# Example text
doc = nlp("This is a test sentence.")

# Tokenize the text
tokens = [token.text for token in doc]
print(tokens)

Before Example:
has text but doesnโ€™t know how to split it into words. ๐Ÿค”

Text: "This is a test sentence."

After Example: With spaCy tokenization, the text is split into tokens! โœ‚๏ธ

Tokens: ['This', 'is', 'a', 'test', 'sentence', '.']

Challenge: ๐ŸŒŸ Try tokenizing a more complex sentence with punctuation and special characters.


3. Named Entity Recognition (NER with spacy.ents) ๐Ÿข

Boilerplate Code:

from spacy.tokens import Doc

Use Case: Extract named entities (people, organizations, locations) from text. ๐Ÿข

Goal: Identify and classify entities like names, dates, or places. ๐ŸŽฏ

Sample Code:

# Example text
doc = nlp("Barack Obama was the president of the United States.")

# Extract named entities
for ent in doc.ents:
    print(ent.text, ent.label_)

Before Example: The intern has text but doesnโ€™t know which words refer to names or places. ๐Ÿค”

Text: "Barack Obama was the president of the United States."

After Example: With spaCy NER, the intern identifies named entities! ๐Ÿข

Named Entities: "Barack Obama" (PERSON), "United States" (GPE)

Challenge: ๐ŸŒŸ Try analyzing a news article and extract all named entities like people, locations, and organizations.


4. Part-of-Speech Tagging (spacy.pos_ and spacy.tag_) ๐Ÿท๏ธ

Boilerplate Code:

from spacy.tokens import Doc

Use Case: Assign part-of-speech (POS) tags to each word in a sentence. ๐Ÿท๏ธ

Goal: Understand the grammatical role of each word in a sentence. ๐ŸŽฏ

Sample Code:

# Example text
doc = nlp("This is a test sentence.")

# Get POS tags for each token
for token in doc:
    print(token.text, token.pos_, token.tag_)

Before Example: The intern has a sentence but doesnโ€™t know the grammatical role of each word. ๐Ÿค”

Sentence: "This is a test sentence."

After Example: With POS tagging, each word is tagged with its grammatical role! ๐Ÿท๏ธ

POS Tags: ('This', 'DET'), ('is', 'AUX'), ('a', 'DET'), ...

Challenge: ๐ŸŒŸ Try analyzing more complex sentences and observe how POS tags change with different sentence structures.


5. Dependency Parsing (spacy.dep_) ๐ŸŒณ

Boilerplate Code:

from spacy.tokens import Doc

Use Case: Extract dependency relationships between words (e.g., subject-verb-object). ๐ŸŒณ

Goal: Understand the syntactic structure of a sentence. ๐ŸŽฏ

Sample Code:

# Example text
doc = nlp("I love programming in Python.")

# Display dependencies
for token in doc:
    print(token.text, token.dep_, token.head.text)

Before Example: The intern has a sentence but doesnโ€™t understand the grammatical relationships between words. ๐Ÿค”

Sentence: "I love programming in Python."

After Example: With dependency parsing, the intern understands how words relate to each other! ๐ŸŒณ

Dependencies: ('I', 'nsubj', 'love'), ('love', 'ROOT', 'love'), ...

Challenge: ๐ŸŒŸ Try visualizing dependencies using spacy.displacy.render() for a better understanding of sentence structure.


6. Similarity Comparison (spacy.similarity) ๐Ÿ”—

Boilerplate Code:

from spacy.tokens import Doc

Use Case: Compare the similarity between words, sentences, or documents. ๐Ÿ”—

Goal: Measure how similar two pieces of text are. ๐ŸŽฏ

Sample Code:

# Example sentences
doc1 = nlp("I love pizza.")
doc2 = nlp("I like pasta.")

# Compare similarity
similarity_score = doc1.similarity(doc2)
print(similarity_score)

Before Example: The intern has two sentences but doesnโ€™t know how to compare their similarity. ๐Ÿค”

Sentences: "I love pizza." vs. "I like pasta."

After Example: With similarity comparison, the intern can measure how similar they are! ๐Ÿ”—

Similarity Score: 0.8

Challenge: ๐ŸŒŸ Try comparing the similarity between longer documents or paragraphs.


7. Text Lemmatization (spacy.lemma_) ๐Ÿ‚

Boilerplate Code:

from spacy.tokens import Token

Use Case: Perform lemmatization, which reduces words to their base form (e.g., "running" โ†’ "run"). ๐Ÿ‚

Goal: Normalize words to their dictionary form for easier analysis. ๐ŸŽฏ

Sample Code:

# Example text
doc = nlp("The cats are running in the garden.")

# Get lemmas for each token
lemmas = [token.lemma_ for token in doc]
print(lemmas)

Before Example: The intern has words in different forms but wants to normalize them. ๐Ÿค”

Words: "cats", "running", "garden"

After Example: With lemmatization, the intern reduces words to their base forms! ๐Ÿ‚

Lemmas: ['the', 'cat', 'be', 'run', 'in', 'the', 'garden']

Challenge: ๐ŸŒŸ Try lemmatizing text in different tenses or forms and observe the results.


8. Custom Named Entity Recognition (NER) ๐Ÿ”ง

Boilerplate Code:

from spacy.tokens import Span

Use Case: Create custom named entities by labeling specific text patterns. ๐Ÿ”ง

Goal: Extend spaCyโ€™s NER capabilities by adding custom entities. ๐ŸŽฏ

Sample Code:

# Define custom entity
doc = nlp("Apple is looking at buying U.K. startup for $1 billion.")
org = Span(doc, 0, 1, label="ORG")

# Add custom entity
doc.ents = list(doc.ents) + [org]
print([(ent.text, ent.label_) for ent in doc.ents])

Before Example: The intern has a company name in the text but itโ€™s not labeled as an entity. ๐Ÿค”

Text: "Apple is looking at buying U.K. startup."

After Example: With custom NER, the intern labels "Apple" as an organization! ๐Ÿ”ง

Entities: [('Apple', 'ORG'), ('U.K.', 'GPE')]

Challenge: ๐ŸŒŸ Try adding custom entities for different types of data like product names or company names.


9. Word Vector Representation (spacy.vocab.vectors) ๐Ÿ” 

**Boilerplate Code

**:

from spacy.vocab import Vectors

Use Case: Use word vectors to represent words as numerical vectors for machine learning tasks. ๐Ÿ” 

Goal: Convert words into vectors to perform mathematical operations on text. ๐ŸŽฏ

Sample Code:

# Example word
word = nlp("apple")

# Get word vector
vector = word.vector
print(vector[:5])  # Print first 5 elements of the vector

Before Example: The intern has words but doesnโ€™t know how to represent them as numerical vectors. ๐Ÿค”

Word: "apple"

After Example: With word vectors, the word is represented as a numerical vector! ๐Ÿ” 

Word Vector: [0.231, 0.127, 0.654, ...]

Challenge: ๐ŸŒŸ Try using vectors for similarity comparison between words or performing arithmetic operations on words (e.g., king - man + woman = queen).


10. Visualizing Dependencies (spacy.displacy.render) ๐Ÿ–ผ๏ธ

Boilerplate Code:

from spacy import displacy

Use Case: Visualize sentence structure using dependency parsing. ๐Ÿ–ผ๏ธ

Goal: Generate a visual representation of how words in a sentence are related. ๐ŸŽฏ

Sample Code:

# Example text
doc = nlp("I love programming in Python.")

# Render dependency graph
displacy.render(doc, style="dep", jupyter=True)

Before Example: The intern has a sentence but finds it hard to understand how the words relate. ๐Ÿค”

Sentence: "I love programming in Python."

After Example: With displacy, the intern can see a visual diagram of the sentence structure! ๐Ÿ–ผ๏ธ

Visual: Arrows showing grammatical relationships between words.

Challenge: ๐ŸŒŸ Try visualizing more complex sentences or paragraphs and see how the dependency structure changes.


11. Pooling Layers (spacy.tokens.Pool) ๐ŸŠโ€โ™‚๏ธ

Boilerplate Code:

from spacy.tokens import Doc

Use Case: Perform pooling operations like sum or max-pooling over tokens to reduce dimensionality. ๐ŸŠโ€โ™‚๏ธ

Goal: Apply pooling operations over vectors of tokens. ๐ŸŽฏ

Sample Code:

# Example text
doc = nlp("I love programming in Python.")

# Perform sum pooling
sum_pooling = sum(token.vector for token in doc)
print(sum_pooling[:5])  # Print the first 5 elements of the pooled vector

Before Example: The intern has word vectors but needs to reduce their size for further analysis. ๐Ÿค”

Word Vectors: [Vector of each word in the sentence]

After Example: With pooling, the intern reduces multiple vectors into a smaller vector! ๐ŸŠโ€โ™‚๏ธ

Pooled Vector: [Sum of word vectors]

Challenge: ๐ŸŒŸ Try experimenting with different pooling methods like max-pooling and average-pooling.


12. Text Classification (spacy.pipeline.TextCategorizer) ๐Ÿ†

Boilerplate Code:

from spacy.pipeline import TextCategorizer

Use Case: Classify text into categories like positive/negative or news/sports using spaCy's text categorizer. ๐Ÿ†

Goal: Build a text classifier for sentiment analysis or document categorization. ๐ŸŽฏ

Sample Code:

# Initialize text categorizer
textcat = nlp.add_pipe("textcat")

# Add labels to text categorizer
textcat.add_label("POSITIVE")
textcat.add_label("NEGATIVE")

# Example sentence
doc = nlp("This is an awesome product!")

# Predict category
print(doc.cats)

Before Example: The intern has text but doesnโ€™t know how to categorize it (positive or negative). ๐Ÿค”

Sentence: "This is an awesome product!"

After Example: With TextCategorizer, the text is categorized into positive or negative! ๐Ÿ†

Categories: {'POSITIVE': 0.85, 'NEGATIVE': 0.15}

Challenge: ๐ŸŒŸ Try training the text classifier on a larger dataset for better performance.


13. Document Similarity (doc.similarity) ๐Ÿ”—

Boilerplate Code:

from spacy.tokens import Doc

Use Case: Compare the similarity between documents using spaCyโ€™s built-in similarity function. ๐Ÿ”—

Goal: Measure how similar two pieces of text are. ๐ŸŽฏ

Sample Code:

# Example sentences
doc1 = nlp("I love playing football.")
doc2 = nlp("I enjoy soccer.")

# Compare document similarity
similarity = doc1.similarity(doc2)
print(similarity)

Before Example: The intern has two texts but doesnโ€™t know how similar they are. ๐Ÿค”

Text1: "I love playing football." 
Text2: "I enjoy soccer."

After Example: With similarity comparison, the intern can measure how similar the texts are! ๐Ÿ”—

Similarity Score: 0.92

Challenge: ๐ŸŒŸ Try comparing the similarity between different types of documents like news articles or research papers.


14. Custom Token Attributes (spacy.tokens.Token.set_extension) ๐Ÿ”ง

Boilerplate Code:

from spacy.tokens import Token

Use Case: Add custom attributes to tokens to store additional information like polarity or frequency. ๐Ÿ”ง

Goal: Extend tokens with custom attributes to suit your NLP needs. ๐ŸŽฏ

Sample Code:

# Define custom token attribute
Token.set_extension('is_positive', default=False)

# Example text
doc = nlp("This is a great product!")

# Set custom attribute for specific tokens
for token in doc:
    if token.text == "great":
        token._.is_positive = True
    print(token.text, token._.is_positive)

Before Example: The intern wants to tag words like "great" with a custom attribute (e.g., positivity). ๐Ÿค”

Sentence: "This is a great product!"

After Example: With custom token attributes, the intern tags specific words with custom attributes! ๐Ÿ”ง

Token Attributes: "great" โ†’ is_positive = True

Challenge: ๐ŸŒŸ Try adding custom attributes to other tokens like "excellent" or "awesome."


15. Document Vectors (doc.vector) ๐Ÿงฎ

Boilerplate Code:

from spacy.tokens import Doc

Use Case: Extract the document vector, which is a numerical representation of the entire document. ๐Ÿงฎ

Goal: Represent an entire document as a vector for similarity comparisons or machine learning tasks. ๐ŸŽฏ

Sample Code:

# Example text
doc = nlp("I love programming in Python.")

# Get document vector
doc_vector = doc.vector
print(doc_vector[:5])  # Print the first 5 elements of the vector

Before Example: has text but doesnโ€™t know how to convert the entire document into a vector. ๐Ÿค”

Text: "I love programming in Python."

After Example: With doc.vector, convert the text into a numerical vector! ๐Ÿงฎ

Document Vector: [0.23, 0.56, 0.12, ...]

Challenge: ๐ŸŒŸ Try comparing the document vectors of two similar texts.


16. Matcher (spacy.matcher.Matcher) ๐Ÿ”

Boilerplate Code:

from spacy.matcher import Matcher

Use Case: Use the Matcher to find specific patterns in the text (e.g., word sequences or phrases). ๐Ÿ”

Goal: Identify specific sequences of words based on patterns. ๐ŸŽฏ

Sample Code:

# Initialize the matcher
matcher = Matcher(nlp.vocab)

# Define pattern (e.g., "New York City")
pattern = [{"TEXT": "New"}, {"TEXT": "York"}, {"TEXT": "City"}]

# Add pattern to matcher
matcher.add("NYC_PATTERN", [pattern])

# Example text
doc = nlp("I visited New York City last year.")

# Find matches
matches = matcher(doc)
for match_id, start, end in matches:
    print(doc[start:end].text)

Before Example: we want to find a specific phrase (e.g., "New York City") but doesnโ€™t know how to identify it. ๐Ÿค”

Text: "I visited New York City last year."

After Example: With Matcher, we find the phrase in the text! ๐Ÿ”

Match Found: "New York City"

Challenge: ๐ŸŒŸ Try creating more complex patterns, like searching for specific parts of speech or combinations of words.


17. Text Entity Linking (spacy.pipeline.EntityLinker) ๐Ÿ”—

Boilerplate Code:

from spacy.pipeline import EntityLinker

Use Case: Link named entities to external knowledge bases like Wikipedia. ๐Ÿ”—

Goal: Provide more context for named entities by linking them to real-world information. ๐ŸŽฏ

Sample Code:

# Initialize entity linker
linker = nlp.add_pipe("entity_linker")

# Example text
doc = nlp("Google was founded by Larry Page and Sergey Brin.")

# Get linked entities
for ent in doc.ents:
    print(ent.text, ent.kb_id_)

Before Example: we want to identify entities but doesnโ€™t have additional information about them. ๐Ÿค”

Entities: "Google", "Larry Page", "Sergey Brin"

After Example: With EntityLinker, we link entities to real-world knowledge! ๐Ÿ”—

Linked Entities: "Google" โ†’ Wikipedia ID, "Larry Page" โ†’ Wikipedia ID

Challenge: ๐ŸŒŸ Try linking entities to other knowledge bases like Wikidata or custom datasets.


18. Sentence Segmentation (doc.sents) โœ‚๏ธ

Boilerplate Code:

from spacy.tokens import Doc

Use Case: Split text into sentences using spaCyโ€™s sentence boundary detection. โœ‚๏ธ

Goal: Segment text into individual sentences for further analysis. ๐ŸŽฏ

Sample Code:

# Example text
doc = nlp("This is the first sentence. Here's another one!")

# Extract sentences
for sent in doc.sents:
    print(sent.text)

Before Example: we has a paragraph but doesnโ€™t know how to split it into individual sentences. ๐Ÿค”

Text: "This is the first sentence. Here's another one!"

After Example: With sentence segmentation, the text is split into separate sentences! โœ‚๏ธ

Sentences: "This is the first sentence." "Here's another one!"

Challenge: ๐ŸŒŸ Try segmenting a longer article or text document into individual sentences.


19. Pipeline Customization (spacy.pipe) ๐Ÿ”„

  • Default pipeline: When you run nlp("Google was founded in 1998."), spaCy will typically process the text through all components: tokenization, POS tagging, NER, etc.

  • Disable NER: In this example, you're disabling NER (which identifies entities like "Google" or "1998"). You may want to do this if you're not interested in identifying entities to speed up the process. Instead, you're only interested in POS tagging (figuring out if each word is a noun, verb, etc.)
    Boilerplate Code:

from spacy.language import Language

Use Case: Customize the NLP pipeline by adding or removing components (e.g., NER, TextCategorizer). ๐Ÿ”„

Goal: Tailor the NLP pipeline to your specific needs by adding or removing components. ๐ŸŽฏ

Sample Code:

# Disable Named Entity Recognition (NER)
with nlp.disable_pipes("ner"):
    doc = nlp("Google was founded in 1998.")

# Process text without NER
print([(token.text, token.pos_) for token in doc])

Before Example: We run a full NLP pipeline but doesnโ€™t need some components like NER. ๐Ÿค”

Text: "Google was founded in 1998."

After Example: With pipeline customization, we disables unnecessary components! ๐Ÿ”„

Pipeline: Disabled "ner", only POS tagging applied.

Challenge: ๐ŸŒŸ Try creating a custom pipeline with only the components you need for a specific task.


20. Document Extension (spacy.tokens.Doc.set_extension) ๐Ÿ› ๏ธ

Boilerplate Code:

from spacy.tokens import Doc

Use Case: Add custom attributes to the entire document (not just tokens) for additional processing. ๐Ÿ› ๏ธ

Goal: Extend spaCy's Doc object to store custom attributes for the entire text. ๐ŸŽฏ

Sample Code:

# Define custom document attribute
Doc.set_extension('is_technical', default=False)

# Example text
doc = nlp("Python is a popular programming language.")

# Set custom attribute for the document
doc._.is_technical = True
print(doc._.is_technical)

Before Example: we want to tag entire documents with custom attributes (e.g., technical or non-technical). ๐Ÿค”

Document: "Python is a popular programming language."

After Example: With document extension, we add a custom attribute to the entire document! ๐Ÿ› ๏ธ

Custom Attribute: is_technical = True

Challenge: ๐ŸŒŸ Try adding more custom attributes at the document level for specific types of analysis.

Bonus Point:
Both NLTK and spaCy are popular libraries for Natural Language Processing (NLP), but they serve slightly different purposes, and your choice depends on your needs.

When to choose spaCy:

  • Speed: spaCy is faster and more efficient, making it ideal for real-time applications and larger datasets.

  • Modern NLP: It's designed with modern NLP tasks in mind, like Named Entity Recognition (NER), Dependency Parsing, and Word Vectors.

  • Ease of use: spaCy comes with pre-trained models that are ready to use, making it simpler to get started on common tasks without much setup.

  • Deep Learning: If you plan to integrate with deep learning frameworks like TensorFlow or PyTorch, spaCy is easier to work with.

When to choose NLTK:

  • Flexibility and Variety: NLTK offers a wider variety of tools and datasets for NLP research, covering tasks like tokenization, parsing, and corpora access.

  • Learning and Research: It's a great library for teaching and learning NLP, with more academic features. It also includes a variety of text processing techniques and algorithms.

  • Customization: NLTK gives you more control and customization, but it's slower and more manually intensive compared to spaCy.

Default Choice:

If you're looking for speed, simplicity, and modern NLP features, spaCy is the better default choice. If you need flexibility and want to dive deeper into NLP theory or work with a variety of text processing tools, then NLTK might be more suitable.


0
Subscribe to my newsletter

Read articles from Anix Lynch directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Anix Lynch
Anix Lynch