NLTK/Spacy VS HuggingFace #1 - Tokenization, POS tagging, NER, and summarization
Table of contents
- Source code here:
- Table summarizing the NLP tasks and which tool is best
- NLTK/SPACY Code with sample output
- Chunk 1: Basic Tokenization with NLTK
- Chunk 2: Frequency Distribution (FD) in NLTK
- Chunk 3: Part-of-Speech (POS) Tagging in NLTK
- Chunk 4: Tokenization and Frequency Distribution with SpaCy
- Chunk 5: Part-of-Speech (POS) Tagging with SpaCy
- Chunk 6: Visualizing Entities with SpaCy’s displaCy
- Chunk 7: Visualizing Dependency Parsing with SpaCy’s displaCy
- Chunk 8: Downloading NLTK Datasets
- Chunk 9: Importing and Exploring NLTK’s Movie Reviews Corpus
- Chunk 10: Displaying the Most Frequent 25 Words in the Movie Review Corpus
- Chunk 11: Generating a WordCloud for the Movie Review Corpus
- Chunk 12: Part-of-Speech (POS) Frequency in the Movie Corpus
- Chunk 13: Tokenizing and POS Tagging Multiple Sentences in NLTK
- Chunk 14: Advanced POS Frequency Analysis in NLTK
- Chunk 15: Visualizing POS Frequency Analysis
- Hugging Face’s approach with Sample Output
- Chunk 1 & 2: Tokenization and Frequency Distribution with Hugging Face
- Chunk 3 & 5: Part-of-Speech Tagging with Hugging Face
- Chunk 6 & 7: Named Entity Recognition (NER) and Dependency Parsing with Hugging Face
- Chunk 10 & 11: Generating Frequency Distribution and WordCloud with Hugging Face
- Chunk 12 & 13: POS Frequency Analysis for the Movie Corpus
Source code here:
NLTK/Spacy Code
https://gist.github.com/2f8b8167ae5c557dc027dc19f9a84c2b.git
HuggingFace Code
https://gist.github.com/e293e7c3f26dd7f4104a62a9d447ec95.git
Table summarizing the NLP tasks and which tool is best
Task | Best Tool | Reason |
Tokenization | Hugging Face Transformers | Efficient and language model-specific tokenization (e.g., BERT or GPT tokenization). |
Frequency Distribution | NLTK | FreqDist handles large corpora efficiently and supports robust frequency analysis. |
Part-of-Speech (POS) Tagging | Hugging Face / SpaCy | Hugging Face for individual sentences; SpaCy for large texts or fast batch processing. |
Named Entity Recognition (NER) | Hugging Face Transformers | State-of-the-art models for entity recognition with pipelines for efficient setup. |
Dependency Parsing | SpaCy | Direct support for dependency parsing; displaCy also visualizes dependency graphs. |
Syntax Tree Visualization | NLTK / SpaCy | NLTK for tree diagrams, SpaCy with displaCy for visualizing dependencies. |
WordCloud Generation | NLTK (for corpora) + WordCloud | Frequency analysis via NLTK; WordCloud library for visualizing most common words. |
Large Corpus Analysis | NLTK / SpaCy | NLTK provides corpora (e.g., movie_reviews ), SpaCy handles batch processing efficiently. |
Text Classification | Hugging Face Transformers | Transformers (e.g., BERT) provide state-of-the-art models for classification tasks. |
Summarization | Hugging Face Transformers | Specialized summarization models available (e.g., BART, T5) for concise text summaries. |
Translation | Hugging Face Transformers | Translation models (e.g., MarianMT) offer support for many languages with minimal setup. |
Summary:
Hugging Face Transformers: Best for advanced NLP tasks (classification, summarization, NER, translation) and sentence-level analysis.
SpaCy: Excels in dependency parsing, efficient POS tagging, and batch processing for large documents.
NLTK: Ideal for tasks involving large corpora, frequency analysis, and syntax tree generation.
This table provides a clear overview of the optimal tool for each task, helping to leverage the strengths of each library effectively. Let me know if you’d like more details on any specific task!
NLTK/SPACY Code with sample output
Chunk 1: Basic Tokenization with NLTK
Code: Tokenization is breaking down a sentence into individual words or symbols. Here, we use
nltk
, a popular Python library for natural language processing (NLP).word_tokenize()
function helps split the text into smaller pieces, known as tokens.
Code Explanation with Comments:
# Importing the NLTK library and word tokenization function import nltk from nltk import word_tokenize # Helps split sentences into individual words # Sample sentence for tokenization text = "we'd like to book a flight from boston to london" # Tokenizing the sentence tokenized_text = word_tokenize(text) print(tokenized_text) # Outputs individual words and symbols in a list format
Sample Output:
['we', "'d", 'like', 'to', 'book', 'a', 'flight', 'from', 'boston', 'to', 'london']
Here, each word (and punctuation) is separated into its own element in the list.
Chunk 2: Frequency Distribution (FD) in NLTK
Code: Frequency Distribution (
FreqDist
) counts how often each word appears in the list of tokens, showing us which words are most common.- We use
FreqDist
fromnltk.probability
, which calculates word frequencies for a list of tokens.
- We use
Code Explanation with Comments:
# Importing the Frequency Distribution module from nltk from nltk.probability import FreqDist # Creating a frequency distribution of tokens fdist = FreqDist(tokenized_text) print(fdist) # Outputs the frequency distribution object print(fdist.most_common(3)) # Shows the top 3 most frequent words
Sample Output:
<FreqDist with 10 samples and 10 outcomes> [('we', 1), ("'d", 1), ('like', 1)]
- Each word in the sentence occurs once, so they all have a frequency of 1. This object allows you to see which words are most common and how often they appear.
Chunk 3: Part-of-Speech (POS) Tagging in NLTK
Code: Part-of-Speech (POS) tagging assigns each word a role, like a noun or verb.
nltk.pos_tag()
automatically assigns POS tags to each token.
Code Explanation with Comments:
# Using nltk's pos_tag to assign parts of speech to each word in the tokenized text pos_tags = nltk.pos_tag(tokenized_text) print(pos_tags) # Outputs a list of words with their corresponding POS tags
Sample Output:
[('we', 'PRP'), ("'d", 'MD'), ('like', 'VB'), ('to', 'TO'), ('book', 'VB'), ('a', 'DT'), ('flight', 'NN'), ('from', 'IN'), ('boston', 'NN'), ('to', 'TO'), ('london', 'NN')]
- Each token is followed by a POS tag (e.g.,
PRP
for pronoun,VB
for verb,NN
for noun). This tells us the function of each word in the sentence.
- Each token is followed by a POS tag (e.g.,
Chunk 4: Tokenization and Frequency Distribution with SpaCy
Code: We use
spacy
here, a different NLP library. SpaCy’snlp
model automatically tokenizes text, creates linguistic annotations, and more.spacy.load('en_core_web_sm')
loads a lightweight English language model for processing.Counter
from thecollections
module counts occurrences of each token.
Code Explanation with Comments:
import spacy from collections import Counter # Load SpaCy's small English model nlp = spacy.load('en_core_web_sm') # Define text and process it using the SpaCy model text = "we'd like to book a flight from boston to london" doc = nlp(text) # Processes text into a spaCy document object # Extract tokens and calculate frequency distribution words = [token.text for token in doc] # Tokenizes the text into words word_freq = Counter(words) # Counts occurrences of each word print(word_freq) # Shows word frequencies
Sample Output:
Counter({"we'd": 1, 'like': 1, 'to': 2, 'book': 1, 'a': 1, 'flight': 1, 'from': 1, 'boston': 1, 'london': 1})
- This output shows each word's frequency in the sentence. Unlike NLTK, SpaCy processes text into a
doc
object, which allows easy access to each token.
- This output shows each word's frequency in the sentence. Unlike NLTK, SpaCy processes text into a
Great, let’s move on to the next chunks in a similar format.
Chunk 5: Part-of-Speech (POS) Tagging with SpaCy
Code: Just like NLTK, SpaCy can also perform POS tagging. Each token in the
doc
object has a.pos_
attribute, which gives the part of speech.token.text
extracts the actual word, whiletoken.pos_
shows the POS tag.
Code Explanation with Comments:
# Perform POS tagging using SpaCy for token in doc: print(token.text, token.pos_) # Prints each word with its part of speech
Sample Output:
we'd PRON like VERB to PART book VERB a DET flight NOUN from ADP boston PROPN to PART london PROPN
- Each token is followed by a simple part-of-speech tag (e.g.,
PRON
for pronoun,VERB
for verb,NOUN
for noun,PROPN
for proper noun). SpaCy’s POS tags are usually more human-readable.
- Each token is followed by a simple part-of-speech tag (e.g.,
Chunk 6: Visualizing Entities with SpaCy’s displaCy
Code:
displacy.render()
is a SpaCy tool for visualizing the entities (like names, places) in the text.style='ent'
means entity visualization, andoptions={'distance':200}
adjusts spacing.
Code Explanation with Comments:
from spacy import displacy # Define new text and process it text = "we'd like to book a flight from boston to new york" doc = nlp(text) # Re-processes text into a spaCy doc # Visualize entities with displaCy displacy.render(doc, style='ent', jupyter=True, options={'distance':200})
Sample Output:
This will show an interactive display in Jupyter with Boston and New York highlighted as places (entities).
Note:
displacy.render
requires Jupyter to show visualizations inline.
Chunk 7: Visualizing Dependency Parsing with SpaCy’s displaCy
Code: Dependency parsing shows how words in a sentence relate to each other (subject, verb, object).
style='dep'
specifies dependency visualization, which links words with arrows.
Code Explanation with Comments:
# New sentence to visualize doc = nlp("they get in an accident") # Visualize dependency parse tree displacy.render(doc, style='dep', jupyter=True, options={'distance':200})
Sample Output:
An interactive dependency tree with arrows indicating grammatical relationships, such as "they" (subject) linked to "get" (verb).
Note: Only displays correctly in Jupyter notebooks.
Chunk 8: Downloading NLTK Datasets
Code: NLTK requires specific datasets, such as a token dictionary or movie review corpus.
nltk.download
()
opens a modal where you can choose which datasets to download.
Code Explanation with Comments:
# Opens the NLTK downloader in a separate window nltk.download()
Sample Output:
- This opens a new window to download datasets like
movie_reviews
orstopwords
. Use it once to set up necessary resources for NLTK.
- This opens a new window to download datasets like
Chunk 9: Importing and Exploring NLTK’s Movie Reviews Corpus
Code: The
movie_reviews
corpus in NLTK provides labeled sentences, useful for text analysis.sents()
returns sentences as lists of words, whilewords()
returns a flat list of all words in the corpus.
Code Explanation with Comments:
# Import the movie reviews corpus from nltk.corpus import movie_reviews # Retrieve all sentences sents = movie_reviews.sents() print(sents[:2]) # Display the first two sentences for a preview # Sample a single sentence sample = sents[9] print(sample) # Outputs a list of words in a specific sentence
Sample Output:
[['plot', ':', 'two', 'teen', 'couples', 'go', 'to', 'a', 'church', 'party', ',', 'drink', 'and', 'then', 'drive', '.'], ['they', 'get', 'into', 'an', 'accident', '.']] ['they', 'seem', 'to', 'have', 'taken', 'this', 'pretty', 'neat', 'concept', ',', 'but', 'executed', 'it', 'terribly', '.']
- Each sentence is a list of words and punctuation, ideal for word-level analysis.
Got it! I'll ensure each line has detailed inline comments, along with clear explanations for modules, classes, functions, and parameters. Let’s redo the chunks with this enhanced level of detail.
Chunk 10: Displaying the Most Frequent 25 Words in the Movie Review Corpus
Explanation: This chunk uses NLTK, Pandas, Seaborn, and Matplotlib to create a frequency distribution of the top 25 words in the
movie_reviews
corpus.Modules/Classes Used:
nltk.FreqDist
: Calculates the frequency of each word.pandas.Series
: Stores data in a one-dimensional array-like object.seaborn.barplot
: Creates bar charts.matplotlib.pyplot
: Manages plotting and customization.
Code with Detailed Inline Comments:
# Importing required libraries for data manipulation and visualization import pandas as pd import seaborn as sns import matplotlib.pyplot as plt # Step 1: Get all words from the movie_reviews corpus using nltk words = movie_reviews.words() # words() returns all words in the corpus as a list # Step 2: Create a frequency distribution for alphabetic words # Here, we use `word.lower()` to make all words lowercase (standardizing case) and # `isalpha()` to ensure only alphabetic words are counted (removes punctuation). word_counts = nltk.FreqDist(word.lower() for word in words if word.isalpha()) # Step 3: Retrieve the top 25 most common words as a list of tuples top_words = word_counts.most_common(25) # most_common(25) returns the top 25 word-frequency pairs # Step 4: Convert the word-frequency pairs into a Pandas Series for easy plotting all_fdist = pd.Series(dict(top_words)) # Convert the list of tuples into a Series for plotting # Step 5: Plotting # Set up the plot size fig, ax = plt.subplots(figsize=(10, 10)) # Create a bar plot using seaborn with word labels on x-axis and frequency on y-axis sns.barplot(x=all_fdist.index, y=all_fdist.values, ax=ax) # ax=ax plots on the specified subplot plt.xticks(rotation=60) # Rotate x-axis labels for readability plt.title("Frequency -- Top 25 Words in the Movie Review Corpus", fontsize=18) plt.xlabel("Words", fontsize=14) plt.ylabel("Frequency", fontsize=14) plt.show() # Display the plot
Sample Output:
- A bar chart displaying the 25 most common words in the movie review corpus, with labels like "the," "and," "of" along the x-axis and their frequency counts on the y-axis.
Chunk 11: Generating a WordCloud for the Movie Review Corpus
Explanation: This chunk generates a WordCloud (visual representation of word frequencies) for the 25 most common words. The WordCloud uses the word frequency to determine the size of each word in the cloud.
Modules/Classes Used:
WordCloud
: Generates a cloud image from word frequencies.matplotlib.pyplot
: Displays the generated WordCloud.
Code with Detailed Inline Comments:
# Importing WordCloud from wordcloud library from wordcloud import WordCloud # Step 1: Generate the WordCloud based on word frequencies from `all_fdist` # - background_color='white' sets a white background # - max_words=25 limits the number of words displayed to 25 # - colormap='Dark2' applies a color map for styling wordcloud = WordCloud( background_color='white', max_words=25, width=600, # Width of the canvas height=300, # Height of the canvas max_font_size=150, # Sets the maximum font size colormap='Dark2' ).generate_from_frequencies(all_fdist) # Generates the word cloud using word frequencies in `all_fdist` # Step 2: Display the WordCloud plt.imshow(wordcloud, interpolation='bilinear') # Displays the image with smooth interpolation plt.axis("off") # Hides the axis for a cleaner look plt.show() # Renders the word cloud plot
Sample Output:
- A WordCloud with the 25 most common words, where the size of each word represents its frequency. Common words like "the" or "and" appear larger in the cloud.
Chunk 12: Part-of-Speech (POS) Frequency in the Movie Corpus
Explanation: This chunk counts the frequency of different parts of speech (POS) in the
movie_reviews
corpus. It usesnltk.pos_tag_sents
to tag each word andCounter
to count each POS tag.Modules/Classes Used:
nltk.pos_tag_sents
: Tags multiple sentences with POS at once.collections.Counter
: Counts the frequency of each POS.seaborn.barplot
: Visualizes the most common POS types.matplotlib.pyplot
: Manages plotting and customization.
Code with Detailed Inline Comments:
from collections import Counter # Step 1: Retrieve sentences and POS-tag them movie_reviews_sentences = movie_reviews.sents() # Returns all sentences as lists of words tagged_sentences = nltk.pos_tag_sents(movie_reviews_sentences) # Tags each sentence's words with POS # Step 2: Initialize an empty Counter to aggregate POS frequencies total_counts = Counter() # Loop through each tagged sentence for sentence in tagged_sentences: # Count POS tags for each word in the sentence counts = Counter(tag for word, tag in sentence) # Update total_counts by adding counts for the current sentence total_counts.update(counts) # Step 3: Sort POS tags by frequency and select the top 18 tags sorted_tag_list = sorted(total_counts.items(), key=lambda x: x[1], reverse=True) # Sorts tags by frequency most_common_tags = pd.DataFrame(sorted_tag_list[:18]) # Converts top 18 POS tags to a DataFrame for plotting # Step 4: Plotting fig, ax = plt.subplots(figsize=(15, 10)) # Set up figure and axes with size sns.barplot(x=most_common_tags[0], y=most_common_tags[1], ax=ax) # Create barplot of POS frequency plt.xticks(rotation=70) # Rotate labels for readability plt.title("Part of Speech Frequency in Movie Review Corpus", fontsize=18) plt.xlabel("Part of Speech", fontsize=14) plt.ylabel("Frequency", fontsize=14) plt.show() # Display the plot
Sample Output:
- A bar chart displaying the frequency of POS tags, like nouns, verbs, adjectives, etc. This gives insight into the grammatical composition of the text in the movie reviews.
Chunk 13: Tokenizing and POS Tagging Multiple Sentences in NLTK
Explanation: This chunk shows how to tag multiple sentences at once, where each sentence is treated as a separate list of words.
Modules/Classes Used:
nltk.pos_tag_sents
: Tags all sentences in one call, which is more efficient than tagging each sentence individually.nltk.corpus.movie
_reviews
: Provides example sentences.
Code with Detailed Inline Comments:
# Importing the required dataset from NLTK from nltk.corpus import movie_reviews # Step 1: Retrieve all sentences in the movie_reviews corpus sents = movie_reviews.sents() # Each sentence is a list of words # Step 2: POS tagging all sentences using pos_tag_sents for efficiency tagged_sentences = nltk.pos_tag_sents(sents) # Tags each sentence's words with POS # Step 3: Display a tagged sample sentence sample_tagged_sentence = tagged_sentences[9] # Retrieve the 10th sentence with POS tags print(sample_tagged_sentence) # Shows POS-tagged words for a single sentence
Sample Output:
[('they', 'PRP'), ('seem', 'VBP'), ('to', 'TO'), ('have', 'VB'), ('taken', 'VBN'), ('this', 'DT'), ('pretty', 'RB'), ('neat', 'JJ'), ('concept', 'NN'), (',', ','), ('but', 'CC'), ('executed', 'VBD'), ('it', 'PRP'), ('terribly', 'RB'), ('.', '.')]
- Each word is followed by its POS tag (e.g.,
PRP
for pronoun,VB
for verb, etc.), showing the syntactic function of each word.
- Each word is followed by its POS tag (e.g.,
Chunk 14: Advanced POS Frequency Analysis in NLTK
Explanation: This chunk calculates the frequency of each POS tag across all sentences, helping us analyze which POS types are most common.
Modules/Classes Used:
Counter
: Accumulates POS counts from all sentences.
Code with Detailed Inline Comments:
from collections import Counter # Initialize an empty Counter for POS frequencies total_counts = Counter() # Loop through each POS-tagged sentence for sentence in tagged_sentences: # Count POS tags in each sentence counts = Counter(tag for word, tag in sentence) # Counts tags in the current sentence total_counts.update(counts) # Update total counts with current sentence counts # Sort POS counts by frequency in descending order sorted_tag_list = sorted(total_counts.items(), key=lambda x: x[1], reverse=True) # Sort by frequency # Display the most common tags most_common_tags = sorted_tag_list[:10] # Show the top 10 most common tags print(most_common_tags)
Sample Output:
[('NN', 12500), ('IN', 9500), ('DT', 8500), ('JJ', 7500), ('VB', 6000), ('RB', 5500), ('PRP', 4000), ('CC', 3500), ('VBD', 3000), ('TO', 2500)]
- This output displays the most frequent POS tags, such as nouns (
NN
), prepositions (IN
), and determiners (DT
), and their respective counts across the entire corpus.
- This output displays the most frequent POS tags, such as nouns (
Chunk 15: Visualizing POS Frequency Analysis
Explanation: This chunk uses
seaborn
to visualize the POS frequency analysis from the previous step.Modules/Classes Used:
pandas.DataFrame
: Organizes POS frequency data for plotting.seaborn.barplot
: Creates a bar chart of the POS tags and their frequencies.
Code with Detailed Inline Comments:
import pandas as pd import seaborn as sns import matplotlib.pyplot as plt # Convert POS tag frequency data to a DataFrame for plotting most_common_tags_df = pd.DataFrame(most_common_tags, columns=['POS Tag', 'Frequency']) # Set up the figure and axes for plotting fig, ax = plt.subplots(figsize=(12, 8)) # Define figure size # Create a bar plot of POS tag frequencies sns.barplot(x='POS Tag', y='Frequency', data=most_common_tags_df, ax=ax) # Label and title customization plt.title("Top 10 POS Tag Frequencies in Movie Reviews Corpus", fontsize=16) plt.xlabel("Part of Speech (POS)", fontsize=14) plt.ylabel("Frequency", fontsize=14) plt.xticks(rotation=45) # Rotate x-axis labels for better readability # Display the plot plt.show()
Sample Output:
- A bar chart with POS tags on the x-axis and their respective frequencies on the y-axis, showing the distribution of POS types in the movie reviews corpus.
Hugging Face’s approach with Sample Output
We now willl focus on making the code shorter and leveraging the power of pre-trained models for tasks like tokenization, POS tagging, frequency analysis, and visualization. This approach will use the transformers
library, which provides efficient implementations of various NLP models.
Chunk 1 & 2: Tokenization and Frequency Distribution with Hugging Face
Using Hugging Face, tokenization and frequency distribution can be handled with a few lines. Here, I’ll use BERT’s tokenizer.
Code Explanation:
AutoTokenizer
: Automatically loads the tokenizer for a given model (e.g., BERT).Counter
: Counts token frequencies.
Code:
from transformers import AutoTokenizer from collections import Counter # Load the tokenizer tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") # Tokenize text text = "we'd like to book a flight from boston to london" tokens = tokenizer.tokenize(text) print("Tokens:", tokens) # Display tokenized words # Frequency distribution of tokens token_freq = Counter(tokens) print("Token Frequency:", token_freq)
Sample Output:
Tokens: ['we', "'", 'd', 'like', 'to', 'book', 'a', 'flight', 'from', 'boston', 'to', 'london'] Token Frequency: Counter({'to': 2, 'we': 1, ...})
Chunk 3 & 5: Part-of-Speech Tagging with Hugging Face
Hugging Face models don’t directly provide POS tags. However, pipeline
with token-classification
and a POS model achieves this.
Code Explanation:
pipeline
: Automatically sets up tasks like POS tagging when a model is specified.AutoModelForTokenClassification
+AutoTokenizer
: For POS tagging.
Code:
from transformers import pipeline # Load POS tagging pipeline pos_pipeline = pipeline("token-classification", model="vblagoje/bert-english-uncased-finetuned-pos") # Run POS tagging pos_tags = pos_pipeline(text) print("POS Tags:", pos_tags) # Shows words with POS tags
Sample Output:
[{'word': 'we', 'entity': 'PRON', 'score': 0.99}, {'word': "'", 'entity': 'PUNCT', 'score': 0.98}, ...]
Chunk 6 & 7: Named Entity Recognition (NER) and Dependency Parsing with Hugging Face
For NER, Hugging Face models can recognize entities with pipeline("ner")
. Dependency parsing isn’t directly available, but transformers-based syntax parsers can be used.
Code Explanation:
pipeline("ner")
: Recognizes entities like names, locations, etc.
Code:
# Load NER pipeline ner_pipeline = pipeline("ner", model="dbmdz/bert-large-cased-finetuned-conll03-english") # Run NER on text entities = ner_pipeline(text) print("Named Entities:", entities)
Sample Output:
[{'word': 'boston', 'entity': 'LOC'}, {'word': 'london', 'entity': 'LOC'}]
Chunk 10 & 11: Generating Frequency Distribution and WordCloud with Hugging Face
For word frequency and WordCloud, we’ll use tokenization frequency from Hugging Face models and standard visualization tools.
Code Explanation:
Tokenize and count word frequency using Hugging Face.
Generate WordCloud from frequencies.
Code:
from wordcloud import WordCloud import matplotlib.pyplot as plt # Tokenize and count word frequencies tokens = tokenizer.tokenize(" ".join(movie_reviews.words()[:500])) # Use a sample for speed word_freq = Counter(tokens) # Generate WordCloud wordcloud = WordCloud(width=600, height=300, background_color='white').generate_from_frequencies(word_freq) plt.imshow(wordcloud, interpolation='bilinear') plt.axis("off") plt.show()
Chunk 12 & 13: POS Frequency Analysis for the Movie Corpus
Code Explanation:
- Apply the POS pipeline to sentences, count POS tags, and visualize.
Code:
# Sample movie review sentences and POS tagging sentences = [" ".join(sent) for sent in movie_reviews.sents()[:5]] # Small sample for speed pos_counts = Counter() # Tag each sentence and accumulate POS counts for sent in sentences: pos_tags = pos_pipeline(sent) pos_counts.update(tag['entity'] for tag in pos_tags) print("POS Frequency:", pos_counts)
This approach condenses the tasks into fewer lines using Hugging Face models while retaining functionality. Let me know if you'd like any further customization or explanations!
Subscribe to my newsletter
Read articles from Anix Lynch directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by