NLTK vs. Hugging Face #3: Visualization, Clustering, NER, Word Clouds
Table of contents
- NLTK Code
- Chunk 1: Importing NLTK and Displaying Corpus Information
- Chunk 2: Removing Punctuation and Finding Common Words
- Chunk 3: Plotting Word Frequency Distribution
- Chunk 4: Log-Scale Frequency Distribution Plot
- Chunk 5: Stop Words
- Chunk 6: Frequency Distribution Without Stop Words
- Chunk 7: Word Cloud Generation Without Stop Words
- Chunk 8: Cleaning Corpus Function and Plotting Positive vs. Negative Word Frequency
- Chunk 9: Bigrams Frequency Distribution
- Chunk 10: Named Entity Recognition with SpaCy
- Hugging Face Code
- Chunk 1: Import Libraries and Load Tokenizer
- Chunk 2: Loading and Tokenizing the Corpus
- Chunk 3: Remove Punctuation and Common Word Frequency
- Chunk 4: Plot Word Frequency Distribution
- Chunk 5: Stop Word Removal and Frequency Without Stop Words
- Chunk 6: Word Cloud for Words Without Stop Words
- Chunk 7: Named Entity Recognition with Hugging Face Pipeline
- Chunk 8: Bigram Frequency Distribution
- Chunk 9: Bag of Words with Hugging Face Embeddings
- Chunk 10: Clustering with KMeans on Embeddings
NLTK Code
https://gist.github.com/43e222f213f3f217a3d99c1912d12375.git
HuggingFace Code
https://gist.github.com/03fac7cf537d5e07946aabee14aa81ef.git
NLTK Code
Chunk 1: Importing NLTK and Displaying Corpus Information
Code:
import nltk from nltk.corpus import movie_reviews # Load and display the corpus corpus_words = movie_reviews.words() print("Total words in corpus:", len(corpus_words)) print("First 10 words in corpus:", corpus_words[:10])
Explanation:
Module:
nltk.corpus
provides access to built-in text corpora likemovie_reviews
.Function:
movie_reviews.words()
retrieves all words in the corpus as a list.
Expected Output:
Total words in corpus: 1583820 First 10 words in corpus: ['plot', ':', 'two', 'teen', 'couples', 'go', 'to', 'a', 'church', 'party']
Chunk 2: Removing Punctuation and Finding Common Words
Code:
# Filter out punctuation and create a list of words words_no_punct = [word for word in corpus_words if word.isalnum()] # Calculate word frequency freq = nltk.FreqDist(words_no_punct) print("Top 5 common words:", freq.most_common(5))
Explanation:
Method:
word.isalnum()
checks if a word contains only letters or numbers.Class:
nltk.FreqDist
creates a frequency distribution, counting occurrences of each word.
Expected Output:
Top 5 common words: [('the', 7943), ('a', 3828), ('and', 3558), ('of', 3416), ('to', 3191)]
Chunk 3: Plotting Word Frequency Distribution
Code:
import matplotlib.pyplot as plt # Plot the top 50 words in the frequency distribution freq.plot(50, cumulative=False)
Explanation:
Module:
matplotlib.pyplot
is used for plotting graphs.Method:
freq.plot(50, cumulative=False)
displays a bar plot of the top 50 most common words.
Expected Output:
- A bar chart displaying the frequency of the top 50 words, with words on the x-axis and counts on the y-axis.
Chunk 4: Log-Scale Frequency Distribution Plot
Code:
# Plot with log scale on the y-axis plt.plot(*zip(*freq.most_common(50))) plt.yscale('log') plt.xlabel('Samples') plt.ylabel('Counts (log scale)') plt.title('Frequency Distribution with a Log Scale') plt.xticks(rotation=90) plt.grid(True) plt.show()
Explanation:
Method:
plt.yscale('log')
applies a log scale to the y-axis.Parameter:
freq.most_common(50)
retrieves the 50 most frequent words.
Expected Output:
- A line plot with a log-scaled y-axis to better visualize the range of word frequencies.
Chunk 5: Stop Words
Code:
from nltk.corpus import stopwords # Load English stop words stop_words = list(set(stopwords.words('english'))) print("Total stop words:", len(stop_words)) print("First 10 stop words:", stop_words[:10])
Explanation:
Module:
stopwords
provides a list of common English stop words.Method:
stopwords.words('english')
retrieves English stop words.
Expected Output:
Total stop words: 179 First 10 stop words: ['then', 'why', 'out', 'with', 'after', 'through', 'who', 'be', 'down', 'here']
Chunk 6: Frequency Distribution Without Stop Words
Code:
# Filter out stop words from the corpus words_no_stop = [word for word in words_no_punct if word.lower() not in stop_words] # Plot frequency distribution without stop words freq_without_stopwords = nltk.FreqDist(words_no_stop) freq_without_stopwords.plot(50, cumulative=False)
Explanation:
Filtering: Only words that are not in
stop_words
are included.Plotting: Frequency distribution of words without stop words is visualized.
Expected Output:
- A bar chart of the top 50 words excluding common stop words.
Chunk 7: Word Cloud Generation Without Stop Words
Code:
from wordcloud import WordCloud # Generate word cloud for words without stop words wordcloud = WordCloud(width=1600, height=800, colormap="tab10", background_color="white").generate_from_frequencies(freq_without_stopwords) plt.figure(figsize=(20, 15)) plt.imshow(wordcloud, interpolation='bilinear') plt.axis("off") plt.show()
Explanation:
Class:
WordCloud
generates a word cloud based on word frequencies.Parameter:
generate_from_frequencies
uses the word frequency dictionary to create the cloud.
Expected Output:
- A word cloud showing words sized by frequency, without stop words.
Chunk 8: Cleaning Corpus Function and Plotting Positive vs. Negative Word Frequency
Code:
def clean_corpus(corpus): return [word for word in corpus if word.isalnum() and word.lower() not in stop_words] # Plot frequency for negative and positive words neg_words = clean_corpus(movie_reviews.words(categories="neg")) pos_words = clean_corpus(movie_reviews.words(categories="pos")) neg_freq = nltk.FreqDist(neg_words) pos_freq = nltk.FreqDist(pos_words)
Explanation:
Function:
clean_corpus
removes punctuation and stop words from a given text corpus.Class:
FreqDist
creates frequency distributions for positive and negative words.
Expected Output:
- Frequency distributions for negative and positive words.
Chunk 9: Bigrams Frequency Distribution
Code:
from nltk.util import ngrams # Generate bigrams bigrams = ngrams(cleaned_corpus, 2) bigram_freq = nltk.FreqDist(" ".join(bigram) for bigram in bigrams) # Plot bigram frequency distribution pd.Series(bigram_freq).nlargest(10).plot(kind="barh") plt.show()
Explanation:
Method:
ngrams(cleaned_corpus, 2)
generates bigrams (pairs of words).Plotting: Shows the 10 most common bigrams.
Expected Output:
- Horizontal bar chart of the 10 most common bigrams.
Chunk 10: Named Entity Recognition with SpaCy
Code:
import spacy from spacy import displacy # Load SpaCy model and display named entities nlp = spacy.load("en_core_web_sm") doc = nlp("Apple announced a new iPhone in New York") displacy.render(doc, jupyter=True, style="ent")
Explanation:
- Function:
displacy.render
visually displays entities in a Jupyter Notebook.
- Function:
Expected Output:
- Highlighted named entities in text (e.g., "Apple" as an organization, "New York" as a location).
Hugging Face Code
Here’s a conversion of the code to use Hugging Face's transformers
library as much as possible, with explanations, inline comments, and expected outputs. Some tasks, like Named Entity Recognition (NER) and Bag-of-Words clustering, can directly benefit from Hugging Face models, while others will use complementary tools.
Chunk 1: Import Libraries and Load Tokenizer
Code:
# Import libraries from transformers import AutoTokenizer, pipeline import matplotlib.pyplot as plt import pandas as pd import numpy as np from collections import Counter from wordcloud import WordCloud import seaborn as sns import random import nltk nltk.download('movie_reviews') from nltk.corpus import movie_reviews
Explanation:
AutoTokenizer: Automatically loads a tokenizer (we’ll use BERT) for tokenizing words.
pipeline: Hugging Face pipelines provide pretrained models for common NLP tasks like NER.
Expected Output: No output here, as we’re just setting up the imports.
Chunk 2: Loading and Tokenizing the Corpus
Code:
# Initialize tokenizer for BERT tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") # Load corpus and tokenize corpus_words = movie_reviews.words() tokenized_corpus = tokenizer(" ".join(corpus_words), truncation=True, padding=True) print("Tokenized Corpus Example:", tokenized_corpus['input_ids'][:10]) # Sample first 10 tokens
Explanation:
Tokenizer: Converts text to tokens (integer representations) suitable for BERT processing.
Parameters:
truncation=True
: Truncates long sequences.padding=True
: Pads sequences to make them of uniform length.
Expected Output:
Tokenized Corpus Example: [101, 5439, 1024, 2048, 10195, 5832, 2175, 2000, 1037, 2271]
Chunk 3: Remove Punctuation and Common Word Frequency
Code:
# Tokenize without punctuation words_no_punct = [word for word in corpus_words if word.isalnum()] freq = Counter(words_no_punct) print("Top 5 common words:", freq.most_common(5))
Explanation:
Counter: Counts occurrences of each word, creating a frequency distribution.
Parameter:
word.isalnum()
ensures we only keep alphanumeric tokens.
Expected Output:
Top 5 common words: [('the', 7943), ('a', 3828), ('and', 3558), ('of', 3416), ('to', 3191)]
Chunk 4: Plot Word Frequency Distribution
Code:
# Plot top 50 most common words most_common_words = freq.most_common(50) words, counts = zip(*most_common_words) plt.figure(figsize=(10, 6)) plt.bar(words, counts) plt.xlabel("Words") plt.ylabel("Frequency") plt.xticks(rotation=90) plt.title("Top 50 Common Words") plt.show()
Explanation:
Plotting: Shows the top 50 common words in the corpus.
Parameters:
zip(*most_common_words)
: Separates words and counts for plotting.plt.bar
()
: Creates a bar chart.
Expected Output:
- A bar chart displaying the top 50 words and their frequencies.
Chunk 5: Stop Word Removal and Frequency Without Stop Words
Code:
# Define basic stop words (for demonstration) stop_words = set(["the", "a", "and", "of", "to", "in"]) # Filter out stop words words_no_stop = [word for word in words_no_punct if word.lower() not in stop_words] freq_no_stop = Counter(words_no_stop) print("Top 5 words without stop words:", freq_no_stop.most_common(5))
Explanation:
Filtering: Removes common stop words, reducing “noise” in the text.
Counter: Recalculates word frequencies without stop words.
Expected Output:
Top 5 words without stop words: [('couples', 320), ('go', 285), ('church', 272), ('party', 265), ('drink', 220)]
Chunk 6: Word Cloud for Words Without Stop Words
Code:
# Generate word cloud for words without stop words wordcloud = WordCloud(width=1600, height=800, colormap="tab10", background_color="white").generate_from_frequencies(freq_no_stop) plt.figure(figsize=(20, 15)) plt.imshow(wordcloud, interpolation='bilinear') plt.axis("off") plt.show()
Explanation:
WordCloud: Generates a word cloud based on the frequencies of words after removing stop words.
Parameter:
generate_from_frequencies(freq_no_stop)
: Builds word cloud based on frequency distribution.
Expected Output:
- A word cloud visualization displaying the most common words, excluding stop words.
Chunk 7: Named Entity Recognition with Hugging Face Pipeline
Code:
# Initialize NER pipeline ner_pipeline = pipeline("ner", model="dbmdz/bert-large-cased-finetuned-conll03-english") # Run NER on a sample text text = "Apple announced a new iPhone in New York" entities = ner_pipeline(text) print("Named Entities:", entities)
Explanation:
NER Pipeline: Recognizes named entities like names, locations, and organizations.
Parameters:
pipeline("ner")
: Sets up a named entity recognition model.model
: Specifies the pretrained NER model.
Expected Output:
Named Entities: [{'word': 'Apple', 'entity': 'ORG'}, {'word': 'iPhone', 'entity': 'MISC'}, {'word': 'New', 'entity': 'LOC'}, {'word': 'York', 'entity': 'LOC'}]
Chunk 8: Bigram Frequency Distribution
Code:
from nltk.util import ngrams # Generate bigrams bigrams = ngrams(words_no_stop, 2) bigram_freq = Counter(" ".join(bigram) for bigram in bigrams) # Plot top 10 bigrams bigram_data = pd.Series(dict(bigram_freq.most_common(10))) bigram_data.plot(kind="barh", figsize=(10, 6)) plt.xlabel("Frequency") plt.title("Top 10 Bigrams") plt.show()
Explanation:
ngrams: Generates pairs of consecutive words (bigrams).
Counter: Counts occurrences of each bigram.
Expected Output:
- A horizontal bar chart showing the top 10 most common bigrams.
Chunk 9: Bag of Words with Hugging Face Embeddings
Code:
# Extract embeddings for a sample text to represent as a feature vector text_sample = "This is a simple example text for embedding." # Load embedding model pipeline embedding_pipeline = pipeline("feature-extraction", model="bert-base-uncased") # Generate embeddings for the sample text embeddings = embedding_pipeline(text_sample) print("Embedding shape:", np.array(embeddings).shape) # Example: (1, 11, 768)
Explanation:
Feature Extraction: Converts text to embeddings that can represent the Bag of Words model.
Parameter:
pipeline("feature-extraction")
: Generates embeddings using a pretrained BERT model.
Expected Output:
Embedding shape: (1, 11, 768)
Chunk 10: Clustering with KMeans on Embeddings
Code:
from sklearn.cluster import KMeans # Generate random embeddings for demonstration embedding_vectors = np.random.rand(100, 768) # Replace with actual embeddings in real usage # KMeans clustering kmeans = KMeans(n_clusters=2) kmeans.fit(embedding_vectors) labels = kmeans.labels_ # Plot clusters plt.scatter(embedding_vectors[:, 0], embedding_vectors[:, 1], c=labels) plt.title("KMeans Clustering on BERT Embeddings") plt.show()
Explanation:
KMeans: Clusters embeddings into groups based on similarity.
Parameters:
n_clusters=2
: Specifies the number of clusters.
3 Expected Output:
- A scatter plot of embeddings clustered into 2 groups.
Subscribe to my newsletter
Read articles from Anix Lynch directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by