NLTK, SpaCy, Gensim VS Hugging Face 4#: POS Tagging, Parsing, TF-IDF, One-Hot Encoding, and Word2Vec
Table of contents
- Comparison Table
- NLTK Code
- Hugging Face Code
- Chunk 1: POS Tagging with Hugging Face
- Chunk 2: Dependency Parsing (Using SpaCy, as Hugging Face doesn’t directly support parsing)
- Chunk 3: Bag of Words Representation with Hugging Face Tokenizer
- Chunk 4: TF-IDF Alternative with Hugging Face Embeddings
- Chunk 5: One-Hot Encoding with Hugging Face Tokenizer
- Chunk 6: Word Embeddings and Similarity with Hugging Face
- Chunk 7: Document Clustering Using Embeddings with KMeans
NLTK Code
https://gist.github.com/c969899618e37ba00be355eb676c8c39.git
HuggingFace Code
https://gist.github.com/4f9b5a23dcde12a45f87ea5254013aed.git
Comparison Table
Here’s a comparison table summarizing which tool (Hugging Face Transformers, SpaCy, or traditional methods like NLTK) is best suited for each NLP task based on functionality, ease of use, and task-specific strengths:
Task | Best Tool | Reason |
POS Tagging | Hugging Face Transformers | Hugging Face’s pretrained pipelines are easy to set up and provide accurate POS tagging with deep learning models. |
Dependency Parsing | SpaCy | SpaCy’s dependency parsing is robust and fast, making it ideal for syntactic analysis; Hugging Face doesn’t directly support dependency parsing. |
Bag of Words (BoW) | Traditional (NLTK/Sklearn) | Simple word counting is straightforward in NLTK or Sklearn’s CountVectorizer. Hugging Face is better suited for embeddings rather than basic BoW. |
TF-IDF | Sklearn | Sklearn’s TfidfVectorizer is specialized for TF-IDF and more flexible for text corpus analysis, while Hugging Face is better for embeddings. |
Embedding-Based Features | Hugging Face Transformers | Hugging Face’s feature-extraction pipeline provides deep, contextualized embeddings, making it more advanced than TF-IDF for feature extraction. |
Integer and One-Hot Encoding | Traditional (Sklearn) | Sklearn’s encoders (LabelEncoder , OneHotEncoder ) are ideal for simple integer and one-hot encoding; Hugging Face’s transformers are too advanced for this. |
Word Similarity (Cosine Similarity) | Hugging Face Transformers | Hugging Face embeddings capture word similarity effectively with contextual embeddings, providing richer comparisons than simple word vectors. |
Document Clustering | Hugging Face + Sklearn (KMeans) | Hugging Face’s embeddings combined with Sklearn’s KMeans clustering capture topic clusters effectively due to contextual word meanings. |
Named Entity Recognition (NER) | Hugging Face Transformers | Hugging Face’s pretrained NER models are state-of-the-art, easy to use, and outperform traditional rule-based approaches in accuracy. |
Tokenization with Special Characters | Hugging Face Transformers | Hugging Face’s tokenizers handle complex tokens, special characters, and subword splits, making it ideal for modern NLP tasks. |
Dependency Tree Visualization | SpaCy | SpaCy provides built-in dependency visualization (displacy ), making it easier to visualize syntactic structures than Hugging Face, which lacks direct support. |
Basic Frequency Analysis | Traditional (NLTK) | Simple word frequency analysis is easier and faster in NLTK than in Hugging Face, which is optimized for more complex, contextual embeddings. |
Summary:
Hugging Face Transformers: Best for advanced NLP tasks that benefit from contextual embeddings, such as POS tagging, word similarity, clustering, and NER.
SpaCy: Preferred for syntactic analysis tasks like dependency parsing and visualization due to its optimized dependency trees and visual tools.
Traditional (NLTK/Sklearn): Ideal for simple frequency-based tasks, TF-IDF, and integer or one-hot encoding, where deep learning models are unnecessary.
NLTK Code
Chunk 1: POS Tagging with NLTK
Code:
# Import necessary libraries import nltk from nltk.tokenize import word_tokenize from nltk.corpus import movie_reviews # Load sentences from movie reviews and select one for demonstration example_sentences = movie_reviews.sents() example_sentence = example_sentences[0] # Perform POS tagging pos_tags = nltk.pos_tag(example_sentence) print("POS Tags:", pos_tags)
Explanation:
Module:
nltk.pos_tag
is used for part-of-speech (POS) tagging, identifying grammatical roles (like nouns and verbs) for each word.Parameter:
example_sentence
: A list of words representing a single sentence from the movie reviews dataset.
Sample Output:
POS Tags: [('plot', 'NN'), (':', ':'), ('two', 'CD'), ('teen', 'JJ'), ('couples', 'NNS'), ('go', 'VBP'), ('to', 'TO'), ('a', 'DT'), ('church', 'NN'), ('party', 'NN'), (',', ','), ('drink', 'NN'), ('and', 'CC'), ('then', 'RB'), ('drive', 'VB'), ('.', '.')]
Chunk 2: Dependency Parsing with SpaCy
Code:
import spacy # Sample text for dependency parsing text = "plot: two teen couples go to a church party, drink and then drive." # Load the small English model in SpaCy nlp = spacy.load("en_core_web_sm") doc = nlp(text) # Print token information including POS tags and dependency relations for token in doc: print(f"Token: {token.text}, POS Tag: {token.tag_}, Head: {token.head.text}, Dependency: {token.dep_}")
Explanation:
Module:
spacy.load("en_core_web_sm")
loads a pretrained small English model that includes POS tagging, parsing, and NER.Class:
doc
is a processed object where eachtoken
has attributes liketext
,tag_
(POS tag),head
(head word), anddep_
(dependency type).
Sample Output:
Token: plot, POS Tag: NN, Head: plot, Dependency: ROOT Token: :, POS Tag: :, Head: plot, Dependency: punct Token: two, POS Tag: CD, Head: couples, Dependency: nummod ...
Chunk 3: Count Bag of Words (BoW)
Code:
# Function to calculate Bag of Words def document_features(document): features = {} for word in word_features: features[word] = 0 for doc_word in document: if word == doc_word: features[word] += 1 return features
Explanation:
Function:
document_features
counts occurrences of each word inword_features
for a given document.Parameter:
document
: List of words representing a document.word_features
: List of important words to track in each document.
Usage: Creates a simple word-count-based feature representation (BoW).
Sample Output:
- A dictionary with words as keys and their counts in the document as values (e.g.,
{ 'plot': 1, 'party': 2 }
).
- A dictionary with words as keys and their counts in the document as values (e.g.,
Chunk 4: TF-IDF Vectorization
Code:
import string import os import pandas as pd from sklearn.feature_extraction.text import TfidfVectorizer # Maximum number of tokens (words) to consider max_tokens = 200 # Tokenizer function to split text into words def tokenize(text): return nltk.word_tokenize(text) # Path to movie reviews (adjust path as needed) path = './movie_reviews/' token_dict = {} # Read files and remove punctuation for dirpath, dirs, files in os.walk(path): for f in files: fname = os.path.join(dirpath, f) with open(fname) as review: text = review.read() token_dict[f] = text.lower().translate(str.maketrans('', '', string.punctuation)) # TF-IDF Vectorizer with a maximum of max_tokens words tfIdfVectorizer = TfidfVectorizer(input="content", use_idf=True, tokenizer=tokenize, max_features=max_tokens, stop_words='english') tfIdf = tfIdfVectorizer.fit_transform(token_dict.values()) # Convert to DataFrame tfidf_tokens = tfIdfVectorizer.get_feature_names_out() final_vectors = pd.DataFrame(data=tfIdf.toarray(), columns=tfidf_tokens) print(final_vectors.head())
Explanation:
Class:
TfidfVectorizer
transforms text into TF-IDF vectors, representing text by word importance.Parameters:
max_features=max_tokens
: Limits to the topmax_tokens
most important words.stop_words='english'
: Removes common English stop words.
Sample Output:
- A DataFrame showing TF-IDF values for each word in each document.
Chunk 5: Integer and One-Hot Encoding
Code:
from numpy import array from sklearn.preprocessing import LabelEncoder, OneHotEncoder # List of words from the first movie review data = movie_reviews.words(movie_reviews.fileids()[0])[:50] # Only first 50 words for example # Integer encode the words label_encoder = LabelEncoder() integer_encoded = label_encoder.fit_transform(data) print("Integer Encoded:", integer_encoded) # One-hot encode onehot_encoder = OneHotEncoder(sparse=False) integer_encoded = integer_encoded.reshape(len(integer_encoded), 1) onehot_encoded = onehot_encoder.fit_transform(integer_encoded) print("One-Hot Encoded:", onehot_encoded[0]) # Decode the first one-hot vector back to the original word inverted = label_encoder.inverse_transform([argmax(onehot_encoded[0, :])]) print("Decoded Word:", inverted[0])
Explanation:
Classes:
LabelEncoder
encodes words as integers;OneHotEncoder
converts integer encoding to one-hot encoding.Parameters:
sparse=False
inOneHotEncoder
ensures the output is dense, not sparse.
Sample Output:
Integer Encoded: [integer values] One-Hot Encoded: [binary vector] Decoded Word: ['plot']
Chunk 6: Finding Similar Words with Word2Vec
Code:
import gensim from gensim.models import Word2Vec from nltk.corpus import movie_reviews # Prepare documents for Word2Vec documents = [list(movie_reviews.words(fileid)) for fileid in movie_reviews.fileids()] # Train Word2Vec model model = Word2Vec(documents, min_count=5) # Find words similar to 'movie' similar_words = model.wv.most_similar(positive=['movie'], topn=5) print("Words similar to 'movie':", similar_words)
Explanation:
Class:
Word2Vec
learns word embeddings, which are dense representations of words capturing semantic meaning.Parameters:
min_count=5
: Only includes words that appear at least 5 times.topn=5
: Returns the top 5 most similar words.
Sample Output:
Words similar to 'movie': [('film', 0.85), ('story', 0.76), ('plot', 0.75), ('character', 0.74), ('director', 0.73)]
Hugging Face Code
Chunk 1: POS Tagging with Hugging Face
Code:
# Import Hugging Face pipeline for POS tagging from transformers import pipeline # Initialize the POS tagging pipeline pos_pipeline = pipeline("token-classification", model="vblagoje/bert-english-uncased-finetuned-pos") # Define example sentence example_sentence = "plot: two teen couples go to a church party, drink and then drive." # Perform POS tagging pos_tags = pos_pipeline(example_sentence) print("POS Tags:", pos_tags)
Explanation:
Pipeline: Hugging Face's
pipeline
function sets up a pretrained POS tagging model, which identifies parts of speech for each word.Parameter:
model="vblagoje/bert-english-uncased-finetuned-pos"
specifies the model fine-tuned for POS tagging.
Sample Output:
POS Tags: [{'word': 'plot', 'entity': 'NOUN'}, {'word': ':', 'entity': 'PUNCT'}, ...]
Chunk 2: Dependency Parsing (Using SpaCy, as Hugging Face doesn’t directly support parsing)
Code:
import spacy # Load SpaCy small English model nlp = spacy.load("en_core_web_sm") doc = nlp("plot: two teen couples go to a church party, drink and then drive.") # Print token, POS tag, and dependency info print("\nDependency Parsing:") for token in doc: print(f"Token: {token.text}, POS Tag: {token.tag_}, Head: {token.head.text}, Dependency: {token.dep_}")
Explanation:
SpaCy Dependency Parsing: SpaCy provides dependency parsing as part of its NLP pipeline, analyzing grammatical roles and relationships between words.
Parameters:
en_core_web_sm
: SpaCy's small English model includes POS tagging and dependency parsing.
Sample Output:
Token: plot, POS Tag: NN, Head: plot, Dependency: ROOT Token: :, POS Tag: :, Head: plot, Dependency: punct ...
Chunk 3: Bag of Words Representation with Hugging Face Tokenizer
Code:
from transformers import AutoTokenizer from collections import Counter # Load tokenizer tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") # Define example text example_text = "plot: two teen couples go to a church party, drink and then drive." # Tokenize and count each token for BoW representation tokens = tokenizer.tokenize(example_text) token_counts = Counter(tokens) print("\nBag of Words:", token_counts)
Explanation:
Tokenizer: Converts text into tokens using a BERT tokenizer, generating tokens that represent words or subwords.
Counter: Counts occurrences of each token, mimicking a basic Bag of Words (BoW) model.
Sample Output:
Bag of Words: Counter({'plot': 1, ':': 1, 'two': 1, 'teen': 1, 'couples': 1, 'go': 1, ...})
Chunk 4: TF-IDF Alternative with Hugging Face Embeddings
Code:
from transformers import pipeline import pandas as pd # Load Hugging Face pipeline for feature extraction embedding_pipeline = pipeline("feature-extraction", model="bert-base-uncased") # Sample text data documents = ["plot: two teen couples go to a church party, drink and then drive.", "this movie was about a young couple who fell in love unexpectedly."] # Extract embeddings for each document and convert to a DataFrame embeddings = [embedding_pipeline(doc)[0] for doc in documents] # [0] selects only the first layer of embeddings embedding_df = pd.DataFrame([embedding[0] for embedding in embeddings]) # Use first token for simplicity print("\nEmbeddings (first document):") print(embedding_df.head())
Explanation:
Feature Extraction: Converts documents into embedding vectors, which capture the contextual meaning of each word/token.
Parameter:
model="bert-base-uncased"
: Uses BERT to generate embeddings based on context.
Sample Output:
Embeddings (first document): 0 1 2 ... 765 766 767 0 0.6532 0.2481 0.9732 ... -0.8761 0.2432 0.7651
Chunk 5: One-Hot Encoding with Hugging Face Tokenizer
Code:
import numpy as np from sklearn.preprocessing import OneHotEncoder # Tokenize example text tokens = tokenizer.tokenize(example_text) # Convert tokens to unique integer IDs token_ids = tokenizer.convert_tokens_to_ids(tokens) # One-hot encode the token IDs onehot_encoder = OneHotEncoder(sparse=False) integer_encoded = np.array(token_ids).reshape(-1, 1) onehot_encoded = onehot_encoder.fit_transform(integer_encoded) print("\nOne-Hot Encoded Vector for First Token:", onehot_encoded[0])
Explanation:
OneHotEncoder: Converts integer IDs of tokens into binary vectors representing each token uniquely.
Parameter:
sparse=False
: Ensures dense array output.
Sample Output:
One-Hot Encoded Vector for First Token: [0. 0. 1. 0. ...]
Chunk 6: Word Embeddings and Similarity with Hugging Face
Code:
from sklearn.metrics.pairwise import cosine_similarity # Define two example words for comparison word1, word2 = "movie", "film" # Get embeddings for each word embedding1 = embedding_pipeline(word1)[0][0] # Embedding for first token of word1 embedding2 = embedding_pipeline(word2)[0][0] # Embedding for first token of word2 # Compute cosine similarity similarity = cosine_similarity([embedding1], [embedding2])[0][0] print(f"\nCosine Similarity between '{word1}' and '{word2}':", similarity)
Explanation:
Cosine Similarity: Measures how similar two embedding vectors are in terms of direction, with 1 being identical and -1 being opposite.
Parameters:
[0][0]
selects the embedding vector for the first token in the word.
Sample Output:
Cosine Similarity between 'movie' and 'film': 0.89
Chunk 7: Document Clustering Using Embeddings with KMeans
Code:
from sklearn.cluster import KMeans import matplotlib.pyplot as plt # Generate embeddings for clustering document_embeddings = [embedding_pipeline(doc)[0][0] for doc in documents] # Embedding for each document # Apply KMeans clustering kmeans = KMeans(n_clusters=2) labels = kmeans.fit_predict(document_embeddings) # Plot clusters plt.scatter(range(len(labels)), labels, c=labels, cmap='viridis') plt.title("Document Clustering with KMeans on BERT Embeddings") plt.xlabel("Document Index") plt.ylabel("Cluster Label") plt.show()
Explanation:
KMeans Clustering: Groups documents into clusters based on their embeddings, finding similarities in topics or themes.
Parameters:
n_clusters=2
: Specifies the number of clusters for grouping.
Sample Output:
- A scatter plot showing clusters, where each point represents a document’s cluster assignment.
Subscribe to my newsletter
Read articles from Anix Lynch directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by