Hugging Face #6: Topic Modeling, Sentence Embeddings, and Dynamic Clustering
Table of contents
- Source Code Here:
- Chunk 1: Import Libraries and Load Data
- Chunk 2: Generating Embeddings
- Chunk 3: (Optional) View Embeddings
- Chunk 4: Configure CountVectorizer
- Chunk 5: Set Parameters for HDBSCAN and UMAP
- Chunk 6: Train BERTopic Model
- Chunk 7: Get and Visualize Topic Frequencies
- Chunk 8: View Topic Information
- Chunk 9: Document-Level Visualization
- Chunk 10: Topic Bar Chart Visualization
- Chunk 11: Predict Topics for New Documents
- Summary of Code Workflow
Source Code Here:
HuggingFace Code
https://gist.github.com/0f88f53525d0fd992b627d62a0da13b2.git
Chunk 1: Import Libraries and Load Data
-
!pip install bertopic from sklearn.datasets import fetch_20newsgroups from sentence_transformers import SentenceTransformer from bertopic import BERTopic from umap import UMAP from hdbscan import HDBSCAN # Load text data from the 20 Newsgroups dataset docs = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))['data']
Explanation:
Libraries:
BERTopic: For topic modeling.
UMAP (Uniform Manifold Approximation and Projection): Reduces embedding dimensionality.
HDBSCAN: Clustering algorithm used by BERTopic to form topic clusters.
Dataset: 20 Newsgroups text data.
Sample Output:
Loaded 20 Newsgroups dataset, containing documents on different topics.
Chunk 2: Generating Embeddings
Code:
# Use SentenceTransformer for creating embeddings embedding_model = SentenceTransformer("all-MiniLM-L6-v2") corpus_embeddings = embedding_model.encode(docs, show_progress_bar=True)
Explanation:
SentenceTransformer
: Generates embeddings (dense vectors) from sentences using a pre-trained model.Parameters:
show_progress_bar=True
: Displays a progress bar for embedding generation.
Sample Output:
Corpus embeddings generated for all documents.
Chunk 3: (Optional) View Embeddings
Code:
# View the shape and structure of the generated embeddings corpus_embeddings.view()
Explanation:
- Allows inspection of the embeddings to understand dimensions and structure.
Sample Output:
(1000, 384) # Sample output showing the embedding dimensions (example)
Chunk 4: Configure CountVectorizer
Code:
from sklearn.feature_extraction.text import CountVectorizer # Vectorizer for term frequency counts, useful for topic representation vectorizer_model = CountVectorizer(stop_words="english", max_df=0.95, min_df=0.01)
Explanation:
CountVectorizer
: Prepares a word frequency matrix for use in BERTopic.Parameters:
max_df=0.95
: Excludes terms appearing in more than 95% of documents.min_df=0.01
: Includes terms present in at least 1% of documents.
Sample Output:
CountVectorizer configured for text frequency analysis.
Chunk 5: Set Parameters for HDBSCAN and UMAP
Code:
# Configure clustering and dimensionality reduction models hdbscan_model = HDBSCAN(min_cluster_size=30, metric='euclidean', prediction_data=True) umap_model = UMAP(n_neighbors=15, n_components=10, metric='cosine', low_memory=False)
Explanation:
HDBSCAN:
min_cluster_size=30: Minimum size for clusters, influencing topic sizes.
metric='euclidean': Distance metric used for clustering.
UMAP:
n_neighbors=15: Balances local and global structure in data.
n_components=10: Reduces to 10 dimensions for visual clarity.
Sample Output:
HDBSCAN and UMAP models configured for clustering and dimensionality reduction.
Chunk 6: Train BERTopic Model
Code:
# Initialize and train BERTopic model = BERTopic( n_gram_range=(1, 3), vectorizer_model=vectorizer_model, nr_topics=40, top_n_words=10, umap_model=umap_model, hdbscan_model=hdbscan_model, min_topic_size=30, calculate_probabilities=True ).fit(docs, corpus_embeddings)
Explanation:
BERTopic Parameters:
n_gram_range=(1, 3)
: Considers 1 to 3-word combinations as terms.nr_topics=40
: Sets the number of topics to extract.top_n_words=10
: Limits the top 10 words shown for each topic.calculate_probabilities=True
: Enables probability distribution over topics for documents.
Sample Output:
BERTopic model trained with specified clustering and vectorization parameters.
Chunk 7: Get and Visualize Topic Frequencies
Code:
# Retrieve topics and probabilities for each document topics, probabilities = model.transform(docs, corpus_embeddings) df_topic_freq = model.get_topic_freq() print(df_topic_freq) # Visualize topic frequencies in a bar chart topics_count = len(df_topic_freq) - 1 model.visualize_barchart(top_n_topics=topics_count)
Explanation:
get_topic_freq
: Shows frequency of each topic across all documents.visualize_barchart
: Displays a bar chart of topic frequencies.
Sample Output:
- DataFrame with topics and frequencies, bar chart showing distribution of topics.
Chunk 8: View Topic Information
Code:
# Get topic information topic_info = model.get_topic_info() print(topic_info)
Explanation:
get_topic_info
: Retrieves detailed information about each topic, including representative words.
Sample Output:
Topic information with top words for each topic.
Chunk 9: Document-Level Visualization
Code:
# Visualize document clusters based on topics fig = model.visualize_documents( docs, embeddings=corpus_embeddings, sample=0.6, topics=[0, 1, 2, 3, 4, 5, 6], hide_annotations=False, hide_document_hover=True ) fig.write_image("./clusters.svg")
Explanation:
visualize_documents
: Creates a scatter plot with documents grouped by topics.Parameters:
sample=0.6
: Uses a 60% sample of documents for visualization.topics=[0, 1, 2, ...]
: Visualizes specified topics.
Sample Output:
- Cluster plot saved as
clusters.svg
.
- Cluster plot saved as
Chunk 10: Topic Bar Chart Visualization
Code:
# Create a bar chart for topic visualization fig2 = model.visualize_barchart() fig2.write_image("./barchart.svg")
Explanation:
visualize_barchart
: Provides a bar chart showing each topic’s frequency.
Sample Output:
- Bar chart saved as
barchart.svg
.
- Bar chart saved as
Chunk 11: Predict Topics for New Documents
Code:
sentence_model = SentenceTransformer("all-MiniLM-L6-v2") new_docs = ["I'm looking for a new graphics card", "When is the next NASA launch?"] embeddings = sentence_model.encode(new_docs) topics, probs = model.transform(new_docs, embeddings) print("Topics:", topics) print("Probabilities:", probs)
Explanation:
Predict New Topics: Encodes new documents and predicts topic associations.
Parameters:
new_docs
: List of new document texts to analyze.
Sample Output:
Topics: [topic_number, ...] Probabilities: [probability_distribution]
Summary of Code Workflow
BERTopic is trained to find clusters (topics) in text documents, which are then visualized and examined.
Embeddings are generated by
SentenceTransformer
, which helps in topic assignments.Visualization tools give insight into document-topic relationships and the most frequent topics.
This code provides an entire topic modeling pipeline using BERTopic, from loading text data to predicting topics for new data.
Subscribe to my newsletter
Read articles from Anix Lynch directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by