Source Code Here:

HuggingFace Code

https://gist.github.com/0f88f53525d0fd992b627d62a0da13b2.git

Chunk 1: Import Libraries and Load Data

 !pip install bertopic
 from sklearn.datasets import fetch_20newsgroups
 from sentence_transformers import SentenceTransformer
 from bertopic import BERTopic
 from umap import UMAP
 from hdbscan import HDBSCAN

 # Load text data from the 20 Newsgroups dataset
 docs = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))['data']

Explanation:
- Libraries:
  - BERTopic: For topic modeling.
  - UMAP (Uniform Manifold Approximation and Projection): Reduces embedding dimensionality.
  - HDBSCAN: Clustering algorithm used by BERTopic to form topic clusters.
- Dataset: 20 Newsgroups text data.

Sample Output:

 Loaded 20 Newsgroups dataset, containing documents on different topics.

Chunk 2: Generating Embeddings

Code:

 # Use SentenceTransformer for creating embeddings
 embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
 corpus_embeddings = embedding_model.encode(docs, show_progress_bar=True)

Explanation:
- SentenceTransformer: Generates embeddings (dense vectors) from sentences using a pre-trained model.
- Parameters:
  - show_progress_bar=True: Displays a progress bar for embedding generation.

Sample Output:

 Corpus embeddings generated for all documents.

Chunk 3: (Optional) View Embeddings

Code:

 # View the shape and structure of the generated embeddings
 corpus_embeddings.view()

Explanation:
- Allows inspection of the embeddings to understand dimensions and structure.

Sample Output:

 (1000, 384) # Sample output showing the embedding dimensions (example)

Chunk 4: Configure CountVectorizer

Code:

 from sklearn.feature_extraction.text import CountVectorizer
 # Vectorizer for term frequency counts, useful for topic representation
 vectorizer_model = CountVectorizer(stop_words="english", max_df=0.95, min_df=0.01)

Explanation:
- CountVectorizer: Prepares a word frequency matrix for use in BERTopic.
- Parameters:
  - max_df=0.95: Excludes terms appearing in more than 95% of documents.
  - min_df=0.01: Includes terms present in at least 1% of documents.

Sample Output:

 CountVectorizer configured for text frequency analysis.

Chunk 5: Set Parameters for HDBSCAN and UMAP

Code:

 # Configure clustering and dimensionality reduction models
 hdbscan_model = HDBSCAN(min_cluster_size=30, metric='euclidean', prediction_data=True)
 umap_model = UMAP(n_neighbors=15, n_components=10, metric='cosine', low_memory=False)

Explanation:
- HDBSCAN:
  - min_cluster_size=30: Minimum size for clusters, influencing topic sizes.
  - metric='euclidean': Distance metric used for clustering.
- UMAP:
  - n_neighbors=15: Balances local and global structure in data.
  - n_components=10: Reduces to 10 dimensions for visual clarity.

Sample Output:

 HDBSCAN and UMAP models configured for clustering and dimensionality reduction.

Chunk 6: Train BERTopic Model

Code:

 # Initialize and train BERTopic
 model = BERTopic(
     n_gram_range=(1, 3),
     vectorizer_model=vectorizer_model,
     nr_topics=40,
     top_n_words=10,
     umap_model=umap_model,
     hdbscan_model=hdbscan_model,
     min_topic_size=30,
     calculate_probabilities=True
 ).fit(docs, corpus_embeddings)

Explanation:
- BERTopic Parameters:
  - n_gram_range=(1, 3): Considers 1 to 3-word combinations as terms.
  - nr_topics=40: Sets the number of topics to extract.
  - top_n_words=10: Limits the top 10 words shown for each topic.
  - calculate_probabilities=True: Enables probability distribution over topics for documents.

Sample Output:

 BERTopic model trained with specified clustering and vectorization parameters.

Chunk 7: Get and Visualize Topic Frequencies

Code:

 # Retrieve topics and probabilities for each document
 topics, probabilities = model.transform(docs, corpus_embeddings)
 df_topic_freq = model.get_topic_freq()
 print(df_topic_freq)

 # Visualize topic frequencies in a bar chart
 topics_count = len(df_topic_freq) - 1
 model.visualize_barchart(top_n_topics=topics_count)

Explanation:
- get_topic_freq: Shows frequency of each topic across all documents.
- visualize_barchart: Displays a bar chart of topic frequencies.
Sample Output:
- DataFrame with topics and frequencies, bar chart showing distribution of topics.

Chunk 8: View Topic Information

Code:

 # Get topic information
 topic_info = model.get_topic_info()
 print(topic_info)

Explanation:
- get_topic_info: Retrieves detailed information about each topic, including representative words.

Sample Output:

 Topic information with top words for each topic.

Chunk 9: Document-Level Visualization

Code:

 # Visualize document clusters based on topics
 fig = model.visualize_documents(
     docs, embeddings=corpus_embeddings, sample=0.6,
     topics=[0, 1, 2, 3, 4, 5, 6], hide_annotations=False,
     hide_document_hover=True
 )
 fig.write_image("./clusters.svg")

Explanation:
- visualize_documents: Creates a scatter plot with documents grouped by topics.
- Parameters:
  - sample=0.6: Uses a 60% sample of documents for visualization.
  - topics=[0, 1, 2, ...]: Visualizes specified topics.
Sample Output:
- Cluster plot saved as clusters.svg.

Chunk 10: Topic Bar Chart Visualization

Code:

 # Create a bar chart for topic visualization
 fig2 = model.visualize_barchart()
 fig2.write_image("./barchart.svg")

Explanation:
- visualize_barchart: Provides a bar chart showing each topic’s frequency.
Sample Output:
- Bar chart saved as barchart.svg.

Chunk 11: Predict Topics for New Documents

Code:

 sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
 new_docs = ["I'm looking for a new graphics card", "When is the next NASA launch?"]
 embeddings = sentence_model.encode(new_docs)
 topics, probs = model.transform(new_docs, embeddings)
 print("Topics:", topics)
 print("Probabilities:", probs)

Explanation:
- Predict New Topics: Encodes new documents and predicts topic associations.
- Parameters:
  - new_docs: List of new document texts to analyze.

Sample Output:

 Topics: [topic_number, ...]
 Probabilities: [probability_distribution]

Summary of Code Workflow

BERTopic is trained to find clusters (topics) in text documents, which are then visualized and examined.
Embeddings are generated by SentenceTransformer, which helps in topic assignments.
Visualization tools give insight into document-topic relationships and the most frequent topics.

This code provides an entire topic modeling pipeline using BERTopic, from loading text data to predicting topics for new data.

Hugging Face #6: Topic Modeling, Sentence Embeddings, and Dynamic Clustering

Table of contents

Source Code Here:

Chunk 1: Import Libraries and Load Data

Chunk 2: Generating Embeddings

Chunk 3: (Optional) View Embeddings

Chunk 4: Configure CountVectorizer

Chunk 5: Set Parameters for HDBSCAN and UMAP

Chunk 6: Train BERTopic Model

Chunk 7: Get and Visualize Topic Frequencies

Chunk 8: View Topic Information

Chunk 9: Document-Level Visualization

Chunk 10: Topic Bar Chart Visualization

Chunk 11: Predict Topics for New Documents

Summary of Code Workflow

Subscribe to my newsletter

Anix Lynch

Anix Lynch