Hey there! 👋 I'm super excited to share my first blog post and even more excited that it's about my very first NLP project. I dove deep into topic modeling using BERTopic , and ended up analyzing over 63,000 research papers from arXiv (yes, sixty-three thousand!!).

This blog post is a walk-through of what I built, why I built it, and what cool insights I found along the way. Let’s dive in!

🔍 What I Worked On

I applied BERTopic on the neuralwork/arxiver dataset of over 63,000 research papers to discover and label meaningful topics using embeddings and clustering.

Here’s what I did:

✅ Preprocessed the abstracts
✅ Ran BERTopic on them
✅ Used GPT-4o-mini to assign descriptive topic names
✅ Saved outputs (topics, names, visualizations) to avoid recomputation
✅ Analyzed trends like top topics and paper counts over time

I learned a lot about how BERTopic clusters papers based on semantic meaning and how GPT-generated names can make topics more intuitive.

🗂 Dataset at a Glance

📦 Dataset: neuralwork/arxiver
📄 Fields: Title, Abstract, Authors, Date, Markdown, Link
📊 Size: 63,357 papers from Sept 2022 to Oct 2023

🧠 Project Workflow (Modular & Reusable!)

To keep things clean, fast, and reusable, I split the project into 3 neat parts:

Part 1: Topic Modeling Pipeline + Caching

Loaded the neuralwork/arxiver dataset
Preprocessed the abstracts
Ran BERTopic to generate topics
Used GPT-4o-mini to give each topic a meaningful name
Saved everything to a CSV (so I don’t have to rerun the heavy stuff)

📌 This step is compute-heavy but only needs to be done once — super helpful if you want to experiment later.

Part 2: Building the Final Dataset

Fetched full paper metadata (title, authors, date, etc.)
Merged it with the topic modeling results
The final dataset includes:
id | title | abstract | authors | published_date | link | markdown | topic | Topic Name

Part 3: Visualizations + Topic Insights

I then explored the data with fun, interactive, informative visualizations!

📊 Key Highlights:

🔝 Top 10 Topics by Paper Count
- Most frequent: Astrophysics of Neutrinos and Black Holes (9561 papers!)
- Least frequent: Renewable Energy and Grid Management (509 papers)
📅 Papers Published Per Month
- Peak: May 2023 (4701 papers)
- Low: Sept 2022 (1 paper)
📈 Monthly Trends for Top 5 Topics

Tracked trends for topics like Quantum Phase Transitions, Medical Imaging, and more!

🛠️ Model Settings

Here’s what I used under the hood:

Component	Config
Embedding	`all-MiniLM-L6-v2`
UMAP	`n_neighbors=10`, `min_dist=0.1`
HDBSCAN	`min_cluster_size=60`, `min_samples=15`
Topic Naming	GPT-4o-mini (summarized top words into readable names)

📊 Visual Goodies

You can also explore the BERTopic visualizations in the Colab notebook.

Bar Chart
Heatmap

Intertopic Distance Map

🧪 Sample Topic Analysis

Topic Name	Peak Month	Trough Month	Total Papers
Astrophysics of Neutrinos and Black Holes	Jul 2023	Oct 2022	9561
Audio Recognition and Analysis	May 2023	Dec 2022	1166
Deep Neural Network Optimization	Oct 2023	Dec 2022	787
Medical Imaging & Diagnosis	Mar 2023	Dec 2022	1412
Quantum Phase Transitions	Mar 2023	Nov 2022	7659

💡 Why I Loved This Project

Fast Iteration: Thanks to caching, I could explore ideas quickly
Readable Results: GPT-4o made topics actually understandable
Emerging Trends: I saw how research areas evolve month-by-month
Easily Extendable: You can plug in more data, tweak models, or explore new fields 🔧

📁 Want to Try It Out?

Check out the full Colab notebook and GitHub repo here:
👉 GitHub Repo
👉 Colab Notebook

Thanks for reading! I’m just starting out in NLP and AI, and this project taught me so much about pipelines, embeddings, visualizations, and model efficiency. Hope it inspired you to explore BERTopic too! 😊

If you have feedback or ideas to improve it — I’d love to hear from you!

🚀 My First NLP Project: Topic Modeling 63,000+ Research Papers with BERTopic + GPT-4o-mini