π My First NLP Project: Topic Modeling 63,000+ Research Papers with BERTopic + GPT-4o-mini


Hey there! π I'm super excited to share my first blog post and even more excited that it's about my very first NLP project. I dove deep into topic modeling using BERTopic , and ended up analyzing over 63,000 research papers from arXiv (yes, sixty-three thousand!!).
This blog post is a walk-through of what I built, why I built it, and what cool insights I found along the way. Letβs dive in!
π What I Worked On
I applied BERTopic on the neuralwork/arxiver
dataset of over 63,000 research papers to discover and label meaningful topics using embeddings and clustering.
Hereβs what I did:
β Preprocessed the abstracts
β Ran BERTopic on them
β Used GPT-4o-mini to assign descriptive topic names
β Saved outputs (topics, names, visualizations) to avoid recomputation
β Analyzed trends like top topics and paper counts over time
I learned a lot about how BERTopic clusters papers based on semantic meaning and how GPT-generated names can make topics more intuitive.
π Dataset at a Glance
π¦ Dataset:
neuralwork/arxiver
π Fields: Title, Abstract, Authors, Date, Markdown, Link
π Size: 63,357 papers from Sept 2022 to Oct 2023
π§ Project Workflow (Modular & Reusable!)
To keep things clean, fast, and reusable, I split the project into 3 neat parts:
Part 1: Topic Modeling Pipeline + Caching
Loaded the neuralwork/arxiver dataset
Preprocessed the abstracts
Ran BERTopic to generate topics
Used GPT-4o-mini to give each topic a meaningful name
Saved everything to a CSV (so I donβt have to rerun the heavy stuff)
π This step is compute-heavy but only needs to be done once β super helpful if you want to experiment later.
Part 2: Building the Final Dataset
Fetched full paper metadata (title, authors, date, etc.)
Merged it with the topic modeling results
The final dataset includes:
id | title | abstract | authors | published_date | link | markdown | topic | Topic Name
Part 3: Visualizations + Topic Insights
I then explored the data with fun, interactive, informative visualizations!
π Key Highlights:
π Top 10 Topics by Paper Count
Most frequent: Astrophysics of Neutrinos and Black Holes (9561 papers!)
Least frequent: Renewable Energy and Grid Management (509 papers)
π Papers Published Per Month
Peak: May 2023 (4701 papers)
Low: Sept 2022 (1 paper π )
- π Monthly Trends for Top 5 Topics
Tracked trends for topics like Quantum Phase Transitions, Medical Imaging, and more!
π οΈ Model Settings
Hereβs what I used under the hood:
Component | Config |
Embedding | all-MiniLM-L6-v2 |
UMAP | n_neighbors=10 , min_dist=0.1 |
HDBSCAN | min_cluster_size=60 , min_samples=15 |
Topic Naming | GPT-4o-mini (summarized top words into readable names) |
π Visual Goodies
You can also explore the BERTopic visualizations in the Colab notebook.
Bar Chart
Heatmap
Intertopic Distance Map
π§ͺ Sample Topic Analysis
Topic Name | Peak Month | Trough Month | Total Papers |
Astrophysics of Neutrinos and Black Holes | Jul 2023 | Oct 2022 | 9561 |
Audio Recognition and Analysis | May 2023 | Dec 2022 | 1166 |
Deep Neural Network Optimization | Oct 2023 | Dec 2022 | 787 |
Medical Imaging & Diagnosis | Mar 2023 | Dec 2022 | 1412 |
Quantum Phase Transitions | Mar 2023 | Nov 2022 | 7659 |
π‘ Why I Loved This Project
Fast Iteration: Thanks to caching, I could explore ideas quickly
Readable Results: GPT-4o made topics actually understandable
Emerging Trends: I saw how research areas evolve month-by-month
Easily Extendable: You can plug in more data, tweak models, or explore new fields π§
π Want to Try It Out?
Check out the full Colab notebook and GitHub repo here:
π GitHub Repo
π Colab Notebook
Thanks for reading! Iβm just starting out in NLP and AI, and this project taught me so much about pipelines, embeddings, visualizations, and model efficiency. Hope it inspired you to explore BERTopic too! π
If you have feedback or ideas to improve it β Iβd love to hear from you!
Subscribe to my newsletter
Read articles from Khushi Dubey directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Khushi Dubey
Khushi Dubey
Building AI & tech stuff | Blogging the chaos :)