πŸš€ My First NLP Project: Topic Modeling 63,000+ Research Papers with BERTopic + GPT-4o-mini

Khushi DubeyKhushi Dubey
4 min read

Hey there! πŸ‘‹ I'm super excited to share my first blog post and even more excited that it's about my very first NLP project. I dove deep into topic modeling using BERTopic , and ended up analyzing over 63,000 research papers from arXiv (yes, sixty-three thousand!!).

This blog post is a walk-through of what I built, why I built it, and what cool insights I found along the way. Let’s dive in!


πŸ” What I Worked On

I applied BERTopic on the neuralwork/arxiver dataset of over 63,000 research papers to discover and label meaningful topics using embeddings and clustering.

Here’s what I did:

  • βœ… Preprocessed the abstracts

  • βœ… Ran BERTopic on them

  • βœ… Used GPT-4o-mini to assign descriptive topic names

  • βœ… Saved outputs (topics, names, visualizations) to avoid recomputation

  • βœ… Analyzed trends like top topics and paper counts over time

I learned a lot about how BERTopic clusters papers based on semantic meaning and how GPT-generated names can make topics more intuitive.


πŸ—‚ Dataset at a Glance

  • πŸ“¦ Dataset: neuralwork/arxiver

  • πŸ“„ Fields: Title, Abstract, Authors, Date, Markdown, Link

  • πŸ“Š Size: 63,357 papers from Sept 2022 to Oct 2023


🧠 Project Workflow (Modular & Reusable!)

To keep things clean, fast, and reusable, I split the project into 3 neat parts:

Part 1: Topic Modeling Pipeline + Caching

  • Loaded the neuralwork/arxiver dataset

  • Preprocessed the abstracts

  • Ran BERTopic to generate topics

  • Used GPT-4o-mini to give each topic a meaningful name

  • Saved everything to a CSV (so I don’t have to rerun the heavy stuff)

πŸ“Œ This step is compute-heavy but only needs to be done once β€” super helpful if you want to experiment later.


Part 2: Building the Final Dataset

  • Fetched full paper metadata (title, authors, date, etc.)

  • Merged it with the topic modeling results

  • The final dataset includes:
    id | title | abstract | authors | published_date | link | markdown | topic | Topic Name


Part 3: Visualizations + Topic Insights

I then explored the data with fun, interactive, informative visualizations!

πŸ“Š Key Highlights:

  1. πŸ” Top 10 Topics by Paper Count

    • Most frequent: Astrophysics of Neutrinos and Black Holes (9561 papers!)

    • Least frequent: Renewable Energy and Grid Management (509 papers)

  2. πŸ“… Papers Published Per Month

    • Peak: May 2023 (4701 papers)

    • Low: Sept 2022 (1 paper πŸ˜…)

  1. πŸ“ˆ Monthly Trends for Top 5 Topics

Tracked trends for topics like Quantum Phase Transitions, Medical Imaging, and more!


πŸ› οΈ Model Settings

Here’s what I used under the hood:

ComponentConfig
Embeddingall-MiniLM-L6-v2
UMAPn_neighbors=10, min_dist=0.1
HDBSCANmin_cluster_size=60, min_samples=15
Topic NamingGPT-4o-mini (summarized top words into readable names)

πŸ“Š Visual Goodies

You can also explore the BERTopic visualizations in the Colab notebook.

  • Bar Chart

  • Heatmap

Intertopic Distance Map


πŸ§ͺ Sample Topic Analysis

Topic NamePeak MonthTrough MonthTotal Papers
Astrophysics of Neutrinos and Black HolesJul 2023Oct 20229561
Audio Recognition and AnalysisMay 2023Dec 20221166
Deep Neural Network OptimizationOct 2023Dec 2022787
Medical Imaging & DiagnosisMar 2023Dec 20221412
Quantum Phase TransitionsMar 2023Nov 20227659

πŸ’‘ Why I Loved This Project

  • Fast Iteration: Thanks to caching, I could explore ideas quickly

  • Readable Results: GPT-4o made topics actually understandable

  • Emerging Trends: I saw how research areas evolve month-by-month

  • Easily Extendable: You can plug in more data, tweak models, or explore new fields πŸ”§


πŸ“ Want to Try It Out?

Check out the full Colab notebook and GitHub repo here:
πŸ‘‰ GitHub Repo
πŸ‘‰ Colab Notebook


Thanks for reading! I’m just starting out in NLP and AI, and this project taught me so much about pipelines, embeddings, visualizations, and model efficiency. Hope it inspired you to explore BERTopic too! 😊

If you have feedback or ideas to improve it β€” I’d love to hear from you!

10
Subscribe to my newsletter

Read articles from Khushi Dubey directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Khushi Dubey
Khushi Dubey

Building AI & tech stuff | Blogging the chaos :)