🧠 Summarizing YouTube with AI: Building a Transcript-to-Insight Tool Using LLMs

Khushal JhaveriKhushal Jhaveri
3 min read

So while I was working on some NLP side projects, I had this thought — YouTube videos are packed with great information, but most of the time they’re way too long. Even if someone’s explaining something valuable, it takes 15-20 minutes to get to the main points. So I wanted to try building something that could take a video, read its transcript, and give you a clean summary — just like we do when we take notes.

That’s when I started building a YouTube Transcript Summarizer — a tool that extracts the full transcript of a video, cleans it up, and runs it through a Large Language Model to generate a short summary.


🧩 What I Was Trying to Solve

People consume hours of video content — podcasts, tutorials, interviews — but they often just want a few bullet points or a quick gist. And with LLMs getting better at understanding language, I figured this was a good time to test how well they can do structured summarization from real-world noisy data (like YouTube transcripts).

The bigger idea behind this is also about understanding intent — what’s the main message? What is the tone? What kind of speaker is this? These are the kinds of things that apply even in email security or behavioral detection, just in a different form.


🔧 What I Used to Build It

  • Python

  • Streamlit – to build a lightweight interactive frontend

  • YouTube Transcript API – to extract subtitles automatically

  • Google Gemini AI (and tested on GPT-based models too) – for generating the summaries

  • dotenv – to manage API keys cleanly


🛠️ How I Built It

Here’s the flow I designed:

1️⃣ Input

  • User enters a YouTube video link into the Streamlit app

  • It fetches the transcript using the YouTube Transcript API

  • If the video has no transcript, it gives a warning

2️⃣ Preprocessing

  • Removed filler lines, timestamps, and broken text from the transcript

  • Cleaned up extra spaces, non-verbal cues like “[laughs]”, etc.

  • Formatted everything into a readable chunk of text

3️⃣ Summarization (Main Part)

  • Fed the cleaned transcript to Google Gemini AI using an API call

  • I also experimented with T5 and BART (extractive + abstractive) before settling on Gemini for the demo

  • Generated concise summaries, either in paragraph or bullet-point format

4️⃣ Output

  • Displayed the summary on the Streamlit page

  • Future plan: let users download the summary or export to Notion/Google Docs


⚙️ What It Looks Like (UI)

The Streamlit interface is minimal:

  • Just a link input box and a "Summarize" button

  • It shows a loading bar while it fetches and processes

  • Final summary gets shown with a clean design


🚀 Future Ideas

I'm planning to add:

  • Multi-language support (especially for Hindi, Spanish)

  • Tone analysis — like is the speaker excited, serious, sarcastic?

  • Speaker breakdown if multiple people are in the video

  • Save summary history for the user locally


💡 What I Learned

  • Real-world data (like YouTube transcripts) is messy, and cleaning it is half the battle

  • LLMs can handle this well if you feed them clean, structured input

  • Having an actual frontend helps a lot — it’s one thing to write scripts, but another to let people interact with them directly


✉️ Want to Collaborate?

Still working on improving it, but if this kind of stuff excites you too — feel free to reach out. I’d love to discuss better ways to scale this.

📩 LinkedIn | 🔗 GitHub

0
Subscribe to my newsletter

Read articles from Khushal Jhaveri directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Khushal Jhaveri
Khushal Jhaveri