🧠 Summarizing YouTube with AI: Building a Transcript-to-Insight Tool Using LLMs

So while I was working on some NLP side projects, I had this thought — YouTube videos are packed with great information, but most of the time they’re way too long. Even if someone’s explaining something valuable, it takes 15-20 minutes to get to the main points. So I wanted to try building something that could take a video, read its transcript, and give you a clean summary — just like we do when we take notes.
That’s when I started building a YouTube Transcript Summarizer — a tool that extracts the full transcript of a video, cleans it up, and runs it through a Large Language Model to generate a short summary.
🧩 What I Was Trying to Solve
People consume hours of video content — podcasts, tutorials, interviews — but they often just want a few bullet points or a quick gist. And with LLMs getting better at understanding language, I figured this was a good time to test how well they can do structured summarization from real-world noisy data (like YouTube transcripts).
The bigger idea behind this is also about understanding intent — what’s the main message? What is the tone? What kind of speaker is this? These are the kinds of things that apply even in email security or behavioral detection, just in a different form.
🔧 What I Used to Build It
Python
Streamlit – to build a lightweight interactive frontend
YouTube Transcript API – to extract subtitles automatically
Google Gemini AI (and tested on GPT-based models too) – for generating the summaries
dotenv – to manage API keys cleanly
🛠️ How I Built It
Here’s the flow I designed:
1️⃣ Input
User enters a YouTube video link into the Streamlit app
It fetches the transcript using the YouTube Transcript API
If the video has no transcript, it gives a warning
2️⃣ Preprocessing
Removed filler lines, timestamps, and broken text from the transcript
Cleaned up extra spaces, non-verbal cues like “[laughs]”, etc.
Formatted everything into a readable chunk of text
3️⃣ Summarization (Main Part)
Fed the cleaned transcript to Google Gemini AI using an API call
I also experimented with T5 and BART (extractive + abstractive) before settling on Gemini for the demo
Generated concise summaries, either in paragraph or bullet-point format
4️⃣ Output
Displayed the summary on the Streamlit page
Future plan: let users download the summary or export to Notion/Google Docs
⚙️ What It Looks Like (UI)
The Streamlit interface is minimal:
Just a link input box and a "Summarize" button
It shows a loading bar while it fetches and processes
Final summary gets shown with a clean design
🚀 Future Ideas
I'm planning to add:
Multi-language support (especially for Hindi, Spanish)
Tone analysis — like is the speaker excited, serious, sarcastic?
Speaker breakdown if multiple people are in the video
Save summary history for the user locally
💡 What I Learned
Real-world data (like YouTube transcripts) is messy, and cleaning it is half the battle
LLMs can handle this well if you feed them clean, structured input
Having an actual frontend helps a lot — it’s one thing to write scripts, but another to let people interact with them directly
✉️ Want to Collaborate?
Still working on improving it, but if this kind of stuff excites you too — feel free to reach out. I’d love to discuss better ways to scale this.
Subscribe to my newsletter
Read articles from Khushal Jhaveri directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
