Create a YouTube AI ChatBot: Beginner's Guide

Have you ever watched a long YouTube video and wished you could just ask a question and get the answer from it? Or get a quick summary without watching it all?

Well, that's exactly what we're building today — a YouTube AI Assistant that can:

Understand YouTube videos using their captions,
Break that into chunks,
Embed and store those into a vector database,
Retrieve and generate answers to your questions using AI!

And yes, we’ll use Python 🐍, LangChain 🧠, Google’s Gemini (via LangChain), and FAISS for vector storage!

🔗 Project GitHub Repo

🔧 Tools & Technologies Used

Streamlit: For building an interactive web app
LangChain: For chaining LLMs, retrievers, and prompts
Google Gemini: For answering questions using LLM
FAISS: Vector database to store and retrieve document embeddings
YouTubeTranscriptAPI: To fetch YouTube captions
Python-dotenv: To manage environment variables

💡 What We’ll Cover (LangChain in 4 Steps)

We’ll break this project into four core LangChain steps:

Indexing: Extracting transcript, chunking it, embedding it, and storing in FAISS DB.
Retrieval: Fetching relevant chunks from DB based on the user's query.
Augmentation: Adding context to the user query using retrieved documents.
Generation: Asking the LLM to respond using that context.

🚀 Step-by-Step Walkthrough

🎬 Step 1: Indexing - Making the Video "Searchable"

Goal: Convert video transcript into a format that can be queried.

1.1. Extract Transcript

transcript_list = YouTubeTranscriptApi.get_transcript(video_id, languages=["en"])
transcript = " ".join(chunk["text"] for chunk in transcript_list)

We extract the auto-generated or uploaded captions from YouTube using the video ID.

1.2. Chunk the Text

splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = splitter.create_documents([transcript])

Why chunking? Large texts are hard for models to process. So we break them down into smaller overlapping pieces to maintain context.

1.3. Generate Embeddings

embeddings = GoogleGenerativeAIEmbeddings(model="models/embedding-001") #use any other model if you like

Each chunk is converted into a numerical format (embedding) using Google’s Gemini embedding model — this lets us compare chunks mathematically!

1.4. Store in Vector DB (FAISS)

vector_store = FAISS.from_documents(chunks, embeddings)

These embeddings are stored in FAISS, a fast vector search library. Now our video content is indexed and ready to be searched!

🔍 Step 2: Retrieval - Finding Relevant Info

Goal: When a user asks a question, find relevant chunks from the transcript.

retriever = vector_store.as_retriever(search_type="similarity", search_kwargs={"k": 4})
retrieved_docs = retriever.invoke(question)

Here’s what’s happening:

User types a question.
We convert that question into an embedding.
We compare it against the stored video chunks in FAISS.
Return top k=4 similar chunks.

🧩 Step 3: Augmentation - Combine Context & Query

Goal: Combine retrieved transcript chunks with the user's question into a prompt.

prompt = PromptTemplate(
    template="""
    You are a helpful assistant.
    Answer ONLY from the provided transcript context.
    If the context is insufficient, just say you don't know.

    {context}
    Question: {question}
    """,
    input_variables=["context", "question"],
)

We create a smart prompt that tells the LLM:

Stick to the transcript.
Don’t hallucinate.
Be detailed and helpful.

context_text = "\n\n".join(doc.page_content for doc in retrieved_docs)
final_prompt = prompt.invoke({"context": context_text, "question": question})

💬 Step 4: Generation - Let the LLM Answer

Goal: Use Google Gemini to generate the answer using the provided context.

llm = ChatGoogleGenerativeAI(model="gemini-1.5-flash", temperature=0.7)
answer = llm.invoke(final_prompt)

Voila! You now have an AI that can answer questions from a YouTube video !!🎉 :p

📺 Bonus: Summarizing the Entire Video

You can also summarize the video using a similar technique:

summary = main_chain.invoke("Can you summarize the video")

Here, instead of asking a custom question, we use a fixed one like "Can you summarize the video" and process it just like before. The retrieved chunks become context for summarization.

🧪 Sample Output

Input Video: https://www.youtube.com/watch?v=HAnw168huqA
Sample Question: "What is the main topic discussed?"
Output: "The video discusses how LangChain helps build context-aware AI applications using modular components..."

🧠 Concepts You Learned

✅ Chunking text with overlap
✅ Generating embeddings
✅ Storing in a vector database (FAISS)
✅ Retrieving similar chunks based on query
✅ Contextual prompting with LangChain
✅ Generating responses using Google Gemini LLM

🛠 Try It Yourself

Fork and run the project from GitHub 👇
🔗 https://github.com/vishwajitvm/YouTube-AI-ChatBot

⚠️ Make sure your .env file has your Google Generative AI key and your system has Python 3.10+ installed.

📌 Final Thoughts

This project is a great way to learn:

How LLMs work behind the scenes.
How vector DBs make documents searchable.
How prompt engineering drives results.

Let me know if you’d like a Part 2 where we build voice support or add multi-video context!

If you enjoyed this, hit 💙, share, and follow me for more hands-on AI projects!

Vishwajit Vm

Build Your Own YouTube AI ChatBot Using LangChain, Python, and Vector DB – Beginner Friendly Guide!

🔧 Tools & Technologies Used

💡 What We’ll Cover (LangChain in 4 Steps)

🚀 Step-by-Step Walkthrough

🎬 Step 1: Indexing - Making the Video "Searchable"

1.1. Extract Transcript

1.2. Chunk the Text

1.3. Generate Embeddings

1.4. Store in Vector DB (FAISS)

🔍 Step 2: Retrieval - Finding Relevant Info

🧩 Step 3: Augmentation - Combine Context & Query

💬 Step 4: Generation - Let the LLM Answer

📺 Bonus: Summarizing the Entire Video

🧪 Sample Output

🧠 Concepts You Learned

🛠 Try It Yourself

📌 Final Thoughts

Subscribe to my newsletter

Vishwajit Vm

Vishwajit Vm