🚀DocuTube: A RAG-Based Chatbot with YouTube Summarization - Complete Process Flow

Waqar IqbalWaqar Iqbal
6 min read

DocuTube is an intelligent platform that allows users to interact with documents and YouTube videos using AI.
With DocuTube, you can:

  • Upload a document and ask context-aware questions about it.

  • Paste a YouTube URL and instantly get a transcript-based summary.

Behind the scenes, DocuTube uses:

  • Retrieval-Augmented Generation (RAG) for document Q&A.

  • YouTube Data API v3 for transcript extraction.

  • Gemini AI for generating natural, accurate answers and summaries.

  • Hugging Face sentence-transformers/all-MiniLM-L6-v2 for vector embeddings.

  • PostgreSQL to store every single history record (documents, chats, summaries, metadata).

The result: A secure, modern, and lightning-fast AI knowledge companion.

đź›  Tech Stack

ComponentTechnology / ToolPurpose
Backend FrameworkFastAPIHigh-performance Python web framework for building the API backend.
DatabasePostgreSQL + SQLAlchemy ORMStores structured data such as document metadata, chat history, and YouTube summaries.
Vector DatabasePineconeStores and retrieves vector embeddings for semantic search in the RAG pipeline.
LLMGoogle Gemini 1.5 FlashLarge Language Model used for generating high-quality answers from retrieved context.
EmbeddingsHugging Face — sentence-transformers/all-MiniLM-L6-v2Converts text chunks into dense vector embeddings for semantic similarity.
Document ProcessingLangChainHandles document loading, text splitting, metadata enrichment, and chaining for the RAG flow.
YouTube IntegrationYouTube Data API v3 + youtube-transcript-apiFetches metadata and transcripts for YouTube summarization.

🏗 System Architecture Overview

High-Level Flow

  1. Document Upload — Files are validated, processed, chunked, embedded, and stored in Pinecone.

  2. Question Answering (RAG) — Questions are matched against relevant document chunks in Pinecone, and Gemini generates an answer.

  3. YouTube Summarization — Extract metadata & transcript, then summarize using Gemini.

  4. Persistent Storage — Metadata, chat history, and summaries are stored in PostgreSQL.

đź“„ 1. Document Uploading Process

When a user uploads a document (PDF, DOCX, PPTX, or other supported formats), the system follows a structured pipeline to prepare the content for AI-based question answering. This ensures that documents are stored efficiently, searchable, and optimized for retrieval.

Step-by-Step Flow

1. File Upload & Validation

  • The user initiates the upload via the application interface.

  • The system temporarily stores the uploaded file in a secure directory.

  • File type and size are validated to ensure compatibility with supported formats.

  • If validation fails, the user is notified with an error message.

2. Document Record Creation

  • A unique namespace is generated for the document. This namespace is used in Pinecone Vector Database to separate different users’ content.

  • Metadata such as file name, upload time, and document owner is stored in a PostgreSQL database for quick lookup.

3. Document Processing with LangChain

  • The document is read using LangChain’s document loaders (e.g., PyMuPDFLoader, UnstructuredFileLoader).

  • Chunking & Splitting:

    • The document is split into smaller, manageable text chunks using LangChain’s RecursiveCharacterTextSplitter.

    • Chunk Size: 1000 tokens

    • Chunk Overlap: 200 tokens (ensures contextual continuity between chunks).

  • Each chunk is then converted into vector embeddings using a pre-configured embedding model (e.g., sentence-transformers/all-MiniLM-L6-v2 via LangChain’s HuggingFaceEmbeddings).

4. Storage in Vector Database

  • The vectorized chunks, along with metadata, are stored in Pinecone for fast similarity search.

  • The original file remains accessible for download or re-processing if required.

đź’¬ 2. Asking a Question from the LLM

Once documents are processed and stored, the user can ask natural language questions. The system uses Retrieval-Augmented Generation (RAG) with LangChain to provide accurate, context-aware responses.

Flow for Question Answering

1. User Query Submission

  • The user types a question into the chat interface.

2. Retrieval from Pinecone

  • The system performs a vector similarity search in Pinecone, retrieving the most relevant chunks from the user’s uploaded documents.

  • The top matching chunks are ranked based on similarity scores.

3. Prompt Construction with LangChain

  • LangChain’s prompt templates are used to combine:

    • The retrieved chunks (context)

    • The user’s query

    • Instructions for the LLM to provide concise, fact-based answers.

4. LLM Response Generation

  • The constructed prompt is sent to the configured LLM (Gemini) using LangChain’s LLM wrapper.

  • The LLM generates a human-readable answer based on the provided context.

  • The answer is displayed in the chat interface along with optional citations from the retrieved document chunks.

5. Chat History Storage

  • Each interaction (question, retrieved context, and generated answer) is stored in PostgreSQL for conversation continuity and analytics.

▶️3. YouTube URL Processing – Complete Workflow

1. User Submission

The user enters a YouTube video URL into the frontend interface and submits it.

2. API Request from Frontend

The frontend sends the provided YouTube URL to the backend through an API request.

3. Backend URL Validation

The backend receives the request and validates that the URL is a properly formatted YouTube link.

4. Transcript & Metadata Retrieval

Using the YouTube Data API v3 (authenticated with the YOUTUBE_API_KEY from .env), the backend retrieves:

  • Transcript (if available)

  • Video metadata such as title, description, channel name, publish date, likes, views, and comments

  • Timestamps (if available)

5. Transcript Preprocessing

The transcript undergoes cleaning and formatting, including:

  • Removing timestamps and unnecessary formatting

  • Eliminating special characters

  • Structuring the text for clarity and consistency

6. Summarization with LLM

The cleaned transcript is sent to the Gemini LLM (via the GEMINI_API_KEY) to generate:

  • A concise, human-readable summary highlighting the key points of the video

7. Response Construction

The backend compiles the final response containing:

  • The generated summary

  • Retrieved video metadata

  • (Optional) The full transcript for reference

8. Frontend Display

The frontend receives the structured response and displays:

  • The video’s summary

  • Metadata (title, channel, publish date, views, etc.)

  • Full transcript for detailed review

API Endpoints

The FastAPI application exposes these main endpoints:

EndpointMethodDescription
/uploadPOSTUpload a document for processing
/askPOSTAsk a question about the current document
/summarizePOSTGet a summary of a YouTube video

Example usage:

bash

# Upload a document
curl -X POST -F "file=@report.pdf" http://localhost:8000/upload

# Ask a question
curl -X POST -H "Content-Type: application/json" \
  -d '{"question":"What are the main points?"}' \
  http://localhost:8000/ask

# Summarize YouTube video
curl -X POST -H "Content-Type: application/json" \
  -d '{"url":"https://youtu.be/dQw4w9WgXcQ"}' \
  http://localhost:8000/summarize

Error Handling and Edge Cases

The system includes robust error handling:

  1. Document Processing Errors

    • Invalid file types are rejected immediately

    • Processing failures are logged with detailed error messages

  2. Pinecone Issues

    • Checks namespace stats before querying

    • Validates embedding dimensions

  3. YouTube Specifics

    • Handles videos without transcripts

    • Falls back to basic info when API unavailable

    • Manages various URL formats

Performance Considerations

  1. Chunking Strategy

    • 1000-character chunks with 200-character overlap balances context and precision

    • Recursive splitting preserves semantic boundaries

  2. Embedding Model Choice

    • MiniLM provides good quality with lower dimensionality (faster searches)

    • Normalized embeddings for better similarity comparisons

  3. Pinecone Optimization

    • Namespacing allows efficient document isolation

    • Metadata filtering enables complex queries

0
Subscribe to my newsletter

Read articles from Waqar Iqbal directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Waqar Iqbal
Waqar Iqbal