🚀DocuTube: A RAG-Based Chatbot with YouTube Summarization - Complete Process Flow


DocuTube is an intelligent platform that allows users to interact with documents and YouTube videos using AI.
With DocuTube, you can:
Upload a document and ask context-aware questions about it.
Paste a YouTube URL and instantly get a transcript-based summary.
Behind the scenes, DocuTube uses:
Retrieval-Augmented Generation (RAG) for document Q&A.
YouTube Data API v3 for transcript extraction.
Gemini AI for generating natural, accurate answers and summaries.
Hugging Face sentence-transformers/all-MiniLM-L6-v2 for vector embeddings.
PostgreSQL to store every single history record (documents, chats, summaries, metadata).
The result: A secure, modern, and lightning-fast AI knowledge companion.
đź› Tech Stack
Component | Technology / Tool | Purpose |
Backend Framework | FastAPI | High-performance Python web framework for building the API backend. |
Database | PostgreSQL + SQLAlchemy ORM | Stores structured data such as document metadata, chat history, and YouTube summaries. |
Vector Database | Pinecone | Stores and retrieves vector embeddings for semantic search in the RAG pipeline. |
LLM | Google Gemini 1.5 Flash | Large Language Model used for generating high-quality answers from retrieved context. |
Embeddings | Hugging Face — sentence-transformers/all-MiniLM-L6-v2 | Converts text chunks into dense vector embeddings for semantic similarity. |
Document Processing | LangChain | Handles document loading, text splitting, metadata enrichment, and chaining for the RAG flow. |
YouTube Integration | YouTube Data API v3 + youtube-transcript-api | Fetches metadata and transcripts for YouTube summarization. |
🏗 System Architecture Overview
High-Level Flow
Document Upload — Files are validated, processed, chunked, embedded, and stored in Pinecone.
Question Answering (RAG) — Questions are matched against relevant document chunks in Pinecone, and Gemini generates an answer.
YouTube Summarization — Extract metadata & transcript, then summarize using Gemini.
Persistent Storage — Metadata, chat history, and summaries are stored in PostgreSQL.
đź“„ 1. Document Uploading Process
When a user uploads a document (PDF, DOCX, PPTX, or other supported formats), the system follows a structured pipeline to prepare the content for AI-based question answering. This ensures that documents are stored efficiently, searchable, and optimized for retrieval.
Step-by-Step Flow
1. File Upload & Validation
The user initiates the upload via the application interface.
The system temporarily stores the uploaded file in a secure directory.
File type and size are validated to ensure compatibility with supported formats.
If validation fails, the user is notified with an error message.
2. Document Record Creation
A unique namespace is generated for the document. This namespace is used in Pinecone Vector Database to separate different users’ content.
Metadata such as file name, upload time, and document owner is stored in a PostgreSQL database for quick lookup.
3. Document Processing with LangChain
The document is read using LangChain’s document loaders (e.g.,
PyMuPDFLoader
,UnstructuredFileLoader
).Chunking & Splitting:
The document is split into smaller, manageable text chunks using LangChain’s
RecursiveCharacterTextSplitter
.Chunk Size: 1000 tokens
Chunk Overlap: 200 tokens (ensures contextual continuity between chunks).
Each chunk is then converted into vector embeddings using a pre-configured embedding model (e.g.,
sentence-transformers/all-MiniLM-L6-v2
via LangChain’sHuggingFaceEmbeddings
).
4. Storage in Vector Database
The vectorized chunks, along with metadata, are stored in Pinecone for fast similarity search.
The original file remains accessible for download or re-processing if required.
đź’¬ 2. Asking a Question from the LLM
Once documents are processed and stored, the user can ask natural language questions. The system uses Retrieval-Augmented Generation (RAG) with LangChain to provide accurate, context-aware responses.
Flow for Question Answering
1. User Query Submission
- The user types a question into the chat interface.
2. Retrieval from Pinecone
The system performs a vector similarity search in Pinecone, retrieving the most relevant chunks from the user’s uploaded documents.
The top matching chunks are ranked based on similarity scores.
3. Prompt Construction with LangChain
LangChain’s prompt templates are used to combine:
The retrieved chunks (context)
The user’s query
Instructions for the LLM to provide concise, fact-based answers.
4. LLM Response Generation
The constructed prompt is sent to the configured LLM (Gemini) using LangChain’s LLM wrapper.
The LLM generates a human-readable answer based on the provided context.
The answer is displayed in the chat interface along with optional citations from the retrieved document chunks.
5. Chat History Storage
- Each interaction (question, retrieved context, and generated answer) is stored in PostgreSQL for conversation continuity and analytics.
▶️3. YouTube URL Processing – Complete Workflow
1. User Submission
The user enters a YouTube video URL into the frontend interface and submits it.
2. API Request from Frontend
The frontend sends the provided YouTube URL to the backend through an API request.
3. Backend URL Validation
The backend receives the request and validates that the URL is a properly formatted YouTube link.
4. Transcript & Metadata Retrieval
Using the YouTube Data API v3 (authenticated with the YOUTUBE_API_KEY
from .env
), the backend retrieves:
Transcript (if available)
Video metadata such as title, description, channel name, publish date, likes, views, and comments
Timestamps (if available)
5. Transcript Preprocessing
The transcript undergoes cleaning and formatting, including:
Removing timestamps and unnecessary formatting
Eliminating special characters
Structuring the text for clarity and consistency
6. Summarization with LLM
The cleaned transcript is sent to the Gemini LLM (via the GEMINI_API_KEY
) to generate:
- A concise, human-readable summary highlighting the key points of the video
7. Response Construction
The backend compiles the final response containing:
The generated summary
Retrieved video metadata
(Optional) The full transcript for reference
8. Frontend Display
The frontend receives the structured response and displays:
The video’s summary
Metadata (title, channel, publish date, views, etc.)
Full transcript for detailed review
API Endpoints
The FastAPI application exposes these main endpoints:
Endpoint | Method | Description |
/upload | POST | Upload a document for processing |
/ask | POST | Ask a question about the current document |
/summarize | POST | Get a summary of a YouTube video |
Example usage:
bash
# Upload a document
curl -X POST -F "file=@report.pdf" http://localhost:8000/upload
# Ask a question
curl -X POST -H "Content-Type: application/json" \
-d '{"question":"What are the main points?"}' \
http://localhost:8000/ask
# Summarize YouTube video
curl -X POST -H "Content-Type: application/json" \
-d '{"url":"https://youtu.be/dQw4w9WgXcQ"}' \
http://localhost:8000/summarize
Error Handling and Edge Cases
The system includes robust error handling:
Document Processing Errors
Invalid file types are rejected immediately
Processing failures are logged with detailed error messages
Pinecone Issues
Checks namespace stats before querying
Validates embedding dimensions
YouTube Specifics
Handles videos without transcripts
Falls back to basic info when API unavailable
Manages various URL formats
Performance Considerations
Chunking Strategy
1000-character chunks with 200-character overlap balances context and precision
Recursive splitting preserves semantic boundaries
Embedding Model Choice
MiniLM provides good quality with lower dimensionality (faster searches)
Normalized embeddings for better similarity comparisons
Pinecone Optimization
Namespacing allows efficient document isolation
Metadata filtering enables complex queries
Subscribe to my newsletter
Read articles from Waqar Iqbal directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
