DocuTube is an intelligent platform that allows users to interact with documents and YouTube videos using AI.
With DocuTube, you can:

Upload a document and ask context-aware questions about it.
Paste a YouTube URL and instantly get a transcript-based summary.

Behind the scenes, DocuTube uses:

Retrieval-Augmented Generation (RAG) for document Q&A.
YouTube Data API v3 for transcript extraction.
Gemini AI for generating natural, accurate answers and summaries.
Hugging Face sentence-transformers/all-MiniLM-L6-v2 for vector embeddings.
PostgreSQL to store every single history record (documents, chats, summaries, metadata).

The result: A secure, modern, and lightning-fast AI knowledge companion.

🛠 Tech Stack

Component	Technology / Tool	Purpose
Backend Framework	FastAPI	High-performance Python web framework for building the API backend.
Database	PostgreSQL + SQLAlchemy ORM	Stores structured data such as document metadata, chat history, and YouTube summaries.
Vector Database	Pinecone	Stores and retrieves vector embeddings for semantic search in the RAG pipeline.
LLM	Google Gemini 1.5 Flash	Large Language Model used for generating high-quality answers from retrieved context.
Embeddings	Hugging Face — `sentence-transformers/all-MiniLM-L6-v2`	Converts text chunks into dense vector embeddings for semantic similarity.
Document Processing	LangChain	Handles document loading, text splitting, metadata enrichment, and chaining for the RAG flow.
YouTube Integration	YouTube Data API v3 + `youtube-transcript-api`	Fetches metadata and transcripts for YouTube summarization.

🏗 System Architecture Overview

High-Level Flow

Document Upload — Files are validated, processed, chunked, embedded, and stored in Pinecone.
Question Answering (RAG) — Questions are matched against relevant document chunks in Pinecone, and Gemini generates an answer.
YouTube Summarization — Extract metadata & transcript, then summarize using Gemini.
Persistent Storage — Metadata, chat history, and summaries are stored in PostgreSQL.

📄 1. Document Uploading Process

When a user uploads a document (PDF, DOCX, PPTX, or other supported formats), the system follows a structured pipeline to prepare the content for AI-based question answering. This ensures that documents are stored efficiently, searchable, and optimized for retrieval.

Step-by-Step Flow

1. File Upload & Validation

The user initiates the upload via the application interface.
The system temporarily stores the uploaded file in a secure directory.
File type and size are validated to ensure compatibility with supported formats.
If validation fails, the user is notified with an error message.

2. Document Record Creation

A unique namespace is generated for the document. This namespace is used in Pinecone Vector Database to separate different users’ content.
Metadata such as file name, upload time, and document owner is stored in a PostgreSQL database for quick lookup.

3. Document Processing with LangChain

The document is read using LangChain’s document loaders (e.g., PyMuPDFLoader, UnstructuredFileLoader).
Chunking & Splitting:
- The document is split into smaller, manageable text chunks using LangChain’s RecursiveCharacterTextSplitter.
- Chunk Size: 1000 tokens
- Chunk Overlap: 200 tokens (ensures contextual continuity between chunks).
Each chunk is then converted into vector embeddings using a pre-configured embedding model (e.g., sentence-transformers/all-MiniLM-L6-v2 via LangChain’s HuggingFaceEmbeddings).

4. Storage in Vector Database

The vectorized chunks, along with metadata, are stored in Pinecone for fast similarity search.
The original file remains accessible for download or re-processing if required.

💬 2. Asking a Question from the LLM

Once documents are processed and stored, the user can ask natural language questions. The system uses Retrieval-Augmented Generation (RAG) with LangChain to provide accurate, context-aware responses.

Flow for Question Answering

1. User Query Submission

The user types a question into the chat interface.

2. Retrieval from Pinecone

The system performs a vector similarity search in Pinecone, retrieving the most relevant chunks from the user’s uploaded documents.
The top matching chunks are ranked based on similarity scores.

3. Prompt Construction with LangChain

LangChain’s prompt templates are used to combine:
- The retrieved chunks (context)
- The user’s query
- Instructions for the LLM to provide concise, fact-based answers.

4. LLM Response Generation

The constructed prompt is sent to the configured LLM (Gemini) using LangChain’s LLM wrapper.
The LLM generates a human-readable answer based on the provided context.
The answer is displayed in the chat interface along with optional citations from the retrieved document chunks.

5. Chat History Storage

Each interaction (question, retrieved context, and generated answer) is stored in PostgreSQL for conversation continuity and analytics.

▶️3. YouTube URL Processing – Complete Workflow

1. User Submission

The user enters a YouTube video URL into the frontend interface and submits it.

2. API Request from Frontend

The frontend sends the provided YouTube URL to the backend through an API request.

3. Backend URL Validation

The backend receives the request and validates that the URL is a properly formatted YouTube link.

4. Transcript & Metadata Retrieval

Using the YouTube Data API v3 (authenticated with the YOUTUBE_API_KEY from .env), the backend retrieves:

Transcript (if available)
Video metadata such as title, description, channel name, publish date, likes, views, and comments
Timestamps (if available)

5. Transcript Preprocessing

The transcript undergoes cleaning and formatting, including:

Removing timestamps and unnecessary formatting
Eliminating special characters
Structuring the text for clarity and consistency

6. Summarization with LLM

The cleaned transcript is sent to the Gemini LLM (via the GEMINI_API_KEY) to generate:

A concise, human-readable summary highlighting the key points of the video

7. Response Construction

The backend compiles the final response containing:

The generated summary
Retrieved video metadata
(Optional) The full transcript for reference

8. Frontend Display

The frontend receives the structured response and displays:

The video’s summary
Metadata (title, channel, publish date, views, etc.)
Full transcript for detailed review

API Endpoints

The FastAPI application exposes these main endpoints:

Endpoint	Method	Description
/upload	POST	Upload a document for processing
/ask	POST	Ask a question about the current document
/summarize	POST	Get a summary of a YouTube video

Example usage:

bash

# Upload a document
curl -X POST -F "file=@report.pdf" http://localhost:8000/upload

# Ask a question
curl -X POST -H "Content-Type: application/json" \
  -d '{"question":"What are the main points?"}' \
  http://localhost:8000/ask

# Summarize YouTube video
curl -X POST -H "Content-Type: application/json" \
  -d '{"url":"https://youtu.be/dQw4w9WgXcQ"}' \
  http://localhost:8000/summarize

Error Handling and Edge Cases

The system includes robust error handling:

Document Processing Errors
- Invalid file types are rejected immediately
- Processing failures are logged with detailed error messages
Pinecone Issues
- Checks namespace stats before querying
- Validates embedding dimensions
YouTube Specifics
- Handles videos without transcripts
- Falls back to basic info when API unavailable
- Manages various URL formats

Performance Considerations

Chunking Strategy
- 1000-character chunks with 200-character overlap balances context and precision
- Recursive splitting preserves semantic boundaries
Embedding Model Choice
- MiniLM provides good quality with lower dimensionality (faster searches)
- Normalized embeddings for better similarity comparisons
Pinecone Optimization
- Namespacing allows efficient document isolation
- Metadata filtering enables complex queries

🚀DocuTube: A RAG-Based Chatbot with YouTube Summarization - Complete Process Flow