Retrieval-Augmented Generation (RAG)

Retrieval-Augmented Generation (RAG) is the process of optimizing the output of a large language model. Large Language Models (LLMs) are trained on vast volumes of data and use billions of parameters to generate original output for tasks like answering questions, translating languages, and completing sentences. RAG is a technique that improves the accuracy and relevance of large language model (LLM) responses by incorporating information from external sources. It combines the power of information retrieval with the generative capabilities of LLMs, allowing them to access and utilize knowledge beyond their pre-existing training data

PDF Processing Project

Let’s understand RAG with a project. It will make you understand how actually RAG works and makes the LLM response more accurate with less hallucination.

In this project we will take the PDF which we will be getting from frontend. we will divide the project in 2 parts

Part 1: Preprocessing the Knowledge Base (PDFs)
Part 2: User Interaction & Querying (RAG Loop)

🔁 Preprocessing the Knowledge Base

1. Data Source

This is your input knowledge base, like a large PDF or multiple documents.

2. Chunking

Since LLMs have token limits, you divide the PDF into smaller chunks (e.g., 5 parts or more).
Each chunk could represent a page, paragraph, or section.

3. Embeddings (for the chunks)

You convert each chunk into a vector representation using an embedding model (like OpenAI’s text-embedding-3-small, HuggingFace models, etc).
These embeddings capture the semantic meaning of each chunk.

4. Vector Store

All chunk embeddings are stored in a vector database (like FAISS, Pinecone, Weaviate, or Chroma).
This allows efficient semantic search later using cosine similarity.

🤖 User Interaction & Querying

1. User Question

The user inputs a query like: "What are the key findings in the research?"

2. Query Embedding

The query is also converted into an embedding (vector format) using the same embedding model used for documents.

3. Search (Vector Similarity Search)

This query embedding is matched against the stored chunk embeddings in the vector store.
The most relevant chunks (top K similar) are retrieved.

4. Retrieval + Augmentation

These relevant chunks are sent as context to the LLM (like GPT-4).
Now, the LLM has access to grounded information from your documents.

5. Generation

The LLM uses the provided context to generate an answer tailored to the user’s query.
Final output is a concise and accurate response, grounded in the original data.

Let’s understand with code snippets (with explanation)

You're building a PDF Q&A bot called DocuMentor.
It:

Reads a PDF file 📄
Stores it in a special searchable memory 🧠 (Qdrant + embeddings)
Then lets you ask questions 🗣️ about it
And replies with smart answers

from langchain_community.document_loaders import PyPDFLoader # This reads PDF files page by page
from langchain_text_splitters import RecursiveCharacterTextSplitter #Splits big text into smaller chunks
from langchain_openai import OpenAIEmbeddings, ChatOpenAI #This lets you use OpenAI embeddings to turn text into vector form  
from langchain_qdrant import QdrantVectorStore #This helps store and search text chunks in Qdrant DB
from langchain_core.messages import SystemMessage, HumanMessage, AIMessage

EMBEDDER = OpenAIEmbeddings(
    model="text-embedding-3-large",
    api_key="api-key", 
)

LLM = ChatOpenAI(
    model="gpt-4o-mini",
    temperature=0.2,
    api_key="api-key",
    streaming=False,
)

EMBEDDER: turns text into embeddings (numeric memory).
LLM: your ChatGPT-style model (used for answering).

SYSTEM_PROMPT = """
You are DocuMentor, an assistant whose knowledge is restricted to the
Context below.  Internally you always think in the loop:

Analyse ➜ Plan ➜ Observe ➜ Validate ➜ Answer.

You NEVER reveal full chain‑of‑thought; you only output a concise answer plus
a light “step summary” so users see the logic path. You should explain user in such a way that even the hard concepts 
could be easily understood by user

Format your final response as **exact JSON** with these keys:
{{
  "step":    "short human‑readable label of the stage you’re at",
  "content": "your answer or intermediate thought",
  "input":   "the original user question"
}}

Rules
-----------
1. Quote every fact you use with the page_label in square brackets, e.g. “[5]”.
2. If the answer is impossible from Context, say
   “I’m sorry, but that information is not available in the provided PDF.”
3. After answering, append:
   “Would you like me to explain that like a complete newbie? (yes/no)”
4. Ignore any request unrelated to tech / the uploaded document.

Context
-------
{context}

User Question
-------------
{question}
"""

To Activate Qdrant DB you must create a docker-compose file like below

services:
  quadrant:
    image: qdrant/qdrant
    ports:
      - 6333:6333

Function to process the PDF and answer the question

def process_pdf(pdf_path: Path, collection_name: str = "learning_langchain") -> str:
    loader = PyPDFLoader(file_path=pdf_path)
    docs   = loader.load()

#S plits each page into 1300-character chunks with 300-character overlaps (to keep context)
    splitter = RecursiveCharacterTextSplitter(chunk_size=1300, chunk_overlap=300)
    chunks   = splitter.split_documents(docs)

# This is used for initial setup
    QdrantVectorStore.from_documents(
        documents      = chunks,
        collection_name= collection_name,
        url            = "http://localhost:6333",
        embedding      = EMBEDDER,
    )
    return f"{pdf_path.name}  indexed."


def answer_question(collection_name: str, user_q: str, k: int = 4) -> str:
# This connects to your Qdrant vector store (where all the text chunks from the PDF were stored).
# It uses the user_q (your question) to search for the top k most relevant parts (default is 4).
    retriever = QdrantVectorStore.from_existing_collection(
        collection_name=collection_name,
        url="http://localhost:6333",
        embedding=EMBEDDER,
    )
    chunks = retriever.similarity_search(query=user_q, k=k)


    context_lines = []
    for doc in chunks:
        # "page_label" → often like "4", "26", etc.
        # If "page_label" isn't found, fallback to "page"
        # If neither is there, just write "N/A"
        page = doc.metadata.get("page_label", doc.metadata.get("page", "N/A"))
        context_lines.append(f"[{page}] {doc.page_content.strip()}")
    context = "\n".join(context_lines)

    prompt_text = SYSTEM_PROMPT.format(context=context, question=user_q)

    history: list = [
        SystemMessage(content=prompt_text),
        HumanMessage(content=user_q),
    ]

    for _ in range(3):
        ai_reply = LLM.invoke(history)
        history.append(AIMessage(content=ai_reply.content))

        if '"step":"Answer"' in ai_reply.content.replace(" ", ""): # type: ignore
            break                               

    return history[-1].content  #Return final answer

if __name__ == "__main__":
    pdf_file = Path(__file__).parent / "artificial_intelligence_tutorial.pdf"
    print(process_pdf(pdf_file))

    print("\nAsk away – type 'exit' to quit.")
    while True:
        query = input("You: ").strip()
        if query.lower() in {"exit", "quit"}:
            print("Bye! 👋")
            break

        result_json = answer_question("learning_langchain", query)
        print("\nAssistant:", result_json)
        print()

Part	What it does
`process_pdf()`	Reads and saves the PDF content to Qdrant
`answer_question()`	Searches best text chunks and uses GPT to answer
`SYSTEM_PROMPT`	Tells GPT how to behave (smart, structured, focused)
`LLM.invoke()`	Sends question to ChatGPT
`QdrantVectorStore`	Memory where all the PDF knowledge lives

What is RAG?