Understanding Retrieval-Augmented Generation (RAG)

🤔 Why do we even need RAG?

Large Language Models (LLMs) like GPT, Gemini, Claude, etc. are amazing — but they come with limitations:

They are trained on general data (mostly internet-based).
They don’t know your business-specific or custom data (e.g., internal docs, product DB, PDFs).

So if you ask them:

“What's the refund policy in my internal employee handbook?”

LLM will shrug (metaphorically 🤷) because it has never seen your PDF.

That's where RAG comes in. It augments LLMs with retrieved real-world data.

🍭 RAG Simplified

RAG = Retrieval + Generation

Or simply put:
👉 "RAG is LLM ko haath paair dena" (RAG is helping the LLM with external context so it can actually give you meaningful answers.)

For example, if you want to chat with a specific PDF, the LLM alone can’t do it. But RAG makes it possible 💡

🛠️ Simple Example of RAG

Let’s say we have this scenario:

You have 10 rows in a product database.

You can put those 10 rows along with the user's question into the prompt and ask the LLM to generate a smart response using those rows.

prompt = f"""
Here are the 10 rows of product info:
{db_rows}

User question: {user_query}
"""

But... ⚠️
There’s a problem — prompt/token size limit. What if you had 10,000 rows? You can’t fit that into the prompt. That's when real RAG magic starts 🔮

🚀 Overcoming the Token Size Limit

Instead of stuffing everything in the prompt, RAG works smartly:

✅ It converts data into semantic vectors
✅ Then finds the most relevant data chunks based on user query
✅ And gives only that to the LLM

Let’s break it down with a PDF example.

📝 Example: Chat with PDF — Two Approaches

🔴 Approach 1: Dump entire PDF into prompt

PDF → TEXT

USER_PROMPT + SYSTEM_PROMPT(TEXT) → LLM

Issue: If the PDF is large, the token limit will explode 💥
Not scalable, not efficient.

🟢 Approach 2: Chunk PDF + Semantic Search

PDF → Chunk1, Chunk2, Chunk3, ...

Relevant Chunk = Chunk2  ← (semantic search)

USER_PROMPT + SYSTEM_PROMPT(Chunk2) → LLM

Now, you only send the relevant chunk to the model!
Less tokens, more accurate answers. Win-win 🏆

🧠 Semantic Chunk Retrieval + Manual Prompt (Real RAG)

Want to see real RAG in action? Here's the core idea:
We’ll use LangChain to:

🧾 Load and split the PDF
🔍 Store chunks in Qdrant
🧲 Retrieve relevant chunks based on user query
🧠 Pass only the relevant context to Gemini LLM

💡 Full Code Example:

import os
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Qdrant
from langchain_google_genai import GoogleGenerativeAIEmbeddings, ChatGoogleGenerativeAI
from qdrant_client import QdrantClient

# === 1. Load and Chunk PDF ===
loader = PyPDFLoader("sample.pdf")
pages = loader.load()

splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = splitter.split_documents(pages)

# === 2. Generate Embeddings ===
embedding_model = GoogleGenerativeAIEmbeddings(model="models/embedding-001")

# === 3. Store in Qdrant ===
vectorstore = Qdrant.from_documents(
    documents=chunks,
    embedding=embedding_model,
    url="http://localhost:6333",
    collection_name="rag_chunks"
)

# === 4. Retrieve Relevant Chunks ===
retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 3})
query = "What is the refund policy?"
relevant_docs = retriever.get_relevant_documents(query)

# Merge the top 3 chunks into one context
semantic_context = "\n".join([doc.page_content for doc in relevant_docs])

# === 5. Feed to Gemini Manually ===
llm = ChatGoogleGenerativeAI(model="gemini-pro", temperature=0.2)

final_prompt = f"""
You are a helpful assistant. Use the context below to answer the user's question.

Context:
{semantic_context}

Question:
{query}
"""

response = llm.invoke(final_prompt)
print("🧠 Answer:", response.content)

🧪 Output:

🧠 Answer: The refund policy allows returns within 30 days...

With this approach, you're doing full RAG manually — perfect for learning and building production apps.

💻 Full-Stack ChatPDF Project:

I’ve built a full-stack ChatPDF project using:

LangChain
Gemini API
Semantic Search
File Upload Support

Check it out on GitHub 👇
🔗 https://github.com/r00tshaim/chat-pdf

Let's connect on LinkedIn!
🔗 https://www.linkedin.com/in/shaimkhanusiya/

🧠 Understanding RAG (Retrieval-Augmented Generation) for Smarter LLMs

Table of contents