๐Ÿง  Understanding RAG (Retrieval-Augmented Generation) for Smarter LLMs

Shaim KhanusiyaShaim Khanusiya
3 min read

๐Ÿค” Why do we even need RAG?

Large Language Models (LLMs) like GPT, Gemini, Claude, etc. are amazing โ€” but they come with limitations:

  • They are trained on general data (mostly internet-based).

  • They donโ€™t know your business-specific or custom data (e.g., internal docs, product DB, PDFs).

So if you ask them:

โ€œWhat's the refund policy in my internal employee handbook?โ€

LLM will shrug (metaphorically ๐Ÿคท) because it has never seen your PDF.

That's where RAG comes in. It augments LLMs with retrieved real-world data.


๐Ÿญ RAG Simplified

RAG = Retrieval + Generation

Or simply put:
๐Ÿ‘‰ "RAG is LLM ko haath paair dena" (RAG is helping the LLM with external context so it can actually give you meaningful answers.)

For example, if you want to chat with a specific PDF, the LLM alone canโ€™t do it. But RAG makes it possible ๐Ÿ’ก


๐Ÿ› ๏ธ Simple Example of RAG

Letโ€™s say we have this scenario:

You have 10 rows in a product database.

You can put those 10 rows along with the user's question into the prompt and ask the LLM to generate a smart response using those rows.

prompt = f"""
Here are the 10 rows of product info:
{db_rows}

User question: {user_query}
"""

But... โš ๏ธ
Thereโ€™s a problem โ€” prompt/token size limit. What if you had 10,000 rows? You canโ€™t fit that into the prompt. That's when real RAG magic starts ๐Ÿ”ฎ


๐Ÿš€ Overcoming the Token Size Limit

Instead of stuffing everything in the prompt, RAG works smartly:

โœ… It converts data into semantic vectors
โœ… Then finds the most relevant data chunks based on user query
โœ… And gives only that to the LLM

Letโ€™s break it down with a PDF example.


๐Ÿ“ Example: Chat with PDF โ€” Two Approaches

๐Ÿ”ด Approach 1: Dump entire PDF into prompt

PDF โ†’ TEXT

USER_PROMPT + SYSTEM_PROMPT(TEXT) โ†’ LLM

Issue: If the PDF is large, the token limit will explode ๐Ÿ’ฅ
Not scalable, not efficient.


PDF โ†’ Chunk1, Chunk2, Chunk3, ...

Relevant Chunk = Chunk2  โ† (semantic search)

USER_PROMPT + SYSTEM_PROMPT(Chunk2) โ†’ LLM

Now, you only send the relevant chunk to the model!
Less tokens, more accurate answers. Win-win ๐Ÿ†


๐Ÿง  Semantic Chunk Retrieval + Manual Prompt (Real RAG)

Want to see real RAG in action? Here's the core idea:
Weโ€™ll use LangChain to:

  1. ๐Ÿงพ Load and split the PDF

  2. ๐Ÿ” Store chunks in Qdrant

  3. ๐Ÿงฒ Retrieve relevant chunks based on user query

  4. ๐Ÿง  Pass only the relevant context to Gemini LLM

๐Ÿ’ก Full Code Example:

import os
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Qdrant
from langchain_google_genai import GoogleGenerativeAIEmbeddings, ChatGoogleGenerativeAI
from qdrant_client import QdrantClient

# === 1. Load and Chunk PDF ===
loader = PyPDFLoader("sample.pdf")
pages = loader.load()

splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = splitter.split_documents(pages)

# === 2. Generate Embeddings ===
embedding_model = GoogleGenerativeAIEmbeddings(model="models/embedding-001")

# === 3. Store in Qdrant ===
vectorstore = Qdrant.from_documents(
    documents=chunks,
    embedding=embedding_model,
    url="http://localhost:6333",
    collection_name="rag_chunks"
)

# === 4. Retrieve Relevant Chunks ===
retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 3})
query = "What is the refund policy?"
relevant_docs = retriever.get_relevant_documents(query)

# Merge the top 3 chunks into one context
semantic_context = "\n".join([doc.page_content for doc in relevant_docs])

# === 5. Feed to Gemini Manually ===
llm = ChatGoogleGenerativeAI(model="gemini-pro", temperature=0.2)

final_prompt = f"""
You are a helpful assistant. Use the context below to answer the user's question.

Context:
{semantic_context}

Question:
{query}
"""

response = llm.invoke(final_prompt)
print("๐Ÿง  Answer:", response.content)

๐Ÿงช Output:

๐Ÿง  Answer: The refund policy allows returns within 30 days...

With this approach, you're doing full RAG manually โ€” perfect for learning and building production apps.


๐Ÿ’ป Full-Stack ChatPDF Project:

Iโ€™ve built a full-stack ChatPDF project using:

  • LangChain

  • Gemini API

  • Semantic Search

  • File Upload Support

Check it out on GitHub ๐Ÿ‘‡
๐Ÿ”— https://github.com/r00tshaim/chat-pdf

Let's connect on LinkedIn!
๐Ÿ”— https://www.linkedin.com/in/shaimkhanusiya/


0
Subscribe to my newsletter

Read articles from Shaim Khanusiya directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Shaim Khanusiya
Shaim Khanusiya