Supercharge Your LLM: An In-Depth Look at Retrieval Augmented Generation (RAG)

Table of contents
- What it is RAG?
- Why Do We Need RAG?
- How Does RAG Work?
- Stage 1: Data Preparation (Indexing Phase)
- Stage 2: Retrieval Augmented Generation (Query Phase)
- Why Overlapping Is Used in Chunking?
- Related Core Concepts
- Building a Minimal RAG in Node.js with LangChain + Qdrant
- Prerequisites
- Indexing Script: indexing.js
- Chat Script: chat.js
- docker-compose.yml
- Run it locally
- Practical Tips & Caveats
- Final Takeaway

Large Language Models (LLMs) like GPT-4 are powerful, but they have limitations. They are trained on a snapshot of data — which means if you ask them about recent events, private documents, or specialized knowledge (like a lawyer’s current case files), they will struggle because that information doesn’t exist in their pretrained dataset.
This is where Retrieval Augmented Generation (RAG) comes into play.
What it is RAG?
Retrieval-Augmented Generation (RAG) is an architecture that improves LLM responses by combining two components:
Retriever - finds relevant documents or passages from a large corpus (using semantic search / vector similarity).
Generator - a large language model (LLM) that conditions its answer on the retrieved context (plus the user query).
RAG enhances LLMs by giving them the ability to look things up from external data sources (like PDFs, databases, or internal documents) before answering you. Think of it as giving your AI “memory recall” for knowledge that isn’t built into its training.
Why Do We Need RAG?
Let’s take an example:
Imagine a lawyer working on a case. The LLM doesn’t know the details of the lawyer’s personal files because:
They are not in the pretrained dataset.
Putting all case documents directly into the system prompt would cause:
Context window overflow (LLMs can only handle limited tokens at once).
Increased cost and latency.
RAG solves this by storing documents efficiently and retrieving only the relevant parts when needed.
How Does RAG Work?
RAG operates in two main phases:
Stage 1: Data Preparation (Indexing Phase)
This is a preprocessing step where raw knowledge sources are prepared for retrieval.
Step 1: Setup Data Sources
Gather documents → PDFs, web pages, CSVs, or databases.
Step 2: Extract Information
Run OCR for images, PDF extraction, or scraping to convert them into clean text.
Step 3: Chunking
Split large docs into smaller pieces (500–1000 tokens each).
- Why chunking? LLMs cannot process huge docs, smaller chunks improve precision.
Step 4: Embedding (Vectorization)
Convert each text chunk into a dense numeric vector using embedding models like text-embedding-3-large
.
- Why vectorization? Unlike keyword search, embeddings capture semantic meaning, so the retriever can find conceptually similar chunks even if wordings differ.
Step 5: Store in Vector Database
Persist embeddings and metadata in a vector DB (Qdrant, Pinecone, Milvus, Weaviate, FAISS).
- Why vector DBs? They’re designed for ultra‑fast similarity search on millions of embeddings.
This entire phase is called Indexing and is done offline or periodically — not during every query.
Stage 2: Retrieval Augmented Generation (Query Phase)
This phase is executed whenever a user asks a question.
Step 6: User Input – User asks a question.
Step 7: Query Embedding – Convert the query into an embedding vector (using the same model as before).
Step 8: Semantic Search – Compare the query vector with stored embeddings in the vector DB and fetch the top‑k most relevant chunks.
Step 9: Retrieve Relevant Data – Return matching chunks of text (and metadata).
Step 10: Construct Prompt – Pass both the retrieved context + user’s query to the LLM.
Step 11: LLM Output – The LLM generates a contextual answer grounded in those chunks.
💡 Example:
Lawyer asks: “What clauses in my documents mention intellectual property?”
System query is vectorized → Vector DB retrieves only the relevant sections.
LLM sees both the query + retrieved text and responds with precise, grounded info.
Why Overlapping Is Used in Chunking?
Sometimes important information is split across two chunks:
Chunk 1: "The agreement begins on Jan 1, 2025 and…"
Chunk 2: "... terminates on Dec 31, 2025."
If split without overlap, context is lost. Small overlaps (e.g. 50–100 tokens) ensure continuity between chunks.
Related Core Concepts
Why RAGs exist:
LLMs can hallucinate and fail on unseen/recent/private knowledge.
LLMs don’t scale well as knowledge bases.
RAG injects external, factual context → reduced hallucinations + scalability.
What is LangChain?
A framework that abstracts the RAG pipeline → data loaders, chunkers, embeddings, retrievers, chaining logic, vector DB integration. It accelerates app development.What is a Vector Database?
A specialized DB for embeddings (Qdrant, Pinecone, Weaviate, Milvus). Supports high‑scale, low‑latency nearest neighbor searches essential for semantic retrieval.
Building a Minimal RAG in Node.js with LangChain + Qdrant
Let’s implement a small pipeline: Indexing + Retrieval/Chat.
Prerequisites
Node.js 18+
Docker (for Qdrant)
OpenAI API key
A sample PDF (here
nodejs.pdf
)
Indexing Script: indexing.js
import "dotenv/config";
import { PDFLoader } from "@langchain/community/document_loaders/fs/pdf";
import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";
import { OpenAIEmbeddings } from "@langchain/openai";
import { QdrantVectorStore } from "@langchain/qdrant";
const main = async () => {
try {
// Step 2: load the pdf data after raw data
const pdfPath = "./nodejs.pdf";
const loader = new PDFLoader(pdfPath);
const docs = await loader.load();
console.log("Pages loaded:", docs.length);
// Step 3: split pdf data into chunks
const splitter = new RecursiveCharacterTextSplitter({
chunkSize: 10000,
chunkOverlap: 1000,
});
const chunks = await splitter.splitDocuments(docs);
console.log("Total chunks: ", chunks.length);
console.log("First chunk: ", chunks[0]);
// Step 4: create vector embedding for each chunks
const embeddings = new OpenAIEmbeddings({
apiKey: process.env.OPENAI_API_KEY,
model: "text-embedding-3-large",
});
const vectorData = await embeddings.embedDocuments(
chunks.map((chunk) => chunk.pageContent)
);
console.log("Total embeddings generated:", vectorData.length);
console.log("Embedding for first chunk:", vectorData[0]);
// Step 5: store documents(chunks) + embeddings inside vector DB Qdrant
const vectorStore = await QdrantVectorStore.fromDocuments(
chunks,
embeddings,
{
url: "http://localhost:6333",
collectionName: "notebookllm",
}
);
console.log("Data successfully indexed into Qdrant...");
} catch (err) {
console.log(`Indexing error: ${err}`);
}
};
main();
Chat Script: chat.js
import "dotenv/config";
import { OpenAI } from "openai";
import { OpenAIEmbeddings } from "@langchain/openai";
import { QdrantVectorStore } from "@langchain/qdrant";
const openai = new OpenAI({
apiKey: process.env.OPENAI_API_KEY,
});
const main = async () => {
try {
const userQuery =
"please, can you tell me about the MongoDB hosting is what and why use?";
// Step 7: create vector embedding for for user query
const embeddings = new OpenAIEmbeddings({
apiKey: process.env.OPENAI_API_KEY,
model: "text-embedding-3-large",
});
// Step 8: search relevant vector embedding from vector Database Qdrant DB
const vectorStore = await QdrantVectorStore.fromExistingCollection(
embeddings,
{
url: "http://localhost:6333",
collectionName: "notebookllm",
}
);
// Step 9: retrieve relevant chunks from top 3 most relevant chunks for any query
const vectorRetriver = vectorStore.asRetriever({
k: 3,
});
const relevantChunks = await vectorRetriver.invoke(userQuery);
// Step 6: user input query
const SYSTEM_PROMPT = `You are an AI assistant that answers questions based on the provided context available to you from a PDF file with the content and page number. Only answer based on the available context from file.
Context: ${JSON.stringify(relevantChunks)}`;
// Step 10: pass relevant data & user input query to chat LLM(s) to get the relevant answere
const messagesHistory = [
{ role: "system", content: SYSTEM_PROMPT },
{ role: "user", content: userQuery },
];
const response = await openai.chat.completions.create({
model: "gpt-4.1-nano",
messages: messagesHistory,
});
// Step 11: user get the final output through chat LLM
console.log("Response:", response.choices[0].message.content);
} catch (error) {
console.log(`Reterival chat phase error: ${err}`);
}
};
main();
/*
:::::::::::::::::::::::::::::::::Output Sample based on Nodejs PDF:::::::::::::::::::::::::::::::::
const userQuery =
"please, can you tell me about the hosting concept in node.js";
Response: The provided document does not contain information specifically about the hosting concept in Node.js.
-----------------------------
const userQuery =
"please, can you tell me about the MongoDB hosting is what and why use?";
Response: Based on the provided content, MongoDB hosting refers to deploying and managing your MongoDB database on a dedicated platform or service. One example mentioned is MongoDB Atlas, which is the official managed hosting platform for MongoDB released by the MongoDB organization. It allows you to set up a production-ready MongoDB database without the need to manage the underlying infrastructure yourself.
Why use MongoDB hosting?
- It simplifies the deployment process and reduces administrative overhead.
- Ensures reliable and secure data storage.
- Provides scalable options to handle increasing data and traffic.
- Offers features like automatic backups, updates, and monitoring.
- Facilitates easier deployment and management of your database in production environments.
Using a managed hosting platform like MongoDB Atlas is especially beneficial for scaling applications, ensuring uptime, and focusing on development rather than infrastructure management.
*/
docker-compose.yml
version: '3.8'
services:
qdrant:
image: qdrant/qdrant
ports:
- "6333:6333"
restart: unless-stopped
volumes:
- qdrant_storage:/qdrant/storage
volumes:
qdrant_storage:
Run it locally
Add OpenAI key to .env
OPENAI_API_KEY=sk-...
# Start DB
docker compose up -d
# Install dependencies
npm install
# Index PDF
npm run index
# Ask questions
npm run chat
Practical Tips & Caveats
Chunk size & overlap: Adjust 500–1500 tokens with ~100 overlap for balance.
Top‑k: Fetch 3–10 chunks. More chunks = rich answers but higher token use.
Vector DB tuning: Choose cosine, dot product, or Euclidean depending on needs.
Prompt engineering: Explicitly say “If answer not in context, say so.”
Caching: Cache frequent query results.
Security: Strip PII/sensitive data before indexing.
Monitoring: Evaluate retrieval accuracy and output drift over time.
Final Takeaway
RAG is a scalable pattern that extends LLMs with capabilities like:
Grounded, reliable answers (less hallucination).
Private/custom knowledge made searchable.
Efficient retrieval across large corpora.
In essence: Index once → Retrieve relevant chunks → Generate grounded answers.
With frameworks like LangChain and vector DBs like Qdrant, it’s now practical for anyone—from startups to enterprises—to build domain‑specific AI assistants (e.g., legal bots, enterprise knowledge bases, research copilots).
Subscribe to my newsletter
Read articles from Aditya Jolly directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
