Introduction To RAG


When large language models (LLMs) like GPT, Claude, and Gemini came out, everyone was impressed with their ability to generate human-like text. But then reality hit:
Models don’t know events or facts after their training data ended. Ex: GPT-3 trained data before 2021 can’t tell you what happened after 2021.
LLMs often make things up (hallucinations) in a very confident tone. Ask about a niche research paper, and it might invent an author or cite a journal that doesn’t exist.
The old fix was fine-tuning. But fine-tuning a model every time means the company updates its docs and that is expensive and inefficient.
So researchers started asking that instead of forcing the model to memorize everything, why not let it fetch information from an external source when needed?
That’s when Retrieval Augmented Generation (RAG) came in around 2020, formalized by Facebook (Meta) research papers. The idea was simple but powerful:
Store knowledge in a retrieval system (vector database, search engine, etc).
At query time, retrieve only the most relevant chunks of data.
Feed those chunks into the LLM as context, so it can generate accurate, up-to-date, and grounded answers.
This solved three problems at once:
Models could stay updated without retraining.
Hallucinations dropped because the model had real evidence to ground its answers.
Companies could control what data the AI uses (their own PDFs or databases).
That’s why RAG became a big deal — it turned LLMs into something closer to a trustworthy assistant.
What is RAG?
Retrieval-Augmented Generation (RAG) is a pattern where you first retrieve relevant documents and then feed them into a Large Language Model (LLM) so the LLM can generate an answer grounded in facts instead of hallucinating.
Components of RAG
1) Documents
The raw knowledge you own: product manuals, FAQs, research papers, support tickets.
Why is it important: Garbage in, garbage out. If your docs are outdated or incomplete, your RAG will mislead too.
async function init() {
const pdfFilePath = "./JsTs.pdf";
const loader = new PDFLoader(pdfFilePath);
// Load documents page by page
const docs = await loader.load();
2) Chunking
It breaks long documents into smaller, retrievable chunks.
Why is it important: Embeddings and retrievers perform better on chunks than on entire documents. Without chunking, the right fact might get buried.
Custom chunk size code:
import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";
const splitter = new RecursiveCharacterTextSplitter({
chunkSize: 500,
chunkOverlap: 50,
});
const docs = await loader.load();
const splitDocs = await splitter.splitDocuments(docs);
3) Embeddings
Converts text into numerical vectors that capture meaning.
Why is it important: Retrieval works by comparing embeddings. Good embeddings mean relevant retrieval; bad embeddings mean nonsense.
const embeddings = new OpenAIEmbeddings({
model: "text-embedding-3-small",
apiKey: process.env.OPENAI_API_KEY,
});
4) Vector store
Stores embeddings and allows fast similarity search.
Why is it important: For millions of docs, you need a real vector database like FAISS, Pinecone, or Milvus.
await QdrantVectorStore.fromDocuments(splitDocs, embeddings, {
url: "http://localhost:6333",
collectionName: "rag-collection",
});
5) Retriever
Turns a user query into an embedding, searches in the vector store, and returns the top-k most relevant chunks.
Why is it important: This is the core of RAG. If retrieval fails, everything downstream fails.
const vectorRetriever = vectorStore.asRetriever({ k: 3 });
const relevantChunks = await vectorRetriever.invoke(userQuery);
7) Prompt builder
Combines the user question with retrieved context into a prompt for the LLM.
Why is it important: Without careful instructions, the LLM might ignore context and hallucinate.
const SYSTEM_PROMPT = `
You are an AI assistant who helps resolving user query based on the context available from PDF.
Only base answers on the available context from file only.
Context: ${JSON.stringify(relevantChunks)}
`;
8) LLM
It uses the prompt to generate an answer.
Production: Call OpenAI, Anthropic, etc. Always ask for structured JSON if you care about reliability.
const response = await client.chat.completions.create({
model: "gpt-4o-mini",
messages: [
{ role: "system", content: SYSTEM_PROMPT },
{ role: "user", content: userQuery },
],
});
Full Code:
import "dotenv/config";
import { PDFLoader } from "@langchain/community/document_loaders/fs/pdf";
import { OpenAIEmbeddings } from "@langchain/openai";
import { QdrantVectorStore } from "@langchain/qdrant";
async function init() {
const pdfFilePath = "./JsTs.pdf";
const loader = new PDFLoader(pdfFilePath);
//page by page load the pdf
const docs = await loader.load();
//Ready the client OpenAI embedding model
const embeddings = new OpenAIEmbeddings({
model: "text-embedding-3-small",
apiKey: process.env.OPENAI_API_KEY,
});
//
try {
const vectorStore = await QdrantVectorStore.fromDocuments(
docs,
embeddings,
{
url: "http://localhost:6333",
collectionName: "rag-collection",
}
);
console.log("Indexing of documents done...");
} catch (err) {
console.error("Qdrant indexing failed:", err);
}
}
init();
import "dotenv/config";
import { QdrantVectorStore } from "@langchain/qdrant";
import { OpenAI } from "openai/client.js";
import { OpenAIEmbeddings } from "@langchain/openai";
const client = new OpenAI();
async function chat() {
const userQuery = "Can you tell me about typescript";
const embeddings = new OpenAIEmbeddings({
model: "text-embedding-3-small",
apiKey: process.env.OPENAI_API_KEY,
});
const vectorStore = await QdrantVectorStore.fromExistingCollection(
embeddings,
{
url: "http://localhost:6333",
collectionName: "rag-collection",
}
);
const vectorRetriever = vectorStore.asRetriever({
k: 3,
});
const relevantChunks = vectorRetriever.invoke(userQuery);
const SYSTEM_PROMPT = `
You are an AI assistant who helps resolving user query based on the context available to you from a PDF file with the content and page number.
Only based ans on the available context from file only.
Context: ${JSON.stringify(relevantChunks)}
`;
const response = await client.chat.completions.create({
model: "gpt-4o-mini",
messages: [
{ role: "system", content: SYSTEM_PROMPT },
{ role: "user", content: userQuery },
],
});
console.log(`${response.choices[0].message.content}`);
}
chat();
Conclusion
RAG connects the static knowledge of LLMs with dynamic external data, making answers more accurate and context-aware. By chunking documents, creating embeddings, storing them in a vector database, and retrieving the right context at query time, it reduces hallucinations and adapts to real-world needs.
This was just the foundation—what RAG is and why it matters.
👉 In the next blog, we’ll explore where RAG fails and deep dive into RAG.
Subscribe to my newsletter
Read articles from Nikhil Chauhan directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
