Practical Guide to Retrieval-Augmented Generation (RAG)

1. What it is RAG?

Retrieval-Augmented Generation (RAG) is an architecture that improves LLM responses by combining two components:

Retriever - finds relevant documents or passages from a large corpus (using semantic search / vector similarity).
Generator - a large language model (LLM) that conditions its answer on the retrieved context (plus the user query).

Instead of asking the LLM to remember everything or store large documents in prompts, RAG fetches the most relevant pieces of information at query time and feeds them into the LLM. This leads to more factual, up-to-date, and context-specific answers.

What it is RAG?

2. Why RAG is used?

Scalability: Store thousands or millions of documents in a vector DB and fetch only the few relevant ones per request.
Accuracy & grounding: Provide concrete evidence (chunks) to the LLM, reducing hallucinations.
Cost-efficiency: Smaller prompt sizes and selective context reduce token usage.
Updatability: Update the knowledge base (index) without retraining the LLM.

3. What is indexing in RAG?

Indexing prepares raw data into a search-friendly format. Typical steps:

Extract text from sources (PDFs, web pages, OCR, databases).
Split text into chunks.
Convert each chunk into a vector embedding (semantic representation).
Store embeddings plus chunk metadata in a vector database (Qdrant, Pinecone, Milvus, etc.).

Indexing is usually done offline or periodically (not at every query), so retrieval becomes fast.

What is indexing in RAG?

4. What is retrieval/chat in RAG?

The retrieval/chat phase happens when a user asks a question:

Transform the user query into an embedding.
Use the embedding to query the vector store, retrieving top-k similar chunks.
Pass those chunks and the user query to the LLM (system + user prompt).
LLM generates the final answer grounded on retrieved chunks.

This two-phase flow index once, query often enables fast and relevant responses.

What is retrieval/chat in RAG?

5. How RAG works (retriever + generator) with a simple example?

We’ll illustrate three solution levels: Naive, Optimized, and Highly Optimized.

RAG Naive Solution

Send the entire document corpus as system prompt context for every query.
Problem: impractical for large corpora (e.g., 3K+ files), extremely slow and expensive.

RAG NAIVE SOLUTION - How RAG works (retriever + generator) with a simple example

RAG Optimized Solution

Preprocess documents (chunk → embed → store).
At query time, retrieve top-k chunks and pass them to the LLM as context.
Much faster and cost-efficient than naive.

RAG OPTIMIZED SOLUTION - How RAG works (retriever + generator) with a simple example

RAG Highly Optimized Solution

Use careful chunking strategies and overlap, metadata tagging, fine-tuned embedding models, vector DB tuning (indexes, distance metrics), multi-stage retrieval, caching, and relevance scoring.
Note: Even a simple RAG application requires fine-tuning multiple parameters, components, and models.

RAG HIGHLY OPTIMIZED SOLUTION - How RAG works (retriever + generator) with a simple example

Stage 1: Data Preparation (Indexing Phase)

Step 1: Setup Raw Data Sources
Collect PDFs, docs, web pages, CSVs, or databases.

Step 2: Information Extraction
Run OCR (if images), PDF extraction, or web-scraping to convert to text.

Step 3: Chunking
Split text into sized chunks (e.g., 500–1,000 tokens) - possibly with overlap.

Step 4: Embedding (make for semantic data)
Use an embedding model (e.g., text-embedding-3-large) to convert chunks into numeric vectors.

Step 5: Store Embedding into Vector Database
Persist embeddings and chunk metadata in a vector DB (Qdrant / Pinecone / Milvus).

Stage 2: Retrieval Augmented Generation (Retrieval/chat Phase)

Step 6: User Input Query
User asks a question.

Step 7: Create User Input Query Embedding
Embed the user query with the same embedding model used during indexing.

Step 8: Search Relevant Embedding from Vector Database
Run similarity search and retrieve top-k chunks.

Step 9: Return the Relevant Data (vector embedding + chunks)
Return chunk text and metadata to the application.

Step 10: Pass Relevant Data & User Input Query to Chat LLM(s)
Construct a system prompt + user prompt that includes relevant chunks as context.

Step 11: User Get the Final Output through LLMs
LLM returns the final answer grounded on the retrieved chunks.

6. Why we perform vectorization in RAG?

Vectorization (embeddings) converts text into dense numeric vectors that capture semantic meaning. This enables semantic similarity search rather than brittle keyword matching, allowing the retriever to find conceptually similar chunks even if they use different words.

Why we perform vectorization in RAG?

7. Why RAGs exist?

RAGs address limitations of standalone LLMs:

LLMs don't scale well as knowledge stores.
They can hallucinate without grounding.
RAG adds factual grounding and scalability by injecting relevant chunks from a knowledge base.

8. What is chunking and why we perform chunking?

Chunking splits long documents into manageable pieces. Reasons:

LLM context windows are limited.
Retrieval benefits from focused, specific chunks.
Smaller chunks make embeddings more precise for local context.

9. Why overlapping is used in chunking?

Overlapping ensures contextual continuity between adjacent chunks and reduces the chance of missing answers that require context across chunk boundaries.

10. What is LangChain and why is used?

LangChain is a framework that simplifies building LLM-powered apps, prompt orchestration, document loaders, chains, and connectors to vector DBs and embedding providers. It abstracts common RAG patterns and accelerates development.

11. What is vector database and why is used?

A vector database (Qdrant, Pinecone, Milvus, etc.) is optimized for storing embeddings and performing fast nearest-neighbor searches. It supports high-scale, low-latency similarity queries essential for RAG.

12. Build your own RAG system using Node.js with two main phases: Indexing and Retrieval/Chat

Below is a minimal Node.js example (indexing + chat) using LangChain, Qdrant, and OpenAI embeddings / chat. Place these files in a project folder and follow the run steps.

Prerequisites: Docker (for Qdrant), Node 18+, OpenAI API key, and nodejs.pdf in the project root.

`indexing.js`

import "dotenv/config";
import { PDFLoader } from "@langchain/community/document_loaders/fs/pdf";
import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";
import { OpenAIEmbeddings } from "@langchain/openai";
import { QdrantVectorStore } from "@langchain/qdrant";

const main = async () => {
  try {
    // Step 2: load the pdf data after raw data
    const pdfPath = "./nodejs.pdf";
    const loader = new PDFLoader(pdfPath);
    const docs = await loader.load();
    console.log("Pages loaded:", docs.length);

    // Step 3: split pdf data into chunks
    const splitter = new RecursiveCharacterTextSplitter({
      chunkSize: 10000,
      chunkOverlap: 1000,
    });
    const chunks = await splitter.splitDocuments(docs);
    console.log("Total chunks: ", chunks.length);
    console.log("First chunk: ", chunks[0]);

    // Step 4: create vector embedding for each chunks
    const embeddings = new OpenAIEmbeddings({
      apiKey: process.env.OPENAI_API_KEY,
      model: "text-embedding-3-large",
    });
    const vectorData = await embeddings.embedDocuments(
      chunks.map((chunk) => chunk.pageContent)
    );
    console.log("Total embeddings generated:", vectorData.length);
    console.log("Embedding for first chunk:", vectorData[0]);

    // Step 5: store documents(chunks) + embeddings inside vector DB Qdrant
    const vectorStore = await QdrantVectorStore.fromDocuments(
      chunks,
      embeddings,
      {
        url: "http://localhost:6333",
        collectionName: "notebookllm",
      }
    );

    console.log("Data successfully indexed into Qdrant...");
  } catch (err) {
    console.log(`Indexing error: ${err}`);
  }
};

main();

`chat.js`

import "dotenv/config";
import { OpenAI } from "openai";
import { OpenAIEmbeddings } from "@langchain/openai";
import { QdrantVectorStore } from "@langchain/qdrant";

const openai = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY,
});

const main = async () => {
  try {
    const userQuery =
      "please, can you tell me about the MongoDB hosting is what and why use?";

    // Step 7: create vector embedding for for user query
    const embeddings = new OpenAIEmbeddings({
      apiKey: process.env.OPENAI_API_KEY,
      model: "text-embedding-3-large",
    });

    // Step 8: search relevant vector embedding from vector Database Qdrant DB
    const vectorStore = await QdrantVectorStore.fromExistingCollection(
      embeddings,
      {
        url: "http://localhost:6333",
        collectionName: "notebookllm",
      }
    );

    // Step 9: retrieve relevant chunks from top 3 most relevant chunks for any query
    const vectorRetriver = vectorStore.asRetriever({
      k: 3,
    });
    const relevantChunks = await vectorRetriver.invoke(userQuery);

    // Step 6: user input query
    const SYSTEM_PROMPT = `You are an AI assistant that answers questions based on the provided context available to you from a PDF file with the content and page number. Only answer based on the available context from file.

    Context: ${JSON.stringify(relevantChunks)}`;

    // Step 10: pass relevant data & user input query to chat LLM(s) to get the relevant answere
    const messagesHistory = [
      { role: "system", content: SYSTEM_PROMPT },
      { role: "user", content: userQuery },
    ];
    const response = await openai.chat.completions.create({
      model: "gpt-4.1-nano",
      messages: messagesHistory,
    });

    // Step 11: user get the final output through chat LLM
    console.log("Response:", response.choices[0].message.content);
  } catch (error) {
    console.log(`Reterival chat phase error: ${err}`);
  }
};

main();

/*
:::::::::::::::::::::::::::::::::Output Sample based on Nodejs PDF:::::::::::::::::::::::::::::::::
const userQuery =
      "please, can you tell me about the hosting concept in node.js";
Response: The provided document does not contain information specifically about the hosting concept in Node.js.

-----------------------------

const userQuery =
      "please, can you tell me about the MongoDB hosting is what and why use?";
Response: Based on the provided content, MongoDB hosting refers to deploying and managing your MongoDB database on a dedicated platform or service. One example mentioned is MongoDB Atlas, which is the official managed hosting platform for MongoDB released by the MongoDB organization. It allows you to set up a production-ready MongoDB database without the need to manage the underlying infrastructure yourself.

Why use MongoDB hosting?
- It simplifies the deployment process and reduces administrative overhead.
- Ensures reliable and secure data storage.
- Provides scalable options to handle increasing data and traffic.
- Offers features like automatic backups, updates, and monitoring.
- Facilitates easier deployment and management of your database in production environments.

Using a managed hosting platform like MongoDB Atlas is especially beneficial for scaling applications, ensuring uptime, and focusing on development rather than infrastructure management.
*/

`docker-compose.yml`

version: '3.8'
services:
  qdrant:
    image: qdrant/qdrant
    ports:
      - "6333:6333"
    restart: unless-stopped
    volumes:
      - qdrant_storage:/qdrant/storage

volumes:
  qdrant_storage:

`.gitignore`

node_modules
.env

`package.json`

{
  "name": "notebookllm",
  "version": "1.0.0",
  "type": "module",
  "main": "index.js",
  "scripts": {
    "index": "node indexing.js",
    "chat": "node chat.js"
  },
  "dependencies": {
    "@langchain/community": "^0.3.53",
    "@langchain/core": "^0.3.72",
    "@langchain/openai": "^0.6.9",
    "@langchain/qdrant": "^0.1.3",
    "@langchain/textsplitters": "^0.1.0",
    "dotenv": "^17.2.1",
    "openai": "^5.12.2",
    "pdf-parse": "^1.1.1"
  }
}

How to run (local quickstart)

Create a .env file with:
```
 OPENAI_API_KEY=sk-...
```
Start Qdrant:
```
 docker compose up -d
```
Install dependencies:
```
 npm install
```
Index the PDF:
```
 npm run index
```
Run the chat example:
```
 npm run chat
```

Note: LangChain and client library APIs evolve. If you see API mismatches, check your installed package docs - e.g., methods for retrieval may be .invoke(), .getRelevantDocuments(), or .similaritySearch() depending on version.

Practical tips & caveats

Chunk size & overlap: Tune chunk size and overlap to balance context richness and redundancy.
k (top-k): Choose how many chunks to fetch (k=3–10). Higher k increases context but costs tokens.
Vector DB tuning: Try different distance metrics (cosine vs dot) and index settings to trade latency vs accuracy.
Prompt engineering: Use clear system prompts and constraints (e.g., “Answer only from context; if not present, say you don’t know.”)
Caching: Cache retrieval results for repeated queries.
Security & PII: Sanitize sensitive data before indexing.
Monitoring: Track retrieval relevance and LLM output to detect drift or hallucinations.

Final Takeaway

RAG is a practical, scalable pattern that bridges large language models and persistent knowledge stores. By separating indexing (prepare and embed once) from retrieval (fetch and condition at query time), RAG delivers grounded, up-to-date, and cost-efficient answers.

Start with a simple embed → store → retrieve → prompt loop, then iterate: tune chunking, retrieval scoring, vector DB settings, and prompts. With careful engineering especially around chunking, embeddings, and prompt design you can build robust assistants (legal, enterprise KBs, research tools) that significantly reduce hallucinations and scale to large document collections.

Practical Guide to Retrieval-Augmented Generation (RAG) - Indexing, Retrieval, and a Node.js Example

Table of contents