Understanding RAG: A Comprehensive Guide

somilsomil
4 min read

RAG (Retrieval-Augmented Generation) is a powerful technique in Generative AI that combines a retrieval system with a language model to generate more accurate, grounded, and up-to-date responses.

How RAG Works

  1. Query Input
    A user asks a question or gives a prompt.

  2. Retriever (Search Component)
    The system searches a document store (such as PDFs, websites, or databases) using vector similarity or keyword-based search (e.g., FAISS, Elasticsearch).

  3. Retriever Output
    Top k relevant documents are returned (e.g., paragraphs, chunks).

  4. Generator (LLM)
    The retrieved documents are combined with the query and sent to a language model (like GPT or LLaMA) to generate a response based on both the query and the retrieved context.

Why Use RAG?

  • Overcomes the knowledge cutoff of LLMs

  • Reduces hallucinations

  • Enables domain-specific answers (legal, medical, business, etc.)

  • Makes models more trustworthy and explainable

RAG Variants

  • Standard RAG – simple retrieval + LLM

  • RAG-Fusion – merges multiple retrievers

  • Multi-hop RAG – reasoning across numerous documents

  • Conversational RAG – context-aware over chat history

Two-Step Process in RAG: Indexing and Retrieving

Retrieval-Augmented Generation (RAG) relies on a two-step process to fetch relevant external knowledge before generating a response. The two key steps are:

1. Indexing (Preprocessing Phase)

This is a one-time or periodic process where your data is prepared and stored for efficient retrieval.

Steps in Indexing:

  • Chunking: Break large documents (e.g., PDFs, blogs, reports) into smaller chunks (like 200–500 words).

  • Embedding: Convert each chunk into a high-dimensional vector using an embedding model (e.g., OpenAI, HuggingFace, BGE).

  • Storing: Store the vectors and their corresponding text chunks in a vector database (e.g., FAISS, Chroma, Pinecone).

Example:

Text: "The mitochondria are the powerhouse of the cell."
→ Embedding → Stored as a vector in FAISS with reference to the original text.

2. Retrieving (Query-Time Phase)

This step happens every time a user asks a question.

Steps in Retrieval:

  • Embed the query using the same embedding model.

  • Search the vector DB for the most similar chunks using vector similarity (like cosine similarity).

  • Return the Top-k relevant chunks to feed into the language model along with the original question.

Example:

Query: "What is the function of mitochondria?"
→ Embedding → Compare against stored vectors → Retrieve the chunk above
→ Send it to the LLM for grounded response generation.

Basic RAG structure :

LangChain Introduction

LangChain is a powerful open-source framework designed to help developers build applications powered by large language models (LLMs). It simplifies the process of integrating LLMs with external data sources, memory, tools, and workflows like RAG (Retrieval-Augmented Generation).

LangChain is especially useful for building:

  • Custom chatbots

  • PDF Q&A assistants

  • RAG pipelines

  • Agents that can browse, code, or search

Core Concepts in LangChain

ConceptPurpose
LLMsConnects to models like GPT, Claude, or LLaMA
PromptsTemplates and input formatting for LLMs
ChainsSequences of LLM calls (e.g., input → prompt → output)
AgentsLLMs that decide which tools to use step-by-step
RetrieversGet relevant context from vector stores
ToolsInterfaces for search, APIs, file systems, etc.

LangChain PDF Chatbot – Basic Workflow

1. Install Required Packages

pip install langchain openai faiss-cpu pypdf tiktoken

2. Code Overview

from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import FAISS
from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA

# 1. Load the PDF
loader = PyPDFLoader("your_file.pdf")
documents = loader.load()

# 2. Split into chunks
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
docs = text_splitter.split_documents(documents)

# 3. Convert text chunks to embeddings
embeddings = OpenAIEmbeddings()  # You can use HuggingFaceEmbeddings instead

# 4. Store vectors in a vector database
vectorstore = FAISS.from_documents(docs, embeddings)

# 5. Create retriever and chain
retriever = vectorstore.as_retriever()
llm = ChatOpenAI(temperature=0)
qa_chain = RetrievalQA.from_chain_type(llm=llm, retriever=retriever)

# 6. Ask a question
query = "What are the key findings in the document?"
response = qa_chain.run(query)
print(response)

What Happens Behind the Scenes

  • PDF is chunked → each chunk is embedded → stored in FAISS.

  • Query is embedded → similar chunks are retrieved → LLM gets both query + relevant chunks → Generates an answer.

10
Subscribe to my newsletter

Read articles from somil directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

somil
somil

I am a full-stack web developer, learning ai, web3, automation