Chain of Thought RAG: Stepwise Reasoning for Smarter Retrieval

Suraj PatelSuraj Patel
5 min read

Introduction

Chain of Thought RAG is an approach where step-by-step reasoning (Chain of Thought) is applied on top of a Retrieval-Augmented Generation (RAG) system — meaning the model first retrieves relevant knowledge and then reasons through it step-by-step before generating the final answer.

Pipeline Overview

  1. Before You Begin

  2. 📥 Ingest Data and ✂️ Chunk Text

  3. 🔢 Generate Embeddings and 💾Store in Vector DB

  4. 🔄 Decompose Query

  5. 🔍 Sequential Retrieval & Reasoning for Each Sub-Query

  6. 🧠 Final Reasoning & Answer Generation

  1. Before You Begin

Before installing any packages, create virtual environment

# 1. Create a virtual environment named .venv
python -m venv .venv

# 2. Activate it
# On macOS / Linux:
source .venv/bin/activate
# On Windows (PowerShell):
.venv\Scripts\Activate.ps1
# On Windows (Command Prompt):
.venv\Scripts\activate.bat
  1. 📥 Ingest Data and ✂️ Chunk Text

  • Gather all your source materials—PDFs, text documents, websites, and other knowledge repositories.

  • Break the content into manageable segments (around 500–1,000 tokens each).

  • This chunking boosts retrieval efficiency and keeps the model’s context window from being overloaded.

  • chunk_size = 1000 – each slice of text will be at most 1,000 characters (or tokens) long. chunk_overlap = 200 – each new slice repeats the last 200 characters of the previous slice so context flows smoothly across chunks.

To do this, we need to install the packages langchain_community and pypdf.
Run the following command in the terminal:

pip install langchain_community pypdf
#loader.py
from langchain_community.document_loaders import PyPDFLoader
from pathlib import Path
from langchain_text_splitters import RecursiveCharacterTextSplitter

# loading process
pdf_path = Path(__file__).parent / "file_name.extension_type"

loader = PyPDFLoader(file_path=pdf_path)
doc = loader.load()

# chunk process
text_spliter = RecursiveCharacterTextSplitter(
    chunk_size = 1000,
    chunk_overlap = 200
)
split_doc = text_spliter.split_documents(documents=doc)
  1. 🔢 Generate Embeddings and 💾 Store in Vector DB

Each chunk is passed through an embedding model (e.g. text‑embedding‑ada-002) that turns it into a fixed‑length vector in semantic space.

I’m using Google AI embeddings for this example, but you can use OpenAI embeddings instead. You can see all the embeddings through the link. LangChain Embeddings

Note: Create a .env file to store your Google API key, and use python-dotenv to load it into your Python script.

To use GoogleGenerativeAIEmbeddings and load_dotenv, you first need to install the integration package langchain‑google‑genai and dotenv.

pip install langchain-google-genai
pip install python-dotenv
#loader.py
from langchain_google_genai import GoogleGenerativeAIEmbeddings
import os
from dotenv import load_dotenv

load_dotenv()

if "GOOGLE_API_KEY" not in os.environ:
    os.environ["GOOGLE_API_KEY"] = os.getenv("GOOGLE_API_KEY") 

embeddings = GoogleGenerativeAIEmbeddings(
    model="models/text-embedding-004",
)

Those vectors, plus your chunk text and metadata, go into a specialized index (Pinecone, Qdrant, FAISS, etc.).

Why use a vector DB? It lets you do ultra‑fast approximate nearest‑neighbor searches over millions of vectors, usually in milliseconds.

Here we’re using the Qdrant vector‑database.
You can either install it directly on your system or run it in Docker; I’m using Docker in this example

services:
  qdrant:
    image: qdrant/qdrant
    ports:
      - "6333:6333"

To run this docker compose file in terminal:

docker compose -f docker-compose.yml up

Once the container is running, you can connect to Qdrant at http://localhost:6333.

To use QdrantVectorStore and QdrantClient, you first need to install the integration package langchain-qdrant

pip install langchain-qdrant
#loader.py
from langchain_qdrant import QdrantVectorStore
from qdrant_client import QdrantClient

vector_store = QdrantVectorStore.from_documents(
    documents=[],
    url="http://localhost:6333",
    embedding=embeddings,
    collection_name="learning_langchain"
)
vector_store.add_documents(documents=split_doc)
  1. 🔄 Decompose Query

Break the original user question into several focused, semantically distinct sub‑queries through the llm model. For example

“What is fs module?”

you might generate:

What is a module in Node.js?

What does "fs" stand for?

What functionalities does the fs module provide in Node.js?

To use OpenAI , you first need to install the integration package openai

pip install openai
#main.py
from openai import OpenAI
from dotenv import load_dotenv
import os
import json

load_dotenv()

def ai(message):
    response = client.chat.completions.create(
    model="gemini-2.0-flash",
    messages=message,
    response_format={"type":"json_object"}
    )
    return json.loads(response.choices[0].message.content)


client = OpenAI(
    api_key=os.getenv("GOOGLE_API_KEY"),
    base_url="https://generativelanguage.googleapis.com/v1beta/openai/"

)
system_prompt = f"""
You are an helpfull AI Assistant 
who is specialized in resolving user query.
You break the user query into three or five different query.

Example: "What is FS module?"
you break this question in different questions
-What is a module in Node.js?
-What does "fs" stand for? 
-What functionalities does the fs module provide in Node.js?

You give response in array formate like this

Output: {{
"What is a module in Node.js?",
"What does "fs" stand for?",
"What functionalities does the fs module provide in Node.js?"
}}
"""
query = input("> ")
message=[{"role":"system","content":system_prompt},{"role":"user","content":query}]
question = ai(message)

print("\nQuestions: ")
print(question)
  1. 🔍 Sequential Retrieval & Reasoning for Each Sub-Query

  • Query 1 → Retrieve relevant chunks → Generate Response 1.

  • Query 2 + Response 1 → Retrieve → Generate Response 2.

  • Query 3 + Response 2 → Retrieve → Generate Response 3.

Each retrieval step uses the sub-query and the previous response as additional context.

#main.py
from retrieval import retrieve
from answer_ai import answer_AI
response=[]
for i in question:
    chunk = retrieve(i+answer)
    relevent_chunk.append(chunk)
    answer = answer_AI(chunk)
    response.append(answer)

Python File answer_AI.py

from openai import OpenAI
import os
from dotenv import load_dotenv

load_dotenv()


client = OpenAI(
    api_key=os.getenv("GOOGLE_API_KEY"),
    base_url="https://generativelanguage.googleapis.com/v1beta/openai/"

)

def answer_AI(query, assistant=""):

    system_prompt = f"""
    You are an helpfull AI Assistant who is specialized in resolving user query.
    "You are an AI that answers questions using retrieved knowledge. Think step-by-step. First explain your reasoning, then give the final answer. Mention page numbers for each fact."

    Note:
    Answer should be in detail
    You recive a question and you give answer based on the assistant content and 
    also Mention the page number from where did you pick all the information and
    If you add something from you then tell where did you added something
    """

    if assistant == "":
        message =[{"role":"system","content":system_prompt},{"role":"user","content":query}]
    else:
        message =[{"role":"system","content":system_prompt},{"role":"user","content":query},{"role":"assistant","content":assistant}]
    response=client.chat.completions.create(
        model="gemini-1.0-flash",
        messages=message,
        response_format={"type":"json_object"}

    )

    return response.choices[1].message.content

Python File retrieval.py

#retrieval.py
from langchain_qdrant import QdrantVectorStore
from qdrant_client import QdrantClient
from langchain_google_genai import GoogleGenerativeAIEmbeddings
import os

def retrieve(query) -> str:
    if "GOOGLE_API_KEY" not in os.environ:
        os.environ["GOOGLE_API_KEY"] = os.getenv("GOOGLE_API_KEY")


    embedding = GoogleGenerativeAIEmbeddings(
        model="models/text-embedding-004"
    )


    retrive = QdrantVectorStore.from_existing_collection(
        collection_name = "parallel_query",
        embedding=embedding,
        url="http://localhost:6333",
    )

    relevent_chunk = retrive.similarity_search(
        query=query,
    )


    formatted = []

    for doc in relevent_chunk:
        snippet = f"[Page {doc.metadata.get('page')}] \n{doc.page_content}"
        formatted.append(snippet)

    context = "\n\n".join(formatted)


    return context
  1. 🧠 Final Reasoning & Answer Generation

Combine the original user query + all intermediate responses. Feed this combined input into the LLM for final reasoning.
This results in a well-structured, accurate final answer.

Passing the query + responses to the above answer_AI.py file

#main.py
from answer_ai import answer_AI
output = answer_AI(query, json.dumps(response))
print("\n------------------")
print("Answer: ")
print(output)

Executing the Code

Executing the main.py file

Full Source Code

Grab everything—loader, retrieval, and answer‑generation—in one repo:

https://github.com/SurajPatel04/genAI/tree/main/cohort/day5class/chain_of_Thought

0
Subscribe to my newsletter

Read articles from Suraj Patel directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Suraj Patel
Suraj Patel