In Chain of Thought RAG, step by step reasoning is layered on top of a Retrieval Augmented Generation system. The model retrieves relevant knowledge first, then reasons through it methodically before generating its final output.

A technique that enhances the reasoning capabilities of large language models (LLMs) by guiding them to break down complex problems into intermediate steps, using retrieved knowledge

🧠 What is Chain of Thought (CoT) Reasoning?

Chain of Thought (CoT) is a prompting technique used in Large Language Models (LLMs) where the model is guided to explain its reasoning step by step, just like how humans think through a problem before answering it.

Example:

Ever helped someone solve a work-time math problem?
You don’t just throw the answer at them.
You explain it, bit by bit, like you’re guiding them through the logic.

That’s Chain of Thought Prompting in a nutshell.

It’s like telling a language model:

“Chill. Don’t just answer. Think out loud.”

So instead of jumping to the final result, the model reasons step-by-step — exactly how we solve stuff in real life.

Let’s look at a new example:

Problem:
If A can complete a job in 10 days and B can do it in 15 days, how long will it take for them to finish the job together?

Now instead of rushing to the formula, let’s think like a human — step-by-step 🧠

Let’s break it down:

A finishes 1 job in 10 days → So A does 1/10 of the job per day.
B finishes 1 job in 15 days → So B does 1/15 of the job per day.
If they work together, we add their daily work:
- 1/10 + 1/15
- Common denominator is 30 → (3 + 2)/30 = 5/30 = 1/6
That means: Together, they finish 1/6 of the job per day.
So, they’ll finish the full job in 6 days.

Simple, right?
But instead of jumping to “6 days,” we went through the logic.
That’s how Chain of Thought Prompting works in AI:
Reason first. Answer later.

Pipeline Overview:

Step 0: Set Up Virtual Environment (Before You Begin)

step 1: Ingest Data & Chunk Text

step 2:Generate Embeddings & Store in Vector DB

step 3: Decompose Query

step 4: Sequential Retrieval

step 5:Step-by-Step Reasoning & Final Answer

step 6: Bring It All Together

Code Walkthrough

🧩 Step 0: Set Up Virtual Environment

# Create and activate virtual environment (run these in terminal)
python -m venv .venv
# Activate on macOS/Linux
source .venv/bin/activate
# Activate on Windows (PowerShell)
.venv\Scripts\Activate.ps1
# Activate on Windows (Command Prompt)
.venv\Scripts\activate.bat

📥 Step 1: Ingest Data & Chunk Text

📦 Install Required Packages:

pip install langchain_community pypdf

📄 Python Code (loader.py):

from langchain_community.document_loaders import PyPDFLoader
from pathlib import Path
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Load PDF
pdf_path = Path(__file__).parent / "your_file.pdf"
loader = PyPDFLoader(file_path=pdf_path)
doc = loader.load()

# Chunk Text
text_spliter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
split_doc = text_spliter.split_documents(documents=doc)

🔢 Step 2: Generate Embeddings & Store in Vector DB

📦 Install Required Packages:

pip install langchain-google-genai python-dotenv

📄 Python Code (embed_store.py):

from langchain_google_genai import GoogleGenerativeAIEmbeddings
from dotenv import load_dotenv
import os

load_dotenv()

embeddings = GoogleGenerativeAIEmbeddings(model="models/text-embedding-004")

Now install and run Qdrant DB:

📦 Install Qdrant (Docker recommended):

yamlCopyEdit# docker-compose.yml
services:
  qdrant:
    image: qdrant/qdrant
    ports:
      - "6333:6333"

docker compose -f docker-compose.yml up

📦 Install Python SDK:

pip install langchain-qdrant

📄 Add to embed_store.py:

from langchain_qdrant import QdrantVectorStore

vector_store = QdrantVectorStore.from_documents(
    documents=split_doc,
    url="http://localhost:6333",
    embedding=embeddings,
    collection_name="my_docs"
)

🔄 Step 3: Decompose Query

📦 Install OpenAI SDK (if using OpenAI API):

pip install openai

📄 Python Code (query_decomposer.py):

from openai import OpenAI
from dotenv import load_dotenv
import os
import json

load_dotenv()

client = OpenAI(
    api_key=os.getenv("GOOGLE_API_KEY"),
    base_url="https://generativelanguage.googleapis.com/v1beta/openai/"
)

def ai(message):
    response = client.chat.completions.create(
        model="gemini-2.0-flash",
        messages=message,
        response_format={"type": "json_object"}
    )
    return json.loads(response.choices[0].message.content)

query = input("> ")
system_prompt = """
You are a helpful AI that breaks down a complex query into 3-5 sub-questions.
Example: "What is FS module?"
→ ["What is a module in Node.js?", "What does fs stand for?", "What does the fs module do?"]
"""
message = [{"role": "system", "content": system_prompt}, {"role": "user", "content": query}]
questions = ai(message)
print(questions)

🔍 Step 4: Sequential Retrieval

📄 Python Code (retrieval.py):

from langchain_qdrant import QdrantVectorStore
from langchain_google_genai import GoogleGenerativeAIEmbeddings
import os

def retrieve(query):
    embedding = GoogleGenerativeAIEmbeddings(model="models/text-embedding-004")
    retriever = QdrantVectorStore.from_existing_collection(
        collection_name="my_docs",
        embedding=embedding,
        url="http://localhost:6333",
    )
    results = retriever.similarity_search(query)
    context = "\n\n".join(
        f"[Page {doc.metadata.get('page')}] {doc.page_content}" for doc in results
    )
    return context

🧠 Step 5: Step-by-Step Reasoning & Final Answer

📄 Python Code (answer_ai.py):

from openai import OpenAI
from dotenv import load_dotenv
import os

load_dotenv()

client = OpenAI(
    api_key=os.getenv("GOOGLE_API_KEY"),
    base_url="https://generativelanguage.googleapis.com/v1beta/openai/"
)

def answer_AI(query, assistant=""):
    system_prompt = """
    You are an AI that answers using retrieved knowledge step-by-step.
    Always explain your reasoning, cite pages, and state if you added new information.
    """
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": query}
    ]
    if assistant:
        messages.append({"role": "assistant", "content": assistant})
    response = client.chat.completions.create(
        model="gemini-1.0-flash",
        messages=messages,
        response_format={"type": "json_object"}
    )
    return response.choices[0].message.content

🚀 Step 6: Bring It All Together

📄 Python Code (main.py):

from query_decomposer import ai
from retrieval import retrieve
from answer_ai import answer_AI

query = input("Enter your question: ")

# Step 1: Decompose the query
system_prompt = "Decompose this query into smaller questions..."
message = [{"role": "system", "content": system_prompt}, {"role": "user", "content": query}]
questions = ai(message)

# Step 2: Step-by-step Retrieval + Answering
responses = []
context = ""
for q in questions:
    chunk = retrieve(q + " " + context)
    response = answer_AI(q, chunk)
    responses.append(response)
    context += "\n" + response

# Final Answer
final = answer_AI(query, "\n".join(responses))
print("Answer:\n", final)

🧠 CoT + RAG = Superpowers

In a RAG (Retrieval-Augmented Generation) setup:

The system retrieves relevant knowledge,
Then applies Chain of Thought to reason over it,
Resulting in answers that are not only relevant, but explainable.

Conclusion

Chain of Thought (CoT) reasoning transforms how large language models think not just what they answer, but how they arrive at that answer. By encouraging step by step reasoning, CoT makes models more transparent, accurate, and reliable, especially when tackling complex or multi-layered queries.

When combined with Retrieval-Augmented Generation (RAG), CoT unlocks the full potential of LLMs: they can retrieve external knowledge and logically reason over it just like an expert human would.

In short, CoT turns LLMs from fast responders into thoughtful problem-solvers.

Structured Thinking in RAG: Unlocking Chain of Thought for Smarter AI