Introduction

Reciprocal Rank Fusion (RRF) is an information‐retrieval method that merges several ranked result lists into one enhanced ordering. Each document receives a score inversely proportional to its rank in each list, and these reciprocal scores are summed to produce the final ranking. By emphasizing items that appear near the top across multiple lists, RRF effectively surfaces the most relevant results.

Pipeline Overview

Before You Begin
📥 Ingest Data and ✂️ Chunk Text
🔢 Generate Embeddings and 💾Store in Vector DB
🔄 Decompose Query
🔍 Retrieve Top‑K and ➗ Fuse Rankings with Reciprocal Rank Fusion
✍️ Generate Answer

Before You Begin

Before installing any packages, create virtual environment

# 1. Create a virtual environment named .venv
python -m venv .venv

# 2. Activate it
# On macOS / Linux:
source .venv/bin/activate
# On Windows (PowerShell):
.venv\Scripts\Activate.ps1
# On Windows (Command Prompt):
.venv\Scripts\activate.bat

📥 Ingest Data and ✂️ Chunk Text

Start by bringing in all the source material you want your system to “know.”

Examples: PDFs of manuals, GitHub READMEs, web‑scraped articles, CSV exports.
Goal: Make sure you extract clean text (strip out headers/footers, fix encoding issues) and record metadata (source filename, page number, date) so you can always trace back where an answer came from.

To do this, we need to install the packages langchain_community and pypdf.
Run the following command in the terminal:

pip install langchain_community pypdf

#loader.py
from langchain_community.document_loaders import PyPDFLoader
from pathlib import Path
pdf_path = Path(__file__).parent / "file_name.extension_type"

loader = PyPDFLoader(file_path=pdf_path)
doc = loader.load()

LLMs have finite context windows—if you handed a 500‑page PDF to llm, it wouldn’t fit.

Split into ~500–1,000 token chunks, often with a 10–20% overlap so that you don’t lose sentence continuity at chunk boundaries.
Why: Smaller chunks both fit in the model’s context and allow more precise matching when you retrieve later.

chunk_size = 1000 – each slice of text will be at most 1,000 characters (or tokens) long.

chunk_overlap = 200 – each new slice repeats the last 200 characters of the previous slice so context flows smoothly across chunks.

#loader.py
from langchain_text_splitters import RecursiveCharacterTextSplitter
text_spliter = RecursiveCharacterTextSplitter(
    chunk_size = 1000,
    chunk_overlap = 200
)

split_doc = text_spliter.split_documents(documents=doc)

🔢 Generate Embeddings and 💾Store in Vector DB

Each chunk is passed through an embedding model (e.g. text‑embedding‑ada-002) that turns it into a fixed‑length vector in semantic space.

Similar meaning → nearby points in vector space. “How do I reset my password?” and “password reset steps” end up close together.

I’m using Google AI embeddings for this example, but you can use OpenAI embeddings instead. You can see all the embeddings through the link. LangChain Embeddings

Note: Create a .env file to store your Google API key, and use python-dotenv to load it into your Python script.

To use GoogleGenerativeAIEmbeddings and load_dotenv, you first need to install the integration package langchain‑google‑genai and dotenv.

pip install langchain-google-genai
pip install python-dotenv

#loader.py
from langchain_google_genai import GoogleGenerativeAIEmbeddings
import os
from dotenv import load_dotenv

load_dotenv()

if "GOOGLE_API_KEY" not in os.environ:
    os.environ["GOOGLE_API_KEY"] = os.getenv("GOOGLE_API_KEY") 

embeddings = GoogleGenerativeAIEmbeddings(
    model="models/text-embedding-004",
)

Those vectors, plus your chunk text and metadata, go into a specialized index (Pinecone, Qdrant, FAISS, etc.).

Why use a vector DB? It lets you do ultra‑fast approximate nearest‑neighbor searches over millions of vectors, usually in milliseconds.

Here we’re using the Qdrant vector‑database.
You can either install it directly on your system or run it in Docker; I’m using Docker in this example

services:
  qdrant:
    image: qdrant/qdrant
    ports:
      - "6333:6333"

To run this docker compose file in terminal:

docker compose -f docker-compose.yml up

Once the container is running, you can connect to Qdrant at http://localhost:6333.

To use QdrantVectorStore and QdrantClient, you first need to install the integration package langchain-qdrant

pip install langchain-qdrant

#loader.py
from langchain_qdrant import QdrantVectorStore
from qdrant_client import QdrantClient

vector_store = QdrantVectorStore.from_documents(
    documents=[],
    url="http://localhost:6333",
    embedding=embeddings,
    collection_name="learning_langchain"
)
vector_store.add_documents(documents=split_doc)

🔄 Decompose Query

Use the LLM to split the user’s original question into several targeted, semantically distinct sub‑queries. For instance, from:

“What is fs module?”

you could derive:

What is a “module” in Node.js?

What does “fs” abbreviate?

What capabilities does Node.js’s fs module offer?

Why this matters

Broader coverage: Retrieves documents matching different phrasing.
Reduced ambiguity: Each sub‑query zeroes in on a specific facet.
Sharper embeddings: More focused queries produce embedding vectors that better align with the most relevant text.

To use OpenAI , you first need to install the integration package openai

pip install openai

#main.py
from openai import OpenAI
from dotenv import load_dotenv
import os
import json

load_dotenv()

def ai(message):
    response = client.chat.completions.create(
    model="gemini-2.0-flash",
    messages=message,
    response_format={"type":"json_object"}
    )
    return json.loads(response.choices[0].message.content)


client = OpenAI(
    api_key=os.getenv("GOOGLE_API_KEY"),
    base_url="https://generativelanguage.googleapis.com/v1beta/openai/"

)
system_prompt = f"""
You are an helpfull AI Assistant 
who is specialized in resolving user query.
You break the user query into three or five different query.

Example: "What is FS module?"
you break this question in different questions
-What is a module in Node.js?
-What does "fs" stand for? 
-What functionalities does the fs module provide in Node.js?

You give response in array formate like this

Output: {{
"What is a module in Node.js?",
"What does "fs" stand for?",
"What functionalities does the fs module provide in Node.js?"
}}
"""
query = input("> ")
message=[{"role":"system","content":system_prompt},{"role":"user","content":query}]
question = ai(message)

print("\nQuestions: ")
print(question)

🔍 Retrieve Top‑K and ➗ Fuse Rankings with Reciprocal Rank Fusion

For each decomposed sub‑query, you hit your vector database (e.g. Qdrant, FAISS, Pinecone) with a semantic‐similarity search. The goal is to pull back the K most relevant chunks—typically 10–20 passages—that best match your query embedding.

Why Top‑K? Grabbing only the highest‑scoring chunks keeps your context tight and your LLM prompt focused on the most pertinent information.

Once you have multiple ranked lists—one per sub‑query—RRF merges them into a single consensus list by:

Scoring each document by summing 1 / (k + rank + 1) across all ranking lists.
Sorting documents by their total score in descending order.

#main.py
from retrieval import retrieve
relevent_chunk = retrieve(question)

#retrieval.py
from langchain_qdrant import QdrantVectorStore
from langchain_google_genai import GoogleGenerativeAIEmbeddings
import os

def reciprocal_rank_fusion(rankings, k = 15):
    scores = {}
    for ranking in rankings:
        for rank, doc_id in enumerate(ranking):
            scores[doc_id] = scores.get(doc_id, 0) + 1.0 / (k + rank + 1)
    return sorted(scores.items(), key=lambda x: x[1], reverse=True)

def retrieve(queries,k=15):
    if "GOOGLE_API_KEY" not in os.environ:
        os.environ["GOOGLE_API_KEY"] = os.getenv("GOOGLE_API_KEY", "")

    # embedding
    embedding = GoogleGenerativeAIEmbeddings(model="models/text-embedding-004")
    relevent_chunk = QdrantVectorStore.from_existing_collection(
        collection_name="parallel_query",
        embedding=embedding,
        url="http://localhost:6333",
    )

    # run each sub‑query, collect rankings of IDs + keep lookup
    rankings = []
    lookup= {}

    for q in queries:
        docs = relevent_chunk.similarity_search(query=q, k=k)
        ids = []
        for d in docs:
            # assume each Doc has a unique metadata["id"]
            doc_id = d.metadata.get("id") or f"{d.metadata.get('page')}#{hash(d.page_content)}"
            ids.append(doc_id)
            lookup[doc_id] = d
        rankings.append(ids)

    # fuse the ranked ID lists
    fused = reciprocal_rank_fusion(rankings)

    # map fused IDs back to Doc objects, preserving order
    fused_docs = []
    for doc_id, score in fused:
        if doc_id in lookup:
            fused_docs.append(lookup[doc_id])


    # Formating
    formatted = []
    for doc in fused_docs:
        page = doc.metadata.get("page", "?")
        text = doc.page_content.strip()
        formatted.append(f"[Page {page}]\n{text}")

    return "\n\n".join(formatted)

✍️ Generate Answer

We feed the assembled prompt, which combines the retrieved, labeled chunks and the user’s original question—into your chosen language model. The LLM then uses both its internal knowledge and the provided context to generate a coherent, fact‑grounded response.

#answer_ai.py
from openai import OpenAI
import os
from dotenv import load_dotenv

load_dotenv()


client = OpenAI(
    api_key=os.getenv("GOOGLE_API_KEY"),
    base_url="https://generativelanguage.googleapis.com/v1beta/openai/"

)

def answer_AI(query, assistant):
    system_prompt = f"""
    You are an helpfull AI Assistant who is specialized in resolving user query.

    Note:
    Answer should be in detail
    You recive a question and you give answer based on the assistant content and 
    also Mention the page number from where did you pick all the information and
    If you add something from you then tell where did you added something
    """
    message =[
    {"role":"system","content":system_prompt},
    {"role":"user","content":query},
    {"role":"assistant","content":assistant}]
    response=client.chat.completions.create(
        model="gemini-2.0-flash",
        messages=message,
        response_format={"type":"json_object"}

    )

    return response.choices[0].message.content

Passing all the chunks into the answer_ai.py

#main.py
from answer_ai import answer_AI
output = answer_AI(query, relevent_chunk)

print("\n------------------")
print("Answer: ")
print(output)

Executing the Code

Executing the main.py file

Full Source Code

Grab everything—loader, retrieval, fusion, and answer‑generation—in one repo:

https://github.com/SurajPatel04/genAI/tree/main/cohort/day5class/reciprocol_rank_fusion

From Queries to Quality: Implementing Reciprocal Rank Fusion

Table of contents

Introduction

Pipeline Overview

Before You Begin

📥 Ingest Data and ✂️ Chunk Text

🔢 Generate Embeddings and 💾Store in Vector DB

🔄 Decompose Query

🔍 Retrieve Top‑K and ➗ Fuse Rankings with Reciprocal Rank Fusion

✍️ Generate Answer

Executing the Code

Full Source Code

Subscribe to my newsletter

Suraj Patel

Suraj Patel