In this article, we’ll dive into another clever approach to enhance search result quality when dealing with large text datasets, leveraging a technique known as Reciprocal Rank Fusion (RRF). Similar to Fan-Out Retrieval from our earlier blog(https://parallel-query-in-rag.hashnode.dev/parallel-query-magic-boosting-rag-quality-with-gemini-and-qdrant) , we’ll utilize LangChain, Google’s Gemini model, and Qdrant to create an improved retrieval pipeline but this time, we’ll emphasize combining the results more intelligently.

Definition of RRF:-

Reciprocal Rank Fusion (RRF) is a technique used to merge multiple result sets, each based on distinct relevance metrics, into one cohesive result set. This method does not require parameter tuning, and the relevance metrics involved can be completely unrelated, yet it still delivers excellent outcomes.

How RRF Works in RAG:-

We’ll first break down the image and explain how the process works step by step.

Reciprocal Rank Fusion

Formula:

For a document the RRF score is calculated as:

Parameters:

where:

k is a constant that helps to balance between high and low ranking.
r(d) is the rank/position of the document.

Step-by-Step Breakdown of the RRF-Enhanced Retrieval Workflow

User Input
- The process begins with a user asking a question in natural language. This input is typically unstructured and may vary in phrasing depending on the user’s intent.
- Example: A user might ask, "What are the health benefits of green tea?"
Query Expansion
- A Large Language Model (LLM), such as Google's Gemini, rewrites the original query into multiple variations.
- These variations capture different ways of expressing the same intent, enhancing the chance of retrieving relevant results that might otherwise be missed.
- Example Variations:
  - "Benefits of drinking green tea for health"
  - "Why is green tea good for you?"
  - "Health advantages of consuming green tea"
Parallel Retrieval
- Each query variation is sent simultaneously to a vector database, such as Qdrant.
- This parallel processing allows the system to fetch results efficiently, ensuring diverse perspectives are retrieved for each query.
- Benefit: Faster retrieval and a broader range of potential answers.
Document Retrieval
- For each query, the vector database independently identifies and retrieves the most relevant documents based on similarity scoring or other ranking metrics.
- Example: A query like "Benefits of green tea" might retrieve scientific studies, health blogs, and dietary guides.
RRF Ranking
- All the retrieved documents across the multiple queries are combined.
- Reciprocal Rank Fusion (RRF) is applied to assign scores to each document based on its position in the individual ranked lists.
- Documents ranked highly in multiple lists are prioritized in the final ranking, ensuring relevance and diversity in the results.
Deduplication
- The combined results are cleaned to eliminate duplicate entries.
- Only the top unique documents are retained, providing a concise and comprehensive set of relevant results.
- Example: If two queries return the same blog post, it will appear only once in the final results.
Answer Generation
- The cleaned set of documents, along with the user’s original query, is sent back to the LLM.
- The LLM synthesizes the information from the documents and generates a coherent and contextually relevant answer for the user.
- Example Answer: "Green tea is beneficial due to its high antioxidant content, which can reduce inflammation, support brain health, and improve heart health."

Why Use RRF in RAG Systems?

Maximizes Recall

RRF aggregates results from multiple query variations, ensuring no relevant document is overlooked.
Improves Precision

Prioritizes documents that consistently rank high across multiple result sets.
Supports Heterogeneous Ranking Systems

Works seamlessly with lists derived from different relevance metrics (e.g., vector similarity, BM25, or embeddings).
No Parameter Tuning Needed

Simplicity is a key advantage: RRF doesn’t require complex hyperparameter tuning.
Enhances Answer Generation

Supplies the LLM with highly relevant and diverse documents, improving the quality of generated answers.

Code Walkthrough:

Before installing any packages, create virtual environment

Copy

# 1. Create a virtual environment named .venv
python -m venv .venv

# 2. Activate it
# On macOS / Linux:
source .venv/bin/activate
# On Windows (PowerShell):
.venv\Scripts\Activate.ps1
# On Windows (Command Prompt):
.venv\Scripts\activate.bat

📥 Ingest Data and ✂️ Chunk Text

Start by bringing in all the source material you want your system to “know.”

Examples: PDFs of manuals, GitHub READMEs, web‑scraped articles, CSV exports.
Goal: Make sure you extract clean text (strip out headers/footers, fix encoding issues) and record metadata (source filename, page number, date) so you can always trace back where an answer came from.

To do this, we need to install the packages langchain_community and pypdf.
Run the following command in the terminal:

pip install langchain_community pypdf

#loader.py
from langchain_community.document_loaders import PyPDFLoader
from pathlib import Path
pdf_path = Path(__file__).parent / "file_name.extension_type"

loader = PyPDFLoader(file_path=pdf_path)
doc = loader.load()

LLMs have finite context windows—if you handed a 500‑page PDF to llm, it wouldn’t fit.

Split into ~500–1,000 token chunks, often with a 10–20% overlap so that you don’t lose sentence continuity at chunk boundaries.
Why: Smaller chunks both fit in the model’s context and allow more precise matching when you retrieve later.

chunk_size = 1000 – each slice of text will be at most 1,000 characters (or tokens) long.

chunk_overlap = 200 – each new slice repeats the last 200 characters of the previous slice so context flows smoothly across chunks.

#loader.py
from langchain_text_splitters import RecursiveCharacterTextSplitter
text_spliter = RecursiveCharacterTextSplitter(
    chunk_size = 1000,
    chunk_overlap = 200
)

split_doc = text_spliter.split_documents(documents=doc)

🔢 Generate Embeddings and 💾Store in Vector DB

Each chunk is passed through an embedding model (e.g. text‑embedding‑ada-002) that turns it into a fixed‑length vector in semantic space.

Similar meaning → nearby points in vector space. “How do I reset my password?” and “password reset steps” end up close together.

I’m using Google AI embeddings for this example, but you can use OpenAI embeddings instead. You can see all the embeddings through the link. LangChain Embeddings

To use GoogleGenerativeAIEmbeddings and load_dotenv, you first need to install the integration package langchain‑google‑genai and dotenv.

pip install langchain-google-genai
pip install python-dotenv

#loader.py
from langchain_google_genai import GoogleGenerativeAIEmbeddings
import os
from dotenv import load_dotenv

load_dotenv()

if "GOOGLE_API_KEY" not in os.environ:
    os.environ["GOOGLE_API_KEY"] = os.getenv("GOOGLE_API_KEY") 

embeddings = GoogleGenerativeAIEmbeddings(
    model="models/text-embedding-004",
)

Those vectors, plus your chunk text and metadata, go into a specialized index (Pinecone, Qdrant, FAISS, etc.).

Why use a vector DB? It lets you do ultra‑fast approximate nearest‑neighbor searches over millions of vectors, usually in milliseconds.

Here we’re using the Qdrant vector‑database.
You can either install it directly on your system or run it in Docker; I’m using Docker in this example

services:
  qdrant:
    image: qdrant/qdrant
    ports:
      - "6333:6333"

To run this docker compose file in terminal:

docker compose -f docker-compose.yml up

Once the container is running, you can connect to Qdrant at http://localhost:6333.

To use QdrantVectorStore and QdrantClient, you first need to install the integration package langchain-qdrant

pip install langchain-qdrant

#loader.py
from langchain_qdrant import QdrantVectorStore
from qdrant_client import QdrantClient

vector_store = QdrantVectorStore.from_documents(
    documents=[],
    url="http://localhost:6333",
    embedding=embeddings,
    collection_name="learning_langchain"
)
vector_store.add_documents(documents=split_doc)

🔄 Decompose Query

Use the LLM to split the user’s original question into several targeted, semantically distinct sub‑queries. For instance, from:

“What is fs module?”

you could derive:

What is a “module” in Node.js?

What does “fs” abbreviate?

What capabilities does Node.js’s fs module offer?

Why this matters

Broader coverage: Retrieves documents matching different phrasing.
Reduced ambiguity: Each sub‑query zeroes in on a specific facet.
Sharper embeddings: More focused queries produce embedding vectors that better align with the most relevant text.

To use OpenAI , you first need to install the integration package openai

pip install openai


from openai import OpenAI
from dotenv import load_dotenv
import os
import json

load_dotenv()

def ai(message):
    response = client.chat.completions.create(
    model="gemini-2.0-flash",
    messages=message,
    response_format={"type":"json_object"}
    )
    return json.loads(response.choices[0].message.content)


client = OpenAI(
    api_key=os.getenv("GOOGLE_API_KEY"),
    base_url="https://generativelanguage.googleapis.com/v1beta/openai/"

)
system_prompt = f"""
You are an helpful AI Assistant that generates multiple alternates 
search query out of user's input query. These alternate queris
wil be used to make semantic search within a vector databse
using similarity metrices. Generate 5 alternate 
queries that can be formed to better understand user's 
input query given below.

context:
{user_query}
Strictly return only the alternate queries separated by new line

Example: "What is Operating System?"
you break this question in different questions
- What is a operating system?
- Why use operating system? 
- What is the benefits of operating system?
- How operating system work?
- Benefits of operating system

You give response in array formate like this

Output: {{
"What is a operating system?",
"Why use operating system? ",
"What is the benefits of operating system?",
"How operating system work?",
"Benefits of operating system"
}}
"""
query = input("> ")
message=[{"role":"system","content":system_prompt},{"role":"user","content":query}]
question = ai(message)

print("\nQuestions: ")
print(question)

🔍 Retrieve Top‑K and ➗ Fuse Rankings with Reciprocal Rank Fusion

For each decomposed sub‑query, you hit your vector database (e.g. Qdrant, FAISS, Pinecone) with a semantic‐similarity search. The goal is to pull back the K most relevant chunks—typically 10–20 passages—that best match your query embedding.

Why Top‑K? Grabbing only the highest‑scoring chunks keeps your context tight and your LLM prompt focused on the most pertinent information.

Once you have multiple ranked lists—one per sub‑query—RRF merges them into a single consensus list by:

Scoring each document by summing 1 / (k + rank + 1) across all ranking lists.
Sorting documents by their total score in descending order.


from retrieval import retrieve
relevent_chunk = retrieve(question)

#retrieval.py
from langchain_qdrant import QdrantVectorStore
from langchain_google_genai import GoogleGenerativeAIEmbeddings
import os

def reciprocal_rank_fusion(rankings, k = 15):
    scores = {}
    for ranking in rankings:
        for rank, doc_id in enumerate(ranking):
            scores[doc_id] = scores.get(doc_id, 0) + 1.0 / (k + rank + 1)
    return sorted(scores.items(), key=lambda x: x[1], reverse=True)

def retrieve(queries,k=15):
    if "GOOGLE_API_KEY" not in os.environ:
        os.environ["GOOGLE_API_KEY"] = os.getenv("GOOGLE_API_KEY", "")
    embedding = GoogleGenerativeAIEmbeddings(model="models/text-embedding-004")
    relevent_chunk = QdrantVectorStore.from_existing_collection(
        collection_name="parallel_query",
        embedding=embedding,
        url="http://localhost:6333",
    )
    rankings = []
    lookup= {}

    for q in queries:
        docs = relevent_chunk.similarity_search(query=q, k=k)
        ids = []
        for d in docs:
            doc_id = d.metadata.get("id") or f"{d.metadata.get('page')}#{hash(d.page_content)}"
            ids.append(doc_id)
            lookup[doc_id] = d
        rankings.append(ids)
    fused = reciprocal_rank_fusion(rankings)
    fused_docs = []
    for doc_id, score in fused:
        if doc_id in lookup:
            fused_docs.append(lookup[doc_id])
    formatted = []
    for doc in fused_docs:
        page = doc.metadata.get("page", "?")
        text = doc.page_content.strip()
        formatted.append(f"[Page {page}]\n{text}")

    return "\n\n".join(formatted)

✍️ Generate Answer

We feed the assembled prompt, which combines the retrieved, labeled chunks and the user’s original question into your chosen language model. The LLM then uses both its internal knowledge and the provided context to generate a coherent, fact‑grounded response.

from openai import OpenAI
import os
from dotenv import load_dotenv

load_dotenv()


client = OpenAI(
    api_key=os.getenv("GOOGLE_API_KEY"),
    base_url="https://generativelanguage.googleapis.com/v1beta/openai/"

)
def answer_AI(query, assistant):
    system_prompt = f"""
    You are an helpfull AI Assistant who is specialized in resolving user query.

    Note:
    Answer should be in detail
    You recive a question and you give answer based on the assistant content and 
    also Mention the page number from where did you pick all the information and
    If you add something from you then tell where did you added something
    """
    message =[
    {"role":"system","content":system_prompt},
    {"role":"user","content":query},
    {"role":"assistant","content":assistant}]
    response=client.chat.completions.create(
        model="gemini-2.0-flash",
        messages=message,
        response_format={"type":"json_object"}

    )
    return response.choices[0].message.content

Passing all the chunks into the answer main.py

#main.py
from answer_ai import answer_AI
output = answer_AI(query, relevent_chunk)

print("\n------------------")
print("Answer: ")
print(output)

Full source code available visit this: https://github.com/YogyashriPatil/reciprocal-rank-fusion.git

Conclusion

Reciprocal Rank Fusion (RRF) is a highly effective and versatile method for combining ranked result sets in information retrieval systems. Its ability to merge results from diverse query formulations without requiring parameter tuning makes it particularly valuable in complex workflows like Retrieval-Augmented Generation (RAG). By balancing recall and precision, RRF ensures the retrieval of comprehensive yet relevant documents, thereby providing high-quality inputs for downstream processes such as large language model-driven answer generation. Its simplicity, scalability, and compatibility with heterogeneous retrieval metrics position RRF as a robust solution for modern information retrieval challenges, enabling smarter and more accurate decision-making in data-intensive applications.

Reciprocal Rank Fusion (RRF)- The Rank Symphony

Definition of RRF:-

How RRF Works in RAG:-

We’ll first break down the image and explain how the process works step by step.

Formula:

Parameters:

Step-by-Step Breakdown of the RRF-Enhanced Retrieval Workflow

Why Use RRF in RAG Systems?

Code Walkthrough:

📥 Ingest Data and ✂️ Chunk Text

🔢 Generate Embeddings and 💾Store in Vector DB

🔄 Decompose Query

🔍 Retrieve Top‑K and ➗ Fuse Rankings with Reciprocal Rank Fusion

✍️ Generate Answer

Full source code available visit this: https://github.com/YogyashriPatil/reciprocal-rank-fusion.git

Conclusion

Subscribe to my newsletter

Yogyashri Patil

Yogyashri Patil