Reciprocal Rank Fusion (RRF)- The Rank Symphony

In this article, we’ll dive into another clever approach to enhance search result quality when dealing with large text datasets, leveraging a technique known as Reciprocal Rank Fusion (RRF). Similar to Fan-Out Retrieval from our earlier blog(https://parallel-query-in-rag.hashnode.dev/parallel-query-magic-boosting-rag-quality-with-gemini-and-qdrant) , we’ll utilize LangChain, Google’s Gemini model, and Qdrant to create an improved retrieval pipeline but this time, we’ll emphasize combining the results more intelligently.
Definition of RRF:-
Reciprocal Rank Fusion (RRF) is a technique used to merge multiple result sets, each based on distinct relevance metrics, into one cohesive result set. This method does not require parameter tuning, and the relevance metrics involved can be completely unrelated, yet it still delivers excellent outcomes.
How RRF Works in RAG:-
We’ll first break down the image and explain how the process works step by step.
Reciprocal Rank Fusion
Formula:
For a document the RRF score is calculated as:
Parameters:
where:
k is a constant that helps to balance between high and low ranking.
r(d) is the rank/position of the document.
Step-by-Step Breakdown of the RRF-Enhanced Retrieval Workflow
User Input
The process begins with a user asking a question in natural language. This input is typically unstructured and may vary in phrasing depending on the user’s intent.
Example: A user might ask, "What are the health benefits of green tea?"
Query Expansion
A Large Language Model (LLM), such as Google's Gemini, rewrites the original query into multiple variations.
These variations capture different ways of expressing the same intent, enhancing the chance of retrieving relevant results that might otherwise be missed.
Example Variations:
"Benefits of drinking green tea for health"
"Why is green tea good for you?"
"Health advantages of consuming green tea"
Parallel Retrieval
Each query variation is sent simultaneously to a vector database, such as Qdrant.
This parallel processing allows the system to fetch results efficiently, ensuring diverse perspectives are retrieved for each query.
Benefit: Faster retrieval and a broader range of potential answers.
Document Retrieval
For each query, the vector database independently identifies and retrieves the most relevant documents based on similarity scoring or other ranking metrics.
Example: A query like "Benefits of green tea" might retrieve scientific studies, health blogs, and dietary guides.
RRF Ranking
All the retrieved documents across the multiple queries are combined.
Reciprocal Rank Fusion (RRF) is applied to assign scores to each document based on its position in the individual ranked lists.
Documents ranked highly in multiple lists are prioritized in the final ranking, ensuring relevance and diversity in the results.
Deduplication
The combined results are cleaned to eliminate duplicate entries.
Only the top unique documents are retained, providing a concise and comprehensive set of relevant results.
Example: If two queries return the same blog post, it will appear only once in the final results.
Answer Generation
The cleaned set of documents, along with the user’s original query, is sent back to the LLM.
The LLM synthesizes the information from the documents and generates a coherent and contextually relevant answer for the user.
Example Answer: "Green tea is beneficial due to its high antioxidant content, which can reduce inflammation, support brain health, and improve heart health."
Why Use RRF in RAG Systems?
Maximizes Recall
RRF aggregates results from multiple query variations, ensuring no relevant document is overlooked.
Improves Precision
Prioritizes documents that consistently rank high across multiple result sets.
Supports Heterogeneous Ranking Systems
Works seamlessly with lists derived from different relevance metrics (e.g., vector similarity, BM25, or embeddings).
No Parameter Tuning Needed
Simplicity is a key advantage: RRF doesn’t require complex hyperparameter tuning.
Enhances Answer Generation
Supplies the LLM with highly relevant and diverse documents, improving the quality of generated answers.
Code Walkthrough:
Before installing any packages, create virtual environment
Copy
Copy
# 1. Create a virtual environment named .venv
python -m venv .venv
# 2. Activate it
# On macOS / Linux:
source .venv/bin/activate
# On Windows (PowerShell):
.venv\Scripts\Activate.ps1
# On Windows (Command Prompt):
.venv\Scripts\activate.bat
📥 Ingest Data and ✂️ Chunk Text
Start by bringing in all the source material you want your system to “know.”
Examples: PDFs of manuals, GitHub READMEs, web‑scraped articles, CSV exports.
Goal: Make sure you extract clean text (strip out headers/footers, fix encoding issues) and record metadata (source filename, page number, date) so you can always trace back where an answer came from.
To do this, we need to install the packages langchain_community
and pypdf
.
Run the following command in the terminal:
pip install langchain_community pypdf
#loader.py
from langchain_community.document_loaders import PyPDFLoader
from pathlib import Path
pdf_path = Path(__file__).parent / "file_name.extension_type"
loader = PyPDFLoader(file_path=pdf_path)
doc = loader.load()
LLMs have finite context windows—if you handed a 500‑page PDF to llm, it wouldn’t fit.
Split into ~500–1,000 token chunks, often with a 10–20% overlap so that you don’t lose sentence continuity at chunk boundaries.
Why: Smaller chunks both fit in the model’s context and allow more precise matching when you retrieve later.
chunk_size = 1000
– each slice of text will be at most 1,000 characters (or tokens) long.
chunk_overlap = 200
– each new slice repeats the last 200 characters of the previous slice so context flows smoothly across chunks.
#loader.py
from langchain_text_splitters import RecursiveCharacterTextSplitter
text_spliter = RecursiveCharacterTextSplitter(
chunk_size = 1000,
chunk_overlap = 200
)
split_doc = text_spliter.split_documents(documents=doc)
🔢 Generate Embeddings and 💾Store in Vector DB
Each chunk is passed through an embedding model (e.g. text‑embedding‑ada-002) that turns it into a fixed‑length vector in semantic space.
Similar meaning → nearby points in vector space. “How do I reset my password?” and “password reset steps” end up close together.
I’m using Google AI embeddings for this example, but you can use OpenAI embeddings instead. You can see all the embeddings through the link. LangChain Embeddings
To use GoogleGenerativeAIEmbeddings
and load_dotenv
, you first need to install the integration package langchain‑google‑genai
and dotenv
.
pip install langchain-google-genai
pip install python-dotenv
#loader.py
from langchain_google_genai import GoogleGenerativeAIEmbeddings
import os
from dotenv import load_dotenv
load_dotenv()
if "GOOGLE_API_KEY" not in os.environ:
os.environ["GOOGLE_API_KEY"] = os.getenv("GOOGLE_API_KEY")
embeddings = GoogleGenerativeAIEmbeddings(
model="models/text-embedding-004",
)
Those vectors, plus your chunk text and metadata, go into a specialized index (Pinecone, Qdrant, FAISS, etc.).
Why use a vector DB? It lets you do ultra‑fast approximate nearest‑neighbor searches over millions of vectors, usually in milliseconds.
Here we’re using the Qdrant vector‑database.
You can either install it directly on your system or run it in Docker; I’m using Docker in this example
services:
qdrant:
image: qdrant/qdrant
ports:
- "6333:6333"
To run this docker compose file in terminal:
docker compose -f docker-compose.yml up
Once the container is running, you can connect to Qdrant at http://localhost:6333.
To use QdrantVectorStore
and QdrantClient
, you first need to install the integration package langchain-qdrant
pip install langchain-qdrant
#loader.py
from langchain_qdrant import QdrantVectorStore
from qdrant_client import QdrantClient
vector_store = QdrantVectorStore.from_documents(
documents=[],
url="http://localhost:6333",
embedding=embeddings,
collection_name="learning_langchain"
)
vector_store.add_documents(documents=split_doc)
🔄 Decompose Query
Use the LLM to split the user’s original question into several targeted, semantically distinct sub‑queries. For instance, from:
“What is fs module?”
you could derive:
What is a “module” in Node.js?
What does “fs” abbreviate?
What capabilities does Node.js’s fs module offer?
Why this matters
Broader coverage: Retrieves documents matching different phrasing.
Reduced ambiguity: Each sub‑query zeroes in on a specific facet.
Sharper embeddings: More focused queries produce embedding vectors that better align with the most relevant text.
To use OpenAI
, you first need to install the integration package openai
pip install openai
from openai import OpenAI
from dotenv import load_dotenv
import os
import json
load_dotenv()
def ai(message):
response = client.chat.completions.create(
model="gemini-2.0-flash",
messages=message,
response_format={"type":"json_object"}
)
return json.loads(response.choices[0].message.content)
client = OpenAI(
api_key=os.getenv("GOOGLE_API_KEY"),
base_url="https://generativelanguage.googleapis.com/v1beta/openai/"
)
system_prompt = f"""
You are an helpful AI Assistant that generates multiple alternates
search query out of user's input query. These alternate queris
wil be used to make semantic search within a vector databse
using similarity metrices. Generate 5 alternate
queries that can be formed to better understand user's
input query given below.
context:
{user_query}
Strictly return only the alternate queries separated by new line
Example: "What is Operating System?"
you break this question in different questions
- What is a operating system?
- Why use operating system?
- What is the benefits of operating system?
- How operating system work?
- Benefits of operating system
You give response in array formate like this
Output: {{
"What is a operating system?",
"Why use operating system? ",
"What is the benefits of operating system?",
"How operating system work?",
"Benefits of operating system"
}}
"""
query = input("> ")
message=[{"role":"system","content":system_prompt},{"role":"user","content":query}]
question = ai(message)
print("\nQuestions: ")
print(question)
🔍 Retrieve Top‑K and ➗ Fuse Rankings with Reciprocal Rank Fusion
For each decomposed sub‑query, you hit your vector database (e.g. Qdrant, FAISS, Pinecone) with a semantic‐similarity search. The goal is to pull back the K most relevant chunks—typically 10–20 passages—that best match your query embedding.
Why Top‑K? Grabbing only the highest‑scoring chunks keeps your context tight and your LLM prompt focused on the most pertinent information.
Once you have multiple ranked lists—one per sub‑query—RRF merges them into a single consensus list by:
Scoring each document by summing
1 / (k + rank + 1)
across all ranking lists.Sorting documents by their total score in descending order.
from retrieval import retrieve
relevent_chunk = retrieve(question)
#retrieval.py
from langchain_qdrant import QdrantVectorStore
from langchain_google_genai import GoogleGenerativeAIEmbeddings
import os
def reciprocal_rank_fusion(rankings, k = 15):
scores = {}
for ranking in rankings:
for rank, doc_id in enumerate(ranking):
scores[doc_id] = scores.get(doc_id, 0) + 1.0 / (k + rank + 1)
return sorted(scores.items(), key=lambda x: x[1], reverse=True)
def retrieve(queries,k=15):
if "GOOGLE_API_KEY" not in os.environ:
os.environ["GOOGLE_API_KEY"] = os.getenv("GOOGLE_API_KEY", "")
embedding = GoogleGenerativeAIEmbeddings(model="models/text-embedding-004")
relevent_chunk = QdrantVectorStore.from_existing_collection(
collection_name="parallel_query",
embedding=embedding,
url="http://localhost:6333",
)
rankings = []
lookup= {}
for q in queries:
docs = relevent_chunk.similarity_search(query=q, k=k)
ids = []
for d in docs:
doc_id = d.metadata.get("id") or f"{d.metadata.get('page')}#{hash(d.page_content)}"
ids.append(doc_id)
lookup[doc_id] = d
rankings.append(ids)
fused = reciprocal_rank_fusion(rankings)
fused_docs = []
for doc_id, score in fused:
if doc_id in lookup:
fused_docs.append(lookup[doc_id])
formatted = []
for doc in fused_docs:
page = doc.metadata.get("page", "?")
text = doc.page_content.strip()
formatted.append(f"[Page {page}]\n{text}")
return "\n\n".join(formatted)
✍️ Generate Answer
We feed the assembled prompt, which combines the retrieved, labeled chunks and the user’s original question into your chosen language model. The LLM then uses both its internal knowledge and the provided context to generate a coherent, fact‑grounded response.
from openai import OpenAI
import os
from dotenv import load_dotenv
load_dotenv()
client = OpenAI(
api_key=os.getenv("GOOGLE_API_KEY"),
base_url="https://generativelanguage.googleapis.com/v1beta/openai/"
)
def answer_AI(query, assistant):
system_prompt = f"""
You are an helpfull AI Assistant who is specialized in resolving user query.
Note:
Answer should be in detail
You recive a question and you give answer based on the assistant content and
also Mention the page number from where did you pick all the information and
If you add something from you then tell where did you added something
"""
message =[
{"role":"system","content":system_prompt},
{"role":"user","content":query},
{"role":"assistant","content":assistant}]
response=client.chat.completions.create(
model="gemini-2.0-flash",
messages=message,
response_format={"type":"json_object"}
)
return response.choices[0].message.content
Passing all the chunks into the answer main.py
#main.py
from answer_ai import answer_AI
output = answer_AI(query, relevent_chunk)
print("\n------------------")
print("Answer: ")
print(output)
Full source code available visit this: https://github.com/YogyashriPatil/reciprocal-rank-fusion.git
Conclusion
Reciprocal Rank Fusion (RRF) is a highly effective and versatile method for combining ranked result sets in information retrieval systems. Its ability to merge results from diverse query formulations without requiring parameter tuning makes it particularly valuable in complex workflows like Retrieval-Augmented Generation (RAG). By balancing recall and precision, RRF ensures the retrieval of comprehensive yet relevant documents, thereby providing high-quality inputs for downstream processes such as large language model-driven answer generation. Its simplicity, scalability, and compatibility with heterogeneous retrieval metrics position RRF as a robust solution for modern information retrieval challenges, enabling smarter and more accurate decision-making in data-intensive applications.
Subscribe to my newsletter
Read articles from Yogyashri Patil directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
