Reciprocal Rank Fusion (RRF) in RAG: Complete Guide with Working Code

Table of contents

Before reading this article, I suggest you to go through the Query Translation in RAG: 5 Powerful Techniques to Improve Retrieval Accuracy article, In that article I have briefly discussed about the Reciprocal Rank Fusion (RRF) technique. While in this article we will go in depth of Reciprocal Rank Fusion and also we will write a code for it.
What is Reciprocal Rank Fusion (RRF)?
Reciprocal Rank Fusion is similar to Parallel Query Retrieval technique, Here also we generate the 4-5 queries based on user query and retrieve the documents based on them, But here instead of just filtering unique documents, we rank the documents based Ranking formula, and take the top n documents from them.
Ranking Formula:
Basically we rank the documents based on frequency and position of occurrence
How Reciprocal Rank Fusion works
Step 1: Generating multiple queries
When user asks as a query, we provide it to an small LLM ask it to generate multiple queries of it.
import os
from dotenv import load_dotenv
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import BaseOutputParser
from typing import List, Dict
load_dotenv()
class LineListOutputParser(BaseOutputParser[List[str]]):
"""Output parser for a list of lines."""
def parse(self, text: str) -> List[str]:
lines = text.strip().split("\n")
return list(filter(None, lines))
output_parser = LineListOutputParser()
llm = ChatGoogleGenerativeAI(
model="gemini-2.5-flash",
google_api_key=os.getenv("GEMINI_API_KEY"),
temperature=0,
max_tokens=None,
timeout=None,
max_retries=2,
)
QUERY_REWRITE_PROMPT = PromptTemplate(
input_variables=["question"],
template="""You are an AI language model assistant. Your task is to generate five
different versions of the given user question to retrieve relevant documents from a vector
database. By generating multiple perspectives on the user question, your goal is to help
the user overcome some of the limitations of the distance-based similarity search.
Provide these alternative questions separated by newlines.
Original question: {question}""",
)
llm_chain = QUERY_REWRITE_PROMPT | llm | output_parser
# ----- Step 1: Generate Multiple Queries -----
user_query = "What is FS Module?"
generated_queries = llm_chain.invoke(user_query)
print("--------- Generated Queries -----------")
for i, query in enumerate(generated_queries, 1):
print(f"{i}. {query}")
print("------------------------")
Step 2: Parallely fetching relevant chunks from vector store
Once we generate multiple queries, we embed those query, and retrieve the relevant chunks from the vector store by performing similarity search. This action is done for all the generated query in parallel.
import os
from langchain_google_genai import ChatGoogleGenerativeAI
from typing import List, Set
import concurrent.futures
from langchain_core.documents import Document
from utils.output_parser import output_parser
from config.vector_store import get_vector_store
from llm.prompt_templates import QUERY_REWRITE_PROMPT
from utils.rank_docs import rank_documents
# ----- setup --------
qdrant = get_vector_store()
llm = ChatGoogleGenerativeAI(
model="gemini-2.5-flash",
google_api_key=os.getenv("GEMINI_API_KEY"),
temperature=0,
max_tokens=None,
timeout=None,
max_retries=2,
)
llm_chain = QUERY_REWRITE_PROMPT | llm | output_parser
def reciprocal_rank_fusion(
user_query: str,
llm_chain=llm_chain,
retriever=qdrant.as_retriever()
) -> List[Document]:
"""
Generate multiple queries using llm_chain, fetch documents using retriever,
and return deduplicated list of documents.
Args:
user_query (str): The input user query.
llm_chain: A Runnable (PromptTemplate | LLM | OutputParser) for generating queries.
retriever: A LangChain retriever (e.g., qdrant.as_retriever()).
Returns:
List[Document]: Deduplicated retrieved documents across all queries.
"""
# Step 1: Generate multiple queries
generated_queries = llm_chain.invoke(user_query)
# Step 2: Fetch documents in parallel
def fetch_docs(query: str):
return retriever.invoke(query)
all_docs: List[Document] = []
with concurrent.futures.ThreadPoolExecutor() as executor:
results = executor.map(fetch_docs, generated_queries)
for docs in results:
all_docs.extend(docs)
# Step 3: Rank docs
sorted_docs = rank_documents(all_docs)
print("--------- sorted docs ------")
print(f"\nTotal sorted documents: {len(sorted_docs)}")
print("===============================")
for i, doc in enumerate(sorted_docs, 1):
print(f"Doc {i} ID: {doc}")
print("===============================")
return sorted_docs
Step 3: Ranking the documents
From all the fetched documents, we rank the documents and tak the only top n documents to answer the user query.
To rank the documents we consider their occurrence and position for each query.
suppose we have three queries and those 3 queries fetched documents as per below:
While ranking we rank them as per below
doc_id:3 is occuring 3 time so it should have 1st rank.
but doc_id:1 and doc_id:2 both are occuring twice, so while ranking them we consider their position of occurrence.
In Query 1, doc_id:1 is at 1st position and in Query 3, doc_id:1 is at 2nd position while, doc_id:2 is at 3rd position in Query 1 and at Query 2 it is at 1st position. So the doc_id:1 has more precedence than doc_id:2.
doc_id:4 has only one occurrence so it goes at last in ranking.
finally ranking becomes like below
from typing import List, Dict
from collections import defaultdict
from langchain_core.documents import Document
def reciprocal_rank_fusion(
docs: List[Document],
k: int = 60
) -> List[Document]:
"""
Apply Reciprocal Rank Fusion (RRF) to combine multiple ranked lists.
Args:
docs List[Document]: list of documents.
k (int): RRF constant. Defaults to 60.
Returns:
List[Document]: Final ranked list of documents with highest combined RRF scores.
"""
doc_scores = defaultdict(float)
doc_store = {}
for rank, doc in enumerate(docs):
doc_id = doc.metadata.get("id") or hash(doc.page_content.strip())
score = 1 / (k + rank)
doc_scores[doc_id] += score
# Store one version of the doc
if doc_id not in doc_store:
doc_store[doc_id] = doc
# Sort by RRF score (descending)
ranked_doc_ids = sorted(doc_scores.items(), key=lambda x: x[1], reverse=True)
for doc_id, score in ranked_doc_ids:
print(f"doc_id: {doc_id}, score : {score} ")
# Return the corresponding Document objects
return [doc_store[doc_id] for doc_id, _ in ranked_doc_ids]
Step 4: Generating output
Using ranked documents and users original query, we generate the output, which will more accurate.
import os
from openai import OpenAI
from retriever.retrival import reciprocal_rank_fusion
api_key = os.getenv("GEMINI_API_KEY")
client = OpenAI(
api_key=api_key,
base_url="https://generativelanguage.googleapis.com/v1beta/openai/"
)
def llm_chat(query: str):
unique_docs = reciprocal_rank_fusion(query)
context = "\n\n"
for doc in unique_docs:
# print(doc)
context += f"Page Content: {doc.page_content}\nPage Number: {doc.metadata['page_label']}\nFile Location: {doc.metadata['source']} \n\n"
SYSTEM_PROMPT = f"""
You are a helpful AI assistant who answers user query based on the available context retrieved from a PDF file along with page_contents and page number.
You should only answer the user based on the following context and navigate the user to open the right page number to know more.
answer should be in details.
Context:
{context}
"""
chat_completion = client.chat.completions.create(
model="gemini-2.0-flash",
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": query}
],
)
return chat_completion.choices[0].message.content
Benefits of Reciprocal Rank Fusion
Even though if, multiple queries fetch the docs that are not entirely relevant to user query, by ranking we reject those documents and take only closely relevant documents to answer the user query.
Source Code
retrival.py
import os
from langchain_google_genai import ChatGoogleGenerativeAI
from typing import List, Set
import concurrent.futures
from langchain_core.documents import Document
from utils.output_parser import output_parser
from config.vector_store import get_vector_store
from llm.prompt_templates import QUERY_REWRITE_PROMPT
from utils.rank_docs import rank_documents
# ----- setup --------
qdrant = get_vector_store()
llm = ChatGoogleGenerativeAI(
model="gemini-2.5-flash",
google_api_key=os.getenv("GEMINI_API_KEY"),
temperature=0,
max_tokens=None,
timeout=None,
max_retries=2,
)
llm_chain = QUERY_REWRITE_PROMPT | llm | output_parser
def reciprocal_rank_fusion(
user_query: str,
llm_chain=llm_chain,
retriever=qdrant.as_retriever()
) -> List[Document]:
"""
Generate multiple queries using llm_chain, fetch documents using retriever,
and return deduplicated list of documents.
Args:
user_query (str): The input user query.
llm_chain: A Runnable (PromptTemplate | LLM | OutputParser) for generating queries.
retriever: A LangChain retriever (e.g., qdrant.as_retriever()).
Returns:
List[Document]: Deduplicated retrieved documents across all queries.
"""
# Step 1: Generate multiple queries
generated_queries = llm_chain.invoke(user_query)
# Step 2: Fetch documents in parallel
def fetch_docs(query: str):
return retriever.invoke(query)
all_docs: List[Document] = []
with concurrent.futures.ThreadPoolExecutor() as executor:
results = executor.map(fetch_docs, generated_queries)
for docs in results:
all_docs.extend(docs)
# Step 3: Rank docs
sorted_docs = rank_documents(all_docs)
return sorted_docs
chat.py
import os
from openai import OpenAI
from retriever.retrival import reciprocal_rank_fusion
api_key = os.getenv("GEMINI_API_KEY")
client = OpenAI(
api_key=api_key,
base_url="https://generativelanguage.googleapis.com/v1beta/openai/"
)
def llm_chat(query: str):
search_results = reciprocal_rank_fusion(query)
context = "\n\n"
for _ , doc in enumerate(search_results, 1):
context += f"Page Content: {doc.page_content}\nPage Number: {doc.metadata['page_label']}\nFile Location: {doc.metadata['source']} \n\n"
SYSTEM_PROMPT = f"""
You are a helpful AI assistant who answers user query based on the available context retrieved from a PDF file along with page_contents and page number.
You should only answer the user based on the following context and navigate the user to open the right page number to know more.
answer should be in details.
Context:
{context}
"""
chat_completion = client.chat.completions.create(
model="gemini-2.0-flash",
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": query}
],
)
return chat_completion.choices[0].message.content
main.py
import os
from dotenv import load_dotenv
from generator.chat import llm_chat
load_dotenv()
def main():
query = "what is fs module?"
result = llm_chat(query=query)
print(result)
if __name__ == "__main__":
main()
Full source code available at: GitHub
Subscribe to my newsletter
Read articles from Ganesh Ghadage directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
