From Queries to Quality: Implementing Reciprocal Rank Fusion

Introduction
Reciprocal Rank Fusion (RRF) is an information‐retrieval method that merges several ranked result lists into one enhanced ordering. Each document receives a score inversely proportional to its rank in each list, and these reciprocal scores are summed to produce the final ranking. By emphasizing items that appear near the top across multiple lists, RRF effectively surfaces the most relevant results.
Pipeline Overview
Before You Begin
📥 Ingest Data and ✂️ Chunk Text
🔢 Generate Embeddings and 💾Store in Vector DB
🔄 Decompose Query
🔍 Retrieve Top‑K and ➗ Fuse Rankings with Reciprocal Rank Fusion
✍️ Generate Answer
Before You Begin
Before installing any packages, create virtual environment
# 1. Create a virtual environment named .venv
python -m venv .venv
# 2. Activate it
# On macOS / Linux:
source .venv/bin/activate
# On Windows (PowerShell):
.venv\Scripts\Activate.ps1
# On Windows (Command Prompt):
.venv\Scripts\activate.bat
📥 Ingest Data and ✂️ Chunk Text
Start by bringing in all the source material you want your system to “know.”
Examples: PDFs of manuals, GitHub READMEs, web‑scraped articles, CSV exports.
Goal: Make sure you extract clean text (strip out headers/footers, fix encoding issues) and record metadata (source filename, page number, date) so you can always trace back where an answer came from.
To do this, we need to install the packages langchain_community
and pypdf
.
Run the following command in the terminal:
pip install langchain_community pypdf
#loader.py
from langchain_community.document_loaders import PyPDFLoader
from pathlib import Path
pdf_path = Path(__file__).parent / "file_name.extension_type"
loader = PyPDFLoader(file_path=pdf_path)
doc = loader.load()
LLMs have finite context windows—if you handed a 500‑page PDF to llm, it wouldn’t fit.
Split into ~500–1,000 token chunks, often with a 10–20% overlap so that you don’t lose sentence continuity at chunk boundaries.
Why: Smaller chunks both fit in the model’s context and allow more precise matching when you retrieve later.
chunk_size = 1000
– each slice of text will be at most 1,000 characters (or tokens) long.
chunk_overlap = 200
– each new slice repeats the last 200 characters of the previous slice so context flows smoothly across chunks.
#loader.py
from langchain_text_splitters import RecursiveCharacterTextSplitter
text_spliter = RecursiveCharacterTextSplitter(
chunk_size = 1000,
chunk_overlap = 200
)
split_doc = text_spliter.split_documents(documents=doc)
🔢 Generate Embeddings and 💾Store in Vector DB
Each chunk is passed through an embedding model (e.g. text‑embedding‑ada-002) that turns it into a fixed‑length vector in semantic space.
Similar meaning → nearby points in vector space. “How do I reset my password?” and “password reset steps” end up close together.
I’m using Google AI embeddings for this example, but you can use OpenAI embeddings instead. You can see all the embeddings through the link. LangChain Embeddings
Note: Create a .env
file to store your Google API key, and use python-dotenv
to load it into your Python script.
To use GoogleGenerativeAIEmbeddings
and load_dotenv
, you first need to install the integration package langchain‑google‑genai
and dotenv
.
pip install langchain-google-genai
pip install python-dotenv
#loader.py
from langchain_google_genai import GoogleGenerativeAIEmbeddings
import os
from dotenv import load_dotenv
load_dotenv()
if "GOOGLE_API_KEY" not in os.environ:
os.environ["GOOGLE_API_KEY"] = os.getenv("GOOGLE_API_KEY")
embeddings = GoogleGenerativeAIEmbeddings(
model="models/text-embedding-004",
)
Those vectors, plus your chunk text and metadata, go into a specialized index (Pinecone, Qdrant, FAISS, etc.).
Why use a vector DB? It lets you do ultra‑fast approximate nearest‑neighbor searches over millions of vectors, usually in milliseconds.
Here we’re using the Qdrant vector‑database.
You can either install it directly on your system or run it in Docker; I’m using Docker in this example
services:
qdrant:
image: qdrant/qdrant
ports:
- "6333:6333"
To run this docker compose file in terminal:
docker compose -f docker-compose.yml up
Once the container is running, you can connect to Qdrant at http://localhost:6333.
To use QdrantVectorStore
and QdrantClient
, you first need to install the integration package langchain-qdrant
pip install langchain-qdrant
#loader.py
from langchain_qdrant import QdrantVectorStore
from qdrant_client import QdrantClient
vector_store = QdrantVectorStore.from_documents(
documents=[],
url="http://localhost:6333",
embedding=embeddings,
collection_name="learning_langchain"
)
vector_store.add_documents(documents=split_doc)
🔄 Decompose Query
Use the LLM to split the user’s original question into several targeted, semantically distinct sub‑queries. For instance, from:
“What is fs module?”
you could derive:
What is a “module” in Node.js?
What does “fs” abbreviate?
What capabilities does Node.js’s fs module offer?
Why this matters
Broader coverage: Retrieves documents matching different phrasing.
Reduced ambiguity: Each sub‑query zeroes in on a specific facet.
Sharper embeddings: More focused queries produce embedding vectors that better align with the most relevant text.
To use OpenAI
, you first need to install the integration package openai
pip install openai
#main.py
from openai import OpenAI
from dotenv import load_dotenv
import os
import json
load_dotenv()
def ai(message):
response = client.chat.completions.create(
model="gemini-2.0-flash",
messages=message,
response_format={"type":"json_object"}
)
return json.loads(response.choices[0].message.content)
client = OpenAI(
api_key=os.getenv("GOOGLE_API_KEY"),
base_url="https://generativelanguage.googleapis.com/v1beta/openai/"
)
system_prompt = f"""
You are an helpfull AI Assistant
who is specialized in resolving user query.
You break the user query into three or five different query.
Example: "What is FS module?"
you break this question in different questions
-What is a module in Node.js?
-What does "fs" stand for?
-What functionalities does the fs module provide in Node.js?
You give response in array formate like this
Output: {{
"What is a module in Node.js?",
"What does "fs" stand for?",
"What functionalities does the fs module provide in Node.js?"
}}
"""
query = input("> ")
message=[{"role":"system","content":system_prompt},{"role":"user","content":query}]
question = ai(message)
print("\nQuestions: ")
print(question)
🔍 Retrieve Top‑K and ➗ Fuse Rankings with Reciprocal Rank Fusion
For each decomposed sub‑query, you hit your vector database (e.g. Qdrant, FAISS, Pinecone) with a semantic‐similarity search. The goal is to pull back the K most relevant chunks—typically 10–20 passages—that best match your query embedding.
Why Top‑K? Grabbing only the highest‑scoring chunks keeps your context tight and your LLM prompt focused on the most pertinent information.
Once you have multiple ranked lists—one per sub‑query—RRF merges them into a single consensus list by:
Scoring each document by summing
1 / (k + rank + 1)
across all ranking lists.Sorting documents by their total score in descending order.
#main.py
from retrieval import retrieve
relevent_chunk = retrieve(question)
#retrieval.py
from langchain_qdrant import QdrantVectorStore
from langchain_google_genai import GoogleGenerativeAIEmbeddings
import os
def reciprocal_rank_fusion(rankings, k = 15):
scores = {}
for ranking in rankings:
for rank, doc_id in enumerate(ranking):
scores[doc_id] = scores.get(doc_id, 0) + 1.0 / (k + rank + 1)
return sorted(scores.items(), key=lambda x: x[1], reverse=True)
def retrieve(queries,k=15):
if "GOOGLE_API_KEY" not in os.environ:
os.environ["GOOGLE_API_KEY"] = os.getenv("GOOGLE_API_KEY", "")
# embedding
embedding = GoogleGenerativeAIEmbeddings(model="models/text-embedding-004")
relevent_chunk = QdrantVectorStore.from_existing_collection(
collection_name="parallel_query",
embedding=embedding,
url="http://localhost:6333",
)
# run each sub‑query, collect rankings of IDs + keep lookup
rankings = []
lookup= {}
for q in queries:
docs = relevent_chunk.similarity_search(query=q, k=k)
ids = []
for d in docs:
# assume each Doc has a unique metadata["id"]
doc_id = d.metadata.get("id") or f"{d.metadata.get('page')}#{hash(d.page_content)}"
ids.append(doc_id)
lookup[doc_id] = d
rankings.append(ids)
# fuse the ranked ID lists
fused = reciprocal_rank_fusion(rankings)
# map fused IDs back to Doc objects, preserving order
fused_docs = []
for doc_id, score in fused:
if doc_id in lookup:
fused_docs.append(lookup[doc_id])
# Formating
formatted = []
for doc in fused_docs:
page = doc.metadata.get("page", "?")
text = doc.page_content.strip()
formatted.append(f"[Page {page}]\n{text}")
return "\n\n".join(formatted)
✍️ Generate Answer
We feed the assembled prompt, which combines the retrieved, labeled chunks and the user’s original question—into your chosen language model. The LLM then uses both its internal knowledge and the provided context to generate a coherent, fact‑grounded response.
#answer_ai.py
from openai import OpenAI
import os
from dotenv import load_dotenv
load_dotenv()
client = OpenAI(
api_key=os.getenv("GOOGLE_API_KEY"),
base_url="https://generativelanguage.googleapis.com/v1beta/openai/"
)
def answer_AI(query, assistant):
system_prompt = f"""
You are an helpfull AI Assistant who is specialized in resolving user query.
Note:
Answer should be in detail
You recive a question and you give answer based on the assistant content and
also Mention the page number from where did you pick all the information and
If you add something from you then tell where did you added something
"""
message =[
{"role":"system","content":system_prompt},
{"role":"user","content":query},
{"role":"assistant","content":assistant}]
response=client.chat.completions.create(
model="gemini-2.0-flash",
messages=message,
response_format={"type":"json_object"}
)
return response.choices[0].message.content
Passing all the chunks into the answer_ai.py
#main.py
from answer_ai import answer_AI
output = answer_AI(query, relevent_chunk)
print("\n------------------")
print("Answer: ")
print(output)
Executing the Code
Executing the main.py file
Full Source Code
Grab everything—loader, retrieval, fusion, and answer‑generation—in one repo:
https://github.com/SurajPatel04/genAI/tree/main/cohort/day5class/reciprocol_rank_fusion
Subscribe to my newsletter
Read articles from Suraj Patel directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
