Boost Retrieval Quality with Parallel Query Retrieval (Fan‑out)

Introduciton
Instead of issuing a single search request, you “fan‑out” the original query into multiple variations in parallel each phrased differently, or targeting different aspects of the question. You then retrieve a small set of top‑k documents for each variation and merge them.
Pipeline Overview
Before You Begin
📥 Ingest Data
✂️ Chunk Text
🔢 Generate Embeddings
💾 Store in Vector DB
🔄 Decompose Query (Fan‑Out)
🔍 Retrieve & Merge Chunks
✍️ Generate Final Answer
Before You Begin
Before installing any packages, create virtual environment
# 1. Create a virtual environment named .venv
python -m venv .venv
# 2. Activate it
# On macOS / Linux:
source .venv/bin/activate
# On Windows (PowerShell):
.venv\Scripts\Activate.ps1
# On Windows (Command Prompt):
.venv\Scripts\activate.bat
📥 Ingest Data
You start by bringing in all the source material you want your system to “know.”
Examples: PDFs of manuals, GitHub READMEs, web‑scraped articles, CSV exports.
Goal: Make sure you extract clean text (strip out headers/footers, fix encoding issues) and record metadata (source filename, page number, date) so you can always trace back where an answer came from.
To do this, we need to install the packages langchain_community
and pypdf
.
Run the following command in the terminal:
pip install langchain_community pypdf
#parallel_query_loader.py
from langchain_community.document_loaders import PyPDFLoader
from pathlib import Path
pdf_path = Path(__file__).parent / "file_name.extension_type"
loader = PyPDFLoader(file_path=pdf_path)
doc = loader.load()
✂️ Chunk the Text
LLMs have finite context windows—if you handed a 500‑page PDF to llm, it wouldn’t fit.
How: Split into ~500–1,000 token chunks, often with a 10–20% overlap so that you don’t lose sentence continuity at chunk boundaries.
Why: Smaller chunks both fit in the model’s context and allow more precise matching when you retrieve later.
chunk_size = 1000
– each slice of text will be at most 1,000 characters (or tokens) long.
chunk_overlap = 200
– each new slice repeats the last 200 characters of the previous slice so context flows smoothly across chunks.
#parallel_query_loader.py
from langchain_text_splitters import RecursiveCharacterTextSplitter
text_spliter = RecursiveCharacterTextSplitter(
chunk_size = 1000,
chunk_overlap = 200
)
split_doc = text_spliter.split_documents(documents=doc)
🔢Generate Embeddings
Each chunk is passed through an embedding model (e.g. text‑embedding‑ada-002) that turns it into a fixed‑length vector in semantic space.
Similar meaning → nearby points in vector space. “How do I reset my password?” and “password reset steps” end up close together.
For this example, I’m using Google AI embeddings, but you can use OpenAI embeddings instead. You can see all the embeddings through the link. LangChain Embeddings
Note: Create a .env
file to store your Google API key, and use python-dotenv
to load it into your Python script.
To use GoogleGenerativeAIEmbeddings
and load_dotenv
, you first need to install the integration package langchain‑google‑genai
and dotenv
.
pip install langchain-google-genai
pip install python-dotenv
#parallel_query_loader.py
from langchain_google_genai import GoogleGenerativeAIEmbeddings
import os
from dotenv import load_dotenv
load_dotenv()
if "GOOGLE_API_KEY" not in os.environ:
os.environ["GOOGLE_API_KEY"] = os.getenv("GOOGLE_API_KEY")
embeddings = GoogleGenerativeAIEmbeddings(
model="models/text-embedding-004",
)
💾Store Embeddings in a Vector Database
Those vectors, plus your chunk text and metadata, go into a specialized index (Pinecone, Qdrant, FAISS, etc.).
Why use a vector DB? It lets you do ultra‑fast approximate nearest‑neighbor searches over millions of vectors, usually in milliseconds.
Here we’re using the Qdrant vector‑database.
You can either install it directly on your system or run it in Docker; I’m using Docker in this example
services:
qdrant:
image: qdrant/qdrant
ports:
- "6333:6333"
To run this docker compose file in terminal:
docker compose -f docker-compose.yml up
Once the container is running, you can connect to Qdrant at http://localhost:6333.
To use QdrantVectorStore
and QdrantClient
, you first need to install the integration package langchain-qdrant
pip install langchain-qdrant
#parallel_query_loader.py
from langchain_qdrant import QdrantVectorStore
from qdrant_client import QdrantClient
vector_store = QdrantVectorStore.from_documents(
documents=[],
url="http://localhost:6333",
embedding=embeddings,
collection_name="learning_langchain"
)
vector_store.add_documents(documents=split_doc)
After you run the code, you can view the collection in the Qdrant vector database at http://localhost:6333/dashboard
🔄 Decompose Query
Break the original user question into several focused, semantically distinct sub‑queries through the llm model. For example
“What is fs module?”
you might generate:
What is a module in Node.js?
What does "fs" stand for?
What functionalities does the fs module provide in Node.js?
Why This Helps
Broader Coverage: Captures documents that match specific wording or context.
Reduced Ambiguity: Each variant zeroes in on one angle of the user’s need.
Sharper Embeddings: Focused queries yield embedding vectors that align more closely with relevant chunks.
To use OpenAI
, you first need to install the integration package openai
pip install openai
#main.py
from openai import OpenAI
from dotenv import load_dotenv
import os
import json
load_dotenv()
def ai(message):
response = client.chat.completions.create(
model="gemini-2.0-flash",
messages=message,
response_format={"type":"json_object"}
)
return json.loads(response.choices[0].message.content)
client = OpenAI(
api_key=os.getenv("GOOGLE_API_KEY"),
base_url="https://generativelanguage.googleapis.com/v1beta/openai/"
)
system_prompt = f"""
You are an helpfull AI Assistant
who is specialized in resolving user query.
You break the user query into three or five different query.
Example: "What is FS module?"
you break this question in different questions
-What is a module in Node.js?
-What does "fs" stand for?
-What functionalities does the fs module provide in Node.js?
You give response in array formate like this
Output: {{
"What is a module in Node.js?",
"What does "fs" stand for?",
"What functionalities does the fs module provide in Node.js?"
}}
"""
query = input("> ")
message=[{"role":"system","content":system_prompt},
{"role":"user","content":query}]
question = ai(message)
print("\nQuestions: ")
print(question)
🔍 Retrieve Relevant Chunks and Filtering
Execute each decomposed sub‑query as its own top‑K similarity search, then merge results and filter for uniqueness:
parallel_query_retrieval.py File
#parallel_query_retrieval.py from langchain_qdrant import QdrantVectorStore from qdrant_client import QdrantClient from langchain_google_genai import GoogleGenerativeAIEmbeddings import os def retrieve(query: str) -> str: if "GOOGLE_API_KEY" not in os.environ: os.environ["GOOGLE_API_KEY"] = os.getenv("GOOGLE_API_KEY") embedding = GoogleGenerativeAIEmbeddings( model="models/text-embedding-004" ) retrive = QdrantVectorStore.from_existing_collection( collection_name = "parallel_query", embedding=embedding, url="http://localhost:6333", ) relevent_chunk = retrive.similarity_search( query=query, ) seen = set() unique = [] for doc in relevent_chunk: content = doc.page_content.strip() page = doc.metadata.get("page") key = (page, content) if key not in seen: seen.add(key) unique.append(doc) formatted = [] for doc in unique: snippet = f"[Page {doc.metadata.get('page')}] \n{doc.page_content}" formatted.append(snippet) context = "\n\n".join(formatted) return context
Passing all the question to the parallel_query_retrieval.py file
#main.py
from parallel_query_retrieval import retrieve
array=[]
for i in question:
answer = retrieve(i)
array.append(answer)
✍️ Generate Final Answer
We feed the assembled prompt, which combines the retrieved, labeled chunks and the user’s original question—into your chosen language model. The LLM then uses both its internal knowledge and the provided context to generate a coherent, fact‑grounded response.
answer_ai.py File
#answer_ai.py
from openai import OpenAI
import os
from dotenv import load_dotenv
load_dotenv()
client = OpenAI(
api_key=os.getenv("GOOGLE_API_KEY"),
base_url="https://generativelanguage.googleapis.com/v1beta/openai/"
)
def answer_AI(query, assistant):
system_prompt = f"""
You are an helpfull AI Assistant who is specialized in resolving user query.
Note:
Answer should be in detail
You recive a question and you give answer based on the assistant content and
also Mention the page number from where did you pick all the information and
If you add something from you then tell where did you added something
"""
message =[
{"role":"system","content":system_prompt},
{"role":"user","content":query},
{"role":"assistant","content":assistant}]
response=client.chat.completions.create(
model="gemini-2.0-flash",
messages=message,
response_format={"type":"json_object"}
)
return response.choices[0].message.content
Passing all the chunks into the answer_ai.py
#main.py
from answer_ai import answer_AI
output = answer_AI(query, json.dumps(array))
print("\n---------------------------")
print("Answer: ")
print(output)
Executing the Code
Executing the main.py file
Full Source Code
You can find the complete implementation
https://github.com/SurajPatel04/genAI
Subscribe to my newsletter
Read articles from Suraj Patel directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
