🔍 Parallel Query Retrieval (Fan Out): A Deep Dive into Advanced RAG Query Translation Patterns

Table of contents
- Introduction
- 🤷♂️ The Problem: What if Your Question is Confusing?
- What is Parallel Query Retrieval (Fan Out)?
- Why Fan Out ?
- 📚 Real-Life Analogy: Asking Five Teachers Instead of One
- 🔍 Step-by-Step Breakdown of Parallel Query Retrieval
- 💻 Step 1: User Input
- 💬 Step 2: Prompt the LLM to Reframe the Query
- 🧠 Step 3: Generate Multiple Query Variations
- 🔎 Step 4: Perform Similarity Search for Each Query
- 📦 Step 5: Retrieve Relevant Chunks (Vector Embeddings)
- 🧹 Step 6: Deduplicate and Filter Unique Chunks
- 🔁 Step 7: Enrich the Original Query Using Filtered Chunks
- 🧾 Step 8: Generate the Final Response
- 🚀 Implementation: Parallel Query Retrieval in Action
- 🧪 Run the System

Ever asked a smart assistant a question and didn’t get the answer you were hoping for? What if it could understand what you meant, even when you said it wrong? That’s where Parallel Query Retrieval comes in—your AI’s superpower to “think in multiple directions” at once.
Introduction
What is RAG?
Before jumping into Parallel Query Retrieval (PQR), let’s break down the basics.
Retrieval-Augmented Generation (RAG) is a hybrid approach that combines:
🗂️ Retrieval: Getting relevant facts or documents from a database (like pulling pages from a book).
✍️ Generation: Using a language model (like ChatGPT) to generate a natural language answer using that information.
Instead of relying only on what the model “remembers,” RAG lets it look things up in real-time. It’s like a student checking their textbook before answering your question.
🤷♂️ The Problem: What if Your Question is Confusing?
Let’s say you type:
“How do mobil apps keep file like documnts or photo save on net withut losing?”
That’s:
Full of typos
Missing structure
Technically vague
A regular system might get confused. But using Parallel Query Retrieval, the AI rewrites your messy question into better ones and searches for answers from multiple angles.
What is Parallel Query Retrieval (Fan Out)?
Parallel Query Retrieval (PQR) is an enhancement to the RAG pipeline where instead of using your original query as-is, the system:
Uses an LLM to rewrite your question in multiple meaningful ways.
Sends each rewritten version to the retrieval system (vector database).
Collects and filters results from each version.
Uses this rich context to generate a final answer.
This technique is also known as Fan Out, because it spreads one query into many directions.
Why Fan Out ?
The reason it's called Fan Out is because of the Fan Out architecture in system design. If you've ordered something from an online platform like Amazon , Flipkart etc , you've seen that you receive an email, SMS, and WhatsApp message.
This happens when you put an event in multiple queues, spreading the message across them, which is known as Fan Out architecture in system design. In our example, one user query is converted into three different queries, so there are three pipelines running different processes that converge at the end. It's similar to a Fan Out architecture.
📚 Real-Life Analogy: Asking Five Teachers Instead of One
Say you’re a student confused about how mobile apps use cloud storage.
Instead of asking just one teacher, you ask five:
“How does Google Drive work in mobile apps?”
“Why do apps save photos to the cloud?”
“How do apps prevent data loss in storage?”
“What is cloud backup for apps?”
“How is cloud used in file sync between devices?”
Each teacher explains a part of it.
Now, you combine their answers and truly understand the concept.
That’s exactly what PQR does.
🔍 Step-by-Step Breakdown of Parallel Query Retrieval
Let’s walk through a beginner-friendly flow:
💻 Step 1: User Input
The process begins when a user submits a question or query — this can be well-structured, informal, or even slightly flawed.
Example:
“how do mobile apps save user photos to cloud without losing them?”
This raw query often lacks precision and may include typos, vague language, or limited context.
💬 Step 2: Prompt the LLM to Reframe the Query
We send the user's input to a Large Language Model (LLM) like OpenAI’s GPT or Google’s Gemini, along with a carefully crafted system prompt.
Prompt Example:
“Rephrase the following user query into 5 semantically diverse and contextually relevant versions that retain the original intent.”
This step helps the system explore the query from different angles, increasing the chances of retrieving richer, more accurate information.
🧠 Step 3: Generate Multiple Query Variations
The LLM processes the prompt and outputs multiple intelligently rephrased queries.
LLM Output (for the cloud storage example):
How do mobile apps use cloud storage to save images?
What methods are used by apps to backup user files online?
How is data loss prevented in cloud photo storage for apps?
What is the process of uploading files from an app to the cloud?
How do cloud services ensure user data safety in apps?
These versions explore the same topic through different linguistic and conceptual lenses.
🔎 Step 4: Perform Similarity Search for Each Query
Each rewritten query is transformed into a vector embedding using an embedding model.
These vectors are then used to perform similarity searches against a vector database such as Pinecone, Weaviate, or Qdrant. The system finds and retrieves the top-matching results (chunks) for each query.
Think of this as running 5 separate smart searches across your knowledge base in parallel.
📦 Step 5: Retrieve Relevant Chunks (Vector Embeddings)
For each of the rephrased queries, the similarity search fetches the most relevant content — often short passages, paragraphs, or metadata-rich documents.
These text segments, also known as chunks, are stored in their vector form and contain valuable context tied to each version of the question.
🧹 Step 6: Deduplicate and Filter Unique Chunks
Now, all chunks from the five queries are combined into a single pool.
At this stage:
🔁 Duplicates are removed
📊 Low-relevance items are filtered out
🏆 The most informative and distinct chunks are retained
The result is a compact, clean, and contextually rich knowledge base, fine-tuned to answer the original question.
🔁 Step 7: Enrich the Original Query Using Filtered Chunks
With the high-quality, filtered chunks in hand, the original user query is once again sent to the LLM — this time alongside the supporting context.
This provides the LLM with external knowledge that compensates for ambiguity or gaps in the user’s original input.
Think of this step as giving your AI some well-organized notes to refer to while answering.
🧾 Step 8: Generate the Final Response
Finally, the LLM uses the enriched context to generate a precise, comprehensive, and context-aware response to the user’s query.
The answer is no longer based on “just guessing.” It’s grounded in retrieved facts, filtered insights, and semantic clarity — all tailored to the original question’s intent.
🚀 Implementation: Parallel Query Retrieval in Action
🧾 1. Load and Preprocess the Document
We begin by loading a PDF document and splitting it into smaller chunks for easier indexing and retrieval.
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from pathlib import Path
pdf_path = Path(__file__).parent / "1706.03762v7.pdf"
loader = PyPDFLoader(file_path=pdf_path)
docs = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
split_docs = text_splitter.split_documents(docs)
What this does:
Loads the PDF into memory
Splits the content into chunks of ~1000 characters with some overlap to preserve context between segments
🧠 2. Embed the Document Chunks
We now convert text chunks into vector embeddings using OpenAI’s embedding model.
from langchain_openai import OpenAIEmbeddings
from dotenv import load_dotenv
import os
load_dotenv()
embedder = OpenAIEmbeddings(
model="text-embedding-3-small",
api_key=os.getenv("OPENAI_API_KEY")
)
These embeddings represent the semantic meaning of text numerically, enabling similarity searches later.
🧱 3. Store in Vector Database (Qdrant)
We store our embedded chunks in a vector store like Qdrant for fast similarity-based search.
from langchain_qdrant import QdrantVectorStore
# To inject documents (only run once)
vector_store = QdrantVectorStore.from_documents(
documents=split_docs,
url="http://localhost:6333",
collection_name="learning_langchain_PQR",
embedding=embedder
)
vector_store.add_documents(split_docs)
# ----
# To retrieve later
retriver = QdrantVectorStore.from_existing_collection(
url="http://localhost:6333",
collection_name="learning_langchain_PQR",
embedding=embedder
)
✨ 4. Generate Diverse Variations of the User’s Query
We use Google Gemini to rewrite the user’s input in multiple semantically diverse forms.
from openai import OpenAI
client = OpenAI(
base_url="https://generativelanguage.googleapis.com/v1beta/openai/",
api_key=os.getenv("GEMINI_API_KEY")
)
def generate_different_user_prompt(user_input, num_variants=3):
SYSTEM_PROMPT = f"""
You are an helpful AI Assistant that rewrites user input queries in different forms for better document retrieval.
Original Query: "{user_input}"
Rewrite this query in {num_variants} different ways.
"""
response = client.chat.completions.create(
model="gemini-2.0-flash",
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": user_input}
]
)
lines = response.choices[0].message.content.split('\n')
return [line.strip("1234567890. ") for line in lines if line.strip()]
🔍 5. Perform Parallel Similarity Searches
We now use each of the rephrased queries to retrieve relevant document chunks.
def get_similar_chunks_from_document(user_input):
ai_prompts = generate_different_user_prompt(user_input)
all_results = []
for prompt in ai_prompts:
results = retriver.similarity_search(query=prompt)
all_results.append(results)
return all_results
🧹 6. Filter Unique Chunks
We flatten and deduplicate all chunks across all query variations.
def filter_unique_chunks(nested_chunks):
seen = set()
filtered = []
for chunk_list in nested_chunks:
for doc in chunk_list:
if doc.page_content not in seen:
seen.add(doc.page_content)
filtered.append(doc)
return filtered
🧾 7. Final Answer Generation with Enriched Context
We feed the cleaned, unique chunks + original query into Gemini again to produce a final, context-aware answer.
def parallel_query_retrieval():
while True:
user_input = input(">> ")
if user_input.lower() in ["exit", "quit"]:
break
similar_chunks = get_similar_chunks_from_document(user_input)
filtered = filter_unique_chunks(similar_chunks)
SYSTEM_PROMPT = f"""
You are an helpful AI Assistant who responds based on the available context.
If the answer is not found in the context, reply with "I don't know based on the document."
Context:
{"\n\n".join([doc.page_content for doc in filtered])}
"""
response = client.chat.completions.create(
model="gemini-2.0-flash",
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": user_input}
]
)
print("-----> ", response.choices[0].message.content)
🧪 Run the System
Run the entire system from your terminal:
python parallel_query_retrieval.py
Subscribe to my newsletter
Read articles from Mohammed Saleh directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
