3. Parallel Query Retrieval (Fan Out)


You might have come across this popular reel where Virat Kohli talks about Rohit Sharma’s lazy communication style.
I will describe this in English so that non Hindi speaking audience can understand. Fair warning, my mediocre English can’t just justify the humor here. When you have to say, “there is a lot of traffic in Lokhandwala (a place in Mumbai)“, Rohit Sharma will say the same thing like “that place has a lot this“. Now it’s your responsibility to know “what place“ and “has what“.
My point here being we humans are lazy. Google has exposed us to so much of convenience for so long that we generally don’t care about what we are typing in the search bar. We just expect that Google will bring us the right results. And if you want your RAG application to get popular, then you have to make it very good in understanding what the user wants to ask.
In this & next couple of articles we will try solve this exact same problem of making your RAG application understand user’s queries better, so that it can generate better results.
Before we dive deep into our topic for this article, I will highly recommend you to read my previous articles on the RAG series. We are diving into the advanced RAG topics now, hence you must clear your basics first.
Parallel Query Retrieval
So the problem at hand is, we want our RAG application to understand what the user wants to ask, given most of the times humans are going to give bad input. You may have heard this phrase, “Garbage In, Garbage Out“. It applies perfectly to LLMs. If you give bad input to LLMs, then you will most likely get bad output from them. That means you want to improve the input you are giving to the LLMs to make your RAG application “usable“ to normal
users.
Parallel Query Retrieval technique tries to generate better LLM input for user’s queries. It does so by asking the LLM to generate multiple refined queries for any given user query. It then processes all the LLM generated queries along with the user’s query to generate a comprehensive output for the user’s query. Following diagram will help you understand this better.
For example, Let’s say you create a RAG application capable of answering programming related questions & you have ingested relevant data into your vector database (using the ingestion phase defined in previous article). If the user asks query “implemend goroutines golang“ (notice the spelling mistake in “implement“), then your RAG application will ask LLM model to generate queries similar to user’s query. Let’s say the LLM returns following queries:
How to implement Goroutines in GoLang?
What are the various concurrency patterns in GoLang?
How to take care of thread-safety while using Goroutines in GoLang?
As described in the above digram, you:
Generate Vector Embeddings for all the LLM generated queries & the user’s query
Fetch relevant documents from your vector database using similarity search
Aggregate unique data points from similarity search results across multiple queries
Pass the user’s query along with the aggregated data points to LLM
After following these steps, the response from LLM will be most likely better than the response from basic RAG that we coded in previous article.
Implementation in Python
Enough with theory, let’s code this thing. As discussed before, this RAG differs from the basic RAG we built in previous article in the QUERY
phase. Hence, I will be reusing some components from my basic RAG implementation article. If you haven’t read it already, I will highly recommend reading it first.
Let’s assume that you have ingested a PDF document about GoLang into your RAG application. Now we will discuss the changes in the query flow.
Step 1: Generate Multiple Queries Given User’s Query
Our goal from this step is to generate multiple queries which are similar to user’s queries. We will use LLM to generate queries that are similar to user’s query. On a high level there are two ways to achieve this.
You make multiples requests to your LLM, each one asking to generate a query similar to user’s query. But this is more time consuming & most importantly it will cost you more.
Second way is you ask the LLM to generate multiple queries within the same response. But there is a problem here. When you ask LLM any question, it gives you response in text. How do you extract queries from plain text response? This is where a concept called “Structured Output“ helps you. Basically, modern LLMs can respond in a specific format that you define before making requests.
Let’s see structured output in action using LangChain.
Define Output Format
We will use BaseModel
from pydantic
library to create a class MultipleQueries
that defines our output structure that we are expecting from LLM.
from pydantic import BaseModel
# model for multiple queries
class MultipleQueries(BaseModel):
queries: list[str]
Instruct LLM Model to Respond in Output Format
LangChain makes it very easy to instruct the LLM models to respond in specific format.
# create LLM
llm = ChatOpenAI(
model="gpt-4.1",
)
# llm for query generation
llm_for_query_gen = llm.with_structured_output(MultipleQueries)
Generate Multiple Queries for a Given User Query
SYSTEM_PROMPT_QUERY_GEN = """
You are a helpul assistant. Your job is to generate 3 queries that are similar to user's queries.
You need to give the response in the required format.
Example:
user_query: implement goroutines in golang
response:
[
"how to implement goroutines in golang",
"what is goroutine in golang",
"how to use goroutines in golang"
]
"""
# generate 3 queries similar to the user's query
def generate_queries(query: str) -> list[str]:
# 1. use LLM to generate 3 queries similar to the user's query
messages = [
{"role": "system", "content": SYSTEM_PROMPT_QUERY_GEN},
{"role": "user", "content": query},
]
response = llm_for_query_gen.invoke(messages)
if isinstance(response, MultipleQueries):
result = response.queries
print(f"🌀🌀🌀 Generated {len(result)} queries")
for i, query in enumerate(result):
print(f"🌀🌀🌀 {i+1}. {query}")
return result
else:
raise ValueError("Invalid response from LLM")
Step 2: Fetch Relevant Documents from Vector DB for Each Query
Here, we will use the method get_vector_store()
which we have defined in the previous article.
COLLECTION_NAME = "golang-docs"
# fetch the relevant documents for the query
def fetch_relevant_documents_for_query(query: str) -> list[Document]:
# 1. check if collection exists
if not collection_exists(COLLECTION_NAME):
raise ValueError("Collection does not exist")
# 2. fetch the relevant documents
vector_store = get_vector_store(COLLECTION_NAME)
# 3. fetch the relevant documents
docs = vector_store.similarity_search_with_score(query, k=5)
# 4. filter the documents based on the similarity threshold
filtered_docs = [doc for doc, score in docs if score >= SIMILARITY_THRESHOLD]
print(f"🌀🌀🌀 QUERY: {query}. FOUND: {len(filtered_docs)} documents")
return filtered_docs
Step 3: Aggregate Unique Documents Across Queries
from langchain_core.documents import Document
# aggregate the relevant documents
def aggregate_relevant_documents(queries: list[str]) -> list[Document]:
# 1. fetch the relevant documents for each query
docs = [fetch_relevant_documents_for_query(query) for query in queries]
# 2. flatten the list of lists and get unique documents
flattened_docs = [doc for sublist in docs for doc in sublist]
unique_docs = list({doc.page_content: doc for doc in flattened_docs}.values())
print(f"🌀🌀🌀 Found {len(unique_docs)} unique documents across all the queries")
return unique_docs
Step 4: Query LLM using Aggregated Documents
SYSTEM_PROMPT_ANSWER_GEN = """
You are a helpful assistant. Your job is to generate an answer for the user's query based on the relevant documents provided.
"""
# generate the answer for the user's query
def generate_answer(query: str, docs: list[Document]) -> str:
# 1. use LLM to generate the answer for the user's query based on the relevant documents
system_prompt = SYSTEM_PROMPT_ANSWER_GEN
for doc in docs:
system_prompt += f"""
Document: {doc.page_content}
"""
messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": query},
]
response = llm.invoke(messages)
return response.content
As you can see below, even though I asked a question with spelling mistake (possible to make input more stupid), my RAG application was able to answer well.
And that’s it, that’s how easy it is to implement Parallel Query Retrieval. In my future articles in this series, I will discuss more about techniques used in advanced RAG applications. Stay tuned.
Hope you liked this article. If you have questions/comments, then please feel free to comment on this article.
Source Code: GitHub
Subscribe to my newsletter
Read articles from Aniket Mahangare directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Aniket Mahangare
Aniket Mahangare
I am a Software Engineer in the Platform team at Uber India, deeply passionate about System Architecture, Database Internals, and Advanced Algorithms. I thrive on diving into intricate engineering details and bringing complex solutions to life. In my leisure time, I enjoy reading insightful articles, experimenting with new ideas, and sharing my knowledge through writing. This blog is a space where I document my learning journey, project experiences, and technical insights. Thank you for visiting—I hope you enjoy my posts!