You might have come across this popular reel where Virat Kohli talks about Rohit Sharma’s lazy communication style.

I will describe this in English so that non Hindi speaking audience can understand. Fair warning, my mediocre English can’t just justify the humor here. When you have to say, “there is a lot of traffic in Lokhandwala (a place in Mumbai)“, Rohit Sharma will say the same thing like “that place has a lot this“. Now it’s your responsibility to know “what place“ and “has what“.

My point here being we humans are lazy. Google has exposed us to so much of convenience for so long that we generally don’t care about what we are typing in the search bar. We just expect that Google will bring us the right results. And if you want your RAG application to get popular, then you have to make it very good in understanding what the user wants to ask.

In this & next couple of articles we will try solve this exact same problem of making your RAG application understand user’s queries better, so that it can generate better results.

Before we dive deep into our topic for this article, I will highly recommend you to read my previous articles on the RAG series. We are diving into the advanced RAG topics now, hence you must clear your basics first.

Parallel Query Retrieval

So the problem at hand is, we want our RAG application to understand what the user wants to ask, given most of the times humans are going to give bad input. You may have heard this phrase, “Garbage In, Garbage Out“. It applies perfectly to LLMs. If you give bad input to LLMs, then you will most likely get bad output from them. That means you want to improve the input you are giving to the LLMs to make your RAG application “usable“ to normal users.

Parallel Query Retrieval technique tries to generate better LLM input for user’s queries. It does so by asking the LLM to generate multiple refined queries for any given user query. It then processes all the LLM generated queries along with the user’s query to generate a comprehensive output for the user’s query. Following diagram will help you understand this better.

For example, Let’s say you create a RAG application capable of answering programming related questions & you have ingested relevant data into your vector database (using the ingestion phase defined in previous article). If the user asks query “implemend goroutines golang“ (notice the spelling mistake in “implement“), then your RAG application will ask LLM model to generate queries similar to user’s query. Let’s say the LLM returns following queries:

How to implement Goroutines in GoLang?
What are the various concurrency patterns in GoLang?
How to take care of thread-safety while using Goroutines in GoLang?

As described in the above digram, you:

Generate Vector Embeddings for all the LLM generated queries & the user’s query
Fetch relevant documents from your vector database using similarity search
Aggregate unique data points from similarity search results across multiple queries
Pass the user’s query along with the aggregated data points to LLM

After following these steps, the response from LLM will be most likely better than the response from basic RAG that we coded in previous article.

💡

In classic system design, the Fan-Out Pattern refers to sending a single message or event to multiple services or consumers at once. I hope you understand now why the technique we are discussing in this article comes under the Fan-Out pattern.

Implementation in Python

Enough with theory, let’s code this thing. As discussed before, this RAG differs from the basic RAG we built in previous article in the QUERY phase. Hence, I will be reusing some components from my basic RAG implementation article. If you haven’t read it already, I will highly recommend reading it first.

https://blogs.niket.pro/implementing-rag

Let’s assume that you have ingested a PDF document about GoLang into your RAG application. Now we will discuss the changes in the query flow.

Step 1: Generate Multiple Queries Given User’s Query

Our goal from this step is to generate multiple queries which are similar to user’s queries. We will use LLM to generate queries that are similar to user’s query. On a high level there are two ways to achieve this.

You make multiples requests to your LLM, each one asking to generate a query similar to user’s query. But this is more time consuming & most importantly it will cost you more.
Second way is you ask the LLM to generate multiple queries within the same response. But there is a problem here. When you ask LLM any question, it gives you response in text. How do you extract queries from plain text response? This is where a concept called “Structured Output“ helps you. Basically, modern LLMs can respond in a specific format that you define before making requests.

Let’s see structured output in action using LangChain.

Define Output Format

We will use BaseModel from pydantic library to create a class MultipleQueries that defines our output structure that we are expecting from LLM.

from pydantic import BaseModel

# model for multiple queries
class MultipleQueries(BaseModel):
    queries: list[str]

💡

You can watch this YouTube video to learn more about Pydantic.

Instruct LLM Model to Respond in Output Format

LangChain makes it very easy to instruct the LLM models to respond in specific format.

# create LLM
llm = ChatOpenAI(
    model="gpt-4.1",
)

# llm for query generation
llm_for_query_gen = llm.with_structured_output(MultipleQueries)

💡

You can read more about Structured Output from LangChain in this tutorial. OpenAI SDK also offers a similar functionality to specify the output format directly. You can read more about OpenAI structured outputs here.

Generate Multiple Queries for a Given User Query

SYSTEM_PROMPT_QUERY_GEN = """
You are a helpul assistant. Your job is to generate 3 queries that are similar to user's queries. 
You need to give the response in the required format. 

Example:
user_query: implement goroutines in golang

response:
[
    "how to implement goroutines in golang",
    "what is goroutine in golang",
    "how to use goroutines in golang"
]
"""

# generate 3 queries similar to the user's query
def generate_queries(query: str) -> list[str]:
    # 1. use LLM to generate 3 queries similar to the user's query
    messages = [
        {"role": "system", "content": SYSTEM_PROMPT_QUERY_GEN},
        {"role": "user", "content": query},
    ]

    response = llm_for_query_gen.invoke(messages)
    if isinstance(response, MultipleQueries):
        result = response.queries
        print(f"🌀🌀🌀 Generated {len(result)} queries")
        for i, query in enumerate(result):
            print(f"🌀🌀🌀 {i+1}. {query}")
        return result
    else:
        raise ValueError("Invalid response from LLM")

Step 2: Fetch Relevant Documents from Vector DB for Each Query

Here, we will use the method get_vector_store() which we have defined in the previous article.

COLLECTION_NAME = "golang-docs"

# fetch the relevant documents for the query
def fetch_relevant_documents_for_query(query: str) -> list[Document]:
    # 1. check if collection exists
    if not collection_exists(COLLECTION_NAME):
        raise ValueError("Collection does not exist")

    # 2. fetch the relevant documents
    vector_store = get_vector_store(COLLECTION_NAME)

    # 3. fetch the relevant documents
    docs = vector_store.similarity_search_with_score(query, k=5)

    # 4. filter the documents based on the similarity threshold
    filtered_docs = [doc for doc, score in docs if score >= SIMILARITY_THRESHOLD]

    print(f"🌀🌀🌀 QUERY: {query}. FOUND: {len(filtered_docs)} documents")

    return filtered_docs

Step 3: Aggregate Unique Documents Across Queries

from langchain_core.documents import Document
# aggregate the relevant documents
def aggregate_relevant_documents(queries: list[str]) -> list[Document]:
    # 1. fetch the relevant documents for each query
    docs = [fetch_relevant_documents_for_query(query) for query in queries]

    # 2. flatten the list of lists and get unique documents
    flattened_docs = [doc for sublist in docs for doc in sublist]
    unique_docs = list({doc.page_content: doc for doc in flattened_docs}.values())

    print(f"🌀🌀🌀 Found {len(unique_docs)} unique documents across all the queries")

    return unique_docs

Step 4: Query LLM using Aggregated Documents

SYSTEM_PROMPT_ANSWER_GEN = """
You are a helpful assistant. Your job is to generate an answer for the user's query based on the relevant documents provided.
"""

# generate the answer for the user's query
def generate_answer(query: str, docs: list[Document]) -> str:
    # 1. use LLM to generate the answer for the user's query based on the relevant documents
    system_prompt = SYSTEM_PROMPT_ANSWER_GEN
    for doc in docs:
        system_prompt += f"""
        Document: {doc.page_content}
        """
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": query},
    ]
    response = llm.invoke(messages)
    return response.content

💡

FYI, LangChain provides MultiQueryRetriever which combines step 1 to step 3 we did above in a single line of code 🤖. However in my opinion, LangChain does too much abstraction, which kind of takes away the fun in building stuff.

As you can see below, even though I asked a question with spelling mistake (possible to make input more stupid), my RAG application was able to answer well.

And that’s it, that’s how easy it is to implement Parallel Query Retrieval. In my future articles in this series, I will discuss more about techniques used in advanced RAG applications. Stay tuned.

Hope you liked this article. If you have questions/comments, then please feel free to comment on this article.

Source Code: GitHub

3. Parallel Query Retrieval (Fan Out)