Let’s Understand Query Translation in RAG Systems (with Example)

In a RAG (Retrieval-Augmented Generation) system, users can type anything like a question, a vague thought, or just a few keywords. But it’s not always easy for the system to understand exactly what the user wants.

There are two important parts in every query:

What the user wants to know (intent)
What the user actually wrote (actual query)

These two don’t always match. That’s where Query Translation helps.

Real-life Example:

Imagine you walk into a pharmacy and say:

“I have a cold.”

But what you really mean is:

“Can you give me medicine for sneezing and sore throat?”

If the pharmacist just searches for the word "cold," you might not get the exact medicine you need. But if they understand your real need and search for the symptoms instead, they’ll give you the right solution. Right?

The same thing happens in RAG systems. The original user query may be unclear or too short. So, we translate or rewrite it into a better version that captures the full intent, making it easier for the system to retrieve the most relevant documents.

This process of translating the query helps improve accuracy and gives better answers.

My point here being we humans are lazy. Google has exposed us to so much of convenience
for so long that we generally don’t care about what we are typing in the search bar.
We just expect that Google will bring us the right results. And if you want your RAG
application to get popular, then you have to make it very good in understanding what the
user wants to ask.

Parallel Query Retrieval

So the problem at hand is, we want our RAG application to understand what the user wants to ask, given most of the times humans are going to give bad input. You may have heard this phrase, “Garbage In, Garbage Out“. It applies perfectly to LLMs. If you give bad input to LLMs, then you will most likely get bad output from them. That means you want to improve the input you are giving to the LLMs to make your RAG application “usable“ to normal users.

Parallel Query Retrieval technique tries to generate better LLM input for user’s queries. It does so by asking the LLM to generate multiple refined queries for any given user query. It then processes all the LLM-generated queries along with the user’s query to generate a comprehensive output for the user’s query.

The following diagram will help you understand this better.

For example, let’s say you’ve built a RAG application that answers questions about Node.js, and you’ve already ingested documentation and tutorials related to the File System (FS) module into your vector database.

Now, a user asks a query like:

“red fs files node”
(notice the spelling mistake in “read“) (The user meant “read FS files in Node,” but wrote it poorly.)

A typical keyword-based system may not understand this query well. But your RAG system uses a Query Translation step to improve the understanding.

The system sends the original user query to an LLM (Large Language Model), asking it to generate better or related queries. The LLM might return something like:

How to read files using the FS module in Node.js?
What are the methods to handle file operations in Node.js FS module?
How to read and write files asynchronously using FS in Node.js?

As described in the above diagram, you:

Generate Vector Embeddings for all the LLM-generated queries & the user’s query
Fetch relevant documents from your vector database using similarity search
Aggregate unique data points from similarity search results across multiple queries
Pass the user’s query along with the aggregated data points to LLM

After following these steps, the response from LLM will most likely be better than the response from basic RAG.

In classic system design, the Fan-Out Pattern refers to sending a single message 
or event to multiple services or consumers at once. I hope you understand now why 
the technique we are discussing in this article comes under the Fan-Out pattern.

Implementation in Python

Enough with theory, let’s code this thing. As discussed before, this RAG differs from the basic RAG we built in previous article in the QUERY phase.

Step 1: Generate Multiple Queries Given User’s Query

Our goal from this step is to generate multiple queries which are similar to user’s queries. We will use LLM to generate queries that are similar to user’s query. On a high level there are two ways to achieve this.

You make multiples requests to your LLM, each one asking to generate a query similar to user’s query. But this is more time consuming & most importantly it will cost you more.
Second way is you ask the LLM to generate multiple queries within the same response. But there is a problem here. When you ask LLM any question, it gives you response in text. How do you extract queries from plain text response? This is where a concept called “Structured Output“ helps you. Basically, modern LLMs can respond in a specific format that you define before making requests.

Let’s see structured output in action using LangChain.

Define Output Format

We will use BaseModel from pydantic library to create a class MultipleQueries that defines our output structure that we are expecting from LLM.

from pydantic import BaseModel

# model for multiple queries
class MultipleQueries(BaseModel):
    queries: list[str]

💡

You can watch this YouTube video to learn more about Pydantic.

Instruct LLM Model to Respond in Output Format

LangChain makes it very easy to instruct the LLM models to respond in specific format.

# create LLM
llm = ChatOpenAI(
    model="gpt-4.1",
)

# llm for query generation
llm_for_query_gen = llm.with_structured_output(MultipleQueries)

Generate Multiple Queries for a Given User Query

SYSTEM_PROMPT_QUERY_GEN = """
You are a helpul assistant. Your job is to generate 3 queries that are similar to user's queries. 
You need to give the response in the required format. 

Example:
user_query: How to read file in node.js
response:
[
    "How to read files using the FS module in Node.js?",
    "What are the methods to handel files operation in Node.js FS module?",
    "How to read and write files asynchronously using FS in Node.js?"
]
"""

# generate 3 queries similar to the user's query
def generate_queries(query: str) -> list[str]:
    # 1. use LLM to generate 3 queries similar to the user's query
    messages = [
        {"role": "system", "content": SYSTEM_PROMPT_QUERY_GEN},
        {"role": "user", "content": query},
    ]

    response = llm_for_query_gen.invoke(messages)
    if isinstance(response, MultipleQueries):
        result = response.queries
        print(f"🌀🌀🌀 Generated {len(result)} queries")
        for i, query in enumerate(result):
            print(f"🌀🌀🌀 {i+1}. {query}")
        return result
    else:
        raise ValueError("Invalid response from LLM")

generate_queries("what is fs module ?")

And that’s it, that’s how easy it is to implement Parallel Query Retrieval. In my future articles in this series, I will discuss more about techniques used in advanced RAG applications. Stay tuned.

Hope you liked this article. If you have questions/comments, then please feel free to comment on this article.

Parallel Query Retrieval (Fan Out)

Let’s Understand Query Translation in RAG Systems (with Example)

Real-life Example:

Parallel Query Retrieval

Implementation in Python

Subscribe to my newsletter

Sudarsan Mansingh

Sudarsan Mansingh