Advanced RAG Techniques: Transforming Queries for Smarter AI Responses

Retrieval-Augmented Generation (RAG) has become a powerful tool in AI. If you’re already familiar with how it works, let’s take a deeper dive into some advanced techniques and see how we can optimize the responses generated through RAG.

[ Article ] – You can check out my previous article for a quick overview of what RAG is and how it works.

The Problem with Basic RAG

The real challenge with basic RAG implementations starts when the number of documents increases. As your data becomes more sparse and diverse—imagine a mix of books, study materials, resumes, and more—the accuracy of generated responses begins to drop.

For instance, if a user has different types of documents stored in Google Drive, a basic RAG setup isn't sufficient to fetch and generate accurate responses.

So, let’s take a look at how we can implement an advanced version of RAG.

Overview of Advanced RAG Implementation

The main goal here is to optimize the quality of responses generated by the LLM.

In basic RAG, the process is divided into three stages:

Indexing
Retrieval
Generation

In advanced RAG, we introduce three additional steps to improve the pipeline:

Query Transformation
Routing
Query Construction

In this article, we’ll focus solely on Query Transformation and how it can drastically improve the system.

Query Transformation

When we talk about a query, we usually mean the prompt given by the user. But user queries can often be ambiguous, carrying multiple meanings. If we attempt retrieval and generation based on such vague input, the responses naturally suffer.

There’s a principle called GIGO (Garbage In, Garbage Out), which basically says: bad input leads to bad output. So, if we feed a weak query into the system, we can’t expect a strong answer.

This is where Query Transformation comes in. Its purpose is to better understand what the user really wants and to enhance their query accordingly.

Types of Queries

Queries can be categorized into two types:

Less abstract (very specific and detailed)
More abstract (broad, high-level)

A user's query usually lies somewhere in the middle.

To tackle this, we rewrite the query into multiple improved versions using a technique called RAG Fusion.

Techniques in Query Transformation

1. Parallel Query Retrieval (Fan-out Model)

This technique involves generating multiple similar queries from the user's original query.

Here’s the process:

Convert the user's query into multiple versions.
Turn each version into vector embeddings.
Perform a similarity search for each embedding in the vector database.
Retrieve relevant chunks and filter out the important ones.
Feed these chunks, along with the original query, to the LLM to generate the final response.

Because the query fans out into multiple queries and they’re processed in parallel, it’s called the Fan-out Model.

Result: Improved accuracy
Tradeoff: Slight decrease in speed

2. Reciprocal Rank Fusion (RRF)

The user's query isn't the only thing that can cause issues — the data chunks we retrieve can also hurt the quality if they aren't relevant.

Instead of just randomly filtering chunks, we rank them using a technique called Reciprocal Rank Fusion.

Example:

Suppose we retrieve 10 chunks.
We rank them based on relevance.
We then pick only the top 3 to pass on to the LLM.

Result: Stronger, more relevant context and fewer hallucinations by the model.

Here is how to implement the algorithm:

Formula: RRF(d) = Σ(r ∈ R) 1 / (k + r(d))

Where:
- d is a document
- R is the set of rankers (retrievers)
- k is a constant (typically 60)
- r(d) is the rank of document d in ranker r

def reciprocal_rank_fusion(ranked_lists, k=60):
    scores = defaultdict(float)
    for ranked_list in ranked_lists:
        for rank, doc in enumerate(ranked_list):
            scores[doc] += 1 / (k + rank)
    return sorted(scores.items(), key=lambda x: x[1], reverse=True)

3. Query Decomposition

Earlier, we talked about abstract vs. less abstract queries. In the previous techniques, we didn’t actually change the abstractness of the query.

Now, let’s see how we can do both:

Making Queries Less Abstract

Here, we use a Chain of Thought approach:
Break down the user’s query into multiple sub-queries.

Example:
Suppose the question is "What is machine learning?"

First, we might ask, "What is a machine?"
Then, "What is learning?"
Finally, combine the two to answer "What is machine learning?"

Each sub-query is embedded and used step-by-step to retrieve context. This approach is extremely useful in documents related to medical and legal, where we must consider every aspect of a problem before answering.

Usually, we guide the LLM using a system prompt by providing it a step-by-step process to follow.

Here’s an example of chain of thought prompting.

   User Query: What is the weather of new york?
    Output: {{ "step": "plan", "content": "The user is interseted in weather data of new york" }}
    Output: {{ "step": "plan", "content": "From the available tools I should call get_weather" }}
    Output: {{ "step": "action", "function": "get_weather", "input": "new york" }}
    Output: {{ "step": "observe", "output": "12 Degree Cel" }}
    Output: {{ "step": "output", "content": "The weather for new york seems to be 12 degrees." }}

Making Queries More Abstract

In 2023, Google released a white paper called "Take a Step Back," which introduced a method called Step-Back Prompting.

It’s a few-shot prompting technique where you show examples of how to make a query broader and more abstract.

The idea is to get the LLM to "think bigger" — almost like it's answering based on its pre-trained knowledge rather than treating the input as a one-off question.

This is very helpful when you want the LLM to provide deep, generalized insights rather than narrow answers.

4. Hypothetical Document Embedding (HyDE)

Another fascinating technique is Hypothetical Document Embedding, commonly known as HyDE.

Here’s how it works:

You give the user’s query to a powerful LLM (like GPT-4.1).
The LLM writes a hypothetical document about the query.
You convert that document into vector embeddings.
Then, use those embeddings to search and retrieve relevant information from your database.

This method targets a broader area in your database and results in more accurate and well-rounded responses.

Wrapping Up

In this article, we covered one major part of advanced RAG — Query Transformation — and discussed different techniques like:

Parallel Query Retrieval (Fan-out Model)
Reciprocal Rank Fusion
Query Decomposition (Less and More Abstract)
Hypothetical Document Embedding (HyDE)

Each technique plays a critical role in boosting the quality and reliability of responses generated by RAG systems.

In the next parts, we’ll dive deeper into Routing and Query Construction — so stay tuned!