Advanced RAG: Exploring Query Translation

INTRODUCTION

RAG, or Retrieval-Augmented Generation, is an AI framework that combines the strengths of traditional information retrieval systems with the abilities of generative large language models (LLMs). It lets LLMs use and include current, external knowledge sources before creating responses, making them more accurate and relevant.

BASIC R.A.G.

A basic RAG is a system that follows a simple three-step process: Indexing, Retrieval, and Generation.

First Step: Indexing - For a given document, we create chunks (chunking), convert them into vector embeddings using an LLM embedder, and store them in a vector database like Qdrant or PineCone.
Second step: Retrieval - For a given user query, create embeddings of the query, search for similar chunks in the vector database, and retrieve the relevant document chunks.
Third step: Generation - Combine the user prompt with the retrieved relevant chunks, input them into a Chat model LLM, obtain more accurate results, and then output the results.

All RAGs generally follow this pattern. Now, this works for simple applications but for an enterprise, this workflow is not sufficient because as the data size grows, the data becomes sparse (lose connection with one another). It requires some more additional and advanced steps in addition to the three steps mentioned above to fetch finer and better results:

Query Translation
Routing
Query Construction

The scope of this article is limited to the ‘Query Transformation’ process.

QUERY TRANSLATION PROCESS

The problem with processing a user's requirement is that, most of the time, the user does not know what they want. This directly reflects into the query that the user inputs and it is almost always ambiguous and with ambiguous query, we get ambiguous result. Sometimes, the user's query can be very abstract, while other times it can be more specific. This varies from one user to another. Most of the Advanced RAG systems tries to narrow down the requirement from even an ambiguous user query and return an output exactly what the user might be looking for (at times even to the surprise of the user).

The goal of query translation process is to improve the user query to be less abstract such that the system can return a valuable output which meets the expectation. The rewriting of user query can be done via two ways - ‘Multi query’ & ‘RAG fusion’ methods. With these two methods, we have different types of query translation patterns. Some of them are listed below:

RAG Fusion
- Parallel query retrieval (fan-out)
- Reciprocal Rank Fusion
Multi Query Method
- Step back prompting
- Chain of thought prompting
- Hypothetical document embedding

Note: You might have heard that using AI-generated prompts can return poor results, and if that's true, wouldn't this approach also return poor results? Well, it is true, but this applies to system prompts, not user prompts; in query translation process, we work with user prompts which can be improved with the help of LLM. Also it does take time more than a basic RAG as it runs synchronously, but accuracy improves significantly. And, be mindful that, this does consume a lot more tokens in this process from an LLM model (self hosting is an option if cost is an issue).

PARALLEL QUERY RETRIEVAL (“FAN-OUT“)

Parallel query retrieval is often used in conjunction with query translation to handle multiple reformulated queries simultaneously, increasing the chances of finding the most relevant information. The purpose is to make the search query more aligned with how information is stored and structured in the knowledge base.

How it works:

Take the user query, generate multiple queries from it using an LLM, create vector embeddings for these queries in parallel, similarity search from the vector database to retrieve relevant chunks from the data source for each of these queries, filter out unique chunks from all the retrieved chunks from all queries, and provide it to the LLM as data source along with the original user query.

As you can infer from the diagram, in this method, we have provided an improved context for the LLMs to work with, which is more scalable and will fetch better results.

RECIPROCAL RANK FUSION (RRF)

Reciprocal Rank Fusion is a way to combine ranked lists from different search systems into one list. The main idea is to give more importance to documents that are near the top of several lists, even if they aren't ranked high in any single list. Here, it assigns a score to each document based on its position in each list, then adds up these scores to create a final ranking.

How it works:

This is quite similar to parallel query retrieval method but with a small twist - after getting the relevant chunks from the vector database based on the multiple queries, instead of filtering out unique chunks, we will rank them based on a simple reciprocal rank formula and return them sorted based on the ranks for individual chunks.

We can control how much ‘weight’ needs to be given for a high or low ranked chunks and either return or remove the chunks based on the rank.

CHAIN OF THOUGHT PROMPTING

In this type of query translation pattern, we employ a method called ‘query decomposition’.

How it works:

Let’s say the user query is -
‘what is ‘fs’ module in NodeJs?’.

Here, we need to ‘decompose’ the query into a set of sub queries like -

‘what is NodeJs?’
‘what is a module in NodeJs?’
….
‘what is ‘fs’ module in Nodejs?’

For each of these queries, get the answer from the LLM and feed the answer & question back to LLM along with the next sub query.

Follow this process until the last query is reached, at which point, the LLM would have enough and ample context to answer the users’ query.

STEP BACK PROMPTING

Step-back prompting is a technique used to improve the reasoning abilities of large language models (LLMs) by encouraging them to first consider a more abstract, higher-level concept before addressing the specific details of a complex question. It's akin to how humans might approach a difficult problem by first understanding the broader context or principle involved.

How it works:

Let’s say for a user query:

‘When was the last time a team from Canada has won the Stanley cup?’

In the step back prompting method, the LLM is encouraged to generate higher level abstract queries related to this query before answering this question (which can be trained to think like this using few shot prompting techniques, i.e. providing some examples). In this way, if an answer is found before calling an agent for google search, then it can be returned and the whole process would feel faster. For example, the LLM would check for the following :

‘Which years did a team from Canada has won the Stanley cup as of <knowledge cutoff year>?’.

Based on experiments conducts by Google, which are published in a scientific paper titled ‘Take a step back’, it was found that:

… and observe substantial performance gains on various challenging reasoning-intensive tasks including STEM, Knowledge QA, and Multi-Hop Reasoning

Therefore, this prompting technique can be used when dealing with systems in the above mentioned areas.

HYPOTHETICAL DOCUMENT EMBEDDING (HyDE)

In simple words, hypothetical document embedding works by generating "hypothetical" documents based on user queries. Note: this method works only with large language models.

How it works:

Let’s assume that we have built a video RAG which the user can query and search based on the topics of discussion in the video. Say the user wants to know where did the speaker in the video talks about ‘fs’ module. The query is quite abstract. Now, the LLM knows about nodejs and fs module; we can ask the LLM to generate document about ‘fs’ module in nodejs (not to be shared with the user directly), then create vector embeddings for this document, using these vector embeddings to similarity search within the vector database regarding ‘fs’ module which now will find the relevant things faster.

Do keep in mind that - this won’t work on legal documents.

CONCLUSION

In this article, we explored an introductory overview along with the theoretical foundation behind query translation patterns implemented in advanced Retrieval-Augmented Generation (RAG) systems. We examined various techniques used to transform a simple and abstract user query into a more detailed and comprehensive query. By doing so, these techniques allow the RAG system to generate significantly improved results. This transformation process enhances the system's robustness and usability, making it far more effective at delivering precise and relevant information in response to user queries. Through this exploration, we gained a deeper understanding of how these innovative methods contribute to the overall performance and utility of RAG systems in handling complex tasks.

REFERENCE

GenAI with Python with Hitesh Chaudhary and Piyush Garg

#6 - Advanced RAG: Query Translation Patterns

Table of contents

INTRODUCTION

BASIC R.A.G.

QUERY TRANSLATION PROCESS

PARALLEL QUERY RETRIEVAL (“FAN-OUT“)

RECIPROCAL RANK FUSION (RRF)

CHAIN OF THOUGHT PROMPTING

STEP BACK PROMPTING

HYPOTHETICAL DOCUMENT EMBEDDING (HyDE)

CONCLUSION

REFERENCE

Subscribe to my newsletter

Mishal Alexander

Mishal Alexander