Enhancing Query Performance with Parallel (Fan Out) Technique

📖Introduction

This article is part of the Advance RAG Series, an article series which covers various tenets and features of Advance RAG Systems along with diagrams and code. In this article an intuitive Query Translation Method will be explained through diagram and code.

🔍What is Parallel Query Retrieval?

Parallel Query Retrieval as the name suggests optimizes the Retrieval part of the RAG. For revision, the RAG pipelines first indexes the knowledge (external) data, ingests it in the vector store as vector embeddings and then when user query is received the vector embeddings of the query is created and relevant data chunks present in the vector store are searched and feeded to the LLM for context, which at the end repsonds to the user’s prompt based on its context.

In Parallel Query Retrieval before performing Semantic Search in Vector Database of the user query, it is first converted into various versions while keeping the original query intact. Now for each of these versions, similarity search is done thus more relevant documents (vector embeddings) are retrieved and out of the these docs all the unique ones are selected and sent to the LLM, along with the original user query, for the context and response generation.

🌀What is the Fan Out Architecture?

Fan-out architecture is a design where one component (like a server or process) distributes tasks or requests to multiple downstream components in parallel. It improves performance, scalability, and fault tolerance by allowing simultaneous processing, but requires coordination to manage responses and maintain system consistency.

⚡What is the effect of Parallel (Fan Out) Query Retrieval?

Since more number of query versions are used for similarity search in the vector store thus it results in getting wider data chunks which ultimately gives out better response to the user’s original query.

📊💻Step By Step Working Through Diagram & Code

From user-prompt, LLM generates similar more queries (original query’s versions) {let’s say 3 queries}

Then find their vector embeddings and perform semantic search to get relevant data.

Next, the retrieved data chunks are filtered out only to get relevant data, this then becomes the context for the current/original user query.

Now, the original user-prompt along with context is passed to the LLM, to get the final response.

📋Parallel Query Retrieval Output

🔗Important Links

🎯Conclusion

Through this article you saw how to implement Parallel (Fan Out) Query Retrieval Technique in your RAG and make the response more accurate, efficient and optimised.

Parallel (Fan Out) Query Retrieval : Query Transformation Technique

Table of contents