Advance RAG Patters & Pipelines

RAG which stands for Retrieval Augmented Generation is one of the most fascinating topics discussed in the Gen AI community. We had earlier discussed about the basics of RAG in our previous article. This article will be more focused on preparing production level RAG pipelines which would produce better content on user queries.

What’s wrong with the Basic RAG?

The Basic RAG which traditionally works on three major steps:

  1. Indexing

  2. Retrieval

  3. Generation

In this we store the user data in form of vector embeddings, and the user queries are broken into embeddings as well and a similarity search is done to retrieve chunks similar to semantics in the vector DB which is later processed by LLM models to generate content.

Everything is fine, until we reach the point when the user query is ambiguous in the sense that it contains spelling mistakes which could make it harder for vector DB to find the chunks which are similar to the user query or sometimes when the user is not sure what he/she actually wants to search, in such cases, the vector DB can respond with no chunks found and our LLM responds that it does not have any idea about the query in the context.

For example, a user has submitted a PDF of MERN development, and now he/she questions something like “can i get rid of nome moduls” when we actually wanted to know if he/she wanted to ask “can I get rid of node modules” which could make the LLM say that it is not aware of anything about it from the context provided. Another example where the query could be ambiguous could be when a user has submitted information related to a restaurant dish and the query he puts is “what did people most like?”, this could still cause problems because the information does not have anything similar to it in semantics stored in vector databases.

Advance RAG patterns are also known as Corrective RAG (CRAG) and they have the following steps to follow:

  1. Improving the Initial Retrieval and Generation

  2. Validation and Feedback Loop on the generated Response

  3. Error Detection and Correction

  4. Final Verification and Output

Knowing all the shortcomings, all thing we all can agree upon is that if we could improve the user’s query, we can significantly improve our RAG pipeline. Improving the user query is also known as Query Translation

Advance RAG Patterns

In this article we would be discussing the different ways we can improve our RAG pipelines, the patterns that we discuss here are not the only ways out there and definitely the exact ones would fit for use case would be a wrong assumption, we are more focused on understanding the crux of all this.

Basic RAG:

The Basic RAG is the simplest one to implement and has been discussed earlier in detail how it could be done. It works on the principle that we simply need to make embeddings from the user query and then use chunks which are similar to it from the vector DB and the LLM then generates the content out of it.

Parallel Query (FAN OUT)

There could be instances where the user wants to know something from the content, he had earlier shared, but there are other things that could be similar but aren’t fetched because the user query didn’t ask for them, but they could add better and more context for it.

The parallel Query works on the principle of generating multiple queries from a single user query using an LLM model, this could help me in removing spelling mistakes that might have occurred in the original user query. Having multiple queries could add more context and remove ambiguous meaning from the query.

Once there are multiple queries then vector embedding for each of the query is done and relative chunks is found and then provided to the LLM to generate the response.

Reciprocate Rank Fusion RAG

The RRF RAG works in a similar fashion to Parallel Query, with the key difference being that instead of passing all the retrieved chunks of data to the LLM, it filters and selects only the most relevant ones.

This filtering step is important for two reasons:

  1. Efficiency – By narrowing down the number of chunks, we reduce the overall token usage, which directly lowers cost and latency.

  2. Quality – Feeding too many chunks into the LLM can overwhelm the context window and lead to verbose or irrelevant answers. Passing only the most relevant information helps the model stay focused and precise.

When multiple retrievers are used in parallel, each one produces its own ranked list of results.
Instead of trusting just one system, Reciprocal Rank Fusion (RRF) combines these ranked lists into a single, more reliable ordering. If multiple systems think a chunk is important, it will consistently appear near the top of their lists. RRF rewards this consensus, surfacing results that are robust across different retrieval strategies and hence improving the overall content generation.

Chain Of Thought RAG

The Chain-of-Thought RAG (COT RAG) lets the LLM reason step by step instead of answering in one go. It is especially effective for complex queries that are better solved when broken into smaller steps. Unlike standard RAG, it doesn’t retrieve all chunks at once but instead fetches evidence for each sub-question iteratively. This approach not only improves accuracy on multi-hop queries but also reduces hallucinations, since every reasoning step is backed by retrieved evidence.

The evidence that is passed along the next query could either be the actual chunks or the metadata related to those chunks and varies according to use cases.

Hypothetical Document Embeddings RAG

HyDE RAG is one of the most widely adopted RAG patterns in industry. It works on the principle that Large Language Models (LLMs), through their pre-training, already have rich background knowledge about most user queries. Instead of embedding the raw query directly, HyDE first asks the LLM to generate a hypothetical answer or document related to the query. This synthetic content is then embedded and used to search the vector database for the most relevant chunks of information.

For example, if the user query is:
“What are promises in JavaScript?”

  • In a traditional RAG setup, the query alone might return chunks only describing what promises are.

  • With HyDE RAG, the LLM first generates a detailed hypothetical explanation of promises, including their role in async/await, API calls, and real-world usage.

  • That generated text is then used to search the database, which results in retrieving richer and more diverse chunks (e.g., explanations, code examples, and advanced use cases). This ultimately allows the final LLM response to be more comprehensive and contextual.

However, If the synthetic document contains incorrect information, retrieval may be biased toward irrelevant or misleading content also it might sometimes pull in too broad a set of chunks, diluting precision.

HyDE RAG leverages the LLM’s generative ability to bridge the gap between user intent and database retrieval, making results more useful and comprehensive.

Trade-offs in Advanced RAG Mechanisms

While advanced RAG patterns (like HyDE, RRF, and COT-RAG) improve the quality and reliability of retrieval, they come with trade-offs. Each additional reasoning or retrieval step adds latency, cost, and complexity, and in some cases, may risk overloading the LLM with too much or misleading context. The key is to balance precision with efficiency based on the use case.

  • HyDE RAG

    • ✅ Better recall, richer context

    • ⚠️ Extra LLM call → higher cost & latency

    • ⚠️ Risk of hallucinated synthetic documents biasing retrieval

  • Reciprocal Rank Fusion (RRF) RAG

    • ✅ Aggregates multiple retrieval rankings for robustness

    • ⚠️ Needs more compute for ranking

    • ⚠️ Risk of over-fetching chunks (more tokens → higher cost)

  • Chain-of-Thought (COT) RAG

    • ✅ Breaks complex queries into steps → reduces hallucination, improves reasoning

    • ⚠️ Multiple queries + evidence passing → slower responses

    • ⚠️ Complexity in pipeline design

  • Parallel Query RAG

    • ✅ Fast multi-perspective retrieval

    • ⚠️ May overload the LLM with redundant or irrelevant chunks

Conclusion

We have discussed some of the most used RAG patterns but there is no exact pattern which can solve your query, but you can use a combination of RAG patterns to get more relatable context from the vector DB which you later pass on to the LLM for content generation. Our next step would be getting our hands dirty with the POC of these concepts, so stay tuned and drop any other popular RAG pattern that you are aware of or any improvements in the current ones.

14
Subscribe to my newsletter

Read articles from Saurav Pratap Singh directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Saurav Pratap Singh
Saurav Pratap Singh