As the title suggests this article is about “Advanced” RAG which is quite overwhelming. So we will first start with the basics by going through the concept of RAG and why it is one of the most used and in-demand approach by which LLMs can be leveraged to solve various business use-cases, combining information retrieval and text generation.

Retrieval - Augmented Generation is a one of the most powerful and innovative approach to get contextually relevant output from LLM models particularly in the field of Natural Language Processing (NLP).
As we mentioned earlier it involves both information retrieval and text generation.

The basic structure of RAG involves three parts :

Indexing
Retrieval
Generation

Indexing

This is the part where large amounts of data are processed and organized which mostly depends on embedding techniques which convert textual information from the large amounts of data into numerical vectors, making it easier to search, manage and retrieve meaningful and contextualized information.

Retrieval

Here the system uses similarity measures to get the most semantically meaningful and contextually relevant information from the indexed data.

This makes sure that the LLM model that will be used for text generation gets the best context and generates coherent and contextually appropriate responses.

Generation

In this phase, the large language models (LLMs) are deployed and used. Here the LLMs leverage the retrieved context and generate appropriately relevant and contextually accurate responses that aligns with the user-query.

Advance RAG

So now that we know about the basics of RAG we can get into the “Advance” part with better context.

So why do we need advanced RAG? If we put it quite simply, then we can say that RAG systems are quite subjective depending upon the specific business use case it is trying to solve. So accuracy plays a major role in this context. Therefore, to make these systems more accurate and efficient in real world scenarios Advance RAG techniques are used to optimize the particular pipeline making the responses more accurate and reliable and aiming to generate the best possible output.

Under Advance RAG comes the following techniques which adds on to the list we had previously discussed under basic RAG:

Query Transformation
Routing
Query Construction
Indexing
Retrieval
Generation

In this particular article as the title suggests we would be focussing on Query Translation/Transformation.

Here Query is the user prompt that is being provided into the RAG system as the input which can be quite ambiguous and can have various meanings depending on the context. So to make sure to get the most accurate meaning out of the user prompt we can apply some techniques and “Translate” the user-prompt into something that perfectly aligns with what the user actually wants to get from the system. That would significantly improve the accuracy and reliability of our RAG system.

Query Transformation basically improves the user - prompt (query) to significantly enhance the accuracy in generating the output through the RAG system, making the query more semantically meaningful in order to retrieve the closest and the most accurate information from the three - dimensional vector space.

The shift towards advanced RAG

Parallel Query Retrieval (Fan Out)

This technique is used to simply break down the user query into sub parts which can be called as sub queries to get a better idea about what the user really wants to know about.
Instead of using a single query we “fan-out” the query into multiple rephrased or semantically accurate queries like this :

Breaking down the original query into simpler related sub queries,
Each of them sent independently to the retriever
The retriever answers these sub-questions parallelly, combines the results to provide a more accurate and comprehensive response.

Example :

Query: "How to train a transformer model?"
Fan-out Queries:
- "Best practices for fine-tuning transformers"
- "Transformer architecture training guide"
- "Steps to train BERT or GPT models"

CODE :

from pathlib import Path # File Path
from langchain_community.document_loaders import PyPDFLoader # Loader
from langchain_text_splitters import RecursiveCharacterTextSplitter # Text Splitter
from langchain_google_genai import GoogleGenerativeAIEmbeddings # Google Embedding
from langchain_qdrant import QdrantVectorStore # Vector Store

# GOGGLE GENERATIVE AI
from google import genai
from google.genai import types

from concurrent.futures import ThreadPoolExecutor #Multithreading
from itertools import chain #Flatten
import ast #Parsing


# === CONFIGURATION ===

# Initialize the Gemini client with your API key
genai_client = genai.Client(api_key='GOOGLE_GEMINI_API')

# Google Generative AI Embeddings
embeddings = GoogleGenerativeAIEmbeddings(
    model="models/text-embedding-004", 
    google_api_key="GOOGLE_GEMINI_API"
)

# === INDEXING PART ===

# Data Source - PDF
pdf_path = Path(__file__).parent / "file_path.pdf"

# Load the document from the PDF file
loader = PyPDFLoader(file_path=pdf_path)
docs = loader.load()

# Split the document into smaller chunks, Adjust chunk_size and chunk_overlap 
# according to your need
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
split_docs = text_splitter.split_documents(documents=docs)

# Create a vector store - if collection exists
# vector_store = QdrantVectorStore.from_existing_collection(
#   url="http://localhost:6333",
#   collection_name="collection_name",
#   embedding=embeddings
# )

# Create a new vector store - if collection doesn't already exist
vector_store = QdrantVectorStore.from_documents(
  documents=[],
  url="http://localhost:6333",
  collection_name="collection_name", # Name of your collection in Qdrant
  embedding=embeddings
)

# Add the documents to the vector store
vector_store.add_documents(split_docs)

# === RETRIEVAL PART ===

retriever = QdrantVectorStore.from_existing_collection(
  url="http://localhost:6333",
  collection_name="collection_name", # Name of your collection in Qdrant
  embedding=embeddings
)

user_query = "Can you explain how the File System module works in Node.js?" # User Query

# === SUB-QUERY EXTRACTION USING GEMINI ===

# System prompt for breaking down the user's query into sub-queries
system_prompt_for_subqueries = """
You are a helpful AI Assistant. 
Your task is to take the user query and break it down into different sub-queries.

Rule:
Minimum Sub Query Length :- 3
Maximum Sub Query Length :- 5

Example:
Query: How to become GenAI Developer?
Output: [
    "How to become GenAI Developer?",
    "What is GenAI?",
    "What is Developer?",
    "What is GenAI Developer?",
    "Steps to become GenAI Developer."
]
"""
# Call Gemini API to break down the user's query into sub-queries
breakdown_response = genai_client.models.generate_content(
    model='gemini-2.0-flash-001',
    contents=f"Query: {user_query}",
    config=types.GenerateContentConfig(system_instruction=system_prompt_for_subqueries)
)

# Convert the Gemini response to a Python list (parse the output safely)
sub_queries = ast.literal_eval(breakdown_response.text.strip())
print("Sub Queries:", sub_queries)

# === PARALLEL VECTOR RETRIEVAL ===

# Function to retrieve relevant document chunks for each sub-query
def retrieve_chunks(query):
    return retriever.similarity_search(query=query)

# Use ThreadPoolExecutor to perform parallel retrieval of chunks for each sub-query
with ThreadPoolExecutor() as executor:
    all_chunks = list(executor.map(retrieve_chunks, sub_queries))

# Flatten the list of results (if there are multiple chunks per sub-query)
flattened_chunks = list(chain.from_iterable(all_chunks))

# Optionally remove duplicate chunks (based on content)
unique_chunks = list({doc.page_content: doc for doc in flattened_chunks}.values())

# === Generation Part ===

# Prepare the final system prompt with the unique relevant document chunks
final_system_prompt = f"""
You are a helpful assistant who answers the user's query using the following pieces of context.
If you don't know the answer, just say you don't know — don't make up an answer.

Context:
{[doc.page_content for doc in unique_chunks]}
"""

# Send the final request to Gemini for generating the response using the relevant context
final_response = genai_client.models.generate_content(
    model='gemini-2.0-flash-001',
    contents=user_query,  # The original user query
    config=types.GenerateContentConfig(system_instruction=final_system_prompt)
)

# Output the final response
print("\nFinal Answer:\n")
print(final_response.text)

Reciprocal Rank Fusion

The retrieval stage is the most important part of a RAG system if it fails to fetch the most relevant documents then the precision becomes quite low and hallucination likelihood increases.

To solve this issue a rank aggregation method called Reciprocal Rank Fusion(RRF) is used to combine rankings from multiple sources into a single, unified ranking.

This is how RRF works in RAG:

User Query: The process begins when a user inputs a question or query.
Multiple Retrievers: The query is sent to multiple retrievers. These could be different retrieval models (e.g., dense, sparse, hybrid).
Individual Rankings: Each retriever produces its own ranking of relevant documents.
RRF Fusion: The rankings from all retrievers are combined using the RRF formula.
Final Ranking: A unified ranking is produced based on the RRF scores.
Generation: The generative model uses the top-ranked documents to produce the final answer.

Psuedo - Code for RRF

def calculate_rrf(rankings, k):
    scores = {}

    # Collect all documents across all rankers
    all_documents = set()
    for ranker in rankings.values():
        all_documents.update(ranker.keys())

    # Initialize scores for all documents
    for doc in all_documents:
        scores[doc] = 0.0

    # Compute Reciprocal Rank Fusion scores
    for ranker in rankings:
        for doc, rank in rankings[ranker].items():
            scores[doc] += 1 / (k + rank)

    return scores

def get_final_ranking(scores):
    # Sort documents by score in descending order
    return sorted(scores.items(), key=lambda item: item[1], reverse=True)

Example Usage

rankings = {
    'ranker1': {'doc1': 1, 'doc2': 2, 'doc3': 3},
    'ranker2': {'doc2': 1, 'doc1': 2, 'doc4': 3},
    'ranker3': {'doc3': 1, 'doc4': 2, 'doc1': 3},
}

k = 60
rrf_scores = calculate_rrf(rankings, k)
final_ranking = get_final_ranking(rrf_scores)

print("RRF Scores:", rrf_scores)
print("Final Ranking:", final_ranking)

Step - Back Prompting

This is a prompting technique that is primarily designed to improve and optimize LLM outputs so that it can solve complex tasks by “stepping back” and focussing on abstract principles before reasoning.

This helps prevent errors in intermediate steps and leads to more accurate reasoning.

This involves two steps:

Abstraction: The model is prompted to focus on a higher-level concept or principle related to the question.
Reasoning: Once the high-level abstraction is retrieved, the model uses it to reason through the specifics of the original question.

For example:

Original Question: "What happens to the pressure of an ideal gas if the temperature is increased by a factor of 2 and the volume is increased by a factor of 8?"
Step-Back Abstraction: "What are the principles or fundamental concepts involved in this problem?" (Ideal Gas Law)
Final Answer: Using the Ideal Gas Law, the model can calculate the correct answer: the pressure decreases by a factor of 4.

CODE

from langchain.chat_models import ChatOpenAI
from langchain.schema import AIMessage, HumanMessage, SystemMessage

# Initialize the GPT-4 Turbo model from LangChain
llm = ChatOpenAI(model="gpt-4o", temperature=0.5)

def step_back_prompting_with_langchain(query: str):
    # System primer to guide the assistant
    messages = [
        SystemMessage(content="You are a reasoning tutor that helps break down complex problems step by step.")
    ]

    # Step 1: Abstract to principle
    messages.append(HumanMessage(content=f"Original Question: {query}\nWhat fundamental concept or law applies here?"))
    step1_response = llm(messages)
    print("\n--- Step 1: Underlying Principle ---")
    print(step1_response.content)

    # Step 2: Use principle to solve
    messages.append(AIMessage(content=step1_response.content))
    messages.append(HumanMessage(content="Now use that principle to solve the problem logically."))
    step2_response = llm(messages)
    print("\n--- Step 2: Final Reasoning ---")
    print(step2_response.content)

# Example scientific question (same as before)
science_question = (
    "What happens to the pressure of an ideal gas if the temperature is increased by a factor of 2 "
    "and the volume is increased by a factor of 8?"
)

# Run the Step-Back Prompting pipeline
step_back_prompting_with_langchain(science_question)

COT - Chain of Thought

This is the technique that improves the performance of LLMs by explicitly prompting the model to generate a step-by-step explanation or reasoning process before arriving at a final answer. This method helps the model in breaking down the problem, not skipping any intermediate tasks to avoid reasoning failures.

CoT is an effective prompt technique because it helps focus the attention mechanism of the LLM by decomposition of the reasoning process which makes the model focus its attention on one part of the problem at a time, minimizing the risk of errors and hallucinations that might arise from handling too much information simultaneously.

This was introduced back in 2022 in a paper titled Chain of Thought Prompting Elicits Reasoning in Large Language Models by the researchers at Google.

How does COT work?

1. Explicit Instructions

This directly guides the model to think step-by-step by breaking down the task into sub-tasks in the prompt itself.
Example:
- "List coffee words" → Incorrect output.
- “1. Translate to English.
  
  2. List coffee words." → Correct output.
This encourages the model to follow a logical sequence to solve complex tasks more efficiently without hallucinations.

2. Implicit Instructions

Adding simple cues like “Let’s think step by step” at the end of the prompt is considered goof practice when implementing CoT.
This in - turn triggers the internal reasoning mechanism without needing to spell out the steps.
These practices have show a significant boost in accuracy in math problems from 18% → 79% (as shown in research by Google & University of Tokyo).

3. Demonstrative Examples (Few-Shot / One-Shot)

Showing the model example(s) of similar tasks before asking the real question adds more context.
One-shot: 1 example; Few-shot: multiple examples.
Few-shot CoT: Combines the step-by-step explanations with few-shot examples to guide the model more reliably in generating accurate responses.

HyDE - Hypothetical Document Embedding

When working with similarity search-based index, like a vector store, then raw question search does not generate accurate results and will not work well because their embeddings may not be very similar to those of the relevant documents. Therefore, instead of this if the model generates a hypothetically relevant document and uses that to perform the similarity search then it would make more sense and generate better outputs. This is the key idea behind HyDE.

HyDE uses a Language Language Model, to create a hypothetical document when responding to a query.
Using an unsupervised encoder to change the theoretical document into an embedding vector to locate similar documents in a vector database.
Rather than seeking embedding similarity for questions or queries, it focuses on answer-to-answer embedding similarity.

Its performance is therefore robust, matching well-tuned retrievers in various tasks such as web search, QA, and fact verification.

Setup

# %pip install -qU langchain langchain-openai

Set environment variables using OpenAI here :

import getpass
import os

os.environ["OPENAI_API_KEY"] = getpass.getpass()

# Optional, uncomment to trace runs with LangSmith. Sign up here: https://smith.langchain.com.
# os.environ["LANGCHAIN_TRACING_V2"] = "true"
# os.environ["LANGCHAIN_API_KEY"] = getpass.getpass()

from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI

system = """You are an expert about a set of software for building LLM-powered applications called LangChain, LangGraph, LangServe, and LangSmith.

LangChain is a Python framework that provides a large set of integrations that can easily be composed to build LLM applications.
LangGraph is a Python package built on top of LangChain that makes it easy to build stateful, multi-actor LLM applications.
LangServe is a Python package built on top of LangChain that makes it easy to deploy a LangChain application as a REST API.
LangSmith is a platform that makes it easy to trace and test LLM applications.

Answer the user question as best you can. Answer as though you were writing a tutorial that addressed the user question."""
prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system),
        ("human", "{question}"),
    ]
)
llm = ChatOpenAI(model="gpt-3.5-turbo-0125", temperature=0)
qa_no_context = prompt | llm | StrOutputParser()

answer = qa_no_context.invoke(
    {
        "question": "how to use multi-modal models in a chain and turn chain into a rest api"
    }
)
print(answer)

To use multi-modal models in a chain and turn the chain into a REST API, you can leverage the capabilities of LangChain, LangGraph, and LangServe. Here's a step-by-step guide on how to achieve this:

1. **Building a Multi-Modal Model with LangChain**:
   - Start by defining your multi-modal model using LangChain. LangChain provides integrations with various deep learning frameworks like TensorFlow, PyTorch, and Hugging Face Transformers, making it easy to incorporate different modalities such as text, images, and audio.
   - You can create separate components for each modality and then combine them in a chain to build a multi-modal model.

2. **Building a Stateful, Multi-Actor Application with LangGraph**:
   - Once you have your multi-modal model defined in LangChain, you can use LangGraph to build a stateful, multi-actor application around it.
   - LangGraph allows you to define actors that interact with each other and maintain state, which is useful for handling multi-modal inputs and outputs in a chain.

3. **Deploying the Chain as a REST API with LangServe**:
   - After building your multi-modal model and application using LangChain and LangGraph, you can deploy the chain as a REST API using LangServe.
   - LangServe simplifies the process of exposing your LangChain application as a REST API, allowing you to easily interact with your multi-modal model through HTTP requests.

4. **Testing and Tracing with LangSmith**:
   - To ensure the reliability and performance of your multi-modal model and REST API, you can use LangSmith for testing and tracing.
   - LangSmith provides tools for tracing the execution of your LLM applications and running tests to validate their functionality.

By following these steps and leveraging the capabilities of LangChain, LangGraph, LangServe, and LangSmith, you can effectively use multi-modal models in a chain and turn the chain into a REST API.

Returning the hypothetical document and original question

from langchain_core.runnables import RunnablePassthrough

hyde_chain = RunnablePassthrough.assign(hypothetical_document=qa_no_context)

hyde_chain.invoke(
    {
        "question": "how to use multi-modal models in a chain and turn chain into a rest api"
    }
)

{'question': 'how to use multi-modal models in a chain and turn chain into a rest api',
 'hypothetical_document': "To use multi-modal models in a chain and turn the chain into a REST API, you can leverage the capabilities of LangChain, LangGraph, and LangServe. Here's a step-by-step guide on how to achieve this:\n\n1. **Set up your multi-modal models**: First, you need to create or import your multi-modal models. These models can include text, image, audio, or any other type of data that you want to process in your LLM application.\n\n2. **Build your LangGraph application**: Use LangGraph to build a stateful, multi-actor LLM application that incorporates your multi-modal models. LangGraph allows you to define the flow of data and interactions between different components of your application.\n\n3. **Integrate your models in LangChain**: LangChain provides integrations for various types of models and data sources. You can easily integrate your multi-modal models into your LangGraph application using LangChain's capabilities.\n\n4. **Deploy your LangChain application as a REST API using LangServe**: Once you have built your multi-modal LLM application using LangGraph and LangChain, you can deploy it as a REST API using LangServe. LangServe simplifies the process of exposing your LangChain application as a web service, making it accessible to other applications and users.\n\n5. **Test and trace your application using LangSmith**: Finally, you can use LangSmith to trace and test your multi-modal LLM application. LangSmith provides tools for monitoring the performance of your application, debugging any issues, and ensuring that it functions as expected.\n\nBy following these steps and leveraging the capabilities of LangChain, LangGraph, LangServe, and LangSmith, you can effectively use multi-modal models in a chain and turn the chain into a REST API."}

Using function-calling to get structured output

from langchain_core.output_parsers.openai_tools import PydanticToolsParser
from langchain_core.pydantic_v1 import BaseModel, Field


class Query(BaseModel):
    answer: str = Field(
        ...,
        description="Answer the user question as best you can. Answer as though you were writing a tutorial that addressed the user question.",
    )


system = """You are an expert about a set of software for building LLM-powered applications called LangChain, LangGraph, LangServe, and LangSmith.

LangChain is a Python framework that provides a large set of integrations that can easily be composed to build LLM applications.
LangGraph is a Python package built on top of LangChain that makes it easy to build stateful, multi-actor LLM applications.
LangServe is a Python package built on top of LangChain that makes it easy to deploy a LangChain application as a REST API.
LangSmith is a platform that makes it easy to trace and test LLM applications."""

prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system),
        ("human", "{question}"),
    ]
)
llm_with_tools = llm.bind_tools([Query])
hyde_chain = prompt | llm_with_tools | PydanticToolsParser(tools=[Query])
hyde_chain.invoke(
    {
        "question": "how to use multi-modal models in a chain and turn chain into a rest api"
    }
)

LLM RESPONSE

[Query(answer='To use multi-modal models in a chain and turn the chain into a REST API, you can follow these steps:\n\n1. Use LangChain to build your multi-modal model by integrating different modalities such as text, image, and audio.\n2. Utilize LangGraph, a Python package built on top of LangChain, to create a stateful, multi-actor LLM application that can handle interactions between different modalities.\n3. Once your multi-modal model is built using LangChain and LangGraph, you can deploy it as a REST API using LangServe, another Python package that simplifies the process of creating REST APIs from LangChain applications.\n4. Use LangSmith to trace and test your multi-modal model to ensure its functionality and performance.\n\nBy following these steps, you can effectively use multi-modal models in a chain and turn the chain into a REST API.')]

Advanced RAG : Query Translation Patterns

Table of contents