This Blog is part of series, where it started with basics of RAG. The series of blogs will now deep dive into fine RAG tuning techniques. We would be covering following fine-tuning techniques. This part of the blog will focus on Parallel Query (Fan Out) Retrieval. For learning RAG, please refer RAG (Retrieval Augmented Generation) Basics. The code for this blog is available at https://github.com/ashutoshmca/RAG/

Query Transformation

When user asks for a query from LLM, the query may be abstract or ambiguous. In case of RAG, LLM provides the output based on the user query. The query provided by user may or may not reflect actual intent of the user, because it may be too vague. Garbage query may lead to Garbage output. But, if the query is refined enough, better would be the output.

The user may have written a naïve query; the idea of RAG fine-tuning techniques remains to improve user queries and to provide better responses to the user.

It may be required to provide both the abstraction as well as details over the user query to generate good response. As abstraction provides an overview, whereas details provide the depth to it. User queries may be re-written broadly in two ways:

RAG Fusion
Multi query

We would be covering various techniques for fine-tuning RAG over the few blogs, which would include following techniques.

Parallel Query (Fan Out) Retrieval
Reciprocal Rank Fusion
Step Back Prompting
CoT- Chain of Thought
HyDE - Hypothetical Document Embeddings

This blog would focus upon Parallel Query (Fan Out) Retrieval.

Parallel Query (Fan Out) Retrieval

This blog discusses Parallel Query (Fan out) retrieval. This technique re-writes user queries by generating multiple queries from user queries, each query retrieves different chunks from vector embedding. The output chunks from all the queries are provided to the LLM along with the user query. Which results in better context for LLM, as compared to Basic RAG.

The fan-out is taken from fan-out pattern that is used with messaging to send event or message to multiple destination in parallel.

User asks query from the application or bot
The prompt along with the user query is given to LLM to generate multiple queries

E.g. of prompt that can be used to generate multiple queries from user queries:

MULTI_QUERY_PROMPT = """You are a helpful assistant that helps in refining user query.

You receive a query and you need to generate {n} number of questions that are more accurate to represent the query of the user.

Output the answer in a JSON format.

Example: "What is the capital of France?"
Answer: {{
    "queries": [
        "What is the capital city of France?",
        "Can you tell me the capital of France?",
        "What city serves as the capital of France?"
    ]
}}

"""

The prompt to generate along with the user query is then submitted to LLM generate multiple queries, and LL M produces multiple queries in the response.


query = "Provide summary of the document"

messages=[
            { "role": "system", "content": MULTI_QUERY_PROMPT.format(n=3) }, 
            { "role": "user", "content": query }
        ]


client = OpenAI()

def query_to_LLM(messages):
    result = client.chat.completions.create(
        model="gpt-4o",
        response_format={"type": "json_object"},
        messages=messages
    )
    return result

result = query_to_LLM(messages)
json_response = result.choices[0].message.content

print("Response:", json_response)

data = json.loads(json_response)
queries = data["queries"]
print(queries)

The queries are converted to vector embeddings thru embedding model. Please note the use of “text-embedding-3-large” model used for embeddings.
The vector embedding from each query is used to search most relevant content /chunks in chunks in Vector Store for each query. Please note the use of Qdrant vector store. The vector store is already indexed here with the document. Please refer RAG (Retrieval Augmented Generation) Basics to know in detail about deploying Qdrant vector store and indexing.
The chunks are retrieved for each query.
Unique chunks are identified from the chunks provided by each query. Please note the use of set() in the code to identify unique chunks

embedder = OpenAIEmbeddings(
    model="text-embedding-3-large"
)


retriver = QdrantVectorStore.from_existing_collection(
    url="http://localhost:6333",
    collection_name="learning_langchain",
    embedding=embedder
)
...
...
search_results= set()
for query in queries:
    print(query)
    search_result = retriver.similarity_search(
        query=query
    )
    print("Relevant Chunks", search_result)
    for doc in search_result:
        search_results.add(doc.page_content)

search_results_string = "\n".join(search_results) 
print("Search Results:", search_results_string)

8 The unique chunks, along with system prompt (represented as 8a in the diagram) and query (represented as 8b in the diagram) is provided to LLM.

"""    
SYSTEM_PROMPT = """You are a helpful assistant that helps the user to learn details only with in the provided context.
If the query is out of context then, say "I don't know".
You are not allowed to make any assumptions or guesses.

Ouutput the answer in a JSON format.

context:
{context}
"""

LLM uses relevant chunks, along with system prompt and query and generates output.

 messages=[
             { "role": "system", "content": SYSTEM_PROMPT.format(context=search_results_string) }, 
             { "role": "user", "content": query }
         ]

 query_response = query_to_LLM(messages)
 print("Response:", query_response.choices[0].message.content)

Response is provided back to the user.

We can see the difference between the output that was generated by RAG vs. fine tuning RAG thru Parallel Query retrieval for same query i.e. to provide summary of the document.

Response from Parallel Query retrieval:

Response: { 
  "summary": "The document discusses various methods for software architecture assessment.
 It covers the ABAS (Attribute-Based Architecture Styles) approach, detailing its sections 
like problem description, stimuli/response measures, architectural style, and analysis. 
It also describes SAAM (Software Architecture Analysis Method) and 
ATAM (Architecture Tradeoff Analysis Method), highlighting their focus areas and activities. 
The document explores how these methods evaluate architecture based on quality attributes 
such as security, performance, and modifiability, helping identify strengths and weaknesses in 
architectural designs."
}

Response from Basic RAG for a query to generate summary for a PDF document

Response: { 
"summary": "The document 'Architecture Assessment Frameworks Comparative Analysis' written by 
Ashutosh Gupta, discusses the evaluation of architecture against quality goals such as performance, 
scalability, security, reliability, modifiability, and usability. It compares various 
architecture assessment methods, detailing when each might be suitable. The document also covers 
ABAS descriptions and SAAM activities. ABAS includes problem description, stimulus/response measures, 
architectural style, and analysis. SAAM involves characterizing functional partitioning, 
mapping it onto architecture, selecting quality attributes and tasks, and evaluating architectural 
support for these tasks."
}

You may notice the output from Parallel Query retrieval is better than basic RAG.

Summary

Parallel Query (Fan out) Retrieval technique creates multiple queries from user queries to get relevant output for the user from LLM. It searches for the chunks in vector store that are relevant for each query. It provides the chunks from all the queries along with user query to the LLM and get the response. The responses are better as compared to simple RAG approach.

Reference

RAG (Retrieval Augmented Generation) Basics

https://github.com/ashutoshmca/RAG/

Introduction | 🦜️🔗 LangChain

LangChain Python API Reference — 🦜🔗 LangChain documentation

Parallel Query (Fan Out) Retrieval

Query Transformation

Parallel Query (Fan Out) Retrieval

Summary

Reference

Subscribe to my newsletter

Ashutosh Gupta

Ashutosh Gupta