This Blog is part of series, where it started with basics of RAG. The series of blog then discussed into fine RAG tuning techniques and discussed about Parallel Query (Fan Out) Retrieval. We would be covering Reciprocal Rank Fusion as next fine-tuning technique for RAF.

For learning RAG, please refer RAG (Retrieval Augmented Generation) Basics. For learning Parallel Query (Fan Out) Retrieval, please refer Parallel Query (Fan Out) Retrieval. The code for this blog is available at github.com/ashutoshmca/RAG

Reciprocal Rank Fusion

In case of Parallel Query (Fan out) Retrieval all the chunks retrieved from the vector database of each query is provided to LLM. But it has a drawback, that it doesn’t know among the chunks received thru vector store for multiple parallel queries, which of the chunks have higher ranking and aligns with the response required by the user.

This technique generates multiple queries from user queries similar to of Parallel Query (Fan out) Retrieval. In this case, however the output chunks retrieved thru vector store from the multiple queries are first ranked and is then provide to the LLM along with the user query. Since it has ranking associate with the generated context, the results are better, as compared to Basic RAG and Parallel Query Retrieval.

E.g. for Ranking

Let’s assume the process generated three different queries Query1, Query2 and Query3. Each of these queries provided different set of chunks from vector store.

Query1 provided chunks: C1, C3

Query2 provided chunks: C2, C1

Query2 provided chunks: C1, C2, C3

Among these chunks the number of times it has come in different queries, and the order of chunks determines the rank within the respective query. From above result, we can see that C1 chunks is across all the three query so has higher ranking than rest of the two. Further C2 and C3 are in the result of two different queries but C2 comes first in the order in the result of one of the queries and so C2 has higher ranking than C3.

So the ranking of these chunks would be

C1 > C2 > C3

These chunks would be appended and send to LLM in the same order.

As we covered in previous blogs, RAG pipeline starts with indexing. The indexing is covered separately in the previous blogs. Please refer RAG (Retrieval Augmented Generation) Basics to know details on indexing. The blog also covered using Qdrant vector store as vector store for indexing along with it’s deployment.

User asks query from the application or bot

Application or Bot uses a prompt to generate multiple versions of the query.

 MULTI_QUERY_PROMPT = """You are a helpful assistant that helps in refining user query.

 You receive a query and you need to generate {n} number of questions that are more accurate to represent the query of the user.

 Output the answer in a JSON format.

 Example: "What is the capital of France?"
 Answer: {{
     "queries": [
         "What is the capital city of France?",
         "Can you tell me the capital of France?",
         "What city serves as the capital of France?"
     ]
 }}

 """

The prompt to generate along with the user query is then submitted to LLM generate multiple queries, and LL M produces multiple queries in the response.

query = "Provide summary of the document"

messages=[
            { "role": "system", "content": MULTI_QUERY_PROMPT.format(n=3) }, 
            { "role": "user", "content": query }
        ]


client = OpenAI()

def query_to_LLM(messages):
    result = client.chat.completions.create(
        model="gpt-4o",
        response_format={"type": "json_object"},
        messages=messages
    )
    return result

result = query_to_LLM(messages)
json_response = result.choices[0].message.content

print("Response:", json_response)

data = json.loads(json_response)
queries = data["queries"]
print(queries)

The queries are converted to vector embeddings thru embedding model. Please note the use of “text-embedding-3-large” model used for embeddings.
The vector embedding from each query is used to search most relevant content /chunks in chunks in Vector Store for each query.

The chunks are retrieved for each query.

 embedder = OpenAIEmbeddings(
     model="text-embedding-3-large"
 )
 search_results= []
 for query in queries:
     print(query)
     search_result = retriver.similarity_search(
         query=query
     )
     print("Relevant Chunks", search_result)
     search_results.append(search_result)

Ranking is performed on these chunks.

 def reciprocal_rank_fusion(rankings: list, k=60):
     scores = {}
     for ranking in rankings:
         print("Ranking:", ranking)
         for rank, doc in enumerate(ranking):
             chunk = doc.page_content
             scores[chunk] = scores.get(chunk, 0) + 1 / (k + rank + 1)
     return sorted(scores.items(), key=lambda x: x[1], reverse=True)

 ranked_results = reciprocal_rank_fusion(rankings=search_results, k=60)
 print("Ranked Results:", ranked_results)

 ranked_results_string = "\n".join([chunk for chunk, _ in ranked_results])
 print("Search Results:", ranked_results_string)

The ranked chunks, along with system prompt (represented as 8a in the diagram) and query (represented as 8b in the diagram) is provided to LLM.

    """    
    SYSTEM_PROMPT = """You are a helpful assistant that helps the user to learn details only with in the provided context.
    If the context does not contain the answer, say "I don't know".
    You are not allowed to make any assumptions or guesses.

    Ouutput the answer in a JSON format.

    context:
    {context}
    """

LLM uses ranked chunks, along with system prompt and query and generates output.

    messages=[
                { "role": "system", "content": SYSTEM_PROMPT.format(context=ranked_results_string) }, 
                { "role": "user", "content": query }
            ]

    query_response = query_to_LLM(messages)
    print("Response:", query_response.choices[0].message.content)

Response is provided back to the user.

Now let’s compare the result generated by LLM using Parallel Query Retrieval and Reciprocal Rank Fusion for same query i.e. to provide summary of the document that was loaded and stored in vector store.

Response from Reciprocal Rank Fusion

Response: {
  "overview": "The document discusses Attribute-based Architectural Styles (ABAS) and 
architectural assessment frameworks. It outlines the structure of an ABAS description 
which includes problem description, stimulus/response attribute measures, architectural style,
 and analysis. The document also covers the Architecture Assessment Framework and 
its comparative analysis, discussing different assessment methods and their applicability 
under various contexts. It highlights activities in the Software Architecture Analysis 
Method (SAAM), the importance of architecture evaluation effectiveness, and guidelines 
for conducting architecture reviews. Additionally, it touches on the role of 
software execution models in identifying performance issues and suggests alternatives 
for improvement. SAAM's evaluation based on scenarios and its impact on architecture 
elements is also explained, along with the relationship between scenario interaction 
and metrics like structural complexity, coupling, and cohesion."
}

Response from Parallel Query Retrieval

Response: { 
  "summary": "The document discusses various methods for software architecture assessment.
 It covers the ABAS (Attribute-Based Architecture Styles) approach, detailing its sections 
like problem description, stimuli/response measures, architectural style, and analysis. 
It also describes SAAM (Software Architecture Analysis Method) and ATAM 
(Architecture Tradeoff Analysis Method), highlighting their focus areas and activities. 
The document explores how these methods evaluate architecture based on quality attributes 
such as security, performance, and modifiability, helping identify strengths and weaknesses 
in architectural designs."
}

Conclusion: Result from Reciprocal Rank Fusion is far better than from Parallel Query Retrieval for the same query.

Summary

Reciprocal Rank Fusion technique creates multiple queries similar to Parallel Query Retrieval from user queries. It searches for the chunks in vector store that are relevant for each query. It performs ranking on those chunks for the queries prior to providing the chunks along with user query to the LLM and get the response. Because of ranking of chunks, the responses are better as compared to Parallel Query Retrieval as well as simple RAG approach.

Reference

RAG (Retrieval Augmented Generation) Basics

Parallel Query (Fan Out) Retrieval

https://github.com/ashutoshmca/RAG/

Introduction | 🦜️🔗 LangChain

LangChain Python API Reference — 🦜🔗 LangChain documentation

Reciprocal Rank Fusion

Reciprocal Rank Fusion

Summary

Reference

Subscribe to my newsletter

Ashutosh Gupta

Ashutosh Gupta