Ever asked a smart assistant a question and didn’t get the answer you were hoping for? What if it could understand what you meant, even when you said it wrong? That’s where Parallel Query Retrieval comes in—your AI’s superpower to “think in multiple directions” at once.

Introduction

What is RAG?

Before jumping into Parallel Query Retrieval (PQR), let’s break down the basics.

Retrieval-Augmented Generation (RAG) is a hybrid approach that combines:

🗂️ Retrieval: Getting relevant facts or documents from a database (like pulling pages from a book).
✍️ Generation: Using a language model (like ChatGPT) to generate a natural language answer using that information.

Instead of relying only on what the model “remembers,” RAG lets it look things up in real-time. It’s like a student checking their textbook before answering your question.

🤷‍♂️ The Problem: What if Your Question is Confusing?

Let’s say you type:

“How do mobil apps keep file like documnts or photo save on net withut losing?”

That’s:

Full of typos
Missing structure
Technically vague

A regular system might get confused. But using Parallel Query Retrieval, the AI rewrites your messy question into better ones and searches for answers from multiple angles.

What is Parallel Query Retrieval (Fan Out)?

Parallel Query Retrieval (PQR) is an enhancement to the RAG pipeline where instead of using your original query as-is, the system:

Uses an LLM to rewrite your question in multiple meaningful ways.
Sends each rewritten version to the retrieval system (vector database).
Collects and filters results from each version.
Uses this rich context to generate a final answer.

This technique is also known as Fan Out, because it spreads one query into many directions.

Why Fan Out ?

The reason it's called Fan Out is because of the Fan Out architecture in system design. If you've ordered something from an online platform like Amazon , Flipkart etc , you've seen that you receive an email, SMS, and WhatsApp message.

This happens when you put an event in multiple queues, spreading the message across them, which is known as Fan Out architecture in system design. In our example, one user query is converted into three different queries, so there are three pipelines running different processes that converge at the end. It's similar to a Fan Out architecture.

📚 Real-Life Analogy: Asking Five Teachers Instead of One

Say you’re a student confused about how mobile apps use cloud storage.

Instead of asking just one teacher, you ask five:

“How does Google Drive work in mobile apps?”
“Why do apps save photos to the cloud?”
“How do apps prevent data loss in storage?”
“What is cloud backup for apps?”
“How is cloud used in file sync between devices?”

Each teacher explains a part of it.

Now, you combine their answers and truly understand the concept.

That’s exactly what PQR does.

🔍 Step-by-Step Breakdown of Parallel Query Retrieval

Let’s walk through a beginner-friendly flow:

💻 Step 1: User Input

The process begins when a user submits a question or query — this can be well-structured, informal, or even slightly flawed.

Example:
“how do mobile apps save user photos to cloud without losing them?”

This raw query often lacks precision and may include typos, vague language, or limited context.

💬 Step 2: Prompt the LLM to Reframe the Query

We send the user's input to a Large Language Model (LLM) like OpenAI’s GPT or Google’s Gemini, along with a carefully crafted system prompt.

Prompt Example:
“Rephrase the following user query into 5 semantically diverse and contextually relevant versions that retain the original intent.”

This step helps the system explore the query from different angles, increasing the chances of retrieving richer, more accurate information.

🧠 Step 3: Generate Multiple Query Variations

The LLM processes the prompt and outputs multiple intelligently rephrased queries.

LLM Output (for the cloud storage example):

How do mobile apps use cloud storage to save images?

What methods are used by apps to backup user files online?

How is data loss prevented in cloud photo storage for apps?

What is the process of uploading files from an app to the cloud?

How do cloud services ensure user data safety in apps?

These versions explore the same topic through different linguistic and conceptual lenses.

🔎 Step 4: Perform Similarity Search for Each Query

Each rewritten query is transformed into a vector embedding using an embedding model.

These vectors are then used to perform similarity searches against a vector database such as Pinecone, Weaviate, or Qdrant. The system finds and retrieves the top-matching results (chunks) for each query.

Think of this as running 5 separate smart searches across your knowledge base in parallel.

📦 Step 5: Retrieve Relevant Chunks (Vector Embeddings)

For each of the rephrased queries, the similarity search fetches the most relevant content — often short passages, paragraphs, or metadata-rich documents.

These text segments, also known as chunks, are stored in their vector form and contain valuable context tied to each version of the question.

🧹 Step 6: Deduplicate and Filter Unique Chunks

Now, all chunks from the five queries are combined into a single pool.

At this stage:

🔁 Duplicates are removed
📊 Low-relevance items are filtered out
🏆 The most informative and distinct chunks are retained

The result is a compact, clean, and contextually rich knowledge base, fine-tuned to answer the original question.

🔁 Step 7: Enrich the Original Query Using Filtered Chunks

With the high-quality, filtered chunks in hand, the original user query is once again sent to the LLM — this time alongside the supporting context.

This provides the LLM with external knowledge that compensates for ambiguity or gaps in the user’s original input.

Think of this step as giving your AI some well-organized notes to refer to while answering.

🧾 Step 8: Generate the Final Response

Finally, the LLM uses the enriched context to generate a precise, comprehensive, and context-aware response to the user’s query.

The answer is no longer based on “just guessing.” It’s grounded in retrieved facts, filtered insights, and semantic clarity — all tailored to the original question’s intent.

🚀 Implementation: Parallel Query Retrieval in Action

🧾 1. Load and Preprocess the Document

We begin by loading a PDF document and splitting it into smaller chunks for easier indexing and retrieval.

from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from pathlib import Path

pdf_path = Path(__file__).parent / "1706.03762v7.pdf"
loader = PyPDFLoader(file_path=pdf_path)
docs = loader.load()

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
split_docs = text_splitter.split_documents(docs)

What this does:

Loads the PDF into memory
Splits the content into chunks of ~1000 characters with some overlap to preserve context between segments

🧠 2. Embed the Document Chunks

We now convert text chunks into vector embeddings using OpenAI’s embedding model.

from langchain_openai import OpenAIEmbeddings
from dotenv import load_dotenv
import os

load_dotenv()
embedder = OpenAIEmbeddings(
    model="text-embedding-3-small",
    api_key=os.getenv("OPENAI_API_KEY")
)

These embeddings represent the semantic meaning of text numerically, enabling similarity searches later.

🧱 3. Store in Vector Database (Qdrant)

We store our embedded chunks in a vector store like Qdrant for fast similarity-based search.

from langchain_qdrant import QdrantVectorStore

# To inject documents (only run once)
vector_store = QdrantVectorStore.from_documents(
    documents=split_docs,
    url="http://localhost:6333",
     collection_name="learning_langchain_PQR",
    embedding=embedder
)
vector_store.add_documents(split_docs)
# ----

# To retrieve later
retriver = QdrantVectorStore.from_existing_collection(
    url="http://localhost:6333",
    collection_name="learning_langchain_PQR",
    embedding=embedder
)

✨ 4. Generate Diverse Variations of the User’s Query

We use Google Gemini to rewrite the user’s input in multiple semantically diverse forms.

from openai import OpenAI

client = OpenAI(
    base_url="https://generativelanguage.googleapis.com/v1beta/openai/",
    api_key=os.getenv("GEMINI_API_KEY")
)

def generate_different_user_prompt(user_input, num_variants=3):
    SYSTEM_PROMPT = f"""
    You are an helpful AI Assistant that rewrites user input queries in different forms for better document retrieval.

    Original Query: "{user_input}"

    Rewrite this query in {num_variants} different ways.
    """
    response = client.chat.completions.create(
        model="gemini-2.0-flash",
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": user_input}
        ]
    )
    lines = response.choices[0].message.content.split('\n')
    return [line.strip("1234567890. ") for line in lines if line.strip()]

🔍 5. Perform Parallel Similarity Searches

We now use each of the rephrased queries to retrieve relevant document chunks.

def get_similar_chunks_from_document(user_input):
    ai_prompts = generate_different_user_prompt(user_input)
    all_results = []
    for prompt in ai_prompts:
        results = retriver.similarity_search(query=prompt)
        all_results.append(results)
    return all_results

🧹 6. Filter Unique Chunks

We flatten and deduplicate all chunks across all query variations.

def filter_unique_chunks(nested_chunks):
    seen = set()
    filtered = []
    for chunk_list in nested_chunks:
        for doc in chunk_list:
            if doc.page_content not in seen:
                seen.add(doc.page_content)
                filtered.append(doc)
    return filtered

🧾 7. Final Answer Generation with Enriched Context

We feed the cleaned, unique chunks + original query into Gemini again to produce a final, context-aware answer.

def parallel_query_retrieval():
    while True:
        user_input = input(">> ")
        if user_input.lower() in ["exit", "quit"]:
            break

        similar_chunks = get_similar_chunks_from_document(user_input)
        filtered = filter_unique_chunks(similar_chunks)

        SYSTEM_PROMPT = f"""
        You are an helpful AI Assistant who responds based on the available context.
        If the answer is not found in the context, reply with "I don't know based on the document."

        Context:
        {"\n\n".join([doc.page_content for doc in filtered])}
        """

        response = client.chat.completions.create(
            model="gemini-2.0-flash",
            messages=[
                {"role": "system", "content": SYSTEM_PROMPT},
                {"role": "user", "content": user_input}
            ]
        )

        print("-----> ", response.choices[0].message.content)

🧪 Run the System

Run the entire system from your terminal:

python parallel_query_retrieval.py

🔍 Parallel Query Retrieval (Fan Out): A Deep Dive into Advanced RAG Query Translation Patterns

Table of contents