Parallel query retrieval (fan out)

Imagine getting the perfect answer for your incomplete query

Parallel query retrieval, also known as "fan out" in Retrieval-Augmented Generation (RAG) involves an LLM generating multiple, slightly different queries based on the original user query. These queries are executed in parallel (hence "fan out"), and their results are merged to provide more comprehensive coverage.

Why use Fan out method?

Diverse Perspectives: An LLM generates multiple versions of the original user query, capturing information that might be missed by using just one query.
Complex Problems: With multiple queries from a single user query, the LLM gathers more information, resulting in a better response.

Let’s focus on what this diagram is showing:

🧍 User sends a question

What are the recent developments in Apple?

🧠 System rewrites it into 3-5 better versions (Query Transformation)

Q1: What are Apple's latest product launches?

Q2: How is Apple expanding into new markets?

Q3: What strategies is Apple using for revenue growth?

This process is called Query Rewriting / Transformation. We will generate vector Embeddings of these 3 LLM Generated queries. Using the embeddings we will get the chunks of document/data stored in Qdrant DB for each query. We can use Langchain as it will ease-up the process of similarity search

🌐 Fan Out Retrieval (Parallel Vector Search)

Now, we search each rewritten query separately in Qdrant (database where we store indexed data):

Query	Resulting Chunks
Q1	📄📄
Q2	📄
Q3	📄📄📄

Now we got chunks of data for each LLM generated query, it can be multiple and duplicate chunks for a single query.

🧼 Combine & Deduplicate: filter_unique

We now combine all the results from Q1, Q2, Q3... and remove duplicates so we only have unique chunks. We can use intersection technique which will give use only the unique chunks

Imagine the colours in the diagram are:

Blue = Product news
Yellow = Market entry
Red = Revenue growth

We merge those into (accepting only unique data):

📄 Blue

📄 Yellow

📄 Red

🤖 Final Step: Ask the LLM

Now that we have:

High-quality context chunks (from multiple queries)
Original user question

We send both to the LLM:

SYSTEM_PROMPT + context_chunks + user_question → GPT

The AI finally generates a rich, informed answer.

🧱 Why is this better than traditional RAG?

Traditional RAG	Fan Out Parallel RAG
Ask 1 question → search once	Rewrites question 3-5 times, searches in parallel
May return weak chunks	Better chance of catching the right context
Sensitive to how user phrased it	Handles ambiguity and vague prompts
Fewer insights	More diverse and richer document results

CODE Snippets

Before understanding Parallel Query Retrieval, you have to understand the concept of normal RAG and how we code to implement RAG.

Follow this link: Retrieval-Augmented Generation (RAG)

Write a function to load the PDF data into Qdrant DB

def ingest_pdf(pdf_path: Path) -> str:
    """Load a PDF, split to chunks, embed & store ➜ return `doc_id`."""
    # Give every PDF an ID so we can filter later
    doc_id = str(uuid.uuid4())

    loader   = PyPDFLoader(str(pdf_path))
    raw_docs = loader.load()  # each element has .page_content & .metadata

    splitter = RecursiveCharacterTextSplitter(chunk_size=1200, chunk_overlap=200)
    chunks   = splitter.split_documents(raw_docs)

    for c in chunks:
        c.metadata["doc_id"]   = doc_id
        c.metadata["filename"] = pdf_path.name
        # keep existing page metadata

    QdrantVectorStore.from_documents(
        chunks,
        embedding = EMBEDDER,
        collection_name = VECTOR_COLLECTION,
        url = VECTOR_URL,
    )
    return doc_id

Writing 3-5 user queries by taking reference from the original user query

def rewrite_query(user_q: str, n_variants: int = 4) -> List[str]:
    """
    Use a cheap model to produce 3–5 alternate phrasings of the original query.
    """
    prompt = [
        _REWRITE_SYS,
        HumanMessage(content=f"Original question: {user_q}\nN={n_variants}")
    ]
    llm_resp = LLM_REWRITE.invoke(prompt).content 

    try:
        import json
        queries = json.loads(llm_resp) # type: ignore 

        # Checks if GPT’s response is really a list (like we expect).
        if isinstance(queries, list):
            return queries[: n_variants] 
    except Exception:
        pass

    return [user_q] #If the JSON failed, we simply return the original question as the only element in a list.

Write a function to search from the list of the LLM generated question (similar to RAG)

def search_chunks_parallel(
    queries: Sequence[str],
    active_doc_id: str,
    k: int = 4,
) -> List[Any]:

Write a function to Remove Duplicate Chunks

 def deduplicate_chunks(chunks: List[Any]) -> List[Any]:
    """
    Remove duplicate page_content (or duplicate _id) while preserving order.
    """
    seen = set()
    unique = []
    for c in chunks:
        key = c.page_content.strip() #Extracts the chunk’s text content, removing extra whitespace.
        if key not in seen: #Checks if we've seen this content already.
            unique.append(c)
            seen.add(key)
    return unique

Write a function to create context which we can use to send to LLM

def build_context(chunks: List[Any], max_chars: int = 3900) -> str:
    """
    Concatenate chunks → single string, truncated to fit GPT‑4 context limit.
    """
    lines = []
    total = 0 # Counts how many characters we've added (for limiting context size).
    for c in chunks:
        page = c.metadata.get("page_label", c.metadata.get("page", "N/A"))
        text = f"[{page}] {c.page_content.strip()}\n"
        new_total = total + len(text) # Checks if adding this chunk exceeds the limit (max_chars).
        if new_total > max_chars:
            break
        lines.append(text)
        total = new_total
    return "".join(lines)

Write a function to answer the user query

def answer_question(user_q: str, active_doc_id: str) -> str:
    # 1) rewrite query
    rewrites = rewrite_query(user_q)
    print("🔍 GPT rewrites: ", rewrites) 

    # 2) fan‑out search
    raw_chunks = search_chunks_parallel(rewrites, active_doc_id, k=4)

    if not raw_chunks:
        return (
            '{ "step":"Answer", "content":"No info in current PDF.", '
            f'"input": "{user_q}" }}'
        )

    # 3) deduplicate
    uniq_chunks = deduplicate_chunks(raw_chunks)

    # 4) build context string
    context_text = build_context(uniq_chunks)

    # 5) ask GPT‑4(o)
    history = [
        SystemMessage(content=_ANSWER_SYS + "\n\nContext:\n" + context_text),
        HumanMessage(content=user_q),
    ]
    ai = LLM_ANSWER.invoke(history)
    return ai.content

Parallel Query Retrieval