Parallel Query Retrieval

hardik sehgalhardik sehgal
5 min read

Imagine getting the perfect answer for your incomplete query

Parallel query retrieval, also known as "fan out" in Retrieval-Augmented Generation (RAG) involves an LLM generating multiple, slightly different queries based on the original user query. These queries are executed in parallel (hence "fan out"), and their results are merged to provide more comprehensive coverage.

Why use Fan out method?

  • Diverse Perspectives: An LLM generates multiple versions of the original user query, capturing information that might be missed by using just one query.

  • Complex Problems: With multiple queries from a single user query, the LLM gathers more information, resulting in a better response.

Letโ€™s focus on what this diagram is showing:

๐Ÿง User sends a question

What are the recent developments in Apple?

๐Ÿง  System rewrites it into 3-5 better versions (Query Transformation)

Q1: What are Apple's latest product launches?

Q2: How is Apple expanding into new markets?

Q3: What strategies is Apple using for revenue growth?

This process is called Query Rewriting / Transformation. We will generate vector Embeddings of these 3 LLM Generated queries. Using the embeddings we will get the chunks of document/data stored in Qdrant DB for each query. We can use Langchain as it will ease-up the process of similarity search

Now, we search each rewritten query separately in Qdrant (database where we store indexed data):

QueryResulting Chunks
Q1๐Ÿ“„๐Ÿ“„
Q2๐Ÿ“„
Q3๐Ÿ“„๐Ÿ“„๐Ÿ“„

Now we got chunks of data for each LLM generated query, it can be multiple and duplicate chunks for a single query.

๐Ÿงผ Combine & Deduplicate: filter_unique

We now combine all the results from Q1, Q2, Q3... and remove duplicates so we only have unique chunks. We can use intersection technique which will give use only the unique chunks

Imagine the colours in the diagram are:

  • Blue = Product news

  • Yellow = Market entry

  • Red = Revenue growth

We merge those into (accepting only unique data):

๐Ÿ“„ Blue

๐Ÿ“„ Yellow

๐Ÿ“„ Red

๐Ÿค– Final Step: Ask the LLM

Now that we have:

  • High-quality context chunks (from multiple queries)

  • Original user question

We send both to the LLM:

SYSTEM_PROMPT + context_chunks + user_question โ†’ GPT

The AI finally generates a rich, informed answer.

๐Ÿงฑ Why is this better than traditional RAG?

Traditional RAGFan Out Parallel RAG
Ask 1 question โ†’ search onceRewrites question 3-5 times, searches in parallel
May return weak chunksBetter chance of catching the right context
Sensitive to how user phrased itHandles ambiguity and vague prompts
Fewer insightsMore diverse and richer document results

CODE Snippets

Before understanding Parallel Query Retrieval, you have to understand the concept of normal RAG and how we code to implement RAG.

Follow this link: Retrieval-Augmented Generation (RAG)

  1. Write a function to load the PDF data into Qdrant DB
def ingest_pdf(pdf_path: Path) -> str:
    """Load a PDF, split to chunks, embed & store โžœ return `doc_id`."""
    # Give every PDF an ID so we can filter later
    doc_id = str(uuid.uuid4())

    loader   = PyPDFLoader(str(pdf_path))
    raw_docs = loader.load()  # each element has .page_content & .metadata

    splitter = RecursiveCharacterTextSplitter(chunk_size=1200, chunk_overlap=200)
    chunks   = splitter.split_documents(raw_docs)

    for c in chunks:
        c.metadata["doc_id"]   = doc_id
        c.metadata["filename"] = pdf_path.name
        # keep existing page metadata

    QdrantVectorStore.from_documents(
        chunks,
        embedding = EMBEDDER,
        collection_name = VECTOR_COLLECTION,
        url = VECTOR_URL,
    )
    return doc_id
  1. Writing 3-5 user queries by taking reference from the original user query
def rewrite_query(user_q: str, n_variants: int = 4) -> List[str]:
    """
    Use a cheap model to produce 3โ€“5 alternate phrasings of the original query.
    """
    prompt = [
        _REWRITE_SYS,
        HumanMessage(content=f"Original question: {user_q}\nN={n_variants}")
    ]
    llm_resp = LLM_REWRITE.invoke(prompt).content 

    try:
        import json
        queries = json.loads(llm_resp) # type: ignore 

        # Checks if GPTโ€™s response is really a list (like we expect).
        if isinstance(queries, list):
            return queries[: n_variants] 
    except Exception:
        pass

    return [user_q] #If the JSON failed, we simply return the original question as the only element in a list.
  1. Write a function to search from the list of the LLM generated question (similar to RAG)
def search_chunks_parallel(
    queries: Sequence[str],
    active_doc_id: str,
    k: int = 4,
) -> List[Any]:
  1. Write a function to Remove Duplicate Chunks
 def deduplicate_chunks(chunks: List[Any]) -> List[Any]:
    """
    Remove duplicate page_content (or duplicate _id) while preserving order.
    """
    seen = set()
    unique = []
    for c in chunks:
        key = c.page_content.strip() #Extracts the chunkโ€™s text content, removing extra whitespace.
        if key not in seen: #Checks if we've seen this content already.
            unique.append(c)
            seen.add(key)
    return unique
  1. Write a function to create context which we can use to send to LLM
def build_context(chunks: List[Any], max_chars: int = 3900) -> str:
    """
    Concatenate chunks โ†’ single string, truncated to fit GPTโ€‘4 context limit.
    """
    lines = []
    total = 0 # Counts how many characters we've added (for limiting context size).
    for c in chunks:
        page = c.metadata.get("page_label", c.metadata.get("page", "N/A"))
        text = f"[{page}] {c.page_content.strip()}\n"
        new_total = total + len(text) # Checks if adding this chunk exceeds the limit (max_chars).
        if new_total > max_chars:
            break
        lines.append(text)
        total = new_total
    return "".join(lines)
  1. Write a function to answer the user query
def answer_question(user_q: str, active_doc_id: str) -> str:
    # 1) rewrite query
    rewrites = rewrite_query(user_q)
    print("๐Ÿ” GPT rewrites: ", rewrites) 

    # 2) fanโ€‘out search
    raw_chunks = search_chunks_parallel(rewrites, active_doc_id, k=4)

    if not raw_chunks:
        return (
            '{ "step":"Answer", "content":"No info in current PDF.", '
            f'"input": "{user_q}" }}'
        )

    # 3) deduplicate
    uniq_chunks = deduplicate_chunks(raw_chunks)

    # 4) build context string
    context_text = build_context(uniq_chunks)

    # 5) ask GPTโ€‘4(o)
    history = [
        SystemMessage(content=_ANSWER_SYS + "\n\nContext:\n" + context_text),
        HumanMessage(content=user_q),
    ]
    ai = LLM_ANSWER.invoke(history)
    return ai.content
0
Subscribe to my newsletter

Read articles from hardik sehgal directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

hardik sehgal
hardik sehgal