Boost Retrieval Quality with Parallel Query Retrieval (Fan‑out)

Suraj PatelSuraj Patel
6 min read

Introduciton

Instead of issuing a single search request, you “fan‑out” the original query into multiple variations in parallel each phrased differently, or targeting different aspects of the question. You then retrieve a small set of top‑k documents for each variation and merge them.

Pipeline Overview

  1. Before You Begin

  2. 📥 Ingest Data

  3. ✂️ Chunk Text

  4. 🔢 Generate Embeddings

  5. 💾 Store in Vector DB

  6. 🔄 Decompose Query (Fan‑Out)

  7. 🔍 Retrieve & Merge Chunks

  8. ✍️ Generate Final Answer

  9. Before You Begin

Before installing any packages, create virtual environment

# 1. Create a virtual environment named .venv
python -m venv .venv

# 2. Activate it
# On macOS / Linux:
source .venv/bin/activate
# On Windows (PowerShell):
.venv\Scripts\Activate.ps1
# On Windows (Command Prompt):
.venv\Scripts\activate.bat
  1. 📥 Ingest Data

You start by bringing in all the source material you want your system to “know.”

  • Examples: PDFs of manuals, GitHub READMEs, web‑scraped articles, CSV exports.

  • Goal: Make sure you extract clean text (strip out headers/footers, fix encoding issues) and record metadata (source filename, page number, date) so you can always trace back where an answer came from.

To do this, we need to install the packages langchain_community and pypdf.
Run the following command in the terminal:

pip install langchain_community pypdf
#parallel_query_loader.py
from langchain_community.document_loaders import PyPDFLoader
from pathlib import Path
pdf_path = Path(__file__).parent / "file_name.extension_type"

loader = PyPDFLoader(file_path=pdf_path)
doc = loader.load()
  1. ✂️ Chunk the Text

LLMs have finite context windows—if you handed a 500‑page PDF to llm, it wouldn’t fit.

  • How: Split into ~500–1,000 token chunks, often with a 10–20% overlap so that you don’t lose sentence continuity at chunk boundaries.

  • Why: Smaller chunks both fit in the model’s context and allow more precise matching when you retrieve later.

chunk_size = 1000 – each slice of text will be at most 1,000 characters (or tokens) long.

chunk_overlap = 200 – each new slice repeats the last 200 characters of the previous slice so context flows smoothly across chunks.

#parallel_query_loader.py
from langchain_text_splitters import RecursiveCharacterTextSplitter
text_spliter = RecursiveCharacterTextSplitter(
    chunk_size = 1000,
    chunk_overlap = 200
)

split_doc = text_spliter.split_documents(documents=doc)
  1. 🔢Generate Embeddings

Each chunk is passed through an embedding model (e.g. text‑embedding‑ada-002) that turns it into a fixed‑length vector in semantic space.

Similar meaning → nearby points in vector space. “How do I reset my password?” and “password reset steps” end up close together.

For this example, I’m using Google AI embeddings, but you can use OpenAI embeddings instead. You can see all the embeddings through the link. LangChain Embeddings

Note: Create a .env file to store your Google API key, and use python-dotenv to load it into your Python script.

To use GoogleGenerativeAIEmbeddings and load_dotenv, you first need to install the integration package langchain‑google‑genai and dotenv.

pip install langchain-google-genai
pip install python-dotenv
#parallel_query_loader.py
from langchain_google_genai import GoogleGenerativeAIEmbeddings
import os
from dotenv import load_dotenv

load_dotenv()

if "GOOGLE_API_KEY" not in os.environ:
    os.environ["GOOGLE_API_KEY"] = os.getenv("GOOGLE_API_KEY") 

embeddings = GoogleGenerativeAIEmbeddings(
    model="models/text-embedding-004",
)
  1. 💾Store Embeddings in a Vector Database

Those vectors, plus your chunk text and metadata, go into a specialized index (Pinecone, Qdrant, FAISS, etc.).

Why use a vector DB? It lets you do ultra‑fast approximate nearest‑neighbor searches over millions of vectors, usually in milliseconds.

Here we’re using the Qdrant vector‑database.
You can either install it directly on your system or run it in Docker; I’m using Docker in this example

services:
  qdrant:
    image: qdrant/qdrant
    ports:
      - "6333:6333"

To run this docker compose file in terminal:

docker compose -f docker-compose.yml up

Once the container is running, you can connect to Qdrant at http://localhost:6333.

To use QdrantVectorStore and QdrantClient, you first need to install the integration package langchain-qdrant

pip install langchain-qdrant
#parallel_query_loader.py
from langchain_qdrant import QdrantVectorStore
from qdrant_client import QdrantClient

vector_store = QdrantVectorStore.from_documents(
    documents=[],
    url="http://localhost:6333",
    embedding=embeddings,
    collection_name="learning_langchain"
)
vector_store.add_documents(documents=split_doc)

After you run the code, you can view the collection in the Qdrant vector database at http://localhost:6333/dashboard

  1. 🔄 Decompose Query

Break the original user question into several focused, semantically distinct sub‑queries through the llm model. For example

“What is fs module?”

you might generate:

What is a module in Node.js?

What does "fs" stand for?

What functionalities does the fs module provide in Node.js?

Why This Helps

  • Broader Coverage: Captures documents that match specific wording or context.

  • Reduced Ambiguity: Each variant zeroes in on one angle of the user’s need.

  • Sharper Embeddings: Focused queries yield embedding vectors that align more closely with relevant chunks.

To use OpenAI , you first need to install the integration package openai

pip install openai
#main.py
from openai import OpenAI
from dotenv import load_dotenv
import os
import json

load_dotenv()

def ai(message):
    response = client.chat.completions.create(
    model="gemini-2.0-flash",
    messages=message,
    response_format={"type":"json_object"}
    )
    return json.loads(response.choices[0].message.content)


client = OpenAI(
    api_key=os.getenv("GOOGLE_API_KEY"),
    base_url="https://generativelanguage.googleapis.com/v1beta/openai/"
)
system_prompt = f"""
You are an helpfull AI Assistant 
who is specialized in resolving user query.
You break the user query into three or five different query.

Example: "What is FS module?"
you break this question in different questions
-What is a module in Node.js?
-What does "fs" stand for? 
-What functionalities does the fs module provide in Node.js?

You give response in array formate like this

Output: {{
"What is a module in Node.js?",
"What does "fs" stand for?",
"What functionalities does the fs module provide in Node.js?"
}}
"""
query = input("> ")
message=[{"role":"system","content":system_prompt},
        {"role":"user","content":query}]
question = ai(message)

print("\nQuestions: ")
print(question)
  1. 🔍 Retrieve Relevant Chunks and Filtering

Execute each decomposed sub‑query as its own top‑K similarity search, then merge results and filter for uniqueness:

parallel_query_retrieval.py File

#parallel_query_retrieval.py 
from langchain_qdrant import QdrantVectorStore
from qdrant_client import QdrantClient
from langchain_google_genai import GoogleGenerativeAIEmbeddings
import os

def retrieve(query: str) -> str:
    if "GOOGLE_API_KEY" not in os.environ:
        os.environ["GOOGLE_API_KEY"] = os.getenv("GOOGLE_API_KEY")


    embedding = GoogleGenerativeAIEmbeddings(
        model="models/text-embedding-004"
    )


    retrive = QdrantVectorStore.from_existing_collection(
        collection_name = "parallel_query",
        embedding=embedding,
        url="http://localhost:6333",
    )

    relevent_chunk = retrive.similarity_search(
        query=query,
    )

    seen = set()
    unique = []

    for doc in relevent_chunk:
        content = doc.page_content.strip()
        page = doc.metadata.get("page")
        key = (page, content)
        if key not in seen:
            seen.add(key)
            unique.append(doc)


    formatted = []

    for doc in unique:
        snippet = f"[Page {doc.metadata.get('page')}] \n{doc.page_content}"
        formatted.append(snippet)

    context = "\n\n".join(formatted)

    return context

Passing all the question to the parallel_query_retrieval.py file

#main.py
from parallel_query_retrieval import retrieve
array=[]
for i in question:
    answer = retrieve(i)
    array.append(answer)
  1. ✍️ Generate Final Answer

We feed the assembled prompt, which combines the retrieved, labeled chunks and the user’s original question—into your chosen language model. The LLM then uses both its internal knowledge and the provided context to generate a coherent, fact‑grounded response.

answer_ai.py File

#answer_ai.py
from openai import OpenAI
import os
from dotenv import load_dotenv

load_dotenv()


client = OpenAI(
    api_key=os.getenv("GOOGLE_API_KEY"),
    base_url="https://generativelanguage.googleapis.com/v1beta/openai/"

)

def answer_AI(query, assistant):
    system_prompt = f"""
    You are an helpfull AI Assistant who is specialized in resolving user query.

    Note:
    Answer should be in detail
    You recive a question and you give answer based on the assistant content and 
    also Mention the page number from where did you pick all the information and
    If you add something from you then tell where did you added something
    """
    message =[
    {"role":"system","content":system_prompt},
    {"role":"user","content":query},
    {"role":"assistant","content":assistant}]
    response=client.chat.completions.create(
        model="gemini-2.0-flash",
        messages=message,
        response_format={"type":"json_object"}

    )

    return response.choices[0].message.content

Passing all the chunks into the answer_ai.py

#main.py
from answer_ai import answer_AI
output = answer_AI(query, json.dumps(array))

print("\n---------------------------")
print("Answer: ")
print(output)

Executing the Code

Executing the main.py file

Full Source Code

You can find the complete implementation
https://github.com/SurajPatel04/genAI

0
Subscribe to my newsletter

Read articles from Suraj Patel directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Suraj Patel
Suraj Patel