Parallel Query Retrieval


Imagine getting the perfect answer for your incomplete query
Parallel query retrieval, also known as "fan out" in Retrieval-Augmented Generation (RAG) involves an LLM generating multiple, slightly different queries based on the original user query. These queries are executed in parallel (hence "fan out"), and their results are merged to provide more comprehensive coverage.
Why use Fan out method?
Diverse Perspectives: An LLM generates multiple versions of the original user query, capturing information that might be missed by using just one query.
Complex Problems: With multiple queries from a single user query, the LLM gathers more information, resulting in a better response.
Letโs focus on what this diagram is showing:
๐ง User sends a question
What are the recent developments in Apple?
๐ง System rewrites it into 3-5 better versions (Query Transformation)
Q1: What are Apple's latest product launches?
Q2: How is Apple expanding into new markets?
Q3: What strategies is Apple using for revenue growth?
This process is called Query Rewriting / Transformation. We will generate vector Embeddings of these 3 LLM Generated queries. Using the embeddings we will get the chunks of document/data stored in Qdrant DB for each query. We can use Langchain as it will ease-up the process of similarity search
๐ Fan Out Retrieval (Parallel Vector Search)
Now, we search each rewritten query separately in Qdrant (database where we store indexed data):
Query | Resulting Chunks |
Q1 | ๐๐ |
Q2 | ๐ |
Q3 | ๐๐๐ |
Now we got chunks of data for each LLM generated query, it can be multiple and duplicate chunks for a single query.
๐งผ Combine & Deduplicate: filter_unique
We now combine all the results from Q1, Q2, Q3... and remove duplicates so we only have unique chunks. We can use intersection technique which will give use only the unique chunks
Imagine the colours in the diagram are:
Blue = Product news
Yellow = Market entry
Red = Revenue growth
We merge those into (accepting only unique data):
๐ Blue
๐ Yellow
๐ Red
๐ค Final Step: Ask the LLM
Now that we have:
High-quality context chunks (from multiple queries)
Original user question
We send both to the LLM:
SYSTEM_PROMPT + context_chunks + user_question โ GPT
The AI finally generates a rich, informed answer.
๐งฑ Why is this better than traditional RAG?
Traditional RAG | Fan Out Parallel RAG |
Ask 1 question โ search once | Rewrites question 3-5 times, searches in parallel |
May return weak chunks | Better chance of catching the right context |
Sensitive to how user phrased it | Handles ambiguity and vague prompts |
Fewer insights | More diverse and richer document results |
CODE Snippets
Before understanding Parallel Query Retrieval, you have to understand the concept of normal RAG and how we code to implement RAG.
Follow this link: Retrieval-Augmented Generation (RAG)
- Write a function to load the PDF data into Qdrant DB
def ingest_pdf(pdf_path: Path) -> str:
"""Load a PDF, split to chunks, embed & store โ return `doc_id`."""
# Give every PDF an ID so we can filter later
doc_id = str(uuid.uuid4())
loader = PyPDFLoader(str(pdf_path))
raw_docs = loader.load() # each element has .page_content & .metadata
splitter = RecursiveCharacterTextSplitter(chunk_size=1200, chunk_overlap=200)
chunks = splitter.split_documents(raw_docs)
for c in chunks:
c.metadata["doc_id"] = doc_id
c.metadata["filename"] = pdf_path.name
# keep existing page metadata
QdrantVectorStore.from_documents(
chunks,
embedding = EMBEDDER,
collection_name = VECTOR_COLLECTION,
url = VECTOR_URL,
)
return doc_id
- Writing 3-5 user queries by taking reference from the original user query
def rewrite_query(user_q: str, n_variants: int = 4) -> List[str]:
"""
Use a cheap model to produce 3โ5 alternate phrasings of the original query.
"""
prompt = [
_REWRITE_SYS,
HumanMessage(content=f"Original question: {user_q}\nN={n_variants}")
]
llm_resp = LLM_REWRITE.invoke(prompt).content
try:
import json
queries = json.loads(llm_resp) # type: ignore
# Checks if GPTโs response is really a list (like we expect).
if isinstance(queries, list):
return queries[: n_variants]
except Exception:
pass
return [user_q] #If the JSON failed, we simply return the original question as the only element in a list.
- Write a function to search from the list of the LLM generated question (similar to RAG)
def search_chunks_parallel(
queries: Sequence[str],
active_doc_id: str,
k: int = 4,
) -> List[Any]:
- Write a function to Remove Duplicate Chunks
def deduplicate_chunks(chunks: List[Any]) -> List[Any]:
"""
Remove duplicate page_content (or duplicate _id) while preserving order.
"""
seen = set()
unique = []
for c in chunks:
key = c.page_content.strip() #Extracts the chunkโs text content, removing extra whitespace.
if key not in seen: #Checks if we've seen this content already.
unique.append(c)
seen.add(key)
return unique
- Write a function to create context which we can use to send to LLM
def build_context(chunks: List[Any], max_chars: int = 3900) -> str:
"""
Concatenate chunks โ single string, truncated to fit GPTโ4 context limit.
"""
lines = []
total = 0 # Counts how many characters we've added (for limiting context size).
for c in chunks:
page = c.metadata.get("page_label", c.metadata.get("page", "N/A"))
text = f"[{page}] {c.page_content.strip()}\n"
new_total = total + len(text) # Checks if adding this chunk exceeds the limit (max_chars).
if new_total > max_chars:
break
lines.append(text)
total = new_total
return "".join(lines)
- Write a function to answer the user query
def answer_question(user_q: str, active_doc_id: str) -> str:
# 1) rewrite query
rewrites = rewrite_query(user_q)
print("๐ GPT rewrites: ", rewrites)
# 2) fanโout search
raw_chunks = search_chunks_parallel(rewrites, active_doc_id, k=4)
if not raw_chunks:
return (
'{ "step":"Answer", "content":"No info in current PDF.", '
f'"input": "{user_q}" }}'
)
# 3) deduplicate
uniq_chunks = deduplicate_chunks(raw_chunks)
# 4) build context string
context_text = build_context(uniq_chunks)
# 5) ask GPTโ4(o)
history = [
SystemMessage(content=_ANSWER_SYS + "\n\nContext:\n" + context_text),
HumanMessage(content=user_q),
]
ai = LLM_ANSWER.invoke(history)
return ai.content
Subscribe to my newsletter
Read articles from hardik sehgal directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
