How Parallel Query Retrieval Works: An Overview

What is Parallel Query Retrieval?
Traditional RAG systems use a single query to retrieve documents, which can overlook subtle nuances in a question. Parallel Query Retrieval improves this by:
Creating multiple rephrasings of the user’s query.
Retrieving documents for each variation simultaneously.
Combining the results into a deduplicated, comprehensive set.
This approach enhances retrieval accuracy and robustness, making it perfect for intricate queries.
System Overview
Here’s how the system works at a high level:
Document Processing: Load a PDF and split it into manageable chunks.
Embedding Storage: Convert chunks into embeddings and store them in Qdrant.
Query Expansion: Generate variations of the user’s query.
Parallel Retrieval: Search the vector store with all variations.
Result Fusion: Merge and deduplicate retrieved documents.
Response Generation: Use a language model to answer based on the fused context.
Interactive Interface: Enable continuous user interaction.
Let's Dive into Coding and Explore How Parallel Query Retrieval Works with Detailed Code Walkthroughs
Step 1: Import Required Libraries
We begin by importing the tools needed for document handling, embeddings, vector storage, and AI interaction.
python
from pathlib import Path
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_google_genai import GoogleGenerativeAIEmbeddings, ChatGoogleGenerativeAI
from langchain_qdrant import QdrantVectorStore
from langchain_core.prompts import PromptTemplate
import os
Step 2: Load the PDF Document
We load the PDF file to extract its text content.
python
pdf_path = Path(__file__).parent / "Python Programming.pdf"
loader = PyPDFLoader(file_path=pdf_path)
docs = loader.load()
pdf_path: Constructs the path to "Python Programming.pdf" relative to the script.
PyPDFLoader: Initializes the PDF loader.
loader.load(): Returns a list of document objects containing the PDF’s text.
Step 3: Split the Document into Chunks
Large documents need to be split into smaller pieces for processing.
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=2000,
chunk_overlap=200,
)
splitdocs = text_splitter.split_documents(documents=docs)
chunk_size=2000: Limits each chunk to 2000 characters.
chunk_overlap=200: Overlaps chunks by 200 characters for context continuity.
split_documents: Produces a list of chunked documents.
Step 4: Generate Embeddings
We convert text chunks into numerical vectors using Google’s embedding model.
embedder = GoogleGenerativeAIEmbeddings(
model="models/text-embedding-004",
google_api_key="YOUR_API_KEY" # Replace with your actual key
)
model: Specifies Google’s "text-embedding-004" model.
google_api_key: Authenticates API access (replace with your key).
Step 5: Set Up the Vector Store
We store the embeddings in Qdrant for efficient similarity searches.
vector_store = QdrantVectorStore.from_documents(
documents=splitdocs,
embedding=embedder,
url="http://localhost:6333", # Qdrant instance URL
collection_name="learning_langchain"
)
from_documents: Embeds and stores the chunks in Qdrant.
url: Points to a local Qdrant server (ensure it’s running).
collection_name: Labels the storage collection.
Step 6: Initialize the Language Model
We set up Google’s generative AI for response generation.
llm = ChatGoogleGenerativeAI(
model="gemini-2.0-flash",
google_api_key="YOUR_API_KEY" # Replace with your actual key
)
model: Uses "gemini-2.0-flash".
google_api_key: Authenticates the API.
Step 7: Define the System Prompt
The system prompt instructs the language model on how to respond using the retrieved context.
SYSTEM_PROMPT = """
You are a smart PDF assistant designed to help users understand the content of a PDF document. Your task is to provide accurate, clear, and concise responses based on the user's query and the relevant excerpts from the PDF. Follow these guidelines to ensure your responses are helpful and aligned with the user's intent:
1. **Understand the Query Type**:
- If the user asks for a **summary**, provide a high-level overview of the main content, focusing on key points or themes.
- If the user asks for **specific information** (e.g., "What is [term]?"), locate and present that information directly.
- If the user asks for an **explanation** (e.g., "Explain [concept]"), provide a clear, general overview first, adding specifics only if requested.
- If the query is vague, assume a general understanding is desired and respond concisely.
2. **Use the PDF Excerpts**:
- Base your response solely on the provided PDF excerpts. Do not add information beyond what’s in the document.
- If the excerpts lack the requested information, say: "The PDF does not contain this information."
3. **Tailor the Response**:
- For **general queries**, prioritize broad, introductory content over technical details.
- For **specific queries**, focus on the exact details requested, keeping it brief.
- Synthesize information from multiple excerpts into a single, coherent answer if needed.
4. **Structure Your Answer**:
- Start with a short, direct response to the query.
- Add supporting details or context as appropriate, especially for explanations.
- Keep responses concise for specific questions and slightly longer for summaries or explanations.
5. **Ensure Clarity**:
- Use simple, clear language.
- Avoid unnecessary jargon unless it’s central to the query and explained.
If the query is unclear, ask the user for clarification to ensure an accurate response.
"""
This ensures responses are accurate, concise, and context-driven.
Step 8: Generate Query Variations
We create multiple versions of the user’s query to improve retrieval.
python
def generate_query_variations(original_query, num_variations=3):
prompt = f"Generate {num_variations} different ways to ask the following question: {original_query}"
response = llm.invoke(prompt)
variations = response.content.split("\n")
return [original_query] + [v.strip() for v in variations if v.strip()]
Purpose: Expands the query for broader coverage.
Output: A list with the original query plus rephrasings (e.g., "What is X?" becomes "Explain X." and "What does X do?").
Step 9: Retrieve Documents in Parallel
We search the vector store with all query variations.
python
def retrieve_parallel(vector_store, queries, k=3):
all_docs = []
for query in queries:
docs = vector_store.similarity_search(query, k=k)
all_docs.extend(docs)
return all_docs
k=3: Retrieves the top 3 matching documents per query.
Process: Combines results from all queries into one list.
Step 10: Fuse Retrieved Results
We deduplicate the retrieved documents for a concise context.
def fuse_results(docs):
seen = set()
unique_docs = []
for doc in docs:
if doc.page_content not in seen:
seen.add(doc.page_content)
unique_docs.append(doc)
return unique_docs
Purpose: Removes duplicates based on content.
Output: A list of unique documents.
Step 11: Create the Chat Function
This function ties together query expansion, retrieval, fusion, and response generation.
def chat_with_fusion(query, vector_store, llm):
# Generate query variations
queries = generate_query_variations(query)
print(f"Generated queries: {queries}")
# Retrieve documents in parallel
all_retrieved_docs = retrieve_parallel(vector_store, queries)
# Fuse the results
fused_docs = fuse_results(all_retrieved_docs)
context = "\n\n...\n\n".join([doc.page_content for doc in fused_docs])
# Construct the full prompt
full_prompt = SYSTEM_PROMPT + "\n\nHere are the relevant excerpts from the PDF:\n" + context + "\n\nUser's question: " + query + "\n\nAssistant:"
# Generate the response
response = llm.invoke(full_prompt)
return response.content
- Flow: Expands the query, retrieves and fuses documents, builds a prompt, and generates an answer.
Step 12: Build the Interactive Loop
We create a loop for continuous user interaction.
print("Welcome to the PDF Query Assistant!")
while True:
query = input("Ask a question about the PDF (or type 'exit' to quit): ")
if query.lower() == 'exit':
print("Goodbye!")
break
if not query.strip():
print("Please enter a valid question.")
continue
try:
answer = chat_with_fusion(query, vector_store, llm)
print("Assistant:", answer)
except Exception as e:
print(f"An error occurred: {e}")
- Features: Handles input validation, exits on "exit," and manages errors.
Step 13: Optional Prompt Template
An optional template for cleaner prompt structuring (not used in the main flow).
python
prompt_template = PromptTemplate(
input_variables=["query", "excerpts"],
template=SYSTEM_PROMPT + "\n\nUser Query: {query}\nPDF Excerpts: {excerpts}\nResponse:"
)
- Use Case: Could replace the manual prompt construction if integrated.
Example Interaction
Conclusion
This interactive RAG system with Parallel Query Fusion offers a powerful way to query PDF content dynamically. By leveraging query variations and result fusion, it provides more accurate and comprehensive answers than traditional RAG. Whether for education, research, or support, this system is highly adaptable. Replace the API keys, run Qdrant locally, and start exploring your PDFs interactively!
Subscribe to my newsletter
Read articles from shrihari katti directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
