What is Parallel Query Retrieval?

Traditional RAG systems use a single query to retrieve documents, which can overlook subtle nuances in a question. Parallel Query Retrieval improves this by:

Creating multiple rephrasings of the user’s query.
Retrieving documents for each variation simultaneously.
Combining the results into a deduplicated, comprehensive set.

This approach enhances retrieval accuracy and robustness, making it perfect for intricate queries.

RAG Techniques: Multi Query

System Overview

Here’s how the system works at a high level:

Document Processing: Load a PDF and split it into manageable chunks.
Embedding Storage: Convert chunks into embeddings and store them in Qdrant.
Query Expansion: Generate variations of the user’s query.
Parallel Retrieval: Search the vector store with all variations.
Result Fusion: Merge and deduplicate retrieved documents.
Response Generation: Use a language model to answer based on the fused context.
Interactive Interface: Enable continuous user interaction.

Let's Dive into Coding and Explore How Parallel Query Retrieval Works with Detailed Code Walkthroughs

Step 1: Import Required Libraries

We begin by importing the tools needed for document handling, embeddings, vector storage, and AI interaction.

python

from pathlib import Path
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_google_genai import GoogleGenerativeAIEmbeddings, ChatGoogleGenerativeAI
from langchain_qdrant import QdrantVectorStore
from langchain_core.prompts import PromptTemplate
import os

Step 2: Load the PDF Document

We load the PDF file to extract its text content.

python

pdf_path = Path(__file__).parent / "Python Programming.pdf"
loader = PyPDFLoader(file_path=pdf_path)
docs = loader.load()

pdf_path: Constructs the path to "Python Programming.pdf" relative to the script.
PyPDFLoader: Initializes the PDF loader.
loader.load(): Returns a list of document objects containing the PDF’s text.

Step 3: Split the Document into Chunks

Large documents need to be split into smaller pieces for processing.

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=2000,
    chunk_overlap=200,
)
splitdocs = text_splitter.split_documents(documents=docs)

chunk_size=2000: Limits each chunk to 2000 characters.
chunk_overlap=200: Overlaps chunks by 200 characters for context continuity.
split_documents: Produces a list of chunked documents.

Step 4: Generate Embeddings

We convert text chunks into numerical vectors using Google’s embedding model.

embedder = GoogleGenerativeAIEmbeddings(
    model="models/text-embedding-004",
    google_api_key="YOUR_API_KEY"  # Replace with your actual key
)

model: Specifies Google’s "text-embedding-004" model.
google_api_key: Authenticates API access (replace with your key).

Step 5: Set Up the Vector Store

We store the embeddings in Qdrant for efficient similarity searches.

vector_store = QdrantVectorStore.from_documents(
    documents=splitdocs,
    embedding=embedder,
    url="http://localhost:6333",  # Qdrant instance URL
    collection_name="learning_langchain"
)

from_documents: Embeds and stores the chunks in Qdrant.
url: Points to a local Qdrant server (ensure it’s running).
collection_name: Labels the storage collection.

Step 6: Initialize the Language Model

We set up Google’s generative AI for response generation.

llm = ChatGoogleGenerativeAI(
    model="gemini-2.0-flash",
    google_api_key="YOUR_API_KEY"  # Replace with your actual key
)

model: Uses "gemini-2.0-flash".
google_api_key: Authenticates the API.

Step 7: Define the System Prompt

The system prompt instructs the language model on how to respond using the retrieved context.

SYSTEM_PROMPT = """
You are a smart PDF assistant designed to help users understand the content of a PDF document. Your task is to provide accurate, clear, and concise responses based on the user's query and the relevant excerpts from the PDF. Follow these guidelines to ensure your responses are helpful and aligned with the user's intent:

1. **Understand the Query Type**:
   - If the user asks for a **summary**, provide a high-level overview of the main content, focusing on key points or themes.
   - If the user asks for **specific information** (e.g., "What is [term]?"), locate and present that information directly.
   - If the user asks for an **explanation** (e.g., "Explain [concept]"), provide a clear, general overview first, adding specifics only if requested.
   - If the query is vague, assume a general understanding is desired and respond concisely.

2. **Use the PDF Excerpts**:
   - Base your response solely on the provided PDF excerpts. Do not add information beyond what’s in the document.
   - If the excerpts lack the requested information, say: "The PDF does not contain this information."

3. **Tailor the Response**:
   - For **general queries**, prioritize broad, introductory content over technical details.
   - For **specific queries**, focus on the exact details requested, keeping it brief.
   - Synthesize information from multiple excerpts into a single, coherent answer if needed.

4. **Structure Your Answer**:
   - Start with a short, direct response to the query.
   - Add supporting details or context as appropriate, especially for explanations.
   - Keep responses concise for specific questions and slightly longer for summaries or explanations.

5. **Ensure Clarity**:
   - Use simple, clear language.
   - Avoid unnecessary jargon unless it’s central to the query and explained.

If the query is unclear, ask the user for clarification to ensure an accurate response.
"""

This ensures responses are accurate, concise, and context-driven.

Step 8: Generate Query Variations

We create multiple versions of the user’s query to improve retrieval.

python

def generate_query_variations(original_query, num_variations=3):
    prompt = f"Generate {num_variations} different ways to ask the following question: {original_query}"
    response = llm.invoke(prompt)
    variations = response.content.split("\n")
    return [original_query] + [v.strip() for v in variations if v.strip()]

Purpose: Expands the query for broader coverage.
Output: A list with the original query plus rephrasings (e.g., "What is X?" becomes "Explain X." and "What does X do?").

Step 9: Retrieve Documents in Parallel

We search the vector store with all query variations.

python

def retrieve_parallel(vector_store, queries, k=3):
    all_docs = []
    for query in queries:
        docs = vector_store.similarity_search(query, k=k)
        all_docs.extend(docs)
    return all_docs

k=3: Retrieves the top 3 matching documents per query.
Process: Combines results from all queries into one list.

Step 10: Fuse Retrieved Results

We deduplicate the retrieved documents for a concise context.

def fuse_results(docs):
    seen = set()
    unique_docs = []
    for doc in docs:
        if doc.page_content not in seen:
            seen.add(doc.page_content)
            unique_docs.append(doc)
    return unique_docs

Purpose: Removes duplicates based on content.
Output: A list of unique documents.

Step 11: Create the Chat Function

This function ties together query expansion, retrieval, fusion, and response generation.

def chat_with_fusion(query, vector_store, llm):
    # Generate query variations
    queries = generate_query_variations(query)
    print(f"Generated queries: {queries}")

    # Retrieve documents in parallel
    all_retrieved_docs = retrieve_parallel(vector_store, queries)

    # Fuse the results
    fused_docs = fuse_results(all_retrieved_docs)
    context = "\n\n...\n\n".join([doc.page_content for doc in fused_docs])

    # Construct the full prompt
    full_prompt = SYSTEM_PROMPT + "\n\nHere are the relevant excerpts from the PDF:\n" + context + "\n\nUser's question: " + query + "\n\nAssistant:"

    # Generate the response
    response = llm.invoke(full_prompt)
    return response.content

Flow: Expands the query, retrieves and fuses documents, builds a prompt, and generates an answer.

Step 12: Build the Interactive Loop

We create a loop for continuous user interaction.

print("Welcome to the PDF Query Assistant!")
while True:
    query = input("Ask a question about the PDF (or type 'exit' to quit): ")
    if query.lower() == 'exit':
        print("Goodbye!")
        break
    if not query.strip():
        print("Please enter a valid question.")
        continue
    try:
        answer = chat_with_fusion(query, vector_store, llm)
        print("Assistant:", answer)
    except Exception as e:
        print(f"An error occurred: {e}")

Features: Handles input validation, exits on "exit," and manages errors.

Step 13: Optional Prompt Template

An optional template for cleaner prompt structuring (not used in the main flow).

python

prompt_template = PromptTemplate(
    input_variables=["query", "excerpts"],
    template=SYSTEM_PROMPT + "\n\nUser Query: {query}\nPDF Excerpts: {excerpts}\nResponse:"
)

Use Case: Could replace the manual prompt construction if integrated.

Example Interaction

Conclusion

This interactive RAG system with Parallel Query Fusion offers a powerful way to query PDF content dynamically. By leveraging query variations and result fusion, it provides more accurate and comprehensive answers than traditional RAG. Whether for education, research, or support, this system is highly adaptable. Replace the API keys, run Qdrant locally, and start exploring your PDFs interactively!

How Parallel Query Retrieval Works: An Overview

Table of contents

What is Parallel Query Retrieval?

System Overview

Let's Dive into Coding and Explore How Parallel Query Retrieval Works with Detailed Code Walkthroughs

Conclusion

Subscribe to my newsletter

shrihari katti

shrihari katti