Accelerating Retrieval with Parallel Query Execution in RAG Systems


When building Retrieval-Augmented Generation (RAG) systems, latency can quickly become a bottleneck—especially when generating multiple variations of a user query for better context coverage. A common solution? Parallelize your semantic search requests to speed things up dramatically.
This article walks you through a practical example of parallel query retrieval using Python's asyncio
, LangChain
, and Qdrant
as the vector database. We’ll use an OpenAI's GPT to generate semantically similar queries, retrieve relevant context in parallel, deduplicate results, and then generate a final answer.
Problem: Why Parallel Query Retrieval?
Let's say your system receives this query:
"How does photosynthesis work?"
To give a more robust answer, you want to:
Generate a few variations of the query.
Retrieve relevant documents for all those variations.
Do it fast, so the user isn’t left waiting.
Instead of retrieving documents sequentially, we can run all search queries in parallel using asyncio
.
Flow Diagram :
Step-by-Step Implementation
1. Generate Similar Queries with LLM
We first use an LLM to expand the original query into semantically similar ones:
async def generate_similar_queries_with_llm(query: str, num_queries: int = 3) -> List[str]:
prompt = f"Generate {num_queries} similar queries to: '{query}'. Return only a JSON object with a 'queries' key."
messages = [{"role": "user", "content": prompt}]
response = call_llm(messages, json_format=True) #code for call_llm is in reference repository link at the end of this article
data = json.loads(response)
return data.get('queries', [])
For example,
"How does photosynthesis work?"
may produce:
"Explain the process of photosynthesis."
"What happens during photosynthesis?"
"How do plants convert sunlight into energy?”
2. Retrieve Embeddings & Search — In Parallel
Instead of doing this sequentially, we fire off concurrent retrievals:
async def get_embedding_and_search(query: str):
embedder = initialize_embeddings()
retriever = QdrantVectorStore.from_existing_collection(
embedding=embedder,
collection_name="test_rag",
url="http://localhost:6333",
)
return retriever.similarity_search(query=query)
And run them all at once:
tasks = [get_embedding_and_search(query) for query in all_queries]
results = await asyncio.gather(*tasks)
This uses asyncio.gather
to retrieve documents for each query simultaneously, cutting down latency.
3. Deduplicate Results
With multiple queries there are chances that we get duplicate chunks,so we need to make them unique :
def identify_unique_chunks(results: List) -> List:
seen = set()
unique_chunks = []
for chunk_list in results:
for chunk in chunk_list:
if chunk.page_content not in seen:
seen.add(chunk.page_content)
unique_chunks.append(chunk)
return unique_chunks
4. Final Answer : Generation Using Context
Feed the deduplicated chunks back into the LLM to generate a final answer:
async def run_llm(query: str, context: List) -> str:
context_text = "\n".join([chunk.page_content for chunk in context])
messages = [
{"role": "system", "content": "Use the provided context to answer the question accurately."},
{"role": "user", "content": f"Question: {query}\nContext: {context_text}"}
]
return call_llm(messages)
Main Logic Tying It All Together
Here’s the orchestrator:
async def main_logic(original_query: str):
similar_queries = await generate_similar_queries_with_llm(original_query)
all_queries = [original_query] + similar_queries
# Perform retrieval in parallel
tasks = [get_embedding_and_search(query) for query in all_queries]
results = await asyncio.gather(*tasks)
unique_chunks = identify_unique_chunks(results)
llm_response = await run_llm(original_query, unique_chunks)
return llm_response
Result: Fast, Rich Answers ( because of more context )
Here’s the full pipeline:
Use LLM to rephrase query.
Search in parallel to reduce latency.
Sort out unique chunks.
Pass unique chunks into LLM for better answers.
Run in CLI :
if __name__ == "__main__":
query = input("Enter your query: ")
response = asyncio.run(main_logic(query))
print("\nFinal Response:")
print(response)
When to Use This Pattern?
This method is perfect for: Chatbots that need to provide detailed answers, Search engines used by companies that mix different types of information, and Research helpers where understanding the context deeply is very important.
Conclusion
Parallel retrieval is a game changer for boosting performance and quality in multi-query RAG systems. With just a few lines of code, you can enhance your RAG system.
Code link
github link for details implemention : https://github.com/sandipdeshmukh77/RAG/blob/main/parallel_query_retrival.py
Subscribe to my newsletter
Read articles from Sandip Deshmukh directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
