System Design of RAGs (Production‑Ready)

Table of contents
- Table of contents
- 1. Basic RAG flow
- 2. Why improve RAG
- 3. Query rewriting / translation: make queries clearer for better search
- 4. CRAG (Corrective RAG): fix poor retrieval automatically
- 5. Ranking: order retrieved docs so best ones go to the LLM
- 6. HyDE: generate a hypothetical answer first, then use it to find real docs
- Advanced RAG concepts & production patterns

Table of contents
Basic RAG flow
Why improve RAG
Query rewriting / translation
CRAG (Corrective RAG)
Ranking
HyDE
Advanced concepts covered: scaling RAGs, speed vs accuracy tradeoffs, LLM-as-evaluator, sub‑query rewriting, ranking strategies, caching, hybrid search, contextual embeddings, GraphRAG, and production pipelines.
1. Basic RAG flow
One line: query → retrieve → generate answer
What happens:
User sends a query.
The system retrieves relevant documents/chunks from a vector DB or search index.
An LLM consumes the retrieved context in a prompt and generates the final answer.
When to use: quick prototypes, small datasets, low-cost proof of concept.
Question 1
Suppose we have four different RAG systems with varying speed
, accuracy
, and cost
. The main question is: Which one should we choose based on what matters most for our use case - speed, accuracy, or cost?
The best choice is
Accuracy
, because even if a system is fast or cheap, it’s not useful if the answers are wrong.
Question 2
How can we control the accuracy?
At first, you might think accuracy only depends on the AI model, but that’s not true.
1. If the source files are wrong, the output will also be wrong.
2. If the files are correct but the user has little knowledge, they may ask a poor or incomplete query.
3. If the user forgets important keywords or makes many typos, accuracy will drop.
👉 So, instead of only doubting your pipeline, think from the user’s perspective:
1. The files they uploaded might be indexed incorrectly.
2. The user’s query might be unclear or incorrect.
3. The user might introduce too many typos.
2. Why improve RAG
Key problems in production:
Accuracy: irrelevant retrieval or hallucinations.
Speed: multi-step retrieval + generation can be slow.
Cost: many API calls / large contexts increase price.
Scalability: more users and larger corpora require different retrieval strategies.
Typical failure modes: poor queries, wrong indexing, noisy documents, user typo/misunderstanding.
3. Query rewriting / translation: make queries clearer for better search
Goal: convert noisy user input into clearer, retrieval‑friendly queries.
Techniques:
Typo correction and normalization.
Add domain keywords or expand acronyms.
Translate vague queries into structured subqueries.
Context injection: add user metadata (course, role, previous messages).
Pattern: User query → rewrite/expand → candidate subqueries → embed & retrieve
.
Why it helps: better queries = better embeddings = better retrieval = fewer hallucinations.
Query Rewriting ensures cleaner and more complete queries, leading to better retrieval.
4. CRAG (Corrective RAG): fix poor retrieval automatically
Idea: when initial retrieval is weak, run corrective steps to improve context before final generation.
Typical corrective steps:
Re‑rewrite the query using retrieved chunk summaries.
Expand queries into multiple subqueries and re‑retrieve.
Use LLMs to validate whether chunks actually answer the query (LLM-as-evaluator).
Retry retrieval with alternate embeddings or different candidate sources.
Tradeoffs: more API/compute calls and latency, but dramatically higher accuracy for hard queries.
A retrieval-augmented pipeline that improves accuracy by rewriting queries, validating retrieved chunks, and refining context before final generation.
5. Ranking: order retrieved docs so best ones go to the LLM
Goal: reduce noise by selecting the most relevant chunks to include in the prompt.
Strategies:
Embedding similarity + signal fusion: combine cosine similarity with metadata (timestamp, source trust, recency).
LLM re‑ranker: ask an LLM to score candidate chunks (good for semantic nuance).
Learned ranker: train a small model on click/feedback signals.
Top‑N filtering per subquery: retrieve N per subquery, then merge and rank globally.
Implementation detail: always trim to a safe token budget; rankers let you pack the highest‑value context.
The ranking pipeline refines subqueries by retrieving top-n chunks, ranking them for relevance, and combining the most useful context before final generation, reducing noise and hallucination.
6. HyDE: generate a hypothetical answer first, then use it to find real docs
Concept: generate a synthetic (hypothetical) answer for the query, embed that answer, and use the embedding to retrieve matching real documents. This often surfaces more focused chunks than embedding the raw query.
Why it works: HyDE turns a short/ambiguous query into a rich semantic representation that better matches how human texts explain answers.
Best use case: teacher/academy content where you must return only class‑taught material - HyDE finds teacher‑specific chunks that align with the hypothetical student answer.
Suppose you need to build a RAG system for a teacher or academy where students’ doubts must be answered strictly from what the teacher taught in class.
For example, if in a JavaScript class the teacher solved the ‘Two Sum’ program in a specific way, then the chatbot should return only that exact explanation and example, not any alternate or generalized answer. How is this possible?
Advanced RAG concepts & production patterns
Below are advanced techniques you’ll use when moving RAG to production.
Scale & performance
Sharding vector DBs and partitioning by domain.
Approximate nearest neighbor (ANN) engines (HNSW, FAISS, Qdrant) for speed.
Hot cache for popular queries/answers to reduce repeated LLM calls.
Batching embedding requests and reuse embeddings when possible.
Speed vs accuracy tradeoffs
Low‑latency mode: smaller LLM, smaller context, tighter ranker → faster but less nuanced answers.
High‑accuracy mode: CRAG + HyDE + LLM re‑rankers → slower and costlier but fewer hallucinations.
Offer configurable profiles per customer (fast vs accurate).
Using LLM as evaluator
Use an LLM to judge whether a chunk answers the query (binary/score).
Combine LLM scores with embedding similarity to form a robust relevance score.
Subquery rewriting & orchestration
Break complex queries into subqueries (who/what/when/how) and retrieve per subquery.
Merge top‑k from each subquery then rank globally.
This uncovers diverse evidence and reduces missing facets.
Ranking strategies (summary)
Heuristics: cosine + recency + source weight.
LLM re‑ranker: semantic judgement of top candidates.
Learning to rank: use labeled examples/feedback for continuous improvement.
Caching
Cache embeddings for repeated queries and cache LLM responses for exact prompts.
Use a validity TTL and invalidate when relevant docs change.
Hybrid search
Combine keyword/boolean search (BM25) with semantic (vector) search to handle exact matches and semantic matches.
Useful for legal or code corpora where exact wording matters.
Contextual embeddings
Build embeddings that include short metadata (document type, teacher id, class id) so retrieval respects context constraints.
Contextual embeddings make it easier to restrict results to a single teacher or class.
GraphRAG
Build a graph over entities/subjects to navigate related documents (useful for multi‑document reasoning).
Graph search can provide structured chains of evidence for complex queries.
Production‑ready pipeline checklist
Input validation & query normalization.
Embedding caching and efficient ANN index.
Query rewrite + subquery orchestration.
Top‑N retrieval per subquery & learned/LLM ranking.
Corrective loop (CRAG) for low‑confidence responses.
Answer grounding: always return source ids & confidence.
Monitoring: precision/recall, latency, cost per request.
Feedback loop: collect user corrections and use for retraining ranker/rewrite modules.
Final takeaway
Production‑ready RAG is not a single model it’s a pipeline. You must combine rewriting, smart retrieval, ranking, corrective steps, and grounding techniques (like HyDE) to build systems that are accurate, scalable and trustworthy. Tradeoffs between speed and accuracy are inevitable - design profiles and monitoring so your system meets real user needs.
Subscribe to my newsletter
Read articles from MANOJ KUMAR directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

MANOJ KUMAR
MANOJ KUMAR
Haan Ji, I am Manoj Kumar a product-focused Full Stack Developer passionate about crafting and deploying modern web apps, SaaS solutions, and Generative AI applications using Node.js, Next.js, databases, and cloud technologies, with 10+ real-world projects delivered including AI-powered tools and business applications.