System Design of RAGs (Production‑Ready)

MANOJ KUMARMANOJ KUMAR
6 min read

Table of contents

  1. Basic RAG flow

  2. Why improve RAG

  3. Query rewriting / translation

  4. CRAG (Corrective RAG)

  5. Ranking

  6. HyDE

Advanced concepts covered: scaling RAGs, speed vs accuracy tradeoffs, LLM-as-evaluator, sub‑query rewriting, ranking strategies, caching, hybrid search, contextual embeddings, GraphRAG, and production pipelines.

1. Basic RAG flow

One line: query → retrieve → generate answer

What happens:

  • User sends a query.

  • The system retrieves relevant documents/chunks from a vector DB or search index.

  • An LLM consumes the retrieved context in a prompt and generates the final answer.

When to use: quick prototypes, small datasets, low-cost proof of concept.

Basic RAG flow

Question 1

Suppose we have four different RAG systems with varying speed, accuracy, and cost. The main question is: Which one should we choose based on what matters most for our use case - speed, accuracy, or cost?

The best choice is Accuracy, because even if a system is fast or cheap, it’s not useful if the answers are wrong.

Question 2

How can we control the accuracy?

At first, you might think accuracy only depends on the AI model, but that’s not true.

1. If the source files are wrong, the output will also be wrong.

2. If the files are correct but the user has little knowledge, they may ask a poor or incomplete query.

3. If the user forgets important keywords or makes many typos, accuracy will drop.

👉 So, instead of only doubting your pipeline, think from the user’s perspective:

1. The files they uploaded might be indexed incorrectly.

2. The user’s query might be unclear or incorrect.

3. The user might introduce too many typos.

2. Why improve RAG

Key problems in production:

  • Accuracy: irrelevant retrieval or hallucinations.

  • Speed: multi-step retrieval + generation can be slow.

  • Cost: many API calls / large contexts increase price.

  • Scalability: more users and larger corpora require different retrieval strategies.

Typical failure modes: poor queries, wrong indexing, noisy documents, user typo/misunderstanding.

Why improve RAG: common problems

Goal: convert noisy user input into clearer, retrieval‑friendly queries.

Techniques:

  • Typo correction and normalization.

  • Add domain keywords or expand acronyms.

  • Translate vague queries into structured subqueries.

  • Context injection: add user metadata (course, role, previous messages).

Pattern: User query → rewrite/expand → candidate subqueries → embed & retrieve.

Why it helps: better queries = better embeddings = better retrieval = fewer hallucinations.

Query Rewriting ensures cleaner and more complete queries, leading to better retrieval.

Query Rewriting

4. CRAG (Corrective RAG): fix poor retrieval automatically

CRAG (Corrective RAG)

Idea: when initial retrieval is weak, run corrective steps to improve context before final generation.

Typical corrective steps:

  • Re‑rewrite the query using retrieved chunk summaries.

  • Expand queries into multiple subqueries and re‑retrieve.

  • Use LLMs to validate whether chunks actually answer the query (LLM-as-evaluator).

  • Retry retrieval with alternate embeddings or different candidate sources.

Tradeoffs: more API/compute calls and latency, but dramatically higher accuracy for hard queries.

A retrieval-augmented pipeline that improves accuracy by rewriting queries, validating retrieved chunks, and refining context before final generation.

CRAG flow

5. Ranking: order retrieved docs so best ones go to the LLM

Goal: reduce noise by selecting the most relevant chunks to include in the prompt.

Strategies:

  • Embedding similarity + signal fusion: combine cosine similarity with metadata (timestamp, source trust, recency).

  • LLM re‑ranker: ask an LLM to score candidate chunks (good for semantic nuance).

  • Learned ranker: train a small model on click/feedback signals.

  • Top‑N filtering per subquery: retrieve N per subquery, then merge and rank globally.

Implementation detail: always trim to a safe token budget; rankers let you pack the highest‑value context.

The ranking pipeline refines subqueries by retrieving top-n chunks, ranking them for relevance, and combining the most useful context before final generation, reducing noise and hallucination.

Ranking pipeline

6. HyDE: generate a hypothetical answer first, then use it to find real docs

Concept: generate a synthetic (hypothetical) answer for the query, embed that answer, and use the embedding to retrieve matching real documents. This often surfaces more focused chunks than embedding the raw query.

Why it works: HyDE turns a short/ambiguous query into a rich semantic representation that better matches how human texts explain answers.

Best use case: teacher/academy content where you must return only class‑taught material - HyDE finds teacher‑specific chunks that align with the hypothetical student answer.

Suppose you need to build a RAG system for a teacher or academy where students’ doubts must be answered strictly from what the teacher taught in class.

For example, if in a JavaScript class the teacher solved the ‘Two Sum’ program in a specific way, then the chatbot should return only that exact explanation and example, not any alternate or generalized answer. How is this possible?

HyDE (Hypothetical Document Embeddings)


Advanced RAG concepts & production patterns

Below are advanced techniques you’ll use when moving RAG to production.

Scale & performance

  • Sharding vector DBs and partitioning by domain.

  • Approximate nearest neighbor (ANN) engines (HNSW, FAISS, Qdrant) for speed.

  • Hot cache for popular queries/answers to reduce repeated LLM calls.

  • Batching embedding requests and reuse embeddings when possible.

Speed vs accuracy tradeoffs

  • Low‑latency mode: smaller LLM, smaller context, tighter ranker → faster but less nuanced answers.

  • High‑accuracy mode: CRAG + HyDE + LLM re‑rankers → slower and costlier but fewer hallucinations.

  • Offer configurable profiles per customer (fast vs accurate).

Using LLM as evaluator

  • Use an LLM to judge whether a chunk answers the query (binary/score).

  • Combine LLM scores with embedding similarity to form a robust relevance score.

Subquery rewriting & orchestration

  • Break complex queries into subqueries (who/what/when/how) and retrieve per subquery.

  • Merge top‑k from each subquery then rank globally.

  • This uncovers diverse evidence and reduces missing facets.

Ranking strategies (summary)

  • Heuristics: cosine + recency + source weight.

  • LLM re‑ranker: semantic judgement of top candidates.

  • Learning to rank: use labeled examples/feedback for continuous improvement.

Caching

  • Cache embeddings for repeated queries and cache LLM responses for exact prompts.

  • Use a validity TTL and invalidate when relevant docs change.

  • Combine keyword/boolean search (BM25) with semantic (vector) search to handle exact matches and semantic matches.

  • Useful for legal or code corpora where exact wording matters.

Contextual embeddings

  • Build embeddings that include short metadata (document type, teacher id, class id) so retrieval respects context constraints.

  • Contextual embeddings make it easier to restrict results to a single teacher or class.

GraphRAG

  • Build a graph over entities/subjects to navigate related documents (useful for multi‑document reasoning).

  • Graph search can provide structured chains of evidence for complex queries.

Production‑ready pipeline checklist

  • Input validation & query normalization.

  • Embedding caching and efficient ANN index.

  • Query rewrite + subquery orchestration.

  • Top‑N retrieval per subquery & learned/LLM ranking.

  • Corrective loop (CRAG) for low‑confidence responses.

  • Answer grounding: always return source ids & confidence.

  • Monitoring: precision/recall, latency, cost per request.

  • Feedback loop: collect user corrections and use for retraining ranker/rewrite modules.


Final takeaway

Production‑ready RAG is not a single model it’s a pipeline. You must combine rewriting, smart retrieval, ranking, corrective steps, and grounding techniques (like HyDE) to build systems that are accurate, scalable and trustworthy. Tradeoffs between speed and accuracy are inevitable - design profiles and monitoring so your system meets real user needs.

10
Subscribe to my newsletter

Read articles from MANOJ KUMAR directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

MANOJ KUMAR
MANOJ KUMAR

Haan Ji, I am Manoj Kumar a product-focused Full Stack Developer passionate about crafting and deploying modern web apps, SaaS solutions, and Generative AI applications using Node.js, Next.js, databases, and cloud technologies, with 10+ real-world projects delivered including AI-powered tools and business applications.