RAG looks simple on a slide. Index the data, retrieve a few chunks, let the model speak. Then reality shows up. Queries are messy. Sources are noisy. Latency bites. Costs creep. You ship a demo, but users catch the weak spots in five minutes. The goal here is to move from a toy loop to a system that stands up in production. I will walk through the full stack of ideas and the trade offs that come with them. I will also play the skeptic where needed. If an idea sounds magical, assume it hides a cost.

Foundations that do not crumble

Before any fancy trick, make the boring parts solid.

Clean sources. Remove duplicated pages, boilerplate headers and footers, and junk like cookie banners. If the diet is bad, the body fails.
Chunking that respects structure. Split by heading or section markers. Keep references like title and path as metadata. Blind fixed size chunks are quick but you lose meaning.
Deterministic preprocessing. Same text in should become the same text out. No random edits during indexing.

Speed and accuracy trade off

You will always juggle three things. Accuracy. Latency. Money. Pretending otherwise is how teams end up with a slow and costly system that is still wrong.

Levers you can tune

Chunk size. Smaller chunks raise recall but increase the number of candidates and compute. Larger chunks lower the number of calls but mix topics.
Top k at retrieval. More candidates help recall but slow reranking and generation.
Embedding model. Small models are cheap and fast but lose nuance. Larger models catch meaning but cost more.
Reranking depth. Cross encoders are great judges but heavy. Use them only on a short list.
Generation model. Strong models stabilize answers but add latency. Use them after you have high quality context.

Dev note: for a moment. If the product needs sub second replies, many of the popular academic tricks will not fit. Be honest about your latency budget first, then choose the recipe.

Query translation and rewriting

Users speak in the language of their pain, not in the language your index expects. Fix the query before you hit the store.

What to do

Spell and grammar cleanup. A small model can correct typos and clarify intent without heavy cost.
Context injection. Add stable facts known about the session or tenant. For example the user works with Node and React, so expand a vague query about logging to that stack.
Language translation. Move user text into the language used in your corpus if they differ.
Sub query rewriting. Break a broad query into focused parts, retrieve for each, then merge. Example
User asks how to log errors in Node. Create sub queries for console error, VS Code debug attach, and Sentry setup.

Pitfall to watch. Over rewriting can push the system toward answers the user never asked for. Keep the original query around and show it to the model along with rewritten versions.

Using the model as evaluator

Treat the model as a judge, not just a writer. After retrieval, ask the model to score if each passage is on topic and if it contains concrete evidence. Keep the top few. If all scores are weak, trigger a retry with a new query rewrite or with HyDE which you will meet soon.

Two patterns

Pre answer check. Judge the passages before generation.
Post answer check. After generation, ask the model to point to lines in the retrieved text that back each claim. If it cannot, mark low confidence or try another pass.

This step costs latency. Use it when correctness matters more than speed.

Ranking strategies that actually work

Retrieval gives you candidate chunks. Ranking decides which few get to speak to the model.

Hybrid search. Use sparse search like BM25 for exact term match plus vector search for meaning. Simple and reliable. A safe default.
Cross encoder reranking. A small but sharp model sees the full text of the query and the candidate and gives a relevance score. Apply on the top 50 or top 100 only.
Heuristic boosts. Prefer fresh content for news, prefer official docs for API questions, down rank low quality sources. This is product insight, not math. Do not ignore it.
Learning to rank. If you have feedback data, train a light model that maps features to a score. Useful once you have traffic.

Dev Note: Reranking is not a silver bullet. If your index returns garbage, a perfect reranker only picks the least bad garbage. Invest in indexing quality first.

HyDE

HyDE stands for Hypothetical Document Embeddings. The model writes a short imaginary answer based on the query. You embed that answer and use it to retrieve. Why it helps. The imagined answer contains the right terms and structure, which pulls more precise passages from the store.

How to use

Generate a crisp synthetic answer. One or two paragraphs at most.
Embed it and run retrieval.
Use the retrieved passages to write the real answer.

Costs. Extra generation and embedding calls. Benefit. Better recall for vague or long tail queries.

Corrective RAG

Corrective RAG is a safety net. The system inspects its own context and answer, then chooses a fix.

Steps:

Detect low confidence. Signals can be empty citations, weak evaluator scores, or high disagreement between top passages.
Pick a fix. Options are query rewrite, HyDE, switch to a slower but stronger retriever, or ask a follow up question to the user.
Try again within a strict budget.

The point is not endless loops. The point is one measured second attempt that lifts quality on hard questions.

Contextual embeddings

Plain embeddings miss structure. Help them.

Prefix each chunk with the document title, section, and parent path.
Add entity tags for people, products, versions, dates. You can extract these with a small named entity model or with a rules pipeline.
Put table headers near each row you embed. Tables without headers are cryptic.

All of this stays in the text field that you embed. Retrieval now has more hooks to catch meaning.

Hybrid search in practice

A practical recipe

Use BM25 to fetch a rough cut of 200 to 500 hits for exact keywords.
Use a vector search to fetch 200 to 500 semantic hits.
Union them, remove near duplicates by cosine similarity, keep top 200.
Apply a cross encoder to rank down to top 20.
Keep only the top 5 to 8 for the generator.

This keeps recall high without blowing up cost.

GraphRAG

Sometimes relationships matter more than sentences. GraphRAG builds a small knowledge graph from your corpus. Nodes are entities like cases, laws, APIs, or people. Edges are relations like cites, depends on, or part of. Retrieval can then follow paths instead of guessing with text alone.

Use cases:

Law and policy. Jump from a section to all cases that cite it, then to outcomes.
Complex software docs. From an API call to related errors, to examples, to version notes.
Research. From a paper to prior work to datasets.

Do not fall for hype. Graph building takes effort and tuning. Start narrow. Build graphs for the highest value objects only, not for every noun in the corpus.

Caching that saves real money

Cache wherever results repeat.

Query cache. Map normalized queries to final answers for a period. Add user or tenant scope if needed.
Retrieval cache. Map normalized queries to a stable set of document ids. When sources update, bump the cache version.
Reranker cache. Store pairwise scores for query and passage pairs.
Prompt and output cache. If the same prompt context appears often, cache the final generation.

Add strict size limits and time to live. Cache misses are not failures. They are a sign your users ask new things.

LLM as router

Not every query deserves the same cost. Use a light model to pick a path.

Simple factual questions with strong keyword signals can use sparse search only.
Vague or multi step questions go to hybrid with reranking.
Very hard or high risk queries take the slow lane with evaluator and corrective loops.

Measure the win. If routing does not lower cost or raise accuracy, remove it.

Production pipeline blueprint

Here is a blueprint that teams actually ship.

Ingest
Clean, segment, enrich with metadata, and version the corpus. Use a stream tool like Kafka for fresh updates. Store text blobs in an object store. Store metadata in a relational store.
Index
Build sparse index for BM25. Build dense vectors for a vector store. Keep ids consistent. Track index version.
Query preprocess
Spell fix, translate, add context, and create sub queries. Keep a record of what changed.
Retrieve
Run hybrid retrieval with top k tuned per tenant or per domain.
Rerank
Cross encoder on the merged set. Apply heuristic boosts. Drop near duplicates.
Evaluate
Model judge checks coverage and evidence. If weak, call corrective RAG once.
Generate
Compose a grounded prompt with explicit citations and clear instructions.
Post answer
Add quotes or line references. Create a short summary if the answer is long. Mark confidence.
Log and learn
Store all events with timings. Capture user feedback. Use this to tune chunk size, top k, and routing rules.

Metrics that matter

Stop arguing by taste. Measure.

Retrieval hit rate at k. Does the gold passage appear in the top k.
Coverage. Fraction of answer sentences that have a citation.
Latency per stage. P50 and P95 for preprocess, retrieve, rerank, evaluate, generate.
Cost per answer. Sum of embedding, search, rerank, and generation.
User outcomes. Click through on citations, copy rate, thumbs up or down, follow up rate.

If a new trick does not improve these, throw it out.

Failure patterns and how to fight them

Garbage in. Poor scans, duplicated content, and old versions. Fix during ingest.
Vocabulary mismatch. The user says bill while the docs say invoice. Add synonyms to sparse search and use HyDE to bridge.
Long answers with weak grounding. Force the model to quote lines. Penalize unsupported claims during evaluation.
Slow spikes. Heavy reranking at peak traffic. Add a guardrail that drops reranker depth when the system is hot.

A short field guide for setup choices

If you are starting fresh

Use hybrid retrieval from day one.
Keep chunk size near a short paragraph and carry title and section as prefixes.
Add a small query rewrite step for typo fix and language normalization.
Cache retrieval results.
Log everything with request ids and index version.

Then grow

Add cross encoder reranking.
Add evaluator and a single corrective pass.
Add routing to split fast and careful paths.
Add GraphRAG for the most connected part of your domain.

Conclusion

RAG is not a trick. It is a system. Every gain has a price. HyDE boosts recall but adds calls. Evaluators raise quality but slow replies. GraphRAG gives structure but needs care to build. Your job is to pick the correct level of ambition for your product and budget, then design the pipeline that hits those numbers.

If someone promises perfect answers with a single vector store and a large model, push back. Ask about chunking. Ask about reranking. Ask about caching. Ask about metrics. Good RAG feels simple to the user because the hard choices were made earlier.

Advanced RAG Patterns and Pipelines