Common Failure Cases in RAG Systems

Vaidik JaiswalVaidik Jaiswal
9 min read

Figure: A typical Retrieval-Augmented Generation (RAG) pipeline. The user’s query (1) is used to search for relevant documents (2) in the knowledge base. Those top documents are fetched as enhanced context (3–4) and fed with the original prompt to an LLM, which then generates a grounded response (5). Source: Sahin Ahmed et al. (image adapted) .

Retrieval-Augmented Generation (RAG) combines a document retrieval step with a generative language model. In RAG, the user’s query is first used to fetch relevant text passages from a document store, and those passages are provided as context to an LLM when generating an answer . This grounding in real documents can improve answer accuracy and ensure that the LLM has up-to-date information. For example, a well-built RAG system will retrieve the latest reports or FAQs and then use them to answer a question, instead of relying solely on the model’s pre-training. In practice, RAG often reduces hallucinations by “grounding responses in specific retrieved documents” and by dynamically accessing current data .

However, RAG introduces new points of failure around retrieval and context handling. If the retriever misses key documents, chunks text poorly, drifts off the user’s intent, or operates on stale data, the system’s answers can be incomplete, off-topic, or confidently wrong. The sections below examine five common failure cases - poor document recall, bad chunking, query drift, outdated indexes, and hallucinations from weak context - explaining how each manifests and offering simple best practices to mitigate them.

Poor Document Recall

If the retrieval step fails to find the relevant passages, the LLM simply has no data to draw on. In effect, “poor recall” means the retriever is not returning documents that contain the answer. This often shows up as missing or overly vague answers. For example, Morgan Stanley’s internal RAG-based search initially only retrieved ~20% of the relevant documents for a query, so their AI assistant frequently omitted needed information . In that case, document recall was the “primary bottleneck” – once identified, the team focused on improving retrieval (e.g. tuning embeddings, chunk sizes, and search methods) and lifted recall to ~80% .

Manifestation: Key facts are missing from answers, or answers fall back on generic text because the right documents weren’t retrieved. The LLM may repeat “I’m not sure” or hallucinate an answer because it has no supporting source. Users may notice that asking the same question in different ways suddenly changes the answer entirely, indicating some keywords match versus others do not.

Mitigation and Takeaways:

  • Measure recall explicitly. Use a set of test queries with known “gold” documents to see how many relevant passages the retriever finds. Iteratively adjust your retriever (e.g. embedding model, similarity threshold, index parameters) to improve this metric .

  • Use hybrid retrieval or expansion. Combining dense embeddings with a keyword search (BM25) can boost recall across different types of queries. Likewise, moderate query expansion (e.g. adding synonyms or related terms) can help surface docs that use different phrasing . For example, expanding “AI classification” to include “machine learning models” might catch additional documents.

  • Augment and clean your corpus. Ensure the knowledge base covers the expected domain. Remove irrelevant documents, update missing content, and preprocess text to remove noise. As one RAG guide warns, “Garbage in, garbage out”: a clean, comprehensive document set is essential .

Bad Chunking Strategies

How you split each document into chunks (passages) can make or break retrieval accuracy. If chunks are too large, they may mix multiple topics, so the retriever’s embedding becomes a coarse summary of unrelated content. If chunks are too small or cut arbitrarily, important context can be lost and sentences may be truncated. For instance, a fixed character-length split often cuts off sentences mid-way or mid-word, yielding nonsensical passages . Conversely, using a single very large chunk per document can dilute the signal: long embeddings tend to “average out” details .

Manifestation: Retrieval returns chunks that only partially answer the query or include irrelevant text. The LLM’s context window may be filled with overlapping or duplicated information. You might see the LLM referencing context that doesn’t answer the question, or forgetting key facts that were split across chunk boundaries.

Mitigation and Takeaways:

  • Use semantic or adaptive chunking. Break documents along natural boundaries (paragraphs, headings, sentences) instead of fixed sizes . For example, chunk on paragraph breaks with a little overlap so ideas aren’t cut off.

  • Tune chunk size to your domain. A general guideline is roughly 250–500 tokens per chunk (enough for one or two paragraphs). Smaller chunks give more precise matches, larger chunks give more context. Experiment to find the sweet spot: scientific papers might work well with slightly larger sections, whereas short FAQs can be chunked by sentence.

  • Ensure meaningful overlap. If you overlap chunks, do so thoughtfully (e.g. end of one overlaps start of next paragraph) to avoid splitting sentences or diluting the context .

  • Review and iterate. Spot-check chunks and retrieval results. If the retriever often pulls irrelevant pieces or splits answers awkwardly, adjust your chunking strategy. Remember that “optimized chunking involves carefully selecting size and boundaries” to keep each chunk focused and informative.

Query Drift

“Query drift” happens when the effective search query the system uses changes the user’s intent, usually during automatic query expansion or transformation. For instance, if you naïvely append related terms, the expanded query may grab off-topic documents. A classic example: expanding “python programming” to include “python species” (the snake) because of ambiguity. That drifted query would return reptile docs instead of code documentation. In RAG, query drift leads the retrieval step astray so the LLM gets context unrelated to the original question .

Manifestation: The retrieved documents become tangential or irrelevant. The final answer appears off-topic or “jumpy.” You might see the assistant suddenly changing subject or focusing on a side aspect not mentioned by the user. Metrics like precision at top-K docs will drop as more unrelated hits appear.

Mitigation and Takeaways:

  • Constrain expansion. Only add terms that are strongly relevant. Use domain-specific thesauri or semantic embeddings to filter out irrelevant senses . For example, if expanding “java,” ensure the context implies “programming” not “coffee” by including adjacent keywords or by user clarification.

  • Use context-aware embeddings. Embedding-based expansion can help keep the meaning. For example, using a language model to suggest synonyms within the original context can avoid literal mismatches .

  • Limit or skip expansion when in doubt. If expansion is causing noise, it may be better to rely on the raw query or use selective expansion (e.g. only query indexes of a certain field).

  • Monitor for drift. In testing, examine the top retrieved documents: do they all align with the user’s intent? If not, adjust your expansion rules or retriever settings. Keeping the query “anchored” to key terms from the user’s question can reduce unintended drift.

Outdated Indexes

RAG only works if the document index is current. An “outdated index” means new or edited information never made it into the search corpus, so the system answers based on stale data. As one analysis puts it: stale indexes are a “silent killer.” RAG systems with old knowledge just serve up “yesterday’s facts,” which in domains like law or healthcare can be dangerous . For example, an e-commerce RAG bot might pull last year’s policy and confidently give wrong guidance because the index wasn’t refreshed .

Manifestation: Answers cite obsolete statistics or miss recently changed facts. Users notice the bot quoting outdated terms or failing to recognize new product names or laws. In practice, the model may not even mention obvious recent events or updates.

Mitigation and Takeaways:

  • Regularly refresh the index. Implement a pipeline or cron job to re-index your document store on a schedule (daily, weekly, etc.) depending on how fast content changes. Ingest new documents and updates so the RAG system has the latest information .

  • Use incremental or streaming updates. Instead of full re-indexing, push changes through an update stream or real-time sync to your vector store. This minimizes staleness in fast-changing domains.

  • Tag and filter by timestamp. Store document dates and, if appropriate, prefer more recent documents in ranking. You can even prune very old content if it’s no longer relevant.

  • Monitor for staleness. Add checks that flag when, e.g., the top answer references content older than a threshold. Treat an “answer date” or lack of recent terms as a signal to review the index. As one RAG guide notes, “Without frequent updates, [RAG] serves responses based on yesterday’s facts” .

Hallucinations from Weak or Irrelevant Context

Even with RAG, the LLM can “hallucinate” – invent facts not found in any source – if it lacks solid grounding. Hallucinations are most likely when the retrieved context is weak, low-quality, or irrelevant . For example, if a medical chatbot fetches an unrelated health blog instead of a peer-reviewed source, the model might confidently generate wrong side effects. In RAG, hallucinations occur when the model over-relies on its language priors instead of the provided documents .

Manifestation: The answer contains assertions not supported by any of the retrieved text. You may see contradictions or made-up statistics. The model sounds fluent and confident, even though no source says those details. Often, hallucinations appear as plausible-sounding but unverified explanations.

Mitigation and Takeaways:

  • Curate high-quality sources. Assemble a reliable, domain-specific knowledge base. Remove irrelevant or dubious documents. As one RAG best practice notes, “the quality of your retrieval corpus directly impacts accuracy” and hallucinations spike when the model retrieves “irrelevant or low-quality information” .

  • Filter and re-rank results. After retrieval, apply heuristics or metadata filters (date, domain, keyword presence) to weed out off-topic passages. Using a second-stage re-ranker (or simpler rules) can ensure only the most relevant chunks reach the LLM.

  • Constrain the LLM. Craft prompts that explicitly instruct the model to rely on the given context (e.g. “Answer only using the information above”). Lowering the LLM’s temperature or using deterministic decoding can also reduce creative extrapolation .

  • Use fact-checking and verification. For critical domains, run the generated answer through a fact-check module or ask the model to cite sources. Some workflows do a second “consistency check” by comparing the answer to the retrieved docs. Designing the system to say “I don’t know” when evidence is lacking (rather than guessing) also helps .

Conclusion / Key Takeaways

Building a robust RAG system requires attention not just to the LLM, but especially to the retrieval pipeline and data maintenance. In summary:

  • Test retrieval separately: Evaluate recall and precision of your search independently of generation .

  • Chunk thoughtfully: Choose chunk sizes and boundaries that preserve meaning .

  • Guard query semantics: Avoid drifting the user’s intent when expanding or reformulating queries .

  • Keep data fresh: Automate index updates so the model always has the latest facts .

  • Ensure context quality: Only feed high-quality, relevant documents to the LLM to prevent confident hallucinations .

By monitoring these areas and applying the simple best practices above, developers can avoid many of the common failure modes in RAG systems and deliver more reliable, accurate responses.

0
Subscribe to my newsletter

Read articles from Vaidik Jaiswal directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Vaidik Jaiswal
Vaidik Jaiswal