The Struggles of Building a Reliable AI Chatbot


It's tempting to think that building a smart, document-aware chatbot is as simple as plugging a Large Language Model (LLM) into your data. You’ve got a mountain of company documents, and a powerful AI like Gemini or ChatGPT. What could go wrong?
This promising strategy, known as Retrieval-Augmented Generation (RAG), is a powerful tool. It allows an AI to find and use your specific, up-to-date information, sidestepping the issues of outdated knowledge and "hallucinations." But as a recent paper by software engineers from Deakin University reveals, the journey from concept to a genuinely robust RAG system is fraught with subtle, and often unexpected, failure points.
Based on their research across three case studies, here are seven critical issues that can derail even the most well-intentioned RAG system.
1. The Empty Shelf Problem: Missing Content
The most fundamental failure is simple: the answer isn't in your documents. While a well-designed system might respond with a polite "I don't know," a poorly designed one might invent an answer that seems plausible. You can't expect the AI to retrieve what was never stored in the first place.
2. The Misplaced Book Problem: Missed the Top Documents
Your document library holds the right answer, but the retrieval component—the part of the system that finds relevant information—doesn't rank it highly enough. Imagine a library where the perfect book for your query is sitting in a different aisle. If the system only grabs the top 5 "most relevant" documents, and your answer is #6, it will never make it to the LLM.
3. The Clutter Problem: Not in Context
The retrieval system successfully found the correct documents, but there were too many of them. LLMs have a strict context window, a limit on how much text they can process at once. If the consolidation strategy fails to prioritize the exact right information, the key piece of the answer gets left out of the prompt and is lost in the digital static.
4. The Misinterpretation Problem: Not Extracted
This is perhaps the most frustrating failure. The correct information is not only in your document collection but is also successfully retrieved and passed to the LLM. Yet, the model fails to extract the right answer. This often happens when the surrounding context is full of noise, contradictory information, or complex jargon that confuses the model.
5. The Formatting Faux Pas: Wrong Format
You ask the system for a list, a table, or a structured response, but it returns a single, jumbled paragraph. The LLM, despite receiving clear instructions in the prompt, fails to adhere to the requested format. This is less about factual accuracy and more about usability, making the answer less valuable to the user.
6. The "Just Tell Me" Problem: Incorrect Specificity
The answer is technically correct, but it's either too broad or too specific to be useful. If you ask, "What are the key points of this paper?" and the system returns a single sentence, it's too general. Conversely, if you ask a simple question and receive a highly technical, verbose explanation, it's too specific. The key is to match the response to the user's intent.
7. The Half-Truth Problem: Incomplete Answers
An answer is returned, but it's missing critical information that was available in the provided context. While not technically incorrect, an incomplete answer can be misleading or require the user to ask follow-up questions to get the full picture. A well-designed system should provide a comprehensive response on the first try.
The Two Core Takeaways
The most important lessons from this research are that RAG systems are not a "set it and forget it" solution.
Validation is an ongoing process. Unlike traditional software, you can't just test a RAG system offline and assume it will perform perfectly. It must be continuously validated during real-world operation with real user queries.
Robustness is evolved, not designed. A truly reliable RAG system is not a perfect first build. It requires constant calibration, refinement, and monitoring—a process of learning and adapting to how users actually interact with it over time.
RAG offers incredible potential for building intelligent, data-driven applications. However, understanding and actively addressing these failure points is the only way to move from a basic proof-of-concept to a truly reliable and valuable solution.
Subscribe to my newsletter
Read articles from Lakshya Sharma directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
