đź§ RAG 101: Making Your LLM Think Like Sherlock With Your Data

What is RAG (Retrieval-Augmented Generation)?
In simple terms:
[Data Source] --> LLM (with system prompt) --> Chat
That’s what Retrieval-Augmented Generation (RAG) enables us to do. It lets Large Language Models (LLMs) fetch external information from a data source — like a PDF, website, or database — and use that to generate better, context-rich responses.
Why Do We Need RAG?
LLMs like GPT or Gemini are powerful, but they have two key limitations:
Knowledge cutoff – They only know what they were trained on (up to a certain date).
Token limit – They can't process long documents like full PDFs all at once.
So what happens when we want to ask a question based on a 200-page research paper or a huge database?
That’s where RAG comes in. Instead of fine-tuning the entire model with new data (which is costly and static), RAG allows the LLM to retrieve only the most relevant parts of the data and generate a response based on that.
Fine-Tuning vs. RAG
Fine-Tuning | RAG |
Modify the LLM's weights with new data | Keep the model fixed, just add a retrieval step |
Costly and slow | Fast and cheaper |
Great for domain-specific behavior | Great for dynamic, up-to-date information |
Static – needs retraining for updates | Dynamic – just update the data source |
How RAG Works (with Diagram Reference)
Let’s break it down using this image:
We can divide the process into two stages:
Step 1: Indexing (Data Preparation Phase)
This step happens before any user asks a question.
A → B → C → D → Vector DB
Raw Data Source – Documents, websites, PDFs, etc.
Information Extraction – Use OCR, parsers, or scrapers to extract text.
Chunking – Split the text into small, manageable sections (chunks).
Embedding – Convert each chunk into a vector (using models like OpenAI Embeddings, Sentence Transformers).
Store in Vector DB – Save all the embeddings into a database optimized for fast similarity search (e.g., Pinecone, FAISS, Weaviate).
Step 2: Retrieval (At Query Time)
This happens when the user interacts with the system.
1 → 2 → 3 → 4 → 5
User Query – The user asks a question.
Query Embedding – The question is converted to an embedding vector.
Vector Search – The system searches the vector database to find relevant chunks.
Relevant Data → LLM – The LLM receives only those chunks, along with the question.
LLM Generates Response – Based on the context, the model gives a grounded, accurate answer.
Optimal RAG – Why Chunking is Key
If we feed the entire document to the LLM, we will hit the token limit quickly. That’s inefficient and may even fail.
Instead, RAG works optimally by:
Splitting the content into small logical chunks (chunking).
Indexing each chunk into a vector database.
Retrieving only the top 3–5 relevant chunks at query time.
This makes RAG efficient, scalable, and accurate.
What is a “State-of-the-Art” Model?
A state-of-the-art (SOTA) model is the most advanced model in terms of performance on specific benchmarks or tasks. Examples:
GPT-4o from OpenAI
Gemini 1.5 Flash from Google
Claude 3 Opus from Anthropic
Mistral, Mixtral, and other open-source models
When used with RAG, even smaller or open-source LLMs can give powerful and accurate answers, because the information comes from your data.
Final Thoughts
RAG bridges the gap between static LLMs and the dynamic world of information. You don’t need to fine-tune a model every time your documents update just feed the relevant data when needed.
Whether you’re building a chatbot, search engine, or document assistant. RAG is a game-changer.
Subscribe to my newsletter
Read articles from Piyush Gaud directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
