What is RAG (Retrieval-Augmented Generation)?

In simple terms:

[Data Source] --> LLM (with system prompt) --> Chat

That’s what Retrieval-Augmented Generation (RAG) enables us to do. It lets Large Language Models (LLMs) fetch external information from a data source — like a PDF, website, or database — and use that to generate better, context-rich responses.

Why Do We Need RAG?

LLMs like GPT or Gemini are powerful, but they have two key limitations:

Knowledge cutoff – They only know what they were trained on (up to a certain date).
Token limit – They can't process long documents like full PDFs all at once.

So what happens when we want to ask a question based on a 200-page research paper or a huge database?

That’s where RAG comes in. Instead of fine-tuning the entire model with new data (which is costly and static), RAG allows the LLM to retrieve only the most relevant parts of the data and generate a response based on that.

Fine-Tuning vs. RAG

Fine-Tuning	RAG
Modify the LLM's weights with new data	Keep the model fixed, just add a retrieval step
Costly and slow	Fast and cheaper
Great for domain-specific behavior	Great for dynamic, up-to-date information
Static – needs retraining for updates	Dynamic – just update the data source

How RAG Works (with Diagram Reference)

An introduction to RAG and simple/ complex RAG | by Chia Jeng Yang | Knowledge Graph RAG | Medium

Let’s break it down using this image:

We can divide the process into two stages:

Step 1: Indexing (Data Preparation Phase)

This step happens before any user asks a question.

A → B → C → D → Vector DB

Raw Data Source – Documents, websites, PDFs, etc.
Information Extraction – Use OCR, parsers, or scrapers to extract text.
Chunking – Split the text into small, manageable sections (chunks).
Embedding – Convert each chunk into a vector (using models like OpenAI Embeddings, Sentence Transformers).
Store in Vector DB – Save all the embeddings into a database optimized for fast similarity search (e.g., Pinecone, FAISS, Weaviate).

Step 2: Retrieval (At Query Time)

This happens when the user interacts with the system.

1 → 2 → 3 → 4 → 5

User Query – The user asks a question.
Query Embedding – The question is converted to an embedding vector.
Vector Search – The system searches the vector database to find relevant chunks.
Relevant Data → LLM – The LLM receives only those chunks, along with the question.
LLM Generates Response – Based on the context, the model gives a grounded, accurate answer.

Optimal RAG – Why Chunking is Key

If we feed the entire document to the LLM, we will hit the token limit quickly. That’s inefficient and may even fail.

Instead, RAG works optimally by:

Splitting the content into small logical chunks (chunking).
Indexing each chunk into a vector database.
Retrieving only the top 3–5 relevant chunks at query time.

This makes RAG efficient, scalable, and accurate.

What is a “State-of-the-Art” Model?

A state-of-the-art (SOTA) model is the most advanced model in terms of performance on specific benchmarks or tasks. Examples:

GPT-4o from OpenAI
Gemini 1.5 Flash from Google
Claude 3 Opus from Anthropic
Mistral, Mixtral, and other open-source models

When used with RAG, even smaller or open-source LLMs can give powerful and accurate answers, because the information comes from your data.

Final Thoughts

RAG bridges the gap between static LLMs and the dynamic world of information. You don’t need to fine-tune a model every time your documents update just feed the relevant data when needed.

Whether you’re building a chatbot, search engine, or document assistant. RAG is a game-changer.

🧠 RAG 101: Making Your LLM Think Like Sherlock With Your Data