Why RAG?

Disclaimer: A lot of technical terms have been used in this blog. You can read them here.

Think of a basic chat application. At its core, under the hood, it processes your input and generates a response based on the data it was trained on, right? Now, this training data has a knowledge cut-off date. For example, imagine a chatbot whose Large Language Model (LLM) was last trained in 2024. If you ask it today, "Did RCB win the 2025 IPL?", it will likely say no. The model isn't wrong though; it's providing the correct answer based on its outdated information, that it was trained on.

You can think of the LLM as an over-enthusiastic new employee who refuses to stay informed with current events but will always answer every question with absolute confidence. Unfortunately, such an attitude can negatively impact user trust and is not something you want your chatbots to emulate.

To get the correct, updated answer, we could retrain or fine-tune the model with new data. However, this process is too expensive and time-consuming. Even giants like OpenAI only train their models periodically, and for most individuals or companies, frequent retraining is not a feasible option, majorly because it is too expensive.

So, how can we ensure our LLM answers the RCB question correctly without a full-scale update?
What if we could simply provide the LLM with the right context for the specific question being asked?

This is where Retrieval-Augmented Generation (RAG) comes in.

In simple terms, RAG gives your LLM access to fresh, relevant information without needing to be retrained. It works by connecting the model to an external knowledge source. When you ask a question, the system first retrieves relevant information from that source and then augments the model's prompt with this new context. Finally, the LLM generates an answer based on both its original training and the new, specific information it was just given. This approach gives us greater control over the text output and can even offer users insights into how the model arrived at its answer.

What is RAG?

Retrieval-Augmented Generation (RAG) is a technique for optimizing the output of Large Language Models (LLMs). It works by allowing the LLM to query an external, authoritative knowledge base before generating a response.

The name itself describes the process: the system retrieves data and augments the user's prompt with it before the LLM generates a response. This process is also highly efficient. For instance, imagine your knowledge base contains 10,000 documents. Instead of sending all that information to the LLM (which is impractical), the retrieval step intelligently pinpoints only the handful of document chunks most relevant to your query. These targeted snippets are then sent to the LLM, providing the precise context it needs without the noise of irrelevant data.

While LLMs are already powerful tools trained on billions of parameters, RAG extends their capabilities to specific domains or an organization’s private internal data. This is achieved without the need for constant and costly model retraining, making it a highly effective approach for improving LLM accuracy and relevance.

The applications are too many to count.
RAG enables a model to handle everything from answering a simple, real-time query to processing and synthesizing information from thousands of pages of dense documents to provide a single, context-aware answer.

So, what actually happens when you ask a RAG system a question? The process can be broken down into a few simple steps:

The Query: The user asks a question, like, "What were our company's key achievements in the last quarter?"
The Retrieval: Instead of going straight to the LLM, the system first searches an external knowledge base (e.g., your company's internal documents). It finds the most relevant snippets of text related to your query.
The Augmentation: The system takes the original question and adds the relevant snippets it just found as extra context. It packages this all into a new, more detailed prompt for the LLM.
The Generation: The LLM receives this augmented prompt and generates an answer. Because it has the precise, up-to-date context, its answer is accurate and based on the provided documents.

How Does RAG Work?

Large Language Models (LLMs) are incredibly powerful, but they have two fundamental limitations: their knowledge is frozen at the time of their last training, and they have no access to your private, real-time data. This means they can't answer questions about recent events or your company's internal documents. So, how do we bridge this gap?

This is where Retrieval-Augmented Generation (RAG) comes in. RAG is a flexible framework that enhances LLMs by connecting them to live, authoritative knowledge sources, allowing them to provide accurate, context-aware, and trustworthy answers.

The Big Picture: An Expert Librarian for Your LLM

I have explained RAG through the analogy of an Expert Librarian.

Think of the process like this:

Vector Embeddings & Database: This is the library's card catalog, but instead of alphabetical order, it's organized by topic and meaning (semantic similarity).
Your Question: You go to the expert librarian (the RAG system) with a question.
Retrieval Step: The librarian doesn't read the card catalog numbers to you. They use your question to find the right cards (vector search) and then retrieve the actual books or articles those cards point to (retrieving the text chunks).
LLM Step: The librarian then hands you the open books (the context) and says, "Based on these sources, here is the answer to your question."

This analogy highlights a critical technical point that often causes confusion: the LLM is designed to understand and process natural language (text), not the high-dimensional numerical arrays that constitute vector embeddings. Sending a list of floating-point numbers would be meaningless to its generative capabilities. Instead, the vector embeddings are used as a tool in the retrieval step to find the most relevant pieces of text. The LLM then receives that retrieved text as context.

How RAG Works: The Technical Deep Dive

Here is a breakdown of the standard RAG process, clarifying the distinct roles of embeddings and the LLM. We can imagine two phases: an offline "Indexing" phase and an online "Query" phase.

Phase 1: Indexing (Done once, beforehand)

Load & Chunk Documents: You start with your knowledge base (e.g., PDFs, text files, database entries). You break these documents down into smaller, manageable chunks.
Create Embeddings: Each text chunk is passed through an embedding model (like OpenAI's text-embedding-3-small or an open-source model like e5-large-v2). This model converts the text chunk into a vector embedding (a list of numbers).
Store in Vector Database: You store these vector embeddings in a specialized vector database (like Pinecone, Chroma, or FAISS). Crucially, you also store a reference to the original text chunk itself alongside its corresponding vector.

The result of this phase is a searchable library where each numerical vector acts as a semantic address for a piece of real text.

Original Text → Chunking → Embedding → Store in Vector DB

Phase 2: Querying (Happens in real-time)

This is where the user interacts with the system.

User Query: A user asks a question in natural language, e.g., "What were the company's Q2 profits?"
Embed the Query: The user's query is sent to the same embedding model used in Phase 1. This converts the question into a query vector.
Vector Search: This query vector is used to search the vector database. The database performs a similarity search (e.g., cosine similarity) to find the vectors from your documents that are most mathematically similar to the query vector. It returns the top k results (e.g., the top 3 or 5 most similar vectors).
Retrieve the Original Text: This is the key step. The system uses the results of the vector search to retrieve the original text chunks associated with those top vectors.
Augment the Prompt: A new, detailed prompt is constructed for the LLM. This prompt includes the retrieved text chunks (the context) and the user's original question. It looks something like this:

Context: [Text chunk 1 from your documents...] [Text chunk 2 from your documents...] [Text chunk 3 from your documents...]

Based on the context provided above, please answer the following question: Question: What were the company's Q2 profits?
Send to LLM: This entire block of text (context + question) is sent to the LLM (e.g., Gemini, GPT-4, Llama 3).
Generate Answer: The LLM reads the context and the question and generates a final, synthesized answer based only on the information provided.

At a Glance: The Key Components

Component	Its Job in RAG	What it Processes	What it Outputs
Embedding Model	Creates numerical representations of text for semantic comparison.	Text	Vector Embeddings
Vector Database	Stores and rapidly searches for similar vectors.	Vector Embeddings	IDs of similar vectors
Large Language Model (LLM)	Reasons, synthesizes, and generates answers based on evidence.	Text (Context + Query)	Text (Final Answer)

The RAG Framework in Action: From Simple to Complex

Now that we understand the standard process, we can see how flexible the RAG framework truly is.

Basic Example: RAG without Vector Embeddings

While we often associate RAG with vector searches, the "retrieval" step can be any form of logic. Let's walk through how the code above implements this for a simple restaurant chatbot.

The "Retrieval" Step: Instead of searching a database, the code's retrieval logic is a simple time check. It gets the current time and compares it against the restaurant's business hours (11 AM to 10 PM). Based on this check, it selects one of two pre-written context strings. For instance, since the current time is 6:55 PM IST, it retrieves the "open" context:

"Context: The business is currently open. Business hours are 11 AM to 10 PM."
The "Augmentation & Generation" Steps: Next, the code augments the user's query ("Are you open right now and what are your hours?") by combining it with the retrieved context. This full, context-rich prompt is then sent to the Gemini model to generate the final, conversational answer.

This example perfectly demonstrates the RAG pattern's power to retrieve real-time, relevant context from any source—even simple business logic—to create an accurate response.

You can view the code for this implementation below:

import google.generativeai as genai
from datetime import datetime, time
from IPython.display import display, Markdown

API_KEY = "insert your GEMINI API Key"
genai.configure(api_key=API_KEY)


current_time = datetime.now().time()
open_time = time(11, 0)    # 11:00 AM
close_time = time(22, 0)   # 10:00 PM

context = ""
if open_time <= current_time < close_time:
    context = "Context: The business is currently open. Business hours are 11 AM to 10 PM."
else:
    context = "Context: The business is currently closed. Business hours are 11 AM to 10 PM."

query = "Are you open right now and what are your hours?"


prompt = f"""{context}

Answer the user's question based on the context provided.
# Question: {query}
"""


model = genai.GenerativeModel('gemini-1.5-flash')
response = model.generate_content(prompt)
display(Markdown(response.text))

Imagine a lawyer asking an advanced AI agent to analyze a key witness's video testimony, cross-reference it with specific clauses in a PDF contract, and find similar historical cases. In a single, complex process, the RAG system would transcribe the video to text, semantically search that transcript for key arguments, retrieve the exact clauses from the contract PDF, and simultaneously pull relevant summaries from a legal precedents database. It would then synthesize all this disparate information—video timestamps, contract clauses, and case law summaries—into one comprehensive brief. The final output wouldn't just be a summary, but a fully-cited legal analysis, with every statement linked back to its original source document, page, or timestamp, demonstrating the most sophisticated application of RAG.
The code for this advanced example has not been provided because of some skill issues which are being resolved as you read this blog.

Conclusion: The Power of Context

So, to summarize, vector embeddings are the critical for finding the right information, but it is the original text that is ultimately sent to the LLM for reasoning and generation. By mastering this flow, RAG gives us the power to ground the incredible reasoning capabilities of LLMs in factual, timely, and relevant information.

P.S. Our RAG Journey Continues!

I'm excited to let you know that this is just the first post in a new series I'm writing all about RAG. In the next articles, we'll get into more advanced topics. If you enjoyed this introduction, I hope you'll stick around for what's next. Stay tuned!

Introduction to RAG

Table of contents

Why RAG?

What is RAG?

How Does RAG Work?

The Big Picture: An Expert Librarian for Your LLM

How RAG Works: The Technical Deep Dive

Phase 1: Indexing (Done once, beforehand)

Phase 2: Querying (Happens in real-time)

At a Glance: The Key Components

The RAG Framework in Action: From Simple to Complex

Basic Example: RAG without Vector Embeddings

Conclusion: The Power of Context

Subscribe to my newsletter

Rachit Goyal

Rachit Goyal

Introduction to RAG

Table of contents

Why RAG?

What is RAG?

How Does RAG Work?

The Big Picture: An Expert Librarian for Your LLM

How RAG Works: The Technical Deep Dive

Phase 1: Indexing (Done once, beforehand)

Phase 2: Querying (Happens in real-time)

At a Glance: The Key Components

The RAG Framework in Action: From Simple to Complex

Basic Example: RAG without Vector Embeddings

Advanced Example: The Multi-Modal Legal AI Agent

Conclusion: The Power of Context

Subscribe to my newsletter

Rachit Goyal

Rachit Goyal