Another Jargon to Expose: RAG (Retrieval-Augmented Generation)

Karan ShawKaran Shaw
5 min read

From the title, you already know the full form of RAGRetrieval-Augmented Generation. Sounds fancy, right?

It’s one of those buzzwords that can make beginners think it’s some ultra-advanced, PhD-level technology that only big companies can build.

But here’s the truth:
No, it’s not.

At the time of writing this article, there’s one thing I want you to remember:

In the AI world, 90% is marketing, 9% is coding, and 1% is real-world adoption.

Once you see through the jargon, you’ll realize that RAG is not some magic—it’s just a clever combination of search + generate.

By the end of this article, you’ll not only understand what RAG is,
but you’ll also be confident enough to build your own RAG pipeline.

Let’s break it down.

Business Point of View: Why RAG Actually Matters

We know that LLMs are pre-trained and can handle general tasks quite well.
But the real question is:

Can an LLM actually help a business solve its specific problems?
The answer is: No.

Let’s take an example.

Suppose a business wants to build a chatbot where employees can ask questions and retrieve data from a large internal data source—like PDFs, databases, images, audio files, and more.

First Approach: Fine-Tune the Model on Internal Data

The first idea that usually comes to mind is to train the model using the company’s data. Then, whenever we query the model, it responds using that internal knowledge.

This might seem like it would work—but it's not optimal for a business environment.

Why?

  • Business data changes every second.

  • Fine-tuning a model takes at least 24 hours, sometimes more.

  • Re-training the model every time the data changes is costly and time-consuming.

  • And most importantly, the model will always be outdated, because it's trained on a snapshot of data—not live information.

So businesses not only bear the cost of fine-tuning, but they also lose the ability to access real-time data.

Second Approach: Use a SOTA Model with Prompt Engineering

An alternative is to take a SOTA (State of the Art) model—like GPT-4.1 or Claude Sonnet 4—and use prompting techniques to feed it the business data through system prompts.

💡
SOTA is another buzzword that simply means “best in the market” for a specific task. At the time of writing, GPT-4.1 and Claude Sonnet 4 are considered SOTA models in the LLM space.

You can even go further and give the chatbot access to tools like:

  • write_to_db

  • generate_report

This makes the chatbot agentic, meaning it can reason and take actions.
However, there's one major limitation:

Token Limit Exceeded.

At the time of writing this article, all SOTA models have a token limit.
You simply can’t send the entire business data into the model—it’s just too large. And without access to all relevant data, the model can’t always give the right answer.

The RAG : A Practical Solution

This brings us to the real solution — the RAG pipeline.

With some clever optimizations, RAG helps us overcome the token limit issue.
Instead of sending all data to the model, RAG allows the LLM to retrieve only the relevant information it needs — just in time, just enough.

So, in simple terms:

RAG = Letting the LLM access external data sources efficiently to produce accurate results.

Where Developers Come In

This is where we, as developers, play a key role —
We work on optimizing data retrieval from the source so that the LLM gets exactly what it needs to respond intelligently, without wasting tokens or resources.

What Happens Under the Hood of RAG

Generally, there are two main phases in any RAG-based application.

📌
In this section, we’ll use a PDF file as an example to explain how text is extracted and processed. How we give text to get text. However, the same method can be applied to other data formats too — such as images (using OCR), audio files (using speech-to-text), or even structured databases.
  1. Ingestion / Indexing Phase:

    In this phase, the data (PDF) is first converted to text using PDF parsers. After that, the text is chunked or split—paragraph by paragraph, page by page, or sentence by sentence—depending on the specific use case.

    Once we have the chunks, we generate vector embeddings for each one using an embedding model. These embeddings with content and metadata of the chunks are then stored in a vector database.

    💡
    Metadata can contain the page number of the PDF or any related data about the content of the chunk or data(PDF).

    The purpose of vector embeddings is to capture the semantic relationships between chunks, allowing us to compare and retrieve relevant pieces of information based on similarity in vector space.

  2. Retrieval Phase:

    In this phase, we take the user’s query and convert it into a vector embedding using the same embedding model used during indexing.
    Then, using this query embedding, we search the vector database to find and retrieve the most semantically similar chunks of content.

    Next, we construct a prompt by combining:

    • The retrieved context (from the vector search)

    • The original user query

This prompt is then sent to the LLM, which generates a relevant and meaningful response based on the provided context.

Conclusion

RAG isn’t just another AI buzzword — it’s a practical way to make LLMs work with real, ever-changing business data.

Instead of retraining models or overloading prompts, RAG lets us retrieve only what’s needed, when it’s needed.

As developers, our job is to optimize that flow.
RAG is not magic — it’s smart engineering.

I even built a project using RAG to see it in action — and trust me, it’s much more doable than it sounds.

Project Demo Link: https://chat-with-pdf-rag-app.streamlit.app/

Project Github Link: https://github.com/karanShaw000/Chat-with-pdf-RAG-App

0
Subscribe to my newsletter

Read articles from Karan Shaw directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Karan Shaw
Karan Shaw

I am Karan Shaw. I am MERN Stack developer exprienced in building responsive user interfaces using ReactJs and NextJs and developing backend services using MongoDB and ExpressJs