RAG: Retrieval-Augmented Generation

Let’s be honest. Large Language Models (LLMs) are impressive. You ask them who invented the telephone, they say Alexander Graham Bell. You ask them for Python code to scrape a website, they’ll happily hand you one. But here’s the catch: throw at them a 400-page technical PDF and ask a question about page 276, and they will stare at you blankly, or worse, give you a confident but completely wrong answer.

That’s not intelligence. That’s bluffing.

So how do we stop LLMs from pretending to know everything when, in fact, they don’t have the context? This is where RAG comes into play.

What is RAG?

RAG stands for Retrieval-Augmented Generation. Fancy phrase, but the idea is simple. Instead of forcing an LLM to carry all the knowledge inside its limited brain (its context window), we give it a helping hand.

We teach it to retrieve the right information from an external source (database, documents, PDFs, etc.) and then use its natural language generation skills to form a coherent answer.

So, retrieval plus generation equals RAG.

Why do we even need RAG?

Because LLMs are forgetful and overconfident.

Forgetful: They can only look at a limited number of tokens at once. Think of this as a small backpack. You can carry a few things, but not your entire library.
Overconfident: If you feed them irrelevant or incomplete data, they don’t say, “I don’t know.” They’ll happily invent facts. For example, ask a clueless model who India’s Prime Minister is, it might say “Dhruv Rathee” with full conviction.

RAG fixes both problems:

It only feeds relevant data to the LLM.
It keeps the context within the LLM’s limits.

How does RAG work? (Retriever + Generator)

RAG has two main parts:

Retriever: Its job is to search and fetch relevant information from a knowledge base.
Generator: The LLM itself, which takes the retrieved data and produces a readable, accurate answer.

Let’s say you have 100 PDFs about Japanese universities. A user asks: “What is the MEXT scholarship deadline in Tokyo University?”

The Retriever converts the question into vector form, looks inside your database, and fetches chunks of text from the PDFs that mention Tokyo University deadlines.
The Generator (LLM) takes those chunks and produces a clean, human-like answer: “The MEXT scholarship deadline at Tokyo University is usually in May.”

Without retrieval, the LLM might just make up a random month.

What is Indexing?

Before retrieval works, you need a proper index.

Indexing is like creating a library catalog. Imagine walking into a library with 50,000 books but no index. You’ll never find what you need.

So, when you build RAG, you:

Break down your data (books, PDFs, documents) into smaller chunks.
Convert those chunks into vector embeddings.
Save them in a vector database.

That database is your catalog. Now the retriever can look things up quickly.

Why Vectorization?

LLMs don’t understand plain text the way we do. They understand numbers.

Vectorization is the process of converting words, sentences, or chunks into numerical vectors in a high-dimensional space. Similar meanings end up closer together in that space.

For example:

“Dog” and “Puppy” will have embeddings that are close.
“Dog” and “Carburetor” will be far apart.

This allows the retriever to search by meaning instead of searching by exact keywords.

Why do RAGs exist at all?

Because throwing all your data directly at an LLM is a bad idea.

It won’t fit (context window is small).
Even if it did, the LLM will drown in irrelevant information.
You’ll spend huge amounts of money on token usage.

RAG makes the process efficient: only the relevant parts of your data reach the LLM.

Why Chunking?

Suppose you upload a 200-page PDF. If you treat the entire file as one single block, the retriever can’t work properly. Searching through a single giant blob of text is useless.

Instead, we chunk the document into smaller pieces. A chunk can be:

A page
A paragraph
A fixed number of sentences

When a query comes in, the retriever only has to match against these small chunks. This makes searching faster and more accurate.

Why Overlapping in Chunking?

Now imagine this:

Page 5 of your PDF ends a paragraph with “…the deadline is in May,” and Page 6 starts with “…for the Tokyo University program.”

If you chunk strictly by page, Page 5 doesn’t make sense without Page 6.

This is why we use overlapping chunks. Each chunk slightly overlaps with the next one, so no important context gets lost. It’s like making sure the pieces of a puzzle still connect.

Putting It All Together:

Let’s go back to our example of storing lots of PDFs about Japan.

You take one PDF at a time.
Break it into chunks (paragraphs or pages).
Convert those chunks into vector embeddings.
Save all embeddings in a vector database.

That completes the classification phase.

Now comes the chat phase:

A user asks a question.
You convert the question into vector embeddings.
The retriever looks into the vector database and fetches the most relevant chunks.
These chunks are passed into the LLM’s context window.
The LLM generates an accurate, context-aware answer.

No hallucinations, no Dhruv Rathee as PM.

Final Thoughts

RAG is not magic. It doesn’t make LLMs suddenly omniscient. But it’s the best way we have to combine the reasoning power of LLMs with the reliability of external data.

Think of it this way: LLMs are like smooth talkers. RAG hands them the right notes before they speak.

That’s why RAG matters.

RAG: The Fix LLMs Badly Needed