What is it?

Retrieval-Augmented Generation (RAG) is a method that combines information retrieval and generative AI models(generated text)to produce more accurate and contextually RELEVANT TEXT.

Why is it used?

Retrieval — First, the system searches for the most relevant information from documents, databases, or websites.
(Example: You ask, “What is the latest iPhone model?” → it searches in a knowledge source).
Augmented — That information is given as extra context to the AI model.
Generation — Now, the AI uses both its knowledge + the retrieved info to give a better, accurate answer.

It makes answers more accurate, up-to-date, and based on real documents.

How does it work?

Diagram for Rag functioning followed by explanation

Example: I am a lawyer and I have a case which was relevant to others ,but because of so many files, I cannot find the relevant one…so what I did I feed this model with all of the previous files which I had. What it did in return was convert every file into chunks(splitting into pieces), like if a file has 125 pages, then it converted that file into 125 chunks(you can take even paragraphs or characters too). Then it passed these chunks into LLM, which mapped these chunks into vector embeddings(basically, when we give a number to every property), which it stored in a vector database(Pinecone, MongoDB, etc.). This process is known as Indexing. Now, let's move on to the second process, which is..

Retrieval (the main hero) now, whenever the user gives me a file, then it will also go through indexing, and we will have its chunks mapped to its vec embeddings, and as you know, we will get similar chunks from the Database itself can use qdrant in it.

What is Indexing?

Let’s take a book without an index, you would have to scan through pages to find “Computer Networks but with an index (catalogue), you can directly hop to that page which has “Computer Networks”, in short, it makes searching easy.

book → chunks → vectors → database → query → relevant passage

When we store documents for retrieval, we first index them.

Each document is converted into vector embeddings, which are mapped to their respective chunk. These are stored in a database.

Later, when you ask a query, instead of searching the whole text, it maps similar vec embeddings and receives their chunks, and hence the page or documents, so the index is used to quickly find relevant passages.

Why do we perform vectorization?

Because computers do not understand words directly like “cat”, “dog” so we need to convert it into numbers but if we just assign them 1,2,3 like “cat” = 1, “dog” = 2, the computer won’t understand the relationship between them so we use VECTORS to capture MEANINGS as we need that inorder to find relevant documents.

“cat” → [0.12, 0.85, 0.31]
“dog” → [0.10, 0.80, 0.35]

Both vectors are close because cats and dogs are related animals.

Why do RAGs exist?

Large Language Models (LLMs) like GPT are trained on a massive amount of text, but they have some limitations:

Outdated knowledge — An LLM is only as updated as its training data. If it was trained in 2023, it wouldn’t know what happened in 2024.

Incomplete knowledge — The model may not have access to every document, research paper, or company’s private data.

Hallucination — Sometimes, LLMs make up answers when they don’t know.

So, RAGs exist because they can:

Keep AI up-to-date

Reduce hallucinations

Allow AI to use your own custom data (company docs, PDFs, etc.)

LLM + Retrieval (knowledge) = RAG (smarter AI)

Why do we perform chunking?

Chunking means breaking down documents into pages or paragraphs or words, which enhances the search retrieval because imagine searching in a 125 page document vs in 1 1-page.

It also gives better relevant results because you have a small dataset to get the results from, hence it provides the model only with the most useful context.

Why is overlapping used in chunking?

If we split a document into non-overlapping chunks, sometimes important context gets cut off between two chunks.

Example: Chunk 1: “…The firefly algorithm is a nature-inspired optimization technique. It is based on the flashing behavior of fireflies.

Chunk 2: This algorithm is often used in engineering problems for optimization tasks…”

Without overlap, the system might only retrieve chunk 1 or chunk 2 for the query “What is firefly alg used for?”, and miss half the meaning.

To avoid losing context, we add a small overlap (like 50–100 tokens) between consecutive chunks.

Chunk 1: “…The firefly algorithm is a nature-inspired optimization technique. It is based on the flashing behavior of fireflies. This algorithm is often used in engineering problems…”

Chunk 2: “…It is based on the flashing behavior of fireflies. This algorithm is often used in engineering problems for optimization tasks…”

So, overlapping is important because it …preserves context … better retrieval…..and …improves accuracy

Rag made easy🦾