Introduction of RAG


RAG (Retrieval-Augmented Generation) is a technique in artificial intelligence that helps a model answer questions or generate text by looking up external information first, then using that information to produce a response.
In simple words:
Normal AI model (like ChatGPT without RAG): Answers only from what it already knows (its training data).
RAG-powered AI: Before answering, it searches in a database/documents/knowledge source, takes the most relevant info, and then writes the answer using that info.
This makes the AI:
More accurate
Up-to-date
Better for domain-specific tasks (like legal, medical, finance, etc.)
⚡ Example:
If you ask: “What is the latest iPhone model?”
A normal model (without RAG) might not know if its training stopped before the latest release.
With RAG, the model first retrieves the latest news or database entry, then answers correctly.
how it works (retriever + generator) with a simple example
RAG has two main parts:
Retriever → Finds relevant information from a knowledge base (like documents, PDFs, or a vector database).
Generator → Uses a language model (like GPT) to create a final answer using the retrieved info.
📖 Example:
You ask the system:
“Who won the FIFA World Cup 2022?”
🔹 Step 1: Retriever
The system converts your question into embeddings (mathematical vectors).
It searches a knowledge base (like Wikipedia articles, sports news, or a vector DB).
It retrieves the most relevant document:
“Argentina won the FIFA World Cup 2022 in Qatar, defeating France in the final.”
🔹 Step 2: Generator
Now, the language model (LLM) takes your question + the retrieved document.
It generates a natural, human-like answer:
“Argentina won the FIFA World Cup 2022 by defeating France in the final match held in Qatar.”
✅ Without the retriever, the model might hallucinate or give outdated info.
✅ With RAG, the model stays accurate, grounded, and up-to-date.
👉 Formula view:
Final Answer = Generator(User Query + Retrieved Info)
🔹 Real-Life Use Cases of RAG
Chatbots & Virtual Assistants
Customer support bots use RAG to search company documents (FAQs, manuals, policies) before answering.
Example: If you ask “How do I reset my bank account password?”, the bot retrieves steps from the bank’s help docs and then replies in natural language.
Search Engines with AI
- Instead of just showing links, RAG-based search (like Perplexity AI or Bing Copilot) retrieves information from the web and generates a summarized answer.
Healthcare & Medical Research
Doctors or researchers can query large medical databases.
Example: Ask “What are the latest treatments for Type 2 Diabetes?”, the system retrieves recent research papers and generates a summary.
Legal Industry
Lawyers use RAG-powered tools to search through laws, case histories, and legal documents.
Instead of reading 1,000 pages, the system retrieves the most relevant cases and summarizes them.
E-commerce
Product recommendation chatbots retrieve details from the product catalog and generate answers.
Example: “Which laptops under ₹50,000 have the best battery life?”
Education & Research
Students can ask questions on study material, and the AI retrieves from textbooks, notes, or PDFs, then explains it.
Example: “Explain Newton’s third law with an example from daily life.”
Business Intelligence
Companies use RAG to query internal databases, reports, and dashboards.
Example: “What were our top 5 selling products last quarter?”
👉 In short:
RAG is useful whenever you need AI + accurate external knowledge (not just what the model was trained on).
What is Indexing?
In Retrieval-Augmented Generation (RAG) systems, indexing is the process of organizing and structuring data (like documents) so that relevant information can be efficiently retrieved for use in large language models (LLMs). It's essentially creating a searchable database of your data, enabling the RAG system to quickly find the most relevant context to answer user queries.
why we perform vectorization
Vectorization means converting text (or any data) into numerical vectors (embeddings).
We do this because:
Machines Understand Numbers, Not Words
Computers can’t directly “understand” text like “FIFA World Cup”.
But if we convert it into a vector (a list of numbers), the computer can work with it mathematically.
Semantic Meaning (Not Just Exact Words)
Traditional search works on keywords. If you search “Who won FIFA 2022?”, it only matches exact words.
With vectorization, the embeddings capture meaning.
- Example: “World Cup champion 2022” and “Winner of FIFA 2022” will have vectors close to each other → so retrieval works even if the wording is different.
Fast Similarity Search
Once text is converted into vectors, we can use mathematical operations (like cosine similarity) to quickly find which documents are most similar to the query.
This is the core of RAG retrieval.
🔹 Simple Example
Sentence → Vector (simplified numbers):
“Argentina won FIFA 2022” →
[0.12, -0.45, 0.88, ...]
“Winner of FIFA World Cup 2022 was Argentina” →
[0.10, -0.47, 0.90, ...]
These two vectors will be very close in the vector space → so the retriever knows they mean almost the same thing.
✅ In short:
We perform vectorization so that:
Text can be understood by machines.
Semantic similarity (meaning) can be captured.
Searching becomes fast and accurate in RAG systems.
why RAGs exist
LLMs (like GPT) are powerful, but they have three big limitations:
Knowledge Cutoff
An LLM only knows what it was trained on (up to a certain date).
Example: If a model was trained till 2021, it won’t know who won FIFA 2022.
Domain-Specific Knowledge
LLMs are trained on general internet data.
They don’t automatically know private or specialized info (like your company policies, legal docs, or medical research).
Hallucinations
- Sometimes LLMs make up answers that sound good but are wrong, because they’re guessing from patterns, not retrieving facts.
🔹 How RAG Solves This
RAG = Retrieval + Generation
Retrieval: Connects the LLM to external knowledge (databases, documents, APIs, the web).
Generation: Uses that knowledge to give an accurate, human-like answer.
This way, RAG:
Keeps answers up-to-date.
Brings in trusted sources (your docs, PDFs, APIs).
Reduces hallucinations, since answers are grounded in real data.
📖 Example
Without RAG:
❌ Q: “What is the latest iPhone model?”
→ The model might guess incorrectly if trained before release.
With RAG:
✅ Q: “What is the latest iPhone model?”
→ Retriever fetches Apple’s latest press release → Generator writes:
“The latest iPhone model is the iPhone 16, announced in September 2024.”
✅ In short:
RAGs exist because normal LLMs are not enough on their own.
They make AI more accurate, up-to-date, and useful in real-world applications.
why we perform chunking
Chunking = breaking large text/documents into smaller pieces (chunks) before indexing them into a vector database.
We do this because:
Embedding Models Have Token Limits
Most embedding models (like OpenAI’s, Sentence-BERT, etc.) can only take a limited number of tokens (e.g., 512, 1024).
If your document has 50 pages, you can’t embed the whole thing at once → you split it into chunks.
Better Retrieval Accuracy
If we store the entire document as one big vector, a query might match some part of it but not others → less precise.
By chunking, the retriever can fetch only the relevant part of the text.
👉 Example:
Query: “What are the side effects of Drug X?”
Instead of returning the entire 100-page medical PDF, chunking ensures the system retrieves just the small section that lists side effects.
Faster Search
Smaller chunks = more fine-grained results.
Retrieval becomes more efficient, since you don’t process unnecessary text.
Better Context for the Generator
When answering, the generator only needs a few related chunks, not the whole document.
This avoids overwhelming the LLM with irrelevant info.
🔹 Example of Chunking
Suppose you have a document:
Full document (too big to embed):
“Argentina won the FIFA World Cup 2022. Lionel Messi scored in the final. The match was held in Qatar. France was the runner-up.”
After chunking:
Chunk 1: “Argentina won the FIFA World Cup 2022.”
Chunk 2: “Lionel Messi scored in the final.”
Chunk 3: “The match was held in Qatar.”
Chunk 4: “France was the runner-up.”
👉 Now, if someone asks “Where was the FIFA 2022 final held?”, the retriever only fetches Chunk 3, not the whole paragraph.
✅ In short:
We perform chunking to make documents searchable, accurate, and efficient in RAG pipelines.
why overlapping is used in chunking
When splitting documents into chunks for RAG, we add overlap (some repeated words/sentences) so that:
Preserve context at boundaries
If a sentence or idea falls at the end of one chunk, overlap ensures part of it also appears in the next chunk.
This prevents cutting important information in half.
Better retrieval
- A user’s query might relate to text that spans across two chunks. Overlap ensures the retriever still finds it.
Reduce hallucinations
- Without enough context, the LLM may guess or generate incomplete answers. Overlap keeps continuity between chunks.
Improve embedding quality
- Embeddings work best on semantically complete text. Overlap helps avoid chopped-off sentences.
📖 Example
Text:
“Argentina won the FIFA World Cup 2022. Lionel Messi was the captain.”
Without overlap (chunk size = 6 words):
Chunk 1: “Argentina won the FIFA World Cup”
Chunk 2: “2022. Lionel Messi was the captain”
→ If you ask “Who was Argentina’s captain in FIFA 2022?”, the retriever may miss the link.
With overlap (3 words):
Chunk 1: “Argentina won the FIFA World Cup 2022.”
Chunk 2: “FIFA World Cup 2022. Lionel Messi was the captain”
→ Now both chunks contain enough shared context, so retrieval works correctly.
✅ In short:
Overlapping in chunking is used to carry context across chunk boundaries, making retrieval more accurate and answers more reliable.
Subscribe to my newsletter
Read articles from Shubham Mourya directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
