1. Introduction to RAG


You may have observed recently that this new buzz word RAG is sprinkled all over your LinkedIn feed. Frustrated with constant bombarding of this word on my feed, I caved in and decided to understand what this word means. What I found was quite interesting. Hence I decided to write a series of articles on this topic. This one is first in the series which will introduce you to the world of RAG.
What is RAG?
The RAG stands for Retrieval Augmented Generation. Terrifying set of words, isn’t it? Don’t worry, we will break these down in this section. For now, all you need to understand is, it’s a framework built to pass better & more relevant context to large language models to get better responses. If you have used the tools like ChatGPT, Google Gemini then you must know that quality of the answers from these tools improves drastically when you pass more relevant pieces of information.
Now, let’s break down those words.
Retrieval → It refers to the process of retrieving/fetching the relevant pieces of information. How and from where? We will discuss that later in this article.
Augmented → In this context, Augmented means enhancing the large language models by enriching them with more relevant information for the users’ queries.
Generation → This is the core capability of LLMs. Given an input prompt, generate relevant piece of data such as answers, explanations, summaries, etc.
Semantic Search
Before we get into the implementation details, we must understand Semantic Search, which is the core principle on which the RAG systems works. Semantic search is a way of finding information based on meaning rather than just matching exact words. In simple words, semantic search finds what you mean, not just what you type.
Here’s how semantic search works:
Turning text into meaning vectors: A piece of text can be passed to a pre-trained models (like Sentence-BERT or OpenAI’s text embeddings) that can map the text into vectors & can capture meaning from the text. The model converts the text into a fixed-length list of numbers (e.g. a 768-dimensional vector). Those numbers encode the text’s meaning in a high-dimensional “semantic space.”
Indexing for faster lookup: These vector embeddings are stored into the vector databases. The database builds an index so it can quickly find which vectors lie closest to any given point in that space.
Querying with meaning: When you type a search query (“why is life so hard? 😔”), the system also turns it into its own vector. It then asks the vector database, “Which stored vectors are most similar to this query vector?”. If your RAG has previously stored the data that can handle such queries, then your response from LLM will be much better.
The key benefit of using semantic search is, even if a document doesn’t literally say “why is life so hard? 😔” it might use synonyms (“What makes life so challenging?”, “Why do I face so many obstacles in life?”) and still be retrieved, because its vector sits near your query’s vector in the space.
Semantic search works on different types of data such as text, video, audio, images, etc. As long as you have a model that maps your data (text, pixels, audio waveforms, code tokens…) into real-valued vectors that capture “meaning” in that domain, you can perform semantic search.
Spotify uses audio embeddings to power “Fans also like” and “Discover Weekly” by finding tracks whose embeddings cluster together.
You can watch following video to understand semantic search & vector databases better.
Phases of RAG
RAG in its most basic form has two phases. Let’s understand these phases from one example. Let’s say you have a big PDF document & you want to get answers to some questions basis that document.
Ingestion Phase
This refers to ingesting the data into the RAG system that will be utilized to pass better context to LLM. In our example, we upload our PDF document to the RAG, which indexes this document and stores it in such a way, that it’s easier to fetch relevant information from it.
This phase can has following steps:
Load Data: The first step in ingestion is loading the data. This can be uploaded by the users, or we may have certain data on which we want to make a specialized RAG system.
Chunk Data: In this step the loaded data is chunked into smaller pieces called chunks. Chunking splits large documents into smaller passages that fit within the model’s context window. The queried data can’t be more than the model’s context window. This also ensures that we don’t pass the whole document, in nut shell, too much context to the LLM.
Generate Vector Embeddings: As discussed before, in this step we generate the vector embeddings associated with each of the chunk of the data. We rely on vector embedding models for this step.
Store Vector Embeddings: In this step, we store vector embeddings of the chunks in the vector database such as Pinecone for faster & efficient semantic search.
Query Phase
This refers to fetching the data that is most relevant to user’s query which is then passed to LLM. In our example, let’s say you have a question about your document & you ask the question to RAG. RAG looks at the stored information and fetches the most relevant pieces of data that can be passed to LLM to get the answers to your question.
The query phase has following steps:
Generate Vector Embeddings for Query: In this step we generate vector embeddings for user’s query using the same embedding model used for ingestion.
Semantic Search: In this step, we use the vector embeddings generated for the user’s query to do a similarity search on a vector database. This step returns the most relevant chunks of data corresponding to the user’s query.
Generate Response: In this step, we used the information retrieved from the vector database & pass that information to LLM. Since the LLM now has the most relevant context on the user’s query, it will be able to generate good results.
So that’s it for this one. Hope you liked this introductory article on RAG! In the next article, we will build a simple RAG system, in which we will upload a PDF to our RAG system & ask the system questions on the PDF. The system will integrate with vector database & OpenAI APIs. Stay tuned for the next one!
If you have questions/comments, then please feel free to comment on this article.
Subscribe to my newsletter
Read articles from Aniket Mahangare directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Aniket Mahangare
Aniket Mahangare
I am a Software Engineer in the Platform team at Uber India, deeply passionate about System Architecture, Database Internals, and Advanced Algorithms. I thrive on diving into intricate engineering details and bringing complex solutions to life. In my leisure time, I enjoy reading insightful articles, experimenting with new ideas, and sharing my knowledge through writing. This blog is a space where I document my learning journey, project experiences, and technical insights. Thank you for visiting—I hope you enjoy my posts!