What is RAG and why is it needed?

Table of contents

What is RAG?
Among the AI buzzwords that we've been hearing a lot these days, "RAG" is one of the most common ones. Also known as Retrieval Augmented Generation, RAG is a technique used in natural language processing (NLP) where a language model is combined with a retrieval mechanism to generate more accurate and contextually relevant responses. The idea is to augment the generative model (like GPT) with information retrieved from an external knowledge base or corpus. In simpler terms, RAG means storing external knowledge (such as information about a company for the company's chatbot) on a storage service (usually Vector Databases) and retrieving that knowledge and providing it to LLMs as context to get more accurate and relevant responses.
How does RAG work?
Refer to the flowchart above while reading this for a better understanding.
- Data Sources:
- It starts with structured or unstructured data (e.g., text files, PDFs).
Chunking:
- The large text data is split into smaller, manageable chunks for processing.
Embedding Model:
- Each chunk of text is passed through an embedding model to generate numerical vectors (embeddings) that represent the meaning of the text.
Vector Database:
- These embeddings, along with their corresponding text chunks, are stored in a vector database.
User Query:
- When a user submits a query (e.g., "What color is apple?"), it is also passed through the embedding model to create a query embedding.
Retrieval System:
- The system retrieves the most relevant text chunk(s) from the vector database based on the similarity between the query embedding and stored embeddings.
Context to LLM:
- The retrieved chunk(s) (e.g., "Apple is red") are provided as context to the Large Language Model (LLM), alongside the original query.
LLM Response:
- The LLM processes the provided context and query to generate a relevant, accurate response (e.g., "The color of apple is red").
Why do we need RAG?
Looking at the workings of the RAG system, you may have a question: Can't we just provide the context to LLM in the user message? While doing this might work if the data is small, this doesn't usually work if the data is large, mainly due to the context window limit of the large language model. This simply means that the language model can't process an input prompt of token size that exceeds the context input limit of the model (for example, the context window limit of gpt-4o is 128k tokens). So, the RAG system solves this problem by providing small but most closely relevant context to the language model.
Subscribe to my newsletter
Read articles from Nishchit Bhandari directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
