Introduction to Retrieval-Augmented Generation (RAG)

Rajat SharmaRajat Sharma
3 min read

What is RAG?

Retrieval-Augmented Generation is an AI framework that enhances the capabilities of LLMs(large Language Models) by retrieving relevant information from an external data source and then feeding this information as a context into a language model to generate more accurate, informative, and appropriate responses.


How does RAG work?

Now you know the overview of what RAG is, let’s understand how it works:

  1. Indexing:- Structures and transforms external data, so that the system can search through it efficiently later. It includes:

    1. Data Collection: Gather all the data that is needed for your application.

    2. Data Chunking: Breaking down large datasets into smaller, manageable units (chunks) for easier processing.

    3. Document Embedding: Now, each chunk is converted into a vector (numeric representation) that captures the semantic meaning behind text.

    4. Vector Storing: These vectors are stored in a vector database (such as Pinecone, Qdrant, or Weaviate) for efficient similarity search.

  2. Retrieval:- Transforms the user’s query into a vector embedding and retrieves the most relevant chunks by comparing it with the stored embedding in the vector database. It includes:

    1. Query Embedding: The user query is converted into a vector using the same embedding model used for document embedding, ensuring uniformity.

    2. Similarity Search: The system compares the user’s query embedding with the stored document embedding to retrieve the most relevant chunks.

    3. Top-k Retrieval: The most relevant chunks are identified based on similarity and retrieved from the vector database.

  3. Generation: Once the relevant chunks are retrieved, along with the user’s initial query, they are fed into the language model (LLM) to generate a response. It includes:

    1. Input to LLM: The retrieved text chunks and user query are passed to the LLM, providing context for the generation process.

    2. Contextual Understanding: The LLM uses the provided information to understand the user’s intent and context.

    3. Response Generation: The LLM synthesizes a response based on the context and generates a coherent, informative reply for the user.

Summary Table:-

StageWhat Happens
IndexingCollect → Chunk → Embed → Store in vector DB
RetrievalEmbed query → Search vector DB → Return top-matching chunks
GenerationCombine query + retrieved context → Feed into LLM → Generate final response

Flow Chart for Visual Understanding-

Flowchart that how RAG works


Conclusion

Our overview of Retrieval-Augmented Generation (RAG) is now complete.
The fundamentals of indexing, retrieval, and generation were discussed.
We'll look at more complex RAG subjects in future blogs, such as step-back prompting, RRF, parallel query retrieval, CoT, and HyDE.
Stay tuned for more informative blogs in the future.

11
Subscribe to my newsletter

Read articles from Rajat Sharma directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Rajat Sharma
Rajat Sharma