Understanding the Basics of Retrieval Augmented Generation (RAG)

shrihari kattishrihari katti
10 min read

Introduction

Retrieval-Augmented Generation (RAG) is an important technique in artificial intelligence that improves large language models (LLMs) by using external knowledge bases. RAG is crucial for overcoming the limits of LLMs, which are trained on fixed datasets and often have trouble with current or specific information.

Context and Purpose

RAG is used in AI to overcome the inherent constraints of LLMs, which rely on their training data—often static and cut off from new information. By enabling LLMs to retrieve and incorporate external knowledge, RAG ensures that generated responses are accurate, contextually relevant, and grounded in current or specialized data. This is particularly crucial in fields where information evolves rapidly, such as healthcare, finance, and customer support, and where trust and accuracy are paramount.

Detailed Reasons for Using RAG

  1. Access to External Knowledge

  2. Improved Accuracy and Reduced Hallucinations

  3. Cost-Effectiveness and Efficiency

  4. Domain-Specific Applications

  5. Transparency and Trust Building

  6. Scalability and Flexibility

  7. Reduction of Data Leakage and Enhanced Security

  8. Support for Personalized and Verifiable Responses

  9. Addressing Complex or Unanswerable Queries

Now that we understand the purpose of Retrieval-Augmented Generation (RAG), let's look at how it works. RAG enhances Large Language Models (LLMs) by querying custom datasets or documents to retrieve relevant information. This data is then fed into the LLM to provide context, improving response relevance and domain-specific accuracy.

This method ensures responses are tailored and updated, which is vital in fields like healthcare, finance, and customer support. The RAG process starts by identifying the need for external information, querying databases, and integrating the retrieved data into the LLM's context. This improves response accuracy and builds user trust with transparent, verifiable information.

BASIC RAG

The basic RAG pipeline involves a few key steps. First, a query is generated, which is then used to retrieve relevant information from an external knowledge base. This retrieved data is combined with the query and fed into a language model to generate a response. The response is then refined and presented to the user, ensuring it is accurate and contextually relevant. This process allows the language model to provide up-to-date and specific information, overcoming the limitations of relying solely on pre-trained data.

  1. Indexing

  • In this process, data is gathered from one or multiple sources using specialized data loaders. These loaders are responsible for efficiently collecting and preparing the data for further processing. The sources can range from databases, documents, or any other repositories that contain the necessary information.

  • Once the data is loaded, it is divided into smaller, manageable pieces using text splitters. This step is crucial as it breaks down large volumes of text into chunks that are easier to handle and analyze. These chunks ensure that the system can process the data efficiently without being overwhelmed by the size of the input.

  • Next, vector embeddings are generated for these data chunks using advanced Embedding Generation Models. These models transform the text into numerical representations, known as embeddings, which capture the semantic meaning of the data. The embeddings are tailored based on the type of user query, ensuring that the most relevant information is highlighted for the task at hand.

  • Finally, the vector embeddings of the relevant data are stored in a vector database. This database acts as a repository where embeddings can be quickly retrieved and compared. By storing the embeddings, the system can efficiently match user queries with the most pertinent data, facilitating rapid and accurate responses.

image.png

2. Retrieval and Generation

Flowchart showing the process from "Query" to "Answer." The query leads to a document repository labeled "Retrieve," followed by a "Prompt" with an AI icon. It then connects to a "LLM" (Large Language Model) box with various icons, ultimately resulting in an "Answer."

  • In the Retrieval and Generation process, the user's query is taken as input. We then calculate the vector embeddings of this query using the embedding generation model, which is the same model utilized during the indexing process.

  • The query is then searched within the vector database, where the document chunks are stored. The system retrieves the Top-K relevant data chunks that have contexts similar to the user's query. This ensures that the most pertinent information is selected as search results.

  • After retrieving these contexts from the vector database, we prompt the LLM in a manner that allows it to take the user's query as an input. It then uses the retrieved contexts as a basis to search for the relevant information that addresses the user's query.

  • Finally, the LLM generates a response to the user's query. It does this by incorporating and evaluating the information from the document provided by the user. This process ensures that the response is both accurate and informed by the most relevant data available, effectively addressing the user's needs.

The following working diagram clearly demonstrates how both the indexing and retrieval & generation components make up the Retrieval Augmented Generation (RAG) system.

Flowchart illustrating a process involving a data source being chunked and embedded, stored in a vector store. A user query is embedded for search in the vector store to find relevant chunks. Results are processed by an LLM to provide the final output.

Building a PDF Query System with RAG: A Step-by-Step Code Walkthrough

In this tutorial, we'll guide you through a Python script to set up a Retrieval-Augmented Generation (RAG) system. It loads a PDF, splits it into searchable chunks, stores them in a vector database, and uses a language model to answer questions based on the document. Let's explore the code.

Before installing any packages, follow these steps:

# 1. Create a virtual environment named .venv

python -m venv .venv

# 2. Activate it

# On macOS / Linux:

source .venv/bin/activate

# On Windows (PowerShell):

.venv\Scripts\Activate.ps1

# On Windows (Command Prompt):

.venv\Scripts\activate.bat
  1. Import Required Libraries

    We start by importing the libraries and modules we’ll need. Each one serves a specific purpose in our RAG pipeline.

    python

     from pathlib import Path
     from langchain_community.document_loaders import PyPDFLoader
     from langchain_text_splitters import RecursiveCharacterTextSplitter
     from langchain_google_genai import GoogleGenerativeAIEmbeddings, ChatGoogleGenerativeAI
     from langchain_qdrant import QdrantVectorStore
     from qdrant_client import QdrantClient
     from langchain_core.prompts import PromptTemplate
    

    These imports set the stage for loading, processing, and querying our PDF document.

    1. Load the PDF Document

Next, we load the PDF file we want to query. Here, we assume it’s a file named "Python Programming.pdf" located in the same directory as our script.

    pdf_path = Path(__file__).parent / "Python Programming.pdf"
    loader = PyPDFLoader(file_path=pdf_path)
    docs = loader.load()

This step gives us the raw text content of the PDF, ready for further processing.

  1. Split the Document into Chunks

PDFs can be long, so we split the text into smaller chunks to make it easier to process and search. This is where RecursiveCharacterTextSplitter comes in.

    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=2000,
        chunk_overlap=200,
    )
    splitdocs = text_splitter.split_documents(documents=docs)

Splitting the text allows us to work with manageable pieces and improves retrieval accuracy later on.

  1. Generate Embeddings for the Chunks

To search the chunks effectively, we convert them into embeddings—numerical representations that capture their meaning. We use Google’s Generative AI embeddings for this.

    embedder = GoogleGenerativeAIEmbeddings(
        model="models/text-embedding-004",
        google_api_key="Insert your google api key"
    )

These embeddings will let us perform similarity searches to find chunks relevant to a user’s query.

  1. Set Up the Vector Store

We store the embeddings in a Qdrant vector database, which enables fast and efficient similarity searches.

You can either install it directly on your system or run it in Docker. In this example, I'm using Docker.

    services:
      qdrant:
        image: qdrant/qdrant
        ports:
          - "6333:6333"

In terminal:

    docker compose -f docker-compose.yml up

Once the container is up and running, you can connect to Qdrant at http://localhost:6333.

    vector_store = QdrantVectorStore.from_existing_collection(
        url="http://localhost:6333",
        collection_name="learning_langchain",
        embedding=embedder
    )

If this is your first time setting up the system, you might need to create the collection by adding documents (see Step 6). Here, we assume the collection already exists.

  1. Add Documents to the Vector Store

To populate the vector store with our PDF chunks, we add the split documents and their embeddings.

    vector_store.add_documents(documents=splitdocs)

This step ensures our document content is indexed and searchable. If the collection already contains data, this adds to it.

  1. Define the System Prompt

The system prompt instructs the language model on how to respond to queries using the retrieved document chunks. It’s a critical part of ensuring helpful and relevant answers.

python

    SYSTEM_PROMPT = """
    You are a smart PDF assistant designed to help users understand the content of a PDF document. Your task is to provide accurate, clear, and concise responses based on the user's query and the relevant excerpts from the PDF. Follow these guidelines to ensure your responses are helpful and aligned with the user's intent:

    1. **Understand the Query Type**:
       - If the user asks for a **summary**, provide a high-level overview of the main content, focusing on key points or themes.
       - If the user asks for **specific information** (e.g., "What is [term]?"), locate and present that information directly.
       - If the user asks for an **explanation** (e.g., "Explain [concept]"), provide a clear, general overview first, adding specifics only if requested.
       - If the query is vague, assume a general understanding is desired and respond concisely.

    2. **Use the PDF Excerpts**:
       - Base your response solely on the provided PDF excerpts. Do not add information beyond what’s in the document.
       - If the excerpts lack the requested information, say: "The PDF does not contain this information."

    3. **Tailor the Response**:
       - For **general queries**, prioritize broad, introductory content over technical details.
       - For **specific queries**, focus on the exact details requested, keeping it brief.
       - Synthesize information from multiple excerpts into a single, coherent answer if needed.

    4. **Structure Your Answer**:
       - Start with a short, direct response to the query.
       - Add supporting details or context as appropriate, especially for explanations.
       - Keep responses concise for specific questions and slightly longer for summaries or explanations.

    5. **Ensure Clarity**:
       - Use simple, clear language.
       - Avoid unnecessary jargon unless it’s central to the query and explained.

    If the query is unclear, ask the user for clarification to ensure an accurate response.
    """

This prompt ensures the language model understands its role, uses only the PDF content, and tailors responses appropriately.

  1. Initialize the Language Model

We set up Google’s Generative AI model to generate responses based on our prompts.ython

    llm = ChatGoogleGenerativeAI(
        model="gemini-2.0-flash",
        google_api_key="Insert your api key"
    )

This model will process our prompts and generate human-like responses.

  1. Create an Interactive Chat Loop

The chat loop lets users ask questions about the PDF and get answers in real time.

    while True:
        query = input("Ask a question about the PDF (or type 'exit' to quit): ")
        if query.lower() in ["exit", "quit"]:
            print("Goodbye!")
            break

        # Retrieve top 3 relevant chunks
        retrieved_docs = vector_store.similarity_search(query, k=3)

        # Create context from retrieved chunks
        context = "\n\n...\n\n".join([doc.page_content for doc in retrieved_docs])

        # Construct the full prompt
        full_prompt = (SYSTEM_PROMPT + "\n\nHere are the relevant excerpts from the PDF:\n" +
                       context +
                       "\n\nUser's question: " + query +
                       "\n\nAssistant:")

        # Generate the response
        response = llm.invoke(full_prompt)

        # Display the response
        print("Assistant:", response.content)

This loop ties everything together, making the system interactive.

  1. Define a Prompt Template (Optional)

For cleaner code, we can define a PromptTemplate to structure the prompt dynamically, though it’s not used in the main loop here.

    prompt_template = PromptTemplate(
        input_variables=["query", "excerpts"],
        template=SYSTEM_PROMPT + "\n\nUser Query: {query}\nPDF Excerpts: {excerpts}\nResponse:"
    )

You could modify the chat loop to use this template for more maintainable code, like this:python

    full_prompt = prompt_template.format(query=query, excerpts=context)

Output:

The process efficiently retrieves relevant PDF information to answer user questions. It finds the top three text chunks, combines them into a context, and generates a response using a Language Model. The system is interactive and accurate. The optional PromptTemplate improves code readability and maintainability, offering a strong framework for precise responses.

Conclusion:

In conclusion, Retrieval-Augmented Generation (RAG) represents a significant advancement in the field of artificial intelligence, particularly in enhancing the capabilities of large language models. By integrating external knowledge bases, RAG addresses the limitations of static training data, ensuring that AI systems can provide accurate, contextually relevant, and up-to-date information. This is especially important in dynamic fields like healthcare, finance, and customer support, where precision and trust are critical. The RAG process, through its indexing and retrieval mechanisms, not only improves the accuracy of AI responses but also builds user trust by offering transparent and verifiable information. As AI continues to evolve, RAG will play a crucial role in enabling more personalized and reliable interactions, paving the way for more sophisticated and trustworthy AI applications.

1
Subscribe to my newsletter

Read articles from shrihari katti directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

shrihari katti
shrihari katti