Deep Dive Into RAG Race šŸ

Ruturaj BayadRuturaj Bayad
7 min read

RAG Flow with Project

šŸ“¢
Haven’t checked Chapter 1. Here you can — Intro to the RAG Race šŸ

Chapter 2: Complete RAG Flow


Introduction

  • Quick Recap of Chapter 1

In the last chapter, you learned the basic theory of RAG, including what it is, why it’s useful, and how it works. RAG combines the power of search and generation to create smarter, more useful AI applications.

  • Why do we actually need RAG?

Let's imagine you have a large service with a specific set of data. Now, LLMs like ChatGPT or Gemini are trained on huge amounts of general data from across the internet. But what if you want to build a chat app or web service that answers only based on your own documents, like your platform's internal content or personal files?

There are two ways to do this.

The first option is fine-tuning, where you train the model again using your specific data. But this takes a lot of time, needs strong hardware, and can be expensive.

Instead, a better and more efficient method is RAG.

You might wonder, why not just send all the data along with the user's question? The problem is that LLMs have a context window, which limits how much information they can process at once. If your data is large, you simply can't fit everything in.

This is where RAG comes in. Instead of passing all the data, RAG searches and selects only the most relevant parts of your data and sends that along with the user’s question to the model. This way, the model gives answers based only on the information you care about.

Now the question is, how is it possible?? Let’s understand the flow!!


RAG Flow

Let’s break down the complete flow of how Retrieval-Augmented Generation (RAG) works behind the scenes:

  1. Data Source: You start with your own data. It can be PDF, text files, website content, or any knowledge base.

  2. Chunking: Since raw documents can be too large to process directly, the data is split into smaller parts (chunks). This ensures better understanding and handling.

  3. Embedding: Each chunk is converted into a numerical format (called an embedding) that captures its meaning in vector form.

  4. Store in Vector Space: All the embeddings are saved in a vector database like Qdrant, Pinecone, etc., so they can be searched efficiently.

  5. User Query: The user enters a question or prompt.

  6. Query Embedding: The user’s question is also turned into an embedding using the same method as the documents.

  7. Search in Vector Space: The system compares the query embedding with the stored document embeddings to find the most relevant chunks.

  8. Retrieve Relevant Chunks: Only the top-matching chunks are selected from the vector store.

  9. Model (LLM): A language model like Gemini or GPT takes both the query and the retrieved chunks.

  10. Generate Output: The model uses the combined information (query + relevant chunks) to generate a final, accurate answer.

Here, You can see the RAG Chain

RAG in Action: Build a Resume Chatbot

Introduction of project

  • In this section, we'll bring RAG to life by building a chatbot that can answer questions based on the content of a resume PDF. We'll use LangChain, Qdrant, and Gemini to create a working example of how RAG is used in real-world applications.

Prerequisites

  • Install Required Packages

    Run this command in your terminal to install all the required packages:

      pip install langchain langchain-community langchain-google-genai langchain-qdrant qdrant-client pypdf
    
  • Create Docker Compose File

    We need a Qdrant vector database running locally, and the easiest way to do that is with Docker Compose.

    Create a docker-compose.db.yml file

      services:
        qdrant:
          image: qdrant/qdrant
          ports:
            - 6333:6333
    
  • Install Docker

    Qdrant runs locally as a vector database, so make sure Docker is installed and running.

    And run this command

      docker-compose -f docker-compose.db.yml up
    
  • Gemini API Key

    You’ll need a Gemini API Key to use Google’s Generative AI.
    Get it from Google AI Studio and set it in your code:

      GEMINI_API_KEY = "your-gemini-api-key-here"
    

All set? Here we can move ahead.

Code Game

Create a resume_rag.py file.

Step 1: Import all the required dependencies

#Import Dependencies
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from pathlib import Path
from langchain_google_genai import GoogleGenerativeAIEmbeddings # type: ignore
from langchain_qdrant import QdrantVectorStore
from langchain_google_genai import ChatGoogleGenerativeAI

Step 2: Define Your Gemini API Key

GEMINI_API_KEY = "your_gemini_api_key"

Step 3: Load PDF

Make sure the PDF file will be in the same directory.

pdf_path = Path(__file__).parent / "resume.pdf"
loader = PyPDFLoader(file_path=pdf_path)

docs= loader.load()

# Show Data (If you want to show or you can skip this code)
print(docs[0])

This resume.pdf Is your resume file.

Step 4: Split the Document into Chunks

Since we’ve already discussed the RAG flow, we need to make chunks of documents due to the context window, so here we have done that.

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,        # Max size of each chunk
    chunk_overlap=200       # How much each chunk overlaps with the next
)

split_docs = text_splitter.split_documents(documents=docs)
  • chunk_size=1000 means each chunk will be around 1000 characters long.

  • chunk_overlap=200 helps the model keep context between chunks, so no important info is lost when it transitions from one chunk to the next.

Step 5: Embedding

This converts the text into vector representations that can be stored in a vector database for efficient retrieval later.

# Embedding
embedder = GoogleGenerativeAIEmbeddings(
            model='models/text-embedding-004', 
            google_api_key=GEMINI_API_KEY
          )
  • model='models/text-embedding-004' Embedding Model.

Step 6: Store Data on Vector Space

This step, adding the documents to the vector store, should only be performed once. This step is responsible for storing the embedded documents in the Qdrant vector database for later retrieval.

vector_store = QdrantVectorStore.from_documents(
       documents=[],  # Will create collection automatically on Qdrant
       url="http://localhost:6333", # Url Of the Qdrant 
       collection_name="learning_langchain", # Name Of the collection
       embedding=embedder  # Embedded data
)

 vector_store.add_documents(documents=split_docs)

After running all the above code, you can comment out this code!

Step 7: Create Retriever

This step is crucial as it connects to the Qdrant vector store (which you set up earlier) and allows you to perform searches on the data you previously stored.

retriever = QdrantVectorStore.from_existing_collection(
    url="http://localhost:6333", # Url of Qdrant 
    collection_name="learning_langchain", # Name of the collection
    embedding=embedder
)

Step 8: Create the Chat Model

Use to retrieve the response of query + relevant data

chat_model = ChatGoogleGenerativeAI(
    model="gemini-2.0-flash-001",  # Free Gemini Model
    google_api_key=GEMINI_API_KEY  # Your Api Key
)

Step 9: Final step

In Step 9, you're setting up the interactive loop where users can continuously ask questions about the resume, and the system will generate relevant answers.

while True:
    user_query = input("\nšŸ¤– Ask something about the resume (or 'exit'): ")
    if user_query.lower() == "exit":
        break

    # Do a similarity search in your vector DB
    results = retriever.similarity_search(query=user_query, k=3)

    # Optional: use Gemini to generate final answer
    context = "\n\n".join([doc.page_content for doc in results])

    # Add Prompt to get accurate answer (context + user_query)
    prompt = f"Given the following context from a resume:\n\n{context}\n\nAnswer this: {user_query}"

    answer = chat_model.invoke(prompt)

    print(f"\n🧠 Answer:\n{answer.content}")

In this step we perform search in the Vector DB

  • What is k=3?

    means you're asking the vector database to return the top 3 most relevant document chunks based on the similarity of the user’s query and the stored embeddings.

  • Invoke()

    • Converts the string into a chat message format (like {role: "user", content: "Tell me a joke"}).

    • Sends it to the LLM.

    • Receives the response.

    • Returns a ChatMessage object (usually AIMessage).

Here we have done all the steps of RAG

Finally, you can run this command:

Windows:

python resume_rag.py

Mac or Linux:

python3 resume_rag.py

Out Come

šŸ’”
GitHub Repo → Basic RAG

Summary

We explored RAG in depth and saw how to use LangChain, Qdrant, and Gemini to build a working project.

We broke down the RAG flow, understood why chunking is important, and learned how to connect everything using the RAG chain.

In the end, we saw how easy and powerful it is to build smart, document-aware apps with RAG.

Chai, Code & Gratitude

Before we move ahead, a huge thanks to the people behind the scenes, Hitesh Choudhary and Piyush Garg, for their constant inspiration and guidance. Your mentorship means the world!

Up Next!!

In the next chapter, we’ll understand advanced RAG techniques in depth.

20
Subscribe to my newsletter

Read articles from Ruturaj Bayad directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Ruturaj Bayad
Ruturaj Bayad

Hello, I am Ruturaj Bayad, and I write code, indeed.