Deep Dive Into RAG Race š


RAG Flow with Project
Chapter 2: Complete RAG Flow
Introduction
Quick Recap of Chapter 1
In the last chapter, you learned the basic theory of RAG, including what it is, why itās useful, and how it works. RAG combines the power of search and generation to create smarter, more useful AI applications.
Why do we actually need RAG?
Let's imagine you have a large service with a specific set of data. Now, LLMs like ChatGPT or Gemini are trained on huge amounts of general data from across the internet. But what if you want to build a chat app or web service that answers only based on your own documents, like your platform's internal content or personal files?
There are two ways to do this.
The first option is fine-tuning, where you train the model again using your specific data. But this takes a lot of time, needs strong hardware, and can be expensive.
Instead, a better and more efficient method is RAG.
You might wonder, why not just send all the data along with the user's question? The problem is that LLMs have a context window, which limits how much information they can process at once. If your data is large, you simply can't fit everything in.
This is where RAG comes in. Instead of passing all the data, RAG searches and selects only the most relevant parts of your data and sends that along with the userās question to the model. This way, the model gives answers based only on the information you care about.
Now the question is, how is it possible?? Letās understand the flow!!
RAG Flow
Letās break down the complete flow of how Retrieval-Augmented Generation (RAG) works behind the scenes:
Data Source: You start with your own data. It can be PDF, text files, website content, or any knowledge base.
Chunking: Since raw documents can be too large to process directly, the data is split into smaller parts (chunks). This ensures better understanding and handling.
Embedding: Each chunk is converted into a numerical format (called an embedding) that captures its meaning in vector form.
Store in Vector Space: All the embeddings are saved in a vector database like Qdrant, Pinecone, etc., so they can be searched efficiently.
User Query: The user enters a question or prompt.
Query Embedding: The userās question is also turned into an embedding using the same method as the documents.
Search in Vector Space: The system compares the query embedding with the stored document embeddings to find the most relevant chunks.
Retrieve Relevant Chunks: Only the top-matching chunks are selected from the vector store.
Model (LLM): A language model like Gemini or GPT takes both the query and the retrieved chunks.
Generate Output: The model uses the combined information (query + relevant chunks) to generate a final, accurate answer.
Here, You can see the RAG Chain
RAG in Action: Build a Resume Chatbot
Introduction of project
- In this section, we'll bring RAG to life by building a chatbot that can answer questions based on the content of a resume PDF. We'll use LangChain, Qdrant, and Gemini to create a working example of how RAG is used in real-world applications.
Prerequisites
Install Required Packages
Run this command in your terminal to install all the required packages:
pip install langchain langchain-community langchain-google-genai langchain-qdrant qdrant-client pypdf
Create Docker Compose File
We need a Qdrant vector database running locally, and the easiest way to do that is with Docker Compose.
Create a
docker-compose.db.yml
fileservices: qdrant: image: qdrant/qdrant ports: - 6333:6333
Install Docker
Qdrant runs locally as a vector database, so make sure Docker is installed and running.
And run this command
docker-compose -f docker-compose.db.yml up
Gemini API Key
Youāll need a Gemini API Key to use Googleās Generative AI.
Get it from Google AI Studio and set it in your code:GEMINI_API_KEY = "your-gemini-api-key-here"
All set? Here we can move ahead.
Code Game
Create a resume_rag.py
file.
Step 1: Import all the required dependencies
#Import Dependencies
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from pathlib import Path
from langchain_google_genai import GoogleGenerativeAIEmbeddings # type: ignore
from langchain_qdrant import QdrantVectorStore
from langchain_google_genai import ChatGoogleGenerativeAI
Step 2: Define Your Gemini API Key
GEMINI_API_KEY = "your_gemini_api_key"
Step 3: Load PDF
Make sure the PDF file will be in the same directory.
pdf_path = Path(__file__).parent / "resume.pdf"
loader = PyPDFLoader(file_path=pdf_path)
docs= loader.load()
# Show Data (If you want to show or you can skip this code)
print(docs[0])
This resume.pdf
Is your resume file.
Step 4: Split the Document into Chunks
Since weāve already discussed the RAG flow, we need to make chunks of documents due to the context window, so here we have done that.
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000, # Max size of each chunk
chunk_overlap=200 # How much each chunk overlaps with the next
)
split_docs = text_splitter.split_documents(documents=docs)
chunk_size=1000
means each chunk will be around 1000 characters long.chunk_overlap=200
helps the model keep context between chunks, so no important info is lost when it transitions from one chunk to the next.
Step 5: Embedding
This converts the text into vector representations that can be stored in a vector database for efficient retrieval later.
# Embedding
embedder = GoogleGenerativeAIEmbeddings(
model='models/text-embedding-004',
google_api_key=GEMINI_API_KEY
)
model='models/text-embedding-004'
Embedding Model.
Step 6: Store Data on Vector Space
This step, adding the documents to the vector store, should only be performed once. This step is responsible for storing the embedded documents in the Qdrant vector database for later retrieval.
vector_store = QdrantVectorStore.from_documents(
documents=[], # Will create collection automatically on Qdrant
url="http://localhost:6333", # Url Of the Qdrant
collection_name="learning_langchain", # Name Of the collection
embedding=embedder # Embedded data
)
vector_store.add_documents(documents=split_docs)
After running all the above code, you can comment out this code!
Step 7: Create Retriever
This step is crucial as it connects to the Qdrant vector store (which you set up earlier) and allows you to perform searches on the data you previously stored.
retriever = QdrantVectorStore.from_existing_collection(
url="http://localhost:6333", # Url of Qdrant
collection_name="learning_langchain", # Name of the collection
embedding=embedder
)
Step 8: Create the Chat Model
Use to retrieve the response of query + relevant data
chat_model = ChatGoogleGenerativeAI(
model="gemini-2.0-flash-001", # Free Gemini Model
google_api_key=GEMINI_API_KEY # Your Api Key
)
Step 9: Final step
In Step 9, you're setting up the interactive loop where users can continuously ask questions about the resume, and the system will generate relevant answers.
while True:
user_query = input("\nš¤ Ask something about the resume (or 'exit'): ")
if user_query.lower() == "exit":
break
# Do a similarity search in your vector DB
results = retriever.similarity_search(query=user_query, k=3)
# Optional: use Gemini to generate final answer
context = "\n\n".join([doc.page_content for doc in results])
# Add Prompt to get accurate answer (context + user_query)
prompt = f"Given the following context from a resume:\n\n{context}\n\nAnswer this: {user_query}"
answer = chat_model.invoke(prompt)
print(f"\nš§ Answer:\n{answer.content}")
In this step we perform search in the Vector DB
What is
k=3
?means you're asking the vector database to return the top 3 most relevant document chunks based on the similarity of the userās query and the stored embeddings.
Invoke()
Converts the string into a chat message format (like {role: "user", content: "Tell me a joke"}).
Sends it to the LLM.
Receives the response.
Returns a ChatMessage object (usually AIMessage).
Here we have done all the steps of RAG
Finally, you can run this command:
Windows:
python resume_rag.py
Mac or Linux:
python3 resume_rag.py
Out Come
Summary
We explored RAG in depth and saw how to use LangChain, Qdrant, and Gemini to build a working project.
We broke down the RAG flow, understood why chunking is important, and learned how to connect everything using the RAG chain.
In the end, we saw how easy and powerful it is to build smart, document-aware apps with RAG.
Chai, Code & Gratitude
Before we move ahead, a huge thanks to the people behind the scenes, Hitesh Choudhary and Piyush Garg, for their constant inspiration and guidance. Your mentorship means the world!
Up Next!!
In the next chapter, weāll understand advanced RAG techniques in depth.
Subscribe to my newsletter
Read articles from Ruturaj Bayad directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Ruturaj Bayad
Ruturaj Bayad
Hello, I am Ruturaj Bayad, and I write code, indeed.