2. Implementing RAG


This is a second article in my series, RAG Deep Dive. The goal of this series is to dive deep into the world of RAG & understand it from the first principles by actually implementing a scalable, production ready RAG system.
In the previous article, Introduction to RAG we discussed what a RAG is & how it works. In this article we will implement the most basic & simplest RAG. The goal of this article is let you know how easy it is to build a basic RAG.
Set Up
Python
Make sure you have Python installed locally, preferably the latest version.
OpenAI
You need to create an account in OpenAI & generate an API key for testing. We will be storing this API key in .env
file to be used in the code. You can refer to this short YouTube video to know how to generate OpenAI API key.
Clone GitHub Repository
GitHub Repository: https://github.com/Niket1997/rag-tutorial
Install Dependencies
You also need to install the required dependencies. Open the cloned repository in the IDE of your choice & run the following command to install dependencies.
# installing uv on mac
brew install uv
# install dependencies
uv pip install .
## or alternatively, uv pip install -r pyproject.toml
Install Docker
We will be using Docker to set up the vector database qdrant
locally, hence you need to install Docker in your machine. Just Google it.
Run qdrant
locally using Docker
To set up qdrant
using Docker, we will use following docker-compose.yml
file for the set up.
services:
qdrant:
image: qdrant/qdrant
ports:
- "6333:6333"
volumes:
- ./qdrant_data:/qdrant/storage
volumes:
qdrant_data:
You can start the qdrant
docker container using following command.
docker compose up -d -f docker-compose.yml
Create .env
file
Create a new file in the cloned repository with the name .env
& and following contents to it.
OPENAI_API_KEY="<your-openai-api-key>"
QDRANT_URL="http://localhost:6333"
As mentioned in the previous article, a RAG system has two phases, ingestion phase & query phase. Let’s code them one by one.
Ingestion Phase
As mentioned in the Introduction to RAG article, the ingestion phase has following steps. We will implement these steps one-by-one.
Load Data
Chunk Data
Generate Vector Embeddings for Individual Chunks
Store Vector Embeddings for Chunks in Vector Database
Load Data
LangChain provides loaders for different types of data as mentioned in the documentation here. In our example, we want to load the PDF data into our RAG system hence we will be using PyPDFLoader
. You can find the documentation for it here. You need the package langchain_community
& pypdf
for this.
The docs
variable here will hold the array of pages. Every element in this array will contain contents from a particular page (ordered).
from langchain_community.document_loaders import PyPDFLoader
file_path = "./demo.pdf"
loader = PyPDFLoader(file_path)
docs = loader.load()
Chunk Data
A single page can contain a larger amount of data, hence we need to chunk the data in docs
. This can be achieved using text splitters. In our case we will be using RecursiveCharacterTextSplitter
. You can read more about it here.
from langchain_text_splitters import RecursiveCharacterTextSplitter
def get_text_splitter():
return RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
)
text_splitter = get_text_splitter()
chunks = text_splitter.split_documents(docs)
Generate & Store Vector Embeddings
We need to generate vector embeddings for each of the chunk. We will use OpenAI’s text-embedding-3-small
embedding model. Refer to previous article in this series to know more about vector embeddings. You need the package langchain-openai
for this.
from langchain_openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings(
model="text-embedding-3-small",
)
We need to define certain functions & variables that we will use interact with qdrant
. You need the package langchain-qdrant
for this.
# create qrant client
qdrant_client = QdrantClient(
url=os.getenv("QDRANT_URL"),
)
# create a collection if it doesn't exist
def create_collection_if_not_exists(collection_name: str):
# check if collection exists
if not collection_exists(collection_name):
# create the collection if it doesn't exist
# Note, here the dimensions 1536 is corresponding to the embedding model we chose
# which is text-embedding-3-small
qdrant_client.create_collection(
collection_name=collection_name,
vectors_config=VectorParams(size=1536, distance=Distance.COSINE),
)
print(f"Collection {collection_name} created")
else:
print(f"Collection {collection_name} already exists")
# check if collection exists
def collection_exists(collection_name: str):
return qdrant_client.collection_exists(collection_name)
# get the qdrant vector store for collection
def get_vector_store(collection_name: str):
return QdrantVectorStore(
collection_name=collection_name,
client=qdrant_client,
embedding=embeddings,
)
# get the collection name
def get_collection_name(file_name: str):
return f"rag_collection_{file_name.split('/')[-1].split('.')[0]}"
We will use these methods & above code to generate & store vector embeddings for the PDF document.
# get the name of the collection in qdrant db based on the file
collection_name = get_collection_name(pdf_path)
# create the collection in qdrant db if it does not exists
create_collection_if_not_exists(collection_name=collection_name)
# this will create a vector store & assign the OpenAI embeddings to it
vector_store = get_vector_store(collection_name=collection_name)
# this will generate the embeddings for the chunks & add them to the vector store
vector_store.add_documents(documents=chunks)
Query Phase
Now that we have ingested the PDF document into our qdrant
vector database, let’s see how we can utilize the vector database for getting the relevant chunks of data from the vector database using SimilaritySearch
or as defined in the introduction to RAG article SemanticSearch
.
Generate Vector Embeddings for Query
Let’s begin by writing a system prompt that we will be using to provide instructions to the LLM, in our case OpenAI’s latest model gpt-4.1
.
system_prompt = """
You are a helpful AI assistant that can answer user's questions based on the documents provided.
If there aren't any related documents, or if the user's query is not related to the documents, then you can provide the answer based on your knowledge. Think carefully before answering the user's question.
"""
Now, we will generate vector embeddings for the user’s query & try to find the chunks of documents that are relevant to the user’s query from our vector database. Here, we first check if the collection exists in our vector database & if it does then we find the chunks of data from the vector database that have similarity score of more than 70% & add that into our system prompt.
# get only the chunks who have at least similary score of 0.5 out of 1
SIMILARITY_THRESHOLD = 0.5
collection_name = get_collection_name(file_name)
if collection_exists(collection_name):
vector_store = get_vector_store(collection_name)
# Get documents with their similarity scores
docs = vector_store.similarity_search_with_score(query, k=5)
for doc, score in docs:
if score >= SIMILARITY_THRESHOLD:
system_prompt += f"""
Document: {doc.page_content}
"""
Now we will define a variable that will that will communicate with OpenAI & use the above system prompt that contains the more relevant context as per user’s query along with user’s query to get more refined & more relevant answer.
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(
model="gpt-4.1",
)
messages = [("system", system_prompt), ("user", query)]
response = llm.invoke(messages)
print(f"response: {response.content}")
And that’s all, we just build our first RAG from scratch. Just run the main.py
file in the 1_implementing_basic_rag
directory and you can interact with the RAG.
I am attaching a screenshot of one run of our basic RAG application.
So that’s it for this one. Hope you liked this article on implementing a basic RAG from scratch! In the next set of articles, we will discuss on how to optimize our RAG application to make production ready. There are various techniques that are used in production-ready RAG applications to make them performant & efficient at scale. Stay tuned to learn more about them.
If you have questions/comments, then please feel free to comment on this article.
Subscribe to my newsletter
Read articles from Aniket Mahangare directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Aniket Mahangare
Aniket Mahangare
I am a Software Engineer in the Platform team at Uber India, deeply passionate about System Architecture, Database Internals, and Advanced Algorithms. I thrive on diving into intricate engineering details and bringing complex solutions to life. In my leisure time, I enjoy reading insightful articles, experimenting with new ideas, and sharing my knowledge through writing. This blog is a space where I document my learning journey, project experiences, and technical insights. Thank you for visiting—I hope you enjoy my posts!