RAG Implementation Guide

This is a second article in my series, RAG Deep Dive. The goal of this series is to dive deep into the world of RAG & understand it from the first principles by actually implementing a scalable, production ready RAG system.

In the previous article, Introduction to RAG we discussed what a RAG is & how it works. In this article we will implement the most basic & simplest RAG. The goal of this article is let you know how easy it is to build a basic RAG.

Set Up

Python

Make sure you have Python installed locally, preferably the latest version.

OpenAI

You need to create an account in OpenAI & generate an API key for testing. We will be storing this API key in .env file to be used in the code. You can refer to this short YouTube video to know how to generate OpenAI API key.

Clone GitHub Repository

GitHub Repository: https://github.com/Niket1997/rag-tutorial

Install Dependencies

You also need to install the required dependencies. Open the cloned repository in the IDE of your choice & run the following command to install dependencies.

# installing uv on mac
brew install uv 

# install dependencies
uv pip install .
## or alternatively, uv pip install -r pyproject.toml

Install Docker

We will be using Docker to set up the vector database qdrant locally, hence you need to install Docker in your machine. Just Google it.

Run qdrant locally using Docker

To set up qdrant using Docker, we will use following docker-compose.yml file for the set up.

services:
  qdrant:
    image: qdrant/qdrant
    ports:
      - "6333:6333"
    volumes:
      - ./qdrant_data:/qdrant/storage

volumes:
  qdrant_data:

You can start the qdrant docker container using following command.

docker compose up -d -f docker-compose.yml

Create .env file

Create a new file in the cloned repository with the name .env & and following contents to it.

OPENAI_API_KEY="<your-openai-api-key>"
QDRANT_URL="http://localhost:6333"

As mentioned in the previous article, a RAG system has two phases, ingestion phase & query phase. Let’s code them one by one.

💡

We will be using LangChain framework in this tutorial to build our basic RAG. LangChain is widely used open source framework for building applications on top of Large Language Models (LLMs). You can read more about LangChain here.

Ingestion Phase

As mentioned in the Introduction to RAG article, the ingestion phase has following steps. We will implement these steps one-by-one.

Load Data
Chunk Data
Generate Vector Embeddings for Individual Chunks
Store Vector Embeddings for Chunks in Vector Database

Load Data

LangChain provides loaders for different types of data as mentioned in the documentation here. In our example, we want to load the PDF data into our RAG system hence we will be using PyPDFLoader. You can find the documentation for it here. You need the package langchain_community & pypdf for this.

The docs variable here will hold the array of pages. Every element in this array will contain contents from a particular page (ordered).

from langchain_community.document_loaders import PyPDFLoader

file_path = "./demo.pdf"
loader = PyPDFLoader(file_path)
docs = loader.load()

Chunk Data

A single page can contain a larger amount of data, hence we need to chunk the data in docs. This can be achieved using text splitters. In our case we will be using RecursiveCharacterTextSplitter. You can read more about it here.

from langchain_text_splitters import RecursiveCharacterTextSplitter

def get_text_splitter():
    return RecursiveCharacterTextSplitter(
        chunk_size=1000,
        chunk_overlap=200,
    )

text_splitter = get_text_splitter()
chunks = text_splitter.split_documents(docs)

Generate & Store Vector Embeddings

We need to generate vector embeddings for each of the chunk. We will use OpenAI’s text-embedding-3-small embedding model. Refer to previous article in this series to know more about vector embeddings. You need the package langchain-openai for this.

from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(
    model="text-embedding-3-small",
)

We need to define certain functions & variables that we will use interact with qdrant. You need the package langchain-qdrant for this.

# create qrant client
qdrant_client = QdrantClient(
    url=os.getenv("QDRANT_URL"),
)

# create a collection if it doesn't exist
def create_collection_if_not_exists(collection_name: str):
    # check if collection exists
    if not collection_exists(collection_name):
        # create the collection if it doesn't exist
        # Note, here the dimensions 1536 is corresponding to the embedding model we chose
        # which is text-embedding-3-small
        qdrant_client.create_collection(
            collection_name=collection_name,
            vectors_config=VectorParams(size=1536, distance=Distance.COSINE),
        )
        print(f"Collection {collection_name} created")
    else:
        print(f"Collection {collection_name} already exists")

# check if collection exists
def collection_exists(collection_name: str):
    return qdrant_client.collection_exists(collection_name)

# get the qdrant vector store for collection
def get_vector_store(collection_name: str):
    return QdrantVectorStore(
        collection_name=collection_name,
        client=qdrant_client,
        embedding=embeddings,
    )

# get the collection name
def get_collection_name(file_name: str):
    return f"rag_collection_{file_name.split('/')[-1].split('.')[0]}"

We will use these methods & above code to generate & store vector embeddings for the PDF document.

# get the name of the collection in qdrant db based on the file
collection_name = get_collection_name(pdf_path)

# create the collection in qdrant db if it does not exists
create_collection_if_not_exists(collection_name=collection_name)

# this will create a vector store & assign the OpenAI embeddings to it
vector_store = get_vector_store(collection_name=collection_name)

# this will generate the embeddings for the chunks & add them to the vector store
vector_store.add_documents(documents=chunks)

Query Phase

Now that we have ingested the PDF document into our qdrant vector database, let’s see how we can utilize the vector database for getting the relevant chunks of data from the vector database using SimilaritySearch or as defined in the introduction to RAG article SemanticSearch.

Generate Vector Embeddings for Query

Let’s begin by writing a system prompt that we will be using to provide instructions to the LLM, in our case OpenAI’s latest model gpt-4.1.

system_prompt = """
    You are a helpful AI assistant that can answer user's questions based on the documents provided.
    If there aren't any related documents, or if the user's query is not related to the documents, then you can provide the answer based on your knowledge.        Think carefully before answering the user's question.
    """

Now, we will generate vector embeddings for the user’s query & try to find the chunks of documents that are relevant to the user’s query from our vector database. Here, we first check if the collection exists in our vector database & if it does then we find the chunks of data from the vector database that have similarity score of more than 70% & add that into our system prompt.

# get only the chunks who have at least similary score of 0.5 out of 1
SIMILARITY_THRESHOLD = 0.5

collection_name = get_collection_name(file_name)
if collection_exists(collection_name):
    vector_store = get_vector_store(collection_name)
    # Get documents with their similarity scores
    docs = vector_store.similarity_search_with_score(query, k=5)

    for doc, score in docs:
        if score >= SIMILARITY_THRESHOLD:
            system_prompt += f"""
             Document: {doc.page_content}
             """

Now we will define a variable that will that will communicate with OpenAI & use the above system prompt that contains the more relevant context as per user’s query along with user’s query to get more refined & more relevant answer.

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(
    model="gpt-4.1",
)

messages = [("system", system_prompt), ("user", query)]

response = llm.invoke(messages)

print(f"response: {response.content}")

And that’s all, we just build our first RAG from scratch. Just run the main.py file in the 1_implementing_basic_rag directory and you can interact with the RAG.

I am attaching a screenshot of one run of our basic RAG application.

So that’s it for this one. Hope you liked this article on implementing a basic RAG from scratch! In the next set of articles, we will discuss on how to optimize our RAG application to make production ready. There are various techniques that are used in production-ready RAG applications to make them performant & efficient at scale. Stay tuned to learn more about them.

If you have questions/comments, then please feel free to comment on this article.

2. Implementing RAG