Implementing RAG in Python: A Guide

AI is changing the way we search and find answers. One of the most exciting techniques used in modern AI applications is RAG, which stands for Retrieval-Augmented Generation.

In this guide, you’ll learn:

What RAG is
Why it’s useful
How to build a simple RAG pipeline in Python with full code and setup instructions

What is RAG (Retrieval-Augmented Generation)?

Imagine asking an AI a question about Node.js or Python. If the AI hasn’t been trained on that content, it might not know the answer.

RAG solves this by combining two steps:

Retrieval: Search relevant content from a documents or database.
Generation: Use that content to generate a meaningful, accurate answer.

It’s like having an system that not only remembers but also looks up information before answering.

How RAG Works (Overview)

one time activity (vector Embedding)

Load and read a document.
Split the document into small readable chunks.
Convert each chunk into embeddings (numerical meaning).
Store embeddings in a vector database.

chat Loop(question and answer)

When a user asks a question,it finds the most relevant chunks.
Send those chunks to a language model (like Gemini , claude ) to generate a response.

Step-by-Step RAG Implementation in Python

1. Initial Setup

create a directory for storing all files. name it — rag_chatbot ( this can be anything )

create virtual environment in python using this command for window ( you can use any method as per your convenience)

python -m venv .venv

activate virtual envionment using
.\.venv\Scripts\activate

install required packages

pip install langchain langchain-community langchain-openai langchain-qdrant openai qdrant-client python-dotenv langchain_google_genai pypdf

create .env file and add — GOOGLE_API_KEY = “your_api_key” (you can generate it for free here — https://aistudio.google.com/apikey ).

Download nodejs.pdf from the GitHub link provided below add it in current working directory , or you can use any other document.

2. Start Qdrant in Docker

we are using qdrant as a vector Database so lets setup that in Docker.

note : for running qdrant in Docker you need docker desktop up and running on your machine (install docker from https://docs.docker.com/get-started/introduction/get-docker-desktop/

Create a file called docker-compose.db.yml with the following content :

services:
  qdrant:
    image: qdrant/qdrant
    ports:
      - '6333:6333'

Then run:

docker-compose -f docker-compose.db.yml up -d

This starts the Qdrant vector database locally. you can access it here http://localhost:6333/dashboard

3. Create Reusable LLM Function and embeder

Create a file called llm_utils.py with the following content:

from openai import OpenAI
from langchain_google_genai import GoogleGenerativeAIEmbeddings
from dotenv import load_dotenv
import os

load_dotenv()
# Gemini-compatible OpenAI SDK
client = OpenAI(
     api_key=os.getenv('GOOGLE_API_KEY'),
    base_url="https://generativelanguage.googleapis.com/v1beta/openai/"
)

def call_llm(query):
    response = client.chat.completions.create(
        model="gemini-2.0-flash",
        messages=query
    )
    return response.choices[0].message.content

# Create embeddings
# Initialize Google Generative AI embeddings with the specified model
def initialize_embeddings():
    embeder = GoogleGenerativeAIEmbeddings(model="models/text-embedding-004")
    return embeder

4. Vector Embedding and Indexing

lets index data in vector database

Create a file called index_documents.py with the following content:

from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from llm_utils import initialize_embeddings
from langchain_qdrant import QdrantVectorStore
from pathlib import Path
import os
from dotenv import load_dotenv

embeder = initialize_embeddings()
# Load your PDF
file_path = Path(__file__).parent / "nodejs.pdf" # change this path if yor are using file from different location 
loader = PyPDFLoader(file_path)
docs = loader.load()

# Split into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
split_docs = text_splitter.split_documents(docs)

# Index into Qdrant
QdrantVectorStore.from_documents(
    documents=split_docs,
    embedding=embeder,
    collection_name="test_rag", # collection name 
    url="http://localhost:6333"
)

print("Indexing completed successfully.")

5. Main RAG Script

Create a file called rag_chatbot.py with the following content:

from langchain_qdrant import QdrantVectorStore
from langchain_openai import OpenAIEmbeddings
from llm_utils import call_llm, initialize_embeddings
from dotenv import load_dotenv

# Set up embeddings and connect to existing vector store
embeder = initialize_embeddings()

retriver = QdrantVectorStore.from_existing_collection(
    embedding=embeder,
    collection_name="test_rag",
    url="http://localhost:6333"
)

# Chat loop
conversation_history = []

while True:
    query = input("Enter your question (or 'exit' to end): ")
    if query.lower() == 'exit':
        break

    #retrive relavent data chunks base on user query from vector database
    relevant_chunks = retriver.similarity_search(query=query)

    # Define the system prompt for the chatbot
    system_prompt = f"""
    You are a helpful assistant. You help the user to find the answer to their question based on the provided context.
    context: {relevant_chunks}
    You will be provided with a context and a question. You need to answer the question based on the context.
    If the context does not provide enough information to answer the question, you should say "I don't know".
    Note:
    Answer should be in detaild and should not be too short.
    Answer should be in a conversational tone.
    """

    messages = [{"role": "system", "content": system_prompt}]

    # Add conversation history to messages
    for msg in conversation_history:
        messages.append(msg)

    # Add current user query to messages    
    messages.append({"role": "user", "content": query})

    response = call_llm(messages)

    conversation_history.append({"role": "user", "content": query})
    conversation_history.append({"role": "assistant", "content": response})

    # Keep conversation history limited to last 4 interactions (8 messages)
    if len(conversation_history) > 8:
        conversation_history = conversation_history[-8:]

    print("\nAssistant:", response, "\n")

File Structure Overview

now your files structure should look like this

rag_chatbot/
├── docker-compose.db.yml
├── llm_utils.py
├── index_documents.py      <-- vector embedding + indexing logic
├── rag_chatbot.py          <-- main chatbot app
├── nodejs.pdf              <-- your document

How to Use

Start Qdrant: (if not done already)

docker-compose -f docker-compose.db.yml up -d

Index Documents (Run Once):

python index_documents.py

Start Chatbot:

python rag_chatbot.py

Example Use Case

Let’s say your nodejs.pdf contains a tutorial. You ask:

"How do I create an HTTP server in Node.js?"

The system will:

Search for the relevant chunk in the vector database based on the user's query.
Feed those chunks to the LLM.
The LLM uses the chunks as context, analyzes them, and then provides an answer.

Github repo link

https://github.com/sandipdeshmukh77/simple-rag-system

What is RAG and How to Implement It in Python

Table of contents