What is RAG and How to Implement It in Python


AI is changing the way we search and find answers. One of the most exciting techniques used in modern AI applications is RAG, which stands for Retrieval-Augmented Generation.
In this guide, you’ll learn:
What RAG is
Why it’s useful
How to build a simple RAG pipeline in Python with full code and setup instructions
What is RAG (Retrieval-Augmented Generation)?
Imagine asking an AI a question about Node.js or Python. If the AI hasn’t been trained on that content, it might not know the answer.
RAG solves this by combining two steps:
Retrieval: Search relevant content from a documents or database.
Generation: Use that content to generate a meaningful, accurate answer.
It’s like having an system that not only remembers but also looks up information before answering.
How RAG Works (Overview)
one time activity (vector Embedding)
Load and read a document.
Split the document into small readable chunks.
Convert each chunk into embeddings (numerical meaning).
Store embeddings in a vector database.
chat Loop(question and answer)
When a user asks a question,it finds the most relevant chunks.
Send those chunks to a language model (like Gemini , claude ) to generate a response.
Step-by-Step RAG Implementation in Python
1. Initial Setup
create a directory for storing all files. name it — rag_chatbot ( this can be anything )
create virtual environment in python using this command for window ( you can use any method as per your convenience)
python -m venv .venv
activate virtual envionment using.\.venv\Scripts\activate
install required packages
pip install langchain langchain-community langchain-openai langchain-qdrant openai qdrant-client python-dotenv langchain_google_genai pypdf
create .env file and add — GOOGLE_API_KEY = “your_api_key”
(you can generate it for free here — https://aistudio.google.com/apikey ).
Download nodejs.pdf
from the GitHub link provided below add it in current working directory , or you can use any other document.
2. Start Qdrant in Docker
we are using qdrant as a vector Database so lets setup that in Docker.
note : for running qdrant in Docker you need docker desktop up and running on your machine (install docker from https://docs.docker.com/get-started/introduction/get-docker-desktop/
Create a file called docker-compose.db.yml
with the following content :
services:
qdrant:
image: qdrant/qdrant
ports:
- '6333:6333'
Then run:
docker-compose -f docker-compose.db.yml up -d
This starts the Qdrant vector database locally. you can access it here http://localhost:6333/dashboard
3. Create Reusable LLM Function and embeder
Create a file called llm_utils.py
with the following content:
from openai import OpenAI
from langchain_google_genai import GoogleGenerativeAIEmbeddings
from dotenv import load_dotenv
import os
load_dotenv()
# Gemini-compatible OpenAI SDK
client = OpenAI(
api_key=os.getenv('GOOGLE_API_KEY'),
base_url="https://generativelanguage.googleapis.com/v1beta/openai/"
)
def call_llm(query):
response = client.chat.completions.create(
model="gemini-2.0-flash",
messages=query
)
return response.choices[0].message.content
# Create embeddings
# Initialize Google Generative AI embeddings with the specified model
def initialize_embeddings():
embeder = GoogleGenerativeAIEmbeddings(model="models/text-embedding-004")
return embeder
4. Vector Embedding and Indexing
lets index data in vector database
Create a file called index_documents.py
with the following content:
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from llm_utils import initialize_embeddings
from langchain_qdrant import QdrantVectorStore
from pathlib import Path
import os
from dotenv import load_dotenv
embeder = initialize_embeddings()
# Load your PDF
file_path = Path(__file__).parent / "nodejs.pdf" # change this path if yor are using file from different location
loader = PyPDFLoader(file_path)
docs = loader.load()
# Split into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
split_docs = text_splitter.split_documents(docs)
# Index into Qdrant
QdrantVectorStore.from_documents(
documents=split_docs,
embedding=embeder,
collection_name="test_rag", # collection name
url="http://localhost:6333"
)
print("Indexing completed successfully.")
5. Main RAG Script
Create a file called rag_chatbot.py
with the following content:
from langchain_qdrant import QdrantVectorStore
from langchain_openai import OpenAIEmbeddings
from llm_utils import call_llm, initialize_embeddings
from dotenv import load_dotenv
# Set up embeddings and connect to existing vector store
embeder = initialize_embeddings()
retriver = QdrantVectorStore.from_existing_collection(
embedding=embeder,
collection_name="test_rag",
url="http://localhost:6333"
)
# Chat loop
conversation_history = []
while True:
query = input("Enter your question (or 'exit' to end): ")
if query.lower() == 'exit':
break
#retrive relavent data chunks base on user query from vector database
relevant_chunks = retriver.similarity_search(query=query)
# Define the system prompt for the chatbot
system_prompt = f"""
You are a helpful assistant. You help the user to find the answer to their question based on the provided context.
context: {relevant_chunks}
You will be provided with a context and a question. You need to answer the question based on the context.
If the context does not provide enough information to answer the question, you should say "I don't know".
Note:
Answer should be in detaild and should not be too short.
Answer should be in a conversational tone.
"""
messages = [{"role": "system", "content": system_prompt}]
# Add conversation history to messages
for msg in conversation_history:
messages.append(msg)
# Add current user query to messages
messages.append({"role": "user", "content": query})
response = call_llm(messages)
conversation_history.append({"role": "user", "content": query})
conversation_history.append({"role": "assistant", "content": response})
# Keep conversation history limited to last 4 interactions (8 messages)
if len(conversation_history) > 8:
conversation_history = conversation_history[-8:]
print("\nAssistant:", response, "\n")
File Structure Overview
now your files structure should look like this
rag_chatbot/
├── docker-compose.db.yml
├── llm_utils.py
├── index_documents.py <-- vector embedding + indexing logic
├── rag_chatbot.py <-- main chatbot app
├── nodejs.pdf <-- your document
How to Use
- Start Qdrant: (if not done already)
docker-compose -f docker-compose.db.yml up -d
- Index Documents (Run Once):
python index_documents.py
- Start Chatbot:
python rag_chatbot.py
Example Use Case
Let’s say your nodejs.pdf
contains a tutorial. You ask:
"How do I create an HTTP server in Node.js?"
The system will:
Search for the relevant chunk in the vector database based on the user's query.
Feed those chunks to the LLM.
The LLM uses the chunks as context, analyzes them, and then provides an answer.
Github repo link
Subscribe to my newsletter
Read articles from Sandip Deshmukh directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
