From Documents to Dialogues: Building a Persona (Chatbot) with RAG and Embeddings

📚 RAG: Bridging the Gap Between LLMs and Real-Time, Proprietary Data
I’ve come across an important limitation of Large Language Models (LLMs):
They have knowledge cutoffs and don’t have access to real-time or the most recent data.
LLMs generate answers based on the data they were trained on which, means if you ask them about your internal documentation, recent experiments, or proprietary systems, they likely won’t have the answer. So how can we make LLMs work with our own data?
This is where the technique called RAG (Retrieval-Augmented Generation) comes into play.
🤔 The Problem
Let’s say you want to use an LLM to answer questions based on your:
Large PDFs
Internal reports
Source code
Research papers
or any private dataset
You have two options:
Fine-tune the model with your data (which is expensive and complex)
Use RAG, a much more flexible and cost-effective method
But even here, we face a challenge:
LLMs have token limits, so you can’t just feed an entire dataset to them in one go.
Let’s Explore RAG:
⚙️ What is RAG?
RAG (Retrieval-Augmented Generation) is a technique where we retrieve relevant information from external data sources and augment the LLM’s prompt with that information to get more accurate, relevant, and personalized answers.
Here’s how the RAG pipeline works:
🔄 The RAG Pipeline (Step-by-Step)
1️⃣ Data Ingestion (Indexing)
You split your data into smaller chunks (e.g. paragraphs or sections)
Convert each chunk into vector embeddings
Store these embeddings in a vector database like Pinecone, Qdrant, Weaviate, etc.
2️⃣ Retrieval
When a user submits a query, we convert that query into embeddings
These embeddings are used to search the vector database for similar or related chunks
We retrieve the top matching results (relevant context)
3️⃣ Augmentation
Combine the original query with the retrieved data
Format the input (prompt) clearly so the model understands both the context and the question
This step improves understanding and leads to better responses
4️⃣ Generation
The LLM receives the augmented prompt and generates an answer that is:
Context-aware
More accurate
Based on your actual data
Let’s build your own Persona with RAG Pipeline :
File Structure:
Chatbot/
├── indexing.py
├── main.py
├── streamlit_chat.py
└── Rag-chatbot-content.pdf (Please compile all your information into a PDF, including your background, qualifications, areas of expertise, skills, and projects.)
.env:
OPENAI_API_KEY=<your openai key>
Docker Containers:
docker pull qdrant/qdrant
docker run -d -p 6333:6333 qdrant/qdrant # This will run a qdrant db container on localhost:6333
Python Packages:
python-dotenv
streamlit
langchain
langchain-openai
langchain-community
langchain-text-splitters
qdrant-client
Indexing.py
from dotenv import load_dotenv
from pathlib import Path
from langchain_community.document_loaders import PyPDFLoader
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_qdrant import QdrantVectorStore
from qdrant_client import QdrantClient
load_dotenv()
# PDF Path
pdf_path = Path(__file__).parent/"Rag-chatbot-content.pdf"
# Loading File
pdf_loader = PyPDFLoader(file_path=pdf_path)
all_docs = pdf_loader.load()
# chunking
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000,chunk_overlap=400)
split_docs = text_splitter.split_documents(documents=all_docs)
# vector embedding
embedding_model = OpenAIEmbeddings(model="text-embedding-3-large")
# store vector embedding in DB
# using embedding_model creating embedding of split docs and store in DB
vector_store = QdrantVectorStore.from_documents(
documents=split_docs,
url="http://localhost:6333",
collection_name="persona",
embedding=embedding_model
)
print("Indexing of Documents Done..")
main.py
import streamlit as st
from dotenv import load_dotenv
from streamlit_chat import llm_response
load_dotenv()
st.set_page_config(page_title="Sharad Kumar Singh Bot", layout="centered")
# --- Session state init ---
if "messages" not in st.session_state:
st.session_state.messages = []
if "awaiting_response" not in st.session_state:
st.session_state.awaiting_response = False
# --- Custom CSS ---
st.markdown("""
<style>
.chat-box {
height: 60vh;
overflow-y: auto;
padding: 1rem;
border-radius: 10px;
background-color: lightgrey;
border: 1px solid #ccc;
color:black;
margin-bottom:1em;
}
.user-msg, .bot-msg {
padding: 10px;
margin: 10px 0;
max-width: 90%;
border-radius: 10px;
word-wrap: break-word;
clear: both;
font-size: 16px;
color:black;
}
.user-msg {
background-color: #DCF8C6;
float: right;
text-align: right;
}
.bot-msg {
background-color: #E6E6EA;
float: left;
text-align: left;
}
.message-wrapper::after {
content: "";
display: table;
clear: both;
}
</style>
""", unsafe_allow_html=True)
st.markdown("""
<script>
const chatBox = window.parent.document.querySelector('.chat-box');
if (chatBox) {
chatBox.scrollTop = chatBox.scrollHeight;
}
</script>
""", unsafe_allow_html=True)
st.markdown("<h2 style='text-align: center;'>Sharad Singh</h2>", unsafe_allow_html=True)
st.markdown("<h5 style='text-align: center;'>Ask anything about me</h5>", unsafe_allow_html=True)
# --- Display chat messages ---
chat_html = '<div class="chat-box">'
for msg in st.session_state.messages:
role_class = "user-msg" if msg["role"] == "user" else "bot-msg"
chat_html += f'<div class="message-wrapper"><div class="{role_class}">{msg["content"]}</div></div>'
chat_html += '</div>'
st.markdown(chat_html, unsafe_allow_html=True)
# --- Input form ---
with st.form("chat_input", clear_on_submit=True):
user_input = st.text_area("Type your message here...", height=80, label_visibility="collapsed")
submitted = st.form_submit_button("Send")
# --- Step 1: Handle user input and trigger rerun ---
if submitted and user_input.strip():
st.session_state.messages.append({"role": "user", "content": user_input})
st.session_state.awaiting_response = True
st.rerun()
# --- Step 2: Generate bot response only after rerun ---
if st.session_state.awaiting_response:
user_msg = st.session_state.messages[-1]["content"]
response = llm_response(user_msg)
st.session_state.messages.append({"role": "bot", "content": response})
st.session_state.awaiting_response = False
st.rerun()
streamlit_chat.py
from dotenv import load_dotenv
from langchain_qdrant import QdrantVectorStore
from langchain_openai import OpenAIEmbeddings
from openai import OpenAI
load_dotenv()
client = OpenAI()
#vector embeddings
embedding_model = OpenAIEmbeddings(
model="text-embedding-3-large"
)
vector_db = QdrantVectorStore.from_existing_collection(
url="http://localhost:6333",
collection_name="persona",
embedding=embedding_model
)
def llm_response(query):
bot_name="Sharad Bot"
# Vector similiarity search [query] in DB
search_results = vector_db.similarity_search(query=query)
context ="\n\n\n".join([f"Page Content:{result.page_content}\nPage Number: {result.metadata["page_label"]}\nFile Location: {result.metadata["source"]}" for result in search_results])
SYSTEM_PROMPT=f"""
You are an AI Parona of Sharad Singh your name is {bot_name}, You have to ans to every questions as if you are Sharad Singh and sound natural and human tone.
use the below examples to understand how Sharad Singha and a background about him.
Context:
{context}
Cathphrases:
<add your cathphrases....>
Here are few examples which will give you an idea about my tone and how i react on different type of questions you please understand them and acquire it and try to answer them in these tone.
Examples:
friend say: wow
my answer: thanks you so much.
<add some examples which shows your tone, behavior and way of talking>
Restrictions:
<Add instructions specifying that the LLM should not respond to requests such as code generation or questions that are not related to the user's profile.>
"""
chat_completion = client.chat.completions.create(
model="gpt-4.1",
messages=[
{"role":"system","content":SYSTEM_PROMPT},
{"role":"user","content":query}
]
)
return chat_completion.choices[0].message.content
Commands:
python indexing.py
streamlit run main.py
Final Output:
🔧 Integration with Agentic Workflows
If you're familiar with agentic workflows, you can go a step further by giving tools to the LLM (like search, calculators, or APIs), enabling it to:
Perform live actions
Fetch real-time data
Or enhance retrieval with more control
✅ Benefits of RAG
Access real-time and proprietary data without retraining the model
Cost-effective compared to fine-tuning
Greater control over the model’s output
Better answers tailored to your domain knowledge
RAG is a game-changer for building AI assistants, knowledge bots, and enterprise copilots that need to work with live or internal data.
Please check out my other learnings and posts on GenAI concepts:
I’m just scratching the surface more posts to come as I keep experimenting. 🚀
Have you tried implementing RAG in your projects? Would love to hear your experience.
Subscribe to my newsletter
Read articles from Sharad Singh directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
