Local RAG with Ollama and DeepSeek

Table of contents

Retrieval-Augmented Generation (RAG) is transforming our understanding of large language models. Instead of expecting an LLM to know every fact, RAG combines a lightweight, custom knowledge base with powerful generative capabilities. You first ingest your documents, index them for fast similarity search, then ask your model to reason over just the relevant excerpts. The result? Faster, more accurate, and fully private Q&A on any text you choose.
In this post, I’ll show you two implementations of a RAG pipeline I built:
A Streamlit-based web UI, so you can upload and explore documents interactively
A standalone Python script for CLI or automated workflows
Both share the same core steps, and all use LangChain, Ollama on-device models, and ChromaDB for vector indexing.
TECH STACK OVERVIEW
• Python 3.7+ with python-docx
, PyPDF2
, pandas
and openpyxl
• LangChain (core + community loaders) for document ingestion and splitting
• Ollama Embeddings & ChatOllama for on-device embeddings and chat inference
• ChromaDB as our local vector database
• Streamlit for building the web interface
PROJECT STRUCTURE
.
├── streamlit_app.py # Interactive RAG demo
├── standalone_script.py # CLI-only RAG workflow
├── requirements.txt # All dependencies
└── README.md # Usage & deployment notes
CORE RAG WORKFLOW
All features—from the UI to the script—follow this four-step pipeline:
Load & split your source documents into text chunks
Build (or load) a Chroma vector store over those chunks
Retrieve the most relevant chunks for a given query
Generate a final answer with your chosen Ollama chat model
LOADING AND SPLITTING
We support PDF, Word (.docx
plain text and even Excel sheets. Under the hood, each file is written to a temporary path, then passed into the matching LangChain loader. We split documents into 500-character segments with 50-character overlap to maximise context retrieval.
loader = PyPDFLoader(pdf_path)
docs = loader.load()
splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
splitter.split_documents(docs)
VECTOR STORE CREATION
ChromaDB gives us a lightning-fast, local vector index:
embeddings = OllamaEmbeddings(model="nomic-embed-text")
vectordb = Chroma(
collection_name="pdf_collection",
embedding_function=embeddings,
persist_directory="./DB/chroma_langchain_db"
)
vectordb.add_documents(documents=chunks, ids=[uuid1() for _ in chunks])
By caching or persisting this directory, repeated runs reuse the same index instead of re-ingesting every time.
STREAMLIT UI (streamlit_app.py
)
The Streamlit demo walks you through:
Model selection via dropdown (e.g.
gemma:2b
,deepseek-r1:1.5b
)Multi-file upload (PDF, DOCX, TXT, XLSX)
Load & Split button to chunk all content
Build the Vector Store button to index those chunks
Query box and Get Answer button to retrieve and display results
Uploads land in your browser session as UploadedFile
objects; we write them to temporary files so LangChain loaders can treat them like real files.
STANDALONE SCRIPT (standalone_script.py
)
For CI pipelines or headless runs, simply call:
python standalone_script.py
It will:
Load
./files/Shishir-Resume-PD.pdf
(I used my resume; you can use anything else.).Split into chunks
Build (or load) the Chroma index
Run a sample query (“What did shishir do using Python?”)
Print the cleaned response
All functionality lives in three reusable functions—load_and_split
, create_vector_store
, and query_and_respond
—So you can drop them into any Python project.
SETTING UP OLLAMA LOCALLY
To run entirely on-device, you need Ollama installed on your machine.
Install Ollama CLI
macOS:
brew install ollama
Windows/Linux:
Download the latest package from https://ollama.com/ and follow the platform and follow the platform-specific installer.
Download Required Models
ollama pull gemma:2b ollama pull gemma:7b ollama pull deepseek-r1:1.5b ollama pull nomic-embed-text
This step caches the models locally so inference runs offline.
(Optional) Run Ollama Daemon for API Access
ollama serve --port 11434
If you prefer Python code to call a local REST endpoint, point your client at
http://localhost:11434
.
Your Python scripts and Streamlit app will automatically use the local Ollama socket or REST API—no external network calls required.
BENEFITS OF KEEPING IT ALL LOCALLY
Privacy & Compliance: Your documents never leave your infrastructure, ideal for sensitive SOPs, internal reports, or proprietary research.
Latency & Cost: Local inference eliminates API-call overhead and per-token charges. Queries return in milliseconds rather than seconds.
Offline Capability: Work without internet connectivity—perfect for air-gapped environments or field deployments.
Full Control: You decide which models and versions to run, when to upgrade, and how to scale on your hardware.
START RUNNING IT LOCALLY
Clone the repo:
git clone https://github.com/shishir/GenAI-Usecases.git cd RAG-Examples
Install dependencies:
pip install -r requirements.txt
Run the Streamlit UI:
streamlit run streamlit_app.py
Or launch the CLI script:
python standalone_script.py
And Done!
This Retrieval-Augmented Generation (RAG) lets you combine your documents with on-device LLMs for fast, private, and accurate Q&A. I built both a Streamlit UI and a standalone Python script using LangChain, Ollama, and ChromaDB. Everything runs 100% locally—no API keys, no billing surprises, and sub-second responses. Follow the steps above to clone, install, and get querying in minutes.
TL;DR
RAG marries retrieval and generative AI for accurate, private Q&A.
Two flavours provided: an interactive Streamlit UI and a CLI-friendly script.
Core steps: load & split documents, build/load Chroma index, retrieve chunks, chat with Ollama.
Supports PDF, Word, text, and Excel—just upload, build, and ask.
Output is sanitised (no
<think>
tags) and model-selectable via a dropdown.Ready in 20 lines of code; fully open-source and self-hostable.
Thanks for coming this far in the article. I hope I was able to explain the concept well enough. If you face any issues, please feel free to reach out to me; I'd be happy to help.
🔗 Let’s stay connected!
🌐 Linktree| 🐦 X| 📸 Instagram| ▶️ YouTube ✍️ Hashnode| 💻 GitHub| 🔗 LinkedIn| 🤝 Topmate 🏅 Credly
© 2025 Shishir Srivastav. All rights reserved.
Subscribe to my newsletter
Read articles from shishir srivastav directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

shishir srivastav
shishir srivastav
Developer, Exploring the world online. I write logical code in IT and talk Illogically in reality.