Local RAG with Ollama and DeepSeek

Retrieval-Augmented Generation (RAG) is transforming our understanding of large language models. Instead of expecting an LLM to know every fact, RAG combines a lightweight, custom knowledge base with powerful generative capabilities. You first ingest your documents, index them for fast similarity search, then ask your model to reason over just the relevant excerpts. The result? Faster, more accurate, and fully private Q&A on any text you choose.

In this post, I’ll show you two implementations of a RAG pipeline I built:

A Streamlit-based web UI, so you can upload and explore documents interactively
A standalone Python script for CLI or automated workflows

Both share the same core steps, and all use LangChain, Ollama on-device models, and ChromaDB for vector indexing.

TECH STACK OVERVIEW

• Python 3.7+ with python-docx, PyPDF2, pandas and openpyxl
• LangChain (core + community loaders) for document ingestion and splitting
• Ollama Embeddings & ChatOllama for on-device embeddings and chat inference
• ChromaDB as our local vector database
• Streamlit for building the web interface

PROJECT STRUCTURE

.
├── streamlit_app.py    # Interactive RAG demo
├── standalone_script.py       # CLI-only RAG workflow
├── requirements.txt    # All dependencies
└── README.md           # Usage & deployment notes

CORE RAG WORKFLOW

All features—from the UI to the script—follow this four-step pipeline:

Load & split your source documents into text chunks
Build (or load) a Chroma vector store over those chunks
Retrieve the most relevant chunks for a given query
Generate a final answer with your chosen Ollama chat model

LOADING AND SPLITTING

We support PDF, Word (.docxplain text and even Excel sheets. Under the hood, each file is written to a temporary path, then passed into the matching LangChain loader. We split documents into 500-character segments with 50-character overlap to maximise context retrieval.

loader = PyPDFLoader(pdf_path)
docs = loader.load()
splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
splitter.split_documents(docs)

VECTOR STORE CREATION

ChromaDB gives us a lightning-fast, local vector index:

embeddings = OllamaEmbeddings(model="nomic-embed-text")
vectordb   = Chroma(
                collection_name="pdf_collection",
                embedding_function=embeddings,
                persist_directory="./DB/chroma_langchain_db"
             )
vectordb.add_documents(documents=chunks, ids=[uuid1() for _ in chunks])

By caching or persisting this directory, repeated runs reuse the same index instead of re-ingesting every time.

STREAMLIT UI (`streamlit_app.py`)

The Streamlit demo walks you through:

Model selection via dropdown (e.g. gemma:2b, deepseek-r1:1.5b)
Multi-file upload (PDF, DOCX, TXT, XLSX)
Load & Split button to chunk all content
Build the Vector Store button to index those chunks
Query box and Get Answer button to retrieve and display results

Uploads land in your browser session as UploadedFile objects; we write them to temporary files so LangChain loaders can treat them like real files.

STANDALONE SCRIPT (`standalone_script.py`)

For CI pipelines or headless runs, simply call:

python standalone_script.py

It will:

Load ./files/Shishir-Resume-PD.pdf (I used my resume; you can use anything else.).
Split into chunks
Build (or load) the Chroma index
Run a sample query (“What did shishir do using Python?”)
Print the cleaned response

All functionality lives in three reusable functions—load_and_split, create_vector_store, and query_and_respond—So you can drop them into any Python project.

SETTING UP OLLAMA LOCALLY

To run entirely on-device, you need Ollama installed on your machine.

Install Ollama CLI
- macOS:
```
  brew install ollama
```
- Windows/Linux:
  Download the latest package from https://ollama.com/ and follow the platform and follow the platform-specific installer.

Download Required Models

 ollama pull gemma:2b
 ollama pull gemma:7b
 ollama pull deepseek-r1:1.5b
 ollama pull nomic-embed-text

This step caches the models locally so inference runs offline.

(Optional) Run Ollama Daemon for API Access
```
 ollama serve --port 11434
```
If you prefer Python code to call a local REST endpoint, point your client at http://localhost:11434.

Your Python scripts and Streamlit app will automatically use the local Ollama socket or REST API—no external network calls required.

BENEFITS OF KEEPING IT ALL LOCALLY

Privacy & Compliance: Your documents never leave your infrastructure, ideal for sensitive SOPs, internal reports, or proprietary research.
Latency & Cost: Local inference eliminates API-call overhead and per-token charges. Queries return in milliseconds rather than seconds.
Offline Capability: Work without internet connectivity—perfect for air-gapped environments or field deployments.
Full Control: You decide which models and versions to run, when to upgrade, and how to scale on your hardware.

START RUNNING IT LOCALLY

Clone the repo:

 git clone https://github.com/shishir/GenAI-Usecases.git
 cd RAG-Examples

Install dependencies:
```
 pip install -r requirements.txt
```

Run the Streamlit UI:

 streamlit run streamlit_app.py

Or launch the CLI script:

 python standalone_script.py

And Done!

This Retrieval-Augmented Generation (RAG) lets you combine your documents with on-device LLMs for fast, private, and accurate Q&A. I built both a Streamlit UI and a standalone Python script using LangChain, Ollama, and ChromaDB. Everything runs 100% locally—no API keys, no billing surprises, and sub-second responses. Follow the steps above to clone, install, and get querying in minutes.

TL;DR

RAG marries retrieval and generative AI for accurate, private Q&A.
Two flavours provided: an interactive Streamlit UI and a CLI-friendly script.
Core steps: load & split documents, build/load Chroma index, retrieve chunks, chat with Ollama.
Supports PDF, Word, text, and Excel—just upload, build, and ask.
Output is sanitised (no <think> tags) and model-selectable via a dropdown.
Ready in 20 lines of code; fully open-source and self-hostable.

Thanks for coming this far in the article. I hope I was able to explain the concept well enough. If you face any issues, please feel free to reach out to me; I'd be happy to help.

🔗 Let’s stay connected!

Local RAG with Ollama and DeepSeek

Table of contents

TECH STACK OVERVIEW

PROJECT STRUCTURE

CORE RAG WORKFLOW

LOADING AND SPLITTING

VECTOR STORE CREATION

STREAMLIT UI (`streamlit_app.py`)

STANDALONE SCRIPT (`standalone_script.py`)

SETTING UP OLLAMA LOCALLY

BENEFITS OF KEEPING IT ALL LOCALLY

START RUNNING IT LOCALLY

And Done!

TL;DR

🔗 Let’s stay connected!

Subscribe to my newsletter

shishir srivastav

shishir srivastav

Local RAG with Ollama and DeepSeek

Table of contents

TECH STACK OVERVIEW

PROJECT STRUCTURE

CORE RAG WORKFLOW

LOADING AND SPLITTING

VECTOR STORE CREATION

STREAMLIT UI (streamlit_app.py)

STANDALONE SCRIPT (standalone_script.py)

SETTING UP OLLAMA LOCALLY

BENEFITS OF KEEPING IT ALL LOCALLY

START RUNNING IT LOCALLY

And Done!

TL;DR

🔗 Let’s stay connected!

Subscribe to my newsletter

shishir srivastav

shishir srivastav

STREAMLIT UI (`streamlit_app.py`)

STANDALONE SCRIPT (`standalone_script.py`)