Building a Private AI Document Pipeline with Paperless, PaddleOCR, and LLMs

So this past week I hacked together a little side project to smarten up my Paperless-ngx setup — you know, that self-hosted document management system that eats PDFs and makes them searchable.

Now, Paperless-ngx is solid, don’t get me wrong. But it uses Tesseract for OCR, and honestly... Tesseract is not optimal for anything that's not clean text.

So this just evolved out of need to improve paperless-ngx's OCR capability and to properly classify documents and extract tags. titles and summaries for documents.

This blog is a walkthrough of what I built — what worked, what didn’t, and how it turned into a pretty neat little pipeline with its own microservices. Hope it gives you some ideas.

💡 What I Was Going For

I wanted a system that:

Uses PaddleOCR instead of Tesseract for better OCR output
Runs a local LLM using Ollama to:
- Suggest a smart document title
- Classify the document into a type (invoice, id, tax, etc.)
Pushes that back to Paperless so the doc is nicely searchable and tagged

Everything stays local/private. No exposure to external LLMs. Just Python, containers, compose to stitch everything together.

🔧 How It All Works (Now)

After a few iterations, I ended up with a clean microservice setup with each part doing its job:

paperless-ngx: The main document management system (already amazing)
ollama: Runs a local LLM like Mistral (or phi3 or any configurable in env), no cloud stuff
ocr-service: FastAPI service that runs PaddleOCR
pipeline: Python CLI that connects all the dots — downloads doc from Paperless, sends it to OCR and LLM, then updates Paperless with the smart results

That way, each service does one thing and does it well or can be enhanced in isolation. They all run in Docker, talk to each other over the same network, and make one smooth, local AI document workflow.

This leverages paperless-ngx's extensive features and augments it with better OCR and LLM classification capabilities.

🚦 The Flow Looks Like This:

📁 Paperless stores docs in its DB
⬇️
🤖 I run my pipeline CLI: `docker compose run pipeline 42`
⬇️
📥 pipeline downloads the doc via Paperless API (by ID)
⬇️
📤 Sends it to the OCR microservice over HTTP
⬇️
🧠 Gets back clean OCR’d text
⬇️
🧠 Sends text to Ollama (LLM) to generate:
    - title
    - document type
⬇️
🔁 Updates Paperless document via PATCH

🐳 Everything Runs in Docker

Here's the final list of containers:

paperless → standard Paperless-ngx
redis → required by Paperless
ollama → runs local LLMs like Mistral
ocr-service → FastAPI + PaddleOCR
pipeline → command-line microservice that ties it all together

🗺 Architecture Diagram

                   ┌────────────────────────┐
                   │ 📄 Paperless-ngx (UI)  │
                   └────────────┬───────────┘
                                │
                    [User notes Document ID]
                                │
                   ┌────────────▼────────────┐
                   │   🐍 Pipeline Service    │
                   │ (LLM & Orchestration)   │
                   └────────────┬────────────┘
                                │
           ┌────────────────────┼────────────────────┐
           │                    │                    │
           ▼                    ▼                    ▼
  ┌────────────────┐   ┌────────────────┐   ┌────────────────────┐
  │ Downloads PDF  │   │ Sends to OCR   │   │ Sends OCR text to  │
  │ via Paperless  │   │ microservice   │   │ Ollama LLM (Mistral)│
  └────────────────┘   └────────────────┘   └────────────────────┘
                                                │
                     ◀────────────┬─────────────┘
                                  ▼
                       📝 Title + Type Prediction
                                  │
                       🔁 PATCH back to Paperless
                       (update metadata + text)

😵 What Gave Me Trouble

Paperless consumes files from consume/ automatically and moves it — I had to work around that by working only with doc IDs via API as at this time I was more focused on adding OCR/AI features. This one is high on my TODO list.
PaddleOCR kept re-downloading models — made sure models are cached.

📦 Folder Structure

├── docker-compose.yml
├── __init__.py
├── model-cache
├── ocr_service
│   ├── app
│   │   ├── __init__.py
│   │   ├── main.py
│   │   ├── ocr_config.yaml
│   │   └── ocr_engine.py
│   ├── Dockerfile
│   └── requirements.txt
├── paperless-data
│   ├── consume
│   ├── data
│   │   ├── db.sqlite3
│   │   ├── index
│   │   │   ├── _MAIN_63.toc
│   │   │   ├── MAIN_9v88o8vye3gbqub1.seg
│   │   │   ├── MAIN_ui8jrpftrvauh4n1.seg
│   │   │   ├── MAIN_wceagdbh71brm5wg.seg
│   │   │   └── MAIN_WRITELOCK
│   │   ├── log
│   │   │   └── celery.log.1
│   │   └── migration_lock
│   └── media
│       ├── documents
│       │   ├── archive
│       │   │   ├── 0000009.pdf
│       │   │   └── 0000016.pdf
│       │   ├── originals
│       │   │   ├── 0000009.jpg
│       │   │   └── 0000016.pdf
│       │   └── thumbnails
│       │       ├── 0000009.webp
│       │       └── 0000016.webp
│       └── media.lock
├── pipeline_service
│   ├── app
│   │   ├── api_client.py
│   │   ├── __init__.py
│   │   ├── llm_processor.py
│   │   ├── main.py
│   │   └── watcher.py
│   ├── Dockerfile
│   ├── __init__.py
│   ├── logger.py
│   ├── prompts
│   │   └── classify_title.txt
│   ├── requirements.txt
│   └── test.py
└── README.md

✅ What's Next?

I might:

Run the pipeline automatically when a new doc lands
Add authentication for OCR and pipeline service (utilizing paperless-ngx's token auth?)
Improve performance of OCR service, perhaps using other language (Go, Rust)
Add document summarization via LLM
Extract metadata like amount, date, sender
Hook into Paperless tags and correspondents

🏁 Wrapping Up

If you need a document management system:

That ingests all kinds off docs PDFs, images etc.
That reliably extracts text from varied kind of docs.
That classifies, tags and summarizes docs using LLM.
That keeps stuff private - no exposure to external LLMs.
It comes with features loaded from paperless-ngx.

Then this might be an appealing setup.

It’s Python all the way down. No rocket science — just containers, OCR, and a private LLM.

Questions or inputs are welcome.

Building a Private AI Document Pipeline with Paperless, PaddleOCR, and LLMs

Building a Private AI Document Pipeline with Paperless, PaddleOCR, and LLMs

💡 What I Was Going For

🔧 How It All Works (Now)

🚦 The Flow Looks Like This:

🐳 Everything Runs in Docker

🗺 Architecture Diagram

😵 What Gave Me Trouble

📦 Folder Structure

✅ What's Next?

🏁 Wrapping Up

Subscribe to my newsletter

Rajesh Pethe

Rajesh Pethe