Building a Private AI Document Pipeline with Paperless, PaddleOCR, and LLMs

Rajesh PetheRajesh Pethe
5 min read

Building a Private AI Document Pipeline with Paperless, PaddleOCR, and LLMs

So this past week I hacked together a little side project to smarten up my Paperless-ngx setup — you know, that self-hosted document management system that eats PDFs and makes them searchable.

Now, Paperless-ngx is solid, don’t get me wrong. But it uses Tesseract for OCR, and honestly... Tesseract is not optimal for anything that's not clean text.

So this just evolved out of need to improve paperless-ngx's OCR capability and to properly classify documents and extract tags. titles and summaries for documents.

This blog is a walkthrough of what I built — what worked, what didn’t, and how it turned into a pretty neat little pipeline with its own microservices. Hope it gives you some ideas.


💡 What I Was Going For

I wanted a system that:

  • Uses PaddleOCR instead of Tesseract for better OCR output
  • Runs a local LLM using Ollama to:
    • Suggest a smart document title
    • Classify the document into a type (invoice, id, tax, etc.)
  • Pushes that back to Paperless so the doc is nicely searchable and tagged

Everything stays local/private. No exposure to external LLMs. Just Python, containers, compose to stitch everything together.


🔧 How It All Works (Now)

After a few iterations, I ended up with a clean microservice setup with each part doing its job:

  1. paperless-ngx: The main document management system (already amazing)

  2. ollama: Runs a local LLM like Mistral (or phi3 or any configurable in env), no cloud stuff

  3. ocr-service: FastAPI service that runs PaddleOCR

  4. pipeline: Python CLI that connects all the dots — downloads doc from Paperless, sends it to OCR and LLM, then updates Paperless with the smart results

That way, each service does one thing and does it well or can be enhanced in isolation. They all run in Docker, talk to each other over the same network, and make one smooth, local AI document workflow.

This leverages paperless-ngx's extensive features and augments it with better OCR and LLM classification capabilities.


🚦 The Flow Looks Like This:

📁 Paperless stores docs in its DB
⬇️
🤖 I run my pipeline CLI: `docker compose run pipeline 42`
⬇️
📥 pipeline downloads the doc via Paperless API (by ID)
⬇️
📤 Sends it to the OCR microservice over HTTP
⬇️
🧠 Gets back clean OCR’d text
⬇️
🧠 Sends text to Ollama (LLM) to generate:
    - title
    - document type
⬇️
🔁 Updates Paperless document via PATCH

🐳 Everything Runs in Docker

Here's the final list of containers:

  • paperless → standard Paperless-ngx

  • redis → required by Paperless

  • ollama → runs local LLMs like Mistral

  • ocr-service → FastAPI + PaddleOCR

  • pipeline → command-line microservice that ties it all together

🗺 Architecture Diagram

                   ┌────────────────────────┐
                   │ 📄 Paperless-ngx (UI)  │
                   └────────────┬───────────┘
                                │
                    [User notes Document ID]
                                │
                   ┌────────────▼────────────┐
                   │   🐍 Pipeline Service    │
                   │ (LLM & Orchestration)   │
                   └────────────┬────────────┘
                                │
           ┌────────────────────┼────────────────────┐
           │                    │                    │
           ▼                    ▼                    ▼
  ┌────────────────┐   ┌────────────────┐   ┌────────────────────┐
  │ Downloads PDF  │   │ Sends to OCR   │   │ Sends OCR text to  │
  │ via Paperless  │   │ microservice   │   │ Ollama LLM (Mistral)│
  └────────────────┘   └────────────────┘   └────────────────────┘
                                                │
                     ◀────────────┬─────────────┘
                                  ▼
                       📝 Title + Type Prediction
                                  │
                       🔁 PATCH back to Paperless
                       (update metadata + text)

😵 What Gave Me Trouble

  • Paperless consumes files from consume/ automatically and moves it — I had to work around that by working only with doc IDs via API as at this time I was more focused on adding OCR/AI features. This one is high on my TODO list.

  • PaddleOCR kept re-downloading models — made sure models are cached.

📦 Folder Structure

├── docker-compose.yml
├── __init__.py
├── model-cache
├── ocr_service
│   ├── app
│   │   ├── __init__.py
│   │   ├── main.py
│   │   ├── ocr_config.yaml
│   │   └── ocr_engine.py
│   ├── Dockerfile
│   └── requirements.txt
├── paperless-data
│   ├── consume
│   ├── data
│   │   ├── db.sqlite3
│   │   ├── index
│   │   │   ├── _MAIN_63.toc
│   │   │   ├── MAIN_9v88o8vye3gbqub1.seg
│   │   │   ├── MAIN_ui8jrpftrvauh4n1.seg
│   │   │   ├── MAIN_wceagdbh71brm5wg.seg
│   │   │   └── MAIN_WRITELOCK
│   │   ├── log
│   │   │   └── celery.log.1
│   │   └── migration_lock
│   └── media
│       ├── documents
│       │   ├── archive
│       │   │   ├── 0000009.pdf
│       │   │   └── 0000016.pdf
│       │   ├── originals
│       │   │   ├── 0000009.jpg
│       │   │   └── 0000016.pdf
│       │   └── thumbnails
│       │       ├── 0000009.webp
│       │       └── 0000016.webp
│       └── media.lock
├── pipeline_service
│   ├── app
│   │   ├── api_client.py
│   │   ├── __init__.py
│   │   ├── llm_processor.py
│   │   ├── main.py
│   │   └── watcher.py
│   ├── Dockerfile
│   ├── __init__.py
│   ├── logger.py
│   ├── prompts
│   │   └── classify_title.txt
│   ├── requirements.txt
│   └── test.py
└── README.md

✅ What's Next?

I might:

  • Run the pipeline automatically when a new doc lands

  • Add authentication for OCR and pipeline service (utilizing paperless-ngx's token auth?)

  • Improve performance of OCR service, perhaps using other language (Go, Rust)

  • Add document summarization via LLM

  • Extract metadata like amount, date, sender

  • Hook into Paperless tags and correspondents

🏁 Wrapping Up

If you need a document management system:

  • That ingests all kinds off docs PDFs, images etc.
  • That reliably extracts text from varied kind of docs.
  • That classifies, tags and summarizes docs using LLM.
  • That keeps stuff private - no exposure to external LLMs.
  • It comes with features loaded from paperless-ngx.

Then this might be an appealing setup.

It’s Python all the way down. No rocket science — just containers, OCR, and a private LLM.

Questions or inputs are welcome.

0
Subscribe to my newsletter

Read articles from Rajesh Pethe directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Rajesh Pethe
Rajesh Pethe

I’m a hands-on software engineer with over 18 years of experience solving real-world problems with code. I’ve spent most of that time building web applications, backend systems, and automation tools — often using Python, Django, REST, and a healthy mix of SQL and shell scripts on Linux. Along the way, I’ve also picked up frontend work with Angular and React, and built infrastructure using Docker, Kubernetes, AWS, and Terraform. I wouldn’t call myself a DevOps engineer, but I do believe in owning the full stack — from writing the API to making sure it runs smoothly in production. I’ve worked in all kinds of teams — large, small, remote, distributed, fast-paced, slow-paced. For the last few years, I’ve been freelancing, which has been both freeing and demanding in the best possible way. It’s pushed me to keep learning, stay sharp, and step outside my comfort zone. I love the mix of flexibility and challenge it brings.