What we're building today

We're adding document processing capabilities to the foundation from Part 1. This extends your chat interface to work with your files and documents.

Two key additions make this possible:

Document Processing: Extract text from documents like PDFs, Word docs, and text files using Apache Tika (with OCR capabilities for scanned documents)
RAG (Retrieval-Augmented Generation): Search and retrieve relevant information from your documents to answer questions

This transforms your local AI from conversation-only to document-aware. Upload a file, ask questions about its contents, and get answers grounded in your actual documents rather than the model's training data.

The bigger picture

I'm building this self-hosted AI stack and documenting everything as I go. Part 2 is where real power emerges. We're adding the intelligence layer that transforms your local AI from a chat toy into a genuine knowledge assistant.

Here's the thing: even with today's resources, implementing proper document processing and RAG together is surprisingly complex. This guide captures the lessons learned getting both capabilities production ready.

What we've built so far:

Part 1: Building the foundation with Open WebUI, Ollama, and Postgres: Your foundational chat interface with local LLM hosting

What's still coming:

Part 3: Visual Data Extraction: Web upload interface with N8N workflows and Vision Language Models for extracting structured data from images
Part 4: Model Superpowers: Advanced WebUI configuration with tools and knowledge integration
Part 5: Intelligent Automation: WebUI filters and N8N workflows for content processing

Prerequisites

Part 1 foundation stack running
Ollama running (verify with ollama serve and if you see "address already in use", it's already running)
24GB+ RAM recommended (16GB minimum)
50GB+ storage for documents and vectors
Docker and Docker Compose

Important: If Part 1 is running, stop it first with cd ../part-1-building-the-foundation && docker compose down before starting Part 2.

Credits where credits are due

Massive thanks to the creators and contributors of these open source projects that make this possible:

Apache Tika - Repository: apache/tika
Qdrant - Repository: qdrant/qdrant
Ollama - Repository: ollama/ollama
Open WebUI - Repository: open-webui/open-webui
Docker - Repositories: moby/moby & docker/compose

Quick start (skip explanations, just get it running)

Download the embedding model (~275MB):
```
 ollama pull nomic-embed-text
```

Clone the repository or pull latest:

 git clone https://github.com/FarzamMohammadi/self-hosted-ai-stack
 # Or if you already have it: git pull origin main

Navigate to part 2:
```
 cd part-2-rag-with-tika-and-qdrant
```
Start the enhanced stack:
```
 docker compose up -d
```

Done. Open http://localhost:3000, sign in, and you'll see document upload button (+) in the bottom left corner of the chat. Upload a PDF and start asking questions about it.

Important: Open WebUI handles files through two distinct pathways:

Document Processing (what we built): Extracts text from documents like PDFs, Word docs, and text files through Tika. Does NOT work with standalone image files (PNG, JPEG, etc.).

Image Analysis (not covered here): For standalone image files, you'll need to download and use a vision model.

Understanding what you built

Let's break down what we just built and why each component matters. Understanding the architecture transforms you from someone following steps into someone who owns the system.

What are document processing and RAG, and why you need both

Your AI model was trained on data from months or years ago. It doesn't know about your company's latest reports, personal documents, or today's news. Ask about quarterly financials, and it apologizes for lacking access.

Document Processing uses Apache Tika to extract machine-readable text directly from documents like PDFs, Word docs, and other formats by parsing their internal structure. For scanned documents or image-based content within PDFs, Tika automatically falls back to OCR using Tesseract when needed.

RAG (Retrieval-Augmented Generation) solves the knowledge gap by giving your AI instant access to search through your documents for relevant information.

Together, they create a document intelligence system that processes documents and provides grounded answers.

The transformation:

Without Document Processing + RAG	With Document Processing + RAG
"I can't access your documents."	"Based on your invoice PDF, the total amount due is $2,847."
"I don't have access to your company's financial data."	"Based on your Q3 report, revenue increased 23% compared to last quarter."
Limited to conversation only	Processes documents like PDFs, Word docs, text files
Operates with frozen training data	Searches YOUR content for relevant information
Generic responses, frequent "I don't have access"	Accurate answers with source attribution

Real examples:

Document processing: "What's the total on this invoice?"

Without document processing: "I can't read documents or extract text from files."
With document processing + RAG: "Based on your invoice PDF, the total amount due is $2,847, with a due date of March 15th."

Document analysis: "What were our biggest challenges last quarter?"

Without RAG: "I don't have access to your quarterly reports."
With document processing + RAG: "Based on your Q3 report, the biggest challenges were supply chain delays (mentioned 8 times) and staffing shortages in the manufacturing division (pages 12-14)."

This transforms your AI from a chat interface into a document intelligence system that reads, understands, and answers questions about your documents.

How vector similarity works

Traditional search looks for exact word matches. Search for "machine learning performance" and miss documents about "AI optimization" or "model efficiency" entirely.

Embeddings solve this: mathematical representations that capture semantic meaning.

Think of embeddings as GPS coordinates for concepts:

"Machine learning" → [0.2, 0.8, 0.1, ...] (768 dimensions)
"AI optimization" → [0.3, 0.7, 0.2, ...]
"Car repair" → [0.9, 0.1, 0.3, ...]

Closer coordinates mean more similar meaning. When nomic-embed-text converts "reduce costs" into a 768-dimensional vector, it positions near related concepts:

"Budget optimization" (distance: 0.23)
"Expense management" (distance: 0.31)
"Financial efficiency" (distance: 0.28)

This mathematical precision ensures RAG retrieves conceptually relevant information beyond keyword matches.

Vector search visualization showing how concepts cluster in high-dimensional space. Related concepts like "Kitten" and "Cat" appear close together in mathematical space, enabling semantic search that finds meaning beyond exact keyword matches.

Source: https://odsc.medium.com/a-gentle-introduction-to-vector-search-3c0511bc6771

Vector databases store and search embeddings

Once you have embeddings, you need somewhere to store and search them efficiently. Regular databases can't handle similarity search.

Traditional database (SQL):

SELECT * FROM documents WHERE title CONTAINS 'machine learning'

Finds: Documents with exact phrase "machine learning" in title

Vector database:

SEARCH FOR documents SIMILAR TO embedding([0.2, 0.8, 0.1, ...])

Finds: Documents about AI, neural networks, model training, performance optimization, etc.

Vector databases use specialized algorithms to find similar vectors among millions quickly. Semantic search understands intent:

Search: "reduce costs" → Finds: "budget optimization", "expense management", "financial efficiency"
Search: "team problems" → Finds: "communication challenges", "staff conflicts", "collaboration issues"

How the complete RAG flow works

Document Processing (Setup Phase): Upload document → Apache Tika server extracts machine-readable text from documents like PDFs, Word docs, and text files (using OCR via Tesseract for scanned content when needed) → Open WebUI splits text into 1000-character chunks → nomic-embed-text model (via Ollama) converts each chunk into 768-dimensional vectors → Qdrant vector database stores vectors with original text for fast similarity search

Query Processing (When You Ask Questions): Question → nomic-embed-text model converts question to 768-dimensional vector → Qdrant vector database searches and finds 5 most similar document chunks → Open WebUI compiles relevant text excerpts → Enhanced prompt (question + retrieved excerpts) sent to Ollama LLM → AI generates grounded response with source attribution

The complete document processing + RAG orchestration: Open WebUI handles this entire workflow behind the scenes:

Apache Tika extracts text from documents like PDFs, Word docs, and text files (with automatic OCR fallback for scanned content)
nomic-embed-text converts extracted content into 768-dimensional vectors
Qdrant stores and indexes vectors for fast similarity search
Open WebUI orchestrates the complete pipeline from upload to intelligent response

What typically requires complex document processing and RAG engineering becomes simple upload and query. Your documents transform into a searchable knowledge base within seconds.

Setting up the document processing and RAG components

Now that you understand the concepts, let's examine the technical implementation that makes everything work.

System requirements

Component	Memory	Capabilities	Storage Impact
Apache Tika	4GB	Text extraction from documents like PDFs, Word docs, and text files (with OCR for scanned content)	~2MB per processed document
nomic-embed-text	275MB	Document embedding generation	Creates 768-dimensional vectors
Qdrant	2GB	Vector storage and similarity search	~50KB per document chunk
Overall system	24GB+	Process 100+ documents concurrently	Scales to 10,000+ documents
Storage growth	50GB+	Raw documents + vector indexes	~5MB total per average business document

Performance estimates are rough approximations based on system understanding and real-world usage patterns.

Putting it all together

Here's our enhanced docker-compose.yml that adds complete document processing and RAG capabilities to our foundation:

version: '3.8'

services:
  postgres:
    image: postgres:15-alpine
    container_name: postgres
    ports:
      - '5432:5432'
    environment:
      - POSTGRES_DB=openwebui
      - POSTGRES_USER=postgres
      - POSTGRES_PASSWORD=securepassword123
    volumes:
      - ./volumes/postgres/data:/var/lib/postgresql/data
    restart: unless-stopped
    healthcheck:
      test: ['CMD-SHELL', 'pg_isready -U postgres -d openwebui']
      interval: 10s
      timeout: 5s
      retries: 5
    networks:
      - local-ai-network

  pgadmin:
    image: dpage/pgadmin4:latest
    container_name: pgadmin
    ports:
      - '5050:80'
    environment:
      - PGADMIN_DEFAULT_EMAIL=admin@local.ai
      - PGADMIN_DEFAULT_PASSWORD=admin123
      - PGADMIN_CONFIG_SERVER_MODE=False
      - PGADMIN_CONFIG_MASTER_PASSWORD_REQUIRED=False
    volumes:
      - ./volumes/pgadmin:/var/lib/pgadmin
    restart: unless-stopped
    depends_on:
      postgres:
        condition: service_healthy
    networks:
      - local-ai-network

  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    container_name: webui
    ports:
      - '3000:8080'
    volumes:
      - ./volumes/open-webui/data:/app/backend/data
    environment:
      # Ollama connection
      - OLLAMA_BASE_URL=http://host.docker.internal:11434

      # Database connection
      - DATABASE_URL=postgresql://postgres:securepassword123@postgres:5432/openwebui

      # Basic settings
      - WEBUI_SECRET_KEY=your-secret-key-here
      - WEBUI_AUTH=true
      - ENABLE_SIGNUP=true
      - DEFAULT_MODELS=qwen3:8b

      # Document Processing + RAG Configuration (NEW)
      - UPLOAD_DIR=/app/backend/data/uploads
      - ENABLE_PERSISTENT_CONFIG=false
      - CONTENT_EXTRACTION_ENGINE=tika
      - TIKA_SERVER_URL=http://tika:9998
      - VECTOR_DB=qdrant
      - QDRANT_URI=http://qdrant:6333/
      - QDRANT_COLLECTION_PREFIX=document_chunks
      - RAG_EMBEDDING_MODEL=nomic-embed-text
      - RAG_EMBEDDING_ENGINE=ollama
      - ENABLE_RAG_HYBRID_SEARCH=false
      - RAG_RELEVANCE_THRESHOLD=0.75
      - CHUNK_SIZE=1000
      - CHUNK_OVERLAP=100

    extra_hosts:
      - 'host.docker.internal:host-gateway'
    restart: unless-stopped
    depends_on:
      postgres:
        condition: service_healthy
    networks:
      - local-ai-network

  qdrant:
    image: qdrant/qdrant:latest
    container_name: qdrant
    ports:
      - '6333:6333'
      - '6334:6334'
    environment:
      - QDRANT__SERVICE__HTTP_PORT=6333
      - QDRANT__SERVICE__GRPC_PORT=6334
    volumes:
      - ./volumes/qdrant/storage:/qdrant/storage
    restart: unless-stopped
    networks:
      - local-ai-network

  tika:
    image: apache/tika:latest-full
    container_name: tika-server
    ports:
      - '9998:9998'
    environment:
      - TIKA_CONFIG=/opt/tika/config/tika-config.xml
      - JAVA_OPTS=-Xmx2g -Xms512m -Dfile.encoding=UTF-8 -Djava.awt.headless=true
      - TIKA_OCR_LANGUAGE=eng
      - TIKA_PDF_OCR_STRATEGY=OCR_AND_TEXT_EXTRACTION
    volumes:
      - ../shared/tika/config:/opt/tika/config
    command:
      - --host=0.0.0.0
      - --port=9998
      - --config=/opt/tika/config/tika-config.xml
    restart: unless-stopped
    networks:
      - local-ai-network

networks:
  local-ai-network:
    driver: bridge

What's new from Part 1:

This Docker Compose builds on your existing Part 1 setup by adding two key containers:

Qdrant: Vector database for storing document embeddings
Apache Tika: Text extraction engine for documents like PDFs, Word docs, and text files

Key configuration changes:

ENABLE_PERSISTENT_CONFIG=false: Lets these new document processing + RAG settings override Part 1 configs
CONTENT_EXTRACTION_ENGINE=tika: Routes document processing through the new Tika container
RAG_RELEVANCE_THRESHOLD=0.75: Sets search quality threshold
QDRANT_COLLECTION_PREFIX=document_chunks: Names your vector storage collections
TIKA_PDF_OCR_STRATEGY=OCR_AND_TEXT_EXTRACTION: Extracts machine-readable text first, then uses OCR for scanned content within documents
CHUNK_SIZE=1000 & CHUNK_OVERLAP=100: Document splitting settings for better search

Tika Configuration

The repository includes a pre-configured tika-config.xml file mounted into the Tika container via the volume mapping ../shared/tika/config:/opt/tika/config (shown on line 305 of the Docker Compose file). This configuration contains simple yet optimized settings for document processing, including PDF text extraction and OCR fallback capabilities, plus resource limits to prevent excessive usage during document processing.

Testing your document processing + RAG setup

Time to verify everything works as expected. Let's systematically test each component:

1. Container health check

docker compose ps

All five containers should show as running:

postgres (healthy)
pgadmin
webui
qdrant
tika-server

2. Verify service endpoints

Qdrant Database:

Health check: Open http://localhost:6333/healthz - should show "healthz check successful"
Dashboard: Browse http://localhost:6333/dashboard - explore your document collections and vector data

Tika server:

Quick version check: Open http://localhost:9998/version - shows Tika version info
All available routes: Browse http://localhost:9998/ - lists all Tika API endpoints

Ollama with both models:

ollama list

Should list both qwen3:8b and nomic-embed-text.

3. Upload and test different document types

Navigate to http://localhost:3000
Sign in with your existing account
Look for the document upload icon in the bottom left corner of the chat
Upload documents: PDFs, Word docs, or text files
Wait for processing to complete
Ask specific questions about the document content

Test questions for document processing:

"What's the total amount on this invoice?"
"Extract the key information from this document"
"What are the main points in this document?"

Test questions for RAG analysis:

"What are the main topics covered in this document?"
"Summarize the key findings"
"What does the document say about [specific topic]?"

4. Verify vector storage

Check that Qdrant is storing your document vectors: http://localhost:6333/dashboard#/collections

You should see collections created for your uploaded documents.

5. View extracted text (optional)

Want to see exactly what text was extracted from your uploaded documents? This helps verify text extraction accuracy (including OCR when used) and understand what content the AI uses.

Visit http://localhost:5050
Login with admin@local.ai / admin123
If you haven't already connected to the database, add a server connection:
1. Right click on Servers → Register → Server
2. General tab → Name: local-ai
3. Connection tab:
  - Host name/address: postgres
  - Port: 5432
  - Maintenance database: postgres
  - Username: postgres
  - Password: securepassword123
4. Click Save
Navigate to local-ai → Databases → openwebui → Schemas → public → Tables → file
Right-click on the file table and select View/Edit Data → All Rows
Look at the data column to see the extracted text from your uploaded documents

What you'll see:

The filename column shows your original file names
The data column contains the full extracted text from each document
This is exactly what the AI uses to answer your questions about the documents

Troubleshooting tip: If the data column is empty or contains gibberish, the document may be corrupted or contain primarily images. This setup works best with standard document formats and doesn't process standalone image files.

Troubleshooting common issues

Issue	Quick Fix	Details
Document upload fails	Check `docker compose logs tika-server`	Sweet spot: 5 to 15MB files
No RAG responses	Verify `ollama list` includes both models	First upload takes 2 to 3 times longer
RAG responses seem inaccurate	Check chunk size and overlap settings	Adjust `CHUNK_SIZE=500` for technical docs
Tika memory errors	Increase to `-Xmx6g` or `-Xmx8g`	4GB handles most documents
No text extracted	Standalone image file uploaded (PNG, JPEG, etc.)	Use vision models for standalone image files
System memory issues	Monitor with `docker stats`	16GB minimum, 24GB comfortable
Vector search too slow	Check Qdrant storage space	Database performance degrades when disk is full

What's next

You've successfully built a sophisticated document processing and RAG-enabled document intelligence system! Your self-hosted stack now includes:

✅ Ollama serving your LLM and embedding models

✅ Open WebUI with complete document intelligence

✅ PostgreSQL storing conversations and configurations

✅ Qdrant powering semantic document search and retrieval

✅ Apache Tika providing text extraction from documents like PDFs, Word docs, and text files (with OCR capabilities for scanned content)

Coming up in Part 3

We'll build a visual data extraction pipeline with:

Web upload interface for batch image processing
N8N workflows orchestrating the extraction pipeline
Vision Language Models (VLM) via Ollama for analyzing images
PostgreSQL storage for tracking jobs and extracted structured data
Automated job processing with retry logic and error handling

This adds the ability to extract structured data from images at scale, perfect for processing receipts, invoices, forms, or any visual documents.

Homework before Part 3

Explore your new document processing and RAG capabilities:

Test document processing: Upload PDFs, Word docs, and text files
Test RAG: Upload various documents and ask complex questions
Multi-language: Review the tika-config.xml file in self-hosted-ai-stack/shared/ directory to understand language settings, add new languages to it, then test documents in different languages (hands-on way to learn Tika configuration)
Multi-format: Try Word docs, PowerPoint slides, Excel sheets (remember: this processes standard document formats, but not standalone image files)
Advanced queries: Try complex multi-document queries across different formats
Monitor processing: Browse vector collections in Qdrant's web interface at http://localhost:6333/dashboard

Helpful resources

This is part of my "Complete Self-Hosted AI Infrastructure" series. Follow along as we build increasingly sophisticated AI capabilities, all running self-hosted on your machine.

Setting Up Your Self-Hosted AI Stack - Part 2: Document Processing and RAG with Apache Tika and Qdrant Vector Database

Table of contents