Setting Up Your Self-Hosted AI Stack - Part 2: Document Processing and RAG with Apache Tika and Qdrant Vector Database

Farzam MohammadiFarzam Mohammadi
13 min read

What we're building today

We're adding document processing capabilities to the foundation from Part 1. This extends your chat interface to work with your files and documents.

Two key additions make this possible:

  • Document Processing: Extract text from documents like PDFs, Word docs, and text files using Apache Tika (with OCR capabilities for scanned documents)

  • RAG (Retrieval-Augmented Generation): Search and retrieve relevant information from your documents to answer questions

This transforms your local AI from conversation-only to document-aware. Upload a file, ask questions about its contents, and get answers grounded in your actual documents rather than the model's training data.

The bigger picture

I'm building this self-hosted AI stack and documenting everything as I go. Part 2 is where real power emerges. We're adding the intelligence layer that transforms your local AI from a chat toy into a genuine knowledge assistant.

Here's the thing: even with today's resources, implementing proper document processing and RAG together is surprisingly complex. This guide captures the lessons learned getting both capabilities production ready.

What we've built so far:

What's still coming:

  • Part 3: Visual Data Extraction: Web upload interface with N8N workflows and Vision Language Models for extracting structured data from images

  • Part 4: Model Superpowers: Advanced WebUI configuration with tools and knowledge integration

  • Part 5: Intelligent Automation: WebUI filters and N8N workflows for content processing

Prerequisites

  • Part 1 foundation stack running

  • Ollama running (verify with ollama serve and if you see "address already in use", it's already running)

  • 24GB+ RAM recommended (16GB minimum)

  • 50GB+ storage for documents and vectors

  • Docker and Docker Compose

Important: If Part 1 is running, stop it first with cd ../part-1-building-the-foundation && docker compose down before starting Part 2.

Credits where credits are due

Massive thanks to the creators and contributors of these open source projects that make this possible:

Quick start (skip explanations, just get it running)

  1. Download the embedding model (~275MB):

     ollama pull nomic-embed-text
    
  2. Clone the repository or pull latest:

     git clone https://github.com/FarzamMohammadi/self-hosted-ai-stack
     # Or if you already have it: git pull origin main
    
  3. Navigate to part 2:

     cd part-2-rag-with-tika-and-qdrant
    
  4. Start the enhanced stack:

     docker compose up -d
    

Done. Open http://localhost:3000, sign in, and you'll see document upload button (+) in the bottom left corner of the chat. Upload a PDF and start asking questions about it.

Important: Open WebUI handles files through two distinct pathways:

Document Processing (what we built): Extracts text from documents like PDFs, Word docs, and text files through Tika. Does NOT work with standalone image files (PNG, JPEG, etc.).

Image Analysis (not covered here): For standalone image files, you'll need to download and use a vision model.

Understanding what you built

Let's break down what we just built and why each component matters. Understanding the architecture transforms you from someone following steps into someone who owns the system.

What are document processing and RAG, and why you need both

Your AI model was trained on data from months or years ago. It doesn't know about your company's latest reports, personal documents, or today's news. Ask about quarterly financials, and it apologizes for lacking access.

Document Processing uses Apache Tika to extract machine-readable text directly from documents like PDFs, Word docs, and other formats by parsing their internal structure. For scanned documents or image-based content within PDFs, Tika automatically falls back to OCR using Tesseract when needed.

RAG (Retrieval-Augmented Generation) solves the knowledge gap by giving your AI instant access to search through your documents for relevant information.

Together, they create a document intelligence system that processes documents and provides grounded answers.

The transformation:

Without Document Processing + RAGWith Document Processing + RAG
"I can't access your documents.""Based on your invoice PDF, the total amount due is $2,847."
"I don't have access to your company's financial data.""Based on your Q3 report, revenue increased 23% compared to last quarter."
Limited to conversation onlyProcesses documents like PDFs, Word docs, text files
Operates with frozen training dataSearches YOUR content for relevant information
Generic responses, frequent "I don't have access"Accurate answers with source attribution

Real examples:

Document processing: "What's the total on this invoice?"

  • Without document processing: "I can't read documents or extract text from files."

  • With document processing + RAG: "Based on your invoice PDF, the total amount due is $2,847, with a due date of March 15th."

Document analysis: "What were our biggest challenges last quarter?"

  • Without RAG: "I don't have access to your quarterly reports."

  • With document processing + RAG: "Based on your Q3 report, the biggest challenges were supply chain delays (mentioned 8 times) and staffing shortages in the manufacturing division (pages 12-14)."

This transforms your AI from a chat interface into a document intelligence system that reads, understands, and answers questions about your documents.

How vector similarity works

Traditional search looks for exact word matches. Search for "machine learning performance" and miss documents about "AI optimization" or "model efficiency" entirely.

Embeddings solve this: mathematical representations that capture semantic meaning.

Think of embeddings as GPS coordinates for concepts:

  • "Machine learning" → [0.2, 0.8, 0.1, ...] (768 dimensions)

  • "AI optimization" → [0.3, 0.7, 0.2, ...]

  • "Car repair" → [0.9, 0.1, 0.3, ...]

Closer coordinates mean more similar meaning. When nomic-embed-text converts "reduce costs" into a 768-dimensional vector, it positions near related concepts:

  • "Budget optimization" (distance: 0.23)

  • "Expense management" (distance: 0.31)

  • "Financial efficiency" (distance: 0.28)

This mathematical precision ensures RAG retrieves conceptually relevant information beyond keyword matches.

Vector search visualization showing how concepts cluster in high-dimensional space. Related concepts like "Kitten" and "Cat" appear close together in mathematical space, enabling semantic search that finds meaning beyond exact keyword matches.

Source: https://odsc.medium.com/a-gentle-introduction-to-vector-search-3c0511bc6771

Vector databases store and search embeddings

Once you have embeddings, you need somewhere to store and search them efficiently. Regular databases can't handle similarity search.

Traditional database (SQL):

SELECT * FROM documents WHERE title CONTAINS 'machine learning'

Finds: Documents with exact phrase "machine learning" in title

Vector database:

SEARCH FOR documents SIMILAR TO embedding([0.2, 0.8, 0.1, ...])

Finds: Documents about AI, neural networks, model training, performance optimization, etc.

Vector databases use specialized algorithms to find similar vectors among millions quickly. Semantic search understands intent:

  • Search: "reduce costs" → Finds: "budget optimization", "expense management", "financial efficiency"

  • Search: "team problems" → Finds: "communication challenges", "staff conflicts", "collaboration issues"

How the complete RAG flow works

Document Processing (Setup Phase): Upload document → Apache Tika server extracts machine-readable text from documents like PDFs, Word docs, and text files (using OCR via Tesseract for scanned content when needed) → Open WebUI splits text into 1000-character chunks → nomic-embed-text model (via Ollama) converts each chunk into 768-dimensional vectors → Qdrant vector database stores vectors with original text for fast similarity search

Query Processing (When You Ask Questions): Question → nomic-embed-text model converts question to 768-dimensional vector → Qdrant vector database searches and finds 5 most similar document chunks → Open WebUI compiles relevant text excerpts → Enhanced prompt (question + retrieved excerpts) sent to Ollama LLM → AI generates grounded response with source attribution

The complete document processing + RAG orchestration: Open WebUI handles this entire workflow behind the scenes:

  • Apache Tika extracts text from documents like PDFs, Word docs, and text files (with automatic OCR fallback for scanned content)

  • nomic-embed-text converts extracted content into 768-dimensional vectors

  • Qdrant stores and indexes vectors for fast similarity search

  • Open WebUI orchestrates the complete pipeline from upload to intelligent response

What typically requires complex document processing and RAG engineering becomes simple upload and query. Your documents transform into a searchable knowledge base within seconds.

Setting up the document processing and RAG components

Now that you understand the concepts, let's examine the technical implementation that makes everything work.

System requirements

ComponentMemoryCapabilitiesStorage Impact
Apache Tika4GBText extraction from documents like PDFs, Word docs, and text files (with OCR for scanned content)~2MB per processed document
nomic-embed-text275MBDocument embedding generationCreates 768-dimensional vectors
Qdrant2GBVector storage and similarity search~50KB per document chunk
Overall system24GB+Process 100+ documents concurrentlyScales to 10,000+ documents
Storage growth50GB+Raw documents + vector indexes~5MB total per average business document

Performance estimates are rough approximations based on system understanding and real-world usage patterns.

Putting it all together

Here's our enhanced docker-compose.yml that adds complete document processing and RAG capabilities to our foundation:

version: '3.8'

services:
  postgres:
    image: postgres:15-alpine
    container_name: postgres
    ports:
      - '5432:5432'
    environment:
      - POSTGRES_DB=openwebui
      - POSTGRES_USER=postgres
      - POSTGRES_PASSWORD=securepassword123
    volumes:
      - ./volumes/postgres/data:/var/lib/postgresql/data
    restart: unless-stopped
    healthcheck:
      test: ['CMD-SHELL', 'pg_isready -U postgres -d openwebui']
      interval: 10s
      timeout: 5s
      retries: 5
    networks:
      - local-ai-network

  pgadmin:
    image: dpage/pgadmin4:latest
    container_name: pgadmin
    ports:
      - '5050:80'
    environment:
      - PGADMIN_DEFAULT_EMAIL=admin@local.ai
      - PGADMIN_DEFAULT_PASSWORD=admin123
      - PGADMIN_CONFIG_SERVER_MODE=False
      - PGADMIN_CONFIG_MASTER_PASSWORD_REQUIRED=False
    volumes:
      - ./volumes/pgadmin:/var/lib/pgadmin
    restart: unless-stopped
    depends_on:
      postgres:
        condition: service_healthy
    networks:
      - local-ai-network

  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    container_name: webui
    ports:
      - '3000:8080'
    volumes:
      - ./volumes/open-webui/data:/app/backend/data
    environment:
      # Ollama connection
      - OLLAMA_BASE_URL=http://host.docker.internal:11434

      # Database connection
      - DATABASE_URL=postgresql://postgres:securepassword123@postgres:5432/openwebui

      # Basic settings
      - WEBUI_SECRET_KEY=your-secret-key-here
      - WEBUI_AUTH=true
      - ENABLE_SIGNUP=true
      - DEFAULT_MODELS=qwen3:8b

      # Document Processing + RAG Configuration (NEW)
      - UPLOAD_DIR=/app/backend/data/uploads
      - ENABLE_PERSISTENT_CONFIG=false
      - CONTENT_EXTRACTION_ENGINE=tika
      - TIKA_SERVER_URL=http://tika:9998
      - VECTOR_DB=qdrant
      - QDRANT_URI=http://qdrant:6333/
      - QDRANT_COLLECTION_PREFIX=document_chunks
      - RAG_EMBEDDING_MODEL=nomic-embed-text
      - RAG_EMBEDDING_ENGINE=ollama
      - ENABLE_RAG_HYBRID_SEARCH=false
      - RAG_RELEVANCE_THRESHOLD=0.75
      - CHUNK_SIZE=1000
      - CHUNK_OVERLAP=100

    extra_hosts:
      - 'host.docker.internal:host-gateway'
    restart: unless-stopped
    depends_on:
      postgres:
        condition: service_healthy
    networks:
      - local-ai-network

  qdrant:
    image: qdrant/qdrant:latest
    container_name: qdrant
    ports:
      - '6333:6333'
      - '6334:6334'
    environment:
      - QDRANT__SERVICE__HTTP_PORT=6333
      - QDRANT__SERVICE__GRPC_PORT=6334
    volumes:
      - ./volumes/qdrant/storage:/qdrant/storage
    restart: unless-stopped
    networks:
      - local-ai-network

  tika:
    image: apache/tika:latest-full
    container_name: tika-server
    ports:
      - '9998:9998'
    environment:
      - TIKA_CONFIG=/opt/tika/config/tika-config.xml
      - JAVA_OPTS=-Xmx2g -Xms512m -Dfile.encoding=UTF-8 -Djava.awt.headless=true
      - TIKA_OCR_LANGUAGE=eng
      - TIKA_PDF_OCR_STRATEGY=OCR_AND_TEXT_EXTRACTION
    volumes:
      - ../shared/tika/config:/opt/tika/config
    command:
      - --host=0.0.0.0
      - --port=9998
      - --config=/opt/tika/config/tika-config.xml
    restart: unless-stopped
    networks:
      - local-ai-network

networks:
  local-ai-network:
    driver: bridge

What's new from Part 1:

This Docker Compose builds on your existing Part 1 setup by adding two key containers:

  • Qdrant: Vector database for storing document embeddings

  • Apache Tika: Text extraction engine for documents like PDFs, Word docs, and text files

Key configuration changes:

  • ENABLE_PERSISTENT_CONFIG=false: Lets these new document processing + RAG settings override Part 1 configs

  • CONTENT_EXTRACTION_ENGINE=tika: Routes document processing through the new Tika container

  • RAG_RELEVANCE_THRESHOLD=0.75: Sets search quality threshold

  • QDRANT_COLLECTION_PREFIX=document_chunks: Names your vector storage collections

  • TIKA_PDF_OCR_STRATEGY=OCR_AND_TEXT_EXTRACTION: Extracts machine-readable text first, then uses OCR for scanned content within documents

  • CHUNK_SIZE=1000 & CHUNK_OVERLAP=100: Document splitting settings for better search

Tika Configuration

The repository includes a pre-configured tika-config.xml file mounted into the Tika container via the volume mapping ../shared/tika/config:/opt/tika/config (shown on line 305 of the Docker Compose file). This configuration contains simple yet optimized settings for document processing, including PDF text extraction and OCR fallback capabilities, plus resource limits to prevent excessive usage during document processing.

Testing your document processing + RAG setup

Time to verify everything works as expected. Let's systematically test each component:

1. Container health check

docker compose ps

All five containers should show as running:

  • postgres (healthy)

  • pgadmin

  • webui

  • qdrant

  • tika-server

2. Verify service endpoints

Qdrant Database:

Tika server:

Ollama with both models:

ollama list

Should list both qwen3:8b and nomic-embed-text.

3. Upload and test different document types

  1. Navigate to http://localhost:3000

  2. Sign in with your existing account

  3. Look for the document upload icon in the bottom left corner of the chat

  4. Upload documents: PDFs, Word docs, or text files

  5. Wait for processing to complete

  6. Ask specific questions about the document content

Test questions for document processing:

  • "What's the total amount on this invoice?"

  • "Extract the key information from this document"

  • "What are the main points in this document?"

Test questions for RAG analysis:

  • "What are the main topics covered in this document?"

  • "Summarize the key findings"

  • "What does the document say about [specific topic]?"

4. Verify vector storage

Check that Qdrant is storing your document vectors: http://localhost:6333/dashboard#/collections

You should see collections created for your uploaded documents.

5. View extracted text (optional)

Want to see exactly what text was extracted from your uploaded documents? This helps verify text extraction accuracy (including OCR when used) and understand what content the AI uses.

  1. Visit http://localhost:5050

  2. Login with admin@local.ai / admin123

  3. If you haven't already connected to the database, add a server connection:

    1. Right click on ServersRegisterServer

    2. General tab → Name: local-ai

    3. Connection tab:

      • Host name/address: postgres

      • Port: 5432

      • Maintenance database: postgres

      • Username: postgres

      • Password: securepassword123

    4. Click Save

  4. Navigate to local-aiDatabasesopenwebuiSchemaspublicTablesfile

  5. Right-click on the file table and select View/Edit DataAll Rows

  6. Look at the data column to see the extracted text from your uploaded documents

What you'll see:

  • The filename column shows your original file names

  • The data column contains the full extracted text from each document

  • This is exactly what the AI uses to answer your questions about the documents

Troubleshooting tip: If the data column is empty or contains gibberish, the document may be corrupted or contain primarily images. This setup works best with standard document formats and doesn't process standalone image files.

Troubleshooting common issues

IssueQuick FixDetails
Document upload failsCheck docker compose logs tika-serverSweet spot: 5 to 15MB files
No RAG responsesVerify ollama list includes both modelsFirst upload takes 2 to 3 times longer
RAG responses seem inaccurateCheck chunk size and overlap settingsAdjust CHUNK_SIZE=500 for technical docs
Tika memory errorsIncrease to -Xmx6g or -Xmx8g4GB handles most documents
No text extractedStandalone image file uploaded (PNG, JPEG, etc.)Use vision models for standalone image files
System memory issuesMonitor with docker stats16GB minimum, 24GB comfortable
Vector search too slowCheck Qdrant storage spaceDatabase performance degrades when disk is full

What's next

You've successfully built a sophisticated document processing and RAG-enabled document intelligence system! Your self-hosted stack now includes:

Ollama serving your LLM and embedding models

Open WebUI with complete document intelligence

PostgreSQL storing conversations and configurations

Qdrant powering semantic document search and retrieval

Apache Tika providing text extraction from documents like PDFs, Word docs, and text files (with OCR capabilities for scanned content)

Coming up in Part 3

We'll build a visual data extraction pipeline with:

  • Web upload interface for batch image processing

  • N8N workflows orchestrating the extraction pipeline

  • Vision Language Models (VLM) via Ollama for analyzing images

  • PostgreSQL storage for tracking jobs and extracted structured data

  • Automated job processing with retry logic and error handling

This adds the ability to extract structured data from images at scale, perfect for processing receipts, invoices, forms, or any visual documents.

Homework before Part 3

Explore your new document processing and RAG capabilities:

  • Test document processing: Upload PDFs, Word docs, and text files

  • Test RAG: Upload various documents and ask complex questions

  • Multi-language: Review the tika-config.xml file in self-hosted-ai-stack/shared/ directory to understand language settings, add new languages to it, then test documents in different languages (hands-on way to learn Tika configuration)

  • Multi-format: Try Word docs, PowerPoint slides, Excel sheets (remember: this processes standard document formats, but not standalone image files)

  • Advanced queries: Try complex multi-document queries across different formats

  • Monitor processing: Browse vector collections in Qdrant's web interface at http://localhost:6333/dashboard

Helpful resources


This is part of my "Complete Self-Hosted AI Infrastructure" series. Follow along as we build increasingly sophisticated AI capabilities, all running self-hosted on your machine.

1
Subscribe to my newsletter

Read articles from Farzam Mohammadi directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Farzam Mohammadi
Farzam Mohammadi

I'm Farzam, a Software Engineer specializing in backend development. My mission: Collaborate, share proven tricks, and help you avoid the pricey surprises I've encountered along the way.