Setting Up Your Self-Hosted AI Stack - Part 2: Document Processing and RAG with Apache Tika and Qdrant Vector Database

Table of contents

What we're building today
We're adding document processing capabilities to the foundation from Part 1. This extends your chat interface to work with your files and documents.
Two key additions make this possible:
Document Processing: Extract text from documents like PDFs, Word docs, and text files using Apache Tika (with OCR capabilities for scanned documents)
RAG (Retrieval-Augmented Generation): Search and retrieve relevant information from your documents to answer questions
This transforms your local AI from conversation-only to document-aware. Upload a file, ask questions about its contents, and get answers grounded in your actual documents rather than the model's training data.
The bigger picture
I'm building this self-hosted AI stack and documenting everything as I go. Part 2 is where real power emerges. We're adding the intelligence layer that transforms your local AI from a chat toy into a genuine knowledge assistant.
Here's the thing: even with today's resources, implementing proper document processing and RAG together is surprisingly complex. This guide captures the lessons learned getting both capabilities production ready.
What we've built so far:
- Part 1: Building the foundation with Open WebUI, Ollama, and Postgres: Your foundational chat interface with local LLM hosting
What's still coming:
Part 3: Visual Data Extraction: Web upload interface with N8N workflows and Vision Language Models for extracting structured data from images
Part 4: Model Superpowers: Advanced WebUI configuration with tools and knowledge integration
Part 5: Intelligent Automation: WebUI filters and N8N workflows for content processing
Prerequisites
Ollama running (verify with
ollama serve
and if you see "address already in use", it's already running)24GB+ RAM recommended (16GB minimum)
50GB+ storage for documents and vectors
Docker and Docker Compose
Important: If Part 1 is running, stop it first with
cd ../part-1-building-the-foundation && docker compose down
before starting Part 2.
Credits where credits are due
Massive thanks to the creators and contributors of these open source projects that make this possible:
Apache Tika - Repository: apache/tika
Qdrant - Repository: qdrant/qdrant
Ollama - Repository: ollama/ollama
Open WebUI - Repository: open-webui/open-webui
Docker - Repositories: moby/moby & docker/compose
Quick start (skip explanations, just get it running)
Download the embedding model (~275MB):
ollama pull nomic-embed-text
Clone the repository or pull latest:
git clone https://github.com/FarzamMohammadi/self-hosted-ai-stack # Or if you already have it: git pull origin main
Navigate to part 2:
cd part-2-rag-with-tika-and-qdrant
Start the enhanced stack:
docker compose up -d
Done. Open http://localhost:3000, sign in, and you'll see document upload button (+
) in the bottom left corner of the chat. Upload a PDF and start asking questions about it.
Important: Open WebUI handles files through two distinct pathways:
Document Processing (what we built): Extracts text from documents like PDFs, Word docs, and text files through Tika. Does NOT work with standalone image files (PNG, JPEG, etc.).
Image Analysis (not covered here): For standalone image files, you'll need to download and use a vision model.
Understanding what you built
Let's break down what we just built and why each component matters. Understanding the architecture transforms you from someone following steps into someone who owns the system.
What are document processing and RAG, and why you need both
Your AI model was trained on data from months or years ago. It doesn't know about your company's latest reports, personal documents, or today's news. Ask about quarterly financials, and it apologizes for lacking access.
Document Processing uses Apache Tika to extract machine-readable text directly from documents like PDFs, Word docs, and other formats by parsing their internal structure. For scanned documents or image-based content within PDFs, Tika automatically falls back to OCR using Tesseract when needed.
RAG (Retrieval-Augmented Generation) solves the knowledge gap by giving your AI instant access to search through your documents for relevant information.
Together, they create a document intelligence system that processes documents and provides grounded answers.
The transformation:
Without Document Processing + RAG | With Document Processing + RAG |
"I can't access your documents." | "Based on your invoice PDF, the total amount due is $2,847." |
"I don't have access to your company's financial data." | "Based on your Q3 report, revenue increased 23% compared to last quarter." |
Limited to conversation only | Processes documents like PDFs, Word docs, text files |
Operates with frozen training data | Searches YOUR content for relevant information |
Generic responses, frequent "I don't have access" | Accurate answers with source attribution |
Real examples:
Document processing: "What's the total on this invoice?"
Without document processing: "I can't read documents or extract text from files."
With document processing + RAG: "Based on your invoice PDF, the total amount due is $2,847, with a due date of March 15th."
Document analysis: "What were our biggest challenges last quarter?"
Without RAG: "I don't have access to your quarterly reports."
With document processing + RAG: "Based on your Q3 report, the biggest challenges were supply chain delays (mentioned 8 times) and staffing shortages in the manufacturing division (pages 12-14)."
This transforms your AI from a chat interface into a document intelligence system that reads, understands, and answers questions about your documents.
How vector similarity works
Traditional search looks for exact word matches. Search for "machine learning performance" and miss documents about "AI optimization" or "model efficiency" entirely.
Embeddings solve this: mathematical representations that capture semantic meaning.
Think of embeddings as GPS coordinates for concepts:
"Machine learning" → [0.2, 0.8, 0.1, ...] (768 dimensions)
"AI optimization" → [0.3, 0.7, 0.2, ...]
"Car repair" → [0.9, 0.1, 0.3, ...]
Closer coordinates mean more similar meaning. When nomic-embed-text converts "reduce costs" into a 768-dimensional vector, it positions near related concepts:
"Budget optimization" (distance: 0.23)
"Expense management" (distance: 0.31)
"Financial efficiency" (distance: 0.28)
This mathematical precision ensures RAG retrieves conceptually relevant information beyond keyword matches.
Vector search visualization showing how concepts cluster in high-dimensional space. Related concepts like "Kitten" and "Cat" appear close together in mathematical space, enabling semantic search that finds meaning beyond exact keyword matches.
Source: https://odsc.medium.com/a-gentle-introduction-to-vector-search-3c0511bc6771
Vector databases store and search embeddings
Once you have embeddings, you need somewhere to store and search them efficiently. Regular databases can't handle similarity search.
Traditional database (SQL):
SELECT * FROM documents WHERE title CONTAINS 'machine learning'
Finds: Documents with exact phrase "machine learning" in title
Vector database:
SEARCH FOR documents SIMILAR TO embedding([0.2, 0.8, 0.1, ...])
Finds: Documents about AI, neural networks, model training, performance optimization, etc.
Vector databases use specialized algorithms to find similar vectors among millions quickly. Semantic search understands intent:
Search: "reduce costs" → Finds: "budget optimization", "expense management", "financial efficiency"
Search: "team problems" → Finds: "communication challenges", "staff conflicts", "collaboration issues"
How the complete RAG flow works
Document Processing (Setup Phase): Upload document → Apache Tika server extracts machine-readable text from documents like PDFs, Word docs, and text files (using OCR via Tesseract for scanned content when needed) → Open WebUI splits text into 1000-character chunks → nomic-embed-text model (via Ollama) converts each chunk into 768-dimensional vectors → Qdrant vector database stores vectors with original text for fast similarity search
Query Processing (When You Ask Questions): Question → nomic-embed-text model converts question to 768-dimensional vector → Qdrant vector database searches and finds 5 most similar document chunks → Open WebUI compiles relevant text excerpts → Enhanced prompt (question + retrieved excerpts) sent to Ollama LLM → AI generates grounded response with source attribution
The complete document processing + RAG orchestration: Open WebUI handles this entire workflow behind the scenes:
Apache Tika extracts text from documents like PDFs, Word docs, and text files (with automatic OCR fallback for scanned content)
nomic-embed-text converts extracted content into 768-dimensional vectors
Qdrant stores and indexes vectors for fast similarity search
Open WebUI orchestrates the complete pipeline from upload to intelligent response
What typically requires complex document processing and RAG engineering becomes simple upload and query. Your documents transform into a searchable knowledge base within seconds.
Setting up the document processing and RAG components
Now that you understand the concepts, let's examine the technical implementation that makes everything work.
System requirements
Component | Memory | Capabilities | Storage Impact |
Apache Tika | 4GB | Text extraction from documents like PDFs, Word docs, and text files (with OCR for scanned content) | ~2MB per processed document |
nomic-embed-text | 275MB | Document embedding generation | Creates 768-dimensional vectors |
Qdrant | 2GB | Vector storage and similarity search | ~50KB per document chunk |
Overall system | 24GB+ | Process 100+ documents concurrently | Scales to 10,000+ documents |
Storage growth | 50GB+ | Raw documents + vector indexes | ~5MB total per average business document |
Performance estimates are rough approximations based on system understanding and real-world usage patterns.
Putting it all together
Here's our enhanced docker-compose.yml
that adds complete document processing and RAG capabilities to our foundation:
version: '3.8'
services:
postgres:
image: postgres:15-alpine
container_name: postgres
ports:
- '5432:5432'
environment:
- POSTGRES_DB=openwebui
- POSTGRES_USER=postgres
- POSTGRES_PASSWORD=securepassword123
volumes:
- ./volumes/postgres/data:/var/lib/postgresql/data
restart: unless-stopped
healthcheck:
test: ['CMD-SHELL', 'pg_isready -U postgres -d openwebui']
interval: 10s
timeout: 5s
retries: 5
networks:
- local-ai-network
pgadmin:
image: dpage/pgadmin4:latest
container_name: pgadmin
ports:
- '5050:80'
environment:
- PGADMIN_DEFAULT_EMAIL=admin@local.ai
- PGADMIN_DEFAULT_PASSWORD=admin123
- PGADMIN_CONFIG_SERVER_MODE=False
- PGADMIN_CONFIG_MASTER_PASSWORD_REQUIRED=False
volumes:
- ./volumes/pgadmin:/var/lib/pgadmin
restart: unless-stopped
depends_on:
postgres:
condition: service_healthy
networks:
- local-ai-network
open-webui:
image: ghcr.io/open-webui/open-webui:main
container_name: webui
ports:
- '3000:8080'
volumes:
- ./volumes/open-webui/data:/app/backend/data
environment:
# Ollama connection
- OLLAMA_BASE_URL=http://host.docker.internal:11434
# Database connection
- DATABASE_URL=postgresql://postgres:securepassword123@postgres:5432/openwebui
# Basic settings
- WEBUI_SECRET_KEY=your-secret-key-here
- WEBUI_AUTH=true
- ENABLE_SIGNUP=true
- DEFAULT_MODELS=qwen3:8b
# Document Processing + RAG Configuration (NEW)
- UPLOAD_DIR=/app/backend/data/uploads
- ENABLE_PERSISTENT_CONFIG=false
- CONTENT_EXTRACTION_ENGINE=tika
- TIKA_SERVER_URL=http://tika:9998
- VECTOR_DB=qdrant
- QDRANT_URI=http://qdrant:6333/
- QDRANT_COLLECTION_PREFIX=document_chunks
- RAG_EMBEDDING_MODEL=nomic-embed-text
- RAG_EMBEDDING_ENGINE=ollama
- ENABLE_RAG_HYBRID_SEARCH=false
- RAG_RELEVANCE_THRESHOLD=0.75
- CHUNK_SIZE=1000
- CHUNK_OVERLAP=100
extra_hosts:
- 'host.docker.internal:host-gateway'
restart: unless-stopped
depends_on:
postgres:
condition: service_healthy
networks:
- local-ai-network
qdrant:
image: qdrant/qdrant:latest
container_name: qdrant
ports:
- '6333:6333'
- '6334:6334'
environment:
- QDRANT__SERVICE__HTTP_PORT=6333
- QDRANT__SERVICE__GRPC_PORT=6334
volumes:
- ./volumes/qdrant/storage:/qdrant/storage
restart: unless-stopped
networks:
- local-ai-network
tika:
image: apache/tika:latest-full
container_name: tika-server
ports:
- '9998:9998'
environment:
- TIKA_CONFIG=/opt/tika/config/tika-config.xml
- JAVA_OPTS=-Xmx2g -Xms512m -Dfile.encoding=UTF-8 -Djava.awt.headless=true
- TIKA_OCR_LANGUAGE=eng
- TIKA_PDF_OCR_STRATEGY=OCR_AND_TEXT_EXTRACTION
volumes:
- ../shared/tika/config:/opt/tika/config
command:
- --host=0.0.0.0
- --port=9998
- --config=/opt/tika/config/tika-config.xml
restart: unless-stopped
networks:
- local-ai-network
networks:
local-ai-network:
driver: bridge
What's new from Part 1:
This Docker Compose builds on your existing Part 1 setup by adding two key containers:
Qdrant: Vector database for storing document embeddings
Apache Tika: Text extraction engine for documents like PDFs, Word docs, and text files
Key configuration changes:
ENABLE_PERSISTENT_CONFIG=false
: Lets these new document processing + RAG settings override Part 1 configsCONTENT_EXTRACTION_ENGINE=tika
: Routes document processing through the new Tika containerRAG_RELEVANCE_THRESHOLD=0.75
: Sets search quality thresholdQDRANT_COLLECTION_PREFIX=document_chunks
: Names your vector storage collectionsTIKA_PDF_OCR_STRATEGY=OCR_AND_TEXT_EXTRACTION
: Extracts machine-readable text first, then uses OCR for scanned content within documentsCHUNK_SIZE=1000
&CHUNK_OVERLAP=100
: Document splitting settings for better search
Tika Configuration
The repository includes a pre-configured tika-config.xml
file mounted into the Tika container via the volume mapping ../shared/tika/config:/opt/tika/config
(shown on line 305 of the Docker Compose file). This configuration contains simple yet optimized settings for document processing, including PDF text extraction and OCR fallback capabilities, plus resource limits to prevent excessive usage during document processing.
Testing your document processing + RAG setup
Time to verify everything works as expected. Let's systematically test each component:
1. Container health check
docker compose ps
All five containers should show as running:
postgres
(healthy)pgadmin
webui
qdrant
tika-server
2. Verify service endpoints
Qdrant Database:
Health check: Open http://localhost:6333/healthz - should show "healthz check successful"
Dashboard: Browse http://localhost:6333/dashboard - explore your document collections and vector data
Tika server:
Quick version check: Open http://localhost:9998/version - shows Tika version info
All available routes: Browse http://localhost:9998/ - lists all Tika API endpoints
Ollama with both models:
ollama list
Should list both qwen3:8b
and nomic-embed-text
.
3. Upload and test different document types
Navigate to http://localhost:3000
Sign in with your existing account
Look for the document upload icon in the bottom left corner of the chat
Upload documents: PDFs, Word docs, or text files
Wait for processing to complete
Ask specific questions about the document content
Test questions for document processing:
"What's the total amount on this invoice?"
"Extract the key information from this document"
"What are the main points in this document?"
Test questions for RAG analysis:
"What are the main topics covered in this document?"
"Summarize the key findings"
"What does the document say about [specific topic]?"
4. Verify vector storage
Check that Qdrant is storing your document vectors: http://localhost:6333/dashboard#/collections
You should see collections created for your uploaded documents.
5. View extracted text (optional)
Want to see exactly what text was extracted from your uploaded documents? This helps verify text extraction accuracy (including OCR when used) and understand what content the AI uses.
Visit http://localhost:5050
Login with
admin@local.ai
/admin123
If you haven't already connected to the database, add a server connection:
Right click on Servers → Register → Server
General tab → Name:
local-ai
Connection tab:
Host name/address:
postgres
Port:
5432
Maintenance database:
postgres
Username:
postgres
Password:
securepassword123
Click Save
Navigate to local-ai → Databases → openwebui → Schemas → public → Tables → file
Right-click on the file table and select View/Edit Data → All Rows
Look at the data column to see the extracted text from your uploaded documents
What you'll see:
The
filename
column shows your original file namesThe
data
column contains the full extracted text from each documentThis is exactly what the AI uses to answer your questions about the documents
Troubleshooting tip: If the data
column is empty or contains gibberish, the document may be corrupted or contain primarily images. This setup works best with standard document formats and doesn't process standalone image files.
Troubleshooting common issues
Issue | Quick Fix | Details |
Document upload fails | Check docker compose logs tika-server | Sweet spot: 5 to 15MB files |
No RAG responses | Verify ollama list includes both models | First upload takes 2 to 3 times longer |
RAG responses seem inaccurate | Check chunk size and overlap settings | Adjust CHUNK_SIZE=500 for technical docs |
Tika memory errors | Increase to -Xmx6g or -Xmx8g | 4GB handles most documents |
No text extracted | Standalone image file uploaded (PNG, JPEG, etc.) | Use vision models for standalone image files |
System memory issues | Monitor with docker stats | 16GB minimum, 24GB comfortable |
Vector search too slow | Check Qdrant storage space | Database performance degrades when disk is full |
What's next
You've successfully built a sophisticated document processing and RAG-enabled document intelligence system! Your self-hosted stack now includes:
✅ Ollama serving your LLM and embedding models
✅ Open WebUI with complete document intelligence
✅ PostgreSQL storing conversations and configurations
✅ Qdrant powering semantic document search and retrieval
✅ Apache Tika providing text extraction from documents like PDFs, Word docs, and text files (with OCR capabilities for scanned content)
Coming up in Part 3
We'll build a visual data extraction pipeline with:
Web upload interface for batch image processing
N8N workflows orchestrating the extraction pipeline
Vision Language Models (VLM) via Ollama for analyzing images
PostgreSQL storage for tracking jobs and extracted structured data
Automated job processing with retry logic and error handling
This adds the ability to extract structured data from images at scale, perfect for processing receipts, invoices, forms, or any visual documents.
Homework before Part 3
Explore your new document processing and RAG capabilities:
Test document processing: Upload PDFs, Word docs, and text files
Test RAG: Upload various documents and ask complex questions
Multi-language: Review the
tika-config.xml
file inself-hosted-ai-stack/shared/
directory to understand language settings, add new languages to it, then test documents in different languages (hands-on way to learn Tika configuration)Multi-format: Try Word docs, PowerPoint slides, Excel sheets (remember: this processes standard document formats, but not standalone image files)
Advanced queries: Try complex multi-document queries across different formats
Monitor processing: Browse vector collections in Qdrant's web interface at http://localhost:6333/dashboard
Helpful resources
This is part of my "Complete Self-Hosted AI Infrastructure" series. Follow along as we build increasingly sophisticated AI capabilities, all running self-hosted on your machine.
Subscribe to my newsletter
Read articles from Farzam Mohammadi directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Farzam Mohammadi
Farzam Mohammadi
I'm Farzam, a Software Engineer specializing in backend development. My mission: Collaborate, share proven tricks, and help you avoid the pricey surprises I've encountered along the way.