How I Built a Fully Local RAG App with Ollama, FastAPI, and Qdrant

Ahmad W KhanAhmad W Khan
4 min read

As developers, we’re often faced with the question:
How do we bring AI into our apps without giving up privacy, control, and budget?

The answer: local-first AI using Retrieval-Augmented Generation (RAG).

RAG allows you to feed your own data (PDFs, notes, docs) into an LLM—so it doesn't hallucinate but instead grounds its answers on your actual content.
When done locally, this becomes a powerful, private, and fully offline assistant.

In this guide, I’ll show you how I built a local, private ChatGPT clone that:

  • Reads PDFs or markdown files

  • Embeds and indexes them into a vector database (Qdrant)

  • Uses a local LLM (via Ollama) for generating responses

  • Serves everything over a clean FastAPI backend

No OpenAI. No vendor lock-in. No tokens burned.

Architecture Overview


1. The Theory Behind It

🔸 What is RAG (Retrieval-Augmented Generation)?

RAG bridges two worlds:

  • Information Retrieval (search, chunking, semantic similarity)

  • Text Generation (LLMs like GPT, LLaMA, Mistral)

Instead of making your model "know everything," you let it look things up. This drastically improves accuracy and interpretability.

🔸 Why Local?

  • You control your data

  • Costs are predictable (or free)

  • Perfect for privacy-sensitive domains like healthcare, law, or enterprise internal tools


2. Setup & Tools

Stack:

  • Ollama – Run Mistral or LLaMA locally with GPU or CPU.

  • FastAPI – Lightning-fast Python API framework.

  • Qdrant – Vector database for semantic search.

  • LangChain – Orchestrates RAG logic.

  • Sentence Transformers – For embedding docs.


3. Installing Dependencies

Python Packages:

pip install fastapi uvicorn langchain qdrant-client pypdf sentence-transformers

Ollama:

# MacOS
brew install ollama
ollama run mistral

# Linux
curl -fsSL https://ollama.com/install.sh | sh

Qdrant via Docker:

docker run -d -p 6333:6333 -p 6334:6334 qdrant/qdrant

4. Load & Chunk Your Docs

We’ll use LangChain to split PDFs into small chunks for embedding.

from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

loader = PyPDFLoader("example.pdf")
pages = loader.load()

splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
docs = splitter.split_documents(pages)

🔎 5. Embed and Store in Qdrant

from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Qdrant

embedding_model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

qdrant = Qdrant.from_documents(
    documents=docs,
    embedding=embedding_model,
    location="http://localhost:6333",
    collection_name="mydocs"
)

6. FastAPI Backend

Let’s build a clean API to handle queries, retrieve docs, and pass them to the local LLM.

from fastapi import FastAPI
from pydantic import BaseModel
import requests

app = FastAPI()

class Query(BaseModel):
    question: str

@app.post("/ask")
def ask(query: Query):
    retriever = qdrant.as_retriever(search_kwargs={"k": 4})
    docs = retriever.get_relevant_documents(query.question)

    context = "\n\n".join([d.page_content for d in docs])
    prompt = f"""Use the following context to answer the question:\n\n{context}\n\nQuestion: {query.question}"""

    response = requests.post("http://localhost:11434/api/generate", json={
        "model": "mistral",
        "prompt": prompt,
        "stream": False
    })

    result = response.json()
    return {"answer": result["response"]}

Run it:

uvicorn app:app --reload

7. Interacting With It

You can now hit:

curl -X POST http://localhost:8000/ask \
  -H "Content-Type: application/json" \
  -d '{"question": "Summarize the document."}'

Or use Postman, Insomnia, or even a React/Vue frontend.


8. Privacy, Security & Real-World Considerations

  • This system never leaves your machine. Great for air-gapped environments.

  • If needed, you can Dockerize the whole stack and deploy it on your private cloud.

  • Upgrade Qdrant with TLS, authentication.

  • Switch embedding model to intfloat/e5-large-v2 for better multi-lingual/doc understanding.


9. Bonus: Add File Upload + Frontend

Extend your FastAPI backend with /upload endpoint using aiofiles, and wire up a React frontend with:

  • Drag-and-drop file upload

  • Chat window with streaming responses

  • Local memory using IndexedDB


10. Final Thoughts

This stack changed how I prototype AI tools. Instead of burning tokens and stressing about data security, I now run entire GPT-style systems locally, with:

  • Real-time responses

  • Grounded context from my docs

  • Full control over prompt tuning and latency


If you enjoyed this, follow my blog or drop me a message. I love building clean, production-ready tools that put AI in the hands of indie developers and engineers.

0
Subscribe to my newsletter

Read articles from Ahmad W Khan directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Ahmad W Khan
Ahmad W Khan