Building a Vector Service for Intelligent Document Search with FastAPI, Qdrant & Transformers

Introduction

In the era of information overload, accessing relevant content swiftly from vast document repositories has become essential. This is where vector-based search systems shine โ€” they enable semantic understanding and retrieval, making document search significantly smarter.

In this blog post, Iโ€™ll walk through the architecture and implementation of a Vector Service โ€” a FastAPI-powered microservice that handles:

  • Document ingestion (PDFs & text files)

  • Text chunking with linguistic awareness

  • Embedding using Transformer models

  • Vector storage & retrieval with Qdrant

  • Search APIs for both access

Letโ€™s explore how each module works, how they integrate, and how you can customize or extend the service for your own applications.

๐Ÿ—๏ธ Architecture Overview

The Vector Service is structured around modular, single-responsibility components. The service includes the following core layers:

  1. API Layer (FastAPI) - Routes, request validation, and authentication.

  2. Service Layer - Coordinates ingestion, processing, and storage.

  3. Processing Layer - File reading, chunking, and embedding.

  4. Storage Layer - Handles document metadata and vector persistence.

๐Ÿ› ๏ธ Project Initialization

The project uses Poetry for dependency management. The pyproject.toml specifies the packages such as:

  • fastapi, uvicorn for the API

  • pymupdf, spacy for file parsing and NLP

  • sentence-transformers for embeddings

  • qdrant-client for vector storage

  • asyncpg for async DB operations

This solid foundation ensures scalability and performance.

๐Ÿšฆ Application Startup

The Vector Service initializes through a carefully designed lifespan context in main.py, leveraging FastAPIโ€™s lifespan feature. This approach ensures that critical resources are validated and available before the service begins handling traffic.

๐Ÿงฉ Lifespan Management

@asynccontextmanager
async def lifespan(app: FastAPI):
    ...
    yield
    ...

This asynchronous context manager runs before the first request and after the last one. It checks two vital components:

  1. Embedding Service Initialization

    Ensures that the EmbeddingService and its dependent modules (chunker, ingestor, embedding store, etc.) are instantiated correctly.

     embedding_service = EmbeddingService()
     if embedding_service is None:
         raise RuntimeError("EmbeddingService initialization failed.")
    
  2. Database Connection Verification

    A dummy session is opened and closed to verify DB connectivity at startup.

     db_service = AsyncSessionLocal()
     await db_service.close()
    

๐Ÿ›ก๏ธ CORS Configuration

CORS middleware is enabled for wide compatibility, particularly useful during development:

app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],
    allow_credentials=True,
    allow_methods=["*"],
    allow_headers=["*"],
)

๐Ÿ” Note: In production, replace "*" with specific domain(s) to enhance security.

๐Ÿงญ Route Registration

All embedding-related endpoints are grouped under:

app.include_router(embedding_routes.router, prefix="/api/v1/vector")

This modular approach makes it easy to scale and organize future functionalities (e.g., feedback routes, analytics).

๐Ÿ” Authentication Layer

To ensure only authorized services can access the Vector Service endpoints, a simple but effective service key authentication mechanism is employed. This is a great fit for internal APIs or microservice architectures where OAuth or user-based authentication may be overkill.

def verify_service_key(service_identity_key: str = Header(...)):
    if service_identity_key != SERVICE_IDENTITY_KEY:
        raise HTTPException(status_code=401, detail="Invalid or missing service identity key")

This lightweight mechanism keeps the service protected in multi-tenant or microservice architectures.

๐Ÿงฉ Plugging into Routes

This function is added as a dependency to protected endpoints using Depends.

@router.post("/generate_vectors")
async def generate_vectors(
    document_id: int = Form(...),
    file: UploadFile = File(...),
    _: None = Depends(verify_service_key),
):

๐Ÿ“ Uploading & Ingesting Files

One of the core capabilities of the Vector Service is transforming raw documents into structured, queryable vector data. This pipeline starts with the handle_file_upload method in the EmbeddingService class and involves multiple subsystems working seamlessly together.

Step 1: Uploading to S3

When a file is uploaded via the /generate_vectors endpoint, it is read into memory and stored on Amazon S3. The file key is dynamically generated to ensure uniqueness and traceability.

timestamp = current_utc_timestamp()
unique_id = uuid4().hex
s3_filename = f"{base_name}_{timestamp}_{unique_id}{ext}"
key = f"{sanitized_document_name}/{s3_filename}"

Step 2: Recording Metadata

After a successful upload, metadata is persisted in a PostgreSQL database using SQLAlchemy:

await upsert_uploaded_document(
    document_id=document_id,
    file_name=s3_filename,
    file_url=file_url,
    uploaded_at=datetime.now(timezone.utc).replace(tzinfo=None),
)

This enables auditability, historical analysis, and the ability to retrieve or reprocess documents later.

Step 3: Fetching and Reading

Next, the file is pulled back from S3 and processed locally.

local_path = await self.fetcher.fetch(s3_key, filename)
text = await self.ingestor.ingest(local_path)

The DocumentFetcher handles secure downloading, while FileIngestor supports two formats:

  • .txt: Simple UTF-8 read

  • .pdf: Intelligent parsing with PyMuPDF, including:

    • Font size detection

    • Section classification (h1, h2, h3)

    • Conversion to structured Markdown

โœ‚๏ธ Smart Text Chunking with spaCy

Before a document can be embedded, it needs to be broken down into smaller, coherent segments or chunks. This is where the Chunker class comes into play. Its job is to split long-form content into manageable units while preserving context and structure, which is crucial for generating meaningful embeddings.

๐Ÿงฉ Why Chunking Matters

Embedding entire documents in one go is inefficient and often exceeds model token limits. But naive splitting (e.g., every 500 characters) can fracture sentences or break logical sections. Chunking ensures:

  • Semantic coherence

  • Better embedding quality

  • Relevance in search results

๐Ÿ›  How It Works

The Chunker is initialized with a spaCy model, defined by the SPACY_CPU_MODEL environment variable. If the model is not available locally, it is automatically downloaded:

self.nlp = spacy.load(model)

๐Ÿ“„ Step 1: Section Splitting

Markdown-style heading detection (like ##, ###) is used to isolate sections.

pattern = re.compile(r"(?:^|\n)(#{2,6} .+)", re.MULTILINE)

This helps keep content related to the same topic together, even across multiple paragraphs.

๐Ÿง  Step 2: Sentence Segmentation

Each section is parsed into sentences using spaCyโ€™s Doc.sents, providing linguistically accurate boundaries.

doc = self.nlp(section)
sentences = [sent.text.strip() for sent in doc.sents]

This is far more reliable than using punctuation or regex.

๐Ÿ“ฆ Step 3: Chunk Assembly

Sentences are grouped into chunks based on a configurable max_chunk_size and overlap. This rolling window approach ensures:

  • Context preservation (via overlapping words)

  • Token budget control (chunks under model limits)

if current_len + sentence_len <= max_chunk_size:
    current_chunk.append(sentence)
else:
    chunks.append(" ".join(current_chunk))
    i = max(i - overlap, i + 1)

โš™๏ธ Customization Tips

  • Adjust max_chunk_size based on your transformer model's token limits (e.g., 384, 512).

  • Use different overlap values for chat-like vs. article-style documents.

  • Extend _split_into_sections for HTML, XML, or custom formats.

๐Ÿ”ข Embedding with Sentence Transformers

At the heart of the Vector Serviceโ€™s intelligence lies its ability to convert natural language into numerical vectors that machines can reason about. This is achieved through the powerful sentence-transformers library, specifically the SentenceTransformer class.

๐Ÿง  What Are Sentence Embeddings?

Sentence embeddings are high-dimensional vectors that represent the semantic meaning of a sentence. Two semantically similar sentences will have embeddings that are close together in vector space.

Example:

  • "How to reset my password?"

  • "I forgot my password, how can I recover it?"

These two queries will produce vectors with a small cosine distance between them โ€” enabling smart, context-aware search.

embeddings = await asyncio.to_thread(self.model.encode, texts)

Each chunk is transformed into a dense vector representation suitable for semantic similarity operations.

โš™๏ธ Model Initialization

The EmbeddingStore class initializes the embedding model once during service startup.

self.model = SentenceTransformer(TRANSFORMER_MODEL)

The model is configurable via an environment variable (TRANSFORMER_MODEL), allowing you to choose models such as:

  • all-MiniLM-L6-v2 (fast, lightweight)

  • multi-qa-mpnet-base-dot-v1 (optimized for QA)

  • paraphrase-multilingual-MiniLM-L12-v2 (for multilingual support)

๐Ÿ“ˆ Embedding Process

Text chunks are passed to the model for encoding. To keep the event loop responsive, encoding is done in a background thread using asyncio.to_thread.

embeddings = await asyncio.to_thread(self.model.encode, texts)

This non-blocking strategy ensures the API remains performant even during batch embedding.

๐Ÿงช Output Format

The output is a NumPy array of shape (n_chunks, embedding_dimension). This array is ready to be stored in Qdrant or used for similarity comparisons.

Generated embeddings shape: (10, 384)

๐Ÿ”„ Extensibility

Want to upgrade the intelligence of your embeddings?

  • Replace SentenceTransformer with OpenAIโ€™s API or Hugging Face Transformers.

  • Fine-tune a model on your own document corpus for better relevance.

  • Add attention-based chunk scoring before embedding.

๐Ÿง  Storing Vectors in Qdrant

Qdrant is used as the vector database. Collections are dynamically created per document using a sanitized document name. If the collection already exists, it is dropped and re-initialized.

self.client.create_collection(
    collection_name=collection_name,
    vectors_config=VectorParams(size=dim, distance=Distance.COSINE),
)

Each vector is stored alongside its metadata.

PointStruct(id=uuid4(), vector=vec.tolist(), payload=payload)

This metadata-enriched storage enables filtering and attribution in search results.

๐Ÿ” Vector Search APIs

We can have an api endpoint to search for similar vectors.

POST /api/v1/vector/search_vectors

๐Ÿงน Logging, Structure, and Maintenance

All services use a singleton ChatbotLogger that writes to rotating logs for auditability:

logger = ChatbotLogger().get_logger()

The codebase is cleanly structured, leveraging Python best practices:

  • Async I/O everywhere

  • Clear separation of concerns

  • Dependency injection via FastAPIโ€™s Depends

  • Code formatting with black, isort, and linting via ruff

๐Ÿงช Example Use Case

Imagine you're building an internal search engine for an enterprise. Each team uploads project reports, strategy docs, or PDFs. Using this Vector Service, you can:

  • Automatically process and vectorize every document.

  • Enable semantic search over documents filtered by team.

  • Extend with features like highlights, context answers, or QA chatbots.

โœ… Conclusion

The Vector Service offers a robust foundation for any document intelligence application. From ingestion to intelligent search, it transforms static files into queryable knowledge using modern NLP and vector technologies.

By combining FastAPI, Qdrant, and sentence-transformers, this microservice enables:

  • Scalable file processing with asynchronous pipelines.

  • Domain-specific vector search through per-document indexing.

  • Smart document chunking that preserves structure and meaning.

  • API-first architecture ready for frontend or integration with other microservices.

๐Ÿ”ฎ What's Next? Extensions & Ideas

Looking to take it further? Here are a few ideas:

  • ๐Ÿ”— Integrate with LLMs (like GPT-4) for contextual answers based on search results.

  • ๐ŸŒ Add multilingual support using a multilingual transformer model.

  • ๐Ÿ“Š Build a frontend UI for document upload and semantic search.

  • ๐Ÿ” Implement OAuth or JWT auth for user-level access controls.

  • ๐Ÿง  Add feedback loops for relevance tuning and retraining.

This system is not only functional but extensible โ€” a perfect playground for developers working with AI, search, and document management.

0
Subscribe to my newsletter

Read articles from Aloysius Vidhun Mon directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Aloysius Vidhun Mon
Aloysius Vidhun Mon

Software Engineer with 5 years of experience in Database Development/Administration, Backend Development and AI/ML using Python, RAG, LLM, Langchain, Transformers, Tensorflow, PyTorch, Django, Kafka, Jenkins, Docker, and AWS. I started my career by getting into Databases, SQL, PL\SQL, and Linux. This path helped me tremendously in getting good at handling data related projects and working comfortably in Linux environment. Along this time, I was also involved in writing some automation scripts in Python. As I got more interested in Python and Software Development, I changed my career path to Software Engineering.