Building a Vector Service for Intelligent Document Search with FastAPI, Qdrant & Transformers
Table of contents
- Introduction
- ๐๏ธ Architecture Overview
- ๐ ๏ธ Project Initialization
- ๐ฆ Application Startup
- ๐ Authentication Layer
- ๐ Uploading & Ingesting Files
- โ๏ธ Smart Text Chunking with spaCy
- ๐ข Embedding with Sentence Transformers
- ๐ง Storing Vectors in Qdrant
- ๐ Vector Search APIs
- ๐งน Logging, Structure, and Maintenance
- ๐งช Example Use Case
- โ Conclusion
- ๐ฎ What's Next? Extensions & Ideas

Introduction
In the era of information overload, accessing relevant content swiftly from vast document repositories has become essential. This is where vector-based search systems shine โ they enable semantic understanding and retrieval, making document search significantly smarter.
In this blog post, Iโll walk through the architecture and implementation of a Vector Service โ a FastAPI-powered microservice that handles:
Document ingestion (PDFs & text files)
Text chunking with linguistic awareness
Embedding using Transformer models
Vector storage & retrieval with Qdrant
Search APIs for both access
Letโs explore how each module works, how they integrate, and how you can customize or extend the service for your own applications.
๐๏ธ Architecture Overview
The Vector Service is structured around modular, single-responsibility components. The service includes the following core layers:
API Layer (FastAPI) - Routes, request validation, and authentication.
Service Layer - Coordinates ingestion, processing, and storage.
Processing Layer - File reading, chunking, and embedding.
Storage Layer - Handles document metadata and vector persistence.
๐ ๏ธ Project Initialization
The project uses Poetry for dependency management. The pyproject.toml
specifies the packages such as:
fastapi
,uvicorn
for the APIpymupdf
,spacy
for file parsing and NLPsentence-transformers
for embeddingsqdrant-client
for vector storageasyncpg
for async DB operations
This solid foundation ensures scalability and performance.
๐ฆ Application Startup
The Vector Service initializes through a carefully designed lifespan
context in main.py
, leveraging FastAPIโs lifespan
feature. This approach ensures that critical resources are validated and available before the service begins handling traffic.
๐งฉ Lifespan Management
@asynccontextmanager
async def lifespan(app: FastAPI):
...
yield
...
This asynchronous context manager runs before the first request and after the last one. It checks two vital components:
Embedding Service Initialization
Ensures that the
EmbeddingService
and its dependent modules (chunker, ingestor, embedding store, etc.) are instantiated correctly.embedding_service = EmbeddingService() if embedding_service is None: raise RuntimeError("EmbeddingService initialization failed.")
Database Connection Verification
A dummy session is opened and closed to verify DB connectivity at startup.
db_service = AsyncSessionLocal() await db_service.close()
๐ก๏ธ CORS Configuration
CORS middleware is enabled for wide compatibility, particularly useful during development:
app.add_middleware(
CORSMiddleware,
allow_origins=["*"],
allow_credentials=True,
allow_methods=["*"],
allow_headers=["*"],
)
๐ Note: In production, replace
"*"
with specific domain(s) to enhance security.
๐งญ Route Registration
All embedding-related endpoints are grouped under:
app.include_router(embedding_routes.router, prefix="/api/v1/vector")
This modular approach makes it easy to scale and organize future functionalities (e.g., feedback routes, analytics).
๐ Authentication Layer
To ensure only authorized services can access the Vector Service endpoints, a simple but effective service key authentication mechanism is employed. This is a great fit for internal APIs or microservice architectures where OAuth or user-based authentication may be overkill.
def verify_service_key(service_identity_key: str = Header(...)):
if service_identity_key != SERVICE_IDENTITY_KEY:
raise HTTPException(status_code=401, detail="Invalid or missing service identity key")
This lightweight mechanism keeps the service protected in multi-tenant or microservice architectures.
๐งฉ Plugging into Routes
This function is added as a dependency to protected endpoints using Depends
.
@router.post("/generate_vectors")
async def generate_vectors(
document_id: int = Form(...),
file: UploadFile = File(...),
_: None = Depends(verify_service_key),
):
๐ Uploading & Ingesting Files
One of the core capabilities of the Vector Service is transforming raw documents into structured, queryable vector data. This pipeline starts with the handle_file_upload
method in the EmbeddingService
class and involves multiple subsystems working seamlessly together.
Step 1: Uploading to S3
When a file is uploaded via the /generate_vectors
endpoint, it is read into memory and stored on Amazon S3. The file key is dynamically generated to ensure uniqueness and traceability.
timestamp = current_utc_timestamp()
unique_id = uuid4().hex
s3_filename = f"{base_name}_{timestamp}_{unique_id}{ext}"
key = f"{sanitized_document_name}/{s3_filename}"
Step 2: Recording Metadata
After a successful upload, metadata is persisted in a PostgreSQL database using SQLAlchemy:
await upsert_uploaded_document(
document_id=document_id,
file_name=s3_filename,
file_url=file_url,
uploaded_at=datetime.now(timezone.utc).replace(tzinfo=None),
)
This enables auditability, historical analysis, and the ability to retrieve or reprocess documents later.
Step 3: Fetching and Reading
Next, the file is pulled back from S3 and processed locally.
local_path = await self.fetcher.fetch(s3_key, filename)
text = await self.ingestor.ingest(local_path)
The DocumentFetcher
handles secure downloading, while FileIngestor
supports two formats:
.txt: Simple UTF-8 read
.pdf: Intelligent parsing with PyMuPDF, including:
Font size detection
Section classification (h1, h2, h3)
Conversion to structured Markdown
โ๏ธ Smart Text Chunking with spaCy
Before a document can be embedded, it needs to be broken down into smaller, coherent segments or chunks. This is where the Chunker
class comes into play. Its job is to split long-form content into manageable units while preserving context and structure, which is crucial for generating meaningful embeddings.
๐งฉ Why Chunking Matters
Embedding entire documents in one go is inefficient and often exceeds model token limits. But naive splitting (e.g., every 500 characters) can fracture sentences or break logical sections. Chunking ensures:
Semantic coherence
Better embedding quality
Relevance in search results
๐ How It Works
The Chunker
is initialized with a spaCy model, defined by the SPACY_CPU_MODEL
environment variable. If the model is not available locally, it is automatically downloaded:
self.nlp = spacy.load(model)
๐ Step 1: Section Splitting
Markdown-style heading detection (like ##
, ###
) is used to isolate sections.
pattern = re.compile(r"(?:^|\n)(#{2,6} .+)", re.MULTILINE)
This helps keep content related to the same topic together, even across multiple paragraphs.
๐ง Step 2: Sentence Segmentation
Each section is parsed into sentences using spaCyโs Doc.sents
, providing linguistically accurate boundaries.
doc = self.nlp(section)
sentences = [sent.text.strip() for sent in doc.sents]
This is far more reliable than using punctuation or regex.
๐ฆ Step 3: Chunk Assembly
Sentences are grouped into chunks based on a configurable max_chunk_size
and overlap
. This rolling window approach ensures:
Context preservation (via overlapping words)
Token budget control (chunks under model limits)
if current_len + sentence_len <= max_chunk_size:
current_chunk.append(sentence)
else:
chunks.append(" ".join(current_chunk))
i = max(i - overlap, i + 1)
โ๏ธ Customization Tips
Adjust
max_chunk_size
based on your transformer model's token limits (e.g., 384, 512).Use different overlap values for chat-like vs. article-style documents.
Extend
_split_into_sections
for HTML, XML, or custom formats.
๐ข Embedding with Sentence Transformers
At the heart of the Vector Serviceโs intelligence lies its ability to convert natural language into numerical vectors that machines can reason about. This is achieved through the powerful sentence-transformers
library, specifically the SentenceTransformer
class.
๐ง What Are Sentence Embeddings?
Sentence embeddings are high-dimensional vectors that represent the semantic meaning of a sentence. Two semantically similar sentences will have embeddings that are close together in vector space.
Example:
"How to reset my password?"
"I forgot my password, how can I recover it?"
These two queries will produce vectors with a small cosine distance between them โ enabling smart, context-aware search.
embeddings = await asyncio.to_thread(self.model.encode, texts)
Each chunk is transformed into a dense vector representation suitable for semantic similarity operations.
โ๏ธ Model Initialization
The EmbeddingStore
class initializes the embedding model once during service startup.
self.model = SentenceTransformer(TRANSFORMER_MODEL)
The model is configurable via an environment variable (TRANSFORMER_MODEL
), allowing you to choose models such as:
all-MiniLM-L6-v2
(fast, lightweight)multi-qa-mpnet-base-dot-v1
(optimized for QA)paraphrase-multilingual-MiniLM-L12-v2
(for multilingual support)
๐ Embedding Process
Text chunks are passed to the model for encoding. To keep the event loop responsive, encoding is done in a background thread using asyncio.to
_thread
.
embeddings = await asyncio.to_thread(self.model.encode, texts)
This non-blocking strategy ensures the API remains performant even during batch embedding.
๐งช Output Format
The output is a NumPy array of shape (n_chunks, embedding_dimension)
. This array is ready to be stored in Qdrant or used for similarity comparisons.
Generated embeddings shape: (10, 384)
๐ Extensibility
Want to upgrade the intelligence of your embeddings?
Replace
SentenceTransformer
with OpenAIโs API or Hugging Face Transformers.Fine-tune a model on your own document corpus for better relevance.
Add attention-based chunk scoring before embedding.
๐ง Storing Vectors in Qdrant
Qdrant is used as the vector database. Collections are dynamically created per document using a sanitized document name. If the collection already exists, it is dropped and re-initialized.
self.client.create_collection(
collection_name=collection_name,
vectors_config=VectorParams(size=dim, distance=Distance.COSINE),
)
Each vector is stored alongside its metadata.
PointStruct(id=uuid4(), vector=vec.tolist(), payload=payload)
This metadata-enriched storage enables filtering and attribution in search results.
๐ Vector Search APIs
We can have an api endpoint to search for similar vectors.
POST /api/v1/vector/search_vectors
๐งน Logging, Structure, and Maintenance
All services use a singleton ChatbotLogger
that writes to rotating logs for auditability:
logger = ChatbotLogger().get_logger()
The codebase is cleanly structured, leveraging Python best practices:
Async I/O everywhere
Clear separation of concerns
Dependency injection via FastAPIโs
Depends
Code formatting with
black
,isort
, and linting viaruff
๐งช Example Use Case
Imagine you're building an internal search engine for an enterprise. Each team uploads project reports, strategy docs, or PDFs. Using this Vector Service, you can:
Automatically process and vectorize every document.
Enable semantic search over documents filtered by team.
Extend with features like highlights, context answers, or QA chatbots.
โ Conclusion
The Vector Service offers a robust foundation for any document intelligence application. From ingestion to intelligent search, it transforms static files into queryable knowledge using modern NLP and vector technologies.
By combining FastAPI, Qdrant, and sentence-transformers, this microservice enables:
Scalable file processing with asynchronous pipelines.
Domain-specific vector search through per-document indexing.
Smart document chunking that preserves structure and meaning.
API-first architecture ready for frontend or integration with other microservices.
๐ฎ What's Next? Extensions & Ideas
Looking to take it further? Here are a few ideas:
๐ Integrate with LLMs (like GPT-4) for contextual answers based on search results.
๐ Add multilingual support using a multilingual transformer model.
๐ Build a frontend UI for document upload and semantic search.
๐ Implement OAuth or JWT auth for user-level access controls.
๐ง Add feedback loops for relevance tuning and retraining.
This system is not only functional but extensible โ a perfect playground for developers working with AI, search, and document management.
Subscribe to my newsletter
Read articles from Aloysius Vidhun Mon directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
Aloysius Vidhun Mon
Aloysius Vidhun Mon
Software Engineer with 5 years of experience in Database Development/Administration, Backend Development and AI/ML using Python, RAG, LLM, Langchain, Transformers, Tensorflow, PyTorch, Django, Kafka, Jenkins, Docker, and AWS. I started my career by getting into Databases, SQL, PL\SQL, and Linux. This path helped me tremendously in getting good at handling data related projects and working comfortably in Linux environment. Along this time, I was also involved in writing some automation scripts in Python. As I got more interested in Python and Software Development, I changed my career path to Software Engineering.