Building a Semantic Document Search System

flowchart TD
    %% Color definitions
    classDef default fill:#2c3e50,stroke:#ecf0f1,stroke-width:2px,color:#ecf0f1
    classDef processing fill:#3498db,stroke:#ecf0f1,stroke-width:2px,color:#ecf0f1
    classDef storage fill:#9b59b6,stroke:#ecf0f1,stroke-width:2px,color:#ecf0f1
    classDef embedding fill:#2ecc71,stroke:#ecf0f1,stroke-width:2px,color:#ecf0f1
    classDef query fill:#f1c40f,stroke:#34495e,stroke-width:2px,color:#34495e
    classDef display fill:#e74c3c,stroke:#ecf0f1,stroke-width:2px,color:#ecf0f1

    A["PDF Documents"] --> B["Supabase Storage Upload"]
    B --> C["File Parsing via Llama Index"]
    C --> D["Text Semantic Chunking via LangChain Flask API (Vercel)"]
    D --> E["Text Embedding Generation via Nomic-Embed-Text Flask API (Vercel)"]

    E --> F["Supabase Upload Text per Embedding ID"]
    E --> G["Pinecone Upload Embedding per Embedding ID"]

    H["User Query"] --> I["Convert Query to Embedding via Nomic-Embed-Text Flask API (Vercel)"]
    I --> J["Compare Embeddings via Pinecone Query API (Return Top 2 References)"]
    J --> K["Display References in UI Show Source Information"]
    K --> L["Display Results Based on Retrieved References"]

    %% Styling nodes by category
    class A default
    class B,C storage
    class D processing
    class E,F,G,I embedding
    class H,J query
    class K,L display

    subgraph Pipeline1["Document Processing Pipeline"]
        A
        B
        C
        D
        E
        F
        G
    end

    subgraph Pipeline2["Query Processing Pipeline"]
        H
        I
        J
        K
        L
    end

In today's data-driven world, organizations are drowning in unstructured information. PDF documents, reports, manuals, and other text-based resources contain valuable knowledge, but accessing this information efficiently remains challenging. While Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) systems are gaining popularity, not every solution requires the generative AI component.

In this post, I'll walk through how I built a powerful semantic search system for documents that captures the "retrieval" part of RAG without the "generation" component - providing accurate document references without synthesizing new content.

The Architecture

Our system consists of two primary pipelines:

Document Processing Pipeline

This pipeline handles the ingestion and processing of documents:

PDF Document Collection: The starting point is a repository of PDF documents containing the information we want to make searchable.
Supabase Storage Upload: Documents are uploaded to Supabase storage, providing a centralized location for all our documents.
File Parsing via Llama Index: We utilize Llama Index to extract and structure the content from our PDFs. This tool effectively transforms unstructured documents into structured content.
Text Semantic Chunking: Using LangChain's Flask API (hosted on Vercel), we divide the document content into semantic chunks - logical sections that preserve context rather than arbitrary splits.
Text Embedding Generation: Each chunk is processed through Nomic-Embed-Text Flask API to generate vector embeddings. These embeddings capture the semantic meaning of text in a mathematical format.
Dual Storage Strategy:
- We store the text chunks in Supabase, indexed by unique embedding IDs.
- We upload the vector embeddings to Pinecone, a vector database optimized for similarity search.

Query Processing Pipeline

This pipeline handles user interactions:

User Query: The process begins when a user submits a text query seeking information.
Query Embedding: The user's query is converted into an embedding using the same Nomic-Embed-Text model, ensuring compatibility with our document embeddings.
Embedding Comparison: Pinecone's Query API compares the query embedding with stored document embeddings, returning the top 2 most semantically similar text chunks.
Reference Display: The system displays these references in the UI along with source information, helping users understand where the information originated.
Results Display: Finally, the system presents the retrieved information based on semantic relevance rather than keyword matching.

Technical Implementation Details

For this implementation, I leveraged several key technologies:

Embedding Model: Nomic-Embed-Text provides high-quality embeddings for both document chunks and user queries.
Vector Database: Pinecone stores and efficiently searches through vector embeddings.
Storage Solution: Supabase stores both the original documents and the text chunks.
Processing Tools: Llama Index for document parsing and LangChain for semantic chunking.
Deployment: All API components are deployed on Vercel for reliable scaling.

The Benefits of This Approach

By implementing a "RAG without the AI" approach, we gain several advantages:

Reference Transparency: Users receive direct references to relevant documents rather than AI-generated summaries that might contain hallucinations.
Semantic Understanding: Unlike traditional keyword search, this system understands the meaning behind queries, returning contextually relevant results.
Source Verification: Each result links directly to its source document, enabling users to verify information.
Reduced Complexity: Without the generative component, the system is simpler to implement, debug, and maintain.
Lower Computational Requirements: Vector similarity search requires fewer resources than running large language models.

Real-World Applications

This system is particularly valuable for:

Legal Firms: Searching through case law and precedents
Healthcare Organizations: Finding relevant medical documentation
Financial Institutions: Locating specific regulatory guidance
Research Organizations: Discovering relevant papers and findings
Educational Institutions: Connecting students with relevant learning materials

Conclusion

Building a semantic document search system using embedding-based retrieval provides organizations with a powerful tool to unlock the value hidden in their unstructured data. By focusing on the retrieval component without the generative AI aspect, we create a system that:

Delivers accurate, source-verified information
Understands the semantic meaning behind user queries
Scales efficiently with growing document collections
Maintains transparency in information retrieval

For organizations with large collections of documents that need to be searchable by meaning rather than just keywords, this approach offers significant value. It bridges the gap between traditional search and full RAG systems, providing a practical solution for making institutional knowledge accessible without the complexity and potential pitfalls of generative AI.

The next time you're considering implementing a document search solution, remember that sometimes you don't need the "G" in RAG to deliver transformative results.

P.S. Let's Build Something Cool Together!

Drowning in data? Pipelines giving you a headache? I've been there – and I actually enjoy fixing these things. I'm that data engineer who: - Makes ETL pipelines behave - Turns data warehouse chaos into zen - Gets ML models from laptop to production.

If you find this blog interesting, connect with me on Linkedin and make sure to leave a message!

Semantic Search Data Engineering Pipeline: RAG Without the AI