Creating a Hybrid Search System for the Medical Domain Using Qdrant

You can find the code for this tutorial here: https://github.com/Rudr16a/SuperTeams

TL;DR:

  • Here we explore how to create a hybrid search system for the medical domain using Qdrant.

  • We demonstrate how by combining the strengths of sparse and dense vectors, we can achieve a search system that is both accurate and context-aware, making it ideal for retrieving complex medical information.

Introduction

In the field of medicine, it is crucial that medical professionals and administrators have the ability to search for and retrieve data. Whether it's accessing patient records, clinical trial data, or the latest research articles, the accuracy and speed of search functionality play a vital role in impacting medical outcomes. Traditional search methods often rely on keyword-based approaches, which, while useful, may miss out on the deeper, context-based nuances of medical information. This is where the power of hybrid search systems, combining sparse and dense vectors, comes into play.

Sparse vectors represent traditional keyword-based searches, focusing on exact term matching, while dense vectors utilize semantic search, understanding the context and meaning behind the terms. This guide will walk you through the process of creating a hybrid search system tailored for the medical domain using Qdrant, a high-performance vector search engine. By the end of this article, you'll have a solid understanding of how to set up, implement, and optimize a hybrid search system that leverages both sparse and dense vectors to deliver accurate and relevant search results.

Hybrid search combines two powerful techniques: sparse vector search and dense vector search. Sparse vectors are typically used in traditional search engines, where documents are indexed based on the presence or absence of keywords. This method excels at retrieving documents that contain specific terms but may struggle with understanding context or synonyms.

On the other hand, dense vectors are derived from advanced machine learning models like NV-Embed-v2 or bge-en-icl, which generate embeddings that capture the semantic meaning of text. These embeddings allow the search system to understand the context, synonyms, and related terms, providing more comprehensive search results.

By combining these two approaches, hybrid search systems can achieve higher accuracy, relevance, and comprehensiveness in their results. Imagine searching for a specific medical condition: a sparse vector search might retrieve documents that mention the exact term, while a dense vector search could also surface relevant articles that discuss related conditions or symptoms.

  • Accuracy: By leveraging both keyword matching and semantic understanding, hybrid search systems can retrieve more accurate results.

  • Relevance: Dense vectors help in understanding the context, ensuring that the most relevant documents are prioritized.

  • Comprehensiveness: The combination of sparse and dense vectors ensures that no relevant information is missed, providing a broader view of the topic.

Sparse and Dense Vectors: An Overview

Sparse Vectors

Sparse vectors are representations of documents where only certain dimensions (representing keywords) have non-zero values. These vectors are often used in traditional search engines to match exact terms. For example, in a sparse vector representation of a medical document, specific terms like "diabetes" or "insulin" would have non-zero values

Dense Vectors

Dense vectors, on the other hand, are continuous representations of documents generated by embedding models. These vectors capture the semantic meaning of the text, allowing for more context-aware searches. For instance, a dense vector generated from a sentence like "treatment for high blood sugar" would be close to a vector for "diabetes management" in the vector space.

Combining Sparse and Dense Vectors

The power of hybrid search comes from combining these two types of vectors. Sparse vectors handle the exact term matching, while dense vectors understand the context and synonyms. This combination ensures that the search system retrieves both precise and contextually relevant documents.

Let’s Code

Setting Up the Environment

Splade (Sparse Lexical and Dense) generation is an approach for producing sparse vectors for text representation. Unlike traditional dense embeddings, SPLADE attempts to generate sparse vectors by modifying language models like BERT to ensure only a small subset of dimensions in the vector space are active.

Here’s a high-level approach to generate sparse vectors using SPLADE:

  1. Install SPLADE: You need the SPLADE implementation, which is typically built on top of Hugging Face transformers and PyTorch.

!pip install transformers torch

You might also need specific SPLADE implementation repositories or their models. You can clone the relevant GitHub repository if available.

git clone https://github.com/naver/splade

2**.** Load a pre-trained SPLADE model: SPLADE models can be found on Hugging Face’s model hub. For example, you could use naver/splade-cocondenser-ensembledistil.

3**.**

from transformers import AutoTokenizer, AutoModelForMaskedLM
import torch

# Load pre-trained SPLADE model and tokenizer
model_name = "naver/splade-cocondenser-ensembledistil"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMaskedLM.from_pretrained(model_name)

def generate_sparse_vector(text):
    # Tokenize input text
    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
   
# Generate SPLADE representation
    with torch.no_grad():
        outputs = model(*inputs)
   
    # Get the SPLADE sparse vector
    logits = outputs.logits[0]
    sparse_vector = torch.max(torch.log1p(torch.relu(logits))
inputs['attention_mask'][0].unsqueeze(-1), dim=0)[0]
   
    # Convert to dictionary format (term_id: weight)
    sparse_dict = {idx: weight.item() for idx, weight in enumerate(sparse_vector) if weight > 0}
   
    return sparse_dict

# Example usage
text = "This is an example sentence for SPLADE sparse vector generation."
sparse_vector = generate_sparse_vector(text)

print("Sparse vector:")
print(sparse_vector)

# Optionally, you can map the term IDs back to tokens
id2token = {v: k for k, v in tokenizer.get_vocab().items()}
sparse_terms = {id2token[idx]: weight for idx, weight in sparse_vector.items()}

print("\nSparse terms:")
print(sparse_terms)

Dense vector search is more complex as it involves generating embeddings that capture the semantic meaning of text. These vectors are then indexed in Qdrant for efficient retrieval based on similarity.

# Indexing dense vectors in Qdrant
client.create_collection('medical_dense_vectors', vector_size=768)

# Indexing the dense vectors
client.upload_collection(
    collection_name='medical_dense_vectors',
    vectors=df['dense_vectors'].tolist(),
    payload=df['text'].tolist()
)

# Querying dense vectors
query_vector = generate_dense_vector('treatment for high blood sugar')
results = client.search(
    collection_name='medical_dense_vectors',
    query_vector=query_vector,
    top=5
)

Combining Sparse and Dense Vectors

The final step is to combine the sparse and dense vector searches into a hybrid system, which can be achieved using Qdrant’s Query API. To combine sparse and dense vector searches in Qdrant, you can utilize its Query API with prefetch and fusion mechanisms. Qdrant enables hybrid searches where both sparse and dense vectors are queried simultaneously, with the results fused based on specific scoring mechanisms.

We can define prefetch queries that first run individual searches for sparse and dense vectors. Then, Qdrant offers two main methods to combine the results:

  1. Reciprocal Rank Fusion (RRF): Boosts results that rank higher in both searches.

  2. Distribution-Based Score Fusion (DBSF): Normalizes and sums the scores from both the queries.

client.query_points(
    collection_name="medical_vectors",
    prefetch=[
        models.Prefetch(
            query=models.SparseVector(indices=[1, 42], values=[0.22, 0.8]),
            using="sparse",
            limit=20,
        ),
        models.Prefetch(
            query=[0.01, 0.45, 0.67],
            using="dense",
            limit=20,
        ),
    ],
    query=models.FusionQuery(fusion=models.Fusion.RRF),
    limit=10
)

Code for Implementation of Combined Sparse and Dense Vectors

!pip install transformers torch sentence-transformers numpy

from transformers import AutoTokenizer, AutoModelForMaskedLM
from sentence_transformers import SentenceTransformer
import torch
import numpy as np

# Load pre-trained SPLADE model and tokenizer
splade_model_name = "naver/splade-cocondenser-ensembledistil"
splade_tokenizer = AutoTokenizer.from_pretrained(splade_model_name)
splade_model = AutoModelForMaskedLM.from_pretrained(splade_model_name)

# Load pre-trained Sentence Transformer model
sbert_model_name = "all-MiniLM-L6-v2"
sbert_model = SentenceTransformer(sbert_model_name)

def generate_sparse_vector(text):
    inputs = splade_tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
   
    with torch.no_grad():
        outputs = splade_model(*inputs)
   
    logits = outputs.logits[0]
    sparse_vector = torch.max(torch.log1p(torch.relu(logits)) inputs['attention_mask'][0].unsqueeze(-1), dim=0)[0]
   
    sparse_dict = {idx: weight.item() for idx, weight in enumerate(sparse_vector) if weight > 0}
    return sparse_dict

def generate_dense_vector(text):
    return sbert_model.encode(text)

def combine_vectors(sparse_vector, dense_vector, alpha=0.5):
    # Convert sparse vector to dense format
    max_dim = max(sparse_vector.keys()) + 1
    sparse_dense = np.zeros(max_dim)
    for idx, weight in sparse_vector.items():
        sparse_dense[idx] = weight
   
    # Normalize vectors
    sparse_dense = sparse_dense / np.linalg.norm(sparse_dense)
    dense_vector = dense_vector / np.linalg.norm(dense_vector)
   
    # Pad dense vector if necessary
    if len(dense_vector) < max_dim:
        dense_vector = np.pad(dense_vector, (0, max_dim - len(dense_vector)))
    else:
        dense_vector = dense_vector[:max_dim]
   
    # Combine vectors
    combined = alpha
sparse_dense + (1 - alpha)
dense_vector
   
    return combined

# Example usage
text = "This is an example sentence for combining sparse and dense vectors."
sparse_vector = generate_sparse_vector(text)
dense_vector = generate_dense_vector(text)
combined_vector = combine_vectors(sparse_vector, dense_vector)

print("Sparse vector (top 5 terms):")
print(dict(sorted(sparse_vector.items(), key=lambda x: x[1], reverse=True)[:5]))

print("\nDense vector (first 5 dimensions):")
print(dense_vector[:5])

print("\nCombined vector (first 5 dimensions):")
print(combined_vector[:5])

# Optionally, map sparse terms to tokens
id2token = {v: k for k, v in splade_tokenizer.get_vocab().items()}
sparse_terms = {id2token[idx]: weight for idx, weight in sparse_vector.items()}

print("\nTop 5 sparse terms:")
print(dict(sorted(sparse_terms.items(), key=lambda x: x[1], reverse=True)[:5]))

Result Showcase Using Gradio or Streamlit UI

Gradio

import gradio as gr

# Assuming client and vectorizer, and generate_dense_vector function are already defined

def combine_results(sparse_results, dense_results, alpha=0.5, beta=0.5):
    combined_scores = {}
   
    # Combine the sparse and dense results using the custom weights
    for result in sparse_results:
        doc_id = result['id']
        sparse_score = result['score']
        dense_score = next((r['score'] for r in dense_results if r['id'] == doc_id), 0)
        combined_scores[doc_id] = alpha sparse_score + beta dense_score

    # Add any dense results not in sparse results
    for result in dense_results:
        doc_id = result['id']
        if doc_id not in combined_scores:
            combined_scores[doc_id] = beta * result['score']  # Only dense score

    return combined_scores

def search(query):
    # Generate sparse vector and search
    sparse_query = vectorizer.transform([query]).toarray()
    sparse_results = client.search(
        collection_name='medical_sparse_vectors',
        query_vector=sparse_query[0],
        top=5
    )
   
    # Generate dense vector and search
    dense_query = generate_dense_vector(query)
    dense_results = client.search(
        collection_name='medical_dense_vectors',
        query_vector=dense_query,
        top=5
    )
   
    # Combine results using a custom weighting mechanism
    combined_results = combine_results(sparse_results, dense_results)

    # Top K combined results (reranked)
    top_k_combined_results = sorted(combined_results.items(), key=lambda x: x[1], reverse=True)[:5]
   
    # Format results for display
    sparse_texts = [result['payload'] for result in sparse_results]
    dense_texts = [result['payload'] for result in dense_results]
    combined_texts = [result['payload'] for result in top_k_combined_results]

    return sparse_texts, dense_texts, combined_texts

iface = gr.Interface(
    fn=search,
    inputs="text",
    outputs=["text", "text", "text"],
    title="Hybrid Search System for Medical Domain",
    description="Compare results using Sparse Vectors, Dense Vectors, and a Hybrid approach."
)

iface.launch()

Streamlit

import streamlit as st

def search(query):
    # Generate sparse vector and search
    sparse_query = vectorizer.transform([query]).toarray()
    sparse_results = client.search(
        collection_name='medical_sparse_vectors',
        query_vector=sparse_query[0],
        top=5
    )
   
    # Generate dense vector and search
    dense_query = generate_dense_vector(query)
    dense_results = client.search(
        collection_name='medical_dense_vectors',
        query_vector=dense_query,
        top=5
    )
   
    # Combine results using a custom weighting mechanism
    combined_results = combine_results(sparse_results, dense_results)
   
    return sparse_results, dense_results, combined_results

# Streamlit UI
st.title("Hybrid Search System for Medical Domain")
query = st.text_input("Enter your search query:")

if query:
    sparse_results, dense_results, combined_results = search(query)
   
    st.subheader("Results using Sparse Vectors:")
    for result in sparse_results:
        st.write(result['payload'])
   
    st.subheader("Results using Dense Vectors:")
    for result in dense_results:
        st.write(result['payload'])
   
    st.subheader("Combined Hybrid Results:")
    for result in combined_results:        st.write(result['payload'])

Showcase Results Using Sparse Vectors

When the user enters a search query, the system first uses sparse vectors (keyword-based) to retrieve the results. These results will typically match the exact terms in the query but may miss out on related terms or context.

Example Sparse Vector Search

  • Query: "Anthracyclines treatment"

Sparse Vector Search Example: Query: "Anthracyclines" Results:

  1. Document titled "Chemotherapy protocols using Anthracyclines"

  2. Article discussing "Anthracyclines in breast cancer treatment"

  3. Study on "Cardiotoxic effects of Anthracyclines”

These results are highly specific because sparse vector searches are based on exact keyword matches. The system retrieves documents where the term "Anthracyclines" is explicitly mentioned. This can be useful for retrieving documents with uncommon or domain-specific terminology.

Showcase Results Using Dense Vectors

Next, the system performs a search using dense vectors, which capture the semantic meaning of the query. This allows the search to return documents that are contextually relevant, even if the exact terms are not matched.

Example Dense Vector Search

  • Query: "Anthracyclines treatment"

If you perform a dense vector search using the query "Anthracyclines," the results will focus on semantically similar concepts, potentially retrieving documents that discuss chemotherapy or cancer treatment without necessarily mentioning "Anthracyclines" directly. Here's how the results might look:


Dense Vector Search Example: Query: "Anthracyclines"

Results:

  1. Article titled "Effective treatment of breast cancer using chemotherapy"

  2. Clinical report on "Common drugs used in chemotherapy"

  3. Study discussing "Side effects of cancer medications"

  4. Research on "Advancements in cancer drug therapies"

  5. Article on "Cardiotoxic effects of chemotherapy agents"

In this case, dense vector search prioritizes semantic meaning, so even if "Anthracyclines" isn't mentioned explicitly, documents related to chemotherapy and cancer treatment, which are conceptually similar, will appear. However, specific details or papers that focus on "Anthracyclines" (e.g., cardiotoxic effects unique to this class of drugs) might be missed unless they are highly relevant to the broader chemotherapy context.

Showcase Combined Results and Their Quality

Finally, the combined hybrid search results are displayed. These results integrate both the precise term matching of sparse vectors and the contextual understanding of dense vectors. The combination ensures that the most relevant documents are retrieved, offering both precision and depth.

Example Hybrid Search Results

  • Query: "Anthracyclines treatment"

Results:

  1. Sparse Match: Document titled "Chemotherapy protocols using Anthracyclines"
    (Sparse search catches this because it contains the exact term "Anthracyclines.")

  2. Sparse Match: Study on "Cardiotoxic effects of Anthracyclines"
    (Another direct match due to the exact keyword.)

  3. Dense Match: Article on "Effective treatment of breast cancer using chemotherapy"
    (Dense search retrieves this because it understands that Anthracyclines is related to chemotherapy, even if the term is not directly mentioned.)

  4. Dense Match: Clinical report on "Common drugs used in chemotherapy"
    (Another semantically relevant result related to chemotherapy drugs in general, which could include Anthracyclines.)

  5. Dense Match: Study discussing "Side effects of cancer medications"
    (Dense search picks up this document due to the similarity with the query's topic on cancer drug side effects.)

To Summarize:

Sparse search pulls in documents with exact matches like "Anthracyclines," ensuring that domain-specific terms aren't missed. Dense search enriches the results by adding documents that are conceptually similar, such as those discussing chemotherapy or cancer drugs, even if "Anthracyclines" isn't explicitly mentioned.

This hybrid approach ensures that both precise (keyword) and semantically related (conceptual) documents are retrieved, reducing the chances of missing relevant information while still keeping the results relevant to the user's intent.

Conclusion

In this guide, we delved into the process of building a hybrid search system specifically designed for the medical field using Qdrant. By integrating the strengths of sparse vectors, which focus on precise keyword matching, with dense vectors, which excel at understanding contextual meaning, we created a search solution that balances both accuracy and semantic relevance. This hybrid approach enhances the ability to retrieve complex and nuanced medical information, making it highly effective for tasks such as accessing patient records, clinical data, and research articles. The result is a powerful, context-aware search system, ideal for supporting medical professionals in making informed decisions quickly and efficiently.

GitHub

Code Repository: https://github.com/Rudr16a/SuperTeams

References

Qdrant: https://qdrant.tech/documentation/

Hugging Face: https://huggingface.co/mtebad/classification_model

0
Subscribe to my newsletter

Read articles from Rudra Pratap Dash directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Rudra Pratap Dash
Rudra Pratap Dash

I'm Rudra Pratap Dash, a 3rd-year student deeply fascinated by machine learning. I love diving into data and building smart systems that can predict, recognize, and learn like humans! Whether it's playing around with algorithms, cracking real-world problems, or experimenting with cool AI models, I’m all about the challenge. Always eager to learn and explore new ideas, I’m on a journey to explore technology and create something impactful and exciting.