Hybrid search with QDrant, OpenAI, and re-rank algorithms
As an introduction to this article, please read my previous notes:
Sparse vectors have a large number of dimensions, where only a small portion of values are non-zero. When used for keyword search, each sparse vector represents a document; the dimensions represent words from a dictionary, and the values represent the importance (e.g. by BM25) of these words in the document.
To convert your text corpus to sparse vectors, you can use BM25 (more common) or SPLADE.
Sparse Vectors shine in domains and scenarios where many rare keywords or specialized terms are present. For example, in the medical domain, many rare terms are not present in the general vocabulary, so general-purpose dense vectors cannot capture the nuances of the domain.
I. Configure QDrant to store sparse vectors.
Since version 1.71 QDrant supports sparse vectors natively: it doesn't matter who created the vector, the structure of a sparse vector in QDrant is JSON with 2 array fields (actually emulating key-value pairs)
{
"indices":[
1012,
1996,
25309
],
"values":[
0.06361289,
1.0990041,
0.08670003
]
}
The corresponding configuration looks like this:
client = QdrantClient(url="https://e9d74ef0-b9f2-4b44-b5f0-e22ea1d6fc34.europe-west3-0.gcp.cloud.qdrant.io:6334",
api_key="...",
prefer_grpc=True)
# client = QdrantClient("localhost")
client.recreate_collection(
collection_name=COLLECTION_NAME,
vectors_config={},
sparse_vectors_config={
"text-sparse": models.SparseVectorParams(
index=models.SparseIndexParams(
on_disk=False,
)
)
}
)
II. How to create a sparse vector.
- Using transformers define tokenizer and and model
from transformers import AutoModelForMaskedLM, AutoTokenizer
model_id = "naver/splade-cocondenser-ensembledistil"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForMaskedLM.from_pretrained(model_id)
- Now you need a torch (better with CUDA support) because tokens are easy manipulated as tensors
import torch
tokens = tokenizer(prompt, return_tensors="pt")
output = model(**tokens)
- Compute a vector from logits and attention mask using ReLU, log, and max operations.
logits, attention_mask = output.logits, tokens.attention_mask
# SPLADE takes the probability distribution from the MLM step
# and aggregates them into a single distribution called the “Importance Estimation.”.
# This distribution represents the sparse vector, highlighting relevant tokens that may not exist in the original input sequence.
relu_log = torch.log(1 + torch.relu(logits))
weighted_log = relu_log * attention_mask.unsqueeze(-1)
max_val, _ = torch.max(weighted_log, dim=1)
vec = max_val.squeeze()
4. Summarize steps 1).-3). into function
def compute_sparse_vector(prompt, tokenizer, model):
"""
Computes a vector from logits and attention mask using ReLU, log, and max operations.
Args:
logits (torch.Tensor): The logits output from a model.
attention_mask (torch.Tensor): The attention mask corresponding to the input tokens.
Returns:
torch.Tensor: Computed vector.
"""
tokens = tokenizer(prompt, return_tensors="pt")
output = model(**tokens)
logits, attention_mask = output.logits, tokens.attention_mask
relu_log = torch.log(1 + torch.relu(logits))
weighted_log = relu_log * attention_mask.unsqueeze(-1)
max_val, _ = torch.max(weighted_log, dim=1)
vec = max_val.squeeze()
return vec, tokens
III. Search with sparse vector
prompt = "הרעב"
query_vec, query_tokens = compute_sparse_vector(prompt)
query_indices = query_vec.nonzero().numpy().flatten()
query_values = query_vec.detach().numpy()[query_indices]
results = client.search(
collection_name=COLLECTION_NAME,
query_vector=models.NamedSparseVector(
name="text-sparse",
vector=models.SparseVector(
indices=query_indices,
values=query_values
)
),
limit=3
)
IV. Add semantic search.
We talked about sparse vector search, we've finished the previous part with the following QDrant API call:
results = client.search(
collection_name=COLLECTION_NAME,
query_vector=models.NamedSparseVector(
name="text-sparse",
vector=models.SparseVector(
indices=query_indices,
values=query_values
)
),
limit=3
)
Now add the semantic search to this approach by executing both concurrently. Thanks to search_batch API it is fairly easy:
search_queries = [
models.SearchRequest(
vector=models.NamedVector(
name="text-dense",
vector=compute_dense_vector(prompt)
),
limit=10,
with_payload=True
),
models.SearchRequest(
vector=models.NamedSparseVector(
name="text-sparse",
vector=models.SparseVector(
indices=query_indices,
values=query_values,
),
),
limit=10,
with_payload=True
)
]
and then send this list toquery_dense_vector
result = client.search_batch(
collection_name=COLLECTION_NAME,
requests=search_queries
)
The vector passed into first SearchRequest is an embedding that produced by any corresponding engine, in particulat with OpenAI 'text-embedding-3-large' model. Symmetrically to create_sparse_vector() function add additional one create_dense_vector() that will expoit OpenAI AI to create embeddings
def compute_dense_vector(prompt, model="text-embedding-3-large"):
prompt = prompt.replace("\n", " ")
return openAIClient.embeddings.create(input=prompt, model=model).data[0].embedding
search_batch() method returns list of arrays
Subscribe to my newsletter
Read articles from Oleg Kleiman directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by