Efficient Vector Storage with Halfvecs

RAG (Retrieval Augmented Generation) is a powerful technique that allows us to enhance large language models (LLMs) output with private documents and proprietary knowledge that is not available elsewhere. For example, a company's internal documents or a researcher's notes.

There are many ways to give relevant context to LLMs in your RAG system. We can use a simple keyword search in your database or more advanced search algorithms like BM25, which go beyond keyword search. Here is an example of a simple keyword search.

SELECT *
FROM articles
WHERE content LIKE '%keyword%';

A step further, we can use pretrained language models to create embeddings that provide a lot of information through high dimensionality. Here is a simple example from the SentenceTransformers excellent library.

from sentence_transformers import SentenceTransformer

# 1. Load model
model = SentenceTransformer("all-MiniLM-L6-v2")

# 2. Our documents and query
documents = [
    "Python is great for programming",
    "I have a pet dog",
    "The weather is sunny today"
]
query = "How to program in Python?"

# 3. Calculate embeddings
doc_embeddings = model.encode(documents)
query_embedding = model.encode(query)

# 4. Find similarities
similarities = model.similarity(query_embedding, doc_embeddings)
print(f"Most similar document: {documents[similarities.argmax()]}")
# Python is great for programming

In recent years, pretrained language models have greatly improved text embedding models. However, in our experience, the main challenge for efficient document retrieval is not the performance of the embedding model but the earlier data ingestion process.

💡

The goal of RAG is to provide LLMs with relevant context to enhance their performance. This is the essence of RAG, and it can be tailored to be as simple or complex as the specific use-case demands.

The process of turning documents into passages of text via OCR, chunking, and a complex pipeline of data cleaning is fragile and error prone. For one of our projects doing RAG over clinical trials, we lost over 30% of the context during the process.

ColPali

One advanced technique to improve this process is a retrieval model architecture called ColPali. It uses the document understanding abilities of recent Vision Language Models to create embeddings directly from images of document pages. ColPali significantly outperforms modern document retrieval pipelines while being much faster.

One of the trade-offs of this new retrieval method is that while "late interaction" allows for more detailed matching between specific parts of the query and the potential context, it requires more computing power than simple vector comparisons and produces up to 100 times more embeddings per page.

These trade-offs are often worthwhile in highly visual documents and situations where accuracy is crucial.

Here, we would highlight one of our many optimization in ColiVara, where we leveraged halfvecs as our preferred method of Scalar quantization.

ColiVara

ColiVara is a state of the art retrieval API that stores, searches, and retrieves documents based on their visual embedding. End to end it uses vision models instead of chunking and text-processing for documents.

In simple terms, we ask the AI models to "see" and reason, rather "read", and reason. From the user's perspective, it functions like retrieval augmented generation (RAG) but uses vision models instead of chunking and text-processing for documents.

It is a web-first implementation of the ColPali: Efficient Document Retrieval with Vision Language Models paper.

Like many AI/ML RAG systems, we create and store vectors when we save a user’s document. Since we use ColPali under the hood, each page generate an embeddings that looks like this.

# List of 1030 members, each a list of 128 floats per page
embeddings = [[0.1, 0.2, ..., 0.128], [0.1, 0.2, ...]]

Let's calculate the storage requirements for this:

Each float is 4 bytes.
Each embedding has 128 dimensions, so: 128 * 4 bytes = 512 bytes per embedding.
Total embeddings: 1030.
Total storage: 1030 * 512 bytes = 527,360 bytes ≈ 515 KB per page.

If we have a 100-page document and a collection of 100 documents, then:

515 KB * 100 pages = 51.5 MB per document.
51.5 MB * 100 documents = 5.15 GB per collection.

This calculation is just for the raw numerical data. Actual memory usage in Python might be slightly higher due to Python's object overhead and list structure. ~5 GB per collection is manageable, but not exactly lightweight. So, we explored different quantization methods to better manage our resource usage.

Quantization

There are three common quantization techniques around vector databases:

Scalar quantization, which reduces the overall size of the dimensions to a smaller data type (e.g. a 4-byte float to a 2-byte float or 1-byte integer).
Binary quantization, which is a subset of scalar quantization that reduces the dimensions to a single bit (e.g. > 0 to 1, <=0 to 0).
Product quantization, which uses a clustering technique to effectively remaps the original vector to a vector with smaller dimensionality and indexes that (e.g. reduce a vector from 128-dim to 8-dim).

Scalar quantization is often the easiest way to reduce vector index storage. It involves converting dimensions to a smaller data type, like changing a 4-byte float to a 2-byte float.

In many cases, using a 2-byte float makes sense because, during distance operations, the most important differences between two dimensions are in the more significant bits. By slightly reducing the information to focus on those bits, we shouldn't notice much difference in recall.

In addition, ColPali original implementation used Bfloat16. So, those extra bits if we were to convert to 4-byte float are imprecise anyway.

💡

it's worth noting that Bfloat16 is not the same as Float16 (IEEE half-precision), even though both are 16-bit formats.

Very rarely you get a free lunch with quantization but, here we are, it looks like we do really get a free lunch in this particular instance.

pgVector performance

Jonathan Katz, the pgVector maintainer have benchmarked and evaluated halfvecs in an excellent post - which we highly recommend. In summary, you get near-identical performance between halfvecs and full vectors. However, you cut your storage in half, and you get slight speedups.

This was proof enough for us on the savings. But, Late Interactions embeddings are really a different beast than normal embeddings. So, we needed to validate performance.

We ran the ArxivQ portion of the Vidore benchmark, and our score was 86.6. matching state of the art results in the vidore leaderboard at the time we ran it. This is made us comfortable that there are no significant performance cost to using halfvecs to proceed.

Future work

Optimizing vector storage with halfvecs is a first step on making ColPali architecture viable and cost-effective. We plan to explore a few more optimization in the future, specifically around latency and use of re-rankers.

The ColPali architecture uses MaxSim to calculate relevancy. At larger document corpus, the MaxSim calculation is a significant overhead and with less than ideal latency.

💡

MaxSim (Maximum Similarity) is a method for measuring relevance between a query and a document by finding the maximum similarity score between query terms and document terms.

Most “traditional” RAG architecture uses Cosine similarity to calculate similarity as a first-step. So, in a sense - this is our baseline. MaxSim is more computationally intense than cosine similarity because it compares each query term with every document term.

While cosine similarity does just one vector comparison, MaxSim does many:

If there are n terms in the query and m in the document, MaxSim needs n × m cosine similarity like-calculations, making it much slower.

So, MaxSim could be 100 to 5,000 times more costly than cosine similarity, depending on the number of terms.

We believe that the way to solve that via re-rankers. In a practical sense, we would run a fast search to narrow down the number of documents, then run MaxSim on those. Instead of running MaxSim on a 1000 documents, we would run them only on 10.

Our next step is an automated evaluation pipeline - so, we can accurately identify and optimize this process. We believe that a combination of native vector Postgres search then MaxSim is probably the best balance. But we want a good foundations of automated evaluations first.

Binary Quantization

Binary quantization is a more extreme technique that reduces the full value of a vector's dimension to just a single bit of information. Specifically, it converts any positive value to 1 and any zero or negative value to 0.

For further storage optimizations, we ran a few quick experiments with Binary Quantization, and came to the conclusion that the performance penalty is difficult to determine as the bit diversity is not easily measured.

Bit diversity depends both on the embedding models, its size, and the data being embedded. Our eval data, and our customers data could look very different, so it is difficult to measure the effects.

💡

One common technique with Binary Quantization is to use Hamming distance scores to measure similarity. Hamming scores calculate the number of bit positions that differ between two binary strings, providing a simple similarity metric for binary data where a score of 0 indicates identical strings

We could explore future pipelines where we run Hamming distance scores, then MaxSim. However, this will increase storage requirements, as you need to save both halfVecs and binary bits and could be less predictable than standard Postgres vector search.

Conclusion

We recommend using halfvecs as the starting point for efficient vector storage. The performance loss is minimal, and the storage savings are substantial. In ColiVara, where we built on top of pgVector and Postgres, we experienced no performance loss and achieved a 50% reduction in storage usage.

Optimizing Vector Storage with halfvecs