Don't use raw embeddings

Guillaume GuyGuillaume Guy
3 min read

Introduction:

With the rise of Transformers, embeddings are now widely used:

  • As representations of images or texts that can be used by other models or in a zero-shot manner

  • As a basic building block for Vector Search in LLM RAG and image search

However, embeddings are still quite large. OpenAI’s text-embedding-3-large can reach up to d=3072, which means 6kB (stored as float32) per entity. From experience, this is enough to overwhelm SQL engines when performing large JOINs, as this data needs to be sent across the network for a distributed JOIN.

Therefore, it makes sense to try compressing these embeddings into a smaller, yet high-quality, representation.

Vector Quantization:

Vector Search has been around a while but became truly popular in 2022 (see Trend below). However, with much foresight, Facebook released their vector search codebase FAISS in 2017.

A common challenge with vector search is storing all vectors in memory, which can be quite costly when the vectors are large. H. Jégou et al. (paper) introduced Product Quantization in 2010. The main idea is to divide a long vector into smaller chunks (about 4 dimensions each) and apply k-means clustering to each chunk. Each vector is then represented by its closest centroid. With enough centroids, the loss is expected to be minimal.

The illustration below shows how this works. It displays 2 chunks of 4 columns and their closest centroids (125 and 12) for encoding.

To decode at runtime, the vector is reconstructed by looking up the coordinates of the centroids and combining them back into the original vector space.

You can also calculate the space saved:

  • Current: 3072 at float32 = 6kB

  • Product Quantized (dim=4, with 256 centroids stored as uint8): (3072 / 4) * 1 byte = 768B (8x smaller)

Product Quantization (PQ) is a simple yet effective technique to save space. Although there are other methods, PQ is still widely used in the industry.

To illustrate, we implemented it in a few lines of Numpy code in this notebook in this gist.

CPU/GPU Friendliness:

A keen observer will notice the following:

  • Each group of columns can be processed in parallel, making computation highly parallel.

  • The operations are basic: matrix multiplication and lookups (i.e. gather), which are highly optimized on both CPUs and GPUs.

This makes Product Quantization efficient on almost all modern hardware.

Going Further:

T. Ge et al. (CVPR 2013) (link) improved PQ by adding a rotation step, which significantly reduces the loss. From experience, a cosine similarity metric above 99.X% enable most downstream use cases.

Takeaway:

While embeddings are useful for many applications, they are often large and hard to manage. By quantizing them and using the codes instead of the raw embeddings, you can improve their usability while keeping them close to the original vector. Check it out!

0
Subscribe to my newsletter

Read articles from Guillaume Guy directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Guillaume Guy
Guillaume Guy