Multi-Stage Vector Querying Using Matryoshka Representation Learning (MRL) in Qdrant

Vansh KhanejaVansh Khaneja
7 min read

Data retrieval is a critical component in the creation of an efficient Retrieval Augment Generation (RAG) application. The effectiveness of data retrieval directly impacts the performance, accuracy, and reliability of the application.

There are various methods of data retrieval from vector databases. Some of the most efficient ones are:

  1. Self-Query Retrieval

  2. Multi-Stage Query

  3. Auto-Merging Retrieval

  4. Hybrid Retrieval

In this article, we will explore Multi-Stage Query for the purpose of data retrieval using Matryoshka Representation Learning (MRL) in order to increase the efficiency of fetching data from the database.

So, let’s first understand: What is Matryoshka Representation Learning?

Matryoshka Representation Learning

The name Matryoshka is inspired from the concept of Russian dolls, also known as stacking dolls, which are a set of dolls in decreasing sizes one nested within another.

The core idea of MRL is to learn representations at multiple levels, similar to the way Matryoshka dolls are nested within each other with the decrease in their sizes. Each level of representation encapsulates information from a different level, enabling the model to understand and create an effective hierarchical structure.

There are various advantages of using MRL:

  1. Enhanced Search Efficiency: MRL embeddings allow multi-stage search where initial filtering can be done with smaller embeddings, thus speeding up the search process.

  2. Improved Accuracy: The ability to refine searches with higher-resolution embeddings after an initial broad match ensures that the most relevant results are surfaced.

  3. Flexibility: Depending on the use case, the resolution of the embeddings can be adjusted, providing flexibility in terms of precision and performance.

To learn more about how Matryoshka Representation Learning works, please refer to this paper.

Now let’s understand: What is Multi-Stage Query Retrieval?

Multi-Stage Query Retrieval

Among the various methods for retrieving data from a database, Multi-Stage Query retrieval is one of the fastest. Though it may lack precision in terms of accuracy, it is preferred for various applications due to its high-speed performance and because it saves a significant amount of computational power and memory.

Let’s try to understand how Multi-Stage Query retrieval works.

Multi-Stage Query retrieval focuses on retrieving data in different stages within a hierarchical structure, starting with smaller embeddings and increasing in size.

Steps Involved in Multi-Stage Querying:

  1. Transformation into Vector Embeddings: Initially, the data is transformed into vector embeddings of different sizes.

  2. Initial Search: Data chunks based on the query are first searched over smaller embeddings where the search can be conducted faster.

  3. Secondary Search: From the selected chunks, the data is further searched over larger embeddings.

This process ensures that data fetching does not take a lot of time since it is relatively faster to search over embeddings with fewer dimensions. To ensure accuracy, a secondary search is conducted over the selected chunks of data using larger embeddings.

Let’s deep dive and understand more with a practical implementation:

GitHub

Check out the full code and implementation on GitHub.

Let’s Code

The first step is to install the necessary Python libraries required for the project.

pip install sentence_transformers
pip install qdrant-client
pip install langchain
pip install -U langchain-community
pip install pypdf
pip install openai

Once the installation process is completed, initialize them to use their modules and sub-modules in the code.

from sentence_transformers import SentenceTransformer
from qdrant_client import QdrantClient

from qdrant_client import models
from qdrant_client.http.models import VectorParams, Distance

from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

from openai import OpenAI

Before beginning the main code, let’s initialize the OpenAI client variable to use the module. Please remember to set up the OPENAI_API_KEY as the environment variable.

To get the OpenAI API key, please go here.

from openai import OpenAI
openai_client = OpenAI()

Let’s now create two functions: one for creating smaller embeddings and the another for creating larger embeddings.

Small Embeddings Function

Starting with the function for small embeddings first. Here we are using text-embedding-3-small as the embedding model. We are explicitly declaring the dimensions of the vectors to be 512 in size.

def small_embedding(text,model="text-embedding-3-small"):
  text = text.replace("\n", " ")
  return openai_client.embeddings.create(input = [text], model=model,dimensions=512).data[0].embedding

Large Embeddings Function

Now we need to create the function for larger embeddings. Here we are using text-embedding-3-large as the embedding model. We are explicitly declaring the dimensions of the vectors to be 2048 in size, as the larger embedding size results in better model performance.

def large_embedding(text, model="text-embedding-3-large"):
  text = text.replace("\n", " ")
  return openai_client.embeddings.create(input = [text], model=model,dimensions=2048).data[0].embedding

Loading the Dataset

It’s time to load the dataset. Here we are using a PDF that explains algorithmic trading in detail. The dataset can be downloaded from here.

loaders = [
    PyPDFLoader("/content/TEGI0570.pdf"),
]

For further processing, the PDF needs to be broken into smaller chunks which can later be extracted based on their relevance. Here we are breaking the PDF into chunks of 550 characters per chunk with an overlap of 50 characters.

docs = []
chunk_size=550
chunk_overlap = 50

r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap
    )
for loader in loaders:
    docs.extend(loader.load())
splits = r_splitter.split_documents(docs)

Let’s find out how many chunks we have achieved after splitting the document.

len(splits)

The output is 558, which means we have divided the PDF text into 558 small chunks, each with a character length of 550 characters.

Now, since our data processing is done, we can proceed to insert the data into the Qdrant vector database.

Initializing the Qdrant Client

Let’s start by initializing the Qdrant client. Here we are using the local memory as the storage for the database.

client = QdrantClient(":memory:")
COLLECTION_NAME = "multi_stage_db"

Now we need to create the collection in the database and also create the schema for the collection.

Now we need to create the collection in the database and also create the schema for the collection. In this schema, we are storing two vectors: one with small embeddings of size 512 vectors and another with larger embeddings of size 2048 vectors. Moreover, we are utilizing COSINE similarity to match the query embeddings with the chunk embeddings we are storing in the database.

client.recreate_collection(
    collection_name=COLLECTION_NAME,
    vectors_config={
        "small-embedding": models.VectorParams(
            size=512,
            distance=models.Distance.COSINE,
            datatype=models.Datatype.FLOAT16
        ),
        "large-embedding": models.VectorParams(
            size=2048,
            distance=models.Distance.COSINE,
            datatype=models.Datatype.FLOAT16
        ),
    },
)

Now that the schema and the database are created, we can upload the data into the database. For that, the two functions we declared above, small_embedding and large_embedding will be utilized to convert the chunk data into vector embeddings of their respective sizes.

for i in range(0,len(splits)):
  client.upsert(
      collection_name=COLLECTION_NAME,
      points=[
          models.PointStruct(
              id=i,
              vector={
                  "small-embedding":small_embedding(splits[i].page_content),
                  "large-embedding":large_embedding(splits[i].page_content),
              },
          )
      ],
  )

Now all the data has been added successfully into the database. So now it’s time to test the model.

We will first declare one query for which data must be present in the database. The next step will be to convert the query into vector embeddings of size 512 and 2048 respectively.

query_text = "what are common measurements and mismeasurements of risk "


small_vector = small_embedding(query_text)
large_vector = large_embedding(query_text)

Now, to match the query embeddings with the embeddings present in the database, we will perform multi-stage vector querying with the help of Qdrant.

result = client.query_points(
    collection_name= COLLECTION_NAME,
    prefetch=models.Prefetch(
        query=small_vector,
        using="small-embedding",
        limit=250,
    ),
    query=large_vector,
    using="large-embedding",
    limit=5,
)

Explanation of Multi-Stage Querying

In this code snippet, we optimize the search process by using embeddings of different sizes. Initially, we use a smaller query embedding of size 512 to match against other smaller embeddings in the database. This approach is computationally efficient and speeds up the initial search phase due to the reduced dimensionality. Once the initial matching is complete, we retrieve the top 250 most similar embeddings from the database. Next, we perform a more detailed and precise matching using larger embeddings of size 2048. However, this second phase of matching is only conducted on the 250 embeddings identified in the first step. By focusing on this smaller subset, we maintain efficiency while benefiting from the higher accuracy of the larger embeddings.

Displaying the Results

Let’s now check the output from the above code.

The output will display the top 5 vector embedding IDs that are most similar to the query embedding. To make it more understandable, we match the IDs with the actual chunks and display them. We can do this by adding a small piece of code at the end that extracts the IDs from the result variable and uses those IDs to show the text present in the data chunks with the respective IDs.

ids = [item.id for item in result.points]
for i in ids:
  print(splits[i].page_content)print('\n')
  print('-'*75)

The output for the above code will look like this:

Now, since our multi-stage query retrieval is ready, it can be followed up by any framework like LangChain, LlamaIndex, etc., to make an efficient and fast RAG application.

Conclusion

In conclusion, the multi-stage query process with MRL embeddings represents a powerful method for balancing computational efficiency with search accuracy in large-scale data environments. By using smaller embeddings for initial filtering and subsequently refining with larger, more detailed embeddings, this approach not only accelerates search times but also enhances the precision of similarity matching.

References

https://platform.openai.com/docs/guides/embeddings/what-are-embeddings

https://qdrant.tech/documentation/concepts/hybrid-queries/

https://arxiv.org/abs/2205.13147

0
Subscribe to my newsletter

Read articles from Vansh Khaneja directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Vansh Khaneja
Vansh Khaneja