Explaining Vector Index and Building an LLM Application with Pathway

In the world of modern data management and retrieval, traditional indexing techniques may not always be sufficient, especially when dealing with large volumes of unstructured data such as text, images, and audio. This is where vector indexing come into play, offering powerful solutions for efficient data retrieval based on semantic similarity.

What is Indexing?

A big textbook often has a list at the end called an index. This index acts like a cheat sheet. It shows important words or topics from the book, along with the page numbers where you can find them.

Instead of flipping through every single page to find what you need, you can just look at the index and go straight to the right page.

Databases work in a similar way. They organize information and use special lists, also called indexes. These indexes make it easy to find things quickly without having to look through everything.

Just like how the index in a textbook helps you locate specific words or topics by listing the page numbers, the indexes in databases help locate specific pieces of information efficiently.

What is Vector Indexing?

You have a huge library with millions and millions of books. You loved reading one particular book and want to find other books that are similar to it. Instead of reading through summaries or contents of each book, which would take forever, there is a helpful method called vector indexing.

With vector indexing, each book is represented by a vector, which is a set of numbers. These numbers capture the main ideas, themes, and overall essence of the book in a mathematical way. Books with similar vectors have similar themes, genres, or writing styles.

To find books like the one you enjoyed, you can compare its vector to the vectors of all the other books in the library. The books whose vectors are closest to the vector of the book you liked are likely to be the most similar books. This makes it much faster and easier to discover new books you might enjoy, without having to read through everything first.

Here's a technical definition:

A vector index is a specialized data structure used to efficiently store and manage high-dimensional vector data, enabling fast similarity searches and nearest neighbour queries

Why Use Vector Indexing?

Imagine you're a bookworm who can't get enough of reading. You've explored a vast array of books, from thrilling mysteries to heartwarming romances. One day, you stumble upon a gem of a book that resonates with you deeply. Now, you're eager to discover more books that capture that same essence—ones that share its vibe, themes, and overall feel.

Traditionally, finding similar books would mean diving into summaries, reviews, and endless browsing, which can be daunting and time-consuming. But with vector indexing, it's like having a literary expert by your side, ready to guide you to your next favorite read.

Vector indexing streamlines the process of retrieving relevant information from vast datasets. By transforming data into high-dimensional vectors, it facilitates quick similarity searches and nearest neighbor queries. This means users can swiftly find not only exact matches but also similar items, enhancing their overall experience by providing them with a comprehensive view of related content.

Why is Vector Indexing Fast?

Vector indexing is renowned for its speed due to several key factors. Firstly, by representing data as vectors, it reduces the computational complexity of similarity searches and nearest neighbor queries. These vectors capture essential features of the data, allowing for efficient comparison and retrieval. Additionally, vector indexes are optimized for high-dimensional data, enabling quick processing even with large datasets.

Furthermore, advanced algorithms and data structures are employed to organize the vectors in a manner that facilitates rapid search operations. This optimization ensures that relevant information can be accessed swiftly, contributing to an enhanced user experience characterized by seamless and efficient data retrieval.

Why Use Pathway's Vector Index ?

Always Up-to-Date: The vector index in Pathway ensures that the data index is continuously updated with each data change. This feature provides real-time and accurate results, making it ideal for applications where data freshness is critical. In the context of LLMs, having access to the most up-to-date information is crucial for generating relevant and accurate responses.
RESTful API Access: The vector index in Pathway can be accessed using a RESTful API, providing a standardized and easy-to-use interface for querying and retrieving data from the index. This API-based access makes it easier to integrate Pathway's vector indexing capabilities into your LLM pipeline or application, simplifying the development process.
Integration with LLM Toolkits: The Pathway vector index seamlessly integrates with popular LLM (Large Language Models) toolkits such as Langchain and Llama-index. This integration enhances the capabilities of the vector index and enables developers to leverage advanced language processing features. By combining Pathway's vector indexing with these LLM toolkits, you can build more powerful and sophisticated language models that can efficiently retrieve and process vast amounts of information.
Scalability and Performance: Pathway's vector indexing solution is designed to handle large-scale datasets efficiently, making it suitable for building and deploying LLMs that require processing and retrieving massive amounts of data. With its optimized algorithms and data structures, Pathway ensures high performance and fast retrieval times, even when dealing with extremely large datasets.
Flexible Data Ingestion: Pathway supports ingesting and indexing various data formats, including text, images, and other types of unstructured data. This flexibility allows you to build LLMs that can process and generate outputs based on diverse data sources, expanding the potential applications and use cases.

Let’s build a LLM application using pathway’s vector index.

Let's build an LLM application using Pathway's vector index. When we use Pathway's vector index, we don't require a vector database.

So, let's get our hands on the code.

First, we import the necessary modules and classes from the Pathway and its extensions for large language models (LLMs) and embedders.
```
  import os
  import pathway as pw
  from pathway.stdlib.ml.index import KNNIndex
  from pathway.xpacks.llm.embedders import OpenAIEmbedder
  from pathway.xpacks.llm.llms import OpenAIChat, prompt_chat_single_qa
```
Next, we set up the environment variables required for the program, including the OpenAI API key, the directory where the data files are stored, and the host and port for the REST API connector.
```
 os.environ['OPENAI_API_KEY'] = ''
 os.environ['PATHWAY_DATA_DIR'] = '/content/data/pathway-docs/'
 os.environ['PATHWAY_REST_CONNECTOR_HOST'] = '0.0.0.0'
 os.environ['PATHWAY_REST_CONNECTOR_PORT'] = '8080'
```
We define two Pathway schemas: DocumentInputSchema for the input documents and QueryInputSchema for the user's query.
```
 class DocumentInputSchema(pw.Schema):
     doc: str

 class QueryInputSchema(pw.Schema):
     query: str
     user: str
```
The run function is the main entry point of the program. It sets up various configurations, including the data directory, API key, host and port for the REST API, embedder model, embedding dimension, language model, maximum tokens, and temperature for the language model.
1. We also initialize an OpenAIEmbedder instance, which will be used to generate embeddings for the input documents and queries.

    def run(
        *,
        data_dir: str = os.environ.get("PATHWAY_DATA_DIR", "../../data/pathway-docs/"),
        api_key: str = os.environ.get("OPENAI_API_KEY", ""),
        host: str = os.environ.get("PATHWAY_REST_CONNECTOR_HOST", "0.0.0.0"),
        port: int = int(os.environ.get("PATHWAY_REST_CONNECTOR_PORT", "8080")),
        embedder_locator: str = "text-embedding-ada-002",
        embedding_dimension: int = 1536,
        model_locator: str = "gpt-3.5-turbo",
        max_tokens: int = 60,
        temperature: float = 0.0,
        **kwargs,
    ):
        embedder = OpenAIEmbedder(
            api_key=api_key,
            model=embedder_locator,
            retry_strategy=pw.udfs.FixedDelayRetryStrategy(),
            cache_strategy=pw.udfs.DefaultCache(),
        )

Next, we read the input documents from the specified data directory using the DocumentInputSchema. We then enrich the documents by adding a vector column, which contains the embeddings generated by the OpenAIEmbedder for each document.

    documents = pw.io.jsonlines.read(
        data_dir, schema=DocumentInputSchema, mode="streaming", autocommit_duration_ms=50,
    )
    enriched_documents = documents + documents.select(vector=embedder(pw.this.doc))

We create a KNN (K-Nearest Neighbors) index using the enriched documents and their embeddings. This index will be used to efficiently retrieve the most relevant documents for a given query.

    index = KNNIndex(
        enriched_documents.vector, enriched_documents, n_dimensions=embedding_dimension
    )

We set up a REST connector to handle incoming queries from users. The response_writer will be used to send the responses back to the client. We also enrich the query by adding an embedding vector using the OpenAIEmbedder.

    query, response_writer = pw.io.http.rest_connector(
        host=host,
        port=port,
        schema=QueryInputSchema,
        autocommit_duration_ms=50,
        delete_completed_queries=True,
    )
    query += query.select(vector=embedder(pw.this.query))

For each query, we retrieve the k (in this case, 3) nearest documents from the index based on the similarity between the query embedding and the document embeddings. We then create a query_context that includes the query and the list of relevant documents.

    query_context = query + index.get_nearest_items(
        query.vector, k=3, collapse_rows=True
    ).select(documents_list=pw.this.doc)

We define a user-defined function build_prompt that takes the list of relevant documents and the query, and constructs a prompt for the language model. The prompt includes the relevant documents and the query, instructing the model to answer the query based on the provided context.

    @pw.udf
    def build_prompt(documents, query):
        docs_str = "\\\\n".join(documents)
        prompt = f"Given the following documents : \\\\n {docs_str} \\\\nanswer this query: {query}"
        return prompt

    prompt = query_context.select(
        prompt=build_prompt(pw.this.documents_list, pw.this.query)
    )

We initialize an OpenAIChat instance, which will be used to generate responses based on the constructed prompt. We then select the query ID and the result obtained from the language model, using the prompt_chat_single_qa function to format the prompt for the chat model.

Finally, we write the responses to the REST connector using the response_writer.

    model = OpenAIChat(
        api_key=api_key,
        model=model_locator,
        temperature=temperature,
        max_tokens=max_tokens,
        retry_strategy=pw.udfs.FixedDelayRetryStrategy(),
        cache_strategy=pw.udfs.DefaultCache(),
    )
    responses = prompt.select(
        query_id=pw.this.id, result=model(prompt_chat_single_qa(pw.this.prompt))
    )
    response_writer(responses)

We execute the run function if the script is run directly (not imported as a module).

The code demonstrates the steps involved, from ingesting and embedding documents to setting up a KNN index, handling user queries, retrieving relevant context, constructing prompts, and generating responses using a language model.

    pw.run()

    if __name__ == "__main__":
        run()

Steps to run the code in windows.

Install Docker: First, you need to install Docker on your Windows machine. You can download the Docker Desktop for Windows from the official website: https://www.docker.com/products/docker-desktop
Create a Dockerfile: In your Python app project directory, create a new file named Dockerfile (without any file extension) and include the following code. Replace "your-script.py" with the actual name of your Python script file, and update the directory name if necessary:
```
 FROM pathwaycom/pathway:latest

 WORKDIR /app

 COPY requirements.txt ./
 RUN pip install --no-cache-dir -r requirements.txt

 COPY . .

 CMD ["python", "./your-script.py"]
```
Build the Docker Image: Open a terminal or command prompt, navigate to your Python app project directory, and run the following command to build the Docker image:
```
 docker build -t my-pathway-app .
```
This command will build a Docker image named my-pathway-app based on the instructions in the Dockerfile.
Run the Docker Container: After the image is built successfully, run the following command to start a Docker container from the image:
```
 docker run -it --rm --name my-pathway-app my-pathway-app
```
This command will start a new container named my-pathway-app from the my-pathway-app image and execute the Python script inside the container. The -it flag allows you to interact with the container, and --rm ensures that the container is removed after it stops running.
Run the Docker Container with Environment Variables and Port Mapping: If your Python script requires environment variables or needs to expose a port, you can use the following command instead:
```
 docker run -p 8080:8080 --env-file .env my-pathway-app
```
Replace .env with the name of your environment file (if you have one), and update the port mapping (-p 8080:8080) if your script uses a different port.
Test the Application with Postman: After the Docker container is running, you can test the application using Postman or any other HTTP client. Follow these steps:
- Open Postman Desktop.
- In the left panel, click on the "Import" button in the "Collections" tab.
- In the import dialog, paste the following command:
```
  curl -X POST -H "Content-Type: application/json" -d '{"user": "user", "query": "How to build vector index in Pathway?"}' <http://0.0.0.0:8080/> | jq
```
- Click "Import".
- This will create a new collection with a POST request that sends a JSON payload to your application running at http://0.0.0.0:8080/.
- Click on the request in the collection and then click "Send" to execute the request.
- You should see the response from your application in the "Response" section of Postman.

Congratulations on building your first LLM application with Pathway vector index.By following these steps, you will have successfully built and run a Large Language Model (LLM) application using Pathway's vector indexing on your Windows machine.

If you want to check the entire code in one place, you can check it out here.

Resources:

Here are a few resources you can use to learn more about Pathway:

Thanks for reading the blog, I hope you have enjoyed reading it. If you face any problem then you can reach-out to me here.

Explaining Vector Index and Building an LLM Application with Pathway on Windows