Indexing embeddings for efficient retrieval using Azure Cognitive Search

Certainly! Indexing embeddings for efficient retrieval using Azure Cognitive Search involves several steps:

  1. Preparing a Small Text Dataset

  2. Generating Embeddings for the Text

  3. Creating an Azure Cognitive Search Index with Vector Search Capabilities

  4. Uploading Documents and Embeddings to the Index

  5. Querying the Index Using Vector Similarity Search

Below, I'll walk you through each of these steps with code examples using Python and a small text dataset.


Prerequisites

  • Azure Subscription: An active Azure account.

  • Azure Cognitive Search Service: Provisioned with vector search capabilities (SKU Standard or higher).

  • Python Environment: Python 3.6 or higher installed.

  • Azure SDK for Python: Install necessary packages.

      pip install azure-search-documents==11.4.0
      pip install openai
    

1. Preparing a Small Text Dataset

Let's start with a small dataset of text documents.

documents = [
    {
        "id": "1",
        "content": "The quick brown fox jumps over the lazy dog.",
        "category": "animal behavior"
    },
    {
        "id": "2",
        "content": "Never gonna give you up, never gonna let you down.",
        "category": "song lyrics"
    },
    {
        "id": "3",
        "content": "To be or not to be, that is the question.",
        "category": "literature"
    }
]

2. Generating Embeddings for the Text

We'll use OpenAI's Embedding API to generate embeddings for each document's content.

a. Set Up OpenAI API

import openai
import os

# Set your OpenAI API key
openai.api_key = os.getenv("OPENAI_API_KEY")

b. Generate Embeddings

def generate_embedding(text):
    response = openai.Embedding.create(
        input=text,
        engine='text-embedding-ada-002'  # Or another embedding model
    )
    embedding = response['data'][0]['embedding']
    return embedding

# Generate embeddings for each document
for doc in documents:
    doc['embedding'] = generate_embedding(doc['content'])

3. Creating an Azure Cognitive Search Index with Vector Search Capabilities

a. Import Necessary Modules

from azure.core.credentials import AzureKeyCredential
from azure.search.documents import SearchClient, SearchIndexClient
from azure.search.documents.indexes.models import (
    SearchIndex,
    SimpleField,
    SearchFieldDataType,
    SearchableField,
    VectorSearch,
    VectorSearchAlgorithmConfiguration,
    VectorField,
)

b. Set Up Azure Cognitive Search Credentials

# Replace with your Azure Cognitive Search service name and admin key
search_service_endpoint = "https://<your-search-service-name>.search.windows.net"
admin_key = "<your-admin-key>"

credential = AzureKeyCredential(admin_key)
index_client = SearchIndexClient(endpoint=search_service_endpoint, credential=credential)

c. Define the Index Schema

We need to define an index schema that includes a vector field for embeddings.

index_name = "documents-index"

fields = [
    SimpleField(name="id", type=SearchFieldDataType.String, key=True),
    SearchableField(name="content", type=SearchFieldDataType.String),
    SimpleField(name="category", type=SearchFieldDataType.String, filterable=True),
    VectorField(
        name="embedding",
        dimensions=len(documents[0]['embedding']),
        vector_search_configuration="default"
    ),
]

vector_search = VectorSearch(
    algorithm_configurations=[
        VectorSearchAlgorithmConfiguration(
            name="default",
            algorithm="hnsw"
        )
    ]
)

index = SearchIndex(
    name=index_name,
    fields=fields,
    vector_search=vector_search
)

d. Create the Index

# Delete the index if it already exists
if index_name in index_client.list_index_names():
    index_client.delete_index(index_name)

# Create the new index
index_client.create_index(index)

4. Uploading Documents and Embeddings to the Index

a. Initialize Search Client

search_client = SearchClient(
    endpoint=search_service_endpoint,
    index_name=index_name,
    credential=credential
)

b. Upload Documents

Azure Cognitive Search expects the vector field to be a list of floats. Ensure that the embeddings are in the correct format.

from azure.search.documents import IndexDocumentsBatch
from azure.search.documents.indexes.models import IndexDocumentsAction

batch = IndexDocumentsBatch(actions=[
    IndexDocumentsAction.upload(doc) for doc in documents
])

result = search_client.index_documents(batch)

We can now perform searches using vector similarity.

a. Generate Query Embedding

query = "What is the meaning of life?"
query_embedding = generate_embedding(query)
vector_query = {
    "vector": query_embedding,
    "k": 2,  # Number of nearest neighbors to return
    "fields": "embedding"
}

results = search_client.search(
    search_text="",
    vector=vector_query,
    select=["id", "content", "category"],
)

c. Display Results

print("Search Results:")
for result in results:
    print(f"ID: {result['id']}")
    print(f"Content: {result['content']}")
    print(f"Category: {result['category']}")
    print("--------")

Sample Output:

Search Results:
ID: 3
Content: To be or not to be, that is the question.
Category: literature
--------
ID: 2
Content: Never gonna give you up, never gonna let you down.
Category: song lyrics
--------

Explanation of How It Works

  1. Generating Embeddings: We use OpenAI's Embedding API to convert text into high-dimensional vectors that capture semantic meaning.

  2. Index Schema: In Azure Cognitive Search, we define an index schema that includes:

    • A key field (id) to uniquely identify each document.

    • Searchable fields (content, category) for text search.

    • A vector field (embedding) to store embeddings.

  3. Vector Search Configuration: We configure vector search using the hnsw algorithm, which is efficient for high-dimensional similarity searches.

  4. Uploading Documents: We upload the documents along with their embeddings to the index. Azure Cognitive Search stores the embeddings in the vector field.

  5. Vector Similarity Search:

    • Query Embedding: We generate an embedding for the user's query.

    • Search Parameters: We specify the vector field and the number of nearest neighbors (k).

    • Search Execution: Azure Cognitive Search computes the similarity between the query embedding and document embeddings to retrieve the most relevant documents.

  6. Results: The search returns documents ordered by their similarity to the query, allowing for efficient retrieval of semantically related content.


Putting It All Together

Here is the complete code for the entire process:

import openai
import os
from azure.core.credentials import AzureKeyCredential
from azure.search.documents import SearchClient, SearchIndexClient
from azure.search.documents.indexes.models import (
    SearchIndex,
    SimpleField,
    SearchFieldDataType,
    SearchableField,
    VectorSearch,
    VectorSearchAlgorithmConfiguration,
    VectorField,
)
from azure.search.documents import IndexDocumentsBatch
from azure.search.documents.indexes.models import IndexDocumentsAction

# Set up OpenAI API key
openai.api_key = os.getenv("OPENAI_API_KEY")

# Azure Cognitive Search credentials
search_service_endpoint = "https://<your-search-service-name>.search.windows.net"
admin_key = "<your-admin-key>"
credential = AzureKeyCredential(admin_key)

# Sample dataset
documents = [
    {
        "id": "1",
        "content": "The quick brown fox jumps over the lazy dog.",
        "category": "animal behavior"
    },
    {
        "id": "2",
        "content": "Never gonna give you up, never gonna let you down.",
        "category": "song lyrics"
    },
    {
        "id": "3",
        "content": "To be or not to be, that is the question.",
        "category": "literature"
    }
]

# Function to generate embeddings
def generate_embedding(text):
    response = openai.Embedding.create(
        input=text,
        engine='text-embedding-ada-002'
    )
    embedding = response['data'][0]['embedding']
    return embedding

# Generate embeddings for each document
for doc in documents:
    doc['embedding'] = generate_embedding(doc['content'])

# Define the index
index_name = "documents-index"

fields = [
    SimpleField(name="id", type=SearchFieldDataType.String, key=True),
    SearchableField(name="content", type=SearchFieldDataType.String),
    SimpleField(name="category", type=SearchFieldDataType.String, filterable=True),
    VectorField(
        name="embedding",
        dimensions=len(documents[0]['embedding']),
        vector_search_configuration="default"
    ),
]

vector_search = VectorSearch(
    algorithm_configurations=[
        VectorSearchAlgorithmConfiguration(
            name="default",
            algorithm="hnsw"
        )
    ]
)

index = SearchIndex(
    name=index_name,
    fields=fields,
    vector_search=vector_search
)

# Create the index
index_client = SearchIndexClient(endpoint=search_service_endpoint, credential=credential)
if index_name in index_client.list_index_names():
    index_client.delete_index(index_name)
index_client.create_index(index)

# Upload documents
search_client = SearchClient(endpoint=search_service_endpoint, index_name=index_name, credential=credential)
batch = IndexDocumentsBatch(actions=[
    IndexDocumentsAction.upload(doc) for doc in documents
])
search_client.index_documents(batch)

# Generate query embedding
query = "What is the meaning of life?"
query_embedding = generate_embedding(query)

# Perform vector search
vector_query = {
    "vector": query_embedding,
    "k": 2,
    "fields": "embedding"
}

results = search_client.search(
    search_text="",
    vector=vector_query,
    select=["id", "content", "category"],
)

# Display results
print("Search Results:")
for result in results:
    print(f"ID: {result['id']}")
    print(f"Content: {result['content']}")
    print(f"Category: {result['category']}")
    print("--------")

Important Notes

  • Vector Dimensions: Ensure that the dimensions parameter in the VectorField matches the length of the embeddings generated by the model.

  • API Versions: The vector search capability is relatively new in Azure Cognitive Search. Ensure that you're using a compatible API version and SDK.

  • Permissions: The admin key is used for index management and data ingestion. For query operations in a production environment, use query keys with appropriate permissions.

  • Embeddings Storage: Be mindful of the storage requirements, as embeddings can be high-dimensional vectors.


Additional Considerations

Scaling Up

  • For larger datasets, consider batching the document uploads and optimizing the indexing process.

Error Handling

  • Add try-except blocks to handle exceptions, such as API errors or connectivity issues.

Security

  • Secure Credentials: Never hard-code API keys. Use environment variables or secure key management systems.

  • Data Privacy: Ensure compliance with data privacy regulations when handling sensitive content.


Conclusion

By indexing embeddings in Azure Cognitive Search, you can efficiently retrieve documents based on semantic similarity. This approach enhances traditional keyword search by understanding the meaning behind queries and documents.

Key Steps Recap:

  1. Prepare Data: Collect and structure your documents.

  2. Generate Embeddings: Use a language model to convert text into embeddings.

  3. Create Index: Define an index schema with vector search capabilities.

  4. Upload Data: Ingest documents and embeddings into the index.

  5. Search: Perform vector similarity searches to retrieve relevant documents.


References


0
Subscribe to my newsletter

Read articles from Sai Prasanna Maharana directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Sai Prasanna Maharana
Sai Prasanna Maharana