Indexing embeddings for efficient retrieval using Azure Cognitive Search
Certainly! Indexing embeddings for efficient retrieval using Azure Cognitive Search involves several steps:
Preparing a Small Text Dataset
Generating Embeddings for the Text
Creating an Azure Cognitive Search Index with Vector Search Capabilities
Uploading Documents and Embeddings to the Index
Querying the Index Using Vector Similarity Search
Below, I'll walk you through each of these steps with code examples using Python and a small text dataset.
Prerequisites
Azure Subscription: An active Azure account.
Azure Cognitive Search Service: Provisioned with vector search capabilities (SKU Standard or higher).
Python Environment: Python 3.6 or higher installed.
Azure SDK for Python: Install necessary packages.
pip install azure-search-documents==11.4.0 pip install openai
1. Preparing a Small Text Dataset
Let's start with a small dataset of text documents.
documents = [
{
"id": "1",
"content": "The quick brown fox jumps over the lazy dog.",
"category": "animal behavior"
},
{
"id": "2",
"content": "Never gonna give you up, never gonna let you down.",
"category": "song lyrics"
},
{
"id": "3",
"content": "To be or not to be, that is the question.",
"category": "literature"
}
]
2. Generating Embeddings for the Text
We'll use OpenAI's Embedding API to generate embeddings for each document's content.
a. Set Up OpenAI API
import openai
import os
# Set your OpenAI API key
openai.api_key = os.getenv("OPENAI_API_KEY")
b. Generate Embeddings
def generate_embedding(text):
response = openai.Embedding.create(
input=text,
engine='text-embedding-ada-002' # Or another embedding model
)
embedding = response['data'][0]['embedding']
return embedding
# Generate embeddings for each document
for doc in documents:
doc['embedding'] = generate_embedding(doc['content'])
3. Creating an Azure Cognitive Search Index with Vector Search Capabilities
a. Import Necessary Modules
from azure.core.credentials import AzureKeyCredential
from azure.search.documents import SearchClient, SearchIndexClient
from azure.search.documents.indexes.models import (
SearchIndex,
SimpleField,
SearchFieldDataType,
SearchableField,
VectorSearch,
VectorSearchAlgorithmConfiguration,
VectorField,
)
b. Set Up Azure Cognitive Search Credentials
# Replace with your Azure Cognitive Search service name and admin key
search_service_endpoint = "https://<your-search-service-name>.search.windows.net"
admin_key = "<your-admin-key>"
credential = AzureKeyCredential(admin_key)
index_client = SearchIndexClient(endpoint=search_service_endpoint, credential=credential)
c. Define the Index Schema
We need to define an index schema that includes a vector field for embeddings.
index_name = "documents-index"
fields = [
SimpleField(name="id", type=SearchFieldDataType.String, key=True),
SearchableField(name="content", type=SearchFieldDataType.String),
SimpleField(name="category", type=SearchFieldDataType.String, filterable=True),
VectorField(
name="embedding",
dimensions=len(documents[0]['embedding']),
vector_search_configuration="default"
),
]
vector_search = VectorSearch(
algorithm_configurations=[
VectorSearchAlgorithmConfiguration(
name="default",
algorithm="hnsw"
)
]
)
index = SearchIndex(
name=index_name,
fields=fields,
vector_search=vector_search
)
d. Create the Index
# Delete the index if it already exists
if index_name in index_client.list_index_names():
index_client.delete_index(index_name)
# Create the new index
index_client.create_index(index)
4. Uploading Documents and Embeddings to the Index
a. Initialize Search Client
search_client = SearchClient(
endpoint=search_service_endpoint,
index_name=index_name,
credential=credential
)
b. Upload Documents
Azure Cognitive Search expects the vector field to be a list of floats. Ensure that the embeddings are in the correct format.
from azure.search.documents import IndexDocumentsBatch
from azure.search.documents.indexes.models import IndexDocumentsAction
batch = IndexDocumentsBatch(actions=[
IndexDocumentsAction.upload(doc) for doc in documents
])
result = search_client.index_documents(batch)
5. Querying the Index Using Vector Similarity Search
We can now perform searches using vector similarity.
a. Generate Query Embedding
query = "What is the meaning of life?"
query_embedding = generate_embedding(query)
b. Perform Vector Search
vector_query = {
"vector": query_embedding,
"k": 2, # Number of nearest neighbors to return
"fields": "embedding"
}
results = search_client.search(
search_text="",
vector=vector_query,
select=["id", "content", "category"],
)
c. Display Results
print("Search Results:")
for result in results:
print(f"ID: {result['id']}")
print(f"Content: {result['content']}")
print(f"Category: {result['category']}")
print("--------")
Sample Output:
Search Results:
ID: 3
Content: To be or not to be, that is the question.
Category: literature
--------
ID: 2
Content: Never gonna give you up, never gonna let you down.
Category: song lyrics
--------
Explanation of How It Works
Generating Embeddings: We use OpenAI's Embedding API to convert text into high-dimensional vectors that capture semantic meaning.
Index Schema: In Azure Cognitive Search, we define an index schema that includes:
A key field (
id
) to uniquely identify each document.Searchable fields (
content
,category
) for text search.A vector field (
embedding
) to store embeddings.
Vector Search Configuration: We configure vector search using the
hnsw
algorithm, which is efficient for high-dimensional similarity searches.Uploading Documents: We upload the documents along with their embeddings to the index. Azure Cognitive Search stores the embeddings in the vector field.
Vector Similarity Search:
Query Embedding: We generate an embedding for the user's query.
Search Parameters: We specify the vector field and the number of nearest neighbors (
k
).Search Execution: Azure Cognitive Search computes the similarity between the query embedding and document embeddings to retrieve the most relevant documents.
Results: The search returns documents ordered by their similarity to the query, allowing for efficient retrieval of semantically related content.
Putting It All Together
Here is the complete code for the entire process:
import openai
import os
from azure.core.credentials import AzureKeyCredential
from azure.search.documents import SearchClient, SearchIndexClient
from azure.search.documents.indexes.models import (
SearchIndex,
SimpleField,
SearchFieldDataType,
SearchableField,
VectorSearch,
VectorSearchAlgorithmConfiguration,
VectorField,
)
from azure.search.documents import IndexDocumentsBatch
from azure.search.documents.indexes.models import IndexDocumentsAction
# Set up OpenAI API key
openai.api_key = os.getenv("OPENAI_API_KEY")
# Azure Cognitive Search credentials
search_service_endpoint = "https://<your-search-service-name>.search.windows.net"
admin_key = "<your-admin-key>"
credential = AzureKeyCredential(admin_key)
# Sample dataset
documents = [
{
"id": "1",
"content": "The quick brown fox jumps over the lazy dog.",
"category": "animal behavior"
},
{
"id": "2",
"content": "Never gonna give you up, never gonna let you down.",
"category": "song lyrics"
},
{
"id": "3",
"content": "To be or not to be, that is the question.",
"category": "literature"
}
]
# Function to generate embeddings
def generate_embedding(text):
response = openai.Embedding.create(
input=text,
engine='text-embedding-ada-002'
)
embedding = response['data'][0]['embedding']
return embedding
# Generate embeddings for each document
for doc in documents:
doc['embedding'] = generate_embedding(doc['content'])
# Define the index
index_name = "documents-index"
fields = [
SimpleField(name="id", type=SearchFieldDataType.String, key=True),
SearchableField(name="content", type=SearchFieldDataType.String),
SimpleField(name="category", type=SearchFieldDataType.String, filterable=True),
VectorField(
name="embedding",
dimensions=len(documents[0]['embedding']),
vector_search_configuration="default"
),
]
vector_search = VectorSearch(
algorithm_configurations=[
VectorSearchAlgorithmConfiguration(
name="default",
algorithm="hnsw"
)
]
)
index = SearchIndex(
name=index_name,
fields=fields,
vector_search=vector_search
)
# Create the index
index_client = SearchIndexClient(endpoint=search_service_endpoint, credential=credential)
if index_name in index_client.list_index_names():
index_client.delete_index(index_name)
index_client.create_index(index)
# Upload documents
search_client = SearchClient(endpoint=search_service_endpoint, index_name=index_name, credential=credential)
batch = IndexDocumentsBatch(actions=[
IndexDocumentsAction.upload(doc) for doc in documents
])
search_client.index_documents(batch)
# Generate query embedding
query = "What is the meaning of life?"
query_embedding = generate_embedding(query)
# Perform vector search
vector_query = {
"vector": query_embedding,
"k": 2,
"fields": "embedding"
}
results = search_client.search(
search_text="",
vector=vector_query,
select=["id", "content", "category"],
)
# Display results
print("Search Results:")
for result in results:
print(f"ID: {result['id']}")
print(f"Content: {result['content']}")
print(f"Category: {result['category']}")
print("--------")
Important Notes
Vector Dimensions: Ensure that the
dimensions
parameter in theVectorField
matches the length of the embeddings generated by the model.API Versions: The vector search capability is relatively new in Azure Cognitive Search. Ensure that you're using a compatible API version and SDK.
Permissions: The admin key is used for index management and data ingestion. For query operations in a production environment, use query keys with appropriate permissions.
Embeddings Storage: Be mindful of the storage requirements, as embeddings can be high-dimensional vectors.
Additional Considerations
Scaling Up
- For larger datasets, consider batching the document uploads and optimizing the indexing process.
Error Handling
- Add try-except blocks to handle exceptions, such as API errors or connectivity issues.
Security
Secure Credentials: Never hard-code API keys. Use environment variables or secure key management systems.
Data Privacy: Ensure compliance with data privacy regulations when handling sensitive content.
Conclusion
By indexing embeddings in Azure Cognitive Search, you can efficiently retrieve documents based on semantic similarity. This approach enhances traditional keyword search by understanding the meaning behind queries and documents.
Key Steps Recap:
Prepare Data: Collect and structure your documents.
Generate Embeddings: Use a language model to convert text into embeddings.
Create Index: Define an index schema with vector search capabilities.
Upload Data: Ingest documents and embeddings into the index.
Search: Perform vector similarity searches to retrieve relevant documents.
References
Subscribe to my newsletter
Read articles from Sai Prasanna Maharana directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by