Unlocking Multimodal RAG: A Guide to Using Cohere Embed with Azure AI Search

Farzad SunavalaFarzad Sunavala
5 min read

In today's digital landscape, organizations are dealing with an ever-growing collection of both textual and visual data. The ability to effectively search across these different modalities has become crucial for building modern enterprise applications. With the recent release of Cohere's Embed v3 model on Azure AI Studio, developers can now implement powerful multimodal search capabilities within their Azure AI Search solutions.

Key Features of Cohere Embed v3 on Azure AI Search

Unified Vector Space

  • Text and image embeddings share the same semantic space

  • Enables seamless cross-modal search capabilities

  • No need for separate indexes or complex routing logic

Enterprise-Grade Performance

  • Support for 100+ languages

  • Optimized for real-world business data

  • Exceptional accuracy on retrieval tasks

Integration Benefits

  • Native integration with Azure AI Studio

  • Simplified deployment and scaling

  • Built-in support for Azure AI Search vector search capabilities

What's New with Cohere Embed v3?

Cohere's latest Embed v3 model brings groundbreaking multimodal capabilities to Azure AI Studio. This state-of-the-art model can generate embeddings for both text and images, placing them in a unified vector space. This means you can:

  • Search images using text queries

  • Find relevant text using image queries

  • Perform cross-modal searches

  • Build sophisticated retrieval-augmented generation (RAG) systems

The model supports 100+ languages and maintains exceptional performance across various retrieval tasks, making it ideal for enterprise applications.

Setting Up Your Environment

Let's walk through implementing a multimodal search solution using Cohere Embed v3 and Azure AI Search. First, you'll need to install the required packages:

!pip install cohere azure-search-documents
!pip install azure-search-documents==11.6.0b6
!pip install cohere python-dotenv azure-identity tqdm requests

Configuration and Authentication

Set up your environment variables and initialize the necessary clients:

import cohere
import json
import os
from azure.core.credentials import AzureKeyCredential
from azure.search.documents import SearchClient
from azure.search.documents.indexes import SearchIndexClient

# Load environment variables
AZURE_AI_STUDIO_COHERE_EMBED_KEY = os.getenv("AZURE_AI_STUDIO_COHERE_EMBED_KEY")
AZURE_AI_STUDIO_COHERE_EMBED_ENDPOINT = os.getenv("AZURE_AI_STUDIO_COHERE_EMBED_ENDPOINT")
AZURE_SEARCH_SERVICE_ENDPOINT = os.getenv("AZURE_SEARCH_SERVICE_ENDPOINT")
AZURE_SEARCH_ADMIN_KEY = os.getenv("AZURE_SEARCH_ADMIN_KEY")
INDEX_NAME = "multimodal-cohere-index"

# Initialize clients
azure_search_credential = AzureKeyCredential(AZURE_SEARCH_ADMIN_KEY)
index_client = SearchIndexClient(endpoint=AZURE_SEARCH_SERVICE_ENDPOINT, credential=azure_search_credential)
search_client = SearchClient(endpoint=AZURE_SEARCH_SERVICE_ENDPOINT, index_name=INDEX_NAME, credential=azure_search_credential)
cohere_client = cohere.Client(api_key=AZURE_AI_STUDIO_COHERE_EMBED_KEY, base_url=AZURE_AI_STUDIO_COHERE_EMBED_ENDPOINT)

Creating a Multimodal Search Index

The heart of our solution lies in creating an Azure AI Search index that can handle both text and image embeddings:

fields = [
    SimpleField(name="id", type=SearchFieldDataType.String, key=True),
    SimpleField(name="imageUrl", type=SearchFieldDataType.String, retrievable=True),
    SimpleField(name="caption", type=SearchFieldDataType.String, searchable=True, retrievable=True),
    SearchField(
        name="imageVector",
        type=SearchFieldDataType.Collection(SearchFieldDataType.Single),
        searchable=True,
        vector_search_dimensions=1024,
        vector_search_profile_name="vector_profile"
    ),
    SearchField(
        name="captionVector",
        type=SearchFieldDataType.Collection(SearchFieldDataType.Single),
        searchable=True,
        vector_search_dimensions=1024,
        vector_search_profile_name="vector_profile"
    )
]

Implementing Different Search Modes

Search for similar text descriptions based on a text query:

def text_to_text_search(query_text):
    text_embedding = generate_text_embedding(query_text)
    text_vector_query = VectorizedQuery(
        vector=text_embedding,
        k_nearest_neighbors=1,
        fields="captionVector"
    )
    results = search_client.search(
        search_text=None, 
        vector_queries=[text_vector_query], 
        top=1
    )
    return results

Find images that match a text description:

def text_to_image_search(query_text):
    text_embedding = generate_text_embedding(query_text)
    text_to_image_query = VectorizedQuery(
        vector=text_embedding,
        k_nearest_neighbors=1,
        fields="imageVector"
    )
    results = search_client.search(
        search_text=None, 
        vector_queries=[text_to_image_query], 
        top=1
    )
    return results

Search for relevant text descriptions using an image:

def image_to_text_search(image_url):
    image_base64 = encode_image_to_base64(image_url)
    image_embedding = generate_image_embedding(image_base64)
    image_to_text_query = VectorizedQuery(
        vector=image_embedding,
        k_nearest_neighbors=1,
        fields="captionVector"
    )
    results = search_client.search(
        search_text=None, 
        vector_queries=[image_to_text_query], 
        top=1
    )
    return results

4. Cross-Field Vector Search with Text

Search across both text and image fields simultaneously given a text input query.

def text_embedding_cross_field_search(query_text):
    text_embedding = generate_text_embedding(query_text)
    cross_field_query = VectorizedQuery(
        vector=text_embedding,
        k_nearest_neighbors=1,
        fields="imageVector, captionVector"
    )
    results = search_client.search(
        search_text=None, 
        vector_queries=[cross_field_query], 
        top=3
    )
    return results

5. Cross-Field Vector Search with Images

And given an image input query.

def image_embedding_cross_field_search(image_url):
    image_base64 = encode_image_to_base64(image_url)
    image_embedding = generate_image_embedding(image_base64)

    cross_field_query = VectorizedQuery(
        vector=image_embedding,
        k_nearest_neighbors=1,
        fields="imageVector, captionVector"
    )

    return search_client.search(
        search_text=None, 
        vector_queries=[cross_field_query], 
        top=1
    )

Combine multiple vector queries for more precise results:

def text_and_image_query_multi_vector(query_text, image_url):
    text_embedding = generate_text_embedding(query_text)
    image_base64 = encode_image_to_base64(image_url)
    image_embedding = generate_image_embedding(image_base64)

    text_vector_query = VectorizedQuery(
        vector=text_embedding,
        k_nearest_neighbors=1,
        fields="captionVector"
    )

    image_vector_query = VectorizedQuery(
        vector=image_embedding,
        k_nearest_neighbors=1,
        fields="imageVector"
    )

    results = search_client.search(
        search_text=None,
        vector_queries=[text_vector_query, image_vector_query],
        top=2
    )
    return results
💡
This implementation of multi-vector search differs from ColBERT (Contextualized Late Interaction over BERT). While ColBERT uses late interaction between query and document vectors for dense retrieval, our approach simply combines multiple input vectors (text and image) to query a single vector index, using rank fusion to merge the results.

Building RAG Applications

The combination of Cohere Embed v3 and Azure AI Search enables powerful RAG applications. Here's how to implement a simple RAG system (in this example, I’ll arbitrarily select the “text_embedding_cross_field_search” retrieval configuration.

def ask(query_text):
    search_results = text_embedding_cross_field_search(query_text)
    documents = [{"text": result["caption"]} for result in search_results]

    chat_response = co_chat.chat(
        message=query_text, 
        documents=documents,
        max_tokens=100
    )
    return chat_response

Real-World Applications

The multimodal search capabilities enabled by this integration open up numerous possibilities:

  1. E-commerce Product Discovery: Enable customers to find products using both text descriptions and visual similarities.

  2. Content Management: Efficiently organize and retrieve mixed media content including documents, images, and presentations.

  3. Knowledge Management: Build sophisticated enterprise search systems that understand both textual and visual content.

  4. Design Asset Management: Help creative teams quickly find relevant design assets using natural language descriptions.

Conclusion

The integration of Cohere Embed v3 with Azure AI Search represents a significant advancement in multimodal search capabilities. This powerful combination provides enterprises with the tools they need to build next-generation search experiences. Whether you're building an e-commerce platform, a content management system, or a knowledge base, this technology stack enables you to create more intelligent and user-friendly applications.

Get started today by deploying Cohere Embed v3 through Azure AI Studio and revolutionize how your applications handle multimodal search.

References

1
Subscribe to my newsletter

Read articles from Farzad Sunavala directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Farzad Sunavala
Farzad Sunavala

I am a Principal Product Manager at Microsoft, leading RAG and Vector Database capabilities in Azure AI Search. My passion lies in Information Retrieval, Generative AI, and everything in between—from RAG and Embedding Models to LLMs and SLMs. Follow my journey for a deep dive into the coolest AI/ML innovations, where I demystify complex concepts and share the latest breakthroughs. Whether you're here to geek out on technology or find practical AI solutions, you've found your tribe.