Multimodal RAG: Understanding RAG, Multi-model AI Models & LlamaIndex

Nagen KNagen K
18 min read

I have been contemplating writing a blog post on RAG applications for quite some time. The idea was to use a RAG application to explain the basic concepts behind Retrieval-Augmented Generation for LLMs and demonstrate how to integrate a Vector DB, RAG orchestrator, and LLM to generate contextually relevant responses. However, since RAG has become widely popularized and readers of this blog are likely already familiar with the RAG framework and its fundamentals, I've decided to focus on something more contemporary: Multimodal RAG, an extension of traditional RAG that represents the next evolution in this space.

While traditional RAG (Retrieval-Augmented Generation) systems excel at text-based information retrieval, real-world applications often require understanding images, documents, charts, and other visual content alongside textual data.

This is where Multimodal RAG comes into play—an approach that combines text and visual understanding to create more comprehensive AI systems. In this deep dive, we'll build a complete multimodal RAG system using LLaVA for vision-language understanding, FAISS for efficient vector storage, and LlamaIndex for orchestration.

A heads up: unlike our previous blogs where the sample code could be tested on local systems with moderate capabilities, the code sample in this post requires significantly more computational resources due to the heavy memory and processing demands of vision-based models. To run the sample code, you'll need a system with at least 16 GB of free RAM and a GPU with at least 8 GB of dedicated VRAM. For those without suitable local hardware, we recommend using an AWS EC2 g4dn.xlarge instance (~$0.50/hour, 4 vCPUs, 16 GB RAM, NVIDIA T4 GPU with 16 GB VRAM).

Ok lets get into our topic.

What is Multimodal RAG?

Multimodal RAG extends the traditional RAG architecture by incorporating multiple data modalities—primarily text and images—into a unified retrieval and generation system. Unlike conventional RAG that only processes text, multimodal RAG can:

  • Understand visual content: Extract information from images, charts, diagrams, and documents

  • Cross-modal reasoning: Connect information between text and visual elements

  • Rich context retrieval: Find relevant information across different media types

  • Comprehensive responses: Generate answers that incorporate both textual and visual insights

The Architecture Overview

┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│   Input Query   │───▶│  Multimodal RAG  │───▶│   Response      │
│ (Text + Images) │    │     System       │    │ (Enriched Text) │
└─────────────────┘    └──────────────────┘    └─────────────────┘
                               │
                               ▼
                    ┌─────────────────────┐
                    │  Component Stack    │
                    │  ┌───────────────┐  │
                    │  │    LLaVA      │  │ ◄── Vision-Language Model
                    │  │   (Vision)    │  │
                    │  ├───────────────┤  │
                    │  │    FAISS      │  │ ◄── Vector Database
                    │  │  (Storage)    │  │
                    │  ├───────────────┤  │
                    │  │  LlamaIndex   │  │ ◄── Orchestration
                    │  │ (Framework)   │  │
                    │  └───────────────┘  │
                    └─────────────────────┘

Understanding Multimodal AI Models

Before diving into the implementation, let's understand how multimodal AI models differ from traditional text-only transformers.

Text-Only Transformers vs Multimodal Models

Traditional Text-Only Transformers:

  • Process only tokenized text sequences

  • Single encoder-decoder architecture

  • Attention mechanisms operate on text tokens only

  • Output: Text embeddings and generated text

Input: "What is in this image?" → Tokenizer → Transformer → Text Output

Multimodal AI Models:

  • Process multiple input types (text, images, audio)

  • Combined architecture with separate encoders for each modality

  • Cross-modal attention mechanisms

  • Shared representation space for different modalities

Text Input: "What is this?" ──┐
                              ├─→ Multimodal Fusion → Cross-Modal Attention → Output
Image Input: [Image Data] ────┘

Get Familiar With The Technology Stack

LLaVA: Vision-Language Understanding

LLaVA (Large Language and Vision Assistant)is a multimodal AI, combining the visual understanding capabilities of computer vision models with the reasoning power of large language models. Unlike traditional approaches that process text and images separately, LLaVA creates a unified model that can simultaneously understand visual content and engage in natural language conversations about what it sees. The model achieves this through a sophisticated architecture that uses a vision encoder (typically CLIP) to convert images into visual tokens, which are then projected into the same embedding space as text tokens for the language model. This allows LLaVA to perform complex visual reasoning tasks, answer questions about images, describe visual content in detail, and even engage in multi-turn conversations that reference both textual context and visual elements. What makes LLaVA particularly powerful for RAG applications is its ability to ground textual knowledge with visual understanding, enabling it to generate responses that synthesize information from both retrieved text documents and visual content.

Core LLaVA APIs:

from llama_index.multi_modal_llms.ollama import OllamaMultiModal

# Initialize LLaVA
llava = OllamaMultiModal(model="llava:13b")

# Text-only completion
response = llava.complete("Explain quantum computing")

# Image + text completion  
response = llava.complete(
    prompt="What's in this image?",
    image_documents=["path/to/image.jpg"]
)

# Streaming responses
for chunk in llava.stream_complete("Describe this image", image_documents=["image.jpg"]):
    print(chunk.delta, end="")

Key LLaVA Capabilities:

  • Native multimodal processing: Unlike OCR + text models, LLaVA understands visual semantics

  • Unified embedding space: Text and visual concepts share the same representation space

  • Visual reasoning: Can answer questions requiring understanding of spatial relationships, charts, diagrams

  • Cross-modal grounding: Links textual descriptions to visual elements

    How LLaVA Works Internally

    LLaVA's architecture consists of three main components:

    1. Vision Encoder: Processes images and converts them to visual tokens

    2. Projection Layer: Maps visual features to the language model's embedding space

    3. Language Model: Processes both text and projected visual tokens together

    ┌─────────────┐    ┌──────────────┐    ┌─────────────────┐
    │   Image     │───▶│ Vision       │───▶│ Projection      │
    │   Input     │    │ Encoder      │    │ Layer           │
    └─────────────┘    │ (CLIP)       │    │ (Linear)        │
                       └──────────────┘    └─────────────────┘
                                                    │
                                                    ▼
    ┌─────────────┐    ┌─────────────────────────────────────┐
    │   Text      │───▶│        Language Model               │───▶ Output
    │   Input     │    │    (Processes text + visual tokens) │
    └─────────────┘    └─────────────────────────────────────┘

FAISS: Efficient Vector Storage

FAISS provides high-performance vector storage and similarity search capabilities. See my blog posts https://thinkboundlessai.hashnode.dev/how-vector-databases-store-data-an-in-depth-explanation & https://thinkboundlessai.hashnode.dev/vector-db-in-ai for more details on Vector DB.

Core FAISS APIs:

import faiss
import numpy as np

# Create different types of indexes
index_flat = faiss.IndexFlatIP(dimension)  # Exact search
index_ivf = faiss.IndexIVFFlat(quantizer, dimension, nlist)  # Approximate search
index_hnsw = faiss.IndexHNSWFlat(dimension, M)  # Graph-based search

# Basic operations
index.add(vectors)  # Add vectors
scores, indices = index.search(query_vector, k)  # Search
index.remove_ids(ids_to_remove)  # Remove vectors

FAISS Index Types for Multimodal RAG:

  • IndexFlatIP: Exact inner product search, best for small datasets

  • IndexIVFFlat: Inverted file index, good balance of speed and accuracy

  • IndexHNSW: Hierarchical graph index, excellent for high-dimensional data

LlamaIndex: The Orchestration Layer

LlamaIndex provides abstractions for building RAG applications with multiple data sources.

Core LlamaIndex APIs:

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.node_parser import SentenceSplitter
from llama_index.vector_stores.faiss import FaissVectorStore

# Document loading
documents = SimpleDirectoryReader("data/").load_data()

# Text processing
splitter = SentenceSplitter(chunk_size=512, chunk_overlap=50)
nodes = splitter.get_nodes_from_documents(documents)

# Index creation
index = VectorStoreIndex(nodes, embed_model=embed_model)

# Querying
query_engine = index.as_query_engine()
response = query_engine.query("Your question here")

If you are here to understand the concepts only you can stop reading here. Rest of the post deals with implementation of a Multimodal RAG application using the framework explained above.

Implementing The Multimodal RAG Application

Now lets do the implementation, be patient its going to be lengthy. We will use a clothing store catalogue search RAG application as our example. Application will read the details from the catalogue file which will have details of cloth along with clear image of the cloth. Users can search this catalogue by using text, an image or a combination of both.

Step 1: Environment Setup and Dependencies

# requirements.txt
# Core frameworks
llama-index==0.9.30
llama-index-vector-stores-faiss==0.1.1
llama-index-multi-modal-llms-ollama==0.1.3
llama-index-embeddings-huggingface==0.1.4

# Multimodal processing
transformers==4.36.2
torch==2.1.0
torchvision==0.16.0
Pillow==10.1.0

# Vector storage and search
faiss-cpu==1.7.4  # Use faiss-gpu if you have GPU support

# Document processing
pytesseract==0.3.10
pdf2image==1.16.3

# Additional utilities
numpy==1.24.3
scipy==1.11.3

# Installation commands:
# pip install -r requirements.txt
# 
# For Ollama with LLaVA:
# curl -fsSL https://ollama.ai/install.sh | sh
# ollama pull llava:13b

Step 2: Understanding Why We Need Different Models

Why use BAAI/bge-small for text embeddings instead of LLaVA?

Here's the technical reasoning:

# LLaVA's text processing
llava_response = llava.complete("Explain this concept: neural networks")
# Returns: Generated text response, NOT embeddings suitable for search

# What we need for search
text_embedding = embedding_model.encode("neural networks")  
# Returns: Dense vector [0.23, -0.15, 0.87, ...] for similarity search

LLaVA vs Specialized Embedding Models:

  • LLaVA: Generative model optimized for conversation and reasoning

  • BAAI/bge: Specialized embedding model optimized for semantic similarity

  • Different purposes: Generation vs Retrieval

However, for true multimodal search, we'll use CLIP which creates aligned embeddings for both text and images:

Step 3: Proper Multimodal Embedding Strategy

from transformers import CLIPProcessor, CLIPModel
import torch
import numpy as np
from typing import Union, List

class TrueMultimodalEmbedding:
    """
    Creates embeddings where text and images exist in the same semantic space.
    This enables cross-modal search: text queries can find relevant images and vice versa.
    """

    def __init__(self):
        # CLIP creates aligned embeddings for text and images
        self.clip_model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14")
        self.clip_processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")

        # For enhanced text understanding, we'll also use a text-specific model
        self.text_model = HuggingFaceEmbedding(model_name="BAAI/bge-large-en-v1.5")

    def get_image_embedding(self, image_path: str) -> np.ndarray:
        """
        Generate semantic embeddings for images that can be compared with text embeddings.
        This captures visual concepts, not just OCR text.
        """
        image = Image.open(image_path)

        # Process image through CLIP vision encoder
        inputs = self.clip_processor(images=image, return_tensors="pt")

        with torch.no_grad():
            # Get image features from CLIP
            image_features = self.clip_model.get_image_features(**inputs)
            # Normalize for cosine similarity
            image_features = image_features / image_features.norm(dim=-1, keepdim=True)

        return image_features.numpy().flatten()

    def get_text_embedding(self, text: str) -> np.ndarray:
        """
        Generate text embeddings that are aligned with image embeddings.
        """
        # Process text through CLIP text encoder
        inputs = self.clip_processor(text=[text], return_tensors="pt", padding=True)

        with torch.no_grad():
            text_features = self.clip_model.get_text_features(**inputs)
            text_features = text_features / text_features.norm(dim=-1, keepdim=True)

        return text_features.numpy().flatten()

    def get_enhanced_text_embedding(self, text: str) -> np.ndarray:
        """
        Combine CLIP text embedding with specialized text embedding for better text understanding.
        """
        clip_embedding = self.get_text_embedding(text)
        text_embedding = self.text_model.get_text_embedding(text)

        # Combine embeddings (you could also train a fusion layer)
        combined = np.concatenate([clip_embedding, text_embedding])
        return combined

    def compute_similarity(self, embedding1: np.ndarray, embedding2: np.ndarray) -> float:
        """
        Compute similarity between any two embeddings (text-text, image-image, or cross-modal).
        """
        # Ensure same dimensionality for comparison
        if embedding1.shape != embedding2.shape:
            # Truncate to smaller size for comparison
            min_size = min(len(embedding1), len(embedding2))
            embedding1 = embedding1[:min_size]
            embedding2 = embedding2[:min_size]

        # Cosine similarity
        return np.dot(embedding1, embedding2) / (np.linalg.norm(embedding1) * np.linalg.norm(embedding2))

Step 4: Clothing Catalog Document Processing

class ClothingCatalogProcessor:
    """
    Process clothing catalog PDFs to extract both product images and descriptions.
    Creates a rich multimodal knowledge base for clothing search.
    """

    def __init__(self):
        self.embedding_model = TrueMultimodalEmbedding()
        self.text_splitter = SentenceSplitter(chunk_size=512, chunk_overlap=50)

    def extract_product_info(self, image_path: str) -> Dict[str, Any]:
        """Extract structured product information from catalog pages using OCR and image analysis."""
        try:
            image = Image.open(image_path)
            # Extract text descriptions, prices, product details
            extracted_text = pytesseract.image_to_string(image)

            # Parse structured information from text
            product_info = self.parse_product_attributes(extracted_text)
            return product_info
        except Exception as e:
            print(f"Error extracting product info from {image_path}: {e}")
            return {}

    def parse_product_attributes(self, text: str) -> Dict[str, str]:
        """Parse structured attributes from product descriptions."""
        # Simple regex-based parsing (in production, use NER models)
        import re

        attributes = {}

        # Extract common clothing attributes
        size_match = re.search(r'Size[s]?:\s*([^\n]+)', text, re.IGNORECASE)
        if size_match:
            attributes['sizes'] = size_match.group(1)

        color_match = re.search(r'Color[s]?:\s*([^\n]+)', text, re.IGNORECASE)
        if color_match:
            attributes['colors'] = color_match.group(1)

        material_match = re.search(r'Material:\s*([^\n]+)', text, re.IGNORECASE)
        if material_match:
            attributes['material'] = material_match.group(1)

        price_match = re.search(r'\$(\d+(?:\.\d{2})?)', text)
        if price_match:
            attributes['price'] = f"${price_match.group(1)}"

        return attributes

    def process_clothing_catalog(self, pdf_path: str) -> Dict[str, List]:
        """
        Extract and process clothing items from catalog PDFs.
        Creates separate embeddings for:
        1. Text descriptions (for attribute-based search)
        2. Product images (for visual similarity search)  
        3. Combined content (for multimodal reasoning)
        """
        catalog_name = Path(pdf_path).stem
        pages = convert_from_path(pdf_path)

        processed_data = {
            'text_nodes': [],      # Product descriptions and attributes
            'image_nodes': [],     # Product images for visual search
            'product_nodes': []    # Complete product information
        }

        for page_num, page_image in enumerate(pages):
            # Save page as image
            page_path = f"catalog_{catalog_name}_page_{page_num}.png"
            page_image.save(page_path)

            # Extract product information
            product_text = self.extract_product_info(page_path)
            product_attributes = self.parse_product_attributes(product_text.get('description', ''))

            # Detect individual product images on the page (simplified - assume one product per page)
            product_id = f"{catalog_name}_p{page_num}"

            # 1. Create text-searchable node for attribute-based queries
            if product_text.get('description'):
                description = product_text['description']
                text_embedding = self.embedding_model.get_enhanced_text_embedding(description)

                processed_data['text_nodes'].append({
                    'id': f"{product_id}_text",
                    'content': description,
                    'embedding': text_embedding,
                    'metadata': {
                        'product_id': product_id,
                        'catalog': catalog_name,
                        'page': page_num,
                        'type': 'product_description',
                        'attributes': product_attributes
                    }
                })

            # 2. Create image-searchable node for visual similarity
            image_embedding = self.embedding_model.get_image_embedding(page_path)

            processed_data['image_nodes'].append({
                'id': f"{product_id}_image",
                'content': f"Product image from {catalog_name}, page {page_num}",
                'image_path': page_path,
                'embedding': image_embedding,
                'metadata': {
                    'product_id': product_id,
                    'catalog': catalog_name,
                    'page': page_num,
                    'type': 'product_image',
                    'attributes': product_attributes
                }
            })

            # 3. Create complete product node for multimodal search
            product_summary = f"Product: {product_attributes.get('category', 'clothing item')}"
            if product_text.get('description'):
                product_summary += f" - {product_text['description']}"

            processed_data['product_nodes'].append({
                'id': product_id,
                'content': product_summary,
                'image_path': page_path,
                'text_embedding': self.embedding_model.get_text_embedding(product_summary),
                'image_embedding': image_embedding,
                'metadata': {
                    'product_id': product_id,
                    'catalog': catalog_name,
                    'page': page_num,
                    'type': 'complete_product',
                    'attributes': product_attributes,
                    'description': product_text.get('description', ''),
                    'price': product_attributes.get('price', 'N/A')
                }
            })

        return processed_data

Step 5: True Multimodal Vector Storage

class MultimodalVectorStore:
    """
    Manages separate indexes for different types of content while enabling cross-modal search.
    """

    def __init__(self, clip_dim: int = 768, text_dim: int = 1024):
        # Separate indexes for different modalities
        self.text_index = faiss.IndexFlatIP(text_dim)  # Enhanced text embeddings
        self.image_index = faiss.IndexFlatIP(clip_dim)  # CLIP image embeddings
        self.cross_modal_index = faiss.IndexFlatIP(clip_dim)  # Aligned CLIP embeddings

        # Metadata storage
        self.text_metadata = {}
        self.image_metadata = {}
        self.cross_modal_metadata = {}

    def add_text_nodes(self, nodes: List[Dict]):
        """Add text-only nodes for traditional text search."""
        embeddings = np.array([node['embedding'] for node in nodes]).astype('float32')

        # Get current index size to map back to metadata
        start_idx = self.text_index.ntotal
        self.text_index.add(embeddings)

        # Store metadata
        for i, node in enumerate(nodes):
            self.text_metadata[start_idx + i] = node

    def add_image_nodes(self, nodes: List[Dict]):
        """Add image nodes for visual search."""
        embeddings = np.array([node['embedding'] for node in nodes]).astype('float32')

        start_idx = self.image_index.ntotal
        self.image_index.add(embeddings)

        for i, node in enumerate(nodes):
            self.image_metadata[start_idx + i] = node

    def add_cross_modal_nodes(self, nodes: List[Dict]):
        """Add nodes that enable cross-modal search (text queries finding images, etc.)."""
        # For cross-modal search, we use CLIP embeddings that are aligned
        text_embeddings = []
        image_embeddings = []

        for node in nodes:
            # Add both text and image representations to cross-modal index
            text_embed = node['text_embedding']
            image_embed = node['image_embedding']

            text_embeddings.append(text_embed)
            image_embeddings.append(image_embed)

        # Add both text and image embeddings to the same index
        all_embeddings = np.vstack([
            np.array(text_embeddings),
            np.array(image_embeddings)
        ]).astype('float32')

        start_idx = self.cross_modal_index.ntotal
        self.cross_modal_index.add(all_embeddings)

        # Store metadata for both text and image versions
        for i, node in enumerate(nodes):
            # Text version
            self.cross_modal_metadata[start_idx + i] = {
                **node,
                'modal_type': 'text_representation'
            }
            # Image version  
            self.cross_modal_metadata[start_idx + len(nodes) + i] = {
                **node,
                'modal_type': 'image_representation'
            }

Step 6: Clothing-Specific Query Engine

class ClothingCatalogQueryEngine:
    """
    Handles clothing-specific queries: text-only, image-only, and text+image combinations.
    Demonstrates practical multimodal search for e-commerce applications.
    """

    def __init__(self, vector_store: MultimodalVectorStore, llava_model):
        self.vector_store = vector_store
        self.llava = llava_model
        self.embedding_model = TrueMultimodalEmbedding()

    def search_by_attributes(self, query: str, top_k: int = 5) -> List[Dict]:
        """
        Attribute-based search for clothing.
        Example: "Find formal shirts with full sleeves in pastel colors"
        """
        print(f"👔 Processing attribute-based query: '{query}'")

        # Generate text embedding for the query
        query_embedding = self.embedding_model.get_enhanced_text_embedding(query)

        # Search text index (product descriptions and attributes)
        scores, indices = self.vector_store.text_index.search(
            np.array([query_embedding]).astype('float32'), top_k
        )

        results = []
        for score, idx in zip(scores[0], indices[0]):
            if idx != -1:
                node = self.vector_store.text_metadata[idx]
                results.append({
                    **node,
                    'similarity_score': float(score),
                    'search_type': 'attribute_based'
                })

        return results

    def search_by_visual_similarity(self, image_path: str, top_k: int = 5) -> List[Dict]:
        """
        Visual similarity search for clothing.
        Example: Upload photo of a dress → find similar dresses
        """
        print(f"👗 Processing visual similarity query: {image_path}")

        # Generate image embedding for the uploaded image
        query_embedding = self.embedding_model.get_image_embedding(image_path)

        # Search image index (product images)
        scores, indices = self.vector_store.image_index.search(
            np.array([query_embedding]).astype('float32'), top_k
        )

        results = []
        for score, idx in zip(scores[0], indices[0]):
            if idx != -1:
                node = self.vector_store.image_metadata[idx]
                results.append({
                    **node,
                    'similarity_score': float(score),
                    'search_type': 'visual_similarity'
                })

        return results

    def search_with_style_transfer(self, image_path: str, text_constraints: str, top_k: int = 5) -> List[Dict]:
        """
        Advanced search combining visual style with text constraints.
        Example: Upload casual shirt + "show me formal versions with spread collar in navy"
        """
        print(f"✨ Processing style transfer query:")
        print(f"   Image: {image_path}")
        print(f"   Constraints: '{text_constraints}'")

        # Get embeddings for both image and text
        image_embedding = self.embedding_model.get_image_embedding(image_path)
        text_embedding = self.embedding_model.get_text_embedding(text_constraints)

        # Combine embeddings (simple average - could use learned fusion)
        combined_embedding = (image_embedding + text_embedding) / 2

        # Search cross-modal index
        scores, indices = self.vector_store.cross_modal_index.search(
            np.array([combined_embedding]).astype('float32'), top_k
        )

        results = []
        for score, idx in zip(scores[0], indices[0]):
            if idx != -1:
                node = self.vector_store.cross_modal_metadata[idx]
                results.append({
                    **node,
                    'similarity_score': float(score),
                    'search_type': 'style_transfer'
                })

        return results

    def generate_product_response(self, query: str, retrieved_products: List[Dict], 
                                query_image: str = None) -> str:
        """
        Generate helpful responses about clothing products using LLaVA.
        """
        print("🛍️ Generating product recommendations with LLaVA...")

        # Prepare context from retrieved products
        context_text = "Retrieved Products:\n"
        context_images = []

        for i, product in enumerate(retrieved_products):
            context_text += f"\n--- Product {i+1} ---\n"
            context_text += f"Description: {product['content']}\n"

            # Add product attributes if available
            if 'attributes' in product['metadata']:
                attrs = product['metadata']['attributes']
                for key, value in attrs.items():
                    context_text += f"{key.title()}: {value}\n"

            if 'image_path' in product and product['image_path']:
                context_text += f"[Product image available]\n"
                context_images.append(product['image_path'])

        # Create clothing-specific prompt
        enhanced_prompt = f"""
        Based on the clothing products found in our catalog, please provide helpful recommendations for the user.

        {context_text}

        User Query: {query}

        Instructions:
        - Recommend the most suitable products from the retrieved items
        - Highlight key features like style, color, material, and price when available
        - If the user uploaded an image, compare it with the found products
        - Provide styling suggestions and explain why certain items match their request
        - Be helpful and fashion-forward in your recommendations
        """

        # Use LLaVA for multimodal product recommendation
        if query_image or context_images:
            images_to_process = []
            if query_image:
                images_to_process.append(query_image)
            if context_images:
                images_to_process.extend(context_images[:3])  # Limit to prevent overload

            response = self.llava.complete(
                prompt=enhanced_prompt,
                image_documents=images_to_process
            )
        else:
            response = self.llava.complete(prompt=enhanced_prompt)

        return str(response)

Step 7: Complete Clothing Catalog Example

def demonstrate_clothing_catalog_rag():
    """
    Complete demonstration of multimodal RAG for clothing catalog search.
    Shows three types of queries in a practical e-commerce context.
    """

    # Initialize the clothing catalog system
    print("🚀 Initializing Clothing Catalog Multimodal RAG System...")
    print("This system can search through clothing catalogs using text, images, or both")

    processor = ClothingCatalogProcessor()
    vector_store = MultimodalVectorStore()

    # Initialize LLaVA for product recommendations
    llava = OllamaMultiModal(model="llava:13b")
    query_engine = ClothingCatalogQueryEngine(vector_store, llava)

    # Process clothing catalog PDFs
    print("\n📚 Processing clothing catalogs...")
    catalog_files = [
        "mens_formal_catalog.pdf",      # Business shirts, suits, dress pants
        "womens_casual_catalog.pdf",    # Dresses, tops, jeans, accessories  
        "kids_clothing_catalog.pdf"     # Children's wear for all occasions
    ]

    for catalog_path in catalog_files:
        if os.path.exists(catalog_path):
            print(f"  Processing {catalog_path}...")
            processed_data = processor.process_clothing_catalog(catalog_path)

            # Build searchable indexes for different query types
            vector_store.add_text_nodes(processed_data['text_nodes'])        # Attribute-based search
            vector_store.add_image_nodes(processed_data['image_nodes'])      # Visual similarity search
            vector_store.add_cross_modal_nodes(processed_data['product_nodes']) # Combined search

            print(f"    Indexed {len(processed_data['text_nodes'])} product descriptions")
            print(f"    Indexed {len(processed_data['image_nodes'])} product images")
            print(f"    Created {len(processed_data['product_nodes'])} complete product profiles")

    print(f"\n✅ Clothing catalog indexed successfully!")
    print(f"Total products searchable by text: {vector_store.text_index.ntotal}")
    print(f"Total products searchable by image: {vector_store.image_index.ntotal}")
    print(f"Cross-modal search index size: {vector_store.cross_modal_index.ntotal}")

    print("\n" + "="*80)
    print("🛍️ DEMONSTRATION: Clothing Catalog Search Types")
    print("="*80)

    # 1. ATTRIBUTE-BASED SEARCH (Text Only)
    print("\n1️⃣ ATTRIBUTE-BASED SEARCH")
    print("-" * 50)
    print("Use case: Customer knows what they want and describes it")

    attribute_query = "Find formal shirts with full sleeves in pastel colors under $50"
    print(f"Query: '{attribute_query}'")
    print("Search method: Text embedding → Product descriptions → Filtered results")

    attribute_results = query_engine.search_by_attributes(attribute_query, top_k=3)
    print(f"✅ Found {len(attribute_results)} matching products")

    for i, result in enumerate(attribute_results):
        attrs = result['metadata'].get('attributes', {})
        print(f"  Product {i+1}: {result['content'][:80]}...")
        print(f"    Price: {attrs.get('price', 'N/A')} | Colors: {attrs.get('colors', 'N/A')}")

    response = query_engine.generate_product_response(attribute_query, attribute_results)
    print(f"🤖 Recommendation: {response[:200]}...")

    # 2. VISUAL SIMILARITY SEARCH (Image Only)
    print("\n2️⃣ VISUAL SIMILARITY SEARCH")
    print("-" * 50)
    print("Use case: Customer likes a style and wants to find similar items")

    query_image = "customer_shirt_photo.jpg"  # Customer uploads this
    print(f"Query: [Customer uploads photo: {query_image}]")
    print("Search method: Image embedding → Visual similarity → Similar products")

    if os.path.exists(query_image):
        visual_results = query_engine.search_by_visual_similarity(query_image, top_k=3)
        print(f"✅ Found {len(visual_results)} visually similar products")

        for i, result in enumerate(visual_results):
            attrs = result['metadata'].get('attributes', {})
            print(f"  Similar Product {i+1}: {result['content']}")
            print(f"    Visual similarity score: {result['similarity_score']:.3f}")
            print(f"    Available in: {attrs.get('colors', 'Multiple colors')}")

        response = query_engine.generate_product_response(
            "I like this style, show me similar items available in your catalog", 
            visual_results, 
            query_image=query_image
        )
        print(f"🤖 Recommendation: {response[:200]}...")
    else:
        print("⚠️  Sample customer image not found - upload a clothing photo for this demo")

    # 3. STYLE TRANSFER SEARCH (Image + Text)
    print("\n3️⃣ STYLE TRANSFER SEARCH")
    print("-" * 50)
    print("Use case: Customer has style reference but wants specific modifications")

    style_query = "Show me formal shirts similar to this casual one but with spread collar in navy blue"
    print(f"Text constraint: '{style_query}'")
    print(f"Style reference: [casual shirt image]")
    print("Search method: Combined embedding → Style transfer → Modified recommendations")

    if os.path.exists(query_image):
        style_results = query_engine.search_with_style_transfer(
            query_image, style_query, top_k=5
        )
        print(f"✅ Found {len(style_results)} products matching style + constraints")

        for i, result in enumerate(style_results):
            modal_type = result['metadata'].get('modal_type', 'product')
            attrs = result['metadata'].get('attributes', {})
            print(f"  Style Match {i+1} ({modal_type}): {result['content'][:80]}...")
            print(f"    Style score: {result['similarity_score']:.3f}")
            if attrs.get('colors'):
                print(f"    Available colors: {attrs['colors']}")

        response = query_engine.generate_product_response(
            style_query, style_results, query_image=query_image
        )
        print(f"🤖 Style Transfer Recommendation: {response[:200]}...")

    print("\n" + "="*80)
    print("🎉 Clothing Catalog RAG Demonstration Complete!")
    print("="*80)
    print("\nKey capabilities demonstrated:")
    print("✅ Attribute-based search: 'Find blue formal shirts under $50'")
    print("✅ Visual similarity: Upload photo → Find similar styles")
    print("✅ Style transfer: 'Like this but more formal and in navy'")
    print("✅ Cross-modal search: Text ↔ Images seamlessly")
    print("✅ LLaVA recommendations: Intelligent product suggestions")

if __name__ == "__main__":
    demonstrate_clothing_catalog_rag()

Real-World Use Case: Smart Fashion Assistant

class SmartFashionAssistant:
    """
    A user-friendly interface for the clothing catalog multimodal RAG system.
    Demonstrates realistic customer interactions.
    """

    def __init__(self):
        self.query_engine = ClothingCatalogQueryEngine(
            vector_store=MultimodalVectorStore(),
            llava_model=OllamaMultiModal(model="llava:13b")
        )
        self.setup_catalog()

    def setup_catalog(self):
        """Load and index clothing catalogs."""
        print("📚 Loading fashion catalog...")
        # This would process actual catalog PDFs with product images and descriptions
        pass

    def handle_customer_queries(self):
        """Demonstrate realistic customer interaction scenarios."""

        print("\n🛍️ SMART FASHION ASSISTANT SCENARIOS")
        print("="*60)

        # Scenario 1: Specific attribute search
        print("\n📝 SCENARIO 1: Attribute-Based Shopping")
        print("-" * 40)
        print("Customer: 'I need a white button-down shirt for work interviews'")

        query = "white button-down shirt formal interview professional"
        results = self.query_engine.search_by_attributes(query)

        print("Assistant finds:")
        for result in results[:2]:
            attrs = result['metadata'].get('attributes', {})
            print(f"  • {attrs.get('category', 'Shirt')}: {attrs.get('colors', 'White')}")
            print(f"    Price: {attrs.get('price', '$45')} | Material: {attrs.get('material', 'Cotton')}")

        # Scenario 2: Visual inspiration
        print("\n📸 SCENARIO 2: Visual Style Matching")
        print("-" * 40)
        print("Customer uploads Instagram photo: 'I love this outfit, do you have anything similar?'")

        inspiration_image = "instagram_outfit.jpg"
        if os.path.exists(inspiration_image):
            results = self.query_engine.search_by_visual_similarity(inspiration_image)
            print("Assistant finds visually similar items:")
            for result in results[:2]:
                print(f"  • Similar style: {result['content'][:60]}...")
                print(f"    Match confidence: {result['similarity_score']:.1%}")

        # Scenario 3: Style modification
        print("\n✨ SCENARIO 3: Style Transfer Request")
        print("-" * 40)
        print("Customer: 'I like this casual dress but need something more formal for a wedding'")

        casual_dress_image = "casual_dress_reference.jpg"
        formal_constraint = "formal wedding guest dress elegant"

        if os.path.exists(casual_dress_image):
            results = self.query_engine.search_with_style_transfer(
                casual_dress_image, formal_constraint
            )
            print("Assistant finds formal alternatives with similar style elements:")
            for result in results[:2]:
                attrs = result['metadata'].get('attributes', {})
                print(f"  • Formal version: {attrs.get('category', 'Dress')}")
                print(f"    Occasion: {attrs.get('occasion', 'Formal events')}")
                print(f"    Colors: {attrs.get('colors', 'Multiple options')}")

# Demonstrate the assistant
assistant = SmartFashionAssistant()
assistant.handle_customer_queries()

Final Thoughts

Multimodal RAG systems represent a significant step forward in building AI applications that can understand and reason about the complex, multi-faceted information that defines our digital world. Through our clothing catalog example, we've seen how these systems can seamlessly integrate text and visual understanding to create powerful, practical applications.

The implementation we've covered provides a solid foundation for building multimodal RAG systems that can handle real-world use cases. Whether you're building e-commerce search systems, content discovery platforms, or any application that needs to understand both text and images, the principles and code examples in this guide will help you create powerful, efficient, and scalable multimodal AI applications.

The clothing catalog use case perfectly demonstrates how multimodal RAG can solve real business problems by enabling natural, intuitive search experiences that match how users actually think about visual content. As you build your own systems, remember to focus on the specific needs of your domain and users, while leveraging the technical capabilities we've explored.

0
Subscribe to my newsletter

Read articles from Nagen K directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Nagen K
Nagen K