Introduction

Ever find yourself lost in a sea of videos, just wishing there was an easier way to find that one perfect moment? You’re not alone. These days, we’re creating and sharing more video than ever before—billions of hours, every single day. But searching through all that content still feels stuck in the past, like flipping through an ancient card catalog for an answer that should be at your fingertips.

What if searching through video was finally as smart (and as easy) as searching a conversation—where you could just say, “Show me the part where the sun sets behind the mountains” and instantly get exactly what you need?

I developed this project as a part of the recent AI Demo x VideoDB hackathon https://aidemos.com/ai-hackathons.

What This Codebase Delivers

This codebase demonstrates a production-ready multimodal video search platform that enables users to upload videos from various sources (YouTube, direct URLs) and perform sophisticated natural language queries that combine both spoken and visual content criteria. The system automatically processes videos to extract speech transcripts and visual scene descriptions using VideoDB, then uses AI to understand complex queries like:

“Show me where the narrator discusses solar system formation while showing the Milky Way galaxy.”

Key capabilities include:

Semantic and keyword search across both audio and visual content
AI-powered query understanding that separates spoken and visual components
Real-time video streaming of relevant segments
Intersection and union operations for multimodal results
Scalable architecture ready for enterprise deployment

Architecture Overview

This architecture allows modularity, scalability, and tech replacement (e.g., swap Streamlit with React).

Directory Map

multimodal-video-search/
├── backend/
│ ├── main.py # FastAPI entrypoint
│ ├── api/
│ │ ├── video_routes.py # Video upload and processing routes
│ │ └── search_routes.py # Multimodal search routes
│ ├── services/
│ │ ├── videodb_service.py # VideoDB SDK integration
│ │ └── openai_service.py # OpenAI API and prompt engineering
│ │ └── search_service.py # Search service on videos
│ ├── models/
│ │ └── search_models.py # Pydantic models for requests and responses
| | └── video_model.py # Pydantic models for video data storage
│ ├── config.py # Environment configuration using Pydantic
│ ├── logging_config.py # Rotating file logger for production logging
|
├── frontend/
│ ├── app.py # Streamlit main entrypoint
│ ├── pages/
│ │ ├── 1_upload_video.py # Upload video page
│ │ └── 2_search_interface.py # Search interface page
│ ├── components/
│ │ ├── video_uploader.py # Video uploader component
│ │ └── search_interface.py # Search input + results display
│ └── utils/
│ │ └── api_client.py # API client for calling FastAPI backend

├── shared/
│   └── constants.py # Constant variables 
│
├── .env # Environment variables (OpenAI & VideoDB keys)
├── requirements.txt # Library requirements for the project
├── README.md # Project documentation
└── logs/ # Log file

Backend Services Architecture

FastAPI Main Application (`main.py`)

"""FastAPI main application."""

@app.get("/")
async def root():
    """Root endpoint."""
    return {
        "message": "Multimodal Video Search API",
        "version": "1.0.0",
        "docs": "/docs"
    }

The FastAPI application uses router-based organization to separate video management from search functionality. This modular approach supports team collaboration and independent service scaling.

API Route Handlers (`api/` Directory)

The video routes demonstrate asynchronous processing patterns critical for handling long-running video operations (`video_routes.py`)

 """Video management API routes."""

 @router.post("/upload", response_model=VideoUploadResponse)
 async def upload_video(request: VideoUploadRequest, background_tasks: BackgroundTasks):
     """Upload a video from URL and start indexing."""
     try:
         video_id, video_info = await videodb_service.upload_video(
             url=str(request.url),
             title=request.title
         )

         return VideoUploadResponse(
             video_id=video_id,
             status=VideoStatus.INDEXING,
             title=request.title,
             message="Video uploaded successfully. Indexing in progress."
         )

     except Exception as e:
         logger.error(f"Error uploading video: {e}")
         raise HTTPException(status_code=500, detail=str(e))

Why background tasks: Video processing can take minutes for long content. The background task pattern allows immediate response to users while processing continues asynchronously. This prevents timeout issues and improves perceived performance.

The search routes handle complex multimodal queries: (`search_routes.py`)

 """Search API routes."""

 @router.post("/multimodal", response_model=MultimodalSearchResponse)
 async def multimodal_search(request: MultimodalSearchRequest):
     """Perform multimodal search across videos and generate AI answer."""
     try:
         results = await search_service.multimodal_search(request)
         return results
     except Exception as e:
         logger.error(f"Error in multimodal search: {e}")
         raise HTTPException(status_code=500, detail=str(e))

Logic Services (`services/` Directory)

Videodb Service `(videodb_service.py`)

The below code marks the videodb integration in our application.

"""VideoDB integration service."""

class VideoDBService:

    def _initialize_connection(self):
        """Initialize VideoDB connection and handle collection management."""
        try:
            # Connect to VideoDB
            self.conn = connect(api_key=settings.video_db_api_key)
            logger.info("Connected to VideoDB successfully")

            # Try to find existing collection by name
            existing_collection = self._find_collection_by_name(settings.videodb_collection_name)

            if existing_collection:
                # Use existing collection
                self.collection = self.conn.get_collection(existing_collection.id)
                logger.info(f"Found existing collection: {existing_collection.name} (ID: {existing_collection.id})")
            else:
                # Create new collection
                self.collection = self._create_new_collection()
                logger.info(f"Created new collection: {self.collection.name} (ID: {self.collection.id})")

        except Exception as e:
            logger.error(f"Failed to initialize VideoDB connection: {e}")
            raise

    def list_all_collections(self) -> list:
        """List all collections in the account."""
        try:
            collections = self.conn.get_collections()
            collection_info = []
            for c in collections:
                collection_info.append({
                    "id": c.id,
                    "name": c.name,
                    "description": getattr(c, 'description', 'No description available')
                })
            return collection_info
        except Exception as e:
            logger.error(f"Error listing collections: {e}")
            return [}

    async def upload_video(self, url: str, title: Optional[str] = None) -> Tuple[str, VideoInfo]:
        """Upload and index a video."""
        try:
            # Upload video
            video = self.collection.upload(url=url)
            video_id = video.id

            # Create video info
            video_info = VideoInfo(
                video_id=video_id,
                title=title or f"Video {video_id}",
                source_url=url,
                status=VideoStatus.INDEXING,
                upload_date=datetime.now()
            )

            # Start indexing in background
            asyncio.create_task(self._index_video(video))

            return video_id, video_info

        except Exception as e:
            logger.error(f"Error uploading video: {e}")
            raise

    async def _index_video(self, video):
        """Index video for spoken words and scenes."""
        try:
            # Index spoken words
            video.index_spoken_words()

            # Index scenes
            scene_index_id = video.index_scenes(
                extraction_type=SceneExtractionType.time_based,
                extraction_config={
                    "time": settings.scene_extraction_time,
                    "select_frames": ['first', 'last']
                },
                prompt="Describe the scene in detail including objects, people, actions, and environment"
            )

            logger.info(f"Video {video.id} indexed successfully")
            return scene_index_id

        except Exception as e:
            logger.error(f"Error indexing video {video.id}: {e}")
            raise

    def search_spoken_content(self, video_id: str, query: str, search_type: str = "semantic"):
        """Search spoken content in video."""
        try:
            video = self.collection.get_video(video_id)
            results = video.search(
                query=query,
                index_type=IndexType.spoken_word,
                search_type=SearchType.semantic if search_type == "semantic" else SearchType.keyword
            )
            return results
        except Exception as e:
            logger.error(f"Error searching spoken content: {e}")
            raise

    def search_visual_content(self, video_id: str, query: str, search_type: str = "semantic"):
        """Search visual content in video."""
        try:
            video = self.collection.get_video(video_id)
            results = video.search(
                query=query,
                index_type=IndexType.scene,
                search_type=SearchType.semantic if search_type == "semantic" else SearchType.keyword,
                score_threshold=settings.score_threshold,
                dynamic_score_percentage=settings.dynamic_score_percentage
            )
            return results
        except Exception as e:
            logger.error(f"Error searching visual content: {e}")
            raise

To efficiently manage and organize video data, we utilize collections within VideoDB. The following utility methods abstract common operations such as retrieving, creating, or listing collections:

_find_collection_by_name:
Searches for a collection by its name among all existing collections. If found, it returns the collection object; otherwise, returns None.
_create_new_collection:
Creates a new collection in VideoDB using the name specified in the environment config (settings.videodb_collection_name). This ensures our video data is logically grouped under a dedicated namespace.
get_collection_info:
Returns metadata (ID, name, description) of the currently active collection. Useful for debugging or displaying collection context in the frontend.
list_all_collections:
Retrieves and returns metadata of all collections available in the current VideoDB workspace. This is helpful for administrative interfaces or bulk inspection.
get_collection_by_id:
Fetches a specific collection using its unique ID and logs basic information like name and ID for traceability.

🔍 These methods ensure that our system is robust, traceable, and scalable — supporting multiple video datasets with minimal code changes.

To power our real-time video search platform, we define a central VideoDBService that orchestrates the video ingestion and indexing lifecycle using the VideoDB SDK. Below are the core capabilities:

🔼 `upload_video(url: str, title: Optional[str])`

This asynchronous method uploads a video from a given URL and initializes a background task to index its spoken and visual content. It returns both the video_id and structured VideoInfo metadata.

Video metadata like title, upload date, and status are tracked.
Indexing is offloaded using asyncio.create_task to avoid blocking the main thread.

⚙️ `_index_video(video)`

Runs in the background after upload to extract:

Spoken content, using speech-to-text models.
Visual scenes, based on a configurable time-based strategy (e.g., every 10 seconds), capturing both the first and last frames of each scene with detailed prompts.

🔎 `search_spoken_content(video_id, query, search_type)`

Enables semantic or keyword-based search over the spoken word index of a specific video. It uses IndexType.spoken_word and SearchType.semantic or keyword to retrieve relevant segments.

🖼️ `search_visual_content(video_id, query, search_type)`

Similarly, allows users to search across indexed visual scenes using detailed visual descriptions. It supports configurable scoring (score_threshold, dynamic_score_percentage) for fine-grained control over retrieval relevance.

✅ Result: This multimodal search flow allows querying "what was said" and "what was seen", enabling powerful use cases such as highlight generation, video Q&A, and surveillance analysis.

OpenAI Service (`openai_service.py`)

This service utilizes the OpenAI API to intelligently interpret and respond to user queries in the context of multimodal (spoken + visual) video data.

"""OpenAI service for query transformation."""

class OpenAIService:

    async def divide_query(self, query: str) -> QueryDivision:
        """Divide query into spoken and visual components using OpenAI."""
        try:
            response = self.client.chat.completions.create(
                model="gpt-4.1-mini",
                messages=[
                    {"role": "user", "content": self.transformation_prompt.format(query=query)}
                ],
                temperature=0.1
            )

            message = response.choices[0].message.content
            divided_query = message.strip().split("\n")

            spoken_query = divided_query[0].replace("Spoken:", "").strip()
            visual_query = divided_query[1].replace("Visual:", "").strip()

            return QueryDivision(
                spoken_query=spoken_query,
                visual_query=visual_query,
                original_query=query
            )

        except Exception as e:
            logger.error(f"Error dividing query: {e}")
            # Fallback: use original query for both
            return QueryDivision(
                spoken_query=query,
                visual_query=query,
                original_query=query
            )

    async def generate_answer_from_context(self, query: str, context_texts: List[Dict[str, Any]]) -> str:
        """Generate answer based on query and extracted video context with metadata."""
        try:
            # Limit context size to avoid token overflow
            max_contexts = 10
            formatted_context = []

            for item in context_texts[:max_contexts]:
                start = f"{item['start']:.1f}"
                end = f"{item['end']:.1f}"
                ctype = item.get("type", "unknown")
                text = item.get("text", "").strip().replace("\n", " ")
                formatted_context.append(f"[{start} - {end}] ({ctype}): {text}")

            combined_context = "\n".join(formatted_context)

            ## Prompt

            response = self.client.chat.completions.create(
                model="gpt-4.1-mini",
                messages=[{"role": "user", "content": prompt}],
                temperature=0.3
            )

            return response.choices[0].message.content.strip()

        except Exception as e:
            logger.error(f"Error generating answer from context: {e}")
            return "Unable to generate answer due to processing error."

It offers two key functionalities:

1. `divide_query (query: str)`

✅ Purpose: Decomposes a user’s query into two parts:

Spoken Query: Pertains to narration, dialogue, or audio commentary.
Visual Query: Pertains to images, actions, or graphical elements.

📌 Powered by a structured prompt and the gpt-4.1-mini model, this method ensures:

Strict output format parsing
Fall-back behavior using the original query for both parts in case of failure

📤 Returns: A QueryDivision object with spoken_query, visual_query, and original_query.

2. `generate_answer_from_context (query: str, context_texts: List[Dict])`

✅ Purpose: Generates a detailed and timestamp-aware answer using:

Extracted video segments (spoken/visual)
Associated timestamps and types (e.g., narration or on-screen action)

💡 The model is instructed to:

Ground its response in the provided data only
Include explicit references to timestamps and modality (e.g., spoken or visual)

🛡️ Includes error handling for token limits and malformed data.

📤 Returns: A natural-language response string or an error fallback message.

Search Service (`search_service.py`)

To support nuanced multimodal search and summarization, our system includes a robust shot interval processing utility within the SearchService class.

"""Search orchestration service."""

class SearchService:

    def process_shots(self, l1: List[Tuple[float, float]], l2: List[Tuple[float, float]], operation: str, min_duration: float = 0.0) -> List[Tuple[float, float]]:
        """Process and combine shot intervals with duration filtering."""

       ### Code in Github

    async def multimodal_search(self, request: MultimodalSearchRequest) -> MultimodalSearchResponse:
        """Perform multimodal search across videos and generate LLM answer."""
        try:
            # Divide query into spoken and visual components
            query_division = await openai_service.divide_query(request.query)
            results = []
            all_extracted_texts = []

            collection_name = videodb_service._find_collection_by_name(settings.videodb_collection_name)
            logger.info(f"Collection name: {collection_name}")

            # If no specific video IDs provided, search all videos
            video_ids = request.video_ids or self._get_all_video_ids(collection_name.id)
            logger.info(f"Video IDs: {video_ids}")

            for video_id in video_ids:
                try:
                    # Search spoken content
                    spoken_results = videodb_service.search_spoken_content(
                        video_id=video_id,
                        query=query_division.spoken_query,
                        search_type=request.search_type.value
                    )

                    # Search visual content
                    visual_results = videodb_service.search_visual_content(
                        video_id=video_id,
                        query=query_division.visual_query,
                        search_type=request.search_type.value
                    )

                    logger.info(f"Spoken results: {spoken_results}")
                    logger.info(f"Break\n\n\n")
                    logger.info(f"Visual results: {visual_results}")

                    # Extract text from shots
                    spoken_texts = []
                    visual_texts = []
                    all_extracted_texts_with_timestamps = []

                    for shot in spoken_results.get_shots():
                        if hasattr(shot, 'text') and shot.text:
                            text_with_timestamp = {
                                "text": shot.text,
                                "start": shot.start,
                                "end": shot.end,
                                "type": "spoken"
                            }
                            spoken_texts.append(text_with_timestamp)
                            all_extracted_texts_with_timestamps.append(text_with_timestamp)

                    for shot in visual_results.get_shots():
                        if hasattr(shot, 'text') and shot.text:
                            text_with_timestamp = {
                                "text": shot.text,
                                "start": shot.start,
                                "end": shot.end,
                                "type": "visual"
                            }
                            visual_texts.append(text_with_timestamp)
                            all_extracted_texts_with_timestamps.append(text_with_timestamp)

                    # Extract timestamps
                    spoken_timestamps = [(shot.start, shot.end) for shot in spoken_results.get_shots()]
                    visual_timestamps = [(shot.start, shot.end) for shot in visual_results.get_shots()]
                    logger.info(f"Spoken timestamps: {spoken_timestamps}")
                    logger.info(f"Visual timestamps: {visual_timestamps}")

                    # Combine spoken and visual segments
                    combined_segments = self.process_shots(
                        spoken_timestamps,
                        visual_timestamps,
                        request.combine_operation.value,
                        min_duration=0.0
                    )

                    logger.info(f"Combined segments: {combined_segments}")

                    if combined_segments:
                        total_matches = len(spoken_timestamps) + len(visual_timestamps)
                        similarity_score = len(combined_segments) / total_matches if total_matches else 0

                        if similarity_score > 0.3:
                            search_result = SearchResult(
                                video_id=video_id,
                                segments=[{"start": seg[0], "end": seg[1]} for seg in combined_segments],
                                total_score=similarity_score,
                                spoken_matches=len(spoken_timestamps),
                                visual_matches=len(visual_timestamps),
                                extracted_text=spoken_texts + visual_texts
                            )

                            # Generate stream URLs (one per segment)
                            try:
                                video_obj = videodb_service.collection.get_video(video_id)
                                search_result.stream_urls = [
                                    video_obj.generate_stream(timeline=[[start, end]])
                                    for start, end in combined_segments
                                ]
                                for i, (seg_start, seg_end) in enumerate(combined_segments):
                                    stream_url = search_result.stream_urls[i]
                                    for text_entry in search_result.extracted_text:
                                        if seg_start <= text_entry.start and text_entry.end <= seg_end:
                                            text_entry.stream_url = stream_url
                                            text_entry.start_time = self.format_seconds_to_timestamp(text_entry.start)
                                            text_entry.end_time = self.format_seconds_to_timestamp(text_entry.end)


                            except Exception as e:
                                logger.error(f"Failed to generate stream URLs for video {video_id}: {e}")
                                # results.append(search_result)
                            results.append(search_result)
                except Exception as e:
                    logger.error(f"Error searching video {video_id}: {e}")
                    continue

            logger.info(f"Results: {results}")

            # Sort by score
            results.sort(key=lambda x: x.total_score, reverse=True)
            results = results[:request.max_results]

            logger.info(f"All extracted texts: {all_extracted_texts_with_timestamps}")

            # Generate answer
            generated_answer = None
            if all_extracted_texts_with_timestamps:
                generated_answer = await openai_service.generate_answer_from_context(
                    query=request.query,
                    context_texts=all_extracted_texts_with_timestamps
                )

            logger.info(f"Generated answer: {generated_answer}")
            return MultimodalSearchResponse(
                results=results,
                query_division=query_division,
                total_results=len(results),
                search_params=request.dict(),
                generated_answer=generated_answer
            )

        except Exception as e:
            logger.error(f"Error in multimodal search: {e}")
            raise

The process_shots method enables flexible manipulation of video segments (shots) based on different logical operations:

🔧 `process_shots` — Key Features:

Union & Intersection of Segments:
Combines or intersects two lists of (start_time, end_time) tuples — typically from different modalities (e.g., visual and spoken cues).
Shot Merging Logic:
The method ensures that overlapping intervals are merged cleanly, removing redundancy and producing a set of non-overlapping segments.
Duration Filtering:
Optional filtering ensures that only meaningful shots (i.e., longer than min_duration) are retained — ideal for skipping noise or too-short segments.
Supported Operations:
- "union": Merges all intervals from both modalities.
- "intersection": Returns only overlapping parts where both modalities align.

🔍 Multimodal Video Search Logic

The multimodal_search function performs an intelligent search across both spoken and visual content in videos, powered by LLM-assisted reasoning. Here’s a breakdown of its operations:

Query Splitting: The user query is divided into spoken and visual parts using the OpenAI service, enabling specialized search logic for different modalities.
Video Selection: If no video IDs are specified, the method fetches all videos from the configured collection.
Content Search:
- Spoken content is searched using text extracted from speech (e.g., transcripts).
- Visual content is searched using extracted on-screen text or OCR-like results.
Result Processing:
- For each matched segment, it extracts relevant text with timestamps and tags (spoken/visual).
- Timestamps are collected for both modalities to compute overlaps or combinations based on a chosen operation (e.g., union or intersection).
Segment Combination & Scoring:
- Combines spoken and visual timestamps into final segments.
- Calculates a similarity score to filter out weak matches.
Stream URL Generation:
- For each final segment, generates a stream URL for video playback.
- Stream URLs are embedded back into the text entries along with human-readable timestamps.
LLM Answer Generation:
- If results are found, an OpenAI model generates a natural-language answer using the extracted spoken/visual context.
Response:
- Returns structured results including matched segments, scores, stream URLs, and an LLM-generated answer.

Pydantic Models (`models/` Directory)

Search Models (`search_models.py`)

This module defines the data models and enums used throughout the multimodal video search pipeline. Built with Pydantic, it ensures strong validation and structure across services. It includes:

MultimodalSearchRequest: Schema for incoming search queries.
QueryDivision: Holds split queries for spoken and visual modalities.
ExtractedText: Represents timestamped, typed content segments with optional stream URLs.
SearchResult: Encapsulates all relevant data returned per video match.
MultimodalSearchResponse: Combines all search results, parameters, and the final LLM-generated answer.
Enums SearchType and CombineOperation allow flexible search logic.

These models ensure consistent, typed communication between the backend services.

Video Models (`video_models.py`)

This module defines Pydantic models for managing video ingestion, indexing, and metadata operations. It includes:

Enums
- VideoStatus: Captures the current state of a video – from upload to indexing, to ready for search or error.
Request & Response Models
- VideoUploadRequest: Accepts a video URL and optional title/description.
- VideoUploadResponse: Confirms upload success, returns video_id, status, and optional title/message.
Metadata Model
- VideoInfo: Stores full video metadata, including title, description, upload date, duration, source/thumbnail URLs, and indexing status.
Indexing Status
- IndexStatus: Flags indicating whether spoken word transcripts or visual scenes are indexed, and stores associated scene_index_id.

These models serve as the backbone of the video pipeline and ensure structured communication between video ingestion, processing, and retrieval systems.

Frontend

The frontend part of the application is a streamlit interface calling fastapi endpoints to perform task. This part of the code can be changed to a React app.

Streamlit Main Application (`app.py`)

This Streamlit frontend is the main landing page for the Multimodal Video Search app. It introduces the platform’s key features—video upload, smart search, and rich results—through a clean, interactive UI. It includes quick action buttons to navigate to upload or search pages, describes the backend processing (upload → indexing → multimodal search), and shows real-time backend connectivity and usage stats in the sidebar.

"""Main Streamlit application."""

st.set_page_config(
    page_title="Multimodal Video Search",
    page_icon="🎬",
    layout="wide",
    initial_sidebar_state="expanded"
)

# Feature overview
col1, col2, col3 = st.columns(3)

st.divider()

# Quick actions
st.subheader("🚀 Quick Actions")

col1, col2, col3 = st.columns(3)

with col1:
    if st.button("📹 Upload New Video", use_container_width=True):
        st.switch_page("pages/1_Upload_Video.py")

with col2:
    if st.button("🔍 Search Videos", use_container_width=True):
        st.switch_page("pages/2_Search_Videos.py")

Snapshot of Home Page

Components (`components/` Directory)

Video Uploading Interface (`video_uploader.py`)

🎥 video_uploader.py: Video Upload UI for Streamlit

This module provides a Streamlit interface to upload and index videos for search:

Accepts a video URL and optional title.
Validates and submits the upload to the backend via api_client.
Displays success/failure messages and tracks recent uploads in session state.
Allows checking indexing status of each uploaded video.

"""Video upload component."""

def video_uploader():
    """Render video upload interface."""
    st.header("📹 Upload Video")

    # URL input
    url = st.text_input(
        "Enter YouTube URL or video URL:",
        placeholder="https://www.youtube.com/watch?v=..."
    )

    # Title input
    title = st.text_input(
        "Video Title (optional):",
        placeholder="Enter a descriptive title"
    )

    # Upload button
    if st.button("Upload and Index Video", type="primary"):
        if not url:
            st.error("Please enter a video URL")
            return

        if not validators.url(url):
            st.error("Please enter a valid URL")
            return

        try:
            with st.spinner("Uploading and indexing video..."):
                api_client = get_api_client()
                result = api_client.upload_video(url, title)

                st.success(f"Video uploaded successfully! Video ID: {result['video_id']}")
                st.info("The video is being indexed for search. This may take a few minutes.")

                # Store video info in session state
                if 'uploaded_videos' not in st.session_state:
                    st.session_state.uploaded_videos = []

                st.session_state.uploaded_videos.append({
                    'video_id': result['video_id'],
                    'title': title or f"Video {result['video_id']}",
                    'url': url,
                    'status': result['status']
                })

        except Exception as e:
            st.error(f"Error uploading video: {str(e)}")

    # Show recently uploaded videos
    if 'uploaded_videos' in st.session_state and st.session_state.uploaded_videos:
        st.subheader("Recently Uploaded Videos")
        for video in st.session_state.uploaded_videos[-5:]:  # Show last 5
            with st.expander(f"📹 {video['title']}"):
                st.write(f"**Video ID:** {video['video_id']}")
                st.write(f"**URL:** {video['url']}")
                st.write(f"**Status:** {video['status']}")

                # Check status button
                if st.button(f"Check Status", key=f"status_{video['video_id']}"):
                    try:
                        api_client = get_api_client()
                        status = api_client.get_video_status(video['video_id'])
                        st.json(status)
                    except Exception as e:
                        st.error(f"Error checking status: {str(e)}")

Snapshot of Upload Video Page

Search Interface Component (`search_interface.py`)

🔍 search_interface.py: Multimodal Search UI & Results Renderer

This module defines the user interface and result display for multimodal video search in Streamlit:

search_interface(): Builds an interactive form for users to submit queries, choose search modes (semantic or keyword), and optionally filter by video IDs.
display_search_results(): Renders the generated answer, query decomposition, and detailed matched segments, including embedded video players with HLS support and matched transcript snippets.

"""Search interface component."""

def search_interface():
    """Render search interface."""
    st.header("🔍 Multimodal Video Search")

    # Search query input
    query = st.text_area(
        "Enter your search query:",
        placeholder="Show me where the narrator discusses the formation of the solar system and visualize the milky way galaxy",
        height=100
    )

    # Search options
    col1, col2 = st.columns(2)

    with col1:
        search_type = st.selectbox(
            "Search Type:",
            ["semantic", "keyword"],
            help="Semantic search uses AI to understand meaning, keyword search looks for exact matches"
        )

    with col2:
        combine_operation = st.selectbox(
            "Combine Results:",
            ["intersection", "union"],
            help="Intersection shows segments matching both spoken and visual criteria, union shows all matches"
        )

    # Advanced options
    with st.expander("Advanced Options"):
        video_ids = st.text_input(
            "Specific Video IDs (comma-separated, leave empty for all):",
            placeholder="video_id_1, video_id_2"
        )

        max_results = st.slider(
            "Maximum Results:",
            min_value=1,
            max_value=50,
            value=10
        )

    # Search button
    if st.button("🔎 Search Videos", type="primary"):
        if not query.strip():
            st.error("Please enter a search query")
            return

        try:
            with st.spinner("Searching videos..."):
                api_client = get_api_client()

                # Parse video IDs if provided
                video_id_list = None
                if video_ids.strip():
                    video_id_list = [vid.strip() for vid in video_ids.split(",")]

                # Perform search
                results = api_client.search_videos(
                    query=query,
                    video_ids=video_id_list,
                    search_type=search_type,
                    combine_operation=combine_operation
                )

                # Store results in session state
                st.session_state.search_results = results
                st.session_state.current_query = query

                st.success(f"Found {results['total_results']} results!")

        except Exception as e:
            st.error(f"Search error: {str(e)}")

Snapshot of Search Page

Streamlit Pages (`pages/` Directory)

Video Upload Function (`1_upload_video.py`)

This Streamlit page allows users to upload videos by URL and initiate indexing for search:

Uses the video_uploader component to handle input and backend communication.
Includes expandable instructions explaining supported formats and processing time.
Provides a clear and informative UI with upload tips and status guidance.

"""Upload video page.""" 
st.set_page_config(
    page_title="Upload Video - Multimodal Search",
    page_icon="📹",
    layout="wide"
)

st.title("📹 Video Upload")
st.markdown("Upload videos from YouTube or other sources to make them searchable.")

video_uploader()

Video Search Function (`2_video_search.py`)

This Streamlit page enables users to perform multimodal searches across uploaded videos:

Integrates search_interface() to capture user queries.
Displays results via display_search_results(), supporting both spoken and visual content.
Automatically renders recent search results stored in session state.

"""Search videos page."""

st.title("🔍 Multimodal Video Search")
st.markdown("Search through your video library using both spoken content and visual elements.")

# Search interface
search_interface()

st.divider()

# Results section
if 'search_results' in st.session_state:

    display_search_results()

else:
    display_search_results()

🚀 Future Enhancements & Expansion Opportunities

Building on the current multimodal video search foundation, several high-impact extensions can unlock new use cases and commercial potential using Videodb:

🔍 Content Intelligence Extensions

Advanced Video Analytics: Integrate emotion detection, object tracking, logo/brand recognition, and activity recognition for deeper scene understanding.
Multilingual Capabilities: Enable cross-language search and translation using OpenAI and Whisper for a globally scalable solution.

🏢 Enterprise Integrations

Workflow Automation: Embed search into Slack, Teams, or CRMs to surface relevant videos contextually during user workflows.
Learning & Compliance: Integrate with LMS and DMS platforms to streamline educational discovery and safety compliance checks.

🎥 Media & Entertainment Use Cases

Highlight Generation: Auto-create sports/event highlights from speech + action cues.
Live Broadcast Intelligence: Detect key moments or anomalies in real-time video streams.
Content Moderation & Copyright: Monitor and flag copyrighted or restricted content across platforms.

🏥 Healthcare & Industrial Training

Procedure Evaluation: Analyze surgical/technical footage for training, QA, and certification.
Skill Assessments: Evaluate trainee performance via structured video tasks.
Remote Expert Support: Recommend content dynamically during video-based remote consultations.

I would encourage more of you to build over VideoDB and participate in the hackathon here: https://aidemos.com/ai-hackathons/aidemos-videodb/submit

💡

Get the Full Code on GitHub

Building a Multimodal Video Search App With VideoDB, FastAPI, OpenAI LLM & Streamlit

Table of contents

Introduction

What This Codebase Delivers

Architecture Overview

Directory Map

Backend Services Architecture

FastAPI Main Application (main.py)

API Route Handlers (api/ Directory)

The video routes demonstrate asynchronous processing patterns critical for handling long-running video operations (video_routes.py)

The search routes handle complex multimodal queries: (search_routes.py)

Logic Services (services/ Directory)

Videodb Service (videodb_service.py)

🔼 upload_video(url: str, title: Optional[str])

⚙️ _index_video(video)

🔎 search_spoken_content(video_id, query, search_type)

🖼️ search_visual_content(video_id, query, search_type)

OpenAI Service (openai_service.py)

1. divide_query (query: str)

2. generate_answer_from_context (query: str, context_texts: List[Dict])

Search Service (search_service.py)

🔧 process_shots — Key Features:

Pydantic Models (models/ Directory)

Search Models (search_models.py)

Video Models (video_models.py)