Building a Multimodal Video Search App With VideoDB, FastAPI, OpenAI LLM & Streamlit


Introduction
Ever find yourself lost in a sea of videos, just wishing there was an easier way to find that one perfect moment? Youβre not alone. These days, weβre creating and sharing more video than ever beforeβbillions of hours, every single day. But searching through all that content still feels stuck in the past, like flipping through an ancient card catalog for an answer that should be at your fingertips.
What if searching through video was finally as smart (and as easy) as searching a conversationβwhere you could just say, βShow me the part where the sun sets behind the mountainsβ and instantly get exactly what you need?
I developed this project as a part of the recent AI Demo x VideoDB hackathon https://aidemos.com/ai-hackathons.
What This Codebase Delivers
This codebase demonstrates a production-ready multimodal video search platform that enables users to upload videos from various sources (YouTube, direct URLs) and perform sophisticated natural language queries that combine both spoken and visual content criteria. The system automatically processes videos to extract speech transcripts and visual scene descriptions using VideoDB, then uses AI to understand complex queries like:
βShow me where the narrator discusses solar system formation while showing the Milky Way galaxy.β
Key capabilities include:
Semantic and keyword search across both audio and visual content
AI-powered query understanding that separates spoken and visual components
Real-time video streaming of relevant segments
Intersection and union operations for multimodal results
Scalable architecture ready for enterprise deployment
Architecture Overview
This architecture allows modularity, scalability, and tech replacement (e.g., swap Streamlit with React).
Directory Map
multimodal-video-search/
βββ backend/
β βββ main.py # FastAPI entrypoint
β βββ api/
β β βββ video_routes.py # Video upload and processing routes
β β βββ search_routes.py # Multimodal search routes
β βββ services/
β β βββ videodb_service.py # VideoDB SDK integration
β β βββ openai_service.py # OpenAI API and prompt engineering
β β βββ search_service.py # Search service on videos
β βββ models/
β β βββ search_models.py # Pydantic models for requests and responses
| | βββ video_model.py # Pydantic models for video data storage
β βββ config.py # Environment configuration using Pydantic
β βββ logging_config.py # Rotating file logger for production logging
|
βββ frontend/
β βββ app.py # Streamlit main entrypoint
β βββ pages/
β β βββ 1_upload_video.py # Upload video page
β β βββ 2_search_interface.py # Search interface page
β βββ components/
β β βββ video_uploader.py # Video uploader component
β β βββ search_interface.py # Search input + results display
β βββ utils/
β β βββ api_client.py # API client for calling FastAPI backend
βββ shared/
β βββ constants.py # Constant variables
β
βββ .env # Environment variables (OpenAI & VideoDB keys)
βββ requirements.txt # Library requirements for the project
βββ README.md # Project documentation
βββ logs/ # Log file
Backend Services Architecture
FastAPI Main Application (main.py
)
"""FastAPI main application."""
@app.get("/")
async def root():
"""Root endpoint."""
return {
"message": "Multimodal Video Search API",
"version": "1.0.0",
"docs": "/docs"
}
The FastAPI application uses router-based organization to separate video management from search functionality. This modular approach supports team collaboration and independent service scaling.
API Route Handlers (api/
Directory)
The video routes demonstrate asynchronous processing patterns critical for handling long-running video operations (
video_routes.py
)"""Video management API routes.""" @router.post("/upload", response_model=VideoUploadResponse) async def upload_video(request: VideoUploadRequest, background_tasks: BackgroundTasks): """Upload a video from URL and start indexing.""" try: video_id, video_info = await videodb_service.upload_video( url=str(request.url), title=request.title ) return VideoUploadResponse( video_id=video_id, status=VideoStatus.INDEXING, title=request.title, message="Video uploaded successfully. Indexing in progress." ) except Exception as e: logger.error(f"Error uploading video: {e}") raise HTTPException(status_code=500, detail=str(e))
Why background tasks: Video processing can take minutes for long content. The background task pattern allows immediate response to users while processing continues asynchronously. This prevents timeout issues and improves perceived performance.
The search routes handle complex multimodal queries: (
search_routes.py
)"""Search API routes.""" @router.post("/multimodal", response_model=MultimodalSearchResponse) async def multimodal_search(request: MultimodalSearchRequest): """Perform multimodal search across videos and generate AI answer.""" try: results = await search_service.multimodal_search(request) return results except Exception as e: logger.error(f"Error in multimodal search: {e}") raise HTTPException(status_code=500, detail=str(e))
Logic Services (services/
Directory)
Videodb Service (videodb_service.py
)
The below code marks the videodb integration in our application.
"""VideoDB integration service."""
class VideoDBService:
def _initialize_connection(self):
"""Initialize VideoDB connection and handle collection management."""
try:
# Connect to VideoDB
self.conn = connect(api_key=settings.video_db_api_key)
logger.info("Connected to VideoDB successfully")
# Try to find existing collection by name
existing_collection = self._find_collection_by_name(settings.videodb_collection_name)
if existing_collection:
# Use existing collection
self.collection = self.conn.get_collection(existing_collection.id)
logger.info(f"Found existing collection: {existing_collection.name} (ID: {existing_collection.id})")
else:
# Create new collection
self.collection = self._create_new_collection()
logger.info(f"Created new collection: {self.collection.name} (ID: {self.collection.id})")
except Exception as e:
logger.error(f"Failed to initialize VideoDB connection: {e}")
raise
def list_all_collections(self) -> list:
"""List all collections in the account."""
try:
collections = self.conn.get_collections()
collection_info = []
for c in collections:
collection_info.append({
"id": c.id,
"name": c.name,
"description": getattr(c, 'description', 'No description available')
})
return collection_info
except Exception as e:
logger.error(f"Error listing collections: {e}")
return [}
async def upload_video(self, url: str, title: Optional[str] = None) -> Tuple[str, VideoInfo]:
"""Upload and index a video."""
try:
# Upload video
video = self.collection.upload(url=url)
video_id = video.id
# Create video info
video_info = VideoInfo(
video_id=video_id,
title=title or f"Video {video_id}",
source_url=url,
status=VideoStatus.INDEXING,
upload_date=datetime.now()
)
# Start indexing in background
asyncio.create_task(self._index_video(video))
return video_id, video_info
except Exception as e:
logger.error(f"Error uploading video: {e}")
raise
async def _index_video(self, video):
"""Index video for spoken words and scenes."""
try:
# Index spoken words
video.index_spoken_words()
# Index scenes
scene_index_id = video.index_scenes(
extraction_type=SceneExtractionType.time_based,
extraction_config={
"time": settings.scene_extraction_time,
"select_frames": ['first', 'last']
},
prompt="Describe the scene in detail including objects, people, actions, and environment"
)
logger.info(f"Video {video.id} indexed successfully")
return scene_index_id
except Exception as e:
logger.error(f"Error indexing video {video.id}: {e}")
raise
def search_spoken_content(self, video_id: str, query: str, search_type: str = "semantic"):
"""Search spoken content in video."""
try:
video = self.collection.get_video(video_id)
results = video.search(
query=query,
index_type=IndexType.spoken_word,
search_type=SearchType.semantic if search_type == "semantic" else SearchType.keyword
)
return results
except Exception as e:
logger.error(f"Error searching spoken content: {e}")
raise
def search_visual_content(self, video_id: str, query: str, search_type: str = "semantic"):
"""Search visual content in video."""
try:
video = self.collection.get_video(video_id)
results = video.search(
query=query,
index_type=IndexType.scene,
search_type=SearchType.semantic if search_type == "semantic" else SearchType.keyword,
score_threshold=settings.score_threshold,
dynamic_score_percentage=settings.dynamic_score_percentage
)
return results
except Exception as e:
logger.error(f"Error searching visual content: {e}")
raise
To efficiently manage and organize video data, we utilize collections within VideoDB. The following utility methods abstract common operations such as retrieving, creating, or listing collections:
_find_collection_by_name
:
Searches for a collection by its name among all existing collections. If found, it returns the collection object; otherwise, returnsNone
._create_new_collection
:
Creates a new collection in VideoDB using the name specified in the environment config (settings.videodb_collection_name
). This ensures our video data is logically grouped under a dedicated namespace.get_collection_info
:
Returns metadata (ID, name, description) of the currently active collection. Useful for debugging or displaying collection context in the frontend.list_all_collections
:
Retrieves and returns metadata of all collections available in the current VideoDB workspace. This is helpful for administrative interfaces or bulk inspection.get_collection_by_id
:
Fetches a specific collection using its unique ID and logs basic information like name and ID for traceability.
π These methods ensure that our system is robust, traceable, and scalable β supporting multiple video datasets with minimal code changes.
To power our real-time video search platform, we define a central VideoDBService
that orchestrates the video ingestion and indexing lifecycle using the VideoDB SDK. Below are the core capabilities:
πΌ upload_video(url: str, title: Optional[str])
This asynchronous method uploads a video from a given URL and initializes a background task to index its spoken and visual content. It returns both the video_id
and structured VideoInfo
metadata.
Video metadata like
title
,upload date
, andstatus
are tracked.Indexing is offloaded using
asyncio.create_task
to avoid blocking the main thread.
βοΈ _index_video(video)
Runs in the background after upload to extract:
Spoken content, using speech-to-text models.
Visual scenes, based on a configurable time-based strategy (e.g., every 10 seconds), capturing both the first and last frames of each scene with detailed prompts.
π search_spoken_content(video_id, query, search_type)
Enables semantic or keyword-based search over the spoken word index of a specific video. It uses IndexType.spoken_word
and SearchType.semantic
or keyword
to retrieve relevant segments.
πΌοΈ search_visual_content(video_id, query, search_type)
Similarly, allows users to search across indexed visual scenes using detailed visual descriptions. It supports configurable scoring (score_threshold
, dynamic_score_percentage
) for fine-grained control over retrieval relevance.
β Result: This multimodal search flow allows querying "what was said" and "what was seen", enabling powerful use cases such as highlight generation, video Q&A, and surveillance analysis.
OpenAI Service (openai_service.py
)
This service utilizes the OpenAI API to intelligently interpret and respond to user queries in the context of multimodal (spoken + visual) video data.
"""OpenAI service for query transformation."""
class OpenAIService:
async def divide_query(self, query: str) -> QueryDivision:
"""Divide query into spoken and visual components using OpenAI."""
try:
response = self.client.chat.completions.create(
model="gpt-4.1-mini",
messages=[
{"role": "user", "content": self.transformation_prompt.format(query=query)}
],
temperature=0.1
)
message = response.choices[0].message.content
divided_query = message.strip().split("\n")
spoken_query = divided_query[0].replace("Spoken:", "").strip()
visual_query = divided_query[1].replace("Visual:", "").strip()
return QueryDivision(
spoken_query=spoken_query,
visual_query=visual_query,
original_query=query
)
except Exception as e:
logger.error(f"Error dividing query: {e}")
# Fallback: use original query for both
return QueryDivision(
spoken_query=query,
visual_query=query,
original_query=query
)
async def generate_answer_from_context(self, query: str, context_texts: List[Dict[str, Any]]) -> str:
"""Generate answer based on query and extracted video context with metadata."""
try:
# Limit context size to avoid token overflow
max_contexts = 10
formatted_context = []
for item in context_texts[:max_contexts]:
start = f"{item['start']:.1f}"
end = f"{item['end']:.1f}"
ctype = item.get("type", "unknown")
text = item.get("text", "").strip().replace("\n", " ")
formatted_context.append(f"[{start} - {end}] ({ctype}): {text}")
combined_context = "\n".join(formatted_context)
## Prompt
response = self.client.chat.completions.create(
model="gpt-4.1-mini",
messages=[{"role": "user", "content": prompt}],
temperature=0.3
)
return response.choices[0].message.content.strip()
except Exception as e:
logger.error(f"Error generating answer from context: {e}")
return "Unable to generate answer due to processing error."
It offers two key functionalities:
1. divide_query (query: str)
β Purpose: Decomposes a userβs query into two parts:
Spoken Query: Pertains to narration, dialogue, or audio commentary.
Visual Query: Pertains to images, actions, or graphical elements.
π Powered by a structured prompt and the gpt-4.1-mini
model, this method ensures:
Strict output format parsing
Fall-back behavior using the original query for both parts in case of failure
π€ Returns: A QueryDivision
object with spoken_query
, visual_query
, and original_query
.
2. generate_answer_from_context (query: str, context_texts: List[Dict])
β Purpose: Generates a detailed and timestamp-aware answer using:
Extracted video segments (spoken/visual)
Associated timestamps and types (e.g., narration or on-screen action)
π‘ The model is instructed to:
Ground its response in the provided data only
Include explicit references to timestamps and modality (e.g., spoken or visual)
π‘οΈ Includes error handling for token limits and malformed data.
π€ Returns: A natural-language response string or an error fallback message.
Search Service (search_service.py
)
To support nuanced multimodal search and summarization, our system includes a robust shot interval processing utility within the SearchService
class.
"""Search orchestration service."""
class SearchService:
def process_shots(self, l1: List[Tuple[float, float]], l2: List[Tuple[float, float]], operation: str, min_duration: float = 0.0) -> List[Tuple[float, float]]:
"""Process and combine shot intervals with duration filtering."""
### Code in Github
async def multimodal_search(self, request: MultimodalSearchRequest) -> MultimodalSearchResponse:
"""Perform multimodal search across videos and generate LLM answer."""
try:
# Divide query into spoken and visual components
query_division = await openai_service.divide_query(request.query)
results = []
all_extracted_texts = []
collection_name = videodb_service._find_collection_by_name(settings.videodb_collection_name)
logger.info(f"Collection name: {collection_name}")
# If no specific video IDs provided, search all videos
video_ids = request.video_ids or self._get_all_video_ids(collection_name.id)
logger.info(f"Video IDs: {video_ids}")
for video_id in video_ids:
try:
# Search spoken content
spoken_results = videodb_service.search_spoken_content(
video_id=video_id,
query=query_division.spoken_query,
search_type=request.search_type.value
)
# Search visual content
visual_results = videodb_service.search_visual_content(
video_id=video_id,
query=query_division.visual_query,
search_type=request.search_type.value
)
logger.info(f"Spoken results: {spoken_results}")
logger.info(f"Break\n\n\n")
logger.info(f"Visual results: {visual_results}")
# Extract text from shots
spoken_texts = []
visual_texts = []
all_extracted_texts_with_timestamps = []
for shot in spoken_results.get_shots():
if hasattr(shot, 'text') and shot.text:
text_with_timestamp = {
"text": shot.text,
"start": shot.start,
"end": shot.end,
"type": "spoken"
}
spoken_texts.append(text_with_timestamp)
all_extracted_texts_with_timestamps.append(text_with_timestamp)
for shot in visual_results.get_shots():
if hasattr(shot, 'text') and shot.text:
text_with_timestamp = {
"text": shot.text,
"start": shot.start,
"end": shot.end,
"type": "visual"
}
visual_texts.append(text_with_timestamp)
all_extracted_texts_with_timestamps.append(text_with_timestamp)
# Extract timestamps
spoken_timestamps = [(shot.start, shot.end) for shot in spoken_results.get_shots()]
visual_timestamps = [(shot.start, shot.end) for shot in visual_results.get_shots()]
logger.info(f"Spoken timestamps: {spoken_timestamps}")
logger.info(f"Visual timestamps: {visual_timestamps}")
# Combine spoken and visual segments
combined_segments = self.process_shots(
spoken_timestamps,
visual_timestamps,
request.combine_operation.value,
min_duration=0.0
)
logger.info(f"Combined segments: {combined_segments}")
if combined_segments:
total_matches = len(spoken_timestamps) + len(visual_timestamps)
similarity_score = len(combined_segments) / total_matches if total_matches else 0
if similarity_score > 0.3:
search_result = SearchResult(
video_id=video_id,
segments=[{"start": seg[0], "end": seg[1]} for seg in combined_segments],
total_score=similarity_score,
spoken_matches=len(spoken_timestamps),
visual_matches=len(visual_timestamps),
extracted_text=spoken_texts + visual_texts
)
# Generate stream URLs (one per segment)
try:
video_obj = videodb_service.collection.get_video(video_id)
search_result.stream_urls = [
video_obj.generate_stream(timeline=[[start, end]])
for start, end in combined_segments
]
for i, (seg_start, seg_end) in enumerate(combined_segments):
stream_url = search_result.stream_urls[i]
for text_entry in search_result.extracted_text:
if seg_start <= text_entry.start and text_entry.end <= seg_end:
text_entry.stream_url = stream_url
text_entry.start_time = self.format_seconds_to_timestamp(text_entry.start)
text_entry.end_time = self.format_seconds_to_timestamp(text_entry.end)
except Exception as e:
logger.error(f"Failed to generate stream URLs for video {video_id}: {e}")
# results.append(search_result)
results.append(search_result)
except Exception as e:
logger.error(f"Error searching video {video_id}: {e}")
continue
logger.info(f"Results: {results}")
# Sort by score
results.sort(key=lambda x: x.total_score, reverse=True)
results = results[:request.max_results]
logger.info(f"All extracted texts: {all_extracted_texts_with_timestamps}")
# Generate answer
generated_answer = None
if all_extracted_texts_with_timestamps:
generated_answer = await openai_service.generate_answer_from_context(
query=request.query,
context_texts=all_extracted_texts_with_timestamps
)
logger.info(f"Generated answer: {generated_answer}")
return MultimodalSearchResponse(
results=results,
query_division=query_division,
total_results=len(results),
search_params=request.dict(),
generated_answer=generated_answer
)
except Exception as e:
logger.error(f"Error in multimodal search: {e}")
raise
The process_shots
method enables flexible manipulation of video segments (shots) based on different logical operations:
π§ process_shots
β Key Features:
Union & Intersection of Segments:
Combines or intersects two lists of(start_time, end_time)
tuples β typically from different modalities (e.g., visual and spoken cues).Shot Merging Logic:
The method ensures that overlapping intervals are merged cleanly, removing redundancy and producing a set of non-overlapping segments.Duration Filtering:
Optional filtering ensures that only meaningful shots (i.e., longer thanmin_duration
) are retained β ideal for skipping noise or too-short segments.Supported Operations:
"union"
: Merges all intervals from both modalities."intersection"
: Returns only overlapping parts where both modalities align.
π Multimodal Video Search Logic
The multimodal_search
function performs an intelligent search across both spoken and visual content in videos, powered by LLM-assisted reasoning. Hereβs a breakdown of its operations:
Query Splitting: The user query is divided into spoken and visual parts using the OpenAI service, enabling specialized search logic for different modalities.
Video Selection: If no video IDs are specified, the method fetches all videos from the configured collection.
Content Search:
Spoken content is searched using text extracted from speech (e.g., transcripts).
Visual content is searched using extracted on-screen text or OCR-like results.
Result Processing:
For each matched segment, it extracts relevant text with timestamps and tags (spoken/visual).
Timestamps are collected for both modalities to compute overlaps or combinations based on a chosen operation (e.g., union or intersection).
Segment Combination & Scoring:
Combines spoken and visual timestamps into final segments.
Calculates a similarity score to filter out weak matches.
Stream URL Generation:
For each final segment, generates a stream URL for video playback.
Stream URLs are embedded back into the text entries along with human-readable timestamps.
LLM Answer Generation:
- If results are found, an OpenAI model generates a natural-language answer using the extracted spoken/visual context.
Response:
- Returns structured results including matched segments, scores, stream URLs, and an LLM-generated answer.
Pydantic Models (models/
Directory)
Search Models (search_models.py
)
This module defines the data models and enums used throughout the multimodal video search pipeline. Built with Pydantic, it ensures strong validation and structure across services. It includes:
MultimodalSearchRequest
: Schema for incoming search queries.QueryDivision
: Holds split queries for spoken and visual modalities.ExtractedText
: Represents timestamped, typed content segments with optional stream URLs.SearchResult
: Encapsulates all relevant data returned per video match.MultimodalSearchResponse
: Combines all search results, parameters, and the final LLM-generated answer.Enums
SearchType
andCombineOperation
allow flexible search logic.
These models ensure consistent, typed communication between the backend services.
Video Models (video_models.py
)
This module defines Pydantic models for managing video ingestion, indexing, and metadata operations. It includes:
Enums
VideoStatus
: Captures the current state of a video β from upload to indexing, to ready for search or error.
Request & Response Models
VideoUploadRequest
: Accepts a video URL and optional title/description.VideoUploadResponse
: Confirms upload success, returnsvideo_id
, status, and optional title/message.
Metadata Model
VideoInfo
: Stores full video metadata, including title, description, upload date, duration, source/thumbnail URLs, and indexing status.
Indexing Status
IndexStatus
: Flags indicating whether spoken word transcripts or visual scenes are indexed, and stores associatedscene_index_id
.
These models serve as the backbone of the video pipeline and ensure structured communication between video ingestion, processing, and retrieval systems.
Frontend
The frontend part of the application is a streamlit interface calling fastapi endpoints to perform task. This part of the code can be changed to a React app.
Streamlit Main Application (app.py
)
This Streamlit frontend is the main landing page for the Multimodal Video Search app. It introduces the platformβs key featuresβvideo upload, smart search, and rich resultsβthrough a clean, interactive UI. It includes quick action buttons to navigate to upload or search pages, describes the backend processing (upload β indexing β multimodal search), and shows real-time backend connectivity and usage stats in the sidebar.
"""Main Streamlit application."""
st.set_page_config(
page_title="Multimodal Video Search",
page_icon="π¬",
layout="wide",
initial_sidebar_state="expanded"
)
# Feature overview
col1, col2, col3 = st.columns(3)
st.divider()
# Quick actions
st.subheader("π Quick Actions")
col1, col2, col3 = st.columns(3)
with col1:
if st.button("πΉ Upload New Video", use_container_width=True):
st.switch_page("pages/1_Upload_Video.py")
with col2:
if st.button("π Search Videos", use_container_width=True):
st.switch_page("pages/2_Search_Videos.py")
Snapshot of Home Page
Components (components/
Directory)
Video Uploading Interface (video_uploader.py
)
π₯ video_
uploader.py
: Video Upload UI for Streamlit
This module provides a Streamlit interface to upload and index videos for search:
Accepts a video URL and optional title.
Validates and submits the upload to the backend via
api_client
.Displays success/failure messages and tracks recent uploads in session state.
Allows checking indexing status of each uploaded video.
"""Video upload component."""
def video_uploader():
"""Render video upload interface."""
st.header("πΉ Upload Video")
# URL input
url = st.text_input(
"Enter YouTube URL or video URL:",
placeholder="https://www.youtube.com/watch?v=..."
)
# Title input
title = st.text_input(
"Video Title (optional):",
placeholder="Enter a descriptive title"
)
# Upload button
if st.button("Upload and Index Video", type="primary"):
if not url:
st.error("Please enter a video URL")
return
if not validators.url(url):
st.error("Please enter a valid URL")
return
try:
with st.spinner("Uploading and indexing video..."):
api_client = get_api_client()
result = api_client.upload_video(url, title)
st.success(f"Video uploaded successfully! Video ID: {result['video_id']}")
st.info("The video is being indexed for search. This may take a few minutes.")
# Store video info in session state
if 'uploaded_videos' not in st.session_state:
st.session_state.uploaded_videos = []
st.session_state.uploaded_videos.append({
'video_id': result['video_id'],
'title': title or f"Video {result['video_id']}",
'url': url,
'status': result['status']
})
except Exception as e:
st.error(f"Error uploading video: {str(e)}")
# Show recently uploaded videos
if 'uploaded_videos' in st.session_state and st.session_state.uploaded_videos:
st.subheader("Recently Uploaded Videos")
for video in st.session_state.uploaded_videos[-5:]: # Show last 5
with st.expander(f"πΉ {video['title']}"):
st.write(f"**Video ID:** {video['video_id']}")
st.write(f"**URL:** {video['url']}")
st.write(f"**Status:** {video['status']}")
# Check status button
if st.button(f"Check Status", key=f"status_{video['video_id']}"):
try:
api_client = get_api_client()
status = api_client.get_video_status(video['video_id'])
st.json(status)
except Exception as e:
st.error(f"Error checking status: {str(e)}")
Snapshot of Upload Video Page
Search Interface Component (search_interface.py
)
π search_
interface.py
: Multimodal Search UI & Results Renderer
This module defines the user interface and result display for multimodal video search in Streamlit:
search_interface()
: Builds an interactive form for users to submit queries, choose search modes (semantic or keyword), and optionally filter by video IDs.display_search_results()
: Renders the generated answer, query decomposition, and detailed matched segments, including embedded video players with HLS support and matched transcript snippets.
"""Search interface component."""
def search_interface():
"""Render search interface."""
st.header("π Multimodal Video Search")
# Search query input
query = st.text_area(
"Enter your search query:",
placeholder="Show me where the narrator discusses the formation of the solar system and visualize the milky way galaxy",
height=100
)
# Search options
col1, col2 = st.columns(2)
with col1:
search_type = st.selectbox(
"Search Type:",
["semantic", "keyword"],
help="Semantic search uses AI to understand meaning, keyword search looks for exact matches"
)
with col2:
combine_operation = st.selectbox(
"Combine Results:",
["intersection", "union"],
help="Intersection shows segments matching both spoken and visual criteria, union shows all matches"
)
# Advanced options
with st.expander("Advanced Options"):
video_ids = st.text_input(
"Specific Video IDs (comma-separated, leave empty for all):",
placeholder="video_id_1, video_id_2"
)
max_results = st.slider(
"Maximum Results:",
min_value=1,
max_value=50,
value=10
)
# Search button
if st.button("π Search Videos", type="primary"):
if not query.strip():
st.error("Please enter a search query")
return
try:
with st.spinner("Searching videos..."):
api_client = get_api_client()
# Parse video IDs if provided
video_id_list = None
if video_ids.strip():
video_id_list = [vid.strip() for vid in video_ids.split(",")]
# Perform search
results = api_client.search_videos(
query=query,
video_ids=video_id_list,
search_type=search_type,
combine_operation=combine_operation
)
# Store results in session state
st.session_state.search_results = results
st.session_state.current_query = query
st.success(f"Found {results['total_results']} results!")
except Exception as e:
st.error(f"Search error: {str(e)}")
Snapshot of Search Page
Streamlit Pages (pages/
Directory)
Video Upload Function (1_upload_video.py
)
This Streamlit page allows users to upload videos by URL and initiate indexing for search:
Uses the
video_uploader
component to handle input and backend communication.Includes expandable instructions explaining supported formats and processing time.
Provides a clear and informative UI with upload tips and status guidance.
"""Upload video page."""
st.set_page_config(
page_title="Upload Video - Multimodal Search",
page_icon="πΉ",
layout="wide"
)
st.title("πΉ Video Upload")
st.markdown("Upload videos from YouTube or other sources to make them searchable.")
video_uploader()
Video Search Function (2_video_search.py
)
This Streamlit page enables users to perform multimodal searches across uploaded videos:
Integrates
search_interface()
to capture user queries.Displays results via
display_search_results()
, supporting both spoken and visual content.Automatically renders recent search results stored in session state.
"""Search videos page."""
st.title("π Multimodal Video Search")
st.markdown("Search through your video library using both spoken content and visual elements.")
# Search interface
search_interface()
st.divider()
# Results section
if 'search_results' in st.session_state:
display_search_results()
else:
display_search_results()
π Future Enhancements & Expansion Opportunities
Building on the current multimodal video search foundation, several high-impact extensions can unlock new use cases and commercial potential using Videodb:
π Content Intelligence Extensions
Advanced Video Analytics: Integrate emotion detection, object tracking, logo/brand recognition, and activity recognition for deeper scene understanding.
Multilingual Capabilities: Enable cross-language search and translation using OpenAI and Whisper for a globally scalable solution.
π’ Enterprise Integrations
Workflow Automation: Embed search into Slack, Teams, or CRMs to surface relevant videos contextually during user workflows.
Learning & Compliance: Integrate with LMS and DMS platforms to streamline educational discovery and safety compliance checks.
π₯ Media & Entertainment Use Cases
Highlight Generation: Auto-create sports/event highlights from speech + action cues.
Live Broadcast Intelligence: Detect key moments or anomalies in real-time video streams.
Content Moderation & Copyright: Monitor and flag copyrighted or restricted content across platforms.
π₯ Healthcare & Industrial Training
Procedure Evaluation: Analyze surgical/technical footage for training, QA, and certification.
Skill Assessments: Evaluate trainee performance via structured video tasks.
Remote Expert Support: Recommend content dynamically during video-based remote consultations.
I would encourage more of you to build over VideoDB and participate in the hackathon here: https://aidemos.com/ai-hackathons/aidemos-videodb/submit
Subscribe to my newsletter
Read articles from Ved Vekhande directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Ved Vekhande
Ved Vekhande
I am Data Science Intern at FutureSmart AI where I am working on projects related to Langchain, Llamaindex, OpenAI, etc. I am Machine Learning Enthusiast and have passion for Data. Currently I am in my pre-final year pursuing my Bachelor's in Computer Science from IIIT Vadodara ICD .