Cost-Effective Video Search with Frame-Based Multimodal Embeddings

TL;DR: Traditional video models are costly and often impractical for large-scale search. By splitting videos into ~800 frames per hour, embedding both visuals and transcribed audio into a vector database, we can build a precise and low-cost system to query exact video moments. This approach makes multimodal search affordable without sacrificing accuracy.

Why Video Search is Expensive Today

Most AI models don't process videos directly. To search within a video, you typically need to process it frame by frame, create embeddings for each frame, and store these embeddings in a database. For a 1-hour video, this can cost nearly a dollar with models like Gemini-2.0-flash. Costs rise quickly when scaling to hundreds of hours of content, and using more advanced models increases the price even more. This makes precise, multimodal search (visual + audio) expensive and often impractical for everyday use.

By dividing videos into frames and embedding both visuals and transcribed audio into a vector database, we can create a precise and low-cost system to find exact video moments, making multimodal search affordable without losing accuracy

Here I will show you a approach which would reduce prices and increase accuracy 🌸
Splitting Videos Into Frames

As of today, there are models like Gemini that can directly process videos. However, I won't be using them because I want to create something that remembers videos by frames and costs less than traditional video models.

FFMPEG can help you split a video into any number of frames.

I wrote a script that splits a video into 800 frames per hour. Typically, videos are 60 FPS, and this script uses FFMPEG to convert the video into .jpeg files at 800 FPH.
From Frames to Embeddings

This part gets interesting because there are models that can help with video-based embeddings. Here, I used multimodal embedding from Vertex AI, and there's a good reason for that. This model supports both text and images, making our text-to-image searches feel precise.

Looping over folder:extracted_frames and store them in a vector database using google/multimodalembedding. Convert them into 1408-dimension vectors and save them to a vector database with the configuration (cosine, 1048).

With a bit more looping, we're ready to use a script that processes all videos in the video-to-train folder. It creates their extracted_frames and saves their vectors to Upstash. The id of each vector is structured to point to any video at a specific timestamp.

Example: videos-to-train/12115024_3840_2160_30fps.mp4-11/16 points to a video stored in the videos-to-train folder with the name 12115024_3840_2160_30fps.mp4. The current frame timestamp is 11/16 (68% complete).
Querying the Vector Database

For querying text-to-embeddings, the same model can be used. It will return vectors of the same dimensions, which can be matched with the vectors stored in Upstash.

Here’s a demo of a few queries

Prompt: Shot of a river from a cliff where clouds seem coming towards me

Prompt: squirrel jumping off a toast in the forest

Watch how it accurately identified the moment when the squirrel was about to jump off the toast.
What’s Next: Adding Audio Context

With this approach, a single moment from thousands of videos can directly point to that specific video, and it doesn't stop there.

Sound can also be extracted using ffmpeg at the same rate as the frames. This audio can be converted to text and embedded with the frame data, then saved to the same vector. This way, not only the visuals but also a small piece of dialogue from the video can be precisely identified.

This method shows that video search doesn’t have to be expensive. By combining frame extraction, embeddings, and audio transcripts, you can build a multimodal system that pinpoints exact moments in hours of footage at a fraction of the cost of traditional video models.

Cost-Effective Video Search with Frame-Based Multimodal Embeddings

Table of contents

Why Video Search is Expensive Today

Splitting Videos Into Frames

From Frames to Embeddings

Querying the Vector Database

What’s Next: Adding Audio Context

Subscribe to my newsletter

Pushkar Yadav

Pushkar Yadav