Cost-Effective Video Search with Frame-Based Multimodal Embeddings


TL;DR: Traditional video models are costly and often impractical for large-scale search. By splitting videos into ~800 frames per hour, embedding both visuals and transcribed audio into a vector database, we can build a precise and low-cost system to query exact video moments. This approach makes multimodal search affordable without sacrificing accuracy.
Why Video Search is Expensive Today
Most AI models don't process videos directly. To search within a video, you typically need to process it frame by frame, create embeddings for each frame, and store these embeddings in a database. For a 1-hour video, this can cost nearly a dollar with models like Gemini-2.0-flash. Costs rise quickly when scaling to hundreds of hours of content, and using more advanced models increases the price even more. This makes precise, multimodal search (visual + audio) expensive and often impractical for everyday use.
By dividing videos into frames and embedding both visuals and transcribed audio into a vector database, we can create a precise and low-cost system to find exact video moments, making multimodal search affordable without losing accuracy
Here I will show you a approach which would reduce prices and increase accuracy 🌸
Splitting Videos Into Frames
As of today, there are models like Gemini that can directly process videos. However, I won't be using them because I want to create something that remembers videos by frames and costs less than traditional video models.
FFMPEG can help you split a video into any number of frames.
I wrote a script that splits a video into
800
frames per hour. Typically, videos are 60 FPS, and this script uses FFMPEG to convert the video into .jpeg files at 800 FPH.From Frames to Embeddings
This part gets interesting because there are models that can help with video-based embeddings. Here, I used multimodal embedding from Vertex AI, and there's a good reason for that. This model supports both text and images, making our text-to-image searches feel precise.
Looping over
folder:extracted_frames
and store them in a vector database using google/multimodalembedding. Convert them into1408-dimension vectors
and save them to a vector database with the configuration (cosine, 1048).With a bit more looping, we're ready to use a script that processes all videos in the
video-to-train
folder. It creates theirextracted_frames
and saves their vectors to Upstash. The id of each vector is structured to point to any video at a specific timestamp.Example:
videos-to-train/12115024_3840_2160_30fps.mp4-11/16
points to a video stored in the videos-to-train folder with the name 12115024_3840_2160_30fps.mp4. The current frame timestamp is 11/16 (68% complete).Querying the Vector Database
For querying text-to-embeddings, the same model can be used. It will return vectors of the same dimensions, which can be matched with the vectors stored in Upstash.
Here’s a demo of a few queries
Prompt: Shot of a river from a cliff where clouds seem coming towards me
Prompt: squirrel jumping off a toast in the forest
Watch how it accurately identified the moment when the squirrel was about to jump off the toast.
What’s Next: Adding Audio Context
With this approach, a single moment from thousands of videos can directly point to that specific video, and it doesn't stop there.
Sound can also be extracted using ffmpeg at the same rate as the frames. This audio can be converted to text and embedded with the frame data, then saved to the same vector. This way, not only the visuals but also a small piece of dialogue from the video can be precisely identified.
This method shows that video search doesn’t have to be expensive. By combining frame extraction, embeddings, and audio transcripts, you can build a multimodal system that pinpoints exact moments in hours of footage at a fraction of the cost of traditional video models.
Subscribe to my newsletter
Read articles from Pushkar Yadav directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
