Semantic Image Search with CLIP & FAISS

1. Situation + Task

I wanted to explore how semantic image search works, the kind where you can either type a text prompt like “a tiger in the wild” or upload an image, and the system returns visually or semantically similar images.

To try this out, I used a Kaggle dataset with around 5,400 images spanning 90 animal categories (dataset link).

The main things I wanted to figure out were:

How to convert images and text into a common vector space using pretrained models
How to perform fast similarity search on those high-dimensional vectors
And how to build a simple Flask-based web UI so users could test the search interactively

2. Action

To achieve this, I combined two tools that are often used in industry:

CLIP (by OpenAI): A model that converts images and text into embeddings, placing them into the same vector space where they can be meaningfully compared.
FAISS (by Meta): Helps perform fast similarity searches over large collections of high-dimensional vectors, like embeddings from models such as CLIP.
Simply put, it answers: “Given this vector, which other vectors in my dataset are closest to it?”

Here’s what I did:

Encoded all images in the dataset using CLIP via index.py and stored them in a FAISS index for fast similarity search
Built a Flask web app with two search options:
- Text Search: User types a prompt → CLIP converts it to an embedding → FAISS returns the top 5 closest image matches
- Image Search: User uploads an image → CLIP generates its embedding → FAISS returns the top 5 visually/semantically similar images
Designed the backend to be modular (easily swap CLIP models) and kept the frontend customizable using HTML/CSS for styling.

3. Problems Faced and Learnings

One thing I learned early on was that CLIP is great for semantics, but not for exact matching.
For example, if I uploaded two different pictures of the same dog, CLIP might not consider them very similar i.e. it focuses more on what is in the image, not who.
FAISS always returns the top-k results, even if none of them are actually close.
So in small datasets, it sometimes gave poor matches.
To fix this, I looked into adding a distance threshold - basically filtering out results that weren’t a good enough match.

4. Results

In the end, I had a working app where I could type something like “an elephant drinking water” or upload an image of a cat and in seconds, it would show me the top 5 most semantically similar animal images.
It felt intuitive and fast, thanks to CLIP's ability to understand meaning and FAISS's speed in finding the closest matches.
This project helped me understand how semantic search works under the hood, how to represent images and text in the same latent space, how to search through them efficiently, and how to tie it all together in a web app that anyone could use.

Working demo (text query and image query)

What’s Next?

Fine-tune the CLIP model for better accuracy when identifying specific instances
Extend this (CLIP) to video retrieval by extracting and indexing frames for search
Build a proper backend to scale the app, make it faster, and support larger datasets

🔗 GitHub Link - surajrao2003/semantic_search_clip_faiss

Semantic Search using CLIP and FAISS