Zero-Shot Image Classification with CLIP and Gradio

🧠 Introduction

Traditional image classification models need to be trained on a fixed set of labeled classes. But what if we could classify new images into any category we want, without retraining the model? That’s where zero-shot classification comes in.

With zero-shot image classification, you can upload an image and provide custom labels, and the model gives the label that best describes the image, even if it’s never seen those labels before. This is made possible by powerful vision-language models like CLIP (Contrastive Language-Image Pre-training).

🧬 What is CLIP?

CLIP, developed by OpenAI, is a model that learns to connect images and text. It uses two encoders:

One for images (usually a Vision Transformer or CNN)
One for text (usually a Transformer)

Both encoders map inputs into the same embedding space. At inference time, it compares the similarity between the image embedding and each label’s text embedding to determine the best match.

🧩 CLIP doesn’t need retraining for new tasks, it just gives new text prompts.

🌍 Use Cases of CLIP

Content Moderation: Flag inappropriate visuals based on given labels
Image Search Engines: Enable flexible, label-free querying
Smart Filters: Detect categories like food, nature, or people dynamically

Now, let us start experimenting with the same in Gadio for a better visualization.

💻 Code with Gradio

from PIL import Image
import requests
import gradio as gr
from transformers import CLIPProcessor, CLIPModel

# Load the model and processor
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

# Classification function
def classify_image(image, labels):
    labels = [label.strip() for label in labels.split(",")]
    inputs = processor(text=labels, images=image, return_tensors="pt", padding=True)
    outputs = model(**inputs)
    logits_per_image = outputs.logits_per_image
    probs = logits_per_image.softmax(dim=1).detach().numpy()[0]
    return {label: float(f"{score:.2f}") for label, score in zip(labels, probs)}

# Gradio Interface
demo = gr.Interface(
    fn=classify_image,
    inputs=[
        gr.Image(type="pil"),
        gr.Textbox(label="Comma-separated Labels", placeholder="e.g., cat, dog, airplane")
    ],
    outputs="label",
    title="Zero-Shot Image Classification with CLIP",
    description="Upload an image and enter labels to see what CLIP predicts."
)

demo.launch()

🔍 Code Explanation:

✅ 1. Importing Libraries

PIL helps process and manipulate images.
requests is useful for downloading images (not used here directly but often kept handy).
gradio is used to build the interactive web UI.
transformers library provides the pre-trained CLIP model and processor.

✅ 2. Load the CLIP Model and Processor

pythonCopyEditmodel = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

This downloads and loads the CLIP model from Hugging Face.
CLIPModel is the neural network.
CLIPProcessor handles preprocessing of both images and text (e.g., resizing, tokenizing, etc.)

✅ 3. Define the Classification Function

def classify_image(image, labels):
    labels = [label.strip() for label in labels.split(",")]
    inputs = processor(text=labels, images=image, return_tensors="pt", padding=True)
    outputs = model(**inputs)
    logits_per_image = outputs.logits_per_image
    probs = logits_per_image.softmax(dim=1).detach().numpy()[0]
    return {label: float(f"{score:.2f}") for label, score in zip(labels, probs)}

Step-by-step:

labels.split(","): Takes a comma-separated list like cat, dog, panda and converts it into a list.
processor(...) : Prepares the image and labels for the model in the correct format.
model(...): Runs both the image and labels through the CLIP model.
logits_per_image: The output scores representing similarity between the image and each label.
softmax: Converts scores into probabilities.
return {label: score}: Outputs a dictionary mapping labels to their respective confidence values.

✅ 4. Create Gradio Interface

demo = gr.Interface(
    fn=classify_image,
    inputs=[
        gr.Image(type="pil"),
        gr.Textbox(label="Comma-separated Labels", placeholder="e.g., cat, dog, airplane")
    ],
    outputs="label",
    title="Zero-Shot Image Classification with CLIP",
    description="Upload an image and enter labels to see what CLIP predicts."
)

This sets up the user interface with Gradio:
- Upload an image
- Enter labels separated by commas
- Get the model’s prediction back

✅ 5. Launch the App

demo.launch()

This line starts the web server — it will work in a Jupyter notebook, local script, or Hugging Face Spaces.

The overall working of the prediction can be visualised as,

🧪 Sample Prompt to Try

Try uploading an image and inputting labels like:

mountain, river, house, castle
cat, dog, panda, horse
happy, sad, angry, calm

It will return the best match and confidence score.

⚠️ Limitations

CLIP works best with general concepts — it might not perform well on niche or highly specific categories
The predictions depend heavily on how the labels are phrased
It may not work well with noisy or blurry images

🚀 Deploy to Hugging Face Spaces

Create a new Space and choose Gradio + GPU (if needed)
Upload the code file as app.py

Add requirements.txt:

 transformers==4.41.1
 gradio
 torch
 Pillow

Commit and deploy. You’ll see your app running in the app section.

🔗 Example Space: Try the app here

🎯 Try building your own classifier

Classify art by mood

Detect object types in sketches

Create a meme detector 😄

📚 References

🙌 Wrap-Up

Zero-shot image classification is a game-changer for anyone who wants flexible, on-the-fly classification without training their own model. Whether you're a student, developer, or just curious, this app is your playground. Try uploading your own images and test it with creative labels!

🚀 Go on, give it a shot — no training required!

Enhance Image Recognition with Zero-Shot Classification: Using CLIP and Gradio