Enhance Image Recognition with Zero-Shot Classification: Using CLIP and Gradio

🧠 Introduction

Traditional image classification models need to be trained on a fixed set of labeled classes. But what if we could classify new images into any category we want, without retraining the model? That’s where zero-shot classification comes in.

With zero-shot image classification, you can upload an image and provide custom labels, and the model gives the label that best describes the image, even if it’s never seen those labels before. This is made possible by powerful vision-language models like CLIP (Contrastive Language-Image Pre-training).

🧬 What is CLIP?

CLIP, developed by OpenAI, is a model that learns to connect images and text. It uses two encoders:

  • One for images (usually a Vision Transformer or CNN)

  • One for text (usually a Transformer)

Both encoders map inputs into the same embedding space. At inference time, it compares the similarity between the image embedding and each label’s text embedding to determine the best match.

🧩 CLIP doesn’t need retraining for new tasks, it just gives new text prompts.

🌍 Use Cases of CLIP

  • Content Moderation: Flag inappropriate visuals based on given labels

  • Image Search Engines: Enable flexible, label-free querying

  • Smart Filters: Detect categories like food, nature, or people dynamically

Now, let us start experimenting with the same in Gadio for a better visualization.

💻 Code with Gradio

from PIL import Image
import requests
import gradio as gr
from transformers import CLIPProcessor, CLIPModel

# Load the model and processor
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

# Classification function
def classify_image(image, labels):
    labels = [label.strip() for label in labels.split(",")]
    inputs = processor(text=labels, images=image, return_tensors="pt", padding=True)
    outputs = model(**inputs)
    logits_per_image = outputs.logits_per_image
    probs = logits_per_image.softmax(dim=1).detach().numpy()[0]
    return {label: float(f"{score:.2f}") for label, score in zip(labels, probs)}

# Gradio Interface
demo = gr.Interface(
    fn=classify_image,
    inputs=[
        gr.Image(type="pil"),
        gr.Textbox(label="Comma-separated Labels", placeholder="e.g., cat, dog, airplane")
    ],
    outputs="label",
    title="Zero-Shot Image Classification with CLIP",
    description="Upload an image and enter labels to see what CLIP predicts."
)

demo.launch()

🔍 Code Explanation:

✅ 1. Importing Libraries

  • PIL helps process and manipulate images.

  • requests is useful for downloading images (not used here directly but often kept handy).

  • gradio is used to build the interactive web UI.

  • transformers library provides the pre-trained CLIP model and processor.

✅ 2. Load the CLIP Model and Processor

pythonCopyEditmodel = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
  • This downloads and loads the CLIP model from Hugging Face.

  • CLIPModel is the neural network.

  • CLIPProcessor handles preprocessing of both images and text (e.g., resizing, tokenizing, etc.)

✅ 3. Define the Classification Function

def classify_image(image, labels):
    labels = [label.strip() for label in labels.split(",")]
    inputs = processor(text=labels, images=image, return_tensors="pt", padding=True)
    outputs = model(**inputs)
    logits_per_image = outputs.logits_per_image
    probs = logits_per_image.softmax(dim=1).detach().numpy()[0]
    return {label: float(f"{score:.2f}") for label, score in zip(labels, probs)}

Step-by-step:

  • labels.split(","): Takes a comma-separated list like cat, dog, panda and converts it into a list.

  • processor(...) : Prepares the image and labels for the model in the correct format.

  • model(...): Runs both the image and labels through the CLIP model.

  • logits_per_image: The output scores representing similarity between the image and each label.

  • softmax: Converts scores into probabilities.

  • return {label: score}: Outputs a dictionary mapping labels to their respective confidence values.


✅ 4. Create Gradio Interface

demo = gr.Interface(
    fn=classify_image,
    inputs=[
        gr.Image(type="pil"),
        gr.Textbox(label="Comma-separated Labels", placeholder="e.g., cat, dog, airplane")
    ],
    outputs="label",
    title="Zero-Shot Image Classification with CLIP",
    description="Upload an image and enter labels to see what CLIP predicts."
)
  • This sets up the user interface with Gradio:

    • Upload an image

    • Enter labels separated by commas

    • Get the model’s prediction back

✅ 5. Launch the App

demo.launch()
  • This line starts the web server — it will work in a Jupyter notebook, local script, or Hugging Face Spaces.

The overall working of the prediction can be visualised as,

🧪 Sample Prompt to Try

Try uploading an image and inputting labels like:

  • mountain, river, house, castle

  • cat, dog, panda, horse

  • happy, sad, angry, calm

It will return the best match and confidence score.

⚠️ Limitations

  • CLIP works best with general concepts — it might not perform well on niche or highly specific categories

  • The predictions depend heavily on how the labels are phrased

  • It may not work well with noisy or blurry images

🚀 Deploy to Hugging Face Spaces

  1. Create a new Space and choose Gradio + GPU (if needed)

  2. Upload the code file as app.py

  3. Add requirements.txt:

     transformers==4.41.1
     gradio
     torch
     Pillow
    
  4. Commit and deploy. You’ll see your app running in the app section.

🔗 Example Space: Try the app here

🎯 Try building your own classifier

  • Classify art by mood

  • Detect object types in sketches

  • Create a meme detector 😄

📚 References

🙌 Wrap-Up

Zero-shot image classification is a game-changer for anyone who wants flexible, on-the-fly classification without training their own model. Whether you're a student, developer, or just curious, this app is your playground. Try uploading your own images and test it with creative labels!

🚀 Go on, give it a shot — no training required!

0
Subscribe to my newsletter

Read articles from Divya Vetriveeran directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Divya Vetriveeran
Divya Vetriveeran

I am currently serving as an Assistant Professor at CHRIST (Deemed to be University), Bangalore. With a Ph.D. in Information and Communication Engineering from Anna University and ongoing post-doctoral research at the Singapore Institute of Technology, her expertise lies in Ethical AI, Edge Computing, and innovative teaching methodologies. I have published extensively in reputed international journals and conferences, hold multiple patents, and actively contribute as a reviewer for leading journals, including IEEE and Springer. A UGC-NET qualified educator with a computer science background, I am committed to fostering impactful research and technological innovation for societal good.