Enhance Image Recognition with Zero-Shot Classification: Using CLIP and Gradio


🧠 Introduction
Traditional image classification models need to be trained on a fixed set of labeled classes. But what if we could classify new images into any category we want, without retraining the model? That’s where zero-shot classification comes in.
With zero-shot image classification, you can upload an image and provide custom labels, and the model gives the label that best describes the image, even if it’s never seen those labels before. This is made possible by powerful vision-language models like CLIP (Contrastive Language-Image Pre-training).
🧬 What is CLIP?
CLIP, developed by OpenAI, is a model that learns to connect images and text. It uses two encoders:
One for images (usually a Vision Transformer or CNN)
One for text (usually a Transformer)
Both encoders map inputs into the same embedding space. At inference time, it compares the similarity between the image embedding and each label’s text embedding to determine the best match.
🧩 CLIP doesn’t need retraining for new tasks, it just gives new text prompts.
🌍 Use Cases of CLIP
Content Moderation: Flag inappropriate visuals based on given labels
Image Search Engines: Enable flexible, label-free querying
Smart Filters: Detect categories like food, nature, or people dynamically
Now, let us start experimenting with the same in Gadio for a better visualization.
💻 Code with Gradio
from PIL import Image
import requests
import gradio as gr
from transformers import CLIPProcessor, CLIPModel
# Load the model and processor
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
# Classification function
def classify_image(image, labels):
labels = [label.strip() for label in labels.split(",")]
inputs = processor(text=labels, images=image, return_tensors="pt", padding=True)
outputs = model(**inputs)
logits_per_image = outputs.logits_per_image
probs = logits_per_image.softmax(dim=1).detach().numpy()[0]
return {label: float(f"{score:.2f}") for label, score in zip(labels, probs)}
# Gradio Interface
demo = gr.Interface(
fn=classify_image,
inputs=[
gr.Image(type="pil"),
gr.Textbox(label="Comma-separated Labels", placeholder="e.g., cat, dog, airplane")
],
outputs="label",
title="Zero-Shot Image Classification with CLIP",
description="Upload an image and enter labels to see what CLIP predicts."
)
demo.launch()
🔍 Code Explanation:
✅ 1. Importing Libraries
PIL
helps process and manipulate images.requests
is useful for downloading images (not used here directly but often kept handy).gradio
is used to build the interactive web UI.transformers
library provides the pre-trained CLIP model and processor.
✅ 2. Load the CLIP Model and Processor
pythonCopyEditmodel = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
This downloads and loads the CLIP model from Hugging Face.
CLIPModel
is the neural network.CLIPProcessor
handles preprocessing of both images and text (e.g., resizing, tokenizing, etc.)
✅ 3. Define the Classification Function
def classify_image(image, labels):
labels = [label.strip() for label in labels.split(",")]
inputs = processor(text=labels, images=image, return_tensors="pt", padding=True)
outputs = model(**inputs)
logits_per_image = outputs.logits_per_image
probs = logits_per_image.softmax(dim=1).detach().numpy()[0]
return {label: float(f"{score:.2f}") for label, score in zip(labels, probs)}
Step-by-step:
labels.split(",")
: Takes a comma-separated list likecat, dog, panda
and converts it into a list.processor(...)
: Prepares the image and labels for the model in the correct format.model(...)
: Runs both the image and labels through the CLIP model.logits_per_image
: The output scores representing similarity between the image and each label.softmax
: Converts scores into probabilities.return {label: score}
: Outputs a dictionary mapping labels to their respective confidence values.
✅ 4. Create Gradio Interface
demo = gr.Interface(
fn=classify_image,
inputs=[
gr.Image(type="pil"),
gr.Textbox(label="Comma-separated Labels", placeholder="e.g., cat, dog, airplane")
],
outputs="label",
title="Zero-Shot Image Classification with CLIP",
description="Upload an image and enter labels to see what CLIP predicts."
)
This sets up the user interface with Gradio:
Upload an image
Enter labels separated by commas
Get the model’s prediction back
✅ 5. Launch the App
demo.launch()
- This line starts the web server — it will work in a Jupyter notebook, local script, or Hugging Face Spaces.
The overall working of the prediction can be visualised as,
🧪 Sample Prompt to Try
Try uploading an image and inputting labels like:
mountain, river, house, castle
cat, dog, panda, horse
happy, sad, angry, calm
It will return the best match and confidence score.
⚠️ Limitations
CLIP works best with general concepts — it might not perform well on niche or highly specific categories
The predictions depend heavily on how the labels are phrased
It may not work well with noisy or blurry images
🚀 Deploy to Hugging Face Spaces
Create a new Space and choose Gradio + GPU (if needed)
Upload the code file as app.py
Add
requirements.txt
:transformers==4.41.1 gradio torch Pillow
Commit and deploy. You’ll see your app running in the app section.
🔗 Example Space: Try the app here
🎯 Try building your own classifier
Classify art by mood
Detect object types in sketches
Create a meme detector 😄
📚 References
🙌 Wrap-Up
Zero-shot image classification is a game-changer for anyone who wants flexible, on-the-fly classification without training their own model. Whether you're a student, developer, or just curious, this app is your playground. Try uploading your own images and test it with creative labels!
🚀 Go on, give it a shot — no training required!
Subscribe to my newsletter
Read articles from Divya Vetriveeran directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Divya Vetriveeran
Divya Vetriveeran
I am currently serving as an Assistant Professor at CHRIST (Deemed to be University), Bangalore. With a Ph.D. in Information and Communication Engineering from Anna University and ongoing post-doctoral research at the Singapore Institute of Technology, her expertise lies in Ethical AI, Edge Computing, and innovative teaching methodologies. I have published extensively in reputed international journals and conferences, hold multiple patents, and actively contribute as a reviewer for leading journals, including IEEE and Springer. A UGC-NET qualified educator with a computer science background, I am committed to fostering impactful research and technological innovation for societal good.