Paper: Learning Transferable Visual Models From Natural Language Supervision, Radford et al. (2021)

Introduction

As we discussed in previous posts, Contrastive Learning isn't new. For example, FAIR's Moco was published in 2019, along with many other papers. However, CLIP introduces something new. Let's take a step back and look at the landscape at the time of its publication.

Around 2021, there were two major areas of work in computer vision:

Well-structured predictions using classes (e.g., ImageNet)
Self-Supervised Learning, which removes the need for expensive labels

Meanwhile, the team at OpenAI noticed that the web had over a billion untapped images with captions that hadn't been used due to the challenges of handling natural language. For instance, Li et al. (2017) tried to predict phrase n-grams based on photos, but the approach didn't generalize well, leading to low scores on ImageNet.

The question was clear: how can this web data be used to train effective, generalist computer vision models? And can we leverage both text and image modalities?

Modality alignment

The main claim of this seminal paper is twofold:

One can approach the challenge as a modality alignment problem i.e. a shared latent space
Trained on the web, the resulting model gives good zero-shot performance across a variety of tasks i.e. generalize well.

On the first claim, their architecture is relatively simple: they put 2 encoders (one for each modality). Note that they do not do any decoding (i.e. asking to generate the caption from the image). Instead, their objective is to “align” these 2 encoders. For instance, the picture of the Aussie dog (see below) is in the same vicinity in the latent space as the caption “Pepper the aussie pup”. 2 encoders, 1 shared latent space.

On the 2nd claim, they show that, the model provides rich semantic representation and beats self supervised method such as SimCLRv2 on downstream classification with zero-shot (i.e. no training), which is an incredible claim. Further, with the same number of training examples, CLIP is 10pp more performant.

Why does CLIP matter?

CLIP is the first computer vision architecture to "mine the web," effectively overcoming the challenges of poor data quality. Its strength lies in its ability to generalize across a wide range of domains, making it a strong choice for your vision backbone.

Secondly, the photo embeddings are semantically supervised, making them suitable for giving vision to LLMs (more details in my blog post here). As a matter of fact, CLIP has become the main (only?) vision backbone for LLMs.

Finally, it democratizes image search. By encoding all images and adding them to a vector store (e.g., FAISS), one can find images similar to a text or image prompt. This technique has become popular because it eliminates the need for complex infrastructure.

It is no surprise that AI labs continue to invest heavily in CLIP-style learning. For instance, Google recently released SIGLIP-2, which incorporates additional losses to address some weaknesses. For example, CLIP is not well-suited for segmentation and depth estimation tasks, so the authors added LocCa tasks. FAIR expanded to many modalities with ImageBind.

Conclusion

Go check it out on HuggingFace. SigLIP-2 is open weights and licensed under a permissive Apache-2.

What is CLIP and why does it matter?

Introduction

Modality alignment

Why does CLIP matter?

Conclusion

Subscribe to my newsletter

Guillaume Guy

Guillaume Guy