The state of Self-Supervision for Vision


Introduction
To perform vision tasks effectively, it's important to have a strong, general-purpose vision backbone. This allows you to handle many vision tasks, such as:
Image-to-image similarity: For comparison or retrieval.
Vision adapters: Add a classification head to solve specific problems (e.g., cat vs. dog classification).
Fine-tuning: Adapt the backbone to a specific problem (similar to adapters, usually with higher accuracy but requires recompiling the entire backbone).
Giving vision to LLM: By feeding the vision hidden state to the LLM (see PaliGemma as an example), the LLM can learn to interpret images.
The backbone serves as a starting point for downstream activities. The better the backbone, the better the performance on these tasks.
Currently, there's a lot of excitement about CLIP, which is a method to semantically supervise the hidden state (using "words"). It tends to deliver great performance, easier integration with LLMs (since LLMs are semantic machines), and zero-shot capabilities (allowing image retrieval based on text-based queries). I will write a post on this topic later, but for now, let's discuss self-supervision (usually shortened to SSL)..
Selective history of Self-supervision
Let’s investigate a few seminal papers to see where we’re coming from.
1- Unsupervised Visual Representation Learning by Context Prediction, 2015
The authors (Doersch et al) formulated the self supervision task as that of a puzzle: Split the image into patches and have the model predict where a given patch should be slotted relative to a reference patch. This is called a “pretext task” (meaning, it’s not the overarching objective of the model, but rather a way to force the model to train on a task that will be useful for other applications).
The model is of a Siamese type as it takes in 2 patches, processes them (with the same weights) and has a late fusion at the end to compute the predicted location of the 2nd patch.
The most fascinating finding of this paper is that the model, despite having no semantic information about the photos or patches, learns mid-level semantic structures (e.g., distinguishing objects and surfaces) even though the training signal is purely geometric. For instance, the below figure shows an uncanny ability to discern fixtures in a city. We can surmise that the model builds its own internal conceptual representation of the world.
Finally, the authors shows that these trained models can be helpful for many downstream applications such as object detection and Geometry Estimation (section 4).
However, because the task is local (patch-level), it can bias the model toward capturing only short-range dependencies rather than full object understanding.
Top 1 accuracy on ImageNet-1K: 51.4% @ 100M params (source)
2- MOCO: Momentum Contrast for Unsupervised Visual Representation Learning, 2020
Moco’s core improvement is in the realm of “hard negative mining” strategies, which aim to select negatives that maximize learning efficiency.
To improve contrastive learning, MOCO’s proposes to bring in 2 innovations:
Momentum Encoder: A slow-moving copy of the query encoder, updated via exponential moving average, ensuring the encoded keys evolve smoothly rather than changing rapidly.
Queue: A dynamically maintained FIFO queue of previous embeddings (keys), allowing the dictionary of negatives to be much larger than the minibatch size without prohibitive memory cost. The queue resembles a memory bank but is simpler and fresher: it maintains only a limited number of recent embeddings, without associating them to specific dataset indices.
So, let’s sum it up. The model needs positive and negatives
Positives: A positive pair consists of two random augmentations of the same image: one passed through the query encoder (actively updated) and one through the momentum encoder (slow-moving copy).
Negative: Anything in queue
What’s the current challenge?
- Traditional end-to-end contrastive learning depends on large batch sizes (to supply many negatives), which quickly run into GPU memory limits.
What’s MOCO contribution?
- By decoupling the dictionary size from the minibatch size (via a queue) and ensuring representation consistency (via a momentum encoder), MoCo replicates and exceeds the performance of large-batch end-to-end contrastive learning — scaling up to queues with 65k negatives.
Very cool!
Top 1 accuracy on ImageNet-1K: 65.4% @ 100M params (source)
3- SimCLR, 2020
Some papers achieve higher performance by adding complexity; the best ones achieve both higher performance and lower complexity. SimCLR belongs to the latter: it discards specialized components like memory banks and momentum encoders, offering a simpler, more scalable framework for contrastive learning.
Figure: Overall architecture of the SimCLR contrastive learning framework
What is SimCLR’s contribution?
First, it introduces a simplified contrastive learning pipeline: given an image, two independent but carefully selected augmentations are applied to create a positive pair. A single model (no 2nd momemtum encoder) is trained to distinguish this pair from all others in a large batch (typically 8k examples), treating all other examples as negatives (i.e. contrastive learning).
To further improve learned representations, SimCLR decouples the encoder from a lightweight projection head used only during training. The contrastive loss is applied to the outputs of this projection head, allowing the encoder to retain richer, more transferable features. This architectural tweak significantly boosts linear evaluation performance (e.g., from ~50% to ~65% Top-1 on ImageNet-1K).
Critically, SimCLR systematically studies the role of data augmentation. They find that augmentation strategies are not equally effective: combinations like random cropping plus color jittering dramatically outperform others (e.g., crop+color achieving 56% vs Sobel+rotate yielding only 4%).
Let’s dive in the “set of transformers” mentioned above. The table below is an analysis of different augmentation compositions and shows that they are not all equal!
Overall, SimCLR delivers a substantial performance improvement — almost 10 percentage points over previous contrastive methods — while simplifying the overall framework.
Top 1 accuracy on ImageNet-1K: 69% @ 100M params (source)
4- Dino, 2021
Finally, let’s talk about Dino from FAIR.
Dino approaches the problem of Self Supervision (SSL) as a distillation problem i.e. having 2 models (teacher + student) where the student learns from the teacher. This paradigm is common in the LLM space when the teacher is a much larger, pretrained model, and needs to be shrunk for inference efficiency purpose (denoted “mini” in the OpenAI parlance). However, in DINO, the teacher starts from scratch too and improves over time. A more descriptive term would therefore be "pair-learning with a centered partner".
Ok, let’s talk about the DINO’s contributions now:
Self Supervision as a Distillation problem: The role of the student is to match the Teacher’s distribution of scores. The teacher is also learning through a momentum update (its weights are averaged with the student’s scaled with a small coefficient)
Avoiding collapse without using contrastive loss. Instead, they “center and sharpen” the teacher: Dino found a surprising simple rule: They simply center (subtract the moving average in the latent space) to avoid 1 dimension from dominating and sharpen (i.e. softmax with low temperature) to avoid a uniform distribution.
Now, how does the teacher “learn”? Its weights are actually a exponential moving average of the students’ (no SGD update)
Figure: Diagram of the DINO training protocol
One interesting aspect is that the Teacher always keeps an edge on the student as shown in the figure above. The reason might be that the student’s role is to explore while the teacher keeps the student “on the right tracks” (through the momentum mechanism).
Top 1 accuracy on ImageNet-1K: 75% @ 23M params, ResNet-50 (source)
Conclusion
Doing a large paper review over a large time range always bring an interesting perspective on the progression of the field as the high level view sorts out the important from the unimportant. Looking back at these papers, a few key innovations strike me as important:
At the highest level, SSL + classification has kept pace with full supervised learning. This is a major advantage as SSL only requires the photos without the need for expensive, and slow, labels. Yet, to my knowledge, only a few companies have applied these techniques.
The Teacher+Student paradigm works stunningly well as an SSL technique.
All reviewed authors take special care in their data augmentation. This aspect is critical to creating good visual representation.
Momentum encoders, used as a teacher in Moco and Dino, are a simple yet useful construct to guide a more eager encoders. This concept may well expand beyond SSL.
Can you represent some of your problems as a Self Supervised Learning?
Subscribe to my newsletter
Read articles from Guillaume Guy directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
