Introduction

LLMs were created as text-only chat bots as an artifact of their training paradigm: they learn to predict the next token (~word). Photos, videos and other richer media don’t have words and therefore were naturally excluded from the training protocol.

However, most LLMs are now multimodal. How did that happen?

Image as Cross-Encoded signals

Paper

One of the early papers to explore multi-modality is Deepmind’s Flamingo from 2021. Their technique consisted in interleaving text and photos, and having the LLM complete the sentence by following the pattern.

Their architecture is relatively simple and can be summarized in 3 points (*):

Keep the Vision Encoder and the language model frozen (the implicit assumption is that both are “multimodal-ready” since they are both semantically supervised so the only missing part to make the system work is the “bridge” between the 2 “languages” ). This set up reduces greatly the number of learnt parameters
Add a <image> (learnt) token to represent the location of the n-th photo.
For each layer of the LM, add a cross-attention to attend to the photo(s) representation

(\) leaving out the Perceiver Resampler, which is an interesting component but not central to the architecture. More details in the paper*

Image as a dense token

Paper

PaliGemma, a recent 3B model released by Google in 2024, shows another way to integrate vision signals. It uses the output from the Image Encoder as dense tokens, which can be combined with text tokens after a projection. This way, the LLM encounters new dense tokens that aren't part of the token dictionary and can handle them just like text tokens.

Architecture

The paper is very accessible and full of useful details for replication. Specifically, the authors added a clear step-by-step detailed recipe:

The above architecture is interesting in several ways:

First, it eliminates the cross-attention system, which is computationally demanding because it is added to each layer of the LLM.
Instead, the authors use a simple "Linear Projection," a technique previously used by LLaVA.
The authors allowed the Image Encoder to be fine-tuned, stating that it improves performance in the "blind spot" of its contrastive pretraining, specifically in relation and localization (*).

(*) It would be interesting to see if this remains true, as SigLIP-2 tried to address these blind spots by adding extra losses to capture these tasks.What matters when tuning an VLM?

What matters when training a VLM?

paper

In 2024, the team at HuggingFace aimed to find out what matters when tuning a VLM. They reached the following conclusions:

Enhancements in pretraining (for LLM and vision encoders) result in better performance in later tasks (Finding #1).
Cross-attention generally provides better performance than fully auto-regressive models, but it requires 10% more computing power. However, this difference disappears when using LoRA adapters (Finding #2).

The paper also contains many other insights, such as the link between image splitting and performance on reading tasks.

Conclusion

We compared two major early architectures for integrating vision into LLMs that have been popular up to 2024. With "native LLMs," we expect to see more co-trained multimodal models in the future. Last year, GPT-4o pioneered this approach, but the architecture and training details are still unclear. One hypothesis supported by Gwern suggests it uses a VQ-VAE to tokenize photos and integrate them into their token dictionary. With interleaved data, the LLM's autoregressive training can be applied "natively," meaning the LLM will need to predict tokens in either modality. Only time will tell.

How did LLMs gain Vision?