Beginner's Guide to the Multimodal-verse AI

Hey there, weary traveler! Feeling overwhelmed by the AI revolution? Everywhere you look, it's AI this, AI that. And now you're hearing whispers about "multimodal something something." Don't sweat it, my friend. I've got your back!

Let's dive into the fascinating world of multimodality together. This blog post series is designed to be your friendly guide through the multimodal landscape. We'll keep things simple and beginner-friendly (but you'll need at least some AI 101 under your belt).

Get ready to expand your human brain with some cool concepts and techniques!

Here's what we'll cover in this series:

Intro to the Multimodal-Verse
- Plato's Cave: A philosophical warm-up
- Unimodality and its limitations: Why one just isn't enough
- Unique information in different modalities: Spice up your AI life
From Satellite to Earth: A Case Study on Why We Should Go Multimodal
- Satellite bands: More than meets the eye
- Night vision with thermal imagery
This is Cool, But Why So Complex?
- Which task are you trying to solve?
- The messy world of fusion
- Timing is everything: When to fuse?
- Fusion techniques: From simple to sophisticated
You've Got My Attention, Where Do I Start?
- Welcome to Flax: Your new best friend
- Data loaders, modeling, and training loops
- Evaluation, tracking, and model management
Colab 🧡 GitHub Codespaces

Throughout this journey, we'll tackle some common questions and challenges:

Why is classification so picky about representations?
What's the deal with modality translation and high-quality features?
The great encoder debate: To leak or not to leak?
Fusion timing: Early, mid, or late? (Spoiler: It depends!)
Embedding shenanigans: Size matters, and so does normalization
Fusion techniques: From "concatenate and chill" to "attention fusion and thrill"

So, buckle up, prepare a whole berrad of Moroccan tea, and sip on the knowledge I am about to drop on you.

a photo by @kaoutharelouraoui

Acknowledgments

Google AI/ML Developer Programs team supported this work by providing Google Cloud Credit.

References

I will try to use the same numbers for citations for the rest of the blogs.

Resources

ML Ascent 7 Building your first Multimodal Deep Learning Model with Jax/Flax/Tensorflow Part 1 Youtube Link, Slide Deck
Salama, K. (2021, January 30). Natural language image search with a dual encoder. Retrieved from https://keras.io/examples/vision/nl_image_search/
Attention in transformers, visually explained | Chapter 6, Deep Learning
Self-supervised learning: The dark matter of intelligence. (n.d.). https://ai.meta.com/blog/self-supervised-learning-the-dark-matter-of-intelligence/

Papers and Theses

Le-Khac, P. H., Healy, G., & Smeaton, A. F. (2020). Contrastive representation learning: A framework and review. IEEE Access, 8, 193907–193934.

https://doi.org/10.1109/ACCESS.2020.3031549
Jia, C., Yang, Y., Xia, Y., Chen, Y., Parekh, Z., Pham, H., Le, Q., Sung, Y., Li, Z., & Duerig, T. (2021). Scaling up Visual and Vision-Language representation learning with noisy text supervision. International Conference on Machine Learning, 4904–4916. http://proceedings.mlr.press/v139/jia21b/jia21b.pdf
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., & Sutskever, I. (2021). Learning transferable visual models from natural language supervision. arXiv. https://arxiv.org/abs/2103.00020
Zhai, X., Mustafa, B., Kolesnikov, A., & Beyer, L. (2023, October). Sigmoid loss for language image pre-training. In 2023 IEEE/CVF International Conference on Computer Vision (ICCV) (pp. 11941-11952). IEEE. https://doi.org/10.1109/ICCV51070.2023.01100
Li, S., Zhang, L., Wang, Z., Wu, D., Wu, L., Liu, Z., Xia, J., Tan, C., Liu, Y., Sun, B., & Stan Z. Li. (n.d.). Masked modeling for self-supervised representation learning on vision and beyond. In IEEE [Journal-article]. https://arxiv.org/pdf/2401.00897
Jia, C., Yang, Y., Xia, Y., Chen, Y., Parekh, Z., Pham, H., Le, Q., V., Sung, Y., Li, Z., & Duerig, T. (2021, February 11). Scaling up Visual and Vision-Language representation learning with noisy text supervision. arXiv.org. https://arxiv.org/abs/2102.05918
Bachmann, R., Kar, O. F., Mizrahi, D., Garjani, A., Gao, M., Griffiths, D., Hu, J., Dehghan, A., & Zamir, A. (2024, June 13). 4M-21: An Any-to-Any Vision model for tens of tasks and modalities. arXiv.org. https://arxiv.org/abs/2406.09406
Bao, H., Dong, L., Piao, S., & Wei, F. (2021, June 15). BEIT: BERT Pre-Training of Image Transformers. arXiv.org. https://arxiv.org/abs/2106.08254
Balestriero, R., Ibrahim, M., Sobal, V., Morcos, A., Shekhar, S., Goldstein, T., Bordes, F., Bardes, A., Mialon, G., Tian, Y., Schwarzschild, A., Wilson, A. G., Geiping, J., Garrido, Q., Fernandez, P., Bar, A., Pirsiavash, H., LeCun, Y., & Goldblum, M. (2023, April 24). A cookbook of Self-Supervised Learning. arXiv.org. https://arxiv.org/abs/2304.12210
Zadeh, A., Chen, M., Poria, S., Cambria, E., & Morency, L. (2017, July 23). Tensor Fusion Network for Multimodal Sentiment Analysis. arXiv.org. https://arxiv.org/abs/1707.07250
Tian, Y., Sun, C., Poole, B., Krishnan, D., Schmid, C., & Isola, P. (2020, May 20). What makes for good views for contrastive learning? arXiv.org. https://arxiv.org/abs/2005.10243
Huang, Y., Du, C., Xue, Z., Chen, X., Zhao, H., & Huang, L. (2021, June 8). What Makes Multi-modal Learning Better than Single (Provably). arXiv.org. https://arxiv.org/abs/2106.04538
Nagrani, A., Yang, S., Arnab, A., Jansen, A., Schmid, C., & Sun, C. (2021, June 30). Attention bottlenecks for multimodal fusion. arXiv.org. https://arxiv.org/abs/2107.00135
Liu, Z., Shen, Y., Lakshminarasimhan, V. B., Liang, P. P., Zadeh, A., & Morency, L. (2018, May 31). Efficient Low-rank Multimodal Fusion with Modality-Specific Factors. arXiv.org. https://arxiv.org/abs/1806.00064
Wang, X., Chen, G., Qian, G., Gao, P., Wei, X., Wang, Y., Tian, Y., & Gao, W. (2023, February 20). Large-scale Multi-Modal Pre-trained Models: A comprehensive survey. arXiv.org. https://arxiv.org/abs/2302.10035
Wang, W., Bao, H., Dong, L., Bjorck, J., Peng, Z., Liu, Q., Aggarwal, K., Mohammed, O. K., Singhal, S., Som, S., & Wei, F. (2022, August 22). Image as a Foreign Language: BEIT Pretraining for all Vision and Vision-Language tasks. arXiv.org. https://arxiv.org/abs/2208.10442
Liang, P. P. (2024, April 29). Foundations of multisensory artificial Intelligence. arXiv.org. https://arxiv.org/abs/2404.18976
Huang, S., Pareek, A., Seyyedi, S., Banerjee, I., & Lungren, M. P. (2020). Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. Npj Digital Medicine, 3(1). https://doi.org/10.1038/s41746-020-00341-z

#AISprint Welcome to the Multimodal-verse: A Beginner's Guide

Acknowledgments

References

Resources

Papers and Theses

Subscribe to my newsletter

Taha Bouhsine

Taha Bouhsine