Vision Language Model

What is Vision Language Model ?

Vision Language Models (VLM) are changing the game for multimodal content creation and interaction by bridging the gap between visual and textual cognition. Ongoing research in VLMs advances multimodel architecture to new heights, bringing models one step closer to human perception in their understanding of the world of images and natural language. Innovation in areas such as creative content generation, assistive technology, image understanding, and robotics is being propelled by fast paced VLM research. Almost all big organizations have made large investments in the last few years. The concept of multimodal learning allows VLMs to be incredibly versatile. For instance, in healthcare, they can be used for medical image analysis, aiding in diagnosis and treatment planning. In the automotive industry, VLMs contribute to the development of advanced driver-assistance systems and even self-driving cars.

History of VLMs

From 2019 to 2023, Vision-Language Models (VLMs) evolved rapidly from foundational frameworks like ViLBERT and VisualBERT, which integrated visual and textual data, to sophisticated models like CLIP, enabling zero-shot learning with natural language. The period saw the rise of specialized models such as FashionCLIP and image generation models like DALL-E 2 and GLIDE. By 2023, advanced models like Instruct-Imagen and UNIMO-G emerged, reflecting significant advancements in multimodal understanding and generation, showcasing the increasing specialization and refinement in VLM capabilities.

Families of VLMs

Contrastive training is a commonly use strategy that uses pairs of positive and negative examples. The VLM is trained to predict similar representations for the positive pairs while predicting different representations for the negative pairs. Masking is another strategy that can be leveraged to train VLMs by reconstructing the missing patches given an unmasked text caption. Similarly, by masking words in a caption, it is possible to train a VLM to reconstruct those words given an unmasked image.

Contrastive training is a commonly use strategy that uses pairs of positive and negative examples. The VLM is trained to predict similar representations for the positive pairs while predicting different representations for the negative pairs. Masking is another strategy that can be leveraged to train VLMs by reconstructing the missing patches given an unmasked text caption. Similarly, by masking words in a caption, it is possible to train a VLM to reconstruct those words given an unmasked image.
ViLBERT is another example of a vision language model that depends on the attention mechanism. It extends the famous BERT architecture to consist of two parallel BERT-style models operating over image regions and text segments that interact through co-attentional transformer layers.

For multimodal fusion, this model employs a co-attention Transformer architecture. This architecture dedicates separate Transformers to process image region and text features, with an additional Transformer fusing these representations.
PrefixLM is an NLP learning technique mostly used for model pre-training. It inputs a part of the text (a prefix) and learns to predict the next word in the sequence. In Visual Language Models, PrefixLM enables the model to predict the next sequence of words based on an image and its respective prefix text. It leverages a Vision Transformer (ViT) that divides an image into a one-dimensional patch sequence, each representing a local image region.
MLM (Masked - Language Model) works in language models like BERT by masking or hiding a portion of a textual sequence and training the model to predict the missing text. ITM involves predicting whether sentence Y follows sentence X.

Main Task of VLMs

Visual Question Answering : Visual Question Answering (VQA) consists of interpreting visual elements such as images or videos to provide textual responses to specific queries. VQA can play a great role in interactive educational platforms, virtual customer assistance, and helping visually impaired people.
Visual Captioning: Generate descriptive text captions for visual elements like images or videos. Visual captioning, along with translation services, has great potential in education, entertainment, news and media, and many other fields.
Visual Commonsense Reasoning: Visual Commonsense Reasoning focuses on assessing the underlying correlation, associations, and finest details in visual content, including images and videos. This simulates human cognitive visual perception in addition to going to the last level of details in a visual content.
Multimodal Affective Computing: In this aspect, the emphasis is on the interpretation of both visual and text-based inputs to discern emotional states or moods. The integration of multimodal affective computing is pivotal to human-computer interaction. This brings empathetic and context-sensitive responses to various human-centric uses.
Natural Language for Visual Reasoning: This task evaluates the credibility of textual statements that describe visual elements like images or videos. With the emergence of generative AI, factual verification of the extracted and generated information is the most complex problem to solve. Natural Language for Visual Reasoning helps with fact verification and content moderation for images and videos.
Multimodal Machine Translation: Vision-language models can translate text while considering additional visual context, such as images or videos. This task enhances the accuracy and richness of translations, making them valuable for applications like international e-commerce, cross-cultural social media platforms, and global news dissemination.

Working of VLMs

VLMs process both images and natural language text to perform various tasks. these models are designed for tasks such as image captioning, image-text matching, and answering visual and textual questions. VLM consists of three key elements:
• An image encoder
• A text encoder, and
• A strategy to fuse information from the two encoders

The following three elements are essential in a LVM:

Machine vision. Translates raw pixels into representations of the lines, shapes and forms of objects in visual imagery, such as determining if an image has a cat or a dog.
LLMs. Connects the dots between concepts expressed across many different contexts such as all the ways we might interact with dogs versus cats.
Fusing aspects. Automates the process of labeling parts of an image and connecting them to words in an LLM, such as describing a dog as sitting, eating, chasing squirrels or walking with its owner.

Use cases and Conclusion

Use Cases

We’ve explored the diverse tasks VLMs enable and their unique architectural strengths. Let’s now delve into real-world applications of these powerful models:

Image Search and Retrieval: VLMs power fine-grained image searches using natural language queries, changing how we find relevant imagery online and even in non-relational database retrieval.
Visual Assistance for the Impaired: Models can generate detailed image descriptions or answer questions about visual content, aiding visually impaired individuals.
Enhanced Product Discovery: VLMs enable users to search for products using images or detailed descriptions, streamlining the shopping experience and boosting e-commerce websites.
Automatic Content Moderation: Detecting inappropriate or harmful content within images and videos often leverages VLM capabilities.

Conclusion

This demonstrates a Vision-Language Model (VLM) capable of understanding the relationship between images and their corresponding text descriptions. This is achieved through a combination of techniques:

Image and Text Preprocessing: Both image and text data undergo preprocessing steps to ensure compatibility with the VLM architecture. Images are normalized, resized, and potentially adjusted based on the model's requirements. Text is likely tokenized, cleaned, and potentially padded using a pre-trained sentence encoder.
Implicit Text-Image Fusion: Separate encoders for images and text create embedding vectors that capture their essential features. A key component, the CrossAttention layer, performs an attention-based fusion. This allows the model to focus on relevant parts of the text description based on the image content, and vice versa. This combined representation captures the relationship between the visual and textual data.
Evaluation using Cosine Similarity: The cosine similarity metric is used to evaluate the model's performance. A high cosine similarity between the encoded image and its valid text description indicates the model's ability to learn the correspondence between visual content and textual meaning. Conversely, a lower cosine similarity for an irrelevant text description demonstrates the model's capacity to differentiate relevant and irrelevant text for a given image.