A Beginner-Friendly Review of Trustworthiness in Vision-Language Multimodal AI Systems

Toyibat AdeleToyibat Adele
10 min read

Artificial intelligence (AI) has changed the way we live and work. Today, many people use AI tools to make everyday tasks easier. Some even rely on them for facts or to help make important decisions. I may or may not have used AI myself while writing this article.

A gif showing John Krasinski leaning down slowly into a car to hide.

People have come to start depending on AI tools, but how can we be so sure we can trust them? I’m sure you’ve prompted an AI tool at some point, and it gave you information that wasn’t valid, but it was presented like it was. This is actually an issue with large language models called Hallucination. This article from IBM really gets into the nitty-gritty of this concept.

But that begs even more questions: how can we ensure that the information these systems give us is actually credible and fair? How do we build trustworthy AI systems? Why does this even matter in the first place? What do we even need to see to trust these systems?

I read a paper recently titled ‘Building Trustworthy Multimodal AI: A Review of Fairness, Transparency, and Ethics in Vision Language Tasks’. This paper helped me understand trustworthiness in the context of multimodal AI systems (and even AI systems in general) and why it is so important. It covers three important areas of concern that speak to the trustworthiness of vision language systems: transparency, fairness, and ethics.

I’ll be reviewing the paper in this article, and hopefully, you also understand what trustworthiness is and why it is so important with regard to AI systems.

What are Vision Language Tasks?

This paper reviews trustworthiness in vision-language tasks, but what exactly is a vision-language task? Vision-language tasks are tasks an AI tool performs that combine visual input (like images or videos) with language input or output (like questions, captions, or conversations). These tasks require the model to understand the relationship between the inputs.

The Core Vision Language Tasks

An image showing a Venn diagram of different vision language tasks.

There are quite a number of vision-language tasks, as shown in the image above. This paper looks at the three core tasks, which are:

1. Visual Question Answering (VQA): In this task, the model answers natural language questions based on visual input. What this means is that the user sends in a picture and asks a question about the picture, and then the model responds.

An image showing the architecture of image captioning models

An image showing the architecture of visual question answering models from the paper.

To demonstrate this, I prompted ChatGPT on the GPT-4o model, as you can see in the image below.

Visual question answering tasks can be broken further into the following subcategories:

a. Image Question Answering (ImageQA): Answers questions about a single image.

b. Video Question Answering (VideoQA): Answers questions based on a video or sequence of images.

c. Knowledge-Based VQA (KB-VQA): Uses extra knowledge (like Wikipedia) to answer questions about an image.

2. Visual Dialogue: This task involves what it says, having a dialogue. This entails that the user can hold a conversation with the model about a visual input. The system should be able to answer recurring questions and remember everything it says.

3. Visual Captioning: For this task, the model should be able to look at an image and tell you what it sees. This involves using computer vision techniques to look at the image and use natural language to describe it.

Image showing the architecture of an image captioning model.

Image showing the architecture of image captioning models from the paper.

To demonstrate this, I prompted Claude on the 3.7 sonnet model, as shown in the image below.

Image showing a screenshot of an image captioned by the Claude Sonnet 3.7 model

4. Other tasks: Another important task mentioned is visual reasoning. This involves identifying relationships that exist between images. VQA is an example of this. The model has to reason and identify certain patterns and relationships in the visual input before it responds.

Areas of Concern in Trustworthiness

We’ve gone over the core vision language tasks performed by multimodal AI systems. The paper covers the following three tasks during the review process: visual question answering, image captioning (a subset of visual captioning), and visual dialogue. In this section, we’ll look at the areas of concern in the trustworthiness of these tasks that were reviewed, what they mean, and what they entail for each task.

Transparency

Transparency means the model shows how it makes decisions. The transparency of a system is closely tied to its explainability. This means the model should be able to give a detailed explanation of how it made a decision. This helps users trust the system.

A gif showing a womanspeaking with the text "You've got some explaining to do".

This is what it means for the chosen vision language tasks:

a. Explainability in VQA

A good VQA model should not just answer, but also explain how it got there. This helps users understand the AI’s reasoning. You can see this in action with the GPT-4o model in the image below.

Image showing a screenshot of a conversation with the GPT-4o model explaining how it answered a question based on a visual input.

b. Explainability in Image Captioning

Good image captioning models should show why they chose certain words in a caption. Some techniques used here are:

  1. Attention maps: These show where the AI is “looking” in the image when it gives an answer or writes a caption by highlighting these parts.

  2. Latent space perturbations: This means making tiny changes inside the model’s “thinking space” to see what parts matter most. It helps to understand which parts of the model affect the output and how.

  3. Image-to-word-based models: This method links specific parts of an image to the exact words in the caption. This makes the caption easier to trust because we can see how each word relates to the image.

  4. Segmentation-based models: The image is broken into parts (like puzzle pieces), and each part is connected to parts of the caption.

The image below shows something similar in action with the Sonnet 3.7 model.

Image showing a screenshot of a conversation with the Claude Sonnet 3.7 model explaining how it captioned an image.

c. Explainability in Visual Dialogue

In vision-language tasks, especially visual dialogue, it's important for the model to explain its decisions clearly. Using methods like deconfounded learning helps improve AI explanations and feedback from users. Natural Language Explanations (NLEs) give simple, human-friendly insights into AI decisions, making them easier to understand. Research shows that helpful explanations can improve AI performance, showing why clear explanations are key in AI systems. Some techniques used here also include:

  1. Deconfounded learning: This method helps the model avoid wrong guesses based on patterns in the training data. For example, if it often sees women with babies, it might wrongly assume every woman has a baby. This method teaches the AI to focus on what’s really in the image, not just what it has seen often before.

  2. Data filtering: The AI is trained only on good-quality conversations. Bad or biased dialogue examples are filtered out using a scoring system. This improves the quality of the model’s responses and makes it less likely to repeat harmful or untrustworthy content.

  3. AI-framed questioning: This lets users ask why the AI gave a certain answer or challenge its logic. It turns the interaction into a two-way conversation, making the AI more open, testable, and explainable.

Fairness

Fairness refers to principles that ensure AI doesn’t reinforce bias, treat people unfairly, or support harmful stereotypes. Bias means the AI system is making decisions or predictions that are unfair or one-sided, often because of problems in the data it was trained on or how it was built.

A gif showing a Homer from The Simpsons yelling and pointing that a machine is rigged.

It can be reduced by changing the data, updating how the model learns, or fixing unfair predictions. To learn more about AI bias and how it happens, you can read this article by IBM. You can go even further to see some examples of models that had bias in play in this article from Datatron.

a. Fairness in VQA Models

VQA models can give unfair or biased answers. Bias often comes from over-relying on either modality: the image or the question. This is called modality bias (language or visual).

b. Fairness in Visual Dialogue

Research on fairness in visual dialogue is focused on ensuring fairness, particularly in gender, race, or demographic contexts. AI models can have biases, such as gender or racial bias, which affect tasks like VQA, image captioning, and visual dialogue. Just looking at accuracy can be misleading. To ensure fairness, we need to improve how data and models are trained. This includes training models on more diverse datasets and using fairness-aware algorithms. Studies show bias can be reduced by up to 23%, but might lower accuracy by around 9%. FairCLIP shows it’s possible to balance fairness and performance well.

Ethical Implications

Ethical implications in vision-language tasks involve addressing biases, fairness, and transparency to ensure that AI systems serve all users equitably.

A gif showing a woman speaking with the text, 'They need ethical rules of conduct.

a. Ethical Implications in VQA

Bias in VQA systems means the AI might give better answers for certain groups or questions while giving worse or unfair answers for others. This can lead to harmful stereotypes, like making wrong assumptions based on gender or race. In multilingual models, the problem can show up as better performance in popular languages like English, but poor performance with other languages, like Swahili or Hindi. This shows the model has a hidden preference for some languages and ignores others that aren’t as common in the training data.

b. Ethical Implications in Image Captioning

Image captioning systems can show biases like gender, race, and other social factors, which can strengthen harmful stereotypes.

A gif from the UNDP on gender bias statisitic.

c. Ethical Implications in Visual Dialogue

Visual dialogue systems face challenges related to privacy, data quality, and transparency. They must handle sensitive data fairly and avoid discrimination. The use of multi-agent systems for ethical monitoring helps ensure these tools are built with ethical principles.

Methods like data cleaning and model selection can reduce bias, but they still have limits and raise ethical concerns. Balancing fairness, explainability, and accuracy is hard, but working on them together can help make AI more trustworthy. Future research should aim to create AI systems that are both useful and fair.

What is the way forward?

The paper highlights key trustworthiness opportunities in vision-language research:

  1. Real-world applications: Applying fairness, explainability, and ethics in AI is hard because of changing data and technological limits. But it's important to keep these principles to build trust with users.

  2. Advancements in MLLMs: Multimodal large language models (MLLMs) connect visual and text data. Future work should improve these models to make them better and usable for more tasks.

  3. Mitigating hallucinations in LVLMs: A big issue with large vision-language models (LVLMs) is that they sometimes create wrong content. Better solutions are needed to fix this and make models more reliable.

  4. Ensuring consistency across tasks: Models must perform the same way across tasks to keep user trust. Future research should create better ways to check models that also focus on fairness and transparency.

Final Thoughts

This study looks at how fair, honest, and ethical AI is when it works with pictures and language, like answering questions about images, describing pictures, or talking about them. These things are important to build AI systems we can trust.

There has been some progress in making AI fair when answering questions about images, but we still need to work more on reducing bias in large language models. The study also shows some improvements in handling ethical problems when AI describes pictures.

As these AI models get better, it’s very important to add clear ethical rules and make them easy to understand. Developers need to focus on building AI that people can trust. Everyone, including researchers, policymakers, and companies, should work together to make AI safer and fair for all.

A gif showing a minion mic dropping.

References

41
Subscribe to my newsletter

Read articles from Toyibat Adele directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Toyibat Adele
Toyibat Adele

Machine Learning Engineer • Technical writer • Mechatronics Engineering Undergraduate