Introduction: Peering Into the Mind of a Neural Network

github repo link

Imagine you're training a self-driving car to recognize pedestrians. One day, it fails to detect a person wearing an unusual outfit. Why did it fail? What was it looking at instead? Without answers to these questions, would you feel comfortable deploying such a system?

This scenario highlights one of the most significant challenges in modern AI: the black box problem. Neural networks make decisions, but we often don't understand why or how they arrived at those conclusions. In high-stakes domains like healthcare, autonomous driving, or security, this lack of transparency isn't just frustrating—it can be dangerous.

Enter Explainable AI (XAI), a set of techniques that help us understand, trust, and improve AI systems by making their decision-making processes more transparent. In computer vision specifically, these techniques help us visualize what features or regions of an image are influencing a model's classifications.

In this blog post, we'll explore how Class Activation Mapping (CAM) techniques can open up the black box of computer vision models, with a particular focus on complex cases like images containing multiple objects. We'll break down how these techniques work, compare the most popular methods, and show how you can experiment with them yourself using a simple interactive tool.

The Black Box Problem: Why Can't We Just Trust the AI?

Let's start with an everyday example: you show a neural network an image, and it tells you with 95% confidence that it contains a dog. Great! But what if the image shows both a dog and a cat together?

cat and dog in the same image

Did the model actually recognize the dog's distinctive features?
Or was it just looking at the grassy background that happened to appear in many dog photos during training?
What about the cat—did the model see it but consider it less important, or did it miss it entirely?

Without explainability, we're left guessing. This opacity creates several significant problems:

Trust Issues: In critical applications, users need to verify the model is making decisions for the right reasons.
Debugging Challenges: When models make mistakes, engineers struggle to identify what went wrong.
Hidden Biases: Models might rely on spurious correlations rather than meaningful features.
Regulatory Concerns: In many sectors, AI systems must provide explanations for their decisions to meet legal requirements.

The black box problem is particularly acute in computer vision because images are inherently complex, containing thousands of potential features that could influence a decision.

What is Explainable AI?

Explainable AI refers to methods and techniques that make the decisions and predictions of AI systems understandable to humans. Think of it as adding a "why" layer to the "what" that AI systems typically provide.

In computer vision, XAI helps us answer questions like:

What parts of the image did the model focus on?
Which features were most important for the classification?
How would changing certain elements affect the prediction?

For complex tasks like classifying images with multiple objects, these insights are invaluable. Consider a wildlife monitoring system that needs to count different animal species in the same frame. Without explainability, we can't verify if the system correctly identified each animal based on appropriate visual features, rather than contextual clues or background elements.

Class Activation Mapping: A Window Into Neural Networks

Among the many techniques in the XAI toolkit, Class Activation Mapping (CAM) stands out for its intuitive approach and visual results. At its core, CAM techniques generate heatmaps that highlight the regions of an image that most influenced a model's classification decision.

The Birth of CAM

The original CAM technique, introduced in 2016, worked by:

Taking a CNN that uses Global Average Pooling before the final classification layer
Extracting the feature maps from the last convolutional layer
Weighting these feature maps based on their importance for a specific class
Combining these weighted feature maps to create a heatmap

However, this original method required modifying the network architecture, which wasn't practical for many pre-trained models. This limitation led to the development of more flexible approaches.

Gradient-weighted Class Activation Mapping (Grad-CAM)

Grad-CAM emerged as a more versatile solution that works with any CNN architecture without modification. Here's how it works, in simple terms:

Run the image through the network to get a prediction
Pick a specific class prediction to explain (e.g., "dog")
Calculate how sensitive this class prediction is to each feature in the last convolutional layer
Create a weighted combination of these features to show which image regions were most important

The result is a heatmap where brighter areas (often shown in red) indicate regions that strongly influenced the model's decision about that class.

To make this more concrete, imagine you're a detective trying to understand how a witness identified a suspect. Grad-CAM is like asking the witness, "What specific features made you think this was the person?" and then highlighting those features on a photograph. Maybe they focused on distinctive clothing, a particular facial feature, or a unique hairstyle—Grad-CAM shows you what equivalent visual elements the AI focused on.

Advanced CAM Techniques: Going Beyond the Basics

While Grad-CAM provided a breakthrough in neural network visualization, researchers have continued to refine and improve these techniques.

Grad-CAM++: Better at Multiple Instances

One limitation of Grad-CAM is that it sometimes struggles with images containing multiple instances of the same object. For example, if an image shows three cats, Grad-CAM might only highlight one or two of them.

Grad-CAM++ addresses this by using a more sophisticated weighting scheme. The mathematics get complex, but conceptually, it's better at:

Identifying multiple instances of the same object
Providing more precise object localization
Generating more visually pleasing and accurate heatmaps

For our cat-and-dog example, Grad-CAM++ would typically do a better job of highlighting both animals when we're looking at either class prediction.

Score-CAM: A Gradient-Free Approach

Score-CAM takes a completely different tack by eliminating gradients entirely. Instead, it:

Takes each feature map from the target layer
Uses it to create a masked version of the original image
Measures how much each masked image activates the target class
Weights the feature maps based on these activation scores

The intuition is simple but powerful: "If this feature is important for identifying dogs, then an image highlighting just this feature should strongly activate the 'dog' prediction."

Score-CAM often produces cleaner, more focused heatmaps, especially in cases where gradient calculations might be noisy or unstable.

Seeing It in Action: CAM for Multi-Object Classification

Let's see how these techniques help with our multi-object classification challenge. Imagine an image containing both a cat and a dog:

Visualizing "Cat" Classification: When we generate a CAM heatmap for the "cat" class, we should see the heat concentrated on the cat's features—potentially its ears, face, or distinctive pose.
Visualizing "Dog" Classification: Generating a heatmap for the "dog" class should show heat concentrated on the dog's features—perhaps its snout, ears, or tail.

By comparing these heatmaps, we can:

Verify the model is looking at the correct objects for each class
Identify any confusion between similar features
Understand whether the model is using contextual clues (like a dog leash) rather than the animals themselves

This visual confirmation builds trust in the model and helps identify potential weaknesses before deployment.

CAM visualizations

Comparing CAM Methods: Which One Should You Use?

Each CAM method has its strengths and situations where it excels:

Grad-CAM: The standard approach, works well in most cases and is computationally efficient
Grad-CAM++: Better when you have multiple objects of the same class or need more precise localization
Score-CAM: Produces cleaner visualizations but is more computationally intensive
XGrad-CAM: Provides a good balance between computational efficiency and visualization quality
EigenCAM: Useful when you want to understand the principal components the model focuses on
Ablation-CAM: Helps understand the importance of different features by systematically removing them
HiRes-CAM: Generates higher resolution heatmaps
FullGrad: Incorporates information from all layers for more comprehensive explanations

For beginners, Grad-CAM and Grad-CAM++ are great starting points, with Score-CAM as a next step when you need higher quality visualizations.

Practical Applications: Why This Matters

Beyond satisfying curiosity, CAM techniques have practical applications across many domains:

Medical Imaging

In healthcare, it's not enough for an AI to detect a tumor—doctors need to know what features led to that conclusion. CAM visualizations help physicians verify that the model is focusing on clinically relevant abnormalities rather than artifacts or unrelated features.

Autonomous Vehicles

For self-driving cars, understanding why the system classified an object as a pedestrian, cyclist, or vehicle is crucial for safety. CAM helps engineers verify the model is focusing on the objects themselves rather than contextual elements like road type or time of day.

Wildlife Conservation

In wildlife monitoring systems that count and classify animal species, CAM techniques help verify that the system correctly distinguishes between similar-looking species based on the right features, especially when multiple animals appear in the same frame.

Model Debugging and Improvement

Perhaps the most common use case is debugging and improving models. When a model misclassifies an image, CAM visualizations immediately show what it was focusing on, helping engineers identify the root cause of errors.

For instance, if a model incorrectly classifies a husky as a wolf, a CAM visualization might reveal it's focusing on snow in the background (which appears frequently in wolf images) rather than the animal's features. This insight allows engineers to improve the training data or model architecture.

Experimenting with CAM Using Our Interactive Tool

To make these concepts more concrete, I've created an interactive tool that lets you experiment with different CAM techniques on your own images. This Streamlit application provides a user-friendly interface to:

Upload any image
Choose from different CNN architectures (ResNet50 or DenseNet121)
Select which CAM methods to visualize
Compare the results side by side

You can find the code and instructions in the GitHub repository, and the key components are implemented in the app.py file.

The tool supports multiple CAM methods, including:

Grad-CAM
Grad-CAM++
Score-CAM
XGrad-CAM
HiRes-CAM
EigenCAM
Ablation-CAM
FullGrad

It also provides options for enhancement techniques like augmentation smoothing and eigen smoothing, which can improve the quality of visualizations.

Getting Started with CAM Visualizations

If you want to try this yourself, here's how to get started:

Clone the repository:

git clone https://github.com/shaunliew/gradcam-pytorch-classification-tutorial-streamlit.git
cd gradcam-pytorch-classification-tutorial-streamlit

Set up the environment using uv (a fast Python package installer):

# Install uv if you don't have it
curl -LsSf https://astral.sh/uv/install.sh | sh

# Create and activate a virtual environment
uv venv
source .venv/bin/activate  # On Windows, use .venv\Scripts\activate

# Install dependencies
uv sync

Run the application:

streamlit run app.py

The app will open in your browser, allowing you to upload images and explore different CAM visualizations.

Suggested Experiments

Once you have the tool running, try these experiments to deepen your understanding:

Multi-Object Images: Upload an image containing both a cat and a dog. Generate CAM visualizations for both "cat" and "dog" classes, and compare how different methods highlight each animal.
Model Comparison: Try the same image with both ResNet50 and DenseNet121 to see how different architectures focus on different features.
Enhancement Options: Toggle the augmentation and eigen smoothing options to see how they affect visualization quality.
Attention Verification: Upload images where the main subject is off-center or partially obscured. Do the models correctly focus on the subject, or are they distracted by background elements?

Conclusion: Making AI More Transparent and Trustworthy

As AI systems become more prevalent in critical decision-making contexts, the need for explainability grows ever more important. Class Activation Mapping techniques represent a significant step toward making computer vision models more transparent and interpretable.

By visualizing what features influence a model's decisions, we can:

Build trust in AI systems
Identify and correct biases and weaknesses
Debug and improve model performance
Ensure models focus on relevant features rather than spurious correlations

For complex tasks like multi-object classification, these insights are not just nice to have—they're essential for creating robust, reliable AI systems that we can confidently deploy in real-world applications.

Whether you're a researcher pushing the boundaries of computer vision, an engineer building practical applications, or simply someone curious about how AI "sees" the world, understanding CAM techniques provides valuable insights into the inner workings of neural networks.

The next time you use or build an AI system that classifies images, remember that you don't have to accept its decisions as a black box—you can peer inside and see exactly what it's "thinking."

What images would you like to analyze with CAM? How might these techniques help in your specific domain or application? The interactive tool provided here gives you a starting point to explore these questions and begin your journey into explainable AI for computer vision.

Explainable AI for Computer Vision: Advanced CAM Visualizations

Table of contents