Performing an adversarial attack on images

Exploring adversarial attacks - where tiny, imperceptible modifications can completely deceive even the smartest neural networks
From Autoencoders to Adversarial Attacks: A New Kind of Magic
After my fascinating journey with Variational Autoencoders, where I learned to generate new images from random noise, I encountered something that seemed almost like magic: adversarial attacks. Imagine being able to take a photo of an elephant and, with changes so subtle that human eyes can't detect them, make a sophisticated AI model confidently declare it's actually a saxophone or a lemon!
This isn't theoretical - it's a fundamental vulnerability that exists in virtually every neural network, and understanding it is crucial for anyone working with AI in the real world.
The Adversarial Awakening: When Perfect Vision Fails
Coming from my work with autoencoders, I was fascinated by how neural networks could learn such meaningful representations of images. But then I discovered something unsettling: these same networks that could classify thousands of objects with superhuman accuracy could be completely fooled by changes invisible to human perception.
The core concept is beautifully simple yet profound: What if we could modify an image so minimally that humans can't tell the difference, but the neural network perceives it as belonging to a completely different class?
This is exactly what adversarial attacks accomplish, and they reveal something fundamental about how neural networks "see" the world - very differently from how we do.
Understanding the Strategy: The Adversarial Game Plan
Think of adversarial attacks like a strategic game. Instead of training the model (which we've been doing in previous projects), we're now going to:
Keep the model frozen - We can't change how it thinks
Modify the input image - This is our only tool
Guide these modifications - Use the model's own gradients against it
Achieve our target - Make it predict whatever class we want
The strategy from my project follows these specific steps:
Provide an image (like an elephant)
Specify our target class (like "saxophone")
Load a pre-trained model and freeze all its parameters
Calculate gradients with respect to input pixels (not model weights!)
Calculate loss between current prediction and our target
Use backpropagation to understand how each pixel affects the prediction
Update image pixels in the direction that fools the model
Repeat until the model is completely deceived
The key insight: We're hijacking the training process - instead of updating model weights to reduce error, we're updating input pixels to increase error in a specific direction!
Hands-On Implementation: Building the Digital Deception
Let me walk you through the actual implementation that opened my eyes to this fascinating vulnerability.
Setting Up Our Adversarial Laboratory
# First, let's set up our environment
!pip install torch_snippets
from torch_snippets import inspect, show, np, torch, nn
from torchvision.models import resnet50
# Load our "victim" - a pre-trained ResNet50
model = resnet50(pretrained=True)
# HERE'S THE CRUCIAL PART: Freeze ALL model parameters
for param in model.parameters():
param.requires_grad = False
model = model.eval() # Set to evaluation mode
Why freeze the parameters? This is the opposite of normal training! Instead of updating the model to better understand images, we're going to update the image to fool the model. The model becomes our fixed "judge" that we're trying to deceive.
Loading Our Target Image
# Download an elephant image to experiment with
import requests
from PIL import Image
url = 'https://lionsvalley.co.za/wp-content/uploads/2015/11/africanelephant-square.jpg'
original_image = Image.open(requests.get(url, stream=True).raw).convert('RGB')
original_image = np.array(original_image)
original_image = torch.Tensor(original_image)
I chose an elephant image because it's clearly recognizable - making the eventual deception even more striking when this obvious elephant gets classified as completely different objects.
Setting Up ImageNet Classes
# Get the mapping of ImageNet class IDs to class names
image_net_classes = 'https://gist.githubusercontent.com/yrevar/942d3a0ac09ec9e5eb3a/raw/238f720ff059c1f82f368259d1ca4ffa5dd8f9f5/imagenet1000_clsidx_to_labels.txt'
image_net_classes = requests.get(image_net_classes).text
image_net_ids = eval(image_net_classes)
image_net_classes = {i:j for j,i in image_net_ids.items()}
This gives us access to all 1000 ImageNet classes, so we can target any specific misclassification we want - from "lemon" to "saxophone" to "comic book"!
The Image Processing Pipeline
One of the trickiest parts was getting the image normalization right:
from torchvision import transforms as T
from torch.nn import functional as F
# Standard ImageNet normalization (what the model expects)
normalize = T.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
# Reverse normalization (for displaying results)
denormalize = T.Normalize(
[-0.485/0.229, -0.456/0.224, -0.406/0.225],
[1/0.229, 1/0.224, 1/0.225]
)
def image2tensor(input):
"""Convert image to model-ready tensor"""
x = normalize(input.clone().permute(2,0,1)/255.)[None]
return x
def tensor2image(input):
"""Convert tensor back to displayable image"""
x = (denormalize(input[0].clone()).permute(1,2,0)*255.).type(torch.uint8)
return x
Critical Learning: The pre-trained model expects ImageNet-normalized inputs, so we need to convert our images to the exact format the model was trained on, then convert back for visualization.
The Prediction Function: Reading the Model's Mind
def predict_on_image(input):
"""Get the model's prediction for an image"""
model.eval()
show(input) # Display the image
input = image2tensor(input) # Convert to proper format
pred = model(input) # Get raw predictions
pred = F.softmax(pred, dim=-1)[0] # Convert to probabilities
prob, clss = torch.max(pred, 0) # Find highest probability
clss = image_net_ids[clss.item()] # Convert to class name
print(f'PREDICTION: `{clss}` @ {prob.item()}')
This function became my window into the model's "thoughts" - showing me exactly what it sees and how confident it is.
The Heart of the Magic: The Attack Function
Here's where the real magic happens - the core adversarial attack algorithm:
from tqdm import trange
losses = []
def attack(image, model, target, epsilon=1e-6):
"""
Perform one step of adversarial attack
Args:
image: Current image being modified
model: The model we're trying to fool
target: Our desired target class
epsilon: How big a step to take (very small!)
"""
# Step 1: Convert image to tensor and ENABLE gradients
input = image2tensor(image)
input.requires_grad = True # This is the key difference!
# Step 2: Get model's prediction
pred = model(input)
# Step 3: Calculate loss relative to our TARGET (not true class)
loss = nn.CrossEntropyLoss()(pred, target)
# Step 4: Backpropagate to get gradients w.r.t. INPUT pixels
loss.backward()
losses.append(loss.mean().item())
# Step 5: Update image pixels in gradient direction
output = input - epsilon * input.grad.sign()
# Step 6: Convert back to image format and return
output = tensor2image(output)
del input # Clean up memory
return output.detach()
Breaking Down the Attack Magic
Step 1: Enable Gradients on Input
input.requires_grad = True
This is the most important line! We're telling PyTorch to track gradients with respect to the input pixels instead of model weights.
Step 3: The Adversarial Loss
loss = nn.CrossEntropyLoss()(pred, target)
Instead of calculating loss against the true class (elephant), we calculate it against our desired target class (saxophone). Higher loss means the model is more confident it's NOT a saxophone.
Step 5: The Gradient Step
output = input - epsilon * input.grad.sign()
input.grad.sign()
: Direction to change each pixel to increase lossepsilon
: Very small step size (we want imperceptible changes!)We subtract because we want to INCREASE the loss (fool the model)
Why use .sign()
instead of the full gradient?
It bounds our changes (each pixel changes by at most ε)
It's more stable than using raw gradient magnitudes
It often works better in practice
The Complete Deception: Iterative Attacks
Now let's put it all together and perform complete attacks on our elephant image:
# Define our attack targets - let's fool the model into these classes
modified_images = []
desired_targets = ['lemon', 'comic book', 'sax, saxophone']
for target in desired_targets:
# Convert target name to tensor format
target = torch.tensor([image_net_classes[target]])
# Start with original elephant image
image_to_attack = original_image.clone()
# Perform 10 iterations of small modifications
for _ in trange(10):
image_to_attack = attack(image_to_attack, model, target)
# Save the final result
modified_images.append(image_to_attack)
# Show all results
for image in [original_image, *modified_images]:
predict_on_image(image)
inspect(image)
The Iterative Strategy Explained
Why 10 iterations? Each attack step makes a tiny change. By repeating this process, we gradually "nudge" the image across the model's decision boundary while keeping changes imperceptible.
The process:
Start: "African elephant" (99.8% confidence)
Step 1: Still "African elephant" (95% confidence)
Step 5: "African elephant" (60% confidence)
Step 8: "Lemon" (30% confidence)
Step 10: "Lemon" (94% confidence) ✨
The Stunning Results: Digital Illusion Achieved
The results of my adversarial attacks were both fascinating and slightly unsettling:
Original Image: African elephant
@ 0.523 confidence
Attack Result 1: lemon
@ 0.999 confidence
Attack Result 2: comic book
@ 0.999 confidence
Attack Result 3: sax, saxophone
@ 0.999 confidence
The mind-bending part: To my human eyes, all four images looked virtually identical! The changes were so subtle that I had to look extremely carefully to notice any difference at all.
Understanding Why This Works
Through my experiments, I discovered several key insights:
1. High-Dimensional Vulnerability Images exist in extremely high-dimensional spaces (224×224×3 = 150,528 dimensions for a small image). In such spaces:
Small changes in many dimensions can sum to large effects
Even imperceptible per-pixel changes can cross decision boundaries
The "perturbation budget" can be distributed across all pixels
2. The Linear Approximation Neural networks behave approximately linearly in local neighborhoods. This means:
new_prediction ≈ old_prediction + gradient × step_size
This linear behavior is why our simple gradient-based attack works so reliably.
3. Decision Boundary Reality Neural networks don't "see" images like humans do. They learn complex decision boundaries in high-dimensional space, and these boundaries can be surprisingly close to natural images.
The Broader Implications: Beyond the Magic Trick
What started as a fascinating technical exercise revealed profound implications:
Real-World Vulnerabilities
Autonomous vehicles: Adversarial stop signs could cause accidents
Medical diagnosis: Manipulated medical images could lead to misdiagnosis
Security systems: Adversarial examples could evade detection
Content moderation: Inappropriate content could bypass AI filters
The Robustness Challenge
This revealed that high accuracy on test sets doesn't guarantee robustness to small perturbations. A model can be 99.9% accurate and still be completely fooled by invisible modifications.
The Trust Question
How do we maintain confidence in AI systems knowing they can be deceived by changes we can't even see?
Defensive Strategies: Fighting Back
Understanding attacks naturally led me to explore defenses:
1. Adversarial Training
Train models on both clean and adversarial examples:
# Pseudo-code for adversarial training
for batch in dataloader:
clean_images, labels = batch
# Generate adversarial examples
adv_images = generate_adversarial_examples(clean_images, model)
# Train on both clean and adversarial
loss_clean = criterion(model(clean_images), labels)
loss_adv = criterion(model(adv_images), labels)
total_loss = 0.5 * (loss_clean + loss_adv)
total_loss.backward()
2. Input Preprocessing
JPEG compression: Destroys some adversarial perturbations
Gaussian noise: Adds randomness that can mask attacks
Bit depth reduction: Removes fine-grained adversarial signals
3. Detection Methods
Monitor prediction confidence distributions
Check for unusual gradient patterns
Use ensemble methods for verification
Key Lessons: The Deeper Understanding
My adversarial attack journey fundamentally changed how I think about AI:
1. Brittleness vs Capability
AI models can be simultaneously incredibly capable and surprisingly fragile. High performance doesn't guarantee robustness.
2. The Feature Learning Reality
Models often rely on features that humans can't perceive or understand, highlighting the "black box" nature of deep learning.
3. Security as Core Requirement
As AI systems are deployed in critical applications, security can't be an afterthought - it must be built in from the beginning.
The Ethical Dimension: With Great Power...
Learning to fool AI systems raised important ethical questions:
Responsible Research: How do we study vulnerabilities without enabling malicious use?
Disclosure: Should we publicly share attack methods that could be misused?
Defense Priority: The primary goal should be making AI systems more robust and trustworthy.
Conclusion: The Invisible Revolution
Adversarial attacks revealed a hidden dimension of machine learning that completely changed my perspective. The ability to make an elephant appear as a saxophone with invisible changes isn't just a clever trick - it's a fundamental challenge that affects the reliability and trustworthiness of AI systems.
The key takeaways:
Neural networks are more fragile than they appear
High accuracy doesn't guarantee robustness
Security must be considered from the beginning
Understanding attacks is essential for building better defenses
The broader impact: As AI systems make increasingly important decisions in our world - from medical diagnosis to autonomous driving - ensuring they can't be easily fooled becomes not just a technical challenge, but a societal imperative.
The magic trick of fooling AI with invisible changes teaches us that in the age of artificial intelligence, seeing truly isn't always believing. And sometimes, the most important insights come from learning how to break things before we can properly build them.
References
The implementation and insights in this post are based on the practical exercises from "Modern Computer Vision with PyTorch" and hands-on experimentation with gradient-based adversarial attacks using ResNet50 and ImageNet classes.
Subscribe to my newsletter
Read articles from Shaun Liew directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Shaun Liew
Shaun Liew
Year 3 Computer Sciences Student from Universiti Sains Malaysia. Keep Learning.