The Power of Activation Functions in Neural Networks: A Visual Exploration


Neural networks have revolutionized machine learning, but their true power comes from a critical component that's often overlooked: activation functions. These mathematical operations transform the output of each neuron, enabling networks to learn complex patterns. In this article, we'll explore how activation functions dramatically impact neural network performance through a practical PyTorch implementation.
Understanding the Problem
This experiment uses the "moons" dataset—a classic non-linear classification problem with two intertwined semicircles. We'll train two identical neural network architectures, differing only in one critical aspect: one uses activation functions between layers, while the other doesn't.
Neural Network Architecture
Both models share the same structure:
Input layer: 2 features
Hidden layer 1: 16 neurons
Hidden layer 2: 8 neurons
Output layer: 1 neuron (binary classification)
The key difference is in the forward pass:
# Model WITHOUT activation functions
def forward(self, x):
x = self.fc1(x) # No activation
x = self.fc2(x) # No activation
x = torch.sigmoid(self.fc3(x)) # Only activation at output
return x
# Model WITH activation functions
def forward(self, x):
x = torch.relu(self.fc1(x)) # ReLU activation
x = torch.relu(self.fc2(x)) # ReLU activation
x = torch.sigmoid(self.fc3(x)) # Output activation
return x
Why This Matters: The Linear Combination Problem
Without activation functions, multiple linear layers collapse mathematically into a single linear transformation. Consider:
y = W2 * (W1 * x + b1) + b2
y = (W2 * W1) * x + (W2 * b1 + b2)
y = W_combined * x + b_combined
This means a network without non-linear activations can only learn linear decision boundaries, regardless of how many layers it has.
Results: Seeing is Believing
When trained on our non-linear moons dataset:
Performance gap: The network with ReLU activations achieves significantly higher accuracy than the linear network.
Decision boundaries: The visualization reveals why—the model with activations learns a complex, curved decision boundary that perfectly separates the classes, while the model without activations can only create a linear boundary that fails to separate the intertwined data.
Learning dynamics: Training curves show that the model with activations learns faster and achieves higher validation accuracy, while the model without activations plateaus quickly.
Why Activation Functions Are Essential
Non-linearity: Activations allow networks to model complex, non-linear relationships in data.
Representational power: With activations, neural networks become universal function approximators—they can theoretically represent any continuous function given enough neurons.
Feature learning: In deep networks, each layer with activations can learn progressively more abstract features.
Gradient flow: Proper activation functions help maintain healthy gradient propagation during training, preventing issues like vanishing or exploding gradients.
Common Activation Functions
While example uses ReLU (Rectified Linear Unit), which outputs the input value for positive values and zero otherwise, many other options exist:
Sigmoid: Maps inputs to values between 0 and 1, useful for outputs representing probabilities
Tanh: Similar to sigmoid but maps to values between -1 and 1
Leaky ReLU: A variation of ReLU that allows a small negative slope for negative inputs
ELU, SELU, GELU: More advanced activation functions with specialized properties
Implementation Code
#Importing Libraries
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# Set random seed for reproducibility
np.random.seed(0)
torch.manual_seed(0)
# Generate a non-linear dataset (moons)
X, y = make_moons(n_samples=1000, noise=0.1, random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
# Standardize features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# Convert to PyTorch tensors
X_train_tensor = torch.FloatTensor(X_train)
y_train_tensor = torch.FloatTensor(y_train).unsqueeze(1)
X_test_tensor = torch.FloatTensor(X_test)
y_test_tensor = torch.FloatTensor(y_test).unsqueeze(1)
# Create DataLoaders
train_dataset = TensorDataset(X_train_tensor, y_train_tensor)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
test_dataset = TensorDataset(X_test_tensor, y_test_tensor)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)
# Model WITHOUT activation functions
class ModelWithoutActivation(nn.Module):
def __init__(self):
super(ModelWithoutActivation, self).__init__()
self.fc1 = nn.Linear(2, 16)
self.fc2 = nn.Linear(16, 8)
self.fc3 = nn.Linear(8, 1)
def forward(self, x):
x = self.fc1(x) # No activation
x = self.fc2(x) # No activation
x = torch.sigmoid(self.fc3(x)) # Only activation at output
return x
# Model WITH activation functions
class ModelWithActivation(nn.Module):
def __init__(self):
super(ModelWithActivation, self).__init__()
self.fc1 = nn.Linear(2, 16)
self.fc2 = nn.Linear(16, 8)
self.fc3 = nn.Linear(8, 1)
def forward(self, x):
x = torch.relu(self.fc1(x)) # ReLU activation
x = torch.relu(self.fc2(x)) # ReLU activation
x = torch.sigmoid(self.fc3(x)) # Output activation
return x
# Instantiate models
model_without_activation = ModelWithoutActivation()
model_with_activation = ModelWithActivation()
# Loss function and optimizers
criterion = nn.BCELoss()
optimizer_without = optim.Adam(model_without_activation.parameters(), lr=0.01)
optimizer_with = optim.Adam(model_with_activation.parameters(), lr=0.01)
# Training function
def train_model(model, optimizer, num_epochs=50):
train_losses = []
train_accuracies = []
val_losses = []
val_accuracies = []
# Split train data into train and validation
train_size = int(0.8 * len(X_train_tensor))
val_size = len(X_train_tensor) - train_size
train_subset, val_subset = torch.utils.data.random_split(train_dataset, [train_size, val_size])
train_loader_split = DataLoader(train_subset, batch_size=32, shuffle=True)
val_loader = DataLoader(val_subset, batch_size=32, shuffle=False)
for epoch in range(num_epochs):
model.train()
train_loss = 0.0
correct_train = 0
total_train = 0
for inputs, labels in train_loader_split:
optimizer.zero_grad()
outputs = model(inputs)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
train_loss += loss.item()
predicted = (outputs > 0.5).float()
total_train += labels.size(0)
correct_train += (predicted == labels).sum().item()
train_loss = train_loss / len(train_loader_split)
train_accuracy = correct_train / total_train
train_losses.append(train_loss)
train_accuracies.append(train_accuracy)
# Validation
model.eval()
val_loss = 0.0
correct_val = 0
total_val = 0
with torch.no_grad():
for inputs, labels in val_loader:
outputs = model(inputs)
loss = criterion(outputs, labels)
val_loss += loss.item()
predicted = (outputs > 0.5).float()
total_val += labels.size(0)
correct_val += (predicted == labels).sum().item()
val_loss = val_loss / len(val_loader)
val_accuracy = correct_val / total_val
val_losses.append(val_loss)
val_accuracies.append(val_accuracy)
return train_accuracies, val_accuracies
# Train both models
train_acc_without, val_acc_without = train_model(model_without_activation, optimizer_without)
train_acc_with, val_acc_with = train_model(model_with_activation, optimizer_with)
# Evaluation function
def evaluate_model(model, data_loader):
model.eval()
correct = 0
total = 0
with torch.no_grad():
for inputs, labels in data_loader:
outputs = model(inputs)
predicted = (outputs > 0.5).float()
total += labels.size(0)
correct += (predicted == labels).sum().item()
return correct / total
# Evaluate both models
acc_without = evaluate_model(model_without_activation, test_loader)
acc_with = evaluate_model(model_with_activation, test_loader)
print(f"Test accuracy WITHOUT activation functions: {acc_without:.4f}")
print(f"Test accuracy WITH activation functions: {acc_with:.4f}")
# Plot decision boundaries
def plot_decision_boundary(model, X, y, title):
model.eval()
# Create a mesh grid on which we will run our model
h = 0.02 # step size in the mesh
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
# Convert to PyTorch tensors
grid = torch.FloatTensor(np.c_[xx.ravel(), yy.ravel()])
# Make predictions on the meshgrid
with torch.no_grad():
Z = model(grid).numpy()
Z = Z.reshape(xx.shape)
# Plot the decision boundary
plt.contourf(xx, yy, Z, alpha=0.8, cmap=plt.cm.RdBu)
plt.scatter(X[:, 0], X[:, 1], c=y, edgecolors='k', cmap=plt.cm.RdBu)
plt.title(title)
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
# Plot learning curves
plt.figure(figsize=(15, 5))
plt.subplot(1, 3, 1)
plt.plot(train_acc_without, label='Without Activation (Train)')
plt.plot(val_acc_without, label='Without Activation (Val)')
plt.plot(train_acc_with, label='With Activation (Train)')
plt.plot(val_acc_with, label='With Activation (Val)')
plt.title('Model Accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend()
# Plot decision boundaries
plt.subplot(1, 3, 2)
plot_decision_boundary(model_without_activation, X_test, y_test,
f'Without Activation (Acc: {acc_without:.4f})')
plt.subplot(1, 3, 3)
plot_decision_boundary(model_with_activation, X_test, y_test,
f'With Activation (Acc: {acc_with:.4f})')
plt.tight_layout()
plt.savefig('activation_function_comparison.png')
plt.show()
Conclusion
This experiment provides clear visual evidence of why activation functions are fundamental to neural networks. Without them, even deep networks are limited to simple linear models. With them, networks can learn complex, non-linear patterns that accurately represent real-world data.
Next time you design a neural network, remember that those simple non-linear functions between layers are what give your model its true learning power.
Subscribe to my newsletter
Read articles from Omkar Thorve directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Omkar Thorve
Omkar Thorve
I build intelligent systems that solve real-world problems. With expertise in deep learning, computer vision, and NLP, I'm passionate about pushing the boundaries of AI research and innovation.