Activation Functions: Transforming Neural Networks

Neural networks have revolutionized machine learning, but their true power comes from a critical component that's often overlooked: activation functions. These mathematical operations transform the output of each neuron, enabling networks to learn complex patterns. In this article, we'll explore how activation functions dramatically impact neural network performance through a practical PyTorch implementation.

Understanding the Problem

This experiment uses the "moons" dataset—a classic non-linear classification problem with two intertwined semicircles. We'll train two identical neural network architectures, differing only in one critical aspect: one uses activation functions between layers, while the other doesn't.

Neural Network Architecture

Both models share the same structure:

Input layer: 2 features
Hidden layer 1: 16 neurons
Hidden layer 2: 8 neurons
Output layer: 1 neuron (binary classification)

The key difference is in the forward pass:

# Model WITHOUT activation functions
def forward(self, x):
    x = self.fc1(x)  # No activation
    x = self.fc2(x)  # No activation
    x = torch.sigmoid(self.fc3(x))  # Only activation at output
    return x

# Model WITH activation functions
def forward(self, x):
    x = torch.relu(self.fc1(x))  # ReLU activation
    x = torch.relu(self.fc2(x))  # ReLU activation
    x = torch.sigmoid(self.fc3(x))  # Output activation
    return x

Why This Matters: The Linear Combination Problem

Without activation functions, multiple linear layers collapse mathematically into a single linear transformation. Consider:

y = W2 * (W1 * x + b1) + b2
y = (W2 * W1) * x + (W2 * b1 + b2)
y = W_combined * x + b_combined

This means a network without non-linear activations can only learn linear decision boundaries, regardless of how many layers it has.

Results: Seeing is Believing

When trained on our non-linear moons dataset:

Performance gap: The network with ReLU activations achieves significantly higher accuracy than the linear network.
Decision boundaries: The visualization reveals why—the model with activations learns a complex, curved decision boundary that perfectly separates the classes, while the model without activations can only create a linear boundary that fails to separate the intertwined data.
Learning dynamics: Training curves show that the model with activations learns faster and achieves higher validation accuracy, while the model without activations plateaus quickly.

Why Activation Functions Are Essential

Non-linearity: Activations allow networks to model complex, non-linear relationships in data.
Representational power: With activations, neural networks become universal function approximators—they can theoretically represent any continuous function given enough neurons.
Feature learning: In deep networks, each layer with activations can learn progressively more abstract features.
Gradient flow: Proper activation functions help maintain healthy gradient propagation during training, preventing issues like vanishing or exploding gradients.

Common Activation Functions

While example uses ReLU (Rectified Linear Unit), which outputs the input value for positive values and zero otherwise, many other options exist:

Sigmoid: Maps inputs to values between 0 and 1, useful for outputs representing probabilities
Tanh: Similar to sigmoid but maps to values between -1 and 1
Leaky ReLU: A variation of ReLU that allows a small negative slope for negative inputs
ELU, SELU, GELU: More advanced activation functions with specialized properties

Implementation Code

#Importing Libraries
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Set random seed for reproducibility
np.random.seed(0)
torch.manual_seed(0)

# Generate a non-linear dataset (moons)
X, y = make_moons(n_samples=1000, noise=0.1, random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# Standardize features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Convert to PyTorch tensors
X_train_tensor = torch.FloatTensor(X_train)
y_train_tensor = torch.FloatTensor(y_train).unsqueeze(1)
X_test_tensor = torch.FloatTensor(X_test)
y_test_tensor = torch.FloatTensor(y_test).unsqueeze(1)

# Create DataLoaders
train_dataset = TensorDataset(X_train_tensor, y_train_tensor)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
test_dataset = TensorDataset(X_test_tensor, y_test_tensor)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)

# Model WITHOUT activation functions
class ModelWithoutActivation(nn.Module):
    def __init__(self):
        super(ModelWithoutActivation, self).__init__()
        self.fc1 = nn.Linear(2, 16)
        self.fc2 = nn.Linear(16, 8)
        self.fc3 = nn.Linear(8, 1)

    def forward(self, x):
        x = self.fc1(x)  # No activation
        x = self.fc2(x)  # No activation
        x = torch.sigmoid(self.fc3(x))  # Only activation at output
        return x

# Model WITH activation functions
class ModelWithActivation(nn.Module):
    def __init__(self):
        super(ModelWithActivation, self).__init__()
        self.fc1 = nn.Linear(2, 16)
        self.fc2 = nn.Linear(16, 8)
        self.fc3 = nn.Linear(8, 1)

    def forward(self, x):
        x = torch.relu(self.fc1(x))  # ReLU activation
        x = torch.relu(self.fc2(x))  # ReLU activation
        x = torch.sigmoid(self.fc3(x))  # Output activation
        return x

# Instantiate models
model_without_activation = ModelWithoutActivation()
model_with_activation = ModelWithActivation()

# Loss function and optimizers
criterion = nn.BCELoss()
optimizer_without = optim.Adam(model_without_activation.parameters(), lr=0.01)
optimizer_with = optim.Adam(model_with_activation.parameters(), lr=0.01)

# Training function
def train_model(model, optimizer, num_epochs=50):
    train_losses = []
    train_accuracies = []
    val_losses = []
    val_accuracies = []

    # Split train data into train and validation
    train_size = int(0.8 * len(X_train_tensor))
    val_size = len(X_train_tensor) - train_size
    train_subset, val_subset = torch.utils.data.random_split(train_dataset, [train_size, val_size])
    train_loader_split = DataLoader(train_subset, batch_size=32, shuffle=True)
    val_loader = DataLoader(val_subset, batch_size=32, shuffle=False)

    for epoch in range(num_epochs):
        model.train()
        train_loss = 0.0
        correct_train = 0
        total_train = 0

        for inputs, labels in train_loader_split:
            optimizer.zero_grad()
            outputs = model(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()

            train_loss += loss.item()
            predicted = (outputs > 0.5).float()
            total_train += labels.size(0)
            correct_train += (predicted == labels).sum().item()

        train_loss = train_loss / len(train_loader_split)
        train_accuracy = correct_train / total_train
        train_losses.append(train_loss)
        train_accuracies.append(train_accuracy)

        # Validation
        model.eval()
        val_loss = 0.0
        correct_val = 0
        total_val = 0

        with torch.no_grad():
            for inputs, labels in val_loader:
                outputs = model(inputs)
                loss = criterion(outputs, labels)

                val_loss += loss.item()
                predicted = (outputs > 0.5).float()
                total_val += labels.size(0)
                correct_val += (predicted == labels).sum().item()

        val_loss = val_loss / len(val_loader)
        val_accuracy = correct_val / total_val
        val_losses.append(val_loss)
        val_accuracies.append(val_accuracy)

    return train_accuracies, val_accuracies

# Train both models
train_acc_without, val_acc_without = train_model(model_without_activation, optimizer_without)
train_acc_with, val_acc_with = train_model(model_with_activation, optimizer_with)

# Evaluation function
def evaluate_model(model, data_loader):
    model.eval()
    correct = 0
    total = 0

    with torch.no_grad():
        for inputs, labels in data_loader:
            outputs = model(inputs)
            predicted = (outputs > 0.5).float()
            total += labels.size(0)
            correct += (predicted == labels).sum().item()

    return correct / total

# Evaluate both models
acc_without = evaluate_model(model_without_activation, test_loader)
acc_with = evaluate_model(model_with_activation, test_loader)

print(f"Test accuracy WITHOUT activation functions: {acc_without:.4f}")
print(f"Test accuracy WITH activation functions: {acc_with:.4f}")

# Plot decision boundaries
def plot_decision_boundary(model, X, y, title):
    model.eval()

    # Create a mesh grid on which we will run our model
    h = 0.02  # step size in the mesh
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

    # Convert to PyTorch tensors
    grid = torch.FloatTensor(np.c_[xx.ravel(), yy.ravel()])

    # Make predictions on the meshgrid
    with torch.no_grad():
        Z = model(grid).numpy()
    Z = Z.reshape(xx.shape)

    # Plot the decision boundary
    plt.contourf(xx, yy, Z, alpha=0.8, cmap=plt.cm.RdBu)
    plt.scatter(X[:, 0], X[:, 1], c=y, edgecolors='k', cmap=plt.cm.RdBu)
    plt.title(title)
    plt.xlabel('Feature 1')
    plt.ylabel('Feature 2')

# Plot learning curves
plt.figure(figsize=(15, 5))

plt.subplot(1, 3, 1)
plt.plot(train_acc_without, label='Without Activation (Train)')
plt.plot(val_acc_without, label='Without Activation (Val)')
plt.plot(train_acc_with, label='With Activation (Train)')
plt.plot(val_acc_with, label='With Activation (Val)')
plt.title('Model Accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend()

# Plot decision boundaries
plt.subplot(1, 3, 2)
plot_decision_boundary(model_without_activation, X_test, y_test, 
                      f'Without Activation (Acc: {acc_without:.4f})')

plt.subplot(1, 3, 3)
plot_decision_boundary(model_with_activation, X_test, y_test, 
                      f'With Activation (Acc: {acc_with:.4f})')

plt.tight_layout()
plt.savefig('activation_function_comparison.png')
plt.show()

Conclusion

This experiment provides clear visual evidence of why activation functions are fundamental to neural networks. Without them, even deep networks are limited to simple linear models. With them, networks can learn complex, non-linear patterns that accurately represent real-world data.

Next time you design a neural network, remember that those simple non-linear functions between layers are what give your model its true learning power.

The Power of Activation Functions in Neural Networks: A Visual Exploration

Table of contents