Over-Regularization Kills Learning

ShivamShivam
8 min read

Deep learning has revolutionized how we solve complex problems like image recognition, language translation, and more. Its power lies in the ability to learn patterns directly from data ,sometimes even too well. But here's the catch: deep neural networks are so flexible that they often memorize the training data instead of actually learning meaningful patterns. This phenomenon is called overfitting, and it’s one of the biggest challenges in deep learning.

To fight overfitting, we employ crucial techniques like Dropout (randomly ignoring neurons during training) and Batch Normalization (standardizing layer inputs to stabilize training and help in converge faster). These regularization techniques act like discipline for the model, forcing it to generalize better instead of simply memorizing.

Sounds great, right? Well… not always.

What happens when we apply too much of a good thing? In this blog, we’ll explore the flip side of regularization ,a less-discussed topic where over-regularization kills learning. We'll see how excessive dropout, an abundance of layers, or even unnecessary normalization can hinder your model's ability to grasp fundamental patterns. We’ll demonstrate this critical concept with a real, hands-on example using the CIFAR-10 image classification dataset, focusing on simple code, clear visuals, and direct comparisons to make it easy to understand.

  1. The "Not Normalized and Less Regularized" Model:

To set up our experiment, we first prepared the CIFAR-10 dataset. For training and validation, we selected a total of 13,000 images, ensuring an equal number of images for each of the 10 classes. From this 13,000, we used 10,000 images for training our model and the remaining 3,000 images for validation.For the test set, which is used to evaluate the final performance of our trained model, we took 5,000 images, again making sure that each class was equally represented.

train_transform = transforms.Compose([
    transforms.Resize(224),
    transforms.RandomHorizontalFlip(),
    transforms.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.2),
    transforms.ToTensor()])

sample of Train and Test images:

CNN model Architecture: Less Regularized

class CNN(nn.Module):
    def __init__(self,input_features=10):
        super().__init__()
        self.convo = nn.Sequential(
            nn.Conv2d(3, 32, kernel_size=3, padding=1),
            nn.BatchNorm2d(32), 
            nn.ReLU(),

            nn.Conv2d(32, 64, kernel_size=3, padding=1),
            nn.BatchNorm2d(64),
            nn.ReLU(),

            nn.Conv2d(64, 128, kernel_size=3, padding=1, stride=2),
            nn.BatchNorm2d(128),
            nn.ReLU(),

            nn.Conv2d(128, 64, kernel_size=3, padding=1, stride=2),
            nn.BatchNorm2d(64),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2),

            nn.Conv2d(64, 32, kernel_size=3, padding=1, stride=2),
            nn.BatchNorm2d(32),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2),
        )

        self.fc=nn.Sequential(
            nn.Flatten(),
            nn.Linear(32*7*7,2048),
            nn.ReLU(),
            nn.Dropout(0.5), 
            nn.Linear(2048,1024),
            nn.ReLU(),
            nn.Dropout(0.4),
            nn.Linear(1024,512),
            nn.ReLU(),
            nn.Dropout(0.4), 
            nn.Linear(512,10)
        )

    def forward(self,x):
        x=self.convo(x)
        x=self.fc(x)
        return x

RESULTS:

We trained this initial model for 40 epochs. Below, we've plotted the training and validation loss and accuracy, specifically highlighting the trends clearly illustrate the point of overfitting.

Epoch 1/40 | Train Loss: 2.3014, Acc: 0.1028 | Val Loss: 2.2827, Acc: 0.1670
Epoch 2/40 | Train Loss: 2.2361, Acc: 0.1619 | Val Loss: 2.1163, Acc: 0.2470
Epoch 3/40 | Train Loss: 2.0101, Acc: 0.2395 | Val Loss: 1.9608, Acc: 0.2487
Epoch 4/40 | Train Loss: 1.8096, Acc: 0.3102 | Val Loss: 1.9206, Acc: 0.3333
Epoch 5/40 | Train Loss: 1.6884, Acc: 0.3636 | Val Loss: 1.6209, Acc: 0.4013
Epoch 6/40 | Train Loss: 1.5699, Acc: 0.4165 | Val Loss: 1.4837, Acc: 0.4440
Epoch 7/40 | Train Loss: 1.4804, Acc: 0.4589 | Val Loss: 1.4074, Acc: 0.4843
Epoch 8/40 | Train Loss: 1.4060, Acc: 0.4909 | Val Loss: 1.4100, Acc: 0.4823
Epoch 9/40 | Train Loss: 1.3404, Acc: 0.5155 | Val Loss: 1.8671, Acc: 0.3703
Epoch 10/40 | Train Loss: 1.2775, Acc: 0.5426 | Val Loss: 1.4806, Acc: 0.4750
Epoch 11/40 | Train Loss: 1.2118, Acc: 0.5657 | Val Loss: 1.2812, Acc: 0.5550
Epoch 12/40 | Train Loss: 1.1585, Acc: 0.5826 | Val Loss: 1.2558, Acc: 0.5480
Epoch 13/40 | Train Loss: 1.1048, Acc: 0.6091 | Val Loss: 1.1754, Acc: 0.5773
Epoch 14/40 | Train Loss: 1.0787, Acc: 0.6179 | Val Loss: 1.1479, Acc: 0.5857
Epoch 15/40 | Train Loss: 1.0226, Acc: 0.6406 | Val Loss: 1.1105, Acc: 0.6080
Epoch 16/40 | Train Loss: 0.9744, Acc: 0.6577 | Val Loss: 1.2366, Acc: 0.5707
Epoch 17/40 | Train Loss: 0.9433, Acc: 0.6656 | Val Loss: 1.0283, Acc: 0.6487
Epoch 18/40 | Train Loss: 0.8998, Acc: 0.6827 | Val Loss: 1.0152, Acc: 0.6390
Epoch 19/40 | Train Loss: 0.8890, Acc: 0.6881 | Val Loss: 0.9835, Acc: 0.6590
Epoch 20/40 | Train Loss: 0.8440, Acc: 0.7056 | Val Loss: 0.8981, Acc: 0.6843
Epoch 21/40 | Train Loss: 0.7820, Acc: 0.7248 | Val Loss: 1.1031, Acc: 0.6203
Epoch 22/40 | Train Loss: 0.7917, Acc: 0.7222 | Val Loss: 0.9274, Acc: 0.6693
Epoch 23/40 | Train Loss: 0.7665, Acc: 0.7341 | Val Loss: 1.2899, Acc: 0.5967
Epoch 24/40 | Train Loss: 0.7141, Acc: 0.7475 | Val Loss: 0.9817, Acc: 0.6577
Epoch 25/40 | Train Loss: 0.6824, Acc: 0.7615 | Val Loss: 1.0246, Acc: 0.6477
Epoch 26/40 | Train Loss: 0.6534, Acc: 0.7755 | Val Loss: 0.9047, Acc: 0.6877
Epoch 27/40 | Train Loss: 0.6163, Acc: 0.7812 | Val Loss: 0.9526, Acc: 0.6710
Epoch 28/40 | Train Loss: 0.5913, Acc: 0.7932 | Val Loss: 0.9292, Acc: 0.6917
Epoch 29/40 | Train Loss: 0.5741, Acc: 0.8013 | Val Loss: 1.0302, Acc: 0.6567
Epoch 30/40 | Train Loss: 0.5601, Acc: 0.8043 | Val Loss: 0.9821, Acc: 0.6797
Epoch 31/40 | Train Loss: 0.5096, Acc: 0.8249 | Val Loss: 0.9452, Acc: 0.6957
Epoch 32/40 | Train Loss: 0.4938, Acc: 0.8287 | Val Loss: 0.9734, Acc: 0.6880
Epoch 33/40 | Train Loss: 0.4473, Acc: 0.8450 | Val Loss: 1.0509, Acc: 0.6833
Epoch 34/40 | Train Loss: 0.4533, Acc: 0.8443 | Val Loss: 1.2324, Acc: 0.6373
Epoch 35/40 | Train Loss: 0.4170, Acc: 0.8562 | Val Loss: 1.0924, Acc: 0.6647
Epoch 36/40 | Train Loss: 0.4110, Acc: 0.8569 | Val Loss: 1.0186, Acc: 0.6867
Epoch 37/40 | Train Loss: 0.3691, Acc: 0.8738 | Val Loss: 1.0541, Acc: 0.6843
Epoch 38/40 | Train Loss: 0.3741, Acc: 0.8741 | Val Loss: 0.9498, Acc: 0.6963
Epoch 39/40 | Train Loss: 0.3557, Acc: 0.8778 | Val Loss: 0.9654, Acc: 0.7113
Epoch 40/40 | Train Loss: 0.3261, Acc: 0.8911 | Val Loss: 1.0607, Acc: 0.7020

However, the more critical observation is what happens after around Epoch 20. While the training accuracy continues to improve, the validation accuracy starts to plateau and even slightly decrease, and the validation loss begins to fluctuate and increase. This widening gap between the training and validation metrics is a clear sign of overfitting. The model is memorizing the training data too well, losing its ability to generalize to new, unseen examples.

The final test accuracy of the less-regularized model was 0.699

  1. The Normalized and More Regularized Model

    Now, introduce the second experiment. This model includes Normalize and significantly more Dropout and Dropout2d layers .

train_transform = transforms.Compose([
    transforms.Resize(224),
    transforms.RandomHorizontalFlip(),
    transforms.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.2),
    transforms.ToTensor(),                  
    transforms.Normalize((0.4914, 0.4822, 0.4465), 
                         (0.2023, 0.1994, 0.2010))])

sample of Train and Test images:

CNN model Architecture: Over Regularized

class CNN(nn.Module):
    def __init__(self,input_features=10):\
        super().__init__()
        self.convo = nn.Sequential(
            nn.Conv2d(3, 64, kernel_size=3, padding=1),
            nn.BatchNorm2d(64),
            nn.ReLU(),
            nn.Dropout2d(0.2), 

            nn.Conv2d(64, 128, kernel_size=3, padding=1, stride=2),
            nn.BatchNorm2d(128),
            nn.ReLU(),
            nn.Dropout2d(0.2), 

            nn.Conv2d(128, 256, kernel_size=3, padding=1, stride=2),
            nn.BatchNorm2d(256),
            nn.ReLU(),
            nn.Dropout2d(0.2), 

            nn.Conv2d(256, 128, kernel_size=3, padding=1, stride=2),
            nn.BatchNorm2d(128),
            nn.ReLU(),
            nn.Dropout2d(0.2),
            nn.MaxPool2d(2),

            nn.Conv2d(128, 64, kernel_size=3, padding=1, stride=2),
            nn.BatchNorm2d(64),
            nn.ReLU(),
            nn.Dropout2d(0.3), 
            nn.MaxPool2d(2),
        )

        self.fc=nn.Sequential(
            nn.Flatten(),
            nn.Linear(64*3*3,2048),
            nn.ReLU(),
            nn.Dropout(0.5), 
            nn.Linear(2048,1024),
            nn.ReLU(),
            nn.Dropout(0.4),
            nn.Linear(1024,512),
            nn.ReLU(),
            nn.Dropout(0.5), 
            nn.Linear(512,10)
        )

    def forward(self,x):
        x=self.convo(x)
        x=self.fc(x)
        return x

RESULT:

The graph below illustrates the training and validation curves for this heavily regularized model.

Looking at the graph, we observe a significantly different learning pattern compared to our previous model.

Slower Learning :Both the training and validation accuracy curves rise much more slowly, and the loss curves decrease at a much gentler pace. The model struggles to achieve high accuracy even on the training set.

Reduced Overfitting (but at a cost): While the gap between training and validation accuracy is smaller, indicating less overfitting, the overall accuracy is lower. The model isn't memorizing, but it's also not learning the underlying patterns effectively. It seems to be "stuck".

This behavior is a classic example of over-regularization. While Normalize primarily helps stabilize training by centering and scaling the inputs, excessive Dropout2d combined with this stabilization may unintentionally restrict learning capacity The model is intentionally being "handicapped" to such an extent that it cannot fully grasp the patterns in the data, leading to underfitting or a severely slowed learning curve.

After training, this regularized model achieved a test accuracy of 0.65. This is noticeably lower than the 0.699 accuracy achieved by the "Not Normalized" model.

Why the "Unexpected" Results?

The observation that our first model (without explicit Normalize and Dropout2d ) performed better (showing higher accuracy at earlier epochs) than the heavily regularized one is insightful. It points to a common challenge in deep learning: hyperparameter tuning.

Let's break down the reasons behind this "unexpected" outcome:

  • Over-regularization: When both Normalize (which indirectly aids stability and can thus contribute to regularization) and Dropout2d or Dropout are applied extensively, the model might become over-regularized. Excessive regularization makes it too difficult for the model to learn the training data effectively, even if the dataset isn't particularly large or complex. This leads to underfitting or simply a slower learning curve, resulting in lower training accuracy and, consequently, lower validation/test accuracy within a limited number of epochs. The model is intentionally being "handicapped" too much to fully grasp the patterns.

  • Learning Rate Mismatch: While Normalize allows for higher learning rates, it doesn't guarantee they are optimal. It's possible that without normalization, our learning rate or initial weights were (coincidentally) a better match for the non-normalized input distribution for rapid initial learning, even if that learning might have become unstable or led to overfitting later.

  • Early Convergence vs. Generalization: The model without heavy regularization might have achieved higher accuracy at fewer epochs by memorizing parts of the training data very quickly. Regularization, by design, slows down this memorization process to force true learning and generalization. While the non-regularized model might look good early on, it often plateaus faster and then overfits, leading to degraded performance on new data if trained for many more epochs. The regularized model, though slower to start, might eventually achieve higher peak performance and better generalization with more training.

Key Takeaway

This experience highlights that regularization and normalization are hyperparameters that must be tuned. They are powerful tools, but applying them without proper tuning (especially the dropout rates and how they interact with learning rates) can sometimes hinder early training performance.

To truly assess their benefit, it's often necessary to:

  • Monitor validation accuracy closely.

  • Train for more epochs to see if the regularized model eventually surpasses the non-regularized one.

  • Experiment with different dropout rates (e.g., lower dropout in early layers, or higher in deeper/FC layers where overfitting is more common).

  • Adjust the learning rate schedule to complement the normalization and regularization

0
Subscribe to my newsletter

Read articles from Shivam directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Shivam
Shivam