The motivation behind Convolutional Neural Networks (CNNs) comes from the inability of traditional dense neural networks to perform well on image classification tasks. Why is that? A dense network, also known as a fully-connected network, treats an image as a flat vector of pixels. If you flatten a 32x32 pixel image, you get a 1024-dimensional vector. This process discards all spatial information. The network has no inherent understanding that a pixel is "next to" another. This makes it difficult to learn concepts like edges, textures, or shapes, and it completely fails to grasp translation invariance, the idea that a cat is still a cat whether it's in the top-left or bottom-right corner of the image.

This is where CNNs shine. They are specifically designed to process pixel data by creating better feature maps out of raw images. Instead of flattening the input, they use small filters (kernels) that slide across the image, recognizing patterns like edges, corners, and textures. These initial patterns are then combined in deeper layers to form more complex features like eyes, wheels, or wings.

In this post, we'll build a CNN from scratch using PyTorch to understand its core components. We'll train it on the popular CIFAR-10 dataset and see how it learns to classify images into one of ten categories.

Let's break down the process step-by-step.

Phase 1: Importing Dependencies

First, we import all the necessary libraries. We'll be using torch and its nn module for building the network, torchvision for the dataset and image transformations, and PIL for handling our own custom images later.

import numpy as np
from PIL import Image

import torch
import os
from torch import nn

import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import DataLoader

import torchvision
from torchvision import datasets, transforms

Phase 2: Data Transformation and Loading

Before we can feed images to our network, we need to preprocess them. This is done using torchvision.transforms.

transforms.ToTensor(): This converts the image from a PIL Image format (with pixel values from 0-255) to a PyTorch tensor (with values from 0.0 to 1.0).
transforms.Normalize(): This standardizes the pixel values. The arguments (0.5, 0.5, 0.5) are the mean and standard deviation for each of the three (R, G, B) channels. This normalization helps the network train faster and more stably by centering the data around zero.

transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])

With our transformation pipeline ready, we can load the CIFAR-10 dataset. We also wrap our datasets in a DataLoader, which is a handy utility that provides batches of data, shuffles it for each epoch, and can even use multiple workers to load data in parallel.

train_data = torchvision.datasets.CIFAR10(root='./data', train=True, transform=transform, download=True)
test_data = torchvision.datasets.CIFAR10(root='./data', train=False, transform=transform, download=True)

train_loader = torch.utils.data.DataLoader(train_data, batch_size=32, shuffle=True, num_workers=2)
test_loader = torch.utils.data.DataLoader(test_data, batch_size=32, shuffle=True, num_workers=2)

class_names = ['plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']

The CIFAR-10 images are 3-channel (RGB) images of size 32x32 pixels. Let's confirm this:

image, label = train_data[0]
print(image.size())
# Output: torch.Size([3, 32, 32])

Phase 3: Defining the Neural Network Architecture

This is the core of our project. We'll define a class NeuralNet that inherits from nn.Module.

class NeuralNet(nn.Module):
    def __init__(self):
        super().__init__()
        # Input: (3, 32, 32)
        self.conv1 = nn.Conv2d(3, 16, 5, padding=2) # 32x32 -> 32x32
        self.pool1 = nn.MaxPool2d(2, 2)            # 32x32 -> 16x16
        # Shape: (16, 16, 16)

        self.conv2 = nn.Conv2d(16, 32, 3, padding=1) # 16x16 -> 16x16
        self.pool2 = nn.MaxPool2d(2, 2)             # 16x16 -> 8x8
        # Shape: (32, 8, 8)

        self.conv3 = nn.Conv2d(32, 64, 3, padding=1) # 8x8 -> 8x8
        self.pool3 = nn.MaxPool2d(2, 2)             # 8x8 -> 4x4
        # Shape: (64, 4, 4)

        # IMPORTANT: Calculate the new flattened size
        self.fc1 = nn.Linear(64 * 4 * 4, 256)
        self.fc2 = nn.Linear(256, 128)
        self.fc3 = nn.Linear(128, 10)

Let's break down how the shape of our data changes as it flows through the network:

Convolutional Layers (nn.Conv2d)

The shape of the output from a convolutional layer depends on the input size, kernel size, stride, and padding. The formula is:
Output_Size = (Input_Size - Kernel_Size + 2 * Padding) / Stride + 1

self.conv1 = nn.Conv2d(3, 16, 5, padding=2)
- in_channels=3: We start with a 3-channel (RGB) image.
- out_channels=16: The layer will produce 16 feature maps.
- kernel_size=5: The filter is a 5x5 matrix.
- padding=2: We add a 2-pixel border around the image.
- Shape Change: (32 - 5 + 2*2) / 1 + 1 = 32. With this padding, the height and width are preserved. Our shape goes from (3, 32, 32) to (16, 32, 32).

Pooling Layers (nn.MaxPool2d)

Pooling layers are used to down sample the feature maps. This reduces the computational load and makes the detected features more robust to their exact location in the image.

self.pool1 = nn.MaxPool2d(2, 2)
- This takes a 2x2 window and keeps only the maximum value, effectively halving the height and width.
- Shape Change: The input (16, 32, 32) becomes (16, 16, 16).

We repeat this pattern. After conv3 and pool3, our final feature map has a shape of (64, 4, 4).

The Flattening and Fully-Connected Layers (nn.Linear)

The convolutional layers have done their job of extracting spatial features. Now, we need to feed these features into a standard dense network to perform the final classification. To do this, we must "flatten" our 3D feature map (64, 4, 4) into a 1D vector.

The size of this vector is channels height width, which is 64 4 4 = 1024.

This is why our first fully-connected layer, fc1, is defined as nn.Linear(64 4 4, 256). It takes the 1024 features from our flattened map and transforms them into 256 features. The final layer, fc3, outputs 10 values, one for each class in CIFAR-10.

To understand how fully connected layers work, here’s a link explaining the math and intuition behind it as we do it from scratch.

Phase 4: Defining the Forward Propagation

The forward method defines the actual path our data takes through the layers. We apply a ReLU activation function after each convolution and after the first two fully-connected layers to introduce non-linearity, which is crucial for learning complex patterns.

def forward(self, x):
        x = self.pool1(F.relu(self.conv1(x)))
        x = self.pool2(F.relu(self.conv2(x)))
        x = self.pool3(F.relu(self.conv3(x)))

        x = torch.flatten(x, 1) # Flatten all dimensions except batch

        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

Phase 5: Optimizer and Loss Function

To train the network, we need two things:

Loss Function: Measures how wrong the model's predictions are.
Optimizer: Updates the model's weights to reduce the loss.

net = NeuralNet()
loss_function = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)

Why CrossEntropyLoss?

nn.CrossEntropyLoss is the standard choice for multi-class classification problems like this one. It's particularly effective because it combines two operations: LogSoftmax and NLLLoss (Negative Log Likelihood Loss). Internally, it takes the raw output scores (logits) from our final layer, converts them into probabilities using a softmax function, and then calculates the loss. It heavily penalizes the model for being confident in the wrong prediction, which makes it a very effective teacher during training.

Phase 6: The Training Loop

Here, we iterate through our training data for a set number of epochs. In each step, we perform the standard training routine:

Get a batch of inputs and labels.
Clear previous gradients with optimizer.zero_grad().
Make a prediction (outputs = net(inputs)).
Calculate the loss.
Perform backpropagation to calculate gradients (loss.backward()).
Update the network's weights (optimizer.step()).

for epoch in range(30):
    print(f"Training Epoch: {epoch}")
    running_loss = 0.0

    for i, data in enumerate(train_loader):
        inputs, labels = data
        optimizer.zero_grad()
        outputs = net(inputs)

        loss = loss_function(outputs, labels)
        loss.backward()
        optimizer.step()

        running_loss += loss.item()

    print(f"Loss: {running_loss/len(train_loader):.4f}")

Phase 7: Saving the Model and Evaluating Performance

After training, we save the learned weights (the model's "state") to a file. Then, we load these weights into a fresh instance of our network and evaluate its performance on the test dataset, which it has never seen before.

We switch the network to evaluation mode with net.eval(). This is important as it disables certain layers like Dropout that behave differently during training and inference. We use torch.no_grad() to tell PyTorch not to calculate gradients, which saves memory and computation.

torch.save(net.state_dict(), 'trained_net.pth')

net = NeuralNet()
net.load_state_dict(torch.load('trained_net.pth'))

correct = 0
total = 0

net.eval()
with torch.no_grad():
    for data in test_loader:
        images, labels = data
        outputs = net(images)
        _, predicted = torch.max(outputs, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

accuracy = 100 * correct / total
print(f"Accuracy: {accuracy}%")

This will give us a final accuracy score, showing how well our CNN learned to generalize.

Phase 8: Testing with Our Own Image

Finally, the fun part! Let's see how our trained model performs on a completely new image from the web. We create a simple function to load, resize, and transform an image to match the input format our network expects.

Note the image.unsqueeze(0) step. Our network was trained on batches of images. This adds a "batch dimension" of size 1, so the tensor shape becomes (1, 3, 32, 32), which is what the network expects.

new_transform = transforms.Compose([
    transforms.Resize((32, 32)),
    transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])

def load_image(image_path):
    image = Image.open(image_path)
    image = image.convert('RGB')
    image = new_transform(image)
    image = image.unsqueeze(0) # Add batch dimension
    return image

# Replace with the path to your image
image_paths = ['path/to/your/image.png'] 
images = [load_image(img) for img in image_paths]

net.eval()
with torch.no_grad():
    for image in images:
        output = net(image)
        _, predicted = torch.max(output, 1)
        print(f"Prediction: {class_names[predicted.item()]}")

Conclusion

We've successfully built, trained, and tested a Convolutional Neural Network. We saw how convolutional and pooling layers work together to extract meaningful features from raw pixels, and how these features are then used by a classifier to make a final prediction. This ability to learn spatial hierarchies of patterns is what makes CNNs the powerhouse behind modern computer vision. From here, you can experiment by changing the architecture, adding more layers, or trying different optimizers to see how it affects performance.

Python Code Link: https://github.com/HarvsDucs/hashnode_python_scripts/tree/main/Building%20Intuition%20for%20Convolutional%20Neural%20Networks

Building Intuition for Convolutional Neural Networks