Convolutional Neural Networks (CNNs) are at the heart of modern computer vision. They excel at tasks like image classification, object detection, and more, thanks to their ability to automatically extract features from images. In this blog, we'll break down the core concepts of convolution, filters (kernels), feature maps, padding, and strides, and provide code examples to help you understand how these work in practice.

What Are Filters (Kernels) in CNNs?

Filters, also known as kernels, are small matrices (often 3x3) that slide over the input image to detect specific features, such as edges or textures. Each filter is designed to highlight a particular pattern in the image.

For example, a horizontal edge detector filter might look like this:

text[[-1, -1, -1],
 [ 0,  0,  0],
 [ 1,  1,  1]]

When this filter is applied to an image, it highlights horizontal edges by amplifying the difference between pixel values in the vertical direction.

The Convolution Operation

The process of applying a filter to an image is called convolution. The filter slides (or convolves) over the image, computing a weighted sum at each position. The result is a feature map that highlights where the filter's pattern appears in the image.

After convolution, it's common to apply a ReLU (Rectified Linear Unit) activation function, which sets all negative values in the feature map to zero and keeps the positive values as it is. This introduces non-linearity and helps the network learn complex patterns.

Multiple Filters and Output Channels

A single image can be processed by multiple filters, each detecting different features. If you use two filters (e.g., one for vertical edges and one for horizontal edges) on an RGB image of size 6x6x3, each filter produces a 4x4 feature map. Stacking these together gives an output of size 4x4x2, where 2 is the number of filters (output channels).

The Problem:

Shrinking Feature Maps

Each time you apply a filter, the output feature map becomes smaller. For a filter of size f applied to an image of size n, the output size is:

$$n-f+1$$

This reduction can cause loss of important information, especially at the edges.

Solution:

Padding

Padding involves adding a border of zeros around the image before applying the filter. This ensures that the filter can process the edges and corners, preserving more information.

Zero Padding: Adds zeros around the image.
Formula with Padding:

$$[ \text{Output size} = n + 2p - f + 1 ]$$

where p is the padding size.

For example, adding a padding of 1 to a 5x5 image with a 3x3 filter results in a 5x5 feature map—no reduction in size.

Strides:

Controlling the Step Size

Stride determines how far the filter moves at each step. A stride of 1 means the filter moves one pixel at a time; a stride of 2 skips every other pixel.

Formula with Stride and Padding:

$$[ \text{Output size} = \frac{n + 2p - f}{s} + 1 ]$$

where s is the stride

Larger stride: Reduces the size of the feature map and speeds up computation.
Smaller stride: Preserves more detail.

Code Examples: Adding Filters, Padding, and Strides in CNNs

Here's how you can implement these concepts using TensorFlow and Keras:

pythonimport tensorflow as tf
from tensorflow.keras import layers, models

# Example input: 32x32 RGB image
input_shape = (32, 32, 3)

model = models.Sequential()

# Convolutional layer with 2 filters (kernels), 3x3 size, stride 1, and padding
model.add(layers.Conv2D(
    filters=2,           # Number of filters
    kernel_size=(3, 3),  # Filter size
    strides=(1, 1),      # Stride
    padding='same',      # 'same' applies zero padding to keep output size
    activation='relu',
    input_shape=input_shape
))

# Convolutional layer with stride 2 (downsampling)
model.add(layers.Conv2D(
    filters=4,
    kernel_size=(3, 3),
    strides=(2, 2),      # Stride of 2
    padding='valid',     # No padding
    activation='relu'
))

model.summary()

Visualizing the Convolution Process

Imagine a 5x5 image and a 3x3 filter. The filter slides over the image, computing a sum at each position. With zero padding, even the corners are processed, ensuring no information is lost at the edges.

Key Takeaways

Filters (kernels) detect specific features in images.
Convolution produces feature maps that highlight these features.
Padding preserves information at the edges.
Strides control the step size and output resolution.
Multiple filters create multiple output channels, allowing the network to learn diverse features.

Understanding these building blocks is essential for designing effective convolutional neural networks for image analysis and beyond.

Special Case

Calculating the Feature Map Size: Special Case (Stride = 2)

When performing convolution on a grayscale image of size 6×7 with a 3×3 filter and a stride of 2, the formula to calculate the feature map size is:

$$[ \text{Output size} = \left\lfloor \frac{n - f}{s} + 1 \right\rfloor ]$$

Where:

n = input dimension (height or width)
f = filter (kernel) size
s = stride

Applying the Formula

For a 6×7 image and a 3×3 filter with stride = 2:

Height Calculation (n = 6)

$$[ \text{Output height} = \left\lfloor \frac{6 - 3}{2} + 1 \right\rfloor = \left\lfloor \frac{3}{2} + 1 \right\rfloor = \left\lfloor 1.5 + 1 \right\rfloor = \left\lfloor 2.5 \right\rfloor = 2 ]$$

Width Calculation (n = 7)

$$[ \text{Output width} = \left\lfloor \frac{7 - 3}{2} + 1 \right\rfloor = \left\lfloor \frac{4}{2} + 1 \right\rfloor = \left\lfloor 2 + 1 \right\rfloor = \left\lfloor 3 \right\rfloor = 3 ]$$

Final Feature Map Size: 2 × 3

When padding is applied

Formula is:

$$[ \text{Output size} = \left\lfloor \frac{n + 2p - f}{s} + 1 \right\rfloor ]$$