If you are thinking of dipping your toes into Computer Vision or even Deep Learning, you might’ve already come across the term Convolutional Neural Networks. These are advancements of Artificial Neural Networks catered to learn from images.

In this series, we will try to understand the inner workings of CNN and try to implement it from scratch without using any deep learning frameworks.

Prerequisites

Knowledge of fundamental libraries like numpy and scipy.
Knowledge of Artificial Neural Networks (covered before on our blog).

What is a Convolution?

Before we start discussing the Architecture of a CNN, we need to understand what a convolution means.

Well fundamentally, it is just a simple operation very similar to how you multiply two polynomials. Let’s say that we have two polynomials:

$$\begin{align} p(x) &= x^3 + 2x^2 + 3x \\ g(x) &= 4x^3 + 5x^2 + 6x \end{align}$$

We know the standard way to multiply them, but let’s try the a different approach this time. Consider an array named coeff_p comprising the coefficients of p and coeff_g for coefficients of g.

To perform convolution we will have to reverse the second array, i.e. the coeff_g array and treat it like a sliding window. So now we have:

Let’s push the second array a little ahead. Now we can see, that the product of the pair which is lined up (i.e. 1 and 4) forms the coefficient of the highest power of x.

If we push it one step further, we see that we have two products now (which are 4×2 and 5×1) which will form the coefficient of the next highest power when added up.

We can continue with this process and this pattern will give us the coefficients of all the powers in order, starting from the highest to the lowest. You can even verify this by multiplying the polynomial yourself.

$$\begin{align} p(x) \cdot g(x) &= (4x^3 + 5x^2 + 6x)(x^3 + 2x^2 + 3x) \\ &= 4x^6 + 13x^5 + 28x^4 + 27x^3 + 18x^2 \end{align}$$

Therefore, we can conclude that the convolution operation for two arrays includes three steps:

Reversing the second array.
Traversing it like a sliding window.
Multiplying corresponding elements which overlap and add all the products.

In subsequent parts of the series we will see how this is used in CNNs.

Architecture of a CNN

A CNN, just like a regular Neural Network consists of layers. But in this, we have four different types of layers.

Convolution Layer
Max Pool Layer
Activation Layer
Fully Connected Layer

Convolution Layer

As the name suggests, this will handle the convolution operation between two matrices, one of them is our image matrix. The second one is a bit special.

Let’s consider a 3X3 matrix with some specific values which are magically set to help us learn the image features. This is the second matrix which we will convolve with the image matrix. It is also called filter or kernel.

You might be thinking that how we will get those magic values. That’s where backpropagation comes into play. But that’s something we will discuss in later parts of the series in detail.

The output of this convolution layer is called feature map because it provides us a matrix which represents learned features of the image.

Max Pool Layer

Max pooling is a crucial operation that helps reduce the spatial dimensions of an input feature map, making computations more efficient and reducing the number of parameters and overfitting.

It operates by sliding a window (or kernel) over the feature map and extracting the maximum value from each region it covers.

Fully Connected Layer and Activation Layer

The fully connected layers and activation layers are essentially same as Artificial Neural Network. Then why don’t we just use ANNs right away? This might be the first question that comes to mind, and the answer is pretty simple.

Because of the large size of image files (and datasets) we cannot treat every pixel as a feature and directly feed it into the Neural network. In that case, we will have to calculate weights and biases for all of them which is just impractical.

Instead, we use a combination of multiple Convolution layers and max pool layers first to obtain feature maps and retain the most important features of the images. All fully connected layers are then used in the end of the architecture.

Final Picture

This is the illustration of the architecture of VGG16, a trained image learning model which has shown splendid results on various standard datasets like ImageNet.

Image source : learnopencv.com

References

Though we will discuss all these layers in detail with code in the coming articles, you can also check 3Blue1Brown’s this video for a deeper understanding of convolutions. To be honest, this explanation was borrowed from this video, so don’t forget to check it out.

Introduction to Convolution