Convolutional Neural Network

Nitin SharmaNitin Sharma
4 min read

A convolutional neural network (CNN) is an advanced deep learning architecture designed for the identification and classification of images. In addition to image recognition, CNNs are utilized for object detection within images, audio classification, and the analysis of time-series data. A convolutional layer processes an input volume, transforming it into an output volume that may vary in size.

Convolutional neural networks (commonly known as convnets) are powerful architectures that build upon the foundation of fully connected neural networks. They consist of layers of neurons equipped with learnable weights and biases. Each neuron processes input data by performing a linear transformation followed by a nonlinear activation function, culminating in a unified scoring function that translates raw image pixels at the input layer into definitive class scores at the output layer.

The unique advantage of convnets lies in their deliberate assumptions about the structure of input data, especially images. These assumptions empower the architecture to encode critical properties, leading to remarkable implementation efficiency and a significant reduction in the number of parameters in the network.

Instead of treating input data as simple linear arrays, convnets expertly manage information as three-dimensional volumes defined by width, height, and depth. This allows each layer to accept a 3D volume of numerical data as input and produce another 3D volume as output. By incorporating color depth as the third dimension, a two-dimensional input image is seamlessly transformed into a three-dimensional representation, enhancing the network's ability to interpret and analyze visual information.

A Convolutional Neural Network (CNN) is structured with several layers, primarily categorized into three main types: convolutional layers, pooling layers, and fully connected layers. CNNs are often comprised of many layers, particularly a combination of convolutional and pooling layers, which work together to extract and refine features from input data.

In convolutional layers, specialized nodes, or filters, slide over the input data to detect patterns, such as edges or textures, by performing convolution operations. Pooling layers follow, downsampling the feature maps generated by the convolutional layers to reduce their dimensionality, thereby retaining essential information while minimizing computational load.

Finally, fully connected layers integrate the features learned throughout the network, connecting all nodes to produce the final output. Together, these layers enable the CNN to effectively analyze and interpret complex data, making them a powerful tool in various applications, such as image and video recognition.

Convolution Layer

The convolution process consists of the following steps:

  1. It commences with an input volume.

  2. A filter is applied at every position throughout the input.

  3. The process yields an output volume, which typically differs in size from the input.

To convolve a 3x3 filter with an image, one multiplies the filter's values element-wise with the corresponding values of the original matrix. The resulting products are then summed, and a bias is added to produce the final output. Then we move one step next to the right for horizontal move and apply the filter and repeat the same convolution process. We do the same process while moving one step vertically down. This is also called to be using one stride.

Zero-Padding

Zero-padding adds zeros around the border of an image.

The primary advantages of utilizing padding in convolutional neural networks are as follows:

  • Padding enables the application of a convolutional layer without necessarily reducing the height and width of the input volumes. This characteristic is crucial for constructing deeper networks, as it prevents a reduction in height and width as one progresses through subsequent layers. Notably, the "same" convolution is a specific instance where the height and width are accurately maintained after processing through one layer.

  • Additionally, padding contributes to the retention of information at the periphery of an image. In the absence of padding, the influence of border pixels on subsequent layers would be significantly diminished, thereby compromising the utilization of crucial edge data.

Stride

Stride = amount you move the window each time you slide as shown in below in an image by red arrows. In below image Convolution operation is shown with a filter of 3x3 and a stride of 2 to create 3 by 3 output. In convolutional neural networks (CNNs), the main advantage of using stride is its capacity to efficiently downsample input data. This allows the network to concentrate on more significant features while also reducing computational complexity by processing fewer pixels. Increasing the stride decreases the computational load, as the filter moves across more pixels with each step, resulting in fewer operations. This can speed up both the training and inference processes.

If n x n is the size of the input image , f x f is the size of filter P is padding and S is stride then output size will be

\(\frac{n+2P-f}{S} +1\) by \(\frac{n+2P-f}{S} +1\)

Pooling Layer

The pooling (POOL) layer reduces the height and width of the input, which helps decrease computation while also making feature detectors more invariant to their position in the input. There are two main types of pooling layers:

  • Max-Pooling Layer: This layer stores the maximum value within the specified window in the output.

  • Average-Pooling Layer: This layer calculates and stores the average value within the specified window in the output.

0
Subscribe to my newsletter

Read articles from Nitin Sharma directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Nitin Sharma
Nitin Sharma