Backward Pass of CNN


Since the forward pass ended on the Fully Connected Layer, it is only natural to begin the discussion of backpropagation from that layer.
Fortunately for us, it is not any different than what we usually see in regular Neural Networks. Therefore, we can just proceed to implement it:
class FullyConnectedLayer:
def __init__(self, input_size, num_classes, lr=0.01):
self.weights = np.random.randn(input_size, num_classes) / input_size
self.biases = np.zeros(num_classes)
self.lr = lr
def forward(self, input):
self.input = input
return np.dot(input, self.weights) + self.biases
def backward(self, dL_dout):
dL_dB = dL_dout.sum(axis = 0)
dL_dW = self.input.reshape(1352, 1) @ dL_dout.reshape(1, 10) #for single image
dL_dinput = np.dot(dL_dout, self.weights.T)
self.weights -= self.lr*dL_dW
self.biases -= self.lr*dL_dB
return dL_dinput
Here, you might notice that instead of the dot product, we are using the matrix multiplication with reshape. This is because our example implementation is only for a single image and this reshaping helps us align the matrices correctly. We will have a look at a more general implementation in later parts of the series.
Max Pool Layer
Since in Max Pool Layer, only the maximum value is retained in the forward pass, so any change from the dL_dout will be passed only to that specific position which has the maximum value right now.
def backward(self, dL_dout):
dL_dinput = np.zeros(self.input.shape)
h, w, num_filters = self.input.shape
for i in range(0, h, self.size):
for j in range(0, w, self.size):
region = self.input[i:i+self.size, j:j+self.size]
max_vals = np.amax(region, axis = (0,1))
for k in range(num_filters):
for ii in range(self.size):
for jj in range(self.size):
if region[ii, jj, k] == max_vals[k]:
dL_dinput[i+ii, j+jj, k] = dL_dout[i // self.size, j // self.size, k]
return dL_dinput
But if you combine this with our previous code for forward implementation you will see that it looks quite bulky and repetitive which is true since we are rewriting the code for finding the maximum element.
Instead what we can do is extract and store the indices of the maximum elements using np.argmax() function. But for that we will have to use a flat array and convert the indices back to the required shape using np.unravel_index. This is the final outlook.
class MaxPoolLayer:
def __init__(self, size=2):
self.size = size
self.max_positions = None #For storing the indices of max. values
def forward(self, input):
self.input = input
h, w, num_filters = input.shape
output = np.zeros((h // self.size, w // self.size, num_filters))
self.max_positions = np.zeros((h // self.size, w // self.size, num_filters, 2), dtype=int)
for i in range(0, h, self.size):
for j in range(0, w, self.size):
region = input[i:i+self.size, j:j+self.size]
max_vals = np.amax(region, axis=(0, 1))
output[i // self.size, j // self.size] = max_vals
for k in range(num_filters):
max_pos = np.unravel_index(np.argmax(region[:, :, k]), (self.size, self.size))
self.max_positions[i // self.size, j // self.size, k] = (i + max_pos[0], j + max_pos[1])
return output
def backward(self, dL_dout):
dL_dinput = np.zeros(self.input.shape)
h_out, w_out, num_filters = dL_dout.shape
for i in range(h_out):
for j in range(w_out):
for k in range(num_filters):
max_i, max_j = self.max_positions[i, j, k]
dL_dinput[max_i, max_j, k] = dL_dout[i, j, k]
return dL_dinput
ReLU Layer
Again, the backpropagation in ReLU layer stays the same as that of ANN. The gradient from the upper layer is passed on as it is, but only to those indices where the value is positive.
def backward(self, dL_dout):
dL_dinput = dL_dout.copy()
dL_dinput[self.input <= 0] = 0
return dL_dinput
For Convolution layer
Here too, we already have the differential of Loss w.r.t the output (dL/d_out).
In this image, X represents the input, K is kernel which represents the weights which we will be updating, B is biases and Z represents the output. Now you might ask “Hey, we didn’t put any bias in our forward pass?” which is right, but for the sake of numerical derivation, let’s just assume for a second that we did.
This lays a nice carpet for us to tinker with, because this is similar to the picture of regular neural networks which we have worked with before. We have dL/dZ and we need to find dL/dK and dL/dB so that we can adjust the filters and biases through gradient descent.
For ease of calculation, we will use a 3 × 3 input matrix, and a 2 × 2 kernel, since the results we derive from these calculations will be standard.
Firstly, we have:
$$X \circledast K + B = Z$$
which means,
$$\begin{bmatrix} X_{11} & X_{12} & X_{13} \\ X_{21} & X_{22} & X_{23} \\ X_{31} & X_{32} & X_{33} \end{bmatrix} \circledast \begin{bmatrix} K_{11} & K_{12} \\ K_{21} & K_{22} \end{bmatrix} + B = \\ \begin{bmatrix} Z_{11} & Z_{12} \\ Z_{21} & Z_{22} \end{bmatrix}$$
which gives us 4 equations:
$$\begin{align} Z_{11} &= X_{11}K_{11} + X_{12}K_{12} + X_{21}K_{21} + X_{22}K_{22} + B \\ Z_{12} &= X_{12}K_{11} + X_{13}K_{12} + X_{22}K_{21} + X_{23}K_{22} + B \\ Z_{21} &= X_{21}K_{11} + X_{22}K_{12} + X_{31}K_{21} + X_{32}K_{22} + B \\ Z_{22} &= X_{22}K_{11} + X_{23}K_{12} + X_{32}K_{21} + X_{33}K_{22} + B \\ \end{align}$$
From chain rule, we know:
$$\frac{\partial L}{\partial K} = \frac{\partial L}{\partial Z}\cdot\frac{\partial Z}{\partial K}$$
But that was for single entities only. You can see that K is a kernel representing 4 different weights. So we will have to calculate the change in loss function w.r.t. weight for all 4 of them, which will make it look like this.
$$\frac{\partial L}{\partial K} = \begin{bmatrix} \frac{\partial L}{\partial k_{11}} & \frac{\partial L}{\partial k_{12}}\\ \frac{\partial L}{\partial k_{21}} & \frac{\partial L}{\partial k_{22}} \end{bmatrix}$$
Modifying the above equation:
$$\frac{\partial L}{\partial K_{mn}} = \frac{\partial L}{\partial Z}\cdot\frac{\partial Z}{\partial K_{mn}}$$
which if you think about it, seems incomplete or incorrect. That is because the output Z is also not a single entity.
While calculating the change in loss function w.r.t a particular weight, we will have to consider all those instances where any of the outputs is affected by that weight, because all of those outputs, in turn, will affect the loss function individually.
For example, for k_11 we will have:
$${\small \begin{align} \frac{\partial L}{\partial K_{11}} &= (\frac{\partial L}{\partial Z_{11}}\cdot\frac{\partial Z_{11}}{\partial K_{11}}) + (\frac{\partial L}{\partial Z_{12}}\cdot\frac{\partial Z_{12}}{\partial K_{11}}) + (\frac{\partial L}{\partial Z_{21}}\cdot\frac{\partial Z_{21}}{\partial K_{11}}) + (\frac{\partial L}{\partial Z_{22}}\cdot\frac{\partial Z_{22}}{\partial K_{11}}) \\ \\ \\ &= \sum_{j=1}^{W_{out}}\sum_{i=1}^{H_{out}} (\frac{\partial L}{\partial Z_{ij}}\cdot\frac{\partial Z_{ij}}{\partial K_{11}}) \end{align} }$$
where W_out and H_out are the width and height of the output window respectively. We can replace K_11 with K_mn to generalise the formula but before that,
We already have all the dL/dZ_ij terms, from the parameters we will take. What we need is the second term, which we can easily get from the four equations written above.
For k_11:
$$\begin{align} Z_{11} &= X_{11}K_{11} + X_{12}K_{21} + X_{21}K_{12} + X_{22}K_{22} + B \\ \\ \implies \frac{\partial Z_{11}}{\partial K_{11}} &= X_{11} \end{align}$$
which can be generalised for all other outputs as well. So:
$${\small \frac{\partial L}{\partial K_{11}} = (\frac{\partial L}{\partial Z_{11}}\cdot X_{11})+ (\frac{\partial L}{\partial Z_{12}}\cdot X_{12}) + (\frac{\partial L}{\partial Z_{21}}\cdot X_{21}) + (\frac{\partial L}{\partial Z_{22}}\cdot X_{22}) }$$
You can repeat the process to obtain similar results for the other three and this is what it will look like:
$${\small \begin{align} \frac{\partial L}{\partial K_{11}} &= (\frac{\partial L}{\partial Z_{11}}\cdot X_{11})+ (\frac{\partial L}{\partial Z_{12}}\cdot X_{12}) + (\frac{\partial L}{\partial Z_{21}}\cdot X_{21}) + (\frac{\partial L}{\partial Z_{22}}\cdot X_{22}) \\ \frac{\partial L}{\partial K_{12}} &= (\frac{\partial L}{\partial Z_{11}}\cdot X_{12})+ (\frac{\partial L}{\partial Z_{12}}\cdot X_{13}) + (\frac{\partial L}{\partial Z_{21}}\cdot X_{22}) + (\frac{\partial L}{\partial Z_{22}}\cdot X_{23}) \\ \frac{\partial L}{\partial K_{21}} &= (\frac{\partial L}{\partial Z_{11}}\cdot X_{21})+ (\frac{\partial L}{\partial Z_{12}}\cdot X_{22}) + (\frac{\partial L}{\partial Z_{21}}\cdot X_{31}) + (\frac{\partial L}{\partial Z_{22}}\cdot X_{32}) \\ \frac{\partial L}{\partial K_{22}} &= (\frac{\partial L}{\partial Z_{11}}\cdot X_{22})+ (\frac{\partial L}{\partial Z_{12}}\cdot X_{23}) + (\frac{\partial L}{\partial Z_{21}}\cdot X_{32}) + (\frac{\partial L}{\partial Z_{22}}\cdot X_{33}) \end{align} }$$
Seems overwhelming? How about this?
$$\begin{bmatrix} X_{11} & X_{12} & X_{13} \\ X_{21} & X_{22} & X_{23} \\ X_{31} & X_{32} & X_{33} \end{bmatrix} \circledast \begin{bmatrix} \frac{\partial L}{\partial Z_{11}} & \frac{\partial L}{\partial Z_{12}} \\ \frac{\partial L}{\partial Z_{21}} & \frac{\partial L}{\partial Z_{22}} \end{bmatrix} = \\ \begin{bmatrix} \frac{\partial L}{\partial K_{11}} & \frac{\partial L}{\partial K_{12}} \\ \frac{\partial L}{\partial K_{21}} & \frac{\partial L}{\partial K_{22}} \end{bmatrix}$$
Yes. The above four equations can be summarised with the use of a single convolution operation. If you look again at the equations, you will see that they are not much different from the initial four, just the positions of terms with K and terms with Z has been reversed through our logic. So it is natural for it to follow this pattern.
It can also be written as:
$$\frac{\partial L}{\partial K} = conv(X , \frac{\partial L}{\partial Z})$$
We can use the same equations to calculate the partial derivative w.r.t Biases as well and it will result in this:
$$\frac{\partial L}{\partial B} = sum(\frac{\partial L}{\partial Z})$$
This completes half of our work. The rest half is about calculating the change in loss w.r.t the input values. For that, again, we will use the same logic.
Find all those instances where the output is affected by that particular input, multiply the error signal with them and add. For example, let’s take X_11
$$\begin{align} \frac{\partial L}{\partial X_{11}} &= \frac{\partial L}{\partial Z_{11}}\cdot\frac{\partial Z_{11}}{\partial X_{11}} \\ \\ \implies \frac{\partial L}{\partial X_{11}} &= \frac{\partial L}{\partial Z_{11}}\cdot K_{11} \end{align}$$
because Z_11 is the only output which is being affected by the input X_11. But this isn’t the case with X_12. It affects both Z_11 and Z_12. Doing this, you will get these equations:
$${\small \begin{align} \frac{\partial L}{\partial X_{11}} &= (\frac{\partial L}{\partial Z_{11}}\cdot K_{11}) \\ \frac{\partial L}{\partial X_{12}} &= (\frac{\partial L}{\partial Z_{11}}\cdot K_{12}) + (\frac{\partial L}{\partial Z_{12}} \cdot K_{11}) \\ \frac{\partial L}{\partial X_{13}} &= (\frac{\partial L}{\partial Z_{12}}\cdot K_{12}) \\ \\ \frac{\partial L}{\partial X_{21}} &= (\frac{\partial L}{\partial Z_{11}}\cdot K_{21}) + (\frac{\partial L}{\partial Z_{21}}\cdot K_{11}) \\ \frac{\partial L}{\partial X_{22}} &= (\frac{\partial L}{\partial Z_{11}}\cdot K_{22}) + (\frac{\partial L}{\partial Z_{12}} \cdot K_{21}) + (\frac{\partial L}{\partial Z_{21}} \cdot K_{12}) + (\frac{\partial L}{\partial Z_{22}} \cdot K_{11}) \\ \frac{\partial L}{\partial X_{23}} &= (\frac{\partial L}{\partial Z_{12}}\cdot K_{22}) + (\frac{\partial L}{\partial Z_{22}}\cdot K_{11}) \\ \\ \frac{\partial L}{\partial X_{31}} &= (\frac{\partial L}{\partial Z_{21}}\cdot K_{21}) \\ \frac{\partial L}{\partial X_{32}} &= (\frac{\partial L}{\partial Z_{21}}\cdot K_{22}) + (\frac{\partial L}{\partial Z_{22}} \cdot K_{21}) \\ \frac{\partial L}{\partial X_{33}} &= (\frac{\partial L}{\partial Z_{21}}\cdot K_{22}) \end{align} }$$
But who likes this? As you can already see that it follows a similar pattern as the previous time. So there is ought to be something which we can leverage off to express this in terms of convolution.
The difference in previous case and this one, is that the sizes of both operand matrices were in our favour. Here even if we try to brute force the same thing, we cannot hope to achieve any desirable results because the size of both, kernel and error signal matrices is same which is 2 × 2.
So we apply padding of 1 on the error signal matrix, which makes it 4 × 4, and (4-2+1) is 3, so we will get a 3 × 3 matrix. This is good. But there’s still a catch.
From this image you can see, that while trying to attempt the convolution, we are getting false result for the very first equation. We need Z_11 to match with K_11 to work. How can we achieve that? How about we rotate the filter matrix accordingly?
Now it looks better. You can try moving that filter over the rest of the matrix and you will obtain the exact same set of equations which we had.
To surmise what we did :
$$\frac{\partial L}{\partial X} = conv(\space padded(\frac{\partial L}{\partial Z}) , 180^{\circ} \space rotated \space (K) \space )$$
But wait. Why did this happen? We didn’t do any rotation for the filter gradient? Why do we need it for the input gradient? Well, there’s a simple answer.
Intuition
The difference arises from the roles of Kernel and Input matrices:
Input (X): During backpropagation, the error signal must "spread back" to the input. This reversal in the direction of convolution requires the 180° rotation of K to ensure proper alignment.
Filter/Kernel (K): The filter weights are learned by summing up contributions from the input regions corresponding to each error signal position. The spatial alignment between K and X remains consistent with the forward pass, so no rotation is necessary.
With this, we have finally completed our backward pass. Now it’s time to implement this. We won’t need the equation for updating the biases since we haven’t used any.
def backward(self, dL_dout):
dL_dinput = np.zeros(self.input.shape)
dL_dfilters = np.zeros(self.filters.shape)
for region, i, j in self.iterate_regions(self.input):
for f in range(self.num_filters):
dL_dfilters[f] += dL_dout[i, j, f] * region
dL_dinput[i:i+self.filter_size, j:j+self.filter_size] += dL_dout[i, j, f] * self.filters[f]
# Update the filters
self.filters -= self.lr * dL_dfilters
return dL_dinput
This is the block of code which we will be adding to our previous implementation of Convolutional Layer. So it will finally look like:
class ConvLayer:
def __init__(self, num_filters, filter_size, lr=0.01):
self.num_filters = num_filters
self.filter_size = filter_size
self.filters = np.random.randn(num_filters, filter_size, filter_size) / 9
self.lr = lr
def iterate_regions(self, image):
h, w = image.shape
for i in range(h - self.filter_size + 1):
for j in range(w - self.filter_size + 1):
region = image[i:(i + self.filter_size), j:(j + self.filter_size)]
yield region, i, j
def forward(self, input):
self.input = input
h, w = input.shape
output = np.zeros((h - self.filter_size + 1, w - self.filter_size + 1, self.num_filters))
for region, i, j in self.iterate_regions(input):
output[i, j] = np.sum(region * self.filters, axis=(1, 2))
return output
def backward(self, dL_dout):
dL_dinput = np.zeros(self.input.shape)
dL_dfilters = np.zeros(self.filters.shape)
for region, i, j in self.iterate_regions(self.input):
for f in range(self.num_filters):
dL_dfilters[f] += dL_dout[i, j, f] * region
dL_dinput[i:i+self.filter_size, j:j+self.filter_size] += dL_dout[i, j, f] * self.filters[f]
# Update the filters
self.filters -= self.lr * dL_dfilters
return dL_dinput
The only change other than the implementing the equations is to include the learning rate for updating the filters.
Now all which remains is to combine all this and test it on a sample image.
References
If you liked this, or want a visual understanding don’t forget to check out these videos by Learn with Jay. It helped me a lot and this explanation is based on these videos.
Subscribe to my newsletter
Read articles from Ayush Saraswat directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
Ayush Saraswat
Ayush Saraswat
I am currently a 3rd year B.Tech Undergrad, excited to learn new stuff about programming, and share my findings with everyone.