Matrix Structure of Multi-Layer Perceptrons

What is a Multi-Layer Perceptron (MLP)?

A multi-layer perceptron (MLP) is an artificial neural network (ANN) model agglomerate of multiple layers. It is known as the basic foundation of deep learning or neural networks.

MLP usually has three types of layers:

Input Layer: A layer that accepts data. Each of the input features is represented by a neuron.
Hidden Layer: This is between the input and the output. A layer where the complex algebraic mathematical processing takes place is called a 'deep neural network' if and only if there are multiple hidden layers. Each neuron takes the input through weights (W) and biases (b) and produces an output using an activation function. The activation function can be ReLU (MAX(0,z)), Sigmoid(range from 0 to 1), or Tanh (range from -1 to 1).
Output Layer: This is the last layer, which gives the final result known as y_prediction.

How does it work?

Each neuron computes the weighted sum of its inputs, then applies an activation function (such as ReLU, sigmoid, or tanh) to make a decision. The output is then sent to the next layer. The entire network works together in a feedforward manner.

Then, if the output is incorrect, the weights and biases are updated by calculating the error using a method called backpropagation. The main goal of backpropagation is to minimize the error.

Usage

MLP is used in image recognition (convolutional neural network, also known as CNN), language processing (NLP), sound perception, and classification or regression problems. Although it is a simple neural network, it can be much more powerful with the right hyperparameters, data, and training.

In short, a multilayer perceptron is a neural network that can analyze data through multiple layers and make predictions.

Prediction:

$$\sigma(w^T x + b) = \frac{1}{1 + e^{-(w^T x + b)}}$$

Here, x is the input data, and w and b are the assigned weights and bias, respectively. This neural network uses a sigmoid activation function. Remember, using the sigmoid activation function in the hidden layer may cause vanishing gradient descent. Instead, use ReLU to overcome the drawback. Now, let's convert the aforementioned table into input format.

For Layer 1:

$$\begin{bmatrix} w_{11} & w_{12} & w_{13} \\ w_{21} & w_{22} & w_{23} \\ w_{31} & w_{32} & w_{33} \\ w_{41} & w_{42} & w_{43} \end{bmatrix}^T \begin{bmatrix} x_A \\ x_B \\ x_C \\ x_D \end{bmatrix} + \begin{bmatrix} b_{11} \\ b_{12} \\ b_{13} \end{bmatrix}$$

$$\begin{bmatrix} w_{11} & w_{21} & w_{31} & w_{41} \\ w_{12} & w_{22} & w_{32} & w_{42} \\ w_{13} & w_{23} & w_{33} & w_{43} \end{bmatrix} \begin{bmatrix} x_A \\ x_B \\ x_C \\ x_D \end{bmatrix} + \begin{bmatrix} b_{11} \\ b_{12} \\ b_{13} \end{bmatrix}$$

$$\begin{bmatrix} w_{11} x_A + w_{21} x_B + w_{31} x_C + w_{41} x_D \\ w_{12} x_A + w_{22} x_B + w_{32} x_C + w_{42} x_D \\ w_{13} x_A + w_{23} x_B + w_{33} x_C + w_{43} x_D \end{bmatrix} + \begin{bmatrix} b_{11} \\ b_{12} \\ b_{13} \end{bmatrix}$$

For simplicity lets assume that,

$$\begin{cases} w_{11} x_A + w_{21} x_B + w_{31} x_C + w_{41} x_D = \alpha_1 \\ w_{12} x_A + w_{22} x_B + w_{32} x_C + w_{42} x_D = \alpha_2 \\ w_{13} x_A + w_{23} x_B + w_{33} x_C + w_{43} x_D = \alpha_3 \end{cases}$$

$$\begin{bmatrix} \alpha_1 \\ \alpha_2 \\ \alpha_3 \end{bmatrix} + \begin{bmatrix} b_{11} \\ b_{12} \\ b_{13} \end{bmatrix}$$

$$\begin{bmatrix} \alpha_1 + b_{11} \\ \alpha_2 + b_{12} \\ \alpha_3 + b_{13} \end{bmatrix}$$

Applying Activation function and we get,

$$\sigma\left( \begin{bmatrix} \alpha_1 + b_{11} \\ \alpha_2 + b_{12} \\ \alpha_3 + b_{13} \end{bmatrix} \right) = \begin{bmatrix} \sigma(\alpha_1 + b_{11}) \\ \sigma(\alpha_2 + b_{12}) \\ \sigma(\alpha_3 + b_{13}) \end{bmatrix}$$

$$\begin{bmatrix} \gamma_1 \\ \gamma_2 \\ \gamma_3 \end{bmatrix}$$

There are some advantages and disadvantages to using the sigmoid function in the underlying layer. As an advantage, it introduces non-linearity, which helps in learning complex data. In addition, it limits the output to 0 and 1, which makes interpretation easier. However, the disadvantage is that using sigmoid can lead to gradient vanishing problems, which slows down learning and reduces performance in deep layers. Therefore, other functions are more commonly used in modern networks.

Exploring Multi-Layer Perceptrons: A Study of Their Matrix Structure

Table of contents

What is a Multi-Layer Perceptron (MLP)?

MLP usually has three types of layers:

How does it work?

Usage

Subscribe to my newsletter

Amir Sakib Saad

Amir Sakib Saad