Exploring Multi-Layer Perceptrons: A Study of Their Matrix Structure

Amir Sakib SaadAmir Sakib Saad
4 min read

What is a Multi-Layer Perceptron (MLP)?

A multi-layer perceptron (MLP) is an artificial neural network (ANN) model agglomerate of multiple layers. It is known as the basic foundation of deep learning or neural networks.

MLP usually has three types of layers:

  1. Input Layer: A layer that accepts data. Each of the input features is represented by a neuron.

  2. Hidden Layer: This is between the input and the output. A layer where the complex algebraic mathematical processing takes place is called a 'deep neural network' if and only if there are multiple hidden layers. Each neuron takes the input through weights (W) and biases (b) and produces an output using an activation function. The activation function can be ReLU (MAX(0,z)), Sigmoid(range from 0 to 1), or Tanh (range from -1 to 1).

  3. Output Layer: This is the last layer, which gives the final result known as y_prediction.

How does it work?

Each neuron computes the weighted sum of its inputs, then applies an activation function (such as ReLU, sigmoid, or tanh) to make a decision. The output is then sent to the next layer. The entire network works together in a feedforward manner.

Then, if the output is incorrect, the weights and biases are updated by calculating the error using a method called backpropagation. The main goal of backpropagation is to minimize the error.

Usage

MLP is used in image recognition (convolutional neural network, also known as CNN), language processing (NLP), sound perception, and classification or regression problems. Although it is a simple neural network, it can be much more powerful with the right hyperparameters, data, and training.

In short, a multilayer perceptron is a neural network that can analyze data through multiple layers and make predictions.

Prediction:

$$\sigma(w^T x + b) = \frac{1}{1 + e^{-(w^T x + b)}}$$

Here, x is the input data, and w and b are the assigned weights and bias, respectively. This neural network uses a sigmoid activation function. Remember, using the sigmoid activation function in the hidden layer may cause vanishing gradient descent. Instead, use ReLU to overcome the drawback. Now, let's convert the aforementioned table into input format.

For Layer 1:

$$\begin{bmatrix} w_{11} & w_{12} & w_{13} \\ w_{21} & w_{22} & w_{23} \\ w_{31} & w_{32} & w_{33} \\ w_{41} & w_{42} & w_{43} \end{bmatrix}^T \begin{bmatrix} x_A \\ x_B \\ x_C \\ x_D \end{bmatrix} + \begin{bmatrix} b_{11} \\ b_{12} \\ b_{13} \end{bmatrix}$$

$$\begin{bmatrix} w_{11} & w_{21} & w_{31} & w_{41} \\ w_{12} & w_{22} & w_{32} & w_{42} \\ w_{13} & w_{23} & w_{33} & w_{43} \end{bmatrix} \begin{bmatrix} x_A \\ x_B \\ x_C \\ x_D \end{bmatrix} + \begin{bmatrix} b_{11} \\ b_{12} \\ b_{13} \end{bmatrix}$$

$$\begin{bmatrix} w_{11} x_A + w_{21} x_B + w_{31} x_C + w_{41} x_D \\ w_{12} x_A + w_{22} x_B + w_{32} x_C + w_{42} x_D \\ w_{13} x_A + w_{23} x_B + w_{33} x_C + w_{43} x_D \end{bmatrix} + \begin{bmatrix} b_{11} \\ b_{12} \\ b_{13} \end{bmatrix}$$

For simplicity lets assume that,

$$\begin{cases} w_{11} x_A + w_{21} x_B + w_{31} x_C + w_{41} x_D = \alpha_1 \\ w_{12} x_A + w_{22} x_B + w_{32} x_C + w_{42} x_D = \alpha_2 \\ w_{13} x_A + w_{23} x_B + w_{33} x_C + w_{43} x_D = \alpha_3 \end{cases}$$

$$\begin{bmatrix} \alpha_1 \\ \alpha_2 \\ \alpha_3 \end{bmatrix} + \begin{bmatrix} b_{11} \\ b_{12} \\ b_{13} \end{bmatrix}$$

$$\begin{bmatrix} \alpha_1 + b_{11} \\ \alpha_2 + b_{12} \\ \alpha_3 + b_{13} \end{bmatrix}$$

Applying Activation function and we get,

$$\sigma\left( \begin{bmatrix} \alpha_1 + b_{11} \\ \alpha_2 + b_{12} \\ \alpha_3 + b_{13} \end{bmatrix} \right) = \begin{bmatrix} \sigma(\alpha_1 + b_{11}) \\ \sigma(\alpha_2 + b_{12}) \\ \sigma(\alpha_3 + b_{13}) \end{bmatrix}$$

$$\begin{bmatrix} \gamma_1 \\ \gamma_2 \\ \gamma_3 \end{bmatrix}$$

There are some advantages and disadvantages to using the sigmoid function in the underlying layer. As an advantage, it introduces non-linearity, which helps in learning complex data. In addition, it limits the output to 0 and 1, which makes interpretation easier. However, the disadvantage is that using sigmoid can lead to gradient vanishing problems, which slows down learning and reduces performance in deep layers. Therefore, other functions are more commonly used in modern networks.

0
Subscribe to my newsletter

Read articles from Amir Sakib Saad directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Amir Sakib Saad
Amir Sakib Saad

I’m passionate about Machine Learning, Deep Learning, Data Science, and Computer Vision, and I’m on a continuous journey to master Data Structures & Algorithms, Artificial Intelligence, and Full-Stack Development. As an aspiring Data Scientist and AI Engineer, I specialize in building intelligent systems that solve real-world problems using AI and optimize algorithms for better efficiency. I also work as a Full-Stack Developer with experience in React.js and Flask, and I enjoy contributing to open-source projects, writing technical blogs, and mentoring others in the tech community. Currently, I’m focused on improving my skills in Data Structures & Algorithms (DSA) through platforms like LeetCode, diving deep into the Data Science workflow—from data cleaning and feature engineering to model optimization. I'm also expanding my expertise in React.js + Flask, working with SQL and MongoDB, and exploring Deep Learning and Artificial Intelligence in depth. For data visualization, I actively use Power BI, Tableau, and Excel.