Improving the Performance of Machine Learning Models - Part 2

Yusuf OlaniyiYusuf Olaniyi
10 min read

Introduction

istockphoto.com

Aside from understanding the regularization techniques used to address the issue of high variance or overfitting in machine learning algorithms, it is essential to explore optimization algorithms that enhance the performance and convergence of these models. Optimization algorithms help in finding the best set of parameters that minimize the loss function and improve the overall performance of the model.

In this section, we will delve into some popular optimization algorithms, their principles, and how they contribute to improving the training process and enhancing the model's predictive capabilities. If you are yet to read my previous post on various regularization techniques, I recommend you read it here.

Gradient Descent

Before the era of big data, we can easily train a neural network on the entire dataset say 1000 examples. However, with an increase in the amounts of data generated daily, there is a need to get a faster approach to training neural networks. To solve this problem, we can use mini-batch gradient descent.

What Mini-Batch Gradient Descent?

Mini-batch gradient descent is a type of gradient descent algorithm that splits the training dataset into small batches that are used to calculate model error and update model coefficients. There are basically two steps involved which are:

  1. Splitting the dataset into smaller portions called mini-batch say m mini-batches.

  2. Implement forward and backward propagation on all the mini-batch independently.

What this means is if I have like 5 million rows of data, I can split it into hundred(100) 5,000 rows each which will train faster and the algorithm can learn the unique attributes in each example.

Note:

  1. When the number of examples m is equal to the one(1), each example is a mini-batch, termed stochastic gradient descent.

  2. When the number of examples m is equal to the number of rows in the data, this means you only have one mini-batch which can still mean you did not have any mini-batch since you are training on the entire training data. And this would be called, batch gradient descent.

  3. The size of each mini-batch depends on the available resources.

Exponentially weighted algorithm

The exponentially weighted average is a commonly used smoothing technique used in time series. It’s similar to the moving average in the time series. This mathematical methodology serves as the basic building block for the optimization algorithms we would be talking about in a bit.

To understand this better, consider the heat energy generated during a particular operation at a particular time to e represented as E = f(π‘œ,t). This implies that the amount of heat energy at a particular time is dependent on the temperature change. So to apply an exponentially weighted average to create a smooth curve, we would use the general equation which is given as

Et = ßEt-1 + (1-ß)π‘œt

Where ,

Et-1 = Heat energy at the previous time interval (say 10 days ago)

π‘œt = Temperature at time t

Et = Heat energy at present time.

ß = A constant that determines the time intervals that would be averaged( when ß is 0.9, we are considering approximately 10 days while when it’s 0.98 we are considering approximately 50 days)

To understand this, consider the example below:

Day

π‘œtΒ 

EtΒ Β 

1

-

17.2

2

4.9

15.97

3

5.2

18.01

4

5.8

20.27

5

6

22.44

6

6.2

24.54

For simplicity, I am assuming ß to be 0.9 and I would be using 5 days approximately.

Let’s compute:

E6 = 0.9 * E5 + (1-0.9)π‘œ6 i.e E6 = 0.9 E5 + (0.1)π‘œ6

E5 = 0.9 * E4 + (1-0.9)π‘œ5

E4 = 0.9 * E3 + (0.1)π‘œ4Β 

E3 = 0.9 * E2 + (0.1)π‘œ3Β 

E2 = 0.9 * E1 + (0.1)π‘œ2Β 

Therefore we have,

E2 = 0.9 *17.2 + 0.1 * 4.9 = 15.97

E3 = 0.9 * 15.97 + 0.1 * 5.2 = 18.01

E4 = 0.9 * 18.01 + 0.1 * 5.8 = 20.27

E5 = 0.9 * 20.27 + 0.1 * 6 = 22.44

E6 = 0.9 * 22.44 + 0.1 * 6.2 = 24.54

It’s important to note that the values used a just chosen randomly as the aim for this is for you to understand the methodology.

Using Exponentially weighted average

Exponentially weighted average(EWA) is used to smooths time series data by giving more weight to recent observations and gradually reducing the influence of older data points. This adaptability makes it valuable for capturing changing patterns. It is widely applied in optimization algorithms like gradient descent with momentum, RMSprop, and Adam optimizer. EWA enables these algorithms to navigate complex parameter spaces, converge faster, and enhance overall performance in machine learning and data analysis tasks.

A notable application of this can be found in bias correction, where the weight parameter for the next layer or iteration, denoted as Et, is computed using the previous weight Et-1 and the corresponding error or bias, represented as π‘œt. This utilization of the exponentially weighted average allows for the adaptive adjustment of weights based on the previous weight and the error, facilitating more effective and dynamic optimization processes.

Local Minimum and Global Minimum

In deep learning, a local minimum is a point in the parameter space where the loss function reaches a relatively low value compared to neighboring points. It signifies a potential stopping point for the optimization algorithm, as the gradient becomes zero. However, a local minimum does not guarantee the lowest value across the entire parameter space, as there may exist other points with even lower errors, including the global minimum. The global minimum represents the optimal point in the parameter space where the loss function reaches its lowest possible value, signifying the desired outcome of the optimization process.

The exponentially weighted average is employed in various optimization algorithms, including neural network models, to address challenges associated with the optimization process. By computing the exponentially weighted average, the optimizer smoothes out parameter updates during training (backward propagation). This helps prevent overshooting the global minimum and aids in escaping local minima. The smoothing effect enables the optimizer to navigate the loss landscape more effectively, leading to stable and consistent training. By reducing abrupt changes in parameter values, an exponentially weighted average facilitates convergence to the global optima of the loss function, enhancing the training process.

istockphoto.com

Gradient Descent Momentum

Gradient Descent with Momentum is an optimization algorithm that enhances the standard gradient descent method by incorporating momentum. It introduces an additional term that accumulates the previous gradients to smoothen and stabilize the steps taken during weight updates. This momentum term helps accelerate the convergence process, especially in scenarios with high-curvature or noisy gradients.

The algorithm is as follows:

  1. Vdw = ß1 Vdw-1 + (1-ß1) dw

  2. Vdb = ß1 Vdb-1 + (1-ß1) db

  3. w = w - 𑁛 * Vdw

  4. b = b - 𑁛 * Vdb

The code implementation will be as follows where ß1 is beta and is 𑁛 learning_rate.

# Initialize variables
w = 0.5  # Initial weight
b = -1.0  # Initial bias
learning_rate = 0.1  # Learning rate
beta = 0.9  # Momentum hyperparameter

# Initialize velocity
Vdw = 0.0  # Initial velocity for weight
Vdb = 0.0  # Initial velocity for bias

# Perform gradient descent with momentum
for i in range(num_iterations):  # Replace num_iterations with desired number of iterations
    # Compute predictions
    y_pred = w * X + b

    # Compute gradients - This compute the average gradients
    # Assuming the loss function is mean squared error (MSE)
    dw = (2 / len(X)) * np.sum((y_pred - y) * X)
    db = (2 / len(X)) * np.sum(y_pred - y)

    # Update velocity
    Vdw = beta * Vdw + (1 - beta) * dw
    Vdb = beta * Vdb + (1 - beta) * db

    # Update weights and biases
    w = w - learning_rate * Vdw
    b = b - learning_rate * Vdb

RMS Prop

RMSprop (Root Mean Squared Propagation) is an optimization algorithm that combines the principles of mini-batch gradient descent and gradient descent with momentum. It addresses the challenges of optimizing neural networks with large datasets by performing updates on smaller mini-batches of data instead of the entire dataset. Additionally, RMSprop introduces a modification to the momentum-based approach by incorporating a rolling average of the squared gradients.

To avoid confusion in the Adam optimizer, I will be using S instead of V. So the algorithm is as follows

For each mini-batch:

  1. Compute dw, db

  2. Sdw = ß2 Sdw-1 + (1-ß2) dw2

  3. Sdb = ß2 Sdb-1 + (1-ß2) db2

  4. w = w - 𑁛 * (dw/√(Sdw+ ੬)

  5. b = b - 𑁛 * (db/√(Sdb+ ੬)

We are adding ੬ to avoid error when Sdw /Sdb is zeros and it’s advisable to set ੬ as 10-8.

The algorithm will then be used for all the mini-batches present and our hyperparameter choices will include ß, 𑁛 and ੬.

The implementation of this is as follows

# Initialize variables
w = 0.5  # Initial weight
b = -1.0  # Initial bias
learning_rate = 0.1  # Learning rate
beta = 0.9  # Momentum hyperparameter
beta2 = 0.999  # RMSprop hyperparameter
epsilon = 1e-8  # Small constant to avoid division by zero
​
# Initialize squared gradients
Sdw = 0.0  # Initial squared gradient for weight
Sdb = 0.0  # Initial squared gradient for bias
​
# Perform RMSprop optimization
for i in range(num_iterations):  # Replace num_iterations with desired number of iterations
    # Compute predictions
    y_pred = w * X + b 

    # Compute gradients (dw, db) for current mini-batch
    # Assuming the loss function is mean squared error (MSE)
    dw = (2 / len(X)) * np.dot(X.T, (y_pred - y))
    db = (2 / len(X)) * np.sum(y_pred - y)

    # Update squared gradients
    Sdw = beta2 * Sdw + (1 - beta2) * dw**2
    Sdb = beta2 * Sdb + (1 - beta2) * db**2

    # Update weights and biases
    w = w - (learning_rate * dw) / (sqrt(Sdw) + epsilon)
    b = b - (learning_rate * db) / (sqrt(Sdb) + epsilon)

Adam Optimisation

Adam (Adaptive Moment Estimation) is an optimization algorithm that combines the concepts of momentum and RMSProp to achieve efficient and adaptive parameter updates. Similar to RMSProp, Adam performs updates on mini-batches of data and incorporates both momentum and squared gradients to adjust the learning rate accordingly.

The algorithm is as follows:

  1. Vdw = ß1 Vdw-1 + (1-ß1) dw

  2. Vdb = ß1 Vdb-1 + (1-ß1) db

  3. Sdw = ß2 Sdw-1 + (1-ß2) dw2

  4. Sdb = ß2 Sdb-1 + (1-ß2) db2

  5. Vcorrdw = Vdw/ (1-ßt1)

  6. Vcorrdb = Vdb/ (1-ßt1)

  7. Scorrdw = Sdw/ (1-ßt2)

  8. Scorrdb = Sdb/ (1-ßt2)

  9. w = w - 𑁛 * (Vcorrdw/√(Scorrdw+ ੬)

  10. b = b - 𑁛 * (Vcorrdb/√(Scorrdb+ ੬)

Here's a Python code implementation of the Adam optimization algorithm using the equations above:

# Initialize variables
w = 0.5  # Initial weight
b = -1.0  # Initial bias
learning_rate = 0.1  # Learning rate
beta1 = 0.9  # Momentum hyperparameter
beta2 = 0.999  # RMSProp hyperparameter
epsilon = 1e-8  # Small constant to avoid division by zero

# Initialize moments and squared gradients
Vdw = 0.0  # Initial moment for weight
Vdb = 0.0  # Initial moment for bias
Sdw = 0.0  # Initial squared gradient for weight
Sdb = 0.0  # Initial squared gradient for bias
t = 0  # Iteration counter

# Perform Adam optimization
for i in range(num_iterations):  # Replace num_iterations with desired number of iterations
    # Compute gradients (dw, db) for current mini-batch
    # Replace the following with the specific loss function and its derivative

    # Assuming the loss function is mean squared error (MSE)
    dw = (2 / len(X)) * np.dot(X.T, (y_pred - y))
    db = (2 / len(X)) * np.sum(y_pred - y)

    # Update moments
    Vdw = beta1 * Vdw + (1 - beta1) * dw
    Vdb = beta1 * Vdb + (1 - beta1) * db

    # Update squared gradients
    Sdw = beta2 * Sdw + (1 - beta2) * dw**2
    Sdb = beta2 * Sdb + (1 - beta2) * db**2

    # Bias correction for moments
    Vcorrdw = Vdw / (1 - beta1**t)
    Vcorrdb = Vdb / (1 - beta1**t)

    # Bias correction for squared gradients
    Scorrdw = Sdw / (1 - beta2**t)
    Scorrdb = Sdb / (1 - beta2**t)

    # Update weights and biases
    w = w - (learning_rate * Vcorrdw) / (sqrt(Scorrdw) + epsilon)
    b = b - (learning_rate * Vcorrdb) / (sqrt(Scorrdb) + epsilon)

    # Increment iteration counter
    t += 1

Conclusion

I would like to emphasize that the knowledge and insights shared in this series are based on the valuable teachings of Andrew NG’s Deep Learning course available on Coursera. If you’re interested in delving deeper into the subject, I highly recommend taking the course as it offers a comprehensive and insightful learning experience.

Thank you for taking the time to read through this series. I appreciate your support, and I would love to hear your feedback. Please feel free to like, comment, share, and subscribe for more informative content. You can also connect with me via LinkedIn, Twitter, or Email me at olaniyiyusuf2000@gmail.com for job recommendations, projects, and potential collaborations.

4
Subscribe to my newsletter

Read articles from Yusuf Olaniyi directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Yusuf Olaniyi
Yusuf Olaniyi