Boltzman Machine

We have hidden nodes and visible nodes.

Look at these visible nodes and you can see that they are all connected between each other.

Boltzmann machines are different, are fundamentally different to all other algorithms and for them they don't just expect input data.

**What they do is they generate data. They generate information in all of these nodes,**It doesn't distinguish or doesn't discriminate between these nodes, and these nodes.

For a Boltzmann machine this whole thing is working with it's a system, it's generating states of this system,and the best way to think about it is through an example that Geoffrey Hinton once gave of a nuclear power plant.

So here we've got a diagram of a nuclear power plant, and as you can see it's quite a simplified diagram and normally there's lots and lots more.

There are lots of things we do measure. for example, we could be measuring how quickly this turbine is spinning. Another thing we could be measuring, for instance, not as important but still crucial to the operation of the power plant is, for instance, the pressure inside this pump. Because, if it's too high or it's too low that could mean a problem. Then we could be measuring how much electricity it is outputting. So lot's of different things that we're measuring about the power plant, but at the same time there could be a lot of odd things that we're not measuring. For example, the speed of the wind. Or even if we are measuring the speed of the wind, we might not be measuring the moisture of the soil in this specific location, or this specific location. Or we might not be measuring the thickness of the cooling tower wall at a height 20 meters at this specific radial location. So there can be lots of different parameters of the nuclear power plant that we're not measuring, but at the same time all these parameters all together, they all form one system and they all work together, and that is what the Boltzmann machine represents.

The Boltzmann machine is a representation of a certain system. In our case, a nuclear power plant, and the visible nodes are just merely things that we can and do measure, and the hidden nodes are things that we can't or don't measure.

The way this whole model works is that, instead of just waiting for us to input values for it to model the states, what it does is the Boltzmann machine is capable of generating all of the values. All of the nodes, on it's own. It doesn't need any inputs it just generates parameters on its own and for,So basically it looks at one state, and then it looks at another state and then it says, okay what if the temperatures higher, the wind is faster, the pressure in the pump is lower, the moisture of the earth is lower or higher it doesn't matter. So it just keeps generating all these different states, it's capable of doing that.

it's a stochastic deep learning model or also, a better way to call it is a generative deep learning model because it generates these states.

What we want, is we want to use our training data. The hundred, or thousand, or tens of thousands of rows that we have. We want to feed it into this Boltzmann machine as the inputs to help it adjust the weights of this system accordingly, so that it actually resembles our system.

So what's the benefit of that? Well the benefit of that is, once we've done the learning, once all of these weights are adjusted and the Boltzmann machine understands how all of these parameters interact with each other and how they, what kind of constraints should exist between them in order for this system to be the system that we're modeling. Once that's all done we can use the Boltzmann machine to, for example, in the case of a nuclear plant.

Because we've modeled it using good behavior, we've modeled it, behavior that hasn't led to any meltdown or any explosions, we know what is normal for the nuclear power plant. Then the Boltzmann machine will help us understand what is abnormal behavior. This is a great example of when unsupervised learning is the way to go.to monitor our nuclear power plant.

Energy Based Models

this is Boltzmann distribution.

So we've got a probability here (pi) , and this is the probability of a certain state of your system. And probability of your system being in state i is equal to e, as an exponent, to the power of minus energy of that system, divided by k, T.

K is constant and T is the temperature of your system , Finally the whole thing is divided by the sum of all of those values, for all of the possible states of the system.

The important thing here to see is that e is to the power of minus energy, meaning the higher the energy of a certain state,the lower the probability.So, probability is proportional.

Let's see an example. Assume that you and your friends are in a room.

Why is the gas, for instance, not all in one corner? Cause' that could technically,

there's nothing preventing that happening, right? The molecules of gas, there's nothing saying they have to be there, they have to be all over the room. They could be anywhere.

Assume that, there are now in this condition. This is possible for sure.

But what the Boltzmann distribution is saying that the probability of that state occurring is very low, because the energy in that state would be very high. Because the molecules are very close to each other, they would be bumping, they would be making a lot of chaos and havoc, and they would be moving very quickly.

More examples,

For instance, if you dropped some ink into water, it will start spreading evenly in all directions. It won't form a star for instance. It could form a star, or a snowflake, but it's not going to because that's not the lowest energy state.

On the other hand, if you drop, if you put a drop of oil into water, then it won't start spreading, it will turn into a little ball on the surface of the water, because that is the lowest energy state for that specific system.

So, energy is defined in Boltzmann machines through the weights of synapses, and then, once the system is trained up, once the weights are set, then what happens, is the system, based on those weights, will always try to find the lowest energy state for itself.

it has lots of different options. It can be in this state, it can be in that state, but the weights will dictate what is the lowest energy state for the system, and it will constantly try to get to the lowest energy state possible. That's how it works, and that's why they're called energy-based models.

And here's an example of an energy function for a restricted Boltzmann machine.

The probability of being in a certain state is inversely related with the energy of that state

so it's going to play by those rules, it's going to find the lowest energy state, just because of the way we set it up.

Restricted Boltzmann Machine (RBM)

In practice it's (Boltzmann machine) very hard to implement in fact, at some point we'll run into a roadblock because we cannot, simply cannot compute a full Boltzmann machine and the reason for that is as you increase number of nodes, the number of connections between them grows exponentially.

So therefore, a different type of architecture was proposed which is called the restricted Boltzmann machine.

So here we've got exactly the same concept with the simple restriction that hidden nodes cannot connect to each other and visible nodes cannot connect to each other.

now we're going to talk about how it is, how it works, how it's trained and then how it's applied in practice.

So let's get straight into it. We're going to look at an example with movies because you can use a restricted Boltzmann machine to build a recommender system.

So let's say our restricted Boltzmann machine is going or our recommender system is going to be working on six movies.

As you remember, a Boltzmann machine is a generative type of model so it always constantly generates or is capable of generating these states, these different states of our system and then in training through feeding it training data and through a process called contrastive divergence.

And so through that process, what this restricted Boltzmann machine is going to learn is it's going to understand how to allocate its hidden nodes to certain features. And this process is very very similar to what we discussed in the convolutionary neural networks.

Well let's go through this, during the training process, we're feeding in lots and lots of rows to the restricted Boltzmann machine and for example, these rows could look something like this where we've got movies as columns and then the users as rows.

And here we've got the ratings or the feedback that each user has left for the movie whether they liked it, that's a one or they didn't like it, a zero and also the empty cells are totally fine as well because that just means that person hasn't watched that movie.

And, through this process as we're feeding in this data to this restricted Boltzmann machine what it is able to do is it's able to understand better our system and it is better to adjust itself to be a better representation of our system, and understand and reflect better reflect all of the intra connectivity that is, that might be present here because ultimately, people have biases, people have preferences, people have tastes and that is what is reflected in the data.

Let's have a look at how this would play out in action. So now that we've trained up our machine, our restricted Boltzmann machine. We know that it is able to pick out these certain features and based on what it's previously seen about thousands of our users and their ratings and now we're going to look at specific features so let's say it's picked out drama as a feature, action DiCaprio, Leonardo DiCaprio as the actor in a movie, Oscar, whether or not the movie has won an Oscar and Quentin Tarantino, whether or not he was a director of the movie.

And again these are just for our benefit. In reality, the restricted Boltzmann machine has no idea whether (laughs) the director's name is Tarantino or not. It's just picking out a feature for intuitive purposes and now we're going to look at a couple of movies.

So the machine is trained up on lots and lots of rows and now we're going to input a new row into this restricted Boltzmann machine into this recommender system and we're going to see how it's going to go about giving us the prediction whether or not a person will like certain movies.

This is the actual application of the RBM. So let's start. We've got movies The Matrix, the Fight Club, Forrest Gump, Pulp Fiction, Titanic and The Departed.

The Oscar here represents whether or not a movie won an Oscar

And now let's see this person that we're trying to make a recommendation for, what have they seen, what they haven't seen, what they've rated and how they've rated it.

So they've seen The Matrix, they didn't like The matrix, they put a zero, so one is like, zero is dislike. Fight Club, they haven't seen the Fight Club. Forrest Gump, they've seen Forrest Gump and they like the movie. Pulp Fiction, they've seen Pulp Fiction but they didn't like the movie. Titanic they've seen and they've liked it and The Departed, they haven't seen that movie and now we want to make a recommendation for this person, will they like Fight Club or not?

Now it's going to try to assess which of these features are going to activate and think very, it could be useful to think of it as in the convolutional neural network analogy. In there, we would feed in a picture into our convolutional neural network and it would, certain features would highlight. Certain features would light up if they're present in that picture.

So let's go through this, I'm gonna go with so we're gonna start with Drama.

We know that Matrix is not Drama, Fight Club is not Drama, Forrest Gump is Drama. It's actually, I looked it up, it's actually comedy and then it's Drama. We don't have comedy here. So it's for all in our purposes it's Drama. Pulp Fiction is not Drama. Titanic is Drama and The Departed is Drama.

but we don't have data for The Departed, right (As the person hasn't seen that yet)? So this Boltzmann machine can only learn from these two (as he has seen these two)

Right, it can only say, all right so this person liked Forest Gump and this person liked the Titanic and based on that this node is gonna light up and it's going to, we're gonna light it up symbolically in green meaning that it's activated .

Next, Action and you can see that the Action movies we have here are The Matrix, Fight Club and Pulp Fiction and Departed. We have four Action movies but out of them we only have data for The Matrix and Pulp Fiction and both of these, this person didn't like. So it's gonna light up in red.

DiCaprio. So out of all of these movies, Leonardo DiCaprio is present in Titanic and The Departed and based on this, just this one, that one movie the DiCaprio node is going to light up green.

Oscar. So we've got three Oscar movies. We only have data for Forrest Gump and Titanic and based on those, that person liked both. The node is gonna just light up green.

And finally Tarantino the only movie with Tarantino as the director here is Pulp Fiction, out of all of them and that person did not like Tarantino that movie and therefore this node is gonna light up red.

Now what happens is the Boltzmann machine is going to try to reconstruct our input.

The inputs get flashed and comes back to an adjusted value (Although that is still the same here)

Here we're only going to care about the movies where we don't have ratings and we're gonna use the values that reconstructs as predictions.

How is it going to reconstruct Fight Club?

Well, Fight Club is going to look at all of the nodes and find out based on what it learned from the training it's going to really know which nodes actually connect to Fight Club. Is it a Drama movie? No, it's not.

Is it an Action movie? Yes, it is.So that's not always going to light up.

Is it, does it have DiCaprio in it? No, it doesn't. Did this movie win an Oscar? It hasn't. And is Tarantino director of this movie? No, he's not. That's in our understanding because we know these things.

So, fight club movie has no connection with nodes except the Action one

Note: In the Boltzmann machine's understanding it will be like, "does this, is this node connected to this node?" and so on.

And based on this one connection, we know this one lit up in red and therefore Fight Club is going to be a movie that this person is not going to like.

So, 0 is added.

Just by the weights from which should had established during training is going to know these connections and it will know here that The Departed is connected to this node (Drama), is connected to this node (Action), connected to this node (Dicaprio), connected this node (Oscar) and it's not connected to this node (Trantino).

The weight here is low or very insignificantand in our terms in human language why is that? Well because this node is responsible for Drama movies, it's a Drama movie.

This node is responsible for Action movies, it's an Action movie.

This node is responsible for DiCaprio movies, it does have DiCaprio in it.

This movie is now is responsible for Oscar movies, it does have, it did have an Oscar, did win an Oscar

But it's not a Trantino movie and therefore based on this,

We have 3 greens and 1 red connected with "The Departed". Green means a yes .

So we predict that "Yes, you are going to most likely enjoy that movie"

Contrastive Divergence

This is the algorithm that actually allows Restricted Boltzmann Machines to learn.The question that we still have is, how does the Restricted Boltzmann Machines adjust its weights?

Let's learn that

So here we've got our input nodes. Once you put them into the network, using some randomly assigned weights, at the very start, the system or the Restricted Boltzmann Machine calculates the hidden nodes

Then what's going to happen is those hidden nodes are going to use the exact same weights to calculate the input nodes, or to reconstruct, the input nodes here.

The key point here is that the weighs are exactly the same & they don't change.

What is also important to understand is that the reconstructed inputs are not going to equal the original inputs even though the weight is the same

Let's have a look at our network as our Restricted Boltzmann Machine in a bit more depth or in a bit more detail to understand this specific thing.

So here we've got our Restricted Boltzmann Machine, we've got visible nodes, our hidden nodes. The question is, once we've reconstructed are visible modes, how come they're not identical to the original visible nodes even though we're using the same weights. Well, the reason for that is because these nodes are not initially interconnected.

Let's listen to stand this on an example.

Let's look at this node, node number two over here. How does it get reconstructed?

Well, it gets it gets reconstructed based on the values that all of these hidden nodes, all of these five hidden nodes have in them.

So once we first run this RBM, these initial values will assign or will initiate some values in your hidden nodes.

Then once we run it backwards, these hidden nodes will reconstruct, all of these nodes (visible nodes)

As Hidden nodes are connected with visible nodes and you may say hidden nodes were formed by the connection of multiple Visible nodes.

And therefore, Hidden nodes are dependent on visible nodes and once visible nodes get some value assigned, they again update all of the visible nodes.

now we're going to do another one. We're going to again feed these values the reconstructed node values of our inputs into the RBM , and we're going to get some outputs or some hidden values.

Then based on these hidden values, we're going to reconstruct the inputs again, and again

Then we're going to construct the hidden values and so on.And this whole process is called Gibbs sampling

finally, at some point, we're going to get some reconstructed input values which are such that when we feed them into the RBM and get hidden layers,

and then we try to reconstruct input layers again, we will get those same values.

So in essence, this process has finally converged and our network is finally a great model to model our inputs.

in terms of curve, and also what it means for us. In terms of the curve, this is what it looks like

In the left , we've got the gradient of the log probability of a certain state of our system, based on the weights in the system.

v means visible vector and h means hidden vector

So, we minus the vector formed by Initial visible vector , hidden vector and final visible vector , hidden vector

Weights dictate the shape of this energy curve, Now we place our initial inputs. For example, we end up some where

After the second pause, what happens is we end up somewhere

here.

So as we've discussed before, a system which is governed by its energy will always try to end up in the lowest energy state possible.

we know that the balls are going rolling downhill.

So what we actually want is want to pull this curve down here and we want to push it up over here.

So you can see your ball is already inside the minimum, and that way you don't even have to go through the long process of sampling to get to that recipe of how to adjust the curve, but you can just adjust the weights.

Through that process, the system will always aim to get to values and its nodes which represent the lowest energy state possible.

There's a shortcut that we don't actually have to go through to the very end of the sampling process, we can just do two pauses, we go first pause, second pause and so we do a Contrastive Divergence one, and that will tell us how to adjust the curve. That's the essence of it all.

Deep Belief Networks (DBN)

Well, a Deep Belief Network comes to be if you stack on top of each other several Restricted Boltzmann Machines or RBMs.

So, basically the outputs, or the hidden layer of the first RBM, is the input of the second RBM and then the hidden layer of the second RBM is the input of the third RBM. And, in a Deep Belief Network, so you've stacked up these RBMs.

You make sure that these layers, layers one, two and three, and the connections between them, they are directed and they are directed downwards. Whereas there is no direction in the top layers

You train them layer by layer as RBMs and then there's also the wake-sleep algorithm. The wake-sleep algorithm is basically you train all the way up, then you train all the way down.

To learn more, read this out

Deep Boltzmann Machine (DBM)

In DBM, there is nothing about directness and you don't need to think about it. But in DBN, we had to make sure connection between layers are directed towards the visible nodes (Note: No directed connection for the top hidden layer)

Let's code this down

We will download the dataset from here

Importing the libraries

import numpy as np

import pandas as pd

import torch

import torch.nn as nn

import torch.nn.parallel #for parallel computation

import torch.optim as optim #for optimizer

import torch.utils.data #for the tools we are going to use

from torch.autograd import Variable #Stochastic Gradient descent

Importing the dataset

We will work with movies.dat file within ml-1m folder

this format of file is not separated by only comma as the title also carries comma.

This is the .dat file looks like

so, we can separate it '::", Also there is no column name . So, header=None

and this is to make sure that the dataset gets imported correctly. And we will use the Python engine, Python here, in quotes, to make it efficient. Finally ,some of the movie titles contain special characters that cannot be treated properly with the classic encoding, UTF-8. So, we're just adding this encoding argument because of some of the special characters in the movie titles. So, in quotes we input Latin-1.

movies = pd.read_csv('ml-1m/movies.dat',sep='::',header=None, engine='python',encoding='latin-1')

So, this is the imported dataset

We will do the same for users.dat file

users= pd.read_csv('ml-1m/users.dat',sep='::',header=None, engine='python',encoding='latin-1')

So, here is the users dataset

Same for ratings

ratings= pd.read_csv('ml-1m/ratings.dat',sep='::',header=None, engine='python',encoding='latin-1')

We have lots of test and train set (base) within ml-100k folder

We will use just the u1.base and u1.test file

This is what the u1.base file looks .So, it's divided by tabs

so, let's import

training_set=pd.read_csv('ml-100k/u1.base',delimiter='\t')

Here we have index, user number, movie number, ratings and timestamp

We will turn that training set to an array

training_set=np.array(training_set,dtype='int')

This will convert the training_set to an array and we have set the data type to integer as all of the data types are integer type.

Now, let's prepare the test set

test_set=pd.read_csv('ml-100k/u1.test',delimiter='\t') test_set=np.array(test_set,dtype='int')

we want to know max number of users

Here in training_set we have user number provided in the first column. Here 1 means number 1 user.

So, if we take the max value from here, we will know how many users are here by max(training_set[:,0])

Same for the test_set

we can so take the max value from both of the sets as two sets might not have equal numbers (as test and train set)

So, for test set max(test_set[:,0]) to take the max value of first column

So, together, we will take max from both values

max(max(training_set[:,0]),max(test_set[:,0]))

And then turn them to integer

So, nb_users= int(max(max(training_set[:,0]),max(test_set[:,0])))

Again, this is same for the movies number

But here we will just take column index 1

nb_movies= int(max(max(training_set[:,1]),max(test_set[:,1])))

You can see that we have 943 users and 1682 movies in our dataset

Converting the data into an array with users in lines and movies in columns

Well, the reason is that we need to make a specific structure of data that will correspond to what the restricted Boltzmann machine expects as inputs. The restricted Boltzmann machines are a type of neural network where you have some input nodes that are the features, and you have some observations going one by one into the networks starting with the input nodes. And so what we have to do is create a structure that will contain these observations that will go into the network and their different features that are going to be in the input nodes. And that's exactly what we are about to do by creating this array with the users in lines and the movies in columns because we will have the observations in lines and the features in columns.

Once we are done, this is the new training_set

and this is the new test_set

Converting the data into torch tensors

Till now, whatever we have created. Well, the lines are going to be the observations going into the network and the columns are the features that are going to be the input nodes in the network.

So for each user we will have its ratings of all the movies, zeros included, and these ratings are going to be the input nodes for this observation going into the network. And now that's when PyTorch comes into play.

PyTorch tensors. So what are tensors? Tensors are simply arrays that contain elements of a single data type. So a tensor is a multi-dimensional matrix but instead of being a NumPy array, this is a PyTorch array. In fact, we could build a neural network with NumPy array, , but that would be much less efficient and that's why we're using tensors.

Tensor is a multi-dimensional matrix with a single type and since we're taking the FloatTensor class, the single type is going to be float

training_set=torch.FloatTensor()

The training set has turned to

We will do the same for test_set

test_set=torch.FloatTensor(test_set)

Convert the ratings into binary ratings 1 (Liked) or 0 (Not liked)

But now, since the ratings are gonna become zero and one, well, the original zeroes, must now have another value.And this new value that they're gonna have, is minus one. Now, minus one will mean that there was not a rating for a specific movie, given by a specific user.

training_set[training_set==0]=-1 #all the 0 values of training_set

Ratings from 1-2 will be given 0 to mean the user did not like training_set[training_set==1]= 0

training_set[training_set==2]= 0

Ratings bigger than or equal will get 1 to mean user liked it training_set[training_set>=3]= 1

We do the same for test_set

We also need to initialize the bias. And remember there is some bias for the probability of the hidden note given the visible note and some bias for the probability of the visible note given the hidden note. So lets start with the bias for the probabilities of the hidden notes given the visible notes. . So that's the same, we have to give a name to these bias.

And so for these first bias, we're gonna give the name A. And I took my self object because A is a parameter of the object. So self does A and then equals and then again we take our torch library then dot and then again we're gonna take our rendn function to initialize the weights according to normal distribution of mean 0 and variance 1.

And so now, since there is one bias for each hidden note and we have NH hidden note, well, we need to create a vector of an H element. That is, we're gonna have a vector of NH element all initialized to some numbers that follow a normal distribution. But we need to create an additional dimension corresponding to the batch, and therefore, this vector shouldn't have 1 dimension, like a single input vector, it should have 2 dimensions. The first dimension corresponding to the batchand the second dimension corresponding to the bias. And so, why do we need to create this fake dimension for the batch?Well, that's always for the same reason.It's because the functions that we're gonna usethen in PyTorch cannot accept a single input vector of one dimension as argument, but a two dimensionaltensor with the first dimension corresponding to the batch and the second dimension corresponding to the bias.So that's why here, we cannot put directly NH, like this,but we need to add a 1 first. So that creates a 2-D tensor with this one here corresponding to the first dimension that is the batch. And this NH element here corresponding to the bias.

Alright? And now we have a third parameter to define that is still specific to the object that will be created, that is our RBM model, and that is the bias for, this time, the visible notes. And so that's the same, we need to take our torch dot rendn function, but this time we have not NH, but NV, visible note, while we initialize the tensor of NV element with one additional dimension corresponding to the batch. So that makes a 2-D tensor and this tensor has NV element in this second dimension. Alright, and that's all we need to initialize our future objects of the RBM class.

We can then complete this RBM function and that's it

Once, we are done with training,

And we end up with a train loss of 0.242 which is pretty good because that means that in the training set, well we get the correct predictive rating, three times out of four.

One time out of four we make a mistake when predicting the ratings of the movies by all our users. So that's pretty good,

Now, let's see the test result And let's press Command or Control plus, Enter to execute and we get a test loss of 0.257

Which is definitely excellent, because that's for new observations, so for new observations for new movies we managed to predict some correct ratings three times out of four and even better than that.

Because we are slightly above 25%. So that's excellent results and we definitely managed to make a robust recommended system.

Done!

Get the codes

Machine Learning : Deep Learning - Boltzman Machine (Part 30)