In this short article, I will just try to pass on my understanding of Recommender Systems.

This is going to be a very basic one.

Intro

Recommender systems are used in many scenarios these days, from books that you read to the movies that you watch under your blanket at 2 AM in the morning.

The cool part is that these systems have surpassed human-level performance on this task of “recommendation”. They have become so much better at analyzing user behavior that it’s nearly impossible for a human to compete with them.

It’s amazing to see how we are using mathematics to do some real world shit and the best part is that it even works.

I don’t know why it works, but I know that it works.

Mathematical Way of Looking at Them

Obviously, to use recommender systems, you need to have some data about the users. You then use this data to predict different patterns in their behavior, which in turn allows you to provide a better experience to the user.

Take a look at the table below.

Movie \ User	U1	U2	U3	U4	U5	U6	U7	U8	U9	U10
M1	5	3	4	2	1	4	3	2	5	4
M2	2	4	3	5	3	2	4	3	4	1
M3	4	2	5	4	3	1	2	5	3	2
M4	3	5	2	1	4	3	5	4	2	3
M5	1	3	4	5	2	4	3	1	2	5

Our goal is to make a recommender system out of this data that gives users the best possible experience.

So, now we are getting into a little bit of mathematics.

Let us define some variables so that we can solve the problems.

These are just indices for movies and users. If we have to point out user 2, we will say j=2.

Now, here are some common things that we need to define, which are just obvious

In case of r, we are asking whether user j has rated movie i or not. If the user has rated the movie i then the value of r becomes 1; otherwise, 0.

Now let’s pour in some linear algebra. We need it to work with bigger datasets and make faster calculations.

The key thing to notice here is that each user has their own weight matrix of length, which is equal to the total number of movies in our dataset.

And we need a bias matrix.

The total number of vectors like this is going to be n_u (the total number of users).

Vector x contains all the ratings for a particular movie, it’s going to be a row vector with a length of the total number of users, that is n_u. We will have this kind of vector for each movie, so the total number of vectors like this is: n_m.

We also need this one variable (y) that stores the value of the rating given to movie i by user j.

Now that we have all the variables, we can set up equations that give us the best possible results.

You may know that the equation below:

We are going to use this equation, put this in the gradient descent model, and get the best parameters so that our model performs well.

Here, w and b are just two matrices associated with each user, and the goal is obviously to tweak out these matrices in such a way that they give out the best possible approximation of rating for each movie by the user.

To visualize, imagine a weight and bias matrix for each user floating on each user and the vector x(i) as a row vector with elements of each row (i) in it.

We do this by gradient descent, but before that, we need to figure out our cost function.

Below is the cost function. Here, we are trying to minimize the parameters (weights and biases) for each user by minimizing the overall cost function.

$$\min_{w^{(1)}, b^{(1)}, \dots, w^{(n_u)}, b^{(n_u)}} \frac{1}{2} \sum_{j=1}^{n_u} \sum_{i: r(i,j)=1} \left( w^{(j)} \cdot x^{(i)} + b^{(j)} - y^{(i,j)} \right)^2 + \frac{\lambda}{2} \sum_{j=1}^{n_u} \sum_{k=1}^{n} \left( w_k^{(j)} \right)^2$$

Don’t worry, here we are just trying to minimize the overall cost by tweaking the weight and bias matrices associated with each user. This is the cost function for that. You might know that we use gradient descent to minimize this cost function.

But before that here is a simple explanation of how this formula works.

See, the rightmost element of the equation, it’s just a regularization term which is trying to minimize weights, which helps in preventing overfitting. The bigger the value of the regularization parameter (lambda), the smaller the weights will be, because we know that gradient descent tries to decrease the overall value of the cost function. This is the L2 regularization term.

Also, through these summations, we are just trying to access the different elements of a weight matrix associated with a user, like if k=1 and j=2, then that means trying to find the 2nd element of the 1st matrix.

So, actually, we are taking individual elements out of the weight matrices and squaring them.

Now, look back at the leftmost element of the equation.

Here we are just subtracting the predictions of our model from the actual value and squaring the errors. Again, the summation signs are just allowing us to get different predictions calculated.

In case of the first summation, j varies from 1 to n_u, and the second summation goes through all the values of i for which r(i,j) = 1. In short, for each user (or j), we are going through all the examples or movies, but only those movies that have been rated by the user, so the condition r(i,j) = 1.

We are doing something similar in the following case, where we are trying to minimize the feature vectors x(i)s for each movie.

$$min_{x^{(1)}, \dots, x^{(n_m)}} \frac{1}{2} \sum_{i=1}^{n_m} \sum_{j: r(i,j)=1} \left( w^{(j)} \cdot x^{(i)} + b^{(j)} - y^{(i,j)} \right)^2 + \frac{\lambda}{2} \sum_{i=1}^{n_m} \sum_{k=1}^{n} \left( x_k^{(i)} \right)^2$$

The only difference between the two equations is the change in summation order and some other reasonable changes. Instead of going through all the examples for each user we are going through all the users for each movie and calculating the final prediction and then calculating the errors and squaring them out. In the rightmost term of this equation we are doing regularization as above but this time we are regularizing feature vectors not the weight matrices.

What if we combine the two equations? This will decrease our overall work and improve the model’s performance. So we do this.

$$\ \min_{w, b, x} J(w, b, x) = \frac{1}{2} \sum_{(i,j): r(i,j)=1} \left( w^{(j)} \cdot x^{(i)} + b^{(j)} - y^{(i,j)} \right)^2 + \frac{\lambda}{2} \sum_{j=1}^{n_u} \sum_{k=1}^{n} \left( w_k^{(j)} \right)^2 + \frac{\lambda}{2} \sum_{i=1}^{n_m} \sum_{k=1}^{n} \left( x_k^{(i)} \right)^2$$

This equation is just the combination of the above two equations, and it does the work of both equations by itself. It’s a 2 in 1 pack.

So, when we have minimized our gradient descent to its best possible minimum value then we can hope that we have got great parameters that will give us better predictions.

So, this was the intro to recommender systems in a mathematical way.

I hope it helped.

A Simple Intro To Recommender Systems

Table of contents

Intro

Mathematical Way of Looking at Them

Subscribe to my newsletter

Kartavay

Kartavay