Linear Regression is one of the most common machine learning algorithms. It is especially used in regression problems in machine learning. This article will help you deduce the depths of this algorithm and will make your understanding of it clearer.

What is Linear Regression?

In machine learning when you are training a linear regression model your assumption is that there is a linear relationship between the input features and the output of the model.

A linear regression model works on the same principle, it assumes a linear relation between the inputs and the output, in supervised machine learning the model is trained with the data which includes the output.

When you train a linear regression model it forms a hypothesis function, which is mathematically a linear equation with variables, with a bias added to it.

Hypothesis Function

In the above example, you can easily see the hypothesis function has one input variable x, theta naught is the bias of the function and theta1 is the weight of the feature x (or input x). This looks like a typical equation of a straight line which is

$$y=mx+c$$

But as the number of features increases this equation can become huge.

COST Function

The better the hypothesis function the nearer the predictions of the model to the actual output for input features. To measure exactly how much better your model performs we have the cost function. The cost function is:

Here, J represents the cost function and if you look carefully you will realize that we are actually calculating the error of each prediction and then squaring the error and using the summation to add all the errors. m represents the number of examples that you are taking into account. We are adding ½ just to make the calculation easier. The i’s are representing the index of a particular example.

When you create a machine learning model your goal should be to minimize the cost function, after all, who wants to increase the error?

Now in order to decrease the error the best thing you can do is to make a hypothesis function whose predictions are as close as possible to the actual outputs so as to minimize the square value and in turn, minimize the cost function.

Now the cost function depends on two things: feature value and the weights (or parameters) of the model which are denoted by theta (look at the hypothesis function equation for reference).

Minimizing the cost function intro to Gradient Descent

Now since we are on a journey of minimizing the cost function we have to find some algorithm to do it because we cannot just predict in air which set of weights might be the best for our model and data.

Gradient Descent is an algorithm that helps us in doing that. There are two types of gradient descent: batch gradient descent and stochastic gradient descent.

Since we cannot change the values of different features for different indices the only way to minimize the error is to minimize the function weights and we use gradient descent for that.

Below I am providing the mathematical way of minimizing the cost function:

Don’t be scared, just look at the last line.

Here in order to calculate a particular weight (theta j) the model is going through m examples which is quite a long process.

This is called batch gradient descent, the batch in the name refers to the fact that we are going through all the examples and taking them as a batch.

As described above since we are going through all the rows in the dataset or all the data points it will take more time in comparison to calculating it if we don’t go through all the rows or examples.

This process of not going through all the examples and calculating a weight is a part of another form of this algorithm called stochastic gradient descent.

Stochastic gradient descent takes less time in comparison to batch gradient descent because in batch gradient descent we have to go through all the rows which takes more time and we don’t want that.

We generally prefer to use stochastic gradient descent on larger datasets as we can’t afford to go through all the examples of the dataset but in smaller datasets, we can choose to use batch gradient descent.

If you look carefully at the equation you will find an alpha (α) symbol there which is called learning rate. The learning rate basically describes how quickly the model converges to the best values. In practice, we try to keep it to 0.001 or a fairly low value.

So, by using all these elements which you can see in the mathematical equation gradient descent the model tries to find the best possible value of weights in order to form the best possible hypothesis function.

CONCLUSION

Linear regression is generally used in machine learning in regression problems. When using it our assumption is that there is some linear relation between the input features and the outputs. A linear regression model tries to form a hypothesis function to predict the outputs based on the inputs of the function.

The better the hypothesis function the better the model and the lower the cost function. In order to minimise the cost function we use another algorithm called gradient descent which tries to find the best possible weights for our model to perform better on a given set of features.

This is the way both of them work together.

Thanks for reading, if it helped follow to support!!!

Linear Regression and Gradient Descent

Table of contents