Machine learning is an evolving domain within the field of artificial intelligence (AI) that focuses on the development of algorithms capable of learning from data. These algorithms empower systems to improve their performance on tasks over time through experience, primarily by recognizing patterns and making data-driven predictions. Among the diverse array of techniques employed in machine learning, linear regression emerges as a fundamental statistical approach extensively used for analyzing relationships among variables.

At its core, linear regression examines the connection between a dependent variable—often referred to as the outcome we aim to predict—and one or multiple independent variables, which are the factors believed to influence this outcome. This technique is particularly powerful in situations where the relationship between variables can be approximated by a straight line.

To implement linear regression, a mathematical line is fitted to the data points in such a manner that the distance between the line and the actual data points is minimized. This method, known as the least squares method, optimally determines the line that best represents the data. The equation of this line can be expressed as

$\textbf {Y=mX+c}$

Where (Y) is the dependent variable, (X) is the independent variable, (m) denotes the slope of the line indicating how changes in (X) impact (Y), and (c) represents the y-intercept.

By effectively drawing this line, linear regression not only aids in forecasting and predicting outcomes but also facilitates a deeper understanding of trends in data. It can be utilized to draw insights that inform decision-making processes across various fields, such as economics, where it can elucidate market trends; biology, for exploring relationships between biological factors; engineering, for modeling processes and behaviors; and social sciences, for analyzing societal trends and implications.

A linear model generates predictions by first assessing each input feature, which represents a specific characteristic of the data. It calculates a weighted sum, where each feature is multiplied by a corresponding weight that reflects its importance in the prediction process. To refine this computation, a constant value known as the bias term is added. This bias term allows the model to adjust its predictions to better fit the observed data, ensuring more accurate outcomes.

The model function for linear regression is represented as

$\textbf f_{w,b}(x) = \textbf {wx + b}$

Where $w$ is weights and $b$ is bias

for multivariate problem which has many features this can be defines ad

$\hat{y}= w_0+w_1x_1+w_2x_2........w_nx_n$

$\hat{y}$ is predicted value, $w_0$ is bias term and $n$is the number of features.

Or we can say $\hat{y}= w.x$

$w$ is the model’s parameter vector, containing the bias term $w_0$ and the feature weights $w_1$ to $w_n$ .
x is the instance’s feature vector, containing $x_1$ to $x_n$.
$w.x$ is the dot product of the vectors $w$ and $x$.

To evaluate the effectiveness of various pairs of parameters $\textbf {(w,b)}$ in a linear regression model, we employ a cost function represented as $\textbf {J(w,b)}$. This function plays a crucial role in measuring the performance of the selected parameters by quantifying the discrepancy between the predicted outcomes generated by the linear model and the actual target values observed in the dataset.

In more detail, the cost function typically calculates the sum of the squared differences between the predicted values (obtained from the linear equation defined by (w) and (b)) and the true values from the data. This quantification allows us to assess how accurately the linear model is able to predict outcomes based on the input features.

By systematically varying the parameters (w) (the weights) and (b) (the bias) and computing the corresponding values of the cost function (J(w,b)), we can analyze and compare the performance for different combinations. The ultimate goal of this assessment is to identify the pair of parameters that results in the lowest cost, indicating the best fit for the linear regression model. This process is foundational in optimizing the model to achieve the most accurate predictions possible based on the input data.

The cost function for linear regression $\textbf {J(w, b) }$ is defined as

$ J(w,b) = \frac{1}{2m} \sum\limits_{i = 0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)})^2 $

To find the optimal values for the parameters $\textbf {(w,b)}$ that minimize the cost function $\textbf {J(w, b) }$, one effective method we can use is gradient descent. This powerful iterative optimization technique systematically refines the parameter values over time.

The process begins by calculating the gradient of the cost function, which indicates the direction of the steepest increase in cost. By following the negative gradient, essentially moving in the opposite direction, gradient descent gradually adjusts $\textbf {(w,b)}$ in small steps. Each update is designed to reduce the cost function $\textbf {(w,b)}$, guiding the parameters toward those values that achieve the lowest cost.

Through repeated iterations, where each step brings us closer to the optimal solution, gradient descent effectively uncovers the best-fitting parameters for your model. This method is not only fundamental in machine learning but also widely applicable in various optimization problems across different fields.

the gradient descent algorithm is:

$\begin{align*}& \text{repeat until convergence:} \; \lbrace \newline \; & \phantom {0000} b := b - \alpha \frac{\partial J(w,b)}{\partial b} \newline \; & \phantom {0000} w := w - \alpha \frac{\partial J(w,b)}{\partial w} \; & \newline & \rbrace\end{align*}$
where, parameters

$\textbf w$, b are both updated simultaniously and where

$\frac{\partial J(w,b)}{\partial b} = \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)}) \tag{2}$

$\frac{\partial J(w,b)}{\partial w} = \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{w,b}(x^{(i)}) -y^{(i)})x^{(i)} $

m is the number of training examples in the dataset
$f_{w,b}(x^{(i)})$is the model's prediction, while $y^{(i)},$ is the target value

Deriving Partial derivative of cost function for gradient descent

Lets find the above gradient descent equations by applying partial derivative on function $\textbf {J(w, b) }$with respect to w and b.

$\frac{\partial J(w,b)}{\partial b}=\frac{\partial \frac{1}{2m} \sum\limits_{i = 0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)})^2 }{\partial b}$

Applying the chain rule of derivative with respect to b we get

$$\frac {\partial J(w,b)}{\partial b}= { 2.\frac{1}{2m} \sum\limits_{i = 0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)})}={\frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)})}$$

Applying the chain rule and sum rule of derivative with respect to w we get

$$\frac {\partial J(w,b)}{\partial w}= { 2.\frac{1}{2m} \sum\limits_{i = 0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)})}x^{(i)}={\frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)})}x^{(i)}$$

We start the process by assigning random values to the weights, denoted as w . This initial step is referred to as random initialization and serves as the starting point for our optimization. Once the weights are initialized, we enter the optimization phase, where we strive to improve these weights iteratively.

During this phase, we take small, measured steps—often referred to as "baby steps"—to modify the weights. Each step involves calculating the gradient of the cost function, such as the Mean Squared Error (MSE), with respect to the weights. The gradient informs us of the direction in which to adjust ( w ) in order to minimize the cost function.

We continue this iterative process of adjusting the weights and evaluating the cost function until the algorithm converges, meaning that further adjustments result in negligible changes to the cost function. This convergence indicates that we have reached a minimum point, where the weights are optimized for our model.

When the learning rate $ \alpha$ is set excessively high, it can cause the optimization process to overshoot the minimum point in the loss landscape, akin to a ball that rolls too far down one side of a valley and ends up on the opposite slope. Instead of homing in on the optimal solution, the algorithm may land at a point with a higher error value than it started with. This misstep can lead to divergence, where the values of the solution continue to increase uncontrollably, ultimately preventing the algorithm from converging on a suitable and effective solution.

Linear Regression

Deriving Partial derivative of cost function for gradient descent

Subscribe to my newsletter

Nitin Sharma

Nitin Sharma