Basics of Machine Learning and Linear Regression (Week 02)

Abhiraj BhattaAbhiraj Bhatta
9 min read

Week 2 of my internship was focused on getting familiar with NumPy, pandas, scikit-learn and other libraries which are used in machine learning commonly. I also went into further depth on some of the concepts that I had already implemented and also understood about the process of data cleaning

We started the week off by downloading a sales dataset from Kaggle and proceeded with the data cleaning process while simultaneously refreshing upon concepts such as linear regression.

The data cleaning process and EDA

EDA stands for exploratory data analysis and is a process where we use various analysis techniques on a given dataset to identify trends or patterns such as number of null values, analyze outliers etc.

The general workflow for data cleaning is as such:

  • Data Cleaning:

    The process of standardizing the data formats such as usernames, phone numbers etc. Various commands from pandas like replace, astype etc. and regular expression help us to format the data properly.

  • Data imputation (handle missing/null values):

    We use commands like isnull to identify all the missing values and then make a decision. We can impute data using various methods including mean/median/mode imputation which fills with mean/median of that column for numeric data and mode in case of categorical data or forward/backward filling which uses previous or next value in a sequence. Note that if the percentage of missing values is very large it may be better to drop the column instead of trying to impute values.

  • Encoding for categorical values:

    Not all the columns in a given dataset are numeric in nature, in order to make everything numeric so our models can function properly we use encoding. Common methods for encoding are label encoding which just assigns a unique value to each category sequentially, frequency encoding, which fills each category with how often it appears in the column

  • Removing outliers:

    We remove outliers using the IQR (Inter Quartile Range) method, where we define a range which is between .25 and .75 percent of the data and anything outside 1.5 multiplied by this range is considered an outlier and removed. The df.clip command is used to achieve this

  • Analyze feature importance via heatmaps, correlation etc. :

    We do this to check which columns have a high linear relation between one another, this is useful for models like regression.

  • Feature Elimination:

    Methods such as RFE, RFECV, and PCA are used in order to remove unimportant features and reduce dimensionality.
    RFE stands for recursive feature elimination and RFECV is the same with cross validation. In RFE we iteratively remove features one by one until a desired number of features is reached. RFECV automatically finds the optimal number of features using K-fold cross validation. Both use an estimator which is our model with the current number of features and then use a measure of importance of features from the trained model and eliminates the least important feature. In RFECV, the data is divided into k folds, out of which one is used for testing and rest are used for training, this process repeats iteratively and the avg score of all the folds is the CV score for that number of features.

    PCA stands for Principle Component Analysis and is a technique of transforming high dimensional data to lower dimensional data by eliminating features which have the lowest variance with respect to the other features. Essentially we calculate the eigenvalues and eigenvectors of all the features, then pick the eigenvectors of the features which have the highest eigenvalues. Note that the data must be standardized for PCA to work (Mean must be 0 and variance must be 1).

  • Scale/Normalize features:
    Feature scaling is an essential step for data analysis which helps in faster convergence of various models such as gradient decent and reduces the mean squared errors (MSE) of the final trained model.

    Various methods for scaling exists, in our case we use mostly standard scaling, which basically subtracts each entry with the mean of the column then divides by the standard deviation in order to make the whole column have zero mean and unit standard deviation. Another method known as minmax scaling, subtracts each value by the minimum value of the column then divides with the difference between minimum and maximum values.

  • Train/ Test split and modeling:

    The python package for train test split helps use split our processed dataset into training and testing sets, we can set the amount of training data we want such as 20% which is the most common and also control how randomly the data is divided. Note that the data splitting must be done before feature scaling, as otherwise our model gets to know the mean and standard deviation of the data which may lead to false improvement in performance. We first split our data then we normalize each set for fair results.

after the data cleaning process, its time to apply machine learning models on our processed dataset and produce meaningful results. Let us begin this journey by discussing one of the simplest linear models, linear regression.

Linear Regression

It is one of the simplest supervised learning algorithms where our hypothesis is a linear equation and we optimize the weights of this equation to minimize the sum of squared errors between actual and predicted values. The exact equation form depends one the number of features that our problem has. For examples in a data set of house prices, there may be only a single feature that affects price which is size in square feet, so we define a hypothesis as such:

$$h(x) = \theta_0 + \theta_1x_1$$

where h(x) is the hypothesis which represents our predicted output (price), and x₁ represents our input feature (housing size). θ₀ and θ₁ are regression coefficients which in this simple case are just the intercept and slope of a line in a 2-dimensional plane. The linear regression algorithm tries to optimize these coefficients to get accurate outputs as we will see later.

Now a question arises, in real life, multiple factors may influence a particular trait, even in the case of our house example maybe the price depends on factors such as number of rooms, distance from shops etc. so in general we define the following hypothesis:

$$h(x) = \theta_0 + \theta_1x_1 + \theta_2x_2 + \ ... \ + \theta_px_p$$

where p is the total number of features that our output depends upon. This case of linear regression is also called multiple linear regression (MLR).

There are many ways of now calculating the weights θ₀, θ₁, … θₚ. We will be discussing:

  • Gradient Descent

  • Normal Equation

Gradient descent (Batch and Stochastic)

We first make some assumptions:

  • m = number of training examples

  • p = number of features (excluding bias)

  • Let

    $$x^{(i)} = \begin{bmatrix} x_0^{(i)} \ x_1^{(i)} \ \vdots \ x_p^{(i)} \end{bmatrix} \in \mathbb{R}^{(p+1) \times 1}$$

    $$Where\ x^{(i)}\ is\ the\ column\ vector\ representing\ all\ the features\ of\ the\ i^{th}\ training\ example$$

    $$x_0^{(i)} is\ assumed\ to\ be\ 1\ for\ all\ i.$$

  • Let us define a weight vector:

    $$\theta = \begin{bmatrix} \theta_0 \ \theta_1 \ \vdots \ \theta_p \end{bmatrix} \in \mathbb{R}^{(p+1) \times 1}$$

Where θ is the column vector representing all the weights

notice that θ has dimension (p+1)x1, this is in order to accommodate the bias/error θ₀.

  • We also define our hypothesis as

    $$h_\theta\left(x^{(i)}\right) = \theta^T x^{(i)} \in \mathbb{R}$$

    On simplifying the RHS , we get our original linear regression equation:

  • $$h_\theta(x^{(i)}) = \theta_0 + \theta_1x_1^{(i)} + \theta_2x_2^{(i)} + \ ... \ + \theta_px_p^{(i)}$$

Now we can begin the derivation, basically our goal is to minimize the cost function:

$$J(\theta) = \frac{1}{2} \sum_{i=1}^{m} \left( h_\theta(x^{(i)}) - y^{(i)} \right)^2$$

which is essentially the sum of squared errors of the predictions and actual values, the ½ exists so we can have simpler calculations up ahead. In gradient descent we start with some random θ vector like a zero vector and keep updating θ until the cost function approaches the local optimum which in the case of LR is always the global minimum. The reason behind this is because our cos function is quadratic in nature and produces a bowl shaped curve if we plot the parameters θ and output J(θ).

Each θ is updated as follows:

$$\theta_j := \theta_j - \alpha \cdot \frac{\partial}{\partial \theta_j} J(\theta)$$

where 𝛂 is the learning rate or step size, we will get back to its purpose after the derivation. Clearly we can see that we need to partially differentiate the cost function with each particular θ value to update that θ, on doing so we get:

$$\frac{\partial}{\partial \theta_j} J(\theta) = \sum_{i=1}^{m} \left( h_\theta(x^{(i)}) - y^{(i)} \right) x_j^{(i)}$$

and our final equation becomes:

$$\theta_j := \theta_j - \alpha \cdot \sum_{i=1}^{m} \left( h_\theta(x^{(i)}) - y^{(i)} \right) x_j^{(i)}$$

here, the learning rate 𝛂 decides the step size for each iteration of the gradient decent algorithm. When 𝛂 is low, the step size is small and the algorithm may take too long to function, where as when 𝛂 is high, steps may be too large and we can overshoot the global minimum.

The method for gradient decent discussed above is called Batch Gradient Descent as for each step we use the sum of squared errors of each training example. However for larger datasets this can be inefficient as just for a small step of the algorithm we have to do so many computations. and alternate algorithm is the Stochastic Gradient Descent algorithm:

Initialize θ = [θ₀, θ₁, ..., θₙ] randomly and set learning rate α, then instead of summing over all training examples each θ is updated as:

$$\theta_j := \theta_j - \alpha \cdot \left( h_\theta(x^{(i)}) - y^{(i)} \right) x_j^{(i)}$$

here we basically optimize for a single training example at a time, at each step. This leads to a slightly noisier path in the algorithm as it approaches the global minima but leads to it being much faster. Another drawback of this method is that it never leads to convergence as the algorithm is always trying to optimize for a new training example, it will just get close to the global minima. We can somewhat mitigate this issue by reducing the learning rate as we get closer to the minima, this leads to smaller fluctuations and the algorithm stays closer to minima in general.

Using too large a learning rate

Normal Equations

This is another method for linear regression which can be used for smaller datasets as it uses matrix multiplication which has a high time complexity. The advantage in this method is that we get the global minima in a single shot through closed loop calculation instead of getting it after many iterations like in gradient decent. We however must note that the matrices involved must my linearly independent and invertible. We can ensure this via proper data preprocessing techniques such as handling null values, removing duplicates etc. as we will see later.

We assume:

  • m = number of training examples

  • p = number of features (excluding bias)

  • we define a feature matrix as

    $$X = \begin{bmatrix} x_0^{(1)} & x_1^{(1)} & \cdots & x_p^{(1)} \ x_0^{(2)} & x_1^{(2)} & \cdots & x_p^{(2)} \ \vdots & \vdots & \ddots & \vdots \ x_0^{(m)} & x_1^{(m)} & \cdots & x_p^{(m)} \end{bmatrix} \in \mathbb{R}^{m \times (p+1)}$$

  • The weight matrix remains the same as in gradient descent

    $$\theta = \begin{bmatrix} \theta_0 \ \theta_1 \ \vdots \ \theta_p \end{bmatrix} \in \mathbb{R}^{(p+1) \times 1}$$

  • The output matrix stays the same as well

    $$y = \begin{bmatrix} y^{(1)} \ y^{(2)} \ \vdots \ y^{(m)} \end{bmatrix} \in \mathbb{R}^{m \times 1}$$

Our hypothesis is now:

$$h_\theta(x^{(i)}) = X\theta$$

and our new cost function is:

$$J(\theta) = \left( y\ -\ X\theta \right)^{2}$$

our goal is to minimize this function, so we differentiate the whole thing with respect to θ and set it to 0 then solve it for θ and finally we get our equation as:

$$\theta = (X^TX)^{-1}X^Ty$$

clearly we can see why the matrices need to be invertible for this formula to be applicable and how larger datasets can compromise the functioning.

0
Subscribe to my newsletter

Read articles from Abhiraj Bhatta directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Abhiraj Bhatta
Abhiraj Bhatta