From Linear Predictions to K-Means Clustering: Essential ML concepts Explained with Math

Hello folks , in this blog we will be exploring concepts that are most basic and foundation of Machine learning . Recently, I delved into the ML domain and came across these topics that form the backbone of this exciting field.
Table of Contents
. Why math is Important in understanding ML concepts?
Types of prediction tasks
Binary Classification
Regression
Multi Class Classification
Ranking
Structured prediction
Supervised-learning : Feature Engineering
Feature extraction
Feature extractor
Key Definitions ( Feature Vector , Weight Vector )
Linear Predictors
Binary Classification with Linear Predictors
Loss Functions (Zero-One Loss, Square Loss)
Score and Margin Definitions
Linear Regression
Residuals and Square Loss
Loss Minimization Framework
Gradient and Stochastic Gradient Descent (SGD)
Hinge Loss
Feature Templates
. Hypothesis class
. Hyperparameters
. Validation set
Unsupervised Learning
. Clustering ( K-means objective & Algorithm )
. Dimensionality reduction
Why math is Important in understanding ML concepts?
Math is important in machine learning because it helps us understand how computers learn and make decisions. It’s like giving the computer a set of rules to follow so it can figure out patterns in data and make predictions. For example, math helps us measure how wrong a computer’s guess is and shows us how to improve it. Without math, we wouldn’t know how to teach computers to learn or solve problems.
Types of prediction tasks
Binary Classification
Let me explain this with a simple example , imagine you want to predict whether an email is spam or not spam. To do this, you need to design a predictor, which is based on an algorithm. In this case, you would input certain features of the email like recipient name , ends with , etc (let's call this input x) into the predictor (which we can represent as F). The predictor evaluates these features and provides an output (y), which can only have two possible values: +1 for spam and -1 for not spam.
So, you can visualize it like this ,
$$x → | F | → y (+1, -1)$$
Regression
Regression can be defined as Supervised Learning task of learning a function mapping an input point to a continous value. For example, consider a real-world scenario where a real estate agent wants to estimate the price of houses based on various factors such as size (in square feet), number of bedrooms, and location. By using regression analysis, the agent can create a model that learns from historical data of house prices and their corresponding features. Once trained, this model can predict the price of a new house by inputting its features into the regression equation. This allows buyers and sellers to make informed decisions based on predicted values, demonstrating how regression helps in understanding and forecasting trends in various fields.
Then what’s the difference between Binary Classification and Regression??
Discrete vs Continous : The first key difference is that binary classification (BC) outputs discrete values, such as "yes" or "no," or in numerical terms, +1 or -1. In contrast, regression outputs continuous values, meaning it predicts a range of possible outcomes, such as prices or temperatures.
Curvy vs. Linear : The second difference is in the nature of decision boundaries. Binary classification can use both linear and non-linear (curvy) decision boundaries to separate classes based on the data. For example, a model like a decision tree or a neural network can create complex shapes to classify the training data effectively. On the other hand, linear regression models the relationship between features and the target variable using a straight line, assuming a linear relationship between them.
Multi Class Classification
Multiclass classification is used to categorize data into more than two distinct classes or categories. Unlike binary classification, which deals with only two outcomes (Ex. spam vs. not spam), multiclass classification handles problems with multiple possible outcomes. A real-world example is classifying handwritten digits (0–9) in the MNIST dataset, where the model predicts the digit based on pixel features of the image. Other examples include identifying the type of animal in an image (Ex. cat, dog, horse) or classifying emails into categories like primary, social, or promotions. Multiclass classification enables models to make complex decisions across diverse applications like image recognition, text categorization, and sentiment analysis
Ranking
Ranking is a machine learning task that involves ordering items based on their relevance or importance relative to a specific query or criterion. Unlike classification, which assigns labels to items, ranking focuses on producing a sorted list of items. A real-world example of ranking is in search engines, where results are ranked based on their relevance to a user's query. For instance, when you search for "best restaurants in Mysuru," the search engine uses ranking algorithms to display the most relevant restaurants at the top of the list, considering factors like user reviews, location, and popularity.
Structured prediction
Structured prediction is a machine learning approach used to predict complex outputs that have internal relationships or dependencies, rather than simple scalar values like in classification or regression. For example, in natural language processing, structured prediction can be used to assign parts of speech (Ex. noun, verb, adjective) to each word in a sentence while ensuring the tags follow grammatical rules.
Supervised-learning : Feature Engineering
Feature Extraction
Feature extraction is the process of transforming raw data into meaningful numerical features that machine learning models can understand and process efficiently. For example, in image recognition, feature extraction might involve identifying edges, shapes, or colors from an image to simplify the data while retaining its essential information. This approach reduces complexity, improves model accuracy, and speeds up processing by focusing only on relevant data
Feature Extractor
A feature extractor is a tool or algorithm used in the process of feature extraction to identify and extract meaningful features from raw data for machine learning models. For instance, in facial recognition systems, a feature extractor might detect key facial landmarks like the distance between eyes or the shape of the jawline, converting these into numerical values that the model can process. This simplifies complex data, enabling faster and more accurate predictions while focusing on relevant information
Key Definitions ( Feature vector ,Weight vector )
Feature Vector is an n-dimensional numerical representation of the characteristics or attributes of an object, event, or data sample, used in machine learning and pattern recognition. It is typically denoted as a vector x=(x1,x2,…,xn), where each element x corresponds to a specific feature.
Mathematically, if we have n features, the feature vector can be represented as ,
$$\ \mathbf{x} = \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix} \$$
Weight Vector assigns importance to features in a dataset , influencing how much each feature contributes to the model’s prediction. It is typically denoted as w=(w1,w2,…,wn), where w represents the weight assigned to the i-th feature.
$$\ \mathbf{w} = \begin{bmatrix} w_1 \\ w_2 \\ \vdots \\ w_n \end{bmatrix} \$$
Linear Predictors
Binary Classification with Linear Predictors
Binary classification is a machine learning task where the goal is to classify data points into one of two categories, such as determining whether an email is spam or not, or predicting if a patient has a disease. A linear predictor is a mathematical model that uses a linear equation to separate these two classes. It works by computing a weighted sum of the input features and applying a threshold to make predictions.
Mathematically, the linear predictor can be expressed as ,
$$y=w^Tx+b$$
Here ,
x is the feature vector (input data),
w is the weight vector (learned during training),
b is the bias term,
y is the output, which determines the class based on a threshold (Ex : y>0 for class 1 and y≤0 for class 0).
For example, in predicting whether an email is spam, each feature (Ex : word frequency) contributes to the decision, and the linear predictor calculates a score. If the score exceeds a certain threshold, the email is classified as spam otherwise, it’s not.
Linear predictors are simple yet powerful for linearly separable data. However, when data is not linearly separable (Ex : overlapping classes), techniques like logistic regression or kernel methods are used to improve performance.
Loss Functions (Zero-One Loss, Square Loss)
Loss function is nothing but a mathematical function that determines how well our hypothesis ( model ) performs by calculating the difference between predicted values and actual values.
Zero-One Loss
The Zero-One Loss is one of the simplest loss functions used in classification tasks. It assigns a loss of 1 for an incorrect prediction and 0 for a correct prediction.
Mathematically it is defined as ,
$$\ L(y, \hat{y}) = \begin{cases} 0 & \text{if } y = \hat{y} \\ 1 & \text{if } y \neq \hat{y} \end{cases} \$$
Here ,
y is the true label.
y^ is the predicted label.
For example, in spam email detection, if the model predicts "spam" for a non-spam email, it incurs a loss of 1. While simple, Zero-One Loss is not differentiable, making it unsuitable for gradient-based optimization methods.
Square Loss (Mean Squared Error)
The Square Loss, commonly used in regression tasks, measures the squared difference between the predicted value (y^) and the true value (y). It is mathematically expressed as:
$$\ L(y, \hat{y}) = (y - \hat{y})^2 \$$
This loss penalizes larger errors more heavily than smaller ones due to squaring. For example, in predicting house prices, if the true price is $300,000 and the model predicts $280,000, the squared error would be:
$$L(300000,280000)=(300000−280000)2=400000000$$
Square Loss is differentiable and widely used with gradient descent for optimizing regression models.
Score & Margin definitions
Score on an example (x,y) is given by w⋅ϕ(x) , which represents how confident we are in predicting the class label +1. Here, w is the weight vector, and ϕ(x) is the feature map applied to the input x.
The score on an example ( x,y ) is ,
Margin on an example (x,y) is defined as (w⋅ϕ(x))y, which indicates how correct we are in our prediction. A larger margin suggests that the model is more confident and accurate in its classification.
The margin on an example (x,y) is ,
$$\ \text{Margin} = (\mathbf{w} \cdot \phi(x))y \$$
Linear Regression
Linear regression is a supervised learning algorithm used to model the relationship between a dependent variable (target) and one or more independent variables (features) by fitting a straight line to the data. The goal is to predict continuous outcomes by minimizing the difference between the actual values and the predicted values.
The mathematical representation of linear regression is ,
$$\ y = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \dots + \theta_n x_n \$$
Here ,
y is the predicted output (dependent variable),
x1,x2,…,xn are the input features (independent variables),
θ0 is the bias term (intercept),
θ1,θ2,…,θnθ1,θ2,…,θn are the weights (coefficients).
Residuals and Square Loss
The Residual is given by (w⋅ϕ(x))−y, which represents the amount by which the prediction fw(x)=w⋅ϕ(x) differs from the target y . If the residual is positive, it means the prediction overshoots the target, if it's negative, the prediction undershoots the target.
The Square Loss, also known as the Mean Squared Error (MSE) in machine learning, is a loss function used in regression tasks to measure the difference between predicted values and actual target values. It penalizes larger errors more heavily by squaring the difference, making it sensitive to outliers. The goal of the model is to minimize this loss during training.
The Square Loss for a single data point is defined as below which we already came across through ,
$$\ L(y, \hat{y}) = (y - \hat{y})^2$$
For a dataset with multiple samples, the Mean Squared Error (MSE) is calculated as ,
$$\ MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 \$$
n is the number of samples and the rest have same meaning as we discussed before this.
Loss Minimization Framework
The loss minimization framework refers to the process of training a model by iteratively adjusting its parameters to reduce the error between predictions and actual target values. This is achieved using a loss function, which quantifies the prediction error. For example, in predicting house prices, the model calculates how far its predicted price is from the actual price and updates its parameters to minimize this difference. By continually reducing the loss during training, the model improves its accuracy and generalizes better to unseen data.
$$\ \text{TrainLoss}(\mathbf{w}) = \frac{1}{|D_{\text{train}}|} \sum_{(x, y) \in D_{\text{train}}} \text{Loss}(x, y, \mathbf{w}) \$$
Gradient and Stochastic Gradient Descent (SGD)
Gradient Descent
Gradient descent is an optimization algorithm used in ML to minimize the error (loss) of a model by adjusting its parameters step by step. Imagine you are blindfolded and trying to find the lowest point in a valley. You can feel the slope under your feet and decide which way to step to go downhill. At each step, you measure the steepness of the slope (gradient) and move in the direction that decreases your altitude the fastest. Over time, you gradually reach the bottom of the valley, which represents the minimum error for your model.
Gradient Descent is the iterative optimization procedure . Gradient Descent has two hyperparameters , the step size ( which specifies how aggresively we want to pursue a direction ) and the number of iterations T .
In machine learning, this process involves calculating the gradient of the loss function with respect to the model's parameters (weights and biases) and updating them iteratively. The size of each step is controlled by a parameter called the learning rate , if too large a step might overshoot the minimum, while too small a step might take forever to converge.
Given a model with parameters w and a loss function L(w), the gradient descent update rule is ,
$$\ \mathbf{w} \leftarrow \mathbf{w} - \alpha \nabla L(\mathbf{w}) \$$
w is the vector of model parameters (weights and biases),
α is the learning rate, which controls the step size,
∇L(w) is the gradient of the loss function with respect to the parameters w.
Stochastic Gradient Descent
This algorithm follows “ Its not about quality , it’s about quantity “ ( yes! , you heard it right )
Stochastic Gradient Descent (SGD) is an optimization algorithm used in machine learning to minimize the loss function and improve model accuracy. Unlike standard gradient descent, which calculates the gradient using the entire dataset, SGD updates the model parameters using the gradient of the loss function for a single randomly selected data point at each step. This makes SGD faster and more memory-efficient, especially for large datasets, but it introduces some noise into the optimization process.
Real-world example , Imagine you're hiking down a mountain to reach the lowest point. In standard gradient descent, you carefully analyze the entire terrain before deciding your next step. In contrast, with SGD, you only look at the slope under your feet (a single point) and take a step based on that. While this approach may not always lead directly downhill due to local variations (noise), it often helps you escape small valleys (local minima) and eventually find the lowest point (global minimum).
The parameter update rule for SGD is ,
$$\ \mathbf{w} \leftarrow \mathbf{w} - \alpha \nabla L(x_i, y_i, \mathbf{w}) \$$
where , w is the model's parameter vector,
α is the learning rate,
∇L(xi,yi,w) is the gradient of the loss function computed for a single data point (xi,yi)
Hinge Loss
Hinge Loss is a loss function used in machine learning, particularly for training Support Vector Machines (SVMs), to classify data points into two categories. It measures how well a model separates data points by penalizing misclassified points and those close to the decision boundary. The goal is to maximize the margin between classes while minimizing the loss.
The Hinge Loss for a single data point (x,y) is ,
$$\ L(y, \hat{y}) = \max(0, 1 - y \cdot \hat{y}) \\$$
where , y is the true label (either +1 or -1) , y^ is the predicted score (distance from the decision boundary).
For multiple datapoints in a dataset ,
$$\ \text{Hinge Loss} = \frac{1}{n} \sum_{i=1}^{n} \max(0, 1 - y_i \cdot \hat{y}_i) \\$$
If y⋅y^≥1 (correct classification and far from boundary), loss = 0.
If y⋅y^<1 (misclassified or close to boundary), loss increases proportionally.
Feature Templates
Feature templates in machine learning are predefined patterns or rules used to extract or generate features from raw data. They act as a blueprint for creating meaningful inputs (features) that the model can use to make predictions. Feature templates are particularly useful in tasks like natural language processing, where structured features (Ex : word n-grams, part-of-speech tags) are derived from unstructured text data.
Hypothesis class
A hypothesis class refers to the set of all possible functions or models that can be considered to solve a given problem. It defines the space of candidate solutions that the learning algorithm searches through to find the best model for mapping inputs to outputs. The hypothesis class is determined by the type of model chosen (e.g., linear models, decision trees, neural networks) and the constraints or assumptions applied to the data.
Hyperparameters
Hyperparameters are settings that you choose before training a machine learning model. They help control how the model learns from the data. Think of them like knobs you adjust to make sure your model works well. Examples include things like how fast the model learns (learning rate) or how many layers it should have. You set these before training starts, and they can greatly affect how well your model performs.
Validation set
A validation set is a subset of data used during the training phase of a machine learning model to evaluate its performance and fine-tune its hyperparameters. It acts as a middle step between the training and test sets, helping to prevent overfitting by assessing the model on data it hasn’t seen during training. For example, if you’re training a model to predict house prices, the validation set allows you to test different settings (like learning rate or regularization strength) and choose the best-performing configuration before final testing
Unsupervised Learning
Unsupervised Learning in simple words can be expressed as given input data without any additional feedback , learn patterns .
Clustering ( K-means objective & Algorithm )
K-Means is a popular clustering algorithm used to group similar data points into K clusters. The goal is to minimize the distance between data points and their cluster's center (called the centroid) while maximizing the differences between clusters.
Imagine you’re organizing books in a library. You want to group them into K shelves based on their similarity (Ex : genre). Initially, you randomly place a few books on each shelf (random centroids). Then ,
For each book, you assign it to the shelf with the most similar books (closest centroid).
Once all books are assigned, you rearrange the shelves’ positions by averaging the characteristics of the books on each shelf (recomputing centroids).
You repeat this process until no books change shelves.
The objective of K-Means is to minimize the Within-Cluster Sum of Squares (WCSS), which is the sum of squared distances between each data point and its assigned cluster centroid ,
$$\ \text{WCSS} = \sum_{i=1}^{K} \left( \sum_{x \in S_i} ||x - \mu_i||^2 \right) \$$
where ,
K : Number of clusters,
Si : Set of points in cluster i
μi : Centroid of cluster i
∣∣x−μi∣∣2∣∣x−μi∣∣2 : Squared Euclidean distance between a point xx and its cluster centroid μi
Steps in K-Means Algorithm
Initialize Centroids : Randomly choose K points as initial centroids.
Assign Points : Assign each data point to the nearest centroid based on Euclidean distance.
Update Centroids : Recalculate centroids by averaging all points in each cluster.
Repeat : Repeat steps 2 and 3 until centroids no longer change or a maximum number of iterations is reached.
Dimensionality Reduction
Dimensionality reduction is a technique in machine learning used to simplify datasets by reducing the number of features (or dimensions) while retaining the most important information. This helps make models faster, reduces the risk of overfitting, and improves visualization of high-dimensional data. For example, if you are predicting house prices using features like bedrooms, square footage, and location, adding too many irrelevant features (e.g., flooring type) can make the dataset complex and slow down training. Dimensionality reduction removes such redundant or noisy features.
Example : PCA ( Principle Component Analysis )
PCA is one of the most common dimensionality reduction techniques. It works by projecting data onto a lower-dimensional space while preserving as much variance as possible. Imagine you have data points in 3D space (X, Y, Z), but most of the variation lies along the X and Y axes. PCA simplifies this by projecting the data onto a 2D plane (X and Y), ignoring the less important Z-dimension.
Hope you all had great learning through this blog , until next time Happy Learning!
Subscribe to my newsletter
Read articles from Varsha U N directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Varsha U N
Varsha U N
Freshmen year AI &DS student || GHCI Scholar'24 || Cloud Mentee @AWS ||