Support Vector Machine and Math behind it

Support vector machine is a machine learning algorithm used for classification and regression. SVM uses hyperplane to separate data into groups maximizing the distances between the closest points (vectors) for each class (group).
In machine learning, the boundary separating the examples of different classes is called the decision boundary.
Consider the below graph in which \(θ^T x = 0\) x’s represent positive training examples, o’s denote negative training examples, a decision boundary (this is the line given by the equation \(θ^T x = 0\), and is also called the separating hyperplane) is also shown, and three points have also been labeled A, B and C.
We notice that our point A is far the decision boundary and if we are making a prediction of \(y\) at point A we are more confident that value of A which lie far from the decision boundary but when we come to c it is more close to the decision boundary and just a small change to the decision boundary could lead our prediction to be \(y = 0\).
In SVM, we redefine the label space to:
\(y∈{−1,+1}\)
Functional Margin and Geometric Margin
The hyperplane which is the decision boundary is an n-dimensional space defined by the equation
\(w^Tx+b=0\)
This equation splits the feature space into two half spaces
\(w^Tx+b>0\) one side (classified as +1)
\(w^Tx+b<0\) other side (classified as -1)
SVM doesn’t just look at any separating hyperplane — it looks for the one that maximizes the margin between two classes. The condition
\(y^{(i)}\left( \mathbf{w}^\top \mathbf{x}^{(i)} + b \right) \geq 1\)
imposes that each lies on the correct side of the margin.
\(\begin{cases} y = +1: & w^\top x + b \geq +1 \\ y = -1: & w^\top x + b \leq -1 \end{cases}\)
Functional Margin
Given a training example \(( x^{(i)}, y^{(i)} )\), where \(x^{(i)}\) is a feature vectors and \(y^{(i)}\) \(\in \{-1, 1\}\) is the corresponding label, the functional margin \(\hat{\gamma}^{(i)}\) with respect to a hyperplane parameterized by a vector \(w\)(weights) and scalar \(b \) (bias) is defined as:
\(\hat{\gamma}^{(i)} = y^{(i)} \left( w^\top x^{(i)} + b \right)\)
Significance of the Functional Margin:
For a correct classification \( y^{(i)} \left( w^\top x^{(i)} + b \right) > 0.\)
if \(y^{(i)} = 1, \) we require that \(w^Tx^{(i)}+b\) be positive and large for high confidence meaning that the further \(w^Tx^{(i)}+b\) is from zero , the more confident the model is in its prediction.
A small positive value (e.g., \(w^Tx+b=0.1\)) means the point is just barely classified as positive.
A large positive value (e.g., \(w^Tx + b = 100\)) means the point is far into the positive region, well away from decision boundary.
if \(y^{(i)} = -1,\) we require that \(w^Tx^{(i)}+b\) be negative and large in magnitude for high confidence.
A small negative value (e.g., \(w^Tx+b=-0.1\)) means the point is just barely classified as negative.
A large negative value (e.g., \(w^Tx+b=-100\)) means the point is far into the negative region, well away from decision boundary.
Scaling Problem
The decision rule in SVMs is:
\(\hat y^=sign(w^Tx+b)\)
This rule only depends on the sign, not on the magnitude of \(w^Tx+b.\) That is :
if \(w^Tx+b>0\), classify as +1
if \(w^Tx+b>0, \) classify as -1
For example, multiplying both \(w\) and \(b\) by a scalar (e.g., \(w\) → \(2w\), \(b \) → \(2b\)) results in the same classification decision boundary because \(g(w^Tx+b)\) = \(g(2w^Tx+2b)\).
so this leads us to the unbounded nature of functional margin making the functional margin meaningless to optimize alone. So a large functional margin does tell us nothing, since it doesn’t reflect distance to the direction boundary.
Normalization condition comes in the case of this problem.
Geometric Margin
let us consider the below diagram :
Suppose we have a training point \(x^{(i)} \in R^d\) which we name it A with label \(y = 1\) .
The geometric margin \(γ^{(i)}\) of this point relative to the hyperplane is the signed perpendicular distance from \(x^{(i)}\) to the hyperplane \(H.\)
Here we are moving from the point \(x^{(i)}\) or (A) perpendicularly down to the hyperplane.The direction of that movement is along the unit normal vector \(\hat w^= \frac{∥w∥}{w}\) (unit vector in the direction of \(w\)), and the amount moved is the signed perpendicular distance \(γ^{(i)}\).
The vector from the point to the hyperplane is :
\(\vec{v} =γ^{(i)}⋅\frac{∥w∥}{w}\)
So to land on the hyperplane, we subtract that vector
\( B = x^{(i)} - \vec{v}\)
where B is the projection of A onto the hyperplane
By geometric reasoning, point \(B \) is given by:
\(B=A−γ^{(i)}⋅\frac{∥w∥}{w}=x^{(i)}−γ^{(i)}⋅\frac{∥w∥}{w}\)
But since \(B \in H\), it must satisfy the hyperplane equation:
\(w^TB+b=0\)
Substituting for \(B : \)
\(w^T(x^{(i)}−γ^{(i)}⋅\frac{w^Tw}{∥w∥})+b=0\)
\(w^Tx^{(i)}−γ^{(i)}⋅{∥w∥}+b=0\)
Solving for \(γ^{(i)}\) :
\(γ^{(i)}=\frac{w^Tx^{(i)}+b}{∥w∥}\)
This is the distance from point \(x^{(i)}\) to the hyperplane.
Signed Distance & Generalization to Any Label
If the point has label \(y^{(i)} \in \) {+1,−1}, then the signed distance becomes
\(γ^{(i)}=\frac{y^{(i)}(w^Tx^{(i)}+b)}{∥w∥}\)
We multiply by \(y^{(i)}\) because we want:
Positive margin if correctly classified
Negative margin if misclassified
Zero margin if the point lie exactly on the hyperplane.
Invariance to Scaling
Suppose we scale:
\(w'=αw\) , \(b' = αb\)
Then :
functional margin:
\(\hatγ^{'}{(i)}\) = \(α⋅\\)\\(\hatγ^{(i)}\) => Explodes with \( α\)
Geometric margin:
\(\hatγ^{'}{(i)} = \) \(\frac{α⋅\hatγ^{(i)}}{{∥α⋅w∥}}\)\= \(\frac{α⋅\hatγ^{(i)}}{{α∥⋅w∥}}\)\= \(\hatγ^{(i)}\)
Geometric margin remains unchanged under scaling.
Set-level Margin: Margin over a Training Set
Given a training set:
\(S\) = {\({(x^{(i)},y^{(i)}),i=1,…,m}\)}
The geometric margin of the hyperplane with respect to the dataset is defined as the minimum margin over all training samples:
\(\gamma = \min_{i = 1, \dots, m} \gamma^{(i)} = \min_{i} \left( \frac{y^{(i)} \left( w^\top x^{(i)} + b \right)}{\lVert w \rVert} \right)\)
This defines how well-separated the worst-case point is from the decision boundary
In SVMs, we aim to:
Maximize the minimum geometric margin \(\gamma\)
We are explicitly maximizing the geometric margin \(\gamma\), i.e., the minimum distance from any data point to the hyperplane \(H\).
\(\max_{w,\,b} \; \gamma\)
We want the largest possible margin such that all training points are correctly classified:
\(\max_{w,\,b} \; \gamma \) subject to \(y^{(i)}(w^T x^{(i)} + b ) \) \(\ge\) \(\gamma\) \(∀i\) , ∥\(w\)∥=1
For each point, the signed distance from the hyperplane must be at least γ.
This ensures that every point lies on the correct side of the margin band.
We leverage the fact that both functional and geometric margins are related:
\(γ^{(i)}\) = \(\frac{\hatγ^{(i)}}{∥w∥}\)
So we rewrite the problem in terms of the functional margin:
\(\max_{w,\,b} \; \) \(\frac{\hatγ^{(i)}}{∥w∥}\) s.t \(y^{(i)}(w^T x^{(i)} + b )\) \(\ge\) \(γ^{(i)}\) ∀i
Still not convex. So next:
Scaling the Functional Margin to 1
Since we can re scale \(w\) , \(b\) arbitrarily (as the classifier is invariant under scaling), we choose to set:
\(\hat \gamma =1\)
This means we now fix the functional margin and allow \(∥w∥\) to vary. Now the geometric margin becomes:
\(\gamma \) \= \(\frac{1}{∥w∥}\)
So maximizing geometric margin is equivalent to:
\(\min_{w,\,b}\) \(∥w∥\) or \(\min_{w,\,b}\frac{1}{2}∥w∥^2\) subject to \(y^{(i)}(w^T x^{(i)} + b ) \ge 1\) ∀i
Objective: minimize the norm of the hyperplane's normal vector → maximize the margin
Constraints: every point is on the correct side, and at least unit distance from the hyperplane.
Subscribe to my newsletter
Read articles from kirubel Awoke directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
