Support Vector Machine

Vidushi AnandVidushi Anand
7 min read

SVM is a supervised machine learning algorithm used for classification, regression and outlier detection.

It is a fancy and super smart version of y=mx+c (a straight line) but with a mission in hand.

It doesn’t just draw a line, it draws the best possible line. It looks at the data and sorts it into one out of 2 categories and leaves a margin i.e. a wide gap between both categories.

As I said, it is smarter than y=mx+c, why?

  • It can handle curve boundaries and multiple dimensions.

  • It can provide focus on all important data points

  • It get’s a wide margin between categories.

Under the hood, it is just a line… but with style, strategy and purpose.

SVM for classification tasks

Classification problems involve dividing data into separate classes. For example: disease detetcion, email filter.

You give the model something -> it tells you what class does it belongs to.

For classification, SVM is a discriminative classifier i.e. It learns the boundary between different classes. It doesn't care much about how the data gets generated (i.e., the probability of features given a class), It focuses on “how do I separate them?”. It focuses on learning the relationship between input features x and the target label y, typically by estimating P(y | x) — the probability of a label given the features.

SVM tries to find the best boundary (hyperplane) to divide data points to different classes.

Optimal Hyperplane is the best possible decision boundary that divides data points to separate classes with a maximum margin. The maximal perpendicular distance between the hyperplane and the support vectors.

Let’s talk about margin in detail now, a margin is the distance between the hyperplane and the nearest points of both classes. We can say that it is a “safe buffer zone” between both the classes. A maximal margin helps generalize unseen data.

Distance from the decision boundary (hyperplane) to the nearest data points of either class is a margin. If we maximize the margin, the data points get quite far away which increases the accuracy of classification tasks.

Wider the margin, better the generalization accuracy.

Support Vectors are the data points that touch the positive and negative hyperplane. They lie closest to the decisio boundary (the hyperplane). These points support/ decide the position of the optimal hyperplane. Only these points matter while creating a margin, the rest of the data doesn’t have a direct effect on the hyperplane.

Convex hull

For 2d data points, we use a convex hull to find the smallest convex shape that completely encloses them. SVMs don’t explicitly compute convex hulls, the concept plays an important role in understanding class separability. The optimization is done through quadratic programming, not geometric hull computations.

The support vectors, the critical points closest to the decision boundary, often lie near or on the convex hulls of each class. By maximizing the margin between these convex boundaries, SVM finds the optimal hyperplane that best separates the data, especially when the classes are linearly separable.

To find maximum distance using an optimal hyperplane using convex hull->

  1. Draw a convex hull for positive support vectors

  2. Draw a convex hull for negative support vectors

  3. Find the shortest possible line between the both hulls

  4. Create a perpendicular line on it and voila!

The perpendicular line is the “optimal hyperplane”

Mathematics Behind Support Vector Machines

Hard Margin SVM

A Hard Margin SVM is a type of Support Vector Machine used for linearly separable data, where a straight line (or hyperplane in higher dimensions) can completely separate the two classes without any errors or overlaps.

No room for mistakes

Soft Margin

A Soft Margin SVM is an improved version of the hard margin SVM designed to handle real-world, noisy, or overlapping data that is not perfectly linearly separable.

In most real-world cases, data points from different classes overlap, outliers or noise break perfect separability so, a strict hard margin (no mistakes) would fail or overfit

C is a hyperparameter which is always positive and it controls the trade-off between maximizing the margin and minimizing classification errors.

As C increases, Overfitting increases (Overfitting: When a model learns the training data too well, capturing noise and details, and performs poorly on new data.)

As C decreases, Underfitting increases (Underfitting: When a model is too simple to capture the underlying pattern, resulting in poor performance on both training and new data.)

When C increases, SVM draws a complex and tight boundary capturing each and every noise which leads to the model performing poorly on unseen data whereas when C decreases, SVM ignores outliers and draws clean and wide margins leading to poor behaviour of model on both train and test dataset.

Hence, it is like a tuning knob. We have to tune this “knob” so that it doesn’t underfit or overfit.

Hinge Loss

Hinge loss is the cost function SVMs use to penalize incorrect or uncertain predictions. It focuses on not just whether a point is classified correctly, but how confidently it is classified.

We should note that optimization problem mentioned here and optimization problem mentioned in soft margin are both the same.

Dual form of SVM

It is a mathemactical reformulation of primal form in SVM to better the training process, especially while using kernel function.

It expresses the problem in terms of dot products only making it more efficient for high-dimensional and non-linear data.

Both primal form and dual form are equivalent to each other.

Kernel

A kernel in SVM is a mathematical trick that helps the algorithm handle non-linear data by transforming it into a higher-dimensional space without actually computing the transformation directly.

Working with non-linear data without explicitly transforming it

We compute dot products in the features to transform them to higher-dimensions

Mercer’s Theorem

Mercer’s theorem tells us “what conditions determine which function can be considered as a kernel function.”

The first condition is that it should be symmetric as the kernel function is a dot product of the mapping function.

A matrix is called positive semi-definite if all its eigen vectors are non-negative.

In kernel methods (like SVM with kernels), the kernel matrix must be positive semi-definite to ensure that:

  • The optimization problem remains convex.

  • It corresponds to a valid inner product in some high-dimensional space.

  • The solution is stable and meaningful.

SVR- Support Vector Regression

SVR is a regression version of SVM. Instead of classifying data points, SVR aims to predict a continuous value while maintaining the core ideas of margin and support vectors.

Our Goal in SVR is to find a function ( a hyperplane ) to fit the data. Instead of separating data classes, we want to approximate data points within a certain error margin.

Here, a margin determines how far off a prediction is allowed be from the actual values.

Epsilon-Insensitivty tube

It is a tube of width 2ε around the regression line. The predictions within this range are not penalized. The predictions outside this range contribute to the cost function. This makes sure the model gets flexible and robust to noise.

  • SVR tries to minimize

  • Model complexity ( flatness of the model )

  • Error outside the ε tube

  • Slack variables εi, εi* to handle errors beyond ε

Time Complexity

The time complexity of SVM is O(n^2)

If n is large, Time complexity becomes very large. Hence,SVM is not used in low-latency applications.

The run time complexity of SVM can be O(k.d) where k is the number of support vectors and d is the dimensions of the data

SVM Implementation

To implement SVM for classification, we would import SVC from scikit learn library. Then we build the model using SVC ( ). Next, we fit our model of training data.

X represents the features here, and y represents the labels.

SVC, NuSVC and LinearSVC are classes capable of performing binary and multi-class classification on a dataset in scikit learn library.
SVC() has multiple hyperparameters we can tune to get better results from our model.

Commonly Tuned Hyperparameters:

(1) C

(2) Kernel

(3) Gamma

(4) degree (for poly)

(5) class_weight

Example code:

This code demonstrates how to train and visualize an SVM (Support Vector Machine) classifier using the breast cancer dataset from sklearn. It:

  1. Loads the dataset and selects only the first two features for 2D visualization.

  2. Trains an SVM model with an RBF kernel using specified hyperparameters (gamma=0.01, C=10).

  3. Uses DecisionBoundaryDisplay to visualize the model’s decision boundary.

  4. Overlays a scatter plot of the data points, colored by class label.

  5. Displays the feature names on the axes for interpretability

References

23
Subscribe to my newsletter

Read articles from Vidushi Anand directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Vidushi Anand
Vidushi Anand