Unveiling the Power of Support Vector Machines in Classification Tasks

Support Vector Machines (SVM) are among the most effective and widely used classification algorithms in machine learning. At its core, the SVM algorithm works by finding the optimal boundary—called a hyperplane—that best separates different classes in a dataset.

  • In the case of 2-dimensional data, this boundary is a line.

  • For 3-dimensional data, it becomes a plane.

  • And for higher dimensions, it is referred to as a hyperplane.

To understand this concept intuitively, let's begin by exploring how SVM behaves with a simple 2D dataset. We'll start by generating an arbitrary dataset for visualization and demonstration purposes.

import pandas as pd, numpy as np, matplotlib.pyplot as plt, seaborn as sns
from sklearn.svm import SVC
import plotly.express as px

def making_df(slope_type=1,n=100, p1=100, p2=30, p3=2):
    np.random.seed(n)

    p=np.random.randn(p1)*p2
    q=np.random.randn(p1)*p2

    if slope_type == 1:        
        r=p+p3*p.max()
        s=q-p3*q.max()
    else:
        r=p+p3*p.max()
        s=q+p3*q.max()        


    df1 = pd.DataFrame({'ones':np.ones(len(p)),'feature1':p,
                       'feature2':q, 'label':np.ones(len(p))})

    df2 = pd.DataFrame({'ones':np.ones(len(r)),'feature1':r,
                       'feature2':s, 'label':[-1]*len(r)})

    df = pd.concat([df1,df2], axis=0)
    df = df.sample(n=len(df))
    return [df,p,q,r,s]

To avoid diving too deeply into the implementation details, let’s assume that the above-defined function returns a dataframe consisting of two visually separable clusters of data points. These clusters represent two distinct classes:

  • (p, q) for negative class data points

  • (r, s) for positive class data points

The function also accepts a few input parameters, such as the desired slope direction (positive or negative) and other customization options for generating the dataset. Since we’ll be generating dataframes multiple times throughout our practical exploration, having a reusable function makes the process more efficient.

Now, to understand how SVM finds a boundary between these classes, we refer to the general equation of a straight line:

ax+by+c=0

Where:

  • x and y are the coordinate variables

  • a, b, and c are constants

From this, we can derive:

  • Slope of the line: −a/b

  • Y-intercept: −c/b

Without delving further into theory just yet, let’s generate an arbitrary dataframe and plot it. This visual representation will help us clearly see the separation between classes and better understand what we are working with.

obt = making_df(slope_type=1,n=40)

df = obt[0]
p = obt[1]
q = obt[2]
r = obt[3]
s = obt[4]

plt.figure(figsize=(20,8))
plt.scatter(p, q, facecolors='purple', marker='_', s=100)
plt.scatter(r,s, facecolors='red',marker='+', s=100)
plt.xlabel('Features', fontsize=16)
plt.ylabel('Label', fontsize=16)

plt.show()

Now we have a clear dataset with distinct data points, conveniently labeled as ‘+1’ and ‘-1’.

def primary_fit(weight): 
    def nor(lst):
        factor = (1/sum([i**2 for i in lst[:2]]))**0.5
        return [i*factor for i in lst[:2]]
    a = nor(weight)[0]
    b = nor(weight)[1]
    c = weight[2]

    def st(a,b,c):
        slope = -(a/b)
        intercept = -(c/b)
        x = np.concatenate([p,r])
        y = slope*x + intercept
        plt.plot(x,y) 
        return [slope,intercept]

    plt.figure(figsize=(20,8))
    plt.scatter(p, q, facecolors='purple', marker='_', s=100)
    plt.scatter(r,s, facecolors='red',marker='+', s=100)
    plt.xlabel('Features', fontsize=16)
    plt.ylabel('Label', fontsize=16)
    st(a,b,c)
    plt.show()


primary_fit([-1,1,110])

We can try to fit a line using a = -1, b = 1, and c = 110.

With the selected values of a, b, and c, we were able to fit a line in the form of ax+by+c=0

This line successfully separates the data points into two distinct clusters or classes. In the context of Support Vector Machines (SVM), this separating boundary is referred to as a hyperplane.

  • In 2D, the hyperplane is simply a line.

  • In 3D, it becomes a 2D plane.

  • In higher dimensions (4D, 5D, etc.), the hyperplane could theoretically take forms like a cube or even a tesseract, although we can't visualize them directly.

From the plotted graph, we can logically infer:

  • If ax+by+c > 0, the point can be classified as belonging to one class (say, –1)

  • If ax+by+c < 0, the point belongs to the other class (say, +1)

This behavior forms the basis for classification using SVM. However, one interesting observation is that different values of aaa, bbb, and ccc can generate different hyperplanes—each potentially separating the points in a slightly different manner.

Let’s explore this further by fitting another line using a new set of parameters:

a=−1, b=3, c=80

We’ll visualize how this new line affects the classification boundary and compare it with the previous one.

We observe that with every new combination of values for a, b, and c, an entirely different line is produced. This illustrates a key challenge in Support Vector Machines:

  1. The first issue is that the parameters aaa, bbb, and ccc need to be carefully optimized in order to find the best possible separating line—one that generalizes well to unseen data.

  2. The second issue arises when, during real-world application, new data points appear that are either very close to the decision boundary or even on the opposite side of it. These edge cases can lead to misclassifications and reduce the model’s reliability.

To address both of these concerns, SVM introduces the concept of margins.

What are Margins?

Simply put, margins are the regions on either side of the classifier (or hyperplane) that define the boundaries of each class. More formally:

A margin is the sum of the perpendicular distances from the classifier to the closest data point in each class.

The goal of an SVM is not just to find any separating line, but to find the one that maximizes this margin. This optimal boundary is known as the maximum margin classifier and is considered to be the most robust in terms of generalization.

In the next section, we’ll visualize these margins and understand how they make SVM one of the most powerful classification algorithms available.

🧠 SVM Formulation

The fundamental goal of a Support Vector Machine (SVM) is to find a hyperplane that not only separates the data but does so with the maximum possible margin. This margin is the buffer zone that helps ensure better generalization when the model is applied to unseen data.

Mathematically, this objective can be expressed as:

$$l_i\cdot (W \cdot y_i) \geq M$$

Where:

  • \(l_i\) is the label of the ithi^{th}ith training data point (either +1 or -1),

  • \(y_i\)​ is the feature vector of the ithi^{th}ith point,

  • \(W\) is the weight vector perpendicular to the hyperplane,

  • \(M\) is the margin we want to maximize.

🧾 Understanding the Expression

The term \(W \cdot y_i\)​ represents the projected distance of the point from the hyperplane. Since both \(l_i\) (the class label) and the projection share the same sign, their product will always be positive when the point is on the correct side of the margin.

Thus, the constraint \( l_i\cdot (W \cdot y_i) \geq M\) ensures that all data points lie outside or on the margin, reinforcing the separation.


📐 Distance from a Point to a Hyperplane

For any point PPP, the perpendicular distance ddd from a line (or hyperplane) defined by \(ax+by+c=0\) is given by:

$$d=\frac{|a x + b y + c|}{\sqrt{a^2 + b^2}}$$

This formula helps quantify how close each point is to the classifier. In the context of SVM, we are particularly interested in the closest points from both classes—these are the support vectors.


🎯 Goal of SVM

The classifier that we are searching for must be one that is farther from the closest points (support vectors) of both classes. That is, the SVM algorithm maximizes the minimum distance to any data point, ensuring the most robust separation.

This leads to the optimization problem at the heart of SVM:

\(min⁡\frac{1}{2}\|W\|^2\) subject to \(l_i(W \cdot y_i + b)≥1\)

In simple terms, this optimization tries to:

  • Minimize the norm of the weight vector (which is equivalent to maximizing margin),

  • While making sure that all data points are correctly classified and lie outside the margin.

def sup(weight, epsilon=0):
    def nor(lst):
        factor = (1/sum([i**2 for i in lst[:2]]))**0.5
        return [i*factor for i in lst[:2]]

    a = nor(weight)[0]
    b = nor(weight)[1]
    c = weight[2]

    def st(a,b,c):
        slope = -(a/b)
        intercept = -(c/b)

        x = np.concatenate([p,r])
        y = slope*x + intercept
        return [x,y]

    df['distances'] = np.matmul(np.array(df.iloc[:,:3]),np.array([c,a,b]))

    df['formulation'] = df.label * df.distances

    if -(a/b)>0:
        df_neg = df[df.label<0].sort_values('formulation')
        df_pos = df[df.label>0].sort_values('formulation')
    elif -(a/b)<0:
        df_neg = df[df.label<0].sort_values('formulation', ascending=False)
        df_pos = df[df.label>0].sort_values('formulation', ascending=False)



    neg_dist = df_neg.formulation.values[0]
    pos_dist = df_pos.formulation.values[0]


    def vert_dist(distance):
        import math
        slope = -(a/b)
        angle = abs(math.degrees(math.atan(slope)))
        vertical = distance/math.cos(math.radians(angle))
        return vertical

    neg_vert_dist = vert_dist(neg_dist)
    pos_vert_dist = vert_dist(pos_dist)

    intercept = -(c/b)
    pos_intercept = intercept + pos_vert_dist
    neg_intercept = intercept - neg_vert_dist
    c_intercept = min([neg_intercept, pos_intercept]) + abs((neg_intercept - pos_intercept)/2)

    c_pos = -(pos_intercept*b)
    c_neg = -(neg_intercept*b)
    c_new = -(c_intercept*b)

    if epsilon != 0:

        margin = (neg_dist+pos_dist)*(1-epsilon)
        distance = margin/2
        dist_vert_dist = vert_dist(distance)

        pos_intercept = c_intercept + dist_vert_dist
        neg_intercept = c_intercept - dist_vert_dist
        c_pos = -(pos_intercept*b)
        c_neg = -(neg_intercept*b) 


    plt.figure(figsize=(20,8))
    plt.scatter(p, q, facecolors='purple', marker='_', s=100)
    plt.scatter(r,s, facecolors='red', marker='+', s=100)
    plt.xlabel('Features', fontsize=16)
    plt.ylabel('Label', fontsize=16)

    plt.plot(st(a,b,c_new)[0], st(a,b,c_new)[1], linewidth=3.5, label='classifier')
    plt.plot(st(a,b,c_pos)[0], st(a,b,c_pos)[1], '--', linewidth=1, alpha=0.8, label='negative margin')
    plt.plot(st(a,b,c_neg)[0], st(a,b,c_neg)[1], '--', linewidth=1, alpha=0.8, label='positive margin')
    plt.legend()
    plt.show()


weights = [-1,1,110]
sup(weights)
💡
Note: The above function is solely intended for visualization and demonstration purposes to aid conceptual understanding. It does not represent the actual model-building process in practice.

The negative and positive margin lines pass through the nearest negative and positive data points to the classifier, respectively. However, there can be instances where data points lie very close to the classifier. In such scenarios, the model may attempt to overfit the data by shrinking the margin, as it strives to achieve the ideal condition of zero classification errors. Let’s visualize how this behavior manifests.

sup(weights, overfit=True)

While this overfitted model may perform exceptionally well during training, it suffers a significant drawback. Due to the narrow margin—i.e., a highly constrained decision boundary—the model is more likely to misclassify data during evaluation or testing. Let’s understand this visually. Suppose we receive a new data point that truly belongs to the positive class but falls just above the negative margin. The model, relying on the tight boundary, would incorrectly classify it as ‘-1’. This misclassification leads to a drop in overall accuracy.

To address this issue, Support Vector Machines introduce a slack variable (ϵ) into the formulation. This allows the model to tolerate some degree of misclassification, thereby increasing its generalization ability.

The updated SVM constraint becomes:

$$l_i\cdot (W \cdot y_i) \geq M(1 - \epsilon_i)$$

From this formulation, we can clearly infer that the margin is inversely proportional to the slack variable. The slack variable ϵᵢ represents the degree of allowable error for each data point. In other words, it measures how much a data point is permitted to violate the margin.

A higher value of ϵ indicates more flexibility, meaning the model is allowed to misclassify certain points, thus preventing overfitting. Let’s visualize the impact of this by setting ϵ = 0.6 and examining how the decision boundary adapts.

Now, the model becomes generalized by allowing a few training data points to cross the margin boundaries. These relaxed thresholds are known as soft margins, which define the extent to which the model can tolerate errors in the training set without significantly compromising its generalization capabilities.

By incorporating this flexibility using the slack variable ε\varepsilonε, the model avoids overfitting and becomes more robust to unseen data. Let’s now revisit the same scenario illustrated previously, where the overfitted model misclassified a positive point due to overly narrow margins, and observe how the introduction of soft margins improves the outcome.

sup(weights, epsilon=0.6, test_pos_pos=48)

Unlike the earlier overfitted model, this generalized version will no longer misclassify the data point (represented by the red dot), as it now falls within the permissible margin. This demonstrates how the introduction of soft margins helps the model tolerate minor violations and improves overall robustness.

I hope this demonstration has provided a clear understanding of the foundational concepts behind Support Vector Machines. Now, let’s explore an additional insight: as the value of ε\varepsilonε increases, the soft margin becomes narrower. Conversely, when ε=0, the soft margins coincide exactly with the classifier, as the condition becomes \(l_i(W⋅yi​)≥0\). Let’s visualize how the fit changes with different values of ε.

for i in np.arange(0.2,1.1,0.2):
    sup(weights, epsilon=i, show=True)

The greater the value of ε, the simpler the model becomes. A larger ε allows the model to be more flexible, reducing its tendency to fit the classifier too aggressively and thereby avoiding overfitting.

However, all these formulations assume that the classes are linearly separable—an ideal condition that rarely holds true in real-world scenarios. To better reflect practical situations, let’s now construct another dataset that introduces some overlap between classes, mimicking the complexity of real-world data.

x = np.linspace(-5.0, 2.0, 100)
y = np.sqrt(10**2 + x**2)
y=np.hstack([y,-y])
x=np.hstack([x,-x])

x1 = np.linspace(-5.0, 2.0, 100)
y1 = np.sqrt(5**2 - x1**2)
y1=np.hstack([y1,-y1])
x1=np.hstack([x1,-x1])

df1 =pd.DataFrame(np.vstack([y,x]).T,columns=['X1','X2'])
df1['Y']=0
df2 =pd.DataFrame(np.vstack([y1,x1]).T,columns=['X1','X2'])
df2['Y']=1
df = df1.append(df2)

df.head()

Let's take a look at the graphical representation of the data.

As we can see, these data points are not linearly separable. To handle such cases, the concept of Kernels was introduced. Kernels are mathematical functions that, when applied to the data points from different features, transform them into a higher-dimensional space, allowing us to better classify the data. Essentially, a kernel function allows us to find a decision boundary in a transformed feature space, even when the original space is not linearly separable.

There are several types of kernels, each of which performs a different transformation. While the mathematical expressions for these kernels vary, their main goal is the same: to map the data points into a higher-dimensional space where they become linearly separable.

For simplicity, let’s consider one of the most commonly used kernels: Polynomial Kernel.

$$K(x,y)=(x^Ty+c)^d$$

Above mentioned is the mathematical expression for the Polynomial kernel. Let’s now calculate the transformation and observe the new dimensions formulated by the kernel. In our dataset, \(x\) and \(y\) are represented by\(X_1\) and\(X_2\) respectively.

When applying the Polynomial kernel function, we are essentially mapping the input features (from a lower-dimensional space) into a higher-dimensional space. The transformation process allows us to achieve linear separability in situations where the original data is not linearly separable.

We can compute the transformed feature space for our dataset by applying the kernel to pairs of data points, where each pair is processed through the following:

From the above derivation, we obtained three new dimensions, namely:

  1. \(X_1^2\)

  2. \(X_2^2\)

  3. \(X_1 \cdot X_2\)

Let's calculate these new dimensions and integrate them into the original dataframe.

df['X1_Square']= df['X1']**2
df['X2_Square']= df['X2']**2
df['X1X2'] = (df['X1'] *df['X2'])
df.head()

Now, let’s plot these 3 newly formed dimensions in a 3D graph.

# Creating dataset
df1 = df[df.Y==0]
a = np.array(df1.X1_Square)
b = np.array(df1.X2_Square)
c = np.array(df1.X1X2)

# Creating figure
plt.figure(figsize = (13,13))
ax = plt.axes(projection ="3d")

# Creating plot
ax.scatter3D(a, b, c, s=8, label='class 0')

# Creating dataset
df1 = df[df.Y==1]
a = np.array(df1.X1_Square)
b = np.array(df1.X2_Square)
c = np.array(df1.X1X2)

# Creating plot
ax.scatter3D(a, b, c, s=8, label='class 1')

ax. set_xlabel('X1_Square', fontsize=13)
ax. set_ylabel('X2_Square', fontsize=13)
ax. set_zlabel('X1X2', fontsize=13)
plt.legend(fontsize=12)

ax.view_init(30,250)

# show plot
plt.show()

After applying the kernel, we can easily imagine fitting a 2D plane between the two classes, using all the theoretical concepts discussed earlier. Unlike before kernel execution, where the data was not linearly separable, now there exists a hyperplane that can successfully classify both the classes efficiently. This becomes possible only due to the introduction of additional dimensions over the initial 2D Cartesian plane, allowing the model to find a suitable separating boundary in the transformed feature space.

I hope this explanation provided a clear understanding of the Support Vector Machine (SVM) classifier algorithm and the internal process it follows. Although the practical implementation using automated processes with libraries (such as scikit-learn) has not been discussed here, I can assure you that once the working of the algorithm is clear, model building becomes pretty straightforward.

For further details and practical implementation, you can refer to the scikit-learn SVC documentation.

💡
For more insights, projects, and articles, visit my portfolio at tuhindutta.com.
0
Subscribe to my newsletter

Read articles from Tuhin Kumar Dutta directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Tuhin Kumar Dutta
Tuhin Kumar Dutta

I decode data, craft AI solutions, and write about everything from algorithms to analytics. Here to share what I learn and learn from what I share. 🚀 Data Scientist | AI Enthusiast | Building intelligent systems & simplifying complexity through code and curiosity. Sharing insights, projects, and deep dives in ML, data, and innovation.