Support vector machine

Classification is simply just seeing patterns, noticing what one object has that another doesn’t.

Imagine This

It is the mid-20th century. Scientists are puzzling over a big question:

Can a machine tell an apple from an orange?

Sounds simple, right? Apples are red and smooth. Oranges are orange and bumpy. Give a machine a few examples, and it should “learn.”

But here’s the twist: when objects were almost the same, early machines failed miserably. For example, two apples, slightly different shades or sizes? Confused. Handwritten letters with slightly different loops? Misread.

The problem wasn’t that machines couldn’t see. It was that they couldn’t see subtle patterns.

History

Ronald A. Fisher—statistician, geneticist, and biologist—widely known as the father of modern statistics. In the 1930s through the 1950s, much of his work focused on one central challenge: how to classify living things. He didn’t just analyze data—he asked bold questions. In 1936, he posed one that quietly shaped the future of machine learning:

“Given measurements from two groups, how can we find a rule that best separates them so we can classify new observations?”

This led to Linear Discriminant Analysis (LDA), a method for drawing a line between groups.

Neural Networks Join the Scene

Decades later, scientists took another leap. They built the perceptron, an early neural network. It was like giving machines a tiny brain that could draw a line through data and say:

“This side is apples, that side is oranges.”

But it was limited. It could only handle simple separations. When the data got tricky, it stumbled.

Backpropagation.

This was the breakthrough that let neural networks learn complex, layered patterns. Suddenly, machines could classify handwritten digits, faces, even speech.

So the question asked is:

What was the point? Why invent something new if backpropagation already separated the data?.

The truth is backpropagation could separate, but it often overfit. Neural networks back then memorized patterns instead of learning rules that held up on unseen data.

This is where the Support Vector Machine (SVM) stepped in.

SVM asked a smarter question:

👉 “Not just any line. What’s the optimal line?”

Mathematically, it searched for the hyperplane with the widest margin—the greatest distance between the boundary and the closest data points. This helped SVM because even with very high-dimensional data, it resisted overfitting by focusing only on the smallest, closest points that define the boundary. Those closest points are called support vectors, and they’re the real heroes—since they alone shape the boundary, while the rest of the data doesn’t matter.

This simple but powerful idea meant SVM didn’t just memorize the training set. It learned to generalize.

SVM flipped the question. Instead of asking “How do I separate this dataset?” it asked:

“Which separation will generalize best to the unknown?”

If you ask me, that alone is powerful.

The Math Journey

Imagine you have two groups of points on a piece of paper: red dots and blue dots.

The problem: How do we draw a line that separates them so clearly that, when a new dot arrives, we can quickly decide which group it belongs to?

1.The Geometry

Let’s say your input data points are vectors:

$$x \in \mathbb{R}^n$$

We want a hyperplane (a flat sheet in high dimensions) that separates the classes.

The hyperplane is written as:

$$w \cdot x + b = 0$$

w is a vector that determines the orientation of the hyperplane.

b shifts the hyperplane.

The formula above tells us which side of the hyperplane a point lies on.

3. The Margin

There are infinitely many lines (or hyperplanes) that can separate the two groups.

SVM introduces the margin:

The margin is the distance from the hyperplane to the closest data points.

We don’t just want to separate the points—we want to separate them with the maximum margin.

Mathematically:

$$\text{margin} = \frac{2}{|w|}$$

So the optimization problem becomes:

$$\min \frac{1}{2}\|w\|^2 \quad \text{subject to } y_i(w \cdot x_i + b) \geq 1$$

This is a convex optimization problem, which means it can be solved efficiently.

4. Support Vectors

Not all points matter for defining the hyperplane. Only the ones closest to the boundary matter. These are called support vectors.

So, instead of remembering the whole dataset, the SVM only “remembers” the critical few points that shape the boundary.

5. The Kernel Trick

What if the data isn’t linearly separable? What if the red and blue dots are twisted like spirals?

The trick is to:

Map the data into a higher-dimensional space using a non-linear mapping:

$$\phi: \mathbb{R}^n \to Z$$

In this new space , the data might become linearly separable. But computing explicitly would be too expensive.

So mathematicians invented the kernel trick:

$$K(x_i, x_j) = \phi(x_i) \cdot \phi(x_j)$$

This lets us compute in the high-dimensional space without ever going there directly.

6. Infinite-Dimensional Surprise

Even if the space is infinite-dimensional (like with the Gaussian kernel), the generalization still works. Why? Because the bound on generalization depends on the number of support vectors, not the dimensionality.

That’s why SVMs can separate data in what feels like a billion-dimensional feature space—without collapsing under the “curse of dimensionality.”

Conclusion

Think of SVM like building a fence. I know because I have one of those neighbors—the kind who always wanders into your yard. Kids, pets, even their barbecue smoke somehow find their way over. An SVM would draw the perfect line, keeping them on their side while giving you the biggest margin of peace. So next time they cross over, just grin and think: “If only I could drop a hyperplane here.” 😅

SVM from scratch in Python

import numpy as np
import matplotlib.pyplot as plt

# Step 1: Generate some sample data
np.random.seed(42)

# Two classes of points
X_class1 = np.random.randn(20, 2) - [2, 2]   # shifted cluster
X_class2 = np.random.randn(20, 2) + [2, 2]   # shifted cluster

X = np.vstack((X_class1, X_class2))
y = np.hstack((-1 * np.ones(20), 1 * np.ones(20)))  # labels: -1 and +1

# Step 2: Initialize parameters
w = np.zeros(X.shape[1])  # weights
b = 0.0                   # bias

learning_rate = 0.001     #regularization (to maximize margin)

# Step 3: Training loop
for epoch in range(epochs):
    for i, x_i in enumerate(X):
    condition = y[i] * (np.dot(x_i, w) - b) >= 1
    if condition:      # Correct side of the margin: only regularization
       w -= learning_rate * (2 * lambda_param * w)
    else:              # Misclassified or within margin
       w -= learning_rate * (2 * lambda_param * w - np.dot(x_i, y[i]))
       b -= learning_rate * y[i]

# Step 4: Prediction function
def predict(X):
    return np.sign(np.dot(X, w) - b)

# Step 5: Plot the results
def plot_svm(X, y, w, b):
    plt.scatter(X[:, 0], X[:, 1], c=y, cmap='bwr', alpha=0.7)
    # Hyperplane: w.x - b = 0
    x_min, x_max = plt.xlim()
    x_vals = np.linspace(x_min, x_max, 100)
    y_vals = -(w[0] * x_vals - b) / w[1]
    plt.plot(x_vals, y_vals_plus, 'g--')
    plt.plot(x_vals, y_vals_minus, 'g--')
    plt.title("SVM from Scratch (Linear)")
    plt.show()

plot_svm(X, y, w, b)

# Step 6: Test predictions
print("Predictions:", predict(X[:5]))
print("True labels:", y[:5])

Curious to see SVM in action on real data? This book “40 Beginner Machine Learning Projects” has a fun chapter that walks you through it step by step. Check it out here 👉 [https://selar.com/1y457n5a14]

The Algorithm That Outsmarted Expectations: Beating Image Tasks Without a Single Layer of Deep Learning