What is SVM?

In this blog we are going to cover the "out of the box" machine learning technique called support vector machine. I used out of the box because it is different than all models covered till now.

The flow of learning will be like this

Maximal Margin Classifier: This maximal marginal classifier is a very simple and intuitive classifier and support vector machine is a generalization of this classifier only. However, this classifier can only be applied when classes are linearly separable.
Support Vector Classifier: It is an extension of the maximal marginal classifier. And this can be applied in a broad range of cases.
Support Vector Machines: which is the further extension of support vector classifier. We will see that how SVM accommodates non-linear class boundaries.

Maximal Margin Classifier

When we start our discussion on maximal margin classifier the first thing we need to understand is the concept of a hyperplane.

What is Hyperplane?

Hyperplane is something that divides a p dimensional space in two parts.

Let's consider a one dimensional space. What's one dimensional space? It's a line. And how do we separate this line into two parts? We just need a point. Now there is a left part of the line which is left to this point and there is a right part of the line. This point is a hyperplane for this space.

For a two dimensional space we have a one-dimensional separator that is a line. Similarly if I continue increasing the dimensions of space let's say a three dimensional space can be separated using a two dimensional plane. This concept can be extended to any P dimensional space.

And we will say that there is a hyperplane of P - 1 dimensions which separates this P dimensional space into two parts. But you may be thinking why are we talking about P dimensional spaces and trying to split the spaces. Let me explain it to you using an example, Suppose I have three variables X1, X2 and the category of the class to which that observation belongs.

X1	X2	Category
87	76	pass
35	54	fail
....	....	....
65	76	pass

Basically we have two predictor variables and one variable is to be predicted. Since we have two predictor variables we can plot the observations on a two dimensional space such as this.

We can have X1 on the x-axis X2 on the y-axis and color of each point will tell us which class that particular point belongs. We want such a hyperplane which can divide this space into two parts, such that all the blue points are in one part and all the Purple points are in another part. If I am able to find this hyperplane I will have my classifier model with me. Any point which falls on upper side of the hyperplane will be classified as blue or class 1 and anything that falls on lower side of the hyperplane will be classified as purple or Class 2.

How to choose hyperplane?

When the data is perfectly separable we can draw infinite hyperplanes. Just take any hyperplane and make tiny shifts or tiny rotations and you would get more hyperplanes. Here you can see three such hyperplanes in This figure.

But which of these hyperplanes should we choose and why? One reasonable choices selecting that hyperplane which is farthest from the training observations. That is we have a hyperplane, we find the perpendicular distance of all the observations from this hyperplane. The smallest distance of these observations is called the margin. In other words margin is the farthest minimum distance to the trained observations. So whichever hyperplane has maximum value of the margin that hyperplane will be selected

You can now see what the name stands for. Because we choose the hyperplane with maximum value of margin this is called maximal margin classifier.

I draw the hyperplane and the two margins on this graph and we noticed that there are three points which lie on the margin. If these points were not there we would have received wider margins. These points are called support vectors. Our Classifier is completely dependent only on the support vectors. Any slide movement in any of the support vectors would mean that are classifier will change. Identification of such points and classification on the basis of only this few points is a special characteristic of support vector classifier and machines , which separates this technique from any other conventional technique.

What are limitations of maximal margin classifier?

The maximal margin classifier has two major limitations due to which it is not used in majority of the real-world scenarios.

Maximal margin classifier cannot be used if the two classes are not separable by a hyperplane
Maximal margin classifier is very sensitive to support vectors, an additional observation can lead to a dramatic shift in the maximal margin hyperplane

Due to these two limitations we will move on from maximal margin classifier to support vector classifier and we will see how Support Vector Classifier handles these limitations.

Support Vector Classifier

We are going to start a discussion now on support vector classifier so that we can handle non perfectly separable scenarios. Let me first tell you what support vector classifier is, Support vector classifier is a soft margin classifier. The meaning of this is that we will have a hyperplane and the margins around the hyperplane, but all the observations need not be on the correct side of this margin. We'll allow some observation to be present within the margin area and some may even Go On The Wrong side of the hyperplane.

if you look at point number eight. This point is supposed to be on the right hand side of the upper margin. But this point is present in the wrong side of the margin however since it is still to the right of the hyperplane it is correctly classified. So it is on The Wrong side of margin but correctly classified
Where as point number 12. This point is a blue point, which is supposed to be on the right side of the upper margin where as it is on the left side of the hyperplane itself this point will be incorrectly classified by our support vector classifier.
There are two types of errors that we are allowing one is classification is correct but it is on the wrong side of the margin. And the second type of error that we allow is allowing the point to be misclassified that is being on the wrong side of the hyperplane.
These points point number 1, 8, 11, 12, Which lie within the margin or on the margin these all points will be called as support vectors.

How support vector classifier is created?

The underlying concept stays the same that we are trying to get a hyperplane with maximum margin, But with one additional constraint. The constraint we are going to add is a budget of mistakes. We know some points are going to be miss classified so we will create a budget B which will be the amount of misclassification that is acceptable to us. so the software will be trying to find a hyperplane with maximum margin while staying within this budget.

We create a misclassification budget ( B )
We limit sum of distances of the points on the wrong side of the margin

( 𝑥1 + 𝑥2 + 𝑥3 + 𝑥4 )< 𝐵
We try to maximize margin while trying to stay within budget

Another way of implementing the same concept which is often found in software packages is use of cost parameter. This is effectively the same thing only difference is that instead of having Unit cost and a budget B, You have C times X and the cost Unit budget. So C is kind of an inverse of B.

C ( 𝑥1 + 𝑥2 + 𝑥3 + 𝑥4 )< 𝐵

Impact of C:

When C is small, margins will be wide and there will be many support vectors and many misclassified observations.
When C is large , margins will be narrow and there will be fewer support vectors and fewer misclassified values
However, low cost value prevents overfitting and may give better test set performance
We try to find optimal value of C at which we get best test performance.

click below link to simulate how c impact on margin :

https://cs.stanford.edu/~karpathy/svmjs/demo/

Limitations of Support Vector Classifier

Support Vector classifier is a linear classifier, it cannot classify non linear separable data.

By visual inspection we can clearly see that these classes are well separated. If I can draw a circle here and say that anything within this circle is going to be purple and anything outside will be blue. That would be a much better classifier than the one on the right. So to handle this limitation of linearity we further generalize, Use something known as kernel method to reach at support vector machines.

Using support vector machines we'll be able to draw nonlinear boundaries as well.

Kernel Based Support Vector Machine

Support vector machine (SVM) is an extension of the support vector classifier which uses Kernels to create non linear boundaries.

What are Kernels?

kernel are some functional relationship between two observations. Some popular kernels

Linear Kernel: Linear kernel takes inner product of two observations

Polynomial Kernel : Polynomial kernel uses power function to create non linear boundaries.
1. Radial Kernel: Radial kernel uses radial function to create radial boundaries.
  
  Gamma defines how much influence a single training example has. The larger gamma is, the closer other examples must be to be affected.
  
  https://cs.stanford.edu/~karpathy/svmjs/demo/