## Machine Learning

## A complete explanation of the inner workings of Support Vector Machines (SVM) and Radial Basis Function (RBF) kernel

It is essential to understand how different Machine Learning algorithms work to succeed in your Data Science projects.

I have written this story as part of the series that dives into each ML algorithm explaining its mechanics, supplemented by Python code examples and intuitive visualizations.

- The category of algorithms that SVM classification belongs to
- An explanation of how the algorithm works
- What are kernels, and how are they used in SVM?
- A closer look into RBF kernel with Python examples and graphs

Support Vector Machines (SVMs) are most frequently used for solving **classification** problems, which fall under the supervised machine learning category. However, with small adaptations, SVMs can also be used for other types of problems such as:

**Clustering**(unsupervised learning) through the use of Support Vector Clustering algorithm**Regression**(supervised learning) through the use of Support Vector Regression algorithm (SVR)

The exact place of these algorithms is displayed in the diagram below.

Let’s assume we have a set of points that belong to two separate classes. We want to separate those two classes in a way that allows us to correctly assign any future new points to one class or the other.

SVM algorithm attempts to find a hyperplane that separates these two classes with the highest possible margin. If classes are fully linearly separable, a **hard-margin** can be used. Otherwise, it requires a **soft-margin**.

Note, the points that end up on the margins are known as

support vectors.

To aid the understanding, let’s review the examples in the below illustrations.

**Hard-margin**

- Hyperplane called “
**H1**” cannot accurately separate the two classes; hence, it is not a viable solution to our problem. - The “
**H2**” hyperplane separates classes correctly. However, the margin between the hyperplane and the nearest blue and green points is tiny. Hence, there is a high chance of incorrectly classifying any future new points. E.g., the new grey point (x1=3, x2=3.6) would be assigned to the green class by the algorithm when it is obvious that it should belong to the blue class instead. - Finally, the “
**H3**” hyperplane separates the two classes correctly and with the highest possible margin (yellow shaded area). Solution found!

Note, finding the largest possible margin allows more accurate classification of new points, making the model a lot more robust. You can see that the new grey point would be assigned correctly to the blue class when using the “H3” hyperplane.

**Soft-margin**

Sometimes, it may not be impossible to separate the two classes perfectly. In such scenarios, a **soft-margin** is used where some points are allowed to be misclassified or to fall inside the margin (yellow shaded area). This is where the “slack” value comes in, denoted by a greek letter ξ (xi, pronounced “ksi”).

Using this example, we can see that the “H4” hyperplane treats the green point inside the margin as an outlier. Hence, the support vectors are the two green points closer to the main group of green points. This allows a larger margin to exist, increasing the model’s robustness.

Note, the algorithm allows you to control how much you care about misclassifications (and points inside the margin) by adjusting the hyperparameter C. Essentially, C acts as a weight assigned to ξ. A low C makes the decision surface smooth (more robust), while a high C aims at classifying all training examples correctly, producing a closer fit to the training data but making it less robust.

Beware, while setting a high value for C is likely to lead to a better model performance on the training data, there is a high risk of overfitting the model, producing poor results on the test data.

The above explanation of SVM covered examples where blue and green classes are linearly separable. However, what if we wanted to apply SVMs to non-linear problems? How would we do that?

This is where the kernel trick comes in. A **kernel is a function** that takes the original non-linear problem and transforms it into a linear one within the higher-dimensional space. To explain this trick, let’s study the below example.

Suppose you have two classes — red and black, as shown below:

As you can see, red and black points are not linearly separable since we cannot draw a line that would put these two classes on different sides of such a line. However, we can separate them by drawing a circle with all the red points inside it and the black points outside it.

**How to transform this problem into a linear one?**

Let’s add a third dimension and make it a sum of squared x and y values:

`z = x² + y²`

Using this three-dimensional space with x, y, and z coordinates, we can now draw a hyperplane (flat 2D surface) to separate red and black points. Hence, the SVM classification algorithm can now be used.

RBF is the default kernel used within the sklearn’s SVM classification algorithm and can be described with the following formula: