A complete explanation of the inner workings of Support Vector Machines (SVM) and Radial Basis Function (RBF) kernel
It is essential to understand how different Machine Learning algorithms work to succeed in your Data Science projects.
I have written this story as part of the series that dives into each ML algorithm explaining its mechanics, supplemented by Python code examples and intuitive visualizations.
- The category of algorithms that SVM classification belongs to
- An explanation of how the algorithm works
- What are kernels, and how are they used in SVM?
- A closer look into RBF kernel with Python examples and graphs
Support Vector Machines (SVMs) are most frequently used for solving classification problems, which fall under the supervised machine learning category. However, with small adaptations, SVMs can also be used for other types of problems such as:
- Clustering (unsupervised learning) through the use of Support Vector Clustering algorithm
- Regression (supervised learning) through the use of Support Vector Regression algorithm (SVR)
The exact place of these algorithms is displayed in the diagram below.
Let’s assume we have a set of points that belong to two separate classes. We want to separate those two classes in a way that allows us to correctly assign any future new points to one class or the other.
SVM algorithm attempts to find a hyperplane that separates these two classes with the highest possible margin. If classes are fully linearly separable, a hard-margin can be used. Otherwise, it requires a soft-margin.
Note, the points that end up on the margins are known as support vectors.
To aid the understanding, let’s review the examples in the below illustrations.
- Hyperplane called “H1” cannot accurately separate the two classes; hence, it is not a viable solution to our problem.
- The “H2” hyperplane separates classes correctly. However, the margin between the hyperplane and the nearest blue and green points is tiny. Hence, there is a high chance of incorrectly classifying any future new points. E.g., the new grey point (x1=3, x2=3.6) would be assigned to the green class by the algorithm when it is obvious that it should belong to the blue class instead.
- Finally, the “H3” hyperplane separates the two classes correctly and with the highest possible margin (yellow shaded area). Solution found!
Note, finding the largest possible margin allows more accurate classification of new points, making the model a lot more robust. You can see that the new grey point would be assigned correctly to the blue class when using the “H3” hyperplane.
Sometimes, it may not be impossible to separate the two classes perfectly. In such scenarios, a soft-margin is used where some points are allowed to be misclassified or to fall inside the margin (yellow shaded area). This is where the “slack” value comes in, denoted by a greek letter ξ (xi, pronounced “ksi”).
Using this example, we can see that the “H4” hyperplane treats the green point inside the margin as an outlier. Hence, the support vectors are the two green points closer to the main group of green points. This allows a larger margin to exist, increasing the model’s robustness.
Note, the algorithm allows you to control how much you care about misclassifications (and points inside the margin) by adjusting the hyperparameter C. Essentially, C acts as a weight assigned to ξ. A low C makes the decision surface smooth (more robust), while a high C aims at classifying all training examples correctly, producing a closer fit to the training data but making it less robust.
Beware, while setting a high value for C is likely to lead to a better model performance on the training data, there is a high risk of overfitting the model, producing poor results on the test data.
The above explanation of SVM covered examples where blue and green classes are linearly separable. However, what if we wanted to apply SVMs to non-linear problems? How would we do that?
This is where the kernel trick comes in. A kernel is a function that takes the original non-linear problem and transforms it into a linear one within the higher-dimensional space. To explain this trick, let’s study the below example.
Suppose you have two classes — red and black, as shown below:
As you can see, red and black points are not linearly separable since we cannot draw a line that would put these two classes on different sides of such a line. However, we can separate them by drawing a circle with all the red points inside it and the black points outside it.
How to transform this problem into a linear one?
Let’s add a third dimension and make it a sum of squared x and y values:
z = x² + y²
Using this three-dimensional space with x, y, and z coordinates, we can now draw a hyperplane (flat 2D surface) to separate red and black points. Hence, the SVM classification algorithm can now be used.
RBF is the default kernel used within the sklearn’s SVM classification algorithm and can be described with the following formula:
where gamma can be set manually and has to be >0. The default value for gamma in sklearn’s SVM classification algorithm is:
||x - x'||² is the squared Euclidean distance between two feature vectors (2 points).Gamma is a scalar that defines how much influence a single training example (point) has.
So, given the above setup, we can control individual points’ influence on the overall algorithm. The larger gamma is, the closer other points must be to affect the model. We will see the impact of changing gamma in the below Python examples.
We will use the following data and libraries:
Let’s import all the libraries:
Then we get the chess games data from Kaggle, which you can download following this link: https://www.kaggle.com/datasnaek/chess.
Once you have saved the data on your machine, ingest it with the below code. Note, we also derive a couple of new variables for us to use in the modeling.
Now, let’s create a couple of functions to reuse when building different models and plotting the results.
This first function will split the data into train and test samples, fit the model, predict the result on a test set, and generate model performance evaluation metrics.
The following function will draw a Plotly 3D scatter graph with the test data and model prediction surface.
Build a model with default values for C and Gamma
Let’s build our first SVM model using ‘rating_difference’ and ‘turns’ fields as our independent variables (attributes/predictors) and the ‘white_win’ flag as our target.
Note, we are somewhat cheating here as the number of total moves would only be known after the match. Hence, ‘turns’ would not be available if we wanted to generate model prediction before the match starts. Nevertheless, this is for illustration purposes only; hence we will use it in the below examples.
Since we are using our previously defined ‘fitting’ function, the code is short.
The function prints the following model evaluation metrics:
We can see that the model performance on test data is similar to that on training data, which gives reassurance that the model can generalize well using the default hyperparameters.
Let’s now visualize the prediction by simply calling the Plot_3D function:
Plot_3D(X, X_test, y_test, clf)
Note, black points at the top are actual class=1 (white won), and the ones at the bottom are actual class=0 (white did not win). Meanwhile, the surface is the probability of a white win produced by the model.
While there is local variation in the probability, the decision boundary lies around x=0 (i.e., rating difference=0) since this is where the probability crosses the p=0.5 boundary.
SVM model 2 — Gamma = 0.1
Let’s now see what happens when we set a relatively high value for gamma.
We can see that increasing gamma has led to better model performance on training data but worse performance on test data. The below graph helps us see exactly why that is.
Instead of having a smooth prediction surface like before, we now have a very “spiky” one. To understand why this happens, we need to study the kernel function a bit closer.
When we choose high gamma, we tell the function that the near points are much more important for the prediction than points further away. Hence, we get these “spikes” as the prediction largely depends on individual points of the training examples rather than what is around them.
On the opposite side, reducing gamma tells the function that it’s not just the individual point but also the points around it that matter when making the prediction. To verify this, let’s look at another example with a relatively low value for gamma.
SVM model 3— Gamma = 0.000001
Let’s rerun the functions:
As expected, reducing gamma made the model more robust with an increase in model performance on the test data (accuracy = 0.66). The below graph illustrates how much smoother the prediction surface has become after assigning more influence to the points further away.
Adjusting hyperparameter C
I decided not to include examples in this story using different C values because it affects the prediction plane’s smoothness in a similar way to gamma, although be it due to different reasons. You can try this yourself by passing a value such as C=100 to the “fitting” function to see.
SVM algorithm is mighty and flexible. While I only covered the basic usage with one of the available kernels, I hope this has given you an understanding of SVM and RBF’s inner workings. This should enable you to explore all the rest of the options by yourself.
Please give me a shout if you have any questions, and thanks for reading!