If you’ve ever been introduced to the theory of machine learning, you might have come across the term kernel methods and how they can transform the original input space into a higher dimensional one (sometimes infinite) in order to perform regression/classification tasks using something called the kernel trick. This seemed a bit “magical” and confused me for quite some time. However, I finally think I managed to wrap my head around it, which is why I would like to share my take on it.
A common linear parametric model is called linear regression, where some output is given as a function of some input vector x and some linear combination of some parameters w. We usually partition the initial data into a training- and test set and tune the regression parameters w on the training set until we are happy with the results. We then throw away the training set and use w to predict new, fresh unseen data using the testing set to evaluate its general performance.
Let’s use ridge regression as an example:
This is the loss function we are trying to minimize, expressed in the parameters w. If we take the gradient of this function with respect to w we get the following:
Note how we can rewrite w as a function of a. If we substitute this back into the original loss function we get the following:
Note the repeating pattern of phi’s. The matrix phi is each initial data point x mapped to some extended feature space. The product of phi with itself transposed represents the dot product of each datapoint with itself and all other datapoints (under the individual transformations phi(x)). This product is also known as the gram matrix K. We continue to rewrite the loss function in terms of K:
If we calculate the gradient of this function with respect to a and set it equal to zero we get:
If we rearrange and solve for a we end up with:
We managed to rephrase the problem in terms of a and by doing so we can see that the solution can be written only in terms of K and y. This is the dual representation, but why is this a good thing? *drumroll* Enter the kernel trick.
In the simple case phi(x) represents the identity transformation phi(x) = x. This means that each entry of K is just the dot product of each data point in the training set. Since K is a N by N matrix this only seems to make things worse (as the computational complexity of inverting a N by N matrix is O(n^3) when using the Gauss-Jordan algorithm), however when the dimensionality D of the training set is much larger than the amount of samples (D >> N) we can see that it is more efficient to solve the dual problem than the primal one.
But what is the trick?
If we view the dot product as a measure of similarity, we can imagine that each entry in K gives us a measure of how similar each vector pair x and z are in the original D-dimensional space. But what if we want to expand into a higher dimensional space?
There exist a few popular kernels such as the polynomial kernel:
For simplicity, we set b = 1 and d = 2 so that each entry in the kernel becomes:
We see that each entry is simply the square of a dot product and an addition. However, if we rewrite each entry in terms of a dot product we can see that it actually represents a comparison in a (2D + 1) dimensional space:
This is the kernel trick! We find some mapping K that allows us to compare the original data set, in a higher dimension without doing the extra work of actually calculating the dot product in the extended feature space.
What about kernels in infinite dimensions?
There is another very popular kernel used called the RBF (Radial Basis Function) which looks like this:
We start by expanding the square:
If you remember anything from calculus you know that we can use something called a Taylor expansion to approximate a function:
Fortunately for us the expansion is easy when f(x) is the exponential function:
If we simplify the expression above we get:
We use this expansion to replace the last exp(xy) term so that the initial formulation becomes:
Now replace the left hand factor of the expression like this:
Just as the polynomial kernel could be rewritten into a dot product, the RBF can be too:
This means that the RBF maps each pair of data points into an infinite kernel, and more importantly without us having to actually calculate these (infinite) dot products.
Note that not all functions are valid kernel functions, there are some conditions that must apply, if you are interested I can recommend chapter 6.2 of “Pattern Recognition and ML” by M. Bishop.
Thanks for reading!