The intuition behind kernel methods

With kernel methods we refer to a set of algorithms used in machine learning to perform typically non-linear classification tasks using a similarity measure called kernel. In this article we will see how this similarity measure can be written as a scalar product into a higher dimensional space through a feature map and provide an example of kernel method, namely a kernelized support vector machine (KSVM).

Article contents

Introduction

Kernel methods rely on the definition of kernel, namely a similarity function that lets us perform non-linear classification tasks by expressing them as linear problems in higher dimensional spaces. Let’s take a dataset

    
\begin{equation}
(x_1, y_1), ..., (x_n, y_n) \in \mathcal{X} \times \mathcal{Y}
\end{equation}

where $\mathcal{X}$ is the set of variables of our problems, while $\mathcal{Y}$ is the set of lables associated to this data. 

Let's also define a feature map

\begin{equation}
\Phi: \mathcal{X} \longrightarrow \mathcal{H}
\end{equation}

where $\mathcal{H}$ is a higher dimensional space referred to as feature space. A kernel $k: \mathcal{X} \times \mathcal{X} \longrightarrow \mathcal{R}$ is defined as

\begin{equation}
k(x, y) = \langle\Phi(x), \Phi(y)\rangle
\end{equation}
  
in other words a kernel is the scalar product between two data points once they have been mapped to the feature space through the feature map $\Phi$. The advantage of such a scheme, is that it allows us to work in a dot product space where efficient algorithms as SVMs can be exploited. 

Thought the formal definition already says it all, it's very difficult to have a feeling of why we define kernels this way and how they can really be used: to fix this, let's start with a practical example

Support vector machine

Let’s start by investigating a common linear classification algorithm in machine learning, namely support vector machines (SVM). This is one of the most common cases in which kernels can be used to improve a linear algorithm giving it non-linear capabilities.

    
A support vector machine (SVM) is an algorithm for dividing a dataset into classes by finding a plane dividing the dataset in such a way that there is the maximum distinction between the two datasets, let's look at an example

Distinguishing the two pieces of the dataset optimally (an linearly) means finding an hyperplane which provides a separation between the two classes with the maximum possible margin, where the margin is defined as the smallest distance between the hyperplane and the data points. Let’s look at the following figure

Optimal and non optimal margin (respectively in yellow and green)

in the figure above we can see an example of an optimal margin (yellow) which maximizes the separation between the two classes. On the other hand the second hyperplane (green) still separates the two datasets but with a smaller margin: the first one is the one we are looking for. How can we characterize this hyperplane?

    
In general we can describe a hyperplane by means of a normal vector $\boldsymbol{\omega}$: the hyperplane is the set of points satisfying the equation

\begin{equation}
<\boldsymbol{\omega}, \textbf{x}> = b
\end{equation}
  
where $\boldsymbol{\omega}$ is orthogonal to the hyperplane and  one can verify that the distance between the hyperplane and the origin is given by $\frac{b}{||\boldsymbol{\omega}||_2}$. This means that by varying $b$ we can slide the plane along $\boldsymbol{\omega}$ as shown in the following figure

    
We know that we can slide the planes along $\vec{\omega}$ by adjusting $\|\boldsymbol{\omega}\|$ and $b$ and that the margin width is given by $\frac{2}{\|\boldsymbol{\omega}\|}$, therefore we would like to minimize $\|\boldsymbol{\omega}\|$ in order to maximize the margin, at the same time we want the points to stay outside of the margin, so we need to impose a constraint. Let's give the data points a label $y_i$ and label the blue points with $y_i = 1$ and the red points with $y_i = -1$, then the conditions we just identified can be written formally as follows

\begin{equation}
\langle\boldsymbol{\omega}, \textbf{x}_i\rangle \geq b + 1 \hspace{1cm} \forall y_i = 1\\
 \langle\boldsymbol{\omega}, \textbf{x}_i\rangle \leq b - 1 \hspace{1cm} \forall y_i = -1
\end{equation}
   
These conditions can then be summarized in one condition
   
\begin{equation}
y_i\cdot(\langle\boldsymbol{\omega}, \textbf{x}_i\rangle -b) \geq 1
\end{equation}
   
Therefore the optimization problem can be summarized as follows: minimize $\|\boldsymbol{\omega}\|$ while keeping $y_i\cdot(\langle\boldsymbol{\omega}, \textbf{x}_i\rangle -b) \geq 1 \forall i$.

This kind of optimization problem is also known as a constrained optimization problem and can be solved by the Lagrangian multipler method.

Kernelized support vector machine

Let’s now introduce kernels and see how they can be used to extend SVMs to non-linear classification problems showing this way one of the most common applications of this idea. Let’s look at a different dataset like the one in the figure below

the red and blue color identify again two different labels and the goal is to find a rule to separate between these two data points, this rule is evident to the “human eye”, not so much for a classification algorithm. There is clearly no linear rule that lets us do this separation without making huge errors, therefore the SVM approach we just discussed turns out to be useless, but is there a way to somehow manipulate this dataset making it “solvable” by a SVM? Of course yes, this is where the idea of kernels comes into play: we can map these points from a 2D space to a higher dimensional space with some map (called a feature map) and try to see if the dataset is linearly separable in this higher dimensional space, let’s try the feature map mapping from 2D to 3D

    
\begin{equation}
\Phi(x, y) \equiv (x, y, e^{-(x^2 + y^2)})
\end{equation}

this produces the following dataset, which now lives in a three-dimensional space

Projection of the dataset in the feature space — Projection of the dataset in the *feature space*

the points have been extruded by the effect of the feature map whe chose and they are now much easier to separate by a linear classifier, namely a plane

Data classifier in the feature space — Data classifier in the *feature space*

therefore we can now simply plug the problem with increased dimensionality in the SVM algorithm and solve it, this is simply equivalent to changing variables

    
\begin{equation}
x \longrightarrow \Phi(x)\\
y \longrightarrow \Phi(y)
\end{equation}

The kernel trick

Let’s look back at our example: we used the feature map to transform the initial problem into a higher dimensional problem, gaining complexity in terms of dimension but obtaining simplicity in terms of data structure, which became linear in our previous example.

    
Now here is the problem: we did everything assuming that we already know the form of the feature map $\Phi$, but this clearly prevents us from working in spaces whose dimension is too big, let alone infinite dimensional spaces. It turns out, however, that kernel methods are a powerful tool especially for infinite dimensional feature maps, but how do we treat these feature maps then? Simple, we don't treat them! Let's look back at the definition of kernel, we said that is a similarity measure that can be written as

\begin{equation}
k(x, y) = \langle\Phi(x), \Phi(y)\rangle
\end{equation}
 
furthermore, from our example, we always use scalar products and never use the feature map explicitly as long as we have an expression for this scalar product, namely if we have an expression for the kernel! This is is called kernel trick, namely the process of keeping the scalar product implicit without having to explicitly write the feature map.

This trick is useful based on the fact that generally kernels can be much simpler then their corresponding feature map: this is not the case in our previous example, so let's look at two more claryfing examples

Polynomial kernel

    


This is defined as 

\begin{equation}
k_p(\textbf{x}, \textbf{y}) = (\textbf{x} \cdot \textbf{y} + 1)^d 
\end{equation}

let's see what feature map implements this kernel taking, for example, $d = 2$. We can verify that this kernel is produced by a feature map $\Phi: \mathcal{R}^2 \longrightarrow \mathcal{R}^6$ which has the form

\begin{equation}
\Phi(\textbf{x}) = 
\begin{bmatrix}
1\\
\sqrt{2}x_1\\
\sqrt{2}x_2\\
\sqrt{2}x_1 x_2\\
x^2_1\\
x^2_2\\

\end{bmatrix}
\end{equation}

simply pluggin into the kernel formula we get

\begin{equation}
k(\textbf{x}, \textbf{y}) = 

\begin{bmatrix}
1
\sqrt{2}x_1
\sqrt{2}x_2
\sqrt{2}x_1 x_2
x^2_1
x^2_2
\end{bmatrix}

\cdot

\begin{bmatrix}
1\\
\sqrt{2}y_1\\
\sqrt{2}y_2\\
\sqrt{2}y_1 y_2\\
y^2_1\\
y^2_2\\
\end{bmatrix}
= 
1 + 2 x_1 y_1 + 2 x_2 y_2 +2 x_1 x_2 y_1 y_2 + x^2_1 y^2_1 + x^2_2 y^2_2 = (\textbf{x}\cdot \textbf{y}+1)^2
\end{equation}

Here we can see the power of the kernel trick, namely we can compute a scalar product in a 6 dimensional space without ever visiting the space, in fact, the scalar product is simply given by the kernel, which involves only the 2 dimensional vectors

Gaussian kernel

Let’s now look at another kernel, namely the Gaussian kernel, this is defined as

    


\begin{equation}
k_G(\textbf{x}, \textbf{y}) = e^{-\gamma\|\textbf{x}-\textbf{y}\|^2}
\end{equation}

what is the feature map associated to this kernel? Let's recall the definition of the exponential function

\begin{equation}
e^x = \sum_{n = 1}^\infty \frac{x^n}{n!} = 1 + x + \frac{x^2}{2} + \frac{x^3}{6} + ...
\end{equation}

then we can consider that

\begin{equation}
\|\textbf{x}-\textbf{y}\|^2 = \|\textbf{x}\|^2 + \|\textbf{y}\|^2 - 2 \langle\textbf{x},\textbf{y}\rangle
\end{equation}
  
let's put everything together 
  
\begin{equation}
k_G(\textbf{x}, \textbf{y}) = e^{-\gamma\|\textbf{x}\|^2} e^{-\gamma\|\textbf{y}\|^2} e^{2\gamma \langle\textbf{x},\textbf{y}\rangle} = e^{-\gamma\|\textbf{x}\|^2} e^{-\gamma\|\textbf{y}\|^2} \sum_{n = 0}^\infty \frac{(2\gamma)^n}{n!} (\langle\textbf{x},\textbf{y}\rangle)^n
\end{equation}
  
this expression ca be evaluated via a multinomial expansion of $(<\vec{x},\vec{y}>)^n$, namely using the multinomial theorem
  
\begin{equation}
\Big(\sum_{n = 0}^N x_i \Big)^m = \sum_{k_1 + k_2 + ... k_N = m} \frac{m!}{k_1!k_2! \cdot \cdot \cdot k_N!}\prod_{q = 0}^N x^{k_q}_q 
\end{equation}
  
  which substituted in the initial formula and after some calculation gives us the final result
  
  \begin{equation}
  k_G(\textbf{x}, \textbf{y}) = \sum_{n = 0}^{\infty} \sum_{k_1 + k_2 + ... k_N = m} \Bigg( \sqrt{\frac{(2\gamma)^n}{n!} \frac{m!}{k_1!k_2! \cdot \cdot \cdot k_N!}}e^{-\gamma\|\vec{x}\|^2} \prod_{q = 0}^N x^{k_q}_q \Bigg)\cdot\Bigg( \sqrt{\frac{(2\gamma)^n}{n!} \frac{m!}{k_1!k_2! \cdot \cdot \cdot k_N!}} e^{-\gamma\|\textbf{y}\|^2}\prod_{q = 0}^N x^{k_q}_q  \Bigg)
  \end{equation}
  
  where $d$ is the dimension of the input vector. This is a rather complicated expression, let's finish by writing the explicit expression for the feature map, one can verify that it is given by
  
 \begin{equation}
  \Phi(\textbf{x}) = \Big(..., \sqrt{\frac{(2\gamma)^n}{n!} \frac{m!}{k_1!k_2! \cdot \cdot \cdot k_N!}} e^{-\gamma\|\textbf{x}\|^2} \prod_{q = 0}^N x^{k_q}_q, ...\Big)_{n = 1 , ..., \infty}
 \end{equation}
  
 where the index $n$ ranges from 0 to infinity, therefore the feature map is a map $\Phi: \mathcal{R}^d \longrightarrow \mathcal{R}^{\infty}$.

  
Why did we spend so much time for all of this? The reason is to prove that the feature map implemented by the Gaussian kernel which is projecting, how we can see, into an infinite dimensional space. Furthermore we can now fully appreciate the power of the kernel trick, namely when "kernelizing" an algorithm, we don't need to explicitly compute the feature map, which is the infinite-dimensional case is simply impossible) but we can perform the scalar product implicitly through the infinitely simpler kernel
  
\begin{equation}
k_G(\textbf{x}, \textbf{y}) = e^{-\gamma\|\textbf{x}-\textbf{y}\|^2}
\end{equation}

So what is a kernel method?

A kernel method is simply a method by which we can “transport” a non-linear problem into a higher dimensional space and compute scalar products in this space (which is possibly infinite-dimensional) to try to make it more easily solvable. The kernel trick allows us to perform all the calculations without ever visiting the space explicitly: in other words we never have to “write down” the vectors living in this space and we can simply exploit the much simpler form of the scalar product in this space, namely the kernel function.

Bibliography

[1] Smola, A.J. and Schölkopf, B., 1998. Learning with kernels (Vol. 4).

[2] These online notes about kernels