Home
Blog

Approximation of Gaussian Kernel

We explain the motivation of the Kernel method in machine learning theory and how one might go about overcoming the computational challenges.

Published

28 October 2023

Motivation

Let’s take a look at the kernel (soft) SVM optimization problem: $\min\limits_{w\in\mathcal{H}}p(w)=\frac{\lambda} 2\lVert w\rVert^2_\mathcal{H}+\frac 1m\sum\limits^m_{i=1}(1-y_i\langle w,\phi(x_i)\rangle)_+$ Or, more generally, $\min_w(f(\langle w, \phi(x_1)\rangle,\dots,\langle w,\phi(x_m)\rangle)+R(\lVert w\rVert))$ where $R$ is a monotonically nondecreasing function. According to Representer theorem (proof involves using the orthogonality of Hilbert space and the monotonicity of the function $R$ ), the solution takes the form $f(x)=\sum^m_{i=1}\alpha_i\phi(x_i)$ . This implies that our minimizing function can be completely expressed in terms of $K(x_i, x_j)=:\langle\phi(x_i), \phi(x_j)\rangle$ . However, this might still be unsatisfactory. For instance, if evaluating the Kernel $\phi:\mathbb{R}^d\to\mathcal{H}$ takes $\Theta(d)$ , then we would need $\Theta(md)$ operations in total.
If we can find a way to transform $\phi:\mathcal{X}\to\mathcal{H}$ to $\tilde \phi(x):\mathbb{R}^d\to\mathbb{R}^D$ , and define $\tilde f(x)=\langle \tilde w, \tilde \phi (x)\rangle$ we can potentially reduce the cost to $\Theta(D)$ .
Notice how this bears resemlance to the method of weighted sums and quadrature points that is commonly employed for numerically approximating integrals in partial differential equations.

Approximation method 1: Taylor approximation

Motivation
- The following theorem illustrates how the concept of projection can achieve this reduction in computational cost.
  
  Theorem 1. Let $p^*=\inf_w p(w)$ be the optimal value of soft SVM. Define $\tilde \phi(x)=P\phi(x)$ and $\tilde p^*=\inf_{\tilde w}\tilde p(\tilde w)$ . Then we have $p^*\leq \tilde p^*\leq p^*+\dfrac 1{m\sqrt{\lambda}}\sum\limits^m_{i=1}\sqrt{K(x_i, x_i)-\tilde K(x_i, x_i)}$
  
  Proof. Note that since $P$ is an orthogonal projection, we have $\lVert Pw\rVert\leq \lVert w\rVert$ . Further, we have $\langle \tilde w, \phi(x_i)\rangle=\langle Pw, \phi(x_i)\rangle=\langle w, P^*\phi(x_i)\rangle =\langle w, P\phi(x_i)\rangle=\langle w, \tilde\phi(x_i)\rangle$ and so $p(Pw)\leq\tilde p(w)$ . This implies that $p^*\leq \tilde p^*$ , which establishes the first inequality.
  
  For the second inequality, note that $\begin{aligned} \lvert &(1-y_i\langle w, \phi(x_i)\rangle)_+-(1-y_i\langle w, P\phi(x_i)\rangle)_+\rvert\\ &\leq \lVert w\rVert\lVert P^\perp \phi(x_i)\rVert \end{aligned}$ , again using the orthogonality of Hilbert space. Now note that $\lVert P^\perp\phi(x_i)\rVert^2=\lVert \phi(x_i)\rVert^2+\lVert P\phi(x_i)\rVert^2$ , and so by regularizing $\lVert w^*\rVert\leq \frac 1{\sqrt\lambda}$ , we can get the second inequality by routine algebra. ◻
- This suggests that minimizing the $\sum^m_{i=1}\lVert \phi(x)-P\phi(x)\rVert$ provides a good approximation, and Taylor approximation gives a solution for this.
Taylor approximation
- For the Gaussian Kernel $K(x,x')=\exp(-\dfrac{\lVert x-x'\rVert^2}{2\sigma^2})$ , we first find $\phi$ , truncate it, and use Taylor’s theorem with remainder to obtain a bound.
- First note that we have $K(x,x')=\langle \phi(x), \phi(x')\rangle=\sum\limits^\infty_{k=0}\sum\limits_{j\in [d]^k}\phi_{k,j}(x)\phi_{k,j}(x')$ .
- This is like inner product on $\infty\times d^k$ dimensional space, projected onto $r\times d^k$ dimensional space.
- This leads to the truncated Kernel $\tilde K(x,x')=\langle \tilde \phi(x), \tilde\phi(x') \rangle=\exp(-\dfrac{\lVert x\rVert^2+\lVert x'\rVert^2}{2\sigma^2})\sum\limits^r_{k=0}\dfrac 1{k!}(\dfrac{\langle x,x'\rangle}{\sigma^2})^k$
- By Taylor’s theorem with remainder, we have $\lvert K(x,x')-\tilde K(x,x')\rvert\leq \dfrac{1}{(r+1)!}(\dfrac{\lVert x\rVert\lVert x'\rVert}{\sigma^2})^{r+1}$
- So we have reduced to $D=\sum^r_{k=1}d^r$ dimensional feature space. However, there are duplicates; for example we treat $j\in\{1,2,3\}$ differently as $j\in\{3,1,2\}$ when in fact they can be treated same. So $d^k$ can be reduced to $\binom{d+k-1}{k}$ (this is like finding the number of ways to solve $\sum\limits_{i=1}^d x_i=k, x_i\geq 0$ ). Using Pascal’s rule, we have $D=\binom{d+r}{r}$ .

Approximation method 2: Random features

Gaussian Kernel can be represented as $K(x,x')=K(x-x')$ and it falls into the Schwartz class calss of functions. As such, we may use Fourier inversion formula, given by $\begin{aligned} K(x-x')&=\int_{\mathbb{R}^d}\widehat{K}(w)\exp(2\pi i(x-x')\cdot w)dw\\ &=\int_{\mathbb{R}^d}Re(\widehat{K}(w))\cos(2\pi (x-x')\cdot w)- Im(\widehat{K}(w))\sin(2\pi (x-x')\cdot w)dw \end{aligned}$ , second equality since K is real-valued. Now $\int_{\mathbb{R}^d}Im(\widehat{K}(w))\sin(2\pi (x-x')\cdot w)dw=0$ since $K(x-x')=K(-(x-x'))$ . We thus have (appropriately scaling $\widehat{K}$ , and denoting the real part of $\widehat{K}$ as $\widehat{K}$ , again for ease of notation) $\begin{aligned} K(x-x')&=\mathbb{E}_{w\sim\widehat{K}(w)}[\cos w\cdot(x-x')]\\ &=\mathbb{E}_{w\sim\widehat{K}(w)}[\cos(w\cdot x+\theta)\cdot \cos(w\cdot x'+\theta)- \sin(w\cdot x+\theta)\cdot \sin(w\cdot x'+\theta)]\\ \end{aligned}$
On the other hand, we have $\mathbb{E}_{\theta\sim U}[\sin(w\cdot x+\theta)\cdot \sin(w\cdot x'+\theta)]=\frac 12 \mathbb{E}_{\theta\sim U}[\cos(w\cdot(x-x'))]$ In cluclusion, we have, for properly scaled K, $K(x-x')=\mathbb{E}_{w\sim \widehat{K}(w), \theta\sim [0,2\pi]}[\cos(w\cdot x+\theta)\cdot \cos(w\cdot x'+\theta)]$
We can then employ Monte Carlo method to approximate $K$ with this equality.
Feature mapping is defined as $\tilde \phi_j(x)=\cos (w_j\cdot x+\theta_j)$ where $w_j\sim \widehat{K}(w)$ and $\theta_j\sim U([0,2\pi])$ .

Fact 1. *(Rahimi and Recht, union bound of random features method) Let $\tilde K$ be the kernel defined by D random Fourier features. For any $\epsilon$ , we have $P[\sup\limits_{\lVert x\rVert, \lVert y\rVert \leq R}\lvert K(x,y)-\tilde K(x,y)\rvert\geq\epsilon]\leq 2^8\dfrac{d}{\epsilon^2}(\dfrac{R}{\sigma})^2\exp(-\dfrac{D\epsilon^2}{4(2+d)})$

Conclusion

We surveyed the Kernel method within the realm of machine learning theory and how one might go about overcoming the computational challenges in approximating Gaussian Kernel.

References

Andrew Cotter, N. S., Joseph Keshet. (2011). Explicit Approximations of the Gaussian Kernel. arXiv:1109.4603.
Shai Ben-David, S. S.-S. (2014). Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press.

Approximation of Gaussian Kernel

We explain the motivation of the Kernel method in machine learning theory and how one might go about overcoming the computational challenges.

Published

Tag

Motivation

Approximation method 1: Taylor approximation

Approximation method 2: Random features

Conclusion

References