Approximation of Gaussian Kernel
We explain the motivation of the Kernel method in machine learning theory and how one might go about overcoming the computational challenges.
-
The concept of a kernel plays a crucial role when dealing with massive datasets, as it provides a computationally feasible method to encapsulate non-linearity.
-
There are several popular types of kernels, including the Gaussian Kernel (RBF), the polynomial kernel, and the Sigmoid Kernel, each with their unique advantages.
-
Essentially, kernels serve as a valuable tool to navigate computational limitations while still capturing the intricate non-linearity inherent in data. However, it’s important to note that approximations may be necessary in order to fully harness the power of kernels.
-
In this article, we aim to understand the significance of kernel methods and survey the theoretical underpinnings of Gaussian Kernel approximations, focusing on Taylor approximation and random features.
Motivation
-
Let’s take a look at the kernel (soft) SVM optimization problem: w∈Hminp(w)=2λ∥w∥H2+m1i=1∑m(1−yi⟨w,ϕ(xi)⟩)+ Or, more generally, wmin(f(⟨w,ϕ(x1)⟩,…,⟨w,ψ(xm)⟩)+R(∥w∥)) where R is a monotonically nondecreasing function. According to Representer theorem (proof involves using the orthogonality of Hilbert space and the monotonicity of the function R), the solution takes the form f(x)=∑i=1mαiψ(xi). This implies that our minimizing function can be completely expressed in terms of K(xi,xj)=:⟨ϕ(xi),ϕ(xj)⟩. However, this might still be unsatisfactory. For instance, if evaluating the Kernel ψ:Rd→H takes Θ(d), then we would need Θ(md) operations in total.
-
If we can find a way to transform ϕ:X→H to ϕ~(x):Rd→RD, and define f~(x)=⟨w~,ϕ~(x)⟩ we can potentially reduce the cost to Θ(D).
-
Notice how this bears resemlance to the method of weighted sums and quadrature points that is commonly employed for numerically approximating integrals in partial differential equations.
Approximation method 1: Taylor approximation
-
Motivation
-
The following theorem illustrates how the concept of projection can achieve this reduction in computational cost.
Theorem 1. Let p∗=infwp(w) be the optimal value of soft SVM. Define ϕ~(x)=Pϕ(x) and p~∗=infw~p~(w~). Then we have p∗≤p~∗≤p∗+mλ1i=1∑mK(xi,xi)−K~(xi,xi)
Proof. Note that since P is an orthogonal projection, we have ∥Pw∥≤∥w∥. Further, we have ⟨w~,ϕ(xi)⟩=⟨Pw,ϕ(xi)⟩=⟨w,P∗ϕ(xi)⟩=⟨w,Pϕ(xi)⟩=⟨w,ϕ~(xi)⟩ and so p(Pw)≤p~(w). This implies that p∗≤p~∗, which establishes the first inequality.
For the second inequality, note that ∣(1−yi⟨w,ϕ(xi)⟩)+−(1−yi⟨w,Pϕ(xi)⟩)+∣≤∥w∥∥P⊥ϕ(xi)∥ , again using the orthogonality of Hilbert space. Now note that ∥P⊥ϕ(xi)∥2=∥ϕ(xi)∥2+∥Pϕ(xi)∥2, and so by regularizing ∥w∗∥≤λ1, we can get the second inequality by routine algebra. ◻
-
This suggests that minimizing the ∑i=1m∥ϕ(x)−Pϕ(x)∥ provides a good approximation, and Taylor approximation gives a solution for this.
-
-
Taylor approximation
-
For the Gaussian Kernel K(x,x′)=exp(−2σ2∥x−x′∥2), we first find ϕ, truncate it, and use Taylor’s theorem with remainder to obtain a bound.
-
First note that we have K(x,x′)=⟨ϕ(x),ϕ(x′)⟩=k=0∑∞j∈[d]k∑ϕk,j(x)ϕk,j(x′).
-
This is like inner product on ∞×dk dimensional space, projected onto r×dk dimensional space.
-
This leads to the truncated Kernel K~(x,x′)=⟨ϕ~(x),ϕ~(x′)⟩=exp(−2σ2∥x∥2+∥x′∥2)k=0∑rk!1(σ2⟨x,x′⟩)k
-
By Taylor’s theorem with remainder, we have ∣K(x,x′)−K~(x,x′)∣≤(r+1)!1(σ2∥x∥∥x′∥)r+1
-
So we have reduced to D=∑k=1rdr dimensional feature space. However, there are duplicates; for example we treat j∈{1,2,3} differently as j∈{3,1,2} when in fact they can be treated same. So dk can be reduced to (kd+k−1) (this is like finding the number of ways to solve i=1∑dxi=k,xi≥0). Using Pascal’s rule, we have D=(rd+r).
-
Approximation method 2: Random features
-
Gaussian Kernel can be represented as K(x,x′)=K(x−x′) and it falls into the Schwartz class calss of functions. As such, we may use Fourier inversion formula, given by K(x−x′)=∫RdK(w)exp(2πi(x−x′)⋅w)dw=∫RdRe(K(w))cos(2π(x−x′)⋅w)−Im(K(w))sin(2π(x−x′)⋅w)dw , second equality since K is real-valued. Now ∫RdIm(K(w))sin(2π(x−x′)⋅w)dw=0 since K(x−x′)=K(−(x−x′)). We thus have (appropriately scaling K, and denoting the real part of K as K, again for ease of notation) K(x−x′)=Ew∼K(w)[cosw⋅(x−x′)]=Ew∼K(w)[cos(w⋅x+θ)⋅cos(w⋅x′+θ)−sin(w⋅x+θ)⋅sin(w⋅x′+θ)]
-
On the other hand, we have Eθ∼U[sin(w⋅x+θ)⋅sin(w⋅x′+θ)]=21Eθ∼U[cos(w⋅(x−x′))] In cluclusion, we have, for properly scaled K, K(x−x′)=Ew∼K(w),θ∼[0,2π][cos(w⋅x+θ)⋅cos(w⋅x′+θ)]
-
We can then employ Monte Carlo method to approximate K with this equality.
-
Feature mapping is defined as ϕ~j(x)=cos(wj⋅x+θj) where wj∼K(w) and θj∼U([0,2π]).
Fact 1. *(Rahimi and Recht, union bound of random features method) Let K~ be the kernel defined by D random Fourier features. For any ϵ, we have P[∥x∥,∥y∥≤Rsup∣K(x,y)−K~(x,y)∣≥ϵ]≤28ϵ2d(σR)2exp(−4(2+d)Dϵ2)
Conclusion
- We surveyed the Kernel method within the realm of machine learning theory and how one might go about overcoming the computational challenges in approximating Gaussian Kernel.
References
- Andrew Cotter, N. S., Joseph Keshet. (2011). Explicit Approximations of the Gaussian Kernel. arXiv:1109.4603.
- Shai Ben-David, S. S.-S. (2014). Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press.