Random Features for Large Scale Kernel Machines

Size: px

Start display at page:

Download "Random Features for Large Scale Kernel Machines"

Lydia Floyd
5 years ago
Views:

1 Random Features for Large Scale Kernel Machines Andrea Vedaldi From: A. Rahimi and B. Recht. Random features for large-scale kernel machines. In Proc. NIPS, 2007.

2 Explicit feature maps Fast algorithm for linear kernels one-slack SVM (SVM-perf) primal SVM (liblinear) stochastic SVM (PEGASOS) PCA Slow for non linear kernel K However any kernel is linear in an appropriate feature space Can we get a finite dimensional feature map approximating the kernel? 2

3 Random Fourier features 1/2 Translation invariant kernel: K(x,y) = K(y - x) Bochner If K is PD and translation invariant, then for some real, symmetric, non neg. measure κ(λ) dλ. Bochner as expected value If K(0) = 1 (normalized), then 3

4 Random Fourier features 2/2 Random Fourier features Obtain approximate feature map from random sampling Gaussian random Fourier features MATLAB pseudocode function psix = gaussianrandomfeatures(x, D) d = size(x, 1) ; omega = randn(d, d) ; psix = exp(-i*omega*x) ; 4

5 Random Fourier features: Errors Claim 1 (Uniform convergence of Fourier features). Let M be a compact subset of R d with diameter diam(m). Then, for the mapping z defined in Algorithm 1, we have Pr sup x,y M z(x) z(y) k(x, y) Uniform error bound in a ball σp diam(m) exp D2, 4(d + 2) where σp 2 E p [ω ω] is the second moment of the Fourier transform of k. Further, sup x,y M z(x) z(y) k(y, x) with any constant probability when D = d Ω log σ p diam(m) 2. (with any fixed probability) requires Data dimensionality random projections. Accuracy required Gaussian kernel variance Validity range 5

6 Random Binning features 1/2 B-2 B-1 B0 B1 B2 B3 x u Randomly shifted bins of side δ δ Hat kernel x indicator vector for the bin occupied by x under shift u δ 6

7 Random binning features 2/2 Decompose other kernels as averages over hat kernels it s a simple deconvolution problem solved by Assuming that interpret p(δ) as a probability density expand kernel as where - uj ~ uniformly at random in [0, δ] - δj ~ p(δ) 7

8 Random binning features 3/3 Separable kernel represent as the probability of being binned together simultaneously in the d dimensions Remarks The feature map has very high dimension, but it is very sparse Must be stored in a hash structure Convergence rates similar to the random Fourier features 8

9 Experiments Dataset Fourier+LS Binning+LS CVM Exact SVM CPU 3.6% 5.3% 5.5% 11% regression 20 secs 3 mins 51 secs 31 secs 6500 instances 21 dims D = 300 P = 350 ASVM Census 5% 7.5% 8.8% 9% regression 36 secs 19 mins 7.5 mins 13 mins 18,000 instances 119 dims D = 500 P = 30 SVMTorch Adult 14.9% 15.3% 14.8% 15.1% classification 9 secs 1.5 mins 73 mins 7 mins 32,000 instances 123 dims D = 500 P = 30 SVM light Forest Cover 11.6% 2.2% 2.3% 2.2% classification 71 mins 25 mins 7.5 hrs 44 hrs 522,000 instances 54 dims D = 5000 P = 50 libsvm KDDCUP99 (see footnote) 7.3% 7.3% 6.2% (18%) 8.3% classification 1.5 min 35 mins 1.4 secs (20 secs) < 1 s 4,900,000 instances 127 dims D = 50 P = 10 SVM+sampling 0.5 Testing error % error training+testing time (sec) Training set size P P 9

Randomized Algorithms

Randomized Algorithms Saniv Kumar, Google Research, NY EECS-6898, Columbia University - Fall, 010 Saniv Kumar 9/13/010 EECS6898 Large Scale Machine Learning 1 Curse of Dimensionality Gaussian Mixture Models