Fastfood - Approximating Kernel Expansions in Loglinear Time

Size: px

Start display at page:

Download "Fastfood - Approximating Kernel Expansions in Loglinear Time"

Stuart Leonard
5 years ago
Views:

1 Fastfoo - Approximating Kernel Expansions in Loglinear Time Quoc Le, Tamás Sarlós, Alex Smola. ICML-013 Machine Learning Journal Club, Gatsby May 16, 014

2 Notations Given: omain (X), kernel k. φ : X H feature map k(x,x ) = φ(x),φ(x ) H. (1) Representer theorem: for many tasks (SVM,...) w = N α i φ(x i ). () i=1 Consequence: ecision function f(x) = w,φ(x) H = N α i k(x i,x). (3) i=1

3 Ranom kitchen sinks (X = R ) Bochner Theorem: k continuous, shift invariant k(x x,0) = e i z,x x λ(z)z, λ M + (R ) (4) R = φz (x)φ z (x )λ(z)z, φ z (x) = e izx. (5) R Assumption: λ is a probability measure (normalization). Trick: ˆk(x x,0) = 1 n e i z j,x x, zj λ. (6) n i=1

4 Ranom kitchen sinks - continue Specially, for Gaussians: k(x x,0) = e x x σ, ( ) I λ(z) = N z;0, σ, (7) k(x x,0) ˆφ(x), ˆφ(x ) = ˆφ(x) ˆφ(x ), (8) ˆφ(x) = 1 n e izx C n, (9) Properties: O(n) CPU, O(n) RAM. Z = [ Z ab N ( 0,σ )] R n. (10) Iea (fastfoo): o not store Z, only the fast generators of Ẑ.

5 Fastfoo construction: n = ( = l ; otherwise paing) where G: iag(n(0,1)) R. V = 1 σ SHGPHB, (11) P: ranom permutation matrix {0,1}. B: iag(bernoulli) R, B ii { 1,1}. H = H : Walsh-Haamar (WH) transformation R H 1 = 1,H = [ ] [ H k H,H k+1 = k H k H k S: iag( s i G ): s i (π) r 1 e r F A 1, A 1 = π Γ( ). ] = (H ) k.

6 Fastfoo construction: n > (assumption: n) We stack n inepenent copies together: V = [V 1 ;...;V n] = Ẑ. (1) Intuition of V j = 1 σ SHGPHB: 1 HB: acts as an isometry, which makes the input enser. P: ensures the incoherence of the two H-s. H,G: WHs with iagonal Gaussian ense Gaussian. S: length istributions of V rows are inepenent.

7 Fastfoo: computational efficiency G,B,S: generate them once, store. RAM: O(n), cost of multiplication: O(n). P: O(n) storage, O(n) computation (lookup table). H : o not store, H x: O( log()) time/block, n blocks O(nlog()). To sum up: sinks CPU: O(n), RAM: O(n), vs fastfoo CPU: O(n log()), RAM: O(n).

8 Walsh-Haamar transformation: symmetry, orthogonality Definition: H 1 = 1,H = [ ],H k+1 = (H ) k.

9 Walsh-Haamar transformation: symmetry, orthogonality Definition: H 1 = 1,H = [ Symmetry, orthogonality ( = k ): ],H k+1 = (H ) k. H = H T, H H T = I (i.e., 1 H: orth.). (13)

10 Walsh-Haamar transformation: symmetry, orthogonality Definition: H 1 = 1,H = [ Symmetry, orthogonality ( = k ): ],H k+1 = (H ) k. H = H T, H H T = I (i.e., 1 H: orth.). (13) Proof: H 1, H : OK. [ [H k+1] T = (H ) k] T = (H T ) k = (H ) k = H k+1, ( H T k H T H k +1H T = (H k+1 k H )(H k H ) T = (H k H ) ) ( ) = (H kh T H k H T = ( k I) (I) = k+1 I using (A B) T = A T B T, (A B)(C D) = AC BD. (14) )

11 Walsh-Haamar transformation: spectral norm We have seen ( = k ): H H T = I. Spectral norm: H = λ max ( H T H ) = λmax (I) =. (15)

12 Goal ( = ) Unbiaseness: ] ] E[ˆk(x,x ) = E[ˆφ(x) ˆφ(x ) = e x x σ = k(x,x ). (16) Concentration: ] P[ ˆk(x,x ) k(x,x ) a b. (17)

13 Goal continue ( Low variance (one-block): v = x x σ, ψ [HGPHBv]j j(v) = cos ), j [], Low variance: var[ψ j (v)] = 1 var ψ j (v) j=1 ( 1 e v ), (18) ( 1 e v ) +C( v ), (19) ( ) C(α) = 6α 4 e α + α. (0) 3 T var [ˆφ(x) ˆφ(x )] (1 e v ) n + C( v ). (1) n Proof: ˆφ(x) T ˆφ(x ) = sum of n inep. terms ( n ), average ( 1 n ).

14 Towars unbiaseness: E([HGPHB] ij ) Let M := HGPHB. E(M ij ) = 0 () since Hi T : i th row of H H j : j th column of H, M ij = (Hi T )GP(H j B jj ), M ij P,B: sum of inepenent N(0,1)-s, +sign change, E(M ij ) = E[E(M ij P,B)] = E(0) = 0.

15 Unbiaseness: var ([HGPHB] ij ) Last slie: M ij = (Hi T )GP(H j B jj ), E(M ij ) = 0. [ ] var (M ij ) = E M ij Mij T ] = E [Bjj HT i GPee T P T GH i )] = E [(H i T GPH j B jj )(B jj Hj T P T GH i ] = E [1H i T Gee T GH i = H T i E [ G ] H i = H T i IH i = H T i H i = using e := [1;...;1] R, H j H T j = ee T, Pe = e, E ( Gee T G ) = E ( G ) (G: iagonal), E(G ii ) = 1.

16 Unbiaseness: cov ([HGPHB] ij,[hgphb] ik ), j k We have seen: E(M ij ) = 0, var (M ij ) =. cov(m ij,m ik ) = 0 (j k) since ) l.h.s. = E (H i T GPH j B jj Hi T GPH k B kk (3) ) = E(B jj B kk )E (H i T GPH j Hi T GPH k, (4) E(B jj B kk ) = E(B jj )E(B kk ) = 0 0 = 0 (5) using that 0 = I ((B jj,b kk ),others) = I(B jj,b kk ) = E(B uu ).

17 Unbiaseness In V = 1 σ 1 HGPHB (V = ( ) Mij E(V ij ) = E σ ( ) Mij var (V ij ) = var σ σ M) = 0, (6) = var (M ij) σ = σ = 1 σ, (7) cov(v ij,v ik ) = 0 (j k). (8) Thus, the istribution of the rows of V P,B: N ( 0, I σ ) [Ali&Recht 007] unbiaseness P, B unbiaseness. Note: we nee (i) (8) P,B, but we use E B (B jj B kk ); otherwise: V MOG, (ii) the inepenence of the rows.

18 Concentration (e cos, n = ) Theorem (RBF): Let ˆk(x,x ) = 1 j=1 ( 1 cos σ [ HGPHB(x x ) ] ). (9) j Then P ˆk(x,x ) k(x,x log ( ) δ ) α δ (30) for δ > 0, α = x x σ log ( ) δ.

19 Concentration proof ] We have alreay seen: E[ˆk(x,x ) = k(x,x ). Lemma (concentration of Gaussian measure; Leoux 1996): f : R R Lipschitz continuous (L), g N(0,I ). Then P( f(g) E[f(g)] t) e t L. (31) Lemma [approximate isometry of HB ; Ailon & Chazelle, 009]: x R ; H,B: from V. For any δ > 0 HBx P[ x log ( δ ) ] δ. (3)

20 Concentration proof Notation: v = x x σ, k(v) = k(x,x ), ˆk(v) = ˆk(x,x ). Sufficient to prove: f(g,p,b) = 1 j=1 cos(z j ), z = HGu, u = P HB v (33) concentrates aroun the mean. Iea: G f(g,p,b): Lipschitz high-p concentration in G B. Approximate isometry of HB : high-p in B (P: oes not matter). Union boun.

21 Concentration proof h(a) = 1 cos(a j ) (a R ), j=1 f(g;p,b) f(g ;P,B) = h[hiag(g)u] h[hiag(g )u], h(a) h(b) = 1 cos(a j ) cos(b j ) 1 j=1 cos(a j ) cos(b j ) 1 j=1 a j b j = 1 a b 1 1 a b = 1 a b Hiag(g)u Hiag(g )u H iag(g g )u = (g g ) u g g u. j=1

22 Concentration proof Until now: f(g;p,b) f(g ;P,B) u g g. (34) u term: using Pw = w u = PHB v = HB v. (35) Approximate isometry of HB : with 1 δ P B,P -probability u v log ( ) δ. (36)

23 Concentration proof Until now: f Lipschitz with 1 δ P B,P -probability f(g;p,b) f(g ;P,B) [ v log =: L g g. ( δ ) ] g g By the concentration of the Gaussian measure [G ii N(0,1)]: P G [ f(g;p,b) k(v) t] e t L =: δ, (37) ( ) ] P G [ f(g;p,b) k(v) log L δ. (38) δ We apply a union boun: δ.

24 Low variance: var[ψ j (v)] Notations: w = 1 HBv,u = Pw,z = HGu. (39) High-level iea: cov(z j,z t u): normal. cov(ψ(z j ),ψ(z t ) u), some exp cosh relations, j = t.

25 Low variance: z j u Def.: w = 1 HBv, u = Pw, z = HGu. Using E G (HGu u) = 0 cov(z j,z j u) = cov([hgu] j,[hgu] j u) = cov(hj T Gu,Hj T Gu u) [ ( )( ) ] T = E Hj T Gu Hj T Gu, Hj T Gu = [H j1 G 11 u 1,H j G u,...], (G : iagonal) ( ) cov(z j,z j u) = E GiiH jiu i = E ( G ) ii u i = i = u = v i using Hji = 1 (H ji = ±1), E ( Gii) = 1 [Gii N(0,1)], isometry of ( P 1 HB.} z u: normal, z j u N 0, v ). i u i

26 Low variance: cov(z j,z t u) Last slie: z j u N ( 0, v ). cov(z j,z t u) = corr(z j,z t u)st(z j u)st(z t u) (40) [ zj z t = corr(z j,z t u) v =: ρ jt (u) v =: ρ v. (41) ] ( [ ) 1 ρ N 0, ] v =: LL T = N(0,Lg),g N(0,I) ρ 1 (4) [ ] 1 0 L = ρ 1 ρ v. (43) Now for ψ j (v) = cos(z j ): cov(ψ j (v),ψ t (v) u) = cov(cos([lg] 1 ),cos([lg] )) (44) [ ] = E g k ) E g [cos([lg] k )]. (45) k=1cos([lg] k=1

27 Low variance: first term in cov(ψ j (v),ψ t (v) u) Using cos(α)cos(β) = 1 [cos(α β)+cos(α+β)], g = [g 1;g ], E g [cos([lg] 1 )cos([lg] )] = 1 E g {cos([lg] 1 [Lg] )+cos([lg] 1 +[Lg] )}, where [Lg] 1 [Lg] = v (g 1 ρg 1 ) 1 ρ g = v ρh, [Lg] 1 +[Lg] = v (g 1 +ρg 1 + ) 1 ρ g = v +ρh since g 1 ρg 1 1 ρ g (1 ρ) +(1 ρ )h = ρh, g 1 +ρg ρ g (1+ρ) +(1 ρ )h = +ρh. where h N(0,1).

28 Low variance: first term in cov(ψ j (v),ψ t (v) u) Thus E g [cos([lg] 1 )cos([lg] )] = 1 E g {cos(a h)+cos(a + h)}, (46) Making use of the relation we obtaine a = v ρ, (47) a + = v +ρ. (48) E[cos(ah)] = e 1 a, h N(0,1), (49) E g [cos([lg] 1 )cos([lg] )] = 1 [ e v (1 ρ) +e v (1+ρ) ].

29 Low variance: value of E[cos(b)] Lemma: E[cos(b)] = e 1 σ, b N(0,σ ), (50) Proof: The characteristic function of b N(m,σ ) c(t) = E b [e jtb] = e itm 1 σ t. (51) Specially, for m = 0, t = 1 (b N(0,σ )) e 1 σ = E b [e jb] = E[cos(b)]. (5)

30 Low variance: secon term in cov(ψ j (v),ψ t (v) u) Since z j N(0, v ) E g [cos(z j )]E g [cos(z t )] = (E g [cos( v h)]) = (e 1 v ) = e v using the ientity for E[cos(ah)]. Thus [cosh(a) = ea +e a ] cov(ψ j (v),ψ t (v) u) = 1 [e ] v (1 ρ) +e v (1+ρ) e v [ ] e v ρ +e v ρ = e v 1 [ ( ) ] = e v cosh v ρ 1.

31 Low variance: var[ψ j (v)] With j = t, ρ = 1 we got var[ψ j (v)] = e v [ e v +e v 1 ] (53) e v (54) = 1 ( 1 e v +e v ) (55) = 1 ( 1 e v ). (56) = 1+e v

32 Low variance: var j=1 ψ j(v) Decomposition: var ψ j (v) = We have seen that j=1 cov [ψ j (v),ψ t (v)]. (57) j,t=1 cov [ψ j (v),ψ t (v) u] = e v [ cosh We rewrite the cosh term. ( ) ] v ρ 1. (58)

33 Low variance: cosh v ρ Thir-orer Taylor expansion aroun 0 with remainer term where ( ) cosh v ρ = 1+ 1! v 4 ρ + 1 3! sinh(η) v 6 ρ 3 (59) η 1+ 1 v 4 ρ sinh ( v ) v 6 ρ 3 (60) 1+ v 4 ρ B( v ), (61) [ ] v ρ, v ρ, we use: cosh = sinh, sinh = cosh, cosh(0) = 1, sinh(a) = ea e a, sinh(0) = 0, monotonicity of sinh, ρ 1. B( v ) = sinh ( v ) v, (ρ 3 ρ ).

34 Low variance: var j=1 ψ j(v) Plugging the result back to cov [ψ j (v),ψ t (v) u], e v 1: Here, ρ = ρ(u). cov [ψ j (v),ψ t (v) u] v 4 ρ B( v ). (6) [ Remains: to boun E u ρ (u) ]. ( ) Small if E u 4 4 is small ( HB: ranomize preconitioner).

35 Numerical experiments Accuracy: similar to ranom kitchen sinks (RKS). CPU, RAM:

36 Summary Ranom kitchen sinks: use (normally istribute) ranom projections, which are store (Z). Fastfoo: approximates the RKS features using the composition of iag, permutation, Walsh-Haamar transformations (Ẑ). oes not store the feature map! Results: unbiase, concentration, low variance, RAM + CPU improvements.

37 Fastfoo: properties - rows of HGPHB: same length Let M = HGPHB. Square norm of the j th row lj = [MM T] [ = (HGPHB)(HGPHB) T] jj jj = [HGPHBB T H T P T GH T] [ = HG H T] jj jj (63) (64) = i H ijg ii = i G ii = G F (65) by BB T = I [B=iag(±1)], HH T = I, PP T = I, H ij = 1 (H ij = ±1).

38 Fastfoo: optional scaling matrix (S) Previous slie: l j = G F. Rescaling by 1 l j = 1 G F : yiels rows of unit length. S: iag( si G ): s i (π) r 1 e r F A 1, A 1 = π Γ( ). length istributions of the V rows: inepenent of each other.

Fastfood Approximating Kernel Expansions in Loglinear Time

Fastfood Approximating Kernel Expansions in Loglinear Time Quoc Le Tamás Sarlós Alex Smola Google Knowledge, 1600 Amphitheatre Pkwy, Mountain View 94043, CA USA qvl@google.com stamas@google.com alex@smola.org