Supplementary Material Nyström Method vs Random Fourier Features: A Theoretical and Empirical Comparison

Size: px

Start display at page:

Download "Supplementary Material Nyström Method vs Random Fourier Features: A Theoretical and Empirical Comparison"

Denis Chambers
5 years ago
Views:

1 Supplementary Material yström Method vs Random Fourier Features: A Theoretical and Empirical Comparison Appendix A: Proof of Proposition For anyf H f a, we can write Anonymous Author(s) Affiliation Address f( ) m wks( )+w s kc( ), c k wherew (w,w s, c,wm,w s m) c are the coefficients Then we have f(x) w z f (x), f 2 H κ du f(u) 2 exp(σ 2 u 2 /2) w (w γ), where the functional norm follows [2], denotes element-wise dot product, and γ (γ s,γ c,,γ s m,γ c m), γ s i γ c i exp(σ 2 u i 2 /2), weights the Fourier component u i We then finish the proof by replacing the linear form of f(x) and the functional norm off into equation (5) to obtain equation (6) in the paper Appendix B: Proof of Proposition 2 By noting that for anyf( ) r j w j ϕ j ( ) Ha n, we have f(x i ) w j ϕ j (x i ) w z n (x i ), f 2 H κ w 2 2, j where ϕ j (x i ) z n (x i ) follows the equality in (2) in the paper, and the functional norm follows the fact that ϕ j,j,,r are normalized orthonomal eigenfunctions By replacing these into equation (0) in the paper, we obtain an equivalent formulation in equation (9) in the paper Appendix C: Proof of Theorem 2 To prove Theorem 2 We first prove the following lemma Lemma DefineL(f) as L(f) λ 2 f 2 H κ + l(f(x i ),y i ) We have 0 L(f m) L(f ) C 2λ K K r 2,

2 wherec is the bound on the gradient of the loss function, ie, z l(z,y) C, 2 stands for the spectral norm of a matrix Proof Letl (α) denote the Fenchel conjugate ofl(z,y) in terms ofz, ie, l (α) sup(αz l(z,y)) α Ω where Ω is the range of the mapping z l(z,y) : R R Using the conjugate of l(z,y), we can write L(f) λ min f H D 2 f 2 H κ + l(f(x i ),y i ), into an equivalent form L(f ) max {α i Ω} whereα (α,,α ) Similarly, we can cast into Then L(f m) L(fm) λ min f H a n 2 f 2 H κ + max {α i Ω} L(f m) max {α i Ω} l (α i ) 2λ 2(α y) K(α y) l(f(x i ),y i ), l (α i ) 2λ 2(α y) K b K K b (α y) l (α i ) 2λ 2(α y) K(α y) + 2λ 2(α y) (K K K b Kb )(α y) max {α i Ω} l (α i ) 2λ 2(α y) K(α y) + max {α i Ω} 2λ 2(α y) (K K K b Kb )(α y) L(f )+ 2λ 2 α 2 2 K K b K K b 2 L (f )+ C 2λ K K r 2 where the last inequality follows α C, α Ω due to z l(z,y) C, and the definition of K r in equation (7) in the paper Proof of Theorem 2 Define loss function l(f(x),y) λ 2 f 2 H κ + l(f(x),y) To simplify our notation, we define P andp as P ( l f) l(f(x i ),y i ) L(f), P( l f) E [ l(f(x),y) ] F(f) Using this notation, we have Λ(fm ) Λ(f ) P( l f m l f ) P ( l f m l f )+(P P )( l f m l f ) P ( l f m l f )+ max (P P )( l f l f ), (f,f ) G 2

3 whereg is defined as { } G (f,f ) H κ H κ : f f l2 r, f f Hκ R, where f l2 E[f 2 (x)] f Hκ, and r,r are given by r R f m f H κ 4/ λ Using Lemma 9 from [], we have with a probability 2 3, for anyǫr e,ǫ 2 R e, sup (P P )( l f l f ) C 0 L(rǫ+Rǫ 2 +e ) C L(rǫ+e ), (f,f ) G wherec 0,C are numerical constants, andlis the upper bound of the gradient of l for functions in G and is given by L λ+c Since max( f H κ, f m H κ ) 2/ λ and 6ǫ 2 e 2 λ, we have the conditionǫr e satisfied, and therefore, with a probability 2 3, we have P( l f m l f ) P ( l f m l f )+C ( λ+c)(rǫ+e ) C K K r 2 2λ +C ( λ+c)(rǫ+e ), where we use the result in Lemma Using the definition ofr, and for any scalers, we have 2rs λ f m f 2 H κ 8 + 8s2 λ λ 4 f m f 2 H κ + λ 4 f f 2 H κ + 8s2 λ 2 P( l f m l f )+ 2 P( l f l f )+ 8s2 λ, where the last step explores the strong convexity of l Using the bound for2r[c ( λ+c)ǫ/2], we have P( l f m l f ( ) ) C λ+c e C K K r 2 2λ + 2 P( l f m l f )+ 2 P( l f l f )+ 2C2 ( λ+c) 2 ǫ 2 λ Using the fact Λ(f) P( l f l f ), we have 2 Λ(f m ) 3 ( ) 2 Λ(f )+C λ+c e + C K K r 2 2λ + 2C2 ( λ+c) 2 ǫ 2 λ We complete the proof by absorbing the constant terms into a numerical constantc 2 Appendix D: Proof of Theorem 3 To prove Theorem 3, we fist prove the following lemma that relates H r andĥr to matricesk r and K r, respectively Lemma 2 Assume λ r > 0 andλ r > 0 We have [ Kr ] i,j κ(x i, ),Ĥrκ(x j, ) Hκ, [K r ] i,j κ(x i, ),H r κ(x j, ) Hκ, i,j,, Proof By the definition ofĥr and expression of ϕ j, we have κ(x i, ),Ĥrκ(x j, ) m s,tk m s,tk k λ k κ(x i, ), ϕ k κ(x j, ), ϕ k λ k Vs,k Vt,k κ(x i, ),κ( x s, ) κ(x j, ),κ( x t, ) V s,k Vt,k λ k [K b ] i,s [K b ] j,t m [K b ] i,s K s,t [K b] j,t [K K b Kb ] i,j [ K r ] i,j s,t 3

4 where in the second equality we use equation (2) in the paper for ϕ j, and last line uses K r k λ k v k v k Similarly, we have κ(x i, ),H r κ(x j, ) κ(x i, ),ϕ k κ(x j, ),ϕ k λ k s,tk s,tk Proof of Theorem 3 k λ k V s,k V t,k κ(x i, ),κ(x s, ) κ(x j, ),κ(x t, ) V s,k V t,k λ k [K] i,s [K] j,t [K] i,s [Kr ] s,t [K] j,t [KKr K ] i,j [K r ] i,j s,t Define κ(x a,x i ) κ(x a, ), Hκ(x i, ) Hκ By Lemma 2, we have We have K r K r 2 2 max max u 2 i,j a u 2 i,j a [K r K r ] ia κ(x a,x i ) u i u j κ(x a,x i ) κ(x a,x j ) u i u j κ(x a, ), Hκ(x i, ) Hκ κ(x a, ), Hκ(x j, ) Hκ () Using the definition ofl in equaiton () in the paper and the reproducing property, for anyf,g H κ, we can have κ(x i, ),f κ(x i, ),g κ(x i, ),g f(x i ) L f,g Applying the above equality to f Hκ(x i, ),g Hκ(x j, ), we have κ(x a, ), Hκ(x i, ) κ(x a, ), Hκ(x j, ) L Hκ(x i, ), Hκ(x j, ) a Using equality (2) in (), we have κ(x i, ), HL Hκ(x j, ) (2) K r K r 2 2 max u i u j κ(x i, ), HL Hκ(x j, ) Hκ u 2 i,j Definef( ) u iκ(x i, ) We note thatκ(x i, ) can be expressed as an eigen-expansion form, ie, κ(x i, ) λj V i,j ϕ j ( ), (3) Using the relationship betweenκ(x i, ) and eigenfunctionsϕ j ( ) in equation (3), we have f( ) u i λi V i,j ϕ j ( ) L /2 [g]( ), j j where g( ) i,j u iv i,j ϕ j ( ) Since u 2, we have g 2 H κ u VV u, where V (v,,v ) We thus have K r K r max g Hκ L/2 g, HL HL /2 g H κ 2 max g Hκ g,l/2 HL HL /2 g H κ 2 L /2 HL HL /2 2 2 L /2 HL/

5 Appendix E: Proof of Theorem 4 To prove Theorem 4, we first prove the following lemma Lemma 3 Assume (λ r λ r+ )/ > 3 L L m HS, where HS stands for the Hilbert Schmidt norm of an integral operator Let Θ ( ϕ,, ϕ r ),Φ (ϕ,,ϕ r ),Φ (ϕ r+,,ϕ ) Then, there exists a matrixp R ( r) r satisfying such thatθ (Φ+ΦP)(I +P P) /2 P F 2 L L m HS L L m HS Proof of Lemma 3 we need the following perturbation result [3] Lemma 4 (Theorem 27 of Chapter 6 [3]) Let (λ i,v i ),i [n] be the eigenvalues and eigenvectors of a symmetric matrix A R n n ranked in the descending order of eigenvalues Set X (v,,v r ) andy (v r+,,v n ) Given a symmetric perturbation matrix E, let ( ) Ê (X,Y) Ê Ê E(X,Y) 2 Ê 2 Ê 22 Let represent a consistent family of norms and set γ Ê2,δ λ r λ r+ Ê Ê22 If δ > 0 and 2γ < δ, then there exists a unique matrix P R (n r) r satisfying P < 2γ δ such that are the eigenvectors ofa+e Define matrixb as X (X +YP)(I +P P) /2, Y (Y XP )(I +PP ) /2, B i,j m m λ k ϕ k,ϕ i ϕ k,ϕ j k Letz i be the eigenvector ofb corresponding to eigenvalue λ i /m It is straightforward to show that and therefore we have z i ( ϕ, ϕ i Hκ,, ϕ, ϕ i Hκ ),i [m], ϕ i z i,k ϕ k,i [m], orθ (Φ,Φ)Z, k where Z (z,,z r ) To decide the relationship between { ϕ i } r and {ϕ i}, we need to determine matrixz We define matrixd diag(λ /,,λ /) and matrixe B D, ie E i,j B i,j λ i δ i,j / ϕ i,(l m L )ϕ j Hκ Following the notation of Lemma 4, we define X (e,,e r ) and Y (e r+,,e ), where e,,e are the canonical bases of R, which are also eigenvectors of D (which corresponds to matrix A in Lemma 4) Defineδ andγ as follows γ r, jr+ δ i,j We simplify the statement to make it better fit with our objective i,jr+ 5

6 It is easy to verify that γ,δ are defined with respect to the Frobenius norm of Ê in Lemma 4 In order to apply the result in Lemma 4, we need to show δ > 0 andγ < δ/2 To this end, we need to provide the lower and upper bounds forγ andδ, respectively We first boundδ as δ L L m HS We then boundγ as γ r i,j jr+ j L L m HS Hence, when > 3 L L m HS, we haveδ > 2γ > 0, which satisfies the condition specified in Lemma 4 Thus, according to Lemma 4, there exists ap R ( r) r satisfying P < 2γ/δ, such that ( ) Z (z,,z r ) (X +YP)(I +P P) /2 Ir r (X,Y) (I +P P) /2 P implying Θ (Φ,Φ)Z (Φ,Φ)(X,Y) where we use (X,Y) I Proof of Theorem 4 Since we need to bound f,l /2 u2 i, and have ( Ir r P ) (I +P P) /2 (Φ+ΦP)(I +P P) /2, L /2 HL/2 2 max f Hκ f,l/2 HL/2 f H κ, HL/2 f H κ f,l /2 /2 ĤrL f H κ To this end, we write f( ) u iϕ i ( ), where i,j u i u j ϕ i,l /2 /2 ĤrL ϕ j Hκ u i u j λi λ j ϕ i,ĥrϕ j Hκ u DA 2 Du, i,j where u (u,,u ), D diag( λ /,, λ /) and A [ ϕ i, ϕ j Hκ ] r (Φ,Φ) Θ Since > 3 L L m HS, according to Lemma 3, there exists a matrixp R ( r) r satisfying P F 2 L L m HS L L m HS, such thatθ ( Φ+ΦP ) (I +P P) /2 Using the expression ofθ, we computea as Thus, we have A (Φ,Φ) Θ (Φ,Φ) ( Φ+ΦP ) (I +P P) /2 ( ) I (I +P P P) /2 f,l /2 HL/2 f H κ (( u I 0 D 0 0 u DBDu, ) AA ) Du 6

7 whereb is given by B ( I (I +P P) (I +P P) P P(I +P P) P(I +P P) P ( P(I +P P) P (I +P P) P P(I +P P) P(I +P P) P Rewriteu (u a,u b ) whereu a R r includes the firstr entries inuandu b includes the rest of the entries inu DefineD a diag( λ /,, λ r /) andd b diag( λ r+ /,, λ /) LetQ (I +P P) We have u DBDu u a 2 D a PQP D a 2 + u b 2 D b PQP D b 2 +2 u a u b D a QP D b 2 ( ) ( u a + u b ) 2 λ λ λ r+ max P 2 2, P 2 2max ( P 2 2, ) λ r+ / P 2, where we use the following facts PQP 2 P 2 2, QP 2 PQ 2 P 2 P 2, and λ We complete the proof by using the fact L /2 HL/2 2 max f Hκ f,l/2 HL/2 f H κ max u 2 u DBDu and the bound on P 2 in Lemma 3 References [] V Koltchinskii and M Yuan Sparsity in multiple kernel learning Annuals of Statistics, 38: , 200 [2] H Q Minh Some properties of gaussian reproducing kernel hilbert spaces and their implications for function approximation and learning theory Constructive Approximation, 32(2): , 200 [3] G W Stewart and J Sun Matrix Perturbation Theory Academic Press, 990 ) ) 7

Large-scale Image Annotation by Efficient and Robust Kernel Metric Learning

Large-scale Image Annotation by Efficient and Robust Kernel Metric Learning Supplementary Material Zheyun Feng Rong Jin Anil Jain Department of Computer Science and Engineering, Michigan State University,