Approximate Kernel PCA with Random Features

Size: px
Start display at page:

Download "Approximate Kernel PCA with Random Features"

Transcription

1 Approximate Kernel PCA with Random Features (Computational vs. Statistical Tradeoff) Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Journées de Statistique Paris May 28, 2018 (Joint work with Nicholas Sterge, PSU) Supported by National Science Foundation (DMS )

2 Outline Principal Component Analysis (PCA) Kernel PCA Approximation methods Kernel PCA with Random Features Computational vs. statistical trade off

3 Principal Component Analysis (PCA) (Pearson, 1901) Suppose X P be a random variable in R d with mean µ and covariance matrix Σ. Find a direction w {v : v 2 = 1} such that is maximized. Var[ w, X 2 ] = w, Σw 2 Find a direction w {v : v 2 = 1} such that is minimized. E (X µ) w, (X µ) 2 w 2 2 The formulations are equivalent and the solution is the eigenvector of the covariance matrix Σ corresponding to the largest eigenvalue. Can be generalized to multiple directions (find a subspace...). Applications: dimensionality reduction.

4 Principal Component Analysis (PCA) (Pearson, 1901) Suppose X P be a random variable in R d with mean µ and covariance matrix Σ. Find a direction w {v : v 2 = 1} such that is maximized. Var[ w, X 2 ] = w, Σw 2 Find a direction w {v : v 2 = 1} such that is minimized. E (X µ) w, (X µ) 2 w 2 2 The formulations are equivalent and the solution is the eigenvector of the covariance matrix Σ corresponding to the largest eigenvalue. Can be generalized to multiple directions (find a subspace...). Applications: dimensionality reduction.

5 Principal Component Analysis (PCA) (Pearson, 1901) Suppose X P be a random variable in R d with mean µ and covariance matrix Σ. Find a direction w {v : v 2 = 1} such that is maximized. Var[ w, X 2 ] = w, Σw 2 Find a direction w {v : v 2 = 1} such that is minimized. E (X µ) w, (X µ) 2 w 2 2 The formulations are equivalent and the solution is the eigenvector of the covariance matrix Σ corresponding to the largest eigenvalue. Can be generalized to multiple directions (find a subspace...). Applications: dimensionality reduction.

6 Kernel PCA Nonlinear generalization of PCA (Schölkopf et al., 1998). X Φ(X ) through the feature map Φ and apply PCA. The choice of Φ determines the degree of information we capture about X. Suppose Φ(X ) = (1, X, X 2, X 3,...), then the covariance of Φ(X ) captures the higher order moments of X. Φ is not explicitly specified but implicitly specified through a positive definite kernel function, k(x, y) = Φ(x), Φ(y).

7 Kernel PCA: Functional Version Find f {g H : g H = 1} such that Var[f (X )] is maximized, i.e., f = arg sup Var[f (X )] f H =1 where H is a reproducing kernel Hilbert space (evaluational functionals f f (x) are bounded for all x X) of real-valued functions (Aronszajn, 1950). unique k : X X R such that k(, x) H for all x H and f (x) = k(, x), f H for all f H, x X. k is called the reproducing kernel of H as k(x, y) = k(, x), k(, y) }{{}}{{} Φ(x) Φ(y) and is symmetric and positive definite. In fact, the converse is also true (Moore-Aronszajn Theorem). H

8 Kernel PCA: Functional Version Find f {g H : g H = 1} such that Var[f (X )] is maximized, i.e., f = arg sup Var[f (X )] f H =1 where H is a reproducing kernel Hilbert space (evaluational functionals f f (x) are bounded for all x X) of real-valued functions (Aronszajn, 1950). unique k : X X R such that k(, x) H for all x H and f (x) = k(, x), f H for all f H, x X. k is called the reproducing kernel of H as k(x, y) = k(, x), k(, y) }{{}}{{} Φ(x) Φ(y) and is symmetric and positive definite. In fact, the converse is also true (Moore-Aronszajn Theorem). H

9 Kernel PCA: Functional Version Find f {g H : g H = 1} such that Var[f (X )] is maximized, i.e., f = arg sup Var[f (X )] f H =1 where H is a reproducing kernel Hilbert space (evaluational functionals f f (x) are bounded for all x X) of real-valued functions (Aronszajn, 1950). unique k : X X R such that k(, x) H for all x H and f (x) = k(, x), f H for all f H, x X. k is called the reproducing kernel of H as k(x, y) = k(, x), k(, y) }{{}}{{} Φ(x) Φ(y) and is symmetric and positive definite. In fact, the converse is also true (Moore-Aronszajn Theorem). H

10 RKHS Kernels Positive definite & symmetric functions RKHS H = span{k(, x) : x X } (linear span of kernel functions) k controls the properties of f H. If k satisfies ( ), then every f H satisfies ( ), where ( ) is boundedness continuity measurability integrability differentiability

11 RKHS Kernels Positive definite & symmetric functions RKHS H = span{k(, x) : x X } (linear span of kernel functions) k controls the properties of f H. If k satisfies ( ), then every f H satisfies ( ), where ( ) is boundedness continuity measurability integrability differentiability

12 RKHS Kernels Positive definite & symmetric functions RKHS H = span{k(, x) : x X } (linear span of kernel functions) k controls the properties of f H. If k satisfies ( ), then every f H satisfies ( ), where ( ) is boundedness continuity measurability integrability differentiability

13 Kernel PCA Kernel PCA generalizes linear PCA. k(x, y) = x, y 2, x, y R d : H is isometrically isomorphic to R d. f (x) = w f, x 2, f H

14 Kernel PCA Kernel PCA generalizes linear PCA. k(x, y) = x, y 2, x, y R d : H is isometrically isomorphic to R d. f (x) = w f, x 2, f H

15 Kernel PCA Using the reproducing property f (X ) = f, k(, X ) H, we obtain }{{} Φ(X ) Σ := X f = arg sup f H =1 f, Σf H. k(, x) H k(, x) dp(x) µ P H µ P is the covariance operator (self-adjoint, positive and trace class) on H and µ P := k(, x) dp(x) is the mean element. Spectral theorem: X Σ = i I λ i φ i H φ i where I is either countable (λ i 0 as i ) or finite. Similar to PCA, the solution is the eigenfunction corresponding to the largest eigenvalue of Σ.

16 Kernel PCA Using the reproducing property f (X ) = f, k(, X ) H, we obtain }{{} Φ(X ) Σ := X f = arg sup f H =1 f, Σf H. k(, x) H k(, x) dp(x) µ P H µ P is the covariance operator (self-adjoint, positive and trace class) on H and µ P := k(, x) dp(x) is the mean element. Spectral theorem: X Σ = i I λ i φ i H φ i where I is either countable (λ i 0 as i ) or finite. Similar to PCA, the solution is the eigenfunction corresponding to the largest eigenvalue of Σ.

17 Kernel PCA Using the reproducing property f (X ) = f, k(, X ) H, we obtain }{{} Φ(X ) Σ := X f = arg sup f H =1 f, Σf H. k(, x) H k(, x) dp(x) µ P H µ P is the covariance operator (self-adjoint, positive and trace class) on H and µ P := k(, x) dp(x) is the mean element. Spectral theorem: X Σ = i I λ i φ i H φ i where I is either countable (λ i 0 as i ) or finite. Similar to PCA, the solution is the eigenfunction corresponding to the largest eigenvalue of Σ.

18 Empirical Kernel PCA In practice, P is unknown but have access to (X i ) n where ˆΣ := 1 n ˆf = arg sup f, ˆΣf H, f H =1 i.i.d. P. n k(, X i ) H k(, X i ) µ n H µ n is the empirical covariance operator (symmetric, positive and trace class) on H and µ n := 1 n k(, X i ). n Spectral theorem: ˆΣ = n ˆλ i ˆφi H ˆφi. ˆf is the eigenfunction corresponding to the largest eigenvalue of ˆΣ.

19 Empirical Kernel PCA In practice, P is unknown but have access to (X i ) n where ˆΣ := 1 n ˆf = arg sup f, ˆΣf H, f H =1 i.i.d. P. n k(, X i ) H k(, X i ) µ n H µ n is the empirical covariance operator (symmetric, positive and trace class) on H and µ n := 1 n k(, X i ). n Spectral theorem: ˆΣ = n ˆλ i ˆφi H ˆφi. ˆf is the eigenfunction corresponding to the largest eigenvalue of ˆΣ.

20 Empirical Kernel PCA In practice, P is unknown but have access to (X i ) n where ˆΣ := 1 n ˆf = arg sup f, ˆΣf H, f H =1 i.i.d. P. n k(, X i ) H k(, X i ) µ n H µ n is the empirical covariance operator (symmetric, positive and trace class) on H and µ n := 1 n k(, X i ). n Spectral theorem: ˆΣ = n ˆλ i ˆφi H ˆφi. ˆf is the eigenfunction corresponding to the largest eigenvalue of ˆΣ.

21 Empirical Kernel PCA: Representer Theorem Since ˆΣ is an infinite dimensional operator, we have to solve an infinite dimensional eigen system, ˆΣ ˆφ i = ˆλ i ˆφi. Consider 1 sup f, ˆΣf H = sup f H =1 f H =1 n n f, k(, X i ) 2 f, 1 n Clearly f span{k(, X i ) : i = 1,..., n}, i.e., n k(, X i ) 2 H. f = n α i k(, X i ). { } sup f, ˆΣf H : f H = 1 = sup { α KH nkα : α Kα = 1 }, i.e., KH nkα = λkα. Requires K to be invertible.

22 Empirical Kernel PCA: Representer Theorem Since ˆΣ is an infinite dimensional operator, we have to solve an infinite dimensional eigen system, ˆΣ ˆφ i = ˆλ i ˆφi. Consider 1 sup f, ˆΣf H = sup f H =1 f H =1 n n f, k(, X i ) 2 f, 1 n Clearly f span{k(, X i ) : i = 1,..., n}, i.e., n k(, X i ) 2 H. f = n α i k(, X i ). { } sup f, ˆΣf H : f H = 1 = sup { α KH nkα : α Kα = 1 }, i.e., KH nkα = λkα. Requires K to be invertible.

23 Empirical Kernel PCA: Representer Theorem Since ˆΣ is an infinite dimensional operator, we have to solve an infinite dimensional eigen system, ˆΣ ˆφ i = ˆλ i ˆφi. Consider 1 sup f, ˆΣf H = sup f H =1 f H =1 n n f, k(, X i ) 2 f, 1 n Clearly f span{k(, X i ) : i = 1,..., n}, i.e., n k(, X i ) 2 H. f = n α i k(, X i ). { } sup f, ˆΣf H : f H = 1 = sup { α KH nkα : α Kα = 1 }, i.e., KH nkα = λkα. Requires K to be invertible.

24 Empirical Kernel PCA In classical PCA, X i R d, i = 1,..., n is represented as ( X i µ, ŵ 1 2,..., X i µ, ŵ l 2 ) R l with l d where (ŵ i ) l are the eigenvectors of ˆΣ corresponding to the top-l eigenvalues. In kernel PCA, X i X, i = 1,..., n is represented as ( k(, X i ) µ n, ˆφ 1,..., k(, X i ) µ n, ˆφ ) l R l H H with l n where ( ˆφ i ) l are the eigenfunctions of ˆΣ corresponding to the top-l eigenvalues.

25 Empirical Kernel PCA In classical PCA, X i R d, i = 1,..., n is represented as ( X i µ, ŵ 1 2,..., X i µ, ŵ l 2 ) R l with l d where (ŵ i ) l are the eigenvectors of ˆΣ corresponding to the top-l eigenvalues. In kernel PCA, X i X, i = 1,..., n is represented as ( k(, X i ) µ n, ˆφ 1,..., k(, X i ) µ n, ˆφ ) l R l H H with l n where ( ˆφ i ) l are the eigenfunctions of ˆΣ corresponding to the top-l eigenvalues.

26 Summary The direct formulation requires the knowledge of feature map Φ (and of course H) and these could be infinite dimensional. ˆΣ ˆφ i = ˆλ i ˆφi. The alternate formulation is entirely determined by kernel evaluations, Gram matrix. But poor scalability: O(n 3 ). n ˆφ i = α i,j k(, X j ), j=1 where α i satisfies H n Kα i = λ i α i.

27 Approximation Schemes Incomplete Cholesky factorization (e.g., Fine and Scheinberg, 2001) Sketching (Yang et al., 2015) Sparse greedy approximation (Smola and Schölkopf, 2000) Nyström method (e.g., Williams and Seeger, 2001) Random Fourier features (e.g., Rahimi and Recht, 2008a),...

28 Random Fourier Approximation X = R d ; k be continuous and translation-invariant, i.e., k(x, y) = ψ(x y). Bochner s theorem: ψ is positive definite if and only if k(x, y) = e 1 ω,x y 2 dλ(ω), R d where Λ is a finite non-negative Borel measure on R d. k is symmetric and therefore Λ is a symmetric measure on R d. Therefore k(x, y) = cos( ω, x y 2 ) dλ(ω). R d

29 Random Fourier Approximation X = R d ; k be continuous and translation-invariant, i.e., k(x, y) = ψ(x y). Bochner s theorem: ψ is positive definite if and only if k(x, y) = e 1 ω,x y 2 dλ(ω), R d where Λ is a finite non-negative Borel measure on R d. k is symmetric and therefore Λ is a symmetric measure on R d. Therefore k(x, y) = cos( ω, x y 2 ) dλ(ω). R d

30 Random Feature Approximation (Rahimi and Recht, 2008a): Draw (ω j ) m j=1 i.i.d. Λ. k m (x, y) = 1 m m cos ( ω j, x y 2 ) = Φ m (x), Φ m (y) R 2m, j=1 k(x, y) = k(, x), k(, y) H }{{} Φ(x),Φ(y) H where ϕ 1(x) Φ m (x) = 1 {}}{ ( cos( ω 1, x 2 ),..., cos( ω m, x 2 ), sin( ω 1, x 2 ),..., sin( ω m, x 2 )). m Idea: Apply PCA to Φ m (x).

31 Random Feature Approximation (Rahimi and Recht, 2008a): Draw (ω j ) m j=1 i.i.d. Λ. k m (x, y) = 1 m m cos ( ω j, x y 2 ) = Φ m (x), Φ m (y) R 2m, j=1 k(x, y) = k(, x), k(, y) H }{{} Φ(x),Φ(y) H where ϕ 1(x) Φ m (x) = 1 {}}{ ( cos( ω 1, x 2 ),..., cos( ω m, x 2 ), sin( ω 1, x 2 ),..., sin( ω m, x 2 )). m Idea: Apply PCA to Φ m (x).

32 Approximate Kernel PCA (RF-KPCA) Perform linear PCA on Φ m (X ) where X P. Approximate empirical KPCA finds β R m that solves sup Var[ β, Φ m (X ) 2 ] = sup β, Σ m β 2 β 2=1 β 2=1 where Σ m := E[Φ m (X ) 2 Φ m (X )] E[Φ m (X )] 2 E[Φ m (X )]. Same as doing kernel PCA in H m where { } m H m = f = β i ϕ i : β R m is an RKHS induced by the reproducing kernel k m w.r.t. f, g Hm = β f, β g 2.

33 Approximate Kernel PCA (RF-KPCA) Perform linear PCA on Φ m (X ) where X P. Approximate empirical KPCA finds β R m that solves sup Var[ β, Φ m (X ) 2 ] = sup β, Σ m β 2 β 2=1 β 2=1 where Σ m := E[Φ m (X ) 2 Φ m (X )] E[Φ m (X )] 2 E[Φ m (X )]. Same as doing kernel PCA in H m where { } m H m = f = β i ϕ i : β R m is an RKHS induced by the reproducing kernel k m w.r.t. f, g Hm = β f, β g 2.

34 Empirical RF-KPCA The empirical counterpart is obtained as: ˆβ m = arg sup β 2=1 β, ˆΣ m β 2 where ˆΣ m := 1 n n Φ m (X i ) 2 Φ m (X i ) ( 1 n ) ( ) n 1 n Φ m (X i ) 2 Φ m (X i ). n Eigen decomposition: ˆΣ m = m ˆλ i,m ˆφ i,m 2 ˆφ i,m ˆβ m is obtained by solving an m m eigensystem: Complexity is O(m 3 ). What happens statistically?

35 Empirical RF-KPCA The empirical counterpart is obtained as: ˆβ m = arg sup β 2=1 β, ˆΣ m β 2 where ˆΣ m := 1 n n Φ m (X i ) 2 Φ m (X i ) ( 1 n ) ( ) n 1 n Φ m (X i ) 2 Φ m (X i ). n Eigen decomposition: ˆΣ m = m ˆλ i,m ˆφ i,m 2 ˆφ i,m ˆβ m is obtained by solving an m m eigensystem: Complexity is O(m 3 ). What happens statistically?

36 Empirical RF-KPCA The empirical counterpart is obtained as: ˆβ m = arg sup β 2=1 β, ˆΣ m β 2 where ˆΣ m := 1 n n Φ m (X i ) 2 Φ m (X i ) ( 1 n ) ( ) n 1 n Φ m (X i ) 2 Φ m (X i ). n Eigen decomposition: ˆΣ m = m ˆλ i,m ˆφ i,m 2 ˆφ i,m ˆβ m is obtained by solving an m m eigensystem: Complexity is O(m 3 ). What happens statistically?

37 How good is the approximation? (S and Szabó, 2016): ( ) log S sup k m (x, y) k(x, y) = O a.s. x,y S m Optimal convergence rate Other results are known but they are non-optimal (Rahimi and Recht, 2008a; Sutherland and Schneider, 2015).

38 What happens statistically? Kernel ridge regression: (X i, Y i ) n iid ρxy. R P = inf f L 2 (ρ X ) E f (X ) Y 2 = E f (X ) Y 2. Penalized risk minimization: O(n 3 ) 1 f n = arg inf f H n n Y i f (X i ) λ f 2 H Penalized risk minimization (approximate): O(m 2 n) 1 f m,n = arg inf f H m n n Y i f (X i ) λ f 2 H m

39 What happens statistically? R P (f m,n ) R }{{} E f m,n(x ) Y 2 = (R P (f m,n ) R P (f n )) + (R P (f n ) R P }{{} ) error due to approximation (Rahimi and Recht, 2008b): (m n) 1 2 (Rudi and Rosasco, 2016): If m n α where 1 2 α < 1 with α depending on the properties of f, then f m,n achieves the minimax optimal rate as obtained in the case with no approximation. Similar results are derived for Nyström approximation (Bach 2013, Alaoui and Mahoney, 2015, Rudi et al., 2015). Computational gain with no statistical loss!!

40 What happens statistically? R P (f m,n ) R }{{} E f m,n(x ) Y 2 = (R P (f m,n ) R P (f n )) + (R P (f n ) R P }{{} ) error due to approximation (Rahimi and Recht, 2008b): (m n) 1 2 (Rudi and Rosasco, 2016): If m n α where 1 2 α < 1 with α depending on the properties of f, then f m,n achieves the minimax optimal rate as obtained in the case with no approximation. Similar results are derived for Nyström approximation (Bach 2013, Alaoui and Mahoney, 2015, Rudi et al., 2015). Computational gain with no statistical loss!!

41 What happens statistically? R P (f m,n ) R }{{} E f m,n(x ) Y 2 = (R P (f m,n ) R P (f n )) + (R P (f n ) R P }{{} ) error due to approximation (Rahimi and Recht, 2008b): (m n) 1 2 (Rudi and Rosasco, 2016): If m n α where 1 2 α < 1 with α depending on the properties of f, then f m,n achieves the minimax optimal rate as obtained in the case with no approximation. Similar results are derived for Nyström approximation (Bach 2013, Alaoui and Mahoney, 2015, Rudi et al., 2015). Computational gain with no statistical loss!!

42 What happens statistically? R P (f m,n ) R }{{} E f m,n(x ) Y 2 = (R P (f m,n ) R P (f n )) + (R P (f n ) R P }{{} ) error due to approximation (Rahimi and Recht, 2008b): (m n) 1 2 (Rudi and Rosasco, 2016): If m n α where 1 2 α < 1 with α depending on the properties of f, then f m,n achieves the minimax optimal rate as obtained in the case with no approximation. Similar results are derived for Nyström approximation (Bach 2013, Alaoui and Mahoney, 2015, Rudi et al., 2015). Computational gain with no statistical loss!!

43 What happens statistically? Two notions for PCA: Reconstruction error Convergence of eigenspaces

44 Reconstruction Error Linear PCA Kernel PCA E X P (X µ) E X P k(, X ) l 2 (X µ), φ i 2 φ i l 2 k(, X ), φ i H φ i where k(, x) = k(, x) k(, x) dp(x). However, the eigenfunctions of approximate empirical KPCA lie in H m, which is finite dimensional and not contained in H. 2 H

45 Reconstruction Error Linear PCA Kernel PCA E X P (X µ) E X P k(, X ) l 2 (X µ), φ i 2 φ i l 2 k(, X ), φ i H φ i where k(, x) = k(, x) k(, x) dp(x). However, the eigenfunctions of approximate empirical KPCA lie in H m, which is finite dimensional and not contained in H. 2 H

46 Reconstruction Error Linear PCA Kernel PCA E X P (X µ) E X P k(, X ) l 2 (X µ), φ i 2 φ i l 2 k(, X ), φ i H φ i where k(, x) = k(, x) k(, x) dp(x). However, the eigenfunctions of approximate empirical KPCA lie in H m, which is finite dimensional and not contained in H. 2 H

47 Embedding to L 2 (P) What we have? Population eigenfunctions (φ i ) i I of Σ: these form a subspace in H. Empirical eigenfunctions ( ˆφ i ) n of ˆΣ: these form a subspace in H. Eigenvectors after approximation, ( ˆφ i,m ) m of ˆΣ m: these form a subspace in R m We embed them in a common space before comparing. The common space is L 2 (P). (Inclusion operator) I : H L 2 (P), f f f (x) dp(x) X (Approximation operator) U : R m L 2 (P), α m ) α i (ϕ i ϕ i (x) dp(x) X

48 Embedding to L 2 (P) What we have? Population eigenfunctions (φ i ) i I of Σ: these form a subspace in H. Empirical eigenfunctions ( ˆφ i ) n of ˆΣ: these form a subspace in H. Eigenvectors after approximation, ( ˆφ i,m ) m of ˆΣ m: these form a subspace in R m We embed them in a common space before comparing. The common space is L 2 (P). (Inclusion operator) I : H L 2 (P), f f f (x) dp(x) X (Approximation operator) U : R m L 2 (P), α m ) α i (ϕ i ϕ i (x) dp(x) X

49 Properties Σ = I I I and I are HS and Σ is trace-class Σ m = U U U and U are HS and Σ m is trace-class Σ ˆΣ ) HS = O P (n 1 2 Σ m ˆΣ ) m HS = O P (n 1 2 ) II UU HS = O Λ (m 1 2

50 Properties Σ = I I I and I are HS and Σ is trace-class Σ m = U U U and U are HS and Σ m is trace-class Σ ˆΣ ) HS = O P (n 1 2 Σ m ˆΣ ) m HS = O P (n 1 2 ) II UU HS = O Λ (m 1 2

51 Properties Σ = I I I and I are HS and Σ is trace-class Σ m = U U U and U are HS and Σ m is trace-class Σ ˆΣ ) HS = O P (n 1 2 Σ m ˆΣ ) m HS = O P (n 1 2 ) II UU HS = O Λ (m 1 2

52 Properties Σ = I I I and I are HS and Σ is trace-class Σ m = U U U and U are HS and Σ m is trace-class Σ ˆΣ ) HS = O P (n 1 2 Σ m ˆΣ ) m HS = O P (n 1 2 ) II UU HS = O Λ (m 1 2

53 Reconstruction Error ( ) Iφ i form an ONS for L 2 (P). Define k(, x) = k(, x) µ P and τ > 0. λi Population KPCA: l R l = E I k(, Iφi Iφ 2 i X ), I k(, X ) λi L 2 (P) λi L 2 (P) = Σ Σ 1/2 Σ 1 l Σ 3/2 2 = Σ Σ l 2 HS. HS Empirical KPCA: l I R n,l = E I k(, X ) ˆφ i, I k(, X ) ˆλ i = Σ Σ 1/2 ˆΣ 1 l Σ 3/2 2 HS Approximate Empirical KPCA: l U ˆφi,m R m,n,l = E I k(, X ), I k(, X ) ˆλ i,m ( = I UˆΣ 1 m,l U ) II 2 HS L 2 (P) L 2 (P) I ˆφ i ˆλ i 2 L 2 (P) 2 U ˆφ i,m ˆλ i,m L 2 (P)

54 Reconstruction Error ( ) Iφ i form an ONS for L 2 (P). Define k(, x) = k(, x) µ P and τ > 0. λi Population KPCA: l R l = E I k(, Iφi Iφ 2 i X ), I k(, X ) λi L 2 (P) λi L 2 (P) = Σ Σ 1/2 Σ 1 l Σ 3/2 2 = Σ Σ l 2 HS. HS Empirical KPCA: l I R n,l = E I k(, X ) ˆφ i, I k(, X ) ˆλ i = Σ Σ 1/2 ˆΣ 1 l Σ 3/2 2 HS Approximate Empirical KPCA: l U ˆφ i,m R m,n,l = E I k(, X ), I k(, X ) ˆλ i,m ( = I UˆΣ 1 m,l U ) II 2 HS L 2 (P) L 2 (P) I ˆφ i ˆλ i 2 L 2 (P) 2 U ˆφ i,m ˆλ i,m L 2 (P)

55 Reconstruction Error ( ) Iφ i form an ONS for L 2 (P). Define k(, x) = k(, x) µ P and τ > 0. λi Population KPCA: l R l = E I k(, Iφi Iφ 2 i X ), I k(, X ) λi L 2 (P) λi L 2 (P) = Σ Σ 1/2 Σ 1 l Σ 3/2 2 = Σ Σ l 2 HS. HS Empirical KPCA: l I R n,l = E I k(, X ) ˆφ i, I k(, X ) ˆλ i = Σ Σ 1/2 ˆΣ 1 l Σ 3/2 2 HS Approximate Empirical KPCA: l U ˆφ i,m R m,n,l = E I k(, X ), I k(, X ) ˆλ i,m ( = I UˆΣ 1 m,l U ) II 2 HS L 2 (P) L 2 (P) I ˆφ i ˆλ i 2 L 2 (P) 2 U ˆφ i,m ˆλ i,m L 2 (P)

56 Result Clearly R l 0 as l. The goal is to study the convergence rates for R l, R n,l and R m,n,l as l, m, n. Suppose λ i i α, α > 1 2, l = n θ α and m = n γ where θ > 0 and 0 < γ < 1. R l n 2θ(1 1 2α ) n 2θ(1 1 ) 2α, R n,l n ( 2 1 θ), α 0 < θ 2(3α 1) α 2(3α 1) θ < 1 2 (bias dominates) (variance dominates) for γ > 2θ. n 2θ(1 1 ) 2α, R m,n,l n ( 1 θ) 2, α 0 < θ 2(3α 1) α 2(3α 1) θ < 1 2 No statistical loss

57 Result Clearly R l 0 as l. The goal is to study the convergence rates for R l, R n,l and R m,n,l as l, m, n. Suppose λ i i α, α > 1 2, l = n θ α and m = n γ where θ > 0 and 0 < γ < 1. R l n 2θ(1 1 2α ) n 2θ(1 1 ) 2α, R n,l n ( 2 1 θ), α 0 < θ 2(3α 1) α 2(3α 1) θ < 1 2 (bias dominates) (variance dominates) for γ > 2θ. n 2θ(1 1 ) 2α, R m,n,l n ( 1 θ) 2, α 0 < θ 2(3α 1) α 2(3α 1) θ < 1 2 No statistical loss

58 Result Clearly R l 0 as l. The goal is to study the convergence rates for R l, R n,l and R m,n,l as l, m, n. Suppose λ i i α, α > 1 2, l = n θ α and m = n γ where θ > 0 and 0 < γ < 1. R l n 2θ(1 1 2α ) n 2θ(1 1 ) 2α, R n,l n ( 2 1 θ), α 0 < θ 2(3α 1) α 2(3α 1) θ < 1 2 (bias dominates) (variance dominates) for γ > 2θ. n 2θ(1 1 ) 2α, R m,n,l n ( 1 θ) 2, α 0 < θ 2(3α 1) α 2(3α 1) θ < 1 2 No statistical loss

59 Result Clearly R l 0 as l. The goal is to study the convergence rates for R l, R n,l and R m,n,l as l, m, n. Suppose λ i i α, α > 1 2, l = n θ α and m = n γ where θ > 0 and 0 < γ < 1. R l n 2θ(1 1 2α ) n 2θ(1 1 ) 2α, R n,l n ( 2 1 θ), α 0 < θ 2(3α 1) α 2(3α 1) θ < 1 2 (bias dominates) (variance dominates) for γ > 2θ. n 2θ(1 1 ) 2α, R m,n,l n ( 1 θ) 2, α 0 < θ 2(3α 1) α 2(3α 1) θ < 1 2 No statistical loss

60 Result Clearly R l 0 as l. The goal is to study the convergence rates for R l, R n,l and R m,n,l as l, m, n. Suppose λ i i α, α > 1 2, l = n θ α and m = n γ where θ > 0 and 0 < γ < 1. R l n 2θ(1 1 2α ) n 2θ(1 1 ) 2α, R n,l n ( 2 1 θ), α 0 < θ 2(3α 1) α 2(3α 1) θ < 1 2 (bias dominates) (variance dominates) for γ > 2θ. n 2θ(1 1 ) 2α, R m,n,l n ( 1 θ) 2, α 0 < θ 2(3α 1) α 2(3α 1) θ < 1 2 No statistical loss

61 Convergence of Projection Operators-I Since ( ) Iφi λi form an ONS for L 2 (P), we consider Empirical KPCA: l Iφ i S n,l = Iφ i λi λi l I ˆφ i I ˆφ i ˆλ i ˆλ i op Approximate Empirical KPCA: l Iφ i S m,n,l = Iφ l i U ˆφ m,i U ˆφ m,i λi λi ˆλm,i ˆλm,i op as l, m, n.

62 Convergence of Projection Operators-I Since ( ) Iφi λi form an ONS for L 2 (P), we consider Empirical KPCA: l Iφ i S n,l = Iφ i λi λi l I ˆφ i I ˆφ i ˆλ i ˆλ i op Approximate Empirical KPCA: l Iφ i S m,n,l = Iφ l i U ˆφ m,i U ˆφ m,i λi λi ˆλm,i ˆλm,i op as l, m, n.

63 Convergence of Projection Operators-I Unlike in reconstruction error, the convergence of projection operators depends on the behavior of the eigen-gap, δ i = 1 2 (λ i λ i+1 ), i N. Empirical KPCA: assuming δ l n 1/2 and λ l n 1/2. S n,l δ l n n 1/4, λ l Approximate Empirical KPCA: assuming δ l m 1/2 and λ l m 1/2. S m,n,l δ l m n 1/4, λ l

64 Convergence of Projection Operators-I Unlike in reconstruction error, the convergence of projection operators depends on the behavior of the eigen-gap, δ i = 1 2 (λ i λ i+1 ), i N. Empirical KPCA: assuming δ l n 1/2 and λ l n 1/2. S n,l δ l n n 1/4, λ l Approximate Empirical KPCA: assuming δ l m 1/2 and λ l m 1/2. S m,n,l δ l m n 1/4, λ l

65 Convergence of Projection Operators-I Suppose λ i i α, α > 1 2, δ i i β, β α, l = n θ α and m = n γ where 0 < θ < 1 2 and 0 < γ < 1. n ( 1 4 θ ) 2, 0 < θ < S n,l ), n ( 12 θβ α n ( 1 4 θ ) 2, 0 < θ < S m,n,l ( n γ 2 θβ ) α, 0 < θ < α 2(2β α) α 2(2β α) θ < α 2β ( ) α 2(2β α), γ θ 2β α 1 α 2(2β α), 2θβ α < γ < θ ( 2β α 1 )

66 Convergence of Projection Operators-I Suppose λ i i α, α > 1 2, δ i i β, β α, l = n θ α and m = n γ where 0 < θ < 1 2 and 0 < γ < 1. n ( 1 4 θ ) 2, 0 < θ < S n,l ), n ( 12 θβ α n ( 4 1 θ ) 2, 0 < θ < S m,n,l ( n γ 2 θβ ) α, 0 < θ < α 2(2β α) α 2(2β α) θ < α 2β ( ) α 2(2β α), γ θ 2β α 1 α 2(2β α), 2θβ α < γ < θ ( 2β α 1 )

67 Convergence of Projection Operators-II Empirical KPCA: l l T n,l = Iφ i L 2 (P) Iφ i I ˆφ i L 2 (P) I ˆφ i HS ( l ) l = Σ1/2 φ i H φ i ˆφ i H ˆφi Σ 1/2 HS lλ l δ l n Approximate Empirical KPCA: l l T m,n,l = Iφ i L 2 (P) Iφ i U ˆφ i,m L 2 (P) U ˆφ i,m HS lλ l + 1 δ l n m Convergence rates can be derived under the assumption λ i i α, α > 1 2, δ i i β, β α, l = n α θ and m = n γ where 0 < θ < 1 2 and 0 < γ < 1.

68 Convergence of Projection Operators-II Empirical KPCA: l l T n,l = Iφ i L 2 (P) Iφ i I ˆφ i L 2 (P) I ˆφ i HS ( l ) l = Σ1/2 φ i H φ i ˆφ i H ˆφi Σ 1/2 HS lλ l δ l n Approximate Empirical KPCA: l l T m,n,l = Iφ i L 2 (P) Iφ i U ˆφ i,m L 2 (P) U ˆφ i,m HS lλ l + 1 δ l n m Convergence rates can be derived under the assumption λ i i α, α > 1 2, δ i i β, β α, l = n α θ and m = n γ where 0 < θ < 1 2 and 0 < γ < 1.

69 Convergence of Projection Operators-II Empirical KPCA: l l T n,l = Iφ i L 2 (P) Iφ i I ˆφ i L 2 (P) I ˆφ i HS ( l ) l = Σ1/2 φ i H φ i ˆφ i H ˆφi Σ 1/2 HS lλ l δ l n Approximate Empirical KPCA: l l T m,n,l = Iφ i L 2 (P) Iφ i U ˆφ i,m L 2 (P) U ˆφ i,m HS lλ l + 1 δ l n m Convergence rates can be derived under the assumption λ i i α, α > 1 2, δ i i β, β α, l = n α θ and m = n γ where 0 < θ < 1 2 and 0 < γ < 1.

70 Summary Random feature approximation to kernel PCA improves its computational complexity. Statistical trade-off: Reconstruction error Convergence of eigenspaces Open questions: Lower bounds Extension to kernel canonical correlation analysis Nyström approximation

71 Thank You

72 References I Aronszajn, N. (1950). Theory of reproducing kernels. Trans. Amer. Math. Soc., 68: Fine, S. and Scheinberg, K. (2001). Efficient SVM training using low-rank kernel representations. Journal of Machine Learning Research, 2: Rahimi, A. and Recht, B. (2008a). Random features for large-scale kernel machines. In Platt, J. C., Koller, D., Singer, Y., and Roweis, S. T., editors, Advances in Neural Information Processing Systems 20, pages Curran Associates, Inc. Rahimi, A. and Recht, B. (2008b). Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning. In NIPS, pages Rudi, A. and Rosasco, L. (2016). Generalization properties of learning with random features. Schölkopf, B., Smola, A., and Müller, K.-R. (1998). Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10: Smola, A. J. and Schölkopf, B. (2000). Sparse greedy matrix approximation for machine learning. In Proc. 17th International Conference on Machine Learning, pages Morgan Kaufmann, San Francisco, CA. Sriperumbudur, B. K. and Szabó, Z. (2015). Optimal rates for random Fourier features. In Cortes, C., Lawrence, N. D., Lee, D. D., Sugiyama, M., and Garnett, R., editors, Advances in Neural Information Processing Systems 28, pages Curran Associates, Inc. Sutherland, D. and Schneider, J. (2015). On the error of random fourier features. In Proceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence, pages

73 References II Williams, C. and Seeger, M. (2001). Using the Nyström method to speed up kernel machines. In T. K. Leen, T. G. Diettrich, V. T., editor, Advances in Neural Information Processing Systems 13, pages , Cambridge, MA. MIT Press. Yang, Y., Pilanci, M., and Wainwright, M. J. (2015). Randomized sketches for kernels: Fast and optimal non-parametric regression. Technical report.

Approximate Kernel Methods

Approximate Kernel Methods Lecture 3 Approximate Kernel Methods Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Machine Learning Summer School Tübingen, 207 Outline Motivating example Ridge regression

More information

Stochastic optimization in Hilbert spaces

Stochastic optimization in Hilbert spaces Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic optimization Hilbert spaces 1 / 48 Outline Learning vs Statistics Aymeric Dieuleveut Stochastic optimization Hilbert

More information

Kernel Learning via Random Fourier Representations

Kernel Learning via Random Fourier Representations Kernel Learning via Random Fourier Representations L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Module 5: Machine Learning L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random

More information

Less is More: Computational Regularization by Subsampling

Less is More: Computational Regularization by Subsampling Less is More: Computational Regularization by Subsampling Lorenzo Rosasco University of Genova - Istituto Italiano di Tecnologia Massachusetts Institute of Technology lcsl.mit.edu joint work with Alessandro

More information

Less is More: Computational Regularization by Subsampling

Less is More: Computational Regularization by Subsampling Less is More: Computational Regularization by Subsampling Lorenzo Rosasco University of Genova - Istituto Italiano di Tecnologia Massachusetts Institute of Technology lcsl.mit.edu joint work with Alessandro

More information

TUM 2016 Class 3 Large scale learning by regularization

TUM 2016 Class 3 Large scale learning by regularization TUM 2016 Class 3 Large scale learning by regularization Lorenzo Rosasco UNIGE-MIT-IIT July 25, 2016 Learning problem Solve min w E(w), E(w) = dρ(x, y)l(w x, y) given (x 1, y 1 ),..., (x n, y n ) Beyond

More information

Kernel methods for comparing distributions, measuring dependence

Kernel methods for comparing distributions, measuring dependence Kernel methods for comparing distributions, measuring dependence Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Principal component analysis Given a set of M centered observations

More information

Convergence of Eigenspaces in Kernel Principal Component Analysis

Convergence of Eigenspaces in Kernel Principal Component Analysis Convergence of Eigenspaces in Kernel Principal Component Analysis Shixin Wang Advanced machine learning April 19, 2016 Shixin Wang Convergence of Eigenspaces April 19, 2016 1 / 18 Outline 1 Motivation

More information

Principal Component Analysis

Principal Component Analysis CSci 5525: Machine Learning Dec 3, 2008 The Main Idea Given a dataset X = {x 1,..., x N } The Main Idea Given a dataset X = {x 1,..., x N } Find a low-dimensional linear projection The Main Idea Given

More information

Kernel Principal Component Analysis

Kernel Principal Component Analysis Kernel Principal Component Analysis Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr

More information

Connection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis

Connection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis Connection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis Alvina Goh Vision Reading Group 13 October 2005 Connection of Local Linear Embedding, ISOMAP, and Kernel Principal

More information

Convergence Rates of Kernel Quadrature Rules

Convergence Rates of Kernel Quadrature Rules Convergence Rates of Kernel Quadrature Rules Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE NIPS workshop on probabilistic integration - Dec. 2015 Outline Introduction

More information

Statistical Convergence of Kernel CCA

Statistical Convergence of Kernel CCA Statistical Convergence of Kernel CCA Kenji Fukumizu Institute of Statistical Mathematics Tokyo 106-8569 Japan fukumizu@ism.ac.jp Francis R. Bach Centre de Morphologie Mathematique Ecole des Mines de Paris,

More information

Nonlinear Dimensionality Reduction

Nonlinear Dimensionality Reduction Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Kernel PCA 2 Isomap 3 Locally Linear Embedding 4 Laplacian Eigenmap

More information

Divide and Conquer Kernel Ridge Regression. A Distributed Algorithm with Minimax Optimal Rates

Divide and Conquer Kernel Ridge Regression. A Distributed Algorithm with Minimax Optimal Rates : A Distributed Algorithm with Minimax Optimal Rates Yuchen Zhang, John C. Duchi, Martin Wainwright (UC Berkeley;http://arxiv.org/pdf/1305.509; Apr 9, 014) Gatsby Unit, Tea Talk June 10, 014 Outline Motivation.

More information

Kernel-Based Contrast Functions for Sufficient Dimension Reduction

Kernel-Based Contrast Functions for Sufficient Dimension Reduction Kernel-Based Contrast Functions for Sufficient Dimension Reduction Michael I. Jordan Departments of Statistics and EECS University of California, Berkeley Joint work with Kenji Fukumizu and Francis Bach

More information

Kernel Methods. Jean-Philippe Vert Last update: Jan Jean-Philippe Vert (Mines ParisTech) 1 / 444

Kernel Methods. Jean-Philippe Vert Last update: Jan Jean-Philippe Vert (Mines ParisTech) 1 / 444 Kernel Methods Jean-Philippe Vert Jean-Philippe.Vert@mines.org Last update: Jan 2015 Jean-Philippe Vert (Mines ParisTech) 1 / 444 What we know how to solve Jean-Philippe Vert (Mines ParisTech) 2 / 444

More information

Minimax Estimation of Kernel Mean Embeddings

Minimax Estimation of Kernel Mean Embeddings Minimax Estimation of Kernel Mean Embeddings Bharath K. Sriperumbudur Department of Statistics Pennsylvania State University Gatsby Computational Neuroscience Unit May 4, 2016 Collaborators Dr. Ilya Tolstikhin

More information

Efficient Complex Output Prediction

Efficient Complex Output Prediction Efficient Complex Output Prediction Florence d Alché-Buc Joint work with Romain Brault, Alex Lambert, Maxime Sangnier October 12, 2017 LTCI, Télécom ParisTech, Institut-Mines Télécom, Université Paris-Saclay

More information

Regularization via Spectral Filtering

Regularization via Spectral Filtering Regularization via Spectral Filtering Lorenzo Rosasco MIT, 9.520 Class 7 About this class Goal To discuss how a class of regularization methods originally designed for solving ill-posed inverse problems,

More information

Optimal kernel methods for large scale learning

Optimal kernel methods for large scale learning Optimal kernel methods for large scale learning Alessandro Rudi INRIA - École Normale Supérieure, Paris joint work with Luigi Carratino, Lorenzo Rosasco 6 Mar 2018 École Polytechnique Learning problem

More information

Spectral Regularization

Spectral Regularization Spectral Regularization Lorenzo Rosasco 9.520 Class 07 February 27, 2008 About this class Goal To discuss how a class of regularization methods originally designed for solving ill-posed inverse problems,

More information

MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications. Class 19: Data Representation by Design

MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications. Class 19: Data Representation by Design MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications Class 19: Data Representation by Design What is data representation? Let X be a data-space X M (M) F (M) X A data representation

More information

Hilbert Space Embedding of Probability Measures

Hilbert Space Embedding of Probability Measures Lecture 2 Hilbert Space Embedding of Probability Measures Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Machine Learning Summer School Tübingen, 2017 Recap of Lecture

More information

Kernel methods, kernel SVM and ridge regression

Kernel methods, kernel SVM and ridge regression Kernel methods, kernel SVM and ridge regression Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Collaborative Filtering 2 Collaborative Filtering R: rating matrix; U: user factor;

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 11, 2009 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 12, 2007 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 9, 2011 About this class Goal In this class we continue our journey in the world of RKHS. We discuss the Mercer theorem which gives

More information

Semi-Supervised Learning through Principal Directions Estimation

Semi-Supervised Learning through Principal Directions Estimation Semi-Supervised Learning through Principal Directions Estimation Olivier Chapelle, Bernhard Schölkopf, Jason Weston Max Planck Institute for Biological Cybernetics, 72076 Tübingen, Germany {first.last}@tuebingen.mpg.de

More information

Learning sets and subspaces: a spectral approach

Learning sets and subspaces: a spectral approach Learning sets and subspaces: a spectral approach Alessandro Rudi DIBRIS, Università di Genova Optimization and dynamical processes in Statistical learning and inverse problems Sept 8-12, 2014 A world of

More information

Manifold Learning: Theory and Applications to HRI

Manifold Learning: Theory and Applications to HRI Manifold Learning: Theory and Applications to HRI Seungjin Choi Department of Computer Science Pohang University of Science and Technology, Korea seungjin@postech.ac.kr August 19, 2008 1 / 46 Greek Philosopher

More information

Nonlinear Projection Trick in kernel methods: An alternative to the Kernel Trick

Nonlinear Projection Trick in kernel methods: An alternative to the Kernel Trick Nonlinear Projection Trick in kernel methods: An alternative to the Kernel Trick Nojun Kak, Member, IEEE, Abstract In kernel methods such as kernel PCA and support vector machines, the so called kernel

More information

Learning gradients: prescriptive models

Learning gradients: prescriptive models Department of Statistical Science Institute for Genome Sciences & Policy Department of Computer Science Duke University May 11, 2007 Relevant papers Learning Coordinate Covariances via Gradients. Sayan

More information

Statistical Optimality of Stochastic Gradient Descent through Multiple Passes

Statistical Optimality of Stochastic Gradient Descent through Multiple Passes Statistical Optimality of Stochastic Gradient Descent through Multiple Passes Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Loucas Pillaud-Vivien

More information

Discriminative Direction for Kernel Classifiers

Discriminative Direction for Kernel Classifiers Discriminative Direction for Kernel Classifiers Polina Golland Artificial Intelligence Lab Massachusetts Institute of Technology Cambridge, MA 02139 polina@ai.mit.edu Abstract In many scientific and engineering

More information

Unsupervised dimensionality reduction

Unsupervised dimensionality reduction Unsupervised dimensionality reduction Guillaume Obozinski Ecole des Ponts - ParisTech SOCN course 2014 Guillaume Obozinski Unsupervised dimensionality reduction 1/30 Outline 1 PCA 2 Kernel PCA 3 Multidimensional

More information

Nonlinear Dimensionality Reduction

Nonlinear Dimensionality Reduction Nonlinear Dimensionality Reduction Piyush Rai CS5350/6350: Machine Learning October 25, 2011 Recap: Linear Dimensionality Reduction Linear Dimensionality Reduction: Based on a linear projection of the

More information

Lecture 3: Statistical Decision Theory (Part II)

Lecture 3: Statistical Decision Theory (Part II) Lecture 3: Statistical Decision Theory (Part II) Hao Helen Zhang Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 1 / 27 Outline of This Note Part I: Statistics Decision Theory (Classical

More information

Random Projections. Lopez Paz & Duvenaud. November 7, 2013

Random Projections. Lopez Paz & Duvenaud. November 7, 2013 Random Projections Lopez Paz & Duvenaud November 7, 2013 Random Outline The Johnson-Lindenstrauss Lemma (1984) Random Kitchen Sinks (Rahimi and Recht, NIPS 2008) Fastfood (Le et al., ICML 2013) Why random

More information

Immediate Reward Reinforcement Learning for Projective Kernel Methods

Immediate Reward Reinforcement Learning for Projective Kernel Methods ESANN'27 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), 25-27 April 27, d-side publi., ISBN 2-9337-7-2. Immediate Reward Reinforcement Learning for Projective Kernel Methods

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces 9.520: Statistical Learning Theory and Applications February 10th, 2010 Reproducing Kernel Hilbert Spaces Lecturer: Lorenzo Rosasco Scribe: Greg Durrett 1 Introduction In the previous two lectures, we

More information

SPECTRAL CLUSTERING AND KERNEL PRINCIPAL COMPONENT ANALYSIS ARE PURSUING GOOD PROJECTIONS

SPECTRAL CLUSTERING AND KERNEL PRINCIPAL COMPONENT ANALYSIS ARE PURSUING GOOD PROJECTIONS SPECTRAL CLUSTERING AND KERNEL PRINCIPAL COMPONENT ANALYSIS ARE PURSUING GOOD PROJECTIONS VIKAS CHANDRAKANT RAYKAR DECEMBER 5, 24 Abstract. We interpret spectral clustering algorithms in the light of unsupervised

More information

Memory Efficient Kernel Approximation

Memory Efficient Kernel Approximation Si Si Department of Computer Science University of Texas at Austin ICML Beijing, China June 23, 2014 Joint work with Cho-Jui Hsieh and Inderjit S. Dhillon Outline Background Motivation Low-Rank vs. Block

More information

Kernel Methods. Machine Learning A W VO

Kernel Methods. Machine Learning A W VO Kernel Methods Machine Learning A 708.063 07W VO Outline 1. Dual representation 2. The kernel concept 3. Properties of kernels 4. Examples of kernel machines Kernel PCA Support vector regression (Relevance

More information

CIS 520: Machine Learning Oct 09, Kernel Methods

CIS 520: Machine Learning Oct 09, Kernel Methods CIS 520: Machine Learning Oct 09, 207 Kernel Methods Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture They may or may not cover all the material discussed

More information

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1).

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1). Regression and PCA Classification The goal: map from input X to a label Y. Y has a discrete set of possible values We focused on binary Y (values 0 or 1). But we also discussed larger number of classes

More information

Kernels A Machine Learning Overview

Kernels A Machine Learning Overview Kernels A Machine Learning Overview S.V.N. Vishy Vishwanathan vishy@axiom.anu.edu.au National ICT of Australia and Australian National University Thanks to Alex Smola, Stéphane Canu, Mike Jordan and Peter

More information

RegML 2018 Class 2 Tikhonov regularization and kernels

RegML 2018 Class 2 Tikhonov regularization and kernels RegML 2018 Class 2 Tikhonov regularization and kernels Lorenzo Rosasco UNIGE-MIT-IIT June 17, 2018 Learning problem Problem For H {f f : X Y }, solve min E(f), f H dρ(x, y)l(f(x), y) given S n = (x i,

More information

2 Tikhonov Regularization and ERM

2 Tikhonov Regularization and ERM Introduction Here we discusses how a class of regularization methods originally designed to solve ill-posed inverse problems give rise to regularized learning algorithms. These algorithms are kernel methods

More information

Mercer s Theorem, Feature Maps, and Smoothing

Mercer s Theorem, Feature Maps, and Smoothing Mercer s Theorem, Feature Maps, and Smoothing Ha Quang Minh, Partha Niyogi, and Yuan Yao Department of Computer Science, University of Chicago 00 East 58th St, Chicago, IL 60637, USA Department of Mathematics,

More information

On the Convergence of the Concave-Convex Procedure

On the Convergence of the Concave-Convex Procedure On the Convergence of the Concave-Convex Procedure Bharath K. Sriperumbudur and Gert R. G. Lanckriet UC San Diego OPT 2009 Outline Difference of convex functions (d.c.) program Applications in machine

More information

Towards Deep Kernel Machines

Towards Deep Kernel Machines Towards Deep Kernel Machines Julien Mairal Inria, Grenoble Prague, April, 2017 Julien Mairal Towards deep kernel machines 1/51 Part I: Scientific Context Julien Mairal Towards deep kernel machines 2/51

More information

MLCC 2015 Dimensionality Reduction and PCA

MLCC 2015 Dimensionality Reduction and PCA MLCC 2015 Dimensionality Reduction and PCA Lorenzo Rosasco UNIGE-MIT-IIT June 25, 2015 Outline PCA & Reconstruction PCA and Maximum Variance PCA and Associated Eigenproblem Beyond the First Principal Component

More information

DIMENSION REDUCTION. min. j=1

DIMENSION REDUCTION. min. j=1 DIMENSION REDUCTION 1 Principal Component Analysis (PCA) Principal components analysis (PCA) finds low dimensional approximations to the data by projecting the data onto linear subspaces. Let X R d and

More information

Regularization in Reproducing Kernel Banach Spaces

Regularization in Reproducing Kernel Banach Spaces .... Regularization in Reproducing Kernel Banach Spaces Guohui Song School of Mathematical and Statistical Sciences Arizona State University Comp Math Seminar, September 16, 2010 Joint work with Dr. Fred

More information

Machine Learning (BSMC-GA 4439) Wenke Liu

Machine Learning (BSMC-GA 4439) Wenke Liu Machine Learning (BSMC-GA 4439) Wenke Liu 02-01-2018 Biomedical data are usually high-dimensional Number of samples (n) is relatively small whereas number of features (p) can be large Sometimes p>>n Problems

More information

Online Kernel PCA with Entropic Matrix Updates

Online Kernel PCA with Entropic Matrix Updates Online Kernel PCA with Entropic Matrix Updates Dima Kuzmin Manfred K. Warmuth University of California - Santa Cruz ICML 2007, Corvallis, Oregon April 23, 2008 D. Kuzmin, M. Warmuth (UCSC) Online Kernel

More information

Worst-Case Bounds for Gaussian Process Models

Worst-Case Bounds for Gaussian Process Models Worst-Case Bounds for Gaussian Process Models Sham M. Kakade University of Pennsylvania Matthias W. Seeger UC Berkeley Abstract Dean P. Foster University of Pennsylvania We present a competitive analysis

More information

RKHS, Mercer s theorem, Unbounded domains, Frames and Wavelets Class 22, 2004 Tomaso Poggio and Sayan Mukherjee

RKHS, Mercer s theorem, Unbounded domains, Frames and Wavelets Class 22, 2004 Tomaso Poggio and Sayan Mukherjee RKHS, Mercer s theorem, Unbounded domains, Frames and Wavelets 9.520 Class 22, 2004 Tomaso Poggio and Sayan Mukherjee About this class Goal To introduce an alternate perspective of RKHS via integral operators

More information

Oslo Class 2 Tikhonov regularization and kernels

Oslo Class 2 Tikhonov regularization and kernels RegML2017@SIMULA Oslo Class 2 Tikhonov regularization and kernels Lorenzo Rosasco UNIGE-MIT-IIT May 3, 2017 Learning problem Problem For H {f f : X Y }, solve min E(f), f H dρ(x, y)l(f(x), y) given S n

More information

Multiple Similarities Based Kernel Subspace Learning for Image Classification

Multiple Similarities Based Kernel Subspace Learning for Image Classification Multiple Similarities Based Kernel Subspace Learning for Image Classification Wang Yan, Qingshan Liu, Hanqing Lu, and Songde Ma National Laboratory of Pattern Recognition, Institute of Automation, Chinese

More information

The Laplacian PDF Distance: A Cost Function for Clustering in a Kernel Feature Space

The Laplacian PDF Distance: A Cost Function for Clustering in a Kernel Feature Space The Laplacian PDF Distance: A Cost Function for Clustering in a Kernel Feature Space Robert Jenssen, Deniz Erdogmus 2, Jose Principe 2, Torbjørn Eltoft Department of Physics, University of Tromsø, Norway

More information

Second-Order Inference for Gaussian Random Curves

Second-Order Inference for Gaussian Random Curves Second-Order Inference for Gaussian Random Curves With Application to DNA Minicircles Victor Panaretos David Kraus John Maddocks Ecole Polytechnique Fédérale de Lausanne Panaretos, Kraus, Maddocks (EPFL)

More information

below, kernel PCA Eigenvectors, and linear combinations thereof. For the cases where the pre-image does exist, we can provide a means of constructing

below, kernel PCA Eigenvectors, and linear combinations thereof. For the cases where the pre-image does exist, we can provide a means of constructing Kernel PCA Pattern Reconstruction via Approximate Pre-Images Bernhard Scholkopf, Sebastian Mika, Alex Smola, Gunnar Ratsch, & Klaus-Robert Muller GMD FIRST, Rudower Chaussee 5, 12489 Berlin, Germany fbs,

More information

Fast large scale Gaussian process regression using the improved fast Gauss transform

Fast large scale Gaussian process regression using the improved fast Gauss transform Fast large scale Gaussian process regression using the improved fast Gauss transform VIKAS CHANDRAKANT RAYKAR and RAMANI DURAISWAMI Perceptual Interfaces and Reality Laboratory Department of Computer Science

More information

Linear and Non-Linear Dimensionality Reduction

Linear and Non-Linear Dimensionality Reduction Linear and Non-Linear Dimensionality Reduction Alexander Schulz aschulz(at)techfak.uni-bielefeld.de University of Pisa, Pisa 4.5.215 and 7.5.215 Overview Dimensionality Reduction Motivation Linear Projections

More information

Kernel Partial Least Squares for Nonlinear Regression and Discrimination

Kernel Partial Least Squares for Nonlinear Regression and Discrimination Kernel Partial Least Squares for Nonlinear Regression and Discrimination Roman Rosipal Abstract This paper summarizes recent results on applying the method of partial least squares (PLS) in a reproducing

More information

A spectral clustering algorithm based on Gram operators

A spectral clustering algorithm based on Gram operators A spectral clustering algorithm based on Gram operators Ilaria Giulini De partement de Mathe matiques et Applications ENS, Paris Joint work with Olivier Catoni 1 july 2015 Clustering task of grouping

More information

Metric Embedding for Kernel Classification Rules

Metric Embedding for Kernel Classification Rules Metric Embedding for Kernel Classification Rules Bharath K. Sriperumbudur University of California, San Diego (Joint work with Omer Lang & Gert Lanckriet) Bharath K. Sriperumbudur (UCSD) Metric Embedding

More information

Computational tractability of machine learning algorithms for tall fat data

Computational tractability of machine learning algorithms for tall fat data Computational tractability of machine learning algorithms for tall fat data Getting good enough solutions as fast as possible Vikas Chandrakant Raykar vikas@cs.umd.edu University of Maryland, CollegePark

More information

Kernel Methods for Computational Biology and Chemistry

Kernel Methods for Computational Biology and Chemistry Kernel Methods for Computational Biology and Chemistry Jean-Philippe Vert Jean-Philippe.Vert@ensmp.fr Centre for Computational Biology Ecole des Mines de Paris, ParisTech Mathematical Statistics and Applications,

More information

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA 1 MACHINE LEARNING Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA 2 Practicals Next Week Next Week, Practical Session on Computer Takes Place in Room GR

More information

Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model

Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model (& discussion on the GPLVM tech. report by Prof. N. Lawrence, 06) Andreas Damianou Department of Neuro- and Computer Science,

More information

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Maximilian Kasy Department of Economics, Harvard University 1 / 37 Agenda 6 equivalent representations of the

More information

Stat542 (F11) Statistical Learning. First consider the scenario where the two classes of points are separable.

Stat542 (F11) Statistical Learning. First consider the scenario where the two classes of points are separable. Linear SVM (separable case) First consider the scenario where the two classes of points are separable. It s desirable to have the width (called margin) between the two dashed lines to be large, i.e., have

More information

Kernel Methods. Barnabás Póczos

Kernel Methods. Barnabás Póczos Kernel Methods Barnabás Póczos Outline Quick Introduction Feature space Perceptron in the feature space Kernels Mercer s theorem Finite domain Arbitrary domain Kernel families Constructing new kernels

More information

Distribution Regression: A Simple Technique with Minimax-optimal Guarantee

Distribution Regression: A Simple Technique with Minimax-optimal Guarantee Distribution Regression: A Simple Technique with Minimax-optimal Guarantee (CMAP, École Polytechnique) Joint work with Bharath K. Sriperumbudur (Department of Statistics, PSU), Barnabás Póczos (ML Department,

More information

1. Mathematical Foundations of Machine Learning

1. Mathematical Foundations of Machine Learning 1. Mathematical Foundations of Machine Learning 1.1 Basic Concepts Definition of Learning Definition [Mitchell (1997)] A computer program is said to learn from experience E with respect to some class of

More information

On non-parametric robust quantile regression by support vector machines

On non-parametric robust quantile regression by support vector machines On non-parametric robust quantile regression by support vector machines Andreas Christmann joint work with: Ingo Steinwart (Los Alamos National Lab) Arnout Van Messem (Vrije Universiteit Brussel) ERCIM

More information

Kernel Method: Data Analysis with Positive Definite Kernels

Kernel Method: Data Analysis with Positive Definite Kernels Kernel Method: Data Analysis with Positive Definite Kernels 2. Positive Definite Kernel and Reproducing Kernel Hilbert Space Kenji Fukumizu The Institute of Statistical Mathematics. Graduate University

More information

Generalization theory

Generalization theory Generalization theory Daniel Hsu Columbia TRIPODS Bootcamp 1 Motivation 2 Support vector machines X = R d, Y = { 1, +1}. Return solution ŵ R d to following optimization problem: λ min w R d 2 w 2 2 + 1

More information

Kernel Independent Component Analysis

Kernel Independent Component Analysis Journal of Machine Learning Research 3 (2002) 1-48 Submitted 11/01; Published 07/02 Kernel Independent Component Analysis Francis R. Bach Computer Science Division University of California Berkeley, CA

More information

Theoretical Exercises Statistical Learning, 2009

Theoretical Exercises Statistical Learning, 2009 Theoretical Exercises Statistical Learning, 2009 Niels Richard Hansen April 20, 2009 The following exercises are going to play a central role in the course Statistical learning, block 4, 2009. The exercises

More information

Diffeomorphic Warping. Ben Recht August 17, 2006 Joint work with Ali Rahimi (Intel)

Diffeomorphic Warping. Ben Recht August 17, 2006 Joint work with Ali Rahimi (Intel) Diffeomorphic Warping Ben Recht August 17, 2006 Joint work with Ali Rahimi (Intel) What Manifold Learning Isn t Common features of Manifold Learning Algorithms: 1-1 charting Dense sampling Geometric Assumptions

More information

Bits of Machine Learning Part 1: Supervised Learning

Bits of Machine Learning Part 1: Supervised Learning Bits of Machine Learning Part 1: Supervised Learning Alexandre Proutiere and Vahan Petrosyan KTH (The Royal Institute of Technology) Outline of the Course 1. Supervised Learning Regression and Classification

More information

A Magiv CV Theory for Large-Margin Classifiers

A Magiv CV Theory for Large-Margin Classifiers A Magiv CV Theory for Large-Margin Classifiers Hui Zou School of Statistics, University of Minnesota June 30, 2018 Joint work with Boxiang Wang Outline 1 Background 2 Magic CV formula 3 Magic support vector

More information

Kernel methods and the exponential family

Kernel methods and the exponential family Kernel methods and the exponential family Stéphane Canu 1 and Alex J. Smola 2 1- PSI - FRE CNRS 2645 INSA de Rouen, France St Etienne du Rouvray, France Stephane.Canu@insa-rouen.fr 2- Statistical Machine

More information

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels Karl Stratos June 21, 2018 1 / 33 Tangent: Some Loose Ends in Logistic Regression Polynomial feature expansion in logistic

More information

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling Machine Learning B. Unsupervised Learning B.2 Dimensionality Reduction Lars Schmidt-Thieme, Nicolas Schilling Information Systems and Machine Learning Lab (ISMLL) Institute for Computer Science University

More information

Approximate Principal Components Analysis of Large Data Sets

Approximate Principal Components Analysis of Large Data Sets Approximate Principal Components Analysis of Large Data Sets Daniel J. McDonald Department of Statistics Indiana University mypage.iu.edu/ dajmcdon April 27, 2016 Approximation-Regularization for Analysis

More information

ESANN'2001 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), April 2001, D-Facto public., ISBN ,

ESANN'2001 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), April 2001, D-Facto public., ISBN , Sparse Kernel Canonical Correlation Analysis Lili Tan and Colin Fyfe 2, Λ. Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong. 2. School of Information and Communication

More information

Lecture 6 Sept Data Visualization STAT 442 / 890, CM 462

Lecture 6 Sept Data Visualization STAT 442 / 890, CM 462 Lecture 6 Sept. 25-2006 Data Visualization STAT 442 / 890, CM 462 Lecture: Ali Ghodsi 1 Dual PCA It turns out that the singular value decomposition also allows us to formulate the principle components

More information

SDRNF: generating scalable and discriminative random nonlinear features from data

SDRNF: generating scalable and discriminative random nonlinear features from data Chu et al. Big Data Analytics (2016) 1:10 DOI 10.1186/s41044-016-0015-z Big Data Analytics RESEARCH Open Access SDRNF: generating scalable and discriminative random nonlinear features from data Haoda Chu

More information

The Learning Problem and Regularization Class 03, 11 February 2004 Tomaso Poggio and Sayan Mukherjee

The Learning Problem and Regularization Class 03, 11 February 2004 Tomaso Poggio and Sayan Mukherjee The Learning Problem and Regularization 9.520 Class 03, 11 February 2004 Tomaso Poggio and Sayan Mukherjee About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing

More information

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Gaussian Processes Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 01 Pictorial view of embedding distribution Transform the entire distribution to expected features Feature space Feature

More information

Kernel Measures of Conditional Dependence

Kernel Measures of Conditional Dependence Kernel Measures of Conditional Dependence Kenji Fukumizu Institute of Statistical Mathematics 4-6-7 Minami-Azabu, Minato-ku Tokyo 6-8569 Japan fukumizu@ism.ac.jp Arthur Gretton Max-Planck Institute for

More information

Fastfood Approximating Kernel Expansions in Loglinear Time. Quoc Le, Tamas Sarlos, and Alex Smola Presenter: Shuai Zheng (Kyle)

Fastfood Approximating Kernel Expansions in Loglinear Time. Quoc Le, Tamas Sarlos, and Alex Smola Presenter: Shuai Zheng (Kyle) Fastfood Approximating Kernel Expansions in Loglinear Time Quoc Le, Tamas Sarlos, and Alex Smola Presenter: Shuai Zheng (Kyle) Large Scale Problem: ImageNet Challenge Large scale data Number of training

More information

Probabilistic Regression Using Basis Function Models

Probabilistic Regression Using Basis Function Models Probabilistic Regression Using Basis Function Models Gregory Z. Grudic Department of Computer Science University of Colorado, Boulder grudic@cs.colorado.edu Abstract Our goal is to accurately estimate

More information

Kernels MIT Course Notes

Kernels MIT Course Notes Kernels MIT 15.097 Course Notes Cynthia Rudin Credits: Bartlett, Schölkopf and Smola, Cristianini and Shawe-Taylor The kernel trick that I m going to show you applies much more broadly than SVM, but we

More information

Support Vector Machines for Classification: A Statistical Portrait

Support Vector Machines for Classification: A Statistical Portrait Support Vector Machines for Classification: A Statistical Portrait Yoonkyung Lee Department of Statistics The Ohio State University May 27, 2011 The Spring Conference of Korean Statistical Society KAIST,

More information