Approximate Kernel PCA with Random Features
|
|
- Franklin Barton
- 5 years ago
- Views:
Transcription
1 Approximate Kernel PCA with Random Features (Computational vs. Statistical Tradeoff) Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Journées de Statistique Paris May 28, 2018 (Joint work with Nicholas Sterge, PSU) Supported by National Science Foundation (DMS )
2 Outline Principal Component Analysis (PCA) Kernel PCA Approximation methods Kernel PCA with Random Features Computational vs. statistical trade off
3 Principal Component Analysis (PCA) (Pearson, 1901) Suppose X P be a random variable in R d with mean µ and covariance matrix Σ. Find a direction w {v : v 2 = 1} such that is maximized. Var[ w, X 2 ] = w, Σw 2 Find a direction w {v : v 2 = 1} such that is minimized. E (X µ) w, (X µ) 2 w 2 2 The formulations are equivalent and the solution is the eigenvector of the covariance matrix Σ corresponding to the largest eigenvalue. Can be generalized to multiple directions (find a subspace...). Applications: dimensionality reduction.
4 Principal Component Analysis (PCA) (Pearson, 1901) Suppose X P be a random variable in R d with mean µ and covariance matrix Σ. Find a direction w {v : v 2 = 1} such that is maximized. Var[ w, X 2 ] = w, Σw 2 Find a direction w {v : v 2 = 1} such that is minimized. E (X µ) w, (X µ) 2 w 2 2 The formulations are equivalent and the solution is the eigenvector of the covariance matrix Σ corresponding to the largest eigenvalue. Can be generalized to multiple directions (find a subspace...). Applications: dimensionality reduction.
5 Principal Component Analysis (PCA) (Pearson, 1901) Suppose X P be a random variable in R d with mean µ and covariance matrix Σ. Find a direction w {v : v 2 = 1} such that is maximized. Var[ w, X 2 ] = w, Σw 2 Find a direction w {v : v 2 = 1} such that is minimized. E (X µ) w, (X µ) 2 w 2 2 The formulations are equivalent and the solution is the eigenvector of the covariance matrix Σ corresponding to the largest eigenvalue. Can be generalized to multiple directions (find a subspace...). Applications: dimensionality reduction.
6 Kernel PCA Nonlinear generalization of PCA (Schölkopf et al., 1998). X Φ(X ) through the feature map Φ and apply PCA. The choice of Φ determines the degree of information we capture about X. Suppose Φ(X ) = (1, X, X 2, X 3,...), then the covariance of Φ(X ) captures the higher order moments of X. Φ is not explicitly specified but implicitly specified through a positive definite kernel function, k(x, y) = Φ(x), Φ(y).
7 Kernel PCA: Functional Version Find f {g H : g H = 1} such that Var[f (X )] is maximized, i.e., f = arg sup Var[f (X )] f H =1 where H is a reproducing kernel Hilbert space (evaluational functionals f f (x) are bounded for all x X) of real-valued functions (Aronszajn, 1950). unique k : X X R such that k(, x) H for all x H and f (x) = k(, x), f H for all f H, x X. k is called the reproducing kernel of H as k(x, y) = k(, x), k(, y) }{{}}{{} Φ(x) Φ(y) and is symmetric and positive definite. In fact, the converse is also true (Moore-Aronszajn Theorem). H
8 Kernel PCA: Functional Version Find f {g H : g H = 1} such that Var[f (X )] is maximized, i.e., f = arg sup Var[f (X )] f H =1 where H is a reproducing kernel Hilbert space (evaluational functionals f f (x) are bounded for all x X) of real-valued functions (Aronszajn, 1950). unique k : X X R such that k(, x) H for all x H and f (x) = k(, x), f H for all f H, x X. k is called the reproducing kernel of H as k(x, y) = k(, x), k(, y) }{{}}{{} Φ(x) Φ(y) and is symmetric and positive definite. In fact, the converse is also true (Moore-Aronszajn Theorem). H
9 Kernel PCA: Functional Version Find f {g H : g H = 1} such that Var[f (X )] is maximized, i.e., f = arg sup Var[f (X )] f H =1 where H is a reproducing kernel Hilbert space (evaluational functionals f f (x) are bounded for all x X) of real-valued functions (Aronszajn, 1950). unique k : X X R such that k(, x) H for all x H and f (x) = k(, x), f H for all f H, x X. k is called the reproducing kernel of H as k(x, y) = k(, x), k(, y) }{{}}{{} Φ(x) Φ(y) and is symmetric and positive definite. In fact, the converse is also true (Moore-Aronszajn Theorem). H
10 RKHS Kernels Positive definite & symmetric functions RKHS H = span{k(, x) : x X } (linear span of kernel functions) k controls the properties of f H. If k satisfies ( ), then every f H satisfies ( ), where ( ) is boundedness continuity measurability integrability differentiability
11 RKHS Kernels Positive definite & symmetric functions RKHS H = span{k(, x) : x X } (linear span of kernel functions) k controls the properties of f H. If k satisfies ( ), then every f H satisfies ( ), where ( ) is boundedness continuity measurability integrability differentiability
12 RKHS Kernels Positive definite & symmetric functions RKHS H = span{k(, x) : x X } (linear span of kernel functions) k controls the properties of f H. If k satisfies ( ), then every f H satisfies ( ), where ( ) is boundedness continuity measurability integrability differentiability
13 Kernel PCA Kernel PCA generalizes linear PCA. k(x, y) = x, y 2, x, y R d : H is isometrically isomorphic to R d. f (x) = w f, x 2, f H
14 Kernel PCA Kernel PCA generalizes linear PCA. k(x, y) = x, y 2, x, y R d : H is isometrically isomorphic to R d. f (x) = w f, x 2, f H
15 Kernel PCA Using the reproducing property f (X ) = f, k(, X ) H, we obtain }{{} Φ(X ) Σ := X f = arg sup f H =1 f, Σf H. k(, x) H k(, x) dp(x) µ P H µ P is the covariance operator (self-adjoint, positive and trace class) on H and µ P := k(, x) dp(x) is the mean element. Spectral theorem: X Σ = i I λ i φ i H φ i where I is either countable (λ i 0 as i ) or finite. Similar to PCA, the solution is the eigenfunction corresponding to the largest eigenvalue of Σ.
16 Kernel PCA Using the reproducing property f (X ) = f, k(, X ) H, we obtain }{{} Φ(X ) Σ := X f = arg sup f H =1 f, Σf H. k(, x) H k(, x) dp(x) µ P H µ P is the covariance operator (self-adjoint, positive and trace class) on H and µ P := k(, x) dp(x) is the mean element. Spectral theorem: X Σ = i I λ i φ i H φ i where I is either countable (λ i 0 as i ) or finite. Similar to PCA, the solution is the eigenfunction corresponding to the largest eigenvalue of Σ.
17 Kernel PCA Using the reproducing property f (X ) = f, k(, X ) H, we obtain }{{} Φ(X ) Σ := X f = arg sup f H =1 f, Σf H. k(, x) H k(, x) dp(x) µ P H µ P is the covariance operator (self-adjoint, positive and trace class) on H and µ P := k(, x) dp(x) is the mean element. Spectral theorem: X Σ = i I λ i φ i H φ i where I is either countable (λ i 0 as i ) or finite. Similar to PCA, the solution is the eigenfunction corresponding to the largest eigenvalue of Σ.
18 Empirical Kernel PCA In practice, P is unknown but have access to (X i ) n where ˆΣ := 1 n ˆf = arg sup f, ˆΣf H, f H =1 i.i.d. P. n k(, X i ) H k(, X i ) µ n H µ n is the empirical covariance operator (symmetric, positive and trace class) on H and µ n := 1 n k(, X i ). n Spectral theorem: ˆΣ = n ˆλ i ˆφi H ˆφi. ˆf is the eigenfunction corresponding to the largest eigenvalue of ˆΣ.
19 Empirical Kernel PCA In practice, P is unknown but have access to (X i ) n where ˆΣ := 1 n ˆf = arg sup f, ˆΣf H, f H =1 i.i.d. P. n k(, X i ) H k(, X i ) µ n H µ n is the empirical covariance operator (symmetric, positive and trace class) on H and µ n := 1 n k(, X i ). n Spectral theorem: ˆΣ = n ˆλ i ˆφi H ˆφi. ˆf is the eigenfunction corresponding to the largest eigenvalue of ˆΣ.
20 Empirical Kernel PCA In practice, P is unknown but have access to (X i ) n where ˆΣ := 1 n ˆf = arg sup f, ˆΣf H, f H =1 i.i.d. P. n k(, X i ) H k(, X i ) µ n H µ n is the empirical covariance operator (symmetric, positive and trace class) on H and µ n := 1 n k(, X i ). n Spectral theorem: ˆΣ = n ˆλ i ˆφi H ˆφi. ˆf is the eigenfunction corresponding to the largest eigenvalue of ˆΣ.
21 Empirical Kernel PCA: Representer Theorem Since ˆΣ is an infinite dimensional operator, we have to solve an infinite dimensional eigen system, ˆΣ ˆφ i = ˆλ i ˆφi. Consider 1 sup f, ˆΣf H = sup f H =1 f H =1 n n f, k(, X i ) 2 f, 1 n Clearly f span{k(, X i ) : i = 1,..., n}, i.e., n k(, X i ) 2 H. f = n α i k(, X i ). { } sup f, ˆΣf H : f H = 1 = sup { α KH nkα : α Kα = 1 }, i.e., KH nkα = λkα. Requires K to be invertible.
22 Empirical Kernel PCA: Representer Theorem Since ˆΣ is an infinite dimensional operator, we have to solve an infinite dimensional eigen system, ˆΣ ˆφ i = ˆλ i ˆφi. Consider 1 sup f, ˆΣf H = sup f H =1 f H =1 n n f, k(, X i ) 2 f, 1 n Clearly f span{k(, X i ) : i = 1,..., n}, i.e., n k(, X i ) 2 H. f = n α i k(, X i ). { } sup f, ˆΣf H : f H = 1 = sup { α KH nkα : α Kα = 1 }, i.e., KH nkα = λkα. Requires K to be invertible.
23 Empirical Kernel PCA: Representer Theorem Since ˆΣ is an infinite dimensional operator, we have to solve an infinite dimensional eigen system, ˆΣ ˆφ i = ˆλ i ˆφi. Consider 1 sup f, ˆΣf H = sup f H =1 f H =1 n n f, k(, X i ) 2 f, 1 n Clearly f span{k(, X i ) : i = 1,..., n}, i.e., n k(, X i ) 2 H. f = n α i k(, X i ). { } sup f, ˆΣf H : f H = 1 = sup { α KH nkα : α Kα = 1 }, i.e., KH nkα = λkα. Requires K to be invertible.
24 Empirical Kernel PCA In classical PCA, X i R d, i = 1,..., n is represented as ( X i µ, ŵ 1 2,..., X i µ, ŵ l 2 ) R l with l d where (ŵ i ) l are the eigenvectors of ˆΣ corresponding to the top-l eigenvalues. In kernel PCA, X i X, i = 1,..., n is represented as ( k(, X i ) µ n, ˆφ 1,..., k(, X i ) µ n, ˆφ ) l R l H H with l n where ( ˆφ i ) l are the eigenfunctions of ˆΣ corresponding to the top-l eigenvalues.
25 Empirical Kernel PCA In classical PCA, X i R d, i = 1,..., n is represented as ( X i µ, ŵ 1 2,..., X i µ, ŵ l 2 ) R l with l d where (ŵ i ) l are the eigenvectors of ˆΣ corresponding to the top-l eigenvalues. In kernel PCA, X i X, i = 1,..., n is represented as ( k(, X i ) µ n, ˆφ 1,..., k(, X i ) µ n, ˆφ ) l R l H H with l n where ( ˆφ i ) l are the eigenfunctions of ˆΣ corresponding to the top-l eigenvalues.
26 Summary The direct formulation requires the knowledge of feature map Φ (and of course H) and these could be infinite dimensional. ˆΣ ˆφ i = ˆλ i ˆφi. The alternate formulation is entirely determined by kernel evaluations, Gram matrix. But poor scalability: O(n 3 ). n ˆφ i = α i,j k(, X j ), j=1 where α i satisfies H n Kα i = λ i α i.
27 Approximation Schemes Incomplete Cholesky factorization (e.g., Fine and Scheinberg, 2001) Sketching (Yang et al., 2015) Sparse greedy approximation (Smola and Schölkopf, 2000) Nyström method (e.g., Williams and Seeger, 2001) Random Fourier features (e.g., Rahimi and Recht, 2008a),...
28 Random Fourier Approximation X = R d ; k be continuous and translation-invariant, i.e., k(x, y) = ψ(x y). Bochner s theorem: ψ is positive definite if and only if k(x, y) = e 1 ω,x y 2 dλ(ω), R d where Λ is a finite non-negative Borel measure on R d. k is symmetric and therefore Λ is a symmetric measure on R d. Therefore k(x, y) = cos( ω, x y 2 ) dλ(ω). R d
29 Random Fourier Approximation X = R d ; k be continuous and translation-invariant, i.e., k(x, y) = ψ(x y). Bochner s theorem: ψ is positive definite if and only if k(x, y) = e 1 ω,x y 2 dλ(ω), R d where Λ is a finite non-negative Borel measure on R d. k is symmetric and therefore Λ is a symmetric measure on R d. Therefore k(x, y) = cos( ω, x y 2 ) dλ(ω). R d
30 Random Feature Approximation (Rahimi and Recht, 2008a): Draw (ω j ) m j=1 i.i.d. Λ. k m (x, y) = 1 m m cos ( ω j, x y 2 ) = Φ m (x), Φ m (y) R 2m, j=1 k(x, y) = k(, x), k(, y) H }{{} Φ(x),Φ(y) H where ϕ 1(x) Φ m (x) = 1 {}}{ ( cos( ω 1, x 2 ),..., cos( ω m, x 2 ), sin( ω 1, x 2 ),..., sin( ω m, x 2 )). m Idea: Apply PCA to Φ m (x).
31 Random Feature Approximation (Rahimi and Recht, 2008a): Draw (ω j ) m j=1 i.i.d. Λ. k m (x, y) = 1 m m cos ( ω j, x y 2 ) = Φ m (x), Φ m (y) R 2m, j=1 k(x, y) = k(, x), k(, y) H }{{} Φ(x),Φ(y) H where ϕ 1(x) Φ m (x) = 1 {}}{ ( cos( ω 1, x 2 ),..., cos( ω m, x 2 ), sin( ω 1, x 2 ),..., sin( ω m, x 2 )). m Idea: Apply PCA to Φ m (x).
32 Approximate Kernel PCA (RF-KPCA) Perform linear PCA on Φ m (X ) where X P. Approximate empirical KPCA finds β R m that solves sup Var[ β, Φ m (X ) 2 ] = sup β, Σ m β 2 β 2=1 β 2=1 where Σ m := E[Φ m (X ) 2 Φ m (X )] E[Φ m (X )] 2 E[Φ m (X )]. Same as doing kernel PCA in H m where { } m H m = f = β i ϕ i : β R m is an RKHS induced by the reproducing kernel k m w.r.t. f, g Hm = β f, β g 2.
33 Approximate Kernel PCA (RF-KPCA) Perform linear PCA on Φ m (X ) where X P. Approximate empirical KPCA finds β R m that solves sup Var[ β, Φ m (X ) 2 ] = sup β, Σ m β 2 β 2=1 β 2=1 where Σ m := E[Φ m (X ) 2 Φ m (X )] E[Φ m (X )] 2 E[Φ m (X )]. Same as doing kernel PCA in H m where { } m H m = f = β i ϕ i : β R m is an RKHS induced by the reproducing kernel k m w.r.t. f, g Hm = β f, β g 2.
34 Empirical RF-KPCA The empirical counterpart is obtained as: ˆβ m = arg sup β 2=1 β, ˆΣ m β 2 where ˆΣ m := 1 n n Φ m (X i ) 2 Φ m (X i ) ( 1 n ) ( ) n 1 n Φ m (X i ) 2 Φ m (X i ). n Eigen decomposition: ˆΣ m = m ˆλ i,m ˆφ i,m 2 ˆφ i,m ˆβ m is obtained by solving an m m eigensystem: Complexity is O(m 3 ). What happens statistically?
35 Empirical RF-KPCA The empirical counterpart is obtained as: ˆβ m = arg sup β 2=1 β, ˆΣ m β 2 where ˆΣ m := 1 n n Φ m (X i ) 2 Φ m (X i ) ( 1 n ) ( ) n 1 n Φ m (X i ) 2 Φ m (X i ). n Eigen decomposition: ˆΣ m = m ˆλ i,m ˆφ i,m 2 ˆφ i,m ˆβ m is obtained by solving an m m eigensystem: Complexity is O(m 3 ). What happens statistically?
36 Empirical RF-KPCA The empirical counterpart is obtained as: ˆβ m = arg sup β 2=1 β, ˆΣ m β 2 where ˆΣ m := 1 n n Φ m (X i ) 2 Φ m (X i ) ( 1 n ) ( ) n 1 n Φ m (X i ) 2 Φ m (X i ). n Eigen decomposition: ˆΣ m = m ˆλ i,m ˆφ i,m 2 ˆφ i,m ˆβ m is obtained by solving an m m eigensystem: Complexity is O(m 3 ). What happens statistically?
37 How good is the approximation? (S and Szabó, 2016): ( ) log S sup k m (x, y) k(x, y) = O a.s. x,y S m Optimal convergence rate Other results are known but they are non-optimal (Rahimi and Recht, 2008a; Sutherland and Schneider, 2015).
38 What happens statistically? Kernel ridge regression: (X i, Y i ) n iid ρxy. R P = inf f L 2 (ρ X ) E f (X ) Y 2 = E f (X ) Y 2. Penalized risk minimization: O(n 3 ) 1 f n = arg inf f H n n Y i f (X i ) λ f 2 H Penalized risk minimization (approximate): O(m 2 n) 1 f m,n = arg inf f H m n n Y i f (X i ) λ f 2 H m
39 What happens statistically? R P (f m,n ) R }{{} E f m,n(x ) Y 2 = (R P (f m,n ) R P (f n )) + (R P (f n ) R P }{{} ) error due to approximation (Rahimi and Recht, 2008b): (m n) 1 2 (Rudi and Rosasco, 2016): If m n α where 1 2 α < 1 with α depending on the properties of f, then f m,n achieves the minimax optimal rate as obtained in the case with no approximation. Similar results are derived for Nyström approximation (Bach 2013, Alaoui and Mahoney, 2015, Rudi et al., 2015). Computational gain with no statistical loss!!
40 What happens statistically? R P (f m,n ) R }{{} E f m,n(x ) Y 2 = (R P (f m,n ) R P (f n )) + (R P (f n ) R P }{{} ) error due to approximation (Rahimi and Recht, 2008b): (m n) 1 2 (Rudi and Rosasco, 2016): If m n α where 1 2 α < 1 with α depending on the properties of f, then f m,n achieves the minimax optimal rate as obtained in the case with no approximation. Similar results are derived for Nyström approximation (Bach 2013, Alaoui and Mahoney, 2015, Rudi et al., 2015). Computational gain with no statistical loss!!
41 What happens statistically? R P (f m,n ) R }{{} E f m,n(x ) Y 2 = (R P (f m,n ) R P (f n )) + (R P (f n ) R P }{{} ) error due to approximation (Rahimi and Recht, 2008b): (m n) 1 2 (Rudi and Rosasco, 2016): If m n α where 1 2 α < 1 with α depending on the properties of f, then f m,n achieves the minimax optimal rate as obtained in the case with no approximation. Similar results are derived for Nyström approximation (Bach 2013, Alaoui and Mahoney, 2015, Rudi et al., 2015). Computational gain with no statistical loss!!
42 What happens statistically? R P (f m,n ) R }{{} E f m,n(x ) Y 2 = (R P (f m,n ) R P (f n )) + (R P (f n ) R P }{{} ) error due to approximation (Rahimi and Recht, 2008b): (m n) 1 2 (Rudi and Rosasco, 2016): If m n α where 1 2 α < 1 with α depending on the properties of f, then f m,n achieves the minimax optimal rate as obtained in the case with no approximation. Similar results are derived for Nyström approximation (Bach 2013, Alaoui and Mahoney, 2015, Rudi et al., 2015). Computational gain with no statistical loss!!
43 What happens statistically? Two notions for PCA: Reconstruction error Convergence of eigenspaces
44 Reconstruction Error Linear PCA Kernel PCA E X P (X µ) E X P k(, X ) l 2 (X µ), φ i 2 φ i l 2 k(, X ), φ i H φ i where k(, x) = k(, x) k(, x) dp(x). However, the eigenfunctions of approximate empirical KPCA lie in H m, which is finite dimensional and not contained in H. 2 H
45 Reconstruction Error Linear PCA Kernel PCA E X P (X µ) E X P k(, X ) l 2 (X µ), φ i 2 φ i l 2 k(, X ), φ i H φ i where k(, x) = k(, x) k(, x) dp(x). However, the eigenfunctions of approximate empirical KPCA lie in H m, which is finite dimensional and not contained in H. 2 H
46 Reconstruction Error Linear PCA Kernel PCA E X P (X µ) E X P k(, X ) l 2 (X µ), φ i 2 φ i l 2 k(, X ), φ i H φ i where k(, x) = k(, x) k(, x) dp(x). However, the eigenfunctions of approximate empirical KPCA lie in H m, which is finite dimensional and not contained in H. 2 H
47 Embedding to L 2 (P) What we have? Population eigenfunctions (φ i ) i I of Σ: these form a subspace in H. Empirical eigenfunctions ( ˆφ i ) n of ˆΣ: these form a subspace in H. Eigenvectors after approximation, ( ˆφ i,m ) m of ˆΣ m: these form a subspace in R m We embed them in a common space before comparing. The common space is L 2 (P). (Inclusion operator) I : H L 2 (P), f f f (x) dp(x) X (Approximation operator) U : R m L 2 (P), α m ) α i (ϕ i ϕ i (x) dp(x) X
48 Embedding to L 2 (P) What we have? Population eigenfunctions (φ i ) i I of Σ: these form a subspace in H. Empirical eigenfunctions ( ˆφ i ) n of ˆΣ: these form a subspace in H. Eigenvectors after approximation, ( ˆφ i,m ) m of ˆΣ m: these form a subspace in R m We embed them in a common space before comparing. The common space is L 2 (P). (Inclusion operator) I : H L 2 (P), f f f (x) dp(x) X (Approximation operator) U : R m L 2 (P), α m ) α i (ϕ i ϕ i (x) dp(x) X
49 Properties Σ = I I I and I are HS and Σ is trace-class Σ m = U U U and U are HS and Σ m is trace-class Σ ˆΣ ) HS = O P (n 1 2 Σ m ˆΣ ) m HS = O P (n 1 2 ) II UU HS = O Λ (m 1 2
50 Properties Σ = I I I and I are HS and Σ is trace-class Σ m = U U U and U are HS and Σ m is trace-class Σ ˆΣ ) HS = O P (n 1 2 Σ m ˆΣ ) m HS = O P (n 1 2 ) II UU HS = O Λ (m 1 2
51 Properties Σ = I I I and I are HS and Σ is trace-class Σ m = U U U and U are HS and Σ m is trace-class Σ ˆΣ ) HS = O P (n 1 2 Σ m ˆΣ ) m HS = O P (n 1 2 ) II UU HS = O Λ (m 1 2
52 Properties Σ = I I I and I are HS and Σ is trace-class Σ m = U U U and U are HS and Σ m is trace-class Σ ˆΣ ) HS = O P (n 1 2 Σ m ˆΣ ) m HS = O P (n 1 2 ) II UU HS = O Λ (m 1 2
53 Reconstruction Error ( ) Iφ i form an ONS for L 2 (P). Define k(, x) = k(, x) µ P and τ > 0. λi Population KPCA: l R l = E I k(, Iφi Iφ 2 i X ), I k(, X ) λi L 2 (P) λi L 2 (P) = Σ Σ 1/2 Σ 1 l Σ 3/2 2 = Σ Σ l 2 HS. HS Empirical KPCA: l I R n,l = E I k(, X ) ˆφ i, I k(, X ) ˆλ i = Σ Σ 1/2 ˆΣ 1 l Σ 3/2 2 HS Approximate Empirical KPCA: l U ˆφi,m R m,n,l = E I k(, X ), I k(, X ) ˆλ i,m ( = I UˆΣ 1 m,l U ) II 2 HS L 2 (P) L 2 (P) I ˆφ i ˆλ i 2 L 2 (P) 2 U ˆφ i,m ˆλ i,m L 2 (P)
54 Reconstruction Error ( ) Iφ i form an ONS for L 2 (P). Define k(, x) = k(, x) µ P and τ > 0. λi Population KPCA: l R l = E I k(, Iφi Iφ 2 i X ), I k(, X ) λi L 2 (P) λi L 2 (P) = Σ Σ 1/2 Σ 1 l Σ 3/2 2 = Σ Σ l 2 HS. HS Empirical KPCA: l I R n,l = E I k(, X ) ˆφ i, I k(, X ) ˆλ i = Σ Σ 1/2 ˆΣ 1 l Σ 3/2 2 HS Approximate Empirical KPCA: l U ˆφ i,m R m,n,l = E I k(, X ), I k(, X ) ˆλ i,m ( = I UˆΣ 1 m,l U ) II 2 HS L 2 (P) L 2 (P) I ˆφ i ˆλ i 2 L 2 (P) 2 U ˆφ i,m ˆλ i,m L 2 (P)
55 Reconstruction Error ( ) Iφ i form an ONS for L 2 (P). Define k(, x) = k(, x) µ P and τ > 0. λi Population KPCA: l R l = E I k(, Iφi Iφ 2 i X ), I k(, X ) λi L 2 (P) λi L 2 (P) = Σ Σ 1/2 Σ 1 l Σ 3/2 2 = Σ Σ l 2 HS. HS Empirical KPCA: l I R n,l = E I k(, X ) ˆφ i, I k(, X ) ˆλ i = Σ Σ 1/2 ˆΣ 1 l Σ 3/2 2 HS Approximate Empirical KPCA: l U ˆφ i,m R m,n,l = E I k(, X ), I k(, X ) ˆλ i,m ( = I UˆΣ 1 m,l U ) II 2 HS L 2 (P) L 2 (P) I ˆφ i ˆλ i 2 L 2 (P) 2 U ˆφ i,m ˆλ i,m L 2 (P)
56 Result Clearly R l 0 as l. The goal is to study the convergence rates for R l, R n,l and R m,n,l as l, m, n. Suppose λ i i α, α > 1 2, l = n θ α and m = n γ where θ > 0 and 0 < γ < 1. R l n 2θ(1 1 2α ) n 2θ(1 1 ) 2α, R n,l n ( 2 1 θ), α 0 < θ 2(3α 1) α 2(3α 1) θ < 1 2 (bias dominates) (variance dominates) for γ > 2θ. n 2θ(1 1 ) 2α, R m,n,l n ( 1 θ) 2, α 0 < θ 2(3α 1) α 2(3α 1) θ < 1 2 No statistical loss
57 Result Clearly R l 0 as l. The goal is to study the convergence rates for R l, R n,l and R m,n,l as l, m, n. Suppose λ i i α, α > 1 2, l = n θ α and m = n γ where θ > 0 and 0 < γ < 1. R l n 2θ(1 1 2α ) n 2θ(1 1 ) 2α, R n,l n ( 2 1 θ), α 0 < θ 2(3α 1) α 2(3α 1) θ < 1 2 (bias dominates) (variance dominates) for γ > 2θ. n 2θ(1 1 ) 2α, R m,n,l n ( 1 θ) 2, α 0 < θ 2(3α 1) α 2(3α 1) θ < 1 2 No statistical loss
58 Result Clearly R l 0 as l. The goal is to study the convergence rates for R l, R n,l and R m,n,l as l, m, n. Suppose λ i i α, α > 1 2, l = n θ α and m = n γ where θ > 0 and 0 < γ < 1. R l n 2θ(1 1 2α ) n 2θ(1 1 ) 2α, R n,l n ( 2 1 θ), α 0 < θ 2(3α 1) α 2(3α 1) θ < 1 2 (bias dominates) (variance dominates) for γ > 2θ. n 2θ(1 1 ) 2α, R m,n,l n ( 1 θ) 2, α 0 < θ 2(3α 1) α 2(3α 1) θ < 1 2 No statistical loss
59 Result Clearly R l 0 as l. The goal is to study the convergence rates for R l, R n,l and R m,n,l as l, m, n. Suppose λ i i α, α > 1 2, l = n θ α and m = n γ where θ > 0 and 0 < γ < 1. R l n 2θ(1 1 2α ) n 2θ(1 1 ) 2α, R n,l n ( 2 1 θ), α 0 < θ 2(3α 1) α 2(3α 1) θ < 1 2 (bias dominates) (variance dominates) for γ > 2θ. n 2θ(1 1 ) 2α, R m,n,l n ( 1 θ) 2, α 0 < θ 2(3α 1) α 2(3α 1) θ < 1 2 No statistical loss
60 Result Clearly R l 0 as l. The goal is to study the convergence rates for R l, R n,l and R m,n,l as l, m, n. Suppose λ i i α, α > 1 2, l = n θ α and m = n γ where θ > 0 and 0 < γ < 1. R l n 2θ(1 1 2α ) n 2θ(1 1 ) 2α, R n,l n ( 2 1 θ), α 0 < θ 2(3α 1) α 2(3α 1) θ < 1 2 (bias dominates) (variance dominates) for γ > 2θ. n 2θ(1 1 ) 2α, R m,n,l n ( 1 θ) 2, α 0 < θ 2(3α 1) α 2(3α 1) θ < 1 2 No statistical loss
61 Convergence of Projection Operators-I Since ( ) Iφi λi form an ONS for L 2 (P), we consider Empirical KPCA: l Iφ i S n,l = Iφ i λi λi l I ˆφ i I ˆφ i ˆλ i ˆλ i op Approximate Empirical KPCA: l Iφ i S m,n,l = Iφ l i U ˆφ m,i U ˆφ m,i λi λi ˆλm,i ˆλm,i op as l, m, n.
62 Convergence of Projection Operators-I Since ( ) Iφi λi form an ONS for L 2 (P), we consider Empirical KPCA: l Iφ i S n,l = Iφ i λi λi l I ˆφ i I ˆφ i ˆλ i ˆλ i op Approximate Empirical KPCA: l Iφ i S m,n,l = Iφ l i U ˆφ m,i U ˆφ m,i λi λi ˆλm,i ˆλm,i op as l, m, n.
63 Convergence of Projection Operators-I Unlike in reconstruction error, the convergence of projection operators depends on the behavior of the eigen-gap, δ i = 1 2 (λ i λ i+1 ), i N. Empirical KPCA: assuming δ l n 1/2 and λ l n 1/2. S n,l δ l n n 1/4, λ l Approximate Empirical KPCA: assuming δ l m 1/2 and λ l m 1/2. S m,n,l δ l m n 1/4, λ l
64 Convergence of Projection Operators-I Unlike in reconstruction error, the convergence of projection operators depends on the behavior of the eigen-gap, δ i = 1 2 (λ i λ i+1 ), i N. Empirical KPCA: assuming δ l n 1/2 and λ l n 1/2. S n,l δ l n n 1/4, λ l Approximate Empirical KPCA: assuming δ l m 1/2 and λ l m 1/2. S m,n,l δ l m n 1/4, λ l
65 Convergence of Projection Operators-I Suppose λ i i α, α > 1 2, δ i i β, β α, l = n θ α and m = n γ where 0 < θ < 1 2 and 0 < γ < 1. n ( 1 4 θ ) 2, 0 < θ < S n,l ), n ( 12 θβ α n ( 1 4 θ ) 2, 0 < θ < S m,n,l ( n γ 2 θβ ) α, 0 < θ < α 2(2β α) α 2(2β α) θ < α 2β ( ) α 2(2β α), γ θ 2β α 1 α 2(2β α), 2θβ α < γ < θ ( 2β α 1 )
66 Convergence of Projection Operators-I Suppose λ i i α, α > 1 2, δ i i β, β α, l = n θ α and m = n γ where 0 < θ < 1 2 and 0 < γ < 1. n ( 1 4 θ ) 2, 0 < θ < S n,l ), n ( 12 θβ α n ( 4 1 θ ) 2, 0 < θ < S m,n,l ( n γ 2 θβ ) α, 0 < θ < α 2(2β α) α 2(2β α) θ < α 2β ( ) α 2(2β α), γ θ 2β α 1 α 2(2β α), 2θβ α < γ < θ ( 2β α 1 )
67 Convergence of Projection Operators-II Empirical KPCA: l l T n,l = Iφ i L 2 (P) Iφ i I ˆφ i L 2 (P) I ˆφ i HS ( l ) l = Σ1/2 φ i H φ i ˆφ i H ˆφi Σ 1/2 HS lλ l δ l n Approximate Empirical KPCA: l l T m,n,l = Iφ i L 2 (P) Iφ i U ˆφ i,m L 2 (P) U ˆφ i,m HS lλ l + 1 δ l n m Convergence rates can be derived under the assumption λ i i α, α > 1 2, δ i i β, β α, l = n α θ and m = n γ where 0 < θ < 1 2 and 0 < γ < 1.
68 Convergence of Projection Operators-II Empirical KPCA: l l T n,l = Iφ i L 2 (P) Iφ i I ˆφ i L 2 (P) I ˆφ i HS ( l ) l = Σ1/2 φ i H φ i ˆφ i H ˆφi Σ 1/2 HS lλ l δ l n Approximate Empirical KPCA: l l T m,n,l = Iφ i L 2 (P) Iφ i U ˆφ i,m L 2 (P) U ˆφ i,m HS lλ l + 1 δ l n m Convergence rates can be derived under the assumption λ i i α, α > 1 2, δ i i β, β α, l = n α θ and m = n γ where 0 < θ < 1 2 and 0 < γ < 1.
69 Convergence of Projection Operators-II Empirical KPCA: l l T n,l = Iφ i L 2 (P) Iφ i I ˆφ i L 2 (P) I ˆφ i HS ( l ) l = Σ1/2 φ i H φ i ˆφ i H ˆφi Σ 1/2 HS lλ l δ l n Approximate Empirical KPCA: l l T m,n,l = Iφ i L 2 (P) Iφ i U ˆφ i,m L 2 (P) U ˆφ i,m HS lλ l + 1 δ l n m Convergence rates can be derived under the assumption λ i i α, α > 1 2, δ i i β, β α, l = n α θ and m = n γ where 0 < θ < 1 2 and 0 < γ < 1.
70 Summary Random feature approximation to kernel PCA improves its computational complexity. Statistical trade-off: Reconstruction error Convergence of eigenspaces Open questions: Lower bounds Extension to kernel canonical correlation analysis Nyström approximation
71 Thank You
72 References I Aronszajn, N. (1950). Theory of reproducing kernels. Trans. Amer. Math. Soc., 68: Fine, S. and Scheinberg, K. (2001). Efficient SVM training using low-rank kernel representations. Journal of Machine Learning Research, 2: Rahimi, A. and Recht, B. (2008a). Random features for large-scale kernel machines. In Platt, J. C., Koller, D., Singer, Y., and Roweis, S. T., editors, Advances in Neural Information Processing Systems 20, pages Curran Associates, Inc. Rahimi, A. and Recht, B. (2008b). Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning. In NIPS, pages Rudi, A. and Rosasco, L. (2016). Generalization properties of learning with random features. Schölkopf, B., Smola, A., and Müller, K.-R. (1998). Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10: Smola, A. J. and Schölkopf, B. (2000). Sparse greedy matrix approximation for machine learning. In Proc. 17th International Conference on Machine Learning, pages Morgan Kaufmann, San Francisco, CA. Sriperumbudur, B. K. and Szabó, Z. (2015). Optimal rates for random Fourier features. In Cortes, C., Lawrence, N. D., Lee, D. D., Sugiyama, M., and Garnett, R., editors, Advances in Neural Information Processing Systems 28, pages Curran Associates, Inc. Sutherland, D. and Schneider, J. (2015). On the error of random fourier features. In Proceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence, pages
73 References II Williams, C. and Seeger, M. (2001). Using the Nyström method to speed up kernel machines. In T. K. Leen, T. G. Diettrich, V. T., editor, Advances in Neural Information Processing Systems 13, pages , Cambridge, MA. MIT Press. Yang, Y., Pilanci, M., and Wainwright, M. J. (2015). Randomized sketches for kernels: Fast and optimal non-parametric regression. Technical report.
Approximate Kernel Methods
Lecture 3 Approximate Kernel Methods Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Machine Learning Summer School Tübingen, 207 Outline Motivating example Ridge regression
More informationStochastic optimization in Hilbert spaces
Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic optimization Hilbert spaces 1 / 48 Outline Learning vs Statistics Aymeric Dieuleveut Stochastic optimization Hilbert
More informationKernel Learning via Random Fourier Representations
Kernel Learning via Random Fourier Representations L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Module 5: Machine Learning L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random
More informationLess is More: Computational Regularization by Subsampling
Less is More: Computational Regularization by Subsampling Lorenzo Rosasco University of Genova - Istituto Italiano di Tecnologia Massachusetts Institute of Technology lcsl.mit.edu joint work with Alessandro
More informationLess is More: Computational Regularization by Subsampling
Less is More: Computational Regularization by Subsampling Lorenzo Rosasco University of Genova - Istituto Italiano di Tecnologia Massachusetts Institute of Technology lcsl.mit.edu joint work with Alessandro
More informationTUM 2016 Class 3 Large scale learning by regularization
TUM 2016 Class 3 Large scale learning by regularization Lorenzo Rosasco UNIGE-MIT-IIT July 25, 2016 Learning problem Solve min w E(w), E(w) = dρ(x, y)l(w x, y) given (x 1, y 1 ),..., (x n, y n ) Beyond
More informationKernel methods for comparing distributions, measuring dependence
Kernel methods for comparing distributions, measuring dependence Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Principal component analysis Given a set of M centered observations
More informationConvergence of Eigenspaces in Kernel Principal Component Analysis
Convergence of Eigenspaces in Kernel Principal Component Analysis Shixin Wang Advanced machine learning April 19, 2016 Shixin Wang Convergence of Eigenspaces April 19, 2016 1 / 18 Outline 1 Motivation
More informationPrincipal Component Analysis
CSci 5525: Machine Learning Dec 3, 2008 The Main Idea Given a dataset X = {x 1,..., x N } The Main Idea Given a dataset X = {x 1,..., x N } Find a low-dimensional linear projection The Main Idea Given
More informationKernel Principal Component Analysis
Kernel Principal Component Analysis Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr
More informationConnection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis
Connection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis Alvina Goh Vision Reading Group 13 October 2005 Connection of Local Linear Embedding, ISOMAP, and Kernel Principal
More informationConvergence Rates of Kernel Quadrature Rules
Convergence Rates of Kernel Quadrature Rules Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE NIPS workshop on probabilistic integration - Dec. 2015 Outline Introduction
More informationStatistical Convergence of Kernel CCA
Statistical Convergence of Kernel CCA Kenji Fukumizu Institute of Statistical Mathematics Tokyo 106-8569 Japan fukumizu@ism.ac.jp Francis R. Bach Centre de Morphologie Mathematique Ecole des Mines de Paris,
More informationNonlinear Dimensionality Reduction
Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Kernel PCA 2 Isomap 3 Locally Linear Embedding 4 Laplacian Eigenmap
More informationDivide and Conquer Kernel Ridge Regression. A Distributed Algorithm with Minimax Optimal Rates
: A Distributed Algorithm with Minimax Optimal Rates Yuchen Zhang, John C. Duchi, Martin Wainwright (UC Berkeley;http://arxiv.org/pdf/1305.509; Apr 9, 014) Gatsby Unit, Tea Talk June 10, 014 Outline Motivation.
More informationKernel-Based Contrast Functions for Sufficient Dimension Reduction
Kernel-Based Contrast Functions for Sufficient Dimension Reduction Michael I. Jordan Departments of Statistics and EECS University of California, Berkeley Joint work with Kenji Fukumizu and Francis Bach
More informationKernel Methods. Jean-Philippe Vert Last update: Jan Jean-Philippe Vert (Mines ParisTech) 1 / 444
Kernel Methods Jean-Philippe Vert Jean-Philippe.Vert@mines.org Last update: Jan 2015 Jean-Philippe Vert (Mines ParisTech) 1 / 444 What we know how to solve Jean-Philippe Vert (Mines ParisTech) 2 / 444
More informationMinimax Estimation of Kernel Mean Embeddings
Minimax Estimation of Kernel Mean Embeddings Bharath K. Sriperumbudur Department of Statistics Pennsylvania State University Gatsby Computational Neuroscience Unit May 4, 2016 Collaborators Dr. Ilya Tolstikhin
More informationEfficient Complex Output Prediction
Efficient Complex Output Prediction Florence d Alché-Buc Joint work with Romain Brault, Alex Lambert, Maxime Sangnier October 12, 2017 LTCI, Télécom ParisTech, Institut-Mines Télécom, Université Paris-Saclay
More informationRegularization via Spectral Filtering
Regularization via Spectral Filtering Lorenzo Rosasco MIT, 9.520 Class 7 About this class Goal To discuss how a class of regularization methods originally designed for solving ill-posed inverse problems,
More informationOptimal kernel methods for large scale learning
Optimal kernel methods for large scale learning Alessandro Rudi INRIA - École Normale Supérieure, Paris joint work with Luigi Carratino, Lorenzo Rosasco 6 Mar 2018 École Polytechnique Learning problem
More informationSpectral Regularization
Spectral Regularization Lorenzo Rosasco 9.520 Class 07 February 27, 2008 About this class Goal To discuss how a class of regularization methods originally designed for solving ill-posed inverse problems,
More informationMIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications. Class 19: Data Representation by Design
MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications Class 19: Data Representation by Design What is data representation? Let X be a data-space X M (M) F (M) X A data representation
More informationHilbert Space Embedding of Probability Measures
Lecture 2 Hilbert Space Embedding of Probability Measures Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Machine Learning Summer School Tübingen, 2017 Recap of Lecture
More informationKernel methods, kernel SVM and ridge regression
Kernel methods, kernel SVM and ridge regression Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Collaborative Filtering 2 Collaborative Filtering R: rating matrix; U: user factor;
More informationReproducing Kernel Hilbert Spaces
Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 11, 2009 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert
More informationReproducing Kernel Hilbert Spaces
Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 12, 2007 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert
More informationReproducing Kernel Hilbert Spaces
Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 9, 2011 About this class Goal In this class we continue our journey in the world of RKHS. We discuss the Mercer theorem which gives
More informationSemi-Supervised Learning through Principal Directions Estimation
Semi-Supervised Learning through Principal Directions Estimation Olivier Chapelle, Bernhard Schölkopf, Jason Weston Max Planck Institute for Biological Cybernetics, 72076 Tübingen, Germany {first.last}@tuebingen.mpg.de
More informationLearning sets and subspaces: a spectral approach
Learning sets and subspaces: a spectral approach Alessandro Rudi DIBRIS, Università di Genova Optimization and dynamical processes in Statistical learning and inverse problems Sept 8-12, 2014 A world of
More informationManifold Learning: Theory and Applications to HRI
Manifold Learning: Theory and Applications to HRI Seungjin Choi Department of Computer Science Pohang University of Science and Technology, Korea seungjin@postech.ac.kr August 19, 2008 1 / 46 Greek Philosopher
More informationNonlinear Projection Trick in kernel methods: An alternative to the Kernel Trick
Nonlinear Projection Trick in kernel methods: An alternative to the Kernel Trick Nojun Kak, Member, IEEE, Abstract In kernel methods such as kernel PCA and support vector machines, the so called kernel
More informationLearning gradients: prescriptive models
Department of Statistical Science Institute for Genome Sciences & Policy Department of Computer Science Duke University May 11, 2007 Relevant papers Learning Coordinate Covariances via Gradients. Sayan
More informationStatistical Optimality of Stochastic Gradient Descent through Multiple Passes
Statistical Optimality of Stochastic Gradient Descent through Multiple Passes Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Loucas Pillaud-Vivien
More informationDiscriminative Direction for Kernel Classifiers
Discriminative Direction for Kernel Classifiers Polina Golland Artificial Intelligence Lab Massachusetts Institute of Technology Cambridge, MA 02139 polina@ai.mit.edu Abstract In many scientific and engineering
More informationUnsupervised dimensionality reduction
Unsupervised dimensionality reduction Guillaume Obozinski Ecole des Ponts - ParisTech SOCN course 2014 Guillaume Obozinski Unsupervised dimensionality reduction 1/30 Outline 1 PCA 2 Kernel PCA 3 Multidimensional
More informationNonlinear Dimensionality Reduction
Nonlinear Dimensionality Reduction Piyush Rai CS5350/6350: Machine Learning October 25, 2011 Recap: Linear Dimensionality Reduction Linear Dimensionality Reduction: Based on a linear projection of the
More informationLecture 3: Statistical Decision Theory (Part II)
Lecture 3: Statistical Decision Theory (Part II) Hao Helen Zhang Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 1 / 27 Outline of This Note Part I: Statistics Decision Theory (Classical
More informationRandom Projections. Lopez Paz & Duvenaud. November 7, 2013
Random Projections Lopez Paz & Duvenaud November 7, 2013 Random Outline The Johnson-Lindenstrauss Lemma (1984) Random Kitchen Sinks (Rahimi and Recht, NIPS 2008) Fastfood (Le et al., ICML 2013) Why random
More informationImmediate Reward Reinforcement Learning for Projective Kernel Methods
ESANN'27 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), 25-27 April 27, d-side publi., ISBN 2-9337-7-2. Immediate Reward Reinforcement Learning for Projective Kernel Methods
More informationReproducing Kernel Hilbert Spaces
9.520: Statistical Learning Theory and Applications February 10th, 2010 Reproducing Kernel Hilbert Spaces Lecturer: Lorenzo Rosasco Scribe: Greg Durrett 1 Introduction In the previous two lectures, we
More informationSPECTRAL CLUSTERING AND KERNEL PRINCIPAL COMPONENT ANALYSIS ARE PURSUING GOOD PROJECTIONS
SPECTRAL CLUSTERING AND KERNEL PRINCIPAL COMPONENT ANALYSIS ARE PURSUING GOOD PROJECTIONS VIKAS CHANDRAKANT RAYKAR DECEMBER 5, 24 Abstract. We interpret spectral clustering algorithms in the light of unsupervised
More informationMemory Efficient Kernel Approximation
Si Si Department of Computer Science University of Texas at Austin ICML Beijing, China June 23, 2014 Joint work with Cho-Jui Hsieh and Inderjit S. Dhillon Outline Background Motivation Low-Rank vs. Block
More informationKernel Methods. Machine Learning A W VO
Kernel Methods Machine Learning A 708.063 07W VO Outline 1. Dual representation 2. The kernel concept 3. Properties of kernels 4. Examples of kernel machines Kernel PCA Support vector regression (Relevance
More informationCIS 520: Machine Learning Oct 09, Kernel Methods
CIS 520: Machine Learning Oct 09, 207 Kernel Methods Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture They may or may not cover all the material discussed
More informationClassification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1).
Regression and PCA Classification The goal: map from input X to a label Y. Y has a discrete set of possible values We focused on binary Y (values 0 or 1). But we also discussed larger number of classes
More informationKernels A Machine Learning Overview
Kernels A Machine Learning Overview S.V.N. Vishy Vishwanathan vishy@axiom.anu.edu.au National ICT of Australia and Australian National University Thanks to Alex Smola, Stéphane Canu, Mike Jordan and Peter
More informationRegML 2018 Class 2 Tikhonov regularization and kernels
RegML 2018 Class 2 Tikhonov regularization and kernels Lorenzo Rosasco UNIGE-MIT-IIT June 17, 2018 Learning problem Problem For H {f f : X Y }, solve min E(f), f H dρ(x, y)l(f(x), y) given S n = (x i,
More information2 Tikhonov Regularization and ERM
Introduction Here we discusses how a class of regularization methods originally designed to solve ill-posed inverse problems give rise to regularized learning algorithms. These algorithms are kernel methods
More informationMercer s Theorem, Feature Maps, and Smoothing
Mercer s Theorem, Feature Maps, and Smoothing Ha Quang Minh, Partha Niyogi, and Yuan Yao Department of Computer Science, University of Chicago 00 East 58th St, Chicago, IL 60637, USA Department of Mathematics,
More informationOn the Convergence of the Concave-Convex Procedure
On the Convergence of the Concave-Convex Procedure Bharath K. Sriperumbudur and Gert R. G. Lanckriet UC San Diego OPT 2009 Outline Difference of convex functions (d.c.) program Applications in machine
More informationTowards Deep Kernel Machines
Towards Deep Kernel Machines Julien Mairal Inria, Grenoble Prague, April, 2017 Julien Mairal Towards deep kernel machines 1/51 Part I: Scientific Context Julien Mairal Towards deep kernel machines 2/51
More informationMLCC 2015 Dimensionality Reduction and PCA
MLCC 2015 Dimensionality Reduction and PCA Lorenzo Rosasco UNIGE-MIT-IIT June 25, 2015 Outline PCA & Reconstruction PCA and Maximum Variance PCA and Associated Eigenproblem Beyond the First Principal Component
More informationDIMENSION REDUCTION. min. j=1
DIMENSION REDUCTION 1 Principal Component Analysis (PCA) Principal components analysis (PCA) finds low dimensional approximations to the data by projecting the data onto linear subspaces. Let X R d and
More informationRegularization in Reproducing Kernel Banach Spaces
.... Regularization in Reproducing Kernel Banach Spaces Guohui Song School of Mathematical and Statistical Sciences Arizona State University Comp Math Seminar, September 16, 2010 Joint work with Dr. Fred
More informationMachine Learning (BSMC-GA 4439) Wenke Liu
Machine Learning (BSMC-GA 4439) Wenke Liu 02-01-2018 Biomedical data are usually high-dimensional Number of samples (n) is relatively small whereas number of features (p) can be large Sometimes p>>n Problems
More informationOnline Kernel PCA with Entropic Matrix Updates
Online Kernel PCA with Entropic Matrix Updates Dima Kuzmin Manfred K. Warmuth University of California - Santa Cruz ICML 2007, Corvallis, Oregon April 23, 2008 D. Kuzmin, M. Warmuth (UCSC) Online Kernel
More informationWorst-Case Bounds for Gaussian Process Models
Worst-Case Bounds for Gaussian Process Models Sham M. Kakade University of Pennsylvania Matthias W. Seeger UC Berkeley Abstract Dean P. Foster University of Pennsylvania We present a competitive analysis
More informationRKHS, Mercer s theorem, Unbounded domains, Frames and Wavelets Class 22, 2004 Tomaso Poggio and Sayan Mukherjee
RKHS, Mercer s theorem, Unbounded domains, Frames and Wavelets 9.520 Class 22, 2004 Tomaso Poggio and Sayan Mukherjee About this class Goal To introduce an alternate perspective of RKHS via integral operators
More informationOslo Class 2 Tikhonov regularization and kernels
RegML2017@SIMULA Oslo Class 2 Tikhonov regularization and kernels Lorenzo Rosasco UNIGE-MIT-IIT May 3, 2017 Learning problem Problem For H {f f : X Y }, solve min E(f), f H dρ(x, y)l(f(x), y) given S n
More informationMultiple Similarities Based Kernel Subspace Learning for Image Classification
Multiple Similarities Based Kernel Subspace Learning for Image Classification Wang Yan, Qingshan Liu, Hanqing Lu, and Songde Ma National Laboratory of Pattern Recognition, Institute of Automation, Chinese
More informationThe Laplacian PDF Distance: A Cost Function for Clustering in a Kernel Feature Space
The Laplacian PDF Distance: A Cost Function for Clustering in a Kernel Feature Space Robert Jenssen, Deniz Erdogmus 2, Jose Principe 2, Torbjørn Eltoft Department of Physics, University of Tromsø, Norway
More informationSecond-Order Inference for Gaussian Random Curves
Second-Order Inference for Gaussian Random Curves With Application to DNA Minicircles Victor Panaretos David Kraus John Maddocks Ecole Polytechnique Fédérale de Lausanne Panaretos, Kraus, Maddocks (EPFL)
More informationbelow, kernel PCA Eigenvectors, and linear combinations thereof. For the cases where the pre-image does exist, we can provide a means of constructing
Kernel PCA Pattern Reconstruction via Approximate Pre-Images Bernhard Scholkopf, Sebastian Mika, Alex Smola, Gunnar Ratsch, & Klaus-Robert Muller GMD FIRST, Rudower Chaussee 5, 12489 Berlin, Germany fbs,
More informationFast large scale Gaussian process regression using the improved fast Gauss transform
Fast large scale Gaussian process regression using the improved fast Gauss transform VIKAS CHANDRAKANT RAYKAR and RAMANI DURAISWAMI Perceptual Interfaces and Reality Laboratory Department of Computer Science
More informationLinear and Non-Linear Dimensionality Reduction
Linear and Non-Linear Dimensionality Reduction Alexander Schulz aschulz(at)techfak.uni-bielefeld.de University of Pisa, Pisa 4.5.215 and 7.5.215 Overview Dimensionality Reduction Motivation Linear Projections
More informationKernel Partial Least Squares for Nonlinear Regression and Discrimination
Kernel Partial Least Squares for Nonlinear Regression and Discrimination Roman Rosipal Abstract This paper summarizes recent results on applying the method of partial least squares (PLS) in a reproducing
More informationA spectral clustering algorithm based on Gram operators
A spectral clustering algorithm based on Gram operators Ilaria Giulini De partement de Mathe matiques et Applications ENS, Paris Joint work with Olivier Catoni 1 july 2015 Clustering task of grouping
More informationMetric Embedding for Kernel Classification Rules
Metric Embedding for Kernel Classification Rules Bharath K. Sriperumbudur University of California, San Diego (Joint work with Omer Lang & Gert Lanckriet) Bharath K. Sriperumbudur (UCSD) Metric Embedding
More informationComputational tractability of machine learning algorithms for tall fat data
Computational tractability of machine learning algorithms for tall fat data Getting good enough solutions as fast as possible Vikas Chandrakant Raykar vikas@cs.umd.edu University of Maryland, CollegePark
More informationKernel Methods for Computational Biology and Chemistry
Kernel Methods for Computational Biology and Chemistry Jean-Philippe Vert Jean-Philippe.Vert@ensmp.fr Centre for Computational Biology Ecole des Mines de Paris, ParisTech Mathematical Statistics and Applications,
More informationMACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA
1 MACHINE LEARNING Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA 2 Practicals Next Week Next Week, Practical Session on Computer Takes Place in Room GR
More informationTutorial on Gaussian Processes and the Gaussian Process Latent Variable Model
Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model (& discussion on the GPLVM tech. report by Prof. N. Lawrence, 06) Andreas Damianou Department of Neuro- and Computer Science,
More informationEcon 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines
Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Maximilian Kasy Department of Economics, Harvard University 1 / 37 Agenda 6 equivalent representations of the
More informationStat542 (F11) Statistical Learning. First consider the scenario where the two classes of points are separable.
Linear SVM (separable case) First consider the scenario where the two classes of points are separable. It s desirable to have the width (called margin) between the two dashed lines to be large, i.e., have
More informationKernel Methods. Barnabás Póczos
Kernel Methods Barnabás Póczos Outline Quick Introduction Feature space Perceptron in the feature space Kernels Mercer s theorem Finite domain Arbitrary domain Kernel families Constructing new kernels
More informationDistribution Regression: A Simple Technique with Minimax-optimal Guarantee
Distribution Regression: A Simple Technique with Minimax-optimal Guarantee (CMAP, École Polytechnique) Joint work with Bharath K. Sriperumbudur (Department of Statistics, PSU), Barnabás Póczos (ML Department,
More information1. Mathematical Foundations of Machine Learning
1. Mathematical Foundations of Machine Learning 1.1 Basic Concepts Definition of Learning Definition [Mitchell (1997)] A computer program is said to learn from experience E with respect to some class of
More informationOn non-parametric robust quantile regression by support vector machines
On non-parametric robust quantile regression by support vector machines Andreas Christmann joint work with: Ingo Steinwart (Los Alamos National Lab) Arnout Van Messem (Vrije Universiteit Brussel) ERCIM
More informationKernel Method: Data Analysis with Positive Definite Kernels
Kernel Method: Data Analysis with Positive Definite Kernels 2. Positive Definite Kernel and Reproducing Kernel Hilbert Space Kenji Fukumizu The Institute of Statistical Mathematics. Graduate University
More informationGeneralization theory
Generalization theory Daniel Hsu Columbia TRIPODS Bootcamp 1 Motivation 2 Support vector machines X = R d, Y = { 1, +1}. Return solution ŵ R d to following optimization problem: λ min w R d 2 w 2 2 + 1
More informationKernel Independent Component Analysis
Journal of Machine Learning Research 3 (2002) 1-48 Submitted 11/01; Published 07/02 Kernel Independent Component Analysis Francis R. Bach Computer Science Division University of California Berkeley, CA
More informationTheoretical Exercises Statistical Learning, 2009
Theoretical Exercises Statistical Learning, 2009 Niels Richard Hansen April 20, 2009 The following exercises are going to play a central role in the course Statistical learning, block 4, 2009. The exercises
More informationDiffeomorphic Warping. Ben Recht August 17, 2006 Joint work with Ali Rahimi (Intel)
Diffeomorphic Warping Ben Recht August 17, 2006 Joint work with Ali Rahimi (Intel) What Manifold Learning Isn t Common features of Manifold Learning Algorithms: 1-1 charting Dense sampling Geometric Assumptions
More informationBits of Machine Learning Part 1: Supervised Learning
Bits of Machine Learning Part 1: Supervised Learning Alexandre Proutiere and Vahan Petrosyan KTH (The Royal Institute of Technology) Outline of the Course 1. Supervised Learning Regression and Classification
More informationA Magiv CV Theory for Large-Margin Classifiers
A Magiv CV Theory for Large-Margin Classifiers Hui Zou School of Statistics, University of Minnesota June 30, 2018 Joint work with Boxiang Wang Outline 1 Background 2 Magic CV formula 3 Magic support vector
More informationKernel methods and the exponential family
Kernel methods and the exponential family Stéphane Canu 1 and Alex J. Smola 2 1- PSI - FRE CNRS 2645 INSA de Rouen, France St Etienne du Rouvray, France Stephane.Canu@insa-rouen.fr 2- Statistical Machine
More informationSVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels
SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels Karl Stratos June 21, 2018 1 / 33 Tangent: Some Loose Ends in Logistic Regression Polynomial feature expansion in logistic
More informationMachine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling
Machine Learning B. Unsupervised Learning B.2 Dimensionality Reduction Lars Schmidt-Thieme, Nicolas Schilling Information Systems and Machine Learning Lab (ISMLL) Institute for Computer Science University
More informationApproximate Principal Components Analysis of Large Data Sets
Approximate Principal Components Analysis of Large Data Sets Daniel J. McDonald Department of Statistics Indiana University mypage.iu.edu/ dajmcdon April 27, 2016 Approximation-Regularization for Analysis
More informationESANN'2001 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), April 2001, D-Facto public., ISBN ,
Sparse Kernel Canonical Correlation Analysis Lili Tan and Colin Fyfe 2, Λ. Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong. 2. School of Information and Communication
More informationLecture 6 Sept Data Visualization STAT 442 / 890, CM 462
Lecture 6 Sept. 25-2006 Data Visualization STAT 442 / 890, CM 462 Lecture: Ali Ghodsi 1 Dual PCA It turns out that the singular value decomposition also allows us to formulate the principle components
More informationSDRNF: generating scalable and discriminative random nonlinear features from data
Chu et al. Big Data Analytics (2016) 1:10 DOI 10.1186/s41044-016-0015-z Big Data Analytics RESEARCH Open Access SDRNF: generating scalable and discriminative random nonlinear features from data Haoda Chu
More informationThe Learning Problem and Regularization Class 03, 11 February 2004 Tomaso Poggio and Sayan Mukherjee
The Learning Problem and Regularization 9.520 Class 03, 11 February 2004 Tomaso Poggio and Sayan Mukherjee About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing
More informationGaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012
Gaussian Processes Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 01 Pictorial view of embedding distribution Transform the entire distribution to expected features Feature space Feature
More informationKernel Measures of Conditional Dependence
Kernel Measures of Conditional Dependence Kenji Fukumizu Institute of Statistical Mathematics 4-6-7 Minami-Azabu, Minato-ku Tokyo 6-8569 Japan fukumizu@ism.ac.jp Arthur Gretton Max-Planck Institute for
More informationFastfood Approximating Kernel Expansions in Loglinear Time. Quoc Le, Tamas Sarlos, and Alex Smola Presenter: Shuai Zheng (Kyle)
Fastfood Approximating Kernel Expansions in Loglinear Time Quoc Le, Tamas Sarlos, and Alex Smola Presenter: Shuai Zheng (Kyle) Large Scale Problem: ImageNet Challenge Large scale data Number of training
More informationProbabilistic Regression Using Basis Function Models
Probabilistic Regression Using Basis Function Models Gregory Z. Grudic Department of Computer Science University of Colorado, Boulder grudic@cs.colorado.edu Abstract Our goal is to accurately estimate
More informationKernels MIT Course Notes
Kernels MIT 15.097 Course Notes Cynthia Rudin Credits: Bartlett, Schölkopf and Smola, Cristianini and Shawe-Taylor The kernel trick that I m going to show you applies much more broadly than SVM, but we
More informationSupport Vector Machines for Classification: A Statistical Portrait
Support Vector Machines for Classification: A Statistical Portrait Yoonkyung Lee Department of Statistics The Ohio State University May 27, 2011 The Spring Conference of Korean Statistical Society KAIST,
More information