Approximate Kernel Methods
|
|
- Nathaniel Gaines
- 5 years ago
- Views:
Transcription
1 Lecture 3 Approximate Kernel Methods Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Machine Learning Summer School Tübingen, 207
2 Outline Motivating example Ridge regression Approximation methods Kernel PCA Computational vs. statistical trade off (Joint work with Nicholas Sterge, Pennsylvania State University)
3 Motivating Example: Ridge regression Given: {(x i, y i )} n i= where x i R d, y i R Task: Find a linear regressor f = w, 2 s.t. f (x i ) y i, min ( w, x i 2 y i ) 2 + λ w 2 w R d 2 (λ > 0) n i= Solution: For X := (x,..., x n ) R d n and y := (y,..., y n ) R n, w = ( ) n n XX + λi d Xy }{{} primal Easy: ( ) ( ) n XX + λi d X = X n X X + λi n w = n X ( n X X + λi n ) y }{{} dual
4 Motivating Example: Ridge regression Given: {(x i, y i )} n i= where x i R d, y i R Task: Find a linear regressor f = w, 2 s.t. f (x i ) y i, min ( w, x i 2 y i ) 2 + λ w 2 w R d 2 (λ > 0) n i= Solution: For X := (x,..., x n ) R d n and y := (y,..., y n ) R n, w = ( ) n n XX + λi d Xy }{{} primal Easy: ( ) ( ) n XX + λi d X = X n X X + λi n w = n X ( n X X + λi n ) y }{{} dual
5 Motivating Example: Ridge regression Given: {(x i, y i )} n i= where x i R d, y i R Task: Find a linear regressor f = w, 2 s.t. f (x i ) y i, min ( w, x i 2 y i ) 2 + λ w 2 w R d 2 (λ > 0) n i= Solution: For X := (x,..., x n ) R d n and y := (y,..., y n ) R n, w = ( ) n n XX + λi d Xy }{{} primal Easy: ( ) ( ) n XX + λi d X = X n X X + λi n w = n X ( n X X + λi n ) y }{{} dual
6 Motivating Example: Ridge regression Given: {(x i, y i )} n i= where x i R d, y i R Task: Find a linear regressor f = w, 2 s.t. f (x i ) y i, min ( w, x i 2 y i ) 2 + λ w 2 w R d 2 (λ > 0) n i= Solution: For X := (x,..., x n ) R d n and y := (y,..., y n ) R n, w = ( ) n n XX + λi d Xy }{{} primal Easy: ( ) ( ) n XX + λi d X = X n X X + λi n w = n X ( n X X + λi n ) y }{{} dual
7 Ridge regression Prediction: Given t R d f (t) = w, t 2 = y X ( XX + nλi d ) t = y ( X X + nλi n ) X t How does X X look like? x, x 2 x, x 2 2 x, x n 2 X x 2, x x 2, x 2 2 x 2, x n 2 X =.. x i, x j.. 2. x n, x x n, x 2 2 x n, x n 2 }{{} Matrix of inner products: Gram Matrix
8 Ridge regression Prediction: Given t R d f (t) = w, t 2 = y X ( XX + nλi d ) t = y ( X X + nλi n ) X t How does X X look like? x, x 2 x, x 2 2 x, x n 2 X x 2, x x 2, x 2 2 x 2, x n 2 X =.. x i, x j.. 2. x n, x x n, x 2 2 x n, x n 2 }{{} Matrix of inner products: Gram Matrix
9 Kernel Ridge regression: Feature Map and Kernel Trick Given: {(x i, y i )} n i= where x i X, y i R Task: Find a regressor f H (some feature space) s.t. f (x i ) y i. Idea: Map x i to Φ(x i ) and do linear regression, min f H n ( f, Φ(x i ) H y i ) 2 + λ f 2 H (λ > 0) i= Solution: For Φ(X) := (Φ(x ),..., Φ(x n )) R dim(h) n and y := (y,..., y n ) R n, ( f = ) n n Φ(X)Φ(X) + λi dim(h) Φ(X)y }{{} primal = n Φ(X) ( n Φ(X) Φ(X) + λi n ) y }{{} dual
10 Kernel Ridge regression: Feature Map and Kernel Trick Given: {(x i, y i )} n i= where x i X, y i R Task: Find a regressor f H (some feature space) s.t. f (x i ) y i. Idea: Map x i to Φ(x i ) and do linear regression, min f H n ( f, Φ(x i ) H y i ) 2 + λ f 2 H (λ > 0) i= Solution: For Φ(X) := (Φ(x ),..., Φ(x n )) R dim(h) n and y := (y,..., y n ) R n, ( f = ) n n Φ(X)Φ(X) + λi dim(h) Φ(X)y }{{} primal = n Φ(X) ( n Φ(X) Φ(X) + λi n ) y }{{} dual
11 Kernel Ridge regression: Feature Map and Kernel Trick Given: {(x i, y i )} n i= where x i X, y i R Task: Find a regressor f H (some feature space) s.t. f (x i ) y i. Idea: Map x i to Φ(x i ) and do linear regression, min f H n ( f, Φ(x i ) H y i ) 2 + λ f 2 H (λ > 0) i= Solution: For Φ(X) := (Φ(x ),..., Φ(x n )) R dim(h) n and y := (y,..., y n ) R n, ( f = ) n n Φ(X)Φ(X) + λi dim(h) Φ(X)y }{{} primal = n Φ(X) ( n Φ(X) Φ(X) + λi n ) y }{{} dual
12 Kernel Ridge regression: Feature Map and Kernel Trick Prediction: Given t X f (t) = f, Φ(t) H = n y Φ(X) ( n Φ(X)Φ(X) + λi dim(h) ) Φ(t) = n y ( n Φ(X) Φ(X) + λi n ) Φ(X) Φ(t) As before Φ(x ), Φ(x ) H Φ(x ), Φ(x n ) H Φ(X) Φ(x 2 ), Φ(x ) H Φ(x 2 ), Φ(x n ) H Φ(X) =..... Φ(x n ), Φ(x ) H Φ(x n ), Φ(x n ) H }{{} k(x i,x j )= Φ(x i ),Φ(x j ) H and Φ(X) Φ(t) = [ Φ(x ), Φ(t) H,..., Φ(x n ), Φ(t) H ]
13 Kernel Ridge regression: Feature Map and Kernel Trick Prediction: Given t X f (t) = f, Φ(t) H = n y Φ(X) ( n Φ(X)Φ(X) + λi dim(h) ) Φ(t) = n y ( n Φ(X) Φ(X) + λi n ) Φ(X) Φ(t) As before Φ(x ), Φ(x ) H Φ(x ), Φ(x n ) H Φ(X) Φ(x 2 ), Φ(x ) H Φ(x 2 ), Φ(x n ) H Φ(X) =..... Φ(x n ), Φ(x ) H Φ(x n ), Φ(x n ) H }{{} k(x i,x j )= Φ(x i ),Φ(x j ) H and Φ(X) Φ(t) = [ Φ(x ), Φ(t) H,..., Φ(x n ), Φ(t) H ]
14 Remarks The primal formulation requires the knowledge of feature map Φ (and of course H) and these could be infinite dimensional. The dual formulation is entirely determined by kernel evaluations, Gram matrix and (k(x i, t)) i. But poor scalability: O(n 3 ).
15 Approximation Schemes Incomplete Cholesky factorization (e.g., (Fine and Scheinberg, JMLR 200) Sketching (Yang et al., 205) Sparse greedy approximation (Smola and Schölkopf, NIPS 2000) Nyström method (e.g., Williams and Seeger, NIPS 200) Random Fourier features (e.g., Rahimi and Recht, NIPS 2008),...
16 Random Fourier Approximation X = R d ; k be continuous and translation-invariant, i.e., k(x, y) = ψ(x y). Bochner s theorem: k(x, y) = e ω,x y 2 dλ(ω), R d where Λ is a finite non-negative Borel measure on R d. k is symmetric and therefore Λ is a symmetric measure on R d. Therefore k(x, y) = cos( ω, x y 2 ) dλ(ω). R d
17 Random Fourier Approximation X = R d ; k be continuous and translation-invariant, i.e., k(x, y) = ψ(x y). Bochner s theorem: k(x, y) = e ω,x y 2 dλ(ω), R d where Λ is a finite non-negative Borel measure on R d. k is symmetric and therefore Λ is a symmetric measure on R d. Therefore k(x, y) = cos( ω, x y 2 ) dλ(ω). R d
18 Random Feature Approximation (Rahimi and Recht, NIPS 2008): Draw (ω j ) m j= i.i.d. Λ. where k m (x, y) = m m cos ( ω j, x y 2 ) = Φ m (x), Φ m (y) R 2m, j= Φ m (x) = (cos( ω, x 2 ),..., cos( ω m, x 2 ), sin( ω, x 2 ),..., sin( ω m, x 2 )). Use the feature map idea with Φ m, i.e., x Φ m (x) and apply your favorite linear method.
19 How good is the approximation? (S and Szabó, NIPS 206): ( ) log S sup k m (x, y) k(x, y) = O a.s. x,y S m Optimal convergence rate Other results are known but they are non-optimal (Rahimi and Recht, NIPS 2008; Sutherland and Schneider, UAI 205).
20 Ridge Regression: Random Feature Approximation Given: {(x i, y i )} n i= where x i X, y i R Task: Find a regressor w R 2m s.t. f (x i ) y i. Idea: Map x i to Φ m (x i ) and do linear regression, min f H n ( f, Φ m (x i ) R 2m y i ) 2 + λ w 2 R2m (λ > 0) i= Solution: For Φ m (X) := (Φ m (x ),..., Φ m (x n )) R 2m n and y := (y,..., y n ) R n, ( f = ) n n Φ m(x)φ m (X) + λi 2m Φ m (X)y }{{} Computation: O(m 2 n). primal = ( ) n Φ m(x) n Φ m(x) Φ m (X) + λi n y }{{} dual
21 Ridge Regression: Random Feature Approximation Given: {(x i, y i )} n i= where x i X, y i R Task: Find a regressor w R 2m s.t. f (x i ) y i. Idea: Map x i to Φ m (x i ) and do linear regression, min f H n ( f, Φ m (x i ) R 2m y i ) 2 + λ w 2 R2m (λ > 0) i= Solution: For Φ m (X) := (Φ m (x ),..., Φ m (x n )) R 2m n and y := (y,..., y n ) R n, ( f = ) n n Φ m(x)φ m (X) + λi 2m Φ m (X)y }{{} Computation: O(m 2 n). primal = ( ) n Φ m(x) n Φ m(x) Φ m (X) + λi n y }{{} dual
22 Ridge Regression: Random Feature Approximation Given: {(x i, y i )} n i= where x i X, y i R Task: Find a regressor w R 2m s.t. f (x i ) y i. Idea: Map x i to Φ m (x i ) and do linear regression, min f H n ( f, Φ m (x i ) R 2m y i ) 2 + λ w 2 R2m (λ > 0) i= Solution: For Φ m (X) := (Φ m (x ),..., Φ m (x n )) R 2m n and y := (y,..., y n ) R n, ( f = ) n n Φ m(x)φ m (X) + λi 2m Φ m (X)y }{{} Computation: O(m 2 n). primal = ( ) n Φ m(x) n Φ m(x) Φ m (X) + λi n y }{{} dual
23 What happens statistically? (Rudi and Rosasco, 206): If m n α where 2 α < with α depending on the properties of the unknown true regressor, f, then R L,P (f m,n ) R L,P (f ) achieves the minimax optimal rate as obtained in the case with no approximation. Here L is the squared loss. Computational gain with no statistical loss!!
24 Principal Component Analysis (PCA) Suppose X P be a random variable in R d with mean µ and covariance matrix Σ. Find a direction w {v : v 2 = } such that is maximized. Var[ w, X 2 ] Find a direction w R d such that is minimized. E (X µ) w, (X µ) 2 w 2 2 The formulations are equivalent and the solution is the eigen vector of the covariance matrix Σ corresponding to the largest eigenvalue. Can be generalized to multiple directions (find a subspace...). Applications: dimensionality reduction.
25 Principal Component Analysis (PCA) Suppose X P be a random variable in R d with mean µ and covariance matrix Σ. Find a direction w {v : v 2 = } such that is maximized. Var[ w, X 2 ] Find a direction w R d such that is minimized. E (X µ) w, (X µ) 2 w 2 2 The formulations are equivalent and the solution is the eigen vector of the covariance matrix Σ corresponding to the largest eigenvalue. Can be generalized to multiple directions (find a subspace...). Applications: dimensionality reduction.
26 Principal Component Analysis (PCA) Suppose X P be a random variable in R d with mean µ and covariance matrix Σ. Find a direction w {v : v 2 = } such that is maximized. Var[ w, X 2 ] Find a direction w R d such that is minimized. E (X µ) w, (X µ) 2 w 2 2 The formulations are equivalent and the solution is the eigen vector of the covariance matrix Σ corresponding to the largest eigenvalue. Can be generalized to multiple directions (find a subspace...). Applications: dimensionality reduction.
27 Kernel PCA Nonlinear generalization of PCA (Schölkopf et al., 998). X Φ(X ) and apply PCA. Provides a low dimensional manifold (curves) in R d to efficiently represent the data. Function space view: Find f {g H : g H = } such that Var[f (X )] is maximized, i.e., f = arg sup f H = E[f 2 (X )] E 2 [f (X )] Using the reproducing property f (X ) = f, k(, X ) H, we obtain E[f 2 (X )] = E f, k(, X ) 2 H = E f, (k(, X ) H k(, X ))f H = f, E[k(, X ) H k(, X )]f H, assuming X k(x, x) dp(x) <. E[f (X )] = f, µ P H where µ P := X k(, x) dp(x).
28 Kernel PCA Nonlinear generalization of PCA (Schölkopf et al., 998). X Φ(X ) and apply PCA. Provides a low dimensional manifold (curves) in R d to efficiently represent the data. Function space view: Find f {g H : g H = } such that Var[f (X )] is maximized, i.e., f = arg sup f H = E[f 2 (X )] E 2 [f (X )] Using the reproducing property f (X ) = f, k(, X ) H, we obtain E[f 2 (X )] = E f, k(, X ) 2 H = E f, (k(, X ) H k(, X ))f H = f, E[k(, X ) H k(, X )]f H, assuming X k(x, x) dp(x) <. E[f (X )] = f, µ P H where µ P := X k(, x) dp(x).
29 Kernel PCA Nonlinear generalization of PCA (Schölkopf et al., 998). X Φ(X ) and apply PCA. Provides a low dimensional manifold (curves) in R d to efficiently represent the data. Function space view: Find f {g H : g H = } such that Var[f (X )] is maximized, i.e., f = arg sup f H = E[f 2 (X )] E 2 [f (X )] Using the reproducing property f (X ) = f, k(, X ) H, we obtain E[f 2 (X )] = E f, k(, X ) 2 H = E f, (k(, X ) H k(, X ))f H = f, E[k(, X ) H k(, X )]f H, assuming X k(x, x) dp(x) <. E[f (X )] = f, µ P H where µ P := X k(, x) dp(x).
30 Kernel PCA Nonlinear generalization of PCA (Schölkopf et al., 998). X Φ(X ) and apply PCA. Provides a low dimensional manifold (curves) in R d to efficiently represent the data. Function space view: Find f {g H : g H = } such that Var[f (X )] is maximized, i.e., f = arg sup f H = E[f 2 (X )] E 2 [f (X )] Using the reproducing property f (X ) = f, k(, X ) H, we obtain E[f 2 (X )] = E f, k(, X ) 2 H = E f, (k(, X ) H k(, X ))f H = f, E[k(, X ) H k(, X )]f H, assuming X k(x, x) dp(x) <. E[f (X )] = f, µ P H where µ P := X k(, x) dp(x).
31 Kernel PCA where Σ := X f = arg sup f H = f, Σf H, k(, x) H k(, x) dp(x) µ P H µ P is the covariance operator (symmetric, positive and Hilbert-Schmidt) on H. Spectral theorem: Σ = i I λ i φ i H φ i where I is either countable (λ i 0 as i ) or finite. Similar to PCA, the solution is the eigen function corresponding to the largest eigenvalue of Σ.
32 Kernel PCA where Σ := X f = arg sup f H = f, Σf H, k(, x) H k(, x) dp(x) µ P H µ P is the covariance operator (symmetric, positive and Hilbert-Schmidt) on H. Spectral theorem: Σ = i I λ i φ i H φ i where I is either countable (λ i 0 as i ) or finite. Similar to PCA, the solution is the eigen function corresponding to the largest eigenvalue of Σ.
33 Empirical Kernel PCA In practice, P is unknown but have access to (X i ) n i.i.d. i= P. where ˆΣ := n ˆf = arg sup f, ˆΣf H, f H = k(, X i ) H k(, X i ) µ n H µ n i= is the empirical covariance operator (symmetric, positive and Hilbert-Schmidt) on H and µ n := k(, X i ). n. i= ˆΣ is an infinite dimensional operator but with finite rank (what is its rank?) Spectral theorem: ˆΣ = ˆλ i ˆφi H ˆφi. i= ˆf is the eigen function corresponding to the largest eigenvalue of ˆΣ.
34 Empirical Kernel PCA In practice, P is unknown but have access to (X i ) n i.i.d. i= P. where ˆΣ := n ˆf = arg sup f, ˆΣf H, f H = k(, X i ) H k(, X i ) µ n H µ n i= is the empirical covariance operator (symmetric, positive and Hilbert-Schmidt) on H and µ n := k(, X i ). n. i= ˆΣ is an infinite dimensional operator but with finite rank (what is its rank?) Spectral theorem: ˆΣ = ˆλ i ˆφi H ˆφi. i= ˆf is the eigen function corresponding to the largest eigenvalue of ˆΣ.
35 Empirical Kernel PCA Since ˆΣ is an infinite dimensional operator, we are solving an infinite dimensional eigen system, ˆΣ ˆφ i = ˆλ i ˆφi. How to solve find ˆφ i in practice? Approach : (Schölkopf et al., 998) ˆf = arg sup f H = n = arg inf f H n f 2 (X i ) i= f 2 (X i ) + i= ( n ( n f (X i ) i= ) 2 ) 2 f (X i ) + λ f 2 for some λ > 0. Use representer theorem, which says there exists (α i ) n i= R such that ˆf = α i k(, X i ). i= Solving for (α i ) n i= yields a n n eigen system involving Gram matrix. (Exercise) i= H
36 Empirical Kernel PCA Since ˆΣ is an infinite dimensional operator, we are solving an infinite dimensional eigen system, ˆΣ ˆφ i = ˆλ i ˆφi. How to solve find ˆφ i in practice? Approach : (Schölkopf et al., 998) ˆf = arg sup f H = n = arg inf f H n f 2 (X i ) i= f 2 (X i ) + i= ( n ( n f (X i ) i= ) 2 ) 2 f (X i ) + λ f 2 for some λ > 0. Use representer theorem, which says there exists (α i ) n i= R such that ˆf = α i k(, X i ). i= Solving for (α i ) n i= yields a n n eigen system involving Gram matrix. (Exercise) i= H
37 Empirical Kernel PCA Approach 2: Sampling operator: S : H R n, f n (f (X ),..., f (X n )). Reconstruction operator: S : R n H, α n n i= α ik(, X i ). (Smale and Zhou et al., 2007) S is the adjoint (transpose) of S and they satisfy f, S α H = Sf, α R n, f H, α R n. ˆφi = ˆλ i S H n ˆα i where H n := I n n n n and ˆα i are the eigenvectors of KH n with ˆλ i being the corresponding eigenvalues. It can be shown that ˆΣ = S H ns where H n := I n n n n. ˆΣ ˆφi = ˆλ i ˆφi S H ns ˆφ i = ˆλ i ˆφi SS H ns ˆφ i = ˆλ i S ˆφ i ˆα i := S ˆφ i. It can be shown that K = SS. ˆα i = S ˆφ i S H n ˆα i = S H ns ˆφ i = ˆΣ ˆφ i = ˆλ i ˆφi.
38 Empirical Kernel PCA Approach 2: Sampling operator: S : H R n, f n (f (X ),..., f (X n )). Reconstruction operator: S : R n H, α n n i= α ik(, X i ). (Smale and Zhou et al., 2007) S is the adjoint (transpose) of S and they satisfy f, S α H = Sf, α R n, f H, α R n. ˆφi = ˆλ i S H n ˆα i where H n := I n n n n and ˆα i are the eigenvectors of KH n with ˆλ i being the corresponding eigenvalues. It can be shown that ˆΣ = S H ns where H n := I n n n n. ˆΣ ˆφi = ˆλ i ˆφi S H ns ˆφ i = ˆλ i ˆφi SS H ns ˆφ i = ˆλ i S ˆφ i ˆα i := S ˆφ i. It can be shown that K = SS. ˆα i = S ˆφ i S H n ˆα i = S H ns ˆφ i = ˆΣ ˆφ i = ˆλ i ˆφi.
39 Empirical Kernel PCA Approach 2: Sampling operator: S : H R n, f n (f (X ),..., f (X n )). Reconstruction operator: S : R n H, α n n i= α ik(, X i ). (Smale and Zhou et al., 2007) S is the adjoint (transpose) of S and they satisfy f, S α H = Sf, α R n, f H, α R n. ˆφi = ˆλ i S H n ˆα i where H n := I n n n n and ˆα i are the eigenvectors of KH n with ˆλ i being the corresponding eigenvalues. It can be shown that ˆΣ = S H ns where H n := I n n n n. ˆΣ ˆφi = ˆλ i ˆφi S H ns ˆφ i = ˆλ i ˆφi SS H ns ˆφ i = ˆλ i S ˆφ i ˆα i := S ˆφ i. It can be shown that K = SS. ˆα i = S ˆφ i S H n ˆα i = S H ns ˆφ i = ˆΣ ˆφ i = ˆλ i ˆφi.
40 Empirical Kernel PCA: Random Features Empirical kernel PCA solves an n n linear system: Complexity is O(n 3 ). Use the idea of random features: X Φ m (X ) and apply PCA. ŵ := arg sup Var[ w, Φ m (X ) 2 ] = w, ˆΣ m w 2 w = where ˆΣ m := n Φ m (X i )Φ m(x i ) i= ( n ) ( Φ m (X i ) Φ m (X i ) n i= i= ) is a covariance matrix of size 2m 2m. Eigen decomposition: ˆΣm = 2m i= ˆλ i,m ˆφi,m ˆφ i,m ŵ is obtained by solving a 2m 2m eigensystem: Complexity is O(m 3 ). What happens statistically?
41 Empirical Kernel PCA: Random Features Empirical kernel PCA solves an n n linear system: Complexity is O(n 3 ). Use the idea of random features: X Φ m (X ) and apply PCA. ŵ := arg sup Var[ w, Φ m (X ) 2 ] = w, ˆΣ m w 2 w = where ˆΣ m := n Φ m (X i )Φ m(x i ) i= ( n ) ( Φ m (X i ) Φ m (X i ) n i= i= ) is a covariance matrix of size 2m 2m. Eigen decomposition: ˆΣm = 2m i= ˆλ i,m ˆφi,m ˆφ i,m ŵ is obtained by solving a 2m 2m eigensystem: Complexity is O(m 3 ). What happens statistically?
42 Empirical Kernel PCA: Random Features Two notions: Reconstruction error (what does this depend on?) Convergence of eigenvectors (more generally, eigenspaces)
43 Eigenspace Convergence What we have? Eigenvectors after approximation, ( ˆφ i,m ) 2m i= : these form a subspace in R 2m Population eigenfunctions (φ i ) i I of Σ: these form a subspace in H. How do we compare? We embed them in a common space before comparing. The common space is L 2 (P).
44 Eigenspace Convergence I : H L 2 (P), f f f (x) dp(x) X U : R 2m L 2 (P), α 2m i= α ( i Φm,i X Φ m,i(x) dp(x) ) where Φ m = (cos( ω, 2 ),..., cos( ω m, 2 ), sin( ω, 2 ),..., sin( ω m, 2 )). Main results: (S and Sterge, 207) l i= ( ) I ˆφ i Iφ i L 2 (P) = O p n. Similar result holds in H (Zwald et al., NIPS 2005) l i= ( U ˆφ i,m Iφ i L 2 (P) = O p + ). m n Random features are not helpful from the point of view of eigenvector convergence (also holds for eigenspaces). May be useful in the reconstruction error convergence (in progress).
45 Eigenspace Convergence I : H L 2 (P), f f f (x) dp(x) X U : R 2m L 2 (P), α 2m i= α ( i Φm,i X Φ m,i(x) dp(x) ) where Φ m = (cos( ω, 2 ),..., cos( ω m, 2 ), sin( ω, 2 ),..., sin( ω m, 2 )). Main results: (S and Sterge, 207) l i= ( ) I ˆφ i Iφ i L 2 (P) = O p n. Similar result holds in H (Zwald et al., NIPS 2005) l i= ( U ˆφ i,m Iφ i L 2 (P) = O p + ). m n Random features are not helpful from the point of view of eigenvector convergence (also holds for eigenspaces). May be useful in the reconstruction error convergence (in progress).
46 Eigenspace Convergence I : H L 2 (P), f f f (x) dp(x) X U : R 2m L 2 (P), α 2m i= α ( i Φm,i X Φ m,i(x) dp(x) ) where Φ m = (cos( ω, 2 ),..., cos( ω m, 2 ), sin( ω, 2 ),..., sin( ω m, 2 )). Main results: (S and Sterge, 207) l i= ( ) I ˆφ i Iφ i L 2 (P) = O p n. Similar result holds in H (Zwald et al., NIPS 2005) l i= ( U ˆφ i,m Iφ i L 2 (P) = O p + ). m n Random features are not helpful from the point of view of eigenvector convergence (also holds for eigenspaces). May be useful in the reconstruction error convergence (in progress).
47 References I Fine, S. and Scheinberg, K. (200). Efficient SVM training using low-rank kernel representations. Journal of Machine Learning Research, 2: Rahimi, A. and Recht, B. (2008). Random features for large-scale kernel machines. In Platt, J. C., Koller, D., Singer, Y., and Roweis, S. T., editors, Advances in Neural Information Processing Systems 20, pages Curran Associates, Inc. Rudi, A. and Rosasco, L. (206). Generalization properties of learning with random features. Schölkopf, B., Smola, A., and Müller, K.-R. (998). Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 0: Smale, S. and Zhou, D.-X. (2007). Learning theory estimates via integral operators and their approximations. Constructive Approximation, 26: Smola, A. J. and Schölkopf, B. (2000). Sparse greedy matrix approximation for machine learning. In Proc. 7th International Conference on Machine Learning, pages Morgan Kaufmann, San Francisco, CA. Sriperumbudur, B. K. and Sterge, N. (207). Statistical consistency of kernel PCA with random features. arxiv: Sriperumbudur, B. K. and Szabó, Z. (205). Optimal rates for random Fourier features. In Cortes, C., Lawrence, N. D., Lee, D. D., Sugiyama, M., and Garnett, R., editors, Advances in Neural Information Processing Systems 28, pages Curran Associates, Inc. Sutherland, D. and Schneider, J. (205). On the error of random fourier features. In Proceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence, pages
48 References II Williams, C. and Seeger, M. (200). Using the Nyström method to speed up kernel machines. In T. K. Leen, T. G. Diettrich, V. T., editor, Advances in Neural Information Processing Systems 3, pages , Cambridge, MA. MIT Press. Yang, Y., Pilanci, M., and Wainwright, M. J. (205). Randomized sketches for kernels: Fast and optimal non-parametric regression. Technical report. Zwald, L. and Blanchard, G. (2006). On the convergence of eigenspaces in kernel principal component analysis. In Weiss, Y., Schölkopf, B., and Platt, J. C., editors, Advances in Neural Information Processing Systems 8, pages MIT Press.
Approximate Kernel PCA with Random Features
Approximate Kernel PCA with Random Features (Computational vs. Statistical Tradeoff) Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Journées de Statistique Paris May 28,
More informationLess is More: Computational Regularization by Subsampling
Less is More: Computational Regularization by Subsampling Lorenzo Rosasco University of Genova - Istituto Italiano di Tecnologia Massachusetts Institute of Technology lcsl.mit.edu joint work with Alessandro
More informationKernel Learning via Random Fourier Representations
Kernel Learning via Random Fourier Representations L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Module 5: Machine Learning L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random
More informationConvergence of Eigenspaces in Kernel Principal Component Analysis
Convergence of Eigenspaces in Kernel Principal Component Analysis Shixin Wang Advanced machine learning April 19, 2016 Shixin Wang Convergence of Eigenspaces April 19, 2016 1 / 18 Outline 1 Motivation
More informationLess is More: Computational Regularization by Subsampling
Less is More: Computational Regularization by Subsampling Lorenzo Rosasco University of Genova - Istituto Italiano di Tecnologia Massachusetts Institute of Technology lcsl.mit.edu joint work with Alessandro
More informationPrincipal Component Analysis
CSci 5525: Machine Learning Dec 3, 2008 The Main Idea Given a dataset X = {x 1,..., x N } The Main Idea Given a dataset X = {x 1,..., x N } Find a low-dimensional linear projection The Main Idea Given
More informationTUM 2016 Class 3 Large scale learning by regularization
TUM 2016 Class 3 Large scale learning by regularization Lorenzo Rosasco UNIGE-MIT-IIT July 25, 2016 Learning problem Solve min w E(w), E(w) = dρ(x, y)l(w x, y) given (x 1, y 1 ),..., (x n, y n ) Beyond
More informationDivide and Conquer Kernel Ridge Regression. A Distributed Algorithm with Minimax Optimal Rates
: A Distributed Algorithm with Minimax Optimal Rates Yuchen Zhang, John C. Duchi, Martin Wainwright (UC Berkeley;http://arxiv.org/pdf/1305.509; Apr 9, 014) Gatsby Unit, Tea Talk June 10, 014 Outline Motivation.
More informationSemi-Supervised Learning through Principal Directions Estimation
Semi-Supervised Learning through Principal Directions Estimation Olivier Chapelle, Bernhard Schölkopf, Jason Weston Max Planck Institute for Biological Cybernetics, 72076 Tübingen, Germany {first.last}@tuebingen.mpg.de
More informationStatistical Convergence of Kernel CCA
Statistical Convergence of Kernel CCA Kenji Fukumizu Institute of Statistical Mathematics Tokyo 106-8569 Japan fukumizu@ism.ac.jp Francis R. Bach Centre de Morphologie Mathematique Ecole des Mines de Paris,
More informationHilbert Space Embedding of Probability Measures
Lecture 2 Hilbert Space Embedding of Probability Measures Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Machine Learning Summer School Tübingen, 2017 Recap of Lecture
More informationStatistical Optimality of Stochastic Gradient Descent through Multiple Passes
Statistical Optimality of Stochastic Gradient Descent through Multiple Passes Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Loucas Pillaud-Vivien
More informationKernel Methods. Jean-Philippe Vert Last update: Jan Jean-Philippe Vert (Mines ParisTech) 1 / 444
Kernel Methods Jean-Philippe Vert Jean-Philippe.Vert@mines.org Last update: Jan 2015 Jean-Philippe Vert (Mines ParisTech) 1 / 444 What we know how to solve Jean-Philippe Vert (Mines ParisTech) 2 / 444
More informationNonlinear Dimensionality Reduction
Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Kernel PCA 2 Isomap 3 Locally Linear Embedding 4 Laplacian Eigenmap
More informationLinear and Non-Linear Dimensionality Reduction
Linear and Non-Linear Dimensionality Reduction Alexander Schulz aschulz(at)techfak.uni-bielefeld.de University of Pisa, Pisa 4.5.215 and 7.5.215 Overview Dimensionality Reduction Motivation Linear Projections
More informationStochastic optimization in Hilbert spaces
Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic optimization Hilbert spaces 1 / 48 Outline Learning vs Statistics Aymeric Dieuleveut Stochastic optimization Hilbert
More informationOutline. Motivation. Mapping the input space to the feature space Calculating the dot product in the feature space
to The The A s s in to Fabio A. González Ph.D. Depto. de Ing. de Sistemas e Industrial Universidad Nacional de Colombia, Bogotá April 2, 2009 to The The A s s in 1 Motivation Outline 2 The Mapping the
More informationKernel methods, kernel SVM and ridge regression
Kernel methods, kernel SVM and ridge regression Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Collaborative Filtering 2 Collaborative Filtering R: rating matrix; U: user factor;
More informationEfficient Complex Output Prediction
Efficient Complex Output Prediction Florence d Alché-Buc Joint work with Romain Brault, Alex Lambert, Maxime Sangnier October 12, 2017 LTCI, Télécom ParisTech, Institut-Mines Télécom, Université Paris-Saclay
More informationConnection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis
Connection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis Alvina Goh Vision Reading Group 13 October 2005 Connection of Local Linear Embedding, ISOMAP, and Kernel Principal
More informationRegularization via Spectral Filtering
Regularization via Spectral Filtering Lorenzo Rosasco MIT, 9.520 Class 7 About this class Goal To discuss how a class of regularization methods originally designed for solving ill-posed inverse problems,
More informationNonparameteric Regression:
Nonparameteric Regression: Nadaraya-Watson Kernel Regression & Gaussian Process Regression Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro,
More informationSpectral Regularization
Spectral Regularization Lorenzo Rosasco 9.520 Class 07 February 27, 2008 About this class Goal To discuss how a class of regularization methods originally designed for solving ill-posed inverse problems,
More informationConvergence Rates of Kernel Quadrature Rules
Convergence Rates of Kernel Quadrature Rules Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE NIPS workshop on probabilistic integration - Dec. 2015 Outline Introduction
More informationReproducing Kernel Hilbert Spaces
Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 9, 2011 About this class Goal In this class we continue our journey in the world of RKHS. We discuss the Mercer theorem which gives
More informationTutorial on Gaussian Processes and the Gaussian Process Latent Variable Model
Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model (& discussion on the GPLVM tech. report by Prof. N. Lawrence, 06) Andreas Damianou Department of Neuro- and Computer Science,
More informationComputational tractability of machine learning algorithms for tall fat data
Computational tractability of machine learning algorithms for tall fat data Getting good enough solutions as fast as possible Vikas Chandrakant Raykar vikas@cs.umd.edu University of Maryland, CollegePark
More informationKernel methods for comparing distributions, measuring dependence
Kernel methods for comparing distributions, measuring dependence Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Principal component analysis Given a set of M centered observations
More informationMACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA
1 MACHINE LEARNING Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA 2 Practicals Next Week Next Week, Practical Session on Computer Takes Place in Room GR
More informationMIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications. Class 04: Features and Kernels. Lorenzo Rosasco
MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications Class 04: Features and Kernels Lorenzo Rosasco Linear functions Let H lin be the space of linear functions f(x) = w x. f w is one
More informationA Magiv CV Theory for Large-Margin Classifiers
A Magiv CV Theory for Large-Margin Classifiers Hui Zou School of Statistics, University of Minnesota June 30, 2018 Joint work with Boxiang Wang Outline 1 Background 2 Magic CV formula 3 Magic support vector
More informationNonlinear Projection Trick in kernel methods: An alternative to the Kernel Trick
Nonlinear Projection Trick in kernel methods: An alternative to the Kernel Trick Nojun Kak, Member, IEEE, Abstract In kernel methods such as kernel PCA and support vector machines, the so called kernel
More informationLearning sets and subspaces: a spectral approach
Learning sets and subspaces: a spectral approach Alessandro Rudi DIBRIS, Università di Genova Optimization and dynamical processes in Statistical learning and inverse problems Sept 8-12, 2014 A world of
More informationRandom Projections. Lopez Paz & Duvenaud. November 7, 2013
Random Projections Lopez Paz & Duvenaud November 7, 2013 Random Outline The Johnson-Lindenstrauss Lemma (1984) Random Kitchen Sinks (Rahimi and Recht, NIPS 2008) Fastfood (Le et al., ICML 2013) Why random
More informationFast large scale Gaussian process regression using the improved fast Gauss transform
Fast large scale Gaussian process regression using the improved fast Gauss transform VIKAS CHANDRAKANT RAYKAR and RAMANI DURAISWAMI Perceptual Interfaces and Reality Laboratory Department of Computer Science
More informationAnalysis Preliminary Exam Workshop: Hilbert Spaces
Analysis Preliminary Exam Workshop: Hilbert Spaces 1. Hilbert spaces A Hilbert space H is a complete real or complex inner product space. Consider complex Hilbert spaces for definiteness. If (, ) : H H
More informationClassification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1).
Regression and PCA Classification The goal: map from input X to a label Y. Y has a discrete set of possible values We focused on binary Y (values 0 or 1). But we also discussed larger number of classes
More informationOn the Convergence of the Concave-Convex Procedure
On the Convergence of the Concave-Convex Procedure Bharath K. Sriperumbudur and Gert R. G. Lanckriet UC San Diego OPT 2009 Outline Difference of convex functions (d.c.) program Applications in machine
More informationKernel Methods. Barnabás Póczos
Kernel Methods Barnabás Póczos Outline Quick Introduction Feature space Perceptron in the feature space Kernels Mercer s theorem Finite domain Arbitrary domain Kernel families Constructing new kernels
More informationCOMPACT OPERATORS. 1. Definitions
COMPACT OPERATORS. Definitions S:defi An operator M : X Y, X, Y Banach, is compact if M(B X (0, )) is relatively compact, i.e. it has compact closure. We denote { E:kk (.) K(X, Y ) = M L(X, Y ), M compact
More informationSPECTRAL CLUSTERING AND KERNEL PRINCIPAL COMPONENT ANALYSIS ARE PURSUING GOOD PROJECTIONS
SPECTRAL CLUSTERING AND KERNEL PRINCIPAL COMPONENT ANALYSIS ARE PURSUING GOOD PROJECTIONS VIKAS CHANDRAKANT RAYKAR DECEMBER 5, 24 Abstract. We interpret spectral clustering algorithms in the light of unsupervised
More informationSpectral Dimensionality Reduction
Spectral Dimensionality Reduction Yoshua Bengio, Olivier Delalleau, Nicolas Le Roux Jean-François Paiement, Pascal Vincent, and Marie Ouimet Département d Informatique et Recherche Opérationnelle Centre
More informationUnsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent
Unsupervised Machine Learning and Data Mining DS 5230 / DS 4420 - Fall 2018 Lecture 7 Jan-Willem van de Meent DIMENSIONALITY REDUCTION Borrowing from: Percy Liang (Stanford) Dimensionality Reduction Goal:
More informationKernel-Based Contrast Functions for Sufficient Dimension Reduction
Kernel-Based Contrast Functions for Sufficient Dimension Reduction Michael I. Jordan Departments of Statistics and EECS University of California, Berkeley Joint work with Kenji Fukumizu and Francis Bach
More informationSupport Vector Machines.
Support Vector Machines www.cs.wisc.edu/~dpage 1 Goals for the lecture you should understand the following concepts the margin slack variables the linear support vector machine nonlinear SVMs the kernel
More informationESANN'2001 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), April 2001, D-Facto public., ISBN ,
Sparse Kernel Canonical Correlation Analysis Lili Tan and Colin Fyfe 2, Λ. Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong. 2. School of Information and Communication
More informationLearning gradients: prescriptive models
Department of Statistical Science Institute for Genome Sciences & Policy Department of Computer Science Duke University May 11, 2007 Relevant papers Learning Coordinate Covariances via Gradients. Sayan
More informationKarhunen-Loève decomposition of Gaussian measures on Banach spaces
Karhunen-Loève decomposition of Gaussian measures on Banach spaces Jean-Charles Croix jean-charles.croix@emse.fr Génie Mathématique et Industriel (GMI) First workshop on Gaussian processes at Saint-Etienne
More informationDiscriminative Direction for Kernel Classifiers
Discriminative Direction for Kernel Classifiers Polina Golland Artificial Intelligence Lab Massachusetts Institute of Technology Cambridge, MA 02139 polina@ai.mit.edu Abstract In many scientific and engineering
More informationMultiple Similarities Based Kernel Subspace Learning for Image Classification
Multiple Similarities Based Kernel Subspace Learning for Image Classification Wang Yan, Qingshan Liu, Hanqing Lu, and Songde Ma National Laboratory of Pattern Recognition, Institute of Automation, Chinese
More informationRKHS, Mercer s theorem, Unbounded domains, Frames and Wavelets Class 22, 2004 Tomaso Poggio and Sayan Mukherjee
RKHS, Mercer s theorem, Unbounded domains, Frames and Wavelets 9.520 Class 22, 2004 Tomaso Poggio and Sayan Mukherjee About this class Goal To introduce an alternate perspective of RKHS via integral operators
More informationKernel Principal Component Analysis
Kernel Principal Component Analysis Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr
More informationKernel Methods. Outline
Kernel Methods Quang Nguyen University of Pittsburgh CS 3750, Fall 2011 Outline Motivation Examples Kernels Definitions Kernel trick Basic properties Mercer condition Constructing feature space Hilbert
More informationUnsupervised dimensionality reduction
Unsupervised dimensionality reduction Guillaume Obozinski Ecole des Ponts - ParisTech SOCN course 2014 Guillaume Obozinski Unsupervised dimensionality reduction 1/30 Outline 1 PCA 2 Kernel PCA 3 Multidimensional
More informationOptimal kernel methods for large scale learning
Optimal kernel methods for large scale learning Alessandro Rudi INRIA - École Normale Supérieure, Paris joint work with Luigi Carratino, Lorenzo Rosasco 6 Mar 2018 École Polytechnique Learning problem
More informationLecture 6 Sept Data Visualization STAT 442 / 890, CM 462
Lecture 6 Sept. 25-2006 Data Visualization STAT 442 / 890, CM 462 Lecture: Ali Ghodsi 1 Dual PCA It turns out that the singular value decomposition also allows us to formulate the principle components
More informationKernel Bayes Rule: Nonparametric Bayesian inference with kernels
Kernel Bayes Rule: Nonparametric Bayesian inference with kernels Kenji Fukumizu The Institute of Statistical Mathematics NIPS 2012 Workshop Confluence between Kernel Methods and Graphical Models December
More informationTowards Deep Kernel Machines
Towards Deep Kernel Machines Julien Mairal Inria, Grenoble Prague, April, 2017 Julien Mairal Towards deep kernel machines 1/51 Part I: Scientific Context Julien Mairal Towards deep kernel machines 2/51
More informationRandom Features for Large Scale Kernel Machines
Random Features for Large Scale Kernel Machines Andrea Vedaldi From: A. Rahimi and B. Recht. Random features for large-scale kernel machines. In Proc. NIPS, 2007. Explicit feature maps Fast algorithm for
More informationMemory Efficient Kernel Approximation
Si Si Department of Computer Science University of Texas at Austin ICML Beijing, China June 23, 2014 Joint work with Cho-Jui Hsieh and Inderjit S. Dhillon Outline Background Motivation Low-Rank vs. Block
More informationOutline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22
Outline Basic concepts: SVM and kernels SVM primal/dual problems Chih-Jen Lin (National Taiwan Univ.) 1 / 22 Outline Basic concepts: SVM and kernels Basic concepts: SVM and kernels SVM primal/dual problems
More informationMinimax Estimation of Kernel Mean Embeddings
Minimax Estimation of Kernel Mean Embeddings Bharath K. Sriperumbudur Department of Statistics Pennsylvania State University Gatsby Computational Neuroscience Unit May 4, 2016 Collaborators Dr. Ilya Tolstikhin
More informationLocal Learning Projections
Mingrui Wu mingrui.wu@tuebingen.mpg.de Max Planck Institute for Biological Cybernetics, Tübingen, Germany Kai Yu kyu@sv.nec-labs.com NEC Labs America, Cupertino CA, USA Shipeng Yu shipeng.yu@siemens.com
More informationPerceptron Revisited: Linear Separators. Support Vector Machines
Support Vector Machines Perceptron Revisited: Linear Separators Binary classification can be viewed as the task of separating classes in feature space: w T x + b > 0 w T x + b = 0 w T x + b < 0 Department
More informationKernel Ridge Regression. Mohammad Emtiyaz Khan EPFL Oct 27, 2015
Kernel Ridge Regression Mohammad Emtiyaz Khan EPFL Oct 27, 2015 Mohammad Emtiyaz Khan 2015 Motivation The ridge solution β R D has a counterpart α R N. Using duality, we will establish a relationship between
More informationKernel methods for Bayesian inference
Kernel methods for Bayesian inference Arthur Gretton Gatsby Computational Neuroscience Unit Lancaster, Nov. 2014 Motivating Example: Bayesian inference without a model 3600 downsampled frames of 20 20
More informationMachine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling
Machine Learning B. Unsupervised Learning B.2 Dimensionality Reduction Lars Schmidt-Thieme, Nicolas Schilling Information Systems and Machine Learning Lab (ISMLL) Institute for Computer Science University
More informationSupport Vector Machines
Support Vector Machines Support vector machines (SVMs) are one of the central concepts in all of machine learning. They are simply a combination of two ideas: linear classification via maximum (or optimal
More informationAdvances in kernel exponential families
Advances in kernel exponential families Arthur Gretton Gatsby Computational Neuroscience Unit, University College London NIPS, 2017 1/39 Outline Motivating application: Fast estimation of complex multivariate
More informationManifold Learning: Theory and Applications to HRI
Manifold Learning: Theory and Applications to HRI Seungjin Choi Department of Computer Science Pohang University of Science and Technology, Korea seungjin@postech.ac.kr August 19, 2008 1 / 46 Greek Philosopher
More informationDimensionality Reduction AShortTutorial
Dimensionality Reduction AShortTutorial Ali Ghodsi Department of Statistics and Actuarial Science University of Waterloo Waterloo, Ontario, Canada, 2006 c Ali Ghodsi, 2006 Contents 1 An Introduction to
More informationIncorporating Invariances in Nonlinear Support Vector Machines
Incorporating Invariances in Nonlinear Support Vector Machines Olivier Chapelle olivier.chapelle@lip6.fr LIP6, Paris, France Biowulf Technologies Bernhard Scholkopf bernhard.schoelkopf@tuebingen.mpg.de
More informationFoundation of Intelligent Systems, Part I. SVM s & Kernel Methods
Foundation of Intelligent Systems, Part I SVM s & Kernel Methods mcuturi@i.kyoto-u.ac.jp FIS - 2013 1 Support Vector Machines The linearly-separable case FIS - 2013 2 A criterion to select a linear classifier:
More informationMultivariate Statistical Analysis
Multivariate Statistical Analysis Fall 2011 C. L. Williams, Ph.D. Lecture 4 for Applied Multivariate Analysis Outline 1 Eigen values and eigen vectors Characteristic equation Some properties of eigendecompositions
More informationMetric Embedding for Kernel Classification Rules
Metric Embedding for Kernel Classification Rules Bharath K. Sriperumbudur University of California, San Diego (Joint work with Omer Lang & Gert Lanckriet) Bharath K. Sriperumbudur (UCSD) Metric Embedding
More informationData Mining Techniques
Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 12 Jan-Willem van de Meent (credit: Yijun Zhao, Percy Liang) DIMENSIONALITY REDUCTION Borrowing from: Percy Liang (Stanford) Linear Dimensionality
More informationLearning Eigenfunctions Links Spectral Embedding
Learning Eigenfunctions Links Spectral Embedding and Kernel PCA Yoshua Bengio, Olivier Delalleau, Nicolas Le Roux Jean-François Paiement, Pascal Vincent, and Marie Ouimet Département d Informatique et
More informationStat542 (F11) Statistical Learning. First consider the scenario where the two classes of points are separable.
Linear SVM (separable case) First consider the scenario where the two classes of points are separable. It s desirable to have the width (called margin) between the two dashed lines to be large, i.e., have
More informationSupport Vector Machines for Classification: A Statistical Portrait
Support Vector Machines for Classification: A Statistical Portrait Yoonkyung Lee Department of Statistics The Ohio State University May 27, 2011 The Spring Conference of Korean Statistical Society KAIST,
More informationLarge-scale Collaborative Prediction Using a Nonparametric Random Effects Model
Large-scale Collaborative Prediction Using a Nonparametric Random Effects Model Kai Yu Joint work with John Lafferty and Shenghuo Zhu NEC Laboratories America, Carnegie Mellon University First Prev Page
More informationDiffeomorphic Warping. Ben Recht August 17, 2006 Joint work with Ali Rahimi (Intel)
Diffeomorphic Warping Ben Recht August 17, 2006 Joint work with Ali Rahimi (Intel) What Manifold Learning Isn t Common features of Manifold Learning Algorithms: 1-1 charting Dense sampling Geometric Assumptions
More informationOnline Kernel PCA with Entropic Matrix Updates
Online Kernel PCA with Entropic Matrix Updates Dima Kuzmin Manfred K. Warmuth University of California - Santa Cruz ICML 2007, Corvallis, Oregon April 23, 2008 D. Kuzmin, M. Warmuth (UCSC) Online Kernel
More informationKernel Approximation via Empirical Orthogonal Decomposition for Unsupervised Feature Learning
Kernel Approximation via Empirical Orthogonal Decomposition for Unsupervised Feature Learning Yusuke Mukuta The University of Tokyo 7-3-1 Hongo Bunkyo-ku, Tokyo, Japan mukuta@mi.t.u-tokyo.ac.jp Tatsuya
More informationReferences. Lecture 3: Shrinkage. The Power of Amnesia. Ockham s Razor
References Lecture 3: Shrinkage Isabelle Guon guoni@inf.ethz.ch Structural risk minimization for character recognition Isabelle Guon et al. http://clopinet.com/isabelle/papers/sr m.ps.z Kernel Ridge Regression
More informationData Mining Techniques
Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 4 Jan-Willem van de Meent (credit: Yijun Zhao, Arthur Gretton Rasmussen & Williams, Percy Liang) Kernel Regression Basis function regression
More informationKernel PCA: keep walking... in informative directions. Johan Van Horebeek, Victor Muñiz, Rogelio Ramos CIMAT, Guanajuato, GTO
Kernel PCA: keep walking... in informative directions Johan Van Horebeek, Victor Muñiz, Rogelio Ramos CIMAT, Guanajuato, GTO Kernel PCA: keep walking... in informative directions Johan Van Horebeek, Victor
More informationLeast Squares SVM Regression
Least Squares SVM Regression Consider changing SVM to LS SVM by making following modifications: min (w,e) ½ w 2 + ½C Σ e(i) 2 subject to d(i) (w T Φ( x(i))+ b) = e(i), i, and C>0. Note that e(i) is error
More informationDesigning Kernel Functions Using the Karhunen-Loève Expansion
July 7, 2004. Designing Kernel Functions Using the Karhunen-Loève Expansion 2 1 Fraunhofer FIRST, Germany Tokyo Institute of Technology, Japan 1,2 2 Masashi Sugiyama and Hidemitsu Ogawa Learning with Kernels
More information1. Mathematical Foundations of Machine Learning
1. Mathematical Foundations of Machine Learning 1.1 Basic Concepts Definition of Learning Definition [Mitchell (1997)] A computer program is said to learn from experience E with respect to some class of
More informationDeep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang
Deep Learning Basics Lecture 7: Factor Analysis Princeton University COS 495 Instructor: Yingyu Liang Supervised v.s. Unsupervised Math formulation for supervised learning Given training data x i, y i
More informationPREDICTING SOLAR GENERATION FROM WEATHER FORECASTS. Chenlin Wu Yuhan Lou
PREDICTING SOLAR GENERATION FROM WEATHER FORECASTS Chenlin Wu Yuhan Lou Background Smart grid: increasing the contribution of renewable in grid energy Solar generation: intermittent and nondispatchable
More informationDIMENSION REDUCTION. min. j=1
DIMENSION REDUCTION 1 Principal Component Analysis (PCA) Principal components analysis (PCA) finds low dimensional approximations to the data by projecting the data onto linear subspaces. Let X R d and
More informationCovariate Shift in Hilbert Space: A Solution via Surrogate Kernels
Kai Zhang kzhang@nec-labs.com NEC Laboratories America, Inc., 4 Independence Way, Suite 2, Princeton, NJ 854 Vincent W. Zheng Advanced Digital Sciences Center, 1 Fusionopolis Way, Singapore 138632 vincent.zheng@adsc.com.sg
More informationDeep Gaussian Processes
Deep Gaussian Processes Neil D. Lawrence 30th April 2015 KTH Royal Institute of Technology Outline Introduction Deep Gaussian Process Models Variational Approximation Samples and Results Outline Introduction
More informationMIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications. Class 19: Data Representation by Design
MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications Class 19: Data Representation by Design What is data representation? Let X be a data-space X M (M) F (M) X A data representation
More informationMax Margin-Classifier
Max Margin-Classifier Oliver Schulte - CMPT 726 Bishop PRML Ch. 7 Outline Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data Kernels and Non-linear Mappings Where does the maximization
More informationStructured Prediction
Structured Prediction Ningshan Zhang Advanced Machine Learning, Spring 2016 Outline Ensemble Methods for Structured Prediction[1] On-line learning Boosting AGeneralizedKernelApproachtoStructuredOutputLearning[2]
More informationTransductive Experiment Design
Appearing in NIPS 2005 workshop Foundations of Active Learning, Whistler, Canada, December, 2005. Transductive Experiment Design Kai Yu, Jinbo Bi, Volker Tresp Siemens AG 81739 Munich, Germany Abstract
More informationSparse Support Vector Machines by Kernel Discriminant Analysis
Sparse Support Vector Machines by Kernel Discriminant Analysis Kazuki Iwamura and Shigeo Abe Kobe University - Graduate School of Engineering Kobe, Japan Abstract. We discuss sparse support vector machines
More informationAdaptive Kernel Principal Component Analysis With Unsupervised Learning of Kernels
Adaptive Kernel Principal Component Analysis With Unsupervised Learning of Kernels Daoqiang Zhang Zhi-Hua Zhou National Laboratory for Novel Software Technology Nanjing University, Nanjing 2193, China
More information