Approximate Kernel Methods

Size: px
Start display at page:

Download "Approximate Kernel Methods"

Transcription

1 Lecture 3 Approximate Kernel Methods Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Machine Learning Summer School Tübingen, 207

2 Outline Motivating example Ridge regression Approximation methods Kernel PCA Computational vs. statistical trade off (Joint work with Nicholas Sterge, Pennsylvania State University)

3 Motivating Example: Ridge regression Given: {(x i, y i )} n i= where x i R d, y i R Task: Find a linear regressor f = w, 2 s.t. f (x i ) y i, min ( w, x i 2 y i ) 2 + λ w 2 w R d 2 (λ > 0) n i= Solution: For X := (x,..., x n ) R d n and y := (y,..., y n ) R n, w = ( ) n n XX + λi d Xy }{{} primal Easy: ( ) ( ) n XX + λi d X = X n X X + λi n w = n X ( n X X + λi n ) y }{{} dual

4 Motivating Example: Ridge regression Given: {(x i, y i )} n i= where x i R d, y i R Task: Find a linear regressor f = w, 2 s.t. f (x i ) y i, min ( w, x i 2 y i ) 2 + λ w 2 w R d 2 (λ > 0) n i= Solution: For X := (x,..., x n ) R d n and y := (y,..., y n ) R n, w = ( ) n n XX + λi d Xy }{{} primal Easy: ( ) ( ) n XX + λi d X = X n X X + λi n w = n X ( n X X + λi n ) y }{{} dual

5 Motivating Example: Ridge regression Given: {(x i, y i )} n i= where x i R d, y i R Task: Find a linear regressor f = w, 2 s.t. f (x i ) y i, min ( w, x i 2 y i ) 2 + λ w 2 w R d 2 (λ > 0) n i= Solution: For X := (x,..., x n ) R d n and y := (y,..., y n ) R n, w = ( ) n n XX + λi d Xy }{{} primal Easy: ( ) ( ) n XX + λi d X = X n X X + λi n w = n X ( n X X + λi n ) y }{{} dual

6 Motivating Example: Ridge regression Given: {(x i, y i )} n i= where x i R d, y i R Task: Find a linear regressor f = w, 2 s.t. f (x i ) y i, min ( w, x i 2 y i ) 2 + λ w 2 w R d 2 (λ > 0) n i= Solution: For X := (x,..., x n ) R d n and y := (y,..., y n ) R n, w = ( ) n n XX + λi d Xy }{{} primal Easy: ( ) ( ) n XX + λi d X = X n X X + λi n w = n X ( n X X + λi n ) y }{{} dual

7 Ridge regression Prediction: Given t R d f (t) = w, t 2 = y X ( XX + nλi d ) t = y ( X X + nλi n ) X t How does X X look like? x, x 2 x, x 2 2 x, x n 2 X x 2, x x 2, x 2 2 x 2, x n 2 X =.. x i, x j.. 2. x n, x x n, x 2 2 x n, x n 2 }{{} Matrix of inner products: Gram Matrix

8 Ridge regression Prediction: Given t R d f (t) = w, t 2 = y X ( XX + nλi d ) t = y ( X X + nλi n ) X t How does X X look like? x, x 2 x, x 2 2 x, x n 2 X x 2, x x 2, x 2 2 x 2, x n 2 X =.. x i, x j.. 2. x n, x x n, x 2 2 x n, x n 2 }{{} Matrix of inner products: Gram Matrix

9 Kernel Ridge regression: Feature Map and Kernel Trick Given: {(x i, y i )} n i= where x i X, y i R Task: Find a regressor f H (some feature space) s.t. f (x i ) y i. Idea: Map x i to Φ(x i ) and do linear regression, min f H n ( f, Φ(x i ) H y i ) 2 + λ f 2 H (λ > 0) i= Solution: For Φ(X) := (Φ(x ),..., Φ(x n )) R dim(h) n and y := (y,..., y n ) R n, ( f = ) n n Φ(X)Φ(X) + λi dim(h) Φ(X)y }{{} primal = n Φ(X) ( n Φ(X) Φ(X) + λi n ) y }{{} dual

10 Kernel Ridge regression: Feature Map and Kernel Trick Given: {(x i, y i )} n i= where x i X, y i R Task: Find a regressor f H (some feature space) s.t. f (x i ) y i. Idea: Map x i to Φ(x i ) and do linear regression, min f H n ( f, Φ(x i ) H y i ) 2 + λ f 2 H (λ > 0) i= Solution: For Φ(X) := (Φ(x ),..., Φ(x n )) R dim(h) n and y := (y,..., y n ) R n, ( f = ) n n Φ(X)Φ(X) + λi dim(h) Φ(X)y }{{} primal = n Φ(X) ( n Φ(X) Φ(X) + λi n ) y }{{} dual

11 Kernel Ridge regression: Feature Map and Kernel Trick Given: {(x i, y i )} n i= where x i X, y i R Task: Find a regressor f H (some feature space) s.t. f (x i ) y i. Idea: Map x i to Φ(x i ) and do linear regression, min f H n ( f, Φ(x i ) H y i ) 2 + λ f 2 H (λ > 0) i= Solution: For Φ(X) := (Φ(x ),..., Φ(x n )) R dim(h) n and y := (y,..., y n ) R n, ( f = ) n n Φ(X)Φ(X) + λi dim(h) Φ(X)y }{{} primal = n Φ(X) ( n Φ(X) Φ(X) + λi n ) y }{{} dual

12 Kernel Ridge regression: Feature Map and Kernel Trick Prediction: Given t X f (t) = f, Φ(t) H = n y Φ(X) ( n Φ(X)Φ(X) + λi dim(h) ) Φ(t) = n y ( n Φ(X) Φ(X) + λi n ) Φ(X) Φ(t) As before Φ(x ), Φ(x ) H Φ(x ), Φ(x n ) H Φ(X) Φ(x 2 ), Φ(x ) H Φ(x 2 ), Φ(x n ) H Φ(X) =..... Φ(x n ), Φ(x ) H Φ(x n ), Φ(x n ) H }{{} k(x i,x j )= Φ(x i ),Φ(x j ) H and Φ(X) Φ(t) = [ Φ(x ), Φ(t) H,..., Φ(x n ), Φ(t) H ]

13 Kernel Ridge regression: Feature Map and Kernel Trick Prediction: Given t X f (t) = f, Φ(t) H = n y Φ(X) ( n Φ(X)Φ(X) + λi dim(h) ) Φ(t) = n y ( n Φ(X) Φ(X) + λi n ) Φ(X) Φ(t) As before Φ(x ), Φ(x ) H Φ(x ), Φ(x n ) H Φ(X) Φ(x 2 ), Φ(x ) H Φ(x 2 ), Φ(x n ) H Φ(X) =..... Φ(x n ), Φ(x ) H Φ(x n ), Φ(x n ) H }{{} k(x i,x j )= Φ(x i ),Φ(x j ) H and Φ(X) Φ(t) = [ Φ(x ), Φ(t) H,..., Φ(x n ), Φ(t) H ]

14 Remarks The primal formulation requires the knowledge of feature map Φ (and of course H) and these could be infinite dimensional. The dual formulation is entirely determined by kernel evaluations, Gram matrix and (k(x i, t)) i. But poor scalability: O(n 3 ).

15 Approximation Schemes Incomplete Cholesky factorization (e.g., (Fine and Scheinberg, JMLR 200) Sketching (Yang et al., 205) Sparse greedy approximation (Smola and Schölkopf, NIPS 2000) Nyström method (e.g., Williams and Seeger, NIPS 200) Random Fourier features (e.g., Rahimi and Recht, NIPS 2008),...

16 Random Fourier Approximation X = R d ; k be continuous and translation-invariant, i.e., k(x, y) = ψ(x y). Bochner s theorem: k(x, y) = e ω,x y 2 dλ(ω), R d where Λ is a finite non-negative Borel measure on R d. k is symmetric and therefore Λ is a symmetric measure on R d. Therefore k(x, y) = cos( ω, x y 2 ) dλ(ω). R d

17 Random Fourier Approximation X = R d ; k be continuous and translation-invariant, i.e., k(x, y) = ψ(x y). Bochner s theorem: k(x, y) = e ω,x y 2 dλ(ω), R d where Λ is a finite non-negative Borel measure on R d. k is symmetric and therefore Λ is a symmetric measure on R d. Therefore k(x, y) = cos( ω, x y 2 ) dλ(ω). R d

18 Random Feature Approximation (Rahimi and Recht, NIPS 2008): Draw (ω j ) m j= i.i.d. Λ. where k m (x, y) = m m cos ( ω j, x y 2 ) = Φ m (x), Φ m (y) R 2m, j= Φ m (x) = (cos( ω, x 2 ),..., cos( ω m, x 2 ), sin( ω, x 2 ),..., sin( ω m, x 2 )). Use the feature map idea with Φ m, i.e., x Φ m (x) and apply your favorite linear method.

19 How good is the approximation? (S and Szabó, NIPS 206): ( ) log S sup k m (x, y) k(x, y) = O a.s. x,y S m Optimal convergence rate Other results are known but they are non-optimal (Rahimi and Recht, NIPS 2008; Sutherland and Schneider, UAI 205).

20 Ridge Regression: Random Feature Approximation Given: {(x i, y i )} n i= where x i X, y i R Task: Find a regressor w R 2m s.t. f (x i ) y i. Idea: Map x i to Φ m (x i ) and do linear regression, min f H n ( f, Φ m (x i ) R 2m y i ) 2 + λ w 2 R2m (λ > 0) i= Solution: For Φ m (X) := (Φ m (x ),..., Φ m (x n )) R 2m n and y := (y,..., y n ) R n, ( f = ) n n Φ m(x)φ m (X) + λi 2m Φ m (X)y }{{} Computation: O(m 2 n). primal = ( ) n Φ m(x) n Φ m(x) Φ m (X) + λi n y }{{} dual

21 Ridge Regression: Random Feature Approximation Given: {(x i, y i )} n i= where x i X, y i R Task: Find a regressor w R 2m s.t. f (x i ) y i. Idea: Map x i to Φ m (x i ) and do linear regression, min f H n ( f, Φ m (x i ) R 2m y i ) 2 + λ w 2 R2m (λ > 0) i= Solution: For Φ m (X) := (Φ m (x ),..., Φ m (x n )) R 2m n and y := (y,..., y n ) R n, ( f = ) n n Φ m(x)φ m (X) + λi 2m Φ m (X)y }{{} Computation: O(m 2 n). primal = ( ) n Φ m(x) n Φ m(x) Φ m (X) + λi n y }{{} dual

22 Ridge Regression: Random Feature Approximation Given: {(x i, y i )} n i= where x i X, y i R Task: Find a regressor w R 2m s.t. f (x i ) y i. Idea: Map x i to Φ m (x i ) and do linear regression, min f H n ( f, Φ m (x i ) R 2m y i ) 2 + λ w 2 R2m (λ > 0) i= Solution: For Φ m (X) := (Φ m (x ),..., Φ m (x n )) R 2m n and y := (y,..., y n ) R n, ( f = ) n n Φ m(x)φ m (X) + λi 2m Φ m (X)y }{{} Computation: O(m 2 n). primal = ( ) n Φ m(x) n Φ m(x) Φ m (X) + λi n y }{{} dual

23 What happens statistically? (Rudi and Rosasco, 206): If m n α where 2 α < with α depending on the properties of the unknown true regressor, f, then R L,P (f m,n ) R L,P (f ) achieves the minimax optimal rate as obtained in the case with no approximation. Here L is the squared loss. Computational gain with no statistical loss!!

24 Principal Component Analysis (PCA) Suppose X P be a random variable in R d with mean µ and covariance matrix Σ. Find a direction w {v : v 2 = } such that is maximized. Var[ w, X 2 ] Find a direction w R d such that is minimized. E (X µ) w, (X µ) 2 w 2 2 The formulations are equivalent and the solution is the eigen vector of the covariance matrix Σ corresponding to the largest eigenvalue. Can be generalized to multiple directions (find a subspace...). Applications: dimensionality reduction.

25 Principal Component Analysis (PCA) Suppose X P be a random variable in R d with mean µ and covariance matrix Σ. Find a direction w {v : v 2 = } such that is maximized. Var[ w, X 2 ] Find a direction w R d such that is minimized. E (X µ) w, (X µ) 2 w 2 2 The formulations are equivalent and the solution is the eigen vector of the covariance matrix Σ corresponding to the largest eigenvalue. Can be generalized to multiple directions (find a subspace...). Applications: dimensionality reduction.

26 Principal Component Analysis (PCA) Suppose X P be a random variable in R d with mean µ and covariance matrix Σ. Find a direction w {v : v 2 = } such that is maximized. Var[ w, X 2 ] Find a direction w R d such that is minimized. E (X µ) w, (X µ) 2 w 2 2 The formulations are equivalent and the solution is the eigen vector of the covariance matrix Σ corresponding to the largest eigenvalue. Can be generalized to multiple directions (find a subspace...). Applications: dimensionality reduction.

27 Kernel PCA Nonlinear generalization of PCA (Schölkopf et al., 998). X Φ(X ) and apply PCA. Provides a low dimensional manifold (curves) in R d to efficiently represent the data. Function space view: Find f {g H : g H = } such that Var[f (X )] is maximized, i.e., f = arg sup f H = E[f 2 (X )] E 2 [f (X )] Using the reproducing property f (X ) = f, k(, X ) H, we obtain E[f 2 (X )] = E f, k(, X ) 2 H = E f, (k(, X ) H k(, X ))f H = f, E[k(, X ) H k(, X )]f H, assuming X k(x, x) dp(x) <. E[f (X )] = f, µ P H where µ P := X k(, x) dp(x).

28 Kernel PCA Nonlinear generalization of PCA (Schölkopf et al., 998). X Φ(X ) and apply PCA. Provides a low dimensional manifold (curves) in R d to efficiently represent the data. Function space view: Find f {g H : g H = } such that Var[f (X )] is maximized, i.e., f = arg sup f H = E[f 2 (X )] E 2 [f (X )] Using the reproducing property f (X ) = f, k(, X ) H, we obtain E[f 2 (X )] = E f, k(, X ) 2 H = E f, (k(, X ) H k(, X ))f H = f, E[k(, X ) H k(, X )]f H, assuming X k(x, x) dp(x) <. E[f (X )] = f, µ P H where µ P := X k(, x) dp(x).

29 Kernel PCA Nonlinear generalization of PCA (Schölkopf et al., 998). X Φ(X ) and apply PCA. Provides a low dimensional manifold (curves) in R d to efficiently represent the data. Function space view: Find f {g H : g H = } such that Var[f (X )] is maximized, i.e., f = arg sup f H = E[f 2 (X )] E 2 [f (X )] Using the reproducing property f (X ) = f, k(, X ) H, we obtain E[f 2 (X )] = E f, k(, X ) 2 H = E f, (k(, X ) H k(, X ))f H = f, E[k(, X ) H k(, X )]f H, assuming X k(x, x) dp(x) <. E[f (X )] = f, µ P H where µ P := X k(, x) dp(x).

30 Kernel PCA Nonlinear generalization of PCA (Schölkopf et al., 998). X Φ(X ) and apply PCA. Provides a low dimensional manifold (curves) in R d to efficiently represent the data. Function space view: Find f {g H : g H = } such that Var[f (X )] is maximized, i.e., f = arg sup f H = E[f 2 (X )] E 2 [f (X )] Using the reproducing property f (X ) = f, k(, X ) H, we obtain E[f 2 (X )] = E f, k(, X ) 2 H = E f, (k(, X ) H k(, X ))f H = f, E[k(, X ) H k(, X )]f H, assuming X k(x, x) dp(x) <. E[f (X )] = f, µ P H where µ P := X k(, x) dp(x).

31 Kernel PCA where Σ := X f = arg sup f H = f, Σf H, k(, x) H k(, x) dp(x) µ P H µ P is the covariance operator (symmetric, positive and Hilbert-Schmidt) on H. Spectral theorem: Σ = i I λ i φ i H φ i where I is either countable (λ i 0 as i ) or finite. Similar to PCA, the solution is the eigen function corresponding to the largest eigenvalue of Σ.

32 Kernel PCA where Σ := X f = arg sup f H = f, Σf H, k(, x) H k(, x) dp(x) µ P H µ P is the covariance operator (symmetric, positive and Hilbert-Schmidt) on H. Spectral theorem: Σ = i I λ i φ i H φ i where I is either countable (λ i 0 as i ) or finite. Similar to PCA, the solution is the eigen function corresponding to the largest eigenvalue of Σ.

33 Empirical Kernel PCA In practice, P is unknown but have access to (X i ) n i.i.d. i= P. where ˆΣ := n ˆf = arg sup f, ˆΣf H, f H = k(, X i ) H k(, X i ) µ n H µ n i= is the empirical covariance operator (symmetric, positive and Hilbert-Schmidt) on H and µ n := k(, X i ). n. i= ˆΣ is an infinite dimensional operator but with finite rank (what is its rank?) Spectral theorem: ˆΣ = ˆλ i ˆφi H ˆφi. i= ˆf is the eigen function corresponding to the largest eigenvalue of ˆΣ.

34 Empirical Kernel PCA In practice, P is unknown but have access to (X i ) n i.i.d. i= P. where ˆΣ := n ˆf = arg sup f, ˆΣf H, f H = k(, X i ) H k(, X i ) µ n H µ n i= is the empirical covariance operator (symmetric, positive and Hilbert-Schmidt) on H and µ n := k(, X i ). n. i= ˆΣ is an infinite dimensional operator but with finite rank (what is its rank?) Spectral theorem: ˆΣ = ˆλ i ˆφi H ˆφi. i= ˆf is the eigen function corresponding to the largest eigenvalue of ˆΣ.

35 Empirical Kernel PCA Since ˆΣ is an infinite dimensional operator, we are solving an infinite dimensional eigen system, ˆΣ ˆφ i = ˆλ i ˆφi. How to solve find ˆφ i in practice? Approach : (Schölkopf et al., 998) ˆf = arg sup f H = n = arg inf f H n f 2 (X i ) i= f 2 (X i ) + i= ( n ( n f (X i ) i= ) 2 ) 2 f (X i ) + λ f 2 for some λ > 0. Use representer theorem, which says there exists (α i ) n i= R such that ˆf = α i k(, X i ). i= Solving for (α i ) n i= yields a n n eigen system involving Gram matrix. (Exercise) i= H

36 Empirical Kernel PCA Since ˆΣ is an infinite dimensional operator, we are solving an infinite dimensional eigen system, ˆΣ ˆφ i = ˆλ i ˆφi. How to solve find ˆφ i in practice? Approach : (Schölkopf et al., 998) ˆf = arg sup f H = n = arg inf f H n f 2 (X i ) i= f 2 (X i ) + i= ( n ( n f (X i ) i= ) 2 ) 2 f (X i ) + λ f 2 for some λ > 0. Use representer theorem, which says there exists (α i ) n i= R such that ˆf = α i k(, X i ). i= Solving for (α i ) n i= yields a n n eigen system involving Gram matrix. (Exercise) i= H

37 Empirical Kernel PCA Approach 2: Sampling operator: S : H R n, f n (f (X ),..., f (X n )). Reconstruction operator: S : R n H, α n n i= α ik(, X i ). (Smale and Zhou et al., 2007) S is the adjoint (transpose) of S and they satisfy f, S α H = Sf, α R n, f H, α R n. ˆφi = ˆλ i S H n ˆα i where H n := I n n n n and ˆα i are the eigenvectors of KH n with ˆλ i being the corresponding eigenvalues. It can be shown that ˆΣ = S H ns where H n := I n n n n. ˆΣ ˆφi = ˆλ i ˆφi S H ns ˆφ i = ˆλ i ˆφi SS H ns ˆφ i = ˆλ i S ˆφ i ˆα i := S ˆφ i. It can be shown that K = SS. ˆα i = S ˆφ i S H n ˆα i = S H ns ˆφ i = ˆΣ ˆφ i = ˆλ i ˆφi.

38 Empirical Kernel PCA Approach 2: Sampling operator: S : H R n, f n (f (X ),..., f (X n )). Reconstruction operator: S : R n H, α n n i= α ik(, X i ). (Smale and Zhou et al., 2007) S is the adjoint (transpose) of S and they satisfy f, S α H = Sf, α R n, f H, α R n. ˆφi = ˆλ i S H n ˆα i where H n := I n n n n and ˆα i are the eigenvectors of KH n with ˆλ i being the corresponding eigenvalues. It can be shown that ˆΣ = S H ns where H n := I n n n n. ˆΣ ˆφi = ˆλ i ˆφi S H ns ˆφ i = ˆλ i ˆφi SS H ns ˆφ i = ˆλ i S ˆφ i ˆα i := S ˆφ i. It can be shown that K = SS. ˆα i = S ˆφ i S H n ˆα i = S H ns ˆφ i = ˆΣ ˆφ i = ˆλ i ˆφi.

39 Empirical Kernel PCA Approach 2: Sampling operator: S : H R n, f n (f (X ),..., f (X n )). Reconstruction operator: S : R n H, α n n i= α ik(, X i ). (Smale and Zhou et al., 2007) S is the adjoint (transpose) of S and they satisfy f, S α H = Sf, α R n, f H, α R n. ˆφi = ˆλ i S H n ˆα i where H n := I n n n n and ˆα i are the eigenvectors of KH n with ˆλ i being the corresponding eigenvalues. It can be shown that ˆΣ = S H ns where H n := I n n n n. ˆΣ ˆφi = ˆλ i ˆφi S H ns ˆφ i = ˆλ i ˆφi SS H ns ˆφ i = ˆλ i S ˆφ i ˆα i := S ˆφ i. It can be shown that K = SS. ˆα i = S ˆφ i S H n ˆα i = S H ns ˆφ i = ˆΣ ˆφ i = ˆλ i ˆφi.

40 Empirical Kernel PCA: Random Features Empirical kernel PCA solves an n n linear system: Complexity is O(n 3 ). Use the idea of random features: X Φ m (X ) and apply PCA. ŵ := arg sup Var[ w, Φ m (X ) 2 ] = w, ˆΣ m w 2 w = where ˆΣ m := n Φ m (X i )Φ m(x i ) i= ( n ) ( Φ m (X i ) Φ m (X i ) n i= i= ) is a covariance matrix of size 2m 2m. Eigen decomposition: ˆΣm = 2m i= ˆλ i,m ˆφi,m ˆφ i,m ŵ is obtained by solving a 2m 2m eigensystem: Complexity is O(m 3 ). What happens statistically?

41 Empirical Kernel PCA: Random Features Empirical kernel PCA solves an n n linear system: Complexity is O(n 3 ). Use the idea of random features: X Φ m (X ) and apply PCA. ŵ := arg sup Var[ w, Φ m (X ) 2 ] = w, ˆΣ m w 2 w = where ˆΣ m := n Φ m (X i )Φ m(x i ) i= ( n ) ( Φ m (X i ) Φ m (X i ) n i= i= ) is a covariance matrix of size 2m 2m. Eigen decomposition: ˆΣm = 2m i= ˆλ i,m ˆφi,m ˆφ i,m ŵ is obtained by solving a 2m 2m eigensystem: Complexity is O(m 3 ). What happens statistically?

42 Empirical Kernel PCA: Random Features Two notions: Reconstruction error (what does this depend on?) Convergence of eigenvectors (more generally, eigenspaces)

43 Eigenspace Convergence What we have? Eigenvectors after approximation, ( ˆφ i,m ) 2m i= : these form a subspace in R 2m Population eigenfunctions (φ i ) i I of Σ: these form a subspace in H. How do we compare? We embed them in a common space before comparing. The common space is L 2 (P).

44 Eigenspace Convergence I : H L 2 (P), f f f (x) dp(x) X U : R 2m L 2 (P), α 2m i= α ( i Φm,i X Φ m,i(x) dp(x) ) where Φ m = (cos( ω, 2 ),..., cos( ω m, 2 ), sin( ω, 2 ),..., sin( ω m, 2 )). Main results: (S and Sterge, 207) l i= ( ) I ˆφ i Iφ i L 2 (P) = O p n. Similar result holds in H (Zwald et al., NIPS 2005) l i= ( U ˆφ i,m Iφ i L 2 (P) = O p + ). m n Random features are not helpful from the point of view of eigenvector convergence (also holds for eigenspaces). May be useful in the reconstruction error convergence (in progress).

45 Eigenspace Convergence I : H L 2 (P), f f f (x) dp(x) X U : R 2m L 2 (P), α 2m i= α ( i Φm,i X Φ m,i(x) dp(x) ) where Φ m = (cos( ω, 2 ),..., cos( ω m, 2 ), sin( ω, 2 ),..., sin( ω m, 2 )). Main results: (S and Sterge, 207) l i= ( ) I ˆφ i Iφ i L 2 (P) = O p n. Similar result holds in H (Zwald et al., NIPS 2005) l i= ( U ˆφ i,m Iφ i L 2 (P) = O p + ). m n Random features are not helpful from the point of view of eigenvector convergence (also holds for eigenspaces). May be useful in the reconstruction error convergence (in progress).

46 Eigenspace Convergence I : H L 2 (P), f f f (x) dp(x) X U : R 2m L 2 (P), α 2m i= α ( i Φm,i X Φ m,i(x) dp(x) ) where Φ m = (cos( ω, 2 ),..., cos( ω m, 2 ), sin( ω, 2 ),..., sin( ω m, 2 )). Main results: (S and Sterge, 207) l i= ( ) I ˆφ i Iφ i L 2 (P) = O p n. Similar result holds in H (Zwald et al., NIPS 2005) l i= ( U ˆφ i,m Iφ i L 2 (P) = O p + ). m n Random features are not helpful from the point of view of eigenvector convergence (also holds for eigenspaces). May be useful in the reconstruction error convergence (in progress).

47 References I Fine, S. and Scheinberg, K. (200). Efficient SVM training using low-rank kernel representations. Journal of Machine Learning Research, 2: Rahimi, A. and Recht, B. (2008). Random features for large-scale kernel machines. In Platt, J. C., Koller, D., Singer, Y., and Roweis, S. T., editors, Advances in Neural Information Processing Systems 20, pages Curran Associates, Inc. Rudi, A. and Rosasco, L. (206). Generalization properties of learning with random features. Schölkopf, B., Smola, A., and Müller, K.-R. (998). Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 0: Smale, S. and Zhou, D.-X. (2007). Learning theory estimates via integral operators and their approximations. Constructive Approximation, 26: Smola, A. J. and Schölkopf, B. (2000). Sparse greedy matrix approximation for machine learning. In Proc. 7th International Conference on Machine Learning, pages Morgan Kaufmann, San Francisco, CA. Sriperumbudur, B. K. and Sterge, N. (207). Statistical consistency of kernel PCA with random features. arxiv: Sriperumbudur, B. K. and Szabó, Z. (205). Optimal rates for random Fourier features. In Cortes, C., Lawrence, N. D., Lee, D. D., Sugiyama, M., and Garnett, R., editors, Advances in Neural Information Processing Systems 28, pages Curran Associates, Inc. Sutherland, D. and Schneider, J. (205). On the error of random fourier features. In Proceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence, pages

48 References II Williams, C. and Seeger, M. (200). Using the Nyström method to speed up kernel machines. In T. K. Leen, T. G. Diettrich, V. T., editor, Advances in Neural Information Processing Systems 3, pages , Cambridge, MA. MIT Press. Yang, Y., Pilanci, M., and Wainwright, M. J. (205). Randomized sketches for kernels: Fast and optimal non-parametric regression. Technical report. Zwald, L. and Blanchard, G. (2006). On the convergence of eigenspaces in kernel principal component analysis. In Weiss, Y., Schölkopf, B., and Platt, J. C., editors, Advances in Neural Information Processing Systems 8, pages MIT Press.

Approximate Kernel PCA with Random Features

Approximate Kernel PCA with Random Features Approximate Kernel PCA with Random Features (Computational vs. Statistical Tradeoff) Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Journées de Statistique Paris May 28,

More information

Less is More: Computational Regularization by Subsampling

Less is More: Computational Regularization by Subsampling Less is More: Computational Regularization by Subsampling Lorenzo Rosasco University of Genova - Istituto Italiano di Tecnologia Massachusetts Institute of Technology lcsl.mit.edu joint work with Alessandro

More information

Kernel Learning via Random Fourier Representations

Kernel Learning via Random Fourier Representations Kernel Learning via Random Fourier Representations L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Module 5: Machine Learning L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random

More information

Convergence of Eigenspaces in Kernel Principal Component Analysis

Convergence of Eigenspaces in Kernel Principal Component Analysis Convergence of Eigenspaces in Kernel Principal Component Analysis Shixin Wang Advanced machine learning April 19, 2016 Shixin Wang Convergence of Eigenspaces April 19, 2016 1 / 18 Outline 1 Motivation

More information

Less is More: Computational Regularization by Subsampling

Less is More: Computational Regularization by Subsampling Less is More: Computational Regularization by Subsampling Lorenzo Rosasco University of Genova - Istituto Italiano di Tecnologia Massachusetts Institute of Technology lcsl.mit.edu joint work with Alessandro

More information

Principal Component Analysis

Principal Component Analysis CSci 5525: Machine Learning Dec 3, 2008 The Main Idea Given a dataset X = {x 1,..., x N } The Main Idea Given a dataset X = {x 1,..., x N } Find a low-dimensional linear projection The Main Idea Given

More information

TUM 2016 Class 3 Large scale learning by regularization

TUM 2016 Class 3 Large scale learning by regularization TUM 2016 Class 3 Large scale learning by regularization Lorenzo Rosasco UNIGE-MIT-IIT July 25, 2016 Learning problem Solve min w E(w), E(w) = dρ(x, y)l(w x, y) given (x 1, y 1 ),..., (x n, y n ) Beyond

More information

Divide and Conquer Kernel Ridge Regression. A Distributed Algorithm with Minimax Optimal Rates

Divide and Conquer Kernel Ridge Regression. A Distributed Algorithm with Minimax Optimal Rates : A Distributed Algorithm with Minimax Optimal Rates Yuchen Zhang, John C. Duchi, Martin Wainwright (UC Berkeley;http://arxiv.org/pdf/1305.509; Apr 9, 014) Gatsby Unit, Tea Talk June 10, 014 Outline Motivation.

More information

Semi-Supervised Learning through Principal Directions Estimation

Semi-Supervised Learning through Principal Directions Estimation Semi-Supervised Learning through Principal Directions Estimation Olivier Chapelle, Bernhard Schölkopf, Jason Weston Max Planck Institute for Biological Cybernetics, 72076 Tübingen, Germany {first.last}@tuebingen.mpg.de

More information

Statistical Convergence of Kernel CCA

Statistical Convergence of Kernel CCA Statistical Convergence of Kernel CCA Kenji Fukumizu Institute of Statistical Mathematics Tokyo 106-8569 Japan fukumizu@ism.ac.jp Francis R. Bach Centre de Morphologie Mathematique Ecole des Mines de Paris,

More information

Hilbert Space Embedding of Probability Measures

Hilbert Space Embedding of Probability Measures Lecture 2 Hilbert Space Embedding of Probability Measures Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Machine Learning Summer School Tübingen, 2017 Recap of Lecture

More information

Statistical Optimality of Stochastic Gradient Descent through Multiple Passes

Statistical Optimality of Stochastic Gradient Descent through Multiple Passes Statistical Optimality of Stochastic Gradient Descent through Multiple Passes Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Loucas Pillaud-Vivien

More information

Kernel Methods. Jean-Philippe Vert Last update: Jan Jean-Philippe Vert (Mines ParisTech) 1 / 444

Kernel Methods. Jean-Philippe Vert Last update: Jan Jean-Philippe Vert (Mines ParisTech) 1 / 444 Kernel Methods Jean-Philippe Vert Jean-Philippe.Vert@mines.org Last update: Jan 2015 Jean-Philippe Vert (Mines ParisTech) 1 / 444 What we know how to solve Jean-Philippe Vert (Mines ParisTech) 2 / 444

More information

Nonlinear Dimensionality Reduction

Nonlinear Dimensionality Reduction Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Kernel PCA 2 Isomap 3 Locally Linear Embedding 4 Laplacian Eigenmap

More information

Linear and Non-Linear Dimensionality Reduction

Linear and Non-Linear Dimensionality Reduction Linear and Non-Linear Dimensionality Reduction Alexander Schulz aschulz(at)techfak.uni-bielefeld.de University of Pisa, Pisa 4.5.215 and 7.5.215 Overview Dimensionality Reduction Motivation Linear Projections

More information

Stochastic optimization in Hilbert spaces

Stochastic optimization in Hilbert spaces Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic optimization Hilbert spaces 1 / 48 Outline Learning vs Statistics Aymeric Dieuleveut Stochastic optimization Hilbert

More information

Outline. Motivation. Mapping the input space to the feature space Calculating the dot product in the feature space

Outline. Motivation. Mapping the input space to the feature space Calculating the dot product in the feature space to The The A s s in to Fabio A. González Ph.D. Depto. de Ing. de Sistemas e Industrial Universidad Nacional de Colombia, Bogotá April 2, 2009 to The The A s s in 1 Motivation Outline 2 The Mapping the

More information

Kernel methods, kernel SVM and ridge regression

Kernel methods, kernel SVM and ridge regression Kernel methods, kernel SVM and ridge regression Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Collaborative Filtering 2 Collaborative Filtering R: rating matrix; U: user factor;

More information

Efficient Complex Output Prediction

Efficient Complex Output Prediction Efficient Complex Output Prediction Florence d Alché-Buc Joint work with Romain Brault, Alex Lambert, Maxime Sangnier October 12, 2017 LTCI, Télécom ParisTech, Institut-Mines Télécom, Université Paris-Saclay

More information

Connection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis

Connection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis Connection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis Alvina Goh Vision Reading Group 13 October 2005 Connection of Local Linear Embedding, ISOMAP, and Kernel Principal

More information

Regularization via Spectral Filtering

Regularization via Spectral Filtering Regularization via Spectral Filtering Lorenzo Rosasco MIT, 9.520 Class 7 About this class Goal To discuss how a class of regularization methods originally designed for solving ill-posed inverse problems,

More information

Nonparameteric Regression:

Nonparameteric Regression: Nonparameteric Regression: Nadaraya-Watson Kernel Regression & Gaussian Process Regression Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro,

More information

Spectral Regularization

Spectral Regularization Spectral Regularization Lorenzo Rosasco 9.520 Class 07 February 27, 2008 About this class Goal To discuss how a class of regularization methods originally designed for solving ill-posed inverse problems,

More information

Convergence Rates of Kernel Quadrature Rules

Convergence Rates of Kernel Quadrature Rules Convergence Rates of Kernel Quadrature Rules Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE NIPS workshop on probabilistic integration - Dec. 2015 Outline Introduction

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 9, 2011 About this class Goal In this class we continue our journey in the world of RKHS. We discuss the Mercer theorem which gives

More information

Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model

Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model (& discussion on the GPLVM tech. report by Prof. N. Lawrence, 06) Andreas Damianou Department of Neuro- and Computer Science,

More information

Computational tractability of machine learning algorithms for tall fat data

Computational tractability of machine learning algorithms for tall fat data Computational tractability of machine learning algorithms for tall fat data Getting good enough solutions as fast as possible Vikas Chandrakant Raykar vikas@cs.umd.edu University of Maryland, CollegePark

More information

Kernel methods for comparing distributions, measuring dependence

Kernel methods for comparing distributions, measuring dependence Kernel methods for comparing distributions, measuring dependence Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Principal component analysis Given a set of M centered observations

More information

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA 1 MACHINE LEARNING Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA 2 Practicals Next Week Next Week, Practical Session on Computer Takes Place in Room GR

More information

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications. Class 04: Features and Kernels. Lorenzo Rosasco

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications. Class 04: Features and Kernels. Lorenzo Rosasco MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications Class 04: Features and Kernels Lorenzo Rosasco Linear functions Let H lin be the space of linear functions f(x) = w x. f w is one

More information

A Magiv CV Theory for Large-Margin Classifiers

A Magiv CV Theory for Large-Margin Classifiers A Magiv CV Theory for Large-Margin Classifiers Hui Zou School of Statistics, University of Minnesota June 30, 2018 Joint work with Boxiang Wang Outline 1 Background 2 Magic CV formula 3 Magic support vector

More information

Nonlinear Projection Trick in kernel methods: An alternative to the Kernel Trick

Nonlinear Projection Trick in kernel methods: An alternative to the Kernel Trick Nonlinear Projection Trick in kernel methods: An alternative to the Kernel Trick Nojun Kak, Member, IEEE, Abstract In kernel methods such as kernel PCA and support vector machines, the so called kernel

More information

Learning sets and subspaces: a spectral approach

Learning sets and subspaces: a spectral approach Learning sets and subspaces: a spectral approach Alessandro Rudi DIBRIS, Università di Genova Optimization and dynamical processes in Statistical learning and inverse problems Sept 8-12, 2014 A world of

More information

Random Projections. Lopez Paz & Duvenaud. November 7, 2013

Random Projections. Lopez Paz & Duvenaud. November 7, 2013 Random Projections Lopez Paz & Duvenaud November 7, 2013 Random Outline The Johnson-Lindenstrauss Lemma (1984) Random Kitchen Sinks (Rahimi and Recht, NIPS 2008) Fastfood (Le et al., ICML 2013) Why random

More information

Fast large scale Gaussian process regression using the improved fast Gauss transform

Fast large scale Gaussian process regression using the improved fast Gauss transform Fast large scale Gaussian process regression using the improved fast Gauss transform VIKAS CHANDRAKANT RAYKAR and RAMANI DURAISWAMI Perceptual Interfaces and Reality Laboratory Department of Computer Science

More information

Analysis Preliminary Exam Workshop: Hilbert Spaces

Analysis Preliminary Exam Workshop: Hilbert Spaces Analysis Preliminary Exam Workshop: Hilbert Spaces 1. Hilbert spaces A Hilbert space H is a complete real or complex inner product space. Consider complex Hilbert spaces for definiteness. If (, ) : H H

More information

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1).

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1). Regression and PCA Classification The goal: map from input X to a label Y. Y has a discrete set of possible values We focused on binary Y (values 0 or 1). But we also discussed larger number of classes

More information

On the Convergence of the Concave-Convex Procedure

On the Convergence of the Concave-Convex Procedure On the Convergence of the Concave-Convex Procedure Bharath K. Sriperumbudur and Gert R. G. Lanckriet UC San Diego OPT 2009 Outline Difference of convex functions (d.c.) program Applications in machine

More information

Kernel Methods. Barnabás Póczos

Kernel Methods. Barnabás Póczos Kernel Methods Barnabás Póczos Outline Quick Introduction Feature space Perceptron in the feature space Kernels Mercer s theorem Finite domain Arbitrary domain Kernel families Constructing new kernels

More information

COMPACT OPERATORS. 1. Definitions

COMPACT OPERATORS. 1. Definitions COMPACT OPERATORS. Definitions S:defi An operator M : X Y, X, Y Banach, is compact if M(B X (0, )) is relatively compact, i.e. it has compact closure. We denote { E:kk (.) K(X, Y ) = M L(X, Y ), M compact

More information

SPECTRAL CLUSTERING AND KERNEL PRINCIPAL COMPONENT ANALYSIS ARE PURSUING GOOD PROJECTIONS

SPECTRAL CLUSTERING AND KERNEL PRINCIPAL COMPONENT ANALYSIS ARE PURSUING GOOD PROJECTIONS SPECTRAL CLUSTERING AND KERNEL PRINCIPAL COMPONENT ANALYSIS ARE PURSUING GOOD PROJECTIONS VIKAS CHANDRAKANT RAYKAR DECEMBER 5, 24 Abstract. We interpret spectral clustering algorithms in the light of unsupervised

More information

Spectral Dimensionality Reduction

Spectral Dimensionality Reduction Spectral Dimensionality Reduction Yoshua Bengio, Olivier Delalleau, Nicolas Le Roux Jean-François Paiement, Pascal Vincent, and Marie Ouimet Département d Informatique et Recherche Opérationnelle Centre

More information

Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent

Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent Unsupervised Machine Learning and Data Mining DS 5230 / DS 4420 - Fall 2018 Lecture 7 Jan-Willem van de Meent DIMENSIONALITY REDUCTION Borrowing from: Percy Liang (Stanford) Dimensionality Reduction Goal:

More information

Kernel-Based Contrast Functions for Sufficient Dimension Reduction

Kernel-Based Contrast Functions for Sufficient Dimension Reduction Kernel-Based Contrast Functions for Sufficient Dimension Reduction Michael I. Jordan Departments of Statistics and EECS University of California, Berkeley Joint work with Kenji Fukumizu and Francis Bach

More information

Support Vector Machines.

Support Vector Machines. Support Vector Machines www.cs.wisc.edu/~dpage 1 Goals for the lecture you should understand the following concepts the margin slack variables the linear support vector machine nonlinear SVMs the kernel

More information

ESANN'2001 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), April 2001, D-Facto public., ISBN ,

ESANN'2001 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), April 2001, D-Facto public., ISBN , Sparse Kernel Canonical Correlation Analysis Lili Tan and Colin Fyfe 2, Λ. Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong. 2. School of Information and Communication

More information

Learning gradients: prescriptive models

Learning gradients: prescriptive models Department of Statistical Science Institute for Genome Sciences & Policy Department of Computer Science Duke University May 11, 2007 Relevant papers Learning Coordinate Covariances via Gradients. Sayan

More information

Karhunen-Loève decomposition of Gaussian measures on Banach spaces

Karhunen-Loève decomposition of Gaussian measures on Banach spaces Karhunen-Loève decomposition of Gaussian measures on Banach spaces Jean-Charles Croix jean-charles.croix@emse.fr Génie Mathématique et Industriel (GMI) First workshop on Gaussian processes at Saint-Etienne

More information

Discriminative Direction for Kernel Classifiers

Discriminative Direction for Kernel Classifiers Discriminative Direction for Kernel Classifiers Polina Golland Artificial Intelligence Lab Massachusetts Institute of Technology Cambridge, MA 02139 polina@ai.mit.edu Abstract In many scientific and engineering

More information

Multiple Similarities Based Kernel Subspace Learning for Image Classification

Multiple Similarities Based Kernel Subspace Learning for Image Classification Multiple Similarities Based Kernel Subspace Learning for Image Classification Wang Yan, Qingshan Liu, Hanqing Lu, and Songde Ma National Laboratory of Pattern Recognition, Institute of Automation, Chinese

More information

RKHS, Mercer s theorem, Unbounded domains, Frames and Wavelets Class 22, 2004 Tomaso Poggio and Sayan Mukherjee

RKHS, Mercer s theorem, Unbounded domains, Frames and Wavelets Class 22, 2004 Tomaso Poggio and Sayan Mukherjee RKHS, Mercer s theorem, Unbounded domains, Frames and Wavelets 9.520 Class 22, 2004 Tomaso Poggio and Sayan Mukherjee About this class Goal To introduce an alternate perspective of RKHS via integral operators

More information

Kernel Principal Component Analysis

Kernel Principal Component Analysis Kernel Principal Component Analysis Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr

More information

Kernel Methods. Outline

Kernel Methods. Outline Kernel Methods Quang Nguyen University of Pittsburgh CS 3750, Fall 2011 Outline Motivation Examples Kernels Definitions Kernel trick Basic properties Mercer condition Constructing feature space Hilbert

More information

Unsupervised dimensionality reduction

Unsupervised dimensionality reduction Unsupervised dimensionality reduction Guillaume Obozinski Ecole des Ponts - ParisTech SOCN course 2014 Guillaume Obozinski Unsupervised dimensionality reduction 1/30 Outline 1 PCA 2 Kernel PCA 3 Multidimensional

More information

Optimal kernel methods for large scale learning

Optimal kernel methods for large scale learning Optimal kernel methods for large scale learning Alessandro Rudi INRIA - École Normale Supérieure, Paris joint work with Luigi Carratino, Lorenzo Rosasco 6 Mar 2018 École Polytechnique Learning problem

More information

Lecture 6 Sept Data Visualization STAT 442 / 890, CM 462

Lecture 6 Sept Data Visualization STAT 442 / 890, CM 462 Lecture 6 Sept. 25-2006 Data Visualization STAT 442 / 890, CM 462 Lecture: Ali Ghodsi 1 Dual PCA It turns out that the singular value decomposition also allows us to formulate the principle components

More information

Kernel Bayes Rule: Nonparametric Bayesian inference with kernels

Kernel Bayes Rule: Nonparametric Bayesian inference with kernels Kernel Bayes Rule: Nonparametric Bayesian inference with kernels Kenji Fukumizu The Institute of Statistical Mathematics NIPS 2012 Workshop Confluence between Kernel Methods and Graphical Models December

More information

Towards Deep Kernel Machines

Towards Deep Kernel Machines Towards Deep Kernel Machines Julien Mairal Inria, Grenoble Prague, April, 2017 Julien Mairal Towards deep kernel machines 1/51 Part I: Scientific Context Julien Mairal Towards deep kernel machines 2/51

More information

Random Features for Large Scale Kernel Machines

Random Features for Large Scale Kernel Machines Random Features for Large Scale Kernel Machines Andrea Vedaldi From: A. Rahimi and B. Recht. Random features for large-scale kernel machines. In Proc. NIPS, 2007. Explicit feature maps Fast algorithm for

More information

Memory Efficient Kernel Approximation

Memory Efficient Kernel Approximation Si Si Department of Computer Science University of Texas at Austin ICML Beijing, China June 23, 2014 Joint work with Cho-Jui Hsieh and Inderjit S. Dhillon Outline Background Motivation Low-Rank vs. Block

More information

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22 Outline Basic concepts: SVM and kernels SVM primal/dual problems Chih-Jen Lin (National Taiwan Univ.) 1 / 22 Outline Basic concepts: SVM and kernels Basic concepts: SVM and kernels SVM primal/dual problems

More information

Minimax Estimation of Kernel Mean Embeddings

Minimax Estimation of Kernel Mean Embeddings Minimax Estimation of Kernel Mean Embeddings Bharath K. Sriperumbudur Department of Statistics Pennsylvania State University Gatsby Computational Neuroscience Unit May 4, 2016 Collaborators Dr. Ilya Tolstikhin

More information

Local Learning Projections

Local Learning Projections Mingrui Wu mingrui.wu@tuebingen.mpg.de Max Planck Institute for Biological Cybernetics, Tübingen, Germany Kai Yu kyu@sv.nec-labs.com NEC Labs America, Cupertino CA, USA Shipeng Yu shipeng.yu@siemens.com

More information

Perceptron Revisited: Linear Separators. Support Vector Machines

Perceptron Revisited: Linear Separators. Support Vector Machines Support Vector Machines Perceptron Revisited: Linear Separators Binary classification can be viewed as the task of separating classes in feature space: w T x + b > 0 w T x + b = 0 w T x + b < 0 Department

More information

Kernel Ridge Regression. Mohammad Emtiyaz Khan EPFL Oct 27, 2015

Kernel Ridge Regression. Mohammad Emtiyaz Khan EPFL Oct 27, 2015 Kernel Ridge Regression Mohammad Emtiyaz Khan EPFL Oct 27, 2015 Mohammad Emtiyaz Khan 2015 Motivation The ridge solution β R D has a counterpart α R N. Using duality, we will establish a relationship between

More information

Kernel methods for Bayesian inference

Kernel methods for Bayesian inference Kernel methods for Bayesian inference Arthur Gretton Gatsby Computational Neuroscience Unit Lancaster, Nov. 2014 Motivating Example: Bayesian inference without a model 3600 downsampled frames of 20 20

More information

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling Machine Learning B. Unsupervised Learning B.2 Dimensionality Reduction Lars Schmidt-Thieme, Nicolas Schilling Information Systems and Machine Learning Lab (ISMLL) Institute for Computer Science University

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Support vector machines (SVMs) are one of the central concepts in all of machine learning. They are simply a combination of two ideas: linear classification via maximum (or optimal

More information

Advances in kernel exponential families

Advances in kernel exponential families Advances in kernel exponential families Arthur Gretton Gatsby Computational Neuroscience Unit, University College London NIPS, 2017 1/39 Outline Motivating application: Fast estimation of complex multivariate

More information

Manifold Learning: Theory and Applications to HRI

Manifold Learning: Theory and Applications to HRI Manifold Learning: Theory and Applications to HRI Seungjin Choi Department of Computer Science Pohang University of Science and Technology, Korea seungjin@postech.ac.kr August 19, 2008 1 / 46 Greek Philosopher

More information

Dimensionality Reduction AShortTutorial

Dimensionality Reduction AShortTutorial Dimensionality Reduction AShortTutorial Ali Ghodsi Department of Statistics and Actuarial Science University of Waterloo Waterloo, Ontario, Canada, 2006 c Ali Ghodsi, 2006 Contents 1 An Introduction to

More information

Incorporating Invariances in Nonlinear Support Vector Machines

Incorporating Invariances in Nonlinear Support Vector Machines Incorporating Invariances in Nonlinear Support Vector Machines Olivier Chapelle olivier.chapelle@lip6.fr LIP6, Paris, France Biowulf Technologies Bernhard Scholkopf bernhard.schoelkopf@tuebingen.mpg.de

More information

Foundation of Intelligent Systems, Part I. SVM s & Kernel Methods

Foundation of Intelligent Systems, Part I. SVM s & Kernel Methods Foundation of Intelligent Systems, Part I SVM s & Kernel Methods mcuturi@i.kyoto-u.ac.jp FIS - 2013 1 Support Vector Machines The linearly-separable case FIS - 2013 2 A criterion to select a linear classifier:

More information

Multivariate Statistical Analysis

Multivariate Statistical Analysis Multivariate Statistical Analysis Fall 2011 C. L. Williams, Ph.D. Lecture 4 for Applied Multivariate Analysis Outline 1 Eigen values and eigen vectors Characteristic equation Some properties of eigendecompositions

More information

Metric Embedding for Kernel Classification Rules

Metric Embedding for Kernel Classification Rules Metric Embedding for Kernel Classification Rules Bharath K. Sriperumbudur University of California, San Diego (Joint work with Omer Lang & Gert Lanckriet) Bharath K. Sriperumbudur (UCSD) Metric Embedding

More information

Data Mining Techniques

Data Mining Techniques Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 12 Jan-Willem van de Meent (credit: Yijun Zhao, Percy Liang) DIMENSIONALITY REDUCTION Borrowing from: Percy Liang (Stanford) Linear Dimensionality

More information

Learning Eigenfunctions Links Spectral Embedding

Learning Eigenfunctions Links Spectral Embedding Learning Eigenfunctions Links Spectral Embedding and Kernel PCA Yoshua Bengio, Olivier Delalleau, Nicolas Le Roux Jean-François Paiement, Pascal Vincent, and Marie Ouimet Département d Informatique et

More information

Stat542 (F11) Statistical Learning. First consider the scenario where the two classes of points are separable.

Stat542 (F11) Statistical Learning. First consider the scenario where the two classes of points are separable. Linear SVM (separable case) First consider the scenario where the two classes of points are separable. It s desirable to have the width (called margin) between the two dashed lines to be large, i.e., have

More information

Support Vector Machines for Classification: A Statistical Portrait

Support Vector Machines for Classification: A Statistical Portrait Support Vector Machines for Classification: A Statistical Portrait Yoonkyung Lee Department of Statistics The Ohio State University May 27, 2011 The Spring Conference of Korean Statistical Society KAIST,

More information

Large-scale Collaborative Prediction Using a Nonparametric Random Effects Model

Large-scale Collaborative Prediction Using a Nonparametric Random Effects Model Large-scale Collaborative Prediction Using a Nonparametric Random Effects Model Kai Yu Joint work with John Lafferty and Shenghuo Zhu NEC Laboratories America, Carnegie Mellon University First Prev Page

More information

Diffeomorphic Warping. Ben Recht August 17, 2006 Joint work with Ali Rahimi (Intel)

Diffeomorphic Warping. Ben Recht August 17, 2006 Joint work with Ali Rahimi (Intel) Diffeomorphic Warping Ben Recht August 17, 2006 Joint work with Ali Rahimi (Intel) What Manifold Learning Isn t Common features of Manifold Learning Algorithms: 1-1 charting Dense sampling Geometric Assumptions

More information

Online Kernel PCA with Entropic Matrix Updates

Online Kernel PCA with Entropic Matrix Updates Online Kernel PCA with Entropic Matrix Updates Dima Kuzmin Manfred K. Warmuth University of California - Santa Cruz ICML 2007, Corvallis, Oregon April 23, 2008 D. Kuzmin, M. Warmuth (UCSC) Online Kernel

More information

Kernel Approximation via Empirical Orthogonal Decomposition for Unsupervised Feature Learning

Kernel Approximation via Empirical Orthogonal Decomposition for Unsupervised Feature Learning Kernel Approximation via Empirical Orthogonal Decomposition for Unsupervised Feature Learning Yusuke Mukuta The University of Tokyo 7-3-1 Hongo Bunkyo-ku, Tokyo, Japan mukuta@mi.t.u-tokyo.ac.jp Tatsuya

More information

References. Lecture 3: Shrinkage. The Power of Amnesia. Ockham s Razor

References. Lecture 3: Shrinkage. The Power of Amnesia. Ockham s Razor References Lecture 3: Shrinkage Isabelle Guon guoni@inf.ethz.ch Structural risk minimization for character recognition Isabelle Guon et al. http://clopinet.com/isabelle/papers/sr m.ps.z Kernel Ridge Regression

More information

Data Mining Techniques

Data Mining Techniques Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 4 Jan-Willem van de Meent (credit: Yijun Zhao, Arthur Gretton Rasmussen & Williams, Percy Liang) Kernel Regression Basis function regression

More information

Kernel PCA: keep walking... in informative directions. Johan Van Horebeek, Victor Muñiz, Rogelio Ramos CIMAT, Guanajuato, GTO

Kernel PCA: keep walking... in informative directions. Johan Van Horebeek, Victor Muñiz, Rogelio Ramos CIMAT, Guanajuato, GTO Kernel PCA: keep walking... in informative directions Johan Van Horebeek, Victor Muñiz, Rogelio Ramos CIMAT, Guanajuato, GTO Kernel PCA: keep walking... in informative directions Johan Van Horebeek, Victor

More information

Least Squares SVM Regression

Least Squares SVM Regression Least Squares SVM Regression Consider changing SVM to LS SVM by making following modifications: min (w,e) ½ w 2 + ½C Σ e(i) 2 subject to d(i) (w T Φ( x(i))+ b) = e(i), i, and C>0. Note that e(i) is error

More information

Designing Kernel Functions Using the Karhunen-Loève Expansion

Designing Kernel Functions Using the Karhunen-Loève Expansion July 7, 2004. Designing Kernel Functions Using the Karhunen-Loève Expansion 2 1 Fraunhofer FIRST, Germany Tokyo Institute of Technology, Japan 1,2 2 Masashi Sugiyama and Hidemitsu Ogawa Learning with Kernels

More information

1. Mathematical Foundations of Machine Learning

1. Mathematical Foundations of Machine Learning 1. Mathematical Foundations of Machine Learning 1.1 Basic Concepts Definition of Learning Definition [Mitchell (1997)] A computer program is said to learn from experience E with respect to some class of

More information

Deep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang

Deep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang Deep Learning Basics Lecture 7: Factor Analysis Princeton University COS 495 Instructor: Yingyu Liang Supervised v.s. Unsupervised Math formulation for supervised learning Given training data x i, y i

More information

PREDICTING SOLAR GENERATION FROM WEATHER FORECASTS. Chenlin Wu Yuhan Lou

PREDICTING SOLAR GENERATION FROM WEATHER FORECASTS. Chenlin Wu Yuhan Lou PREDICTING SOLAR GENERATION FROM WEATHER FORECASTS Chenlin Wu Yuhan Lou Background Smart grid: increasing the contribution of renewable in grid energy Solar generation: intermittent and nondispatchable

More information

DIMENSION REDUCTION. min. j=1

DIMENSION REDUCTION. min. j=1 DIMENSION REDUCTION 1 Principal Component Analysis (PCA) Principal components analysis (PCA) finds low dimensional approximations to the data by projecting the data onto linear subspaces. Let X R d and

More information

Covariate Shift in Hilbert Space: A Solution via Surrogate Kernels

Covariate Shift in Hilbert Space: A Solution via Surrogate Kernels Kai Zhang kzhang@nec-labs.com NEC Laboratories America, Inc., 4 Independence Way, Suite 2, Princeton, NJ 854 Vincent W. Zheng Advanced Digital Sciences Center, 1 Fusionopolis Way, Singapore 138632 vincent.zheng@adsc.com.sg

More information

Deep Gaussian Processes

Deep Gaussian Processes Deep Gaussian Processes Neil D. Lawrence 30th April 2015 KTH Royal Institute of Technology Outline Introduction Deep Gaussian Process Models Variational Approximation Samples and Results Outline Introduction

More information

MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications. Class 19: Data Representation by Design

MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications. Class 19: Data Representation by Design MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications Class 19: Data Representation by Design What is data representation? Let X be a data-space X M (M) F (M) X A data representation

More information

Max Margin-Classifier

Max Margin-Classifier Max Margin-Classifier Oliver Schulte - CMPT 726 Bishop PRML Ch. 7 Outline Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data Kernels and Non-linear Mappings Where does the maximization

More information

Structured Prediction

Structured Prediction Structured Prediction Ningshan Zhang Advanced Machine Learning, Spring 2016 Outline Ensemble Methods for Structured Prediction[1] On-line learning Boosting AGeneralizedKernelApproachtoStructuredOutputLearning[2]

More information

Transductive Experiment Design

Transductive Experiment Design Appearing in NIPS 2005 workshop Foundations of Active Learning, Whistler, Canada, December, 2005. Transductive Experiment Design Kai Yu, Jinbo Bi, Volker Tresp Siemens AG 81739 Munich, Germany Abstract

More information

Sparse Support Vector Machines by Kernel Discriminant Analysis

Sparse Support Vector Machines by Kernel Discriminant Analysis Sparse Support Vector Machines by Kernel Discriminant Analysis Kazuki Iwamura and Shigeo Abe Kobe University - Graduate School of Engineering Kobe, Japan Abstract. We discuss sparse support vector machines

More information

Adaptive Kernel Principal Component Analysis With Unsupervised Learning of Kernels

Adaptive Kernel Principal Component Analysis With Unsupervised Learning of Kernels Adaptive Kernel Principal Component Analysis With Unsupervised Learning of Kernels Daoqiang Zhang Zhi-Hua Zhou National Laboratory for Novel Software Technology Nanjing University, Nanjing 2193, China

More information