Approximate Kernel Methods

Similar documents
Approximate Kernel PCA with Random Features

Less is More: Computational Regularization by Subsampling

Kernel Learning via Random Fourier Representations

Convergence of Eigenspaces in Kernel Principal Component Analysis

Less is More: Computational Regularization by Subsampling

Principal Component Analysis

TUM 2016 Class 3 Large scale learning by regularization

Divide and Conquer Kernel Ridge Regression. A Distributed Algorithm with Minimax Optimal Rates

Semi-Supervised Learning through Principal Directions Estimation

Statistical Convergence of Kernel CCA

Hilbert Space Embedding of Probability Measures

Statistical Optimality of Stochastic Gradient Descent through Multiple Passes

Kernel Methods. Jean-Philippe Vert Last update: Jan Jean-Philippe Vert (Mines ParisTech) 1 / 444

Nonlinear Dimensionality Reduction

Linear and Non-Linear Dimensionality Reduction

Stochastic optimization in Hilbert spaces

Outline. Motivation. Mapping the input space to the feature space Calculating the dot product in the feature space

Kernel methods, kernel SVM and ridge regression

Efficient Complex Output Prediction

Connection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis

Regularization via Spectral Filtering

Nonparameteric Regression:

Spectral Regularization

Convergence Rates of Kernel Quadrature Rules

Reproducing Kernel Hilbert Spaces

Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model

Computational tractability of machine learning algorithms for tall fat data

Kernel methods for comparing distributions, measuring dependence

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications. Class 04: Features and Kernels. Lorenzo Rosasco

A Magiv CV Theory for Large-Margin Classifiers

Nonlinear Projection Trick in kernel methods: An alternative to the Kernel Trick

Learning sets and subspaces: a spectral approach

Random Projections. Lopez Paz & Duvenaud. November 7, 2013

Fast large scale Gaussian process regression using the improved fast Gauss transform

Analysis Preliminary Exam Workshop: Hilbert Spaces

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1).

On the Convergence of the Concave-Convex Procedure

Kernel Methods. Barnabás Póczos

COMPACT OPERATORS. 1. Definitions

SPECTRAL CLUSTERING AND KERNEL PRINCIPAL COMPONENT ANALYSIS ARE PURSUING GOOD PROJECTIONS

Spectral Dimensionality Reduction

Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent

Kernel-Based Contrast Functions for Sufficient Dimension Reduction

Support Vector Machines.

ESANN'2001 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), April 2001, D-Facto public., ISBN ,

Learning gradients: prescriptive models

Karhunen-Loève decomposition of Gaussian measures on Banach spaces

Discriminative Direction for Kernel Classifiers

Multiple Similarities Based Kernel Subspace Learning for Image Classification

RKHS, Mercer s theorem, Unbounded domains, Frames and Wavelets Class 22, 2004 Tomaso Poggio and Sayan Mukherjee

Kernel Principal Component Analysis

Kernel Methods. Outline

Unsupervised dimensionality reduction

Optimal kernel methods for large scale learning

Lecture 6 Sept Data Visualization STAT 442 / 890, CM 462

Kernel Bayes Rule: Nonparametric Bayesian inference with kernels

Towards Deep Kernel Machines

Random Features for Large Scale Kernel Machines

Memory Efficient Kernel Approximation

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Minimax Estimation of Kernel Mean Embeddings

Local Learning Projections

Perceptron Revisited: Linear Separators. Support Vector Machines

Kernel Ridge Regression. Mohammad Emtiyaz Khan EPFL Oct 27, 2015

Kernel methods for Bayesian inference

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Support Vector Machines

Advances in kernel exponential families

Manifold Learning: Theory and Applications to HRI

Dimensionality Reduction AShortTutorial

Incorporating Invariances in Nonlinear Support Vector Machines

Foundation of Intelligent Systems, Part I. SVM s & Kernel Methods

Multivariate Statistical Analysis

Metric Embedding for Kernel Classification Rules

Data Mining Techniques

Learning Eigenfunctions Links Spectral Embedding

Stat542 (F11) Statistical Learning. First consider the scenario where the two classes of points are separable.

Support Vector Machines for Classification: A Statistical Portrait

Large-scale Collaborative Prediction Using a Nonparametric Random Effects Model

Diffeomorphic Warping. Ben Recht August 17, 2006 Joint work with Ali Rahimi (Intel)

Online Kernel PCA with Entropic Matrix Updates

Kernel Approximation via Empirical Orthogonal Decomposition for Unsupervised Feature Learning

References. Lecture 3: Shrinkage. The Power of Amnesia. Ockham s Razor

Data Mining Techniques

Kernel PCA: keep walking... in informative directions. Johan Van Horebeek, Victor Muñiz, Rogelio Ramos CIMAT, Guanajuato, GTO

Least Squares SVM Regression

Designing Kernel Functions Using the Karhunen-Loève Expansion

1. Mathematical Foundations of Machine Learning

Deep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang

PREDICTING SOLAR GENERATION FROM WEATHER FORECASTS. Chenlin Wu Yuhan Lou

DIMENSION REDUCTION. min. j=1

Covariate Shift in Hilbert Space: A Solution via Surrogate Kernels

Deep Gaussian Processes

MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications. Class 19: Data Representation by Design

Max Margin-Classifier

Structured Prediction

Transductive Experiment Design

Sparse Support Vector Machines by Kernel Discriminant Analysis

Adaptive Kernel Principal Component Analysis With Unsupervised Learning of Kernels

Transcription:

Lecture 3 Approximate Kernel Methods Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Machine Learning Summer School Tübingen, 207

Outline Motivating example Ridge regression Approximation methods Kernel PCA Computational vs. statistical trade off (Joint work with Nicholas Sterge, Pennsylvania State University)

Motivating Example: Ridge regression Given: {(x i, y i )} n i= where x i R d, y i R Task: Find a linear regressor f = w, 2 s.t. f (x i ) y i, min ( w, x i 2 y i ) 2 + λ w 2 w R d 2 (λ > 0) n i= Solution: For X := (x,..., x n ) R d n and y := (y,..., y n ) R n, w = ( ) n n XX + λi d Xy }{{} primal Easy: ( ) ( ) n XX + λi d X = X n X X + λi n w = n X ( n X X + λi n ) y }{{} dual

Motivating Example: Ridge regression Given: {(x i, y i )} n i= where x i R d, y i R Task: Find a linear regressor f = w, 2 s.t. f (x i ) y i, min ( w, x i 2 y i ) 2 + λ w 2 w R d 2 (λ > 0) n i= Solution: For X := (x,..., x n ) R d n and y := (y,..., y n ) R n, w = ( ) n n XX + λi d Xy }{{} primal Easy: ( ) ( ) n XX + λi d X = X n X X + λi n w = n X ( n X X + λi n ) y }{{} dual

Motivating Example: Ridge regression Given: {(x i, y i )} n i= where x i R d, y i R Task: Find a linear regressor f = w, 2 s.t. f (x i ) y i, min ( w, x i 2 y i ) 2 + λ w 2 w R d 2 (λ > 0) n i= Solution: For X := (x,..., x n ) R d n and y := (y,..., y n ) R n, w = ( ) n n XX + λi d Xy }{{} primal Easy: ( ) ( ) n XX + λi d X = X n X X + λi n w = n X ( n X X + λi n ) y }{{} dual

Motivating Example: Ridge regression Given: {(x i, y i )} n i= where x i R d, y i R Task: Find a linear regressor f = w, 2 s.t. f (x i ) y i, min ( w, x i 2 y i ) 2 + λ w 2 w R d 2 (λ > 0) n i= Solution: For X := (x,..., x n ) R d n and y := (y,..., y n ) R n, w = ( ) n n XX + λi d Xy }{{} primal Easy: ( ) ( ) n XX + λi d X = X n X X + λi n w = n X ( n X X + λi n ) y }{{} dual

Ridge regression Prediction: Given t R d f (t) = w, t 2 = y X ( XX + nλi d ) t = y ( X X + nλi n ) X t How does X X look like? x, x 2 x, x 2 2 x, x n 2 X x 2, x x 2, x 2 2 x 2, x n 2 X =.. x i, x j.. 2. x n, x x n, x 2 2 x n, x n 2 }{{} Matrix of inner products: Gram Matrix

Ridge regression Prediction: Given t R d f (t) = w, t 2 = y X ( XX + nλi d ) t = y ( X X + nλi n ) X t How does X X look like? x, x 2 x, x 2 2 x, x n 2 X x 2, x x 2, x 2 2 x 2, x n 2 X =.. x i, x j.. 2. x n, x x n, x 2 2 x n, x n 2 }{{} Matrix of inner products: Gram Matrix

Kernel Ridge regression: Feature Map and Kernel Trick Given: {(x i, y i )} n i= where x i X, y i R Task: Find a regressor f H (some feature space) s.t. f (x i ) y i. Idea: Map x i to Φ(x i ) and do linear regression, min f H n ( f, Φ(x i ) H y i ) 2 + λ f 2 H (λ > 0) i= Solution: For Φ(X) := (Φ(x ),..., Φ(x n )) R dim(h) n and y := (y,..., y n ) R n, ( f = ) n n Φ(X)Φ(X) + λi dim(h) Φ(X)y }{{} primal = n Φ(X) ( n Φ(X) Φ(X) + λi n ) y }{{} dual

Kernel Ridge regression: Feature Map and Kernel Trick Given: {(x i, y i )} n i= where x i X, y i R Task: Find a regressor f H (some feature space) s.t. f (x i ) y i. Idea: Map x i to Φ(x i ) and do linear regression, min f H n ( f, Φ(x i ) H y i ) 2 + λ f 2 H (λ > 0) i= Solution: For Φ(X) := (Φ(x ),..., Φ(x n )) R dim(h) n and y := (y,..., y n ) R n, ( f = ) n n Φ(X)Φ(X) + λi dim(h) Φ(X)y }{{} primal = n Φ(X) ( n Φ(X) Φ(X) + λi n ) y }{{} dual

Kernel Ridge regression: Feature Map and Kernel Trick Given: {(x i, y i )} n i= where x i X, y i R Task: Find a regressor f H (some feature space) s.t. f (x i ) y i. Idea: Map x i to Φ(x i ) and do linear regression, min f H n ( f, Φ(x i ) H y i ) 2 + λ f 2 H (λ > 0) i= Solution: For Φ(X) := (Φ(x ),..., Φ(x n )) R dim(h) n and y := (y,..., y n ) R n, ( f = ) n n Φ(X)Φ(X) + λi dim(h) Φ(X)y }{{} primal = n Φ(X) ( n Φ(X) Φ(X) + λi n ) y }{{} dual

Kernel Ridge regression: Feature Map and Kernel Trick Prediction: Given t X f (t) = f, Φ(t) H = n y Φ(X) ( n Φ(X)Φ(X) + λi dim(h) ) Φ(t) = n y ( n Φ(X) Φ(X) + λi n ) Φ(X) Φ(t) As before Φ(x ), Φ(x ) H Φ(x ), Φ(x n ) H Φ(X) Φ(x 2 ), Φ(x ) H Φ(x 2 ), Φ(x n ) H Φ(X) =..... Φ(x n ), Φ(x ) H Φ(x n ), Φ(x n ) H }{{} k(x i,x j )= Φ(x i ),Φ(x j ) H and Φ(X) Φ(t) = [ Φ(x ), Φ(t) H,..., Φ(x n ), Φ(t) H ]

Kernel Ridge regression: Feature Map and Kernel Trick Prediction: Given t X f (t) = f, Φ(t) H = n y Φ(X) ( n Φ(X)Φ(X) + λi dim(h) ) Φ(t) = n y ( n Φ(X) Φ(X) + λi n ) Φ(X) Φ(t) As before Φ(x ), Φ(x ) H Φ(x ), Φ(x n ) H Φ(X) Φ(x 2 ), Φ(x ) H Φ(x 2 ), Φ(x n ) H Φ(X) =..... Φ(x n ), Φ(x ) H Φ(x n ), Φ(x n ) H }{{} k(x i,x j )= Φ(x i ),Φ(x j ) H and Φ(X) Φ(t) = [ Φ(x ), Φ(t) H,..., Φ(x n ), Φ(t) H ]

Remarks The primal formulation requires the knowledge of feature map Φ (and of course H) and these could be infinite dimensional. The dual formulation is entirely determined by kernel evaluations, Gram matrix and (k(x i, t)) i. But poor scalability: O(n 3 ).

Approximation Schemes Incomplete Cholesky factorization (e.g., (Fine and Scheinberg, JMLR 200) Sketching (Yang et al., 205) Sparse greedy approximation (Smola and Schölkopf, NIPS 2000) Nyström method (e.g., Williams and Seeger, NIPS 200) Random Fourier features (e.g., Rahimi and Recht, NIPS 2008),...

Random Fourier Approximation X = R d ; k be continuous and translation-invariant, i.e., k(x, y) = ψ(x y). Bochner s theorem: k(x, y) = e ω,x y 2 dλ(ω), R d where Λ is a finite non-negative Borel measure on R d. k is symmetric and therefore Λ is a symmetric measure on R d. Therefore k(x, y) = cos( ω, x y 2 ) dλ(ω). R d

Random Fourier Approximation X = R d ; k be continuous and translation-invariant, i.e., k(x, y) = ψ(x y). Bochner s theorem: k(x, y) = e ω,x y 2 dλ(ω), R d where Λ is a finite non-negative Borel measure on R d. k is symmetric and therefore Λ is a symmetric measure on R d. Therefore k(x, y) = cos( ω, x y 2 ) dλ(ω). R d

Random Feature Approximation (Rahimi and Recht, NIPS 2008): Draw (ω j ) m j= i.i.d. Λ. where k m (x, y) = m m cos ( ω j, x y 2 ) = Φ m (x), Φ m (y) R 2m, j= Φ m (x) = (cos( ω, x 2 ),..., cos( ω m, x 2 ), sin( ω, x 2 ),..., sin( ω m, x 2 )). Use the feature map idea with Φ m, i.e., x Φ m (x) and apply your favorite linear method.

How good is the approximation? (S and Szabó, NIPS 206): ( ) log S sup k m (x, y) k(x, y) = O a.s. x,y S m Optimal convergence rate Other results are known but they are non-optimal (Rahimi and Recht, NIPS 2008; Sutherland and Schneider, UAI 205).

Ridge Regression: Random Feature Approximation Given: {(x i, y i )} n i= where x i X, y i R Task: Find a regressor w R 2m s.t. f (x i ) y i. Idea: Map x i to Φ m (x i ) and do linear regression, min f H n ( f, Φ m (x i ) R 2m y i ) 2 + λ w 2 R2m (λ > 0) i= Solution: For Φ m (X) := (Φ m (x ),..., Φ m (x n )) R 2m n and y := (y,..., y n ) R n, ( f = ) n n Φ m(x)φ m (X) + λi 2m Φ m (X)y }{{} Computation: O(m 2 n). primal = ( ) n Φ m(x) n Φ m(x) Φ m (X) + λi n y }{{} dual

Ridge Regression: Random Feature Approximation Given: {(x i, y i )} n i= where x i X, y i R Task: Find a regressor w R 2m s.t. f (x i ) y i. Idea: Map x i to Φ m (x i ) and do linear regression, min f H n ( f, Φ m (x i ) R 2m y i ) 2 + λ w 2 R2m (λ > 0) i= Solution: For Φ m (X) := (Φ m (x ),..., Φ m (x n )) R 2m n and y := (y,..., y n ) R n, ( f = ) n n Φ m(x)φ m (X) + λi 2m Φ m (X)y }{{} Computation: O(m 2 n). primal = ( ) n Φ m(x) n Φ m(x) Φ m (X) + λi n y }{{} dual

Ridge Regression: Random Feature Approximation Given: {(x i, y i )} n i= where x i X, y i R Task: Find a regressor w R 2m s.t. f (x i ) y i. Idea: Map x i to Φ m (x i ) and do linear regression, min f H n ( f, Φ m (x i ) R 2m y i ) 2 + λ w 2 R2m (λ > 0) i= Solution: For Φ m (X) := (Φ m (x ),..., Φ m (x n )) R 2m n and y := (y,..., y n ) R n, ( f = ) n n Φ m(x)φ m (X) + λi 2m Φ m (X)y }{{} Computation: O(m 2 n). primal = ( ) n Φ m(x) n Φ m(x) Φ m (X) + λi n y }{{} dual

What happens statistically? (Rudi and Rosasco, 206): If m n α where 2 α < with α depending on the properties of the unknown true regressor, f, then R L,P (f m,n ) R L,P (f ) achieves the minimax optimal rate as obtained in the case with no approximation. Here L is the squared loss. Computational gain with no statistical loss!!

Principal Component Analysis (PCA) Suppose X P be a random variable in R d with mean µ and covariance matrix Σ. Find a direction w {v : v 2 = } such that is maximized. Var[ w, X 2 ] Find a direction w R d such that is minimized. E (X µ) w, (X µ) 2 w 2 2 The formulations are equivalent and the solution is the eigen vector of the covariance matrix Σ corresponding to the largest eigenvalue. Can be generalized to multiple directions (find a subspace...). Applications: dimensionality reduction.

Principal Component Analysis (PCA) Suppose X P be a random variable in R d with mean µ and covariance matrix Σ. Find a direction w {v : v 2 = } such that is maximized. Var[ w, X 2 ] Find a direction w R d such that is minimized. E (X µ) w, (X µ) 2 w 2 2 The formulations are equivalent and the solution is the eigen vector of the covariance matrix Σ corresponding to the largest eigenvalue. Can be generalized to multiple directions (find a subspace...). Applications: dimensionality reduction.

Principal Component Analysis (PCA) Suppose X P be a random variable in R d with mean µ and covariance matrix Σ. Find a direction w {v : v 2 = } such that is maximized. Var[ w, X 2 ] Find a direction w R d such that is minimized. E (X µ) w, (X µ) 2 w 2 2 The formulations are equivalent and the solution is the eigen vector of the covariance matrix Σ corresponding to the largest eigenvalue. Can be generalized to multiple directions (find a subspace...). Applications: dimensionality reduction.

Kernel PCA Nonlinear generalization of PCA (Schölkopf et al., 998). X Φ(X ) and apply PCA. Provides a low dimensional manifold (curves) in R d to efficiently represent the data. Function space view: Find f {g H : g H = } such that Var[f (X )] is maximized, i.e., f = arg sup f H = E[f 2 (X )] E 2 [f (X )] Using the reproducing property f (X ) = f, k(, X ) H, we obtain E[f 2 (X )] = E f, k(, X ) 2 H = E f, (k(, X ) H k(, X ))f H = f, E[k(, X ) H k(, X )]f H, assuming X k(x, x) dp(x) <. E[f (X )] = f, µ P H where µ P := X k(, x) dp(x).

Kernel PCA Nonlinear generalization of PCA (Schölkopf et al., 998). X Φ(X ) and apply PCA. Provides a low dimensional manifold (curves) in R d to efficiently represent the data. Function space view: Find f {g H : g H = } such that Var[f (X )] is maximized, i.e., f = arg sup f H = E[f 2 (X )] E 2 [f (X )] Using the reproducing property f (X ) = f, k(, X ) H, we obtain E[f 2 (X )] = E f, k(, X ) 2 H = E f, (k(, X ) H k(, X ))f H = f, E[k(, X ) H k(, X )]f H, assuming X k(x, x) dp(x) <. E[f (X )] = f, µ P H where µ P := X k(, x) dp(x).

Kernel PCA Nonlinear generalization of PCA (Schölkopf et al., 998). X Φ(X ) and apply PCA. Provides a low dimensional manifold (curves) in R d to efficiently represent the data. Function space view: Find f {g H : g H = } such that Var[f (X )] is maximized, i.e., f = arg sup f H = E[f 2 (X )] E 2 [f (X )] Using the reproducing property f (X ) = f, k(, X ) H, we obtain E[f 2 (X )] = E f, k(, X ) 2 H = E f, (k(, X ) H k(, X ))f H = f, E[k(, X ) H k(, X )]f H, assuming X k(x, x) dp(x) <. E[f (X )] = f, µ P H where µ P := X k(, x) dp(x).

Kernel PCA Nonlinear generalization of PCA (Schölkopf et al., 998). X Φ(X ) and apply PCA. Provides a low dimensional manifold (curves) in R d to efficiently represent the data. Function space view: Find f {g H : g H = } such that Var[f (X )] is maximized, i.e., f = arg sup f H = E[f 2 (X )] E 2 [f (X )] Using the reproducing property f (X ) = f, k(, X ) H, we obtain E[f 2 (X )] = E f, k(, X ) 2 H = E f, (k(, X ) H k(, X ))f H = f, E[k(, X ) H k(, X )]f H, assuming X k(x, x) dp(x) <. E[f (X )] = f, µ P H where µ P := X k(, x) dp(x).

Kernel PCA where Σ := X f = arg sup f H = f, Σf H, k(, x) H k(, x) dp(x) µ P H µ P is the covariance operator (symmetric, positive and Hilbert-Schmidt) on H. Spectral theorem: Σ = i I λ i φ i H φ i where I is either countable (λ i 0 as i ) or finite. Similar to PCA, the solution is the eigen function corresponding to the largest eigenvalue of Σ.

Kernel PCA where Σ := X f = arg sup f H = f, Σf H, k(, x) H k(, x) dp(x) µ P H µ P is the covariance operator (symmetric, positive and Hilbert-Schmidt) on H. Spectral theorem: Σ = i I λ i φ i H φ i where I is either countable (λ i 0 as i ) or finite. Similar to PCA, the solution is the eigen function corresponding to the largest eigenvalue of Σ.

Empirical Kernel PCA In practice, P is unknown but have access to (X i ) n i.i.d. i= P. where ˆΣ := n ˆf = arg sup f, ˆΣf H, f H = k(, X i ) H k(, X i ) µ n H µ n i= is the empirical covariance operator (symmetric, positive and Hilbert-Schmidt) on H and µ n := k(, X i ). n. i= ˆΣ is an infinite dimensional operator but with finite rank (what is its rank?) Spectral theorem: ˆΣ = ˆλ i ˆφi H ˆφi. i= ˆf is the eigen function corresponding to the largest eigenvalue of ˆΣ.

Empirical Kernel PCA In practice, P is unknown but have access to (X i ) n i.i.d. i= P. where ˆΣ := n ˆf = arg sup f, ˆΣf H, f H = k(, X i ) H k(, X i ) µ n H µ n i= is the empirical covariance operator (symmetric, positive and Hilbert-Schmidt) on H and µ n := k(, X i ). n. i= ˆΣ is an infinite dimensional operator but with finite rank (what is its rank?) Spectral theorem: ˆΣ = ˆλ i ˆφi H ˆφi. i= ˆf is the eigen function corresponding to the largest eigenvalue of ˆΣ.

Empirical Kernel PCA Since ˆΣ is an infinite dimensional operator, we are solving an infinite dimensional eigen system, ˆΣ ˆφ i = ˆλ i ˆφi. How to solve find ˆφ i in practice? Approach : (Schölkopf et al., 998) ˆf = arg sup f H = n = arg inf f H n f 2 (X i ) i= f 2 (X i ) + i= ( n ( n f (X i ) i= ) 2 ) 2 f (X i ) + λ f 2 for some λ > 0. Use representer theorem, which says there exists (α i ) n i= R such that ˆf = α i k(, X i ). i= Solving for (α i ) n i= yields a n n eigen system involving Gram matrix. (Exercise) i= H

Empirical Kernel PCA Since ˆΣ is an infinite dimensional operator, we are solving an infinite dimensional eigen system, ˆΣ ˆφ i = ˆλ i ˆφi. How to solve find ˆφ i in practice? Approach : (Schölkopf et al., 998) ˆf = arg sup f H = n = arg inf f H n f 2 (X i ) i= f 2 (X i ) + i= ( n ( n f (X i ) i= ) 2 ) 2 f (X i ) + λ f 2 for some λ > 0. Use representer theorem, which says there exists (α i ) n i= R such that ˆf = α i k(, X i ). i= Solving for (α i ) n i= yields a n n eigen system involving Gram matrix. (Exercise) i= H

Empirical Kernel PCA Approach 2: Sampling operator: S : H R n, f n (f (X ),..., f (X n )). Reconstruction operator: S : R n H, α n n i= α ik(, X i ). (Smale and Zhou et al., 2007) S is the adjoint (transpose) of S and they satisfy f, S α H = Sf, α R n, f H, α R n. ˆφi = ˆλ i S H n ˆα i where H n := I n n n n and ˆα i are the eigenvectors of KH n with ˆλ i being the corresponding eigenvalues. It can be shown that ˆΣ = S H ns where H n := I n n n n. ˆΣ ˆφi = ˆλ i ˆφi S H ns ˆφ i = ˆλ i ˆφi SS H ns ˆφ i = ˆλ i S ˆφ i ˆα i := S ˆφ i. It can be shown that K = SS. ˆα i = S ˆφ i S H n ˆα i = S H ns ˆφ i = ˆΣ ˆφ i = ˆλ i ˆφi.

Empirical Kernel PCA Approach 2: Sampling operator: S : H R n, f n (f (X ),..., f (X n )). Reconstruction operator: S : R n H, α n n i= α ik(, X i ). (Smale and Zhou et al., 2007) S is the adjoint (transpose) of S and they satisfy f, S α H = Sf, α R n, f H, α R n. ˆφi = ˆλ i S H n ˆα i where H n := I n n n n and ˆα i are the eigenvectors of KH n with ˆλ i being the corresponding eigenvalues. It can be shown that ˆΣ = S H ns where H n := I n n n n. ˆΣ ˆφi = ˆλ i ˆφi S H ns ˆφ i = ˆλ i ˆφi SS H ns ˆφ i = ˆλ i S ˆφ i ˆα i := S ˆφ i. It can be shown that K = SS. ˆα i = S ˆφ i S H n ˆα i = S H ns ˆφ i = ˆΣ ˆφ i = ˆλ i ˆφi.

Empirical Kernel PCA Approach 2: Sampling operator: S : H R n, f n (f (X ),..., f (X n )). Reconstruction operator: S : R n H, α n n i= α ik(, X i ). (Smale and Zhou et al., 2007) S is the adjoint (transpose) of S and they satisfy f, S α H = Sf, α R n, f H, α R n. ˆφi = ˆλ i S H n ˆα i where H n := I n n n n and ˆα i are the eigenvectors of KH n with ˆλ i being the corresponding eigenvalues. It can be shown that ˆΣ = S H ns where H n := I n n n n. ˆΣ ˆφi = ˆλ i ˆφi S H ns ˆφ i = ˆλ i ˆφi SS H ns ˆφ i = ˆλ i S ˆφ i ˆα i := S ˆφ i. It can be shown that K = SS. ˆα i = S ˆφ i S H n ˆα i = S H ns ˆφ i = ˆΣ ˆφ i = ˆλ i ˆφi.

Empirical Kernel PCA: Random Features Empirical kernel PCA solves an n n linear system: Complexity is O(n 3 ). Use the idea of random features: X Φ m (X ) and apply PCA. ŵ := arg sup Var[ w, Φ m (X ) 2 ] = w, ˆΣ m w 2 w = where ˆΣ m := n Φ m (X i )Φ m(x i ) i= ( n ) ( Φ m (X i ) Φ m (X i ) n i= i= ) is a covariance matrix of size 2m 2m. Eigen decomposition: ˆΣm = 2m i= ˆλ i,m ˆφi,m ˆφ i,m ŵ is obtained by solving a 2m 2m eigensystem: Complexity is O(m 3 ). What happens statistically?

Empirical Kernel PCA: Random Features Empirical kernel PCA solves an n n linear system: Complexity is O(n 3 ). Use the idea of random features: X Φ m (X ) and apply PCA. ŵ := arg sup Var[ w, Φ m (X ) 2 ] = w, ˆΣ m w 2 w = where ˆΣ m := n Φ m (X i )Φ m(x i ) i= ( n ) ( Φ m (X i ) Φ m (X i ) n i= i= ) is a covariance matrix of size 2m 2m. Eigen decomposition: ˆΣm = 2m i= ˆλ i,m ˆφi,m ˆφ i,m ŵ is obtained by solving a 2m 2m eigensystem: Complexity is O(m 3 ). What happens statistically?

Empirical Kernel PCA: Random Features Two notions: Reconstruction error (what does this depend on?) Convergence of eigenvectors (more generally, eigenspaces)

Eigenspace Convergence What we have? Eigenvectors after approximation, ( ˆφ i,m ) 2m i= : these form a subspace in R 2m Population eigenfunctions (φ i ) i I of Σ: these form a subspace in H. How do we compare? We embed them in a common space before comparing. The common space is L 2 (P).

Eigenspace Convergence I : H L 2 (P), f f f (x) dp(x) X U : R 2m L 2 (P), α 2m i= α ( i Φm,i X Φ m,i(x) dp(x) ) where Φ m = (cos( ω, 2 ),..., cos( ω m, 2 ), sin( ω, 2 ),..., sin( ω m, 2 )). Main results: (S and Sterge, 207) l i= ( ) I ˆφ i Iφ i L 2 (P) = O p n. Similar result holds in H (Zwald et al., NIPS 2005) l i= ( U ˆφ i,m Iφ i L 2 (P) = O p + ). m n Random features are not helpful from the point of view of eigenvector convergence (also holds for eigenspaces). May be useful in the reconstruction error convergence (in progress).

Eigenspace Convergence I : H L 2 (P), f f f (x) dp(x) X U : R 2m L 2 (P), α 2m i= α ( i Φm,i X Φ m,i(x) dp(x) ) where Φ m = (cos( ω, 2 ),..., cos( ω m, 2 ), sin( ω, 2 ),..., sin( ω m, 2 )). Main results: (S and Sterge, 207) l i= ( ) I ˆφ i Iφ i L 2 (P) = O p n. Similar result holds in H (Zwald et al., NIPS 2005) l i= ( U ˆφ i,m Iφ i L 2 (P) = O p + ). m n Random features are not helpful from the point of view of eigenvector convergence (also holds for eigenspaces). May be useful in the reconstruction error convergence (in progress).

Eigenspace Convergence I : H L 2 (P), f f f (x) dp(x) X U : R 2m L 2 (P), α 2m i= α ( i Φm,i X Φ m,i(x) dp(x) ) where Φ m = (cos( ω, 2 ),..., cos( ω m, 2 ), sin( ω, 2 ),..., sin( ω m, 2 )). Main results: (S and Sterge, 207) l i= ( ) I ˆφ i Iφ i L 2 (P) = O p n. Similar result holds in H (Zwald et al., NIPS 2005) l i= ( U ˆφ i,m Iφ i L 2 (P) = O p + ). m n Random features are not helpful from the point of view of eigenvector convergence (also holds for eigenspaces). May be useful in the reconstruction error convergence (in progress).

References I Fine, S. and Scheinberg, K. (200). Efficient SVM training using low-rank kernel representations. Journal of Machine Learning Research, 2:243 264. Rahimi, A. and Recht, B. (2008). Random features for large-scale kernel machines. In Platt, J. C., Koller, D., Singer, Y., and Roweis, S. T., editors, Advances in Neural Information Processing Systems 20, pages 77 84. Curran Associates, Inc. Rudi, A. and Rosasco, L. (206). Generalization properties of learning with random features. https://arxiv.org/pdf/602.04474.pdf. Schölkopf, B., Smola, A., and Müller, K.-R. (998). Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 0:299 39. Smale, S. and Zhou, D.-X. (2007). Learning theory estimates via integral operators and their approximations. Constructive Approximation, 26:53 72. Smola, A. J. and Schölkopf, B. (2000). Sparse greedy matrix approximation for machine learning. In Proc. 7th International Conference on Machine Learning, pages 9 98. Morgan Kaufmann, San Francisco, CA. Sriperumbudur, B. K. and Sterge, N. (207). Statistical consistency of kernel PCA with random features. arxiv:706.06296. Sriperumbudur, B. K. and Szabó, Z. (205). Optimal rates for random Fourier features. In Cortes, C., Lawrence, N. D., Lee, D. D., Sugiyama, M., and Garnett, R., editors, Advances in Neural Information Processing Systems 28, pages 44 52. Curran Associates, Inc. Sutherland, D. and Schneider, J. (205). On the error of random fourier features. In Proceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence, pages 862 87.

References II Williams, C. and Seeger, M. (200). Using the Nyström method to speed up kernel machines. In T. K. Leen, T. G. Diettrich, V. T., editor, Advances in Neural Information Processing Systems 3, pages 682 688, Cambridge, MA. MIT Press. Yang, Y., Pilanci, M., and Wainwright, M. J. (205). Randomized sketches for kernels: Fast and optimal non-parametric regression. Technical report. https://arxiv.org/pdf/50.0695.pdf. Zwald, L. and Blanchard, G. (2006). On the convergence of eigenspaces in kernel principal component analysis. In Weiss, Y., Schölkopf, B., and Platt, J. C., editors, Advances in Neural Information Processing Systems 8, pages 649 656. MIT Press.