Approximate Kernel PCA with Random Features

Similar documents
Approximate Kernel Methods

Stochastic optimization in Hilbert spaces

Kernel Learning via Random Fourier Representations

Less is More: Computational Regularization by Subsampling

Less is More: Computational Regularization by Subsampling

TUM 2016 Class 3 Large scale learning by regularization

Kernel methods for comparing distributions, measuring dependence

Convergence of Eigenspaces in Kernel Principal Component Analysis

Principal Component Analysis

Kernel Principal Component Analysis

Connection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis

Convergence Rates of Kernel Quadrature Rules

Statistical Convergence of Kernel CCA

Nonlinear Dimensionality Reduction

Divide and Conquer Kernel Ridge Regression. A Distributed Algorithm with Minimax Optimal Rates

Kernel-Based Contrast Functions for Sufficient Dimension Reduction

Kernel Methods. Jean-Philippe Vert Last update: Jan Jean-Philippe Vert (Mines ParisTech) 1 / 444

Minimax Estimation of Kernel Mean Embeddings

Efficient Complex Output Prediction

Regularization via Spectral Filtering

Optimal kernel methods for large scale learning

Spectral Regularization

MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications. Class 19: Data Representation by Design

Hilbert Space Embedding of Probability Measures

Kernel methods, kernel SVM and ridge regression

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces

Semi-Supervised Learning through Principal Directions Estimation

Learning sets and subspaces: a spectral approach

Manifold Learning: Theory and Applications to HRI

Nonlinear Projection Trick in kernel methods: An alternative to the Kernel Trick

Learning gradients: prescriptive models

Statistical Optimality of Stochastic Gradient Descent through Multiple Passes

Discriminative Direction for Kernel Classifiers

Unsupervised dimensionality reduction

Nonlinear Dimensionality Reduction

Lecture 3: Statistical Decision Theory (Part II)

Random Projections. Lopez Paz & Duvenaud. November 7, 2013

Immediate Reward Reinforcement Learning for Projective Kernel Methods

Reproducing Kernel Hilbert Spaces

SPECTRAL CLUSTERING AND KERNEL PRINCIPAL COMPONENT ANALYSIS ARE PURSUING GOOD PROJECTIONS

Memory Efficient Kernel Approximation

Kernel Methods. Machine Learning A W VO

CIS 520: Machine Learning Oct 09, Kernel Methods

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1).

Kernels A Machine Learning Overview

RegML 2018 Class 2 Tikhonov regularization and kernels

2 Tikhonov Regularization and ERM

Mercer s Theorem, Feature Maps, and Smoothing

On the Convergence of the Concave-Convex Procedure

Towards Deep Kernel Machines

MLCC 2015 Dimensionality Reduction and PCA

DIMENSION REDUCTION. min. j=1

Regularization in Reproducing Kernel Banach Spaces

Machine Learning (BSMC-GA 4439) Wenke Liu

Online Kernel PCA with Entropic Matrix Updates

Worst-Case Bounds for Gaussian Process Models

RKHS, Mercer s theorem, Unbounded domains, Frames and Wavelets Class 22, 2004 Tomaso Poggio and Sayan Mukherjee

Oslo Class 2 Tikhonov regularization and kernels

Multiple Similarities Based Kernel Subspace Learning for Image Classification

The Laplacian PDF Distance: A Cost Function for Clustering in a Kernel Feature Space

Second-Order Inference for Gaussian Random Curves

below, kernel PCA Eigenvectors, and linear combinations thereof. For the cases where the pre-image does exist, we can provide a means of constructing

Fast large scale Gaussian process regression using the improved fast Gauss transform

Linear and Non-Linear Dimensionality Reduction

Kernel Partial Least Squares for Nonlinear Regression and Discrimination

A spectral clustering algorithm based on Gram operators

Metric Embedding for Kernel Classification Rules

Computational tractability of machine learning algorithms for tall fat data

Kernel Methods for Computational Biology and Chemistry

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Stat542 (F11) Statistical Learning. First consider the scenario where the two classes of points are separable.

Kernel Methods. Barnabás Póczos

Distribution Regression: A Simple Technique with Minimax-optimal Guarantee

1. Mathematical Foundations of Machine Learning

On non-parametric robust quantile regression by support vector machines

Kernel Method: Data Analysis with Positive Definite Kernels

Generalization theory

Kernel Independent Component Analysis

Theoretical Exercises Statistical Learning, 2009

Diffeomorphic Warping. Ben Recht August 17, 2006 Joint work with Ali Rahimi (Intel)

Bits of Machine Learning Part 1: Supervised Learning

A Magiv CV Theory for Large-Margin Classifiers

Kernel methods and the exponential family

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Approximate Principal Components Analysis of Large Data Sets

ESANN'2001 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), April 2001, D-Facto public., ISBN ,

Lecture 6 Sept Data Visualization STAT 442 / 890, CM 462

SDRNF: generating scalable and discriminative random nonlinear features from data

The Learning Problem and Regularization Class 03, 11 February 2004 Tomaso Poggio and Sayan Mukherjee

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Kernel Measures of Conditional Dependence

Fastfood Approximating Kernel Expansions in Loglinear Time. Quoc Le, Tamas Sarlos, and Alex Smola Presenter: Shuai Zheng (Kyle)

Probabilistic Regression Using Basis Function Models

Kernels MIT Course Notes

Support Vector Machines for Classification: A Statistical Portrait

Transcription:

Approximate Kernel PCA with Random Features (Computational vs. Statistical Tradeoff) Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Journées de Statistique Paris May 28, 2018 (Joint work with Nicholas Sterge, PSU) Supported by National Science Foundation (DMS-1713011)

Outline Principal Component Analysis (PCA) Kernel PCA Approximation methods Kernel PCA with Random Features Computational vs. statistical trade off

Principal Component Analysis (PCA) (Pearson, 1901) Suppose X P be a random variable in R d with mean µ and covariance matrix Σ. Find a direction w {v : v 2 = 1} such that is maximized. Var[ w, X 2 ] = w, Σw 2 Find a direction w {v : v 2 = 1} such that is minimized. E (X µ) w, (X µ) 2 w 2 2 The formulations are equivalent and the solution is the eigenvector of the covariance matrix Σ corresponding to the largest eigenvalue. Can be generalized to multiple directions (find a subspace...). Applications: dimensionality reduction.

Principal Component Analysis (PCA) (Pearson, 1901) Suppose X P be a random variable in R d with mean µ and covariance matrix Σ. Find a direction w {v : v 2 = 1} such that is maximized. Var[ w, X 2 ] = w, Σw 2 Find a direction w {v : v 2 = 1} such that is minimized. E (X µ) w, (X µ) 2 w 2 2 The formulations are equivalent and the solution is the eigenvector of the covariance matrix Σ corresponding to the largest eigenvalue. Can be generalized to multiple directions (find a subspace...). Applications: dimensionality reduction.

Principal Component Analysis (PCA) (Pearson, 1901) Suppose X P be a random variable in R d with mean µ and covariance matrix Σ. Find a direction w {v : v 2 = 1} such that is maximized. Var[ w, X 2 ] = w, Σw 2 Find a direction w {v : v 2 = 1} such that is minimized. E (X µ) w, (X µ) 2 w 2 2 The formulations are equivalent and the solution is the eigenvector of the covariance matrix Σ corresponding to the largest eigenvalue. Can be generalized to multiple directions (find a subspace...). Applications: dimensionality reduction.

Kernel PCA Nonlinear generalization of PCA (Schölkopf et al., 1998). X Φ(X ) through the feature map Φ and apply PCA. The choice of Φ determines the degree of information we capture about X. Suppose Φ(X ) = (1, X, X 2, X 3,...), then the covariance of Φ(X ) captures the higher order moments of X. Φ is not explicitly specified but implicitly specified through a positive definite kernel function, k(x, y) = Φ(x), Φ(y).

Kernel PCA: Functional Version Find f {g H : g H = 1} such that Var[f (X )] is maximized, i.e., f = arg sup Var[f (X )] f H =1 where H is a reproducing kernel Hilbert space (evaluational functionals f f (x) are bounded for all x X) of real-valued functions (Aronszajn, 1950). unique k : X X R such that k(, x) H for all x H and f (x) = k(, x), f H for all f H, x X. k is called the reproducing kernel of H as k(x, y) = k(, x), k(, y) }{{}}{{} Φ(x) Φ(y) and is symmetric and positive definite. In fact, the converse is also true (Moore-Aronszajn Theorem). H

Kernel PCA: Functional Version Find f {g H : g H = 1} such that Var[f (X )] is maximized, i.e., f = arg sup Var[f (X )] f H =1 where H is a reproducing kernel Hilbert space (evaluational functionals f f (x) are bounded for all x X) of real-valued functions (Aronszajn, 1950). unique k : X X R such that k(, x) H for all x H and f (x) = k(, x), f H for all f H, x X. k is called the reproducing kernel of H as k(x, y) = k(, x), k(, y) }{{}}{{} Φ(x) Φ(y) and is symmetric and positive definite. In fact, the converse is also true (Moore-Aronszajn Theorem). H

Kernel PCA: Functional Version Find f {g H : g H = 1} such that Var[f (X )] is maximized, i.e., f = arg sup Var[f (X )] f H =1 where H is a reproducing kernel Hilbert space (evaluational functionals f f (x) are bounded for all x X) of real-valued functions (Aronszajn, 1950). unique k : X X R such that k(, x) H for all x H and f (x) = k(, x), f H for all f H, x X. k is called the reproducing kernel of H as k(x, y) = k(, x), k(, y) }{{}}{{} Φ(x) Φ(y) and is symmetric and positive definite. In fact, the converse is also true (Moore-Aronszajn Theorem). H

RKHS Kernels Positive definite & symmetric functions RKHS H = span{k(, x) : x X } (linear span of kernel functions) k controls the properties of f H. If k satisfies ( ), then every f H satisfies ( ), where ( ) is boundedness continuity measurability integrability differentiability

RKHS Kernels Positive definite & symmetric functions RKHS H = span{k(, x) : x X } (linear span of kernel functions) k controls the properties of f H. If k satisfies ( ), then every f H satisfies ( ), where ( ) is boundedness continuity measurability integrability differentiability

RKHS Kernels Positive definite & symmetric functions RKHS H = span{k(, x) : x X } (linear span of kernel functions) k controls the properties of f H. If k satisfies ( ), then every f H satisfies ( ), where ( ) is boundedness continuity measurability integrability differentiability

Kernel PCA Kernel PCA generalizes linear PCA. k(x, y) = x, y 2, x, y R d : H is isometrically isomorphic to R d. f (x) = w f, x 2, f H

Kernel PCA Kernel PCA generalizes linear PCA. k(x, y) = x, y 2, x, y R d : H is isometrically isomorphic to R d. f (x) = w f, x 2, f H

Kernel PCA Using the reproducing property f (X ) = f, k(, X ) H, we obtain }{{} Φ(X ) Σ := X f = arg sup f H =1 f, Σf H. k(, x) H k(, x) dp(x) µ P H µ P is the covariance operator (self-adjoint, positive and trace class) on H and µ P := k(, x) dp(x) is the mean element. Spectral theorem: X Σ = i I λ i φ i H φ i where I is either countable (λ i 0 as i ) or finite. Similar to PCA, the solution is the eigenfunction corresponding to the largest eigenvalue of Σ.

Kernel PCA Using the reproducing property f (X ) = f, k(, X ) H, we obtain }{{} Φ(X ) Σ := X f = arg sup f H =1 f, Σf H. k(, x) H k(, x) dp(x) µ P H µ P is the covariance operator (self-adjoint, positive and trace class) on H and µ P := k(, x) dp(x) is the mean element. Spectral theorem: X Σ = i I λ i φ i H φ i where I is either countable (λ i 0 as i ) or finite. Similar to PCA, the solution is the eigenfunction corresponding to the largest eigenvalue of Σ.

Kernel PCA Using the reproducing property f (X ) = f, k(, X ) H, we obtain }{{} Φ(X ) Σ := X f = arg sup f H =1 f, Σf H. k(, x) H k(, x) dp(x) µ P H µ P is the covariance operator (self-adjoint, positive and trace class) on H and µ P := k(, x) dp(x) is the mean element. Spectral theorem: X Σ = i I λ i φ i H φ i where I is either countable (λ i 0 as i ) or finite. Similar to PCA, the solution is the eigenfunction corresponding to the largest eigenvalue of Σ.

Empirical Kernel PCA In practice, P is unknown but have access to (X i ) n where ˆΣ := 1 n ˆf = arg sup f, ˆΣf H, f H =1 i.i.d. P. n k(, X i ) H k(, X i ) µ n H µ n is the empirical covariance operator (symmetric, positive and trace class) on H and µ n := 1 n k(, X i ). n Spectral theorem: ˆΣ = n ˆλ i ˆφi H ˆφi. ˆf is the eigenfunction corresponding to the largest eigenvalue of ˆΣ.

Empirical Kernel PCA In practice, P is unknown but have access to (X i ) n where ˆΣ := 1 n ˆf = arg sup f, ˆΣf H, f H =1 i.i.d. P. n k(, X i ) H k(, X i ) µ n H µ n is the empirical covariance operator (symmetric, positive and trace class) on H and µ n := 1 n k(, X i ). n Spectral theorem: ˆΣ = n ˆλ i ˆφi H ˆφi. ˆf is the eigenfunction corresponding to the largest eigenvalue of ˆΣ.

Empirical Kernel PCA In practice, P is unknown but have access to (X i ) n where ˆΣ := 1 n ˆf = arg sup f, ˆΣf H, f H =1 i.i.d. P. n k(, X i ) H k(, X i ) µ n H µ n is the empirical covariance operator (symmetric, positive and trace class) on H and µ n := 1 n k(, X i ). n Spectral theorem: ˆΣ = n ˆλ i ˆφi H ˆφi. ˆf is the eigenfunction corresponding to the largest eigenvalue of ˆΣ.

Empirical Kernel PCA: Representer Theorem Since ˆΣ is an infinite dimensional operator, we have to solve an infinite dimensional eigen system, ˆΣ ˆφ i = ˆλ i ˆφi. Consider 1 sup f, ˆΣf H = sup f H =1 f H =1 n n f, k(, X i ) 2 f, 1 n Clearly f span{k(, X i ) : i = 1,..., n}, i.e., n k(, X i ) 2 H. f = n α i k(, X i ). { } sup f, ˆΣf H : f H = 1 = sup { α KH nkα : α Kα = 1 }, i.e., KH nkα = λkα. Requires K to be invertible.

Empirical Kernel PCA: Representer Theorem Since ˆΣ is an infinite dimensional operator, we have to solve an infinite dimensional eigen system, ˆΣ ˆφ i = ˆλ i ˆφi. Consider 1 sup f, ˆΣf H = sup f H =1 f H =1 n n f, k(, X i ) 2 f, 1 n Clearly f span{k(, X i ) : i = 1,..., n}, i.e., n k(, X i ) 2 H. f = n α i k(, X i ). { } sup f, ˆΣf H : f H = 1 = sup { α KH nkα : α Kα = 1 }, i.e., KH nkα = λkα. Requires K to be invertible.

Empirical Kernel PCA: Representer Theorem Since ˆΣ is an infinite dimensional operator, we have to solve an infinite dimensional eigen system, ˆΣ ˆφ i = ˆλ i ˆφi. Consider 1 sup f, ˆΣf H = sup f H =1 f H =1 n n f, k(, X i ) 2 f, 1 n Clearly f span{k(, X i ) : i = 1,..., n}, i.e., n k(, X i ) 2 H. f = n α i k(, X i ). { } sup f, ˆΣf H : f H = 1 = sup { α KH nkα : α Kα = 1 }, i.e., KH nkα = λkα. Requires K to be invertible.

Empirical Kernel PCA In classical PCA, X i R d, i = 1,..., n is represented as ( X i µ, ŵ 1 2,..., X i µ, ŵ l 2 ) R l with l d where (ŵ i ) l are the eigenvectors of ˆΣ corresponding to the top-l eigenvalues. In kernel PCA, X i X, i = 1,..., n is represented as ( k(, X i ) µ n, ˆφ 1,..., k(, X i ) µ n, ˆφ ) l R l H H with l n where ( ˆφ i ) l are the eigenfunctions of ˆΣ corresponding to the top-l eigenvalues.

Empirical Kernel PCA In classical PCA, X i R d, i = 1,..., n is represented as ( X i µ, ŵ 1 2,..., X i µ, ŵ l 2 ) R l with l d where (ŵ i ) l are the eigenvectors of ˆΣ corresponding to the top-l eigenvalues. In kernel PCA, X i X, i = 1,..., n is represented as ( k(, X i ) µ n, ˆφ 1,..., k(, X i ) µ n, ˆφ ) l R l H H with l n where ( ˆφ i ) l are the eigenfunctions of ˆΣ corresponding to the top-l eigenvalues.

Summary The direct formulation requires the knowledge of feature map Φ (and of course H) and these could be infinite dimensional. ˆΣ ˆφ i = ˆλ i ˆφi. The alternate formulation is entirely determined by kernel evaluations, Gram matrix. But poor scalability: O(n 3 ). n ˆφ i = α i,j k(, X j ), j=1 where α i satisfies H n Kα i = λ i α i.

Approximation Schemes Incomplete Cholesky factorization (e.g., Fine and Scheinberg, 2001) Sketching (Yang et al., 2015) Sparse greedy approximation (Smola and Schölkopf, 2000) Nyström method (e.g., Williams and Seeger, 2001) Random Fourier features (e.g., Rahimi and Recht, 2008a),...

Random Fourier Approximation X = R d ; k be continuous and translation-invariant, i.e., k(x, y) = ψ(x y). Bochner s theorem: ψ is positive definite if and only if k(x, y) = e 1 ω,x y 2 dλ(ω), R d where Λ is a finite non-negative Borel measure on R d. k is symmetric and therefore Λ is a symmetric measure on R d. Therefore k(x, y) = cos( ω, x y 2 ) dλ(ω). R d

Random Fourier Approximation X = R d ; k be continuous and translation-invariant, i.e., k(x, y) = ψ(x y). Bochner s theorem: ψ is positive definite if and only if k(x, y) = e 1 ω,x y 2 dλ(ω), R d where Λ is a finite non-negative Borel measure on R d. k is symmetric and therefore Λ is a symmetric measure on R d. Therefore k(x, y) = cos( ω, x y 2 ) dλ(ω). R d

Random Feature Approximation (Rahimi and Recht, 2008a): Draw (ω j ) m j=1 i.i.d. Λ. k m (x, y) = 1 m m cos ( ω j, x y 2 ) = Φ m (x), Φ m (y) R 2m, j=1 k(x, y) = k(, x), k(, y) H }{{} Φ(x),Φ(y) H where ϕ 1(x) Φ m (x) = 1 {}}{ ( cos( ω 1, x 2 ),..., cos( ω m, x 2 ), sin( ω 1, x 2 ),..., sin( ω m, x 2 )). m Idea: Apply PCA to Φ m (x).

Random Feature Approximation (Rahimi and Recht, 2008a): Draw (ω j ) m j=1 i.i.d. Λ. k m (x, y) = 1 m m cos ( ω j, x y 2 ) = Φ m (x), Φ m (y) R 2m, j=1 k(x, y) = k(, x), k(, y) H }{{} Φ(x),Φ(y) H where ϕ 1(x) Φ m (x) = 1 {}}{ ( cos( ω 1, x 2 ),..., cos( ω m, x 2 ), sin( ω 1, x 2 ),..., sin( ω m, x 2 )). m Idea: Apply PCA to Φ m (x).

Approximate Kernel PCA (RF-KPCA) Perform linear PCA on Φ m (X ) where X P. Approximate empirical KPCA finds β R m that solves sup Var[ β, Φ m (X ) 2 ] = sup β, Σ m β 2 β 2=1 β 2=1 where Σ m := E[Φ m (X ) 2 Φ m (X )] E[Φ m (X )] 2 E[Φ m (X )]. Same as doing kernel PCA in H m where { } m H m = f = β i ϕ i : β R m is an RKHS induced by the reproducing kernel k m w.r.t. f, g Hm = β f, β g 2.

Approximate Kernel PCA (RF-KPCA) Perform linear PCA on Φ m (X ) where X P. Approximate empirical KPCA finds β R m that solves sup Var[ β, Φ m (X ) 2 ] = sup β, Σ m β 2 β 2=1 β 2=1 where Σ m := E[Φ m (X ) 2 Φ m (X )] E[Φ m (X )] 2 E[Φ m (X )]. Same as doing kernel PCA in H m where { } m H m = f = β i ϕ i : β R m is an RKHS induced by the reproducing kernel k m w.r.t. f, g Hm = β f, β g 2.

Empirical RF-KPCA The empirical counterpart is obtained as: ˆβ m = arg sup β 2=1 β, ˆΣ m β 2 where ˆΣ m := 1 n n Φ m (X i ) 2 Φ m (X i ) ( 1 n ) ( ) n 1 n Φ m (X i ) 2 Φ m (X i ). n Eigen decomposition: ˆΣ m = m ˆλ i,m ˆφ i,m 2 ˆφ i,m ˆβ m is obtained by solving an m m eigensystem: Complexity is O(m 3 ). What happens statistically?

Empirical RF-KPCA The empirical counterpart is obtained as: ˆβ m = arg sup β 2=1 β, ˆΣ m β 2 where ˆΣ m := 1 n n Φ m (X i ) 2 Φ m (X i ) ( 1 n ) ( ) n 1 n Φ m (X i ) 2 Φ m (X i ). n Eigen decomposition: ˆΣ m = m ˆλ i,m ˆφ i,m 2 ˆφ i,m ˆβ m is obtained by solving an m m eigensystem: Complexity is O(m 3 ). What happens statistically?

Empirical RF-KPCA The empirical counterpart is obtained as: ˆβ m = arg sup β 2=1 β, ˆΣ m β 2 where ˆΣ m := 1 n n Φ m (X i ) 2 Φ m (X i ) ( 1 n ) ( ) n 1 n Φ m (X i ) 2 Φ m (X i ). n Eigen decomposition: ˆΣ m = m ˆλ i,m ˆφ i,m 2 ˆφ i,m ˆβ m is obtained by solving an m m eigensystem: Complexity is O(m 3 ). What happens statistically?

How good is the approximation? (S and Szabó, 2016): ( ) log S sup k m (x, y) k(x, y) = O a.s. x,y S m Optimal convergence rate Other results are known but they are non-optimal (Rahimi and Recht, 2008a; Sutherland and Schneider, 2015).

What happens statistically? Kernel ridge regression: (X i, Y i ) n iid ρxy. R P = inf f L 2 (ρ X ) E f (X ) Y 2 = E f (X ) Y 2. Penalized risk minimization: O(n 3 ) 1 f n = arg inf f H n n Y i f (X i ) 2 2 + λ f 2 H Penalized risk minimization (approximate): O(m 2 n) 1 f m,n = arg inf f H m n n Y i f (X i ) 2 2 + λ f 2 H m

What happens statistically? R P (f m,n ) R }{{} E f m,n(x ) Y 2 = (R P (f m,n ) R P (f n )) + (R P (f n ) R P }{{} ) error due to approximation (Rahimi and Recht, 2008b): (m n) 1 2 (Rudi and Rosasco, 2016): If m n α where 1 2 α < 1 with α depending on the properties of f, then f m,n achieves the minimax optimal rate as obtained in the case with no approximation. Similar results are derived for Nyström approximation (Bach 2013, Alaoui and Mahoney, 2015, Rudi et al., 2015). Computational gain with no statistical loss!!

What happens statistically? R P (f m,n ) R }{{} E f m,n(x ) Y 2 = (R P (f m,n ) R P (f n )) + (R P (f n ) R P }{{} ) error due to approximation (Rahimi and Recht, 2008b): (m n) 1 2 (Rudi and Rosasco, 2016): If m n α where 1 2 α < 1 with α depending on the properties of f, then f m,n achieves the minimax optimal rate as obtained in the case with no approximation. Similar results are derived for Nyström approximation (Bach 2013, Alaoui and Mahoney, 2015, Rudi et al., 2015). Computational gain with no statistical loss!!

What happens statistically? R P (f m,n ) R }{{} E f m,n(x ) Y 2 = (R P (f m,n ) R P (f n )) + (R P (f n ) R P }{{} ) error due to approximation (Rahimi and Recht, 2008b): (m n) 1 2 (Rudi and Rosasco, 2016): If m n α where 1 2 α < 1 with α depending on the properties of f, then f m,n achieves the minimax optimal rate as obtained in the case with no approximation. Similar results are derived for Nyström approximation (Bach 2013, Alaoui and Mahoney, 2015, Rudi et al., 2015). Computational gain with no statistical loss!!

What happens statistically? R P (f m,n ) R }{{} E f m,n(x ) Y 2 = (R P (f m,n ) R P (f n )) + (R P (f n ) R P }{{} ) error due to approximation (Rahimi and Recht, 2008b): (m n) 1 2 (Rudi and Rosasco, 2016): If m n α where 1 2 α < 1 with α depending on the properties of f, then f m,n achieves the minimax optimal rate as obtained in the case with no approximation. Similar results are derived for Nyström approximation (Bach 2013, Alaoui and Mahoney, 2015, Rudi et al., 2015). Computational gain with no statistical loss!!

What happens statistically? Two notions for PCA: Reconstruction error Convergence of eigenspaces

Reconstruction Error Linear PCA Kernel PCA E X P (X µ) E X P k(, X ) l 2 (X µ), φ i 2 φ i l 2 k(, X ), φ i H φ i where k(, x) = k(, x) k(, x) dp(x). However, the eigenfunctions of approximate empirical KPCA lie in H m, which is finite dimensional and not contained in H. 2 H

Reconstruction Error Linear PCA Kernel PCA E X P (X µ) E X P k(, X ) l 2 (X µ), φ i 2 φ i l 2 k(, X ), φ i H φ i where k(, x) = k(, x) k(, x) dp(x). However, the eigenfunctions of approximate empirical KPCA lie in H m, which is finite dimensional and not contained in H. 2 H

Reconstruction Error Linear PCA Kernel PCA E X P (X µ) E X P k(, X ) l 2 (X µ), φ i 2 φ i l 2 k(, X ), φ i H φ i where k(, x) = k(, x) k(, x) dp(x). However, the eigenfunctions of approximate empirical KPCA lie in H m, which is finite dimensional and not contained in H. 2 H

Embedding to L 2 (P) What we have? Population eigenfunctions (φ i ) i I of Σ: these form a subspace in H. Empirical eigenfunctions ( ˆφ i ) n of ˆΣ: these form a subspace in H. Eigenvectors after approximation, ( ˆφ i,m ) m of ˆΣ m: these form a subspace in R m We embed them in a common space before comparing. The common space is L 2 (P). (Inclusion operator) I : H L 2 (P), f f f (x) dp(x) X (Approximation operator) U : R m L 2 (P), α m ) α i (ϕ i ϕ i (x) dp(x) X

Embedding to L 2 (P) What we have? Population eigenfunctions (φ i ) i I of Σ: these form a subspace in H. Empirical eigenfunctions ( ˆφ i ) n of ˆΣ: these form a subspace in H. Eigenvectors after approximation, ( ˆφ i,m ) m of ˆΣ m: these form a subspace in R m We embed them in a common space before comparing. The common space is L 2 (P). (Inclusion operator) I : H L 2 (P), f f f (x) dp(x) X (Approximation operator) U : R m L 2 (P), α m ) α i (ϕ i ϕ i (x) dp(x) X

Properties Σ = I I I and I are HS and Σ is trace-class Σ m = U U U and U are HS and Σ m is trace-class Σ ˆΣ ) HS = O P (n 1 2 Σ m ˆΣ ) m HS = O P (n 1 2 ) II UU HS = O Λ (m 1 2

Properties Σ = I I I and I are HS and Σ is trace-class Σ m = U U U and U are HS and Σ m is trace-class Σ ˆΣ ) HS = O P (n 1 2 Σ m ˆΣ ) m HS = O P (n 1 2 ) II UU HS = O Λ (m 1 2

Properties Σ = I I I and I are HS and Σ is trace-class Σ m = U U U and U are HS and Σ m is trace-class Σ ˆΣ ) HS = O P (n 1 2 Σ m ˆΣ ) m HS = O P (n 1 2 ) II UU HS = O Λ (m 1 2

Properties Σ = I I I and I are HS and Σ is trace-class Σ m = U U U and U are HS and Σ m is trace-class Σ ˆΣ ) HS = O P (n 1 2 Σ m ˆΣ ) m HS = O P (n 1 2 ) II UU HS = O Λ (m 1 2

Reconstruction Error ( ) Iφ i form an ONS for L 2 (P). Define k(, x) = k(, x) µ P and τ > 0. λi Population KPCA: l R l = E I k(, Iφi Iφ 2 i X ), I k(, X ) λi L 2 (P) λi L 2 (P) = Σ Σ 1/2 Σ 1 l Σ 3/2 2 = Σ Σ l 2 HS. HS Empirical KPCA: l I R n,l = E I k(, X ) ˆφ i, I k(, X ) ˆλ i = Σ Σ 1/2 ˆΣ 1 l Σ 3/2 2 HS Approximate Empirical KPCA: l U ˆφi,m R m,n,l = E I k(, X ), I k(, X ) ˆλ i,m ( = I UˆΣ 1 m,l U ) II 2 HS L 2 (P) L 2 (P) I ˆφ i ˆλ i 2 L 2 (P) 2 U ˆφ i,m ˆλ i,m L 2 (P)

Reconstruction Error ( ) Iφ i form an ONS for L 2 (P). Define k(, x) = k(, x) µ P and τ > 0. λi Population KPCA: l R l = E I k(, Iφi Iφ 2 i X ), I k(, X ) λi L 2 (P) λi L 2 (P) = Σ Σ 1/2 Σ 1 l Σ 3/2 2 = Σ Σ l 2 HS. HS Empirical KPCA: l I R n,l = E I k(, X ) ˆφ i, I k(, X ) ˆλ i = Σ Σ 1/2 ˆΣ 1 l Σ 3/2 2 HS Approximate Empirical KPCA: l U ˆφ i,m R m,n,l = E I k(, X ), I k(, X ) ˆλ i,m ( = I UˆΣ 1 m,l U ) II 2 HS L 2 (P) L 2 (P) I ˆφ i ˆλ i 2 L 2 (P) 2 U ˆφ i,m ˆλ i,m L 2 (P)

Reconstruction Error ( ) Iφ i form an ONS for L 2 (P). Define k(, x) = k(, x) µ P and τ > 0. λi Population KPCA: l R l = E I k(, Iφi Iφ 2 i X ), I k(, X ) λi L 2 (P) λi L 2 (P) = Σ Σ 1/2 Σ 1 l Σ 3/2 2 = Σ Σ l 2 HS. HS Empirical KPCA: l I R n,l = E I k(, X ) ˆφ i, I k(, X ) ˆλ i = Σ Σ 1/2 ˆΣ 1 l Σ 3/2 2 HS Approximate Empirical KPCA: l U ˆφ i,m R m,n,l = E I k(, X ), I k(, X ) ˆλ i,m ( = I UˆΣ 1 m,l U ) II 2 HS L 2 (P) L 2 (P) I ˆφ i ˆλ i 2 L 2 (P) 2 U ˆφ i,m ˆλ i,m L 2 (P)

Result Clearly R l 0 as l. The goal is to study the convergence rates for R l, R n,l and R m,n,l as l, m, n. Suppose λ i i α, α > 1 2, l = n θ α and m = n γ where θ > 0 and 0 < γ < 1. R l n 2θ(1 1 2α ) n 2θ(1 1 ) 2α, R n,l n ( 2 1 θ), α 0 < θ 2(3α 1) α 2(3α 1) θ < 1 2 (bias dominates) (variance dominates) for γ > 2θ. n 2θ(1 1 ) 2α, R m,n,l n ( 1 θ) 2, α 0 < θ 2(3α 1) α 2(3α 1) θ < 1 2 No statistical loss

Result Clearly R l 0 as l. The goal is to study the convergence rates for R l, R n,l and R m,n,l as l, m, n. Suppose λ i i α, α > 1 2, l = n θ α and m = n γ where θ > 0 and 0 < γ < 1. R l n 2θ(1 1 2α ) n 2θ(1 1 ) 2α, R n,l n ( 2 1 θ), α 0 < θ 2(3α 1) α 2(3α 1) θ < 1 2 (bias dominates) (variance dominates) for γ > 2θ. n 2θ(1 1 ) 2α, R m,n,l n ( 1 θ) 2, α 0 < θ 2(3α 1) α 2(3α 1) θ < 1 2 No statistical loss

Result Clearly R l 0 as l. The goal is to study the convergence rates for R l, R n,l and R m,n,l as l, m, n. Suppose λ i i α, α > 1 2, l = n θ α and m = n γ where θ > 0 and 0 < γ < 1. R l n 2θ(1 1 2α ) n 2θ(1 1 ) 2α, R n,l n ( 2 1 θ), α 0 < θ 2(3α 1) α 2(3α 1) θ < 1 2 (bias dominates) (variance dominates) for γ > 2θ. n 2θ(1 1 ) 2α, R m,n,l n ( 1 θ) 2, α 0 < θ 2(3α 1) α 2(3α 1) θ < 1 2 No statistical loss

Result Clearly R l 0 as l. The goal is to study the convergence rates for R l, R n,l and R m,n,l as l, m, n. Suppose λ i i α, α > 1 2, l = n θ α and m = n γ where θ > 0 and 0 < γ < 1. R l n 2θ(1 1 2α ) n 2θ(1 1 ) 2α, R n,l n ( 2 1 θ), α 0 < θ 2(3α 1) α 2(3α 1) θ < 1 2 (bias dominates) (variance dominates) for γ > 2θ. n 2θ(1 1 ) 2α, R m,n,l n ( 1 θ) 2, α 0 < θ 2(3α 1) α 2(3α 1) θ < 1 2 No statistical loss

Result Clearly R l 0 as l. The goal is to study the convergence rates for R l, R n,l and R m,n,l as l, m, n. Suppose λ i i α, α > 1 2, l = n θ α and m = n γ where θ > 0 and 0 < γ < 1. R l n 2θ(1 1 2α ) n 2θ(1 1 ) 2α, R n,l n ( 2 1 θ), α 0 < θ 2(3α 1) α 2(3α 1) θ < 1 2 (bias dominates) (variance dominates) for γ > 2θ. n 2θ(1 1 ) 2α, R m,n,l n ( 1 θ) 2, α 0 < θ 2(3α 1) α 2(3α 1) θ < 1 2 No statistical loss

Convergence of Projection Operators-I Since ( ) Iφi λi form an ONS for L 2 (P), we consider Empirical KPCA: l Iφ i S n,l = Iφ i λi λi l I ˆφ i I ˆφ i ˆλ i ˆλ i op Approximate Empirical KPCA: l Iφ i S m,n,l = Iφ l i U ˆφ m,i U ˆφ m,i λi λi ˆλm,i ˆλm,i op as l, m, n.

Convergence of Projection Operators-I Since ( ) Iφi λi form an ONS for L 2 (P), we consider Empirical KPCA: l Iφ i S n,l = Iφ i λi λi l I ˆφ i I ˆφ i ˆλ i ˆλ i op Approximate Empirical KPCA: l Iφ i S m,n,l = Iφ l i U ˆφ m,i U ˆφ m,i λi λi ˆλm,i ˆλm,i op as l, m, n.

Convergence of Projection Operators-I Unlike in reconstruction error, the convergence of projection operators depends on the behavior of the eigen-gap, δ i = 1 2 (λ i λ i+1 ), i N. Empirical KPCA: assuming δ l n 1/2 and λ l n 1/2. S n,l 1 1 + δ l n n 1/4, λ l Approximate Empirical KPCA: assuming δ l m 1/2 and λ l m 1/2. S m,n,l 1 1 + δ l m n 1/4, λ l

Convergence of Projection Operators-I Unlike in reconstruction error, the convergence of projection operators depends on the behavior of the eigen-gap, δ i = 1 2 (λ i λ i+1 ), i N. Empirical KPCA: assuming δ l n 1/2 and λ l n 1/2. S n,l 1 1 + δ l n n 1/4, λ l Approximate Empirical KPCA: assuming δ l m 1/2 and λ l m 1/2. S m,n,l 1 1 + δ l m n 1/4, λ l

Convergence of Projection Operators-I Suppose λ i i α, α > 1 2, δ i i β, β α, l = n θ α and m = n γ where 0 < θ < 1 2 and 0 < γ < 1. n ( 1 4 θ ) 2, 0 < θ < S n,l ), n ( 12 θβ α n ( 1 4 θ ) 2, 0 < θ < S m,n,l ( n γ 2 θβ ) α, 0 < θ < α 2(2β α) α 2(2β α) θ < α 2β ( ) α 2(2β α), γ 1 2 + θ 2β α 1 α 2(2β α), 2θβ α < γ < 1 2 + θ ( 2β α 1 )

Convergence of Projection Operators-I Suppose λ i i α, α > 1 2, δ i i β, β α, l = n θ α and m = n γ where 0 < θ < 1 2 and 0 < γ < 1. n ( 1 4 θ ) 2, 0 < θ < S n,l ), n ( 12 θβ α n ( 4 1 θ ) 2, 0 < θ < S m,n,l ( n γ 2 θβ ) α, 0 < θ < α 2(2β α) α 2(2β α) θ < α 2β ( ) α 2(2β α), γ 1 2 + θ 2β α 1 α 2(2β α), 2θβ α < γ < 1 2 + θ ( 2β α 1 )

Convergence of Projection Operators-II Empirical KPCA: l l T n,l = Iφ i L 2 (P) Iφ i I ˆφ i L 2 (P) I ˆφ i HS ( l ) l = Σ1/2 φ i H φ i ˆφ i H ˆφi Σ 1/2 HS lλ l δ l n Approximate Empirical KPCA: l l T m,n,l = Iφ i L 2 (P) Iφ i U ˆφ i,m L 2 (P) U ˆφ i,m HS lλ l + 1 δ l n m Convergence rates can be derived under the assumption λ i i α, α > 1 2, δ i i β, β α, l = n α θ and m = n γ where 0 < θ < 1 2 and 0 < γ < 1.

Convergence of Projection Operators-II Empirical KPCA: l l T n,l = Iφ i L 2 (P) Iφ i I ˆφ i L 2 (P) I ˆφ i HS ( l ) l = Σ1/2 φ i H φ i ˆφ i H ˆφi Σ 1/2 HS lλ l δ l n Approximate Empirical KPCA: l l T m,n,l = Iφ i L 2 (P) Iφ i U ˆφ i,m L 2 (P) U ˆφ i,m HS lλ l + 1 δ l n m Convergence rates can be derived under the assumption λ i i α, α > 1 2, δ i i β, β α, l = n α θ and m = n γ where 0 < θ < 1 2 and 0 < γ < 1.

Convergence of Projection Operators-II Empirical KPCA: l l T n,l = Iφ i L 2 (P) Iφ i I ˆφ i L 2 (P) I ˆφ i HS ( l ) l = Σ1/2 φ i H φ i ˆφ i H ˆφi Σ 1/2 HS lλ l δ l n Approximate Empirical KPCA: l l T m,n,l = Iφ i L 2 (P) Iφ i U ˆφ i,m L 2 (P) U ˆφ i,m HS lλ l + 1 δ l n m Convergence rates can be derived under the assumption λ i i α, α > 1 2, δ i i β, β α, l = n α θ and m = n γ where 0 < θ < 1 2 and 0 < γ < 1.

Summary Random feature approximation to kernel PCA improves its computational complexity. Statistical trade-off: Reconstruction error Convergence of eigenspaces Open questions: Lower bounds Extension to kernel canonical correlation analysis Nyström approximation

Thank You

References I Aronszajn, N. (1950). Theory of reproducing kernels. Trans. Amer. Math. Soc., 68:337 404. Fine, S. and Scheinberg, K. (2001). Efficient SVM training using low-rank kernel representations. Journal of Machine Learning Research, 2:243 264. Rahimi, A. and Recht, B. (2008a). Random features for large-scale kernel machines. In Platt, J. C., Koller, D., Singer, Y., and Roweis, S. T., editors, Advances in Neural Information Processing Systems 20, pages 1177 1184. Curran Associates, Inc. Rahimi, A. and Recht, B. (2008b). Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning. In NIPS, pages 1313 1320. Rudi, A. and Rosasco, L. (2016). Generalization properties of learning with random features. https://arxiv.org/pdf/1602.04474.pdf. Schölkopf, B., Smola, A., and Müller, K.-R. (1998). Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10:1299 1319. Smola, A. J. and Schölkopf, B. (2000). Sparse greedy matrix approximation for machine learning. In Proc. 17th International Conference on Machine Learning, pages 911 918. Morgan Kaufmann, San Francisco, CA. Sriperumbudur, B. K. and Szabó, Z. (2015). Optimal rates for random Fourier features. In Cortes, C., Lawrence, N. D., Lee, D. D., Sugiyama, M., and Garnett, R., editors, Advances in Neural Information Processing Systems 28, pages 1144 1152. Curran Associates, Inc. Sutherland, D. and Schneider, J. (2015). On the error of random fourier features. In Proceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence, pages 862 871.

References II Williams, C. and Seeger, M. (2001). Using the Nyström method to speed up kernel machines. In T. K. Leen, T. G. Diettrich, V. T., editor, Advances in Neural Information Processing Systems 13, pages 682 688, Cambridge, MA. MIT Press. Yang, Y., Pilanci, M., and Wainwright, M. J. (2015). Randomized sketches for kernels: Fast and optimal non-parametric regression. Technical report. https://arxiv.org/pdf/1501.06195.pdf.