Approximate Kernel PCA with Random Features

Approximate Kernel PCA with Random Features (Computational vs. Statistical Tradeoff) Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Journées de Statistique Paris May 28, 2018 (Joint work with Nicholas Sterge, PSU) Supported by National Science Foundation (DMS-1713011)

Outline Principal Component Analysis (PCA) Kernel PCA Approximation methods Kernel PCA with Random Features Computational vs. statistical trade off

Principal Component Analysis (PCA) (Pearson, 1901) Suppose X P be a random variable in R d with mean µ and covariance matrix Σ. Find a direction w {v : v 2 = 1} such that is maximized. Var[ w, X 2 ] = w, Σw 2 Find a direction w {v : v 2 = 1} such that is minimized. E (X µ) w, (X µ) 2 w 2 2 The formulations are equivalent and the solution is the eigenvector of the covariance matrix Σ corresponding to the largest eigenvalue. Can be generalized to multiple directions (find a subspace...). Applications: dimensionality reduction.

Kernel PCA Nonlinear generalization of PCA (Schölkopf et al., 1998). X Φ(X ) through the feature map Φ and apply PCA. The choice of Φ determines the degree of information we capture about X. Suppose Φ(X ) = (1, X, X 2, X 3,...), then the covariance of Φ(X ) captures the higher order moments of X. Φ is not explicitly specified but implicitly specified through a positive definite kernel function, k(x, y) = Φ(x), Φ(y).

Kernel PCA: Functional Version Find f {g H : g H = 1} such that Var[f (X )] is maximized, i.e., f = arg sup Var[f (X )] f H =1 where H is a reproducing kernel Hilbert space (evaluational functionals f f (x) are bounded for all x X) of real-valued functions (Aronszajn, 1950). unique k : X X R such that k(, x) H for all x H and f (x) = k(, x), f H for all f H, x X. k is called the reproducing kernel of H as k(x, y) = k(, x), k(, y) }{{}}{{} Φ(x) Φ(y) and is symmetric and positive definite. In fact, the converse is also true (Moore-Aronszajn Theorem). H

RKHS Kernels Positive definite & symmetric functions RKHS H = span{k(, x) : x X } (linear span of kernel functions) k controls the properties of f H. If k satisfies ( ), then every f H satisfies ( ), where ( ) is boundedness continuity measurability integrability differentiability

Kernel PCA Kernel PCA generalizes linear PCA. k(x, y) = x, y 2, x, y R d : H is isometrically isomorphic to R d. f (x) = w f, x 2, f H

Kernel PCA Using the reproducing property f (X ) = f, k(, X ) H, we obtain }{{} Φ(X ) Σ := X f = arg sup f H =1 f, Σf H. k(, x) H k(, x) dp(x) µ P H µ P is the covariance operator (self-adjoint, positive and trace class) on H and µ P := k(, x) dp(x) is the mean element. Spectral theorem: X Σ = i I λ i φ i H φ i where I is either countable (λ i 0 as i ) or finite. Similar to PCA, the solution is the eigenfunction corresponding to the largest eigenvalue of Σ.

Empirical Kernel PCA In practice, P is unknown but have access to (X i ) n where ˆΣ := 1 n ˆf = arg sup f, ˆΣf H, f H =1 i.i.d. P. n k(, X i ) H k(, X i ) µ n H µ n is the empirical covariance operator (symmetric, positive and trace class) on H and µ n := 1 n k(, X i ). n Spectral theorem: ˆΣ = n ˆλ i ˆφi H ˆφi. ˆf is the eigenfunction corresponding to the largest eigenvalue of ˆΣ.

Empirical Kernel PCA: Representer Theorem Since ˆΣ is an infinite dimensional operator, we have to solve an infinite dimensional eigen system, ˆΣ ˆφ i = ˆλ i ˆφi. Consider 1 sup f, ˆΣf H = sup f H =1 f H =1 n n f, k(, X i ) 2 f, 1 n Clearly f span{k(, X i ) : i = 1,..., n}, i.e., n k(, X i ) 2 H. f = n α i k(, X i ). { } sup f, ˆΣf H : f H = 1 = sup { α KH nkα : α Kα = 1 }, i.e., KH nkα = λkα. Requires K to be invertible.

Empirical Kernel PCA In classical PCA, X i R d, i = 1,..., n is represented as ( X i µ, ŵ 1 2,..., X i µ, ŵ l 2 ) R l with l d where (ŵ i ) l are the eigenvectors of ˆΣ corresponding to the top-l eigenvalues. In kernel PCA, X i X, i = 1,..., n is represented as ( k(, X i ) µ n, ˆφ 1,..., k(, X i ) µ n, ˆφ ) l R l H H with l n where ( ˆφ i ) l are the eigenfunctions of ˆΣ corresponding to the top-l eigenvalues.

Summary The direct formulation requires the knowledge of feature map Φ (and of course H) and these could be infinite dimensional. ˆΣ ˆφ i = ˆλ i ˆφi. The alternate formulation is entirely determined by kernel evaluations, Gram matrix. But poor scalability: O(n 3 ). n ˆφ i = α i,j k(, X j ), j=1 where α i satisfies H n Kα i = λ i α i.

Approximation Schemes Incomplete Cholesky factorization (e.g., Fine and Scheinberg, 2001) Sketching (Yang et al., 2015) Sparse greedy approximation (Smola and Schölkopf, 2000) Nyström method (e.g., Williams and Seeger, 2001) Random Fourier features (e.g., Rahimi and Recht, 2008a),...

Random Fourier Approximation X = R d ; k be continuous and translation-invariant, i.e., k(x, y) = ψ(x y). Bochner s theorem: ψ is positive definite if and only if k(x, y) = e 1 ω,x y 2 dλ(ω), R d where Λ is a finite non-negative Borel measure on R d. k is symmetric and therefore Λ is a symmetric measure on R d. Therefore k(x, y) = cos( ω, x y 2 ) dλ(ω). R d

Random Feature Approximation (Rahimi and Recht, 2008a): Draw (ω j ) m j=1 i.i.d. Λ. k m (x, y) = 1 m m cos ( ω j, x y 2 ) = Φ m (x), Φ m (y) R 2m, j=1 k(x, y) = k(, x), k(, y) H }{{} Φ(x),Φ(y) H where ϕ 1(x) Φ m (x) = 1 {}}{ ( cos( ω 1, x 2 ),..., cos( ω m, x 2 ), sin( ω 1, x 2 ),..., sin( ω m, x 2 )). m Idea: Apply PCA to Φ m (x).

Approximate Kernel PCA (RF-KPCA) Perform linear PCA on Φ m (X ) where X P. Approximate empirical KPCA finds β R m that solves sup Var[ β, Φ m (X ) 2 ] = sup β, Σ m β 2 β 2=1 β 2=1 where Σ m := E[Φ m (X ) 2 Φ m (X )] E[Φ m (X )] 2 E[Φ m (X )]. Same as doing kernel PCA in H m where { } m H m = f = β i ϕ i : β R m is an RKHS induced by the reproducing kernel k m w.r.t. f, g Hm = β f, β g 2.

Empirical RF-KPCA The empirical counterpart is obtained as: ˆβ m = arg sup β 2=1 β, ˆΣ m β 2 where ˆΣ m := 1 n n Φ m (X i ) 2 Φ m (X i ) ( 1 n ) ( ) n 1 n Φ m (X i ) 2 Φ m (X i ). n Eigen decomposition: ˆΣ m = m ˆλ i,m ˆφ i,m 2 ˆφ i,m ˆβ m is obtained by solving an m m eigensystem: Complexity is O(m 3 ). What happens statistically?

How good is the approximation? (S and Szabó, 2016): ( ) log S sup k m (x, y) k(x, y) = O a.s. x,y S m Optimal convergence rate Other results are known but they are non-optimal (Rahimi and Recht, 2008a; Sutherland and Schneider, 2015).

What happens statistically? Kernel ridge regression: (X i, Y i ) n iid ρxy. R P = inf f L 2 (ρ X ) E f (X ) Y 2 = E f (X ) Y 2. Penalized risk minimization: O(n 3 ) 1 f n = arg inf f H n n Y i f (X i ) 2 2 + λ f 2 H Penalized risk minimization (approximate): O(m 2 n) 1 f m,n = arg inf f H m n n Y i f (X i ) 2 2 + λ f 2 H m

What happens statistically? R P (f m,n ) R }{{} E f m,n(x ) Y 2 = (R P (f m,n ) R P (f n )) + (R P (f n ) R P }{{} ) error due to approximation (Rahimi and Recht, 2008b): (m n) 1 2 (Rudi and Rosasco, 2016): If m n α where 1 2 α < 1 with α depending on the properties of f, then f m,n achieves the minimax optimal rate as obtained in the case with no approximation. Similar results are derived for Nyström approximation (Bach 2013, Alaoui and Mahoney, 2015, Rudi et al., 2015). Computational gain with no statistical loss!!

What happens statistically? Two notions for PCA: Reconstruction error Convergence of eigenspaces

Reconstruction Error Linear PCA Kernel PCA E X P (X µ) E X P k(, X ) l 2 (X µ), φ i 2 φ i l 2 k(, X ), φ i H φ i where k(, x) = k(, x) k(, x) dp(x). However, the eigenfunctions of approximate empirical KPCA lie in H m, which is finite dimensional and not contained in H. 2 H

Embedding to L 2 (P) What we have? Population eigenfunctions (φ i ) i I of Σ: these form a subspace in H. Empirical eigenfunctions ( ˆφ i ) n of ˆΣ: these form a subspace in H. Eigenvectors after approximation, ( ˆφ i,m ) m of ˆΣ m: these form a subspace in R m We embed them in a common space before comparing. The common space is L 2 (P). (Inclusion operator) I : H L 2 (P), f f f (x) dp(x) X (Approximation operator) U : R m L 2 (P), α m ) α i (ϕ i ϕ i (x) dp(x) X

Properties Σ = I I I and I are HS and Σ is trace-class Σ m = U U U and U are HS and Σ m is trace-class Σ ˆΣ ) HS = O P (n 1 2 Σ m ˆΣ ) m HS = O P (n 1 2 ) II UU HS = O Λ (m 1 2

Reconstruction Error ( ) Iφ i form an ONS for L 2 (P). Define k(, x) = k(, x) µ P and τ > 0. λi Population KPCA: l R l = E I k(, Iφi Iφ 2 i X ), I k(, X ) λi L 2 (P) λi L 2 (P) = Σ Σ 1/2 Σ 1 l Σ 3/2 2 = Σ Σ l 2 HS. HS Empirical KPCA: l I R n,l = E I k(, X ) ˆφ i, I k(, X ) ˆλ i = Σ Σ 1/2 ˆΣ 1 l Σ 3/2 2 HS Approximate Empirical KPCA: l U ˆφi,m R m,n,l = E I k(, X ), I k(, X ) ˆλ i,m ( = I UˆΣ 1 m,l U ) II 2 HS L 2 (P) L 2 (P) I ˆφ i ˆλ i 2 L 2 (P) 2 U ˆφ i,m ˆλ i,m L 2 (P)

Reconstruction Error ( ) Iφ i form an ONS for L 2 (P). Define k(, x) = k(, x) µ P and τ > 0. λi Population KPCA: l R l = E I k(, Iφi Iφ 2 i X ), I k(, X ) λi L 2 (P) λi L 2 (P) = Σ Σ 1/2 Σ 1 l Σ 3/2 2 = Σ Σ l 2 HS. HS Empirical KPCA: l I R n,l = E I k(, X ) ˆφ i, I k(, X ) ˆλ i = Σ Σ 1/2 ˆΣ 1 l Σ 3/2 2 HS Approximate Empirical KPCA: l U ˆφ i,m R m,n,l = E I k(, X ), I k(, X ) ˆλ i,m ( = I UˆΣ 1 m,l U ) II 2 HS L 2 (P) L 2 (P) I ˆφ i ˆλ i 2 L 2 (P) 2 U ˆφ i,m ˆλ i,m L 2 (P)

Result Clearly R l 0 as l. The goal is to study the convergence rates for R l, R n,l and R m,n,l as l, m, n. Suppose λ i i α, α > 1 2, l = n θ α and m = n γ where θ > 0 and 0 < γ < 1. R l n 2θ(1 1 2α ) n 2θ(1 1 ) 2α, R n,l n ( 2 1 θ), α 0 < θ 2(3α 1) α 2(3α 1) θ < 1 2 (bias dominates) (variance dominates) for γ > 2θ. n 2θ(1 1 ) 2α, R m,n,l n ( 1 θ) 2, α 0 < θ 2(3α 1) α 2(3α 1) θ < 1 2 No statistical loss

Convergence of Projection Operators-I Since ( ) Iφi λi form an ONS for L 2 (P), we consider Empirical KPCA: l Iφ i S n,l = Iφ i λi λi l I ˆφ i I ˆφ i ˆλ i ˆλ i op Approximate Empirical KPCA: l Iφ i S m,n,l = Iφ l i U ˆφ m,i U ˆφ m,i λi λi ˆλm,i ˆλm,i op as l, m, n.

Convergence of Projection Operators-I Unlike in reconstruction error, the convergence of projection operators depends on the behavior of the eigen-gap, δ i = 1 2 (λ i λ i+1 ), i N. Empirical KPCA: assuming δ l n 1/2 and λ l n 1/2. S n,l 1 1 + δ l n n 1/4, λ l Approximate Empirical KPCA: assuming δ l m 1/2 and λ l m 1/2. S m,n,l 1 1 + δ l m n 1/4, λ l

Convergence of Projection Operators-I Suppose λ i i α, α > 1 2, δ i i β, β α, l = n θ α and m = n γ where 0 < θ < 1 2 and 0 < γ < 1. n ( 1 4 θ ) 2, 0 < θ < S n,l ), n ( 12 θβ α n ( 1 4 θ ) 2, 0 < θ < S m,n,l ( n γ 2 θβ ) α, 0 < θ < α 2(2β α) α 2(2β α) θ < α 2β ( ) α 2(2β α), γ 1 2 + θ 2β α 1 α 2(2β α), 2θβ α < γ < 1 2 + θ ( 2β α 1 )

Convergence of Projection Operators-I Suppose λ i i α, α > 1 2, δ i i β, β α, l = n θ α and m = n γ where 0 < θ < 1 2 and 0 < γ < 1. n ( 1 4 θ ) 2, 0 < θ < S n,l ), n ( 12 θβ α n ( 4 1 θ ) 2, 0 < θ < S m,n,l ( n γ 2 θβ ) α, 0 < θ < α 2(2β α) α 2(2β α) θ < α 2β ( ) α 2(2β α), γ 1 2 + θ 2β α 1 α 2(2β α), 2θβ α < γ < 1 2 + θ ( 2β α 1 )

Convergence of Projection Operators-II Empirical KPCA: l l T n,l = Iφ i L 2 (P) Iφ i I ˆφ i L 2 (P) I ˆφ i HS ( l ) l = Σ1/2 φ i H φ i ˆφ i H ˆφi Σ 1/2 HS lλ l δ l n Approximate Empirical KPCA: l l T m,n,l = Iφ i L 2 (P) Iφ i U ˆφ i,m L 2 (P) U ˆφ i,m HS lλ l + 1 δ l n m Convergence rates can be derived under the assumption λ i i α, α > 1 2, δ i i β, β α, l = n α θ and m = n γ where 0 < θ < 1 2 and 0 < γ < 1.

Summary Random feature approximation to kernel PCA improves its computational complexity. Statistical trade-off: Reconstruction error Convergence of eigenspaces Open questions: Lower bounds Extension to kernel canonical correlation analysis Nyström approximation

Thank You

References I Aronszajn, N. (1950). Theory of reproducing kernels. Trans. Amer. Math. Soc., 68:337 404. Fine, S. and Scheinberg, K. (2001). Efficient SVM training using low-rank kernel representations. Journal of Machine Learning Research, 2:243 264. Rahimi, A. and Recht, B. (2008a). Random features for large-scale kernel machines. In Platt, J. C., Koller, D., Singer, Y., and Roweis, S. T., editors, Advances in Neural Information Processing Systems 20, pages 1177 1184. Curran Associates, Inc. Rahimi, A. and Recht, B. (2008b). Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning. In NIPS, pages 1313 1320. Rudi, A. and Rosasco, L. (2016). Generalization properties of learning with random features. https://arxiv.org/pdf/1602.04474.pdf. Schölkopf, B., Smola, A., and Müller, K.-R. (1998). Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10:1299 1319. Smola, A. J. and Schölkopf, B. (2000). Sparse greedy matrix approximation for machine learning. In Proc. 17th International Conference on Machine Learning, pages 911 918. Morgan Kaufmann, San Francisco, CA. Sriperumbudur, B. K. and Szabó, Z. (2015). Optimal rates for random Fourier features. In Cortes, C., Lawrence, N. D., Lee, D. D., Sugiyama, M., and Garnett, R., editors, Advances in Neural Information Processing Systems 28, pages 1144 1152. Curran Associates, Inc. Sutherland, D. and Schneider, J. (2015). On the error of random fourier features. In Proceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence, pages 862 871.

References II Williams, C. and Seeger, M. (2001). Using the Nyström method to speed up kernel machines. In T. K. Leen, T. G. Diettrich, V. T., editor, Advances in Neural Information Processing Systems 13, pages 682 688, Cambridge, MA. MIT Press. Yang, Y., Pilanci, M., and Wainwright, M. J. (2015). Randomized sketches for kernels: Fast and optimal non-parametric regression. Technical report. https://arxiv.org/pdf/1501.06195.pdf.