Approximate Kernel Methods - PDF Free Download

Lecture 3 Approximate Kernel Methods Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Machine Learning Summer School Tübingen, 207

Outline Motivating example Ridge regression Approximation methods Kernel PCA Computational vs. statistical trade off (Joint work with Nicholas Sterge, Pennsylvania State University)

Motivating Example: Ridge regression Given: {(x i, y i )} n i= where x i R d, y i R Task: Find a linear regressor f = w, 2 s.t. f (x i ) y i, min ( w, x i 2 y i ) 2 + λ w 2 w R d 2 (λ > 0) n i= Solution: For X := (x,..., x n ) R d n and y := (y,..., y n ) R n, w = ( ) n n XX + λi d Xy }{{} primal Easy: ( ) ( ) n XX + λi d X = X n X X + λi n w = n X ( n X X + λi n ) y }{{} dual

Ridge regression Prediction: Given t R d f (t) = w, t 2 = y X ( XX + nλi d ) t = y ( X X + nλi n ) X t How does X X look like? x, x 2 x, x 2 2 x, x n 2 X x 2, x x 2, x 2 2 x 2, x n 2 X =.. x i, x j.. 2. x n, x x n, x 2 2 x n, x n 2 }{{} Matrix of inner products: Gram Matrix

Kernel Ridge regression: Feature Map and Kernel Trick Given: {(x i, y i )} n i= where x i X, y i R Task: Find a regressor f H (some feature space) s.t. f (x i ) y i. Idea: Map x i to Φ(x i ) and do linear regression, min f H n ( f, Φ(x i ) H y i ) 2 + λ f 2 H (λ > 0) i= Solution: For Φ(X) := (Φ(x ),..., Φ(x n )) R dim(h) n and y := (y,..., y n ) R n, ( f = ) n n Φ(X)Φ(X) + λi dim(h) Φ(X)y }{{} primal = n Φ(X) ( n Φ(X) Φ(X) + λi n ) y }{{} dual

Kernel Ridge regression: Feature Map and Kernel Trick Prediction: Given t X f (t) = f, Φ(t) H = n y Φ(X) ( n Φ(X)Φ(X) + λi dim(h) ) Φ(t) = n y ( n Φ(X) Φ(X) + λi n ) Φ(X) Φ(t) As before Φ(x ), Φ(x ) H Φ(x ), Φ(x n ) H Φ(X) Φ(x 2 ), Φ(x ) H Φ(x 2 ), Φ(x n ) H Φ(X) =..... Φ(x n ), Φ(x ) H Φ(x n ), Φ(x n ) H }{{} k(x i,x j )= Φ(x i ),Φ(x j ) H and Φ(X) Φ(t) = [ Φ(x ), Φ(t) H,..., Φ(x n ), Φ(t) H ]

Remarks The primal formulation requires the knowledge of feature map Φ (and of course H) and these could be infinite dimensional. The dual formulation is entirely determined by kernel evaluations, Gram matrix and (k(x i, t)) i. But poor scalability: O(n 3 ).

Approximation Schemes Incomplete Cholesky factorization (e.g., (Fine and Scheinberg, JMLR 200) Sketching (Yang et al., 205) Sparse greedy approximation (Smola and Schölkopf, NIPS 2000) Nyström method (e.g., Williams and Seeger, NIPS 200) Random Fourier features (e.g., Rahimi and Recht, NIPS 2008),...

Random Fourier Approximation X = R d ; k be continuous and translation-invariant, i.e., k(x, y) = ψ(x y). Bochner s theorem: k(x, y) = e ω,x y 2 dλ(ω), R d where Λ is a finite non-negative Borel measure on R d. k is symmetric and therefore Λ is a symmetric measure on R d. Therefore k(x, y) = cos( ω, x y 2 ) dλ(ω). R d

Random Feature Approximation (Rahimi and Recht, NIPS 2008): Draw (ω j ) m j= i.i.d. Λ. where k m (x, y) = m m cos ( ω j, x y 2 ) = Φ m (x), Φ m (y) R 2m, j= Φ m (x) = (cos( ω, x 2 ),..., cos( ω m, x 2 ), sin( ω, x 2 ),..., sin( ω m, x 2 )). Use the feature map idea with Φ m, i.e., x Φ m (x) and apply your favorite linear method.

How good is the approximation? (S and Szabó, NIPS 206): ( ) log S sup k m (x, y) k(x, y) = O a.s. x,y S m Optimal convergence rate Other results are known but they are non-optimal (Rahimi and Recht, NIPS 2008; Sutherland and Schneider, UAI 205).

Ridge Regression: Random Feature Approximation Given: {(x i, y i )} n i= where x i X, y i R Task: Find a regressor w R 2m s.t. f (x i ) y i. Idea: Map x i to Φ m (x i ) and do linear regression, min f H n ( f, Φ m (x i ) R 2m y i ) 2 + λ w 2 R2m (λ > 0) i= Solution: For Φ m (X) := (Φ m (x ),..., Φ m (x n )) R 2m n and y := (y,..., y n ) R n, ( f = ) n n Φ m(x)φ m (X) + λi 2m Φ m (X)y }{{} Computation: O(m 2 n). primal = ( ) n Φ m(x) n Φ m(x) Φ m (X) + λi n y }{{} dual

What happens statistically? (Rudi and Rosasco, 206): If m n α where 2 α < with α depending on the properties of the unknown true regressor, f, then R L,P (f m,n ) R L,P (f ) achieves the minimax optimal rate as obtained in the case with no approximation. Here L is the squared loss. Computational gain with no statistical loss!!

Principal Component Analysis (PCA) Suppose X P be a random variable in R d with mean µ and covariance matrix Σ. Find a direction w {v : v 2 = } such that is maximized. Var[ w, X 2 ] Find a direction w R d such that is minimized. E (X µ) w, (X µ) 2 w 2 2 The formulations are equivalent and the solution is the eigen vector of the covariance matrix Σ corresponding to the largest eigenvalue. Can be generalized to multiple directions (find a subspace...). Applications: dimensionality reduction.

Kernel PCA Nonlinear generalization of PCA (Schölkopf et al., 998). X Φ(X ) and apply PCA. Provides a low dimensional manifold (curves) in R d to efficiently represent the data. Function space view: Find f {g H : g H = } such that Var[f (X )] is maximized, i.e., f = arg sup f H = E[f 2 (X )] E 2 [f (X )] Using the reproducing property f (X ) = f, k(, X ) H, we obtain E[f 2 (X )] = E f, k(, X ) 2 H = E f, (k(, X ) H k(, X ))f H = f, E[k(, X ) H k(, X )]f H, assuming X k(x, x) dp(x) <. E[f (X )] = f, µ P H where µ P := X k(, x) dp(x).

Kernel PCA where Σ := X f = arg sup f H = f, Σf H, k(, x) H k(, x) dp(x) µ P H µ P is the covariance operator (symmetric, positive and Hilbert-Schmidt) on H. Spectral theorem: Σ = i I λ i φ i H φ i where I is either countable (λ i 0 as i ) or finite. Similar to PCA, the solution is the eigen function corresponding to the largest eigenvalue of Σ.

Empirical Kernel PCA In practice, P is unknown but have access to (X i ) n i.i.d. i= P. where ˆΣ := n ˆf = arg sup f, ˆΣf H, f H = k(, X i ) H k(, X i ) µ n H µ n i= is the empirical covariance operator (symmetric, positive and Hilbert-Schmidt) on H and µ n := k(, X i ). n. i= ˆΣ is an infinite dimensional operator but with finite rank (what is its rank?) Spectral theorem: ˆΣ = ˆλ i ˆφi H ˆφi. i= ˆf is the eigen function corresponding to the largest eigenvalue of ˆΣ.

Empirical Kernel PCA Since ˆΣ is an infinite dimensional operator, we are solving an infinite dimensional eigen system, ˆΣ ˆφ i = ˆλ i ˆφi. How to solve find ˆφ i in practice? Approach : (Schölkopf et al., 998) ˆf = arg sup f H = n = arg inf f H n f 2 (X i ) i= f 2 (X i ) + i= ( n ( n f (X i ) i= ) 2 ) 2 f (X i ) + λ f 2 for some λ > 0. Use representer theorem, which says there exists (α i ) n i= R such that ˆf = α i k(, X i ). i= Solving for (α i ) n i= yields a n n eigen system involving Gram matrix. (Exercise) i= H

Empirical Kernel PCA Approach 2: Sampling operator: S : H R n, f n (f (X ),..., f (X n )). Reconstruction operator: S : R n H, α n n i= α ik(, X i ). (Smale and Zhou et al., 2007) S is the adjoint (transpose) of S and they satisfy f, S α H = Sf, α R n, f H, α R n. ˆφi = ˆλ i S H n ˆα i where H n := I n n n n and ˆα i are the eigenvectors of KH n with ˆλ i being the corresponding eigenvalues. It can be shown that ˆΣ = S H ns where H n := I n n n n. ˆΣ ˆφi = ˆλ i ˆφi S H ns ˆφ i = ˆλ i ˆφi SS H ns ˆφ i = ˆλ i S ˆφ i ˆα i := S ˆφ i. It can be shown that K = SS. ˆα i = S ˆφ i S H n ˆα i = S H ns ˆφ i = ˆΣ ˆφ i = ˆλ i ˆφi.

Empirical Kernel PCA: Random Features Empirical kernel PCA solves an n n linear system: Complexity is O(n 3 ). Use the idea of random features: X Φ m (X ) and apply PCA. ŵ := arg sup Var[ w, Φ m (X ) 2 ] = w, ˆΣ m w 2 w = where ˆΣ m := n Φ m (X i )Φ m(x i ) i= ( n ) ( Φ m (X i ) Φ m (X i ) n i= i= ) is a covariance matrix of size 2m 2m. Eigen decomposition: ˆΣm = 2m i= ˆλ i,m ˆφi,m ˆφ i,m ŵ is obtained by solving a 2m 2m eigensystem: Complexity is O(m 3 ). What happens statistically?

Empirical Kernel PCA: Random Features Two notions: Reconstruction error (what does this depend on?) Convergence of eigenvectors (more generally, eigenspaces)

Eigenspace Convergence What we have? Eigenvectors after approximation, ( ˆφ i,m ) 2m i= : these form a subspace in R 2m Population eigenfunctions (φ i ) i I of Σ: these form a subspace in H. How do we compare? We embed them in a common space before comparing. The common space is L 2 (P).

Eigenspace Convergence I : H L 2 (P), f f f (x) dp(x) X U : R 2m L 2 (P), α 2m i= α ( i Φm,i X Φ m,i(x) dp(x) ) where Φ m = (cos( ω, 2 ),..., cos( ω m, 2 ), sin( ω, 2 ),..., sin( ω m, 2 )). Main results: (S and Sterge, 207) l i= ( ) I ˆφ i Iφ i L 2 (P) = O p n. Similar result holds in H (Zwald et al., NIPS 2005) l i= ( U ˆφ i,m Iφ i L 2 (P) = O p + ). m n Random features are not helpful from the point of view of eigenvector convergence (also holds for eigenspaces). May be useful in the reconstruction error convergence (in progress).

References I Fine, S. and Scheinberg, K. (200). Efficient SVM training using low-rank kernel representations. Journal of Machine Learning Research, 2:243 264. Rahimi, A. and Recht, B. (2008). Random features for large-scale kernel machines. In Platt, J. C., Koller, D., Singer, Y., and Roweis, S. T., editors, Advances in Neural Information Processing Systems 20, pages 77 84. Curran Associates, Inc. Rudi, A. and Rosasco, L. (206). Generalization properties of learning with random features. https://arxiv.org/pdf/602.04474.pdf. Schölkopf, B., Smola, A., and Müller, K.-R. (998). Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 0:299 39. Smale, S. and Zhou, D.-X. (2007). Learning theory estimates via integral operators and their approximations. Constructive Approximation, 26:53 72. Smola, A. J. and Schölkopf, B. (2000). Sparse greedy matrix approximation for machine learning. In Proc. 7th International Conference on Machine Learning, pages 9 98. Morgan Kaufmann, San Francisco, CA. Sriperumbudur, B. K. and Sterge, N. (207). Statistical consistency of kernel PCA with random features. arxiv:706.06296. Sriperumbudur, B. K. and Szabó, Z. (205). Optimal rates for random Fourier features. In Cortes, C., Lawrence, N. D., Lee, D. D., Sugiyama, M., and Garnett, R., editors, Advances in Neural Information Processing Systems 28, pages 44 52. Curran Associates, Inc. Sutherland, D. and Schneider, J. (205). On the error of random fourier features. In Proceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence, pages 862 87.

References II Williams, C. and Seeger, M. (200). Using the Nyström method to speed up kernel machines. In T. K. Leen, T. G. Diettrich, V. T., editor, Advances in Neural Information Processing Systems 3, pages 682 688, Cambridge, MA. MIT Press. Yang, Y., Pilanci, M., and Wainwright, M. J. (205). Randomized sketches for kernels: Fast and optimal non-parametric regression. Technical report. https://arxiv.org/pdf/50.0695.pdf. Zwald, L. and Blanchard, G. (2006). On the convergence of eigenspaces in kernel principal component analysis. In Weiss, Y., Schölkopf, B., and Platt, J. C., editors, Advances in Neural Information Processing Systems 8, pages 649 656. MIT Press.