On the Impact of Kernel Approximation on Learning Accuracy

Similar documents
Nyström Method vs Random Fourier Features: A Theoretical and Empirical Comparison

Rademacher Complexity Margin Bounds for Learning with a Large Number of Classes

Block designs and statistics

PAC-Bayes Analysis Of Maximum Entropy Learning

Feature Extraction Techniques

Introduction to Machine Learning. Recitation 11

E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis

Stability Bounds for Non-i.i.d. Processes

Support Vector Machine Classification of Uncertain and Imbalanced data using Robust Optimization

Support Vector Machines. Goals for the lecture

Multi-view Discriminative Manifold Embedding for Pattern Classification

Foundations of Machine Learning Boosting. Mehryar Mohri Courant Institute and Google Research

Soft-margin SVM can address linearly separable problems with outliers

3.3 Variational Characterization of Singular Values

Support Vector Machines. Maximizing the Margin

Foundations of Machine Learning Lecture 5. Mehryar Mohri Courant Institute and Google Research

Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines

Foundations of Machine Learning Kernel Methods. Mehryar Mohri Courant Institute and Google Research

1 Bounding the Margin

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

A Theoretical Analysis of a Warm Start Technique

Kernel Methods and Support Vector Machines

Computational and Statistical Learning Theory

COS 424: Interacting with Data. Written Exercises

Stability Bounds for Stationary ϕ-mixing and β-mixing Processes

Robustness and Regularization of Support Vector Machines

Consistent Multiclass Algorithms for Complex Performance Measures. Supplementary Material

Distributed Subgradient Methods for Multi-agent Optimization

Polygonal Designs: Existence and Construction

A note on the multiplication of sparse matrices

This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and

Domain-Adversarial Neural Networks

arxiv: v1 [stat.ml] 31 Jan 2018

The Hilbert Schmidt version of the commutator theorem for zero trace matrices

CS Lecture 13. More Maximum Likelihood

Stability Bounds for Stationary ϕ-mixing and β-mixing Processes

Lower Bounds for Quantized Matrix Completion

Lecture 20 November 7, 2013

E0 370 Statistical Learning Theory Lecture 5 (Aug 25, 2011)

Pattern Recognition and Machine Learning. Artificial Neural networks

Support Vector Machines. Machine Learning Series Jerry Jeychandra Blohm Lab

Exact tensor completion with sum-of-squares

Non-Parametric Non-Line-of-Sight Identification 1

Quantum algorithms (CO 781, Winter 2008) Prof. Andrew Childs, University of Waterloo LECTURE 15: Unstructured search and spatial search

Supplementary Material for Fast and Provable Algorithms for Spectrally Sparse Signal Reconstruction via Low-Rank Hankel Matrix Completion

13.2 Fully Polynomial Randomized Approximation Scheme for Permanent of Random 0-1 Matrices

Intelligent Systems: Reasoning and Recognition. Artificial Neural Networks

Upper bound on false alarm rate for landmine detection and classification using syntactic pattern recognition

Sharp Time Data Tradeoffs for Linear Inverse Problems

Least Squares Fitting of Data

Bipartite subgraphs and the smallest eigenvalue

Lecture 9 November 23, 2015

Stochastic Subgradient Methods

PAC-Bayesian Learning of Linear Classifiers

Pattern Recognition and Machine Learning. Artificial Neural networks

e-companion ONLY AVAILABLE IN ELECTRONIC FORM

Shannon Sampling II. Connections to Learning Theory

Learnability and Stability in the General Learning Setting

Computable Shell Decomposition Bounds

Physics 215 Winter The Density Matrix

New Bounds for Learning Intervals with Implications for Semi-Supervised Learning

arxiv: v1 [cs.ds] 3 Feb 2014

Accuracy at the Top. Abstract

Combining Classifiers

Proc. of the IEEE/OES Seventh Working Conference on Current Measurement Technology UNCERTAINTIES IN SEASONDE CURRENT VELOCITIES

Interactive Markov Models of Evolutionary Algorithms

An l 1 Regularized Method for Numerical Differentiation Using Empirical Eigenfunctions

Structured Prediction Theory Based on Factor Graph Complexity

A Smoothed Boosting Algorithm Using Probabilistic Output Codes

Multi-Dimensional Hegselmann-Krause Dynamics

3.8 Three Types of Convergence

Pattern Recognition and Machine Learning. Artificial Neural networks

Testing Properties of Collections of Distributions

Learning with Rejection

HIGH RESOLUTION NEAR-FIELD MULTIPLE TARGET DETECTION AND LOCALIZATION USING SUPPORT VECTOR MACHINES

CSE525: Randomized Algorithms and Probabilistic Analysis May 16, Lecture 13

Neural Network Learning as an Inverse Problem

arxiv: v1 [math.na] 10 Oct 2016

Boosting with log-loss

E. Alpaydın AERFAISS

arxiv: v4 [cs.lg] 4 Apr 2016

Lecture 21 Nov 18, 2015

Predictive Vaccinology: Optimisation of Predictions Using Support Vector Machine Classifiers

Weighted- 1 minimization with multiple weighting sets

Model Fitting. CURM Background Material, Fall 2014 Dr. Doreen De Leon

Computable Shell Decomposition Bounds

A Low-Complexity Congestion Control and Scheduling Algorithm for Multihop Wireless Networks with Order-Optimal Per-Flow Delay

Machine Learning Basics: Estimators, Bias and Variance

Optimal Jamming Over Additive Noise: Vector Source-Channel Case

The Simplex Method is Strongly Polynomial for the Markov Decision Problem with a Fixed Discount Rate

Improved Bounds for the Nyström Method with Application to Kernel Classification

On Conditions for Linearity of Optimal Estimation

List Scheduling and LPT Oliver Braun (09/05/2017)

Using a De-Convolution Window for Operating Modal Analysis

Introduction to Kernel methods

A Simple Homotopy Algorithm for Compressive Sensing

On the Inapproximability of Vertex Cover on k-partite k-uniform Hypergraphs

Geometrical intuition behind the dual problem

Hybrid System Identification: An SDP Approach

Transcription:

On the Ipact of Kernel Approxiation on Learning Accuracy Corinna Cortes Mehryar Mohri Aeet Talwalkar Google Research New York, NY corinna@google.co Courant Institute and Google Research New York, NY ohri@cs.nyu.edu Courant Institute New York, NY aeet@cs.nyu.edu Abstract Kernel approxiation is coonly used to scale kernel-based algoriths to applications containing as any as several illion instances. This paper analyzes the effect of such approxiations in the kernel atrix on the hypothesis generated by several widely used learning algoriths. We give stability bounds based on the nor of the kernel approxiation for these algoriths, including SVMs, KRR, and graph Laplacian-based regularization algoriths. These bounds help deterine the degree of approxiation that can be tolerated in the estiation of the kernel atrix. Our analysis is general and applies to arbitrary approxiations of the kernel atrix. However, we also give a specific analysis of the Nyströ low-rank approxiation in this context and report the results of experients evaluating the quality of the Nyströ low-rank kernel approxiation when used with ridge regression. 1 Introduction The size of odern day learning probles found in coputer vision, natural language processing, systes design and any other areas is often in the order of hundreds of thousands and can exceed several illion. Scaling standard kernel-based algoriths such as support vector achines SVMs) Cortes and Vapnik, 1995), kernel ridge regression KRR) Saunders et al., 1998), kernel principal coponent analysis KPCA) Schölkopf et al., 1998) to such agnitudes is a serious issue since even storing the kernel atrix can be prohibitive at this size. Appearing in Proceedings of the 13 th International Conference on Artificial Intelligence and Statistics AISTATS) 2010, Chia Laguna Resort, Sardinia, Italy. Volue 9 of JMLR: W&CP 9. Copyright 2010 by the authors. One solution suggested for dealing with such large-scale probles consists of a low-rank approxiation of the kernel atrix Willias and Seeger, 2000). Other variants of these approxiation techniques based on the Nyströ ethod have also been recently presented and shown to be applicable to large-scale probles Belabbas and Wolfe, 2009; Drineas and Mahoney, 2005; Kuar et al., 2009a; Talwalkar et al., 2008; Zhang et al., 2008). Kernel approxiations based on other techniques such as colun sapling Kuar et al., 2009b), incoplete Cholesky decoposition Bach and Jordan, 2002; Fine and Scheinberg, 2002) or Kernel Matching Pursuit KMP) Hussain and Shawe-Taylor, 2008; Vincent and Bengio, 2000) have also been widely used in large-scale learning applications. But, how does the kernel approxiation affect the perforance of the learning algorith? There exists soe previous work on this subject. Spectral clustering with perturbed data was studied in a restrictive setting with several assuption by Huang et al. 2008). In Fine and Scheinberg 2002), the authors address this question in ters of the ipact on the value of the objective function to be optiized by the learning algorith. However, we strive to take the question one step further and directly analyze the effect of an approxiation in the kernel atrix on the hypothesis generated by several widely used kernel-based learning algoriths. We give stability bounds based on the nor of the kernel approxiation for these algoriths, including SVMs, KRR, and graph Laplacian-based regularization algoriths Belkin et al., 2004). These bounds help deterine the degree of approxiation that can be tolerated in the estiation of the kernel atrix. Our analysis differs fro previous applications of stability analysis as put forward by Bousquet and Elisseeff 2001). Instead of studying the effect of changing one training point, we study the effect of changing the kernel atrix. Our analysis is general and applies to arbitrary approxiations of the kernel atrix. However, we also give a specific analysis of the Nyströ

On the Ipact of Kernel Approxiation on Learning Accuracy low-rank approxiation given the recent interest in this ethod and the successful applications of this algorith to large-scale applications. We also report the results of experients evaluating the quality of this kernel approxiation when used with ridge regression. The reainder of this paper is organized as follows. Section 2 introduces the proble of kernel stability and gives a kernel stability analysis of several algoriths. Section 3 provides a brief review of the Nyströ approxiation ethod and gives error bounds that can be cobined with the kernel stability bounds. Section 4 reports the results of experients with kernel approxiation cobined with kernel ridge regression. 2 Kernel Stability Analysis In this section we analyze the ipact of kernel approxiation on several coon kernel-based learning algoriths: KRR, SVM and graph Laplacian-based regularization algoriths. Our stability analyses result in bounds on the hypotheses directly in ters of the quality of the kernel approxiation. In our analysis we assue that the kernel approxiation is only used during training where the kernel approxiation ay serve to reduce resource requireents. At testing tie the true kernel function is used. This scenario that we are considering is standard for the Nyströ ethod and other approxiations. We consider the standard supervised learning setting where the learning algorith receives a saple of labeled points S = x 1, y 1 ),..., x, y )) X Y ), where X is the input space and Y the set of labels, Y = R with y M in the regression case, and Y = {1, +1} in the classification case. Throughout the paper the kernel atrix K and its approxiation K are assued to be syetric, positive, and sei-definite SPSD). 2.1 Kernel Ridge Regression We first provide a stability analysis of kernel ridge regression. The following is the dual optiization proble solved by KRR Saunders et al., 1998): ax α R λα α + αkα 2α y, 1) where λ = λ 0 > 0 is the ridge paraeter. The proble adits the closed for solution α=k+λi) 1 y. We denote by h the hypothesis returned by kernel ridge regression when using the exact kernel atrix. Proposition 1. Let h denote the hypothesis returned by kernel ridge regression when using the approxiate kernel atrix K R. Furtherore, define κ > 0 such that Kx, x) κ and K x, x) κ for all x X. This condition is verified with κ=1 for Gaussian kernels for exaple. Then the following inequalities hold for all x X, h x) hx) κm λ 2 0 K K 2. 2) Proof. Let α denote the solution obtained using the approxiate kernel atrix K. We can write α α = K + λi) 1 y K + λi) 1 y 3) = [ K + λi) 1 K K)K + λi) 1] y, 4) where we used the identity M 1 M 1 =M 1 M M)M 1 valid for any invertible atrices M,M. Thus, α α can be bounded as follows: α α K + λi) 1 K K K + λi) 1 y K K 2 y λ in K + λi)λ in K + λi), 5) where λ in K +λi) is the sallest eigenvalue of K +λi and λ in K + λi) the sallest eigenvalue of K + λi. The hypothesis h derived with the exact kernel atrix is defined by hx) = α ikx, x i ) = α k x, where k x = Kx, x 1 ),..., Kx, x )). By assuption, no approxiation affects k x, thus the approxiate hypothesis h is given by h x)=α k x and h x) hx) α α k x κ α α. 6) Using the bound on α α given by inequality 5), the fact that the eigenvalues of K + λi) and K + λi) are larger than or equal to λ since K and K are SPSD atrices, and y M yields h κm K K 2 x) hx) λ in K + λi)λ in K + λi) κm λ 2 K 2. 0 K The generalization bounds for KRR, e.g., stability bounds Bousquet and Elisseeff, 2001), are of the for Rh) Rh) + O1/ ), where Rh) denotes the generalization error and Rh) the epirical error of a hypothesis h with respect to the square loss. The proposition shows that h x) hx) 2 = O K K 2 2/λ 4 0 2 ). Thus, it suggests that the kernel approxiation tolerated should be such that K K 2 2 /λ4 0 2 O1/ ), that is, such that K K 2 Oλ 2 0 3/4 ). Note that the ain bound used in the proof of the theore, inequality 5), is tight in the sense that it can be atched for soe kernels K and K. Indeed, let K and K be the

Corinna Cortes, Mehryar Mohri, Aeet Talwalkar kernel functions defined by Kx, y)=β and K x, y)=β if x=y, K x, y)= Kx, y)=0 otherwise, with β, β 0. Then, the corresponding kernel atrices for a saple S are K = βi and K = β I, and the dual paraeter vectors are given by α = y/β+λ) and α = y/β +λ). Now, since λ in K + λi) = β +λ and λ in K + λi) = β+λ, and K K =β β, the following equality holds: α α = = β β β y + λ)β + λ) 7) K K λ in K + λi)λ in K y. + λi) 8) This liits significant iproveents of the bound of Proposition 1 using siilar techniques. 2.2 Support Vector Machines This section analyzes the kernel stability of SVMs. For siplicity, we shall consider the case where the classification function sought has no offset. In practice, this corresponds to using a constant feature. Let Φ: X F denote a feature apping fro the input space X to a Hilbert space F corresponding to soe kernel K. The hypothesis set we consider is thus H = {h: w F x X, hx)= w Φx)}. The following is the standard prial optiization proble for SVMs: in w F Kw) = 1 2 w 2 + RK w), 9) where R K w) = 1 Ly iw Φx i )) is the epirical error, with Ly i w Φx i )) = ax0, 1 y i w Φx i )) the hinge loss associated to the ith point. In the following, we analyze the difference between the hypothesis h returned by SVMs when trained on the saple S of points and using a kernel K, versus the hypothesis h obtained when training on the sae saple with the kernel K. For a fixed x X, we shall copare ore specifically hx) and h x). Thus, we can work with the finite set X +1 ={x 1,..., x, x +1 }, with x +1 =x. Different feature appings Φ can be associated to the sae kernel K. To copare the solutions w and w of the optiization probles based on F K and F K, we can choose the feature appingsφ and Φ associated to K and K such that they both ap to R +1 as follows. Let K +1 denote the Gra atrix associated to K and K +1 that of kernel K for the set of points X +1. Then for all u X +1, Φ and Φ can be defined by Φu) = K 1/2 +1 Kx 1, u). Kx +1, u) K x 1, u) and Φ u) = K 1/2 +1. K x +1, u) 10), 11) where K +1 denotes the pseudo-inverse of K +1 and K +1 that of K +1. It is not hard to see then that for all u, v X +1, Ku, v) = Φu) Φv) and K u, v) = Φ u) Φ v) Schölkopf and Sola, 2002). Since the optiization proble depends only on the saple S, we can use the feature appings just defined in the expression of F K and F K. This does not affect in any way the standard SVMs optiization proble. Let w R +1 denote the iniizer of F K and w that of F K. By definition, if we let w denote w w, for all s [0, 1], the following inequalities hold: F K w) F K w + s w) 12) and F K w ) F K w s w). 13) Suing these two inequalities, rearranging ters, and using the identity w+s w 2 w 2 )+ w s w 2 w 2 )=2s1s) w 2, we obtain as in Bousquet and Elisseeff, 2001): s1 s) w 2 [ RK w + s w) R K w) ) + RK w s w) R K w ) )]. Note that w + s w = sw + 1 s)w and w s w = sw + 1 s)w. Then, by the convexity of the hinge loss and thus R K and R K, the following inequalities hold: R K w + s w) R K w) s R K w ) R K w)) R K w s w) R K w ) s R K w ) R K w)). Plugging in these inequalities on the left-hand side, siplifying by s and taking the liit s 0 yields w 2 [ RK w ) R K w ) ) + RK w) R K w) )] = + [ Ly i w Φx i )) Ly i w Φ x i ))) Ly i w Φ x i )) Ly i w Φx i )) ], where the last inequality results fro the definition of the epirical error. Since the hinge loss is 1-Lipschitz, we can

On the Ipact of Kernel Approxiation on Learning Accuracy bound the ters on the right-hand side as follows: w 2 = [ w Φ x i ) Φx i ) ] + w Φ x i ) Φx i ) 14) w + w ) Φ x i ) Φx i ). 15) Let e i denote the ith unit vector of R +1, then Kx 1, x i ),..., Kx +1, x i )) =K +1 e i. Thus, in view of the definition of Φ, for all i [1, + 1], Φx i ) = K 1/2 +1 [Kx 1, x i ),..., Kx, x i ), Kx, x i )] = K 1/2 +1 K +1e i = K 1/2 +1 e i, 16) and siilarly Φ x i )=K 1/2 +1 e i. K 1/2 +1 e i is the ith colun of K 1/2 +1 and siilarly K 1/2 e i the ith colun of K 1/2 +1. Thus, 15) can be rewritten as w w 2 w + w ) K 1/2 +1 K1/2 +1 )e i. As for the case of ridge regression, we shall assue that there exists κ > 0 such that Kx, x) κ and K x, x) κ for all x X +1. Now, since w can be written in ters of the dual variables 0 α i C, C = / as w = α ikx i, ), it can be bounded as w /κ 1/2 = κ 1/2 and siilarly w κ 1/2. Thus, we can write w w 2 2C2 0 κ1/2 2C2 0κ 1/2 K 1/2 +1 K1/2 +1 )e i K 1/2 +1 K1/2 +1 ) e i = 2C0κ 2 1/2 K 1/2 +1 K1/2 +1. 17) Let K denote the Gra atrix associated to K and K that of kernel K for the saple S. Then, the following result holds. Proposition 2. Let h denote the hypothesis returned by SVMs when using the approxiate kernel atrix K R. Then, the following inequality holds for all x X : h x) hx) 2κ 3 4 C0 K K 1 4 2 [1 + [ K K 2 4κ ] 1] 4. 18) Proof. In view of 16) and 17), the following holds: h x) hx) = w Φ x) w Φx) = w w) Φ x) + w Φ x) Φx)) w w Φ x) + w Φ x) Φx) = w w Φ x) + w Φ x +1 ) Φx +1 ) ) 1/2κ 2C0 2 κ1/2 K 1/2 +1 K1/2 +1 1/2 + κ 1/2 K 1/2 +1 K1/2 +1 )e +1 2κ 3/4 K 1/2 +1 K1/2 +1 1/2 + κ 1/2 K 1/2 +1 K1/2 +1. Now, by Lea 1 see Appendix), K 1/2 +1 K1/2 +1 2 K +1 K +1 1/2 2. By assuption, the kernel approxiation is only used at training tie so Kx, x i ) = K x, x i ), for all i [1, ], and since by definition x = x +1, the last rows or the last coluns of the atrices K +1 and K +1 coincide. Therefore, the atrix K +1K +1 coincides with the atrix K K bordered with a zero-colun and zero-row and K 1/2 +1 K1/2 +1 2 K K 1/2 2. Thus, h x) hx) 2κ 3/4 K K 1/4 + κ 1/2 K K 1/2, 19) which is exactly the stateent of the proposition. Since the hinge loss l is 1-Lipschitz, Proposition 2 leads directly to the following bound on the pointwise difference of the hinge loss between the hypotheses h and h. Corollary 1. Let h denote the hypothesis returned by SVMs when using the approxiate kernel atrix K R. Then, the following inequality holds for all x X and y Y: L yh x) ) L yhx) ) 2κ 3 4 C0 K K 1 4 2 [1 + [ K K 2 4κ ] 1] 4. 20) The bounds we obtain for SVMs are weaker than our bound for KRR. This is due ainly to the different loss functions defining the optiization probles of these algoriths. 2.3 Graph Laplacian regularization algoriths We lastly study the kernel stability of graph-laplacian regularization algoriths such as that of Belkin et al. 2004). Given a connected weighted graph G = X, E) in which

Corinna Cortes, Mehryar Mohri, Aeet Talwalkar edge weights can be interpreted as siilarities between vertices, the task consists of predicting the vertex labels of u vertices using a labeled training saple S of vertices. The input space X is thus reduced to the set of vertices, and a hypothesis h: X R can be identified with the finite-diensional vector h of its predictions h = [hx 1 ),..., hx +u )]. The hypothesis set H can thus be identified with R +u here. Let h S denote the restriction of h to the training points, [hx 1 ),..., hx )] R, and siilarly let y S denote [y 1,..., y ] R. Then, the following is the optiization proble corresponding to this proble: in h H h Lh + h S y S ) h S y S ) 21) subject to h 1 = 0, where L is the graph Laplacian and 1 the colun vector with all entries equal to 1. Thus, h Lh = ij=1 w ijhx i )hx j )) 2, for soe weight atrix w ij ). The label vectory is assued to be centered, which iplies that 1 y = 0. Since the graph is connected, the eigenvalue zero of the Laplacian has ultiplicity one. Define I S R +u) +u) to be the diagonal atrix with [I S ] i,i = 1 if i and 0 otherwise. Maintaining the notation used in Belkin et al. 2004), we let P H denote the projection on the hyperplane ) H orthogonal to 1 and let M = P H L + I S and M = P H L + I S ). We denote by h the hypothesis returned by the algorith when using the exact kernel atrix L and by L an approxiate graph Laplacian such that h L h = ij=1 w ij hx i) hx j )) 2, based on atrix w ij ) instead of w ij). We shall assue that there exist M > 0 such that y i M for i [1, ]. Proposition 3. Let h denote the hypothesis returned by the graph-laplacian regularization algorith when using an approxiate Laplacian L R. Then, the following inequality holds: h h 3/2 M/ λ2 1) 2 L L, 22) where λ 2 = ax{λ 2, λ 2} with λ 2 denoting the second sallest eigenvalue of the kernel atrix L and λ 2 the second sallest eigenvalue of L. Proof. The closed-for solution of Proble 21 is given )) 1yS by Belkin et al. 2004): h = P H L+I S. Thus, we can use that expression and the atrix identity for M 1 M 1 ) we already used in the analysis of KRR to write h h = M 1 y S M 1 y S 23) = M 1 M 1 )y S 24) = M 1 M M )M 1 y S 25) M 1 L L )M 1 y S 26) M 1 M 1 y S L L. 27) For any colun atrix z R +u) 1, by the triangle inequality and the projection property P H z z, the following inequalities hold: P H L = P H L + P H I S z P H I S z 28) This yields the lower bound: P H L + P H I S z + P H I S z 29) P H L + I S )z + I S z. 30) Mz = P H L + I S )z 31) P H L I S z 32) C 0 ) λ 2 1 z, 33) which gives the following upper bounds on M 1 and M 1 : M 1 1 λ 2 1 and M 1 1 λ 2 1. Plugging in these inequalities in 27) and using y S 1/2 M lead to h h 3/2 M/ λ 2 1) λ 2 1) L L. The generalization bounds for the graph-laplacian algorith are of the for Rh) Rh)+O C λ 0 21) ) Belkin 2 et al., 2004). In view of the bound given by the theore, this suggests that the approxiation tolerated should verify L L O1/ ). 3 Application to Nyströ ethod The previous section provided stability analyses for several coon learning algoriths studying the effect of using an approxiate kernel atrix instead of the true one. The difference in hypothesis value is expressed siply in ters of the difference between the kernels easured by

On the Ipact of Kernel Approxiation on Learning Accuracy Dataset Description # Points ) # Features d) Kernel Largest label M) ABALONE physical attributes of abalones 4177 8 RBF 29 KIN-8n kineatics of robot ar 4000 8 RBF 1.5 Table 1: Suary of datasets used in our experients Asuncion and Newan, 2007; Ghahraani, 1996). soe nor. Although these bounds are general bounds that are independent of how the approxiation is obtained so long as K reains SPSD), one relevant application of these bounds involves the Nyströ ethod. As shown by Willias and Seeger 2000), later by Drineas and Mahoney 2005); Talwalkar et al. 2008); Zhang et al. 2008), low-rank approxiations of the kernel atrix via the Nyströ ethod can provide an effective technique for tackling large-scale data sets. However, all previous theoretical work analyzing the perforance of the Nyströ ethod has focused on the quality of the low-rank approxiations, rather than the perforance of the kernel learning algoriths used in conjunction with these approxiations. In this section, we first provide a brief review of the Nyströ ethod and then show how we can leverage the analysis of Section 2 to present novel perforance guarantees for the Nyströ ethod in the context of kernel learning algoriths. 3.1 Nyströ ethod The Nyströ approxiation of a syetric positive seidefinite SPSD) atrix K is based on a saple of n coluns of K Drineas and Mahoney, 2005; Willias and Seeger, 2000). Let C denote the n atrix fored by these coluns and W the n n atrix consisting of the intersection of these n coluns with the corresponding n rows of K. The coluns and rows of K can be rearranged based on this sapling so that K and C be written as follows: [ ] [ ] W K K = 21 W and C =. 34) K 21 K 22 K 21 Note that W is also SPSD since K is SPSD. For a unifor sapling of the coluns, the Nyströ ethod generates a rank-k approxiation K of K for k n defined by: K = CW + k C K, 35) where W k is the best k-rank approxiation of W for the Frobenius nor, that is W k = argin rankv)=k W V F and W + k denotes the pseudo-inverse of W k. W + k can be derived fro the singular value decoposition SVD) of W, W = UΣU, where U is orthonoral and Σ = diagσ 1,..., σ ) is a real diagonal atrix with σ 1 σ 0. For k rankw), it is given by W + k = k σ1 i U i U i, where U i denotes the ith colun of U. Since the running tie coplexity of SVD is On 3 ) and Onk) is required for ultiplication with C, the total coplexity of the Nyströ approxiation coputation is On 3 +nk). 3.2 Nyströ kernel ridge regression The accuracy of low-rank Nyströ approxiations has been theoretically analyzed by Drineas and Mahoney 2005); Kuar et al. 2009c). The following theore, adapted fro Drineas and Mahoney 2005) for the case of unifor sapling, gives an upper bound on the nor- 2 error of the Nyströ approxiation of the for K K 2 / K 2 K K k 2 / K 2 +O1/ n). We denote by K ax the axiu diagonal entry of K. Theore 1. Let K denote the rank-k Nyströ approxiation of K based on n coluns sapled uniforly at rando with replaceent fro K, and K k the best rank-k approxiation of K. Then, with probability at least 1 δ, the following inequalities hold for any saple of size n: K K 2 K K k 2 + n K ax 2 + log 1 δ). Theore 1 focuses on the quality of low-rank approxiations. Cobining the analysis fro Section 2 with this theore enables us to bound the relative perforance of the kernel learning algoriths when the Nyströ ethod is used as a eans of scaling kernel learning algoriths. To illustrate this point, Theore 2 uses Proposition 1 along with Theore 1 to upper bound the relative perforance of kernel ridge regression as a function of the approxiation accuracy of the Nyströ ethod. Theore 2. Let h denote the hypothesis returned by kernel ridge regression when using the approxiate rank-k kernel K R n n generated using the Nyströ ethod. Then, with probability at least 1 δ, the following inequality holds for all x X, h x)hx) κm λ 2 0 [ KK k 2 + n K ax 2+log 1 δ) ]. A siilar technique can be used to bound the error of the Nyströ approxiation when used with the other algoriths discussed in Section 2. The results are oitted due to space constraints.

Corinna Cortes, Mehryar Mohri, Aeet Talwalkar Abalone Kin8n Average Absolute Error 50 40 30 20 10 0 Average Absolute Error 14 12 10 8 6 4 2 0 0 2 4 6 8 10 Relative Spectral Distance 0 2 4 6 8 10 Relative Spectral Distance Figure 1: Average absolute error of the kernel ridge regression hypothesis, h ), generated fro the Nyströ approxiation, K, as a function of relative spectral distance K K 2 / K 2. For each dataset, the reported results show the average absolute error as a function of relative spectral distance for both the full dataset and for a subset of the data containing = 2000 points. Results for the sae value of are connected with a line. The different points along the lines correspond to various nubers of sapled coluns, i.e., n ranging fro 1% to 50% of. 3.3 Nyströ Woodbury Approxiation The Nyströ ethod provides an effective algorith for obtaining a rank-k approxiation for the kernel atrix. As suggested by Willias and Seeger 2000) in the context of Gaussian Processes, this approxiation can be cobined with the Woodbury inversion lea to derive an efficient algorith for inverting the kernel atrix. The Woodbury inversion Lea states that the inverse of a rank-k correction of soe atrix can be coputed by doing a rank-k correction to the inverse of the original atrix. In the context of KRR, using the rank-k approxiation K given by the Nyströ ethod, instead of K, and applying the inversion lea yields λi + K) 1 36) λi + CW + k C ) 1 37) = 1 I C [ λi k + W + k λ C C ] 1 W + k C ). 38) Thus, only an inversion of a atrix of size k is needed as opposed to the original proble of size. 4 Experients For our experiental results, we focused on the kernel stability of kernel ridge regression, generating approxiate kernel atrices using the Nyströ ethod. We worked with the datasets listed in Table 1, and for each dataset, we randoly selected 80% of the points to generate K and used the reaining 20% as the test set, T. For each testtrain split, we first perfored grid search to deterine the optial ridge for K, as well as the associated optial hypothesis, h ). Next, using this optial ridge, we generated a set of Nyströ approxiations, using various nubers of sapled coluns, i.e., n ranging fro 1% to 50% of. For each Nyströ approxiation, K, we coputed the associated hypothesis h ) using the sae ridge and easured the distance between h and h as follows: average absolute error = x T h x) hx). 39) T We easured the distance between K and K as follows: relative spectral distance = K K 2 K 2 100. 40) Figure 1 presents results for each dataset using all points and a subset of 2000 points. The plots show the average absolute error of h ) as a function of relative spectral distance. Proposition 1 predicts a linear relationship between kernel approxiation and relative error which is corroborated by the experients, as both datasets display this behavior for different sizes of training data. 5 Conclusion Kernel approxiation is used in a variety of contexts and its use is crucial for scaling any learning algoriths to very large tasks. We presented a stability-based analysis of the effect of kernel approxiation on the hypotheses returned by several coon learning algoriths. Our analysis is independent of how the approxiation is obtained and siply expresses the change in hypothesis value in ters of the difference between the approxiate kernel atrix and the true one easured by soe nor. We also provided a specific analysis of the Nyströ low-rank approxiation in

On the Ipact of Kernel Approxiation on Learning Accuracy this context and reported experiental results that support our theoretical analysis. In practice, the two steps of kernel atrix approxiation and training of a kernel-based algorith are typically executed separately. Work by Bach and Jordan 2005) suggested one possible ethod for cobining these two steps. Perhaps ore accurate results could be obtained by cobining these two stages using the bounds we presented or other siilar ones based on our analysis. A Lea 1 The proof of Lea 1 is given for copleteness. Lea 1. Let M and M be two n n SPSD atrices. Then, the following bound holds for the difference of the square root atrices: M 1/2 M 1/2 2 M M 1/2 2. Proof. By definition of the spectral nor, M M M M 2 I where I is the n n identity atrix. Hence, M M + M M 2 I and M 1/2 M + λi) 1/2, with λ = M M 2. Thus, M 1/2 M + λi) 1/2 M 1/2 + λ 1/2 I, by sub-additivity of. This shows that M 1/2 M 1/2 λ 1/2 I and by syetrym 1/2 M 1/2 λ 1/2 I, thus M 1/2 M 1/2 2 M M 1/2 2, which proves the stateent of the lea. References A. Asuncion and D.J. Newan. UCI achine learning repository, 2007. Francis Bach and Michael Jordan. Kernel independent coponent analysis. Journal of Machine Learning Research, 3:1 48, 2002. Francis R. Bach and Michael I. Jordan. Predictive lowrank decoposition for kernel ethods. In International Conference on Machine Learning, 2005. M. A. Belabbas and P. J. Wolfe. Spectral ethods in achine learning and new strategies for very large datasets. Proceedings of the National Acadey of Sciences of the United States of Aerica, 1062):369 374, January 2009. Mikhail Belkin, Irina Matveeva, and Partha Niyogi. Regularization and sei-supervised learning on large graphs. In Conference on Learning Theory, 2004. Olivier Bousquet and André Elisseeff. Algorithic stability and generalization perforance. In Advances in Neural Inforation Processing Systes, 2001. Corinna Cortes and Vladiir Vapnik. Support-Vector Networks. Machine Learning, 203):273 297, 1995. Petros Drineas and Michael W. Mahoney. On the Nyströ ethod for approxiating a Gra atrix for iproved kernel-based learning. Journal of Machine Learning Research, 6:2153 2175, 2005. Shai Fine and Katya Scheinberg. Efficient SVM training using low-rank kernel representations. Journal of Machine Learning Research, 2:243 264, 2002. Zoubin Ghahraani. The kin datasets, 1996. Ling Huang, Donghui Yan, Michael Jordan, and Nina Taft. Spectral clustering with perturbed data. In Advances in Neural Inforation Processing Systes, 2008. Zakria Hussain and John Shawe-Taylor. Theory of atching pursuit. In Advances in Neural Inforation Processing Systes, 2008. Sanjiv Kuar, Mehryar Mohri, and Aeet Talwalkar. Enseble Nyströ ethod. In Advances in Neural Inforation Processing Systes, 2009. Sanjiv Kuar, Mehryar Mohri, and Aeet Talwalkar. On sapling-based approxiate spectral decoposition. In International Conference on Machine Learning, 2009. Sanjiv Kuar, Mehryar Mohri, and Aeet Talwalkar. Sapling techniques for the Nyströ ethod. In Conference on Artificial Intelligence and Statistics, 2009. Craig Saunders, Alexander Gaeran, and Volodya Vovk. Ridge regression learning algorith in dual variables. In International Conference on Machine Learning, 1998. Bernhard Schölkopf and Alex Sola. Learning with Kernels. MIT Press: Cabridge, MA, 2002. Bernhard Schölkopf, Alexander Sola, and Klaus-Robert Müller. Nonlinear coponent analysis as a kernel eigenvalue proble. Neural Coputation, 105):1299 1319, 1998. Aeet Talwalkar, Sanjiv Kuar, and Henry Rowley. Large-scale anifold learning. In Conference on Vision and Pattern Recognition, 2008. Pascal Vincent and Yoshua Bengio. Kernel Matching Pursuit. Technical Report 1179, Départeent d Inforatique et Recherche Opérationnelle, Université de Montréal, 2000. Christopher K. I. Willias and Matthias Seeger. Using the Nyströ ethod to speed up kernel achines. In Advances in Neural Inforation Processing Systes, 2000. Kai Zhang, Ivor Tsang, and Jaes Kwok. Iproved Nyströ low-rank approxiation and error analysis. In International Conference on Machine Learning, 2008.