On the Impact of Kernel Approximation on Learning Accuracy

Size: px

Start display at page:

Download "On the Impact of Kernel Approximation on Learning Accuracy"

Joanna Lindsey
5 years ago
Views:

1 On the Ipact of Kernel Approxiation on Learning Accuracy Corinna Cortes Mehryar Mohri Aeet Talwalkar Google Research New York, NY Courant Institute and Google Research New York, NY Courant Institute New York, NY Abstract Kernel approxiation is coonly used to scale kernel-based algoriths to applications containing as any as several illion instances. This paper analyzes the effect of such approxiations in the kernel atrix on the hypothesis generated by several widely used learning algoriths. We give stability bounds based on the nor of the kernel approxiation for these algoriths, including SVMs, KRR, and graph Laplacian-based regularization algoriths. These bounds help deterine the degree of approxiation that can be tolerated in the estiation of the kernel atrix. Our analysis is general and applies to arbitrary approxiations of the kernel atrix. However, we also give a specific analysis of the Nyströ low-rank approxiation in this context and report the results of experients evaluating the quality of the Nyströ low-rank kernel approxiation when used with ridge regression. 1 Introduction The size of odern day learning probles found in coputer vision, natural language processing, systes design and any other areas is often in the order of hundreds of thousands and can exceed several illion. Scaling standard kernel-based algoriths such as support vector achines SVMs) Cortes and Vapnik, 1995), kernel ridge regression KRR) Saunders et al., 1998), kernel principal coponent analysis KPCA) Schölkopf et al., 1998) to such agnitudes is a serious issue since even storing the kernel atrix can be prohibitive at this size. Appearing in Proceedings of the 13 th International Conference on Artificial Intelligence and Statistics AISTATS) 2010, Chia Laguna Resort, Sardinia, Italy. Volue 9 of JMLR: W&CP 9. Copyright 2010 by the authors. One solution suggested for dealing with such large-scale probles consists of a low-rank approxiation of the kernel atrix Willias and Seeger, 2000). Other variants of these approxiation techniques based on the Nyströ ethod have also been recently presented and shown to be applicable to large-scale probles Belabbas and Wolfe, 2009; Drineas and Mahoney, 2005; Kuar et al., 2009a; Talwalkar et al., 2008; Zhang et al., 2008). Kernel approxiations based on other techniques such as colun sapling Kuar et al., 2009b), incoplete Cholesky decoposition Bach and Jordan, 2002; Fine and Scheinberg, 2002) or Kernel Matching Pursuit KMP) Hussain and Shawe-Taylor, 2008; Vincent and Bengio, 2000) have also been widely used in large-scale learning applications. But, how does the kernel approxiation affect the perforance of the learning algorith? There exists soe previous work on this subject. Spectral clustering with perturbed data was studied in a restrictive setting with several assuption by Huang et al. 2008). In Fine and Scheinberg 2002), the authors address this question in ters of the ipact on the value of the objective function to be optiized by the learning algorith. However, we strive to take the question one step further and directly analyze the effect of an approxiation in the kernel atrix on the hypothesis generated by several widely used kernel-based learning algoriths. We give stability bounds based on the nor of the kernel approxiation for these algoriths, including SVMs, KRR, and graph Laplacian-based regularization algoriths Belkin et al., 2004). These bounds help deterine the degree of approxiation that can be tolerated in the estiation of the kernel atrix. Our analysis differs fro previous applications of stability analysis as put forward by Bousquet and Elisseeff 2001). Instead of studying the effect of changing one training point, we study the effect of changing the kernel atrix. Our analysis is general and applies to arbitrary approxiations of the kernel atrix. However, we also give a specific analysis of the Nyströ

2 On the Ipact of Kernel Approxiation on Learning Accuracy low-rank approxiation given the recent interest in this ethod and the successful applications of this algorith to large-scale applications. We also report the results of experients evaluating the quality of this kernel approxiation when used with ridge regression. The reainder of this paper is organized as follows. Section 2 introduces the proble of kernel stability and gives a kernel stability analysis of several algoriths. Section 3 provides a brief review of the Nyströ approxiation ethod and gives error bounds that can be cobined with the kernel stability bounds. Section 4 reports the results of experients with kernel approxiation cobined with kernel ridge regression. 2 Kernel Stability Analysis In this section we analyze the ipact of kernel approxiation on several coon kernel-based learning algoriths: KRR, SVM and graph Laplacian-based regularization algoriths. Our stability analyses result in bounds on the hypotheses directly in ters of the quality of the kernel approxiation. In our analysis we assue that the kernel approxiation is only used during training where the kernel approxiation ay serve to reduce resource requireents. At testing tie the true kernel function is used. This scenario that we are considering is standard for the Nyströ ethod and other approxiations. We consider the standard supervised learning setting where the learning algorith receives a saple of labeled points S = x 1, y 1 ),..., x, y )) X Y ), where X is the input space and Y the set of labels, Y = R with y M in the regression case, and Y = {1, +1} in the classification case. Throughout the paper the kernel atrix K and its approxiation K are assued to be syetric, positive, and sei-definite SPSD). 2.1 Kernel Ridge Regression We first provide a stability analysis of kernel ridge regression. The following is the dual optiization proble solved by KRR Saunders et al., 1998): ax α R λα α + αkα 2α y, 1) where λ = λ 0 > 0 is the ridge paraeter. The proble adits the closed for solution α=k+λi) 1 y. We denote by h the hypothesis returned by kernel ridge regression when using the exact kernel atrix. Proposition 1. Let h denote the hypothesis returned by kernel ridge regression when using the approxiate kernel atrix K R. Furtherore, define κ > 0 such that Kx, x) κ and K x, x) κ for all x X. This condition is verified with κ=1 for Gaussian kernels for exaple. Then the following inequalities hold for all x X, h x) hx) κm λ 2 0 K K 2. 2) Proof. Let α denote the solution obtained using the approxiate kernel atrix K. We can write α α = K + λi) 1 y K + λi) 1 y 3) = [ K + λi) 1 K K)K + λi) 1] y, 4) where we used the identity M 1 M 1 =M 1 M M)M 1 valid for any invertible atrices M,M. Thus, α α can be bounded as follows: α α K + λi) 1 K K K + λi) 1 y K K 2 y λ in K + λi)λ in K + λi), 5) where λ in K +λi) is the sallest eigenvalue of K +λi and λ in K + λi) the sallest eigenvalue of K + λi. The hypothesis h derived with the exact kernel atrix is defined by hx) = α ikx, x i ) = α k x, where k x = Kx, x 1 ),..., Kx, x )). By assuption, no approxiation affects k x, thus the approxiate hypothesis h is given by h x)=α k x and h x) hx) α α k x κ α α. 6) Using the bound on α α given by inequality 5), the fact that the eigenvalues of K + λi) and K + λi) are larger than or equal to λ since K and K are SPSD atrices, and y M yields h κm K K 2 x) hx) λ in K + λi)λ in K + λi) κm λ 2 K 2. 0 K The generalization bounds for KRR, e.g., stability bounds Bousquet and Elisseeff, 2001), are of the for Rh) Rh) + O1/ ), where Rh) denotes the generalization error and Rh) the epirical error of a hypothesis h with respect to the square loss. The proposition shows that h x) hx) 2 = O K K 2 2/λ ). Thus, it suggests that the kernel approxiation tolerated should be such that K K 2 2 /λ4 0 2 O1/ ), that is, such that K K 2 Oλ 2 0 3/4 ). Note that the ain bound used in the proof of the theore, inequality 5), is tight in the sense that it can be atched for soe kernels K and K. Indeed, let K and K be the

3 Corinna Cortes, Mehryar Mohri, Aeet Talwalkar kernel functions defined by Kx, y)=β and K x, y)=β if x=y, K x, y)= Kx, y)=0 otherwise, with β, β 0. Then, the corresponding kernel atrices for a saple S are K = βi and K = β I, and the dual paraeter vectors are given by α = y/β+λ) and α = y/β +λ). Now, since λ in K + λi) = β +λ and λ in K + λi) = β+λ, and K K =β β, the following equality holds: α α = = β β β y + λ)β + λ) 7) K K λ in K + λi)λ in K y. + λi) 8) This liits significant iproveents of the bound of Proposition 1 using siilar techniques. 2.2 Support Vector Machines This section analyzes the kernel stability of SVMs. For siplicity, we shall consider the case where the classification function sought has no offset. In practice, this corresponds to using a constant feature. Let Φ: X F denote a feature apping fro the input space X to a Hilbert space F corresponding to soe kernel K. The hypothesis set we consider is thus H = {h: w F x X, hx)= w Φx)}. The following is the standard prial optiization proble for SVMs: in w F Kw) = 1 2 w 2 + RK w), 9) where R K w) = 1 Ly iw Φx i )) is the epirical error, with Ly i w Φx i )) = ax0, 1 y i w Φx i )) the hinge loss associated to the ith point. In the following, we analyze the difference between the hypothesis h returned by SVMs when trained on the saple S of points and using a kernel K, versus the hypothesis h obtained when training on the sae saple with the kernel K. For a fixed x X, we shall copare ore specifically hx) and h x). Thus, we can work with the finite set X +1 ={x 1,..., x, x +1 }, with x +1 =x. Different feature appings Φ can be associated to the sae kernel K. To copare the solutions w and w of the optiization probles based on F K and F K, we can choose the feature appingsφ and Φ associated to K and K such that they both ap to R +1 as follows. Let K +1 denote the Gra atrix associated to K and K +1 that of kernel K for the set of points X +1. Then for all u X +1, Φ and Φ can be defined by Φu) = K 1/2 +1 Kx 1, u). Kx +1, u) K x 1, u) and Φ u) = K 1/2 +1. K x +1, u) 10), 11) where K +1 denotes the pseudo-inverse of K +1 and K +1 that of K +1. It is not hard to see then that for all u, v X +1, Ku, v) = Φu) Φv) and K u, v) = Φ u) Φ v) Schölkopf and Sola, 2002). Since the optiization proble depends only on the saple S, we can use the feature appings just defined in the expression of F K and F K. This does not affect in any way the standard SVMs optiization proble. Let w R +1 denote the iniizer of F K and w that of F K. By definition, if we let w denote w w, for all s [0, 1], the following inequalities hold: F K w) F K w + s w) 12) and F K w ) F K w s w). 13) Suing these two inequalities, rearranging ters, and using the identity w+s w 2 w 2 )+ w s w 2 w 2 )=2s1s) w 2, we obtain as in Bousquet and Elisseeff, 2001): s1 s) w 2 [ RK w + s w) R K w) ) + RK w s w) R K w ) )]. Note that w + s w = sw + 1 s)w and w s w = sw + 1 s)w. Then, by the convexity of the hinge loss and thus R K and R K, the following inequalities hold: R K w + s w) R K w) s R K w ) R K w)) R K w s w) R K w ) s R K w ) R K w)). Plugging in these inequalities on the left-hand side, siplifying by s and taking the liit s 0 yields w 2 [ RK w ) R K w ) ) + RK w) R K w) )] = + [ Ly i w Φx i )) Ly i w Φ x i ))) Ly i w Φ x i )) Ly i w Φx i )) ], where the last inequality results fro the definition of the epirical error. Since the hinge loss is 1-Lipschitz, we can

4 On the Ipact of Kernel Approxiation on Learning Accuracy bound the ters on the right-hand side as follows: w 2 = [ w Φ x i ) Φx i ) ] + w Φ x i ) Φx i ) 14) w + w ) Φ x i ) Φx i ). 15) Let e i denote the ith unit vector of R +1, then Kx 1, x i ),..., Kx +1, x i )) =K +1 e i. Thus, in view of the definition of Φ, for all i [1, + 1], Φx i ) = K 1/2 +1 [Kx 1, x i ),..., Kx, x i ), Kx, x i )] = K 1/2 +1 K +1e i = K 1/2 +1 e i, 16) and siilarly Φ x i )=K 1/2 +1 e i. K 1/2 +1 e i is the ith colun of K 1/2 +1 and siilarly K 1/2 e i the ith colun of K 1/2 +1. Thus, 15) can be rewritten as w w 2 w + w ) K 1/2 +1 K1/2 +1 )e i. As for the case of ridge regression, we shall assue that there exists κ > 0 such that Kx, x) κ and K x, x) κ for all x X +1. Now, since w can be written in ters of the dual variables 0 α i C, C = / as w = α ikx i, ), it can be bounded as w /κ 1/2 = κ 1/2 and siilarly w κ 1/2. Thus, we can write w w 2 2C2 0 κ1/2 2C2 0κ 1/2 K 1/2 +1 K1/2 +1 )e i K 1/2 +1 K1/2 +1 ) e i = 2C0κ 2 1/2 K 1/2 +1 K1/ ) Let K denote the Gra atrix associated to K and K that of kernel K for the saple S. Then, the following result holds. Proposition 2. Let h denote the hypothesis returned by SVMs when using the approxiate kernel atrix K R. Then, the following inequality holds for all x X : h x) hx) 2κ 3 4 C0 K K [1 + [ K K 2 4κ ] 1] 4. 18) Proof. In view of 16) and 17), the following holds: h x) hx) = w Φ x) w Φx) = w w) Φ x) + w Φ x) Φx)) w w Φ x) + w Φ x) Φx) = w w Φ x) + w Φ x +1 ) Φx +1 ) ) 1/2κ 2C0 2 κ1/2 K 1/2 +1 K1/2 +1 1/2 + κ 1/2 K 1/2 +1 K1/2 +1 )e +1 2κ 3/4 K 1/2 +1 K1/2 +1 1/2 + κ 1/2 K 1/2 +1 K1/2 +1. Now, by Lea 1 see Appendix), K 1/2 +1 K1/ K +1 K +1 1/2 2. By assuption, the kernel approxiation is only used at training tie so Kx, x i ) = K x, x i ), for all i [1, ], and since by definition x = x +1, the last rows or the last coluns of the atrices K +1 and K +1 coincide. Therefore, the atrix K +1K +1 coincides with the atrix K K bordered with a zero-colun and zero-row and K 1/2 +1 K1/ K K 1/2 2. Thus, h x) hx) 2κ 3/4 K K 1/4 + κ 1/2 K K 1/2, 19) which is exactly the stateent of the proposition. Since the hinge loss l is 1-Lipschitz, Proposition 2 leads directly to the following bound on the pointwise difference of the hinge loss between the hypotheses h and h. Corollary 1. Let h denote the hypothesis returned by SVMs when using the approxiate kernel atrix K R. Then, the following inequality holds for all x X and y Y: L yh x) ) L yhx) ) 2κ 3 4 C0 K K [1 + [ K K 2 4κ ] 1] 4. 20) The bounds we obtain for SVMs are weaker than our bound for KRR. This is due ainly to the different loss functions defining the optiization probles of these algoriths. 2.3 Graph Laplacian regularization algoriths We lastly study the kernel stability of graph-laplacian regularization algoriths such as that of Belkin et al. 2004). Given a connected weighted graph G = X, E) in which

5 Corinna Cortes, Mehryar Mohri, Aeet Talwalkar edge weights can be interpreted as siilarities between vertices, the task consists of predicting the vertex labels of u vertices using a labeled training saple S of vertices. The input space X is thus reduced to the set of vertices, and a hypothesis h: X R can be identified with the finite-diensional vector h of its predictions h = [hx 1 ),..., hx +u )]. The hypothesis set H can thus be identified with R +u here. Let h S denote the restriction of h to the training points, [hx 1 ),..., hx )] R, and siilarly let y S denote [y 1,..., y ] R. Then, the following is the optiization proble corresponding to this proble: in h H h Lh + h S y S ) h S y S ) 21) subject to h 1 = 0, where L is the graph Laplacian and 1 the colun vector with all entries equal to 1. Thus, h Lh = ij=1 w ijhx i )hx j )) 2, for soe weight atrix w ij ). The label vectory is assued to be centered, which iplies that 1 y = 0. Since the graph is connected, the eigenvalue zero of the Laplacian has ultiplicity one. Define I S R +u) +u) to be the diagonal atrix with [I S ] i,i = 1 if i and 0 otherwise. Maintaining the notation used in Belkin et al. 2004), we let P H denote the projection on the hyperplane ) H orthogonal to 1 and let M = P H L + I S and M = P H L + I S ). We denote by h the hypothesis returned by the algorith when using the exact kernel atrix L and by L an approxiate graph Laplacian such that h L h = ij=1 w ij hx i) hx j )) 2, based on atrix w ij ) instead of w ij). We shall assue that there exist M > 0 such that y i M for i [1, ]. Proposition 3. Let h denote the hypothesis returned by the graph-laplacian regularization algorith when using an approxiate Laplacian L R. Then, the following inequality holds: h h 3/2 M/ λ2 1) 2 L L, 22) where λ 2 = ax{λ 2, λ 2} with λ 2 denoting the second sallest eigenvalue of the kernel atrix L and λ 2 the second sallest eigenvalue of L. Proof. The closed-for solution of Proble 21 is given )) 1yS by Belkin et al. 2004): h = P H L+I S. Thus, we can use that expression and the atrix identity for M 1 M 1 ) we already used in the analysis of KRR to write h h = M 1 y S M 1 y S 23) = M 1 M 1 )y S 24) = M 1 M M )M 1 y S 25) M 1 L L )M 1 y S 26) M 1 M 1 y S L L. 27) For any colun atrix z R +u) 1, by the triangle inequality and the projection property P H z z, the following inequalities hold: P H L = P H L + P H I S z P H I S z 28) This yields the lower bound: P H L + P H I S z + P H I S z 29) P H L + I S )z + I S z. 30) Mz = P H L + I S )z 31) P H L I S z 32) C 0 ) λ 2 1 z, 33) which gives the following upper bounds on M 1 and M 1 : M 1 1 λ 2 1 and M 1 1 λ 2 1. Plugging in these inequalities in 27) and using y S 1/2 M lead to h h 3/2 M/ λ 2 1) λ 2 1) L L. The generalization bounds for the graph-laplacian algorith are of the for Rh) Rh)+O C λ 0 21) ) Belkin 2 et al., 2004). In view of the bound given by the theore, this suggests that the approxiation tolerated should verify L L O1/ ). 3 Application to Nyströ ethod The previous section provided stability analyses for several coon learning algoriths studying the effect of using an approxiate kernel atrix instead of the true one. The difference in hypothesis value is expressed siply in ters of the difference between the kernels easured by

6 On the Ipact of Kernel Approxiation on Learning Accuracy Dataset Description # Points ) # Features d) Kernel Largest label M) ABALONE physical attributes of abalones RBF 29 KIN-8n kineatics of robot ar RBF 1.5 Table 1: Suary of datasets used in our experients Asuncion and Newan, 2007; Ghahraani, 1996). soe nor. Although these bounds are general bounds that are independent of how the approxiation is obtained so long as K reains SPSD), one relevant application of these bounds involves the Nyströ ethod. As shown by Willias and Seeger 2000), later by Drineas and Mahoney 2005); Talwalkar et al. 2008); Zhang et al. 2008), low-rank approxiations of the kernel atrix via the Nyströ ethod can provide an effective technique for tackling large-scale data sets. However, all previous theoretical work analyzing the perforance of the Nyströ ethod has focused on the quality of the low-rank approxiations, rather than the perforance of the kernel learning algoriths used in conjunction with these approxiations. In this section, we first provide a brief review of the Nyströ ethod and then show how we can leverage the analysis of Section 2 to present novel perforance guarantees for the Nyströ ethod in the context of kernel learning algoriths. 3.1 Nyströ ethod The Nyströ approxiation of a syetric positive seidefinite SPSD) atrix K is based on a saple of n coluns of K Drineas and Mahoney, 2005; Willias and Seeger, 2000). Let C denote the n atrix fored by these coluns and W the n n atrix consisting of the intersection of these n coluns with the corresponding n rows of K. The coluns and rows of K can be rearranged based on this sapling so that K and C be written as follows: [ ] [ ] W K K = 21 W and C =. 34) K 21 K 22 K 21 Note that W is also SPSD since K is SPSD. For a unifor sapling of the coluns, the Nyströ ethod generates a rank-k approxiation K of K for k n defined by: K = CW + k C K, 35) where W k is the best k-rank approxiation of W for the Frobenius nor, that is W k = argin rankv)=k W V F and W + k denotes the pseudo-inverse of W k. W + k can be derived fro the singular value decoposition SVD) of W, W = UΣU, where U is orthonoral and Σ = diagσ 1,..., σ ) is a real diagonal atrix with σ 1 σ 0. For k rankw), it is given by W + k = k σ1 i U i U i, where U i denotes the ith colun of U. Since the running tie coplexity of SVD is On 3 ) and Onk) is required for ultiplication with C, the total coplexity of the Nyströ approxiation coputation is On 3 +nk). 3.2 Nyströ kernel ridge regression The accuracy of low-rank Nyströ approxiations has been theoretically analyzed by Drineas and Mahoney 2005); Kuar et al. 2009c). The following theore, adapted fro Drineas and Mahoney 2005) for the case of unifor sapling, gives an upper bound on the nor- 2 error of the Nyströ approxiation of the for K K 2 / K 2 K K k 2 / K 2 +O1/ n). We denote by K ax the axiu diagonal entry of K. Theore 1. Let K denote the rank-k Nyströ approxiation of K based on n coluns sapled uniforly at rando with replaceent fro K, and K k the best rank-k approxiation of K. Then, with probability at least 1 δ, the following inequalities hold for any saple of size n: K K 2 K K k 2 + n K ax 2 + log 1 δ). Theore 1 focuses on the quality of low-rank approxiations. Cobining the analysis fro Section 2 with this theore enables us to bound the relative perforance of the kernel learning algoriths when the Nyströ ethod is used as a eans of scaling kernel learning algoriths. To illustrate this point, Theore 2 uses Proposition 1 along with Theore 1 to upper bound the relative perforance of kernel ridge regression as a function of the approxiation accuracy of the Nyströ ethod. Theore 2. Let h denote the hypothesis returned by kernel ridge regression when using the approxiate rank-k kernel K R n n generated using the Nyströ ethod. Then, with probability at least 1 δ, the following inequality holds for all x X, h x)hx) κm λ 2 0 [ KK k 2 + n K ax 2+log 1 δ) ]. A siilar technique can be used to bound the error of the Nyströ approxiation when used with the other algoriths discussed in Section 2. The results are oitted due to space constraints.

7 Corinna Cortes, Mehryar Mohri, Aeet Talwalkar Abalone Kin8n Average Absolute Error Average Absolute Error Relative Spectral Distance Relative Spectral Distance Figure 1: Average absolute error of the kernel ridge regression hypothesis, h ), generated fro the Nyströ approxiation, K, as a function of relative spectral distance K K 2 / K 2. For each dataset, the reported results show the average absolute error as a function of relative spectral distance for both the full dataset and for a subset of the data containing = 2000 points. Results for the sae value of are connected with a line. The different points along the lines correspond to various nubers of sapled coluns, i.e., n ranging fro 1% to 50% of. 3.3 Nyströ Woodbury Approxiation The Nyströ ethod provides an effective algorith for obtaining a rank-k approxiation for the kernel atrix. As suggested by Willias and Seeger 2000) in the context of Gaussian Processes, this approxiation can be cobined with the Woodbury inversion lea to derive an efficient algorith for inverting the kernel atrix. The Woodbury inversion Lea states that the inverse of a rank-k correction of soe atrix can be coputed by doing a rank-k correction to the inverse of the original atrix. In the context of KRR, using the rank-k approxiation K given by the Nyströ ethod, instead of K, and applying the inversion lea yields λi + K) 1 36) λi + CW + k C ) 1 37) = 1 I C [ λi k + W + k λ C C ] 1 W + k C ). 38) Thus, only an inversion of a atrix of size k is needed as opposed to the original proble of size. 4 Experients For our experiental results, we focused on the kernel stability of kernel ridge regression, generating approxiate kernel atrices using the Nyströ ethod. We worked with the datasets listed in Table 1, and for each dataset, we randoly selected 80% of the points to generate K and used the reaining 20% as the test set, T. For each testtrain split, we first perfored grid search to deterine the optial ridge for K, as well as the associated optial hypothesis, h ). Next, using this optial ridge, we generated a set of Nyströ approxiations, using various nubers of sapled coluns, i.e., n ranging fro 1% to 50% of. For each Nyströ approxiation, K, we coputed the associated hypothesis h ) using the sae ridge and easured the distance between h and h as follows: average absolute error = x T h x) hx). 39) T We easured the distance between K and K as follows: relative spectral distance = K K 2 K ) Figure 1 presents results for each dataset using all points and a subset of 2000 points. The plots show the average absolute error of h ) as a function of relative spectral distance. Proposition 1 predicts a linear relationship between kernel approxiation and relative error which is corroborated by the experients, as both datasets display this behavior for different sizes of training data. 5 Conclusion Kernel approxiation is used in a variety of contexts and its use is crucial for scaling any learning algoriths to very large tasks. We presented a stability-based analysis of the effect of kernel approxiation on the hypotheses returned by several coon learning algoriths. Our analysis is independent of how the approxiation is obtained and siply expresses the change in hypothesis value in ters of the difference between the approxiate kernel atrix and the true one easured by soe nor. We also provided a specific analysis of the Nyströ low-rank approxiation in

8 On the Ipact of Kernel Approxiation on Learning Accuracy this context and reported experiental results that support our theoretical analysis. In practice, the two steps of kernel atrix approxiation and training of a kernel-based algorith are typically executed separately. Work by Bach and Jordan 2005) suggested one possible ethod for cobining these two steps. Perhaps ore accurate results could be obtained by cobining these two stages using the bounds we presented or other siilar ones based on our analysis. A Lea 1 The proof of Lea 1 is given for copleteness. Lea 1. Let M and M be two n n SPSD atrices. Then, the following bound holds for the difference of the square root atrices: M 1/2 M 1/2 2 M M 1/2 2. Proof. By definition of the spectral nor, M M M M 2 I where I is the n n identity atrix. Hence, M M + M M 2 I and M 1/2 M + λi) 1/2, with λ = M M 2. Thus, M 1/2 M + λi) 1/2 M 1/2 + λ 1/2 I, by sub-additivity of. This shows that M 1/2 M 1/2 λ 1/2 I and by syetrym 1/2 M 1/2 λ 1/2 I, thus M 1/2 M 1/2 2 M M 1/2 2, which proves the stateent of the lea. References A. Asuncion and D.J. Newan. UCI achine learning repository, Francis Bach and Michael Jordan. Kernel independent coponent analysis. Journal of Machine Learning Research, 3:1 48, Francis R. Bach and Michael I. Jordan. Predictive lowrank decoposition for kernel ethods. In International Conference on Machine Learning, M. A. Belabbas and P. J. Wolfe. Spectral ethods in achine learning and new strategies for very large datasets. Proceedings of the National Acadey of Sciences of the United States of Aerica, 1062): , January Mikhail Belkin, Irina Matveeva, and Partha Niyogi. Regularization and sei-supervised learning on large graphs. In Conference on Learning Theory, Olivier Bousquet and André Elisseeff. Algorithic stability and generalization perforance. In Advances in Neural Inforation Processing Systes, Corinna Cortes and Vladiir Vapnik. Support-Vector Networks. Machine Learning, 203): , Petros Drineas and Michael W. Mahoney. On the Nyströ ethod for approxiating a Gra atrix for iproved kernel-based learning. Journal of Machine Learning Research, 6: , Shai Fine and Katya Scheinberg. Efficient SVM training using low-rank kernel representations. Journal of Machine Learning Research, 2: , Zoubin Ghahraani. The kin datasets, Ling Huang, Donghui Yan, Michael Jordan, and Nina Taft. Spectral clustering with perturbed data. In Advances in Neural Inforation Processing Systes, Zakria Hussain and John Shawe-Taylor. Theory of atching pursuit. In Advances in Neural Inforation Processing Systes, Sanjiv Kuar, Mehryar Mohri, and Aeet Talwalkar. Enseble Nyströ ethod. In Advances in Neural Inforation Processing Systes, Sanjiv Kuar, Mehryar Mohri, and Aeet Talwalkar. On sapling-based approxiate spectral decoposition. In International Conference on Machine Learning, Sanjiv Kuar, Mehryar Mohri, and Aeet Talwalkar. Sapling techniques for the Nyströ ethod. In Conference on Artificial Intelligence and Statistics, Craig Saunders, Alexander Gaeran, and Volodya Vovk. Ridge regression learning algorith in dual variables. In International Conference on Machine Learning, Bernhard Schölkopf and Alex Sola. Learning with Kernels. MIT Press: Cabridge, MA, Bernhard Schölkopf, Alexander Sola, and Klaus-Robert Müller. Nonlinear coponent analysis as a kernel eigenvalue proble. Neural Coputation, 105): , Aeet Talwalkar, Sanjiv Kuar, and Henry Rowley. Large-scale anifold learning. In Conference on Vision and Pattern Recognition, Pascal Vincent and Yoshua Bengio. Kernel Matching Pursuit. Technical Report 1179, Départeent d Inforatique et Recherche Opérationnelle, Université de Montréal, Christopher K. I. Willias and Matthias Seeger. Using the Nyströ ethod to speed up kernel achines. In Advances in Neural Inforation Processing Systes, Kai Zhang, Ivor Tsang, and Jaes Kwok. Iproved Nyströ low-rank approxiation and error analysis. In International Conference on Machine Learning, 2008.

Nyström Method vs Random Fourier Features: A Theoretical and Empirical Comparison

Nyström Method vs Random Fourier Features: A Theoretical and Empirical Comparison yströ Method vs : A Theoretical and Epirical Coparison Tianbao Yang, Yu-Feng Li, Mehrdad Mahdavi, Rong Jin, Zhi-Hua Zhou Machine Learning Lab, GE Global Research, San Raon, CA 94583 Michigan State University,