Improved Bounds for the Nyström Method with Application to Kernel Classification

Size: px

Start display at page:

Download "Improved Bounds for the Nyström Method with Application to Kernel Classification"

Giles Patterson
5 years ago
Views:

1 Iproved Bounds for the yströ Method with Application to Kernel Classification Rong Jin, Meber, IEEE, Tianbao Yang, Mehrdad Mahdavi, Yu-Feng Li, Zhi-Hua Zhou, Fellow, IEEE Abstract We develop two approaches for analyzing the approxiation error bound for the yströ ethod that approxiates a positive seidefinite PSD atrix by sapling a sall set of coluns, one based on the concentration inequality of integral operator, and one based on the rando atrix theory We show that the approxiation error, easured in the spectral nor, can be iproved fro O/ to O/ ρ in the case of large eigengap, where is the total nuber of data points, is the nuber of sapled data points, and ρ 0, / is a positive constant that characterizes the eigengap When the eigenvalues of the kernel atrix follow a p-power law, our analysis based on rando atrix theory further iproves the bound to O/ p under an incoherence assuption We present a kernel classification approach based on the yströ ethod and derive its generalization perforance using the iproved bound We show that when the eigenvalues of kernel atrix follow a p-power law, we can reduce the nuber of support vectors to p/p, which is sublinear in when p > +, without seriously sacrificing its generalization perforance Index Ters yströ ethod, approxiation error, concentration inequality, kernel ethods, rando atrix theory I ITRODUCTIO The yströ ethod has been widely applied in achine learning to efficiently approxiate large kernel atrices with low rank atrices in order to speed up kernel algoriths [], [], [3], [4], [5], [6], [7], [8], [9], [0], [] In order to evaluate the quality of the yströ ethod, we typically bound the nor of the difference between the original kernel atrix and the low rank approxiate atrix generated by the yströ ethod Several analysis were developed to bound the approxiation error of the yströ ethod [], [4], [9], [], [0], [3], [4] Most of the focus on additive error bound, and base on the theoretical results fro [] When the target atrix is of low rank, significantly better bounds for the approxiation error of the yströ ethod were given in [0] and [3] These results are further generalized to kernel atrices of an arbitrary rank by a relative error bound in [4] In this study, we focus on the additive error bound of the Rong Jin and Mehrdad Mahrdavi are with the Departent of Coputer Science and Engineering, Michigan State University, East Lansing, MI, 4884 USA E-ail: rongjin, ahdavi@csesuedu Tianbao Yang is with GE Global Research, San Raon, CA, USA Eail: tyang@geco Yu-Feng Li and Zhi-Hua Zhou are with ational Key Laboratory of ovel Software Technology, anjing University, anjing, 003, China Eail: liyf, zhouzh@ladanjueducn Copyright c 0 IEEE Personal use of this aterial is peritted However, perission to use this aterial for any other purposes ust be obtained fro the IEEE by sending a request to pubs-perissions@ieeeorg yströ ethod for PSD atrices, and will copare our results ainly to the ones stated in [] Although a relative error bound is usually tighter than an additive bound [5], we show in this paper that the approxiation error resulting fro our additive error analysis is better than that indicated by the relative error analysis in [4], when the kernel atrix follows a power law eigen-value distribution Below, we review the ain results in [] and their liitations Let K R be a PSD kernel atrix to be approxiated, and i, i =,, be the eigenvalues of K ranked in the descending order Let Kr be an approxiate kernel atrix of rank r generated by the yströ ethod Let be the nuber of coluns sapled fro K used to construct Kr Then, under the assuption K i,i = O, Drineas and Mahoney [] showed that for any uniforly sapled coluns, with a high probability, K Kr r+ + O, where stands for the spectral nor of a atrix In particular, by setting r =, the bound becoes K K + + O The ain proble with the bound in is its slow reduction rate in the nuber of sapled coluns ie, O /, iplying that a large nuber of saples is needed in order to achieve a sall approxiation error In this study, we ai to iprove the approxiation error bound in by considering two special cases of the kernel atrix K In the first case, we assue there is a large eigengap in the spectru of K More specifically, we assue there exists a rank r [] such that r = Ω/ ρ and r+ = O/ ρ, where ρ < / Here, paraeter ρ is introduced to characterize the eigengap r r+ : the saller the ρ, the larger the eigengap will be We show that the approxiation error bound can be iproved to O/ ρ in the case of large eigengap The second case assues that the eigenvalues of K follow a p-power law with p > Under this assuption, we obtain an iproved O/ p bound on the approxiation error, provided that the eigenvector atrix sat- For copleteness, we did include the coparison to the relative error bound in [4] in the later rearks Although the ain results in [] use a data dependent sapling schee, they also hold for the unifor sapling entioned by the authors of [] and explicitly established in [6]

2 isfies an incoherence assuption 3 This result ay shed light on why the yströ ethod works well for kernel atrices with power law eigenvalue distributions, a phenoenon that is often observed in practice [0] The second contribution of this study is a kernel classification algorith that explicitly explores the iproved bounds of the yströ ethod developed here We show that when the eigenvalues of the kernel atrix follow a p-power law with p >, we can construct a kernel classifier that yields a siilar generalization perforance as the full version of kernel classifier but with no ore than p/p support vectors, which is sublinear in when p > + Although the generalization error bound of using the yströ ethod for classification has been studied in [], to the best of knowledge, this is the first work that bounds the nuber of support vectors using the analysis of the yströ ethod Although the results presented in this paper are theoretical, they can find applications in the real world The result for a PSD atrix with a large eigen-gap can find applications in docuent analysis [7] and social network analysis [8] Several works [9], [0] have observed that a Guassian kernel atrix usually exhibits a skewed eigen-value distribution on real data sets, which ake our result for a PSD atrix with a power law eigen-value distribution attractive Finally, as we prepare the final version of the anuscript, two related works [], [] have been accepted for publication In [], the authors copared the yströ ethod with different sapling and projection strategies and established iproved bounds for these strategies Their result for the spectral nor error bound with the unifor sapling strategy, as considered in this work, is siilar to the one presented in [4] In [], the author also addresses the generalization error of kernel learning using the yströ ethod The key result of [] indicates that the iniu nuber of sapled coluns depends on the degree of freedo Our result addresses the saple coplexity of kernel learning fro a different aspect II OTATIOS AD BACKGROUD Let D = x,, x be a collection of saples, where x i X, and K = [κx i, x j ] be the kernel atrix for the saples in D, where κ, is a kernel function For siplicity, we assue κx, x for any x X We denote by v i, i, i =,, the eigenvectors and eigenvalues of K ranked in the descending order of eigenvalues, and by V = v,, v the orthonoral eigenvector atrix In order to build the low rank approxiation of kernel atrix K, the yströ ethod first saples < exaples randoly fro D, denoted by D = x,, x Let K = [κ x i, x j ] easure the kernel siilarity between any two saples in D and K b = [κx i, x j ] easure the siilarity between the saples in D and D Using the saples in D, with rank r set to or the rank of K if it is less than, the yströ ethod approxiates K by K K b Kb, where K denote the pseudo inverse of K Our goal is to 3 A siilar assuption was used in the previous analysis of the yströ ethod [0], [3], [4] provide a high probability bound for the approxiation error K K K b Kb We choose r = or the rank of K because according to [], [4], it yields the best approxiation error for a non-singular kernel atrix In this study, we focus on the spectral nor for easuring the approxiation error, which is particularly suitable for kernel classification [] We also restrict the analysis to the unifor sapling for the yströ ethod This is because according to [4], [6], for ost real-world datasets, unifor sapling is coputationally ore efficient, and yields perforance that is coparable to, if not better than, the non-unifor sapling strategies [], [8], [3], [4] Our analysis for the yströ ethod extensively exploits the properties of the integral operator This is in contrast to ost of the previous studies for the yströ ethod that rely on atrix analysis The ain advantage of using the integral operator is its convenience in handling the unseen data points ie, test data, aking it attractive for the analysis of generalization error bounds In particular, we introduce a linear operator L defined over the saples in D For any function f, operator L is defined as L [f] = κx i, fx i It can be shown that the eigenvalues of the operator L are i /, i =,, [5] Let ϕ,, ϕ be the corresponding eigenfunctions of L that are noralized by functional nor, ie, ϕ i, ϕ j Hκ = δi, j, i j, where, Hκ denotes the inner product in H κ and δi, j is the Kronecker delta function According to [5], the eigenfunctions satisfy j ϕ j = V i,j κx i,, j =,,, where V i,j is the i, jth eleent in V Siilarly, we can write κx j, by its eigen-expansion as κx j, = i V j,i ϕ i, j =,, 3 Furtherore, let L be an operator defined on the saples in D, ie, L [f] = κ x i, f x i Finally we denote by f, g Hκ and f Hκ the inner product and function nor in Hilbert space H κ, respectively, and denote by L HS and L the Hilbert Schid nor and spectral nor of a linear operator L, respectively, ie L HS = i,j ϕ i, Lϕ j H κ and L = ax f Hκ Lf H κ, where ϕ i, i =,, is a coplete orthogonal basis of H κ The two nors are the analogs of Frobenius and spectral nor in Euclidean space, respectively In the following analysis, oitted proofs are presented in the appendix

3 3 III APPROXIMATIO ERROR BOUD BY THE YSTRÖM METHOD Our first step is to turn K K b K K b into a functional approxiation proble To this end, we introduce two sets: H a = span κ x,,, κ x, H b = f = u i κx i, : u i, where H a is the subspace spanned by kernel functions defined on the saples in D, and H b is a subset of a functional space spanned by kernel functions defined on the saples in D with bounded coefficients Using the eigen-expansion of κx j, in 3, it is straightforward to show that H b can be rewritten in the basis of the eigenfunctions ϕ i H b = f = w i i ϕ i : wi Define Eg, H a as the iniu error in approxiating a function g H b by functions in H a, ie, Eg, H a = in f H a f g H κ = f H κ + g H κ f, g Hκ Define EH a as the worst error in approxiating any function g H b by functions in H a, ie, EH a = ax Eg, H a 4 g H b The following proposition connects K K K b Kb with EH a Proposition : For any rando saples x,, x, we have K Kb K K = EH a as b Proof: Since g H b and f H a, we can write g and f g = u i κx i, and f = z i κ x i,, where u = u,, u R satisfies u and z = z,, z R We thus can rewrite Eg, H a as an optiization proble in ters of z, ie, and therefore Eg, H a = in z R z Kz u K b z + u Ku = u K K K b Kb u, EH a = ax g H b Eg, H a = ax u u K K K b Kb = K K K b Kb u Reark : We can restrict the space H a to its subspace Ha r = z i κ x i, : z span v,, v r, where v i, i =,, r are the first r eigenvectors of K, to conduct the analysis for the rank r < approxiation of the yströ ethod To proceed our analysis, for any r [] we define H r = spanϕ,, ϕ r, H r = spanϕ r+,, ϕ, r Hb r = f = w i i ϕ i : r r wi, H r b = f = w i i+r ϕ i+r : r w i Define EH a, r = ax Eg, H a as the worst error in g H r b approxiating any function g Hb r by functions in H a The proposition below bounds EH a by EH a, r Proposition : For any r [], we have EH a ax EH a, r, r+ EH a, r + r+ Proof: We first note that any f H a can be written as f = f +f, where f H a H r, and f H a H r For any g H b, we can write g = g + g, where g δh r b, g δh r b, and δ [0, ] Using these notations, we rewrite EH a as EH a = ax δ [0, ] g δh r b g δh r b ax δ ax δ [0,] = ax δ [0,] g H r b δ ax g H r b in f g + f g H κ f H a H r f H a H r in f g H + δ ax g a H r g H r H κ b in f g H + δ ax g a g H r H κ b = ax δ [0,] δeh a, r + δ r+ = ax EH a, r, r+, where the second equality follows that for any g Hb r, in f g H = in f g H a, and the last inequality a H r follows the definition of EH a, r As indicated by Proposition, in order to bound the approxiation error EH a, we can bound EH a, r, naely the approxiation error for functions in the subspace spanned by the top eigenfunctions of L In the next two subsections, we discuss two approaches for bounding EH a, r: the first approach relies on the concentration inequality of integral operator [5], and the second approach explores the rando atrix theory [6] Before proceeding to upper bound EH a, we first provide a lower bound for EH a Theore : There exists a kernel atrix K R with all its diagonal entries being such that for any sapling strategy that selects coluns, the approxiation error of the yströ ethod is lower bounded by Ω, ie, K K K b Kb Ω, provided > 64[ln 4]

4 4 Reark : Theore shows that the lower bound for the approxiation error of the yströ ethod is Ω/ The analysis developed in this work ais to bridge the gap between the known upper bound ie, O/ and the obtained lower bound A Bound for EH a, r using Concentration Inequality of Integral Operator In this section, we bound EH a, r using the concentration inequality of integral operator We show that the approxiation error of the yströ ethod can be iproved to O/ ρ when there is a large eigengap in the spectru of kernel atrix K, where ρ < / is introduced to characterize the eigengap We first state the concentration inequality of a general rando variable Proposition 3: Proposition [5] Let ξ be a rando variable on X, P X with values in a Hilbert space H, Assue ξ M < is alost sure, then with a probability at least δ, we have ξx i E[ξ] 4M ln/δ The approxiation error of the yströ ethod using the concentration inequality is given in the following theore Theore : With a probability at least δ, for any r [], we have K K K b Kb 6[ln/δ] + r+ r We consider the scenario where there is very large eigengap in the spectru of kernel atrix K In particular, we assue that there exists a rank r and ρ 0, / such that r = Ω/ ρ and r+ = O/ ρ Paraeter ρ is introduced to characterize the eigengap which is given by r r+ = Ω ρ ρ = Ω ρ [ ρ Evidently, the saller the ρ, the larger the eigengap When ρ = /, the eigengap is sall Under the large eigengap assuption, the bound in Theore is siplified as K K b K K b O ρ ] 5 Copared to the bound in, the bound in 5 iproves the approxiation error fro O/ to O/ ρ, when ρ < / To prove Theore, we define two sets of functions H r c = h = r w i i ϕ i : H r d = f H κ : f H κ / r r wi i, where r corresponds to the rank with a large eigengap It is evident that Hc r Hd r; and for any g Hr b, it can also be written as g = L [h], where h Hc r Using Hc r and Hd r, we have EH a, r = ax Eg, H a g H r b = ax in L h f H a h H r c ax in L h f H a h H r d By constructing f as L [h] we can bound EH a, r as EH a, r ax in L h f H a h H r d ax L L h h Hd r H κ L L L L HS r r, 6 where the last step follows the fact L L L L HS The corollary below followed iediately fro Proposition 3, allows us to bound the difference between L and L and Corollary : With a probability δ, we have L L HS 4 ln/δ Finally, Theore follows directly the inequality in 6 and the result in Corollary B Bound for EH a, r using Rando Matrix Theory In this subsection, we ai to develop a better error bound for the yströ ethod for kernel atrices with eigenvalues that follow a power law distribution Our analysis explicitly explores soe of the key results in rando atrix theory that has been the ain building blocks in the theory of copressive sensing [6], [7] To this end, we first introduce the definition of the power law distribution of eigenvalues [8], [9] The eigenvalues σ i, i =, ranked in the non-increasing order follows a p-power law distribution if there exists a constant c > 0 such that σ k ck p In the sequel, we assue the noralized eigenvalues i /, i =,, ie, the eigenvalues of the operator L, follow a p-power law distribution 4 A well-known exaple of kernel with a power law eigenvalue distribution [8] is the kernel function that generates Sobolev Spaces W α, T d of soothness α > d/, where T d is d-diensional torus Its eigenvalues follow a p-power law with p = α > d It is also observed that the eigenvalues of a Gaussian kernel by appropriately setting the width paraeter follow a power law distribution [9] In order to exploit the result of rando atrices fro [6], we introduce the definition of the coherence µ for the eigevenvector atrix V = v,, v as µ = ax i,j V i,j 4 We assue a power law distribution for the noralize eigenvalues i / because the eigenvalues i of K scales in

5 5 Intuitively, the coherence easures the degree to which the eigenvectors in V are correlated with the canonical bases According to the result fro copressive sensing, highly coherent atrices are difficult even ipossible to be recovered by atrix copletion with rando sapling The coherence easure was first introduced into the error analysis of the yströ ethod by Talwalkar and Rostaizadeh [0] Their analysis shows that a low rank kernel atrix with incoherent eigvenvectors ie, with low coherence can be accurately approxiated by the yströ ethod using the unifor sapling This result is generalized to noisy observation in [3] for low rank atrix The ain liitation of these results is that they only apply to low rank atrices Recently, A Gittens [4] developed a relative error bound of the yströ ethod for kernel atrices with an arbitrary rank using a slightly different coherence easure Siilar to [0], [3], [4], coherence easure also plays an iportant role in our analysis However, unlike the previous studies, we focus on the error bound of the yströ ethod for kernel atrices with an arbitrary rank and a skewed eigenvalue distribution The ain result of our analysis is given in the following theore Theore 3: Assue the eigenvalues i /, i =,, follow a p-power law with p > Given a sufficiently large nuber of saples, ie, ln > µ ax 6, C ab ln3 3, 4Cab ln 3 3 γ we have, with a probability 3, K K K b Kb Õ p, where Õ suppresses the polynoial factor that depends on ln, and C ab is a nuerical constant as revealed in our later analysis Reark 3: Copared to the approxiation error in, Theore 3 iproves the bound fro O/ to O/ p provided the eigenvalues of kernel atrix follow a power law It is worth noting that siilar to [0], [3], [4], the bound in Theore 3 is eaningful only when the coherence µ of the eigenvector atrix is sall ie, the eigenvector atrix satisfies the incoherence assuption We also would like to copare the absolute error bound in Theore 3 with the relative error bound in [4], which is doinated by O / p+ when eigenvalues follow a p-power law This bound is worse than the O/ p bound obtained in Theore 3 when, a favorable setting when is very large and is sall Although we notice that the relative error bound in [4] holds ore generally than the error bound in Theore 3, the coparison ay shed light on how a power law eigen-value distribution could yield an iproved approxiation error bound of the yströ ethod, and could possibly inspire new error analysis of other rando atrix approxiation ethods In Reark 4, we ake a coparison between a ore general result in Theore 7 and [4] We ephasize that the result in Theore 3 does not contradict the lower bound given in Theore because Theore 3 holds only for the cases when eigenvalues of the kernel atrix follow a power law In fact, an updated lower bound for kernel atrix with a power law eigenvalue distribution is given in the following theore Theore 4: There exists a kernel atrix K R with all its diagonal entries being and its eigenvalues following a p-power law such that for any sapling strategy that selects coluns, the approxiation error of the yströ ethod is lower bounded by Ω, ie, p K K b K K b Ω p provided > 64[ln 4] We skip the proof of this theore as it is alost identical to that of Theore The gap between the upper bound and the lower bound given in Theores 3 and 4 indicates that there is potentially a roo for further iproveent ext, we present several theores and corollaries to pave the path for the proof of Theore 3 We borrow the following two theores used by the copressive sensing theory [6] Theore 5: Theore fro [6] Let V be an orthogonal atrix V V = I with coherence µ Fix a subset T of the signal doain Choose a subset S of the easureent doain of size S = uniforly at rando Suppose that the nuber of easureents obeys T µ ax C a ln T, C b ln3/δ for soe positive constants C a and C b Then Pr V S,T V S,T I / δ Theore 6: Lea 33 fro [6] Let V, S, and T be the sae as defined in Theore 5 Let u k be the k-th row of VS, V S,T Define σ = µ ax, µ T / Fix a > 0 obeying a /µ /4 if µ T / > and a /[µ T ] / otherwise Let z k = VS,T V S,T u k Then, we have Pr sup z k µ T / + aσ/ k T c exp γa + Pr VS,T V S,T for soe positive constant γ, where T c stands for the copleentary set to T We note that Theore 5 has been explored in [0] for showing that the yströ approxiation is exact when the target PSD atrix is of low rank However, our analysis does not restrict to a low rank PSD atrix Instead, by cobining the results in Theore 5 and Theore 6, we have the following high probability bound for sup k T c z k, which serves the key to prove a general result stated in Theore 7 for a PSD atrix with an arbitrary rank Corollary : If T ax C ab ln3 3, 4 ln γ, and ln µ ax T C ab ln3 3, 6 < µ T, γ where C ab = axc a, C b, then with a probability 3, we have T sup z k 4µ k T c

6 6 Using Corollary, we have the following bound for EH a, r Theore 7: If r > axc ab ln3 3, 4 ln /γ and ln µ ax rc ab ln3 3, 6 < µ r, γ then, with a probability 3, we have EH a, r 6µ r i=r+ Reark 4: Before presenting the proof, it is worthwhile to copare the approxiation error bound by cobing the results in Theore 7 and Proposition to Gittens relative error bound [4], ie, with a probability at least δ it holds that K K K b Kb r+ + provided 8τ r logr/δ, where τ is a coherence easure of the top-r eigen-space of K defined by τ = r ax r i j= V ij In the case when the eigenvalues decay fast eg, eigenvalues follow a power law, we have i=r+ i r+, and therefore our bound could be significantly better than Gittens relative error bound On the other hand, when eigenvalues follow a flat distribution eg, i r+ for all i [r+, ], we have i=r+ i r+, and therefore our bound is worse than Gittens relative bound by only a factor of µ r Proof of Theore 7: For the sake of siplicity, we assue that the first exaples are sapled, ie, D = x,, x For any g Hb r, we have g = r w i / i ϕ i, with r Below, we will ake w i specific construction of f based on g that ensures a sall approxiation error Let f be f = = = a j κx j, j= ϕ i / i b i / i ϕ i, i a j V j,i j= where b i = j= a jv j,i, i =,,, and the value of a = a,, a will be given later Define T =,, r and S =,, Under the condition that rµ ax C a, C b ln3 3 rµ ax C a ln r, C b ln3 3, Theore 5 holds, and therefore with a probability at least 3, in V S,T V S,T ax V 3 S,T V S,T 7 We construct a as a = V S,T [ V S,T V S,T ] w, where w = w,, w r Since b = V S, a = V S, V S,T V S,T V S,T w, where b = b,, b, it is straightforward to see that b j = w j for j T Using the result fro Corollary, we have, with a probability at least 3, r ax b j ax z j w 4µ j T c j T c, where z j is the j-th row of atrix VS, V S,T V S,T V S,T We thus obtain f g 6µ r H κ = i b i ϕ i i Hence, i T c / H κ EH a, r = ax in f g g H r H b 6µ r a i=r+ i=r+ Finally, we show the proof of Theore 3 using Theore 7 Proof of Theore 3: Let r = µ C ab ln3 3, then µ rc ab ln3 3 < µ r, where the right inequality follows that r µ C ab ln3 3, and > 4µ Cab ln 3 3 Then the conditions in Theore 7 hold and we have K K K b Kb ax EH a, r, r+ 6µ r ax, i i=r+ Since ax6µ r/, O due to the specific value we choose for r, and i=r+ i O/r p due to the power law distribution, then K K K b Kb O r p Õ p IV APPLICATIO OF THE YSTRÖM METHOD TO KEREL CLASSIFICATIO Although the yströ ethod was proposed in [] to speed up kernel achine, few studies exaine the application of the yströ ethod to kernel classification In fact, to the best of our knowledge, [] and [] are the only two pieces of work that explicitly explore the yströ ethod for kernel classification The key idea of both works is to apply the yströ ethod to approxiate the kernel atrix with a low rank atrix in order to reduce the coputational cost More specifically, we consider the following optiization proble for kernel classification in L f = f H κ + i ly i fx i, 8 where y i, + is the class label assigned to instance x i, and lz is a convex loss function To facilitate our analysis, we assue i lz is strongly convex with odulus σ, ie l z σ 5, and ii lz is Lipschitz continuous, ie 5 Loss functions such as square loss used for regression and logit function used for logistic regression are strongly convex

7 7 l z C for any z within the doain Using the convex conjugate of the loss function lz, denoted by l α, α Ω, where Ω is the doain for dual variable α, we can cast the proble in 8 into the following optiization proble over α ax α i Ω l α i α y Kα y, 9 with the solution f given by f = α iy i κx i, By the Fenchel conjugate theory, we have ax α Ω α C because l z C To reduce the coputational cost, [] and [] suggest to replace the kernel atrix K with its low rank approxiation K = K K b Kb, leading to the following optiization proble for α ax α i Ω l α i α y Kα y 0 One ain proble with this approach is that although it siplifies the coputation of kernel atrix, it does not siplify the classifier f, because the nuber of support vectors, after the application of the yströ ethod, is not guaranteed to be sall [30], [3], leading to a high coputational cost in perforing prediction on testing exaples We address this difficulty by presenting a new approach to explore the yströ ethod for kernel classification We show that, by an analysis of excessive risk, the nuber of support vectors for achieving a siilar perforance as the kernel classifier learned with the full kernel atrix is bounded by p/p, where p characterizes the power law of the eigen-value distribution of the kernel atrix Siilar to the previous analysis, we randoly select a subset of training exaples, denoted by D = x,, x, and restrict the solution of f to the subspace H a = spanκ x,,, κ x,, leading to the following optiization proble in f H a L f = f H κ + ly i fx i The following proposition shows that the optial solution to is closely related to the optial solution to 0 Proposition 4: The solution f to is given by f = z i y i κ x i,, where z = K Kb α and α is the optial solution to 0 It is iportant to note that the classifier obtained fro is only supported by the sapled training exaples in D, which significantly reduces the coplexity of the kernel classifier copared to the approach suggested in [], [] We also note that the proposed approach is equivalent to learning a linear classifier by representing each instance x with the vector φx = D / V κ x, x,, κ x, x, where D is a diagonal atrix with non-zero eigenvalues of K, and V is the corresponding eigenvector atrix Although this idea has already been adopted by practitioners, we are unable to find any reference on its epirical study The reaining of this work is to show that this approach could have a good generalization perforance provided that the eigenvalues of kernel atrix follow a skewed distribution Below, we develop the generalization error bound for the classifier learned fro Let f and f a be the optial solutions to 8 and, respectively Let f be the optial classifier that iniizes the expected loss function, ie, f = arg in P l f E x,y [lyfx] Let f L = E x [ fx ] denote the l nor square of f In order to create a tight bound, we exploit the technique of local Radeacher coplexity [3], [33] Define ψ as / ψδ = inδ, i Let ε be the solution to ε = ψ ε where the existence and uniqueness of ε is deterined by the sub-root property of ψδ [3] Finally we define 6 ln ɛ = ax ε, Theore 8: Assue with a probability 3, EH a Γ,, where Γ, is soe function depending on and Assue that is sufficiently large such that e ax f a Hκ, f Hκ ln, ax f a L, f L e 6 ln Then, with a probability at least 4 3, we have P l f a P l f + f H κ + C Γ, + C C ɛ 4 + C C ɛ + C Ce σ where ɛ is given in and C is a constant independent fro and By choosing that iniizes the above bound, we have P l f a P l f + 4 f Hκ ɛ C C Γ, + ɛ 4 + C C ɛ + C Ce σ Reark 5: In the case when the eigenvalues of the kernel atrix follow a p-power law with p >, we have ɛ = O p/p+ according to [8], and Γ, = O/ p according to Theore 3 Applying these results to Theore 8, the generalization perforance of f a becoes P l f a P l f + f H κ + C C p + C Ce + C 3C p/p+ + C 4C p/p+ 3 σ

8 8 where C, C 3, and C 4 are constants independent fro and By choosing that iniizes the bound in 3, we have P l f a P l f + 4 f Hκ C C p/p+ 3 + C p/p+ p + C 4C σ + C Ce p/p+ = P l f + O p/p+ + p / As indicated by above inequality, when the eigenvalues of the kernel atrix follow a p-power law, by setting = p/p, we are able to achieve a siilar perforance as the full version of kernel classifier ie, O p/p+ In other words, we can construct a kernel classifier without sacrificing its generalization perforance with no ore than p/p support vectors, which could be significantly saller than when p > + For the exaple of kernel that generates Sobolev Spaces W α, T d of soothness α > d/, where T d is d-diensional torus, its eigenvalues follow a p-power law with p = α > d, which is larger than + when d 3 V COCLUSIO We develop new ethods for analyzing the approxiation bound for the yströ ethod We show that the approxiation error can be iproved to O/ ρ in the case when there is a large eigengap in the spectru of a kernel atrix, where ρ 0, / is introduced to characterize the eigengap When the eigenvalues of a kernel atrix follow a p-power law, the approxiation error is further reduced to O/ p under an incoherence assuption We develop a kernel classification approach based on the yströ ethod and show that when the eigenvalues of a kernel atrix follow a p-power law p >, we can reduce the nuber of support vectors to p/p, which could be significantly less than if p is large, without seriously sacrificing its generalization perforance ACKOWLEDGMET Rong Jin and Mehrdad Mahdavi are supported in part by Office of avy Research OR award and Yu-Feng Li and Zhi-Hua Zhou are partially supported by the ational Fundaental Research Progra of China 00CB37903, the ational Science Foundation of China , 6006 REFERECES [] C Willias and M Seeger, Using the nystro ethod to speed up kernel achines, in Advances in eural Inforation Processing Systes 3 MIT Press, 00, pp [] P Drineas and M W Mahoney, On the nystro ethod for approxiating a gra atrix for iproved kernel-based learning, Journal of Machine Learning Research, vol 6, p 005, 005 [3] C Fowlkes, S Belongie, F Chung, and J Malik, Spectral grouping using the nystro ethod, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol 6, p 004, 004 [4] S Kuar, M Mohri, and A Talwalkar, Sapling techniques for the nystro ethod, in Proceedings of Conference on Artificial Intelligence and Statistics, 009, pp [5] V D Silva and J B Tenenbau, Global versus local ethods in nonlinear diensionality reduction, in Advances in eural Inforation Processing Systes 5, 003, pp [6] J C Platt, Fast ebedding of sparse usic siilarity graphs, in Advances in eural Inforation Processing Systes 6 MIT Press, 004, p 004 [7] A Talwalkar, S Kuar, and H A Rowley, Large-scale anifold learning, in IEEE Coputer Society Conference on Coputer Vision and Pattern Recognition, 008 [8] K Zhang, I W Tsang, and J T Kwok, Iproved nystro lowrank approxiation and error analysis, in Proceedings of International Conference on Machine Learning, 008 [9] M-A Belabbas and P J Wolfe, Spectral ethods in achine learning and new strategies for very large data sets, Proceedings of the ational Acadey of Sciences of the USA, vol 06, pp , 009 [0] A Talwalkar and A Rostaizadeh, Matrix coherence and the nystro ethod, in Proceedings of Conference on Uncertainty in Artificial Intelligence, 00 [] C Cortes, M Mohri, and A Talwalkar, On the ipact of kernel approxiation on learning accuracy, Journal of Machine Learning Research - Proceedings Track, vol 9, pp 3 0, 00 [] M Li, J T Kwok, and B-L Lu, Making large-scale nyströ approxiation possible, in Proceedings of the 7th international conference on Machine learning, 00, pp [3] L W Mackey, A S Talwalkar, and M I Jordan, Divide-and-conquer atrix factorization, in Advances in eural Inforation Processing Systes 4, J Shawe-Taylor, R Zeel, P Bartlett, F Pereira, and K Weinberger, Eds, 0, pp 34 4 [4] A Gittens, The spectral nor error of the naive nystro extension, CoRR, 0 [5] M W Mahoney, Randoized algoriths for atrices and data, Foundations and Trends in Machine Learning, vol 3, no, pp 3 4, 0 [6] S Kuar, M Mohri, and A Talwalkar, Sapling ethods for the nystro ethod, J Mach Learn Res, vol 98888, pp , 0 [7] J Chen and Y Saad, Lanczos vectors versus singular vectors for effective diension reduction, IEEE Trans on Knowl and Data Eng, vol, pp 09 03, 009 [8] L Wu, X Ying, X Wu, and Z-H Zhou, Line orthogonality in adjacency eigenspace with application to counity partition, in Proceedings of the Twenty-Second international joint conference on Artificial Intelligence, 0, pp [9] M Ji, T Yang, B Lin, R Jin, and J Han, A siple algorith for sei-supervised learning with iproved generalization error bound, in Proceedings of the 9th international conference on Machine learning, 0, pp [0] T Yang, Y-F Li, M Mahdavi, R Jin, and Z-H Zhou, ystro ethod vs rando fourier features: A theoretical and epirical coparison, in Advances in eural Inforation Processing Systes 5, 0, pp [] A Gittens and M W Mahoney, Revisiting the nystro ethod for iproved large-scale achine learning, in Proceedings of the 30th international conference on Machine learning, 03, pp [] F Bach, Sharp analysis of low-rank kernel atrix approxiations, in Proceedings of the 6th Conference on Learning Theory, 03, pp [3] J B Hough, M Krishnapur, Y Peres, and B Virag, Deterinantal processes and independence, Probability Surveys, vol 3, pp 06 9, 006 [4] P Drineas, M W Mahoney, and S Muthukrishnan, Relative-error cur atrix decopositions, SIAM J Matrix Anal Appl, vol 30, pp , 008 [5] S Sale and D-X Zhou, Geoetry on probability spaces, Constr Approx, vol 30, pp 3 33, 009 [6] E Candés and J Roberg, Sparsity and incoherence in copressive sapling, Inverse Probles, vol 3, no 3, pp , 007 [7] D L Donoho, Copressed sensing, IEEE Transactions on Inforation Theory, vol 5, no 4, pp , 006 [8] V Koltchinskii and M Yuan, Sparsity in ultiple kernel learning, Annuals of Statistics, vol 38, pp , 00 [9] M Kloft and G Blanchard, The local radeacher coplexity of lpnor ultiple kernel learning, in Advances in eural Inforation Processing Systes 3, 0, pp [30] O Dekel and Y Singer, Support vector achines on a budget, in IPS, 006, pp [3] T Joachis and C- J Yu, Sparse kernel svs via cutting-plane training, Mach Learn, vol 76, pp 79 93, 009

9 9 [3] P L Bartlett, O Bousquet, and S Mendelson, Local radeacher coplexities, Annals of Statistics, pp 44 58, 00 [33] V Koltchinskii, Oracle Inequalities in Epirical Risk Miniization and Sparse Recovery Probles Springer, 0 APPEDIX A PROOF OF THEOREM We argue that there exists a kernel atrix K such that i all its diagonal entries equal to, and ii the first + eigenvalues of K are in the order of Ω/ To see the existence of such a atrix, we saple + vectors u,, u +, where u i R, fro a Bernoulli distribution, with Pru i,j = + = Pru i,j = = / We then construct K as K = + u i u i + = + UU, 4 where U = u,, u + First, since u i,j = ±, we have diagu i u i =, where is a vector of all ones, and therefore K i,i = for i [] Second, we show that with soe probability δ, all non-zero eigenvalues of U U are bounded between / and 3/, ie, in U U ax U U 3 5 To prove 5, we use the concentration inequality in Proposition 3 We define ξ i = z i z i, i =,,, where z i R is the ith row of the atrix U, and in the above proposition as the spectral nor of a atrix Since every eleent in z i is sapled fro a Bernoulli distribution with equal probabilities of being ±, we have E[z i z i ] = I and z i z i = Thus, with a probability δ, we have U U I = ξ i E[ξ] 4 ln/δ When > 64 [ln 4], for any sapled U, with 50% chance, we have U U I, which iplies 5 With the bound in 5 and using the fact that the eigenvalues of UU equal to the eigenvalues of U U, it is straightforward to see that the first + eigenvalues of K are in the order of Ω/ Up to this point, we proved the existence of such a kernel atrix ext, we prove the lower bound for the constructed kernel atrix Let V :+ = v,, v + the first + eigenvectors of K We construct ĝ as follows: Let u = V :+ a be a vector in the subspace spanv,, v + that satisfies the condition Kb u = 0 The existence of such a vector is guaranteed because rankkb V :+ We noralize a such that a = Then we let ĝ = u iκx i, = + w i i ϕ i, where w = V:+ u It is easy to verify that i ĝ H b since u = V :+ a =, and ii ĝ H a since u K b = 0 Using ĝ, we have EH a = ax in f g H g H b ĝ H κ = a = Ω w Ω, + + w i i where we use w = V:+ V :+a = a = We coplete the proof by using the fact EH a = K K K b Kb APPEDIX B PROOF OF COROLLARY Define ξ x i to be a rank one linear operator, ie, ξ x i [f] = κ x i, f x i Apparently, L = ξ x i and E[ξ x i ] = L We coplete the proof by using the result fro Proposition 3 and the fact ξ x k HS = ϕ i, κ x k, ϕ j x k i,j= = ϕ i x k ϕ j x k = κ x k, x k, i,j= where the last equality follows equation 3 APPEDIX C PROOF OF COROLLARY We choose a = ln /γ in Theore 6 Since, /4 6µ ln γ then we have a µ Additionally, by having µ T / >, the conditions in Theore 6 hold, and by setting δ = 3 in Theore 5, the condition in Theore 5 holds, which together iplies Pr sup z k µ T / + aσ/ k T c exp γa + Pr VS,T V S,T 3 + Pr V S,T V S,T I 3 Fro this we have, with a probability 3, /4 T sup z k µ k T c + µ3 T / µ T = 4µ

10 0 Since APPEDIX D PROOF OF PROPOSITIO 4 ly i fx i = ax α iy i fx i l α i, α i Ω we rewrite the optiization proble in into a convexconcave optiization proble in ax f H a α i Ω f H κ + α i y i fx i l α i Since f H a, we write f = z iκ x i,, resulting in the following optiization proble in ax z R α i Ω z Kz + α y K b z l α i Since the above proble in linear convex in z and concave in α, we can switch iniization with axiization We coplete the proof by taking the iniization over z APPEDIX E PROOF OF THEOREM 8 To siply our presentation, we introduce notations P l f = ly i fx i, Λf = P l f P l f Using P l f, we can write L f = P l f+ f H κ We first prove that L f L f a + C EH a, where ax z Ω z C ote that L f = ax α i Ω L f a Then = ax α i Ω L f = ax α i Ω l α i α y Kα y l α i α y Kα y l α i α y Kα y + α y K Kα y ax l α i α i Ω α y Kα y + ax α i Ω α y K Kα y L f a + α K K L f a + C EH a Then we proceed the proof as follows f a H κ + P l f a P l f a + f a H κ + P P l f a P l f + f H κ + C EH a + P P l f a P l f + f H κ + C EH a + P P l f a, where the third inequality follows fro the fact that f is the iniizer of P l f + f H κ Hence, Λf a f H κ f a H κ + C EH a + P P l f a l f Let r = f f a L and R = f f a H κ Define Gr, R = f H κ : f f L r, f f Hκ R Using the doain G, we rewrite the bound for Λf a by Λf a f H κ f a H κ + C EH a + sup f Gr,R P P l f l f Since ɛr e and ɛ R e, using Lea 9 fro [8], we have, with a probability 3, for any sup P P l f l f C Crɛ + Rɛ + e, f Gr,R where C is a constant independent fro Thus, with a probability at least 4 3, we have Λf a C Ce f H κ f a H κ + C Γ, + C Cɛ f a f L + C Cɛ f f a Hκ f H κ f a H κ + C Γ, + C C ɛ + σ σ 4 f a f L + C C ɛ 4 f H κ f a H κ + C Γ, + C C ɛ + σ σ 4 f a f L + C L ɛ 4 f H κ + C Γ, + σ 4 f a f L f H κ + C Γ, + C C ɛ σ + C C ɛ σ + 4 f f a H κ + f H κ + f a H κ + C C ɛ 4 + C C ɛ 4 + Λf a, where in the second inequality we apply Young s inequality ab a ɛ + ɛb twice, the last inequality follows fro the strong convexity of lz and f is the iniizer of P l f =

11 E x,y [lyfx] Thus, with a probability at least 4 3, we have P l f a P l f + f H κ + C Γ, + C C ɛ σ + C C ɛ 4 + C Ce We coplete the proof by iniizing over in the RHS of the above inequality Rong Jin is a professor of the Coputer and Science Engineering Dept at Michigan State University He has been working in the areas of statistical achine learning and its application to inforation retrieval He has extensive research experience in a variety of achine learning algoriths such as conditional exponential odels, support vector achine, boosting and optiization for different applications including inforation retrieval Dr Jin is an associative editor of ACM Transactions on Knowledge Discovery fro Data, and received SF Career Award in 006 Dr Jin obtained his PhD degree fro Carnegie Mellon University in 003, and received best paper award fro Conference of Learning Theory COLT in 0 Tianbao Yang received the PhD degree in Coputer Science fro Michigan State University in 0 He joined GE Global Research in 0 and works as a achine learning researcher Dr Yang has board interests in achine learning and he has focused on several research topics, including social network analysis and large scale optiization in achine learning He has published over 0 papers in prestigious achine learning conferences and journals He has won the Mark Fulk Best student paper award at 5th Conference on Learning Theory COLT in 0 Tianbao Yang also served as progra coittee for several conferences and journals, including AAAI, CIKM, 3, IJCAI 3, ACML, TKDD, TKDE, and etc and IJC 3 Zhi-Hua Zhou S 00-M 0-SM 06 received the BSc, MSc and PhD degrees in coputer science fro anjing University, China, in 996, 998 and 000, respectively, all with the highest honors He joined the Departent of Coputer Science & Technology at anjing University as an assistant professor in 00, and is currently professor and Director of the LAMDA group His research interests are ainly in artificial intelligence, achine learning, data ining, pattern recognition and ultiedia inforation retrieval In these areas he has published ore than 00 papers in leading international journals or conference proceedings, and holds patents He has won various awards/honors including the IEEE CIS Outstanding Early Career Award, the ational Science & Technology Award for Young Scholars of China, the Fok Ying Tung Young Professorship Award, the Microsoft Young Professorship Award and nine international journals/conferences paper or copetition awards He is an Associate Editor-in-Chief of the Chinese Science Bulletin, Associate Editor of the ACM Transactions on Intelligent Systes and Technology and on the editorial boards of various other journals He is the founder and Steering Coittee Chair of ACML, and Steering Coittee eber of PAKDD and PRICAI He serves/ed as General Chair/Co-chair of ACML, ADMA and PCM 3, Progra Chair/Co-Chair for PAKDD 07, PRICAI 08, ACML 09, SDM 3, etc, Workshop Chair of KDD, Progra Vice Chair or Area Chair of various conferences, and chaired any doestic conferences in China He is the Chair of the Machine Learning Technical Coittee of the Chinese Association of Artificial Intelligence, Chair of the Artificial Intelligence & Pattern Recognition Technical Coittee of the China Coputer Federation, Vice Chair of the Data Mining Technical Coittee of IEEE Coputational Intelligence Society and the Chair of the IEEE Coputer Society anjing Chapter He is a fellow of the IAPR, the IEEE, and the IET/IEE Mehrdad Mahdavi received the MSc degree fro Sharif University of Technology, Tehran, Iran Since 009, he has been pursuing a PhD in Coputer Science at the Michigan State University and before that he spent two years as a PhD candidate at Sharif University of Technology His current research interests include Machine Learning focused on Online Stochastic Convex Optiization, Learning Theory, and Learning in Gaes He has won the Mark Fulk Best Student Paper Award at the Conference on Learning Theory COLT in 0 Yu-Feng Li received the BSc and PhD degrees in coputer science fro anjing University, China, in 006 and 03, respectively His ain research interests include achine learning and data ining He has won the Microsoft Fellowship Award 009 He has been Progra Coittee eber of several conferences including IJCAI, ICDM, AAAI

Nyström Method vs Random Fourier Features: A Theoretical and Empirical Comparison

Nyström Method vs Random Fourier Features: A Theoretical and Empirical Comparison yströ Method vs : A Theoretical and Epirical Coparison Tianbao Yang, Yu-Feng Li, Mehrdad Mahdavi, Rong Jin, Zhi-Hua Zhou Machine Learning Lab, GE Global Research, San Raon, CA 94583 Michigan State University,