Distributed Learning with Regularized Least Squares

Size: px

Start display at page:

Download "Distributed Learning with Regularized Least Squares"

Roger Parker
5 years ago
Views:

1 Journal of Machine Learning Research Submitted /5; Revised 6/7; Published 9/7 Distributed Learning with Regularized Least Squares Shao-Bo Lin sblin983@gmailcom Department of Mathematics City University of Hong ong, Tat Chee Avenue, owloon, Hong ong Xin Guo xguo@polyueduhk Department of Applied Mathematics The Hong ong Polytechnic University, Hung Hom, owloon, Hong ong Ding-Xuan Zhou mazhou@cityueduhk Department of Mathematics City University of Hong ong, Tat Chee Avenue, owloon, Hong ong Editor: Ingo Steinwart Abstract We study distributed learning with the least squares regularization scheme in a reproducing kernel Hilbert space RHS By a divide-and-conquer approach, the algorithm partitions a data set into disjoint data subsets, applies the least squares regularization scheme to each data subset to produce an output function, and then takes an average of the individual output functions as a final global estimator or predictor We show with error bounds and learning rates in expectation in both the L -metric and RHS-metric that the global output function of this distributed learning is a good approximation to the algorithm processing the whole data in one single machine Our derived learning rates in expectation are optimal and stated in a general setting without any eigenfunction assumption The analysis is achieved by a novel second order decomposition of operator differences in our integral operator approach Even for the classical least squares regularization scheme in the RHS associated with a general kernel, we give the best learning rate in expectation in the literature eywords: Distributed learning, divide-and-conquer, error analysis, integral operator, second order decomposition Introduction and Distributed Learning Algorithms In the era of big data, the rapid expansion of computing capacities in automatic data generation and acquisition brings data of unprecedented size and complexity, and raises a series of scientific challenges such as storage bottleneck and algorithmic scalability Zhou et al, 04 To overcome the difficulty, some approaches for generating scalable approximate algorithms have been introduced in the literature such as low-rank approximations of kernel matrices for kernel principal component analysis Schölkopf et al, 998; Bach, 03, incomplete Cholesky decomposition Fine, 00, early-stopping of iterative optimization algorithms for gradient descent methods Yao et al, 007; Raskutti et al, 04; Lin et al, 06, and greedy-type algorithms Another method proposed recently is distributed learning based on a divide-and-conquer approach and a particular learning algorithm implemented in individual machines Zhang et al, 05; Shamir and Srebro, 04 This method c 07 Lin, Guo and Zhou License: CC-BY 40, see Attribution requirements are provided at

2 Lin, Guo and Zhou produces distributed learning algorithms consisting of three steps: partitioning the data into disjoint subsets, applying a particular learning algorithm implemented in an individual machine to each data subset to produce an individual output function, and synthesizing a global output by utilizing some average of the individual outputs This method can significantly reduce computing time and lower single-machine memory requirements For practical applications in medicine, finance, business and some other areas, the data are often stored naturally across multiple servers in a distributive way and are not combined for reasons of protecting privacy and avoiding high costs In such situations, the first step of data partitioning is not needed It has been observed in many practical applications that when the number of data subsets is not too large, the divide-and-conquer approach does not cause noticeable loss of performance, compared with the learning scheme which processes the whole data on a single big machine Theoretical attempts have been recently made in Zhang et al, 03, 05 to derive learning rates for distributed learning with least squares regularization schemes under certain assumptions This paper aims at error analysis of the distributed learning with regularized least squares and its approximation to the algorithm processing the whole data in one single machine Recall Cristianini and Shawe-Taylor, 000; Evgeniou et al, 000 that in a reproducing kernel Hilbert space RHS H, induced by a Mercer kernel on an input metric space X, with a sample D = {x i, y i } i= X Y where Y = R is the output space, the least squares regularization scheme can be stated as f D, = arg min fx y + f f H x,y D Here > 0 is a regularization parameter and =: is the cardinality of D This learning algorithm is also called kernel ridge regression in statistics and has been well studied in learning theory See eg De Vito et al, 005; Caponnetto and De Vito, 007; Steinwart et al, 009; Bauer et al, 007; Smale and Zhou, 007; Steinwart and Christmann, 008 The regularization scheme can be explicitly solved by using a standard matrix inversion technique, which requires costs of O 3 in time and O in memory However, this matrix inversion technique may not conquer challenges on storages or computations arising from big data The distributed learning algorithm studied in this paper starts with partitioning the data set D into m disjoint subsets {D j } m j= Then it assigns each data subset D j to one machine or processor to produce a local estimator f Dj, by the least squares regularization scheme Finally, these local estimators are communicated to a central processor, and a global estimator f D, is synthesized by taking a weighted average f D, = m j= f D j, of the local estimators {f Dj,} m j= This algorithm has been studied with a matrix analysis approach in Zhang et al, 05 where some error analysis has been conducted under some eigenfunction assumptions for the integral operator associated with the kernel, presenting error bounds in expectation

3 Distributed Learning In this paper we shall use a novel integral operator approach to prove that f D, is a good approximation of f D, We present a representation of the difference f D, f D, in terms of empirical integral operators, and analyze the error f D, f D, in expectation without any eigenfunction assumptions As a by-product, we present the best learning rates in expectation for the least squares regularization scheme in a general setting, which surprisingly has not been done for a general kernel in the literature see detailed comparisons below Main Results Our analysis is carried out in the standard least squares regression framework with a Borel probability measure ρ on Z := X Y, where the input space X is a compact metric space The sample D is independently drawn according to ρ The Mercer kernel : X X R defines an integral operator L on H as L f = x fxdρ X, f H, 3 X where x is the function, x in H and ρ X is the marginal distribution of ρ on X Error Bounds for the Distributed Learning Algorithm Our error bounds in expectation for the distributed learning algorithm require the uniform boundedness condition for the output y, that is, for some constant M > 0, there holds y M almost surely Our bounds are stated in terms of the approximation error where f is the data-free limit of defined by { f = arg min f H f f ρ ρ, 4 fx y dρ + f Z }, 5 ρ denotes the norm of L ρ, the Hilbert space of square integrable functions with respect X to ρ X, and f ρ is the regression function conditional mean of ρ defined by f ρ x = ydρy x, x X, Y with ρ x being the conditional distribution of ρ induced at x X Since is continuous, symmetric and positive semidefinite, L is a compact positive operator of trace class and L +I is invertible Define a quantity measuring the complexity of H with respect to ρ X, the effective dimension Zhang, 005, to be the trace of the operator L + I L as = Tr L + I L, > 0 6 In Section 6 we shall prove the following first main result of this paper concerning error bounds in expectation of f D, f D, in H and in L ρ X Denote κ = sup x X x, x 3

4 Lin, Guo and Zhou Theorem Assume y M almost surely If = m for j =,, m, then we have E { f D, f ρ D, C κ M { mc m + } mc m + f ρ f ρ m + } mc m and E f D, f D, C κ { M mc m { + mc m } + f ρ f ρ m + } mc m, where C κ is a constant depending only on κ, and C m is the quantity given in terms of m,,, by C m := m + To derive explicit learning rates from the general error bounds in Theorem, one can quantify the increment of by the following capacity assumption, a characteristic of the complexity of the hypothesis space Caponnetto and De Vito, 007; Blanchard and rämer, 00, c β, > 0 7 for some constants β > 0 and c > 0 Let { l, ϕ l } l be a set of normalized eigenpairs of L on H with {ϕ l } l being an orthonormal basis of H and { l } l arranged in a nonincreasing order A sufficient condition for the capacity assumption 7 with 0 < β < is l = Ol /β, which can be found in, eg Caponnetto and De Vito 007 Remark The sufficient condition l = Ol /β with the index β = d τ < is satisfied by the Sobolev space W τ BR d with the smoothness index τ > d/ on a ball BR d of the Euclidean space R d when the marginal distribution ρ X is the uniform distribution on BR d, see Steinwart et al, 009; Edmunds and Triebel, 996 Condition 7 with β = always holds true with the choice of the constant c = κ In fact, the eigenvalues of the operator L + I L are { l l + } l So its trace is given by = l l l + l l = TrL κ In the existing literature on learning rates for the classical least squares algorithm, the regularization parameter is often taken to satisfy the restriction = O as in Caponnetto and De Vito, 007 up to a logarithmic factor or in Steinwart et al, 009 under some assumptions on, ρ X see 4 below Here, to derive learning rates for E f D, f ρ D, with m corresponding to the distributed learning algorithm, we consider to satisfy the following restriction with some constant C 0 > 0, 0 < C 0 and m C 0 8 Corollary 3 Assume y M almost surely If = m for j =,, m, and satisfies 8, then we have E { f D, f ρ D, C κ M + m f } ρ f ρ 4

5 Distributed Learning and E f D, f D, C κ { M + m f } ρ f ρ, where C κ is a constant depending only on κ, C 0, and the largest eigenvalue of L In the special case that f ρ H, the approximation error can be bounded as f f ρ ρ f ρ A more general regularity condition can be imposed for the regression function as f ρ = L r g ρ for some g ρ L ρ X, r > 0, 9 where the integral operator L is regarded as a compact positive operator on L ρ X and its rth power L r is well defined for any r > 0 The condition 9 means f ρ lies in the range of L r, and the special case f ρ H corresponds to the choice r = / It implies f f ρ ρ r g ρ ρ by Lemma below Thus, under condition 9, we obtain from Corollary 3, by choosing to minimize the order of the error bound subject to the restriction 8, the following nice convergence rates for the error f D, f D, of the distributed learning algorithm Corollary 4 Assume regularity condition 9 for some 0 < r, capacity assumption 7 with β = α for some α > 0, and y M almost surely If D j = m for j =,, m with α max{r,}+ +αr m α max{r,}++αr, 0 and = α m α max{r,}+, then we have E f D, f ρ D, C m α max{r,} α max{r,}+ κ,c,r m and E f D, f D, C m α max{r,0} α max{r,}+ κ,c,r, m where C κ,c,r is a constant independent of or m In particular, when f ρ H and m E f D, f D, ρ C κ,c,r α α+ m +α, taking = m 4α+ and E f D, f D, C κ,c,r α α+ yields the rates Remark 5 In Corollary 4, we present learning rates in both H and L ρ X norms The rates with respect to the L ρ X norm provide estimates for the generalization ability of the algorithm for regression The rates with respect to the H norm give error estimates with respect to the uniform metric due to the relation f κ f, and might be used to solve some mismatch problems in learning theory where the generalization ability of learning algorithms is measured with respect to a probability measure µ different from ρ X Remark 6 The learning rates in Corollary 4 are stated for the norms of the difference f D, f D, which reflects the variance of the distributed learning algorithm These rates decrease as m increases subject to the restriction 0 and thereby the regularization parameter increases, which is different from the learning rates presented for E f D, f ρ ρ in Zhang et al, 05 To derive learning rates for E f D, f ρ ρ by our analysis, the regularization parameter should be independent of m, as shown in Corollary below m 5

6 Lin, Guo and Zhou Optimal Learning Rates for Least Squares Regularization Scheme The second main result of this paper is optimal learning rates in expectation for the least squares regularization scheme We can even relax the uniform boundedness to a moment condition that for some constant p, σ ρ L p ρ X, where σ ρ is the conditional variance defined by σ ρx = Y y f ρx dρy x The following learning rates in expectation for the least squares regularization scheme will be proved in Section 5 The existence of f is ensured by E[y < Theorem 7 Assume E[y < and moment condition for some p Then we have E [ f D, f ρ ρ + 56κ κ + κ + + { + f f ρ ρ + σρ p } p p If the parameters satisfy = O, we have the following explicit bound Corollary 8 Assume E[y < and moment condition for some p If satisfies 8 with m =, then we have E [ f D, f ρ ρ = O f f ρ ρ + p p In particular, if p =, that is, the conditional variances are uniformly bounded, then E [ f D, f ρ ρ = O f f ρ ρ + In particular, when regularity condition 9 is satisfied, we have the following learning rates in expectation Corollary 9 Assume E[y <, moment condition for some p, and regularity condition 9 for some 0 < r If the capacity assumption = O α holds with some α > 0, then by taking = α α max{r,}+ we have E [ f D, f ρ ρ = O rα α max{r,}+ + p α α max{r,}+ In particular, when p = the conditional variances are uniformly bounded, we have E [ f D, f ρ ρ = O rα α max{r,}+ 6

7 Distributed Learning Remark 0 It was shown in Caponnetto and De Vito, 007; Steinwart et al, 009; Bauer et al, 007 that under the condition of Corollary 9 with p = and r [,, the best learning rate, called minimax lower rate of convergence, for learning algorithms with output functions in H is O rα 4αr+ So the convergence rate we obtain in Corollary 9 is optimal Combining bounds for f D, f D, ρ and f D, f ρ ρ, we shall prove in Section 6 the following learning rates in expectation for the distributed learning algorithm for regression Corollary Assume y M almost surely and regularity condition 9 for some < r If the capacity assumption = O α holds with some α > 0, = m for j =,, m, and m satisfies the restriction α then by taking = 4αr+, we have [ E f D, f ρ ρ = O αr 4αr+ m αr 4αr+, 3 Remark Corollary shows that distributed learning with the least squares regularization scheme in a RHS can achieve the optimal learning rates in expectation, provided that m satisfies the restriction 3 It should be pointed out that our error analysis is carried out under regularity condition 9 with / < r while the work in Zhang et al, 05 focused on the case with r = / When r approaches /, the number m of local processors under the restriction 3 reduces to, which corresponds to the non-distributed case In our follow-up work, we will consider to relax the restriction 3 in a semi-supervised learning framework by using additional unlabelled data, as done in Caponnetto and Yao, 00 The main contribution of our analysis for distributed learning in this paper is to remove an eigenfunction assumption in Zhang et al, 05 by using a novel second order decomposition for a difference of operator inverses Remark 3 In Corollary and Corollary 9, the choice of the regularization parameter is independent of the number m of local processors This is consistent with the results in Zhang et al, 05 There have been several approaches for selecting the regularization parameter in regularization schemes in the literature including cross-validation Györfy et al, 00; Blanchard and rämer, 06 and the balancing principle De Vito et al, 00 For practical applications of distributed learning algorithms, how to choose and m except the situations when the data are stored naturally in a distributive way is crucial Though we only consider the theoretical topic of error analysis in this paper, it would be interesting to develop parameter selection methods for distributed learning 3 Comparisons and Discussion The least squares regularization scheme is a classical algorithm for regression and has been extensively investigated in statistics and learning theory There is a vast literature on 7

8 Lin, Guo and Zhou this topic Here for a general kernel beyond those for the Sobolev spaces, we compare our results with the best learning rates in the existing literature Under the assumption that the orthogonal projection f H of f ρ in L ρ X onto the closure of H satisfies regularity condition 9 for some r, and that the eigenvalues { i} i of L satisfy i i α with some α > /, it was proved in Caponnetto and De Vito, 007 that [ lim lim sup sup Prob f D, f H τ ρ > τr = 0, where ρ Pα = α log 4αr+, α if < r, α+, if r =, and Pα denotes a set of probability measures ρ satisfying some moment decay condition which is satisfied when y M This learning rate in probability is stated with a limit as τ and it has a logarithmic factor in the case r = In particular, to guarantee f D, f H ρ τ η r with confidence η by this result, one needs to restrict η to be large enough and has the constant τ η depending on η to be large enough Using existing tools for error analysis including those in Caponnetto and De Vito, 007, we develop a novel second order decomposition technique for a difference of operator inverses in this paper, and succeed in deriving the optimal learning rate in expectation in Corollary 9 by removing the logarithmic factor in the case r = Under the assumption that y M almost surely, the eigenvalues { i } i satisfying i ai α with some α > / and a > 0, and for some constant C > 0, the pair, ρ X satisfying f C f α f α ρ 4 for every f H, it was proved in Steinwart et al, 009 that for some constant c α,c depending only on α and C, with confidence η, for any 0 <, π M f D, f ρ ρ 9A a /α M log3/η + c α,c /α Here π M is the projection onto the interval [ M, M defined Chen et al, 004; Wu et al, 006 by M, if fx > M, π M fx = fx, if fx M, M, if fx < M, and A is the approximation error defined by { } A = inf f f ρ f H ρ + f When f ρ H, A = O and the choice = α α+ gives a learning rate of order f D, f ρ ρ = O α α+ Here one needs to impose the condition 4 for the functions spaces L ρ X and H, and to take the projection onto [ M, M Our learning rates 8

9 Distributed Learning in Corollary 9 do not require such a condition for the function spaces, nor do we take the projection Let us mention the fact from Steinwart et al, 009; Mendelson and eeman, 00 that condition 4 is satisfied when H is a Sobolev space on a domain of a Euclidean space and ρ X is the uniform distribution, or when the eigenfunctions {ϕ l / l : l > 0} of L normalized in L ρ X are uniformly bounded Recall that ϕ l L ρx = l since l = l ϕ l = lϕ l, ϕ l = L ϕ l, ϕ l equals x ϕ l xdρ X x, ϕ l = ϕ l xϕ l xdρ X x = ϕ l L ρ X X X Learning rates for the least squares regularization scheme in the H -metric have been investigated in Smale and Zhou, 007 For the distributed learning algorithm with subsets {D j } m j= of equal size, under the assumption that for some constants < k and A <, the eigenfunctions {ϕ i } i satisfy [ ϕ sup i >0 E i x k i A k, when k <, 5 sup i >0 ϕ ix i A, when k =, and that f ρ H and i ai α for some α > / and a > 0, it was proved in Zhang et al, 05 that for = α α+ and m satisfying the restriction k 4α k k α+ m c A 4k log k, when k < α 6 α α+ A 4 log, when k = with a constant c α depending only on α, there holds E [ f D, f ρ = O α α+ This ρ interesting result was achieved by a matrix analysis approach for which the eigenfunction assumption 5 played an essential role It might be challenging to check the eigenfunction assumption 5 involving the integral operator L = L,ρX for the pair, ρ X To our best knowledge, besides the case of finite dimensional RHSs, there exist in the literature only three classes of pairs, ρ X proved satisfying this eigenfunction assumption: the first class Steinwart et al, 009 is the Sobolev reproducing kernels on Euclidean domains together with the unform measures ρ X, the second Williamson et al, 00 is periodical kernels, and the third class can be constructed by a Mercer type expansion x, y = i ϕ i x ϕ i i y, 7 i i where { ϕ ix i } i is an orthonormal system in L ρ X An example of a C Mercer kernel was presented in Zhou, 00, 003 to show that only the smoothness of the Mercer kernel does not guarantee the uniform boundedness of the eigenfunctions { ϕ ix i } i Furthermore, it was shown in Gittens and Mahoney, 06 that these normalized eigenfunctions associated with radial basis kernels tend to be localized when the radial basis parameters are made smaller, which raises a practical concern for the uniform boundedness assumption on the eigenfunctions 9

10 Lin, Guo and Zhou Remark 4 To see how the eigenfunctions {ϕ i x} i change with the marginal distribution, we consider a different measure µ induced by a nonnegative function P L ρ X as dµ = P xdρ X An eigenpair, ϕ of the integral operator L,µ associated with the pair, µ with > 0 satisfies L,µ ϕ =, xϕxp xdρ X = ϕ i i X i i >0 X ϕ i x i ϕxp xdρ X = ϕ We introduce an index set I := {i : i > 0}, a possibly infinite matrix P ϕ i x = i P x ϕ jx dρ X, 8 i j X and express ϕ in terms of the orthonormal basis {ϕ i } i I as ϕ = i c i i I ϕ i where the ϕ sequence c = c i i I is given by c i = ϕ, i i L ρx = i ϕ, ϕ i Then the eigenpair, ϕ satisfies L,µ ϕ = ϕ i I ϕ i i j I P i,j c j = i I i,j I c i i ϕ i P c = c Thus the eigenpairs of the integral operator L,µ associated with, µ can be characterized by those of the possibly infinite matrix P defined by 8 Finding the eigenpairs of P is an interesting question involving matrix analysis in linear algebra and multiplier operators in harmonic analysis ote that when P corresponding to µ = ρ X, P is diagonal with diagonal entries { i } i and its eigenvectors yield eigenfunctions {ϕ i } i From the above observation, we can see that the marginal distribution ρ X plays an essential role in the eigenfunction assumption 5 which might be difficult to check for a general marginal distribution ρ X For example, it is even unknown whether any of the Gaussian kernels on X = [0, d satisfies the eigenfunction assumption 5 when ρ X is a general Borel measure Our learning rates stated in Corollary 4 or Corollary do not require the eigenfunction assumption 5 Moreover, our restriction 0 for the number m of local processors in Corollary 4 is more general than 6 when α is close to /: with r = / corresponding to the condition f ρ H, our restriction 0 in Corollary 4 is m 4+6α with the power index tending to 7 while the restriction in 6 with k = takes the form m c α α α+ A 4 log with the power index tending to 0 as α ote that the learning rates stated in Corollary 4 are for the difference f D, f D, between the output function of the distributed learning algorithm and that of the algorithm using the whole data In the special case of α α+ m r =, we can see that E f D, f ρ D, C κ,c,r 4α+, achieved by choosing = α m α+, is smaller as m becomes larger Here the dependence of on m is crucial for achieving this convergence rate of the sample error: if we fix and, the error bounds for E f D, f D, stated in Theorem and Corollary 3 become larger as m increases On the other hand, as one expects, increasing the number m of local processors would increase 0

11 Distributed Learning the approximation error for the regression problem, which can also be seen from the bound with = α m α+ stated in Theorem 7 [ The result in Corollary with r > compensates and gives the optimal learning rate E f D, f ρ ρ = O rα 4αr+ by restricting m as in 3 Besides the divide-and-conquer technique, there are many other approaches in the literature to reduce the computational complexity in handling big data These include the localized learning Meister and Steinwart, 06, yström regularization Bach, 03; Rudi et al, 05, and online learning Dekel et al, 0 The divide-and-conquer technique has the advantage of reducing the single-machine complexity of both space and time without a significant lost of prediction power It is an important and interesting problem how the big data are decomposed for distributed learning Here we only study the approach of random decomposition, and the data subsets are regarded as being drawn from the same distribution There should be better decomposition strategies for some practical applications For example, intuitively data points should be divided according to their spatial distribution so that the learning process would yield locally optimal predictors, which could then be synthesized to the output function See Thomann et al, 06 In this paper, we consider the regularized least squares with Mercer kernels Our results might be extended to more general kernels A natural class consists of bounded and positive semidefinite kernels as studied in Steinwart and Scovel, 0 By means of the Mercer type expansion 7, one needs some assumptions on the system { ϕ ix i } i, the domain X, and the measure ρ X to relax the continuity of the kernel while keeping compactness of the integral operator How to minimize the assumptions and to maximize the scope of applications of the framework such as the situation of an input space X without a metric Shen et al, 04; De Vito et al, 03 is a valuable question to investigate Here we only consider distributed learning with the regularized least squares It would be of great interest and value to develop the theory for distributed learning with other algorithms such as spectral algorithms Bauer et al, 007, empirical feature-based learning Guo and Zhou, 0; Guo et al, 07; Shi et al, 0, the minimum error entropy principle Hu et al, 05; Fan et al, 06, and randomized aczmarz algorithms Lin and Zhou, 05 Remark 5 After the submission of this paper, in our follow-up paper by Z C Guo, S B Lin, and D X Zhou entitled Learning theory of distributed spectral algorithms published in Inverse Problems, error analysis and optimal learning rates for distributed learning with spectral algorithms were derived In late 06, we learned that similar analysis was carried out for classical non-distributed spectral algorithms implemented in one machine by G Blanchard and Mücke in a paper entitled Optimal rates for regularization of statistical inverse learning problems arxiv: , April 06, and by L H Dicker, D P Foster, and D Hsu in a paper entitled ernel ridge vs principal component regression: minimax bounds and adaptibility of regularization operators arxiv: , May 06 We are indebted to one of the referees for pointing this out to us

12 Lin, Guo and Zhou 4 Second Order Decomposition of Operator Differences and orms Our error estimates are achieved by a novel second order decomposition of operator differences in our integral operator approach We approximate the integral operator L by the empirical integral operator L,Dx on H defined with the input data set Dx = {x i } i= = {x : x, y D for some y Y} as L,Dx f = x Dx fx x = x Dx f, x x, f H, 9 where the reproducing property fx = f, x for f H is used Since is a Mercer kernel, L,Dj x is a finite-rank positive operator and L,Dj x + I is invertible The operator difference in our study is that of the inverses of two operators defined by Q Dx = L,Dx + I L + I 0 If we denote A = L,Dx + I and B = L + I, then our second order decomposition for the operator difference Q Dx is a special case of the following second order decomposition for the difference A B Lemma 6 Let A and B be invertible operators on a Banach space Then we have A B = B {B A} B + B {B A} A {B A} B Proof We can decompose the operator A B as A B = B {B A} A This is the first order decomposition Write the last term A as B + A B and apply another first order decomposition similar to as A B = A {B A} B It follows from that A B = B {B A} { B + A {B A} B } Then the desired identity is verified The lemma is proved To demonstrate how the second order decomposition leads to tight error bounds for the classical least squares regularization scheme with the output function f D,, we recall the following formula see eg Caponnetto and De Vito, 007; Smale and Zhou, 007 f D, f = L,Dx + I D, D := ξ z E[ξ, 3 where D is induced by the random variables ξ with values in the Hilbert space H defined as ξ z = y f x x, z = x, y Z 4 z D

13 Distributed Learning Then we can express f D, f by means of the notation Q Dx = L,Dx + I L + I as f D, f = [ Q Dx D + L + I D Due to the identity between the L ρ X norm and the H metric g ρ = L g, g L ρ X, 5 we can estimate the error f D, f ρ = L f D, f by bounding the H norm of the following expression obtained from the second order decomposition with B A = L L,Dx { L [ QDx D = L L + I { + L L + I { {L } } } L,Dx L + I {L + I D } { L + I { } } L L,Dx L + I D } { L + I { } } L L,Dx L,Dx + I Combining this expression with the operator norm bound L L + I and the notation Ξ D = L + I { } L L,Dx 6 for convenience, we find L [ L QDx D Ξ D + I D + Ξ D L,Dx + I Ξ D L + I D By decomposing L + I D as L + I L + I D and using the bounds L,Dx + I, L + I /, we know that and L [ QDx D f D, f ρ ΞD + Ξ D L + I D 7 L ΞD + Ξ D + + I D 8 Thus the classical least squares regularization scheme can be analyzed after estimating the operator norm Ξ D = L + I { } L L,Dx and the function norm L + I D In the following two lemmas, to be proved in the appendix, these norms are estimated in terms of effective dimensions by methods in the existing literature Caponnetto and De Vito, 007; Bauer et al, 007; Blanchard and rämer, 00 3

14 Lin, Guo and Zhou Lemma 7 Let D be a sample drawn independently according to ρ Then the following estimates for the operator norm L + I { } L L,Dx hold [ L a E + I { } L L,Dx κ b For any 0 < δ <, with confidence at least δ, there holds L + I { } L L,Dx B, log/δ, 9 where we denote the quantity B, = { κ κ + } 30 c For any d >, there holds { [ L E + I { } } d d L L,Dx dγd + d B,, where Γ is the Gamma function defined for d > 0 by Γd = 0 u d exp { u} du Lemma 8 Let D be a sample drawn independently according to ρ and g be a measurable bounded function on Z and ξ g be the random variable with values in H defined by ξ g z = gz x for z = x, y Z Then the following statements hold [ L a E + I / x = b For almost every x X, there holds L + I / x κ c For any 0 < δ <, with confidence at least δ, there holds { L + I / ξ g z E [ξ g g log/δ κ + } z D Remark 9 In the existing literature, the first order decomposition was used To bound the norm of L,Dx + I D by this approach, one needs to either use the brute force estimate L,Dx + I leading to suboptimal learning rates, or applying the approximation of L by L,Dx which is valid only with confidence and leads to confidencebased estimates with depending on the confidence level In our second order decomposition, we decompose the inverse operator L,Dx + I further after the first order decomposition This leads to finer estimates with an additional factor B AB in the second term of the bound and gives the refined error bound 8 4

15 Distributed Learning 5 Deriving Error Bounds for Least Squares Regularization Scheme In this section we prove our main result on error bounds for the least squares regularization scheme, which illustrates how to apply the second order decomposition for operator differences in our integral operator approach Proposition 0 Assume E[y < and moment condition for some p Then E [ f D, f ρ + 56κ κ + + { κ p σ p p } p f f ρ ρ ρ + κ Proof We apply the error bound 8 and the Schwarz inequality to get { [ E [ f D, f ρ E + Ξ } / { [ L } D + Ξ D E + I / / D The first factor in the above bound involves Ξ D = it can be estimated by applying Lemma 7 as { [ E + Ξ D + Ξ D where L + I } / { [ Ξ } / { [ + E D Ξ 4 + E D { { κ 49B 4, + } / + 3 { } L L,Dx and } / } / + 56κ4 + 57κ 3 To deal with the factor in the bound 3, we separate D = z D ξ z E[ξ as D := z D D = D + D, ξ 0 z, D := ξ ξ 0 z E[ξ are induced by another random variable ξ 0 with values in the Hilbert space H defined by z D ξ 0 z = y f ρ x x, z = x, y Z 33 Then the second factor in 3 can be separated as { [ L } E + I / / D { E [ L + I / D } / + { [ L E + I / D } / 34 5

16 Lin, Guo and Zhou Let us bound the first term of 34 Observe that L + I / D = z D y f ρx L + I / x Each term in this expression is unbiased because Y y f ρx dρy x = 0 This unbiasedness and the independence tell us that { [ L E + I / D } / = If σρ L, then σρx σρ and by Lemma 8 we have = { [ L E + I / { [ y [ E fρ x L + I / } x / { [ [ E σρx L + I / x } / 35 D } / σ ρ / If σ ρ L p ρ X with p <, we take q = p p q = for p = satisfying p + q = and apply the Hölder inequality E[ ξη E[ ξ p /p E[ η q /q to ξ = σρ, η = L + I / x to find But [ [ E σρx L + I / x [ and E L + I / x [ L + I / x { [ [ σ p ρ E L + I / x q κ/ q = by Lemma 8 So we have q } /q { [ L E + I / D } / = { σ ρ σ ρ p κ p { } κ q /q } / p q p p Combining the above two cases, we know that for either p = or p <, the first term of 34 can be bounded as { [ L E + I / D } / σρ p κ p 6 p p

17 Distributed Learning The second term of 34 can be bounded easily as { [ L E + I / D } / [ {E f ρ x f x L + I / x {E [f ρ x f x κ } / = κ f ρ f ρ } / Putting the above estimates for the two terms of 34 into 3 and applying the bound 3 involving Ξ D, we know that E [ f D, f ρ is bounded by + 56κ4 + 57κ σρ p κ p p p κ f ρ f ρ + Then our desired error bound follows The proof of the proposition is complete Proof of Theorem 7 Combining Proposition 0 with the triangle inequality f D, f ρ ρ f D, f ρ + f f ρ ρ, we know that E [ f D, f ρ ρ f f ρ ρ κ κ + + { κ p σ p p } p κ ρ + f f ρ ρ Then the desired error bound holds true, and the proof of Theorem 7 is complete Proof of Corollary 8 By the definition of effective dimension, = l l l + + Combining this with the restriction 8 with m =, we find +C 0 and +C 0 C 0 Putting these and the restriction 8 with m = into the error bound, we know that E [ f D, f ρ ρ + 56κ κ + κ + + C 0 C0 + C 0 { + + C 0 C 0 / f f ρ ρ + σρ p Then the desired bound follows The proof of Corollary 8 is complete p p } To prove Corollary 9, we need the following bounds Smale and Zhou, 007 for f f ρ ρ and f f ρ 7

18 Lin, Guo and Zhou Lemma Assume regularity condition 9 with 0 < r There holds f f ρ ρ r g ρ ρ 36 Furthermore, if / r, then we have f f ρ r / g ρ ρ 37 Proof of Corollary 9 By Lemma, regularity condition 9 with 0 < r implies f f ρ ρ r g ρ ρ If C 0 α, > 0 for some constant C 0, then the choice = α α max{r,}+ yields C 0 α α+ = C 0 α max{r,}+ C 0 So 8 with m = is satisfied With this choice we also have p p C p 0 α max{r,} α max{r,}+ = C p 0 α max{r,} α max{r,}+ + p p α max{r,}+ α α max{r,}+ α α max{r,}+ p Putting these estimates into Corollary 8, we know that E [ f D, f ρ ρ = O αr α max{r,}+ + = O α min{r,max{r,}} + α max{r,}+ p α max{r,} α max{r,}+ + α p α max{r,}+ α α max{r,}+ But we find min {r, max{r, }} = r by discussing the two different cases 0 < r < and immediately The proof of Corollary 9 is complete r Then our conclusion follows 6 Proof of Error Bounds for the Distributed Learning Algorithm To analyze the error f D, f D, for distributed learning, we recall the notation Q Dx = L,Dx + I L + I for the difference of inverse operators and use the notation Q Dj x involving the data subset D j The empirical integral operator L,Dj x is defined with D replaced by the data subset D j For our error analysis for the distributed learning algorithm, we need the following error decomposition for f D, f D, 8

19 Distributed Learning Lemma Assume E[y < For > 0, we have f D, f D, = = m j= m j= [ L,Djx + I L,Dx + I j [ Q Dj x j + m j= [ Q Dj x j [ Q Dx D, 38 where and j := j := ξ z E[ξ, D := z D j ξ z E[ξ, z D ξ 0 z, j := ξ ξ 0 z E[ξ z D j z D j Proof Recall the expression 3 for f D, f When the data subset D j is used, we have So we know that f D, f = m j= We can decompose D as D = f Dj, f = ξ z E[ξ = z D L,Dj x + I j { fdj, f } = m m j= in the expression 3 for f D, f and find f D, f = m j= j= L,Dj x + I j ξ z E[ξ = m z D j j= L,Dx + I j Then the first desired expression for f D, f D, follows j By adding and subtracting the operator L + I, writing j = j + j, and noting E[ξ 0 = 0, we know that the first expression implies 38 This proves Lemma Before proving our first main result on the error f D, f D, in the H metric and L ρ metric, we state the following more general result which allows different sizes for data subsets {D j } 9

20 Lin, Guo and Zhou Theorem 3 Assume that for some constant M > 0, y M almost surely Then we have and [ E f D, f ρ D, f ρ f ρ Dj C κm m Dj j= + S, + C κ [ f E D, f D, C κm f ρ f ρ Dj + S, m Dj j= + C κ S, + S3, + S, S, + S3, + S, M M + C κ m j= + f ρ f ρ + C κ m j= + f ρ f ρ, where C κ is a constant depending only on κ, and S Dj, is the quantity given by S Dj, = + Proof Recall the expression 38 for f D, f D, in Lemma It enables us to express where the terms J, J, J 3 are given by J = m j= [ L / Q D j x L / { } f D, f D, = J + J + J 3, 39 j, J = m j= [ L / Q D j x [ j, J 3 = L / Q Dx D These three terms will be dealt with separately in the following For the first term[ J of 39, each summand with j {,, m} can be expressed as z D j y f ρx L / Q D j x x, and is unbiased because Y y f ρxdρy x = 0 The unbiasedness and the independence tell us that E [ J { [ } / E J m j= Dj [ [ E L / Q D j x j / 40 Let j {,, m} The estimate 7 derived from the second order decomposition yields [ L / Q D j x j ΞDj + Ξ D j L + I / j 4 0

21 Distributed Learning ow we apply the formula E[ξ = 0 Prob [ξ > t dt 4 to estimate the expected value of 4 By Part b of Lemma 7, for 0 < δ <, there exists a subset Z D j δ, of Z D j of measure at least δ such that Ξ Dj B Dj, log/δ, D j Z D j δ, 43 Applying Part c of Lemma 8 to gz = y f ρ x with g M and the data subset D j, we know that there exists another subset Z D j δ, of Z Dj of measure at least δ such that L + I / j M κ B, log/δ, D j Z D j Combining this with 43 and 4, we know that for D j Z D j δ, Z D j δ,, δ, [ L / Q D j x j B Dj, + B4, M κ B, log/δ6 Since the measure of the set Z D j δ, Z D j δ, is at least δ, by denoting C Dj, = 64 B Dj, + B4, M κ B,, we see that [ [ Prob L / Q D j x j > C, log/δ 6 δ For 0 < t <, the equation C Dj, log/δ 6 = t has the solution { } /6 δ t = exp t/c Dj, When δ t <, we have [ [ Prob L / Q D j x j { } /6 > t δ t = 4 exp t/c Dj, This inequality holds trivially when δ t since the probability is at most Thus we can [ apply the formula 4 to the nonnegative random variable ξ = L / Q D j x j and obtain [ [ { } E L / Q D j x /6 j = Prob [ξ > t dt 4 exp t/c Dj, dt 0 0

22 Lin, Guo and Zhou which equals 4Γ6C Dj, Therefore, m Dj E [ J 4Γ6C Dj, j= 536 m Dj κ 5κM + κ { + 8κ + } j= For the second term J of 39, we apply 7 again and obtain [ L / Q D j x ΞDj j + Ξ L D j + I / j Applying the Schwarz inequality and Lemmas 7 and 8, we get [[ / E L / Q D j x j E ΞDj + Ξ D j It follows that E [ J m j= / { [ E f ρ x f x L + I / x Dj { κ } / { } 49B 4 / Dj, + κ f ρ f ρ Dj κ + 56κ κ + κ f ρ f ρ Dj κ f ρ f ρ κ κ Dj + 56κ + } / The last term J 3 of 39 has been handled by 7 and in the proof of Proposition 0 by ignoring the summand in the bound 3, and we find from the trivial bound σ ρ 4M with p = that κ E [ J 3 κ + 56κ + κ f ρ f ρ M + Combining the above estimates for the three terms of 39, we see that the desired error bound in the L ρ X metric holds true The estimate in the H metric follows from the steps in deriving the error bound in the L ρ X metric except that in the representation 39 the operator L / in the front disappears This change gives an additional factor /, the bound for the operator L + I /, and proves the desired error bound in the H metric /

23 Distributed Learning Remark 4 A crucial step in the above error analysis for the distributed learning algorithm is to use the unbiasedness and independence to get 40 where the norm is squared in the expected value Thus we can obtain the optimal learning rates in expectation for the distributed learning algorithm but have difficulty in getting rates in probability It would be interesting to derive error bounds in probability by combining our second order decomposition technique with some analysis in the literature Caponnetto and De Vito, 007; Blanchard and rämer, 00; Steinwart and Christmann, 008; Wu and Zhou, 008 Proof of Theorem Since = m for j =,, m, the bound in Theorem 3 in the L ρ X metric can be simplified as } [ E f D, f ρ D, C κm m m +C κ +C κ f ρ f ρ m f ρ f ρ + M + { + m m + m m m } C κm m + m + { + m m + +C κ f ρ f ρ m + + m m m + C κm C m { m + m Cm } + C κ f ρ f ρ m + mc m + Then the desired error bound in the L ρ X metric with C κ = C κ follows The proof for the error bound in the H metric is similar The proof of Theorem is complete Proof of Corollary 3 As in the proof of Corollary 8, the restriction 8 implies m and +C 0 C 0 It follows that +C 0 and with C κ := +C 0 C 0 m + C 0 C 0 + C 0 C 0 +, C m C κ m Putting these bounds into Theorem and applying the restriction C 0, we know that E { } 3 f D, f ρ m m D, C κ C κ + M + m f ρ f ρ C κ C κ 3 + C 0 3 { M } m f ρ f ρ C 0 +

24 Lin, Guo and Zhou and E 3 f D, f D, C κ C κ + { C 0 M C 0 + m f } ρ f ρ Then the desired error bounds hold by taking the constant C κ = C κ C κ 3 + C0 This proves Corollary 3 Proof of Corollary 4 If C 0 α, > 0 for some constant C 0, then the choice = m choice we also have m C 0 m αmax{r,} α max{r,}+ α α max{r,}+ satisfies 8 With this Since regularity condition 9 yields f f ρ ρ g ρ ρ r by Lemma, we have by Corollary 3, E f D, f ρ D, C m αmax{r,} α max{r,}+ m α α max{r,}+ κ C0 g ρ m ρ m +M The inequality α m α max{r,}+ r m + m α α max{r,}+ α m m α max{r,}+ r + α α max{r,}+ is equivalent to α α max{r,}+ r and it can be expressed as 0 Since 0 is valid, we have E f D, f ρ D, C m α max{r,} α max{r,}+ κ C0 g ρ ρ + M m This proves the first desired convergence rate The second rate follows easily This proves Corollary 4 Proof of Corollary By Corollary 9, with the choice = 4αr+, we can immediately bound f D, f ρ ρ as E [ f D, f ρ ρ = O αr 4αr+ The assumption = O α tells us that for some constant C 0, C 0 α, > 0 α So the choice = α 4αr+ yields m m +α α C 0 = C 0 m +α 4αr+ = C 0 m α r 4αr+ 44 4

25 Distributed Learning If m satisfies 3, then 8 is valid, and by Corollary 3, E f D, f ρ D, C κ C0 α r 4αr+ r m g ρ + M C α r κ C0 g ρ + M r m 4αr+ + α r 4αr+ r = C κ C0 g ρ + M αr 4αr+ m αr + 4αr+ + Since 3 is satisfied, we have E f D, f D, ρ C κ C0 g ρ + M αr 4αr+, and thereby [ f E D, f ρ ρ = O αr 4αr+ This proves Corollary Acknowledgments We thank the anonymous referees for their constructive suggestions The work described in this paper is supported partially by the Research Grants Council of Hong ong [Project o CityU 3044 The corresponding author is Ding-Xuan Zhou Appendix This appendix provides detailed proofs of the norm estimates stated in Lemmas 7 and 8 involving the approximation of L by L,Dx To this end, we need the following probability inequality for vector-valued random variables in Pinelis, 994 Lemma 5 For a random variable ξ on Z, ρ with values in a separable Hilbert space H, satisfying ξ M < almost surely, and a random sample {z i } s i= independent drawn according to ρ, there holds with confidence δ, s [ξz i Eξ s M log/ δ E ξ + log/ δ 45 s s i= Proof of Lemma 7 We apply Lemma 5 to the random variable η defined by η x = L + I /, x x, x X 46 It takes values in HSH, the Hilbert space of Hilbert-Schmidt operators on H, with inner product A, B HS = TrB T A Here Tr denotes the trace of a trace-class linear operator The norm is given by A HS = i Ae i where {e i} is an orthonormal basis 5

26 Lin, Guo and Zhou of H The space HSH is a subspace of the space of bounded linear operators on H, denoted as LH,, with the norm relations A A HS, AB HS A HS B 47 ow we use effective dimensions to estimate norms involving η The random variable η defined by 46 has mean Eη = L + I / L and sample mean L + I / L,Dx Recall the set of normalized in H eigenfunctions {ϕ i } i of L It is an orthonormal basis of H If we regard L as an operator on L ρ X, the normalized eigenfunctions in L ρ X are { i ϕ i } i and they form an orthonormal basis of the orthogonal complement of the eigenspace associated with eigenvalue 0 By the Mercer Theorem, we have the following uniform convergent Mercer expansion x, y = i i i ϕ i x i ϕ i y = i ϕ i xϕ i y 48 Take the orthonormal basis {ϕ i } i of H By the definition of the HS norm, we have η x HS = L + I /, x x ϕ i i For a fixed i,, x x ϕ i = ϕ i x x, and x H can be expended by the orthonormal basis {ϕ l } l as x = l ϕ l, x ϕ l = l ϕ l xϕ l 49 Hence η x HS = ϕ ix i l = ϕ ix i l Combining this with 48, we see that ϕ l x L + I / ϕ l ϕ l x l + ϕ l = i ϕ i x l ϕ l x l + η x HS = x, x l ϕ l x, x X 50 l + and But E [ η x HS κ E X [ l ϕ l x = κ l + l X ϕ lx dρ X l + ϕ l x dρ X = ϕ l L = ρ X l ϕ l = l 5 l L ρ X 6

27 Distributed Learning So we have and E E [ η HS κ l l + = κ Tr L + I L = κ 5 l x Dx η x E[η HS [ L = E + I / { } L L,Dx HS Then our desired inequality in Part a follows from the first inequality of 47 From 49 and 50, we find a bound for η as κ η x HS κ l ϕ l x κ x, x κ, x X Applying Lemma 5 to the random variable η with M = κ, we know by 47 that with confidence at least δ, E[η η x E[η η x x Dx Writing the above bound by taking a factor x Dx κ log/δ + HS κ log/δ κ log/δ, we get the desired bound 9 Recall B, defined by 30 Apply the formula 4 for nonnegative random variables to ξ = L + I / { } d L L,Dx and use the bound Prob [ξ > t = Prob derived from 9 for t log d B, We find { [ξ d > t d exp E [ L + I / { L L,Dx } d log d B, + t d B, 0 } exp { t d B, The second term on the right-hand side of above equation equals db, d 0 u d exp { u} du Then the desired bound in Part c follows from 0 u d exp { u} du = Γd and the lemma is proved Proof of Lemma 8 Consider the random variable η defined by η z = L + I / x, z = x, y Z 53 } dt 7

28 Lin, Guo and Zhou It takes values in H By 49, it satisfies So η z = L + I / ϕ l xϕ l = [ L E + I / x This is the statement of Part a We also see from 49 that for x X, η z = l l ϕ l x / l + l [ = E l l ϕ l x / l + ϕ l x = l + ϕ l x / = x, x This verifies the statement of Part b For Part c, we consider another random variable η 3 defined by It takes values in H and satisfies κ η 3 z = L + I / gz x, z = x, y Z 54 So η 3 z = gz L + I / x = gz η 3 z κ g, z Z l ϕ l x / l + and E [ η 3 g E [ l ϕ l x = g l + Applying Lemma 5 verifies the statement in Part c The proof of the lemma is complete References F Bach Sharp analysis of low-rank kernel matrix approximations Proceedings of the 6th Annual Conference on Learning Theory, 85-09, 03 F Bauer, S Pereverzev, and L Rosasco On regularization algorithms in learning theory Journal of Complexity, 3:5-7, 007 G Blanchard and rämer Optimal learning rates for kernel conjugate gradient regression Advances in eural Information Processing Systems, 6-34, 00 G Blanchard and rämer Convergence rates for kernel conjugate gradient for random design regression Analysis and Applications, 4: , 06 8

29 Distributed Learning A Caponnetto and E De Vito Optimal rates for the regularized least squares algorithm Foundations of Computational Mathematics, 7:33-368, 007 A Caponnetto and Y Yao Cross-validation based adaptation for regularization operators in learning theory Analysis and Applications, 8:6-83, 00 D R Chen, Q Wu, Y Ying, and D X Zhou Support vector machine soft margin classifiers: error analysis Journal of Machine Learning Research, 5:43-75, 004 Cristianini and J Shawe-Taylor An Introduction to Support Vector Machines Cambridge University Press, 000 O Dekel, R Gilad-Bachrach, O Shamir, and X Lin Optimal distributed online prediction using mini-batches Journal of Machine Learning Research, 3:65-0, 0 E De Vito, A Caponnetto, and L Rosasco Model selection for regularized least-squares algorithm in learning theory Foundations of Computational Mathematics, 5:59-85, 005 E De Vito, S Pereverzyev, and L Rosasco Adaptive kernel methods using the balancing principle Foundations of Computational Mathematics, 0: , 00 E De Vito, V Umanità, and S Villa An extension of Mercer theorem to vector-valued measurable kernels Applied and Computational Harmonic Analysis, 34:339-35, 03 DE Edmunds and H Triebel Function Spaces, Entropy umbers, Differential Operators Cambridge University Press, Cambridge, 996 T Evgeniou, M Pontil, and T Poggio Regularization networks and support vector machines Advance in Computional Mathematics, 3:-50, 000 J Fan, T Hu, Q Wu, and DX Zhou Consistency analysis of an empirical minimum error entropy algorithm Applied and Computational Harmonic Analysis, 4:64-89, 06 S Fine and Scheinberg Efficient SVM training using low-rank kernel representations Journal of Machine Learning Research, :43-64, 00 A Gittens and M Mahoney Revisiting the yström method for improved large-scale machine learning Journal of Machine Learning Research, 7:-65, 06 X Guo and D X Zhou An empirical feature-based learning algorithm producing sparse approximations Applied and Computational Harmonic Analysis, 3: , 0 ZC Guo, DH Xiang, X Guo, and DX Zhou Thresholded spectral algorithms for sparse approximations Analysis and Applications, 5: , 07 L Györfy, M ohler, A rzyzak, H Walk A Distribution-Free Theory of onparametric Regression Springer-Verlag, Berlin, 00 T Hu, J Fan, Q Wu, and D X Zhou Regularization schemes for minimum error entropy principle Analysis and Applications, 3: , 05 9

30 Lin, Guo and Zhou J H Lin, L Rosasco, and D X Zhou Iterative regularization for learning with convex loss functions Journal of Machine Learning Research, 777: -38, 06 J H Lin and D X Zhou Learning theory of randomized aczmarz algorithm Journal of Machine Learning Research, 6: , 05 S Mendelson and J eeman Regularization in kernel learning The Annals of Statistics, 38:56-565, 00 M Meister and I Steinwart Optimal learning rates for localized SVMs Journal of Machine Learning Research, 7: -44, 06 I Pinelis Optimum bounds for the distributions of martingales in Banach spaces The Annals of Probability, : , 994 G Raskutti, M Wainwright, and B Yu Early stopping and non-parametric regression: an optimal data-dependent stopping rule Journal of Machine Learning Research, 5: , 04 A Rudi, R Camoriano, and L Rosasco Less is more: yström computational regularization Advances in eural Information Processing Systems, , 05 B Schölkopf, A Smola, and R Müller onlinear component analysis as a kernel eigenvalue problem IEEE Transactions on Information Theory, 0:99-39, 998 O Shamir and Srebro Distributed stochastic optimization and learning In the 5nd Annual Allerton Conference on Communication, Control and Computing, 04 WJ Shen, HS Wong, QW Xiao, X Guo, and S Smale Introduction to the peptide binding problem of computational immunology: new results Foundations of Computational Mathematics, 4:95-984, 04 L Shi, YL Feng, and DX Zhou Concentration estimates for learning with l -regularizer and data dependent hypothesis spaces Applied and Computational Harmonic Analysis, 3:86-30, 0 S Smale and DX Zhou Learning theory estimates via integral operators and their approximations Constructive Approximation, 6:53-7, 007 I Steinwart and A Christmann, Support Vector Machines Springer, ew York, 008 I Steinwart, D Hush, and C Scovel Optimal rates for regularized least squares regression In Proceedings of the nd Annual Conference on Learning Theory S Dasgupta and A livans, eds, pp 79-93, 009 I Steinwart and C Scovel Mercer s theorem on general domains: On the interaction between measures, kernels, and RHSs Constructive Approximation, 353:363-47, 0 P Thomann, I Steinwart, I Blaschzyk, and M Meister Spatial decompositions for large scale SVMs ArXiv:600374v, 06 30

Learning Theory of Randomized Kaczmarz Algorithm

Learning Theory of Randomized Kaczmarz Algorithm Journal of Machine Learning Research 16 015 3341-3365 Submitted 6/14; Revised 4/15; Published 1/15 Junhong Lin Ding-Xuan Zhou Department of Mathematics City University of Hong Kong 83 Tat Chee Avenue Kowloon,