arxiv: v4 [math.st] 9 Aug 2017

Size: px

Start display at page:

Download "arxiv: v4 [math.st] 9 Aug 2017"

Darleen Preston
6 years ago
Views:

1 PARALLELIZING SPECTRAL ALGORITHMS FOR KERNEL LEARNING GILLES BLANCHARD AND NICOLE MÜCKE arxiv: v4 [athst] 9 Aug 2017 Abstract We consider a distributed learning aroach in suervised learning for a large class of sectral regularization ethods in an RKHS fraework The data set of size n is artitioned into = O(n α ) disjoint subsets On each subset, soe sectral regularization ethod (belonging to a large class, including in articular Kernel Ridge Regression, L 2 -boosting and sectral cut-off) is alied The regression function f is then estiated via sile averaging, leading to a substantial reduction in coutation tie We show that iniax otial rates of convergence are reserved if grows sufficiently slowly (corresonding to an uer bound for α) as n, deending on the soothness assutions on f and the intrinsic diensionality In sirit, our aroach is classical 1 Introduction Distributed learning (DL) algoriths are a standard tool for saving coutation tie in achine learning robles where assive datasets are involved: Dividing randoly data of cardinality n into equally-sized, easy anageable artitions and evaluating the in arallel roughly gains a factor 2 (for tie and eory) coared to the single achine aroach The final outut is obtained fro averaging the individual oututs 1 Recently, DL was studied in several achine learning contexts: in oint estiation [14], atrix factorization [17], soothing sline odels and testing [4], local average regression [3], in classification (kernel SVMs [13] and feature sace decoosition [11]) and also in kernel (ridge) regression (KRR) [21], [16], [20] In this aer, we study the DL aroach for the statistical learning roble (11) Y i := f(x j ) + ε i, j = 1,, n, at rando iid data oints X 1,, X n drawn according to a robability distribution ν on X, where ε j are indeendent centered noise variables The unknown regression function f is real-valued and belongs to soe reroducing kernel Hilbert sace with bounded kernel K We artition the given data set D = {(X 1, Y 1 ),, (X n, Y n )} X R into disjoint equal-size subsets D 1,, D On each subset D j, we coute a local estiator ˆf λ D j, using Date: August 10, For the sake of silicity, throughout this aer we assue that n is divisible by This could always be achieved be disregarding soe data; alternatively, it is straightforward to show that aditting one saller block in the artition does not affect the asytotic results of this aer We shall not try to discuss this oint in greater detail In articular, we shall not analyze in which general fraework our sile averages could be relaced by weighted averages 1

2 2 GILLES BLANCHARD AND NICOLE MÜCKE a sectral regularization ethod The final estiator for the target function f is obtained by sile averaging: f λ D := 1 ˆf λ D j The non-distributed setting (=1) has been studied in the recent aer [2], building the root osition of our results in the distributed setting, where (weak and strong) iniax otial rates of convergence are established Our ai is to extend these results to distributed learning and to derive iniax otial rates We again aly a fairly large class of sectral regularization ethods, including the oular KRR, L 2 -boosting and sectral cut-off As in [2], we let T : f f(x)k(x, )dν(x) denote the kernel integral oerator associated to K and the saling easure ν Our rates of convergence are governed by a source condition assution on f of the for T r f R for soe constants r, R > 0 as well as by the ill-osedness of the roble, as easured by an assued ower decay of the eigenvalues of T with exonent b > 1 We show, that for s [0, 1 2 ] in the sense of -th oent exectation (12) T s (f f λn D ) HK ( ) σ 2 (r+s) 2r+1+1/b R, R 2 n for an aroriate choice of the regularization araeter λ n, deending on the global sale size n as well as on R and the noise variance σ 2 (but not on the nuber of subsale sets) Note that s = 0 corresonds to the reconstruction error (ie - nor), and s = 1 to the rediction error (ie, 2 L2 (ν) nor) The sybol eans that the inequality holds u to a ultilicative constant that can deend on various araeters entering in the assutions of the result, but not on n,, σ, nor R An iortant assution is that the inequality q r + s should hold, where q is the qualification of the regularization ethod, a quantity defined in the classical theory of inverse robles (see Section 23 for a recise definition) Basic robles are the choice of the regularization araeter on the subsales and, ost iortantly, the roer choice of, since it is well known that choosing too large gives a subotial convergence rate in the liit n, see eg [20] Our aroach to this roble is classical Using a bias-variance decoosition and choosing the regularization araeter according to the total sale size n yields undersoothing on each of the individual sales The bias estiate is then straightforward For the hard art we write the variance as a su of indeendent rando variables, leading to a substantial reduction of variance by averaging To the best of our knowledge, coarable results u to coletion of this article had been restricted to KRR, corresonding to Tikhonov regularization In [21] the authors derive Miniax-otial rates in 3 cases (finite rank kernels, sub- Gaussian decay of eigenvalues of the kernel and olynoial decay), rovided satisfies a certain uer bound, deending on the rate of decay of the eigenvalues and an additional crucial uer bound on the eigenfunctions φ j of the Mercer kernel (see Section 5) It is therefore of great interest to investigate if and how can be allowed to go to infinity as a function of n without iosing any conditions on the eigenfunctions of the kernel Results in this direction have been obtained in the recent aer [16], for KRR, which is a great iroveent on the

3 PARALLELIZING SPECTRAL ALGORITHMS FOR KERNEL LEARNING 3 worst rate of [21] The authors dub their aroach a second order decoosition, which uses concentration inequalities and certain resolvent identities adated to KRR After this aer had been coleted, however, we learned of the Oberwolfach reort [23], where the authors have reorted results for general sectral regularization ethods, which are siilar to the results in this aer At the tie of writing, we are not aware of any ublished roof It is unclear to us how the authors of [23] rove their results They require bounded outut sace, a continuous kernel (ours need only be bounded)and their estiates are only in L 2 sense, not in RKHS-nor Furtherore, they do not see to track the deendence on the noise variance and the source condition as recisely as we do For ore detail, we refer to our Discussion in Section 4 The outline of the aer is as follows Section 2 contains notation and the setting Section 3 states our ain result on distributed learning Section 4 resents nuerical studies, followed by a concluding discussion and a ore detailed coarison of our results in Section 5 In Section 6 we rove our theores 2 Notation, Statistical odel and distributed learning Algorith In this section, we secify the atheatical background and the statistical odel for (distributed) regularized learning We have included this section for self sufficiency and reader convenience It essentially reeats the setting in [2] in suarized for 21 Kernel-induced oerators We assue that the inut sace X is a standard Borel sace endowed with a robability easure ν, the outut sace is equal to R We let K be a ositive seidefinite kernel on X X which is bounded by κ The associated reroducing kernel Hilbert sace will be denoted by It is assued that all functions f are easurable and bounded in sureu nor, ie f κ f HK for all f Therefore, is a subset of L 2 (X, ν), with S : L 2 (X, ν) being the inclusion oerator, satisfying S κ The adjoint oerator S : L 2 (X, ν) is identified as S g = E ν [g(x)k X ] = g(x)k x ν(dx) Setting T x = K x Kx :, the covariance oerator is given by T = E ν [K X KX] =, K x HK K x ν(dx), which can be shown to be ositive self-adjoint trace class (and hence is coact) The corresonding eirical versions of these oerators are given by S x : R n, (S x f) j = f, K xj HK, Sx : R n, Sxy = 1 n y j K xj, n T x := S xs x :, X X T x = 1 n n K xj Kx j

4 4 GILLES BLANCHARD AND NICOLE MÜCKE We introduce the shortcut notation T = κ 2 T and T x := κ 2 T x, ensuring T 1 and T x 1 Siilarly, S = κ 1 S and S xj := κ 2 S xj, ensuring S 1 and S x 1 The nubers µ j are the ositive eigenvalues of T satisfying 0 < µ j+1 µ j for all j > 0 and µ j 0 22 Noise assution and rior classes In our setting of kernel learning, the saling is assued to be rando iid, where each observation oint (X i, Y i ) follows the odel Y = f(x) + ε For (X, Y ) having distribution ρ, we assue: The conditional exectation wrt ρ of Y given X exists and it holds for ν-alost all x X : (21) E ρ [Y X = x] = f ρ (x), for soe f ρ Furtherore, we will ake the following assution on the observation noise distribution: There exists σ > 0 such that (22) E[ Y f ρ (X) 2 X ] σ 2 ν as To derive nontrivial rates of convergence, we concentrate our attention on secific subsets (also called odels) of the class of robability easures If P denotes the set of all robability distributions on X, we define classes of saling distributions by introducing decay conditions on the eigenvalues µ i of the oerator T ν For b > 1 and β > 0, we set P < (b, β) := {ν P : µ j β/j b j 1}, For a subset Ω, we let K(Ω) be the set of regular conditional robability distributions ρ( ) on B(R) X such that (21) and (22) hold for soe f ρ Ω We will focus on a Hölder-tye source condition, ie given r > 0, R > 0 and ν P, we define (23) Ω ν (r, R) := {f : f = T r ν h, h HK R} Then the class of odels which we will consider will be defined as (24) M(r, R, P ) := { ρ(dx, dy) = ρ(dy x)ν(dx) : ρ( ) K(Ω ν (r, R)), ν P }, with P = P < (b, β) As a consequence, the class of odels deends not only on the soothness roerties of the solution (reflected in the araeters R > 0, r > 0), but also essentially on the decay of the eigenvalues of T ν 23 Regularization In this subsection, we introduce the class of linear regularization ethods based on sectral theory for self-adjoint linear oerators These are standard ethods for finding stable solutions for ill-osed inverse robles Originally, these ethods were develoed in the deterinistic context, see [8] Later on, they have been alied to robabilistic robles in achine learning, see [10] or [2] Definition 21 (Regularization function) Let g : (0, 1] [0, 1] R be a function and write g λ = g(λ, ) The faily {g λ } λ is called regularization function, if the following conditions hold: (i) There exists a constant D < such that for any 0 < λ 1 su tg λ (t) D 0<t 1

5 PARALLELIZING SPECTRAL ALGORITHMS FOR KERNEL LEARNING 5 (ii) There exists a constant E < such that for any 0 < λ 1 (25) su g λ (t) E 0<t 1 λ (iii) Defining the residual r λ (t) := 1 g λ (t)t, there exists a constant γ 0 < such that for any 0 < λ 1 su r λ (t) γ 0 0<t 1 It has been shown in eg [6], [2] that attainable learning rates are essentially linked with the qualification of the regularization {g λ } λ, being the axial q such that for any 0 < λ 1 (26) su r λ (t) t q γ q λ q 0<t 1 for soe constant γ q > 0 The ost oular exales include: Exale 22 (Tikhonov Regularization, Kernel Ridge Regression) The choice g λ (t) = 1 λ+t corresonds to Tikhonov regularization In this case we have D = E = γ 0 = 1 The qualification of this ethod is q = 1 with γ q = 1 Exale 23 (Landweber Iteration, gradient descent ) The Landweber Iteration (gradient descent algorith with constant stesize) is defined by k 1 g k (t) = (1 t) j with k = 1/λ N j=0 We have D = E = γ 0 = 1 The qualification q of this algorith can be arbitrary with γ q = 1 if 0 < q 1 and γ q = q q if q > 1 Exale 24 (ν- ethod) The ν ethod belongs to the class of so called sei-iterative regularization ethods This ethod has finite qualification q = ν with γ q a ositive constant Moreover, D = 1 and E = 2 The filter is given by g k (t) = k (t), a olynoial of degree k 1, with regularization araeter λ k 2, which akes this ethod uch faster as eg gradient descent 24 Distributed Learning Algorith We let D = {(x j, y j )} n X Y be the dataset, which we artition into disjoint subsets D 1,, D, each having size n Denote the jth data vector by (x j, y j ) (X R) n On each subset we coute a local estiator for a suitable a-riori araeter choice λ = λ n according to (27) f λn D j := g λn (κ 2 T xj )κ 2 S x j y j = g λn ( T xj ) S x j y j By fd λ we will denote the estiator using the whole sale = 1 The final estiator is given by sile averaging the local ones: λ (28) f D := 1 fd λ j

6 6 GILLES BLANCHARD AND NICOLE MÜCKE 3 Main Results This section resents our ain results Theore 31 and Theore 32 contain searate estiates on the aroxiation error and the sale error and lead to Corollary 33 which gives an uer bound for the error T s (f ρ f D λ) HK and resents an uer rate of convergence for the sequence of distributed learning algoriths For the sake of the reader we recall Theore 34, which was already shown in [2], resenting the iniax otial rate for the single achine roble This yields an estiate on the difference between the single achine and the distributed learning algorith in Corollary 35 We want to track the recise behavior of these rates not only for what concerns the exonent in the nuber of exales n, but also in ters of their scaling (ultilicative constant) as a function of soe iortant araeters (naely the noise variance σ 2 and the colexity radius R in the source condition) For this reason, we introduce a notion of a faily of rates over a faily of odels More recisely, we consider an indexed faily (M θ ) θ Θ, where for all θ Θ, M θ is a class of Borel robability distributions on X R satisfying the basic general assutions 21 and (22) We consider rates of convergence in the sense of the -th oents of the estiation error, where 1 < is a fixed real nuber As already entioned in the Introduction, our roofs are based on a classical biasvariance decoosition as follows: Introducing (31) f λ D = 1 g λ ( T xj ) T xj f ρ, we write (32) T s (f ρ f D) λ = T s ( f ρ f D λ ) + T s ( f D λ f D λ ) = 1 T s r λ ( T xj )f ρ + 1 T s g λ ( T xj )( T xj f ρ Sx j y j ) }{{}}{{} Aroxiation Error Sale Error In all the forthcoing results in this section, we let s [0, 1 ], 1 and consider 2 the odel M σ,m,r := M(r, R, P < (b, β)) where r > 0, b > 1 and β > 0 are fixed, and θ = (R, M, σ) varies in Θ = R 3 λn + Given a sale D (X R) of size n, define f D, f λn D as λn in Section 24 and f D as in (31), using a regularization function of qualification q r + s, with araeter sequence (33) λ n := λ n,(σ,r) := in ( ( σ 2 R 2 n ) b 2br+b+1, 1 ),

7 PARALLELIZING SPECTRAL ALGORITHMS FOR KERNEL LEARNING 7 indeendent on M Define the sequence ( ) σ 2 b(r+s) 2br+b+1 (34) a n := a n,(σ,r) := R R 2 n We recall fro the introduction that we shall always assue that n is a ultile of With these rearations, our ain results are: Theore 31 (Aroxiation Error) If the nuber of subsale sets satisfies (35) n α, α < Then su (σ,m,r) R 3 + li su n E su ρ n ρ M σ,m,r 2b in{r, 1} 2br + b + 1, [ T s (f ρ ] λn f D ) 1 < a n Theore 32 (Sale Error) If the nuber of subsale sets satisfies (36) n α 2br, α < 2br + b + 1, Then [ ] T s λn λn ( f D f D ) 1 su li su su < n a n (σ,m,r) R 3 + ρ M σ,m,r E ρ n And, as consequence (by (32) and alying the triangle inequality): Corollary 33 If the nuber of subsale sets satisfies (37) n α in{2br, b + 1}, α <, 2br + b + 1 then the sequence (34) is an uer rate of convergence in L, for the interolation nor of araeter s, for the sequence of estiated solutions ( f λ n,(σ,r) D ) over the faily of odels (M σ,m,r ) (σ,m,r) R 3 +, ie [ ] E T s λn (f su li su su ρ n ρ f D ) 1 < n ρ M σ,m,r a n (σ,m,r) R 3 + Theore 34 (Blanchard, Mücke (2017) [2]) The sequence (34) is an uer rate of convergence in L for all 1, for the interolation nor of araeter s, for the sequence of estiated solutions (f λ n,(σ,r) D ) - indeendent on M - over the faily of odels (M σ,m,r ) (σ,m,r) R 3 +, ie [ ] T s (f ρ f λn D ) 1 su (σ,m,r) R 3 + li su n E su ρ n ρ M σ,m,r a n <

8 8 GILLES BLANCHARD AND NICOLE MÜCKE Cobining Corollary 33 with Theore 34 by alying the triangle inequality iediately yields: Corollary 35 If the nuber of subsale sets satisfies (38) n α, α < then su (σ,m,r) R 3 + li su n E su ρ n ρ M σ,m,r 2b in{r, 1} 2br + b + 1, [ T s (f λn D 4 Nuerical Studies ] λn f D ) 1 a n < In this section we nuerically study the error in - nor, corresonding to s = 0 in Corollary 33 (in exectation with = 2) both in the single achine and distributed learning setting Our ain interest is to study the uer bound for our theoretical exonent α, araetrizing the size of subsales in ters of the total sale size, = n α, in different soothness regies In addition we shall deonstrate in which way arallelization serves as a for of regularization More secifically, we let = H 1 0[0, 1] with kernel K(x, t) = x t xt For all exerients in this section, we siulate data fro the regression odel Y i = f ρ (X i ) + ɛ i, i = 1,, n, where the inut variables X i Unif[0, 1] are uniforly distributed and the noise variables ε i N(0, σ 2 ) are norally distributed with standard deviation σ = 0005 We choose the target function f ρ according to two different cases, naely r < 1 (low soothness) and r = (high soothness) To accurately deterine the degree of soothness r > 0, we aly Proosition 41 below by exlicitly calculating the Fourier coefficients ( f ρ, e j HK ) j N, where e j (x) = 2 πj cos(πjx), for j N, fors an ONB of Recall that the rate of eigenvalue decay is exlicitly given by b = 2, eaning that we have full control over all araeters in (38) Fro [8] we need Proosition 41 Let, H 2 be searable Hilbert saces and S : H 2 be a coact linear oerator with singular syste {σ j, ϕ j, ψ j } 2 Denoting by S the generalized inverse 3 of S, one has for any r > 0 and g H 2 : g is in the doain of S and S g I((S S) r ) if and only if g, ψ j H2 2 j=0 σ 2+4r j < 2 ie, the ϕ j are the noralized eigenfunctions of S S with eigenvalues σ 2 j and ψ j = Sϕ j / Sϕ j ; thus S = σ j ϕ j, ψ j 3 the unique unbounded linear oerator with doain I(S) (I(S)) in H 2 vanishing on (I(S)) and satisfying SS = 1 on I(S), with range orthogonal to the null sace N(S)

9 PARALLELIZING SPECTRAL ALGORITHMS FOR KERNEL LEARNING 9 In our case, is as above, H 2 is L 2 ([0, 1]) with Lebesgue easure and S : H0[0, 1 1] L 2 ([0, 1]) is the inclusion Since H0[0, 1 1] is dense in L 2 ([0, 1]), we know that (I(S)) is trivial, giving SS = 1 on I(S) Furtherore, ϕ j = e j is a noralized eigenbasis of T = S S with eigenvalues σj 2 = (πj) 2 With ψ j = Sϕ j we obtain for f Sϕ j H1 L 2 0[0, 1] Sf, ψ j L 2 = Sf, Thus, alying Proosition 41 gives Se j Se j L 2 = f, S Se j Se j H 1 0 = σ j f, e j H 1 0 Corollary 42 For S and T = S S as above we have for any r > 0: f I(T r ) if and only if j 4r f, e j L 2 2 < Thus, as exected, abstract soothness easured by the araeter r in the source condition corresonds in this secial case to decay of the classical Fourier coefficients which - by the classical theory of Fourier series - easures soothness of the eriodic continuation of f L 2 ([0, 1]) to the real line 401 Low soothness We choose f ρ (x) = 1x(1 x) which clearly belongs to H 2 K A straightforward calculation gives the Fourier coefficient f ρ, e j = 2(πj) 2 for j odd (vanishing for j even) Thus, by the above criterion, f ρ satisfies the source condition f ρ Ran(T r ) recisely for 0 < r < 075 According to Theore 34, the worst case rate in the single achine roble is given by n γ, with γ = 025 Regularization is done using the ν ethod (see Exale 24), with qualification q = ν = 1 Recall that the stoing index k sto serves as the regularization araeter λ, where k sto λ 2 We consider sale sizes fro 500, 9000 In the odel selection ste, we estiate the erforance of different odels and choose the oracle stoing tie ˆk oracle by iniizing the reconstruction error: ( ) 1 1 M ˆk oracle = arg in f ρ k M ˆf j k 2 2 over M = 30 runs In the odel assessent ste, we artition the dataset into n α subsales, for any α {0, 005, 01,, 085} On each subsale we regularize using the oracle stoing tie ˆk oracle (deterined by using the whole sale) Corresonding to Corollary 33, the accuracy should be coarable to the one using the whole sale as long as α < 05 In Figure 1 (left anel) we lot the reconstruction error f ˆk f ρ HK versus the ratio α = log()/ log(n) for different sale sizes We execute each siulation M = 30 ties The lot suorts our theoretical finding The right anel shows the reconstruction error versus the total nuber of sales using different artitions of the data The black curve (α = 0) corresonds to the baseline error ( = 0, no artition of data) Error curves below a threshold α < 06 are roughly coarable, whereas curves above this threshold show a ga in erforances

10 10 GILLES BLANCHARD AND NICOLE MÜCKE In another exerient we study the erforances in case of (very) different regularization: Only artitioning the data (no regularization), underregularization (higher stoing index) and overregularization (lower stoing index) The outcoe of this exerient alifies the regularization effect of arallelizing Figure 2 shows the ain oint: Overregularization is always hoeless, underregularization is better In the extree case of none or alost none regularization there is a shar iniu in the reconstruction error which is only slightly larger than the iniax otial value for the oracle regularization araeter and which is achieved at an attractively large degree of arallelization Qualitatively, this agrees very well with the intuitive notion that arallelizing serves as regularization We ehasize that nuerical results see to indicate that arallelization is ossible to a slightly larger degree than indicated by our theoretical estiate A siilar result was reorted in the aer [21], which also treats the low soothness case 402 High soothness We choose f ρ (x) = 1 sin(2πx), which corresonds to just one 2π non-vanishing Fourier coefficient and by our criterion Corollary 42 has r = In view of our ain Corollary 33 this requires a regularization ethod with higher qualification; we take the Gradient Descent ethod (see Exale 23) The aearance of the ter 2b in{1, r} in our theoretical result 33 gives a redicted value α = 0 (and would ily that arallelization is strictly forbidden for infinite soothness) More secifically, the left anel in Figure 3 shows the absence of any lateau for the reconstruction error as a function of α This corresonds to the right anel showing that no grou of values of α erfors roughly equivalently, eaning that we do not have any otiality guarantees Plotting different values of regularization in Figure 4 we again identify overregularization as hoeless, while severe underregularization exhibits a shar iniu in the reconstruction error But its value at roughly 025 is uch less attractive coared to the case of low soothness where the error is an order of agnitude less

11 PARALLELIZING SPECTRAL ALGORITHMS FOR KERNEL LEARNING 11 Figure 1 The reconstruction error f k oracle D f ρ HK in the low soothness case Left lot: Reconstruction error curves for various (but fixed) sale sizes as a function of the nuber of artitions Right lot: Reconstruction error curves for various (but fixed) nubers of artitions as a function of the sale size (on log-scale)

12 12 GILLES BLANCHARD AND NICOLE MÜCKE Figure 2 The reconstruction error f D λ f ρ HK in the low soothness case Left lot: Error curves for different stoing ties for n = 500 sales, as a function of the nuber of artitions Right lot: Error curves for different stoing ties for n = 5000 sales, as a function of the nuber of artitions

13 PARALLELIZING SPECTRAL ALGORITHMS FOR KERNEL LEARNING 13 Figure 3 The reconstruction error f λ oracle D f ρ HK in the high soothness case Left lot: Reconstruction error curves for various (but fixed) sale sizes as a function of the nuber of artitions Right lot: Reconstruction error curves for various (but fixed) nubers of artitions as a function of the sale size (on log-scale)

14 14 GILLES BLANCHARD AND NICOLE MÜCKE Figure 4 The reconstruction error f λ D f ρ in the high soothness case Left lot: Error curves for different stoing ties for n = 500 sales, as a function of the nuber of artitions Right lot: Error curves for different stoing ties for n = 5000 sales, as a function of the nuber of artitions 5 Discussion Miniax Otiality: We have shown that for a large class of sectral regularization ethods the error of the distributed algorith T s λn ( f D f ρ) HK satisfies the sae uer bound as the error T s (f λn D f ρ) HK for the single achine roble, if the regularization araeter λ n is chosen according to (33), rovided the nuber of subsales grows sufficiently slowly with the sale size n Since, by [2], the rates for the latter are iniax otial, our rates in Corollary 33 are iniax otial also Coarison with other results: In [21] the authors derive Miniax-otial rates in 3 cases: finite rank kernels, sub- Gaussian decay of eigenvalues of the kernel and olynoial decay, rovided satisfies a certain uer bound, deending on the rate of decay of the eigenvalues under two crucial assutions on the eigenfunctions of the integral oerator associated to the kernel: For any j N (51) E[φ j (X) 2k ] ρ 2k k, for soe k 2 and ρ k < or even stronger, it is assued that the eigenfunctions are uniforly bounded, ie (52) su φ j (x) ρ, x X

15 PARALLELIZING SPECTRAL ALGORITHMS FOR KERNEL LEARNING 15 or any j N and soe ρ < We shall describe in ore detail the case of olynoially decaying eigenvalues, which corresonds to our setting Assuing eigenvalue decay µ j j b with b > 1, the authors choose a regularization araeter λ n = n b b+1 and leading to an error in L 2 - nor E[ ( f λn D n b(k 4) k b+1 ρ 4k log k (n) ) 1 k 2 f ρ 2 b L2] n b+1, being iniax otial For k < 4, this is not a useful bound, since 1 as n in this case (for any sort of eigenvalue decay) On the other hand, if k and b ight be taken arbitrarily large - corresonding to alost bounded eigenfunctions and arbitrarily large olynoial decay of eigenvalues - ight be chosen roortional to n 1 ɛ, for any ɛ > 0 As ight be exected, relacing the L 2k bound on the eigenfunctions by a bound in L, gives an uer bound on which sily is the liit for k in the bound given above, naely n b 1 b+1 ρ 4 log n, which for large b behaves as above Granted bounds on the eigenfunctions in L 2k for (very) large k, this is a strong result While the decay rate of the eigenvalues can be deterined by the soothness of K (see, eg, [9] and references therein), it is a widely oen question which general roerties of the kernel ily estiates as in (51) and (52) on the eigenfunctions The author in [22] even gives a counterexale and resents a C Mercer kernel on [0, 1] where the eigenfunctions of the corresonding integral oerator are not uniforly bounded Thus, soothness of the kernel is not a sufficient condition for (52) to hold Moreover, we oint out that the uer bound (51) on the eigenfunctions (and thus the uer bound for in [21]) deends on the (unknown) arginal distribution ν (only the strongest assution, a bound in su-nor (52), does not deend on ν) Concerning this oint, our aroach is agnostic As already entioned in the Introduction, these bounds on the eigenfunctions have been eliinated in [16], for KRR, iosing olynoial decay of eigenvalues as above This is very siilar to our aroach As a general rule, our bounds on and the bounds in [16] are worse than the bounds in [21] for eigenfunctions in (or close to ) L, but in the coleentary case where nothing is known on the eigenfunctions still can be chosen as an increasing function of n, naely = n α More recisely, choosing λ n as in (33), the authors in [16] derive as an uer bound n α, α = 2br 2br + b + 1,

16 16 GILLES BLANCHARD AND NICOLE MÜCKE with r being the soothness araeter arising in the source condition We recall here that due to our assution q r + s, the soothness araeter r is restricted to the interval (0, 1 2 ] for KRR (q = 1) and L2 risk (s = 1 2 ) Our results (which hold for a general class of sectral regularization ethods) are in soe ways coarable to [16] Secialized to KRR, our estiates for the exonent α in = O(n α ) coincide with the result given in [16] Furtherore we ehasize that [21] and [16] estiate the DL-error only for s = 1/2 in our notation (corresonding to L 2 (ν) nor), while our result holds for all values of s [0, 1/2] which soothly interolates between L 2 (ν) nor and RKHS nor and, in addition, for all values of [1, ) Thus, our results also aly to the case of non-araetric inverse regression, where one is articularly interested in the reconstruction error (ie - nor), see eg [2] Additionally, we recisely analyze the deendence of the noise variance σ 2 and the colexity radius R in the source condition Concerning general strategy, while [16] uses a novel second order decoosition in an essential way, our aroach is ore classical We clearly distinguish between estiating the aroxiation error and the sale error The bias using a subsale should be of the sae order as when using the whole sale, whereas the estiation error is higher on each subsale, but gets reduced by averaging by writing the variance as a su of iid rando variables (which allows to use Rosenthal s inequality) Finally, we want to ention the recent works [15] and [12], which were worked out indeently fro our work The authors in [12] also treat general sectral regularization ethods (going beyond kernel ridge) and obtain essentially the sae results, but with error bounds only in L 2 - nor, excluding inverse learning robles In [15], the authors investigate distributed learning on the exale of Gradient Descent algoriths, which have infinite qualification and allow larger soothness of the regression function They are able to irove the uer bound for the nuber of local achines to n α log 5 (n) + 1, α < br 2br + b + 1 which is larger in case r > 2 In the interediate case 1 < r < 2, our bound in (37) is still better An interesting feature is the fact that it is ossible to allow ore local achines by using additional unlabeled data This indicates that finding the uer bound for the nuber of achines in the high soothness regie is still an oen roble Nuber of Subsales: We follow the line of reasoning in earlier work on distributed learning insofar as we only rove sufficient conditions on the cardinality = n α of subsales coatible with iniax otial rates of convergence On the coleentary roble of roving necessity, analytical results are unknown to the best of our knowledge However, our nuerical results see to indicate that the exonent α ight actually be taken larger than we have roved so far in the low soothness regie Adativity: It is clear fro the theoretical results that both the regularization araeter λ and the allowed cardinality of subsales deend on the araeters r and b,

17 PARALLELIZING SPECTRAL ALGORITHMS FOR KERNEL LEARNING 17 which in general are unknown Thus, an adative aroach to both araeters b and r for choosing λ and is of interest To the best of our knowledge, there are yet no rigorous results on adativity in this ore general sense Progress in this field ay well be crucial in finally assessing the relative erits of the distributed learning aroach as coared with alternative strategies to effectively deal with large data sets We sketch an alternative naive aroach to adativity, based on hold-out in the direct case, where we consider each f also as a function in L 2 (X, ν) We slit the data z (X Y) n into a training and validation art z = (z t, z v ) of cardinality t, v We further subdivide z t into k subsales, roughly of size t / k, where k t, k = 1, 2, is soe strictly decreasing sequence For each k and each subsale z j, 1 j k, we define the estiators ˆf z λ j as in (27) and their average λ (53) f k,z t := 1 k k Here, λ varies in soe sufficiently fine lattice Λ Then evaluation on z v gives the associated eirical L 2 error (54) Err λ k(z v ) := 1 v yi v f k,z λ t(xv i ) 2, z v = (y v, x v ), y v = (y1, v, y v v ), v leading us to define i=1 ˆf λ z j (55) ˆλk := Argin λ Λ Err λ k(z v ), Err(k) := Errˆλ k k (zv ) Then, an aroriate stoing criterion for k ight be to sto at (56) k := in{k 3 : (k) δ inf (j)}, (j) := Err(j) Err(j 1), 2 j<k for soe δ < 1 (which ight require tuning) The corresonding regularization araeter is ˆλ = ˆλ k, given by (55) At least intuitively, it is then reasonable to define a urely data driven estiator as (57) fn := f ˆλ k,z t Note that the training data z t enter the definition of fn via the exlicit forula (53) encoding our kernel based aroach, while z v serves to deterine (k, ˆλ ) via iniization of the eirical L 2 error and soe for of the discreancy rincile, which tells one to sto where Err(j) does not areciably irove anyore It is oen if such a rocedure achieves otial rates, and we have to leave this for future research 6 Proofs For ease of reading we ake use of the following conventions: we are interested in a recise deendence of ultilicative constants on the araeters σ, M, R, η,, n and

18 18 GILLES BLANCHARD AND NICOLE MÜCKE the deendence of ultilicative constants on various other araeters, including the kernel araeter κ, the nor araeter s [0, 1 ], the araeters arising fro 2 the regularization ethod, b > 1, β > 0, r > 0, etc will (generally) be oitted and sily indicated by the sybol the value of C ight change fro line to line the exression for n sufficiently large eans that the stateent holds for n n 0, with n 0 otentially deending on all odel araeters (including σ, M and R), but not on η 61 Preliinaries For roving our error bounds, we recall soe results (without roof) fro [5] We introduce the effective diension N (λ), being a easure for the colexity of with resect to the arginal distribution ν: For λ (0, 1] we set (61) N (λ) = tr( ( T + λ) 1 T ) Since the oerator T is trace-class, N (λ) < Moreover, N (λ) satisfies 1 2 βb N (λ) b 1 (κ2 λ) 1 rovided the arginal distribution ν of X belongs to P < (b, β) with b > 1 and β > 0 (see [5], Proosition 3) Proosition 61 ([12], Proosition 1) Let x 1,, x n be an iid sale, drawn according to ν on X Define ( ) 2 2 N (λ) (62) B n (λ) := 1 + nλ + nλ For any λ > 0, η (0, 1], with robability at least 1 η one has (63) ( Tx + λ) 1 ( T + λ) 8 log 2 (2η 1 )B n (λ) Corollary 62 Let η (0, 1) For n N let λ n be ilicitly defined as the unique solution of N ( λ n ) = n λ n Then for any λ n λ 1 one has In articular, with robability at least 1 η B n (λ) 26 b, ( Tx + λ) 1 ( T + λ) 208 log 2 (2η 1 ), Proof of Corollary 62 Let λ n be defined via N ( λ n ) = n λ n Since N (λ)/λ is decreasing, we have for any λ λ n N (λ) nλ N ( λ n ) = 1 n λ n

19 PARALLELIZING SPECTRAL ALGORITHMS FOR KERNEL LEARNING 19 Since the effective diesion is lower bounded by 1, by the inequality above 2 N (λ) 1 nλ 1 = 1 2nλ nλ 2 < 2 for any λ λ n Inserting these bounds into 63 and noticing that 1 2 log(2η 1 ) for any η (0, 1) leads to the conclusion Corollary 63 If λ n is defined by (33) and if one has rovided n is sufficiently large n n α, α < B n n (λ n) 2, Proof of Lea 63 Recall that N (λ n ) C b λ 1 b n of λ n in (33) yields rovided Finally, n λ r n = o(1) if 2br 2br + b + 1, and σ 2 n nλ = o ( ) n λ r n, n n α, α < n n α, α < λ 1 b n nλ n 2(br + 1) 2br + b + 1 2br 2br + b + 1 = Rλ r n Using the definition We shortly illustrate how Corollary 62 and Proosition 61 will be used Let u [0, 1], λ n λ as above and f We have T u f HK = T u ( T + λ) u ( T + λ) u ( T x + λ) u (T x + λ) u f HK (64) T u ( T + λ) u ( T + λ) u ( T x + λ) u ( Tx + λ) u f HK 8 log 2u (2η 1 )B n (λ) u ( Tx + λ) u f HK, with robability at least 1 η, for any η (0, 1) In articular, for any λ n λ (with λ n as in Corollary 62) (65) T u f HK 208 u log 2u (2η 1 ) ( T x + λ) u f HK, with robability at least 1 η In the following, we constantly use (65)

20 20 GILLES BLANCHARD AND NICOLE MÜCKE 62 Aroxiation Error Bound Recall that ν denotes the inut saling distribution and P the set of all robability distributions on the inut sace X Lea 64 Let ν P, v R and let x X n be an iid sale, drawn according to ν Assue the regularization (g λ ) λ has qualification q v s Then with robability at least 1 η T s r λ ( T x ) T x( v T T x ) HK C log 4 (4η 1 )λ s+v+1 B s+1 n (λ) for soe C < ( 2 nλ + ) N (λ) Proof of Lea 64 Fro (64) and fro Proosition A1, since q s + v + 1, one has T s r λ ( T x ) T x( v T T x ) HK C log 2(s+1) (4η 1 )B s+1 n (λ) ( Tx + λ) s r λ ( T x ) T x( v T x + λ) ( T + λ) 1 ( T T x ) ( ) C log 4 (4η 1 )λ s+v+1 B s+1 2 N (λ) n (λ) nλ +, nλ nλ for any λ (0, 1], η (0, 1], with robability at least 1 η We also used that s 1 2 Lea 65 Let ν P, v R and let x X n be an iid sale, drawn according to ν Assue the regularization (g λ ) λ has qualification q v + s Then for any λ (0, 1], η (0, 1], with robability at least 1 η T s r λ ( T x ) T x v C log 2s (2η 1 )B s n (λ)λ s+v, for soe C < Proof of Lea 65 Using (64), since q v + s T s r λ ( T x ) T x v C log 2s (2η 1 )B s n (λ) ( Tx + λ) s r λ ( T x ) T x v C log 2s (2η 1 )B s n (λ)λ s+v, with robability at least 1 η Proosition 66 (Exectation of Aroxiation Error) Let f ρ Ω ν (r, R), λ (0, 1] and let B n (λ) be defined in (62) Assue the regularization has qualification q r + s For any 1 one has: (1) If r 1, then (2) If r > 1, then E ρ n E ρ n [ T s (f ρ f D) ] λ 1 [ T s (f ρ f D) ] λ 1 C Rλ s B s+1 n (λ) C R λ s+r B s+r n (λ) ( λ r + λ ( 2 nλ + )) N (λ) nλ

21 PARALLELIZING SPECTRAL ALGORITHMS FOR KERNEL LEARNING 21 In 1 and 2 the constant C does not deend on (σ, M, R) R 3 + Proof of Proosition 66 Since f ρ Ω ν (r, R) (66) E ρ n [ T s (f ρ f D) ] λ 1 = E ρ n 1 R [ 1 T s r λ ( T ] 1 xj )f ρ E ρ n[ T s r λ ( T )f ] xj 1 ρ E ρ n [ T s r λ ( T ) T r ] xj 1 The first inequality is just the triangle inequality for the - nor f = E[ f ] 1 We bound the exectation for each searate subsale of size n by first deriving a robabilistic estiate and then we integrate Consider first the case where r 1 Using (64) and Cordes Inequality Proosition A3, one has for any j = 1,, T s r λ ( T xj ) T r C log 2(s+r) (4η 1 )B s+r n (λ) ( Txj + λ) s r λ ( T xj )( T xj + λ) r C log 3 (4η 1 )λ s+r B s+r n (λ), with robability at least 1 η and where B s+r n (λ) is defined in (62) regularization has qualification q r + s By integration one has Recall that the E ρ n [ T s r λ ( T xj ) T r ] 1 C, λ s+r B s+r n (λ), for soe C, <, not deending on σ, M, R Finally, fro (66) E ρ n [ T s (f ρ f D) ] λ 1 C, R λ s+r B s+r n (λ) In the case where r 1, we write r = k + u, with k = r and u = r k < 1 We shall use the decoosition (67) T k = k 1 l=0 T l x( T T x ) T k (l+1) + T k x

22 22 GILLES BLANCHARD AND NICOLE MÜCKE We roceed by bounding (66) according to decoosition (67) For any j = 1,, one has [ E ρ n T s r λ ( T ) T k+u xj ] 1 k 1 [ E ρ n T s r λ ( T ) T l xj x j ( T T xj ) T k (l+1)+u ] 1 l=0 k 1 l=0 + E ρ n E ρ n [ T s r λ ( T xj ) T k x j T u ] 1 [ T s r λ ( T xj ) T l x j ( T T xj ) ] 1 (68) + E ρ n [ T s r λ ( T xj ) T k x j T u ] 1 Here we use that T k (l+1)+u is bounded by 1 By Lea 65 and by (64), with robability at least 1 η T s r λ ( T xj ) T x k u j T C log 2(s+u) (2η 1 )B s+u n (λ)λ s+r and thus integration yields [ (69) E ρ n T s r λ ( T ) T r u xj x j T ] 1 C, B s+u n (λ)λ s+r For estiating the first ter in (68) we ay use Lea 64 For any l = 0,, k 1, j = 1,, with robability at least 1 η ( T s r λ ( T xj ) T x l j ( T T ) xj ) C log 4 (8η 1 )λ s+l+1 B s+1 2 N (λ) n (λ) nλ + nλ Again by integration, since λ l 1 for any l = 0,, k 1, one has k 1 [ (610) E ρ n T s r λ ( T xj ) T l ( T xj T xj ) ( ) ] 1 C, r λ s+1 B s+1 2 N (λ) n (λ) nλ + nλ l=0 Finally, cobining (69) and (610) with (66) gives in the case where r > 1 [ E ρ n T s (f ρ f D) ( ( )) ] λ 1 C H λ s B s+1 n (λ) λ r 2 N (λ) + λ K nλ + nλ The rest of the roof follows fro (68) Proof of Theore 31 Let λ n defined by (33) According to Lea 63, we have B n (λ n) n 2 rovided α < 2br We iediately obtain fro the first art of Proosition 66 in 2br+b+1 the case where r 1 E ρ n[ T s (f ρ ] λn f D ) 1 C, R λ s+r n = C, a n

23 PARALLELIZING SPECTRAL ALGORITHMS FOR KERNEL LEARNING 23 We turn to the case where r > 1 We aly the second art of Proosition 66 By Corollary 63 we have ] E ρ n[ T s λn (f ρ f D ) 1 C H Rλ s K nb s+1 n (λ n ) λ r n + λ n 2 n n N (λ n ) + n nλ n nλ n C, Rλ s n λ r n + λ n 2 n n N (λ n ) +, nλ n nλ n where we used that N (λ n ) C b λ 1/b n and the definition of λ n Observe that 2 n nλ n = o ( ) n λ r n, rovided n n α 2(br + 1), α < 2br + b + 1 Furtherore, for n sufficiently large, R n λ σ n 1, rovided that As a result, for any 1 li su n α < E su ρ n ρ M σ,m,r 2b 2br + b + 1 [ T s (f ρ for soe C, <, not deending on σ, M, R ] λn f D ) 1 C,, a n 63 Sale Error Bound The ain idea for deriving an uer bound for the sale error is to identify it as a su of unbiased Hilbert sace- valued iid variables and then to aly a suitable version of Rosenthal s inequality Given λ (0, 1], we define the rando variable ξ λ : (X R) n by ξ λ (x, y) := T s g λ ( T x )( T x f ρ S xy) Recall that according to Assution 21, the conditional exectation wrt ρ of Y given X satisfies E ρ [Y X = x] = S x f ρ, ilying that ξ λ is unbiased (since T x = S x S x ) Thus, (611) T s ( f D λ f D) λ = 1 ξ λ (x j, y j ) is a su of centered iid rando variables

24 24 GILLES BLANCHARD AND NICOLE MÜCKE Furtherore, we need the following result fro [18], Theore 52, which generalizes Rosenthal s inequalities fro [19] (originally only forulated for real valued rando variables) to rando variables with values in a Banach sace For Hilbert saces this looks articularly nice Proosition 67 Let H be a Hilbert sace and ξ 1,, ξ be a finite sequence of indeendent, ean zero H- valued rando variables If 2 <, then there exists a constant C > 0, only deending on, such that (612) E 1 ξ j H 1 ( C ax E ξ j H ) 1 (, E ξ j 2 H ) 1 2 We reark in assing that [7], Corollary 122, contains the interesting result that in addition to the uer bound in (612) there is also a corresonding lower bound where the constant C is relaced by another constant C > 0, only deending on Proosition 68 (Exectation of Sale Error) Let ρ be a source distribution belonging to M σ,m,r, s [0, 1] and let λ (0, 1] Define B n (λ) as in (62) Assue the regularization 2 has qualification q r + s For any 1 one has: [ E ρ n T s ( f D λ f D) ( ) ] λ 1 C H 1 2 B n (λ) 1 2 +s λ s M N (λ) K nλ + σ, nλ where C does not deend on (σ, M, R) R 3 + Proof of Proosition 68 Let λ (0, 1] and 2 Fro Proosition 67 [ B E s ρ n ( f D λ f ] D) λ 1 = Eρ n 1 (613) ξ λ (x j, y j ) ( C ax E ρ n [ ] ) 1 ( ξ λ (x j, y j ), E ρ n 1 2 [ ] ) 1 ξ λ (x j, y j ) 2 Again, the estiates in exectation will follow fro integration a bound holding with high robability By (64), one has for any j = 1,, ξ λ (x j, y j ) HK = T s g λ ( T xj )( T xj f ρ S x j y j ) HK (614) 8 log 2s (4η 1 )B n (λ)s ( T xj + λ) s g λ ( T xj )( T xj f ρ S x j y j ) HK, holding with robability at least 1 η, where B n (λ) is defined in (62) We roceed by 2 slitting: ( T xj + λ) s g λ ( T xj )( T xj f ρ S x j y j ) = H x (1) j H x (2) j h λ z j,

25 PARALLELIZING SPECTRAL ALGORITHMS FOR KERNEL LEARNING 25 with H (1) x j := ( T xj + λ) s g λ ( T xj )( T xj + λ) 1 2, H (2) x j := ( T xj + λ) 1 2 ( T + λ) 1 2, h λ z j := ( T + λ) 1 2 ( Txj f ρ S x j y j ) The first ter is estiated using (26) and gives (615) H (1) x j C λ s 1 2 The second ter is now bounded using (64) once ore One has with robability at least 1 η 4 (616) H (2) x j 8 log(8η 1 )B n (λ) 1 2 Finally, h λ z j is estiated using Proosition A2: (617) ( ) h λ z j 2 log(8η 1 M N (λ) ) n λ + σ, n holding with robability at least 1 η Thus, cobining (615), (616) and (617) with 4 (614) gives for any j = 1,, ξ λ (x j, y j ) HK C log 2(s+1) (8η 1 )B n (λ) 1 2 +s λ s ( M nλ + σ with robability at least 1 η Integration gives for any 2 [ E ρ n ξ λ (x j, y j ) ] C H, A, K with A := A n (λ) := B n (λ) 1 2 +s λ s Cobining this with (613) ilies, since 2 [ E ρ n T s ( f D λ f D) ] λ 1 ( M N (λ) nλ + σ nλ ) N (λ), nλ ) C ( ax ( (A ) ) 1, A ) = C A ax ( = C A, 1, 1 2 where C does not deend on (σ, M, R) R 3 + The result for the case 1 2 iediately follows fro Hölder s inequality )

26 26 GILLES BLANCHARD AND NICOLE MÜCKE Proof of Theore 32 Let λ n defined by (33) According to Lea 63 we have B n (λ n) 2 rovided α < 2br We iediately obtain fro Proosition 68 2br+b+1 ] E ρ n[ T s λn λn ( f D f D ) 1 C λ s M N (λ n ) n + σ nλ n nλ n C λ s M N (λ n ) n + σ nλ n nλ n Again, we use that N (λ n ) C b λ 1/b n n M nλ n and = o σ λn 1/b, nλ n rovided Recalling that σ λ 1/b n nλ n n n α, α < = Rλ r n = λ s n a n, we arrive at 2(br + 1) 2br + b + 1 As a result, for any 1 li su n E ρ n[ T s ( f λn D E su ρ n ρ M σ,m,r ] λn f D ) 1 [ T s ( f λn D C a n ] λn f D ) 1 a n C, for soe C <, not deending on the odel araeter (σ, M, R) R 3 + Aendix A Proosition A1 (see eg [2]) For any n N, λ (0, 1] and η (0, 1), one has with robability at least 1 η : ( T + λ) 1 ( T T x ) ( ) HS 2 log(2η 1 2 N (λ) ) nλ + nλ Proosition A2 (see eg [2]) For n N, λ (0, 1] and η (0, 1], it holds with robability at least 1 η : ( ( B + λ) 1 2 Bx f ρ S xy ) ( ) HK 2 log(2η 1 M ) n λ + σ2 N (λ) n

27 PARALLELIZING SPECTRAL ALGORITHMS FOR KERNEL LEARNING 27 Proosition A3 (Cordes Inequality,[1], Theore IX21-2) Let A, B be to self-adjoint, ositive oerators on a Hilbert sace Then for any s [0, 1]: (A1) A s B s AB s References [1] R Bhatia Matrix Analysis Sringer, 1997 [2] G Blanchard and N Mücke Otial rates for regularization of statistical inverse learning robles Foundations of Coutational Matheatics, 2017 doi:101007/s [3] L Chang and Wang Divide and conquer local average regression arxiv Prerint ( ), 2016 [4] G Cheng and Z Shang Coutational liits of divide-and-conquer ethod arxiv Prerint ( ), 2015 [5] E De Vito and A Caonnetto Otial rates for regularized least-squares algorith Foundations of Coutational Matheatics, 7(3): , 2006 [6] L Dicker, D Foster, and D Hsu Kernel ethods and regularization techniques for nonaraetric regression: Miniax otiality and adatation Technical reort, Rutgers University, 2015 [7] S Dirksen Noncoutative and vector-valued Rosenthal inequalities PhD thesis, Delft Univ Technology, 2011 [8] H Engl, M Hanke, and A Neubauer Regularization of Inverse Probles Kluwer Acadeic Publishers, 2000 [9] J C Ferreira and V A Menegatto Eigenvalues of integral oerators defined by sooth ositive definite kernels Integral equations and Oerator Theory, 64, 2009 [10] L L Gerfo, L Rosasco, F Odone, E De Vito, and A Verri Sectral algoriths for suervised learning Neural Coutation, 20(7): , 2008 [11] Q Guo et al Efficient divide-and-conquer classification based on arallel feature-sace decoosition for distributed systes IEEE Systes Journal, 2015 [12] Z-C Guo, S-B Lin, and D-X Zhou Learning theory of distributed sectral algoriths Inverse Probles, 33(7):074009, 2017 [13] C J Hsieh, S Si, and I Dhillon A divide-and-conquer solver for kernel suort vector achine Proceedings of the 31 International Conference on Machine Learning, 2014 [14] R Li, D K J Lin, and B Li Statistical inference in assive data sets Alied Stochastic Models in Business and Industry, 29 (5): , 2013 [15] D-X Lin, Shao-Boand Zhou Distributed kernel-based gradient descent algoriths Constructive Aroxiation, May 2017 [16] S Lin, X Guo, and D-X Zhou Distributed learning with regularized least squares arxiv Prerint ( ), 2016 [17] L Mackey, A Talwalkar, and M I Jordan Divide-and-conquer atrix factorization Advances in Neural Inforation Processing Systes 24 (NIPS 2011), 2011 [18] I Pinelis Otiu bounds for the distributions of artingales in banach saces The Annals of Probability, 22(4): , 1994 [19] H P Rosenthal On the subsaces of L ( > 2) sanned by sequences of indeendent rando variables Israel J Math, 8: , 1970 [20] C Xu, Y Zhang, and R Li On the feasibility of distributed kernel regression for big data arxiv Prerint ( ), 2015 [21] Y Zhang, J Duchi, and M Wainwright Divide and conquer kernel ridge regression JMLR: Worksho and Conference Proceedings, 30, 2013 [22] D-X Zhou The covering nuber in learning theory Journal of Colexity, 18 (3): , 2002 [23] D-X Zhou Distributed learning algoriths Technical reort, Matheatisches Forschungsinstitut Oberwolfach Reort No 33, 2016

28 28 GILLES BLANCHARD AND NICOLE MÜCKE Institute of Matheatics, University of Potsda, Karl-Liebknecht-Strae Potsda, Gerany E-ail address:

Parallelizing Spectrally Regularized Kernel Algorithms

Parallelizing Spectrally Regularized Kernel Algorithms Journal of Machine Learning Research 19 (2018) 1-29 Subitted 11/16; Revised 8/18; Published 8/18 Parallelizing Sectrally Regularized Kernel Algoriths Nicole Mücke nicole.uecke@atheatik.uni-stuttgart.de