Parallelizing Spectrally Regularized Kernel Algorithms

Size: px

Start display at page:

Download "Parallelizing Spectrally Regularized Kernel Algorithms"

Rafe Lynch
5 years ago
Views:

1 Journal of Machine Learning Research 19 (2018) 1-29 Subitted 11/16; Revised 8/18; Published 8/18 Parallelizing Sectrally Regularized Kernel Algoriths Nicole Mücke Institute of Stochastics and Alications, University of Stuttgart Pfaffenwaldring Stuttgart, Gerany Gilles Blanchard Institute of Matheatics, University of Potsda Karl-Liebknecht-Strae Potsda, Gerany Editor: Ingo Steinwart Abstract We consider a distributed learning aroach in suervised learning for a large class of sectral regularization ethods in an reroducing kernel Hilbert sace (RKHS) fraework. The data set of size n is artitioned into = O(n α ), α < 1 2, disjoint subsales. On each subsale, soe sectral regularization ethod (belonging to a large class, including in articular Kernel Ridge Regression, L 2 -boosting and sectral cut-off) is alied. The regression function f is then estiated via sile averaging, leading to a substantial reduction in coutation tie. We show that iniax otial rates of convergence are reserved if grows sufficiently slowly (corresonding to an uer bound for α) as n, deending on the soothness assutions on f and the intrinsic diensionality. In sirit, the analysis relies on a classical bias/stochastic error analysis. Keywords: Distributed Learning, Sectral Regularization, Miniax Otiality 1. Introduction Distributed learning (DL) algoriths are a standard tool for reducing coutational burden in achine learning robles where assive datasets are involved. Assuing a colexity cost (for tie and/or eory) of O(n β ) (β > 1, β 2, 3] being coon) of the base learning algorith without arallelization, dividing randoly data of cardinality n into disjoint, equally-sized subsales and rocessing the in arallel using the sae base learning algorith has therefore colexity cost of O(.(n/) β ) = O(n β / β 1 ), roughly. Financial suort by the DFG via Research Unit 1735 Structural Inference in Statistics as well as SFB 1294 Data Assiilation is gratefully acknowledged. c 2018 Nicole Mücke and Gilles Blanchard. License: CC-BY 4.0, see htts://creativecoons.org/licenses/by/4.0/. Attribution requireents are rovided at htt://jlr.org/aers/v19/ htl.

2 Mücke and Blanchard gaining a factor β 1 (for tie and eory) coared to the single achine aroach. The final outut is obtained fro averaging the individual oututs. Recently, DL was studied in several achine learning contexts. In oint estiation (Li et al., 2013), atrix factorization (Mackey et al., 2011), soothing sline odels and testing (Cheng and Shang, 2016), local average regression (Chang et al., 2017), in classification (Hsieh et al., 2014; Guo et al., 2015), and also in kernel ridge regression (Zhang et al., 2013; Lin et al., 2017; Xu et al., 2016). In this aer, we study the DL aroach for the statistical learning roble Y i := f(x j ) + ɛ i, j = 1,..., n, (1) at rando i.i.d. data oints X 1,..., X n drawn according to a robability distribution ν on X, where ɛ j are indeendent centered noise variables. The unknown regression function f is real-valued and belongs to soe reroducing kernel Hilbert sace with bounded kernel K. We artition the given data set D = {(X 1, Y 1 ),..., (X n, Y n )} X R into disjoint equal-size subsales D 1,..., D. On each subsale D j, we coute a local estiator ˆf λ D j, using a sectral regularization ethod. The final estiator for the target function f is obtained by sile averaging: f λ D := 1 ˆf D λ j. The non-distributed setting ( = 1) has been studied in the recent aer of Blanchard and Mücke (2017), building the root osition of our results in the distributed setting, where weak and strong iniax otial rates of convergence are established. Our ai is to extend these results to distributed learning and to derive iniax otial rates. We again aly a fairly large class of sectral regularization ethods, including the oular kernel ridge regression (KRR), L 2 -boosting and sectral cut-off. Using the sae notation as Blanchard and Mücke (2017), we let T : f f(x)k(x, )dν(x) denote the kernel integral oerator associated to K and the saling easure ν. We denote T = κ 2 T, with κ 2 the uer bound of K. Our rates of convergence are governed by a source condition assution on f of the for Ω(r, R) := {f : f = T r h, h HK R} for soe constants r, R > 0 as well as by the ill-osedness of the roble, as easured by an assued ower decay of the eigenvalues of T with exonent b > 1. We show that for s 0, 1 2 ] in the sense of -th oent ( 1) exectation T s (f ( ) λn f D ) HK σ 2 (r+s) 2r+1+1/b R, (2) R 2 n for an aroriate choice of the regularization araeter λ n, deending on the global sale size n as well as on R and the noise variance σ 2 (but not on the nuber of subsale sets). Note that s = 0 corresonds to the reconstruction error (i.e. -nor), and s = 1 2 to the 2

3 Parallelizing Sectral Algoriths rediction error (i.e., L 2 (ν)-nor). The sybol eans that the inequality holds u to a ultilicative constant that can deend on various araeters entering in the assutions of the result, but not on n,, σ, nor R. An iortant assution is that the inequality q r + s should hold, where q is the qualification of the regularization ethod, a quantity defined in the classical theory of inverse robles (see Section 2.3 for a recise definition). Basic robles are the choice of the regularization araeter on the subsales and, ost iortantly, the roer choice of, since it is well known that choosing too large gives a subotial convergence rate in the liit n (see, e.g., Xu et al., 2016). Our aroach to this roble is based on a relatively classical bias-variance decoosition rincile. Choosing the global regularization araeter as the otial choice for a single sale of size n results in a bias estiate which is identical for all subsales, is unchanged by averaging, and is straightforward fro the single-sale analysis. On the other hand, the reduced sale size of each of the individual subsales causes an inflation of variance. However, since the subsales are indeendent, so are the oututs of the learning algorith alied to each one of the; as a consequence averaging reduces the inflated variance sufficiently to get iniax otiality. We can write the variance as a su of indeendent rando variables, allowing to successfully aly a Rosenthal s inequality in the Hilbert sace setting due to Pinelis (1994). The technical liiting factors in this arguent give rise to the liitation on the nuber of subsales ; for larger than the allowed range, soe reainder ters are no longer negligible using our roof technique, and rate otiality is not guaranteed any longer. The outline of the aer is as follows. Section 2 contains notation and the setting. Section 3 states our ain result on distributed learning. Section 4 resents nuerical studies. A concluding discussion in Section 5 contains a ore detailed coarison of our results with related results available in the literature. Section 6 contains the roofs of the theores. 2. Notation, statistical odel and distributed learning algorith In this section, we secify the atheatical background and the statistical odel for (distributed) regularized learning. We have included this section for self sufficiency and reader convenience. It essentially reeats the setting in Blanchard and Mücke (2017) in suarized for. 2.1 Kernel-induced oerators We assue that the inut sace X is a standard Borel sace endowed with a robability easure ν, the outut sace is equal to R. We let K be a real-valued ositive seidefinite kernel on X X which is bounded by κ 2. The associated reroducing kernel Hilbert sace will be denoted by. It is assued that all functions f are easurable and bounded in sureu nor, i.e. f κ f HK for all f. Therefore, is a subset of L 2 (X, ν), with S : L 2 (X, ν) being the inclusion oerator, satisfying 3

4 Mücke and Blanchard S κ. The adjoint oerator S : L 2 (X, ν) is identified as S g = g(x)k x ν(dx), X where K x denotes the eleent of equal to the function t K(x, t). The covariance oerator T : is given by T =, K x HK K x ν(dx), X which can be shown to be ositive self-adjoint trace class (and hence is coact). The eirical versions of these oerators, corresonding forally to taking the eirical distribution ν n = 1 n n i=1 δ x i in lace of ν in the above forulas, are given by S x : R n, (S x f) j = f, K xj HK, S x : R n, T x := S xs x :, Sxy = 1 n T x = 1 n n y j K xj, n, K xj HK K xj. We introduce the shortcut notation T = κ 2 T and T x := κ 2 T x, ensuring T 1 and T x 1, for any x X. Siilarly, S = κ 1 S and S xj := κ 1 S xj, ensuring S 1 and S x 1, for any x X. The nubers µ j are the ositive eigenvalues of T satisfying 0 < µ j+1 µ j for all j > 0 and µ j Noise assution and rior classes In our setting of kernel learning, the saling is assued to be rando i.i.d., where each observation oint (X i, Y i ) follows the odel Y = f ρ (X) + ε. For (X, Y ) having distribution ρ, we assue that the conditional exectation wrt. ρ of Y given X exists and belongs to, that is, it holds for ν-alost all x X : E ρ Y X = x] = S x f ρ, for soe f ρ. (3) Furtherore, we will ake the following assution on the observation noise distribution: There exists σ > 0 and M > 0 such that for any l 2 E Y S X f ρ l X ] 1 2 l!σ2 M l 2, ν a.s. (4) To derive nontrivial rates of convergence, we concentrate our attention on secific subsets (also called odels) of the class of robability easures. If P denotes the set of all robability distributions on X, we define classes of saling distributions by introducing a decay 4

5 Parallelizing Sectral Algoriths condition on the effective diension N (λ), being a easure for the colexity of with resect to the arginal distribution ν: For λ (0, 1] we set Note that N (λ) 1. For any b > 1 we introduce N (λ) = Trace ( T + λ) 1 T ]. (5) P < (b) := {ν P : N (λ) C b (κ 2 λ) 1 b }. (6) In De Vito and Caonnetto, 2006, Proosition 3, it is shown that such a condition is ilied by olynoially decreasing eigenvalues of T. More recisely, if the eigenvalues µ i satisfy µ j β/j b j 1 or b > 1 and β > 0, then N (λ) β 1 b b b 1 (κ2 λ) 1 b. For a subset Ω, we let K(Ω) be the set of regular conditional robability distributions ρ( ) on B(R) X such that (3) and (4) hold for soe f ρ Ω. We will focus on a Hölder-tye source condition, i.e. given r > 0, R > 0 and ν P, we define Ω(r, R) := {f : f = T r h, h HK R}. (7) Then the class of odels which we will consider will be defined as M(r, R, P ) := { ρ(dx, dy) = ρ(dy x)ν(dx) : ρ( ) K(Ω(r, R)), ν P }, (8) with P = P < (b). As a consequence, the class of odels deends not only on the soothness roerties of the solution (reflected in the araeters R > 0, r > 0), but also essentially on sectral roerties of T, reflected in N (λ). 2.3 Sectral regularization In this subsection, we introduce the class of linear regularization ethods based on sectral theory for self-adjoint linear oerators. These are standard ethods for finding stable solutions for ill-osed inverse robles. Originally, these ethods were develoed in the deterinistic context (see Engl et al., 2000). Later on, they have been alied to robabilistic robles in achine learning (see, e.g., Bauer et al., 2007; De Vito and Caonnetto, 2006; Dicker et al., 2017 or Blanchard and Mücke, 2017). Definition 1 (Regularization function) Let g : (0, 1] 0, 1] R be a function and write g λ = g(λ, ). The faily {g λ } λ is called regularization function, if the following conditions hold: (i) There exists a constant D < such that for any 0 < λ 1 su tg λ (t) D. (9) 0 t 1 5

6 Mücke and Blanchard (ii) There exists a constant E < such that for any 0 < λ 1 su g λ (t) E 0 t 1 λ. (10) (iii) Defining the residual r λ (t) := 1 g λ (t)t, there exists a constant γ 0 < such that for any 0 < λ 1 su r λ (t) γ 0. 0 t 1 It has been shown in e.g. Gerfo et al. (2008), Dicker et al. (2017), Blanchard and Mücke (2017) that attainable learning rates are essentially linked with the qualification of the regularization {g λ } λ, being the axial q such that for any 0 < λ 1 su r λ (t) t q γ q λ q. (11) 0 t 1 for soe constant γ q > 0. Note that by (iii), using interolation, we have validity of (11) also for any q 0, q] with constant γ q = γ 1 q q The ost oular exales include: q q 0 γq. Exale 1 (Tikhonov Regularization, Kernel Ridge Regression) The choice g λ (t) = 1 λ+t corresonds to Tikhonov regularization. In this case we have D = E = γ 0 = 1. The qualification of this ethod is q = 1 with γ q = 1. Exale 2 (Landweber Iteration, gradient descent) The Landweber Iteration (gradient descent algorith with constant stesize) is defined by k 1 g k (t) = (1 t) j with k = 1/λ N. j=0 We have D = E = γ 0 = 1. The qualification q of this algorith can be arbitrary with γ q = 1 if 0 < q 1 and γ q = q q if q > 1. Exale 3 (ν- ethod) The ν ethod belongs to the class of so called sei-iterative regularization ethods. This ethod has finite qualification q = ν with γ q a ositive constant. Moreover, D = 1 and E = 2. The filter is given by g k (t) = k (t), a olynoial of degree k 1, with regularization araeter λ k 2, which akes this ethod uch faster as e.g. gradient descent. 6

7 Parallelizing Sectral Algoriths 2.4 Distributed learning algorith We let D = {(x j, y j )} n X Y be the dataset, which we artition into disjoint subsales 1 D 1,..., D, each having size n. Denote the jth data subsale by (x j, y j ) (X R) n. On each subsale we coute a local estiator for a suitable a-riori araeter choice λ = λ n according to f λn D j := g λn ( T xj ) S x j y j. (12) By fd λ we will denote the estiator using the whole sale = 1. The final estiator is given by sile averaging of the local ones: f D λ := 1 fd λ j. (13) 3. Main results This section resents our ain results. Theore 3 and Theore 4 contain searate estiates on the aroxiation error and the sale error and lead to Corollary 5 which gives an uer bound for the error T s (f ρ f D λ ) HK and resents an uer rate of convergence for the sequence of distributed learning algoriths. For the sake of the reader we recall Theore 6, which was already shown in Blanchard and Mücke (2017), resenting the iniax otial rate for the single achine roble. This yields an estiate on the difference between the single achine and the distributed learning algorith in Corollary 7. We want to track the recise behavior of these rates not only for what concerns the exonent in the nuber of exales n, but also in ters of their scaling (ultilicative constant) as a function of soe iortant araeters (naely the noise variance σ 2 and the colexity radius R in the source condition, see Reark 9 below). For this reason, we introduce a notion of a faily of rates over a faily of odels. More recisely, we consider an indexed faily (M θ ) θ Θ, where for all θ Θ, M θ is a class of Borel robability distributions on X R satisfying the basic general assutions (3) and (4). We consider rates of convergence in the sense of the -th oents of the estiation error, where 1 < is a fixed real nuber. 1. For the sake of silicity, throughout this aer we assue that n is divisible by. This could always be achieved by disregarding soe data; alternatively, it is straightforward to show that aditting one saller block in the artition does not affect the asytotic results of this aer. We shall not try to discuss this oint in greater detail. In articular, we shall not analyze in which general fraework our sile averages could be relaced by weighted averages. 7

8 Mücke and Blanchard As already entioned in the introduction, our roofs are based on a classical bias-variance decoosition as follows: Introducing we write f λ D = 1 g λ ( T xj ) T xj f ρ, (14) T s (f ρ f D) λ = T s (f ρ f D) λ + T s ( f D λ f D) λ = 1 T s r λ ( T xj )f ρ + 1 T s g λ ( T xj )( T xj f ρ S x j y j ). (15) In all the forthcoing results in this section, we assue: Assution 2 Let s 0, 1 2 ], 1 and consider the odel M σ,m,r := M(r, R, P < (b)) where r > 0 and b > 1 are fixed, and θ = (R, M, σ) varies in Θ = R 3 +. Given a sale λn D (X R) of size n, define f D, f λn λn D as in Section 2.4 and f D as in (14), using a regularization function of qualification q r + s, with araeter sequence λ n := λ n,(σ,r) := in indeendent on M. Define the sequence ( ( σ 2 R 2 n ) b 2br+b+1, 1 ), (16) ( ) σ 2 b(r+s) 2br+b+1 a n := a n,(σ,r) := R. (17) R 2 n We recall that we shall always assue that n is a ultile of. With these rearations, our ain results are: Theore 3 (Aroxiation error) Under Assution 2, we have: If the nuber n of subsale sets satisfies n n α, α < 2b in{r, 1} 2br + b + 1, (18) then su (σ,m,r) R 3 + li su n su ρ M σ,m,r ] T s λn (f ρ f D ) 1 <. a n 8

9 Parallelizing Sectral Algoriths Theore 4 (Sale Error) Under Assution 2, we have: If the nuber n of subsale sets satisfies n n α 2br, α < 2br + b + 1, (19) Then ] T s λn λn ( f D f D ) 1 li su su <. n a n su (σ,m,r) R 3 + ρ M σ,m,r And, as consequence (by (15) and alying the triangle inequality for the L -nor): Corollary 5 Under Assution 2, we have: If the nuber n of subsale sets satisfies n n α, α < 2b in{r, 1} 2br + b + 1, (20) then the sequence (17) is an uer rate of convergence in L for all > 0, for the interolation nor of araeter s, for the sequence of estiated solutions ( f λ n,(σ,r) D ) over the faily of odels (M σ,m,r ) (σ,m,r) R 3 +, i.e. su (σ,m,r) R 3 + li su n su ρ M σ,m,r ] T s λn (f ρ f D ) 1 <. a n Theore 6 (Blanchard and Mücke, 2017) The sequence (17) is an uer rate of convergence in L for all > 0, for the interolation nor of araeter s, for the sequence of estiated solutions (f λ n,(σ,r) D ) over the faily of odels (M σ,m,r ) (σ,m,r) R 3 +, i.e. su (σ,m,r) R 3 + li su n su ρ M σ,m,r ] T s (f ρ f λn D ) 1 a n <. Cobining Corollary 5 with Theore 6 by alying the triangle inequality iediately yields: Corollary 7 If the nuber n of subsale sets satisfies then su (σ,m,r) R 3 + li su n n n α, α < su ρ M σ,m,r 2b in{r, 1} 2br + b + 1, (21) ] T s (f λn λn D f D ) 1 <. a n 9

10 Mücke and Blanchard Reark 8 Our results in the distributed setting slightly differ fro those obtained in Theore 6 fro Blanchard and Mücke (2017) in two several resects: While in the single achine aroach, rates of convergence are obtained for any > 0, the roofs in Section 6 only hold for 1 due to loss of subadditivity of -th oents for 0 < < 1. While the uer uer rates of convergence in Blanchard and Mücke (2017) are derived over classes of arginals ν induced by assuing a decay condition for the eigenvalues of T, we soewhat enlarge this class by assuing a decay condition for N (λ) in (6). Theore 6 also holds under this weaker condition. Note that it is an oen roble if lower rates of convergence can also be obtained by weakening the condition for eigenvalue decay. Reark 9 (Signal-to-noise-ratio) Our results show that the choice of the regularization araeter λ n in (16) and thus the rate of convergence a n in (17) highly deend on the signal-to-noise-ratio σ2, a quantity which naturally aears in the theory of regularization R 2 of ill-osed inverse robles. As a general rule, the degree of regularization should increase with the level of noise in the data, i.e., the iortance of the riors should increase as the odel fit decreases. Our theoretical results recisely show this behavior. 4. Nuerical studies In this section we nuerically study the error in -nor, corresonding to s = 0 in Corollary 5 (in exectation with = 2) both in the single achine and distributed learning setting. Our ain interest is to study the uer bound for our theoretical exonent α, araetrizing the size of subsales in ters of the total sale size, = n α, in different soothness regies. In addition we shall deonstrate in which way arallelization serves as a for of regularization. More secifically, we let = H0 1 0, 1] be the Sobolev sace consisting of absolutely continuous functions f on 0, 1] with weak derivative of order 1 in L 2 0, 1], with boundary condition f(0) = f(1) = 0. The reroducing kernel is given by K(x, t) = x t xt. For all exerients in this section, we siulate data fro the regression odel Y i = f ρ (X i ) + ɛ i, i = 1,..., n, where the inut variables X i Unif0, 1] are uniforly distributed and the noise variables ε i N(0, σ 2 ) are norally distributed with standard deviation σ = We choose the target function f ρ according to two different cases, naely r < 1 (low soothness regie) and r = (high soothness regie). To accurately deterine the degree of soothness r > 0, we aly Proosition 10 below by exlicitly calculating the Fourier coefficients 2 πj cos(πjx), for j N, fors an ONB of. Recall that ( f ρ, e j HK ) j N, where e j (x) = the rate of eigenvalue decay is exlicitly given by b = 2, eaning that we have full control over all araeters in (21). We need the following characterization: 10

11 Parallelizing Sectral Algoriths Proosition 10 (Engl et al., 2000, Pro. 3.13) Let, H 2 be searable Hilbert saces and S : H 2 be a coact linear oerator with singular syste 2 {σ j, ϕ j, ψ j }. Denoting by S the generalized inverse 3 of S, one has for any r > 0 and g H 2 : g is in the doain of S and S g I((S S) r ) if and only if g, ψ j H2 2 j=0 σ 2+4r j <. In our case, is as above, H 2 is L 2 (0, 1]) with Lebesgue easure and S : H0 1 0, 1] L 2 (0, 1]) is the inclusion. Since H0 10, 1] is dense in L2 (0, 1]), we know that (I(S)) is trivial, giving SS = id on I(S). Furtherore, ϕ j = e j is a noralized eigenbasis of T = S S with eigenvalues σj 2 = (πj) 2. With ψ j = Sϕ j Sϕ j we obtain for f H1 L 2 0 0, 1] Sf, ψ j L 2 = Sf, Thus, alying Proosition 10 gives Se L2 j Se j = f, S Se j Se j H 1 0 = σ j f, e j H 1 0. Corollary 11 For S and T = S S defined in Section 2, we have for any r > 0: f I(T r ) if and only if j 4r f, e j L 2 2 <. Thus, as exected, abstract soothness easured by the araeter r in the source condition corresonds in this secial case to decay of the classical Fourier coefficients which by the classical theory of Fourier series easures soothness of the eriodic continuation of f L 2 (0, 1]) to the real line. 4.1 Low soothness regie We choose f ρ (x) = 1 2 x(1 x) which clearly belongs to. A straightforward calculation gives the Fourier coefficient f ρ, e j = 2(πj) 2 for j odd (vanishing for j even). Thus, by the above criterion, f ρ satisfies the source condition f ρ Ran( T r ) recisely for 0 < r < (Observe that although f ρ is sooth on 0, 1], its eriodic continuation on the real line is not, hence the low soothness regie.) According to Theore 6, the worst case rate in the single achine roble is given by n γ, with γ = Regularization is done using the ν ethod (see Exale 3), with qualification q = ν = 1. Recall that the stoing index 2. i.e., the ϕ j are the noralized eigenfunctions of S S with eigenvalues σ 2 j and ψ j = Sϕ j/ Sϕ j ; thus S = σ j ϕ j, ψ j. 3. the unique unbounded linear oerator with doain I(S) (I(S)) in H 2 vanishing on (I(S)) and satisfying SS = 1 on I(S), with range orthogonal to the null sace N(S). 11

12 Mücke and Blanchard k sto serves as the regularization araeter λ, where k sto λ 2. We consider sale sizes fro 500,..., In the odel selection ste, we estiate the erforance of different odels and choose the oracle stoing tie ˆk oracle by iniizing the reconstruction error: over M = 30 runs. ˆk oracle = arg in k 1 M M f ρ ˆf j k In the odel assessent ste, we artition the dataset into n α subsales, for any α {0, 0.05, 0.1,..., 0.85}. On each subsale we regularize using the oracle stoing tie ˆk oracle (deterined by using the whole sale). Corresonding to Corollary 5, the accuracy should be coarable to the one using the whole sale as long as α < 0.5. In Figure 1 (left anel) we lot the reconstruction error f ˆk f ρ HK versus the ratio α = log()/ log(n) for different sale sizes. We execute each siulation M = 30 ties. The lot suorts our theoretical finding. The right anel shows the reconstruction error versus the total nuber of sales using different artitions of the data. The black curve (α = 0) corresonds to the baseline error ( = 0, no artition of data). Error curves below a threshold α < 0.6 are roughly coarable, whereas curves above this threshold show a ga in erforances. In another exerient we study the erforances in case of (very) different regularization: Only artitioning the data (no regularization), underregularization (higher stoing index) and overregularization (lower stoing index). The outcoe of this exerient alifies the regularization effect of arallelizing. Figure 2 shows the ain oint: Overregularization is always hoeless, underregularization is better. In the extree case of (alost) no regularization, there is a shar iniu in the reconstruction error which is only slightly larger than the iniax otial value for the oracle regularization araeter and which is achieved at an attractively large degree of arallelization. Qualitatively, this agrees very well with the intuitive notion that arallelizing serves as regularization. We ehasize that nuerical results see to indicate that arallelization is ossible to a slightly larger degree than indicated by our theoretical estiate. A siilar result was reorted in the aer Zhang et al. (2013), which also treats the low soothness case High soothness regie We choose f ρ (x) = 1 2π sin(2πx), which corresonds to just one non-vanishing Fourier coefficient and by our criterion Corollary 11 has r =. In view of our ain Corollary 5 this requires a regularization ethod with higher qualification; we take the Gradient Descent ethod (see Exale 2). The aearance of the ter 2b in{1, r} in our theoretical result 5 gives a redicted value α = 0 (and would ily that arallelization is strictly forbidden for infinite soothness). More secifically, the left anel in Figure 3 shows the absence of any lateau for the reconstruction error as a function of α. This corresonds to the right anel showing that 12

13 Parallelizing Sectral Algoriths reconstruction error n=500 n=1500 n=3000 n=6000 n=9000 log(recon.error) α=0 α=0.2 α = 0.4 α = 0.6 α = 0.7 α = log()/log(n) log(n) Figure 1: The reconstruction error f k oracle D f ρ HK in the low soothness case. Left lot: Reconstruction error curves for various (but fixed) total sale sizes, as a function of the nuber of subsales. Right lot: Reconstruction error curves for various subsale nuber scalings = n α, as a function of the sale size (on log-scale). n=500 n=5000 reconstruction error T=5 T=15 T=31 (oracle) T=100 T=500 no regularization reconstruction error T=5 T=15 T=45 (oracle) T=100 T=500 no regularization log()/log(n) log()/log(n) Figure 2: The reconstruction error f D λ f ρ HK in the low soothness case. Left lot: Error curves for different stoing ties for n = 500, as a function of the nuber of subsales. Right lot: Error curves for different stoing ties for n = 5000, as a function of the nuber of subsales. 13

14 Mücke and Blanchard reconstruction error n=500 n=1500 n=3000 n=6000 n=8000 log(recon.error) α=0 α=0.1 α = 0.2 α = 0.4 α = 0.6 α = log()/log(n) log(n) Figure 3: The reconstruction error f λ oracle D f ρ HK in the high soothness case. Left lot: Reconstruction error curves for various (but fixed) sale sizes as a function of the nuber of subsales. Right lot: Reconstruction error curves for various subsale nuber scalings = n α, as a function of the sale size (on log-scale). no grou of values of α erfors roughly equivalently, eaning that we do not have any otiality guarantees. Plotting different values of regularization in Figure 4 we again identify overregularization as hoeless, while severe underregularization exhibits a shar iniu in the reconstruction error. But its value at roughly 0.25 is uch less attractive coared to the case of low soothness where the error is an order of agnitude less. 5. Discussion Miniax Otiality: We have shown that for a large class of sectral regularization ethods the error of the distributed algorith T s λn ( f D f ρ) HK satisfies the sae uer bound as the error T s (f λn D f ρ) for the single achine roble, if the regularization HK araeter λ n is chosen according to (16), rovided the nuber of subsales grows sufficiently slowly with the sale size n. Since, the rates for the latter are iniax otial (Blanchard and Mücke, 2017), our rates in Corollary 5 are iniax otial also. Coarison with other results: Zhang et al. (2013) derive iniax-otial rates in three settings: finite rank kernels, sub-gaussian decay of eigenvalues of the kernel and olynoial decay, rovided satisfies a certain uer bound, deending on the rate of decay of the eigenvalues under two crucial assutions on the eigenfunctions of the integral oerator associated to the kernel: For any j N Eφ j (X) 2k ] ρ 2k, (22) 14

15 Parallelizing Sectral Algoriths n=500 n=3000 reconstruction error T=10 T=100 T=223 (oracle) T=500 T=900 no regularization reconstruction error T=10 T=100 T=219 (oracle) T=500 T=900 no regularization log()/log(n) log()/log(n) Figure 4: The reconstruction error f λ D f ρ in the high soothness case. Left lot: Error curves for different stoing ties for n = 500 sales, as a function of the nuber of subsales. Right lot: Error curves for different stoing ties for n = 5000 sales, as a function of the nuber of subsales. for soe k 2 and ρ < or even stronger, it is assued that the eigenfunctions are uniforly bounded, i.e. su φ j (x) ρ, (23) x X or any j N and soe ρ <. We shall describe in ore detail the case of olynoially decaying eigenvalues, which corresonds to our setting. Assuing eigenvalue decay µ j j b with b > 1, the authors choose a regularization araeter λ n = n b b+1 and n b(k 4) k b+1 ρ 4k log k (n) 1 k 2. leading to an error in L 2 -nor λn E f D f ρ 2 L ] n b 2 b+1, being iniax otial. Note that this choice of λ n and the resulting rate corresond to our case r = 0, i.e., no soothness of f ρ is assued (just that f ρ belongs to the RKHS). For k < 4 the bound becoes less eaningful (coared to the case where k 4) since 0 as n in this case (for any sort of eigenvalue decay). On the other hand, if k and b ight be taken arbitrarily large, corresonding to alost bounded eigenfunctions and arbitrarily large olynoial decay of eigenvalues, ight be chosen roortional to n 1 ɛ, for any ɛ > 0. As ight be exected, relacing the L 2k bound on the eigenfunctions by a bound in L, gives an uer bound on which sily is the liit for k in the 15

16 Mücke and Blanchard bound given above, naely n b 1 b+1 ρ 4 log n, which for large b behaves as above. Granted bounds on the eigenfunctions in L 2k for (very) large k, this is a strong result. While the decay rate of the eigenvalues can be deterined by the soothness of K (see, e.g., Ferreira and Menegatto, 2009 and references therein), it is a widely oen question which general roerties of the kernel ily estiates as in (22) and (23) on the eigenfunctions. Zhou (2002) even gives a counterexale and resents a C Mercer kernel on 0, 1] where the eigenfunctions of the corresonding integral oerator are not uniforly bounded. Thus, soothness of the kernel is not a sufficient condition for (23) to hold. Moreover, we oint out that the uer bound (22) on the eigenfunctions (and thus the uer bound for in Zhang et al., 2013) deends on the unknown arginal distribution ν. Only the strongest assution, a bound in su-nor (23), does not deend on ν. Concerning this oint, our aroach is agnostic. As already entioned in the Introduction, these bounds on the eigenfunctions have been eliinated by Lin et al. (2017), for KRR, iosing olynoial decay of eigenvalues as above. This is very siilar to our aroach. As a general rule, our bounds on and the bounds obtained by Lin et al. (2017) are worse than the bounds of Zhang et al. (2013) for eigenfunctions in (or close to ) L, but in the coleentary case where nothing is known on the eigenfunctions still can be chosen as an increasing function of n, naely = n α. More recisely, choosing λ n as in (16), Lin et al. (2017) derive as an uer bound n α, α = 2br 2br + b + 1, with r being the soothness araeter arising in the source condition. We recall here that due to our assution q r + s, the soothness araeter r is restricted to the interval (0, 1 2 ] for KRR (q = 1) and L2 -risk (s = 1 2 ). Our results (which hold for a general class of sectral regularization ethods) are in soe ways coarable to those of Lin et al. (2017). Secialized to KRR, our estiates for the exonent α in = O(n α ) coincide with the result of Lin et al. (2017). Furtherore, we ehasize that Zhang et al. (2013) and Lin et al. (2017) estiate the DL-error only for s = 1/2 in our notation (corresonding to L 2 (ν) nor), while our result holds for all values of s 0, 1/2] which soothly interolates between L 2 (ν)-nor and RKHS-nor and, in addition, for all values of 1, ). Thus, our results also aly to the case of non-araetric inverse regression, where one is articularly interested in the reconstruction error, i.e. -nor (see, e.g., Blanchard and Mücke, 2017). Additionally, we recisely analyze the deendence of the noise variance σ 2 and the colexity radius R in the source condition. Concerning general strategy, while Lin et al. (2017) use a novel second order decoosition in an essential way, our aroach is ore classical. We clearly distinguish between estiat- 16

17 Parallelizing Sectral Algoriths ing the aroxiation error and the sale error. The bias using a subsale should be of the sae order as when using the whole sale, whereas the estiation error is higher on each subsale, but gets reduced by averaging by writing the variance as a su of i.i.d. rando variables (which allows to use Rosenthal s inequality). Finally, we want to ention the recent works of Lin and Zhou (2018) and Guo et al. (2017), which were worked out indeently fro our work. Guo et al. (2017) also treat general sectral regularization ethods (going beyond kernel ridge) and obtain essentially the sae results, but with error bounds only in L 2 -nor, excluding inverse learning robles. Lin and Zhou (2018) investigate distributed learning on the exale of gradient descent algoriths, which have infinite qualification and allow larger soothness of the regression function. They are able to irove the uer bound for the nuber of local achines to n α log 5 (n) + 1, α < br 2br + b + 1, which is larger in the case r > 2. In the interediate case 1 < r < 2, our bound in (20) is still better. An interesting feature is the fact that it is ossible to allow ore local achines by using additional unlabeled data. This indicates that finding the uer bound for the nuber of achines in the high soothness regie is still an oen roble. Adativity: It is clear fro the theoretical results that both the regularization araeter λ and the allowed cardinality of subsales deend on the araeters r and b, which in general are unknown. Thus, an adative aroach to both araeters b and r for choosing λ and is of interest. To the best of our knowledge, there are yet no rigorous results on adativity in this ore general sense. Progress in this field ay well be crucial in finally assessing the relative erits of the distributed learning aroach as coared with alternative strategies to effectively deal with large data sets. We sketch an alternative naive aroach to adativity, based on hold-out in the direct case, where we consider each f also as a function in L 2 (X, ν). We slit the data z (X Y) n into a training and validation art z = (z t, z v ) of cardinality t, v. We further subdivide z t into k subsales, roughly of size t / k, where k t, k = 1, 2,... is soe strictly decreasing sequence. For each k and each subsale z j, 1 j k, we define the estiators ˆf z λ j as in (12) and their average f λ k,z t := 1 k k ˆf λ z j. (24) Here, λ varies in soe sufficiently fine lattice Λ. Then evaluation on z v gives the associated eirical L 2 error Err λ k (zv ) := leading us to define 1 v (yi v f k,z λ t(xv i )) 2, z v = (y v, x v ), y v = (y1, v..., y v v ), (25) v i=1 ˆλ k := argin λ Λ Err λ k (zv ), Err(k) := Errˆλ k k (zv ). (26) 17

18 Mücke and Blanchard Then, an aroriate stoing criterion for k ight be to sto at k := in{k 3 : (k) δ inf 2 j<k (j)}, (j) := Err(j) Err(j 1), (27) for soe δ < 1 (which ight require tuning). The corresonding regularization araeter is ˆλ = ˆλ k, given by (26). At least intuitively, it is then reasonable to define a urely data driven estiator as f n := f ˆλ k,z t. (28) Note that the training data z t enter the definition of f n via the exlicit forula (24) encoding our kernel based aroach, while z v serves to deterine (k, ˆλ ) via iniization of the eirical L 2 -error and a criterion, which tells one to sto where Err(j) does not areciably irove anyore. It is oen if such a rocedure achieves otial rates, and we leave this for future research. 6. Proofs For ease of reading we ake use of the following conventions: for a (bounded) linear oerator A, A denotes the oerator nor; we are interested in a recise deendence of ultilicative constants on the araeters σ, M, R,, n and η. (To be clear about the role of the latter quantity: the roofs rely on high-robability stateents on deviations, tyically holding with high robability 1 η.) the deendence of ultilicative constants on various other araeters, including the kernel araeter κ, the interolating araeter s 0, 1 2 ], the araeters arising fro the regularization ethod, b > 1, β > 0, r > 0, etc. will (generally) be oitted and sily indicated by the sybol. the deendence of the nor araeter will also be indicated, but will not be given exlicitly. the values of C and C, ight change fro line to line. the exression for n sufficiently large eans that the stateent holds for n n 0, with n 0 otentially deending on all odel araeters (including σ, M and R), but not on η. 6.1 Preliinaries Proosition 12 (Guo et al., 2017, Proosition 1) Define ( ) 2 2 N (λ) B n (λ) := 1 + nλ +. (29) nλ 18

19 Parallelizing Sectral Algoriths For any λ > 0, η (0, 1], with robability at least 1 η one has ( T x + λ) 1 ( T + λ) 8 log 2 (2η 1 )B n (λ). (30) Corollary 13 Let η (0, 1). For n N let λ n be ilicitly defined as the unique solution of N ( λ n ) = n λ n. Then for any λ ax( λ n, n 1 ), 1], one has B n (λ) 10. In articular, ( Tx + λ) 1 ( T + λ) 80 log 2 (2η 1 ) holds with robability at least 1 η. We reark that the trace of T is bounded by 1. This ensures that the interval λ n, 1] is non-ety. Proof of Corollary 13] Let λ n be defined via N ( λ n ) = n λ n. Since N (λ)/λ is decreasing, we have for any λ λ n N (λ) nλ N ( λ n ) = 1. n λ n Inserting this bound as well as nλ 1 into (29) and (30) leads to the conclusion. Corollary 14 Assue the arginal distribution ν of X belongs to P < (b, β) with b > 1 and β > 0. If λ n is defined by (16) and if n n α, α < 2br 2br + b + 1, one has rovided n is sufficiently large. B n (λ n ) 2, n Proof of Corollary 14] We can for starters assue that n is sufficiently large so that λ n < 1, i.e. λ n = ( σ 2 R 2 n ) b 2br+b+1 fro (16). Recall that ν P < (b, β) ilies N (λ n ) C λ 1 b n. Looking at the ters entering in B n (λ n), see (29), we have first, using the definition of λ n in (16): N (λ n ) n λ n C λ n n b+1 b = C n 19 ( nr 2 σ 2 ) b+1 2br+b+1,

20 Mücke and Blanchard which (for fixed R, σ and other araeters entering in C ) is O( n n 2br+r+1 ), and hence o(1) rovided n n α 2br, α < 2br + b + 1. For the second ter entering in B n (λ n), we have 2br 1 n λ n = n ( nr 2 σ 2 ) b 2br+b+1, which is O( n n 2br+1 2br+b+1 ) = o(1), rovided n n α, α < 2br + 1 2br + b + 1, which is ilied by the revious stronger condition. We shortly illustrate how Corollary 13 and Proosition 12 will be used. Let u 0, 1], λ n λ as above and f. We have for any bounded oerator A T u A = T u ( T + λ) u ( T + λ) u ( T x + λ) u ( T x + λ) u A T u ( T + λ) u ( T + λ) u ( T x + λ) u ( T x + λ) u A 8 log 2u (2η 1 )B n (λ) u ( T x + λ) u A, (31) with robability at least 1 η, for any η (0, 1); for the last inequality we have used that the first factor is less than 1, and for the second factor Proosition 12 in cobination with the Cordes inequality (see Proosition 22 in the Aendix). In articular, for any ax( λ n, n 1 ) λ (with λ n as in Corollary 13) T u A 80 u log 2u (2η 1 ) ( T x + λ) u A, (32) with robability at least 1 η. In the following, we constantly use (31). Furtherore, to bound ters involving residuals we will frequently use the following estiate: for v 0, u 0, 2], and rovided u + v q (q being the qualification): ( su r λ (t)t v (t + λ) u 2 t 0,1] using twice (11) since q u + v. su t 0,1] rλ (t)t v+u + λ u su t 0,1] ) r λ (t)t v C λ v+u, (33) 20

21 Parallelizing Sectral Algoriths 6.2 Aroxiation error bound Recall that ν denotes the X-arginal of the saling distribution ρ and P the set of all robability distributions on the inut sace X. Lea 15 Let ν P, v R and let x X n be an i.i.d. sale of size n/, drawn according to ν. Assue the regularization (g λ ) λ has qualification q v s. Then with robability at least 1 η: T s r λ ( T x ) T x( v T T x ) ( ) C log 4 (4η 1 )λ s+v+1 B s+1 N (λ) n (λ) nλ +. nλ Proof of Lea 15] Fro (30),(31) and fro Proosition 20 recalled in the Aendix, one has T s r λ ( T x ) T x( v T T x ) C log 2(s+1) (4η 1 )B s+1 n (λ) ( T x + λ) s r λ ( T x ) T x( v T x + λ) ( T + λ) 1 ( T T x ) C log 4 (4η 1 )λ s+v+1 B s+1 n (λ) ( nλ + ) N (λ), nλ for any λ (0, 1], η (0, 1], with robability at least 1 η. We also used that s 1 2, and the estiate (33). Lea 16 Let ν P, v R and let x X n be an i.i.d. sale of size n/ drawn according to ν. Assue the regularization (g λ ) λ has qualification q v + s. Then for any λ (0, 1], η (0, 1], with robability at least 1 η T s r λ ( T x ) T x v C log 2s (2η 1 )B s n (λ)λ s+v, for soe C <. Proof of Lea 16] Using (31), (33), since q v + s, it holds T s r λ ( T x ) T x v C log 2s (2η 1 )B s n (λ) ( Tx + λ) s r λ ( T x ) T v x with robability at least 1 η. C log 2s (2η 1 )B s n (λ)λ s+v, Proosition 17 (Exectation of aroxiation error) Let f ρ Ω(r, R), λ (0, 1] and let B n (λ) be defined in (29). Assue the regularization has qualification q r + s. For any 1 one has: 21

22 Mücke and Blanchard 1. If r 1, then 2. If r > 1, then T s (f ρ f D) λ ] 1 T s (f ρ f D) λ ] 1 C, R λ s+r B s+r n (λ). ( ( )) C, Rλ s B s+1 n (λ) λ r N (λ) + λ nλ + nλ. In 1. and 2. the constant C, does not deend on (σ, M, R) R 3 +. Proof of Proosition 17] Since f ρ Ω(r, R), T s (f ρ f D) λ ] 1 = 1 R 1 T s r λ ( T ] 1 xj )f ρ T s r λ ( T ] 1 xj )f ρ T s r λ ( T xj ) T r ] 1. (34) The first inequality is just the triangle inequality for the -nor f = E f ] 1. We bound the exectation for each searate subsale of size n by first deriving a robabilistic estiate and then we integrate. Consider first the case where r 1. Using (31), the Cordes inequality (Proosition 22 in the Aendix), and (33) one has for any j = 1,...,, T s r λ ( T xj ) T r C log 2(s+r) (4η 1 )B s+r n C log 3 (4η 1 )λ s+r B s+r n (λ) ( T xj + λ) s r λ ( T xj )( T xj + λ) r with robability at least 1 η and where B n (λ) is defined in (29). Recall that the regularization has qualification q r + s. By integration one has T s r λ ( T xj ) T r ] 1 C, λ s+r B s+r n (λ), (λ), for soe C, <, not deending on σ, M, R. Finally, fro (34) T s (f ρ f D) λ ] 1 C, R λ s+r B s+r n (λ). In the case where r 1, we write r = k + u, with k = r and u = r k < 1. We shall use the decoosition T k = k 1 l=0 T l x( T T x ) T k (l+1) + T k x. (35) 22

23 Parallelizing Sectral Algoriths We roceed by bounding (34) according to decoosition (35). For any j = 1,...,, one has T s r λ ( T xj ) T k+u ] 1 k 1 l=0 + k 1 l=0 + T s r λ ( T xj ) T x l j ( T T xj ) T k (l+1)+u ] 1 T s r λ ( T xj ) T x k u j T ] 1 T s r λ ( T xj ) T x l j ( T T xj ) ] 1 T s r λ ( T xj ) T x k u j T ] 1. (36) Here we use that T k (l+1)+u is bounded by 1. By Lea 16 and by (31), (33), with robability at least 1 η T s r λ ( T xj ) T x k u j T C log 2(s+u) (2η 1 )B s+u n (λ) ( T xj + λ) s r λ ( T xj ) T x k j ( T xj + λ) u C log 2(s+u) (2η 1 )B s+u n (λ)λ s+r, and thus integration yields T s r λ ( T xj ) T x r u j T ] 1 C, B s+u n (λ)λ s+r. (37) For estiating the first ter in (36) we ay use Lea 15. For any l = 0,..., k 1, we have l + s + 1 k + s r + s q, hence for any j = 1,..., with robability at least 1 η ( T s r λ ( T xj ) T x l j ( T T ) xj ) C log 4 (8η 1 )λ s+l+1 B s+1 N (λ) n (λ) nλ +. nλ Again by integration, since λ l 1 for any l = 0,..., k 1, one has k 1 l=0 T s r λ ( T xj ) T x l j ( T T xj ) ( ) ] 1 C, r λ s+1 B s+1 N (λ) n (λ) nλ + nλ. (38) Finally, cobining (37) and (38) into (36), then (34), gives in the case where r > 1 T s (f ρ f D) λ ( ( )) ] 1 C H, λ s B s+1 n (λ) λ r N (λ) + λ K nλ +. nλ The rest of the roof follows fro (36). 23

24 Mücke and Blanchard Proof of Theore 3] Let λ n be as defined by (16). According to Corollary 14, we have B n (λ n ) 2 rovided α < 2br n 2br+b+1, for n sufficiently large. We can also assue n sufficiently large so that λ n < 1, i.e., Rλ r+s n = a n (fro (16), (17)). Under these conditions, we iediately obtain fro the first art of Proosition 17 in the case where r 1 T s (f ρ ] λn f D ) 1 C, R λ s+r n = C, a n. We turn to the case where r > 1. We aly the second art of Proosition 17. By Corollary 14 we have T s (f ρ ] λn f D ) 1 C, Rλ s nb s+1 n n C, Rλ s n where we used that N (λ n ) C λ 1/b n λ n, and λ n < 1. Furtherore, rovided and σ (λ n ) λ r n + λ n n nλ n + ( λ r n + λ n ( n nλ n + n R σ λr n λ 1 b n nλ n n = o ( nλ n λ r ) n, n n n α, α < n N (λ n ) nλ n )), = Rλ r n coing fro the definition of 2(br + 1) 2br + b + 1. Finally, for n sufficiently large, R σ n λ n 1, rovided that As a result, for any 1: li su n su ρ M σ,m,r α < for soe C, <, not deending on σ, M, R. 2b 2br + b + 1. ] T s λn (f ρ f D ) 1 C,, a n 6.3 Sale error bound The ain idea for deriving an uer bound for the sale error is to identify it as a su of unbiased Hilbert sace-valued i.i.d. variables and then to aly a suitable version of Rosenthal s inequality. 24

25 Parallelizing Sectral Algoriths Given λ (0, 1], we define the rando variable ξ λ : (X R) n ξ λ (x, y) := T s g λ ( T x )( T x f ρ S xy). by Recall that according to Assution (3), the conditional exectation w.r.t. ρ of Y given X satisfies E ρ Y X = x] = S x f ρ, ilying that ξ λ has zero exectation (since T x = S x S x ). Thus, T s ( f D λ f D) λ = 1 ξ λ (x j, y j ) (39) is a su of centered i.i.d. rando variables. Furtherore, we need the following result (Pinelis, 1994, Theore 5.2), which generalizes Rosenthal s (1970) inequalities (originally only forulated for real valued rando variables) to rando variables with values in a Banach sace. For Hilbert saces this looks articularly nice. Proosition 18 Let H be a Hilbert sace and ξ 1,..., ξ be a finite sequence of indeendent, ean zero H- valued rando variables. If 2 <, then there exists a constant C > 0, only deending on, such that ( ) 1 {( E 1 ξ j C ) 1 ( H ax E ξ j ) 1 } 2 H, E ξ j 2 H. (40) We reark in assing that Dirksen (2011), Corollary 1.22, establishes the interesting result that in addition to the uer bound in (40) there is also a corresonding lower bound where the constant C is relaced by another constant C > 0, only deending on. Proosition 19 (Exectation of sale error) Let ρ be a source distribution belonging to M σ,m,r, s 0, 1 2 ] and let λ (0, 1]. Define B n (λ) as in (29). Assue the regularization has qualification q r + s. For any 1 one has: T s ( f D λ f D) λ ( ) ] 1 C H, 1 2 B n (λ) 1 2 +s λ s M N (λ) K nλ + σ, nλ where C does not deend on (σ, M, R) R 3 +. Proof of Proosition 19] Let λ (0, 1] and 2. Fro Proosition 18 T s λ f D f ] D λ 1 = 1 ξ λ (x j, y j ) { ( C ax ] E ) 1 ( ρ n ξ λ (x j, y j ) ] HK, E ) } 1 ρ n ξ λ (x j, y j ) 2 2 HK. (41) ] 1 25

26 Mücke and Blanchard Again, the estiates in exectation will follow fro integrating a bound holding with high robability. By (31), one has for any j = 1,...,, ξ λ (x j, y j ) HK = T s g λ ( T xj )( T xj f ρ S x j y j ) HK 8 log 2s (4η 1 )B n (λ)s ( T xj + λ) s g λ ( T xj )( T xj f ρ S x j y j ) HK, (42) holding with robability at least 1 η 2, where B n (λ) is defined in (29). We roceed by slitting: ( T xj + λ) s g λ ( T xj )( T xj f ρ S x j y j ) = H x (1) j H x (2) j h λ z j, with H (1) x j := ( T xj + λ) s g λ ( T xj )( T xj + λ) 1 2, H (2) x j := ( T xj + λ) 1 2 ( T + λ) 1 2, h λ z j := ( T + λ) 1 2 ( Txj f ρ S x j y j ) HK. The first ter is estiated using (9),(10) and gives for s 0, 1 2 ] ( ) su g λ (t)(t + λ) s+ 1 2 H (1) x j t 0,1] ( 2 su g λ (t)t s λ s+ 1 2 t 0,1] ( 2( su g λ (t) t 0,1] ) 1 2 s ( su t 0,1] ) su g λ (t) t 0,1] g λ (t)t ) s+ 1 2 ) + λ s+ 1 2 su g λ (t) t 0,1] C λ s 1 2. (43) The second ter is now bounded using (31) once ore. One has with robability at least 1 η 4 H (2) x j Finally, h λ z j is estiated using Proosition 21: 8 log(8η 1 )B n (λ) 1 2. (44) ( ) h λ z j 2 log(8η 1 M N (λ) ) n λ + σ, (45) n holding with robability at least 1 η 4. Thus, cobining (43), (44) and (45) with (42) gives for any j = 1,...,, ξ λ (x j, y j ) HK C log 2(s+1) (8η 1 )B n (λ) 1 2 +s λ s with robability at least 1 η. Integration gives for any 2: ( M nλ + σ ξ λ (x j, y j ) ] C H, A, K 26 ) N (λ), nλ

27 Parallelizing Sectral Algoriths with A := A n (λ) := B n (λ) 1 2 +s λ s ( M N (λ) nλ + σ nλ Cobining this with (41) ilies, since 2: T s ( f D λ f D) λ ] 1 C (, ax (A ) 1, ( A 2) ) 1 2 = C ( ), A ax 1, 1 2 = C, A, where C, does not deend on (σ, M, R) R 3 +. The result for the case 1 2 iediately follows fro Hölder s inequality. ). Proof of Theore 4] Let λ n be as defined by (16); as earlier we assue n is big enough so that λ n < 1. According to Corollary 14, we have B n (λ n) 2 rovided α < 2br 2br+b+1 and n is sufficiently large. Under this condition we iediately obtain fro Proosition 19: ] T s λn λn ( f D f D ) 1 C, λ s M N (λ n ) n + σ nλ n nλ n C, λ s M λ 1 b n n + σ, nλ n nλ n where we used again that N (λ n ) C λn 1/b ; now n M λn = o σ 1/b, nλ n nλ n rovided Recalling that σ λ 1/b n nλ n As a result, for any 1: li su n n n α, α < = Rλ r n = λ s n a n, we arrive at su T s ( ρ M σ,m,r f λn D ] λn f D ) 1 T s ( f λn D 2(br + 1) 2br + b + 1. C, a n. ] λn f D ) 1 C,, a n for soe C,, not deending on the odel araeters (σ, M, R) R 3 +, thus leading to the conclusion. 27

arxiv: v4 [math.st] 9 Aug 2017

arxiv: v4 [math.st] 9 Aug 2017 PARALLELIZING SPECTRAL ALGORITHMS FOR KERNEL LEARNING GILLES BLANCHARD AND NICOLE MÜCKE arxiv:161007487v4 [athst] 9 Aug 2017 Abstract We consider a distributed learning aroach in suervised learning for