Divide and Conquer Kernel Ridge Regression: A Distributed Algorithm with Minimax Optimal Rates

Similar documents
ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

Optimally Sparse SVMs

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

A survey on penalized empirical risk minimization Sara A. van de Geer

Regression with quadratic loss

REGRESSION WITH QUADRATIC LOSS

Rates of Convergence by Moduli of Continuity

6.867 Machine learning, lecture 7 (Jaakkola) 1

5.1 A mutual information bound based on metric entropy

Random Walks on Discrete and Continuous Circles. by Jeffrey S. Rosenthal School of Mathematics, University of Minnesota, Minneapolis, MN, U.S.A.

Summary and Discussion on Simultaneous Analysis of Lasso and Dantzig Selector

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 11

Machine Learning Brett Bernstein

Chapter 3. Strong convergence. 3.1 Definition of almost sure convergence

An Introduction to Randomized Algorithms

Lecture 3: August 31

Problem Set 2 Solutions

Stochastic Simulation

A Risk Comparison of Ordinary Least Squares vs Ridge Regression

Lecture 2: Monte Carlo Simulation

Convergence of random variables. (telegram style notes) P.J.C. Spreij

Information-based Feature Selection

Linear Regression Demystified

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS

Statistical Inference (Chapter 10) Statistical inference = learn about a population based on the information provided by a sample.

w (1) ˆx w (1) x (1) /ρ and w (2) ˆx w (2) x (2) /ρ.

Distribution of Random Samples & Limit theorems

The standard deviation of the mean

On Random Line Segments in the Unit Square

Bayesian Methods: Introduction to Multi-parameter Models

1 Review of Probability & Statistics

4.3 Growth Rates of Solutions to Recurrences

Sequences. Notation. Convergence of a Sequence

62. Power series Definition 16. (Power series) Given a sequence {c n }, the series. c n x n = c 0 + c 1 x + c 2 x 2 + c 3 x 3 +

6.3 Testing Series With Positive Terms

Basics of Probability Theory (for Theory of Computation courses)

Ada Boost, Risk Bounds, Concentration Inequalities. 1 AdaBoost and Estimates of Conditional Probabilities

Support vector machine revisited

7.1 Convergence of sequences of random variables

Discrete Mathematics for CS Spring 2008 David Wagner Note 22

A Hadamard-type lower bound for symmetric diagonally dominant positive matrices

ECE 8527: Introduction to Machine Learning and Pattern Recognition Midterm # 1. Vaishali Amin Fall, 2015

Resampling Methods. X (1/2), i.e., Pr (X i m) = 1/2. We order the data: X (1) X (2) X (n). Define the sample median: ( n.

Rademacher Complexity

arxiv: v1 [math.pr] 13 Oct 2011

Machine Learning Brett Bernstein

1 Inferential Methods for Correlation and Regression Analysis

This is an introductory course in Analysis of Variance and Design of Experiments.

Economics 241B Relation to Method of Moments and Maximum Likelihood OLSE as a Maximum Likelihood Estimator

Infinite Sequences and Series

Sequences and Series of Functions

Introduction to Machine Learning DIS10

Spectral Partitioning in the Planted Partition Model

Topic 9: Sampling Distributions of Estimators

Definition 4.2. (a) A sequence {x n } in a Banach space X is a basis for X if. unique scalars a n (x) such that x = n. a n (x) x n. (4.

NYU Center for Data Science: DS-GA 1003 Machine Learning and Computational Statistics (Spring 2018)

17. Joint distributions of extreme order statistics Lehmann 5.1; Ferguson 15

10-701/ Machine Learning Mid-term Exam Solution

Product measures, Tonelli s and Fubini s theorems For use in MAT3400/4400, autumn 2014 Nadia S. Larsen. Version of 13 October 2014.

CHAPTER 10 INFINITE SEQUENCES AND SERIES

Lecture 7: Density Estimation: k-nearest Neighbor and Basis Approach

Lecture 10 October Minimaxity and least favorable prior sequences

4. Partial Sums and the Central Limit Theorem

Optimization Methods MIT 2.098/6.255/ Final exam

Since X n /n P p, we know that X n (n. Xn (n X n ) Using the asymptotic result above to obtain an approximation for fixed n, we obtain

Statistics 511 Additional Materials

Sieve Estimators: Consistency and Rates of Convergence

Math 155 (Lecture 3)

Dimension-free PAC-Bayesian bounds for the estimation of the mean of a random vector

DS 100: Principles and Techniques of Data Science Date: April 13, Discussion #10

The log-behavior of n p(n) and n p(n)/n

Empirical Process Theory and Oracle Inequalities

Singular Continuous Measures by Michael Pejic 5/14/10

Output Analysis and Run-Length Control

Machine Learning for Data Science (CS 4786)

Introduction to Optimization Techniques. How to Solve Equations

Chapter 9: Numerical Differentiation

Lecture 19: Convergence

Advanced Analysis. Min Yan Department of Mathematics Hong Kong University of Science and Technology

Advanced Stochastic Processes.

x a x a Lecture 2 Series (See Chapter 1 in Boas)

CS284A: Representations and Algorithms in Molecular Biology

Algebra of Least Squares

Topic 9: Sampling Distributions of Estimators

Random Matrices with Blocks of Intermediate Scale Strongly Correlated Band Matrices

Lecture 33: Bootstrap

Chapter 10: Power Series

Math Solutions to homework 6

7.1 Convergence of sequences of random variables

Lecture 2. The Lovász Local Lemma


6.867 Machine learning

Seunghee Ye Ma 8: Week 5 Oct 28

Section 1 of Unit 03 (Pure Mathematics 3) Algebra

Recursive Algorithms. Recurrences. Recursive Algorithms Analysis

Machine Learning Assignment-1

SRC Technical Note June 17, Tight Thresholds for The Pure Literal Rule. Michael Mitzenmacher. d i g i t a l

6.883: Online Methods in Machine Learning Alexander Rakhlin

Regression with an Evaporating Logarithmic Trend

Transcription:

Joural of Machie Learig Research 05) -40 Submitted 4/00; Published 0/00 Divide ad Coquer Kerel Ridge Regressio: A Distributed Algorithm with Miimax Optimal Rates Yuche Zhag Departmet of Electrical Egieerig ad Computer Sciece Uiversity of Califoria, Berkeley, Berkeley, CA 9470, USA Joh Duchi Departmets of Statistics ad Electrical Egieerig Staford Uiversity, Staford, CA 94305, USA yuczhag@berkeley.edu jduchi@staford.edu Marti Waiwright waiwrig@berkeley.edu Departmets of Statistics ad Electrical Egieerig ad Computer Sciece Uiversity of Califoria, Berkeley, Berkeley, CA 9470, USA Editor: Hui Zou Abstract We study a decompositio-based scalable approach to kerel ridge regressio, ad show that it achieves miimax optimal covergece rates uder relatively mild coditios. The method is simple to describe: it radomly partitios a dataset of size N ito m subsets of equal size, computes a idepedet kerel ridge regressio estimator for each subset usig a careful choice of the regularizatio parameter, the averages the local solutios ito a global predictor. This partitioig leads to a substatial reductio i computatio time versus the stadard approach of performig kerel ridge regressio o all N samples. Our two mai theorems establish that despite the computatioal speed-up, statistical optimality is retaied: as log as m is ot too large, the partitio-based estimator achieves the statistical miimax rate over all estimators usig the set of N samples. As cocrete examples, our theory guaratees that the umber of subsets m may grow early liearly for fiite-rak or Gaussia kerels ad polyomially i N for Sobolev spaces, which i tur allows for substatial reductios i computatioal cost. We coclude with experimets o both simulated data ad a music-predictio task that complemet our theoretical results, exhibitig the computatioal ad statistical beefits of our approach. Keywords: kerel ridge regressio, divide ad coquer, computatio complexity. Itroductio I o-parametric regressio, the statisticia receives N samples of the form {x i, y i )} N i=, where each x i X is a covariate ad y i R is a real-valued respose, ad the samples are draw i.i.d. from some ukow joit distributio P over X R. The goal is to estimate a fuctio f : X R that ca be used to predict future resposes based o observig oly the covariates. Frequetly, the quality of a estimate f is measured i terms of the mea-squared predictio error E[ fx) Y ), i which case the coditioal expectatio f x) = E[Y X = x is optimal. The problem of o-parametric regressio is a classical oe, ad a researchers have studied a wide rage of estimators see, for example, the books of Gyorfi et al. 00), Wasserma 006), or va de Geer 000)). Oe class of methods, kow as regularized M-estimators va de Geer, 000), are based o miimizig the c 05 Yuche Zhag, Joh Duchi ad Marti Waiwright.

Zhag, Duchi, Waiwright combiatio of a data-depedet loss fuctio with a regularizatio term. The focus of this paper is a popular M-estimator that combies the least-squares loss with a squared Hilbert orm pealty for regularizatio. Whe workig i a reproducig kerel Hilbert space RKHS), the resultig method is kow as kerel ridge regressio, ad is widely used i practice Hastie et al., 00; Shawe-Taylor ad Cristiaii, 004). Past work has established bouds o the estimatio error for RKHS-based methods Koltchiskii, 006; Medelso, 00a; va de Geer, 000; Zhag, 005), which have bee refied ad exteded i more recet work e.g., Steiwart et al., 009). Although the statistical aspects of kerel ridge regressio KRR) are well-uderstood, the computatio of the KRR estimate ca be challegig for large datasets. I a stadard implemetatio Sauders et al., 998), the kerel matrix must be iverted, which requires ON 3 ) time ad ON ) memory. Such scaligs are prohibitive whe the sample size N is large. As a cosequece, approximatios have bee desiged to avoid the expese of fidig a exact miimizer. Oe family of approaches is based o low-rak approximatio of the kerel matrix; examples iclude kerel PCA Schölkopf et al., 998), the icomplete Cholesky decompositio Fie ad Scheiberg, 00), or Nyström samplig Williams ad Seeger, 00). These methods reduce the time complexity to OdN ) or Od N), where d N is the preserved rak. The associated predictio error has oly bee studied very recetly. Cocurret work by Bach 03) establishes coditios o the maitaied rak that still guaratee optimal covergece rates; see the discussio i Sectio 7 for more detail. A secod lie of research has cosidered early-stoppig of iterative optimizatio algorithms for KRR, icludig gradiet descet Yao et al., 007; Raskutti et al., 0) ad cojugate gradiet methods Blachard ad Krämer, 00), where early-stoppig provides regularizatio agaist over-fittig ad improves ru-time. If the algorithm stops after t iteratios, the aggregate time complexity is OtN ). I this work, we study a differet decompositio-based approach. The algorithm is appealig i its simplicity: we partitio the dataset of size N radomly ito m equal sized subsets, ad we compute the kerel ridge regressio estimate f i for each of the i =,..., m subsets idepedetly, with a careful choice of the regularizatio parameter. The estimates are the averaged via f = /m) m f i= i. Our mai theoretical result gives coditios uder which the average f achieves the miimax rate of covergece over the uderlyig Hilbert space. Eve usig aive implemetatios of KRR, this decompositio gives time ad memory complexity scalig as ON 3 /m ) ad ON /m ), respectively. Moreover, our approach dovetails aturally with parallel ad distributed computatio: we are guarateed superliear speedup with m parallel processors though we must still commuicate the fuctio estimates from each processor). Divide-ad-coquer approaches have bee studied by several authors, icludig McDoald et al. 00) for perceptro-based algorithms, Kleier et al. 0) i distributed versios of the bootstrap, ad Zhag et al. 03) for parametric smooth covex optimizatio problems. This paper demostrates the potetial beefits of divide-ad-coquer approaches for oparametric ad ifiitedimesioal regressio problems. Oe difficulty i solvig each of the sub-problems idepedetly is how to choose the regularizatio parameter. Due to the ifiite-dimesioal ature of o-parametric problems, the choice of regularizatio parameter must be made with care e.g., Hastie et al., 00). A iterestig cosequece of our theoretical aalysis is i demostratig that, eve though each partitioed sub-problem is based oly o the fractio N/m of samples, it is oetheless essetial to regularize the partitioed sub-problems as though they had all N samples. Cosequetly, from a local poit of view, each sub-problem is uder-regularized. This uder-regularizatio allows the bias of each local estimate to be very small, but it causes a detrimetal blow-up i the variace. However, as we prove, the m-fold averagig

Divide ad Coquer Kerel Ridge Regressio uderlyig the method reduces variace eough that the resultig estimator f still attais optimal covergece rate. The remaider of this paper is orgaized as follows. We begi i Sectio by providig backgroud o the kerel ridge regressio estimate ad discussig the assumptios that uderlie our aalysis. I Sectio 3, we preset our mai theorems o the mea-squared error betwee the averaged estimate f ad the optimal regressio fuctio f. We provide both a result whe the regressio fuctio f belogs to the Hilbert space H associated with the kerel, as well as a more geeral oracle iequality that holds for a geeral f. We the provide several corollaries that exhibit cocrete cosequeces of the results, icludig covergece rates of r/n for kerels with fiite rak r, ad covergece rates of N ν/ν+) for estimatio of fuctioals i a Sobolev space with ν-degrees of smoothess. As we discuss, both of these estimatio rates are miimax-optimal ad hece uimprovable. We devote Sectios 4 ad 5 to the proofs of our results, deferrig more techical aspects of the aalysis to appedices. Lastly, we preset simulatio results i Sectio 6. to further explore our theoretical results, while Sectio 6. cotais experimets with a reasoably large music predictio experimet.. Backgroud ad problem formulatio We begi with the backgroud ad otatio required for a precise statemet of our problem.. Reproducig kerels The method of kerel ridge regressio is based o the idea of a reproducig kerel Hilbert space. We provide oly a very brief coverage of the basics here, referrig the reader to oe of the may books o the topic Wahba, 990; Shawe-Taylor ad Cristiaii, 004; Berliet ad Thomas-Aga, 004; Gu, 00) for further details. Ay symmetric ad positive semidefiite kerel fuctio K : X X R defies a reproducig kerel Hilbert space RKHS for short). For a give distributio P o X, the Hilbert space is strictly cotaied i L P). For each x X, the fuctio z Kz, x) is cotaied with the Hilbert space H; moreover, the Hilbert space is edowed with a ier product, H such that K, x) acts as the represeter of evaluatio, meaig f, Kx, ) H = fx) for f H. ) We let g H := g, g H deote the orm i H, ad similarly g := X gx) dpx)) / deotes the orm i L P). Uder suitable regularity coditios, Mercer s theorem guaratees that the kerel has a eige-expasio of the form Kx, x ) = µ j φ j x)φ j x ), j= where µ µ 0 are a o-egative sequece of eigevalues, ad {φ j } j= is a orthoormal basis for L P). From the reproducig relatio ), we have φ j, φ j H = /µ j for ay j ad φ j, φ j H = 0 for ay j j. For ay f H, by defiig the basis coefficiets θ j = f, φ j L P) for j =,,..., we ca expad the fuctio i terms of these coefficiets as f = j= θ jφ j, ad simple calculatios show that f = X f x)dpx) = θj, ad f H = f, f H = θj. µ j j= 3 j=

Zhag, Duchi, Waiwright Cosequetly, we see that the RKHS ca be viewed as a elliptical subset of the sequece space l N) as defied by the o-egative eigevalues {µ j } j=.. Kerel ridge regressio Suppose that we are give a data set {x i, y i )} N i= cosistig of N i.i.d. samples draw from a ukow distributio P over X R, ad our goal is to estimate the fuctio that miimizes the mea-squared error E[fX) Y ), where the expectatio is take joitly over X, Y ) pairs. It is well-kow that the optimal fuctio is the coditioal mea f x) : = E[Y X = x. I order to estimate the ukow fuctio f, we cosider a M-estimator that is based o miimizig a combiatio of the least-squares loss defied over the dataset with a weighted pealty based o the squared Hilbert orm, { f := argmi f H N N fx i ) y i ) + f H i= }, ) where > 0 is a regularizatio parameter. Whe H is a reproducig kerel Hilbert space, the the estimator ) is kow as the kerel ridge regressio estimate, or KRR for short. It is a atural geeralizatio of the ordiary ridge regressio estimate Hoerl ad Keard, 970) to the o-parametric settig. By the represeter theorem for reproducig kerel Hilbert spaces Wahba, 990), ay solutio to the KRR program ) must belog to the liear spa of the kerel fuctios {K, x i ), i =,..., N}. This fact allows the computatio of the KRR estimate to be reduced to a N-dimesioal quadratic program, ivolvig the N etries of the kerel matrix {Kx i, x j ), i, j =,..., }. O the statistical side, a lie of past work va de Geer, 000; Zhag, 005; Capoetto ad De Vito, 007; Steiwart et al., 009; Hsu et al., 0) has provided bouds o the estimatio error of f as a fuctio of N ad. 3. Mai results ad their cosequeces We ow tur to the descriptio of our algorithm, followed by the statemets of our mai results, amely Theorems ad. Each theorem provides a upper boud o the measquared predictio error for ay trace class kerel. The secod theorem is of oracle type, meaig that it applies eve whe the true regressio fuctio f does ot belog to the Hilbert space H, ad hece ivolves a combiatio of approximatio ad estimatio error terms. The first theorem requires that f H, ad provides somewhat sharper bouds o the estimatio error i this case. Both of these theorems apply to ay trace class kerel, but as we illustrate, they provide cocrete results whe applied to specific classes of kerels. Ideed, as a corollary, we establish that our distributed KRR algorithm achieves miimax-optimal rates for three differet kerel classes, amely fiite-rak, Gaussia, ad Sobolev. 3. Algorithm ad assumptios The divide-ad-coquer algorithm Fast-KRR is easy to describe. Rather tha solvig the kerel ridge regressio problem ) o all N samples, the Fast-KRR method executes the followig three steps:. Divide the set of samples {x, y ),..., x N, y N )} evely ad uiformly at radom ito the m disjoit subsets S,..., S m X R, such that every subset cotais N/m samples. 4

Divide ad Coquer Kerel Ridge Regressio. For each i =,,..., m, compute the local KRR estimate { f i := argmi f H S i x,y) S i fx) y) + f H 3. Average together the local estimates ad output f = m m i= f i. }. 3) This descriptio actually provides a family of estimators, oe for each choice of the regularizatio parameter > 0. Our mai result applies to ay choice of, while our corollaries for specific kerel classes optimize as a fuctio of the kerel. We ow describe our mai assumptios. Our first assumptio, for which we have two variats, deals with the tail behavior of the basis fuctios {φ j } j=. Assumptio A For some k, there is a costat ρ < such that E[φ j X) k ρ k for all j N. I certai cases, we show that sharper error guaratees ca be obtaied by eforcig a stroger coditio of uiform boudedess. Assumptio A There is a costat ρ < such that sup x X φ j x) ρ for all j N. Assumptio A holds, for example, whe the iput x is draw from a closed iterval ad the kerel is traslatio ivariat, i.e. Kx, x ) = ψx x ) for some eve fuctio ψ. Give iput space X ad kerel K, the assumptio is verifiable without the data. Recallig that f x) : = E[Y X = x, our secod assumptio ivolves the deviatios of the zero-mea oise variables Y f x). I the simplest case, whe f H, we require oly a bouded variace coditio: Assumptio B The fuctio f H, ad for x X, we have E[Y f x)) x σ. Whe the fuctio f H, we require a slightly stroger variat of this assumptio. For each 0, defie { f = argmi E [ fx) Y ) } + f H. 4) f H Note that f = f0 correspods to the usual regressio fuctio. As f L P), for each 0, the associated mea-squared error σ x) := E[Y f x)) x is fiite for almost every x. I this more geeral settig, the followig assumptio replaces Assumptio B: Assumptio B For ay 0, there exists a costat τ < such that τ 4 = E[σ4 X). 3. Statemet of mai results With these assumptios i place, we are ow ready for the statemets of our mai results. All of our results give bouds o the mea-squared estimatio error E[ f f associated with the averaged estimate f based o a assigig = N/m samples to each of m machies. Both theorem statemets ivolve the followig three kerel-related quatities: trk) := µ j, γ) :=, ad β d = + /µ j j= j= j=d+ µ j. 5) The first quatity is the kerel trace, which serves a crude estimate of the size of the kerel operator, ad assumed to be fiite. The secod quatity γ), familiar from previous 5

Zhag, Duchi, Waiwright work o kerel regressio Zhag, 005), is the effective dimesioality of the kerel K with respect to L P). Fially, the quatity β d is parameterized by a positive iteger d that we may choose i applyig the bouds, ad it describes the tail decay of the eigevalues of K. For d = 0, ote that β 0 = tr K. Fially, both theorems ivolve a quatity that depeds o the umber of momets k i Assumptio A: { } max{k, max{k, logd)} b, d, k) := max logd)},. 6) / /k Here the iteger d N is a free parameter that may be optimized to obtai the sharpest possible upper boud. The algorithm s executio is idepedet of d.) Theorem With f H ad uder Assumptios A ad B, the mea-squared error of the averaged estimate f is upper bouded as [ E f f 8 + ) f H m + σ γ) { + if T d) + T d) + T 3 d) }, 7) N d N where T d) = 8ρ4 f H trk)β d, T d) = 4 f H + σ / m ) k T 3 d) = Cb, d, k) ρ γ) µ 0 f H + σ m + 4 f H m ad C deotes a uiversal umerical) costat. µ d+ + ρ4 trk)β d ), ), ad Theorem is a geeral result that applies to ay trace-class kerel. Although the statemet appears somewhat complicated at first sight, it yields cocrete ad iterpretable guaratees o the error whe specialized to particular kerels, as we illustrate i Sectio 3.3. Before doig so, let us make a few heuristic argumets i order to provide ituitio. I typical settigs, the term T 3 d) goes to zero quickly: if the umber of momets k is suitably large ad umber of partitios m is small say eough to guaratee that b, d, k)γ)/ ) k = O/N) it will be of lower order. As for the remaiig terms, at a high level, we show that a appropriate choice of the free parameter d leaves the first two terms i the upper boud 7) domiat. Note that the terms µ d+ ad β d are decreasig i d while the term b, d, k) icreases with d. However, the icreasig term b, d, k) grows oly logarithmically i d, which allows us to choose a fairly large value without a sigificat pealty. As we show i our corollaries, for may kerels of iterest, as log as the umber of machies m is ot too large, this tradeoff is such that T d) ad T d) are also of lower order compared to the two first terms i the boud 7). I such settigs, Theorem guaratees a upper boud of the form E [ [ f f = O) f H }{{} Squared bias + σ γ). 8) } N {{} Variace This iequality reveals the usual bias-variace trade-off i o-parametric regressio; choosig a smaller value of > 0 reduces the first squared bias term, but icreases the secod variace term. Cosequetly, the settig of that miimizes the sum of these two terms is defied by the relatioship f H γ) σ N. 9) 6

Divide ad Coquer Kerel Ridge Regressio This type of fixed poit equatio is familiar from work o oracle iequalities ad local complexity measures i empirical process theory Bartlett et al., 005; Koltchiskii, 006; va de Geer, 000; Zhag, 005), ad whe is chose so that the fixed poit equatio 9) holds this typically) yields miimax optimal covergece rates Bartlett et al., 005; Koltchiskii, 006; Zhag, 005; Capoetto ad De Vito, 007). I Sectio 3.3, we provide detailed examples i which the choice specified by equatio 9), followed by applicatio of Theorem, yields miimax-optimal predictio error for the Fast-KRR algorithm) for may kerel classes. We ow tur to a error boud that applies without requirig that f H. I order to do so, we itroduce a auxiliary variable [0, for use i our aalysis the algorithm s executio does ot deped o, ad i our esuig bouds we may choose ay [0, to give the sharpest possible results). Let the radius R = f H, where the populatio regularized) regressio fuctio f was previously defied 4). The theorem requires a few additioal coditios to those i Theorem, ivolvig the quatities trk), γ) ad β d defied i Eq. 5), as well as the error momet τ from Assumptio B. We assume that the triplet m, d, k) of positive itegers satisfy the coditios β d R + τ /)N, µ d+ { N m mi R + τ /)N, ρ γ) logd), N k R + τ /) /k b, d, k)ρ γ)) }. 0) We the have the followig result: Theorem Uder coditio 0), Assumptio A with k 4, ad Assumptio B, for ay [0, ad q > 0 we have [ E f f + ) if q f f + + q) E N,m,, R, ρ) ) f H R where the residual term is give by E N,m,, R, ρ) : = ad C deotes a uiversal umerical) costat. 4 + C ) m )R + Cγ)ρ τ + C ), ) N N Remarks: Theorem is a oracle iequality, as it upper bouds the mea-squared error i terms of the error if f f, which may oly be obtaied by a oracle kowig f H R the samplig distributio P, alog with the residual error term ). I some situatios, it may be difficult to verify Assumptio B. I such scearios, a alterative coditio suffices. For istace, if there exists a costat κ < such that E[Y 4 κ 4, the uder coditio 0), the boud ) holds with τ replaced by 8 trk) R 4 ρ 4 + 8κ 4 that is, with the alterative residual error Ẽ N,m,, R, ρ) : = + C ) m )R + Cγ)ρ 8 trk) R 4 ρ 4 + 8κ 4 + C ). 3) N N I essece, if the respose variable Y has sufficietly may momets, the predictio measquare error τ i the statemet of Theorem ca be replaced by costats related to the size of f H. See Sectio 5. for a proof of iequality 3). 7

Zhag, Duchi, Waiwright I compariso with Theorem, Theorem provides somewhat looser bouds. It is, however, istructive to cosider a few special cases. For the first, we may assume that f H, i which case f H <. I this settig, the choice = 0 essetially) recovers Theorem, sice there is o approximatio error. Takig q 0, we are thus left with the boud E f f f H + γ)ρ τ0, 4) N where deotes a iequality up to costats. By ispectio, this boud is roughly equivalet to Theorem ; see i particular the decompositio 8). O the other had, whe the coditio f H fails to hold, we ca take =, ad the choose q to balace betwee the familiar approximatio ad estimatio errors: we have E[ f f + ) if q f f γ)ρ τ ) + + q). 5) f H R N }{{}}{{} approximatio estimatio Relative to Theorem, the coditio 0) required to apply Theorem ivolves costraits o the umber m of subsampled data sets that are more restrictive. I particular, whe igorig costats ad logarithm terms, the quatity m may grow at rate N/γ ). By cotrast, Theorem allows m to grow as quickly as N/γ ) recall the remarks o T 3 d) followig Theorem or look ahead to coditio 8)). Thus at least i our curret aalysis geeralizig to the case that f H prevets us from dividig the data ito fier subsets. 3.3 Some cosequeces We ow tur to derivig some explicit cosequeces of our mai theorems for specific classes of reproducig kerel Hilbert spaces. I each case, our derivatio follows the broad outlie give the the remarks followig Theorem : we first choose the regularizatio parameter to balace the bias ad variace terms, ad the show, by compariso to kow miimax lower bouds, that the resultig upper boud is optimal. Fially, we derive a upper boud o the umber of subsampled data sets m for which the miimax optimal covergece rate ca still be achieved. Throughout this sectio, we assume that f H. 3.3. Fiite-rak kerels Our first corollary applies to problems for which the kerel has fiite rak r, meaig that its eigevalues satisfy µ j = 0 for all j > r. Examples of such fiite rak kerels iclude the liear kerel Kx, x ) = x, x R d, which has rak at most r = d; ad the kerel Kx, x) = + x x ) m geeratig polyomials of degree m, which has rak at most r = m +. Corollary 3 For a kerel with rak r, cosider the output of the Fast-KRR algorithm with = r/n. Suppose that Assumptio B ad Assumptios A or A ) hold, ad that the umber of processors m satisfy the boud m c N k 4 k Assumptio A) or m c r k 4k k k ρ k log k r N r ρ 4 log N Assumptio A ), where c is a uiversal umerical) costat. For suitably large N, the mea-squared error is bouded as [ E f f = O) σ r N. 6) 8

Divide ad Coquer Kerel Ridge Regressio For fiite-rak kerels, the rate 6) is kow to be miimax-optimal, meaig that there is a uiversal costat c > 0 such that if f sup f H E[ f f c r N, 7) where the ifimum rages over all estimators f based o observig all N samples ad with o costraits o memory ad/or computatio). This lower boud follows from Theorem a) of Raskutti et al. 0) with s = d =. 3.3. Polyomially decayig eigevalues Our ext corollary applies to kerel operators with eigevalues that obey a boud of the form µ j C j ν for all j =,,..., 8) where C is a uiversal costat, ad ν > / parameterizes the decay rate. We ote that equatio 5) assumes a fiite kerel trace trk) := j= µ j. Sice trk) appears i Theorem, it is atural to use j= Cj ν as a upper boud o trk). This upper boud is fiite if ad oly if ν > /. Kerels with polyomial decayig eigevalues iclude those that uderlie for the Sobolev spaces with differet orders of smoothess e.g. Birma ad Solomjak, 967; Gu, 00). As a cocrete example, the first-order Sobolev kerel Kx, x ) = + mi{x, x } geerates a RKHS of Lipschitz fuctios with smoothess ν =. Other higher-order Sobolev kerels also exhibit polyomial eigedecay with larger values of the parameter ν. Corollary 4 For ay kerel with ν-polyomial eigedecay 8), cosider the output of the Fast-KRR algorithm with = /N) ν ν+. Suppose that Assumptio B ad Assumptio A or A ) hold, ad that the umber of processors satisfy the boud m c N k 4)ν k ν+) ρ 4k log k N ) k Assumptio A) or m c N ν ν+ ρ 4 log N Assumptio A ), where c is a costat oly depedig o ν. The the mea-squared error is bouded as [ E σ f f ) ν ) ν+ = O. 9) N The upper boud 9) is uimprovable up to costat factors, as show by kow miimax bouds o estimatio error i Sobolev spaces Stoe, 98; Tsybakov, 009); see also Theorem b) of Raskutti et al. 0). 3.3.3 Expoetially decayig eigevalues Our fial corollary applies to kerel operators with eigevalues that obey a boud of the form µ j c exp c j ) for all j =,,..., 0) for strictly positive costats c, c ). Such classes iclude the RKHS geerated by the Gaussia kerel Kx, x ) = exp x x ). 9

Zhag, Duchi, Waiwright Corollary 5 For a kerel with sub-gaussia eigedecay 0), cosider the output of the Fast-KRR algorithm with = /N. Suppose that Assumptio B ad Assumptio A or A ) hold, ad that the umber of processors satisfy the boud m c N k 4 k ρ 4k k Assumptio A) or m c k log k N N ρ 4 log N Assumptio A ), where c is a costat oly depedig o c. The the mea-squared error is bouded as [ E ) f f log N = O σ. ) N The upper boud ) is miimax optimal; see, for example, Theorem ad Example of the recet paper by Yag et al. 05). Summary: Each corollary gives a critical threshold for the umber m of data partitios: as log as m is below this threshold, the decompositio-based Fast-KRR algorithm gives the optimal rate of covergece. It is iterestig to ote that the umber of splits may be quite large: each grows asymptotically with N wheever the basis fuctios have more tha four momets viz. Assumptio A). Moreover, the Fast-KRR method ca attai these optimal covergece rates while usig substatially less computatio tha stadard kerel ridge regressio methods, as it requires solvig problems oly of size N/m. 3.4 The choice of regularizatio parameter I practice, the local sample size o each machie may be differet ad the optimal choice for the regularizatio may ot be kow a priori, so that a adaptive choice of the regularizatio parameter is desirable e.g. Tsybakov, 009, Chapters 3.5 3.7). We recommed usig cross-validatio to choose the regularizatio parameter, ad we ow sketch a heuristic argumet that a adaptive algorithm usig cross-validatio may achieve optimal rates of covergece. We leave fuller aalysis to future work.) Let be the oracle) optimal regularizatio parameter give kowledge of the samplig distributio P ad eige-structure of the kerel K. We assume cf. Corollary 4) that there is a costat ν > 0 such that ν as. Let i be the local sample size for each machie i ad N the global sample size; we assume that i N clearly, N i ). First, use local cross-validatio to choose regularizatio parameters i ad i /N correspodig to samples of size i ad i /N, respectively. Heuristically, if cross validatio is successful, we expect to have i ν i ad i /N N ν ν i, yieldig that i / i /N N ν. With this ituitio, we the compute local estimates f i := argmi f H i x,y) S i fx) y) + i) f H where i) := i i /N ) ad global average estimate f = m i f i= N i as usual. Notably, we have i) N i this heuristic settig. Usig formula ) ad the average f, we have [ E f f [ m = E m i= i= i N fi E[ f i ) m + i N E [ fi E[ f i i= + max i [m i E[ N f i f ) } { E[ fi f. 3) 0

Divide ad Coquer Kerel Ridge Regressio Usig Lemmas 6 ad 7 from the proof of Theorem to come ad assumig that is cocetrated tightly eough aroud, we obtai E[ f i f = O N f H ) by Lemma 6 ad that E[ f i E[ f i = O γ N ) i ) by Lemma 7. Substitutig these bouds ito iequality 3) ad otig that i i = N, we may upper boud the overall estimatio error as [ E f f O) N f H + γ N) N While the derivatio of this upper boud was o-rigorous, we believe that it is roughly accurate, ad i compariso with the previous upper boud 8), it provides optimal rates of covergece. ). 4. Proofs of Theorem ad related results We ow tur to the proofs of Theorem ad Corollaries 3 through 5. This sectio cotais oly a high-level view of proof of Theorem ; we defer more techical aspects to the appedices. 4. Proof of Theorem Usig the defiitio of the averaged estimate f = m m i= f i, a bit of algebra yields E[ f f = E[ f E[ f) + E[ f f ) = E[ f E[ f + E[ f f + E[ f E[ f, E[ f f L P) [ m = E m f i E[ f i ) + E[ f f, i= where we used the fact that E[ f i = E[ f for each i [m. Usig this ubiasedess oce more, we boud the variace of the terms f i E[ f to see that E [ f f = [ m E f E[ f + E[ f f [ m E f f + E[ f f, 4) where we have used the fact that E[ f i miimizes E[ f i f over f H. The error boud 4) suggests our strategy: we upper boud E[ f f ad E[ f f respectively. Based o equatio 3), the estimate f is obtaied from a stadard kerel ridge regressio with sample size = N/m ad ridge parameter. Accordigly, the followig two auxiliary results provide bouds o these two terms, where the reader should recall the defiitios of b, d, k) ad β d from equatio 5). I each lemma, C represets a uiversal umerical) costat. Lemma 6 Bias boud) Uder Assumptios A ad B, for each d =,,..., we have E[ f f 8 f H + 8ρ4 f H trk)β d ) k + Cb, d, k) ρ γ) µ 0 f H. 5)

Zhag, Duchi, Waiwright Lemma 7 Variace boud) Uder Assumptios A ad B, for each d =,,..., we have E[ f f f H + σ γ) ) σ + + 4 f H µ d+ + ρ4 trk)β d ) k ) + Cb, d, k) ρ γ) f. 6) The proofs of these lemmas, cotaied i Appedices A ad B respectively, costitute oe mai techical cotributio of this paper. Give these two lemmas, the remaider of the theorem proof is straightforward. Combiig the iequality 4) with Lemmas 6 ad 7 yields the claim of Theorem. Remarks: The proofs of Lemmas 6 ad 7 are somewhat complex, but to the best of our kowledge, existig literature does ot yield sigificatly simpler proofs. We ow discuss this claim to better situate our techical cotributios. Defie the regularized populatio miimizer f := argmi f H{E[fX) Y ) + f H }. Expadig the decompositio 4) of the L P)-risk ito bias ad variace terms, we obtai the further boud [ E f f E[ f f + [ m E f f = E[ f f [ + f f + E }{{} m }{{} f ) f f f = T + }{{} m T + T 3 ). :=T :=T :=T 3 I this decompositio, T ad T are bias ad approximatio error terms iduced by the regularizatio parameter, while T 3 is a excess risk variace) term icurred by miimizig the empirical loss. This upper boud illustrates three trade-offs i our subsampled ad averaged kerel regressio procedure: The trade-off betwee T ad T 3 : whe the regularizatio parameter grows, the bias term T icreases while the variace term T 3 coverges to zero. The trade-off betwee T ad T 3 : whe the regularizatio parameter grows, the bias term T icreases while the variace term T 3 coverges to zero. The trade-off betwee T ad the computatio time: whe the umber of machies m grows, the bias term T icreases as the local sample size = N/m shriks), while the computatio time N 3 /m decreases. Theoretical results i the KRR literature focus o the trade-off betwee T ad T 3, but i the curret cotext, we also eed a upper boud o the bias term T, which is ot relevat for classical cetralized) aalyses. With this settig i mid, Lemma 6 tightly upper bouds the bias T as a fuctio of ad. A essetial part of the proof is to characterize the properties of E[ f, which is the expectatio of a oparametric empirical loss miimizer. We are ot aware of existig literature o this problem, ad the proof of Lemma 6 itroduces ovel techiques for this purpose. O the other had, Lemma 7 upper bouds E[ f f as a fuctio of ad. Past work has focused o boudig a quatity of this form, but for techical reasos, most work e.g. va de Geer, 000; Medelso, 00b; Bartlett et al., 00; Zhag, 005) focuses o aalyzig the costraied form f i := argmi fx) y), 7) f H C S i x,y) S i

Divide ad Coquer Kerel Ridge Regressio of kerel ridge regressio. While this problem traces out the same set of solutios as that of the regularized kerel ridge regressio estimator 3), it is o-trivial to determie a matched settig of for a give C. Zhag 003) provides oe of the few aalyses of the regularized ridge regressio estimator 3) or )), providig a upper boud of the form E[ f f = O + / ), which is at best O ). I cotrast, Lemma 7 gives upper boud O + γ) ); the effective dimesio γ) is ofte much smaller tha /, yieldig a stroger covergece guaratee. 4. Proof of Corollary 3 We first preset a geeral iequality boudig the size of m for which optimal covergece rates are possible. We assume that d is chose large eough such that we have logd) k ad d N. I the rest of the proof, our assigmet to d will satisfy these iequalities. I this case, ispectio of Theorem shows that if m is small eough that ) k log d N/m ρ γ) m γ) N, the the term T 3 d) provides a covergece rate give by γ)/n. expressio above for m, we fid Thus, solvig the m log d N ρ4 γ) = /k m /k γ) /k N /k or m k k = Takig k )/k-th roots of both sides, we obtai that if m k N k k γ) k k ρ 4 log d. k N, 8) γ) k 4k k k ρ k log k d the the term T 3 d) of the boud 7) is Oγ)/N). Now we apply the boud 8) i the case i the corollary. Let us take d = max{r, N}. Notice that β d = β r = µ r+ = 0. We fid that γ) r sice each of its terms is bouded by, ad we take = r/n. Evaluatig the expressio 8) with this value, we arrive at m N k 4 k. r k 4k k k ρ k log k d If we have sufficietly may momets that k log N, ad N r for example, if the basis fuctios φ j have a uiform boud ρ, the k ca be chose arbitrarily large), the we may take k = log N, which implies that N k 4 k we replace log d with log N. The so log as = ΩN), r k k N m c r ρ 4 log N for some costat c > 0, we obtai a idetical result. = Or ) ad ρ 4k k = Oρ 4 ) ; ad 3

Zhag, Duchi, Waiwright 4.3 Proof of Corollary 4 We follow the program outlied i our remarks followig Theorem. We must first choose o the order of γ)/n. To that ed, we ote that settig = N ν ν+ gives γ) = j= + j ν N ν ν+ N ν+ + j>n ν+ N ν+ + N ν ν+ + j ν N ν ν+ N ν+ du = N ν+ uν + ν N ν+. Dividig by N, we fid that γ)/n, as desired. Now we choose the trucatio parameter d. By choosig d = N t for some t R +, the we fid that µ d+ N νt ad a itegratio yields β d N ν )t. Settig t = 3/ν ) guaratees that µ d+ N 3 ad β d N 3 ; the correspodig terms i the boud 7) are thus egligible. Moreover, we have for ay fiite k that log d k. Applyig the geeral boud 8) o m, we arrive at the iequality m c N N 4ν ν+)k ) N k ) 4k k ν+)k ) ρ k log k N = c N k 4)ν k ν+)k ) ρ 4k k k log k N. Wheever this holds, we have covergece rate = N ν+. Now, let Assumptio A hold. The takig k = log N, the above boud becomes to a multiplicative costat factor) N ν ν+ /ρ 4 log N as claimed. 4.4 Proof of Corollary 5 First, we set = /N. Cosiderig the sum γ) = j= µ j/µ j + ), we see that for j log N)/c, the elemets of the sum are bouded by. For j > log N)/c, we make the approximatio j log N)/c µ j µ j + j log N)/c µ j N ν log N)/c exp c t )dt = O). Thus we fid that γ) + c log N for some costat c. By choosig d = N, we have that the tail sum ad d + )-th eigevalue both satisfy µ d+ β d c N 4. As a cosequece, all the terms ivolvig β d or µ d+ i the boud 7) are egligible. Recallig our iequality 8), we thus fid that uder Assumptio A), as log as the umber of partitios m satisfies m c N k 4 k ρ 4k k, k log k N the covergece rate of f to f is give by γ)/n log N/N. Uder the boudedess assumptio A, as we did i the proof of Corollary 3, we take k = log N i Theorem. By ispectio, this yields the secod statemet of the corollary. 5. Proof of Theorem ad related results I this sectio, we provide the proofs of Theorem, as well as the boud 3) based o the alterative form of the residual error. As i the previous sectio, we preset a high-level proof, deferrig more techical argumets to the appedices. 4

Divide ad Coquer Kerel Ridge Regressio 5. Proof of Theorem We begi by statig ad provig two auxiliary claims: E [ Y fx)) = E [ Y f X)) + f f for ay f L P), ad 9a) f = argmi f f. f H R Let us begi by provig equality 9a). By addig ad subtractig terms, we have E [ Y f X)) = E [ Y f X)) + f f + E[fX) f X))E[Y f X) X i) = E [ Y f X)) + f f, where equality i) follows sice the radom variable Y f X) is mea-zero give X = x. For the secod equality 9b), cosider ay fuctio f i the RKHS that satisfies the boud f H R. The defiitio of the miimizer f guaratees that E [ f X) Y ) + R E[fX) Y ) + f H E[fX) Y ) + R. This result combied with equatio 9a) establishes the equality 9b). 9b) We ow tur to the proof of the theorem. Applyig Hölder s iequality yields that f f + ) f q f + + q) f f = + ) if q f f + + q) f f for all q > 0, 30) f H R where the secod step follows from equality 9b). It thus suffices to upper boud f f, ad followig the deductio of iequality 4), we immediately obtai the decompositio formula [ E f f m E[ f f + E[ f f, 3) where f deotes the empirical miimizer for oe of the subsampled datasets i.e. the stadard KRR solutio o a sample of size = N/m with regularizatio ). This suggests our strategy, which parallels our proof of Theorem : we upper boud E[ f f ad E[ f f, respectively. I the rest of the proof, we let f = f deote this solutio. Let the estimatio error for a subsample be give by = f f. Uder Assumptios A ad B, we have the followig two lemmas boudig expressio 3), which parallel Lemmas 6 ad 7 i the case whe f H. I each lemma, C deotes a uiversal costat. Lemma 8 For all d =,,..., we have [ E 6 ) R + 8γ)ρ τ + 3R 4 + 8τ 4 / µ d+ + 6ρ4 trk)β d + Deotig the right had side of iequality 3) by D, we have ) k ) Cb, d, k) ρ γ). 3) 5

Zhag, Duchi, Waiwright Lemma 9 For all d =,,..., we have E[ 4 ) R + C log d)ρ γ)) D See Appedices C ad D for the proofs of these two lemmas. + 3R 4 + 8τ 4 / µ d+ + 4ρ4 trk)β d Give these two lemmas, we ca ow complete the proof of the theorem. If the coditios 0) hold, we have β d R + τ /)N, µ d+ R + τ /)N, log d)ρ γ)) m ad b, d, k) ρ γ) so there is a uiversal costat C satisfyig ) k R + τ /)N, ) k ) 3R 4 + 8τ 4 / µ d+ + 6ρ4 trk)β d + Cb, d, k) ρ γ) C N. Cosequetly, Lemma 8 yields the upper boud ). 33) E[ 8 ) R + 8γ)ρ τ + C N. Sice log d)ρ γ)) / /m by assumptio, we obtai E [ f f C ) R + Cγ)ρ τ + C m N Nm + 4 ) R + C ) R m + Cγ)ρ τ N + C Nm + C N, where C is a uiversal costat whose value is allowed to chage from lie to lie). Summig these bouds ad usig the coditio that, we coclude that E [ f f 4 + C ) m )R + Cγ)ρ τ + C N N. Combiig this error boud with iequality 30) completes the proof. 5. Proof of the boud 3) Usig Theorem, it suffices to show that τ 4 8 trk) f 4 Hρ 4 + 8κ 4. 34) By the tower property of expectatios ad Jese s iequality, we have τ 4 = E[E[f x) Y ) X = x) E[f X) Y ) 4 8E[f X)) 4 + 8E[Y 4. 6

Divide ad Coquer Kerel Ridge Regressio Sice we have assumed that E[Y 4 κ 4, the oly remaiig step is to upper boud E[f X)) 4. Let f have expasio θ, θ,...) i the basis {φ j }. For ay x X, Hölder s iequality applied with the cojugates 4/3 ad 4 implies the upper boud f x) = µ /4 j θ / j ) θ/ j j= µ /4 j φ j x) µ /3 j θ /3 j j= 3/4 θ j φ 4 µ jx) j j= /4. 35) Agai applyig Hölder s iequality this time with cojugates 3/ ad 3 to upper boud the first term i the product i iequality 35), we obtai j= µ /3 j θ /3 j = j= µ /3 j ) /3 θ j ) /3 µ j µ j j= Combiig iequalities 35) ad 36), we fid that E[f X)) 4 trk) f H j= θ j µ j= j θ j µ j E[φ 4 jx) trk) f 4 Hρ 4, ) /3 = trk) /3 f /3 H. 36) where we have used Assumptio A. This completes the proof of iequality 34). 6. Experimetal results I this sectio, we report the results of experimets o both simulated ad real-world data desiged to test the sharpess of our theoretical predictios. 6. Simulatio studies We begi by explorig the empirical performace of our subsample-ad-average methods for a o-parametric regressio problem o simulated datasets. For all experimets i this sectio, we simulate data from the regressio model y = f x) + ε for x [0,, where f x) := mix, x) is -Lipschitz, the oise variables ε N0, σ ) are ormally distributed with variace σ = /5, ad the samples x i Ui[0,. The Sobolev space of Lipschitz fuctios o [0, has reproducig kerel Kx, x ) = + mi{x, x } ad orm f H = f 0) + 0 f z)) dz. By costructio, the fuctio f x) = mix, x) satisfies f H =. The kerel ridge regressio estimator f takes the form f = N α i Kx i, ), where α = K + NI) y, 37) i= ad K is the N N Gram matrix ad I is the N N idetity matrix. Sice the firstorder Sobolev kerel has eigevalues Gu, 00) that scale as µ j /j), the miimax covergece rate i terms of squared L P)-error is N /3 see e.g. Tsybakov 009); Stoe 98); Capoetto ad De Vito 007)). By Corollary 4 with ν =, this optimal rate of covergece ca be achieved by Fast-KRR with regularizatio parameter N /3 as log as the umber of partitios m satisfies m N /3. I each of our experimets, we begi with a dataset of size N = m, which we partitio uiformly at radom ito m disjoit subsets. We compute the local estimator f i for each of the m subsets usig samples via 37), where the Gram matrix is costructed usig the ith batch of samples ad replaces N). We the compute 7

Zhag, Duchi, Waiwright m= m=4 m=6 m=64 0 m= m=4 m=6 m=64 Mea square error 0 3 Mea square error 0 3 0 4 0 4 56 5 04 048 4096 89 Total umber of samples N) a) With uder-regularizatio 56 5 04 048 4096 89 Total umber of samples N) b) Without uder-regularizatio Figure : The squared L P)-orm betwee betwee the averaged estimate f ad the optimal solutio f. a) These plots correspod to the output of the Fast-KRR algorithm: each sub-problem is uder-regularized by usig N /3. b) Aalogous plots whe each sub-problem is ot uder-regularized that is, with = /3 = N/m) /3 chose as if there were oly a sigle dataset of size. Mea square error 0 0 0 3 0 4 N=56 N=5 N=04 N=048 N=4096 N=89 0 5 0 0. 0.4 0.6 0.8 log# of partitios)/log# of samples) Figure : The mea-square error curves for fixed sample size but varied umber of partitios. We are iterested i the threshold of partitioig umber m uder which the optimal rate of covergece is achieved. f = /m) m f i= i. Our experimets compare the error of f as a fuctio of sample size N, the umber of partitios m, ad the regularizatio. I Figure 6.a), we plot the error f f versus the total umber of samples N, where N { 8, 9,..., 3 }, usig four differet data partitios m {, 4, 6, 64}. We execute each simulatio 0 times to obtai stadard errors for the plot. The black circled curve m = ) gives the baselie KRR error; if the umber of partitios m 6, Fast-KRR has accuracy comparable to the baselie algorithm. Eve with m = 64, Fast-KRR s performace closely matches the full estimator for larger sample sizes N ). I the right plot Figure 6.b), we perform a idetical experimet, but we over-regularize by choosig = /3 rather tha = N /3 i each of the m sub-problems, combiig the local estimates by averagig as usual. I cotrast to Figure 6.a), there is a obvious gap betwee the performace of the algorithms whe m = ad m >, as our theory predicts. 8

Divide ad Coquer Kerel Ridge Regressio N m = m = 6 m = 64 m = 56 m = 04 Error.6 0 4.33 0 4.38 0 4 N/A N/A Time. 0.03) 0.03 0.0) 0.0 0.00) 3 Error 6.40 0 5 6.9 0 5 6.7 0 5 N/A N/A Time 5.47 0.) 0. 0.03) 0.04 0.00) 4 Error 3.95 0 5 4.06 0 5 4.03 0 5 3.89 0 5 N/A Time 30.6 0.87) 0.59 0.) 0. 0.00) 0.06 0.00) 5 Error.90 0 Fail 5.84 0 5.78 0 5 N/A Time.65 0.04) 0.43 0.0) 0.5 0.0) 6 Error.75 0 Fail 5.73 0 5.7 0 5.67 0 5 Time 6.65 0.30). 0.06) 0.4 0.0) 0.3 0.0) 7 Error.9 0 Fail 5. 0 5.5 0 5.4 0 5 Time 90.80 3.7) 0.87 0.9).88 0.08) 0.60 0.0) Table : Timig experimet givig f f as a fuctio of umber of partitios m ad data size N, providig mea ru-time measured i secod) for each umber m of partitios ad data size N. It is also iterestig to uderstad the umber of partitios m ito which a dataset of size N may be divided while maitaiig good statistical performace. Accordig to Corollary 4 with ν =, for the first-order Sobolev kerel, performace degradatio should be limited as log as m N /3. I order to test this predictio, Figure plots the measquare error f f versus the ratio logm)/ logn). Our theory predicts that eve as the umber of partitios m may grow polyomially i N, the error should grow oly above some costat value of logm)/ logn). As Figure shows, the poit that f f begis to icrease appears to be aroud logm) 0.45 logn) for reasoably large N. This empirical performace is somewhat better tha the /3) thresholded predicted by Corollary 4, but it does cofirm that the umber of partitios m ca scale polyomially with N while retaiig miimax optimality. Our fial experimet gives evidece for the improved time complexity partitioig provides. Here we compare the amout of time required to solve the KRR problem usig the aive matrix iversio 37) for differet partitio sizes m ad provide the resultig squared errors f f. Although there are more sophisticated solutio strategies, we believe this is a reasoable proxy to exhibit Fast-KRR s potetial. I Table, we preset the results of this simulatio, which we performed i Matlab usig a Widows machie with 6GB of memory ad a sigle-threaded 3.4Ghz processor. I each etry of the table, we give the mea error of Fast-KRR ad the mea amout of time it took to ru with stadard deviatio over 0 simulatios i paretheses; the error rate stadard deviatios are a order of magitude smaller tha the errors, so we do ot report them). The etries Fail correspod to out-of-memory failures because of the large matrix iversio, while etries N/A idicate that f f was sigificatly larger tha the optimal value rederig time improvemets meaigless). The table shows that without sacrificig accuracy, decompositio via Fast-KRR ca yield substatial computatioal improvemets. 9

Zhag, Duchi, Waiwright 83 8.5 Fast KRR Nystrom Samplig Radom Feature Approx. Mea square error 8 8.5 8 80.5 80 0 00 400 600 800 000 Traiig rutime sec) Figure 3: Results o year predictio o held-out test sogs for Fast-KRR, Nyström samplig, ad radom feature approximatio. Error bars idicate stadard deviatios over te experimets. 6. Real data experimets We ow tur to the results of experimets studyig the performace of Fast-KRR o the task of predictig the year i which a sog was released based o audio features associated with the sog. We use the Millio Sog Dataset Berti-Mahieux et al., 0), which cosists of 463,75 traiig examples ad a secod set of 5,630 testig examples. Each example is a sog track) released betwee 9 ad 0, ad the sog is represeted as a vector of timbre iformatio computed about the sog. Each sample cosists of the pair x i, y i ) R d R, where x i R d is a d = 90-dimesioal vector ad y i [9, 0 is the year i which the sog was released. For further details, see Berti-Mahieux et al. 0)). Our experimets with this dataset use the Gaussia radial basis kerel ) Kx, x ) = exp x x σ. 38) We ormalize the feature vectors x so that the timbre sigals have stadard deviatio, ad select the badwidth parameter σ = 6 via cross-validatio. For regularizatio, we set = N ; sice the Gaussia kerel has expoetially decayig eigevalues for typical distributios o X), Corollary 5 shows that this regularizatio achieves the optimal rate of covergece for the Hilbert space. I Figure 3, we compare the time-accuracy curve of Fast-KRR with two approximatiobased methods, plottig the mea-squared error betwee the predicted release year ad the actual year o test sogs. The first baselie is Nyström subsamplig Williams ad Seeger, 00), where the kerel matrix is approximated by a low-rak matrix of rak r {,..., 6} 0 3. The secod baselie approach is a approximate form of kerel ridge regressio usig radom features Rahimi ad Recht, 007). The algorithm approximates the Gaussia kerel 38) by the ier product of two radom feature vectors of dimesios D {, 3, 5, 7, 8.5, 0} 0 3, ad the solves the resultig liear regressio problem. For the Fast-KRR algorithm, we use seve partitios m {3, 38, 48, 64, 96, 8, 56} to test 0

Divide ad Coquer Kerel Ridge Regressio 90 88 Fast KRR KRR with /m data Mea square error 86 84 8 80 3 38 48 64 96 8 56 Number of partitios m) Figure 4: Compariso of the performace of Fast-KRR to a stadard KRR estimator usig a fractio /m of the data. the algorithm. Each algorithm is executed 0 times to obtai stadard deviatios plotted as error-bars i Figure 3). As we see i Figure 3, for a fixed time budget, Fast-KRR ejoys the best performace, though the margi betwee Fast-KRR ad Nyström samplig is ot substatial. I spite of this close performace betwee Nyström samplig ad the divide-ad-coquer Fast-KRR algorithm, it is worth otig that with parallel computatio, it is trivial to accelerate Fast-KRR m times; parallelizig approximatio-based methods appears to be a o-trivial task. Moreover, as our results i Sectio 3 idicate, Fast-KRR is miimax optimal i may regimes. At the same time the coferece versio of this paper was submitted, Bach 03) published the first results we kow of establishig covergece results i l -error for Nyström samplig; see the discussio for more detail. We ote i passig that stadard liear regressio with the origial 90 features, while quite fast with rutime o the order of secod igorig data loadig), has mea-squared-error 90.44, which is sigificatly worse tha the kerel-based methods. Our fial experimet provides a saity check: is the fial averagig step i Fast-KRR eve ecessary? To this ed, we compare Fast-KRR with stadard KRR usig a fractio /m of the data. For the latter approach, we employ the stadard regularizatio N/m). As Figure 4 shows, Fast-KRR achieves much lower error rates tha KRR usig oly a fractio of the data. Moreover, averagig stabilizes the estimators: the stadard deviatios of the performace of Fast-KRR are egligible compared to those for stadard KRR. 7. Discussio I this paper, we preset results establishig that our decompositio-based algorithm for kerel ridge regressio achieves miimax optimal covergece rates wheever the umber of splits m of the data is ot too large. The error guaratees of our method deped o the effective dimesioality γ) = j= µ j/µ j + ) of the kerel. For ay umber of splits

Zhag, Duchi, Waiwright m N/γ), our method achieves estimatio error decreasig as E [ f f f H + σ γ) N. I particular, recall the boud 8) followig Theorem ). Notably, this covergece rate is miimax optimal, ad we achieve substatial computatioal beefits from the subsamplig schemes, i that computatioal cost scales early) liearly i N. It is also iterestig to cosider the umber of kerel evaluatios required to implemet our method. Our estimator requires m sub-matrices of the full kerel Gram) matrix, each of size N/m N/m. Sice the method may use m N/γ ) machies, i the best case, it requires at most Nγ ) kerel evaluatios. By cotrast, Bach 03) shows that Nyström-based subsamplig ca be used to form a estimator withi a costat factor of optimal as log as the umber of N-dimesioal subsampled colums of the kerel matrix scales roughly as the margial dimesio γ) = N diagkk + NI) ). Cosequetly, usig roughly N γ) kerel evaluatios, Nyström subsamplig ca achieve optimal covergece rates. These two scaligs amely, Nγ ) versus N γ) are curretly ot comparable: i some situatios, such as whe the data is ot compactly supported, γ) ca scale liearly with N, while i others it appears to scale roughly as the true effective dimesioality γ). A atural questio arisig from these lies of work is to uderstad the true optimal scalig for these differet estimators: is oe fudametally better tha the other? Are there atural computatioal tradeoffs that ca be leveraged at large scale? As datasets grow substatially larger ad more complex, these questios should become eve more importat, ad we hope to cotiue to study them. Ackowledgemets: We thak Fracis Bach for iterestig ad elighteig coversatios o the coectios betwee this work ad his paper Bach, 03) ad Yiig Wag for poitig out a mistake i a earlier versio of this mauscript. We also thak two reviewers for useful feedback ad commets. JCD was partially supported by a Natioal Defese Sciece ad Egieerig Graduate Fellowship NDSEG) ad a Facebook PhD fellowship. This work was partially supported by ONR MURI grat N0004---0688 to MJW. Appedix A. Proof of Lemma 6 This appedix is devoted to the bias boud stated i Lemma 6. Let X = {x i } i= be shorthad for the desig matrix, ad defie the error vector = f f. By Jese s iequality, we have E[ E[ E[ X, so it suffices to provide a boud o E[ X. Throughout this proof ad the remaider of the paper, we represet the kerel evaluator by the fuctio ξ x, where ξ x := Kx, ) ad fx) = ξ x, f for ay f H. Usig this otatio, the estimate f miimizes the empirical objective ξ xi, f H y i ) + f H. 39) i= This objective is Fréchet differetiable, ad as a cosequece, the ecessary ad sufficiet coditios for optimality Lueberger, 969) of f are that ξ xi ξ xi, f f H ε i ) + f = i= ξ xi ξ xi, f H y i ) + f = 0, 40) where the last equatio uses the fact that y i = ξ xi, f H + ε i. Takig coditioal expectatios over the oise variables {ε i } i= with the desig X = {x i} i= fixed, we fid i=