Divide and Conquer Kernel Ridge Regression: A Distributed Algorithm with Minimax Optimal Rates

Size: px
Start display at page:

Download "Divide and Conquer Kernel Ridge Regression: A Distributed Algorithm with Minimax Optimal Rates"

Transcription

1 Joural of Machie Learig Research 05) -40 Submitted 4/00; Published 0/00 Divide ad Coquer Kerel Ridge Regressio: A Distributed Algorithm with Miimax Optimal Rates Yuche Zhag Departmet of Electrical Egieerig ad Computer Sciece Uiversity of Califoria, Berkeley, Berkeley, CA 9470, USA Joh Duchi Departmets of Statistics ad Electrical Egieerig Staford Uiversity, Staford, CA 94305, USA yuczhag@berkeley.edu jduchi@staford.edu Marti Waiwright waiwrig@berkeley.edu Departmets of Statistics ad Electrical Egieerig ad Computer Sciece Uiversity of Califoria, Berkeley, Berkeley, CA 9470, USA Editor: Hui Zou Abstract We study a decompositio-based scalable approach to kerel ridge regressio, ad show that it achieves miimax optimal covergece rates uder relatively mild coditios. The method is simple to describe: it radomly partitios a dataset of size N ito m subsets of equal size, computes a idepedet kerel ridge regressio estimator for each subset usig a careful choice of the regularizatio parameter, the averages the local solutios ito a global predictor. This partitioig leads to a substatial reductio i computatio time versus the stadard approach of performig kerel ridge regressio o all N samples. Our two mai theorems establish that despite the computatioal speed-up, statistical optimality is retaied: as log as m is ot too large, the partitio-based estimator achieves the statistical miimax rate over all estimators usig the set of N samples. As cocrete examples, our theory guaratees that the umber of subsets m may grow early liearly for fiite-rak or Gaussia kerels ad polyomially i N for Sobolev spaces, which i tur allows for substatial reductios i computatioal cost. We coclude with experimets o both simulated data ad a music-predictio task that complemet our theoretical results, exhibitig the computatioal ad statistical beefits of our approach. Keywords: kerel ridge regressio, divide ad coquer, computatio complexity. Itroductio I o-parametric regressio, the statisticia receives N samples of the form {x i, y i )} N i=, where each x i X is a covariate ad y i R is a real-valued respose, ad the samples are draw i.i.d. from some ukow joit distributio P over X R. The goal is to estimate a fuctio f : X R that ca be used to predict future resposes based o observig oly the covariates. Frequetly, the quality of a estimate f is measured i terms of the mea-squared predictio error E[ fx) Y ), i which case the coditioal expectatio f x) = E[Y X = x is optimal. The problem of o-parametric regressio is a classical oe, ad a researchers have studied a wide rage of estimators see, for example, the books of Gyorfi et al. 00), Wasserma 006), or va de Geer 000)). Oe class of methods, kow as regularized M-estimators va de Geer, 000), are based o miimizig the c 05 Yuche Zhag, Joh Duchi ad Marti Waiwright.

2 Zhag, Duchi, Waiwright combiatio of a data-depedet loss fuctio with a regularizatio term. The focus of this paper is a popular M-estimator that combies the least-squares loss with a squared Hilbert orm pealty for regularizatio. Whe workig i a reproducig kerel Hilbert space RKHS), the resultig method is kow as kerel ridge regressio, ad is widely used i practice Hastie et al., 00; Shawe-Taylor ad Cristiaii, 004). Past work has established bouds o the estimatio error for RKHS-based methods Koltchiskii, 006; Medelso, 00a; va de Geer, 000; Zhag, 005), which have bee refied ad exteded i more recet work e.g., Steiwart et al., 009). Although the statistical aspects of kerel ridge regressio KRR) are well-uderstood, the computatio of the KRR estimate ca be challegig for large datasets. I a stadard implemetatio Sauders et al., 998), the kerel matrix must be iverted, which requires ON 3 ) time ad ON ) memory. Such scaligs are prohibitive whe the sample size N is large. As a cosequece, approximatios have bee desiged to avoid the expese of fidig a exact miimizer. Oe family of approaches is based o low-rak approximatio of the kerel matrix; examples iclude kerel PCA Schölkopf et al., 998), the icomplete Cholesky decompositio Fie ad Scheiberg, 00), or Nyström samplig Williams ad Seeger, 00). These methods reduce the time complexity to OdN ) or Od N), where d N is the preserved rak. The associated predictio error has oly bee studied very recetly. Cocurret work by Bach 03) establishes coditios o the maitaied rak that still guaratee optimal covergece rates; see the discussio i Sectio 7 for more detail. A secod lie of research has cosidered early-stoppig of iterative optimizatio algorithms for KRR, icludig gradiet descet Yao et al., 007; Raskutti et al., 0) ad cojugate gradiet methods Blachard ad Krämer, 00), where early-stoppig provides regularizatio agaist over-fittig ad improves ru-time. If the algorithm stops after t iteratios, the aggregate time complexity is OtN ). I this work, we study a differet decompositio-based approach. The algorithm is appealig i its simplicity: we partitio the dataset of size N radomly ito m equal sized subsets, ad we compute the kerel ridge regressio estimate f i for each of the i =,..., m subsets idepedetly, with a careful choice of the regularizatio parameter. The estimates are the averaged via f = /m) m f i= i. Our mai theoretical result gives coditios uder which the average f achieves the miimax rate of covergece over the uderlyig Hilbert space. Eve usig aive implemetatios of KRR, this decompositio gives time ad memory complexity scalig as ON 3 /m ) ad ON /m ), respectively. Moreover, our approach dovetails aturally with parallel ad distributed computatio: we are guarateed superliear speedup with m parallel processors though we must still commuicate the fuctio estimates from each processor). Divide-ad-coquer approaches have bee studied by several authors, icludig McDoald et al. 00) for perceptro-based algorithms, Kleier et al. 0) i distributed versios of the bootstrap, ad Zhag et al. 03) for parametric smooth covex optimizatio problems. This paper demostrates the potetial beefits of divide-ad-coquer approaches for oparametric ad ifiitedimesioal regressio problems. Oe difficulty i solvig each of the sub-problems idepedetly is how to choose the regularizatio parameter. Due to the ifiite-dimesioal ature of o-parametric problems, the choice of regularizatio parameter must be made with care e.g., Hastie et al., 00). A iterestig cosequece of our theoretical aalysis is i demostratig that, eve though each partitioed sub-problem is based oly o the fractio N/m of samples, it is oetheless essetial to regularize the partitioed sub-problems as though they had all N samples. Cosequetly, from a local poit of view, each sub-problem is uder-regularized. This uder-regularizatio allows the bias of each local estimate to be very small, but it causes a detrimetal blow-up i the variace. However, as we prove, the m-fold averagig

3 Divide ad Coquer Kerel Ridge Regressio uderlyig the method reduces variace eough that the resultig estimator f still attais optimal covergece rate. The remaider of this paper is orgaized as follows. We begi i Sectio by providig backgroud o the kerel ridge regressio estimate ad discussig the assumptios that uderlie our aalysis. I Sectio 3, we preset our mai theorems o the mea-squared error betwee the averaged estimate f ad the optimal regressio fuctio f. We provide both a result whe the regressio fuctio f belogs to the Hilbert space H associated with the kerel, as well as a more geeral oracle iequality that holds for a geeral f. We the provide several corollaries that exhibit cocrete cosequeces of the results, icludig covergece rates of r/n for kerels with fiite rak r, ad covergece rates of N ν/ν+) for estimatio of fuctioals i a Sobolev space with ν-degrees of smoothess. As we discuss, both of these estimatio rates are miimax-optimal ad hece uimprovable. We devote Sectios 4 ad 5 to the proofs of our results, deferrig more techical aspects of the aalysis to appedices. Lastly, we preset simulatio results i Sectio 6. to further explore our theoretical results, while Sectio 6. cotais experimets with a reasoably large music predictio experimet.. Backgroud ad problem formulatio We begi with the backgroud ad otatio required for a precise statemet of our problem.. Reproducig kerels The method of kerel ridge regressio is based o the idea of a reproducig kerel Hilbert space. We provide oly a very brief coverage of the basics here, referrig the reader to oe of the may books o the topic Wahba, 990; Shawe-Taylor ad Cristiaii, 004; Berliet ad Thomas-Aga, 004; Gu, 00) for further details. Ay symmetric ad positive semidefiite kerel fuctio K : X X R defies a reproducig kerel Hilbert space RKHS for short). For a give distributio P o X, the Hilbert space is strictly cotaied i L P). For each x X, the fuctio z Kz, x) is cotaied with the Hilbert space H; moreover, the Hilbert space is edowed with a ier product, H such that K, x) acts as the represeter of evaluatio, meaig f, Kx, ) H = fx) for f H. ) We let g H := g, g H deote the orm i H, ad similarly g := X gx) dpx)) / deotes the orm i L P). Uder suitable regularity coditios, Mercer s theorem guaratees that the kerel has a eige-expasio of the form Kx, x ) = µ j φ j x)φ j x ), j= where µ µ 0 are a o-egative sequece of eigevalues, ad {φ j } j= is a orthoormal basis for L P). From the reproducig relatio ), we have φ j, φ j H = /µ j for ay j ad φ j, φ j H = 0 for ay j j. For ay f H, by defiig the basis coefficiets θ j = f, φ j L P) for j =,,..., we ca expad the fuctio i terms of these coefficiets as f = j= θ jφ j, ad simple calculatios show that f = X f x)dpx) = θj, ad f H = f, f H = θj. µ j j= 3 j=

4 Zhag, Duchi, Waiwright Cosequetly, we see that the RKHS ca be viewed as a elliptical subset of the sequece space l N) as defied by the o-egative eigevalues {µ j } j=.. Kerel ridge regressio Suppose that we are give a data set {x i, y i )} N i= cosistig of N i.i.d. samples draw from a ukow distributio P over X R, ad our goal is to estimate the fuctio that miimizes the mea-squared error E[fX) Y ), where the expectatio is take joitly over X, Y ) pairs. It is well-kow that the optimal fuctio is the coditioal mea f x) : = E[Y X = x. I order to estimate the ukow fuctio f, we cosider a M-estimator that is based o miimizig a combiatio of the least-squares loss defied over the dataset with a weighted pealty based o the squared Hilbert orm, { f := argmi f H N N fx i ) y i ) + f H i= }, ) where > 0 is a regularizatio parameter. Whe H is a reproducig kerel Hilbert space, the the estimator ) is kow as the kerel ridge regressio estimate, or KRR for short. It is a atural geeralizatio of the ordiary ridge regressio estimate Hoerl ad Keard, 970) to the o-parametric settig. By the represeter theorem for reproducig kerel Hilbert spaces Wahba, 990), ay solutio to the KRR program ) must belog to the liear spa of the kerel fuctios {K, x i ), i =,..., N}. This fact allows the computatio of the KRR estimate to be reduced to a N-dimesioal quadratic program, ivolvig the N etries of the kerel matrix {Kx i, x j ), i, j =,..., }. O the statistical side, a lie of past work va de Geer, 000; Zhag, 005; Capoetto ad De Vito, 007; Steiwart et al., 009; Hsu et al., 0) has provided bouds o the estimatio error of f as a fuctio of N ad. 3. Mai results ad their cosequeces We ow tur to the descriptio of our algorithm, followed by the statemets of our mai results, amely Theorems ad. Each theorem provides a upper boud o the measquared predictio error for ay trace class kerel. The secod theorem is of oracle type, meaig that it applies eve whe the true regressio fuctio f does ot belog to the Hilbert space H, ad hece ivolves a combiatio of approximatio ad estimatio error terms. The first theorem requires that f H, ad provides somewhat sharper bouds o the estimatio error i this case. Both of these theorems apply to ay trace class kerel, but as we illustrate, they provide cocrete results whe applied to specific classes of kerels. Ideed, as a corollary, we establish that our distributed KRR algorithm achieves miimax-optimal rates for three differet kerel classes, amely fiite-rak, Gaussia, ad Sobolev. 3. Algorithm ad assumptios The divide-ad-coquer algorithm Fast-KRR is easy to describe. Rather tha solvig the kerel ridge regressio problem ) o all N samples, the Fast-KRR method executes the followig three steps:. Divide the set of samples {x, y ),..., x N, y N )} evely ad uiformly at radom ito the m disjoit subsets S,..., S m X R, such that every subset cotais N/m samples. 4

5 Divide ad Coquer Kerel Ridge Regressio. For each i =,,..., m, compute the local KRR estimate { f i := argmi f H S i x,y) S i fx) y) + f H 3. Average together the local estimates ad output f = m m i= f i. }. 3) This descriptio actually provides a family of estimators, oe for each choice of the regularizatio parameter > 0. Our mai result applies to ay choice of, while our corollaries for specific kerel classes optimize as a fuctio of the kerel. We ow describe our mai assumptios. Our first assumptio, for which we have two variats, deals with the tail behavior of the basis fuctios {φ j } j=. Assumptio A For some k, there is a costat ρ < such that E[φ j X) k ρ k for all j N. I certai cases, we show that sharper error guaratees ca be obtaied by eforcig a stroger coditio of uiform boudedess. Assumptio A There is a costat ρ < such that sup x X φ j x) ρ for all j N. Assumptio A holds, for example, whe the iput x is draw from a closed iterval ad the kerel is traslatio ivariat, i.e. Kx, x ) = ψx x ) for some eve fuctio ψ. Give iput space X ad kerel K, the assumptio is verifiable without the data. Recallig that f x) : = E[Y X = x, our secod assumptio ivolves the deviatios of the zero-mea oise variables Y f x). I the simplest case, whe f H, we require oly a bouded variace coditio: Assumptio B The fuctio f H, ad for x X, we have E[Y f x)) x σ. Whe the fuctio f H, we require a slightly stroger variat of this assumptio. For each 0, defie { f = argmi E [ fx) Y ) } + f H. 4) f H Note that f = f0 correspods to the usual regressio fuctio. As f L P), for each 0, the associated mea-squared error σ x) := E[Y f x)) x is fiite for almost every x. I this more geeral settig, the followig assumptio replaces Assumptio B: Assumptio B For ay 0, there exists a costat τ < such that τ 4 = E[σ4 X). 3. Statemet of mai results With these assumptios i place, we are ow ready for the statemets of our mai results. All of our results give bouds o the mea-squared estimatio error E[ f f associated with the averaged estimate f based o a assigig = N/m samples to each of m machies. Both theorem statemets ivolve the followig three kerel-related quatities: trk) := µ j, γ) :=, ad β d = + /µ j j= j= j=d+ µ j. 5) The first quatity is the kerel trace, which serves a crude estimate of the size of the kerel operator, ad assumed to be fiite. The secod quatity γ), familiar from previous 5

6 Zhag, Duchi, Waiwright work o kerel regressio Zhag, 005), is the effective dimesioality of the kerel K with respect to L P). Fially, the quatity β d is parameterized by a positive iteger d that we may choose i applyig the bouds, ad it describes the tail decay of the eigevalues of K. For d = 0, ote that β 0 = tr K. Fially, both theorems ivolve a quatity that depeds o the umber of momets k i Assumptio A: { } max{k, max{k, logd)} b, d, k) := max logd)},. 6) / /k Here the iteger d N is a free parameter that may be optimized to obtai the sharpest possible upper boud. The algorithm s executio is idepedet of d.) Theorem With f H ad uder Assumptios A ad B, the mea-squared error of the averaged estimate f is upper bouded as [ E f f 8 + ) f H m + σ γ) { + if T d) + T d) + T 3 d) }, 7) N d N where T d) = 8ρ4 f H trk)β d, T d) = 4 f H + σ / m ) k T 3 d) = Cb, d, k) ρ γ) µ 0 f H + σ m + 4 f H m ad C deotes a uiversal umerical) costat. µ d+ + ρ4 trk)β d ), ), ad Theorem is a geeral result that applies to ay trace-class kerel. Although the statemet appears somewhat complicated at first sight, it yields cocrete ad iterpretable guaratees o the error whe specialized to particular kerels, as we illustrate i Sectio 3.3. Before doig so, let us make a few heuristic argumets i order to provide ituitio. I typical settigs, the term T 3 d) goes to zero quickly: if the umber of momets k is suitably large ad umber of partitios m is small say eough to guaratee that b, d, k)γ)/ ) k = O/N) it will be of lower order. As for the remaiig terms, at a high level, we show that a appropriate choice of the free parameter d leaves the first two terms i the upper boud 7) domiat. Note that the terms µ d+ ad β d are decreasig i d while the term b, d, k) icreases with d. However, the icreasig term b, d, k) grows oly logarithmically i d, which allows us to choose a fairly large value without a sigificat pealty. As we show i our corollaries, for may kerels of iterest, as log as the umber of machies m is ot too large, this tradeoff is such that T d) ad T d) are also of lower order compared to the two first terms i the boud 7). I such settigs, Theorem guaratees a upper boud of the form E [ [ f f = O) f H }{{} Squared bias + σ γ). 8) } N {{} Variace This iequality reveals the usual bias-variace trade-off i o-parametric regressio; choosig a smaller value of > 0 reduces the first squared bias term, but icreases the secod variace term. Cosequetly, the settig of that miimizes the sum of these two terms is defied by the relatioship f H γ) σ N. 9) 6

7 Divide ad Coquer Kerel Ridge Regressio This type of fixed poit equatio is familiar from work o oracle iequalities ad local complexity measures i empirical process theory Bartlett et al., 005; Koltchiskii, 006; va de Geer, 000; Zhag, 005), ad whe is chose so that the fixed poit equatio 9) holds this typically) yields miimax optimal covergece rates Bartlett et al., 005; Koltchiskii, 006; Zhag, 005; Capoetto ad De Vito, 007). I Sectio 3.3, we provide detailed examples i which the choice specified by equatio 9), followed by applicatio of Theorem, yields miimax-optimal predictio error for the Fast-KRR algorithm) for may kerel classes. We ow tur to a error boud that applies without requirig that f H. I order to do so, we itroduce a auxiliary variable [0, for use i our aalysis the algorithm s executio does ot deped o, ad i our esuig bouds we may choose ay [0, to give the sharpest possible results). Let the radius R = f H, where the populatio regularized) regressio fuctio f was previously defied 4). The theorem requires a few additioal coditios to those i Theorem, ivolvig the quatities trk), γ) ad β d defied i Eq. 5), as well as the error momet τ from Assumptio B. We assume that the triplet m, d, k) of positive itegers satisfy the coditios β d R + τ /)N, µ d+ { N m mi R + τ /)N, ρ γ) logd), N k R + τ /) /k b, d, k)ρ γ)) }. 0) We the have the followig result: Theorem Uder coditio 0), Assumptio A with k 4, ad Assumptio B, for ay [0, ad q > 0 we have [ E f f + ) if q f f + + q) E N,m,, R, ρ) ) f H R where the residual term is give by E N,m,, R, ρ) : = ad C deotes a uiversal umerical) costat. 4 + C ) m )R + Cγ)ρ τ + C ), ) N N Remarks: Theorem is a oracle iequality, as it upper bouds the mea-squared error i terms of the error if f f, which may oly be obtaied by a oracle kowig f H R the samplig distributio P, alog with the residual error term ). I some situatios, it may be difficult to verify Assumptio B. I such scearios, a alterative coditio suffices. For istace, if there exists a costat κ < such that E[Y 4 κ 4, the uder coditio 0), the boud ) holds with τ replaced by 8 trk) R 4 ρ 4 + 8κ 4 that is, with the alterative residual error Ẽ N,m,, R, ρ) : = + C ) m )R + Cγ)ρ 8 trk) R 4 ρ 4 + 8κ 4 + C ). 3) N N I essece, if the respose variable Y has sufficietly may momets, the predictio measquare error τ i the statemet of Theorem ca be replaced by costats related to the size of f H. See Sectio 5. for a proof of iequality 3). 7

8 Zhag, Duchi, Waiwright I compariso with Theorem, Theorem provides somewhat looser bouds. It is, however, istructive to cosider a few special cases. For the first, we may assume that f H, i which case f H <. I this settig, the choice = 0 essetially) recovers Theorem, sice there is o approximatio error. Takig q 0, we are thus left with the boud E f f f H + γ)ρ τ0, 4) N where deotes a iequality up to costats. By ispectio, this boud is roughly equivalet to Theorem ; see i particular the decompositio 8). O the other had, whe the coditio f H fails to hold, we ca take =, ad the choose q to balace betwee the familiar approximatio ad estimatio errors: we have E[ f f + ) if q f f γ)ρ τ ) + + q). 5) f H R N }{{}}{{} approximatio estimatio Relative to Theorem, the coditio 0) required to apply Theorem ivolves costraits o the umber m of subsampled data sets that are more restrictive. I particular, whe igorig costats ad logarithm terms, the quatity m may grow at rate N/γ ). By cotrast, Theorem allows m to grow as quickly as N/γ ) recall the remarks o T 3 d) followig Theorem or look ahead to coditio 8)). Thus at least i our curret aalysis geeralizig to the case that f H prevets us from dividig the data ito fier subsets. 3.3 Some cosequeces We ow tur to derivig some explicit cosequeces of our mai theorems for specific classes of reproducig kerel Hilbert spaces. I each case, our derivatio follows the broad outlie give the the remarks followig Theorem : we first choose the regularizatio parameter to balace the bias ad variace terms, ad the show, by compariso to kow miimax lower bouds, that the resultig upper boud is optimal. Fially, we derive a upper boud o the umber of subsampled data sets m for which the miimax optimal covergece rate ca still be achieved. Throughout this sectio, we assume that f H Fiite-rak kerels Our first corollary applies to problems for which the kerel has fiite rak r, meaig that its eigevalues satisfy µ j = 0 for all j > r. Examples of such fiite rak kerels iclude the liear kerel Kx, x ) = x, x R d, which has rak at most r = d; ad the kerel Kx, x) = + x x ) m geeratig polyomials of degree m, which has rak at most r = m +. Corollary 3 For a kerel with rak r, cosider the output of the Fast-KRR algorithm with = r/n. Suppose that Assumptio B ad Assumptios A or A ) hold, ad that the umber of processors m satisfy the boud m c N k 4 k Assumptio A) or m c r k 4k k k ρ k log k r N r ρ 4 log N Assumptio A ), where c is a uiversal umerical) costat. For suitably large N, the mea-squared error is bouded as [ E f f = O) σ r N. 6) 8

9 Divide ad Coquer Kerel Ridge Regressio For fiite-rak kerels, the rate 6) is kow to be miimax-optimal, meaig that there is a uiversal costat c > 0 such that if f sup f H E[ f f c r N, 7) where the ifimum rages over all estimators f based o observig all N samples ad with o costraits o memory ad/or computatio). This lower boud follows from Theorem a) of Raskutti et al. 0) with s = d = Polyomially decayig eigevalues Our ext corollary applies to kerel operators with eigevalues that obey a boud of the form µ j C j ν for all j =,,..., 8) where C is a uiversal costat, ad ν > / parameterizes the decay rate. We ote that equatio 5) assumes a fiite kerel trace trk) := j= µ j. Sice trk) appears i Theorem, it is atural to use j= Cj ν as a upper boud o trk). This upper boud is fiite if ad oly if ν > /. Kerels with polyomial decayig eigevalues iclude those that uderlie for the Sobolev spaces with differet orders of smoothess e.g. Birma ad Solomjak, 967; Gu, 00). As a cocrete example, the first-order Sobolev kerel Kx, x ) = + mi{x, x } geerates a RKHS of Lipschitz fuctios with smoothess ν =. Other higher-order Sobolev kerels also exhibit polyomial eigedecay with larger values of the parameter ν. Corollary 4 For ay kerel with ν-polyomial eigedecay 8), cosider the output of the Fast-KRR algorithm with = /N) ν ν+. Suppose that Assumptio B ad Assumptio A or A ) hold, ad that the umber of processors satisfy the boud m c N k 4)ν k ν+) ρ 4k log k N ) k Assumptio A) or m c N ν ν+ ρ 4 log N Assumptio A ), where c is a costat oly depedig o ν. The the mea-squared error is bouded as [ E σ f f ) ν ) ν+ = O. 9) N The upper boud 9) is uimprovable up to costat factors, as show by kow miimax bouds o estimatio error i Sobolev spaces Stoe, 98; Tsybakov, 009); see also Theorem b) of Raskutti et al. 0) Expoetially decayig eigevalues Our fial corollary applies to kerel operators with eigevalues that obey a boud of the form µ j c exp c j ) for all j =,,..., 0) for strictly positive costats c, c ). Such classes iclude the RKHS geerated by the Gaussia kerel Kx, x ) = exp x x ). 9

10 Zhag, Duchi, Waiwright Corollary 5 For a kerel with sub-gaussia eigedecay 0), cosider the output of the Fast-KRR algorithm with = /N. Suppose that Assumptio B ad Assumptio A or A ) hold, ad that the umber of processors satisfy the boud m c N k 4 k ρ 4k k Assumptio A) or m c k log k N N ρ 4 log N Assumptio A ), where c is a costat oly depedig o c. The the mea-squared error is bouded as [ E ) f f log N = O σ. ) N The upper boud ) is miimax optimal; see, for example, Theorem ad Example of the recet paper by Yag et al. 05). Summary: Each corollary gives a critical threshold for the umber m of data partitios: as log as m is below this threshold, the decompositio-based Fast-KRR algorithm gives the optimal rate of covergece. It is iterestig to ote that the umber of splits may be quite large: each grows asymptotically with N wheever the basis fuctios have more tha four momets viz. Assumptio A). Moreover, the Fast-KRR method ca attai these optimal covergece rates while usig substatially less computatio tha stadard kerel ridge regressio methods, as it requires solvig problems oly of size N/m. 3.4 The choice of regularizatio parameter I practice, the local sample size o each machie may be differet ad the optimal choice for the regularizatio may ot be kow a priori, so that a adaptive choice of the regularizatio parameter is desirable e.g. Tsybakov, 009, Chapters ). We recommed usig cross-validatio to choose the regularizatio parameter, ad we ow sketch a heuristic argumet that a adaptive algorithm usig cross-validatio may achieve optimal rates of covergece. We leave fuller aalysis to future work.) Let be the oracle) optimal regularizatio parameter give kowledge of the samplig distributio P ad eige-structure of the kerel K. We assume cf. Corollary 4) that there is a costat ν > 0 such that ν as. Let i be the local sample size for each machie i ad N the global sample size; we assume that i N clearly, N i ). First, use local cross-validatio to choose regularizatio parameters i ad i /N correspodig to samples of size i ad i /N, respectively. Heuristically, if cross validatio is successful, we expect to have i ν i ad i /N N ν ν i, yieldig that i / i /N N ν. With this ituitio, we the compute local estimates f i := argmi f H i x,y) S i fx) y) + i) f H where i) := i i /N ) ad global average estimate f = m i f i= N i as usual. Notably, we have i) N i this heuristic settig. Usig formula ) ad the average f, we have [ E f f [ m = E m i= i= i N fi E[ f i ) m + i N E [ fi E[ f i i= + max i [m i E[ N f i f ) } { E[ fi f. 3) 0

11 Divide ad Coquer Kerel Ridge Regressio Usig Lemmas 6 ad 7 from the proof of Theorem to come ad assumig that is cocetrated tightly eough aroud, we obtai E[ f i f = O N f H ) by Lemma 6 ad that E[ f i E[ f i = O γ N ) i ) by Lemma 7. Substitutig these bouds ito iequality 3) ad otig that i i = N, we may upper boud the overall estimatio error as [ E f f O) N f H + γ N) N While the derivatio of this upper boud was o-rigorous, we believe that it is roughly accurate, ad i compariso with the previous upper boud 8), it provides optimal rates of covergece. ). 4. Proofs of Theorem ad related results We ow tur to the proofs of Theorem ad Corollaries 3 through 5. This sectio cotais oly a high-level view of proof of Theorem ; we defer more techical aspects to the appedices. 4. Proof of Theorem Usig the defiitio of the averaged estimate f = m m i= f i, a bit of algebra yields E[ f f = E[ f E[ f) + E[ f f ) = E[ f E[ f + E[ f f + E[ f E[ f, E[ f f L P) [ m = E m f i E[ f i ) + E[ f f, i= where we used the fact that E[ f i = E[ f for each i [m. Usig this ubiasedess oce more, we boud the variace of the terms f i E[ f to see that E [ f f = [ m E f E[ f + E[ f f [ m E f f + E[ f f, 4) where we have used the fact that E[ f i miimizes E[ f i f over f H. The error boud 4) suggests our strategy: we upper boud E[ f f ad E[ f f respectively. Based o equatio 3), the estimate f is obtaied from a stadard kerel ridge regressio with sample size = N/m ad ridge parameter. Accordigly, the followig two auxiliary results provide bouds o these two terms, where the reader should recall the defiitios of b, d, k) ad β d from equatio 5). I each lemma, C represets a uiversal umerical) costat. Lemma 6 Bias boud) Uder Assumptios A ad B, for each d =,,..., we have E[ f f 8 f H + 8ρ4 f H trk)β d ) k + Cb, d, k) ρ γ) µ 0 f H. 5)

12 Zhag, Duchi, Waiwright Lemma 7 Variace boud) Uder Assumptios A ad B, for each d =,,..., we have E[ f f f H + σ γ) ) σ f H µ d+ + ρ4 trk)β d ) k ) + Cb, d, k) ρ γ) f. 6) The proofs of these lemmas, cotaied i Appedices A ad B respectively, costitute oe mai techical cotributio of this paper. Give these two lemmas, the remaider of the theorem proof is straightforward. Combiig the iequality 4) with Lemmas 6 ad 7 yields the claim of Theorem. Remarks: The proofs of Lemmas 6 ad 7 are somewhat complex, but to the best of our kowledge, existig literature does ot yield sigificatly simpler proofs. We ow discuss this claim to better situate our techical cotributios. Defie the regularized populatio miimizer f := argmi f H{E[fX) Y ) + f H }. Expadig the decompositio 4) of the L P)-risk ito bias ad variace terms, we obtai the further boud [ E f f E[ f f + [ m E f f = E[ f f [ + f f + E }{{} m }{{} f ) f f f = T + }{{} m T + T 3 ). :=T :=T :=T 3 I this decompositio, T ad T are bias ad approximatio error terms iduced by the regularizatio parameter, while T 3 is a excess risk variace) term icurred by miimizig the empirical loss. This upper boud illustrates three trade-offs i our subsampled ad averaged kerel regressio procedure: The trade-off betwee T ad T 3 : whe the regularizatio parameter grows, the bias term T icreases while the variace term T 3 coverges to zero. The trade-off betwee T ad T 3 : whe the regularizatio parameter grows, the bias term T icreases while the variace term T 3 coverges to zero. The trade-off betwee T ad the computatio time: whe the umber of machies m grows, the bias term T icreases as the local sample size = N/m shriks), while the computatio time N 3 /m decreases. Theoretical results i the KRR literature focus o the trade-off betwee T ad T 3, but i the curret cotext, we also eed a upper boud o the bias term T, which is ot relevat for classical cetralized) aalyses. With this settig i mid, Lemma 6 tightly upper bouds the bias T as a fuctio of ad. A essetial part of the proof is to characterize the properties of E[ f, which is the expectatio of a oparametric empirical loss miimizer. We are ot aware of existig literature o this problem, ad the proof of Lemma 6 itroduces ovel techiques for this purpose. O the other had, Lemma 7 upper bouds E[ f f as a fuctio of ad. Past work has focused o boudig a quatity of this form, but for techical reasos, most work e.g. va de Geer, 000; Medelso, 00b; Bartlett et al., 00; Zhag, 005) focuses o aalyzig the costraied form f i := argmi fx) y), 7) f H C S i x,y) S i

13 Divide ad Coquer Kerel Ridge Regressio of kerel ridge regressio. While this problem traces out the same set of solutios as that of the regularized kerel ridge regressio estimator 3), it is o-trivial to determie a matched settig of for a give C. Zhag 003) provides oe of the few aalyses of the regularized ridge regressio estimator 3) or )), providig a upper boud of the form E[ f f = O + / ), which is at best O ). I cotrast, Lemma 7 gives upper boud O + γ) ); the effective dimesio γ) is ofte much smaller tha /, yieldig a stroger covergece guaratee. 4. Proof of Corollary 3 We first preset a geeral iequality boudig the size of m for which optimal covergece rates are possible. We assume that d is chose large eough such that we have logd) k ad d N. I the rest of the proof, our assigmet to d will satisfy these iequalities. I this case, ispectio of Theorem shows that if m is small eough that ) k log d N/m ρ γ) m γ) N, the the term T 3 d) provides a covergece rate give by γ)/n. expressio above for m, we fid Thus, solvig the m log d N ρ4 γ) = /k m /k γ) /k N /k or m k k = Takig k )/k-th roots of both sides, we obtai that if m k N k k γ) k k ρ 4 log d. k N, 8) γ) k 4k k k ρ k log k d the the term T 3 d) of the boud 7) is Oγ)/N). Now we apply the boud 8) i the case i the corollary. Let us take d = max{r, N}. Notice that β d = β r = µ r+ = 0. We fid that γ) r sice each of its terms is bouded by, ad we take = r/n. Evaluatig the expressio 8) with this value, we arrive at m N k 4 k. r k 4k k k ρ k log k d If we have sufficietly may momets that k log N, ad N r for example, if the basis fuctios φ j have a uiform boud ρ, the k ca be chose arbitrarily large), the we may take k = log N, which implies that N k 4 k we replace log d with log N. The so log as = ΩN), r k k N m c r ρ 4 log N for some costat c > 0, we obtai a idetical result. = Or ) ad ρ 4k k = Oρ 4 ) ; ad 3

14 Zhag, Duchi, Waiwright 4.3 Proof of Corollary 4 We follow the program outlied i our remarks followig Theorem. We must first choose o the order of γ)/n. To that ed, we ote that settig = N ν ν+ gives γ) = j= + j ν N ν ν+ N ν+ + j>n ν+ N ν+ + N ν ν+ + j ν N ν ν+ N ν+ du = N ν+ uν + ν N ν+. Dividig by N, we fid that γ)/n, as desired. Now we choose the trucatio parameter d. By choosig d = N t for some t R +, the we fid that µ d+ N νt ad a itegratio yields β d N ν )t. Settig t = 3/ν ) guaratees that µ d+ N 3 ad β d N 3 ; the correspodig terms i the boud 7) are thus egligible. Moreover, we have for ay fiite k that log d k. Applyig the geeral boud 8) o m, we arrive at the iequality m c N N 4ν ν+)k ) N k ) 4k k ν+)k ) ρ k log k N = c N k 4)ν k ν+)k ) ρ 4k k k log k N. Wheever this holds, we have covergece rate = N ν+. Now, let Assumptio A hold. The takig k = log N, the above boud becomes to a multiplicative costat factor) N ν ν+ /ρ 4 log N as claimed. 4.4 Proof of Corollary 5 First, we set = /N. Cosiderig the sum γ) = j= µ j/µ j + ), we see that for j log N)/c, the elemets of the sum are bouded by. For j > log N)/c, we make the approximatio j log N)/c µ j µ j + j log N)/c µ j N ν log N)/c exp c t )dt = O). Thus we fid that γ) + c log N for some costat c. By choosig d = N, we have that the tail sum ad d + )-th eigevalue both satisfy µ d+ β d c N 4. As a cosequece, all the terms ivolvig β d or µ d+ i the boud 7) are egligible. Recallig our iequality 8), we thus fid that uder Assumptio A), as log as the umber of partitios m satisfies m c N k 4 k ρ 4k k, k log k N the covergece rate of f to f is give by γ)/n log N/N. Uder the boudedess assumptio A, as we did i the proof of Corollary 3, we take k = log N i Theorem. By ispectio, this yields the secod statemet of the corollary. 5. Proof of Theorem ad related results I this sectio, we provide the proofs of Theorem, as well as the boud 3) based o the alterative form of the residual error. As i the previous sectio, we preset a high-level proof, deferrig more techical argumets to the appedices. 4

15 Divide ad Coquer Kerel Ridge Regressio 5. Proof of Theorem We begi by statig ad provig two auxiliary claims: E [ Y fx)) = E [ Y f X)) + f f for ay f L P), ad 9a) f = argmi f f. f H R Let us begi by provig equality 9a). By addig ad subtractig terms, we have E [ Y f X)) = E [ Y f X)) + f f + E[fX) f X))E[Y f X) X i) = E [ Y f X)) + f f, where equality i) follows sice the radom variable Y f X) is mea-zero give X = x. For the secod equality 9b), cosider ay fuctio f i the RKHS that satisfies the boud f H R. The defiitio of the miimizer f guaratees that E [ f X) Y ) + R E[fX) Y ) + f H E[fX) Y ) + R. This result combied with equatio 9a) establishes the equality 9b). 9b) We ow tur to the proof of the theorem. Applyig Hölder s iequality yields that f f + ) f q f + + q) f f = + ) if q f f + + q) f f for all q > 0, 30) f H R where the secod step follows from equality 9b). It thus suffices to upper boud f f, ad followig the deductio of iequality 4), we immediately obtai the decompositio formula [ E f f m E[ f f + E[ f f, 3) where f deotes the empirical miimizer for oe of the subsampled datasets i.e. the stadard KRR solutio o a sample of size = N/m with regularizatio ). This suggests our strategy, which parallels our proof of Theorem : we upper boud E[ f f ad E[ f f, respectively. I the rest of the proof, we let f = f deote this solutio. Let the estimatio error for a subsample be give by = f f. Uder Assumptios A ad B, we have the followig two lemmas boudig expressio 3), which parallel Lemmas 6 ad 7 i the case whe f H. I each lemma, C deotes a uiversal costat. Lemma 8 For all d =,,..., we have [ E 6 ) R + 8γ)ρ τ + 3R 4 + 8τ 4 / µ d+ + 6ρ4 trk)β d + Deotig the right had side of iequality 3) by D, we have ) k ) Cb, d, k) ρ γ). 3) 5

16 Zhag, Duchi, Waiwright Lemma 9 For all d =,,..., we have E[ 4 ) R + C log d)ρ γ)) D See Appedices C ad D for the proofs of these two lemmas. + 3R 4 + 8τ 4 / µ d+ + 4ρ4 trk)β d Give these two lemmas, we ca ow complete the proof of the theorem. If the coditios 0) hold, we have β d R + τ /)N, µ d+ R + τ /)N, log d)ρ γ)) m ad b, d, k) ρ γ) so there is a uiversal costat C satisfyig ) k R + τ /)N, ) k ) 3R 4 + 8τ 4 / µ d+ + 6ρ4 trk)β d + Cb, d, k) ρ γ) C N. Cosequetly, Lemma 8 yields the upper boud ). 33) E[ 8 ) R + 8γ)ρ τ + C N. Sice log d)ρ γ)) / /m by assumptio, we obtai E [ f f C ) R + Cγ)ρ τ + C m N Nm + 4 ) R + C ) R m + Cγ)ρ τ N + C Nm + C N, where C is a uiversal costat whose value is allowed to chage from lie to lie). Summig these bouds ad usig the coditio that, we coclude that E [ f f 4 + C ) m )R + Cγ)ρ τ + C N N. Combiig this error boud with iequality 30) completes the proof. 5. Proof of the boud 3) Usig Theorem, it suffices to show that τ 4 8 trk) f 4 Hρ 4 + 8κ 4. 34) By the tower property of expectatios ad Jese s iequality, we have τ 4 = E[E[f x) Y ) X = x) E[f X) Y ) 4 8E[f X)) 4 + 8E[Y 4. 6

17 Divide ad Coquer Kerel Ridge Regressio Sice we have assumed that E[Y 4 κ 4, the oly remaiig step is to upper boud E[f X)) 4. Let f have expasio θ, θ,...) i the basis {φ j }. For ay x X, Hölder s iequality applied with the cojugates 4/3 ad 4 implies the upper boud f x) = µ /4 j θ / j ) θ/ j j= µ /4 j φ j x) µ /3 j θ /3 j j= 3/4 θ j φ 4 µ jx) j j= /4. 35) Agai applyig Hölder s iequality this time with cojugates 3/ ad 3 to upper boud the first term i the product i iequality 35), we obtai j= µ /3 j θ /3 j = j= µ /3 j ) /3 θ j ) /3 µ j µ j j= Combiig iequalities 35) ad 36), we fid that E[f X)) 4 trk) f H j= θ j µ j= j θ j µ j E[φ 4 jx) trk) f 4 Hρ 4, ) /3 = trk) /3 f /3 H. 36) where we have used Assumptio A. This completes the proof of iequality 34). 6. Experimetal results I this sectio, we report the results of experimets o both simulated ad real-world data desiged to test the sharpess of our theoretical predictios. 6. Simulatio studies We begi by explorig the empirical performace of our subsample-ad-average methods for a o-parametric regressio problem o simulated datasets. For all experimets i this sectio, we simulate data from the regressio model y = f x) + ε for x [0,, where f x) := mix, x) is -Lipschitz, the oise variables ε N0, σ ) are ormally distributed with variace σ = /5, ad the samples x i Ui[0,. The Sobolev space of Lipschitz fuctios o [0, has reproducig kerel Kx, x ) = + mi{x, x } ad orm f H = f 0) + 0 f z)) dz. By costructio, the fuctio f x) = mix, x) satisfies f H =. The kerel ridge regressio estimator f takes the form f = N α i Kx i, ), where α = K + NI) y, 37) i= ad K is the N N Gram matrix ad I is the N N idetity matrix. Sice the firstorder Sobolev kerel has eigevalues Gu, 00) that scale as µ j /j), the miimax covergece rate i terms of squared L P)-error is N /3 see e.g. Tsybakov 009); Stoe 98); Capoetto ad De Vito 007)). By Corollary 4 with ν =, this optimal rate of covergece ca be achieved by Fast-KRR with regularizatio parameter N /3 as log as the umber of partitios m satisfies m N /3. I each of our experimets, we begi with a dataset of size N = m, which we partitio uiformly at radom ito m disjoit subsets. We compute the local estimator f i for each of the m subsets usig samples via 37), where the Gram matrix is costructed usig the ith batch of samples ad replaces N). We the compute 7

18 Zhag, Duchi, Waiwright m= m=4 m=6 m=64 0 m= m=4 m=6 m=64 Mea square error 0 3 Mea square error Total umber of samples N) a) With uder-regularizatio Total umber of samples N) b) Without uder-regularizatio Figure : The squared L P)-orm betwee betwee the averaged estimate f ad the optimal solutio f. a) These plots correspod to the output of the Fast-KRR algorithm: each sub-problem is uder-regularized by usig N /3. b) Aalogous plots whe each sub-problem is ot uder-regularized that is, with = /3 = N/m) /3 chose as if there were oly a sigle dataset of size. Mea square error N=56 N=5 N=04 N=048 N=4096 N= log# of partitios)/log# of samples) Figure : The mea-square error curves for fixed sample size but varied umber of partitios. We are iterested i the threshold of partitioig umber m uder which the optimal rate of covergece is achieved. f = /m) m f i= i. Our experimets compare the error of f as a fuctio of sample size N, the umber of partitios m, ad the regularizatio. I Figure 6.a), we plot the error f f versus the total umber of samples N, where N { 8, 9,..., 3 }, usig four differet data partitios m {, 4, 6, 64}. We execute each simulatio 0 times to obtai stadard errors for the plot. The black circled curve m = ) gives the baselie KRR error; if the umber of partitios m 6, Fast-KRR has accuracy comparable to the baselie algorithm. Eve with m = 64, Fast-KRR s performace closely matches the full estimator for larger sample sizes N ). I the right plot Figure 6.b), we perform a idetical experimet, but we over-regularize by choosig = /3 rather tha = N /3 i each of the m sub-problems, combiig the local estimates by averagig as usual. I cotrast to Figure 6.a), there is a obvious gap betwee the performace of the algorithms whe m = ad m >, as our theory predicts. 8

19 Divide ad Coquer Kerel Ridge Regressio N m = m = 6 m = 64 m = 56 m = 04 Error N/A N/A Time. 0.03) ) ) 3 Error N/A N/A Time ) ) ) 4 Error N/A Time ) ) ) ) 5 Error.90 0 Fail N/A Time ) ) ) 6 Error.75 0 Fail Time ). 0.06) ) ) 7 Error.9 0 Fail Time ) ) ) ) Table : Timig experimet givig f f as a fuctio of umber of partitios m ad data size N, providig mea ru-time measured i secod) for each umber m of partitios ad data size N. It is also iterestig to uderstad the umber of partitios m ito which a dataset of size N may be divided while maitaiig good statistical performace. Accordig to Corollary 4 with ν =, for the first-order Sobolev kerel, performace degradatio should be limited as log as m N /3. I order to test this predictio, Figure plots the measquare error f f versus the ratio logm)/ logn). Our theory predicts that eve as the umber of partitios m may grow polyomially i N, the error should grow oly above some costat value of logm)/ logn). As Figure shows, the poit that f f begis to icrease appears to be aroud logm) 0.45 logn) for reasoably large N. This empirical performace is somewhat better tha the /3) thresholded predicted by Corollary 4, but it does cofirm that the umber of partitios m ca scale polyomially with N while retaiig miimax optimality. Our fial experimet gives evidece for the improved time complexity partitioig provides. Here we compare the amout of time required to solve the KRR problem usig the aive matrix iversio 37) for differet partitio sizes m ad provide the resultig squared errors f f. Although there are more sophisticated solutio strategies, we believe this is a reasoable proxy to exhibit Fast-KRR s potetial. I Table, we preset the results of this simulatio, which we performed i Matlab usig a Widows machie with 6GB of memory ad a sigle-threaded 3.4Ghz processor. I each etry of the table, we give the mea error of Fast-KRR ad the mea amout of time it took to ru with stadard deviatio over 0 simulatios i paretheses; the error rate stadard deviatios are a order of magitude smaller tha the errors, so we do ot report them). The etries Fail correspod to out-of-memory failures because of the large matrix iversio, while etries N/A idicate that f f was sigificatly larger tha the optimal value rederig time improvemets meaigless). The table shows that without sacrificig accuracy, decompositio via Fast-KRR ca yield substatial computatioal improvemets. 9

20 Zhag, Duchi, Waiwright Fast KRR Nystrom Samplig Radom Feature Approx. Mea square error Traiig rutime sec) Figure 3: Results o year predictio o held-out test sogs for Fast-KRR, Nyström samplig, ad radom feature approximatio. Error bars idicate stadard deviatios over te experimets. 6. Real data experimets We ow tur to the results of experimets studyig the performace of Fast-KRR o the task of predictig the year i which a sog was released based o audio features associated with the sog. We use the Millio Sog Dataset Berti-Mahieux et al., 0), which cosists of 463,75 traiig examples ad a secod set of 5,630 testig examples. Each example is a sog track) released betwee 9 ad 0, ad the sog is represeted as a vector of timbre iformatio computed about the sog. Each sample cosists of the pair x i, y i ) R d R, where x i R d is a d = 90-dimesioal vector ad y i [9, 0 is the year i which the sog was released. For further details, see Berti-Mahieux et al. 0)). Our experimets with this dataset use the Gaussia radial basis kerel ) Kx, x ) = exp x x σ. 38) We ormalize the feature vectors x so that the timbre sigals have stadard deviatio, ad select the badwidth parameter σ = 6 via cross-validatio. For regularizatio, we set = N ; sice the Gaussia kerel has expoetially decayig eigevalues for typical distributios o X), Corollary 5 shows that this regularizatio achieves the optimal rate of covergece for the Hilbert space. I Figure 3, we compare the time-accuracy curve of Fast-KRR with two approximatiobased methods, plottig the mea-squared error betwee the predicted release year ad the actual year o test sogs. The first baselie is Nyström subsamplig Williams ad Seeger, 00), where the kerel matrix is approximated by a low-rak matrix of rak r {,..., 6} 0 3. The secod baselie approach is a approximate form of kerel ridge regressio usig radom features Rahimi ad Recht, 007). The algorithm approximates the Gaussia kerel 38) by the ier product of two radom feature vectors of dimesios D {, 3, 5, 7, 8.5, 0} 0 3, ad the solves the resultig liear regressio problem. For the Fast-KRR algorithm, we use seve partitios m {3, 38, 48, 64, 96, 8, 56} to test 0

21 Divide ad Coquer Kerel Ridge Regressio Fast KRR KRR with /m data Mea square error Number of partitios m) Figure 4: Compariso of the performace of Fast-KRR to a stadard KRR estimator usig a fractio /m of the data. the algorithm. Each algorithm is executed 0 times to obtai stadard deviatios plotted as error-bars i Figure 3). As we see i Figure 3, for a fixed time budget, Fast-KRR ejoys the best performace, though the margi betwee Fast-KRR ad Nyström samplig is ot substatial. I spite of this close performace betwee Nyström samplig ad the divide-ad-coquer Fast-KRR algorithm, it is worth otig that with parallel computatio, it is trivial to accelerate Fast-KRR m times; parallelizig approximatio-based methods appears to be a o-trivial task. Moreover, as our results i Sectio 3 idicate, Fast-KRR is miimax optimal i may regimes. At the same time the coferece versio of this paper was submitted, Bach 03) published the first results we kow of establishig covergece results i l -error for Nyström samplig; see the discussio for more detail. We ote i passig that stadard liear regressio with the origial 90 features, while quite fast with rutime o the order of secod igorig data loadig), has mea-squared-error 90.44, which is sigificatly worse tha the kerel-based methods. Our fial experimet provides a saity check: is the fial averagig step i Fast-KRR eve ecessary? To this ed, we compare Fast-KRR with stadard KRR usig a fractio /m of the data. For the latter approach, we employ the stadard regularizatio N/m). As Figure 4 shows, Fast-KRR achieves much lower error rates tha KRR usig oly a fractio of the data. Moreover, averagig stabilizes the estimators: the stadard deviatios of the performace of Fast-KRR are egligible compared to those for stadard KRR. 7. Discussio I this paper, we preset results establishig that our decompositio-based algorithm for kerel ridge regressio achieves miimax optimal covergece rates wheever the umber of splits m of the data is ot too large. The error guaratees of our method deped o the effective dimesioality γ) = j= µ j/µ j + ) of the kerel. For ay umber of splits

22 Zhag, Duchi, Waiwright m N/γ), our method achieves estimatio error decreasig as E [ f f f H + σ γ) N. I particular, recall the boud 8) followig Theorem ). Notably, this covergece rate is miimax optimal, ad we achieve substatial computatioal beefits from the subsamplig schemes, i that computatioal cost scales early) liearly i N. It is also iterestig to cosider the umber of kerel evaluatios required to implemet our method. Our estimator requires m sub-matrices of the full kerel Gram) matrix, each of size N/m N/m. Sice the method may use m N/γ ) machies, i the best case, it requires at most Nγ ) kerel evaluatios. By cotrast, Bach 03) shows that Nyström-based subsamplig ca be used to form a estimator withi a costat factor of optimal as log as the umber of N-dimesioal subsampled colums of the kerel matrix scales roughly as the margial dimesio γ) = N diagkk + NI) ). Cosequetly, usig roughly N γ) kerel evaluatios, Nyström subsamplig ca achieve optimal covergece rates. These two scaligs amely, Nγ ) versus N γ) are curretly ot comparable: i some situatios, such as whe the data is ot compactly supported, γ) ca scale liearly with N, while i others it appears to scale roughly as the true effective dimesioality γ). A atural questio arisig from these lies of work is to uderstad the true optimal scalig for these differet estimators: is oe fudametally better tha the other? Are there atural computatioal tradeoffs that ca be leveraged at large scale? As datasets grow substatially larger ad more complex, these questios should become eve more importat, ad we hope to cotiue to study them. Ackowledgemets: We thak Fracis Bach for iterestig ad elighteig coversatios o the coectios betwee this work ad his paper Bach, 03) ad Yiig Wag for poitig out a mistake i a earlier versio of this mauscript. We also thak two reviewers for useful feedback ad commets. JCD was partially supported by a Natioal Defese Sciece ad Egieerig Graduate Fellowship NDSEG) ad a Facebook PhD fellowship. This work was partially supported by ONR MURI grat N to MJW. Appedix A. Proof of Lemma 6 This appedix is devoted to the bias boud stated i Lemma 6. Let X = {x i } i= be shorthad for the desig matrix, ad defie the error vector = f f. By Jese s iequality, we have E[ E[ E[ X, so it suffices to provide a boud o E[ X. Throughout this proof ad the remaider of the paper, we represet the kerel evaluator by the fuctio ξ x, where ξ x := Kx, ) ad fx) = ξ x, f for ay f H. Usig this otatio, the estimate f miimizes the empirical objective ξ xi, f H y i ) + f H. 39) i= This objective is Fréchet differetiable, ad as a cosequece, the ecessary ad sufficiet coditios for optimality Lueberger, 969) of f are that ξ xi ξ xi, f f H ε i ) + f = i= ξ xi ξ xi, f H y i ) + f = 0, 40) where the last equatio uses the fact that y i = ξ xi, f H + ε i. Takig coditioal expectatios over the oise variables {ε i } i= with the desig X = {x i} i= fixed, we fid i=

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss ECE 90 Lecture : Complexity Regularizatio ad the Squared Loss R. Nowak 5/7/009 I the previous lectures we made use of the Cheroff/Hoeffdig bouds for our aalysis of classifier errors. Hoeffdig s iequality

More information

Optimally Sparse SVMs

Optimally Sparse SVMs A. Proof of Lemma 3. We here prove a lower boud o the umber of support vectors to achieve geeralizatio bouds of the form which we cosider. Importatly, this result holds ot oly for liear classifiers, but

More information

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12 Machie Learig Theory Tübige Uiversity, WS 06/07 Lecture Tolstikhi Ilya Abstract I this lecture we derive risk bouds for kerel methods. We will start by showig that Soft Margi kerel SVM correspods to miimizig

More information

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d Liear regressio Daiel Hsu (COMS 477) Maximum likelihood estimatio Oe of the simplest liear regressio models is the followig: (X, Y ),..., (X, Y ), (X, Y ) are iid radom pairs takig values i R d R, ad Y

More information

A survey on penalized empirical risk minimization Sara A. van de Geer

A survey on penalized empirical risk minimization Sara A. van de Geer A survey o pealized empirical risk miimizatio Sara A. va de Geer We address the questio how to choose the pealty i empirical risk miimizatio. Roughly speakig, this pealty should be a good boud for the

More information

Regression with quadratic loss

Regression with quadratic loss Regressio with quadratic loss Maxim Ragisky October 13, 2015 Regressio with quadratic loss is aother basic problem studied i statistical learig theory. We have a radom couple Z = X,Y, where, as before,

More information

REGRESSION WITH QUADRATIC LOSS

REGRESSION WITH QUADRATIC LOSS REGRESSION WITH QUADRATIC LOSS MAXIM RAGINSKY Regressio with quadratic loss is aother basic problem studied i statistical learig theory. We have a radom couple Z = X, Y ), where, as before, X is a R d

More information

Rates of Convergence by Moduli of Continuity

Rates of Convergence by Moduli of Continuity Rates of Covergece by Moduli of Cotiuity Joh Duchi: Notes for Statistics 300b March, 017 1 Itroductio I this ote, we give a presetatio showig the importace, ad relatioship betwee, the modulis of cotiuity

More information

6.867 Machine learning, lecture 7 (Jaakkola) 1

6.867 Machine learning, lecture 7 (Jaakkola) 1 6.867 Machie learig, lecture 7 (Jaakkola) 1 Lecture topics: Kerel form of liear regressio Kerels, examples, costructio, properties Liear regressio ad kerels Cosider a slightly simpler model where we omit

More information

5.1 A mutual information bound based on metric entropy

5.1 A mutual information bound based on metric entropy Chapter 5 Global Fao Method I this chapter, we exted the techiques of Chapter 2.4 o Fao s method the local Fao method) to a more global costructio. I particular, we show that, rather tha costructig a local

More information

Random Walks on Discrete and Continuous Circles. by Jeffrey S. Rosenthal School of Mathematics, University of Minnesota, Minneapolis, MN, U.S.A.

Random Walks on Discrete and Continuous Circles. by Jeffrey S. Rosenthal School of Mathematics, University of Minnesota, Minneapolis, MN, U.S.A. Radom Walks o Discrete ad Cotiuous Circles by Jeffrey S. Rosethal School of Mathematics, Uiversity of Miesota, Mieapolis, MN, U.S.A. 55455 (Appeared i Joural of Applied Probability 30 (1993), 780 789.)

More information

Summary and Discussion on Simultaneous Analysis of Lasso and Dantzig Selector

Summary and Discussion on Simultaneous Analysis of Lasso and Dantzig Selector Summary ad Discussio o Simultaeous Aalysis of Lasso ad Datzig Selector STAT732, Sprig 28 Duzhe Wag May 4, 28 Abstract This is a discussio o the work i Bickel, Ritov ad Tsybakov (29). We begi with a short

More information

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 11

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 11 Machie Learig Theory Tübige Uiversity, WS 06/07 Lecture Tolstikhi Ilya Abstract We will itroduce the otio of reproducig kerels ad associated Reproducig Kerel Hilbert Spaces (RKHS). We will cosider couple

More information

Machine Learning Brett Bernstein

Machine Learning Brett Bernstein Machie Learig Brett Berstei Week 2 Lecture: Cocept Check Exercises Starred problems are optioal. Excess Risk Decompositio 1. Let X = Y = {1, 2,..., 10}, A = {1,..., 10, 11} ad suppose the data distributio

More information

Chapter 3. Strong convergence. 3.1 Definition of almost sure convergence

Chapter 3. Strong convergence. 3.1 Definition of almost sure convergence Chapter 3 Strog covergece As poited out i the Chapter 2, there are multiple ways to defie the otio of covergece of a sequece of radom variables. That chapter defied covergece i probability, covergece i

More information

An Introduction to Randomized Algorithms

An Introduction to Randomized Algorithms A Itroductio to Radomized Algorithms The focus of this lecture is to study a radomized algorithm for quick sort, aalyze it usig probabilistic recurrece relatios, ad also provide more geeral tools for aalysis

More information

Lecture 3: August 31

Lecture 3: August 31 36-705: Itermediate Statistics Fall 018 Lecturer: Siva Balakrisha Lecture 3: August 31 This lecture will be mostly a summary of other useful expoetial tail bouds We will ot prove ay of these i lecture,

More information

Problem Set 2 Solutions

Problem Set 2 Solutions CS271 Radomess & Computatio, Sprig 2018 Problem Set 2 Solutios Poit totals are i the margi; the maximum total umber of poits was 52. 1. Probabilistic method for domiatig sets 6pts Pick a radom subset S

More information

Stochastic Simulation

Stochastic Simulation Stochastic Simulatio 1 Itroductio Readig Assigmet: Read Chapter 1 of text. We shall itroduce may of the key issues to be discussed i this course via a couple of model problems. Model Problem 1 (Jackso

More information

A Risk Comparison of Ordinary Least Squares vs Ridge Regression

A Risk Comparison of Ordinary Least Squares vs Ridge Regression Joural of Machie Learig Research 14 (2013) 1505-1511 Submitted 5/12; Revised 3/13; Published 6/13 A Risk Compariso of Ordiary Least Squares vs Ridge Regressio Paramveer S. Dhillo Departmet of Computer

More information

Lecture 2: Monte Carlo Simulation

Lecture 2: Monte Carlo Simulation STAT/Q SCI 43: Itroductio to Resamplig ethods Sprig 27 Istructor: Ye-Chi Che Lecture 2: ote Carlo Simulatio 2 ote Carlo Itegratio Assume we wat to evaluate the followig itegratio: e x3 dx What ca we do?

More information

Convergence of random variables. (telegram style notes) P.J.C. Spreij

Convergence of random variables. (telegram style notes) P.J.C. Spreij Covergece of radom variables (telegram style otes).j.c. Spreij this versio: September 6, 2005 Itroductio As we kow, radom variables are by defiitio measurable fuctios o some uderlyig measurable space

More information

Information-based Feature Selection

Information-based Feature Selection Iformatio-based Feature Selectio Farza Faria, Abbas Kazeroui, Afshi Babveyh Email: {faria,abbask,afshib}@staford.edu 1 Itroductio Feature selectio is a topic of great iterest i applicatios dealig with

More information

Linear Regression Demystified

Linear Regression Demystified Liear Regressio Demystified Liear regressio is a importat subject i statistics. I elemetary statistics courses, formulae related to liear regressio are ofte stated without derivatio. This ote iteds to

More information

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS MASSACHUSTTS INSTITUT OF TCHNOLOGY 6.436J/5.085J Fall 2008 Lecture 9 /7/2008 LAWS OF LARG NUMBRS II Cotets. The strog law of large umbers 2. The Cheroff boud TH STRONG LAW OF LARG NUMBRS While the weak

More information

Statistical Inference (Chapter 10) Statistical inference = learn about a population based on the information provided by a sample.

Statistical Inference (Chapter 10) Statistical inference = learn about a population based on the information provided by a sample. Statistical Iferece (Chapter 10) Statistical iferece = lear about a populatio based o the iformatio provided by a sample. Populatio: The set of all values of a radom variable X of iterest. Characterized

More information

w (1) ˆx w (1) x (1) /ρ and w (2) ˆx w (2) x (2) /ρ.

w (1) ˆx w (1) x (1) /ρ and w (2) ˆx w (2) x (2) /ρ. 2 5. Weighted umber of late jobs 5.1. Release dates ad due dates: maximimizig the weight of o-time jobs Oce we add release dates, miimizig the umber of late jobs becomes a sigificatly harder problem. For

More information

Distribution of Random Samples & Limit theorems

Distribution of Random Samples & Limit theorems STAT/MATH 395 A - PROBABILITY II UW Witer Quarter 2017 Néhémy Lim Distributio of Radom Samples & Limit theorems 1 Distributio of i.i.d. Samples Motivatig example. Assume that the goal of a study is to

More information

The standard deviation of the mean

The standard deviation of the mean Physics 6C Fall 20 The stadard deviatio of the mea These otes provide some clarificatio o the distictio betwee the stadard deviatio ad the stadard deviatio of the mea.. The sample mea ad variace Cosider

More information

On Random Line Segments in the Unit Square

On Random Line Segments in the Unit Square O Radom Lie Segmets i the Uit Square Thomas A. Courtade Departmet of Electrical Egieerig Uiversity of Califoria Los Ageles, Califoria 90095 Email: tacourta@ee.ucla.edu I. INTRODUCTION Let Q = [0, 1] [0,

More information

Bayesian Methods: Introduction to Multi-parameter Models

Bayesian Methods: Introduction to Multi-parameter Models Bayesia Methods: Itroductio to Multi-parameter Models Parameter: θ = ( θ, θ) Give Likelihood p(y θ) ad prior p(θ ), the posterior p proportioal to p(y θ) x p(θ ) Margial posterior ( θ, θ y) is Iterested

More information

1 Review of Probability & Statistics

1 Review of Probability & Statistics 1 Review of Probability & Statistics a. I a group of 000 people, it has bee reported that there are: 61 smokers 670 over 5 960 people who imbibe (drik alcohol) 86 smokers who imbibe 90 imbibers over 5

More information

4.3 Growth Rates of Solutions to Recurrences

4.3 Growth Rates of Solutions to Recurrences 4.3. GROWTH RATES OF SOLUTIONS TO RECURRENCES 81 4.3 Growth Rates of Solutios to Recurreces 4.3.1 Divide ad Coquer Algorithms Oe of the most basic ad powerful algorithmic techiques is divide ad coquer.

More information

Sequences. Notation. Convergence of a Sequence

Sequences. Notation. Convergence of a Sequence Sequeces A sequece is essetially just a list. Defiitio (Sequece of Real Numbers). A sequece of real umbers is a fuctio Z (, ) R for some real umber. Do t let the descriptio of the domai cofuse you; it

More information

62. Power series Definition 16. (Power series) Given a sequence {c n }, the series. c n x n = c 0 + c 1 x + c 2 x 2 + c 3 x 3 +

62. Power series Definition 16. (Power series) Given a sequence {c n }, the series. c n x n = c 0 + c 1 x + c 2 x 2 + c 3 x 3 + 62. Power series Defiitio 16. (Power series) Give a sequece {c }, the series c x = c 0 + c 1 x + c 2 x 2 + c 3 x 3 + is called a power series i the variable x. The umbers c are called the coefficiets of

More information

6.3 Testing Series With Positive Terms

6.3 Testing Series With Positive Terms 6.3. TESTING SERIES WITH POSITIVE TERMS 307 6.3 Testig Series With Positive Terms 6.3. Review of what is kow up to ow I theory, testig a series a i for covergece amouts to fidig the i= sequece of partial

More information

Basics of Probability Theory (for Theory of Computation courses)

Basics of Probability Theory (for Theory of Computation courses) Basics of Probability Theory (for Theory of Computatio courses) Oded Goldreich Departmet of Computer Sciece Weizma Istitute of Sciece Rehovot, Israel. oded.goldreich@weizma.ac.il November 24, 2008 Preface.

More information

Ada Boost, Risk Bounds, Concentration Inequalities. 1 AdaBoost and Estimates of Conditional Probabilities

Ada Boost, Risk Bounds, Concentration Inequalities. 1 AdaBoost and Estimates of Conditional Probabilities CS8B/Stat4B Sprig 008) Statistical Learig Theory Lecture: Ada Boost, Risk Bouds, Cocetratio Iequalities Lecturer: Peter Bartlett Scribe: Subhrasu Maji AdaBoost ad Estimates of Coditioal Probabilities We

More information

Support vector machine revisited

Support vector machine revisited 6.867 Machie learig, lecture 8 (Jaakkola) 1 Lecture topics: Support vector machie ad kerels Kerel optimizatio, selectio Support vector machie revisited Our task here is to first tur the support vector

More information

7.1 Convergence of sequences of random variables

7.1 Convergence of sequences of random variables Chapter 7 Limit Theorems Throughout this sectio we will assume a probability space (, F, P), i which is defied a ifiite sequece of radom variables (X ) ad a radom variable X. The fact that for every ifiite

More information

Discrete Mathematics for CS Spring 2008 David Wagner Note 22

Discrete Mathematics for CS Spring 2008 David Wagner Note 22 CS 70 Discrete Mathematics for CS Sprig 2008 David Wager Note 22 I.I.D. Radom Variables Estimatig the bias of a coi Questio: We wat to estimate the proportio p of Democrats i the US populatio, by takig

More information

A Hadamard-type lower bound for symmetric diagonally dominant positive matrices

A Hadamard-type lower bound for symmetric diagonally dominant positive matrices A Hadamard-type lower boud for symmetric diagoally domiat positive matrices Christopher J. Hillar, Adre Wibisoo Uiversity of Califoria, Berkeley Jauary 7, 205 Abstract We prove a ew lower-boud form of

More information

ECE 8527: Introduction to Machine Learning and Pattern Recognition Midterm # 1. Vaishali Amin Fall, 2015

ECE 8527: Introduction to Machine Learning and Pattern Recognition Midterm # 1. Vaishali Amin Fall, 2015 ECE 8527: Itroductio to Machie Learig ad Patter Recogitio Midterm # 1 Vaishali Ami Fall, 2015 tue39624@temple.edu Problem No. 1: Cosider a two-class discrete distributio problem: ω 1 :{[0,0], [2,0], [2,2],

More information

Resampling Methods. X (1/2), i.e., Pr (X i m) = 1/2. We order the data: X (1) X (2) X (n). Define the sample median: ( n.

Resampling Methods. X (1/2), i.e., Pr (X i m) = 1/2. We order the data: X (1) X (2) X (n). Define the sample median: ( n. Jauary 1, 2019 Resamplig Methods Motivatio We have so may estimators with the property θ θ d N 0, σ 2 We ca also write θ a N θ, σ 2 /, where a meas approximately distributed as Oce we have a cosistet estimator

More information

Rademacher Complexity

Rademacher Complexity EECS 598: Statistical Learig Theory, Witer 204 Topic 0 Rademacher Complexity Lecturer: Clayto Scott Scribe: Ya Deg, Kevi Moo Disclaimer: These otes have ot bee subjected to the usual scrutiy reserved for

More information

arxiv: v1 [math.pr] 13 Oct 2011

arxiv: v1 [math.pr] 13 Oct 2011 A tail iequality for quadratic forms of subgaussia radom vectors Daiel Hsu, Sham M. Kakade,, ad Tog Zhag 3 arxiv:0.84v math.pr] 3 Oct 0 Microsoft Research New Eglad Departmet of Statistics, Wharto School,

More information

Machine Learning Brett Bernstein

Machine Learning Brett Bernstein Machie Learig Brett Berstei Week Lecture: Cocept Check Exercises Starred problems are optioal. Statistical Learig Theory. Suppose A = Y = R ad X is some other set. Furthermore, assume P X Y is a discrete

More information

1 Inferential Methods for Correlation and Regression Analysis

1 Inferential Methods for Correlation and Regression Analysis 1 Iferetial Methods for Correlatio ad Regressio Aalysis I the chapter o Correlatio ad Regressio Aalysis tools for describig bivariate cotiuous data were itroduced. The sample Pearso Correlatio Coefficiet

More information

This is an introductory course in Analysis of Variance and Design of Experiments.

This is an introductory course in Analysis of Variance and Design of Experiments. 1 Notes for M 384E, Wedesday, Jauary 21, 2009 (Please ote: I will ot pass out hard-copy class otes i future classes. If there are writte class otes, they will be posted o the web by the ight before class

More information

Economics 241B Relation to Method of Moments and Maximum Likelihood OLSE as a Maximum Likelihood Estimator

Economics 241B Relation to Method of Moments and Maximum Likelihood OLSE as a Maximum Likelihood Estimator Ecoomics 24B Relatio to Method of Momets ad Maximum Likelihood OLSE as a Maximum Likelihood Estimator Uder Assumptio 5 we have speci ed the distributio of the error, so we ca estimate the model parameters

More information

Infinite Sequences and Series

Infinite Sequences and Series Chapter 6 Ifiite Sequeces ad Series 6.1 Ifiite Sequeces 6.1.1 Elemetary Cocepts Simply speakig, a sequece is a ordered list of umbers writte: {a 1, a 2, a 3,...a, a +1,...} where the elemets a i represet

More information

Sequences and Series of Functions

Sequences and Series of Functions Chapter 6 Sequeces ad Series of Fuctios 6.1. Covergece of a Sequece of Fuctios Poitwise Covergece. Defiitio 6.1. Let, for each N, fuctio f : A R be defied. If, for each x A, the sequece (f (x)) coverges

More information

Introduction to Machine Learning DIS10

Introduction to Machine Learning DIS10 CS 189 Fall 017 Itroductio to Machie Learig DIS10 1 Fu with Lagrage Multipliers (a) Miimize the fuctio such that f (x,y) = x + y x + y = 3. Solutio: The Lagragia is: L(x,y,λ) = x + y + λ(x + y 3) Takig

More information

Spectral Partitioning in the Planted Partition Model

Spectral Partitioning in the Planted Partition Model Spectral Graph Theory Lecture 21 Spectral Partitioig i the Plated Partitio Model Daiel A. Spielma November 11, 2009 21.1 Itroductio I this lecture, we will perform a crude aalysis of the performace of

More information

Topic 9: Sampling Distributions of Estimators

Topic 9: Sampling Distributions of Estimators Topic 9: Samplig Distributios of Estimators Course 003, 2016 Page 0 Samplig distributios of estimators Sice our estimators are statistics (particular fuctios of radom variables), their distributio ca be

More information

Definition 4.2. (a) A sequence {x n } in a Banach space X is a basis for X if. unique scalars a n (x) such that x = n. a n (x) x n. (4.

Definition 4.2. (a) A sequence {x n } in a Banach space X is a basis for X if. unique scalars a n (x) such that x = n. a n (x) x n. (4. 4. BASES I BAACH SPACES 39 4. BASES I BAACH SPACES Sice a Baach space X is a vector space, it must possess a Hamel, or vector space, basis, i.e., a subset {x γ } γ Γ whose fiite liear spa is all of X ad

More information

NYU Center for Data Science: DS-GA 1003 Machine Learning and Computational Statistics (Spring 2018)

NYU Center for Data Science: DS-GA 1003 Machine Learning and Computational Statistics (Spring 2018) NYU Ceter for Data Sciece: DS-GA 003 Machie Learig ad Computatioal Statistics (Sprig 208) Brett Berstei, David Roseberg, Be Jakubowski Jauary 20, 208 Istructios: Followig most lab ad lecture sectios, we

More information

17. Joint distributions of extreme order statistics Lehmann 5.1; Ferguson 15

17. Joint distributions of extreme order statistics Lehmann 5.1; Ferguson 15 17. Joit distributios of extreme order statistics Lehma 5.1; Ferguso 15 I Example 10., we derived the asymptotic distributio of the maximum from a radom sample from a uiform distributio. We did this usig

More information

10-701/ Machine Learning Mid-term Exam Solution

10-701/ Machine Learning Mid-term Exam Solution 0-70/5-78 Machie Learig Mid-term Exam Solutio Your Name: Your Adrew ID: True or False (Give oe setece explaatio) (20%). (F) For a cotiuous radom variable x ad its probability distributio fuctio p(x), it

More information

Product measures, Tonelli s and Fubini s theorems For use in MAT3400/4400, autumn 2014 Nadia S. Larsen. Version of 13 October 2014.

Product measures, Tonelli s and Fubini s theorems For use in MAT3400/4400, autumn 2014 Nadia S. Larsen. Version of 13 October 2014. Product measures, Toelli s ad Fubii s theorems For use i MAT3400/4400, autum 2014 Nadia S. Larse Versio of 13 October 2014. 1. Costructio of the product measure The purpose of these otes is to preset the

More information

CHAPTER 10 INFINITE SEQUENCES AND SERIES

CHAPTER 10 INFINITE SEQUENCES AND SERIES CHAPTER 10 INFINITE SEQUENCES AND SERIES 10.1 Sequeces 10.2 Ifiite Series 10.3 The Itegral Tests 10.4 Compariso Tests 10.5 The Ratio ad Root Tests 10.6 Alteratig Series: Absolute ad Coditioal Covergece

More information

Lecture 7: Density Estimation: k-nearest Neighbor and Basis Approach

Lecture 7: Density Estimation: k-nearest Neighbor and Basis Approach STAT 425: Itroductio to Noparametric Statistics Witer 28 Lecture 7: Desity Estimatio: k-nearest Neighbor ad Basis Approach Istructor: Ye-Chi Che Referece: Sectio 8.4 of All of Noparametric Statistics.

More information

Lecture 10 October Minimaxity and least favorable prior sequences

Lecture 10 October Minimaxity and least favorable prior sequences STATS 300A: Theory of Statistics Fall 205 Lecture 0 October 22 Lecturer: Lester Mackey Scribe: Brya He, Rahul Makhijai Warig: These otes may cotai factual ad/or typographic errors. 0. Miimaxity ad least

More information

4. Partial Sums and the Central Limit Theorem

4. Partial Sums and the Central Limit Theorem 1 of 10 7/16/2009 6:05 AM Virtual Laboratories > 6. Radom Samples > 1 2 3 4 5 6 7 4. Partial Sums ad the Cetral Limit Theorem The cetral limit theorem ad the law of large umbers are the two fudametal theorems

More information

Optimization Methods MIT 2.098/6.255/ Final exam

Optimization Methods MIT 2.098/6.255/ Final exam Optimizatio Methods MIT 2.098/6.255/15.093 Fial exam Date Give: December 19th, 2006 P1. [30 pts] Classify the followig statemets as true or false. All aswers must be well-justified, either through a short

More information

Since X n /n P p, we know that X n (n. Xn (n X n ) Using the asymptotic result above to obtain an approximation for fixed n, we obtain

Since X n /n P p, we know that X n (n. Xn (n X n ) Using the asymptotic result above to obtain an approximation for fixed n, we obtain Assigmet 9 Exercise 5.5 Let X biomial, p, where p 0, 1 is ukow. Obtai cofidece itervals for p i two differet ways: a Sice X / p d N0, p1 p], the variace of the limitig distributio depeds oly o p. Use the

More information

Statistics 511 Additional Materials

Statistics 511 Additional Materials Cofidece Itervals o mu Statistics 511 Additioal Materials This topic officially moves us from probability to statistics. We begi to discuss makig ifereces about the populatio. Oe way to differetiate probability

More information

Sieve Estimators: Consistency and Rates of Convergence

Sieve Estimators: Consistency and Rates of Convergence EECS 598: Statistical Learig Theory, Witer 2014 Topic 6 Sieve Estimators: Cosistecy ad Rates of Covergece Lecturer: Clayto Scott Scribe: Julia Katz-Samuels, Brado Oselio, Pi-Yu Che Disclaimer: These otes

More information

Math 155 (Lecture 3)

Math 155 (Lecture 3) Math 55 (Lecture 3) September 8, I this lecture, we ll cosider the aswer to oe of the most basic coutig problems i combiatorics Questio How may ways are there to choose a -elemet subset of the set {,,,

More information

Dimension-free PAC-Bayesian bounds for the estimation of the mean of a random vector

Dimension-free PAC-Bayesian bounds for the estimation of the mean of a random vector Dimesio-free PAC-Bayesia bouds for the estimatio of the mea of a radom vector Olivier Catoi CREST CNRS UMR 9194 Uiversité Paris Saclay olivier.catoi@esae.fr Ilaria Giulii Laboratoire de Probabilités et

More information

DS 100: Principles and Techniques of Data Science Date: April 13, Discussion #10

DS 100: Principles and Techniques of Data Science Date: April 13, Discussion #10 DS 00: Priciples ad Techiques of Data Sciece Date: April 3, 208 Name: Hypothesis Testig Discussio #0. Defie these terms below as they relate to hypothesis testig. a) Data Geeratio Model: Solutio: A set

More information

The log-behavior of n p(n) and n p(n)/n

The log-behavior of n p(n) and n p(n)/n Ramauja J. 44 017, 81-99 The log-behavior of p ad p/ William Y.C. Che 1 ad Ke Y. Zheg 1 Ceter for Applied Mathematics Tiaji Uiversity Tiaji 0007, P. R. Chia Ceter for Combiatorics, LPMC Nakai Uivercity

More information

Empirical Process Theory and Oracle Inequalities

Empirical Process Theory and Oracle Inequalities Stat 928: Statistical Learig Theory Lecture: 10 Empirical Process Theory ad Oracle Iequalities Istructor: Sham Kakade 1 Risk vs Risk See Lecture 0 for a discussio o termiology. 2 The Uio Boud / Boferoi

More information

Singular Continuous Measures by Michael Pejic 5/14/10

Singular Continuous Measures by Michael Pejic 5/14/10 Sigular Cotiuous Measures by Michael Peic 5/4/0 Prelimiaries Give a set X, a σ-algebra o X is a collectio of subsets of X that cotais X ad ad is closed uder complemetatio ad coutable uios hece, coutable

More information

Output Analysis and Run-Length Control

Output Analysis and Run-Length Control IEOR E4703: Mote Carlo Simulatio Columbia Uiversity c 2017 by Marti Haugh Output Aalysis ad Ru-Legth Cotrol I these otes we describe how the Cetral Limit Theorem ca be used to costruct approximate (1 α%

More information

Machine Learning for Data Science (CS 4786)

Machine Learning for Data Science (CS 4786) Machie Learig for Data Sciece CS 4786) Lecture & 3: Pricipal Compoet Aalysis The text i black outlies high level ideas. The text i blue provides simple mathematical details to derive or get to the algorithm

More information

Introduction to Optimization Techniques. How to Solve Equations

Introduction to Optimization Techniques. How to Solve Equations Itroductio to Optimizatio Techiques How to Solve Equatios Iterative Methods of Optimizatio Iterative methods of optimizatio Solutio of the oliear equatios resultig form a optimizatio problem is usually

More information

Chapter 9: Numerical Differentiation

Chapter 9: Numerical Differentiation 178 Chapter 9: Numerical Differetiatio Numerical Differetiatio Formulatio of equatios for physical problems ofte ivolve derivatives (rate-of-chage quatities, such as velocity ad acceleratio). Numerical

More information

Lecture 19: Convergence

Lecture 19: Convergence Lecture 19: Covergece Asymptotic approach I statistical aalysis or iferece, a key to the success of fidig a good procedure is beig able to fid some momets ad/or distributios of various statistics. I may

More information

Advanced Analysis. Min Yan Department of Mathematics Hong Kong University of Science and Technology

Advanced Analysis. Min Yan Department of Mathematics Hong Kong University of Science and Technology Advaced Aalysis Mi Ya Departmet of Mathematics Hog Kog Uiversity of Sciece ad Techology September 3, 009 Cotets Limit ad Cotiuity 7 Limit of Sequece 8 Defiitio 8 Property 3 3 Ifiity ad Ifiitesimal 8 4

More information

Advanced Stochastic Processes.

Advanced Stochastic Processes. Advaced Stochastic Processes. David Gamarik LECTURE 2 Radom variables ad measurable fuctios. Strog Law of Large Numbers (SLLN). Scary stuff cotiued... Outlie of Lecture Radom variables ad measurable fuctios.

More information

x a x a Lecture 2 Series (See Chapter 1 in Boas)

x a x a Lecture 2 Series (See Chapter 1 in Boas) Lecture Series (See Chapter i Boas) A basic ad very powerful (if pedestria, recall we are lazy AD smart) way to solve ay differetial (or itegral) equatio is via a series expasio of the correspodig solutio

More information

CS284A: Representations and Algorithms in Molecular Biology

CS284A: Representations and Algorithms in Molecular Biology CS284A: Represetatios ad Algorithms i Molecular Biology Scribe Notes o Lectures 3 & 4: Motif Discovery via Eumeratio & Motif Represetatio Usig Positio Weight Matrix Joshua Gervi Based o presetatios by

More information

Algebra of Least Squares

Algebra of Least Squares October 19, 2018 Algebra of Least Squares Geometry of Least Squares Recall that out data is like a table [Y X] where Y collects observatios o the depedet variable Y ad X collects observatios o the k-dimesioal

More information

Topic 9: Sampling Distributions of Estimators

Topic 9: Sampling Distributions of Estimators Topic 9: Samplig Distributios of Estimators Course 003, 2018 Page 0 Samplig distributios of estimators Sice our estimators are statistics (particular fuctios of radom variables), their distributio ca be

More information

Random Matrices with Blocks of Intermediate Scale Strongly Correlated Band Matrices

Random Matrices with Blocks of Intermediate Scale Strongly Correlated Band Matrices Radom Matrices with Blocks of Itermediate Scale Strogly Correlated Bad Matrices Jiayi Tog Advisor: Dr. Todd Kemp May 30, 07 Departmet of Mathematics Uiversity of Califoria, Sa Diego Cotets Itroductio Notatio

More information

Lecture 33: Bootstrap

Lecture 33: Bootstrap Lecture 33: ootstrap Motivatio To evaluate ad compare differet estimators, we eed cosistet estimators of variaces or asymptotic variaces of estimators. This is also importat for hypothesis testig ad cofidece

More information

Chapter 10: Power Series

Chapter 10: Power Series Chapter : Power Series 57 Chapter Overview: Power Series The reaso series are part of a Calculus course is that there are fuctios which caot be itegrated. All power series, though, ca be itegrated because

More information

Math Solutions to homework 6

Math Solutions to homework 6 Math 175 - Solutios to homework 6 Cédric De Groote November 16, 2017 Problem 1 (8.11 i the book): Let K be a compact Hermitia operator o a Hilbert space H ad let the kerel of K be {0}. Show that there

More information

7.1 Convergence of sequences of random variables

7.1 Convergence of sequences of random variables Chapter 7 Limit theorems Throughout this sectio we will assume a probability space (Ω, F, P), i which is defied a ifiite sequece of radom variables (X ) ad a radom variable X. The fact that for every ifiite

More information

Lecture 2. The Lovász Local Lemma

Lecture 2. The Lovász Local Lemma Staford Uiversity Sprig 208 Math 233A: No-costructive methods i combiatorics Istructor: Ja Vodrák Lecture date: Jauary 0, 208 Origial scribe: Apoorva Khare Lecture 2. The Lovász Local Lemma 2. Itroductio

More information

1 of 7 7/16/2009 6:06 AM Virtual Laboratories > 6. Radom Samples > 1 2 3 4 5 6 7 6. Order Statistics Defiitios Suppose agai that we have a basic radom experimet, ad that X is a real-valued radom variable

More information

6.867 Machine learning

6.867 Machine learning 6.867 Machie learig Mid-term exam October, ( poits) Your ame ad MIT ID: Problem We are iterested here i a particular -dimesioal liear regressio problem. The dataset correspodig to this problem has examples

More information

Seunghee Ye Ma 8: Week 5 Oct 28

Seunghee Ye Ma 8: Week 5 Oct 28 Week 5 Summary I Sectio, we go over the Mea Value Theorem ad its applicatios. I Sectio 2, we will recap what we have covered so far this term. Topics Page Mea Value Theorem. Applicatios of the Mea Value

More information

Section 1 of Unit 03 (Pure Mathematics 3) Algebra

Section 1 of Unit 03 (Pure Mathematics 3) Algebra Sectio 1 of Uit 0 (Pure Mathematics ) Algebra Recommeded Prior Kowledge Studets should have studied the algebraic techiques i Pure Mathematics 1. Cotet This Sectio should be studied early i the course

More information

Recursive Algorithms. Recurrences. Recursive Algorithms Analysis

Recursive Algorithms. Recurrences. Recursive Algorithms Analysis Recursive Algorithms Recurreces Computer Sciece & Egieerig 35: Discrete Mathematics Christopher M Bourke cbourke@cseuledu A recursive algorithm is oe i which objects are defied i terms of other objects

More information

Machine Learning Assignment-1

Machine Learning Assignment-1 Uiversity of Utah, School Of Computig Machie Learig Assigmet-1 Chadramouli, Shridhara sdhara@cs.utah.edu 00873255) Sigla, Sumedha sumedha.sigla@utah.edu 00877456) September 10, 2013 1 Liear Regressio a)

More information

SRC Technical Note June 17, Tight Thresholds for The Pure Literal Rule. Michael Mitzenmacher. d i g i t a l

SRC Technical Note June 17, Tight Thresholds for The Pure Literal Rule. Michael Mitzenmacher. d i g i t a l SRC Techical Note 1997-011 Jue 17, 1997 Tight Thresholds for The Pure Literal Rule Michael Mitzemacher d i g i t a l Systems Research Ceter 130 Lytto Aveue Palo Alto, Califoria 94301 http://www.research.digital.com/src/

More information

6.883: Online Methods in Machine Learning Alexander Rakhlin

6.883: Online Methods in Machine Learning Alexander Rakhlin 6.883: Olie Methods i Machie Learig Alexader Rakhli LECTURES 5 AND 6. THE EXPERTS SETTING. EXPONENTIAL WEIGHTS All the algorithms preseted so far halluciate the future values as radom draws ad the perform

More information

Regression with an Evaporating Logarithmic Trend

Regression with an Evaporating Logarithmic Trend Regressio with a Evaporatig Logarithmic Tred Peter C. B. Phillips Cowles Foudatio, Yale Uiversity, Uiversity of Aucklad & Uiversity of York ad Yixiao Su Departmet of Ecoomics Yale Uiversity October 5,

More information