Random Design Analysis of Ridge Regression

Size: px

Start display at page:

Download "Random Design Analysis of Ridge Regression"

Theodore Burke
5 years ago
Views:

1 JMLR: Workshop ad Coferece Proceedigs vol ) th Aual Coferece o Learig Theory Radom Desig Aalysis of Ridge Regressio Daiel Hsu Microsoft Research Sham M. Kakade Microsoft Research Tog Zhag Rutgers Uiversity DAHSU@MICROSOFT.COM SKAKADE@MICROSOFT.COM TZHANG@STAT.RUTGERS.EDU Editor: Shie Maor, Natha Srebro, Robert C. Williamso Abstract This work gives a simultaeous aalysis of both the ordiary least squares estimator ad the ridge regressio estimator i the radom desig settig uder mild assumptios o the covariate/respose distributios. I particular, the aalysis provides sharp results o the out-of-sample predictio error, as opposed to the i-sample fixed desig) error. The aalysis also reveals the effect of errors i the estimated covariace structure, as well as the effect of modelig errors; either of which effects are preset i the fixed desig settig. The proof of the mai results are based o a simple decompositio lemma combied with cocetratio iequalities for radom vectors ad matrices.. Itroductio I the radom desig settig for liear regressio, we are provided with samples of covariates ad resposes, x, y ), x 2, y 2 ),..., x, y ), which are sampled idepedetly from a populatio, where the x i are radom vectors ad the y i are radom variables. Typically, these pairs are hypothesized to have the liear relatioship y i = β, x i + ɛ i for some liear fuctio β though this hypothesis eed ot be true). Here, the ɛ i are error terms, typically assumed to be ormally distributed as N 0, σ 2 ). The goal of estimatio i this settig is to fid coefficiets ˆβ based o these x i, y i ) pairs such that the expected predictio error o a ew draw x, y) from the populatio, measured as E[ ˆβ, x y) 2 ], is as small as possible. This goal ca also be iterpreted as estimatig β with accuracy measured uder a particular orm. The radom desig settig stads i cotrast to the fixed desig settig, where the covariates x, x 2,..., x are fixed i.e., determiistic), ad oly the resposes y, y 2,..., y treated as radom. Thus, the covariace structure of the desig poits is completely kow ad eed ot be estimated, which simplifies the aalysis of stadard estimators. However, the fixed desig settig does ot directly address out-of-sample predictio, which is of primary cocer i may applicatios. For istace, i predictio problems, the estimator ˆβ is computed from a iitial sample from the populatio, ad the ed-goal is to use ˆβ as a predictor of y give x where x, y) is a ew draw from the populatio. A fixed desig aalysis oly assesses the accuracy of ˆβ o data already see, while a radom desig aalysis is cocered with the predictive performace o usee data. c 202 D. Hsu, S.M. Kakade & T. Zhag.

2 HSU KAKADE ZHANG This work gives a detailed aalysis of both the ordiary least squares ad ridge estimator Hoerl, 962) i the radom desig settig that quatifies the essetial differeces betwee radom ad fixed desig. I particular, the aalysis reveals, through a simple decompositio, the effect of errors i the estimated covariace structure, as well as the effect of approximatig the true regressio fuctio by a liear fuctio i the case the model is misspecified. Neither of these effects is preset i the fixed desig aalysis of ridge regressio. The radom desig aalysis shows that the effect of errors i the estimated covariace structure is miimal it is typically a secod-order effect as soo as the sample size is large eough. The aalysis also isolates the effect of approximatio error i the mai terms of the estimatio error boud so that the boud reduces to oe that scales with the oise variace whe the approximatio error vaishes. Oe feature of the aalysis i this work is that it applies to the ridge estimator with a arbitrary settig of. The estimatio error is give i terms of the spectrum of the covariace E[x x] ad the particular choice of. Whe = 0, we obtai a aalysis of ordiary least squares, applicable whe the spectrum is fiite i.e., whe the covariates live i a fiite dimesioal space). More geerally, the covergece rate ca be optimized by appropriately settig based o assumptios about the spectrum. Outlie. Sectio 2 discusses the model, prelimiaries, ad related work. Sectio 3 presets the mai results o the excess mea squared error of the ordiary least squares ad ridge estimators uder radom desig ad discusses the relatioship to the stadard fixed desig aalysis. A applicatio to smoothig splies is provided i Appedix A, ad the proof of the mai results are give i the Appedix B. 2. Prelimiaries 2.. Notatio Uless otherwise specified, all vectors i this work are assumed to live i a possibly ifiite dimesioal) separable Hilbert space with ier product,. Let M for a self-adjoit positive semidefiite liear operator M 0 deote the vector orm give by v M := v, Mv. Whe M is omitted, it is assumed to be the idetity, so v = v, v. Let u u deote the outer product of a vector u, which acts as the rak-oe liear operator v u u)v = v, u u. For a liear operator M, let M deote its spectral operator) orm, i.e., M = sup v 0 Mv / v, ad let M F deote its Frobeius orm, i.e., M F = trm M). If M is self-adjoit, M F = trm 2 ). Let max [M] ad mi [M], respectively, deote the largest ad smallest eigevalue of a self-adjoit liear operator M Liear regressio Let x be a radom vector, ad let y be a radom variable. Let {v j } be the eigevectors of Σ := E[x x], ) so that they form a orthoormal basis. The correspodig eigevalues are j := v j, Σv j = E[ v j, x 2 ] 9.2

3 RANDOM DESIGN ANALYSIS OF RIDGE REGRESSION assumed to be o-zero for coveiece). Let β achieve the miimum mea squared error over all liear fuctios, i.e., E[ β, x y) 2 { ] = mi E[ w, x y) 2 ] }, w so that: β := j β j v j where β j := E[ v j, x y] E[ v j, x 2 ]. 2) We also have that the excess mea squared error of w over the miimum is: E[ w, x y) 2 ] E[ β, x y) 2 ] = w β 2 Σ see Propositio 2) The ridge ad ordiary least squares estimators Let x, y ), x 2, y 2 ),..., x, y ) be idepedet copies of x, y), ad let Ê deote the empirical expectatio with respect to these copies, i.e., Ê[f] := fx i, y i ) i= Σ := Ê[x x] = x i x i. 3) i= Let ˆβ deote the ridge estimator with parameter 0, defied as the miimizer of the - regularized empirical mea squared error, i.e., ˆβ := arg mi {Ê[ w, x y) 2 ] + w 2}. 4) w The special case with = 0 is the ordiary least squares estimator, which miimizes the empirical mea squared error. These estimators are uiquely defied if ad oly if Σ + I 0 a sufficiet coditio is > 0), i which case ˆβ = Σ + I) Ê[xy] Data model We ow specify the coditios o the radom pair x, y) uder which the aalysis applies COVARIATE MODEL The followig coditios o the covariate x esure that the secod-momet operator Σ ca be estimated from a radom sample with sufficiet accuracy. The first requires that the spectrum of Σ decays sufficietly fast at regularizatio level. Coditio Spectral decay at ) For p {, 2}, d p, := j ) p j <. 5) j + 9.3

4 HSU KAKADE ZHANG For techical reasos, we also use the quatity d, := max{d,, } 6) merely to simplify certai probability tail iequalities i the mai result i the peculiar case that upo which d, 0). We remark that d 2, appears aturally arises i the stadard fixed desig aalysis of ridge regressio see Propositio 5), ad that d, was also used by Zhag 2005) i his radom desig aalysis of kerel) ridge regressio. It is easy to see that d 2, d,, ad that i i covariate spaces of fiite dimesio d <, we have d p, d with equality iff = 0. The secod coditio requires that the squared legth of Σ + I) /2 x is ever more tha a costat factor greater tha its expectatio hece the ame bouded statistical leverage). The liear mappig x Σ + I) /2 x is sometimes called whiteig whe = 0. The reaso for cosiderig > 0, i which case we call the mappig -whiteig, is that the expectatio E[ Σ + I) /2 x 2 ] may oly be small for sufficietly large as i Coditio ), as E[ Σ + I) /2 x 2 ] = trσ + I) /2 ΣΣ + I) /2 ) = j j j + = d,. Coditio 2 Bouded statistical leverage at ) There exists fiite ρ such that, almost surely, Σ + I) /2 x E[ Σ + I) /2 x 2 ] = Σ + I) /2 x ρ. d, The hard almost sure boud i Coditio 2 may be relaxed to momet coditios simply by usig differet probability tail iequalities i the aalysis. We do ot cosider this relaxatio for sake of simplicity. We also remark that, i fiite dimesioal settigs, it is easy to replace Coditio 2 with a subgaussia coditio specifically, a requiremet that every projectio of Σ + I) /2 x be subgaussia), which ca lead to a sharper deviatio boud i certai cases. Remark Fiite dimesioal settig ad = 0) If = 0 ad the dimesio of the covariate space is d, the Coditio 2 reduces to the requiremet that there exists a fiite ρ 0 such that, almost surely, Σ /2 x E[ Σ /2 x 2 ] = Σ /2 x ρ 0. d Remark 2 Bouded x ) If x r almost surely, the Σ + I) /2 x d, r if{j } + )d, i which case Coditio 2 is satisfied with ρ r d,. 9.4

5 RANDOM DESIGN ANALYSIS OF RIDGE REGRESSION RESPONSE MODEL The respose model cosidered i this work is a relaxatio of the typical Gaussia model; the model specifically allows for approximatio error ad geeral subgaussia oise. Defie the radom variables oisex) := y E[y x] ad approxx) := E[y x] β, x 7) where oisex) correspods to the respose oise, ad approxx) correspods to the approximatio error of β. This gives the followig modelig equatio: y = β, x + approxx) + oisex). Coditioed o x, oisex) is radom, while approxx) is determiistic. The oise is assumed to satisfy the followig subgaussia momet coditio. Coditio 3 Subgaussia oise) There exists fiite σ 0 such that, almost surely, E [expη oisex)) x] expη 2 σ 2 /2) η R. Coditio 3 is satisfied, for istace, if oisex) is ormally distributed with mea zero ad variace σ 2. For the ext coditio, defie β to be the miimizer of the regularized mea squared error, i.e., { β := arg mi E[ w, x y) 2 ] + w 2} = Σ + I) E[xy], 8) w ad also defie The fial coditio requires a boud o the size of approx x). approx x) := E[y x] β, x. 9) Coditio 4 Bouded approximatio error at ) There exist fiite b 0 such that, almost surely, Σ + I) /2 x approx x) = Σ + I) /2 x approx x) b. E[ Σ + I) /2 x 2 ] d, The hard almost sure boud i Coditio 4 ca easily be relaxed to momet coditios, but we do ot cosider it here for sake of simplicity. We also remark that b oly appears i lower-order terms i the mai bouds. Remark 3 Fiite dimesioal settig ad = 0) If = 0 ad the dimesio of the covariate space is d, the Coditio 4 reduces to the requiremet that there exists a fiite b 0 0 such that, almost surely, Σ /2 x approxx) E[ Σ /2 x 2 ] = Σ /2 x approxx) d b

6 HSU KAKADE ZHANG Remark 4 Bouded approxx) ) If approxx) a almost surely ad Coditio 2 with parameter ρ ) holds, the Σ + I) /2 x approx x) d, ρ approx x) ρ a + β β, x ) ρ a + β β Σ+I x Σ+I) ) ρ a + ρ d, β β Σ+I ) where the first ad last iequalities use Coditio 2, the secod iequality uses the defiitio of approx x) i 9) ad the triagle iequality, ad the third iequality follows from Cauchy- Schwarz. The quatity β β Σ+I ca be bouded by β usig the argumets i the proof of Propositio 23. I this case, Coditio 4 is satisfied with b ρ a + ρ d, β ). If i additio x r almost surely, the Coditio 2 ad Coditio 4 are satisfied with ρ r d, ad b ρ a + r β ) as per Remark Related work May classical aalyses of the ridge ad ordiary least squares estimators i the radom desig settig e.g., i the cotext of o-parametric estimators) do ot actually show o-asymptotic Od/) covergece of the mea squared error to that of the best liear predictor, where d is the dimesio of the covariate space. Rather, the error relative to the Bayes error is bouded by some multiple c > of the error of the optimal liear predictor relative to Bayes error, plus a Od/) term Györfi et al., 2004): E[ ˆβ, x E[y x]) 2 ] c E[ β, x E[y x]) 2 ] + Od/). Such bouds are appropriate i o-parametric settigs where the error of the optimal liear predictor also approaches the Bayes error at a Od/) rate. Beyod these classical results, aalyses of ordiary least squares ofte come with o-stadard restrictios o applicability or additioal depedecies o the spectrum of the secod momet matrix see the recet work of Audibert ad Catoi 200b) for a comprehesive survey of these results). For istace, a result of Catoi 2004, Propositio 5.9.) gives a boud o the excess mea squared error of the form ˆβ β 2 Σ O d + logdet ˆΣ)/ detσ)) but the boud is oly show to hold whe every liear predictor with low empirical mea squared error satisfies certai boudedess coditios. This work provides ridge regressio bouds explicitly i terms of the vector β as a sequece) ad i terms of the eigespectrum of the of the secod momet matrix Σ. Previous aalyses of ridge ), 9.6

7 RANDOM DESIGN ANALYSIS OF RIDGE REGRESSION regressio make strog boudedess assumptios, or fail to give a boud i the case = 0 e.g., Zhag, 2005; Smale ad Zhou, 2007; Capoetto ad Vito, 2007; Steiwart et al., 2009). For istace, Zhag assumes x b x ad β, x y b approx almost surely, ad gives the boud ˆβ β 2 Σ ˆβ β 2 +c d, b approx+b x ˆβ β ) 2 where d, is a otio of effective dimesio at scale same as that i Coditio ). The quatity ˆβ β is the bouded by assumig β <. Smale ad Zhou assumes the more striget coditios that y b y ad x b x almost surely, ad proves the boud ˆβ β 2 Σ c b2 xb 2 y ote that the boud becomes trivial whe = 0); 2 this is the used to boud ˆβ β 2 Σ uder explicit boudedess assumptios o β. Capoetto ad Vito crucially require boudedess of β ad > 0 i their aalysis i particular, i their Theorem 4), ad also have a worse tail behavior with a boud of the form d, t 2 / with probability e t. Fially, Steiwart et al. explicitly require y b y ad their boud depeds o b y i the domiat term; moreover, their bouds require explicit decay coditios o the eigespectrum Equatio 6) ad also trivial whe = 0. Our result for ridge regressio is give explicitly i terms of β β 2 Σ ad therefore explicitly i terms of β as a sequece, the eigespectrum of Σ, ad ); this quatity vaishes whe = 0 ad ca be bouded eve whe β is ubouded. We ote that β β 2 Σ is precisely the bias term from the stadard fixed desig aalysis of ridge regressio, ad therefore is atural to expect i a radom desig aalysis. Recetly, Audibert ad Catoi 200a,b) derived sharp risk bouds for the ordiary least squares ad ridge estimators i additio to specially developed PAC-Bayesia estimators) i a radom desig settig uder very mild assumptios. Their bouds are proved usig PAC-Bayesia techiques, which allows them to achieve expoetial tail iequalities uder remarkably miimal momet coditios. Their o-asymptotic boud for ordiary least squares holds with probability at least e t but oly for t l. Our result requires stroger assumptios i some respects, but it avoids this restrictio o the probability tail parameter t, ad the aalysis is arguably more trasparet ad yields more reasoable quatitative bouds. The aalysis of Audibert ad Catoi 200a) for the ridge estimator is established oly i a asymptotic sese ad bouds the excess regularized mea squared error rather tha the excess mea squared error itself. Therefore, the results are ot directly comparable to those provided here. It should also be metioed that a umber of other liear estimators have bee cosidered i the literature with o-asymptotic predictio error bouds e.g., Koltchiskii, 2006; Audibert ad Catoi, 200a,b), but the focus of our work is o the ordiary least squares ad ridge estimators. 3. Radom Desig Regressio This sectio presets the mai results of the paper o the excess mea squared error of the ridge estimator uder radom desig ad its specializatio to the ordiary least squares estimator). First, we review the stadard fixed desig aalysis. 3.. Review of fixed desig aalysis It is iformative to first review the fixed desig aalysis of the ridge estimator. Recall that i this settig, the desig poits x, x 2,..., x are fixed determiistic) vectors, ad the resposes y, y 2,..., y are idepedet radom variables. Therefore, we defie Σ := Σ = i= x i x i which is o-radom), ad assume it has eigevectors {v j } ad correspodig eigevalues j := v j, Σv j. As i the radom desig settig, the liear fuctio β := j β jv j where 9.7

8 HSU KAKADE ZHANG β j := j ) i= v j, x i E[y i ] miimizes the expected mea squared error, i.e., β := arg mi w E[ w, x i y i ) 2 ]. i= Similar to the radom desig setup, defie oisex i ) := y i E[y i ] ad approxx i ) := E[y i ] β, x i for i =, 2,...,, so the followig modelig equatios holds: y i = β, x i + approxx i ) + oisex i ) for i =, 2,...,. Because Σ = Σ, the ridge estimator ˆβ i the fixed desig settig is a ubiased estimator of the miimizer of the regularized mea squared error, i.e., ) { } E[ ˆβ ] = Σ + I) x i E[y i ] = arg mi E[ w, x i y i ) 2 ] + w 2. w i= This ubiasedess implies that the expected mea squared error of ˆβ has the bias-variace decompositio E[ ˆβ β 2 Σ] = E[ ˆβ ] β 2 Σ + E[ ˆβ E[ ˆβ ] 2 Σ]. 0) The followig boud o the expected excess mea squared error easily follows from this decompositio ad the defiitio of β see, e.g., Propositio 23). Propositio 5 Ridge regressio: fixed desig) Fix 0, ad assume Σ + I is ivertible. If there exists σ 0 such that vary 2 i ) σ2 for all i =, 2,...,, the i= E[ ˆβ β 2 Σ] j j j + β2 )2 j + σ2 j j ) 2 j + with equality iff vary i ) = σ 2 for all i =, 2,...,. Remark 6 Effect of approximatio error i fixed desig) Observe that approxx i ) has o effect o the expected excess mea squared error. Remark 7 Effective dimesio) The secod sum i the boud is equal to d 2, from Coditio, which implies a otio of effective dimesio at regularizatio level. Remark 8 Ordiary least squares i fixed desig) I fiite dimesioal spaces of dimesio d, Σ has oly d o-zero eigevalues j, ad therefore settig = 0 gives the followig boud for the ordiary least squares estimator ˆβ 0 : E[ ˆβ 0 β 2 Σ] σ2 d where, as before, equality holds iff vary i ) = σ 2 for all i =, 2,...,. 9.8

9 RANDOM DESIGN ANALYSIS OF RIDGE REGRESSION 3.2. Ordiary least squares i fiite dimesios Our aalysis of the ordiary least squares estimator uder radom desig) is based o a simple decompositio of the excess mea squared error, similar to the oe from the fixed desig aalysis. To state the decompositio, first let β 0 deote the coditioal expectatio of the least squares estimator ˆβ 0 coditioed o x, x 2,..., x, i.e., Also, defie the bias ad variace as: β 0 := E[ ˆβ 0 x, x 2,..., x ] = Σ Ê[xE[y x]]. ) ε bs := β 0 β 2 Σ, ε vr := ˆβ 0 β 0 2 Σ Propositio 9 Radom desig decompositio) We have: ˆβ 0 β 2 Σ ε bs + 2 ε bs ε vr + ε vr 2ε bs + ε vr ) Proof The claim follows from the triagle iequality ad the fact a + b) 2 2a 2 + b 2 ). Remark 0 Note that, i geeral, E[ ˆβ 0 ] β ulike i the fixed desig settig where E[ ˆβ 0 ] = β). Hece, our decompositio differs from that i the fixed desig aalysis see 0)). Our first mai result characterizes the excess loss of the ordiary least squares estimator. Theorem Ordiary least squares regressio) Let d be the dimesio of the covariate space. Pick ay t > max{0, 2.6 log d}. Assume Coditio, Coditio 2 with parameter ρ 0 ), Coditio 3 with σ), ad Coditio 4 with b 0 ) hold ad that 6ρ 2 0dlog d + t). With probability at least 3e t, the followig holds.. Relative spectral orm error i Σ: Σ is ivertible, ad Σ /2 Σ Σ /2 δ s ) where Σ is defied i ), Σ is defied i 3), ad 4ρ 2 δ s := 0 dlog d + t) + 2ρ2 0dlog d + t) 3 ote that the lower-boud o esures δ s 0.93 < ). 2. Effect of bias due to radom desig: 2 E[ Σ /2 x approxx) 2 ] ε bs δ s ) 2 + ) 8t) 2 + 6b2 0 dt ρ 2 0 de[approxx)2 ] δ s ) 2 + ) 8t) 2 + 6b2 0 dt2 9 2, ad approxx) is defied i 9). 9.9

10 HSU KAKADE ZHANG 3. Effect of oise: ε vr σ2 d + 2 dt + 2t). δ s Remark 2 Simplified form) Suppressig the terms that are o/), the overall boud from Theorem is ˆβ 0 β 2 Σ 2E[ Σ /2 x approxx) 2 ] + 8t) 2 + σ2 d + 2 dt + 2t) + o/) so b 0 appears oly i the o/) terms). If the liear model is correct i.e., E[y x] = β, x almost surely), the ˆβ 0 β 2 Σ σ2 d + 2 dt + 2t) + o/). 2) Oe ca show that the costats i the first-order term i 2) are the same as those that oe would obtai for a fixed desig tail boud. Remark 3 Tightess of the boud) Sice ad β 0 β 2 Σ = Σ /2 Σ Σ /2 )Ê[Σ /2 x approxx)] 2 Σ /2 Σ Σ /2 I 0 as Lemma 24), β 0 β 2 Σ is withi costat factors of Ê[Σ /2 x approxx)] 2 for sufficietly large. Moreover, E[ Ê[Σ /2 x approxx)] 2 ] = E[ Σ /2 x approxx) 2 ], which is the mai term that appears i the boud for ε bs. Similarly, ˆβ 0 β 0 2 Σ is withi costat factors of ˆβ 0 β 0 2 Σ for sufficietly large, ad E[ ˆβ 0 β 0 2 Σ] σ2 d with equality iff vary) = σ 2 this comes from the fixed desig risk boud i Remark 8). Therefore, i this case where vary) = σ 2, we coclude that the boud Theorem is tight up to costat factors ad lower-order terms Radom desig ridge regressio The aalysis of the ridge estimator uder radom desig is agai based o a simple decompositio of the excess mea squared error. Here, let β deote the coditioal expectatio of ˆβ give x, x 2,..., x, i.e., β := E[ ˆβ x, x 2,..., x ] = Σ + I) Ê[xE[y x]]. 3) Defie the bias from regularizatio, the bias from the radom desig, ad the variace as: ε rg := β β 2 Σ, ε bs := β β 2 Σ, ε vr := ˆβ β 2 Σ where β is the miimizer of the regularized mea squared error see 8)). 9.0

11 RANDOM DESIGN ANALYSIS OF RIDGE REGRESSION Propositio 4 Geeral radom desig decompositio) ˆβ β 2 Σ ε rg + ε bs + ε vr + 2 ε rg ε bs + ε rg ε vr + ε bs ε vr ) 3ε rg + ε bs + ε vr ) Proof The claim follows from the triagle iequality ad the fact a + b) 2 2a 2 + b 2 ). Remark 5 Agai, ote that E[ ˆβ ] β i geeral, so the bias-variace decompositio i 0) from the fixed desig aalysis is ot directly applicable i the radom desig settig. The followig theorem is the mai result of the paper. Theorem 6 Ridge regressio) Fix some 0, ad pick ay t > max{0, 2.6 log d, }. Assume Coditio, Coditio 2 with parameter ρ ), Coditio 3 with parameter σ), ad Coditio 4 with parameter b ) hold; ad that 6ρ 2 d,log d, + t) where d p, for p {, 2} is defied i 5), ad d, is defied i 6). With probability at least 4e t, the followig holds.. Relative spectral orm error i Σ + I: Σ + I is ivertible, ad Σ + I) /2 Σ + I) Σ + I) /2 δ s ) where Σ is defied i ), Σ is defied i 3), ad 4ρ 2 δ s := d,log d, + t) + 2ρ2 d,log d, + t) 3 ote that the lower-boud o esures δ s 0.93 < ). 2. Frobeius orm error i Σ: Σ + I) /2 Σ Σ)Σ + I) /2 F d, δ f where ρ 2 δ f := d, d 2, /d, + 8t) + 4 ρ 4 d, + d 2, /d, t Effect of regularizatio: If = 0, the ε rg = 0. ε rg j j j + )2 β2 j. 9.

12 HSU KAKADE ZHANG 4. Effect of bias due to radom desig: ε bs 2 E[ Σ + I) /2 x approx x) β ) 2 ] δ s ) 2 4 δ s ) 2 ρ 2 d,e[approx x) 2 ] + ε rg + 8t) t) b d, + ) 2t ) ε 2 rg 9 2 b d, + ) 2t ) ε 2 rg, ad approx x) is defied i 9). If = 0, the approx x) = approxx) as defied i 7). 5. Effect of oise: 2 ε vr σ 2 d 2, + d, d 2, δ f ) δ s ) 2 + 2σ 2 d 2, + d, d 2, δ f )t δ s ) 3/2 + 2σ2 t δ s ). We ow discuss various aspects of Theorem 6. Remark 7 Simplified form) Igorig the terms that are o/) ad treatig t as a costat, the overall boud from Theorem 6 is ) ˆβ β 2 Σ β β 2 E[ Σ + I) /2 x approx Σ + O x) β ) 2 ] + σ 2 d 2, ρ β β 2 2 Σ + O d, E[approx x) 2 ] + β β 2 Σ + ) σ2 d 2, ρ β β 2 2 Σ + O d, E[approxx) 2 ] + ρ 2 d, + ) β β 2 Σ + ) σ2 d 2, where the last iequality follows from the fact E[approx x) 2 ] E[approxx) 2 ]+ β β Σ. Remark 8 Effect of errors i Σ) The accuracy of Σ has a relatively mild effect o the boud it appears essetially through multiplicative factors δ s ) = + Oδ s ) ad + δ f, where both δ s ad δ f are decreasig with as /2 ), ad therefore oly cotribute to lower-order terms overall. Remark 9 Compariso to fixed desig) As already discussed, the ridge estimator behaves similarly uder fixed ad radom desigs, with the mai differeces beig the lack of errors i Σ uder fixed desig, ad the ifluece of approximatio error uder radom desig. These are revealed through the quatities ρ ad d, ad b i lower-order terms), which are eeded to apply the probability tail iequalities. Therefore, the scalig of ρ 2 d, with crucially cotrols the effect of radom desig compared to fixed desig. Ackowledgmets We thak Dea Foster, David McAllester, ad Robert Stie for may isightful discussios. 9.2

13 RANDOM DESIGN ANALYSIS OF RIDGE REGRESSION Refereces J.-Y. Audibert ad O. Catoi. Robust liear least squares regressio, 200a. arxiv: J.-Y. Audibert ad O. Catoi. Robust liear regressio through PAC-Bayesia trucatio, 200b. arxiv: A. Capoetto ad E. De Vito. Optimal rates for the regularized least-squares algorithm. Foudatios of Computatioal Mathematics, 73):33 368, O. Catoi. Statistical Learig Theory ad Stochastic Optimizatio, Lectures o Probability ad Statistics, Ecole d Eté de Probabilitiés de Sait-Flour XXXI 200, volume 85 of Lecture Notes i Mathematics. Spriger, L. Györfi, M. Kohler, A. Kryżak, ad H. Walk. A Distributio-Free Theory of Noparametric Regressio. Spriger, A. E. Hoerl. Applicatio of ridge aalysis to regressio problems. Chemical Egieerig Progress, 58:54 59, 962. R. Hor ad C. R. Johso. Matrix Aalysis. Cambridge Uiversity Press, 985. D. Hsu, S. M. Kakade, ad T. Zhag. A tail iequality for quadratic forms of subgaussia radom vectors, 20. arxiv: D. Hsu, S. M. Kakade, ad T. Zhag. Tail iequalities for sums of radom matrices that deped o the itrisic dimesio. Electroic Commuicatios i Probability, 74): 3, 202. V. Koltchiskii. Local Rademacher complexities ad oracle iequalities i risk miimizatio. The Aals of Statistics, 346): , B. Lauret ad P. Massart. Adaptive estimatio of a quadratic fuctioal by model selectio. The Aals of Statistics, 285): , S. Smale ad D.-X. Zhou. Learig theory estimates via itegral operators ad their approximatios. Costructive Approximatios, 26:53 72, I. Steiwart, D. Hush, ad C. Scovel. Optimal rates for regularized least squares regressio. I Proceedigs of the 22d Aual Coferece o Learig Theory, pages 79 93, G. W. Stewart ad J.-G. Su. Matrix Perturbatio Theory. Academic Press, 990. C. J. Stoe. Optimal global rates of covergece for oparametric regressio. Aals of Statistics, 0: , 982. T. Zhag. Learig bouds for kerel regressio usig effective data dimesioality. Neural Computatio, 7: ,

14 HSU KAKADE ZHANG Appedix A. Applicatio to smoothig splies The applicatios of ridge regressio cosidered by Zhag 2005) ca also be aalyzed usig Theorem 6. We specifically cosider the problem of approximatig a periodic fuctio with smoothig splies, which are fuctios f : R R whose s-th derivatives f s), for some s > /2, satisfy ) 2 f s) t) dt <. The oe-dimesioal covariate t R ca be mapped to the ifiite dimesioal represetatio x := φt) R where x 2k := sikt) k + ) s ad x 2k+ := coskt), k {0,, 2,... }. k + ) s Assume that the regressio fuctio is E[y x] = β, x so approxx) = 0 almost surely. Observe that x 2 ρ := ) 2s /2 2s d, 2s 2s, so Coditio 2 is satisfied with as per Remark 2. Therefore, the simplified boud from Remark 7 becomes i this case ˆβ 2s β 2 Σ β β 2 Σ + C 2s β β 2 Σ + β β 2 Σ + ) σ2 d 2, β 2 + C σ2 d 2, 2s + C 2 2s + ) β 2 2 for some costat C > 0, where we have used the iequality β β 2 Σ β 2 /2. Zhag 2005, Sectio 5.3) shows that { } 2/ d, if 2k + k 2s )k 2s. Sice d 2, d,, it follows that settig := k 2s where k = 2s )/2s)) /2s+) gives the boud ˆβ β β 2 2 Σ 2 ) ) 2s 2s + 2Cσ 2 2s+ + lower-order terms 2s which has the optimal data-depedet rate of 2s 2s+ Stoe, 982). Appedix B. Proofs of Theorem ad Theorem 6 The proof of Theorem 6 uses the decompositio of ˆβ β 2 Σ i Propositio 4, ad the bouds each term usig the lemmas proved i this sectio. 9.4

15 RANDOM DESIGN ANALYSIS OF RIDGE REGRESSION The proof of Theorem omits oe term from the decompositio i Propositio 4 due to the fact that β = β whe = 0; ad it uses a slightly simpler argumet to hadle the effect of oise Lemma 28 rather tha Lemma 29), which reduces the umber of lower-order terms. Other tha these differeces, the proof is the same as that for Theorem 6 i the special case of = 0. Defie Σ := Σ + I, 4) Σ := Σ + I, ad 5) := Σ /2 Σ Σ)Σ /2 6) = Σ /2 Σ Σ )Σ /2. Recall the basic decompositio from Propositio 4: ˆβ β 2 Σ β β Σ + β β Σ + ˆβ β ) 2 Σ. Sectio B. first establishes basic properties of β ad β, which are the used to boud β β 2 Σ ; this part is exactly the same as the stadard fixed desig aalysis of ridge regressio. Sectio B.2 employs probability tail iequalities for the spectral ad Frobeius orms of radom matrices to boud the matrix errors i estimatig Σ with Σ. Fially, Sectio B.3 ad Sectio B.4 boud the cotributios of approximatio error i β β 2 Σ ) ad oise i ˆβ β 2 Σ ), respectively, usig probability tail iequalities for radom vectors as well as the matrix error bouds for Σ. B.. Basic properties of β ad β, ad the effect of regularizatio Propositio 20 Normal equatios) E[ w, x y] = E[ w, x β, x ] for ay w. Proof It suffices to prove the claim for w = v j. Sice E[ v j, x v j, x ] = 0 for j j, it follows that E[ v j, x β, x ] = j β j E[ v j, x v j, x ] = β j E[ v j, x 2 ] = E[ v j, x y], where the last equality follows from the defiitio of β i 2). Propositio 2 Excess mea squared error) E[ w, x y) 2 ] E[ β, x y) 2 ] = E[ w β, x 2 ] for ay w. Proof Directly expadig the squares i the expectatios reveals that E[ w, x y) 2 ] E[ β, x y) 2 ] = E[ w, x 2 ] 2E[ w, x y] + 2E[ β, x y] E[ β, x 2 ] = E[ w, x 2 ] 2E[ w, x β, x ] + 2E[ β, x β, x ] E[ β, x 2 ] = E[ w, x 2 2 w, x β, x + β, x 2 ] = E[ w β, x 2 ] where the third equality follows from Propositio

16 HSU KAKADE ZHANG Propositio 22 Shrikage) For ay j, v j, β = j j + β j. Proof Sice Σ + I) = j j + ) v j v j, v j, β = v j, Σ + I) E[xy] = j + E[ v j, x y] = j E[ v j, x y] j + v j, x 2 = j j + β j. Propositio 23 Effect of regularizatio) β β 2 Σ = j j j + )2 β2 j. Proof By Propositio 22, v j, β β = β j j j + β j = j + β j. Therefore, β β 2 Σ = j ) 2 j j + β j = j j j + )2 β2 j. B.2. Effect of errors i Σ Lemma 24 Spectral orm error i Σ) Assume Coditio ad Coditio 2 with parameter ρ ) hold. Pick t > max{0, 2.6 log d, }. With probability at least e t, 4ρ 2 d,log d, + t) + 2ρ2 d,log d, + t) 3 where is defied i 6). Proof The claim is a cosequece of the tail iequality from Lemma 32. First, defie where Σ is defied i 4)), ad let x := Σ /2 /2 x ad Σ := Σ ΣΣ /2 Z := x x Σ = Σ /2 x x Σ)Σ /2 9.6

17 RANDOM DESIGN ANALYSIS OF RIDGE REGRESSION so = Ê[Z]. Observe that E[Z] = 0 ad Z = max{ max [Z], max [ Z]} max{ x 2, } ρ 2 d, where the secod iequality follows from Coditio 2. Moreover, E[Z 2 ] = E[ x x) 2 ] Σ 2 = E[ x 2 x x)] Σ 2 so max [E[Z 2 ]] max [E[ x x) 2 ]] ρ 2 d, max [ Σ] ρ 2 d, tre[z 2 ]) tre[ x 2 x x)]) ρ 2 d, tr Σ) = ρ 2 d2,. The claim ow follows from Lemma 32 recall that d, = max{, d, }). Lemma 25 Relative spectral orm error i Σ ) If < where is defied i 6), the Σ /2 Σ Σ/2 where Σ is defied i 4) ad Σ is defied i 5). Proof Observe that Σ /2 Σ Σ /2 = Σ /2 Σ + Σ Σ )Σ /2 = I + Σ /2 Σ Σ )Σ /2 = I +, ad that mi [I + ] > 0 by the assumptio < ad Weyl s theorem Hor ad Johso, 985, Theorem 4.3.). Therefore Σ /2 Σ Σ/2 = max[σ /2 Σ Σ /2 ) ] = max [I+ ) ] = mi [I + ]. Lemma 26 Frobeius orm error i Σ) Assume Coditio ad Coditio 2 with parameter ρ ) hold. Pick ay t > 0. With probability at least e t, E[ Σ /2 x 4 ] d 2, F + 4 ρ 4 d2, + d 2,t 8t) + 3 ρ 2 d2, d 2, + 4 ρ 4 d2, + d 2,t 8t) + 3 where is defied i 6). 9.7

18 HSU KAKADE ZHANG Proof The claim is a cosequece of the tail iequality i Lemma 3. As i the proof of Lemma 24, defie x := Σ /2 x ad Σ := Σ /2 ΣΣ /2, ad let Z := x x Σ so = Ê[Z]. Now edow the space of self-adjoit liear operators with the ier product give by A, B F := trab), ad ote that this ier product iduces the Frobeius orm M F = M, M F. Observe that E[Z] = 0 ad Z 2 F = x x Σ, x x Σ F = x x, x x F 2 x x, Σ F + Σ, Σ F = x 4 2 x 2 Σ + tr Σ 2 ) = x 4 2 x 2 Σ + d 2, ρ 4 d2, + d 2, where the iequality follows from Coditio 2. Moreover, E[ Z 2 F] = E[ x x, x x F ] Σ, Σ F = E[ x 4 ] d 2, ρ 2 d,e[ x 2 ] d 2, = ρ 2 d2, d 2, where the iequality agai uses Coditio 2. The claim ow follows from Lemma 3. B.3. Effect of approximatio error Lemma 27 Effect of approximatio error) Assume Coditio, Coditio 2 with parameter ρ ), ad Coditio 4 with parameter b ) hold. Pick ay t > 0. If < where is defied i 6), the β β Σ Ê[x approx x) β ] Σ where β is defied i 3), β is defied i 8), approx x) is defied i 9), ad Σ is defied i 4). Moreover, with probability at least e t, Ê[x approx x) β ] Σ E[ Σ /2 x approx x) β ) 2 ] + 8t) + 4b d, + β β Σ )t 3 2ρ 2 d,e[approx x) 2 ] + β β 2 Σ ) + 8t) + 4b d, + β β Σ )t

19 RANDOM DESIGN ANALYSIS OF RIDGE REGRESSION Proof By the defiitios of β ad β, ) β β = Ê[xE[y x]] Σ β Σ = Σ /2 Σ /2 Σ = Σ /2 Σ /2 Σ = Σ /2 Σ /2 Σ Σ/2 Σ/2 Σ/2 ) )Σ /2 Ê[xapproxx) + β, x )] Σβ β ) Ê[xapproxx) + β, x β, x )] β )Σ /2 )Σ /2 Ê[x approx x) β ]). Therefore, usig the sub-multiplicative property of the spectral orm, β β Σ Σ /2 Σ /2 Σ /2 Σ Σ/2 Ê[x approx x) β ] Σ where the secod iequality follows from Lemma 25 ad because Σ /2 Σ /2 2 = max [Σ /2 ΣΣ /2 Ê[x approx x) β ] Σ ] = max i i i +. The secod part of the claim is a cosequece of the tail iequality i Lemma 3. Observe that E[x approxx)] = E[xE[y x] β, x )] = 0 by Propositio 20, ad that E[x β β, x ] β = Σβ Σ + I)β = 0. Therefore, E[Σ /2 x approx x) β )] = Σ /2 E[xapproxx) + β β, x ) β ] = 0. Moreover, by Propositio 22 ad Propositio 23, Σ /2 β 2 = j = j j 2 j + v j, β 2 2 j + 2 j + j j + β j j j + ) 2 ) β 2 j = j j j + )2 β2 j = β β 2 Σ. 7) Combiig the iequality from 7) with Coditio 4 ad the triagle iequality, it follows that Σ /2 x approx x) β ) Σ /2 x approx x) + Σ /2 β b d, + β β Σ. 9.9

20 HSU KAKADE ZHANG Fially, by the triagle iequality, the fact a + b) 2 2a 2 + b 2 ), the iequality from 7), ad Coditio 2, E[ Σ /2 x approx x) β ) 2 ] 2E[ Σ /2 x approx x) 2 ] + β β 2 Σ) The claim ow follows from Lemma 3. 2ρ 2 d,e[approx x) 2 ] + β β 2 Σ). B.4. Effect of oise Lemma 28 Effect of oise, = 0) Assume the dimesio of the covariate space is d < ad that = 0. Assume Coditio 3 with parameter σ) holds. Pick ay t > 0. With probability at least e t, either 0, or 0 < ad β 0 ˆβ 0 2 Σ 0 σ2 d + 2 dt + 2t), where 0 is defied i 6). Proof Observe that β 0 ˆβ 0 2 Σ Σ /2 Σ /2 2 β 0 ˆβ 0 2 Σ = Σ /2 Σ Σ /2 β 0 ˆβ 0 2 Σ; ad if 0 <, the Σ /2 Σ Σ /2 / 0 ) by Lemma 25. Let ξ := oisex ), oisex 2 ),..., oisex )) be the radom vector whose i-th compoet is oisex i ) = y i E[y i x i ]. By the defiitio of ˆβ 0 ad β 0 ˆβ 0 β 0 2 Σ = Σ /2 Ê[xy E[y x])] 2 = ξ Kξ where K R is the symmetric matrix whose i, j)-th etry is K i,j := 2 Σ /2 x i, Σ /2 x j. Note that the o-zero eigevalues of K are the same as those of [ Ê Σ /2 x) Σ ] /2 x) = Σ /2 Σ Σ /2 = I. By Lemma 30, with probability at least e t coditioed o x, x 2,..., x ), The claim follows. ξ Kξ σ 2 tr K) + 2 tr K 2 )t + 2 max K)t) = σ2 d + 2 dt + 2t). Lemma 29 Effect of oise, 0) Assume Coditio ad Coditio 3 with parameter σ) hold. Pick ay t > 0. Let K be the symmetric matrix whose i, j)-th etry is K i,j := Σ/2 Σ 2 x /2 i, Σ Σ x j 9.20

21 RANDOM DESIGN ANALYSIS OF RIDGE REGRESSION where Σ is defied i 5). With probability at least e t, β ˆβ 2 Σ σ 2 trk) + 2 trk) max K)t + 2 max K)t). Moreover, if < where is defied i 6), the max K) ) ad trk) d 2, + d 2, 2 F ) 2. Proof Let ξ := oisex ), oisex 2 ),..., oisex )) be the radom vector whose i-th compoet is oisex i ) = y i E[y i x i ]. By the defiitio of ˆβ, β, ad K, ˆβ β 2 Σ = Σ Ê[xy E[y x])] 2 Σ = ξ Kξ. By Lemma 30, with probability at least e t coditioed o x, x 2,..., x ), ξ Kξ σ 2 trk) + 2 trk 2 )t + 2 max K)t) σ 2 trk) + 2 trk) max K)t + 2 max K)t) where the secod iequality follows from vo Neuma s theorem Hor ad Johso, 985, page 423). Note that the o-zero eigevalues of K are the same as that of Ê [ Σ /2 Σ ] x) Σ/2 Σ x) = Σ/2 Σ Σ Σ Σ/2. To boud max K), observe that by the sub-multiplicative property of the spectral orm ad Lemma 25, /2 max K) = Σ Σ Σ /2 2 Σ /2 Σ /2 2 Σ /2 Σ /2 Σ /2 2 Σ Σ/2 = Σ /2. Σ /2 2 /2 Σ Σ /2 2 To boud trk), first defie the -whiteed versios of Σ, Σ, ad Σ : Σ w := Σ /2 ΣΣ /2 Σ w := Σ /2 Σ,w := Σ /2 ΣΣ /2 Σ Σ /2. Usig these defiitios with the cycle property of the trace, /2 trk) = trσ Σ Σ Σ = tr Σ = tr Σ Σ Σ/2 ) Σ) Σ Σ,w Σ w,w Σ w). 9.2

22 HSU KAKADE ZHANG Let { j [M]} deote the eigevalues of a liear operator M. By vo Neuma s theorem Hor ad Johso, 985, page 423), tr Σ Σ,w Σ w,w Σ w) j j [ Σ Σ,w Σ w,w ] j[σ w ] ad by Ostrowski s theorem Hor ad Johso, 985, Theorem 4.5.9), Therefore j [ Σ Σ,w Σ w,w ] 2 max[ Σ,w ] j[ Σ w ]. tr Σ Σ,w Σ w,w Σ 2 w) max [ Σ,w ] j [ Σ w ] j [Σ w ] j ) 2 j [ Σ w ] j [Σ w ] = = = j ) 2 j [Σ w ] 2 + j [ Σ ) w ] j [Σ w ]) j [Σ w ] j ) 2 j [Σ w ] 2 + j [ Σ w ] j [Σ w ]) 2 j j j d ) 2 2, + j [ Σ w ] j [Σ w ]) 2 d 2, ) 2 d 2, + Σ ) w Σ w F d2, ) ) 2 d 2, + F d2, j j [Σ w ] 2 where the secod iequality follows from Lemma 25, the third iequality follows from Cauchy- Schwarz, ad the fourth iequality follows from Mirsky s theorem Stewart ad Su, 990, Corollary 4.3). Appedix C. Probability tail iequalities The followig probability tail iequalities are used i our aalysis. These specific iequalities were chose i order to satisfy the geeral coditios setup i Sectio 2.4; however, our aalysis ca specialize or geeralize with the availability of other tail iequalities of these sorts. The first tail iequality is for positive semidefiite quadratic forms of a subgaussia radom vector. It geeralizes a stadard tail iequality for Gaussia radom vectors based o liear combiatios of χ 2 radom variables Lauret ad Massart, 2000). We give the proof for completeess. Lemma 30 Quadratic forms of a subgaussia radom vector; Hsu et al., 20) Let ξ be a radom vector takig values i R such that for some c 0, E[exp u, ξ )] expc u 2 /2), u R. 9.22

23 RANDOM DESIGN ANALYSIS OF RIDGE REGRESSION For all symmetric positive semidefiite matrices K 0, ad all t > 0, [ Pr ξ Kξ > c trk) + 2 trk 2 )t + 2 K t) ] e t. Proof Let z R be a vector of i.i.d. stadard ormal radom variables idepedet of ξ). For ay τ 0 ad 0, let η := c 2 /2, so E [ exp z, K /2 ξ ) ] E [ exp z, K /2 ξ ) K /2 ξ 2 > ctrk) + τ) ] Pr [ K /2 ξ 2 > ctrk) + τ) ] exp 2 ctrk) + τ)/2) Pr [ K /2 ξ 2 > ctrk) + τ) ] = expηtrk) + τ)) Pr [ K /2 ξ 2 > ctrk) + τ) ] 8) sice E[exp u, z )] = exp u 2 /2) for ay u R. Moreover, by idepedece of ξ ad z, E [ exp z, K /2 ξ ) ] = E [E [ exp K /2 z, ξ ) z ]] E [ expc 2 K /2 z 2 /2) ] = E [ expη K /2 z 2 ) ]. Sice K is symmetric ad positive semidefiite, K = V DV for some orthogoal matrix V = [u u 2 u r ] ad diagoal matrix D = diagρ, ρ 2,..., ρ r ), where r is the rak of K. By rotatioal symmetry, the vector V z is equal i distributio to a vector of r i.i.d. stadard ormal radom variables q, q 2,..., q r, ad K /2 z 2 = D /2 V z 2 = ρ q 2 + ρ 2q ρ rqr. 2 Therefore, E [ exp z, K /2 ξ ) ] E [ expη K /2 z 2 ) ] = E [ expηρ q 2 + ρ 2 q ρ r q 2 r)) ]. 9) Combiig 8) ad 9) gives Pr [ K /2 ξ 2 > ctrk) + τ) ] exp ηtrk) + τ)) E [ expηρ q 2 + ρ 2 q ρ r q 2 r)) ]. The expectatio o the right-had side is the momet geeratig fuctio for a liear combiatio of r idepedet χ 2 radom variables, each with oe degree of freedom. Sice trk) = ρ + ρ ρ r, trk 2 ) = ρ 2 + ρ ρ2 r, ad K = max{ρ, ρ 2,..., ρ r }, the coclusio follows from stadard facts about χ 2 radom variables Lauret ad Massart, 2000): Pr [ K /2 ξ 2 > ctrk) + τ) ] )) exp trk2 ) K τ 2 K h trk 2 ) where h a) := + a + 2a. The ext lemma is a tail iequality for sums of bouded radom vectors; it is a stadard applicatio of Berstei s iequality. 9.23

24 HSU KAKADE ZHANG Lemma 3 Vector Berstei boud; see, e.g., Hsu et al., 20) Let x, x 2,..., x be idepedet radom vectors such that E[ x i 2 ] v ad x i r i= for all i =, 2,...,, almost surely. Let s := x + x x. For all t > 0, [ Pr s > v + ] 8t) + 4/3)rt e t The last tail iequality cocers the spectral accuracy of a empirical secod momet matrix. Lemma 32 Matrix Berstei boud; Hsu et al., 202) Let X be a radom matrix, ad r > 0, v > 0, ad k > 0 be such that, almost surely, E[X] = 0, max [X] r, max [E[X 2 ]] v, tre[x 2 ]) vk. If X, X 2,..., X are idepedet copies of X, the for ay t > 0, [ ] ] 2vt Pr [ max X i > + rt kte t t ). 3 i= If t 2.6, the te t t ) e t/

arxiv: v1 [math.pr] 13 Oct 2011

arxiv: v1 [math.pr] 13 Oct 2011 A tail iequality for quadratic forms of subgaussia radom vectors Daiel Hsu, Sham M. Kakade,, ad Tog Zhag 3 arxiv:0.84v math.pr] 3 Oct 0 Microsoft Research New Eglad Departmet of Statistics, Wharto School,