Random Design Analysis of Ridge Regression

Size: px
Start display at page:

Download "Random Design Analysis of Ridge Regression"

Transcription

1 JMLR: Workshop ad Coferece Proceedigs vol ) th Aual Coferece o Learig Theory Radom Desig Aalysis of Ridge Regressio Daiel Hsu Microsoft Research Sham M. Kakade Microsoft Research Tog Zhag Rutgers Uiversity DAHSU@MICROSOFT.COM SKAKADE@MICROSOFT.COM TZHANG@STAT.RUTGERS.EDU Editor: Shie Maor, Natha Srebro, Robert C. Williamso Abstract This work gives a simultaeous aalysis of both the ordiary least squares estimator ad the ridge regressio estimator i the radom desig settig uder mild assumptios o the covariate/respose distributios. I particular, the aalysis provides sharp results o the out-of-sample predictio error, as opposed to the i-sample fixed desig) error. The aalysis also reveals the effect of errors i the estimated covariace structure, as well as the effect of modelig errors; either of which effects are preset i the fixed desig settig. The proof of the mai results are based o a simple decompositio lemma combied with cocetratio iequalities for radom vectors ad matrices.. Itroductio I the radom desig settig for liear regressio, we are provided with samples of covariates ad resposes, x, y ), x 2, y 2 ),..., x, y ), which are sampled idepedetly from a populatio, where the x i are radom vectors ad the y i are radom variables. Typically, these pairs are hypothesized to have the liear relatioship y i = β, x i + ɛ i for some liear fuctio β though this hypothesis eed ot be true). Here, the ɛ i are error terms, typically assumed to be ormally distributed as N 0, σ 2 ). The goal of estimatio i this settig is to fid coefficiets ˆβ based o these x i, y i ) pairs such that the expected predictio error o a ew draw x, y) from the populatio, measured as E[ ˆβ, x y) 2 ], is as small as possible. This goal ca also be iterpreted as estimatig β with accuracy measured uder a particular orm. The radom desig settig stads i cotrast to the fixed desig settig, where the covariates x, x 2,..., x are fixed i.e., determiistic), ad oly the resposes y, y 2,..., y treated as radom. Thus, the covariace structure of the desig poits is completely kow ad eed ot be estimated, which simplifies the aalysis of stadard estimators. However, the fixed desig settig does ot directly address out-of-sample predictio, which is of primary cocer i may applicatios. For istace, i predictio problems, the estimator ˆβ is computed from a iitial sample from the populatio, ad the ed-goal is to use ˆβ as a predictor of y give x where x, y) is a ew draw from the populatio. A fixed desig aalysis oly assesses the accuracy of ˆβ o data already see, while a radom desig aalysis is cocered with the predictive performace o usee data. c 202 D. Hsu, S.M. Kakade & T. Zhag.

2 HSU KAKADE ZHANG This work gives a detailed aalysis of both the ordiary least squares ad ridge estimator Hoerl, 962) i the radom desig settig that quatifies the essetial differeces betwee radom ad fixed desig. I particular, the aalysis reveals, through a simple decompositio, the effect of errors i the estimated covariace structure, as well as the effect of approximatig the true regressio fuctio by a liear fuctio i the case the model is misspecified. Neither of these effects is preset i the fixed desig aalysis of ridge regressio. The radom desig aalysis shows that the effect of errors i the estimated covariace structure is miimal it is typically a secod-order effect as soo as the sample size is large eough. The aalysis also isolates the effect of approximatio error i the mai terms of the estimatio error boud so that the boud reduces to oe that scales with the oise variace whe the approximatio error vaishes. Oe feature of the aalysis i this work is that it applies to the ridge estimator with a arbitrary settig of. The estimatio error is give i terms of the spectrum of the covariace E[x x] ad the particular choice of. Whe = 0, we obtai a aalysis of ordiary least squares, applicable whe the spectrum is fiite i.e., whe the covariates live i a fiite dimesioal space). More geerally, the covergece rate ca be optimized by appropriately settig based o assumptios about the spectrum. Outlie. Sectio 2 discusses the model, prelimiaries, ad related work. Sectio 3 presets the mai results o the excess mea squared error of the ordiary least squares ad ridge estimators uder radom desig ad discusses the relatioship to the stadard fixed desig aalysis. A applicatio to smoothig splies is provided i Appedix A, ad the proof of the mai results are give i the Appedix B. 2. Prelimiaries 2.. Notatio Uless otherwise specified, all vectors i this work are assumed to live i a possibly ifiite dimesioal) separable Hilbert space with ier product,. Let M for a self-adjoit positive semidefiite liear operator M 0 deote the vector orm give by v M := v, Mv. Whe M is omitted, it is assumed to be the idetity, so v = v, v. Let u u deote the outer product of a vector u, which acts as the rak-oe liear operator v u u)v = v, u u. For a liear operator M, let M deote its spectral operator) orm, i.e., M = sup v 0 Mv / v, ad let M F deote its Frobeius orm, i.e., M F = trm M). If M is self-adjoit, M F = trm 2 ). Let max [M] ad mi [M], respectively, deote the largest ad smallest eigevalue of a self-adjoit liear operator M Liear regressio Let x be a radom vector, ad let y be a radom variable. Let {v j } be the eigevectors of Σ := E[x x], ) so that they form a orthoormal basis. The correspodig eigevalues are j := v j, Σv j = E[ v j, x 2 ] 9.2

3 RANDOM DESIGN ANALYSIS OF RIDGE REGRESSION assumed to be o-zero for coveiece). Let β achieve the miimum mea squared error over all liear fuctios, i.e., E[ β, x y) 2 { ] = mi E[ w, x y) 2 ] }, w so that: β := j β j v j where β j := E[ v j, x y] E[ v j, x 2 ]. 2) We also have that the excess mea squared error of w over the miimum is: E[ w, x y) 2 ] E[ β, x y) 2 ] = w β 2 Σ see Propositio 2) The ridge ad ordiary least squares estimators Let x, y ), x 2, y 2 ),..., x, y ) be idepedet copies of x, y), ad let Ê deote the empirical expectatio with respect to these copies, i.e., Ê[f] := fx i, y i ) i= Σ := Ê[x x] = x i x i. 3) i= Let ˆβ deote the ridge estimator with parameter 0, defied as the miimizer of the - regularized empirical mea squared error, i.e., ˆβ := arg mi {Ê[ w, x y) 2 ] + w 2}. 4) w The special case with = 0 is the ordiary least squares estimator, which miimizes the empirical mea squared error. These estimators are uiquely defied if ad oly if Σ + I 0 a sufficiet coditio is > 0), i which case ˆβ = Σ + I) Ê[xy] Data model We ow specify the coditios o the radom pair x, y) uder which the aalysis applies COVARIATE MODEL The followig coditios o the covariate x esure that the secod-momet operator Σ ca be estimated from a radom sample with sufficiet accuracy. The first requires that the spectrum of Σ decays sufficietly fast at regularizatio level. Coditio Spectral decay at ) For p {, 2}, d p, := j ) p j <. 5) j + 9.3

4 HSU KAKADE ZHANG For techical reasos, we also use the quatity d, := max{d,, } 6) merely to simplify certai probability tail iequalities i the mai result i the peculiar case that upo which d, 0). We remark that d 2, appears aturally arises i the stadard fixed desig aalysis of ridge regressio see Propositio 5), ad that d, was also used by Zhag 2005) i his radom desig aalysis of kerel) ridge regressio. It is easy to see that d 2, d,, ad that i i covariate spaces of fiite dimesio d <, we have d p, d with equality iff = 0. The secod coditio requires that the squared legth of Σ + I) /2 x is ever more tha a costat factor greater tha its expectatio hece the ame bouded statistical leverage). The liear mappig x Σ + I) /2 x is sometimes called whiteig whe = 0. The reaso for cosiderig > 0, i which case we call the mappig -whiteig, is that the expectatio E[ Σ + I) /2 x 2 ] may oly be small for sufficietly large as i Coditio ), as E[ Σ + I) /2 x 2 ] = trσ + I) /2 ΣΣ + I) /2 ) = j j j + = d,. Coditio 2 Bouded statistical leverage at ) There exists fiite ρ such that, almost surely, Σ + I) /2 x E[ Σ + I) /2 x 2 ] = Σ + I) /2 x ρ. d, The hard almost sure boud i Coditio 2 may be relaxed to momet coditios simply by usig differet probability tail iequalities i the aalysis. We do ot cosider this relaxatio for sake of simplicity. We also remark that, i fiite dimesioal settigs, it is easy to replace Coditio 2 with a subgaussia coditio specifically, a requiremet that every projectio of Σ + I) /2 x be subgaussia), which ca lead to a sharper deviatio boud i certai cases. Remark Fiite dimesioal settig ad = 0) If = 0 ad the dimesio of the covariate space is d, the Coditio 2 reduces to the requiremet that there exists a fiite ρ 0 such that, almost surely, Σ /2 x E[ Σ /2 x 2 ] = Σ /2 x ρ 0. d Remark 2 Bouded x ) If x r almost surely, the Σ + I) /2 x d, r if{j } + )d, i which case Coditio 2 is satisfied with ρ r d,. 9.4

5 RANDOM DESIGN ANALYSIS OF RIDGE REGRESSION RESPONSE MODEL The respose model cosidered i this work is a relaxatio of the typical Gaussia model; the model specifically allows for approximatio error ad geeral subgaussia oise. Defie the radom variables oisex) := y E[y x] ad approxx) := E[y x] β, x 7) where oisex) correspods to the respose oise, ad approxx) correspods to the approximatio error of β. This gives the followig modelig equatio: y = β, x + approxx) + oisex). Coditioed o x, oisex) is radom, while approxx) is determiistic. The oise is assumed to satisfy the followig subgaussia momet coditio. Coditio 3 Subgaussia oise) There exists fiite σ 0 such that, almost surely, E [expη oisex)) x] expη 2 σ 2 /2) η R. Coditio 3 is satisfied, for istace, if oisex) is ormally distributed with mea zero ad variace σ 2. For the ext coditio, defie β to be the miimizer of the regularized mea squared error, i.e., { β := arg mi E[ w, x y) 2 ] + w 2} = Σ + I) E[xy], 8) w ad also defie The fial coditio requires a boud o the size of approx x). approx x) := E[y x] β, x. 9) Coditio 4 Bouded approximatio error at ) There exist fiite b 0 such that, almost surely, Σ + I) /2 x approx x) = Σ + I) /2 x approx x) b. E[ Σ + I) /2 x 2 ] d, The hard almost sure boud i Coditio 4 ca easily be relaxed to momet coditios, but we do ot cosider it here for sake of simplicity. We also remark that b oly appears i lower-order terms i the mai bouds. Remark 3 Fiite dimesioal settig ad = 0) If = 0 ad the dimesio of the covariate space is d, the Coditio 4 reduces to the requiremet that there exists a fiite b 0 0 such that, almost surely, Σ /2 x approxx) E[ Σ /2 x 2 ] = Σ /2 x approxx) d b

6 HSU KAKADE ZHANG Remark 4 Bouded approxx) ) If approxx) a almost surely ad Coditio 2 with parameter ρ ) holds, the Σ + I) /2 x approx x) d, ρ approx x) ρ a + β β, x ) ρ a + β β Σ+I x Σ+I) ) ρ a + ρ d, β β Σ+I ) where the first ad last iequalities use Coditio 2, the secod iequality uses the defiitio of approx x) i 9) ad the triagle iequality, ad the third iequality follows from Cauchy- Schwarz. The quatity β β Σ+I ca be bouded by β usig the argumets i the proof of Propositio 23. I this case, Coditio 4 is satisfied with b ρ a + ρ d, β ). If i additio x r almost surely, the Coditio 2 ad Coditio 4 are satisfied with ρ r d, ad b ρ a + r β ) as per Remark Related work May classical aalyses of the ridge ad ordiary least squares estimators i the radom desig settig e.g., i the cotext of o-parametric estimators) do ot actually show o-asymptotic Od/) covergece of the mea squared error to that of the best liear predictor, where d is the dimesio of the covariate space. Rather, the error relative to the Bayes error is bouded by some multiple c > of the error of the optimal liear predictor relative to Bayes error, plus a Od/) term Györfi et al., 2004): E[ ˆβ, x E[y x]) 2 ] c E[ β, x E[y x]) 2 ] + Od/). Such bouds are appropriate i o-parametric settigs where the error of the optimal liear predictor also approaches the Bayes error at a Od/) rate. Beyod these classical results, aalyses of ordiary least squares ofte come with o-stadard restrictios o applicability or additioal depedecies o the spectrum of the secod momet matrix see the recet work of Audibert ad Catoi 200b) for a comprehesive survey of these results). For istace, a result of Catoi 2004, Propositio 5.9.) gives a boud o the excess mea squared error of the form ˆβ β 2 Σ O d + logdet ˆΣ)/ detσ)) but the boud is oly show to hold whe every liear predictor with low empirical mea squared error satisfies certai boudedess coditios. This work provides ridge regressio bouds explicitly i terms of the vector β as a sequece) ad i terms of the eigespectrum of the of the secod momet matrix Σ. Previous aalyses of ridge ), 9.6

7 RANDOM DESIGN ANALYSIS OF RIDGE REGRESSION regressio make strog boudedess assumptios, or fail to give a boud i the case = 0 e.g., Zhag, 2005; Smale ad Zhou, 2007; Capoetto ad Vito, 2007; Steiwart et al., 2009). For istace, Zhag assumes x b x ad β, x y b approx almost surely, ad gives the boud ˆβ β 2 Σ ˆβ β 2 +c d, b approx+b x ˆβ β ) 2 where d, is a otio of effective dimesio at scale same as that i Coditio ). The quatity ˆβ β is the bouded by assumig β <. Smale ad Zhou assumes the more striget coditios that y b y ad x b x almost surely, ad proves the boud ˆβ β 2 Σ c b2 xb 2 y ote that the boud becomes trivial whe = 0); 2 this is the used to boud ˆβ β 2 Σ uder explicit boudedess assumptios o β. Capoetto ad Vito crucially require boudedess of β ad > 0 i their aalysis i particular, i their Theorem 4), ad also have a worse tail behavior with a boud of the form d, t 2 / with probability e t. Fially, Steiwart et al. explicitly require y b y ad their boud depeds o b y i the domiat term; moreover, their bouds require explicit decay coditios o the eigespectrum Equatio 6) ad also trivial whe = 0. Our result for ridge regressio is give explicitly i terms of β β 2 Σ ad therefore explicitly i terms of β as a sequece, the eigespectrum of Σ, ad ); this quatity vaishes whe = 0 ad ca be bouded eve whe β is ubouded. We ote that β β 2 Σ is precisely the bias term from the stadard fixed desig aalysis of ridge regressio, ad therefore is atural to expect i a radom desig aalysis. Recetly, Audibert ad Catoi 200a,b) derived sharp risk bouds for the ordiary least squares ad ridge estimators i additio to specially developed PAC-Bayesia estimators) i a radom desig settig uder very mild assumptios. Their bouds are proved usig PAC-Bayesia techiques, which allows them to achieve expoetial tail iequalities uder remarkably miimal momet coditios. Their o-asymptotic boud for ordiary least squares holds with probability at least e t but oly for t l. Our result requires stroger assumptios i some respects, but it avoids this restrictio o the probability tail parameter t, ad the aalysis is arguably more trasparet ad yields more reasoable quatitative bouds. The aalysis of Audibert ad Catoi 200a) for the ridge estimator is established oly i a asymptotic sese ad bouds the excess regularized mea squared error rather tha the excess mea squared error itself. Therefore, the results are ot directly comparable to those provided here. It should also be metioed that a umber of other liear estimators have bee cosidered i the literature with o-asymptotic predictio error bouds e.g., Koltchiskii, 2006; Audibert ad Catoi, 200a,b), but the focus of our work is o the ordiary least squares ad ridge estimators. 3. Radom Desig Regressio This sectio presets the mai results of the paper o the excess mea squared error of the ridge estimator uder radom desig ad its specializatio to the ordiary least squares estimator). First, we review the stadard fixed desig aalysis. 3.. Review of fixed desig aalysis It is iformative to first review the fixed desig aalysis of the ridge estimator. Recall that i this settig, the desig poits x, x 2,..., x are fixed determiistic) vectors, ad the resposes y, y 2,..., y are idepedet radom variables. Therefore, we defie Σ := Σ = i= x i x i which is o-radom), ad assume it has eigevectors {v j } ad correspodig eigevalues j := v j, Σv j. As i the radom desig settig, the liear fuctio β := j β jv j where 9.7

8 HSU KAKADE ZHANG β j := j ) i= v j, x i E[y i ] miimizes the expected mea squared error, i.e., β := arg mi w E[ w, x i y i ) 2 ]. i= Similar to the radom desig setup, defie oisex i ) := y i E[y i ] ad approxx i ) := E[y i ] β, x i for i =, 2,...,, so the followig modelig equatios holds: y i = β, x i + approxx i ) + oisex i ) for i =, 2,...,. Because Σ = Σ, the ridge estimator ˆβ i the fixed desig settig is a ubiased estimator of the miimizer of the regularized mea squared error, i.e., ) { } E[ ˆβ ] = Σ + I) x i E[y i ] = arg mi E[ w, x i y i ) 2 ] + w 2. w i= This ubiasedess implies that the expected mea squared error of ˆβ has the bias-variace decompositio E[ ˆβ β 2 Σ] = E[ ˆβ ] β 2 Σ + E[ ˆβ E[ ˆβ ] 2 Σ]. 0) The followig boud o the expected excess mea squared error easily follows from this decompositio ad the defiitio of β see, e.g., Propositio 23). Propositio 5 Ridge regressio: fixed desig) Fix 0, ad assume Σ + I is ivertible. If there exists σ 0 such that vary 2 i ) σ2 for all i =, 2,...,, the i= E[ ˆβ β 2 Σ] j j j + β2 )2 j + σ2 j j ) 2 j + with equality iff vary i ) = σ 2 for all i =, 2,...,. Remark 6 Effect of approximatio error i fixed desig) Observe that approxx i ) has o effect o the expected excess mea squared error. Remark 7 Effective dimesio) The secod sum i the boud is equal to d 2, from Coditio, which implies a otio of effective dimesio at regularizatio level. Remark 8 Ordiary least squares i fixed desig) I fiite dimesioal spaces of dimesio d, Σ has oly d o-zero eigevalues j, ad therefore settig = 0 gives the followig boud for the ordiary least squares estimator ˆβ 0 : E[ ˆβ 0 β 2 Σ] σ2 d where, as before, equality holds iff vary i ) = σ 2 for all i =, 2,...,. 9.8

9 RANDOM DESIGN ANALYSIS OF RIDGE REGRESSION 3.2. Ordiary least squares i fiite dimesios Our aalysis of the ordiary least squares estimator uder radom desig) is based o a simple decompositio of the excess mea squared error, similar to the oe from the fixed desig aalysis. To state the decompositio, first let β 0 deote the coditioal expectatio of the least squares estimator ˆβ 0 coditioed o x, x 2,..., x, i.e., Also, defie the bias ad variace as: β 0 := E[ ˆβ 0 x, x 2,..., x ] = Σ Ê[xE[y x]]. ) ε bs := β 0 β 2 Σ, ε vr := ˆβ 0 β 0 2 Σ Propositio 9 Radom desig decompositio) We have: ˆβ 0 β 2 Σ ε bs + 2 ε bs ε vr + ε vr 2ε bs + ε vr ) Proof The claim follows from the triagle iequality ad the fact a + b) 2 2a 2 + b 2 ). Remark 0 Note that, i geeral, E[ ˆβ 0 ] β ulike i the fixed desig settig where E[ ˆβ 0 ] = β). Hece, our decompositio differs from that i the fixed desig aalysis see 0)). Our first mai result characterizes the excess loss of the ordiary least squares estimator. Theorem Ordiary least squares regressio) Let d be the dimesio of the covariate space. Pick ay t > max{0, 2.6 log d}. Assume Coditio, Coditio 2 with parameter ρ 0 ), Coditio 3 with σ), ad Coditio 4 with b 0 ) hold ad that 6ρ 2 0dlog d + t). With probability at least 3e t, the followig holds.. Relative spectral orm error i Σ: Σ is ivertible, ad Σ /2 Σ Σ /2 δ s ) where Σ is defied i ), Σ is defied i 3), ad 4ρ 2 δ s := 0 dlog d + t) + 2ρ2 0dlog d + t) 3 ote that the lower-boud o esures δ s 0.93 < ). 2. Effect of bias due to radom desig: 2 E[ Σ /2 x approxx) 2 ] ε bs δ s ) 2 + ) 8t) 2 + 6b2 0 dt ρ 2 0 de[approxx)2 ] δ s ) 2 + ) 8t) 2 + 6b2 0 dt2 9 2, ad approxx) is defied i 9). 9.9

10 HSU KAKADE ZHANG 3. Effect of oise: ε vr σ2 d + 2 dt + 2t). δ s Remark 2 Simplified form) Suppressig the terms that are o/), the overall boud from Theorem is ˆβ 0 β 2 Σ 2E[ Σ /2 x approxx) 2 ] + 8t) 2 + σ2 d + 2 dt + 2t) + o/) so b 0 appears oly i the o/) terms). If the liear model is correct i.e., E[y x] = β, x almost surely), the ˆβ 0 β 2 Σ σ2 d + 2 dt + 2t) + o/). 2) Oe ca show that the costats i the first-order term i 2) are the same as those that oe would obtai for a fixed desig tail boud. Remark 3 Tightess of the boud) Sice ad β 0 β 2 Σ = Σ /2 Σ Σ /2 )Ê[Σ /2 x approxx)] 2 Σ /2 Σ Σ /2 I 0 as Lemma 24), β 0 β 2 Σ is withi costat factors of Ê[Σ /2 x approxx)] 2 for sufficietly large. Moreover, E[ Ê[Σ /2 x approxx)] 2 ] = E[ Σ /2 x approxx) 2 ], which is the mai term that appears i the boud for ε bs. Similarly, ˆβ 0 β 0 2 Σ is withi costat factors of ˆβ 0 β 0 2 Σ for sufficietly large, ad E[ ˆβ 0 β 0 2 Σ] σ2 d with equality iff vary) = σ 2 this comes from the fixed desig risk boud i Remark 8). Therefore, i this case where vary) = σ 2, we coclude that the boud Theorem is tight up to costat factors ad lower-order terms Radom desig ridge regressio The aalysis of the ridge estimator uder radom desig is agai based o a simple decompositio of the excess mea squared error. Here, let β deote the coditioal expectatio of ˆβ give x, x 2,..., x, i.e., β := E[ ˆβ x, x 2,..., x ] = Σ + I) Ê[xE[y x]]. 3) Defie the bias from regularizatio, the bias from the radom desig, ad the variace as: ε rg := β β 2 Σ, ε bs := β β 2 Σ, ε vr := ˆβ β 2 Σ where β is the miimizer of the regularized mea squared error see 8)). 9.0

11 RANDOM DESIGN ANALYSIS OF RIDGE REGRESSION Propositio 4 Geeral radom desig decompositio) ˆβ β 2 Σ ε rg + ε bs + ε vr + 2 ε rg ε bs + ε rg ε vr + ε bs ε vr ) 3ε rg + ε bs + ε vr ) Proof The claim follows from the triagle iequality ad the fact a + b) 2 2a 2 + b 2 ). Remark 5 Agai, ote that E[ ˆβ ] β i geeral, so the bias-variace decompositio i 0) from the fixed desig aalysis is ot directly applicable i the radom desig settig. The followig theorem is the mai result of the paper. Theorem 6 Ridge regressio) Fix some 0, ad pick ay t > max{0, 2.6 log d, }. Assume Coditio, Coditio 2 with parameter ρ ), Coditio 3 with parameter σ), ad Coditio 4 with parameter b ) hold; ad that 6ρ 2 d,log d, + t) where d p, for p {, 2} is defied i 5), ad d, is defied i 6). With probability at least 4e t, the followig holds.. Relative spectral orm error i Σ + I: Σ + I is ivertible, ad Σ + I) /2 Σ + I) Σ + I) /2 δ s ) where Σ is defied i ), Σ is defied i 3), ad 4ρ 2 δ s := d,log d, + t) + 2ρ2 d,log d, + t) 3 ote that the lower-boud o esures δ s 0.93 < ). 2. Frobeius orm error i Σ: Σ + I) /2 Σ Σ)Σ + I) /2 F d, δ f where ρ 2 δ f := d, d 2, /d, + 8t) + 4 ρ 4 d, + d 2, /d, t Effect of regularizatio: If = 0, the ε rg = 0. ε rg j j j + )2 β2 j. 9.

12 HSU KAKADE ZHANG 4. Effect of bias due to radom desig: ε bs 2 E[ Σ + I) /2 x approx x) β ) 2 ] δ s ) 2 4 δ s ) 2 ρ 2 d,e[approx x) 2 ] + ε rg + 8t) t) b d, + ) 2t ) ε 2 rg 9 2 b d, + ) 2t ) ε 2 rg, ad approx x) is defied i 9). If = 0, the approx x) = approxx) as defied i 7). 5. Effect of oise: 2 ε vr σ 2 d 2, + d, d 2, δ f ) δ s ) 2 + 2σ 2 d 2, + d, d 2, δ f )t δ s ) 3/2 + 2σ2 t δ s ). We ow discuss various aspects of Theorem 6. Remark 7 Simplified form) Igorig the terms that are o/) ad treatig t as a costat, the overall boud from Theorem 6 is ) ˆβ β 2 Σ β β 2 E[ Σ + I) /2 x approx Σ + O x) β ) 2 ] + σ 2 d 2, ρ β β 2 2 Σ + O d, E[approx x) 2 ] + β β 2 Σ + ) σ2 d 2, ρ β β 2 2 Σ + O d, E[approxx) 2 ] + ρ 2 d, + ) β β 2 Σ + ) σ2 d 2, where the last iequality follows from the fact E[approx x) 2 ] E[approxx) 2 ]+ β β Σ. Remark 8 Effect of errors i Σ) The accuracy of Σ has a relatively mild effect o the boud it appears essetially through multiplicative factors δ s ) = + Oδ s ) ad + δ f, where both δ s ad δ f are decreasig with as /2 ), ad therefore oly cotribute to lower-order terms overall. Remark 9 Compariso to fixed desig) As already discussed, the ridge estimator behaves similarly uder fixed ad radom desigs, with the mai differeces beig the lack of errors i Σ uder fixed desig, ad the ifluece of approximatio error uder radom desig. These are revealed through the quatities ρ ad d, ad b i lower-order terms), which are eeded to apply the probability tail iequalities. Therefore, the scalig of ρ 2 d, with crucially cotrols the effect of radom desig compared to fixed desig. Ackowledgmets We thak Dea Foster, David McAllester, ad Robert Stie for may isightful discussios. 9.2

13 RANDOM DESIGN ANALYSIS OF RIDGE REGRESSION Refereces J.-Y. Audibert ad O. Catoi. Robust liear least squares regressio, 200a. arxiv: J.-Y. Audibert ad O. Catoi. Robust liear regressio through PAC-Bayesia trucatio, 200b. arxiv: A. Capoetto ad E. De Vito. Optimal rates for the regularized least-squares algorithm. Foudatios of Computatioal Mathematics, 73):33 368, O. Catoi. Statistical Learig Theory ad Stochastic Optimizatio, Lectures o Probability ad Statistics, Ecole d Eté de Probabilitiés de Sait-Flour XXXI 200, volume 85 of Lecture Notes i Mathematics. Spriger, L. Györfi, M. Kohler, A. Kryżak, ad H. Walk. A Distributio-Free Theory of Noparametric Regressio. Spriger, A. E. Hoerl. Applicatio of ridge aalysis to regressio problems. Chemical Egieerig Progress, 58:54 59, 962. R. Hor ad C. R. Johso. Matrix Aalysis. Cambridge Uiversity Press, 985. D. Hsu, S. M. Kakade, ad T. Zhag. A tail iequality for quadratic forms of subgaussia radom vectors, 20. arxiv: D. Hsu, S. M. Kakade, ad T. Zhag. Tail iequalities for sums of radom matrices that deped o the itrisic dimesio. Electroic Commuicatios i Probability, 74): 3, 202. V. Koltchiskii. Local Rademacher complexities ad oracle iequalities i risk miimizatio. The Aals of Statistics, 346): , B. Lauret ad P. Massart. Adaptive estimatio of a quadratic fuctioal by model selectio. The Aals of Statistics, 285): , S. Smale ad D.-X. Zhou. Learig theory estimates via itegral operators ad their approximatios. Costructive Approximatios, 26:53 72, I. Steiwart, D. Hush, ad C. Scovel. Optimal rates for regularized least squares regressio. I Proceedigs of the 22d Aual Coferece o Learig Theory, pages 79 93, G. W. Stewart ad J.-G. Su. Matrix Perturbatio Theory. Academic Press, 990. C. J. Stoe. Optimal global rates of covergece for oparametric regressio. Aals of Statistics, 0: , 982. T. Zhag. Learig bouds for kerel regressio usig effective data dimesioality. Neural Computatio, 7: ,

14 HSU KAKADE ZHANG Appedix A. Applicatio to smoothig splies The applicatios of ridge regressio cosidered by Zhag 2005) ca also be aalyzed usig Theorem 6. We specifically cosider the problem of approximatig a periodic fuctio with smoothig splies, which are fuctios f : R R whose s-th derivatives f s), for some s > /2, satisfy ) 2 f s) t) dt <. The oe-dimesioal covariate t R ca be mapped to the ifiite dimesioal represetatio x := φt) R where x 2k := sikt) k + ) s ad x 2k+ := coskt), k {0,, 2,... }. k + ) s Assume that the regressio fuctio is E[y x] = β, x so approxx) = 0 almost surely. Observe that x 2 ρ := ) 2s /2 2s d, 2s 2s, so Coditio 2 is satisfied with as per Remark 2. Therefore, the simplified boud from Remark 7 becomes i this case ˆβ 2s β 2 Σ β β 2 Σ + C 2s β β 2 Σ + β β 2 Σ + ) σ2 d 2, β 2 + C σ2 d 2, 2s + C 2 2s + ) β 2 2 for some costat C > 0, where we have used the iequality β β 2 Σ β 2 /2. Zhag 2005, Sectio 5.3) shows that { } 2/ d, if 2k + k 2s )k 2s. Sice d 2, d,, it follows that settig := k 2s where k = 2s )/2s)) /2s+) gives the boud ˆβ β β 2 2 Σ 2 ) ) 2s 2s + 2Cσ 2 2s+ + lower-order terms 2s which has the optimal data-depedet rate of 2s 2s+ Stoe, 982). Appedix B. Proofs of Theorem ad Theorem 6 The proof of Theorem 6 uses the decompositio of ˆβ β 2 Σ i Propositio 4, ad the bouds each term usig the lemmas proved i this sectio. 9.4

15 RANDOM DESIGN ANALYSIS OF RIDGE REGRESSION The proof of Theorem omits oe term from the decompositio i Propositio 4 due to the fact that β = β whe = 0; ad it uses a slightly simpler argumet to hadle the effect of oise Lemma 28 rather tha Lemma 29), which reduces the umber of lower-order terms. Other tha these differeces, the proof is the same as that for Theorem 6 i the special case of = 0. Defie Σ := Σ + I, 4) Σ := Σ + I, ad 5) := Σ /2 Σ Σ)Σ /2 6) = Σ /2 Σ Σ )Σ /2. Recall the basic decompositio from Propositio 4: ˆβ β 2 Σ β β Σ + β β Σ + ˆβ β ) 2 Σ. Sectio B. first establishes basic properties of β ad β, which are the used to boud β β 2 Σ ; this part is exactly the same as the stadard fixed desig aalysis of ridge regressio. Sectio B.2 employs probability tail iequalities for the spectral ad Frobeius orms of radom matrices to boud the matrix errors i estimatig Σ with Σ. Fially, Sectio B.3 ad Sectio B.4 boud the cotributios of approximatio error i β β 2 Σ ) ad oise i ˆβ β 2 Σ ), respectively, usig probability tail iequalities for radom vectors as well as the matrix error bouds for Σ. B.. Basic properties of β ad β, ad the effect of regularizatio Propositio 20 Normal equatios) E[ w, x y] = E[ w, x β, x ] for ay w. Proof It suffices to prove the claim for w = v j. Sice E[ v j, x v j, x ] = 0 for j j, it follows that E[ v j, x β, x ] = j β j E[ v j, x v j, x ] = β j E[ v j, x 2 ] = E[ v j, x y], where the last equality follows from the defiitio of β i 2). Propositio 2 Excess mea squared error) E[ w, x y) 2 ] E[ β, x y) 2 ] = E[ w β, x 2 ] for ay w. Proof Directly expadig the squares i the expectatios reveals that E[ w, x y) 2 ] E[ β, x y) 2 ] = E[ w, x 2 ] 2E[ w, x y] + 2E[ β, x y] E[ β, x 2 ] = E[ w, x 2 ] 2E[ w, x β, x ] + 2E[ β, x β, x ] E[ β, x 2 ] = E[ w, x 2 2 w, x β, x + β, x 2 ] = E[ w β, x 2 ] where the third equality follows from Propositio

16 HSU KAKADE ZHANG Propositio 22 Shrikage) For ay j, v j, β = j j + β j. Proof Sice Σ + I) = j j + ) v j v j, v j, β = v j, Σ + I) E[xy] = j + E[ v j, x y] = j E[ v j, x y] j + v j, x 2 = j j + β j. Propositio 23 Effect of regularizatio) β β 2 Σ = j j j + )2 β2 j. Proof By Propositio 22, v j, β β = β j j j + β j = j + β j. Therefore, β β 2 Σ = j ) 2 j j + β j = j j j + )2 β2 j. B.2. Effect of errors i Σ Lemma 24 Spectral orm error i Σ) Assume Coditio ad Coditio 2 with parameter ρ ) hold. Pick t > max{0, 2.6 log d, }. With probability at least e t, 4ρ 2 d,log d, + t) + 2ρ2 d,log d, + t) 3 where is defied i 6). Proof The claim is a cosequece of the tail iequality from Lemma 32. First, defie where Σ is defied i 4)), ad let x := Σ /2 /2 x ad Σ := Σ ΣΣ /2 Z := x x Σ = Σ /2 x x Σ)Σ /2 9.6

17 RANDOM DESIGN ANALYSIS OF RIDGE REGRESSION so = Ê[Z]. Observe that E[Z] = 0 ad Z = max{ max [Z], max [ Z]} max{ x 2, } ρ 2 d, where the secod iequality follows from Coditio 2. Moreover, E[Z 2 ] = E[ x x) 2 ] Σ 2 = E[ x 2 x x)] Σ 2 so max [E[Z 2 ]] max [E[ x x) 2 ]] ρ 2 d, max [ Σ] ρ 2 d, tre[z 2 ]) tre[ x 2 x x)]) ρ 2 d, tr Σ) = ρ 2 d2,. The claim ow follows from Lemma 32 recall that d, = max{, d, }). Lemma 25 Relative spectral orm error i Σ ) If < where is defied i 6), the Σ /2 Σ Σ/2 where Σ is defied i 4) ad Σ is defied i 5). Proof Observe that Σ /2 Σ Σ /2 = Σ /2 Σ + Σ Σ )Σ /2 = I + Σ /2 Σ Σ )Σ /2 = I +, ad that mi [I + ] > 0 by the assumptio < ad Weyl s theorem Hor ad Johso, 985, Theorem 4.3.). Therefore Σ /2 Σ Σ/2 = max[σ /2 Σ Σ /2 ) ] = max [I+ ) ] = mi [I + ]. Lemma 26 Frobeius orm error i Σ) Assume Coditio ad Coditio 2 with parameter ρ ) hold. Pick ay t > 0. With probability at least e t, E[ Σ /2 x 4 ] d 2, F + 4 ρ 4 d2, + d 2,t 8t) + 3 ρ 2 d2, d 2, + 4 ρ 4 d2, + d 2,t 8t) + 3 where is defied i 6). 9.7

18 HSU KAKADE ZHANG Proof The claim is a cosequece of the tail iequality i Lemma 3. As i the proof of Lemma 24, defie x := Σ /2 x ad Σ := Σ /2 ΣΣ /2, ad let Z := x x Σ so = Ê[Z]. Now edow the space of self-adjoit liear operators with the ier product give by A, B F := trab), ad ote that this ier product iduces the Frobeius orm M F = M, M F. Observe that E[Z] = 0 ad Z 2 F = x x Σ, x x Σ F = x x, x x F 2 x x, Σ F + Σ, Σ F = x 4 2 x 2 Σ + tr Σ 2 ) = x 4 2 x 2 Σ + d 2, ρ 4 d2, + d 2, where the iequality follows from Coditio 2. Moreover, E[ Z 2 F] = E[ x x, x x F ] Σ, Σ F = E[ x 4 ] d 2, ρ 2 d,e[ x 2 ] d 2, = ρ 2 d2, d 2, where the iequality agai uses Coditio 2. The claim ow follows from Lemma 3. B.3. Effect of approximatio error Lemma 27 Effect of approximatio error) Assume Coditio, Coditio 2 with parameter ρ ), ad Coditio 4 with parameter b ) hold. Pick ay t > 0. If < where is defied i 6), the β β Σ Ê[x approx x) β ] Σ where β is defied i 3), β is defied i 8), approx x) is defied i 9), ad Σ is defied i 4). Moreover, with probability at least e t, Ê[x approx x) β ] Σ E[ Σ /2 x approx x) β ) 2 ] + 8t) + 4b d, + β β Σ )t 3 2ρ 2 d,e[approx x) 2 ] + β β 2 Σ ) + 8t) + 4b d, + β β Σ )t

19 RANDOM DESIGN ANALYSIS OF RIDGE REGRESSION Proof By the defiitios of β ad β, ) β β = Ê[xE[y x]] Σ β Σ = Σ /2 Σ /2 Σ = Σ /2 Σ /2 Σ = Σ /2 Σ /2 Σ Σ/2 Σ/2 Σ/2 ) )Σ /2 Ê[xapproxx) + β, x )] Σβ β ) Ê[xapproxx) + β, x β, x )] β )Σ /2 )Σ /2 Ê[x approx x) β ]). Therefore, usig the sub-multiplicative property of the spectral orm, β β Σ Σ /2 Σ /2 Σ /2 Σ Σ/2 Ê[x approx x) β ] Σ where the secod iequality follows from Lemma 25 ad because Σ /2 Σ /2 2 = max [Σ /2 ΣΣ /2 Ê[x approx x) β ] Σ ] = max i i i +. The secod part of the claim is a cosequece of the tail iequality i Lemma 3. Observe that E[x approxx)] = E[xE[y x] β, x )] = 0 by Propositio 20, ad that E[x β β, x ] β = Σβ Σ + I)β = 0. Therefore, E[Σ /2 x approx x) β )] = Σ /2 E[xapproxx) + β β, x ) β ] = 0. Moreover, by Propositio 22 ad Propositio 23, Σ /2 β 2 = j = j j 2 j + v j, β 2 2 j + 2 j + j j + β j j j + ) 2 ) β 2 j = j j j + )2 β2 j = β β 2 Σ. 7) Combiig the iequality from 7) with Coditio 4 ad the triagle iequality, it follows that Σ /2 x approx x) β ) Σ /2 x approx x) + Σ /2 β b d, + β β Σ. 9.9

20 HSU KAKADE ZHANG Fially, by the triagle iequality, the fact a + b) 2 2a 2 + b 2 ), the iequality from 7), ad Coditio 2, E[ Σ /2 x approx x) β ) 2 ] 2E[ Σ /2 x approx x) 2 ] + β β 2 Σ) The claim ow follows from Lemma 3. 2ρ 2 d,e[approx x) 2 ] + β β 2 Σ). B.4. Effect of oise Lemma 28 Effect of oise, = 0) Assume the dimesio of the covariate space is d < ad that = 0. Assume Coditio 3 with parameter σ) holds. Pick ay t > 0. With probability at least e t, either 0, or 0 < ad β 0 ˆβ 0 2 Σ 0 σ2 d + 2 dt + 2t), where 0 is defied i 6). Proof Observe that β 0 ˆβ 0 2 Σ Σ /2 Σ /2 2 β 0 ˆβ 0 2 Σ = Σ /2 Σ Σ /2 β 0 ˆβ 0 2 Σ; ad if 0 <, the Σ /2 Σ Σ /2 / 0 ) by Lemma 25. Let ξ := oisex ), oisex 2 ),..., oisex )) be the radom vector whose i-th compoet is oisex i ) = y i E[y i x i ]. By the defiitio of ˆβ 0 ad β 0 ˆβ 0 β 0 2 Σ = Σ /2 Ê[xy E[y x])] 2 = ξ Kξ where K R is the symmetric matrix whose i, j)-th etry is K i,j := 2 Σ /2 x i, Σ /2 x j. Note that the o-zero eigevalues of K are the same as those of [ Ê Σ /2 x) Σ ] /2 x) = Σ /2 Σ Σ /2 = I. By Lemma 30, with probability at least e t coditioed o x, x 2,..., x ), The claim follows. ξ Kξ σ 2 tr K) + 2 tr K 2 )t + 2 max K)t) = σ2 d + 2 dt + 2t). Lemma 29 Effect of oise, 0) Assume Coditio ad Coditio 3 with parameter σ) hold. Pick ay t > 0. Let K be the symmetric matrix whose i, j)-th etry is K i,j := Σ/2 Σ 2 x /2 i, Σ Σ x j 9.20

21 RANDOM DESIGN ANALYSIS OF RIDGE REGRESSION where Σ is defied i 5). With probability at least e t, β ˆβ 2 Σ σ 2 trk) + 2 trk) max K)t + 2 max K)t). Moreover, if < where is defied i 6), the max K) ) ad trk) d 2, + d 2, 2 F ) 2. Proof Let ξ := oisex ), oisex 2 ),..., oisex )) be the radom vector whose i-th compoet is oisex i ) = y i E[y i x i ]. By the defiitio of ˆβ, β, ad K, ˆβ β 2 Σ = Σ Ê[xy E[y x])] 2 Σ = ξ Kξ. By Lemma 30, with probability at least e t coditioed o x, x 2,..., x ), ξ Kξ σ 2 trk) + 2 trk 2 )t + 2 max K)t) σ 2 trk) + 2 trk) max K)t + 2 max K)t) where the secod iequality follows from vo Neuma s theorem Hor ad Johso, 985, page 423). Note that the o-zero eigevalues of K are the same as that of Ê [ Σ /2 Σ ] x) Σ/2 Σ x) = Σ/2 Σ Σ Σ Σ/2. To boud max K), observe that by the sub-multiplicative property of the spectral orm ad Lemma 25, /2 max K) = Σ Σ Σ /2 2 Σ /2 Σ /2 2 Σ /2 Σ /2 Σ /2 2 Σ Σ/2 = Σ /2. Σ /2 2 /2 Σ Σ /2 2 To boud trk), first defie the -whiteed versios of Σ, Σ, ad Σ : Σ w := Σ /2 ΣΣ /2 Σ w := Σ /2 Σ,w := Σ /2 ΣΣ /2 Σ Σ /2. Usig these defiitios with the cycle property of the trace, /2 trk) = trσ Σ Σ Σ = tr Σ = tr Σ Σ Σ/2 ) Σ) Σ Σ,w Σ w,w Σ w). 9.2

22 HSU KAKADE ZHANG Let { j [M]} deote the eigevalues of a liear operator M. By vo Neuma s theorem Hor ad Johso, 985, page 423), tr Σ Σ,w Σ w,w Σ w) j j [ Σ Σ,w Σ w,w ] j[σ w ] ad by Ostrowski s theorem Hor ad Johso, 985, Theorem 4.5.9), Therefore j [ Σ Σ,w Σ w,w ] 2 max[ Σ,w ] j[ Σ w ]. tr Σ Σ,w Σ w,w Σ 2 w) max [ Σ,w ] j [ Σ w ] j [Σ w ] j ) 2 j [ Σ w ] j [Σ w ] = = = j ) 2 j [Σ w ] 2 + j [ Σ ) w ] j [Σ w ]) j [Σ w ] j ) 2 j [Σ w ] 2 + j [ Σ w ] j [Σ w ]) 2 j j j d ) 2 2, + j [ Σ w ] j [Σ w ]) 2 d 2, ) 2 d 2, + Σ ) w Σ w F d2, ) ) 2 d 2, + F d2, j j [Σ w ] 2 where the secod iequality follows from Lemma 25, the third iequality follows from Cauchy- Schwarz, ad the fourth iequality follows from Mirsky s theorem Stewart ad Su, 990, Corollary 4.3). Appedix C. Probability tail iequalities The followig probability tail iequalities are used i our aalysis. These specific iequalities were chose i order to satisfy the geeral coditios setup i Sectio 2.4; however, our aalysis ca specialize or geeralize with the availability of other tail iequalities of these sorts. The first tail iequality is for positive semidefiite quadratic forms of a subgaussia radom vector. It geeralizes a stadard tail iequality for Gaussia radom vectors based o liear combiatios of χ 2 radom variables Lauret ad Massart, 2000). We give the proof for completeess. Lemma 30 Quadratic forms of a subgaussia radom vector; Hsu et al., 20) Let ξ be a radom vector takig values i R such that for some c 0, E[exp u, ξ )] expc u 2 /2), u R. 9.22

23 RANDOM DESIGN ANALYSIS OF RIDGE REGRESSION For all symmetric positive semidefiite matrices K 0, ad all t > 0, [ Pr ξ Kξ > c trk) + 2 trk 2 )t + 2 K t) ] e t. Proof Let z R be a vector of i.i.d. stadard ormal radom variables idepedet of ξ). For ay τ 0 ad 0, let η := c 2 /2, so E [ exp z, K /2 ξ ) ] E [ exp z, K /2 ξ ) K /2 ξ 2 > ctrk) + τ) ] Pr [ K /2 ξ 2 > ctrk) + τ) ] exp 2 ctrk) + τ)/2) Pr [ K /2 ξ 2 > ctrk) + τ) ] = expηtrk) + τ)) Pr [ K /2 ξ 2 > ctrk) + τ) ] 8) sice E[exp u, z )] = exp u 2 /2) for ay u R. Moreover, by idepedece of ξ ad z, E [ exp z, K /2 ξ ) ] = E [E [ exp K /2 z, ξ ) z ]] E [ expc 2 K /2 z 2 /2) ] = E [ expη K /2 z 2 ) ]. Sice K is symmetric ad positive semidefiite, K = V DV for some orthogoal matrix V = [u u 2 u r ] ad diagoal matrix D = diagρ, ρ 2,..., ρ r ), where r is the rak of K. By rotatioal symmetry, the vector V z is equal i distributio to a vector of r i.i.d. stadard ormal radom variables q, q 2,..., q r, ad K /2 z 2 = D /2 V z 2 = ρ q 2 + ρ 2q ρ rqr. 2 Therefore, E [ exp z, K /2 ξ ) ] E [ expη K /2 z 2 ) ] = E [ expηρ q 2 + ρ 2 q ρ r q 2 r)) ]. 9) Combiig 8) ad 9) gives Pr [ K /2 ξ 2 > ctrk) + τ) ] exp ηtrk) + τ)) E [ expηρ q 2 + ρ 2 q ρ r q 2 r)) ]. The expectatio o the right-had side is the momet geeratig fuctio for a liear combiatio of r idepedet χ 2 radom variables, each with oe degree of freedom. Sice trk) = ρ + ρ ρ r, trk 2 ) = ρ 2 + ρ ρ2 r, ad K = max{ρ, ρ 2,..., ρ r }, the coclusio follows from stadard facts about χ 2 radom variables Lauret ad Massart, 2000): Pr [ K /2 ξ 2 > ctrk) + τ) ] )) exp trk2 ) K τ 2 K h trk 2 ) where h a) := + a + 2a. The ext lemma is a tail iequality for sums of bouded radom vectors; it is a stadard applicatio of Berstei s iequality. 9.23

24 HSU KAKADE ZHANG Lemma 3 Vector Berstei boud; see, e.g., Hsu et al., 20) Let x, x 2,..., x be idepedet radom vectors such that E[ x i 2 ] v ad x i r i= for all i =, 2,...,, almost surely. Let s := x + x x. For all t > 0, [ Pr s > v + ] 8t) + 4/3)rt e t The last tail iequality cocers the spectral accuracy of a empirical secod momet matrix. Lemma 32 Matrix Berstei boud; Hsu et al., 202) Let X be a radom matrix, ad r > 0, v > 0, ad k > 0 be such that, almost surely, E[X] = 0, max [X] r, max [E[X 2 ]] v, tre[x 2 ]) vk. If X, X 2,..., X are idepedet copies of X, the for ay t > 0, [ ] ] 2vt Pr [ max X i > + rt kte t t ). 3 i= If t 2.6, the te t t ) e t/

arxiv: v1 [math.pr] 13 Oct 2011

arxiv: v1 [math.pr] 13 Oct 2011 A tail iequality for quadratic forms of subgaussia radom vectors Daiel Hsu, Sham M. Kakade,, ad Tog Zhag 3 arxiv:0.84v math.pr] 3 Oct 0 Microsoft Research New Eglad Departmet of Statistics, Wharto School,

More information

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d Liear regressio Daiel Hsu (COMS 477) Maximum likelihood estimatio Oe of the simplest liear regressio models is the followig: (X, Y ),..., (X, Y ), (X, Y ) are iid radom pairs takig values i R d R, ad Y

More information

A Risk Comparison of Ordinary Least Squares vs Ridge Regression

A Risk Comparison of Ordinary Least Squares vs Ridge Regression Joural of Machie Learig Research 14 (2013) 1505-1511 Submitted 5/12; Revised 3/13; Published 6/13 A Risk Compariso of Ordiary Least Squares vs Ridge Regressio Paramveer S. Dhillo Departmet of Computer

More information

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss ECE 90 Lecture : Complexity Regularizatio ad the Squared Loss R. Nowak 5/7/009 I the previous lectures we made use of the Cheroff/Hoeffdig bouds for our aalysis of classifier errors. Hoeffdig s iequality

More information

Convergence of random variables. (telegram style notes) P.J.C. Spreij

Convergence of random variables. (telegram style notes) P.J.C. Spreij Covergece of radom variables (telegram style otes).j.c. Spreij this versio: September 6, 2005 Itroductio As we kow, radom variables are by defiitio measurable fuctios o some uderlyig measurable space

More information

Machine Learning Brett Bernstein

Machine Learning Brett Bernstein Machie Learig Brett Berstei Week 2 Lecture: Cocept Check Exercises Starred problems are optioal. Excess Risk Decompositio 1. Let X = Y = {1, 2,..., 10}, A = {1,..., 10, 11} ad suppose the data distributio

More information

5.1 Review of Singular Value Decomposition (SVD)

5.1 Review of Singular Value Decomposition (SVD) MGMT 69000: Topics i High-dimesioal Data Aalysis Falll 06 Lecture 5: Spectral Clusterig: Overview (cotd) ad Aalysis Lecturer: Jiamig Xu Scribe: Adarsh Barik, Taotao He, September 3, 06 Outlie Review of

More information

Lecture 19: Convergence

Lecture 19: Convergence Lecture 19: Covergece Asymptotic approach I statistical aalysis or iferece, a key to the success of fidig a good procedure is beig able to fid some momets ad/or distributios of various statistics. I may

More information

REGRESSION WITH QUADRATIC LOSS

REGRESSION WITH QUADRATIC LOSS REGRESSION WITH QUADRATIC LOSS MAXIM RAGINSKY Regressio with quadratic loss is aother basic problem studied i statistical learig theory. We have a radom couple Z = X, Y ), where, as before, X is a R d

More information

Lecture 3: August 31

Lecture 3: August 31 36-705: Itermediate Statistics Fall 018 Lecturer: Siva Balakrisha Lecture 3: August 31 This lecture will be mostly a summary of other useful expoetial tail bouds We will ot prove ay of these i lecture,

More information

A survey on penalized empirical risk minimization Sara A. van de Geer

A survey on penalized empirical risk minimization Sara A. van de Geer A survey o pealized empirical risk miimizatio Sara A. va de Geer We address the questio how to choose the pealty i empirical risk miimizatio. Roughly speakig, this pealty should be a good boud for the

More information

Definition 4.2. (a) A sequence {x n } in a Banach space X is a basis for X if. unique scalars a n (x) such that x = n. a n (x) x n. (4.

Definition 4.2. (a) A sequence {x n } in a Banach space X is a basis for X if. unique scalars a n (x) such that x = n. a n (x) x n. (4. 4. BASES I BAACH SPACES 39 4. BASES I BAACH SPACES Sice a Baach space X is a vector space, it must possess a Hamel, or vector space, basis, i.e., a subset {x γ } γ Γ whose fiite liear spa is all of X ad

More information

A Hadamard-type lower bound for symmetric diagonally dominant positive matrices

A Hadamard-type lower bound for symmetric diagonally dominant positive matrices A Hadamard-type lower boud for symmetric diagoally domiat positive matrices Christopher J. Hillar, Adre Wibisoo Uiversity of Califoria, Berkeley Jauary 7, 205 Abstract We prove a ew lower-boud form of

More information

Spectral Partitioning in the Planted Partition Model

Spectral Partitioning in the Planted Partition Model Spectral Graph Theory Lecture 21 Spectral Partitioig i the Plated Partitio Model Daiel A. Spielma November 11, 2009 21.1 Itroductio I this lecture, we will perform a crude aalysis of the performace of

More information

Dimension-free PAC-Bayesian bounds for the estimation of the mean of a random vector

Dimension-free PAC-Bayesian bounds for the estimation of the mean of a random vector Dimesio-free PAC-Bayesia bouds for the estimatio of the mea of a radom vector Olivier Catoi CREST CNRS UMR 9194 Uiversité Paris Saclay olivier.catoi@esae.fr Ilaria Giulii Laboratoire de Probabilités et

More information

Machine Learning Theory (CS 6783)

Machine Learning Theory (CS 6783) Machie Learig Theory (CS 6783) Lecture 2 : Learig Frameworks, Examples Settig up learig problems. X : istace space or iput space Examples: Computer Visio: Raw M N image vectorized X = 0, 255 M N, SIFT

More information

Math Solutions to homework 6

Math Solutions to homework 6 Math 175 - Solutios to homework 6 Cédric De Groote November 16, 2017 Problem 1 (8.11 i the book): Let K be a compact Hermitia operator o a Hilbert space H ad let the kerel of K be {0}. Show that there

More information

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS MASSACHUSTTS INSTITUT OF TCHNOLOGY 6.436J/5.085J Fall 2008 Lecture 9 /7/2008 LAWS OF LARG NUMBRS II Cotets. The strog law of large umbers 2. The Cheroff boud TH STRONG LAW OF LARG NUMBRS While the weak

More information

7.1 Convergence of sequences of random variables

7.1 Convergence of sequences of random variables Chapter 7 Limit Theorems Throughout this sectio we will assume a probability space (, F, P), i which is defied a ifiite sequece of radom variables (X ) ad a radom variable X. The fact that for every ifiite

More information

Regression with quadratic loss

Regression with quadratic loss Regressio with quadratic loss Maxim Ragisky October 13, 2015 Regressio with quadratic loss is aother basic problem studied i statistical learig theory. We have a radom couple Z = X,Y, where, as before,

More information

Lecture 24: Variable selection in linear models

Lecture 24: Variable selection in linear models Lecture 24: Variable selectio i liear models Cosider liear model X = Z β + ε, β R p ad Varε = σ 2 I. Like the LSE, the ridge regressio estimator does ot give 0 estimate to a compoet of β eve if that compoet

More information

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 2 9/9/2013. Large Deviations for i.i.d. Random Variables

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 2 9/9/2013. Large Deviations for i.i.d. Random Variables MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 2 9/9/2013 Large Deviatios for i.i.d. Radom Variables Cotet. Cheroff boud usig expoetial momet geeratig fuctios. Properties of a momet

More information

Empirical Process Theory and Oracle Inequalities

Empirical Process Theory and Oracle Inequalities Stat 928: Statistical Learig Theory Lecture: 10 Empirical Process Theory ad Oracle Iequalities Istructor: Sham Kakade 1 Risk vs Risk See Lecture 0 for a discussio o termiology. 2 The Uio Boud / Boferoi

More information

An Introduction to Randomized Algorithms

An Introduction to Randomized Algorithms A Itroductio to Radomized Algorithms The focus of this lecture is to study a radomized algorithm for quick sort, aalyze it usig probabilistic recurrece relatios, ad also provide more geeral tools for aalysis

More information

Supplementary Material for Fast Stochastic AUC Maximization with O(1/n)-Convergence Rate

Supplementary Material for Fast Stochastic AUC Maximization with O(1/n)-Convergence Rate Supplemetary Material for Fast Stochastic AUC Maximizatio with O/-Covergece Rate Migrui Liu Xiaoxua Zhag Zaiyi Che Xiaoyu Wag 3 iabao Yag echical Lemmas ized versio of Hoeffdig s iequality, ote that We

More information

Rademacher Complexity

Rademacher Complexity EECS 598: Statistical Learig Theory, Witer 204 Topic 0 Rademacher Complexity Lecturer: Clayto Scott Scribe: Ya Deg, Kevi Moo Disclaimer: These otes have ot bee subjected to the usual scrutiy reserved for

More information

7.1 Convergence of sequences of random variables

7.1 Convergence of sequences of random variables Chapter 7 Limit theorems Throughout this sectio we will assume a probability space (Ω, F, P), i which is defied a ifiite sequece of radom variables (X ) ad a radom variable X. The fact that for every ifiite

More information

Chapter 3. Strong convergence. 3.1 Definition of almost sure convergence

Chapter 3. Strong convergence. 3.1 Definition of almost sure convergence Chapter 3 Strog covergece As poited out i the Chapter 2, there are multiple ways to defie the otio of covergece of a sequece of radom variables. That chapter defied covergece i probability, covergece i

More information

Precise Rates in Complete Moment Convergence for Negatively Associated Sequences

Precise Rates in Complete Moment Convergence for Negatively Associated Sequences Commuicatios of the Korea Statistical Society 29, Vol. 16, No. 5, 841 849 Precise Rates i Complete Momet Covergece for Negatively Associated Sequeces Dae-Hee Ryu 1,a a Departmet of Computer Sciece, ChugWoo

More information

Machine Learning for Data Science (CS 4786)

Machine Learning for Data Science (CS 4786) Machie Learig for Data Sciece CS 4786) Lecture & 3: Pricipal Compoet Aalysis The text i black outlies high level ideas. The text i blue provides simple mathematical details to derive or get to the algorithm

More information

Optimally Sparse SVMs

Optimally Sparse SVMs A. Proof of Lemma 3. We here prove a lower boud o the umber of support vectors to achieve geeralizatio bouds of the form which we cosider. Importatly, this result holds ot oly for liear classifiers, but

More information

Notes 27 : Brownian motion: path properties

Notes 27 : Brownian motion: path properties Notes 27 : Browia motio: path properties Math 733-734: Theory of Probability Lecturer: Sebastie Roch Refereces:[Dur10, Sectio 8.1], [MP10, Sectio 1.1, 1.2, 1.3]. Recall: DEF 27.1 (Covariace) Let X = (X

More information

Lecture 10 October Minimaxity and least favorable prior sequences

Lecture 10 October Minimaxity and least favorable prior sequences STATS 300A: Theory of Statistics Fall 205 Lecture 0 October 22 Lecturer: Lester Mackey Scribe: Brya He, Rahul Makhijai Warig: These otes may cotai factual ad/or typographic errors. 0. Miimaxity ad least

More information

Distribution of Random Samples & Limit theorems

Distribution of Random Samples & Limit theorems STAT/MATH 395 A - PROBABILITY II UW Witer Quarter 2017 Néhémy Lim Distributio of Radom Samples & Limit theorems 1 Distributio of i.i.d. Samples Motivatig example. Assume that the goal of a study is to

More information

Rates of Convergence by Moduli of Continuity

Rates of Convergence by Moduli of Continuity Rates of Covergece by Moduli of Cotiuity Joh Duchi: Notes for Statistics 300b March, 017 1 Itroductio I this ote, we give a presetatio showig the importace, ad relatioship betwee, the modulis of cotiuity

More information

Discrete Mathematics for CS Spring 2008 David Wagner Note 22

Discrete Mathematics for CS Spring 2008 David Wagner Note 22 CS 70 Discrete Mathematics for CS Sprig 2008 David Wager Note 22 I.I.D. Radom Variables Estimatig the bias of a coi Questio: We wat to estimate the proportio p of Democrats i the US populatio, by takig

More information

Resampling Methods. X (1/2), i.e., Pr (X i m) = 1/2. We order the data: X (1) X (2) X (n). Define the sample median: ( n.

Resampling Methods. X (1/2), i.e., Pr (X i m) = 1/2. We order the data: X (1) X (2) X (n). Define the sample median: ( n. Jauary 1, 2019 Resamplig Methods Motivatio We have so may estimators with the property θ θ d N 0, σ 2 We ca also write θ a N θ, σ 2 /, where a meas approximately distributed as Oce we have a cosistet estimator

More information

Ada Boost, Risk Bounds, Concentration Inequalities. 1 AdaBoost and Estimates of Conditional Probabilities

Ada Boost, Risk Bounds, Concentration Inequalities. 1 AdaBoost and Estimates of Conditional Probabilities CS8B/Stat4B Sprig 008) Statistical Learig Theory Lecture: Ada Boost, Risk Bouds, Cocetratio Iequalities Lecturer: Peter Bartlett Scribe: Subhrasu Maji AdaBoost ad Estimates of Coditioal Probabilities We

More information

Advanced Stochastic Processes.

Advanced Stochastic Processes. Advaced Stochastic Processes. David Gamarik LECTURE 2 Radom variables ad measurable fuctios. Strog Law of Large Numbers (SLLN). Scary stuff cotiued... Outlie of Lecture Radom variables ad measurable fuctios.

More information

Problem Set 2 Solutions

Problem Set 2 Solutions CS271 Radomess & Computatio, Sprig 2018 Problem Set 2 Solutios Poit totals are i the margi; the maximum total umber of poits was 52. 1. Probabilistic method for domiatig sets 6pts Pick a radom subset S

More information

ECONOMETRIC THEORY. MODULE XIII Lecture - 34 Asymptotic Theory and Stochastic Regressors

ECONOMETRIC THEORY. MODULE XIII Lecture - 34 Asymptotic Theory and Stochastic Regressors ECONOMETRIC THEORY MODULE XIII Lecture - 34 Asymptotic Theory ad Stochastic Regressors Dr. Shalabh Departmet of Mathematics ad Statistics Idia Istitute of Techology Kapur Asymptotic theory The asymptotic

More information

Efficient GMM LECTURE 12 GMM II

Efficient GMM LECTURE 12 GMM II DECEMBER 1 010 LECTURE 1 II Efficiet The estimator depeds o the choice of the weight matrix A. The efficiet estimator is the oe that has the smallest asymptotic variace amog all estimators defied by differet

More information

Lecture 12: September 27

Lecture 12: September 27 36-705: Itermediate Statistics Fall 207 Lecturer: Siva Balakrisha Lecture 2: September 27 Today we will discuss sufficiecy i more detail ad the begi to discuss some geeral strategies for costructig estimators.

More information

1 Review and Overview

1 Review and Overview DRAFT a fial versio will be posted shortly CS229T/STATS231: Statistical Learig Theory Lecturer: Tegyu Ma Lecture #3 Scribe: Migda Qiao October 1, 2013 1 Review ad Overview I the first half of this course,

More information

ECE 901 Lecture 14: Maximum Likelihood Estimation and Complexity Regularization

ECE 901 Lecture 14: Maximum Likelihood Estimation and Complexity Regularization ECE 90 Lecture 4: Maximum Likelihood Estimatio ad Complexity Regularizatio R Nowak 5/7/009 Review : Maximum Likelihood Estimatio We have iid observatios draw from a ukow distributio Y i iid p θ, i,, where

More information

Intro to Learning Theory

Intro to Learning Theory Lecture 1, October 18, 2016 Itro to Learig Theory Ruth Urer 1 Machie Learig ad Learig Theory Comig soo 2 Formal Framework 21 Basic otios I our formal model for machie learig, the istaces to be classified

More information

Self-normalized deviation inequalities with application to t-statistic

Self-normalized deviation inequalities with application to t-statistic Self-ormalized deviatio iequalities with applicatio to t-statistic Xiequa Fa Ceter for Applied Mathematics, Tiaji Uiversity, 30007 Tiaji, Chia Abstract Let ξ i i 1 be a sequece of idepedet ad symmetric

More information

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12 Machie Learig Theory Tübige Uiversity, WS 06/07 Lecture Tolstikhi Ilya Abstract I this lecture we derive risk bouds for kerel methods. We will start by showig that Soft Margi kerel SVM correspods to miimizig

More information

Lecture 01: the Central Limit Theorem. 1 Central Limit Theorem for i.i.d. random variables

Lecture 01: the Central Limit Theorem. 1 Central Limit Theorem for i.i.d. random variables CSCI-B609: A Theorist s Toolkit, Fall 06 Aug 3 Lecture 0: the Cetral Limit Theorem Lecturer: Yua Zhou Scribe: Yua Xie & Yua Zhou Cetral Limit Theorem for iid radom variables Let us say that we wat to aalyze

More information

Sieve Estimators: Consistency and Rates of Convergence

Sieve Estimators: Consistency and Rates of Convergence EECS 598: Statistical Learig Theory, Witer 2014 Topic 6 Sieve Estimators: Cosistecy ad Rates of Covergece Lecturer: Clayto Scott Scribe: Julia Katz-Samuels, Brado Oselio, Pi-Yu Che Disclaimer: These otes

More information

Summary and Discussion on Simultaneous Analysis of Lasso and Dantzig Selector

Summary and Discussion on Simultaneous Analysis of Lasso and Dantzig Selector Summary ad Discussio o Simultaeous Aalysis of Lasso ad Datzig Selector STAT732, Sprig 28 Duzhe Wag May 4, 28 Abstract This is a discussio o the work i Bickel, Ritov ad Tsybakov (29). We begi with a short

More information

6. Kalman filter implementation for linear algebraic equations. Karhunen-Loeve decomposition

6. Kalman filter implementation for linear algebraic equations. Karhunen-Loeve decomposition 6. Kalma filter implemetatio for liear algebraic equatios. Karhue-Loeve decompositio 6.1. Solvable liear algebraic systems. Probabilistic iterpretatio. Let A be a quadratic matrix (ot obligatory osigular.

More information

Algebra of Least Squares

Algebra of Least Squares October 19, 2018 Algebra of Least Squares Geometry of Least Squares Recall that out data is like a table [Y X] where Y collects observatios o the depedet variable Y ad X collects observatios o the k-dimesioal

More information

Machine Learning Brett Bernstein

Machine Learning Brett Bernstein Machie Learig Brett Berstei Week Lecture: Cocept Check Exercises Starred problems are optioal. Statistical Learig Theory. Suppose A = Y = R ad X is some other set. Furthermore, assume P X Y is a discrete

More information

Random Walks on Discrete and Continuous Circles. by Jeffrey S. Rosenthal School of Mathematics, University of Minnesota, Minneapolis, MN, U.S.A.

Random Walks on Discrete and Continuous Circles. by Jeffrey S. Rosenthal School of Mathematics, University of Minnesota, Minneapolis, MN, U.S.A. Radom Walks o Discrete ad Cotiuous Circles by Jeffrey S. Rosethal School of Mathematics, Uiversity of Miesota, Mieapolis, MN, U.S.A. 55455 (Appeared i Joural of Applied Probability 30 (1993), 780 789.)

More information

Lecture 2. The Lovász Local Lemma

Lecture 2. The Lovász Local Lemma Staford Uiversity Sprig 208 Math 233A: No-costructive methods i combiatorics Istructor: Ja Vodrák Lecture date: Jauary 0, 208 Origial scribe: Apoorva Khare Lecture 2. The Lovász Local Lemma 2. Itroductio

More information

Problem Set 4 Due Oct, 12

Problem Set 4 Due Oct, 12 EE226: Radom Processes i Systems Lecturer: Jea C. Walrad Problem Set 4 Due Oct, 12 Fall 06 GSI: Assae Gueye This problem set essetially reviews detectio theory ad hypothesis testig ad some basic otios

More information

ECE 901 Lecture 13: Maximum Likelihood Estimation

ECE 901 Lecture 13: Maximum Likelihood Estimation ECE 90 Lecture 3: Maximum Likelihood Estimatio R. Nowak 5/7/009 The focus of this lecture is to cosider aother approach to learig based o maximum likelihood estimatio. Ulike earlier approaches cosidered

More information

Riesz-Fischer Sequences and Lower Frame Bounds

Riesz-Fischer Sequences and Lower Frame Bounds Zeitschrift für Aalysis ud ihre Aweduge Joural for Aalysis ad its Applicatios Volume 1 (00), No., 305 314 Riesz-Fischer Sequeces ad Lower Frame Bouds P. Casazza, O. Christese, S. Li ad A. Lider Abstract.

More information

Lecture 3. Properties of Summary Statistics: Sampling Distribution

Lecture 3. Properties of Summary Statistics: Sampling Distribution Lecture 3 Properties of Summary Statistics: Samplig Distributio Mai Theme How ca we use math to justify that our umerical summaries from the sample are good summaries of the populatio? Lecture Summary

More information

Maximum Likelihood Estimation and Complexity Regularization

Maximum Likelihood Estimation and Complexity Regularization ECE90 Sprig 004 Statistical Regularizatio ad Learig Theory Lecture: 4 Maximum Likelihood Estimatio ad Complexity Regularizatio Lecturer: Rob Nowak Scribe: Pam Limpiti Review : Maximum Likelihood Estimatio

More information

1.010 Uncertainty in Engineering Fall 2008

1.010 Uncertainty in Engineering Fall 2008 MIT OpeCourseWare http://ocw.mit.edu.00 Ucertaity i Egieerig Fall 2008 For iformatio about citig these materials or our Terms of Use, visit: http://ocw.mit.edu.terms. .00 - Brief Notes # 9 Poit ad Iterval

More information

Regression with an Evaporating Logarithmic Trend

Regression with an Evaporating Logarithmic Trend Regressio with a Evaporatig Logarithmic Tred Peter C. B. Phillips Cowles Foudatio, Yale Uiversity, Uiversity of Aucklad & Uiversity of York ad Yixiao Su Departmet of Ecoomics Yale Uiversity October 5,

More information

CMSE 820: Math. Foundations of Data Sci.

CMSE 820: Math. Foundations of Data Sci. Lecture 17 8.4 Weighted path graphs Take from [10, Lecture 3] As alluded to at the ed of the previous sectio, we ow aalyze weighted path graphs. To that ed, we prove the followig: Theorem 6 (Fiedler).

More information

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 11

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 11 Machie Learig Theory Tübige Uiversity, WS 06/07 Lecture Tolstikhi Ilya Abstract We will itroduce the otio of reproducig kerels ad associated Reproducig Kerel Hilbert Spaces (RKHS). We will cosider couple

More information

On Equivalence of Martingale Tail Bounds and Deterministic Regret Inequalities

On Equivalence of Martingale Tail Bounds and Deterministic Regret Inequalities O Equivalece of Martigale Tail Bouds ad Determiistic Regret Iequalities Sasha Rakhli Departmet of Statistics, The Wharto School Uiversity of Pesylvaia Dec 16, 2015 Joit work with K. Sridhara arxiv:1510.03925

More information

Lecture 2: Concentration Bounds

Lecture 2: Concentration Bounds CSE 52: Desig ad Aalysis of Algorithms I Sprig 206 Lecture 2: Cocetratio Bouds Lecturer: Shaya Oveis Ghara March 30th Scribe: Syuzaa Sargsya Disclaimer: These otes have ot bee subjected to the usual scrutiy

More information

LECTURE 8: ASYMPTOTICS I

LECTURE 8: ASYMPTOTICS I LECTURE 8: ASYMPTOTICS I We are iterested i the properties of estimators as. Cosider a sequece of radom variables {, X 1}. N. M. Kiefer, Corell Uiversity, Ecoomics 60 1 Defiitio: (Weak covergece) A sequece

More information

TENSOR PRODUCTS AND PARTIAL TRACES

TENSOR PRODUCTS AND PARTIAL TRACES Lecture 2 TENSOR PRODUCTS AND PARTIAL TRACES Stéphae ATTAL Abstract This lecture cocers special aspects of Operator Theory which are of much use i Quatum Mechaics, i particular i the theory of Quatum Ope

More information

Notes for Lecture 11

Notes for Lecture 11 U.C. Berkeley CS78: Computatioal Complexity Hadout N Professor Luca Trevisa 3/4/008 Notes for Lecture Eigevalues, Expasio, ad Radom Walks As usual by ow, let G = (V, E) be a udirected d-regular graph with

More information

Random Variables, Sampling and Estimation

Random Variables, Sampling and Estimation Chapter 1 Radom Variables, Samplig ad Estimatio 1.1 Itroductio This chapter will cover the most importat basic statistical theory you eed i order to uderstad the ecoometric material that will be comig

More information

Regularization methods for large scale machine learning

Regularization methods for large scale machine learning Regularizatio methods for large scale machie learig Lorezo Rosasco March 7, 2017 Abstract After recallig a iverse problems perspective o supervised learig, we discuss regularizatio methods for large scale

More information

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 5

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 5 CS434a/54a: Patter Recogitio Prof. Olga Veksler Lecture 5 Today Itroductio to parameter estimatio Two methods for parameter estimatio Maimum Likelihood Estimatio Bayesia Estimatio Itroducto Bayesia Decisio

More information

Section 14. Simple linear regression.

Section 14. Simple linear regression. Sectio 14 Simple liear regressio. Let us look at the cigarette dataset from [1] (available to dowload from joural s website) ad []. The cigarette dataset cotais measuremets of tar, icotie, weight ad carbo

More information

Fast Rates for Regularized Objectives

Fast Rates for Regularized Objectives Fast Rates for Regularized Objectives Karthik Sridhara, Natha Srebro, Shai Shalev-Shwartz Toyota Techological Istitute Chicago Abstract We study covergece properties of empirical miimizatio of a stochastic

More information

Linear Regression Demystified

Linear Regression Demystified Liear Regressio Demystified Liear regressio is a importat subject i statistics. I elemetary statistics courses, formulae related to liear regressio are ofte stated without derivatio. This ote iteds to

More information

The variance of a sum of independent variables is the sum of their variances, since covariances are zero. Therefore. V (xi )= n n 2 σ2 = σ2.

The variance of a sum of independent variables is the sum of their variances, since covariances are zero. Therefore. V (xi )= n n 2 σ2 = σ2. SAMPLE STATISTICS A radom sample x 1,x,,x from a distributio f(x) is a set of idepedetly ad idetically variables with x i f(x) for all i Their joit pdf is f(x 1,x,,x )=f(x 1 )f(x ) f(x )= f(x i ) The sample

More information

MA Advanced Econometrics: Properties of Least Squares Estimators

MA Advanced Econometrics: Properties of Least Squares Estimators MA Advaced Ecoometrics: Properties of Least Squares Estimators Karl Whela School of Ecoomics, UCD February 5, 20 Karl Whela UCD Least Squares Estimators February 5, 20 / 5 Part I Least Squares: Some Fiite-Sample

More information

Econ 325/327 Notes on Sample Mean, Sample Proportion, Central Limit Theorem, Chi-square Distribution, Student s t distribution 1.

Econ 325/327 Notes on Sample Mean, Sample Proportion, Central Limit Theorem, Chi-square Distribution, Student s t distribution 1. Eco 325/327 Notes o Sample Mea, Sample Proportio, Cetral Limit Theorem, Chi-square Distributio, Studet s t distributio 1 Sample Mea By Hiro Kasahara We cosider a radom sample from a populatio. Defiitio

More information

Asymptotic distribution of products of sums of independent random variables

Asymptotic distribution of products of sums of independent random variables Proc. Idia Acad. Sci. Math. Sci. Vol. 3, No., May 03, pp. 83 9. c Idia Academy of Scieces Asymptotic distributio of products of sums of idepedet radom variables YANLING WANG, SUXIA YAO ad HONGXIA DU ollege

More information

1 Convergence in Probability and the Weak Law of Large Numbers

1 Convergence in Probability and the Weak Law of Large Numbers 36-752 Advaced Probability Overview Sprig 2018 8. Covergece Cocepts: i Probability, i L p ad Almost Surely Istructor: Alessadro Rialdo Associated readig: Sec 2.4, 2.5, ad 4.11 of Ash ad Doléas-Dade; Sec

More information

NYU Center for Data Science: DS-GA 1003 Machine Learning and Computational Statistics (Spring 2018)

NYU Center for Data Science: DS-GA 1003 Machine Learning and Computational Statistics (Spring 2018) NYU Ceter for Data Sciece: DS-GA 003 Machie Learig ad Computatioal Statistics (Sprig 208) Brett Berstei, David Roseberg, Be Jakubowski Jauary 20, 208 Istructios: Followig most lab ad lecture sectios, we

More information

Supplementary Material for Fast Stochastic AUC Maximization with O(1/n)-Convergence Rate

Supplementary Material for Fast Stochastic AUC Maximization with O(1/n)-Convergence Rate Supplemetary Material for Fast Stochastic AUC Maximizatio with O/-Covergece Rate Migrui Liu Xiaoxua Zhag Zaiyi Che Xiaoyu Wag 3 iabao Yag echical Lemmas ized versio of Hoeffdig s iequality, ote that We

More information

An almost sure invariance principle for trimmed sums of random vectors

An almost sure invariance principle for trimmed sums of random vectors Proc. Idia Acad. Sci. Math. Sci. Vol. 20, No. 5, November 200, pp. 6 68. Idia Academy of Scieces A almost sure ivariace priciple for trimmed sums of radom vectors KE-ANG FU School of Statistics ad Mathematics,

More information

Physics 324, Fall Dirac Notation. These notes were produced by David Kaplan for Phys. 324 in Autumn 2001.

Physics 324, Fall Dirac Notation. These notes were produced by David Kaplan for Phys. 324 in Autumn 2001. Physics 324, Fall 2002 Dirac Notatio These otes were produced by David Kapla for Phys. 324 i Autum 2001. 1 Vectors 1.1 Ier product Recall from liear algebra: we ca represet a vector V as a colum vector;

More information

Stochastic Matrices in a Finite Field

Stochastic Matrices in a Finite Field Stochastic Matrices i a Fiite Field Abstract: I this project we will explore the properties of stochastic matrices i both the real ad the fiite fields. We first explore what properties 2 2 stochastic matrices

More information

ACO Comprehensive Exam 9 October 2007 Student code A. 1. Graph Theory

ACO Comprehensive Exam 9 October 2007 Student code A. 1. Graph Theory 1. Graph Theory Prove that there exist o simple plaar triagulatio T ad two distict adjacet vertices x, y V (T ) such that x ad y are the oly vertices of T of odd degree. Do ot use the Four-Color Theorem.

More information

The random version of Dvoretzky s theorem in l n

The random version of Dvoretzky s theorem in l n The radom versio of Dvoretzky s theorem i l Gideo Schechtma Abstract We show that with high probability a sectio of the l ball of dimesio k cε log c > 0 a uiversal costat) is ε close to a multiple of the

More information

Random Matrices with Blocks of Intermediate Scale Strongly Correlated Band Matrices

Random Matrices with Blocks of Intermediate Scale Strongly Correlated Band Matrices Radom Matrices with Blocks of Itermediate Scale Strogly Correlated Bad Matrices Jiayi Tog Advisor: Dr. Todd Kemp May 30, 07 Departmet of Mathematics Uiversity of Califoria, Sa Diego Cotets Itroductio Notatio

More information

11 THE GMM ESTIMATION

11 THE GMM ESTIMATION Cotets THE GMM ESTIMATION 2. Cosistecy ad Asymptotic Normality..................... 3.2 Regularity Coditios ad Idetificatio..................... 4.3 The GMM Iterpretatio of the OLS Estimatio.................

More information

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 21 11/27/2013

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 21 11/27/2013 MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 21 11/27/2013 Fuctioal Law of Large Numbers. Costructio of the Wieer Measure Cotet. 1. Additioal techical results o weak covergece

More information

6.3 Testing Series With Positive Terms

6.3 Testing Series With Positive Terms 6.3. TESTING SERIES WITH POSITIVE TERMS 307 6.3 Testig Series With Positive Terms 6.3. Review of what is kow up to ow I theory, testig a series a i for covergece amouts to fidig the i= sequece of partial

More information

5 Birkhoff s Ergodic Theorem

5 Birkhoff s Ergodic Theorem 5 Birkhoff s Ergodic Theorem Amog the most useful of the various geeralizatios of KolmogorovâĂŹs strog law of large umbers are the ergodic theorems of Birkhoff ad Kigma, which exted the validity of the

More information

On Nonsingularity of Saddle Point Matrices. with Vectors of Ones

On Nonsingularity of Saddle Point Matrices. with Vectors of Ones Iteratioal Joural of Algebra, Vol. 2, 2008, o. 4, 197-204 O Nosigularity of Saddle Poit Matrices with Vectors of Oes Tadeusz Ostrowski Istitute of Maagemet The State Vocatioal Uiversity -400 Gorzów, Polad

More information

Notes 19 : Martingale CLT

Notes 19 : Martingale CLT Notes 9 : Martigale CLT Math 733-734: Theory of Probability Lecturer: Sebastie Roch Refereces: [Bil95, Chapter 35], [Roc, Chapter 3]. Sice we have ot ecoutered weak covergece i some time, we first recall

More information

5.1 A mutual information bound based on metric entropy

5.1 A mutual information bound based on metric entropy Chapter 5 Global Fao Method I this chapter, we exted the techiques of Chapter 2.4 o Fao s method the local Fao method) to a more global costructio. I particular, we show that, rather tha costructig a local

More information

Empirical Processes: Glivenko Cantelli Theorems

Empirical Processes: Glivenko Cantelli Theorems Empirical Processes: Gliveko Catelli Theorems Mouliath Baerjee Jue 6, 200 Gliveko Catelli classes of fuctios The reader is referred to Chapter.6 of Weller s Torgo otes, Chapter??? of VDVW ad Chapter 8.3

More information

4. Partial Sums and the Central Limit Theorem

4. Partial Sums and the Central Limit Theorem 1 of 10 7/16/2009 6:05 AM Virtual Laboratories > 6. Radom Samples > 1 2 3 4 5 6 7 4. Partial Sums ad the Cetral Limit Theorem The cetral limit theorem ad the law of large umbers are the two fudametal theorems

More information

Geometry of LS. LECTURE 3 GEOMETRY OF LS, PROPERTIES OF σ 2, PARTITIONED REGRESSION, GOODNESS OF FIT

Geometry of LS. LECTURE 3 GEOMETRY OF LS, PROPERTIES OF σ 2, PARTITIONED REGRESSION, GOODNESS OF FIT OCTOBER 7, 2016 LECTURE 3 GEOMETRY OF LS, PROPERTIES OF σ 2, PARTITIONED REGRESSION, GOODNESS OF FIT Geometry of LS We ca thik of y ad the colums of X as members of the -dimesioal Euclidea space R Oe ca

More information