AGGREGATION AND HIGH-DIMENSIONAL STATISTICS (preliminary notes of Saint-Flour lectures, July 8-20, 2013)

Size: px
Start display at page:

Download "AGGREGATION AND HIGH-DIMENSIONAL STATISTICS (preliminary notes of Saint-Flour lectures, July 8-20, 2013)"

Transcription

1 AGGREGATION AND HIGH-DIENSIONAL STATISTICS (prelimiary otes of Sait-Flour lectures, July 8-20, 2013) Alexadre B. Tsybakov (CREST-ENSAE) October 30, Itroductio Give a collectio of estimators, the problem of liear, covex or model selectio type aggregatio cosists i costructig a ew estimator, called the aggregate, which is early as good as the best amog them (or early as good as their best liear or covex combiatio), with respect to a give risk criterio. Whe the uderlyig model is sparse, which meas that it is well approximated by a liear combiatio of a small umber of fuctios i the dictioary, the aggregatio techiques tur out to be very useful i takig advatage of sparsity. O the other had, aggregatio is a geeral techique of producig adaptive oparametric estimators, which is more powerful tha the classical methods sice it allows oe to combie estimators of differet ature. Aggregates are usually costructed by mixig the iitial estimators or fuctios of the dictioary with datadepedet weights that ca be computed is several possible ways. Importat example is give by aggregates with expoetial weights. They satisfy sharp oracle iequalities that allow oe to treat i a uified way three differet problems: Adaptive oparametric estimatio, aggregatio ad sparse estimatio. To be able to demostrate the mai ideas without excessive techicalities, throughout this course we will deal with a simple model, amely the Gaussia regressio model with fixed desig. Suppose that we observe {(Y i, X i )} i=1 such that Y i = f(x i ) + ξ i, i = 1,...,, (1) where X is a arbitrary set, f X R is a ukow fuctio, X i X are oradom, ad the radom errors ξ i are i.i.d. Gaussia with mea zero ad variace σ 2, ξ i N (0, σ 2 ). The overall goal is to costruct a estimator ˆf for f based o the observatios {(Y i, X i )} i=1. To measure how good ˆf is, we use the squared error loss of the form ˆf f 2 = 1 i=1 ( ˆf(X i ) f(x i )) 2 ad we defie the risk of estimator ˆf as E ˆf f 2. The pseudo-orm f is referred to as the empirical orm of a fuctio defied o X. For vectors b R, we will also cosider the empirical l 2 -orm defied by b 2 = 1 i=1 b2 i, while b 2 2 = i=1 b2 i defies the usual l 2-orm b 2. Assume that we are give a collectio of fuctios {f 1,..., f } called the dictioary, where f j X R. Assume also that we are give a subset Θ of R. For θ = (θ 1,..., θ ) Θ we cosider 1

2 the liear combiatios f θ defied by f θ (x) def = θ j f j (x), x X. Fuctios f θ are thought to be approximatios of the ukow f. Assumig the dictioary {f 1,..., f } to be rich eough ad sufficietly large, these approximatios ca be satisfactory. Therefore, the estimatio of f may be reduced to estimatig θ j, leadig a estimator ˆf = fˆθ = ˆθ j f j, where ˆθ j are suitable estimators of θ j. The the aim is to mimimize the risk by choosig a optimal ˆθ j. However, depedig o the assumptios we make about the dictioary, the set Θ ad f, we are lead to differet optimality properties. We itroduce below three scearios ad discuss how these assumptios ifluce the costructio of the estimators ad the optimality framework. 1.1 Sceario 1: Liear Regressio ad Sparsity Assume that the true f is a liear combiatio of the fuctios from the dictioary: θ R f(x) = f θ (x) = θ j f j (x). (2) The we are i the usual liear regressoi framework, ad the observatios ca be writte i the followig form y = Xθ + ξ, where Y 1 f 1 (X 1 ) f (X 1 ) y =, ξ =, X =. (3) f 1 (X ) f (X ) Y ξ 1 ξ Estimatio of f is ow reduced to estimatio of θ. Classical theory of liear regressio deals with cases where, which is a ecessary coditio of idetifiability of θ whe we oly kow that θ R. However, i recet years there is a icreasig applied iterest i the problems where is greater tha ad ofte. I this case, f is ot idetifiable without additioal assumptios o θ. A atural ad most popular additioal assumptio is a sparsity costrait o θ. It cosists i restrictig the parameter θ to the class Θ = B 0 (s) where B 0 (s) is the l 0 -ball i R : Here, B 0 (s) = {θ R θ 0 s}, s = 1,...,. (4) def θ 0 = I(θ j 0) is the l 0 orm. Vectors θ belogig to B 0 (s) are called s-sparse. It turs out that, uder the s-sparisty restrictio, estimatio with reasoable accuracy is possible. We may ask ourselves the followig questio. 2

3 Questio 1. What is the optimal way to estimate θ if we kow that θ B 0 (s)? Let ˆθ be a estimator of θ. The correspodig estimator of f is the ˆf = fˆθ = ad the squared risk defied above takes the form ˆθ j f j E ˆf f 2 = E ( 1 X(ˆθ θ ) 2 2). This is kow uder the ame of predictio risk for liear regressio. The optimality is usually defied i a miimax sese. A estimator ˆθ is called optimal if there exists a sequece of positive umbers ψ,,s such that, for all ad, the followig two coditios are satisfied: if T sup E ( 1 θ B 0 (s) X(ˆθ θ ) 2 2) Cψ,,s (5) sup θ B 0 (s) E ( 1 X(T θ ) 2 2) cψ,,s (6) where C ad c are positive costats idepedet of,, s, ad if T deotes the miimum over all estimators of θ based o the sample {(Y i, X i )} i=1. This is commoly referred to as the miimax optimality. A sequece ψ,,s such that (5) ad (6) hold is called miimax rate of covergece (or optimal rate of covergece) o B 0 (s). To summarize, our mai goal i this sceario is to fid a miimax optimal estimator ˆθ o the class B 0 (s). Alog with B 0 (s), other classes ca be cosidered, such as l q -balls with 0 < q. This problem, i its simplest versio where X T X/ is the idetity (the Gaussia sequece model) ad with asymptotic poit of view, has bee i the focus of statistical literature from the 1990ies, with the mai developmets due to Dooho ad Johstoe. We are iterested here i a more geeral liear regressio settig ad we deal with o-asymptotic miimax optimality. 1.2 Sceario 2: Noparametric Regressio Let f F β,l where F β,l, typically, is a class of smooth fuctios parametrized by β > 0 ad L > 0. Roughly speakig, parameter β is the umber of derivatives of f that are assumed bouded i some orm by costat L. I this sceario, it is usually assumed that the dictioary {f 1,..., f } is composed of the first fuctios of some orthoormal basis. For example, it ca be the Fourier or wavelet basis. A key assumptio i the oparametric regressio settig is that the true fuctio f ca be approximated by a liear combiatio of the basis fuctios. It ca be stated, for example, i the followig form. Let f F β,l. The, for all = 1, 2,... there exists θ = θ (f) R such that f θj f j C β, (7) where C is a costat depedig oly o β, L. Here, i geeral, f def θ = θ j f j f, which is i cotrast with the liear regressio settig. Like i the liear regressio case, we are iterested i optimal estimatio of f. 3

4 Questio 2. What is the miimax optimal estimator of f o the class F β,l? As before, a miimax optimal estimator ˆf is the oe that satisfies if f sup E ˆf f 2 Cψ,β, (8) f F β,l sup E f f 2 cψ,β, (9) f F β,l where C ad c are positive costats idepedet of β ad L, ad if f deotes the miimum over all estimators of f based o the sample {(Y i, X i )} i=1. A sequece ψ,β such that (8) ad (9) hold is called miimax rate of covergece (or optimal rate of covergece) o F β,l. Questio 3. How to costruct a adaptive estimatio procedure? A adaptive estimator is a estimator ˆf which is idepedet of β ad L ad satisfies (8) with optimal rate of covergece ψ,β for all pairs (β, L) i a wide rage of values. 1.3 Sceario 3: Aggregatio of estimators The geeral mathematical framework of aggregatio is itroduced by Nemirovski i his Sait-Flour lectures i 1998 (published as Nemirovski (2000)). Nemirovskii (2000) outlied three problems: model selectio type aggregatio, covex aggregatio, ad liear aggregatio. ore geerally, the problem of aggregatio is stated as follows. Suppose that we are give a collectio of prelimiary estimators ˆf 1,..., ˆf of f ad a subset Θ of R. The goal is to fid a ew estimator f, called the aggregate, which is approximately at least as good as the best liear combiatio f θ = θ j ˆf j restricted to θ Θ. The best liear combiatio is defied as the oe that solves the problem mi θ Θ E f f θ 2 miimizig the squared risk. Ulike i the previous scearios, here f θ is a radom fuctio depedig o the data. I cotrast to those scearia, we do ot assume that f f θ is zero or small (see (2), (7)); it may happe that all f θ for some Θ are very far from the true f. So, the choice of Θ is importat for aggregatio problems. Some examples of Θ are listed below. 1. L-aggregatio (Liear aggregatio): Θ = R. The aim of liear aggregatio is to costruct a estimator f, which is approximately as good as the best liear combiatio of the iitial estimators ˆf 1,..., ˆf. 2. C-aggregatio (Covex aggregatio): Θ is the simplex Θ = Λ def = {θ R θ j 0, θ j = 1}. The aim of covex aggregatio is to costruct a estimator f, which is approximately as good as the best covex combiatio of the iitial estimators ˆf 1,..., ˆf. 3. S-aggregatio (odel Selectio type aggregatio): Θ = {e 1,..., e } where e i are the caoical basis vectors i R. The aim of S-aggregatio is to costruct a estimator f, which is approximately as good as the best amog the iitial estimators ˆf 1,..., ˆf. 4

5 4. s-sparse aggregatio: Θ = B 0 (s) def = {θ R θ 0 s} where s {1,..., }. 5. L q -aggregatio: Θ = B q (τ) def = {θ R θ q τ} where θ q = ( θ j q ) 1 q is the usual l q -orm. Other types of aggregatio will be discussed below as well. Note that for liear, covex ad S aggregatio the sets Θ ca be expressed as itersectios of l 0 ad l 1 balls. Ideed, for liear aggregatio, Θ = R = B 0 (), where B 0 () is the l 0 -ball of radius. For covex aggregatio, the simplex is icluded ito B 1 + (1) a itersectio of the l 1-ball B 1 (1) with the coe of positive coordiates. For S-aggregatio, Θ = {e 1,..., e } = B 0 (1) B 1 + (1). The goal of aggregatio is to mimic the best liear combiatio of iitial estimators with weights restricted to a give set Θ of possible weights. The word best here is formalized as choosig f with the smallest possible excess risk (also kow uder the ame of regret) defied by E Θ ( f, f) def = E f f 2 if θ Θ E f θ f 2. (10) Based o the excess risk, we ca itroduce the cocept of miimax optimality for aggregatio. A estimator f is called a optimal aggregate for the class Θ if there exists a sequece of positive umbers ψ, (Θ) such that sup { sup E Θ ( f, f)} Cψ, (Θ), (11) ˆf 1,..., ˆf f sup { if ˆf 1,..., ˆf ˆf sup E Θ ( ˆf, f)} cψ, (Θ). (12) f Here, if ˆf is the miimum over all estimators, C ad c are positive costats idepedet of ad, ad sup ˆf1,..., ˆf, sup f are the suprema over wide classes of prelimiary estimators ad fuctios f. I some cases, these will be all possible estimators ad all possible f with o restrictio; i other cases it will suffice to cosider classes of ˆf 1,..., ˆf ad f satisfyig a boudedess assumptio i the empirical orm. If (11) ad (12) hold for some sequece ψ, (Θ), this sequece is called a optimal rate of aggregatio for the class Θ. The questios arisig i this cotext are as follows. Questio 4. How to costruct a optimal aggregate f for a give class Θ? Questio 5. Is it possible to costruct a uiversal aggregate, i.e., a aggregate which is optimal simultaeously for a large scale of classes Θ? The last questio is of the same ature as Questio 3 cocerig adaptive oparametric estimatio. Iequalities (11) ad (12) establish upper ad lower bouds for the miimax risk, respectively. The upper bouds (11) ca be equivaletly writte i the form of oracle iequalities E f f 2 if θ Θ E f θ f 2 + Cψ, (Θ), ˆf 1,..., ˆf, f, (13) which say that the risk of the suggested aggregate f is at least as good as the risk of the ukow oracle θ miimizig E f θ f 2, up to a small remaider term of the order ψ, (Θ) (a price to pay for aggregatio ). Lower bouds (12) say that this is the miimal price; the remaider term caot be of a smaller order whatever is the aggregate. For the sparsity classes, for example, Θ = B 0 (s), the rate ψ, (Θ) is a fuctio of s; the correspodig oracle iequalities are called sparsity oracle iequalities. 5

6 1.4 Outlie The mai message of this course is that there are methods that solve problems described i Sectios 1.1, 1.2, ad 1.3 simultaeously. We will cosider methods like the BIC, the Lasso, ad the expoetial weightig, provide oracle iequalities ad discuss lower bouds for the three above scearios i a uified framework. We will establish the optimal rates of aggregatio. Aticipatig, for the mai types of aggregatio they are give i the followig table where R = Rak(X) deotes the rak of matrix X. Problem ψ, (Θ) S-aggregatio C-aggregatio L-aggregatio σ 2 R σ 2 R σ2 log σ 2 σ log (1 + ) σ 2 R Table 1. We will also show that the techique of expoetial weightig achieves uiversal aggregatio. 2 From aggregatio of estimators to aggregatio of fuctios Aggregates are usually costructed i the form f = ˆθ j ˆfj where ˆθ j are suitably chose statistics measurable with respect to the data. The aalysis is more ivolved if both ˆθ j ad the prelimiary estimators ˆf j are costructed from the same sample {(Y i, X i )} i=1. To avoid this, the idea put forward by Nemirovski (2000) is to obtai two idepedet samples from the iitial oe by radomizatio (sample cloig). The estimators ˆf j are costructed from the first sample while the secod oe is used to perform aggregatio, i.e., to compute the weights ˆθ j. To carry out the aalysis of aggregatio, it is eough to work coditioally o the first sample, so that ˆf j ca be cosidered as determiistic fuctios. Thus, the problem reduces to aggregatio of determiistic fuctios that we will deote as previously f j = ˆf j, j = 1,...,. A limitatio is that this type of radomizatio oly applies to Gaussia model with kow variace. Nevertheless, the idea of two-step procedures carries over to models with i.i.d. observatios where oe ca do direct sample splittig (see, e.g., Rigollet ad Tsybakov (2007); Lecué (2011)). Thus, i may cases aggregatio of estimators ca be achieved by reductio to aggregatio of determiistic fuctios. Alog with this approach, oe ca aggregate estimators usig the same observatios for estimatio ad aggregatio. While for geeral estimators this would clearly result i overfittig, the idea proved to be successful for certai types of estimators, first for projectio estimators (Leug ad Barro (2006)) ad more recetly for a more geeral class of liear (affie) estimators (Dalalya ad Salmo (2012)). The procedure of sample cloig by radomizatio is based o the followig elemetary lemma. 6

7 Lemma 1. Let Y i = f(x i ) + ξ i. Let ω i be a stadard ormal radom variable idepedet of ξ i. Set The we have Y i1 = Y i + σω i, Y i2 = Y i σω i. Y i1 = f(x i ) + ξ i1, Y i2 = f(x i ) + ξ i2, where ξ i1 N (0, 2σ 2 ), ξ i2 N (0, 2σ 2 ) ad ξ i1 is idepedet of ξ i2. Thus, we obtai two idepedet Gaussia samples D 1 = {(Y i1, X i )} i=1 ad D 2 = {(Y i2, X i )} i=1, where Y ik = f(x i )+ξ ik, k = 1, 2. Both samples are of the same form as the origial oe {(Y i, X i )} i=1, with the oly differece that the variace of the oise is doubled. Now, we use D 1 to costruct prelimiary estimators ˆf 1,..., ˆf ad we use D 2 to determie the weights ˆθ 1,..., ˆθ. Deotig by E (k) the expectatios with respect to the distributio of D k for k = 1, 2, we may write the oracle iequality (13) that we eed to prove i the form E (1) E (2) f f 2 if θ Θ E (1) f θ f 2 + Cψ, (Θ). (14) Clearly, to obtai (14) it suffices to show that, for ay fixed fuctios f 1,..., f, f (possibly satisfyig some mild assumptios), we have E (2) f f 2 if θ Θ f θ f 2 + Cψ, (Θ), (15) where f θ is a liear combiatio of f 1,..., f, ad f = ˆθ j f j with ˆθ j measurable with respect to D 2. Thus, usig the sample cloig device, we ca reduce aggregatio of estimators to its special case, which is aggregatio of fixed fuctios. The, the miimax framework modifies oly i that the excess risk takes the form E Θ ( f, f) def = E f f 2 if θ Θ f θ f 2 (16) (o expectatio i the term if θ Θ f θ f 2 ). I this settig, a estimator f is a optimal aggregate for the class Θ if there exists a sequece of positive umbers ψ, (Θ) such that (11) ad (12) are satisfied where ˆf j s are replaced by f j s. The upper boud o the maximum excess risk is equivalet to the oracle iequality E f f 2 if θ Θ f θ f 2 + Cψ, (Θ), f 1,..., f, f. (17) Oce such a oracle iequality is established, we ca obtai upper bouds for the miimax risk i Scearios 1 ad 2 as simple corollaries. Ideed, those scearios itroduce additioal strog restrictios o f, i particular, that the oracle risk if θ Θ f θ f 2 is either 0 (for Sceario 1) or admits a give boud, cf. (7) (for Sceario 2). 7

8 3 Least squares aggregatio A first simple idea is to costruct aggregates via the least squares (LS). Give a set Θ ad a collectio of determiistic fuctios f 1,..., f, we take ad we defie the LS aggregate as ˆθ LS (Θ) = argmi y f θ 2 θ Θ f = fˆθls (Θ) = ˆθ j LS (Θ)f j. We are goig to show that this idea works for liear ad covex aggregatio but fails for Saggregatio. Recall that we deote by X the matrix f 1 (X 1 ) f (X 1 ) X =. f 1 (X ) f (X ) Propositio 2 (Liear aggregatio). Let ˆθ LS def = ˆθ LS (R ) be a least squares estimator o the set Θ = R. The for all f, f 1,..., f we have where R = Rak(X). E fˆθls f 2 = mi θ R f θ f 2 + σ2 R. Proof. I what follows, with a slight abuse of otatio, we will deote by f ad f θ ot oly the fuctios from X to R but also the -vectors of values of these fuctios at poits X 1,..., X. The, with the otatio from (3), the model of observatios (1) ca be writte as y = f + ξ. Also, f θ = Xθ for all θ ad, i particular, fˆθls = X ˆθ LS = Ay where A is the orthogoal projector o Im(X). Sice y = f + ξ we have which yileds Sice A is the projector o Im(X), O the other had, Af f 2 = fˆθls f 2 = Ay f 2 = A(f + ξ) f 2, E fˆθls f 2 = Af f 2 + E Aξ 2. mi v v Im(X) f 2 = mi Xθ f 2 = mi f θ f 2. θ R θ R ad the propositio follows. E Aξ 2 = σ2 Tr(A) = σ2 R A similar result is valid for the LS estimator o ay covex subset of R. (18) 8

9 Propositio 3. Let Θ be a closed covex subset of R. squares estimator o Θ satisfies where R = Rak(X). E fˆθls (Θ) f 2 = mi θ Θ f θ f 2 + 4σ2 R The, for all f, f 1,..., f, the least Proof. Set for brevity ˆθ = ˆθ LS (Θ), f = fˆθls (Θ). First, by a simple algebra, for ay g = f θ with θ Θ, usig that y f 2 y g 2 ad y = f + ξ, we deduce that f f 2 f g < f g, ξ > where Thus, for ay θ Θ, < f, g > def = 1 i=1 f(x i )g(x i ). f f 2 f f θ < f f θ, ξ >. (19) We may write < f f θ, ξ >=< fˆθ f θ, ξ >= fˆθ f θ < u, ξ > where u = fˆθ f θ fˆθ f θ while belogs to Im(X), ad u = 1. Therefore, < f f θ, ξ > fˆθ f θ sup < u, ξ > u Im(X) u =1 sup < u, ξ > = sup < u, Aξ > Aξ u Im(X) u =1 u Im(X) u =1 where, as i the proof of Propositio 2, A deotes the orthogoal projector o Im(X). Hece Let θ be the miimizer of f θ f o Θ: The, i view of the covexity of Θ, 2 < f f θ, ξ > 2 f f θ Aξ 1 2 f f θ Aξ 2. (20) f θ f 2 = mi θ Θ f θ f 2. f f 2 f f θ 2 + f θ f 2. (21) Settig θ = θ i (19) (20) combiig these iequalities with (21) we obtai f f 2 f θ f Aξ 2. The result of the propositio ow follows by takig the expectatios of both sides of this iequality ad usig (18). We ow cosider the LS estimator o ay (ot ecessarily covex) subset of the simplex Λ. The, alog with the rate of covergece obtaied i Propositio 3 we ca obtai a differet rate as shows the ext result. 9

10 Propositio 4. Let Θ be a closed subset of Λ. The for all f ad all dictioaries f 1,..., f such that f j L, j = 1,...,, least squares estimator o Θ satisfies E fˆθls (Θ) f 2 mi θ Θ f θ f 2 + 4σL 2 log. Now, Proof. It follows from (19) that, for all θ Λ, Note that, for ay θ Λ we have E f f 2 f f θ 2 + 2E < f f θ, ξ >. E < f f θ, ξ > E max θ Λ < f θ f θ, ξ >= E max 1 j < f j f θ, ξ >. f θ 2 = 1 θ j f j (X i ) i=1 1 θ j fj 2 (X i ) = Therefore, f j f θ 2L. O the other had, i=1 θ j f j 2 L 2. η j def = < f j f θ, ξ > N (0, σ 2 ) where σ 2 = σ 2 f j f θ 2 /. Hece, usig Lemma 29 we obtai that ad the propositio follows. E max < f j f θ, ξ > = E max η j σ 2 log 1 j 1 j 2 log 2 log = σ f j f θ 2Lσ, We ow tur to covex aggregatio ad cosider the correspodig LS estimator ˆθ LS cov = argmi θ Λ y f θ 2. The followig theorem is straightforward i view of Propositios 3 ad 4. It states thatfˆθls attais cov the fastest of the two rates. Theorem 5 (Covex aggregatio). For all f ad all dictioaries f 1,..., f such that f j L, j = 1,...,, we have E fˆθls f 2 mi f θ f σ 2 R 2 log cov θ Λ σl. 2 10

11 Note that, up to a mior logarithmic discrepacy, the aggregate f achieves the target optimal rate of covex aggregatio give i Table 1. However, for S-aggregatio the situatio is differet. I this case, Θ is a fiite set ad the least squares estimator of f is defied by ˆf S = fĵ where ĵ = argmi 1 j y f j 2. The followig oracle iequality is a immediate cosequece of Propositio 4. Theorem 6 (S-aggregatio). For all f ad all f 1,..., f such that f j L, j = 1,...,, we have E ˆf 2 log S f 2 mi f j f 2 + 4σL. 1 j We see that the desired optimal rate for S-aggregatio, which is of the order (log )/ (cf. Table 1) is ot achieved ad the LS-aggregate ˆf S exhibits much poorer behavior. This is ot due to the techiques of the proof. I fact, the rate (log )/ give i Theorem 6 is the best that oe ca obtai for ˆf S. The followig result shows that this defect is itrisic ot oly for the least squares estimator but also for ay method that selects oly oe fuctio i the dictioary. This icludes methods of model selectio by pealized empirical risk miimizatio. We call estimators Ŝ takig values i {f 1,..., f } the selectors. Theorem 7 (Suboptimality of selectors). Assume that (σ 1) (log )/ C 0 (22) for 0 < C 0 < 1 small eough. The, there exists a dictioary {f 1,..., f } with f j 1, j = 1,...,, such that the followig holds. For ay selector Ŝ, ad i particular, for ay selector based o pealized empirical risk miimizatio, there exists a regressio fuctio f such that f 1 ad E Ŝ f 2 mi 1 j f j f 2 + C σ log (23) for some positive costat C. It follows from the lower boud (23) that selectig oe of the fuctios i a fiite dictioary to solve the problem of model selectio is suboptimal i the sese that it exhibits a too large remaider term, of the order (log )/. It turs out that we ca do better if we take a mixture, that is a covex combiatio of the fuctios i the dictioary. We will see below that uder a particular choice of weights i this covex combiatio, amely the expoetial weights, oe ca achieve oracle iequalities with the optimal rate (log )/. Proof of Theorem 7. Cosider a radom matrix X of size such that its elemets X i,j, i = 1,...,, j = 1,..., are i.i.d. Rademacher radom variables, i.e., radom variables takig values 1 ad 1 with probability 1/2. oreover, assume that 2 log (1 + e 2 ) < C 1. (24) for some positive costat C 1 < 1/2. Note that (24) follows from (22) if C 0 is chose small eough. Theorem 5.2 i Baraiuk et al (2008) [see also Subsectio i Rigollet ad Tsybakov (2011)] 11

12 implies that if (24) holds for C 1 small eough, the there exists a oempty set of matrices obtaied as realizatios of the matrix X that ejoy the followig weak restricted isometry property. For ay X, there exist costats κ κ > 0, such that for ay λ R with at most 2 ozero coordiates, κ 2 λ 2 2 Xλ 2 2 κ 2 λ 2 2, (25) whe (24) is satisfied. For X, let φ 1,..., φ be ay fuctios o X satisfyig φ j (X i ) = x i,j, i = 1,...,, j = 1,...,, where x i,j are the etries of X. Note that φ j = 1 sice x i,j { 1, 1}. Fix τ > 0 to be chose later ad set where we set for brevity α = (σ/3) f j = τ (1 + α) φ j, j = 1,...,, log κ 2. oreover, cosider the fuctios η j = ταφ j, j = 1,...,. Usig (22) we choose τ small eough to esure that η j 1 ad f j 1 for ay j = 1,...,. For ay fuctio g, we write for brevity R j (g) = g η j 2. Set also H = {f 1,..., f }. It is easy to check that mi R j(f) = R j (f j ) = f j η j 2. (26) f H We ow reduce our estimatio problem to a testig problem as follows. Let ψ {1,..., } be the radom variable, or test, defied by ψ = j if ad oly if Ŝ = f j. The, ψ j implies that there exists k j such that Ŝ = f k, so that Ŝ η j 2 f j η j 2 = f k f j f k f j, f j η j = τ 2 (1 + α) 2 φ j φ k 2 + 2τ 2 (1 + α)( φ j, φ k 1) τ 2 α φ j φ k 2. From (25), we fid that φ j φ k 2 2κ 2 so that Ŝ η j 2 f j η j 2 2τ 2 κ 2 σ log 3 κ Therefore, we coclude that ψ j implies that Hece, R j (Ŝ) mi f H R j(f) ν,. max P j {R j (Ŝ) mi R j(f) ν, } if 1 j f H ψ def = ν,. max P j(ψ j), (27) 1 j where the ifimum is take over all tests takig values i {1,..., } ad P j deotes the joit distributio of Y 1,..., Y that are idepedet Gaussia radom variables with variace σ 2 ad 12

13 meas η j (X 1 ),..., η j (X ) respectively. It follows from Propositio 2.3 ad Theorem 2.5 i Tsybakov (2009) that if for ay 1 j, k, the Kullback-Leibler divergece betwee P j ad P k satisfies K(P j, P k ) < log, (28) 8 the there exists a costat C > 0 such that To check (28), observe that, choosig τ 1 ad applyig (25), we get if ψ max P j(ψ j) C. (29) 1 j K(P j, P k ) = 2σ 2 η j η k 2 = τ 2 log 18 κ 2 φ j φ k 2 < log 8. Therefore, i view of (27) ad (29), we fid usig the arkov iequality that for ay selector Ŝ, max E j [R j (Ŝ) mi R j(f)] Cν, = C σ 1 j f H log, where E j deotes the expectatio with respect to P j. This proves the theorem. 4 Sparsity ad high dimesioal regressio Let us go back to Sceario 1 (sparse liear regressio). We assume that f = f θ for some θ R, ad θ is s-sparse. Usig Propositio 2 we obtai that the least squares estimator satisfies E fˆθls f 2 = E fˆθls f θ 2 = E ( 1 X(ˆθ LS θ ) 2 2) 1 = mi θ R X(θ θ ) σ2 ( ) = σ2 ( ) wheever the matrix X is of full rak. This result is useless i high-dimesioal problems whe > sice the remaider term is ot small. The sparsity s is ot ivolved i the expressio for the risk. So, the global least squares caot take advatage of sparsity, eve if the target vector is very sparse, i.e., s. O the other had, imagie that some oracle discloses to us the set of o-zero compoets of the target vector J(θ ) = {j θj 0}. The we ca use the least squares estimator restricted to the liear subspace of vectors with o-zero compoets i J(θ ). Deotig this estimator by ˆθ LS,J(θ ) ad applyig agai Propositio 2 we fid E ( 1 X(ˆθ LS,J(θ ) θ ) 2 2) σ2 θ 0 σ2 s where we have used that Card(J(θ )) = θ 0 ad that θ is s-sparse. This boud is much better, it takes advatage of sparsity ad ca be very small whe s. Ufortuately, ˆθ LS,J(θ ) is ot a estimator. It is a oracle; it depeds o the ukow θ ad caot be computed from the data. 13

14 A atural questio i this cotext is whether oe ca costruct a true estimator θ such that E ( 1 X( θ θ ) 2 2) σ2 θ 0 We will see that this is almost possible. I particular, we will exhibit a estimator θ such that E ( 1 X( θ θ ) 2 2) C σ2 θ 0? log ( θ 0 ) (30) for some costat C ad all 0 < θ 0 <. The additioal logarithmic factor i (30) characterizes the (modest) price to pay for the lack of kowledge of the set J(θ ). We will see that this factor caot be avoided i a miimax sese o the class of all s-sparse vectors. Iequality (30) is a example of sparsity oracle iequality. 4.1 Sparsity i Gaussia sequece model To give a idea how to costruct estimators θ satisfyig (30), we cosider a simple but istructive case whe the colums of matrix X are orthoormal. Assumptio (ORT). atrix X is such that 1 XT X = I where I is the idetity matrix, 2. This assumptio implies that sice otherwise X T X is degeerate. Usig the model y = Xθ + ξ we may write y 1 y def = 1 XT y = 1 XT Xθ + 1 XT ξ = θ + ζ, where ζ = 1 XT ξ is a Gaussia radom vector i R with mea zero ad covariace matrix V(ζ) = 1 2 E(XT ξξ T X) = σ2 I. Thus, the compoets ζ j of ζ are i.i.d. Gaussia radom variables that ca be writte i the form ζ j = εη j where ε = σ ad η 1,..., η are i.i.d. stadard ormal. We see that, uder Assumptio (ORT), we have a sequece of ew observatios y 1,..., y of the form y j = θ j + εη j, j = 1,...,, ε = σ, (31) where θ j is the jth compoet of θ ad η j are i.i.d. N (0, 1) radom variables. The model (31) is called the Gaussia sequece model ad has a simple sigal + oise iterpretatio. I the rest of this subsectio, we will forget the iitial model y = Xθ + ξ ad work with a sequece of observatios y 1,..., y satisfyig (31). Note first that, for ay θ, i view of Assumptio (ORT), 1 Xθ 2 2 = 1 θt X T Xθ = θ 2 2, so that the squared risk of a arbitrary estimator ˆθ simplifies to E ( 1 X(ˆθ θ ) 2 2) = E ˆθ θ 2 2. (32) 14

15 As discussed above, uder the sparsity assumptio o θ, it is crucial to detect the set of ozero compoets J(θ ). For the Gaussia sequece model (31), such a detectio is based o a very simple idea to keep oly the idices j such that the absolute values y j are large eough. To quatify the otio of large eough value, we will refer to the followig property (cf. Lemma 28 below): If η j are stadard Gaussia radom variables the max 1 j η j 2 log with probability close to 1 for large. Ituitively, the value 2 log characterizes the oise level. The observatio y j is uder the oise level, or is difficult to distiguish from the oise if y j ε 2 log. O the cotrary, if y j > cε log for some costat c > 2, the it is almost impossible to have θj = 0. Thus, all 2 log idices j such that y j > ε 2 log = σ belog to the set J(θ ) with probability close to 1 for large. These remarks lead us to estimatio of coefficiets θj by thresholdig. It meas that we use a suitable estimator of θj (for example, the least squares ad maximal likelihood estimator equal to y j ) for idices j such that y j > cσ log log ad we estimate by 0 all the coefficiets θ j such that y j is uder the oise level cσ. A basic realizatio of this idea is give by the hard thresholdig estimator ˆθ j H = y j I( y j > τ), wher τ > 0 is the threshold, typically chose of the order log. The followig theorem summarizes the mai properties of the hard thresholdig estimator ˆθ H = (ˆθ 1 H,..., ˆθ H ). Theorem 8. Cosider the liear regressio model uder Assumptio (ORT). The the followig holds. (i) (Oracle iequality i expectatio) If τ = σ 2 log ad θ 0, the E ˆθ H θ 2 2 2σ 2 θ 0 log (1 + 4 log ). (ii) (Oracle iequality i probablility) If τ = Aσ least 1 1 A2 /8 we have: log ˆθ H θ A2 σ 2 ( θ 0 (iii) (Selectio of variables) If τ = Bσ log, A > 2 2, the with probability at log ). with B > 2 ad mi j θj 0 θ j > 2τ, the, with probability at least 1 1 B2 /2 we have: Ĵ = J(θ ), where J(θ ) = {j θ j 0} ad Ĵ = {j ˆθ H j 0}. 15

16 Proof. (i). If θ j = 0, the ˆθ H j θ j = y j I( y j > τ) = ε η j I( η j > 2 log ), while for θ j 0 we have the boud ˆθ H j θ j = y j I( y j > τ) θ j y j θ j + y j I( y j τ) ε η j + τ. Therefore, E ˆθ H θ 2 2 = Sice E(η 2 1 ) = 1 ad E η 1 = 2/π, E ˆθ H j θ j 2 (33) ε 2 E[η 2 1I( η 1 > 2 log )] + θ 0 E[(ε η 1 + τ) 2 ]. E[(ε η 1 + τ) 2 ] = ε 2 + 4ετ 2π + τ 2 (34) = ε log π + 2 log. By Lemma 27 E[η 2 1I( η 1 > 2 log )] 2 π ( 1 2 log log ) 1. (35) 1 Pluggig (34) ad (35) i (33) ad usig that 1+ π log 4 log for all 2 ad the iequality 6/ π 4, we obtai the result. (ii). Set r = Aσ 2 log = A 2 ε log. Cosider the radom evet A = { y j θ j r, j = 1,..., }. By Lemma 28, the probability of the complemetary evet A c satisfies P (A c ) = P { max 1 j ζ j > r} = P {ε max 1 j η j > A 2 ε log } 1 A2 /8. O the evet A we have, i view of Lemma 30, y j I( y j > 2r) θ j 3 mi( θ j, r). π 16

17 Usig that r = τ/2 this implies ˆθ H θ 2 2 = ˆθ j H θj 2 9 mi ( θj 2, τ 2 4 ) 2 τ = 9 j θj 0 4 = τ 2 9 θ 0 4. (iii). Set B = A/2. The r defied i the proof of part (ii) has the form r = τ. Cosider the evet A defied i the proof of part (ii). Let us show that Ĵ J(θ ) o the evet A. Let ˆθ H j 0. I this case, ˆθ H j = y j y j > τ θ j + εη j > τ, which implies θ j > τ εη j τ r = 0 o the evet A. Therefore, θ j 0. Let us show that J(θ ) Ĵ o the evet A. Let θ j 0. The θ j > 2τ, which yields y j = θ j + εη j > 2τ εη j 2τ r = τ o the evet A. O the other had, by defiitio of ˆθ H, y j > τ ˆθ H j = y j Thus, ˆθ H j 0 with probability 1. There exist other thresholdig estimators behavig similarly as described i Theorem 8. For example, if τ is the same threshold, the soft thresholdig estimator defied as ad the o-egative garrotte estimator 1, defied as ˆθ S j = max (1 τ y j, 0) y j, j = 1,...,, (36) ˆθ j G = max 1 τ 2, 0 y j j = 1,...,, (37) y 2 j have similar risk ad selectio of variables behavior. We ca equivaletly defie the soft ad hard thresholdig estimators i terms of optimizatio programs as described below. Propositio 9. The soft ad hard thresholdig estimators are solutios to the followig optimizatio problems ˆθ H = argmi θ R ˆθ S = argmi θ R 1 This estimator is closely related to the James-Stei estimator. (y j θ j ) 2 + τ 2 θ 0, (38) (y j θ j ) 2 + 2τ θ 1. (39) 17

18 Furthermore, uder Assumptio (ORT), we ca express these two estimators as follows: ˆθ H = argmi θ R ( 1 y Xθ τ 2 θ 0 ), (40) ˆθ S = argmi θ R ( 1 y Xθ τ θ 1 ). (41) Ideed, sice we assume that 1 XT X = I (Assumptio (ORT)) ad we use the otatio 1 XT y, we may write (y j θ j ) 2 = 1 XT y θ 2 2 = θ θt X T y yt XX T y = 1 Xθ θt X T y + 1 yt XX T y y 1 y def = = 1 Xθ y yt XX T y 1 y 2 2 = 1 y Xθ c where c is a costat idepedet of θ. A importat observatio is that the estimators (40) ad (41) ca be used with geeral matrices X ad therefore ca be applied i full geerality i Scearios 1-3 ad ot oly i the Gaussia sequece model. For geeral X, the estimator defied by (40) is called the BIC estimator ad that defied by (41) is called the Lasso estimator. So, the BIC ad Lasso are atural extesios of the hard ad soft thresholdig estimators respectively. 4.2 Sparsity oracle iequality for the BIC We ow retur to the geeral regressio model y = f + ξ. Let τ > 0 be a give threshold. The origial BIC estimator is defied as follows ˆθ BIC = argmi θ R ( 1 y Xθ τ 2 θ 0 ) = argmi θ R ( y f θ 2 + τ 2 θ 0 ). Note that it ca be cosidered ot oly as a estimator for Sceario 1 but it also geerates a oparametric estimator fˆθbic for Sceario 2 ad a aggregate fˆθbic for Sceario 3. To get sharper bouds o the risk, it is coveiet to slightly modify the BIC by replacig the term τ 2 θ 0 by a pealty fuctio pe(θ) defied by pe( θ 0 ) = 2σ2 (1 + C C 2 1 L(θ) + ɛ L(θ)) θ 0 (42) where C 1, C 2 are suitable positive costats, ɛ > 0 is a arbitrary positive umber, ad L(θ) = log ( e θ 0 1 ). 18

19 We will cosider this pealty istead of τ 2 θ 0 ad use a modified defiitio of BIC: θ BIC = argmi θ R ( 1 y Xθ pe( θ 0 )). (43) Both versios of the BIC are pealized least squares estimators where the peality is imposed o the size of the support of θ. However, the BIC optimizatio problem is NP-hard. To see this, we ca reformulate the BIC program as follows mi θ R ( 1 y Xθ pe( θ 0 )) = mi 0 m mi ( 1 θ θ 0 =m y Xθ pe( θ 0 )) = mi ( mi 1 0 m θ θ 0 =m y Xθ pe(m)). Thus, we have to solve m=0 ( m ) = 2 possible least squares problems. Despite the computatioal ufeasibility, the theoretical properties of the BIC estimator ca be aalyzed i detail. I particular, it satisfies the oracle iequalities give i the ext theorem. Theorem 10 (Oracle Iequality for BIC). Fix ɛ > 0. Let θ BIC be defied i (42) (43) with sufficietly large C 1 ad C 2 ad let f BIC = f θbic. The there exists a costat C > 0 such that, for all f, E f BIC f 2 (1 + ɛ) mi ( f θ f 2 + C σ 2 θ 0 e Cσ2 log ( )) + θ R ɛ θ 0 1. (44) I additio, there exists a costat C > 0 such that, for ay 0 < δ < 1 with probability at least 1 δ, f f BIC f 2 (1 + ɛ) mi [ f θ f 2 + C σ 2 θ 0 e Cσ2 log ( )] + θ R ɛ θ 0 1 log (1 δ ). (45) I particular, if f(x) = f θ (x) with θ 0, E ( 1 X( θ BIC θ ) 2 2 ) C σ2 θ 0 log ( e θ 0 1 ). (46) The oracle iequality i expectatio (44) is proved i Birgé ad assart (2007) (see also Johstoe (2013)). For the proof of the iequality i probability (45), see Buea, Tsybakov ad Wegkamp (2004). Remarks. 1. Iequalities of Theorem 10 are sparsity oracle iequalities sice the remaider term depeds oly o θ 0. For istace, the i expectatio versio (44) is of the form E ˆf f 2 K mi θ R ( f θ f 2 2 +, (θ)) (47) where ˆf is a estimator of f, c is a costat, ad, > 0 oly depeds o θ If, depeds o θ 0 ad other features of θ, the the correspodig oracle iequality is sometimes referred to as a balaced oracle iequality. 19

20 2. The sparsity oracle iequalities of Theorem 10 are ot sharp, i.e., the leadig costat K is greater tha 1. I particular, we caot obtai a meaigful boud o the excess risk usig iequality (44). Ideed, sice it is of the form (47) with K > 1 the excess risk ca be oly bouded as E Θ ( ˆf, f) = E ˆf f 2 mi f θ f 2 (K 1) mi f θ f 2 + K sup, (θ). θ Θ θ Θ But this boud is useless i the aggregatio cotext because we have o cotrol of the miimum mi θ Θ f θ f 2 (it ca be arbitrarily large). 3. The oracle iequalities of Theorem 10 hold uder o assumptio o the dictioary f 1,..., f, ad (except for iequality (46)) uder o assumptio of f. 4. Iequality (46) gives a solutio to the questio aouced above, cf. (30). It cotais a oracle term C σ2 θ 0 multiplied by log ( e θ 0 1 ). This factor represets the price to pay for ot kowig the set of o-zero compoets of θ. 5. Istead of the pealty (43) implemeted above, we ca also use the pealty pe(θ) = Cσ 2 θ 0 log. This leads to oracle iequalities similar to those of Theorem 10 except for the logarithmic factors that become slightly suboptimal. ore precisely, log ( e θ 0 1 ) log for this type of pealty. 4.3 Sparsity oracle iequality for the Lasso As i the previous subsectio, here we cosider the geeral regressio model y = f + ξ. Let ˆθ L be the Lasso estimator ˆθ L = argmi θ R ( 1 y Xθ τ θ 1 ) = argmi θ R ( y f θ 2 + 2τ θ 1 ) where τ > 0 is a tuig parameter. Similarly to the BIC estimator, it ca be cosidered ot oly as a estimator for parametric Sceario 1 but also it geerates a oparametric estimator for fˆθl Sceario 2 ad a aggregate for Sceario 3. The followig theorem is a modificatio of a result fˆθl i Koltchiskii, Louici ad Tsybakov (2011). It provides a sparsity oracle iequality i probability with leadig costat 1 for the Lasso estimator. Theorem 11. Let ξ be i.i.d. radom variables, ξ i N (0, σ 2 ) ad let f i 1, j = 1,...,. Let ˆθ L be the Lasso estimator with the tuig parameter τ = Aσ log, A = t δ, t > 2, 0 < δ < 1. The, with probability at least 1 1 t2 /2 we have f fˆθl f 2 mi θ R θ Θ f θ f 2 + C mi σ2 µ 2 (θ) θ log log 0, σ θ 1 (48) 20

21 where C > 0 depeds oly o t ad δ, θ 0 µ(θ) = if µ > 0 J(θ) 1 µ X 2, C θ with C θ = { R J c (θ) δ 1 δ J(θ) 1 }. The Proof. Set for brevity ˆθ L = ˆθ ad G(θ) = y f θ 2 + 2τ θ 1. ˆθ = argmi θ R Deote by (, ) the ier product i R, ad set < f, g > def = 1 i=1 G(θ). f(x i )g(x i ). We ow recall the followig geeral fact from covex aalysis. Lemma 12. For ay covex fuctio G R R we have: ˆθ argmi θ R 0 G(ˆθ), where G(ˆθ) is the subdiffretial of G at poit ˆθ. G(θ) if ad oly if The coditio 0 G(ˆθ) of this lemma obviously implies: there exists B G(ˆθ) such that (B, ˆθ θ) = 0, for all θ R. I the sequel, we will use this property. I our case, ad thus ( y f θ 2 ) = ( 1 y Xθ 2 2) = 2 XT (y Xθ), ( ( y fˆθ 2 ), ˆθ θ) = 2 (XT (y X ˆθ), ˆθ θ) = 2 (X(ˆθ θ), y X ˆθ) = 2 < fˆθ θ, y fˆθ >. Applyig Lemma 12, we get that there exists ˆV ( ˆθ 1 ) such that 2 fˆθ θ, y fˆθ + 2τ( ˆV, ˆθ θ) = 0. (49) Let V be ay elemet of ( θ 1 ). It follows from (49) that 2 fˆθ θ, y fˆθ + 2τ( ˆV V, ˆθ θ) = 2τ(V, ˆθ θ). (50) We ow use the followig fact from covex aalysis applied to the fuctio g(θ) = θ 1. 21

22 Lemma 13. For ay covex fuctio g R R, we have for all V g(θ), V g(θ ). From Lemma 13 ad (50) we fid Sice y = f + ξ, we ca rewrite this i the form: (V V, θ θ ) 0, θ, θ R, 2 fˆθ θ, y fˆθ 2τ(V, ˆθ θ). 2 fˆθ θ, fˆθ f 2τ(V, ˆθ θ) + 2 ξ, fˆθ θ (51) for ay V ( θ 1 ) ad ay θ R. Next, elemetary argumet yields 2 fˆθ θ, fˆθ f = 2 fˆθ f θ, fˆθ f = fˆθ f θ 2 + fˆθ f 2 f θ f 2. (52) Fix some θ R ad let J = J(θ) be the set of o-zero compoets of θ. Write V = V J + V J c where V J R is the vector with compoets V j I(j J), j = 1,...,, where V j are the compoets of V, ad J c = {1,..., }/J is the complemet of J. The (V, ˆθ θ) = (V J, ˆθ θ) + (V J c, ˆθ θ) = (V J, ˆθ θ) + (V J c, ˆθ) sice the compoets of V J c vaish o the support of θ. O the other had, V is ay elemet of ( θ 1 ), ad thus the compoets of V satisfy { V j 1, j J c, V j = sig(θ j ), j J. Choose V such that V j = sig(ˆθ j ) for j J c. This is possible, sice V j ca be ay values satisfyig V j 1 for j J c. The (V, ˆθ θ) = (V J, ˆθ θ) + ˆθ J c 1 = (V J, ) + J c 1 = (V J, J ) + J c 1 where = ˆθ θ ad we used the fact that ˆθ J c 1 = J c 1. This ad (51) imply 2 fˆθ θ, fˆθ f 2τ J 1 2τ J c 1 + 2(H, ), (53) H 1 where H = 1 XT ξ = with H j = 1 H i=1 f j(x i )ξ i. We have used here the idetity ξ, fˆθ θ = (H, ˆθ θ). Note ow that if fˆθ θ, fˆθ f 0 the, i view of (52), we get fˆθ f 2 f θ f 2 22

23 ad the result of the theorem follows i a trivial way. fˆθ θ, fˆθ f 0. But i this case, i view of (53), So, it is eough to cosider the case Assume for the momet that τ J c 1 τ J 1 + H 1. H δτ for some 0 < δ < 1. The, sice 1 = J 1 + J c 1, we have J c δ 1 δ J 1. I other words, it suffices to cosider C θ, where C θ = { R J c δ 1 δ J 1 } ad J = J(θ) is the set of o-zero compoets of θ. We ow retur to (53), ad boud the terms o the right-had side of (53). Usig that H δτ ad C θ we get This ad (53) imply 2τ J 1 2τ J c 1 + 2(H, ) 2τ J 1 2τ J c H 1 = 2τ J 1 2τ J c 1 + 2δτ( J 1 + J c 1 ) Combiig this with (52) we get 2τ(1 + δ) J 1. (54) 2 fˆθ θ, fˆθ f 2τ(1 + δ) J 1. fˆθ f 2 f θ f 2 fˆθ f θ 2 + 2τ(1 + δ) J 1. (55) Sice C θ, we get θ 0 J 1 µ(θ) X 2 = µ(θ) θ 0 f = µ(θ) θ 0 fˆθ f θ. This ad the elemetary iequality 2ab a 2 + b 2 yield 2τ(1 + δ) J 1 2τ(1 + δ)µ(θ) θ 0 fˆθ f θ τ 2 (1 + δ) 2 µ 2 (θ) θ 0 + fˆθ f θ 2. (56) Combiig (55) ad (56) we obtai θ, f fˆθ f 2 f θ f 2 + τ 2 (1 + δ) 2 µ 2 (θ) θ 0. (57) Note that this iequality is proved for all θ R ad all f, uder the assumptio that H δτ. 23

24 Let us ow show that H δτ holds with probability at least 1 1 t2 /2. Cosider the radom evet A = { H δτ}. The probability of its complemet P (A c ) is estimated as follows Here, for each j, P (A c ) = P ( max 1 j 1 1 i=1 i=1 f j (X i )ξ i > δτ) P ( 1 f j (X i )ξ i N (0, σ2 f j 2 ). i=1 It follows similarly to Lemma 28 that, sice f j 2 1 for all j, we get P ( H δτ) = P log H > σt 1 t2 /2. To fiish the proof of the theorem, we show that, o the same evet A, f j (X i )ξ i > δτ). θ, f fˆθ f 2 f θ f 2 + C σ θ 1 log (58) for a costat C > 0 depedig oly o t ad δ. Ideed, sice G(ˆθ) G(θ) for all θ R, we get, by a simple algebra, fˆθ f 2 f θ f ξ, fˆθ f θ + 2τ θ 1 2τ ˆθ 1. (59) Sice ξ, fˆθ f θ = (H, ˆθ θ) ad H δτ o A, we fid 2 ξ, fˆθ f θ 2δτ ˆθ θ 1 + 2τ( θ 1 ˆθ 1 ) 2τ(1 + δ) θ 1. (60) Combiig (59) ad (60) we get (58) with C = 2t(1 + 1/δ). Fially, the theorem follows from (57) ad (58). The costat C i (48) ca be take equal to max(t 2 (1 + 1/δ) 2, 2t(1 + 1/δ)). 5 ixig with expoetial weights Let f 1,..., f be give fuctios formig a dictioary. Set ˆr j = y f j 2. This is the empirical risk of f j. The expoetially weighted aggregate is defied by where ˆθ EW = (ˆθ EW 1,..., EW ˆθ ) with ˆf EW = ˆθ EW j = ˆθ j EW f j = fˆθew exp( ˆr j /β)π j k=1 exp( ˆr k/β)π k 24

25 for some β > 0 ad some set of prior probabilities π k > 0, k=1 π k = 1. This defiitio has bee brought to achie Learig by Vovk (1990), Littlestoe ad Warmuth (1994). There exist two heuristic iterpretatios of expoetial weightig. 1. Quasi-bayesia iterpretatio. The weights ˆθ EW defie a posterior distributio (which is the Gibbs distributio if π k are uiform) i the phatom model Y i = f θ (X i ) + ξ i, i = 1,...,, where ξ i are i.i.d. N (0, β 2 ) radom variables, θ {e 1,..., e }, ad π j are prior probabilities of e j. 2. Variatioal iterpretatio. It is ot hard to check that ˆθ EW is a solutio of the followig miimizatio problem: ˆθ EW = argmi θ j ˆr j + β θ Λ K(θ, π) where K(θ, π) = θ j log θ j π j is a simplex. Note that is the Kullback-Lieibler divergece betwee θ ad π, ad Λ = {θ θ j 0, θ j ˆr j = θ j y f j 2 θ j = 1} y f θ 2. Jese Thus, ˆθ EW miimizes a upper approximatio of the empirical risk pealized by Kullback- Leibler divergece from π: y f θ 2 + β K(θ, π). Note that K(θ, π) 0 ad K(θ, π) = 0 θ = π. So, we pealize the solutio for beig too far from the prior π. I what follows we set for brevity w j = EW ˆθ j, Z = exp( ˆr k /β)π k. k=1 The followig propositio goes back to Vovk (1990) who cosidered a determiistic model. Ideed, o assumptio o the distributio of y is eeded. Propositio 14. The value ˆr = w j ˆr j satisfies As a cosequece, for all y, ˆr mi 1 j (ˆr j + β log 1 π j ). y ˆf EW 2 mi 1 j ( y f j 2 + β log 1 π j ). 25

26 Note that if π j = 1, j = 1,..., (the uiform prior), the β log 1 π j = β log, which is the optimal rate of S-aggregatio. But the boud is for the empirical risk y ˆf EW 2 ad ot for the risk E f ˆf EW 2. Also, o the RHS we have the empirical risk y f j 2 ad ot the discrepacy f f j 2 as expected i our oracle iequalities. Proof. Take logarithms of both sides of the equatio The, for ay k ad j, we have w j = exp( ˆr j/β)π j Z. so that log Z = ˆr k β + log 1 π k + log w k, log Z = ˆr j β + log 1 π j + log w j, ˆr k β Thus, usig that log w j 0, we get ˆr = = ˆr j β + log 1 π j log 1 π k + log w j log w k. k=1 w kˆr k ˆr j + β log 1 π j β k=1 w k log w k π k K(w,π) Sice K(w, π) 0 the first result of the propositio follows. The secod result is obtaied from the first oe usig the iequalities:. y EW ˆf = w jf j 2 = w j (y f j ) 2 Jese w j y f j 2 = w j ˆr j = ˆr. The ext propositio is ispired by the argumet i Leug ad Barro (2006). Propositio 15. (i) If β = 4σ 2, the ˆr σ 2 is a ubiased estimator of the risk : E ˆf EW f 2 = E(ˆr) σ 2. (ii) If β > 4σ 2, the with Proof. First, recall that E ˆf EW f 2 E(ˆr) σ 2. ˆf EW ( ) = w j f j ( ) w j = w j (y) = exp( β ˆr j)π j Z 26

27 where Z = k=1 exp( β ˆr k)π k, ˆr j = y f j 2. By Stei ubiased risk estimatio formula (see e.g. Tsybakov (2009), p. 157), the statistic ˆR def = y ˆf EW 2 + 2σ2 ˆf EW (X i ) σ 2 i=1 Y i is a ubiased estimator of the risk E f ˆf EW 2, i.e., E( ˆR) = E f ˆf EW 2. (61) Let us compute ˆR. Note that i the defiitio of ˆf EW oly the weights w j deped o Y 1,..., Y. So, we eed first to fid the derivative w j(y) Y i. Recall that ˆr j = 1 i=1 (Y i f j (X i )) 2. Hece, ˆr j Y i = 2 (Y i f j (X i )) ad we have w j = exp( β ˆr j)π j Y i Z 2 [ 2 β (Y i f j (X i ))Z + 2 β k=1 = 2w j β [(Y i f j (X i )) + (Y i f k (X i ))w k ] k=1 = 2 β (f j(x i ) ˆf EW (X i ))w j. (Y i f k (X i )) exp( β ˆr k)π k ] (62) O the other had, sice w j 0, w j = 1, we have, by the bias-variace decompositio with respect to the distributio defied by {w j }, Note also that, for all i, ˆf EW y 2 = Combiig (62) (64) we obtai ˆR = ˆr = ˆr = ˆr = w j f j y 2 w j f j ˆf EW 2 (63) w j ˆr j w j f j ˆf EW 2. =ˆr ( w j Y i ) ˆf EW (X i ) = ˆf EW (X i ) Y i w j f j ˆf EW 2 + 2σ2 w j f j ˆf EW 2 + 4σ2 β i=1 w j =1 = 0. (64) ( w j Y i ) f j (X i ) σ 2 (f j (X i ) ˆf EW (X i )) 2 w j σ 2 i=1 (1 4σ2 β ) w j f j ˆf EW 2 σ 2. 27

28 Takig expectatios of both sides of this iequality ad usig (61) we fid which implies the propositio. Theorem 16. For β 4σ 2 we have E ˆf EW f 2 = E(ˆr) σ 2 (1 4σ2 β ) E w j f j ˆf EW 2 I particular, if π j = 1, j = 1,...,, E ˆf EW f 2 mi 1 j ( f f j 2 + β log 1 π j ). E ˆf EW f 2 mi f f j 2 + β log. 1 j Proof. Propositios 14 ad 15, ad the fact that E(ˆr j ) = E y f j 2 = f f j 2 + σ 2 imply E ˆf EW f 2 E(ˆr) σ 2 mi (E(ˆr j) + β 1 j log 1 ) σ 2 π j = mi 1 j ( f f j 2 + β log 1 π j ). Remarks 1. Theorem 16 is proved i Dalalya ad Tsybakov (2007, 2008) where the result has a more geeral form: λ Λ E ˆf EW f 2 mi λ j f f j 2 + β K(λ, π). (65) Ideed, the right-had side of (65) does ot exceed mi λ {e 1,...,e } λ j f f j 2 + β K(λ, π) = mi 1 j ( f f j 2 + β log 1 π j ). 2. The right-had side of (65) is remiiscet of the variatioal iterpretatio of the expoetial weighted estimator. If we replace r j = f f j 2 by ˆr j = y f j 2, ˆf EW is obtaied by the miimizatio : mi λ j ˆr j + β λ Λ K(λ, π), which is the empirical aalog of the right-had side of (65). 28

29 3. Leug ad Barro (2006) have proved a result aalogous to Theorem 16 for the case where f j are ot ay fixed fuctios but rather the least squares estimators o liear subspaces of R. These estimators are costructed from the same sample y that is used to compute the weights. I their case, the expoetial weights are slightly differet. Namely, they take w j = exp ( ˆr j β k=1 exp ( ˆr k β dim(j) 2 ) π j dim(k) 2 ) π k where dim(j) is the dimesio of the space o which the jth least squares estimator projects. 6 Sparsity patter aggregatio I this sectio, we describe a aggregatio procedure that will be show to achieve uiversal aggregatio. Let P = {0, 1}. We call a sparsity patter ay biary vector p P. We deote by p def = p 0 the umber of oes i p. To each sparsity patter p P we associate a liear subspace S p of R : p S p def = spa {e j p j = 1}, dim(s p ) = p. From the iitial sample y, we cloe two radomized idepedet samples y (1) R ad y (2) R with radom errors N (0, 2σ 2 ), cf. Sectio 2. For each p P, we costruct a least squares estimator ˆθ p o S p based o the first sample y (1) : ˆθ p = argmi θ S p y (1) f θ 2. Set ˆr p = y (2) 2 ad defie a vector ˆθ SP A = (ˆθ SP A fˆθp p, p P) with compoets ˆθ SP A p = exp( ηˆr p /β)π p, p P. p P exp( ηˆr p /β)π p Here, {π p } is a prior probability measure o P with π p 0 (ot ecessarily π p > 0; π p = 0 is possible, o the differece from priors i Sectio 5). Note that ˆθ SP A R 2. The Sparsity Patter Aggregate is defied by ˆf SP A def = ˆθ p SP A. fˆθp p P From Theorem 16 we get: If β = 8σ 2 (because σ 2 2σ 2 after sample cloig) the From Propositio 2, f E ˆf SP A f 2 mi p P,π p 0 [E fˆθp f 2 + 8σ2 log 1 π p ]. (66) f 2 mi E fˆθp f θ S p θ f 2 + 2σ2 p. (67) 29

30 Combiig (66) ad (67), ad choosig a appropriate prior π p we obtai our mai result that will be stated below. Namely, we will use the prior π p = (( p )e p H) 1 if p R, 1/2 if p =, 0 otherwise, (68) where H = 2 R k=0 e k 2 k=0 e k = 2e/(e 1). Clearly, p P π p = 1. Ideed, R π p = ( p P, p R k=0 k ) 1 ( k )ek H = R k=0 e k H ES Defiitio 17. Expoetial Screeig (ES) estimator ˆf is defied as a sparsity patter aggregate ( ˆf SP A ) with the prior π p give i (68). The correspodig vector of weights is deoted by ˆθ ES. Remark. The prior (68) ca be called a sparsity prior because it dowweights expoetially the o-sparse vectors. The oly exceptio is doe for the most o-sparse vector (the oe with all o-zero compoets) for which we keep the global least squares estimator with weight 1/2. This poit is techical; we itroduce it for mathematical coveiece i order to simplify the proofs. From (66) with p = we obtai E ˆf ES f 2 E fˆθls f 2 + 8σ2 log 2 where we have used that ˆθ p for p = coicides with the global least squares estimator ˆθ LS. This iequality ad Propositio 2 imply: = 1 2. E ˆf ES f 2 mi f θ f 2 + 2σ2 R θ R + 8σ2 log 2. (69) Let p(θ) P be the sparsity patter of θ R, i.e., a vector with compoets p j (θ) = 1 if θ j 0, ad p j (θ) = 0 otherwise. Note that p(θ) = θ 0. Usig (66) ad (67), we get E ˆf ES f 2 mi [mi f p P p R θ S p θ f 2 + 2σ2 p + 8σ2 log 1 ] π p {θ p(θ)=p} S p = p(θ) = θ 0 mi p P p R mi [ f θ f 2 + 2σ2 p(θ) + 8σ2 θ p(θ)=p log ( 1 )] π p(θ) mi [ f θ f 2 + 2σ2 θ 0 + 8σ2 θ R θ 0 R log ( 1 )]. π p(θ) Now, we eed to boud log 1 π p(θ). We use the followig fact: ( k ) (e K ) k. 30

Aggregation and minimax optimality in highdimensional

Aggregation and minimax optimality in highdimensional Aggregatio ad miimax optimality i highdimesioal estimatio Alexadre B. Tsybakov Abstract. Aggregatio is a popular techique i statistics ad machie learig. Give a collectio of estimators, the problem of liear,

More information

Summary and Discussion on Simultaneous Analysis of Lasso and Dantzig Selector

Summary and Discussion on Simultaneous Analysis of Lasso and Dantzig Selector Summary ad Discussio o Simultaeous Aalysis of Lasso ad Datzig Selector STAT732, Sprig 28 Duzhe Wag May 4, 28 Abstract This is a discussio o the work i Bickel, Ritov ad Tsybakov (29). We begi with a short

More information

Convergence of random variables. (telegram style notes) P.J.C. Spreij

Convergence of random variables. (telegram style notes) P.J.C. Spreij Covergece of radom variables (telegram style otes).j.c. Spreij this versio: September 6, 2005 Itroductio As we kow, radom variables are by defiitio measurable fuctios o some uderlyig measurable space

More information

A survey on penalized empirical risk minimization Sara A. van de Geer

A survey on penalized empirical risk minimization Sara A. van de Geer A survey o pealized empirical risk miimizatio Sara A. va de Geer We address the questio how to choose the pealty i empirical risk miimizatio. Roughly speakig, this pealty should be a good boud for the

More information

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d Liear regressio Daiel Hsu (COMS 477) Maximum likelihood estimatio Oe of the simplest liear regressio models is the followig: (X, Y ),..., (X, Y ), (X, Y ) are iid radom pairs takig values i R d R, ad Y

More information

ECE 901 Lecture 14: Maximum Likelihood Estimation and Complexity Regularization

ECE 901 Lecture 14: Maximum Likelihood Estimation and Complexity Regularization ECE 90 Lecture 4: Maximum Likelihood Estimatio ad Complexity Regularizatio R Nowak 5/7/009 Review : Maximum Likelihood Estimatio We have iid observatios draw from a ukow distributio Y i iid p θ, i,, where

More information

Economics 241B Relation to Method of Moments and Maximum Likelihood OLSE as a Maximum Likelihood Estimator

Economics 241B Relation to Method of Moments and Maximum Likelihood OLSE as a Maximum Likelihood Estimator Ecoomics 24B Relatio to Method of Momets ad Maximum Likelihood OLSE as a Maximum Likelihood Estimator Uder Assumptio 5 we have speci ed the distributio of the error, so we ca estimate the model parameters

More information

Lecture 10 October Minimaxity and least favorable prior sequences

Lecture 10 October Minimaxity and least favorable prior sequences STATS 300A: Theory of Statistics Fall 205 Lecture 0 October 22 Lecturer: Lester Mackey Scribe: Brya He, Rahul Makhijai Warig: These otes may cotai factual ad/or typographic errors. 0. Miimaxity ad least

More information

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss ECE 90 Lecture : Complexity Regularizatio ad the Squared Loss R. Nowak 5/7/009 I the previous lectures we made use of the Cheroff/Hoeffdig bouds for our aalysis of classifier errors. Hoeffdig s iequality

More information

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12 Machie Learig Theory Tübige Uiversity, WS 06/07 Lecture Tolstikhi Ilya Abstract I this lecture we derive risk bouds for kerel methods. We will start by showig that Soft Margi kerel SVM correspods to miimizig

More information

Regression with quadratic loss

Regression with quadratic loss Regressio with quadratic loss Maxim Ragisky October 13, 2015 Regressio with quadratic loss is aother basic problem studied i statistical learig theory. We have a radom couple Z = X,Y, where, as before,

More information

REGRESSION WITH QUADRATIC LOSS

REGRESSION WITH QUADRATIC LOSS REGRESSION WITH QUADRATIC LOSS MAXIM RAGINSKY Regressio with quadratic loss is aother basic problem studied i statistical learig theory. We have a radom couple Z = X, Y ), where, as before, X is a R d

More information

Machine Learning Brett Bernstein

Machine Learning Brett Bernstein Machie Learig Brett Berstei Week 2 Lecture: Cocept Check Exercises Starred problems are optioal. Excess Risk Decompositio 1. Let X = Y = {1, 2,..., 10}, A = {1,..., 10, 11} ad suppose the data distributio

More information

ECE 901 Lecture 13: Maximum Likelihood Estimation

ECE 901 Lecture 13: Maximum Likelihood Estimation ECE 90 Lecture 3: Maximum Likelihood Estimatio R. Nowak 5/7/009 The focus of this lecture is to cosider aother approach to learig based o maximum likelihood estimatio. Ulike earlier approaches cosidered

More information

Optimally Sparse SVMs

Optimally Sparse SVMs A. Proof of Lemma 3. We here prove a lower boud o the umber of support vectors to achieve geeralizatio bouds of the form which we cosider. Importatly, this result holds ot oly for liear classifiers, but

More information

Definition 4.2. (a) A sequence {x n } in a Banach space X is a basis for X if. unique scalars a n (x) such that x = n. a n (x) x n. (4.

Definition 4.2. (a) A sequence {x n } in a Banach space X is a basis for X if. unique scalars a n (x) such that x = n. a n (x) x n. (4. 4. BASES I BAACH SPACES 39 4. BASES I BAACH SPACES Sice a Baach space X is a vector space, it must possess a Hamel, or vector space, basis, i.e., a subset {x γ } γ Γ whose fiite liear spa is all of X ad

More information

Rademacher Complexity

Rademacher Complexity EECS 598: Statistical Learig Theory, Witer 204 Topic 0 Rademacher Complexity Lecturer: Clayto Scott Scribe: Ya Deg, Kevi Moo Disclaimer: These otes have ot bee subjected to the usual scrutiy reserved for

More information

Machine Learning for Data Science (CS 4786)

Machine Learning for Data Science (CS 4786) Machie Learig for Data Sciece CS 4786) Lecture & 3: Pricipal Compoet Aalysis The text i black outlies high level ideas. The text i blue provides simple mathematical details to derive or get to the algorithm

More information

7.1 Convergence of sequences of random variables

7.1 Convergence of sequences of random variables Chapter 7 Limit Theorems Throughout this sectio we will assume a probability space (, F, P), i which is defied a ifiite sequece of radom variables (X ) ad a radom variable X. The fact that for every ifiite

More information

Linear Regression Demystified

Linear Regression Demystified Liear Regressio Demystified Liear regressio is a importat subject i statistics. I elemetary statistics courses, formulae related to liear regressio are ofte stated without derivatio. This ote iteds to

More information

Lecture 13: Maximum Likelihood Estimation

Lecture 13: Maximum Likelihood Estimation ECE90 Sprig 007 Statistical Learig Theory Istructor: R. Nowak Lecture 3: Maximum Likelihood Estimatio Summary of Lecture I the last lecture we derived a risk (MSE) boud for regressio problems; i.e., select

More information

Output Analysis and Run-Length Control

Output Analysis and Run-Length Control IEOR E4703: Mote Carlo Simulatio Columbia Uiversity c 2017 by Marti Haugh Output Aalysis ad Ru-Legth Cotrol I these otes we describe how the Cetral Limit Theorem ca be used to costruct approximate (1 α%

More information

Lecture 12: September 27

Lecture 12: September 27 36-705: Itermediate Statistics Fall 207 Lecturer: Siva Balakrisha Lecture 2: September 27 Today we will discuss sufficiecy i more detail ad the begi to discuss some geeral strategies for costructig estimators.

More information

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS MASSACHUSTTS INSTITUT OF TCHNOLOGY 6.436J/5.085J Fall 2008 Lecture 9 /7/2008 LAWS OF LARG NUMBRS II Cotets. The strog law of large umbers 2. The Cheroff boud TH STRONG LAW OF LARG NUMBRS While the weak

More information

Chapter 3. Strong convergence. 3.1 Definition of almost sure convergence

Chapter 3. Strong convergence. 3.1 Definition of almost sure convergence Chapter 3 Strog covergece As poited out i the Chapter 2, there are multiple ways to defie the otio of covergece of a sequece of radom variables. That chapter defied covergece i probability, covergece i

More information

Resampling Methods. X (1/2), i.e., Pr (X i m) = 1/2. We order the data: X (1) X (2) X (n). Define the sample median: ( n.

Resampling Methods. X (1/2), i.e., Pr (X i m) = 1/2. We order the data: X (1) X (2) X (n). Define the sample median: ( n. Jauary 1, 2019 Resamplig Methods Motivatio We have so may estimators with the property θ θ d N 0, σ 2 We ca also write θ a N θ, σ 2 /, where a meas approximately distributed as Oce we have a cosistet estimator

More information

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 11

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 11 Machie Learig Theory Tübige Uiversity, WS 06/07 Lecture Tolstikhi Ilya Abstract We will itroduce the otio of reproducig kerels ad associated Reproducig Kerel Hilbert Spaces (RKHS). We will cosider couple

More information

Sparsity oracle inequalities

Sparsity oracle inequalities (SOI) Laboratoire de Statistique, CREST ad Laboratoire de Probabilités et Modèles Aléatoires, Uiversité Paris 6 Cambridge, Jue 24, 2008 (SOI) Model, dictioary, liear approximatio Sparsity ad dimesio reductio

More information

Lecture 24: Variable selection in linear models

Lecture 24: Variable selection in linear models Lecture 24: Variable selectio i liear models Cosider liear model X = Z β + ε, β R p ad Varε = σ 2 I. Like the LSE, the ridge regressio estimator does ot give 0 estimate to a compoet of β eve if that compoet

More information

Algebra of Least Squares

Algebra of Least Squares October 19, 2018 Algebra of Least Squares Geometry of Least Squares Recall that out data is like a table [Y X] where Y collects observatios o the depedet variable Y ad X collects observatios o the k-dimesioal

More information

Dimension-free PAC-Bayesian bounds for the estimation of the mean of a random vector

Dimension-free PAC-Bayesian bounds for the estimation of the mean of a random vector Dimesio-free PAC-Bayesia bouds for the estimatio of the mea of a radom vector Olivier Catoi CREST CNRS UMR 9194 Uiversité Paris Saclay olivier.catoi@esae.fr Ilaria Giulii Laboratoire de Probabilités et

More information

7.1 Convergence of sequences of random variables

7.1 Convergence of sequences of random variables Chapter 7 Limit theorems Throughout this sectio we will assume a probability space (Ω, F, P), i which is defied a ifiite sequece of radom variables (X ) ad a radom variable X. The fact that for every ifiite

More information

STATISTICS 593C: Spring, Model Selection and Regularization

STATISTICS 593C: Spring, Model Selection and Regularization STATISTICS 593C: Sprig, 27 Model Selectio ad Regularizatio Jo A. Weller Lecture 2 (March 29): Geeral Notatio ad Some Examples Here is some otatio ad termiology that I will try to use (more or less) systematically

More information

An Introduction to Randomized Algorithms

An Introduction to Randomized Algorithms A Itroductio to Radomized Algorithms The focus of this lecture is to study a radomized algorithm for quick sort, aalyze it usig probabilistic recurrece relatios, ad also provide more geeral tools for aalysis

More information

1 Review and Overview

1 Review and Overview DRAFT a fial versio will be posted shortly CS229T/STATS231: Statistical Learig Theory Lecturer: Tegyu Ma Lecture #3 Scribe: Migda Qiao October 1, 2013 1 Review ad Overview I the first half of this course,

More information

Maximum Likelihood Estimation and Complexity Regularization

Maximum Likelihood Estimation and Complexity Regularization ECE90 Sprig 004 Statistical Regularizatio ad Learig Theory Lecture: 4 Maximum Likelihood Estimatio ad Complexity Regularizatio Lecturer: Rob Nowak Scribe: Pam Limpiti Review : Maximum Likelihood Estimatio

More information

Problem Set 4 Due Oct, 12

Problem Set 4 Due Oct, 12 EE226: Radom Processes i Systems Lecturer: Jea C. Walrad Problem Set 4 Due Oct, 12 Fall 06 GSI: Assae Gueye This problem set essetially reviews detectio theory ad hypothesis testig ad some basic otios

More information

Lecture 3 The Lebesgue Integral

Lecture 3 The Lebesgue Integral Lecture 3: The Lebesgue Itegral 1 of 14 Course: Theory of Probability I Term: Fall 2013 Istructor: Gorda Zitkovic Lecture 3 The Lebesgue Itegral The costructio of the itegral Uless expressly specified

More information

Geometry of LS. LECTURE 3 GEOMETRY OF LS, PROPERTIES OF σ 2, PARTITIONED REGRESSION, GOODNESS OF FIT

Geometry of LS. LECTURE 3 GEOMETRY OF LS, PROPERTIES OF σ 2, PARTITIONED REGRESSION, GOODNESS OF FIT OCTOBER 7, 2016 LECTURE 3 GEOMETRY OF LS, PROPERTIES OF σ 2, PARTITIONED REGRESSION, GOODNESS OF FIT Geometry of LS We ca thik of y ad the colums of X as members of the -dimesioal Euclidea space R Oe ca

More information

Lecture 3: August 31

Lecture 3: August 31 36-705: Itermediate Statistics Fall 018 Lecturer: Siva Balakrisha Lecture 3: August 31 This lecture will be mostly a summary of other useful expoetial tail bouds We will ot prove ay of these i lecture,

More information

Efficient GMM LECTURE 12 GMM II

Efficient GMM LECTURE 12 GMM II DECEMBER 1 010 LECTURE 1 II Efficiet The estimator depeds o the choice of the weight matrix A. The efficiet estimator is the oe that has the smallest asymptotic variace amog all estimators defied by differet

More information

Lecture 12: February 28

Lecture 12: February 28 10-716: Advaced Machie Learig Sprig 2019 Lecture 12: February 28 Lecturer: Pradeep Ravikumar Scribes: Jacob Tyo, Rishub Jai, Ojash Neopae Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer:

More information

Lecture 7: Density Estimation: k-nearest Neighbor and Basis Approach

Lecture 7: Density Estimation: k-nearest Neighbor and Basis Approach STAT 425: Itroductio to Noparametric Statistics Witer 28 Lecture 7: Desity Estimatio: k-nearest Neighbor ad Basis Approach Istructor: Ye-Chi Che Referece: Sectio 8.4 of All of Noparametric Statistics.

More information

Lecture 19: Convergence

Lecture 19: Convergence Lecture 19: Covergece Asymptotic approach I statistical aalysis or iferece, a key to the success of fidig a good procedure is beig able to fid some momets ad/or distributios of various statistics. I may

More information

Application to Random Graphs

Application to Random Graphs A Applicatio to Radom Graphs Brachig processes have a umber of iterestig ad importat applicatios. We shall cosider oe of the most famous of them, the Erdős-Réyi radom graph theory. 1 Defiitio A.1. Let

More information

Advanced Stochastic Processes.

Advanced Stochastic Processes. Advaced Stochastic Processes. David Gamarik LECTURE 2 Radom variables ad measurable fuctios. Strog Law of Large Numbers (SLLN). Scary stuff cotiued... Outlie of Lecture Radom variables ad measurable fuctios.

More information

Ada Boost, Risk Bounds, Concentration Inequalities. 1 AdaBoost and Estimates of Conditional Probabilities

Ada Boost, Risk Bounds, Concentration Inequalities. 1 AdaBoost and Estimates of Conditional Probabilities CS8B/Stat4B Sprig 008) Statistical Learig Theory Lecture: Ada Boost, Risk Bouds, Cocetratio Iequalities Lecturer: Peter Bartlett Scribe: Subhrasu Maji AdaBoost ad Estimates of Coditioal Probabilities We

More information

Distribution of Random Samples & Limit theorems

Distribution of Random Samples & Limit theorems STAT/MATH 395 A - PROBABILITY II UW Witer Quarter 2017 Néhémy Lim Distributio of Radom Samples & Limit theorems 1 Distributio of i.i.d. Samples Motivatig example. Assume that the goal of a study is to

More information

b i u x i U a i j u x i u x j

b i u x i U a i j u x i u x j M ath 5 2 7 Fall 2 0 0 9 L ecture 1 9 N ov. 1 6, 2 0 0 9 ) S ecod- Order Elliptic Equatios: Weak S olutios 1. Defiitios. I this ad the followig two lectures we will study the boudary value problem Here

More information

Sequences and Series of Functions

Sequences and Series of Functions Chapter 6 Sequeces ad Series of Fuctios 6.1. Covergece of a Sequece of Fuctios Poitwise Covergece. Defiitio 6.1. Let, for each N, fuctio f : A R be defied. If, for each x A, the sequece (f (x)) coverges

More information

Agnostic Learning and Concentration Inequalities

Agnostic Learning and Concentration Inequalities ECE901 Sprig 2004 Statistical Regularizatio ad Learig Theory Lecture: 7 Agostic Learig ad Cocetratio Iequalities Lecturer: Rob Nowak Scribe: Aravid Kailas 1 Itroductio 1.1 Motivatio I the last lecture

More information

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 5

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 5 CS434a/54a: Patter Recogitio Prof. Olga Veksler Lecture 5 Today Itroductio to parameter estimatio Two methods for parameter estimatio Maimum Likelihood Estimatio Bayesia Estimatio Itroducto Bayesia Decisio

More information

Week 10. f2 j=2 2 j k ; j; k 2 Zg is an orthonormal basis for L 2 (R). This function is called mother wavelet, which can be often constructed

Week 10. f2 j=2 2 j k ; j; k 2 Zg is an orthonormal basis for L 2 (R). This function is called mother wavelet, which can be often constructed Wee 0 A Itroductio to Wavelet regressio. De itio: Wavelet is a fuctio such that f j= j ; j; Zg is a orthoormal basis for L (R). This fuctio is called mother wavelet, which ca be ofte costructed from father

More information

Intro to Learning Theory

Intro to Learning Theory Lecture 1, October 18, 2016 Itro to Learig Theory Ruth Urer 1 Machie Learig ad Learig Theory Comig soo 2 Formal Framework 21 Basic otios I our formal model for machie learig, the istaces to be classified

More information

Supplemental Material: Proofs

Supplemental Material: Proofs Proof to Theorem Supplemetal Material: Proofs Proof. Let be the miimal umber of traiig items to esure a uique solutio θ. First cosider the case. It happes if ad oly if θ ad Rak(A) d, which is a special

More information

Accuracy Assessment for High-Dimensional Linear Regression

Accuracy Assessment for High-Dimensional Linear Regression Uiversity of Pesylvaia ScholarlyCommos Statistics Papers Wharto Faculty Research -016 Accuracy Assessmet for High-Dimesioal Liear Regressio Toy Cai Uiversity of Pesylvaia Zijia Guo Uiversity of Pesylvaia

More information

6. Kalman filter implementation for linear algebraic equations. Karhunen-Loeve decomposition

6. Kalman filter implementation for linear algebraic equations. Karhunen-Loeve decomposition 6. Kalma filter implemetatio for liear algebraic equatios. Karhue-Loeve decompositio 6.1. Solvable liear algebraic systems. Probabilistic iterpretatio. Let A be a quadratic matrix (ot obligatory osigular.

More information

Chapter 7 Isoperimetric problem

Chapter 7 Isoperimetric problem Chapter 7 Isoperimetric problem Recall that the isoperimetric problem (see the itroductio its coectio with ido s proble) is oe of the most classical problem of a shape optimizatio. It ca be formulated

More information

Statistical and Mathematical Methods DS-GA 1002 December 8, Sample Final Problems Solutions

Statistical and Mathematical Methods DS-GA 1002 December 8, Sample Final Problems Solutions Statistical ad Mathematical Methods DS-GA 00 December 8, 05. Short questios Sample Fial Problems Solutios a. Ax b has a solutio if b is i the rage of A. The dimesio of the rage of A is because A has liearly-idepedet

More information

18.657: Mathematics of Machine Learning

18.657: Mathematics of Machine Learning 8.657: Mathematics of Machie Learig Lecturer: Philippe Rigollet Lecture 4 Scribe: Cheg Mao Sep., 05 I this lecture, we cotiue to discuss the effect of oise o the rate of the excess risk E(h) = R(h) R(h

More information

Binary classification, Part 1

Binary classification, Part 1 Biary classificatio, Part 1 Maxim Ragisky September 25, 2014 The problem of biary classificatio ca be stated as follows. We have a radom couple Z = (X,Y ), where X R d is called the feature vector ad Y

More information

5.1 A mutual information bound based on metric entropy

5.1 A mutual information bound based on metric entropy Chapter 5 Global Fao Method I this chapter, we exted the techiques of Chapter 2.4 o Fao s method the local Fao method) to a more global costructio. I particular, we show that, rather tha costructig a local

More information

Estimation for Complete Data

Estimation for Complete Data Estimatio for Complete Data complete data: there is o loss of iformatio durig study. complete idividual complete data= grouped data A complete idividual data is the oe i which the complete iformatio of

More information

Lecture 7: October 18, 2017

Lecture 7: October 18, 2017 Iformatio ad Codig Theory Autum 207 Lecturer: Madhur Tulsiai Lecture 7: October 8, 207 Biary hypothesis testig I this lecture, we apply the tools developed i the past few lectures to uderstad the problem

More information

Notes 19 : Martingale CLT

Notes 19 : Martingale CLT Notes 9 : Martigale CLT Math 733-734: Theory of Probability Lecturer: Sebastie Roch Refereces: [Bil95, Chapter 35], [Roc, Chapter 3]. Sice we have ot ecoutered weak covergece i some time, we first recall

More information

ACO Comprehensive Exam 9 October 2007 Student code A. 1. Graph Theory

ACO Comprehensive Exam 9 October 2007 Student code A. 1. Graph Theory 1. Graph Theory Prove that there exist o simple plaar triagulatio T ad two distict adjacet vertices x, y V (T ) such that x ad y are the oly vertices of T of odd degree. Do ot use the Four-Color Theorem.

More information

18.657: Mathematics of Machine Learning

18.657: Mathematics of Machine Learning 8.657: Mathematics of Machie Learig Lecturer: Philippe Rigollet Lecture 0 Scribe: Ade Forrow Oct. 3, 05 Recall the followig defiitios from last time: Defiitio: A fuctio K : X X R is called a positive symmetric

More information

Random Variables, Sampling and Estimation

Random Variables, Sampling and Estimation Chapter 1 Radom Variables, Samplig ad Estimatio 1.1 Itroductio This chapter will cover the most importat basic statistical theory you eed i order to uderstad the ecoometric material that will be comig

More information

Chapter 6 Principles of Data Reduction

Chapter 6 Principles of Data Reduction Chapter 6 for BST 695: Special Topics i Statistical Theory. Kui Zhag, 0 Chapter 6 Priciples of Data Reductio Sectio 6. Itroductio Goal: To summarize or reduce the data X, X,, X to get iformatio about a

More information

Product measures, Tonelli s and Fubini s theorems For use in MAT3400/4400, autumn 2014 Nadia S. Larsen. Version of 13 October 2014.

Product measures, Tonelli s and Fubini s theorems For use in MAT3400/4400, autumn 2014 Nadia S. Larsen. Version of 13 October 2014. Product measures, Toelli s ad Fubii s theorems For use i MAT3400/4400, autum 2014 Nadia S. Larse Versio of 13 October 2014. 1. Costructio of the product measure The purpose of these otes is to preset the

More information

CEE 522 Autumn Uncertainty Concepts for Geotechnical Engineering

CEE 522 Autumn Uncertainty Concepts for Geotechnical Engineering CEE 5 Autum 005 Ucertaity Cocepts for Geotechical Egieerig Basic Termiology Set A set is a collectio of (mutually exclusive) objects or evets. The sample space is the (collectively exhaustive) collectio

More information

EECS564 Estimation, Filtering, and Detection Hwk 2 Solns. Winter p θ (z) = (2θz + 1 θ), 0 z 1

EECS564 Estimation, Filtering, and Detection Hwk 2 Solns. Winter p θ (z) = (2θz + 1 θ), 0 z 1 EECS564 Estimatio, Filterig, ad Detectio Hwk 2 Sols. Witer 25 4. Let Z be a sigle observatio havig desity fuctio where. p (z) = (2z + ), z (a) Assumig that is a oradom parameter, fid ad plot the maximum

More information

6.883: Online Methods in Machine Learning Alexander Rakhlin

6.883: Online Methods in Machine Learning Alexander Rakhlin 6.883: Olie Methods i Machie Learig Alexader Rakhli LECTURES 5 AND 6. THE EXPERTS SETTING. EXPONENTIAL WEIGHTS All the algorithms preseted so far halluciate the future values as radom draws ad the perform

More information

Frequentist Inference

Frequentist Inference Frequetist Iferece The topics of the ext three sectios are useful applicatios of the Cetral Limit Theorem. Without kowig aythig about the uderlyig distributio of a sequece of radom variables {X i }, for

More information

6.3 Testing Series With Positive Terms

6.3 Testing Series With Positive Terms 6.3. TESTING SERIES WITH POSITIVE TERMS 307 6.3 Testig Series With Positive Terms 6.3. Review of what is kow up to ow I theory, testig a series a i for covergece amouts to fidig the i= sequece of partial

More information

Lecture 2. The Lovász Local Lemma

Lecture 2. The Lovász Local Lemma Staford Uiversity Sprig 208 Math 233A: No-costructive methods i combiatorics Istructor: Ja Vodrák Lecture date: Jauary 0, 208 Origial scribe: Apoorva Khare Lecture 2. The Lovász Local Lemma 2. Itroductio

More information

5.1 Review of Singular Value Decomposition (SVD)

5.1 Review of Singular Value Decomposition (SVD) MGMT 69000: Topics i High-dimesioal Data Aalysis Falll 06 Lecture 5: Spectral Clusterig: Overview (cotd) ad Aalysis Lecturer: Jiamig Xu Scribe: Adarsh Barik, Taotao He, September 3, 06 Outlie Review of

More information

Riesz-Fischer Sequences and Lower Frame Bounds

Riesz-Fischer Sequences and Lower Frame Bounds Zeitschrift für Aalysis ud ihre Aweduge Joural for Aalysis ad its Applicatios Volume 1 (00), No., 305 314 Riesz-Fischer Sequeces ad Lower Frame Bouds P. Casazza, O. Christese, S. Li ad A. Lider Abstract.

More information

Technical Proofs for Homogeneity Pursuit

Technical Proofs for Homogeneity Pursuit Techical Proofs for Homogeeity Pursuit bstract This is the supplemetal material for the article Homogeeity Pursuit, submitted for publicatio i Joural of the merica Statistical ssociatio. B Proofs B. Proof

More information

Problem Set 2 Solutions

Problem Set 2 Solutions CS271 Radomess & Computatio, Sprig 2018 Problem Set 2 Solutios Poit totals are i the margi; the maximum total umber of poits was 52. 1. Probabilistic method for domiatig sets 6pts Pick a radom subset S

More information

A Proof of Birkhoff s Ergodic Theorem

A Proof of Birkhoff s Ergodic Theorem A Proof of Birkhoff s Ergodic Theorem Joseph Hora September 2, 205 Itroductio I Fall 203, I was learig the basics of ergodic theory, ad I came across this theorem. Oe of my supervisors, Athoy Quas, showed

More information

CS284A: Representations and Algorithms in Molecular Biology

CS284A: Representations and Algorithms in Molecular Biology CS284A: Represetatios ad Algorithms i Molecular Biology Scribe Notes o Lectures 3 & 4: Motif Discovery via Eumeratio & Motif Represetatio Usig Positio Weight Matrix Joshua Gervi Based o presetatios by

More information

Linear Support Vector Machines

Linear Support Vector Machines Liear Support Vector Machies David S. Roseberg The Support Vector Machie For a liear support vector machie (SVM), we use the hypothesis space of affie fuctios F = { f(x) = w T x + b w R d, b R } ad evaluate

More information

Lecture 27. Capacity of additive Gaussian noise channel and the sphere packing bound

Lecture 27. Capacity of additive Gaussian noise channel and the sphere packing bound Lecture 7 Ageda for the lecture Gaussia chael with average power costraits Capacity of additive Gaussia oise chael ad the sphere packig boud 7. Additive Gaussia oise chael Up to this poit, we have bee

More information

Slide Set 13 Linear Model with Endogenous Regressors and the GMM estimator

Slide Set 13 Linear Model with Endogenous Regressors and the GMM estimator Slide Set 13 Liear Model with Edogeous Regressors ad the GMM estimator Pietro Coretto pcoretto@uisa.it Ecoometrics Master i Ecoomics ad Fiace (MEF) Uiversità degli Studi di Napoli Federico II Versio: Friday

More information

Lecture 11 October 27

Lecture 11 October 27 STATS 300A: Theory of Statistics Fall 205 Lecture October 27 Lecturer: Lester Mackey Scribe: Viswajith Veugopal, Vivek Bagaria, Steve Yadlowsky Warig: These otes may cotai factual ad/or typographic errors..

More information

6.867 Machine learning, lecture 7 (Jaakkola) 1

6.867 Machine learning, lecture 7 (Jaakkola) 1 6.867 Machie learig, lecture 7 (Jaakkola) 1 Lecture topics: Kerel form of liear regressio Kerels, examples, costructio, properties Liear regressio ad kerels Cosider a slightly simpler model where we omit

More information

w (1) ˆx w (1) x (1) /ρ and w (2) ˆx w (2) x (2) /ρ.

w (1) ˆx w (1) x (1) /ρ and w (2) ˆx w (2) x (2) /ρ. 2 5. Weighted umber of late jobs 5.1. Release dates ad due dates: maximimizig the weight of o-time jobs Oce we add release dates, miimizig the umber of late jobs becomes a sigificatly harder problem. For

More information

Empirical Process Theory and Oracle Inequalities

Empirical Process Theory and Oracle Inequalities Stat 928: Statistical Learig Theory Lecture: 10 Empirical Process Theory ad Oracle Iequalities Istructor: Sham Kakade 1 Risk vs Risk See Lecture 0 for a discussio o termiology. 2 The Uio Boud / Boferoi

More information

arxiv: v1 [math.pr] 13 Oct 2011

arxiv: v1 [math.pr] 13 Oct 2011 A tail iequality for quadratic forms of subgaussia radom vectors Daiel Hsu, Sham M. Kakade,, ad Tog Zhag 3 arxiv:0.84v math.pr] 3 Oct 0 Microsoft Research New Eglad Departmet of Statistics, Wharto School,

More information

Quantile regression with multilayer perceptrons.

Quantile regression with multilayer perceptrons. Quatile regressio with multilayer perceptros. S.-F. Dimby ad J. Rykiewicz Uiversite Paris 1 - SAMM 90 Rue de Tolbiac, 75013 Paris - Frace Abstract. We cosider oliear quatile regressio ivolvig multilayer

More information

Definitions and Theorems. where x are the decision variables. c, b, and a are constant coefficients.

Definitions and Theorems. where x are the decision variables. c, b, and a are constant coefficients. Defiitios ad Theorems Remember the scalar form of the liear programmig problem, Miimize, Subject to, f(x) = c i x i a 1i x i = b 1 a mi x i = b m x i 0 i = 1,2,, where x are the decisio variables. c, b,

More information

Solution. 1 Solutions of Homework 1. Sangchul Lee. October 27, Problem 1.1

Solution. 1 Solutions of Homework 1. Sangchul Lee. October 27, Problem 1.1 Solutio Sagchul Lee October 7, 017 1 Solutios of Homework 1 Problem 1.1 Let Ω,F,P) be a probability space. Show that if {A : N} F such that A := lim A exists, the PA) = lim PA ). Proof. Usig the cotiuity

More information

Machine Learning Theory (CS 6783)

Machine Learning Theory (CS 6783) Machie Learig Theory (CS 6783) Lecture 2 : Learig Frameworks, Examples Settig up learig problems. X : istace space or iput space Examples: Computer Visio: Raw M N image vectorized X = 0, 255 M N, SIFT

More information

Basics of Probability Theory (for Theory of Computation courses)

Basics of Probability Theory (for Theory of Computation courses) Basics of Probability Theory (for Theory of Computatio courses) Oded Goldreich Departmet of Computer Sciece Weizma Istitute of Sciece Rehovot, Israel. oded.goldreich@weizma.ac.il November 24, 2008 Preface.

More information

Study the bias (due to the nite dimensional approximation) and variance of the estimators

Study the bias (due to the nite dimensional approximation) and variance of the estimators 2 Series Methods 2. Geeral Approach A model has parameters (; ) where is ite-dimesioal ad is oparametric. (Sometimes, there is o :) We will focus o regressio. The fuctio is approximated by a series a ite

More information

Matrix Representation of Data in Experiment

Matrix Representation of Data in Experiment Matrix Represetatio of Data i Experimet Cosider a very simple model for resposes y ij : y ij i ij, i 1,; j 1,,..., (ote that for simplicity we are assumig the two () groups are of equal sample size ) Y

More information

4. Partial Sums and the Central Limit Theorem

4. Partial Sums and the Central Limit Theorem 1 of 10 7/16/2009 6:05 AM Virtual Laboratories > 6. Radom Samples > 1 2 3 4 5 6 7 4. Partial Sums ad the Cetral Limit Theorem The cetral limit theorem ad the law of large umbers are the two fudametal theorems

More information

62. Power series Definition 16. (Power series) Given a sequence {c n }, the series. c n x n = c 0 + c 1 x + c 2 x 2 + c 3 x 3 +

62. Power series Definition 16. (Power series) Given a sequence {c n }, the series. c n x n = c 0 + c 1 x + c 2 x 2 + c 3 x 3 + 62. Power series Defiitio 16. (Power series) Give a sequece {c }, the series c x = c 0 + c 1 x + c 2 x 2 + c 3 x 3 + is called a power series i the variable x. The umbers c are called the coefficiets of

More information