AGGREGATION AND HIGH-DIMENSIONAL STATISTICS (preliminary notes of Saint-Flour lectures, July 8-20, 2013)

Size: px

Start display at page:

Download "AGGREGATION AND HIGH-DIMENSIONAL STATISTICS (preliminary notes of Saint-Flour lectures, July 8-20, 2013)"

Cornelius Henry
6 years ago
Views:

1 AGGREGATION AND HIGH-DIENSIONAL STATISTICS (prelimiary otes of Sait-Flour lectures, July 8-20, 2013) Alexadre B. Tsybakov (CREST-ENSAE) October 30, Itroductio Give a collectio of estimators, the problem of liear, covex or model selectio type aggregatio cosists i costructig a ew estimator, called the aggregate, which is early as good as the best amog them (or early as good as their best liear or covex combiatio), with respect to a give risk criterio. Whe the uderlyig model is sparse, which meas that it is well approximated by a liear combiatio of a small umber of fuctios i the dictioary, the aggregatio techiques tur out to be very useful i takig advatage of sparsity. O the other had, aggregatio is a geeral techique of producig adaptive oparametric estimators, which is more powerful tha the classical methods sice it allows oe to combie estimators of differet ature. Aggregates are usually costructed by mixig the iitial estimators or fuctios of the dictioary with datadepedet weights that ca be computed is several possible ways. Importat example is give by aggregates with expoetial weights. They satisfy sharp oracle iequalities that allow oe to treat i a uified way three differet problems: Adaptive oparametric estimatio, aggregatio ad sparse estimatio. To be able to demostrate the mai ideas without excessive techicalities, throughout this course we will deal with a simple model, amely the Gaussia regressio model with fixed desig. Suppose that we observe {(Y i, X i )} i=1 such that Y i = f(x i ) + ξ i, i = 1,...,, (1) where X is a arbitrary set, f X R is a ukow fuctio, X i X are oradom, ad the radom errors ξ i are i.i.d. Gaussia with mea zero ad variace σ 2, ξ i N (0, σ 2 ). The overall goal is to costruct a estimator ˆf for f based o the observatios {(Y i, X i )} i=1. To measure how good ˆf is, we use the squared error loss of the form ˆf f 2 = 1 i=1 ( ˆf(X i ) f(x i )) 2 ad we defie the risk of estimator ˆf as E ˆf f 2. The pseudo-orm f is referred to as the empirical orm of a fuctio defied o X. For vectors b R, we will also cosider the empirical l 2 -orm defied by b 2 = 1 i=1 b2 i, while b 2 2 = i=1 b2 i defies the usual l 2-orm b 2. Assume that we are give a collectio of fuctios {f 1,..., f } called the dictioary, where f j X R. Assume also that we are give a subset Θ of R. For θ = (θ 1,..., θ ) Θ we cosider 1

2 the liear combiatios f θ defied by f θ (x) def = θ j f j (x), x X. Fuctios f θ are thought to be approximatios of the ukow f. Assumig the dictioary {f 1,..., f } to be rich eough ad sufficietly large, these approximatios ca be satisfactory. Therefore, the estimatio of f may be reduced to estimatig θ j, leadig a estimator ˆf = fˆθ = ˆθ j f j, where ˆθ j are suitable estimators of θ j. The the aim is to mimimize the risk by choosig a optimal ˆθ j. However, depedig o the assumptios we make about the dictioary, the set Θ ad f, we are lead to differet optimality properties. We itroduce below three scearios ad discuss how these assumptios ifluce the costructio of the estimators ad the optimality framework. 1.1 Sceario 1: Liear Regressio ad Sparsity Assume that the true f is a liear combiatio of the fuctios from the dictioary: θ R f(x) = f θ (x) = θ j f j (x). (2) The we are i the usual liear regressoi framework, ad the observatios ca be writte i the followig form y = Xθ + ξ, where Y 1 f 1 (X 1 ) f (X 1 ) y =, ξ =, X =. (3) f 1 (X ) f (X ) Y ξ 1 ξ Estimatio of f is ow reduced to estimatio of θ. Classical theory of liear regressio deals with cases where, which is a ecessary coditio of idetifiability of θ whe we oly kow that θ R. However, i recet years there is a icreasig applied iterest i the problems where is greater tha ad ofte. I this case, f is ot idetifiable without additioal assumptios o θ. A atural ad most popular additioal assumptio is a sparsity costrait o θ. It cosists i restrictig the parameter θ to the class Θ = B 0 (s) where B 0 (s) is the l 0 -ball i R : Here, B 0 (s) = {θ R θ 0 s}, s = 1,...,. (4) def θ 0 = I(θ j 0) is the l 0 orm. Vectors θ belogig to B 0 (s) are called s-sparse. It turs out that, uder the s-sparisty restrictio, estimatio with reasoable accuracy is possible. We may ask ourselves the followig questio. 2

3 Questio 1. What is the optimal way to estimate θ if we kow that θ B 0 (s)? Let ˆθ be a estimator of θ. The correspodig estimator of f is the ˆf = fˆθ = ad the squared risk defied above takes the form ˆθ j f j E ˆf f 2 = E ( 1 X(ˆθ θ ) 2 2). This is kow uder the ame of predictio risk for liear regressio. The optimality is usually defied i a miimax sese. A estimator ˆθ is called optimal if there exists a sequece of positive umbers ψ,,s such that, for all ad, the followig two coditios are satisfied: if T sup E ( 1 θ B 0 (s) X(ˆθ θ ) 2 2) Cψ,,s (5) sup θ B 0 (s) E ( 1 X(T θ ) 2 2) cψ,,s (6) where C ad c are positive costats idepedet of,, s, ad if T deotes the miimum over all estimators of θ based o the sample {(Y i, X i )} i=1. This is commoly referred to as the miimax optimality. A sequece ψ,,s such that (5) ad (6) hold is called miimax rate of covergece (or optimal rate of covergece) o B 0 (s). To summarize, our mai goal i this sceario is to fid a miimax optimal estimator ˆθ o the class B 0 (s). Alog with B 0 (s), other classes ca be cosidered, such as l q -balls with 0 < q. This problem, i its simplest versio where X T X/ is the idetity (the Gaussia sequece model) ad with asymptotic poit of view, has bee i the focus of statistical literature from the 1990ies, with the mai developmets due to Dooho ad Johstoe. We are iterested here i a more geeral liear regressio settig ad we deal with o-asymptotic miimax optimality. 1.2 Sceario 2: Noparametric Regressio Let f F β,l where F β,l, typically, is a class of smooth fuctios parametrized by β > 0 ad L > 0. Roughly speakig, parameter β is the umber of derivatives of f that are assumed bouded i some orm by costat L. I this sceario, it is usually assumed that the dictioary {f 1,..., f } is composed of the first fuctios of some orthoormal basis. For example, it ca be the Fourier or wavelet basis. A key assumptio i the oparametric regressio settig is that the true fuctio f ca be approximated by a liear combiatio of the basis fuctios. It ca be stated, for example, i the followig form. Let f F β,l. The, for all = 1, 2,... there exists θ = θ (f) R such that f θj f j C β, (7) where C is a costat depedig oly o β, L. Here, i geeral, f def θ = θ j f j f, which is i cotrast with the liear regressio settig. Like i the liear regressio case, we are iterested i optimal estimatio of f. 3

4 Questio 2. What is the miimax optimal estimator of f o the class F β,l? As before, a miimax optimal estimator ˆf is the oe that satisfies if f sup E ˆf f 2 Cψ,β, (8) f F β,l sup E f f 2 cψ,β, (9) f F β,l where C ad c are positive costats idepedet of β ad L, ad if f deotes the miimum over all estimators of f based o the sample {(Y i, X i )} i=1. A sequece ψ,β such that (8) ad (9) hold is called miimax rate of covergece (or optimal rate of covergece) o F β,l. Questio 3. How to costruct a adaptive estimatio procedure? A adaptive estimator is a estimator ˆf which is idepedet of β ad L ad satisfies (8) with optimal rate of covergece ψ,β for all pairs (β, L) i a wide rage of values. 1.3 Sceario 3: Aggregatio of estimators The geeral mathematical framework of aggregatio is itroduced by Nemirovski i his Sait-Flour lectures i 1998 (published as Nemirovski (2000)). Nemirovskii (2000) outlied three problems: model selectio type aggregatio, covex aggregatio, ad liear aggregatio. ore geerally, the problem of aggregatio is stated as follows. Suppose that we are give a collectio of prelimiary estimators ˆf 1,..., ˆf of f ad a subset Θ of R. The goal is to fid a ew estimator f, called the aggregate, which is approximately at least as good as the best liear combiatio f θ = θ j ˆf j restricted to θ Θ. The best liear combiatio is defied as the oe that solves the problem mi θ Θ E f f θ 2 miimizig the squared risk. Ulike i the previous scearios, here f θ is a radom fuctio depedig o the data. I cotrast to those scearia, we do ot assume that f f θ is zero or small (see (2), (7)); it may happe that all f θ for some Θ are very far from the true f. So, the choice of Θ is importat for aggregatio problems. Some examples of Θ are listed below. 1. L-aggregatio (Liear aggregatio): Θ = R. The aim of liear aggregatio is to costruct a estimator f, which is approximately as good as the best liear combiatio of the iitial estimators ˆf 1,..., ˆf. 2. C-aggregatio (Covex aggregatio): Θ is the simplex Θ = Λ def = {θ R θ j 0, θ j = 1}. The aim of covex aggregatio is to costruct a estimator f, which is approximately as good as the best covex combiatio of the iitial estimators ˆf 1,..., ˆf. 3. S-aggregatio (odel Selectio type aggregatio): Θ = {e 1,..., e } where e i are the caoical basis vectors i R. The aim of S-aggregatio is to costruct a estimator f, which is approximately as good as the best amog the iitial estimators ˆf 1,..., ˆf. 4

5 4. s-sparse aggregatio: Θ = B 0 (s) def = {θ R θ 0 s} where s {1,..., }. 5. L q -aggregatio: Θ = B q (τ) def = {θ R θ q τ} where θ q = ( θ j q ) 1 q is the usual l q -orm. Other types of aggregatio will be discussed below as well. Note that for liear, covex ad S aggregatio the sets Θ ca be expressed as itersectios of l 0 ad l 1 balls. Ideed, for liear aggregatio, Θ = R = B 0 (), where B 0 () is the l 0 -ball of radius. For covex aggregatio, the simplex is icluded ito B 1 + (1) a itersectio of the l 1-ball B 1 (1) with the coe of positive coordiates. For S-aggregatio, Θ = {e 1,..., e } = B 0 (1) B 1 + (1). The goal of aggregatio is to mimic the best liear combiatio of iitial estimators with weights restricted to a give set Θ of possible weights. The word best here is formalized as choosig f with the smallest possible excess risk (also kow uder the ame of regret) defied by E Θ ( f, f) def = E f f 2 if θ Θ E f θ f 2. (10) Based o the excess risk, we ca itroduce the cocept of miimax optimality for aggregatio. A estimator f is called a optimal aggregate for the class Θ if there exists a sequece of positive umbers ψ, (Θ) such that sup { sup E Θ ( f, f)} Cψ, (Θ), (11) ˆf 1,..., ˆf f sup { if ˆf 1,..., ˆf ˆf sup E Θ ( ˆf, f)} cψ, (Θ). (12) f Here, if ˆf is the miimum over all estimators, C ad c are positive costats idepedet of ad, ad sup ˆf1,..., ˆf, sup f are the suprema over wide classes of prelimiary estimators ad fuctios f. I some cases, these will be all possible estimators ad all possible f with o restrictio; i other cases it will suffice to cosider classes of ˆf 1,..., ˆf ad f satisfyig a boudedess assumptio i the empirical orm. If (11) ad (12) hold for some sequece ψ, (Θ), this sequece is called a optimal rate of aggregatio for the class Θ. The questios arisig i this cotext are as follows. Questio 4. How to costruct a optimal aggregate f for a give class Θ? Questio 5. Is it possible to costruct a uiversal aggregate, i.e., a aggregate which is optimal simultaeously for a large scale of classes Θ? The last questio is of the same ature as Questio 3 cocerig adaptive oparametric estimatio. Iequalities (11) ad (12) establish upper ad lower bouds for the miimax risk, respectively. The upper bouds (11) ca be equivaletly writte i the form of oracle iequalities E f f 2 if θ Θ E f θ f 2 + Cψ, (Θ), ˆf 1,..., ˆf, f, (13) which say that the risk of the suggested aggregate f is at least as good as the risk of the ukow oracle θ miimizig E f θ f 2, up to a small remaider term of the order ψ, (Θ) (a price to pay for aggregatio ). Lower bouds (12) say that this is the miimal price; the remaider term caot be of a smaller order whatever is the aggregate. For the sparsity classes, for example, Θ = B 0 (s), the rate ψ, (Θ) is a fuctio of s; the correspodig oracle iequalities are called sparsity oracle iequalities. 5

6 1.4 Outlie The mai message of this course is that there are methods that solve problems described i Sectios 1.1, 1.2, ad 1.3 simultaeously. We will cosider methods like the BIC, the Lasso, ad the expoetial weightig, provide oracle iequalities ad discuss lower bouds for the three above scearios i a uified framework. We will establish the optimal rates of aggregatio. Aticipatig, for the mai types of aggregatio they are give i the followig table where R = Rak(X) deotes the rak of matrix X. Problem ψ, (Θ) S-aggregatio C-aggregatio L-aggregatio σ 2 R σ 2 R σ2 log σ 2 σ log (1 + ) σ 2 R Table 1. We will also show that the techique of expoetial weightig achieves uiversal aggregatio. 2 From aggregatio of estimators to aggregatio of fuctios Aggregates are usually costructed i the form f = ˆθ j ˆfj where ˆθ j are suitably chose statistics measurable with respect to the data. The aalysis is more ivolved if both ˆθ j ad the prelimiary estimators ˆf j are costructed from the same sample {(Y i, X i )} i=1. To avoid this, the idea put forward by Nemirovski (2000) is to obtai two idepedet samples from the iitial oe by radomizatio (sample cloig). The estimators ˆf j are costructed from the first sample while the secod oe is used to perform aggregatio, i.e., to compute the weights ˆθ j. To carry out the aalysis of aggregatio, it is eough to work coditioally o the first sample, so that ˆf j ca be cosidered as determiistic fuctios. Thus, the problem reduces to aggregatio of determiistic fuctios that we will deote as previously f j = ˆf j, j = 1,...,. A limitatio is that this type of radomizatio oly applies to Gaussia model with kow variace. Nevertheless, the idea of two-step procedures carries over to models with i.i.d. observatios where oe ca do direct sample splittig (see, e.g., Rigollet ad Tsybakov (2007); Lecué (2011)). Thus, i may cases aggregatio of estimators ca be achieved by reductio to aggregatio of determiistic fuctios. Alog with this approach, oe ca aggregate estimators usig the same observatios for estimatio ad aggregatio. While for geeral estimators this would clearly result i overfittig, the idea proved to be successful for certai types of estimators, first for projectio estimators (Leug ad Barro (2006)) ad more recetly for a more geeral class of liear (affie) estimators (Dalalya ad Salmo (2012)). The procedure of sample cloig by radomizatio is based o the followig elemetary lemma. 6

7 Lemma 1. Let Y i = f(x i ) + ξ i. Let ω i be a stadard ormal radom variable idepedet of ξ i. Set The we have Y i1 = Y i + σω i, Y i2 = Y i σω i. Y i1 = f(x i ) + ξ i1, Y i2 = f(x i ) + ξ i2, where ξ i1 N (0, 2σ 2 ), ξ i2 N (0, 2σ 2 ) ad ξ i1 is idepedet of ξ i2. Thus, we obtai two idepedet Gaussia samples D 1 = {(Y i1, X i )} i=1 ad D 2 = {(Y i2, X i )} i=1, where Y ik = f(x i )+ξ ik, k = 1, 2. Both samples are of the same form as the origial oe {(Y i, X i )} i=1, with the oly differece that the variace of the oise is doubled. Now, we use D 1 to costruct prelimiary estimators ˆf 1,..., ˆf ad we use D 2 to determie the weights ˆθ 1,..., ˆθ. Deotig by E (k) the expectatios with respect to the distributio of D k for k = 1, 2, we may write the oracle iequality (13) that we eed to prove i the form E (1) E (2) f f 2 if θ Θ E (1) f θ f 2 + Cψ, (Θ). (14) Clearly, to obtai (14) it suffices to show that, for ay fixed fuctios f 1,..., f, f (possibly satisfyig some mild assumptios), we have E (2) f f 2 if θ Θ f θ f 2 + Cψ, (Θ), (15) where f θ is a liear combiatio of f 1,..., f, ad f = ˆθ j f j with ˆθ j measurable with respect to D 2. Thus, usig the sample cloig device, we ca reduce aggregatio of estimators to its special case, which is aggregatio of fixed fuctios. The, the miimax framework modifies oly i that the excess risk takes the form E Θ ( f, f) def = E f f 2 if θ Θ f θ f 2 (16) (o expectatio i the term if θ Θ f θ f 2 ). I this settig, a estimator f is a optimal aggregate for the class Θ if there exists a sequece of positive umbers ψ, (Θ) such that (11) ad (12) are satisfied where ˆf j s are replaced by f j s. The upper boud o the maximum excess risk is equivalet to the oracle iequality E f f 2 if θ Θ f θ f 2 + Cψ, (Θ), f 1,..., f, f. (17) Oce such a oracle iequality is established, we ca obtai upper bouds for the miimax risk i Scearios 1 ad 2 as simple corollaries. Ideed, those scearios itroduce additioal strog restrictios o f, i particular, that the oracle risk if θ Θ f θ f 2 is either 0 (for Sceario 1) or admits a give boud, cf. (7) (for Sceario 2). 7

8 3 Least squares aggregatio A first simple idea is to costruct aggregates via the least squares (LS). Give a set Θ ad a collectio of determiistic fuctios f 1,..., f, we take ad we defie the LS aggregate as ˆθ LS (Θ) = argmi y f θ 2 θ Θ f = fˆθls (Θ) = ˆθ j LS (Θ)f j. We are goig to show that this idea works for liear ad covex aggregatio but fails for Saggregatio. Recall that we deote by X the matrix f 1 (X 1 ) f (X 1 ) X =. f 1 (X ) f (X ) Propositio 2 (Liear aggregatio). Let ˆθ LS def = ˆθ LS (R ) be a least squares estimator o the set Θ = R. The for all f, f 1,..., f we have where R = Rak(X). E fˆθls f 2 = mi θ R f θ f 2 + σ2 R. Proof. I what follows, with a slight abuse of otatio, we will deote by f ad f θ ot oly the fuctios from X to R but also the -vectors of values of these fuctios at poits X 1,..., X. The, with the otatio from (3), the model of observatios (1) ca be writte as y = f + ξ. Also, f θ = Xθ for all θ ad, i particular, fˆθls = X ˆθ LS = Ay where A is the orthogoal projector o Im(X). Sice y = f + ξ we have which yileds Sice A is the projector o Im(X), O the other had, Af f 2 = fˆθls f 2 = Ay f 2 = A(f + ξ) f 2, E fˆθls f 2 = Af f 2 + E Aξ 2. mi v v Im(X) f 2 = mi Xθ f 2 = mi f θ f 2. θ R θ R ad the propositio follows. E Aξ 2 = σ2 Tr(A) = σ2 R A similar result is valid for the LS estimator o ay covex subset of R. (18) 8

9 Propositio 3. Let Θ be a closed covex subset of R. squares estimator o Θ satisfies where R = Rak(X). E fˆθls (Θ) f 2 = mi θ Θ f θ f 2 + 4σ2 R The, for all f, f 1,..., f, the least Proof. Set for brevity ˆθ = ˆθ LS (Θ), f = fˆθls (Θ). First, by a simple algebra, for ay g = f θ with θ Θ, usig that y f 2 y g 2 ad y = f + ξ, we deduce that f f 2 f g < f g, ξ > where Thus, for ay θ Θ, < f, g > def = 1 i=1 f(x i )g(x i ). f f 2 f f θ < f f θ, ξ >. (19) We may write < f f θ, ξ >=< fˆθ f θ, ξ >= fˆθ f θ < u, ξ > where u = fˆθ f θ fˆθ f θ while belogs to Im(X), ad u = 1. Therefore, < f f θ, ξ > fˆθ f θ sup < u, ξ > u Im(X) u =1 sup < u, ξ > = sup < u, Aξ > Aξ u Im(X) u =1 u Im(X) u =1 where, as i the proof of Propositio 2, A deotes the orthogoal projector o Im(X). Hece Let θ be the miimizer of f θ f o Θ: The, i view of the covexity of Θ, 2 < f f θ, ξ > 2 f f θ Aξ 1 2 f f θ Aξ 2. (20) f θ f 2 = mi θ Θ f θ f 2. f f 2 f f θ 2 + f θ f 2. (21) Settig θ = θ i (19) (20) combiig these iequalities with (21) we obtai f f 2 f θ f Aξ 2. The result of the propositio ow follows by takig the expectatios of both sides of this iequality ad usig (18). We ow cosider the LS estimator o ay (ot ecessarily covex) subset of the simplex Λ. The, alog with the rate of covergece obtaied i Propositio 3 we ca obtai a differet rate as shows the ext result. 9

10 Propositio 4. Let Θ be a closed subset of Λ. The for all f ad all dictioaries f 1,..., f such that f j L, j = 1,...,, least squares estimator o Θ satisfies E fˆθls (Θ) f 2 mi θ Θ f θ f 2 + 4σL 2 log. Now, Proof. It follows from (19) that, for all θ Λ, Note that, for ay θ Λ we have E f f 2 f f θ 2 + 2E < f f θ, ξ >. E < f f θ, ξ > E max θ Λ < f θ f θ, ξ >= E max 1 j < f j f θ, ξ >. f θ 2 = 1 θ j f j (X i ) i=1 1 θ j fj 2 (X i ) = Therefore, f j f θ 2L. O the other had, i=1 θ j f j 2 L 2. η j def = < f j f θ, ξ > N (0, σ 2 ) where σ 2 = σ 2 f j f θ 2 /. Hece, usig Lemma 29 we obtai that ad the propositio follows. E max < f j f θ, ξ > = E max η j σ 2 log 1 j 1 j 2 log 2 log = σ f j f θ 2Lσ, We ow tur to covex aggregatio ad cosider the correspodig LS estimator ˆθ LS cov = argmi θ Λ y f θ 2. The followig theorem is straightforward i view of Propositios 3 ad 4. It states thatfˆθls attais cov the fastest of the two rates. Theorem 5 (Covex aggregatio). For all f ad all dictioaries f 1,..., f such that f j L, j = 1,...,, we have E fˆθls f 2 mi f θ f σ 2 R 2 log cov θ Λ σl. 2 10

11 Note that, up to a mior logarithmic discrepacy, the aggregate f achieves the target optimal rate of covex aggregatio give i Table 1. However, for S-aggregatio the situatio is differet. I this case, Θ is a fiite set ad the least squares estimator of f is defied by ˆf S = fĵ where ĵ = argmi 1 j y f j 2. The followig oracle iequality is a immediate cosequece of Propositio 4. Theorem 6 (S-aggregatio). For all f ad all f 1,..., f such that f j L, j = 1,...,, we have E ˆf 2 log S f 2 mi f j f 2 + 4σL. 1 j We see that the desired optimal rate for S-aggregatio, which is of the order (log )/ (cf. Table 1) is ot achieved ad the LS-aggregate ˆf S exhibits much poorer behavior. This is ot due to the techiques of the proof. I fact, the rate (log )/ give i Theorem 6 is the best that oe ca obtai for ˆf S. The followig result shows that this defect is itrisic ot oly for the least squares estimator but also for ay method that selects oly oe fuctio i the dictioary. This icludes methods of model selectio by pealized empirical risk miimizatio. We call estimators Ŝ takig values i {f 1,..., f } the selectors. Theorem 7 (Suboptimality of selectors). Assume that (σ 1) (log )/ C 0 (22) for 0 < C 0 < 1 small eough. The, there exists a dictioary {f 1,..., f } with f j 1, j = 1,...,, such that the followig holds. For ay selector Ŝ, ad i particular, for ay selector based o pealized empirical risk miimizatio, there exists a regressio fuctio f such that f 1 ad E Ŝ f 2 mi 1 j f j f 2 + C σ log (23) for some positive costat C. It follows from the lower boud (23) that selectig oe of the fuctios i a fiite dictioary to solve the problem of model selectio is suboptimal i the sese that it exhibits a too large remaider term, of the order (log )/. It turs out that we ca do better if we take a mixture, that is a covex combiatio of the fuctios i the dictioary. We will see below that uder a particular choice of weights i this covex combiatio, amely the expoetial weights, oe ca achieve oracle iequalities with the optimal rate (log )/. Proof of Theorem 7. Cosider a radom matrix X of size such that its elemets X i,j, i = 1,...,, j = 1,..., are i.i.d. Rademacher radom variables, i.e., radom variables takig values 1 ad 1 with probability 1/2. oreover, assume that 2 log (1 + e 2 ) < C 1. (24) for some positive costat C 1 < 1/2. Note that (24) follows from (22) if C 0 is chose small eough. Theorem 5.2 i Baraiuk et al (2008) [see also Subsectio i Rigollet ad Tsybakov (2011)] 11

12 implies that if (24) holds for C 1 small eough, the there exists a oempty set of matrices obtaied as realizatios of the matrix X that ejoy the followig weak restricted isometry property. For ay X, there exist costats κ κ > 0, such that for ay λ R with at most 2 ozero coordiates, κ 2 λ 2 2 Xλ 2 2 κ 2 λ 2 2, (25) whe (24) is satisfied. For X, let φ 1,..., φ be ay fuctios o X satisfyig φ j (X i ) = x i,j, i = 1,...,, j = 1,...,, where x i,j are the etries of X. Note that φ j = 1 sice x i,j { 1, 1}. Fix τ > 0 to be chose later ad set where we set for brevity α = (σ/3) f j = τ (1 + α) φ j, j = 1,...,, log κ 2. oreover, cosider the fuctios η j = ταφ j, j = 1,...,. Usig (22) we choose τ small eough to esure that η j 1 ad f j 1 for ay j = 1,...,. For ay fuctio g, we write for brevity R j (g) = g η j 2. Set also H = {f 1,..., f }. It is easy to check that mi R j(f) = R j (f j ) = f j η j 2. (26) f H We ow reduce our estimatio problem to a testig problem as follows. Let ψ {1,..., } be the radom variable, or test, defied by ψ = j if ad oly if Ŝ = f j. The, ψ j implies that there exists k j such that Ŝ = f k, so that Ŝ η j 2 f j η j 2 = f k f j f k f j, f j η j = τ 2 (1 + α) 2 φ j φ k 2 + 2τ 2 (1 + α)( φ j, φ k 1) τ 2 α φ j φ k 2. From (25), we fid that φ j φ k 2 2κ 2 so that Ŝ η j 2 f j η j 2 2τ 2 κ 2 σ log 3 κ Therefore, we coclude that ψ j implies that Hece, R j (Ŝ) mi f H R j(f) ν,. max P j {R j (Ŝ) mi R j(f) ν, } if 1 j f H ψ def = ν,. max P j(ψ j), (27) 1 j where the ifimum is take over all tests takig values i {1,..., } ad P j deotes the joit distributio of Y 1,..., Y that are idepedet Gaussia radom variables with variace σ 2 ad 12

13 meas η j (X 1 ),..., η j (X ) respectively. It follows from Propositio 2.3 ad Theorem 2.5 i Tsybakov (2009) that if for ay 1 j, k, the Kullback-Leibler divergece betwee P j ad P k satisfies K(P j, P k ) < log, (28) 8 the there exists a costat C > 0 such that To check (28), observe that, choosig τ 1 ad applyig (25), we get if ψ max P j(ψ j) C. (29) 1 j K(P j, P k ) = 2σ 2 η j η k 2 = τ 2 log 18 κ 2 φ j φ k 2 < log 8. Therefore, i view of (27) ad (29), we fid usig the arkov iequality that for ay selector Ŝ, max E j [R j (Ŝ) mi R j(f)] Cν, = C σ 1 j f H log, where E j deotes the expectatio with respect to P j. This proves the theorem. 4 Sparsity ad high dimesioal regressio Let us go back to Sceario 1 (sparse liear regressio). We assume that f = f θ for some θ R, ad θ is s-sparse. Usig Propositio 2 we obtai that the least squares estimator satisfies E fˆθls f 2 = E fˆθls f θ 2 = E ( 1 X(ˆθ LS θ ) 2 2) 1 = mi θ R X(θ θ ) σ2 ( ) = σ2 ( ) wheever the matrix X is of full rak. This result is useless i high-dimesioal problems whe > sice the remaider term is ot small. The sparsity s is ot ivolved i the expressio for the risk. So, the global least squares caot take advatage of sparsity, eve if the target vector is very sparse, i.e., s. O the other had, imagie that some oracle discloses to us the set of o-zero compoets of the target vector J(θ ) = {j θj 0}. The we ca use the least squares estimator restricted to the liear subspace of vectors with o-zero compoets i J(θ ). Deotig this estimator by ˆθ LS,J(θ ) ad applyig agai Propositio 2 we fid E ( 1 X(ˆθ LS,J(θ ) θ ) 2 2) σ2 θ 0 σ2 s where we have used that Card(J(θ )) = θ 0 ad that θ is s-sparse. This boud is much better, it takes advatage of sparsity ad ca be very small whe s. Ufortuately, ˆθ LS,J(θ ) is ot a estimator. It is a oracle; it depeds o the ukow θ ad caot be computed from the data. 13

14 A atural questio i this cotext is whether oe ca costruct a true estimator θ such that E ( 1 X( θ θ ) 2 2) σ2 θ 0 We will see that this is almost possible. I particular, we will exhibit a estimator θ such that E ( 1 X( θ θ ) 2 2) C σ2 θ 0? log ( θ 0 ) (30) for some costat C ad all 0 < θ 0 <. The additioal logarithmic factor i (30) characterizes the (modest) price to pay for the lack of kowledge of the set J(θ ). We will see that this factor caot be avoided i a miimax sese o the class of all s-sparse vectors. Iequality (30) is a example of sparsity oracle iequality. 4.1 Sparsity i Gaussia sequece model To give a idea how to costruct estimators θ satisfyig (30), we cosider a simple but istructive case whe the colums of matrix X are orthoormal. Assumptio (ORT). atrix X is such that 1 XT X = I where I is the idetity matrix, 2. This assumptio implies that sice otherwise X T X is degeerate. Usig the model y = Xθ + ξ we may write y 1 y def = 1 XT y = 1 XT Xθ + 1 XT ξ = θ + ζ, where ζ = 1 XT ξ is a Gaussia radom vector i R with mea zero ad covariace matrix V(ζ) = 1 2 E(XT ξξ T X) = σ2 I. Thus, the compoets ζ j of ζ are i.i.d. Gaussia radom variables that ca be writte i the form ζ j = εη j where ε = σ ad η 1,..., η are i.i.d. stadard ormal. We see that, uder Assumptio (ORT), we have a sequece of ew observatios y 1,..., y of the form y j = θ j + εη j, j = 1,...,, ε = σ, (31) where θ j is the jth compoet of θ ad η j are i.i.d. N (0, 1) radom variables. The model (31) is called the Gaussia sequece model ad has a simple sigal + oise iterpretatio. I the rest of this subsectio, we will forget the iitial model y = Xθ + ξ ad work with a sequece of observatios y 1,..., y satisfyig (31). Note first that, for ay θ, i view of Assumptio (ORT), 1 Xθ 2 2 = 1 θt X T Xθ = θ 2 2, so that the squared risk of a arbitrary estimator ˆθ simplifies to E ( 1 X(ˆθ θ ) 2 2) = E ˆθ θ 2 2. (32) 14

15 As discussed above, uder the sparsity assumptio o θ, it is crucial to detect the set of ozero compoets J(θ ). For the Gaussia sequece model (31), such a detectio is based o a very simple idea to keep oly the idices j such that the absolute values y j are large eough. To quatify the otio of large eough value, we will refer to the followig property (cf. Lemma 28 below): If η j are stadard Gaussia radom variables the max 1 j η j 2 log with probability close to 1 for large. Ituitively, the value 2 log characterizes the oise level. The observatio y j is uder the oise level, or is difficult to distiguish from the oise if y j ε 2 log. O the cotrary, if y j > cε log for some costat c > 2, the it is almost impossible to have θj = 0. Thus, all 2 log idices j such that y j > ε 2 log = σ belog to the set J(θ ) with probability close to 1 for large. These remarks lead us to estimatio of coefficiets θj by thresholdig. It meas that we use a suitable estimator of θj (for example, the least squares ad maximal likelihood estimator equal to y j ) for idices j such that y j > cσ log log ad we estimate by 0 all the coefficiets θ j such that y j is uder the oise level cσ. A basic realizatio of this idea is give by the hard thresholdig estimator ˆθ j H = y j I( y j > τ), wher τ > 0 is the threshold, typically chose of the order log. The followig theorem summarizes the mai properties of the hard thresholdig estimator ˆθ H = (ˆθ 1 H,..., ˆθ H ). Theorem 8. Cosider the liear regressio model uder Assumptio (ORT). The the followig holds. (i) (Oracle iequality i expectatio) If τ = σ 2 log ad θ 0, the E ˆθ H θ 2 2 2σ 2 θ 0 log (1 + 4 log ). (ii) (Oracle iequality i probablility) If τ = Aσ least 1 1 A2 /8 we have: log ˆθ H θ A2 σ 2 ( θ 0 (iii) (Selectio of variables) If τ = Bσ log, A > 2 2, the with probability at log ). with B > 2 ad mi j θj 0 θ j > 2τ, the, with probability at least 1 1 B2 /2 we have: Ĵ = J(θ ), where J(θ ) = {j θ j 0} ad Ĵ = {j ˆθ H j 0}. 15

16 Proof. (i). If θ j = 0, the ˆθ H j θ j = y j I( y j > τ) = ε η j I( η j > 2 log ), while for θ j 0 we have the boud ˆθ H j θ j = y j I( y j > τ) θ j y j θ j + y j I( y j τ) ε η j + τ. Therefore, E ˆθ H θ 2 2 = Sice E(η 2 1 ) = 1 ad E η 1 = 2/π, E ˆθ H j θ j 2 (33) ε 2 E[η 2 1I( η 1 > 2 log )] + θ 0 E[(ε η 1 + τ) 2 ]. E[(ε η 1 + τ) 2 ] = ε 2 + 4ετ 2π + τ 2 (34) = ε log π + 2 log. By Lemma 27 E[η 2 1I( η 1 > 2 log )] 2 π ( 1 2 log log ) 1. (35) 1 Pluggig (34) ad (35) i (33) ad usig that 1+ π log 4 log for all 2 ad the iequality 6/ π 4, we obtai the result. (ii). Set r = Aσ 2 log = A 2 ε log. Cosider the radom evet A = { y j θ j r, j = 1,..., }. By Lemma 28, the probability of the complemetary evet A c satisfies P (A c ) = P { max 1 j ζ j > r} = P {ε max 1 j η j > A 2 ε log } 1 A2 /8. O the evet A we have, i view of Lemma 30, y j I( y j > 2r) θ j 3 mi( θ j, r). π 16

17 Usig that r = τ/2 this implies ˆθ H θ 2 2 = ˆθ j H θj 2 9 mi ( θj 2, τ 2 4 ) 2 τ = 9 j θj 0 4 = τ 2 9 θ 0 4. (iii). Set B = A/2. The r defied i the proof of part (ii) has the form r = τ. Cosider the evet A defied i the proof of part (ii). Let us show that Ĵ J(θ ) o the evet A. Let ˆθ H j 0. I this case, ˆθ H j = y j y j > τ θ j + εη j > τ, which implies θ j > τ εη j τ r = 0 o the evet A. Therefore, θ j 0. Let us show that J(θ ) Ĵ o the evet A. Let θ j 0. The θ j > 2τ, which yields y j = θ j + εη j > 2τ εη j 2τ r = τ o the evet A. O the other had, by defiitio of ˆθ H, y j > τ ˆθ H j = y j Thus, ˆθ H j 0 with probability 1. There exist other thresholdig estimators behavig similarly as described i Theorem 8. For example, if τ is the same threshold, the soft thresholdig estimator defied as ad the o-egative garrotte estimator 1, defied as ˆθ S j = max (1 τ y j, 0) y j, j = 1,...,, (36) ˆθ j G = max 1 τ 2, 0 y j j = 1,...,, (37) y 2 j have similar risk ad selectio of variables behavior. We ca equivaletly defie the soft ad hard thresholdig estimators i terms of optimizatio programs as described below. Propositio 9. The soft ad hard thresholdig estimators are solutios to the followig optimizatio problems ˆθ H = argmi θ R ˆθ S = argmi θ R 1 This estimator is closely related to the James-Stei estimator. (y j θ j ) 2 + τ 2 θ 0, (38) (y j θ j ) 2 + 2τ θ 1. (39) 17

18 Furthermore, uder Assumptio (ORT), we ca express these two estimators as follows: ˆθ H = argmi θ R ( 1 y Xθ τ 2 θ 0 ), (40) ˆθ S = argmi θ R ( 1 y Xθ τ θ 1 ). (41) Ideed, sice we assume that 1 XT X = I (Assumptio (ORT)) ad we use the otatio 1 XT y, we may write (y j θ j ) 2 = 1 XT y θ 2 2 = θ θt X T y yt XX T y = 1 Xθ θt X T y + 1 yt XX T y y 1 y def = = 1 Xθ y yt XX T y 1 y 2 2 = 1 y Xθ c where c is a costat idepedet of θ. A importat observatio is that the estimators (40) ad (41) ca be used with geeral matrices X ad therefore ca be applied i full geerality i Scearios 1-3 ad ot oly i the Gaussia sequece model. For geeral X, the estimator defied by (40) is called the BIC estimator ad that defied by (41) is called the Lasso estimator. So, the BIC ad Lasso are atural extesios of the hard ad soft thresholdig estimators respectively. 4.2 Sparsity oracle iequality for the BIC We ow retur to the geeral regressio model y = f + ξ. Let τ > 0 be a give threshold. The origial BIC estimator is defied as follows ˆθ BIC = argmi θ R ( 1 y Xθ τ 2 θ 0 ) = argmi θ R ( y f θ 2 + τ 2 θ 0 ). Note that it ca be cosidered ot oly as a estimator for Sceario 1 but it also geerates a oparametric estimator fˆθbic for Sceario 2 ad a aggregate fˆθbic for Sceario 3. To get sharper bouds o the risk, it is coveiet to slightly modify the BIC by replacig the term τ 2 θ 0 by a pealty fuctio pe(θ) defied by pe( θ 0 ) = 2σ2 (1 + C C 2 1 L(θ) + ɛ L(θ)) θ 0 (42) where C 1, C 2 are suitable positive costats, ɛ > 0 is a arbitrary positive umber, ad L(θ) = log ( e θ 0 1 ). 18

19 We will cosider this pealty istead of τ 2 θ 0 ad use a modified defiitio of BIC: θ BIC = argmi θ R ( 1 y Xθ pe( θ 0 )). (43) Both versios of the BIC are pealized least squares estimators where the peality is imposed o the size of the support of θ. However, the BIC optimizatio problem is NP-hard. To see this, we ca reformulate the BIC program as follows mi θ R ( 1 y Xθ pe( θ 0 )) = mi 0 m mi ( 1 θ θ 0 =m y Xθ pe( θ 0 )) = mi ( mi 1 0 m θ θ 0 =m y Xθ pe(m)). Thus, we have to solve m=0 ( m ) = 2 possible least squares problems. Despite the computatioal ufeasibility, the theoretical properties of the BIC estimator ca be aalyzed i detail. I particular, it satisfies the oracle iequalities give i the ext theorem. Theorem 10 (Oracle Iequality for BIC). Fix ɛ > 0. Let θ BIC be defied i (42) (43) with sufficietly large C 1 ad C 2 ad let f BIC = f θbic. The there exists a costat C > 0 such that, for all f, E f BIC f 2 (1 + ɛ) mi ( f θ f 2 + C σ 2 θ 0 e Cσ2 log ( )) + θ R ɛ θ 0 1. (44) I additio, there exists a costat C > 0 such that, for ay 0 < δ < 1 with probability at least 1 δ, f f BIC f 2 (1 + ɛ) mi [ f θ f 2 + C σ 2 θ 0 e Cσ2 log ( )] + θ R ɛ θ 0 1 log (1 δ ). (45) I particular, if f(x) = f θ (x) with θ 0, E ( 1 X( θ BIC θ ) 2 2 ) C σ2 θ 0 log ( e θ 0 1 ). (46) The oracle iequality i expectatio (44) is proved i Birgé ad assart (2007) (see also Johstoe (2013)). For the proof of the iequality i probability (45), see Buea, Tsybakov ad Wegkamp (2004). Remarks. 1. Iequalities of Theorem 10 are sparsity oracle iequalities sice the remaider term depeds oly o θ 0. For istace, the i expectatio versio (44) is of the form E ˆf f 2 K mi θ R ( f θ f 2 2 +, (θ)) (47) where ˆf is a estimator of f, c is a costat, ad, > 0 oly depeds o θ If, depeds o θ 0 ad other features of θ, the the correspodig oracle iequality is sometimes referred to as a balaced oracle iequality. 19

20 2. The sparsity oracle iequalities of Theorem 10 are ot sharp, i.e., the leadig costat K is greater tha 1. I particular, we caot obtai a meaigful boud o the excess risk usig iequality (44). Ideed, sice it is of the form (47) with K > 1 the excess risk ca be oly bouded as E Θ ( ˆf, f) = E ˆf f 2 mi f θ f 2 (K 1) mi f θ f 2 + K sup, (θ). θ Θ θ Θ But this boud is useless i the aggregatio cotext because we have o cotrol of the miimum mi θ Θ f θ f 2 (it ca be arbitrarily large). 3. The oracle iequalities of Theorem 10 hold uder o assumptio o the dictioary f 1,..., f, ad (except for iequality (46)) uder o assumptio of f. 4. Iequality (46) gives a solutio to the questio aouced above, cf. (30). It cotais a oracle term C σ2 θ 0 multiplied by log ( e θ 0 1 ). This factor represets the price to pay for ot kowig the set of o-zero compoets of θ. 5. Istead of the pealty (43) implemeted above, we ca also use the pealty pe(θ) = Cσ 2 θ 0 log. This leads to oracle iequalities similar to those of Theorem 10 except for the logarithmic factors that become slightly suboptimal. ore precisely, log ( e θ 0 1 ) log for this type of pealty. 4.3 Sparsity oracle iequality for the Lasso As i the previous subsectio, here we cosider the geeral regressio model y = f + ξ. Let ˆθ L be the Lasso estimator ˆθ L = argmi θ R ( 1 y Xθ τ θ 1 ) = argmi θ R ( y f θ 2 + 2τ θ 1 ) where τ > 0 is a tuig parameter. Similarly to the BIC estimator, it ca be cosidered ot oly as a estimator for parametric Sceario 1 but also it geerates a oparametric estimator for fˆθl Sceario 2 ad a aggregate for Sceario 3. The followig theorem is a modificatio of a result fˆθl i Koltchiskii, Louici ad Tsybakov (2011). It provides a sparsity oracle iequality i probability with leadig costat 1 for the Lasso estimator. Theorem 11. Let ξ be i.i.d. radom variables, ξ i N (0, σ 2 ) ad let f i 1, j = 1,...,. Let ˆθ L be the Lasso estimator with the tuig parameter τ = Aσ log, A = t δ, t > 2, 0 < δ < 1. The, with probability at least 1 1 t2 /2 we have f fˆθl f 2 mi θ R θ Θ f θ f 2 + C mi σ2 µ 2 (θ) θ log log 0, σ θ 1 (48) 20

21 where C > 0 depeds oly o t ad δ, θ 0 µ(θ) = if µ > 0 J(θ) 1 µ X 2, C θ with C θ = { R J c (θ) δ 1 δ J(θ) 1 }. The Proof. Set for brevity ˆθ L = ˆθ ad G(θ) = y f θ 2 + 2τ θ 1. ˆθ = argmi θ R Deote by (, ) the ier product i R, ad set < f, g > def = 1 i=1 G(θ). f(x i )g(x i ). We ow recall the followig geeral fact from covex aalysis. Lemma 12. For ay covex fuctio G R R we have: ˆθ argmi θ R 0 G(ˆθ), where G(ˆθ) is the subdiffretial of G at poit ˆθ. G(θ) if ad oly if The coditio 0 G(ˆθ) of this lemma obviously implies: there exists B G(ˆθ) such that (B, ˆθ θ) = 0, for all θ R. I the sequel, we will use this property. I our case, ad thus ( y f θ 2 ) = ( 1 y Xθ 2 2) = 2 XT (y Xθ), ( ( y fˆθ 2 ), ˆθ θ) = 2 (XT (y X ˆθ), ˆθ θ) = 2 (X(ˆθ θ), y X ˆθ) = 2 < fˆθ θ, y fˆθ >. Applyig Lemma 12, we get that there exists ˆV ( ˆθ 1 ) such that 2 fˆθ θ, y fˆθ + 2τ( ˆV, ˆθ θ) = 0. (49) Let V be ay elemet of ( θ 1 ). It follows from (49) that 2 fˆθ θ, y fˆθ + 2τ( ˆV V, ˆθ θ) = 2τ(V, ˆθ θ). (50) We ow use the followig fact from covex aalysis applied to the fuctio g(θ) = θ 1. 21

22 Lemma 13. For ay covex fuctio g R R, we have for all V g(θ), V g(θ ). From Lemma 13 ad (50) we fid Sice y = f + ξ, we ca rewrite this i the form: (V V, θ θ ) 0, θ, θ R, 2 fˆθ θ, y fˆθ 2τ(V, ˆθ θ). 2 fˆθ θ, fˆθ f 2τ(V, ˆθ θ) + 2 ξ, fˆθ θ (51) for ay V ( θ 1 ) ad ay θ R. Next, elemetary argumet yields 2 fˆθ θ, fˆθ f = 2 fˆθ f θ, fˆθ f = fˆθ f θ 2 + fˆθ f 2 f θ f 2. (52) Fix some θ R ad let J = J(θ) be the set of o-zero compoets of θ. Write V = V J + V J c where V J R is the vector with compoets V j I(j J), j = 1,...,, where V j are the compoets of V, ad J c = {1,..., }/J is the complemet of J. The (V, ˆθ θ) = (V J, ˆθ θ) + (V J c, ˆθ θ) = (V J, ˆθ θ) + (V J c, ˆθ) sice the compoets of V J c vaish o the support of θ. O the other had, V is ay elemet of ( θ 1 ), ad thus the compoets of V satisfy { V j 1, j J c, V j = sig(θ j ), j J. Choose V such that V j = sig(ˆθ j ) for j J c. This is possible, sice V j ca be ay values satisfyig V j 1 for j J c. The (V, ˆθ θ) = (V J, ˆθ θ) + ˆθ J c 1 = (V J, ) + J c 1 = (V J, J ) + J c 1 where = ˆθ θ ad we used the fact that ˆθ J c 1 = J c 1. This ad (51) imply 2 fˆθ θ, fˆθ f 2τ J 1 2τ J c 1 + 2(H, ), (53) H 1 where H = 1 XT ξ = with H j = 1 H i=1 f j(x i )ξ i. We have used here the idetity ξ, fˆθ θ = (H, ˆθ θ). Note ow that if fˆθ θ, fˆθ f 0 the, i view of (52), we get fˆθ f 2 f θ f 2 22

23 ad the result of the theorem follows i a trivial way. fˆθ θ, fˆθ f 0. But i this case, i view of (53), So, it is eough to cosider the case Assume for the momet that τ J c 1 τ J 1 + H 1. H δτ for some 0 < δ < 1. The, sice 1 = J 1 + J c 1, we have J c δ 1 δ J 1. I other words, it suffices to cosider C θ, where C θ = { R J c δ 1 δ J 1 } ad J = J(θ) is the set of o-zero compoets of θ. We ow retur to (53), ad boud the terms o the right-had side of (53). Usig that H δτ ad C θ we get This ad (53) imply 2τ J 1 2τ J c 1 + 2(H, ) 2τ J 1 2τ J c H 1 = 2τ J 1 2τ J c 1 + 2δτ( J 1 + J c 1 ) Combiig this with (52) we get 2τ(1 + δ) J 1. (54) 2 fˆθ θ, fˆθ f 2τ(1 + δ) J 1. fˆθ f 2 f θ f 2 fˆθ f θ 2 + 2τ(1 + δ) J 1. (55) Sice C θ, we get θ 0 J 1 µ(θ) X 2 = µ(θ) θ 0 f = µ(θ) θ 0 fˆθ f θ. This ad the elemetary iequality 2ab a 2 + b 2 yield 2τ(1 + δ) J 1 2τ(1 + δ)µ(θ) θ 0 fˆθ f θ τ 2 (1 + δ) 2 µ 2 (θ) θ 0 + fˆθ f θ 2. (56) Combiig (55) ad (56) we obtai θ, f fˆθ f 2 f θ f 2 + τ 2 (1 + δ) 2 µ 2 (θ) θ 0. (57) Note that this iequality is proved for all θ R ad all f, uder the assumptio that H δτ. 23

24 Let us ow show that H δτ holds with probability at least 1 1 t2 /2. Cosider the radom evet A = { H δτ}. The probability of its complemet P (A c ) is estimated as follows Here, for each j, P (A c ) = P ( max 1 j 1 1 i=1 i=1 f j (X i )ξ i > δτ) P ( 1 f j (X i )ξ i N (0, σ2 f j 2 ). i=1 It follows similarly to Lemma 28 that, sice f j 2 1 for all j, we get P ( H δτ) = P log H > σt 1 t2 /2. To fiish the proof of the theorem, we show that, o the same evet A, f j (X i )ξ i > δτ). θ, f fˆθ f 2 f θ f 2 + C σ θ 1 log (58) for a costat C > 0 depedig oly o t ad δ. Ideed, sice G(ˆθ) G(θ) for all θ R, we get, by a simple algebra, fˆθ f 2 f θ f ξ, fˆθ f θ + 2τ θ 1 2τ ˆθ 1. (59) Sice ξ, fˆθ f θ = (H, ˆθ θ) ad H δτ o A, we fid 2 ξ, fˆθ f θ 2δτ ˆθ θ 1 + 2τ( θ 1 ˆθ 1 ) 2τ(1 + δ) θ 1. (60) Combiig (59) ad (60) we get (58) with C = 2t(1 + 1/δ). Fially, the theorem follows from (57) ad (58). The costat C i (48) ca be take equal to max(t 2 (1 + 1/δ) 2, 2t(1 + 1/δ)). 5 ixig with expoetial weights Let f 1,..., f be give fuctios formig a dictioary. Set ˆr j = y f j 2. This is the empirical risk of f j. The expoetially weighted aggregate is defied by where ˆθ EW = (ˆθ EW 1,..., EW ˆθ ) with ˆf EW = ˆθ EW j = ˆθ j EW f j = fˆθew exp( ˆr j /β)π j k=1 exp( ˆr k/β)π k 24

25 for some β > 0 ad some set of prior probabilities π k > 0, k=1 π k = 1. This defiitio has bee brought to achie Learig by Vovk (1990), Littlestoe ad Warmuth (1994). There exist two heuristic iterpretatios of expoetial weightig. 1. Quasi-bayesia iterpretatio. The weights ˆθ EW defie a posterior distributio (which is the Gibbs distributio if π k are uiform) i the phatom model Y i = f θ (X i ) + ξ i, i = 1,...,, where ξ i are i.i.d. N (0, β 2 ) radom variables, θ {e 1,..., e }, ad π j are prior probabilities of e j. 2. Variatioal iterpretatio. It is ot hard to check that ˆθ EW is a solutio of the followig miimizatio problem: ˆθ EW = argmi θ j ˆr j + β θ Λ K(θ, π) where K(θ, π) = θ j log θ j π j is a simplex. Note that is the Kullback-Lieibler divergece betwee θ ad π, ad Λ = {θ θ j 0, θ j ˆr j = θ j y f j 2 θ j = 1} y f θ 2. Jese Thus, ˆθ EW miimizes a upper approximatio of the empirical risk pealized by Kullback- Leibler divergece from π: y f θ 2 + β K(θ, π). Note that K(θ, π) 0 ad K(θ, π) = 0 θ = π. So, we pealize the solutio for beig too far from the prior π. I what follows we set for brevity w j = EW ˆθ j, Z = exp( ˆr k /β)π k. k=1 The followig propositio goes back to Vovk (1990) who cosidered a determiistic model. Ideed, o assumptio o the distributio of y is eeded. Propositio 14. The value ˆr = w j ˆr j satisfies As a cosequece, for all y, ˆr mi 1 j (ˆr j + β log 1 π j ). y ˆf EW 2 mi 1 j ( y f j 2 + β log 1 π j ). 25

26 Note that if π j = 1, j = 1,..., (the uiform prior), the β log 1 π j = β log, which is the optimal rate of S-aggregatio. But the boud is for the empirical risk y ˆf EW 2 ad ot for the risk E f ˆf EW 2. Also, o the RHS we have the empirical risk y f j 2 ad ot the discrepacy f f j 2 as expected i our oracle iequalities. Proof. Take logarithms of both sides of the equatio The, for ay k ad j, we have w j = exp( ˆr j/β)π j Z. so that log Z = ˆr k β + log 1 π k + log w k, log Z = ˆr j β + log 1 π j + log w j, ˆr k β Thus, usig that log w j 0, we get ˆr = = ˆr j β + log 1 π j log 1 π k + log w j log w k. k=1 w kˆr k ˆr j + β log 1 π j β k=1 w k log w k π k K(w,π) Sice K(w, π) 0 the first result of the propositio follows. The secod result is obtaied from the first oe usig the iequalities:. y EW ˆf = w jf j 2 = w j (y f j ) 2 Jese w j y f j 2 = w j ˆr j = ˆr. The ext propositio is ispired by the argumet i Leug ad Barro (2006). Propositio 15. (i) If β = 4σ 2, the ˆr σ 2 is a ubiased estimator of the risk : E ˆf EW f 2 = E(ˆr) σ 2. (ii) If β > 4σ 2, the with Proof. First, recall that E ˆf EW f 2 E(ˆr) σ 2. ˆf EW ( ) = w j f j ( ) w j = w j (y) = exp( β ˆr j)π j Z 26

27 where Z = k=1 exp( β ˆr k)π k, ˆr j = y f j 2. By Stei ubiased risk estimatio formula (see e.g. Tsybakov (2009), p. 157), the statistic ˆR def = y ˆf EW 2 + 2σ2 ˆf EW (X i ) σ 2 i=1 Y i is a ubiased estimator of the risk E f ˆf EW 2, i.e., E( ˆR) = E f ˆf EW 2. (61) Let us compute ˆR. Note that i the defiitio of ˆf EW oly the weights w j deped o Y 1,..., Y. So, we eed first to fid the derivative w j(y) Y i. Recall that ˆr j = 1 i=1 (Y i f j (X i )) 2. Hece, ˆr j Y i = 2 (Y i f j (X i )) ad we have w j = exp( β ˆr j)π j Y i Z 2 [ 2 β (Y i f j (X i ))Z + 2 β k=1 = 2w j β [(Y i f j (X i )) + (Y i f k (X i ))w k ] k=1 = 2 β (f j(x i ) ˆf EW (X i ))w j. (Y i f k (X i )) exp( β ˆr k)π k ] (62) O the other had, sice w j 0, w j = 1, we have, by the bias-variace decompositio with respect to the distributio defied by {w j }, Note also that, for all i, ˆf EW y 2 = Combiig (62) (64) we obtai ˆR = ˆr = ˆr = ˆr = w j f j y 2 w j f j ˆf EW 2 (63) w j ˆr j w j f j ˆf EW 2. =ˆr ( w j Y i ) ˆf EW (X i ) = ˆf EW (X i ) Y i w j f j ˆf EW 2 + 2σ2 w j f j ˆf EW 2 + 4σ2 β i=1 w j =1 = 0. (64) ( w j Y i ) f j (X i ) σ 2 (f j (X i ) ˆf EW (X i )) 2 w j σ 2 i=1 (1 4σ2 β ) w j f j ˆf EW 2 σ 2. 27

28 Takig expectatios of both sides of this iequality ad usig (61) we fid which implies the propositio. Theorem 16. For β 4σ 2 we have E ˆf EW f 2 = E(ˆr) σ 2 (1 4σ2 β ) E w j f j ˆf EW 2 I particular, if π j = 1, j = 1,...,, E ˆf EW f 2 mi 1 j ( f f j 2 + β log 1 π j ). E ˆf EW f 2 mi f f j 2 + β log. 1 j Proof. Propositios 14 ad 15, ad the fact that E(ˆr j ) = E y f j 2 = f f j 2 + σ 2 imply E ˆf EW f 2 E(ˆr) σ 2 mi (E(ˆr j) + β 1 j log 1 ) σ 2 π j = mi 1 j ( f f j 2 + β log 1 π j ). Remarks 1. Theorem 16 is proved i Dalalya ad Tsybakov (2007, 2008) where the result has a more geeral form: λ Λ E ˆf EW f 2 mi λ j f f j 2 + β K(λ, π). (65) Ideed, the right-had side of (65) does ot exceed mi λ {e 1,...,e } λ j f f j 2 + β K(λ, π) = mi 1 j ( f f j 2 + β log 1 π j ). 2. The right-had side of (65) is remiiscet of the variatioal iterpretatio of the expoetial weighted estimator. If we replace r j = f f j 2 by ˆr j = y f j 2, ˆf EW is obtaied by the miimizatio : mi λ j ˆr j + β λ Λ K(λ, π), which is the empirical aalog of the right-had side of (65). 28

29 3. Leug ad Barro (2006) have proved a result aalogous to Theorem 16 for the case where f j are ot ay fixed fuctios but rather the least squares estimators o liear subspaces of R. These estimators are costructed from the same sample y that is used to compute the weights. I their case, the expoetial weights are slightly differet. Namely, they take w j = exp ( ˆr j β k=1 exp ( ˆr k β dim(j) 2 ) π j dim(k) 2 ) π k where dim(j) is the dimesio of the space o which the jth least squares estimator projects. 6 Sparsity patter aggregatio I this sectio, we describe a aggregatio procedure that will be show to achieve uiversal aggregatio. Let P = {0, 1}. We call a sparsity patter ay biary vector p P. We deote by p def = p 0 the umber of oes i p. To each sparsity patter p P we associate a liear subspace S p of R : p S p def = spa {e j p j = 1}, dim(s p ) = p. From the iitial sample y, we cloe two radomized idepedet samples y (1) R ad y (2) R with radom errors N (0, 2σ 2 ), cf. Sectio 2. For each p P, we costruct a least squares estimator ˆθ p o S p based o the first sample y (1) : ˆθ p = argmi θ S p y (1) f θ 2. Set ˆr p = y (2) 2 ad defie a vector ˆθ SP A = (ˆθ SP A fˆθp p, p P) with compoets ˆθ SP A p = exp( ηˆr p /β)π p, p P. p P exp( ηˆr p /β)π p Here, {π p } is a prior probability measure o P with π p 0 (ot ecessarily π p > 0; π p = 0 is possible, o the differece from priors i Sectio 5). Note that ˆθ SP A R 2. The Sparsity Patter Aggregate is defied by ˆf SP A def = ˆθ p SP A. fˆθp p P From Theorem 16 we get: If β = 8σ 2 (because σ 2 2σ 2 after sample cloig) the From Propositio 2, f E ˆf SP A f 2 mi p P,π p 0 [E fˆθp f 2 + 8σ2 log 1 π p ]. (66) f 2 mi E fˆθp f θ S p θ f 2 + 2σ2 p. (67) 29

30 Combiig (66) ad (67), ad choosig a appropriate prior π p we obtai our mai result that will be stated below. Namely, we will use the prior π p = (( p )e p H) 1 if p R, 1/2 if p =, 0 otherwise, (68) where H = 2 R k=0 e k 2 k=0 e k = 2e/(e 1). Clearly, p P π p = 1. Ideed, R π p = ( p P, p R k=0 k ) 1 ( k )ek H = R k=0 e k H ES Defiitio 17. Expoetial Screeig (ES) estimator ˆf is defied as a sparsity patter aggregate ( ˆf SP A ) with the prior π p give i (68). The correspodig vector of weights is deoted by ˆθ ES. Remark. The prior (68) ca be called a sparsity prior because it dowweights expoetially the o-sparse vectors. The oly exceptio is doe for the most o-sparse vector (the oe with all o-zero compoets) for which we keep the global least squares estimator with weight 1/2. This poit is techical; we itroduce it for mathematical coveiece i order to simplify the proofs. From (66) with p = we obtai E ˆf ES f 2 E fˆθls f 2 + 8σ2 log 2 where we have used that ˆθ p for p = coicides with the global least squares estimator ˆθ LS. This iequality ad Propositio 2 imply: = 1 2. E ˆf ES f 2 mi f θ f 2 + 2σ2 R θ R + 8σ2 log 2. (69) Let p(θ) P be the sparsity patter of θ R, i.e., a vector with compoets p j (θ) = 1 if θ j 0, ad p j (θ) = 0 otherwise. Note that p(θ) = θ 0. Usig (66) ad (67), we get E ˆf ES f 2 mi [mi f p P p R θ S p θ f 2 + 2σ2 p + 8σ2 log 1 ] π p {θ p(θ)=p} S p = p(θ) = θ 0 mi p P p R mi [ f θ f 2 + 2σ2 p(θ) + 8σ2 θ p(θ)=p log ( 1 )] π p(θ) mi [ f θ f 2 + 2σ2 θ 0 + 8σ2 θ R θ 0 R log ( 1 )]. π p(θ) Now, we eed to boud log 1 π p(θ). We use the followig fact: ( k ) (e K ) k. 30

Aggregation and minimax optimality in highdimensional

Aggregation and minimax optimality in highdimensional Aggregatio ad miimax optimality i highdimesioal estimatio Alexadre B. Tsybakov Abstract. Aggregatio is a popular techique i statistics ad machie learig. Give a collectio of estimators, the problem of liear,