Peter L. Bartlett 1, Shahar Mendelson 2, 3 and Petra Philips Introduction

Size: px

Start display at page:

Download "Peter L. Bartlett 1, Shahar Mendelson 2, 3 and Petra Philips Introduction"

Gilbert Ross McBride
6 years ago
Views:

1 ESAIM: Probability ad Statistics URL: Will be set by the publisher ON THE OPTIMALITY OF SAMPLE-BASED ESTIMATES OF THE EXPECTATION OF THE EMPIRICAL MINIMIZER, Peter L. Bartlett 1, Shahar Medelso 2, 3 ad Petra Philips 4 Abstract. We study sample-based estimates of the expectatio of the fuctio produced by the empirical miimizatio algorithm. We ivestigate the extet i which oe ca estimate the rate of covergece of the empirical miimizer i a data depedet maer. We establish three mai results. First, we provide a algorithm that upper bouds the expectatio of the empirical miimizer i a completely data-depedet maer. This boud is based o a structural result due to Bartlett ad Medelso, which relates expectatios to sample averages. Secod, we show that these structural upper bouds ca be loose, compared to previous bouds. I particular, we demostrate a class for which the expectatio of the empirical miimizer decreases as O(1/) for sample size, although the upper boud based o structural properties is Ω(1). Third, we show that this looseess of the boud is ievitable: we preset a example that shows that a sharp boud caot be uiversally recovered from empirical data Mathematics Subject Classificatio. 62G08,68Q32. The dates will be set by the publisher. 1. Itroductio The empirical miimizatio algorithm is a statistical procedure that chooses a fuctio that miimizes a empirical loss fuctioal o a give class of fuctios. Kow as a M-estimator i statistical literature, it has bee studied extesively [11,29,31]. Here, we ivestigate the limitatios of estimates of the expectatio of the fuctio produced by the empirical miimizatio algorithm. To be more exact, let F be a class of real-valued fuctios defied o a probability space (Ω,µ) ad set X 1,...,X to be idepedet radom variables distributed accordig to µ. For f F defie E f = 1 i=1 f(x i) ad let Ef be the expectatio of f with respect to µ. The goal is to fid a fuctio that miimizes Ef over F, where the oly iformatio available about the ukow distributio µ is through the fiite sample X 1,...,X. The empirical miimizatio algorithm produces the fuctio ˆf F that has the smallest empirical mea, that Keywords ad phrases: error bouds; empirical miimizatio; data-depedet complexity This work was supported i part by Natioal Sciece Foudatio grat This work was supported i part by the Australia Research Coucil Discovery Grat DP Computer Sciece Divisio ad Departmet of Statistics, 367 Evas Hall #3860, Uiversity of Califoria, Berkeley, CA , USA 2 Cetre for Mathematics ad its Applicatios (CMA), The Australia Natioal Uiversity, Caberra, 0200 Australia 3 Departmet of Mathematics, Techio I.I.T., Haifa, 32000, Israel 4 Friedrich Miescher Laboratory of the Max Plack Society, Tübige, 72076, Germay c EDP Scieces, SMAI 1999

2 2 TITLE WILL BE SET BY THE PUBLISHER is, ˆf satisfies E ˆf = mi {E f : f F }. Throughout this article, we assume that such a miimum exists (the modificatios required if this is ot the case are obvious), that F satisfies some mior measurability coditios, which we omit (see [8] for more details), ad that for every f F, Ef 0, which, as we explai later, is a atural assumptio i the cases that iterest us. I statistical learig theory, this problem arises whe oe miimizes the empirical risk, or sample average of a loss icurred o a fiite traiig sample. There, the aim is to esure that the risk, or expected loss, is small. Thus, f(x i ) represets the loss icurred o X i. Performace guaratees are typically obtaied through high probability bouds o the coditioal expectatio E ˆf = E( ˆf(X) X 1,...,X ). (1) I particular, oe is iterested i obtaiig fast ad accurate estimates of the rates of covergece of this expectatio to 0 as a fuctio of the sample size. Classical estimates of this expectatio rely o the uiform covergece over F of sample averages to expectatios (see, for example, [31]). These estimates are essetially based o the aalysis of the supremum of the empirical process sup f F (Ef E f) idexed by the whole class F. As opposed to these global estimates, it is possible to study local subsets of fuctios of F, for example, balls of a give radius with respect to a chose metric. The supremum of the empirical process idexed by these local subsets as a fuctio of the radius of the balls is called the modulus of cotiuity. Sharper localized estimates for the rate of covergece of the expectatio ca be obtaied i terms of the fixed poit of the modulus of cotiuity of the class [1,12,16,18,28]. Recet results [3] show that oe ca further sigificatly improve the high-probability estimates for the covergece rates for empirical miimizers. These results are based o a ew localized otio of complexity of subsets of F cotaiig fuctios with idetical expectatios ad are therefore depedet o the uderlyig ukow distributio. I this article, we ivestigate the extet to which oe ca estimate these high-probability covergece rates i a data-depedet maer, a importat aspect if oe wats to make these estimates practically useful. The results i [3] establish upper ad lower bouds for the expectatio E ˆf usig two differet argumets. The first is a structural result relatig the empirical (radom) structure edowed o the class by the selectio of the coordiates (X 1,...,X ), ad the real structure, give by the measure µ. The secod is a direct aalysis, which yields seemigly sharper bouds. I both cases (ad uder some mild structural assumptios o the class F), the bouds are give usig a fuctio that measures the localized complexity of subsets of F cosistig of fuctios with a fixed expectatio r, deoted here by F r = {f F : Ef = r}. For every iteger ad probability measure µ o Ω, cosider the followig two sequeces of fuctios, which measure the complexity of the sets F r : ξ,f,µ (r) = Esup { Ef E f : f F r }, (2) ξ,f,µ(r) = Esup {Ef E f : f F r }. (3) I the followig, i cases where the uderlyig probability measure µ ad the class F are clear, we will refer to these fuctios as ξ ad ξ. It turs out that these two fuctios cotrol the geeralizatio ability i F r wheever oe has a strog degree of cocetratio for the empirical process suprema sup f Fr Ef E f ad sup f Fr (Ef E f) aroud their expectatio. Thus, ξ ad ξ ca be used to derive bouds o the performace of the empirical miimizatio algorithm as log as these suprema are sufficietly cocetrated. Therefore, the mai tool required i the proofs of the results i [3] that provide bouds usig the ξ ad ξ is Talagrad s cocetratio iequality for empirical processes (see Theorem A.1 i the appedix). To see how ξ ad ξ ca be used to derive geeralizatio bouds, observe that it suffices to fid the critical poit r 0 for which, with high probability, for a give 0 < λ < 1, every r r 0 ad every f F r,

3 TITLE WILL BE SET BY THE PUBLISHER 3 (1 λ)ef E f (1 + λ)ef. If the equivalece holds for a sample (X 1,...,X ) for such a r 0, the every f F satisfies that { } E f Ef max 1 λ,r 0, (4) ad thus, a upper boud o the expectatio of the empirical miimizer ˆf ca be established. It is possible to show that oe ca take r 0 as r, where r = if {r : ξ,g (r) r/4}, (5) where G = {θf : 0 θ 1, f F }. I fact, sice i (4) oly a oe-sided coditio is required, oe ca actually use r = if { r : ξ,g(r) r/4 }. (6) For the rest of this sectio we will assume that F is star-shaped aroud 0 (that is, G = F), ad we will explai the sigificace of this property later. A more careful aalysis, which uses the stregth of Talagrad s cocetratio iequality for empirical processes, shows that the expectatio of the empirical miimizer is govered by approximatios of { s = sup r : ξ (r) r = max s } {ξ (s) s}. (7) To see why s is a likely cadidate, ote that for ay empirical miimizer, the fuctio of r defied as sup f Fr (Ef E f) r = if f Fr E f is maximized for the value r = E ˆf. Assume that oe has a very strog cocetratio of empirical processes idexed by F r aroud their mea for every r > 0, that is, with high probability, for every r > 0, sup (Ef E f) E sup (Ef E f) = ξ (r). f F r f F r The, it would make sese to expect that, with high probability, E ˆf s for s = argmax{ξ (r) r}. More precisely, ad to overcome the fact that Esup f Fr (Ef E f) is oly very close to sup f Fr (Ef E f) defie for ε > 0, { r,ε,+ = sup { r,ε, = if r : ξ,f,µ(r) r sup s r : ξ,f,µ(r) r sup s ( ξ,f,µ (s) s ) } ε } ( ξ,f,µ (s) s ) ε, (8). (9) Note that r,ε,+ ad r,ε, are respectively upper ad lower approximatios of s that become better as ε 0. They are close to s if the fuctio ξ (r) r is peaked aroud its maximum. Uder mild structural assumptios o F, E ˆf ca be upper bouded by either r or r,ε,+, ad lower bouded by r,ε, for a choice of ε = O( log /) (see the exact statemet i Theorem 2.6 below). Thus, these two parameters the fixed poit of 4ξ (deoted by r ) ad the poits at which the maximum of ξ (r) r is almost attaied are our mai focus. The first result we preset here is that there is a true gap betwee r ad s, which implies that there is a true differece betwee the boud that could be obtaied usig the structural approach (i.e. r ) ad the true expectatio of the empirical miimizer. We costruct a class of fuctios satisfyig the required structural assumptios ad show that for ay, r is of the order of a costat (ad thus r is of the order of a costat), but the subsets F r are very rich whe r is close to 0 ad s ad r,ε,+ are of the order of 1/. Let us metio that there is a costructio related to this oe i [3]: for every there is a fuctio class F for which this

4 4 TITLE WILL BE SET BY THE PUBLISHER pheomeo occurs. The costructio we preset here is stroger, sice it shows that, for some fuctio class ad probability distributio, the true covergece rate for a fixed class is far from the structural boud. The idea behid the costructio is based o the oe preseted i [3], amely that oe has complete freedom to choose the expectatio of a fuctio, while forcig it to have certai values o a give sample. For the class we costruct ad ay large sample size, estimates for the covergece rates of the empirical miimizers based o r are asymptotically ot optimal (as they are Θ(1) whereas the true covergece rate is O(1/)), ad thus the structural boud does ot capture the true behavior of the empirical miimizer. The secod questio we tackle cocers the estimatio of the expectatio of the empirical miimizer from data. To that ed, i Sectio 4, we preset a efficiet algorithm that eables oe to estimate r i a completely data depedet maer. The, i Sectio 5, we show that this type of data-depedet estimate is the best oe ca hope to have if oe oly has access to the fuctio values o fiite samples. We show that i such a case it is impossible to establish a data depedet upper boud o the expectatio of the empirical miimizer that is asymptotically better tha r. The geeral idea is to costruct two classes of fuctios that look idetical whe restricted to ay sample of fiite size, but for oe class both a typical expectatio of the empirical miimizer ad r are of the order of a absolute costat, while for the other a typical expectatio is of the order of 1/ Loss Classes 2. Defiitios ad Prelimiary Results Oe of the mai applicatios of our ivestigatios is the aalysis of predictio problems, like classificatio or regressio, arisig i machie learig. Suppose that oe is preseted with a sequece of observatio-outcome pairs (x,y) X Y, ad the aim is to select a fuctio g : X Y that makes a accurate predictio of the outcome for each observatio. We assume that (X,Y ),(X 1,Y 1 ),...,(X,Y ) are chose idepedetly from a probability distributio P o X Y, but P is ukow. The quality of the predictio is measured usig a bouded loss fuctio, l : Y Y [0,b], where l(ŷ,y) represets the cost icurred for predictio ŷ whe the true outcome is y. The risk of a fuctio g : X Y is defied as El(g(X),Y ), ad the aim is to use the sequece (X 1,Y 1 ),...,(X,Y ) to choose a fuctio g with miimal risk. Settig f(x,y) = l(g(x),y), this task correspods to miimizig Ef. I empirical risk miimizatio, oe chooses g from a set G that miimizes the sample average of l(g(x),y), which correspods to choosig f F that miimizes E f, where F is the loss class, F = {(x,y) l(g(x),y) : g G}. It is sometimes coveiet to cosider excess loss fuctios, f(x,y) = l(g(x),y) l(g (x),y), where g G satisfies El(g (X),Y ) = if g G El(g(X),Y ). Sice g is fixed, choosig g G that miimizes the risk (respectively, empirical risk) agai correspods to choosig f F that miimizes Ef (respectively, E f), where F = {(x,y) l(g(x),y) l(g (x),y) : g G}. Thus, for this choice of F, Ef 0 for all f F, but fuctios i F ca have egative values Assumptios o F Throughout this article, we assume that F is a class of fuctios defied o a probability space (Ω,µ) satisfyig the followig coditios: (1) Each fuctios i F maps to the bouded iterval [ b,b]. (2) Each fuctio i F has oegative expectatio. (3) F cotais 0. (4) F has Berstei type β > 0.

5 TITLE WILL BE SET BY THE PUBLISHER 5 We shall see shortly why these coditios are atural for may practical oparametric ad machie learig methods. The Berstei coditio, defied precisely below, is that the secod momet of every fuctio is bouded by a power of its expectatio, uiformly over the class. Defiitio 2.1. We say that F is a (β,b)-berstei class with respect to the probability measure P (where 0 < β 2 ad B 1), if every f F satisfies Ef 2 B(Ef) β. We say that F has Berstei type β with respect to P if there is some costat B for which F is a (β,b)- Berstei class. These coditios are satisfied by a large variety of loss classes arisig i statistical settigs. Oe simple example is the loss class, F = {(x,y) l(g(x),y) : g G}, i the case where some fuctio g G has zero loss, that is, El(g (X),Y ) = 0. Clearly, if F cotais 0, fuctios i F are bouded ad have oegative expectatios, ad trivially F has Berstei type 1: Ef 2 bef. However, i practical problems, the assumptio that there is some fuctio g G that has zero loss is ofte ureasoable. More realistic examples are excess loss classes, F = {(x,y) l(g(x),y) l(g (x),y) : g G}, where g i G achieves the miimal risk over G. Clearly, fuctios i F are bouded ad have oegative expectatio, ad F cotais zero. As the followig examples show, the boudedess ad Berstei coditios also frequetly arise aturally. Low oise classificatio: I two-class patter classificatio, we have Y = {±1}, ad l(ŷ, y) is the 0-1 loss, that is, the idicator of ŷ y. Clearly, the boudedess coditio holds. A key factor i the difficulty of a patter classificatio problem is the behavior of the coditioal probability η(x) = Pr(Y = 1 X = x), ad i particular how likely it is to be ear the critical value of 1/2. Startig with Tsybakov [27], may authors have cosidered [2, 4, 5, 11, 19, 26] patter classificatio whe there is a costat ǫ such that the coditioal probability satisfies ( Pr η(x) 1 ) 2 < ǫ = 0. (10) Suppose that we assume, as i [27], that the class G cotais the miimizer g of the expected loss (the Bayes classifier), which is the idicator of η(x) > 1/2. The it is easy to show that this implies the excess loss class is of Berstei type 1. Ideed, oe ca verify that (10) is equivalet to the assertio that all measurable fuctios g : X {±1} satisfy Pr (g(x) g (X)) 1 2ǫ E(l(g(X),Y ) l(g (X),Y )) (see, for example, Lemma 5 i [2]). Therefore, E(l(g(X),Y ) l(g (X),Y )) 2 = Pr (g(x) g (X)) 1 2ǫ E(l(g(X),Y ) l(g (X),Y )).

6 6 TITLE WILL BE SET BY THE PUBLISHER Similarly, if there is a costat κ 0 such that ( Pr η(x) 1 ) 2 ǫ cǫ κ (11) for some c ad all ǫ > 0 (see [27]), ad the class G cotais the Bayes classifier, the this implies the excess loss class is of Berstei type κ/(1 + κ) (see, for example, Lemma 5 i [2]). Boostig with a l 1 costrait: Large margi classificatio methods, such as AdaBoost ad support vector machies, miimize the sample average of a covex criterio over a class of real-valued fuctios. For example, Lugosi ad Vayatis [15] cosider empirical miimizatio with a expoetial loss over a class of l 1 -costraied liear combiatios of biary fuctios: Defie, for a give class H of {±1}-valued fuctios, the class { G λ = α i h i : h i H ad i i Let l(ŷ,y) = exp( yŷ), ad cosider the excess loss class α i λ F λ = {(x,y) l(g(x),y) l(g (x),y) : g G λ }, where g is the miimizer i G λ of the risk. The for all probability distributios, fuctios i F λ are bouded by b = exp(λ) ad have Berstei type 1 (see Lemma 7 ad Table 1 i [2]). Support vector machies with low oise: The support vector machie is a method for patter classificatio that chooses a fuctio f : X R from a reproducig kerel Hilbert space (RKHS) H with kerel k : X 2 R so as to miimize the regularized empirical risk criterio 1 l(f(x i ),y i ) + λ f 2, i=1 where y i {±1}, the loss is the hige loss, } l(ŷ,y) = max{0,1 ŷy}, (12) ad f deotes the orm i the RKHS. This is equivalet, for some r, to solvig the costraied optimizatio problem 1 mi f i=1 l(f(x i),y i ) s.t. f H f 2 r 2. Defie H r = {g H : g 2 r 2 } ad the excess loss class F r = {(x,y) l(g(x),y) l(g (x),y) : g H r }, (13) where g H r is the miimizer of the risk. The if the kerel of the RKHS satisfies k(x, X) B almost surely, (14) all fuctios i F r are bouded by 2Br. Furthermore, if the probability distributio satisfies the low oise coditio (11) ad F r cotais the Bayes classifier, the Lemma 7 of [4] shows that F r has Berstei type κ/(1 + κ)..

7 TITLE WILL BE SET BY THE PUBLISHER 7 ξ (r) α 3 r α 2 r α 1 r r Figure 1. The graph of a fuctio ξ that is sub-liear (cf. Lemma 2.3). Thus, our assumptios are satisfied i this case, ad the results i this article give estimates of the excess risk, that is, the differece betwee the expected loss ad the ifimum over all measurable fuctios of the expected loss. I fact, this also leads to a estimate of the excess risk as measured by the 0-1 loss: for all large margi classificatio methods, which miimize the sample average of a surrogate loss fuctio, there is a geeral, optimal iequality relatig the excess risk as measured by the surrogate loss to the excess risk as measured by the 0-1 loss [2]. Kerel ridge regressio for classificatio: If, i the support vector machie, we replace the hige loss (12) with the quadratic loss, l(ŷ,y) = (ŷ y) 2, we obtai the kerel ridge regressio method for patter classificatio. Defiig the class F r as i (13), if the kerel satisfies the boud (14), the every fuctio i F r is bouded by 2Br. Furthermore, without ay costraits o the probability distributio, the uiform covexity of the loss fuctio implies that F r has Berstei type 1 [14]. Kerel regressio with covex loss: Similar examples ca be obtaied whe the quadratic loss is replaced by ay power loss (see [20]). I kerel regressio also, if the respose variable satisfies Y B almost surely, the the boudedess of the kerel implies boudedess of fuctios i the excess loss class, ad uiform covexity of the loss implies that the excess loss class is Berstei Star-shaped classes We begi with the followig defiitio: Defiitio 2.2. F is called star-shaped aroud 0 if for every f F ad 0 α 1, αf F. We will show below that if F is a excess loss class, the ay empirical miimizer i F is also a empirical miimizer i the set star(f,0) = {αf : f F, 0 α 1}. Hece, oe ca replace F with star(f, 0) i the aalysis of the empirical miimizatio problem. Moreover, sice Ef ad E f are liear fuctioals i f, the localized complexity of star(f,0) is ot cosiderably larger tha that of F (for istace, i the sese of coverig umbers). The advatage i cosiderig star-shaped classes is that it adds some regularity to the class, ad thus the aalysis of the empirical miimizatio problem becomes simpler. For example, it is easy to see that for star-shaped classes the fuctios ξ (r)/r ad ξ (r)/r are o-icreasig. Figure 1 illustrates the graph of a typical fuctio with this sub-liear property, which is stated formally i the followig lemma.

8 8 TITLE WILL BE SET BY THE PUBLISHER ξ (r) r 1 r 2 r Figure 2. A example of a graph of a fuctio ξ for the class star(f,0), where F cotais oly fuctios with expectatios r 1 ad r 2. Lemma 2.3. If F is star-shaped aroud 0, the for ay 0 < r 1 < r 2, ξ (r 1 ) r 1 ξ (r 2 ) r 2. I particular, if for some α, ξ (r) αr the for all 0 < r r, ξ (r ) αr. Aalogous assertios hold for ξ. I other words, for every r, the graph of ξ i the iterval [0,r] is above the lie coectig (r,ξ (r)) ad (0,0). For the sake of completeess we iclude the proof of Lemma 2.3, which was origially stated i [3]. Proof. (of Lemma 2.3) Fix a sample X 1,...,X ad, without loss of geerality, suppose that sup f Fr2 (Ef E f) is attaied at f. Sice F is star-shaped, the f = r1 r 2 f F r1 satisfies Ef E f = r 1 r 2 sup f F r2 (Ef E f), ad the first part follows. The secod part follows directly from the first part by otig that The proof for ξ is aalogous. ξ (r ) r r ξ (r) r r αr = αr. As a example, Figure 2 illustrates the graph of a fuctio ξ for the star-shaped hull of a class that cotais oly fuctios with expectatios that either equal to r 1 or to r 2. The followig lemma allows oe to use star(f,0) i the aalysis of the empirical miimizatio problem ad obtai results regardig the empirical miimizatio problem over F. Lemma 2.4. Let F be a class of fuctios that cotais 0. (1) If F is a (β,b)-berstei class the star(f,0) is also a (β,b) Berstei class. (2) For every x 1,...,x, set { } I 1 = if f(x i ) : f F, i=1 { } I 2 = if f(x i ) : f star(f,0). i=1

9 TITLE WILL BE SET BY THE PUBLISHER 9 The I 1 = I 2. Moreover, for every ε 0 the set {f star(f,0) : i=1 f(x i) I 1 + ε} has a oempty itersectio with F. Note that by Lemma 2.4, if the set of ε-approximate empirical miimizers relative to star(f,0) is cotaied i some set A, the the set of ε-approximate empirical miimizers relative to F is also cotaied i A. I particular, cosider the set A = {f : γ Ef β}. Thus, upper ad lower estimates of the expectatio of the empirical miimizer i star(f,0) would imply the same fact for all empirical miimizers i F. Proof. (of Lemma 2.4) Every g star(f,0) is of the form g = αf for some f F ad 0 α 1. Sice β 2 ad F is a (β,b)-berstei class, Eg 2 = α 2 Ef 2 Bα 2 (Ef) β B(Eαf) β = B(Eg) β. To prove the secod part, otice that I 2 I 1. Sice 0 F, we have I 1 0 ad thus, if I 2 = 0 the the claim is obvious. Therefore, assume that I 2 < 0 ad for the sake of simplicity, assume that the ifimum is attaied i g = αf for some f F ad 0 < α 1. If α < 1 the I 1 f(x i ) = α 1 g(x i ) = α 1 I 2 < I 2, i=1 i=1 which is impossible. Thus α = 1 ad I 1 = I 2. The fial claim of the lemma follows usig a similar argumet. Motivated by these observatios, we redefie the set F r as F r = {f star(f,0) : Ef = r}. For the remaider of the article, we use this i the defiitios of the complexity parameters ξ,f,µ (r), ξ,f,µ (r) i (2 3), ad hece i the defiitios of r, s, r,ε,+, ad r,ε, i (5 9) as well Prelimiary Results If F is star-shaped aroud 0 oe ca derive the followig estimates for the empirical miimizer. (Recall the defiitio r = if {r : ξ (r) r/4} ad r = if {r : ξ (r) r/4}, where ξ ad ξ were defied above i (2) ad (3).) Theorem 2.5. [3] Let F be a (β,b)-berstei class of fuctios bouded by b that cotais 0. The there is a absolute costat c such that with probability at least 1 e x, ay empirical miimizer ˆf F satisfies { E ˆf max r, cbx ( ) } 1/(2 β) Bx,c. Also, with probability at least 1 e x, ay empirical miimizer ˆf F satisfies { E ˆf max r, cbx ( ) } 1/(2 β) Bx,c. Thus, with high probability, r is a upper boud for E ˆf, as log as r c/ 1/(2 β), ad the same holds for r. Note that r ca be much smaller tha r, ad so the covergece rates obtaied through r are potetially better. For β = 1, the estimates based o r ad r are at best 1/, ad i geeral at best 1/ 1/(2 β). Thus, the degree of cotrol of the variace through the expectatio, as measured by the Berstei coditio, iflueces

10 10 TITLE WILL BE SET BY THE PUBLISHER the best rate of covergece oe ca obtai i terms of r ad r usig this method wheever oe requires a cofidece that is expoetially close to 1. I particular, this approach recovers the better learig rates for covex fuctio classes from [14] ad for low oise classificatio from [19, 27], as both covexity of F for squared-loss ad low oise coditios imply that the loss class is Berstei. It turs out that this structural boud ca be improved usig a direct aalysis of the empirical miimizatio process. Ideed, the ext theorem shows that oe ca directly boud E ˆf for the empirical miimizer without tryig to relate the empirical ad actual structures of F. It states that E ˆf is cocetrated aroud s ad therefore, with high probability, E ˆf r,ε,+, where ε ca be take smaller tha c log /. I additio, if the class is ot too rich aroud 0, the with high probability, E ˆf r,ε,. (To recall the defiitios of s, r,ε,+, ad r,ε, see (7-9).) The result follows immediately from the mai result of [3], together with the observatios above about star-shaped classes. Theorem 2.6. For ay c 1 > 0, there is a costat c (depedig oly o c 1 ) such that the followig holds. Let F be a (β,b)-berstei class of fuctios bouded by b that cotais 0. For every ad ε > 0 defie r,ε,+, ad r,ε, as above, fix x > 0 ad set If { ( ) } 1/(2 β) r = max r cb(x + log ) B(x + log ),,c. ( { ε c max sup s the (1) With probability at least 1 e x, (2) If the with probability at least 1 e x, ( ξ,f,µ (s) s ) ) 1/2 (B + b)(x + log ),r β}, { } E ˆf 1 max,r,ε,+. Esup {Ef E f : f star(f,0), Ef c 1 /} ( < sup ξ,f,µ (s) s ) ε, s E ˆf r,ε,. To compare this result to the previous oe, ote that s r. Ideed, ξ (r) E(Ef E f) = 0 for ay fixed fuctio f, ad thus ξ (0) 0, ξ (s ) s ad 0 s if {r : ξ (r) r} r (where the last iequality holds sice ξ (r)/r is o-icreasig, by Lemma 2.3). It follows that if ξ (r) r is ot flat aroud s, the the boud resultig from Theorem 2.6 improves the structural boud of Theorem 2.5. Figure 3 illustrates graphically such a case. 3. A true gap betwee the expectatio of the empirical miimizer ad r I this sectio, we costruct a class of fuctios for which there is a clear gap betwee the structural result of Theorem 2.5 ad the expectatio of the empirical miimizer, as estimated i Theorem 2.6. The idea behid this costructio (as well as i the other costructio we preset later) is that oe has complete freedom to choose the expectatio of a fuctio, while forcig it to have certai values o a give sample.

11 TITLE WILL BE SET BY THE PUBLISHER 11 ξ (r) ε r r/4 r,ε, s r,ε,+ r r Figure 3. The graph of a fuctio ξ, ad the correspodig values for r, s, r,ε,+, ad r,ε,. If s r ad ξ (r) r is peaked aroud s, the r,ε,+ is smaller tha r. Let us start with a outlie of the costructio. It is based o the idea (developed i [3]) of two Berstei classes of fuctios satisfyig the followig for ay fixed. The fuctios are defied o a fiite set {1,...,m} with respect to the uiform probability measure, where m depeds o. The first class cotais all fuctios that vaish o a set of cardiality, but have expectatios equal to a give costat. The secod class cosists of fuctios that each take their miimal values o a set of cardiality, but have expectatios equal to 1/. By appropriately choosig the values of the fuctios, oe ca show that the star-shaped hull of the uio of these two classes has r c, whereas s r,ε,+ 1/. Thus, the estimate give by Theorem 2.6 is cosiderably better tha the oe resultig from Theorem 2.5 for that fixed value of. To make this example uiform over, we costruct similar sets o (0,1], take the star-shaped hull of the uio of all such sets ad show that ξ,f,µ (r) r still achieves its maximum at 1/ ad decays rapidly for r > 1/, esurig that r,ε,+ r. The first step i the costructio is the followig lemma. Lemma 3.1. Let µ be the Lebesgue measure o (0,1]. The, for every positive iteger ad ay 1 λ 1/2 there exists a fuctio class G λ such that (1) For every g G λ, 1 g(x) 1, Eg = λ ad Eg2 2Eg. (2) For every set τ (0,1] with τ, there is some g G λ such that for every s τ, g(s) = 1. Also, there exists a fuctio class Hλ such that (1) For every h Hλ, 0 h(x) 1, Eh = λ, ad Eh2 Eh. (2) For every set τ (0,1] with τ, there is some h Hλ such that for every s τ, h(s) = 0.

12 12 TITLE WILL BE SET BY THE PUBLISHER Proof. Let m = 2( 2 + ). Cosider fuctios that are costat o the itervals ((i 1)/m,i/m], 1 i m, ad set G λ to be the fuctio class cotaiig all fuctios takig the value 1 o exactly such itervals; that is, each fuctio i G λ is defied as follows: Let J {1,...,m}, J = ad set { 1, if x ( j 1 g J (x) = m, j m ] ad j J, t λ, otherwise, where t λ = λm + m = 2λ(2 + ) (15) + Sice 0 λ 1/2, 0 t λ 1 ad thus g J : (0,1] [ 1,1]. It is easy to verify that all the fuctios i G λ have expectatio λ with respect to µ ad that G λ is (1,2)-Berstei, sice for ay g G λ, Eg 2 = 1 m ( + t 2 λ (m ) ) 1 m ( + t λ(m )) = λ + 1 2λ = 2Eg. + 1 The costructio of Hλ is similar, ad its fuctios take the values {0,t λ } for t λ = λm/(m ). Usig the otatio of the lemma, defie the followig fuctio classes: ad H = H1/4 i, F k = G k 1/k, G = F i, i=5 i=5 F = star(g H,0). (16) Sice F cotais 0 ad is a (1,2)-Berstei class, it satisfies the assumptios of Theorem 2.5 ad Theorem 2.6. Moreover, it is star-shaped aroud 0 ad for ay 5 ad ay X 1,...,X there is some f F with Ef = 1/4 ad E f = 0, ad some g F with Eg = 1/ ad E g = 1. Ideed, f ca be take from H 1/4 ad g from F = G 1/. The followig theorem shows that for the class F, for ay iteger, r = 1/4, while the empirical miimizer is likely to be smaller tha r,ε,+ c/. Theorem 3.2. For F defied by (16), the followig holds: (1) For every 5, r + rk if r (1/(k + 1),1/k], where k ξ,f,µ(r) = r if r (1/5,1/4] 0 if r > 1/4, ad i particular, r = 1/4. (2) There exists a costat c > 1, such that the followig holds: for every ε < 3/4, every N(ε) ad every k /c, ξ,f,µ(1/k) 1/k ξ,f,µ(1/) 1/ ε. I particular, r,ε,+ c/. Note that by the properties of F metioed above, for every sample of cardiality 5, the graph of ξ for the class F H1/4 (which is the same as for the class star(f H1/4,0)) is as i Figure 4, with r = 1/4 ad s = 1/. For the star-shaped hull of the uio of all these sets, the fuctio ξ ca still be described i

13 TITLE WILL BE SET BY THE PUBLISHER 13 ξ (r) r + 1 r s = 1/ r = 1/4 r Figure 4. ξ,f H,µ (as i the proof of Theorem 3.2). 1/4 closed-form for values of r > 1/5 ad r 1/, because sup f Fr (Ef E f) is idepedet of the sample ad is reached at a scaled-dow fuctio from H ad respectively G; this is proved i part 1 of the theorem. O the other had, for 1/ < r < 1/5 this supremum is o loger idepedet of the sample ad thus we caot provide a simple closed-form for ξ. Despite that, ξ (r) r still achieves its maximum at 1/ ad decays rapidly for r > 1/, esurig that r,ε,+ r, which is the secod part of the theorem. Figure 5 illustrates the qualitative behavior of ξ. Proof. (of Theorem 3.2) For the first part of the proof, observe that the subsets F r cosistig of fuctios with expectatio Ef = r are H r G r if r < 1/5 F r = H r if r (1/5,1/4] if r > 1/4, where H r ad G r are the scaled-dow versios of H ad G, ad G r = 1/r k=5 {krg : g Gk 1/k }. The first part of the Theorem follows from the defiitio of the fuctio ξ ad the fact that for ay fixed sample of size, the ifimum if f Fr E f is equal to 0 ad reached at a scaled-dow fuctio from H1/4 for r (1/5,1/4], ad it is equal to -1 ad reached at a scaled-dow fuctio from G k 1/k wheever r (1/(k + 1),1/k] ad k. Turig to the secod, ad more difficult part, ote that ideed r = 1/4 ad that the maximal value of ξ,f,µ (r) r is attaied at r = 1/. I order to estimate the value ξ,f,µ (1/k) for k <, cosider sup f G 1/k(Ef k E f) for a fixed X 1,...,X. Let m = 2(k 2 + k) ad ote that by the costructio of G k 1/k, each g Gk 1/k is of the form g J for some set J {1,...,m}, J = k. For each set J let A J be the uio of the itervals ( j 1 m, j ] m where j J, ad let Φ be the followig set of idicator fuctios Φ = {½ AJ : J {1,...,m}, J = k}. Clearly, for every φ Φ, Eφ = k/m ad vc(φ) k, sice o set of k+1 distict poits i (0,1] ca be shattered by Φ (actually, vc(φ) = k sice the set {1/k,1/(k 1),...,1} is shattered by Φ). Recall that if Φ is a class

14 14 TITLE WILL BE SET BY THE PUBLISHER ξ (r) r + 1 ε r 1/ c/ 1/5 1/4 r Figure 5. Qualitative behavior of ξ,f,µ. of biary-valued fuctios ad if the VC-dimesio vc(φ) k, the as a special case of Theorem A.5, the Rademacher averages (see page 16, equatio (18) for the defiitio) ca be bouded by ER (Φ) c 2 k/ (17) for some absolute costat c 2. Defie the radom variable l J = i=1 ½ A J (X i ). Thus, l J is the cardiality of the set {i : g J (X i ) = 1}. Note that E g J = 2l J(k + 1) 2 + 3k + 2, k(2k + 1) ad therefore, sup (Ef E f) = 1 f G k + 2(k + 1)2 sup J l J 3k 2. k(2k + 1) k 1/k From Talagrad s cocetratio iequality (Theorem A.1) applied to the set of fuctios Φ, there exist absolute costats c 1,c 2 such that for ay 0 < t 1, with probability larger tha 1 e c1t2, sup f Φ i=1 f(x i ) k m + 2R (Φ) + 2t k m + 2c 2 k + 2t, where the last iequality holds by (17). Settig t = 1/20, ad sice k/m = /(2(k + 1)) < /10 for ay k 5, it is evidet that there exists a absolute costat c > 1 such that for ay k /c, with probability at least 1 e c 1, sup J l J /5+2c 2 k /4.

15 TITLE WILL BE SET BY THE PUBLISHER 15 Therefore, applyig the uio boud for 5 k k, it follows that with probability at least 1 e c, sup f k k =5 k k Gk 1/k (Ef E f) (k + 1) 2 /2 3k 2 k(2k + 1) 1 k for every k /c 1. Observe that scaled-dow versios of fuctios from H do ot cotribute to ξ,f,µ (1/k) ad thus, oe oly has to take care of elemets i F with expectatio of 1/k that come either from G k 1/k or are scaled dow versios of G k 1/k for k k. Hece, ξ,f,µ(1/k) = E sup f k k =5 ( 1 k (Ef E f) k k Gk 1/k ) (1 e c ) + e c = 1 k e c. ( ) 1 k + 1 Thus, for ε < 3/4, if is sufficietly large that 3/4e c 3/4 ε, we have ξ,f,µ(1/k) 1/k 1 ε = ξ,f,µ(1/) 1/ ε, provided that k /c. To coclude, there exists a true gap betwee the boud that ca be obtaied via the structural result (the fixed poit r of the localized empirical process) ad the true expectatio of the empirical miimizer as captured by s. Corollary 3.3. For F defied i (16), there is a absolute costat c > 0 for which the followig holds: For ay x > 0 there is a iteger N(x) such that for ay N(x), (1) With probability at least 1 e x, E ˆf c/ s. (2) r = r = 1/4. 4. Estimatig r from data The ext questio we wish to address is how to estimate the fuctio ξ (r) ad the fixed poit { r = if r : ξ (r) r } 4 empirically, i cases where the global complexity of the fuctio class, as captured, for example, by the coverig umbers or the combiatorial dimesio, is ot kow. A way of estimatig r is to fid a empirically computable fuctio ˆξ (r) that is, with high probability, a upper boud for the fuctio ξ (r) ad therefore, its fixed poit ˆr = if{r : ˆξ (r) r 4 } is a upper boud for r. We shall costruct ˆξ for which ˆξ (r)/r is o-icreasig ad thus ˆr would be determied usig a biary search algorithm. To that ed, we require the followig result, which states that, for Berstei classes, there is a phase trasitio i the behavior of coordiate projectios aroud the poit where ξ (r) r. Above this poit, the local subsets F r = {f star(f,0) : Ef = r} are small ad the expectatio ad empirical meas are close i a multiplicative sese. Below this poit, the sets F r are too rich to allow this.

16 16 TITLE WILL BE SET BY THE PUBLISHER Theorem 4.1. [3] There is a absolute costat c for which the followig holds. Let F be a class of fuctios, such that for every f F, f b. Assume that F is a (β,b)-berstei class. Suppose that r 0, 0 < λ < 1, ad 0 < α < 1 satisfy r cmax { bx α 2 λ, ( ) } 1/(2 β) Bx α 2 λ 2. (1) If ξ (r) (1 + α)rλ, the with probability at least 1 e x, sup Ef E f λef. f F r (2) If ξ (r) (1 α)rλ, the with probability at least 1 e x, sup Ef E f λef. f F r (3) If ξ (r) (1 + α)rλ, the with probability at least 1 e x, sup (Ef E f) λef. f F r (4) If ξ (r) (1 α)rλ, the with probability at least 1 e x, sup (Ef E f) λef. f F r We will make use of the followig direct corollary of Theorem 4.1 applied to the case α = 1/2, λ = 1/2. Corollary 4.2. There is a absolute costat c > 0 for which the followig holds. If F is (β,b)-berstei, ad { ( ) } 1/(2 β) bx Bx r cmax, ad ξ (r) r 4, the with probability larger tha 1 e x, every f F r satisfies r/2 E f 3r/2. If we defie the empirical shell, F r 2, 3r 2 := {f star(f,0) : r/2 E f 3r/2}, the corollary shows that, for suitable large r, with high probability, F r F r 2, 3r 2 The followig theorem shows that the empirical Rademacher average of a empirical shell is with high probability a upper boud for ξ (r) for all r larger tha the fixed poit r. For this, defie the radom variables R f = 1 i=1. σ i f(x i ) ad R (F) = sup R f, (18) f F where σ 1,...,σ deote idepedet Rademacher radom variables, that is, symmetric, { 1,1}-valued radom variables. The Rademacher averages of the class F are defied as ER (F), where the expectatio is take with respect to all radom variables X i ad σ i. A empirical versio of the Rademacher averages is obtaied by coditioig o X 1,...,X, E σ R (F) = E(R (F) X 1,...,X ).

17 TITLE WILL BE SET BY THE PUBLISHER 17 Theorem 4.3. There are absolute costats c, c 1, c 2, ad c 3 for which the followig holds. Let F be a (β,b)- Berstei class that cotais 0 ad for which sup f F f b. If r = max the with probability at least 1 2(b + 1)e x for every r [ r,b]. { r, 1, cbx ( ) } 1/(2 β) Bx,c, ξ (r) 8E σ R ( F c1r,c 2r) + c3 r Proof. By Lemma 2.3, ξ (r) r 4 if ad oly if r r. Thus, by Corollary 4.2 (for appropriately chose c), if r r, the with probability larger tha 1 e x, F r F r 2,, which implies that 3r 2 ) E σ R (F r ) E σ R (F r 2,. 3r 2 By symmetrizatio (Theorem A.2) ad cocetratio of Rademacher averages aroud their mea (Theorem A.3), ad sice r cbx, it follows that with probability at least 1 2e x, ξ (r) 2ER (F r ) 4E σ R (F r ) + 4bx ( ) 4E σr F r 2, + c 3r 3 r. 2 To fid a upper boud o ξ (r) that holds with high probability uiformly for all r r, we divide the iterval [1/,b] ito a set of b itervals of legth at most 1/. (Note that the choice of the startig poit 1/ restricts the estimates for r to values that are larger tha 1/. The proof ca be easily modified to allow estimates up to the value cbx/, but sice we are oly iterested i estimates that are at best of the order of O(1/) we made this restrictio i order to keep the proof simpler.) Let { 1 A =, 2 } [ b c,...,, b ], where c = cmax { ( ) } 1/(2 β) bx Bx,. Sice A b + 1, the uio boud shows that with probability at least 1 2(b + 1)e x, ξ (r) 4E σ R (F r 2, 3r 2 for every r A. By Lemma 2.3, for ay 1 k, if r [ k probability at least 1 2(b + 1)e x, every r [ r,b] satisfies ( ) k r ξ (r) ξ k ( (4E σ R F k 2, 3k 2 ) + c 3 r 8E σ R ( F c1r,c 2r) + c3 r,, k+1 ) + c ) 3k r k ], the ξ (r) ξ ( k ) r k. Thus, with where k satisfies that r [k/,(k + 1)/] ad c 1 ad c 2 are absolute costats.

18 18 TITLE WILL BE SET BY THE PUBLISHER Therefore, oe ca defie ˆξ (r) = 8E σ R ( F c1r,c 2r) + c3 r. Let ˆr = if{r : ˆξ (r) r 4 }. By Theorem 4.3, with probability at least 1 2(b + 1)e x, ˆr r. Moreover, sice ˆξ (r)/r is o-icreasig, r ˆr if ad oly if ˆξ (r) r 4. With this, give a sample of size, cosider the followig algorithm to estimate the upper boud o ˆr based o the data: Algorithm RSTAR(F, X 1,...,X ) Set r L = max{1/,c }, r R = b. If ˆξ (r R ) r R /4 the for l = 0 to log 2 b set r = rr rl 2 ; if ˆξ (r) > r/4 the set r L = r, else set r R = r. Output r = r R. By the costructio, r 1 ˆr r. Hece, for every, with probability larger tha 1 2(b + 1)e x, r r. Theorem 4.4. There exists a absolute costat c for which the followig holds. Let F be a (β,b)-berstei class of fuctios bouded by b that cotais 0. For every iteger, ay x > 0, ad ay sample X of size, with probability at least 1 (2b + 3)e x, E ˆf RSTAR(F,X). ( Note that RSTAR(F,X) is essetially the fixed poit of the fuctio r E σ R F c1r,c 2r). This fuctio measures the complexity of the fuctio class Fc 1r,c 2r, which ca be determied empirically by lookig at empirical meas that fall i a iterval whose legth is proportioal to r. The mai differece betwee that ad the data-depedet estimates i [1] is that istead of takig the whole empirical ball as i [1], here we oly measure the complexity of a empirical shell aroud r. However, if the fuctio class is ot regular aroud the critical value of r, the complexity of the shell F(c 1 r,c 2 r) might be very differet from the complexity of F r, i which case oe would like to make c 1 ad c 2 very close to 1. Ideed, oe ca tighte this boud further by arrowig the size of the shell ad replacig the empirical set F r 2, 3r 2 with F(1 ε )r,(1+ε )r. This is doe by selectig the isomorphism costat i Theorem 4.1 to deped o ad ted to 1 as. Theorem 4.5. Let F be a (β,b)-berstei class that cotais 0 such that sup f F f b. There is a absolute costat c, for which the followig holds. If 0 < ε < 1 ad { r = max r, 1, cbx ( ) } 1/(2 β) Bx,c, ε the with probability at least 1 2(b + 1)e x for every r [ r,b]. ε 2 ξ (r) 4E σ R (F (1 ε )r,(1+ε )r ) + ε r c Proof. With the same reasoig as before, by Theorem 4.1 for α = 1/2 ad λ = ε, if r r the with probability larger tha 1 e x, F r F(1 ε )r,(1+ε )r. We defie ( ) ˆξ (r) = (4E σ R F(1 ε )r,(1+ε )r + kε ) [ r k c k, for r, k + 1 ]. Agai, with probability at least 1 2(b + 1)e x, for every r [ r,b], ξ(r) ˆξ (r).

19 TITLE WILL BE SET BY THE PUBLISHER 19 Sice ˆξ (r)/r is o-icreasig, it is possible to defie { ˆr = if r : ˆξ (r) rε } 2 with a slight modificatio of RSTAR (we replace the test i the if-clause, ˆξ (r) > r/4, with ˆξ (r) > rε /2). It follows that for every ad every sample of size, with probability larger tha 1 2be x, r r, where r is geerated by the modified algorithm. For example, oe ca choose ε = 1/log, which has the advatage that the empirical shells ˆFr r log, r+ r become, with growig sample size, closer to F log r. The price we pay for the advatage is a extra log factor i the fial estimate, sice i this case the estimate of the expectatio goes dow at the rate of O(log /). Remark 4.6. Note that a lower boud of a similar ature has to take ito accout the complexity of the class F 0,cr. This might happe because oe may ot have a iclusio F r F c 1r,c 2r uless c 1 = 0. Ideed, if the class F is very rich for r close to 0, it is possible to have fuctios that have a very small expectatio, but for which E f r. 5. The limitatios of estimatig from data Although the results i [3] show that it is possible to boud the expectatio of the empirical miimizer i a far sharper way tha by applyig a structural result, it was ot clear whether such a boud could be estimated from data. I the followig we cosider a sceario i which oe oly has access to the fuctio class through the values that class members take o fiite samples, that is, the fiite dimesioal coordiate projectios of the class. I this case, we costruct a example that shows that, i geeral, it is impossible to establish a data-depedet estimate of s that is better tha r. To be precise, we costruct two fuctio classes that have idetical coordiate projectios o every sample. For oe class we have r c, s c ad the expectatio of the empirical miimizer is of the order of c with probability 1, while for the other class, s 1/. If oe oly has access to the way the classes behave o fiite dimesioal coordiate projectios, that is, samples, the classes are idistiguishable, ad it is impossible to predict a better boud tha a absolute costat, which could be much worse tha the true behavior of the empirical miimizer. Recall that for a give fuctio class F ad a sample τ = {x 1,...,x }, the coordiate projectio of F o τ is P τ F = {(f(x 1 ),...,f(x )) : f F }. Let µ be the Lebesgue measure o (0,1]. For each k N we costruct two fuctio classes F1 k ad F2 k, both (1,c)-Berstei with respect to µ for a suitable absolute costat c, ad take values i V = { 1,0,1}. I both classes we costruct, each fuctio is a costat o the itervals ((j 1)/m k,j/m k ], where m k = k 2 + 3k. The class F1 k cosists of all fuctios that take the value 1 o k itervals, the value 1 o 2k itervals ad the value 0 o k 2 itervals. It is easy to verify that for ay f F1 k, Ef = k/(k 2 + 3k) 1/k ad Ef 2 = 3k/(k 2 + 3k) 1/k, implyig that ideed F1 k is a (1,3)-Berstei class. I cotrast, F2 k cosists of all fuctios that take the value 1 o k itervals, the value 1 o k 2 + k itervals ad 0 o k itervals. Therefore, for ay fuctio f F2 k, Ef = k 2 /(k 2 + 3k) 1/4 ad sice Ef 2 1, F2 k is a (1,4)-Berstei class. Notice that fuctios i F1 k have expectatios of the order of 1/k while fuctios i F2 k have expectatios of the order of a costat. Set ( ) ( ) F 1 = star F1 k,0, F 2 = star F2 k,0, k N ad it is easy to verify that for every fiite set τ, P τ F 1 = P τ F 2. Ideed, cosider a set τ = {x 1,...,x }. Without loss of geerality, assume that x i x j if i j. Let l be large eough to esure that the x i s fall i disjoit itervals ((j 1)/m l,j/m l ] ad that l, ad thus, P τ F l 2 = P τ F l 1 = { 1,0,1}. k N

20 20 TITLE WILL BE SET BY THE PUBLISHER Therefore, F 1 ad F 2 are star-shaped, Berstei classes that have idetical coordiate projectios, makig it impossible to distiguish the two based solely o empirical data. O the other had, the behavior of the empirical miimizer is very differet i the two cases. Theorem 5.1. For F 1 ad F 2 defied as above, there is a absolute costat c > 0 for which the followig holds. For ay x > 0 there is some N(x) such that for ay N(x), (1) For F 1, with probability at least 1 e x, E ˆf c/ s (F 1 ). (2) For F 2, with probability 1, E ˆf 1/4 r (F 2 ). Theorem 5.1 implies that the estimates for the covergece rate of the empirical miimizatio algorithm based o s are sigificatly better for the class F 1 tha for F 2. However, the classes have idetical coordiate projectios o ay sample, ad hece are idistiguishable empirically. Thus, oe ca ot get a empirical estimate of the covergece rate for F 1 that is sigificatly better tha oe based o a empirical estimate of r. Proof. We will show that the expectatio of the empirical miimizer i F 1 is likely to be smaller tha c/, as opposed to F 2 where it is likely to be of the order of a costat. For ay, if f F 1 E f = 1, ad therefore ξ,f 1,µ (s ) s = 1, where, for ay k ad ay f F k 1, s k = Ef = k k 2 + 3k 1 k. Clearly, for a class of fuctios bouded by 1, ξ,f,µ (r) r 1, ad thus the maximal value of ξ,f 1,µ (r) r is attaied at s 1/. The mai part of the proof is to show that there is some absolute costat c > 1 such that for large eough values of ad for r c/, ξ,f 1,µ (r) r 1/2. This is the case because the sets F k 1 are ot rich eough whe projected oto samples of size as log as k /c. Ideed, the fuctio class F 1 has low complexity i terms of the combiatorial dimesio vc(f 1,ε) (see Defiitio A.4). I particular, the defiitios imply that vc(f k 1,ε) 2k for all 0 < ε 2 ad all k. Sice the class of fuctios is bouded by 1, Theorem A.5 implies there is a absolute costat c 2 such that ER (F k 1 ) c 2 k/. Applyig the oe sided versio of Talagrad s cocetratio iequality for the empirical process Z = sup f F k 1 (Ef E f), it follows that for t = 1/4, with probability at least 1 e c1t2 = 1 e c 1, k sup (Ef E f) 2ER (F1 k ) + t 2c 2 f F + t 1 2, 1 k provided that k /c for some uiversal costat c. Let A k = s k s k k k that is, A k cotais the fuctios i F 1 that have expectatios s k those either come from F k 1 or are scaled dow versios of fuctios from F k for k < k. Therefore, with probability at least 1 e c 1, for ay k /c, F k 1, Takig the expectatio, sup (Ef E f) 1 f A k 2. ξ,f 1,µ(s k ) (1 e c 1 ) (1 + s k) e c 1 = ( s k ) e c 1,

21 TITLE WILL BE SET BY THE PUBLISHER 21 ad thus, for all ε < 1/2, N(ε) ad k /c, ξ,f,µ(s k ) s k 1 ε s k = ξ,f,µ(s ) s ε s k. This implies that ξ,f,µ (r) r ξ,f,µ (s ) s ε for every r c /, from which we coclude that r,ε,+ c /. O the other had, it is easy to verify that for empirical miimizatio over F 2, E ˆf 1/4. Ideed, as we saw for F 1, if f F 2 E f = 1, which implies E ˆf = 1. Sice we ca write F 2 = {αf : f F k 2, k N, α [0,1]}, ad empirical miimizatio is a liear operatio, it is clear that the empirical miimum will be attaied at α = 1 (usig a similar argumet to the oe used i Lemma 2.4). Sice all the fuctios i k N F2 k have expectatio greater tha 1/4, the with probability 1, E ˆf 1/4 i this case. Remark 5.2. Note that if oe is give the fuctio ˆf that the algorithm produced, rather tha just the coordiate projectios, it becomes possible to distiguish if the class at had is F 1 or F 2. However, we ca defie a ucoutable collectio of fuctio classes { ( ) } F = star Fα k k,0 : α k {1,2} for k N, k N where if α k = 1 the Fα k k = F1 k ad if α k = 2 the F αk = F2 k. Clearly, for every H,G F ad every fiite σ Ω, P σ (G) = P σ (H). If the learer kows that F F ad eve if ˆf is give to him, the the best thig that could be said is that a sigle compoet of F, say the jth compoet of F, is F j 1 or F j 2. It is impossible to say whether other compoets of F are of type 1 or type 2 ad i particular, the covergece rate for the expectatio of the empirical miimizer ca be as bad as for F 2. The secod observatio worth otig is that the class F 1 is ot a Gliveko-Catelli class. The classes F1 k become richer as k grows - i.e., i the part of F 1 i which the expectatio of fuctios is smaller. The reaso oe ca still obtai a geeralizatio boud eve for classes that are ot Gliveko-Catelli is because the method of [3] uses the expectatio of the empirical process idexed by {f star(f,0) : Ef = r}, ad each oe of these sets is a Gliveko-Catelli class. If oe were to try ad boud the error of the empirical miimizer usig the localizatio {f F : Ef r} as i [1], it would be impossible. Appedix A. Additioal material The mai techical tool we require is Talagrad s celebrated cocetratio theorem for empirical processes [13, 24]. The versio we use is due to Bousquet [7], buildig o Massart s argumet (see also [10,17,22]). Theorem A.1. Let F be a class of fuctios defied o X ad let P be a probability measure such that for every f F, f b ad Ef = 0. Let X 1,...,X be idepedet radom variables distributed accordig to P ad set σ 2 = sup f F varf. Defie Z = sup f F Z = sup f F f(x i ), i=1 f(x i ). i=1

22 22 TITLE WILL BE SET BY THE PUBLISHER For every x > 0 ad every ρ > 0, ({ Pr Z (1 + ρ)ez + σ }) Kx + K(1 + ρ 1 )bx ({ Pr Z (1 ρ)ez σ }) Kx K(1 + ρ 1 )bx ad the same iequalities hold for Z. Here, K is a absolute costat. e x, e x, The rest of this sectio is devoted to some results that allow oe to estimate Esup f F Ef E f via the Rademacher process idexed by the class. Recall the defiitio of the Rademacher averages of a class from page 16, equatio (18). A well kow symmetrizatio argumet (due to Gié ad Zi) coects the expectatio of sup f F Ef E f to the Rademacher averages of F [30]. Theorem A.2. Let F be a class of fuctios defied o (Ω,µ) ad let X 1,...,X be idepedet radom variables distributed accordig to µ. The, E sup Ef E f 2ER (F). f F The ext lemma, which follows directly from a self-boudig property of the Rademacher process ad the methods developed i [6], shows that E σ R (F) is highly cocetrated aroud its expectatio; hece, the Rademacher averages of a class ca be upper bouded by their empirical versio. The followig formulatio ca be foud i [1]. Theorem A.3. Let F be a class of bouded fuctios defied o (Ω,µ) takig values i [a,b] ad let X 1,...,X be idepedet radom variables distributed accordig to µ. The, for ay 0 α < 1 ad x > 0, with probability at least 1 e x, ER (F) 1 1 α E (b a)x σr (F) + 4α(1 α). Also, with probability at least 1 e x, where c is a absolute costat. 1 2 E σr (F) cbx ER (F) It is possible to boud ER (F) usig the combiatorial dimesio of a set. Recall that a set {x 1,...,x } is shattered by a class of {0,1}-valued fuctios F if P σ F = {(f(x 1 ),...,f(x )) : f F } = {0,1}, ad that the Vapik-Chervoekis dimesio d of F deoted by vc(f) is the maximal cardiality of a subset of Ω that is shattered by F. I a similar way, oe ca defie the combiatorial dimesio of a class of real-valued fuctios. Defiitio A.4. For every ε > 0, a set σ = {x 1,...,x } Ω is said to be ε-shattered by F if there is some fuctio s : σ R, such that for every I {1,...,} there is some f I F for which f I (x i ) s(x i ) + ε if i I, ad f I (x i ) s(x i ) ε if i I. Let vc(f,ε) = sup { σ σ Ω, σ is ε shattered by F }. The followig result is a recet extesio, due to Rudelso ad Vershyi [23] to well-kow estimates o ER (F).

Optimal Sample-Based Estimates of the Expectation of the Empirical Minimizer

Optimal Sample-Based Estimates of the Expectatio of the Empirical Miimizer Peter L. Bartlett Computer Sciece Divisio ad Departmet of Statistics Uiversity of Califoria, Berkeley 367 Evas Hall #3860, Berkeley,