Peter L. Bartlett 1, Shahar Mendelson 2, 3 and Petra Philips Introduction

Size: px
Start display at page:

Download "Peter L. Bartlett 1, Shahar Mendelson 2, 3 and Petra Philips Introduction"

Transcription

1 ESAIM: Probability ad Statistics URL: Will be set by the publisher ON THE OPTIMALITY OF SAMPLE-BASED ESTIMATES OF THE EXPECTATION OF THE EMPIRICAL MINIMIZER, Peter L. Bartlett 1, Shahar Medelso 2, 3 ad Petra Philips 4 Abstract. We study sample-based estimates of the expectatio of the fuctio produced by the empirical miimizatio algorithm. We ivestigate the extet i which oe ca estimate the rate of covergece of the empirical miimizer i a data depedet maer. We establish three mai results. First, we provide a algorithm that upper bouds the expectatio of the empirical miimizer i a completely data-depedet maer. This boud is based o a structural result due to Bartlett ad Medelso, which relates expectatios to sample averages. Secod, we show that these structural upper bouds ca be loose, compared to previous bouds. I particular, we demostrate a class for which the expectatio of the empirical miimizer decreases as O(1/) for sample size, although the upper boud based o structural properties is Ω(1). Third, we show that this looseess of the boud is ievitable: we preset a example that shows that a sharp boud caot be uiversally recovered from empirical data Mathematics Subject Classificatio. 62G08,68Q32. The dates will be set by the publisher. 1. Itroductio The empirical miimizatio algorithm is a statistical procedure that chooses a fuctio that miimizes a empirical loss fuctioal o a give class of fuctios. Kow as a M-estimator i statistical literature, it has bee studied extesively [11,29,31]. Here, we ivestigate the limitatios of estimates of the expectatio of the fuctio produced by the empirical miimizatio algorithm. To be more exact, let F be a class of real-valued fuctios defied o a probability space (Ω,µ) ad set X 1,...,X to be idepedet radom variables distributed accordig to µ. For f F defie E f = 1 i=1 f(x i) ad let Ef be the expectatio of f with respect to µ. The goal is to fid a fuctio that miimizes Ef over F, where the oly iformatio available about the ukow distributio µ is through the fiite sample X 1,...,X. The empirical miimizatio algorithm produces the fuctio ˆf F that has the smallest empirical mea, that Keywords ad phrases: error bouds; empirical miimizatio; data-depedet complexity This work was supported i part by Natioal Sciece Foudatio grat This work was supported i part by the Australia Research Coucil Discovery Grat DP Computer Sciece Divisio ad Departmet of Statistics, 367 Evas Hall #3860, Uiversity of Califoria, Berkeley, CA , USA 2 Cetre for Mathematics ad its Applicatios (CMA), The Australia Natioal Uiversity, Caberra, 0200 Australia 3 Departmet of Mathematics, Techio I.I.T., Haifa, 32000, Israel 4 Friedrich Miescher Laboratory of the Max Plack Society, Tübige, 72076, Germay c EDP Scieces, SMAI 1999

2 2 TITLE WILL BE SET BY THE PUBLISHER is, ˆf satisfies E ˆf = mi {E f : f F }. Throughout this article, we assume that such a miimum exists (the modificatios required if this is ot the case are obvious), that F satisfies some mior measurability coditios, which we omit (see [8] for more details), ad that for every f F, Ef 0, which, as we explai later, is a atural assumptio i the cases that iterest us. I statistical learig theory, this problem arises whe oe miimizes the empirical risk, or sample average of a loss icurred o a fiite traiig sample. There, the aim is to esure that the risk, or expected loss, is small. Thus, f(x i ) represets the loss icurred o X i. Performace guaratees are typically obtaied through high probability bouds o the coditioal expectatio E ˆf = E( ˆf(X) X 1,...,X ). (1) I particular, oe is iterested i obtaiig fast ad accurate estimates of the rates of covergece of this expectatio to 0 as a fuctio of the sample size. Classical estimates of this expectatio rely o the uiform covergece over F of sample averages to expectatios (see, for example, [31]). These estimates are essetially based o the aalysis of the supremum of the empirical process sup f F (Ef E f) idexed by the whole class F. As opposed to these global estimates, it is possible to study local subsets of fuctios of F, for example, balls of a give radius with respect to a chose metric. The supremum of the empirical process idexed by these local subsets as a fuctio of the radius of the balls is called the modulus of cotiuity. Sharper localized estimates for the rate of covergece of the expectatio ca be obtaied i terms of the fixed poit of the modulus of cotiuity of the class [1,12,16,18,28]. Recet results [3] show that oe ca further sigificatly improve the high-probability estimates for the covergece rates for empirical miimizers. These results are based o a ew localized otio of complexity of subsets of F cotaiig fuctios with idetical expectatios ad are therefore depedet o the uderlyig ukow distributio. I this article, we ivestigate the extet to which oe ca estimate these high-probability covergece rates i a data-depedet maer, a importat aspect if oe wats to make these estimates practically useful. The results i [3] establish upper ad lower bouds for the expectatio E ˆf usig two differet argumets. The first is a structural result relatig the empirical (radom) structure edowed o the class by the selectio of the coordiates (X 1,...,X ), ad the real structure, give by the measure µ. The secod is a direct aalysis, which yields seemigly sharper bouds. I both cases (ad uder some mild structural assumptios o the class F), the bouds are give usig a fuctio that measures the localized complexity of subsets of F cosistig of fuctios with a fixed expectatio r, deoted here by F r = {f F : Ef = r}. For every iteger ad probability measure µ o Ω, cosider the followig two sequeces of fuctios, which measure the complexity of the sets F r : ξ,f,µ (r) = Esup { Ef E f : f F r }, (2) ξ,f,µ(r) = Esup {Ef E f : f F r }. (3) I the followig, i cases where the uderlyig probability measure µ ad the class F are clear, we will refer to these fuctios as ξ ad ξ. It turs out that these two fuctios cotrol the geeralizatio ability i F r wheever oe has a strog degree of cocetratio for the empirical process suprema sup f Fr Ef E f ad sup f Fr (Ef E f) aroud their expectatio. Thus, ξ ad ξ ca be used to derive bouds o the performace of the empirical miimizatio algorithm as log as these suprema are sufficietly cocetrated. Therefore, the mai tool required i the proofs of the results i [3] that provide bouds usig the ξ ad ξ is Talagrad s cocetratio iequality for empirical processes (see Theorem A.1 i the appedix). To see how ξ ad ξ ca be used to derive geeralizatio bouds, observe that it suffices to fid the critical poit r 0 for which, with high probability, for a give 0 < λ < 1, every r r 0 ad every f F r,

3 TITLE WILL BE SET BY THE PUBLISHER 3 (1 λ)ef E f (1 + λ)ef. If the equivalece holds for a sample (X 1,...,X ) for such a r 0, the every f F satisfies that { } E f Ef max 1 λ,r 0, (4) ad thus, a upper boud o the expectatio of the empirical miimizer ˆf ca be established. It is possible to show that oe ca take r 0 as r, where r = if {r : ξ,g (r) r/4}, (5) where G = {θf : 0 θ 1, f F }. I fact, sice i (4) oly a oe-sided coditio is required, oe ca actually use r = if { r : ξ,g(r) r/4 }. (6) For the rest of this sectio we will assume that F is star-shaped aroud 0 (that is, G = F), ad we will explai the sigificace of this property later. A more careful aalysis, which uses the stregth of Talagrad s cocetratio iequality for empirical processes, shows that the expectatio of the empirical miimizer is govered by approximatios of { s = sup r : ξ (r) r = max s } {ξ (s) s}. (7) To see why s is a likely cadidate, ote that for ay empirical miimizer, the fuctio of r defied as sup f Fr (Ef E f) r = if f Fr E f is maximized for the value r = E ˆf. Assume that oe has a very strog cocetratio of empirical processes idexed by F r aroud their mea for every r > 0, that is, with high probability, for every r > 0, sup (Ef E f) E sup (Ef E f) = ξ (r). f F r f F r The, it would make sese to expect that, with high probability, E ˆf s for s = argmax{ξ (r) r}. More precisely, ad to overcome the fact that Esup f Fr (Ef E f) is oly very close to sup f Fr (Ef E f) defie for ε > 0, { r,ε,+ = sup { r,ε, = if r : ξ,f,µ(r) r sup s r : ξ,f,µ(r) r sup s ( ξ,f,µ (s) s ) } ε } ( ξ,f,µ (s) s ) ε, (8). (9) Note that r,ε,+ ad r,ε, are respectively upper ad lower approximatios of s that become better as ε 0. They are close to s if the fuctio ξ (r) r is peaked aroud its maximum. Uder mild structural assumptios o F, E ˆf ca be upper bouded by either r or r,ε,+, ad lower bouded by r,ε, for a choice of ε = O( log /) (see the exact statemet i Theorem 2.6 below). Thus, these two parameters the fixed poit of 4ξ (deoted by r ) ad the poits at which the maximum of ξ (r) r is almost attaied are our mai focus. The first result we preset here is that there is a true gap betwee r ad s, which implies that there is a true differece betwee the boud that could be obtaied usig the structural approach (i.e. r ) ad the true expectatio of the empirical miimizer. We costruct a class of fuctios satisfyig the required structural assumptios ad show that for ay, r is of the order of a costat (ad thus r is of the order of a costat), but the subsets F r are very rich whe r is close to 0 ad s ad r,ε,+ are of the order of 1/. Let us metio that there is a costructio related to this oe i [3]: for every there is a fuctio class F for which this

4 4 TITLE WILL BE SET BY THE PUBLISHER pheomeo occurs. The costructio we preset here is stroger, sice it shows that, for some fuctio class ad probability distributio, the true covergece rate for a fixed class is far from the structural boud. The idea behid the costructio is based o the oe preseted i [3], amely that oe has complete freedom to choose the expectatio of a fuctio, while forcig it to have certai values o a give sample. For the class we costruct ad ay large sample size, estimates for the covergece rates of the empirical miimizers based o r are asymptotically ot optimal (as they are Θ(1) whereas the true covergece rate is O(1/)), ad thus the structural boud does ot capture the true behavior of the empirical miimizer. The secod questio we tackle cocers the estimatio of the expectatio of the empirical miimizer from data. To that ed, i Sectio 4, we preset a efficiet algorithm that eables oe to estimate r i a completely data depedet maer. The, i Sectio 5, we show that this type of data-depedet estimate is the best oe ca hope to have if oe oly has access to the fuctio values o fiite samples. We show that i such a case it is impossible to establish a data depedet upper boud o the expectatio of the empirical miimizer that is asymptotically better tha r. The geeral idea is to costruct two classes of fuctios that look idetical whe restricted to ay sample of fiite size, but for oe class both a typical expectatio of the empirical miimizer ad r are of the order of a absolute costat, while for the other a typical expectatio is of the order of 1/ Loss Classes 2. Defiitios ad Prelimiary Results Oe of the mai applicatios of our ivestigatios is the aalysis of predictio problems, like classificatio or regressio, arisig i machie learig. Suppose that oe is preseted with a sequece of observatio-outcome pairs (x,y) X Y, ad the aim is to select a fuctio g : X Y that makes a accurate predictio of the outcome for each observatio. We assume that (X,Y ),(X 1,Y 1 ),...,(X,Y ) are chose idepedetly from a probability distributio P o X Y, but P is ukow. The quality of the predictio is measured usig a bouded loss fuctio, l : Y Y [0,b], where l(ŷ,y) represets the cost icurred for predictio ŷ whe the true outcome is y. The risk of a fuctio g : X Y is defied as El(g(X),Y ), ad the aim is to use the sequece (X 1,Y 1 ),...,(X,Y ) to choose a fuctio g with miimal risk. Settig f(x,y) = l(g(x),y), this task correspods to miimizig Ef. I empirical risk miimizatio, oe chooses g from a set G that miimizes the sample average of l(g(x),y), which correspods to choosig f F that miimizes E f, where F is the loss class, F = {(x,y) l(g(x),y) : g G}. It is sometimes coveiet to cosider excess loss fuctios, f(x,y) = l(g(x),y) l(g (x),y), where g G satisfies El(g (X),Y ) = if g G El(g(X),Y ). Sice g is fixed, choosig g G that miimizes the risk (respectively, empirical risk) agai correspods to choosig f F that miimizes Ef (respectively, E f), where F = {(x,y) l(g(x),y) l(g (x),y) : g G}. Thus, for this choice of F, Ef 0 for all f F, but fuctios i F ca have egative values Assumptios o F Throughout this article, we assume that F is a class of fuctios defied o a probability space (Ω,µ) satisfyig the followig coditios: (1) Each fuctios i F maps to the bouded iterval [ b,b]. (2) Each fuctio i F has oegative expectatio. (3) F cotais 0. (4) F has Berstei type β > 0.

5 TITLE WILL BE SET BY THE PUBLISHER 5 We shall see shortly why these coditios are atural for may practical oparametric ad machie learig methods. The Berstei coditio, defied precisely below, is that the secod momet of every fuctio is bouded by a power of its expectatio, uiformly over the class. Defiitio 2.1. We say that F is a (β,b)-berstei class with respect to the probability measure P (where 0 < β 2 ad B 1), if every f F satisfies Ef 2 B(Ef) β. We say that F has Berstei type β with respect to P if there is some costat B for which F is a (β,b)- Berstei class. These coditios are satisfied by a large variety of loss classes arisig i statistical settigs. Oe simple example is the loss class, F = {(x,y) l(g(x),y) : g G}, i the case where some fuctio g G has zero loss, that is, El(g (X),Y ) = 0. Clearly, if F cotais 0, fuctios i F are bouded ad have oegative expectatios, ad trivially F has Berstei type 1: Ef 2 bef. However, i practical problems, the assumptio that there is some fuctio g G that has zero loss is ofte ureasoable. More realistic examples are excess loss classes, F = {(x,y) l(g(x),y) l(g (x),y) : g G}, where g i G achieves the miimal risk over G. Clearly, fuctios i F are bouded ad have oegative expectatio, ad F cotais zero. As the followig examples show, the boudedess ad Berstei coditios also frequetly arise aturally. Low oise classificatio: I two-class patter classificatio, we have Y = {±1}, ad l(ŷ, y) is the 0-1 loss, that is, the idicator of ŷ y. Clearly, the boudedess coditio holds. A key factor i the difficulty of a patter classificatio problem is the behavior of the coditioal probability η(x) = Pr(Y = 1 X = x), ad i particular how likely it is to be ear the critical value of 1/2. Startig with Tsybakov [27], may authors have cosidered [2, 4, 5, 11, 19, 26] patter classificatio whe there is a costat ǫ such that the coditioal probability satisfies ( Pr η(x) 1 ) 2 < ǫ = 0. (10) Suppose that we assume, as i [27], that the class G cotais the miimizer g of the expected loss (the Bayes classifier), which is the idicator of η(x) > 1/2. The it is easy to show that this implies the excess loss class is of Berstei type 1. Ideed, oe ca verify that (10) is equivalet to the assertio that all measurable fuctios g : X {±1} satisfy Pr (g(x) g (X)) 1 2ǫ E(l(g(X),Y ) l(g (X),Y )) (see, for example, Lemma 5 i [2]). Therefore, E(l(g(X),Y ) l(g (X),Y )) 2 = Pr (g(x) g (X)) 1 2ǫ E(l(g(X),Y ) l(g (X),Y )).

6 6 TITLE WILL BE SET BY THE PUBLISHER Similarly, if there is a costat κ 0 such that ( Pr η(x) 1 ) 2 ǫ cǫ κ (11) for some c ad all ǫ > 0 (see [27]), ad the class G cotais the Bayes classifier, the this implies the excess loss class is of Berstei type κ/(1 + κ) (see, for example, Lemma 5 i [2]). Boostig with a l 1 costrait: Large margi classificatio methods, such as AdaBoost ad support vector machies, miimize the sample average of a covex criterio over a class of real-valued fuctios. For example, Lugosi ad Vayatis [15] cosider empirical miimizatio with a expoetial loss over a class of l 1 -costraied liear combiatios of biary fuctios: Defie, for a give class H of {±1}-valued fuctios, the class { G λ = α i h i : h i H ad i i Let l(ŷ,y) = exp( yŷ), ad cosider the excess loss class α i λ F λ = {(x,y) l(g(x),y) l(g (x),y) : g G λ }, where g is the miimizer i G λ of the risk. The for all probability distributios, fuctios i F λ are bouded by b = exp(λ) ad have Berstei type 1 (see Lemma 7 ad Table 1 i [2]). Support vector machies with low oise: The support vector machie is a method for patter classificatio that chooses a fuctio f : X R from a reproducig kerel Hilbert space (RKHS) H with kerel k : X 2 R so as to miimize the regularized empirical risk criterio 1 l(f(x i ),y i ) + λ f 2, i=1 where y i {±1}, the loss is the hige loss, } l(ŷ,y) = max{0,1 ŷy}, (12) ad f deotes the orm i the RKHS. This is equivalet, for some r, to solvig the costraied optimizatio problem 1 mi f i=1 l(f(x i),y i ) s.t. f H f 2 r 2. Defie H r = {g H : g 2 r 2 } ad the excess loss class F r = {(x,y) l(g(x),y) l(g (x),y) : g H r }, (13) where g H r is the miimizer of the risk. The if the kerel of the RKHS satisfies k(x, X) B almost surely, (14) all fuctios i F r are bouded by 2Br. Furthermore, if the probability distributio satisfies the low oise coditio (11) ad F r cotais the Bayes classifier, the Lemma 7 of [4] shows that F r has Berstei type κ/(1 + κ)..

7 TITLE WILL BE SET BY THE PUBLISHER 7 ξ (r) α 3 r α 2 r α 1 r r Figure 1. The graph of a fuctio ξ that is sub-liear (cf. Lemma 2.3). Thus, our assumptios are satisfied i this case, ad the results i this article give estimates of the excess risk, that is, the differece betwee the expected loss ad the ifimum over all measurable fuctios of the expected loss. I fact, this also leads to a estimate of the excess risk as measured by the 0-1 loss: for all large margi classificatio methods, which miimize the sample average of a surrogate loss fuctio, there is a geeral, optimal iequality relatig the excess risk as measured by the surrogate loss to the excess risk as measured by the 0-1 loss [2]. Kerel ridge regressio for classificatio: If, i the support vector machie, we replace the hige loss (12) with the quadratic loss, l(ŷ,y) = (ŷ y) 2, we obtai the kerel ridge regressio method for patter classificatio. Defiig the class F r as i (13), if the kerel satisfies the boud (14), the every fuctio i F r is bouded by 2Br. Furthermore, without ay costraits o the probability distributio, the uiform covexity of the loss fuctio implies that F r has Berstei type 1 [14]. Kerel regressio with covex loss: Similar examples ca be obtaied whe the quadratic loss is replaced by ay power loss (see [20]). I kerel regressio also, if the respose variable satisfies Y B almost surely, the the boudedess of the kerel implies boudedess of fuctios i the excess loss class, ad uiform covexity of the loss implies that the excess loss class is Berstei Star-shaped classes We begi with the followig defiitio: Defiitio 2.2. F is called star-shaped aroud 0 if for every f F ad 0 α 1, αf F. We will show below that if F is a excess loss class, the ay empirical miimizer i F is also a empirical miimizer i the set star(f,0) = {αf : f F, 0 α 1}. Hece, oe ca replace F with star(f, 0) i the aalysis of the empirical miimizatio problem. Moreover, sice Ef ad E f are liear fuctioals i f, the localized complexity of star(f,0) is ot cosiderably larger tha that of F (for istace, i the sese of coverig umbers). The advatage i cosiderig star-shaped classes is that it adds some regularity to the class, ad thus the aalysis of the empirical miimizatio problem becomes simpler. For example, it is easy to see that for star-shaped classes the fuctios ξ (r)/r ad ξ (r)/r are o-icreasig. Figure 1 illustrates the graph of a typical fuctio with this sub-liear property, which is stated formally i the followig lemma.

8 8 TITLE WILL BE SET BY THE PUBLISHER ξ (r) r 1 r 2 r Figure 2. A example of a graph of a fuctio ξ for the class star(f,0), where F cotais oly fuctios with expectatios r 1 ad r 2. Lemma 2.3. If F is star-shaped aroud 0, the for ay 0 < r 1 < r 2, ξ (r 1 ) r 1 ξ (r 2 ) r 2. I particular, if for some α, ξ (r) αr the for all 0 < r r, ξ (r ) αr. Aalogous assertios hold for ξ. I other words, for every r, the graph of ξ i the iterval [0,r] is above the lie coectig (r,ξ (r)) ad (0,0). For the sake of completeess we iclude the proof of Lemma 2.3, which was origially stated i [3]. Proof. (of Lemma 2.3) Fix a sample X 1,...,X ad, without loss of geerality, suppose that sup f Fr2 (Ef E f) is attaied at f. Sice F is star-shaped, the f = r1 r 2 f F r1 satisfies Ef E f = r 1 r 2 sup f F r2 (Ef E f), ad the first part follows. The secod part follows directly from the first part by otig that The proof for ξ is aalogous. ξ (r ) r r ξ (r) r r αr = αr. As a example, Figure 2 illustrates the graph of a fuctio ξ for the star-shaped hull of a class that cotais oly fuctios with expectatios that either equal to r 1 or to r 2. The followig lemma allows oe to use star(f,0) i the aalysis of the empirical miimizatio problem ad obtai results regardig the empirical miimizatio problem over F. Lemma 2.4. Let F be a class of fuctios that cotais 0. (1) If F is a (β,b)-berstei class the star(f,0) is also a (β,b) Berstei class. (2) For every x 1,...,x, set { } I 1 = if f(x i ) : f F, i=1 { } I 2 = if f(x i ) : f star(f,0). i=1

9 TITLE WILL BE SET BY THE PUBLISHER 9 The I 1 = I 2. Moreover, for every ε 0 the set {f star(f,0) : i=1 f(x i) I 1 + ε} has a oempty itersectio with F. Note that by Lemma 2.4, if the set of ε-approximate empirical miimizers relative to star(f,0) is cotaied i some set A, the the set of ε-approximate empirical miimizers relative to F is also cotaied i A. I particular, cosider the set A = {f : γ Ef β}. Thus, upper ad lower estimates of the expectatio of the empirical miimizer i star(f,0) would imply the same fact for all empirical miimizers i F. Proof. (of Lemma 2.4) Every g star(f,0) is of the form g = αf for some f F ad 0 α 1. Sice β 2 ad F is a (β,b)-berstei class, Eg 2 = α 2 Ef 2 Bα 2 (Ef) β B(Eαf) β = B(Eg) β. To prove the secod part, otice that I 2 I 1. Sice 0 F, we have I 1 0 ad thus, if I 2 = 0 the the claim is obvious. Therefore, assume that I 2 < 0 ad for the sake of simplicity, assume that the ifimum is attaied i g = αf for some f F ad 0 < α 1. If α < 1 the I 1 f(x i ) = α 1 g(x i ) = α 1 I 2 < I 2, i=1 i=1 which is impossible. Thus α = 1 ad I 1 = I 2. The fial claim of the lemma follows usig a similar argumet. Motivated by these observatios, we redefie the set F r as F r = {f star(f,0) : Ef = r}. For the remaider of the article, we use this i the defiitios of the complexity parameters ξ,f,µ (r), ξ,f,µ (r) i (2 3), ad hece i the defiitios of r, s, r,ε,+, ad r,ε, i (5 9) as well Prelimiary Results If F is star-shaped aroud 0 oe ca derive the followig estimates for the empirical miimizer. (Recall the defiitio r = if {r : ξ (r) r/4} ad r = if {r : ξ (r) r/4}, where ξ ad ξ were defied above i (2) ad (3).) Theorem 2.5. [3] Let F be a (β,b)-berstei class of fuctios bouded by b that cotais 0. The there is a absolute costat c such that with probability at least 1 e x, ay empirical miimizer ˆf F satisfies { E ˆf max r, cbx ( ) } 1/(2 β) Bx,c. Also, with probability at least 1 e x, ay empirical miimizer ˆf F satisfies { E ˆf max r, cbx ( ) } 1/(2 β) Bx,c. Thus, with high probability, r is a upper boud for E ˆf, as log as r c/ 1/(2 β), ad the same holds for r. Note that r ca be much smaller tha r, ad so the covergece rates obtaied through r are potetially better. For β = 1, the estimates based o r ad r are at best 1/, ad i geeral at best 1/ 1/(2 β). Thus, the degree of cotrol of the variace through the expectatio, as measured by the Berstei coditio, iflueces

10 10 TITLE WILL BE SET BY THE PUBLISHER the best rate of covergece oe ca obtai i terms of r ad r usig this method wheever oe requires a cofidece that is expoetially close to 1. I particular, this approach recovers the better learig rates for covex fuctio classes from [14] ad for low oise classificatio from [19, 27], as both covexity of F for squared-loss ad low oise coditios imply that the loss class is Berstei. It turs out that this structural boud ca be improved usig a direct aalysis of the empirical miimizatio process. Ideed, the ext theorem shows that oe ca directly boud E ˆf for the empirical miimizer without tryig to relate the empirical ad actual structures of F. It states that E ˆf is cocetrated aroud s ad therefore, with high probability, E ˆf r,ε,+, where ε ca be take smaller tha c log /. I additio, if the class is ot too rich aroud 0, the with high probability, E ˆf r,ε,. (To recall the defiitios of s, r,ε,+, ad r,ε, see (7-9).) The result follows immediately from the mai result of [3], together with the observatios above about star-shaped classes. Theorem 2.6. For ay c 1 > 0, there is a costat c (depedig oly o c 1 ) such that the followig holds. Let F be a (β,b)-berstei class of fuctios bouded by b that cotais 0. For every ad ε > 0 defie r,ε,+, ad r,ε, as above, fix x > 0 ad set If { ( ) } 1/(2 β) r = max r cb(x + log ) B(x + log ),,c. ( { ε c max sup s the (1) With probability at least 1 e x, (2) If the with probability at least 1 e x, ( ξ,f,µ (s) s ) ) 1/2 (B + b)(x + log ),r β}, { } E ˆf 1 max,r,ε,+. Esup {Ef E f : f star(f,0), Ef c 1 /} ( < sup ξ,f,µ (s) s ) ε, s E ˆf r,ε,. To compare this result to the previous oe, ote that s r. Ideed, ξ (r) E(Ef E f) = 0 for ay fixed fuctio f, ad thus ξ (0) 0, ξ (s ) s ad 0 s if {r : ξ (r) r} r (where the last iequality holds sice ξ (r)/r is o-icreasig, by Lemma 2.3). It follows that if ξ (r) r is ot flat aroud s, the the boud resultig from Theorem 2.6 improves the structural boud of Theorem 2.5. Figure 3 illustrates graphically such a case. 3. A true gap betwee the expectatio of the empirical miimizer ad r I this sectio, we costruct a class of fuctios for which there is a clear gap betwee the structural result of Theorem 2.5 ad the expectatio of the empirical miimizer, as estimated i Theorem 2.6. The idea behid this costructio (as well as i the other costructio we preset later) is that oe has complete freedom to choose the expectatio of a fuctio, while forcig it to have certai values o a give sample.

11 TITLE WILL BE SET BY THE PUBLISHER 11 ξ (r) ε r r/4 r,ε, s r,ε,+ r r Figure 3. The graph of a fuctio ξ, ad the correspodig values for r, s, r,ε,+, ad r,ε,. If s r ad ξ (r) r is peaked aroud s, the r,ε,+ is smaller tha r. Let us start with a outlie of the costructio. It is based o the idea (developed i [3]) of two Berstei classes of fuctios satisfyig the followig for ay fixed. The fuctios are defied o a fiite set {1,...,m} with respect to the uiform probability measure, where m depeds o. The first class cotais all fuctios that vaish o a set of cardiality, but have expectatios equal to a give costat. The secod class cosists of fuctios that each take their miimal values o a set of cardiality, but have expectatios equal to 1/. By appropriately choosig the values of the fuctios, oe ca show that the star-shaped hull of the uio of these two classes has r c, whereas s r,ε,+ 1/. Thus, the estimate give by Theorem 2.6 is cosiderably better tha the oe resultig from Theorem 2.5 for that fixed value of. To make this example uiform over, we costruct similar sets o (0,1], take the star-shaped hull of the uio of all such sets ad show that ξ,f,µ (r) r still achieves its maximum at 1/ ad decays rapidly for r > 1/, esurig that r,ε,+ r. The first step i the costructio is the followig lemma. Lemma 3.1. Let µ be the Lebesgue measure o (0,1]. The, for every positive iteger ad ay 1 λ 1/2 there exists a fuctio class G λ such that (1) For every g G λ, 1 g(x) 1, Eg = λ ad Eg2 2Eg. (2) For every set τ (0,1] with τ, there is some g G λ such that for every s τ, g(s) = 1. Also, there exists a fuctio class Hλ such that (1) For every h Hλ, 0 h(x) 1, Eh = λ, ad Eh2 Eh. (2) For every set τ (0,1] with τ, there is some h Hλ such that for every s τ, h(s) = 0.

12 12 TITLE WILL BE SET BY THE PUBLISHER Proof. Let m = 2( 2 + ). Cosider fuctios that are costat o the itervals ((i 1)/m,i/m], 1 i m, ad set G λ to be the fuctio class cotaiig all fuctios takig the value 1 o exactly such itervals; that is, each fuctio i G λ is defied as follows: Let J {1,...,m}, J = ad set { 1, if x ( j 1 g J (x) = m, j m ] ad j J, t λ, otherwise, where t λ = λm + m = 2λ(2 + ) (15) + Sice 0 λ 1/2, 0 t λ 1 ad thus g J : (0,1] [ 1,1]. It is easy to verify that all the fuctios i G λ have expectatio λ with respect to µ ad that G λ is (1,2)-Berstei, sice for ay g G λ, Eg 2 = 1 m ( + t 2 λ (m ) ) 1 m ( + t λ(m )) = λ + 1 2λ = 2Eg. + 1 The costructio of Hλ is similar, ad its fuctios take the values {0,t λ } for t λ = λm/(m ). Usig the otatio of the lemma, defie the followig fuctio classes: ad H = H1/4 i, F k = G k 1/k, G = F i, i=5 i=5 F = star(g H,0). (16) Sice F cotais 0 ad is a (1,2)-Berstei class, it satisfies the assumptios of Theorem 2.5 ad Theorem 2.6. Moreover, it is star-shaped aroud 0 ad for ay 5 ad ay X 1,...,X there is some f F with Ef = 1/4 ad E f = 0, ad some g F with Eg = 1/ ad E g = 1. Ideed, f ca be take from H 1/4 ad g from F = G 1/. The followig theorem shows that for the class F, for ay iteger, r = 1/4, while the empirical miimizer is likely to be smaller tha r,ε,+ c/. Theorem 3.2. For F defied by (16), the followig holds: (1) For every 5, r + rk if r (1/(k + 1),1/k], where k ξ,f,µ(r) = r if r (1/5,1/4] 0 if r > 1/4, ad i particular, r = 1/4. (2) There exists a costat c > 1, such that the followig holds: for every ε < 3/4, every N(ε) ad every k /c, ξ,f,µ(1/k) 1/k ξ,f,µ(1/) 1/ ε. I particular, r,ε,+ c/. Note that by the properties of F metioed above, for every sample of cardiality 5, the graph of ξ for the class F H1/4 (which is the same as for the class star(f H1/4,0)) is as i Figure 4, with r = 1/4 ad s = 1/. For the star-shaped hull of the uio of all these sets, the fuctio ξ ca still be described i

13 TITLE WILL BE SET BY THE PUBLISHER 13 ξ (r) r + 1 r s = 1/ r = 1/4 r Figure 4. ξ,f H,µ (as i the proof of Theorem 3.2). 1/4 closed-form for values of r > 1/5 ad r 1/, because sup f Fr (Ef E f) is idepedet of the sample ad is reached at a scaled-dow fuctio from H ad respectively G; this is proved i part 1 of the theorem. O the other had, for 1/ < r < 1/5 this supremum is o loger idepedet of the sample ad thus we caot provide a simple closed-form for ξ. Despite that, ξ (r) r still achieves its maximum at 1/ ad decays rapidly for r > 1/, esurig that r,ε,+ r, which is the secod part of the theorem. Figure 5 illustrates the qualitative behavior of ξ. Proof. (of Theorem 3.2) For the first part of the proof, observe that the subsets F r cosistig of fuctios with expectatio Ef = r are H r G r if r < 1/5 F r = H r if r (1/5,1/4] if r > 1/4, where H r ad G r are the scaled-dow versios of H ad G, ad G r = 1/r k=5 {krg : g Gk 1/k }. The first part of the Theorem follows from the defiitio of the fuctio ξ ad the fact that for ay fixed sample of size, the ifimum if f Fr E f is equal to 0 ad reached at a scaled-dow fuctio from H1/4 for r (1/5,1/4], ad it is equal to -1 ad reached at a scaled-dow fuctio from G k 1/k wheever r (1/(k + 1),1/k] ad k. Turig to the secod, ad more difficult part, ote that ideed r = 1/4 ad that the maximal value of ξ,f,µ (r) r is attaied at r = 1/. I order to estimate the value ξ,f,µ (1/k) for k <, cosider sup f G 1/k(Ef k E f) for a fixed X 1,...,X. Let m = 2(k 2 + k) ad ote that by the costructio of G k 1/k, each g Gk 1/k is of the form g J for some set J {1,...,m}, J = k. For each set J let A J be the uio of the itervals ( j 1 m, j ] m where j J, ad let Φ be the followig set of idicator fuctios Φ = {½ AJ : J {1,...,m}, J = k}. Clearly, for every φ Φ, Eφ = k/m ad vc(φ) k, sice o set of k+1 distict poits i (0,1] ca be shattered by Φ (actually, vc(φ) = k sice the set {1/k,1/(k 1),...,1} is shattered by Φ). Recall that if Φ is a class

14 14 TITLE WILL BE SET BY THE PUBLISHER ξ (r) r + 1 ε r 1/ c/ 1/5 1/4 r Figure 5. Qualitative behavior of ξ,f,µ. of biary-valued fuctios ad if the VC-dimesio vc(φ) k, the as a special case of Theorem A.5, the Rademacher averages (see page 16, equatio (18) for the defiitio) ca be bouded by ER (Φ) c 2 k/ (17) for some absolute costat c 2. Defie the radom variable l J = i=1 ½ A J (X i ). Thus, l J is the cardiality of the set {i : g J (X i ) = 1}. Note that E g J = 2l J(k + 1) 2 + 3k + 2, k(2k + 1) ad therefore, sup (Ef E f) = 1 f G k + 2(k + 1)2 sup J l J 3k 2. k(2k + 1) k 1/k From Talagrad s cocetratio iequality (Theorem A.1) applied to the set of fuctios Φ, there exist absolute costats c 1,c 2 such that for ay 0 < t 1, with probability larger tha 1 e c1t2, sup f Φ i=1 f(x i ) k m + 2R (Φ) + 2t k m + 2c 2 k + 2t, where the last iequality holds by (17). Settig t = 1/20, ad sice k/m = /(2(k + 1)) < /10 for ay k 5, it is evidet that there exists a absolute costat c > 1 such that for ay k /c, with probability at least 1 e c 1, sup J l J /5+2c 2 k /4.

15 TITLE WILL BE SET BY THE PUBLISHER 15 Therefore, applyig the uio boud for 5 k k, it follows that with probability at least 1 e c, sup f k k =5 k k Gk 1/k (Ef E f) (k + 1) 2 /2 3k 2 k(2k + 1) 1 k for every k /c 1. Observe that scaled-dow versios of fuctios from H do ot cotribute to ξ,f,µ (1/k) ad thus, oe oly has to take care of elemets i F with expectatio of 1/k that come either from G k 1/k or are scaled dow versios of G k 1/k for k k. Hece, ξ,f,µ(1/k) = E sup f k k =5 ( 1 k (Ef E f) k k Gk 1/k ) (1 e c ) + e c = 1 k e c. ( ) 1 k + 1 Thus, for ε < 3/4, if is sufficietly large that 3/4e c 3/4 ε, we have ξ,f,µ(1/k) 1/k 1 ε = ξ,f,µ(1/) 1/ ε, provided that k /c. To coclude, there exists a true gap betwee the boud that ca be obtaied via the structural result (the fixed poit r of the localized empirical process) ad the true expectatio of the empirical miimizer as captured by s. Corollary 3.3. For F defied i (16), there is a absolute costat c > 0 for which the followig holds: For ay x > 0 there is a iteger N(x) such that for ay N(x), (1) With probability at least 1 e x, E ˆf c/ s. (2) r = r = 1/4. 4. Estimatig r from data The ext questio we wish to address is how to estimate the fuctio ξ (r) ad the fixed poit { r = if r : ξ (r) r } 4 empirically, i cases where the global complexity of the fuctio class, as captured, for example, by the coverig umbers or the combiatorial dimesio, is ot kow. A way of estimatig r is to fid a empirically computable fuctio ˆξ (r) that is, with high probability, a upper boud for the fuctio ξ (r) ad therefore, its fixed poit ˆr = if{r : ˆξ (r) r 4 } is a upper boud for r. We shall costruct ˆξ for which ˆξ (r)/r is o-icreasig ad thus ˆr would be determied usig a biary search algorithm. To that ed, we require the followig result, which states that, for Berstei classes, there is a phase trasitio i the behavior of coordiate projectios aroud the poit where ξ (r) r. Above this poit, the local subsets F r = {f star(f,0) : Ef = r} are small ad the expectatio ad empirical meas are close i a multiplicative sese. Below this poit, the sets F r are too rich to allow this.

16 16 TITLE WILL BE SET BY THE PUBLISHER Theorem 4.1. [3] There is a absolute costat c for which the followig holds. Let F be a class of fuctios, such that for every f F, f b. Assume that F is a (β,b)-berstei class. Suppose that r 0, 0 < λ < 1, ad 0 < α < 1 satisfy r cmax { bx α 2 λ, ( ) } 1/(2 β) Bx α 2 λ 2. (1) If ξ (r) (1 + α)rλ, the with probability at least 1 e x, sup Ef E f λef. f F r (2) If ξ (r) (1 α)rλ, the with probability at least 1 e x, sup Ef E f λef. f F r (3) If ξ (r) (1 + α)rλ, the with probability at least 1 e x, sup (Ef E f) λef. f F r (4) If ξ (r) (1 α)rλ, the with probability at least 1 e x, sup (Ef E f) λef. f F r We will make use of the followig direct corollary of Theorem 4.1 applied to the case α = 1/2, λ = 1/2. Corollary 4.2. There is a absolute costat c > 0 for which the followig holds. If F is (β,b)-berstei, ad { ( ) } 1/(2 β) bx Bx r cmax, ad ξ (r) r 4, the with probability larger tha 1 e x, every f F r satisfies r/2 E f 3r/2. If we defie the empirical shell, F r 2, 3r 2 := {f star(f,0) : r/2 E f 3r/2}, the corollary shows that, for suitable large r, with high probability, F r F r 2, 3r 2 The followig theorem shows that the empirical Rademacher average of a empirical shell is with high probability a upper boud for ξ (r) for all r larger tha the fixed poit r. For this, defie the radom variables R f = 1 i=1. σ i f(x i ) ad R (F) = sup R f, (18) f F where σ 1,...,σ deote idepedet Rademacher radom variables, that is, symmetric, { 1,1}-valued radom variables. The Rademacher averages of the class F are defied as ER (F), where the expectatio is take with respect to all radom variables X i ad σ i. A empirical versio of the Rademacher averages is obtaied by coditioig o X 1,...,X, E σ R (F) = E(R (F) X 1,...,X ).

17 TITLE WILL BE SET BY THE PUBLISHER 17 Theorem 4.3. There are absolute costats c, c 1, c 2, ad c 3 for which the followig holds. Let F be a (β,b)- Berstei class that cotais 0 ad for which sup f F f b. If r = max the with probability at least 1 2(b + 1)e x for every r [ r,b]. { r, 1, cbx ( ) } 1/(2 β) Bx,c, ξ (r) 8E σ R ( F c1r,c 2r) + c3 r Proof. By Lemma 2.3, ξ (r) r 4 if ad oly if r r. Thus, by Corollary 4.2 (for appropriately chose c), if r r, the with probability larger tha 1 e x, F r F r 2,, which implies that 3r 2 ) E σ R (F r ) E σ R (F r 2,. 3r 2 By symmetrizatio (Theorem A.2) ad cocetratio of Rademacher averages aroud their mea (Theorem A.3), ad sice r cbx, it follows that with probability at least 1 2e x, ξ (r) 2ER (F r ) 4E σ R (F r ) + 4bx ( ) 4E σr F r 2, + c 3r 3 r. 2 To fid a upper boud o ξ (r) that holds with high probability uiformly for all r r, we divide the iterval [1/,b] ito a set of b itervals of legth at most 1/. (Note that the choice of the startig poit 1/ restricts the estimates for r to values that are larger tha 1/. The proof ca be easily modified to allow estimates up to the value cbx/, but sice we are oly iterested i estimates that are at best of the order of O(1/) we made this restrictio i order to keep the proof simpler.) Let { 1 A =, 2 } [ b c,...,, b ], where c = cmax { ( ) } 1/(2 β) bx Bx,. Sice A b + 1, the uio boud shows that with probability at least 1 2(b + 1)e x, ξ (r) 4E σ R (F r 2, 3r 2 for every r A. By Lemma 2.3, for ay 1 k, if r [ k probability at least 1 2(b + 1)e x, every r [ r,b] satisfies ( ) k r ξ (r) ξ k ( (4E σ R F k 2, 3k 2 ) + c 3 r 8E σ R ( F c1r,c 2r) + c3 r,, k+1 ) + c ) 3k r k ], the ξ (r) ξ ( k ) r k. Thus, with where k satisfies that r [k/,(k + 1)/] ad c 1 ad c 2 are absolute costats.

18 18 TITLE WILL BE SET BY THE PUBLISHER Therefore, oe ca defie ˆξ (r) = 8E σ R ( F c1r,c 2r) + c3 r. Let ˆr = if{r : ˆξ (r) r 4 }. By Theorem 4.3, with probability at least 1 2(b + 1)e x, ˆr r. Moreover, sice ˆξ (r)/r is o-icreasig, r ˆr if ad oly if ˆξ (r) r 4. With this, give a sample of size, cosider the followig algorithm to estimate the upper boud o ˆr based o the data: Algorithm RSTAR(F, X 1,...,X ) Set r L = max{1/,c }, r R = b. If ˆξ (r R ) r R /4 the for l = 0 to log 2 b set r = rr rl 2 ; if ˆξ (r) > r/4 the set r L = r, else set r R = r. Output r = r R. By the costructio, r 1 ˆr r. Hece, for every, with probability larger tha 1 2(b + 1)e x, r r. Theorem 4.4. There exists a absolute costat c for which the followig holds. Let F be a (β,b)-berstei class of fuctios bouded by b that cotais 0. For every iteger, ay x > 0, ad ay sample X of size, with probability at least 1 (2b + 3)e x, E ˆf RSTAR(F,X). ( Note that RSTAR(F,X) is essetially the fixed poit of the fuctio r E σ R F c1r,c 2r). This fuctio measures the complexity of the fuctio class Fc 1r,c 2r, which ca be determied empirically by lookig at empirical meas that fall i a iterval whose legth is proportioal to r. The mai differece betwee that ad the data-depedet estimates i [1] is that istead of takig the whole empirical ball as i [1], here we oly measure the complexity of a empirical shell aroud r. However, if the fuctio class is ot regular aroud the critical value of r, the complexity of the shell F(c 1 r,c 2 r) might be very differet from the complexity of F r, i which case oe would like to make c 1 ad c 2 very close to 1. Ideed, oe ca tighte this boud further by arrowig the size of the shell ad replacig the empirical set F r 2, 3r 2 with F(1 ε )r,(1+ε )r. This is doe by selectig the isomorphism costat i Theorem 4.1 to deped o ad ted to 1 as. Theorem 4.5. Let F be a (β,b)-berstei class that cotais 0 such that sup f F f b. There is a absolute costat c, for which the followig holds. If 0 < ε < 1 ad { r = max r, 1, cbx ( ) } 1/(2 β) Bx,c, ε the with probability at least 1 2(b + 1)e x for every r [ r,b]. ε 2 ξ (r) 4E σ R (F (1 ε )r,(1+ε )r ) + ε r c Proof. With the same reasoig as before, by Theorem 4.1 for α = 1/2 ad λ = ε, if r r the with probability larger tha 1 e x, F r F(1 ε )r,(1+ε )r. We defie ( ) ˆξ (r) = (4E σ R F(1 ε )r,(1+ε )r + kε ) [ r k c k, for r, k + 1 ]. Agai, with probability at least 1 2(b + 1)e x, for every r [ r,b], ξ(r) ˆξ (r).

19 TITLE WILL BE SET BY THE PUBLISHER 19 Sice ˆξ (r)/r is o-icreasig, it is possible to defie { ˆr = if r : ˆξ (r) rε } 2 with a slight modificatio of RSTAR (we replace the test i the if-clause, ˆξ (r) > r/4, with ˆξ (r) > rε /2). It follows that for every ad every sample of size, with probability larger tha 1 2be x, r r, where r is geerated by the modified algorithm. For example, oe ca choose ε = 1/log, which has the advatage that the empirical shells ˆFr r log, r+ r become, with growig sample size, closer to F log r. The price we pay for the advatage is a extra log factor i the fial estimate, sice i this case the estimate of the expectatio goes dow at the rate of O(log /). Remark 4.6. Note that a lower boud of a similar ature has to take ito accout the complexity of the class F 0,cr. This might happe because oe may ot have a iclusio F r F c 1r,c 2r uless c 1 = 0. Ideed, if the class F is very rich for r close to 0, it is possible to have fuctios that have a very small expectatio, but for which E f r. 5. The limitatios of estimatig from data Although the results i [3] show that it is possible to boud the expectatio of the empirical miimizer i a far sharper way tha by applyig a structural result, it was ot clear whether such a boud could be estimated from data. I the followig we cosider a sceario i which oe oly has access to the fuctio class through the values that class members take o fiite samples, that is, the fiite dimesioal coordiate projectios of the class. I this case, we costruct a example that shows that, i geeral, it is impossible to establish a data-depedet estimate of s that is better tha r. To be precise, we costruct two fuctio classes that have idetical coordiate projectios o every sample. For oe class we have r c, s c ad the expectatio of the empirical miimizer is of the order of c with probability 1, while for the other class, s 1/. If oe oly has access to the way the classes behave o fiite dimesioal coordiate projectios, that is, samples, the classes are idistiguishable, ad it is impossible to predict a better boud tha a absolute costat, which could be much worse tha the true behavior of the empirical miimizer. Recall that for a give fuctio class F ad a sample τ = {x 1,...,x }, the coordiate projectio of F o τ is P τ F = {(f(x 1 ),...,f(x )) : f F }. Let µ be the Lebesgue measure o (0,1]. For each k N we costruct two fuctio classes F1 k ad F2 k, both (1,c)-Berstei with respect to µ for a suitable absolute costat c, ad take values i V = { 1,0,1}. I both classes we costruct, each fuctio is a costat o the itervals ((j 1)/m k,j/m k ], where m k = k 2 + 3k. The class F1 k cosists of all fuctios that take the value 1 o k itervals, the value 1 o 2k itervals ad the value 0 o k 2 itervals. It is easy to verify that for ay f F1 k, Ef = k/(k 2 + 3k) 1/k ad Ef 2 = 3k/(k 2 + 3k) 1/k, implyig that ideed F1 k is a (1,3)-Berstei class. I cotrast, F2 k cosists of all fuctios that take the value 1 o k itervals, the value 1 o k 2 + k itervals ad 0 o k itervals. Therefore, for ay fuctio f F2 k, Ef = k 2 /(k 2 + 3k) 1/4 ad sice Ef 2 1, F2 k is a (1,4)-Berstei class. Notice that fuctios i F1 k have expectatios of the order of 1/k while fuctios i F2 k have expectatios of the order of a costat. Set ( ) ( ) F 1 = star F1 k,0, F 2 = star F2 k,0, k N ad it is easy to verify that for every fiite set τ, P τ F 1 = P τ F 2. Ideed, cosider a set τ = {x 1,...,x }. Without loss of geerality, assume that x i x j if i j. Let l be large eough to esure that the x i s fall i disjoit itervals ((j 1)/m l,j/m l ] ad that l, ad thus, P τ F l 2 = P τ F l 1 = { 1,0,1}. k N

20 20 TITLE WILL BE SET BY THE PUBLISHER Therefore, F 1 ad F 2 are star-shaped, Berstei classes that have idetical coordiate projectios, makig it impossible to distiguish the two based solely o empirical data. O the other had, the behavior of the empirical miimizer is very differet i the two cases. Theorem 5.1. For F 1 ad F 2 defied as above, there is a absolute costat c > 0 for which the followig holds. For ay x > 0 there is some N(x) such that for ay N(x), (1) For F 1, with probability at least 1 e x, E ˆf c/ s (F 1 ). (2) For F 2, with probability 1, E ˆf 1/4 r (F 2 ). Theorem 5.1 implies that the estimates for the covergece rate of the empirical miimizatio algorithm based o s are sigificatly better for the class F 1 tha for F 2. However, the classes have idetical coordiate projectios o ay sample, ad hece are idistiguishable empirically. Thus, oe ca ot get a empirical estimate of the covergece rate for F 1 that is sigificatly better tha oe based o a empirical estimate of r. Proof. We will show that the expectatio of the empirical miimizer i F 1 is likely to be smaller tha c/, as opposed to F 2 where it is likely to be of the order of a costat. For ay, if f F 1 E f = 1, ad therefore ξ,f 1,µ (s ) s = 1, where, for ay k ad ay f F k 1, s k = Ef = k k 2 + 3k 1 k. Clearly, for a class of fuctios bouded by 1, ξ,f,µ (r) r 1, ad thus the maximal value of ξ,f 1,µ (r) r is attaied at s 1/. The mai part of the proof is to show that there is some absolute costat c > 1 such that for large eough values of ad for r c/, ξ,f 1,µ (r) r 1/2. This is the case because the sets F k 1 are ot rich eough whe projected oto samples of size as log as k /c. Ideed, the fuctio class F 1 has low complexity i terms of the combiatorial dimesio vc(f 1,ε) (see Defiitio A.4). I particular, the defiitios imply that vc(f k 1,ε) 2k for all 0 < ε 2 ad all k. Sice the class of fuctios is bouded by 1, Theorem A.5 implies there is a absolute costat c 2 such that ER (F k 1 ) c 2 k/. Applyig the oe sided versio of Talagrad s cocetratio iequality for the empirical process Z = sup f F k 1 (Ef E f), it follows that for t = 1/4, with probability at least 1 e c1t2 = 1 e c 1, k sup (Ef E f) 2ER (F1 k ) + t 2c 2 f F + t 1 2, 1 k provided that k /c for some uiversal costat c. Let A k = s k s k k k that is, A k cotais the fuctios i F 1 that have expectatios s k those either come from F k 1 or are scaled dow versios of fuctios from F k for k < k. Therefore, with probability at least 1 e c 1, for ay k /c, F k 1, Takig the expectatio, sup (Ef E f) 1 f A k 2. ξ,f 1,µ(s k ) (1 e c 1 ) (1 + s k) e c 1 = ( s k ) e c 1,

21 TITLE WILL BE SET BY THE PUBLISHER 21 ad thus, for all ε < 1/2, N(ε) ad k /c, ξ,f,µ(s k ) s k 1 ε s k = ξ,f,µ(s ) s ε s k. This implies that ξ,f,µ (r) r ξ,f,µ (s ) s ε for every r c /, from which we coclude that r,ε,+ c /. O the other had, it is easy to verify that for empirical miimizatio over F 2, E ˆf 1/4. Ideed, as we saw for F 1, if f F 2 E f = 1, which implies E ˆf = 1. Sice we ca write F 2 = {αf : f F k 2, k N, α [0,1]}, ad empirical miimizatio is a liear operatio, it is clear that the empirical miimum will be attaied at α = 1 (usig a similar argumet to the oe used i Lemma 2.4). Sice all the fuctios i k N F2 k have expectatio greater tha 1/4, the with probability 1, E ˆf 1/4 i this case. Remark 5.2. Note that if oe is give the fuctio ˆf that the algorithm produced, rather tha just the coordiate projectios, it becomes possible to distiguish if the class at had is F 1 or F 2. However, we ca defie a ucoutable collectio of fuctio classes { ( ) } F = star Fα k k,0 : α k {1,2} for k N, k N where if α k = 1 the Fα k k = F1 k ad if α k = 2 the F αk = F2 k. Clearly, for every H,G F ad every fiite σ Ω, P σ (G) = P σ (H). If the learer kows that F F ad eve if ˆf is give to him, the the best thig that could be said is that a sigle compoet of F, say the jth compoet of F, is F j 1 or F j 2. It is impossible to say whether other compoets of F are of type 1 or type 2 ad i particular, the covergece rate for the expectatio of the empirical miimizer ca be as bad as for F 2. The secod observatio worth otig is that the class F 1 is ot a Gliveko-Catelli class. The classes F1 k become richer as k grows - i.e., i the part of F 1 i which the expectatio of fuctios is smaller. The reaso oe ca still obtai a geeralizatio boud eve for classes that are ot Gliveko-Catelli is because the method of [3] uses the expectatio of the empirical process idexed by {f star(f,0) : Ef = r}, ad each oe of these sets is a Gliveko-Catelli class. If oe were to try ad boud the error of the empirical miimizer usig the localizatio {f F : Ef r} as i [1], it would be impossible. Appedix A. Additioal material The mai techical tool we require is Talagrad s celebrated cocetratio theorem for empirical processes [13, 24]. The versio we use is due to Bousquet [7], buildig o Massart s argumet (see also [10,17,22]). Theorem A.1. Let F be a class of fuctios defied o X ad let P be a probability measure such that for every f F, f b ad Ef = 0. Let X 1,...,X be idepedet radom variables distributed accordig to P ad set σ 2 = sup f F varf. Defie Z = sup f F Z = sup f F f(x i ), i=1 f(x i ). i=1

22 22 TITLE WILL BE SET BY THE PUBLISHER For every x > 0 ad every ρ > 0, ({ Pr Z (1 + ρ)ez + σ }) Kx + K(1 + ρ 1 )bx ({ Pr Z (1 ρ)ez σ }) Kx K(1 + ρ 1 )bx ad the same iequalities hold for Z. Here, K is a absolute costat. e x, e x, The rest of this sectio is devoted to some results that allow oe to estimate Esup f F Ef E f via the Rademacher process idexed by the class. Recall the defiitio of the Rademacher averages of a class from page 16, equatio (18). A well kow symmetrizatio argumet (due to Gié ad Zi) coects the expectatio of sup f F Ef E f to the Rademacher averages of F [30]. Theorem A.2. Let F be a class of fuctios defied o (Ω,µ) ad let X 1,...,X be idepedet radom variables distributed accordig to µ. The, E sup Ef E f 2ER (F). f F The ext lemma, which follows directly from a self-boudig property of the Rademacher process ad the methods developed i [6], shows that E σ R (F) is highly cocetrated aroud its expectatio; hece, the Rademacher averages of a class ca be upper bouded by their empirical versio. The followig formulatio ca be foud i [1]. Theorem A.3. Let F be a class of bouded fuctios defied o (Ω,µ) takig values i [a,b] ad let X 1,...,X be idepedet radom variables distributed accordig to µ. The, for ay 0 α < 1 ad x > 0, with probability at least 1 e x, ER (F) 1 1 α E (b a)x σr (F) + 4α(1 α). Also, with probability at least 1 e x, where c is a absolute costat. 1 2 E σr (F) cbx ER (F) It is possible to boud ER (F) usig the combiatorial dimesio of a set. Recall that a set {x 1,...,x } is shattered by a class of {0,1}-valued fuctios F if P σ F = {(f(x 1 ),...,f(x )) : f F } = {0,1}, ad that the Vapik-Chervoekis dimesio d of F deoted by vc(f) is the maximal cardiality of a subset of Ω that is shattered by F. I a similar way, oe ca defie the combiatorial dimesio of a class of real-valued fuctios. Defiitio A.4. For every ε > 0, a set σ = {x 1,...,x } Ω is said to be ε-shattered by F if there is some fuctio s : σ R, such that for every I {1,...,} there is some f I F for which f I (x i ) s(x i ) + ε if i I, ad f I (x i ) s(x i ) ε if i I. Let vc(f,ε) = sup { σ σ Ω, σ is ε shattered by F }. The followig result is a recet extesio, due to Rudelso ad Vershyi [23] to well-kow estimates o ER (F).

Optimal Sample-Based Estimates of the Expectation of the Empirical Minimizer

Optimal Sample-Based Estimates of the Expectation of the Empirical Minimizer Optimal Sample-Based Estimates of the Expectatio of the Empirical Miimizer Peter L. Bartlett Computer Sciece Divisio ad Departmet of Statistics Uiversity of Califoria, Berkeley 367 Evas Hall #3860, Berkeley,

More information

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12 Machie Learig Theory Tübige Uiversity, WS 06/07 Lecture Tolstikhi Ilya Abstract I this lecture we derive risk bouds for kerel methods. We will start by showig that Soft Margi kerel SVM correspods to miimizig

More information

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss ECE 90 Lecture : Complexity Regularizatio ad the Squared Loss R. Nowak 5/7/009 I the previous lectures we made use of the Cheroff/Hoeffdig bouds for our aalysis of classifier errors. Hoeffdig s iequality

More information

A survey on penalized empirical risk minimization Sara A. van de Geer

A survey on penalized empirical risk minimization Sara A. van de Geer A survey o pealized empirical risk miimizatio Sara A. va de Geer We address the questio how to choose the pealty i empirical risk miimizatio. Roughly speakig, this pealty should be a good boud for the

More information

REGRESSION WITH QUADRATIC LOSS

REGRESSION WITH QUADRATIC LOSS REGRESSION WITH QUADRATIC LOSS MAXIM RAGINSKY Regressio with quadratic loss is aother basic problem studied i statistical learig theory. We have a radom couple Z = X, Y ), where, as before, X is a R d

More information

Optimally Sparse SVMs

Optimally Sparse SVMs A. Proof of Lemma 3. We here prove a lower boud o the umber of support vectors to achieve geeralizatio bouds of the form which we cosider. Importatly, this result holds ot oly for liear classifiers, but

More information

Regression with quadratic loss

Regression with quadratic loss Regressio with quadratic loss Maxim Ragisky October 13, 2015 Regressio with quadratic loss is aother basic problem studied i statistical learig theory. We have a radom couple Z = X,Y, where, as before,

More information

18.657: Mathematics of Machine Learning

18.657: Mathematics of Machine Learning 8.657: Mathematics of Machie Learig Lecturer: Philippe Rigollet Lecture 4 Scribe: Cheg Mao Sep., 05 I this lecture, we cotiue to discuss the effect of oise o the rate of the excess risk E(h) = R(h) R(h

More information

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 11

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 11 Machie Learig Theory Tübige Uiversity, WS 06/07 Lecture Tolstikhi Ilya Abstract We will itroduce the otio of reproducig kerels ad associated Reproducig Kerel Hilbert Spaces (RKHS). We will cosider couple

More information

Convergence of random variables. (telegram style notes) P.J.C. Spreij

Convergence of random variables. (telegram style notes) P.J.C. Spreij Covergece of radom variables (telegram style otes).j.c. Spreij this versio: September 6, 2005 Itroductio As we kow, radom variables are by defiitio measurable fuctios o some uderlyig measurable space

More information

Binary classification, Part 1

Binary classification, Part 1 Biary classificatio, Part 1 Maxim Ragisky September 25, 2014 The problem of biary classificatio ca be stated as follows. We have a radom couple Z = (X,Y ), where X R d is called the feature vector ad Y

More information

1 Review and Overview

1 Review and Overview CS9T/STATS3: Statistical Learig Theory Lecturer: Tegyu Ma Lecture #6 Scribe: Jay Whag ad Patrick Cho October 0, 08 Review ad Overview Recall i the last lecture that for ay family of scalar fuctios F, we

More information

Support vector machine revisited

Support vector machine revisited 6.867 Machie learig, lecture 8 (Jaakkola) 1 Lecture topics: Support vector machie ad kerels Kerel optimizatio, selectio Support vector machie revisited Our task here is to first tur the support vector

More information

Ada Boost, Risk Bounds, Concentration Inequalities. 1 AdaBoost and Estimates of Conditional Probabilities

Ada Boost, Risk Bounds, Concentration Inequalities. 1 AdaBoost and Estimates of Conditional Probabilities CS8B/Stat4B Sprig 008) Statistical Learig Theory Lecture: Ada Boost, Risk Bouds, Cocetratio Iequalities Lecturer: Peter Bartlett Scribe: Subhrasu Maji AdaBoost ad Estimates of Coditioal Probabilities We

More information

Lecture 2. The Lovász Local Lemma

Lecture 2. The Lovász Local Lemma Staford Uiversity Sprig 208 Math 233A: No-costructive methods i combiatorics Istructor: Ja Vodrák Lecture date: Jauary 0, 208 Origial scribe: Apoorva Khare Lecture 2. The Lovász Local Lemma 2. Itroductio

More information

Empirical Process Theory and Oracle Inequalities

Empirical Process Theory and Oracle Inequalities Stat 928: Statistical Learig Theory Lecture: 10 Empirical Process Theory ad Oracle Iequalities Istructor: Sham Kakade 1 Risk vs Risk See Lecture 0 for a discussio o termiology. 2 The Uio Boud / Boferoi

More information

Infinite Sequences and Series

Infinite Sequences and Series Chapter 6 Ifiite Sequeces ad Series 6.1 Ifiite Sequeces 6.1.1 Elemetary Cocepts Simply speakig, a sequece is a ordered list of umbers writte: {a 1, a 2, a 3,...a, a +1,...} where the elemets a i represet

More information

On Random Line Segments in the Unit Square

On Random Line Segments in the Unit Square O Radom Lie Segmets i the Uit Square Thomas A. Courtade Departmet of Electrical Egieerig Uiversity of Califoria Los Ageles, Califoria 90095 Email: tacourta@ee.ucla.edu I. INTRODUCTION Let Q = [0, 1] [0,

More information

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 3

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 3 Machie Learig Theory Tübige Uiversity, WS 06/07 Lecture 3 Tolstikhi Ilya Abstract I this lecture we will prove the VC-boud, which provides a high-probability excess risk boud for the ERM algorithm whe

More information

Chapter 3. Strong convergence. 3.1 Definition of almost sure convergence

Chapter 3. Strong convergence. 3.1 Definition of almost sure convergence Chapter 3 Strog covergece As poited out i the Chapter 2, there are multiple ways to defie the otio of covergece of a sequece of radom variables. That chapter defied covergece i probability, covergece i

More information

An Introduction to Randomized Algorithms

An Introduction to Randomized Algorithms A Itroductio to Radomized Algorithms The focus of this lecture is to study a radomized algorithm for quick sort, aalyze it usig probabilistic recurrece relatios, ad also provide more geeral tools for aalysis

More information

Lecture 3 The Lebesgue Integral

Lecture 3 The Lebesgue Integral Lecture 3: The Lebesgue Itegral 1 of 14 Course: Theory of Probability I Term: Fall 2013 Istructor: Gorda Zitkovic Lecture 3 The Lebesgue Itegral The costructio of the itegral Uless expressly specified

More information

Agnostic Learning and Concentration Inequalities

Agnostic Learning and Concentration Inequalities ECE901 Sprig 2004 Statistical Regularizatio ad Learig Theory Lecture: 7 Agostic Learig ad Cocetratio Iequalities Lecturer: Rob Nowak Scribe: Aravid Kailas 1 Itroductio 1.1 Motivatio I the last lecture

More information

Sequences and Series of Functions

Sequences and Series of Functions Chapter 6 Sequeces ad Series of Fuctios 6.1. Covergece of a Sequece of Fuctios Poitwise Covergece. Defiitio 6.1. Let, for each N, fuctio f : A R be defied. If, for each x A, the sequece (f (x)) coverges

More information

Intro to Learning Theory

Intro to Learning Theory Lecture 1, October 18, 2016 Itro to Learig Theory Ruth Urer 1 Machie Learig ad Learig Theory Comig soo 2 Formal Framework 21 Basic otios I our formal model for machie learig, the istaces to be classified

More information

Definition 4.2. (a) A sequence {x n } in a Banach space X is a basis for X if. unique scalars a n (x) such that x = n. a n (x) x n. (4.

Definition 4.2. (a) A sequence {x n } in a Banach space X is a basis for X if. unique scalars a n (x) such that x = n. a n (x) x n. (4. 4. BASES I BAACH SPACES 39 4. BASES I BAACH SPACES Sice a Baach space X is a vector space, it must possess a Hamel, or vector space, basis, i.e., a subset {x γ } γ Γ whose fiite liear spa is all of X ad

More information

Problem Set 2 Solutions

Problem Set 2 Solutions CS271 Radomess & Computatio, Sprig 2018 Problem Set 2 Solutios Poit totals are i the margi; the maximum total umber of poits was 52. 1. Probabilistic method for domiatig sets 6pts Pick a radom subset S

More information

Product measures, Tonelli s and Fubini s theorems For use in MAT3400/4400, autumn 2014 Nadia S. Larsen. Version of 13 October 2014.

Product measures, Tonelli s and Fubini s theorems For use in MAT3400/4400, autumn 2014 Nadia S. Larsen. Version of 13 October 2014. Product measures, Toelli s ad Fubii s theorems For use i MAT3400/4400, autum 2014 Nadia S. Larse Versio of 13 October 2014. 1. Costructio of the product measure The purpose of these otes is to preset the

More information

6 Integers Modulo n. integer k can be written as k = qn + r, with q,r, 0 r b. So any integer.

6 Integers Modulo n. integer k can be written as k = qn + r, with q,r, 0 r b. So any integer. 6 Itegers Modulo I Example 2.3(e), we have defied the cogruece of two itegers a,b with respect to a modulus. Let us recall that a b (mod ) meas a b. We have proved that cogruece is a equivalece relatio

More information

Lecture Notes for Analysis Class

Lecture Notes for Analysis Class Lecture Notes for Aalysis Class Topological Spaces A topology for a set X is a collectio T of subsets of X such that: (a) X ad the empty set are i T (b) Uios of elemets of T are i T (c) Fiite itersectios

More information

MA131 - Analysis 1. Workbook 3 Sequences II

MA131 - Analysis 1. Workbook 3 Sequences II MA3 - Aalysis Workbook 3 Sequeces II Autum 2004 Cotets 2.8 Coverget Sequeces........................ 2.9 Algebra of Limits......................... 2 2.0 Further Useful Results........................

More information

ACO Comprehensive Exam 9 October 2007 Student code A. 1. Graph Theory

ACO Comprehensive Exam 9 October 2007 Student code A. 1. Graph Theory 1. Graph Theory Prove that there exist o simple plaar triagulatio T ad two distict adjacet vertices x, y V (T ) such that x ad y are the oly vertices of T of odd degree. Do ot use the Four-Color Theorem.

More information

Lecture 10 October Minimaxity and least favorable prior sequences

Lecture 10 October Minimaxity and least favorable prior sequences STATS 300A: Theory of Statistics Fall 205 Lecture 0 October 22 Lecturer: Lester Mackey Scribe: Brya He, Rahul Makhijai Warig: These otes may cotai factual ad/or typographic errors. 0. Miimaxity ad least

More information

10-701/ Machine Learning Mid-term Exam Solution

10-701/ Machine Learning Mid-term Exam Solution 0-70/5-78 Machie Learig Mid-term Exam Solutio Your Name: Your Adrew ID: True or False (Give oe setece explaatio) (20%). (F) For a cotiuous radom variable x ad its probability distributio fuctio p(x), it

More information

62. Power series Definition 16. (Power series) Given a sequence {c n }, the series. c n x n = c 0 + c 1 x + c 2 x 2 + c 3 x 3 +

62. Power series Definition 16. (Power series) Given a sequence {c n }, the series. c n x n = c 0 + c 1 x + c 2 x 2 + c 3 x 3 + 62. Power series Defiitio 16. (Power series) Give a sequece {c }, the series c x = c 0 + c 1 x + c 2 x 2 + c 3 x 3 + is called a power series i the variable x. The umbers c are called the coefficiets of

More information

Rademacher Complexity

Rademacher Complexity EECS 598: Statistical Learig Theory, Witer 204 Topic 0 Rademacher Complexity Lecturer: Clayto Scott Scribe: Ya Deg, Kevi Moo Disclaimer: These otes have ot bee subjected to the usual scrutiy reserved for

More information

Disjoint Systems. Abstract

Disjoint Systems. Abstract Disjoit Systems Noga Alo ad Bey Sudaov Departmet of Mathematics Raymod ad Beverly Sacler Faculty of Exact Scieces Tel Aviv Uiversity, Tel Aviv, Israel Abstract A disjoit system of type (,,, ) is a collectio

More information

Glivenko-Cantelli Classes

Glivenko-Cantelli Classes CS28B/Stat24B (Sprig 2008 Statistical Learig Theory Lecture: 4 Gliveko-Catelli Classes Lecturer: Peter Bartlett Scribe: Michelle Besi Itroductio This lecture will cover Gliveko-Catelli (GC classes ad itroduce

More information

Machine Learning Theory (CS 6783)

Machine Learning Theory (CS 6783) Machie Learig Theory (CS 6783) Lecture 2 : Learig Frameworks, Examples Settig up learig problems. X : istace space or iput space Examples: Computer Visio: Raw M N image vectorized X = 0, 255 M N, SIFT

More information

Empirical Processes: Glivenko Cantelli Theorems

Empirical Processes: Glivenko Cantelli Theorems Empirical Processes: Gliveko Catelli Theorems Mouliath Baerjee Jue 6, 200 Gliveko Catelli classes of fuctios The reader is referred to Chapter.6 of Weller s Torgo otes, Chapter??? of VDVW ad Chapter 8.3

More information

Local Rademacher Complexities

Local Rademacher Complexities Local Rademacher Complexities Peter L. Bartlett Departmet of Statistics ad Divisio of Computer Sciece Uiversity of Califoria at Berkeley 367 Evas Hall Berkeley, CA 94720-3860 bartlett@stat.berkeley.edu

More information

Math 61CM - Solutions to homework 3

Math 61CM - Solutions to homework 3 Math 6CM - Solutios to homework 3 Cédric De Groote October 2 th, 208 Problem : Let F be a field, m 0 a fixed oegative iteger ad let V = {a 0 + a x + + a m x m a 0,, a m F} be the vector space cosistig

More information

Machine Learning Brett Bernstein

Machine Learning Brett Bernstein Machie Learig Brett Berstei Week 2 Lecture: Cocept Check Exercises Starred problems are optioal. Excess Risk Decompositio 1. Let X = Y = {1, 2,..., 10}, A = {1,..., 10, 11} ad suppose the data distributio

More information

1 Review and Overview

1 Review and Overview DRAFT a fial versio will be posted shortly CS229T/STATS231: Statistical Learig Theory Lecturer: Tegyu Ma Lecture #3 Scribe: Migda Qiao October 1, 2013 1 Review ad Overview I the first half of this course,

More information

Chapter 6 Infinite Series

Chapter 6 Infinite Series Chapter 6 Ifiite Series I the previous chapter we cosidered itegrals which were improper i the sese that the iterval of itegratio was ubouded. I this chapter we are goig to discuss a topic which is somewhat

More information

ECE 901 Lecture 14: Maximum Likelihood Estimation and Complexity Regularization

ECE 901 Lecture 14: Maximum Likelihood Estimation and Complexity Regularization ECE 90 Lecture 4: Maximum Likelihood Estimatio ad Complexity Regularizatio R Nowak 5/7/009 Review : Maximum Likelihood Estimatio We have iid observatios draw from a ukow distributio Y i iid p θ, i,, where

More information

Basics of Probability Theory (for Theory of Computation courses)

Basics of Probability Theory (for Theory of Computation courses) Basics of Probability Theory (for Theory of Computatio courses) Oded Goldreich Departmet of Computer Sciece Weizma Istitute of Sciece Rehovot, Israel. oded.goldreich@weizma.ac.il November 24, 2008 Preface.

More information

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS MASSACHUSTTS INSTITUT OF TCHNOLOGY 6.436J/5.085J Fall 2008 Lecture 9 /7/2008 LAWS OF LARG NUMBRS II Cotets. The strog law of large umbers 2. The Cheroff boud TH STRONG LAW OF LARG NUMBRS While the weak

More information

Lecture 3: August 31

Lecture 3: August 31 36-705: Itermediate Statistics Fall 018 Lecturer: Siva Balakrisha Lecture 3: August 31 This lecture will be mostly a summary of other useful expoetial tail bouds We will ot prove ay of these i lecture,

More information

Sieve Estimators: Consistency and Rates of Convergence

Sieve Estimators: Consistency and Rates of Convergence EECS 598: Statistical Learig Theory, Witer 2014 Topic 6 Sieve Estimators: Cosistecy ad Rates of Covergece Lecturer: Clayto Scott Scribe: Julia Katz-Samuels, Brado Oselio, Pi-Yu Che Disclaimer: These otes

More information

Sequences. Notation. Convergence of a Sequence

Sequences. Notation. Convergence of a Sequence Sequeces A sequece is essetially just a list. Defiitio (Sequece of Real Numbers). A sequece of real umbers is a fuctio Z (, ) R for some real umber. Do t let the descriptio of the domai cofuse you; it

More information

Application to Random Graphs

Application to Random Graphs A Applicatio to Radom Graphs Brachig processes have a umber of iterestig ad importat applicatios. We shall cosider oe of the most famous of them, the Erdős-Réyi radom graph theory. 1 Defiitio A.1. Let

More information

Math Solutions to homework 6

Math Solutions to homework 6 Math 175 - Solutios to homework 6 Cédric De Groote November 16, 2017 Problem 1 (8.11 i the book): Let K be a compact Hermitia operator o a Hilbert space H ad let the kerel of K be {0}. Show that there

More information

Advanced Stochastic Processes.

Advanced Stochastic Processes. Advaced Stochastic Processes. David Gamarik LECTURE 2 Radom variables ad measurable fuctios. Strog Law of Large Numbers (SLLN). Scary stuff cotiued... Outlie of Lecture Radom variables ad measurable fuctios.

More information

(A sequence also can be thought of as the list of function values attained for a function f :ℵ X, where f (n) = x n for n 1.) x 1 x N +k x N +4 x 3

(A sequence also can be thought of as the list of function values attained for a function f :ℵ X, where f (n) = x n for n 1.) x 1 x N +k x N +4 x 3 MATH 337 Sequeces Dr. Neal, WKU Let X be a metric space with distace fuctio d. We shall defie the geeral cocept of sequece ad limit i a metric space, the apply the results i particular to some special

More information

18.657: Mathematics of Machine Learning

18.657: Mathematics of Machine Learning 8.657: Mathematics of Machie Learig Lecturer: Philippe Rigollet Lecture 0 Scribe: Ade Forrow Oct. 3, 05 Recall the followig defiitios from last time: Defiitio: A fuctio K : X X R is called a positive symmetric

More information

Random Walks on Discrete and Continuous Circles. by Jeffrey S. Rosenthal School of Mathematics, University of Minnesota, Minneapolis, MN, U.S.A.

Random Walks on Discrete and Continuous Circles. by Jeffrey S. Rosenthal School of Mathematics, University of Minnesota, Minneapolis, MN, U.S.A. Radom Walks o Discrete ad Cotiuous Circles by Jeffrey S. Rosethal School of Mathematics, Uiversity of Miesota, Mieapolis, MN, U.S.A. 55455 (Appeared i Joural of Applied Probability 30 (1993), 780 789.)

More information

Statistics 511 Additional Materials

Statistics 511 Additional Materials Cofidece Itervals o mu Statistics 511 Additioal Materials This topic officially moves us from probability to statistics. We begi to discuss makig ifereces about the populatio. Oe way to differetiate probability

More information

If a subset E of R contains no open interval, is it of zero measure? For instance, is the set of irrationals in [0, 1] is of measure zero?

If a subset E of R contains no open interval, is it of zero measure? For instance, is the set of irrationals in [0, 1] is of measure zero? 2 Lebesgue Measure I Chapter 1 we defied the cocept of a set of measure zero, ad we have observed that every coutable set is of measure zero. Here are some atural questios: If a subset E of R cotais a

More information

IP Reference guide for integer programming formulations.

IP Reference guide for integer programming formulations. IP Referece guide for iteger programmig formulatios. by James B. Orli for 15.053 ad 15.058 This documet is iteded as a compact (or relatively compact) guide to the formulatio of iteger programs. For more

More information

5.1 A mutual information bound based on metric entropy

5.1 A mutual information bound based on metric entropy Chapter 5 Global Fao Method I this chapter, we exted the techiques of Chapter 2.4 o Fao s method the local Fao method) to a more global costructio. I particular, we show that, rather tha costructig a local

More information

Economics 241B Relation to Method of Moments and Maximum Likelihood OLSE as a Maximum Likelihood Estimator

Economics 241B Relation to Method of Moments and Maximum Likelihood OLSE as a Maximum Likelihood Estimator Ecoomics 24B Relatio to Method of Momets ad Maximum Likelihood OLSE as a Maximum Likelihood Estimator Uder Assumptio 5 we have speci ed the distributio of the error, so we ca estimate the model parameters

More information

Measure and Measurable Functions

Measure and Measurable Functions 3 Measure ad Measurable Fuctios 3.1 Measure o a Arbitrary σ-algebra Recall from Chapter 2 that the set M of all Lebesgue measurable sets has the followig properties: R M, E M implies E c M, E M for N implies

More information

Fall 2013 MTH431/531 Real analysis Section Notes

Fall 2013 MTH431/531 Real analysis Section Notes Fall 013 MTH431/531 Real aalysis Sectio 8.1-8. Notes Yi Su 013.11.1 1. Defiitio of uiform covergece. We look at a sequece of fuctios f (x) ad study the coverget property. Notice we have two parameters

More information

EECS564 Estimation, Filtering, and Detection Hwk 2 Solns. Winter p θ (z) = (2θz + 1 θ), 0 z 1

EECS564 Estimation, Filtering, and Detection Hwk 2 Solns. Winter p θ (z) = (2θz + 1 θ), 0 z 1 EECS564 Estimatio, Filterig, ad Detectio Hwk 2 Sols. Witer 25 4. Let Z be a sigle observatio havig desity fuctio where. p (z) = (2z + ), z (a) Assumig that is a oradom parameter, fid ad plot the maximum

More information

It is always the case that unions, intersections, complements, and set differences are preserved by the inverse image of a function.

It is always the case that unions, intersections, complements, and set differences are preserved by the inverse image of a function. MATH 532 Measurable Fuctios Dr. Neal, WKU Throughout, let ( X, F, µ) be a measure space ad let (!, F, P ) deote the special case of a probability space. We shall ow begi to study real-valued fuctios defied

More information

6.3 Testing Series With Positive Terms

6.3 Testing Series With Positive Terms 6.3. TESTING SERIES WITH POSITIVE TERMS 307 6.3 Testig Series With Positive Terms 6.3. Review of what is kow up to ow I theory, testig a series a i for covergece amouts to fidig the i= sequece of partial

More information

6.867 Machine learning

6.867 Machine learning 6.867 Machie learig Mid-term exam October, ( poits) Your ame ad MIT ID: Problem We are iterested here i a particular -dimesioal liear regressio problem. The dataset correspodig to this problem has examples

More information

Learnability with Rademacher Complexities

Learnability with Rademacher Complexities Learability with Rademacher Complexities Daiel Khashabi Fall 203 Last Update: September 26, 206 Itroductio Our goal i study of passive ervised learig is to fid a hypothesis h based o a set of examples

More information

Lecture 11: Decision Trees

Lecture 11: Decision Trees ECE9 Sprig 7 Statistical Learig Theory Istructor: R. Nowak Lecture : Decisio Trees Miimum Complexity Pealized Fuctio Recall the basic results of the last lectures: let X ad Y deote the iput ad output spaces

More information

McGill University Math 354: Honors Analysis 3 Fall 2012 Solutions to selected problems

McGill University Math 354: Honors Analysis 3 Fall 2012 Solutions to selected problems McGill Uiversity Math 354: Hoors Aalysis 3 Fall 212 Assigmet 3 Solutios to selected problems Problem 1. Lipschitz fuctios. Let Lip K be the set of all fuctios cotiuous fuctios o [, 1] satisfyig a Lipschitz

More information

Math 216A Notes, Week 5

Math 216A Notes, Week 5 Math 6A Notes, Week 5 Scribe: Ayastassia Sebolt Disclaimer: These otes are ot early as polished (ad quite possibly ot early as correct) as a published paper. Please use them at your ow risk.. Thresholds

More information

Discrete Mathematics for CS Spring 2008 David Wagner Note 22

Discrete Mathematics for CS Spring 2008 David Wagner Note 22 CS 70 Discrete Mathematics for CS Sprig 2008 David Wager Note 22 I.I.D. Radom Variables Estimatig the bias of a coi Questio: We wat to estimate the proportio p of Democrats i the US populatio, by takig

More information

The Borel hierarchy classifies subsets of the reals by their topological complexity. Another approach is to classify them by size.

The Borel hierarchy classifies subsets of the reals by their topological complexity. Another approach is to classify them by size. Lecture 7: Measure ad Category The Borel hierarchy classifies subsets of the reals by their topological complexity. Aother approach is to classify them by size. Filters ad Ideals The most commo measure

More information

1 of 7 7/16/2009 6:06 AM Virtual Laboratories > 6. Radom Samples > 1 2 3 4 5 6 7 6. Order Statistics Defiitios Suppose agai that we have a basic radom experimet, ad that X is a real-valued radom variable

More information

Seunghee Ye Ma 8: Week 5 Oct 28

Seunghee Ye Ma 8: Week 5 Oct 28 Week 5 Summary I Sectio, we go over the Mea Value Theorem ad its applicatios. I Sectio 2, we will recap what we have covered so far this term. Topics Page Mea Value Theorem. Applicatios of the Mea Value

More information

Singular Continuous Measures by Michael Pejic 5/14/10

Singular Continuous Measures by Michael Pejic 5/14/10 Sigular Cotiuous Measures by Michael Peic 5/4/0 Prelimiaries Give a set X, a σ-algebra o X is a collectio of subsets of X that cotais X ad ad is closed uder complemetatio ad coutable uios hece, coutable

More information

Solution. 1 Solutions of Homework 1. Sangchul Lee. October 27, Problem 1.1

Solution. 1 Solutions of Homework 1. Sangchul Lee. October 27, Problem 1.1 Solutio Sagchul Lee October 7, 017 1 Solutios of Homework 1 Problem 1.1 Let Ω,F,P) be a probability space. Show that if {A : N} F such that A := lim A exists, the PA) = lim PA ). Proof. Usig the cotiuity

More information

Mi-Hwa Ko and Tae-Sung Kim

Mi-Hwa Ko and Tae-Sung Kim J. Korea Math. Soc. 42 2005), No. 5, pp. 949 957 ALMOST SURE CONVERGENCE FOR WEIGHTED SUMS OF NEGATIVELY ORTHANT DEPENDENT RANDOM VARIABLES Mi-Hwa Ko ad Tae-Sug Kim Abstract. For weighted sum of a sequece

More information

HOMEWORK 2 SOLUTIONS

HOMEWORK 2 SOLUTIONS HOMEWORK SOLUTIONS CSE 55 RANDOMIZED AND APPROXIMATION ALGORITHMS 1. Questio 1. a) The larger the value of k is, the smaller the expected umber of days util we get all the coupos we eed. I fact if = k

More information

A statistical method to determine sample size to estimate characteristic value of soil parameters

A statistical method to determine sample size to estimate characteristic value of soil parameters A statistical method to determie sample size to estimate characteristic value of soil parameters Y. Hojo, B. Setiawa 2 ad M. Suzuki 3 Abstract Sample size is a importat factor to be cosidered i determiig

More information

Math 220A Fall 2007 Homework #2. Will Garner A

Math 220A Fall 2007 Homework #2. Will Garner A Math 0A Fall 007 Homewor # Will Garer Pg 3 #: Show that {cis : a o-egative iteger} is dese i T = {z œ : z = }. For which values of q is {cis(q): a o-egative iteger} dese i T? To show that {cis : a o-egative

More information

b i u x i U a i j u x i u x j

b i u x i U a i j u x i u x j M ath 5 2 7 Fall 2 0 0 9 L ecture 1 9 N ov. 1 6, 2 0 0 9 ) S ecod- Order Elliptic Equatios: Weak S olutios 1. Defiitios. I this ad the followig two lectures we will study the boudary value problem Here

More information

Output Analysis and Run-Length Control

Output Analysis and Run-Length Control IEOR E4703: Mote Carlo Simulatio Columbia Uiversity c 2017 by Marti Haugh Output Aalysis ad Ru-Legth Cotrol I these otes we describe how the Cetral Limit Theorem ca be used to costruct approximate (1 α%

More information

Estimation of the essential supremum of a regression function

Estimation of the essential supremum of a regression function Estimatio of the essetial supremum of a regressio fuctio Michael ohler, Adam rzyżak 2, ad Harro Walk 3 Fachbereich Mathematik, Techische Uiversität Darmstadt, Schlossgartestr. 7, 64289 Darmstadt, Germay,

More information

Rates of Convergence by Moduli of Continuity

Rates of Convergence by Moduli of Continuity Rates of Covergece by Moduli of Cotiuity Joh Duchi: Notes for Statistics 300b March, 017 1 Itroductio I this ote, we give a presetatio showig the importace, ad relatioship betwee, the modulis of cotiuity

More information

The Boolean Ring of Intervals

The Boolean Ring of Intervals MATH 532 Lebesgue Measure Dr. Neal, WKU We ow shall apply the results obtaied about outer measure to the legth measure o the real lie. Throughout, our space X will be the set of real umbers R. Whe ecessary,

More information

32 estimating the cumulative distribution function

32 estimating the cumulative distribution function 32 estimatig the cumulative distributio fuctio 4.6 types of cofidece itervals/bads Let F be a class of distributio fuctios F ad let θ be some quatity of iterest, such as the mea of F or the whole fuctio

More information

Law of the sum of Bernoulli random variables

Law of the sum of Bernoulli random variables Law of the sum of Beroulli radom variables Nicolas Chevallier Uiversité de Haute Alsace, 4, rue des frères Lumière 68093 Mulhouse icolas.chevallier@uha.fr December 006 Abstract Let be the set of all possible

More information

FIXED POINTS OF n-valued MULTIMAPS OF THE CIRCLE

FIXED POINTS OF n-valued MULTIMAPS OF THE CIRCLE FIXED POINTS OF -VALUED MULTIMAPS OF THE CIRCLE Robert F. Brow Departmet of Mathematics Uiversity of Califoria Los Ageles, CA 90095-1555 e-mail: rfb@math.ucla.edu November 15, 2005 Abstract A multifuctio

More information

Maximum Likelihood Estimation and Complexity Regularization

Maximum Likelihood Estimation and Complexity Regularization ECE90 Sprig 004 Statistical Regularizatio ad Learig Theory Lecture: 4 Maximum Likelihood Estimatio ad Complexity Regularizatio Lecturer: Rob Nowak Scribe: Pam Limpiti Review : Maximum Likelihood Estimatio

More information

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d Liear regressio Daiel Hsu (COMS 477) Maximum likelihood estimatio Oe of the simplest liear regressio models is the followig: (X, Y ),..., (X, Y ), (X, Y ) are iid radom pairs takig values i R d R, ad Y

More information

w (1) ˆx w (1) x (1) /ρ and w (2) ˆx w (2) x (2) /ρ.

w (1) ˆx w (1) x (1) /ρ and w (2) ˆx w (2) x (2) /ρ. 2 5. Weighted umber of late jobs 5.1. Release dates ad due dates: maximimizig the weight of o-time jobs Oce we add release dates, miimizig the umber of late jobs becomes a sigificatly harder problem. For

More information

Lecture 9: Boosting. Akshay Krishnamurthy October 3, 2017

Lecture 9: Boosting. Akshay Krishnamurthy October 3, 2017 Lecture 9: Boostig Akshay Krishamurthy akshay@csumassedu October 3, 07 Recap Last week we discussed some algorithmic aspects of machie learig We saw oe very powerful family of learig algorithms, amely

More information

6.883: Online Methods in Machine Learning Alexander Rakhlin

6.883: Online Methods in Machine Learning Alexander Rakhlin 6.883: Olie Methods i Machie Learig Alexader Rakhli LECTURES 5 AND 6. THE EXPERTS SETTING. EXPONENTIAL WEIGHTS All the algorithms preseted so far halluciate the future values as radom draws ad the perform

More information

Notes 19 : Martingale CLT

Notes 19 : Martingale CLT Notes 9 : Martigale CLT Math 733-734: Theory of Probability Lecturer: Sebastie Roch Refereces: [Bil95, Chapter 35], [Roc, Chapter 3]. Sice we have ot ecoutered weak covergece i some time, we first recall

More information

REAL ANALYSIS II: PROBLEM SET 1 - SOLUTIONS

REAL ANALYSIS II: PROBLEM SET 1 - SOLUTIONS REAL ANALYSIS II: PROBLEM SET 1 - SOLUTIONS 18th Feb, 016 Defiitio (Lipschitz fuctio). A fuctio f : R R is said to be Lipschitz if there exists a positive real umber c such that for ay x, y i the domai

More information

Lecture 15: Learning Theory: Concentration Inequalities

Lecture 15: Learning Theory: Concentration Inequalities STAT 425: Itroductio to Noparametric Statistics Witer 208 Lecture 5: Learig Theory: Cocetratio Iequalities Istructor: Ye-Chi Che 5. Itroductio Recall that i the lecture o classificatio, we have see that

More information

Notes on Snell Envelops and Examples

Notes on Snell Envelops and Examples Notes o Sell Evelops ad Examples Example (Secretary Problem): Coside a pool of N cadidates whose qualificatios are represeted by ukow umbers {a > a 2 > > a N } from best to last. They are iterviewed sequetially

More information