Peter L. Bartlett 1, Shahar Mendelson 2, 3 and Petra Philips Introduction
|
|
- Gilbert Ross McBride
- 6 years ago
- Views:
Transcription
1 ESAIM: Probability ad Statistics URL: Will be set by the publisher ON THE OPTIMALITY OF SAMPLE-BASED ESTIMATES OF THE EXPECTATION OF THE EMPIRICAL MINIMIZER, Peter L. Bartlett 1, Shahar Medelso 2, 3 ad Petra Philips 4 Abstract. We study sample-based estimates of the expectatio of the fuctio produced by the empirical miimizatio algorithm. We ivestigate the extet i which oe ca estimate the rate of covergece of the empirical miimizer i a data depedet maer. We establish three mai results. First, we provide a algorithm that upper bouds the expectatio of the empirical miimizer i a completely data-depedet maer. This boud is based o a structural result due to Bartlett ad Medelso, which relates expectatios to sample averages. Secod, we show that these structural upper bouds ca be loose, compared to previous bouds. I particular, we demostrate a class for which the expectatio of the empirical miimizer decreases as O(1/) for sample size, although the upper boud based o structural properties is Ω(1). Third, we show that this looseess of the boud is ievitable: we preset a example that shows that a sharp boud caot be uiversally recovered from empirical data Mathematics Subject Classificatio. 62G08,68Q32. The dates will be set by the publisher. 1. Itroductio The empirical miimizatio algorithm is a statistical procedure that chooses a fuctio that miimizes a empirical loss fuctioal o a give class of fuctios. Kow as a M-estimator i statistical literature, it has bee studied extesively [11,29,31]. Here, we ivestigate the limitatios of estimates of the expectatio of the fuctio produced by the empirical miimizatio algorithm. To be more exact, let F be a class of real-valued fuctios defied o a probability space (Ω,µ) ad set X 1,...,X to be idepedet radom variables distributed accordig to µ. For f F defie E f = 1 i=1 f(x i) ad let Ef be the expectatio of f with respect to µ. The goal is to fid a fuctio that miimizes Ef over F, where the oly iformatio available about the ukow distributio µ is through the fiite sample X 1,...,X. The empirical miimizatio algorithm produces the fuctio ˆf F that has the smallest empirical mea, that Keywords ad phrases: error bouds; empirical miimizatio; data-depedet complexity This work was supported i part by Natioal Sciece Foudatio grat This work was supported i part by the Australia Research Coucil Discovery Grat DP Computer Sciece Divisio ad Departmet of Statistics, 367 Evas Hall #3860, Uiversity of Califoria, Berkeley, CA , USA 2 Cetre for Mathematics ad its Applicatios (CMA), The Australia Natioal Uiversity, Caberra, 0200 Australia 3 Departmet of Mathematics, Techio I.I.T., Haifa, 32000, Israel 4 Friedrich Miescher Laboratory of the Max Plack Society, Tübige, 72076, Germay c EDP Scieces, SMAI 1999
2 2 TITLE WILL BE SET BY THE PUBLISHER is, ˆf satisfies E ˆf = mi {E f : f F }. Throughout this article, we assume that such a miimum exists (the modificatios required if this is ot the case are obvious), that F satisfies some mior measurability coditios, which we omit (see [8] for more details), ad that for every f F, Ef 0, which, as we explai later, is a atural assumptio i the cases that iterest us. I statistical learig theory, this problem arises whe oe miimizes the empirical risk, or sample average of a loss icurred o a fiite traiig sample. There, the aim is to esure that the risk, or expected loss, is small. Thus, f(x i ) represets the loss icurred o X i. Performace guaratees are typically obtaied through high probability bouds o the coditioal expectatio E ˆf = E( ˆf(X) X 1,...,X ). (1) I particular, oe is iterested i obtaiig fast ad accurate estimates of the rates of covergece of this expectatio to 0 as a fuctio of the sample size. Classical estimates of this expectatio rely o the uiform covergece over F of sample averages to expectatios (see, for example, [31]). These estimates are essetially based o the aalysis of the supremum of the empirical process sup f F (Ef E f) idexed by the whole class F. As opposed to these global estimates, it is possible to study local subsets of fuctios of F, for example, balls of a give radius with respect to a chose metric. The supremum of the empirical process idexed by these local subsets as a fuctio of the radius of the balls is called the modulus of cotiuity. Sharper localized estimates for the rate of covergece of the expectatio ca be obtaied i terms of the fixed poit of the modulus of cotiuity of the class [1,12,16,18,28]. Recet results [3] show that oe ca further sigificatly improve the high-probability estimates for the covergece rates for empirical miimizers. These results are based o a ew localized otio of complexity of subsets of F cotaiig fuctios with idetical expectatios ad are therefore depedet o the uderlyig ukow distributio. I this article, we ivestigate the extet to which oe ca estimate these high-probability covergece rates i a data-depedet maer, a importat aspect if oe wats to make these estimates practically useful. The results i [3] establish upper ad lower bouds for the expectatio E ˆf usig two differet argumets. The first is a structural result relatig the empirical (radom) structure edowed o the class by the selectio of the coordiates (X 1,...,X ), ad the real structure, give by the measure µ. The secod is a direct aalysis, which yields seemigly sharper bouds. I both cases (ad uder some mild structural assumptios o the class F), the bouds are give usig a fuctio that measures the localized complexity of subsets of F cosistig of fuctios with a fixed expectatio r, deoted here by F r = {f F : Ef = r}. For every iteger ad probability measure µ o Ω, cosider the followig two sequeces of fuctios, which measure the complexity of the sets F r : ξ,f,µ (r) = Esup { Ef E f : f F r }, (2) ξ,f,µ(r) = Esup {Ef E f : f F r }. (3) I the followig, i cases where the uderlyig probability measure µ ad the class F are clear, we will refer to these fuctios as ξ ad ξ. It turs out that these two fuctios cotrol the geeralizatio ability i F r wheever oe has a strog degree of cocetratio for the empirical process suprema sup f Fr Ef E f ad sup f Fr (Ef E f) aroud their expectatio. Thus, ξ ad ξ ca be used to derive bouds o the performace of the empirical miimizatio algorithm as log as these suprema are sufficietly cocetrated. Therefore, the mai tool required i the proofs of the results i [3] that provide bouds usig the ξ ad ξ is Talagrad s cocetratio iequality for empirical processes (see Theorem A.1 i the appedix). To see how ξ ad ξ ca be used to derive geeralizatio bouds, observe that it suffices to fid the critical poit r 0 for which, with high probability, for a give 0 < λ < 1, every r r 0 ad every f F r,
3 TITLE WILL BE SET BY THE PUBLISHER 3 (1 λ)ef E f (1 + λ)ef. If the equivalece holds for a sample (X 1,...,X ) for such a r 0, the every f F satisfies that { } E f Ef max 1 λ,r 0, (4) ad thus, a upper boud o the expectatio of the empirical miimizer ˆf ca be established. It is possible to show that oe ca take r 0 as r, where r = if {r : ξ,g (r) r/4}, (5) where G = {θf : 0 θ 1, f F }. I fact, sice i (4) oly a oe-sided coditio is required, oe ca actually use r = if { r : ξ,g(r) r/4 }. (6) For the rest of this sectio we will assume that F is star-shaped aroud 0 (that is, G = F), ad we will explai the sigificace of this property later. A more careful aalysis, which uses the stregth of Talagrad s cocetratio iequality for empirical processes, shows that the expectatio of the empirical miimizer is govered by approximatios of { s = sup r : ξ (r) r = max s } {ξ (s) s}. (7) To see why s is a likely cadidate, ote that for ay empirical miimizer, the fuctio of r defied as sup f Fr (Ef E f) r = if f Fr E f is maximized for the value r = E ˆf. Assume that oe has a very strog cocetratio of empirical processes idexed by F r aroud their mea for every r > 0, that is, with high probability, for every r > 0, sup (Ef E f) E sup (Ef E f) = ξ (r). f F r f F r The, it would make sese to expect that, with high probability, E ˆf s for s = argmax{ξ (r) r}. More precisely, ad to overcome the fact that Esup f Fr (Ef E f) is oly very close to sup f Fr (Ef E f) defie for ε > 0, { r,ε,+ = sup { r,ε, = if r : ξ,f,µ(r) r sup s r : ξ,f,µ(r) r sup s ( ξ,f,µ (s) s ) } ε } ( ξ,f,µ (s) s ) ε, (8). (9) Note that r,ε,+ ad r,ε, are respectively upper ad lower approximatios of s that become better as ε 0. They are close to s if the fuctio ξ (r) r is peaked aroud its maximum. Uder mild structural assumptios o F, E ˆf ca be upper bouded by either r or r,ε,+, ad lower bouded by r,ε, for a choice of ε = O( log /) (see the exact statemet i Theorem 2.6 below). Thus, these two parameters the fixed poit of 4ξ (deoted by r ) ad the poits at which the maximum of ξ (r) r is almost attaied are our mai focus. The first result we preset here is that there is a true gap betwee r ad s, which implies that there is a true differece betwee the boud that could be obtaied usig the structural approach (i.e. r ) ad the true expectatio of the empirical miimizer. We costruct a class of fuctios satisfyig the required structural assumptios ad show that for ay, r is of the order of a costat (ad thus r is of the order of a costat), but the subsets F r are very rich whe r is close to 0 ad s ad r,ε,+ are of the order of 1/. Let us metio that there is a costructio related to this oe i [3]: for every there is a fuctio class F for which this
4 4 TITLE WILL BE SET BY THE PUBLISHER pheomeo occurs. The costructio we preset here is stroger, sice it shows that, for some fuctio class ad probability distributio, the true covergece rate for a fixed class is far from the structural boud. The idea behid the costructio is based o the oe preseted i [3], amely that oe has complete freedom to choose the expectatio of a fuctio, while forcig it to have certai values o a give sample. For the class we costruct ad ay large sample size, estimates for the covergece rates of the empirical miimizers based o r are asymptotically ot optimal (as they are Θ(1) whereas the true covergece rate is O(1/)), ad thus the structural boud does ot capture the true behavior of the empirical miimizer. The secod questio we tackle cocers the estimatio of the expectatio of the empirical miimizer from data. To that ed, i Sectio 4, we preset a efficiet algorithm that eables oe to estimate r i a completely data depedet maer. The, i Sectio 5, we show that this type of data-depedet estimate is the best oe ca hope to have if oe oly has access to the fuctio values o fiite samples. We show that i such a case it is impossible to establish a data depedet upper boud o the expectatio of the empirical miimizer that is asymptotically better tha r. The geeral idea is to costruct two classes of fuctios that look idetical whe restricted to ay sample of fiite size, but for oe class both a typical expectatio of the empirical miimizer ad r are of the order of a absolute costat, while for the other a typical expectatio is of the order of 1/ Loss Classes 2. Defiitios ad Prelimiary Results Oe of the mai applicatios of our ivestigatios is the aalysis of predictio problems, like classificatio or regressio, arisig i machie learig. Suppose that oe is preseted with a sequece of observatio-outcome pairs (x,y) X Y, ad the aim is to select a fuctio g : X Y that makes a accurate predictio of the outcome for each observatio. We assume that (X,Y ),(X 1,Y 1 ),...,(X,Y ) are chose idepedetly from a probability distributio P o X Y, but P is ukow. The quality of the predictio is measured usig a bouded loss fuctio, l : Y Y [0,b], where l(ŷ,y) represets the cost icurred for predictio ŷ whe the true outcome is y. The risk of a fuctio g : X Y is defied as El(g(X),Y ), ad the aim is to use the sequece (X 1,Y 1 ),...,(X,Y ) to choose a fuctio g with miimal risk. Settig f(x,y) = l(g(x),y), this task correspods to miimizig Ef. I empirical risk miimizatio, oe chooses g from a set G that miimizes the sample average of l(g(x),y), which correspods to choosig f F that miimizes E f, where F is the loss class, F = {(x,y) l(g(x),y) : g G}. It is sometimes coveiet to cosider excess loss fuctios, f(x,y) = l(g(x),y) l(g (x),y), where g G satisfies El(g (X),Y ) = if g G El(g(X),Y ). Sice g is fixed, choosig g G that miimizes the risk (respectively, empirical risk) agai correspods to choosig f F that miimizes Ef (respectively, E f), where F = {(x,y) l(g(x),y) l(g (x),y) : g G}. Thus, for this choice of F, Ef 0 for all f F, but fuctios i F ca have egative values Assumptios o F Throughout this article, we assume that F is a class of fuctios defied o a probability space (Ω,µ) satisfyig the followig coditios: (1) Each fuctios i F maps to the bouded iterval [ b,b]. (2) Each fuctio i F has oegative expectatio. (3) F cotais 0. (4) F has Berstei type β > 0.
5 TITLE WILL BE SET BY THE PUBLISHER 5 We shall see shortly why these coditios are atural for may practical oparametric ad machie learig methods. The Berstei coditio, defied precisely below, is that the secod momet of every fuctio is bouded by a power of its expectatio, uiformly over the class. Defiitio 2.1. We say that F is a (β,b)-berstei class with respect to the probability measure P (where 0 < β 2 ad B 1), if every f F satisfies Ef 2 B(Ef) β. We say that F has Berstei type β with respect to P if there is some costat B for which F is a (β,b)- Berstei class. These coditios are satisfied by a large variety of loss classes arisig i statistical settigs. Oe simple example is the loss class, F = {(x,y) l(g(x),y) : g G}, i the case where some fuctio g G has zero loss, that is, El(g (X),Y ) = 0. Clearly, if F cotais 0, fuctios i F are bouded ad have oegative expectatios, ad trivially F has Berstei type 1: Ef 2 bef. However, i practical problems, the assumptio that there is some fuctio g G that has zero loss is ofte ureasoable. More realistic examples are excess loss classes, F = {(x,y) l(g(x),y) l(g (x),y) : g G}, where g i G achieves the miimal risk over G. Clearly, fuctios i F are bouded ad have oegative expectatio, ad F cotais zero. As the followig examples show, the boudedess ad Berstei coditios also frequetly arise aturally. Low oise classificatio: I two-class patter classificatio, we have Y = {±1}, ad l(ŷ, y) is the 0-1 loss, that is, the idicator of ŷ y. Clearly, the boudedess coditio holds. A key factor i the difficulty of a patter classificatio problem is the behavior of the coditioal probability η(x) = Pr(Y = 1 X = x), ad i particular how likely it is to be ear the critical value of 1/2. Startig with Tsybakov [27], may authors have cosidered [2, 4, 5, 11, 19, 26] patter classificatio whe there is a costat ǫ such that the coditioal probability satisfies ( Pr η(x) 1 ) 2 < ǫ = 0. (10) Suppose that we assume, as i [27], that the class G cotais the miimizer g of the expected loss (the Bayes classifier), which is the idicator of η(x) > 1/2. The it is easy to show that this implies the excess loss class is of Berstei type 1. Ideed, oe ca verify that (10) is equivalet to the assertio that all measurable fuctios g : X {±1} satisfy Pr (g(x) g (X)) 1 2ǫ E(l(g(X),Y ) l(g (X),Y )) (see, for example, Lemma 5 i [2]). Therefore, E(l(g(X),Y ) l(g (X),Y )) 2 = Pr (g(x) g (X)) 1 2ǫ E(l(g(X),Y ) l(g (X),Y )).
6 6 TITLE WILL BE SET BY THE PUBLISHER Similarly, if there is a costat κ 0 such that ( Pr η(x) 1 ) 2 ǫ cǫ κ (11) for some c ad all ǫ > 0 (see [27]), ad the class G cotais the Bayes classifier, the this implies the excess loss class is of Berstei type κ/(1 + κ) (see, for example, Lemma 5 i [2]). Boostig with a l 1 costrait: Large margi classificatio methods, such as AdaBoost ad support vector machies, miimize the sample average of a covex criterio over a class of real-valued fuctios. For example, Lugosi ad Vayatis [15] cosider empirical miimizatio with a expoetial loss over a class of l 1 -costraied liear combiatios of biary fuctios: Defie, for a give class H of {±1}-valued fuctios, the class { G λ = α i h i : h i H ad i i Let l(ŷ,y) = exp( yŷ), ad cosider the excess loss class α i λ F λ = {(x,y) l(g(x),y) l(g (x),y) : g G λ }, where g is the miimizer i G λ of the risk. The for all probability distributios, fuctios i F λ are bouded by b = exp(λ) ad have Berstei type 1 (see Lemma 7 ad Table 1 i [2]). Support vector machies with low oise: The support vector machie is a method for patter classificatio that chooses a fuctio f : X R from a reproducig kerel Hilbert space (RKHS) H with kerel k : X 2 R so as to miimize the regularized empirical risk criterio 1 l(f(x i ),y i ) + λ f 2, i=1 where y i {±1}, the loss is the hige loss, } l(ŷ,y) = max{0,1 ŷy}, (12) ad f deotes the orm i the RKHS. This is equivalet, for some r, to solvig the costraied optimizatio problem 1 mi f i=1 l(f(x i),y i ) s.t. f H f 2 r 2. Defie H r = {g H : g 2 r 2 } ad the excess loss class F r = {(x,y) l(g(x),y) l(g (x),y) : g H r }, (13) where g H r is the miimizer of the risk. The if the kerel of the RKHS satisfies k(x, X) B almost surely, (14) all fuctios i F r are bouded by 2Br. Furthermore, if the probability distributio satisfies the low oise coditio (11) ad F r cotais the Bayes classifier, the Lemma 7 of [4] shows that F r has Berstei type κ/(1 + κ)..
7 TITLE WILL BE SET BY THE PUBLISHER 7 ξ (r) α 3 r α 2 r α 1 r r Figure 1. The graph of a fuctio ξ that is sub-liear (cf. Lemma 2.3). Thus, our assumptios are satisfied i this case, ad the results i this article give estimates of the excess risk, that is, the differece betwee the expected loss ad the ifimum over all measurable fuctios of the expected loss. I fact, this also leads to a estimate of the excess risk as measured by the 0-1 loss: for all large margi classificatio methods, which miimize the sample average of a surrogate loss fuctio, there is a geeral, optimal iequality relatig the excess risk as measured by the surrogate loss to the excess risk as measured by the 0-1 loss [2]. Kerel ridge regressio for classificatio: If, i the support vector machie, we replace the hige loss (12) with the quadratic loss, l(ŷ,y) = (ŷ y) 2, we obtai the kerel ridge regressio method for patter classificatio. Defiig the class F r as i (13), if the kerel satisfies the boud (14), the every fuctio i F r is bouded by 2Br. Furthermore, without ay costraits o the probability distributio, the uiform covexity of the loss fuctio implies that F r has Berstei type 1 [14]. Kerel regressio with covex loss: Similar examples ca be obtaied whe the quadratic loss is replaced by ay power loss (see [20]). I kerel regressio also, if the respose variable satisfies Y B almost surely, the the boudedess of the kerel implies boudedess of fuctios i the excess loss class, ad uiform covexity of the loss implies that the excess loss class is Berstei Star-shaped classes We begi with the followig defiitio: Defiitio 2.2. F is called star-shaped aroud 0 if for every f F ad 0 α 1, αf F. We will show below that if F is a excess loss class, the ay empirical miimizer i F is also a empirical miimizer i the set star(f,0) = {αf : f F, 0 α 1}. Hece, oe ca replace F with star(f, 0) i the aalysis of the empirical miimizatio problem. Moreover, sice Ef ad E f are liear fuctioals i f, the localized complexity of star(f,0) is ot cosiderably larger tha that of F (for istace, i the sese of coverig umbers). The advatage i cosiderig star-shaped classes is that it adds some regularity to the class, ad thus the aalysis of the empirical miimizatio problem becomes simpler. For example, it is easy to see that for star-shaped classes the fuctios ξ (r)/r ad ξ (r)/r are o-icreasig. Figure 1 illustrates the graph of a typical fuctio with this sub-liear property, which is stated formally i the followig lemma.
8 8 TITLE WILL BE SET BY THE PUBLISHER ξ (r) r 1 r 2 r Figure 2. A example of a graph of a fuctio ξ for the class star(f,0), where F cotais oly fuctios with expectatios r 1 ad r 2. Lemma 2.3. If F is star-shaped aroud 0, the for ay 0 < r 1 < r 2, ξ (r 1 ) r 1 ξ (r 2 ) r 2. I particular, if for some α, ξ (r) αr the for all 0 < r r, ξ (r ) αr. Aalogous assertios hold for ξ. I other words, for every r, the graph of ξ i the iterval [0,r] is above the lie coectig (r,ξ (r)) ad (0,0). For the sake of completeess we iclude the proof of Lemma 2.3, which was origially stated i [3]. Proof. (of Lemma 2.3) Fix a sample X 1,...,X ad, without loss of geerality, suppose that sup f Fr2 (Ef E f) is attaied at f. Sice F is star-shaped, the f = r1 r 2 f F r1 satisfies Ef E f = r 1 r 2 sup f F r2 (Ef E f), ad the first part follows. The secod part follows directly from the first part by otig that The proof for ξ is aalogous. ξ (r ) r r ξ (r) r r αr = αr. As a example, Figure 2 illustrates the graph of a fuctio ξ for the star-shaped hull of a class that cotais oly fuctios with expectatios that either equal to r 1 or to r 2. The followig lemma allows oe to use star(f,0) i the aalysis of the empirical miimizatio problem ad obtai results regardig the empirical miimizatio problem over F. Lemma 2.4. Let F be a class of fuctios that cotais 0. (1) If F is a (β,b)-berstei class the star(f,0) is also a (β,b) Berstei class. (2) For every x 1,...,x, set { } I 1 = if f(x i ) : f F, i=1 { } I 2 = if f(x i ) : f star(f,0). i=1
9 TITLE WILL BE SET BY THE PUBLISHER 9 The I 1 = I 2. Moreover, for every ε 0 the set {f star(f,0) : i=1 f(x i) I 1 + ε} has a oempty itersectio with F. Note that by Lemma 2.4, if the set of ε-approximate empirical miimizers relative to star(f,0) is cotaied i some set A, the the set of ε-approximate empirical miimizers relative to F is also cotaied i A. I particular, cosider the set A = {f : γ Ef β}. Thus, upper ad lower estimates of the expectatio of the empirical miimizer i star(f,0) would imply the same fact for all empirical miimizers i F. Proof. (of Lemma 2.4) Every g star(f,0) is of the form g = αf for some f F ad 0 α 1. Sice β 2 ad F is a (β,b)-berstei class, Eg 2 = α 2 Ef 2 Bα 2 (Ef) β B(Eαf) β = B(Eg) β. To prove the secod part, otice that I 2 I 1. Sice 0 F, we have I 1 0 ad thus, if I 2 = 0 the the claim is obvious. Therefore, assume that I 2 < 0 ad for the sake of simplicity, assume that the ifimum is attaied i g = αf for some f F ad 0 < α 1. If α < 1 the I 1 f(x i ) = α 1 g(x i ) = α 1 I 2 < I 2, i=1 i=1 which is impossible. Thus α = 1 ad I 1 = I 2. The fial claim of the lemma follows usig a similar argumet. Motivated by these observatios, we redefie the set F r as F r = {f star(f,0) : Ef = r}. For the remaider of the article, we use this i the defiitios of the complexity parameters ξ,f,µ (r), ξ,f,µ (r) i (2 3), ad hece i the defiitios of r, s, r,ε,+, ad r,ε, i (5 9) as well Prelimiary Results If F is star-shaped aroud 0 oe ca derive the followig estimates for the empirical miimizer. (Recall the defiitio r = if {r : ξ (r) r/4} ad r = if {r : ξ (r) r/4}, where ξ ad ξ were defied above i (2) ad (3).) Theorem 2.5. [3] Let F be a (β,b)-berstei class of fuctios bouded by b that cotais 0. The there is a absolute costat c such that with probability at least 1 e x, ay empirical miimizer ˆf F satisfies { E ˆf max r, cbx ( ) } 1/(2 β) Bx,c. Also, with probability at least 1 e x, ay empirical miimizer ˆf F satisfies { E ˆf max r, cbx ( ) } 1/(2 β) Bx,c. Thus, with high probability, r is a upper boud for E ˆf, as log as r c/ 1/(2 β), ad the same holds for r. Note that r ca be much smaller tha r, ad so the covergece rates obtaied through r are potetially better. For β = 1, the estimates based o r ad r are at best 1/, ad i geeral at best 1/ 1/(2 β). Thus, the degree of cotrol of the variace through the expectatio, as measured by the Berstei coditio, iflueces
10 10 TITLE WILL BE SET BY THE PUBLISHER the best rate of covergece oe ca obtai i terms of r ad r usig this method wheever oe requires a cofidece that is expoetially close to 1. I particular, this approach recovers the better learig rates for covex fuctio classes from [14] ad for low oise classificatio from [19, 27], as both covexity of F for squared-loss ad low oise coditios imply that the loss class is Berstei. It turs out that this structural boud ca be improved usig a direct aalysis of the empirical miimizatio process. Ideed, the ext theorem shows that oe ca directly boud E ˆf for the empirical miimizer without tryig to relate the empirical ad actual structures of F. It states that E ˆf is cocetrated aroud s ad therefore, with high probability, E ˆf r,ε,+, where ε ca be take smaller tha c log /. I additio, if the class is ot too rich aroud 0, the with high probability, E ˆf r,ε,. (To recall the defiitios of s, r,ε,+, ad r,ε, see (7-9).) The result follows immediately from the mai result of [3], together with the observatios above about star-shaped classes. Theorem 2.6. For ay c 1 > 0, there is a costat c (depedig oly o c 1 ) such that the followig holds. Let F be a (β,b)-berstei class of fuctios bouded by b that cotais 0. For every ad ε > 0 defie r,ε,+, ad r,ε, as above, fix x > 0 ad set If { ( ) } 1/(2 β) r = max r cb(x + log ) B(x + log ),,c. ( { ε c max sup s the (1) With probability at least 1 e x, (2) If the with probability at least 1 e x, ( ξ,f,µ (s) s ) ) 1/2 (B + b)(x + log ),r β}, { } E ˆf 1 max,r,ε,+. Esup {Ef E f : f star(f,0), Ef c 1 /} ( < sup ξ,f,µ (s) s ) ε, s E ˆf r,ε,. To compare this result to the previous oe, ote that s r. Ideed, ξ (r) E(Ef E f) = 0 for ay fixed fuctio f, ad thus ξ (0) 0, ξ (s ) s ad 0 s if {r : ξ (r) r} r (where the last iequality holds sice ξ (r)/r is o-icreasig, by Lemma 2.3). It follows that if ξ (r) r is ot flat aroud s, the the boud resultig from Theorem 2.6 improves the structural boud of Theorem 2.5. Figure 3 illustrates graphically such a case. 3. A true gap betwee the expectatio of the empirical miimizer ad r I this sectio, we costruct a class of fuctios for which there is a clear gap betwee the structural result of Theorem 2.5 ad the expectatio of the empirical miimizer, as estimated i Theorem 2.6. The idea behid this costructio (as well as i the other costructio we preset later) is that oe has complete freedom to choose the expectatio of a fuctio, while forcig it to have certai values o a give sample.
11 TITLE WILL BE SET BY THE PUBLISHER 11 ξ (r) ε r r/4 r,ε, s r,ε,+ r r Figure 3. The graph of a fuctio ξ, ad the correspodig values for r, s, r,ε,+, ad r,ε,. If s r ad ξ (r) r is peaked aroud s, the r,ε,+ is smaller tha r. Let us start with a outlie of the costructio. It is based o the idea (developed i [3]) of two Berstei classes of fuctios satisfyig the followig for ay fixed. The fuctios are defied o a fiite set {1,...,m} with respect to the uiform probability measure, where m depeds o. The first class cotais all fuctios that vaish o a set of cardiality, but have expectatios equal to a give costat. The secod class cosists of fuctios that each take their miimal values o a set of cardiality, but have expectatios equal to 1/. By appropriately choosig the values of the fuctios, oe ca show that the star-shaped hull of the uio of these two classes has r c, whereas s r,ε,+ 1/. Thus, the estimate give by Theorem 2.6 is cosiderably better tha the oe resultig from Theorem 2.5 for that fixed value of. To make this example uiform over, we costruct similar sets o (0,1], take the star-shaped hull of the uio of all such sets ad show that ξ,f,µ (r) r still achieves its maximum at 1/ ad decays rapidly for r > 1/, esurig that r,ε,+ r. The first step i the costructio is the followig lemma. Lemma 3.1. Let µ be the Lebesgue measure o (0,1]. The, for every positive iteger ad ay 1 λ 1/2 there exists a fuctio class G λ such that (1) For every g G λ, 1 g(x) 1, Eg = λ ad Eg2 2Eg. (2) For every set τ (0,1] with τ, there is some g G λ such that for every s τ, g(s) = 1. Also, there exists a fuctio class Hλ such that (1) For every h Hλ, 0 h(x) 1, Eh = λ, ad Eh2 Eh. (2) For every set τ (0,1] with τ, there is some h Hλ such that for every s τ, h(s) = 0.
12 12 TITLE WILL BE SET BY THE PUBLISHER Proof. Let m = 2( 2 + ). Cosider fuctios that are costat o the itervals ((i 1)/m,i/m], 1 i m, ad set G λ to be the fuctio class cotaiig all fuctios takig the value 1 o exactly such itervals; that is, each fuctio i G λ is defied as follows: Let J {1,...,m}, J = ad set { 1, if x ( j 1 g J (x) = m, j m ] ad j J, t λ, otherwise, where t λ = λm + m = 2λ(2 + ) (15) + Sice 0 λ 1/2, 0 t λ 1 ad thus g J : (0,1] [ 1,1]. It is easy to verify that all the fuctios i G λ have expectatio λ with respect to µ ad that G λ is (1,2)-Berstei, sice for ay g G λ, Eg 2 = 1 m ( + t 2 λ (m ) ) 1 m ( + t λ(m )) = λ + 1 2λ = 2Eg. + 1 The costructio of Hλ is similar, ad its fuctios take the values {0,t λ } for t λ = λm/(m ). Usig the otatio of the lemma, defie the followig fuctio classes: ad H = H1/4 i, F k = G k 1/k, G = F i, i=5 i=5 F = star(g H,0). (16) Sice F cotais 0 ad is a (1,2)-Berstei class, it satisfies the assumptios of Theorem 2.5 ad Theorem 2.6. Moreover, it is star-shaped aroud 0 ad for ay 5 ad ay X 1,...,X there is some f F with Ef = 1/4 ad E f = 0, ad some g F with Eg = 1/ ad E g = 1. Ideed, f ca be take from H 1/4 ad g from F = G 1/. The followig theorem shows that for the class F, for ay iteger, r = 1/4, while the empirical miimizer is likely to be smaller tha r,ε,+ c/. Theorem 3.2. For F defied by (16), the followig holds: (1) For every 5, r + rk if r (1/(k + 1),1/k], where k ξ,f,µ(r) = r if r (1/5,1/4] 0 if r > 1/4, ad i particular, r = 1/4. (2) There exists a costat c > 1, such that the followig holds: for every ε < 3/4, every N(ε) ad every k /c, ξ,f,µ(1/k) 1/k ξ,f,µ(1/) 1/ ε. I particular, r,ε,+ c/. Note that by the properties of F metioed above, for every sample of cardiality 5, the graph of ξ for the class F H1/4 (which is the same as for the class star(f H1/4,0)) is as i Figure 4, with r = 1/4 ad s = 1/. For the star-shaped hull of the uio of all these sets, the fuctio ξ ca still be described i
13 TITLE WILL BE SET BY THE PUBLISHER 13 ξ (r) r + 1 r s = 1/ r = 1/4 r Figure 4. ξ,f H,µ (as i the proof of Theorem 3.2). 1/4 closed-form for values of r > 1/5 ad r 1/, because sup f Fr (Ef E f) is idepedet of the sample ad is reached at a scaled-dow fuctio from H ad respectively G; this is proved i part 1 of the theorem. O the other had, for 1/ < r < 1/5 this supremum is o loger idepedet of the sample ad thus we caot provide a simple closed-form for ξ. Despite that, ξ (r) r still achieves its maximum at 1/ ad decays rapidly for r > 1/, esurig that r,ε,+ r, which is the secod part of the theorem. Figure 5 illustrates the qualitative behavior of ξ. Proof. (of Theorem 3.2) For the first part of the proof, observe that the subsets F r cosistig of fuctios with expectatio Ef = r are H r G r if r < 1/5 F r = H r if r (1/5,1/4] if r > 1/4, where H r ad G r are the scaled-dow versios of H ad G, ad G r = 1/r k=5 {krg : g Gk 1/k }. The first part of the Theorem follows from the defiitio of the fuctio ξ ad the fact that for ay fixed sample of size, the ifimum if f Fr E f is equal to 0 ad reached at a scaled-dow fuctio from H1/4 for r (1/5,1/4], ad it is equal to -1 ad reached at a scaled-dow fuctio from G k 1/k wheever r (1/(k + 1),1/k] ad k. Turig to the secod, ad more difficult part, ote that ideed r = 1/4 ad that the maximal value of ξ,f,µ (r) r is attaied at r = 1/. I order to estimate the value ξ,f,µ (1/k) for k <, cosider sup f G 1/k(Ef k E f) for a fixed X 1,...,X. Let m = 2(k 2 + k) ad ote that by the costructio of G k 1/k, each g Gk 1/k is of the form g J for some set J {1,...,m}, J = k. For each set J let A J be the uio of the itervals ( j 1 m, j ] m where j J, ad let Φ be the followig set of idicator fuctios Φ = {½ AJ : J {1,...,m}, J = k}. Clearly, for every φ Φ, Eφ = k/m ad vc(φ) k, sice o set of k+1 distict poits i (0,1] ca be shattered by Φ (actually, vc(φ) = k sice the set {1/k,1/(k 1),...,1} is shattered by Φ). Recall that if Φ is a class
14 14 TITLE WILL BE SET BY THE PUBLISHER ξ (r) r + 1 ε r 1/ c/ 1/5 1/4 r Figure 5. Qualitative behavior of ξ,f,µ. of biary-valued fuctios ad if the VC-dimesio vc(φ) k, the as a special case of Theorem A.5, the Rademacher averages (see page 16, equatio (18) for the defiitio) ca be bouded by ER (Φ) c 2 k/ (17) for some absolute costat c 2. Defie the radom variable l J = i=1 ½ A J (X i ). Thus, l J is the cardiality of the set {i : g J (X i ) = 1}. Note that E g J = 2l J(k + 1) 2 + 3k + 2, k(2k + 1) ad therefore, sup (Ef E f) = 1 f G k + 2(k + 1)2 sup J l J 3k 2. k(2k + 1) k 1/k From Talagrad s cocetratio iequality (Theorem A.1) applied to the set of fuctios Φ, there exist absolute costats c 1,c 2 such that for ay 0 < t 1, with probability larger tha 1 e c1t2, sup f Φ i=1 f(x i ) k m + 2R (Φ) + 2t k m + 2c 2 k + 2t, where the last iequality holds by (17). Settig t = 1/20, ad sice k/m = /(2(k + 1)) < /10 for ay k 5, it is evidet that there exists a absolute costat c > 1 such that for ay k /c, with probability at least 1 e c 1, sup J l J /5+2c 2 k /4.
15 TITLE WILL BE SET BY THE PUBLISHER 15 Therefore, applyig the uio boud for 5 k k, it follows that with probability at least 1 e c, sup f k k =5 k k Gk 1/k (Ef E f) (k + 1) 2 /2 3k 2 k(2k + 1) 1 k for every k /c 1. Observe that scaled-dow versios of fuctios from H do ot cotribute to ξ,f,µ (1/k) ad thus, oe oly has to take care of elemets i F with expectatio of 1/k that come either from G k 1/k or are scaled dow versios of G k 1/k for k k. Hece, ξ,f,µ(1/k) = E sup f k k =5 ( 1 k (Ef E f) k k Gk 1/k ) (1 e c ) + e c = 1 k e c. ( ) 1 k + 1 Thus, for ε < 3/4, if is sufficietly large that 3/4e c 3/4 ε, we have ξ,f,µ(1/k) 1/k 1 ε = ξ,f,µ(1/) 1/ ε, provided that k /c. To coclude, there exists a true gap betwee the boud that ca be obtaied via the structural result (the fixed poit r of the localized empirical process) ad the true expectatio of the empirical miimizer as captured by s. Corollary 3.3. For F defied i (16), there is a absolute costat c > 0 for which the followig holds: For ay x > 0 there is a iteger N(x) such that for ay N(x), (1) With probability at least 1 e x, E ˆf c/ s. (2) r = r = 1/4. 4. Estimatig r from data The ext questio we wish to address is how to estimate the fuctio ξ (r) ad the fixed poit { r = if r : ξ (r) r } 4 empirically, i cases where the global complexity of the fuctio class, as captured, for example, by the coverig umbers or the combiatorial dimesio, is ot kow. A way of estimatig r is to fid a empirically computable fuctio ˆξ (r) that is, with high probability, a upper boud for the fuctio ξ (r) ad therefore, its fixed poit ˆr = if{r : ˆξ (r) r 4 } is a upper boud for r. We shall costruct ˆξ for which ˆξ (r)/r is o-icreasig ad thus ˆr would be determied usig a biary search algorithm. To that ed, we require the followig result, which states that, for Berstei classes, there is a phase trasitio i the behavior of coordiate projectios aroud the poit where ξ (r) r. Above this poit, the local subsets F r = {f star(f,0) : Ef = r} are small ad the expectatio ad empirical meas are close i a multiplicative sese. Below this poit, the sets F r are too rich to allow this.
16 16 TITLE WILL BE SET BY THE PUBLISHER Theorem 4.1. [3] There is a absolute costat c for which the followig holds. Let F be a class of fuctios, such that for every f F, f b. Assume that F is a (β,b)-berstei class. Suppose that r 0, 0 < λ < 1, ad 0 < α < 1 satisfy r cmax { bx α 2 λ, ( ) } 1/(2 β) Bx α 2 λ 2. (1) If ξ (r) (1 + α)rλ, the with probability at least 1 e x, sup Ef E f λef. f F r (2) If ξ (r) (1 α)rλ, the with probability at least 1 e x, sup Ef E f λef. f F r (3) If ξ (r) (1 + α)rλ, the with probability at least 1 e x, sup (Ef E f) λef. f F r (4) If ξ (r) (1 α)rλ, the with probability at least 1 e x, sup (Ef E f) λef. f F r We will make use of the followig direct corollary of Theorem 4.1 applied to the case α = 1/2, λ = 1/2. Corollary 4.2. There is a absolute costat c > 0 for which the followig holds. If F is (β,b)-berstei, ad { ( ) } 1/(2 β) bx Bx r cmax, ad ξ (r) r 4, the with probability larger tha 1 e x, every f F r satisfies r/2 E f 3r/2. If we defie the empirical shell, F r 2, 3r 2 := {f star(f,0) : r/2 E f 3r/2}, the corollary shows that, for suitable large r, with high probability, F r F r 2, 3r 2 The followig theorem shows that the empirical Rademacher average of a empirical shell is with high probability a upper boud for ξ (r) for all r larger tha the fixed poit r. For this, defie the radom variables R f = 1 i=1. σ i f(x i ) ad R (F) = sup R f, (18) f F where σ 1,...,σ deote idepedet Rademacher radom variables, that is, symmetric, { 1,1}-valued radom variables. The Rademacher averages of the class F are defied as ER (F), where the expectatio is take with respect to all radom variables X i ad σ i. A empirical versio of the Rademacher averages is obtaied by coditioig o X 1,...,X, E σ R (F) = E(R (F) X 1,...,X ).
17 TITLE WILL BE SET BY THE PUBLISHER 17 Theorem 4.3. There are absolute costats c, c 1, c 2, ad c 3 for which the followig holds. Let F be a (β,b)- Berstei class that cotais 0 ad for which sup f F f b. If r = max the with probability at least 1 2(b + 1)e x for every r [ r,b]. { r, 1, cbx ( ) } 1/(2 β) Bx,c, ξ (r) 8E σ R ( F c1r,c 2r) + c3 r Proof. By Lemma 2.3, ξ (r) r 4 if ad oly if r r. Thus, by Corollary 4.2 (for appropriately chose c), if r r, the with probability larger tha 1 e x, F r F r 2,, which implies that 3r 2 ) E σ R (F r ) E σ R (F r 2,. 3r 2 By symmetrizatio (Theorem A.2) ad cocetratio of Rademacher averages aroud their mea (Theorem A.3), ad sice r cbx, it follows that with probability at least 1 2e x, ξ (r) 2ER (F r ) 4E σ R (F r ) + 4bx ( ) 4E σr F r 2, + c 3r 3 r. 2 To fid a upper boud o ξ (r) that holds with high probability uiformly for all r r, we divide the iterval [1/,b] ito a set of b itervals of legth at most 1/. (Note that the choice of the startig poit 1/ restricts the estimates for r to values that are larger tha 1/. The proof ca be easily modified to allow estimates up to the value cbx/, but sice we are oly iterested i estimates that are at best of the order of O(1/) we made this restrictio i order to keep the proof simpler.) Let { 1 A =, 2 } [ b c,...,, b ], where c = cmax { ( ) } 1/(2 β) bx Bx,. Sice A b + 1, the uio boud shows that with probability at least 1 2(b + 1)e x, ξ (r) 4E σ R (F r 2, 3r 2 for every r A. By Lemma 2.3, for ay 1 k, if r [ k probability at least 1 2(b + 1)e x, every r [ r,b] satisfies ( ) k r ξ (r) ξ k ( (4E σ R F k 2, 3k 2 ) + c 3 r 8E σ R ( F c1r,c 2r) + c3 r,, k+1 ) + c ) 3k r k ], the ξ (r) ξ ( k ) r k. Thus, with where k satisfies that r [k/,(k + 1)/] ad c 1 ad c 2 are absolute costats.
18 18 TITLE WILL BE SET BY THE PUBLISHER Therefore, oe ca defie ˆξ (r) = 8E σ R ( F c1r,c 2r) + c3 r. Let ˆr = if{r : ˆξ (r) r 4 }. By Theorem 4.3, with probability at least 1 2(b + 1)e x, ˆr r. Moreover, sice ˆξ (r)/r is o-icreasig, r ˆr if ad oly if ˆξ (r) r 4. With this, give a sample of size, cosider the followig algorithm to estimate the upper boud o ˆr based o the data: Algorithm RSTAR(F, X 1,...,X ) Set r L = max{1/,c }, r R = b. If ˆξ (r R ) r R /4 the for l = 0 to log 2 b set r = rr rl 2 ; if ˆξ (r) > r/4 the set r L = r, else set r R = r. Output r = r R. By the costructio, r 1 ˆr r. Hece, for every, with probability larger tha 1 2(b + 1)e x, r r. Theorem 4.4. There exists a absolute costat c for which the followig holds. Let F be a (β,b)-berstei class of fuctios bouded by b that cotais 0. For every iteger, ay x > 0, ad ay sample X of size, with probability at least 1 (2b + 3)e x, E ˆf RSTAR(F,X). ( Note that RSTAR(F,X) is essetially the fixed poit of the fuctio r E σ R F c1r,c 2r). This fuctio measures the complexity of the fuctio class Fc 1r,c 2r, which ca be determied empirically by lookig at empirical meas that fall i a iterval whose legth is proportioal to r. The mai differece betwee that ad the data-depedet estimates i [1] is that istead of takig the whole empirical ball as i [1], here we oly measure the complexity of a empirical shell aroud r. However, if the fuctio class is ot regular aroud the critical value of r, the complexity of the shell F(c 1 r,c 2 r) might be very differet from the complexity of F r, i which case oe would like to make c 1 ad c 2 very close to 1. Ideed, oe ca tighte this boud further by arrowig the size of the shell ad replacig the empirical set F r 2, 3r 2 with F(1 ε )r,(1+ε )r. This is doe by selectig the isomorphism costat i Theorem 4.1 to deped o ad ted to 1 as. Theorem 4.5. Let F be a (β,b)-berstei class that cotais 0 such that sup f F f b. There is a absolute costat c, for which the followig holds. If 0 < ε < 1 ad { r = max r, 1, cbx ( ) } 1/(2 β) Bx,c, ε the with probability at least 1 2(b + 1)e x for every r [ r,b]. ε 2 ξ (r) 4E σ R (F (1 ε )r,(1+ε )r ) + ε r c Proof. With the same reasoig as before, by Theorem 4.1 for α = 1/2 ad λ = ε, if r r the with probability larger tha 1 e x, F r F(1 ε )r,(1+ε )r. We defie ( ) ˆξ (r) = (4E σ R F(1 ε )r,(1+ε )r + kε ) [ r k c k, for r, k + 1 ]. Agai, with probability at least 1 2(b + 1)e x, for every r [ r,b], ξ(r) ˆξ (r).
19 TITLE WILL BE SET BY THE PUBLISHER 19 Sice ˆξ (r)/r is o-icreasig, it is possible to defie { ˆr = if r : ˆξ (r) rε } 2 with a slight modificatio of RSTAR (we replace the test i the if-clause, ˆξ (r) > r/4, with ˆξ (r) > rε /2). It follows that for every ad every sample of size, with probability larger tha 1 2be x, r r, where r is geerated by the modified algorithm. For example, oe ca choose ε = 1/log, which has the advatage that the empirical shells ˆFr r log, r+ r become, with growig sample size, closer to F log r. The price we pay for the advatage is a extra log factor i the fial estimate, sice i this case the estimate of the expectatio goes dow at the rate of O(log /). Remark 4.6. Note that a lower boud of a similar ature has to take ito accout the complexity of the class F 0,cr. This might happe because oe may ot have a iclusio F r F c 1r,c 2r uless c 1 = 0. Ideed, if the class F is very rich for r close to 0, it is possible to have fuctios that have a very small expectatio, but for which E f r. 5. The limitatios of estimatig from data Although the results i [3] show that it is possible to boud the expectatio of the empirical miimizer i a far sharper way tha by applyig a structural result, it was ot clear whether such a boud could be estimated from data. I the followig we cosider a sceario i which oe oly has access to the fuctio class through the values that class members take o fiite samples, that is, the fiite dimesioal coordiate projectios of the class. I this case, we costruct a example that shows that, i geeral, it is impossible to establish a data-depedet estimate of s that is better tha r. To be precise, we costruct two fuctio classes that have idetical coordiate projectios o every sample. For oe class we have r c, s c ad the expectatio of the empirical miimizer is of the order of c with probability 1, while for the other class, s 1/. If oe oly has access to the way the classes behave o fiite dimesioal coordiate projectios, that is, samples, the classes are idistiguishable, ad it is impossible to predict a better boud tha a absolute costat, which could be much worse tha the true behavior of the empirical miimizer. Recall that for a give fuctio class F ad a sample τ = {x 1,...,x }, the coordiate projectio of F o τ is P τ F = {(f(x 1 ),...,f(x )) : f F }. Let µ be the Lebesgue measure o (0,1]. For each k N we costruct two fuctio classes F1 k ad F2 k, both (1,c)-Berstei with respect to µ for a suitable absolute costat c, ad take values i V = { 1,0,1}. I both classes we costruct, each fuctio is a costat o the itervals ((j 1)/m k,j/m k ], where m k = k 2 + 3k. The class F1 k cosists of all fuctios that take the value 1 o k itervals, the value 1 o 2k itervals ad the value 0 o k 2 itervals. It is easy to verify that for ay f F1 k, Ef = k/(k 2 + 3k) 1/k ad Ef 2 = 3k/(k 2 + 3k) 1/k, implyig that ideed F1 k is a (1,3)-Berstei class. I cotrast, F2 k cosists of all fuctios that take the value 1 o k itervals, the value 1 o k 2 + k itervals ad 0 o k itervals. Therefore, for ay fuctio f F2 k, Ef = k 2 /(k 2 + 3k) 1/4 ad sice Ef 2 1, F2 k is a (1,4)-Berstei class. Notice that fuctios i F1 k have expectatios of the order of 1/k while fuctios i F2 k have expectatios of the order of a costat. Set ( ) ( ) F 1 = star F1 k,0, F 2 = star F2 k,0, k N ad it is easy to verify that for every fiite set τ, P τ F 1 = P τ F 2. Ideed, cosider a set τ = {x 1,...,x }. Without loss of geerality, assume that x i x j if i j. Let l be large eough to esure that the x i s fall i disjoit itervals ((j 1)/m l,j/m l ] ad that l, ad thus, P τ F l 2 = P τ F l 1 = { 1,0,1}. k N
20 20 TITLE WILL BE SET BY THE PUBLISHER Therefore, F 1 ad F 2 are star-shaped, Berstei classes that have idetical coordiate projectios, makig it impossible to distiguish the two based solely o empirical data. O the other had, the behavior of the empirical miimizer is very differet i the two cases. Theorem 5.1. For F 1 ad F 2 defied as above, there is a absolute costat c > 0 for which the followig holds. For ay x > 0 there is some N(x) such that for ay N(x), (1) For F 1, with probability at least 1 e x, E ˆf c/ s (F 1 ). (2) For F 2, with probability 1, E ˆf 1/4 r (F 2 ). Theorem 5.1 implies that the estimates for the covergece rate of the empirical miimizatio algorithm based o s are sigificatly better for the class F 1 tha for F 2. However, the classes have idetical coordiate projectios o ay sample, ad hece are idistiguishable empirically. Thus, oe ca ot get a empirical estimate of the covergece rate for F 1 that is sigificatly better tha oe based o a empirical estimate of r. Proof. We will show that the expectatio of the empirical miimizer i F 1 is likely to be smaller tha c/, as opposed to F 2 where it is likely to be of the order of a costat. For ay, if f F 1 E f = 1, ad therefore ξ,f 1,µ (s ) s = 1, where, for ay k ad ay f F k 1, s k = Ef = k k 2 + 3k 1 k. Clearly, for a class of fuctios bouded by 1, ξ,f,µ (r) r 1, ad thus the maximal value of ξ,f 1,µ (r) r is attaied at s 1/. The mai part of the proof is to show that there is some absolute costat c > 1 such that for large eough values of ad for r c/, ξ,f 1,µ (r) r 1/2. This is the case because the sets F k 1 are ot rich eough whe projected oto samples of size as log as k /c. Ideed, the fuctio class F 1 has low complexity i terms of the combiatorial dimesio vc(f 1,ε) (see Defiitio A.4). I particular, the defiitios imply that vc(f k 1,ε) 2k for all 0 < ε 2 ad all k. Sice the class of fuctios is bouded by 1, Theorem A.5 implies there is a absolute costat c 2 such that ER (F k 1 ) c 2 k/. Applyig the oe sided versio of Talagrad s cocetratio iequality for the empirical process Z = sup f F k 1 (Ef E f), it follows that for t = 1/4, with probability at least 1 e c1t2 = 1 e c 1, k sup (Ef E f) 2ER (F1 k ) + t 2c 2 f F + t 1 2, 1 k provided that k /c for some uiversal costat c. Let A k = s k s k k k that is, A k cotais the fuctios i F 1 that have expectatios s k those either come from F k 1 or are scaled dow versios of fuctios from F k for k < k. Therefore, with probability at least 1 e c 1, for ay k /c, F k 1, Takig the expectatio, sup (Ef E f) 1 f A k 2. ξ,f 1,µ(s k ) (1 e c 1 ) (1 + s k) e c 1 = ( s k ) e c 1,
21 TITLE WILL BE SET BY THE PUBLISHER 21 ad thus, for all ε < 1/2, N(ε) ad k /c, ξ,f,µ(s k ) s k 1 ε s k = ξ,f,µ(s ) s ε s k. This implies that ξ,f,µ (r) r ξ,f,µ (s ) s ε for every r c /, from which we coclude that r,ε,+ c /. O the other had, it is easy to verify that for empirical miimizatio over F 2, E ˆf 1/4. Ideed, as we saw for F 1, if f F 2 E f = 1, which implies E ˆf = 1. Sice we ca write F 2 = {αf : f F k 2, k N, α [0,1]}, ad empirical miimizatio is a liear operatio, it is clear that the empirical miimum will be attaied at α = 1 (usig a similar argumet to the oe used i Lemma 2.4). Sice all the fuctios i k N F2 k have expectatio greater tha 1/4, the with probability 1, E ˆf 1/4 i this case. Remark 5.2. Note that if oe is give the fuctio ˆf that the algorithm produced, rather tha just the coordiate projectios, it becomes possible to distiguish if the class at had is F 1 or F 2. However, we ca defie a ucoutable collectio of fuctio classes { ( ) } F = star Fα k k,0 : α k {1,2} for k N, k N where if α k = 1 the Fα k k = F1 k ad if α k = 2 the F αk = F2 k. Clearly, for every H,G F ad every fiite σ Ω, P σ (G) = P σ (H). If the learer kows that F F ad eve if ˆf is give to him, the the best thig that could be said is that a sigle compoet of F, say the jth compoet of F, is F j 1 or F j 2. It is impossible to say whether other compoets of F are of type 1 or type 2 ad i particular, the covergece rate for the expectatio of the empirical miimizer ca be as bad as for F 2. The secod observatio worth otig is that the class F 1 is ot a Gliveko-Catelli class. The classes F1 k become richer as k grows - i.e., i the part of F 1 i which the expectatio of fuctios is smaller. The reaso oe ca still obtai a geeralizatio boud eve for classes that are ot Gliveko-Catelli is because the method of [3] uses the expectatio of the empirical process idexed by {f star(f,0) : Ef = r}, ad each oe of these sets is a Gliveko-Catelli class. If oe were to try ad boud the error of the empirical miimizer usig the localizatio {f F : Ef r} as i [1], it would be impossible. Appedix A. Additioal material The mai techical tool we require is Talagrad s celebrated cocetratio theorem for empirical processes [13, 24]. The versio we use is due to Bousquet [7], buildig o Massart s argumet (see also [10,17,22]). Theorem A.1. Let F be a class of fuctios defied o X ad let P be a probability measure such that for every f F, f b ad Ef = 0. Let X 1,...,X be idepedet radom variables distributed accordig to P ad set σ 2 = sup f F varf. Defie Z = sup f F Z = sup f F f(x i ), i=1 f(x i ). i=1
22 22 TITLE WILL BE SET BY THE PUBLISHER For every x > 0 ad every ρ > 0, ({ Pr Z (1 + ρ)ez + σ }) Kx + K(1 + ρ 1 )bx ({ Pr Z (1 ρ)ez σ }) Kx K(1 + ρ 1 )bx ad the same iequalities hold for Z. Here, K is a absolute costat. e x, e x, The rest of this sectio is devoted to some results that allow oe to estimate Esup f F Ef E f via the Rademacher process idexed by the class. Recall the defiitio of the Rademacher averages of a class from page 16, equatio (18). A well kow symmetrizatio argumet (due to Gié ad Zi) coects the expectatio of sup f F Ef E f to the Rademacher averages of F [30]. Theorem A.2. Let F be a class of fuctios defied o (Ω,µ) ad let X 1,...,X be idepedet radom variables distributed accordig to µ. The, E sup Ef E f 2ER (F). f F The ext lemma, which follows directly from a self-boudig property of the Rademacher process ad the methods developed i [6], shows that E σ R (F) is highly cocetrated aroud its expectatio; hece, the Rademacher averages of a class ca be upper bouded by their empirical versio. The followig formulatio ca be foud i [1]. Theorem A.3. Let F be a class of bouded fuctios defied o (Ω,µ) takig values i [a,b] ad let X 1,...,X be idepedet radom variables distributed accordig to µ. The, for ay 0 α < 1 ad x > 0, with probability at least 1 e x, ER (F) 1 1 α E (b a)x σr (F) + 4α(1 α). Also, with probability at least 1 e x, where c is a absolute costat. 1 2 E σr (F) cbx ER (F) It is possible to boud ER (F) usig the combiatorial dimesio of a set. Recall that a set {x 1,...,x } is shattered by a class of {0,1}-valued fuctios F if P σ F = {(f(x 1 ),...,f(x )) : f F } = {0,1}, ad that the Vapik-Chervoekis dimesio d of F deoted by vc(f) is the maximal cardiality of a subset of Ω that is shattered by F. I a similar way, oe ca defie the combiatorial dimesio of a class of real-valued fuctios. Defiitio A.4. For every ε > 0, a set σ = {x 1,...,x } Ω is said to be ε-shattered by F if there is some fuctio s : σ R, such that for every I {1,...,} there is some f I F for which f I (x i ) s(x i ) + ε if i I, ad f I (x i ) s(x i ) ε if i I. Let vc(f,ε) = sup { σ σ Ω, σ is ε shattered by F }. The followig result is a recet extesio, due to Rudelso ad Vershyi [23] to well-kow estimates o ER (F).
Optimal Sample-Based Estimates of the Expectation of the Empirical Minimizer
Optimal Sample-Based Estimates of the Expectatio of the Empirical Miimizer Peter L. Bartlett Computer Sciece Divisio ad Departmet of Statistics Uiversity of Califoria, Berkeley 367 Evas Hall #3860, Berkeley,
More informationMachine Learning Theory Tübingen University, WS 2016/2017 Lecture 12
Machie Learig Theory Tübige Uiversity, WS 06/07 Lecture Tolstikhi Ilya Abstract I this lecture we derive risk bouds for kerel methods. We will start by showig that Soft Margi kerel SVM correspods to miimizig
More informationECE 901 Lecture 12: Complexity Regularization and the Squared Loss
ECE 90 Lecture : Complexity Regularizatio ad the Squared Loss R. Nowak 5/7/009 I the previous lectures we made use of the Cheroff/Hoeffdig bouds for our aalysis of classifier errors. Hoeffdig s iequality
More informationA survey on penalized empirical risk minimization Sara A. van de Geer
A survey o pealized empirical risk miimizatio Sara A. va de Geer We address the questio how to choose the pealty i empirical risk miimizatio. Roughly speakig, this pealty should be a good boud for the
More informationREGRESSION WITH QUADRATIC LOSS
REGRESSION WITH QUADRATIC LOSS MAXIM RAGINSKY Regressio with quadratic loss is aother basic problem studied i statistical learig theory. We have a radom couple Z = X, Y ), where, as before, X is a R d
More informationOptimally Sparse SVMs
A. Proof of Lemma 3. We here prove a lower boud o the umber of support vectors to achieve geeralizatio bouds of the form which we cosider. Importatly, this result holds ot oly for liear classifiers, but
More informationRegression with quadratic loss
Regressio with quadratic loss Maxim Ragisky October 13, 2015 Regressio with quadratic loss is aother basic problem studied i statistical learig theory. We have a radom couple Z = X,Y, where, as before,
More information18.657: Mathematics of Machine Learning
8.657: Mathematics of Machie Learig Lecturer: Philippe Rigollet Lecture 4 Scribe: Cheg Mao Sep., 05 I this lecture, we cotiue to discuss the effect of oise o the rate of the excess risk E(h) = R(h) R(h
More informationMachine Learning Theory Tübingen University, WS 2016/2017 Lecture 11
Machie Learig Theory Tübige Uiversity, WS 06/07 Lecture Tolstikhi Ilya Abstract We will itroduce the otio of reproducig kerels ad associated Reproducig Kerel Hilbert Spaces (RKHS). We will cosider couple
More informationConvergence of random variables. (telegram style notes) P.J.C. Spreij
Covergece of radom variables (telegram style otes).j.c. Spreij this versio: September 6, 2005 Itroductio As we kow, radom variables are by defiitio measurable fuctios o some uderlyig measurable space
More informationBinary classification, Part 1
Biary classificatio, Part 1 Maxim Ragisky September 25, 2014 The problem of biary classificatio ca be stated as follows. We have a radom couple Z = (X,Y ), where X R d is called the feature vector ad Y
More information1 Review and Overview
CS9T/STATS3: Statistical Learig Theory Lecturer: Tegyu Ma Lecture #6 Scribe: Jay Whag ad Patrick Cho October 0, 08 Review ad Overview Recall i the last lecture that for ay family of scalar fuctios F, we
More informationSupport vector machine revisited
6.867 Machie learig, lecture 8 (Jaakkola) 1 Lecture topics: Support vector machie ad kerels Kerel optimizatio, selectio Support vector machie revisited Our task here is to first tur the support vector
More informationAda Boost, Risk Bounds, Concentration Inequalities. 1 AdaBoost and Estimates of Conditional Probabilities
CS8B/Stat4B Sprig 008) Statistical Learig Theory Lecture: Ada Boost, Risk Bouds, Cocetratio Iequalities Lecturer: Peter Bartlett Scribe: Subhrasu Maji AdaBoost ad Estimates of Coditioal Probabilities We
More informationLecture 2. The Lovász Local Lemma
Staford Uiversity Sprig 208 Math 233A: No-costructive methods i combiatorics Istructor: Ja Vodrák Lecture date: Jauary 0, 208 Origial scribe: Apoorva Khare Lecture 2. The Lovász Local Lemma 2. Itroductio
More informationEmpirical Process Theory and Oracle Inequalities
Stat 928: Statistical Learig Theory Lecture: 10 Empirical Process Theory ad Oracle Iequalities Istructor: Sham Kakade 1 Risk vs Risk See Lecture 0 for a discussio o termiology. 2 The Uio Boud / Boferoi
More informationInfinite Sequences and Series
Chapter 6 Ifiite Sequeces ad Series 6.1 Ifiite Sequeces 6.1.1 Elemetary Cocepts Simply speakig, a sequece is a ordered list of umbers writte: {a 1, a 2, a 3,...a, a +1,...} where the elemets a i represet
More informationOn Random Line Segments in the Unit Square
O Radom Lie Segmets i the Uit Square Thomas A. Courtade Departmet of Electrical Egieerig Uiversity of Califoria Los Ageles, Califoria 90095 Email: tacourta@ee.ucla.edu I. INTRODUCTION Let Q = [0, 1] [0,
More informationMachine Learning Theory Tübingen University, WS 2016/2017 Lecture 3
Machie Learig Theory Tübige Uiversity, WS 06/07 Lecture 3 Tolstikhi Ilya Abstract I this lecture we will prove the VC-boud, which provides a high-probability excess risk boud for the ERM algorithm whe
More informationChapter 3. Strong convergence. 3.1 Definition of almost sure convergence
Chapter 3 Strog covergece As poited out i the Chapter 2, there are multiple ways to defie the otio of covergece of a sequece of radom variables. That chapter defied covergece i probability, covergece i
More informationAn Introduction to Randomized Algorithms
A Itroductio to Radomized Algorithms The focus of this lecture is to study a radomized algorithm for quick sort, aalyze it usig probabilistic recurrece relatios, ad also provide more geeral tools for aalysis
More informationLecture 3 The Lebesgue Integral
Lecture 3: The Lebesgue Itegral 1 of 14 Course: Theory of Probability I Term: Fall 2013 Istructor: Gorda Zitkovic Lecture 3 The Lebesgue Itegral The costructio of the itegral Uless expressly specified
More informationAgnostic Learning and Concentration Inequalities
ECE901 Sprig 2004 Statistical Regularizatio ad Learig Theory Lecture: 7 Agostic Learig ad Cocetratio Iequalities Lecturer: Rob Nowak Scribe: Aravid Kailas 1 Itroductio 1.1 Motivatio I the last lecture
More informationSequences and Series of Functions
Chapter 6 Sequeces ad Series of Fuctios 6.1. Covergece of a Sequece of Fuctios Poitwise Covergece. Defiitio 6.1. Let, for each N, fuctio f : A R be defied. If, for each x A, the sequece (f (x)) coverges
More informationIntro to Learning Theory
Lecture 1, October 18, 2016 Itro to Learig Theory Ruth Urer 1 Machie Learig ad Learig Theory Comig soo 2 Formal Framework 21 Basic otios I our formal model for machie learig, the istaces to be classified
More informationDefinition 4.2. (a) A sequence {x n } in a Banach space X is a basis for X if. unique scalars a n (x) such that x = n. a n (x) x n. (4.
4. BASES I BAACH SPACES 39 4. BASES I BAACH SPACES Sice a Baach space X is a vector space, it must possess a Hamel, or vector space, basis, i.e., a subset {x γ } γ Γ whose fiite liear spa is all of X ad
More informationProblem Set 2 Solutions
CS271 Radomess & Computatio, Sprig 2018 Problem Set 2 Solutios Poit totals are i the margi; the maximum total umber of poits was 52. 1. Probabilistic method for domiatig sets 6pts Pick a radom subset S
More informationProduct measures, Tonelli s and Fubini s theorems For use in MAT3400/4400, autumn 2014 Nadia S. Larsen. Version of 13 October 2014.
Product measures, Toelli s ad Fubii s theorems For use i MAT3400/4400, autum 2014 Nadia S. Larse Versio of 13 October 2014. 1. Costructio of the product measure The purpose of these otes is to preset the
More information6 Integers Modulo n. integer k can be written as k = qn + r, with q,r, 0 r b. So any integer.
6 Itegers Modulo I Example 2.3(e), we have defied the cogruece of two itegers a,b with respect to a modulus. Let us recall that a b (mod ) meas a b. We have proved that cogruece is a equivalece relatio
More informationLecture Notes for Analysis Class
Lecture Notes for Aalysis Class Topological Spaces A topology for a set X is a collectio T of subsets of X such that: (a) X ad the empty set are i T (b) Uios of elemets of T are i T (c) Fiite itersectios
More informationMA131 - Analysis 1. Workbook 3 Sequences II
MA3 - Aalysis Workbook 3 Sequeces II Autum 2004 Cotets 2.8 Coverget Sequeces........................ 2.9 Algebra of Limits......................... 2 2.0 Further Useful Results........................
More informationACO Comprehensive Exam 9 October 2007 Student code A. 1. Graph Theory
1. Graph Theory Prove that there exist o simple plaar triagulatio T ad two distict adjacet vertices x, y V (T ) such that x ad y are the oly vertices of T of odd degree. Do ot use the Four-Color Theorem.
More informationLecture 10 October Minimaxity and least favorable prior sequences
STATS 300A: Theory of Statistics Fall 205 Lecture 0 October 22 Lecturer: Lester Mackey Scribe: Brya He, Rahul Makhijai Warig: These otes may cotai factual ad/or typographic errors. 0. Miimaxity ad least
More information10-701/ Machine Learning Mid-term Exam Solution
0-70/5-78 Machie Learig Mid-term Exam Solutio Your Name: Your Adrew ID: True or False (Give oe setece explaatio) (20%). (F) For a cotiuous radom variable x ad its probability distributio fuctio p(x), it
More information62. Power series Definition 16. (Power series) Given a sequence {c n }, the series. c n x n = c 0 + c 1 x + c 2 x 2 + c 3 x 3 +
62. Power series Defiitio 16. (Power series) Give a sequece {c }, the series c x = c 0 + c 1 x + c 2 x 2 + c 3 x 3 + is called a power series i the variable x. The umbers c are called the coefficiets of
More informationRademacher Complexity
EECS 598: Statistical Learig Theory, Witer 204 Topic 0 Rademacher Complexity Lecturer: Clayto Scott Scribe: Ya Deg, Kevi Moo Disclaimer: These otes have ot bee subjected to the usual scrutiy reserved for
More informationDisjoint Systems. Abstract
Disjoit Systems Noga Alo ad Bey Sudaov Departmet of Mathematics Raymod ad Beverly Sacler Faculty of Exact Scieces Tel Aviv Uiversity, Tel Aviv, Israel Abstract A disjoit system of type (,,, ) is a collectio
More informationGlivenko-Cantelli Classes
CS28B/Stat24B (Sprig 2008 Statistical Learig Theory Lecture: 4 Gliveko-Catelli Classes Lecturer: Peter Bartlett Scribe: Michelle Besi Itroductio This lecture will cover Gliveko-Catelli (GC classes ad itroduce
More informationMachine Learning Theory (CS 6783)
Machie Learig Theory (CS 6783) Lecture 2 : Learig Frameworks, Examples Settig up learig problems. X : istace space or iput space Examples: Computer Visio: Raw M N image vectorized X = 0, 255 M N, SIFT
More informationEmpirical Processes: Glivenko Cantelli Theorems
Empirical Processes: Gliveko Catelli Theorems Mouliath Baerjee Jue 6, 200 Gliveko Catelli classes of fuctios The reader is referred to Chapter.6 of Weller s Torgo otes, Chapter??? of VDVW ad Chapter 8.3
More informationLocal Rademacher Complexities
Local Rademacher Complexities Peter L. Bartlett Departmet of Statistics ad Divisio of Computer Sciece Uiversity of Califoria at Berkeley 367 Evas Hall Berkeley, CA 94720-3860 bartlett@stat.berkeley.edu
More informationMath 61CM - Solutions to homework 3
Math 6CM - Solutios to homework 3 Cédric De Groote October 2 th, 208 Problem : Let F be a field, m 0 a fixed oegative iteger ad let V = {a 0 + a x + + a m x m a 0,, a m F} be the vector space cosistig
More informationMachine Learning Brett Bernstein
Machie Learig Brett Berstei Week 2 Lecture: Cocept Check Exercises Starred problems are optioal. Excess Risk Decompositio 1. Let X = Y = {1, 2,..., 10}, A = {1,..., 10, 11} ad suppose the data distributio
More information1 Review and Overview
DRAFT a fial versio will be posted shortly CS229T/STATS231: Statistical Learig Theory Lecturer: Tegyu Ma Lecture #3 Scribe: Migda Qiao October 1, 2013 1 Review ad Overview I the first half of this course,
More informationChapter 6 Infinite Series
Chapter 6 Ifiite Series I the previous chapter we cosidered itegrals which were improper i the sese that the iterval of itegratio was ubouded. I this chapter we are goig to discuss a topic which is somewhat
More informationECE 901 Lecture 14: Maximum Likelihood Estimation and Complexity Regularization
ECE 90 Lecture 4: Maximum Likelihood Estimatio ad Complexity Regularizatio R Nowak 5/7/009 Review : Maximum Likelihood Estimatio We have iid observatios draw from a ukow distributio Y i iid p θ, i,, where
More informationBasics of Probability Theory (for Theory of Computation courses)
Basics of Probability Theory (for Theory of Computatio courses) Oded Goldreich Departmet of Computer Sciece Weizma Istitute of Sciece Rehovot, Israel. oded.goldreich@weizma.ac.il November 24, 2008 Preface.
More informationMASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS
MASSACHUSTTS INSTITUT OF TCHNOLOGY 6.436J/5.085J Fall 2008 Lecture 9 /7/2008 LAWS OF LARG NUMBRS II Cotets. The strog law of large umbers 2. The Cheroff boud TH STRONG LAW OF LARG NUMBRS While the weak
More informationLecture 3: August 31
36-705: Itermediate Statistics Fall 018 Lecturer: Siva Balakrisha Lecture 3: August 31 This lecture will be mostly a summary of other useful expoetial tail bouds We will ot prove ay of these i lecture,
More informationSieve Estimators: Consistency and Rates of Convergence
EECS 598: Statistical Learig Theory, Witer 2014 Topic 6 Sieve Estimators: Cosistecy ad Rates of Covergece Lecturer: Clayto Scott Scribe: Julia Katz-Samuels, Brado Oselio, Pi-Yu Che Disclaimer: These otes
More informationSequences. Notation. Convergence of a Sequence
Sequeces A sequece is essetially just a list. Defiitio (Sequece of Real Numbers). A sequece of real umbers is a fuctio Z (, ) R for some real umber. Do t let the descriptio of the domai cofuse you; it
More informationApplication to Random Graphs
A Applicatio to Radom Graphs Brachig processes have a umber of iterestig ad importat applicatios. We shall cosider oe of the most famous of them, the Erdős-Réyi radom graph theory. 1 Defiitio A.1. Let
More informationMath Solutions to homework 6
Math 175 - Solutios to homework 6 Cédric De Groote November 16, 2017 Problem 1 (8.11 i the book): Let K be a compact Hermitia operator o a Hilbert space H ad let the kerel of K be {0}. Show that there
More informationAdvanced Stochastic Processes.
Advaced Stochastic Processes. David Gamarik LECTURE 2 Radom variables ad measurable fuctios. Strog Law of Large Numbers (SLLN). Scary stuff cotiued... Outlie of Lecture Radom variables ad measurable fuctios.
More information(A sequence also can be thought of as the list of function values attained for a function f :ℵ X, where f (n) = x n for n 1.) x 1 x N +k x N +4 x 3
MATH 337 Sequeces Dr. Neal, WKU Let X be a metric space with distace fuctio d. We shall defie the geeral cocept of sequece ad limit i a metric space, the apply the results i particular to some special
More information18.657: Mathematics of Machine Learning
8.657: Mathematics of Machie Learig Lecturer: Philippe Rigollet Lecture 0 Scribe: Ade Forrow Oct. 3, 05 Recall the followig defiitios from last time: Defiitio: A fuctio K : X X R is called a positive symmetric
More informationRandom Walks on Discrete and Continuous Circles. by Jeffrey S. Rosenthal School of Mathematics, University of Minnesota, Minneapolis, MN, U.S.A.
Radom Walks o Discrete ad Cotiuous Circles by Jeffrey S. Rosethal School of Mathematics, Uiversity of Miesota, Mieapolis, MN, U.S.A. 55455 (Appeared i Joural of Applied Probability 30 (1993), 780 789.)
More informationStatistics 511 Additional Materials
Cofidece Itervals o mu Statistics 511 Additioal Materials This topic officially moves us from probability to statistics. We begi to discuss makig ifereces about the populatio. Oe way to differetiate probability
More informationIf a subset E of R contains no open interval, is it of zero measure? For instance, is the set of irrationals in [0, 1] is of measure zero?
2 Lebesgue Measure I Chapter 1 we defied the cocept of a set of measure zero, ad we have observed that every coutable set is of measure zero. Here are some atural questios: If a subset E of R cotais a
More informationIP Reference guide for integer programming formulations.
IP Referece guide for iteger programmig formulatios. by James B. Orli for 15.053 ad 15.058 This documet is iteded as a compact (or relatively compact) guide to the formulatio of iteger programs. For more
More information5.1 A mutual information bound based on metric entropy
Chapter 5 Global Fao Method I this chapter, we exted the techiques of Chapter 2.4 o Fao s method the local Fao method) to a more global costructio. I particular, we show that, rather tha costructig a local
More informationEconomics 241B Relation to Method of Moments and Maximum Likelihood OLSE as a Maximum Likelihood Estimator
Ecoomics 24B Relatio to Method of Momets ad Maximum Likelihood OLSE as a Maximum Likelihood Estimator Uder Assumptio 5 we have speci ed the distributio of the error, so we ca estimate the model parameters
More informationMeasure and Measurable Functions
3 Measure ad Measurable Fuctios 3.1 Measure o a Arbitrary σ-algebra Recall from Chapter 2 that the set M of all Lebesgue measurable sets has the followig properties: R M, E M implies E c M, E M for N implies
More informationFall 2013 MTH431/531 Real analysis Section Notes
Fall 013 MTH431/531 Real aalysis Sectio 8.1-8. Notes Yi Su 013.11.1 1. Defiitio of uiform covergece. We look at a sequece of fuctios f (x) ad study the coverget property. Notice we have two parameters
More informationEECS564 Estimation, Filtering, and Detection Hwk 2 Solns. Winter p θ (z) = (2θz + 1 θ), 0 z 1
EECS564 Estimatio, Filterig, ad Detectio Hwk 2 Sols. Witer 25 4. Let Z be a sigle observatio havig desity fuctio where. p (z) = (2z + ), z (a) Assumig that is a oradom parameter, fid ad plot the maximum
More informationIt is always the case that unions, intersections, complements, and set differences are preserved by the inverse image of a function.
MATH 532 Measurable Fuctios Dr. Neal, WKU Throughout, let ( X, F, µ) be a measure space ad let (!, F, P ) deote the special case of a probability space. We shall ow begi to study real-valued fuctios defied
More information6.3 Testing Series With Positive Terms
6.3. TESTING SERIES WITH POSITIVE TERMS 307 6.3 Testig Series With Positive Terms 6.3. Review of what is kow up to ow I theory, testig a series a i for covergece amouts to fidig the i= sequece of partial
More information6.867 Machine learning
6.867 Machie learig Mid-term exam October, ( poits) Your ame ad MIT ID: Problem We are iterested here i a particular -dimesioal liear regressio problem. The dataset correspodig to this problem has examples
More informationLearnability with Rademacher Complexities
Learability with Rademacher Complexities Daiel Khashabi Fall 203 Last Update: September 26, 206 Itroductio Our goal i study of passive ervised learig is to fid a hypothesis h based o a set of examples
More informationLecture 11: Decision Trees
ECE9 Sprig 7 Statistical Learig Theory Istructor: R. Nowak Lecture : Decisio Trees Miimum Complexity Pealized Fuctio Recall the basic results of the last lectures: let X ad Y deote the iput ad output spaces
More informationMcGill University Math 354: Honors Analysis 3 Fall 2012 Solutions to selected problems
McGill Uiversity Math 354: Hoors Aalysis 3 Fall 212 Assigmet 3 Solutios to selected problems Problem 1. Lipschitz fuctios. Let Lip K be the set of all fuctios cotiuous fuctios o [, 1] satisfyig a Lipschitz
More informationMath 216A Notes, Week 5
Math 6A Notes, Week 5 Scribe: Ayastassia Sebolt Disclaimer: These otes are ot early as polished (ad quite possibly ot early as correct) as a published paper. Please use them at your ow risk.. Thresholds
More informationDiscrete Mathematics for CS Spring 2008 David Wagner Note 22
CS 70 Discrete Mathematics for CS Sprig 2008 David Wager Note 22 I.I.D. Radom Variables Estimatig the bias of a coi Questio: We wat to estimate the proportio p of Democrats i the US populatio, by takig
More informationThe Borel hierarchy classifies subsets of the reals by their topological complexity. Another approach is to classify them by size.
Lecture 7: Measure ad Category The Borel hierarchy classifies subsets of the reals by their topological complexity. Aother approach is to classify them by size. Filters ad Ideals The most commo measure
More information1 of 7 7/16/2009 6:06 AM Virtual Laboratories > 6. Radom Samples > 1 2 3 4 5 6 7 6. Order Statistics Defiitios Suppose agai that we have a basic radom experimet, ad that X is a real-valued radom variable
More informationSeunghee Ye Ma 8: Week 5 Oct 28
Week 5 Summary I Sectio, we go over the Mea Value Theorem ad its applicatios. I Sectio 2, we will recap what we have covered so far this term. Topics Page Mea Value Theorem. Applicatios of the Mea Value
More informationSingular Continuous Measures by Michael Pejic 5/14/10
Sigular Cotiuous Measures by Michael Peic 5/4/0 Prelimiaries Give a set X, a σ-algebra o X is a collectio of subsets of X that cotais X ad ad is closed uder complemetatio ad coutable uios hece, coutable
More informationSolution. 1 Solutions of Homework 1. Sangchul Lee. October 27, Problem 1.1
Solutio Sagchul Lee October 7, 017 1 Solutios of Homework 1 Problem 1.1 Let Ω,F,P) be a probability space. Show that if {A : N} F such that A := lim A exists, the PA) = lim PA ). Proof. Usig the cotiuity
More informationMi-Hwa Ko and Tae-Sung Kim
J. Korea Math. Soc. 42 2005), No. 5, pp. 949 957 ALMOST SURE CONVERGENCE FOR WEIGHTED SUMS OF NEGATIVELY ORTHANT DEPENDENT RANDOM VARIABLES Mi-Hwa Ko ad Tae-Sug Kim Abstract. For weighted sum of a sequece
More informationHOMEWORK 2 SOLUTIONS
HOMEWORK SOLUTIONS CSE 55 RANDOMIZED AND APPROXIMATION ALGORITHMS 1. Questio 1. a) The larger the value of k is, the smaller the expected umber of days util we get all the coupos we eed. I fact if = k
More informationA statistical method to determine sample size to estimate characteristic value of soil parameters
A statistical method to determie sample size to estimate characteristic value of soil parameters Y. Hojo, B. Setiawa 2 ad M. Suzuki 3 Abstract Sample size is a importat factor to be cosidered i determiig
More informationMath 220A Fall 2007 Homework #2. Will Garner A
Math 0A Fall 007 Homewor # Will Garer Pg 3 #: Show that {cis : a o-egative iteger} is dese i T = {z œ : z = }. For which values of q is {cis(q): a o-egative iteger} dese i T? To show that {cis : a o-egative
More informationb i u x i U a i j u x i u x j
M ath 5 2 7 Fall 2 0 0 9 L ecture 1 9 N ov. 1 6, 2 0 0 9 ) S ecod- Order Elliptic Equatios: Weak S olutios 1. Defiitios. I this ad the followig two lectures we will study the boudary value problem Here
More informationOutput Analysis and Run-Length Control
IEOR E4703: Mote Carlo Simulatio Columbia Uiversity c 2017 by Marti Haugh Output Aalysis ad Ru-Legth Cotrol I these otes we describe how the Cetral Limit Theorem ca be used to costruct approximate (1 α%
More informationEstimation of the essential supremum of a regression function
Estimatio of the essetial supremum of a regressio fuctio Michael ohler, Adam rzyżak 2, ad Harro Walk 3 Fachbereich Mathematik, Techische Uiversität Darmstadt, Schlossgartestr. 7, 64289 Darmstadt, Germay,
More informationRates of Convergence by Moduli of Continuity
Rates of Covergece by Moduli of Cotiuity Joh Duchi: Notes for Statistics 300b March, 017 1 Itroductio I this ote, we give a presetatio showig the importace, ad relatioship betwee, the modulis of cotiuity
More informationThe Boolean Ring of Intervals
MATH 532 Lebesgue Measure Dr. Neal, WKU We ow shall apply the results obtaied about outer measure to the legth measure o the real lie. Throughout, our space X will be the set of real umbers R. Whe ecessary,
More information32 estimating the cumulative distribution function
32 estimatig the cumulative distributio fuctio 4.6 types of cofidece itervals/bads Let F be a class of distributio fuctios F ad let θ be some quatity of iterest, such as the mea of F or the whole fuctio
More informationLaw of the sum of Bernoulli random variables
Law of the sum of Beroulli radom variables Nicolas Chevallier Uiversité de Haute Alsace, 4, rue des frères Lumière 68093 Mulhouse icolas.chevallier@uha.fr December 006 Abstract Let be the set of all possible
More informationFIXED POINTS OF n-valued MULTIMAPS OF THE CIRCLE
FIXED POINTS OF -VALUED MULTIMAPS OF THE CIRCLE Robert F. Brow Departmet of Mathematics Uiversity of Califoria Los Ageles, CA 90095-1555 e-mail: rfb@math.ucla.edu November 15, 2005 Abstract A multifuctio
More informationMaximum Likelihood Estimation and Complexity Regularization
ECE90 Sprig 004 Statistical Regularizatio ad Learig Theory Lecture: 4 Maximum Likelihood Estimatio ad Complexity Regularizatio Lecturer: Rob Nowak Scribe: Pam Limpiti Review : Maximum Likelihood Estimatio
More informationLinear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d
Liear regressio Daiel Hsu (COMS 477) Maximum likelihood estimatio Oe of the simplest liear regressio models is the followig: (X, Y ),..., (X, Y ), (X, Y ) are iid radom pairs takig values i R d R, ad Y
More informationw (1) ˆx w (1) x (1) /ρ and w (2) ˆx w (2) x (2) /ρ.
2 5. Weighted umber of late jobs 5.1. Release dates ad due dates: maximimizig the weight of o-time jobs Oce we add release dates, miimizig the umber of late jobs becomes a sigificatly harder problem. For
More informationLecture 9: Boosting. Akshay Krishnamurthy October 3, 2017
Lecture 9: Boostig Akshay Krishamurthy akshay@csumassedu October 3, 07 Recap Last week we discussed some algorithmic aspects of machie learig We saw oe very powerful family of learig algorithms, amely
More information6.883: Online Methods in Machine Learning Alexander Rakhlin
6.883: Olie Methods i Machie Learig Alexader Rakhli LECTURES 5 AND 6. THE EXPERTS SETTING. EXPONENTIAL WEIGHTS All the algorithms preseted so far halluciate the future values as radom draws ad the perform
More informationNotes 19 : Martingale CLT
Notes 9 : Martigale CLT Math 733-734: Theory of Probability Lecturer: Sebastie Roch Refereces: [Bil95, Chapter 35], [Roc, Chapter 3]. Sice we have ot ecoutered weak covergece i some time, we first recall
More informationREAL ANALYSIS II: PROBLEM SET 1 - SOLUTIONS
REAL ANALYSIS II: PROBLEM SET 1 - SOLUTIONS 18th Feb, 016 Defiitio (Lipschitz fuctio). A fuctio f : R R is said to be Lipschitz if there exists a positive real umber c such that for ay x, y i the domai
More informationLecture 15: Learning Theory: Concentration Inequalities
STAT 425: Itroductio to Noparametric Statistics Witer 208 Lecture 5: Learig Theory: Cocetratio Iequalities Istructor: Ye-Chi Che 5. Itroductio Recall that i the lecture o classificatio, we have see that
More informationNotes on Snell Envelops and Examples
Notes o Sell Evelops ad Examples Example (Secretary Problem): Coside a pool of N cadidates whose qualificatios are represeted by ukow umbers {a > a 2 > > a N } from best to last. They are iterviewed sequetially
More information