Local Rademacher Complexities

Size: px
Start display at page:

Download "Local Rademacher Complexities"

Transcription

1 Local Rademacher Complexities Peter L. Bartlett Departmet of Statistics ad Divisio of Computer Sciece Uiversity of Califoria at Berkeley 367 Evas Hall Berkeley, CA Olivier Bousquet Empirical Iferece Departmet Max Plack Istitute for Biological Cyberetics Spemastr. 38, D Tübige, Germay May 14, 2004 Shahar Medelso Istitute of Advaced Studies The Australia Natioal Uiversity Caberra, ACT 0200, Australia Abstract We propose ew bouds o the error of learig algorithms i terms of a data-depedet otio of complexity. The estimates we establish give optimal rates ad are based o a local ad empirical versio of Rademacher averages, i the sese that the Rademacher averages are computed from the data, o a subset of fuctios with small empirical error. We preset some applicatios to classificatio ad predictio with covex fuctio classes, ad with kerel classes i particular. Keywords: Error Bouds, Rademacher Averages, Data-Depedet Complexity, Cocetratio Iequalities. 1 Itroductio Estimatig the performace of statistical procedures is useful for providig a better uderstadig of the factors that ifluece their behavior, as well as for suggestig ways to improve them. Although asymptotic aalysis is a crucial first step towards uderstadig the behavior, fiite sample error bouds are of more value as they allow the desig of model selectio or parameter tuig) procedures. These error bouds typically have the followig form: with high probability, the error of the estimator typically a fuctio i a certai class) is bouded by a empirical estimate of error plus a pealty term depedig o the complexity of the class of fuctios that ca be chose by the algorithm. The differeces betwee the true ad empirical errors of fuctios i that class ca the be viewed as a empirical process. May tools have bee developed for uderstadig the behavior of such objects, ad especially for evaluatig their suprema which ca be thought of as a measure of how hard it is to estimate fuctios i the class at had. The goal is thus to obtai the sharpest possible estimates o the complexity of fuctio classes. A problem arises 1

2 sice the otio of complexity might deped o the ukow) uderlyig probability measure accordig to which the data is produced. Distributio-free otios of the complexity, such as the Vapik-Chervoekis dimesio [35] or the metric etropy [28], typically give coservative estimates. Distributio-depedet estimates, based for example o etropy umbers i the L 2 P ) distace where P is the uderlyig distributio, are ot useful whe P is ukow. Thus, it is desirable to obtai data-depedet estimates which ca readily be computed from the sample. Oe of the most iterestig data-depedet complexity estimates is the so-called Rademacher average associated with the class. Although kow for a log time to be related to the expected supremum of the empirical process thaks to symmetrizatio iequalities), it was first proposed as a effective complexity measure by Koltchiskii [15], Bartlett, Bouchero ad Lugosi [1] ad Medelso [25] ad the further studied i [3]. Ufortuately, oe of the shortcomigs of the Rademacher averages is that they provide global estimates o the complexity of the fuctio class, that is, they do ot reflect the fact that the algorithm will likely pick fuctios that have a small error, ad i particular, oly a small subset of the fuctio class will be used. As a result, the best error rate that ca be obtaied via the global Rademacher averages is at least of the order of 1/ where is the sample size), which is suboptimal i some situatios. Ideed, the type of algorithms we cosider here are kow i the statistical literature as M-estimators. They miimize a empirical loss criterio i a fixed class of fuctios. They have bee extesively studied ad their rate of covergece is kow to be related to the modulus of cotiuity of the empirical process associated with the class of fuctios rather tha to the expected supremum of that empirical process). This modulus of cotiuity is well uderstood from the empirical processes theory viewpoit see e.g. [34] ad [33]). Also, from the poit of view of M-estimators, the quatity which determies the rate of covergece is actually a fixed poit of this modulus of cotiuity. Results of this type have bee obtaied by va de Geer [31, 32] amog others), who also provides o-asymptotic expoetial iequalities. Ufortuately, these are i terms of etropy or radom etropy) ad hece are ot useful whe the probability distributio is ukow. The key property that allows oe to prove fast rates of covergece is the fact that aroud the best fuctio i the class, the variace of the icremets of the empirical process or the L 2 P ) distace to the best fuctio) is upper bouded by a liear fuctio of the expectatio of these icremets. I the cotext of regressio with squared loss, this happes as soo as the fuctios are bouded ad the class of fuctios is covex. I the cotext of classificatio, Mamme ad Tsybakov have show [20] that this also happes uder coditios o the coditioal distributio especially about its behavior aroud 1/2). They actually do ot require the relatioship betwee variace ad expectatio of the icremets) to be liear but allow for more geeral, power type iequalities. Their results, like those of va de Geer, are asymptotic. I order to exploit this key property ad have fiite sample bouds, rather tha cosiderig the Rademacher averages of the etire class as the complexity measure, it is possible to cosider the Rademacher averages of a small subset of the class, usually, the itersectio of the class with a ball cetered at a fuctio of iterest. These local Rademacher averages ca serve as a complexity measure; clearly, they are always smaller tha the correspodig global averages. Several authors have cosidered the use of local estimates of the complexity of the fuctio class, i order to obtai better bouds. Before presetig their results, we itroduce some otatio which is used throughout the paper. Let X, P ) be a probability space. Deote by F a class of measurable fuctios from X to R, ad set X 1,..., X to be idepedet radom variables distributed accordig to P. Let σ 1,..., σ 2

3 be idepedet Rademacher radom variables, that is, idepedet radom variables for which Prσ i = 1) = Prσ i = 1) = 1/2. For a fuctio f : X R, defie the followig P f = 1 fx i ), P f = EfX) ad R f = 1 σ i fx i ). For a class F, set R F = sup R f. Defie E σ to be the expectatio with respect to the radom variables σ 1,..., σ, coditioed o all of the other radom variables. The Rademacher average of F is ER F, ad the empirical or coditioal) Rademacher averages of F are ) E σ R F = 1 E sup σ i fx i ) X 1,..., X Some classical properties of Rademacher averages ad some simple lemmas which we use ofte) are listed i Appedix A. The simplest way to obtai the property allowig for fast rates of covergece is to cosider oegative uiformly bouded fuctios or icremets with respect to a fixed ull fuctio). I that case, oe trivially has for all f F Var [f] cp f. This is exploited by Koltchiskii ad Pacheko [16], who cosider the case of predictio with absolute loss whe fuctios i F have values i [0, 1] ad there is a perfect fuctio f i the class, i.e. P f = 0. They itroduce a iterative method ivolvig local empirical Rademacher averages. They first costruct a fuctio φ r) = c 1 R f : P f 2r} + c 2 rx/ + c3 /, which ca be computed from the data. For ˆr N defied by ˆr 0 = 1, ˆr k+1 = φ ˆr k ), they show that with probability at least 1 2Ne x, P ˆf ˆr N + 2x,. where ˆf is a miimizer of the empirical error, that is, a fuctio i F satisfyig P ˆf = if P f. Hece, this oicreasig sequece of local Rademacher averages ca be used as upper bouds o the error of the empirical miimizer ˆf. Furthermore, if ψ is a cocave fuctio such that ψ r) Eσ R f F : P f r}, ad if the umber of iteratios N is at least 1 + log 2 log 2 /x, the with probability at least 1 Ne x, ˆr N c ˆr + x ), where r is a solutio of the fixed-poit equatio ψ r) = r. Combiig the above results, oe has a procedure to obtai data-depedet error bouds that are of the order of the fixed-poit of the modulus of cotiuity at 0 of the empirical Rademacher averages. Oe limitatio of this result is that it assumes that there is a fuctio f i the class with P f = 0. I cotrast, we are iterested i predictio problems where P f is the error of a estimator, ad i the presece of oise there may ot be ay perfect estimator eve the best i the class ca have o-zero error). 3

4 More recetly, Bousquet, Koltchiskii ad Pacheko [9] obtaied a more geeral result avoidig the iterative procedure. Their result is that for fuctios with values i [0, 1], with probability at least 1 e x, f F, P f c P f + ˆr + where ˆr is the fixed poit of a cocave fuctio ψ satisfyig ψ 0) = 0 ad ψ r) E σ R f F : P f r}. ) t + log log. 1.1) The mai differece betwee this ad the results of [16] is that there is o requiremet that the class cotais a perfect fuctio. However, the local Rademacher averages are cetered aroud the zero fuctio istead of the oe that miimizes P f. As a cosequece, the fixed poit ˆr caot be expected to coverge to zero whe if P f > 0. I order to remove this limitatio, Lugosi ad Wegkamp [19] use localized Rademacher averages of a small ball aroud the miimizer ˆf of P. However, their result is restricted to oegative fuctios, ad i particular fuctios with values i 0, 1}. Moreover, their bouds also ivolve some global iformatio, i the form of the shatter coefficiets S F X1 ) of the fuctio class that is, the cardiality of the coordiate projectios of the class F o the data X1 ). They show that there are costats c 1, c 2 such that, with probability at least 1 8/, the empirical miimizer ˆf satisfies P ˆf if P f + 2 ψ ˆr ), where ψ r) = c 1 E σ R f F : P f 16P ˆf + 15r} + log + ) log P ˆf + r ad ˆr = c 2 log S F X 1 ) + log )/. The limitatio of this result is that ˆr has to be chose accordig to the empirically measured) complexity of the whole class, which may ot be as sharp as the Rademacher averages, ad i geeral, is ot a fixed poit of ψ. Moreover, the balls over which the Rademacher averages are computed i ψ cotai a factor of 16 i frot of P ˆf. As we explai later, this iduces a lower boud o ψ whe there is o fuctio with P f = 0 i the class. It seems that the oly way to capture the right behavior i the geeral, oisy case, is to aalyze the icremets of the empirical process, i other words, to directly cosider the fuctios f f. This approach was first proposed by Massart [22]; see also [26]. Massart itroduces the followig assumptio Var [l f X) l f X)] d 2 f, f ) BP l f P l f ), where l f is the loss associated with the fuctio f i other words, l f X, Y ) = lfx), Y ), which measures the discrepacy i the predictio made by f), 1 d is a psuedometric, ad f miimizes the expected loss. This is a more refied versio of the assumptio we metioed earlier o the relatioship betwee the variace ad expectatio of the icremets of the empirical process. It is oly satisfied for some loss fuctios l ad fuctio classes F. Uder this assumptio, Massart cosiders a odecreasig fuctio ψ satisfyig ψr) E sup, d 2 f,f ) 2 r P f P f P f + P f + c x, 1 The previous results could also be stated i terms of loss fuctios, but we omitted this i order to simplify expositio. However, the extra otatio is ecessary to properly state Massart s result. 4

5 such that ψr)/ r is oicreasig we refer to this property as the sub-root property later i the paper). The, with probability at least 1 e x, f F, P l f P l f c r + x ), 1.2) where r is the fixed poit of ψ ad c depeds oly o B ad o the uiform boud o the rage of fuctios i F. It ca be proved that i may situatios of iterest, this boud suffices to prove miimax rates of covergece for pealized M-estimators. Massart cosiders examples where the complexity term ca be bouded usig a priori global iformatio about the fuctio class.) However, the mai limitatio of this result is that it does ot ivolve quatities that ca be computed from the data. Fially, as we metioed earlier, Medelso [26] gives a aalysis similar to that of Massart, i a slightly less geeral case with o oise i the target values, i.e. the coditioal distributio of Y give X is cocetrated at oe poit). Medelso itroduces the otio of the star-hull of a class of fuctios see the ext sectio for a defiitio) ad cosiders Rademacher averages of this star-hull as a localized measure of complexity. His results also ivolve a priori kowledge of the class, such as the rate of growth of coverig umbers. We ca ow spell out our goal i more detail: i this paper we combie the icremet-based approach of Massart ad Medelso dealig with differeces of fuctios, or more geerally with bouded real-valued fuctios) with the empirical local Rademacher approach of Koltchiskii ad Pacheko ad of Lugosi ad Wegkamp, i order to obtai data-depedet bouds which deped o a fixed poit of the modulus of cotiuity of Rademacher averages computed aroud the empirically best fuctio. Our first mai result Theorem 3.3) is a distributio-depedet result ivolvig the fixed poit r of a local Rademacher average of the star-hull of the class F. This shows that fuctios with the sub-root property ca readily be obtaied from Rademacher averages, while i previous work the appropriate fuctios were obtaied oly via global iformatio about the class. The secod mai result Theorems 4.1 ad 4.2) is a empirical couterpart of the first oe, where the complexity is the fixed poit of a empirical local Rademacher average. We also show that this fixed poit is withi a costat factor of the o-empirical oe. Equipped with this result, we ca the prove Theorem 5.4) a fully data-depedet aalogue of Massart s result, where the Rademacher averages are localized aroud the miimizer of the empirical loss. We also show Theorem 6.3) that i the cotext of classificatio, the local Rademacher averages of star-hulls ca be approximated by solvig a weighted empirical error miimizatio problem. Our fial result Corollary 6.7) cocers regressio with kerel classes, that is, classes of fuctios that are geerated by a positive defiite kerel. These classes are widely used i iterpolatio ad estimatio problems as they yield computatioally efficiet algorithms. Our result gives a data-depedet complexity term that ca be computed directly from the eigevalues of the Gram matrix the matrix whose etries are values of the kerel o the data). The sharpess of our results is demostrated from the fact that we recover, i the distributio depedet case treated i Sectio 4), similar results to those of Massart [22] which, i the situatios where they apply, give the miimax optimal rates or the best kow results. Moreover, the datadepedet bouds that we obtai as couterparts of these results have the same rate of covergece see Theorem 4.2). 5

6 The paper is orgaized as follows. I Sectio 2, we preset some prelimiary results obtaied from cocetratio iequalities, which we use throughout. Sectio 3 establishes error bouds usig local Rademacher averages ad explais how to compute their fixed poits from global iformatio e.g. estimates of the metric etropy or of the combiatorial dimesios of the idexig class), i which case the optimal estimates ca be recovered. I Sectio 4, we give a data-depedet error boud usig empirical ad local Rademacher averages, ad show the coectio betwee the fixed poits of the empirical ad o-empirical Rademacher averages. I Sectio 5, we apply our results to loss classes. We give estimates that geeralize the results of Koltchiskii ad Pacheko by elimiatig the requiremet that some fuctio i the class has zero loss, ad are more geeral tha those of Lugosi ad Wegkamp, sice there is o eed, i our case, to estimate global shatter coefficiets of the class. We also give a data-depedet extesio of Massart s result where the local averages are computed aroud the miimizer of the empirical loss. Fially, Sectio 6 shows that the problem of estimatig these local Rademacher averages i classificatio reduces to weighted empirical risk miimizatio. It also shows that the local averages for kerel classes ca be sharply bouded i terms of the eigevalues of the Gram matrix. 2 Prelimiary Results Recall that the star-hull of F aroud f 0 is defied by starf, f 0 ) = f 0 + αf f 0 ) : f F, α [0, 1]}. Throughout this paper, we will maipulate suprema of empirical processes, that is, quatities of the form sup P f P f). We will always assume they are measurable without explicitly metioig it. I other words, we assume that the class F ad the distributio P satisfy appropriate mild) coditios for measurability of this supremum we refer to [11, 28] for a detailed accout of such issues). The followig theorem is the mai result of this sectio ad is at the core of all the proofs preseted later. It shows that if the fuctios i a class have small variace, the maximal deviatio betwee empirical meas ad true meas is cotrolled by the Rademacher averages of F. I particular, the boud improves as the largest variace of a class member decreases. Theorem 2.1 Let F be a class of fuctios that map X ito [a, b]. Assume that there is some r > 0 such that for every f F, Var [fx i )] r. The, for every x > 0, with probability at least 1 e x, supp f P f) if α>0 ad with probability at least 1 2e x, supp f P f) if α 0,1) 21 + α)er F α 1 α E σr F + 2rx 2rx Moreover, the same results hold for the quatity sup P f P f). 1 + b a) ) ) x, α 1 + b a) α α ) ) x. 2α1 α) 6

7 This theorem, which is proved i Appedix B, is a more or less direct cosequece of Talagrad s iequality for empirical processes [30]. However, the actual statemet preseted here is ew i the sese that it displays the best kow costats. Ideed, compared to the previous result of Koltchiskii ad Pacheko [16] which was based o Massart s versio of Talagrad s iequality [21], we have used the most refied cocetratio iequalities available: that of Bousquet [7] for the supremum of the empirical process ad that of Bouchero, Lugosi ad Massart [5] for the Rademacher averages. This last iequality is a powerful tool to obtai data-depedet bouds, sice it allows oe to replace the Rademacher average which measures the complexity of the class of fuctios) by its empirical versio, which ca be efficietly computed i some cases. Details about these iequalities are give i Appedix A. Whe applied to the full fuctio class F, the above theorem is ot useful. Ideed, with oly a trivial boud o the maximal variace, better results ca be obtaied via simpler cocetratio iequalities, such as the bouded differece iequality [23], which would allow rx/ to be replaced by x/. However, by applyig Theorem 2.1 to subsets of F or to modified classes obtaied from F, much better results ca be obtaied. Hece, the presece of a upper boud o the variace i the square root term is the key igrediet of this result. A last prelimiary result that we will require is the followig cosequece of Theorem 2.1, which shows that if the local Rademacher averages are small, the balls i L 2 P ) are probably cotaied i the correspodig empirical balls that is, i L 2 P )) with a slightly larger radius. Corollary 2.2 Let F be a class of fuctios that map X ito [ b, b] with b > 0. For every x > 0 ad r that satisfy r 10bER f : f F, P f 2 r } + 11b2 x, the with probability at least 1 e x, f F : P f 2 r } f F : P f 2 2r }. Proof: Sice the rage of ay fuctio i the set F r = f 2 : f F, P f 2 r } is cotaied i [0, b 2 ], it follows that Var [ f 2 X i ) ] P f 4 b 2 P f 2 b 2 r. Thus, by the first part of Theorem 2.1 with α = 1/4), with probability at least 1 e x, every f F r satisfies P f 2 r ER f 2 : f F, P f 2 r } 2b + 2 rx + 13b2 x 3 r ER f 2 : f F, P f 2 r } + r b2 x 3 r + 5bER f : f F, P f 2 r } + r b2 x 3 2r, where the secod iequality follows from Lemma A.3 ad we have used, i the secod last iequality, Theorem A.6 applied to φx) = x 2 with Lipschitz costat 2b o [ b, b]). 3 Error Bouds with Local Complexity I this sectio, we show that the Rademacher averages associated with a small subset of the class may be cosidered as a complexity term i a error boud. Sice these local Rademacher averages are always smaller tha the correspodig global averages, they lead to sharper bouds. 7

8 We preset a geeral error boud ivolvig local complexities that is applicable to classes of bouded fuctios for which the variace is bouded by a fixed liear fuctio of the expectatio. I this case, the local Rademacher averages are defied as ER f F : T f) r} where T f) is a upper boud o the variace typically chose as T f) = P f 2 ). There is a trade-off betwee the size of the subset we cosider i these local averages ad its complexity; we shall see that the optimal choice is give by a fixed poit of a upper boud o the local Rademacher averages. The fuctios we use as upper bouds are sub-root fuctios; amog other useful properties, sub-root fuctios have a uique fixed poit. Defiitio 3.1 A fuctio ψ : [0, ) [0, ) is sub-root if it is oegative, odecreasig, ad if r ψr)/ r is oicreasig for r > 0. We oly cosider otrivial sub-root fuctios, that is sub-root fuctios that are ot the costat fuctio ψ 0. Lemma 3.2 If ψ : [0, ) [0, ) is a otrivial sub-root fuctio, the it is cotiuous o [0, ) ad the equatio ψr) = r has a uique positive solutio. Moreover, if we deote the solutio by r, the for all r > 0, r ψr) if ad oly if r r. The proof of this lemma is i Appedix B. I view of the lemma, we will simply refer to the quatity r as the uique positive solutio of ψr) = r, or as the fixed-poit of ψ. 3.1 Error Bouds We ca ow state ad discuss the mai result of this sectio. It is composed of two parts: i the first part, oe requires a sub-root upper boud o the local Rademacher averages, ad i the secod part, it is show that better results ca be obtaied whe the class over which the averages are computed is slightly elarged. Theorem 3.3 Let F be a class of fuctios with rages i [a, b] ad assume that there is some fuctioal T : F R + ad some costat B such that for every f F, Var [f] T f) BP f. Let ψ be a sub-root fuctio ad let r be the fixed poit of ψ. 1) Assume that ψ satisfies, for ay r r, ψr) BER f F : T f) r}. The, with c 1 = 704 ad c 2 = 26, for ay K > 1 ad every x > 0, with probability at least 1 e x, f F, P f K K 1 P f + c 1K B r + x11b a)) + c 2BK). Also, with probability at least 1 e x, f F, P f K + 1 K P f + c 1K B r + x11b a) + c 2BK). 2) If, i additio, for f F ad α [0, 1], T αf) α 2 T f), ad if ψ satisfies, for ay r r, ψr) BER f starf, 0) : T f) r}, the the same results hold true with c 1 = 6 ad c 2 = 5. 8

9 The proof of this theorem is give i Sectio 3.2. We ca compare the results to our startig poit Theorem 2.1). The improvemet comes from the fact that the complexity term, which was essetially sup r ψr) i Theorem 2.1 if we had applied it to the class F directly) is ow reduced to r, the fixed poit of ψ. So the complexity term is always smaller later, we show how to estimate r ). O the other had, there is some loss sice the costat i frot of P f is strictly larger tha oe. Sectio 5.2 will show that this is ot a issue i the applicatios we have i mid. I Sectios 5.1 ad 5.2, we ivestigate coditios that esure the assumptios of this theorem are satisfied, ad we provide applicatios of this result to predictio problems. The coditio that the variace is upper bouded by the expectatio turs out to be crucial to obtai these results. The idea behid Theorem 3.3 origiates i the work of Massart [22], who proves a slightly differet versio of the first part. The differece is that we use local Rademacher averages istead of the expectatio of the supremum of the empirical process o a ball. Moreover, we give smaller costats. As far as we kow, the secod part of Theorem 3.3 is ew Choosig the Fuctio ψ Notice that the fuctio ψ caot be chose arbitrarily ad has to satisfy the sub-root property. Oe possible approach is to use classical upper bouds o the Rademacher averages, such as Dudley s etropy itegral. This ca give a sub-root upper boud ad was used, for example, i [16] ad i [22]. However, the secod part of Theorem 3.3 idicates a possible choice for ψ, amely, oe ca take ψ as the local Rademacher averages of the star-hull of F aroud 0. The reaso for this comes from the followig lemma, which shows that if the class is star-shaped ad T f) behaves as a quadratic fuctio, the Rademacher averages are sub-root. Lemma 3.4 If the class F is star-shaped aroud ˆf which may deped o the data), ad T : F R + is a possibly radom) fuctio that satisfies T αf) α 2 T f) for ay f F ad ay α [0, 1], the the radom) fuctio ψ defied for r 0 by is sub-root ad r Eψr) is also sub-root. ψr) = E σ R f F : T f ˆf) r} This lemma is proved i Appedix B. Notice that makig a class star-shaped oly icreases it, so that ER f starf, f 0 ) : T f) r} ER f F : T f) r}. However, this icrease i size is moderate as ca be see for example if oe compares coverig umbers of a class ad its star-hull see, for example, [26], Lemma 4.5) Some Cosequeces As a cosequece of Theorem 3.3, we obtai a error boud whe F cosists of uiformly bouded oegative fuctios. Notice that i this case, the variace is trivially bouded by a costat times the expectatio ad oe ca directly use T f) = P f. 9

10 Corollary 3.5 Let F be a class of fuctios with rages i [0, 1]. Let ψ be a sub-root fuctio, such that for all r 0, ER f F : P f r} ψr), ad let r be the fixed poit of ψ. The, for ay K > 1 ad every x > 0, with probability at least 1 e x, every f F satisfies P f K K 1 P f + 704Kr x K) +. Also, with probability at least 1 e x, every f F satisfies P f K + 1 K P f + x K) 704Kr +. Proof: Whe f [0, 1], we have Var [f] P f so that the result follows from applyig Theorem 3.3 with T f) = P f. We also ote that the same idea as i the proof of Theorem 3.3 gives a coverse of Corollary 2.2, amely, that with high probability, the itersectio of F with a empirical ball of a fixed radius is cotaied i the itersectio of F with a L 2 P ) ball with a slightly larger radius. Lemma 3.6 Let F be a class of fuctios that map X ito [ 1, 1]. Fix x > 0. If r 20ER f : f starf, 0), P f 2 r } + 26x, the with probability at least 1 e x, f starf, 0) : P f 2 r } f starf, 0) : P f 2 2r }. This result, proved i Sectio 3.2, will be useful i Sectio Estimatig r from Global Iformatio The error bouds ivolve fixed poits of fuctios that defie upper bouds o the local Rademacher averages. I some cases, these fixed poits ca be estimated from global iformatio o the fuctio class. We preset a complete aalysis oly i a simple case, where F is a class of biary-valued fuctios with a fiite VC dimesio. Corollary 3.7 Let F be a class of 0, 1}-valued fuctios with VC dimesio d <. The for all K > 1 ad every x > 0, with probability at least 1 e x, every f F satisfies P f K d log/d) K 1 P f + ck + x ). The proof is i Appedix B. The above result is similar to results obtaied by Vapik ad Chervoekis [35] ad by Lugosi ad Wegkamp Theorem 3.1 of [19]). However they used iequalities for weighted empirical processes idexed by oegative fuctios. Our results have more flexibility sice they ca accommodate geeral fuctios, although this is ot eeded i this simple corollary. The proof uses a similar lie of reasoig to proofs i [26, 27]. Clearly, it exteds to ay class of real-valued fuctios for which oe has estimates for the etropy itegral, such as classes with fiite pseudo-dimesio or a combiatorial dimesio that grows more slowly tha quadratically. See [26, 27] for more details. Notice also that the rate of d log / is the best kow. 10

11 3.1.4 Proof Techiques Before givig the proofs of the results metioed above, let us sketch the techiques we use. The approach has its roots i classical empirical processes theory, where it was uderstood that the modulus of cotiuity of the empirical process is a importat quatity here, ψ plays this role). I order to obtai o-asymptotic results, two approaches have bee developed: the first oe cosists of cuttig the class F ito smaller pieces, where oe has cotrol of the variace of the elemets. This is the so-called peelig techique see, for example, [31, 33, 34, 32] ad refereces therei). The secod approach cosists of weightig the fuctios i F by dividig them by their variace. May results have bee obtaied o such weighted empirical processes see, for example, [28]). The results of Vapik ad Chervoekis based o weightig [35] are restricted to classes of oegative fuctios. Also, most previous results, such as those of Pollard [28], Va de Geer [32] or Haussler [13] give complexity terms that ivolve global measures of complexity of the class, such as coverig umbers. Noe of these results use the recetly itroduced Rademacher averages as measures of complexity. It turs out that it is possible to combie the peelig ad weightig ideas with cocetratio iequalities to obtai such results, as proposed by Massart i [22], ad also used for oegative fuctios) by Koltchiskii ad Pacheko [16]. The idea is the followig: Apply Theorem 2.1 to the class of fuctios f/wf) : f F} where w is some oegative weight of the order of the variace of f. Hece, the fuctios i this class have a small variace. Upper boud the Rademacher averages of this weighted class, by peelig off subclasses of F accordig to the variace of their elemets, ad boudig the Rademacher averages of these subclasses usig ψ. Use the sub-root property of ψ, so that its fixed poit gives a commo upper boud o the complexity of all the subclasses up to some scalig). Fially, covert the upper boud for fuctios i the weighted class ito a boud for fuctios i the iitial class. The idea of peelig that is, of partitioig the class F ito slices where fuctios have variace withi a certai rage is at the core of the proof of the first part of Theorem 3.3 see, for example, Equatio 3.1)). However, it does ot appear explicitly i the proof of the secod part. Oe explaatio is that whe oe cosiders the star hull of the class, it is eough to cosider two subclasses: the fuctios with T f) r ad the oes with T f) > r ad this is doe by itroducig the weightig factor T f) r. This idea was exploited i the work of Medelso [26] ad, more recetly, i [4]. Moreover, whe oe cosiders the set F r = starf, 0) T f) r, ay fuctio f F with T f ) > r, will have a scaled dow represetative i that set. So eve though it seems that we look at the class starf, 0) oly locally, we still take ito accout all of the fuctios i F with appropriate scalig). 3.2 Proofs Before presetig the proof, let us first itroduce some additioal otatio. Give a class F, λ > 1, ad r > 0, let wf) = mirλ k : k N, rλ k T f)} ad set } r G r = wf) f : f F. 11

12 Notice that wf) r, so that G r αf : f F, α [0, 1]} = starf, 0). Defie V + r = sup P g P g ad Vr = sup P g P g. g G r g G r For the secod part of the theorem, we eed to itroduce aother class of fuctios } rf G r := T f) r : f F, ad defie Ṽ + r = sup P g P g ad Ṽ g G r = sup P g P g. r g G r Lemma 3.8 With the above otatio, assume that there is a costat B > 0 such that for every f F, T f) BP f. Fix K > 1, λ > 0 ad r > 0. If V r + r/λbk), the Also, if V r r/λbk), the Similarly, if K > 1 ad r > 0 are such that Ṽ + r Also, if Ṽ r r/bk), the f F, P f K K 1 P f + r λbk. f F, P f K + 1 K P f + r λbk. r/bk) the f F, P f K K 1 P f + r BK. f F, P f K + 1 K P f + r BK. Proof: Notice that for all g G r, P g P g + V r +. Whe T f) r, wf) = r, so that g = f. Thus, the fact that P g P g + V r + P f P f + V r + P f + r/λbk). Fix f F ad defie g = rf/wf). implies that O the other had, if T f) > r, the wf) = rλ k with k > 0 ad T f) rλ k 1, rλ k ]. Moreover, g = f/λ k, P g P g + V + r, ad thus P f λ k Usig the fact that T f) > rλ k 1, it follows that P f λ k + V + r. P f P f + λ k V + r < P f + λt f)v + r /r P f + P f/k. Rearragig, P f K K 1 P f < K K 1 P f + r λbk. The proof of the secod result is similar. For the third ad fourth results, the reasoig is the same. 12

13 Proof of Theorem 3.3, first part: Let G r be defied as above, where r is chose such that r r, ad ote that fuctios i G r satisfy g P g b a sice 0 r/wf) 1. Also, we have Var [g] r. Ideed, if T f) r, the g = f, ad thus Var [g] = Var [f] r. Otherwise, whe T f) > r, g = f/λ k where k is such that T f) rλ k 1, rλ k ]), so that Var [g] = Var [f] /λ 2k r. Applyig Theorem 2.1, for all x > 0, with probability 1 e x, 2rx 1 V r α)er G r + + b a) ) x α. Let Fx, y) := f F : x T f) y} ad defie k to be the smallest iteger such that rλ k+1 Bb. The, r wf) R f ER G r ER F0, r) + E sup r,bb) k ER F0, r) + = ER F0, r) + ψr) B + 1 B j=0 E k λ j E j=0 sup rλ j,rλ j+1 ) k λ j ψrλ j+1 ). j=0 sup rλ j,rλ j+1 ) By our assumptio it follows that for β 1, ψβr) βψr). Hece, ER G r 1 B ψr) 1 + k λ λ j/2, j=0 r wf) R f 3.1) ad takig λ = 4, the right-had side is upper bouded by 5ψr)/B. Moreover, for r r, ψr) r/r ψr ) = rr, ad thus V r α) 2rx 1 rr B + + b a) ) x α. Set A = α) r /B + 2x/ ad C = b a)1/3 + 1/α)x/, ad ote that V r + A r + C. We ow show that r ca be chose such that V r + r/λbk. Ideed, cosider the largest solutio r 0 of A r + C = r/λbk. It satisfies r 0 λ 2 A 2 B 2 K 2 /2 r ad r 0 λbk) 2 A 2 + 2λBKC, so that applyig Lemma 3.8, it follows that every f F satisfies P f K K 1 P f + λbka 2 + 2C = K K 1 P f + λbk α) 2 r /B b a) ) x α. R f ) α) 2xr B + 2x Settig α = 1/10 ad usig Lemma A.3 to show that 2xr / Bx/5) + 5r /2B) completes the proof of the first statemet. The secod statemet is proved i the same way, by cosiderig Vr istead of V r +. 13

14 Proof of Theorem 3.3, secod part: The proof of this result uses the same argumet as for the first part. However, we cosider the class G r defied above. Oe ca easily check that G r f starf, 0) : T f) r}, ad thus ER Gr ψr)/b. Applyig Theorem 2.1 to G r, it follows that, for all x > 0, with probability 1 e x, Ṽ r α) 2rx 1 ψr) + + b a) B ) x α. The reasoig is the the same as for the first part, ad we use i the very last step that 2xr / Bx/ + r /2B), which gives the displayed costats. Proof of Lemma 3.6: The map α α 2 is Lipschitz with costat 2 whe α is restricted to [ 1, 1]. Applyig Theorem A.6, r 10ER f 2 : f starf, 0), P f 2 r } + 26x. 3.2) Clearly, if f F, the f 2 maps to [0, 1] ad Var [ f 2] P f 2. Thus, Theorem 2.1 ca be applied to the class G r = rf 2 /P f 2 r) : f F}, whose fuctios have rage i [0, 1] ad variace bouded by r. Therefore, with probability at least 1 e x, every f F satisfies r P f 2 P f 2 P f 2 r 21 + α)er G r + 2rx ) x α. Select α = 1/4 ad otice that 2rx/ r/4 + 2x/ to get r P f 2 P f 2 P f 2 r 5 2 ER G r + r x 3. Hece, oe either has P f 2 r, or whe P f 2 r, sice it was assumed that P f 2 r, P f 2 r + P f 2 5 r 2 ER G r + r x ). 3 Now, if g G r, there exists f 0 F such that g = rf0 2/P f 0 2 r). If P f 0 2 r, the g = f 0 2. O the other had, if P f0 2 > r, the g = rf 0 2/P f 0 2 = f 1 2 with f 1 starf, 0) ad P f1 2 r, which shows that ER G r ER f 2 : f starf, 0), P f 2 r }. Thus, by Iequality 3.2), P f 2 2r, which cocludes the proof. 4 Data-Depedet Error Bouds The results preseted thus far use distributio depedet measures of complexity of the class at had. Ideed, the sub-root fuctio ψ of Theorem 3.3 is bouded i terms of the Rademacher averages of the star-hull of F, but these averages ca oly be computed if oe kows the distributio P. Otherwise, we have see that it is possible to compute a upper boud o the Rademacher averages usig a priori global or distributio-free kowledge about the complexity of the class at had such as the VC dimesio). I this sectio, we preset error bouds that ca be computed 14

15 directly from the data, without a priori iformatio. Istead of computig ψ, we compute a estimate, ψ, of it. The fuctio ψ is defied usig the data ad is a upper boud o ψ with high probability. To simplify the expositio we restrict ourselves to the case where the fuctios have a rage which is symmetric aroud zero, say [ 1, 1]. Moreover, we ca oly treat the special case where T f) = P f 2, but this is a mior restrictio as i most applicatios this is the fuctio of iterest i.e., for which oe ca show T f) BP f). 4.1 Results We ow preset the mai result of this sectio, which gives a aalogue of the secod part of Theorem 3.3, with a completely empirical boud that is, the boud ca be computed from the data oly). Theorem 4.1 Let F be a class of fuctios with rages i [ 1, 1] ad assume that there is some costat B such that for every f F, P f 2 BP f. Let ψ be a sub-root fuctio ad let ˆr be the fixed poit of ψ. Fix x > 0 ad assume that ψ satisfies, for ay r ˆr, ψ r) c 1 E σ R f starf, 0) : P f 2 2r } + c 2x, where c 1 = 210 B) ad c 2 = c The, for ay K > 1 with probability at least 1 3e x, Also, with probability at least 1 3e x, f F, P f K K 1 P f + 6K x11 + 5BK) B ˆr +. f F, P f K + 1 K P f + 6K x11 + 5BK) B ˆr +. Although these are data depedet bouds, they are ot ecessarily easy to compute. There are, however, favorable iterestig situatios where they ca be computed efficietly, as Sectio 6 shows. It is atural to woder how close the quatity ˆr appearig i the above theorem is to the quatity r of Theorem 3.3. The ext theorem shows that they are close with high probability. Theorem 4.2 Let F be a class of fuctios with rages i [ 1, 1]. Fix x > 0 ad cosider the sub-root fuctios ψr) = ER f starf, 0) : P f 2 r }, ad ψ r) = c 1 E σ R f starf, 0) : P f 2 2r } + c 2x, with fixed poits r ad ˆr respectively ad with c 1 = 210 B) ad c 2 = 13. Assume that r c 3 x/ where c 3 = 26 c 2 + 2c 1 )/3. The, with probability at least 1 4e x, r ˆr 91 + c 1 ) 2 r. Thus, with high probability, ˆr is a upper boud o r ad has the same asymptotic behavior. Notice that there was o attempt to optimize the costats i the above theorem. I additio, the costat 91 + c 1 ) 2 equal to 3969 if B 10) i Theorem 4.2 does ot appear i the upper boud of Theorem

16 4.2 Proofs The idea of the proofs is to show that oe ca upper boud ψ by a empirical estimate with high probability). This requires two steps: the first oe uses the cocetratio of the Rademacher averages to upper boud the expected Rademacher averages by their empirical versios. The secod step uses Corollary 2.2 to prove that the ball over which the averages are computed which is a L 2 P ) ball) ca be replaced by a empirical oe. Thus, ψ is a upper boud o ψ, ad oe ca apply Theorem 3.3, together with the followig lemma, which shows how fixed poits of sub-root fuctios relate whe the fuctios are ordered. Lemma 4.3 Suppose that ψ, ψ are sub-root. Let r resp. ˆr ) be the fixed poit of ψ resp. ψ ). If for 0 α 1, we have α ψ r ) ψr ) ψ r ), the α 2ˆr r ˆr. Proof: Deotig by ˆr α the fixed poit of the sub-root fuctio α ψ the, by Lemma 3.2, ˆr α r ˆr. Also, sice ψ is sub-root, ψ α 2ˆr ) α ψ ˆr ) = αˆr which meas α ψ α 2ˆr ) α 2ˆr. Hece, Lemma 3.2 yields ˆr α α 2ˆr. Proof of Theorem 4.1: Cosider the sub-root fuctio ψ 1 r) = c 1 2 ER f starf, 0) : P f 2 r } + c 2 c 1 )x, with fixed poit r 1. Applyig Corollary 2.2 whe r ψ 1r), it follows that with probability at least 1 e x, f starf, 0) : P f 2 r } f starf, 0) : P f 2 2r }. Usig this, together with the first iequality of Lemma A.4 with α = 1/2) shows that if r ψ 1 r), with probability at least 1 2e x, ψ 1 r) = c 1 2 ER f starf, 0) : P f 2 r } + c 2 c 1 )x c 1 E σ R f starf, 0) : P f 2 r } + c 2x c 1 E σ R f starf, 0) : P f 2 2r } + c 2x ψ r). Choosig r = r 1, Lemma 4.3 shows that with probability at least 1 2e x, Also, for all r 0, r 1 ˆr. 4.1) ψ 1 r) BER f starf, 0) : P f 2 r }, ad so from Theorem 3.3, with probability at least 1 e x, every f F satisfies P f K K 1 P f + 6Kr 1 B BK)x. Combiig this with 4.1) gives the first result. The secod result is proved i a similar maer. 16

17 Proof of Theorem 4.2: Cosider the fuctios ψ 1 r) = c 1 2 ER f starf, 0) : P f 2 r } + c 2 c 1 )x, ad ψ 2 r) = c 1 ER f starf, 0) : P f 2 r } + c 3x, ad deote by r1 ad r 2 the fixed poits of ψ 1 ad ψ 2 respectively. The proof of Theorem 4.1 shows that with probability at least 1 2e x, r1 ˆr. Now apply Lemma 3.6 to show that if r ψ 2 r) the with probability at least 1 e x, f starf, 0) : P f 2 r } f starf, 0) : P f 2 2r }. Usig this, together with the secod iequality of Lemma A.4 with α = 1/2) shows that if r ψ 2 r), with probability at least 1 2e x, ψ r) = c 1 E σ R f starf, 0) : P f 2 2r } + c 2x c 1 2Eσ R f starf, 0) : P f 2 r } + c 2x c 1 2Eσ R f starf, 0) : P f 2 2r } + c 2x c 1ER f starf, 0) : P f 2 2r } + c 2 + 2c 1 )x 3c 1 ER f starf, 0) : P f 2 r } + c 2 + 2c 1 )x 3ψ 2 r), where the sub-root property was used twice i the first ad secod last iequalities). Lemma 4.3 thus gives ˆr 9r2. Also otice that for all r, ψr) ψ 1 r), ad hece r r1. Moreover, for all r ψr) hece r r c 3 x/), ψ 2 r) c 1 ψr) + r so that ψ 2 r ) c 1 + 1)r = c 1 + 1)ψr ). Lemma 4.3 implies that r2 1 + c 1) 2 r. 5 Predictio with Bouded Loss I this sectio, we discuss the applicatio of our results to predictio problems, such as classificatio ad regressio. For such problems, there is a iput space X ad a output space Y ad the product X Y is edowed with a ukow probability measure P. For example, classificatio correspods to the case where Y is discrete, typically Y = 1, 1} ad regressio correspods to the cotiuous case, typically Y = [ 1, 1]. Note that assumig the boudedess of the target values is a typical assumptio i theoretical aalysis of regressio procedures. To aalyze the case of ubouded targets, oe usually trucates the values at a certai threshold ad bouds the probability of exceedig that threshold see, for example, the techiques developed i [12]). The traiig sample is a sequece X 1, Y 1 ),..., X, Y ) of idepedet ad idetically distributed i.i.d.) pairs sampled accordig to P. A loss fuctio l : Y Y [0, 1] is defied ad the goal is to fid a fuctio f : X Y from a class F that miimizes the expected loss El f = ElfX), Y ). 17

18 Sice the probability distributio P is ukow, oe caot directly miimize the expected loss over F. The key property that is eeded to apply our results is the fact that Var [f] BP f or P f 2 BP f to obtai data-depedet bouds). This will trivially be the case for the class l f : f F} as all its fuctios are uiformly bouded ad oegative. This case, studied i Sectio 5.1 is, however, ot the most iterestig. Ideed, it is whe oe studies the excess risk l f l f that our approach shows its superiority over previous oes; whe the class l f l f } satisfies the variace coditio ad Sectio 5.2 gives examples of this), we obtai distributio-depedet bouds that are optimal i certai cases, ad data-depedet bouds of the same order. 5.1 Geeral Results without Assumptios Defie the followig class of fuctios, called the loss class associated with F: l F = l f : f F} = x, y) lfx), y) : f F}. Notice that l F is a class of oegative fuctios. Applyig Theorem 4.1 to this class of fuctios gives the followig corollary. Corollary 5.1 For a loss fuctio l : Y Y [0, 1], defie ψ r) = 20E σ R f starlf, 0) : P f 2 2r } + 13x, with fixed poit ˆr. The, for ay K > 1 with probability at least 1 3e x, f F, P l f K K 1 P l f + 6Kˆr x11 + 5K) +. A atural approach is to miimize the empirical loss P l f over the class F. The followig result shows that this approach leads to a estimate with expected loss ear miimal. How close it is to the miimal expected loss depeds o the value of the miimum, as well as o the local Rademacher averages of the class. Theorem 5.2 For a loss fuctio l : Y Y [0, 1], defie ψr), ψ r), r, ad ˆr as i Theorem 5.1. Let L = if P l f. The there is a costat c such that with probability at least 1 2e x, the miimizer ˆf F of P l f satisfies Also, with probability at least 1 4e x, P l ˆf L + c L r + r ). P l ˆf L + c L ˆr + ˆr ). The proof of this theorem is give i Appedix B. This theorem has the same flavor as Theorem 4.2 of [19]. We have ot used ay property besides the positivity of the fuctios i the class. This idicates that there might ot be a sigificat gai compared to earlier results as without further assumptios the optimal rates are kow). 18

19 Ideed, a careful examiatio of this result shows that whe L > 0, the differece betwee P l ˆf ad L is essetially of order r. For a class of 0, 1}-valued fuctios with VC-dimesio d, for example, this would be d log /. O the other had, the result of [19] is more refied sice the Rademacher averages are ot localized aroud 0 as they are here), but rather aroud the miimizer of the empirical error itself. Ufortuately, the small ball i [19] is ot defied as P l f P l ˆf + r but as P l f 16P l ˆf + r. This meas that i the geeral situatio where L > 0, sice P l ˆf does ot coverge to 0 with icreasig as it is expected to be close to P l ˆf which itself coverges to L ), the radius of the ball aroud l ˆf which is 15P l ˆf + r) will ot coverge to 0. Thus, the localized Rademacher average over this ball will coverge at speed d/. I other words, our Theorem 5.2 ad Theorem 4.2 of [19] essetially have the same behavior. But this is ot surprisig, as it is kow that this is the optimal rate of covergece i this case. To get a improvemet i the rates of covergece, oe eeds to make further assumptios o the distributio P or o the class F. 5.2 Improved Results for the Excess Risk Cosider a loss fuctio l ad fuctio class F that satisfy the followig coditios. 1. For every probability distributio P there is a f F satisfyig P l f = if P l f. 2. There is a costat L such that l is L-Lipschitz i its first argumet: for all y, ŷ 1, ŷ 2, lŷ 1, y) lŷ 2, y) L ŷ 1 ŷ There is a costat B 1 such that for every probability distributio ad every f F, P f f ) 2 BP l f l f ). These coditios are ot too restrictive as they are met by several commoly used regularized algorithms with covex losses. Note that Coditio 1 could be weakeed, ad oe could cosider a fuctio which is oly close to achievig the ifimum, with a appropriate chage to Coditio 3. This geeralizatio is straightforward, but it would make the results less readable, so we omit it. Coditio 2 implies that, for all f F, P l f l f ) 2 L 2 P f f ) 2. Coditio 3 usually follows from a uiform covexity coditio o l. A importat example is the quadratic loss, ly, y ) = y y ) 2, whe the fuctio class F is covex ad uiformly bouded. I particular, if fx) y [0, 1] for all f F, x X ad y Y, the the coditios are satisfied with L = 2 ad B = 1 see [18]). Other examples are described i [26] ad i [2]. The first result we preset is a direct but istructive corollary of Theorem 3.3. Corollary 5.3 Let F be a class of fuctios with rage i [ 1, 1] ad let l be a loss fuctio satisfyig Coditios 1 3 above. Let ˆf be ay elemet of F satisfyig P l ˆf = if P l f. Assume ψ is a sub-root fuctio for which ψr) BLER f F : L 2 P f f ) 2 r }. 19

20 The for ay x > 0 ad ay r ψr), with probability at least 1 e x, P l ˆf l f ) 705 r B + 11L + 27B)x. Proof: Oe applies Theorem 3.3 first part) to the class l f l f with T f) = L 2 P f f ) 2 ad uses the fact that by Theorem A.6, ad by the symmetry of the Rademacher variables, LER f : L 2 P f f ) 2 r} ER l f l f : L 2 P f f ) 2 r}. The result follows from oticig that P l ˆf l f ) 0. Istead of comparig the loss of f to that of f, oe could compare it to the loss of the best measurable fuctio the regressio fuctio for regressio fuctio estimatio, or the Bayes classifier for classificatio). The techiques proposed here ca be adapted to this case. Usig Corollary 5.3, oe ca with mior modificatio) recover the results of [22] for model selectio. These have bee show to match the miimax results i various situatios. I that sese, Corollary 5.3 ca be cosidered as sharp. Next, we tur to the mai result of this sectio. It is a versio of Corollary 5.3 with a fully data-depedet boud. This is obtaied by modifyig ψ i three ways: the Rademacher averages are replaced by empirical oes, the radius of the ball is i L 2 P ) orm istead of L 2 P ), ad fially, the ceter of the ball is ˆf istead of f. Theorem 5.4 Let F be a covex class of fuctios with rage i [ 1, 1] ad let l be a loss fuctio satisfyig Coditios 1 3 above. Let ˆf be ay elemet of F satisfyig P l ˆf = if P l f. Defie ψ r) = c 1 E σ R f F : P f ˆf) } 2 c 3 r + c 2x, 5.1) where c 1 = 2LB 10L), c 2 = 11L 2 + c 1, ad c 3 = B11L + 27B)/c 2. The with probability at least 1 4e x, P l ˆf l f ) L + 27B)x B ˆr +, where ˆr is the fixed poit of ψ. Remark 5.5 Ulike Corollary 5.3, the class F i Theorem 5.4 has to be covex. This esures that it is star-shaped aroud ay of its elemets which implies that ψ is sub-root eve though ˆf is radom). However, covexity of the loss class is ot ecessary, so that this theorem still applies to may situatios of iterest, i particular to regularized regressio, where the fuctios are take i a vector space or a ball of a vector space. Remark 5.6 Although the theorem is stated with explicit costats, there is o reaso to thik that these are optimal. The fact that the costat 705 appears actually is due to our failure to use the secod part of Theorem 3.3 to the iitial loss class, which is ot star-shaped this would have give a 7 istead). However, with some additioal effort, oe ca probably obtai much better costats. As we explaied earlier, although the statemet of Theorem 5.4 is similar to Theorem 4.2 i [19], there is a importat differece i the way the localized averages are defied: i our case the radius is a costat times r, while i [19] there is a additioal term, ivolvig the loss of the empirical risk miimizer, which may ot coverge to zero. Hece, the complexity decreases faster i our boud. 20

21 The additioal property required i the proof of this result compared to the proof of Theorem 4.1 is that uder the assumptios of the theorem, the miimizer of the empirical loss ad of the true loss are close with respect to the L 2 P ) ad the L 2 P ) distaces this has also bee used i [20] ad [31, 32]). 5.3 Proof of Theorem 5.4 Defie the fuctio ψ as ψr) = c 1 2 ER f F : L 2 P f f ) 2 r } + c 2 c 1 )x. 5.2) Notice that sice F is covex ad thus star-shaped aroud each of its poits, Lemma 3.4 implies that ψ is sub-root. Now, for r ψr), Corollary 5.3 ad Coditio 3 o the loss fuctio imply that, with probability at least 1 e x, L 2 P ˆf f ) 2 BL 2 P ) l ˆf l f 705L 2 r + 11L + 27B)BL2 x. 5.3) Deote the right-had side by s. Sice s r r, the s ψs) by Lemma 3.2), ad thus s 10L 2 ER f F : L 2 P f f ) 2 s } + 11L2 x Therefore, Corollary 2.2 applied to the class LF yields that with probability at least 1 e x, f F, L 2 P f f ) 2 s} f F, L 2 P f f ) 2 2s}. This, combied with 5.3), implies that with probability at least 1 2e x, ) P ˆf f ) 2 11L + 27B)Bx 2 705r L + 27B)B c 2. ) r, 5.4) where the secod iequality follows from r ψr) c 2 x/. Defie c = L + 27B)B/c 2 ). By the triagle iequality i L 2 P ), if 5.4) occurs, the ay f F satisfies ) P f ˆf) P 2 2 f f ) 2 + P f ˆf) 2 P f f ) 2 + cr) 2. Appealig agai to Corollary 2.2 applied to LF as before, but ow for r ψr), it follows that with probability at least 1 3e x, f F : L 2 P f f ) 2 r } f F : L 2 P f ˆf) 2 ) } c L 2 r. 21

A survey on penalized empirical risk minimization Sara A. van de Geer

A survey on penalized empirical risk minimization Sara A. van de Geer A survey o pealized empirical risk miimizatio Sara A. va de Geer We address the questio how to choose the pealty i empirical risk miimizatio. Roughly speakig, this pealty should be a good boud for the

More information

REGRESSION WITH QUADRATIC LOSS

REGRESSION WITH QUADRATIC LOSS REGRESSION WITH QUADRATIC LOSS MAXIM RAGINSKY Regressio with quadratic loss is aother basic problem studied i statistical learig theory. We have a radom couple Z = X, Y ), where, as before, X is a R d

More information

Regression with quadratic loss

Regression with quadratic loss Regressio with quadratic loss Maxim Ragisky October 13, 2015 Regressio with quadratic loss is aother basic problem studied i statistical learig theory. We have a radom couple Z = X,Y, where, as before,

More information

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 3

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 3 Machie Learig Theory Tübige Uiversity, WS 06/07 Lecture 3 Tolstikhi Ilya Abstract I this lecture we will prove the VC-boud, which provides a high-probability excess risk boud for the ERM algorithm whe

More information

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12 Machie Learig Theory Tübige Uiversity, WS 06/07 Lecture Tolstikhi Ilya Abstract I this lecture we derive risk bouds for kerel methods. We will start by showig that Soft Margi kerel SVM correspods to miimizig

More information

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss ECE 90 Lecture : Complexity Regularizatio ad the Squared Loss R. Nowak 5/7/009 I the previous lectures we made use of the Cheroff/Hoeffdig bouds for our aalysis of classifier errors. Hoeffdig s iequality

More information

1 Review and Overview

1 Review and Overview CS9T/STATS3: Statistical Learig Theory Lecturer: Tegyu Ma Lecture #6 Scribe: Jay Whag ad Patrick Cho October 0, 08 Review ad Overview Recall i the last lecture that for ay family of scalar fuctios F, we

More information

Binary classification, Part 1

Binary classification, Part 1 Biary classificatio, Part 1 Maxim Ragisky September 25, 2014 The problem of biary classificatio ca be stated as follows. We have a radom couple Z = (X,Y ), where X R d is called the feature vector ad Y

More information

Optimal Sample-Based Estimates of the Expectation of the Empirical Minimizer

Optimal Sample-Based Estimates of the Expectation of the Empirical Minimizer Optimal Sample-Based Estimates of the Expectatio of the Empirical Miimizer Peter L. Bartlett Computer Sciece Divisio ad Departmet of Statistics Uiversity of Califoria, Berkeley 367 Evas Hall #3860, Berkeley,

More information

18.657: Mathematics of Machine Learning

18.657: Mathematics of Machine Learning 8.657: Mathematics of Machie Learig Lecturer: Philippe Rigollet Lecture 4 Scribe: Cheg Mao Sep., 05 I this lecture, we cotiue to discuss the effect of oise o the rate of the excess risk E(h) = R(h) R(h

More information

Rademacher Complexity

Rademacher Complexity EECS 598: Statistical Learig Theory, Witer 204 Topic 0 Rademacher Complexity Lecturer: Clayto Scott Scribe: Ya Deg, Kevi Moo Disclaimer: These otes have ot bee subjected to the usual scrutiy reserved for

More information

Optimally Sparse SVMs

Optimally Sparse SVMs A. Proof of Lemma 3. We here prove a lower boud o the umber of support vectors to achieve geeralizatio bouds of the form which we cosider. Importatly, this result holds ot oly for liear classifiers, but

More information

Convergence of random variables. (telegram style notes) P.J.C. Spreij

Convergence of random variables. (telegram style notes) P.J.C. Spreij Covergece of radom variables (telegram style otes).j.c. Spreij this versio: September 6, 2005 Itroductio As we kow, radom variables are by defiitio measurable fuctios o some uderlyig measurable space

More information

Machine Learning Brett Bernstein

Machine Learning Brett Bernstein Machie Learig Brett Berstei Week 2 Lecture: Cocept Check Exercises Starred problems are optioal. Excess Risk Decompositio 1. Let X = Y = {1, 2,..., 10}, A = {1,..., 10, 11} ad suppose the data distributio

More information

Infinite Sequences and Series

Infinite Sequences and Series Chapter 6 Ifiite Sequeces ad Series 6.1 Ifiite Sequeces 6.1.1 Elemetary Cocepts Simply speakig, a sequece is a ordered list of umbers writte: {a 1, a 2, a 3,...a, a +1,...} where the elemets a i represet

More information

Lecture 10 October Minimaxity and least favorable prior sequences

Lecture 10 October Minimaxity and least favorable prior sequences STATS 300A: Theory of Statistics Fall 205 Lecture 0 October 22 Lecturer: Lester Mackey Scribe: Brya He, Rahul Makhijai Warig: These otes may cotai factual ad/or typographic errors. 0. Miimaxity ad least

More information

Lecture 3 The Lebesgue Integral

Lecture 3 The Lebesgue Integral Lecture 3: The Lebesgue Itegral 1 of 14 Course: Theory of Probability I Term: Fall 2013 Istructor: Gorda Zitkovic Lecture 3 The Lebesgue Itegral The costructio of the itegral Uless expressly specified

More information

ECE 901 Lecture 14: Maximum Likelihood Estimation and Complexity Regularization

ECE 901 Lecture 14: Maximum Likelihood Estimation and Complexity Regularization ECE 90 Lecture 4: Maximum Likelihood Estimatio ad Complexity Regularizatio R Nowak 5/7/009 Review : Maximum Likelihood Estimatio We have iid observatios draw from a ukow distributio Y i iid p θ, i,, where

More information

Peter L. Bartlett 1, Shahar Mendelson 2, 3 and Petra Philips Introduction

Peter L. Bartlett 1, Shahar Mendelson 2, 3 and Petra Philips Introduction ESAIM: Probability ad Statistics URL: http://www.emath.fr/ps/ Will be set by the publisher ON THE OPTIMALITY OF SAMPLE-BASED ESTIMATES OF THE EXPECTATION OF THE EMPIRICAL MINIMIZER, Peter L. Bartlett 1,

More information

Advanced Analysis. Min Yan Department of Mathematics Hong Kong University of Science and Technology

Advanced Analysis. Min Yan Department of Mathematics Hong Kong University of Science and Technology Advaced Aalysis Mi Ya Departmet of Mathematics Hog Kog Uiversity of Sciece ad Techology September 3, 009 Cotets Limit ad Cotiuity 7 Limit of Sequece 8 Defiitio 8 Property 3 3 Ifiity ad Ifiitesimal 8 4

More information

Empirical Process Theory and Oracle Inequalities

Empirical Process Theory and Oracle Inequalities Stat 928: Statistical Learig Theory Lecture: 10 Empirical Process Theory ad Oracle Iequalities Istructor: Sham Kakade 1 Risk vs Risk See Lecture 0 for a discussio o termiology. 2 The Uio Boud / Boferoi

More information

Ada Boost, Risk Bounds, Concentration Inequalities. 1 AdaBoost and Estimates of Conditional Probabilities

Ada Boost, Risk Bounds, Concentration Inequalities. 1 AdaBoost and Estimates of Conditional Probabilities CS8B/Stat4B Sprig 008) Statistical Learig Theory Lecture: Ada Boost, Risk Bouds, Cocetratio Iequalities Lecturer: Peter Bartlett Scribe: Subhrasu Maji AdaBoost ad Estimates of Coditioal Probabilities We

More information

Random Walks on Discrete and Continuous Circles. by Jeffrey S. Rosenthal School of Mathematics, University of Minnesota, Minneapolis, MN, U.S.A.

Random Walks on Discrete and Continuous Circles. by Jeffrey S. Rosenthal School of Mathematics, University of Minnesota, Minneapolis, MN, U.S.A. Radom Walks o Discrete ad Cotiuous Circles by Jeffrey S. Rosethal School of Mathematics, Uiversity of Miesota, Mieapolis, MN, U.S.A. 55455 (Appeared i Joural of Applied Probability 30 (1993), 780 789.)

More information

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 11

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 11 Machie Learig Theory Tübige Uiversity, WS 06/07 Lecture Tolstikhi Ilya Abstract We will itroduce the otio of reproducig kerels ad associated Reproducig Kerel Hilbert Spaces (RKHS). We will cosider couple

More information

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS MASSACHUSTTS INSTITUT OF TCHNOLOGY 6.436J/5.085J Fall 2008 Lecture 9 /7/2008 LAWS OF LARG NUMBRS II Cotets. The strog law of large umbers 2. The Cheroff boud TH STRONG LAW OF LARG NUMBRS While the weak

More information

Chapter 3. Strong convergence. 3.1 Definition of almost sure convergence

Chapter 3. Strong convergence. 3.1 Definition of almost sure convergence Chapter 3 Strog covergece As poited out i the Chapter 2, there are multiple ways to defie the otio of covergece of a sequece of radom variables. That chapter defied covergece i probability, covergece i

More information

(A sequence also can be thought of as the list of function values attained for a function f :ℵ X, where f (n) = x n for n 1.) x 1 x N +k x N +4 x 3

(A sequence also can be thought of as the list of function values attained for a function f :ℵ X, where f (n) = x n for n 1.) x 1 x N +k x N +4 x 3 MATH 337 Sequeces Dr. Neal, WKU Let X be a metric space with distace fuctio d. We shall defie the geeral cocept of sequece ad limit i a metric space, the apply the results i particular to some special

More information

Product measures, Tonelli s and Fubini s theorems For use in MAT3400/4400, autumn 2014 Nadia S. Larsen. Version of 13 October 2014.

Product measures, Tonelli s and Fubini s theorems For use in MAT3400/4400, autumn 2014 Nadia S. Larsen. Version of 13 October 2014. Product measures, Toelli s ad Fubii s theorems For use i MAT3400/4400, autum 2014 Nadia S. Larse Versio of 13 October 2014. 1. Costructio of the product measure The purpose of these otes is to preset the

More information

Information-based Feature Selection

Information-based Feature Selection Iformatio-based Feature Selectio Farza Faria, Abbas Kazeroui, Afshi Babveyh Email: {faria,abbask,afshib}@staford.edu 1 Itroductio Feature selectio is a topic of great iterest i applicatios dealig with

More information

6.3 Testing Series With Positive Terms

6.3 Testing Series With Positive Terms 6.3. TESTING SERIES WITH POSITIVE TERMS 307 6.3 Testig Series With Positive Terms 6.3. Review of what is kow up to ow I theory, testig a series a i for covergece amouts to fidig the i= sequece of partial

More information

Empirical Processes: Glivenko Cantelli Theorems

Empirical Processes: Glivenko Cantelli Theorems Empirical Processes: Gliveko Catelli Theorems Mouliath Baerjee Jue 6, 200 Gliveko Catelli classes of fuctios The reader is referred to Chapter.6 of Weller s Torgo otes, Chapter??? of VDVW ad Chapter 8.3

More information

Agnostic Learning and Concentration Inequalities

Agnostic Learning and Concentration Inequalities ECE901 Sprig 2004 Statistical Regularizatio ad Learig Theory Lecture: 7 Agostic Learig ad Cocetratio Iequalities Lecturer: Rob Nowak Scribe: Aravid Kailas 1 Itroductio 1.1 Motivatio I the last lecture

More information

An Introduction to Randomized Algorithms

An Introduction to Randomized Algorithms A Itroductio to Radomized Algorithms The focus of this lecture is to study a radomized algorithm for quick sort, aalyze it usig probabilistic recurrece relatios, ad also provide more geeral tools for aalysis

More information

Definition 4.2. (a) A sequence {x n } in a Banach space X is a basis for X if. unique scalars a n (x) such that x = n. a n (x) x n. (4.

Definition 4.2. (a) A sequence {x n } in a Banach space X is a basis for X if. unique scalars a n (x) such that x = n. a n (x) x n. (4. 4. BASES I BAACH SPACES 39 4. BASES I BAACH SPACES Sice a Baach space X is a vector space, it must possess a Hamel, or vector space, basis, i.e., a subset {x γ } γ Γ whose fiite liear spa is all of X ad

More information

Lecture 27. Capacity of additive Gaussian noise channel and the sphere packing bound

Lecture 27. Capacity of additive Gaussian noise channel and the sphere packing bound Lecture 7 Ageda for the lecture Gaussia chael with average power costraits Capacity of additive Gaussia oise chael ad the sphere packig boud 7. Additive Gaussia oise chael Up to this poit, we have bee

More information

Sequences and Series of Functions

Sequences and Series of Functions Chapter 6 Sequeces ad Series of Fuctios 6.1. Covergece of a Sequece of Fuctios Poitwise Covergece. Defiitio 6.1. Let, for each N, fuctio f : A R be defied. If, for each x A, the sequece (f (x)) coverges

More information

Discrete Mathematics for CS Spring 2008 David Wagner Note 22

Discrete Mathematics for CS Spring 2008 David Wagner Note 22 CS 70 Discrete Mathematics for CS Sprig 2008 David Wager Note 22 I.I.D. Radom Variables Estimatig the bias of a coi Questio: We wat to estimate the proportio p of Democrats i the US populatio, by takig

More information

Lecture 15: Learning Theory: Concentration Inequalities

Lecture 15: Learning Theory: Concentration Inequalities STAT 425: Itroductio to Noparametric Statistics Witer 208 Lecture 5: Learig Theory: Cocetratio Iequalities Istructor: Ye-Chi Che 5. Itroductio Recall that i the lecture o classificatio, we have see that

More information

Dimension-free PAC-Bayesian bounds for the estimation of the mean of a random vector

Dimension-free PAC-Bayesian bounds for the estimation of the mean of a random vector Dimesio-free PAC-Bayesia bouds for the estimatio of the mea of a radom vector Olivier Catoi CREST CNRS UMR 9194 Uiversité Paris Saclay olivier.catoi@esae.fr Ilaria Giulii Laboratoire de Probabilités et

More information

Maximum Likelihood Estimation and Complexity Regularization

Maximum Likelihood Estimation and Complexity Regularization ECE90 Sprig 004 Statistical Regularizatio ad Learig Theory Lecture: 4 Maximum Likelihood Estimatio ad Complexity Regularizatio Lecturer: Rob Nowak Scribe: Pam Limpiti Review : Maximum Likelihood Estimatio

More information

Sequences A sequence of numbers is a function whose domain is the positive integers. We can see that the sequence

Sequences A sequence of numbers is a function whose domain is the positive integers. We can see that the sequence Sequeces A sequece of umbers is a fuctio whose domai is the positive itegers. We ca see that the sequece 1, 1, 2, 2, 3, 3,... is a fuctio from the positive itegers whe we write the first sequece elemet

More information

The log-behavior of n p(n) and n p(n)/n

The log-behavior of n p(n) and n p(n)/n Ramauja J. 44 017, 81-99 The log-behavior of p ad p/ William Y.C. Che 1 ad Ke Y. Zheg 1 Ceter for Applied Mathematics Tiaji Uiversity Tiaji 0007, P. R. Chia Ceter for Combiatorics, LPMC Nakai Uivercity

More information

1 Review and Overview

1 Review and Overview DRAFT a fial versio will be posted shortly CS229T/STATS231: Statistical Learig Theory Lecturer: Tegyu Ma Lecture #3 Scribe: Migda Qiao October 1, 2013 1 Review ad Overview I the first half of this course,

More information

6 Integers Modulo n. integer k can be written as k = qn + r, with q,r, 0 r b. So any integer.

6 Integers Modulo n. integer k can be written as k = qn + r, with q,r, 0 r b. So any integer. 6 Itegers Modulo I Example 2.3(e), we have defied the cogruece of two itegers a,b with respect to a modulus. Let us recall that a b (mod ) meas a b. We have proved that cogruece is a equivalece relatio

More information

6.883: Online Methods in Machine Learning Alexander Rakhlin

6.883: Online Methods in Machine Learning Alexander Rakhlin 6.883: Olie Methods i Machie Learig Alexader Rakhli LECTURES 5 AND 6. THE EXPERTS SETTING. EXPONENTIAL WEIGHTS All the algorithms preseted so far halluciate the future values as radom draws ad the perform

More information

Introduction to Machine Learning DIS10

Introduction to Machine Learning DIS10 CS 189 Fall 017 Itroductio to Machie Learig DIS10 1 Fu with Lagrage Multipliers (a) Miimize the fuctio such that f (x,y) = x + y x + y = 3. Solutio: The Lagragia is: L(x,y,λ) = x + y + λ(x + y 3) Takig

More information

Notes 19 : Martingale CLT

Notes 19 : Martingale CLT Notes 9 : Martigale CLT Math 733-734: Theory of Probability Lecturer: Sebastie Roch Refereces: [Bil95, Chapter 35], [Roc, Chapter 3]. Sice we have ot ecoutered weak covergece i some time, we first recall

More information

Math Solutions to homework 6

Math Solutions to homework 6 Math 175 - Solutios to homework 6 Cédric De Groote November 16, 2017 Problem 1 (8.11 i the book): Let K be a compact Hermitia operator o a Hilbert space H ad let the kerel of K be {0}. Show that there

More information

Basics of Probability Theory (for Theory of Computation courses)

Basics of Probability Theory (for Theory of Computation courses) Basics of Probability Theory (for Theory of Computatio courses) Oded Goldreich Departmet of Computer Sciece Weizma Istitute of Sciece Rehovot, Israel. oded.goldreich@weizma.ac.il November 24, 2008 Preface.

More information

A sequence of numbers is a function whose domain is the positive integers. We can see that the sequence

A sequence of numbers is a function whose domain is the positive integers. We can see that the sequence Sequeces A sequece of umbers is a fuctio whose domai is the positive itegers. We ca see that the sequece,, 2, 2, 3, 3,... is a fuctio from the positive itegers whe we write the first sequece elemet as

More information

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d Liear regressio Daiel Hsu (COMS 477) Maximum likelihood estimatio Oe of the simplest liear regressio models is the followig: (X, Y ),..., (X, Y ), (X, Y ) are iid radom pairs takig values i R d R, ad Y

More information

On Random Line Segments in the Unit Square

On Random Line Segments in the Unit Square O Radom Lie Segmets i the Uit Square Thomas A. Courtade Departmet of Electrical Egieerig Uiversity of Califoria Los Ageles, Califoria 90095 Email: tacourta@ee.ucla.edu I. INTRODUCTION Let Q = [0, 1] [0,

More information

10-701/ Machine Learning Mid-term Exam Solution

10-701/ Machine Learning Mid-term Exam Solution 0-70/5-78 Machie Learig Mid-term Exam Solutio Your Name: Your Adrew ID: True or False (Give oe setece explaatio) (20%). (F) For a cotiuous radom variable x ad its probability distributio fuctio p(x), it

More information

Sieve Estimators: Consistency and Rates of Convergence

Sieve Estimators: Consistency and Rates of Convergence EECS 598: Statistical Learig Theory, Witer 2014 Topic 6 Sieve Estimators: Cosistecy ad Rates of Covergece Lecturer: Clayto Scott Scribe: Julia Katz-Samuels, Brado Oselio, Pi-Yu Che Disclaimer: These otes

More information

MA131 - Analysis 1. Workbook 3 Sequences II

MA131 - Analysis 1. Workbook 3 Sequences II MA3 - Aalysis Workbook 3 Sequeces II Autum 2004 Cotets 2.8 Coverget Sequeces........................ 2.9 Algebra of Limits......................... 2 2.0 Further Useful Results........................

More information

Fall 2013 MTH431/531 Real analysis Section Notes

Fall 2013 MTH431/531 Real analysis Section Notes Fall 013 MTH431/531 Real aalysis Sectio 8.1-8. Notes Yi Su 013.11.1 1. Defiitio of uiform covergece. We look at a sequece of fuctios f (x) ad study the coverget property. Notice we have two parameters

More information

Learnability with Rademacher Complexities

Learnability with Rademacher Complexities Learability with Rademacher Complexities Daiel Khashabi Fall 203 Last Update: September 26, 206 Itroductio Our goal i study of passive ervised learig is to fid a hypothesis h based o a set of examples

More information

5.1 A mutual information bound based on metric entropy

5.1 A mutual information bound based on metric entropy Chapter 5 Global Fao Method I this chapter, we exted the techiques of Chapter 2.4 o Fao s method the local Fao method) to a more global costructio. I particular, we show that, rather tha costructig a local

More information

A Proof of Birkhoff s Ergodic Theorem

A Proof of Birkhoff s Ergodic Theorem A Proof of Birkhoff s Ergodic Theorem Joseph Hora September 2, 205 Itroductio I Fall 203, I was learig the basics of ergodic theory, ad I came across this theorem. Oe of my supervisors, Athoy Quas, showed

More information

Lecture 3: August 31

Lecture 3: August 31 36-705: Itermediate Statistics Fall 018 Lecturer: Siva Balakrisha Lecture 3: August 31 This lecture will be mostly a summary of other useful expoetial tail bouds We will ot prove ay of these i lecture,

More information

Math 61CM - Solutions to homework 3

Math 61CM - Solutions to homework 3 Math 6CM - Solutios to homework 3 Cédric De Groote October 2 th, 208 Problem : Let F be a field, m 0 a fixed oegative iteger ad let V = {a 0 + a x + + a m x m a 0,, a m F} be the vector space cosistig

More information

Chapter 6 Infinite Series

Chapter 6 Infinite Series Chapter 6 Ifiite Series I the previous chapter we cosidered itegrals which were improper i the sese that the iterval of itegratio was ubouded. I this chapter we are goig to discuss a topic which is somewhat

More information

A statistical method to determine sample size to estimate characteristic value of soil parameters

A statistical method to determine sample size to estimate characteristic value of soil parameters A statistical method to determie sample size to estimate characteristic value of soil parameters Y. Hojo, B. Setiawa 2 ad M. Suzuki 3 Abstract Sample size is a importat factor to be cosidered i determiig

More information

Measure and Measurable Functions

Measure and Measurable Functions 3 Measure ad Measurable Fuctios 3.1 Measure o a Arbitrary σ-algebra Recall from Chapter 2 that the set M of all Lebesgue measurable sets has the followig properties: R M, E M implies E c M, E M for N implies

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Patter Recogitio Classificatio: No-Parametric Modelig Hamid R. Rabiee Jafar Muhammadi Sprig 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2/ Ageda Parametric Modelig No-Parametric Modelig

More information

IP Reference guide for integer programming formulations.

IP Reference guide for integer programming formulations. IP Referece guide for iteger programmig formulatios. by James B. Orli for 15.053 ad 15.058 This documet is iteded as a compact (or relatively compact) guide to the formulatio of iteger programs. For more

More information

Machine Learning Brett Bernstein

Machine Learning Brett Bernstein Machie Learig Brett Berstei Week Lecture: Cocept Check Exercises Starred problems are optioal. Statistical Learig Theory. Suppose A = Y = R ad X is some other set. Furthermore, assume P X Y is a discrete

More information

The Growth of Functions. Theoretical Supplement

The Growth of Functions. Theoretical Supplement The Growth of Fuctios Theoretical Supplemet The Triagle Iequality The triagle iequality is a algebraic tool that is ofte useful i maipulatig absolute values of fuctios. The triagle iequality says that

More information

6.867 Machine learning, lecture 7 (Jaakkola) 1

6.867 Machine learning, lecture 7 (Jaakkola) 1 6.867 Machie learig, lecture 7 (Jaakkola) 1 Lecture topics: Kerel form of liear regressio Kerels, examples, costructio, properties Liear regressio ad kerels Cosider a slightly simpler model where we omit

More information

NYU Center for Data Science: DS-GA 1003 Machine Learning and Computational Statistics (Spring 2018)

NYU Center for Data Science: DS-GA 1003 Machine Learning and Computational Statistics (Spring 2018) NYU Ceter for Data Sciece: DS-GA 003 Machie Learig ad Computatioal Statistics (Sprig 208) Brett Berstei, David Roseberg, Be Jakubowski Jauary 20, 208 Istructios: Followig most lab ad lecture sectios, we

More information

1 Inferential Methods for Correlation and Regression Analysis

1 Inferential Methods for Correlation and Regression Analysis 1 Iferetial Methods for Correlatio ad Regressio Aalysis I the chapter o Correlatio ad Regressio Aalysis tools for describig bivariate cotiuous data were itroduced. The sample Pearso Correlatio Coefficiet

More information

Output Analysis and Run-Length Control

Output Analysis and Run-Length Control IEOR E4703: Mote Carlo Simulatio Columbia Uiversity c 2017 by Marti Haugh Output Aalysis ad Ru-Legth Cotrol I these otes we describe how the Cetral Limit Theorem ca be used to costruct approximate (1 α%

More information

ACO Comprehensive Exam 9 October 2007 Student code A. 1. Graph Theory

ACO Comprehensive Exam 9 October 2007 Student code A. 1. Graph Theory 1. Graph Theory Prove that there exist o simple plaar triagulatio T ad two distict adjacet vertices x, y V (T ) such that x ad y are the oly vertices of T of odd degree. Do ot use the Four-Color Theorem.

More information

Beurling Integers: Part 2

Beurling Integers: Part 2 Beurlig Itegers: Part 2 Isomorphisms Devi Platt July 11, 2015 1 Prime Factorizatio Sequeces I the last article we itroduced the Beurlig geeralized itegers, which ca be represeted as a sequece of real umbers

More information

Element sampling: Part 2

Element sampling: Part 2 Chapter 4 Elemet samplig: Part 2 4.1 Itroductio We ow cosider uequal probability samplig desigs which is very popular i practice. I the uequal probability samplig, we ca improve the efficiecy of the resultig

More information

The standard deviation of the mean

The standard deviation of the mean Physics 6C Fall 20 The stadard deviatio of the mea These otes provide some clarificatio o the distictio betwee the stadard deviatio ad the stadard deviatio of the mea.. The sample mea ad variace Cosider

More information

Sequences. Notation. Convergence of a Sequence

Sequences. Notation. Convergence of a Sequence Sequeces A sequece is essetially just a list. Defiitio (Sequece of Real Numbers). A sequece of real umbers is a fuctio Z (, ) R for some real umber. Do t let the descriptio of the domai cofuse you; it

More information

Rates of Convergence by Moduli of Continuity

Rates of Convergence by Moduli of Continuity Rates of Covergece by Moduli of Cotiuity Joh Duchi: Notes for Statistics 300b March, 017 1 Itroductio I this ote, we give a presetatio showig the importace, ad relatioship betwee, the modulis of cotiuity

More information

Seunghee Ye Ma 8: Week 5 Oct 28

Seunghee Ye Ma 8: Week 5 Oct 28 Week 5 Summary I Sectio, we go over the Mea Value Theorem ad its applicatios. I Sectio 2, we will recap what we have covered so far this term. Topics Page Mea Value Theorem. Applicatios of the Mea Value

More information

Intro to Learning Theory

Intro to Learning Theory Lecture 1, October 18, 2016 Itro to Learig Theory Ruth Urer 1 Machie Learig ad Learig Theory Comig soo 2 Formal Framework 21 Basic otios I our formal model for machie learig, the istaces to be classified

More information

Problem Set 4 Due Oct, 12

Problem Set 4 Due Oct, 12 EE226: Radom Processes i Systems Lecturer: Jea C. Walrad Problem Set 4 Due Oct, 12 Fall 06 GSI: Assae Gueye This problem set essetially reviews detectio theory ad hypothesis testig ad some basic otios

More information

TR/46 OCTOBER THE ZEROS OF PARTIAL SUMS OF A MACLAURIN EXPANSION A. TALBOT

TR/46 OCTOBER THE ZEROS OF PARTIAL SUMS OF A MACLAURIN EXPANSION A. TALBOT TR/46 OCTOBER 974 THE ZEROS OF PARTIAL SUMS OF A MACLAURIN EXPANSION by A. TALBOT .. Itroductio. A problem i approximatio theory o which I have recetly worked [] required for its solutio a proof that the

More information

Glivenko-Cantelli Classes

Glivenko-Cantelli Classes CS28B/Stat24B (Sprig 2008 Statistical Learig Theory Lecture: 4 Gliveko-Catelli Classes Lecturer: Peter Bartlett Scribe: Michelle Besi Itroductio This lecture will cover Gliveko-Catelli (GC classes ad itroduce

More information

7.1 Convergence of sequences of random variables

7.1 Convergence of sequences of random variables Chapter 7 Limit Theorems Throughout this sectio we will assume a probability space (, F, P), i which is defied a ifiite sequece of radom variables (X ) ad a radom variable X. The fact that for every ifiite

More information

Support vector machine revisited

Support vector machine revisited 6.867 Machie learig, lecture 8 (Jaakkola) 1 Lecture topics: Support vector machie ad kerels Kerel optimizatio, selectio Support vector machie revisited Our task here is to first tur the support vector

More information

Lecture 2: Monte Carlo Simulation

Lecture 2: Monte Carlo Simulation STAT/Q SCI 43: Itroductio to Resamplig ethods Sprig 27 Istructor: Ye-Chi Che Lecture 2: ote Carlo Simulatio 2 ote Carlo Itegratio Assume we wat to evaluate the followig itegratio: e x3 dx What ca we do?

More information

Lecture 19: Convergence

Lecture 19: Convergence Lecture 19: Covergece Asymptotic approach I statistical aalysis or iferece, a key to the success of fidig a good procedure is beig able to fid some momets ad/or distributios of various statistics. I may

More information

4 The Sperner property.

4 The Sperner property. 4 The Sperer property. I this sectio we cosider a surprisig applicatio of certai adjacecy matrices to some problems i extremal set theory. A importat role will also be played by fiite groups. I geeral,

More information

CHAPTER 10 INFINITE SEQUENCES AND SERIES

CHAPTER 10 INFINITE SEQUENCES AND SERIES CHAPTER 10 INFINITE SEQUENCES AND SERIES 10.1 Sequeces 10.2 Ifiite Series 10.3 The Itegral Tests 10.4 Compariso Tests 10.5 The Ratio ad Root Tests 10.6 Alteratig Series: Absolute ad Coditioal Covergece

More information

Machine Learning Theory (CS 6783)

Machine Learning Theory (CS 6783) Machie Learig Theory (CS 6783) Lecture 2 : Learig Frameworks, Examples Settig up learig problems. X : istace space or iput space Examples: Computer Visio: Raw M N image vectorized X = 0, 255 M N, SIFT

More information

Lecture Notes for Analysis Class

Lecture Notes for Analysis Class Lecture Notes for Aalysis Class Topological Spaces A topology for a set X is a collectio T of subsets of X such that: (a) X ad the empty set are i T (b) Uios of elemets of T are i T (c) Fiite itersectios

More information

Lecture 7: October 18, 2017

Lecture 7: October 18, 2017 Iformatio ad Codig Theory Autum 207 Lecturer: Madhur Tulsiai Lecture 7: October 8, 207 Biary hypothesis testig I this lecture, we apply the tools developed i the past few lectures to uderstad the problem

More information

Linear Regression Demystified

Linear Regression Demystified Liear Regressio Demystified Liear regressio is a importat subject i statistics. I elemetary statistics courses, formulae related to liear regressio are ofte stated without derivatio. This ote iteds to

More information

Math 155 (Lecture 3)

Math 155 (Lecture 3) Math 55 (Lecture 3) September 8, I this lecture, we ll cosider the aswer to oe of the most basic coutig problems i combiatorics Questio How may ways are there to choose a -elemet subset of the set {,,,

More information

MAT1026 Calculus II Basic Convergence Tests for Series

MAT1026 Calculus II Basic Convergence Tests for Series MAT026 Calculus II Basic Covergece Tests for Series Egi MERMUT 202.03.08 Dokuz Eylül Uiversity Faculty of Sciece Departmet of Mathematics İzmir/TURKEY Cotets Mootoe Covergece Theorem 2 2 Series of Real

More information

Singular Continuous Measures by Michael Pejic 5/14/10

Singular Continuous Measures by Michael Pejic 5/14/10 Sigular Cotiuous Measures by Michael Peic 5/4/0 Prelimiaries Give a set X, a σ-algebra o X is a collectio of subsets of X that cotais X ad ad is closed uder complemetatio ad coutable uios hece, coutable

More information

Lecture 11: Decision Trees

Lecture 11: Decision Trees ECE9 Sprig 7 Statistical Learig Theory Istructor: R. Nowak Lecture : Decisio Trees Miimum Complexity Pealized Fuctio Recall the basic results of the last lectures: let X ad Y deote the iput ad output spaces

More information

Notes for Lecture 11

Notes for Lecture 11 U.C. Berkeley CS78: Computatioal Complexity Hadout N Professor Luca Trevisa 3/4/008 Notes for Lecture Eigevalues, Expasio, ad Radom Walks As usual by ow, let G = (V, E) be a udirected d-regular graph with

More information

ECE 8527: Introduction to Machine Learning and Pattern Recognition Midterm # 1. Vaishali Amin Fall, 2015

ECE 8527: Introduction to Machine Learning and Pattern Recognition Midterm # 1. Vaishali Amin Fall, 2015 ECE 8527: Itroductio to Machie Learig ad Patter Recogitio Midterm # 1 Vaishali Ami Fall, 2015 tue39624@temple.edu Problem No. 1: Cosider a two-class discrete distributio problem: ω 1 :{[0,0], [2,0], [2,2],

More information