Empirical Bernstein Inequality for Martingales : Application to Online Learning

Size: px
Start display at page:

Download "Empirical Bernstein Inequality for Martingales : Application to Online Learning"

Transcription

1 Empirical Berstei Iequality for Martigales : Applicatio to Olie Learig Thomas Peel, Sadrie Athoie, Liva Ralaivola To cite this versio: Thomas Peel, Sadrie Athoie, Liva Ralaivola. Empirical Berstei Iequality for Martigales : Applicatio to Olie Learig <hal > HAL Id: hal Submitted o 5 Nov 203 HAL is a multi-discipliary ope access archive for the deposit ad dissemiatio of scietific research documets, whether they are published or ot. The documets may come from teachig ad research istitutios i Frace or abroad, or from public or private research ceters. L archive ouverte pluridiscipliaire HAL, est destiée au dépôt et à la diffusio de documets scietifiques de iveau recherche, publiés ou o, émaat des établissemets d eseigemet et de recherche fraçais ou étragers, des laboratoires publics ou privés.

2 Empirical Berstei Iequality for Martigales : Applicatio to Olie Learig Thomas Peel,2 Sadrie Athoie Liva Ralaivola 2 Aix-Marseille Uiversité - CNRS LATP, UMR 7353 Marseille, Frace 2 Aix-Marseille Uiversité - CNRS LIF, UMR 7279 Marseille, Frace Abstract I this article we preset a ew empirical Berstei iequality for bouded martigale differece sequeces. This iequality refies the oe by Freedma [975 ad is the used i order to boud the average risk of the hypotheses durig a olie learig process. We show theoretical ad empirical evideces of the tightess of our result compared with the state of the art boud provided by Cesa- Biachi ad Getile [2008. INTRODUCTION The motivatio behid this work comes from the wish to aalyze the risk of the models (or hypotheses) produced by a olie learig algorithm. Such a algorithm works icremetally o a sequece of idepedet ad idetically distributed (i.i.d.) radom variables. At each step, it receives a example that is used i order to update the curret model parameters. Oce this update is doe, the performace of the ew hypothesis is measured by evaluatig its loss o the ext example of the sequece ad so o. By averagig these losses, oe ca defie a statistic ˆR called empirical istataeous risk. The risk of a model is simply the expectatio of its loss o a ew usee example give the sequece of data used i its costructio. I their recet works, Cesa-Biachi et al. [2004 ad Cesa-Biachi ad Getile [2008 show how the statistic ˆR ca be used for selectig a hypothesis with a low risk. The key tool i their aalyses is the use of cocetratio iequalities for martigales (Azuma-Hoeffdig, Berstei). Ideed, the depedecies existig betwee the hypotheses that are iheret to olie learig processes prevet the use of stadard cocetratio iequalities that require idepedece. Berstei (secod-order) iequalities are kow to be tighter tha their first-order couterparts. However, the variace is i geeral ukow ad eed to be upper bouded. Recet works i the batch settig have proposed a empirical (data-depedet) versio of the Berstei iequality [Maurer ad Potil, 2009, Peel et al., 200 where a estimator of the variace is used as upper boud. However, these iequalities are ot applicable to the olie learig settig. I this paper, we propose a ew Berstei iequality for bouded martigale differece sequeces (Theorem 2) that takes advatage of the statistic ˆV, a istataeous estimator of the variace. This iequality is the used i order to refie the tail boud by Cesa-Biachi ad Getile [2008. Briefly, we show that uder the same assumptios they make, the average risk of the hypotheses produced by a olie learig algorithm is bouded with high probability by ˆR + β l l, where β is a fuctio of ˆV we will detail later. This boud ca be applied to ay olie algorithm ad as a example we show how to use it to characterize the average risk of the hypotheses produced by Pegasos [Shalev-Shwartz et al., 20, a stochastic method for solvig the SVM optimizatio problem. We wat to emphasize that the scope of our ew empirical Berstei iequality for martigales goes far beyod ay applicatio to olie learig processes. The paper is orgaized as follows. I Sectio 2 we recall a few fudametal otios about martigales ad the classical cocetratio iequalities associated with this kid of radom processes. Sectio 3 presets the mai result of this paper, a cocetratio iequality that takes advatage of a secod order empirical iformatio i the martigale settig. This oe is the applied i Sectio 4 to get a boud o the mea geeralizatio error made by the hypotheses leared durig a olie learig process. This boud substatially improves the results metioed above. We ed this

3 Empirical Berstei Iequality for Martigales : Applicatio to Olie Learig paper with Sectio 5, a direct cosequece of the previous iequalities that let us boud the mea risk of the weight vectors geerated durig a ru of the Pegasos algorithm. 2 PRELIMINARIES This sectio briefly remids basic otios about the martigale theory ad the classical cocetratio iequalities associated with this kid of stochastic processes. 2. Martigale ad Martigale Differece Sequece Defiitio (Martigale). A sequece {M : 0 < } of radom variables is said to be a martigale with respect to the sequece of radom variables {X : < } if the sequece {M 0,...,M } has two basic properties. The first oe is that for each there is a fuctio f : R R such that M = f (X,X 2,...,X ). The secod property is that the sequece {M } satisfies for all : E[ M < () E[M X,...,X = M. (2) Give this defiitio of a martigale, we ow defie a martigale differece sequece. Defiitio 2 (Martigale differece sequece). We say that a sequece of radom variables {Y : 0 < } is a martigale differece sequece (mds) if the sequece {Y } satisfies the followig properties for all : E[ Y < (3) E[Y Y,...,Y = 0. (4) By costructio, this implies that if the sequece {M } is a martigale the the sequece {Y = M M } is a martigale differece sequece. We ow itroduce two well-kow cocetratio iequalities about the sum of the icremets of a mds that we will use i the ext sectios. 2.2 Azuma-Hoeffdig Iequality The Azuma-Hoeffdig iequality [Hoeffdig, 963, Azuma, 967 gives a result about the cocetratio of the values of a martigale with bouded icremets aroud his iitial value M 0. Theorem (Azuma-Hoeffdig iequality). Let {M } be a martigale ad defie {Y = M M } the associated martigale differece sequece such that Y i c i for all i. The, for all ǫ > 0 [ P Y i = M M 0 ǫ i= ( exp ǫ 2 ) 2 i= c. i 2 (5) This result makes it possible to exted the Hoeffdig iequality [Hoeffdig, 963 to the case where the radom variables of iterest are ot ecessarily idepedet. Corollary. Let X,...,X be a sequece of radom variables such that for all i we have E[X i X,...,X i X i c i. Set S = i= X i, the for all ǫ > 0 [ P E[X i X,...,X i S ǫ i= ( exp ǫ 2 ) 2 i= c. (6) i 2 Proof. A direct applicatio of Theorem to the martigale differece sequece {Y } such that Y i = E[X i X,...,X i X i gives the result. 2.3 Berstei Iequality for Martigales The iequality we recall i the followig lemma is a cosequece of the Berstei iequality for martigales give i Freedma [975. This lemma exteds the classical Berstei iequality [Beett, 962 which requires idepedece betwee the radom variables X i uder cosideratio. This limitatio is overcome by lookig at the martigale differece sequece {Y = E[X X,...,X X }. Lemma (Berstei iequality for martigales). Suppose X,...,X is a sequece of radom variables such that 0 X i. Defie the martigale differece sequece {Y = E[X X,...,X X } ad ote K the sum of the coditioal variaces K = V[X X,...,X. (7) Let S = i= X i, the for all ǫ,v 0, [ P E[X X,...,X S ǫ,k k i= ( exp ǫ 2 2k +2ǫ/3 ). (8) As we shall see, this lemma is cetral i our aalysis as it was i the work by Cesa-Biachi ad Getile [2008.

4 Thomas Peel,2, Sadrie Athoie, Liva Ralaivola 2 3 EMPIRICAL BERNSTEIN INEQUALITY FOR MARTINGALES Secod order Berstei iequalities are kow to be tighter tha their first order couterparts thaks to the variace term. However, i practice, this term ofte ca ot be evaluated ad it is commo to upper boud it by the expectatio (makig the assumptio that the radom variables uder iterest are bouded by ) i order to compute the whole iequality. We propose aother approach based o the use of a istataeous estimator of the variace istead of the usual approach. By doig so, we hope to get a tighter iequality without ay a priori assumptio o the uderlyig distributio of the radom variables. This sectio presets the mai result of the paper, a refied versio of Berstei iequality for martigales recalled above where the sum of coditioal variaces is upper bouded usig a istataeous estimator. We first itroduce the iequality reversal lemma, which allows us to trasform tail iequalities ito upper bouds (or cofidece itervals). This lemma has bee used by Peel et al. [200 to prove their empirical Berstei iequality for U-Statistics. Lemma 2 (Iequality reversal lemma). Let X be a radom variable ad a,b>0,c,d 0 such that } ε > 0, P X [ X ε aexp { bε2, (9) c+dε the, with probability at least c X b l a + d b l a. (0) Proof. Solvig forεsuch that the right had side of (9) is equal to gives: ε = (dl a 2b + d 2 l 2 a +4bcl a ). Usig a+b a + b gives a upper boud o ε ad provides the result. We use the otatio f {Zt} i order to idicate a fuctio determied by the sequece of radom variables {Z t } = {Z,...,Z t } i.e. the expressio of f {Zt} is fixed by the sequece {Z t }. The ext theorem is the mai result of this paper. Theorem 2 (Empirical Berstei iequality for martigales). Let Z,...,Z be a sequece of radom variables followig the same probability distributio D such that Z t+,z t+2 are coditioally idepedet give {Z t }, for all t. Suppose {f {Zt}} is a family of fuctios which take their values i [0,, the for all 0 < we have with probability at least E [ f {Zt}(Z t+ ) Z,...,Z t where ad f {Zt}(Z t+ )+ β l i= ˆV = 2 β = ˆV l, () ( ) 2 2 l, (2) ( f{zt}(z t+ ) f {Zt}(Z t+2 ) ) 2. (3) I a utshell, the message carried by this theorem is that it is possible to use a istataeous variace estimator to quatify the deviatio of the sum i= f {Z t}(z t+ ) from its expected value. I order to prove the previous cocetratio iequality, we eed a itermediate result about the coditioal variace estimator itroduced i Equatio (3). I essece, the followig lemma allows us to quatify the deviatio of this estimator from the sum V of the coditioal variaces: V = V [ f {Zt}(Z) Z,...,Z t. (4) Lemma 3. Let Z,...,Z be a sequece of radom variables followig the same probability distributio D such that Z t+,z t+2 are coditioally idepedet give {Z t }, for all t. Suppose {f {Zt}} is a family of fuctios which take their values i [0,, the for all 0 <, P [ V ˆV + 2 l ( ). (5) Proof. We begi this proof by defiig the sequece of radom variables {M } such that for all t, M t = 2( f{zt}(z t+ ) f {Zt}(Z t+2 ) ) 2, ad the associated martigale differece sequece {A = E[M Z,...,Z M }. Usig the fact that the Z t follow the same distributio ad that Z t+, Z t+2 are coditioally idepedet we get that E[M t Z,...,Z t = V [ f {Zt}(Z) Z,...,Z t.

5 Empirical Berstei Iequality for Martigales : Applicatio to Olie Learig It follows that A t = V [ f {Zt}(Z) Z,...,Z t ˆV = V ˆV. Notig that M t [0, 2 because f takes its values i [0, etails E[M t Z,...,Z t [0, 2 ad furthermore each term of the sequece {A } is bouded: 2 A t 2. Cosequetly {A } is a bouded martigale differece sequece o which we ca apply the Azuma-Hoeffdig iequality (Theorem ) to obtai P [ V ˆV ǫ exp ) ( 2ǫ2. We coclude the proof by usig Lemma 2. Thaks to this first result, we ca ow prove Theorem 2. Proof. (Theorem 2) Defie the sequece of radom variables {M } such that M i = f {Zi}(Z), ad the associated martigale differece sequece {A = E[M Z,...,Z M }. Remark that for β as i Equatio (2) ad s fixed [ [ P A t s =P A t s,v β [ +P A t s,v < β. We eed to upper boud the two parts of the right had side of the previous equatio i order to get the desired boud o the left had side. Remark that P[ A t s,v β P[V β. We use Lemma 3 to boud P[V β ad obtai [ P A t s,v β 2. (6) The, by usig the Berstei iequality for martigales (Lemma ) o the martigale differece sequece {A } we have [ P A t s,v < b ( exp s 2 2b+2s/3 ), (7) which we ca write alteratively [ P A t bl + 2 ( ) 2 3 l,v < b 2, (8) thaks to Lemma 2. We coclude the proof by settig b = β i (8) ad s = β l l, i Equatio (6). I the upcomig sectio, we use Theorem 2 i a olie learig settig. More precisely, we employ our result with the itetio of characterizig the mea of the risks R(h t ) associated with the hypotheses leared durig such a process. 4 APPLICATION TO ONLINE LEARNING Before statig the mai theorem of this sectio, we recall the olie learig settig ad defie a ew istataeous estimator of the coditioal variace well suited for a olie learig procedure. 4. Olie Learig ad Istataeous Coditioal Variace Estimator There is o formal defiitio of a olie learig process, eve i referece works as Littlestoe et al. [995 or Shalev-Shwartz [2007. Oe geerally defies it as follows. Cosider a dataset Z = {z i } i= = {(x i,y i )} i= of idepedet ad idetically distributed radom variables with respect to a ukow probability distributio D o the product space X Y. A olie learig algorithm workig with the set Z produces a set {h 0,...,h } of hypotheses where each h t : X Ỹ aims at predictig the class of a ew example x draw from D. From a iitial hypothesis h 0 ad the first datum (x,y ) the algorithm produces a ew hypothesis h. This ew hypothesis is a fuctio of the radom variable z = (x,y ) (ad the hypothesis h 0 ). It the uses the ext example (x 2,y 2 ) ad the hypothesis h to geerate a secod hypothesis h 2 ad so o. At the ed of the learig process, the algorithm outputs the set {h 0,...,h } where each hypothesis h t is costructed usig the previous hypothesis h t ad the example (x t,y t ). Thus each hypothesis h t depeds o the sequece of radom

6 Thomas Peel,2, Sadrie Athoie, Liva Ralaivola 2 variables {z,...,z t }. We use a bouded loss fuctio l : Ỹ Y R+ i order to evaluate the performace of a hypothesis. The risk of the hypothesis h t, deoted by R(h t ) = E[l(h t (X),Y) z,...,z t, is simply the expectatio of the loss fuctio l coditioally to the radom variables {z,...,z t }. Obviously, this quatity is ukow sice D is ukow. I this article, we assume that the loss fuctio is such thatl [0,Ỹ Y. It is importat to otice that this assumptio does ot limit the scope of the results preseted hereafter. A commo wish i olie learig is to characterize the mea risk R(h t ) = E[l(h t (X),Y) z,...,z t, (9) associated with the hypotheses produced by a algorithm usig a olie estimator ˆR such that ˆR = ˆR (Z ) = l(h t (x t+ ),y t+ ). (20) The hypothesis h is discarded for purely techical reasos. The quatity ˆR is ofte referred to as the average istataeous risk. It is cetral i may olie learig aalysis (see by example Cesa-Biachi et al. [2004). Each term of the previous sum is a estimator of the risk R(h t ) associated with the hypothesis h t (coditioally to the examples z,...,z t ) : E[l(h t (x t+ ),y t+ ) z,...,z t = R(h t ). The term istataeous comes from the fact that ˆR oly relies o the example (x t+,y t+ ) appearig at iteratio t + i order to evaluate the risk of h t. A state of the art result due to Cesa-Biachi ad Getile [2008 liks ˆR to R(h t): Propositio. Let h 0,...,h be the set of hypotheses geerated by a olie learig algorithm usig the bouded loss fuctio l [0,Ỹ Y. The, for all 0 <, we have with probability at least R(h t ) ˆR +2 ˆR ( ) l ˆR +3 ( ) + 36 l ˆR +3. (2) R(h t) Remark. The Gibbs classifier [McAllester, 999 is a stochastic classifier obtaied by selectig radomly a hypothesis amog a set of hypotheses, give a proba- bility distributio o these hypotheses. ca thus be see as the risk of the Gibbs classifier for a uiform distributio o the set {h 0,...,h }. The key of the result exposed i the previous propositio lies i the use of a secod order cocetratio iequality for martigales (proposed by Freedma [975) which itroduces the sum V of the coditioal variaces of the loss of each hypothesis: V = V[l(h t (x t+ ),y t+ ) z,...,z t. As R, this quatity ca ot be computed sice the distributio D is ukow. Cesa-Biachi ad Getile [2008 proposed to upper boud this sum usig a stratificatio process i order to get their iequality. I this sectio we improve the previous boud by employig Theorem 2 together with a olie estimator ˆV of the sum V, which allows for a better cotrol of the former. The average empirical istataeous variace ˆV is simply defied as ˆV = 2( ) 2 ( l(ht (x t+ ),y t+ ) l(h t (x t+2 ),y t+2 ) ) 2. (22) Agai, we discard the hypotheses h et h from this quatity for techical reasos. Each term of this sum is a estimator of the coditioal variace of l(h t (x),y): [ (l(ht E (x t+ ),y t+ ) l(h t (x t+2 ),y t+2 ) ) 2 z,...,z t = 2V[l(h t (x),y) z,...,z t. (23) ˆV may be easily computed durig a olie learig process ad plays a cetral role i the theorem we preset here. 4.2 Empirical Berstei Iequalities for Olie Learig I the followig theorem, we use Theorem 2 ad the istataeous estimators ˆR et ˆV i order to boud R(h t), the mea of the risks of the hypotheses leared by a olie algorithm. Theorem 3 (Empirical Berstei iequality for olie learig). Let h 0,...,h be the set of hypotheses geerated from the sample Z = {z i } i= = {(x i,y i )} i= of i.i.d. radom variables by a olie learig algorithm usig the bouded loss fuctio l [0,Ỹ Y. The, for all 0 < we have with probability at least : R(h t ) ˆR + where β = ( )ˆV + β l l, (24) ( ) 2 l. (25) 2

7 Empirical Berstei Iequality for Martigales : Applicatio to Olie Learig Proof. (Theorem 3) The proof is direct. Cosider the set Z = {z i } i= = {(x i,y i )} i= of i.i.d. radom variables ad the family of fuctios {l(h i ( ), )} i=0 where each fuctio l(h t ( ), ) oly depeds o the variables z,...,z t by defiitio of h t. Notig that z t+,z t+2 are idepedet with respect to z,...,z t (by defiitio of Z ), we simply apply Theorem 2 ad adjust the idexes to obtai the result. We ow wat to emphasize the compariso with the boud by Cesa-Biachi ad Getile [2008 (Equatio (2)). Our result firstly improves the costats ivolved i the boud which is very appreciable whe the boud is computed with a small umber of hypotheses (whe is small, the last term i the boud ca ot be eglected). I order to aalyze the behavior of our result whe we have a sufficiet umber of hypotheses to omit the last term, we have to pay attetio to β l l ( 2 ) ˆV +( ( )) l 3/4 2, (26) where we used the fact that a+b a+ b to get the upper boud. Thus, omittig the costat terms, our boud teds to ˆR at least i O ˆV + +, 3/4 whe the oe by Cesa-Biachi ad Getile [2008 teds to ˆR i O l(ˆr ) ˆR + l(ˆr ). I order to study the differece betwee the two rates of covergece, we eed to compare the two terms ˆV ad ˆR. ˆV = 2( ) 2 ( l(ht (x t+ ),y t+ ) l(h t (x t+2 ),y t+2 ) ) 2 2( ) 2 + ( 2 l(h t (x t+ ),y t+ ) 2 ) l(h t (x t+2 ),y t+2 ) 2 2( ) 2 + ( 2 l(h t (x t+ ),y t+ ) ) l(h t (x t+2 ),y t+2 ). The last iequality is obtaied by usig l [0,Ỹ Y. Suppose that the error made by each hypothesis h t o the example z t+2 is ot too differet from the error made by the same hypothesis o z t+ : l(h t (x t+2 ),y t+2 ) l(h t (x t+ ),y t+ ). I this case, the previous right had side is almost 2 l(h t (x t+ ),y t+ ) ( ) ˆR thus it follows that ˆV ˆR. A settig studied by Cesa-Biachi ad Getile [2008 is whe the empirical cumulative risk ˆR is i O() i.e. ˆR is bouded. Their result thus reaches a asymptotic behavior i O( ) (the terms ivolvig l(ˆr ) vaishes as a costat). With the assumptio that ˆV is io() as well, our boud shows a rate of covergece slightly worse i O( ). However, as soo as the cumulative 3/4 risk ˆR icreases with, the boud by Cesa-Biachi ad Getile [2008 coverges at the rate O( l/) whereas ours reaches a O( /) rate. Case of a Covex Loss Fuctio Whe a olie algorithm uses a covex loss fuctio l, we ca use Theorem 3 i order to characterize the risk associated with the mea hypothesis h: h = h t. (27) Whe the decisio space Ỹ associated with the classifiers h t : X Ỹ is covex the the hypothesis h belogs to the same fuctio class as each of the h t, h : X Ỹ. The mea hypothesis is thus a determiistic classifier, by oppositio to the Gibbs classifier defied earlier, which shares the same boud o its risk. Corollary 2. Let h 0,...,h be the set of all the hypotheses geerated by a olie learig algorithm usig the covex loss fuctio l such that l [0,Ỹ Y. The, for all 0 < with probability at least where R( h) ˆR + β l β = ( )ˆV l, (28) ( ) 2 l. 2 Proof. Usig Jese s iequality ad liearity of the

8 Thomas Peel,2, Sadrie Athoie, Liva Ralaivola 2 expectatio, it is easy to show that [ R( h) = E l( h t (X),Y) E[l(h t (X),Y) = R(h t ). To coclude the proof, we just eed to combie this result with Theorem 3. 5 BOUNDING THE AVERAGE RISK OF PEGASOS I this sectio, we use the previous corollary i order to derive a boud o the mea risk of the hypotheses geerated by the Pegasos [Shalev-Shwartz et al., 20 algorithm. 5. Pegasos Pegasos is a algorithm desiged to solve the primal SVM problem. Recall that give a sample Z = {z i } i= = {(x i,y i )} i= the SVM objective fuctio is give by: F(w) := λ 2 w l hige (w,x i,y i ), (29) i= where l hige (w,x,y) = max{0, y w,x }. Pegasos works i a olie fashio by doig a stochastic subgradiet descet o the SVM objective fuctio. At time t, Pegasos radomly selects a example Z it = (x ii,y it ) ad aims at miimizig the approximatio f(w t,z it ) = λ 2 w t 2 2 +l hige(w t,x it,y it ), of the SVM objective fuctio. It cosiders the followig sub-gradiet, take at poit w t, of the previous fuctio which is give by t = w t f(w t,z it ) = λw t [yit w t,x it < y i t x it, ad it updates the curret weight vector w t to w t+ by w t+ w t η t t usig a step η t = /(λt). So, we get at each iteratio the vector ( w t+ ) w t +η t t [yit w t,x it < y i t x it. A projectio step (optioal) that we detail i the sequel eds up the iteratiot. Pegasos stops whet = T, Algorithm Pegasos Require: {(x i,y i )} i=,λ 0 ad T 0 Esure: w T+ w 0 0 for t 0 to T do Pick radomly i t {,...,} Defie η t = λt if y it w t,x it < the w t+ ( η t λ)w t +η t y it x it else w t+ ( η t λ)w t ed if w t+ = mi [ / λ, w t+ w t+ 2 ed for where T is a umber of iteratio give as a parameter. Thus Pegasos ca be see as a olie algorithm workig with the sequece of examples Z i,...,z it costructed by pickig radomly at each iteratio a example from Z. Algorithm sums up the differet steps of Pegasos. 5.2 Boudig the Mea Risk of the Hypotheses Geerated by Pegasos I order to apply Theorem 3 we eed the loss fuctio to be bouded. It ca be show that w = argmi w F(w) satisfies w 2 / λ. Thus, we ca limit the search space to the ball of radius / λ by icorporatig a projectio step as metioed above [ w t+ / λ = mi, w t+ w t+. 2 With the assumptio that x 2 M, we ca boud the hige loss fuctio: l hige (w,x,y) + x 2 w 2 + M λ = C. Thereby, the loss fuctio used by Pegasos ca be adjusted to satisfy the assumptio of Theorem 3 ad we ca use it to prove the followig corollary. Corollary 3. Let w 0,...,w T be the sequece of weight vectors geerated by the Pegasos algorithm from a sample Z where x i 2 M, i. The for all 0 <, we have with probability at least, R(w t ) ˆR + C where β = ( )ˆV C 2 + β l + 2C 3 l, l 2. (30)

9 Empirical Berstei Iequality for Martigales : Applicatio to Olie Learig λ = Cesa-Biachi (2008) boud Empirical Berstei boud (this work) 0 3 λ = Cesa-Biachi (2008) boud Empirical Berstei boud (this work) λ = Cesa-Biachi (2008) boud Empirical Berstei boud (this work) 0 2 λ = Cesa-Biachi (2008) boud Empirical Berstei boud (this work) Figure : Compariso of the Bouds From Propositio ad Corollary 3 Computed for the Pegasos Algorithm o a Toy Liearly Separable Dataset. 5.3 Proof of Cocept I this sectio we wat to highlight experimetally the performace of our empirical Berstei iequality applied to olie learig. I order to do that, we compare the boud provided by Corollary 3 for the Pegasos algorithm to the oe exposed i Propositio. We use a liearly separable toy dataset ad compare the covergece of the empirical risk to the mea risk of the hypotheses w 0,...,w T. We geerate radom vectors x i [, 2 to which we assig the class y i = sig( w,x i ) {+, } for a vector w [, 2 also radomly geerated. We work with a learig sample cotaiig poits ad report i Figure the values of the right had sides appearig i Propositio [Cesa-Biachi ad Getile, 2008 ad i Corollary 3 computed with a cofidece of 95% ( = 0.05). We ra the experimet 20 times for may values of the parameter λ ad averaged the results. We ca see that our iequality is far tighter tha the oe by Cesa-Biachi ad Getile [2008 durig the first iteratios, as it was souded i the theoretical compariso doe i Sectio 3. The gap betwee the two iequalities tightes whe the umber of hypotheses cosidered icreases but remais i our favor. 6 CONCLUSION AND OUTLOOKS I this article, we preset a ew empirical Berstei cocetratio iequality for martigales. We applied this result to the olie learig settig i order to boud the mea risk of the hypotheses leared durig such learig processes. Because we itroduce of a ew istataeous variace estimator, our iequality is well suited for the olie learig settig ad improves the state of the art. This improvemet is maily oticeable whe the umber of hypotheses cosidered is small as show i the empirical sectio of this work. There are may outlooks opeed by this work. First of all, we ca thik about a ew olie learig algorithm that aims at miimizig our empirical Berstei boud as it is doe i the batch settig [Variace Pealizig AdaBoost, Shivaswamy ad Jebara, 20, by example. The, it will be of iterest to derive ew kid of bouds for olie algorithms takig advatage of our result (by example o the excess risk as it is doe i the work by Kakade ad Tewari [2009). The last perspective that we wat to metio is the compariso of our boud with the very recet PAC-Bayes-Empirical- Berstei Iequality by Tolstikhi ad Seldi [203.

10 Thomas Peel,2, Sadrie Athoie, Liva Ralaivola 2 Refereces Kazuoki Azuma. Weighted Sums of Certai Depedet Radom Variables. Tohoku Mathematical Joural, 9(3): , 967. George Beett. Probability Iequalities for the Sum of Idepedet Radom Variables. Joural of the America Statistical Associatio, 57(297):33 45, 962. Nicolò Cesa-Biachi ad Claudio Getile. Improved Risk Tail Bouds for O-Lie Algorithms. IEEE Trasactios o Iformatio Theory, 54(): , Nicolò Cesa-Biachi, Alex Cocoi, ad Claudio Getile. O the Geeralizatio Ability of O-Lie Learig Algorithms. IEEE Trasactios o Iformatio Theory, 50(9): , David A. Freedma. O Tail Probabilities for Martigales. The Aals of Probability, 3():00 8, 975. Wassily Hoeffdig. Probability Iequalities for Sums of Bouded Radom Variables. Joural of the America Statistical Associatio, 58(30):3 30, 963. Sham M. Kakade ad Ambuj Tewari. O the Geeralizatio Ability of Olie Strogly Covex Programmig Algorithms. I Advaces i Neural Iformatio Processig Systems 2 - NIPS 08, pages , Nicholas Littlestoe, Philip Log, ad Mafred Warmuth. O-lie Learig of Liear Fuctios. Computatioal Complexity, 5(): 23, 995. Adreas Maurer ad Massimiliao Potil. Empirical Berstei Bouds ad Sample Variace Pealizatio. I Proceedigs of the 22d Aual Coferece o Learig Theory - COLT 09, David A. McAllester. PAC-Bayesia Model Averagig. I Proceedigs of the 2th Aual Coferece o Computatioal Learig Theory - COLT 99, pages 64 70, 999. Thomas Peel, Sadrie Athoie, ad Liva Ralaivola. Empirical Berstei Iequalities for U-Statistics. I Advaces i Neural Iformatio Processig Systems 23 - NIPS 0, pages 903 9, 200. Shai Shalev-Shwartz. Olie learig: Theory, algorithms, ad applicatios. PhD thesis, Hebrew Uiversity, Shai Shalev-Shwartz, Yoram Siger, Natha Srebro, ad Adrew Cotter. Pegasos: Primal Estimated Sub-Gradiet Solver for SVM. Mathematical Programmig, 27():3 30, 20. Paagadatta K. Shivaswamy ad Toy Jebara. Variace Pealizig AdaBoost. I Advaces i Neural Iformatio Processig Systems 24 - NIPS, pages , 20. Ilya Tolstikhi ad Yevgey Seldi. PAC-Bayes- Empirical-Berstei Iequality. I Advaces i Neural Iformatio Processig Systems - NIPS 3, 203.

Optimally Sparse SVMs

Optimally Sparse SVMs A. Proof of Lemma 3. We here prove a lower boud o the umber of support vectors to achieve geeralizatio bouds of the form which we cosider. Importatly, this result holds ot oly for liear classifiers, but

More information

Improvement of Generic Attacks on the Rank Syndrome Decoding Problem

Improvement of Generic Attacks on the Rank Syndrome Decoding Problem Improvemet of Geeric Attacks o the Rak Sydrome Decodig Problem Nicolas Arago, Philippe Gaborit, Adrie Hauteville, Jea-Pierre Tillich To cite this versio: Nicolas Arago, Philippe Gaborit, Adrie Hauteville,

More information

Convergence of random variables. (telegram style notes) P.J.C. Spreij

Convergence of random variables. (telegram style notes) P.J.C. Spreij Covergece of radom variables (telegram style otes).j.c. Spreij this versio: September 6, 2005 Itroductio As we kow, radom variables are by defiitio measurable fuctios o some uderlyig measurable space

More information

On the behavior at infinity of an integrable function

On the behavior at infinity of an integrable function O the behavior at ifiity of a itegrable fuctio Emmauel Lesige To cite this versio: Emmauel Lesige. O the behavior at ifiity of a itegrable fuctio. The America Mathematical Mothly, 200, 7 (2), pp.75-8.

More information

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss ECE 90 Lecture : Complexity Regularizatio ad the Squared Loss R. Nowak 5/7/009 I the previous lectures we made use of the Cheroff/Hoeffdig bouds for our aalysis of classifier errors. Hoeffdig s iequality

More information

Regression with quadratic loss

Regression with quadratic loss Regressio with quadratic loss Maxim Ragisky October 13, 2015 Regressio with quadratic loss is aother basic problem studied i statistical learig theory. We have a radom couple Z = X,Y, where, as before,

More information

A Simple Proof of the Shallow Packing Lemma

A Simple Proof of the Shallow Packing Lemma A Simple Proof of the Shallow Packig Lemma Nabil Mustafa To cite this versio: Nabil Mustafa. A Simple Proof of the Shallow Packig Lemma. Discrete ad Computatioal Geometry, Spriger Verlag, 06, 55 (3), pp.739-743.

More information

Empirical Process Theory and Oracle Inequalities

Empirical Process Theory and Oracle Inequalities Stat 928: Statistical Learig Theory Lecture: 10 Empirical Process Theory ad Oracle Iequalities Istructor: Sham Kakade 1 Risk vs Risk See Lecture 0 for a discussio o termiology. 2 The Uio Boud / Boferoi

More information

Machine Learning Theory (CS 6783)

Machine Learning Theory (CS 6783) Machie Learig Theory (CS 6783) Lecture 2 : Learig Frameworks, Examples Settig up learig problems. X : istace space or iput space Examples: Computer Visio: Raw M N image vectorized X = 0, 255 M N, SIFT

More information

REGRESSION WITH QUADRATIC LOSS

REGRESSION WITH QUADRATIC LOSS REGRESSION WITH QUADRATIC LOSS MAXIM RAGINSKY Regressio with quadratic loss is aother basic problem studied i statistical learig theory. We have a radom couple Z = X, Y ), where, as before, X is a R d

More information

Optimization Results for a Generalized Coupon Collector Problem

Optimization Results for a Generalized Coupon Collector Problem Optimizatio Results for a Geeralized Coupo Collector Problem Emmauelle Aceaume, Ya Busel, E Schulte-Geers, B Sericola To cite this versio: Emmauelle Aceaume, Ya Busel, E Schulte-Geers, B Sericola. Optimizatio

More information

An Introduction to Randomized Algorithms

An Introduction to Randomized Algorithms A Itroductio to Radomized Algorithms The focus of this lecture is to study a radomized algorithm for quick sort, aalyze it usig probabilistic recurrece relatios, ad also provide more geeral tools for aalysis

More information

Intro to Learning Theory

Intro to Learning Theory Lecture 1, October 18, 2016 Itro to Learig Theory Ruth Urer 1 Machie Learig ad Learig Theory Comig soo 2 Formal Framework 21 Basic otios I our formal model for machie learig, the istaces to be classified

More information

Discrete Mathematics for CS Spring 2008 David Wagner Note 22

Discrete Mathematics for CS Spring 2008 David Wagner Note 22 CS 70 Discrete Mathematics for CS Sprig 2008 David Wager Note 22 I.I.D. Radom Variables Estimatig the bias of a coi Questio: We wat to estimate the proportio p of Democrats i the US populatio, by takig

More information

1 Review and Overview

1 Review and Overview CS9T/STATS3: Statistical Learig Theory Lecturer: Tegyu Ma Lecture #6 Scribe: Jay Whag ad Patrick Cho October 0, 08 Review ad Overview Recall i the last lecture that for ay family of scalar fuctios F, we

More information

The Goldbach conjectures

The Goldbach conjectures The Goldbach cojectures Jamel Ghaouchi To cite this versio: Jamel Ghaouchi. The Goldbach cojectures. 2015. HAL Id: hal-01243303 https://hal.archives-ouvertes.fr/hal-01243303 Submitted o

More information

Rademacher Complexity

Rademacher Complexity EECS 598: Statistical Learig Theory, Witer 204 Topic 0 Rademacher Complexity Lecturer: Clayto Scott Scribe: Ya Deg, Kevi Moo Disclaimer: These otes have ot bee subjected to the usual scrutiy reserved for

More information

6.3 Testing Series With Positive Terms

6.3 Testing Series With Positive Terms 6.3. TESTING SERIES WITH POSITIVE TERMS 307 6.3 Testig Series With Positive Terms 6.3. Review of what is kow up to ow I theory, testig a series a i for covergece amouts to fidig the i= sequece of partial

More information

Machine Learning Brett Bernstein

Machine Learning Brett Bernstein Machie Learig Brett Berstei Week 2 Lecture: Cocept Check Exercises Starred problems are optioal. Excess Risk Decompositio 1. Let X = Y = {1, 2,..., 10}, A = {1,..., 10, 11} ad suppose the data distributio

More information

The standard deviation of the mean

The standard deviation of the mean Physics 6C Fall 20 The stadard deviatio of the mea These otes provide some clarificatio o the distictio betwee the stadard deviatio ad the stadard deviatio of the mea.. The sample mea ad variace Cosider

More information

1 Review and Overview

1 Review and Overview DRAFT a fial versio will be posted shortly CS229T/STATS231: Statistical Learig Theory Lecturer: Tegyu Ma Lecture #3 Scribe: Migda Qiao October 1, 2013 1 Review ad Overview I the first half of this course,

More information

Introduction to Machine Learning DIS10

Introduction to Machine Learning DIS10 CS 189 Fall 017 Itroductio to Machie Learig DIS10 1 Fu with Lagrage Multipliers (a) Miimize the fuctio such that f (x,y) = x + y x + y = 3. Solutio: The Lagragia is: L(x,y,λ) = x + y + λ(x + y 3) Takig

More information

TURBULENT FUNCTIONS AND SOLVING THE NAVIER-STOKES EQUATION BY FOURIER SERIES

TURBULENT FUNCTIONS AND SOLVING THE NAVIER-STOKES EQUATION BY FOURIER SERIES TURBULENT FUNCTIONS AND SOLVING THE NAVIER-STOKES EQUATION BY FOURIER SERIES M Sghiar To cite this versio: M Sghiar. TURBULENT FUNCTIONS AND SOLVING THE NAVIER-STOKES EQUATION BY FOURIER SERIES. Iteratioal

More information

Dimension-free PAC-Bayesian bounds for the estimation of the mean of a random vector

Dimension-free PAC-Bayesian bounds for the estimation of the mean of a random vector Dimesio-free PAC-Bayesia bouds for the estimatio of the mea of a radom vector Olivier Catoi CREST CNRS UMR 9194 Uiversité Paris Saclay olivier.catoi@esae.fr Ilaria Giulii Laboratoire de Probabilités et

More information

10-701/ Machine Learning Mid-term Exam Solution

10-701/ Machine Learning Mid-term Exam Solution 0-70/5-78 Machie Learig Mid-term Exam Solutio Your Name: Your Adrew ID: True or False (Give oe setece explaatio) (20%). (F) For a cotiuous radom variable x ad its probability distributio fuctio p(x), it

More information

5.1 A mutual information bound based on metric entropy

5.1 A mutual information bound based on metric entropy Chapter 5 Global Fao Method I this chapter, we exted the techiques of Chapter 2.4 o Fao s method the local Fao method) to a more global costructio. I particular, we show that, rather tha costructig a local

More information

Lecture 2: Concentration Bounds

Lecture 2: Concentration Bounds CSE 52: Desig ad Aalysis of Algorithms I Sprig 206 Lecture 2: Cocetratio Bouds Lecturer: Shaya Oveis Ghara March 30th Scribe: Syuzaa Sargsya Disclaimer: These otes have ot bee subjected to the usual scrutiy

More information

Chapter 5. Inequalities. 5.1 The Markov and Chebyshev inequalities

Chapter 5. Inequalities. 5.1 The Markov and Chebyshev inequalities Chapter 5 Iequalities 5.1 The Markov ad Chebyshev iequalities As you have probably see o today s frot page: every perso i the upper teth percetile ears at least 1 times more tha the average salary. I other

More information

7.1 Convergence of sequences of random variables

7.1 Convergence of sequences of random variables Chapter 7 Limit Theorems Throughout this sectio we will assume a probability space (, F, P), i which is defied a ifiite sequece of radom variables (X ) ad a radom variable X. The fact that for every ifiite

More information

Gini Index and Polynomial Pen s Parade

Gini Index and Polynomial Pen s Parade Gii Idex ad Polyomial Pe s Parade Jules Sadefo Kamdem To cite this versio: Jules Sadefo Kamdem. Gii Idex ad Polyomial Pe s Parade. 2011. HAL Id: hal-00582625 https://hal.archives-ouvertes.fr/hal-00582625

More information

Invariant relations between binary Goldbach s decompositions numbers coded in a 4 letters language

Invariant relations between binary Goldbach s decompositions numbers coded in a 4 letters language Ivariat relatios betwee biary Goldbach s decompositios umbers coded i a letters laguage Deise Vella-Chemla To cite this versio: Deise Vella-Chemla. Ivariat relatios betwee biary Goldbach s decompositios

More information

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS MASSACHUSTTS INSTITUT OF TCHNOLOGY 6.436J/5.085J Fall 2008 Lecture 9 /7/2008 LAWS OF LARG NUMBRS II Cotets. The strog law of large umbers 2. The Cheroff boud TH STRONG LAW OF LARG NUMBRS While the weak

More information

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12 Machie Learig Theory Tübige Uiversity, WS 06/07 Lecture Tolstikhi Ilya Abstract I this lecture we derive risk bouds for kerel methods. We will start by showig that Soft Margi kerel SVM correspods to miimizig

More information

On Equivalence of Martingale Tail Bounds and Deterministic Regret Inequalities

On Equivalence of Martingale Tail Bounds and Deterministic Regret Inequalities O Equivalece of Martigale Tail Bouds ad Determiistic Regret Iequalities Sasha Rakhli Departmet of Statistics, The Wharto School Uiversity of Pesylvaia Dec 16, 2015 Joit work with K. Sridhara arxiv:1510.03925

More information

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 21 11/27/2013

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 21 11/27/2013 MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 21 11/27/2013 Fuctioal Law of Large Numbers. Costructio of the Wieer Measure Cotet. 1. Additioal techical results o weak covergece

More information

Distribution of Random Samples & Limit theorems

Distribution of Random Samples & Limit theorems STAT/MATH 395 A - PROBABILITY II UW Witer Quarter 2017 Néhémy Lim Distributio of Radom Samples & Limit theorems 1 Distributio of i.i.d. Samples Motivatig example. Assume that the goal of a study is to

More information

Frequentist Inference

Frequentist Inference Frequetist Iferece The topics of the ext three sectios are useful applicatios of the Cetral Limit Theorem. Without kowig aythig about the uderlyig distributio of a sequece of radom variables {X i }, for

More information

Glivenko-Cantelli Classes

Glivenko-Cantelli Classes CS28B/Stat24B (Sprig 2008 Statistical Learig Theory Lecture: 4 Gliveko-Catelli Classes Lecturer: Peter Bartlett Scribe: Michelle Besi Itroductio This lecture will cover Gliveko-Catelli (GC classes ad itroduce

More information

Lecture 2: April 3, 2013

Lecture 2: April 3, 2013 TTIC/CMSC 350 Mathematical Toolkit Sprig 203 Madhur Tulsiai Lecture 2: April 3, 203 Scribe: Shubhedu Trivedi Coi tosses cotiued We retur to the coi tossig example from the last lecture agai: Example. Give,

More information

Ada Boost, Risk Bounds, Concentration Inequalities. 1 AdaBoost and Estimates of Conditional Probabilities

Ada Boost, Risk Bounds, Concentration Inequalities. 1 AdaBoost and Estimates of Conditional Probabilities CS8B/Stat4B Sprig 008) Statistical Learig Theory Lecture: Ada Boost, Risk Bouds, Cocetratio Iequalities Lecturer: Peter Bartlett Scribe: Subhrasu Maji AdaBoost ad Estimates of Coditioal Probabilities We

More information

Lecture 15: Learning Theory: Concentration Inequalities

Lecture 15: Learning Theory: Concentration Inequalities STAT 425: Itroductio to Noparametric Statistics Witer 208 Lecture 5: Learig Theory: Cocetratio Iequalities Istructor: Ye-Chi Che 5. Itroductio Recall that i the lecture o classificatio, we have see that

More information

Chapter 3. Strong convergence. 3.1 Definition of almost sure convergence

Chapter 3. Strong convergence. 3.1 Definition of almost sure convergence Chapter 3 Strog covergece As poited out i the Chapter 2, there are multiple ways to defie the otio of covergece of a sequece of radom variables. That chapter defied covergece i probability, covergece i

More information

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d Liear regressio Daiel Hsu (COMS 477) Maximum likelihood estimatio Oe of the simplest liear regressio models is the followig: (X, Y ),..., (X, Y ), (X, Y ) are iid radom pairs takig values i R d R, ad Y

More information

A survey on penalized empirical risk minimization Sara A. van de Geer

A survey on penalized empirical risk minimization Sara A. van de Geer A survey o pealized empirical risk miimizatio Sara A. va de Geer We address the questio how to choose the pealty i empirical risk miimizatio. Roughly speakig, this pealty should be a good boud for the

More information

1 Introduction to reducing variance in Monte Carlo simulations

1 Introduction to reducing variance in Monte Carlo simulations Copyright c 010 by Karl Sigma 1 Itroductio to reducig variace i Mote Carlo simulatios 11 Review of cofidece itervals for estimatig a mea I statistics, we estimate a ukow mea µ = E(X) of a distributio by

More information

Output Analysis (2, Chapters 10 &11 Law)

Output Analysis (2, Chapters 10 &11 Law) B. Maddah ENMG 6 Simulatio Output Aalysis (, Chapters 10 &11 Law) Comparig alterative system cofiguratio Sice the output of a simulatio is radom, the comparig differet systems via simulatio should be doe

More information

Self-normalized deviation inequalities with application to t-statistic

Self-normalized deviation inequalities with application to t-statistic Self-ormalized deviatio iequalities with applicatio to t-statistic Xiequa Fa Ceter for Applied Mathematics, Tiaji Uiversity, 30007 Tiaji, Chia Abstract Let ξ i i 1 be a sequece of idepedet ad symmetric

More information

Resampling Methods. X (1/2), i.e., Pr (X i m) = 1/2. We order the data: X (1) X (2) X (n). Define the sample median: ( n.

Resampling Methods. X (1/2), i.e., Pr (X i m) = 1/2. We order the data: X (1) X (2) X (n). Define the sample median: ( n. Jauary 1, 2019 Resamplig Methods Motivatio We have so may estimators with the property θ θ d N 0, σ 2 We ca also write θ a N θ, σ 2 /, where a meas approximately distributed as Oce we have a cosistet estimator

More information

Advanced Stochastic Processes.

Advanced Stochastic Processes. Advaced Stochastic Processes. David Gamarik LECTURE 2 Radom variables ad measurable fuctios. Strog Law of Large Numbers (SLLN). Scary stuff cotiued... Outlie of Lecture Radom variables ad measurable fuctios.

More information

Supplementary Material for Fast Stochastic AUC Maximization with O(1/n)-Convergence Rate

Supplementary Material for Fast Stochastic AUC Maximization with O(1/n)-Convergence Rate Supplemetary Material for Fast Stochastic AUC Maximizatio with O/-Covergece Rate Migrui Liu Xiaoxua Zhag Zaiyi Che Xiaoyu Wag 3 iabao Yag echical Lemmas ized versio of Hoeffdig s iequality, ote that We

More information

This is an introductory course in Analysis of Variance and Design of Experiments.

This is an introductory course in Analysis of Variance and Design of Experiments. 1 Notes for M 384E, Wedesday, Jauary 21, 2009 (Please ote: I will ot pass out hard-copy class otes i future classes. If there are writte class otes, they will be posted o the web by the ight before class

More information

Lecture 3 The Lebesgue Integral

Lecture 3 The Lebesgue Integral Lecture 3: The Lebesgue Itegral 1 of 14 Course: Theory of Probability I Term: Fall 2013 Istructor: Gorda Zitkovic Lecture 3 The Lebesgue Itegral The costructio of the itegral Uless expressly specified

More information

Lecture 3: August 31

Lecture 3: August 31 36-705: Itermediate Statistics Fall 018 Lecturer: Siva Balakrisha Lecture 3: August 31 This lecture will be mostly a summary of other useful expoetial tail bouds We will ot prove ay of these i lecture,

More information

Sequences and Series of Functions

Sequences and Series of Functions Chapter 6 Sequeces ad Series of Fuctios 6.1. Covergece of a Sequece of Fuctios Poitwise Covergece. Defiitio 6.1. Let, for each N, fuctio f : A R be defied. If, for each x A, the sequece (f (x)) coverges

More information

4. Partial Sums and the Central Limit Theorem

4. Partial Sums and the Central Limit Theorem 1 of 10 7/16/2009 6:05 AM Virtual Laboratories > 6. Radom Samples > 1 2 3 4 5 6 7 4. Partial Sums ad the Cetral Limit Theorem The cetral limit theorem ad the law of large umbers are the two fudametal theorems

More information

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 3

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 3 Machie Learig Theory Tübige Uiversity, WS 06/07 Lecture 3 Tolstikhi Ilya Abstract I this lecture we will prove the VC-boud, which provides a high-probability excess risk boud for the ERM algorithm whe

More information

Element sampling: Part 2

Element sampling: Part 2 Chapter 4 Elemet samplig: Part 2 4.1 Itroductio We ow cosider uequal probability samplig desigs which is very popular i practice. I the uequal probability samplig, we ca improve the efficiecy of the resultig

More information

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 3 9/11/2013. Large deviations Theory. Cramér s Theorem

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 3 9/11/2013. Large deviations Theory. Cramér s Theorem MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/5.070J Fall 203 Lecture 3 9//203 Large deviatios Theory. Cramér s Theorem Cotet.. Cramér s Theorem. 2. Rate fuctio ad properties. 3. Chage of measure techique.

More information

Lecture 01: the Central Limit Theorem. 1 Central Limit Theorem for i.i.d. random variables

Lecture 01: the Central Limit Theorem. 1 Central Limit Theorem for i.i.d. random variables CSCI-B609: A Theorist s Toolkit, Fall 06 Aug 3 Lecture 0: the Cetral Limit Theorem Lecturer: Yua Zhou Scribe: Yua Xie & Yua Zhou Cetral Limit Theorem for iid radom variables Let us say that we wat to aalyze

More information

Let us give one more example of MLE. Example 3. The uniform distribution U[0, θ] on the interval [0, θ] has p.d.f.

Let us give one more example of MLE. Example 3. The uniform distribution U[0, θ] on the interval [0, θ] has p.d.f. Lecture 5 Let us give oe more example of MLE. Example 3. The uiform distributio U[0, ] o the iterval [0, ] has p.d.f. { 1 f(x =, 0 x, 0, otherwise The likelihood fuctio ϕ( = f(x i = 1 I(X 1,..., X [0,

More information

Agnostic Learning and Concentration Inequalities

Agnostic Learning and Concentration Inequalities ECE901 Sprig 2004 Statistical Regularizatio ad Learig Theory Lecture: 7 Agostic Learig ad Cocetratio Iequalities Lecturer: Rob Nowak Scribe: Aravid Kailas 1 Itroductio 1.1 Motivatio I the last lecture

More information

Entropy and Ergodic Theory Lecture 5: Joint typicality and conditional AEP

Entropy and Ergodic Theory Lecture 5: Joint typicality and conditional AEP Etropy ad Ergodic Theory Lecture 5: Joit typicality ad coditioal AEP 1 Notatio: from RVs back to distributios Let (Ω, F, P) be a probability space, ad let X ad Y be A- ad B-valued discrete RVs, respectively.

More information

Product measures, Tonelli s and Fubini s theorems For use in MAT3400/4400, autumn 2014 Nadia S. Larsen. Version of 13 October 2014.

Product measures, Tonelli s and Fubini s theorems For use in MAT3400/4400, autumn 2014 Nadia S. Larsen. Version of 13 October 2014. Product measures, Toelli s ad Fubii s theorems For use i MAT3400/4400, autum 2014 Nadia S. Larse Versio of 13 October 2014. 1. Costructio of the product measure The purpose of these otes is to preset the

More information

Chapter 6 Sampling Distributions

Chapter 6 Sampling Distributions Chapter 6 Samplig Distributios 1 I most experimets, we have more tha oe measuremet for ay give variable, each measuremet beig associated with oe radomly selected a member of a populatio. Hece we eed to

More information

18.657: Mathematics of Machine Learning

18.657: Mathematics of Machine Learning 8.657: Mathematics of Machie Learig Lecturer: Philippe Rigollet Lecture 4 Scribe: Cheg Mao Sep., 05 I this lecture, we cotiue to discuss the effect of oise o the rate of the excess risk E(h) = R(h) R(h

More information

Lecture 7: October 18, 2017

Lecture 7: October 18, 2017 Iformatio ad Codig Theory Autum 207 Lecturer: Madhur Tulsiai Lecture 7: October 8, 207 Biary hypothesis testig I this lecture, we apply the tools developed i the past few lectures to uderstad the problem

More information

Machine Learning Brett Bernstein

Machine Learning Brett Bernstein Machie Learig Brett Berstei Week Lecture: Cocept Check Exercises Starred problems are optioal. Statistical Learig Theory. Suppose A = Y = R ad X is some other set. Furthermore, assume P X Y is a discrete

More information

Efficient GMM LECTURE 12 GMM II

Efficient GMM LECTURE 12 GMM II DECEMBER 1 010 LECTURE 1 II Efficiet The estimator depeds o the choice of the weight matrix A. The efficiet estimator is the oe that has the smallest asymptotic variace amog all estimators defied by differet

More information

5.1 Review of Singular Value Decomposition (SVD)

5.1 Review of Singular Value Decomposition (SVD) MGMT 69000: Topics i High-dimesioal Data Aalysis Falll 06 Lecture 5: Spectral Clusterig: Overview (cotd) ad Aalysis Lecturer: Jiamig Xu Scribe: Adarsh Barik, Taotao He, September 3, 06 Outlie Review of

More information

1 Duality revisited. AM 221: Advanced Optimization Spring 2016

1 Duality revisited. AM 221: Advanced Optimization Spring 2016 AM 22: Advaced Optimizatio Sprig 206 Prof. Yaro Siger Sectio 7 Wedesday, Mar. 9th Duality revisited I this sectio, we will give a slightly differet perspective o duality. optimizatio program: f(x) x R

More information

1 Convergence in Probability and the Weak Law of Large Numbers

1 Convergence in Probability and the Weak Law of Large Numbers 36-752 Advaced Probability Overview Sprig 2018 8. Covergece Cocepts: i Probability, i L p ad Almost Surely Istructor: Alessadro Rialdo Associated readig: Sec 2.4, 2.5, ad 4.11 of Ash ad Doléas-Dade; Sec

More information

ECE 901 Lecture 14: Maximum Likelihood Estimation and Complexity Regularization

ECE 901 Lecture 14: Maximum Likelihood Estimation and Complexity Regularization ECE 90 Lecture 4: Maximum Likelihood Estimatio ad Complexity Regularizatio R Nowak 5/7/009 Review : Maximum Likelihood Estimatio We have iid observatios draw from a ukow distributio Y i iid p θ, i,, where

More information

Topic 9: Sampling Distributions of Estimators

Topic 9: Sampling Distributions of Estimators Topic 9: Samplig Distributios of Estimators Course 003, 2016 Page 0 Samplig distributios of estimators Sice our estimators are statistics (particular fuctios of radom variables), their distributio ca be

More information

Output Analysis and Run-Length Control

Output Analysis and Run-Length Control IEOR E4703: Mote Carlo Simulatio Columbia Uiversity c 2017 by Marti Haugh Output Aalysis ad Ru-Legth Cotrol I these otes we describe how the Cetral Limit Theorem ca be used to costruct approximate (1 α%

More information

2.2. Central limit theorem.

2.2. Central limit theorem. 36.. Cetral limit theorem. The most ideal case of the CLT is that the radom variables are iid with fiite variace. Although it is a special case of the more geeral Lideberg-Feller CLT, it is most stadard

More information

32 estimating the cumulative distribution function

32 estimating the cumulative distribution function 32 estimatig the cumulative distributio fuctio 4.6 types of cofidece itervals/bads Let F be a class of distributio fuctios F ad let θ be some quatity of iterest, such as the mea of F or the whole fuctio

More information

Because it tests for differences between multiple pairs of means in one test, it is called an omnibus test.

Because it tests for differences between multiple pairs of means in one test, it is called an omnibus test. Math 308 Sprig 018 Classes 19 ad 0: Aalysis of Variace (ANOVA) Page 1 of 6 Itroductio ANOVA is a statistical procedure for determiig whether three or more sample meas were draw from populatios with equal

More information

Statistical Inference (Chapter 10) Statistical inference = learn about a population based on the information provided by a sample.

Statistical Inference (Chapter 10) Statistical inference = learn about a population based on the information provided by a sample. Statistical Iferece (Chapter 10) Statistical iferece = lear about a populatio based o the iformatio provided by a sample. Populatio: The set of all values of a radom variable X of iterest. Characterized

More information

CHAPTER 10 INFINITE SEQUENCES AND SERIES

CHAPTER 10 INFINITE SEQUENCES AND SERIES CHAPTER 10 INFINITE SEQUENCES AND SERIES 10.1 Sequeces 10.2 Ifiite Series 10.3 The Itegral Tests 10.4 Compariso Tests 10.5 The Ratio ad Root Tests 10.6 Alteratig Series: Absolute ad Coditioal Covergece

More information

Testing the number of parameters with multidimensional MLP

Testing the number of parameters with multidimensional MLP Testig the umber of parameters with multidimesioal MLP Joseph Rykiewicz To cite this versio: Joseph Rykiewicz. Testig the umber of parameters with multidimesioal MLP. ASMDA 2005, 2005, Brest, Frace. pp.561-568,

More information

Topic 9: Sampling Distributions of Estimators

Topic 9: Sampling Distributions of Estimators Topic 9: Samplig Distributios of Estimators Course 003, 2018 Page 0 Samplig distributios of estimators Sice our estimators are statistics (particular fuctios of radom variables), their distributio ca be

More information

w (1) ˆx w (1) x (1) /ρ and w (2) ˆx w (2) x (2) /ρ.

w (1) ˆx w (1) x (1) /ρ and w (2) ˆx w (2) x (2) /ρ. 2 5. Weighted umber of late jobs 5.1. Release dates ad due dates: maximimizig the weight of o-time jobs Oce we add release dates, miimizig the umber of late jobs becomes a sigificatly harder problem. For

More information

Lecture 19: Convergence

Lecture 19: Convergence Lecture 19: Covergece Asymptotic approach I statistical aalysis or iferece, a key to the success of fidig a good procedure is beig able to fid some momets ad/or distributios of various statistics. I may

More information

1 Inferential Methods for Correlation and Regression Analysis

1 Inferential Methods for Correlation and Regression Analysis 1 Iferetial Methods for Correlatio ad Regressio Aalysis I the chapter o Correlatio ad Regressio Aalysis tools for describig bivariate cotiuous data were itroduced. The sample Pearso Correlatio Coefficiet

More information

Statistics 511 Additional Materials

Statistics 511 Additional Materials Cofidece Itervals o mu Statistics 511 Additioal Materials This topic officially moves us from probability to statistics. We begi to discuss makig ifereces about the populatio. Oe way to differetiate probability

More information

7.1 Convergence of sequences of random variables

7.1 Convergence of sequences of random variables Chapter 7 Limit theorems Throughout this sectio we will assume a probability space (Ω, F, P), i which is defied a ifiite sequece of radom variables (X ) ad a radom variable X. The fact that for every ifiite

More information

Infinite Sequences and Series

Infinite Sequences and Series Chapter 6 Ifiite Sequeces ad Series 6.1 Ifiite Sequeces 6.1.1 Elemetary Cocepts Simply speakig, a sequece is a ordered list of umbers writte: {a 1, a 2, a 3,...a, a +1,...} where the elemets a i represet

More information

6.883: Online Methods in Machine Learning Alexander Rakhlin

6.883: Online Methods in Machine Learning Alexander Rakhlin 6.883: Olie Methods i Machie Learig Alexader Rakhli LECTURE 23. SOME CONSEQUENCES OF ONLINE NO-REGRET METHODS I this lecture, we explore some cosequeces of the developed techiques.. Covex optimizatio Wheever

More information

6.883: Online Methods in Machine Learning Alexander Rakhlin

6.883: Online Methods in Machine Learning Alexander Rakhlin 6.883: Olie Methods i Machie Learig Alexader Rakhli LECURE 4 his lecture is partly based o chapters 4-5 i [SSBD4]. Let us o give a variat of SGD for strogly covex fuctios. Algorithm SGD for strogly covex

More information

Chapter 2 The Monte Carlo Method

Chapter 2 The Monte Carlo Method Chapter 2 The Mote Carlo Method The Mote Carlo Method stads for a broad class of computatioal algorithms that rely o radom sampligs. It is ofte used i physical ad mathematical problems ad is most useful

More information

MAT1026 Calculus II Basic Convergence Tests for Series

MAT1026 Calculus II Basic Convergence Tests for Series MAT026 Calculus II Basic Covergece Tests for Series Egi MERMUT 202.03.08 Dokuz Eylül Uiversity Faculty of Sciece Departmet of Mathematics İzmir/TURKEY Cotets Mootoe Covergece Theorem 2 2 Series of Real

More information

Topic 9: Sampling Distributions of Estimators

Topic 9: Sampling Distributions of Estimators Topic 9: Samplig Distributios of Estimators Course 003, 2018 Page 0 Samplig distributios of estimators Sice our estimators are statistics (particular fuctios of radom variables), their distributio ca be

More information

NYU Center for Data Science: DS-GA 1003 Machine Learning and Computational Statistics (Spring 2018)

NYU Center for Data Science: DS-GA 1003 Machine Learning and Computational Statistics (Spring 2018) NYU Ceter for Data Sciece: DS-GA 003 Machie Learig ad Computatioal Statistics (Sprig 208) Brett Berstei, David Roseberg, Be Jakubowski Jauary 20, 208 Istructios: Followig most lab ad lecture sectios, we

More information

Coefficient of variation and Power Pen s parade computation

Coefficient of variation and Power Pen s parade computation Coefficiet of variatio ad Power Pe s parade computatio Jules Sadefo Kamdem To cite this versio: Jules Sadefo Kamdem. Coefficiet of variatio ad Power Pe s parade computatio. 20. HAL Id: hal-0058658

More information

Comparing Two Populations. Topic 15 - Two Sample Inference I. Comparing Two Means. Comparing Two Pop Means. Background Reading

Comparing Two Populations. Topic 15 - Two Sample Inference I. Comparing Two Means. Comparing Two Pop Means. Background Reading Topic 15 - Two Sample Iferece I STAT 511 Professor Bruce Craig Comparig Two Populatios Research ofte ivolves the compariso of two or more samples from differet populatios Graphical summaries provide visual

More information

ENGI 4421 Confidence Intervals (Two Samples) Page 12-01

ENGI 4421 Confidence Intervals (Two Samples) Page 12-01 ENGI 44 Cofidece Itervals (Two Samples) Page -0 Two Sample Cofidece Iterval for a Differece i Populatio Meas [Navidi sectios 5.4-5.7; Devore chapter 9] From the cetral limit theorem, we kow that, for sufficietly

More information

Basics of Probability Theory (for Theory of Computation courses)

Basics of Probability Theory (for Theory of Computation courses) Basics of Probability Theory (for Theory of Computatio courses) Oded Goldreich Departmet of Computer Sciece Weizma Istitute of Sciece Rehovot, Israel. oded.goldreich@weizma.ac.il November 24, 2008 Preface.

More information

Binary classification, Part 1

Binary classification, Part 1 Biary classificatio, Part 1 Maxim Ragisky September 25, 2014 The problem of biary classificatio ca be stated as follows. We have a radom couple Z = (X,Y ), where X R d is called the feature vector ad Y

More information

Random Variables, Sampling and Estimation

Random Variables, Sampling and Estimation Chapter 1 Radom Variables, Samplig ad Estimatio 1.1 Itroductio This chapter will cover the most importat basic statistical theory you eed i order to uderstad the ecoometric material that will be comig

More information

DISCRETE PREDICTION PROBLEMS: RANDOMIZED PREDICTION

DISCRETE PREDICTION PROBLEMS: RANDOMIZED PREDICTION DISCRETE PREDICTION PROBLEMS: RANDOMIZED PREDICTION Csaba Szepesvári Uiversity of Alberta CMPUT 654 E-mail: szepesva@ualberta.ca UofA, October 10-12-14, 2006 OUTLINE 1 DISCRETE PREDICTION PROBLEMS 2 RANDOMIZED

More information