Empirical Bernstein Inequality for Martingales : Application to Online Learning

Size: px

Start display at page:

Download "Empirical Bernstein Inequality for Martingales : Application to Online Learning"

Russell Greer
6 years ago
Views:

Empirical Berstei Iequality for Martigales : Applicatio to Olie Learig Thomas Peel, Sadrie Athoie, Liva Ralaivola To cite this versio: Thomas

<hal-00879909> HAL Id: hal-00879909 https://hal.archives-ouvertes.

documets, whether they are published or ot.

1 Empirical Berstei Iequality for Martigales : Applicatio to Olie Learig Thomas Peel, Sadrie Athoie, Liva Ralaivola To cite this versio: Thomas Peel, Sadrie Athoie, Liva Ralaivola. Empirical Berstei Iequality for Martigales : Applicatio to Olie Learig <hal > HAL Id: hal Submitted o 5 Nov 203 HAL is a multi-discipliary ope access archive for the deposit ad dissemiatio of scietific research documets, whether they are published or ot. The documets may come from teachig ad research istitutios i Frace or abroad, or from public or private research ceters. L archive ouverte pluridiscipliaire HAL, est destiée au dépôt et à la diffusio de documets scietifiques de iveau recherche, publiés ou o, émaat des établissemets d eseigemet et de recherche fraçais ou étragers, des laboratoires publics ou privés.

2 Empirical Berstei Iequality for Martigales : Applicatio to Olie Learig Thomas Peel,2 Sadrie Athoie Liva Ralaivola 2 Aix-Marseille Uiversité - CNRS LATP, UMR 7353 Marseille, Frace 2 Aix-Marseille Uiversité - CNRS LIF, UMR 7279 Marseille, Frace Abstract I this article we preset a ew empirical Berstei iequality for bouded martigale differece sequeces. This iequality refies the oe by Freedma [975 ad is the used i order to boud the average risk of the hypotheses durig a olie learig process. We show theoretical ad empirical evideces of the tightess of our result compared with the state of the art boud provided by Cesa- Biachi ad Getile [2008. INTRODUCTION The motivatio behid this work comes from the wish to aalyze the risk of the models (or hypotheses) produced by a olie learig algorithm. Such a algorithm works icremetally o a sequece of idepedet ad idetically distributed (i.i.d.) radom variables. At each step, it receives a example that is used i order to update the curret model parameters. Oce this update is doe, the performace of the ew hypothesis is measured by evaluatig its loss o the ext example of the sequece ad so o. By averagig these losses, oe ca defie a statistic ˆR called empirical istataeous risk. The risk of a model is simply the expectatio of its loss o a ew usee example give the sequece of data used i its costructio. I their recet works, Cesa-Biachi et al. [2004 ad Cesa-Biachi ad Getile [2008 show how the statistic ˆR ca be used for selectig a hypothesis with a low risk. The key tool i their aalyses is the use of cocetratio iequalities for martigales (Azuma-Hoeffdig, Berstei). Ideed, the depedecies existig betwee the hypotheses that are iheret to olie learig processes prevet the use of stadard cocetratio iequalities that require idepedece. Berstei (secod-order) iequalities are kow to be tighter tha their first-order couterparts. However, the variace is i geeral ukow ad eed to be upper bouded. Recet works i the batch settig have proposed a empirical (data-depedet) versio of the Berstei iequality [Maurer ad Potil, 2009, Peel et al., 200 where a estimator of the variace is used as upper boud. However, these iequalities are ot applicable to the olie learig settig. I this paper, we propose a ew Berstei iequality for bouded martigale differece sequeces (Theorem 2) that takes advatage of the statistic ˆV, a istataeous estimator of the variace. This iequality is the used i order to refie the tail boud by Cesa-Biachi ad Getile [2008. Briefly, we show that uder the same assumptios they make, the average risk of the hypotheses produced by a olie learig algorithm is bouded with high probability by ˆR + β l l, where β is a fuctio of ˆV we will detail later. This boud ca be applied to ay olie algorithm ad as a example we show how to use it to characterize the average risk of the hypotheses produced by Pegasos [Shalev-Shwartz et al., 20, a stochastic method for solvig the SVM optimizatio problem. We wat to emphasize that the scope of our ew empirical Berstei iequality for martigales goes far beyod ay applicatio to olie learig processes. The paper is orgaized as follows. I Sectio 2 we recall a few fudametal otios about martigales ad the classical cocetratio iequalities associated with this kid of radom processes. Sectio 3 presets the mai result of this paper, a cocetratio iequality that takes advatage of a secod order empirical iformatio i the martigale settig. This oe is the applied i Sectio 4 to get a boud o the mea geeralizatio error made by the hypotheses leared durig a olie learig process. This boud substatially improves the results metioed above. We ed this

3 Empirical Berstei Iequality for Martigales : Applicatio to Olie Learig paper with Sectio 5, a direct cosequece of the previous iequalities that let us boud the mea risk of the weight vectors geerated durig a ru of the Pegasos algorithm. 2 PRELIMINARIES This sectio briefly remids basic otios about the martigale theory ad the classical cocetratio iequalities associated with this kid of stochastic processes. 2. Martigale ad Martigale Differece Sequece Defiitio (Martigale). A sequece {M : 0 < } of radom variables is said to be a martigale with respect to the sequece of radom variables {X : < } if the sequece {M 0,...,M } has two basic properties. The first oe is that for each there is a fuctio f : R R such that M = f (X,X 2,...,X ). The secod property is that the sequece {M } satisfies for all : E[ M < () E[M X,...,X = M. (2) Give this defiitio of a martigale, we ow defie a martigale differece sequece. Defiitio 2 (Martigale differece sequece). We say that a sequece of radom variables {Y : 0 < } is a martigale differece sequece (mds) if the sequece {Y } satisfies the followig properties for all : E[ Y < (3) E[Y Y,...,Y = 0. (4) By costructio, this implies that if the sequece {M } is a martigale the the sequece {Y = M M } is a martigale differece sequece. We ow itroduce two well-kow cocetratio iequalities about the sum of the icremets of a mds that we will use i the ext sectios. 2.2 Azuma-Hoeffdig Iequality The Azuma-Hoeffdig iequality [Hoeffdig, 963, Azuma, 967 gives a result about the cocetratio of the values of a martigale with bouded icremets aroud his iitial value M 0. Theorem (Azuma-Hoeffdig iequality). Let {M } be a martigale ad defie {Y = M M } the associated martigale differece sequece such that Y i c i for all i. The, for all ǫ > 0 [ P Y i = M M 0 ǫ i= ( exp ǫ 2 ) 2 i= c. i 2 (5) This result makes it possible to exted the Hoeffdig iequality [Hoeffdig, 963 to the case where the radom variables of iterest are ot ecessarily idepedet. Corollary. Let X,...,X be a sequece of radom variables such that for all i we have E[X i X,...,X i X i c i. Set S = i= X i, the for all ǫ > 0 [ P E[X i X,...,X i S ǫ i= ( exp ǫ 2 ) 2 i= c. (6) i 2 Proof. A direct applicatio of Theorem to the martigale differece sequece {Y } such that Y i = E[X i X,...,X i X i gives the result. 2.3 Berstei Iequality for Martigales The iequality we recall i the followig lemma is a cosequece of the Berstei iequality for martigales give i Freedma [975. This lemma exteds the classical Berstei iequality [Beett, 962 which requires idepedece betwee the radom variables X i uder cosideratio. This limitatio is overcome by lookig at the martigale differece sequece {Y = E[X X,...,X X }. Lemma (Berstei iequality for martigales). Suppose X,...,X is a sequece of radom variables such that 0 X i. Defie the martigale differece sequece {Y = E[X X,...,X X } ad ote K the sum of the coditioal variaces K = V[X X,...,X. (7) Let S = i= X i, the for all ǫ,v 0, [ P E[X X,...,X S ǫ,k k i= ( exp ǫ 2 2k +2ǫ/3 ). (8) As we shall see, this lemma is cetral i our aalysis as it was i the work by Cesa-Biachi ad Getile [2008.

4 Thomas Peel,2, Sadrie Athoie, Liva Ralaivola 2 3 EMPIRICAL BERNSTEIN INEQUALITY FOR MARTINGALES Secod order Berstei iequalities are kow to be tighter tha their first order couterparts thaks to the variace term. However, i practice, this term ofte ca ot be evaluated ad it is commo to upper boud it by the expectatio (makig the assumptio that the radom variables uder iterest are bouded by ) i order to compute the whole iequality. We propose aother approach based o the use of a istataeous estimator of the variace istead of the usual approach. By doig so, we hope to get a tighter iequality without ay a priori assumptio o the uderlyig distributio of the radom variables. This sectio presets the mai result of the paper, a refied versio of Berstei iequality for martigales recalled above where the sum of coditioal variaces is upper bouded usig a istataeous estimator. We first itroduce the iequality reversal lemma, which allows us to trasform tail iequalities ito upper bouds (or cofidece itervals). This lemma has bee used by Peel et al. [200 to prove their empirical Berstei iequality for U-Statistics. Lemma 2 (Iequality reversal lemma). Let X be a radom variable ad a,b>0,c,d 0 such that } ε > 0, P X [ X ε aexp { bε2, (9) c+dε the, with probability at least c X b l a + d b l a. (0) Proof. Solvig forεsuch that the right had side of (9) is equal to gives: ε = (dl a 2b + d 2 l 2 a +4bcl a ). Usig a+b a + b gives a upper boud o ε ad provides the result. We use the otatio f {Zt} i order to idicate a fuctio determied by the sequece of radom variables {Z t } = {Z,...,Z t } i.e. the expressio of f {Zt} is fixed by the sequece {Z t }. The ext theorem is the mai result of this paper. Theorem 2 (Empirical Berstei iequality for martigales). Let Z,...,Z be a sequece of radom variables followig the same probability distributio D such that Z t+,z t+2 are coditioally idepedet give {Z t }, for all t. Suppose {f {Zt}} is a family of fuctios which take their values i [0,, the for all 0 < we have with probability at least E [ f {Zt}(Z t+ ) Z,...,Z t where ad f {Zt}(Z t+ )+ β l i= ˆV = 2 β = ˆV l, () ( ) 2 2 l, (2) ( f{zt}(z t+ ) f {Zt}(Z t+2 ) ) 2. (3) I a utshell, the message carried by this theorem is that it is possible to use a istataeous variace estimator to quatify the deviatio of the sum i= f {Z t}(z t+ ) from its expected value. I order to prove the previous cocetratio iequality, we eed a itermediate result about the coditioal variace estimator itroduced i Equatio (3). I essece, the followig lemma allows us to quatify the deviatio of this estimator from the sum V of the coditioal variaces: V = V [ f {Zt}(Z) Z,...,Z t. (4) Lemma 3. Let Z,...,Z be a sequece of radom variables followig the same probability distributio D such that Z t+,z t+2 are coditioally idepedet give {Z t }, for all t. Suppose {f {Zt}} is a family of fuctios which take their values i [0,, the for all 0 <, P [ V ˆV + 2 l ( ). (5) Proof. We begi this proof by defiig the sequece of radom variables {M } such that for all t, M t = 2( f{zt}(z t+ ) f {Zt}(Z t+2 ) ) 2, ad the associated martigale differece sequece {A = E[M Z,...,Z M }. Usig the fact that the Z t follow the same distributio ad that Z t+, Z t+2 are coditioally idepedet we get that E[M t Z,...,Z t = V [ f {Zt}(Z) Z,...,Z t.

5 Empirical Berstei Iequality for Martigales : Applicatio to Olie Learig It follows that A t = V [ f {Zt}(Z) Z,...,Z t ˆV = V ˆV. Notig that M t [0, 2 because f takes its values i [0, etails E[M t Z,...,Z t [0, 2 ad furthermore each term of the sequece {A } is bouded: 2 A t 2. Cosequetly {A } is a bouded martigale differece sequece o which we ca apply the Azuma-Hoeffdig iequality (Theorem ) to obtai P [ V ˆV ǫ exp ) ( 2ǫ2. We coclude the proof by usig Lemma 2. Thaks to this first result, we ca ow prove Theorem 2. Proof. (Theorem 2) Defie the sequece of radom variables {M } such that M i = f {Zi}(Z), ad the associated martigale differece sequece {A = E[M Z,...,Z M }. Remark that for β as i Equatio (2) ad s fixed [ [ P A t s =P A t s,v β [ +P A t s,v < β. We eed to upper boud the two parts of the right had side of the previous equatio i order to get the desired boud o the left had side. Remark that P[ A t s,v β P[V β. We use Lemma 3 to boud P[V β ad obtai [ P A t s,v β 2. (6) The, by usig the Berstei iequality for martigales (Lemma ) o the martigale differece sequece {A } we have [ P A t s,v < b ( exp s 2 2b+2s/3 ), (7) which we ca write alteratively [ P A t bl + 2 ( ) 2 3 l,v < b 2, (8) thaks to Lemma 2. We coclude the proof by settig b = β i (8) ad s = β l l, i Equatio (6). I the upcomig sectio, we use Theorem 2 i a olie learig settig. More precisely, we employ our result with the itetio of characterizig the mea of the risks R(h t ) associated with the hypotheses leared durig such a process. 4 APPLICATION TO ONLINE LEARNING Before statig the mai theorem of this sectio, we recall the olie learig settig ad defie a ew istataeous estimator of the coditioal variace well suited for a olie learig procedure. 4. Olie Learig ad Istataeous Coditioal Variace Estimator There is o formal defiitio of a olie learig process, eve i referece works as Littlestoe et al. [995 or Shalev-Shwartz [2007. Oe geerally defies it as follows. Cosider a dataset Z = {z i } i= = {(x i,y i )} i= of idepedet ad idetically distributed radom variables with respect to a ukow probability distributio D o the product space X Y. A olie learig algorithm workig with the set Z produces a set {h 0,...,h } of hypotheses where each h t : X Ỹ aims at predictig the class of a ew example x draw from D. From a iitial hypothesis h 0 ad the first datum (x,y ) the algorithm produces a ew hypothesis h. This ew hypothesis is a fuctio of the radom variable z = (x,y ) (ad the hypothesis h 0 ). It the uses the ext example (x 2,y 2 ) ad the hypothesis h to geerate a secod hypothesis h 2 ad so o. At the ed of the learig process, the algorithm outputs the set {h 0,...,h } where each hypothesis h t is costructed usig the previous hypothesis h t ad the example (x t,y t ). Thus each hypothesis h t depeds o the sequece of radom

6 Thomas Peel,2, Sadrie Athoie, Liva Ralaivola 2 variables {z,...,z t }. We use a bouded loss fuctio l : Ỹ Y R+ i order to evaluate the performace of a hypothesis. The risk of the hypothesis h t, deoted by R(h t ) = E[l(h t (X),Y) z,...,z t, is simply the expectatio of the loss fuctio l coditioally to the radom variables {z,...,z t }. Obviously, this quatity is ukow sice D is ukow. I this article, we assume that the loss fuctio is such thatl [0,Ỹ Y. It is importat to otice that this assumptio does ot limit the scope of the results preseted hereafter. A commo wish i olie learig is to characterize the mea risk R(h t ) = E[l(h t (X),Y) z,...,z t, (9) associated with the hypotheses produced by a algorithm usig a olie estimator ˆR such that ˆR = ˆR (Z ) = l(h t (x t+ ),y t+ ). (20) The hypothesis h is discarded for purely techical reasos. The quatity ˆR is ofte referred to as the average istataeous risk. It is cetral i may olie learig aalysis (see by example Cesa-Biachi et al. [2004). Each term of the previous sum is a estimator of the risk R(h t ) associated with the hypothesis h t (coditioally to the examples z,...,z t ) : E[l(h t (x t+ ),y t+ ) z,...,z t = R(h t ). The term istataeous comes from the fact that ˆR oly relies o the example (x t+,y t+ ) appearig at iteratio t + i order to evaluate the risk of h t. A state of the art result due to Cesa-Biachi ad Getile [2008 liks ˆR to R(h t): Propositio. Let h 0,...,h be the set of hypotheses geerated by a olie learig algorithm usig the bouded loss fuctio l [0,Ỹ Y. The, for all 0 <, we have with probability at least R(h t ) ˆR +2 ˆR ( ) l ˆR +3 ( ) + 36 l ˆR +3. (2) R(h t) Remark. The Gibbs classifier [McAllester, 999 is a stochastic classifier obtaied by selectig radomly a hypothesis amog a set of hypotheses, give a proba- bility distributio o these hypotheses. ca thus be see as the risk of the Gibbs classifier for a uiform distributio o the set {h 0,...,h }. The key of the result exposed i the previous propositio lies i the use of a secod order cocetratio iequality for martigales (proposed by Freedma [975) which itroduces the sum V of the coditioal variaces of the loss of each hypothesis: V = V[l(h t (x t+ ),y t+ ) z,...,z t. As R, this quatity ca ot be computed sice the distributio D is ukow. Cesa-Biachi ad Getile [2008 proposed to upper boud this sum usig a stratificatio process i order to get their iequality. I this sectio we improve the previous boud by employig Theorem 2 together with a olie estimator ˆV of the sum V, which allows for a better cotrol of the former. The average empirical istataeous variace ˆV is simply defied as ˆV = 2( ) 2 ( l(ht (x t+ ),y t+ ) l(h t (x t+2 ),y t+2 ) ) 2. (22) Agai, we discard the hypotheses h et h from this quatity for techical reasos. Each term of this sum is a estimator of the coditioal variace of l(h t (x),y): [ (l(ht E (x t+ ),y t+ ) l(h t (x t+2 ),y t+2 ) ) 2 z,...,z t = 2V[l(h t (x),y) z,...,z t. (23) ˆV may be easily computed durig a olie learig process ad plays a cetral role i the theorem we preset here. 4.2 Empirical Berstei Iequalities for Olie Learig I the followig theorem, we use Theorem 2 ad the istataeous estimators ˆR et ˆV i order to boud R(h t), the mea of the risks of the hypotheses leared by a olie algorithm. Theorem 3 (Empirical Berstei iequality for olie learig). Let h 0,...,h be the set of hypotheses geerated from the sample Z = {z i } i= = {(x i,y i )} i= of i.i.d. radom variables by a olie learig algorithm usig the bouded loss fuctio l [0,Ỹ Y. The, for all 0 < we have with probability at least : R(h t ) ˆR + where β = ( )ˆV + β l l, (24) ( ) 2 l. (25) 2

7 Empirical Berstei Iequality for Martigales : Applicatio to Olie Learig Proof. (Theorem 3) The proof is direct. Cosider the set Z = {z i } i= = {(x i,y i )} i= of i.i.d. radom variables ad the family of fuctios {l(h i ( ), )} i=0 where each fuctio l(h t ( ), ) oly depeds o the variables z,...,z t by defiitio of h t. Notig that z t+,z t+2 are idepedet with respect to z,...,z t (by defiitio of Z ), we simply apply Theorem 2 ad adjust the idexes to obtai the result. We ow wat to emphasize the compariso with the boud by Cesa-Biachi ad Getile [2008 (Equatio (2)). Our result firstly improves the costats ivolved i the boud which is very appreciable whe the boud is computed with a small umber of hypotheses (whe is small, the last term i the boud ca ot be eglected). I order to aalyze the behavior of our result whe we have a sufficiet umber of hypotheses to omit the last term, we have to pay attetio to β l l ( 2 ) ˆV +( ( )) l 3/4 2, (26) where we used the fact that a+b a+ b to get the upper boud. Thus, omittig the costat terms, our boud teds to ˆR at least i O ˆV + +, 3/4 whe the oe by Cesa-Biachi ad Getile [2008 teds to ˆR i O l(ˆr ) ˆR + l(ˆr ). I order to study the differece betwee the two rates of covergece, we eed to compare the two terms ˆV ad ˆR. ˆV = 2( ) 2 ( l(ht (x t+ ),y t+ ) l(h t (x t+2 ),y t+2 ) ) 2 2( ) 2 + ( 2 l(h t (x t+ ),y t+ ) 2 ) l(h t (x t+2 ),y t+2 ) 2 2( ) 2 + ( 2 l(h t (x t+ ),y t+ ) ) l(h t (x t+2 ),y t+2 ). The last iequality is obtaied by usig l [0,Ỹ Y. Suppose that the error made by each hypothesis h t o the example z t+2 is ot too differet from the error made by the same hypothesis o z t+ : l(h t (x t+2 ),y t+2 ) l(h t (x t+ ),y t+ ). I this case, the previous right had side is almost 2 l(h t (x t+ ),y t+ ) ( ) ˆR thus it follows that ˆV ˆR. A settig studied by Cesa-Biachi ad Getile [2008 is whe the empirical cumulative risk ˆR is i O() i.e. ˆR is bouded. Their result thus reaches a asymptotic behavior i O( ) (the terms ivolvig l(ˆr ) vaishes as a costat). With the assumptio that ˆV is io() as well, our boud shows a rate of covergece slightly worse i O( ). However, as soo as the cumulative 3/4 risk ˆR icreases with, the boud by Cesa-Biachi ad Getile [2008 coverges at the rate O( l/) whereas ours reaches a O( /) rate. Case of a Covex Loss Fuctio Whe a olie algorithm uses a covex loss fuctio l, we ca use Theorem 3 i order to characterize the risk associated with the mea hypothesis h: h = h t. (27) Whe the decisio space Ỹ associated with the classifiers h t : X Ỹ is covex the the hypothesis h belogs to the same fuctio class as each of the h t, h : X Ỹ. The mea hypothesis is thus a determiistic classifier, by oppositio to the Gibbs classifier defied earlier, which shares the same boud o its risk. Corollary 2. Let h 0,...,h be the set of all the hypotheses geerated by a olie learig algorithm usig the covex loss fuctio l such that l [0,Ỹ Y. The, for all 0 < with probability at least where R( h) ˆR + β l β = ( )ˆV l, (28) ( ) 2 l. 2 Proof. Usig Jese s iequality ad liearity of the

8 Thomas Peel,2, Sadrie Athoie, Liva Ralaivola 2 expectatio, it is easy to show that [ R( h) = E l( h t (X),Y) E[l(h t (X),Y) = R(h t ). To coclude the proof, we just eed to combie this result with Theorem 3. 5 BOUNDING THE AVERAGE RISK OF PEGASOS I this sectio, we use the previous corollary i order to derive a boud o the mea risk of the hypotheses geerated by the Pegasos [Shalev-Shwartz et al., 20 algorithm. 5. Pegasos Pegasos is a algorithm desiged to solve the primal SVM problem. Recall that give a sample Z = {z i } i= = {(x i,y i )} i= the SVM objective fuctio is give by: F(w) := λ 2 w l hige (w,x i,y i ), (29) i= where l hige (w,x,y) = max{0, y w,x }. Pegasos works i a olie fashio by doig a stochastic subgradiet descet o the SVM objective fuctio. At time t, Pegasos radomly selects a example Z it = (x ii,y it ) ad aims at miimizig the approximatio f(w t,z it ) = λ 2 w t 2 2 +l hige(w t,x it,y it ), of the SVM objective fuctio. It cosiders the followig sub-gradiet, take at poit w t, of the previous fuctio which is give by t = w t f(w t,z it ) = λw t [yit w t,x it < y i t x it, ad it updates the curret weight vector w t to w t+ by w t+ w t η t t usig a step η t = /(λt). So, we get at each iteratio the vector ( w t+ ) w t +η t t [yit w t,x it < y i t x it. A projectio step (optioal) that we detail i the sequel eds up the iteratiot. Pegasos stops whet = T, Algorithm Pegasos Require: {(x i,y i )} i=,λ 0 ad T 0 Esure: w T+ w 0 0 for t 0 to T do Pick radomly i t {,...,} Defie η t = λt if y it w t,x it < the w t+ ( η t λ)w t +η t y it x it else w t+ ( η t λ)w t ed if w t+ = mi [ / λ, w t+ w t+ 2 ed for where T is a umber of iteratio give as a parameter. Thus Pegasos ca be see as a olie algorithm workig with the sequece of examples Z i,...,z it costructed by pickig radomly at each iteratio a example from Z. Algorithm sums up the differet steps of Pegasos. 5.2 Boudig the Mea Risk of the Hypotheses Geerated by Pegasos I order to apply Theorem 3 we eed the loss fuctio to be bouded. It ca be show that w = argmi w F(w) satisfies w 2 / λ. Thus, we ca limit the search space to the ball of radius / λ by icorporatig a projectio step as metioed above [ w t+ / λ = mi, w t+ w t+. 2 With the assumptio that x 2 M, we ca boud the hige loss fuctio: l hige (w,x,y) + x 2 w 2 + M λ = C. Thereby, the loss fuctio used by Pegasos ca be adjusted to satisfy the assumptio of Theorem 3 ad we ca use it to prove the followig corollary. Corollary 3. Let w 0,...,w T be the sequece of weight vectors geerated by the Pegasos algorithm from a sample Z where x i 2 M, i. The for all 0 <, we have with probability at least, R(w t ) ˆR + C where β = ( )ˆV C 2 + β l + 2C 3 l, l 2. (30)

9 Empirical Berstei Iequality for Martigales : Applicatio to Olie Learig λ = Cesa-Biachi (2008) boud Empirical Berstei boud (this work) 0 3 λ = Cesa-Biachi (2008) boud Empirical Berstei boud (this work) λ = Cesa-Biachi (2008) boud Empirical Berstei boud (this work) 0 2 λ = Cesa-Biachi (2008) boud Empirical Berstei boud (this work) Figure : Compariso of the Bouds From Propositio ad Corollary 3 Computed for the Pegasos Algorithm o a Toy Liearly Separable Dataset. 5.3 Proof of Cocept I this sectio we wat to highlight experimetally the performace of our empirical Berstei iequality applied to olie learig. I order to do that, we compare the boud provided by Corollary 3 for the Pegasos algorithm to the oe exposed i Propositio. We use a liearly separable toy dataset ad compare the covergece of the empirical risk to the mea risk of the hypotheses w 0,...,w T. We geerate radom vectors x i [, 2 to which we assig the class y i = sig( w,x i ) {+, } for a vector w [, 2 also radomly geerated. We work with a learig sample cotaiig poits ad report i Figure the values of the right had sides appearig i Propositio [Cesa-Biachi ad Getile, 2008 ad i Corollary 3 computed with a cofidece of 95% ( = 0.05). We ra the experimet 20 times for may values of the parameter λ ad averaged the results. We ca see that our iequality is far tighter tha the oe by Cesa-Biachi ad Getile [2008 durig the first iteratios, as it was souded i the theoretical compariso doe i Sectio 3. The gap betwee the two iequalities tightes whe the umber of hypotheses cosidered icreases but remais i our favor. 6 CONCLUSION AND OUTLOOKS I this article, we preset a ew empirical Berstei cocetratio iequality for martigales. We applied this result to the olie learig settig i order to boud the mea risk of the hypotheses leared durig such learig processes. Because we itroduce of a ew istataeous variace estimator, our iequality is well suited for the olie learig settig ad improves the state of the art. This improvemet is maily oticeable whe the umber of hypotheses cosidered is small as show i the empirical sectio of this work. There are may outlooks opeed by this work. First of all, we ca thik about a ew olie learig algorithm that aims at miimizig our empirical Berstei boud as it is doe i the batch settig [Variace Pealizig AdaBoost, Shivaswamy ad Jebara, 20, by example. The, it will be of iterest to derive ew kid of bouds for olie algorithms takig advatage of our result (by example o the excess risk as it is doe i the work by Kakade ad Tewari [2009). The last perspective that we wat to metio is the compariso of our boud with the very recet PAC-Bayes-Empirical- Berstei Iequality by Tolstikhi ad Seldi [203.

10 Thomas Peel,2, Sadrie Athoie, Liva Ralaivola 2 Refereces Kazuoki Azuma. Weighted Sums of Certai Depedet Radom Variables. Tohoku Mathematical Joural, 9(3): , 967. George Beett. Probability Iequalities for the Sum of Idepedet Radom Variables. Joural of the America Statistical Associatio, 57(297):33 45, 962. Nicolò Cesa-Biachi ad Claudio Getile. Improved Risk Tail Bouds for O-Lie Algorithms. IEEE Trasactios o Iformatio Theory, 54(): , Nicolò Cesa-Biachi, Alex Cocoi, ad Claudio Getile. O the Geeralizatio Ability of O-Lie Learig Algorithms. IEEE Trasactios o Iformatio Theory, 50(9): , David A. Freedma. O Tail Probabilities for Martigales. The Aals of Probability, 3():00 8, 975. Wassily Hoeffdig. Probability Iequalities for Sums of Bouded Radom Variables. Joural of the America Statistical Associatio, 58(30):3 30, 963. Sham M. Kakade ad Ambuj Tewari. O the Geeralizatio Ability of Olie Strogly Covex Programmig Algorithms. I Advaces i Neural Iformatio Processig Systems 2 - NIPS 08, pages , Nicholas Littlestoe, Philip Log, ad Mafred Warmuth. O-lie Learig of Liear Fuctios. Computatioal Complexity, 5(): 23, 995. Adreas Maurer ad Massimiliao Potil. Empirical Berstei Bouds ad Sample Variace Pealizatio. I Proceedigs of the 22d Aual Coferece o Learig Theory - COLT 09, David A. McAllester. PAC-Bayesia Model Averagig. I Proceedigs of the 2th Aual Coferece o Computatioal Learig Theory - COLT 99, pages 64 70, 999. Thomas Peel, Sadrie Athoie, ad Liva Ralaivola. Empirical Berstei Iequalities for U-Statistics. I Advaces i Neural Iformatio Processig Systems 23 - NIPS 0, pages 903 9, 200. Shai Shalev-Shwartz. Olie learig: Theory, algorithms, ad applicatios. PhD thesis, Hebrew Uiversity, Shai Shalev-Shwartz, Yoram Siger, Natha Srebro, ad Adrew Cotter. Pegasos: Primal Estimated Sub-Gradiet Solver for SVM. Mathematical Programmig, 27():3 30, 20. Paagadatta K. Shivaswamy ad Toy Jebara. Variace Pealizig AdaBoost. I Advaces i Neural Iformatio Processig Systems 24 - NIPS, pages , 20. Ilya Tolstikhi ad Yevgey Seldi. PAC-Bayes- Empirical-Berstei Iequality. I Advaces i Neural Iformatio Processig Systems - NIPS 3, 203.

Optimally Sparse SVMs

Optimally Sparse SVMs A. Proof of Lemma 3. We here prove a lower boud o the umber of support vectors to achieve geeralizatio bouds of the form which we cosider. Importatly, this result holds ot oly for liear classifiers, but