On Equivalence of Martingale Tail Bounds and Deterministic Regret Inequalities

Size: px

Start display at page:

Download "On Equivalence of Martingale Tail Bounds and Deterministic Regret Inequalities"

Anabel Cobb
6 years ago
Views:

1 Proceedigs of Machie Learig Research vol 65:1 19, 2017 O Equivalece of Martigale Tail Bouds ad Determiistic Regret Iequalities Alexader Rakhli Uiversity of Pesylvaia Karthik Sridhara Corell Uiversity rakhli@wharto.upe.edu sridhara@cs.corell.edu Abstract We study a equivalece of (i) determiistic pathwise statemets appearig i the olie learig literature (termed regret bouds), (ii) high-probability tail bouds for the supremum of a collectio of martigales (of a specific form arisig from uiform laws of large umbers), ad (iii) i-expectatio bouds for the supremum. By virtue of the equivalece, we prove expoetial tail bouds for orms of Baach space valued martigales via determiistic regret bouds for the olie mirror descet algorithm with a adaptive step size. We show that the pheomeo exteds beyod the settig of olie liear optimizatio ad preset the equivalece for the supervised olie learig settig. Keywords: martigale iequalities; olie learig 1. Itroductio The paper ivestigates equivalece of regret iequalities that hold for all sequeces ad probabilistic iequalities for martigales. I recet years, it was show that existece of regret-miimizatio strategies ca be certified o-algorithmically by studyig certai stochastic processes. I this paper, we make the coectio i the opposite directio ad show a certai equivalece. We preset several ew deviatio iequalities that follow with surprisig ease from pathwise regret iequalities, while it is far from clear how to prove them with other methods. Arguably the simplest example of the equivalece betwee predictio of idividual sequeces ad probabilistic iequalities ca be foud i the work of Cover (1965). Cosider the task of predictig a biary sequece y = (y 1,..., y ) {±1} i a olie maer. Let φ {±1} [0, 1] be 1/-Lipschitz with respect to the Hammig distace. The there exists a radomized strategy such that y, E [ 1 1 {ŷ t y t }] φ(y) (1) if ad oly if Eφ(ε) 1/2. The expectatio i (1) is with respect to the radomized predictios ŷ t = ŷ t (y 1,..., y t 1 ) {±1} made by the algorithm, ε = (ε 1,..., ε ) is a sequece of idepedet Rademacher radom variables, ad 1 {} is the idicator loss fuctio. While this result is ot difficult to prove by backward iductio (see e.g. (Rakhli ad Sridhara, 2016)), the message is rather itriguig: existece of a predictio strategy with a give 2017 A. Rakhli & K. Sridhara.

2 Rakhli Sridhara mistake boud φ is equivalet to a simple statemet about the expected value of φ with respect to the uiform distributio. Furthermore, the Lipschitz coditio o φ implies a high-probability boud for the deviatio of φ from Eφ via McDiarmid s iequality. Our secod example of the equivalece is i the settig of olie liear optimizatio. Cosider the uit Euclidea ball B i R d. Let z 1,..., z B ad defie, recursively, the Euclidea projectios ŷ t+1 = ŷ t+1 (z 1,..., z t ) = Proj B (ŷ t 1/2 z t ) (2) for each t = 1,...,, with the iitial value ŷ 1 = 0. Elemetary algebra 1 shows that for ay f B, the regret iequality ŷ t f, z t holds determiistically for ay sequece z 1,..., z B. By optimally choosig f i the directio of the sum, we re-write this statemet equivaletly as z t ŷ t, z t. (3) Sice the iequality holds pathwise, by applyig it to a B-valued martigale differece sequece Z 1,..., Z, we coclude that P ( Z t > u) P ( ŷ t, Z t > u) exp { u2 2 }. (4) The latter upper boud is a applicatio of the Azuma-Hoeffdig s iequality. Ideed, the process (ŷ t ) is predictable with respect to σ(z 1,..., Z t 1 ), ad thus ( ŷ t, Z t ) is a [ 1, 1]- valued martigale differece sequece. It is worth emphasizig the coclusio: oe-sided deviatio tail bouds for a orm of a vector-valued martigale ca be deduced from tail bouds for real-valued martigales with the help of a determiistic regret iequality. Next, itegratig the tail boud i (4) yields a seemigly weaker i-expectatio statemet E Z t c (5) for a appropriate costat c. The twist i this ucomplicated story comes ext: with the help of the miimax theorem, (Aberethy et al., 2009; Rakhli et al., 2010) established existece of strategies (ŷ t ) such that z 1,..., z, f B, ŷ t f, z t sup E Z t, (6) with the supremum take over all 2B-valued martigale differece sequeces. I view of (5), this boud is c. What have we achieved? Let us summarize. The determiistic iequality (3), which holds for all sequeces, implies a tail boud (4). The latter, i tur, implies a iexpectatio boud (5), which implies (3) (with a worse costat) through a miimax argumet, thus closig the loop. The equivalece studied i depth i this paper is iformally stated below: 1. See the two-lie proof i the Appedix, Lemma 12. 2

3 O Equivalece of Martigale Tail Bouds ad Determiistic Regret Iequalities Iformal: The followig bouds imply each other: (a) a iequality that holds for all sequeces; (b) a deviatio tail probability for the size of a martigale; (c) a i-expectatio boud o the size of a martigale. The equivalece, i particular, allows us to amplify the i-expectatio bouds to appropriate high-probability tail bouds. While writig the paper, we leared of the trajectorial approach, extesively studied i recet years. I particular, it has bee show that Doob s maximal iequalities ad Burkholder-Davis-Gudy iequalities have determiistic couterparts (Acciaio et al., 2013; Beiglböck ad Nutz, 2014; Gushchi, 2014; Beiglböck ad Siorpaes, 2015). The olie learig literature cotais a trove of pathwise iequalities, ad further sythesis with the trajectorial approach (ad the applicatios i mathematical fiace) appears to be a promisig directio. This paper is orgaized as follows. I the ext sectio, we exted the Euclidea result to martigales with values i Baach spaces ad improve it by replacig with square root of variatio. I particular, we coclude a high probability self-ormalized tail boud, a statemet that appears to be difficult to obtai with other methods (see (Bercu et al., 2015; de la Peña et al., 2008) for a survey of techiques i this area). Sectio 3 is devoted to the aalysis of equivalece for supervised learig. Fially, Sectio 4 shows that it is eough to cosider dyadic martigales if oe is iterested i geeral martigale iequalities of a certai form. 2. Adaptive Bouds ad Probabilistic Iequalities i Baach Spaces For the case of the Euclidea (or Hilbertia) orm, it is easy to see that the boud of (5) ca be improved to a distributio-depedet quatity ( E Z t 2 ) 1/2. Give the equivalece sketched earlier, oe may woder whether this upper boud is also equivalet to a gradiet-descet-like olie method with a sequece-depedet variatio goverig the rate of covergece. Below, we ideed preset such a equivalece for 2-smooth Baach spaces. Furthermore, the probabilistic tail bouds obtaied this way appear to be ovel. Suppose that we have a orm o some vector space such that 2 is a smooth fuctio: x + y 2 x 2 + x 2, y + C y 2 (7) for some C > 0. Repeatedly usig smoothess of the orm, we coclude that 2 E Z t C E Z t 2 (8) for ay martigale differece sequece takig values i that vector space, sice the crossterms vaish. Istead of (8), we will work with the tighter iequality E Z t CE Z t 2. (9) Let (B, ) be a reflexive Baach space with dual space (B, ). Assume that (B, ) is 2-smooth (that is, ρ(δ) sup { 1 2 ( x + y + x y ) 1 x = 1, y = δ}, the modulus of 3

4 Rakhli Sridhara smoothess, behaves as cδ 2 ). The there exists a equivalet orm B (i the sese that c 1 B c 2 B for some possibly dimesio-depedet c 1, c 2 ) that is smooth. I this case, we ca expect that (9) holds for martigale differece sequeces takig values i B. Let us ow argue this more formally, ad also show equivalece to the existece of determiistic predictio strategies From regret iequality to expected value ad back Lemma 1 Existece of a (determiistic) predictio strategy (ŷ t ), with values ŷ t(z 1,..., z t 1 ) i the uit ball B of B such that z 1,..., z B, f B, ŷ t f, z t C z t 2 (10) for some C is equivalet to (9) (with a possibly differet costat C) holdig for all martigale differece sequeces with values i B. Proof By rearragig (10) as i (3), choosig a uit vector f, ad takig a expectatio o both sides implies (9) with the same costat C as i (10). We ow argue the reverse directio: (9) implies existece of a strategy with a regret boud (10). First, cosider a arbitrary collectio (X 1,..., X ) of radom variables takig values i a R-radius cetered ball of B ad defie the coditioal expectatios E t 1 [ ] = E[ X 1,..., X t 1 ]. Observe that the collectio (X t E t 1 X t ), t = 1,...,, is a martigale differece sequece. Hece, by triagle iequality ad our assumptio, E X t E E t 1 X t E (X t E t 1 X t ) CE X t E t 1 X t 2. (11) The right-most expressio i (11) ca be upper bouded by 2CE X t 2 + E t 1 X t 2 8CE X t 2. (12) To justify the last iequality, first observe that E t 1 X t E t 1 X t. Secod, the fuctio x A + x 2 is covex, ad (12) follows by Jese s iequality. Combiig (11) ad (12), for ay fiite R ad ay collectio (X 1,..., X ) with values i R B, E X t E t 1 X t C X t 2 0 (13) with C = 8C. Writig we coclude that E t 1 X t = if ŷ t, E t 1 X t, ŷ t 1 sup E if ŷ t, E t 1 X t if f, X t C X t 2 0 (14) ŷ t 1 f 1 4

5 O Equivalece of Martigale Tail Bouds ad Determiistic Regret Iequalities where the supremum is over the distributios of (X 1,..., X ) with values i R B. The rest of the argumet ca be see as ruig the proof of (Aberethy et al., 2009) backwards. The miimax theorem holds because of fiiteess of R, the radius of the support of the X t s, via argumets i (Rakhli et al., 2014, Appedix A). Thakfully, the strategy that guaratees (10) is already kow: it is a adaptive versio of Mirror Descet. For completeess, the proof is provided i the Appedix. To defie the strategy, we eed the followig fact: if B is a 2-smooth Baach space, the there is a fuctio R o the dual space B which is strogly covex with respect to the orm. I fact, oe ca take the squared dual orm correspodig to the smooth equivalet orm o B (Borwei et al., 2009). To avoid extra costats, let us simply assume that R is 1- strogly covex o the uit ball B of B. The fuctio R iduces the Bregma divergece D R B B R, defied as D R (f, g) = R(f) R(g) R(g), f g. Lemma 2 Let F B be a covex set. Defie, recursively, ŷ t+1 = ŷ t+1 (z 1,..., z t ) = argmi η t f, z t + D R (f, ŷ t ) (15) with ŷ 1 = 0, η t R max ( t s=1 z s 2 ) 1/2, ad R 2 max sup f,g F D R (f, g). The for ay f F ad ay z 1,..., z B, ŷ t f, z t 2R max z t 2. Lemma 2 is complemetary to Lemma 1, as it gives the algorithm whose existece was guarateed by Lemma 1. I Sectio 3, we will ot have the luxury of producig a explicit algorithm, yet the equivalece will still be established From regret iequalities to tail bouds ad back We ow start from a regret-miimizatio strategy ad deduce a ew probabilistic iequality for martigales. We the coclude the i-expectatio boud ad use the equivalece of Lemma 1 to close the loop. The adaptive Mirror Descet algorithm of the previous sectio implies the followig theorem: Theorem 3 Let Z 1,..., Z be a B-valued martigale differece sequece, ad let E t stad for the coditioal expectatio give Z 1,..., Z t. Defie V = 2 Z t 2 ad W = 2 E t 1 Z t 2, (16) which are assumed to have a fiite expected value. For ay u > 0, it holds that Z t 2R max V P V + W + (E > u 2 exp { u 2 /16}, (17) V + W ) 2 5

6 Rakhli Sridhara ad for ay u 2, it holds that P Z t 2R max V (V + W + 1) ( log (V + W + 1)) u exp { u 2 /2}. (18) Furthermore, both bouds also hold with W 0 ad V = Z t 2 if the martigale differeces are coditioally symmetric. 2 I additio to extedig the Euclidea result of the previous sectio to Baach spaces, (17) ad (18) offer several advatages. First, the bouds are -idepedet. The deviatios i (17) ad (18) are self-ormalized (that is, scaled by root-variatio terms) ad all the terms are either distributio-depedet or data-depedet, as i the case of the Studet s t- statistic (de la Peña et al., 2008). The advatage of (18), especially i the case of coditioal symmetry, is that all the terms, modulo the additive costats 1, are data-depedet. We are ot aware of similar bouds for orms of radom vectors i the literature, ad we wish to stress that the proof of the result is almost immediate, give the regret iequality. We would also like to stress that Theorem 3 holds without ay assumptio o the martigale differece sequece beyod square itegrability. Proof [Theorem 3] We take F i Lemma 2 to be the uit ball i B, esurig ŷ t 1. For ay martigale differece sequece (Z t ) with values i B, the above lemma implies, by the defiitio of the orm, Z t 2R max V ŷ t, Z t (19) determiistically for all sample paths. Dividig both sides by V + W + (E V + W ) 2, we coclude that the left-had side i (17) is upper bouded by ŷ P t, Z t V + W + (E > u. (20) V + W ) 2 To cotrol this probability, we recall the followig results (de la Peña et al., 2008, Theorem 12.4, Corollary 12.5): Theorem 4 ((de la Peña et al., 2008)) For a pair of radom variables A, B, with B > 0, such that it holds that for ay u > 0, E exp {λa λ 2 B 2 /2} 1 λ R, (21) P A B 2 + (EB) > u 2 2 exp { u 2 /4} 2. A martigale differece sequece Z 1,..., Z is coditioally symmetric if the law L(Z t Z 1,..., Z t 1) = L( Z t Z 1,..., Z t 1). 6

7 O Equivalece of Martigale Tail Bouds ad Determiistic Regret Iequalities ad for ay y > 0 ad u 2, P A (B 2 + y) ( log (B2 /y + 1)) To apply this theorem, we verify assumptio (21): u exp { u 2 /2}. Lemma 5 The radom variables A = ŷ t, Z t ad B 2 = 2 ( Z t 2 + E t 1 Z t 2 ) satisfy (21). Furthermore, if Z t s are coditioally symmetric, the A = ŷ t, Z t ad B 2 = Z t 2 satisfy (21). The simple proof of the Lemma is postpoed to the Appedix. Puttig together (20) with Lemma 5 ad Theorem 4 cocludes the proof of Theorem 3. To close the loop of equivaleces, we eed to deduce (9) from the tail boud iequality. Let us use the first part of Theorem 3. Deote the radom variable i the umerator of the fractio i (17) as Y ad the deomiator as a radom variable U. The (17) implies that (Y /U) is a subgaussia radom variable. Hece, its secod momet is bouded by a costat: E(Y /U) 2 c. However, by Cauchy-Schwartz iequality, implyig EY = E (U Y U ) (EU 2 ) 1/2 (E Y 2 U 2 ) 1/2 ceu 2, E Z t 2R max E V + 2 cev. (22) This almost closes the loop, except the last term i (22) has the expectatio iside the square root rather tha outside, ad thus presets a weaker upper boud (i the sese of (8) rather tha (9)). We cojecture that there is a way to prove the upper boud with the expectatio outside the square root. Noetheless, to keep the promise of closig the loop, we observe that the upper boud of (8) implies that the Baach space has martigale type 2, which implies, via (Srebro et al., 2011), existece of a strogly covex fuctio o the dual space, ad, hece, existece of a strategy that guaratees (10) with a costat C that may deped at most logarithmically o Remarks We compare our result to that of Pielis (1994). Let Z 1,..., Z be a martigale differece sequece takig values i a separable (2, D)-smooth Baach space (B, ). Pielis (1994) proved, through a sigificatly more difficult aalysis, that for ay u > 0, 1 Z t σu) 2 exp { u2 2D 2 }, (23) where σ is a costat satisfyig Z t 2 σ2. I compariso to Theorem 3, this result ivolves a distributio-idepedet variatio σ as a worst-case poitwise upper boud. 7

8 Rakhli Sridhara The reader will otice that the pathwise iequality (19) does ot deped o ad the costructio of ŷ t is also oblivious to this value. A simple argumet the allows us to lift the real-valued Burkholder-Davis-Gudy iequality (with the costat from (Burkholder, 2002)) to the Baach space valued martigales: Lemma 6 With the otatio of Theorem 3, E max s=1,..., s Z t (2R max + 3) E V. Remarkably, the costat i the resultig BDG iequality is, up to a additive costat, proportioal to R max. Oce agai, we have ot see such results i the literature, yet they follow with ease from regret iequalities. We also remark that Theorem 3 ca be aturally exteded to p-smooth Baach spaces B. This is accomplished i a straightforward maer by extedig Lemma Probabilistic Iequalities ad Supervised Learig We ow look beyod liear predictio ad aalyze supervised learig problems with side iformatio. Here agai we establish a strog coectio betwee existece of predictio strategies, the i-expectatio iequalities for martigales, ad high-probability tail bouds. I cotrast to Sectio 2, we will ot preset ay algorithms. Note that the simplest example of the equivalece (for biary predictio ad i the absece of side iformatio) was already stated i the very begiig of this paper Supervised learig with side iformatio We let y 1,..., y {±1} ad x 1,..., x X for some abstract measurable set X. Let F be a class of [ 1, 1]-valued fuctios o X. Fix a cost fuctio l R R R, covex i the first argumet. For a give fuctio B F X R, we aim to costruct ŷ t = ŷ t (x 1,..., x t, y 1,..., y t 1 ) [ 1, 1] such that the followig adaptive boud holds: (x t, y t ), l(ŷ t, y t ) if { l(f(x t ), y t ) + B(f; x 1,..., x )}. (24) We may view ŷ t as a predictio of the ext value y t havig observed x t ad all the data thus far. I this paper, we focus o the liear loss l(a, b) = ab (equivaletly, absolute loss a b = 1 ab whe a [ 1, 1] ad b {±1}) ad the square loss l(a, b) = (a b) 2. We write (24) for the liear cost fuctio as sup { y t f(x t ) B(f; x 1,..., x )} while for the square loss it becomes sup { 2y t f(x t ) f(x t ) 2 B(f; x 1,..., x )} y t ŷ t (25) 2y t ŷ t ŷ 2 t. (26) 8

9 O Equivalece of Martigale Tail Bouds ad Determiistic Regret Iequalities Give a fuctio B ad a class F, there are two goals we may cosider: (a) certify the existece of (ŷ t ) (ŷ 1,..., ŷ ) satisfyig the pathwise iequality (24) for all sequeces (x t, y t ) ; or (b) give a explicit costructio of (ŷ t). Both questios have bee studied i the olie learig literature, but the o-costructive approach will play a especially importat role. Ideed, explicit costructios such as the simple gradiet descet update (2) might ot be available i more complex situatios, yet it is the existece of (ŷ t ) that yields the sought-after tail bouds. To certify the existece of a strategy (ŷ t ), cosider the followig object: A(F, B) = sup x t if ŷ t max y t { l(ŷ t, y t ) if { l(f(x t ), y t ) + B(f; x 1,..., x )}} (27) where the otatio stads for the repeated applicatio of the operators (the outer operators correspodig to t = 1). The variable x t rages over X, y t is i the set {±1}, ad ŷ t rages i [ 1, 1]. It follows that A(F, B) 0 is a ecessary ad sufficiet coditio for the existece of (ŷ t ) such that (24) holds. Ideed, the optimal choice for ŷ 1 is made give x 1 ; the optimal choice for ŷ 2 is made give x 1, y 1, x 2, ad so o. This choice defies the optimal strategy (ŷ t ). 3 The other directio is immediate. Suppose we ca fid a upper boud o A(F, B) ad the prove that this upper boud is o-positive. This would serve as a sufficiet coditio for the existece of (ŷ t ). Next, we preset such a upper boud for the case whe the cost fuctio is liear. More geeral results for covex Lipschitz cost fuctios ca be foud i (Foster et al., 2015) Liear loss As i the itroductio, let ε = (ε 1,..., ε ) be a sequece of idepedet Rademacher radom variables. Let x = (x 1,..., x ) ad y = (y 1,..., y ) be predictable processes with respect to the dyadic filtratio (σ(ε 1,..., ε t )) t=0, with values i X ad {±1}, respectively. I other words, x t = x t (ε 1,..., ε t 1 ) X ad y t = y t (ε 1,..., ε t 1 ) {±1} for each t = 1,...,. Oe ca thik of the collectios (x t ) ad (y t ) as trees labeled, respectively, by elemets of X ad {±1}. Lemma 7 For the case of the liear cost fuctio, A(F, B) = sup x E [sup Therefore, the followig are equivalet: ε t f(x t ) B(f; x 1,..., x )]. (28) For ay predictable process x = (x 1,..., x ) E [sup 3. If the ifima are ot achieved, a limitig argumet ca be employed. ε t f(x t ) B(f; x 1,..., x )] 0, (29) 9

10 Rakhli Sridhara There exists a strategy (ŷ t ) such that the pathwise iequality (25) holds. Furthermore, the strategy ca be assumed to satisfy ŷ t sup f(x t ). (30) The i-expectatio boud of (29) is a ecessary ad sufficiet coditio for the existece of a strategy with the per-sequece boud (25). This latter boud, however, implies a highprobability statemet, i the spirit of the other results i the paper. Below, we detail this amplificatio. Take ay X -valued predictable process x = (x 1,..., x ) with respect to the dyadic filtratio. The determiistic iequality (25) applied to x t = x t (ε 1,..., ε t 1 ) ad y t = ε t becomes sup { ε t f(x t ) B(f; x 1,..., x )} for ay sample path (ε 1,..., ε ), ad thus we have the compariso of tails ε t ŷ t (31) { ε t f(x t ) B(f; x 1,..., x )} > u) P ( ε t ŷ t > u). (32) Give the boudedess of the icremets ε t ŷ t, the tail bouds follow immediately from the Azuma-Hoeffdig s iequality or from Freedma s iequality (Freedma, 1975). More precisely, we use the fact that the martigale differeces are bouded by ŷ t sup f(x t ), ad coclude: Lemma 8 If there exists a predictio strategy (ŷ t ) that satisfies (25) ad (30), the for ay predictable process x, the Azuma-Hoeffdig iequality implies that { ε t f(x t ) B(f; x 1,..., x )} > u) exp ( 4 max ε sup f(x t (ε)) 2 ), (33) Freedma s iequality implies u 2 { ε t f(x t ) B(f; x 1,..., x )} > u, sup f(x t ) 2 σ 2 ) exp ( 2σ 2 + 2uM/3 ), u 2 (34) where M = sup,ε {±1},t f(x t ), ad we also have that for ay α > 0, { ε t f(x t ) B(f; x 1,..., x )} α sup f(x t ) 2 > u) exp ( 2αu). (35) I view of Lemma 7, a sufficiet coditio for these iequalities is that (29) holds for all x. Let us emphasize the coclusio of the above lemma: the o-positivity of the expected supremum of a collectio of martigales, offset by a fuctio B, implies existece of a regret-miimizatio strategy, which implies a high-probability tail boud. To close the loop, 10

11 O Equivalece of Martigale Tail Bouds ad Determiistic Regret Iequalities we itegrate out the tails, obtaiig a i-expectatio boud of the form (29), but possibly with a somewhat larger B fuctio (this depeds o the particular form of B). I additio to describig the equivalece, let us capitalize o it ad prove a ew tail boud. The most basic B is a costat that depeds o the complexity of F, but ot o f or the data. Defie the worst-case sequetial Rademacher averages as R (F) sup x E sup Clearly, B = R (F) satisfies (29) ad the followig is immediate. ε t f(x t ). (36) Corollary 9 For ay F R X ad a X -valued predictable process x with respect to the dyadic filtratio, ε t f(x t ) > R (F) + u) exp ( 4 max ε sup f(x t (ε)) 2 ). (37) Superficially, (37) looks like a oe-sided versio of a deviatio boud for classical (i.i.d.) Rademacher averages (Bouchero et al., 2013). However, sequetial Rademacher averages are ot Lipschitz with respect to a flip of a sig, as all of the remaiig path may chage after a flip. It is uclear to the authors how to prove (37) through other existig methods Square loss Due to limited space, we will ot state the aalogue of Lemma 7 ad simply outlie the implicatio from existece of regret miimizatio strategies to high probability tail bouds. As for the case of the liear loss fuctio, take ay X -valued predictable process x = (x 1,..., x ) with respect to the dyadic filtratio. Fix α > 0. The determiistic iequality (26) for x t = x t (ε 1,..., ε t 1 ) ad y t = 1 α ε t becomes sup { ( 2 α ε tf(x t ) f 2 (x t )) B(f; x 1,..., x )} As i the proof of (35), we obtai a tail compariso u 2 2 α ε tŷ t ŷ 2 t. (38) { ( 2 α ε tf(x t ) f 2 (x t )) B(f; x 1,..., x )} > u α ) (39) P ( ( 2 α ε tŷ t ŷ 2 t ) > u ) exp { αu α 2 } where the last iequality follows via a stadard aalysis of the momet geeratig fuctio. As a example, cosider the Azoury-Vovk-Warmuth forecaster for liear regressio (see e.g. (Cesa-Biachi ad Lugosi, 2006, Sec. 11.8)). Take the class F to be the class of fuctios F = {x f, x f B2 d}, where Bd 2 is the uit Euclidea ball i Rd. Assumig X = B2 d, the regret boud for the forecaster is kow to be B(f; x 1,..., x ) = f 2 + Y 2 xt T A 1 t x t, 11

12 Rakhli Sridhara where A t = I + t s=1 x t xt T ad Y = max y t. However, whe F is idexed by the uit ball, the supremum i (39) has a closed form expressio, ad the overall probability iequality takes o the form P 2 ε t x t 1 2 A 1 xt T A 1 t x t + u exp { u}. (40) We poit out that, beig fuctios of Rademacher radom variables, x t s are radom variables themselves, ad the terms i the above expressio are depedet i a o-trivial maer. We would like to refer the reader to the full versio of this paper (Rakhli ad Sridhara, 2015) which cotais further implicatios of the equivalece betwee the existece of determiistic strategies ad tail bouds. I particular, the amplificatio allows us to prove a characterizatio of a otio of martigale type beyod the liear case. 4. Symmetrizatio: dyadic filtratio is eough I Sectio 3, we preseted coectios betwee determiistic regret iequalities i the supervised settig ad tail bouds for dyadic martigales. Oe may ask whether these tail bouds ca be used for more geeral martigales idexed by some set. The purpose of this sectio is to prove that statemets for the dyadic filtratio ca be lifted to geeral processes via sequetial symmetrizatio. Cosider the martigale M g = g(z t ) E[g(Z t ) Z 1,..., Z t 1 ] idexed by g G. If (Z t ) is adapted to a dyadic filtratio A t = σ(ε 1,..., ε t ), each icremet g(z t ) E[g(Z t ) Z 1,..., Z t 1 ] takes o the value f g (x t (ε 1 t 1 )) (g(z t (ε 1 t 1, +1)) g(z t (ε 1 t 1, 1))) /2 or its egatio, where x t is a predictable process with values i Z Z ad f g F defied by (z, z ) g(z) g(z ). I Sectio 3, we worked directly with martigales of the form M f = ε t f(x t (ε)), idexed by a abstract class F R X ad a abstract X -valued predictable process x. We exted the symmetrizatio approach of Pacheko (Pacheko, 2003) to sequetial symmetrizatio for the case of martigales. I cotrast to the more frequetly-used Gié-Zi symmetrizatio proof (via Chebyshev s iequality) (Gié ad Zi, 1984; Va Der Vaart ad Weller, 1996) that allows a direct tail compariso of the symmetrized ad the origial processes, Pacheko s approach allows for a idirect compariso. The followig immediate extesio of (Pacheko, 2003, Lemma 1) will imply that ay exp{ µ(u)} type tail behavior of the symmetrized process yields the same behavior for the origial process. Lemma 10 Suppose ξ ad ν are radom variables ad for some Γ 1 ad for all u 0 P (ν u) Γ exp{ µ(u)}. 12

13 O Equivalece of Martigale Tail Bouds ad Determiistic Regret Iequalities Let µ R + R + be a icreasig differetiable fuctio with µ(0) = 0 ad µ( ) =. Suppose for all a R ad φ(x) µ([x a] + ) it holds that Eφ(ξ) Eφ(ν). The for ay u 0, P (ξ u) Γ exp{ µ(u µ 1 (1))}. I particular, if µ(b) = cb, we have P (ξ u) Γ exp{1 cu}; if µ(b) = cb 2, the P (ξ u) Γ exp{1 cu 2 /4}. As i (Pacheko, 2003), the lemma will be used with ξ ad ν as fuctios of a sigle sample ad the double sample, respectively. The expressio for the double sample will be symmetrized i order to pass to the dyadic filtratio. However, ulike (Pacheko, 2003), we are dealig with a depedet sequece Z 1,..., Z, ad the meaig ascribed to the secod sample Z 1,..., Z is that of a coditioally idepedet taget sequece. That is, Z t, Z t are idepedet ad have the same distributio coditioally o Z 1,..., Z t 1. Let E t 1 stad for the coditioal expectatio give Z 1,..., Z t 1. Corollary 11 Let B G Z 2 R be a fuctio that is symmetric with respect to the swap of the i-th pair z i, z i, for ay i []: B(g; z 1, z 1,..., z i, z i,..., z, z ) = B(g; z 1, z 1,..., z i, z i,..., z, z ) (41) for all g G. The, uder the assumptios of Lemma 10 o µ, a tail behavior (z, z ), g G ε t (g(z t ) g(z t)) B(g; (z 1, z 1),..., (z, z )) > u) Γ exp{ µ(u)} for all u > 0 implies the tail boud g G (g(z t ) E t 1 g(z t )) E Z 1 B(g; Z1, Z 1,..., Z, Z ) > u) Γ exp{ µ(u µ 1 (1))} for ay sequece of radom variables Z 1,..., Z ad the correspodig taget sequece Z 1,..., Z. The supremum is take over a pair of predictable processes z, z with respect to the dyadic filtratio. A direct compariso of the expected suprema also holds: E sup g G (g(z t ) E t 1 g(z t )) E Z 1 B(g; Z1, Z 1,..., Z, Z ) (42) sup E sup ε t (g(z t ) g(z t)) B(g; (z 1, z 1),..., (z, z )). z,z g G We coclude that it is eough to prove tail bouds for a supremum sup ε t f(x t ) B(f; x 1,..., x ) of a martigale with respect to the dyadic filtratio, offset by a fuctio B(f; x 1,..., x ), as doe i Sectio 3. 13

14 Rakhli Sridhara Ackowledgemets Research is supported i part by the NSF uder grats o CDS&E-MSS ad Refereces J. Aberethy, A. Agarwal, P. Bartlett, ad A. Rakhli. A stochastic view of optimal regret through miimax duality. I Proceedigs of the 22th Aual Coferece o Learig Theory, B. Acciaio, M. Beiglbck, F. Peker, W. Schachermayer, ad J. Temme. A trajectorial iterpretatio of Doob s martigale iequalities. A. Appl. Probab., 23(4): , URL M. Beiglböck ad M. Nutz. Martigale iequalities ad determiistic couterparts. Electro. J. Probab, 19(95):1 15, M. Beiglböck ad P. Siorpaes. Pathwise versios of the burkholder davis gudy iequality. Beroulli, 21(1): , B. Bercu, B. Delyo, ad E. Rio. Cocetratio iequalities for sums ad martigales, J. Borwei, A. Guirao, P. Hájek, ad J. Vaderwerff. Uiformly covex fuctios o baach spaces. Proceedigs of the America Mathematical Society, 137(3): , S. Bouchero, G. Lugosi, ad P. Massart. Cocetratio iequalities: A oasymptotic theory of idepedece. Oxford Uiversity Press, D. Burkholder. The best costat i the davis iequality for the expectatio of the martigale square fuctio. Trasactios of the America Mathematical Society, 354(1):91 105, N. Cesa-Biachi ad G. Lugosi. Predictio, Learig, ad Games. Cambridge Uiversity Press, T. Cover. Behaviour of sequetial predictors of biary sequeces. I Proc. 4th Prague Cof. Iform. Theory, Statistical Decisio Fuctios, Radom Processes, V. H de la Peña, T. L. Lai, ad Q.-M. Shao. Self-ormalized processes: Limit theory ad Statistical Applicatios. Spriger, D. Foster, A. Rakhli, ad K. Sridhara. Adaptive olie learig, I Submissio. D. A Freedma. O tail probabilities for martigales. the Aals of Probability, pages , E. Gié ad J. Zi. Some limit theorems for empirical processes. Aals of Probability, 12(4): ,

15 O Equivalece of Martigale Tail Bouds ad Determiistic Regret Iequalities A. Gushchi. O pathwise couterparts of Doob s maximal iequalities. Proceedigs of the Steklov Istitute of Mathematics, 1(287): , D. Pacheko. Symmetrizatio approach to cocetratio iequalities for empirical processes. Aals of Probability, 31(4): , I. Pielis. Optimum bouds for the distributios of martigales i baach spaces. The Aals of Probability, 22(4): , A. Rakhli ad K. Sridhara. O equivalece of martigale tail bouds ad determiistic regret iequalities. arxiv preprit arxiv: , A. Rakhli ad K. Sridhara. A tutorial o olie supervised learig with applicatios to ode classificatio i social etworks. CoRR, abs/ , URL http: //arxiv.org/abs/ A. Rakhli, K. Sridhara, ad A. Tewari. Olie learig: Radom averages, combiatorial parameters, ad learability. Advaces i Neural Iformatio Processig Systems 23, pages , A. Rakhli, K. Sridhara, ad A. Tewari. Olie learig via sequetial complexities. Joural of Machie Learig Research, N. Srebro, K. Sridhara, ad A. Tewari. O the uiversality of olie mirror descet. I NIPS, pages , A. W. Va Der Vaart ad J. A. Weller. Weak Covergece ad Empirical Processes: With Applicatios to Statistics. Spriger Series, March Appedix A. Proofs Lemma 12 The update i (2) satisfies z 1,..., z B, ŷ t f, z t. Proof [Lemma 12] The followig two-lie proof is stadard. By the property of a projectio, ŷ t+1 f 2 = Proj B (ŷ t 1/2 z t ) f 2 (ŷ t 1/2 z t ) f 2 (43) = ŷ t f z t 2 2 1/2 ŷ t f, z t. (44) Rearragig, 2 1/2 ŷ t f, z t ŷ t f 2 ŷ t+1 f z t 2. Summig over t = 1,..., yields the desired statemet. 15

16 Rakhli Sridhara Proof [Lemma 6] Because of the aytime property of the regret boud ad the strategy defiitio, we ca write (19) as s s max { Z t ŷ t, Z t } 2R max V (45) s=1,..., simply because the right-had side is largest for s =. Sub-additivity of max implies s max Z t 2R max V max ŷ t, Z t. (46) s=1,..., s s=1,..., By the Burkholder-Davis-Gudy iequality (with the costat from Burkholder (2002)), E max s s=1,..., ŷ t, Z t 3E ( 1/2 ŷ t, Z t 2 ) 3E V. (47) Proof [Lemma 2] Let ŷ t+1 be the urestricted miimum of (15). Because of the update form, f F, ŷ t+1 f, z t 1 η t (D R (f, ŷ t ) D R (f, ŷ t+1 ) D R (ŷ t+1, ŷ t )). Summig over t = 1,...,, ŷ t+1 f, z t η1 1 D R (f, ŷ 1 ) + η1 1 Rmax 2 + R 2 maxη 1 (ηt 1 t=2 (ηt 1 ηt 1)R 2 max t=2 η 1 t 2 ŷ t+1 ŷ t 2, ηt 1)D R (f, ŷ t ) η 1 t 2 ŷ t+1 ŷ t 2 η 1 t D R (ŷ t+1, ŷ t ) where we used strog covexity of R ad the fact that η t is oicreasig. Next, we write ŷ t f, z t = ad upper boud the secod term by otig that Combiig the bouds, ŷ t+1 f, z t + ŷ t ŷ t+1, z t ŷ t ŷ t+1, z t ŷ t ŷ t+1 z t η 1 t 2 ŷ t ŷ t η t 2 z t 2. ŷ t f, z t R 2 maxη 1 + η t 2 z t 2. (48) 16

17 O Equivalece of Martigale Tail Bouds ad Determiistic Regret Iequalities Usig the fact (Cesa-Biachi ad Lugosi, 2006, Lemma 11.8) that for oegative (α t ) ad the defiitio of η t, α t t s=1 αs 2 α t ŷ t f, z t 2R max z t 2. (49) Proof [Lemma 5] Let E t 1 [ ] = E t 1 [ Z 1 t 1 ] deote coditioal expectatio. We have E t 1 exp {λ ŷ t, Z t E t 1 Z t λ2 ( Z t 2 + E t 1 Z t 2 )} E t 1 exp {λ ŷ t, Z t Z t λ2 ( Z t 2 + Z t 2 )} E t 1 E ɛ exp {λɛ ŷ t, Z t Z t λ2 ( Z t 2 + Z t 2 )}. Sice exp is a covex fuctio, the expressio is E t 1 E ɛ exp { 1 2 (2λɛ ŷ t, Z t 2λ 2 Z t 2 ) (2λɛ ŷ t, Z t 2λ2 Z t 2 )} 1 2 E t 1E ɛ exp {2λɛ ŷ t, Z t 2λ 2 Z t 2 } E t 1E ɛ exp {2λɛ ŷ t, Z t 2λ2 Z t 2 } = E t 1 E ɛ exp {2λɛ ŷ t, Z t 2λ 2 Z t 2 } E t 1 exp {2λ 2 ŷ t, Z t 2 2λ 2 Z t 2 } 1 sice ŷ t 1. Repeatig this argumet for t = to t = 1 yields the statemet. If Z t are coditioally symmetric, the ŷ t, Z t are also coditioally symmetric. Hece, E t 1 exp {λ ŷ t, Z t λ2 2 Z t 2 } = E t 1 E ɛ exp {λɛ ŷ t, Z t λ2 2 Z t 2 } E t 1 exp { λ2 2 ŷ t, Z t 2 λ2 2 Z t 2 } 1. Proof [Lemma 7] For biary outcomes y {±1} ad either absolute loss or liear loss, A(F, B) = sup x t if ŷ t max y t { y t ŷ t + sup { y t f(x t ) B(f; x 1,..., x )}}, where we shall restrict ŷ t to rage over the iterval ŷ t sup f(x t ) ad y t i {±1}. Cosider the last step t =. Give x 1, ŷ 1 1, ad y 1 1, we solve if max { ŷ y + φ (x 1, y 1 )} (50) ŷ y 17

18 Rakhli Sridhara where φ (x 1, y 1 ) sup { y t f(x t ) B(f; x 1,..., x )}. (51) Sice there are two possibilities for y, the closed form solutio for ŷ is give by ŷ = 1 2 (φ (x 1, y 1 1, 1) φ (x 1, y 1 1, 1)). (52) Importatly, this value satisfies ŷ sup f(x ). With this optimal choice, (50) is equal to E ε φ (x 1, y 1 1, ε ). We ow iclude the supremum over x i the defiitio of φ 1 φ 1 (x 1 1, y 1 1 ) sup x E ε φ (x 1, y 1 1, ε ) ad repeat the argumet for t = 1. Sice all the steps are equalities, A(F, B) = φ 0 ( ) = sup x 1 which ca be writte as (28). E ε1... sup x E ε sup { ε t f(x t ) B(f; x 1,..., x )}, Proof [Lemma 10] We have P (ξ u) Eφ(ξ) φ(u) Eφ(ν) φ(u) 1 φ(u) (φ(0) + 0 φ (x)p (ν x)dx). Choose a = u µ 1 (1), where µ 1 is the iverse fuctio. If a < 0, the coclusio of the lemma is true sice Γ 1. I the case of a 0, we have φ(0) = 0. The above upper boud becomes P (ξ u) Γ φ(u) φ (x) exp{ µ(x)}dx = 0 = If µ(b) = cb, we have If µ(b) = cb 2, we have Γ φ(u) a µ (x) exp{ µ(x)}dx Γ µ(u a) [ exp{ µ(x)}] a = Γ exp{ µ(a)} = Γ exp{ µ(u µ 1 (1))}. P (ξ u) Γ exp{ c(u 1/c)} = Γ exp{1 cu}. P (ξ u) Γ exp{ c(u 1/ c) 2 } Γ exp{ cu 2 /4} wheever u 2/ c. If u 2/ c, the coclusio is valid sice Γ 1. Proof [Corollary 11] Let ξ(z 1,..., Z, Z 1,..., Z ) = sup g (g(z t ) g(z t)) B(g; Z 1, Z 1,..., Z, Z ) 18

19 O Equivalece of Martigale Tail Bouds ad Determiistic Regret Iequalities ad ν(z 1,..., Z ) = sup g (g(z t ) E t 1 g(z t)) E Z 1 B(g; Z1, Z 1,..., Z, Z ). The for ay covex φ R R, Eφ(ν) Eφ(ξ) usig covexity of the supremum. The problem is ow reduced to obtaiig tail bouds for f Write the probability as (g(z t ) g(z t)) B(g; Z 1, Z 1,..., Z, Z ) > u). E1 {ξ(z 1,..., Z, Z 1,..., Z ) > u}. We ow proceed to replace the radom variables from backwards with a dyadic filtratio. Let us start with the last idex. Reamig Z ad Z we see that E1 {sup g (g(z t ) g(z t)) B(g; Z 1, Z 1,..., Z, Z ) > u} 1 = E1 {sup g 1 = EE ɛ 1 {sup g E sup z,z (g(z t ) g(z t)) + (g(z ) g(z )) B(g; Z 1, Z 1,..., Z, Z ) > u} 1 E ɛ 1 {sup g (g(z t ) g(z t)) + ɛ (g(z ) g(z )) B(g; Z 1, Z 1,..., Z, Z ) > u} (g(z t ) g(z t)) + ɛ (g(z ) g(z )) B(g; Z 1, Z 1,..., Z 1, Z 1, z, z ) > u}. Proceedig i this maer for step 1 ad back to t = 1, we obtai a upper boud of sup E ɛ1... sup z 1,z 1 z,z = sup x E1 {sup g E ɛ 1 {sup g ɛ t (g(z t ) g(z t)) B(g; z 1, z 1,..., z, z ) > u} ɛ t f g (x t ) B(g; x 1,..., x ) > u}. 19

On Equivalence of Martingale Tail Bounds and Deterministic Regret Inequalities

On Equivalence of Martingale Tail Bounds and Deterministic Regret Inequalities O Equivalece of Martigale Tail Bouds ad Determiistic Regret Iequalities Sasha Rakhli Departmet of Statistics, The Wharto School Uiversity of Pesylvaia Dec 16, 2015 Joit work with K. Sridhara arxiv:1510.03925