On Equivalence of Martingale Tail Bounds and Deterministic Regret Inequalities

Size: px

Start display at page:

Download "On Equivalence of Martingale Tail Bounds and Deterministic Regret Inequalities"

Douglas Sullivan
6 years ago
Views:

1 O Equivalece of Martigale Tail Bouds ad Determiistic Regret Iequalities Alexader Rakhli Uiversity of Pesylvaia Karthik Sridhara Corell Uiversity October 17, 2015 Abstract We study a equivalece of (i) determiistic pathwise statemets appearig i the olie learig literature (termed regret bouds), (ii) high-probability tail bouds for the remum of a collectio of martigales (of a specific form arisig from uiform laws of large umbers for martigales), ad (iii) i-expectatio bouds for the remum. By virtue of the equivalece, we prove expoetial tail bouds for orms of Baach space valued martigales via determiistic regret bouds for the olie mirror descet algorithm with a adaptive step size. We exted these results beyod the liear structure of the Baach space: we defie a otio of martigale type for geeral classes of real-valued fuctios ad show its equivalece (up to a logarithmic factor) to various sequetial complexities of the class (i particular, the sequetial Rademacher complexity ad its offset versio). For classes with the geeral martigale type 2, we exhibit a fier otio of variatio that allows partial adaptatio to the fuctio idexig the martigale. Our proof techique rests o sequetial symmetrizatio ad o certifyig the existece of regret miimizatio strategies for certai olie predictio problems. 1 Itroductio Let Z 1,..., Z be a martigale differece sequece takig values i a separable (2, D)-smooth Baach space (B, ). A result due to Pielis [17] asserts that for ay u > 0 P ( 1 Z t σu) 2 exp { u2 2D 2 }, (1) where σ is a costat satisfyig Z t 2 σ2. Writig the orm x = y 1 y, x as the remum over the dual ball, we may re-iterpret (1) as a oe-sided tail cotrol for the remum of a stochastic process {y y, Z t y 1}. I this paper, we cosider several extesios of (1), motivated by the followig questios: (a) Ca (1) be stregtheed by replacig σ with a path-depedet versio of variatio? (b) Does a versio of (1) hold whe we move away from the liear structure of the Baach space? Positive aswers to these questios costitute the first cotributio of our paper. The secod cotributio ivolves the actual techique. The corerstoe of our aalysis is a certai equivalece of martigale iequalities ad determiistic pathwise statemets. The latter iequalities are studied i 1

2 the field of olie learig (or, sequetial predictio), ad are referred to as regret bouds. We show that the existece (which ca be certified via the miimax theorem) of predictio strategies that miimize regret yields predictable processes that help i aswerig (a) ad (b). The equivalece is exploited i both directios, whereby stroger regret bouds are derived from the correspodig probabilistic bouds, ad vice versa. To obtai oe of the mai results i the paper, we sharpe the boud by passig several times betwee the determiistic statemets ad probabilistic tail bouds. The equivalece asserts a strog coectio betwee probabilistic iequalities for martigales ad olie learig algorithms. I the remaider of this sectio, we preset a simple example of the equivalece based o the gradiet descet method, arguably the most popular covex optimizatio procedure. The example captures, loosely speakig, a correspodece betwee determiistic optimizatio methods ad probabilistic bouds. Cosider the uit Euclidea ball B i R d. Let z 1,..., z B ad defie, recursively, the Euclidea projectios ŷ t+1 = ŷ t+1 (z 1,..., z t ) = Proj B (ŷ t 1/2 z t ) (2) for each t = 1,...,, with the iitial value ŷ 1 = 0. Elemetary algebra 1 shows that for ay f B, the regret iequality ŷ t f, z t holds determiistically for ay sequece z 1,..., z B. We re-write this statemet as z t ŷ t, z t. (3) Applyig the determiistic iequality to a B-valued martigale differece sequece Z 1,..., Z, P ( Z t > u) P ( ŷ t, Z t > u) exp { u2 2 }. (4) The latter upper boud is a applicatio of the Azuma-Hoeffdig s iequality. Ideed, the process (ŷ t ) is predictable with respect to σ(z 1,..., Z t ), ad thus ( ŷ t, Z t ) is a [ 1, 1]-valued martigale differece sequece. It is worth emphasizig the coclusio: oe-sided deviatio tail bouds for a orm of a vector-valued martigale ca be deduced from tail bouds for real-valued martigales with the help of a determiistic iequality. Next, itegratig the tail boud i (4) yields a seemigly weaker i-expectatio statemet E Z t c (5) for a appropriate costat c. The twist i this ucomplicated story comes ext: with the help of the miimax theorem, [23] established existece of strategies (ŷ t ) such that z 1,..., z, f B, ŷ t f, z t E Z t, (6) with the remum take over all martigale differece sequeces with respect to a dyadic filtratio. I view of (5), this boud is c. What have we achieved? Let us summarize. The determiistic iequality (3), which holds for all sequeces, implies a tail boud (4). The latter, i tur, implies a i-expectatio boud (5), which implies (3) (with a worse costat) through a miimax argumet, thus closig the loop. The equivalece studied i depth i this paper is iformally stated below: 1 See the two-lie proof i the Appedix, Lemma 19. 2

3 Iformal: The followig bouds imply each other: (a) a iequality that holds for all sequeces; (b) a deviatio tail probability for the size of a martigale; (c) a i-expectatio boud o the size of a martigale. The equivalece, i particular, allows us to amplify the i-expectatio bouds to appropriate high-probability tail bouds. As already metioed, the pathwise iequalities, such as (3), are extesively studied i the field of olie learig. I this paper, we employ some of the recetly developed data-depedet (adaptive) regret iequalities to prove tail bouds for martigales. I tur, i view of the above equivalece, martigale iequalities shall give rise to ovel determiistic regret bouds. While writig the paper, we leared of the trajectorial approach, extesively studied i recet years. I particular, it has bee show that Doob s maximal iequalities ad Burkholder-Davis- Gudy iequalities have determiistic couterparts [2, 3, 13, 4]. The olie learig literature cotais a trove of pathwise iequalities, ad further sythesis with the trajectorial approach (ad the applicatios i mathematical fiace) appears to be a promisig directio. This paper is orgaized as follows. I the ext sectio, we exted the Euclidea result to martigales with values i Baach spaces, ad also improve it by replacig with square root of variatio. We defie a otio of martigale type for geeral classes of fuctios i Sectio 3, ad exhibit a tight coectio to the growth of sequetial Rademacher complexity. Sectio 4 presets sequetial symmetrizatio; here we prove that statemets for the dyadic filtratio automatically yield correspodig tail bouds for geeral discrete-time stochastic processes. I Sectio 5, we itroduce the machiery for obtaiig regret iequalities, ad show how these iequalities allow oe to amplify certai i-expectatio bouds ito high-probability statemets (Sectio 6). The last two sectios cotai some of the mai results: I Sectio 7 we prove a high probability boud for the otio of martigale type, ad preset a fier aalysis of adaptivity of the variatio term i Sectio 8. 2 Results i Baach spaces For the case of the Euclidea (or Hilbertia) orm, it is easy to see that the boud of (5) ca be improved to a distributio-depedet quatity ( E Z t 2 ) 1/2. Give the equivalece sketched earlier, oe may woder whether this implies existece of a gradiet-descet-like method with a sequece-depedet variatio goverig the rate of covergece of this optimizatio procedure. Below, we ideed preset such a method for 2-smooth Baach spaces. Let (B, ) be a separable Baach space, ad let (B, ) deote its dual. (B, ) is of martigale type p (for p [1, 2])) if there exists a costat C such that E p Z t C p E Z t p (7) for ay B-valued martigale differece sequece. The best possible costat C i this iequality (as well as its fiiteess) is kow to deped o the geometry of the Baach space. For istace, for a Hilbert space (7) holds for p = 2 with costat C = 1. O the other had, triagle iequality implies that ay space has the trivial type p = 1. 3

4 A equivalet way to defie martigale type p is to ask that there exist a costat C such that E y 1 y, Z t = E Z t C ( 1/p E Z t p ). (8) We ow show that the stregtheig to a sequece-depedet variatio holds for ay 2-smooth Baach space. Based o the equivalece metioed earlier, we immediately obtai tail bouds. Assume is 2-smooth. Let D R B B R be the Bregma divergece with respect to a covex fuctio R, which is assumed to be 1-strogly covex o the uit ball B of B. Deote R 2 max f,g B D R (f, g). We exted ad improve (4) as follows. Theorem 1. Let Z 1,..., Z be a B-valued martigale differece sequece, ad let E t stad for the coditioal expectatio give Z 1,..., Z t. For ay u > 0, it holds that where Z t 2.5R max ( V + 1) P V + W + (E > u 2 exp { u 2 /16}, (9) V + W ) 2 V = Z t 2 ad W = E t 1 Z t 2. (10) Furthermore, the boud holds with W 0 if the martigale differeces are coditioally symmetric. I additio to extedig the Euclidea result of the previous sectio to Baach spaces, (9) offers several advatages. First, it is -idepedet. Secod, deviatios are self-ormalized (that is, scaled by root-variatio terms). We refer to Lemma 11 for other forms of probabilistic bouds. To prove the theorem, we start with a determiistic iequality from [21, Corollary 2]. For completeess, the proof is provided i the Appedix. Lemma 2. Let F B be a covex set. Defie, recursively, ŷ t+1 = ŷ t+1 (z 1,..., z t ) = argmi η t f, z t + D R (f, ŷ t ) (11) 1 with ŷ 0 = 0, η t R max mi {1, ( t s=1 z s 2 + t 1 s=1 z s 2 ) }, ad with Rmax 2 f,g F D R (f, g). The for ay f F ad ay z 1,..., z B, ŷ t f, z t 2.5R max ( z t 2 + 1). Proof of Theorem 1. We take F to be the uit ball i B, esurig ŷ t 1. For ay martigale differece sequece (Z t ) with values i B, the above lemma implies, by defiitio of the orm, Z t 2.5R max ( V + 1) ŷ t, Z t (12) for all sample paths. Dividig both sides by V + W + (E V + W ) 2, we coclude that the left-had side i (9) is upper bouded by ŷ P t, Z t V + W + (E > u. (13) V + W ) 2 To cotrol this probability, we recall the followig result [8, Theorem 2.7]: 4

5 Theorem 3 ([8]). For a pair of radom variables A, B, with B > 0, such that E exp {λa λ 2 B 2 /2} 1 λ R, (14) it holds that P A B 2 + (EB) > u 2 2 exp { u 2 /4}. To apply this theorem, we verify assumptio (14): Lemma 4. The radom variables A = ŷ t, Z t ad B 2 = 4 ( Z t 2 +E t 1 Z t 2 ) satisfy (14). The proof of the Lemma, as well as most of the proofs i this paper, is postpoed to the Appedix. This cocludes the proof of Theorem 1. Let us make several remarks. First, [21, Corollary 2] proves a more geeral determiistic iequality: for ay collectio of fuctios M t = M t (z 1,..., z t 1 ), there exists a strategy (ŷ t ) such that z 1,..., z B, ŷ t f, z t 4.5R max ( z t M t 2 + 1). Secod, the reader will otice that the pathwise iequality (12) does ot deped o ad the costructio of ŷ t is also oblivious to this value. A simple argumet (Lemma 20 i the Appedix) the allows us to lift the real-valued Burkholder-Davis-Gudy iequality (with the costat from [6]) to the Baach space valued martigales: s E max Z t (2.5R max + 3) E V + 2.5R max. s=1,..., Notably, the costat i the resultig BDG iequality is proportioal to R max. We also remark that Theorem 1 ca be aturally exteded to p-smooth Baach spaces B. This is accomplished i a straightforward maer by extedig Lemma 2. I coclusio, we were able to replace the distributio-idepedet boud with a sequecedepedet quatity V. Oe may ask whether this pheomeo is geeral; that is, whether a sequece-depedet variatio boud ecessarily holds wheever the correspodig distributioidepedet boud does. We prove i Theorem 5 below that this is ideed the case (up to a logarithmic factor), a result that holds for geeral classes of fuctios. 3 Martigale Type for a Geeral Class of Fuctios We ow defie the aalogue of a martigale type for a class G of real-valued measurable fuctios o some abstract measurable space Z. To this ed, we assume that (Z 1,..., Z ) is a discrete time process o a probability space (Ω, A, P ). Let E deote the expectatio o this probability space, ad let E t 1 deote the coditioal (give Z 1,..., Z t 1 ) expectatio with respect to Z t. For ay g Z R, (g(z t ) E t 1 [g(z t )]) (15) 5

6 is a sum of martigale differeces g(z t ) E t 1 [g(z t )]. We let Z 1,..., Z be a taget sequece; that is, Z t ad Z t are idepedet ad idetically distributed coditioally o Z 1,..., Z t 1. Let E t 1 deote the coditioal (give Z 1,..., Z t 1 ) expectatio with respect to Z t. Defiitio 1. A class G R Z has martigale type p if there exists a costat C such that E[ (g(z t ) E t 1 [g(z t )])] C E( E t 1 g(z t ) g(z t) p 1/p ). (16) Remark 3.1. We cojecture that the statemets below also hold for the defiitio of martigale type where E t 1 g(z t ) g(z t) p o the right-had side of (16) is replaced with a smaller ad more atural quatity g(z t ) E t 1 g(z t) p. I provig (16), we shall work with a dyadic filtratio. Let (A t = σ( 1,..., t )) geerated by idepedet Rademacher (symmetric {±1}-valued) radom variables 1,...,. Let x = (x 1,..., x ) be a predictable process with respect to this filtratio (that is, x t is A t 1 -measurable) with values i some set X. Sequetial Rademacher complexity 2 of a abstract class F R X o x is defied as R (F; x) = E t f(x t ). (17) Defiitio 2. Let r (1, 2]. We say that sequetial Rademacher complexity of F exhibits a 1/r growth with costat C if 1, x, R (F; x) C 1/r f(x t ()). (18), {±1},t We will work with a particular class of fuctios F = {f g (z, z ) = g(z) g(z ) g G} defied o X Z Z. It is immediate that F exhibits 1/r wheever G does, ad vice versa, with at most doublig of the costat C. Usig a sequetial symmetrizatio techique, it holds (see [25]) that E[ (g(z t ) E t 1 [g(z t )])] 2 R (G; z). (19) z Therefore, the statemet G has martigale type r wheever G exhibits a 1/r growth correspods to the pheomeo that, loosely speakig, oe may replace the distributio-idepedet 1/r boud with a sequece-depedet variatio. The ext theorem shows a tight coectio betwee the complexity growth 1/r ad martigale type. Theorem 5. For ay fuctio class G R Z, the followig statemets hold: 1. If for some r (1, 2] sequetial Rademacher complexity exhibits 1/r growth, the G has martigale type p for every p < r. 2. If G has martigale type p, the sequetial complexity exhibits a 1/p growth. 2 This complexity is defied i [25] without the absolute values; this differece is mior (ad disappears if 0 F). 6

7 The proof relies o the developmet i the ext few sectios, ad especially o Lemma 15. The techique is partly ispired by the work of Burkholder [7] ad Pisier [18]. I particular, a key tool is the reverse Hölder priciple [19, Prop. 8.53]. I additio to Theorem 5, let us state iformal versios of Theorems 17 ad 18 which appear, respectively, i Sectios 7 ad 8. Defie the radom variables Var p = E g(z t ) g(z t) p, Var p (g) = E g(z t ) g(z t) p where E is expectatio with respect to the taget sequece, coditioally o Z 1. The Theorem 17 states that with high probability cotrolled by u > 0, (g(z t ) E t 1 [g(z t )]) log()var 1/r r + uvar 1/2 2 wheever G exhibits 1/r growth of sequetial Rademacher complexity. Theorem 8 addresses the case of martigale type 2 ad states that with high probability cotrolled by u > 0, (g(z t ) E t 1 [g(z t )]) q 1/2 2 q 4 (Var 2 (g)) 4 uvar 1/2 2 (g) 0 wheever sequetial etropy (defied below) at scale α behaves as α q. 3.1 Other complexity measures We see that the martigale type of G is described by the behavior of sequetial Rademacher complexity. The latter behavior ca, i tur, be quatified i terms of geometric quatities, such as sequetial coverig umbers ad the sequetial scale-sesitive dimesio. We preset the followig two defiitios from [25], both stated i terms of a predictable process x = (x 1,..., x ) with respect to the dyadic filtratio. It may be beeficial (at least it was for the authors of [25]) to thik of x as a complete biary tree of depth, decorated by elemets of X, ad {±1} specifyig a path i this tree. Defiitio 3 (Sequetial coverig umber). Let x = (x 1,..., x ) be a X -valued predictable process with respect to the dyadic filtratio, ad let F R X. A collectio V of R-valued predictable processes is called a α-cover (with respect to l p ) of F o x if f F, {±1}, v V, s.t. ( 1 1/p f(x t ()) v t () p ) α. (20) The cardiality of the smallest α-cover is deoted by N p (F, α, x) ad N p (F, α, ) = x N p (F, α, x), ad both are referred to as sequetial coverig umbers. Sequetial etropy is defied as log N p. Defiitio 4 (Sequetial fat-shatterig dimesio). We say that F R X shatters the predictable process x = (x 1,..., x ) at scale α > 0 if there exists a real-valued predictable process s such that {±1}, f F, s.t. t, t (f(x t ()) s t ()) α/2. The largest legth of a shattered predictable process x is called the sequetial fat-shatterig dimesio at scale α ad deoted fat α (F). 7

8 The sequetial coverig umbers ad the fat-shatterig dimesio are atural extesios of the classical otios, as show i [25]. I particular, a Dudley-type etropy itegral upper boud i terms of sequetial coverig umbers holds for sequetial Rademacher complexity. The sequetial coverig umbers, i tur, are upper bouded i terms of the fat-shatterig dimesios, i a parallel to the way classical empirical coverig umbers are cotrolled by the scale-sesitive versio of the Vapik-Chervoekis dimesio. We summarize the implicatios of these relatioships i the followig corollary: Corollary 6. For ay fuctio class F R X, 1. If for some q > 0 either α, log N 2 (F, α, ) Cα q or α, fat α (F) Cα q, the F has martigale type p for ay p < max{q,2} max{q,2} If F has martigale type r (1, 2] the, for every p < r, there exists C such that log N 2 (F, α, ) Cα p p 1 ad fat α (F) Cα p p 1, for all α. We have established a relatio betwee the martigale type of a fuctio class F ad several sequetial complexities of the class. However, ulike our startig poit (1) ad Theorem 1, our results so far do ot quatify the tail behavior for the differece betwee the remum of the martigale process ad the correspodig variatio. A atural idea is to mimic the equivalece argumet used i Sectio 2 to coclude the expoetial tail bouds. Ufortuately, the deviatio iequalities of the previous sectio rest o pathwise regret bouds that, i tur, rely o the liear structure of the associated Baach space, as well as o properties such as smoothess ad uiform covexity. Without the liear structure, it is ot clear whether the aalogous pathwise statemets hold. The goal of the rest of the paper is to brig forth some of the tools recetly developed withi the olie learig literature, ad to apply these pathwise regret bouds to coclude high probability tail bouds associated to martigale type. I additio to this goal, we will seek a versio of Theorem 5(i) for bouded fuctios, where the 1/r growth of sequetial Rademacher complexity implies martigale type r (rather tha ay p < r), but with a additioal log() factor. Our third goal will be to establish per-fuctio variatio bouds (similar to the otio of a weak variace [5]). We show that this latter boud is a fier versio of the variatio term, possible for classes that are ot too large. Our pla is as follows. First, we reduce the problem to oe based o the dyadic filtratio. After that, we shall itroduce certai determiistic iequalities from the olie learig literature that are already stated for the dyadic filtratio. 4 Symmetrizatio: dyadic filtratio is eough The purpose of this sectio is to prove that statemets for the dyadic filtratio ca be lifted to geeral processes via sequetial symmetrizatio. Cosider the martigale M g = g(z t ) E[g(Z t ) Z 1,..., Z t 1 ] idexed by g G. If (Z t ) is adapted to a dyadic filtratio A t = σ( 1,..., t ), each icremet g(z t ) E[g(Z t ) Z 1,..., Z t 1 ] takes o the value f g (x t ( 1 t 1 )) (g(z t ( 1 t 1, +1)) g(z t ( 1 t 1, 1))) /2 8

9 or its egatio, where x t is a predictable process with values i Z Z ad f g F defied by (z, z ) g(z) g(z ). I the rest of the paper, we work directly with martigales of the form M f = t f(x t ()), idexed by a abstract class F R X ad a abstract X -valued predictable process x. We exted the symmetrizatio approach of Pacheko [15] to sequetial symmetrizatio for the case of martigales. I cotrast to the more frequetly-used Gié-Zi symmetrizatio proof (via Chebyshev s iequality) [12, 26] that allows a direct tail compariso of the symmetrized ad the origial processes, Pacheko s approach allows for a idirect compariso. The followig immediate extesio of [15, Lemma 1] will imply that ay exp{ µ(u)} type tail behavior of the symmetrized process yields the same behavior for the origial process. Lemma 7. Suppose ξ ad ν are radom variables ad for some Γ 1 ad for all u 0 P (ν u) Γ exp{ µ(u)}. Let µ R + R + be a icreasig differetiable fuctio with µ(0) = 0 ad µ( ) =. Suppose for all a R ad φ(x) µ([x a] + ) it holds that Eφ(ξ) Eφ(ν). The for ay u 0, P (ξ u) Γ exp{ µ(u µ 1 (1))}. I particular, if µ(b) = cb, we have P (ξ u) Γ exp{1 cu}; if µ(b) = cb 2, the P (ξ u) Γ exp{1 cu 2 /4}. As i [15], the lemma will be used with ξ ad ν as fuctios of a sigle sample ad the double sample, respectively. The expressio for the double sample will be symmetrized i order to pass to the dyadic filtratio. However, ulike [15], we are dealig with a depedet sequece Z 1,..., Z, ad the meaig ascribed to the secod sample Z 1,..., Z is that of a taget sequece. That is, Z t, Z t are idepedet ad have the same distributio coditioally o Z 1,..., Z t 1. Let E t 1 stad for the coditioal expectatio give Z 1,..., Z t 1. Corollary 8. Let B G Z 2 R be a fuctio that is symmetric with respect to the swap of the i-th pair z i, z i, for ay i []: B(g; z 1, z 1,..., z i, z i,..., z, z ) = B(g; z 1, z 1,..., z i, z i,..., z, z ) (21) for all g G. The, uder the assumptios of Lemma 7 o µ, a tail behavior (z, z ), P ( t (g(z t ) g(z t)) B(g; (z 1, z 1),..., (z, z )) > u) Γ exp{ µ(u)} for all u > 0 implies the tail boud P ( (g(z t ) E t 1 g(z t )) E Z 1 B(g; Z1, Z 1,..., Z, Z ) > u) Γ exp{ µ(u µ 1 (1))} for ay sequece of radom variables Z 1,..., Z ad the correspodig taget sequece Z 1,..., Z. The remum is take over a pair of predictable processes z, z with respect to the dyadic filtratio. A direct compariso of the expected rema also holds: E (g(z t ) E t 1 g(z t )) E Z 1 B(g; Z1, Z 1,..., Z, Z ) (22) E t (g(z t ) g(z t)) B(g; (z 1, z 1),..., (z, z )). z,z 9

10 We coclude that it is eough to prove tail bouds for a remum t f(x t ) B(f; x 1,..., x ) of a martigale with respect to the dyadic filtratio, offset by a fuctio B(f; x 1,..., x ). This will be achieved with the help of determiistic regret iequalities. 5 Determiistic regret iequalities 5.1 Sequetial predictio We let y 1,..., y {±1} ad x 1,..., x X for some abstract measurable set X. Let F be a class of [ 1, 1]-valued fuctios o X. Fix a cost fuctio l R R R, covex i the first argumet. For a give fuctio B F X R, we aim to costruct ŷ t = ŷ t (x 1,..., x t, y 1,..., y t 1 ) [ 1, 1] such that (x t, y t ), l(ŷ t, y t ) if { l(f(x t ), y t ) + B(f; x 1,..., x )}. (23) We may view ŷ t as a predictio of the ext value y t havig observed x t ad all the data thus far. I this paper, we focus o the liear loss l(a, b) = ab/2 (equivaletly, absolute loss a b = (1 ab)/2 whe b {±1}) ad l(a, b) = (a b) 2. We equivaletly write (23) for the liear cost fuctio as { y t f(x t ) 2B(f; x 1,..., x )} while for the square loss it becomes { 2y t f(x t ) f(x t ) 2 B(f; x 1,..., x )} y t ŷ t (24) 2y t ŷ t ŷ 2 t. (25) Give a fuctio B ad a class F, there are two goals we may cosider: (a) certify the existece of (ŷ t ) (ŷ 1,..., ŷ ) satisfyig the pathwise iequality (23) for all sequeces (x t, y t ) ; or (b) give a explicit costructio of (ŷ t ). Both questios have bee studied i the olie learig literature, but the o-costructive approach will play a especially importat role. Ideed, explicit costructios such as the simple gradiet descet update (2) might ot be available i more complex situatios, yet it is the existece of (ŷ t ) that yields the sought-after tail bouds. 5.2 Existece of strategies To certify the existece of a strategy (ŷ t ), cosider the followig object: A(F, B) = x t if ŷ t max y t { l(ŷ t, y t ) if { l(f(x t ), y t ) + B(f; x 1,..., x )}} (26) where the otatio stads for the repeated applicatio of the operators (the outer operators correspodig to t = 1). The variable x t rages over X, y t is i the set {±1}, ad ŷ t rages i [ 1, 1]. It follows that 10

11 A(F, B) 0 is a ecessary ad sufficiet coditio for the existece of (ŷ t ) such that (23) holds. Ideed, the optimal choice for ŷ 1 is made give x 1 ; the optimal choice for ŷ 2 is made give x 1, y 1, x 2, ad so o. This choice defies the optimal strategy (ŷ t ). 3 The other directio is immediate. Suppose we ca fid a upper boud o A(F, B) ad the prove that this upper boud is o-positive. This would serve as a sufficiet coditio for the existece of (ŷ t ). Next, we preset such a upper boud for the case whe the cost fuctio is liear. More geeral results for covex Lipschitz cost fuctios ca be foud i [9]. As before, let = ( 1,..., ) be a sequece of idepedet Rademacher radom variables. Let x = (x 1,..., x ) ad y = (y 1,..., y ) be predictable processes with respect to the dyadic filtratio σ( 1,..., t ), with values i X ad {±1}, respectively. I other words, x t = x t ( 1,..., t 1 ) X ad y t = y t ( 1,..., t 1 ) {±1} for each t = 1,...,. Lemma 9. For the case of the liear cost fuctio, A(F, B) x E [ 1 2 tf(x t ) B(f; x 1,..., x )]. (27) Therefore, wheever it holds that for ay predictable process x = (x 1,..., x ) E [ t f(x t ) 2B(f; x 1,..., x )] 0, (28) there exists a strategy (ŷ t ) with values such that the pathwise iequality (24) holds. ŷ t f(x t ) (29) Coditio (28) i the previous lemma implies the existece of a strategy for (24). However, there might be situatios whe (28) ca be verified for a fuctio B(f; x) of the predictable process that does ot have a correspodig represetatio i the sese of (24). The ext lemma provides a variat of Lemma 9. Lemma 10. Let x be a X -valued predictable process with respect to the dyadic filtratio. Let the fuctio B map the predictable process x ad a fuctio f F to a real value, with the property B(f; x y) B(f; x) (30) y where y = (y 1,..., y ) is a {±1}-valued predictable process, ad (x y) t = x t (y 2 t ()). If E [ t f(x t ) 2B(f; x)] 0, (31) the there is a strategy (ŷ t ) with ŷ t = ŷ t (y 1,..., y t 1 ) ad ŷ t f(x t ) such that y 1,..., y {±1}, { y t f(x t (y 1,..., y t 1 )) 2B(f; x)} y t ŷ t. (32) 3 If the ifima are ot achieved, a limitig argumet ca be employed. 11

12 6 Amplificatio ad equivalece We ow describe a iterestig amplificatio pheomeo, already preseted i the Itroductio for the simple Euclidea case. Wheever (28) holds, the determiistic iequality (24) holds, ad, therefore, we may apply it to a particular martigale differece sequece to obtai high-probability bouds. Below, we detail this amplificatio for both liear ad square loss fuctios. 6.1 Liear loss Take ay X -valued predictable process x = (x 1,..., x ) with respect to the dyadic filtratio. The determiistic iequality (24) applied to x t = x t ( 1,..., t 1 ) ad y t = t becomes { t f(x t ) 2B(f; x 1,..., x )} for ay, ad thus we have the compariso of tails P ( t ŷ t (33) { t f(x t ) 2B(f; x 1,..., x )} > u) P ( t ŷ t > u). (34) Give the boudedess of the icremets t ŷ t, the tail bouds follow immediately from the Azuma- Hoeffdig s iequality or from Freedma s iequality [10]. More precisely, we use the fact that the martigale differeces are bouded by ŷ t f(x t ), ad coclude Lemma 11. If there exists a predictio strategy (ŷ t ) that satisfies (24) ad (29), the for ay predictable process x Azuma-Hoeffdig iequality implies that P ( { t f(x t ) 2B(f; x 1,..., x )} > u) exp ( 4 max f(x t ()) 2 ), (35) Freedma s iequality implies u 2 P ( { t f(x t ) 2B(f; x 1,..., x )} > u, f(x t ) 2 σ 2 ) exp ( 2σ 2 + 2uM/3 ), (36) u 2 where M =, {±1},t f(x t ), ad we also have that for ay α > 0, P ( { t f(x t ) 2B(f; x 1,..., x )} α f(x t ) 2 > u) exp ( 2αu). (37) I view of Lemma 9, a sufficiet coditio for these iequalities is that (28) holds for all x. The same iequalities hold with B(f; x) if coditios of Lemma 10 are verified for the give x. Let us emphasize the coclusio of the above lemma: the o-positivity of the expected remum of a collectio of martigales, offset by a fuctio 2B, implies existece of a regret-miimizatio strategy, which implies a high-probability tail boud. To close the loop, we itegrate out the tails, obtaiig a i-expectatio boud of the form (28), but possibly with a larger B fuctio. This is a more geeral form of the equivalece promised i the itroductio. 12

13 The ext goal is to fid otrivial fuctios B such that (28) holds. The most basic B is a costat that depeds o the complexity of F, but ot o f or the data. Defie the worst-case sequetial Rademacher averages as R (F) x E Clearly, B = R (F)/2 satisfies (28). The followig is immediate. t f(x t ). (38) Corollary 12. For ay F R X ad a X -valued predictable process x with respect to the dyadic filtratio, P ( t f(x t ) > R (F) + u) exp ( 4 max f(x t ()) 2 ). (39) Superficially, (39) looks like a oe-sided versio of the cocetratio boud for classical (i.i.d.) Rademacher averages [5]. However, sequetial Rademacher averages are ot Lipschitz with respect to a flip of a sig, as the whole remaiig path may chage after a flip. 6.2 Square loss As for the case of the liear loss fuctio, take ay X -valued predictable process x = (x 1,..., x ) with respect to the dyadic filtratio. Fix α > 0. The determiistic iequality (25) for x t = x t ( 1,..., t 1 ) ad y t = 1 α t becomes { ( 2 α tf(x t ) f 2 (x t )) B(f; x 1,..., x )} As i the proof of (37), we obtai a tail compariso P ( { ( 2 α tf(x t ) f 2 (x t )) B(f; x 1,..., x )} > u α ) P ( u 2 2 α tŷ t ŷ 2 t. (40) ( 2 α tŷ t ŷ 2 t ) > u ) exp { αu α 2 }. Oce agai, the most basic choice for B is the costat that depeds o the complexity of the class. We recall the followig result from [22]. Lemma 13 ([22]). Let κ > 0. For ay class F R X, there exists a predictio strategy (ŷ t ) with values i [ κ, κ] such that (x 1, y 1 ),..., (x, y ) X [ κ, κ], (ŷ t y t ) 2 if (f(x t ) y t ) 2 R off (F, κ, 1), where, aalogously to (38), we defie offset Rademacher complexity R off (F, c 1, c 2 ) x,µ E { 4c 1 t (f(x t ) µ t ) c 2 (f(x t ) µ t ) 2 }. (41) Here, the remum is take over X -valued predictable processes x = (x 1,..., x ) ad [ κ, κ]-valued predictable processes µ, both with respect to the dyadic filtratio. 13

14 We coclude that (40) is satisfied with the data-idepedet costat B = R off (F, 1/α, 1). Hece, the followig aalogue of Corollary 12 holds: Corollary 14. Let F [ 1, 1] X. For ay X -valued predictable process x with respect to the dyadic filtratio ad for ay α > 0, it holds that P { (2 t f(x t ) αf 2 (x t ))} R off (F, 1, α) > u exp { αu 2 }. To summarize, i Sectio 5 we preseted the machiery of regret iequalities, as well as sufficiet coditios for existece of strategies. I the preset sectio we used the pathwise statemets, alog with real-valued deviatio iequalities, to coclude tail bouds, which, i tur, certify existece of regret-miimizatio strategies. I the ext two sectios we put these techiques to use. 7 Uiform variatio ad tail bouds for geeral martigale type We ow make a extesive use of the amplificatio techique to prove i-probability versios of the martigale type defiitio. We start by workig with dyadic martigales of the form f t f(x t ) where x = (x 1,..., x ) is a predictable process (with respect to the dyadic filtratio) with values i X. Oce the results for these objects are established, we coclude the correspodig statemets for geeral processes of the form (15) via the sequetial symmetrizatio techique summarized i Corollary 8. As i Sectio 3, we assume a growth coditio 1/r o sequetial Rademacher complexity. Lemma 15. Let F R X ad r (1, 2]. Uder the growth assumptio (18), for ay p < r there exists K r,p < such that E t f(x t ) K r,p E ( f(x t ) )1/p p. (42) Further, if F [ 1, 1] X ad (18) holds with costat D/2, the E t f(x t ) 32D log E ( f(x t ) )1/r r + φ (43) where φ 64D log is a egligible term. D2 log The secod part of the proof of Lemma 15 uses the amplificatio idea of the previous sectio. Usig Lemma 9, we ca ow coclude existece of predictio strategies whose regret is cotrolled by sequece-depedet variace. This greatly exteds the scope of available variace-type bouds i the olie learig literature where results i this directio have bee obtaied for either fiite or liear classes. Corollary 16. Let F [ 1, 1] X ad r (1, 2]. If (18) holds with costat D/2, the there exists a predictio strategy (ŷ t ) such that ( ŷ t y t ) if ( f(x t ) y t ) 32 D log 2 () ( for ay sequece of (x t, y t ) (equivaletly, (24) holds). 14 1/r f(x t ) r ) + φ

15 I additio to beig a ovel result i the olie learig domai, the above corollary serves as a amplificatio step to boost the i-expectatio of boud of Lemma 15 to a high probability statemet. We the ivoke Corollary 8 ad Lemma 21 to prove the followig theorem. Theorem 17. Let Z 1,..., Z be a stochastic process with values i Z ad let Z 1,..., Z be a taget sequece. Let G [ 1, 1] Z, r (1, 2], ad defie the r-variatio as Var r = E Z 1 (g(z t ) g(z t)) r. (44) If (18) holds for G with costat D/2, the with probability at least 1 e log() exp( 2u 2 ) (g(z t ) E t 1 g(z t )) 256D log 2 ()Var 1/r r + u 8 Var φ. We remark that the tail boud ca be viewed as a ratio iequality (see [16, 11]) of the form (9), where the deviatios are scaled by the square root of the variace. 8 Fier cotrol via per-fuctio variatio From the poit of view of the previous sectio (ad Theorem 5), all classes with sequetial Rademacher complexity growth 1/2 are treated equally. However, classes with such a growth ca be as simple as a set cosistig of two fuctios, or as complex as a set of liear fuctios idexed by a ball i the ifiite-dimesioal Hilbert space. I this sectio, a differet complexity measure will be used for the regime whe the 1/2 growth hides the differece i complexity. This measure will be give by sequetial coverig umbers (ad, as a cosequece, by the offset Rademacher complexity). I the regime α q, q [0, 2], for the growth of sequetial etropy, we exhibit a fier aalysis of the variatio term that allows part of the variace to be adapted to the fuctio. Let q (0, 2]. We say that a class F [ 1, 1] X has the γ q growth (as γ decreases) of sequetial etropy if there is a costat C such that for all γ (0, 1], log N 2 (F, γ, ) Cγ q. As for sequetial Rademacher complexity, it is easy to check that the class G ad the derived class of fuctios (z, z ) f(z, z ) = g(z) g(z ) have the same growth of sequetial etropy. Moreover, this growth cotrols the rate of growth of the offset Rademacher complexity, as show i [22]. I particular, for the fiite fuctio class, R off (F, 1, α) 8 log F, α while for a parametric class of dimesio d (such that N (F, γ, ) (C /γ) d for some C > 0), R off (F, 1, α) ad for a class with sequetial etropy growth q (0, 2), Cd log(), α R off (F, 1, α) Cα 2 q 2+q q 2+q 15

16 for some absolute costat C (the boud gais a extra logarithmic factor at q = 2). I this last oparametric regime, Corollary 14 implies that for ay u > 0, P { t f(x t ) α 2 f(x t) 2 } Cα 2 r 2+r r 2+r > u exp { αu}, ad the aalogous statemets hold for the fiite ad parametric cases. As the ext Theorem shows, the offset Rademacher complexity R off brigs out (for smaller classes) the fier complexity cotrol obscured by the sequetial Rademacher complexity which oly provides Ω( 1/2 ) bouds. Theorem 18. Let Z 1,..., Z be a discrete-time process with values i Z ad let Z 1,..., Z be a taget sequece. Let G [ 1, 1] Z ad defie fuctio-depedet variace as Var 2 (g) = E Z 1 (g(z t ) g(z t)) 2. (45) If G exhibits a γ q growth of sequetial etropy, the there exists a costat C such that for ay u > 0, with probability at least 1 e log() exp{ u 2 }, for all g G, (g(z t ) E t 1 g(z t )) C q 4 (Var2 (g) + 2) 2 q 4 + u 2 2 Var 2 (g) + 2. (46) If G is fiite, with the same probability it holds that for all g G while for the parametric case, (g(z t ) E t 1 g(z t )) C log G Var 2 (g) u 2 2 Var 2 (g) + 2, (47) (g(z t ) E t 1 g(z t )) C d log Var 2 (g) u 2 2 Var 2 (g) + 2. (48) The fiite ad parametric cases ca be thought of as a q = 0 regime. Here, we have a boud that depeds o at most logarithmically. O the other had, for q 2 the term q 4 (Var 2 + 1) 2 q 4 is replaced with 1 1/q, without ay per-fuctio adaptivity (as studied i the previous sectio). Betwee these two regimes, we obtai a iterpolatio, whereby the 1/2 power is split ito a oadaptive part q 4 ad the adaptive part (Var 2 + 1) 2 q 4. This costitutes a fier aalysis of classes with martigale type 2. We may compare the boud of Theorem 18 i the fiite case to the i-expectatio boud of [14] i terms of weak variace for i.i.d. zero mea radom variables Z 1,..., Z R d : E [max j d t Z t,j ] 2 l(2d)e max j d I cotrast to this boud, Theorem 18 matches the coordiate j o the left-had side to the variace of the jth coordiate o the right-had side. Further, our boud holds for martigale differece sequeces rather tha i.i.d. radom vectors. Fially, Theorem 18 holds well beyod the fiite case. Z 2 t,j. 16

17 9 Some Ope Questios The followig are a few ope-eded questios raised by this work: 1. I the defiitio of martigale type, ca we replace E( E t 1 g(z t ) g(z t) p ) 1/p with E( g(z t ) E t 1 [g] p ) 1/p ad reach the same coclusios? The latter versio of variatio is closer to the geeralizatio of the martigale type for Baach spaces. 2. If for some r (1, 2], sequetial Rademacher complexity exhibits 1/r growth rate, the does G have martigale type r? Curretly, we oly prove martigale type p for ay p < r. For the case of Baach spaces (liear g), the above questio is aswered i the positive i the work of Pisier [18]. However, the result of [18] relies o the otios of uiform covexity or uiform smoothess which are specific to liear fuctioals ad Baach spaces. 3. Is it possible to get a mix of uiform ad per-futio variace for geeral fuctio classes with martigale type 2? I Sectio 8, for martigale type 2 we prove a fier cotrol through per fuctio variace. A atural questio is whether oe ca replace the -depedet part by uiform variace terms thus givig a mix of per-fuctio ad uiform variace i the same boud. A Proofs Lemma 19. The update i (2) satisfies z 1,..., z B, ŷ t f, z t. Proof of Lemma 19. The followig two-lie proof is stadard. By the property of a projectio, ŷ t+1 f 2 = Proj B (ŷ t 1/2 z t ) f 2 (ŷ t 1/2 z t ) f 2 = ŷ t f z t 2 2 1/2 ŷ t f, z t. Rearragig, 2 1/2 ŷ t f, z t ŷ t f 2 ŷ t+1 f z t 2. Summig over t = 1,..., yields the desired statemet. Lemma 20. With the otatio of Lemma 1, s E max Z t (2.5R max + 3) E V + 2.5R max. s=1,..., Proof of Lemma 20. Because of the aytime property of the regret boud ad the strategy defiitio, we ca write (12) as max s=1,..., s { Z t ŷ t, Z t } 2.5R max ( V + 1) (49) s 17

18 simply because the right-had side is largest for s =. Sub-additivity of max implies max s=1,..., s Z t 2.5R max ( V + 1) max ŷ t, Z t. (50) s s=1,..., By the Burkholder-Davis-Gudy iequality (with the costat from [6]), E max s s=1,..., ŷ t, Z t 3E ( I view of (49), we coclude the statemet. Proof of Lemma 2. Because of the update form, Summig over t = 1,...,, 1/2 ŷ t, Z t 2 ) 3E V. (51) f F, ŷ t+1 f, z t 1 η t (D R (f, ŷ t ) D R (f, ŷ t+1 ) D R (ŷ t+1, ŷ t )). ŷ t+1 f, z t η1 1 D R (f, ŷ 1 ) + η1 1 Rmax 2 + R 2 max(η η 1 (ηt 1 t=2 (ηt 1 ηt 1)R 2 max t=2 ) ηt 1)D R (f, ŷ t ) η 1 t 2 ŷ t+1 ŷ t 2, η 1 t 2 ŷ t+1 ŷ t 2 η 1 t D R (ŷ t+1, ŷ t ) where we used strog covexity of R ad the fact that η t is oicreasig. Next, we write ŷ t f, z t = ad upper boud the secod term by otig that Combiig the bouds, ŷ t+1 f, z t + ŷ t ŷ t+1, z t ŷ t ŷ t+1, z t ŷ t ŷ t+1 z t η 1 t 2 ŷ t ŷ t η t 2 z t 2. ŷ t f, z t R 2 max(η η 1 ) + η t 2 z t 2. (52) Now observe that η t = R max mi {1, t s=1 zs 2 t 1 s=1 zs 2 } (53) z t 2 ad thus the secod term i (52) is upper bouded as η t 2 z t 2 R max 2 18 s=1 z s 2.

19 For the first term, we use η 1 1 = R 1 max ad η 1 R 1 max max 1, 2 t z s 2 s=1 Cocludig, ŷ t f, z t R max t s=1 z s 2. (54) Proof of Lemma 4. We have E t 1 exp {λa λ 2 B 2 /2} = E t 1 exp {λ ŷ t, Z t E t 1 Z t 2λ2 ( Z t 2 + E t 1 Z t 2 )} E t 1 exp {λ ŷ t, Z t Z t 2λ2 ( Z t 2 + Z t 2 )} E t 1 E exp {λ Sice exp is a covex fuctio, E t 1 E exp { 1 2 (2λ 1 2 E t 1E exp {2λ = E t 1 E exp {2λ t ŷ t, Z t 4λ 2 t ŷ t, Z t 4λ 2 t ŷ t, Z t 4λ 2 Z t 2 } E t 1 exp {4λ 2 ŷ t, Z t 2 4λ 2 Z t 2 } 1 sice ŷ t 1. t ŷ t, Z t Z t 2λ2 ( Z t 2 + Z t 2 )}. Z t 2 ) (2λ ŷ t, Z t 4λ2 Z t 2 )} Z t 2 } E t 1E exp {2λ Proof of Theorem 5. Let Z 1,..., Z be a discrete time process. We have E { (g(z t ) E t 1 [g(z t )])} C ( p t E Z t p t 1/p E t 1 g(z t ) g(z t) p ) { (g(z t ) E Z t p t [g(z t)])} C ( t ŷ t, Z t 4λ2 Z t 2 } E Z t p t g(z t ) g(z t) p )1/p 19

20 where pt E Zt p t stads for repeated applicatio of the operators: p 1 E Z1... p E Z. By Jese s iequality, we upper boud the above expressio by p t E Zt,Z t pt { (g(z t ) g(z t))} C ( g(z t ) g(z t) p )1/p Itroducig idepedet Rademacher radom variables 1,...,, the precedig expressio is equal to p t z t,z t E Zt,Z t pte t E t { t (g(z t ) g(z t))} C ( { t (g(z t ) g(z t))} C ( The latter expressio may be writte as z,z E { t (g(z t ) g(z t))} C ( g(z t ) g(z t) p )1/p g(z t ) g(z t) )1/p p. g(z t ) g(z t) p )1/p with a remum ragig over predictable processes z = (z 1,..., z ) ad z = (z 1,..., z ), each z t, z t {±1} t 1 Z. Now defie the fuctio class F R Z Z as follows: F = {(z, z ) g(z) g(z ) g G}. Trivially, (55) ca be writte with this otatio as E z,z { t g(z t, z t)} C ( However the complexity of F is ot much larger tha that of G: R (F; (z, z )) = E t f(z t, z t) = E g(z t, z t) )1/p p. t (g(z t ) g(z t)) R (G; z) + R (G; z ). The first part of the theorem is cocluded by applyig Lemma 15 to the class F. To prove the secod part, we modify the lower boud costructio i [25, Theorem 2]. Assume that we are give a predictable process x of legth ad that x 0 is ay oe of the 2 1 values i. (55) 20

21 the image of x. Sice E [ t ], we have that R (F; x) = E [ E [ E [ E [ E [ 2E [ t f(x t ) ] t f(x t ) ] t f(x t ) ] E [ t f(x t ) f(x 0 ) E [ t ] + f(x 0 ) t f(x 0 ) ] + f(x 0 ) t f(x 0 ) ] + f(x 0 ) t (f(x t ) f(x 0 )) ] + f(x 0 ) f(x t ) + f(x 0 ) 2 1 t 2 f(x t) 1 + t 2 f(x 0) ] + f(x 0 ) Now cosider the joit distributio over X 1,..., X such that, for every t [], P (X t = x 0 t 1, t = 1) = 1 ad P (X t = x t ( 1 t 1 ) t 1, t = 0) = 1. Uder this distributio, we ca rewrite the above iequality as R (F; x) 2 E [ (E t 1 [f(x t )] f(x t )) ] + f(x 0 ). Sice F is of type r, takig 1,..., to be a idepedet Rademacher sequece, we further boud the above term as 2C E ( 2C E ( E t 1 f(x t ) f(x t) r )1/r + f(x 0 ) 1 t 2 f(x t) t 2 f(x 0) 1 t 2 f(x t) 1 + t 2 f(x 0) 4C f(x t ) r + f(x 0 ) r,t, {±1} 1/r + f(x 0 ). r )1/r + f(x 0 ) Sice x 0 is oe of the elemets of the tree x, we further upper boud the expressio by 8C 1/r f(x t ),t, {±1} + f(x 0 ) 16C 1/r f(x t ),t, {±1}. I the last step we used the fact that r 2 ad so 1/r. Proof of Lemma 7. We have P (ξ u) Eφ(ξ) φ(u) Eφ(ν) φ(u) 1 φ(u) (φ(0) + 0 φ (x)p (ν x)dx). 21

22 Choose a = u µ 1 (1), where µ 1 is the iverse fuctio. If a < 0, the coclusio of the lemma is true sice Γ 1. I the case of a 0, we have φ(0) = 0. The above upper boud becomes P (ξ u) Γ φ(u) φ (x) exp{ µ(x)}dx = 0 = Γ φ(u) a µ (x) exp{ µ(x)}dx Γ µ(u a) [ exp{ µ(x)}] a = Γ exp{ µ(a)} = Γ exp{ µ(u µ 1 (1))}. If µ(b) = cb, we have If µ(b) = cb 2, we have P (ξ u) Γ exp{ c(u 1/c)} = Γ exp{1 cu}. P (ξ u) Γ exp{ c(u 1/ c) 2 } Γ exp{ cu 2 /4} wheever u 2/ c. If u 2/ c, the coclusio is valid sice Γ 1. Proof of Corollary 8. Let ad ξ(z 1,..., Z, Z 1,..., Z ) = g ν(z 1,..., Z ) = g The for ay covex φ R R, (g(z t ) g(z t)) B(g; Z 1, Z 1,..., Z, Z ) (g(z t ) E t 1 g(z t)) E Z 1 B(g; Z1, Z 1,..., Z, Z ). Eφ(ν) Eφ(ξ) usig covexity of the remum. The problem is ow reduced to obtaiig tail bouds for Write the probability as P ( f (g(z t ) g(z t)) B(g; Z 1, Z 1,..., Z, Z ) > u). EI {ξ(z 1,..., Z, Z 1,..., Z ) > u}. We ow proceed to replace the radom variables from backwards with a dyadic filtratio. Let us start with the last idex. Reamig Z ad Z we see that EI { g (g(z t ) g(z t)) B(g; Z 1, Z 1,..., Z, Z ) > u} 1 = EI { g 1 = EE I { g E z,z (g(z t ) g(z t)) + (g(z ) g(z )) B(g; Z 1, Z 1,..., Z, Z ) > u} 1 E I { g (g(z t ) g(z t)) + (g(z ) g(z )) B(g; Z 1, Z 1,..., Z, Z ) > u} (g(z t ) g(z t)) + (g(z ) g(z )) B(g; Z 1, Z 1,..., Z 1, Z 1, z, z ) > u}. 22

23 Proceedig i this maer for step 1 ad back to t = 1, we obtai a upper boud of E 1... z 1,z 1 z,z = x EI { g E I { g t (g(z t ) g(z t)) B(g; z 1, z 1,..., z, z ) > u} t f g (x t ) B(g; x 1,..., x ) > u}. Proof of Lemma 10. To check the desired statemet (32) for the give predictable process x, we verify that if ŷ t max y t {±1} [ ŷ t y t if { f(x t (y 1 t 1 )) y t + B(f; x)}] 0 (56) where each ifimum is take over the set {ŷ t ŷ t f(x t (y 1 t 1 )) }. To this ed, 2 if ŷ t = if ŷ t = p t max y t {±1} [ y t {±1} E [ yt pt ŷ t y t if { f(x t (y 1 t 1 )) y t + B(f; x)}] [ ( ŷ t y t ) if { ( f(x t (y 1 t 1 )) y t ) + 2B(f; x)}] if { ŷ t E [y t ]} if ŷ t { f(x t (y 1 t 1 )) y t + 2B(f; x)}] where p t rages over distributios o {±1}. I the last step, we have used the miimax theorem, ad the the techique that ca be foud, for istace, i [1, 24]. Next, we replace the ifima by (sub)optimal choices correspodig to the value of f. This yields p t p t p t E yt pt [ { E [ { yt pt E [ { y t,y t pt f(x t (y 1 t 1 )) y t {ŷ t E [y t ]} 2B(f; x)}] ŷ t f(x t (y 1 t 1 )) (y t E [y t ]) 2B(f; x)}] f(x t (y 1 t 1 )) (y t y t) 2B(f; x)}] where the last step is by Jese s iequality. We further upper boud the above expressio by where y t p t E max y t,y t pt y t [ { f(x t (y 1 t 1)) (y t y t) 2B(f; x)}] rages over {±1}. Sice y t, y t ca be reamed, we itroduce the radom sigs p t max y t,y t E E max y t,y t pt t y t E max t y t max E max b t t {±1} y t [ [ [ { { t (y t y t)f(x t (y 1 t 1)) 2B(f; x)}] t (y t y t)f(x t (y 1 t 1)) 2B(f; x)}] { 2 t b t f(x t (y 1 t 1 )) 2B(f; x)}]. 23

24 Sice b t t has the same distributio as t for ay b t {±1}, we write the above expressio as E t max y t [ { 2 t f(x t (y 1 t 1 )) 2B(f; x)}]. To be cosistet with the otatio of predictable processes, we shift the umberig o y by oe: E t max y t+1 [ { 2 t f(x t (y 2 t )) 2B(f; x)}] = y by (30). The last quatity is opositive by (31). y E [ E [ 2 t f(x t (y 2 t ())) 2B(f; x)] 2 t f(x t (y 2 t ())) 2B(f; x y)] Proof of Lemma 11. The first two statemets are immediate from the discussio precedig the Lemma. For the third statemet, we have that P ( t f(x t ) 2B(f; x 1,..., x ) α f(x t ) 2 > u) P ( ŷ t t α f(x t ) 2 > u) ad the latter probability is further upper bouded by P ( t ŷ t αŷ 2 t > u) if E [exp {λ t ŷ t αλŷ 2 t λu}] λ>0 if max exp { (λ 2 ŷt 2 /2 αλŷt 2 ) λu} exp { 2αu}. λ>0 ŷ 1 [ 1,1] Lemma 21. Suppose we have a collectio of radom variables (X(g), Y (g)), with 0 Y (g) b almost surely for ay g G. Suppose for all α > 0, c > 0, ad some 0 a 1, ad K 0 it holds that The P ({X(g) α a K αy (g)} > u) Γ exp { cαu}. P ( {X(g) 4K 1 a 1+a (Y (g) + 1) a+1 4u Y (g) + 1} > 0) log(b)γ exp { cu 2 }. Proof of Lemma 21. Fix u > 0 ad cosider a discretizatio over two regios [d l, d u], [d l, d u], give by α i = d l 2i 1, i 1,..., N = log(d u/d l ) ad α j = d l 2j 1, j 1,..., N = log(d u/d l ). Let N = N + N be the total cardiality of the discretizatio, ad let I deote the discretized set. From our premise, we have that for every idex i ad t > 0, P ({X(g) αi a K α i Y (g)} > t) exp { cα i t}. 24

25 Substitutig t = uα 1 i + α i, P ( {X(g) α a By uio boud we coclude that, P (max α i I Therefore, i K α i Y (g)} α i > uα 1 i ) exp { cu cα 2 i }. X(g) αi a K α i Y (g) α i uαi 1 > 0) exp { cu cαi 2 } N exp { cu}. α i I P ( {X(g) 2α(Y (g) + 1) α a K uα 1 } > 0) N exp { cu},α with α takig values i [d l, d u] [d l, d u]. However, if α {2α(Y (g) + 1) + α a K + uα 1 } if {2α(Y (g) + 1) + 2 max{α a K, uα 1 }}. α Passig to two balacig choices we obtai 1/(a+1) u α = Y (g) + 1, K α = ( Y (g) + 1 ) P ( {X(g) 4 max { (Y (g) + 1)u, K 1/(a+1) (Y (g) + 1) a/(a+1) }} > 0) N exp { cu}. It remais to quatify N such that, for ay g G, the choices of α, α are captured by the two correspodig regios of discretizatio. It is immediate that N does ot deped o u or K, ad depeds logarithmically o b. This cocludes the proof. I the proofs, it is useful to work with a equivalet to (18) growth assumptio (57), defied below. Lemma 22. Suppose sequetial Rademacher complexity exhibits a 1/r growth with costat D/2 > 0, i the sese of (18). The the followig holds for ay {0, 1}-valued predictable process b ad ay X -valued predictable process x (both with respect to the dyadic filtratio): E [ t b t f(x t )] D (max Proof of Lemma 22. For ay f F, t b t f(x t ) = 1/r b t ) t N i=1 b t f(x t ),t [], {±1}. (57) τi f(x τi ) (58) where N = max b t ad τ i = mi{s s k=1 b k i}. For simplicity, assume b t = N for all uiformly (the argumet ca be modified appropriately if ot). Sice b k is A k 1 -measurable, the 25

26 evet {τ i t} is A t 1 -measurable. Defie N radom variables X i = x τi ad i = τi, as well as the filtratio Ãi = A τi. We have that for ay f ad t E [ i f( X i ) Ã i 1 ] = 0 ad therefore N i=1 if( X i ) is a sum of martigale differeces, idexed by f. By the result of [25], for ay process X 1,..., X N with values i img(x), E N i f( X i ) 2 E γ y,x i=1 N i=1 γ i y i f(x i) = 2 x E γ N γ i f(x i) where y, x rage, respectively, over {±1}-valued ad img(x)-valued trees. The last equality follows from the rotatio lemma (see [20]). B Proof of Lemma 15 Lemma 23. Let F R X ad r (1, 2]. If (18) holds with costat D/2, the E [ t f(x t )] C r,p max for ay 1 p < r ad C r,p D (1 2 (r p)/rp ) 1. ( i=1 1/p f(x t ) p ) Proof of Lemma 23. Give a predictable process x, defie for each k = 0, 1,..., a predictable process b (k) by b (k) t = { 1 if 2 (k+1)/p A < f(x t ) 2 k/p A, 0 otherwise where A = max ( 1/p f(x t ) p ). Sice x t is A t 1 -measurable, so is b (k) t. From the defiitio, k 0 b (k) t 1. Hece Deotig N k () = {t b (k) t = 1}, E [ t f(x t ) t f(x t )] E [ k 0 D k 0 k 0 t b (k) t f(x t ). t b (k) t f(x t )] 1/r (max N k () ) f,,t { b (k) t f(x t ) } 1/r DA (max N k () ) 2 k/p. (59) k 0 26

On Equivalence of Martingale Tail Bounds and Deterministic Regret Inequalities

On Equivalence of Martingale Tail Bounds and Deterministic Regret Inequalities O Equivalece of Martigale Tail Bouds ad Determiistic Regret Iequalities Sasha Rakhli Departmet of Statistics, The Wharto School Uiversity of Pesylvaia Dec 16, 2015 Joit work with K. Sridhara arxiv:1510.03925