On Equivalence of Martingale Tail Bounds and Deterministic Regret Inequalities

Size: px
Start display at page:

Download "On Equivalence of Martingale Tail Bounds and Deterministic Regret Inequalities"

Transcription

1 O Equivalece of Martigale Tail Bouds ad Determiistic Regret Iequalities Alexader Rakhli Uiversity of Pesylvaia Karthik Sridhara Corell Uiversity October 17, 2015 Abstract We study a equivalece of (i) determiistic pathwise statemets appearig i the olie learig literature (termed regret bouds), (ii) high-probability tail bouds for the remum of a collectio of martigales (of a specific form arisig from uiform laws of large umbers for martigales), ad (iii) i-expectatio bouds for the remum. By virtue of the equivalece, we prove expoetial tail bouds for orms of Baach space valued martigales via determiistic regret bouds for the olie mirror descet algorithm with a adaptive step size. We exted these results beyod the liear structure of the Baach space: we defie a otio of martigale type for geeral classes of real-valued fuctios ad show its equivalece (up to a logarithmic factor) to various sequetial complexities of the class (i particular, the sequetial Rademacher complexity ad its offset versio). For classes with the geeral martigale type 2, we exhibit a fier otio of variatio that allows partial adaptatio to the fuctio idexig the martigale. Our proof techique rests o sequetial symmetrizatio ad o certifyig the existece of regret miimizatio strategies for certai olie predictio problems. 1 Itroductio Let Z 1,..., Z be a martigale differece sequece takig values i a separable (2, D)-smooth Baach space (B, ). A result due to Pielis [17] asserts that for ay u > 0 P ( 1 Z t σu) 2 exp { u2 2D 2 }, (1) where σ is a costat satisfyig Z t 2 σ2. Writig the orm x = y 1 y, x as the remum over the dual ball, we may re-iterpret (1) as a oe-sided tail cotrol for the remum of a stochastic process {y y, Z t y 1}. I this paper, we cosider several extesios of (1), motivated by the followig questios: (a) Ca (1) be stregtheed by replacig σ with a path-depedet versio of variatio? (b) Does a versio of (1) hold whe we move away from the liear structure of the Baach space? Positive aswers to these questios costitute the first cotributio of our paper. The secod cotributio ivolves the actual techique. The corerstoe of our aalysis is a certai equivalece of martigale iequalities ad determiistic pathwise statemets. The latter iequalities are studied i 1

2 the field of olie learig (or, sequetial predictio), ad are referred to as regret bouds. We show that the existece (which ca be certified via the miimax theorem) of predictio strategies that miimize regret yields predictable processes that help i aswerig (a) ad (b). The equivalece is exploited i both directios, whereby stroger regret bouds are derived from the correspodig probabilistic bouds, ad vice versa. To obtai oe of the mai results i the paper, we sharpe the boud by passig several times betwee the determiistic statemets ad probabilistic tail bouds. The equivalece asserts a strog coectio betwee probabilistic iequalities for martigales ad olie learig algorithms. I the remaider of this sectio, we preset a simple example of the equivalece based o the gradiet descet method, arguably the most popular covex optimizatio procedure. The example captures, loosely speakig, a correspodece betwee determiistic optimizatio methods ad probabilistic bouds. Cosider the uit Euclidea ball B i R d. Let z 1,..., z B ad defie, recursively, the Euclidea projectios ŷ t+1 = ŷ t+1 (z 1,..., z t ) = Proj B (ŷ t 1/2 z t ) (2) for each t = 1,...,, with the iitial value ŷ 1 = 0. Elemetary algebra 1 shows that for ay f B, the regret iequality ŷ t f, z t holds determiistically for ay sequece z 1,..., z B. We re-write this statemet as z t ŷ t, z t. (3) Applyig the determiistic iequality to a B-valued martigale differece sequece Z 1,..., Z, P ( Z t > u) P ( ŷ t, Z t > u) exp { u2 2 }. (4) The latter upper boud is a applicatio of the Azuma-Hoeffdig s iequality. Ideed, the process (ŷ t ) is predictable with respect to σ(z 1,..., Z t ), ad thus ( ŷ t, Z t ) is a [ 1, 1]-valued martigale differece sequece. It is worth emphasizig the coclusio: oe-sided deviatio tail bouds for a orm of a vector-valued martigale ca be deduced from tail bouds for real-valued martigales with the help of a determiistic iequality. Next, itegratig the tail boud i (4) yields a seemigly weaker i-expectatio statemet E Z t c (5) for a appropriate costat c. The twist i this ucomplicated story comes ext: with the help of the miimax theorem, [23] established existece of strategies (ŷ t ) such that z 1,..., z, f B, ŷ t f, z t E Z t, (6) with the remum take over all martigale differece sequeces with respect to a dyadic filtratio. I view of (5), this boud is c. What have we achieved? Let us summarize. The determiistic iequality (3), which holds for all sequeces, implies a tail boud (4). The latter, i tur, implies a i-expectatio boud (5), which implies (3) (with a worse costat) through a miimax argumet, thus closig the loop. The equivalece studied i depth i this paper is iformally stated below: 1 See the two-lie proof i the Appedix, Lemma 19. 2

3 Iformal: The followig bouds imply each other: (a) a iequality that holds for all sequeces; (b) a deviatio tail probability for the size of a martigale; (c) a i-expectatio boud o the size of a martigale. The equivalece, i particular, allows us to amplify the i-expectatio bouds to appropriate high-probability tail bouds. As already metioed, the pathwise iequalities, such as (3), are extesively studied i the field of olie learig. I this paper, we employ some of the recetly developed data-depedet (adaptive) regret iequalities to prove tail bouds for martigales. I tur, i view of the above equivalece, martigale iequalities shall give rise to ovel determiistic regret bouds. While writig the paper, we leared of the trajectorial approach, extesively studied i recet years. I particular, it has bee show that Doob s maximal iequalities ad Burkholder-Davis- Gudy iequalities have determiistic couterparts [2, 3, 13, 4]. The olie learig literature cotais a trove of pathwise iequalities, ad further sythesis with the trajectorial approach (ad the applicatios i mathematical fiace) appears to be a promisig directio. This paper is orgaized as follows. I the ext sectio, we exted the Euclidea result to martigales with values i Baach spaces, ad also improve it by replacig with square root of variatio. We defie a otio of martigale type for geeral classes of fuctios i Sectio 3, ad exhibit a tight coectio to the growth of sequetial Rademacher complexity. Sectio 4 presets sequetial symmetrizatio; here we prove that statemets for the dyadic filtratio automatically yield correspodig tail bouds for geeral discrete-time stochastic processes. I Sectio 5, we itroduce the machiery for obtaiig regret iequalities, ad show how these iequalities allow oe to amplify certai i-expectatio bouds ito high-probability statemets (Sectio 6). The last two sectios cotai some of the mai results: I Sectio 7 we prove a high probability boud for the otio of martigale type, ad preset a fier aalysis of adaptivity of the variatio term i Sectio 8. 2 Results i Baach spaces For the case of the Euclidea (or Hilbertia) orm, it is easy to see that the boud of (5) ca be improved to a distributio-depedet quatity ( E Z t 2 ) 1/2. Give the equivalece sketched earlier, oe may woder whether this implies existece of a gradiet-descet-like method with a sequece-depedet variatio goverig the rate of covergece of this optimizatio procedure. Below, we ideed preset such a method for 2-smooth Baach spaces. Let (B, ) be a separable Baach space, ad let (B, ) deote its dual. (B, ) is of martigale type p (for p [1, 2])) if there exists a costat C such that E p Z t C p E Z t p (7) for ay B-valued martigale differece sequece. The best possible costat C i this iequality (as well as its fiiteess) is kow to deped o the geometry of the Baach space. For istace, for a Hilbert space (7) holds for p = 2 with costat C = 1. O the other had, triagle iequality implies that ay space has the trivial type p = 1. 3

4 A equivalet way to defie martigale type p is to ask that there exist a costat C such that E y 1 y, Z t = E Z t C ( 1/p E Z t p ). (8) We ow show that the stregtheig to a sequece-depedet variatio holds for ay 2-smooth Baach space. Based o the equivalece metioed earlier, we immediately obtai tail bouds. Assume is 2-smooth. Let D R B B R be the Bregma divergece with respect to a covex fuctio R, which is assumed to be 1-strogly covex o the uit ball B of B. Deote R 2 max f,g B D R (f, g). We exted ad improve (4) as follows. Theorem 1. Let Z 1,..., Z be a B-valued martigale differece sequece, ad let E t stad for the coditioal expectatio give Z 1,..., Z t. For ay u > 0, it holds that where Z t 2.5R max ( V + 1) P V + W + (E > u 2 exp { u 2 /16}, (9) V + W ) 2 V = Z t 2 ad W = E t 1 Z t 2. (10) Furthermore, the boud holds with W 0 if the martigale differeces are coditioally symmetric. I additio to extedig the Euclidea result of the previous sectio to Baach spaces, (9) offers several advatages. First, it is -idepedet. Secod, deviatios are self-ormalized (that is, scaled by root-variatio terms). We refer to Lemma 11 for other forms of probabilistic bouds. To prove the theorem, we start with a determiistic iequality from [21, Corollary 2]. For completeess, the proof is provided i the Appedix. Lemma 2. Let F B be a covex set. Defie, recursively, ŷ t+1 = ŷ t+1 (z 1,..., z t ) = argmi η t f, z t + D R (f, ŷ t ) (11) 1 with ŷ 0 = 0, η t R max mi {1, ( t s=1 z s 2 + t 1 s=1 z s 2 ) }, ad with Rmax 2 f,g F D R (f, g). The for ay f F ad ay z 1,..., z B, ŷ t f, z t 2.5R max ( z t 2 + 1). Proof of Theorem 1. We take F to be the uit ball i B, esurig ŷ t 1. For ay martigale differece sequece (Z t ) with values i B, the above lemma implies, by defiitio of the orm, Z t 2.5R max ( V + 1) ŷ t, Z t (12) for all sample paths. Dividig both sides by V + W + (E V + W ) 2, we coclude that the left-had side i (9) is upper bouded by ŷ P t, Z t V + W + (E > u. (13) V + W ) 2 To cotrol this probability, we recall the followig result [8, Theorem 2.7]: 4

5 Theorem 3 ([8]). For a pair of radom variables A, B, with B > 0, such that E exp {λa λ 2 B 2 /2} 1 λ R, (14) it holds that P A B 2 + (EB) > u 2 2 exp { u 2 /4}. To apply this theorem, we verify assumptio (14): Lemma 4. The radom variables A = ŷ t, Z t ad B 2 = 4 ( Z t 2 +E t 1 Z t 2 ) satisfy (14). The proof of the Lemma, as well as most of the proofs i this paper, is postpoed to the Appedix. This cocludes the proof of Theorem 1. Let us make several remarks. First, [21, Corollary 2] proves a more geeral determiistic iequality: for ay collectio of fuctios M t = M t (z 1,..., z t 1 ), there exists a strategy (ŷ t ) such that z 1,..., z B, ŷ t f, z t 4.5R max ( z t M t 2 + 1). Secod, the reader will otice that the pathwise iequality (12) does ot deped o ad the costructio of ŷ t is also oblivious to this value. A simple argumet (Lemma 20 i the Appedix) the allows us to lift the real-valued Burkholder-Davis-Gudy iequality (with the costat from [6]) to the Baach space valued martigales: s E max Z t (2.5R max + 3) E V + 2.5R max. s=1,..., Notably, the costat i the resultig BDG iequality is proportioal to R max. We also remark that Theorem 1 ca be aturally exteded to p-smooth Baach spaces B. This is accomplished i a straightforward maer by extedig Lemma 2. I coclusio, we were able to replace the distributio-idepedet boud with a sequecedepedet quatity V. Oe may ask whether this pheomeo is geeral; that is, whether a sequece-depedet variatio boud ecessarily holds wheever the correspodig distributioidepedet boud does. We prove i Theorem 5 below that this is ideed the case (up to a logarithmic factor), a result that holds for geeral classes of fuctios. 3 Martigale Type for a Geeral Class of Fuctios We ow defie the aalogue of a martigale type for a class G of real-valued measurable fuctios o some abstract measurable space Z. To this ed, we assume that (Z 1,..., Z ) is a discrete time process o a probability space (Ω, A, P ). Let E deote the expectatio o this probability space, ad let E t 1 deote the coditioal (give Z 1,..., Z t 1 ) expectatio with respect to Z t. For ay g Z R, (g(z t ) E t 1 [g(z t )]) (15) 5

6 is a sum of martigale differeces g(z t ) E t 1 [g(z t )]. We let Z 1,..., Z be a taget sequece; that is, Z t ad Z t are idepedet ad idetically distributed coditioally o Z 1,..., Z t 1. Let E t 1 deote the coditioal (give Z 1,..., Z t 1 ) expectatio with respect to Z t. Defiitio 1. A class G R Z has martigale type p if there exists a costat C such that E[ (g(z t ) E t 1 [g(z t )])] C E( E t 1 g(z t ) g(z t) p 1/p ). (16) Remark 3.1. We cojecture that the statemets below also hold for the defiitio of martigale type where E t 1 g(z t ) g(z t) p o the right-had side of (16) is replaced with a smaller ad more atural quatity g(z t ) E t 1 g(z t) p. I provig (16), we shall work with a dyadic filtratio. Let (A t = σ( 1,..., t )) geerated by idepedet Rademacher (symmetric {±1}-valued) radom variables 1,...,. Let x = (x 1,..., x ) be a predictable process with respect to this filtratio (that is, x t is A t 1 -measurable) with values i some set X. Sequetial Rademacher complexity 2 of a abstract class F R X o x is defied as R (F; x) = E t f(x t ). (17) Defiitio 2. Let r (1, 2]. We say that sequetial Rademacher complexity of F exhibits a 1/r growth with costat C if 1, x, R (F; x) C 1/r f(x t ()). (18), {±1},t We will work with a particular class of fuctios F = {f g (z, z ) = g(z) g(z ) g G} defied o X Z Z. It is immediate that F exhibits 1/r wheever G does, ad vice versa, with at most doublig of the costat C. Usig a sequetial symmetrizatio techique, it holds (see [25]) that E[ (g(z t ) E t 1 [g(z t )])] 2 R (G; z). (19) z Therefore, the statemet G has martigale type r wheever G exhibits a 1/r growth correspods to the pheomeo that, loosely speakig, oe may replace the distributio-idepedet 1/r boud with a sequece-depedet variatio. The ext theorem shows a tight coectio betwee the complexity growth 1/r ad martigale type. Theorem 5. For ay fuctio class G R Z, the followig statemets hold: 1. If for some r (1, 2] sequetial Rademacher complexity exhibits 1/r growth, the G has martigale type p for every p < r. 2. If G has martigale type p, the sequetial complexity exhibits a 1/p growth. 2 This complexity is defied i [25] without the absolute values; this differece is mior (ad disappears if 0 F). 6

7 The proof relies o the developmet i the ext few sectios, ad especially o Lemma 15. The techique is partly ispired by the work of Burkholder [7] ad Pisier [18]. I particular, a key tool is the reverse Hölder priciple [19, Prop. 8.53]. I additio to Theorem 5, let us state iformal versios of Theorems 17 ad 18 which appear, respectively, i Sectios 7 ad 8. Defie the radom variables Var p = E g(z t ) g(z t) p, Var p (g) = E g(z t ) g(z t) p where E is expectatio with respect to the taget sequece, coditioally o Z 1. The Theorem 17 states that with high probability cotrolled by u > 0, (g(z t ) E t 1 [g(z t )]) log()var 1/r r + uvar 1/2 2 wheever G exhibits 1/r growth of sequetial Rademacher complexity. Theorem 8 addresses the case of martigale type 2 ad states that with high probability cotrolled by u > 0, (g(z t ) E t 1 [g(z t )]) q 1/2 2 q 4 (Var 2 (g)) 4 uvar 1/2 2 (g) 0 wheever sequetial etropy (defied below) at scale α behaves as α q. 3.1 Other complexity measures We see that the martigale type of G is described by the behavior of sequetial Rademacher complexity. The latter behavior ca, i tur, be quatified i terms of geometric quatities, such as sequetial coverig umbers ad the sequetial scale-sesitive dimesio. We preset the followig two defiitios from [25], both stated i terms of a predictable process x = (x 1,..., x ) with respect to the dyadic filtratio. It may be beeficial (at least it was for the authors of [25]) to thik of x as a complete biary tree of depth, decorated by elemets of X, ad {±1} specifyig a path i this tree. Defiitio 3 (Sequetial coverig umber). Let x = (x 1,..., x ) be a X -valued predictable process with respect to the dyadic filtratio, ad let F R X. A collectio V of R-valued predictable processes is called a α-cover (with respect to l p ) of F o x if f F, {±1}, v V, s.t. ( 1 1/p f(x t ()) v t () p ) α. (20) The cardiality of the smallest α-cover is deoted by N p (F, α, x) ad N p (F, α, ) = x N p (F, α, x), ad both are referred to as sequetial coverig umbers. Sequetial etropy is defied as log N p. Defiitio 4 (Sequetial fat-shatterig dimesio). We say that F R X shatters the predictable process x = (x 1,..., x ) at scale α > 0 if there exists a real-valued predictable process s such that {±1}, f F, s.t. t, t (f(x t ()) s t ()) α/2. The largest legth of a shattered predictable process x is called the sequetial fat-shatterig dimesio at scale α ad deoted fat α (F). 7

8 The sequetial coverig umbers ad the fat-shatterig dimesio are atural extesios of the classical otios, as show i [25]. I particular, a Dudley-type etropy itegral upper boud i terms of sequetial coverig umbers holds for sequetial Rademacher complexity. The sequetial coverig umbers, i tur, are upper bouded i terms of the fat-shatterig dimesios, i a parallel to the way classical empirical coverig umbers are cotrolled by the scale-sesitive versio of the Vapik-Chervoekis dimesio. We summarize the implicatios of these relatioships i the followig corollary: Corollary 6. For ay fuctio class F R X, 1. If for some q > 0 either α, log N 2 (F, α, ) Cα q or α, fat α (F) Cα q, the F has martigale type p for ay p < max{q,2} max{q,2} If F has martigale type r (1, 2] the, for every p < r, there exists C such that log N 2 (F, α, ) Cα p p 1 ad fat α (F) Cα p p 1, for all α. We have established a relatio betwee the martigale type of a fuctio class F ad several sequetial complexities of the class. However, ulike our startig poit (1) ad Theorem 1, our results so far do ot quatify the tail behavior for the differece betwee the remum of the martigale process ad the correspodig variatio. A atural idea is to mimic the equivalece argumet used i Sectio 2 to coclude the expoetial tail bouds. Ufortuately, the deviatio iequalities of the previous sectio rest o pathwise regret bouds that, i tur, rely o the liear structure of the associated Baach space, as well as o properties such as smoothess ad uiform covexity. Without the liear structure, it is ot clear whether the aalogous pathwise statemets hold. The goal of the rest of the paper is to brig forth some of the tools recetly developed withi the olie learig literature, ad to apply these pathwise regret bouds to coclude high probability tail bouds associated to martigale type. I additio to this goal, we will seek a versio of Theorem 5(i) for bouded fuctios, where the 1/r growth of sequetial Rademacher complexity implies martigale type r (rather tha ay p < r), but with a additioal log() factor. Our third goal will be to establish per-fuctio variatio bouds (similar to the otio of a weak variace [5]). We show that this latter boud is a fier versio of the variatio term, possible for classes that are ot too large. Our pla is as follows. First, we reduce the problem to oe based o the dyadic filtratio. After that, we shall itroduce certai determiistic iequalities from the olie learig literature that are already stated for the dyadic filtratio. 4 Symmetrizatio: dyadic filtratio is eough The purpose of this sectio is to prove that statemets for the dyadic filtratio ca be lifted to geeral processes via sequetial symmetrizatio. Cosider the martigale M g = g(z t ) E[g(Z t ) Z 1,..., Z t 1 ] idexed by g G. If (Z t ) is adapted to a dyadic filtratio A t = σ( 1,..., t ), each icremet g(z t ) E[g(Z t ) Z 1,..., Z t 1 ] takes o the value f g (x t ( 1 t 1 )) (g(z t ( 1 t 1, +1)) g(z t ( 1 t 1, 1))) /2 8

9 or its egatio, where x t is a predictable process with values i Z Z ad f g F defied by (z, z ) g(z) g(z ). I the rest of the paper, we work directly with martigales of the form M f = t f(x t ()), idexed by a abstract class F R X ad a abstract X -valued predictable process x. We exted the symmetrizatio approach of Pacheko [15] to sequetial symmetrizatio for the case of martigales. I cotrast to the more frequetly-used Gié-Zi symmetrizatio proof (via Chebyshev s iequality) [12, 26] that allows a direct tail compariso of the symmetrized ad the origial processes, Pacheko s approach allows for a idirect compariso. The followig immediate extesio of [15, Lemma 1] will imply that ay exp{ µ(u)} type tail behavior of the symmetrized process yields the same behavior for the origial process. Lemma 7. Suppose ξ ad ν are radom variables ad for some Γ 1 ad for all u 0 P (ν u) Γ exp{ µ(u)}. Let µ R + R + be a icreasig differetiable fuctio with µ(0) = 0 ad µ( ) =. Suppose for all a R ad φ(x) µ([x a] + ) it holds that Eφ(ξ) Eφ(ν). The for ay u 0, P (ξ u) Γ exp{ µ(u µ 1 (1))}. I particular, if µ(b) = cb, we have P (ξ u) Γ exp{1 cu}; if µ(b) = cb 2, the P (ξ u) Γ exp{1 cu 2 /4}. As i [15], the lemma will be used with ξ ad ν as fuctios of a sigle sample ad the double sample, respectively. The expressio for the double sample will be symmetrized i order to pass to the dyadic filtratio. However, ulike [15], we are dealig with a depedet sequece Z 1,..., Z, ad the meaig ascribed to the secod sample Z 1,..., Z is that of a taget sequece. That is, Z t, Z t are idepedet ad have the same distributio coditioally o Z 1,..., Z t 1. Let E t 1 stad for the coditioal expectatio give Z 1,..., Z t 1. Corollary 8. Let B G Z 2 R be a fuctio that is symmetric with respect to the swap of the i-th pair z i, z i, for ay i []: B(g; z 1, z 1,..., z i, z i,..., z, z ) = B(g; z 1, z 1,..., z i, z i,..., z, z ) (21) for all g G. The, uder the assumptios of Lemma 7 o µ, a tail behavior (z, z ), P ( t (g(z t ) g(z t)) B(g; (z 1, z 1),..., (z, z )) > u) Γ exp{ µ(u)} for all u > 0 implies the tail boud P ( (g(z t ) E t 1 g(z t )) E Z 1 B(g; Z1, Z 1,..., Z, Z ) > u) Γ exp{ µ(u µ 1 (1))} for ay sequece of radom variables Z 1,..., Z ad the correspodig taget sequece Z 1,..., Z. The remum is take over a pair of predictable processes z, z with respect to the dyadic filtratio. A direct compariso of the expected rema also holds: E (g(z t ) E t 1 g(z t )) E Z 1 B(g; Z1, Z 1,..., Z, Z ) (22) E t (g(z t ) g(z t)) B(g; (z 1, z 1),..., (z, z )). z,z 9

10 We coclude that it is eough to prove tail bouds for a remum t f(x t ) B(f; x 1,..., x ) of a martigale with respect to the dyadic filtratio, offset by a fuctio B(f; x 1,..., x ). This will be achieved with the help of determiistic regret iequalities. 5 Determiistic regret iequalities 5.1 Sequetial predictio We let y 1,..., y {±1} ad x 1,..., x X for some abstract measurable set X. Let F be a class of [ 1, 1]-valued fuctios o X. Fix a cost fuctio l R R R, covex i the first argumet. For a give fuctio B F X R, we aim to costruct ŷ t = ŷ t (x 1,..., x t, y 1,..., y t 1 ) [ 1, 1] such that (x t, y t ), l(ŷ t, y t ) if { l(f(x t ), y t ) + B(f; x 1,..., x )}. (23) We may view ŷ t as a predictio of the ext value y t havig observed x t ad all the data thus far. I this paper, we focus o the liear loss l(a, b) = ab/2 (equivaletly, absolute loss a b = (1 ab)/2 whe b {±1}) ad l(a, b) = (a b) 2. We equivaletly write (23) for the liear cost fuctio as { y t f(x t ) 2B(f; x 1,..., x )} while for the square loss it becomes { 2y t f(x t ) f(x t ) 2 B(f; x 1,..., x )} y t ŷ t (24) 2y t ŷ t ŷ 2 t. (25) Give a fuctio B ad a class F, there are two goals we may cosider: (a) certify the existece of (ŷ t ) (ŷ 1,..., ŷ ) satisfyig the pathwise iequality (23) for all sequeces (x t, y t ) ; or (b) give a explicit costructio of (ŷ t ). Both questios have bee studied i the olie learig literature, but the o-costructive approach will play a especially importat role. Ideed, explicit costructios such as the simple gradiet descet update (2) might ot be available i more complex situatios, yet it is the existece of (ŷ t ) that yields the sought-after tail bouds. 5.2 Existece of strategies To certify the existece of a strategy (ŷ t ), cosider the followig object: A(F, B) = x t if ŷ t max y t { l(ŷ t, y t ) if { l(f(x t ), y t ) + B(f; x 1,..., x )}} (26) where the otatio stads for the repeated applicatio of the operators (the outer operators correspodig to t = 1). The variable x t rages over X, y t is i the set {±1}, ad ŷ t rages i [ 1, 1]. It follows that 10

11 A(F, B) 0 is a ecessary ad sufficiet coditio for the existece of (ŷ t ) such that (23) holds. Ideed, the optimal choice for ŷ 1 is made give x 1 ; the optimal choice for ŷ 2 is made give x 1, y 1, x 2, ad so o. This choice defies the optimal strategy (ŷ t ). 3 The other directio is immediate. Suppose we ca fid a upper boud o A(F, B) ad the prove that this upper boud is o-positive. This would serve as a sufficiet coditio for the existece of (ŷ t ). Next, we preset such a upper boud for the case whe the cost fuctio is liear. More geeral results for covex Lipschitz cost fuctios ca be foud i [9]. As before, let = ( 1,..., ) be a sequece of idepedet Rademacher radom variables. Let x = (x 1,..., x ) ad y = (y 1,..., y ) be predictable processes with respect to the dyadic filtratio σ( 1,..., t ), with values i X ad {±1}, respectively. I other words, x t = x t ( 1,..., t 1 ) X ad y t = y t ( 1,..., t 1 ) {±1} for each t = 1,...,. Lemma 9. For the case of the liear cost fuctio, A(F, B) x E [ 1 2 tf(x t ) B(f; x 1,..., x )]. (27) Therefore, wheever it holds that for ay predictable process x = (x 1,..., x ) E [ t f(x t ) 2B(f; x 1,..., x )] 0, (28) there exists a strategy (ŷ t ) with values such that the pathwise iequality (24) holds. ŷ t f(x t ) (29) Coditio (28) i the previous lemma implies the existece of a strategy for (24). However, there might be situatios whe (28) ca be verified for a fuctio B(f; x) of the predictable process that does ot have a correspodig represetatio i the sese of (24). The ext lemma provides a variat of Lemma 9. Lemma 10. Let x be a X -valued predictable process with respect to the dyadic filtratio. Let the fuctio B map the predictable process x ad a fuctio f F to a real value, with the property B(f; x y) B(f; x) (30) y where y = (y 1,..., y ) is a {±1}-valued predictable process, ad (x y) t = x t (y 2 t ()). If E [ t f(x t ) 2B(f; x)] 0, (31) the there is a strategy (ŷ t ) with ŷ t = ŷ t (y 1,..., y t 1 ) ad ŷ t f(x t ) such that y 1,..., y {±1}, { y t f(x t (y 1,..., y t 1 )) 2B(f; x)} y t ŷ t. (32) 3 If the ifima are ot achieved, a limitig argumet ca be employed. 11

12 6 Amplificatio ad equivalece We ow describe a iterestig amplificatio pheomeo, already preseted i the Itroductio for the simple Euclidea case. Wheever (28) holds, the determiistic iequality (24) holds, ad, therefore, we may apply it to a particular martigale differece sequece to obtai high-probability bouds. Below, we detail this amplificatio for both liear ad square loss fuctios. 6.1 Liear loss Take ay X -valued predictable process x = (x 1,..., x ) with respect to the dyadic filtratio. The determiistic iequality (24) applied to x t = x t ( 1,..., t 1 ) ad y t = t becomes { t f(x t ) 2B(f; x 1,..., x )} for ay, ad thus we have the compariso of tails P ( t ŷ t (33) { t f(x t ) 2B(f; x 1,..., x )} > u) P ( t ŷ t > u). (34) Give the boudedess of the icremets t ŷ t, the tail bouds follow immediately from the Azuma- Hoeffdig s iequality or from Freedma s iequality [10]. More precisely, we use the fact that the martigale differeces are bouded by ŷ t f(x t ), ad coclude Lemma 11. If there exists a predictio strategy (ŷ t ) that satisfies (24) ad (29), the for ay predictable process x Azuma-Hoeffdig iequality implies that P ( { t f(x t ) 2B(f; x 1,..., x )} > u) exp ( 4 max f(x t ()) 2 ), (35) Freedma s iequality implies u 2 P ( { t f(x t ) 2B(f; x 1,..., x )} > u, f(x t ) 2 σ 2 ) exp ( 2σ 2 + 2uM/3 ), (36) u 2 where M =, {±1},t f(x t ), ad we also have that for ay α > 0, P ( { t f(x t ) 2B(f; x 1,..., x )} α f(x t ) 2 > u) exp ( 2αu). (37) I view of Lemma 9, a sufficiet coditio for these iequalities is that (28) holds for all x. The same iequalities hold with B(f; x) if coditios of Lemma 10 are verified for the give x. Let us emphasize the coclusio of the above lemma: the o-positivity of the expected remum of a collectio of martigales, offset by a fuctio 2B, implies existece of a regret-miimizatio strategy, which implies a high-probability tail boud. To close the loop, we itegrate out the tails, obtaiig a i-expectatio boud of the form (28), but possibly with a larger B fuctio. This is a more geeral form of the equivalece promised i the itroductio. 12

13 The ext goal is to fid otrivial fuctios B such that (28) holds. The most basic B is a costat that depeds o the complexity of F, but ot o f or the data. Defie the worst-case sequetial Rademacher averages as R (F) x E Clearly, B = R (F)/2 satisfies (28). The followig is immediate. t f(x t ). (38) Corollary 12. For ay F R X ad a X -valued predictable process x with respect to the dyadic filtratio, P ( t f(x t ) > R (F) + u) exp ( 4 max f(x t ()) 2 ). (39) Superficially, (39) looks like a oe-sided versio of the cocetratio boud for classical (i.i.d.) Rademacher averages [5]. However, sequetial Rademacher averages are ot Lipschitz with respect to a flip of a sig, as the whole remaiig path may chage after a flip. 6.2 Square loss As for the case of the liear loss fuctio, take ay X -valued predictable process x = (x 1,..., x ) with respect to the dyadic filtratio. Fix α > 0. The determiistic iequality (25) for x t = x t ( 1,..., t 1 ) ad y t = 1 α t becomes { ( 2 α tf(x t ) f 2 (x t )) B(f; x 1,..., x )} As i the proof of (37), we obtai a tail compariso P ( { ( 2 α tf(x t ) f 2 (x t )) B(f; x 1,..., x )} > u α ) P ( u 2 2 α tŷ t ŷ 2 t. (40) ( 2 α tŷ t ŷ 2 t ) > u ) exp { αu α 2 }. Oce agai, the most basic choice for B is the costat that depeds o the complexity of the class. We recall the followig result from [22]. Lemma 13 ([22]). Let κ > 0. For ay class F R X, there exists a predictio strategy (ŷ t ) with values i [ κ, κ] such that (x 1, y 1 ),..., (x, y ) X [ κ, κ], (ŷ t y t ) 2 if (f(x t ) y t ) 2 R off (F, κ, 1), where, aalogously to (38), we defie offset Rademacher complexity R off (F, c 1, c 2 ) x,µ E { 4c 1 t (f(x t ) µ t ) c 2 (f(x t ) µ t ) 2 }. (41) Here, the remum is take over X -valued predictable processes x = (x 1,..., x ) ad [ κ, κ]-valued predictable processes µ, both with respect to the dyadic filtratio. 13

14 We coclude that (40) is satisfied with the data-idepedet costat B = R off (F, 1/α, 1). Hece, the followig aalogue of Corollary 12 holds: Corollary 14. Let F [ 1, 1] X. For ay X -valued predictable process x with respect to the dyadic filtratio ad for ay α > 0, it holds that P { (2 t f(x t ) αf 2 (x t ))} R off (F, 1, α) > u exp { αu 2 }. To summarize, i Sectio 5 we preseted the machiery of regret iequalities, as well as sufficiet coditios for existece of strategies. I the preset sectio we used the pathwise statemets, alog with real-valued deviatio iequalities, to coclude tail bouds, which, i tur, certify existece of regret-miimizatio strategies. I the ext two sectios we put these techiques to use. 7 Uiform variatio ad tail bouds for geeral martigale type We ow make a extesive use of the amplificatio techique to prove i-probability versios of the martigale type defiitio. We start by workig with dyadic martigales of the form f t f(x t ) where x = (x 1,..., x ) is a predictable process (with respect to the dyadic filtratio) with values i X. Oce the results for these objects are established, we coclude the correspodig statemets for geeral processes of the form (15) via the sequetial symmetrizatio techique summarized i Corollary 8. As i Sectio 3, we assume a growth coditio 1/r o sequetial Rademacher complexity. Lemma 15. Let F R X ad r (1, 2]. Uder the growth assumptio (18), for ay p < r there exists K r,p < such that E t f(x t ) K r,p E ( f(x t ) )1/p p. (42) Further, if F [ 1, 1] X ad (18) holds with costat D/2, the E t f(x t ) 32D log E ( f(x t ) )1/r r + φ (43) where φ 64D log is a egligible term. D2 log The secod part of the proof of Lemma 15 uses the amplificatio idea of the previous sectio. Usig Lemma 9, we ca ow coclude existece of predictio strategies whose regret is cotrolled by sequece-depedet variace. This greatly exteds the scope of available variace-type bouds i the olie learig literature where results i this directio have bee obtaied for either fiite or liear classes. Corollary 16. Let F [ 1, 1] X ad r (1, 2]. If (18) holds with costat D/2, the there exists a predictio strategy (ŷ t ) such that ( ŷ t y t ) if ( f(x t ) y t ) 32 D log 2 () ( for ay sequece of (x t, y t ) (equivaletly, (24) holds). 14 1/r f(x t ) r ) + φ

15 I additio to beig a ovel result i the olie learig domai, the above corollary serves as a amplificatio step to boost the i-expectatio of boud of Lemma 15 to a high probability statemet. We the ivoke Corollary 8 ad Lemma 21 to prove the followig theorem. Theorem 17. Let Z 1,..., Z be a stochastic process with values i Z ad let Z 1,..., Z be a taget sequece. Let G [ 1, 1] Z, r (1, 2], ad defie the r-variatio as Var r = E Z 1 (g(z t ) g(z t)) r. (44) If (18) holds for G with costat D/2, the with probability at least 1 e log() exp( 2u 2 ) (g(z t ) E t 1 g(z t )) 256D log 2 ()Var 1/r r + u 8 Var φ. We remark that the tail boud ca be viewed as a ratio iequality (see [16, 11]) of the form (9), where the deviatios are scaled by the square root of the variace. 8 Fier cotrol via per-fuctio variatio From the poit of view of the previous sectio (ad Theorem 5), all classes with sequetial Rademacher complexity growth 1/2 are treated equally. However, classes with such a growth ca be as simple as a set cosistig of two fuctios, or as complex as a set of liear fuctios idexed by a ball i the ifiite-dimesioal Hilbert space. I this sectio, a differet complexity measure will be used for the regime whe the 1/2 growth hides the differece i complexity. This measure will be give by sequetial coverig umbers (ad, as a cosequece, by the offset Rademacher complexity). I the regime α q, q [0, 2], for the growth of sequetial etropy, we exhibit a fier aalysis of the variatio term that allows part of the variace to be adapted to the fuctio. Let q (0, 2]. We say that a class F [ 1, 1] X has the γ q growth (as γ decreases) of sequetial etropy if there is a costat C such that for all γ (0, 1], log N 2 (F, γ, ) Cγ q. As for sequetial Rademacher complexity, it is easy to check that the class G ad the derived class of fuctios (z, z ) f(z, z ) = g(z) g(z ) have the same growth of sequetial etropy. Moreover, this growth cotrols the rate of growth of the offset Rademacher complexity, as show i [22]. I particular, for the fiite fuctio class, R off (F, 1, α) 8 log F, α while for a parametric class of dimesio d (such that N (F, γ, ) (C /γ) d for some C > 0), R off (F, 1, α) ad for a class with sequetial etropy growth q (0, 2), Cd log(), α R off (F, 1, α) Cα 2 q 2+q q 2+q 15

16 for some absolute costat C (the boud gais a extra logarithmic factor at q = 2). I this last oparametric regime, Corollary 14 implies that for ay u > 0, P { t f(x t ) α 2 f(x t) 2 } Cα 2 r 2+r r 2+r > u exp { αu}, ad the aalogous statemets hold for the fiite ad parametric cases. As the ext Theorem shows, the offset Rademacher complexity R off brigs out (for smaller classes) the fier complexity cotrol obscured by the sequetial Rademacher complexity which oly provides Ω( 1/2 ) bouds. Theorem 18. Let Z 1,..., Z be a discrete-time process with values i Z ad let Z 1,..., Z be a taget sequece. Let G [ 1, 1] Z ad defie fuctio-depedet variace as Var 2 (g) = E Z 1 (g(z t ) g(z t)) 2. (45) If G exhibits a γ q growth of sequetial etropy, the there exists a costat C such that for ay u > 0, with probability at least 1 e log() exp{ u 2 }, for all g G, (g(z t ) E t 1 g(z t )) C q 4 (Var2 (g) + 2) 2 q 4 + u 2 2 Var 2 (g) + 2. (46) If G is fiite, with the same probability it holds that for all g G while for the parametric case, (g(z t ) E t 1 g(z t )) C log G Var 2 (g) u 2 2 Var 2 (g) + 2, (47) (g(z t ) E t 1 g(z t )) C d log Var 2 (g) u 2 2 Var 2 (g) + 2. (48) The fiite ad parametric cases ca be thought of as a q = 0 regime. Here, we have a boud that depeds o at most logarithmically. O the other had, for q 2 the term q 4 (Var 2 + 1) 2 q 4 is replaced with 1 1/q, without ay per-fuctio adaptivity (as studied i the previous sectio). Betwee these two regimes, we obtai a iterpolatio, whereby the 1/2 power is split ito a oadaptive part q 4 ad the adaptive part (Var 2 + 1) 2 q 4. This costitutes a fier aalysis of classes with martigale type 2. We may compare the boud of Theorem 18 i the fiite case to the i-expectatio boud of [14] i terms of weak variace for i.i.d. zero mea radom variables Z 1,..., Z R d : E [max j d t Z t,j ] 2 l(2d)e max j d I cotrast to this boud, Theorem 18 matches the coordiate j o the left-had side to the variace of the jth coordiate o the right-had side. Further, our boud holds for martigale differece sequeces rather tha i.i.d. radom vectors. Fially, Theorem 18 holds well beyod the fiite case. Z 2 t,j. 16

17 9 Some Ope Questios The followig are a few ope-eded questios raised by this work: 1. I the defiitio of martigale type, ca we replace E( E t 1 g(z t ) g(z t) p ) 1/p with E( g(z t ) E t 1 [g] p ) 1/p ad reach the same coclusios? The latter versio of variatio is closer to the geeralizatio of the martigale type for Baach spaces. 2. If for some r (1, 2], sequetial Rademacher complexity exhibits 1/r growth rate, the does G have martigale type r? Curretly, we oly prove martigale type p for ay p < r. For the case of Baach spaces (liear g), the above questio is aswered i the positive i the work of Pisier [18]. However, the result of [18] relies o the otios of uiform covexity or uiform smoothess which are specific to liear fuctioals ad Baach spaces. 3. Is it possible to get a mix of uiform ad per-futio variace for geeral fuctio classes with martigale type 2? I Sectio 8, for martigale type 2 we prove a fier cotrol through per fuctio variace. A atural questio is whether oe ca replace the -depedet part by uiform variace terms thus givig a mix of per-fuctio ad uiform variace i the same boud. A Proofs Lemma 19. The update i (2) satisfies z 1,..., z B, ŷ t f, z t. Proof of Lemma 19. The followig two-lie proof is stadard. By the property of a projectio, ŷ t+1 f 2 = Proj B (ŷ t 1/2 z t ) f 2 (ŷ t 1/2 z t ) f 2 = ŷ t f z t 2 2 1/2 ŷ t f, z t. Rearragig, 2 1/2 ŷ t f, z t ŷ t f 2 ŷ t+1 f z t 2. Summig over t = 1,..., yields the desired statemet. Lemma 20. With the otatio of Lemma 1, s E max Z t (2.5R max + 3) E V + 2.5R max. s=1,..., Proof of Lemma 20. Because of the aytime property of the regret boud ad the strategy defiitio, we ca write (12) as max s=1,..., s { Z t ŷ t, Z t } 2.5R max ( V + 1) (49) s 17

18 simply because the right-had side is largest for s =. Sub-additivity of max implies max s=1,..., s Z t 2.5R max ( V + 1) max ŷ t, Z t. (50) s s=1,..., By the Burkholder-Davis-Gudy iequality (with the costat from [6]), E max s s=1,..., ŷ t, Z t 3E ( I view of (49), we coclude the statemet. Proof of Lemma 2. Because of the update form, Summig over t = 1,...,, 1/2 ŷ t, Z t 2 ) 3E V. (51) f F, ŷ t+1 f, z t 1 η t (D R (f, ŷ t ) D R (f, ŷ t+1 ) D R (ŷ t+1, ŷ t )). ŷ t+1 f, z t η1 1 D R (f, ŷ 1 ) + η1 1 Rmax 2 + R 2 max(η η 1 (ηt 1 t=2 (ηt 1 ηt 1)R 2 max t=2 ) ηt 1)D R (f, ŷ t ) η 1 t 2 ŷ t+1 ŷ t 2, η 1 t 2 ŷ t+1 ŷ t 2 η 1 t D R (ŷ t+1, ŷ t ) where we used strog covexity of R ad the fact that η t is oicreasig. Next, we write ŷ t f, z t = ad upper boud the secod term by otig that Combiig the bouds, ŷ t+1 f, z t + ŷ t ŷ t+1, z t ŷ t ŷ t+1, z t ŷ t ŷ t+1 z t η 1 t 2 ŷ t ŷ t η t 2 z t 2. ŷ t f, z t R 2 max(η η 1 ) + η t 2 z t 2. (52) Now observe that η t = R max mi {1, t s=1 zs 2 t 1 s=1 zs 2 } (53) z t 2 ad thus the secod term i (52) is upper bouded as η t 2 z t 2 R max 2 18 s=1 z s 2.

19 For the first term, we use η 1 1 = R 1 max ad η 1 R 1 max max 1, 2 t z s 2 s=1 Cocludig, ŷ t f, z t R max t s=1 z s 2. (54) Proof of Lemma 4. We have E t 1 exp {λa λ 2 B 2 /2} = E t 1 exp {λ ŷ t, Z t E t 1 Z t 2λ2 ( Z t 2 + E t 1 Z t 2 )} E t 1 exp {λ ŷ t, Z t Z t 2λ2 ( Z t 2 + Z t 2 )} E t 1 E exp {λ Sice exp is a covex fuctio, E t 1 E exp { 1 2 (2λ 1 2 E t 1E exp {2λ = E t 1 E exp {2λ t ŷ t, Z t 4λ 2 t ŷ t, Z t 4λ 2 t ŷ t, Z t 4λ 2 Z t 2 } E t 1 exp {4λ 2 ŷ t, Z t 2 4λ 2 Z t 2 } 1 sice ŷ t 1. t ŷ t, Z t Z t 2λ2 ( Z t 2 + Z t 2 )}. Z t 2 ) (2λ ŷ t, Z t 4λ2 Z t 2 )} Z t 2 } E t 1E exp {2λ Proof of Theorem 5. Let Z 1,..., Z be a discrete time process. We have E { (g(z t ) E t 1 [g(z t )])} C ( p t E Z t p t 1/p E t 1 g(z t ) g(z t) p ) { (g(z t ) E Z t p t [g(z t)])} C ( t ŷ t, Z t 4λ2 Z t 2 } E Z t p t g(z t ) g(z t) p )1/p 19

20 where pt E Zt p t stads for repeated applicatio of the operators: p 1 E Z1... p E Z. By Jese s iequality, we upper boud the above expressio by p t E Zt,Z t pt { (g(z t ) g(z t))} C ( g(z t ) g(z t) p )1/p Itroducig idepedet Rademacher radom variables 1,...,, the precedig expressio is equal to p t z t,z t E Zt,Z t pte t E t { t (g(z t ) g(z t))} C ( { t (g(z t ) g(z t))} C ( The latter expressio may be writte as z,z E { t (g(z t ) g(z t))} C ( g(z t ) g(z t) p )1/p g(z t ) g(z t) )1/p p. g(z t ) g(z t) p )1/p with a remum ragig over predictable processes z = (z 1,..., z ) ad z = (z 1,..., z ), each z t, z t {±1} t 1 Z. Now defie the fuctio class F R Z Z as follows: F = {(z, z ) g(z) g(z ) g G}. Trivially, (55) ca be writte with this otatio as E z,z { t g(z t, z t)} C ( However the complexity of F is ot much larger tha that of G: R (F; (z, z )) = E t f(z t, z t) = E g(z t, z t) )1/p p. t (g(z t ) g(z t)) R (G; z) + R (G; z ). The first part of the theorem is cocluded by applyig Lemma 15 to the class F. To prove the secod part, we modify the lower boud costructio i [25, Theorem 2]. Assume that we are give a predictable process x of legth ad that x 0 is ay oe of the 2 1 values i. (55) 20

21 the image of x. Sice E [ t ], we have that R (F; x) = E [ E [ E [ E [ E [ 2E [ t f(x t ) ] t f(x t ) ] t f(x t ) ] E [ t f(x t ) f(x 0 ) E [ t ] + f(x 0 ) t f(x 0 ) ] + f(x 0 ) t f(x 0 ) ] + f(x 0 ) t (f(x t ) f(x 0 )) ] + f(x 0 ) f(x t ) + f(x 0 ) 2 1 t 2 f(x t) 1 + t 2 f(x 0) ] + f(x 0 ) Now cosider the joit distributio over X 1,..., X such that, for every t [], P (X t = x 0 t 1, t = 1) = 1 ad P (X t = x t ( 1 t 1 ) t 1, t = 0) = 1. Uder this distributio, we ca rewrite the above iequality as R (F; x) 2 E [ (E t 1 [f(x t )] f(x t )) ] + f(x 0 ). Sice F is of type r, takig 1,..., to be a idepedet Rademacher sequece, we further boud the above term as 2C E ( 2C E ( E t 1 f(x t ) f(x t) r )1/r + f(x 0 ) 1 t 2 f(x t) t 2 f(x 0) 1 t 2 f(x t) 1 + t 2 f(x 0) 4C f(x t ) r + f(x 0 ) r,t, {±1} 1/r + f(x 0 ). r )1/r + f(x 0 ) Sice x 0 is oe of the elemets of the tree x, we further upper boud the expressio by 8C 1/r f(x t ),t, {±1} + f(x 0 ) 16C 1/r f(x t ),t, {±1}. I the last step we used the fact that r 2 ad so 1/r. Proof of Lemma 7. We have P (ξ u) Eφ(ξ) φ(u) Eφ(ν) φ(u) 1 φ(u) (φ(0) + 0 φ (x)p (ν x)dx). 21

22 Choose a = u µ 1 (1), where µ 1 is the iverse fuctio. If a < 0, the coclusio of the lemma is true sice Γ 1. I the case of a 0, we have φ(0) = 0. The above upper boud becomes P (ξ u) Γ φ(u) φ (x) exp{ µ(x)}dx = 0 = Γ φ(u) a µ (x) exp{ µ(x)}dx Γ µ(u a) [ exp{ µ(x)}] a = Γ exp{ µ(a)} = Γ exp{ µ(u µ 1 (1))}. If µ(b) = cb, we have If µ(b) = cb 2, we have P (ξ u) Γ exp{ c(u 1/c)} = Γ exp{1 cu}. P (ξ u) Γ exp{ c(u 1/ c) 2 } Γ exp{ cu 2 /4} wheever u 2/ c. If u 2/ c, the coclusio is valid sice Γ 1. Proof of Corollary 8. Let ad ξ(z 1,..., Z, Z 1,..., Z ) = g ν(z 1,..., Z ) = g The for ay covex φ R R, (g(z t ) g(z t)) B(g; Z 1, Z 1,..., Z, Z ) (g(z t ) E t 1 g(z t)) E Z 1 B(g; Z1, Z 1,..., Z, Z ). Eφ(ν) Eφ(ξ) usig covexity of the remum. The problem is ow reduced to obtaiig tail bouds for Write the probability as P ( f (g(z t ) g(z t)) B(g; Z 1, Z 1,..., Z, Z ) > u). EI {ξ(z 1,..., Z, Z 1,..., Z ) > u}. We ow proceed to replace the radom variables from backwards with a dyadic filtratio. Let us start with the last idex. Reamig Z ad Z we see that EI { g (g(z t ) g(z t)) B(g; Z 1, Z 1,..., Z, Z ) > u} 1 = EI { g 1 = EE I { g E z,z (g(z t ) g(z t)) + (g(z ) g(z )) B(g; Z 1, Z 1,..., Z, Z ) > u} 1 E I { g (g(z t ) g(z t)) + (g(z ) g(z )) B(g; Z 1, Z 1,..., Z, Z ) > u} (g(z t ) g(z t)) + (g(z ) g(z )) B(g; Z 1, Z 1,..., Z 1, Z 1, z, z ) > u}. 22

23 Proceedig i this maer for step 1 ad back to t = 1, we obtai a upper boud of E 1... z 1,z 1 z,z = x EI { g E I { g t (g(z t ) g(z t)) B(g; z 1, z 1,..., z, z ) > u} t f g (x t ) B(g; x 1,..., x ) > u}. Proof of Lemma 10. To check the desired statemet (32) for the give predictable process x, we verify that if ŷ t max y t {±1} [ ŷ t y t if { f(x t (y 1 t 1 )) y t + B(f; x)}] 0 (56) where each ifimum is take over the set {ŷ t ŷ t f(x t (y 1 t 1 )) }. To this ed, 2 if ŷ t = if ŷ t = p t max y t {±1} [ y t {±1} E [ yt pt ŷ t y t if { f(x t (y 1 t 1 )) y t + B(f; x)}] [ ( ŷ t y t ) if { ( f(x t (y 1 t 1 )) y t ) + 2B(f; x)}] if { ŷ t E [y t ]} if ŷ t { f(x t (y 1 t 1 )) y t + 2B(f; x)}] where p t rages over distributios o {±1}. I the last step, we have used the miimax theorem, ad the the techique that ca be foud, for istace, i [1, 24]. Next, we replace the ifima by (sub)optimal choices correspodig to the value of f. This yields p t p t p t E yt pt [ { E [ { yt pt E [ { y t,y t pt f(x t (y 1 t 1 )) y t {ŷ t E [y t ]} 2B(f; x)}] ŷ t f(x t (y 1 t 1 )) (y t E [y t ]) 2B(f; x)}] f(x t (y 1 t 1 )) (y t y t) 2B(f; x)}] where the last step is by Jese s iequality. We further upper boud the above expressio by where y t p t E max y t,y t pt y t [ { f(x t (y 1 t 1)) (y t y t) 2B(f; x)}] rages over {±1}. Sice y t, y t ca be reamed, we itroduce the radom sigs p t max y t,y t E E max y t,y t pt t y t E max t y t max E max b t t {±1} y t [ [ [ { { t (y t y t)f(x t (y 1 t 1)) 2B(f; x)}] t (y t y t)f(x t (y 1 t 1)) 2B(f; x)}] { 2 t b t f(x t (y 1 t 1 )) 2B(f; x)}]. 23

24 Sice b t t has the same distributio as t for ay b t {±1}, we write the above expressio as E t max y t [ { 2 t f(x t (y 1 t 1 )) 2B(f; x)}]. To be cosistet with the otatio of predictable processes, we shift the umberig o y by oe: E t max y t+1 [ { 2 t f(x t (y 2 t )) 2B(f; x)}] = y by (30). The last quatity is opositive by (31). y E [ E [ 2 t f(x t (y 2 t ())) 2B(f; x)] 2 t f(x t (y 2 t ())) 2B(f; x y)] Proof of Lemma 11. The first two statemets are immediate from the discussio precedig the Lemma. For the third statemet, we have that P ( t f(x t ) 2B(f; x 1,..., x ) α f(x t ) 2 > u) P ( ŷ t t α f(x t ) 2 > u) ad the latter probability is further upper bouded by P ( t ŷ t αŷ 2 t > u) if E [exp {λ t ŷ t αλŷ 2 t λu}] λ>0 if max exp { (λ 2 ŷt 2 /2 αλŷt 2 ) λu} exp { 2αu}. λ>0 ŷ 1 [ 1,1] Lemma 21. Suppose we have a collectio of radom variables (X(g), Y (g)), with 0 Y (g) b almost surely for ay g G. Suppose for all α > 0, c > 0, ad some 0 a 1, ad K 0 it holds that The P ({X(g) α a K αy (g)} > u) Γ exp { cαu}. P ( {X(g) 4K 1 a 1+a (Y (g) + 1) a+1 4u Y (g) + 1} > 0) log(b)γ exp { cu 2 }. Proof of Lemma 21. Fix u > 0 ad cosider a discretizatio over two regios [d l, d u], [d l, d u], give by α i = d l 2i 1, i 1,..., N = log(d u/d l ) ad α j = d l 2j 1, j 1,..., N = log(d u/d l ). Let N = N + N be the total cardiality of the discretizatio, ad let I deote the discretized set. From our premise, we have that for every idex i ad t > 0, P ({X(g) αi a K α i Y (g)} > t) exp { cα i t}. 24

25 Substitutig t = uα 1 i + α i, P ( {X(g) α a By uio boud we coclude that, P (max α i I Therefore, i K α i Y (g)} α i > uα 1 i ) exp { cu cα 2 i }. X(g) αi a K α i Y (g) α i uαi 1 > 0) exp { cu cαi 2 } N exp { cu}. α i I P ( {X(g) 2α(Y (g) + 1) α a K uα 1 } > 0) N exp { cu},α with α takig values i [d l, d u] [d l, d u]. However, if α {2α(Y (g) + 1) + α a K + uα 1 } if {2α(Y (g) + 1) + 2 max{α a K, uα 1 }}. α Passig to two balacig choices we obtai 1/(a+1) u α = Y (g) + 1, K α = ( Y (g) + 1 ) P ( {X(g) 4 max { (Y (g) + 1)u, K 1/(a+1) (Y (g) + 1) a/(a+1) }} > 0) N exp { cu}. It remais to quatify N such that, for ay g G, the choices of α, α are captured by the two correspodig regios of discretizatio. It is immediate that N does ot deped o u or K, ad depeds logarithmically o b. This cocludes the proof. I the proofs, it is useful to work with a equivalet to (18) growth assumptio (57), defied below. Lemma 22. Suppose sequetial Rademacher complexity exhibits a 1/r growth with costat D/2 > 0, i the sese of (18). The the followig holds for ay {0, 1}-valued predictable process b ad ay X -valued predictable process x (both with respect to the dyadic filtratio): E [ t b t f(x t )] D (max Proof of Lemma 22. For ay f F, t b t f(x t ) = 1/r b t ) t N i=1 b t f(x t ),t [], {±1}. (57) τi f(x τi ) (58) where N = max b t ad τ i = mi{s s k=1 b k i}. For simplicity, assume b t = N for all uiformly (the argumet ca be modified appropriately if ot). Sice b k is A k 1 -measurable, the 25

26 evet {τ i t} is A t 1 -measurable. Defie N radom variables X i = x τi ad i = τi, as well as the filtratio Ãi = A τi. We have that for ay f ad t E [ i f( X i ) Ã i 1 ] = 0 ad therefore N i=1 if( X i ) is a sum of martigale differeces, idexed by f. By the result of [25], for ay process X 1,..., X N with values i img(x), E N i f( X i ) 2 E γ y,x i=1 N i=1 γ i y i f(x i) = 2 x E γ N γ i f(x i) where y, x rage, respectively, over {±1}-valued ad img(x)-valued trees. The last equality follows from the rotatio lemma (see [20]). B Proof of Lemma 15 Lemma 23. Let F R X ad r (1, 2]. If (18) holds with costat D/2, the E [ t f(x t )] C r,p max for ay 1 p < r ad C r,p D (1 2 (r p)/rp ) 1. ( i=1 1/p f(x t ) p ) Proof of Lemma 23. Give a predictable process x, defie for each k = 0, 1,..., a predictable process b (k) by b (k) t = { 1 if 2 (k+1)/p A < f(x t ) 2 k/p A, 0 otherwise where A = max ( 1/p f(x t ) p ). Sice x t is A t 1 -measurable, so is b (k) t. From the defiitio, k 0 b (k) t 1. Hece Deotig N k () = {t b (k) t = 1}, E [ t f(x t ) t f(x t )] E [ k 0 D k 0 k 0 t b (k) t f(x t ). t b (k) t f(x t )] 1/r (max N k () ) f,,t { b (k) t f(x t ) } 1/r DA (max N k () ) 2 k/p. (59) k 0 26

On Equivalence of Martingale Tail Bounds and Deterministic Regret Inequalities

On Equivalence of Martingale Tail Bounds and Deterministic Regret Inequalities O Equivalece of Martigale Tail Bouds ad Determiistic Regret Iequalities Sasha Rakhli Departmet of Statistics, The Wharto School Uiversity of Pesylvaia Dec 16, 2015 Joit work with K. Sridhara arxiv:1510.03925

More information

On Equivalence of Martingale Tail Bounds and Deterministic Regret Inequalities

On Equivalence of Martingale Tail Bounds and Deterministic Regret Inequalities Proceedigs of Machie Learig Research vol 65:1 19, 2017 O Equivalece of Martigale Tail Bouds ad Determiistic Regret Iequalities Alexader Rakhli Uiversity of Pesylvaia Karthik Sridhara Corell Uiversity rakhli@wharto.upe.edu

More information

Convergence of random variables. (telegram style notes) P.J.C. Spreij

Convergence of random variables. (telegram style notes) P.J.C. Spreij Covergece of radom variables (telegram style otes).j.c. Spreij this versio: September 6, 2005 Itroductio As we kow, radom variables are by defiitio measurable fuctios o some uderlyig measurable space

More information

Machine Learning Theory (CS 6783)

Machine Learning Theory (CS 6783) Machie Learig Theory (CS 6783) Lecture 3 : Olie Learig, miimax value, sequetial Rademacher complexity Recap: Miimax Theorem We shall use the celebrated miimax theorem as a key tool to boud the miimax rate

More information

Chapter 3. Strong convergence. 3.1 Definition of almost sure convergence

Chapter 3. Strong convergence. 3.1 Definition of almost sure convergence Chapter 3 Strog covergece As poited out i the Chapter 2, there are multiple ways to defie the otio of covergece of a sequece of radom variables. That chapter defied covergece i probability, covergece i

More information

Rademacher Complexity

Rademacher Complexity EECS 598: Statistical Learig Theory, Witer 204 Topic 0 Rademacher Complexity Lecturer: Clayto Scott Scribe: Ya Deg, Kevi Moo Disclaimer: These otes have ot bee subjected to the usual scrutiy reserved for

More information

7.1 Convergence of sequences of random variables

7.1 Convergence of sequences of random variables Chapter 7 Limit Theorems Throughout this sectio we will assume a probability space (, F, P), i which is defied a ifiite sequece of radom variables (X ) ad a radom variable X. The fact that for every ifiite

More information

Product measures, Tonelli s and Fubini s theorems For use in MAT3400/4400, autumn 2014 Nadia S. Larsen. Version of 13 October 2014.

Product measures, Tonelli s and Fubini s theorems For use in MAT3400/4400, autumn 2014 Nadia S. Larsen. Version of 13 October 2014. Product measures, Toelli s ad Fubii s theorems For use i MAT3400/4400, autum 2014 Nadia S. Larse Versio of 13 October 2014. 1. Costructio of the product measure The purpose of these otes is to preset the

More information

Optimally Sparse SVMs

Optimally Sparse SVMs A. Proof of Lemma 3. We here prove a lower boud o the umber of support vectors to achieve geeralizatio bouds of the form which we cosider. Importatly, this result holds ot oly for liear classifiers, but

More information

6.883: Online Methods in Machine Learning Alexander Rakhlin

6.883: Online Methods in Machine Learning Alexander Rakhlin 6.883: Olie Methods i Machie Learig Alexader Rakhli LECTURE 23. SOME CONSEQUENCES OF ONLINE NO-REGRET METHODS I this lecture, we explore some cosequeces of the developed techiques.. Covex optimizatio Wheever

More information

Chapter 7 Isoperimetric problem

Chapter 7 Isoperimetric problem Chapter 7 Isoperimetric problem Recall that the isoperimetric problem (see the itroductio its coectio with ido s proble) is oe of the most classical problem of a shape optimizatio. It ca be formulated

More information

Definition 4.2. (a) A sequence {x n } in a Banach space X is a basis for X if. unique scalars a n (x) such that x = n. a n (x) x n. (4.

Definition 4.2. (a) A sequence {x n } in a Banach space X is a basis for X if. unique scalars a n (x) such that x = n. a n (x) x n. (4. 4. BASES I BAACH SPACES 39 4. BASES I BAACH SPACES Sice a Baach space X is a vector space, it must possess a Hamel, or vector space, basis, i.e., a subset {x γ } γ Γ whose fiite liear spa is all of X ad

More information

An Introduction to Randomized Algorithms

An Introduction to Randomized Algorithms A Itroductio to Radomized Algorithms The focus of this lecture is to study a radomized algorithm for quick sort, aalyze it usig probabilistic recurrece relatios, ad also provide more geeral tools for aalysis

More information

Empirical Processes: Glivenko Cantelli Theorems

Empirical Processes: Glivenko Cantelli Theorems Empirical Processes: Gliveko Catelli Theorems Mouliath Baerjee Jue 6, 200 Gliveko Catelli classes of fuctios The reader is referred to Chapter.6 of Weller s Torgo otes, Chapter??? of VDVW ad Chapter 8.3

More information

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 3

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 3 Machie Learig Theory Tübige Uiversity, WS 06/07 Lecture 3 Tolstikhi Ilya Abstract I this lecture we will prove the VC-boud, which provides a high-probability excess risk boud for the ERM algorithm whe

More information

Chapter 6 Infinite Series

Chapter 6 Infinite Series Chapter 6 Ifiite Series I the previous chapter we cosidered itegrals which were improper i the sese that the iterval of itegratio was ubouded. I this chapter we are goig to discuss a topic which is somewhat

More information

Riesz-Fischer Sequences and Lower Frame Bounds

Riesz-Fischer Sequences and Lower Frame Bounds Zeitschrift für Aalysis ud ihre Aweduge Joural for Aalysis ad its Applicatios Volume 1 (00), No., 305 314 Riesz-Fischer Sequeces ad Lower Frame Bouds P. Casazza, O. Christese, S. Li ad A. Lider Abstract.

More information

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12 Machie Learig Theory Tübige Uiversity, WS 06/07 Lecture Tolstikhi Ilya Abstract I this lecture we derive risk bouds for kerel methods. We will start by showig that Soft Margi kerel SVM correspods to miimizig

More information

7.1 Convergence of sequences of random variables

7.1 Convergence of sequences of random variables Chapter 7 Limit theorems Throughout this sectio we will assume a probability space (Ω, F, P), i which is defied a ifiite sequece of radom variables (X ) ad a radom variable X. The fact that for every ifiite

More information

Measure and Measurable Functions

Measure and Measurable Functions 3 Measure ad Measurable Fuctios 3.1 Measure o a Arbitrary σ-algebra Recall from Chapter 2 that the set M of all Lebesgue measurable sets has the followig properties: R M, E M implies E c M, E M for N implies

More information

7 Sequences of real numbers

7 Sequences of real numbers 40 7 Sequeces of real umbers 7. Defiitios ad examples Defiitio 7... A sequece of real umbers is a real fuctio whose domai is the set N of atural umbers. Let s : N R be a sequece. The the values of s are

More information

If a subset E of R contains no open interval, is it of zero measure? For instance, is the set of irrationals in [0, 1] is of measure zero?

If a subset E of R contains no open interval, is it of zero measure? For instance, is the set of irrationals in [0, 1] is of measure zero? 2 Lebesgue Measure I Chapter 1 we defied the cocept of a set of measure zero, ad we have observed that every coutable set is of measure zero. Here are some atural questios: If a subset E of R cotais a

More information

Advanced Stochastic Processes.

Advanced Stochastic Processes. Advaced Stochastic Processes. David Gamarik LECTURE 2 Radom variables ad measurable fuctios. Strog Law of Large Numbers (SLLN). Scary stuff cotiued... Outlie of Lecture Radom variables ad measurable fuctios.

More information

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss ECE 90 Lecture : Complexity Regularizatio ad the Squared Loss R. Nowak 5/7/009 I the previous lectures we made use of the Cheroff/Hoeffdig bouds for our aalysis of classifier errors. Hoeffdig s iequality

More information

Empirical Process Theory and Oracle Inequalities

Empirical Process Theory and Oracle Inequalities Stat 928: Statistical Learig Theory Lecture: 10 Empirical Process Theory ad Oracle Iequalities Istructor: Sham Kakade 1 Risk vs Risk See Lecture 0 for a discussio o termiology. 2 The Uio Boud / Boferoi

More information

Lecture 3 The Lebesgue Integral

Lecture 3 The Lebesgue Integral Lecture 3: The Lebesgue Itegral 1 of 14 Course: Theory of Probability I Term: Fall 2013 Istructor: Gorda Zitkovic Lecture 3 The Lebesgue Itegral The costructio of the itegral Uless expressly specified

More information

Distribution of Random Samples & Limit theorems

Distribution of Random Samples & Limit theorems STAT/MATH 395 A - PROBABILITY II UW Witer Quarter 2017 Néhémy Lim Distributio of Radom Samples & Limit theorems 1 Distributio of i.i.d. Samples Motivatig example. Assume that the goal of a study is to

More information

On Random Line Segments in the Unit Square

On Random Line Segments in the Unit Square O Radom Lie Segmets i the Uit Square Thomas A. Courtade Departmet of Electrical Egieerig Uiversity of Califoria Los Ageles, Califoria 90095 Email: tacourta@ee.ucla.edu I. INTRODUCTION Let Q = [0, 1] [0,

More information

6.3 Testing Series With Positive Terms

6.3 Testing Series With Positive Terms 6.3. TESTING SERIES WITH POSITIVE TERMS 307 6.3 Testig Series With Positive Terms 6.3. Review of what is kow up to ow I theory, testig a series a i for covergece amouts to fidig the i= sequece of partial

More information

Infinite Sequences and Series

Infinite Sequences and Series Chapter 6 Ifiite Sequeces ad Series 6.1 Ifiite Sequeces 6.1.1 Elemetary Cocepts Simply speakig, a sequece is a ordered list of umbers writte: {a 1, a 2, a 3,...a, a +1,...} where the elemets a i represet

More information

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS MASSACHUSTTS INSTITUT OF TCHNOLOGY 6.436J/5.085J Fall 2008 Lecture 9 /7/2008 LAWS OF LARG NUMBRS II Cotets. The strog law of large umbers 2. The Cheroff boud TH STRONG LAW OF LARG NUMBRS While the weak

More information

A Proof of Birkhoff s Ergodic Theorem

A Proof of Birkhoff s Ergodic Theorem A Proof of Birkhoff s Ergodic Theorem Joseph Hora September 2, 205 Itroductio I Fall 203, I was learig the basics of ergodic theory, ad I came across this theorem. Oe of my supervisors, Athoy Quas, showed

More information

Lecture 10 October Minimaxity and least favorable prior sequences

Lecture 10 October Minimaxity and least favorable prior sequences STATS 300A: Theory of Statistics Fall 205 Lecture 0 October 22 Lecturer: Lester Mackey Scribe: Brya He, Rahul Makhijai Warig: These otes may cotai factual ad/or typographic errors. 0. Miimaxity ad least

More information

Lecture 3: August 31

Lecture 3: August 31 36-705: Itermediate Statistics Fall 018 Lecturer: Siva Balakrisha Lecture 3: August 31 This lecture will be mostly a summary of other useful expoetial tail bouds We will ot prove ay of these i lecture,

More information

Basics of Probability Theory (for Theory of Computation courses)

Basics of Probability Theory (for Theory of Computation courses) Basics of Probability Theory (for Theory of Computatio courses) Oded Goldreich Departmet of Computer Sciece Weizma Istitute of Sciece Rehovot, Israel. oded.goldreich@weizma.ac.il November 24, 2008 Preface.

More information

Random Walks on Discrete and Continuous Circles. by Jeffrey S. Rosenthal School of Mathematics, University of Minnesota, Minneapolis, MN, U.S.A.

Random Walks on Discrete and Continuous Circles. by Jeffrey S. Rosenthal School of Mathematics, University of Minnesota, Minneapolis, MN, U.S.A. Radom Walks o Discrete ad Cotiuous Circles by Jeffrey S. Rosethal School of Mathematics, Uiversity of Miesota, Mieapolis, MN, U.S.A. 55455 (Appeared i Joural of Applied Probability 30 (1993), 780 789.)

More information

REGRESSION WITH QUADRATIC LOSS

REGRESSION WITH QUADRATIC LOSS REGRESSION WITH QUADRATIC LOSS MAXIM RAGINSKY Regressio with quadratic loss is aother basic problem studied i statistical learig theory. We have a radom couple Z = X, Y ), where, as before, X is a R d

More information

Chapter IV Integration Theory

Chapter IV Integration Theory Chapter IV Itegratio Theory Lectures 32-33 1. Costructio of the itegral I this sectio we costruct the abstract itegral. As a matter of termiology, we defie a measure space as beig a triple (, A, µ), where

More information

Fall 2013 MTH431/531 Real analysis Section Notes

Fall 2013 MTH431/531 Real analysis Section Notes Fall 013 MTH431/531 Real aalysis Sectio 8.1-8. Notes Yi Su 013.11.1 1. Defiitio of uiform covergece. We look at a sequece of fuctios f (x) ad study the coverget property. Notice we have two parameters

More information

1 Review and Overview

1 Review and Overview CS9T/STATS3: Statistical Learig Theory Lecturer: Tegyu Ma Lecture #6 Scribe: Jay Whag ad Patrick Cho October 0, 08 Review ad Overview Recall i the last lecture that for ay family of scalar fuctios F, we

More information

Sequences and Series of Functions

Sequences and Series of Functions Chapter 6 Sequeces ad Series of Fuctios 6.1. Covergece of a Sequece of Fuctios Poitwise Covergece. Defiitio 6.1. Let, for each N, fuctio f : A R be defied. If, for each x A, the sequece (f (x)) coverges

More information

6.883: Online Methods in Machine Learning Alexander Rakhlin

6.883: Online Methods in Machine Learning Alexander Rakhlin 6.883: Olie Methods i Machie Learig Alexader Rakhli LECTURES 5 AND 6. THE EXPERTS SETTING. EXPONENTIAL WEIGHTS All the algorithms preseted so far halluciate the future values as radom draws ad the perform

More information

Solutions to HW Assignment 1

Solutions to HW Assignment 1 Solutios to HW: 1 Course: Theory of Probability II Page: 1 of 6 Uiversity of Texas at Austi Solutios to HW Assigmet 1 Problem 1.1. Let Ω, F, {F } 0, P) be a filtered probability space ad T a stoppig time.

More information

18.657: Mathematics of Machine Learning

18.657: Mathematics of Machine Learning 8.657: Mathematics of Machie Learig Lecturer: Philippe Rigollet Lecture 4 Scribe: Cheg Mao Sep., 05 I this lecture, we cotiue to discuss the effect of oise o the rate of the excess risk E(h) = R(h) R(h

More information

MAT1026 Calculus II Basic Convergence Tests for Series

MAT1026 Calculus II Basic Convergence Tests for Series MAT026 Calculus II Basic Covergece Tests for Series Egi MERMUT 202.03.08 Dokuz Eylül Uiversity Faculty of Sciece Departmet of Mathematics İzmir/TURKEY Cotets Mootoe Covergece Theorem 2 2 Series of Real

More information

Math Solutions to homework 6

Math Solutions to homework 6 Math 175 - Solutios to homework 6 Cédric De Groote November 16, 2017 Problem 1 (8.11 i the book): Let K be a compact Hermitia operator o a Hilbert space H ad let the kerel of K be {0}. Show that there

More information

Regression with quadratic loss

Regression with quadratic loss Regressio with quadratic loss Maxim Ragisky October 13, 2015 Regressio with quadratic loss is aother basic problem studied i statistical learig theory. We have a radom couple Z = X,Y, where, as before,

More information

Lecture 2. The Lovász Local Lemma

Lecture 2. The Lovász Local Lemma Staford Uiversity Sprig 208 Math 233A: No-costructive methods i combiatorics Istructor: Ja Vodrák Lecture date: Jauary 0, 208 Origial scribe: Apoorva Khare Lecture 2. The Lovász Local Lemma 2. Itroductio

More information

Lecture 19: Convergence

Lecture 19: Convergence Lecture 19: Covergece Asymptotic approach I statistical aalysis or iferece, a key to the success of fidig a good procedure is beig able to fid some momets ad/or distributios of various statistics. I may

More information

1 Review and Overview

1 Review and Overview DRAFT a fial versio will be posted shortly CS229T/STATS231: Statistical Learig Theory Lecturer: Tegyu Ma Lecture #3 Scribe: Migda Qiao October 1, 2013 1 Review ad Overview I the first half of this course,

More information

Linear Regression Demystified

Linear Regression Demystified Liear Regressio Demystified Liear regressio is a importat subject i statistics. I elemetary statistics courses, formulae related to liear regressio are ofte stated without derivatio. This ote iteds to

More information

sin(n) + 2 cos(2n) n 3/2 3 sin(n) 2cos(2n) n 3/2 a n =

sin(n) + 2 cos(2n) n 3/2 3 sin(n) 2cos(2n) n 3/2 a n = 60. Ratio ad root tests 60.1. Absolutely coverget series. Defiitio 13. (Absolute covergece) A series a is called absolutely coverget if the series of absolute values a is coverget. The absolute covergece

More information

Lecture 15: Learning Theory: Concentration Inequalities

Lecture 15: Learning Theory: Concentration Inequalities STAT 425: Itroductio to Noparametric Statistics Witer 208 Lecture 5: Learig Theory: Cocetratio Iequalities Istructor: Ye-Chi Che 5. Itroductio Recall that i the lecture o classificatio, we have see that

More information

Application to Random Graphs

Application to Random Graphs A Applicatio to Radom Graphs Brachig processes have a umber of iterestig ad importat applicatios. We shall cosider oe of the most famous of them, the Erdős-Réyi radom graph theory. 1 Defiitio A.1. Let

More information

Advanced Analysis. Min Yan Department of Mathematics Hong Kong University of Science and Technology

Advanced Analysis. Min Yan Department of Mathematics Hong Kong University of Science and Technology Advaced Aalysis Mi Ya Departmet of Mathematics Hog Kog Uiversity of Sciece ad Techology September 3, 009 Cotets Limit ad Cotiuity 7 Limit of Sequece 8 Defiitio 8 Property 3 3 Ifiity ad Ifiitesimal 8 4

More information

The standard deviation of the mean

The standard deviation of the mean Physics 6C Fall 20 The stadard deviatio of the mea These otes provide some clarificatio o the distictio betwee the stadard deviatio ad the stadard deviatio of the mea.. The sample mea ad variace Cosider

More information

Machine Learning Brett Bernstein

Machine Learning Brett Bernstein Machie Learig Brett Berstei Week 2 Lecture: Cocept Check Exercises Starred problems are optioal. Excess Risk Decompositio 1. Let X = Y = {1, 2,..., 10}, A = {1,..., 10, 11} ad suppose the data distributio

More information

Singular Continuous Measures by Michael Pejic 5/14/10

Singular Continuous Measures by Michael Pejic 5/14/10 Sigular Cotiuous Measures by Michael Peic 5/4/0 Prelimiaries Give a set X, a σ-algebra o X is a collectio of subsets of X that cotais X ad ad is closed uder complemetatio ad coutable uios hece, coutable

More information

Lecture Notes for Analysis Class

Lecture Notes for Analysis Class Lecture Notes for Aalysis Class Topological Spaces A topology for a set X is a collectio T of subsets of X such that: (a) X ad the empty set are i T (b) Uios of elemets of T are i T (c) Fiite itersectios

More information

Rates of Convergence by Moduli of Continuity

Rates of Convergence by Moduli of Continuity Rates of Covergece by Moduli of Cotiuity Joh Duchi: Notes for Statistics 300b March, 017 1 Itroductio I this ote, we give a presetatio showig the importace, ad relatioship betwee, the modulis of cotiuity

More information

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 21 11/27/2013

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 21 11/27/2013 MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 21 11/27/2013 Fuctioal Law of Large Numbers. Costructio of the Wieer Measure Cotet. 1. Additioal techical results o weak covergece

More information

Self-normalized deviation inequalities with application to t-statistic

Self-normalized deviation inequalities with application to t-statistic Self-ormalized deviatio iequalities with applicatio to t-statistic Xiequa Fa Ceter for Applied Mathematics, Tiaji Uiversity, 30007 Tiaji, Chia Abstract Let ξ i i 1 be a sequece of idepedet ad symmetric

More information

A survey on penalized empirical risk minimization Sara A. van de Geer

A survey on penalized empirical risk minimization Sara A. van de Geer A survey o pealized empirical risk miimizatio Sara A. va de Geer We address the questio how to choose the pealty i empirical risk miimizatio. Roughly speakig, this pealty should be a good boud for the

More information

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 6 9/23/2013. Brownian motion. Introduction

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 6 9/23/2013. Brownian motion. Introduction MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/5.070J Fall 203 Lecture 6 9/23/203 Browia motio. Itroductio Cotet.. A heuristic costructio of a Browia motio from a radom walk. 2. Defiitio ad basic properties

More information

Introduction to Extreme Value Theory Laurens de Haan, ISM Japan, Erasmus University Rotterdam, NL University of Lisbon, PT

Introduction to Extreme Value Theory Laurens de Haan, ISM Japan, Erasmus University Rotterdam, NL University of Lisbon, PT Itroductio to Extreme Value Theory Laures de Haa, ISM Japa, 202 Itroductio to Extreme Value Theory Laures de Haa Erasmus Uiversity Rotterdam, NL Uiversity of Lisbo, PT Itroductio to Extreme Value Theory

More information

TR/46 OCTOBER THE ZEROS OF PARTIAL SUMS OF A MACLAURIN EXPANSION A. TALBOT

TR/46 OCTOBER THE ZEROS OF PARTIAL SUMS OF A MACLAURIN EXPANSION A. TALBOT TR/46 OCTOBER 974 THE ZEROS OF PARTIAL SUMS OF A MACLAURIN EXPANSION by A. TALBOT .. Itroductio. A problem i approximatio theory o which I have recetly worked [] required for its solutio a proof that the

More information

Sequences. Notation. Convergence of a Sequence

Sequences. Notation. Convergence of a Sequence Sequeces A sequece is essetially just a list. Defiitio (Sequece of Real Numbers). A sequece of real umbers is a fuctio Z (, ) R for some real umber. Do t let the descriptio of the domai cofuse you; it

More information

Economics 241B Relation to Method of Moments and Maximum Likelihood OLSE as a Maximum Likelihood Estimator

Economics 241B Relation to Method of Moments and Maximum Likelihood OLSE as a Maximum Likelihood Estimator Ecoomics 24B Relatio to Method of Momets ad Maximum Likelihood OLSE as a Maximum Likelihood Estimator Uder Assumptio 5 we have speci ed the distributio of the error, so we ca estimate the model parameters

More information

4 The Sperner property.

4 The Sperner property. 4 The Sperer property. I this sectio we cosider a surprisig applicatio of certai adjacecy matrices to some problems i extremal set theory. A importat role will also be played by fiite groups. I geeral,

More information

Notes 19 : Martingale CLT

Notes 19 : Martingale CLT Notes 9 : Martigale CLT Math 733-734: Theory of Probability Lecturer: Sebastie Roch Refereces: [Bil95, Chapter 35], [Roc, Chapter 3]. Sice we have ot ecoutered weak covergece i some time, we first recall

More information

The Borel hierarchy classifies subsets of the reals by their topological complexity. Another approach is to classify them by size.

The Borel hierarchy classifies subsets of the reals by their topological complexity. Another approach is to classify them by size. Lecture 7: Measure ad Category The Borel hierarchy classifies subsets of the reals by their topological complexity. Aother approach is to classify them by size. Filters ad Ideals The most commo measure

More information

lim za n n = z lim a n n.

lim za n n = z lim a n n. Lecture 6 Sequeces ad Series Defiitio 1 By a sequece i a set A, we mea a mappig f : N A. It is customary to deote a sequece f by {s } where, s := f(). A sequece {z } of (complex) umbers is said to be coverget

More information

Entropy Rates and Asymptotic Equipartition

Entropy Rates and Asymptotic Equipartition Chapter 29 Etropy Rates ad Asymptotic Equipartitio Sectio 29. itroduces the etropy rate the asymptotic etropy per time-step of a stochastic process ad shows that it is well-defied; ad similarly for iformatio,

More information

CEE 522 Autumn Uncertainty Concepts for Geotechnical Engineering

CEE 522 Autumn Uncertainty Concepts for Geotechnical Engineering CEE 5 Autum 005 Ucertaity Cocepts for Geotechical Egieerig Basic Termiology Set A set is a collectio of (mutually exclusive) objects or evets. The sample space is the (collectively exhaustive) collectio

More information

arxiv: v1 [math.pr] 13 Oct 2011

arxiv: v1 [math.pr] 13 Oct 2011 A tail iequality for quadratic forms of subgaussia radom vectors Daiel Hsu, Sham M. Kakade,, ad Tog Zhag 3 arxiv:0.84v math.pr] 3 Oct 0 Microsoft Research New Eglad Departmet of Statistics, Wharto School,

More information

Feedback in Iterative Algorithms

Feedback in Iterative Algorithms Feedback i Iterative Algorithms Charles Byre (Charles Byre@uml.edu), Departmet of Mathematical Scieces, Uiversity of Massachusetts Lowell, Lowell, MA 01854 October 17, 2005 Abstract Whe the oegative system

More information

STAT Homework 1 - Solutions

STAT Homework 1 - Solutions STAT-36700 Homework 1 - Solutios Fall 018 September 11, 018 This cotais solutios for Homework 1. Please ote that we have icluded several additioal commets ad approaches to the problems to give you better

More information

6a Time change b Quadratic variation c Planar Brownian motion d Conformal local martingales e Hints to exercises...

6a Time change b Quadratic variation c Planar Brownian motion d Conformal local martingales e Hints to exercises... Tel Aviv Uiversity, 28 Browia motio 59 6 Time chage 6a Time chage..................... 59 6b Quadratic variatio................. 61 6c Plaar Browia motio.............. 64 6d Coformal local martigales............

More information

Machine Learning Theory (CS 6783)

Machine Learning Theory (CS 6783) Machie Learig Theory (CS 6783) Lecture 2 : Learig Frameworks, Examples Settig up learig problems. X : istace space or iput space Examples: Computer Visio: Raw M N image vectorized X = 0, 255 M N, SIFT

More information

1 Duality revisited. AM 221: Advanced Optimization Spring 2016

1 Duality revisited. AM 221: Advanced Optimization Spring 2016 AM 22: Advaced Optimizatio Sprig 206 Prof. Yaro Siger Sectio 7 Wedesday, Mar. 9th Duality revisited I this sectio, we will give a slightly differet perspective o duality. optimizatio program: f(x) x R

More information

Output Analysis and Run-Length Control

Output Analysis and Run-Length Control IEOR E4703: Mote Carlo Simulatio Columbia Uiversity c 2017 by Marti Haugh Output Aalysis ad Ru-Legth Cotrol I these otes we describe how the Cetral Limit Theorem ca be used to costruct approximate (1 α%

More information

Math 155 (Lecture 3)

Math 155 (Lecture 3) Math 55 (Lecture 3) September 8, I this lecture, we ll cosider the aswer to oe of the most basic coutig problems i combiatorics Questio How may ways are there to choose a -elemet subset of the set {,,,

More information

Binary classification, Part 1

Binary classification, Part 1 Biary classificatio, Part 1 Maxim Ragisky September 25, 2014 The problem of biary classificatio ca be stated as follows. We have a radom couple Z = (X,Y ), where X R d is called the feature vector ad Y

More information

ECE 901 Lecture 14: Maximum Likelihood Estimation and Complexity Regularization

ECE 901 Lecture 14: Maximum Likelihood Estimation and Complexity Regularization ECE 90 Lecture 4: Maximum Likelihood Estimatio ad Complexity Regularizatio R Nowak 5/7/009 Review : Maximum Likelihood Estimatio We have iid observatios draw from a ukow distributio Y i iid p θ, i,, where

More information

The Growth of Functions. Theoretical Supplement

The Growth of Functions. Theoretical Supplement The Growth of Fuctios Theoretical Supplemet The Triagle Iequality The triagle iequality is a algebraic tool that is ofte useful i maipulatig absolute values of fuctios. The triagle iequality says that

More information

Bayesian Methods: Introduction to Multi-parameter Models

Bayesian Methods: Introduction to Multi-parameter Models Bayesia Methods: Itroductio to Multi-parameter Models Parameter: θ = ( θ, θ) Give Likelihood p(y θ) ad prior p(θ ), the posterior p proportioal to p(y θ) x p(θ ) Margial posterior ( θ, θ y) is Iterested

More information

Section 4.3. Boolean functions

Section 4.3. Boolean functions Sectio 4.3. Boolea fuctios Let us take aother look at the simplest o-trivial Boolea algebra, ({0}), the power-set algebra based o a oe-elemet set, chose here as {0}. This has two elemets, the empty set,

More information

The log-behavior of n p(n) and n p(n)/n

The log-behavior of n p(n) and n p(n)/n Ramauja J. 44 017, 81-99 The log-behavior of p ad p/ William Y.C. Che 1 ad Ke Y. Zheg 1 Ceter for Applied Mathematics Tiaji Uiversity Tiaji 0007, P. R. Chia Ceter for Combiatorics, LPMC Nakai Uivercity

More information

4.3 Growth Rates of Solutions to Recurrences

4.3 Growth Rates of Solutions to Recurrences 4.3. GROWTH RATES OF SOLUTIONS TO RECURRENCES 81 4.3 Growth Rates of Solutios to Recurreces 4.3.1 Divide ad Coquer Algorithms Oe of the most basic ad powerful algorithmic techiques is divide ad coquer.

More information

62. Power series Definition 16. (Power series) Given a sequence {c n }, the series. c n x n = c 0 + c 1 x + c 2 x 2 + c 3 x 3 +

62. Power series Definition 16. (Power series) Given a sequence {c n }, the series. c n x n = c 0 + c 1 x + c 2 x 2 + c 3 x 3 + 62. Power series Defiitio 16. (Power series) Give a sequece {c }, the series c x = c 0 + c 1 x + c 2 x 2 + c 3 x 3 + is called a power series i the variable x. The umbers c are called the coefficiets of

More information

6.867 Machine learning, lecture 7 (Jaakkola) 1

6.867 Machine learning, lecture 7 (Jaakkola) 1 6.867 Machie learig, lecture 7 (Jaakkola) 1 Lecture topics: Kerel form of liear regressio Kerels, examples, costructio, properties Liear regressio ad kerels Cosider a slightly simpler model where we omit

More information

Discrete Mathematics for CS Spring 2007 Luca Trevisan Lecture 22

Discrete Mathematics for CS Spring 2007 Luca Trevisan Lecture 22 CS 70 Discrete Mathematics for CS Sprig 2007 Luca Trevisa Lecture 22 Aother Importat Distributio The Geometric Distributio Questio: A biased coi with Heads probability p is tossed repeatedly util the first

More information

5.1 A mutual information bound based on metric entropy

5.1 A mutual information bound based on metric entropy Chapter 5 Global Fao Method I this chapter, we exted the techiques of Chapter 2.4 o Fao s method the local Fao method) to a more global costructio. I particular, we show that, rather tha costructig a local

More information

5 Many points of continuity

5 Many points of continuity Tel Aviv Uiversity, 2013 Measure ad category 40 5 May poits of cotiuity 5a Discotiuous derivatives.............. 40 5b Baire class 1 (classical)............... 42 5c Baire class 1 (moder)...............

More information

The Boolean Ring of Intervals

The Boolean Ring of Intervals MATH 532 Lebesgue Measure Dr. Neal, WKU We ow shall apply the results obtaied about outer measure to the legth measure o the real lie. Throughout, our space X will be the set of real umbers R. Whe ecessary,

More information

Topic 9: Sampling Distributions of Estimators

Topic 9: Sampling Distributions of Estimators Topic 9: Samplig Distributios of Estimators Course 003, 2016 Page 0 Samplig distributios of estimators Sice our estimators are statistics (particular fuctios of radom variables), their distributio ca be

More information

2.4 Sequences, Sequences of Sets

2.4 Sequences, Sequences of Sets 72 CHAPTER 2. IMPORTANT PROPERTIES OF R 2.4 Sequeces, Sequeces of Sets 2.4.1 Sequeces Defiitio 2.4.1 (sequece Let S R. 1. A sequece i S is a fuctio f : K S where K = { N : 0 for some 0 N}. 2. For each

More information

The Riemann Zeta Function

The Riemann Zeta Function Physics 6A Witer 6 The Riema Zeta Fuctio I this ote, I will sketch some of the mai properties of the Riema zeta fuctio, ζ(x). For x >, we defie ζ(x) =, x >. () x = For x, this sum diverges. However, we

More information

6 Integers Modulo n. integer k can be written as k = qn + r, with q,r, 0 r b. So any integer.

6 Integers Modulo n. integer k can be written as k = qn + r, with q,r, 0 r b. So any integer. 6 Itegers Modulo I Example 2.3(e), we have defied the cogruece of two itegers a,b with respect to a modulus. Let us recall that a b (mod ) meas a b. We have proved that cogruece is a equivalece relatio

More information

arxiv: v1 [math.pr] 4 Dec 2013

arxiv: v1 [math.pr] 4 Dec 2013 Squared-Norm Empirical Process i Baach Space arxiv:32005v [mathpr] 4 Dec 203 Vicet Q Vu Departmet of Statistics The Ohio State Uiversity Columbus, OH vqv@statosuedu Abstract Jig Lei Departmet of Statistics

More information