MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 12 10/21/2013. Martingale Concentration Inequalities and Applications

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.65/15.070J Fall 013 Lecture 1 10/1/013 Martngale Concentraton Inequaltes and Applcatons Content. 1. Exponental concentraton for martngales wth bounded ncrements. Concentraton for Lpschtz contnuous functons 3. Examples n statstcs and random graph theory 1 Azuma-Hoeffdng nequalty Suppose X n s a martngale wrt fltraton F n such that X 0 = 0 The goal of ths lecture s to obtan bounds of the form P X n δn) exp Θn)) under some condton on X n. Note that snce E[X n ] = 0, the devaton from zero s the rght regme to look for rare events. It turns out the exponental bound of the form above holds under very smple assumpton that the ncrements of X n are bounded. The theorem below s known as Azuma-Hoeffdng Inequalty. Theorem 1 Azuma-Hoeffdng Inequalty). Suppose X n, n 1 s a martngale such that X 0 = 0 and X X 1 d, 1 n almost surely for some constants d, 1 n. Then, for every t > 0, t n =1 d P X n > t) exp. Notce that n the specal p case when d = d, we can take t = xn and obtan an upper bound exp x n/d ) - whch s of the form promsed above. Note that ths s consstent wth the Chernoff bound for the specal case X n s the sum of..d. zero mean terms, though t s applcable only n the specal case of a.s. bounded ncrements. Proof. fx) expλx) s a convex functon n x for any λ R. Then we have f d ) = exp λd ) and fd ) = expλd ). Usng convexty we have that 1

when x/d 1 1 x 1 x expλx) = fx) = f + 1)d + 1 ) d ) d d 1 x 1 x + 1 fd ) + 1 f d ) d d fd ) + f d ) fd ) f d ) = + x. 1) Further, for every a k k k expa) + exp a) a 1) k a a = + = k! k! k)! k=0 k=0 k=0 a k because k k! k)!) k=0 k k! a ) k a = = exp ). ) k! k=0 We conclude that for every x such that x/d 1 d expλd ) exp λd ) expλx) exp ) + x. 3) We now turn to our martngale sequence X n. For every t > 0 and every λ > 0 we have PX n t) = P expλx n ) expλt)) exp λt)e[expλx n )] = exp λt)e[expλ X X 1 ))], 1 n where X 0 = 0 was used n the last equalty. Applyng the tower property of condtonal expectaton we have E expλ X X 1 )) 1 n = E E expλx n X n 1 )) expλ X X 1 )) F n 1. 1 n 1

Now, snce X, n 1 are measurable wrt F n 1, then E expλx n X n 1 )) expλ X X 1 )) F n 1 1 n 1 = expλ X X 1 ))E [expλx n X n 1 )) F n 1 ] 1 n 1 expλ X X 1 )) 1 n 1 λ d ) ) n expλd ) exp λd ) exp + E[X n X n 1 F n 1 ], where 3) was used n the last nequalty. Martngale property mples E[X n X n 1 F n 1 ] = 0, and we have obtaned an upper bound λ d ) n E expλ X X 1 )) E expλ X X 1 )) exp 1 n 1 n 1 Iteratng further we obtan the followng upper bound on PX n t): 1 n λ d exp λt) exp Optmzng over the choce of λ, we see that the tghtest bound s obtaned by settng λ = t/ d > 0, leadng to an upper bound t ) PX n t) exp. d A smlar approach usng λ < 0 gves for every t > 0 t ) PX n t) exp. Combnng, we obtan the requred result. d Applcaton to Lpschtz contnuous functons of..d. random varables Suppose X 1,..., X n are ndependent random varables. Suppose g : R n R s a functon and d 1,..., d n are constants such that for any two vectors x 1,..., x n 3

and y 1,..., y n n gx 1,..., x n ) gy 1,..., y n ) d 1{x = y }. 4) In partcular when a vector x changes value only n ts -th coordnate the amount of change n functon g s at most d. As a specal case, consder a subset of vectors x = x 1,..., x n ) such that x c and suppose g s Lpschtz contnuous wth constant K. Namely, for for every x, y, gx) gy) K x y, where x y = x y. Then for any two such vectors Theorem. Suppose X, 1 n are..d. and functon g : R n R satsfes 4). Then for every t 0 t ) P gx 1,..., X n ) E[gX 1,..., X n )] > t) exp. =1 gx) gy) K x y K c x y, and therefore ths fts nto a prevous framework wth d = Kc. d Proof. Let F be the σ-feld generated by varables X 1,..., X : F = σx 1,..., X ). For convenence, we also set F 0 to be the trval σ-feld consstng of Ø, Ω, so that E[Z F 0 ] = E[Z] for every r.v. Z. Let M 0 = E[gX 1,..., X n )], M 1 = E[gX 1,..., X n ) F 1 ],...,M n = E[gX 1,..., X n ) F n ]. Observe that M n s smply gx 1,..., X n ), snce X 1,..., X n are measurable wrt F n. Thus, we by tower property E[M n F n 1 ] = E[E[gX 1,..., X n ) F n ] F n 1 ] = M n 1. Thus, M s a martngale. We have M +1 M = E[E[gX 1,..., X n ) F +1 ] E[gX 1,..., X n ) F ]] = E[E[gX 1,..., X n ) E[gX 1,..., X n ) F ] F +1 ]]. Snce X s are ndependent, then M s a r.v. whch on any vector x = x 1,..., x n ) Ω takes value M = gx 1,..., x n )dpx +1 ) dpx n ), x +1,...,x n 4

and n partcular only depends on the frst coordnates of x). Smlarly M +1 = gx 1,..., x n )dpx + ) dpx n ). x +,...,x n Thus M +1 M = gx 1,..., x n ) gx 1,..., x n )dpx +1 ))dpx +1 ) dpx n ), x +,...,x n x +1 d +1 dpx +1 ) dpx n ) x +,...,x n = d +1. Ths dervaton represents a smple dea that M and M +1 only dffer n averagng out X +1 n M. Now defnng Mˆ = M M 0 = M E[gX 1,..., X n )], we have that Mˆ s also a martngale wth dfferences bounded by d, but wth an addtonal property M 0 = 0. Applyng Theorem 1 we obtan the requred result. 3 Two examples We now consder two applcatons of the concentraton nequaltes developed n the prevous sectons. Our frst example concerns convergence emprcal dstrbutons to the true dstrbutons of random varables. Specfcally, suppose we have a dstrbuton functon F, and..d. sequence X 1,..., X n wth dstrbuton F. From the sample X 1,..., X n we can buld an emprcal dstrbuton functon F 1 n x) = n 1 n 1{X x}. Namely, F n x) s smply the frequency of observng values at most x n our sample. We should realze that F n s a random functon, snce t depends on the sample X 1,..., X n. An mportant Theorem called Glvenko-Cantell says that sup x R F n x) F x) converges to zero and n expectaton, the latter meanng of course that E[sup x R F n x) F x) ] 0. Provng ths result s beyond our scope. However, applyng the martngale concentraton nequalty we can bound the devaton of sup x R F n x) F x) around ts expectaton. For convenence let L n = L n X 1,..., X n ) = sup x R F n x) F x), whch s commonly called emprcal rsk n the statstcs and machne learnng felds. We need to bound P L n E[L n ] > t). Observe that L satsfes property 4) wth d = 1/n. Indeed changng one coordnate X to some X I changes F n by at most 1/n, and thus the same apples to 5

L n. Applyng Theorem we obtan t ) P L n E[L n ] > t) exp n1/n) t ) n = exp. Thus, we obtan a large devatons type bound on the dfference L n E[L n ]. For our second example we turn to combnatoral optmzaton on random graphs. We wll use the so-called Max-Cut problem as an example, though the approach works for many other optmzaton and constrant satsfacton problems as well. Consder a smple undrected graph G = V, E). V s the set of nodes, denoted by 1,,..., n. And E s the set of edges whch we descrbe as a lst of pars 1, j 1 ),..., E, j E ), where 1,..., E, j 1,..., j E are nodes. The graph s undrected, whch means that the edges 1, j 1 ) and j 1, 1 ) are dentcal. We can also represent the graph as an n n zero-one matrx A, where A,j = 1 f, j) E) and A,j = 0 otherwse. Then A s a symmetrc matrx, namely A T = A, where A T s a transpose of A. A cut n ths graph s a partton σ of nodes nto two groups, encoded by functon a functon σ : V {0, 1}. The value MCσ) of the cut assocated wth σ s the number of edges between the two groups. Formally, MCσ) = {, j) E : σ) = σj)}. Clearly MCσ) E. At the same tme, a random assgnment σ) = 0 wth probablty 1/ and = 1 wth probablty 1/ gves a cut wth expected value MCσ) 1/) E. In fact there s a smple algorthm to construct such a cut explctly. Now denote by MCG) the maxmum possble value of the cut: MCG) = max σ MCσ). Thus 1/ MCG)/ E 1. Further, suppose we delete an arbtrary edge from the graph G and obtan a new graph G I. Observe that n ths case MCG I ) MGG) 1 - the Max-Cut value ether stays the same or goes down by at most one. Smlarly, when we add an edge, the Max-Cut value ncreases by at most one. Puttng ths together, f we replace an arbtrary edge e E by a dfferent edge e I and leave all the other edges ntact, the value of the Max-Cut changes by at most one. Now suppose the graph G = Gn, dn) s a random Erd os-r eny graph wth E = dn edges. Specfcally, suppose we choose every edges E 1,..., E dn unformly at random from the total set of edges, ndependently for these nd choces. Denote by MC n the value of the maxmum cut MCGn, dn)) on ths random graph. Snce the graph s random, we have that MC n s a random varable. Furthermore, as we have just establshed, d/ MC n /n d. One of the major open problems n the theory of random graphs s computng the scalng lmt E[MC n ]/n as n. However, we can easly obtan bounds pn 6

on the concentraton of MC n around ts expectaton, usng Azuma-Hoeffdng nequalty. For ths goal, thnk of p random edges E 1,..., E dn as..d. random varables n the space 1,,..., n correspondng to the space of all possble edges on n nodes. Let ge 1,..., E dn ) = MC n. Observe that ndeed g s a functon of dn..d. random varables. By our observaton, replacng one edge E by a dfferent edge E I changes MC n by at most one. Thus we can apply Theorem whch gves t ) P MC n E[MC n ] t) exp. dn In partcular, takng t = rn, where r > 0 s a constant, we obtan a large n devatons type bound exp r d ). Takng nstead t = r n, we obtan Gaus san type bound exp r d ). Namely, MC n = E[MC n ] + Θ n). Ths s a meanngful concentraton around the mean snce, as we have dscussed above E[MC n ] = Θn). 7

MIT OpenCourseWare http://ocw.mt.edu 15.070J / 6.65J Advanced Stochastc Processes Fall 013 For nformaton about ctng these materals or our Terms of Use, vst: http://ocw.mt.edu/terms.