MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 2 9/9/2013. Large Deviations for i.i.d. Random Variables

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 2 9/9/2013 Large Deviatios for i.i.d. Radom Variables Cotet. Cheroff boud usig expoetial momet geeratig fuctios. Properties of a momet geeratig fuctios. Legedre trasforms. 1 Prelimiary otes The Weak Law of Large Numbers tells us that if X 1, X 2,..., is a i.i.d. sequece of radom variables with mea µ E[X 1 ] < the for every E > 0 as. X 1 +... + X P( µ > E) 0, But how quickly does this covergece to zero occur? We ca try to use Chebyshev iequality which says X 1 +... + X Var(X 1 ) P( µ > E). E 2 This suggest a decay rate of order 1 if we treat Var(X 1 ) ad E as a costat. Is this a accurate rate? Far from so... I fact if the higher momet of X 1 was fiite, for example, E[X1 2m ] <, the usig a similar boud, we could show that the decay rate is at least 1 (exercise). The goal of the large deviatio theory is to show that i may iterestig cases the decay rate is i fact expoetial: e c. The expoet c > 0 is called the large deviatios rate, ad i may cases it ca be computed explicitly or umerically. m 1

2 Large deviatios upper boud (Cheroff boud) Cosider a i.i.d. sequece with a commo probability distributio fuctio F (x) = P(X x), x R. Fix a value a > µ, where µ is agai a expectatio correspodig to the distributio F. We cosider probability that the average of X 1,..., X exceeds a. The WLLN tells us that this happes with probability covergig to zero as icreases, ad ow we obtai a estimate o this probability. Fix a positive parameter θ > 0. We have 1 i X i P( > a) = P( X i > a) 1 i θ X = P(e 1 i i > e θa ) θ E[e 1 i X i ] e θa E[ i eθx i ] = θa ), (e Markov iequality But recall that X θx i i s are i.i.d. Therefore E[ i e ] = (E[eθX 1 ]). Thus we obtai a upper boud ( ) 1 i X i E[e θx 1 ] P( > a). (1) e θa Of course this boud is meaigful oly if the ratio E[e θx 1 ]/e θa is less tha uity. We recogize E[e θx 1 ] as the momet geeratig fuctio of X 1 ad deote it by M(θ). For the boud to be useful, we eed E[e θx 1 ] to be at least fiite. If we could show that this ratio is less tha uity, we would be doe expoetially fast decay of the probability would be established. Similarly, suppose we wat to estimate 1 i X i P( < a), for some a < µ. Fixig ow a egative θ < 0, we obtai 1 i X i θ X P( < a) = P(e 1 i i > e θa ) ( ) M(θ), e θa 2

ad ow we eed to fid a egative θ such that M(θ) < e θa. I particular, we eed to focus o θ for which the momet geeratig fuctio is fiite. For this purpose let D(M) {θ : M(θ) < }. Namely D(M) is the set of values θ for which the momet geeratig fuctio is fiite. Thus we call D the domai of M. 3 Momet geeratig fuctio. Examples ad properties Let us cosider some examples of computig the momet geeratig fuctios. Expoetial distributio. Cosider a expoetially distributed radom variable X with parameter λ. The M(θ) = 0 e θx λe λx dx = λ e (λ θ)x dx. 0 1 λ θ (λ θ)x 0 Whe θ < λ this itegral is equal to e = 1/(λ θ). But whe θ λ, the itegral is ifiite. Thus the exp. momet geeratig fuctio is fiite iff θ < λ ad is M(θ) = λ/(λ θ). I this case the domai of the momet geeratig fuctio is D(M) = (, λ). Stadard Normal distributio. Whe X has stadard Normal distributio, we obtai 1 θx x 2 2π 1 x 2 2θx+θ 2 θ 2 M(θ) = E[e θx ] = e e 2 dx = e 2 dx 2π θ 2 1 (x θ)2 = e 2 e 2 dx 2π Itroducig chage of variables y = x θ we obtai that the itegral 2 is equal to 1 e y 2 dy = 1 (itegral of the desity of the stadard 2π θ Normal distributio). Therefore M(θ) = e 2 2. We see that it is always fiite ad D(M) = R. I a retrospect it is ot surprisig that i this case M(θ) is fiite for all θ. The desity of the stadard Normal distributio decays like e x2 ad 3

this is faster tha just expoetial growth e θx. So o matter how large is θ the overall product is fiite. Poisso distributio. Suppose X has a Poisso distributio with parameter λ. The θm λm λ M(θ) = E[e θx ] = e e m! m=0 (e θ λ) m λ = e m! m=0 e θ λ λ = e, t m m 0 m! (where we use the formula = e t ). Thus agai D(M) = R. This agai has to do with the fact that λ m /m! decays at the rate similar to 1/m! which is faster the ay expoetial growth rate e θm. We ow establish several properties of the momet geeratig fuctios. Propositio 1. The momet geeratig fuctio M(θ) of a radom variable X satisfies the followig properties: (a) M(0) = 1. If M(θ) < for some θ > 0 the M(θ ' ) < for all θ ' [0, θ]. Similarly, if M(θ) < for some θ < 0 the M(θ ' ) < for all θ ' [θ, 0]. I particular, the domai D(M) is a iterval cotaiig zero. (b) Suppose (θ 1, θ 2 ) D(M). The M(θ) as a fuctio of θ is differetiable i θ for every θ 0 (θ 1, θ 2 ), ad furthermore, d M(θ) = E[Xe θ 0 X ] <. dθ θ=θ 0 Namely, the order of differetiatio ad expectatio operators ca be chaged. Proof. Part (a) is left as a exercise. We ow establish part (b). Fix ay θ 0 (θ 1, θ 2 ) ad cosider a θ-idexed sequece of radom variables exp(θx) exp(θ 0 X) Y θ. θ θ 0 4

d Sice dθ exp(θx) = x exp(θx), the almost surely Y θ X exp(θ 0 X), as θ θ 0. Thus to establish the claim it suffices to show that covergece of expectatios holds as well, amely lim θ θ0 E[Y θ ] = E[X exp(θ 0 X)], ad E[X exp(θ 0 X)] <. For this purpose we will use the Domiated Covergece Theorem. Namely, we will idetify a radom variable Z such that Y θ Z almost surely i some iterval (θ 0 E, θ 0 + E), ad E[Z] <. Fix E > 0 small eough so that (θ 0 E, θ 0 + E) (θ 1, θ 2 ). Let Z = E 1 exp(θ 0 X + E X ). Usig the Taylor expasio of exp( ) fuctio, for every θ (θ 0 E, θ 0 + E), we have ( 1 1 1 ) Y θ = exp(θ 0 X) X + (θ θ 0 )X 2 + (θ θ 0 ) 2 X 3 + + (θ θ 0 ) 1 X +, 2! 3!! which gives ( ) 1 1 Y θ exp(θ 0 X) X + (θ θ 0 ) X 2 + + (θ θ 0 ) 1 X + 2!! ( ) 1 1 exp(θ 0 X) X + E X 2 + + E 1 X + 2!! = exp(θ 0 X)E 1 (exp(e X ) 1) exp(θ 0 X)E 1 exp(e X ) = Z. It remais to show that E[Z] <. We have E[Z] = E 1 E[exp(θ 0 X + EX)1{X 0}] + E 1 E[exp(θ 0 X EX)1{X < 0}] E 1 E[exp(θ 0 X + EX)] + E 1 E[exp(θ 0 X EX)] = E 1 M(θ 0 + E) + E 1 M(θ 0 E) <, sice E was chose so that (θ 0 E, θ 0 + E) (θ 1, θ 2 ) D(M). This completes the proof of the propositio. Problem 1. (a) Establish part (a) of Propositio 1. (b) Costruct a example of a radom variable for which the correspodig iterval is trivial {0}. Namely, M(θ) = for every θ > 0. 5

(c) Costruct a example of a radom variable X such that D(M) = [θ 1, θ 2 ] for some θ 1 < 0 < θ 2. Namely, the the domai D is a o-zero legth closed iterval cotaiig zero. Now suppose the i.i.d. sequece X i, i 1 is such that 0 (θ 1, θ 2 ) D(M), where M is the momet geeratig fuctio of X 1. Namely, M is fiite i a eighborhood of 0. Let a > µ = E[X 1 ]. Applyig Propositio 1, let us differetiate this ratio with respect to θ at θ = 0: d M(θ) E[X θx 1 ]e θa E[e θx 1 1 e θa ae ] = = µ a < 0. dθ eθa e2θa Note that M(θ)/e θa = 1 whe θ = 0. Therefore, for sufficietly small positive θ, the ratio M(θ)/e θa is smaller tha uity, ad (1) provides a expoetial boud o the tail probability for the average of X 1,..., X. Similarly, if a < µ, the ratio M(θ)/e θa < 1 for sufficietly small egative θ. We ow summarize our fidigs. Theorem 1 (Cheroff boud). Give a i.i.d. sequece X 1,..., X suppose the momet geeratig fuctio M(θ) is fiite i some iterval (θ 1, θ 2 ) : 0. Let a > µ = E[X 1 ]. The there exists θ > 0, such that M(θ)/e θa < 1 ad ( ) X 1 i i M(θ) P( > a). e θa Similarly, if a < µ, the there exists θ < 0, such that M(θ)/e θa < 1 ad 1 i X ( ) i M(θ) P( < a). e θa How small ca we make the ratio M(θ)/ exp(θa)? We have some freedom i choosig θ as log as E[e θx 1 ] is fiite. So we could try to fid θ which miimizes the ratio M(θ)/e θa. This is what we will do i the rest of the lecture. The surprisig coclusio of the large deviatios theory is very ofte that such a miimizig value θ exists ad is tight. Namely it provides the correct decay rate! I this case we will be able to say 1 i X i P( > a) exp( I(a, θ )), a where I(a, θ ) = log M(θ )/e θ. 6

4 Legedre trasforms Theorem 1 gave us a large deviatios boud (M(θ)/e θa ) which we rewrite as e (θa log M(θ)). We ow study i more detail the expoet θa log M(θ). Defiitio 1. A Legedre trasform of a radom variable X is the fuctio I(a) sup θ R (θa log M(θ)). Let us go over the examples of some distributios ad compute their correspodig Legedre trasforms. Expoetial distributio with parameter λ. Recall that M(θ) = λ/(λ θ) whe θ < λ ad M(θ) = otherwise. Therefore whe θ < λ λ I(a) = sup (aθ log ) λ θ θ = sup (aθ log λ + log(λ θ)), θ ad I(a) = otherwise. Settig the derivative of g(θ) = aθ log λ + log(λ θ) equal to zero we obtai the equatio a 1/(λ θ) = 0 which has the uique solutio θ = λ 1/a. For the boudary cases, we have aθ log λ + log(λ θ)) whe either θ λ or θ (check). Therefore I(a) = a(λ 1/a) log λ + log(λ λ + 1/a) = aλ 1 log λ + log(1/a) = aλ 1 log λ log a. The large deviatios boud the tells us that whe a > 1/λ 1 i X i (aλ 1 log λ log a) P( > a) e. (.2 log 1.2) Say λ = 1 ad a = 1.2. The the approximatio gives us e. Note that we ca obtai a exact expressio for this tail probability. Ideed, X 1, X 1 +X 2,..., X 1 +X 2 +X,... are the evets of a Poisso process with parameter λ = 1. Therefore we ca compute the probability P( 1 i X i > 1.2) exactly: it is the probability that the Poisso 7

process has at most 1 evets before time 1.2. Thus 1 i X i P( > 1.2) = P( X i > 1.2) 1 i (1.2) k 1.2 = e. k! 0 k 1 It is ot at all clear how revealig this expressio is. I hidsight, we kow that it is approximately e (.2 log 1.2), obtaied via large deviatios theory. Stadard Normal distributio. Recall that M(θ) = e θ 2 2 whe X 1 has the stadard Normal distributio. The expected value µ = 0. Thus we fix a > 0 ad obtai θ 2 I(a) = sup (aθ ) θ 2 a 2 =, 2 achieved at θ = a. Thus for a > 0, the large deviatios theory predicts that X i 2 1 i a P( > a) e 2. 1 i X i Agai we could compute this probability directly. We kow that is distributed as a Normal radom variable with mea zero ad variace 1/. Thus 1 i P( X i > a) = e t2 2 dt. 2π a After a little bit of techical work oe could show that this itegral is a+e domiated by its part aroud a, amely,, which is further approxa 2 a 2π imated by the value of the fuctio itself at a, amely e 2. This is cosistet with the value give by the large deviatios theory. Simply the lower order magitude term disappears i the approximatio o the 2π log scale. 8

Poisso distributio. Suppose X has a Poisso distributio with parame eter λ. Recall that i this case M(θ) = e θ λ λ. The I(a) = sup (aθ (e θ λ λ)). θ Settig derivative to zero we obtai θ = log(a/λ) ad I(a) = a log(a/λ) (a λ). Thus for a > λ, the large deviatios theory predicts that 1 i X i (a log(a/λ) a+λ) P( > a) e. I this case as well we ca compute the large deviatios probability explicitly. The sum X 1 + + X of Poisso radom variables is also a Poisso radom variable with parameter λ. Therefore (λ) m λ P( X i > a) = e. m! 1 i m>a But agai it is hard to ifer a more explicit rate of decay usig this expressio 5 Additioal readig materials Chapter 0 of [2]. This is o-techical itroductio to the field which describes motivatio ad various applicatios of the large deviatios theory. Soft readig. Chapter 2.2 of [1]. Refereces [1] A. Dembo ad O. Zeitoui, Large deviatios techiques ad applicatios, Spriger, 1998. [2] A. Shwartz ad A. Weiss, Large deviatios for performace aalysis, Chapma ad Hall, 1995. 9

MIT OpeCourseWare http://ocw.mit.edu 15.070J / 6.265J Advaced Stochastic Processes Fall 2013 For iformatio about citig these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.