Entropy nd Ergodic Theory Notes 10: Lrge Devitions I 1 A chnge of convention This is our first lecture on pplictions of entropy in probbility theory. In probbility theory, the convention is tht ll logrithms re nturl unless stted otherwise. So, strting in this lecture, we dopt this convention for entropy: in the formul H(p) = A p() log p(), the logrithm is to bse e. 2 A vrint of the bsic counting problem; divergence This lecture strts with probbilistic vrint of our bsic counting problem from the strt of the course. Let p Prob(A), nd suppose tht X = (X 1,..., X n ) p n. Suppose lso tht x A n is fixed string. We hve seen tht if p x = p, then the probbility P(X = x) is e H(p)n (previously it ws written 2 H(p)n, since tht ws before we switched from log 2 to nturl logrithms). This fct ws the bsis of our originl upper bound on the number of strings of type p. But now suppose tht p x is some other element q Prob(A). Then reltive of the erlier clcultion gives P(X = x) = p() N( x) = p() nq(). 1
If there exists A for which p() = 0 but q() 0, then this probbility is zero. Otherwise, we my write it s [( ) ] exp q() log p() n. (1) In fct, we cn write it this wy in both cses, provided we dopt the nturl conventions log 0 := nd exp( ) := 0. Therefore [( ) ] P(p X = q) = P(X T n (q)) = T n (q) exp q() log p() n. In cse denom(q) divides n, our originl estimtes for T n (q) give [( exp q() log q() + ) ] q() log p() n o(n) [ ( = exp q() log q() ) ] n o(n) p() P(p X = q) [ ( exp q() log q() ) ] n. (2) p() Tht is, we hve estimted the probbility tht rndom string drwn i.i.d. from p hs type equl to some other distribution q. The min negtive term in the exponent hs coefficient q() log q() p() = p() q() q() log p() p(). This is clled the divergence (or Kullbck Liebler divergence, informtion divergence, or reltive entropy) of q with respect to p. We denoted it by D(q p). If there exists A such tht p() = 0 but q() 0, then D(q p) = + by definition. Exmple. If p is the uniform distribution on Prob(A), then p(a) = 1/ A for every A, nd D(q p) = log A H(q) q Prob(A). (3) 2
In order to hndle the cse of infinite divergence succinctly, recll the following notion nd nottion from mesure theory. If (X, B) is mesurble spce nd µ nd ν re two probbility mesures on it, then ν is bsolutely continuous with respect to µ, denoted ν µ, if ν(a) = 0 whenever µ(a) = 0 for A B. If X is countble nd B is the power set of X, then it is enough to check this for ll singletons A. Lemm 1. For fixed p, the quntity D(q p) is continuous nd non-negtive function on the set {q Prob(A) : q p}. It is zero if nd only if q = p. The quntity D(q p) is strictly convex s function of the pir (q, p) Prob(A) Prob(A) (this is stronger thn being strictly convex in ech rgument seprtely). Proof. Continuity is cler. To prove non-negtivity, recll tht the function η(t) := t log t is strictly convex on (0, ). Since p is probbility distribution, Jensen s inequlity gives D(q p) = ( q() ) ( p()η η p() q() ) = η(1) = 0. p() p() Equlity occurs if nd only if q()/p() is the sme vlue for every. Since q nd p re both probbility distributions, this occurs only if q = p. We leve joint strict convexity s n exercise for the reder (it requires nother, slightly less obvious ppliction of Jensen s inequlity). nd In terms of divergence, we cn re-write the clcultions (1) nd (2) s P(X = x) = p n [H(px)+D(px p)]n (x) = e e D(q p)n o(n) P(p X = q) = e D(q p)n in cse denom(q) = n. One importnt exmple of divergence hs lredy been introduced by nother nme. 3
Exmple. Let p Prob(A B) hve mrginls p A nd p B. Let (X, Y ) p. In Lecture 2 we derived the formul I(X ; Y ) =,b p(, b) log p(, b) p A ()p B (b) (except with log 2 in plce of log, similrly to the rest of this lecture). Thus, the mutul informtion between X nd Y is equl to D(p p A p B ). It my be seen s compring the true joint distribution p to the distribution of independent RVs tht hve the sme individul distributions s X nd Y. 3 Snov s theorem The clcultion in (2) cn be seen s quntittive strengthening of the LLN for types. If q p, then D(q p) > 0, nd so we hve shown tht for X p n the event {p X = q} is not only unlikely, but exponentilly unlikely, with exponent given by D(q p). By some simple estimtes nd summing over possible types, this cn be turned into n estimte for the following more generl question: If X p n nd U Prob(A) is such tht {p X U} is unlikely, then just how unlikely is it? To lighten nottion, let us now write T n (U) := {x A n : p x U} = p U T n (p) for ny n N nd U Prob(A). Theorem 2 (Snov s theorem). Every U Prob(A) stisfies ( exp [ ) inf D(q p)] n + o(n) q U P(p X U) ( exp [ inf q U D(q p)] n + o(n) where U is the interior of U for the usul topology of Prob(A). ), 4
Corollry 3. If U Prob(A) hs the property tht U U (tht is, the interior of U is dense in U), then ( P(p X U) = exp [ ) inf D(q p)] n + o(n). q U Proof. Upper bound. This is nother exercise in summing over typicl sets. Let d := inf q U D(q p). Then the upper bound (2) gives P(p X U) = q U P(p X T n (q)) q U, denom(q) n D(q p)n e {q U : denom(q) = n} e dn (n + 1) A e dn = e dn+o(n). Lower bound. We need the following fct: For ll sufficiently lrge n is holds tht, for ny p Prob(A), there exists q Prob(A) with denomintor n such tht p q 2 A /n. This ws lredy used in Lecture 2, where we referred to it s (D). Now let d := inf q U D(q p). Let ε > 0, nd let V U be nonempty open subset such tht ny q V stisfies D(q p) < d + ε. Since V is open, fct (D) promises tht, provided n is sufficiently lrge, there exists q V of denomintor n. For this q, we my now pply the lower bound from (2) to obtin P(p X U) P(p X = q) = e D(q p)n o(n) e (d +ε)n o(n). Since ε ws rbitrry, this ctully holds for some slowly decresing sequence ε n 0 in plce of ε. This completes the proof. Using Lemm 1, we immeditely deduce the following. Corollry 4. If p Prob(A) nd U Prob(A) stisfy p U, nd if X p n, then P(p X U) decys exponentilly fst s n. Exmple. Let p Prob(A B) hve mrginls p A nd p B. Let (X, Y) p n. Then we hve seen tht I(X i ; Y i ) = D(p p A p B ) for ech i (except with log 2 in plce of log, similrly to the rest of this lecture). 5
Now suppose we drw rndom strings X p n nd Y p n independently of ech other. The LLN for types gives tht X nd Y re pproximtely typicl for p A nd p B with high probbility. But if p p A p B, then (X, Y ) is unlikely to be pproximtely typicl for p. Snov s theorem lets us estimte this smll probbility: for δ > 0, tht theorem gives P((X, Y ) T n,δ (p)) = e D(p p A p B )n+ (δ)n+o(n) = e I(X i ; Y i )n+ (δ)n+o(n). This cn lso be obtined from our erlier description of the typicl set T n,δ (p) in terms of conditionl entropy. 4 The conditionl limit theorem Snov s Theorem cn be used to nswer the following vrint of the previous question: Suppose tht X p n nd tht E Prob(A) is such tht {p X E} is unlikely. Conditionlly on the unlikely occurrence {p X E}, wht is the most likely behviour of p X? (Put nother wy, wht the lest unlikely wy for this unlikely event to occur?) We cn nswer this rther completely in cse E is closed, convex, nd hs nonempty interior (in Prob(A)). In tht cse, if p X does lnd in the unlikely set E, then it turns out tht it is overwhelmingly likely to be close to prticulr distinguished element of E. Lemm 5. If p Prob(A) nd if E Prob(A) is closed nd convex, then the function D( p) hs unique minimizer on E. Proof. This follows t once from the continuity nd strict convexity of D( p). Theorem 6. Let E Prob(A) be closed, convex, nd hve nonempty interior, nd suppose tht p Prob(A) \ E. Let q be the unique minimizer of D( p) in E, nd ssume tht D(q p) <. Then, for ny ε > 0, we hve P ( p X q < ε px E ) 1 s n, nd the convergence is exponentil in n. 6
Remrk. The ssumption tht D(q p) < merely excludes degenerte cses: if this does not hold, then the event {p X E} simply hs zero probbility. Proof. Let d := inf q E D(q p) = D(q p). Let us prtition E s (E B ε (q )) (E \ B ε (q )), where B ε (q ) is the open ε-bll round q in totl vrition. Assume ε is smll enough tht both prts re nonempty (otherwise the desired probbility is lwys equl to 1). Since q is the unique minimizer on E, nd since E \ B ε (q ) is closed nd hence compct, we hve d := min D(q p) > d. q E\B ε(q ) Now pply Snov s theorem to both of the sets E nd E \ B ε (q ): P(p X E) = e dn+o(n) nd P(p X E \ B ε (q )) e d n+o(n). The first of these is true symptotic becuse E E, nd so we my ctully pply Corollry 3. The second estimte here is just the upper bound from Theorem 2. Combining these, we obtin P ( p X q ε px E ) = P(p X E \ B ε (q )) P(p X E) = e (d d)n+o(n). This tends to 0 exponentilly fst s n. Corollry 7 (The conditionl limit theorem). Under the conditions of the previous theorem, we hve P ( X 1 = px E ) q () s n for every A. Proof. The lw of X nd the event {p X E} re both invrint under permuting the entries of the rndom string X = (X 1,..., X n ). Therefore the vlue P ( X i = px E ) is the sme for every i [n], nd so it equls 1 n n i=1 P ( X i = px E ) ( 1 = E n n i=1 1 {Xi =} 7 ) p X E = E ( p X () px E ).
By Theorem 2, under the conditioned mesure P( p X E), the empiricl distribution stys within ny ε of q with high probbility s n, nd so the bove expecttion lso stys within ny ε of q () s n tht is, it converges to q (). With similr proof nd just little extr work, one obtins the following strengthening. Corollry 8 (Multiple conditionl limit theorem). Under the sme conditions s bove, nd for ny fixed m N, we hve P ( X 1 = 1, X 2 = 2,..., X m = m px E ) q ( 1 )q ( 2 ) q ( m ) s n for every 1,..., m A. 5 Notes nd remrks Our bsic source for this lecture is [CT06]: For divergence nd its properties, Sections 2.5 2.7 (lso, see the end of Section 2.5 for conditionl version of divergence); For the probbility of type clss: Theorem 11.1.4; Snov s theorem: Theorem 11.4.1; The conditionl limit theorem: Theorem 11.6.2. See the rest of [CT06, Section 11.6] for finer nlysis of the reltion between the quntities D(q p) nd q p. See the remining sections of [CT06, Chpter 11] for some vluble pplictions to sttistics. Snov s theorem nd the conditionl limit theorem re just the strt of the lrge subfield of probbility clled lrge devitions theory. Our next lecture is lso given to results from this field. Severl good sources re vilble to tke the reder further in this direction: The book [DZ10] is dedicted to lrge devitions theory, nd introduces mny spects of the subject t quite gentle pce. See, for instnce, [DZ10, Theorem 3.3.3] for the generliztion of the conditionl limit theorem to non-convex sets E. 8
The lecture notes [Vr] go rther fster, but re still cler nd introduce severl interesting dvnced topics. The monogrph [Ell06] lso goes quite gently nd emphsized the connection to sttisticl physics. I will im to mke this connection bit lter in the course. The clssic rticle [Ln73] hs similr point of view, but requires little more mthemticl bckground. Section A4 of tht rticle is on lrge devitions. Severl refinements re known to the conditionl limit theorem. The ppers [Csi75, Csi84, DZ96] cover severl of these, nd explin the subject very clerly. See [Kul97] for much more on the ppliction of informtion theoretic quntities in sttistics. References [Csi75] I. Csiszár. I-divergence geometry of probbility distributions nd minimiztion problems. Ann. Probbility, 3:146 158, 1975. [Csi84] Imre Csiszár. Snov property, generlized I-projection nd conditionl limit theorem. Ann. Probb., 12(3):768 793, 1984. [CT06] Thoms M. Cover nd Joy A. Thoms. Elements of informtion theory. Wiley-Interscience [John Wiley & Sons], Hoboken, NJ, second edition, 2006. [DZ96] A. Dembo nd O. Zeitouni. Refinements of the Gibbs conditioning principle. Probb. Theory Relted Fields, 104(1):1 14, 1996. [DZ10] Amir Dembo nd Ofer Zeitouni. Lrge devitions techniques nd pplictions, volume 38 of Stochstic Modelling nd Applied Probbility. Springer-Verlg, Berlin, 2010. Corrected reprint of the second (1998) edition. [Ell06] Richrd S. Ellis. Entropy, lrge devitions, nd sttisticl mechnics. Clssics in Mthemtics. Springer-Verlg, Berlin, 2006. Reprint of the 1985 originl. 9
[Kul97] Solomon Kullbck. Informtion theory nd sttistics. Dover Publictions, Inc., Mineol, NY, 1997. Reprint of the second (1968) edition. [Ln73] O.E. Lnford III. Entropy nd equilibrium sttes in clssicl sttisticl mechnics. In conference proceedings on Sttisticl Mechnics nd Mthemticl Problems. Springer, 1973. [Vr] S. R. S. Vrdhn. Lrge devitions nd pplictions. Number 46 in CBMS-NSF Regionl Conference Series in Applied Mthemtics. Society for Industril nd Applied Mthemtics (SIAM). TIM AUSTIN Emil: tim@mth.ucl.edu URL: mth.ucl.edu/ tim 10