Entropy and Ergodic Theory Notes 10: Large Deviations I

Similar documents
p-adic Egyptian Fractions

W. We shall do so one by one, starting with I 1, and we shall do it greedily, trying

The Regulated and Riemann Integrals

Advanced Calculus: MATH 410 Notes on Integrals and Integrability Professor David Levermore 17 October 2004

Discrete Mathematics and Probability Theory Spring 2013 Anant Sahai Lecture 17

Discrete Mathematics and Probability Theory Summer 2014 James Cook Note 17

8 Laplace s Method and Local Limit Theorems

Theoretical foundations of Gaussian quadrature

Lecture 1. Functional series. Pointwise and uniform convergence.

Riemann is the Mann! (But Lebesgue may besgue to differ.)

Riemann Sums and Riemann Integrals

Math 1B, lecture 4: Error bounds for numerical methods

Riemann Sums and Riemann Integrals

Review of Riemann Integral

arxiv:math/ v2 [math.ho] 16 Dec 2003

The First Fundamental Theorem of Calculus. If f(x) is continuous on [a, b] and F (x) is any antiderivative. f(x) dx = F (b) F (a).

Chapter 4 Contravariance, Covariance, and Spacetime Diagrams

Lecture notes. Fundamental inequalities: techniques and applications

Euler, Ioachimescu and the trapezium rule. G.J.O. Jameson (Math. Gazette 96 (2012), )

Unit #9 : Definite Integral Properties; Fundamental Theorem of Calculus

Continuous Random Variables

Reversals of Signal-Posterior Monotonicity for Any Bounded Prior

Handout: Natural deduction for first order logic

Properties of Integrals, Indefinite Integrals. Goals: Definition of the Definite Integral Integral Calculations using Antiderivatives

Review of basic calculus

A REVIEW OF CALCULUS CONCEPTS FOR JDEP 384H. Thomas Shores Department of Mathematics University of Nebraska Spring 2007

Recitation 3: More Applications of the Derivative

UNIFORM CONVERGENCE. Contents 1. Uniform Convergence 1 2. Properties of uniform convergence 3

SUMMER KNOWHOW STUDY AND LEARNING CENTRE

Math 61CM - Solutions to homework 9

20 MATHEMATICS POLYNOMIALS

Improper Integrals, and Differential Equations

Math 360: A primitive integral and elementary functions

ECO 317 Economics of Uncertainty Fall Term 2007 Notes for lectures 4. Stochastic Dominance

Polynomial Approximations for the Natural Logarithm and Arctangent Functions. Math 230

1 The Lagrange interpolation formula

Duality # Second iteration for HW problem. Recall our LP example problem we have been working on, in equality form, is given below.

DISCRETE MATHEMATICS HOMEWORK 3 SOLUTIONS

RIEMANN-LIOUVILLE AND CAPUTO FRACTIONAL APPROXIMATION OF CSISZAR S f DIVERGENCE

Advanced Calculus: MATH 410 Uniform Convergence of Functions Professor David Levermore 11 December 2015

Appendix to Notes 8 (a)

1 Probability Density Functions

Quadratic Forms. Quadratic Forms

CHAPTER 4 MULTIPLE INTEGRALS

Chapter 5 : Continuous Random Variables

An approximation to the arithmetic-geometric mean. G.J.O. Jameson, Math. Gazette 98 (2014), 85 95

1B40 Practical Skills

FUNDAMENTALS OF REAL ANALYSIS by. III.1. Measurable functions. f 1 (

Chapter 0. What is the Lebesgue integral about?

1 Online Learning and Regret Minimization

Lecture 1: Introduction to integration theory and bounded variation

Riemann Integrals and the Fundamental Theorem of Calculus

Review of Calculus, cont d

The steps of the hypothesis test

UniversitaireWiskundeCompetitie. Problem 2005/4-A We have k=1. Show that for every q Q satisfying 0 < q < 1, there exists a finite subset K N so that

An instructive toy model: two paradoxes

Math 426: Probability Final Exam Practice

S. S. Dragomir. 2, we have the inequality. b a

p(t) dt + i 1 re it ireit dt =

SOLUTIONS FOR ANALYSIS QUALIFYING EXAM, FALL (1 + µ(f n )) f(x) =. But we don t need the exact bound.) Set

and that at t = 0 the object is at position 5. Find the position of the object at t = 2.

f(x)dx . Show that there 1, 0 < x 1 does not exist a differentiable function g : [ 1, 1] R such that g (x) = f(x) for all

The area under the graph of f and above the x-axis between a and b is denoted by. f(x) dx. π O

A HELLY THEOREM FOR FUNCTIONS WITH VALUES IN METRIC SPACES. 1. Introduction

Infinite Geometric Series

Numerical Integration

ON A CONVEXITY PROPERTY. 1. Introduction Most general class of convex functions is defined by the inequality

MAA 4212 Improper Integrals

Notes on length and conformal metrics

Lecture 3: Equivalence Relations

Lecture 14: Quadrature

The Bochner Integral and the Weak Property (N)

State space systems analysis (continued) Stability. A. Definitions A system is said to be Asymptotically Stable (AS) when it satisfies

Lecture 3. Limits of Functions and Continuity

Improper Integrals. Type I Improper Integrals How do we evaluate an integral such as

than 1. It means in particular that the function is decreasing and approaching the x-

The Wave Equation I. MA 436 Kurt Bryan

Physics 116C Solution of inhomogeneous ordinary differential equations using Green s functions

Main topics for the First Midterm

The mth Ratio Convergence Test and Other Unconventional Convergence Tests

7.2 The Definite Integral

New Expansion and Infinite Series

(e) if x = y + z and a divides any two of the integers x, y, or z, then a divides the remaining integer

Properties of the Riemann Integral

Goals: Determine how to calculate the area described by a function. Define the definite integral. Explore the relationship between the definite

Coalgebra, Lecture 15: Equations for Deterministic Automata

CS667 Lecture 6: Monte Carlo Integration 02/10/05

DIRECT CURRENT CIRCUITS

A BRIEF INTRODUCTION TO UNIFORM CONVERGENCE. In the study of Fourier series, several questions arise naturally, such as: c n e int

Best Approximation in the 2-norm

The use of a so called graphing calculator or programmable calculator is not permitted. Simple scientific calculators are allowed.

The Riemann-Lebesgue Lemma

Farey Fractions. Rickard Fernström. U.U.D.M. Project Report 2017:24. Department of Mathematics Uppsala University

Chapter 22. The Fundamental Theorem of Calculus

Introduction to Group Theory

A NOTE ON PREPARACOMPACTNESS

The final exam will take place on Friday May 11th from 8am 11am in Evans room 60.

Lecture 3 ( ) (translated and slightly adapted from lecture notes by Martin Klazar)

Math 6455 Oct 10, Differential Geometry I Fall 2006, Georgia Tech

4 7x =250; 5 3x =500; Read section 3.3, 3.4 Announcements: Bell Ringer: Use your calculator to solve

Transcription:

Entropy nd Ergodic Theory Notes 10: Lrge Devitions I 1 A chnge of convention This is our first lecture on pplictions of entropy in probbility theory. In probbility theory, the convention is tht ll logrithms re nturl unless stted otherwise. So, strting in this lecture, we dopt this convention for entropy: in the formul H(p) = A p() log p(), the logrithm is to bse e. 2 A vrint of the bsic counting problem; divergence This lecture strts with probbilistic vrint of our bsic counting problem from the strt of the course. Let p Prob(A), nd suppose tht X = (X 1,..., X n ) p n. Suppose lso tht x A n is fixed string. We hve seen tht if p x = p, then the probbility P(X = x) is e H(p)n (previously it ws written 2 H(p)n, since tht ws before we switched from log 2 to nturl logrithms). This fct ws the bsis of our originl upper bound on the number of strings of type p. But now suppose tht p x is some other element q Prob(A). Then reltive of the erlier clcultion gives P(X = x) = p() N( x) = p() nq(). 1

If there exists A for which p() = 0 but q() 0, then this probbility is zero. Otherwise, we my write it s [( ) ] exp q() log p() n. (1) In fct, we cn write it this wy in both cses, provided we dopt the nturl conventions log 0 := nd exp( ) := 0. Therefore [( ) ] P(p X = q) = P(X T n (q)) = T n (q) exp q() log p() n. In cse denom(q) divides n, our originl estimtes for T n (q) give [( exp q() log q() + ) ] q() log p() n o(n) [ ( = exp q() log q() ) ] n o(n) p() P(p X = q) [ ( exp q() log q() ) ] n. (2) p() Tht is, we hve estimted the probbility tht rndom string drwn i.i.d. from p hs type equl to some other distribution q. The min negtive term in the exponent hs coefficient q() log q() p() = p() q() q() log p() p(). This is clled the divergence (or Kullbck Liebler divergence, informtion divergence, or reltive entropy) of q with respect to p. We denoted it by D(q p). If there exists A such tht p() = 0 but q() 0, then D(q p) = + by definition. Exmple. If p is the uniform distribution on Prob(A), then p(a) = 1/ A for every A, nd D(q p) = log A H(q) q Prob(A). (3) 2

In order to hndle the cse of infinite divergence succinctly, recll the following notion nd nottion from mesure theory. If (X, B) is mesurble spce nd µ nd ν re two probbility mesures on it, then ν is bsolutely continuous with respect to µ, denoted ν µ, if ν(a) = 0 whenever µ(a) = 0 for A B. If X is countble nd B is the power set of X, then it is enough to check this for ll singletons A. Lemm 1. For fixed p, the quntity D(q p) is continuous nd non-negtive function on the set {q Prob(A) : q p}. It is zero if nd only if q = p. The quntity D(q p) is strictly convex s function of the pir (q, p) Prob(A) Prob(A) (this is stronger thn being strictly convex in ech rgument seprtely). Proof. Continuity is cler. To prove non-negtivity, recll tht the function η(t) := t log t is strictly convex on (0, ). Since p is probbility distribution, Jensen s inequlity gives D(q p) = ( q() ) ( p()η η p() q() ) = η(1) = 0. p() p() Equlity occurs if nd only if q()/p() is the sme vlue for every. Since q nd p re both probbility distributions, this occurs only if q = p. We leve joint strict convexity s n exercise for the reder (it requires nother, slightly less obvious ppliction of Jensen s inequlity). nd In terms of divergence, we cn re-write the clcultions (1) nd (2) s P(X = x) = p n [H(px)+D(px p)]n (x) = e e D(q p)n o(n) P(p X = q) = e D(q p)n in cse denom(q) = n. One importnt exmple of divergence hs lredy been introduced by nother nme. 3

Exmple. Let p Prob(A B) hve mrginls p A nd p B. Let (X, Y ) p. In Lecture 2 we derived the formul I(X ; Y ) =,b p(, b) log p(, b) p A ()p B (b) (except with log 2 in plce of log, similrly to the rest of this lecture). Thus, the mutul informtion between X nd Y is equl to D(p p A p B ). It my be seen s compring the true joint distribution p to the distribution of independent RVs tht hve the sme individul distributions s X nd Y. 3 Snov s theorem The clcultion in (2) cn be seen s quntittive strengthening of the LLN for types. If q p, then D(q p) > 0, nd so we hve shown tht for X p n the event {p X = q} is not only unlikely, but exponentilly unlikely, with exponent given by D(q p). By some simple estimtes nd summing over possible types, this cn be turned into n estimte for the following more generl question: If X p n nd U Prob(A) is such tht {p X U} is unlikely, then just how unlikely is it? To lighten nottion, let us now write T n (U) := {x A n : p x U} = p U T n (p) for ny n N nd U Prob(A). Theorem 2 (Snov s theorem). Every U Prob(A) stisfies ( exp [ ) inf D(q p)] n + o(n) q U P(p X U) ( exp [ inf q U D(q p)] n + o(n) where U is the interior of U for the usul topology of Prob(A). ), 4

Corollry 3. If U Prob(A) hs the property tht U U (tht is, the interior of U is dense in U), then ( P(p X U) = exp [ ) inf D(q p)] n + o(n). q U Proof. Upper bound. This is nother exercise in summing over typicl sets. Let d := inf q U D(q p). Then the upper bound (2) gives P(p X U) = q U P(p X T n (q)) q U, denom(q) n D(q p)n e {q U : denom(q) = n} e dn (n + 1) A e dn = e dn+o(n). Lower bound. We need the following fct: For ll sufficiently lrge n is holds tht, for ny p Prob(A), there exists q Prob(A) with denomintor n such tht p q 2 A /n. This ws lredy used in Lecture 2, where we referred to it s (D). Now let d := inf q U D(q p). Let ε > 0, nd let V U be nonempty open subset such tht ny q V stisfies D(q p) < d + ε. Since V is open, fct (D) promises tht, provided n is sufficiently lrge, there exists q V of denomintor n. For this q, we my now pply the lower bound from (2) to obtin P(p X U) P(p X = q) = e D(q p)n o(n) e (d +ε)n o(n). Since ε ws rbitrry, this ctully holds for some slowly decresing sequence ε n 0 in plce of ε. This completes the proof. Using Lemm 1, we immeditely deduce the following. Corollry 4. If p Prob(A) nd U Prob(A) stisfy p U, nd if X p n, then P(p X U) decys exponentilly fst s n. Exmple. Let p Prob(A B) hve mrginls p A nd p B. Let (X, Y) p n. Then we hve seen tht I(X i ; Y i ) = D(p p A p B ) for ech i (except with log 2 in plce of log, similrly to the rest of this lecture). 5

Now suppose we drw rndom strings X p n nd Y p n independently of ech other. The LLN for types gives tht X nd Y re pproximtely typicl for p A nd p B with high probbility. But if p p A p B, then (X, Y ) is unlikely to be pproximtely typicl for p. Snov s theorem lets us estimte this smll probbility: for δ > 0, tht theorem gives P((X, Y ) T n,δ (p)) = e D(p p A p B )n+ (δ)n+o(n) = e I(X i ; Y i )n+ (δ)n+o(n). This cn lso be obtined from our erlier description of the typicl set T n,δ (p) in terms of conditionl entropy. 4 The conditionl limit theorem Snov s Theorem cn be used to nswer the following vrint of the previous question: Suppose tht X p n nd tht E Prob(A) is such tht {p X E} is unlikely. Conditionlly on the unlikely occurrence {p X E}, wht is the most likely behviour of p X? (Put nother wy, wht the lest unlikely wy for this unlikely event to occur?) We cn nswer this rther completely in cse E is closed, convex, nd hs nonempty interior (in Prob(A)). In tht cse, if p X does lnd in the unlikely set E, then it turns out tht it is overwhelmingly likely to be close to prticulr distinguished element of E. Lemm 5. If p Prob(A) nd if E Prob(A) is closed nd convex, then the function D( p) hs unique minimizer on E. Proof. This follows t once from the continuity nd strict convexity of D( p). Theorem 6. Let E Prob(A) be closed, convex, nd hve nonempty interior, nd suppose tht p Prob(A) \ E. Let q be the unique minimizer of D( p) in E, nd ssume tht D(q p) <. Then, for ny ε > 0, we hve P ( p X q < ε px E ) 1 s n, nd the convergence is exponentil in n. 6

Remrk. The ssumption tht D(q p) < merely excludes degenerte cses: if this does not hold, then the event {p X E} simply hs zero probbility. Proof. Let d := inf q E D(q p) = D(q p). Let us prtition E s (E B ε (q )) (E \ B ε (q )), where B ε (q ) is the open ε-bll round q in totl vrition. Assume ε is smll enough tht both prts re nonempty (otherwise the desired probbility is lwys equl to 1). Since q is the unique minimizer on E, nd since E \ B ε (q ) is closed nd hence compct, we hve d := min D(q p) > d. q E\B ε(q ) Now pply Snov s theorem to both of the sets E nd E \ B ε (q ): P(p X E) = e dn+o(n) nd P(p X E \ B ε (q )) e d n+o(n). The first of these is true symptotic becuse E E, nd so we my ctully pply Corollry 3. The second estimte here is just the upper bound from Theorem 2. Combining these, we obtin P ( p X q ε px E ) = P(p X E \ B ε (q )) P(p X E) = e (d d)n+o(n). This tends to 0 exponentilly fst s n. Corollry 7 (The conditionl limit theorem). Under the conditions of the previous theorem, we hve P ( X 1 = px E ) q () s n for every A. Proof. The lw of X nd the event {p X E} re both invrint under permuting the entries of the rndom string X = (X 1,..., X n ). Therefore the vlue P ( X i = px E ) is the sme for every i [n], nd so it equls 1 n n i=1 P ( X i = px E ) ( 1 = E n n i=1 1 {Xi =} 7 ) p X E = E ( p X () px E ).

By Theorem 2, under the conditioned mesure P( p X E), the empiricl distribution stys within ny ε of q with high probbility s n, nd so the bove expecttion lso stys within ny ε of q () s n tht is, it converges to q (). With similr proof nd just little extr work, one obtins the following strengthening. Corollry 8 (Multiple conditionl limit theorem). Under the sme conditions s bove, nd for ny fixed m N, we hve P ( X 1 = 1, X 2 = 2,..., X m = m px E ) q ( 1 )q ( 2 ) q ( m ) s n for every 1,..., m A. 5 Notes nd remrks Our bsic source for this lecture is [CT06]: For divergence nd its properties, Sections 2.5 2.7 (lso, see the end of Section 2.5 for conditionl version of divergence); For the probbility of type clss: Theorem 11.1.4; Snov s theorem: Theorem 11.4.1; The conditionl limit theorem: Theorem 11.6.2. See the rest of [CT06, Section 11.6] for finer nlysis of the reltion between the quntities D(q p) nd q p. See the remining sections of [CT06, Chpter 11] for some vluble pplictions to sttistics. Snov s theorem nd the conditionl limit theorem re just the strt of the lrge subfield of probbility clled lrge devitions theory. Our next lecture is lso given to results from this field. Severl good sources re vilble to tke the reder further in this direction: The book [DZ10] is dedicted to lrge devitions theory, nd introduces mny spects of the subject t quite gentle pce. See, for instnce, [DZ10, Theorem 3.3.3] for the generliztion of the conditionl limit theorem to non-convex sets E. 8

The lecture notes [Vr] go rther fster, but re still cler nd introduce severl interesting dvnced topics. The monogrph [Ell06] lso goes quite gently nd emphsized the connection to sttisticl physics. I will im to mke this connection bit lter in the course. The clssic rticle [Ln73] hs similr point of view, but requires little more mthemticl bckground. Section A4 of tht rticle is on lrge devitions. Severl refinements re known to the conditionl limit theorem. The ppers [Csi75, Csi84, DZ96] cover severl of these, nd explin the subject very clerly. See [Kul97] for much more on the ppliction of informtion theoretic quntities in sttistics. References [Csi75] I. Csiszár. I-divergence geometry of probbility distributions nd minimiztion problems. Ann. Probbility, 3:146 158, 1975. [Csi84] Imre Csiszár. Snov property, generlized I-projection nd conditionl limit theorem. Ann. Probb., 12(3):768 793, 1984. [CT06] Thoms M. Cover nd Joy A. Thoms. Elements of informtion theory. Wiley-Interscience [John Wiley & Sons], Hoboken, NJ, second edition, 2006. [DZ96] A. Dembo nd O. Zeitouni. Refinements of the Gibbs conditioning principle. Probb. Theory Relted Fields, 104(1):1 14, 1996. [DZ10] Amir Dembo nd Ofer Zeitouni. Lrge devitions techniques nd pplictions, volume 38 of Stochstic Modelling nd Applied Probbility. Springer-Verlg, Berlin, 2010. Corrected reprint of the second (1998) edition. [Ell06] Richrd S. Ellis. Entropy, lrge devitions, nd sttisticl mechnics. Clssics in Mthemtics. Springer-Verlg, Berlin, 2006. Reprint of the 1985 originl. 9

[Kul97] Solomon Kullbck. Informtion theory nd sttistics. Dover Publictions, Inc., Mineol, NY, 1997. Reprint of the second (1968) edition. [Ln73] O.E. Lnford III. Entropy nd equilibrium sttes in clssicl sttisticl mechnics. In conference proceedings on Sttisticl Mechnics nd Mthemticl Problems. Springer, 1973. [Vr] S. R. S. Vrdhn. Lrge devitions nd pplictions. Number 46 in CBMS-NSF Regionl Conference Series in Applied Mthemtics. Society for Industril nd Applied Mthemtics (SIAM). TIM AUSTIN Emil: tim@mth.ucl.edu URL: mth.ucl.edu/ tim 10