A Course on Large Deviations with an Introduction to Gibbs Measures. Firas Rassoul-Agha

Size: px

Start display at page:

Download "A Course on Large Deviations with an Introduction to Gibbs Measures. Firas Rassoul-Agha"

Chloe Kerry Wilkerson
5 years ago
Views:

1 A Course on Large Deviations with an Introduction to Gibbs Measures Firas Rassoul-Agha Timo Seppäläinen Department of Mathematics, University of Utah, 55 South 400 East, Salt Lake City, UT 842, USA address: Mathematics Department, University of Wisconsin-Madison, 49 Van Vleck Hall, Madison, WI 53706, USA address: c Copyright 204 Firas Rassoul-Agha and Timo Seppäläinen

2 2000 Mathematics Subject Classification. Primary 60F0, 82B20 Key words and phrases. convex analysis, Gibbs measure, Ising model, large deviations, Markov chain, percolation, phase transition, random cluster model, random walk in a random medium, relative entropy, statistical mechanics, variational principle

3 To Alla, Maxim, and Kirill To Celeste, David, Ansa, and Timo

5 Contents Preface xi Part I. Large deviations: general theory and i.i.d. processes Chapter. Introductory discussion 3.. Information-theoretic entropy 5.2. Thermodynamic entropy 8.3. Large deviations as useful estimates 2 Chapter 2. The large deviation principle Precise asymptotics on an exponential scale Lower semicontinuous and tight rate functions Weak large deviation principle Aspects of Cramér s theorem Limits, deviations, and fluctuations 33 Chapter 3. Large deviations and asymptotics of integrals Contraction principle Varadhan s theorem Bryc s theorem Curie-Weiss model of ferromagnetism 43 Chapter 4. Convex analysis in large deviation theory Some elementary convex analysis Rate function as a convex conjugate 58 vii

6 viii Contents 4.3. Multidimensional Cramér s theorem 60 Chapter 5. Relative entropy and large deviations for empirical measures Relative entropy Sanov s theorem Maximum entropy principle 78 Chapter 6. Process level large deviations for i.i.d. fields Setting Specific relative entropy Pressure and the large deviation principle 9 Part II. Statistical mechanics Chapter 7. Formalism for classical lattice systems Finite volume model Potentials and Hamiltonians Specifications Phase transition Extreme Gibbs measures Uniqueness for small potentials 2 Chapter 8. Large deviations and equilibrium statistical mechanics Thermodynamic limit of the pressure Entropy and large deviations under Gibbs measures Dobrushin-Lanford-Ruelle (DLR) variational principle 25 Chapter 9. Phase transition in the Ising model One-dimensional Ising model Phase transition at low temperature Case of no external field Case of nonzero external field 43 Chapter 0. Percolation approach to phase transition Bernoulli bond percolation and random cluster measures Ising phase transition revisited 5 Part III. Additional large deviation topics Chapter. Further asymptotics for i.i.d. random variables 59

7 Contents ix.. Refinement of Cramér s theorem Moderate deviations 62 Chapter 2. Large deviations through the limiting generating function Essential smoothness and exposed points Gärtner-Ellis theorem Large deviations for the current of particles 77 Chapter 3. Large deviations for Markov chains Relative entropy for kernels Countable Markov chains Finite Markov chains 20 Chapter 4. Convexity criterion for large deviations 2 Chapter 5. Nonstationary independent variables Generalization of relative entropy and Sanov s theorem Proof of the large deviation principle 22 Chapter 6. Random walk in a dynamical random environment Quenched large deviation principles Proofs via the Baxter-Jain theorem 237 Appendixes Appendix A. Analysis 257 A.. Metric spaces and topology 257 A.2. Measure and integral 260 A.3. Product spaces 265 A.4. Separation theorem 266 A.5. Minimax theorem 267 Appendix B. Probability 27 B.. Independence 272 B.2. Existence of stochastic processes 273 B.3. Conditional expectation 274 B.4. Weak topology of probability measures 276 B.5. First limit theorems 280 B.6. Ergodic theory 280 B.7. Stochastic ordering 285

8 x Contents Appendix C. Inequalities from statistical mechanics 29 C.. Griffiths inequality 29 C.2. Griffiths-Hurst-Sherman inequality 292 Appendix D. Nonnegative matrices 295 Bibliography 297 Notation index 305 Author index 3 General index 33

9 Preface This book arose from courses on large deviations and related topics given by the authors in the Departments of Mathematics at the Ohio State University (993), at the University of Wisconsin-Madison (2006, 203), and at the University of Utah (2008, 203). Our goal has been to create an attractive collection of material for a semester s course which would also serve the broader needs of students from different fields. This goal has had two implications for the book. () We have not aimed at anything like an encyclopedic coverage of different techniques for proving large deviation principles (LDPs). Part I of the book focuses on one classic line of reasoning: (i) upper bound by an exponential Markov-Chebyshev inequality, (ii) lower bound by a change of measure, and (iii) an argument to match the rates from the first two steps. Beyond this technique Part I covers Bryc s theorem and proves Cramér s theorem with the subadditive method. Part III of the book covers the Gärtner-Ellis theorem and an approach based on the convexity of a local rate function due to Baxter and Jain. (2) We have not felt obligated to stay within the boundaries of large deviation theory but instead follow the trail of interesting material. Large deviation theory is a natural gateway to statistical mechanics. Discussion of statistical mechanics would be incomplete without some study of phase transitions. We prove the phase transition of the Ising model in two different ways: (i) first with classical techniques: Peierls argument, Dobrushin s uniqueness condition, and correlation inequalities, and (ii) the second time with random cluster measures. This means leaving large deviation theory xi

10 xii Preface completely behind. Along the way we have the opportunity to learn coupling methods which are central to modern probability theory but do not get serious application in the typical first graduate course in probability. Here is a brief overview of the contents of the book. Part I covers core general large deviation theory, the relevant convex analysis, and the large deviations of i.i.d. processes on three levels: Cramér s theorem, Sanov s theorem, and the process level LDP for i.i.d. variables indexed by a multidimensional square lattice. Part II introduces Gibbs measures and proves the Dobrushin-Lanford- Ruelle variational principle that characterizes translation-invariant Gibbs measures. After this we study the phase transition of the Ising model. Part II ends with a chapter on the Fortuin-Kasteleyn random cluster model and the percolation approach to Ising phase transition. Part III develops the large deviation themes of Part I in several directions. Large deviations of i.i.d. variables are complemented with moderate deviations and with more precise large deviation asymptotics. The Gärtner- Ellis theorem is developed carefully, together with the necessary additional convex analysis beyond the basics covered in Part I. From large deviations of i.i.d. processes we move on to Markov chains, to nonstationary independent random variables, and finally to random walk in a dynamical random environment. The last two topics give us an opportunity to apply another approach to proving large deviation principles, namely the Baxter-Jain theorem. The Baxter-Jain theorem has not previously appeared in textbooks, and its application to random walk in random environment is new. The ideal background for reading this book would be some familiarity with the language of measure-theoretic probability. Large deviation theory does also require a little analysis, point set topology and functional analysis. For example, readers should be comfortable with lower semicontinuity and the weak topology on probability measures. It should be possible for an instructor to accommodate students with quick lectures on technical prerequisites whenever needed. It is also possible to consider everything in the framework of concrete finite spaces, in which case probability measures become simply probability vectors. In practice our courses have been populated by students with very diverse backgrounds, many with less than ideal knowledge of analysis and probability. This has turned out less problematic than one might initially fear. Mathematics students are typically fully satisfied only after every theoretical point is rigorously justified. But engineering students are content to set aside much of the theory and focus on the essentials of the phenomenon in question. There is great interest in probability theory among students

11 Preface xiii of economics, engineering and sciences. This interest should be encouraged and nurtured with accessible courses. The appendixes in the back of the book serve two purposes. There is a quick overview of some basic results of analysis and probability without proofs for the reader who wants a quick refresher. In particular, here the reader can look up textbook tools such as convergence theorems and inequalities that are referenced in the text. The other material in the appendixes consists of specialized results used in the text, such as a minimax theorem and inequalities from statistical mechanics. These are proved. Since this book evolved in courses where we tried to actively engage the students, the development of the material relies on frequent exercises. We realize that this feature may not appeal to some readers. On the other hand, spelling out all the technical details left as exercises might make for tedious reading. Hopefully an instructor can fill in those details fairly easily if she wants to present full details in class. Exercises that are referred to in the text are marked with an asterisk. One of us (TS) first learned large deviations from a course taught by Steven Orey in at the University of Minnesota. We are greatly indebted to the existing books on the subject, especially those by Amir Dembo and Ofer Zeitouni [5], Frank den Hollander [6], Jean-Dominique Deuschel and Daniel Stroock [8], Richard Ellis [32] and Srinivasa Varadhan [79]. As a text that combines large deviations with equilibrium statistical mechanics, [32] is a predecessor of ours. There is obviously a good degree of overlap but the books are different. Ours is a textbook with a lighter touch while [32] is closer to a research monograph, covers more models in detail and explains much of the physics. We recommend [32] to our readers and students for further study. Our phase transition discussion covers the nearest-neighbor Ising model while [32] covers also long-range Ising models. On the other hand, [32] does not cover Dobrushin s uniqueness theorem, random cluster models, general lattice systems, or their large deviations. Our literature references are sparse and sometimes do not assign credit to the originators of the ideas. We encourage the reader to consult the superb historical notes and references in the monographs of Dembo-Zeitouni, Ellis, and Georgii. Here is a guide to the dependencies between the parts of the book. Sections and are foundational for all discussions of large deviations. In addition, we have the following links. Chapter 5 relies on Sections , and Chapter 6 relies on Chapter 5. Chapter 8 relies on Chapters 6 and 7. Chapter 9 can be read independently of large deviations after Sections and 7.6. Section 0.2 makes sense only in the context of

12 xiv Preface Chapter 9. Chapters 2 and 4 are independent of each other and both rely on Sections Chapter 3 relies on Chapter 5. Chapter 5 relies on Section 3. and Chapter 4. Chapter 6 relies on Chapter 4. We thank Jeff Steif for lecture notes that helped shape the proof of Theorem 9.2, Jim Kuelbs for material for Chapter, and Chuck Newman for helpful discussions on the liquid-gas phase transition for Chapter 7. We also thank Davar Khoshnevisan for several valuable suggestions. We thank the team at AMS and especially Ed Dunne for patience in the face of serial breaches of agreed deadlines, and the several reviewers for valuable suggestions. Support from the National Science Foundation and the Wisconsin Alumni Research Foundation is gratefully acknowledged. Firas Rassoul-Agha Timo Seppäläinen

13 Part I Large deviations: general theory and i.i.d. processes

15 Chapter Introductory discussion Toss a fair coin n times. When n is small there is nothing to say beyond enumerating all the outcomes and their probabilities. With a large number of tosses patterns and order emerge from the randomness: heads appear about 50% of the time and the histogram approaches a bell curve. As the number of tosses increases these patterns become more and more pronounced. But from time to time a random fluctuation might break the pattern: perhaps 0,000 tosses of a fair coin give 6000 heads. In fact, we know that there is a chance of (/2) 0,000 that all tosses yield heads. The point is that to understand the system well one cannot be satisfied with understanding only the most likely outcomes. One also needs to understand rare events. But why care about an event that has a chance of (/2) 0,000? Here is a simplified example to illustrate the importance of probabilities of rare events. Imagine that an insurance company collects premiums at a steady rate of c per month. Let X k be the random amount that the insurance company pays out in month k to cover customer claims. Let S n = X + + X n be the total pay-out in n months. Naturally the premiums must cover the average outlays, so c > E[X k ]. The company stays solvent as long as S n cn. Quantifying the chances of the rare event S n > cn is then of obvious interest. This is an introductory book on the methods of computing asymptotics of probabilities of rare events: the theory of large deviations. Let us start with a basic computation. Example.. Let {X k } k N be a sequence of independent and identically distributed (i.i.d.) Bernoulli random variables with success probability p (each X k = with probability p and 0 with probability p). Denote the 3

16 4. Introductory discussion partial sum by S n = X + + X n. The strong law of large numbers says that, as n, the sample mean S n /n converges to p almost surely. But at any given n there is a chance p n for all heads (S n = n) and also a chance ( p) n for all tails (S n = 0). In fact, for any s (0, ) there is a positive chance of a fraction of heads close to s. Let us compute the asymptotics of this probability. Denote the integer part of x R by x, that is, x is the largest integer less than or equal to x. From binomial probabilities n! P {S n = ns } = ns!(n ns )! p ns ( p) n ns nn p ns ( p) n ns ns ns (n ns ) n ns We used Stirling s formula (Exercise 3.5) (.) n! e n n n 2πn. Notation a n b n means that a n /b n. Abbreviate n β n = 2π ns (n ns ) and to get rid of integer parts, let also Then γ n = n 2π ns (n ns ). (ns) ns (n ns) n ns ns ns (n ns ) n ns p ns ( p) n ns p ns ( p) n ns. ( P {S n = ns } β n γ n exp ns log p p ) + n( s) log. s s Exercise.2. Show that there exists a constant C such that C n β n C and n Cn γ n Cn for large enough n. By being a little more careful you can improve the second statement to C γ n C. (.2) The asymptotics above gives the limit with lim n n log P {S n = ns } = I p (s) I p (s) = s log s p + ( s) log s p. Note the minus sign introduced in front of I p (s). This is a convention of large deviation theory. It is instructive to look at the graph of I p (Figure.). I p extends continuously to [0, ] with values I p (0) = log p and I p() = log p that

17 .. Information-theoretic entropy 5 I(s) log p log p 0 0 p s Figure.. The rate function for coin tosses. match the exponential decay of the probabilities of the events {S n = 0} and {S n = n}. The unique zero of I p is at the law of large numbers limit p which we would regard as the typical behavior of S n /n. Increasing values of I p correspond to less likely outcomes. For s [0, ] it is natural to set I p (s) =. The function I p in (.2) is a large deviation rate function. We shall understand later that I p (s) is also the relative entropy of the coin with success probability s relative to the one with success probability p. The choice of terminology is not a coincidence. This quantity is related to both information-theoretic and thermodynamic entropy. For this reason we go on a brief detour to discuss these well-known notions of entropy and to point out the link with the large deviation rate function I p. The relative entropy that appears in large deviation theory will take center stage in Chapters 5 6, and again in Chapter 8 when we discuss statistical mechanics of lattice systems. Limit (.2) is our first large deviation result. One of the very last ones in the book is limit (6.2) which is the analogue of (.2) for a random walk in a dynamical random environment, that is, in a setting where the success probability of the coin also fluctuates randomly... Information-theoretic entropy A coin that always comes up heads is not random at all, and the same of course for a coin that always comes up tails. On the other hand, we should probably regard a fair coin as the most random coin because we cannot predict whether we see more heads or tails in a sequence of tosses with better than even odds. We discuss here briefly the quantification of the degree of randomness of a sequence of coin flips. We take the point of view that the

18 6. Introductory discussion degree of randomness of the coin is reflected in the average number of bits needed to encode a sequence of tosses. This section is inspired by Chapter 2 of Ash [4]. Let Ω = {0, } n be the space of words ω Ω of length n. A message is a concatenation of words. The message made of words ω, ω 2,..., ω m is written ω ω 2 ω m. A code is a map C : Ω l {0, } l that assigns to each word ω Ω a code word C(ω) which is a finite sequence of 0s and s. C(ω) denotes the length of code word C(ω). A concatenation of code words is a code message. Thus, a message is encoded by concatenating the code words of its individual words to make a code message: C(ω ω m ) = C(ω ) C(ω m ). A code should be uniquely decipherable. That is, for every finite sequence c c l of 0s and s there exists at most one message ω ω m such that C(ω ) C(ω m ) = c c l. Now sample words at random under a probability distribution P on the space Ω. In this discussion we employ the base 2 logarithm log 2 x = log x/ log 2. Noiseless coding theorem. If C is a uniquely decipherable code, then its average length satisfies (.3) P (ω) log 2 P (ω) ω Ω P (ω) C(ω) ω Ω with equality if and only if P (ω) = 2 C(ω). In information theory the quantity on the right of (.3) is called the Shannon entropy of the probability distribution P. For a simple proof of the theorem see [4, Theorem 2.5., page 37]. Consider the case where the n characters of the word ω are chosen independently, and let s [0, ] be the probability that a character is a. Then P (ω) = s N(ω) ( s) n N(ω), where N(ω) is the number of ones in ω. (As usual, 0 0 =.) By the noiseless coding theorem, the average length of a decipherable code C satisfies C(ω) s N(ω) ( s) n N(ω) ω Ω ω Ω s N(ω) ( s) n N(ω) log 2 s N(ω) ( s) n N(ω). Since ω sn(ω) ( s) n N(ω) = and ω N(ω)sN(ω) ( s) n N(ω) = ns, the right-hand side equals nh(s) where h(s) = s log 2 s ( s) log 2 ( s) = I /2(s) log 2,

19 .. Information-theoretic entropy 7 and we see the large deviations rate function from (.2) appear. Thus we have the lower bound (.4) C(ω) s N(ω) ( s) n N(ω) nh(s). ω Ω In other words, any uniquely decipherable code for independent and identically distributed characters with probability s for a must use, on average, at least h(s) bits per character. In this case the Shannon entropy and the rate function I /2 are related by ω Ω P (ω) log 2 P (ω) = I /2(s) log 2. Here is a simplistic way to see the lower bound nh(s) on the number of bits needed that makes an indirect appeal to large deviations in the sense that deviant words are ignored. With probability s for symbol, the typical word of length n has about ns ones. Suppose we use code words of length L to code these typical words. Then ( ) n 2 L ns and the lower bound L nh(s) + O(log n) follows from Stirling s formula. The values h(0) = h() = 0 make asymptotic sense. For example, if s = 0, then a word of any length n is all zeroes and can be encoded by a single bit, which in the n limit gives 0 bits per character. This is the case of complete order. At the other extreme of complete disorder is the case s = /2 of fair coin tosses where all n bits are needed because all words of a given length are equally likely. For s /2 a is either more or less likely than a 0 and by exploiting this bias one can encode with less than bit per character on average. David A. Huffman [48], while a Ph.D. student at MIT, developed an optimal decipherable code; that is, a code C whose average length cannot be improved upon. As n, the average length of the code generated by this algorithm is exactly h(s) per character and so the lower bound (.4) is achieved asymptotically. We illustrate the algorithm through an example. For a proof of its optimality and asymptotic average length see page 42 of [4]. Example.3 (Huffman s algorithm). Consider the case n = 3 and s = /4. There are 8 words. Word comes with probability /4 3, words 0, 0, and 0 come each with probability 3/4 3, words 00, 00, and 00 come with probability 3 2 /4 3 each, and word 000 comes with probability (3/4) 3. These 8 words are the terminal leaves of a binary tree that we build.

20 8. Introductory discussion /4 3 3/4 3 0/4 3 3/4 3 6/4 3 9/4 3 3/ /4 3 37/ /4 3 8/ / /4 3 Figure.2. The tree for Huffman s algorithm in the case n = 3 and s = /4. The leftmost column shows the resulting codes. First, find the two leaves with the smallest probabilities. Ties can be resolved arbitrarily. Give these two leaves a and b a common ancestor labeled with a probability that is the sum of the probabilities of a and b. In our example, leaves and 0 are given a common parent labeled with probability 4/4 3. Now leaves a and b are done with and their parent is regarded as a new leaf. Repeat the step. Continue until there is one leaf left. In our example, the second step gives a common ancestor to leaves 0 and 0. This new node is labeled 3/ /4 3 = 6/4 3. And so on. Figure.2 presents the final tree. To produce the code of a word, start at the root and follow the tree to the leaf of that word. At each fork encode a down step with a 0 and an up step with a (in our figure). For instance, word 0 is reached from the root by three successive up steps followed by a single down step then another up step. Thus word 0 is encoded as 0. The average length of the code is = This is 58/ bits per character. As the number of characters n grows, the average length of the encoding per character will converge to the information-theoretic entropy h(/4) Thermodynamic entropy The next discussion of thermodynamics is inspired by Schrödinger s lectures [70]. After some preliminary computations we use the first and second laws of thermodynamics to derive an expression for entropy. In the simplest case of a system with two energy levels this expression can be related to the

21 .2. Thermodynamic entropy 9 rate function (.2). The reader should be aware that this section is not mathematically rigorous. Let A denote a physical system whose possible energy levels are {ε l : l N}. Then consider a larger system A n made up of n identical physically independent copies of A. By physically independent we mean that these components of A n do not communicate with each other. Each component can be at any energy level ε l. Immerse the whole system in a large heat bath at fixed absolute temperature T which gives the system total energy E = nu. Let a l be the number of components in state ε l. These numbers must satisfy the constraints (.5) a l = n and a l ε l = E. l For given values a l, the total number of possible arrangements of the components at different energy levels is n! a! a l!. When n is large, it is reasonable to assume that the values a l that appear are the ones that maximize the number of arrangements, subject to the constraints (.5). To find these optimal a l values, maximize the logarithm of the number of arrangements and introduce Lagrange multipliers α and β. Thus we wish to differentiate n! ( (.6) log a! a l! α l l ) a l n β ( l ) a l ε l E with respect to a l and set the derivative equal to zero. To use calculus, pretend that the unknowns a l are continuous variables and use Stirling s formula (.) in the form log n! n(log n ). We arrive at log a l + α + βε l = 0, for all l. Thus a l = Ce βε l. Since the total number of components is n = a l, (.7) a l = ne βε l j e βε j. The second constraint gives E = n l ε l e βε l l e βε l. These equations should be understood to hold only asymptotically. Divide both equations by n and take n. We interpret the limit as saying that when a typical system A is immersed in a heat bath at temperature T

22 0. Introductory discussion the system takes energy ε l with probability (.8) and then has average energy (.9) p l = U = e βε l j e βε j l ε l e βε l l e βε l Expression (.6) suggests that β is a function of {ε l } and U. We argue with physical reasoning that β is in fact a universal function of T alone. Consider another system B with energy levels { ε m }. Let B n denote a composite system of n identical and independent copies of B, also physically independent of A n. Immersing A n in a heat bath with temperature T specifies a value of β for it. Since β can a priori depend on {ε l }, which is a characteristic of system A, we denote this value by β A. Similarly, immersing B n in the same heat bath leads to value β B. We can also immerse A n and B n together in the heat bath and consider them together as consisting of n independent and identical copies of a system AB. This system acquires its own value β AB which depends on the temperature T and on the energies a system AB can take. Since A and B are physically independent, AB can take energies in the set {ε l + ε m : l, m N}. Let a l,m be the number of AB-components whose A-part is at energy level ε l and whose B-part is at energy level ε m, when A n and B n are immersed together in the heat bath. Solving the Lagrange multipliers problem for the AB-system gives a l,m = ne β AB(ε l + ε m) i,j e β AB(ε j + ε i ) = n. e βabεl j e β ABε j e β AB ε m i e β AB ε i. To obtain a l, the number of A-components at energy ε l, sum over m: a l = m a l,m = ne β ABε l j e β ABε j. Since A n and B n do not interact, this must agree with the earlier outcome (.7): a l = ne β Aε l = ne βabεl for all l N. j e β Aε j j e β ABε j It is reasonable to assume that system A can take at least two different energies ε l ε l for otherwise the discussion is trivial. Then the above gives e β A(ε l ε l ) = e β AB(ε l ε l ) and so β A = β AB. Switching the roles of A and B leads to β B = β AB = β A. Since system B was arbitrary, we conclude that β is a universal function of T.

23 .2. Thermodynamic entropy We regard β as the more fundamental quantity and view T as a universal function of β. The state of the system is then determined by the energy levels {ε l } and β by equations (.8) and (.9). Next we derive the precise formula for the dependence of T on β. Working with fixed energies ε l and considering β to be the only variable will not help, since we can replace β by any monotone function of it and nothing in the above reasoning changes. We need to make energies ε l vary which leads to the notion of work done by the system. The first law of thermodynamics states that if the parameters of the system (i.e. its energies ε l ) change it will absorb an average amount of heat dq = de + dw, where dw is the work done by the system. If the energies change by dε l, then dw = l a l dε l and dq = de l a l dε l. Let ns be the entropy of the system A n. By the second law of thermodynamics dq = nt ds. Define the free energy F = log e βε j. Divide the two displays above by n to write ds = (du ) p l dε l T = (d(βu) U dβ β ) p l dε l T β (.0) = ( d(βu) + F T β β dβ + F ) dε l ε l = d(βu + F ). T β Abbreviate G = βu + F which, by the display above, has to be a function f(s) such that f (S) = T β. Recall that the three systems A, B, and AB acquire the same β when immersed in the heat bath. Consequently F A + F B = F AB. Since U = F β, the same additivity holds for the function G, and so f(s A ) + f(s B ) = f(s AB ). Then by (.0), since T is a universal function of β, ds AB = ds A + ds B which implies S AB = S A + S B + c. Now we have f(s A ) + f(s B ) = f(s A + S B + c). Differentiate in S A and S B to see that f (S A ) = f (S B ). Since the system B was chosen arbitrarily, entropy S B can be made equal to any number

24 2. Introductory discussion regardless of the value of temperature T. Therefore f (S) must be a universal constant which we call /k. (This constant cannot be zero because T and β vary with each other.) This implies β = kt and G = k S. Constant k is called Boltzmann s constant. If k < 0, (.8) would imply that as T 0 the system chooses the highest energy state which goes against physical sense. Hence k > 0. Let us compute S for a system with two energy levels ε 0 and ε. By symmetry, recentering, and a change of units, we can assume that ε 0 = 0 and ε =. The system takes energy 0 with probability p 0 and energy with probability p. The average energy U = p and from (.8) p = e β /( + e β ). Then S = kg = k(βu + F ) = k ( p (β + F ) + ( p )F ) = k ( p log p + ( p ) log( p ) ) = k log 2 ki /2 (p ). Thus rate function I /2 of Example. is, up to a universal positive multiplicative factor and an additive constant, the negative thermodynamic entropy of a two-energy system. In the previous section we saw that I /2 is a linear function (with positive slope) of information-theoretic entropy. Together these observations imply that the thermodynamic entropy of a physical system represents the amount of information needed to describe the system or, equivalently, the amount of uncertainty remaining in it. The identity (kβ) S = U + β F expresses an energy-entropy balance and reappears several times later in various guises. It can be found in Exercise 5.9, as equation (7.8) for the Curie-Weiss model, and in Section 8.3 as part (c) of the Dobrushin-Lanford-Ruelle variational principle for lattice systems..3. Large deviations as useful estimates The subject of large deviations is about controlling probabilities of atypical events. There are two somewhat different forms of this activity. (i) Proofs of limit theorems in probability require estimates to rule out atypical behavior. Such estimates could be called ad-hoc large deviations. (ii) Precise limits of vanishing probabilities on an exponential scale are stated as large deviation principles.

25 .3. Large deviations as useful estimates 3 The subject of our book is the second kind of large deviations. The next chapter begins a systematic development of large deviation principles. Before that, let us look at two textbook examples to illustrate the use of independence in the derivation of estimates to prove limit theorems. Example.4. Let {X n } be an i.i.d. sequence with E[X] = 0. (Common device: X is a random variable that has the same distribution as all the X n s.) We wish to show that, under a suitable hypothesis, (.) S n /n p 0 P -almost surely, for any p > /2. In order to illustrate a method, we make a strong assumption. Assume the existence of δ > 0 such that E[e θx ] < for θ δ. When p limit (.) follows from the strong law of large numbers. So let us assume p (/2, ). For t 0 Chebyshev s inequality gives P {S n εn p } E[e tsn εtnp ] = exp{ εtn p + n log E[e tx ]}. The exponential moment assumption implies that E[ X k ]t k /k! is summable for t [0, δ]. Recalling that E[X] = 0, E[e tx ] = E[e tx t k tx] + k! E[ X k ] Then, taking t = εnp 2nc (.2) + t 2 δ 2 k=2 k=2 and n large enough, δ k k! E[ X k ] + ct 2 for t [0, δ]. P {S n εn p } exp{ εtn p + n log( + ct 2 )} { exp{ εtn p + nct 2 } = exp ε2 4c n2p }. Applying this to the sequence { X n } gives the matching bound on the left: { (.3) P {S n εn p } exp ε2 4c n2p }. Inequalities (.2) (.3) can be regarded as large deviation estimates. (Although later we see that since the scale is n p for /2 < p <, technically these are called moderate deviations. But that distinction is not relevant here.) These estimates imply the summability P { S n εn p } <. The Borel-Cantelli lemma implies that for any ε > 0 n P { n 0 : n n 0 S n /n p ε} =.

26 4. Introductory discussion A countable intersection over ε = /k for k N gives P { k n 0 : n n 0 S n /n p /k} =, which says that S n /n p 0, P -a.s. We used an unnecessarily strong assumption to illustrate the exponential Chebyshev method. We can achieve the same result with martingales under the assumption E[ X 2 ] <. Since S n is a martingale (relative to the filtration σ(x,..., X n )), Doob s inequality (Theorem of [27] or (8.26) of [54]) gives { P max S k εn p} k n ε 2 n 2p E[ S n 2 ] = ne[ X 2 ] ε 2 n 2p = c ε 2 n (2p ). Pick r > 0 such that r(2p ) >. Then, { P max S k εm pr} c k m r m r(2p ). Hence, P {max k m r S k εm pr } is summable over m and the Borel-Cantelli lemma implies that m rp max k m r S k 0 P -a.s. as m. To get the result for the full sequence pick m n such that (m n ) r n < m r n. Then, ( m n p r ) max S k n pm rp n S k 0 as n k n n because m r n/n. max k m r n Example.5 (Longest run of heads). Let {X n } be an i.i.d. sequence of Bernoulli random variables with success probability p. For each n let R n be the length of the longest success run among (X,..., X n ). We derive estimates to prove a result of Rényi [66] that { (.4) P lim n R n log n = log p } =. Fix b > and r such that r(b ) >. Let l m = br log m/ log p. ( x is the smallest integer larger than or equal to x.) If R m r l m, then there is an i m r such that X i = X i+ = = X i+lm =. Therefore (.5) P {R m r l m } m r p lm /m r(b ). By the Borel-Cantelli lemma, with probability one, R m r l m for large enough m. (Though how large m needs to be is random.) Consequently, with probability one, R m r lim m log m r b log p.

27 .3. Large deviations as useful estimates 5 Given n, let m n be such that m r n n < (m n +) r. Then R n R (mn+) r and R n lim n log n lim log(m n + ) r R (mn+) n log m r r n log(m n + ) r b log p. Taking b along a sequence shows that { R n P lim n log n } =. log p We have the upper bound for the goal (.4). Fix a (0, ) and let l n = a log n/ log p. Let A i be the event that X iln+ = = X (i+)ln =. Then {R n < l n } By the independence of the A i s n/l n (.6) P {R n < l n } ( p ln ) n ln e pln ( n ln ) e n a /l n. Once again, by the Borel-Cantelli lemma, R n < l n happens only finitely often, with probability one, and thus lim n R n / log n a/ log p. Taking a proves that { R n P lim n log n } =. log p Looking back, the proof relied again on a right-tail estimate (.5) and a left-tail estimate (.6). It might be a stretch to call (.5) a large deviation bound since it is not exponential, but (.6) can be viewed as a large deviation bound. Remark.6. Combining the limit theorem above with the fact that the variance of R n remains bounded as n (see [0, 42]) provides a very accurate test of the hypothesis that the sequence {X n } is i.i.d. Bernoulli with probability of success p. i=0 A c i.

29 Chapter 2 The large deviation principle 2.. Precise asymptotics on an exponential scale Since the 960 s a standard formalism has been employed to express limits of probabilities of rare events on an exponential scale. The term for these statements is large deviation principle (LDP). We introduce this in a fairly abstract setting and then return to the Bernoulli example. There is a sequence {µ n } of probability measures whose asymptotics we are interested in. These measures exist on some measurable space (X, B). Throughout our general discussion we take X to be a Hausdorff topological space, unless further assumptions are placed on it. B = B X is the Borel σ-algebra of X, and M (X ) is the space of probability measures on the measurable space (X, B X ). Thus {µ n } is a sequence in the space M (X ). In Example. X = R and µ n is the probability distribution of S n /n: µ n (A) = P {S n /n A} for Borel subsets A R. Remark on mathematical generality. A reader not familiar with pointset topology can assume that X is a metric space without any harm. Even taking X = R or R d will do for a while. However, later we will study large deviations on spaces of probability measures, and the more abstract point of view becomes a necessity. If the notion of a Borel set is not familiar, it is safe to think of Borel sets as all the reasonable sets for which a probability can be defined. To formulate a general large deviation statement, let us look at result (.2) of Example. for guidance. The first ingredient of interest in (.2) is the normalization n in front of the logarithm. Obviously this can change 7

30 8 2. The large deviation principle in a different example. Thus we should consider probabilities µ n (A) that decay roughly like e rnc(a) for some normalization r n and a constant C(A) [0, ] that depends on the event A. In (.2) we identified a rate function. How should the constant C(A) relate to a rate function? Consider a finite set A = {x,..., x n }. Then asymptotically r n log µ n (A) = rn log i µ n {x i } max rn log µ n {x i } i so that C(A) = min i C(x i ). This suggests that in general C(A) should be the infimum of a rate function I over A. The final technical point is that it is in general unrealistic to expect rn log µ n (A) to actually converge on account of boundary effects, even if A is a nice set. A reasonable goal is to expect statements in terms of limsup and liminf. From these considerations we arrive at the following tentative formulation of a large deviation principle: for Borel subsets A of the space X, (2.) inf I(x) lim x A n r n log µ n (A) lim n log µ n (A) inf I(x), r n x A where A and A are, respectively, the topological interior and closure of A. This statement is basically what we want, except that we need to address the uniqueness of the rate function. Example 2.. Let us return to the i.i.d. Bernoulli sequence {X n } of Example.. We claim that probability measures µ n (A) = P {S n /n A} satisfy (2.) with normalization r n = n and rate I p of (.2). This follows from (.2) with a small argument. For an open set G and s G [0, ], ns /n G for large enough n. So lim n n log P {S n/n G} lim n n log P {S n = ns } = I p (s). This holds also for s G [0, ] because I p (s) =. Taking supremum over s G on the right gives the inequality lim n n log P {S ( n/n G} sup Ip (s) ) = inf I p(s). s G s G With G = A this gives the lower bound in (2.). Split a closed set F into F = F (, p] and F 2 = F [p, ). First prove the upper bound in (2.) for F and F 2 separately. Let a = sup F p and b = inf F 2 p. (If F is empty then a = and if F 2 is empty then

31 2.. Precise asymptotics on an exponential scale 9 b =.) Assume first that a 0. Then n log P {S n/n F } n log P {S n/n [0, a]} = na n log P {S n = k}. k=0 Exercise 2.2. Prove that P {S n = k} increases with k na. By the exercise above, lim n n log P {S n/n F } lim n n log( na + )P {S n = na } = I p (a). This formula is still valid even when a < 0 because the probability vanishes. A similar upper bound works for F 2. Next write n log P {S n/n F } ( ) n log P {S n /n F } + P {S n /n F 2 } ( n log 2 + max n log P {S n/n F }, ) n log P {S n/n F 2 }. I p is decreasing on [0, p] and increasing on [p, ]. Hence, inf F I p = I p (a), inf F2 I p = I p (b), and inf F I p = min(i p (a), I p (b)). Finally, lim n n log P {S n/n F } min(i p (a), I p (b)) = inf I p. F If we now take F = A, the upper bound in (2.) follows. We have shown that (2.) holds with I p defined in (.2). This is our first example of a full-fledged large deviation principle. Remark 2.3. The limsup for closed sets and liminf for open sets in (2.) remind us of weak convergence of probability measures where the same boundary issue arises. Section B.4 gives the definition of weak convergence. These exercises contain other instances where the rate function can be derived by hand. Exercise 2.4. Prove (2.) for the distribution of the sample mean of an i.i.d. sequence of real-valued normal random variables. Identifying I is part of the task. Hint: The density of S n /n can be written down explicitly. This suggests I(x) = (x µ) 2 /(2σ 2 ), where µ is the mean and σ 2 is the variance of X. Exercise 2.5. Prove (2.) for the distribution of the sample mean of an i.i.d. sequence of exponential random variables and compute the rate function explicitly. Hint: Use Stirling s formula.

32 20 2. The large deviation principle 2.2. Lower semicontinuous and tight rate functions We continue with some general facts and then in Definition 2.2 state precisely what is meant by a large deviation principle. We recall the definition of a lower semicontinuous function. Definition 2.6. A function f : X [, ] is lower semicontinuous if {f c} = {x X : f(x) c} is a closed subset of X for all c R. Exercise 2.7. Prove that if X is a metric space then f is lower semicontinuous if and only if lim y x f(y) f(x) for all x. An important transformation produces a lower semicontinuous function f lsc from an arbitrary function f : X [, ]. This lower semicontinuous regularization of f is defined by { } (2.2) f lsc (x) = sup inf f(y) : G x and G is open. y G This turns out to be the maximal lower semicontinuous minorant of f. Lemma 2.8. f lsc is lower semicontinuous and f lsc (x) f(x) for all x. If g is lower semicontinuous and satisfies g(x) f(x) for all x, then g(x) f lsc (x) for all x. In particular, if f is lower semicontinuous, then f = f lsc. Proof. f lsc f is clear. To show f lsc is lower semicontinuous, let x {f lsc > c}. Then there is an open set G containing x such that inf G f > c. Hence by the supremum in the definition of f lsc, f lsc (y) inf G f > c for all y G. Thus G is an open neighborhood of x contained in {f lsc > c}. So {f lsc > c} is open. To show g f lsc one just needs to show that g lsc = g. For then { } g(x) = sup inf g : x G and G is open G { } sup inf f : x G and G is open = f lsc (x). G We already know that g lsc g. To show the other direction let c be such that g(x) > c. Then, G = {g > c} is an open set containing x and inf G g c. Thus g lsc (x) c. Now increase c to g(x). The above can be reinterpreted in terms of epigraphs. The epigraph of a function f is the set epi f = {(x, t) X R : f(x) t}. For the next lemma we endow X R with its product topology. Lemma 2.9. The epigraph of f lsc is the closure of epi f. Proof. Note that the epigraph of f lsc is closed. That it contains the epigraph of f (and thus also the closure of the epigraph of f) is immediate because

33 2.2. Lower semicontinuous and tight rate functions 2 f lsc f. For the other inclusion we need to show that any open set outside the epigraph of f is also outside the epigraph of f lsc. Let A be such a set and let (x, t) A. By the definition of the product topology, there is an open neighborhood G of x and an ε > 0 such that G (t ε, t + ε) A. So for any y G and any s (t ε, t + ε), s < f(y). In particular, t + ε/2 inf G f f lsc (x). So (x, t) is outside the epigraph of f lsc. Lower semicontinuous regularization can also be expressed in terms of pointwise alterations of the values of f. Exercise 2.0. Assume X is a metric space. Show that if x n x, then f lsc (x) lim f(x n ). Prove that for each x X there is a sequence x n x such that f(x n ) f lsc (x). (The constant sequence x n = x is allowed here.) This gives the alternate definition f lsc (x) = min(f(x), lim y x f(y)). Now we apply this to large deviation rate functions. The next lemma shows that rate functions can be assumed to be lower semicontinuous. Lemma 2.. Suppose I is a function such that (2.) holds for all measurable sets A. Then (2.) continues to hold if I is replaced by I lsc. Proof. I lsc I and the upper bound is immediate. For the lower bound observe that inf G I lsc = inf G I when G is open. Due to Lemma 2. we will call a [0, ]-valued function I a rate function only when it is lower semicontinuous. Here is the precise definition of a large deviation principle (LDP) for the remainder of the text. Definition 2.2. Let I : X [0, ] be a lower semicontinuous function and r n a sequence of positive real constants. A sequence of probability measures {µ n } M (X ) is said to satisfy a large deviation principle with rate function I and normalization r n if the following inequalities hold: (2.3) (2.4) lim n lim n log µ n (F ) inf r n x F I(x) closed F X, log µ n (G) inf I(x) open G X. r n x G We will abbreviate LDP(µ n, r n, I) if all of the above holds. When the sets {I c} are compact for all c R, we say I is a tight rate function. Lower semicontinuity makes a rate function unique. For this we assume of X a little bit more than Hausdorff. A topological space is regular if points and closed sets can be separated by disjoint open neighborhoods. In particular, metric spaces are regular topological spaces.

34 22 2. The large deviation principle Theorem 2.3. If X is a regular topological space, then there is at most one (lower semicontinuous) rate function satisfying the large deviation bounds (2.3) and (2.4). Proof. We show that I satisfies { I(x) = sup lim } log µ n (B) : B x and B is open. r n One direction is easy: for all open B x lim r n log µ n (B) inf B I I(x). For the other direction, fix x and let c < I(x). One can separate x from {I c} by disjoint neighborhoods. Thus, there exists an open set G containing x and such that G {I > c}. (Note that this is true also for c < 0, which is relevant in case I(x) = 0.) Then { sup lim } log µ n (B) : B x and B is open r n lim log µ n (G) lim log µ n ( G ) inf I c. r n r n G Increasing c to I(x) concludes the proof. Remark 2.4. Tightness of a rate function is a very useful property, as illustrated by the two exercises below. In a large part of the large deviation literature a rate function I is called good when the sets {I c} are compact for c R. We prefer the term tight as more descriptive and because of the connection with exponential tightness: see Theorem 2.9 below. Exercise 2.5. Suppose X is a Hausdorff topological space and let E X be a closed set. Assume that the relative topology on E is metrized by the metric d. Let I : E [0, ] be a tight rate function and fix an arbitrary closed set F E. Prove that lim inf I = inf I, ε 0 F ε F where F ε = {x E : y F such that d(x, y) < ε}. Exercise 2.6. X and E as in the exercise above. Suppose ξ n and η n are E-valued random variables defined on (Ω, F, P ), and for any δ > 0 there exists an n 0 < such that d(ξ n (ω), η n (ω)) < δ for all n n 0 and ω Ω. (a) Show that if the distributions of ξ n satisfy the lower large deviation bound (2.4) with some rate function I : E [0, ], then so do the distributions of η n.

35 2.3. Weak large deviation principle 23 (b) Show that if the distributions of ξ n satisfy the upper large deviation bound (2.3) with some tight rate function I : E [0, ], then so do the distributions of η n Weak large deviation principle It turns out that it is sometimes difficult to satisfy the upper bound (2.3) for all closed sets. A useful weakening of the LDP requires the upper bound only for compact sets. Definition 2.7. A sequence of probability measures {µ n } M (X ) satisfies a weak large deviation principle with lower semicontinuous rate function I : X [0, ] and normalization {r n } if the lower large deviation bound (2.4) holds for all open sets G X and the upper large deviation bound (2.3) holds for all compact sets F X. With enough control on the tails of the measures µ n, a weak LDP is sufficient for the full LDP. Definition 2.8. We say {µ n } M (X ) is exponentially tight with normalization r n if for each 0 < b < there exists a compact set K b such that (2.5) lim log µ n (K n b c r ) b. n Theorem 2.9. Assume the upper bound (2.3) holds for compact sets and {µ n } is exponentially tight with normalization r n. Then the upper bound (2.3) holds for all closed sets with the same rate function I. If the weak LDP (µ n, r n, I) holds and {µ n } is exponentially tight with normalization r n, then the full LDP (µ n, r n, I) holds and I is a tight rate function. Proof. Let F be a closed set. lim n log µ n (F ) lim log ( µ n (F K b ) + µ n (K r n n b c r )) n ( ) max b, lim log µ n (F K b ) n r n max ( b, inf I ) F K b max ( b, inf F I ). Letting b proves the upper large deviation bound (2.3).

36 24 2. The large deviation principle The weak LDP already contains the lower large deviation bound (2.4) and so we have both bounds. From the lower bound and exponential tightness follows inf I lim Kb+ c n r n log µ n (K c b+ ) b +. This implies that {I b} K b+. As a closed subset of a compact set {I b} is compact. The connection between a tight rate function and exponential tightness is an equivalence if we assume a little more of the space. To prove the other implication in Theorem 2.2 below we give an equivalent reformulation of exponential tightness in terms of open balls. In a metric space (X, d), B(x, r) = {y X : d(x, y) < r} is the open r-ball centered at x. Lemma Let {µ n } be a sequence of probability measures on a Polish space X. (A Polish space is a complete and separable metric space.) Then {µ n } is exponentially tight if and only if for every b < and δ > 0 there exist finitely many δ-balls B,..., B m such that ([ m ] c ) µ n B i e rnb n N. i= Proof. Ulam s theorem (page 280) says that on a Polish space an individual probability measure ν is tight, which means that ε > 0 there exists a compact set A such that ν(a c ) < ε. Consequently on such a space exponential tightness is equivalent to the stronger statement that for all b < there exists a compact set K b such that µ n (K c b ) e rnb for all n N. Since a compact set can be covered by finitely many δ-balls, the ball condition is a consequence of this stronger form of exponential tightness. Conversely, assume the ball condition and let b <. We need to produce the compact set K b. For each k N, find m k balls B k,,..., B k,mk of radius k such that ([ m k ] c ) µ n B k,i e 2krnb n N. i= Let K = mk k= i= B k,i. As a closed subset of X, K is complete. By its construction K is totally bounded. This means that for any ε > 0 it can be covered by finitely many ε-balls. Completeness and total boundedness are equivalent to compactness in a metric space [26, Theorem 2.3.]. By explicitly evaluating the geometric series and some elementary estimation, µ n (K c ) e 2krnb e rnb k=

Introduction and Preliminaries

Chapter 1 Introduction and Preliminaries This chapter serves two purposes. The first purpose is to prepare the readers for the more systematic development in later chapters of methods of real analysis