Large deviations, weak convergence, and relative entropy

Size: px

Start display at page:

Download "Large deviations, weak convergence, and relative entropy"

Suzan Stewart
5 years ago
Views:

1 Large deviations, weak convergence, and relative entropy Markus Fischer, University of Padua Revised June 20, 203 Introduction Rare event probabilities and large deviations: basic example and definition in ection 2. Essential tools for large deviations analysis: weak convergence of probability measures (ection 3) and relative entropy (ection 4). Weak convergence especially useful in the Dupuis and Ellis [997] approach see lectures! Table : Notation X B(X ) P(X ) M b (X ) C(X ) C b (X ) C c (X ) C k (R d ) a topological space σ-algebra of Borel sets over X, i.e., smallest σ-algebra containing all open (closed) subsets of X space of probability measures on B(X ), endowed with topology of weak convergence of probability measures space of all bounded Borel measurable functions X R space of all continuous functions X R space of all bounded continuous functions X R space of all continuous functions X R with compact support space of all continuous functions R d R with continuous partial derivatives up to order k Preparatory lecture notes for a course on Representations and weak convergence methods for the analysis and approximation of rare events given by Prof. Paul Dupuis, Brown University, at the Doctoral chool in Mathematics, University of Padua, May 203.

2 C k c (R d ) space of all continuous functions R d R with compact support and continuous partial derivatives up to order k T a subset of R, usually [0, T ] or [0, ) C(T : X ) space of all continuous functions T X D(T : X ) space of all càdlàg functions T X (i.e., functions continuous from the right with limits from the left) minimum (as binary operator) maximum (as binary operator) 2 Large deviations A standard textbook on the theory of large deviations is Dembo and Zeitouni [998]; also see Ellis [985], Deuschel and troock [989], Dupuis and Ellis [997], den Hollander [2000], and the references therein. A foundational work in the theory is Varadhan [966]. 2. Coin tossing Consider the following random experiments: Given a number n N, toss n coins of the same type and count the number of coins that land heads up. Denote that (random) number by n. Then n /n is the empirical mean, here equal to the empirical probability, of getting heads. What can be said about n /n for n large? To construct a mathematical model for the coin tossing experiments, let X, X 2,... be {0, }-valued independent and identically distributed (i.i.d.) random variables defined on some probability space (Ω, F, P). Thus each X i has Bernoulli distribution with parameter p. = P(X = ). Interpret X i (ω) = as saying that coin i at realization ω Ω lands head up. Then n = n i= X i. By the strong / weak law of large numbers (LLN), n n p with probability one / in probability. In particular, by the weak law of large numbers, for all ε > 0, P { n /n p ε} 0. More can be said about the asymptotic behavior of those deviation probabilities. Observe that n has binomial distribution with parameters (n, p), 2

3 that is, P { n = k} = n! k!(n k)! pk ( p) n k, k {0,..., n}. By tirling s formula, asymptotically for large n, 2πn n n e n P { n = k} 2πk k k e k 2π(n k)(n k) n k e (n k) pk ( p) n k. Therefore, if k n x for some x (0, ), then hence log P { n = k} (log(2π) + log(x) + log( x) + log(n)) 2 ( ) ( ) x x n x log n( x) log, p p ( n log P { n = k} x log ( ) x + ( x) log p ( )) x. p The expression x log( x x p ) + ( x) log( p ) gives the relative entropy of the Bernoulli distribution with parameter x w.r.t. the Bernoulli distribution with parameter p, which is minimal and zero if and only if x = p. The asymptotic equivalence n log P { n = k} ( x log ( ) x + ( x) log p ( )) x, k n x, p shows that the probabilities of the events { n /n p ε} converge to zero exponentially fast with rate (up to arbitrary small corrections) ( ( ) ( )) p + ε p ε (p+ε) log + ( p ε) log, p p which corresponds to x = p + ε, the rate of slowest convergence. Events like { n /n p ε} describe large deviations from the law of large numbers limit, in contrast to the fluctuations ( normal deviations ) captured by the central limit theorem, which says here that the distribution of n ( n /n p) is asymptotically Gaussian with mean zero and variance p( p). 3

4 2.2 The large deviation principle The theory of large deviations will be developed in this course for random variables taking values in a Polish space. A Polish space is a separable topological space that is compatible with a complete metric. Examples of Polish spaces are R d with the standard topology, any closed subset of R d (or another Polish space) equipped with the induced topology, the space C(T : X ) of continuous functions, T (, ) an interval, X a complete and separable metric space, equipped with the topology of uniform convergence on compact subsets of T, the space D(T : X ) of càdlàg functions, T (, ) an interval, a X a complete and separable metric space, equipped with the korohod topology [e.g. Billingsley, 999, Chapter 3], the space P(X ) of probability measures on B(X ), X a Polish space, equipped with the weak convergence topology (cf. ection 3). Let (ξ n ) n N be a family of random variables with values in a Polish space. Let I : [0, ] be a function with compact sublevel sets, i.e., {x : I(x) c} is compact for every c [0, ). uch a function is lower semicontinuous and is called a (good) rate function. Definition 2.. The sequence (ξ n ) n N satisfies the large deviation principle with rate function I iff for all G B(), inf I(x) lim inf x G n log P {ξn G} lim sup n log P {ξn G} inf I(x). x cl(g) The large deviation principle is a distributional property: Writing P n for Law(ξ n ), (ξ n ) satisfies the large deviation principle with rate function I if and only if for all G B(), inf I(x) lim inf x G n log P n(g) lim sup n log P n(g) inf I(x). x cl(g) 4

5 The large deviation principle gives a rough description of the asymptotic behavior of the probabilities of rare events. For simplicity, consider only I- continuity sets; a set G B() is called an I-continuity set if inf x cl(g) I(x) = inf x G I(x). For such G, hence lim n log P {ξn G} = inf I(x) =. I(G), x G P {ξ n G} = P n (G) = e n(i(g)+o()). The probability of G w.r.t. the law of ξ n therefore tends to zero exponentially fast as n whenever I(G) > 0. Thus, if I(G) > 0, then G is a rare event (w.r.t. P n for large n). Example (Coin tossing). The sequence ξ n. = n /n, n N, satisfies the large deviation principle in =. R (or =. [0, ]) with rate function x log( x x p ) + ( x) log( p ) if x [0, ], I(x) = otherwise. If p (0, ), then I is finite and continuous on [0, ], and convex on R. A concept closely related to the large deviation principle is what is called Laplace principle. Let (ξ n ) n N be a family of -valued random variables, P n = Law(ξ n ). Definition 2.2. The sequence (ξ n ) satisfies the Laplace principle with rate function I iff for all F C b (), lim n log E [exp ( n F (ξn ))] = inf {I(x) + F (x)}. x In Definition 2.2 it is clearly equivalent to require that for all F C b (), lim n log exp (n F (x)) P n (dx) = sup {F (x) I(x)}. x If = R d and if we took F (x) = θ x (such an F is clearly unbounded), then on the right-hand side above we would have the Legendre transform of I at θ R d. Also notice the analogy with Laplace s method for asymptotically evaluating exponential integrals: lim n log 0 enf(x) dx = max x [0,] f(x) for all f C([0, ]). Basic results [for instance Dembo and Zeitouni, 998, Chapter 4]: 5

6 . The (good) rate function of a large deviation principle is uniquely determined. 2. If I is the rate function of a large deviation principle, then inf x I(x) = 0 and I(x ) = 0 for some x. If I has a unique minimizer, then the large deviation principle implies a corresponding law of large numbers. 3. The large deviation principle holds if and only if the Laplace principle holds, and the (good) rate function is the same. 4. Contraction principle: Let Y be a Polish space and ψ : Y be a measurable function. If (ξ n ) satisfies the large deviation principle with (good) rate function I and if ψ is continuous on {x : I(x) < }, then (ψ(ξ n )) satisfies the large deviation principle with (good) rate function J(y). = inf x ψ (y) I(x). 2.3 Large deviations for empirical means Let X, X 2,... be R-valued i.i.d. random variables with common distribution µ such that m =. xµ(dx) = E[X ] is finite. As in the case of coin flipping,. set n = n i= X i and consider the asymptotic behavior of n /n, n N. By the law of large numbers, n /n m with probability one. Let φ µ be the moment generating function of µ (or X, X 2,...), that is, φ µ (t) =. e t x µ(dx) = E [ e t X ], t R. R Theorem 2. (Cramér). uppose that µ is such that φ µ (t) is finite for all t R. Then ( n /n) n N satisfies the large deviation principle with rate function I given by I(x) =. sup {t x log (φ µ (t))}. t R The rate function I in Theorem 2. is the Legendre transform of log φ µ, the logarithmic moment generating function or cumulant generating function of the common distribution µ. Two particular cases:. Bernoulli distribution: µ the Bernoulli distribution on {0, } with parameter p. Then φ µ (t) = p + p e t and ( ) x I(x) = x log + ( x) log p ( x p I(x) = for x R \ [0, ], as in Example. 6 ), x [0, ],

7 2. Normal distribution: µ the normal distribution with mean 0 and variance σ 2. Then φ µ (t) = e σ2 t 2 /2 and I(x) = x2 2σ 2, x R. ome properties of log φ µ and the associated rate function I from Theorem 2. (hypothesis φ µ (t) < for all t R): φ µ is in C (R : (0, )), log φ µ is strictly convex, I(x) = for x / [essinf X, esssup X ] = Conv(supp(µ)), I is non-negative and convex, I is strictly convex and infinitly differentiable on (essinf X, esssup X ), I(m) = 0, I (m) = 0, and I (m) = / var(µ). Here we give a sketch of the proof of Theorem 2.; for a complete proof of Cramér s theorem under weaker assumptions on φ µ see ection 2.2 in Dembo and Zeitouni [998, pp ]. Proof of Theorem 2. (sketch). May assume m = E[X ] = 0. Given x > 0, consider the probabilities of the events { n /n x}. By Markov s inequality and since n is the sum of i.i.d. random variables with common distribution µ, we have for t > 0, P { n n x} = P { e t n e t n x} e t n x E [ e t n] = e t n x (φ µ (t)) n. ince this inequality holds for any t > 0 and log is non-decreasing, we obtain lim sup n log P { n/n x} sup {t x log(φ µ (t))}. t>0 By Jensen s inequality, log φ µ (t) t E[X ] = t m = 0. ince log φ µ (0) = 0, we have I(m) = I(0) = 0 and t x log(φ µ (t)) t m log(φ µ (t)) I(m) = 0 for every t 0. It follows that sup t>0 {t x log(φ µ (t))} = sup {t x log(φ µ (t))} = I(x), t R 7

8 which yields the large deviation upper bound for closed sets of the form [x, ), x > 0. A completely analogous argument gives the upper bound for closed sets of the form (, x], x < 0. Let G R be a closed set not containing zero. Then, thanks to the strict convexity of I and the fact that I 0 and I(0) = 0, there are x + > 0, x < 0 such that G (, x ] [x +, ) and inf x G I(x) = inf x (,x ] [x +, ) I(x). This establishes the large deviation upper bound. To obtain the large deviation lower bound, we first consider open sets of the form (x δ, x + δ) for x (essinf X, esssup X ), δ > 0. Fix such x, δ. ince φ µ is everywhere finite by hypothesis, thus log φ µ continuously differentiable and strictly convex on R, there exists a unique solution t x R to the equation and x = (log φ µ ) (t x ) = φ (t x ) φ(t x ) = E [ X e txx ] E [e txx, ] I(x) = t x x log φ µ (t x ). Define a probability measure µ P(R) absolutely continuous with respect to µ according to d µ dµ (y) =. etx y exp (t x y log φ µ (t x )) = φ µ (t x ). Let Y, Y 2,... be i.i.d. random variables with common distribution µ, and set n. = n i= Y i, n N. Then E[Y i ] = x and, for ε > 0, P { n /n (x ε, x + ε)} = {z R n : n i= z i (n(x ε),n(x+ε))} n µ(dy) e n(tx(x ε) tx(x+ε)) {z R n : n i= z i (n(x ε),n(x+ε))} etx n i= y i n µ(dy) { } = e n log φµ(tx) e n(tx(x ε) tx(x+ε)) P n /n (x ε, x + ε). ince Y, Y 2,... are i.i.d. with E[Y i ] = x, we have n /n x in probability as n by the weak law of large numbers. It follows that lim inf n log P { n/n (x ε, x + ε)} log φ µ (t x ) t x (x ε) t x (x + ε). ince ε > 0 was arbitrary and log φ µ (t x ) t x x = I(x), we obtain lim inf n log P { n/n (x δ, x + δ)} I(x). 8

9 If G R is open such that G (essinf X, esssup X ), then the preceding inequality implies that lim inf n log P { n/n G} sup I(x) = inf I(x). x G x G The parameter t x in the proof of Theorem 2. corresponds to an absolutely continuous change of measure with exponential density. Under the new measures, the random variable X has expected value x instead of zero the rare event { n /n x} becomes typical! 3 Weak convergence of probability measures Here we collect some facts about weak convergence of probability measures. A standard reference on the topic is Billingsley [968, 999]; also see Chapter 3 in Ethier and Kurtz [986] or Chapter in Dudley [2002]. Let X be a Polish space. Let B(X ) be the Borel σ-algebra over X. Denote by P(X ) the space of probability measures on B(X ). Definition 3.. A sequence (θ n ) n N P(X ) is said to converge weakly to X w θ P(X ), in symbols θ n θ, if f(x)θ n (dx) f(x)θ n (dx) for all f C b (X ). X Remark 3.. In the terminology of functional analysis, the weak convergence of Definition 3. would be called weak- convergence. Denote by M sgn (X ) the space of finite signed measures on B(X ). Then M sgn (X ) is a Banach space under the total variation norm ( strong convergence ), and P(X ) is a closed convex subset of M sgn (X ) with respect to both the strong and the weak convergence topology. Moreover, M sgn (X ) can be identified with a subspace of the topological dual C b (X ) of the Banach space C b (X ) under the sup norm (if X is compact, then M sgn (X ) C b (X ) ; in general, C b (X ) is not reflexive, not even for X = [0, ]). Thus P(X ) C b (X ), and the convergence of Definition 3. coincides with the weak- convergence on the dual space induced by the original space C b (X ). Example 2 (Dirac measures). Let (x n ) n N X be a convergent sequence with limit x X. Then the sequence of Dirac measures (δ xn ) n N P(X ) converges weakly to the Dirac measure δ x. 9

10 Example 3 (Normal distributions). Let (m n ) n N R, (σ 2 n) n N [0, ) be convergent sequences with limits m and σ 2, respectively. Then the sequence of normal distributions (N(m n, σ 2 n)) n N P(R) converges weakly to N(m, σ 2 ). Example 4 (Product measures). uppose (θ n ) n N P(X ), (µ n ) n N P(Y) are weakly convergent sequences with limits θ and µ, respectively, where Y is Polish, too. Then the sequence of product measures (θ n µ n ) n N P(X Y) converges weakly to θ µ. Example 5 (Marginals vs. joint distribution). Let X, Y, Z be independent real-valued random variables defined on some probability space (Ω, F, P) such that Law(X) = N(0, ) = Law(Y ) (i.e., X, Y have standard normal distribution), and Z has Rademacher distribution, that is, P(Z = ) = /2 = P(Z = ). Define a sequence (θ n ) n N P(R 2. ) by θ 2n = Law(X, Y ),. = Law(X, Z X), n N. Observe that the random variable Z X has θ 2n standard normal distribution N(0, ). The marginal distributions of (θ n ) therefore converge weakly (they are constant and equal to N(0, )), while the sequence (θ n ) itself does not converge (indeed, the joint distribution of X and Z X is not even Gaussian). To prove weak convergence of probability measures on a product space it is therefore not enough to check convergence of the marginal distributions. This suffices only in the case of product measures; cf. Example 4. The limit of a weakly convergent sequence in P(X ) is unique. Weak convergence induces a topology on P(X ); under this topology, X being a Polish space, P(X ) is a Polish space, too. Let d be a complete metric compatible with the topology of X ; thus (X, d) is a complete and separable metric space. There are different choices for a complete metric on P(X ) that is compatible with the topology of weak convergence. Two common choices are the Prohorov metric and the bounded Lipschitz metric, respectively. The Prohorov metric on P(X ) is defined by ρ(θ, ν). = inf {ε > 0 : θ(g) ν(g ε ) + ε for all closed G X }, (3.) where G ε. = {x X : d(x, G) < ε}. Notice that ρ is indeed a metric. The bounded Lipschitz metric on P(X ) is defined by ρ(θ, ν) =. { sup fdθ fdν } : f Cb (X ) such that f bl, (3.2) 0

11 where f bl. = supx X f(x) + sup x,y X :x y f(x) f(y) d(x,y). The following theorem gives a number of equivalent characterizations of weak convergence. Theorem 3. ( Portemanteau theorem ). Let (θ n ) n N P(X ) be a sequence of probability measures, and let θ P(X ). Then the following are equivalent: (i) θ n w θ as n ; (ii) ρ(θ n, θ) 0 (Prohorov metric); (iii) ρ(θ n, θ) 0 (bounded Lipschitz metric); (iv) fdθ n fdθ for all bounded Lipschitz continuous f : X R; (v) lim inf θ n (O) θ(o) for all open O X ; (vi) lim sup θ n (G) θ(g) for all closed G X ; (vii) lim θ n (B) = θ(b) for all B B(X ) such that θ( B) = 0, where B. = cl(b) cl(b c ) denotes the boundary of the Borel set B; (viii) fdθ n fdθ for all f M b (X ) such that θ(u f ) = 0 where. = {x X : f discontinuous at x}. U f Proof. For the equivalence of conditions (i), (ii), (iii), and (iv) see Theorem.3.3 in Dudley [2002, pp ]. (i) (v). Let O X be open. If f C b (X ) is such that 0 f O, then lim inf θ n(o) f dθ X since θ n (O) f for every n N and f dθ n f dθ as n by (i). ince O is open, we can find (f M ) M N C b (X ) such that 0 f M O and f M O pointwise as M. Then f M dθ θ(o) as M by monotone convergence. It follows that lim inf θ n (O) θ(o). (v) (vi). This follows by taking complements (open/closed sets). (v), (vi) (vii). Let B B(X ). Clearly, B B cl(b) and B open, cl(b) closed. Therefore by (v), (vi), θ(b ) lim inf θ n(b ) lim inf θ n(b) lim sup θ n (B) lim sup θ n (cl(b)) θ(cl(b)).

12 If θ( B) = 0, then θ(b ) = θ(cl(b)), hence lim θ n (B) = θ(b). (vii) (viii). Let f M b (X ) be such that θ(u f ) = 0, and let A =. {y R : θ(f {y}) > 0} be the set of atoms of θ f. ince θ f is a finite measure, A is at most countable. Let ε > 0. Then there are N N, y 0,..., y N R \ A such that y 0 f < y <... < y N < f y N, y i y i ε. For i {,..., N} set B i. = f {[y i, y i )}. Then θ( B i ) θ(f {y i }) + θ(f {y i }) + θ{u f } = 0. Using (vii) we obtain lim sup f dθ n lim sup N θ n (B i ) y i = i= N i= θ(b i ) y i ε + f dθ. ince ε was arbitrary, it follows that lim sup f dθn f dθ. The same argument applied to f yields the inequality lim inf f dθn f dθ. The implication (viii) (i) is immediate. From Definition 3. it is clear that weak convergence is preserved under continuous mappings. The mapping theorem for weak convergence requires continuity only with probability one with respect to the limit measure; this should be compared to characterization (vii) in Theorem 3.. Theorem 3.2 (Mapping theorem). Let (θ n ) n N P(X ), θ P(X ). Let Y be a second Polish space, and let ψ : X Y be a measurable mapping. If w w θ and θ{x X : ψ discontinuous at x} = 0, then ψ θn ψ θ. θ n Proof. By part (v) of Theorem 3., it is enough to show that for every O Y open, lim inf θ ( n ψ (O) ) θ ( ψ (O) ). Let O Y be open. et C. = {x X : ψ continuous at x}. Then ψ (O) C is contained in (ψ (O)), the interior of ψ (O). ince (ψ (O)) is open, w θ(c) = and θ n θ by hypothesis, it follows from part (v) of Theorem 3. that θ ( ψ (O) ) = θ ( (ψ (O)) ) ( lim inf θ n (ψ (O)) ) lim inf θ ( n ψ (O) ). 2

13 The following generalization of Theorem 3.2 is sometimes useful. Theorem 3.3 (Extended mapping theorem). Let (θ n ) n N P(X ), θ P(X ). Let Y be a second Polish space, and let ψ n, n N, ψ be a measurable w mappings X Y. If θ n θ and θ {x X : (x n ) X such that x n x while ψ n (x n ) ψ(x)} = 0, then ψ n θ n w ψ θ. Proof. ee Theorem 5.5 in Billingsley [968, p. 34] or Theorem 4.27 in Kallenberg [200, p. 76]. In the following version of Fatou s lemma the probability measures converge weakly, while the integrand is fixed (the latter condition can be weakened). Lemma 3. (Fatou). Let (θ n ) n N P(X ), θ P(X ). Let g : X (, ] be a lower semicontinuous function. If θ n θ, then lim inf g dθ n g dθ. Proof. ee Theorem A.3.2 in Dupuis and Ellis [997, p. 307]. X w X A sequence (X n ) n N of X -valued random variables (possibly defined on different probability spaces) is said to converge in distribution to some X - valued random variable X if the respective laws converge weakly. There is a version of the dominated convergence theorem in connection with convergence in distribution for real-valued (or R d -valued) random variables. Theorem 3.4 (Dominated convergence). Let (X n ) n N be a sequence of R- valued random variables. uppose that (X n ) converges in distribution to X for some random variable X. If (X n ) n N is uniformly integrable, then E [ X ] < and E n [X n ] E [X]. Proof. For M N, the functions x M (x M) and x x M are in C b (R). ince (X n ) converges in distribution to X by hypothesis, this implies E n [ X n M] E [ X M], E n [ M (X n M)] E [ M (X M)]. (3.3a) (3.3b) 3

14 By hypothesis, (X n ) n N is uniformly integrable, that is, lim sup [ E n Xn M { Xn M}] = 0. n N This implies, in particular, that sup n N E n [ X n ] <. By (3.3a), for M N we can choose n M N such that E nm [ X nm M] E [ X M]. It follows that sup E [ X M] M N { sup M N E nm [ X nm M] E [ X M] + sup E n [ X n M] n N + sup E n [ X n ] <. n N This shows E[ X ] <, that is, X is integrable, since E [ X M] E[ X ] as M by monotone convergence. Now for any n N, any M N, E [X] E n [X n ] E [ M (X M)] E n [ M (X n M)] + E [ ] [ ] X { X M} + En Xn { Xn M} Let ε > 0. By the integrability of X and the uniform integrability of (X n ) one finds M ε N such that E [ ] [ ] ε X { X M} + sup E k Xk { Xk M} k N 2. By (3.3a), we can choose n ε = n(m ε ) such that E [ M ε (X M ε )] E nε [ M ε (X nε M ε )] ε 2. It follows that E [X] E nε [X nε ] ε, which establishes the desired convergence since ε was arbitrary. uppose we have a sequence (X n ) n N of random variables that converges w in distribution to some random variable X; thus Law(X n ) Law(X). If the relation (in particular, joint distribution) between the X, X 2,... is irrelevant, one may work with random variables that converge almost surely. w Theorem 3.5 (korohod representation). Let (θ n ) n N P(X ). If θ n θ for some θ P(X ), then there exists a probability space (Ω, F, P) carrying X -valued random variables X n, n N, and X such that P Xn = θ n for every n N, P X = θ, and X n X as n P-almost surely. } 4

15 Proof (for X finite). Assume that X {,..., N} for some N N. et Ω =. [0, ], F =. B([0, ]), and let P be Lebesgue measure on B([0, ]). et X(ω) =. X n (ω) =. N k (θ{i<k},θ{i k}] (ω), k= N k (θn{i<k},θn{i k}](ω). k= The random variables thus defined have the desired properties since, for every k {,..., N}, θ n {i < k} θ{i < k}, θ n {i k} θ{i k} as n. For a general Polish space X (actually, completeness not needed) use countable partitions and approximation argument; cf. the proof of Theorem 4.30 in Kallenberg [200, p. 79] or of Theorem.7.2 in Dudley [2002, pp ]. A standard method for proving that a sequence (a n ) n N of elements of a complete metric space converges to a unique limit a is to proceed as follows. First show that (a n ) n N is relatively compact (i.e., cl({a n : n N}) is compact in ). Then, taking any convergent subsequence (a n(j) ) j N with limit ã, show that ã = a. This establishes a n a as n N. Relative compactness in P(X ) with the topology of weak convergence when X is Polish is equivalent to uniform exhaustibility by compact sets or tightness. Definition 3.2. Let I be a non-empty set. A family (θ i ) i I P(X ) is called tight (or uniformly tight) if for any ε > 0 there is a compact set K ε X such that inf θ i(k ε ) ε. i I Theorem 3.6 (Prohorov). Let I be a non-empty set, and let (θ i ) i I P(X ), where X is Polish. Then (θ i ) i I P(X ) is tight if and only if (θ i ) i I is relatively compact in P(X ) with respect to the topology of weak convergence. Proof. ee, for instance, ection.5 in Billingsley [999, pp ]. Depending on the structure of the underlying space X, conditions for tightness or relative compactness can be derived. Let us consider here the case X = C([0, ), R d ) with the topology of uniform convergence on compact time intervals. for (R d -valued) continuous processes. With this choice, X is the canonical path space Let X be the canonical process on C([0, ), R d ), that is, X(t, ω). = ω(t) for t 0, ω C([0, ), R d ). 5

16 Theorem 3.7. Let I be a non-empty set, and let (θ i ) i I P(C([0, ), R d )). Then (θ i ) i I is relatively compact if and only if the following two conditions hold: (i) (θ i (X(0)) is tight in P(R d ), and (ii) for every ε > 0, every T N there is δ > 0 such that }) sup θ i ({ω C([0, ), R d ) : w T (ω, δ) > ε ε, i I where w T (ω, δ). = sup s,t [0,T ]: t s δ ω(t) ω(s) is the modulus of continuity of ω with size δ over the time interval [0, T ]. Proof. ee, for instance, Theorem in Billingsley [999, pp ]; the extension from a compact time interval to [0, ) is straightforward. Theorem 3.7 should be compared to the Arzelà-Ascoli criterion for relative compactness in C([0, ), R d ). The next theorem gives a sufficient condition for relative compactness (or tightness) in P(C([0, ), R d )); the result should be compared to the Kolmogorov-Chentsov continuity theorem. Theorem 3.8 (Kolmogorov s sufficient condition). Let I be a non-empty set, and let (θ i ) i I P(C([0, ), R d )). uppose that (i) (θ i (X(0)) is tight in P(R d ), and (ii) there are strictly positive numbers C, α, β such that for all t, s [0, ), all i I, E θi [ X(s) X(t) α ] C t s +β. Then (θ i ) i I is relatively compact in P(C([0, ), R d )). Proof. ee, for instance, Corollary 6.5 in Kallenberg [200, p. 33]. Tightness of a family (θ i ) i I P() can often be established by showing that sup i I G(θ i ) < for an appropriate function G. This works if G is a tightness function in the sense of the definition below. The issue is more delicate when passing from the korohod space D([0, T ], X ) to the korohod space D([0, ), X ); cf. Chapter 3 in Billingsley [999]. 6

17 Definition 3.3. A measurable function g : [0, ] is called a tightness function on if g has relatively compact sublevel sets, that is, for every M R, cl{x : g(x) M} is compact in. A way of constructing tightness functions on P() is provided by the following result. Theorem 3.9. uppose g is a tightness function on. Define a function G on P() by G(θ) =. g(x)θ(dx). Then G is a tightness function on. Proof. The function G is well-defined with values in [0, ]. Let M [0, ), and set A =. {θ P() : G(θ) M}. By Prohorov s theorem it is enough to. show that A = A(M) is tight. Let ε > 0. et K ε = cl{x : g(x) M/ε}. Then K ε is compact since g is a tightness function and for θ A, M G(θ) = g dθ g dθ M \K ε ε θ( \ K ε). It follows that inf θ A θ( \ K ε ) ε, hence sup θ A θ(k ε ) ε. 4 Relative entropy Here we collect properties of relative entropy for probability measures on Polish spaces. We will mostly refer to Dupuis and Ellis [997]; also see the classical works by Kullback and Leibler [95] and Kullback [959], where relative entropy is introduced as an information measure and called directed divergence. Let, X, Y be Polish spaces. Definition 4.. Let µ, ν P(). The relative entropy of µ with respect to ν is given by ( ) R(µ ν) = log dµ dν (x) µ(dx) if µ ν, else. Relative entropy is well-defined as a function P() P() [0, ]. Indeed, if µ ν, then a density f =. dµ dν exists by the Radon-Nikodym theorem with f uniquely determined ν-almost surely. In this case, R(µ ν) = f(x) log (f(x)) ν(dx). 7

18 Clearly, lim x 0+ x log(x) = 0. ince fdν = and x log(x) x for all x 0 with equality if and only if x =, it follows that R(µ ν) 0 with R(µ ν) = 0 if and only if µ = ν. Relative entropy can actually be defined for σ-finite measures on an arbitrary measurable space. Lemma 4. (Basic properties). Properties of relative entropy R(.. ) for probability measures on a Polish space. (a) Relative entropy is a non-negative, convex, lower semicontinuous function P() P() [0, ]. (b) For ν P(), R(. ν) is strictly convex on {µ P( : R(µ ν) < }. (c) For ν P(), R(. ν) has compact sublevel sets. (d) Let Π denote the set of finite measurable partitions of. Then for all µ, ν P(), ( ) µ(a) R(µ ν) = sup µ(a) log, π Π ν(a) A π where x log(x/y) = 0 if x = 0, x log(x/y) = if x > 0 and y = 0. (e) For every A B(), any µ, ν P(), R(µ ν) µ(a) log ( ) µ(a). ν(a) Proof. ee Lemma.4.3, parts (b), (c), (g), in Dupuis and Ellis [997, pp ]. Lemma 4.2 (Contraction property). Let ψ : Y X be a Borel measurable mapping. Let η P(X ), γ 0 P(Y). Then R ( η γ 0 ψ ) = where inf = by convention. inf R( ) γ γ 0, (4.) γ P(Y):γ ψ =η Proof (sketch). Inequality analogous to proof of Lemma E.2. in Dupuis and Ellis [997, p. 366]. For the opposite inequality, check that the probability measure γ defined by γ(dy) =. dη dγ 0 (ψ(y))γ ψ 0 (dy) attains the infimum whenever that infimum is finite. 2 2 For more details and an application see, for instance, M. Fischer, On the form of the large deviation rate function for the empirical measures of weakly interacting systems, arxiv: [math.pr]. 8

19 Lemma 4.2 yields the invariance property of relative entropy established in Lemma E.2. [Dupuis and Ellis, 997, p. 366] in the case when ψ is a bijective bi-measurable mapping; also cf. Theorem 4. in Kullback and Leibler [95], where the inequality that is implied by Lemma 4.2 is derived. Lemma 4.3 (Chain rule). Let X, Y be Polish spaces. Let α, β P(X Y) and denote their marginal distributions on X by α and β, respectively. Let α(..), β(..) be stochastic kernels on Y given X such that for all A B(X ), B B(Y), α(a B) = α(b x)α (dx), β(a B) = β(b x)β (dx). A Then the mapping x R (α(. x) β(. x)) is measurable and R(α β) = R (α β ) + R (α(. x) β(. x)) α (dx). In particular, if α, β are product measures, then R(α α 2 β β 2 ) = R (α β ) + R (α 2 β 2 ). Proof. Appendix C.3 in Dupuis and Ellis [997, pp ] X A The variational representation for Laplace functionals given in Lemma 4.4 below is the starting point for the weak convergence approach to large deviations; its statement should be compared to Definition 2.2, the definition of the Laplace principle. Lemma 4.4 (Laplace functionals). Let ν P(). Then for all g M b (), { } log exp ( g(x)) ν(dx) = inf R(µ ν) + g(x)µ(dx), µ P() Infimum in variational formula above is attained at µ P() given by dµ dν (x). = exp ( g(x)) x. exp ( g(y)) ν(dy), Proof. Let g M b (), and define µ through its density with respect to ν as above. Notice that µ, ν are mutually absolutely continuous. Let µ P() be such that R(µ ν) <. Then µ is absolutely continuous with respect to 9

20 ν with density dµ dν, but also absolutely continuous with respect to µ with density dµ dµ = dµ dν dν dµ, where dν dµ = eg e. It follows that g dµ R(µ ν) + g dµ = = ( ) dµ log dµ + g dµ dν ( ) ( dµ dµ log dµ dµ + log dν = R(µ µ ) log e g dν. ) dµ + g dµ This yields the assertion since R(µ µ ) 0 with R(µ µ ) = 0 if and only if µ = µ. Lemma 4.4 also allows to derive the Donsker-Varadhan variational formula for relative entropy itself. Lemma 4.5 (Donsker-Varadhan). Let µ, ν P(). Then { R(µ ν) = sup g M b () g(x)µ(dx) log exp (g(x)) ν(dx) Proof. Let µ, ν P(). By Lemma 4.4, for every g M b () R(µ ν) g dµ log e g dν, hence R(µ ν) sup g M b () { { = sup g M b () g dµ log g dµ log } e g dν e g dν For g M b () set J(g). = g dµ log eg dν. Thus R(µ ν) sup g Mb J(g). To obtain equality, it is enough to find a sequence (g M ) M N M b () such }. that lim sup M J(g M ) = R(µ ν). We distinguish two cases. First case: µ is not absolutely continuous with respect to ν. Then R(µ ν) = and there exists A B() such that µ(a) > 0 while ν(a) = 0. Choose such a set A and set g M. = M A. Then, for every M N, g M = 0 ν-almost surely, thus e g M dν = e 0 dν =, hence log e g M dν = 0. follows that lim sup M J(g M ) = lim sup M }. g M dµ = lim sup M µ(a) =. M 20 It

21 econd case: µ is absolutely continuous with respect to ν. Then we can choose a measurable function f : [0, ) such that f is a density for µ with respect to ν (f a version of the Radon-Nikodym derivative dµ/dν), and R(µ ν) = f log(f)dν, where the value of the integral is in [0, ]. et g M (x). = log(f(x)) [/M,M] (f(x)) M {0} (f(x)), x. Then lim g M dµ M = lim f log(f) [/M,M] (f) dν M ( ( = lim f log(f) + (0, ) (f) ) [/M,M] (f) dν M = f log(f) dν + ν{f > 0} ν{f > 0} = R(µ ν) ) [/M,M] (f) dν by dominated convergence and monotone convergence since t log(t) for every t 0 and t log(t) = 0 if t = 0, hence, for every x, (f(x) log(f(x)) + (0, ) (f(x))) [/M,M] (f(x)) f(x) log(f(x)) + (0, ) (f(x)) as M. On the other hand, again using dominated and monotone convergence, respectively, lim log e g M dν M ( ( = log lim f [/M,M] (f) + (0,/M) (M, ) (f) + e M {0} (f) ) ) dν M = log f dν = log() = 0. It follows that lim sup M J(g M ) = lim M g M dµ = R(µ ν). Remark 4.. If the state space is Polish as we assume, then the supremum in the Donsker-Varadhan formula of Lemma 4.5 can be restricted to bounded and continuous functions, that is, { g(x)µ(dx) log sup g M b () = sup g C b () { g(x)µ(dx) log } exp (g(x)) ν(dx) } exp (g(x)) ν(dx) ; see the proof of formula (C.) in Dupuis and Ellis [997, pp ]. 2

22 Remark 4.2. Lemmata 4.4 and 4.5 imply a relationship of convex duality between Laplace functionals and relative entropy. Let ν P(). Then { } R(µ ν) = sup g dµ log e g dν, µ P(), log e g dν = g M b (X ) sup µ P() { } g dµ R(µ ν), g M b (), that is, the functions µ R(µ ν) and g log e g dν are convex conjugates. References P. Billingsley. Convergence of Probability Measures. Wiley series in Probability and tatistics. John Wiley & ons, New York, 968. P. Billingsley. Convergence of Probability Measures. Wiley series in Probability and tatistics. John Wiley & ons, New York, 2nd edition, 999. A. Dembo and O. Zeitouni. Large Deviations Techniques and Applications, volume 38 of Applications of Mathematics. pringer, New York, 2nd edition, 998. F. den Hollander. Large Deviations, volume 4 of Fields Institute Monographs. American Mathematical ociety, Providence, RI, J.-D. Deuschel and D. W. troock. Large Deviations. Academic Press, Boston, 989. R. Dudley. Real Analysis and Probability. Cambridge studies in advanced mathematics. Cambridge University Press, Cambridge, P. Dupuis and R.. Ellis. A Weak Convergence Approach to the Theory of Large Deviations. Wiley eries in Probability and tatistics. John Wiley & ons, New York, 997. R.. Ellis. Entropy, Large Deviations and tatistical Mechanics, volume 27 of Grundlehren der mathematischen Wissenschaften. pringer, New York, N. Ethier and T. G. Kurtz. Markov Processes: Characterization and Convergence. Wiley eries in Probability and tatistics. John Wiley & ons, New York,

23 O. Kallenberg. Foundations of Modern Probability. Probability and Its Applications. pringer, New York, 2nd edition, Kullback. Information Theory and tatistics. John Wiley and ons, Inc., New York, Kullback and R. A. Leibler. On information and sufficiency. Ann. Math. tatistics, 22:79 86, 95.. R.. Varadhan. Asymptotic probabilities and differential equations. Comm. Pure Appl. Math., 9:26 286,

Probability and Measure

Probability and Measure Part II Year 2018 2017 2016 2015 2014 2013 2012 2011 2010 2009 2008 2007 2006 2005 2018 84 Paper 4, Section II 26J Let (X, A) be a measurable space. Let T : X X be a measurable map, and µ a probability