A Note On Large Deviation Theory and Beyond

Size: px

Start display at page:

Download "A Note On Large Deviation Theory and Beyond"

Suzan Booker
6 years ago
Views:

1 A Note On Large Deviation Theory and Beyond Jin Feng In this set of notes, we will develop and explain a whole mathematical theory which can be highly summarized through one simple observation ( ) lim n + n log e na + e nb = a b. Staring at the above identity for a moment, if you are sufficiently over-sensitive, you discover that two subjects of mathematics are shouting at you on the left hand side, you see summation and hence probability theory; on the right hand side, you see maximization and hence calculus of variations. The large deviation theory, which is an abstract framework to make the above simple observation rigorous and extensive, has brought profound impacts to mathematics as well as to physics and engineering... Copyright 200 Jin Feng

2 2 JIN FENG, LARGE DEVIATIONS LECTURE Sanov Theorem, from the point of view of Boltzmann.. Outline of Lecture Sanov Theorem, the mathematics Why did Boltzmann care, the concept of entropy, and an elementary proof of Sanov via Sterling formula Gibbs conditioning and maximum entropy principles.2. Sanov theorem, the abstract setup Let Define (a) (S, d) be a complete separable metric space; (b) {X i : i =, 2,...} be i.i.d. S-valued random variables with probability law γ(dx) := P(X dx); (c) Define measure-valued random variable µ n (dx) := n δ Xi (dx) P(S). n i= S(ρ γ) := S log dρ dγ dρ. Let P(S) be given the weak convergence topology with a compatible metric. Then Theorem.. For each ρ P(S), lim lim ɛ 0+ n n log P(µ n B(ρ; ɛ)) = S(ρ γ).

3 LECTURE. SANOV THEOREM, FROM THE POINT OF VIEW OF BOLTZMANN3.3. Boltzmann in 877 Why did Boltzmann care? The setting of discrete ideal gas: (a) S := {x, x 2,..., x m }; (b) P(S) = {γ := (γ,..., γ m ) : m k= γ k =, γ k 0}. (c) µ n = (µ n (x ),..., µ n (x m )) where µ n (x) = n δ Xi ({x}), x S. n µ n is a model for the shape of thin gas. Think about why. Theorem.2 (Boltzmann). i= P(µ n ρ) exp{ ns(ρ γ)} Proof. ( P(µ n ρ) = P (µ n (x ),..., µ n (x m )) ) n (nρ,..., nρ m ) ( ) = P (#X i = x, #X i = x 2,..., #X i = x m ) (nρ,..., nρ m ) = n! (nρ )!... (nρ m )! γnρ... γ nρm m. By Stirling s formula (we will revisit this issue using Gamma function in the second lecture) log(k!) = k log k k + O(log k), then n log P(µ n ρ) = log n + O( log n n ) i ρ i log(nρ i ) + i ρ i + i O( log nρ i ) n + i ρ i log γ i = i ( ρ i log ρ i + ρ i log γ i ) + O( log n n ) = S(ρ γ) + O( log n n ). Definition.3 (Relative Entropy). ( S(ρ γ) := S log dρ dγ ) dρ

4 4 JIN FENG, LARGE DEVIATIONS From the above, we know that µ n γ in probability (law of large number). Indeed, we know more than just that. Lemma.4. Let A be a set which Ā does not contain γ, then for P(µ n A) Ce ni(a) n 0 I(A) := inf S(ρ γ) > 0. ρ Ā What does the distribution of (X,..., X K ) converge to? The following is a complicated way to answer this extremely simple question. First of all, it is a product measure (why?) Note that f, µ n := fdµ n = n f(x i ), n by identical distribution property, and by the above lemma, E[f(X )] = E[ f, µ n ] fdγ i= (indeed, the last limit holds as equality without limit). answer is γ k := γ... γ. Hence the.4. Maximum entropy and Gibbs conditioning Problem. Suppose that we made one observation regarding the samples {X,..., X n }. Knowing such a priori information, how does it change the setup and conclusion of the above Sanov theorem? For instance, suppose that h is a function on S and we observe H n := h(x ) + + h(x n ) = h(x)µ n (dx) := f(µ n ). n S What is lim n n log P(µ n ρ H n e) =? Note that H n := f(µ n ) is a function of µ n, the event { H n = e} = {µ n f (e)}. Therefore we arrive at a more general question which is answered by the following Theorem.5 (Gibbs conditioning principle). lim n n log P(µ n A µ n B) = I(A B) + I(B).

5 LECTURE. SANOV THEOREM, FROM THE POINT OF VIEW OF BOLTZMANN5 Proof. By Sanov theorem, for a large class of sets A P(S), lim n n log P(µ n A) = I(A). Therefore n log P(µ n A µ n B) = n log P(µ n A B) n log P(µ n B) I(A B) + I(B) inf S(ρ γ) + inf S(ρ γ). ρ A B ρ B What is the most likely state for µ n in the conditional probability P(µ n µ n B)? Theorem.6 (Maximum entropy principle). Suppose that ρ is the unique minimizer such that Then Proof. Let Then Hence S(ρ γ) = inf ρ B S(ρ γ). lim P(d(µ n, ρ ) > δ µ n B) = 0, δ > 0. n M := A := {ρ : d(ρ, ρ ) > δ}. inf S(ρ γ) inf S(ρ γ) > 0. B {ρ:d(ρ,ρ )>δ} ρ B P(d(µ n, ρ ) > δ µ n B) e nm 0. We now consider the special case of {µ n B} := { H n e}. By the maximum entropy principle, we d like to optimize S(ρ γ) under the constraint h, ρ = e. By Lagrange multiplier method, we optimize function F (ρ, β) := S(ρ γ) + β( h, ρ e) = S(ρ γ β ), where parametrized probability measure γ β (dx) = Z β e β(h(x) e) γ(dx).

6 6 JIN FENG, LARGE DEVIATIONS From That is, where (.7) F = 0, we get ρ i log ρ i + log γ i + β h(x i ) = 0. ρ i := j eβ h(x j) e β h(x i) γ i = γ β i γ j h, ρ = S hdγ β = e. For people familiar with advanced statistical theory of estimation and inference, one recognize the exponential family connection. Therefore, it is natural to introduce pressure function (to be discussed more extensively in next lecture) Λ(β) := log e βh(x) γ(dx). It can be verified that and that Λ (β) = S Λ (β) > 0, h(x i )e βh(xi) γ i = j eβh(x j) γ j i S h(x)γ β (dx). Corollary.8 (Macro-state). The most likely macro -state is with β uniquely determined by ρ i := e β h(x i ) γ i Z β Λ (β ) = e. Next, we derive lim P(X H n e) n As the Sanov case, by symmetry, By law of large number, E[f(X ) H n e] = E[ f, µ n H n e] lim E[f(X ) H n e] = n S fdγ β By de Finetti theorem (review the concept of exchangeability), Therefore P(X dx,..., X n dx n H n ) = Π n i=p(x i dx i H n ).

7 LECTURE. SANOV THEOREM, FROM THE POINT OF VIEW OF BOLTZMANN7 Corollary.9. For each K fixed, lim P(X,..., X K H n e) = (γ β ) K. n

9 LECTURE 2 Free Energy and Entropy, à la Gibbs 2.. Outline of Lecture A duality between free energy and entropy Properties of relative entropy 2.2. An entropy-free energy (pressure) duality The Gibbs conditioning principle tells us the following: We start with a model X,..., X n γ. We make observations based on H n := n i h(x i ). Conditioning on the fact that we saw H, then we should update our prior belief that the underlying measure should be dγ β,h := Z β,h e βh dγ. for some constant β. This is a Bayes theorem essentially. Since βh always come together, we will just set β = and write the renormalized new reference measure (a Gibbs measure) dγ h := eh Z h dγ with normalizing (partition) constant Z h := e h dγ. 9

10 0 JIN FENG, LARGE DEVIATIONS Let h C b (S). The log partition functional Λ(h) := log Z h = log e h dγ plays key role as some kind of dual functional to entropy. We first observe that S(ρ γ) h, ρ = dρ log dρ dγ log Z h h = S(ρ γ h ) log Z h = S(ρ γ h ) Λ(h). We have the following infinite dimensional version of Legendre- Frenchel transform Theorem 2. (Lanford-Varadhan). S(ρ γ) = sup { h, ρ Λ(h)} h C b (S) Λ(h) = sup { h, ρ S(ρ γ)} ρ P(S) The supreme for the second one is uniquely attained at γ h. Proof. Since S(ρ γ) + Λ(h) = h, ρ + S(ρ γ h ) h, ρ and since ρ = γ h is the only solution for S(ρ γ h ) = 0, the conclusions follow Properties of entropy Lemma 2.2. S( γ) : P(S) R + is convex. Proof. This is because that S(ρ γ) = dγ dρ dρ log dγ dγ = S S h( dρ dγ )dγ where h(r) = r log r is convex. The positivity follows from Jensen s inequality. Review the concept and give examples of semicontinuous functions. Lemma 2.3. Let f α ( ) : S R lower semicontinuous for each α Λ fixed. Then is still lower semicontinuous. f(x) := sup f α (x) α Λ

11 LECTURE 2. FREE ENERGY AND ENTROPY, À LA GIBBS Proof. This is a consequence of (2.4) {x : f(x) c} = α Λ {x : f α (x) c}. Lemma 2.5. S( ) : P(S) P(S) R + is lower semicontinuous in the weak convergence topology. Proof. This is because of the variational representation in (2.).

13 LECTURE 3 Large Deviation, General Theory 3.. Outline of Lecture Laplace lemma Large deviation principle, Laplace principle and related stuff Exponential tightness Rate function and techniques on identifying it The situation of stochastic processes 3.2. Laplace lemma We will make sense of an infinite dimensional generalization of ( /n e dx) nf(x) = exp{ min f(x)} lim n 0 and its far-reaching impacts to physical applications. Lemma 3. (Laplace Lemma). n log with f C b (S), µ M b (S). Proof. Take home exercise. S e nf(z) µ(dz) n sup f(z) z supp(µ) As an application, we prove the Stirling formula for gamma function Γ(α) := Note that Γ(n) = (n )!. α. 0 x α e x dx. We are interested in behavior of Γ as 3

14 4 JIN FENG, LARGE DEVIATIONS By change of variable x = αy, and by using the Laplace lemma, Γ(α) = α α y exp{ α(y log y)}dy e α log α α, 0 since min{y log y} = log =. y To be more precise, ( Γ(α) ) /α lim = e. α α α The special case of α = n + gives the well-known Sterling formula n! e n log n n. Indeed, if we are more careful, we have next order expansion around the stationary point y 0 = that y log y + 2 (y )2 + O((y ) 3 ). By Gaussian integral properties, Γ(α) = α α+ 2 e α (2π) 2 ( + O(α )) Large Deviation Principle and Laplace Principle A rate (action) function is a function I : S [0, + ] which is lower semicontinuous. If I has compact level sets, we call it good. We denote I(A) := inf x A I(x). Definition 3.2 (LDP). {X n : n =, 2,...} is said to satisfies the Large Deviation principle with rate function I, if and only if (a) For each closed set F S, lim sup n (b) For each open set G S, lim inf n n log P (X n F ) I(F ); n log P (X n G) I(G). Definition 3.3. {X n : n =, 2,...} is said to satisfies the Laplace principle with rate function I, if (a) For all f C b (S), lim sup n n log E[enf(Xn) ] sup{f(x) I(x)}; x S

15 LECTURE 3. LARGE DEVIATION, GENERAL THEORY 5 (b) For each f C b (S), lim inf n n log E[enf(Xn) ] sup{f(x) I(x)}. x S Theorem 3.4. The Laplace principle is equivalent to the Large deviation principle. Proof. First, we prove that Large deviation principle implies Laplace principle. This was due to Varadhan. Let closed set F N,j := {x S : f + j N 2 f f(x) f + j N 2 f }. and we approximate f from above by step functions: N ( f N (x) := f + j ) N 2 f (x F N,j ). j= Note that the level sets of f N are closed. Therefore, by large deviation upper bound lim sup n n log E[enf(Xn) ] lim sup n n log E[enf N (X n) ] max { f + j j=,2...,n N 2 f I(F N,j )} max j=,...,n sup {f(x) I(x)} + 2 f x F N,j N sup {f(x) I(x)} + 2 f x S N. Let x 0 S and ɛ > 0. Then G := {x : f(x) > f(x 0 ) ɛ} is open, by large deviation lower bound lim inf n n log E[enf(Xn) ] lim inf n n log E[(X n G)e nf(xn) ] f(x 0 ) ɛ + lim inf n n log P (X n G) f(x 0 ) ɛ I(G) f(x 0 ) I(x 0 ) ɛ. The Laplace lower bound follows from the arbitrariness of x 0 S and ɛ > 0. Next, we prove that Laplace principle implies Large deviation. This seems was first realized by Dupuis and Ellis....

17 LECTURE 4 Occupation Measure and Random Perturbation of ODEs 4.. Outline of Lecture The Donsker-Varadhan theory The Freidlin-Wentzell theory 7

19 LECTURE 5 An HJB equation approach to large deviation of Markov Processes 5.. Outline of Lecture Martingale problems A nonlinear semigroup Hamilton-Jacobi-Bellman equation and viscosity solutions Convergence Variational problems through the view of optimal control 9

21 LECTURE 6 Examples 6.. Outline of Lecture Examples: Freidlin-Wentzell, Donsker-Varadhan, Multi-scale diffusion Applications to infinite dimensions - Stochastic PDEs Another type of infinite dimensions - Interacting particles 2

23 LECTURE 7 Beyond Large Deviation 7.. Outline of Lecture Variational formulation of PDEs - compressible Euler equations Incompressible Navier-Stokes Lasry-Lions Mean-Field Games Transition Path Theory An approach to large time statistical structures of complex flows 23

Weak convergence and large deviation theory

First Prev Next Go To Go Back Full Screen Close Quit 1 Weak convergence and large deviation theory Large deviation principle Convergence in distribution The Bryc-Varadhan theorem Tightness and Prohorov