A Note On Large Deviation Theory and Beyond

Similar documents
Weak convergence and large deviation theory

Lecture 5: Importance sampling and Hamilton-Jacobi equations

Large Deviations Techniques and Applications

Large Deviations for Small-Noise Stochastic Differential Equations

Large Deviations for Small-Noise Stochastic Differential Equations

Théorie des grandes déviations: Des mathématiques à la physique

Large Deviations for Weakly Dependent Sequences: The Gärtner-Ellis Theorem

LARGE DEVIATIONS FOR STOCHASTIC PROCESSES

Metric Spaces and Topology

Gärtner-Ellis Theorem and applications.

Entropy and Large Deviations

Lecture 2: Convex Sets and Functions

General Theory of Large Deviations

Fokker-Planck Equation on Graph with Finite Vertices

The Moment Method; Convex Duality; and Large/Medium/Small Deviations

Part IA Probability. Theorems. Based on lectures by R. Weber Notes taken by Dexter Chua. Lent 2015

An inverse of Sanov s theorem

1. Principle of large deviations

Exercises with solutions (Set D)

δ xj β n = 1 n Theorem 1.1. The sequence {P n } satisfies a large deviation principle on M(X) with the rate function I(β) given by

Advanced computational methods X Selected Topics: SGD

Large-deviation theory and coverage in mobile phone networks

A large deviation principle for a RWRC in a box

P (A G) dp G P (A G)

MATH 6605: SUMMARY LECTURE NOTES

2a Large deviation principle (LDP) b Contraction principle c Change of measure... 10

10-704: Information Processing and Learning Spring Lecture 8: Feb 5

Brownian Motion and Stochastic Calculus

Lecture 2. We now introduce some fundamental tools in martingale theory, which are useful in controlling the fluctuation of martingales.

Lecture 7 Introduction to Statistical Decision Theory

Probabilistic Graphical Models

Proving the central limit theorem

B553 Lecture 1: Calculus Review

1 Probability and Random Variables

Introduction to Empirical Processes and Semiparametric Inference Lecture 12: Glivenko-Cantelli and Donsker Results

INTRODUCTION TO STATISTICAL MECHANICS Exercises

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Large Deviation Theory. J.M. Swart December 2, 2016

0.1 Uniform integrability

Lecture 1 Measure concentration

Mean Field Games on networks

Bioinformatics: Biology X

Module 3. Function of a Random Variable and its distribution

Lagrange Relaxation and Duality

Brownian Motion. 1 Definition Brownian Motion Wiener measure... 3

The Relativistic Heat Equation

Large deviation theory and applications

Hamiltonian Mechanics

Concentration inequalities and tail bounds

LARGE DEVIATIONS FOR DOUBLY INDEXED STOCHASTIC PROCESSES WITH APPLICATIONS TO STATISTICAL MECHANICS

Introduction to Statistical Learning Theory

Statistical physics models belonging to the generalised exponential family

Contents. Preface xi. vii

Large Deviations from the Hydrodynamic Limit for a System with Nearest Neighbor Interactions

The Central Limit Theorem: More of the Story

Fokker-Planck Equation with Detailed Balance

A new Hellinger-Kantorovich distance between positive measures and optimal Entropy-Transport problems

Concentration Inequalities

Introduction to Empirical Processes and Semiparametric Inference Lecture 09: Stochastic Convergence, Continued

Part IA Probability. Definitions. Based on lectures by R. Weber Notes taken by Dexter Chua. Lent 2015

Solution for Problem 7.1. We argue by contradiction. If the limit were not infinite, then since τ M (ω) is nondecreasing we would have

Mean-field dual of cooperative reproduction

Stat 451 Lecture Notes Numerical Integration

6.1 Variational representation of f-divergences

arxiv: v2 [math.ap] 28 Nov 2016

The large deviation principle for the Erdős-Rényi random graph

Lecture 5 Channel Coding over Continuous Channels

Statistical Machine Learning Lectures 4: Variational Bayes

SOLVABLE VARIATIONAL PROBLEMS IN N STATISTICAL MECHANICS

Large deviations and averaging for systems of slow fast stochastic reaction diffusion equations.

A relative entropy characterization of the growth rate of reward in risk-sensitive control

A D VA N C E D P R O B A B I L - I T Y

Connection to Branching Random Walk

Contents: 1. Minimization. 2. The theorem of Lions-Stampacchia for variational inequalities. 3. Γ -Convergence. 4. Duality mapping.

Information Theory and Predictability Lecture 6: Maximum Entropy Techniques

2. Dual space is essential for the concept of gradient which, in turn, leads to the variational analysis of Lagrange multipliers.

Other properties of M M 1

Introduction to Empirical Processes and Semiparametric Inference Lecture 08: Stochastic Convergence

Consistency of the maximum likelihood estimator for general hidden Markov models

Week 6 Notes, Math 865, Tanveer

From Boltzmann Equations to Gas Dynamics: From DiPerna-Lions to Leray

Exercises Measure Theoretic Probability

Lecture 8: Information Theory and Statistics

The Way of Analysis. Robert S. Strichartz. Jones and Bartlett Publishers. Mathematics Department Cornell University Ithaca, New York

LECTURE 15: COMPLETENESS AND CONVEXITY

Variational approach to mean field games with density constraints

Optimality Conditions for Constrained Optimization

4 Expectation & the Lebesgue Theorems

Weak Convergence of Numerical Methods for Dynamical Systems and Optimal Control, and a relation with Large Deviations for Stochastic Equations

Homework 1 Due: Thursday 2/5/2015. Instructions: Turn in your homework in class on Thursday 2/5/2015

(1) Consider the space S consisting of all continuous real-valued functions on the closed interval [0, 1]. For f, g S, define

ELEMENTS OF PROBABILITY THEORY

Constrained Optimization Theory

Reaction-Diffusion Equations In Narrow Tubes and Wave Front P

6. Brownian Motion. Q(A) = P [ ω : x(, ω) A )

Robust control and applications in economic theory

Notes on Large Deviations in Economics and Finance. Noah Williams

Lattice spin models: Crash course

MAT 135B Midterm 1 Solutions

Transcription:

A Note On Large Deviation Theory and Beyond Jin Feng In this set of notes, we will develop and explain a whole mathematical theory which can be highly summarized through one simple observation ( ) lim n + n log e na + e nb = a b. Staring at the above identity for a moment, if you are sufficiently over-sensitive, you discover that two subjects of mathematics are shouting at you on the left hand side, you see summation and hence probability theory; on the right hand side, you see maximization and hence calculus of variations. The large deviation theory, which is an abstract framework to make the above simple observation rigorous and extensive, has brought profound impacts to mathematics as well as to physics and engineering... Copyright 200 Jin Feng

2 JIN FENG, LARGE DEVIATIONS LECTURE Sanov Theorem, from the point of view of Boltzmann.. Outline of Lecture Sanov Theorem, the mathematics Why did Boltzmann care, the concept of entropy, and an elementary proof of Sanov via Sterling formula Gibbs conditioning and maximum entropy principles.2. Sanov theorem, the abstract setup Let Define (a) (S, d) be a complete separable metric space; (b) {X i : i =, 2,...} be i.i.d. S-valued random variables with probability law γ(dx) := P(X dx); (c) Define measure-valued random variable µ n (dx) := n δ Xi (dx) P(S). n i= S(ρ γ) := S log dρ dγ dρ. Let P(S) be given the weak convergence topology with a compatible metric. Then Theorem.. For each ρ P(S), lim lim ɛ 0+ n n log P(µ n B(ρ; ɛ)) = S(ρ γ).

LECTURE. SANOV THEOREM, FROM THE POINT OF VIEW OF BOLTZMANN3.3. Boltzmann in 877 Why did Boltzmann care? The setting of discrete ideal gas: (a) S := {x, x 2,..., x m }; (b) P(S) = {γ := (γ,..., γ m ) : m k= γ k =, γ k 0}. (c) µ n = (µ n (x ),..., µ n (x m )) where µ n (x) = n δ Xi ({x}), x S. n µ n is a model for the shape of thin gas. Think about why. Theorem.2 (Boltzmann). i= P(µ n ρ) exp{ ns(ρ γ)} Proof. ( P(µ n ρ) = P (µ n (x ),..., µ n (x m )) ) n (nρ,..., nρ m ) ( ) = P (#X i = x, #X i = x 2,..., #X i = x m ) (nρ,..., nρ m ) = n! (nρ )!... (nρ m )! γnρ... γ nρm m. By Stirling s formula (we will revisit this issue using Gamma function in the second lecture) log(k!) = k log k k + O(log k), then n log P(µ n ρ) = log n + O( log n n ) i ρ i log(nρ i ) + i ρ i + i O( log nρ i ) n + i ρ i log γ i = i ( ρ i log ρ i + ρ i log γ i ) + O( log n n ) = S(ρ γ) + O( log n n ). Definition.3 (Relative Entropy). ( S(ρ γ) := S log dρ dγ ) dρ

4 JIN FENG, LARGE DEVIATIONS From the above, we know that µ n γ in probability (law of large number). Indeed, we know more than just that. Lemma.4. Let A be a set which Ā does not contain γ, then for P(µ n A) Ce ni(a) n 0 I(A) := inf S(ρ γ) > 0. ρ Ā What does the distribution of (X,..., X K ) converge to? The following is a complicated way to answer this extremely simple question. First of all, it is a product measure (why?) Note that f, µ n := fdµ n = n f(x i ), n by identical distribution property, and by the above lemma, E[f(X )] = E[ f, µ n ] fdγ i= (indeed, the last limit holds as equality without limit). answer is γ k := γ... γ. Hence the.4. Maximum entropy and Gibbs conditioning Problem. Suppose that we made one observation regarding the samples {X,..., X n }. Knowing such a priori information, how does it change the setup and conclusion of the above Sanov theorem? For instance, suppose that h is a function on S and we observe H n := h(x ) + + h(x n ) = h(x)µ n (dx) := f(µ n ). n S What is lim n n log P(µ n ρ H n e) =? Note that H n := f(µ n ) is a function of µ n, the event { H n = e} = {µ n f (e)}. Therefore we arrive at a more general question which is answered by the following Theorem.5 (Gibbs conditioning principle). lim n n log P(µ n A µ n B) = I(A B) + I(B).

LECTURE. SANOV THEOREM, FROM THE POINT OF VIEW OF BOLTZMANN5 Proof. By Sanov theorem, for a large class of sets A P(S), lim n n log P(µ n A) = I(A). Therefore n log P(µ n A µ n B) = n log P(µ n A B) n log P(µ n B) I(A B) + I(B) inf S(ρ γ) + inf S(ρ γ). ρ A B ρ B What is the most likely state for µ n in the conditional probability P(µ n µ n B)? Theorem.6 (Maximum entropy principle). Suppose that ρ is the unique minimizer such that Then Proof. Let Then Hence S(ρ γ) = inf ρ B S(ρ γ). lim P(d(µ n, ρ ) > δ µ n B) = 0, δ > 0. n M := A := {ρ : d(ρ, ρ ) > δ}. inf S(ρ γ) inf S(ρ γ) > 0. B {ρ:d(ρ,ρ )>δ} ρ B P(d(µ n, ρ ) > δ µ n B) e nm 0. We now consider the special case of {µ n B} := { H n e}. By the maximum entropy principle, we d like to optimize S(ρ γ) under the constraint h, ρ = e. By Lagrange multiplier method, we optimize function F (ρ, β) := S(ρ γ) + β( h, ρ e) = S(ρ γ β ), where parametrized probability measure γ β (dx) = Z β e β(h(x) e) γ(dx).

6 JIN FENG, LARGE DEVIATIONS From That is, where (.7) F = 0, we get ρ i log ρ i + log γ i + β h(x i ) = 0. ρ i := j eβ h(x j) e β h(x i) γ i = γ β i γ j h, ρ = S hdγ β = e. For people familiar with advanced statistical theory of estimation and inference, one recognize the exponential family connection. Therefore, it is natural to introduce pressure function (to be discussed more extensively in next lecture) Λ(β) := log e βh(x) γ(dx). It can be verified that and that Λ (β) = S Λ (β) > 0, h(x i )e βh(xi) γ i = j eβh(x j) γ j i S h(x)γ β (dx). Corollary.8 (Macro-state). The most likely macro -state is with β uniquely determined by ρ i := e β h(x i ) γ i Z β Λ (β ) = e. Next, we derive lim P(X H n e) n As the Sanov case, by symmetry, By law of large number, E[f(X ) H n e] = E[ f, µ n H n e] lim E[f(X ) H n e] = n S fdγ β By de Finetti theorem (review the concept of exchangeability), Therefore P(X dx,..., X n dx n H n ) = Π n i=p(x i dx i H n ).

LECTURE. SANOV THEOREM, FROM THE POINT OF VIEW OF BOLTZMANN7 Corollary.9. For each K fixed, lim P(X,..., X K H n e) = (γ β ) K. n

LECTURE 2 Free Energy and Entropy, à la Gibbs 2.. Outline of Lecture A duality between free energy and entropy Properties of relative entropy 2.2. An entropy-free energy (pressure) duality The Gibbs conditioning principle tells us the following: We start with a model X,..., X n γ. We make observations based on H n := n i h(x i ). Conditioning on the fact that we saw H, then we should update our prior belief that the underlying measure should be dγ β,h := Z β,h e βh dγ. for some constant β. This is a Bayes theorem essentially. Since βh always come together, we will just set β = and write the renormalized new reference measure (a Gibbs measure) dγ h := eh Z h dγ with normalizing (partition) constant Z h := e h dγ. 9

0 JIN FENG, LARGE DEVIATIONS Let h C b (S). The log partition functional Λ(h) := log Z h = log e h dγ plays key role as some kind of dual functional to entropy. We first observe that S(ρ γ) h, ρ = dρ log dρ dγ log Z h h = S(ρ γ h ) log Z h = S(ρ γ h ) Λ(h). We have the following infinite dimensional version of Legendre- Frenchel transform Theorem 2. (Lanford-Varadhan). S(ρ γ) = sup { h, ρ Λ(h)} h C b (S) Λ(h) = sup { h, ρ S(ρ γ)} ρ P(S) The supreme for the second one is uniquely attained at γ h. Proof. Since S(ρ γ) + Λ(h) = h, ρ + S(ρ γ h ) h, ρ and since ρ = γ h is the only solution for S(ρ γ h ) = 0, the conclusions follow. 2.3. Properties of entropy Lemma 2.2. S( γ) : P(S) R + is convex. Proof. This is because that S(ρ γ) = dγ dρ dρ log dγ dγ = S S h( dρ dγ )dγ where h(r) = r log r is convex. The positivity follows from Jensen s inequality. Review the concept and give examples of semicontinuous functions. Lemma 2.3. Let f α ( ) : S R lower semicontinuous for each α Λ fixed. Then is still lower semicontinuous. f(x) := sup f α (x) α Λ

LECTURE 2. FREE ENERGY AND ENTROPY, À LA GIBBS Proof. This is a consequence of (2.4) {x : f(x) c} = α Λ {x : f α (x) c}. Lemma 2.5. S( ) : P(S) P(S) R + is lower semicontinuous in the weak convergence topology. Proof. This is because of the variational representation in (2.).

LECTURE 3 Large Deviation, General Theory 3.. Outline of Lecture Laplace lemma Large deviation principle, Laplace principle and related stuff Exponential tightness Rate function and techniques on identifying it The situation of stochastic processes 3.2. Laplace lemma We will make sense of an infinite dimensional generalization of ( /n e dx) nf(x) = exp{ min f(x)} lim n 0 and its far-reaching impacts to physical applications. Lemma 3. (Laplace Lemma). n log with f C b (S), µ M b (S). Proof. Take home exercise. S e nf(z) µ(dz) n sup f(z) z supp(µ) As an application, we prove the Stirling formula for gamma function Γ(α) := Note that Γ(n) = (n )!. α. 0 x α e x dx. We are interested in behavior of Γ as 3

4 JIN FENG, LARGE DEVIATIONS By change of variable x = αy, and by using the Laplace lemma, Γ(α) = α α y exp{ α(y log y)}dy e α log α α, 0 since min{y log y} = log =. y To be more precise, ( Γ(α) ) /α lim = e. α α α The special case of α = n + gives the well-known Sterling formula n! e n log n n. Indeed, if we are more careful, we have next order expansion around the stationary point y 0 = that y log y + 2 (y )2 + O((y ) 3 ). By Gaussian integral properties, Γ(α) = α α+ 2 e α (2π) 2 ( + O(α )). 3.3. Large Deviation Principle and Laplace Principle A rate (action) function is a function I : S [0, + ] which is lower semicontinuous. If I has compact level sets, we call it good. We denote I(A) := inf x A I(x). Definition 3.2 (LDP). {X n : n =, 2,...} is said to satisfies the Large Deviation principle with rate function I, if and only if (a) For each closed set F S, lim sup n (b) For each open set G S, lim inf n n log P (X n F ) I(F ); n log P (X n G) I(G). Definition 3.3. {X n : n =, 2,...} is said to satisfies the Laplace principle with rate function I, if (a) For all f C b (S), lim sup n n log E[enf(Xn) ] sup{f(x) I(x)}; x S

LECTURE 3. LARGE DEVIATION, GENERAL THEORY 5 (b) For each f C b (S), lim inf n n log E[enf(Xn) ] sup{f(x) I(x)}. x S Theorem 3.4. The Laplace principle is equivalent to the Large deviation principle. Proof. First, we prove that Large deviation principle implies Laplace principle. This was due to Varadhan. Let closed set F N,j := {x S : f + j N 2 f f(x) f + j N 2 f }. and we approximate f from above by step functions: N ( f N (x) := f + j ) N 2 f (x F N,j ). j= Note that the level sets of f N are closed. Therefore, by large deviation upper bound lim sup n n log E[enf(Xn) ] lim sup n n log E[enf N (X n) ] max { f + j j=,2...,n N 2 f I(F N,j )} max j=,...,n sup {f(x) I(x)} + 2 f x F N,j N sup {f(x) I(x)} + 2 f x S N. Let x 0 S and ɛ > 0. Then G := {x : f(x) > f(x 0 ) ɛ} is open, by large deviation lower bound lim inf n n log E[enf(Xn) ] lim inf n n log E[(X n G)e nf(xn) ] f(x 0 ) ɛ + lim inf n n log P (X n G) f(x 0 ) ɛ I(G) f(x 0 ) I(x 0 ) ɛ. The Laplace lower bound follows from the arbitrariness of x 0 S and ɛ > 0. Next, we prove that Laplace principle implies Large deviation. This seems was first realized by Dupuis and Ellis....

LECTURE 4 Occupation Measure and Random Perturbation of ODEs 4.. Outline of Lecture The Donsker-Varadhan theory The Freidlin-Wentzell theory 7

LECTURE 5 An HJB equation approach to large deviation of Markov Processes 5.. Outline of Lecture Martingale problems A nonlinear semigroup Hamilton-Jacobi-Bellman equation and viscosity solutions Convergence Variational problems through the view of optimal control 9

LECTURE 6 Examples 6.. Outline of Lecture Examples: Freidlin-Wentzell, Donsker-Varadhan, Multi-scale diffusion Applications to infinite dimensions - Stochastic PDEs Another type of infinite dimensions - Interacting particles 2

LECTURE 7 Beyond Large Deviation 7.. Outline of Lecture Variational formulation of PDEs - compressible Euler equations Incompressible Navier-Stokes Lasry-Lions Mean-Field Games Transition Path Theory An approach to large time statistical structures of complex flows 23