An inverse of Sanov s theorem

Similar documents
A large-deviation principle for Dirichlet posteriors

2a Large deviation principle (LDP) b Contraction principle c Change of measure... 10

Large Deviations Techniques and Applications

Locally convex spaces, the hyperplane separation theorem, and the Krein-Milman theorem

Generalized Neyman Pearson optimality of empirical likelihood for testing parameter hypotheses

Distribution-Invariant Risk Measures, Entropy, and Large Deviations

MATHS 730 FC Lecture Notes March 5, Introduction

Large deviations for random projections of l p balls

Recall that if X is a compact metric space, C(X), the space of continuous (real-valued) functions on X, is a Banach space with the norm

4. Conditional risk measures and their robust representation

HILBERT SPACES AND THE RADON-NIKODYM THEOREM. where the bar in the first equation denotes complex conjugation. In either case, for any x V define

Priors for the frequentist, consistency beyond Schwartz

Statistics 581 Revision of Section 4.4: Consistency of Maximum Likelihood Estimates Wellner; 11/30/2001

A SHORT NOTE ON STRASSEN S THEOREMS

PROBLEMS. (b) (Polarization Identity) Show that in any inner product space

Gärtner-Ellis Theorem and applications.

Math 5051 Measure Theory and Functional Analysis I Homework Assignment 3

. Then V l on K l, and so. e e 1.

ON LARGE DEVIATIONS OF MARKOV PROCESSES WITH DISCONTINUOUS STATISTICS 1

CORRECTIONS AND ADDITION TO THE PAPER KELLERER-STRASSEN TYPE MARGINAL MEASURE PROBLEM. Rataka TAHATA. Received March 9, 2005

(1) Consider the space S consisting of all continuous real-valued functions on the closed interval [0, 1]. For f, g S, define

Large deviation convergence of generated pseudo measures

x 0 + f(x), exist as extended real numbers. Show that f is upper semicontinuous This shows ( ɛ, ɛ) B α. Thus

Math 4317 : Real Analysis I Mid-Term Exam 1 25 September 2012

Free Entropy for Free Gibbs Laws Given by Convex Potentials

Measure Theory on Topological Spaces. Course: Prof. Tony Dorlas 2010 Typset: Cathal Ormond

Entropy and Ergodic Theory Lecture 15: A first look at concentration

Chapter 8. General Countably Additive Set Functions. 8.1 Hahn Decomposition Theorem

δ xj β n = 1 n Theorem 1.1. The sequence {P n } satisfies a large deviation principle on M(X) with the rate function I(β) given by

1 The Glivenko-Cantelli Theorem

(U) =, if 0 U, 1 U, (U) = X, if 0 U, and 1 U. (U) = E, if 0 U, but 1 U. (U) = X \ E if 0 U, but 1 U. n=1 A n, then A M.

THEOREMS, ETC., FOR MATH 516

Functional Conjugacy in Parametric Bayesian Models

Real Analysis, 2nd Edition, G.B.Folland Signed Measures and Differentiation

BERNOULLI ACTIONS AND INFINITE ENTROPY

THE DAUGAVETIAN INDEX OF A BANACH SPACE 1. INTRODUCTION

Asymptotics for posterior hazards

A Note On Large Deviation Theory and Beyond

Isomorphism for transitive groupoid C -algebras

THE LARGE DEVIATIONS OF ESTIMATING RATE-FUNCTIONS

Geometry and Optimization of Relative Arbitrage

Maths 212: Homework Solutions

Chapter 4. Measure Theory. 1. Measure Spaces

LECTURE 15: COMPLETENESS AND CONVEXITY

THEOREMS, ETC., FOR MATH 515

Foundations of Nonparametric Bayesian Methods

Problem Set 2: Solutions Math 201A: Fall 2016

MATH MEASURE THEORY AND FOURIER ANALYSIS. Contents

Introduction to Topology

EXPONENTIAL APPROXIMATIONS IN COMPLETELY REGULAR TOPOLOGICAL SPACES AND EXTENSIONS OF SANOV S THEOREM

Metric Spaces and Topology

Bayesian Aggregation for Extraordinarily Large Dataset

Math 320-2: Midterm 2 Practice Solutions Northwestern University, Winter 2015

Estimating Loynes exponent

Integral Jensen inequality

Signed Measures and Complex Measures

A PROOF OF A CONVEX-VALUED SELECTION THEOREM WITH THE CODOMAIN OF A FRÉCHET SPACE. Myung-Hyun Cho and Jun-Hui Kim. 1. Introduction

General Theory of Large Deviations

MA651 Topology. Lecture 10. Metric Spaces.

LARGE DEVIATIONS FOR SUMS OF I.I.D. RANDOM COMPACT SETS

Large deviations, weak convergence, and relative entropy

Bayesian Consistency for Markov Models

Real Analysis Chapter 3 Solutions Jonathan Conder. ν(f n ) = lim

Lebesgue-Radon-Nikodym Theorem

Serena Doria. Department of Sciences, University G.d Annunzio, Via dei Vestini, 31, Chieti, Italy. Received 7 July 2008; Revised 25 December 2008

SEPARABILITY AND COMPLETENESS FOR THE WASSERSTEIN DISTANCE

Large Deviations for Weakly Dependent Sequences: The Gärtner-Ellis Theorem

L p Spaces and Convexity

Large Deviation Theory. J.M. Swart December 2, 2016

Spring 2014 Advanced Probability Overview. Lecture Notes Set 1: Course Overview, σ-fields, and Measures

Math General Topology Fall 2012 Homework 6 Solutions

5 Measure theory II. (or. lim. Prove the proposition. 5. For fixed F A and φ M define the restriction of φ on F by writing.

Problem Set. Problem Set #1. Math 5322, Fall March 4, 2002 ANSWERS

Convergence of Banach valued stochastic processes of Pettis and McShane integrable functions

PACKING DIMENSIONS, TRANSVERSAL MAPPINGS AND GEODESIC FLOWS

Elementary Probability. Exam Number 38119

Contents Ordered Fields... 2 Ordered sets and fields... 2 Construction of the Reals 1: Dedekind Cuts... 2 Metric Spaces... 3

Math 5051 Measure Theory and Functional Analysis I Homework Assignment 2

Large deviation theory and applications

Math 209B Homework 2

1 Topology Definition of a topology Basis (Base) of a topology The subspace topology & the product topology on X Y 3

Banach Spaces V: A Closer Look at the w- and the w -Topologies

Max-min (σ-)additive representation of monotone measures

ON THE DEFINITION OF RELATIVE PRESSURE FOR FACTOR MAPS ON SHIFTS OF FINITE TYPE. 1. Introduction

Convergence of generalized entropy minimizers in sequences of convex problems

ON THE ZERO-ONE LAW AND THE LAW OF LARGE NUMBERS FOR RANDOM WALK IN MIXING RAN- DOM ENVIRONMENT

Reminder Notes for the Course on Measures on Topological Spaces

Fragmentability and σ-fragmentability

Nonparametric Bayes Inference on Manifolds with Applications

Lyapunov optimizing measures for C 1 expanding maps of the circle

1 Measurable Functions

Measure and integration

CHAPTER I THE RIESZ REPRESENTATION THEOREM

Semicontinuous functions and convexity

3 (Due ). Let A X consist of points (x, y) such that either x or y is a rational number. Is A measurable? What is its Lebesgue measure?

Final Exam Practice Problems Math 428, Spring 2017

Notes on Large Deviations in Economics and Finance. Noah Williams

Notes 1 : Measure-theoretic foundations I

Some SDEs with distributional drift Part I : General calculus. Flandoli, Franco; Russo, Francesco; Wolf, Jochen

Property (T) and the Furstenberg Entropy of Nonsingular Actions

Transcription:

An inverse of Sanov s theorem Ayalvadi Ganesh and Neil O Connell BRIMS, Hewlett-Packard Labs, Bristol Abstract Let X k be a sequence of iid random variables taking values in a finite set, and consider the problem of estimating the law of X in a Bayesian framework. We prove that the sequence of posterior distributions satisfies a large deviation principle, and give an explicit expression for the rate function. As an application, we obtain an asymptotic formula for the predictive probability of ruin in the classical gambler s ruin problem. Introduction and preliminaries Let X be a Hausdorff topological space with Borel σ-algebra B, and let µ n be a sequence of probability measures on (X, B). A rate function is a nonnegative lower semicontinuous function on X. We say that the sequence µ n satisfies the large deviation principle (LDP) with rate function I, if for all B B, inf I(x) x B n n log µ n(b) lim sup n n log µ n(b) inf I(x). x B Here B and B denote the interior and closure of B, respectively. Let Ω be a finite set and denote by M (Ω) the space of probability measures on Ω. Consider a sequence of independent random variables X k taking values

in Ω, with common law µ M (Ω). Denote by L n the empirical measure corresponding to the first n observations: L n = n n δ Xk, where δ Xk denotes unit mass at X k. We denote the law of L n by L(L n ). k= For ν M (Ω) define its relative entropy (relative to µ) by Ω dν dν dµ log dµ H(ν µ) = dµ ν µ otherwise. The statement of Sanov s theorem is that the sequence L(L n ) satisfies the LDP in M (Ω) with rate function H( µ). In this paper we present an inverse of this result, which arises naturally in a Bayesian setting. The underlying distribution (of the X k s) is unknown, and has a prior distribution π M (M (Ω)). The posterior distribution, given the first n observations, is a function of the empirical measure L n and will be denoted by π n (L n ). We prove that, on the set {L n µ}, for any fixed µ in the support of the prior, the sequence π n (L n ) satisfies the LDP in M (Ω) with rate function given by H(µ ) on the support of the prior (otherwise it is infinite). Note that the roles played by the arguments of the relative entropy function are interchanged. This result can be understood intuitively as follows: in Sanov s theorem we ask how likely the empirical distribution is to be close to ν, given that the true distribution is µ, whereas in the Bayesian context we ask how likely it is that the true distribution is close to ν, given that the empirical distribution is close to µ. We regard these questions as natural inverses of each other. As an application, we obtain an asymptotic formula for the predictive probability of ruin in the classical gambler s ruin problem. 2

Admittedly, the assumption that Ω is finite is very strong. However, it is the only case where the result can be stated without making additional assumptions about the prior. To see that this is a delicate issue, note that, since H(µ µ) = 0, the LDP implies consistency of the posterior distribution: it was shown by Freedman [5] that Bayes estimates can be inconsistent even on countable Ω and even when the true distribution is in the support of the prior; moreover, sufficient conditions for consistency which exist in the literature are quite disparate and in general far from being necessary. The problem of extending our result would seem to be an interesting (and formidable!) challenge for future research. We have recently extended the result to the case where Ω is a compact metric space, under the condition that the prior is of Dirichlet form [8]. There is a considerable literature on the asymptotic behaviour of Bayes posteriors; see, for example, Le Cam [, 2], Freedman [5], Lindley [3], Schwartz [4], Johnson [9, 0], Brenner et al. [], Chen [2], Diaconis and Freedman [4] and Fu and Kass [6]. Freedman, and Diaconis and Freedman study conditions under which the Bayes estimate (the mean of the posterior distribution) is consistent. Le Cam, Freedman, Brenner et al. and Chen establish asymptotic normality of the posterior distribution under different conditions on the parameter space and the prior. Schwartz and Johnson study asymptotic expansions of the posterior distribution in powers of n /2, having a normal distribution as the leading term. Fu and Kass establish an LDP for the posterior distribution of the parameter in a one-dimensional parametric family, under certain regularity conditions. 3

2 The LDP Let Ω be a finite set, and let M (Ω) denote the space of probability measures on Ω. Suppose X, X 2,... are i.i.d. Ω valued random variables. Let π M (M (Ω)) denote the prior distribution on the space M (Ω), with support denoted by supp π. For each n, set { } n M n (Ω) = δ xi : x Ω n. n i= Define a mapping π n : M n (Ω) M (M (Ω)) by its Radon-Nikodym derivative on the support of π: dπ n (µ n ) (ν) = dπ M (Ω) ν(x)nµn(x) λ(x)nµn(x) π(dλ), () if the denominator is non-zero; π n is undefined at µ n if the denominator is zero. When defined, π n (µ n ) denotes the posterior distribution, conditional on the observations X,..., X n having empirical distribution µ n M n (Ω). (This follows immediately from Bayes formula; there are no measurability concerns here since Ω is finite.) Theorem Suppose x Ω IN is such that the sequence µ n = n i= δ xi /n converges weakly to µ supp π and that µ(x) = 0 implies that µ n (x) = 0 for all n. Then the sequence of laws π n (µ n ) satisfies a large deviation principle (LDP) with good rate function { H(µ ν), if ν supp π I(ν) =, else. The rate function I( ) is convex if supp π is convex. 4

Proof : Observe that n log λ(x) nµn(x) = µ n (x) log µ n (x) µ n (x) log µ n(x) λ(x) = H(µ n ) H(µ n λ) H(µ n ). Here, H(µ n ) = µ n(x) log µ n (x) denotes the entropy of µ n. The last inequality follows from the fact that relative entropy is non-negative. follows, since π is a probability measure on M (Ω), that λ(x) nµn(x) π(dλ) exp( nh(µ n )). M (Ω) It Thus, lim sup n M log (Ω) here we are using the fact that H( ) is continuous. λ(x) nµn(x) π(dλ) lim sup H(µ n ) = H(µ); (2) Next, since µ n converges to µ supp π, we have that for all ɛ > 0, π(b(µ, ɛ)) > 0 and µ n B(µ, ɛ) for all n sufficiently large. (Here, B(µ, ɛ) denotes the set of probability distributions on Ω that are within ɛ of µ in total variation distance note that this generates the weak topology since Ω is finite.) Therefore, n M log λ(x) nµn(x) π(dλ) (Ω) n log λ(x) nµn(x) π(dλ) B(µ,ɛ) ( ) n log π B(µ, ɛ) + µ n (x) log µ n (x) sup µ n (x) log µ n(x) λ B(µ,ɛ) λ(x). :µ(x)>0 To obtain the last inequality, we have used the assumption that, if µ(x) = 0, then µ n (x) = 0 for all n. We also use the convention that 0 log 0 = 0. Since π(b(µ, ɛ)) > 0, it follows from the above, again using the continuity of H( ), 5

that H(µ) sup n M log λ(x) nµn(x) π(dλ) (Ω) λ,ρ B(µ,ɛ) :µ(x)>0 Letting ɛ 0, we get from (2) and (3) that lim n M log (Ω) ρ(x) log ρ(x) λ(x). (3) λ(x) nµn(x) π(dλ) = H(µ). (4) LD upper bound: Let F be an arbitrary closed subset of M (Ω). If π(f ) = 0, then π n (µ n )(F ) = 0 for all n and the LD upper bound, lim sup n log πn (µ n )(F ) inf ν F I(ν), holds trivially. Otherwise, if π(f ) > 0, observe from () and (4) that lim sup lim sup = lim sup sup ρ B(µ,δ) n log πn (µ n )(F ) H(µ) = lim sup n F log λ(x) nµn(x) π(dλ) log π(f ) + lim sup sup µ n (x) log λ(x) n λ F supp π sup [ H(µ n ) H(µ n λ)] λ F supp π sup [ H(ρ) H(ρ λ)] δ > 0, λ F supp π where the last inequality is because µ n converges to µ. Letting δ 0, we get from the continuity of H( ) and H( ) that lim sup n log πn (µ n )(F ) inf λ F supp π H(µ λ) = inf I(λ), (5) λ F where the last equality follows from the definition of I in the statement of the theorem. This completes the proof of the large deviations upper bound. LD lower bound: Fix ν supp π, and let B(ν, ɛ) denote the set of probability distributions on Ω that are within ɛ of ν in total variation. Then, 6

π(b(ν, ɛ)) > 0 for any ɛ > 0. δ (0, ɛ), Using () and (4) we thus have, for all ( ) n log πn (µ n ) B(ν, ɛ) H(µ) n B(ν,δ) log λ(x) nµn(x) π(dλ) ( ) n log π B(ν, δ) + inf µ n (x) log λ(x) λ B(ν,δ) inf inf [ H(ρ) H(ρ λ)]. ρ B(µ,δ) λ B(ν,δ) Letting δ 0, we have by the continuity of H( ) and H( ) that ( ) n log πn (µ n ) B(ν, ɛ) H(µ ν) = I(ν), (6) where the last equality is because we took ν to be in supp π. Let G be an arbitrary open subset of M (Ω). If G supp π is empty, then, by the definition of I in the statement of the theorem, inf ν G I(ν) =. Therefore, the large deviations lower bound, n log πn (µ n )(G) inf ν G I(ν), holds trivially. On the other hand, if G supp π isn t empty, then we can pick ν G such that I(ν) is arbitrarily close to inf λ G I(λ), and ɛ > 0 such ( ) that B(ν, ɛ) G. So π n (µ n )(G) π n (µ n ) B(ν, ɛ) for all n, and it follows from (6) that n log πn (µ n )(G) inf λ G I(λ). This completes the proof of the large deviations lower bound, and of the theorem. 7

3 Application to the gambler s ruin problem Suppose now that Ω is a finite subset of IR. As before, X k is a sequence of iid random variables with common law µ M (Ω), and we are interested in level-crossing probabilities for the random walk S n = X + + X n. For Q > 0, denote by R(Q, µ) the probability that the walk ever exceeds the level Q. If a gambler has initial capital Q, and loses amount X k on the k th bet, then R(Q, µ) is the probability of ultimate ruin. If the underlying distribution µ is unknown, the gambler may wish to assess this probability based on experience: this leads to a predictive probability of ruin, given by the formula P n (Q, µ n ) = R(Q, λ)π n (dλ), where, as before, µ n is the empirical distribution of the first n observations and π n π n (µ n ) is the posterior distribution as defined in equation (). A standard refinement of Wald s approximation yields C exp( δ(µ)q) R(Q, µ) exp( δ(µ)q), for some C > 0, where δ(µ) = sup{θ 0 : e θx µ(dx) }. Thus, C exp( δ(λ)q)π n (dλ) P n (Q, µ n ) exp( δ(λ)q)π n (dλ), and we can apply Varadhan s lemma (see, for example, [3, Theorem 4.3.]) to obtain the asymptotic formula, for q > 0, lim n log P n(qn, µ n ) = inf{h(µ ν) + δ(ν)q : ν supp π}, 8

on the set µ n µ. We are also assuming, as in Theorem, that µ n (x) = 0, for all n, whenever µ(x) = 0, and using the easy (Ω is finite) fact that δ : M (Ω) IR + is continuous. This formula can be simplified in special cases. Its implications for risk and network management are discussed in Ganesh et al. [7]. References [] D. Brenner, D. A. S. Fraser and P. McDunnough, On asymptotic normality of likelihood and conditional analysis, Canad. J. Statist., 0: 63 72, 983. [2] C. F. Chen, On asymptotic normality of limiting density functions with Bayesian implications, J. Roy. Statist. Soc. Ser. B, 47(3): 540 546, 985. [3] A. Dembo and O. Zeitouni, Large Deviations Techniques and Applications, Jones and Bartlett, 993. [4] Persi Diaconis and David Freedman, On the consistency of Bayes estimates, Ann. Statist., 4(): 26, 986. [5] David A. Freedman, On the asymptotic behavior of Bayes estimates in the discrete case, Ann. Math. Statist. 34: 386 403, 963. [6] J. C. Fu and R. E. Kass, The exponential rates of convergence of posterior distributions, Ann. Inst. Statist. Math., 40(4): 683 69, 988. [7] Ayalvadi Ganesh, Peter Green, Neil O Connell and Susan Pitts, Bayesian network management. To appear in Queueing Systems. 9

[8] A. J. Ganesh and Neil O Connell, A large deviations principle for Dirichlet process posteriors, Hewlett-Packard Technical Report, HPL- BRIMS-98-05, 998. [9] R. A. Johnson, An asymptotic expansion for posterior distributions, Ann. Math. Statist., 38: 899 907, 967. [0] R. A. Johnson, Asymptotic expansions associated with posterior distributions, Ann. Math. Statist., 4: 85 864, 970. [] L. Le Cam, On some asymptotic properties of maximum likelihood estimates and related Bayes estimates, Univ. Calif. Publ. Statist., : 227 330, 953. [2] L. Le Cam, Locally asymptotically normal families of distributions, Univ. Calif. Publ. Statist., 3: 37 98, 957. [3] D. V. Lindley, Introduction to Probability and Statistics from a Bayesian Viewpoint, Cambridge University Press, 965. [4] L. Schwartz, On Bayes procedures, Z. Wahrsch. Verw. Gebiete, 4: 0 26, 965. 0