An inverse of Sanov s theorem Ayalvadi Ganesh and Neil O Connell BRIMS, Hewlett-Packard Labs, Bristol Abstract Let X k be a sequence of iid random variables taking values in a finite set, and consider the problem of estimating the law of X in a Bayesian framework. We prove that the sequence of posterior distributions satisfies a large deviation principle, and give an explicit expression for the rate function. As an application, we obtain an asymptotic formula for the predictive probability of ruin in the classical gambler s ruin problem. Introduction and preliminaries Let X be a Hausdorff topological space with Borel σ-algebra B, and let µ n be a sequence of probability measures on (X, B). A rate function is a nonnegative lower semicontinuous function on X. We say that the sequence µ n satisfies the large deviation principle (LDP) with rate function I, if for all B B, inf I(x) x B n n log µ n(b) lim sup n n log µ n(b) inf I(x). x B Here B and B denote the interior and closure of B, respectively. Let Ω be a finite set and denote by M (Ω) the space of probability measures on Ω. Consider a sequence of independent random variables X k taking values
in Ω, with common law µ M (Ω). Denote by L n the empirical measure corresponding to the first n observations: L n = n n δ Xk, where δ Xk denotes unit mass at X k. We denote the law of L n by L(L n ). k= For ν M (Ω) define its relative entropy (relative to µ) by Ω dν dν dµ log dµ H(ν µ) = dµ ν µ otherwise. The statement of Sanov s theorem is that the sequence L(L n ) satisfies the LDP in M (Ω) with rate function H( µ). In this paper we present an inverse of this result, which arises naturally in a Bayesian setting. The underlying distribution (of the X k s) is unknown, and has a prior distribution π M (M (Ω)). The posterior distribution, given the first n observations, is a function of the empirical measure L n and will be denoted by π n (L n ). We prove that, on the set {L n µ}, for any fixed µ in the support of the prior, the sequence π n (L n ) satisfies the LDP in M (Ω) with rate function given by H(µ ) on the support of the prior (otherwise it is infinite). Note that the roles played by the arguments of the relative entropy function are interchanged. This result can be understood intuitively as follows: in Sanov s theorem we ask how likely the empirical distribution is to be close to ν, given that the true distribution is µ, whereas in the Bayesian context we ask how likely it is that the true distribution is close to ν, given that the empirical distribution is close to µ. We regard these questions as natural inverses of each other. As an application, we obtain an asymptotic formula for the predictive probability of ruin in the classical gambler s ruin problem. 2
Admittedly, the assumption that Ω is finite is very strong. However, it is the only case where the result can be stated without making additional assumptions about the prior. To see that this is a delicate issue, note that, since H(µ µ) = 0, the LDP implies consistency of the posterior distribution: it was shown by Freedman [5] that Bayes estimates can be inconsistent even on countable Ω and even when the true distribution is in the support of the prior; moreover, sufficient conditions for consistency which exist in the literature are quite disparate and in general far from being necessary. The problem of extending our result would seem to be an interesting (and formidable!) challenge for future research. We have recently extended the result to the case where Ω is a compact metric space, under the condition that the prior is of Dirichlet form [8]. There is a considerable literature on the asymptotic behaviour of Bayes posteriors; see, for example, Le Cam [, 2], Freedman [5], Lindley [3], Schwartz [4], Johnson [9, 0], Brenner et al. [], Chen [2], Diaconis and Freedman [4] and Fu and Kass [6]. Freedman, and Diaconis and Freedman study conditions under which the Bayes estimate (the mean of the posterior distribution) is consistent. Le Cam, Freedman, Brenner et al. and Chen establish asymptotic normality of the posterior distribution under different conditions on the parameter space and the prior. Schwartz and Johnson study asymptotic expansions of the posterior distribution in powers of n /2, having a normal distribution as the leading term. Fu and Kass establish an LDP for the posterior distribution of the parameter in a one-dimensional parametric family, under certain regularity conditions. 3
2 The LDP Let Ω be a finite set, and let M (Ω) denote the space of probability measures on Ω. Suppose X, X 2,... are i.i.d. Ω valued random variables. Let π M (M (Ω)) denote the prior distribution on the space M (Ω), with support denoted by supp π. For each n, set { } n M n (Ω) = δ xi : x Ω n. n i= Define a mapping π n : M n (Ω) M (M (Ω)) by its Radon-Nikodym derivative on the support of π: dπ n (µ n ) (ν) = dπ M (Ω) ν(x)nµn(x) λ(x)nµn(x) π(dλ), () if the denominator is non-zero; π n is undefined at µ n if the denominator is zero. When defined, π n (µ n ) denotes the posterior distribution, conditional on the observations X,..., X n having empirical distribution µ n M n (Ω). (This follows immediately from Bayes formula; there are no measurability concerns here since Ω is finite.) Theorem Suppose x Ω IN is such that the sequence µ n = n i= δ xi /n converges weakly to µ supp π and that µ(x) = 0 implies that µ n (x) = 0 for all n. Then the sequence of laws π n (µ n ) satisfies a large deviation principle (LDP) with good rate function { H(µ ν), if ν supp π I(ν) =, else. The rate function I( ) is convex if supp π is convex. 4
Proof : Observe that n log λ(x) nµn(x) = µ n (x) log µ n (x) µ n (x) log µ n(x) λ(x) = H(µ n ) H(µ n λ) H(µ n ). Here, H(µ n ) = µ n(x) log µ n (x) denotes the entropy of µ n. The last inequality follows from the fact that relative entropy is non-negative. follows, since π is a probability measure on M (Ω), that λ(x) nµn(x) π(dλ) exp( nh(µ n )). M (Ω) It Thus, lim sup n M log (Ω) here we are using the fact that H( ) is continuous. λ(x) nµn(x) π(dλ) lim sup H(µ n ) = H(µ); (2) Next, since µ n converges to µ supp π, we have that for all ɛ > 0, π(b(µ, ɛ)) > 0 and µ n B(µ, ɛ) for all n sufficiently large. (Here, B(µ, ɛ) denotes the set of probability distributions on Ω that are within ɛ of µ in total variation distance note that this generates the weak topology since Ω is finite.) Therefore, n M log λ(x) nµn(x) π(dλ) (Ω) n log λ(x) nµn(x) π(dλ) B(µ,ɛ) ( ) n log π B(µ, ɛ) + µ n (x) log µ n (x) sup µ n (x) log µ n(x) λ B(µ,ɛ) λ(x). :µ(x)>0 To obtain the last inequality, we have used the assumption that, if µ(x) = 0, then µ n (x) = 0 for all n. We also use the convention that 0 log 0 = 0. Since π(b(µ, ɛ)) > 0, it follows from the above, again using the continuity of H( ), 5
that H(µ) sup n M log λ(x) nµn(x) π(dλ) (Ω) λ,ρ B(µ,ɛ) :µ(x)>0 Letting ɛ 0, we get from (2) and (3) that lim n M log (Ω) ρ(x) log ρ(x) λ(x). (3) λ(x) nµn(x) π(dλ) = H(µ). (4) LD upper bound: Let F be an arbitrary closed subset of M (Ω). If π(f ) = 0, then π n (µ n )(F ) = 0 for all n and the LD upper bound, lim sup n log πn (µ n )(F ) inf ν F I(ν), holds trivially. Otherwise, if π(f ) > 0, observe from () and (4) that lim sup lim sup = lim sup sup ρ B(µ,δ) n log πn (µ n )(F ) H(µ) = lim sup n F log λ(x) nµn(x) π(dλ) log π(f ) + lim sup sup µ n (x) log λ(x) n λ F supp π sup [ H(µ n ) H(µ n λ)] λ F supp π sup [ H(ρ) H(ρ λ)] δ > 0, λ F supp π where the last inequality is because µ n converges to µ. Letting δ 0, we get from the continuity of H( ) and H( ) that lim sup n log πn (µ n )(F ) inf λ F supp π H(µ λ) = inf I(λ), (5) λ F where the last equality follows from the definition of I in the statement of the theorem. This completes the proof of the large deviations upper bound. LD lower bound: Fix ν supp π, and let B(ν, ɛ) denote the set of probability distributions on Ω that are within ɛ of ν in total variation. Then, 6
π(b(ν, ɛ)) > 0 for any ɛ > 0. δ (0, ɛ), Using () and (4) we thus have, for all ( ) n log πn (µ n ) B(ν, ɛ) H(µ) n B(ν,δ) log λ(x) nµn(x) π(dλ) ( ) n log π B(ν, δ) + inf µ n (x) log λ(x) λ B(ν,δ) inf inf [ H(ρ) H(ρ λ)]. ρ B(µ,δ) λ B(ν,δ) Letting δ 0, we have by the continuity of H( ) and H( ) that ( ) n log πn (µ n ) B(ν, ɛ) H(µ ν) = I(ν), (6) where the last equality is because we took ν to be in supp π. Let G be an arbitrary open subset of M (Ω). If G supp π is empty, then, by the definition of I in the statement of the theorem, inf ν G I(ν) =. Therefore, the large deviations lower bound, n log πn (µ n )(G) inf ν G I(ν), holds trivially. On the other hand, if G supp π isn t empty, then we can pick ν G such that I(ν) is arbitrarily close to inf λ G I(λ), and ɛ > 0 such ( ) that B(ν, ɛ) G. So π n (µ n )(G) π n (µ n ) B(ν, ɛ) for all n, and it follows from (6) that n log πn (µ n )(G) inf λ G I(λ). This completes the proof of the large deviations lower bound, and of the theorem. 7
3 Application to the gambler s ruin problem Suppose now that Ω is a finite subset of IR. As before, X k is a sequence of iid random variables with common law µ M (Ω), and we are interested in level-crossing probabilities for the random walk S n = X + + X n. For Q > 0, denote by R(Q, µ) the probability that the walk ever exceeds the level Q. If a gambler has initial capital Q, and loses amount X k on the k th bet, then R(Q, µ) is the probability of ultimate ruin. If the underlying distribution µ is unknown, the gambler may wish to assess this probability based on experience: this leads to a predictive probability of ruin, given by the formula P n (Q, µ n ) = R(Q, λ)π n (dλ), where, as before, µ n is the empirical distribution of the first n observations and π n π n (µ n ) is the posterior distribution as defined in equation (). A standard refinement of Wald s approximation yields C exp( δ(µ)q) R(Q, µ) exp( δ(µ)q), for some C > 0, where δ(µ) = sup{θ 0 : e θx µ(dx) }. Thus, C exp( δ(λ)q)π n (dλ) P n (Q, µ n ) exp( δ(λ)q)π n (dλ), and we can apply Varadhan s lemma (see, for example, [3, Theorem 4.3.]) to obtain the asymptotic formula, for q > 0, lim n log P n(qn, µ n ) = inf{h(µ ν) + δ(ν)q : ν supp π}, 8
on the set µ n µ. We are also assuming, as in Theorem, that µ n (x) = 0, for all n, whenever µ(x) = 0, and using the easy (Ω is finite) fact that δ : M (Ω) IR + is continuous. This formula can be simplified in special cases. Its implications for risk and network management are discussed in Ganesh et al. [7]. References [] D. Brenner, D. A. S. Fraser and P. McDunnough, On asymptotic normality of likelihood and conditional analysis, Canad. J. Statist., 0: 63 72, 983. [2] C. F. Chen, On asymptotic normality of limiting density functions with Bayesian implications, J. Roy. Statist. Soc. Ser. B, 47(3): 540 546, 985. [3] A. Dembo and O. Zeitouni, Large Deviations Techniques and Applications, Jones and Bartlett, 993. [4] Persi Diaconis and David Freedman, On the consistency of Bayes estimates, Ann. Statist., 4(): 26, 986. [5] David A. Freedman, On the asymptotic behavior of Bayes estimates in the discrete case, Ann. Math. Statist. 34: 386 403, 963. [6] J. C. Fu and R. E. Kass, The exponential rates of convergence of posterior distributions, Ann. Inst. Statist. Math., 40(4): 683 69, 988. [7] Ayalvadi Ganesh, Peter Green, Neil O Connell and Susan Pitts, Bayesian network management. To appear in Queueing Systems. 9
[8] A. J. Ganesh and Neil O Connell, A large deviations principle for Dirichlet process posteriors, Hewlett-Packard Technical Report, HPL- BRIMS-98-05, 998. [9] R. A. Johnson, An asymptotic expansion for posterior distributions, Ann. Math. Statist., 38: 899 907, 967. [0] R. A. Johnson, Asymptotic expansions associated with posterior distributions, Ann. Math. Statist., 4: 85 864, 970. [] L. Le Cam, On some asymptotic properties of maximum likelihood estimates and related Bayes estimates, Univ. Calif. Publ. Statist., : 227 330, 953. [2] L. Le Cam, Locally asymptotically normal families of distributions, Univ. Calif. Publ. Statist., 3: 37 98, 957. [3] D. V. Lindley, Introduction to Probability and Statistics from a Bayesian Viewpoint, Cambridge University Press, 965. [4] L. Schwartz, On Bayes procedures, Z. Wahrsch. Verw. Gebiete, 4: 0 26, 965. 0