A large-deviation principle for Dirichlet posteriors

Size: px

Start display at page:

Download "A large-deviation principle for Dirichlet posteriors"

Mervin Woods
6 years ago
Views:

1 Bernoulli 6(6), 2000, 02±034 A large-deviation principle for Dirichlet posteriors AYALVADI J. GANESH and NEIL O'CONNELL 2 Microsoft Research, Guildhall Street, Cambridge CB2 3NH, UK. ajg@microsoft.com 2 BRIMS, Hewlett-Packard Laboratories, Filton Road, Bristol BS2 6QZ, UK. noc@hplb.hpl.hp.com Let X k be a sequence of independent and identically distributed random variables taking values in a compact metric space, and consider the problem of estimating the law of X in a Bayesian framework. A conjugate family of priors for nonparametric Bayesian inference is the Dirichlet process priors popularized by Ferguson. We prove that if the prior distribution is Dirichlet, then the sequence of posterior distributions satis es a large-deviation principle, and give an explicit expression for the rate function. As an application, we obtain an asymptotic formula for the predictive probability of ruin in the classical gambler's ruin problem. Keywords: asymptotics; Bayesian nonparametrics; Dirichlet process; large deviations. Introduction Let X be a Hausdorff topological space with Borel ó -algebra B, and let ì n be a sequence of probability measures on (X, B ). A rate function is a non-negative lower semicontinuous function on X. We say that the sequence ì n satis es the large-deviation principle (LDP) with rate function I, if for all B 2 B, inf I(x) < lim inf x2b n n log ì n(b) < lim sup n n log ì n(b) < inf x2b Let be a complete, separable metric space (Polish space) and denote by M () the space of probability measures on. Consider a sequence of independent random variables X k taking values in, with common law ì 2 M (). Denote by L n the empirical measure corresponding to the rst n observations: L n ˆ X n ä X n k : kˆ We denote the law of L n by L (L n ). For í 2 M () de ne its Kullback±Leibler distance or relative entropy (relative to ì) by I(x): 350±7265 # 2000 ISI/BS

2 022 A.J. Ganesh and N. O'Connell 8 < dí dí log dì, í ì, H(íjì) ˆ dì dì :, otherwise: The statement of Sanov's theorem is that the sequence L (L n ) satis es the LDP in M () equipped with the ô-topology (see Dembo and Zeitouni 993, Theorem 6.2.0), with rate function H(:jì). As a corollary, the LDP also holds in the weak topology on M (), which is weaker than the ô-topology. In an earlier paper (Ganesh and O'Connell 999), we proved an inverse of this result, which arises naturally in a Bayesian setting, for nite sets. The underlying distribution (of the X k ) is unknown, and has a prior distribution ð 2 M (M ()). The posterior distribution, given the rst n observations, is a function of the empirical measure L n and is denoted ð n (L n ). We showed that, on the set fl n! ìg, for any xed ì in the support of the prior, the sequence ð n (L n ) satis es the LDP in M () with rate function given by H( ìj:) on the support of the prior (otherwise it is in nite). Note that the roles played by the arguments of the relative entropy function are interchanged compared to Sanov's theorem. We pointed out that the extension of the result to more general would require additional assumptions about the prior. To see that this is a delicate issue, note that, since H( ìjì) ˆ 0, the LDP implies consistency of the posterior distribution in the topology generated by Kullback±Leibler neighbourhoods; in particular, it implies weak consistency. But it was shown by Freedman (963) that Bayes estimates can be inconsistent even for countable ; even if the true distribution is in the weak support of the prior, it does not follow that the posterior mass of each weak neighbourhood tends to (in fact, it can tend to zero!). There has recently been renewed interest in the consistency of nonparametric Bayes methods, prompted by their increasing popularity in applied work. A notable early result in this eld is due to Schwartz (965), who showed that if the prior assigns positive probability to every Kullback±Leibler neighbourhood of the true distribution, then the posterior is weakly consistent. If, in addition, the relevant space of probability distributions satis es a `metric entropy' condition, then Barron et al. (999) show that the posterior concentrates on neighbourhoods de ned by the Hellinger metric; these are ner than weak neighbourhoods. (The Hellinger distance between two densities f and g with respect to a reference measure ì is de ned by p p ( f g ) 2 dì.) Recent research on the consistency of Bayes methods is reviewed by Ghosal et al. (999) and Wasserman (998). Rates of convergence of the posterior have been investigated by Ghosal et al. (998) and Shen and Wasserman (998), but there is relatively little work on more re ned asymptotics. In this paper, we prove an LDP for the special (but nevertheless useful) case of Dirichlet process priors on a compact metric space. The problem of extending our results to an arbitrary Polish space remains open. An LDP with a similar avour for a sequence of Dirichlet processes has been derived by Lynch and Sethuraman (987); we compare our result with theirs following the statement of Theorem. The techniques we use in this paper are very different from those of Lynch and Sethuraman, who obtain their results as a consequence of an LDP they derive for processes

3 A large-deviation principle for Dirichlet posteriors 023 with stationary, independent increments. We believe that our methods are of independent interest, and also that they can be generalized to a wider class of prior distributions. The LDP for Dirichlet posteriors derived here has applications to queue and risk management that are discussed in Ganesh et al. (998). Some questions of interest in this context are posed in terms of the ruin probability in the classical gambler's ruin problem. In Section 3, we use the LDP for the posterior distributions to obtain an asymptotic formula for the predictive probability of ruin. 2. The large-deviation principle Let be a compact metric space with Borel ó -algebra F. Let M () denote the space of probability measures on (, F ), and B (M ()) the Borel ó -algebra induced by the weak topology on M (). In this case, it is not possible to establish an LDP for Bayes posteriors corresponding to arbitrary prior distributions, for reasons discussed above. Therefore, we shall work with a speci c family of priors, namely Dirichlet process priors; see Ferguson (973) for a detailed discussion of their properties. The n-dimensional Dirichlet distribution with parameter a ˆ (a,..., a n ), denoted D(a), is de ned to be the joint distribution of (Z =Z,..., Z n =Z), where Z i, i ˆ,..., n, are mutually independent, Z i has the gamma distribution with shape parameter a i and scale parameter, and Z ˆ Z... Z n. Denote by M () the space of nite non-negative measures on (, F ). The Dirichlet process with parameter á 2 M (), denoted D (á), is a probability distribution on M (). A random probability measure, ì, on is said to have law D (á) if, for every nite measurable partition (A,..., A n ) of, the vector ( ì(a ),..., ì(a n )) has the n- dimensional Dirichlet distribution D(á(A ),..., á(a n )). The distribution of ( ì(b ),..., ì(b n )) for arbitrary measurable B,..., B n follows in an obvious way from the distributions for partitions. Let ð be a Dirichlet process prior, D (á), on the space M (). Then, conditional on observing ù,..., ù n, the posterior distribution is also a Dirichlet process, but with parameter á P n iˆ ä ù i, where ä x denotes Dirac measure at x (see Ferguson 973; 974). In other words, the Dirichlet processes D (á), á 2 M (), are a conjugate family of priors. This property greatly facilitates computation of posterior distributions and is very useful in analytical work. We now prove an LDP for the sequence of distributions fd (á P n iˆ ä ù i ), n ˆ, 2,...g. Theorem. Let á be a nite non-negative measure on (, B ()), with support. Let ì be a probability measure on (, B ()), and let fx n g be an -valued sequence such that X n ä xi! ì weakly, n iˆ where ä xi denotes Dirac measure at x i. Then the sequence of probability measures

4 024 A.J. Ganesh and N. O'Connell D (á P n iˆ ä x i ) satis es an LDP in M () equipped with its weak topology, with rate function I(:) given by I(í) ˆ H( ìjí), where H( ìjí) denotes the relative entropy of ì with respect to í. Corollary. If X i, i 2 N, are independent and identically distributed with common law ì, then the sequence of empirical distributions (=n) P n iˆ ä X i converges weakly to ì with probability one. Hence, the sequence of random probability measures D (á P n iˆ ä X i ) almost surely satis es an LDP (on M () equipped with its weak topology) with rate function I(:) ˆ H( ìj:). Remark. There is no loss of generality in the assumption that the support of the prior, á, is. Indeed, if the prior were supported on some smaller set, then since the posterior assigns no mass outside, we can con ne ourselves to the closed set, which is also a compact metric space. Remark 2. Lynch and Sethuraman (987) prove an LDP for the sequence of Dirichlet distributions D (nì) on [0, ]. Their result is equivalent to our theorem, for ˆ [0, ], if D (nì) and D (á nì n ) are exponentially equivalent whenever ì n converges weakly to ì; however, establishing exponential equivalence does not appear to be trivial. We now sketch the main ideas behind the proof before proceeding with a formal derivation. Let ì n be a random element of M () with distribution D (á P n iˆ ä x i )as above. For bounded measurable functions f :! R, we de ne Ë n ( f ) ˆ log E exp f dì n : () We show in Lemma below that, for nite measurable partitions (A,..., A k )of, the vector ( ì n (A ),..., ì n (A k )) satis es the LDP in R k. We then use Varadhan's integral lemma (Dembo and Zeitouni 993, Theorem 4.3.) to infer the existence of the limit Ë( f ) ˆ lim Ë(nf ), n for simple functions f ˆ Pk iˆ c i Ai ; here Ai denotes the indicator of A i. This is extended in Lemma 3 to all bounded continuous functions on, using the continuity of Ë(:). By Theorem in Dembo and Zeitouni (993), the existence of the limiting logarithmic moment generating function, Ë, implies the large-deviation upper bound for the sequence fì n g, for all compact subsets of M (). Since was assumed to be compact, M () is compact in the weak topology and so the upper bound holds for all closed sets. The rate function for this upper bound is the convex conjugate of Ë, which we identify to be H( ìj:). We use the LDP for ( ì n (A ),..., ì n (A k )) and the contraction principle (Dembo and Zeitouni 993, Theorem 4.2.) to obtain an LDP for P k iˆ c iì n (A i ), for arbitrary constants c i. Thus, we obtain a large-deviation lower bound for sets of the form n!

5 A large-deviation principle for Dirichlet posteriors 025 U(ö k, x, ä) ˆ í 2 M (): ö k dí 2 (x ä, x ä), with rate function H k ( ìj:) given in Lemma 2. Here x 2 R and ä. 0 are arbitrary, and ö k is any simple function ö k ˆ Pk iˆ c i Ai, so that ö k dí ˆ Pc i í(a i ). We extend the lower bound to sets of the form U(ö, x, ä), where ö is any bounded continuous function on, by using increasingly ne partitions of to approximate ö by simple functions. The rate function for this lower bound is the limit of H k ( ìj:) as the partitions indexed by k get ner, which is shown in Lemma 2 to be H( ìj:). Since the sets U(ö, x, ä) constitute a base for the weak topology on M (), this establishes the large-deviation lower bound for all open sets. The proof of Theorem uses the following lemmas, whose proofs are in the Appendix. Lemma. Let (A,..., A k ) be a measurable partition of and suppose that the interior of A i is non-empty for each i ˆ,..., k. Let f be bounded and measurable with respect to ó (A,..., A k ), the ó algebra generated by the sets A,..., A k. Then Ë( f ) :ˆ lim n! n Ë n(nf ) (2) exists and is nite, and is given by Ë( f ) ˆ sup í2m () f dí H( ìjí) : (3) Lemma 2. Let A k ˆ (A k,..., Ak n k ), k 2 N, be a sequence of partitions of such that the corresponding ó-algebras, ó (A k ), increase to B (), the Borel ó-algebra on. Then, for all í 2 M (), we have H( ìjí) ˆ sup k where H k ( ìjí) :ˆ Pn k iˆ ì(ak i )log[ì(ak i )=í(ak i )]. H k ( ìjí) ˆ lim k! H k ( ìjí), This result is well known; see Georgii (988), for example. Lemma 3. For all bounded, continuous functions f :! R, the limit in (2) exists and is nite. The map Ë : C b ()! R is convex and continuous, and we have Ë( f ) ˆ sup í2m () f dí H( ìjí) : Here, C b () denotes the space of bounded continuous functions from to R, equipped with the supremum norm, k f k ˆ sup x2 j f (x)j. Proof of Theorem. We have from Lemma 3 that Ë is the convex conjugate of H( ìj:). But H( ìj:) is convex, and lower semicontinuous in the weak topology (see Dupuis and Ellis 997, Lemma.4.3, and recall that M () is Polish since is a Polish space). Hence, H( ìj:) and Ë(:) are convex duals of each other. The large-deviations upper bound for

6 026 A.J. Ganesh and N. O'Connell compact subsets of M () now follows from Dembo and Zeitouni (993, Theorem 4.5.3). But was assumed to be compact, hence M () is compact in the weak topology, so the upper bound holds for all closed sets in M (). We now turn to the proof of the largedeviations lower bound. The weak topology on M () is generated by the sets U ö,x,ä ˆ í 2 M (): ö dí x, ä, ö 2 C b (), x 2 R, ä. 0: Given such a set and å. 0, we can nd a sequence of measurable partitions A k ˆ (A k,..., Ak n k )of, and a sequence of simple functions ö k measurable with respect to ó (A k ), with the following properties: the ó -algebras ó (A k ) increase to B (), the Borel ó algebra on ; for all k and all i 2f,..., n k g, Ai k is a ì-continuity set with non-empty interior; for some K. 0 and all k. K, kö k ök, å. We shall assume that å, ä=3. We now have P( ì n 2 U ö,x,ä ) > P ö k dì n x, ä å, 8k. K: (4) Let ö k ˆ Pn k iˆ ck i A k i. Then ö k dì n ˆ Xn k ci k ì n(ai k ): iˆ It is shown in the proof of Lemma (see equation (3)) that the sequence ( ì n (A k ),..., ì n(a k n k )) n>0 satis es an LDP with rate function I k given by 8 >< P nk jˆ I k (y,..., y nk ) ˆ ì(a j) log ì(a j), if y 2 R n k y and P n k iˆ y i ˆ, j >:, otherwise: It follows from the contraction principle (Dembo and Zeitouni 993, Theorem 4.2.) that P nk iˆ ck i ì n(a k i ) satis es an LDP with rate function J k given by ( ) J k (x) ˆ inf I k (y): Xn k ci k y i ˆ x iˆ ˆ inf H k ( ìjí): í 2 M (), ö k dí ˆ x : In particular, we obtain the large-deviations lower bound,

7 A large-deviation principle for Dirichlet posteriors 027 lim inf n! n log P ö k dì n x, ä å > inffj k (y): jy xj, ä åg ˆ inf H k ( ìjí): í 2 M (), ö k dí x, ä å : (6) Now, kö ö k k, å for all k. K, so we have, for all í 2 M (), that ö dí x, ä 2å ) ö k dí x, ä å, 8k. K: It now follows from (4) and (6) that, for all k. K, lim inf n! n log P( ì n 2 U ö,x,ä ) > inf H k ( ìjí): ö dí x, ä 2å : Hence, we have from Lemma 2 that lim inf n! n log P( ì n 2 U ö,x,ä ) > inf H( ìjí): ö dí x, ä 2å : Since å. 0 was arbitrary, we can let å decrease to zero, to obtain lim inf n! n log P( ì n 2 U ö,x,ä ) > inf H( ìjí): ö dí x, ä, which is the desired large-deviations lower bound for the set U ö,x,ä, with rate function H( ìj:). We have thus established the large-deviations lower bound for a base of the weak topology on M (), and hence for all open sets in this topology. Combined with the upper bound above, this completes the proof of the theorem. h We have established an LDP for the sequence of Dirichlet posterior distributions in the weak topology on M (), with rate function I(í) ˆ H( ìjí). The rate function differs from that in Sanov's theorem in that its argument, í, enters as the second rather than the rst argument in the relative entropy function. (Sanov's theorem says that the empirical distribution of a sequence of independent and identically distributed -valued random variables with common law ì satis es an LDP with rate function J(í) ˆ H(íjì).) Intuitively, this is because in Sanov's theorem we are asking how likely we are to observe í, given that the true distribution is ì, whereas in this paper we are asking how likely it is that the true distribution is í, given that we observe ì. We believe that our result holds for a wider class of priors, of the form described below. Let P be the set of all nite measurable partitions of. ForP 2 P, let ó (P) denote the ó -algebra generated by P. The restriction of a measure í 2 M () to the ó-algebra ó (P) is denoted í P. In other words, í P ˆ E[íjó (P)]. For a prior ð 2 M (M ()) we denote by ð P the corresponding element in M (M (, ó (P))), thus the restriction of ð to the Borel ó-algebra B (M (, ó (P))). We x a subset P 9 of P and say that a prior measure (5)

8 028 A.J. Ganesh and N. O'Connell ð 2 M (M ()) is exchangeable with respect to nite projections in P 9 if, for every P 2 P 9, wehave [ð n ( ì n )] P ˆ ð n P ( ì n,p): Here ð n ( ì n ) denotes the posterior distribution on M (, B ()) corresponding to the prior ð and the empirical distribution ì n ; [ð n ( ì n )] P its restriction to ó (P); and ð n P ( ì n,p) the posterior distribution on M (, ó (P)) corresponding to the prior ð P and the empirical distribution restricted to ó (P). The essential property of the Dirichlet process that we have used in the proof of Theorem is its exchangeability with respect to P 9, where P 9 is the collection of nite partitions consisting of sets with non-empty interiors. This collection is large enough to generate the Borel ó-algebra on. We believe that our methods can be generalized to priors which are exchangeable with respect to nite projections in P 9, for some P 9 which generates the Borel ó-algebra on, although there do seem to be some technical dif culties which we hope to address in future work. An example of a class of priors which are exchangeable with respect to nite projections are the PoÂlya tree distributions studied by Mauldin et al. (992) and Lavine (992), which generalize the Dirichlet process. 3. Application to the gambler's ruin problem Suppose now that is a compact subset of R. As before, fx k g is a sequence of independent, identically distributed random variables with common law ì 2 M (), and we are interested in level-crossing probabilities for the random walk S n ˆ X... X n.forq. 0, denote by R(Q, ì) the probability that the walk ever exceeds the level Q. If a gambler has initial capital Q, and loses amount X k on the k th bet, then R(Q, ì) is the probability of ultimate ruin. If the underlying distribution ì is unknown, the gambler may wish to assess this probability based on experience: this leads to a predictive probability of ruin, given by the formula P n (Q, ì n ) ˆ R(Q, ë)ð n (dë), where, as before, ì n is the empirical distribution of the rst n observations and ð n ð n ( ì n ) is the posterior distribution corresponding to some prior, ð, and the empirical distribution, ì n. A standard re nement of Wald's approximation yields Ce ä( ì)q < R(Q, ì) < e ä( ì)q, for some C. 0, where ä( ì) ˆ sup è > 0: e èx ì(dx) < : Thus,

9 A large-deviation principle for Dirichlet posteriors 029 C e ä(ë)q ð n (dë) < P n (Q, ì n ) < e ä(ë)q ð n (dë): M () M () Now, if ð is the Dirichlet process D (á), parametrized by an arbitrary nite positive measure á whose support is all of, then the sequence ð n obeys an LDP by Theorem, and we can apply Varadhan's lemma (see, for example, Dembo and Zeitouni 993, Theorem 4.3.) to obtain the asymptotic formula, for q. 0, lim n! n log P n(qn, ì n ) ˆ inffh( ìjí) ä(í)q : í 2 supp ðg, on the set ì n! ì. Here, we are using the easy ( is compact) fact that ä : M ()! R is continuous. This formula can be simpli ed in special cases. Its implications for risk and network management are discussed in Ganesh et al. (998). 4. Conclusion In this paper, we establish a large-deviation principle for the sequence of Bayesian posteriors induced by a Dirichlet prior on a compact metric space. Can the result be extended to an arbitrary Polish space? Our approach yields the large-deviation lower bound for arbitrary open subsets of this space, and the upper bound for compact subsets. In other words, we can prove a weak LDP on a Polish space. This could be strengthened to a full LDP if the sequence of Dirichlet posteriors were exponentially tight. However, exponential tightness of this sequence would imply the goodness of the rate function H( ìj:), which we know not to be true in general. For example, take ˆ R, ì ˆ ä 0, the unit mass at 0, and í n ˆ ( 2 )ä 0 ( 2 )ä n. Then H( ìjí n ) ˆ log 2 for all n, but the sequence í n is not tight. This implies that H( ìj:) does not have compact level sets, that is, it is not a good rate function. Hence, our method cannot be easily extended to arbitrary Polish spaces. Finally, while we have worked with Dirichlet process priors, we believe that our approach can be extended to priors with the appropriate exchangeability properties, as discussed at the end of Section 2. However, there do appear to be some technical dif culties with this approach, which we hope to address in future work. Appendix: Proofs Proof of Lemma. Let (A,..., A k ) be a partition of such that the interior of A i is nonempty and that A i is a ì-continuity set for every i ˆ,..., k. Let f be bounded and measurable with respect to the ó-algebra generated by the partition. Then we can write f ˆ Xk c i Ai, (7) for some constants c i, where Ai denotes the indicator of A i. Then, by (), iˆ

10 030 A.J. Ganesh and N. O'Connell " # Ë n ( f ) ˆ log E exp Xk c i ì n (A i ) : (8) By the assumption that each A i has non-empty interior and that the support of á is, we have á n (A j ) :ˆ á(a j ) Xn ä xi (A j ). 0, 8n 2 N, j ˆ,..., k: (9) iˆ We have from the de nition of the Dirichlet distribution that! Z n Z k n ( ì n (A ),..., ì n (A k )) P k iˆ Z,..., i P k n iˆ Z, i n where the Z i n are independent gamma random variables, with Z n j G (á n(a j ), ), and á n is de ned in (9). Here G (á, ) denotes the gamma distribution with shape parameter á and scale parameter. It is straightforward to evaluate the cumulant generating functions of the Z n j.wehave ë j n (è) :ˆ log E[exp(èZ n j )] ˆ á n(a j )log( è), if è,,, otherwise: Since P n iˆ ä x i (A j )=n! ì(a j ) by assumption, we obtain ë j (è) :ˆ lim n! n ë j n (è) ˆ ì(a j)log( è), if è,,, otherwise: Hence, by the GaÈrtner±Ellis theorem (see Dembo and Zeitouni 993, Theorem 2.3.6), the sequence of random variables Z n j =n satis es an LDP in R with rate function ë j which is the convex dual of ë j, that is, 8 ë < j (x) ˆ sup[èx ë j (è)] ˆ x ì(a j) ì(a j )log ì(a j), if x. 0, x (0) è2r :, otherwise: If ì(a j ) ˆ 0, then the assumption of steepness of ë j is not satis ed, so the GaÈrtner±Ellis theorem does not apply. However, it is not hard to verify directly in this case that Z n j =n does indeed satisfy an LDP with the above rate function. Since fz n j, j ˆ,..., kg are independent, fz n j =n, j ˆ,..., kg jointly satisfy an LDP in R k with rate function ë (x) ˆ Pk jˆ ë j (x j), where x ˆ (x,..., x k ) and ë j is given by (0). De ne Y n j ˆ Z n j =P k iˆ Z i n. Since P k iˆ Z i n is strictly positive with probability, the maps (Z n,..., Z k n )! (Y n,..., Y k n ) are almost surely continuous for every n. It follows from the contraction principle (Dembo and Zeitouni 993, Theorem 4.2.) that fy n j, j ˆ,..., kg jointly satisfy an LDP with rate function I given by iˆ

11 A large-deviation principle for Dirichlet posteriors 03 8 < I(y,..., y k ) ˆ inf : X k jˆ ë j (z j): y j ˆ zj P k iˆ z i 9 =, j ˆ,..., k ; : () If y j, 0 for some j, then any z included in the in mum in () must have z i, 0 for some i and so, by (0) I(y) ˆ. Next, if y j ˆ 0 for all j or if P n iˆ y i 6ˆ, then there is no z 2 R k such that y j ˆ z j = P k iˆ z i for all j. Hence I(y), being the in mum of an empty set, is again. In the following, we shall con ne attention to y 2 R k such that y > 0 and P k iˆ y i ˆ. If z 2 R k is such that y j ˆ z j = P k iˆ z i for all j ˆ,..., k, then we can write z ˆ ây for some â. y. Now () gives X k I(y,..., y k ) ˆ inf ë j (ây j): (2) â.0 jˆ Setting the derivative of the sum on the right with respect to â equal to zero yields 0 ˆ Xk y j ì(a j) ˆ â â : jˆ To obtain the last equality, we have used the fact that P k jˆ y j ˆ by assumption, while P k jˆ ì(a j) ˆ asì is a probability distribution and A,..., A k partition. Since each ë j is convex, the above implies that the in mum in (2) is achieved at â ˆ, and I(y) ˆ Xk jˆ ˆ Xk jˆ ë j (y j) ˆ Xk jˆ ì(a j )log ì(a j) y j : y j ì(a j ) ì(a j )log ì(a j) y j The second equality above comes from (0) and the third follows from the fact that ì and y are both probability distributions, and hence sum to. It follows from the preceding discussion that the sequence of random vectors ( ì n (A ),..., ì n (A k )) satis es an LDP in R k with rate function 8 X >< k I(y) ˆ jˆ >:, otherwise: ì(a j )log ì(a j), if y 2 R k Xk and y i ˆ, y j iˆ Observe from (7) that j f dì n j < maxiˆ k jc ij as ì n is a probability distribution. Hence, we have from Varadhan's lemma (Dembo and Zeitouni 993, Theorem 4.3.) and the LDP for ( ì n (A ),..., ì n (A k )) that Ë( f ) :ˆ lim n! n Ë n(nf ) ˆ sup y2r k " # Xk c i y i I(y) : iˆ (3)

12 032 A.J. Ganesh and N. O'Connell Using (3), we can rewrite the above as " # X k Ë( f ) ˆ sup c i í(a i ) Xk ì(a i )log ì(a i) í2m () í(a iˆ iˆ i ) ˆ sup í2m () f dí H k ( ìjí), (4) where H k ( ìjí) is de ned in Lemma 2. We now show that we can replace H k ( ìjí) in the supremum above by H( ìjí). Let í 2 M () be arbitrary. If ì(a i ). 0 and í(a i ) ˆ 0 for some A i, then ì 6 í and H( ìjí) and H k ( ìjí) are both in nite. Hence, such í can be excluded from consideration of the supremum above, and we shall suppose without loss of generality that ì(a i ) ˆ 0 whenever í(a i ) ˆ 0. We now de ne ë 2 M () as follows. Set ë í on A i if ì(a i ) ˆ 0; if ì(a i ). 0, take ë to be absolutely continuous with respect to ì on A i, with Radon± Nikodym derivative dë dì í(a i) ì(a i ). 0: Then ì is absolutely continuous with respect to ë and we have Also, H( ìjë) ˆ ˆ f dí ˆ Xk dì log dì dë ˆ X i:ì(a i ). 0 iˆ X i:ì(a i ). 0 A i dì log dì dë ì(a i ) log ì(a i) í(a i ) ˆ H k( ìjí): (5) c i í(a i ) ˆ Xk iˆ c i ë(a i ) ˆ f dë: (6) Since í 2 M () was arbitrary, we obtain from (5) and (6) that sup f dí H k ( ìjí) < sup f d ë H( ìjë) : í2m () ë2m () The reverse inequality holds as well because H k ( ìjí) < H( ìjí) for all í 2 M () by Lemma 2 (or by the convexity of x 7! x log x on [0, )). Hence, equality holds above and the claim of the lemma follows from (4). h Proof of Lemma 3. Let å. 0 be given, and let f :! R be bounded and continuous. We can nd k. 0 and a simple function g ˆ Pk iˆ c i Ai such that k f gk, å. Since f is continuous, we can in fact choose the A i to be ì-continuity sets with non-empty interiors. Now, by () and the fact that each ì n is a probability distribution,

13 A large-deviation principle for Dirichlet posteriors 033 Ë n (nf ) ˆ log E exp nf dì n < log E exp ng dì n nå ˆ nå Ë n (ng), so that, by (2), lim sup n! n Ë n(nf ) < Ë(g) å: Likewise, lim inf n! Ë n (nf )=n > Ë(g) å. Since å. 0 is arbitrary, it follows that Ë( f ) :ˆ lim n! n Ë n(nf ) exists and is nite for all bounded, continuous f :! R. The arguments above also show that Ë : C b ()! R is continuous, with jë( f ) Ë(g)j < k f gk. For f 2 L (), de ne H ( f ) ˆ sup í2m () f dí H( ìjí), (7) that is, H is the convex conjugate of H( ìj:). Now j f díj < k f k for all í 2 M (), while H( ìj:) is non-negative, with H( ìjì) ˆ 0. Thus, jh ( f )j < k f k. Since H is a convex function with domain L (), which is bounded on the open neighbourhood f f : k f k, g, we have by Rockafellar (974, Theorem 8) that H is continuous on the interior of its domain, which is all of L (). By Lemma, H and Ë agree on functions of the form f ˆ Pk iˆ c i Ai, where the A i partition and each A i is a ì-continuity set with non-empty interior. Since such functions are dense in C b (), Ë was shown to be continuous on C b () and H to be continuous on L () C b (), it follows that Ë ˆ H on all of C b () and, consequently, that Ë is convex. h Acknowledgements The research of this paper was carried out in part while the rst named author was at BRIMS. We would like to thank the referees for their comments, which have helped to improve the presentation of the paper. References Barron, A., Schervish, M.J. and Wasserman, L. (999) The consistency of posterior distributions in nonparametric problems. Ann. Statist., 27, 536±56. Dembo, A. and Zeitouni, O. (993) Large Deviations Techniques and Applications. Boston and London: Jones and Bartlett.

14 034 A.J. Ganesh and N. O'Connell Dupuis, P. and Ellis, R.S. (997) A Weak Convergence Approach to the Theory of Large Deviations. New York: Wiley. Ferguson, T.S. (973) A Bayesian analysis of some non-parametric problems. Ann. Statist.,, 209± 230. Ferguson, T.S. (974) Prior distributions on spaces of probability measures. Ann. Statist., 2, 65±629. Freedman, D. (963) On the asymptotic behavior of Bayes estimates in the discrete case. Ann. Math. Statist., 34, 386±403. Ganesh, A.J. and O'Connell, N. (999) An inverse of Sanov's theorem. Statist. Probab. Lett., 42, 20± 206. Ganesh, A., Green, P., O'Connell, N. and Pitts, S. (998) Bayesian network management, Queueing Systems, 28, 267±282. Georgii, H.-O. (988) Gibbs Measures and Phase Transitions. Berlin: De Gruyter. Ghosal, S., Ghosh, J.K. and van der Vaart, A.W. (998) Convergence rates of posterior distributions. Preprint. Ghosal, S., Ghosh, J.K. and Ramamoorthi, R.V. (999) Consistency issues in Bayesian nonparametrics. In S. Ghosh (ed.), Asymptotics, Nonparametrics and Time Series. New York: Marcel Dekker. Lavine, M. (992) Some aspects of PoÂlya tree distributions for statistical modelling. Ann. Statist., 20, 222±235. Lynch, J. and Sethuraman, J. (987) Large deviations for processes with independent increments. Ann. Probab., 5, 60±627. Mauldin, R.D., Sudderth, W.D. and Williams, S.C. (992) PoÂlya trees and random distributions. Ann. Statist., 20, 203±22. Rockafellar, R.T. (974) Conjugate Duality and Optimization. Philadelphia: Society for Industrial and Applied Mathematics. Schwartz, L. (965) On Bayes procedures. Z. Wahrscheinlichkeitstheorie. Verw. Geb., 4, 0±26. Shen, X. and Wasserman, L. (998) Rates of convergence of posterior distributions. Technical report, Dept. of Statistics, Carnegie-Mellon University. Wasserman, L. (998) Asymptotic properties of nonparametric Bayesian procedures. Technical report, Dept. of Statistics, Carnegie-Mellon University. Received November 998 and revised November 999

An inverse of Sanov s theorem

An inverse of Sanov s theorem Ayalvadi Ganesh and Neil O Connell BRIMS, Hewlett-Packard Labs, Bristol Abstract Let X k be a sequence of iid random variables taking values in a finite set, and consider