Convergence Rates of Nonparametric Posterior Distributions

Size: px

Start display at page:

Download "Convergence Rates of Nonparametric Posterior Distributions"

Antony Warren
5 years ago
Views:

1 Convergence Rates of Nonparametric Posterior Distributions Yang Xing arxiv: v1 [math.st] 17 Apr 008 Swedish University of Agricultural Sciences We study the asymptotic behavior of posterior distributions. We present general posterior convergence rate theorems, which extend several results on posterior convergence rates provided by Ghosal and Van der Vaart 000, Shen and Wasserman 001 and Walker, Lijor and Prunster 007. Our main tools are the Hausdorff α-entropy introduced by Xing and Ranneby 008 and a new notion of prior concentration, which is a slight improvement of the usual prior concentration provided by Ghosal and Van der Vaart 000. We apply our results to several statistical models. 1. Introduction. Recently, a major theoretical advance has occurred in the theory of Bayesian consistency for infinite-dimensional models. Schwartz 1965 first proved that, if the true density function is in the Kullback-Leibler support of the prior distribution, then the sequence of posterior distributions accumulates in all weak neighborhoods of the true density function. It is known that the condition of positivity of prior mass on each Kullback-Leibler neighborhood in Schwartz s theorem is not a necessary condition. When one considers problems of density estimation, it is natural to ask for the strong consistency of Bayesian procedures. Sufficient conditions for the strong Hellinger consistency and for evaluating consistency rates have been currently developed by many authors. In this paper we study the problem of determining whether the posterior distributions accumulate in Hellinger neighborhoods of the true density function. The rate of this convergence can be measured by the size of the smallest shrinking Hellinger balls around the true density function on which posterior masses tend to zero as the sample size increases to infinity. By the fundamental works of Ghosal, Ghosh and Van der Vaart 000 and Shen and Wasserman 001, we know that the convergence rate of posterior distributions is completely determined by two quantities: the rate of the metric entropy and the prior concentration rate. Roughly speaking, the rate of the metric entropy describes how large the model is, and the prior concentration rate depends on prior masses near the true distribution. Since the true distribution is unknown, the later assumption actually requires that the prior AMS 000 Subject Classifications. 6G07, 6G0, 6F15. Key words. Hellinger consistency, posterior distribution, rate of convergence, sieve, infinite-dimensional model. 1

2 distribution spreads its mass uniformly over the whole density space. Another elegant approach for determination of the convergence rate was provided by Walker 004, who obtained a sufficient condition for strong consistency by using summability of square root of prior probability instead of the metric entropy method. In this paper, in dealing with the rate of metric entropies we shall apply the Hausdorff α-entropy introduced by Xing and Ranneby 008, which is much smaller than widely used metric entropies and the bracketing entropy. For some important prior distributions of statistical models the Hausdorff α-entropies of all sieves are uniformly bounded, whereas it is generally impossible to get uniform boundedness of metric entropies of large sieves. The application of the Hausdorff α-entropy leads refinements of several theorems on posterior convergence rates, for instance, the well known assumptions on metric entropies and summability of square root of prior probability have been weakened, which particularly yields that Theorem 5 of Ghosal et al.007b is strengthened into Corollary 3 of this paper. To handle the prior concentration rate, we shall apply a new notion of prior concentration. Our approach is a slight improvement of the prior concentration provided by Ghosal et al.000, and moreover the proof of Lemma 1 on which the approach bases is quite simple. Finally, to get posterior convergence at the optimal rate 1/ n, we give an extension of Ghosal et al.000, Theorem.4, in which the universal testing constant has been replaced by any fixed constant. An outline of this paper is as follows. In Section we define the Hausdorff α-entropy with respect to a given prior and then present general theorems for the determination of posterior convergence rates. We also give a new approach to compute concentration rates. In Section 3 we apply our results to Bernstein polynomial priors, priors based on uniform distribution, log spline models and finite-dimensional models, which leads some improvements on known results for these models. The proofs of the main results are contained in Section 4.. Notations and Theorems. We consider a family of probability measures dominated by a σ-finite measure µ in X, a Polish space endowed with a σ-algebra X. Let X 1,X,...,X n stand for an independent identically distributed i.i.d. sample of n random variables, taking values in X and having a common probability density function f 0 with respect to the measure µ. Denote by F0 the infinite product distribution of the probability distribution F 0 associated with f 0. For two probability densities f and g µdx 1/ we denote the Hellinger distance Hf,g = X fx gx and the Kullback-Leibler divergence Kf,g = X µdx. Assume that the space F of probability density functions is separable with respect to the Hellinger metric and that F is the Borel σ-algebra of F. Given a prior distribution Π on F, the posterior distribution Π n is a random probability measure with the following expression fxlog fx gx Π n A = Π A X1,X,...,X n = n fx i Πdf A i=1 F n fx i Πdf i=1 = AR nfπdf F R nfπdf

3 forallmeasurablesubsetsa F, wherer n f = n { fxi /f 0 X i isthelikelihoodratio. In other words, the posterior distribution Π n is the conditional distribution of Π given the observations X 1,X,...,X n. If the posterior distribution Π n concentrates on arbitrarily small neighborhoods of the true density function f 0 almost surely or in probability, then it is said to be consistent at f 0 almost surely and in probability respectively. Throughout this paper, almost sure convergence and convergence in probability should be understood as to be with respect to the infinite product distribution F 0 of F 0. Our aim of this article is to present general theorems on posterior convergence rates at f 0. By the posterior convergence rate theorems of Ghosal, Ghosh and Van der Vaart 000, we know that the prior concentration rate and the rate of metric entropy both completely determine the convergence rate of posterior distributions. More specifically, a key inequality to determine almost sure convergence rates of posterior distributions is that for each ε > 0, F R nfπdf e 3nε i=1 Π f : Hf 0,f f 0 /f < ε almost surely for all sufficiently large n, where g stands for the supremum norm of the function g on X. This inequality was obtained by Ghosal et al.000, Lemma 8.4 under mild assumptions. It appears almost in all of papers handling strong convergence rates of posterior distributions. The reason is that in order to get the convergence rate of posterior distributions one needs to find a suitable lower bound for the denominator in the expression of posterior distributions. This is successfully done in Ghosal et al.000, who suggested that the prior Π puts sufficiently amount of mass around the true density function f 0 in the sense: Π f : Hf 0,f f 0 /f < ε n e n ε n c for some fixed constant c. Such a sequence { ε n is referred to as the concentration rate of the prior Π around f 0. Here we give a slightly stronger result. We introduce a modification of the Hellinger distance H f 0,f = f0 x fx f 0 x X 3 fx µdx. 3 Itisclearthattheinequality f 0 /f 1holdsforalldensityfunctionsf andf 0 suchthat the supremum is well-defined, and the quality holds if and only if f = f 0 almost surely. Observe also that H f 0,f H f,f 0 and 3 1/ Hf 0,f H f 0,f. Moreover, we have H f 0,f Hf 0,f 3 which yields f0 /f / Hf 0,f f0 /f 1/4 Hf 0,f f0 /f 1/ { f F : H f 0,f ε n { f F : Hf0,f f0 / f < ε n { f F : Hf 0,f f0 / f < ε n. 3

4 The following simplelemma shows that, for ε n tobe a prior concentration rate, it isenough to assume Π W εn e n ε n c 3, where W ε = { f F : H f 0,f ε. Lemma 1. Let ε > 0 and c > 0. Then the inequality F0 F R nfπdf e nε 3+c Π W ε e nε c holds for all n. Lemma 1 provides a useful approach to compute prior concentration rates, particularly for models in which rate of convergence is governed by the prior concentration rate. It leads a simplification of the proof of Theorem. of Ghosal et al.000. Furthermore, we shall present general posterior convergence rate theorems in which the well known assumptions on metric entropies and summability of square root of prior probability are also weakened. We shall apply the Hausdorff α-entropy Jδ, G, α introduced by Xing et al.008. Denote by L µ the space of all nonnegative integrable functions with the norm f 1 = Xfxµdx. Write log0 =. Definition. Let α 0 and G F. For δ > 0, the Hausdorff α-entropy Jδ,G,α with respect to the prior distribution Π is defined as Jδ,G,α = log inf N ΠB j α, where the infimum is taken over all coverings {B 1,B,...,B N of G, where N may take the value, such that each B j is contained in some Hellinger ball {f : Hf j,f < δ of radius δ and center at f j L µ. Notethat theinfimum can beequivalentlytakenover allpartitions{p 1,P,...,P N of G such that the Hellinger radius of each subset P j does not exceed δ. It was proved in Xing et al.008, Lemma 1 that the Hausdorff α-entropy Jδ, G, α is an increasing subadditive function of G and satisfies Jδ,G,α log Nδ,G for all α 0, where Nδ,G stands for the minimal number of Hellinger balls of radius δ needed to cover G. For 0 α 1 and each G F, we also obtained the following useful inequality ΠG α e Jδ,G,α ΠG α Nδ,G 1 α. Our first result in this paper is the following general theorem on posterior strong convergence rates. Denote A ε = { f : Hf 0,f ε. Theorem 1. Let { ε n and { ε n be positive sequences such that n min ε n, ε n as n. Suppose that there exist constants c 1 > 0, c > 0, c 3 0, 0 α < 1 and a sequence {G n of subsets on F such that e n ε n c < and 1 e J ε n,g n,α n ε n c 1 <, 4

5 e n ε n 3+3c +c 3 ΠA εn \G n <, 3 Π f : H f 0,f ε n e n ε n c 3. 3α+αc Then for ε n = max ε n, ε n and each r > + +αc 3 +c 1 1 α, we have Π n Arεn 0 almost surely as n. As direct applications we have Corollary 1. Let c 1 0, c > 0, c 3 0 and 0 α < 1. Suppose that {ε n is a positive sequence satisfying e nε n c < and suppose that there exists a sequence {G n of subsets on F such that 1 Jε n,g n,α nε nc 1, ΠF\G n e nε n 3+3c +c 3, 3 Π f : H f 0,f ε n e nε n c 3. 3α+αc Then for each r > + +αc 3 +c 1 +c 1 α, we have that Π n Arεn 0 almost surely as n. Proof. It is clear that all conditions of Theorem 1 are fulfilled if we let ε n = ε n = ε n and replace the c 1 in Theorem 1 by c 1 +c of Corollary 1, and the proof is complete. Corollary 1 extends Theorem. of Ghosal et al.000, in which they have stronger conditions than 1 and 3 of Corollary 1. It is probably worth mentioning that for several important prior distributions of infinite-dimensional statistical models, the quantities Jε n,g n,α are uniformly bounded for all n and hence condition 1 of Corollary 1 is triviallyfulfilled, whereasgeneral metricentropiesofthesieveg n growtoinfinity asthesample size increases. A slightly different version of Theorem 1 is the following consequence, which is in fact an extension of Proposition 1 of Walker et all.007. Corollary. Let {ε n be a positive sequence such that nε n as n. Suppose that there exist constants c 1 > 0, c > 0, c 3 0, 0 α < 1 and a sequence { G nj with G nj F such that e nε n c < and 1 Nε n,g nj 1 α ΠG nj α e nε n c 1 < ; 5

6 e nε n 3+3c +c 3 ΠA εn \ G nj <, 3 Π f : H f 0,f ε n e nε n c 3. 3α+αc Then for each r > + +αc 3 +c 1, we have that Π 1 α n Arεn 0 almost surely as n. Proof. Let G n = G nj. We only need to verify condition 1 of Theorem 1 for such a sieve G n. By Lemma 1 of Xing et al.008 we have e Jε n,g n,α nε n c 1 Corollary then follows from Theorem 1. e Jε n,g nj,α nε n c 1 Nε n,g nj 1 α ΠG nj α e nε n c 1 <. The assertion of Theorem 1 is an almost sure statement that the posterior distributions outside a Hellinger ball with a multiple of ε n as radius converge to zero almost surely. Now we give an in-probability assertion under weaker conditions. Denote Vf,g = X fx log fx gx µdx. Theorem. Let { ε n and { ε n be positive sequences such that n min ε n, ε n as n. Suppose that there exist constants c 1 > 0, c 0, 0 α < 1 and a sequence {G n of subsets on F such that 1 J ε n,g n,α n ε nc 1 as n, e n ε n +c ΠA εn \G n 0 as n, 3 Π f : Kf 0,f < ε n and Vf 0,f < ε n e n ε n c. Then for ε n = max ε n, ε n and each r > + in probability as n. α+αc +c 1 1 α, we have that Π n Arεn 0 Observe that the conditions of Theorem 1 and Theorem are only used to ensure that Π n A εn \G n 0 as n. So one can replace the conditions of these theorems by Π n A εn \ G n 0 as n almost surely and in probability respectively. Theorem is an extended version of Theorem.1 of Ghosal et al.001 and Theorem 1 of Walker 6

7 et all.007. Furthermore, as a consequence of Theorem we obtain the following slight improvement of Theorem 5 of Ghosal et al.007b. Corollary 3. Let {ε n be a positive sequence such that nε n as n. Suppose that there exist constants c 1 > 0, c 0, 0 α < 1 and a sequence { G nj with G nj F. If 1 e nε n c 1 Nε n,g nj 1 α ΠG nj α 0 as n ; e nε n +c ΠA εn \ G nj 0 as n ; 3 Π f : Kf 0,f < ε n and Vf 0,f < ε n e nε n c, α+αc then for each r > + +c 1 1 α, we have that Π n Arεn 0 in probability as n. Proof. For G n = G nj, by Lemma 1 of Xing et al.008 we get e Jε n,g n,α nε n c 1 e Jε n,g nj,α nε n c 1 e nε n c 1 Nε n,g nj 1 α ΠG nj α which tends to zero as n and condition 1 of Theorem holds. Then by Theorem we conclude the proof. The above theorems cannot yield a convergence rate 1/ n because of the assumption nε n. Particularly, these theorems cannot well serve finite-dimensional models. Ghosal et al.000, 007a have obtained a nice theorem to handle such models. Denote B ε n = { f : Kf 0,f < ε n and Vf 0,f < ε n. Now our result is Theorem 3. Let {ε n be a positive sequence such that nε n are uniformly bounded away from zero, i.e., there exists a constant c 0 > 0 such that nε n c 0 for all n. Suppose that there exist constants 0 < α < 1, c 1 < 1 α 18 and a sequence {G n of subsets on F such that 1 e nε n ΠA ε n \G n ΠB ε n 0 as n, exp J jε n 3, { f G n : jε n Hf 0,f < jε n,α e c 1j nε n ΠBε n α for all sufficiently large positive integers j and n. Then for each r n we have that Π n Arn ε n 0 in probability as n. 7

8 Remark. Here we adopt the convention that if the denominator of a quotient equals zero thenthenumeratormust alsobezero. Hence, Theorem3isstilltrueevenwhenΠB ε n = 0 for some n. Corollary 4. Let {ε n be a positive sequence such that nε n are uniformly bounded away from zero. Suppose that there exist constants c 1, c and a sequence {G n of subsets on F such that 1 logn jε n 3, { f G n : jε n Hf 0,f < jε n c1 nε n for all integer j and n large enough, e nε n ΠA ε n \G n ΠB ε n 0 as n, 3 Π f G n :jε n Hf 0,f<jε n e c j nε n for all integer j and n large enough. ΠB ε n Then for each r n we have that Π n Arn ε n 0 in probability as n. Corollary 4 is a slightly stronger version of Theorem.4 of Ghosal et al.000. A notable improvement in Corollary 4 is that we have no restriction on the constant c, whereas their constant c equals half of some universal testing constant. Proof of Corollary 4. We only need to check condition of Theorem 3. It follows from Lemma 1 of Xing et al.008 and conditions 1 and 3 that J jε n 3,{ f G n : jε n Hf 0,f < jε n,α αlogπ f G n : jε n Hf 0,f < jε n +1 αlogn jε n 3,{ f G n : jε n Hf 0,f < jε n αlog e c j nε n ΠBε n +1 αc 1 nε n = αc + 1 αc 1 j j nε n +αlogπb ε n. Taking a small α in 0,1 and then letting j be large enough, we have that αc + 1 αc 1 1 α 18 j < and hence condition of Theorem 3 is fulfilled. The proof of Corollary4is complete. We will conclude this section by presenting an analogue of Theorem 3, which gives an almost sure assertion under stronger conditions. 8

9 Theorem 4. Let {ε n be a positive sequence such that there exists a constant c 0 > 0 such that nε n c 0 logn for all large n. Suppose that there exist constants 0 < α < 1, c 1 < 1 α 18, c > 1 c 0 and a sequence {G n of subsets on F such that 1 e nε n 3+c ΠA ε n \G n ΠW ε n <, exp J jε n 3, { f G n : jε n Hf 0,f < jε n,α e c 1j nε n ΠWεn α for all sufficiently large positive integers j and n. Then for each r large enough we have that Π n Arεn 0 almost surely as n. Completely following the proof of Corollary 4, we have the following consequence of Theorem 4. Corollary 5. Let {ε n be a positive sequence such that there exists a constant c 0 > 0 such that nε n c 0 logn for all large n. Suppose that there exist constants c 1, c > 1 c 0, c 3 and a sequence {G n of subsets on F such that 1 logn jε n 3, { f G n : jε n Hf 0,f < jε n c1 nε n for all integer j and n large enough, 3 e nε n 3+c ΠA ε n \G n ΠW ε n <, Π f G n :jε n Hf 0,f<jε n ΠW ε n e c 3j nε n for all integer j and n large enough. Then for each r large enough we have that Π n Arεn 0 almost surely as n. 3. Illustrations. In this section we apply our theorems to Bernstein polynomial priors, priors based on uniform distribution, log spline models and finite-dimensional models. This leads some improvements on known results for these models Bernstein polynomial prior. A Bernstein polynomial prior is a probability measure on the space of continuous probability distribution functions on [0, 1]. Petrone 1999 introduced the Bernstein polynomial prior Π by putting a prior distribution on the class of Bernstein densities in [0,1] in the following way: bx;k,f = k Fj/k Fj 1/k βx;j,k j +1, 9

10 where βx;a,b stands for the beta density βx;a,b = Γa+b ΓaΓb xa 1 1 x b 1, k has a distribution ρ, F is a random distribution independent of ρ. In other words, if B j stands for the set of all Bernstein densities of order j, then Π = ρjπ B j, where the probability measure Π Bj is the normalized restriction of Π on B j. We refer to Petrone and Wasserman 00 for a detailed description of the Bernstein polynomial prior, in which consistency of the posterior distribution for the Bernstein polynomial prior is discussed. Rates of convergence have been established under suitable tail conditions on ρ by Ghosal 001 and Walker et al.007, where the convergence is understood as convergence in F0 -probability. Ghosal 001, Theorem.3 proved that the prior concentration rate is logn 1/3 /n 1/3 and the entropy rate is logn 5/6 /n 1/3 under the tail assumption ρj e cj for all j, which yields the convergence rate logn 5/6 /n 1/3. Walker etal.007obtainedtheentropyratelogn 1/3 /n 1/3 under thelightertailconditionρj e 4j logj = 1/j j 4 for all j, which leads the convergence rate logn 1/3 /n 1/3. In the following we establish the entropy rate 1/n γ under the tail condition ρj 1/j j c 0 for all j, where c 0 is any fixed positive constant and γ is any fixed constant strictly less than 1/. Hence we also get the convergence rate logn 1/3 /n 1/3 under the weaker condition ρj 1/j j c 0. This improves the result of Walker et al.007. From Ghosal 001 it follows that there exists an absolute constant c > 0 such that Nε n,b j c/ε n j for all j. Given γ < 1/, choose G nj = B j and ε n = 1/n γ. Take 0 < α < 1 such that γ + 1/d < 1 with d = c 0 α/1 α. To verify condition 1 of Corollary 3, by ρj 1/j j c 0 we obtain that e nε n c 1 Nε n,b j 1 α ΠB j α e nε n c 1 c j1 α ρj α ε n e c 1n 1 γ e nε n c 1 1 j cn γ 1/d cn γ c ε n j c 0 α 1 α j d j1 α + j1 α j cn γ 1/d 1 j1 α, where the last sum on the right hand side tends to zero as n. To estimate the first term, note that for gt = cn γ /t d t we have that g t = cn γ /t d t logcn γ /t d d = 0 is equivalent to t = c 1/d v γ/d e 1. This implies gj e d 1n γ/d for all j, where d 1 stands for the constant dc 1/d e 1. Therefore, by the inequality x e x for x 0 we get e c 1n 1 γ 1 j cn γ 1/d cn γ j d j1 α e c 1 n 1 γ cn γ 1/d e d 11 αn γ/d e c 1n 1 γ + 1/d c 1/d +d 1 d 1 αn γ/d 0 as j, since 1 γ > γ/d, and hence condition1 ofcorollary3holds. Condition of Corollary 3 is trivially fulfilled and hence by Corollary 3 we obtain that the entropy rate is at least 1/n γ for any given γ < 1/. 10

11 3.. Prior based on uniform distribution. Ghosal et al.1997 established posterior consistency for prior distributions based on uniform distributions of finite subsets. Priors based on discrete uniform distributions were further studied in Ghosal et al.000, in which they used the bracketing entropy as a tool to compute the convergence rate of posterior distributions. As an application of Theorem 1, we now give a slight extension. Given ε n > 0, assume that there exist density functions f 1,...,f Nn such that all sets {f : H f,f j 3 1/ ε n form a covering of F. Denote by µ n the uniform discrete probability measure on the set {f 1,...,f Nn. Define a prior distribution Π = a j µ j for a given sequence a j with a j > 0 and a j = 1. Theorem 5. If logn n + logn + log 1 a n = Onε n as n, then the posterior distributions Π n converge almost surely at least at the rate ε n, that is, Π n f : Hf0,f rε n 0 as n almost surely for any given sufficiently large r. Proof. From logn = Onε n it follows that e nε n c < for all large c > 0. For any f with H f,f j 3 1/ ε n we have Hf,f j 3 1/ H f,f j ε n. Hence we obtain that {f : H f,f j 3 1/ ε n {f : Hf,f j < ε n and then Jε n,f,0 logn n = Onε n, which implies condition 1 of Theorem 1. Condition is trivially fulfilled. Condition 3 follows from the fact that Πf : H f 0,f ε n Πf : H f 0,f 3 1/ ε n a n /N n e nε n c 3 for some c 3 > 0, since {f : H f 0,f 3 1/ ε n contains at least some density function of the set {f 1,...,f Nn. The proof of Theorem 5 is complete. It seems to be unusual to find a covering of the density space with covering sets of type {f : H f,f j cε n. The most widely used norm for continuous functions should be the supremum norm. In fact, one can easily construct a new covering N n {f : H f,f j cε n of F in terms of a given covering N n {f : f g j ε n with nonnegative bounded functions g j not necessarily density functions, as shown in the following: Take f j x = g j x+ε n / X g j x+ε n µdx, where we assume that µ is a probability measure on X and ε n 1 for all n. Then for each f with f g j ε n, that is, gj x ε n fx g j x+ε n on X, we have fx f j x = fx g j x+ε n X = 1+4ε n +4ε n g j x+ε n µdx X X f j xµdx 1+ε n, fj x+ε µdx n where f j is some density function in {f : f g j ε n and the last inequality follows from f 1 f = 1. This implies that H f,f j 3 1+ε n Hf,f j Hf,f j H f, g j +ε n +H g j +ε n,f j 11

12 1 µdx 4ε n + g j x+ε n X Therefore, we have N n N { n f : H f,f j 8ε n 1 4ε n + 1+ε n 1 = 8ε n. { f : f gj ε n F. Observe that the numbers of covering subsets in both type coverings are equal. For models with H f,g controlled by a constant multiple of the Hellinger metric Hf, g such as the exponential family and a model with uniformly bounded supremum norm f/g for all density functions f and g, it is not necessary to assume that the probability measures µ n constructed above concentrate on a finite number of points. Here we give an extension of Theorem 5. Let c 0 1 and F c0 be a subfamily of F such that H f,g c 0 Hf,g for all f,g F c0. Given ε n > 0, let {P 1,...,P Kn be a partition of F c0 such that for each P i there exists f i in F c0 with P i {f : Hf i,f ε n /c 0. Take any probability measure µ n on F c0 with µ n P i = 1/K n for i = 1,,...,K n. Define then a prior distribution Π = a j µ j for a given sequence a j with a j > 0 and a j = 1. Now we have Theorem 6. Let f 0 F c0. If logk n + logn + log 1 a n = Onε n as n, then the posterior distributions Π n converge at least at the rate ε n almost surely. Proof. By the proof of Theorem 5, we only need to verify condition 3 of Theorem 1. Take f i0 F c0 such that Hf 0,f i0 ε n /c 0. Then, for all f F c0 with Hf i0,f ε n /c 0, we have that H f 0,f c 0 Hf 0,f c 0 Hf 0,f i0 + c 0 Hf i0,f ε n. Hence we get that Πf : H f 0,f ε n ΠP i0 a n /K n e nε n c 3 for some c 3 > 0, and the proof of Theorem 6 is complete. Observe that, given a covering {O 1,O,...,O Kn of F c0, one can easily construct a partition {P 1,P,...,P Kn of F c0 in the following way: P 1 = O 1 F c0 and P i = O i i 1 l=1 P l F c0 for i =,3,...,K n. Example Exponential families. We consider the exponential family of all density functions of the form e hx, where the function hx belongs to a fixed bounded subset in the Sobolev space C p [0,1] with p > 0. A subclass of this family has been recently studied by Scricciolo 006. Following a result of Kolmogorov and Tihomirov 1959, we know that the ε-entropy of this family with respect to the norm equals Oε 1/p. Thus, using the above argument we get that logk n = Onε n for ε n = n p/p+1 and hence by Theorem 6 the posterior distributions constructed above converge at the rate ε n = n p/p+1, which is known to be the optimal rate of convergence in the minimax sense under the Hellinger loss Log spline models. Log spline models for density estimation have been studied, among others, by Stone 1990 and Ghosal et al.000. Let [ k 1/K n,k/k n with 1

13 k = 1,,...,K n be a partition of the interval [0,1. The space of splines of order q relative to this partition is the set of all functions f : [0,1 R such that f is q times continuously differentiable on [0,1 and the restriction of f on each [ k 1/K n,k/k n is a polynomial of degree strictly less then q. Let J n = q+k n 1. This space of splines is a J n -dimensional vector space with a B-spline basis B 1 x,b x,...,b Jn x, see Ghosal et al.000 for the details of such a basis. Let F be the set of all density functions in C α [0,1]. Assume that the true density function f 0 x is bounded away from zero and infinity. We consider the J n -dimensional exponential subfamily of C α [0,1] of the form J n f θ x = exp θ j B j x cθ where θ = θ 1,θ,...,θ Jn Θ 0 = {θ 1,θ,...,θ Jn R J n : J n θ j = 0 and the constant cθ is chosen such that f θ x is a density function in [0,1]. Each prior on Θ 0 induces naturally a prior on F. Let θ = max j θ j be the infinity norm on Θ 0. Assume that a 1 n 1/α+1 K n a n 1/α+1 for two fixed positive constants a 1 and a. Assume that the prior Π for Θ 0 is supported on [ M,M] J n for some M 1 and has a density function with respect to the Lebesgue measure on Θ 0, which is bounded below by a J n 3 and above by a J n 4. Take a constant d > 0 such that d θ logf θ x for all θ Θ 0. Ghosal et al.000, Theorem 4.5 proved that, if f 0 C α [0,1] with q α 1/ and logf 0 x dm/, the posteriors Π n converge in probability at the rate n α/α+1. Using Corollary 5 we now get that under the same assumptions as in Ghosal et al.000, Theorem 4.5, the posteriors Π n are in fact convergent almost surely at the rate ε n = n α/α+1. To see this, take G n = F. Clearly, nε n logn for all large n. Condition 1 of Corollary 5 has been verified by Ghosal et al.000 and condition is trivially fulfilled. Condition 3 follows also from the proof of Theorem 4.5 in Ghosal et al.000, since the inequality H f 0,f θ Hf 0,f θ f 0 /f θ 1/ holds for all θ Θ Finite-dimensional models. Let β > 0 and let Θ be a bounded subset in R d with the Euclidean norm. Denote by F the family of all density functions f θ with the parameter θ in Θ satisfying a 1 θ 1 θ β Hf θ1,f θ 3H f θ1,f θ a θ 1 θ β for all θ 1, θ Θ, where a 1 and a are two fixed positive constants. Assume that the true value θ 0 is in Θ and that the density function of the prior distribution Π with respect to the Lebesgue measure on Θ is uniformly bounded away from zero and infinity. Under slightly weaker conditions, Ghosal et al.000 proved that the posterior distributions Π n converge in probability at the rate 1/ n. Now we give an almost sure assertion for this model. Theorem 7. Under the above assumptions, the posterior distributions Π n converge almost surely at least at the rate logn/ n. 13,

14 Proof. We shall apply Corollary 5 for G n = F. Clearly, nε n = logn for ε n = logn/ n. Condition 1 has been verified in the proof of Theorem 5.1 of Ghosal et al.000. Condition is trivially fulfilled. Using H f θ0,f θ 3 1/ a θ 0 θ β, we have that ΠW εn Π θ : θ θ 0 3ε n /a 1/β. Hence, the verification of condition 3 follows from the same lines as the proof of Ghosal et al.000, Theorem 5 and then by Corollary 5 we conclude the proof of Theorem Lemmas and Proofs. In this section we give proofs of our lemmas and theorems. For simplicity of notations, we assume throughout this section that ε n = ε n = ε n. Proof of Lemma 1. ItisnorestrictiontoassumethatΠ W ε > 0. UsingJensen sinequality for the convex function x 1/ for x > 0 and Chebyshev s inequality, we obtain that F 0 On the other hand, we have = 1+ X = 1+ X = 1+ X F R nfπdf e nε 3+c Π W ε F0 R n fπdf e nε W ε = F0 e nε 3 +c 1 R n fπdf ΠW ε W ε F0 e nε 3 +c 1 ΠW ε e nε 3 +c 1 ΠW ε E R n f 1 Πdf W ε = e nε 3 +c 1 ΠW ε E W ε E 3+c Π W ε 1 W ε R n f 1 Πdf f 0 X 1 fx 1 n Πdf. f 0 X 1 fx 1 = 1+E f0 X 1 fx 1 fx1 f0 x fx fx f0 x fx+ fx f0 xµdx f0 x f0 x fx µdx+ fx f0 x f0 x fx µdx+ 1 fx 14 X f0 x fxf 0 x µdx X f0 x fx µdx

15 = 1+ 3 H f 0,f e 3 H f 0,f e 3 ε, where the last inequality holds when f W ε. Hence we get F 0 F R nfπdf e nε 3+c Π W ε e nε 3 +c 1 ΠW ε The proof of Lemma 1 is complete. W ε e 3 nε Πdf = e nε c. In the proof of Theorem 1 we use the following Lemma, which is similar to Lemma 5 of Barron et al Lemma. Let c > 0 and c 3 0. Let {ε n be a positive sequence such that ΠW ε n e nε n c 3 for all n and e nε n c <. If a sequence {D n of subsets in F satisfies e nε n 3+3c +c 3 ΠD n <, then Π n D n 0 almost surely as n. Proof. From Chebyshev s inequality and Fubini s theorem it turns out that F 0 { D n R n fπdf e nε = e nε n 3+3c +c 3 n 3+3c +c 3 e nε n 3+3c +c 3 E R n fπdf D n ER n fπdf = e nε n 3+3c +c 3 ΠD n D n for all n. Hence by the first Borel-Cantelli Lemma we get that D n R n fπdf e nε n 3+3c +c 3 almost surely for all n large enough. On the other hand, Lemma 1 and the first Borel- Cantelli Lemma yield that F R nfπdf ΠW εn e nε n 3+c e nε n 3+c +c 3 almost surely for all n. Therefore, we obtain that with probability one, Π n D n = D n R n fπdf F R nfπdf e nε n c, which tends to zero as n and the proof of Lemma is complete. 15

16 Proof of Theorem 1. It isclear that if condition1 holds for some α = α 0 then it also holds 3α+αc for any α α 0. So we may assume that 0 < α < 1. Given r > + +αc 3 +c 1 1 α, we have Π n A rεn Π n Gn A rεn +Πn Aεn \G n. Itthen followsfrom Lemma that Π n Aεn \G n 0 almostsurelyasn. Soitsuffices toprovethatπ n Gn A rεn 0almostsurely asn. BythedefinitionofJεn,G n,α, for each fixed n there exist functions f 1,f,...,f N in L µ such that G n A rεn N B j, where B j = G n A rεn {f : Hf j, f < ε n and N ΠB j α e Jε n,g n,α. It is no restriction to assume that all the sets B j are disjoint and nonempty. Taking a fj B j we get that Hf j,f 0 Hfj,f 0 Hfj,f j r 1ε n. Now for each B j we have B j R n fπdf = ΠB j n 1 B j R k+1 fπdf B j R k fπdf = ΠB j n 1 f kbj X k+1 f 0 X k+1, wheref kbj x = B j fxr k fπdf / B j R k fπdfandr 0 f = 1. Thefunction f kbj was introduced by Walker 004 and can be considered as the predictive density of f with a normalized posterior distribution, restricted on the set B j. Clearly, Jensen s inequality yields that Hf kbj,f j ε n for each k. Hence Hf kbj,f 0 Hf j,f 0 Hf j,f kbj r ε n > 0. Since e nε n c <, it turns out from Lemma 1 and the first Borel- Cantelli Lemma that F R nfπdf e nε n 3+c Π W εn almost surely for all n large enough. Hence, by condition 3 we obtain that Π n Gn A rεn Πn G n A rεn α N Π n B j α N Π n B j α = N ΠB j α n 1 f kbj X k+1 α f 0 X k+1 α F R nfπdf α N ΠB j α n 1 f kbj X k+1 α f 0 X k+1 α α ΠW εn e nε n 3+c e nε n 3+c +c 3 α N n 1 ΠB j α almost surely for all n large enough. Since r > + f kbj X k+1 α f 0 X k+1 α 3α+αc +αc 3 +c 1, the inequality 1 α 3+c +c 3 α < 1 r 1 α c 1 holds. Take a constant b with 3+c +c 3 α < b < 1 r 1 α c 1. Denote F k = σ{x 1,X,...,X k. Then we have F 0 { N n 1 ΠB j α f kbj X k+1 α f 0 X k+1 α e nε n b 16

17 where E n 1 N e nε n b E n 1 ΠB j α N = e nε n b ΠB j α E n 1 n 1 e Jε n,g n,α+nε n b max E 1 j N = E f kbj X k+1 α n 1 f 0 X k+1 α = E E n f kbj X k+1 α f 0 X k+1 α f kbj X k+1 α f 0 X k+1 α f kbj X k+1 α f 0 X k+1 E fn 1Bj X n α α f 0 X n α f kbj X k+1 α f 0 X k+1 α, f kbj X k+1 α f 0 X k+1 α F n 1 F n 1. By the conditional Hölder s inequality we get that with probability one, fn 1Bj X n α E f 0 X n α F n 1 fn 1Bj X n α = E f 0 X n α f n 1Bj X n α f 0 X n α F n 1 fn 1Bj X n α α E f 0 X n α α F n 1 α fn 1Bj X n α α = E f 0 X n α α fn 1Bj X n α α E f 0 X n α α F n 1 α. F n 1 α Take the integer m with α 1 α m < α. Repeating the above procedure m 1 more 1 α times we obtain that with probability one, fn 1Bj X n α E f 0 X n α F n 1 fn 1Bj X n E f 0 X n α m 1 α+α α m 1 α+α F n 1 m 1 α+α m, which by the conditional Hölder s inequality is less than fn 1Bj X n 1 E f 0 X n 1 F n 1 = 1 Hf n 1B j,f 0 α m 1 α = f n 1Bj X n f 0 X n µdx n m 1 α m 1 1 r ε n e m r αε n e 1 r α 1ε n. 17 α m 1

18 Hence, with probability one, we have n 1 E f kbj X k+1 α f 0 X k+1 α n e 1 r α 1ε n E f kbj X k+1 α f 0 X k+1 α. Repeating the same argument n 1 times, we obtain that for each j, n 1 E f kbj X k+1 α f 0 X k+1 α Therefore, we have gotten that for all n, F 0 { N e Jε n,g n,α+nε n n 1 ΠB j α e 1 r α 1nε n. f kbj X k+1 α f 0 X k+1 α b e nε n b+ 1 r α 1 e Jε n,g n,α nε n c 1. Thus, together with condition 1, the first Borel-Cantelli Lemma yields that N n 1 ΠB j α f kbj X k+1 α f 0 X k+1 α e nε n b almost surely for all n large enough. Hence we have Π n Gn A rεn e nε n 3α+αc +αc 3 b, which tends to zero as n, since 3+c +c 3 α < b and nε n proof of Theorem 1 is complete. as n. The To prove Theorem, we need a replacement of Lemma under weaker conditions. Lemma 3. Let c 0 and let {ε n be a positive sequence such that ΠB ε n e nε n c for all n. If a sequence {D n of subsets in F satisfies e nε n +c ΠD n 0 as n, then Π n D n 0 in probability as n. Proof. From Lemma 1 of Shen et al. 001 or Lemma 8.1 of Ghosal et al. 000 it turns out that we have, with probability tending to 1, D Π n D n n R n fπdf e nε n +c R n fπdf. ΠB ε n e nε n D n Hence for any given δ > 0 we have that F 0 { Πn D n δ F 0 { e nε n +c 18 R n fπdf δ +o1 D n

19 1 δ enε n +c E R n fπdf+o1 D n = 1 δ enε n +c ΠD n +o1 0 as n, which concludes the proof of Lemma 3. Proof of Theorem. Assumethat0 < α < 1. TheproofofTheoremfollowsfromthesame lines as the proof of Theorem 1. By Lemma 3 it suffices to prove that Π n Gn A rεn 0 in probability as n. For any given δ > 0, following the proof of Theorem 1 we get F 0 { Πn Gn A rεn δ F 0 { e nε n +c α N n 1 ΠB j α f kbj X k+1 α f 0 X k+1 α δ +o1 1 N δ enε n +c α ΠB j α E n 1 f kbj X k+1 α f 0 X k+1 α +o1 n 1 δ ejε n,g n,α+nε n +cα max E 1 j N δ ejε n,g n,α+nε n +c α+ 1 r α 1 f kbj X k+1 α f 0 X k+1 α +o1 +o1 δ ejε n,g n,α nε n c +o1 0 as n, α+αc where the last inequality follows from r > + +c 1. The proof of Theorem is 1 α complete. Proof of Theorem 3. Since Π n A rn ε n Π n Gn A rn ε n +Πn Aεn \G n, it suffices that the terms on the right hand side both tend to zero in probability. Given δ > 0, the proof of Lemma 3 implies that F 0 { Πn Aεn \G n δ e nε n ΠAεn \G n δ ΠB ε n +o1, which by condition 1 tends to zero as n. Assume that [r n ] stands for the largest integer less than or equal to r n and assume that D j = { f G n : jε n Hf 0,f < jε n Indeed, D j is an empty set for j > /ε n since the Hellinger distance cannot exceed. Then we have F 0 { Πn Gn A rn ε n δ F 0 { j=[r n ] Π n D j δ 19

20 F 0 { j=[r n ] Π n D j α δ 1 δ E j=[r n ] Π n D j α 1 δ j=[r n ] EΠ n D j α. Take a partition N j i=1 D ji for each D j such that D ji {f : Hf ji, f < jε n 3 for some f ji in L µ and N j ΠD ji α exp J jε n 3,D j,α e c 1j nε n ΠBε n α, i=1 where the last inequality follows from condition. Using the same argument as the proof of Theorem 1, one can get Hf kdji,f 0 jε n /3 and hence we have with probability tending to 1 F 0 { Πn Gn A rn ε n δ 1 δ j=[r n ] E N j i=1 α Π n D ji δ 1 δ 3 δnε n j=[r n ] N j j=[r n ] i=1 j=[r n ] EΠ n D ji α eαnε n δπb ε n α e nε n α+c 1j j α 1 δ j=[r n ] N j j=[r n ] i=1 ΠD ji α e 1 18 j nε n α 1 1 nε n α+c1 j j α 1 1 c 1 j j 1 α 3 δnε n c α = 3 δnεn c α [r n ] 1 j=[r n ] 1 j 1 1 j which tends to zero as n, since nε n are uniformly bounded away from zero and r n, where the second inequality follows from α 0,1, the third from the proof of Theorem 1, the fifth from the elementary inequality e x < 1 x for x > 0 and some of the inequalities only hold for all large n. Thus, we have proved that Π n Gn A rn ε n converges to zero in probability. The proof of Theorem 3 is complete. Proofof Theorem 4. Fromnε n c 0 lognandc 0 c > 1itturnsoutthate nε n c 1/n c 0 c and hence, by the first Borel-Cantelli Lemma and Lemma 1, we obtain that F R nfπdf e nε n 3+c Π W εn almost surely for all large n. Then, following the proofs of Lemma 3 and Theorem 3, one can get that for any δ > 0 and r > 1, F 0 { { Πn Arεn δ F δ { 0 Πn Aεn \G n +F δ 0 Πn Gn A rεn 0

21 enε n 3+c ΠA εn \G n + 4 δπw εn δ j=[r] e nε n 3+c α+c 1 j j α 1 := a n +b n. Condition 1 yields that a n <. On the other hand, since c 1 < 1 α 18, we have that for all n and for all r so large that 3+c α+c 1 + α 1 18 r 1 c 0, 4nc 0 b n 4enε n 3+c α δ 3+c α+c 1 + α 1 18 [r] j=[r] e nε δ 4nc0 1 n c 0c 1 + α 1 18 n c 1+ α 1 18 j = 4enε n 3+c α+c 1 + α 1 18 [r] δ 1 e nε n c 1+ α c α+c 1 + α 1 18 r 1 δ 1 c 0c 1 + α 1 4n 18 δ 1 c 0c 1 + α and hence b n <. Thus, by the first Borel-Cantelli Lemma we obtain that Π n Arεn δ almost surely for all large n, which concludes the proof of Theorem 4. REFERENCES BARRON, A., SCHERVISH, M. and WASSERMAN, L The consistency of posterior distributions in nonparametric problems. Ann. Statist. 7, GHOSAL, S Convergence rates for density estimation with Bernstein polynomials. Ann. Statist. 9, GHOSAL, S., GHOSH, J. K. and RAMAMOORTHI, R. V Non-informative priors via sieves and packing numbers. In Advances in Statistical Decision Theory and Applications S. Panchapakeshan and N.Balakrishnan eds Birkhäuser, Boston. GHOSAL, S., GHOSH, J. K. and VAN DERVAART, A. W Convergence rates of posterior distributions. Ann. Statist. 8, GHOSAL,S.andVAN DERVAART,A.W.001. Entropiesand ratesof convergencefor maximum likelihood and Bayes estimation for mixtures of normal densities. Ann. Statist. 9, GHOSAL, S. and VAN DER VAART, A. W. 007a. Convergence rates of posterior distributions for noniid observations. Ann. Statist. 35, GHOSAL, S. and VAN DER VAART, A. W. 007b. Posterior convergence rates of Dirichlet mixtures at smooth densities. Ann. Statist. 35, KOLMOGOROV, A. N. and TIHOMIROV, V. M ε-entropy and ε-capacity of sets in function spaces. Uspekhi Mat. Nauk 14, 3-86 [in Russian; English transl. Amer. Math. Soc. Transl. Ser., 17, ]. LIJOI, A., PRUNSTER, I. and WALKER, S On convergence rates for nonparametric posterior distributions. Aust. N. Z. J. Stat. 493, PETRONE, S Random Bernstein polynomials. Scand. J. Statist. 6, PETRONE, S. and WASSERMAN, L. 00. Consistency of Bernstein polynomial posteriors. J. R. Stat. Soc. Ser. B Stat. Methodol. 64, SCHWARTZ, L On Bayes procedures Z. Wahr. verw. Geb. 4,

22 SCRICCIOLO, C Convergence rates for Bayesian density estimation of infinite-dimensional exponential families. Ann. Statist. 34, SHEN, X. and WASSERMAN, L Rates of convergence of posterior distributions. Ann. Statist. 9, STONE, C. J Lerge-sample inference for log-spline models. Ann. Statist. 18, WALKER, S New approaches to Bayesian consistency. Ann. Statist. 3, WALKER, S., LIJOI, A. and PRUNSTER, I On rates of convergence for posterior distributions in infinite-dimensional models. Ann. Statist. 35, XING, Y. and RANNEBY, B Sufficient conditions for Bayesian consistency. Preprint. Yang Xing Centre of Biostochastics Swedish University of Agricultural Sciences SE , Umeå Sweden address:

BERNSTEIN POLYNOMIAL DENSITY ESTIMATION 1265 bounded away from zero and infinity. Gibbs sampling techniques to compute the posterior mean and other po

BERNSTEIN POLYNOMIAL DENSITY ESTIMATION 1265 bounded away from zero and infinity. Gibbs sampling techniques to compute the posterior mean and other po The Annals of Statistics 2001, Vol. 29, No. 5, 1264 1280 CONVERGENCE RATES FOR DENSITY ESTIMATION WITH BERNSTEIN POLYNOMIALS By Subhashis Ghosal University of Minnesota Mixture models for density estimation