ASYMPTOTIC PROPERTIES OF PREDICTIVE RECURSION: ROBUSTNESS AND RATE OF CONVERGENCE

Size: px

Start display at page:

Download "ASYMPTOTIC PROPERTIES OF PREDICTIVE RECURSION: ROBUSTNESS AND RATE OF CONVERGENCE"

Bruce Patterson
5 years ago
Views:

1 ASYMPTOTIC PROPERTIES OF PREDICTIVE RECURSION: ROBUSTNESS AND RATE OF CONVERGENCE By Ryan Martin and Surya T. Tokdar Indiana University-Purdue University Indianapolis and Duke University Here we explore general asymptotic properties of Predictive Recursion (PR) for nonparametric estimation of mixing distributions. Under fairly mild conditions, we prove that, when the mixture model is mis-specified, the estimated mixture converges almost surely in total variation to the mixture that minimizes the Kullback-Leibler divergence, and a bound on the (Hellinger contrast) rate of convergence is obtained. Moreover, when the model is identifiable, almost sure weak convergence of the mixing distribution estimate follows. PR assumes that the support of the mixing distribution is known. To remove this requirement, we propose a generalization that incorporates a sequence of supports, increasing with the sample size, that combines the efficiency of PR with the flexibility of mixture sieves. Under mild conditions, we obtain a bound on the rate of convergence of these new estimates. 1. Introduction. Despite a well-developed theory and numerous applications of mixture models, estimation of a mixing distribution remains a challenging statistical problem. However, some recent progress has been made through a computationally efficient nonparametric estimate due to Newton [12]; see also Newton, et al. [13] and Newton and Zhang [14]. A mixture model views data (X 1,,X n ) X n as independent observations from a density m(x) of the form (1.1) m f (x) = Θ p(x θ)f(θ) dµ(θ), f F where F = F(Θ,µ) is the class of densities dominated by a σ-finite measure µ on a parameter space Θ, and P = {p( θ) :θ Θ} is a parametric family of densities on X, dominated by a σ-finite measure ν. More succinctly, a mixture model assumes m M Θ := {m f : f F}, the convex hull of P. To estimate f from the data X 1,..., X n, Newton [12] proposed the following n-step recursive algorithm, called Predictive Recursion (PR): AMS 2000 subject classifications: Primary 62G20; secondary 62G05, 62G07, 62G35. Keywords and phrases: density estimation, Hellinger contrast, Kullback-Leibler projection, mixture models, recursive estimation. 1

2 2 R. MARTIN AND S. T. TOKDAR Algorithm PR. Choose an initial density f 0 on Θ and a sequence of weights {w i : i 1} (0, 1). For i =1,..., n, compute (1.2) f i (θ) = (1 w i )f i 1 (θ)+w i p(x i θ)f i 1 (θ) Θ p(x i θ )f i 1 (θ ) dµ(θ ), θ Θ and produce f n as the final estimate of f. Until recently, very little was known about the large-sample behavior of the estimates f n and m n. Ghosh and Tokdar [5] used a novel martingale argument to prove, when m = m f M Θ for finite Θ, that f n f almost surely. Martin and Ghosh [11] proved a slightly stronger consistency theorem via a stochastic approximation representation of f n. Most recently, Tokdar, Martin, and Ghosh [18] (henceforth, TMG) handled the case of a more general parameter space Θ by extending the martingale argument to the X -space, proving that the mixture density estimate (1.3) m n (x) := m fn (x) = p(x θ)f n (θ) dµ(θ), x X, converges almost surely to m f (x) in the L 1 topology. From this, consistency of f n in the weak topology on Θ is obtained. In this paper, we further extend the martingale-based arguments of TMG, establishing two important new results. First, in the more general context, where the mixture model M Θ need not contain the true density m, we show that the estimated mixture m n in (1.3) is asymptotically robust in the sense that it converges almost surely in the total variation (or Hellinger) topology to the mixture m f M Θ that minimizes the Kullback-Leibler (KL) divergence, K(m, m ϕ ) = log(m/m ϕ )m dν. When the mixing density is identifiable, we also obtain weak convergence of f n. Our second result is a bound on the rate of convergence for m n. For a unified treatment of the well- and mis-specified cases, we consider the Hellinger contrast (1.4) ρ(m n,m f )= 1 2 ( mn m f 1 ) 2 m dν where m f is the mixture that minimizes K(m, m ϕ ). The Hellinger contrast has been previously used to study asymptotics of maximum likelihood and Bayes estimates under model mis-specification; see, for example, Patilea [15], and Kleijn and van der Vaart [6]. In Section 3, we show that if Θ is compact, then ρ(m n,m f ) 0 almost surely at a rate faster than a 1/2 n, where a n = n i=1 w i. This establishes a direct connection between the choice of w i 1/2,

3 ASYMPTOTIC PROPERTIES OF PR 3 and the performance of the resulting PR estimate. Moreover, this bound is derived without using any structural knowledge about the mixands p(x θ) and applies to a wide range of such densities, including Normal, Gamma, and Poisson. The conditions on w i required by this result are satisfied by w i i γ for γ (2/3, 1], leading to a n n 1 γ. Consequently, the Hellinger contrast convergence rate of m n is faster than n (1 γ)/2. How this relates to the rate of convergence for f n remains an open question. Our nearly n 1/6 bound on the convergence rate for m n matches with the results in Genovese and Wasserman [3] derived for the special case of finite Gaussian mixtures. But it falls short of the nearly parametric rates obtained in Li and Barron [8] and Ghosal and van der Vaart [4]. However, Li and Barron s bound involves constants which can be infinite, and Ghosal and van der Vaart s proof seems to rely heavily on the smoothness of the Gaussian mixands. It is quite possible that the PR estimates converge faster than what our bounds suggest, but at this moment we are unsure how sharp our bounds are. A shortcoming of PR is that one needs to specify the compact set Θ a priori. To remove this requirement, we propose, in Section 4, a generalized PR (GPR) with an increasing sequence of supports Θ i Θ i+1. We obtain a bound on the rate of convergence for GPR in the special case where m = m f M Θ and f has an unknown compact support. 2. KL projection of m onto M Θ. It is quite natural to use the KL divergence to study the large-sample properties of PR. Indeed, Martin and Ghosh [11] remark that, for Θ finite, l(ϕ) =K(m, m ϕ ) is a Lyapunov function for the differential equation that characterizes the asymptotics of PR: roughly, the KL divergence controls the dynamics of PR, driving the estimates toward a stable equilibrium. But when m / M Θ, this equilibrium cannot be m. In such cases, we consider the mixture m f that is closest to m in a KL sense; that is, (2.1) K(m, m f )=K(m, M Θ ) := inf{k(m, m ϕ ):ϕ F} We call m f the KL projection of m onto M Θ. Leroux [7], Shyamalkumar [17], Patilea [15], and Kleijn and van der Vaart [6] each use similar ideas. Existence of the KL projection is an important issue, and various results are available; see, for example, Liese and Vadja [9, Chap. 8]. Here we prove a simple result which gives sufficient conditions for the existence of a KL projection in our special case of mixtures. Assume the following: A1. F is compact with respect to the weak topology. A2. θ p(x θ) is bounded and continuous for ν-almost all x.

4 4 R. MARTIN AND S. T. TOKDAR Lemma 2.1. Under A1 A2, there exists a mixing density f F such that K(m, m f )=K(m, M Θ ). Proof. Choose any ϕ F and any {ϕ s } F such that ϕ s ϕ weakly. Then A2 and Scheffé s theorem imply m ϕs m ϕ in the L 1 (ν) and, hence, the weak topology. Further, κ(ϕ) := K(m, m ϕ ) lim inf s K(m, m ϕ s ) = lim inf s κ(ϕ s), where the inequality follows from weak lower semi-continuity of K(m, ); see Liese and Vadja [9], Theorem Consequently, κ( ) is weakly lower semi-continuous and, therefore, must attain its infimum on the compact F. A similar proof may be given based on Lemma 4 of Brown, et al. [2]. Convexity of M Θ and of K(m, ) together imply that the KL projection m f in Lemma 2.1 is unique. However, in general, there could be many mixing densities f F that generate the KL projection m f. Identifiability is needed to guarantee uniqueness of the mixing density. 3. Robustness and rate of convergence. We begin with our assumptions and some preliminary lemmas proofs can be found in the Appendix. Let {w i : i 1} be the user-specified weight sequence in the PR algorithm, and define the sequence of partial sums a n = n i=1 w i (a 0 0). In addition to A1 A2 in Section 2, assume the following: A3. w n > 0, w n 0, n w n =, n w2 n <, and n a n 1wn 2 <. A4. There exists B< such that [ ] p(x θ1 ) 2 sup m(x) dν(x) < B. θ 1,θ 2 Θ p(x θ 2 ) Condition A3 is satisfied if w n n γ, for γ (2/3, 1]. The squareintegrability condition A4 is the strongest assumption, but it does hold for many common mixands, such as Normal or Poisson, provided that Θ is compact and m admits a moment-generating function on Θ. If one is willing to assume that m M Θ, then A4 can be replaced by a less restrictive condition, depending only on p(x θ); cf. assumption A5 in TMG. Our new results are partially based on calculations in TMG, and we begin by recording a few of these for future reference. First, let R(x) be the remainder term of a first-order Taylor approximation of log(1+x) at x = 0; that is, log(1 + x) =x x 2 R(x), x> 1, where R(x) satisfies (3.1) 0 R(x) max{1, (1 + x) 2 }, x > 1.

5 ASYMPTOTIC PROPERTIES OF PR 5 Next, write the PR estimate m n M Θ in (1.3) as where m n (x) = (1 w n )m n 1 (x)+w n h n,xn (x), h n,y (x) = Θ p(x θ)p(y θ)f n 1 (θ) m n 1 (y) For notational convenience, also define the function dµ(θ), x, y X. H n,y (x) = h n,y(x) 1, x, y X. m n 1 (x) Then the KL divergence K n = K(m, m n ) satisfies K n K n 1 = m log(m n 1 /m n ) dν = m log(1 + w n H n,xn ) dν = w n mh n,xn dν + wn 2 mhn,x 2 n R(w n H n,xn ) dν where R(x) satisfies (3.1). Let A n 1 be the σ-algebra generated by the data sequence X 1,..., X n 1. Since K n 1 is A n 1 -measurable, upon taking conditional expectation with respect to A n 1 we get (3.2) E(K n A n 1 ) K n 1 = w n T (f n 1 )+w 2 n E(Z n A n 1 ) where T ( ) and Z n are defined as (3.3) (3.4) T (ϕ) = Z n = Θ { X } 2 m(x) p(x θ) dν(x) ϕ(θ) dµ(θ) 1, ϕ F m ϕ (x) mh 2 n,x n R(w n H n,xn ) dν Note that T (f n 1 ) is exactly M n defined in TMG. The following properties of T ( ) will be critical in the proof of our main result. Lemma 3.1. (a) T (ϕ) 0 with equality iff K(m, m ϕ )=K(m, M Θ ). (b) Under A1, A2 and A4, T is continuous in the weak topology on F. For some further insight into the relationship between T (ϕ) and the KL divergence K(m, m ϕ ), define D(θ; ϕ) = m(x) p(x θ) dν(x) 1 m ϕ (x)

6 6 R. MARTIN AND S. T. TOKDAR and notice that T (ϕ) = D 2 (θ; ϕ) ϕ(θ) dµ(θ). Some analysis shows that D(θ; ϕ) is the Gâteaux derivative of K(m, η) at η = m ϕ in the direction of p( θ). Now, if T (ϕ) = 0, then D(θ; ϕ) = 0 for µ-almost all θ and, hence, D(ψ; ϕ) = D(θ; ϕ)ψ(θ) dµ(θ), the Gâteaux derivative of K(m, η) at η = m ϕ in the direction of m ψ, is zero for all ψ F. The fact that the Gâteaux derivative vanishes in all directions suggests that m ϕ is a point at which the infimum K(m, M Θ ) is attained, and this is exactly the conclusion of Lemma 3.1(a). A similar characterization of the NPMLE is given in Lindsay [10]. According to (3.2) and Lemma 3.1(a), K n would be a non-negative supermartingale were it not for the term involving Z n. Fortunately, since Z n is almost surely bounded, its presence causes no major difficulties. Lemma 3.2. Under A4, Z n 1+B a.s. for all n. Our last preliminary result of this section gets us closer to a convergence rate for K(m, m n ). Define K n = K(m, m n ) K(m, M Θ ) 0. Lemma 3.3. Under A3 A4, n w nk n < a.s. We are ready state and prove our main result. Those convergence properties advertised in Section 1 are corollaries to Theorem 3.4 that follows. Theorem 3.4. Under A1 A4, a n K n 0 a.s. Proof. Define two sequences of random variables, δ n = i=n+1 a i 1 wi 2 E(Z i A n ) and ε n = i=n+1 w i E(Ki A n ), and define K n = a n Kn + δ n + ε n 0. Using (3.2) we find that E( K n A n 1 ) K n = E(a n K n + δ n + ε n A n 1 ) a n 1 K n 1 δ n 1 ε n 1 w n K(m, M Θ ) a n E(K n A n 1 )+ i=n+1 a i 1 wi 2 E [E(Z i A n ) A n 1 ] + i=n+1 E [E(K i A n ) A n 1 ] a n 1 K n 1 a i 1 wi 2 E(Z i A n 1 ) w i E(K i A n 1 ). i=n i=n

7 ASYMPTOTIC PROPERTIES OF PR 7 After some simplification, using the fact that A n 1 A n, we have E( K n A n 1 ) K n a n E(K n A n 1 ) a n 1 K n 1 a n 1 wne(z 2 n A n 1 ) w n E(K n A n 1 ) ] = a n 1 [E(K n A n 1 ) K n 1 wn 2 E(Z n A n 1 ) = a n 1 w n T (f n 1 ) 0 Thus, K n forms a non-negative supermartingale so, by the martingale convergence theorem, there is a limit K 0 with finite expectation. But assumption A3, Lemma 3.2, and the dominated convergence theorem ensure that δ n and ε n both converge a.s. to zero; therefore, a n K n K a.s. It remains to show that K = 0 a.s. Suppose, on the other hand, that K > 0 with positive probability. From the previous display we have (3.5) K i 1 E( K i A i 1 ) a i 1 w i T (f i 1 ), i 2. Fix any two integers N and n. Taking conditional expectation with respect to A N 1 in (3.5) and summing gives (3.6) i=n a i 1 w i E[T (f i 1 ) A N 1 ] = = i=n i=n i=n E[ K i 1 E( K i A i 1 ) A N 1 ] { E( K i 1 A N 1 ) E[E( K i A i 1 ) A N 1 ] E( K i 1 K i A N 1 ) = K N 1 E( K N+n A N 1 ) } K N 1 We are assuming K > 0, so there exists ε> 0 such that K(m, m n ) >K(m, M Θ )+ε for all but perhaps finitely many n. Recall the proof of Lemma 2.1, which shows that the mapping κ(ϕ) =K(m, m ϕ ) is lower semi-continuous with respect to the weak topology on F. Consequently, G ε := {ϕ F : κ(ϕ) >K(m, M Θ )+ε} F.

8 8 R. MARTIN AND S. T. TOKDAR is a weakly open set, and its closure G ε is compact by A1. Since f n G ε for all but finitely many n, Lemma 3.1 implies that T (f n ) is bounded away from zero. This and A3 imply that the left-most sum in (3.6) goes to with positive probability as n. But the upper bound K N 1 is almost surely finite a contradiction. Therefore, K = 0 almost surely. Next we show that K n 0 implies m n m f 0, where denotes the L 1 (ν) norm. Corollary 3.5. Under A1 A4, m n converges to m f in L 1 (ν). Proof. Suppose this is not true; that is, there is an ε> 0 and a subsequence {n s } such that m ns m f >ε for all s. Then, by A1, this sequence has a further subsequence {n s(t) } such that f ns(t) converges weakly to some f. By A2, m ns(t) converges to m := m f pointwise and in L 1 (ν) by Scheffé s theorem; therefore, m m f. Define u t = m ns(t) /m 1. Then u t 0 pointwise and {u t } is uniformly integrable with respect to m dν by A4 and Jensen s inequality; indeed, sup t ( ) u 2 mns(t) 2 t m dν = sup 1 m dν < 1+B. t m Then Theorem 3.4, together with Vitali s theorem, implies that 0 K(m, m ) K(m, m f ) = K(m, m ) lim K(m, m ns(t) ) t = lim m log(m ns(t) /m ) dν lim t t u t m dν =0 Therefore, K(m, m )=K(m, m f ), and uniqueness of the KL projection implies m = m f, which contradicts our supposition. Theorem 3.4 and Corollary 3.5 seem to indicate that the estimates f n converge to some f F at which the infimum K(m, M Θ ) is attained. However, to conclude weak convergence of f n from L 1 convergence of m n, we need two additional conditions: A5. Identifiability: m ϕ = m ψ ν-a.e. implies ϕ = ψµ-a.e. A6. For any ε> 0 and any compact A X, there exists a compact B Θ such that A p(x θ) dν(x) <ε for all θ B. With conditions A5 A6 and Theorem 3 of TMG, the next result follows immediately from Corollary 3.5.

9 ASYMPTOTIC PROPERTIES OF PR 9 Corollary 3.6. Under A1 A6, f n f a.s. in the weak topology, where f F is the unique mixing density that satisfies K(m, m f )=K(m, M Θ ). In particular, if m M Θ, then f n is a consistent estimate of the true mixing density, in the weak topology. In the mis-specified case, even though Kn 0 implies m n m f 0, the L 1 (ν) rate does not easily follow without extra assumptions, such as boundedness of m f /m. But a Hellinger contrast rate is a direct consequence of Theorem 3.4. In the well-specified case, when m f = m, the Hellinger contrast reduces to the usual Hellinger distance, so our convergence rate results are comparable to those of, say, Genovese and Wasserman [3]. Corollary 3.7. Choose w n n γ, γ (2/3, 1], and assume A1 A4. (a) ρ(m n,m f )=o(n (1 γ)/2 ) a.s. (b) If m f /m L (m dν), then m n m f = o(n (1 γ)/2 ) a.s. Proof. Let Γ n = (m n /m f )m dν. An important well known property of KL projections that will be used in the proof is Γ n 1 (see, for example, Theorem 3 of Barron [1] or Lemma 2.1 of Patilea [15]). Part (a) follows from Lemma 2.4 of Patilea [15]. Indeed, 2ρ 2 (m n,m f )= ( ) 2 mn 1 m dν m f = (Γ n 1) + 2 ( 1 2 = K n mf log m dν m n ) mn m f m dν For part (b), let q n = mm n /m f Γ n, and notice that Kn K(m, q n). Then, Pinsker s inequality gives 2Kn m q n = 1 m n m n Γ Γ n n Γ m f n. m f L1 (m dν) The triangle inequality for L1 (m dν) implies m n m n 1 Γ m f n + (1 Γ L1 m (m) f n ), L1 (m dν) L1 (m dν)

10 10 R. MARTIN AND S. T. TOKDAR and hence 2Kn m n + (1 Γ n ) 1. m f L1 (m dν) The right-hand side is related to the L 1 (ν) error via Holder s inequality m n m f = m n m f dν mf m n = 1 m m f m dν C m n 1 m f L1 (m) where C := m f /m L (m dν) is finite by assumption. Therefore, { } m n m f C 2Kn + (1 Γ n ), and part (b) follows from the fact that (1 Γ n ) K n. 4. A generalized PR algorithm. To satisfy the conditions of Theorem 3.4, one typically would need the mixing parameter space Θ to be compact. Also, in order to calculate the estimate in practice, this Θ must be known since integration over Θ is required in (1.2). These requirements are similar to those in Ghosal and van der Vaart [4] and can be somewhat restrictive, particularly when there is no natural choice of Θ. A potential solution is to use a mixture sieve, which allows the support of the estimated mixing distribution to grow with the sample size. A motivation for this dynamic choice of support would be that, eventually, the support will be large enough so that the class of all mixtures over that support will be sufficiently rich. Borrowing on this idea, we propose a sieve-like extension of the PR algorithm which, instead of requiring Θ to be fixed and known, incorporates a sequence of compact mixing parameter spaces that increases with n. Let Θ denote the mixing parameter space, which may or may not be compact. For example, if the model is a Gaussian location-scale mixture, then Θ = R R +. The generalized PR algorithm is as follows. Algorithm GPR. Choose an increasing sequence of compact sets Θ n such that Θ n Θ, a bounded, strictly positive, µ-measurable function g(θ), and a sequence c n 0 that satisfies n log(1 + c n) <. Define g n (θ) =g(θ)i Θn\Θ n 1 (θ)/d n, where d n = Θ n\θ n 1 g dµ is the normalizing constant. Start with an initial estimate f 0 on Θ 0 and, for n 1, define fn (θ) = (1 w p(x n θ)f n 1 (θ) n)f n 1 (θ)+w n, θ Θ n 1 m n 1 (X n )

11 ASYMPTOTIC PROPERTIES OF PR 11 and then 1 1+c n fn (θ), θ Θ n 1 (4.1) f n (θ) = c n 1+c n g n (θ), θ Θ n \ Θ n 1 0, θ Θ c n As in (1.3), define m n := m fn as the final estimate of m. The motivation for using g n is that Θ n is small compared to Θ, so the estimate should be padded near the boundary of Θ n to compensate for the possibly heavy tails of f which might assign considerable mass to Θ c n. Asymptotically, any function g would suffice, so we recommend a default choice g(θ) 1, making g n (θ) a Uniform density on Θ n \ Θ n 1. Next we derive convergence properties of the GPR estimate m n. The primary obstacle in extending the results in Section 3 is that now the support of the mixing densities is changing with n this makes the comparisons of the mixing densities in the proof of Lemma 3.3 more difficult. Here we will consider only the case where m = m f M Θ, but the mixing density f and its support Θ f Θ are both unknown. Theorem 4.1 below establishes a bound on the rate of convergence in the case where Θ f is a compact subset of Θ. It is important to point out that, while we are restricting ourselves to the compact case, we need not assume Θ f to be known. Recall condition A4 in Section 3 requiring that the likelihood ratio be square integrable uniformly over Θ. In this case, we have a sequence Θ n, and we require a bound similar to that of A4 for each Θ n. To this end, define the sequence (4.2) B n = sup θ 1,θ 2,θ 3 Θ n [ ] p(x θ1 ) 2 p(x θ 3 ) dν(x), n 0. p(x θ 2 ) Since Θ n Θ n+1, the sequence B n is clearly increasing; if we are to push the proof of Theorem 3.4 through in this more general situation, we will need to control how fast B n increases. Theorem 4.1. Assume that Θ f Θ is compact and that conditions A2 A3 hold. Furthermore, assume that n a n 1w 2 n B n <. Then the GPR estimates m n satisfy a n K(m f,m n ) 0 a.s. See the Appendix for the proof. In general, the conditions of the theorem are satisfied if (4.3) w n (n α log n) 1, α [2/3, 1], and B n = O(log n).

12 12 R. MARTIN AND S. T. TOKDAR Example 4.2. Suppose that p(x θ) = e θ θ x /x! is a Poisson density (with respect to counting measure). Then x [ ] p(x θ1 ) 2 p(x θ 3 ) = exp{2(θ 2 θ 1 ) θ 3 + θ 2 p(x θ 2 ) 1θ 3 /θ2}. 2 Take Θ n =[α n,β n ] where α n and β n are to be determined. Then, B n = sup exp{2(θ 2 θ 1 ) θ 3 + θ1θ 2 3 /θ2} 2 exp{βn/α 3 2 n}. θ 1,θ 2,θ 3 Θ n If β n (c + log log n) 1/5, for some constant c>0such that Θ 0 suitably large, and α n = βn 1, then it is easy to check that B n = O(log n). Therefore, the conditions of Theorem 4.1 are satisfied with w n as in (4.3). Acknowledgments. The authors thank Professor J. K. Ghosh for his support and guidance throughout this and other projects. APPENDIX: PROOFS Proof of Lemma 3.1. For part (a), treat θ as a random element in Θ, whose distribution has density ϕ F, and define the random variable (A.4) g ϕ (θ) = m(x) p(x θ) dν(x). m ϕ (x) Then E ϕ {g ϕ (θ)} = g ϕ ϕ dµ = 1 and T (ϕ) =V ϕ {g ϕ (θ)} 0, with equality if and only if g ϕ =1µ-a.e. Next define { } Λ(ϕ) = log g ϕ (θ) f(θ) dµ(θ), Θ where f F is such that K(m, m f )=K(m, M Θ ). Note that T (ϕ) = 0 implies Λ(ϕ) = 0. By Jensen s inequality { } m f (x) Λ(ϕ) = log m(x) dν(x) X m ϕ (x) ( ) mf (x) log m(x) dν(x) m ϕ (x) X = K(m, m ϕ ) K(m, m f ) 0 so that Λ(ϕ) = 0 implies K(m, m ϕ )=K(m, m f ).

13 ASYMPTOTIC PROPERTIES OF PR 13 For part (b), take a sequence {ϕ s } F such that ϕ s converges weakly to some ϕ F, and let g ϕs and g ϕ be as in (A.4). Let r s (x, θ) =p(x θ)/m ϕs (x) so that g ϕs (θ) = r s (x, θ)m(x) dν(x). By A4 and Jensen s inequality, sup s rs 2 (x, θ)m(x) dν(x) B, which implies {r s (,θ)} is uniformly integrable with respect to m dν; therefore, g ϕs g ϕ µ-a.e. Since the weak topology on F is metrizable, to prove continuity of T it suffices to show that T (ϕ s ) T (ϕ). We have (A.5) T (ϕ s ) T (ϕ) = gϕ 2 s ϕ s dµ gϕ 2 ϕ dµ (gϕ 2 s gϕ)ϕ 2 s dµ + gϕ(ϕ 2 s ϕ) dµ The second term on the right-hand side of (A.5) goes to zero by definition of weak convergence, since g 2 ϕ B (by A4) and g 2 ϕ is continuous (by A1). Next, we know the following: g 2 ϕ s g 2 ϕ ϕ s 0 µ-a.e., g 2 ϕ s g 2 ϕ ϕ s 2Bϕ s, and ϕ s ϕ pointwise µ-a.e. and lim ϕ s dµ = 1 = ϕ dµ. Now, for the first term on the right-hand side of (A.5), (gϕ 2 s gϕ)ϕ 2 s dµ gϕ 2 s gϕ ϕ 2 s dµ 0, where convergence follows from the above properties and Theorem 1 of Pratt [16]. Therefore, T (ϕ s ) T (ϕ) and since the sequence {ϕ s } was arbitrary, T must be continuous. Proof of Lemma 3.2. Note that for a>0 and b (0, 1), we have (A.6) (a 1) 2 max{1, (1 + b(a 1)) 2 } max{(a 1) 2, (1/a 1) 2 }. Combining inequalities (3.1) and (A.6) we see that Hn,X 2 n R(w n H n,xn ) max ) ( ) 2 2 mn 1 1, 1 m n 1 h n,xn ( hn,xn and, since both h n,xn and m n 1 belong to M Θ for each n, we conclude from A4 and Jensen s inequality that Z n 1+B a.s.

14 14 R. MARTIN AND S. T. TOKDAR Proof of Lemma 3.3. This was proved in TMG for the case m M Θ. Here we only sketch the details for the more general case. Let f F be such that K(m, m f )=K(m, M Θ ). TMG s Taylor approximation and telescoping sum argument for the KL divergence between mixing densities reveals n K(f, f n ) K(f, f 0 )= f log(f i /f i 1 ) dµ i=1 Θ n n n = w i V i w i M i + wi 2 E i i=1 i=1 i=1 where, with a slight difference in notation compared to TMG, M i = ( ) mf 1 m dν, V i =1 m f(x i ) m i 1 m i 1 (X i ) + M i, and the E i s are uniformly bounded (by A4). From log(x) x 1 and the proof of Lemma 3.1(a), we get M i log(m f /m i 1 )m dν = Ki 1 0. To prove the claim, it suffices to show that n i=1 w i M i converges almost surely to a finite random variable. The same arguments in TMG show that both V := i=1 w i V i and E := i=1 w 2 i E i are almost surely finite, and non-negativity of the KL divergence implies n w i M i K(f, f 0 )+V + E < n. i=1 Then the left-hand side converges because the M i s are non-negative. Proof of Theorem 4.1. The proof is essentially the same as that of Theorem 3.4; we simply need to check that Lemmas 3.2 and 3.3 continue to hold in this new context. Since B n is increasing, the condition n a n 1w 2 nb n < implies that the conclusion of Lemma 3.2 holds, and the remainder term Z n in (3.4) satisfies n w2 n E(Z n) <. The key observation is that, since Θ f is compact, there is a number N = N(f) such that Θ f Θ n for all n N. Consequently, there are only N iterates f 0,..., f N 1 such that K(f, f i )= so, after appropriately shifting the time scale, we can apply the telescoping sum argument in

15 the proof of Lemma 3.3 to get K(f, f N+n ) K(f, f N )= = i=n+1 ASYMPTOTIC PROPERTIES OF PR 15 w i V i i=n+1 i=n+1 w i M i + f log(f i /f i 1 ) dµ i=n+1 w 2 i E i + i=n+1 log(1 + c i ) The right-most sum in the display converges by assumption and the others can be handled as in the proof of Lemma 3.3. Therefore, we can conclude that n=n w n K(m f,m n ) < a.s. To finish the proof, simply throw away the first N iterations and apply the argument used to prove Theorem 3.4. REFERENCES [1] Barron, A. R. (2000). Limits of information, Markov chains, and projection. In IEEE International Symposium on Information Theory. 25. [2] Brown, L. D., George, E. I., and Xu, X. (2008). Admissible predictive density estimation. Ann. Statist. 36, 3, MR [3] Genovese, C. R. and Wasserman, L. (2000). Rates of convergence for the Gaussian mixture sieve. Ann. Statist. 28, 4, MR [4] Ghosal, S. and van der Vaart, A. W. (2001). Entropies and rates of convergence for maximum likelihood and Bayes estimation for mixtures of normal densities. Ann. Statist. 29, 5, MR [5] Ghosh, J. K. and Tokdar, S. T. (2006). Convergence and consistency of Newton s algorithm for estimating mixing distribution. In Frontiers in statistics. Imp. Coll. Press, London, MR [6] Kleijn, B. J. K. and van der Vaart, A. W. (2006). Misspecification in infinitedimensional Bayesian statistics. Ann. Statist. 34, 2, MR [7] Leroux, B. G. (1992). Consistent estimation of a mixing distribution. Ann. Statist. 20, 3, MR [8] Li, J. Q. and Barron, A. R. (2000). Mixture density estimation. In Advances in Neural Information Processing Systems, S. Solla, T. Leen, and K.-R. Mueller, Eds. MIT Press, Cambridge, Massachusetts, [9] Liese, F. and Vajda, I. (1987). Convex statistical distances. Teubner, Leipzig. [10] Lindsay, B. G. (1983). The geometry of mixture likelihoods: a general theory. Ann. Statist. 11, 1, MR [11] Martin, R. and Ghosh, J. K. (2008). Stochastic approximation and Newton s estimate of a mixing distribution. Statist. Sci. 23, 3, MR [12] Newton, M. A. (2002). On a nonparametric recursive estimator of the mixing distribution. Sankhyā Ser. A 64, 2, MR [13] Newton, M. A., Quintana, F. A., and Zhang, Y. (1998). Nonparametric Bayes methods using predictive updating. In Practical nonparametric and semiparametric Bayesian statistics. Vol Springer, New York, MR [14] Newton, M. A. and Zhang, Y. (1999). A recursive algorithm for nonparametric analysis with missing data. Biometrika 86, 1, MR

16 16 R. MARTIN AND S. T. TOKDAR [15] Patilea, V. (2001). Convex models, MLE and misspecification. Ann. Statist. 29, 1, MR [16] Pratt, J. W. (1960). On interchanging limits and integrals. Ann. Math. Statist. 31, MR [17] Shyamalkumar, N. (1996). Cyclic I 0 projections and its applications in statistics. Tech. Rep , Purdue University, Department of Statistics, West Lafayette, IN. [18] Tokdar, S. T., Martin, R., and Ghosh, J. K. (2009). Consistency of a recursive estimate of mixing distributions. Ann. Statist. 37, 5A, Department of Mathematical Sciences Indiana University-Purdue University Indianapolis 402 North Blackford Street Indianapolis, Indiana USA rgmartin@math.iupui.edu Department of Statistical Science Duke University 214 Old Chemistry Building Durham, North Carolina USA st118@stat.duke.edu

Bayesian estimation of the discrepancy with misspecified parametric models

Bayesian estimation of the discrepancy with misspecified parametric models Pierpaolo De Blasi University of Torino & Collegio Carlo Alberto Bayesian Nonparametrics workshop ICERM, 17-21 September 2012