1 Introduction and main results.

Size: px

Start display at page:

Download "1 Introduction and main results."

Ginger Bates
5 years ago
Views:

1 On Dirichlet Multinomial Distributions Dedicated to Professor Y. S. Chow on the Occasion of his 80th Birthday By Robert W. Keener 1 and Wei Biao Wu 2 Abstract Let Y have a symmetric Dirichlet multinomial distributions in R m, and let S m = h(y 1 )+ +h(y m ). We derive a central limit theorem for S m as the sample size n and the number of cells m tend to infinity at the same rate. The rate of convergence is shown to be of order m 1/6. The approach is based on approximation of marginal distributions for the Dirichlet multinomial distribution by negative binomial distributions, and a blocking technique similar to that used to study renormalization groups in statistical physics. These theorems generalize and refine results for the classical occupancy problem. keyword: Occupancy problems; central limit theorem; exchangeable distributions 1 Introduction and main results. Let Y have a multinomial distribution M(n, p) with n trials and success probabilities p = (p 1,..., p m ). Classical occupancy problems concern counts l k = #{j : Y j = k} and coverage m l 0. If m and n tend to infinity at the same rate and the multinomial distribution is symmetric, p 1 = = p m = 1/m, Weiss (1958) gives a central limit theorem for the number of cells covered or l 0. This result has been extended in various directions. Rényi (1962) gives proofs extending Weiss result in more general limits, Kopocinska and Kopocinski (1992) prove joint asymptotic normality for a collection of the l k, and Englund (1981) gives a Berry-Esseen bound for the error of normal approximation. In asymmetric cases where the cell probabilities p j vary, Esty (1983) gives a central limit theorem for the coverage and Quine and Robinson (1982) obtain a Berry-Esseen bound. Most relevant to the results presented here, Chen (1980) introduces mixture models, described below, in which the multinomial cell probabilities are sampled from a Dirichlet distribution. Extensions are presented in Chen (1981a, 1981b). 1 Department of Statistics, University of Michigan, Ann Arbor, MI 48109, USA. keener@umich.edu 2 Department of Statistics, University of Chicago, Chicago, IL 60637, USA. wbwu@galton.uchicago.edu This work was supported by National Science Foundation Grant DMS- 1

2 The models and results here are of some interest in statistics, particularly in situations where multinomial data divide up a sample, but cells are discovered as the experiment is performed. Then quantities like l 0, the number of unobserved cells, will be unknown parameters, but the other counts l k, k 1, are observed and various proposed estimators are based on these. See Fisher, Corbet and Williams (1943), Good and Toulmin (1956) and Keener, Rothman and Starr (1987) for further details. Related models also arise studying bootstrap procedures, although the distributional questions of interest are not that related to the results here. See Rubin (1981) or Csörgő and Wu (2000). If G m i=1 Γ(A i, 1), then p = G/(G G m ) has a Dirichlet distribution, p = G G G m D m (A). If the conditional distribution of Y given p is multinomial, Y p = p M(n, p), then Y has the Dirichlet multinomial distribution, By smoothing (law of total probability), Y DM(n, A). P(Y = y) = EP(Y = y p) ( ) n m = E y 1,..., y m i=1 = n!γ(a A m ) Γ(n + A A m ) p y i i m i=1 Γ(y i + A i ) y i!γ(a i ). In the sequel, we will be particularly interested in the symmetric case in which A = (a,..., a) = a1 m. Special cases of interest include the Bose-Einstein distribution in which a = 1 and P(Y = y) is independent of y, and the Maxwell-Boltzmann distribution which arises in the limit as a. When m = 2, p 1 has the beta distribution, B(A 1, A 2 ) and Y 1 has the beta-binomial distribution, BB(n, A 1, A 2 ). In the limit theory developed in this paper, the negative binomial distribution NB(a, η) with mass function Γ(a + y)a a η y, y = 0, 1,... y!γ(a)(a + η) a+y 2

3 will play a central role. The shape parameter a here is not restricted to be an integer, and the parameter η is the mean, instead of the usual success probability η/(a + η). The variance with this parameterization is η(a + η)/a. Let h be a function from N + = {0, 1,...} to R, and define m S m = h(y i ). i=1 The main result below is a central limit theorem for S m as m with Y from a symmetric Dirichlet multinomial distribution. Note that when h(x) = I{z = j}, S m equals l k, showing the connection to the occupancy problems mentioned above. Theorem 1. Consider a limiting situation in which m and n, with η = n m η (0, ). (1) Assume Y DM(n, a1 m ), and that h is a nonlinear function with sup h 4 (y)e Λy <, y N + where Λ < log(1 + a/η ). Take Z NB(a, η) and define ˆµ = ˆµ(η) = E η h(z). Also, let β = Cov[h(Z), Z)]/Var(Z), so that ˆµ + β(z η) is the best linear predictor of h(z), and define ˆσ 2 = ˆσ 2 (η) = Var[h(Z) ˆµ β(z η)]. (Note that since h is nonlinear, ˆσ 2 > 0.) Then P (S m x) Φ sup x R ( x mˆµ mˆσ ) = O(1/m 1/6 ). This result remains true if the corresponding moments of h(y 1 ) or the mean and standard deviation of S m are used to center and scale the normal approximation. The version stated seems a bit more convenient for explicit calculation. The next result complements Theorem 1 providing an exponential bound for tail probabilities of S m. 3

4 Theorem 2. Assume E η e ɛ 0 h(z) < (2) for some ɛ 0 > 0. Then for some constant c > 0, P [ S m mˆµ > mɛ] = O( me cɛ2m ) in limit (1), uniformly for ɛ in any bounded subset of [0, ). 2 Marginal Distributions This section provides approximations for marginal distributions in the limiting situation described in Theorem 1. Throughout, Y will have the symmetric Dirichlet multinomial distribution DM(n, a1 m ), and Z, Z 1, Z 2,... will be i.i.d. from NB(a, η) with η = n/m. In limit (1), joint distributions for (Y 1,..., Y k ) and (Z 1,..., Z k ) converge. We will be interested in approximating moments of S k = h(y 1 ) + + h(y k ), and it will be convenient to measure distance between measures using a variant of the total variation norm with exponential weights for large values. Specifically, if ν is a signed measure on N k, define ν Λ = ν({y}) e Λ(y 1+ +y k ). y N k If Q and ˆQ are finite signed measures on N k, then f dq f d ˆQ = [ ] f(y)e Λ(y 1 + +y k ) e Λ(y 1+ +y k ) [Q({y}) ˆQ({y})] y N k is Q ˆQ Λ sup y N k f(y) e Λ(y 1+ +y k ). (3) The likelihood ratio between the marginal distribution of (Y 1,..., Y k ) and (Z 1,..., Z k ) L(y) = P(Y 1 = y 1,..., Y k = y k ) P(Z = y 1 ) P(Z = y k ) Γ(n + 1) = Γ(n + 1 v)n Γ(ma) v Γ(ma ka)(ma) ka Γ(n + ma ka v)(n + ma)ka+v, Γ(n + ma) 4

5 where v = y y k. The three terms here can be approximated using the following lemma, which follows fairly easily from Stirling s approximation for the gamma function. Lemma 1. If a = a x and b = b x are both o( x), then ( ) 1 + a 4 + b 4 + O as x. Γ(x + a) x a b Γ(x + b) = 1 + (a b)(a + b 1) 2x x 2 Using this lemma, L(y) = [ v(1 v) m 2η ( ) 1 + v 4 +O m 2 k(1 + ka) 2 + ] (ka + v)(ka + v + 1) 2(a + η) (4) (5) in limit (1), provided v = o( m). When v is large, the approximation to L breaks down, and errors from these values will be estimated using Bernstein inequality bounds based on moment generating functions. The moment generating function for the negative binomial distribution is finite provided e u < 1 + a/η. ( ) a M Z (u) = Ee uz a =, a + η ηe u Lemma 2. Let V k = Y Y k. If e u < 1 + a/η, then in limit (1), ( ) ka Ee uv k a. a + η η e u Proof. The moment generating function for the binomial distribution with n trials and success probability p is [1 + p(e u 1)] n. Since V k BB(n, ka, ma ka), its distribution is a beta mixture of binomial distributions and Ee uv k = = Γ(ma) Γ(ka)Γ(ma ka) Γ(ma) Γ(ka)Γ(ma ka) [1 + x(e u 1)] n x ka 1 (1 x) ma ka 1 dx x ka 1 (1 x) ka 1 e mf(x) dx, 5

6 where f(x) = η log[1 + x(e u 1)] a log(1 x) = [a (e u 1)]x + O(x 2 ) as x 0. A change of variables rescaling x by m gives Ee u(y 1+ +Y k ) = Γ(ma)(ma) ka Γ(ma ka) m aka Γ(ka) 0 x ka 1 e mf(x/m) dx. (1 x/m) ka 1 The first factor here tends to one by Lemma 1, and the desired result then follows by a dominated convergence argument since mf(x/m) [a + η η e u ]x, with the error uniformly bounded for x (0, m). To be careful, there should be a separate argument to show that the contribution integrating over x ( m, m) is negligible. Since this is fairly routine, the details are omitted. Considering the approximation for the likelihood ratio L given in (5), it is natural to approximate the joint distribution Q for (Y 1,..., Y k ) by ˆQ = ˆQ m ˆQ 1 where ˆQ 0 = NB(a, η) k, the joint distribution of (Z 1,..., Z k ), and ˆQ 1 ({z}) = q 1 (z) ˆQ 0 ({z}) with q 1 (z) = = v(1 v) k(1 + ka) (ka + v)(ka + v + 1) + 2η 2 2(a + η) (a + 2η)ṽ 2η(a + η) aṽ 2 2η(a + η) + k 2, where v = z z k and ṽ = v kη. Theorem 3. If Λ < log(1 + a/η ), then in limit (1), Q ˆQ Λ = O(1/m 2 ). 6

7 Proof. Let B = B m = {y N k : y y k n 1/3 }. Then since Q({y}) = L(y) ˆQ 0 ({y}), Q ˆQ Λ y B c L(y) 1 q 1 (y) e Λ(y 1+ +y k ) ˆQ0 ({y}) + y B[Q({y}) + Q 0 ({y}) + q 1 (y) Q 0 ({y})]e Λ(y 1+ +y k ). The first term here is O(1/m 2 ) by the approximation for L in (5), and the second term is of order e ɛn1/3 some u > Λ. for some ɛ, since moment generating functions in Lemma 2 converge for As a corollary the next result provides approximations to moments of h(y i ). Let µ(η) = Eh(Y i ) and ˆµ(η) = Eh(Z i ). Corollary 1. Assume Λ < log(1 + a/η ) and that sup h 4 (y)e Λy <. y N + Then in limit (1), µ(η) = ˆµ(η) + O(1/m), Var(h(Y 1 )) = Var(h(Z)) + O(1/m), a[eh(z)(z η)]2 Cov(h(Y 1 ), h(y 2 )) = + O(1/m 2 ), mη(a + η) E(h(Y 1 ) µ(η)) 4 = E(h(Z) ˆµ(η)) 4 + O(1/m), E(h(Y 1 ) µ(η)) 3 (h(y 2 ) µ(η)) = O(1/m), E(h(Y 1 ) µ(η)) 2 (h(y 2 ) µ(η)) 2 = Var 2 (h(z)) + O(1/m), E(h(Y 1 ) µ(η)) 2 (h(y 2 ) µ(η))(h(y 3 ) µ(η)) = O(1/m), and E(h(Y 1 ) µ(η))(h(y 2 ) µ(η))(h(y 3 ) µ(η))(h(y 4 ) µ(η)) = O(1/m 2 ). As a consequence, in this limit ES m = mˆµ(η) + O(1), 7

8 ma[eh(z)(z η)]2 Var(S m ) = mvar(h(z)) η(a + η) = me [ h(z) ˆµ(η) = mˆσ 2 (η) + O(1), + O(1) 2 Cov(h(Z), Z) (Z η)] + O(1), Var(Z) and E(S m mµ(η)) 4 = O(m 2 ). Proof. Using (3) the initial assertions all follow from Theorem 3. Note that f d ˆQ = Ef(Z 1,..., Z k ) + 1 m Eq 1(Z 1,..., Z k )f(z 1,..., Z k ), and that q 1 is a quadratic function with Eq 1 (Z 1,..., Z k ) = 0. So, for instance, q 1 (Z 1,..., Z 4 ) can be written as a sum of quadratic functions of (Z i, Z j ), 1 i j 4 and Eq 1 (Z 1,..., Z 4 ) 4 (h(z i ) µ(η)) = 0. The results about moments of S m then follow directly after a bit of combinatorics. i=1 3 Partial Sums Given a subset B = {b 1,..., b j } (with b 1 < < b j ) of {1,..., m} let Y B denote the random vector (Y b1,..., Y bj ) and let Y +B = i B Y i. Define A B, G B and G +B similarly, and take S +B = i B h(y i). Lemma 3. Let B 1,..., B γ be sets partitioning {1,..., m}. If Y DM m (A), then Y B1,..., Y Bγ given Y +B1 = n 1,..., Y +Bγ = n γ are conditionally independent with Y Bj Y +B1 = n 1,..., Y +Bγ = n γ DM(n j, A Bj ). Proof. Since Y G = g M(n, p), the conditional joint mass function P (Y B1 = y B1,..., Y Bγ = y Bγ G = g, Y +B1 = n 1,..., Y +Bγ = n γ ) = P (Y = y G = g) P (Y +B1 = n 1,..., Y +Bγ = n γ ) 8

9 is a ratio of multinomial probabilities. Straightforward algebra then shows that Y B1,..., Y Bγ given G = g and Y +B1 = n 1,..., Y +Bγ = n γ are conditionally independent with Y Bi G = g, Y +B1 = n 1,..., Y +Bγ = n γ M(n i, g Bi /g +Bi ). The stated results now follows integrating against the distribution for G. Conditional independence is preserved because G B1,..., G Bγ are independent. Given a partition B 1,..., B γ of {1,..., m} we can write S m = S B1 + + S Bγ, and by this lemma the summands are conditionally independent given Y +B1,..., Y +Bγ. Theorem 1 will be established using a Berry-Esseen limit theorem (in an independent but non-identically distributed setting) to argue that the conditional distribution of S m is approximately normal. The following two technical lemmas will be needed. Lemma 4. The mean of the beta-binomial distribution BB(n, A 1, A 2 ) is and the variance is na 1 A 1 + A 2, na 1 A 2 (n + A 1 + A 2 ) (A 1 + A 2 ) 2 (A 1 + A 2 + 1). If A 1 + A 2 = ma, then the variance is at most A 1 η(a + η)/a 3. This result follows easily from a conditioning argument. Lemma 5. For c > 0, x R and y R, Φ(cx) Φ(y) e log c + x y 2π. Proof. By ordinary calculus, x φ(x) φ(1). If c 1 then c Φ(cx) Φ(x) = uxφ(ux) du u φ(1) log c. 1 Similarly, Φ(cx) Φ(x) < φ(1) log c when 0 < c < 1. Also, since Φ = φ 1/ 2π, Φ(x) Φ(y) x y / 2π. The desired result follows from these bounds because Φ(cx) Φ(y) Φ(cx) Φ(x) + Φ(x) Φ(y). 9

10 Proof. Proof of Theorem 1 Let γ = m 1/3, and let B 1,..., B γ be an even partition of {1,..., m}, i.e., a partition chosen so that m i = B i equals m or m, where m = m/γ = m 2/3 + O(m 1/3 ). Then m i m 1, so m i = m + O(1). Define n i = Y +Bi and η i = n i m i, and let F denote the sigma-field generated by Y +B1,..., Y +Bγ. Conditional moments of S Bi given F will be approximated using Corollary 1. The approximations will be accurate when the variables η i are near the limiting value η. Define the event Since Y +Bi F = { η i η m 1/12, i = 1,..., γ}. BB(n, m i a, (m m i )a), by Lemma 4, η i has mean η and variance at most m i η(a + η)/(m 2 i a 2 ) = O(m 2/3 ). By Tchebysheff s inequality, P ( η i η m 1/12 ) = O(m 1/2 ). This bound and asymptotic expressions in the sequel for all quantities indexed by i hold uniformly in i. By Boole s inequality, P (F c ) γ P ( η i η m 1/12 ) = γo(m 1/2 ) = O(m 1/6 ). i=1 Let µ i = E[S Bi F], σ 2 i = Var(S Bi F), and ρ i = E[ S Bi µ i 3 F]. By Corollary 1, on F, µ i = m iˆµ(η i ) + O(1) and σ 2 i = m iˆσ 2 (η i ) + O(1). Also, by the corollary on F, and so E[(S Bi µ i ) 4 F] = O(m 2 i ) ρ i = O(m 3/2 i ) on F. The function ˆµ( ) has a bounded second derivative in some neighborhood of η. Taylor expansion about η i = η gives µ i = m iˆµ(η) + (n i m i η)µ (η) + O(m i (η i η) 2 ) + O(1) 10

11 on F, and summing over i, µi = mˆµ(η) + V O(m 2/3 ) + O(m 1/3 ), (6) on F, where Similarly, on F, V = (η i η) 2. σ 2 i = mˆσ 2 (η) + V O(m 2/3 ) + O(m 1/3 ). Next, by the Berry-Esseen theorem (cf Theorem in Feller (1971)), ( P (S x ) µ i ρi m x F) Φ 6 σ 2 i ( σ. i 2)3/2 Now on F, V = O(m 1/6 ), and so σ 2 i mˆσ 2 (η). Hence on F, ρi ( σ 2 i )3/2 = O(m 1/6 ). Since P (S m x) = EP (S m x F), the theorem will follow from the bounds presented provided E sup x Φ ( x ) ( ) µ i x mˆµ Φ 1 F = O(m 1/6 ). σ 2 i mˆσ By Lemma 5, the left hand side here is bounded by the sum of e 2π E mˆσ log 1 1 F and E µi mˆµ σ 2 i 2π 1 F. mˆσ Since V = O(m 1/6 ) on F, the argument of the expectation in the first of these expressions is O(m 1/6 ). The second expression, by (6), is O(m 1/6 ) + O(m 1/6 EV ). But Var(η i ) = O(m 2/3 ) which implies EV = O(m 1/3 ), and so the second expression is also O(m 1/6 ). Proof. Proof of Theorem 2 Since Z 1,..., Z m NB(ma, mη), P η (Z Z m = n) = Γ(ma + n)(ma)ma (mη) n n!γ(ma)(ma + mη) ma+n. 11

12 Using this, it is easy to check that Z 1,..., Z m Z Z n = n DM(n, a1 m ), (7) noted as Lemma 1 of Chen (1980). Also, by Stirling s formula a P η (Z Z m = n) 2πmη(a + η) in limit (1). Using (7), P [ S m mˆµ > mɛ] P η [ n i=1 W i > mɛ] P η (Z Z m = n), where W i = h(z i ) ˆµ (and W = h(z) ˆµ below), and the theorem will follow if [ ] n P η W i > mɛ = O(e cɛ2m ). i=1 This basically follows from Bernstein s inequality, but a bit of care is necessary to make sure the stated uniformity holds. Note that adjusting c, it is sufficient to show that the asymptotic bound holds uniformly for all ɛ sufficiently small. Let δ = ɛ 0 /4. Since e x 1 + x + x 2 e x /2 and x 2 4e x /e 2, for 0 u δ, E η e uw u2 E η W 2 e δ W 1 + 2u2 δ 2 e 2 E ηe 2δ W Introducing a likelihood ratio and using the Schwarz inequality, ( ) Z ( ) a+z η a + E η e 2δ W η = E η e 2δ W η a + η { ( ) 2Z ( ) } 2a+2Z 1/2 η a + η {Eη e E } ɛ 0 h(z) ˆµ 1/2 η. η a + η The first factor here converges to one by dominated convergence as η η, and the second factor remains bounded for m and n sufficiently large by (2). So there is a constant c 0 such that E η e uw 1 + c 0 u 2 e c 0u 2, 0 u δ, for m and n sufficiently large in limit (1). By Bernstein s inequality, ] P η [ n i=1 W i > mɛ e mc 0u 2 mɛ 12

13 for m and n sufficiently large in limit (1). If ɛ 2c 0 δ, taking u = ɛ/(2c 0 ) this bound becomes e mɛ2 /(4c 0 ). The theorem then follows from this bound and a corresponding bound for P η [ n i=1 W i < mɛ]. Acknowledgment. We thank the referee for a careful reading of our manuscript. References Chen, Wen-Chen (1980) On the weak form of Zipf s law, Journal of Applied Probability, 17, Chen, Wen-Chen (1981a) Limit theorems for general size distributions, Journal of Applied Probability, 18, Chen, Wen-Chen (1981b) Some local limit theorems in the symmetric Dirichlet-multinomial urn models, Annals of the Institute of Statistical Mathematics, 33, Csörgő, S. and Wu, W. B. (2000) Random graphs and the strong convergence of bootstrap means, Combinatorics, Probability and Computing, Englund, Gunnar, (1981) A remainder term estimate for the normal approximation in classical occupancy, Annals of Probability, 9, Esty, Warren W. (1983) A normal limit law for a nonparametric estimator of the coverage of a random sample, Annals of Statistics, Feller, W. (1971) An Introduction to Probability Theory and its Applications. Wiley, New York Fisher, R. A., Corbet, A. S. and Williams, C. B., (1943) The relation between the number of species and the number of individuals in a random sample of an animal population, J. Anial Ecol., 12, Good, I. J. and Toulmin, G. H., (1956) The number of new species, and the increase in population coverage, when a sample is increased, Biometrika, 43, Keener, R., Rothman, E. and Starr, N. (1987) Distributions on partitions, Annals of Statistics, 15, Kopocinska, I. and Kopocinski, B. (1992) A new proof of generalized theorem of Irving Weiss, Periodica Mathematica Hungarica, 25, Quine, M. P., (1979) A functional central limit theorem for a generalized occupancy problem, Stochastic Processes and their Applications, 9,

14 Quine, M. P. and Robinson, J. (1982) A Berry-Esseen bound for an occupancy problem. Annals of Probability, 10, Rényi, A., (1962) Three new proofs and a generalization of a theorem of Irving Weiss. Magyar Tud. Akad. Mat. Kutato Int. Kozl., 7, Rubin, D. B. (1981) The Bayesian bootstrap, Annals of Statistics, 9, Weiss, Irving (1958) Limiting distributions in some occupancy problems. Annals of Mathematical Statistics,

Additive functionals of infinite-variance moving averages. Wei Biao Wu The University of Chicago TECHNICAL REPORT NO. 535

Additive functionals of infinite-variance moving averages Wei Biao Wu The University of Chicago TECHNICAL REPORT NO. 535 Departments of Statistics The University of Chicago Chicago, Illinois 60637 June