Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at

Size: px

Start display at page:

Download "Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at"

Cody Wilkerson
5 years ago
Views:

1 Biometrika Trust Some Remarks on Overdispersion Author(s): D. R. Cox Source: Biometrika, Vol. 70, No. 1 (Apr., 1983), pp Published by: Oxford University Press on behalf of Biometrika Trust Stable URL: Accessed: :07 UTC JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms of scholarship. For more information about JSTOR, please contact support@jstor.org. Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at Biometrika Trust, Oxford University Press are collaborating with JSTOR to digitize, preserve and extend access to Biometrika

2 Biometrika (1983), 70, 1, pp Printed in Great Britain Some remarks on overdispersion BY D. R. COX Department of Mathematics, Imperial College, London SUMMARY It is shown that maximum likelihood estimation of a simple model retains high efficiency in the presence of modest amounts of overdispersion. The main requirement is that the target parameter should be the moment parameter of an exponential family distribution, or more generally of a parameter for which the order n-' bias of the maximum likelihood estimate is zero. Extensions for models with explanatory variables are outlined. Some key words: Asymptotic theory; Dispersion test; Exponential distribution; Maximum likelihood; Negative binomial distribution; Pareto distribution; Poisson distribution; Quasilikelihood. 1. INTRODUCTION Analysis of data via a single-parameter family of distributions implies in particular that the variance is determined by the mean. Familiar examples are the Poisson, binomial and exponential distributions. A very common practical complication is the presence of overdispersion, or more rarely underdispersion, leading to a failure of the variance-mean relation. Overdispersion in general has two effects. One is that summary statistics have a larger variance than anticipated under the simple model. This has long been recognized and is commonly allowed for by an empirical inflation factor, either assumed from prior experience or estimated. The second effect is a possible loss of efficiency in using statistics appropriate for the single-parameter family. There are two lines of approach. One is detailed representation of the overdispersion by a specific model. The other is to examine the effect on the conventional analysis of changes from the simple model. Two fairly familiar examples are studied in?2 as a preliminary to a more general analysis. 2. Two SIMPLE EXAMPLES Suppose first that Y1,..., Y,, are independently and identically distributed in a Poisson distribution of mean 0, optimally estimated by Y = X Yi/n. Overdispersion is most simply represented by supposing that Yj has a Poisson distribution of mean Ej, where E 1, * E), are independently and identically distributed with a gamma distribution of mean, and index y, E(E@) = j,u var(es) = ((1) Then Y1,..., Y,, have a negative binomial distribution and the following conclusions hold. (i) The sample mean Y remains efficient as an estimate of,, being the maximum likelihood estimate whether y is known or unknown. (ii) If the Poisson distribution is parameterized in terms of some nonlinear function + of 0, for example = logo or e-@ or 1/0, then the Poisson-based estimate, for example

3 270 D. R. Cox log Y or e-, is not a consistent estimate of E(@D) in the corresponding overdispersed model. (iii) We have that n var (Y) =?u +2/(2) and the inflation is by a factor independent of, if and only if y oc,u. This means that in the compounding gamma distribution the variance is proportional to the mean, mimicking the Poisson distribution. To some extent the properties (i)-(iii) are special both to the Poisson distribution and to the special choice of compounding distribution. We therefore consider a second example. Suppose that Y1,..., Y,, are independently and identically exponentially distributed with mean 0 and rate p = 1/0. Again Y estimates 0. The simplest representation is to suppose the rate parameter to be a random variable P having a gamma distribution of mean A and index y: E(Pr) = (I/y)rF(?+r)/F(y). Write Iu = E(P - ) = A{(-1)}. Then E(Y) =, n var (Y) = M2 y/(y-2). (3) The individual Yj have density y{(y- l)ii}y (y + 7/A)y + -{y+ (Y-1) y}+ 1 The analogues of (i)-(iii) for the Poisson distribution are as follows. (i)' The sample mean Y is no longer fully efficient for estimating P. For known y, th maximum likelihood estimate of M has, by the usual calculations, asymptotically nvar(/i) =,u2 (y + 2)/y (4) so that the asymptotic efficiency of Y relative The parameters M and y are slightly nono unknown (4) and the asymptotic relative efficiency are increased by 0(1/y6). Recall that 1/y2 is the fourth power of the coefficient of variation of E and that if v is the variance inflation factor, so that var (YI) = v,u, then v = y/(y -2); thus the asymptotic relative efficiency is (2v -1)/v2. Even when v = 2, y = 4, representing substantial overdispersion, the asymptotic relative efficiency is 3/4. High asymptotic relative efficiency is retained for modest overdispersion. (ii)' If the exponential distribution is parameterized in terms of some nonlinear function 4 of 0, for example 4 = 1/0 or 0 = log 0, then the exponential-based for example 1/ Y or log Y, is not a consistent estimate of E((D) in the corresponding overdispersed model. (iii)' As already implicitly noted, for constant y, the variance-mean relation for the compounding distribution mimics that for the exponential distribution. For constant y, n var ( Y) = v,u2, where v is constant. 3. MORE GENERAL DISCUSSION To treat the problem in a more general way, asymptotic arguments seem necessary. The limiting operations involved are, of course, purely technical devices for deriving

4 Some remarks on overdispersion 271 approximations and care in formulation is needed. Here we consider a model with overdispersion on the borderline of detectibility, i.e. such that there is a reasonable but not overwhelming chance of detecting the overdispersion from the data. Suppose then that the initial model is that Y1,..., Y,, are independently distributed with Yj having density fj(y; 6), where 0 is a scalar parameter. Suppose next that?1,..., O the values of 0 for the n observations, are independently distributed with mean 1u and variance /<In. Note that this increases var(yy) by 0(1/In) and that this is on the borderline of detectibility in the above sense. Under suitable regularity conditions, the density of Yj in the overdispersed mo Ee {fj(y; E)} =fj(y; p)?+ I j(y P)+ 0 21Vn n,~ {jy 2) I+ n hj(y; 11) +?(n)} (5) where, with gj(y;,u) = log fj(y;,u), hj(y; P) = {agj(y; g)/agi}2+02gj(y;,i)/0a12. (6) Thus if It and 1 denote log likelihoods from respectively overdispersed and original models, then for a random vector Y T n lt(it T; Y) = 1(g; Y)+ E hj(yj; +Op(1) where = 49/; Y) + 2TV nk( p) + Op( 1), (7) nk(,u) = E{X hi(yi; Y); Ho} (8) In (8) the expectation is taken under the original model, T = 0, and at the true parameter-value g,u say. In (7), the remainder term is Op(l) for any fixed T a within 0(1/In) of its true value. Of course higher order terms in (7) could be evaluated. Now a constant difference between It and 1 would be of no consequence. Thus we consider Olt/a/,, = 0l1/a/'1 +-21Tnk'(y) + -2d.Vnk(gl) + OP( 1), ai~!a,1= /I1~n~I~L,2 dyu?~() (9) where k(u) = 2ilO+i?001 = -i300-t110 in the fairly standard notation r gj( yj; p) s a3 gj(y nirst = L E[{ a,i( }j; ) 2 Y; {1)}tJ so that irst is an average generalized information. If T is fixed or if T is a parameter independent of u, dt/d, = 0. In fact because k(,io) = 0 the term in dtidu is in any case negligible to the order considered below. To analyse the deviations of the maximum likelihood estimates 't and j from the true value of,u we expand in Taylor series. It follows from (9) that 't and A differ by an amount proportional to T and of order 1/I/n, unless k'(u) = 0. This difference is of the same order as the standard error of the maximum likelihood estimate and hence cannot be ignored. The requirement that k'(,u) = 0 is equivalent to choice of parameterization

5 272 D. R. Cox making the bias of the maximum likelihood estimate of, under the simple model zero to order n- ; see, for instance, Cox & Hinkley (1974, p. 310). When this condition hold and jt differ by Op(l/n), whether T is known or estimated. That is, simple maxi likelihood estimation retains full asymptotic efficiency when there is overdispersion on the borderline of detectibility, provided that the target parameter is correctly chosen. In particular, in full exponential family problems, the target parameter is the expectation over the compounding distribution of the moment parameter of the exponential family. These results generalize (i) and (ii) of? 2. The inflation in var (,) induced by overdispersion can in principle be examined by more detailed expansions. In the full exponential family, with 0 the moment parameter and Y the canonical statistic, 0 = Y and in the simple model n var ( Y) = v(o), say. In the overdispersed model nvar (Y) = E{v(e) + var (e)} v(,u) + var (e) {1?+ v" ()}. There is thus inflation by an approximately constant factor if var(e) oc 1v?(u) (10) The family of densities derived from (5) 4. TEST FOR OVERDISPERSION or in many ways preferably f (y; p) { 1 + sh(y; M1)j f (y; p) exp Ish( Y; M)l)}a(s, 1), (1 where a(e, /u) is a normalizing constant, represents for positive e overdispersion, and for negative e underdispersion, relative to f (y; lu). This suggests as test statistic, for e = 0 from a random sample Yl,..., Y,, Xh(Yj, t), where,0 is the maximum likelihood estimate of, when e = 0. When e = 0 the statistic has asymptotically zero mean and variance n var [{ag ( Y;II} + a2q(y* i) aq(y;!) =]J n{i40+?2i210 + i020-(i300i + i10)2/i200}, (12) where the i's can be evaluated at go. This is a rather general version of standard dispersion tests. When the parameter is the moment parameter or more generally defined as in?3, i300+i110 = 0 and the final term in (12) vanishes. 5. GENERALIZATIONS The analysis sketched in?? 2-4 can be generalized in various ways, the most important being as follows: (a) the parameter 0 may be a vector; (b) each Yj may have its own parameter-value Oj, these being related by a reg model Q\= (xj;,b), where xj is a vector of explanatory variables for the jth

Some remarks on overdispersion 273 individual,,b is a vector of regression coefficients, usually of dimension small compared with n and tj is a function of known form; (c) the individuals may be

6 Some remarks on overdispersion 273 individual,,b is a vector of regression coefficients, usually of dimension small compared with n and tj is a function of known form; (c) the individuals may be grouped in 'clusters' in such a way that all individuals in the same 'cluster' have a common random term. This may be combined with the kind of dependence outlined in (b). The generalization (a) is immediate. As a simple example suppose that the initial model is that Y1,..., Y,, are independently normally distributed with mean A and standard deviation K. The moment parameters of the normal distribution are the expected values of the canonical statistics (Yj, YV), that is are A and )2 + that in the overdispersed model in which (A, K) becomes a random variable (A, of standard normal estimates leads to the estimation of E(A), E(A2 + K2) = {E(A)}2 + var (A) + E(K2). Indeed it is clear that the standard estimates of mean and variance tend to E(A) and var (A) + E(K2). If in the compounding distribution, A and K are independent, it is easy to show that the cumulant generating functions are related by fy(t) = VA(t) + K2(2), where, for example, /y(t) = log E(etY). Thus observation of Y allows the estimati odd order cumulants of A and certain combinations of the even order cumulants of A and of K2, of which the sum of variances is the simplest case. The discussion of? 3 can be adapted to apply to the regression model (b). For this we use (7) with, for Yj,, replaced by ii(xj; /3) and u by u{i(xj;,b)}. We then examine, as in the relation between the gradient vectors alt/af3 and ai/3,b. The resulting maximum likelihood estimates differ by Op(1/n) if E{ahj( Yj; Mu)/Oy} = 0; no special requirement is involved for the dependence of X on il. The implication is that the use of maximum likelihood estimation as for the standard model retains high asyruptotic efficiency in the presence of modest overdispersion provided that the regression model being fitted is regarded as applying to expected values of parameters with zero n-1 bias in simple estimation. Thus, for example, fitting by maximum likelihood of a log linear model for Poisson-distributed data retains high efficiency under borderline overdispersion, provided that the log linear model determines the expected value of the observed count. That is, if the log linear model specifies a Poisson distribution for Yj with log E( Yj) = xt/, the overdispersed model should have E( Yj) = exp (x4t), wi var (Yj) > E( Yj). An overdispersed model in which Yj is considered to have a Poiss distribution with log E(Yj) = xjt + j, where dj in turn is a random variable of expectation zero, would, however, lead to the inconsistencies exemplified in (ii) and (ii)' of? 2. This discussion shows that the method of quasilikelihood (Wedderburn, 1974) is likely to have high efficiency for modest amounts of overdispersion. Generalization (c), arising from 'clustering', will not be discussed in detail. If any explanatory variables are constant within clusters, the broad conclusions above will apply. If, however, there is a need to treat differently dependencies within and dependencies between clusters, a more elaborate discussion is necessary.

7 274 D. R. COX REFERENCES Cox, D. R. & HINKLEY, D. V. (1974). Theoretical Statistics. London: Chapman and Hall. WEDDERBURN, R. W. M. (1974). Quasi-likelihood functions, generalized linear models, and the Gauss- Newton method. Biometrika 61, [Received May 1982]

Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at

Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at Biometrika Trust Robust Regression via Discriminant Analysis Author(s): A. C. Atkinson and D. R. Cox Source: Biometrika, Vol. 64, No. 1 (Apr., 1977), pp. 15-19 Published by: Oxford University Press on