Negative Multinomial Model and Cancer. Incidence

Size: px

Start display at page:

Download "Negative Multinomial Model and Cancer. Incidence"

Harry Harris
6 years ago
Views:

1 Generalized Linear Model under the Extended Negative Multinomial Model and Cancer Incidence S. Lahiri & Sunil K. Dhar Department of Mathematical Sciences, CAMS New Jersey Institute of Technology, Newar, NJ-07111, USA SUMMARY The generalized linear model for a multi-way contingency table for several independent populations that follow the extended negative multinomial distributions is introduced. This model represents an extension of negative multinomial log-linear model. The parameters of the new model are estimated bythe quasi-lielihood method and the corresponding score function Corresponding author 1

2 which gives a close from estimate of the regression parameters. The goodnessof-fit test for the model is also discussed. An application of the log-linear model under the generalized inverse sampling scheme representing cancer incidence data is given as an example of this model. Keywords: generalized inverse sampling, extended negative multinomial distribution, quasi-lielihood, asymptotic distribution. 1 INTRODUCTION A generalized linear model (GLIM) for functions of the extended negative multinomial frequency counts for s independent sub populations is defined along the lines of Grizzle, Starmer and Koch (1969), Bonett (1985a) and Bonett(1985b). The extended negative multinomial model (ENMn) is defined in Dhar (1995) and the extended negative multinomial log-linear model is defined in Lahiri and Dhar (2008). Let (f ri,, f 2i, f 1i, f 1i, f 2i,, f ni ), be s sets of count observations from s subpopulations labeled i = 1, 2,, s, and let j = r,, 1, 1,, n be the set of response categories as summarized in Table 1 below. Here, denotes transpose of a matrix. To define the model the following notations are needed. Denote the vector f i = (f ri,, f 2i, f 1i ), f i = (f 1i, f 2i,, f ni ) and their corresponding expected valued vectors µ i = (µ ri,, µ 1i ), µ i = (µ 1i,, µ ni ), respectively. The vector f (i) = (f i, f i) is assumed to follow an extended negative multinomial distribution with parameters, i = r j=1 f ji and µ (i) = (µ i, µ i ) = i ( r j=1 p ji) 1 p i, where p i = (p ri,, p 1i, p 1i,, p ni ) is from Lahiri and Dhar (2008, Section 2). Let f be the augmented vector de- 2

3 fined by f = (f (1), f (2),, f (s) ) with expected value µ = (µ (1), µ (2),, µ (s) ). The generalized linear model for count data is defined by y = Xβ + δ (1.1) where y = (y 1, y 2,, y s ) and y i = g(f ri ),, g(f 2i ), g(f 1i ), g(f 1i ), g(f 2i ),, g(f ni ), i = 1, 2,, s, and g is function from R to R, which has an inverse. Let X be a nown full ran q and of dimensions (r + n)s q (q [r + n]s) design matrix with intercept, main effects, and interaction effects, β is a q 1 vector of unnown non-random parameters, and δ is a (r + n)s 1 unobservable error vector such that y 0 = g( 1 ), g( 2 ),, g( s ), y 0 Aβ = 0 and y Xβ is asymptotically zero, as. Here, = s i=1 i, the fractions of the total sample, namely, i = w i, f 0 = ( 1, 2,, s ), and A = (a 1, a 2,, a s ), s q, are nown constants, with linearly independent constraints y 0 Aβ = 0. The data consists of observing the vector f along with design matrix X as shown in the r c contingency Table 1. This model is nown to be the generalized linear model under the extended inverse sampling scheme or the extended negative multinomial distribution, with g in (1.1) as the lin function. This model is specifically nown to be linear or the log linear model according to g being linear or the log function. A closed-form estimator of the model parameters, estimate of the covariance matrix, and the general Wald test are derived under the assumption of extended negative multinomial sampling. The commonly used GLIM, does not accommodate contingency tables of inverse sampling schemes as has also been observed by Bonett (1985a). 3

4 Observations from a subpopulation are sampled following generalized inverse sampling and are described below for the sae of completeness. The resulting observed frequency counts can be summarized in the following s (r + n) contingency table. 1.1 Definition of ENMn Consider a sequence of independent trials, where one of the events A i occurs n with probability p i, i { r,, 1, 1,, n}, p i = 1. Suppose that A r, A (r 1),, A 1 are the rare events. i= r,i 0 Let f i represent the frequency with which A i, i { r,, 1} occurs. Here, f i s are counted until we get a total of (predetermined value) observations of at least one of the A i s, i { r,, 1}. Then the distribution of f = (f r,, f 1, f 1,, f n ) is said to follow an extended negative multinomial distribution with parameters and p = (p r,, p 1, p 1,, p n ) with the joint probability density function given as n ( f i + 1)! i=1 n f i! ( 1)! i=1 ( r i=1 p i) p f 1 1 p fn n! r f i! i=1 (p 1) f 1 (p r) f r, f i and f i 0, i { r,, 1, 1,, n}, where p i = n i= r,i 0 p i r, p i i=1 i = 1,...r, p i = 1 and = r i=1 f i. An observation that fall in cell (j,i) has the probability p ji. 4

5 Table 1: Frequency Distribution Categories of responses Population r n 1 f r1... f 21 f 11 f 11 f f n1 2 f r2... f 22 f 12 f 12 f f n s f rs... f 2s f 1s f 1s f 2s... f ns In order to define the covariance matrix of f, the following notations are introduced. Let p p ji ji = r j=1 p and D µi be a diagonal matrix with ji the elements of the vector µ i, n 1, along the main diagonal. Then the covariance matrix of f is given by the (r + n)s (r + n)s bloc diagonal matrix Σ f of ran (r + n)s with Σ (i) f ( i) = Σ (i) 1 ( i ) 0, i = 1, 2,, s (1.2) 0 Σ (i) 2 ( i ) as the blocs, where 0 is r n zero matrix and where Σ (i) 1 ( i ) = i p 1i(1 p 1i) i p 1ip 2i i p 1ip ri i p 1ip 2i i p 2i(1 p 2i) i p 2ip ri,(1.3) i p 1ip ri i p 2ip ri i p ri(1 p ri) 5

6 and Σ (i) 2 ( i ) = (µ iµ i) i + D µi. (1.4) 2 Estimation The quasi-lielihood estimator of β (Myers et al. 2002, Section 5.4) with constraints has been given in this section. An efficient estimator of β, under the constraint Aβ = y 0, minimizes X 2 = (y Xβ) Ω(y Xβ) λ (Aβ y 0 ), (2.5) where Ω 1 = [ y / f] Σ f [ y / f], Σ f is the estimate of the bloc diagonal matrix Σ f with blocs given by the matrix in (1.2), with i = 1, i = 1, 2,..., s times w i, where w i = i nown, λ is an s 1 vector of Lagrange multipliers, y 0 = g( 1 ), g( 2 ),, g( s ). The maximum lielihood estimator of the covariance matrix, Σ f, is obtained by replacing the elements of p and µ (which is a function of p), by their corresponding sample proportions based on f (Dhar 1995). Please see the expression for the true parameter Ω 1 in equation (3.7) in the following section. Minimizing (2.5) with respect to β and using multivariate calculus, the quasi-lielihood estimator of β with constraints is computed and has the following closed form: β = β (X ΩX) 1 A [A(X ΩX) 1 A ] 1 (A β y 0 ), (2.6) where β = (X ΩX) 1 X Ωy, which is shown more generally by Ferguson (1958). The estimated covariance matrix of β is = (X Σ β ΩX) 1 (X ΩX) 1 A [A(X ΩX) 1 A ] 1 A(X ΩX) 1. The cell frequencies that follow the extended negative multinomial constraints are predicted by ŷ = X β. 6

7 3 Hypothesis Testing In this section, the test of the general linear hypothesis H 0 : Hβ = h versus its negation is derived, where H is a (n + r) q nown matrix of ran (n + r) and h is a nown (n + r) 1 vector. The Wald s statistics W = (H β h) (H(X ΩX) 1 )H ) 1 (H β h)/, where = s i=1 i is used to test these hypotheses. The asymptotic distribution of W is now derived. Theorem 1: Under the true model as described by (1.1), the assumptions g is a differentiable function from R to R, and sup E y Xβ 1+η < for some η > 0, where. represents the Euclidean norm, then W d χ 2 (n+r) as. Proof: Define, B (m) ji = 1, if m th unit drawn at random from i th population belongs to j th category, 0, otherwise, for m = 1,, i ; j = 1,, r; i = 1,, s, and D (m) ji = 1, if m th unit drawn at random from i th population belongs to j th category, 0, otherwise, for m = 1,, i ; j = 1,, n; i = 1,, s. 7

8 Then f (i) = (f i, f i ) = (f 1i, f 2i,, f ri, f 1i, f 2i,, f ni ) i = ( = = i = 1,, s. i B (m) 1i, m=1 m=1 i (B (m) 1i, B(m) m=1 i (C (m) m=1 i ), i B (m) 2i,, 2i,, B(m) m=1 ri, D(m) 1i i B (m) ri, m=1,, D (m) D (m) 1i,, ni ) i m=1 D (m) ni ) Therefore, C (m) i, j = r,, 1, 1,, n, follows an extended negative multinomial distribution with parameters (1, µ (i) / i ), where µ (i) / i = ( r j=1 p ji) 1 p i. Now, C (m) i are iid random vectors with mean µ (i) / i and covariance matrix Σ (i) f (1), for m = 1,, i. The central limit theorem (CLT) yields, ( 1 i ( ) (m) i i m=1 C i µ (i) / i ) d N ( 0, Σ (i) f (1)) as i, i = 1, 2,, s. This implies that 1 i (f (i) µ (i) ) d N ( 0, Σ (i) f (1)), as i, hence 1 (f µ) d N ( 0, Σ f), as, where i = w i is a nown constant and Σ f is the bloc diagonal matrix with blocs w i Σ (i) f (1), i = 1, 2,, s, along the diagonal of ran s(n + r). Therefore, 1 (y µ ) d N ( ) 0, Dµ(µ )Σ fdµ (µ ) (3.7) as, where µ is equal to g(µ), i.e., g applied to each component of µ and Dµ(µ ) is the differential of µ with respect to µ (Theorem A, Serfling, 2002, p. 122). Here, Ω 1 = Dµ(µ )Σ fdµ (µ ). Now consider β = (X ΩX) 1 X Ωy, as described in Section 2, which converges as follows: 1 ( β β) d N(0, (X ΩX) 1 ) as (Theorem A, Serfling, 2002, p.122). 8

9 Similarly, as, 1 (H β Hβ) d N(0, H(X ΩX) 1 H ), where H is a (n + r) q nown matrix with ran (n + r). Therefore, from the fact that x H(X ΩX) 1 H )x is a continuous function of x gives W = (H β h) (H(X ΩX) 1 )H ) 1 (H β h) d χ 2 (n+r) (Rao, 1973, p. 188). Hence the proof. To evaluate the goodness-of-fit of this model with linearly independent constraints, one can consider the statistics (y X β ) Ω(y X β )/. Similar to Bonnett (1989 and 1985a) and Haber et al. (1986) one can show that this statistic converges in distribution to the chi-square distribution with s(n + r) q degrees of freedom. This statement holds true under conditions of the aforementioned Theorem 1 and is similar to its proof as well as Bonnett (1985a, Section 4, page 128) or Bonnett (1989), which uses Haber et al. (1986). 4 An Example This section illustrates an example of an extended negative multinomial model to a given data set from s = 3 subpopulations. The application involves modeling of the incidence of several related diseases in different cities. There is no maximum lielihood estimate for the shape parameter in general. An estimate based on quantiles of Pearson s chi-squared statistics is applied to model the cancer incidence for three cities in Ohio. This approach is similar to those proposed by Williams (1982), Breslow (1984), and more recently, in Waller and Zelterman (1997) for estimating over-dispersion parameters. 9

10 4.1 Model of Cancer Incidence for Three Cities Suppose that the following table gives the cancer incidence for the three largest cities according to the site of primary cancer during a particular year. It is assumed that the overall disease incidence may be higher (lower) in one location than the other, but this increase (decrease) is not disease-specific that is, the relative frequencies of disease do not change across cities. Waller and Zelterman (1997) used the log-linear model with the negative multinomial distribution to fit the cancer deaths, during 1989, in the three largest Ohio cities. The structure of the data used in this example is similar to the one described in Waller and Zelterman (1997), but hypothetical numbers have been used to illustrate the proposed model. The sites of primary tumors are as follows: 1 = eye, 2 = oral cavity, 3 = gallbladder, 4 = lung, 5 = breast, 6 = genitals, 7 = urinary organs, 8 = leuemia and 9 = lymphatic tissues. Table 2: Cancer Deaths in the Three Different Cities City City City City Our objective is to fit the data with an appropriate model. Poisson models are appropriate when the sample mean and the sample variance are equal. Multinomial models can be used when the cell counts are negatively correlated and the negative multinomial models are used when the cell counts are 10

11 positively correlated. It can be observed from the above data that some of the frequency counts are positively correlated, while others are negatively correlated. In this situation, the extended negative multinomial model is expected to be the most appropriate one to fit the data due to its covariance structure (equations 1.2 to 1.4). Consider the log-linear model of means with no interaction between city and disease type as ln µ js = µ + α s city + β j disease, where α city s, s = 1, 2, 3, is the effect due to city and β disease j, j = 1, 2,, 9, is the effect due to disease. Note that each of these effects add up to zero. No interaction between city and disease type implies that an increase (decrease) in incidence of one disease is accompanied by a similar increase (decrease) in incidence of the same disease across all the other cities. It has been assumed that the state has a large amount of manufacturing and industry. So, if there is an environmental cause for high incidence of one type of cancer, this may translate into high incidence of another type of cancer. To describe the design matrix X for the above model let 0 be the 8 1 zero column vector, 1 represent the 8 1 vector with all one s, and I represent the 8 8 identity 11

12 matrix. Then, X is given by I I I Let f js denote the number of cancer deaths of site s in city j and use the extended negative multinomial distribution, since many of the counts have both positive and negative covariances. The test statistic χ 2 s( s ) will have the following form: χ 2 s( s ) = j (f js µ js ) 2 µ js, (4.8) where s is the shape parameter of the extended negative multinomial distribution. The estimated means µ js are the same as described in Section 2. The above test statistic not only measures fit, but also suggests a method of estimating s. In each cities case, solve for s, such that the median of the limiting chi-squared distribution in (4.8) with 8 d.f. is This method is similar to the one proposed by Waller and Zelterman, which is used to find values of shape parameter. Using these values of s, the model parameters are computed by maximizing the log-lielihood with penalized constraints using Newton-Raphson algorithm iteratively and by R program. By improving the estimate of s, the model will be improved. 12

13 Acnowledgments Professor Hira Lal Koul is my statistics guru. I wish to than him for all the nowledge he has given me. Bibliography [1] Bonett,D.G. (1985a). A linear negative multinomial model. Statist.& Probability Letters, 3: [2] Bonett, D.G. (1985b). The negative multinomial logit model. Commun. Statist.-Theory Methods, 3: [3] Bonett, D.G. (1989). Pearson chi-square estimator and test for log-linear models with expected frequencies subject to linear constraints. Statist.& Probability Letters, 8: [4] Breslow, N.E. (1984). Extra-Poisson variation in log-linear models. Applied Statistics, 33: [5] Dhar, S.K. (1995). Extension of a negative multinomial model. Commun. Statist.-Theory Methods, 24(1): [6] Ferguson, T.S. (1958). A method of generating best asymptotically normal estimates with application to the estimating of bacterial densities. Ann. Math. Stat., 33: [7] Grizzle, J.E., Starmer, C.F. and Koch, G.G.(1969). Analysis of categorical variables by linear models. Biometrics, 25: [8] Haber, M. and Brown, M.B.(1986). Maximum lielihood methods for log-linear models when expected frequencies are subject to linear constraints. J. Amer. Statist. Assoc., 81:

14 [9] Serfling, R.J. (2002). Approximation theorems of mathematical statistics. New Yor: Wiley. [10] Lahiri, S. and Dhar, S.K. (2008). Log-linear Modeling under Generalized Inverse Sampling Scheme. Communications in Statistics - Theory and Methods, 37(8): [11] Myers, R.H., Montgomery, D.C., Vining, G.G. and Robinson, T.J. (2002). Generalized linear models with application in engineering and the sciences, second edition New Yor: Wiley. [12] Rao, C.R. (1973). Linear statistical inference and its applications. second edition New Yor: Wiley. [13] Williams, D.A. (1982). Extra-binomial variation in logistic linear models. Applied Statistics, 31: [14] Waller, L.A. and Zelterman, D. (1997). Log-linear modeling with the negative multinomial distribution. Biometrics, 53:

Generalized Linear Model under the Extended Negative Multinomial Model and Cancer Incidence

Generalized Linear Model under the Extended Negative Multinomial Model and Cancer Incidence Sunil Kumar Dhar Center for Applied Mathematics and Statistics, Department of Mathematical Sciences, New Jersey