Negative Multinomial Model and Cancer. Incidence

Similar documents
Generalized Linear Model under the Extended Negative Multinomial Model and Cancer Incidence

Decomposition of Parsimonious Independence Model Using Pearson, Kendall and Spearman s Correlations for Two-Way Contingency Tables

A Generalized Linear Model for Binomial Response Data. Copyright c 2017 Dan Nettleton (Iowa State University) Statistics / 46

Statistics 3858 : Contingency Tables

Generalized Linear Models I

Model Estimation Example

Chapter 22: Log-linear regression for Poisson counts

Parametric Modelling of Over-dispersed Count Data. Part III / MMath (Applied Statistics) 1

Asymptotic Statistics-III. Changliang Zou

Stat 642, Lecture notes for 04/12/05 96

Stat 5101 Lecture Notes

8 Nominal and Ordinal Logistic Regression

Generalized linear models

Regression models. Generalized linear models in R. Normal regression models are not always appropriate. Generalized linear models. Examples.

Central Limit Theorem ( 5.3)

Ph.D. Qualifying Exam Friday Saturday, January 3 4, 2014

Analysis of Microtubules using. for Growth Curve modeling.

MARGINAL HOMOGENEITY MODEL FOR ORDERED CATEGORIES WITH OPEN ENDS IN SQUARE CONTINGENCY TABLES

Statistics in medicine

Approximate Test for Comparing Parameters of Several Inverse Hypergeometric Distributions

Generalized Linear Models 1

ST3241 Categorical Data Analysis I Generalized Linear Models. Introduction and Some Examples

DA Freedman Notes on the MLE Fall 2003

UNIVERSITY OF TORONTO. Faculty of Arts and Science APRIL 2010 EXAMINATIONS STA 303 H1S / STA 1002 HS. Duration - 3 hours. Aids Allowed: Calculator

SCHOOL OF MATHEMATICS AND STATISTICS. Linear and Generalised Linear Models

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

Multinomial Logistic Regression Models

Poisson Regression. Ryan Godwin. ECON University of Manitoba

AFT Models and Empirical Likelihood

Review: what is a linear model. Y = β 0 + β 1 X 1 + β 2 X 2 + A model of the following form:

Normal distribution We have a random sample from N(m, υ). The sample mean is Ȳ and the corrected sum of squares is S yy. After some simplification,

Asymptotic Statistics-VI. Changliang Zou

Analysis of Survival Data Using Cox Model (Continuous Type)

Review of Statistics 101

Logistic Regression. Interpretation of linear regression. Other types of outcomes. 0-1 response variable: Wound infection. Usual linear regression

Person-Time Data. Incidence. Cumulative Incidence: Example. Cumulative Incidence. Person-Time Data. Person-Time Data

Measure for No Three-Factor Interaction Model in Three-Way Contingency Tables

MATH 644: Regression Analysis Methods

Model Based Statistics in Biology. Part V. The Generalized Linear Model. Chapter 16 Introduction

Analysis of data in square contingency tables

STAT 6350 Analysis of Lifetime Data. Failure-time Regression Analysis

Generalized Linear Models (GLZ)

ANALYSING BINARY DATA IN A REPEATED MEASUREMENTS SETTING USING SAS

The Multinomial Model

3 Joint Distributions 71

Math 152. Rumbos Fall Solutions to Exam #2

Unit 9: Inferences for Proportions and Count Data

Cohen s s Kappa and Log-linear Models

Introduction to Statistical Analysis

STA 303 H1S / 1002 HS Winter 2011 Test March 7, ab 1cde 2abcde 2fghij 3

Generalized Linear Models Introduction

Goodness of Fit Test and Test of Independence by Entropy

Hypothesis Testing Based on the Maximum of Two Statistics from Weighted and Unweighted Estimating Equations

Generalized Linear Models. Kurt Hornik

11 Hypothesis Testing

SIMULTANEOUS CONFIDENCE BANDS FOR THE PTH PERCENTILE AND THE MEAN LIFETIME IN EXPONENTIAL AND WEIBULL REGRESSION MODELS. Ping Sa and S.J.

Some General Types of Tests

Class Notes: Week 8. Probit versus Logit Link Functions and Count Data

Generalized linear models for binary data. A better graphical exploratory data analysis. The simple linear logistic regression model

Model Based Statistics in Biology. Part V. The Generalized Linear Model. Chapter 18.1 Logistic Regression (Dose - Response)

A Bivariate Weibull Regression Model

Unit 9: Inferences for Proportions and Count Data

TUTORIAL 8 SOLUTIONS #

Testing and Model Selection

Deccan Education Society s FERGUSSON COLLEGE, PUNE (AUTONOMOUS) SYLLABUS UNDER AUTOMONY. SECOND YEAR B.Sc. SEMESTER - III

The GENMOD Procedure. Overview. Getting Started. Syntax. Details. Examples. References. SAS/STAT User's Guide. Book Contents Previous Next

MSH3 Generalized linear model

Discrete Multivariate Statistics

Linear Regression Models P8111

Statistics. Statistics

Lecture 14: Introduction to Poisson Regression

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview

Linear model A linear model assumes Y X N(µ(X),σ 2 I), And IE(Y X) = µ(x) = X β, 2/52

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

TA: Sheng Zhgang (Th 1:20) / 342 (W 1:20) / 343 (W 2:25) / 344 (W 12:05) Haoyang Fan (W 1:20) / 346 (Th 12:05) FINAL EXAM

Let us first identify some classes of hypotheses. simple versus simple. H 0 : θ = θ 0 versus H 1 : θ = θ 1. (1) one-sided

GMM Estimation of a Maximum Entropy Distribution with Interval Data

Lecture 01: Introduction

Multilevel Models in Matrix Form. Lecture 7 July 27, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2

Basic Medical Statistics Course

Generalized, Linear, and Mixed Models

Generalized Linear Models

Fundamentals to Biostatistics. Prof. Chandan Chakraborty Associate Professor School of Medical Science & Technology IIT Kharagpur

Generalized logit models for nominal multinomial responses. Local odds ratios

Generalized Linear Models

Chapter 5 Matrix Approach to Simple Linear Regression

Large Sample Properties of Estimators in the Classical Linear Regression Model

Exam Applied Statistical Regression. Good Luck!

2 Describing Contingency Tables

STAC51: Categorical data Analysis

36-463/663: Multilevel & Hierarchical Models

Journal of Forensic & Investigative Accounting Volume 10: Issue 2, Special Issue 2018

KRUSKAL-WALLIS ONE-WAY ANALYSIS OF VARIANCE BASED ON LINEAR PLACEMENTS

Introduction to the Logistic Regression Model

1/15. Over or under dispersion Problem

ECONOMETRICS II (ECO 2401S) University of Toronto. Department of Economics. Spring 2013 Instructor: Victor Aguirregabiria

Logistic Regression. James H. Steiger. Department of Psychology and Human Development Vanderbilt University

Using the Delta Method to Construct Confidence Intervals for Predicted Probabilities, Rates, and Discrete Changes 1

The goodness-of-fit test Having discussed how to make comparisons between two proportions, we now consider comparisons of multiple proportions.

Pseudo-score confidence intervals for parameters in discrete statistical models

Transcription:

Generalized Linear Model under the Extended Negative Multinomial Model and Cancer Incidence S. Lahiri & Sunil K. Dhar Department of Mathematical Sciences, CAMS New Jersey Institute of Technology, Newar, NJ-07111, USA E-mail: dhar@njit.edu SUMMARY The generalized linear model for a multi-way contingency table for several independent populations that follow the extended negative multinomial distributions is introduced. This model represents an extension of negative multinomial log-linear model. The parameters of the new model are estimated bythe quasi-lielihood method and the corresponding score function Corresponding author 1

which gives a close from estimate of the regression parameters. The goodnessof-fit test for the model is also discussed. An application of the log-linear model under the generalized inverse sampling scheme representing cancer incidence data is given as an example of this model. Keywords: generalized inverse sampling, extended negative multinomial distribution, quasi-lielihood, asymptotic distribution. 1 INTRODUCTION A generalized linear model (GLIM) for functions of the extended negative multinomial frequency counts for s independent sub populations is defined along the lines of Grizzle, Starmer and Koch (1969), Bonett (1985a) and Bonett(1985b). The extended negative multinomial model (ENMn) is defined in Dhar (1995) and the extended negative multinomial log-linear model is defined in Lahiri and Dhar (2008). Let (f ri,, f 2i, f 1i, f 1i, f 2i,, f ni ), be s sets of count observations from s subpopulations labeled i = 1, 2,, s, and let j = r,, 1, 1,, n be the set of response categories as summarized in Table 1 below. Here, denotes transpose of a matrix. To define the model the following notations are needed. Denote the vector f i = (f ri,, f 2i, f 1i ), f i = (f 1i, f 2i,, f ni ) and their corresponding expected valued vectors µ i = (µ ri,, µ 1i ), µ i = (µ 1i,, µ ni ), respectively. The vector f (i) = (f i, f i) is assumed to follow an extended negative multinomial distribution with parameters, i = r j=1 f ji and µ (i) = (µ i, µ i ) = i ( r j=1 p ji) 1 p i, where p i = (p ri,, p 1i, p 1i,, p ni ) is from Lahiri and Dhar (2008, Section 2). Let f be the augmented vector de- 2

fined by f = (f (1), f (2),, f (s) ) with expected value µ = (µ (1), µ (2),, µ (s) ). The generalized linear model for count data is defined by y = Xβ + δ (1.1) where y = (y 1, y 2,, y s ) and y i = g(f ri ),, g(f 2i ), g(f 1i ), g(f 1i ), g(f 2i ),, g(f ni ), i = 1, 2,, s, and g is function from R to R, which has an inverse. Let X be a nown full ran q and of dimensions (r + n)s q (q [r + n]s) design matrix with intercept, main effects, and interaction effects, β is a q 1 vector of unnown non-random parameters, and δ is a (r + n)s 1 unobservable error vector such that y 0 = g( 1 ), g( 2 ),, g( s ), y 0 Aβ = 0 and y Xβ is asymptotically zero, as. Here, = s i=1 i, the fractions of the total sample, namely, i = w i, f 0 = ( 1, 2,, s ), and A = (a 1, a 2,, a s ), s q, are nown constants, with linearly independent constraints y 0 Aβ = 0. The data consists of observing the vector f along with design matrix X as shown in the r c contingency Table 1. This model is nown to be the generalized linear model under the extended inverse sampling scheme or the extended negative multinomial distribution, with g in (1.1) as the lin function. This model is specifically nown to be linear or the log linear model according to g being linear or the log function. A closed-form estimator of the model parameters, estimate of the covariance matrix, and the general Wald test are derived under the assumption of extended negative multinomial sampling. The commonly used GLIM, does not accommodate contingency tables of inverse sampling schemes as has also been observed by Bonett (1985a). 3

Observations from a subpopulation are sampled following generalized inverse sampling and are described below for the sae of completeness. The resulting observed frequency counts can be summarized in the following s (r + n) contingency table. 1.1 Definition of ENMn Consider a sequence of independent trials, where one of the events A i occurs n with probability p i, i { r,, 1, 1,, n}, p i = 1. Suppose that A r, A (r 1),, A 1 are the rare events. i= r,i 0 Let f i represent the frequency with which A i, i { r,, 1} occurs. Here, f i s are counted until we get a total of (predetermined value) observations of at least one of the A i s, i { r,, 1}. Then the distribution of f = (f r,, f 1, f 1,, f n ) is said to follow an extended negative multinomial distribution with parameters and p = (p r,, p 1, p 1,, p n ) with the joint probability density function given as n ( f i + 1)! i=1 n f i! ( 1)! i=1 ( r i=1 p i) p f 1 1 p fn n! r f i! i=1 (p 1) f 1 (p r) f r, f i and f i 0, i { r,, 1, 1,, n}, where p i = n i= r,i 0 p i r, p i i=1 i = 1,...r, p i = 1 and = r i=1 f i. An observation that fall in cell (j,i) has the probability p ji. 4

Table 1: Frequency Distribution Categories of responses Population r... 2 1 1 2... n 1 f r1... f 21 f 11 f 11 f 21... f n1 2 f r2... f 22 f 12 f 12 f 22... f n2............. s f rs... f 2s f 1s f 1s f 2s... f ns In order to define the covariance matrix of f, the following notations are introduced. Let p p ji ji = r j=1 p and D µi be a diagonal matrix with ji the elements of the vector µ i, n 1, along the main diagonal. Then the covariance matrix of f is given by the (r + n)s (r + n)s bloc diagonal matrix Σ f of ran (r + n)s with Σ (i) f ( i) = Σ (i) 1 ( i ) 0, i = 1, 2,, s (1.2) 0 Σ (i) 2 ( i ) as the blocs, where 0 is r n zero matrix and where Σ (i) 1 ( i ) = i p 1i(1 p 1i) i p 1ip 2i i p 1ip ri i p 1ip 2i i p 2i(1 p 2i) i p 2ip ri,(1.3)...... i p 1ip ri i p 2ip ri i p ri(1 p ri) 5

and Σ (i) 2 ( i ) = (µ iµ i) i + D µi. (1.4) 2 Estimation The quasi-lielihood estimator of β (Myers et al. 2002, Section 5.4) with constraints has been given in this section. An efficient estimator of β, under the constraint Aβ = y 0, minimizes X 2 = (y Xβ) Ω(y Xβ) λ (Aβ y 0 ), (2.5) where Ω 1 = [ y / f] Σ f [ y / f], Σ f is the estimate of the bloc diagonal matrix Σ f with blocs given by the matrix in (1.2), with i = 1, i = 1, 2,..., s times w i, where w i = i nown, λ is an s 1 vector of Lagrange multipliers, y 0 = g( 1 ), g( 2 ),, g( s ). The maximum lielihood estimator of the covariance matrix, Σ f, is obtained by replacing the elements of p and µ (which is a function of p), by their corresponding sample proportions based on f (Dhar 1995). Please see the expression for the true parameter Ω 1 in equation (3.7) in the following section. Minimizing (2.5) with respect to β and using multivariate calculus, the quasi-lielihood estimator of β with constraints is computed and has the following closed form: β = β (X ΩX) 1 A [A(X ΩX) 1 A ] 1 (A β y 0 ), (2.6) where β = (X ΩX) 1 X Ωy, which is shown more generally by Ferguson (1958). The estimated covariance matrix of β is = (X Σ β ΩX) 1 (X ΩX) 1 A [A(X ΩX) 1 A ] 1 A(X ΩX) 1. The cell frequencies that follow the extended negative multinomial constraints are predicted by ŷ = X β. 6

3 Hypothesis Testing In this section, the test of the general linear hypothesis H 0 : Hβ = h versus its negation is derived, where H is a (n + r) q nown matrix of ran (n + r) and h is a nown (n + r) 1 vector. The Wald s statistics W = (H β h) (H(X ΩX) 1 )H ) 1 (H β h)/, where = s i=1 i is used to test these hypotheses. The asymptotic distribution of W is now derived. Theorem 1: Under the true model as described by (1.1), the assumptions g is a differentiable function from R to R, and sup E y Xβ 1+η < for some η > 0, where. represents the Euclidean norm, then W d χ 2 (n+r) as. Proof: Define, B (m) ji = 1, if m th unit drawn at random from i th population belongs to j th category, 0, otherwise, for m = 1,, i ; j = 1,, r; i = 1,, s, and D (m) ji = 1, if m th unit drawn at random from i th population belongs to j th category, 0, otherwise, for m = 1,, i ; j = 1,, n; i = 1,, s. 7

Then f (i) = (f i, f i ) = (f 1i, f 2i,, f ri, f 1i, f 2i,, f ni ) i = ( = = i = 1,, s. i B (m) 1i, m=1 m=1 i (B (m) 1i, B(m) m=1 i (C (m) m=1 i ), i B (m) 2i,, 2i,, B(m) m=1 ri, D(m) 1i i B (m) ri, m=1,, D (m) D (m) 1i,, ni ) i m=1 D (m) ni ) Therefore, C (m) i, j = r,, 1, 1,, n, follows an extended negative multinomial distribution with parameters (1, µ (i) / i ), where µ (i) / i = ( r j=1 p ji) 1 p i. Now, C (m) i are iid random vectors with mean µ (i) / i and covariance matrix Σ (i) f (1), for m = 1,, i. The central limit theorem (CLT) yields, ( 1 i ( ) (m) i i m=1 C i µ (i) / i ) d N ( 0, Σ (i) f (1)) as i, i = 1, 2,, s. This implies that 1 i (f (i) µ (i) ) d N ( 0, Σ (i) f (1)), as i, hence 1 (f µ) d N ( 0, Σ f), as, where i = w i is a nown constant and Σ f is the bloc diagonal matrix with blocs w i Σ (i) f (1), i = 1, 2,, s, along the diagonal of ran s(n + r). Therefore, 1 (y µ ) d N ( ) 0, Dµ(µ )Σ fdµ (µ ) (3.7) as, where µ is equal to g(µ), i.e., g applied to each component of µ and Dµ(µ ) is the differential of µ with respect to µ (Theorem A, Serfling, 2002, p. 122). Here, Ω 1 = Dµ(µ )Σ fdµ (µ ). Now consider β = (X ΩX) 1 X Ωy, as described in Section 2, which converges as follows: 1 ( β β) d N(0, (X ΩX) 1 ) as (Theorem A, Serfling, 2002, p.122). 8

Similarly, as, 1 (H β Hβ) d N(0, H(X ΩX) 1 H ), where H is a (n + r) q nown matrix with ran (n + r). Therefore, from the fact that x H(X ΩX) 1 H )x is a continuous function of x gives W = (H β h) (H(X ΩX) 1 )H ) 1 (H β h) d χ 2 (n+r) (Rao, 1973, p. 188). Hence the proof. To evaluate the goodness-of-fit of this model with linearly independent constraints, one can consider the statistics (y X β ) Ω(y X β )/. Similar to Bonnett (1989 and 1985a) and Haber et al. (1986) one can show that this statistic converges in distribution to the chi-square distribution with s(n + r) q degrees of freedom. This statement holds true under conditions of the aforementioned Theorem 1 and is similar to its proof as well as Bonnett (1985a, Section 4, page 128) or Bonnett (1989), which uses Haber et al. (1986). 4 An Example This section illustrates an example of an extended negative multinomial model to a given data set from s = 3 subpopulations. The application involves modeling of the incidence of several related diseases in different cities. There is no maximum lielihood estimate for the shape parameter in general. An estimate based on quantiles of Pearson s chi-squared statistics is applied to model the cancer incidence for three cities in Ohio. This approach is similar to those proposed by Williams (1982), Breslow (1984), and more recently, in Waller and Zelterman (1997) for estimating over-dispersion parameters. 9

4.1 Model of Cancer Incidence for Three Cities Suppose that the following table gives the cancer incidence for the three largest cities according to the site of primary cancer during a particular year. It is assumed that the overall disease incidence may be higher (lower) in one location than the other, but this increase (decrease) is not disease-specific that is, the relative frequencies of disease do not change across cities. Waller and Zelterman (1997) used the log-linear model with the negative multinomial distribution to fit the cancer deaths, during 1989, in the three largest Ohio cities. The structure of the data used in this example is similar to the one described in Waller and Zelterman (1997), but hypothetical numbers have been used to illustrate the proposed model. The sites of primary tumors are as follows: 1 = eye, 2 = oral cavity, 3 = gallbladder, 4 = lung, 5 = breast, 6 = genitals, 7 = urinary organs, 8 = leuemia and 9 = lymphatic tissues. Table 2: Cancer Deaths in the Three Different Cities City 1 2 3 4 5 6 7 8 9 City 1 35 41 31 440 488 159 523 169 268 City 2 40 28 27 270 337 133 378 107 160 City 3 31 25 28 190 212 91 254 77 137 Our objective is to fit the data with an appropriate model. Poisson models are appropriate when the sample mean and the sample variance are equal. Multinomial models can be used when the cell counts are negatively correlated and the negative multinomial models are used when the cell counts are 10

positively correlated. It can be observed from the above data that some of the frequency counts are positively correlated, while others are negatively correlated. In this situation, the extended negative multinomial model is expected to be the most appropriate one to fit the data due to its covariance structure (equations 1.2 to 1.4). Consider the log-linear model of means with no interaction between city and disease type as ln µ js = µ + α s city + β j disease, where α city s, s = 1, 2, 3, is the effect due to city and β disease j, j = 1, 2,, 9, is the effect due to disease. Note that each of these effects add up to zero. No interaction between city and disease type implies that an increase (decrease) in incidence of one disease is accompanied by a similar increase (decrease) in incidence of the same disease across all the other cities. It has been assumed that the state has a large amount of manufacturing and industry. So, if there is an environmental cause for high incidence of one type of cancer, this may translate into high incidence of another type of cancer. To describe the design matrix X for the above model let 0 be the 8 1 zero column vector, 1 represent the 8 1 vector with all one s, and I represent the 8 8 identity 11

matrix. Then, X is given by 1 1 0 I 1 1 0 0 1 0 1 I 1 0 1 0 1 0 0 I 1 0 0 0. Let f js denote the number of cancer deaths of site s in city j and use the extended negative multinomial distribution, since many of the counts have both positive and negative covariances. The test statistic χ 2 s( s ) will have the following form: χ 2 s( s ) = j (f js µ js ) 2 µ js, (4.8) where s is the shape parameter of the extended negative multinomial distribution. The estimated means µ js are the same as described in Section 2. The above test statistic not only measures fit, but also suggests a method of estimating s. In each cities case, solve for s, such that the median of the limiting chi-squared distribution in (4.8) with 8 d.f. is 7.34. This method is similar to the one proposed by Waller and Zelterman, which is used to find values of shape parameter. Using these values of s, the model parameters are computed by maximizing the log-lielihood with penalized constraints using Newton-Raphson algorithm iteratively and by R program. By improving the estimate of s, the model will be improved. 12

Acnowledgments Professor Hira Lal Koul is my statistics guru. I wish to than him for all the nowledge he has given me. Bibliography [1] Bonett,D.G. (1985a). A linear negative multinomial model. Statist.& Probability Letters, 3: 127-129. [2] Bonett, D.G. (1985b). The negative multinomial logit model. Commun. Statist.-Theory Methods, 3: 127-129. [3] Bonett, D.G. (1989). Pearson chi-square estimator and test for log-linear models with expected frequencies subject to linear constraints. Statist.& Probability Letters, 8: 175-177. [4] Breslow, N.E. (1984). Extra-Poisson variation in log-linear models. Applied Statistics, 33: 38-44. [5] Dhar, S.K. (1995). Extension of a negative multinomial model. Commun. Statist.-Theory Methods, 24(1): 39-57. [6] Ferguson, T.S. (1958). A method of generating best asymptotically normal estimates with application to the estimating of bacterial densities. Ann. Math. Stat., 33: 38-44. [7] Grizzle, J.E., Starmer, C.F. and Koch, G.G.(1969). Analysis of categorical variables by linear models. Biometrics, 25: 489-504. [8] Haber, M. and Brown, M.B.(1986). Maximum lielihood methods for log-linear models when expected frequencies are subject to linear constraints. J. Amer. Statist. Assoc., 81: 477-482. 13

[9] Serfling, R.J. (2002). Approximation theorems of mathematical statistics. New Yor: Wiley. [10] Lahiri, S. and Dhar, S.K. (2008). Log-linear Modeling under Generalized Inverse Sampling Scheme. Communications in Statistics - Theory and Methods, 37(8): 1237-1244. [11] Myers, R.H., Montgomery, D.C., Vining, G.G. and Robinson, T.J. (2002). Generalized linear models with application in engineering and the sciences, second edition New Yor: Wiley. [12] Rao, C.R. (1973). Linear statistical inference and its applications. second edition New Yor: Wiley. [13] Williams, D.A. (1982). Extra-binomial variation in logistic linear models. Applied Statistics, 31: 144-148. [14] Waller, L.A. and Zelterman, D. (1997). Log-linear modeling with the negative multinomial distribution. Biometrics, 53: 971-982. 14