On Properties of QIC in Generalized. Estimating Equations. Shinpei Imori

On Properties of QIC in Generalized Estimating Equations Shinpei Imori Graduate School of Engineering Science, Osaka University 1-3 Machikaneyama-cho, Toyonaka, Osaka 560-8531, Japan E-mail: imori.stat@gmail.com Abstract: The generalized estimating equations (GEE) approach has attracted considerable interest in analysis of correlated response data. An information criterion based on the quasi-likelihood in the GEE framework, called the quasi-likelihood under the independence model criterion (QIC), is proposed in the past literature. This paper studies the properties of the QIC. We establish a formal derivation of the QIC as an asymptotically unbiased estimator of the prediction risk based on the quasi-likelihood. Especially, when deriving the QIC, we explicitly take into account the effect of estimating the correlation matrix used in the GEE procedure. Furthermore, we discuss an adequacy of the risk function used in the derivation of the QIC. Key words: Generalized estimating equations; Longitudinal data analysis; Quasi-likelihood under the independence model criterion; Variable selection. 1. Introduction 1

In analysis of biomedical study, longitudinal data often arise, which data are correlated within individual response. The generalized estimating equations (GEE) approach developed by Liang & Zeger (1986) has been of considerable interest for the parameter estimating in such data. The GEE methodology avoids assuming the simultaneous distribution of observations by only assuming a functional form or the marginal distribution at each time and a correlation structure called working correlation matrix. Furthermore, under some regularity conditions (see, Xie & Yang, 2003; Balan & Schiopu-Kratina, 2005), the GEE estimator has properties which include asymptotic normality and consistency even when the working correlation matrix has been misspecified. From these advantage points, the GEE method is often used to analyze the longitudinal data (e.g., Thall & Vail, 1990; Barnhart & Williamson, 1998; Vens & Ziegler, 2012). As with a common regression analysis, we should select the best model among candidate models in the GEE methodology. Model selection in GEE framework has been extensively discussed in previous literature (e.g., Pan, 2001b, 2002; Cantoni, et al., 2005; Shen & Chen, 2012) In general, many model selection methods have been proposed (for details of statistical model selection; see Burnham & Anderson, 2002). It is famous for a model selection to measure the goodness of fit of the model by the risk function based on the expected Kullback-Leibler (KL) information (Kullback & Leibler, 1951). For actual use, we must estimate the risk function, which depends on unknown parameters. Akaike s information criterion (AIC), which is proposed 2

by Akaike (1973, 1974) as an estimator of the risk function based on KL information is used for selecting the best model among the candidate models. Since the AIC can be simply defined as 2 the maximum log-likelihood +2 the number of parameters, the AIC is widely applied in many fields for selecting appropriate models using a set of explanatory variables. It may not be adequate to directly use the AIC since we do not assume the multivariate distribution of observations in the GEE procedure. Pan (2001a) extended the AIC to the GEE method based on quasi-likelihood constructed from the estimating equations (Wedderburn, 1974), which is called Quasi-likelihood under the independence model criterion (QIC). The QIC is often used as an alternative of the AIC in applied longitudinal analysis, and a representative model selector in GEE methodology. On the other hands, the theoretical properties of the QIC have not been discussed until now. However, the QIC was derived by ignoring the calculation of the particular part. The aim of this paper is to study these problems, especially we examine from the viewpoint of estimating the correlation parameter. In the present paper, we establish a formal derivation of the QIC (called formal QIC or fqic) as an asymptotic unbiased estimate of the prediction risk based on the quasi-likelihood. Notably, when deriving the formal QIC, we explicitly take into account the effect of estimating the correlation matrix used in the GEE procedure. Furthermore, we discuss an adequacy of the risk function used in the derivation of the QIC. Concretely, when we extend the independence quasi-likelihood to more general case (i.e., the mul- 3

tivariate quasi-likelihood) in order to consider correlations of response, the risk function reduces to the same one used in the derivation of the QIC. For details of the meaning of this adequacy, we show in section 3. The rest of the paper is organized as follows. Section 2 introduces the GEE approach and re-derive the QIC formally. We present the properties of the QIC in section 3. A comparison between the formal QIC and original QIC is given in section 4. In section 5, we conclude this paper. Technical details are provided in the Appendix. 2. Modifications of the original QIC In this section, we present the definition of the GEE and re-derive the QIC, formally. For individuals i = 1,..., n, we have an m-dimensional response vector y i = (y i1,..., y im ), and an m p explanatory variable matrix X i = (x i1,..., x im ). Note that y = (y 1,..., y n). In general, the components of y i are correlated but y 1,..., y n are independent. Furthermore, we do not decide the simultaneous distribution of each y i. In the GEE frameworks, we assume y ij be the generalized linear model (GLM) developed in Nelder & Wedderburn (1972). Hence, the probability density function of y ij is expressed as follows: { } θij y ij b(θ ij ) f(y ij ; θ ij, ϕ) = exp + c(y ij, ϕ), a(ϕ) where a( ), b( ), and c( ) are known functions, θ ij is an unknown location parameter, and ϕ is a known scale parameter. Let us the first and second 4

moments for y ij be E[y ij ] = µ ij, Var[y ij ] = σij, 2 respectively. From the GLM properties, we can obtain µ ij = b (θ ij ), σij 2 = σ 2 (µ ij ) = a(ϕ)b (θ ij ). For all θ i = (θ i1,..., θ im ), we would like to note that h(µ ij ) = h(µ(θ ij )) = η ij, where h( ) is a link function, η ij = x ijβ is a linear predictor, and β is an unknown regression parameter. Herein, the GEE for β is expressed as follows: g(β; R, y) = D iv 1 i (y i µ i ) = 0 p. (1) where D i = µ i / β = A i i X i, i = diag( θ i1 / η i1,..., θ im / η im ), A i = a(ϕ)diag{b (θ i1 ),..., b (θ im )}, µ i = (µ i1,..., µ im ), V i = A 1/2 i RA 1/2 i, R = R(α) is a working correlation matrix and α is a q-dimensional parameter, which is referred to as a correlation parameter. We can choose some useful working correlation matrices as the situation demands. For instance, independence (i.e., (R) jk = 0, j k), exchangeable (i.e., (R) jk = α, j k), first-order autoregressive (AR-1) (i.e., (R) jk = (R) kj = α j k, j > k), one-dependent (i.e., (R) j,j+1 = (R) j+1,j = α), and unstructured (i.e., (R) jk = (R) kj = α jk, j > k) are often used. By solving (1), we can obtain ˆβ, which is the GEE estimator of β. In order to guarantee the asymptotic properties, we will assume the regularity appropriate conditions. The risk function based on the independent quasi-likelihood is Risk = E y E y [ 2Q(y ; ˆβ)], (2) 5

where y is a future observation and Q(y; β) is the quasi-likelihood proposed by McCullagh & Nelder (1989). That is, Q(y; β) = m µij j=1 y ij y ij t σ 2 (t) dt, where σ 2 ( ) is a variance function such that σ 2 (µ ij ) = a(ϕ)b (θ ij ). Let B be the bias when estimating (2) by 2Q(y; ˆβ). The QIC is defined as QIC = 2Q(y; ˆβ) + ˆB, (3) where ˆB is an estimator of B. We divide B in order to calculate precisely as follows: B = E y E y [ 2Q(y ; ˆβ) + 2Q(y; ˆβ)] = E y E y [ 2Q(y ; ˆβ) + 2Q(y ; β)] (B1) + E y E y [ 2Q(y ; β) + 2Q(y; β)] (B2) + E y [ 2Q(y; β) + 2Q(y; ˆβ)]. (B3) It is obvious (B2)= 0. By applying a Taylor expansion around ˆβ = β to equation Q(y; ˆβ) yields Q(y; ˆβ) = Q(y; β) + Q(y; β) β ( ˆβ β) + 1 2 ( ˆβ β) 2 Q(y; β) β β ( ˆβ β) + o p (1). (4) 6

Note that Q(y; β) β = D ia 1 i (y i µ i ), ˆβ = β + 1 n Ω 1 R g(β; R, y) + o p(n 1/2 ), 2 Q(y; β) β β = nω I + o p (n), (5) where Ω R = 1 n D iv 1 i D i, Ω I = 1 n D ia 1 i D i. Substituting (4) and (5) into (B1) and (B3), respectively, we can show that (B1) = tr(v s Ω I ) + o(1), (B3) = 2tr(V a Ω 1 R ) tr(v sω I ) + o(1), (6) where V a = 1 n V s = Ω 1 R D iv 1 i Cov[y i ]A 1 i D i, ( 1 n From (6), an expansion of B is given as follows: ) (7) D iv 1 i Cov[y i ]V 1 i D i Ω 1 R. B = 2tr(V a Ω 1 R ) + o(1). (8) 7

By substituting (8) into (3) and ignoring the term of o(1), we can obtain fqic = 2Q(y; ˆβ) + 2tr( ˆV a ˆΩ 1 R ), where ˆV a, ˆVs ˆΩR and ˆΩ I are substituted ˆβ into β in V a, V s, Ω R and Ω I, respectively. On the other hand, the original QIC proposed in Pan (2001a) is an estimator of (2) defined as follows: original QIC = 2Q(y; ˆβ) + 2tr( ˆV s ˆΩI ). We would like to note that V a = V s and Ω R = Ω I when R is independence. Furthermore, when R includes the true correlation structure, Ω I = V a and Ω 1 R = V s are achieved since V i = Cov[y i ] in this situations. Hence, the original QIC is exactly and asymptotically equivalent to the formal QIC when the working correlation matrix is independence and includes the true correlation structure, respectively. More comparisons are given in section 4. 3. Effect of estimating the correlation parameter We present two properties of the formal QIC in this section. Firstly, we mention about the bias of the QIC. In Pan (2001a), the original QIC was derived by only considering the bias that arises when estimating β. However, we would like to note that we need to estimate α when we use the QIC in real 8

data analysis. Hence, the QIC can be regarded as a function of β and α and we must use the QIC by substituting each estimator of unknown parameters. Hence, we should also consider the bias that arises when estimating α. According to Liang & Zeger (1986), we can assume that α is defined as a function of the sample correlation matrix R(β), R(β) = 1 n p A 1/2 i (y i µ i )(y i µ i ) A 1/2 i. For example, by assuming the exchangeable covariance structure (i.e., (R) jk = α), α = 1 2 1 n p m(m 1) a(ϕ) m y ij b (θ ij ) y ik b (θ ik ). (9) b (θ ij ) 1/2 b (θ ik ) 1/2 j<k When we explicitly take into account the effect of estimating α, the simultaneous estimating equation of β and α is given as follows: α = α(β) = h( R(β)), D iv 1 i (y i µ i ) = 0 p, (10) where h( ) is a q-dimensional vector-valued function. Let ˆβ s and ˆα s be the solution of (10). Under these situations, the risk function (2) is rewritten as E y E y [ 2Q(y ; ˆβ s )], and we can show the following theorem. 9

Theorem.1. Let us assume vec{r(α 0 ) 1 }, α(β) = O α β p (1), (11) where α 0 = h(r 0 ) and R 0 is the true correlation matrix. The formal QIC is an asymptotically unbiased estimator of the Risk function even when taking account into the effect of estimating α. That is, E y E y [ 2Q(y ; ˆβ s )] E y [fqic(y; ˆβ s )] = o(1). Details of a proof of theorem 1 are given in Appendix A.1. This theorem leads that the formal QIC is an asymptotic unbiased estimator of the risk function in taking account into the effect of estimating correlation parameter. This is an optimality of the formal QIC. Liang & Zeger (1986) provided ways to estimate that can satisfy the assumption (11). For example, in the case of the exchangeable correlation structure, R(α) 1 = 1 1 α (I 1 m P m ) + 1 + (m 1)α P m, (12) 10

where P m = 1 m 1 m/m. Derivations of (9) and (12) are expressed as follows: α(β) = 1 β (n p) vec{r(α 0 ) 1 } α = 2 1 m(m 1) a(ϕ) m j k [ b (θ ij ) 1/2 + {y ] ij b (θ ij )}b (θ ij ) 2b (θ ij ) 3/2 y ik b (θ ik ) θ ij x b (θ ik ) 1/2 ij, η ij 1 (1 α 0 ) vec(i m 1 2 m P m ) {1 + (m 1)α 0 } vec(p m). 2 Hence, the assumption (11) is satisfied when α 0 1, 1/(m 1), which is usually assumed for the non-singularity of the working correlation matrix. Next, we consider the risk function of the formal QIC. In spite of assuming a correlation structure for response in the GEE procedure, the risk function (2) is based on the independence quasi-likelihood. In order to resolve this contradiction, we attempt to expand the risk function to the quasilikelihood with considering the correlations, which is the multivariate quasilikelihood (McCullagh & Nelder, 1989). As mentioned in Pan (2001a), the GEE methodology is closely related to the multivariate quasi-likelihood. The ith multivariate quasi-likelihood Q m (y i ; M, β) is given as the following differential form: Q m (y i ; M, β) µ i = M(µ i ) 1 (y i µ i ). where M( ) is an m m matrix-valued function. We can get the multivariate 11

quasi-likelihood to line integrate as the following expression. Q m (y i ; M, β) = t=µi t=y i (y i t) M(t) 1 dt, (13) where t = (t 1,..., t m ). In general, the value of this integration is misspecified since it depends on the integral path chosen. Such problem had been already discussed. Wang gave an example about the non-integrability of the GEE (see Wang, 1999, Example 2.4). The risk function should be specified because of which is a criteria of statistical models. The following theorem gives a condition about the pathdependence of the multivariate quasi-likelihood. Theorem.2. To assume the independence structure for correlation matrix and/or the constant variance function σ 2 ( ) is a necessary and sufficient condition in order to avoid the path-dependence of the multivariate quasilikelihood in the GEE approach. A proof of theorem 2 is given in Appendix A.2. This theorem suggests an adequacy of the risk function of the formal QIC. It may be possible to decide an ad hoc path of the multivariate quasi-likelihood (for example of the path; see McCullagh & Nelder, 1989). However, it is difficult to derive a new model selector based on the path-decided multivariate quasi-likelihood since the derivation of this multivariate quasi-likelihood become more comparison form than the formal QIC. In order to establish the uniqueness of the risk function 12

among the multivariate quasi-likelihood class, we recommend assuming the model independence for the risk function, which leads the same as the risk function of the formal QIC. 4. Comparison of the formal QIC and the original QIC Let us compare the bias term of the formal QIC derived in section 2 with the original QIC through a simulation study. We prepared the four candidate models with 500 samples which is constructed of a 4-dimensional response vector y i = (y i1,..., y i4 ) and an 4 6 explanatory variable matrix X i = (x i1,..., x i4 ). Let x ij = (1, x ij1,..., x i4 ) and x ijk be random variable which are independent and identically distributed as the uniform distribution U(0, 1), k = 1,..., 4. We assume y ij is distributed according to a logistic regression model B(1, p ij ) where p ij = 1/{1+exp( x ijβ)}, β = (1, 1, 0, 0, 0). Let an explanatory variable matrix in the kth candidate model consist of the first k columns of X i, k = 1,..., 4. As mentioned in Section 2, the original QIC is exactly and asymptotically equivalent to the formal QIC when the working correlation matrix is independence and includes the true correlation structure, respectively. However, in the other case, the difference may be greatly. For example, we assume that R is one-dependent and R 0 is exchangeable with correlation parameter α = 0.3. In above situations, we simulated 10,000 repetitions in order to compare the formal QIC and original QIC. We show the average value and bias of the 13

Table 1: mean and bias of formal QIC and original QIC in the case 1 candidate model 1 2 3 4 risk function 2619.202 2620.125 2621.067 2622.005 formal QIC mean 2619.362 2620.284 2621.210 2622.128 bias 0.160 0.159 0.142 0.123 original QIC mean 2619.730 2620.939 2622.152 2623.370 bias 0.367 0.655 0.943 1.242 Figure 1: value of the fomal QIC and original QIC in the case 1 2620 2621 2622 2623 Risk function Formal QIC Original QIC 1 2 3 4 Model index 14

each QIC in table 1. From table 1, we can see that the bias of the original QIC is getting large when the number of the parameter is getting increase even when the formal QIC keeps a stable value in each model. Hence, there may be a non-negligible difference between the formal QIC and original QIC. Therefore, we recommend the use of the formal QIC rather than the original QIC for the model selection in GEE procedure. 5. Conclusions In the present paper, we derive the formal QIC as an asymptotic unbiased estimator of the risk function based on the independent quasi-likelihood. Through the simulation study, we illustrate that the difference between the formal QIC and original QIC may have non-negligible effect for model selection, especially when the true correlation structure completely different from the working correlation structure. Furthermore, we show two theorems regarding the formal QIC. In Theorem 1, we prove the asymptotically unbiasedness of the formal QIC in taking account into the effect of estimating correlation parameter. This result may arise because of the risk function of the formal QIC, which do not include the correlation parameter since it is based on the independent quasi-likelihood. In Theorem 2, we obtain an adequacy of the formal QIC in considering a wide class of the risk function, which is based on the multivariate quasi-likelihood. The unique risk function can be established by assuming the independence structure for mul- 15

tivariate quasi-likelihood. These theorems guarantee the adequacy of the formal QIC. Appendix A.1. Proof of Theorem 1 We can re-writing (10) as D ia 1/2 i R(α(β)) 1 A 1/2 i (y i µ i ) = 0 p. (A.1) Under the assumptions of Theorem 1, by applying chain rule into the derivation of (A.1), we are immediately able to show vec {R(α(β)) 1 } = vec {R(α) 1 } α(β) β α β = vec {R(α 0) 1 } α α(β) β + o p (1) = O p (1). (A.2) By combining (A.1) and (A.2), we obtain 1 n D ia 1/2 i R(α(β)) 1 A 1/2 i (y i µ i ) β = Ω R + o p (1). This leads the following result, n( ˆβs β) = 1 n Ω 1 R g(β; R, D) + o p(1). 16

The rest of proof is very similar to the derivation of the QIC in section 2. We obtain E y E y [Q(y ; ˆβ)] E y [Q(y ; ˆβ s )] = o(1). A.2. Proof of Theorem 2 Applying the Stokes theorem, which is usually used in differential geometry, we obtain the necessary and sufficient conditions in order to avoid the path-dependence of (13) as follows: m j,k,l (y ij t j ) (M(t) 1 ) jk t l = 0, (A.3) for individuals, i = 1,..., n. Hence, we re-write the condition (A.3) as { m (M(t) 1 ) jk (y ij t i ) t l j=1 (M(t) 1 ) jl t k } = 0, k > l. Since we assume M(t) = A(t) 1/2 RA(t) 1/2 in the GEE methodology, this is equivalent to the following conditions: (y il t l ) σ(t k) 1 σ(t l ) 1 (R 1 ) lk = (y ik t k ) σ(t l) 1 σ(t k ) 1 (R 1 ) kl (A.4) t l t k 17

where σ(µ ij ) 2 = a(ϕ)b (θ ij ). If we assume (R) kl 0, (A.4) is equivalent to σ(t l ) 1 = 0. t l From above results, we can see that for all 1 k < l m, (R) kl = 0 or the variance function σ 2 ( ) is constant in order to achieve the expression (A.4). The proof of Theorem 2 is completed. References Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. In Second International Symposium on Information Theory (eds. Petrov, B. N. & Csáki, F.), 267 281, Akadémiai Kiadó, Budapest. Akaike, H. (1974). A new look at the statistical model identification. IEEE Trans. Automatic Control, AC-19, 716 723. Balan R. M. & Schiopu-Kratina, I. (2005). Asymptotic results with generalized estimating equations for longitudinal data. Ann. Statist., 33, 522 541. Barnhart, H. X. & Williamson, J. M. (1998). Goodness-of-fit tests for the GEE modeling with binary responses. Biometrics., 54, 720 729. Burnham, K. P. & Anderson, D. R. (2002). Model selection and multi- 18

model inference: A practical information-theoretic approach, 2nd edition. Springer-Verlag, New York. Cantoni, E., Flemming, M. & Ronchetti, E. (2005). Variable selection for marginal longitudinal generalized linear models. Biometrics., 61, 507 514. Kullback, S. & Leibler, R. (1951). On information and sufficiency. Ann. Math. Statist., 22, 79 86. Liang, K-Y. & Zeger, S. L. (1986). Longitudinal data analysis using generalized linear models. Biometrika., 73, 13 22. McCullagh, P. & Nelder, J. A. (1989). Generalized linear models, 2nd edition. Chapman and Hall, London. Nelder, J. A. & Wedderburn, W. M. (1972). Generalized linear models. J. R. Statist. Soc. A., 135, 370 384. Pan, W. (2001a). Akaike s information criterion in generalized estimating equations. Biometrics., 57, 120 125. Pan, W. (2001b). Model selection in estimating equations. Biometrics., 57, 529 534. Pan, W. (2002). Goodness-of-fit tests for GEE with correlated binary data. Scand. J. Statist., 29, 101 110. 19

R Development Core Team (2012). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. URL http://www.r-project.org/. Shen, C-W. & Chen, Y-H. (2012). Model selection for Generalized Estimating equations accommodating dropout missingness. Biometrics, 68, 1046 1054. Thall, P. F. & Vail, S. C. (1990). Some covariance models for longitudinal count data with overdispersion. Biometrics, 46, 657 671. Vens, M. & Ziegler, A. (2012). Generalized estimating equations and regression diagnostic for longitudinal controlled trials: A case study. Comput. Statist. Data Anal., 56, 1232 1242. Wang, J. (1999). Artificial likelihoods for general nonlinear regressions (in Japanese). Proc. Inst. Statist. Math, 47, 49 61. Wedderburn, W. M. (1974). Quasi-likelihood functions, generalized linear models, and the Gauss-Newton method. Biometrika., 61, 439 447. Xie, M. & Yang, Y. (2003). Asymptotics for generalized estimating equations with large cluster sizes. Ann. Statist., 31, 310 347. 20