Modeling the scale parameter ϕ A note on modeling correlation of binary responses Using marginal odds ratios to model association for binary responses

Size: px

Start display at page:

Download "Modeling the scale parameter ϕ A note on modeling correlation of binary responses Using marginal odds ratios to model association for binary responses"

Anthony Potter
5 years ago
Views:

1 Outline Marginal model Examples of marginal model GEE1 Augmented GEE GEE1.5 GEE2 Modeling the scale parameter ϕ A note on modeling correlation of binary responses Using marginal odds ratios to model association for binary responses 1

2 Marginal Model Marginal models (or population-average model) distinguish from the mixed effects models (or subjectspecific models) according to the interpretation of their regression coefficients. The marginal models are used to make inferences about population average. Marginal models emphasize the dependence of the mean response on the covariates of interest, but not on both the random effects and the covariates (mixed effects models), nor on previous responses (transition models). Marginal models do not require the joint distributional assumptions for the vector of responses, which may be difficult for the discrete data. The avoidance of distributional assumptions leads to a method of estimation of generalized estimating equations (GEE). For longitudinal data, the marginal model separately model the mean responses and within-subject association among the repeated responses. 2

3 Notation of the marginal models Let the random variable Y ij denote the responses for the ith individual, measured at time t ij, i = 1,..., m, j = 1,..., n i. The response variables for the ith subject is an n i 1 vector, Y i = (Y i1, Y i2,..., Y ini ). Let X ij denote p 1 vector of covariates X ij = (X ij1, X ij2,..., X ijp ) that are associated with each response, Y ij. X ij may include covariates whose values do not change over time, i.e. time-stationary or between-subjects covariates, and covariates whose values change over time, i.e., time-varying or within-subject covariates. The covariates can be grouped into an n i p matrix: X i = X i1 X i2. = X i11 X i12... X i1p X i21 X i22... X i2p, i = 1,..., m X in i X ini 1 X ini 2... X ini p 3

4 A marginal model has the following components: 1. Mean model: the marginal mean of each response, E(Y ij X ij ) = µ ij, depends on covariates via a known link function g(µ ij ) = η ij = Xijβ T 2. Correlation model (nuisance): Var(Y ij X i ) = ϕv (µ ij ) Cor(Y ij, Y ik X i ) = ρ ijk Cov(Y i X i ) = V i (ϕ, α) = ϕc 1/2 i R i C 1/2 i where R i is the correlation matrix and C i = diag(v (µ ij )) is a diagonal matrix of variances. The parameter α characterizes the correlation and ϕ is a scale parameter for variances. Estimation and valid inference are achieved by constructing an unbiased estimating function without likelihood function. 4

5 Example: marginal model for a continuous response Consider Y ij is a continuous response and we are interested in relating mean response to the covariates X i, The marginal mean model is: µ ij = η ij = Xijβ. T The variance model is: Var(Y ij X i ) = ϕv (µ ij ) = ϕ Cor(Y ij, Y ik X i ) = α k j where 0 α k j 1 is the pairwise correlation among the responses. Note in this example we assume the variance are homogeneous over time, other variance functions are also allowed, e.g. a separate scale parameter, ϕ j. We specify a first-order autoregressive correlation pattern in this example, other correlation structure, e.g. exchangeable or unstructured are also possible. 5

6 Example: marginal model for count data Consider Y ij is a count and we are interested to relate changes in the expected count to the covariates X i, The marginal mean model is: log(µ ij ) = η ij = Xijβ. T The variance model is: Var(Y ij X i ) = ϕµ ij Cor(Y ij, Y ik X i ) = α jk Note this is a log-linear regression model with an extra-poisson variance assumption which allows the variance to be inflated by a factor of ϕ to deal with the issue of over-dispersion. 6

7 Example: marginal model for a binary response Consider the effect of vitamin A deficiency (Xerophthalmia, X) on respiratory infection (RI, Y ) in the Indonesian Children s Health Study. Let i indicate the child and j the visit. The marginal mean model is: logit(µ ij ) = log Pr(Y ij = 1) Pr(Y ij = 0) = β 0 + β 1 I Xij =1. The variance model is: Var(Y ij ) = µ ij (1 µ ij ), Cor(Y ij, Y ik ) = α. The parameter of interest is β 1, exp(β 1 ) = Odds of RI among vitamin A deficient children. Odds of RI among non-deficient children When the prevalence of RI is low, the odds ratio (OR) is approximately the same as relative risk 7

8 (RR). The risk may be different for different children with the same covariates, so the parameter is a population average (assuming random sample). The correlation between two binary variables Y 1 and Y 2 has a constrained range that depends on µ 1 and µ 2. So it might be desirable to model the correlation differently, for example, using the marginal odds ratio. log OR(Y ij, Y ik ) = α jk. where OR(Y ij, Y ik ) = { } P r(yij = 1, Y ik = 1)P r(y ij = 0, Y ik = 0) P r(y ij = 1, Y ik = 0)P r(y ij = 0, Y ik = 1) 8

9 GEE1 Estimating function When (ϕ, α) are known, then the estimator ˆβ is defined by the estimating equation: 0 = m i=1 D T i V 1 i {Y i µ i (β)}, where D i (β) = µ i β, D i(j, k) = µ ij β k, V i (β, ϕ, α) = ϕc 1/2 i R i (α)c 1/2 i. where V i is known as a working covariance to distinguish it from the true underlying covariance among the Y ij s and R i is a working correlation. 9

10 For example, for logistic model with one covariate and exchangeable correlation, µ ij = exp(β 0 + β 1 x ij ) 1 + exp(β 0 + β 1 x ij ) ( µij D i (j) =, µ ) ij β 0 β 1 µ i1 (1 µ i1 ) µ C i = i2 (1 µ i2 ) µ ini (1 µ ini ) 1 α α α 1 α R i = α α 1 10

11 GEE1 estimation An iterative algorithm is used to find (ˆβ, ˆϕ, ˆα): 1. Starting with an initial estimate of β with an ordinary generalized linear model assuming independence. 2. Given ˆβ (l), calculate method-of-moments estimates for ϕ, α, and the estimate of covariance of V i : ˆV i = ˆϕĈ 1/2 i R i ( ˆα)Ĉ 1/2 i 3. Given estimates for ϕ, α and V i, solve the estimating equation using Fisher s scoring algorithm: ˆβ (l+1) = ˆβ (l) + ( m i=1 ˆD T i ˆV 1 i ˆD i ) 1 m i=1 ˆD T i ˆV 1 i {Y i ˆµ i }. 4. Iterate the above two steps until convergence is achieved. 11

12 GEE1 working correlation The model chosen for R i (α) is called the working correlation since it needs not to be the true correlation to obtain a valid point estimate ˆβ (consistent and asymptotically normal). If R i (α) is the correct correlation, then the model-based estimates of the standard errors for ˆβ can be used. Otherwise we use the empirical estimates of standard errors. Replacing α with a (any) consistent estimator does not affect the large sample properties of ˆβ. The asymptotic variance of ˆβ would be the same as if α is known. (Liang and Zeger, 1986). Some working correlation examples Independent correlation: assumes all pairwise correlation coefficients are zero: Cor(Y ij, Y ik ) = 0, j k. 12

13 Exchangeable correlation: assumes all pairwise correlation coefficients are equal and hence the components in the response vector Y are exchangeable (not ordered): Cor(Y ij, Y ik ) = α, j k. AR(1) correlation: the autoregressive correlation structure of order 1 assumes the correlation coefficients decay exponentially over time, and the responses are ordered in time and more correlated if they are closer to each other in time than if they are more distant: Cor(Y ij, Y ik ) = α j k, j k. Unstructured correlation: assumes all pairwise correlation coefficients are different parameters: Cor(Y ij, Y ik ) = α jk, j k. 13

14 GEE1 variance of ˆβ The solution ˆβ is consistent and asymptotically normal. If the correlation model is correct, then, the model-based estimate for the variance of ˆβ is ˆ Var(ˆβ) = A 1, where A = m i=1 Di T (ˆβ)V i 1 (ˆβ, ϕ, α)d i (ˆβ). If the correlation model is not correct, then we can use the empirical (so called sandwich ) variance estimate: Var(ˆβ) = A 1 BA 1, m B = U i Ui T = i=1 m Di T (ˆβ)V i 1 (ˆβ, ϕ, α) Cov(Y ˆ i )Vi 1 (ˆβ, ϕ, α)d i (ˆβ) i=1 14

15 where ˆ Cov(Y i ) = (Y i ˆµ i ) (Y i ˆµ i ) T. ˆ Cov(Y i ) is a poor estimator for Cov(Y i ). However, we do not need a good estimator for each Cov(Y i ). With sufficient independent replication (m large), the average covariance can be well estimated (consistency). What if (ϕ, α) are unknown? How can we estimate them and what is the impact on the estimation of β? Liang and Zeger (1986) proposed to use moment estimators for the unknown parameters (GEE1). 15

16 GEE1 estimating α using method of moments Let N = m i=1 n i. Recall that Var(Y ij X i ) = ϕv (µ ij ) where V is a known function. The scale parameter ϕ, if exists, can be estimated by ˆϕ = 1 N p m i=1 n i j=1 (Y ij ˆµ ij ) 2, V (ˆµ ij ) where p is the dimension of β. Binomial: V (ˆµ ij ) = ˆµ ij (1 ˆµ ij ). Poisson: V (ˆµ ij ) = ˆµ ij. 16

17 Define the residuals: r ij = Y ij ˆµ ij (ˆβ) [ ] 1/2, Var(Y ˆ ij ) where ˆ Var(Y ij ) = ˆϕV (ˆµ ij ). The correlation parameter α can be estimated as simple functions of r ij, e.g. Exchangeable correlation: ˆα = ( i n i(n i 1) p) ˆϕ m r ij r ik. i=1 j<k AR(1) correlation: Unstructured correlation: ˆα = 1 ( i (n i 1) p) ˆϕ 1 ˆα jk = (m p) ˆϕ m r ij r i,j+1. i=1 j n i 1 m r ij r ik. i=1 17

18 GEE1 more about α Recall that we use simple MoM estimators to estimate (ϕ, α) in GEE1. The scale parameter ϕ is often considered to be a nuisance and it does not affect the estimates of β. But what about α? 1. Shouldn t we consider the parameter as (β, α)? 2. Can t we improve upon the estimation of α? 3. Would better estimation of α help us to better estimate β? Shouldn t we consider the parameter as (β, α)? Answer: It depends. Is α a nuisance? If the covariance structure is of secondary interest (often the case) then GEE1 is usually fine. However, if the covariance matrix is of primary interest then GEE1 is not ideal. Are you willing to sacrifice some model robustness in order to let (β, α) be the target parameter? Note that in GEE1, the estimate ˆβ is consistent even if the model for α is wrong. Other approaches 18

19 that treat β and α on equal ground may not have this property. Can t we improve upon the estimation of α? Answer: Yes! We can adopt alternative association (dependence) models that are more suitable for categorical data. We can create estimators that are targeted at (β, α) jointly and are efficient for both. Would better estimation of α help us to better estimate β? Answer: Model for α is important for the efficiency of ˆβ. 19

20 GEE1 more about the sandwich estimator of Cov(ˆβ) It is a robust estimator as it provides valid standard error when the assumed model for the covariance among the repeated measures, i.e. working covariance, is misspecified. With the sandwich estimator, why bother expending effort to model the within-subject association in marginal model? Efficiency or precision could be improved with a working covariance more closer to the true underlying covariance. The robustness property depends on the large sample property, i.e., relative large number of subjects and relative small number of repeated measures. The sandwich estimator may be problematic for severely unbalanced data. The model-based variance estimator should be used instead. 20

21 GEE1 hypothesis testing Generalized Wald Test: H 0 : β j = 0 Write β = (β 1, β 2 ), H 0 : β 1 = 0 ˆβ T 1 ˆβ j s.e. ˆ N (0, 1). ˆ Var 1 (ˆβ) 1 ˆβ1 χ 2 r, where r is the dimension of β 1 and Var ˆ 1 (ˆβ) is the estimated variance matrix corresponding to ˆβ 1 (r r subprincipal matrix of Var(ˆβ)). ˆ H 0 : Lβ = 0, where L is a r p matrix, where ˆβ T L T (L ˆ Var(ˆβ)L T ) 1 Lˆβ χ 2 r, ˆ Var(ˆβ) is the estimated variance matrix for ˆβ. Generalized Score Test: H 0 : Lβ = 0 where L is a r p matrix. T = S( β) T AL T (LBL T ) 1 LAS( β) χ 2 r. where A and B are model-based and empirical covariance estimates, S( β) is the GEE evaluated at the regression parameter estimate β under H 0. Rotnitzky and Jewell (1990). 21

22 - In the generalized Wald test statistic, Var(ˆβ) ˆ using the sandwich form can be unstable if n i s are large and m is small. If the correlation structure has been carefully modelled, the model-based (or working ) Var(ˆβ) ˆ can be used. - The generalized score test is invariant under any differentiable transformation of the parameters, τ = τ (β). 22

23 Augmented GEE GEE1.5 Prentice (1988) s approach for binary data Idea: Estimation of both parameters vectors, β and α, in the marginal response model and the correlation model, respectively. GEE1 uses an estimating function U 1 based on the centered first moments (Y i µ i ) for the estimation of β. We can add a second estimation function based on the centered second moments to estimate α: [ where σ ijk = E ] (Y ij µ ij )(Y ik µ ik ) [µ ij (1 µ ij )µ ik (1 µ ik )] 1/2 (Y ij µ ij )(Y ik µ ik ) [µ ij (1 µ ij )µ ik (1 µ ik )] 1/2 σ ijk is the mean of the sample correlation. 23

24 Now, the joint estimating functions are: U 1 (β, α) = U 2 (β, α) = m i=1 m i=1 D T i (β)v 1 i (β, α) {Y i µ i (β)} (1) E T i (β, α)w 1 i (β, α) {S i σ i (β, α)}, (2) where S i = (S i1 S i2, S i1 S i3, S i1 S ini,, S ini 1S ini ) T S ij = (Y ij µ ij )/[µ ij (1 µ ij )] 1/2 σ i = E(S i ) E i = σ i α W i = diag(var(s i1 S i2 ),, Var(S ini 1S ini )) Var(S ij S ik ) = 1 + (1 2µ ij )(1 2µ ik )[µ ij (1 µ ij )µ ik (1 µ ik )] 1/2 σ ijk σ 2 ijk (Check this) 24

25 Note that the above W i is an n i (n i 1)/2-dimensional working independent variance matrix for S i. Other working matrices could readily be substituted. To model Cov(S i ) properly, we need specify models for higher moments (with more parameters), which is typically difficult. Iterative method can be used for solving the joint estimating equations 0 = U 1 (β, α) 0 = U 2 (β, α) Given (ˆβ (l), ˆα (l) ): 1. Fixed ˆα (l), solve 0 = U 1 (β, ˆα (l) ) to get ˆβ (l+1). 2. Fixed ˆβ (l+1), solve 0 = U 2 (ˆβ (l+1), α) to get ˆα (l+1). (ˆβ, ˆα) is consistent and asymptotically normal under correct model specification. Similar to GEE1, ˆβ is consistent even if the model for α is misspecified. 25

26 GEE2 Prentice and Zhao (1991) s appraoch They considered a systematic approach for generating estimating functions for δ = (β, α). Combine the response vector Y i and the pairwise crossproducts, S, into one outcome vector T T i = (Y T i, Z T i ), where Y T i = (y i1,..., y ini ), and Z T i = (y 2 i1, y i1y i2,..., y 2 i2, y i2y i3,..., y 2 in i ). The estimating equations for (β, α) can be generated under a Quadratic Exponential Family (QEF) model, where the loglikelihood Pr i (y i ; µ i, σ i ) θ T 1iY i + θ T 2iZ i + c i (Y i ), where the canonical parameters (θ 1i, θ 2i ) are functions of (β, α). 26

27 Optimal estimating equations for δ = (β T, α T ) T : U(δ) = = m i=1 µ i β 0 V i (1, 1) = Cov(Y i ) Di T (δ)vi 1 (δ)f i (δ) σ i β σ i α V i (1, 2) = Cov(Y i, S i ) V i (2, 2) = Cov(S i ) S ijk = (Y ij µ ij )(Y ik µ ik ) V i(1, 1) V i (1, 2) V i (1, 2) T V i (2, 2) σ i = E[S i ], σ ijk = Cov(Y ij, Y ik ). 1 y i µ i s i σ i Third and fourth order moments are contained in V i, specifically V i (1, 2) and V i (2, 2). 27

28 One can specify a working variance matrix in V i, in which the third and fourth moments are expressed as functions of (µ i, σ i ). Independence working models V i (1, 2) = 0 V i (2, 2) = diagonal matrix Gaussian working models V i (1, 2) = 0 V i (2, 2) : Cov(S ijk, S ij k ) = σ ijj σ ikk + σ ijk σ ikj 28

29 Estimation can be done using the Fisher scoring procedure ( ) 1 ( β(l+1) = β(l) + Di T Vi 1 D i i i α (l+1) α (l) ) Di T Vi 1 f i Sandwich variance estimator can be used to protect against the 3rd/4th moment model misspecification. Consistency of both ˆβ and ˆα depends on the correct modeling of both mean and covariance. The matrix V i has dimension M i M i where M i = n i + n i (n i + 1)/2, and its inverse is required! GEE1.5 is a special case of GEE2. The efficiency gain of GEE2 comparing with GEE1.5 depends on the correct specification of the 3rd/4th moments. Conclusion: GEE2 may not be worthwhile after all. If we want to specify higher order moments, why not use the likelihood? 29

30 Modeling the scale parameter ϕ Yan and Fine (2004) s approach Remember that Cov(Y i ) = ϕc 1/2 i R i C 1/2 i. When the scale parameter ϕ (i.e the over-dispersion parameter for heteroscedasticity) is important, Yan and Fine (2004) extended GEE1.5 by adding a third model for the scale parameter : g 3 (ϕ ij ) = X3ijγ, T where g 3 is the link function, e.g. log. The estimating equation is U ϕ = m i=1 D T 3iV 1 3i (T i ϕ i ) = 0, (3) where T i = (T i1,..., T ini ) T, T ij = (Y ij µ ij ) 2, V (µ ij ) D 3i = ϕ i γ, ϕ i = (ϕ i1,..., ϕ ini ) T. 30

31 V 3i also contains third and fourth moments. ˆβ is consistent with mis-specification of the correlation and scale parameters; ˆγ is consistent if the mean and the scale structures are correctly specified, regardless of mis-specification of the correlation. The method is implemented in R package geepack. 31

32 A Note on Modeling Correlation of Binary Responses Correlation for binary data are constrained by their means. Let E(Y 1 ) = µ 1, E(Y 2 ) = µ 2, ρ 12 = Cor(Y 1, Y 2 ), and π 12 = E(Y 1 Y 2 ), then π 12 < min(µ 1, µ 2 ), and hence, ρ 2 12 min { µ1 (1 µ 2 ) µ 2 (1 µ 1 ), µ } 2(1 µ 1 ), µ 1 (1 µ 2 ) because the pairwise correlation ρ 12 = π 12 µ 1 µ 2 [µ 1 (1 µ 1 )µ 2 (1 µ 2 )] 1/2, and π 12 is constrained to satisfy max(0, µ 1 + µ 2 1) < π 12 < min(µ 1, µ 2 ). 32

33 Modeling odds ratios: Ψ ijk = Pr(Y ij = 1, Y ik = 1) Pr(Y ij = 0, Y ik = 0) Pr(Y ij = 1, Y ik = 0) Pr(Y ij = 0, Y ik = 1), = Pr(Y ij = 1 Y ik = 1)/ Pr(Y ij = 0 Y ik = 1) Pr(Y ij = 1 Y ik = 0)/ Pr(Y ij = 0 Y ik = 0). The log odds ratios log Ψ (, ), are symmetric about 0 and not constrained by the marginal means. Interpretation: Ψ ijk = 1 or log Ψ ijk = 0 implies (Y ij, Y ik ) are uncorrelated. The odds ratio, Ψ ijk and the marginal means µ ij, µ ik determine π ijk = E(Y ij Y ik ), hence determine the correlation ρ ijk, and variance V i (β, α). 33

34 The odds ratio of Ψ ijk can be expressed as function of µ ij, µ ik and π ijk : Ψ ijk = π ijk(1 µ ij µ ik +π ijk ) (µ ij π ijk )(µ ik π ijk ) (4) Thus, π ijk can be solved from the above and expressed as function of µ ij, µ ik and Ψ ijk : π ijk = A [A2 4(Ψ ijk 1)Ψ ijk µ ij µ ik ] 1/2 2(Ψ ijk 1) when Ψ ijk 1 π ijk = Ψ ijk µ ij µ ik when Ψ ijk = 1 where A = 1 (µ ij + µ ik )(1 Ψ ijk ) Lipsitz et al (1991) 34

35 Using Marginal Odds Ratios to Model Association for Binary Responses Lipsitz et al (1991) s approach They modified the approach of Prentice (1988) using odds ratios. The joint models (Lipsitz et al, 1991) are: Mean model Correlation model logit(µ ij ) = X T ijβ log(ψ ijk ) = Z T ijkα Lipsitz et al. noted that β estimates from odds ratio model are slightly efficient comparing to that from correlation model by simulation and these two mdoels are computationally similar. 35

36 Carey, Zeger, and Diggle (1993) s alternating logistic regression (ALR) appraoch Let γ ijk = log Ψ ijk. Note that the pairwise conditional expectations: logit E(Y ij Y ik, X i ) = γ ijk Y ik + ijk, where ( ) µ ij π ijk ijk = log. 1 µ ij µ ik + π ijk (Check this using the fact in (4).) Suppose that all the odds ratios are the same, γ ijk = γ, then an estimate for γ could be obtained by a logistic regression of y ij on y ik, 1 j < k n i, i = 1,, m, using ijk as an offset. More generally, we assume log(ψ ijk ) = Zijk T α, where Z ijk is a set of covariates characterizing the log-odds ratio between observations j and k. For example, in family data, Z ijk would encode the type of family relationship for Y ij and Y ik : husband-wife, parent-sib, or sib-sib. We then estimate the vector α by a logistic regression of y ij on the product zijk T y ik with the offset ijk. 36

37 Note that the offset depends on both β and α so that iterations are required. We alternate the following two steps until convergence. 1. Given the current values of (β (l), α (l) ), calculate ˆV (l) and solve the U β (β, α (l) ) = 0 for an updated ˆβ (l+1), where U β (β, α) = m i=1 ( ) T µi V i (β, α) 1 (Y i µ i ). β 2. Given ˆβ (l+1) and ˆα (l), evaluate the offset and perform the offset logistic regression of Y ij on Z ijk Y ik with a total of m i=1 n i(n i 1)/2 observations to obtain ˆα (l+1). 37

38 Formally, the ALR uses the same model as in Lipsitz et al. (1991) but uses this estimating equation for α: U α (β, α) = m i=1 F T i (β, α) 1 W i (β, α)t i (β, α), where F i = ζ i α ζ ijk = E(Y ij Y ik ) T i s (j, k) entry: T ijk = Y ij ζ ijk W i (β, α) = diag(var(y ij Y ik )) = diag(ζ ijk (1 ζ ijk )). No working assumption about the 3rd/4th-order odds ratios are required. The efficiency is comparable to GEE2 but more computationally efficient for large clusters (does not require the inverse of large matrices). 38

39 Further Reading Chapters 7.1, 7.5, and 8 of DHLZ. References Cavey V, Zeger SL, and Diggle P (1993). Modelling multivariate binary data with alternating logistic regression. Biometrika 80: Ling KY and Zeger SL (1986). Longitudinal data analysis using generalized linear models. Biometrika 73: Liang KY, Zeger SL, and Qaquish B (1992). Multivariate regression analysse for categorical data (with discussion). JRSSB 54:3-40. Lipsitz SR, Laird NM, and Harrington DP (1991). Generalized estimating equations for correlated binary data: using the odds ratio as a measure of association. Biometrika 78: Prentice RL (1988). Correlated binary regression with covariates specific o each binary observations. Biometrics 44: Prentice RL and Zhao LP (1991). Estimating equations for parameters in means and covariances of multivariate discrete and continuous responses. Biometrics 47: Rotnitzky, A. and Jewell, N.P., (1990), Hypothesis Testing of Regression Parameters in Semiparametric Generalized Linear Models for Cluster Correlated Data. Biometrika, 77: Yan J and Fine J (2004). Estimating equations for association structures. Statistics in Medicine 23:

,..., θ(2),..., θ(n)

,..., θ(2),..., θ(n) Likelihoods for Multivariate Binary Data Log-Linear Model We have 2 n 1 distinct probabilities, but we wish to consider formulations that allow more parsimonious descriptions as a function of covariates.