Figure 36: Respiratory infection versus time for the first 49 children.

Size: px

Start display at page:

Download "Figure 36: Respiratory infection versus time for the first 49 children."

Maurice Ford
5 years ago
Views:

1 y BINARY DATA MODELS We devote an entire chapter to binary data since such data are challenging, both in terms of modeling the dependence, and parameter interpretation. We again consider mixed effects models approaches (with inference from likelihood or Bayes perspectives), and GEE. Motivating Data - Indonesian Child Health DHLZ describe a dataset in which the response is the absence/presence of respiratory illness in 275 children, with mutiple measurements being collected over time quarterly measurements were taken for up to six consecutive quarters. Age also recorded. Question of interest: Is the prevalence of respiratory infection higher amongst children who suffer from xerophthalmia, an ocular manifestation of chronic vitamin A deficiency? Figure 36 shows the infection indicator versus time for the first 49 children. 329 id id id id id id id time Figure 36: Respiratory infection versus time for the first 49 children. 330

2 xero id id id id id id id time Figure 37: Xerophthalmia versus time for the first 49 children. 331 Exploratory Summaries With binary responses and covariates, plots are not so informative - instead look at tables. > table(n) # no of responses per child (ie 22 children had 1),1200 in total > xtabs(~xerocross+ycross) # cross-sectional comparison of respiratory ycross # event (ycross) and xerophthalmia (xerocross) xerocross 0 1 # at first time point > 234/(9*31) [1] > summary(glm(cbind(ycross,1-ycross) ~ xerocross, family="binomial")) Estimate Std. Error z value Pr(> z ) (Intercept) <2e-16 *** xerocross > exp(-.1759) [1]

3 Now add in age (potential confounder). > modglm <- glm(cbind(ycross,1-ycross) ~ agecross+xerocross, family="binomial") > summary(modglm) # Cross-sectional analyses Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) <2e-16 *** agecross * xerocross Null deviance: on 274 degrees of freedom Residual deviance: on 272 degrees of freedom > modglmq <- glm(cbind(ycross,1-ycross) ~ agecross+xerocross,family=quasibinomial()) > summary(modglmq) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) <2e-16 *** agecross * xerocross (Dispersion parameter for quasibinomial family taken to be ) Null deviance: on 274 degrees of freedom Residual deviance: on 272 degrees of freedom 333 Overview of Modeling Clustered Binary Data For a single binary variable, Y, all moments are determined by p = E[Y ]. Specifically E[Y r ] = p for r 1. Consider n observations from a single individual Y = (Y 1,..., Y n ) T. The saturated model has 2 n 1 parameters (in contrast to n means, n variances and n(n 1)/2 correlations for the saturated normal model). There is no obvious or natural manner by which multivariate binary data can be specified. We begin with a mixed effects formulation that induces correlations. 334

4 Mixed Effects Models: Conjugate Model gives n Pr(Y = y) = y Marginal moments: Y p Binomial(n, p) Γ(a + b) Γ(a)Γ(b) E[Y ] = ne[p] = n a a + b, p Beta(a, b) Γ(a + y)γ(b + n y), y = 0,1,..., n. Γ(a + b + n) + b + n var(y ) = ne[p](1 E[p])a a + b + 1 Limitations: likelihood is complex, extension to covariates increases complexity, and no tractable way to allow random effect slopes. 335 Mixed Effects Models: Logistic Model A GLMM for binary data takes the binomial exponential family, with the canonical link being logistic. We have Stage 1: Y ij p ij ind Binomial(n ij, p ij ) with «pij log = x ij β + z ij b i 1 p ij Stage 2: b i D iid N(0, D). Marginal moments are not available in closed form. 336

5 Likelihood Inference Likelihood inference proceeds in the usual asymptotic fashion, following integration of the random effects. A Bayesian analysis adds priors for the fixed effects and variance components. Model checking is very difficult for binary data > library(lme4) > memmod1 <- lmer(cbind(y,1-y) ~ age+xero+(1 id),family=binomial,method="pql") > summary(memmod1) Random effects: Groups Name Variance Std.Dev. id (Intercept) Fixed effects: Estimate Std. Error z value Pr(> z ) (Intercept) < 2.2e-16 *** age ** xero memmod2 <- lmer(cbind(y,1-y) ~ age+xero+(1 id),family=binomial,method="laplace") > summary(memmod2) Random effects: Groups Name Variance Std.Dev. id (Intercept) Fixed effects: Estimate Std. Error z value Pr(> z ) (Intercept) < 2e-16 *** age * xero

6 Likelihoods for Multivariate Binary Data Log-Linear Model We have 2 n 1 distinct probabilities, but we wish to consider formulations that allow more parsimonious descriptions as a function of covariates. One choice is the log-linear model: 0 1 Pr(Y = y) = c(θ) X θ (1) j y j + X θ (2) j 1 j y j1 y j θ (n) n y 1...y n A, j j 1 <j 2 with 2 n 1 parameters θ = (θ (1) 1,..., θ(1) n, θ (2) and where c(θ) is the normalizing constant. 12,..., θ(2) n 1n,..., θ(n) 12...n )T, This formulation allows calculation of cell probabilities (but is less useful for describing Pr(Y = y) as a function of x). Note that we have 2 n 1 parameters and we have two aims: reduce this number, and introduce a regression model. 339 Example: n = 2. We have Pr(Y 1 = y 1, Y 2 = y 2 ) = c(θ) exp θ (1) 1 y 1 + θ (1) 2 y 2 + θ (2) 12 y 1y 2, where θ = (θ (1) 1, θ(1) 2, θ(2) 12 )T and c(θ) 1 = 1X 1X y 1 =0 y 2 =0 exp θ (1) 1 y 1 + θ (1) 2 y 2 + θ (2) 12 y 1y 2 y 1 y 2 Pr(Y 1 = y 1, Y 2 = y 2 ) 0 0 c(θ) 1 0 c(θ) exp(θ (1) 1 ) 0 1 c(θ) exp(θ (1) 2 ) 1 1 c(θ) exp(θ (1) 1 + θ (1) 2 + θ (2) 12 ) 340

7 Hence we have (conditional) interpretations: exp(θ (1) 1 ) = Pr(Y 1 = 1, Y 2 = 0) Pr(Y 1 = 0, Y 2 = 0) = Pr(Y 1 = 1 Y 2 = 0) Pr(Y 1 = 0 Y 2 = 0) the odds of an event at trial 1, given no event at trial 2; exp(θ (1) 2 ) = Pr(Y 1 = 0, Y 2 = 1) Pr(Y 1 = 0, Y 2 = 0) = Pr(Y 2 = 1 Y 1 = 0) Pr(Y 2 = 0 Y 1 = 0) the odds of an event at trial 2, given an event at trial 1; exp(θ (12) 12 ) = Pr(Y 1 = 1, Y 2 = 1)Pr(Y 1 = 0, Y 2 = 0) Pr(Y 1 = 1, Y 2 = 0)Pr(Y 1 = 0, Y 2 = 1) = Pr(Y 2 = 1 Y 1 = 1)/ Pr(Y 2 = 0 Y 1 = 1) Pr(Y 2 = 1 Y 1 = 0)/ Pr(Y 2 = 0 Y 1 = 0) the ratio of odds of an event at trial 2 given an event at trial 1, divided by the odds of an event at trial 2 given no event at trial 1. Hence if this parameter is larger than 1 we have positive dependence. 341 Quadratic Exponential Log-Linear Model We describe three approaches to modeling binary data: conditional odds ratios, correlations, marginal odds ratios. Zhao and Prentice (1990) consider the log-linear model with third and higher-order terms set to zero, so that 0 1 Pr(Y = y) = c(θ) X j θ (1) j y j + X θ (2) jk y jy k A. j<k For this model Interpretation: Pr(Y j = 1 Y k = y k, Y l = 0, l j, k) Pr(Y j = 0 Y k = y k, Y l = 0, l j, k) = exp(θ(1) j + θ (2) jk y k). exp(θ (1) j ) is the odds of a success, given all other responses are zero. exp(θ (2) jk ) is the odds ratio describing the association between Y j and Y k, given all other responses are fixed and equal to zero. 342

8 Issues: 1. Suppose we now wish to model θ as a function of x. Example: Y respiratory infection, x mother s smoking (no/yes). Then we could let the parameters θ depend on x, i.e. θ(x). But the difference between θ (1) j (x = 1) and θ (1) j (x = 0) represent the effect of smoking on the conditional probability of respiratory infection at visit j, given that there was no infection at any other visits. Since smoking is likely to influence the response at all visits a part of the smoking effect will be conditioned away. Difficult to interpret, and we would rather model the marginal probability. 2. The interpretation of the θ parameters depends on the number of responses n particularly a problem in a longitudinal setting with different n i. 343 A Related Model Fitzmaurice and Laird (1993) parameterize in terms of µ j = Pr(Y j = 1) (the marginal mean) and the second and higher-order parameters, θ (2) 12,..., θ(2) n 1n,..., θ(n) 12...n. If we let θ 1 = (θ (1) 1,..., θ(1) n ) and θ 2 = (θ (12) 12, θ(2) 13,..., θ(2) n 1n,..., θ(n) 1...n ) and w = (y 1 y 2, y 1 y 3,..., y n 1n, y 123,..., y 1 y 2...y n ), then the loglinear model can be written as Pr(Y = y) = c(θ 1, θ 2 )exp(y T θ 1 + w T θ 2 ) The parameters of interest are µ = (µ 1,..., µ n ) = µ(θ 1, θ 2 ) so that we can reparameterize from (θ 1, θ 2 ) to (µ, θ 2 ). GEE used for inference (with a logistic mean model, for example). The limitiation of this parameterizatio is that θ 2 as a conditional odds depends on the number of other responses. So most useful for cases in which we have the same number of observations per cluster. If the association between responses is of interest then not so easy to interpret. 344

9 Bahadur Representation Another approach to parameterizing a multivariate binary model was proposed by Bahadur (1961) who used marginal means, as well as second-order moments specified in terms of correlations. Let Then we can write 0 R j = Y j µ j [µ j (1 µ j )] 1/2 ρ jk = corr(y j, Y k ) = E[R j R k ] ρ jkl = E[R j R k R l ] ρ 1,...,n = E[R 1...R n ] Pr(Y = y) + X ρ jk r j r k + X j<k j<k<l ny j=1 µ y j j (1 µ j) 1 y j 1 ρ jkl r j r k r l ρ 1,...,n r 1 r 2...r n A Appealing because we have the marginal means µ j and nuisance parameters. 345 Limitations: Unfortunately, the correlations are constrained in complicated ways by the marginal means. Example: consider measurements on a single individual, Y 1 and Y 2, with means µ 1 and µ 2. We have and corr(y 1, Y 2 ) = Pr(Y 1 = 1, Y 2 = 1) µ 1 µ 2 {µ 1 (1 µ 1 )µ 2 (1 µ 2 )} 1/2 max(0, µ 1 + µ 2 1) Pr(Y 1 = 1, Y 2 = 1) min(µ 1, µ 2 ), which implies complex constraints on the correlation. For example, if µ 1 = and µ 2 = then 0 corr(y 1, Y 2 )

10 Marginal Odds Ratios An alternative is to parameterize in terms of the marginal means and the marginal odds ratios defined by γ jk = Pr(Y j = 1, Y k = 1) Pr(Y j = 0, Y k = 0) Pr(Y j = 1, Y k = 0)Pr(Y j = 0, Y k = 1) = Pr(Y j = 1 Y k = 1)/Pr(Y j = 0 Y k = 1) Pr(Y j = 1 Y k = 0)/Pr(Y j = 0 Y k = 0) which is the odds that the j-th observation is a 1, given the k-th observation is a 1, divided by the odds that the j-th observation is a 1, given the k-th observation is a 0. Hence if γ jk > 1 we have positive dependence between outcomes j and k. It is then possible to obtain the joint distribution in terms of the means µ, where µ j = Pr(Y j = 1) the odds ratios γ = (γ 12,..., γ n 1,n ) and contrasts of odds ratios 347 We need to find E[Y j Y k ] = µ jk, so that we can write down the likelihood function, or an estimating function. For the case of n = 2 we have and so γ 12 = Pr(Y 1 = 1, Y 2 = 1) Pr(Y 1 = 0, Y 2 = 0) Pr(Y 1 = 1, Y 2 = 0)Pr(Y 1 = 0, Y 2 = 1) = µ 12(1 µ 1 µ 2 + µ 12 ) (µ 1 µ 12 )(µ 2 µ 12 ), µ 2 12(γ 12 1) + µ 12 b + γ 12 µ 1 µ 2 = 0, where b = (µ 1 + µ 2 )(1 γ 12 ) 1, to give µ 12 = b ± p b 2 4(γ 12 1)µ 1 µ 2. 2(γ 12 1) Y Y µ 1 1 µ 12 µ 1 1 µ 2 µ 2 348

11 Limitations In a longitudinal setting (we add an i subscript to denote individuals), finding the µ ijk terms is computationally complex (see Qaqish and Ivanova, 2006, Biometrika). Large numbers of nuisance odds ratios if n i s are large assumptions such as γ ijk = γ for all i, j, k may be made. Another possibility is to take log γ ijk = α 0 + α 1 t ij t ik 1, so that the degree of association is inversely proportional to the time between observations. 349 Modeling Multivariate Binary Data Using GEE For a marginal Bernoulli outcome we have Pr(Y ij = y ij x ij ) = µ y ij ij (1 µ ij) 1 y ij = exp(y ij θ ij log{1 + e θ ij }), where θ ij = log(µ ij /(1 µ ij ), an exponential family representation. For independent responses we therefore have the likelihood mx Xn i mx Xn i mx Xn i Pr(Y x) = exp 4 y ij θ ij log{1 + e θ ij } 5 = exp 4 i=1 j=1 i=1 j=1 To find the MLEs we consider the score equation: G(β) = l β = mx Xn i i=1 j=1 l ij θ ij θ ij β = mx Xn i x ij (y ij µ ij ) = i=1 j=1 i=1 j=1 l ij 3 5. mx x T i (y i µ i ). So GEE with working independence can be implemented with standard software, though we need to fix-up the standard errors via sandwich estimation. i=1 350

12 Non-independence GEE Assuming working correlation matrices: R i (α) and estimating equation G(β) = where W i = 1/2 i R i (α) 1/2 i. mx i=1 D T i W 1 i (y i µ i ), Here α are parameters that we need a consistent estimator of (Newey 1990, shows that the choice of estimator for α has no effect on the asymptotic efficiency). Define a set of n i (n i 1)/2 empirical covariances R ijk = (Y ij µ ij )(Y ik µ ik ) We can then define a set of moment-based estimating equations to obtain estimates of α. 351 First Extension to GEE Rather than have a method of moments estimator for α, Prentice (1988) proposed using a second set of estimating equations for α. In the context of data with var(y ij ) = v(µ ij ) recall: G 1 (β, α) = G 2 (β, α) = mx i=1 mx i=1 where R ij = {Y ij µ ij }, to give data D T i W 1 i (Y i µ i ) E T i H 1 i (T i Σ i ) T T i = (R i1r i2,..., R ini 1R ini, R 2 i1,..., R2 in i ), Σ i (α) = E[T i ] is a model for the correlations and variances of the standardized residuals, E i = Σ i α, and H i = cov(t i ) is the working covariance model. The vector T i has n i (n i 1)/2 + n i elements in general the working covariance model H i is in general complex. 352

13 In the context of binary data Prentice (1988) the variances are determined by the mean and so the last n i terms of T i are dropped; he also suggests taking a diagonal working covariance model, H i, i.e. ignoring the covariances. The theoetical variances ae given by var(r ij R ik ) = 1+(1 2p ij )(1 2p ik ){p ij (1 p ij )p ik (1 p ik )} 1/2 Σ(α) ijk Σ(α) 2 ijk which depend on the assumed correlation model Σ these may be taken as the diagonal elements of H i. 353 Application of GEE Extension to Marginal Odds Model We have the marginal mean model logit E[Y ij X ij ] = βx ij. Suppose we specify a model for the associations in terms of the marginal log odds ratios: j ff Pr(Yij = 1, Y ik = 1) Pr(Y ij = 0, Y ik = 0) α ijk = log. Pr(Y ij = 1, Y ik = 0) Pr(Y ij = 0, Y ik = 1) These are nuisance parameters, but how do we estimate them? Carey et al. (1992) suggest the following approach fo estimating β and α. Let µ ij = Pr(Y ij = 1) µ ik = Pr(Y ik = 1) µ ijk = Pr(Y ij = 1, Y ik = 1) 354

14 It is easy to show that Pr(Y ij = 1 Y ik = y ik ) Pr(Y ij = 0 Y ik = y ik ) = exp(y ik α ijk ) Pr(Y ij = 1, Y ik = 0) Pr(Y ij = 0, Y ik = 0) «µ ij µ ijk = exp(y ik α ijk ) 1 µ ij µ ik + µ ijk which can be written as a logistic regression model in terms of conditional probabilities: «Pr(Yij = 1 Y ik = y ik ) logit E[Y ij Y ik ] = log Pr(Y ij = 0 Y ik = y ik ) «µ ij µ ijk = y ik α ijk + log 1 µ ij µ ik + µ ijk where the term on the right is a known offset (the µ s are a function of β only). Suppose for simplicity that α ijk = α then given current estimates of β, α, we can fit a logistic regression model by regressing Y ij on Y ik for 1 j < k n i, to estimate α this can then be used within the working correlation model. Carey et al. (1992) refer to this method as alternating logistic regressions. 355 Indonesian Children s Health Example > summary(geese(y ~ xero+age,corstr="independence",id=id,family="binomial")) Mean Model: Mean Link: logit Variance to Mean Relation: binomial Coefficients: estimate san.se wald p (Intercept) e+00 age e-07 xero e-02 Scale Model: Scale Link: identity Estimated Scale Parameters: estimate san.se wald p (Intercept) Correlation Model: Correlation Structure: independence Number of clusters: 275 Maximum cluster size: 6 356

15 > summary(geese(y ~ age+xero, corstr="exchangeable", id=id,family="binomial")) estimate san.se wald p (Intercept) e+00 age e-06 xero e-01 Estimated Scale Parameters: estimate san.se wald p (Intercept) Estimated Correlation Parameters: estimate san.se wald p alpha > summary(geese(y ~ age+xero, corstr="ar1", id=id,family="binomial")) Coefficients: estimate san.se wald p (Intercept) e+00 age e-07 xero e-01 Estimated Scale Parameters: estimate san.se wald p (Intercept) Estimated Correlation Parameters: estimate san.se wald p alpha Conditional Likelihood: Binary Longitudinal Data Recall that conditional likelihood is a technique for eliminating nuisance parameters, here what we have previously modeled as random effects. Consider individual i with binary observations y i1,..., y ini and assume the random intercepts model Y ij γ i, β Bernoulli(p ij ), where «pij log = x ij β + γ i 1 p ij where γ i = x i β + b i and x ij (a slight change from our usual notation), are those covariates which change within an individual. 358

16 We have Pr(y i1,..., y ini γ i, β) = = = n i Y exp (γ i y ij + x ij βy ij ) 1 + exp (γ j=1 i + x ij β) P ni exp γ i j=1 y ij + P n i j=1 x ijy ij β Q ni j=1 [1 + exp (γ i + x ij β)] exp (γ i t 2i + t 1i β) Q ni j=1 [1 + exp (γ i + x ij β)] = exp (γ it 2i + t 1i β) k(γ i, β) = p(t 1i, t 2i γ i, β) where t 1i = k(γ i, β) = n i X X x ij y ij, t 2i = j=1 n i Y j=1 n i j=1 y ij [1 + exp (γ i + x ij β)]. 359 We have where L c (β) = p(t 2i γ i, β) = my p(t 1i t 2i, β) = i=1 P ni y i+ l=1 exp my i=1 p(t 1i, t 2i γ i, β) p(t 2i γ i, β) γ i P ni j=1 y ij + P n i k=1 x iky l ik β k(γ i, β), where the summation is over the ` n i y i+ ways of choosing yi+ ones out of n i, and y l i = (yl i1,..., yl in i ), l = 1,..., ` n i y i+ is the collection of these ways. Hence L c (β) = = my i=1 my i=1 P ni y i+ P ni exp γ i j=1 y ij + P n i j=1 x ijy ij β l=1 exp P ni y i+ P ni γ i j=1 y ij + P n i k=1 x ikyik l β exp Pni j=1 x ijy ij β l=1 exp `P n i k=1 x ikyik l β 360

17 Notes Can be computationally expensive to evaluate likelihood if n i is large, e.g. if n i = 20 and y i+ = 10, ` n i y i+ = 184,756. There is no contribution to the conditional likelihood from individuals: With n i = 1. With y i+ = 0 or y i+ = n i. For those covariates with x i1 =... = x ini = x i. The conditional likelihood estimates β s that are associated with within-individual covariates. If a covariate only varies between individuals, then it cannot be estimated using conditional likelihood. For covariates that vary both between and within individuals, only the within-individual contrasts are used. The similarity to Cox s partial likelihood may be exploited to carry out computation. We have not made a distributional assumption for the γ i s! 361 Examples: If n i = 3 and y i = (0,0, 1) so that y i+ = 1 then y 1 i = (1, 0,0), y2 i = (0, 1,0), y3 i = (0,0, 1), and the contribution to the conditional likelihood is exp(x i3 β) exp(x i1 β) + exp(x i2 β) + exp(x i3 β). If n i = 3 and y i = (1,0, 1) so that y i+ = 2 then y 1 i = (1, 1,0), y2 i = (1, 0,1), y3 i = (0,1, 1), and the contribution to the conditional likelihood is exp(x i1 β + x i3 β) exp(x i1 β + x i2 β) + exp(x i1 β + x i3 β) + exp(x i2 β + x i3 β). 362

,..., θ(2),..., θ(n)

,..., θ(2),..., θ(n) Likelihoods for Multivariate Binary Data Log-Linear Model We have 2 n 1 distinct probabilities, but we wish to consider formulations that allow more parsimonious descriptions as a function of covariates.