E[Y i θ i, α] = µ i = b (θ i ) and var(y i θ i, α) = b (θ i )α, for i = 1,..., n, with cov(y i, Y j θ i, θ j, α) = 0 for i j.

Size: px

Start display at page:

Download "E[Y i θ i, α] = µ i = b (θ i ) and var(y i θ i, α) = b (θ i )α, for i = 1,..., n, with cov(y i, Y j θ i, θ j, α) = 0 for i j."

Patricia Tate
5 years ago
Views:

1 29 Jon Wakefield, Stat/Biostat 57 GENERAL REGRESSION MODELS We consider the class of Generalized Linear Mixed Models (GLMMs) and non-linear mixed effects models (NLMEMs). In this chapter we will again consider both a conditional approach to modeling, via the introduction of random effects, and a marginal approach using GEEs. Likelihood and Bayesian methods will be used for inference in the conditional approach. First we briefly review generalized linear models and non-linear models for independent data Jon Wakefield, Stat/Biostat 57 Generalized Linear Models GLMs provide a very useful extension to the linear model class. GLMs specify stochastic and deterministic components: The responses y i follow an exponential family so that the distribution is of the form p(y i θ i, α) = exp({y i θ i b(θ i )}/α + c(y i, α)), (4) where θ i and α are scalars. This is sometimes referred to as a linear or natural exponential family. It is straightforward to show that E[Y i θ i, α] = µ i = b (θ i ) and var(y i θ i, α) = b (θ i )α, for i =,..., n, with cov(y i, Y j θ i, θ j, α) = for i j. 2

2 29 Jon Wakefield, Stat/Biostat 57 The link function g( ) provides the connection between µ = E[Y θ, α] and the linear predictor xβ, via g(µ) = xβ, where x is a (k + ) vector of explanatory variables (including a for the intercept) and β is a (k + ) of regression parameters. If α is known this is a one-parameter (natural) exponential family model and there is a (k + )-dimensional sufficient statistic for β. If α is unknown then the distribution may or may not be a two-parameter exponential family model Jon Wakefield, Stat/Biostat 57 Likelihood Inference We now derive the score vector and information matrix. For an independent sample from the exponential family (4), we have l(θ) = i= y i θ i b(θ i ) α + c(y i, α) = l i (θ), where θ = [θ (β),..., θ n (β)] is the vector of canonical parameters. Using the chain rule we may write S(β) = l β = where V i = var(y β)/α and = i= i= l i dθ i µ i θ i dµ i β Y i b (θ i ) α i= dµ i V i dβ d 2 b dθ 2 i = d dθ i µ i = V i. 22

3 29 Jon Wakefield, Stat/Biostat 57 Hence S(β) = i= «T µi {Y i E[Y i µ i ]} β var(y i µ i ) = D T V {Y µ(β)}, (42) where D is the n p matrix with elements µ i / β j, i =,..., n, j =,..., p, and V is the n n diagonal matrix with i th diagonal element var(y i µ i ) Jon Wakefield, Stat/Biostat 57 The MLE has asymptotic distribution where In practice we use I n (β) /2 ( b β n β) d N p (, I p ), I n (β) = E[S(β)S(β) T ] = D T V D. I n ( b β) = b D T b V b D, where b V and b D are V and D evaluated at b β. Hence an estimator b β defined through S( b β) = will be consistent so long as the mean function is appropriate. The variance of the estimator is appropriately estimated if the second moment is correctly specified. 24

4 29 Jon Wakefield, Stat/Biostat 57 Nonlinear Regression Models Consider models of the form Y i = µ i (β) + ǫ i, i =,..., n, where µ i (β) = µ(x i, β) is nonlinear in x i, β is assumed to be of dimension k, and E[ǫ i µ i ] =, var(ǫ i µ i ) = σ 2 f(µ i ) with cov(ǫ i, ǫ j µ i ) =. Such models are often used for positive responses, and if such data are modeled on the original scale it is common to find that the variance is of the form f(µ) = µ or f(µ) = µ 2. An alternative approach that is appropriate in the case of f(µ) = µ 2 is to log transform the responses and then assume constant errors Jon Wakefield, Stat/Biostat 57 Likelihood Inference To obtain the likelihood function the probability model for the data must be fully specified. Assume Y i β, σ 2 ind N{µ i (β), σ 2 µ i (β) r }, for i =,..., n, and known r to give the likelihood function l(β, σ) = nlog σ r log µ i (β) {Y i µ i (β)} 2 2 2σ 2 µ r i (β). i= Differentiation with respect to β and σ yields the score equations S (β, σ) = r µ i 2 β (β) µ i (β) + {Y i µ i (β)} µ i σ 2 µ i (β) r β r 2σ 2 i= {Y i µ(β)} 2 i= S 2 (β, σ) = n σ + σ 3 µ r+ i (β) i= µ i β. {Y i µ(β)} 2 µ r i (β). Notice that we have a pair of quadratic estimating functions here and if the first two moments are correctly specified then E[S ] = and E[S 2 ] =. i= i= 26

5 29 Jon Wakefield, Stat/Biostat 57 Under the usual regularity conditions I(θ) /2 ( b θ θ) d N k+ (, I). where θ = (β, σ) and I(θ) is Fisher s expected information. In the case of r = we obtain l(β, σ) = n log σ {Y 2σ 2 i µ i (β)} 2 S (β, σ) = σ 2 i= S 2 (β, σ) = n σ 2 + σ 4 i= {Y i µ i (β)} µ i β {Y i µ i (β)} 2 i=» S I = E = β σ 2» S I 2 = E = T σ» S2 I 2 = E = β» S2 I 22 = E = 2n σ σ 2 «µi µi i= β β «T Jon Wakefield, Stat/Biostat 57 Identifiability For many nonlinear models identifiability is an issue. Specifically the same curve may be obtained with different parameter values. For example, consider the sum-of-exponentials model µ(x, β) = β exp( xβ ) + β 2 exp( xβ 3 ), where β = (β, β, β 2, β 3 ) and β j >, j =,, 2,3. The same curve results under the parameter sets (β, β, β 2, β 3 ) and (β 2, β 3, β, β ) and so we have non-identifiability. A solution to this problem, and to ensure that the parameters can only take admissible values, is to work with the set which constrains β 3 > β >. γ = (log β,log(β 3 β ),log β 2,log β ) 28

6 29 Jon Wakefield, Stat/Biostat 57 Example of a non-linear model: One compartmental open model Let w i (x) represent the amount of drug in compartment i, i =,, at time x. Assume: dw dx dw dx = k a w = k a w k e w where k a is the absorption rate, and k e is the elimination rate. Leads to µ(x) = Note: non-identifiability (flip-flop). Assume Dk a V (k a k e ) {exp[ k ex] exp[ k a x]} Y i β, σ 2 = LogNorm{µ i (x i ), σ 2 }. Mimics assay precision constant CV Jon Wakefield, Stat/Biostat 57 Pharmacokinetic Data Analysis Concentration (y) as a function of time (x), obtained from a new-born baby following the administration of a mg dose of Theophylline. Time Conc Concentration (mg/liter) Time (hours) 2

7 29 Jon Wakefield, Stat/Biostat 57 Generalized Linear Mixed Models A GLMM is defined by. Random Component: Y ij θ ij, α p( ) where p( ) is a member of the exponential family, that is p(y ij θ ij, α) = exp[{y ij θ ij b(θ ij )})/a(α) + c(y ij, α)], for i =,..., m units, and j =,..., n i, measurements per unit. 2. Systematic Component: If µ ij = E[Y ij θ ij, α] then we have a link function g( ), with g(µ ij ) = x ij β + z ij b i, so that we have introduced random effects into the linear predictor. The above defines the conditional part of the model. The random effects are then assigned a distribution, and in a GLMM this is assumed to be b i iid N(, D). We also have var(y ij θ ij, α) = αv(µ ij ) Jon Wakefield, Stat/Biostat 57 Marginal Moments Mean: E[Y ij ] = E{E[Y ij b i ]} = E[µ ij ] = E b [g (x ij β + z ij b i )]. Variance: var(y ij ) = E[var(Y ij b i )] + var(e[y ij b i ]) = αe b [v{g (x ij β + z ij b i )}] + var b [g (x ij β + z ij b i )]. Covariance: cov(y ij, Y ik ) = E[cov(Y ij, Y ik b i )] + cov(e[y ij b i ],E[Y ik b i ]) = cov{g (x ij β + z ij b i ), g (x ik β + z ik b i )}, for j k due to shared random effects, and cov(y ij, Y lk ) =, for i l, as there are no shared random effects. 22

8 29 Jon Wakefield, Stat/Biostat 57 Example: Log-Linear Regression for Seizure Data Data on seizures were collected on 59 epileptics. For each patient the number of epileptic seizures were recorded during a baseline period of eight weeks, after which patients were randomized to treatment with the anti-epileptic drug progabide, or to placebo. The number of seizures was then recorded in four consecutive two-week periods. The age of the patient was also available. Figures 8-2 contain summaries Jon Wakefield, Stat/Biostat Figure 8: Number of seizures for selected individuals over time for placebo group. 24

9 29 Jon Wakefield, Stat/Biostat Figure 9: Number of seizures for selected individuals over time for progabide group Jon Wakefield, Stat/Biostat 57 Baseline counts, =placebo,=progabide Means versus time, =placebo,=progabide 5 5 Average seizures Treatment Mean Var, =placebo,=progabide Seizures by age, =placebo,=progabide Var seizures Mean seizures Age (years) Figure 2: Summaries for seizure data. 26

10 29 Jon Wakefield, Stat/Biostat 57 A model for the seizure data Let Y ij = number of seizures on patient i at occasion j t ij = observation period on patient i at occasion j x i = / if patient i was assigned placebo/progabide x ij2 = / if j = /,2, 3,4 with t ij = 8 if j = and t ij = 2 if j =,2,3,4, i =,...,59. The question of primary scientific interest here is whether progabide reduces the number of seizures. A marginal mean model is given by E[Y ij ] = t ij exp(β + β x i + β 2 x ij2 + β 3 x i x ij2 ) Group j = period j =,2,3,4 period Placebo β β + β 2 Progabide β + β β + β + β 2 + β 3 Table : Parameter interpretation Jon Wakefield, Stat/Biostat 57 Precise definitions: exp(β ) is the expected number of seizures in the placebo group in time period ; exp(β ) is the ratio of the expected seizure rate in the progabide group, compared to the placebo group, in time period ; exp(β 2 ) is the ratio of the expected seizure rate at times j =,2, 3,4, as compared to j =, in the placebo group; exp(β 3 ) is the ratio of the expected seizure rates in the progabide group in the j =,2,3,4 period, as compared to the placebo group, in the same period. Hence exp(β 3 ) is the parameter of interest. More colloquially: β INTERCEPT β BASELINE TREATMENT GROUP EFFECT β 2 PERIOD EFFECT β 3 TREATMENT PERIOD EFFECT 28

11 29 Jon Wakefield, Stat/Biostat 57 Mixed Effects Model for Seizure Data Stage : Y ij β, b i ind Poisson(µ ij ), with g(µ ij ) = log µ ij = log t ij + x ij β + b i, where x ij β = β + β x ij + β 2 x i2 + β 3 x ij x i2. Hence E[Y ij b i ] = µ ij = t ij exp(x ij β + b i ), var(y ij b i ) = µ ij. Stage 2: b i iid N(, σ 2 ). The marginal mean is given by E[Y ij ] = t ij exp(x ij β + σ 2 /2), and the marginal median by t ij exp(x ij β) Jon Wakefield, Stat/Biostat 57 The marginal variance is given by var(y ij ) = E[µ ij ] + var(µ ij ) = E[Y ij ]{ + E[Y ij ](e σ2 )} = E[Y ij ]( + E[Y ij ] κ) where κ = e σ2 > illustrating excess-poisson variation which increases as σ 2 increases. For the marginal covariance cov(y ij, Y ik ) = cov{t ij exp(x ij β + b i ), t ij exp(x ik β + b i )} = t ij t ik exp(x ij β + x ik β) e σ2 {e σ2 } = E[Y ij ]E[Y ik ]κ. Hence for individual i we have variance-covariance matrix 2 µ i + µ 2 i κ µ iµ i2 κ... µ i µ ini κ µ i2 µ i κ µ i2 + µ 2 i2 κ... µ i2µ ini κ µ ini µ i κ µ ini µ i2 κ... µ ini + µ 2 in i κ 3, 7 5 where κ = e σ2 >. A deficiency of this model is that we only have a single parameter (σ 2 ) to control both excess-poisson variability and dependence. 22

12 29 Jon Wakefield, Stat/Biostat 57 Likelihood Inference In general there are two approaches to inference from a likelihood perspective:. Carry out conditional inference in order to eliminate the random effects. 2. Make a distributional assumption for b i, and then carry out likelihood inference (using some form of approximation to evaluate the required integrals). We first consider the conditional likelihood approach Jon Wakefield, Stat/Biostat 57 Conditional Likelihood Recall the definition of conditional likelihood. Suppose the distribution of the data may be factored as p(y β, γ) = h(y) p(t, t 2 β, γ) = h(y) p(t t 2, β) p(t 2 β, γ), where we choose to ignore the second term and consider the conditional likelihood L c (β) = p(t t 2, β) = p(t, t 2 β, γ). p(t 2 β, γ) Maximizing the conditional likelihood yields an estimator, b β c with the usual properties, for example I c (β) /2 ( b β c β) d N(, I), and I c (β) is the expected information derived from the conditional likelihood. 222

13 29 Jon Wakefield, Stat/Biostat 57 Conditional Likelihood for GLMMs In the context of GLMMs we have my L c (β) = p(t i t 2i, β) = where and i= my i= p(t i, t 2i β, b i ) p(y i β, b i ) p(t 2i β, b i ) = X p(t i, t 2i β, b i ) p(t 2i β, b i ) u i S 2i p(u i, t 2i β, b i ), and S 2i is the set of values of y i such that T 2i = t 2i, a set of disjoint events. The different notation is to emphasize that T i takes on values different to t i Jon Wakefield, Stat/Biostat 57 For simplicity we assume the canonical link function, g(µ ij ) = θ ij = x ij β + z ij b i and assume α =. Viewing b i as fixed effects we have the likelihood 8 9 < mx Xn i = L(β, b) = exp y ij x ij β + y ij z ij b i b(x ij β + z ij b i ) : ;, i= j= so that and mx mx Xn i t = t i = y ij x ij i= i= j= n i X t 2i = y ij z ij. j= We emphasize that no distribution has been specified for the b i, and they are being viewed as fixed effects. 224

14 29 Jon Wakefield, Stat/Biostat 57 Conditional Likelihood for the Poisson GLMM Assume for simplicity that z ij b i = b i, so that we have the random intercepts only model. Also, in an obvious change in notation x ij β + x i β + b i = x ij β + γ i so that β are the regression associated with covariates that change within an individual. Then p(y β, γ) = my p(y i β, γ i ) = i= my i= exp P m j= µ ij + P m j= y ij log µ ij Q ni j= y ij! Y m Xn i = c µ i+ + y i+ γ i + y ij log (t ij exp(x ij β)) A i= j= where c = Q i Qj y ij! and µ i+ = P m j= t ij exp(x ij β) Jon Wakefield, Stat/Biostat 57 In this case the distribution of the conditioning statistic is straightforward: so that y i+ β, γ i Poisson(µ i+ ) Y m p(y i+ β, γ i ) = c 2 exp( µ i+ + y i+ log µ i+ ) where c 2 = y i+! Hence i= Y m Xn i = c 2 µ i+ + y i+ γ i + y i+ t ij exp(x ij β) AA i= p(y y +,..., y ni +, β) = j= p(y β, γ) p(y +,..., y ni + β, γ) which is given by Q c i exp µ i+ + y i+ γ i + P n i j= y ij log(t ij exp(x ij β) c 2 Q m i= exp µ i+ + y i+ γ i + y i+ log Pni j= t ij exp(x ij β) 226

15 29 Jon Wakefield, Stat/Biostat 57 After simplification: p(y y +,..., y ni +, β) = = Q c m Q ni i= j= (t ij exp(x ij β)) y ij yi+ Q c m Pni 2 i= j= t ij exp(x ij β)! i+ my Yn i yij t A ij exp(x ij β) P ni y i...y ini l= t il exp(x il β) i= j= which is a multinomial likelihood (we have conditioned a set of Poisson counts on their total so obvious!): where π T i = (π i,..., π ini ) and y ij y i+, β Mult ni (y i+, π i ) π ij = t ij exp(x ij β) P ni l= t il exp(x il β) Jon Wakefield, Stat/Biostat 57 Conditional Likelihood for the Seizure Data Recall Y ij = number of seizures on patient i at occasion j t ij = observation period on patient i at occasion j x i = / if patient i was assigned placebo/progabide x ij2 = / if j = /,2, 3,4 with t ij = 8 if j = and t ij = 2 if j =,2,3,4, i =,...,59. A log-linear random intercept model is given by log E[Y ij b i ] = log t ij + β + β x i + β 2 x ij2 + β 3 x i x ij2 + b i 228

16 29 Jon Wakefield, Stat/Biostat 57 Precise definitions: exp(β ) is the expected number of seizures for a typical individual in the placebo group in time period ; exp(β ) is the ratio of the expected seizure rate in the progabide group, compared to the placebo group, for a typical individual, i.e. one with b i =, in time period ; exp(β 2 ) is the ratio of the expected seizure rate at times j =,2, 3,4, as compared to j =, for a typical individual in the placebo group; exp(β 3 ) is the ratio of the expected seizure rates in the progabide group in the j =,2,3,4 period, as compared to the placebo group, in the same period for a typical individual. Hence exp(β 3 ) is the parameter of interest Jon Wakefield, Stat/Biostat 57 In the conditional likelihood notation: log E[Y ij γ i ] = log t ij + γ i + β 2 x ij2 + β 3 x i x ij2 where γ i = β + β x i + b i so that we cannot estimate β, which is not a parameter of primary interest. Since x i = x i2 = x i3 = x i4 and t i = 8 = P 4 j= t ij, we effectively have two observation periods which we label (slightly abusing our previous notation), j =,. Let Y i = P 4 j= Y ij. For the placebo group: Y i ind Binomial(Y i+, π i ) for i =,...,29, with For the progabide group: π i = exp(β 2) + exp(β 2 ). Y i ind Binomial(Y i+, π i ) for i = 3,...,59, where. π i = exp(β 2 + β 3 ) + exp(β 2 + β 3 ). 23

17 29 Jon Wakefield, Stat/Biostat 57 This model is straightforward to fit in R: > xcond <- c(rep(,28),rep(,3)) > condmod <- glm(cbind(y,y)~xcond,family=binomial) > summary(condmod) Call: glm(formula = cbind(y, y) ~ xcond, family = binomial) Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) * xcond Signif. codes: ***. **. *.5.. (Dispersion parameter for binomial family taken to be ) Null deviance: 36.5 on 58 degrees of freedom Residual deviance: on 57 degrees of freedom Hence the treatment effect is exp(.) =.9 so that the rate of seizures is estimated as % less in the progabide group, though this change is not statistically significant Jon Wakefield, Stat/Biostat 57 Conditional Likelihood for the Seizure Data The overall fit of the random intercept model is poor (34 on 57 degrees of freedom). Once possibility is to extend the model to allow a random slope for the effect of treatment x ij2, i.e. β 2i = β 2 + b 2i, but a conditional likelihood approach for this model will condition away the information relevant for estimation of β 3. We will examine such a model using a mixed effects approach. 232

18 29 Jon Wakefield, Stat/Biostat 57 Likelihood Inference in the Mixed Effects Model As with the linear mixed effects model (LMEM) we maximize L(β, α) where α denote the variance components in D, and my Z L(β, α) = p(y i β, b i ) p(b i α) db i. i= As with the NLMEM the required integrals are not available in closed form and so some sort of analytical or numerical approximation is required Jon Wakefield, Stat/Biostat 57 Example: Log-linear Poisson regression GLMM With a single random effect we have α = σ 2. L(β, α) = = = my Z i= n i Y j= exp( µ ij )µ y ij ij y ij! my (2πσ 2 ) /2 exp i= Z n i (2πσ 2 ) /2 exp «2σ 2 b2 i! X y ij x ij β i= Xn i e b i e x Xn i ijβ + y ij b i my exp i= j=! Xn i Z y ij x ij β i= j= 2σ 2 b2 i A db i h(b i ) exp{ b2 i /(2σ2 )} (2πσ 2 ) /2 db i, an integral with respect to a normal random variable (which is analytically intractable). db i 234

19 29 Jon Wakefield, Stat/Biostat 57 Integration in the GLMM As with the NLMEM there are a number of possible approaches for integrating out the random effects including: Analytical approximations, including Laplace, and the closely-related penalized quasi-likelihood approach. Gaussian quadrature. Importance sampling Monte Carlo Jon Wakefield, Stat/Biostat 57 Overview of Integration Techniques We describe a number of generic integration techniques, in particular: Laplace approximation (an analytical approximation). Quadrature (numerical integration). Importance sampling (a Monte Carlo method). Before the MCMC revolution these techniques were used in a Bayesian context. 236

20 29 Jon Wakefield, Stat/Biostat 57 Laplace Approximation Let Z I = exp{ng(θ)}dθ, denote a generic integral of interest and suppose m is the maximum of g( ). We have ng(θ) = n X (θ m) k g (k) (m), k! k= where g (k) (m) represents the k th derivative of g evaluated at m. Hence Z ( ) X (θ m) k I = exp n g (k) (m) dθ k! k= Z j (θ m) e ng(m) 2 ff exp 2/[ng (2) dθ (m)] = e ng(m) (2πv) /2 n /2 where v = /[g (2) (m)], and we have ignored terms in cubics or greater in the Taylor series Jon Wakefield, Stat/Biostat 57 Gaussian Quadrature A general method of integration is provided by quadrature (numerical integration) in which an integral Z I = f(u) du, is approximated by bi = n w X i= f(u i )w i, for design points u,..., u nw and weights w,..., w nw. Different choices of (u i, w i ) lead to different integration rules. In mixed model applications we have integrals with respect to a normal density, Gauss-Hermite quadrature is designed for problems of this type. Specifically, it provides exact integration of Z g(u)e u2 du, where g( ) is a polynomial of degree 2n w. 238

21 29 Jon Wakefield, Stat/Biostat 57 The design points are the zeroes of the so-called Hermite polynomials. Specifically, for a rule of n w points, u i is the i th zero of H nw (u), the Hermite polynomial of degree n w, and w i = wn w n w! π n 2 w[h nw (u i )] 2. Now suppose θ is two-dimensional and we wish to evaluate Z Z Z Z I = f(θ)dθ = f(θ, θ 2 )dθ 2 dθ = f (θ )dθ, where Z f (θ ) = f(θ, θ 2 )dθ 2. Now form where m X bi = w if b (θ i ), i= m 2 X bf (θ i ) = u j f(θ i, θ 2j ). j= Jon Wakefield, Stat/Biostat 57 Then we have m m 2 X X bi = w i u j f(θ i, θ 2j ), i= j= which is known as the Cartesian Product. Scaling and reparameterization To implement this method the function must be centered and scaled in some way, for example we could center and scale by the current estimates of the mean, m, and variance-covariance matrix, V known as adaptive quadrature. We then form X = L(θ m) where L L = V and carry out integation in the space of X. There is no guarantee that the most efficient rule is obtained by scaling in terms of the posterior mean and variance, but we note that the best normal approximation to a density (in terms of Kullbach-Leibler divergence) has the same mean and variance. 24

22 29 Jon Wakefield, Stat/Biostat 57 Gauss-Hermite Code in R Nodes and weights for n = 4: > n <- 4 > quad <- gauss.quad(n,kind="hermite") > quad$nodes [] > quad$weights [] Nodes and weights for n = 5: > n <- 5 > quad <- gauss.quad(n,kind="hermite") > quad$nodes [] > quad$weights [] Jon Wakefield, Stat/Biostat 57 Importance Sampling Rather than deterministically selecting points we may randomly generate points from some density h(θ). We have Z I = where w(θ) = f(θ)/h(θ). Z f(θ)dθ = Hence we have the obvious estimator mx bi = w(θ i ), f(θ) h(θ)dθ = E[w(θ)], h(θ) i= where θ i iid h( ). We have E[ b I] = I and V = var( b I) = m var{w(θ)}. From this expression it is clear that a good h( ) produces an approximately constant w(θ). 242

23 29 Jon Wakefield, Stat/Biostat 57 We may estimate V via bv = m mx i= f 2 (θ i ) h 2 (θ i ) bi 2, m and (appealing to the central limit theorem) Î is asymptotically normal and so a ( α)% confidence interval is given by Î ± Z α/2 ˆV /2 where Z α/2 is the α/2 point of an N(,) random variable. Hence the accuracy of the approximation may be directly assessed, providing an advantage over analytical approximations and quadrature methods. Notes on Importance Sampling We require an h( ) with heavier tails than the integrand. We can carry out importance sampling with any h but if the tails are lighter we will have an estimator with infinite variance (and hence an inconsistent procedure). Many suggestions for h have been made including Student t distributions and mixtures of Student t distributions. Iteration may again be used to obtain an estimator with good properties Jon Wakefield, Stat/Biostat 57 Notes on Implementation If the number of parameters is small then numerical integration techniques (e.g. quadrature) are highly efficient in terms of the number of function evaluations required. Hence if, for example, obtaining a point on the likelihood surface is computationally expensive (as occurs if a large simulation is required) then such techniques are preferable to Monte Carlo methods. The method employed will depend on whether it is for a one-off application, in which case ease-of-implementation is a consideration, or for a great deal of use, in which case an efficient method may be required. In general it is difficult to assess the accuracy of Laplace/numerical integration techniques. For simulation methods we note that independent samples are ideal for assessing Monte Carlo error since standard errors on expectations of interest may be simply calculated. Evans and Swartz (995, Statistical Science) provide a good review of integration techniques. 244

24 29 Jon Wakefield, Stat/Biostat 57 Penalized Quasi-Likelihood Breslow and Clayton (993) introduced the method of Penalized Quasi-Likelihood (PQL) which was an attempt to extend quasi-likelihood to GLMMs. One justification of the method is a first-order Laplace approximation. PQL is very poor for binary data but may be OK for binomial and Poisson data (as long as the counts are not too small). Within the lme4 package the lmer function may be used to fit GLMMs using MLE/REML; the required integrals can be approximated using penalized quasi-likelihood, Laplace, or adaptive Gaussian quadrature Jon Wakefield, Stat/Biostat 57 GLMMs for the Seizure Data PQL standard error for β looks off here (doesn t tie in with later analyses). Adaptive quadrature option is not available for this model. > library(lme4) # Need Matrix package version > lmermod <- lmer(y ~ x+x2+x3+( ID)+offset(log(time)),family=poisson, data=seiz,method="pql") > summary(lmermod) Generalized linear mixed model fit using PQL Random effects: Groups Name Variance Std.Dev. ID (Intercept) Fixed effects: Estimate Std. Error z value Pr(> z ) (Intercept) < 2e-6 *** x x * x > lmermod2 <- lmer(y ~ x+x2+x3+( ID)+offset(log(time)),family=poisson, data=seiz,method="laplace") > summary(lmermod2) Generalized linear mixed model fit using Laplace 246

25 29 Jon Wakefield, Stat/Biostat 57 Formula: y ~ x + x2 + x3 + ( ID) + offset(log(time)) Data: seiz Family: poisson(log link) AIC BIC loglik deviance Random effects: Groups Name Variance Std.Dev. ID (Intercept) # of obs: 295, groups: ID, 59 Estimated scale (compare to ).674 Fixed effects: Estimate Std. Error z value Pr(> z ) (Intercept) e- *** x x * x The Laplace approach gives significantly different (and more reliable estimates) Jon Wakefield, Stat/Biostat 57 Random intercepts and slopes We may also allow the treatment effect to vary between individuals. > lmermod4 <- lmer(y ~ x+x2+x3+(+x2 ID)+offset(log(time)), family=poisson,data=seiz,method="laplace") > summary(lmermod4) Generalized linear mixed model fit using Laplace Random effects: Groups Name Variance Std.Dev. Corr ID (Intercept) x # of obs: 295, groups: ID, 59 Estimated scale (compare to ).4377 Fixed effects: Estimate Std. Error z value Pr(> z ) (Intercept) e-4 *** x x x * 248

26 29 Jon Wakefield, Stat/Biostat 57 Bayesian Inference for GLMMs A Bayesian approach to inference for a GLMM adds a prior distribution for β, α, to the likelihood L(β, α). Again a proper prior is required for the matrix D. In general a proper prior is not required for β the exponential family and linear link lead to a likelihood that is well-behaved Jon Wakefield, Stat/Biostat 57 Priors for β and α in the GLMM Lognormal Priors It is convenient to specify lognormal priors for positive parameters θ, since one may specify two quantiles of the distribution, and directly solve for the two parameters of the prior. In a GLMM we can often specify priors for more meaningful parameters than elements of β. For example, e β is the relative risk/rate in a log linear model, and is the odds ratio in a logistic model. Suppose we wish to specify a lognormal prior for a generic parameter θ. Denote by LN(µ, σ) the lognormal distribution with E[log θ] = µ and var(log θ) = σ 2, and let θ and θ 2 be the q and q 2 quantiles of this prior. Then µ = log(θ ) zq2 z q2 z q ««zq log(θ 2 ), σ = log(θ ) log(θ 2 ). (43) z q2 z q z q z q2 As an example, suppose that for θ we believe there is a 5% chance that the relative risk is less than and a 95% chance that it is less than 5; with q =.5, θ =. and q 2 =.95, θ 2 = 5., we obtain lognormal parameters µ = and σ = log 5/.96 =

27 29 Jon Wakefield, Stat/Biostat 57 Gamma Priors Consider the random intercepts model with b i iid N(, σ 2 ). It is not straightforward to specify a prior for σ, which represents the standard deviation of the residuals on the linear predictor scale, and is consequently not easy to interpret. We specify a gamma prior Ga(a, b) for the precision τ = /σ 2, with parameters a, b specified a priori. The choice of a gamma distribution is convenient since it produces a marginal distribution for the residuals in closed form. Specifically the two-stage model b i σ iid N(, σ 2 ), τ = σ 2 Ga(a, b) produces a marginal distribution for b i which is t d (, λ 2 ), a Student s t distribution with d = 2a degrees of freedom, location zero, and scale λ 2 = b/a. We now consider a log link, in which case the above is equivalent to the residual relative risks following a log t distribution. We specify the range exp(±r) within which the residual relative risks will lie with probability q, and use the relationship ±t d q/2 λ = ±R, where td q is the q-th quantile of a Student t random variable with d degrees of freedom, to give a = d/2, b = R 2 d/2(t d q/2 ) Jon Wakefield, Stat/Biostat 57 For example, if we assume a priori that the residual relative risks follow a log Student t distribution with 2 degrees of freedom, and that 95% of these risks fall in the interval (.5,2.) then we obtain the prior, Ga(,.26). In terms of σ this results in (2.5%, 97.5%) quantiles of (.84,.) with posterior median.9. It is important to assess whether the prior allows all reasonable levels of variability in the residual relative risks, in particular small values should not be excluded. The prior Ga(.,.) which has previously been used (e.g. in the WinBUGS manual) should be avoided for this very reason (this corresponds to relative risks which follow a log Student t distribution with.2 degrees of freedom). 252

28 29 Jon Wakefield, Stat/Biostat 57 Implementation Closed-form inference is unavailable, but MCMC is almost as straightforward as in the linear mixed model case. The joint posterior is p(β, W, b y) Suppose we have priors: my {p(y i β, b i )p(b i W)} π(β)π(w). i= β N q+ (β, V ) W W q+ (r, R ) The conditional distributions for β, τ, W are unchanged from the linear case. There is no closed form conditional distribution for β, or for b i, but Metropolis-Hastings step can be used (or adaptive rejection sampling can be utilized, the conditional is log concave) Jon Wakefield, Stat/Biostat 57 Integrated Nested Laplace Approximation (INLA) Recently an approach has emerged that combines Laplace approximations and numerical integration in a very efficient manner, see Rue, Martino and Chopin (28) for more detail. The method is designed for latent Gaussian model we describe in the context of a GLMM. Suppose Y ij is of exponential family form: Y ij θ ij, α p( ) where p( ) is a member of the exponential family, that is p(y ij θ ij, α) = exp[{y ij θ ij b(θ ij )})/a(α) + c(y ij, α)], for i =,..., m units, and j =,..., n i, measurements per unit. Lwt µ ij = E[Y ij θ ij, α] with g(µ ij ) = η ij = x ij β + z ij b i, where b i N(, D), and β is assigned a normal prior. We also have priors for α (if not a constant) and D these priors are non-normal. 254

29 29 Jon Wakefield, Stat/Biostat 57 Let γ = (b, β) denote the p vector of Gaussian parameters with π(γ θ ) N{, Q(D)} and α = (α, D) be the hyperparameters which are not Gaussian. Then π(γ, α y) π(α)π(γ α) Y i p(y i γ, α) π(α) Q(D) p/2 exp ( 2 γt Q(D)γ + X i log p(y i γ, α) ) Jon Wakefield, Stat/Biostat 57 The INLA Algorithm We wish to obtain the posterior marginals π(γ i y) and π(α i y). We have Z π(γ i y) = π(γ i α, y) π(α y)dα which is evaluated via the approximation Z eπ(γ i y) = eπ(γ i α, y) eπ(α y)dα (44) = X k eπ(γ i α k, y) eπ(α k y) k (45) where eπ(γ i α, y) is approximated by Laplace (or other analytical approximations). For eπ(α k y) the mode is located and then the Hessian is approximated, from which a grid of points is found that cover the density. 256

30 29 Jon Wakefield, Stat/Biostat 57 Pros and Cons of INLA Advantages: Quite widely applicable: GLMMs including temporal and spatial error terms. Very fast. An R package is available. Disadvantages: Can t do NLMEMs (though could in principle). Difficult to access when the approximation is failing. R package is new, and so there are wrinkles Jon Wakefield, Stat/Biostat 57 Bayesian Inference for the Seizure Data We fit various models and begin with a discussion of prior specification. We fit four models to the seizure data. Model Random intercepts only, π(β), τ Ga(,.26) corresponds to Student t 2 residuals and 95% (.5,2.). Model 2 Random intercepts only, π(β), τ Ga(2,.376) corresponds to Student t 4 residuals and 95% (.,.). Model 3 Random effects for intercept and for x 2. Model 4 We allow a bivariate Student t distribution for the pair of random effects introduced in Model 3. Model 5 We introduce measurement error into the model. 258

31 29 Jon Wakefield, Stat/Biostat 57 WinBUGS for Model model { for (i in :n){ for (j in :k){ Y[i,j] ~ dpois(mu[i,j]) log(mu[i,j]) <- log(t[j])+beta+beta*x[i]+beta2*x2[j]+ beta3*x[i]*x2[j]+b[i] } b[i] ~ dnorm(,tau) } tau ~ dgamma(,.26) sigma <- /tau beta ~ dflat() beta ~ dflat() beta2 ~ dflat() beta3 ~ dflat() } Jon Wakefield, Stat/Biostat 57 # Data list(k = 5, n = 59, t = c(8,2,2,2,2), x2 = c(,,,,), x = c(,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,), Y = structure(.data = c(,5,3,3,3,,3,5,3,3,6,2,4,,5,8,4,4,,4,66,7,8,9,2, 27,5,2,8,7,2,6,4,,2,52,4,2,23,2,23,5,6,6,5,,4,3,6,,52,26,2,6,22, 33,2,6,8,4,8,4,4,6,2,42,7,9,2,4,87,6,24,,9,5,,,,5,8,,,3,3,,37,29,28,29,8,3,5,2,5,2,3,,6,7,2,3,4,3,4,9,3,4,3,4,7,2,3,3,5, 28,8,2,2,8,55,8,24,76,25,9,2,,2,,,3,,4,2,47,3,5,3,2,76,,4,9,8, 38,8,7,9,4,9,,4,3,,,3,6,,3,9,2,6,7,4,24,4,3,,3,3,22,7,9,6, 4,5,4,7,4,,2,4,,4,67,3,7,7,7,4,4,8,2,5,7,2,,,,22,,2,4,,3,5,4,,3, 46,,4,25,5,36,,5,3,8,38,9,7,6,7,7,,,2,3,36,6,,8,8,,2,,,, 5,2,65,72,63,22,4,3,2,4,4,8,6,5,7,32,,3,,5,56,8,,28,3,24,6,3,4,, 6,3,5,4,3,22,,23,9,8,25,2,3,,,3,,,,,2,,4,3,2),.Dim = c(59,5))) # Initial estimates list(b=c(,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,), beta=, beta=, beta2=, beta3=, tau=) 26

32 29 Jon Wakefield, Stat/Biostat 57 Estimates (standard deviations) Model Model 2 Model 3 Model 4 Model 5 β.3 (.5).3 (.5).8 (.3).92 (.5). (.8) β -.24 (.2) -.34 (.2).42 (.9).6 (.2).9 (.24) β 2. (.47). (.47).45 (.) -.3 (.).2 (.) β 3 -. (.65) -. (.65) -.3 (.5) -.32 (.5) -.3 (.4) σ.64 (.3).66 (.3).7 (.72).7 (.).82 (.84) σ.473 (.62).399 (.78) ρ.9 (.6).2 (.2) σ e.39 (.33) Table 2: Posterior means and standard deviations for Bayesian analysis of seizure data; σ is the standard deviation of the random intercepts, σ is the standard deviation of the random period effect, and ρ is the correlation between these random effects; σ e is the standard deviation of the measurement error Jon Wakefield, Stat/Biostat 57 Poisson Model with a nugget effect Recall the model Y ij b i Poisson(t ij exp(x ij β + b i )) b i N(, σ 2 ) has a single parameter only, σ to allow for excess-poisson variability and between-individual variability. In the LMEM model we have with b i and ǫ ij independent. E[Y ij b i ] = x ij β + b i + ǫ ij b i N(, σ 2 ) ǫ ij N(, σ 2 e) 262

33 29 Jon Wakefield, Stat/Biostat 57 By analogy we might consider the model: with b i and b ij independent. Y ij b i, b ij Poisson(t ij exp(x ij β + b i + b ij )) b i N(, σ 2 ) b ij N(, σ 2 e ) We now two parameters to allow for between-individual variability, σ, and excess-poisson variability, σ e. Unfortunately there is no simple marginal interpretation of σ and σ e : E[Y ij ] = t ij exp(x ij η + σ 2 e /2 + σ2 o ) = µ ij var(y ij ) = µ ij + µ 2 ij (eσ2 e )(e σ2 ) cov(y ij ) = t ij t ik exp(x ij β + x ik β)e σ2 (e σ2 ) Another possibility would be to start with a negative binomial distibution, and then introduce a random effect, b i. This reveals the heaven and hell of mixed-effects models we have a lot of flexibility in the models we can fit, but many formulations that are similar produce different marginal mean and covariance structures, and often there is no obvious right choice Jon Wakefield, Stat/Biostat 57 WinBUGS for Model # Model 3 - Poisson lognormal for nugget also model { for (i in :n){ for (j in :k){ Y[i,j] ~ dpois(mu[i,j]) log(mu[i,j]) <- log(t[j])+beta+beta*x[i]+beta2*x2[j]+ beta3*x[i]*x2[j]+b[i]+be[i,j] be[i,j] ~ dnorm(,taue) } b[i] ~ dnorm(,tau) } taue ~ dgamma(,.26) tau ~ dgamma(,.26) sigma <- sqrt(/tau) sigmae <- /sqrt(taue) beta ~ dflat() beta ~ dflat() beta2 ~ dflat() beta3 ~ dflat() } 264

34 29 Jon Wakefield, Stat/Biostat 57 Seizure Data Patient 49 had counts 5,2,65,72,63 under progabide very surprising. In DHLZ dropping this individual gave a parameter of interest of -.3. Posterior medians of b ij for this individual (i = 49, j =,,2,3,4) are: -.6,.6,.8,.27,.65,.5 Conclusions: there is evidence of a statistically significant treatment effect, under Model 4 the 95% credible interval on β 3 is (-.6,-28). Under model 5 the 95% credible interval on β 3 is (-.59,-.3) Jon Wakefield, Stat/Biostat 57 Seizure Data: INLA implementation > require (glmmak); data(epileptic) > epileptic$y = epileptic$seizure; epileptic$x = epileptic$trt > epileptic$x2 = as.integer(epileptic$visit>) > epileptic$t = 8-6*epileptic$x2 > epileptic$rand = :nrow(epileptic) > formula = Y ~ x + x2 + I(x*x2) + + f(id,model="iid",param=c(,.26)) + f(rand,model="iid",param=c(,.26)) > inla = inla (formula, family="poisson", data=epileptic,offset=i(log(t))) > summary(inla) Fixed effects: mean sd.25quant.975quant kld dist. x e+ x e+ i(x*x2) e-32 intercept e+ Model hyperparameters: mean sd.25quant.975quant Precision.for..id Precision.for..rand

35 29 Jon Wakefield, Stat/Biostat 57 Generalized Estimating Equations (GEEs) Liang and Zeger (986, Biometrika), and Zeger and Liang (986, Biometrics) considered GLMs with dependence within individuals (in the context of longitudinal data). Theorem (Liang and Zeger, 986): the estimator b β that satisfies G(β, bα) = mx i= D T i W i (Y i µ i ) =, where D i = µ i β, W i = W i (β, α) is the working covariance model, µ i = µ i (β) and bα is a consistent estimator of α, is such that V /2 β ( β b β) d N(, I), where V β is given by mx i= D T i W i D i! ( m X i= ) D T i W i cov(y i )W i D i mx i=! D T i W i D i. In practice an empirical estimator of cov(y i ) is substituted to give b V β Jon Wakefield, Stat/Biostat 57 Choice of Working Covariance Models As in the linear case, various assumptions about the form of the working covariance may be assumed (what is a natural choice?); we write W i = /2 i R i (α) /2 i, where i = diag[var(y i ),...,var(y ini )] T and R i is a working correlation model, for example, independence, exchangeable, AR(), unstructured. For small m the sandwich estimator will have high variability and so model-based variance estimators may be preferable (but would we trust asymptotic normality if m were small anyway?). Model-based estimators are more efficient if the model is correct. Published comments: Liang and Zeger (986): little difference when correlation is moderate. McDonald (993): The independence estimator may be reccomended for practical purposes. Zhao, Prentice and Self (992): Assuming independence can lead to important losses of efficiency. Fitmaurice, Laird and Rotnitsky (993): important to obtain a close approximation to cov(y i ) in order to achieve high efficiency. 268

36 29 Jon Wakefield, Stat/Biostat 57 GEE for the Seizure Data We have the log-linear model is given log E[Y ij ] = log µ ij = log t ij + β + β x i + β 2 x ij2 + β 3 x i x ij2 and var(y ij ) = αµ ij. Recall β is baseline comparison of rates, β 2 is period effect in the placebo group and β 3 is treatment period effect of interest. Both quasi-likelihood and working independence GEE have estimating equation G(β, bα) = mx x T i (Y i µ i ) =, but differ in the manner in which the standard errors are calculated. i= Jon Wakefield, Stat/Biostat 57 Estimates (standard errors) Poisson Quasi-Lhd GEE Ind GEE Exch GEE AR() β.35 (.34).35 (.5).35 (.6).35 (.6).3 (.6) β.27 (.47).27 (.2).27 (.22).27 (.22).5 (.2) β 2. (.47). (.2). (.2). (.2).6 (.) β 3 -. (.65) -. (.29) -. (.22) -. (.22) -.3 (.27) α, α 2., 9.7, 9.4, 9.4,.78 2.,.89 Table 3: Parameter estimates and standard errors under various models; α is a variance parameter, and α 2 a correlation parameter. The point estimates under Poisson, quasi-likelihood and GEE working independence will always agree. The Poisson standard errors are clearly much too small. The quasi-likelihood standard errors are increased by 9.7 = 4.4, but do not acknowledge dependence on observations on the same individual (it is as if we have 59 5 independent observations). The standard errors of estimated parameters that are associated with time-varying covariates (β 2 and β 3 ) are reduced under GEE, since within-person comparisons are being made. The coincidence of the estimates and standard errors for independence and exchangeability is a consequence of the balanced design. 27

37 29 Jon Wakefield, Stat/Biostat 57 Interpretation of Marginal and Conditional Coefficients In a marginal model (which we consider under GEE), we have E[Y x] = exp(γ + γ x) in which case e γ is the change in the average response when we increase x by unit in the population under consideration. Under the conditional (mixed effects) model the interpretation of regression coefficients is conditional on the value of the random effect. For the model E[Y x, b] = exp(β + β x + b), with b iid N(, σ 2 ), the marginal mean is given by: E[Y x] = E b σ 2{E[Y x, b]} = exp(β + σ 2 /2 + β x). Hence for the log-linear model, e β has the same marginal interpretation to e γ (the marginal intercept is γ = β + σ / 2), though estimation of the latter via GEE produces a consistent estimator in more general circumstances (though there is an efficiency loss if the random effects model is correct) Jon Wakefield, Stat/Biostat 57 In the model E[Y x, b] = exp{β + b i + (β + b i )x i } e β is the relative risk between two populations with the same b but whose x values differ by one unit, that is: exp(β ) = E[Y x, b] E[Y x, b]. An alternative interpretation is to say that it is the expected change between two typical individuals, that is, individuals with random effects, b =. With b iid N(, D) we have the marginal mean E[Y x] = exp{β + D /2 + x(β + D ) + x 2 D /2} so that there is no marginal mean interpretation of exp(β ) (the latter is the marginal median). 272

38 29 Jon Wakefield, Stat/Biostat 57 Second Extension to GEE: Connected Estimating Equations, GEE2 a In GEE2, there are a connected set of joint estimating equations for β and α. Such an approach was proposed by Zhao and Prentice (99), and Prentice and Zhao (99). This approach is particularly appealing if the variance-covariance model is of interest. To motivate such a pair, consider the following model for a single individual with n independent observations: Y i β, α ind N {µ i (β),σ i (β, α)}, where, for example, we may have Σ i (β, α) = αµ i (β) 2, i =,..., n. We have the likelihood l(β, α) = 2 log Σ i 2 i= i= (Y i µ i ) 2 Σ i. a GEE is the method in which we have a single estimating equation, and a consistent estimator of α Jon Wakefield, Stat/Biostat 57 The score equations are given by l β = 2 and = i= l α i= µi β «T Σi + β Σ i = 2 = i= «T (Y i µ i ) Σ i + i= i= Σi α «T µi (Y i µ i ) + β Σ i 2 i= «T Σi ˆ(Yi µ i ) 2 Σ i β i= «T Σi + α Σ i 2 i= «T ˆ(Yi µ i ) 2 Σ i 2Σ 2 i «T Σi (Y i µ i ) 2 α Σ 2 i «T Σi (Y i µ i ) 2 β Σ 2 i This pair of quadratic estimating functions, are unbiased given correct specification of the first two moments so note that if the variance model is wrong, we are no longer guaranteed a consistent estimator of β. 2Σ 2 i If it s correct, however, there will be a gain in efficiency. (47) (46) 274

39 29 Jon Wakefield, Stat/Biostat 57 Let with, under the model, S i = (Y i µ i ) 2 E[S i ] = Σ i var(s i ) = E[S 2 i ] E[S i] 2 = 3Σ 2 i Σ2 i = 2Σ2 i Hence we can rewrite (46) and (47) l β l α = = i= i= D T i V i (Y i µ i ) + F i W i (S i Σ i ) i= where D i = µ i / β, E i = Σ i / β and F i = Σ i / α. E i W i (S i Σ i ) This can be compared with the usual estimating equation specification: l β = i= D T i V i (Y i µ i ) Jon Wakefield, Stat/Biostat 57 GEE2 Continued The general form of estimating equations, in the dependent data setting, is given by mx 4 D i 5 = 4 5 E i F i i= 2 5T 4 V i C i C T i W i Y i µ i S i Σ i where D i = µ i / β, E i = Σ i / β and F i = Σ i / α, and we have working variance-covariance structure V i = var(y i ) C i = cov(y i, S i ) W i = var(s i ) 276

40 29 Jon Wakefield, Stat/Biostat 57 When C i = we obtain G (β, α) = G 2 (β, α) = mx i= mx i= D T i V i (Y i µ i ) + F i W i (S i Σ i ) mx i= E i W i (S i Σ i ) which are the dependent data version of the normal score equations we obtained earlier. Prentice and Zhao show that these equations arise from a so-called quadratic exponential model: p(y i θ i, λ i ) = k i exp[y T i θ i + S T i λ i + c i (Y i )]. For example, c i = gives the multivariate normal. For consistency of b β we require models for both Y i and S i to be correct increased eficiency if models are correct. 277

Figure 36: Respiratory infection versus time for the first 49 children.

Figure 36: Respiratory infection versus time for the first 49 children. y BINARY DATA MODELS We devote an entire chapter to binary data since such data are challenging, both in terms of modeling the dependence, and parameter interpretation. We again consider mixed effects