November 2002 STA Random Effects Selection in Linear Mixed Models

November 2002 STA216 1 Random Effects Selection in Linear Mixed Models

November 2002 STA216 2 Introduction It is common practice in many applications to collect multiple measurements on a subject. Linear mixed models (Laird and Ware, 1982; Longford, 1993) attempt to account for within-subject dependency in the multiple measurements by including one or more subject-specific latent variables (i.e., random effects) in the regression model. An important practical problem in applying linear mixed models is how to choose the random effects component. Use AIC or BIC? Likelihood ratio test? Score test?

November 2002 STA216 3 Bayesian Hierarchical Approach We propose an approach for selecting random effects using a hierarchical Bayesian model. A key step: D = ΛΓΓ T Λ, (1) We allow elements of Λ to have positive probability of being zero so that random effects can have zero variances, effectively dropping out of the model. Conditionally, the parameters in either Λ or Γ can be regarded as regression coefficients in a normal linear model.

November 2002 STA216 4 Linear Mixed Models n subjects, with subject i contributing n i observations For subject i at observation j, let y ij denote a response variable, let x ij denote a p 1 vector of predictors, and let z ij denote a q 1 vector of predictors. In general, the linear mixed effects model is written as y i = X i α + Z i β i + ε i, (2) where y i = (y i1,..., y ini ) T, X i = (x T i1,..., xt in i ) T, Z i = (z T i1,..., zt in i ) T, α is a p 1 vector of unknown population parameters, β i is a q 1 vector of unknown subject-specific random effects with β i N(0, D), and the elements of the residual vector, ε i, are N(0, σ 2 I). Integrating out the random effects β i, the marginal distribution of y i is

November 2002 STA216 5 N(X i α, Z i DZ T i ). Heterogeneity among subjects is accommodated by allowing the linear predictor conditional on the covariates to vary. When z ij is a subvector of x ij, the model allows the regression coefficients for the covariates included in z ij to vary among subjects, while assuming that the remaining coefficients are fixed for all subjects. In Bayesian estimation of mixed models: inverse-wishart prior for D. The inverse-wishart density tends to be restrictive, however, since it prescribes a common degrees of freedom for all the diagonal entries of D. In addition, it is only useful if the random effects component is known, since it restricts all random effect variances to be positive.

November 2002 STA216 6 Reparameterization Starting with the model that has a random coefficient for each of the elements of z ij, we adaptively select models having some random effects excluded. From model (2), it is clear that selecting a subset of random effects is equivalent to setting to 0 the variances of the nonselected random effects. Let d lm denote the (l, m)th entry of D, for l, m = 1,..., q. The lth random effect β il is excluded if d ll = 0 and is included if d ll > 0. Let L be the lower triangular Cholesky decomposition of D. We assume that L has nonnegative diagonal elements so that it is unique (Seber, 1977, p388). Given L, the linear mixed model (2) can be reexpressed as y i = X i α + Z i Lb i + ε i,

November 2002 STA216 7 where b i = (b i1,..., b iq ) T is a vector of independent standard normal latent variables. We further let L = ΛΓ, where Λ = diag(λ 1,..., λ q ) and Γ is a q q matrix with the (l, m)th element denoted by γ lm. As minimal conditions on Λ and Γ so that they are uniquely defined, we assume that λ l 0, γ ll = 1, and γ lm = 0 for l = 1,..., q, m = l + 1,..., q. (3) Specifically, we choose Λ to be a nonnegative q q diagonal matrix, and Γ to be a lower triangular matrix with 1 s in the diagonal entries. This leads to the decomposition of D in (1), and to the reparameterized linear mixed model, y i = X i α + Z i ΛΓb i + ε i. (4)

November 2002 STA216 8 Implications of the Reparameterization Following straightforward matrix algebra, the diagonal elements of D are d ll = λ 2 l ( 1 + l 1 r=1 γ 2 lr ) The off-diagonal elements are d lm = d ml = λ l λ m (γ ml + for l = 1,..., q, (5) l 1 r=1 γ lr γ mr ) for l = 1,..., q; m = l + 1,..., q. In the case where λ l = 0, var(β il ) = 0 and the lth random effect, β il, is effectively dropped. The parameters γ R q(q 1)/2 measure the degree of within-subject dependency in the random-effects, β i, as is clear from the expression for the correlation coefficient

November 2002 STA216 9 between β il and β im, for l m, ρ(β im, β il ) = γ ml + l 1 r=1 γ lrγ mr ( 1 + )( l 1 r=1 γ2 lr 1 + ), m 1 r=1 γ2 mr which does not depend on λ. As functions of elements of the covariance matrix D, λ and γ are not independent. In particular, if λ l = 0, γ ml = γ lm = 0 for all m {l + 1,..., q} and m {1,..., l 1}. For later use, we define { R λ = γ : γ ml = γ lm = 0 if λ l = 0, } l = 1,..., q, m = l + 1,..., q, m = 1,..., l 1. (6)

November 2002 STA216 10 Prior Specification Our model is completed with a prior density for θ = (α, λ, γ, σ 2 ) T. First, we assume p(θ) = p(λ, γ)p(α)p(σ 2 ), Following standard convention, we choose conjugate priors, with N(α 0, A 0 ) for α and G(c 0, d 0 ) (σ 2 ) c 0 1 exp{ d 0 σ 2 } for σ 2. In choosing priors for Λ and Γ, and hence for D, we wish to allocate positive probability to zero values for the random effects variances. In addition, motivated by practical considerations, we want to choose priors that facilitate posterior computation. For this reason, prior distributions that are conditionally conjugate are desirable. We assume that p(λ, γ) = p(γ λ)p(λ) N(γ; γ 0, R 0 )1(γ R λ )p(λ),

November 2002 STA216 11 We further assume that the λ s are independent so that p(λ) = q l=1 p(λ l). Let ZI-N + (π, µ, σ 2 ) denote the density of a zero inflated half normal distribution consisting of a point mass at zero (with probability π) and a N(µ, σ 2 ) density truncated below by zero. To specify a model selection prior, we choose p(λ l ) = d ZI-N + (p l0, m l0, s 2 l0 ) for each l, where p l0, m l0, and s 2 l0 are hyperparameters to be specified by the investigators. The prior probability that the lth random effect is excluded (i.e., its variance is zero) is p l0, and the overall prior probability of excluding all the random effects is q l=1 p l0.

November 2002 STA216 12 Posterior Computation Letting b = (b 1,..., b n ) T and y = (y 1,..., y n ) T, the likelihood is given by ( n exp σ 2 i=1 n i j=1 l(θ, b; y) = (2πσ 2 ) n i=1 n i/2 ) (y ij x T ijα z T ijλγb i ) 2 /2. The posterior distribution is obtained by combining priors and the likelihood in the usual way. However, directly evaluation of the posterior distribution seems to be difficult. Instead we employ a Gibbs sampler (Gelfand and Smith, 1990) which works by alternately sampling from the full conditional distributions of the parameters (α, σ 2, λ, γ) and latent variables b. Bayesian linear model theory (Lindley and Smith, 1972) applies when deriving the full

November 2002 STA216 13 conditional distributions of α, σ 2, and b p(α λ, γ, σ 2, b, y) = d N( α, Â), with ( Â = σ 2 n ) 1 ni i=1 j=1 x ijx T ij + A 1 0 and { α = Â σ 2 n ni i=1 j=1 x ij(y ij z T ij ΛΓb i) + } A 1 0 α 0. For σ 2, the full conditional distribution is given by p(σ 2 α, λ, γ, b, y) = d G(ĉ, d) where ĉ = c 0 + n i=1 n i/2 and d = d 0 + n ni i=1 j=1 (y ij x T ij α zt ij ΛΓb i) 2 /2. Similar to α, the full conditional distribution of the latent normal variable b is n p(b λ, γ, σ 2, α, y) = p(b i λ, γ, σ 2, α, y i ), i=1 with p(b i λ, γ, σ 2, α, y i ) = d N(ĥi, ( Ĥi), where Ĥi = σ 2 ) n 1, i j=1 v ijvij T + I ĥ i = σ 2 Ĥ i ni j=1 v ij(y ij x T ij α), and v ij = z T ij ΛΓb i.

November 2002 STA216 14 FCDs of λ and γ The full conditional distributions of λ and γ seem to be complex, given the likelihood form in (7). However, upon rewriting expression (4) with constraint (3) as q q ) y ij = x T ijα + b il (λ l z ijl + λ m z ijm γ ml + ε ij, l=1 m=l+1 we obtain two equations that characterize λ and γ as regression coefficients in a normal linear model. First define the q(q 1)/2 1 vector ( T u ij = b il λ m z ijm : l = 1,..., q, m = l + 1,..., q). Then expression (7) implies y ij x T ijα = u T ijγ + ε ij. Since the error term is normally distributed and γ has a multivariate normal prior

November 2002 STA216 15 distribution after setting elements equal to zero to ensure that γ R λ, the full conditional distribution for γ is easy to derive. The full conditional distribution of γ is given by p(γ α, λ, b, σ 2, y) N( γ, R)1(γ R λ ), where R ( = σ 2 n ) 1 ni i=1 j=1 u iju T ij + R 1 0 and ( γ = R σ 2 n i=1 ) ni j=1 u ij(y ij x T ijα) + R 1 0 γ 0. Similarly, on defining the q 1 vector t ij = ( z ijl (b il + l 1 m=1 b im γ ml ) : l = 1,..., q) T, it is easy to verify that (7) implies y ij x T ijα = t T ijλ + ε ij. Letting η ijl = y ij x T ij α m l t ijmλ m for each λ l, we have η ijl = t ijl λ l + ε ij. It follows from straightforward (but lengthy) algebra

November 2002 STA216 16 that the full conditional distribution of λ l is p(λ l λ (l), α, β, γ, σ 2, y) d = ZI-N + ( p l, λ l, σ 2 l ), (8) where p l = P (λ l = 0 λ (l), α, β, γ, σ 2, y) is the conditional posterior probability that λ l = 0, and λ l and σ l 2 are the updated mean and variance in the normal component of the ZI-N + density. To derive the expressions for p l, λ l and σ l 2, first let ω 2 l = n ni i=1 j=1 t2 ijl /σ2, and let λ l be the maximum likelihood estimate of λ l so that λ l = n i=1 Then, λ l = σ 2 (ω 2 l σ 2 l = (ω 2 l ni j=1 t ijlη ijl / n i=1 λ l + s 2 l0 m l0) and + s 2 l0 ) 1. Define ni j=1 t2 ijl.

November 2002 STA216 17 a = exp{ n i=1 ni j=1 η2 ijl /2σ2 } and b = σ l 1 Φ( m l0 /s l0 ) s l0 1 Φ( λ l / σ l ) { } n n i exp (η ijl λ l t ijl ) 2 /2σ 2 exp { i=1 j=1 ( λ 2 l /2ω 2 l + m 2 l0/2s 2 l0 λ 2 /2 σ 2 l ) }. Then, p l = p l0 a p l0 a + (1 p l0 )b. Distribution (8) is conditionally conjugate, following the same form as the prior for λ l. Sampling from expression (8) can be implemented by (i) sampling δ l from Bernoulli( p l ); and (ii) setting λ l = 0 if δ l = 1 and otherwise sampling λ l from N( λ l, σ 2 l ) truncated below by zero. Given repeated samples from the posterior distribution, inference about the model

November 2002 STA216 18 parameters α, γ, λ, and σ 2 proceeds as usual. In particular, one can report posterior means, posterior standard deviations, and highest posterior density (HPD) intervals. To compute the posterior probabilities of each of the 2 q models, we simply add up the number of occurrences of each model and divide by the number of iterations. The prior and posterior probabilities can then be used to calculate Bayes factors for comparing individual models. Refer to Kass and Raftery (1995) for a review of the Bayes factor.