Linear mixed models and penalized least squares

Size: px

Start display at page:

Download "Linear mixed models and penalized least squares"

Ernest Robertson
6 years ago
Views:

1 Linear mixed models and penalized least squares Douglas M. Bates Department of Statistics, University of Wisconsin Madison Saikat DebRoy Department of Biostatistics, Harvard School of Public Health Abstract Linear mixed-effects models are an important class of statistical models that are not only used directly in many fields of applications but also used as iterative steps in fitting other types of mixed-effects models, such as generalized linear mixed models. he parameters in these models are typically estimated by maximum likelihood (ML) or restricted maximum likelihood (REML). In general there is no closed form solution for these estimates and they must be determined by iterative algorithms such as EM iterations or general nonlinear optimization. Many of the intermediate calculations for such iterations have been expressed as generalized least squares problems. We show that an alternative representation as a penalized least squares problem has many advantageous computational properties including the ability to evaluate explicitly a profiled log-likelihood or log-restricted likelihood, the gradient and Hessian of this profiled objective, and an ECME update to refine this objective. Key words: REML, gradient, Hessian, EM algorithm, ECME algorithm, maximum likelihood, profile likelihood, multilevel models Introduction We will establish some results for the penalized least squares representation of a general form of a linear mixed-effects model, then show how these results specialize for particular models. he general form we consider is y = Xβ + Zb + ɛ ɛ N (0, σ 2 I), b N (0, σ 2 Ω ), ɛ b () where y is the n-dimensional response vector, X is an n p model matrix for the p-dimensional fixed-effects vector β, Z is the n q model matrix for Preprint submitted to Journal of Multivariate Analysis 25 September 2003

2 the q-dimensional random-effects vector b that has a Gaussian distribution with mean 0 and relative precision matrix Ω (i.e., Ω is the precision of b relative to the precision of ɛ), and ɛ is the random noise assumed to have a spherical Gaussian distribution. he symbol indicates independence of random variables. We assume that X has full column rank and that Ω is positive definite. (If X is rank deficient or if Ω is singular then the model can be transformed to an alternative model that fulfills the desired conditions.) A relative precision factor,, is any q q matrix that satisfies Ω =. One possible is the Cholesky factor of Ω but others can be used. Because Ω is positive definite, any will be non-singular. In general (and hence Ω) depend on a k-dimensional parameter vector θ. ypically Z, Ω, and are very large and sparse (mostly zeros) while k, the dimension of θ, is small. he likelihood for the parameters β, θ, and σ 2, given y, in model () is L ( β, θ, σ 2 y ) = ( Ω y Xβ Zb 2 + b ) Ωb exp db. (2) (2πσ 2 (n+q)/2 ) 2σ 2 he restricted (or residual) maximum likelihood (REML) estimates of θ and σ 2 optimize a related criterion that can be written L R (θ, σ 2 ) = L(β, θ, σ 2 ) dβ. (3) his integral is not the typical way to derive or to justify REML estimates but, as we shall see, it provides a convenient form in which to evaluate L R. In the next section we show that the integrals in (2) and (3) can be expressed succinctly using the solution to a penalized least squares problem. In particular, we derive a profiled log-likelihood and a profiled log-restricted likelihood that depend on θ only. Maximizing the profiled log-likelihood (or profiled log-restricted-likelihood) is generally a much smaller and more stable optimization problem than attempting to optimize the log-likelihood for the full parameter vector. In section 3 we derive new expressions for the gradient and the Hessian of these profiled log-likelihoods. Using these derivatives, the profiled log-likelihood can be rapidly optimized, say by Newton steps, once the neighbourhood of θ has been determined. Some heuristics provide reasonable starting estimates for θ but these may not be sufficiently close to θ to ensure stable Newton steps. he expectationmaximization (EM) algorithm is a robust algorithm that approaches the neighbourhood of θ quickly but tends to converge slowly once it is in the neighbourhood. In section 4 we show how the penalized least squares results provide the update for a related algorithm called expectation-conditional maximizationeither (ECME). his allows the starting estimates to be updated by a moderate number of ECME iterations before starting the Newton steps. 2

3 hese results are derived for the general model (). In sections 5 and 6 we consider several common special cases. Our implementation of some of these methods is discussed in section 7. 2 A Penalized Least-Squares Problem For a fixed value of θ we consider the penalized least squares problem defined by the augmented model matrix Φ(θ) and the augmented response vector ỹ; min b,β ỹ Φ(θ) b 2 β where Φ(θ) = Z X and ỹ = y. (4) (θ) 0 0 One way to solve problem (4) is to form Φ e = [Φ, ỹ] and let R e be the Cholesky decomposition of Φ e Φ e Z Z + Ω Z X Z y R ZZ R ZX r Zy Φ e Φ e = X Z X X X y = R e R e where R e = 0 R XX r Xy. y Z y X y y 0 0 r yy (5) he matrices R ZZ and R XX are upper triangular of dimension q q and p p respectively. he corresponding vectors, r Zy and r Xy, are of dimension q and p, and r yy is a scalar. he conditions that Ω be positive definite and X have full column rank ensure that Φ has full column rank, and hence that R ZZ and R XX are nonsingular. Representation (5) is a particular form of the mixed model equations described in Henderson (984) []. We write the blocks in the opposite order from which they are typically written because of computational advantages associated with this order. Using a = [ b, β, ] we can write the numerator of the exponent in the integral in (2) as y Xβ Zb 2 + b Ωb = a Φ e Φ e a = a R e R e a = R e a 2 = r Zy R ZX β R ZZ b 2 + r Xy R XX β 2 + r 2 yy (6) 3

4 and a simple change of variable allows us to evaluate (2πσ 2 ) q/2 exp ( rzy R ZX β R ZZ b 2 ) db = 2σ 2 abs R ZZ = Z Z + Ω (7) Combining (2), (6), and (7) and taking the logarithm produces the log-likelihood l(β, θ, σ 2 ) = log L(β, θ, σ 2 ), which, for convenience, we write in the form of a deviance Z Z + Ω 2l(β, θ, σ 2 ) = log +n log(2πσ 2 )+ r2 yy + r Xy R XX β 2 (8) Ω σ 2 leading to the following results for the maximum likelihood estimates (mle s): () he conditional mle of the fixed-effects, β(θ), satisfies R XX β(θ) = r Xy (9) (2) he conditional mle of the variance, σ 2 (θ), is r 2 yy/n. (3) he profiled log-likelihood, l(θ), a function of θ only, is given by 2 l(θ) = 2l ( β(θ), θ, σ 2 (θ) ) Z Z + Ω [ = log + n + log Ω ( )] 2πr 2 yy n (0) (4) he conditional expected value of b, which we write as b(β, θ), satisfies R ZZ b(y, β, θ) = r Zy R ZX β () ypically we evaluate b( β(θ), θ), which we write as b(θ). (5) he conditional distribution of b is b y, β, θ, σ 2 N ( b(β, θ), σ 2 ( Z Z + Ω ) ) (2) 2. REML results As in (7), we can use a simple change of variable to obtain (2πσ 2 ) p/2 exp ( rxy R XX β 2 ) dβ = 2σ 2 abs R XX (3) 4

5 providing the log-restricted-likelihood, l R (θ, σ 2 ) = log L R (θ, σ 2 ), as Z Z + Ω 2l R (θ, σ 2 R XX 2 ) = log + (n p) log(2πσ 2 ) + r2 yy Ω σ. (4) 2 Noting that Z Z + Ω R XX 2 = Φ Φ we have the following results: () he conditional REML estimate of the variance, σ 2 R(θ), is r 2 yy/(n p). (2) he profiled log-restricted-likelihood is given by Φ [ Φ 2 l R (θ) = log + (n p) + log Ω ( )] 2πr 2 yy. (5) n p (3) he conditional distribution of b is b y, θ, σ 2 N ( b(θ), σ 2 V b ) (6) where V b is the upper-left q q submatrix of ( Φ Φ ), ( ) V b = RZZ I + RZX RXX RXX RZX R ZZ (7) 2.2 A preliminary decomposition Because we wish to evaluate the log-likelihood or log-restricted-likelihood for many different values of θ, we form a preliminary decomposition of the crossproducts of the model matrices and the response, Z Z Z X Z y ZZ ZX t Zy X Z X X X y = where = 0 XX t Xy (8) y Z y X y y 0 0 t yy so that the evaluation of (5) is equivalent to forming the orthogonal-triangular decomposition 0 0 R ZZ R ZX r Zy ZZ ZX t Zy 0 R = Q XX r Xy. (9) 0 XX t Xy 0 0 r yy 0 0 t yy o ensure well-defined estimates for the parameters we should have n q + p which means that (8) produces a reduction, sometimes a substantial reduc- 5

6 tion, in the amount of data that must be stored and manipulated for each evaluation of the log-likelihood. Furthermore, the matrices Z Z and ZZ are sparse and we store and manipulate these matrices taking the sparsity into account (see 5 and 6 for specific examples). We will assume that Z X, ZX, X X, and XX are dense and are stored accordingly. Equation (9) shows that t 2 yy is a lower bound for r 2 yy. Similarly, the residual sum of squares from regressing y on X is an upper bound on r 2 yy. hus the terms in (0 and 5) involving r 2 yy are bounded. he ratio Z Z + Ω / Ω is bounded below by unity (and approaches this bound as Ω 0) so the profiled deviance (0) is bounded below by n [ + log ( 2πt 2 yy/n )] and the profiled restricted deviance (5) is bounded below by (n p) [ + log ( 2πt 2 yy/(n p) )]. As Ω approaches singularity (say the minimum eigenvalue of Ω approaches zero while the other eigenvalues are bounded above) Z Z + Ω / Ω. hus we know that the ML or REML estimates will not occur on the boundary of the set of positive definite Ω. It is possible that finite ML or REML estimates of Ω will not exist. he minimum profiled deviance may correspond to an infinite precision (unbounded Ω), which is to say that the ML or REML estimates of σ 2 Ω, the variancecovariance of b, are singular. 3 Derivatives of the profiled log-likelihood Expressions (0) and (5) provide extremely efficient ways to determine ML or REML estimates for a linear mixed-effects model because we can optimize these expressions as a function of θ only, instead of as a function of the complete parameter vector ( θ, β, σ 2). he reduction in the dimension of the parameter over which we are optimizing helps to improve the performance and reliability of numerical optimization routines. Another way in which we can improve performance and reliability in numerical optimization is by providing analytic derivatives of the objective, which we do using results from Golub and Pereyra [2]. As in [2] we will use the operator D to indicate the Frechet derivative of an array or a scalar. For example, D Φ(θ) is an array of dimension n (q +p) k, which we treat as k matrices of size n (q + p) when writing expressions involving matrices and arrays. When we need to indicate partial derivatives 6

7 with respect to particular parameters we will use the notation D i Φ(θ) for the n (q + p) matrix Φ/ θ i. Golub and Pereyra [2] provide derivatives of the projection orthogonal to the column space of Φ, which, because Φ Φ is nonsingular, we can write P = I Φ ( Φ Φ ) Φ ; the pseudo-inverse of Φ, which is Φ + = ( Φ Φ ) Φ ; and the residual sum of squares, ryy(θ) 2 = P ỹ 2. hese derivatives are D P = P D ΦΦ + ( P D ΦΦ +), (20) 2 r2 yy(θ) = ỹ P D ΦΦ + ỹ, (2) D Φ + = Φ + D ΦΦ + + Φ + ( Φ +) D Φ P. (22) (Equation (22) is derived from equation (4.2) in [2], which has another term. However, that term is identically zero when Φ has full column rank.) For A a square, nonsingular matrix we have (log A(θ) ) = tr [ D(A)A ] (23) D ( A ) = A D(A)A (24) where tr denotes the trace of a matrix. Notice that (24) is a special case of (22). In Appendix A we show that the gradient and Hessian terms for the penalized residual sum-of-squares, r 2 yy(θ), are r 2 yy(θ) = ˆb D Ωˆb = tr ( D Ωˆbˆb ) (25) D j D i r 2 yy(θ) = ˆb [D j (D i Ω) 2D j ΩV b D i Ω] ˆb (26) providing the gradients ( 2 l) = tr D Ω (Z Z + Ω) Ω + ˆb ˆb σ σ (27) ( 2 l R ) = tr D Ω V b Ω + ˆb ˆb σ R σ R (28) 7

8 and Hessian terms of the form D j D i ( 2 l) = tr D j (D i Ω) (Z Z + Ω) Ω + ˆb σ tr [ D j Ω(Z Z + Ω) D i Ω(Z Z + Ω) ] ˆb (29) σ + tr ( D j ΩΩ D i ΩΩ ) 2 ˆb D j ΩV b D i Ω ˆb σ σ ˆb D j Ω ˆb ˆb D i Ω ˆb n σ σ σ σ D j D i ( 2l R ) = tr D j (D i Ω) V b Ω + ˆb ˆb (30) σ R σ R tr [D j ΩV b D i ΩV b ] + tr ( D j ΩΩ D i ΩΩ ) 2 ˆb σ R ˆb D j Ω ˆb ˆb n p σ R σ R σ R D j ΩV b D i Ω ˆb σ R D i Ω ˆb σ R 4 An ECME algorithm he EM algorithm [3] is a general iterative algorithm for computing maximum likelihood estimates in the presence of missing data. For linear mixed-effects models we formulate an EM algorithm by considering the random effects b to be unobserved data. In the terminology of the EM algorithm, we call the observed data, y, the incomplete data, and y augmented by b, the complete data. he EM algorithm has two steps: in the E step we compute Q, the expected log-likelihood (or deviance) for the complete data, and, in the M step, we maximize the expected log-likelihood (or minimize the expected deviance) with respect to the parameters in the model. Liu and Rubin [4] derived the EM algorithm for linear mixed-effects models using b as the missing data. In same paper they introduced expectation conditional maximization either (ECME) algorithms, which are an extension of the EM algorithm. In ECME algorithms the M step is broken down into a number of conditional maximization steps and in each conditional maximization step either the original log-likelihood, l, or its conditional expectation, Q, is maximized. he maximization in each step is done by placing constraints on the parameters in such a way that the collection of all the maximization steps 8

9 is with respect to the full parameter space. In describing an EM algorithm we must distinguish between current values of parameters and updated values. We denote the current values of the parameters by β 0, σ 2 0 and θ 0. hese are either starting values or values obtained from the last E and M steps. he parameter estimates to be obtained after an E and an M step are β, σ 2 and θ. he log-likelihood for the complete data is 2l(β, σ 2, θ y, b) = (n + q) log(2πσ 2 ) log Ω + y Xβ Zb 2 + b Ωb σ 2 (3) Because we can easily calculate σ 2 (θ) and β(θ) we define an update step in an ECME algorithm to be: () Given θ 0, set β = β(θ 0 ) and σ 2 = σ ( 2 (θ 0 ). hen the conditional distribution of b is N b(θ), σ 2 Z Z + Ω ) ). ( (2) Choose θ to minimize the conditional expectation of 2l Q(θ y, σ, 2 β, θ 0 ) = E b θ0 [ 2l(β, σ, 2 θ y, b)] [ ] = E b θ0 c log Ω + b Ωb/σ 2 = c log Ω + ˆb Ωˆb + tr [Ω ( Z Z + Ω(θ σ 2 0 ) ) ] (32) hus θ satisfies [ (ˆb(θ0 ) ˆb(θ0 ) θ Q = tr D Ω σ σ + ( Z Z + Ω(θ 0 ) ) Ω(θ ) )] = 0 (33) A similar derivation shows that, for the REML criterion, θ satisfies [ (ˆb(θ0 ) ˆb(θ0 ) )] θ Q R = tr D Ω + V b (θ 0 ) Ω(θ ) = 0 (34) σ R σ R From the similarity of (33) to (27) (and of (34) to (28)) we can see that a stationary point of this ECME algorithm will be a critical value of the profiled log-likelihood. 5 Computational methods for a single grouping factor he results in the previous sections provide concise expressions for the profiled log-likelihood (0), the profiled log-restricted-likelihood (5); the ECME 9

10 increments (33 or 34); and the gradient (27 or 28) and Hessian (29 or 30) of the profiled objective functions. All these expressions depend on being able to evaluate the initial decomposition (8) and, for several different θ, the decomposition (9), which can be a formidable computational problem because the matrices Z and Ω can be very large. However, these matrices generally are sparse and, by exploiting the sparsity, we can provide computationally feasible methods for all these results. he sparsity in Z and Ω occurs when the random effects vector b is divided into small components associated with one or more factors that group the observations. In the simplest situation there is one grouping factor, or one set of experimental units, and the model can be written y i = X i β + Z i b i + ɛ i, b i N (0, σ 2 Ω ), ɛ i N (0, σ 2 I), i =,..., m, ɛ i ɛ j, b i b j, i j; ɛ i b j, all i, j (35) where y i is the vector of length n i of responses for unit i; X i is the n i p fixed-effects model matrix for unit i; and Z i is the n i q model matrix for unit i and the random effects b i corresponding to that unit. Because we only have one grouping factor in this model we say that we have one level of random effects. However, this terminology is not universal. In particular, this model is called the two-level model in the multilevel modeling literature (e.g. [5]) because it has two levels of random variation. o convert model (35) to the form () we would set and b y X Z b b = 2 y, y = 2 X, X = 2 0 Z, Z = , (36). b m y m X m Z m Ω Ω Ω =... 0 = I m Ω (37) Ω where I m is the m m identity matrix and is the Kronecker product. hat is, Ω is a block-diagonal matrix whose diagonal is m copies of Ω. Similarly, a relative precision factor is = I m where is any q q matrix satisfying = Ω. 0

11 he calculation of R XX, R ZZ () R R ZZ = ZZ (2) , R ZX = R ZZ (m ) R ZX () R ZX (2). R ZX (m ) (38) [ and r Zy = r Zy(), r Zy(2),..., r Zy(m )] can be performed in blocks, using a series of QR decompositions based on the corresponding blocks in = Q R ZZ (i) i followed by Q i ZZ (i) 0 and 0 0 ZX (i) t Zy(i) = R ZX (i) r Zy(i), R XX (i) r Xy(i) (39) R XX () r Xy().. R XX r Xy R XX (m ) r Xy(m ) = Q 0 0 r yy. (40) XX t Xy t yy Noting that Ω = Ω m = 2m and that Z Z + Ω = m i= R ZZ (i) 2, we can evaluate the profiled log-likelihood (0) or the profiled log-restrictedlikelihood (5). o evaluate the ECME increments we note that tr [ D i (Ω)Ω ] = m tr [ ] D i (Ω )Ω = m tr [ ] D i (Ω ) [ (ˆb(θ0 ) ˆb(θ0 ) )] tr D i (Ω) + (Z Z + Ω(θ 0 )) = tr [ ] A D i (Ω)A σ tr σ [ (ˆb(θ0 ) ˆb(θ0 ) )] D i (Ω) + V b = tr [ ] A RD i (Ω)A R σ R σ R (4)

12 where the matrices A and A R are obtained from the QR decompositions ˆb /σ R ZZ () U A =. ˆb m /σ R ZZ (m ) and U R A R = ˆb /σ R R ZZ () R XX R ZX () R ZZ (). ˆb m /σ R R ZZ (m ) R XX R ZX (m ) R ZZ (m ) (42) he matrix Ω must be positive definite and symmetric. If no further structure is imposed on it then k = q (q + )/2 and a suitable value of (θ ) can be calculated as = m A for ML estimation and R = m A R for REML. If further structure is imposed on Ω (θ), so that k < q (q +)/2, then the gradient equations, (33) or (34), must be solved for the ECME update. Evaluation of the gradients (27 or 28) of the profiled objective functions can be simplified in the same way as the ECME increment is, as can the first term in the Hessian (29 or 30). he other terms in the Hessian can be simplified in various ways. Because D i Ω = I m D i Ω and Ω = I m Ω tr ( D j ΩΩ D i ΩΩ ) = m tr ( ) D j Ω Ω D i Ω Ω = m tr ( D j Ω D i Ω From the vectors c li, d li, and g li, l =,..., m ; i =,..., k defined as ) (43) we can evaluate c li = D i Ω ˆbl d li = R ZZ (l) c li g li = R XX R ZX (l)d li (44) (ˆb D j Ωˆb ) (ˆb D i Ωˆb ) ( m ) ( m ) = b l c lj b l c li m l= ˆb D j ΩV b D i Ωˆb = d ljd li + l= l= ( m ) ( m g lj l= l= g li ) (45) 2

13 he remaining terms in the Hessian expressions are evaluated as or tr [ D j Ω(Z Z + Ω) D i Ω(Z Z + Ω) ] = l= m l= tr [D j ΩV b D i ΩV b ] = m tr [ R ZZ (l) D jω R ZZ (l) R ZZ (l) D ] iω R ZZ (l) m tr l= [( m l= ( m l= tr [ R ZZ (l) D jω R ZZ (l) R ZZ (l) D ] iω R ZZ (l) tr [ R ZZ (l) D jω R ZZ (l) R ZX (l)r XX R XX R ZX (l)r ZZ (l) D iω R ZZ (l) ] RXX RZX (l)r ZZ (l) D jω R ZZ (l) R ZX (l)rxx )] R XX R ZX (l)r ZZ (l) D iω R ZZ (l) R ZX (l)r XX ) (46) (47) By working with the components of Ω instead of the whole matrix we save a considerable amount of storage and computation because q is typically very small (values of one or two are common) while m can be very large. We have worked with cases where m is in the millions. In most cases p q but it is unusual for p to exceed, say, one hundred. After the pre-decomposition the storage required is approximately m q (q +p) locations and the number of floating point operations (FLOPs) per iteration for the function evaluation is on the order of m q 3 for the decompositions and m pq 2 for the multiplications in (39), followed by m q p 2 for (40). Evaluation of the determinants of R ZZ (i), i =,..., m and R XX is trivial because these matrices are triangular. he dominant term in the FLOP count is m q (q + p) 2. his is also the order of the computation for the ECME update, the gradient calculation and the Hessian calculation. We can simplify some of the expressions for the Hessian by noting that, for arbitrary q q matrices U and V, tr [D i Ω UD j Ω V ] = k k 2 k 3 k 4 = k k 2 k 2 k 3 k 3 k 4 k 4 d (i) k k 2 d (j) k 3 k 4 tr [ e k e k 2 Ue k3 e k 4 V ] d (i) k k 2 d (j) k 3 k 4 ( e k2 Ue k3 e k 4 V e k ) = d (i) k k 2 d (j) k 3 k 4 u k2 k 3 v k4 k k 3

14 where d (i) jk is the (j, k)th element of D iω, e k is the k-th column of the identity matrix of size q q, and u ij and v ij are the (i, j)th elements of U and V, respectively. For the one-level model, all the terms involved in the Hessian can be rewritten using this principle and so the full Hessian can be written as d (i) k k 2 d (j) k 3 k 4 H k k 2 k 3 k 4 k k 2 k 3 k k where H is an array of size q q q q. 6 Models with multiple grouping factors We will consider models with multiple grouping factors where the random effects associated with each grouping factor are independent between groups and are i.i.d within groups. hat is, the matrix Ω consists of s blocks of the form I mj Ω j, j =,..., s. We will give details for s = 2, in which case Ω = I m Ω 0 0 I m2 Ω 2 Extensions to more than two grouping factors follow naturally.. (48) he block-diagonal structure in (48) will also be present in the factor,, and in all the derivatives with respect to components of θ. hat is, consists of s blocks of the form I mj j, j =,..., s where j is of size q j q j. We assume that each component of θ determines only one block in Ω and we designate the block associated with component j as b(j) where b(j) s, j =,..., k. hen D j Ω is zero except for the b(j)th diagonal block which is of the form I mb(j) D j Ω b(j). Notice that the second derivative, D j D i Ω, will be zero if b(i) b(j). We divide the random effects vector b into s = 2 blocks and subdivide the jth block into m j components of length q j denoted by b ji, j =,..., s; i =,..., m j. If Z Z is split into blocks corresponding to the blocks in Ω then the diagonal blocks in Z Z are themselves block diagonal but, unlike the situation with a single grouping factor, Z Z and ZZ will have non-zero off-diagonal blocks. hese off-diagonal blocks can be sparse or dense according to whether the grouping factors are nested or crossed. We distinguish three cases: completely crossed, partially crossed, and strictly nested. 4

15 Completely crossed grouping factors usually occur in designed experiments. For example, biological assays are often conducted by measuring the optical density of liquid samples in wells arranged in a grid on a plate. A common arrangement is 96 wells in a grid of 8 rows by 2 columns. If we assigned one set of random effects to the rows and another set of random effects to the columns then each row would occur with each column, resulting in completely crossed random effects. An example of nested grouping factors would be a longitudinal study, say records of annual achievement test scores, of students in several schools. For the student factor to be strictly nested within the school factor we require that each student attend only one school during the period of the study. In most large studies this will not be the case. We expect some students will attend more than one school but we do not expect every student to attend every school. In such a case the grouping factors are neither strictly nested nor completely crossed. We describe this situation as partially crossed grouping factors. ] If we divide Z = [Z Z 2 according to the two levels of random effects and correspondingly divide ZZ = ZZ ZZ 2 and R ZZ = R ZZ R ZZ 2 (49) 0 ZZ 22 0 R ZZ 22 then the block diagonal structure of Z Z is also present in ZZ and we can store and manipulate it accordingly. Even though Z 2 Z 2 is block diagonal there can be non-zero off-diagonals induced in ZZ 22 from ZZ 2 when taking the Cholesky decomposition. his is called fill-in [6]. If we consider the matrix ZZ 2 to consist of m m 2 blocks of size q q 2 then the (i,j)th such block will necessarily be zero if the ith level of the first grouping factor does not occur in combination with the jth level of the second grouping factor. For strictly nested grouping factors there will be only one non-zero block in each of the m sets of q adjacent rows. Furthermore, this pattern guarantees that there will be no fill-in of ZZ 22 and that it can be stored as a blockdiagonal matrix consisting of m 2 blocks of size q 2 q 2. he matrix R ZZ has the same pattern of sparsity so ZZ and R ZZ only require storage of size (m q + m 2 q 2 )(q + q 2 + p) and computing R ZZ is of order (m q + m 2 q 2 )(q + q 2 + p) 2 FLOPs. he same sparsity pattern will be present in RZZ, which we can calculate explicitly, allowing for the evaluation of all the formulas for the ECME update, the gradient, and the Hessian. hese results generalize to an arbitrary number of strictly nested levels of grouping factors (and are 5

16 incorporated this way in our software). For fully crossed grouping factors ZZ 2 and ZZ 22 are both dense and would need to be stored and manipulated as such. Because we can condense the calculation for the first group of random effects only, we choose the order of the grouping factors so that m q m 2 q 2. Once the order of the grouping factors is established then the calculations involving ZZ 2 and ZZ 22 are essentially the same as those involving ZX and XX except that producing R ZZ 22 from ZZ 22 also involves 2. he amount of storage required for fully crossed grouping factors is on the order of (m q + m 2 q 2 )(q + m 2 q 2 + p) locations and producing R ZZ requires on the order of (m q + m 2 q 2 )(q + m 2 q 2 + p) 2 FLOPs, which, obviously, can be considerably more than (m q + m 2 q 2 )(q + q 2 + p) 2. However, fully crossed grouping factors result in n m m 2 and usually come from designed experiments so we do not expect m 2 q 2 to be extremely large. he most interesting case is partially crossed grouping factors where ZZ 2 is sparse but does not obey that pattern that there is only one non-zero q q 2 block in each of the m sets of q adjacent rows. Having more than one nonzero block in such a set of rows (say because a student attended more than one school during the course of the study) does not generate fill-in in ZZ 2 but does generate fill-in in ZZ 22 and in R ZZ 22. If the extent of the crossing is moderate (i.e. an individual student may attend more than one school during the study but no student attends a large proportion of all the schools in the study, so that most of the q q 2 blocks in any one of the m sets of q adjacent rows are zero), then it will be advantageous to use sparse matrix representations of ZZ 2 (and R ZZ 2 ) and to generate R ZZ 22 using methods for the Cholesky decomposition of sparse semi-definite matrices. his particular calculation has been extensively studied because it is important in the implementation of interior-point methods in mathematical programming (Wright (997, pp ) [6]). A critical part of algorithms for the Cholesky decomposition of sparse semidefinite matrices is choosing an ordering of the columns of Z 2 so as to minimize fill-in in R ZZ 22. his only needs to be done once and can be based on the pattern of crossing of the grouping factors. he amount of storage and computation required to generate R ZZ for partially crossed grouping factors will fall between that for strictly nested grouping factors and that for fully crossed grouping factors. It is unlikely that R ZZ can be inverted in place when the grouping factors are partially crossed. We expect that it will be feasible to evaluate the objective function, the ECME update, and the gradient but that it may not be feasible to evaluate the Hessian. 6

17 7 Implementation We have implemented the computational methods for evaluating the profiled objective function (either log-likelihood or log-restricted-likelihood), the ECME increment and the gradient in the lme4 package for R [7] ( Some comparisons of the speed and stability of this implementation versus our previous implementation in the nlme package for R are available in Debroy (2003) [8]. Future versions of this software will allow for crossed random effects and will provide the Hessian, at least for the completely crossed and strictly nested cases. 8 Acknowledgements his work was supported by U.S. Army Medical Research and Materiel Command under Contract No. DAMD7-02-C-09. he views, opinions and/or findings contained in this report are those of the authors and should not be construed as an official Department of the Army position, policy or decision unless so designated by other documentation. We thank Deepayan Sarkar for helpful discussions. A Differentiating the penalized RSS Using (2) and the relationships D Φ = 0 D I we obtain [ ] I 0, Φ + ỹ = ˆb, and P ỹ = y X β Zˆb β ˆb (A.) r 2 yy(θ) = 2ˆb D ˆb = ˆb D Ωˆb (A.2) 7

18 he (j, i)th element of the Hessian 2 r 2 yy(θ) = D j D i r 2 yy is D 2 jd i ryy 2 = D (ỹ j P D i ΦΦ + ỹ ) = ỹ ( D j P D i ΦΦ + + P D j D i ΦΦ + + P D i ΦD j Φ +) ỹ [ = ỹ P D j ΦΦ + + ( Φ +) ] Dj Φ P D i ΦΦ + ỹ + ˆb D j D i ˆb + ỹ P D i Φ [Φ + D j ΦΦ + ( Φ Φ ) ] Dj Φ P ỹ = ỹ ( Φ +) Dj Φ D i ΦΦ + ỹ + ˆb D j D i ˆb ỹ ( Φ +) Dj Φ Φ ( Φ Φ ) Φ D i ΦΦ + ỹ + ỹp D j Φ ( Φ Φ ) Φ D i ΦΦ + ỹ + ỹp D i Φ ( Φ Φ ) Φ D j ΦΦ + ỹ Writing and ỹp D i Φ ( Φ Φ ) Dj Φ P ỹ w i = D i ΦΦ + ỹ = 0, D i ˆb u i = ( R ) Φ D i ΦΦ + ỹ = ( R ) D i ˆb, 0 v i = ( R ) Di Φ P ỹ = ( R ) D i ˆb, 0 (A.3) (A.4) (A.5) (A.6) where R is the Cholesky decomposition of Φ Φ = R R, expression (A.3) becomes D 2 jd i ryy(θ) 2 = ˆb D j D i ˆb + wj w i (u j + v j ) (u j + v j ) (A.7) Noting that u i + v i = (R ) D iωˆb, we have 0 D j D i r 2 yy(θ) = ˆb [D j D i Ω D j ΩV b D i Ω D i ΩV b D j Ω] ˆb. (A.8) 8

19 References [] C. Henderson, Applications of Linear Models in Animal Breeding, University of Guelph, 984. [2] G. H. Golub, V. Pereyra, he differentiation of pseudo-inverses and nonlinear least squares problems whose variables separate, SIAM J. of Numerical Analysis 0 (2) (973) [3] A. P. Dempster, N. M. Laird, D. B. Rubin, Maximum likelihood from incomplete data via the EM algorithm, J. of the Royal Statistical Society, Series B 39 (977) 22. [4] C. Liu, D. B. Rubin, he ECME algorithm: A simple extension of EM and ECM with faster monotone convergence, Biometrika 8 (994) [5] S. W. Raudenbush, A. S. Bryk, Hierarchical Linear Models: Applications and Data Analysis Methods, 2nd Edition, Sage, [6] S. J. Wright, Primal-Dual Interior Point Methods, SIAM, 997. [7] R. Ihaka, R. Gentleman, R: A language for data analysis and graphics, Journal of Computational and Graphical Statistics 5 (3) (996) [8] S. DebRoy, Computational methods for mixed-effects models, Ph.D. thesis, U. of Wisconsin Madison (2003). 9

Sparse Matrix Representations of Linear Mixed Models

Sparse Matrix Representations of Linear Mixed Models Douglas Bates R Development Core Team Douglas.Bates@R-project.org June 18, 2004 Abstract We describe a representation of linear mixed-effects models