Penalized least squares versus generalized least squares representations of linear mixed models

Similar documents
Mixed models in R using the lme4 package Part 4: Theory of linear mixed models

Mixed models in R using the lme4 package Part 6: Theory of linear mixed models, evaluating precision of estimates

Computational methods for mixed models

Mixed models in R using the lme4 package Part 4: Theory of linear mixed models

Linear mixed models and penalized least squares

Mixed models in R using the lme4 package Part 7: Generalized linear mixed models

Outline. Mixed models in R using the lme4 package Part 5: Generalized linear mixed models. Parts of LMMs carried over to GLMMs

Mixed models in R using the lme4 package Part 5: Generalized linear mixed models

Sparse Matrix Methods and Mixed-effects Models

Mixed models in R using the lme4 package Part 5: Generalized linear mixed models

Sparse Matrix Representations of Linear Mixed Models

Overview. Multilevel Models in R: Present & Future. What are multilevel models? What are multilevel models? (cont d)

MA 575 Linear Models: Cedric E. Ginestet, Boston University Mixed Effects Estimation, Residuals Diagnostics Week 11, Lecture 1

Linear Mixed Models: Methodology and Algorithms

Inverse of a Square Matrix. For an N N square matrix A, the inverse of A, 1

. a m1 a mn. a 1 a 2 a = a n

Outline. Mixed models in R using the lme4 package Part 3: Longitudinal data. Sleep deprivation data. Simple longitudinal data

AMS-207: Bayesian Statistics

B553 Lecture 5: Matrix Algebra Review

Chapter 5 Matrix Approach to Simple Linear Regression

VAR Model. (k-variate) VAR(p) model (in the Reduced Form): Y t-2. Y t-1 = A + B 1. Y t + B 2. Y t-p. + ε t. + + B p. where:

Generalized Linear and Nonlinear Mixed-Effects Models

Fitting Mixed-Effects Models Using the lme4 Package in R

2. Matrix Algebra and Random Vectors

Fitting linear mixed-effects models using lme4

1 Mixed effect models and longitudinal data analysis

1 Data Arrays and Decompositions

Principal Components Theory Notes

Sparsity-Preserving Difference of Positive Semidefinite Matrix Representation of Indefinite Matrices

Phys 201. Matrices and Determinants

Ma 3/103: Lecture 24 Linear Regression I: Estimation

Fitting Linear Mixed-Effects Models Using the lme4 Package in R

CPSC 540: Machine Learning

Outline for today. Computation of the likelihood function for GLMMs. Likelihood for generalized linear mixed model

Homework 2 Foundations of Computational Math 2 Spring 2019

Computational Methods. Eigenvalues and Singular Values

12. Cholesky factorization

ECS130 Scientific Computing. Lecture 1: Introduction. Monday, January 7, 10:00 10:50 am

Topic 7 - Matrix Approach to Simple Linear Regression. Outline. Matrix. Matrix. Review of Matrices. Regression model in matrix form

Basic Concepts in Matrix Algebra

Restricted Maximum Likelihood in Linear Regression and Linear Mixed-Effects Model

Mixed models in R using the lme4 package Part 2: Longitudinal data, modeling interactions

Optimization Problems

STAT 135 Lab 13 (Review) Linear Regression, Multivariate Random Variables, Prediction, Logistic Regression and the δ-method.

Minimum Error Rate Classification

Linear Algebra Review

forms Christopher Engström November 14, 2014 MAA704: Matrix factorization and canonical forms Matrix properties Matrix factorization Canonical forms

Principal Component Analysis (PCA) Our starting point consists of T observations from N variables, which will be arranged in an T N matrix R,

Lecture 3: QR-Factorization

Flexible Spatio-temporal smoothing with array methods

Likelihood Ratio Tests. that Certain Variance Components Are Zero. Ciprian M. Crainiceanu. Department of Statistical Science

Chapter 3 Transformations

Alternative implementations of Monte Carlo EM algorithms for likelihood inferences

AMS526: Numerical Analysis I (Numerical Linear Algebra)

9. Numerical linear algebra background

Mixed models in R using the lme4 package Part 1: Linear mixed models with simple, scalar random effects

Multilevel Models in Matrix Form. Lecture 7 July 27, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2

Likelihood-Based Methods

Model comparison and selection

Nonstationary spatial process modeling Part II Paul D. Sampson --- Catherine Calder Univ of Washington --- Ohio State University

Outline. Mixed models in R using the lme4 package Part 2: Linear mixed models with simple, scalar random effects. R packages. Accessing documentation

Symmetric matrices and dot products

SVD, PCA & Preprocessing

LINEAR SYSTEMS (11) Intensive Computation

Parameter Estimation

Introduction to Matrix Algebra

Review of Classical Least Squares. James L. Powell Department of Economics University of California, Berkeley

Numerical Methods. Elena loli Piccolomini. Civil Engeneering. piccolom. Metodi Numerici M p. 1/??

Properties of Matrices and Operations on Matrices

Journal of Statistical Software

Review of Linear Algebra

8. Diagonalization.

Vectors To begin, let us describe an element of the state space as a point with numerical coordinates, that is x 1. x 2. x =

Matrix Basic Concepts

Mixed models in R using the lme4 package Part 1: Linear mixed models with simple, scalar random

Chapter 5. The multivariate normal distribution. Probability Theory. Linear transformations. The mean vector and the covariance matrix

Problem 1. CS205 Homework #2 Solutions. Solution

Lecture 11. Linear systems: Cholesky method. Eigensystems: Terminology. Jacobi transformations QR transformation

Open Problems in Mixed Models

STA 2201/442 Assignment 2

Mixed models in R using the lme4 package Part 3: Linear mixed models with simple, scalar random effects

Computation fundamentals of discrete GMRF representations of continuous domain spatial models

Linear Algebra Section 2.6 : LU Decomposition Section 2.7 : Permutations and transposes Wednesday, February 13th Math 301 Week #4

Introduction to Mathematical Programming

Numerical Optimization

Matrix decompositions

Chapter 17: Undirected Graphical Models

Numerical Linear Algebra Primer. Ryan Tibshirani Convex Optimization

Repeated Measures ANOVA Multivariate ANOVA and Their Relationship to Linear Mixed Models

Spatial Process Estimates as Smoothers: A Review

A Introduction to Matrix Algebra and the Multivariate Normal Distribution

Estimation theory. Parametric estimation. Properties of estimators. Minimum variance estimator. Cramer-Rao bound. Maximum likelihood estimators

Chapter 3 Best Linear Unbiased Estimation

Linear Models Review

Regression. Oscar García

Matrix decompositions

Introduction to LMER. Andrew Zieffler

Stat260: Bayesian Modeling and Inference Lecture Date: February 10th, Jeffreys priors. exp 1 ) p 2

Multivariate Statistics

Matrix Algebra, part 2

Transcription:

Penalized least squares versus generalized least squares representations of linear mixed models Douglas Bates Department of Statistics University of Wisconsin Madison April 6, 2017 Abstract The methods in the lme4 package for R for fitting linear mixed models are based on sparse matrix methods, especially the Cholesky decomposition of sparse positive-semidefinite matrices, in a penalized least squares representation of the conditional model for the response given the random effects. The representation is similar to that in Henderson s mixed-model equations. An alternative representation of the calculations is as a generalized least squares problem. We describe the two representations, show the equivalence of the two representations and explain why we feel that the penalized least squares approach is more versatile and more computationally efficient. 1 Definition of the model We consider linear mixed models in which the random effects are represented by a q-dimensional random vector, B, and the response is represented by an n-dimensional random vector, Y. We observe a value, y, of the response. The random effects are unobserved. For our purposes, we will assume a spherical multivariate normal conditional distribution of Y, given B. That is, we assume the variance-covariance matrix of Y B is simply σ 2 I n, where I n denotes the identity matrix of order n. (The term spherical refers to the fact that contours of the conditional density are concentric spheres.) 1

The conditional mean, E[Y B = b], is a linear function of b and the p-dimensional fixed-effects parameter, β, E[Y B = b] = Xβ +Zb, (1) wherex andz areknownmodelmatricesofsizesn pandn q,respectively. Thus Y B N ( Xβ +Zb,σ 2 I n ). (2) The marginal distribution of the random effects B N ( 0,σ 2 Σ(θ) ) (3) is also multivariate normal, with mean 0 and variance-covariance matrix σ 2 Σ(θ). The scalar, σ 2, in (3) is the same as the σ 2 in (2). As described in the next section, the relative variance-covariance matrix, Σ(θ), is a q q positive semidefinite matrix depending on a parameter vector, θ. Typically the dimension of θ is much, much smaller than q. 1.1 Variance-covariance of the random effects The relative variance-covariance matrix, Σ(θ), must be symmetric and positive semidefinite (i.e. x Σx 0, x R q ). Because the estimate of a variance component can be zero, it is important to allow for a semidefinite Σ. We do not assume that Σ is positive definite (i.e. x Σx > 0, x R q,x 0) and, hence, we cannot assume that Σ 1 exists. A positive semidefinite matrix such as Σ has a Cholesky decomposition of the so-called LDL form. We use a slight modification of this form, Σ(θ) = T(θ)S(θ)S(θ)T(θ), (4) where T(θ) is a unit lower-triangular q q matrix and S(θ) is a diagonal q q matrix with nonnegative diagonal elements that act as scale factors. (They are the relative standard deviations of certain linear combinations of the random effects.) Thus, T is a triangular matrix and S is a scale matrix. Both T and S are highly patterned. 2

1.2 Orthogonal random effects Let us define a q-dimensional random vector, U, of orthogonal random effects with marginal distribution U N ( 0,σ 2 I q ) (5) and, for a given value of θ, express B as a linear transformation of U, B = T(θ)S(θ)U. (6) Note that the transformation (6) gives the desired distribution of B in that E[B] = TSE[U] = 0 and Var(B) = E[BB ] = TSE[UU ]ST = σ 2 TSST = Σ. The conditional distribution, Y U, can be derived from Y B as Y U N ( Xβ +ZTSu,σ 2 I ) (7) We will write the transpose of ZTS as A. Because the matrices T and S depend on the parameter θ, A is also a function of θ, A (θ) = ZT(θ)S(θ). (8) In applications, the matrix Z is derived from indicator columns of the levels of one or more factors in the data and is a sparse matrix, in the sense that most of its elements are zero. The matrix A is also sparse. In fact, the structure of T and S are such that pattern of nonzeros in A is that same as that in Z. 1.3 Sparse matrix methods The reason for defining A as the transpose of a model matrix is because A is stored and manipulated as a sparse matrix. In the compressed columnoriented storage form that we use for sparse matrices, there are advantages to storingaasamatrixofncolumnsandq rows. Inparticular, thecholmod sparse matrix library allows us to evaluate the sparse Cholesky factor, L(θ), a sparse lower triangular matrix that satisfies L(θ)L(θ) = P (A(θ)A(θ) +I q )P, (9) 3

directly from A(θ). In (9) the q q matrix P is a fill-reducing permutation matrix determined from the pattern of nonzeros in Z. P does not affect the statistical theory (if U N(0,σ 2 I) then P U also has a N(0,σ 2 I) distribution because PP = P P = I) but, because it affects the number of nonzeros in L, it can have a tremendous impact on the amount storage required for L and the time required to evaluate L from A. Indeed, it is precisely because L(θ) can be evaluated quickly, even for complex models applied the large data sets, that the lmer function is effective in fitting such models. 2 The penalized least squares approach to linear mixed models Given a value of θ we form A(θ) from which we evaluate L(θ). We can then solve for the q p matrix, R ZX, in the system of equations L(θ)R ZX = PA(θ)X (10) and for the p p upper triangular matrix, R X, satisfying R XR X = X X R ZXR ZX (11) The conditional mode, ũ(θ), of the orthogonal random effects and the conditional mle, β(θ), of the fixed-effects parameters can be determined simultaneously as the solutions to a penalized least squares problem, [ũ(θ) ] β(θ) = argmin u,β [ ] y 0 [ A P X I q 0 for which the solution satisfies [ ][ũ(θ) ] P (AA +I)P PAX X A P X = X β(θ) ][ ] u β, 2 (12) [ ] PAy X. (13) y The Cholesky factor of the system matrix for the PLS problem can be expressed using L, R ZX and R X, because [ ] [ ][ ] P (AA +I)P PAX L 0 L R X A P X = ZX. (14) X 0 R X 4 R ZX R X

In the lme4 package the "mer" class is the representation of a mixed-effects model. Several slots in this class are matrices corresponding directly to the matrices in the preceding equations. The A slot contains the sparse matrix A(θ) and the L slot contains the sparse Cholesky factor, L(θ). The RZX and RX slots contain R ZX (θ) and R X (θ), respectively, stored as dense matrices. It is not necessary to solve for ũ(θ) and β(θ) to evaluate the profiled log-likelihood, which is the log-likelihood evaluated θ and the conditional estimates of the other parameters, β(θ) and σ2 (θ). All that is needed for evaluation of the profiled log-likelihood is the (penalized) residual sum of squares, r 2, from the penalized least squares problem (12) and the determinant AA + I = L 2. Because L is triangular, its determinant is easily evaluated as the product of its diagonal elements. Furthermore, L 2 > 0 because it is equal to AA +I, which is the determinant of a positive definite matrix. Thus log( L 2 ) is both well-defined and easily calculated from L. The profiled deviance (negative twice the profiled log-likelihood), as a function of θ only (β and σ 2 at their conditional estimates), is ( d(θ y) = log( L 2 )+n 1+log(r 2 )+ 2π ) (15) n The maximum likelihood estimates, θ, satisfy θ = argmin θ d(θ y) (16) Once the value of θ has been determined, the mle of β is evaluated from (13) and the mle of σ 2 as σ 2 (θ) = r 2 /n. Note that nothing has been said about the form of the sparse model matrix, Z, other than the fact that it is sparse. In contrast to other methods for linear mixed models, these results apply to models where Z is derived from crossed or partially crossed grouping factors, in addition to models with multiple, nested grouping factors. The system(13) is similar to Henderson s mixed-model equations (reference?). One important difference between (13) and Henderson s formulation is that Henderson represented his system of equations in terms of Σ 1 and, in important practical examples, Σ 1 does not exist at the parameter estimates. Also, Henderson assumed that equations like (13) would need to be solved explicitly and, as we have seen, only the decomposition of the system matrix is needed for evaluation of the profiled log-likelihood. The same is 5

true of the profiled the logarithm of the REML criterion, which we define later. 3 The generalized least squares approach to linear mixed models Another common approach to linear mixed models is to derive the marginal variance-covariance matrix of Y as a function of θ and use that to determine the conditional estimates, β(θ), as the solution of a generalized least squares (GLS) problem. In the notation of 1 the marginal mean of Y is E[Y] = Xβ and the marginal variance-covariance matrix is Var(Y) = σ 2 (I n +ZTSST Z ) = σ 2 (I n +A A) = σ 2 V (θ), (17) where V (θ) = I n +A A. The conditional estimates of β are often written as β(θ) = ( X V 1 X ) 1 X V 1 y (18) but, ofcourse, thisformulaisnotsuitableforcomputation. ThematrixV(θ) is a symmetric n n positive definite matrix and hence has a Cholesky factor. However, this factor is n n, not q q, and n is always larger than q sometimes orders of magnitude larger. Blithely writing a formula in terms of V 1 when V is n n, and n can be in the millions does not a computational formula make. 3.1 Relating the GLS approach to the Cholesky factor We can use the fact that V 1 (θ) = (I n +A A) 1 = I n A (I q +AA ) 1 A (19) to relate the GLS problem to the PLS problem. One way to establish (19) is simply to show that the product ( ) (I +A A) I A (I +AA ) 1 A =I +A A A (I +AA )(I +AA ) 1 A =I +A A A A =I. 6

Incorporating the permutation matrix P we have V 1 (θ) =I n A P P (I q +AA ) 1 P PA =I n A P (LL ) 1 PA =I n ( L 1 PA ) L 1 PA. (20) Even in this form we would not want to routinely evaluate V 1. However, (20) does allow us to simplify many common expressions. For example, the variance-covariance of the estimator β, conditional on θ and σ, can be expressed as σ 2( X V 1 (θ)x ) 1 =σ 2 (X X ( L 1 PAX ) ( L 1 PAX )) 1 =σ 2 (X X R ZXR ZX ) 1 =σ 2 (R XR X ) 1. (21) 4 Trace of the hat matrix Another calculation that is of interest to some is the the trace of the hat matrix, which can be written as ( [A tr X ]([ ] A [ ]) X A 1 [ ] ) X A I 0 I 0 X ( [A = tr X ]([ ][ ]) L 0 L 1 [ ] ) R ZX A 0 R X X (22) R ZX R X 7