Mixed models in R using the lme4 package Part 4: Theory of linear mixed models

Similar documents
Mixed models in R using the lme4 package Part 4: Theory of linear mixed models

Mixed models in R using the lme4 package Part 6: Theory of linear mixed models, evaluating precision of estimates

Computational methods for mixed models

Penalized least squares versus generalized least squares representations of linear mixed models

Sparse Matrix Methods and Mixed-effects Models

Mixed models in R using the lme4 package Part 7: Generalized linear mixed models

Outline. Mixed models in R using the lme4 package Part 5: Generalized linear mixed models. Parts of LMMs carried over to GLMMs

Mixed models in R using the lme4 package Part 5: Generalized linear mixed models

Mixed models in R using the lme4 package Part 5: Generalized linear mixed models

Linear mixed models and penalized least squares

Sparse Matrix Representations of Linear Mixed Models

Multilevel Models in Matrix Form. Lecture 7 July 27, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2

Overview. Multilevel Models in R: Present & Future. What are multilevel models? What are multilevel models? (cont d)

Outline. Mixed models in R using the lme4 package Part 3: Longitudinal data. Sleep deprivation data. Simple longitudinal data

November 2002 STA Random Effects Selection in Linear Mixed Models

Generalized Linear and Nonlinear Mixed-Effects Models

Fitting linear mixed-effects models using lme4

CSC321 Lecture 18: Learning Probabilistic Models

SCMA292 Mathematical Modeling : Machine Learning. Krikamol Muandet. Department of Mathematics Faculty of Science, Mahidol University.

LECTURE NOTE #NEW 6 PROF. ALAN YUILLE

Advanced Quantitative Methods: maximum likelihood

Introduction to Bayesian Statistics and Markov Chain Monte Carlo Estimation. EPSY 905: Multivariate Analysis Spring 2016 Lecture #10: April 6, 2016

Outline of GLMs. Definitions

Mixed models in R using the lme4 package Part 1: Linear mixed models with simple, scalar random

The lasso. Patrick Breheny. February 15. The lasso Convex optimization Soft thresholding

Statistics and econometrics

Linear Methods for Prediction

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

1 GSW Sets of Systems

Nonparameteric Regression:

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

MA 575 Linear Models: Cedric E. Ginestet, Boston University Mixed Effects Estimation, Residuals Diagnostics Week 11, Lecture 1

Graphical Model Selection

Density Estimation. Seungjin Choi

Linear Model Selection and Regularization

1 Bayesian Linear Regression (BLR)

Fundamentals of Unconstrained Optimization

Expectation Maximization

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

A brief introduction to mixed models

Numerical Linear Algebra Primer. Ryan Tibshirani Convex Optimization /36-725

Multivariate Distributions

1 EM algorithm: updating the mixing proportions {π k } ik are the posterior probabilities at the qth iteration of EM.

Expectation Maximization

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Likelihood-Based Methods

Maximum Likelihood Estimation; Robust Maximum Likelihood; Missing Data with Maximum Likelihood

COS513 LECTURE 8 STATISTICAL CONCEPTS

Linear Models for Regression CS534

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Cox regression: Estimation

Linear Models for Regression. Sargur Srihari

Linear Methods for Prediction

Mixed models in R using the lme4 package Part 2: Longitudinal data, modeling interactions

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

CPSC 540: Machine Learning

Logistic Regression. Will Monroe CS 109. Lecture Notes #22 August 14, 2017

Generalized Linear Models

DS-GA 1002 Lecture notes 10 November 23, Linear models

Sparse Linear Models (10/7/13)

Frequentist-Bayesian Model Comparisons: A Simple Example

Introduction to Gaussian Process

Linear Mixed Models. One-way layout REML. Likelihood. Another perspective. Relationship to classical ideas. Drawbacks.

Lecture 6. Regression

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Generalized Linear Models. Kurt Hornik

Mixed models in R using the lme4 package Part 3: Linear mixed models with simple, scalar random effects

Restricted Maximum Likelihood in Linear Regression and Linear Mixed-Effects Model

Outline for today. Computation of the likelihood function for GLMMs. Likelihood for generalized linear mixed model

Generalized linear models

LU Factorization. LU Decomposition. LU Decomposition. LU Decomposition: Motivation A = LU

EM Algorithm. Expectation-maximization (EM) algorithm.

1.6: 16, 20, 24, 27, 28

K-Means and Gaussian Mixture Models

Linear Models for Regression CS534

Numerical Linear Algebra

MS&E 226: Small Data. Lecture 11: Maximum likelihood (v2) Ramesh Johari

The Expectation-Maximization Algorithm

Derivation of the Kalman Filter

Fitting Mixed-Effects Models Using the lme4 Package in R

Lecture : Probabilistic Machine Learning

Multivariate Statistical Analysis

Linear regression. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

LINEAR SYSTEMS (11) Intensive Computation

4 Gaussian Mixture Models

Lecture 2 Machine Learning Review

Additional Notes: Investigating a Random Slope. When we have fixed level-1 predictors at level 2 we show them like this:

Gaussian Processes 1. Schedule

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017

Parameter Estimation

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

ECEN 615 Methods of Electric Power Systems Analysis Lecture 18: Least Squares, State Estimation

Mixtures of Gaussians. Sargur Srihari

Regression Estimation - Least Squares and Maximum Likelihood. Dr. Frank Wood

Outline for today. Computation of the likelihood function for GLMMs. Likelihood for generalized linear mixed model

Generalized Autoregressive Score Models

Learning Gaussian Graphical Models with Unknown Group Sparsity

Clustering and Gaussian Mixture Models

Part 6: Multivariate Normal and Linear Models

Transcription:

Mixed models in R using the lme4 package Part 4: Theory of linear mixed models Douglas Bates 8 th International Amsterdam Conference on Multilevel Analysis <Bates@R-project.org> 2011-03-16 Douglas Bates (Multilevel Conf.) Theory of LMMs 2011-03-16 1 / 16

Outline 1 Definition of linear mixed models Douglas Bates (Multilevel Conf.) Theory of LMMs 2011-03-16 2 / 16

Outline 1 Definition of linear mixed models 2 The penalized least squares problem Douglas Bates (Multilevel Conf.) Theory of LMMs 2011-03-16 2 / 16

Outline 1 Definition of linear mixed models 2 The penalized least squares problem 3 The sparse Cholesky factor Douglas Bates (Multilevel Conf.) Theory of LMMs 2011-03-16 2 / 16

Outline 1 Definition of linear mixed models 2 The penalized least squares problem 3 The sparse Cholesky factor 4 Evaluating the likelihood Douglas Bates (Multilevel Conf.) Theory of LMMs 2011-03-16 2 / 16

Definition of linear mixed models As previously stated, we define a linear mixed model in terms of two random variables: the n-dimensional Y and the q-dimensional B The probability model specifies the conditional distribution (Y B = b) N ( Z b + X β, σ 2 I n ) and the unconditional distribution B N (0, Σ θ ). These distributions depend on the parameters β, θ and σ. The probability model defines the likelihood of the parameters, given the observed data, y. In theory all we need to know is how to define the likelihood from the data so that we can maximize the likelihood with respect to the parameters. In practice we want to be able to evaluate it quickly and accurately. Douglas Bates (Multilevel Conf.) Theory of LMMs 2011-03-16 3 / 16

Properties of Σ θ ; generating it Because it is a variance-covariance matrix, the q q Σ θ must be symmetric and positive semi-definite, which means, in effect, that it has a square root there must be another matrix that, when multiplied by its transpose, gives Σ θ. We never really form Σ; we always work with the relative covariance factor, Λ θ, defined so that Σ θ = σ 2 Λ θ Λ T θ where σ 2 is the same variance parameter as in (Y B = b). We also work with a q-dimensional spherical or unit random-effects vector, U, such that U N ( 0, σ 2 I q ), B = Λθ U Var(B) = σ 2 Λ θ Λ T θ = Σ. The linear predictor expression becomes Z b + X β = Z Λ θ u + X β Douglas Bates (Multilevel Conf.) Theory of LMMs 2011-03-16 4 / 16

The conditional mean µ U Y Although the probability model is defined from (Y U = u), we observe y, not u (or b) so we want to work with the other conditional distribution, (U Y = y). The joint distribution of Y and U is Gaussian with density f Y,U (y, u) = f Y U (y u) f U (u) = exp( 1 y X β Z Λ 2σ 2 θ u 2 ) exp( 1 u 2 ) 2σ 2 (2πσ 2 ) n/2 (2πσ 2 ) q/2 = exp( [ y X β Z Λ θ u 2 + u 2] /(2σ 2 )) (2πσ 2 ) (n+q)/2 (U Y = y) is also Gaussian so its mean is its mode. I.e. [ µ U Y = arg min y X β Z Λ θ u 2 + u 2] u Douglas Bates (Multilevel Conf.) Theory of LMMs 2011-03-16 5 / 16

Minimizing a penalized sum of squared residuals An expression like y X β Z Λ θ u 2 + u 2 is called a penalized sum of squared residuals because y X β Z Λ θ u 2 is a sum of squared residuals and u 2 is a penalty on the size of the vector u. Determining µ U Y as the minimizer of this expression is a penalized least squares (PLS) problem. In this case it is a penalized linear least squares problem that we can solve directly (i.e. without iterating). One way to determine the solution is to rephrase it as a linear least squares problem for an extended residual vector µ U Y = arg min u [ y X β 0 ] [ ] Z Λθ u This is sometimes called a pseudo-data approach because we create the effect of the penalty term, u 2, by adding pseudo-observations to y and to the predictor. I q 2 Douglas Bates (Multilevel Conf.) Theory of LMMs 2011-03-16 6 / 16

Solving the linear PLS problem The conditional mean satisfies the equations ( Λ T θ Z T Z Λ T θ + I ) q µu Y = Λ T θ Z T (y X β). This would be interesting but not very important were it not for the fact that we actually can solve that system for µ U Y even when its dimension, q, is very, very large. Because Z is generated from indicator columns for the grouping factors, it is sparse. Z Λ θ is also very sparse. There are sophisticated and efficient ways of calculating a sparse Cholesky factor, which is a sparse, lower-triangular matrix L θ that satisfies L θ L T θ = ΛT θ Z T Z Λ θ + I q and, from that, solving for µ U Y. Douglas Bates (Multilevel Conf.) Theory of LMMs 2011-03-16 7 / 16

The sparse Choleksy factor, L θ Because the ability to evaluate the sparse Cholesky factor, L θ, is the key to the computational methods in the lme4 package, we consider this in detail. In practice we will evaluate L θ for many different values of θ when determining the ML or REML estimates of the parameters. As described in Davis (2006), 4.6, the calculation is performed in two steps: in the symbolic decomposition we determine the position of the nonzeros in L from those in Z Λ θ then, in the numeric decomposition, we determine the numerical values in those positions. Although the numeric decomposition may be done dozens, perhaps hundreds of times as we iterate on θ, the symbolic decomposition is only done once. Douglas Bates (Multilevel Conf.) Theory of LMMs 2011-03-16 8 / 16

A fill-reducing permutation, P In practice it can be important while performing the symbolic decomposition to determine a fill-reducing permutation, which is written as a q q permutation matrix, P. This matrix is just a re-ordering of the columns of I q and has an orthogonality property, PP T = P T P = I q. When P is used, the factor L θ is defined to be the sparse, lower-triangular matrix that satisfies L θ L T θ = P [ Λ T θ Z T θ Z Λ θ + I q ] P T In the Matrix package for R, the Cholesky method for a sparse, symmetric matrix (class dscmatrix) performs both the symbolic and numeric decomposition. By default, it determines a fill-reducing permutation, P. The update method for a Cholesky factor (class CHMfactor) performs the numeric decomposition only. Douglas Bates (Multilevel Conf.) Theory of LMMs 2011-03-16 9 / 16

The conditional density, f U Y We know the joint density, f Y,U (y, u), and f U Y (u y) = f Y,U(y, u) fy,u (y, u) du so we almost have f U Y. The trick is evaluating the integral in the denominator, which, it turns out, is exactly the likelihood, L(θ, β, σ 2 y), that we want to maximize. The Cholesky factor, L θ is the key to doing this because P T L θ L T θ Pµ U Y = Λ T θ Z T (y X β). Although the Matrix package provides a one-step solve method for this, we write it in stages: Solve Lc u = PΛ T θ Z T (y X β) for c u. Solve L T Pµ = c u for Pµ U Y and µ U Y as P T Pµ U Y. Douglas Bates (Multilevel Conf.) Theory of LMMs 2011-03-16 10 / 16

Evaluating the likelihood The exponent of f Y,U (y, u) can now be written y X β Z Λ θ u 2 + u 2 = r 2 (θ, β) + L T P(u µ U Y ) 2. where r 2 (θ, β) = y X β U µ U Y 2 + µ U Y 2. The first term doesn t depend on u and the second is relatively easy to integrate. Use the change of variable v = L T P(u µ U Y ), with dv = abs( L P ) du, in ( ) L exp T P(u µ U Y ) 2 2σ 2 du (2πσ 2 ) q/2 = exp ( v 2 2σ 2 ) (2πσ 2 ) q/2 dv abs( L P ) = 1 abs( L P ) = 1 L because abs P = 1 and abs L, which is the product of its diagonal elements, all of which are positive, is positive. Douglas Bates (Multilevel Conf.) Theory of LMMs 2011-03-16 11 / 16

Evaluating the likelihood (cont d) As is often the case, it is easiest to write the log-likelihood. On the deviance scale (negative twice the log-likelihood) l(θ, β, σ y) = log L(θ, β, σ y) becomes 2l(θ, β, σ y) = n log(2πσ 2 ) + r 2 (θ, β) σ 2 + log( L θ 2 ) We wish to minimize the deviance. Its dependence on σ is straightforward. Given values of the other parameters, we can evaluate the conditional estimate σ 2 (θ, β) = r 2 (θ, β) n producing the profiled deviance [ ( 2πr 2 l(θ, β y) = log( L θ 2 2 )] (θ, β) ) + n 1 + log n However, an even greater simplification is possible because the deviance depends on β only through r 2 (θ, β). Douglas Bates (Multilevel Conf.) Theory of LMMs 2011-03-16 12 / 16

Profiling the deviance with respect to β Because the deviance depends on β only through r 2 (θ, β) we can obtain the conditional estimate, β θ, by extending the PLS problem to rθ [ y 2 = min X β Z Λ θ u 2 + u 2] u,β with the solution satisfying the equations [ Λ T θ Z T Z Λ θ + I q U T θ X X T Z Λ θ X T X ] [ ] µu Y = β θ [ Λ T θ Z T ] y X T y. The profiled deviance, which is a function of θ only, is [ ( 2πr 2 l(θ) = log( L θ 2 2 )] ) + n 1 + log θ n Douglas Bates (Multilevel Conf.) Theory of LMMs 2011-03-16 13 / 16

Solving the extended PLS problem For brevity we will no longer show the dependence of matrices and vectors on the parameter θ. As before we use the sparse Cholesky decomposition, with L and P satisfying LL T = P(Λ T θ Z T Z Λ θ + I ) and c u, the solution to Lc u = PΛ T θ Z T y. We extend the decomposition with the q p matrix R ZX, the upper triangular p p matrix R X, and the p-vector c β satisfying so that [ P T L 0 R T ZX RT X LR ZX = PΛ T θ Z T X R T X R X = X T X R T ZX R ZX R T X c β = X T y R T ZX c u ] [ L T ] [ P R ZX Λ T = θ Z T Z Λ θ + I Λ T θ Z T X 0 R X X T Z Λ θ X T X ]. Douglas Bates (Multilevel Conf.) Theory of LMMs 2011-03-16 14 / 16

Solving the extended PLS problem (cont d) Finally we solve R X βθ = c β L T Pµ U Y = c u R ZX βθ The profiled REML criterion also can be expressed simply. The criterion is L R (θ, σ 2 y) = L(θ, β, σ 2 y) dβ The same change-of-variable technique for evaluating the integral w.r.t. u as 1/ abs( L ) produces 1/ abs( R X ) here and removes (2πσ 2 ) p/2 from the denominator. On the deviance scale, the profiled REML criterion is [ ( 2πr 2 l R (θ) = log( L 2 ) + log( R x 2 2 )] ) + (n p) 1 + log θ n p These calculations can be expressed in a few lines of R code. Douglas Bates (Multilevel Conf.) Theory of LMMs 2011-03-16 15 / 16

Summary For a linear mixed model, even one with a huge number of observations and random effects like the model for the grade point scores, evaluation of the ML or REML profiled deviance, given a value of θ, is straightforward. It involves updating Λ θ, L θ, R ZX, R X, calculating the penalized residual sum of squares, rθ 2 and two determinants of triangular matrices. The profiled deviance can be optimized as a function of θ only. The dimension of θ is usually very small. For the grade point scores there are only three components to θ. Douglas Bates (Multilevel Conf.) Theory of LMMs 2011-03-16 16 / 16