MA 575 Linear Models: Cedric E Ginestet, Boston University Mixed Effects Estimation, Residuals Diagnostics Week 11, Lecture 1 1 Within-group Correlation Let us recall the simple two-level hierarchical model that we have studied as a motivating example for mixed effects models In this case, we had assumed that the single intercept in our original model was drawn from a population of intercepts, such that y ij = α j + βx ij + e ij, α j = α + b j for i = 1,, n j, and j = 1,, m In addition, we will make the following standard distributional assumptions, b j N(0, σ 2 b ), and e ij N(0, σ 2 e ) Moreover, we also assume that the random effects and error terms are uncorrelated, Cov[b j, e ij ] = 0 Clearly, this model contains two different sources of variation: within- and between-clusters We may therefore wish to ask what is the correlation between two observations y 1j and y 2j belonging to the same group, j Here, we are assuming that homoscedasticity holds within and between group, so we can drop the dependence on x 1j and x 2j, such that Cov[Y 1j, Y 1j x 1j, x 2j ] = Cov[b j + e 1j, b j + e 2j ] = Cov[b j, b j ] + Cov[b j, e 2j ] + Cov[e 1j, b j ] + Cov[e 1j, e 2j ] = Var[b j ], since we have assumed that the e ij are statistically independent of each other and statistically independent of the b j It suffices to normalize with respect to the product of the standard deviations of Y 1j and Y 2j, which are (Var[Y 1j x 1j ] Var[Y 2j x 2j ]) 1/2 = Var[b j + e ij ] = σ 2 e + σ 2 b The within-group correlation between any two elements in a cluster is then given by, ρ := Var[b j ] Var[b j + e ij ] = σ2 b σe 2 + σb 2 Here, ρ weighs the relative contribution of each independent group to the overall mean It is sometimes described in terms of shrinkage, since as ρ increases, the group-specific observations shrink toward their group means Moreover, we may have the following two extreme situations, Department of Mathematics and Statistics, Boston University 1
1 As σ 2 b 0, we have ρ 0, and we recover a standard regression with no random effects 2 As σ 2 e 0, we have ρ 1, and each group element is equal to its group mean Thus, the within-cluster correlation quantifies how influential is the group structure on the estimation of the fixed effects This particular quantity is related to the shrinkage coefficient in Bayesian statistics, which is usually defined as 1 ρ The latter represents the shrinkage of the group means toward the overall mean of the observations 2 Mixed Models Estimation 21 General Formulation The main mixed effects model equation is generally written as follows, y j = X j β + Z j b j + e j, j = 1,, m (1) This general expression includes different choices of the Z j s, such as a model with both random intercepts and random slope coefficients, y ij = α j + β j x ij + e ij, α j = α + a j, β j = β + b j for i = 1,, n j, and j = 1,, m This model would fit a separate regression line for every group 22 Balanced Mixed Effects Designs A mixed effects model is called balanced if for every j = 1,, m, we have both Z j = Z and n j = n; and where the total number of observations is still denoted by N := m j=1 n j = mn Moreover, we have made the following two distributional assumptions, e j MVNn (0, σ 2 I n ), b j MVNk (0, σ 2 D), for every j = 1,, m Recall that we have expressed the general mixed effects model by stacking the group-specific matrices, such that y = Xβ + Zb + e (2) where y, X, Z, b and e are of dimensions (N 1), (N p ), (N mk), (mk 1), and (N 1) This was further simplified using the definition η := Zb + e as y = Xβ + η, where y is an N 1 vector of observations, X is a design matrix for the fixed effects of order N p, and β is of order p 1 Finally, the error vector combines both the random effects and error terms, such that η 1 Zb 1 + e 1 η := = Zb m + e m η m Department of Mathematics and Statistics, Boston University 2
Moreover, we had seen that the variance of the error term is an N N matrix that has a block-diagonal structure, I n + ZDZ T 0 0 Var[η X] = σ 2 0 I n + ZDZ T 0 0 0 I n + ZDZ T Here, we may use the Kronecker product for matrices For any two matrices A, B, with A of order (r c), the product A B is given by the block matrix, a 11 B a 1c B A B = a r1 B a rc B Hence, the covariance matrix of η in a balanced mixed effects model can simply be written as Var[η X] = σ 2 (I m (I n + ZDZ T )) Altogether, this gives the following mean and covariance functions for the general mixed effects model, E[y X] = Xβ, and Var[η X] = σ 2 (I m (I n + ZDZ T )) For notational convenience, let us denote the variance/covariance matrix of the error term by Σ 1 := Var[η X] One can then formulate an RSS criterion for estimating β such that RSS(β; Σ) := (y Xβ) T Σ(y Xβ) This expression can be straightforwardly minimized using the GLS framework that we have studied in lecture 62, in order to obtain β = (X T ΣX) 1 X T Σy However, this GLS estimator is dependent on D, which is unknown Therefore, we need to resort to a more sophisticated estimation procedure 23 MLE Estimation We can express the full likelihood of our balanced model using the variance of η, such that ) ind y j MVN n (X j β, σ 2 (I n + ZDZ T ), j = 1,, m Here, there are three parameters to estimate, which can be summarized as follows, Θ := {θ : β R p, σ 2 > 0, D 0}, where the last condition states that D should be positive semi-definite Strictly speaking, we are here dealing with a constrained maximization problem However, note that the space Θ is convex, since any convex combination of any two positive semi-definite matrices is itself positive semi-definite The standard way of estimating the parameters of such general mixed effects models is through restricted maximum likelihood Department of Mathematics and Statistics, Boston University 3
24 Identifiability In general, we want to ensure that the model of interest is identifiable This simply means that once we know the probability distribution (or likelihood) L(θ, y, X) of our model, we also necessarily know the parameter θ This condition is therefore equivalent to the requirement that the map L( ; y, X) : Θ R +, is injective or one-to-one In that case, L(θ 1 ; y, X) = L(θ 2 ; y, X) implies that θ 1 = θ 2 (see Demidenko, 2004, p117) An example of a non-identifiable model would be a linear regression with two intercepts, y i = α 1 + α 2 + βx i + e i, i = 1,, n If one assumes that in addition e i N(0, σ 2 ), we can compute the likelihood function of this model However, several combinations of the parameters, α 1 and α 2, would give the same probability to the model This constitutes the formal definition of identifiability In statistical practice, however, this term is usually used in a more general sense Often, one may use the phrase degrees of identifiability, which could be interpreted as the number of degrees of freedom still available after fitting the specified parameters 3 Regression Diagnostics: Residuals 31 Errors and Residuals In standard multiple regression, we make the following assumptions on the error terms, E[e X] = 0, and Var[e X] = σ 2 I This should be contrasted with the moments of the residuals that were found to be E[ê X] = 0, and Var[ê X] = σ 2 (I H) Observe that the moments of the error terms do not depend on the values of the x is However, the residuals are dependent on the x i s through the hat matrix Whereas the variance of the error terms is assumed to be homoscedastic, the residuals will take different values, depending on the diagonal entries of H Moreover, the variance of each individual residual is given by Var[ê i X] = σ 2 (1 h ii ), where h ii is the i th diagonal entry of H The covariance of any two residuals is given by Cov[ê i, ê j X] = σ 2 (1 h ij ) Thus, the properties of the residuals are intrinsically linked with the ones of the hat matrix 32 Reminder: Properties of the Hat Matrix Recall that for any multiple regression, the hat matrix is defined as H := X(X T X) 1 X T It satisfies Putting these two equalities together, we obtain, ŷ = X β = Hy, and ê = y ŷ = (I H)y y = ŷ + ê = Hy + (I H)y Since H is an orthogonal projection onto the column space of X, it also follows that where 0 is here a matrix of order n p HX = X and (I H)X = 0, Department of Mathematics and Statistics, Boston University 4
33 Leverages Let x i := [x i0,, x ip ] T denote the i th row of X, here defined as a column vector Each diagonal entry in the hat matrix takes the form, h ii = x T i (X T X) 1 x i, and moreover it can be shown that for any i = 1,, n, 1 n xt i (X T X) 1 x i 1 r, where r is the number of rows of X, which are identical to x i This quantity is called the leverage because it controls the relative importance of each data point, through the following relationships: 1 As h ii 0, that particular data point becomes irrelevant in the computation of ŷ i, since the variance of its residual, Var[e i X], is high 2 As h ii 1, the residual for that data point becomes increasingly close 0, because its variance also approaches 0 Therefore, ŷ i y i A good example of this upper bound is when considering a standard ANOVA model with m groups, for instance In that case, the leverage for any point labeled in the j th group is given by h ii 1 n j For every data point, the fitted value can be shown to be a function of the form, ŷ i = h ij y j = h ii y i + h ij y j, j=1 for every i = 1,, n If the mean function includes an intercept, every h ii admits the following decomposition, j i h ii = 1 n + (x i x ) T (X T X ) 1 (x i x ), where x i = [1, x i ]T, and the mean covariate profile is defined as x := 1 n x i, and recall that X is the corrected matrix of predictors, (x 10 x 0 ) (x 1p x p ) X := (x n0 x 0 ) (x np x p ) In simple linear regression, we obtain the following formula, where observe that the h ii s are only a function of the x i s, h ii = 1 n + (x i x) 2 n i=1 (x i x) 2 i=1 References Demidenko, E (2004) Mixed Models: Theory and Applications Wiley, London Department of Mathematics and Statistics, Boston University 5