Fitting mixed models in R Contents 1 Packages 2 2 Specifying the variance-covariance matrix (nlme package) 3 2.1 Illustration:.................................... 3 2.2 Technical point: why to use as.numeric(time) in corsymm.......... 4 3 Handling missing data (nlme package) 6 3.1 Illustration:.................................... 6 4 Methods for GLS objects 8 4.1 Test linear hypothesis............................... 8 4.2 Analysis of variance................................ 10 4.2.1 recall.................................... 10 4.2.2 in R.................................... 10 4.3 Getting confidence intervals for the coefficients................. 12 4.4 Displaying the variance covariance matrix................... 13 4.5 Predictions.................................... 14 4.5.1 Punctual estimate............................. 14 4.5.2 Confidence intervals........................... 15 References 17 1
1 Packages Several packages can be used in R to fit mixed models. In this course we will use the two following packages: nlme package: it enables to specify the form of the correlation structure between residuals, to model a potential heteroscedasticity and to consider random effects. It is limited to gaussian variables but can handle non-linear relationships (e.g. Y exp(βx)). Main fonctions: gls lme nlme Reference book: (Pinheiro and Bates 2000) lme4 package: it is a numerically more efficient alternative to nlme which is recommanded for large datasets or when several random effects are considered. Contrary to nlme, the correlation structure between residuals can only be model through random effects. No option for dealing with heteroscedasticity. However lme4 enables to model non-gaussian dependant variables (e.g. binary, Poisson,... ). Main functions: lmer glmer nlmer Reference book: (D. M. Bates 2010) See http://glmm.wikidot.com/faq or http://glmm.wikidot.com/pkg-comparison for a broader overview. The first link also includes interesting discussion on the practical use of mixed models. 2
2 Specifying the variance-covariance matrix (nlme package) gls, lme and nlme use two arguments to construct the variance covariance matrix that will be used to fit the mixed model: argument correlation: specifies the form of the covariance matrix (e.g. correlation = corcompsymm(form = ~1 animal)). It is composed of three parts: the structure of the correlation matrix (e.g. corcompsymm). See?corClasses for a list of the available structures. the postion variable form = ~1: usually only an intercept but one may specify explanatory variables here if, for instance, the correlation between observation times is assumed be different regarding the age or the geographical position. the grouping variable animal. Here we indicates that observations are correlated within the same animal. argument weight: can be used to model a potential heteroscedasticity, e.g. variance dependant on some variables. As the correlation argument, it is composed of three parts: the structure of the heteroschedasticity (e.g. varident). See?varClasses for a list of the available structures. the postion variable form = ~1: usually only an intercept but one may specify explanatory variables here if, for instance, the variance is assumed to depend of a given variable. the grouping variable time. Here we indicates that observations observed at the same time have common variance. 2.1 Illustration: library(nlme) setkey(dtl.data, Id) # sort data by Id print(dtl.data[1:11]) # only 3 observations for patient 1 Id Gender Treatment Age time outcome 1: 1 Male Yes 1.2155138 time1 1.2176709 2: 1 Male Yes 1.2155138 time2 4.4445477 3: 1 Male Yes 1.2155138 time3 6.4400121 4: 2 Male Yes 0.3308765 time1 1.2291346 3
5: 2 Male Yes 0.3308765 time2 6.3963192 6: 2 Male Yes 0.3308765 time3 6.7737205 7: 2 Male Yes 0.3308765 time4 11.2973947 8: 3 Female Yes 1.3902751 time1 5.2878323 9: 3 Female Yes 1.3902751 time2-0.0268401 10: 3 Female Yes 1.3902751 time3 14.8827780 11: 3 Female Yes 1.3902751 time4 8.4201817 # compound symmetry fm1 <- gls(outcome ~ Gender + Age + Treatment, correlation = corcompsymm(form = ~1 Id), data = dtl.data) # unstructured with constant variance fm2 <- gls(outcome ~ Gender + Age + Treatment, correlation = corsymm(form = ~as.numeric(time) Id), data = dtl.data) # unstructured with different variances fm3 <- gls(outcome ~ Gender + Age + Treatment, correlation = corsymm(form = ~as.numeric(time) Id), weights = varident(form =~ 1 time), data = dtl.data) 2.2 Technical point: why to use as.numeric(time) in corsymm Using ~as.numeric(time) Id in corsymm may seems a weird syntax. as.numeric(time) is here to indicate the time repetition. In absence of missing values, one can also use ~1 Id. However when dealing with missing values, this last syntax can lead to unexpected behavior (Galecki and Burzykowski 2013): # adapted from example R11.5 page 207, chapter 11.4 Sigma <- corar1(value = 0.3, form = ~1 Id) Isigma <- Initialize(Sigma, dtl.data) cormatrix(isigma)[[2]] [,1] [,2] [,3] [,4] [1,] 1.000 0.30 0.09 0.027 [2,] 0.300 1.00 0.30 0.090 [3,] 0.090 0.30 1.00 0.300 [4,] 0.027 0.09 0.30 1.000 4
dtl.datana <- dtl.data[-5,] dtl.datana[dtl.datana$id==2,] # time 2 is missing Id Gender Treatment Age time outcome 1: 2 Male Yes 0.3308765 time1 1.229135 2: 2 Male Yes 0.3308765 time3 6.773721 3: 2 Male Yes 0.3308765 time4 11.297395 IsigmaNA <- Initialize(Sigma, dtl.datana) cormatrix(isigmana)[[2]] # not ok [,1] [,2] [,3] [1,] 1.00 0.3 0.09 [2,] 0.30 1.0 0.30 [3,] 0.09 0.3 1.00 valid syntax Sigma <- corar1(value = 0.3, form = ~as.numeric(time) Id) IsigmaNA <- Initialize(Sigma, dtl.datana) cormatrix(isigmana)[[2]] # ok [,1] [,2] [,3] [1,] 1.000 0.09 0.027 [2,] 0.090 1.00 0.300 [3,] 0.027 0.30 1.000 5
3 Handling missing data (nlme package) gls, lme and nlme read the argument na.action to know how to handle missing values: na.fail (default option) leads to an error in presence of missing values. na.omit deals with missing values by removing the corresponding lines in the dataset. na.exclude ignores the missing values but, compared to na.omit, it enables to have outputs with the same number of observations compared to the original dataset. na.pass will continue the execution of the function without any change. If the function cannot manage missing values, it will lead to an error. 3.1 Illustration: library(nlme) data(orthodont, package = "nlme") dataset.na <- rbind(na, Orthodont) NROW(dataset.NA) # 109 observations [1] 109 na.fail attributes(try( lme.fail <- lme(distance ~ age, data = dataset.na),silent = TRUE ))$condition <simpleerror in na.fail.default(structure(list(age = c(na, 8, 10, 12, 14, 8, 10, 12, # same as: fm1 <- lme(distance ~ age, data = dataset.na, na.action = na.fail) na.omit lme.omit <- lme(distance ~ age, data = rbind(na, Orthodont), na.action = na.omit) NROW(predict(lme.omit)) # 108 fitted values [1] 108 na.exclude lme.exclude <- lme(distance ~ age, data = rbind(na, Orthodont), na.action = na.exclude) NROW(predict(lme.exclude)) # 109 fitted values (1 NA + 108 others) 6
[1] 109 predict(lme.exclude)[1] <NA> NA na.pass attributes(try( lme.pass <- lme(distance ~ age, data = rbind(na,orthodont), na.action = na.pass),silent = TRUE ))$condition <simpleerror in if (max(tmpdims$zxlen[[1l]]) < tmpdims$qvec[1l]) { warning(gettext 7
4 Methods for GLS objects methods(class=class(fm1)) # find available methods for gls objects [1] ACF anova augpred coef [5] comparepred fitted formula getdata [9] getgroups getgroupsformula getresponse getvarcov [13] intervals loglik nobs plot [17] predict print qqnorm residuals [21] summary update Variogram vcov see '?methods' for accessing help and source code 4.1 Test linear hypothesis # Wald tests summary(fm1)$ttable Value Std.Error t-value p-value (Intercept) 4.9483437 0.6411074 7.718432 4.090771e-11 GenderMale -1.2508565 0.6965154-1.795878 7.654148e-02 Age 0.4590189 0.3588027 1.279307 2.047327e-01 TreatmentYes 1.9377465 0.6991199 2.771694 7.028789e-03 # to test a linear combinaison of coefficients library(multcomp, verbose = FALSE, quietly = TRUE) Attaching package: 'TH.data' The following object is masked from 'package:mass': geyser Allcoef <- coef(fm1) # create a contrast matrix C <- matrix(0,nrow = 1, ncol=length(allcoef), dimnames = list(null, names(allcoef))) C[1,"GenderMale"] <- 1 C[1,"TreatmentYes"] <- 1 summary(glht(fm1, linfct = C)) 8
Simultaneous Tests for General Linear Hypotheses Fit: gls(model = outcome ~ Gender + Age + Treatment, data = dtl.data, correlation = corcompsymm(form = ~1 Id)) Linear Hypotheses: Estimate Std. Error z value Pr(> z ) 1 == 0 0.6869 1.0400 0.66 0.509 (Adjusted p values reported -- single-step method) # combined tests C <- matrix(0,nrow = 2, ncol=length(allcoef), dimnames = list(null, names(allcoef))) C[1,"GenderMale"] <- 1 C[2,"TreatmentYes"] <- 1 anova(fm1, L = C) # no punctual estimate Denom. DF: 75 F-test for linear combination(s) GenderMale TreatmentYes 1 1 0 2 0 1 numdf F-value p-value 1 2 6.078253 0.0036 summary(glht(fm1, linfct = C)) # performs separate tests Simultaneous Tests for General Linear Hypotheses Fit: gls(model = outcome ~ Gender + Age + Treatment, data = dtl.data, correlation = corcompsymm(form = ~1 Id)) Linear Hypotheses: Estimate Std. Error z value Pr(> z ) 1 == 0-1.2509 0.6965-1.796 0.1393 2 == 0 1.9377 0.6991 2.772 0.0111 * --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Adjusted p values reported -- single-step method) 9
4.2 Analysis of variance 4.2.1 recall Denote SS(A) the explained sum of squares by the variable A. Sequential anova = Type I Anova: SS(A) for factor A. SS(B A) = SS(A, B) - SS(A) for factor B. SS(AB B, A) = SS(A, B, AB) - SS(A, B) for interaction AB. It will give different results depending on which main effect is considered first. It tests the first factor without controlling for the other factor(s). Marginal anova = Type II/III Anova (depending if we consider an interaction or not): SS(A B) for factor A if no interaction else SS(A B, AB) SS(B A) for factor B if no interaction else SS(B A, AB) This type tests for each main effect after controlling for the other effects. (from http://goanna.cs.rmit.edu.au/~fscholer/anova.php) 4.2.2 in R fmi <- gls(outcome ~ Gender + Age * Treatment, correlation = corcompsymm(form = ~1 Id), data = dtl.data) By default sequential ANOVA anova(fmi) # equivalent to anova(m.gls, type = "sequential") Denom. DF: 74 numdf F-value p-value (Intercept) 1 265.09108 <.0001 Gender 1 5.77437 0.0188 Age 1 3.21283 0.0772 Treatment 1 9.06673 0.0036 Age:Treatment 1 3.97663 0.0498 10
marginal ANOVA anova(fmi, type = "marginal") Denom. DF: 74 numdf F-value p-value (Intercept) 1 73.59580 <.0001 Gender 1 5.33790 0.0237 Age 1 5.82131 0.0183 Treatment 1 7.37577 0.0082 Age:Treatment 1 3.97663 0.0498 equivalent to separate F tests n.coef <- length(coef(fmi)) Contrast <- matrix(0, nrow = 1, ncol = n.coef) colnames(contrast) <- names(coef(fmi)) Contrast1 <- Contrast2 <- Contrast3 <- Contrast Contrast1[,"GenderMale"] <- 1 anova(fmi, L = Contrast1) Denom. DF: 74 F-test for linear combination(s) [1] 1 numdf F-value p-value 1 1 5.337903 0.0237 Contrast2[,"Age"] <- 1 anova(fmi, L = Contrast2) Denom. DF: 74 F-test for linear combination(s) [1] 1 numdf F-value p-value 1 1 5.821315 0.0183 Contrast3[,"TreatmentYes"] <- 1 anova(fmi, L = Contrast3) Denom. DF: 74 F-test for linear combination(s) [1] 1 numdf F-value p-value 1 1 7.375766 0.0082 11
4.3 Getting confidence intervals for the coefficients intervals(fm1) Approximate 95% confidence intervals Coefficients: lower est. upper (Intercept) 3.6711924 4.9483437 6.2254951 GenderMale -2.6383863-1.2508565 0.1366733 Age -0.2557527 0.4590189 1.1737905 TreatmentYes 0.5450281 1.9377465 3.3304648 attr(,"label") [1] "Coefficients:" Correlation structure: lower est. upper Rho -0.2131575-0.0923524 0.1059207 attr(,"label") [1] "Correlation structure:" Residual standard error: lower est. upper 3.058183 3.590414 4.215271 intervals(fm1,which = "coef") Approximate 95% confidence intervals Coefficients: lower est. upper (Intercept) 3.6711924 4.9483437 6.2254951 GenderMale -2.6383863-1.2508565 0.1366733 Age -0.2557527 0.4590189 1.1737905 TreatmentYes 0.5450281 1.9377465 3.3304648 attr(,"label") [1] "Coefficients:" 12
4.4 Displaying the variance covariance matrix getvarcov extracts the variance covariance matrix for patient 8 # compound symmetry getvarcov(fm1, individual = "8") Marginal variance covariance matrix [,1] [,2] [,3] [,4] [1,] 12.8910-1.1905-1.1905-1.1905 [2,] -1.1905 12.8910-1.1905-1.1905 [3,] -1.1905-1.1905 12.8910-1.1905 [4,] -1.1905-1.1905-1.1905 12.8910 Standard Deviations: 3.5904 3.5904 3.5904 3.5904 # unstructured with constant variance getvarcov(fm2, individual = "8") Marginal variance covariance matrix [,1] [,2] [,3] [,4] [1,] 12.9280 1.3088 2.9393-6.4916 [2,] 1.3088 12.9280-6.3470 3.7202 [3,] 2.9393-6.3470 12.9280-2.8363 [4,] -6.4916 3.7202-2.8363 12.9280 Standard Deviations: 3.5955 3.5955 3.5955 3.5955 # unstructured with different variances getvarcov(fm3, individual = "8") Marginal variance covariance matrix [,1] [,2] [,3] [,4] [1,] 19.1600 3.2908 4.9295-8.6339 [2,] 3.2908 11.0640-3.7638 2.2375 [3,] 4.9295-3.7638 10.1660-3.9176 [4,] -8.6339 2.2375-3.9176 12.9630 Standard Deviations: 4.3772 3.3263 3.1884 3.6004 getvarcov(fm3, individual = "1") # patient with no observation at time 4 Marginal variance covariance matrix [,1] [,2] [,3] [1,] 19.1600 3.2908 4.9295 [2,] 3.2908 11.0640-3.7638 [3,] 4.9295-3.7638 10.1660 Standard Deviations: 4.3772 3.3263 3.1884 13
4.5 Predictions 4.5.1 Punctual estimate When performing prediction, the safest way is to build a data.frame that precisely match the format of the dataset used to train the model. Care must be taken in presence of factor since the predict function will ask to find the same levels for the factor variables. m.gls <- gls(outcome ~ Gender + Age + Treatment, correlation = corcompsymm(form = ~1 Id), data = dtl.data) str(dtl.data[,.(gender,age,treatment)]) Classes 'data.table' and 'data.frame': 79 obs. of 3 variables: $ Gender : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 1 1 1... $ Age : num 1.216 1.216 1.216 0.331 0.331... $ Treatment: Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2... - attr(*, ".internal.selfref")=<externalptr> df.test <- data.frame(gender = factor("male", levels = c("male","female")), Age = 25, Treatment = factor("no", levels = c("no","yes")) ) predict(m.gls, newdata = df.test) [1] 15.17296 attr(,"label") [1] "Predicted values" Note: when considering a mixed model, the prediction can be either made at the population level (unknown random effect) or at the individual level (known random effect). Note: In presence of missing values dtl.datana <- dtl.data dtl.datana[1:5,"outcome"] <- NA m.gls <- gls(outcome ~ Gender + Age + Treatment, correlation = corcompsymm(form = ~1 Id), data = dtl.datana, na.action = na.exclude) there is a difference between: 14
pred <- predict(m.gls, newdata = df.test) sum(is.na(pred)) # predicted values [1] 0 and fit <- predict(m.gls) sum(is.na(fit)) # fitted values [1] 5 4.5.2 Confidence intervals Using the predictse.gls function from the package AICcmodavg: library(aiccmodavg,quietly = TRUE) dftempo <- data.frame(predictse.gls(m.gls, newdata = dtl.datana)) quantile_norm <- qnorm(p = 0.975) dftempo$fit_lower <- dftempo$fit - quantile_norm * dftempo$se.fit dftempo$fit_upper <- dftempo$fit + quantile_norm * dftempo$se.fit head(dftempo) fit se.fit fit_lower fit_upper 1 6.823494 0.8858756 5.087210 8.559778 2 6.823494 0.8858756 5.087210 8.559778 3 6.823494 0.8858756 5.087210 8.559778 4 6.289381 0.7280501 4.862429 7.716333 5 6.289381 0.7280501 4.862429 7.716333 6 6.289381 0.7280501 4.862429 7.716333 If we want to do it ourself we can follow : the general recipe for computing predictions from a linear or generalized linear model is to: 1- figure out the model matrix X corresponding to the new data; 2- matrix-multiply X by the parameter vector β to get the predictions; 3- extract the variance-covariance matrix of the parameters V 4- compute XVX to get the variance-covariance matrix of the predictions; 5- extract the diagonal of this matrix to get variances of predictions; 15
6- take the square-root of the variances to get the standard deviations (errors) of the predictions; 7- compute confidence intervals based on a Normal approximation;" (from http://glmm.wikidot.com/faq) 1 Xmatrix <- model.matrix( ~ Gender + Age + Treatment, dtl.datana) 2 predictions <- predict(m.gls, newdata = dtl.datana) # same as Xmatrix %*% lme.1$coefficients$fixed # same as df.pred_swabs$predicted.1 3 VCOV.beta <- vcov(m.gls) 4 VCOV.predictions <- Xmatrix %*% VCOV.beta %*% t(xmatrix) 5 VAR.predictions <- diag(vcov.predictions) 6 SE.predictions <- sqrt(var.predictions) 7 dtl.datana$predicted_lower <- predictions - quantile_norm * SE.predictions dtl.datana$predicted_upper <- predictions + quantile_norm * SE.predictions Check that both approaches agrees: identical(as.double(dftempo$fit),as.double(predictions)) [1] TRUE identical(as.double(dftempo$se.fit),as.double(se.predictions)) [1] TRUE 16
References Bates, Douglas M. 2010. lme4: Mixed-effects modeling with R. http://lme4.r-forge.r-project. org/lmmwr/lrgprt.pdf. Galecki, Andrzej, and Tomasz Burzykowski. 2013. Linear Mixed-Effects Models Using R: A Step-by-Step Approach. Statistics. Springer. Pinheiro, J. C., and D. M. Bates. 2000. Mixed Effects Models in Sand S-PLUS. 17