Fitting mixed models in R

Size: px

Start display at page:

Download "Fitting mixed models in R"

Eleanore Daniel
6 years ago
Views:

1 Fitting mixed models in R Contents 1 Packages 2 2 Specifying the variance-covariance matrix (nlme package) Illustration: Technical point: why to use as.numeric(time) in corsymm Handling missing data (nlme package) Illustration: Methods for GLS objects Test linear hypothesis Analysis of variance recall in R Getting confidence intervals for the coefficients Displaying the variance covariance matrix Predictions Punctual estimate Confidence intervals References 17 1

2 1 Packages Several packages can be used in R to fit mixed models. In this course we will use the two following packages: nlme package: it enables to specify the form of the correlation structure between residuals, to model a potential heteroscedasticity and to consider random effects. It is limited to gaussian variables but can handle non-linear relationships (e.g. Y exp(βx)). Main fonctions: gls lme nlme Reference book: (Pinheiro and Bates 2000) lme4 package: it is a numerically more efficient alternative to nlme which is recommanded for large datasets or when several random effects are considered. Contrary to nlme, the correlation structure between residuals can only be model through random effects. No option for dealing with heteroscedasticity. However lme4 enables to model non-gaussian dependant variables (e.g. binary, Poisson,... ). Main functions: lmer glmer nlmer Reference book: (D. M. Bates 2010) See or for a broader overview. The first link also includes interesting discussion on the practical use of mixed models. 2

3 2 Specifying the variance-covariance matrix (nlme package) gls, lme and nlme use two arguments to construct the variance covariance matrix that will be used to fit the mixed model: argument correlation: specifies the form of the covariance matrix (e.g. correlation = corcompsymm(form = ~1 animal)). It is composed of three parts: the structure of the correlation matrix (e.g. corcompsymm). See?corClasses for a list of the available structures. the postion variable form = ~1: usually only an intercept but one may specify explanatory variables here if, for instance, the correlation between observation times is assumed be different regarding the age or the geographical position. the grouping variable animal. Here we indicates that observations are correlated within the same animal. argument weight: can be used to model a potential heteroscedasticity, e.g. variance dependant on some variables. As the correlation argument, it is composed of three parts: the structure of the heteroschedasticity (e.g. varident). See?varClasses for a list of the available structures. the postion variable form = ~1: usually only an intercept but one may specify explanatory variables here if, for instance, the variance is assumed to depend of a given variable. the grouping variable time. Here we indicates that observations observed at the same time have common variance. 2.1 Illustration: library(nlme) setkey(dtl.data, Id) # sort data by Id print(dtl.data[1:11]) # only 3 observations for patient 1 Id Gender Treatment Age time outcome 1: 1 Male Yes time : 1 Male Yes time : 1 Male Yes time : 2 Male Yes time

4 5: 2 Male Yes time : 2 Male Yes time : 2 Male Yes time : 3 Female Yes time : 3 Female Yes time : 3 Female Yes time : 3 Female Yes time # compound symmetry fm1 <- gls(outcome ~ Gender + Age + Treatment, correlation = corcompsymm(form = ~1 Id), data = dtl.data) # unstructured with constant variance fm2 <- gls(outcome ~ Gender + Age + Treatment, correlation = corsymm(form = ~as.numeric(time) Id), data = dtl.data) # unstructured with different variances fm3 <- gls(outcome ~ Gender + Age + Treatment, correlation = corsymm(form = ~as.numeric(time) Id), weights = varident(form =~ 1 time), data = dtl.data) 2.2 Technical point: why to use as.numeric(time) in corsymm Using ~as.numeric(time) Id in corsymm may seems a weird syntax. as.numeric(time) is here to indicate the time repetition. In absence of missing values, one can also use ~1 Id. However when dealing with missing values, this last syntax can lead to unexpected behavior (Galecki and Burzykowski 2013): # adapted from example R11.5 page 207, chapter 11.4 Sigma <- corar1(value = 0.3, form = ~1 Id) Isigma <- Initialize(Sigma, dtl.data) cormatrix(isigma)[[2]] [,1] [,2] [,3] [,4] [1,] [2,] [3,] [4,]

5 dtl.datana <- dtl.data[-5,] dtl.datana[dtl.datana$id==2,] # time 2 is missing Id Gender Treatment Age time outcome 1: 2 Male Yes time : 2 Male Yes time : 2 Male Yes time IsigmaNA <- Initialize(Sigma, dtl.datana) cormatrix(isigmana)[[2]] # not ok [,1] [,2] [,3] [1,] [2,] [3,] valid syntax Sigma <- corar1(value = 0.3, form = ~as.numeric(time) Id) IsigmaNA <- Initialize(Sigma, dtl.datana) cormatrix(isigmana)[[2]] # ok [,1] [,2] [,3] [1,] [2,] [3,]

6 3 Handling missing data (nlme package) gls, lme and nlme read the argument na.action to know how to handle missing values: na.fail (default option) leads to an error in presence of missing values. na.omit deals with missing values by removing the corresponding lines in the dataset. na.exclude ignores the missing values but, compared to na.omit, it enables to have outputs with the same number of observations compared to the original dataset. na.pass will continue the execution of the function without any change. If the function cannot manage missing values, it will lead to an error. 3.1 Illustration: library(nlme) data(orthodont, package = "nlme") dataset.na <- rbind(na, Orthodont) NROW(dataset.NA) # 109 observations [1] 109 na.fail attributes(try( lme.fail <- lme(distance ~ age, data = dataset.na),silent = TRUE ))$condition <simpleerror in na.fail.default(structure(list(age = c(na, 8, 10, 12, 14, 8, 10, 12, # same as: fm1 <- lme(distance ~ age, data = dataset.na, na.action = na.fail) na.omit lme.omit <- lme(distance ~ age, data = rbind(na, Orthodont), na.action = na.omit) NROW(predict(lme.omit)) # 108 fitted values [1] 108 na.exclude lme.exclude <- lme(distance ~ age, data = rbind(na, Orthodont), na.action = na.exclude) NROW(predict(lme.exclude)) # 109 fitted values (1 NA others) 6

7 [1] 109 predict(lme.exclude)[1] <NA> NA na.pass attributes(try( lme.pass <- lme(distance ~ age, data = rbind(na,orthodont), na.action = na.pass),silent = TRUE ))$condition <simpleerror in if (max(tmpdims$zxlen[[1l]]) < tmpdims$qvec[1l]) { warning(gettext 7

8 4 Methods for GLS objects methods(class=class(fm1)) # find available methods for gls objects [1] ACF anova augpred coef [5] comparepred fitted formula getdata [9] getgroups getgroupsformula getresponse getvarcov [13] intervals loglik nobs plot [17] predict print qqnorm residuals [21] summary update Variogram vcov see '?methods' for accessing help and source code 4.1 Test linear hypothesis # Wald tests summary(fm1)$ttable Value Std.Error t-value p-value (Intercept) e-11 GenderMale e-02 Age e-01 TreatmentYes e-03 # to test a linear combinaison of coefficients library(multcomp, verbose = FALSE, quietly = TRUE) Attaching package: 'TH.data' The following object is masked from 'package:mass': geyser Allcoef <- coef(fm1) # create a contrast matrix C <- matrix(0,nrow = 1, ncol=length(allcoef), dimnames = list(null, names(allcoef))) C[1,"GenderMale"] <- 1 C[1,"TreatmentYes"] <- 1 summary(glht(fm1, linfct = C)) 8

9 Simultaneous Tests for General Linear Hypotheses Fit: gls(model = outcome ~ Gender + Age + Treatment, data = dtl.data, correlation = corcompsymm(form = ~1 Id)) Linear Hypotheses: Estimate Std. Error z value Pr(> z ) 1 == (Adjusted p values reported -- single-step method) # combined tests C <- matrix(0,nrow = 2, ncol=length(allcoef), dimnames = list(null, names(allcoef))) C[1,"GenderMale"] <- 1 C[2,"TreatmentYes"] <- 1 anova(fm1, L = C) # no punctual estimate Denom. DF: 75 F-test for linear combination(s) GenderMale TreatmentYes numdf F-value p-value summary(glht(fm1, linfct = C)) # performs separate tests Simultaneous Tests for General Linear Hypotheses Fit: gls(model = outcome ~ Gender + Age + Treatment, data = dtl.data, correlation = corcompsymm(form = ~1 Id)) Linear Hypotheses: Estimate Std. Error z value Pr(> z ) 1 == == * --- Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Adjusted p values reported -- single-step method) 9

10 4.2 Analysis of variance recall Denote SS(A) the explained sum of squares by the variable A. Sequential anova = Type I Anova: SS(A) for factor A. SS(B A) = SS(A, B) - SS(A) for factor B. SS(AB B, A) = SS(A, B, AB) - SS(A, B) for interaction AB. It will give different results depending on which main effect is considered first. It tests the first factor without controlling for the other factor(s). Marginal anova = Type II/III Anova (depending if we consider an interaction or not): SS(A B) for factor A if no interaction else SS(A B, AB) SS(B A) for factor B if no interaction else SS(B A, AB) This type tests for each main effect after controlling for the other effects. (from in R fmi <- gls(outcome ~ Gender + Age * Treatment, correlation = corcompsymm(form = ~1 Id), data = dtl.data) By default sequential ANOVA anova(fmi) # equivalent to anova(m.gls, type = "sequential") Denom. DF: 74 numdf F-value p-value (Intercept) <.0001 Gender Age Treatment Age:Treatment

11 marginal ANOVA anova(fmi, type = "marginal") Denom. DF: 74 numdf F-value p-value (Intercept) <.0001 Gender Age Treatment Age:Treatment equivalent to separate F tests n.coef <- length(coef(fmi)) Contrast <- matrix(0, nrow = 1, ncol = n.coef) colnames(contrast) <- names(coef(fmi)) Contrast1 <- Contrast2 <- Contrast3 <- Contrast Contrast1[,"GenderMale"] <- 1 anova(fmi, L = Contrast1) Denom. DF: 74 F-test for linear combination(s) [1] 1 numdf F-value p-value Contrast2[,"Age"] <- 1 anova(fmi, L = Contrast2) Denom. DF: 74 F-test for linear combination(s) [1] 1 numdf F-value p-value Contrast3[,"TreatmentYes"] <- 1 anova(fmi, L = Contrast3) Denom. DF: 74 F-test for linear combination(s) [1] 1 numdf F-value p-value

12 4.3 Getting confidence intervals for the coefficients intervals(fm1) Approximate 95% confidence intervals Coefficients: lower est. upper (Intercept) GenderMale Age TreatmentYes attr(,"label") [1] "Coefficients:" Correlation structure: lower est. upper Rho attr(,"label") [1] "Correlation structure:" Residual standard error: lower est. upper intervals(fm1,which = "coef") Approximate 95% confidence intervals Coefficients: lower est. upper (Intercept) GenderMale Age TreatmentYes attr(,"label") [1] "Coefficients:" 12

13 4.4 Displaying the variance covariance matrix getvarcov extracts the variance covariance matrix for patient 8 # compound symmetry getvarcov(fm1, individual = "8") Marginal variance covariance matrix [,1] [,2] [,3] [,4] [1,] [2,] [3,] [4,] Standard Deviations: # unstructured with constant variance getvarcov(fm2, individual = "8") Marginal variance covariance matrix [,1] [,2] [,3] [,4] [1,] [2,] [3,] [4,] Standard Deviations: # unstructured with different variances getvarcov(fm3, individual = "8") Marginal variance covariance matrix [,1] [,2] [,3] [,4] [1,] [2,] [3,] [4,] Standard Deviations: getvarcov(fm3, individual = "1") # patient with no observation at time 4 Marginal variance covariance matrix [,1] [,2] [,3] [1,] [2,] [3,] Standard Deviations:

14 4.5 Predictions Punctual estimate When performing prediction, the safest way is to build a data.frame that precisely match the format of the dataset used to train the model. Care must be taken in presence of factor since the predict function will ask to find the same levels for the factor variables. m.gls <- gls(outcome ~ Gender + Age + Treatment, correlation = corcompsymm(form = ~1 Id), data = dtl.data) str(dtl.data[,.(gender,age,treatment)]) Classes 'data.table' and 'data.frame': 79 obs. of 3 variables: $ Gender : Factor w/ 2 levels "Female","Male": $ Age : num $ Treatment: Factor w/ 2 levels "No","Yes": attr(*, ".internal.selfref")=<externalptr> df.test <- data.frame(gender = factor("male", levels = c("male","female")), Age = 25, Treatment = factor("no", levels = c("no","yes")) ) predict(m.gls, newdata = df.test) [1] attr(,"label") [1] "Predicted values" Note: when considering a mixed model, the prediction can be either made at the population level (unknown random effect) or at the individual level (known random effect). Note: In presence of missing values dtl.datana <- dtl.data dtl.datana[1:5,"outcome"] <- NA m.gls <- gls(outcome ~ Gender + Age + Treatment, correlation = corcompsymm(form = ~1 Id), data = dtl.datana, na.action = na.exclude) there is a difference between: 14

15 pred <- predict(m.gls, newdata = df.test) sum(is.na(pred)) # predicted values [1] 0 and fit <- predict(m.gls) sum(is.na(fit)) # fitted values [1] Confidence intervals Using the predictse.gls function from the package AICcmodavg: library(aiccmodavg,quietly = TRUE) dftempo <- data.frame(predictse.gls(m.gls, newdata = dtl.datana)) quantile_norm <- qnorm(p = 0.975) dftempo$fit_lower <- dftempo$fit - quantile_norm * dftempo$se.fit dftempo$fit_upper <- dftempo$fit + quantile_norm * dftempo$se.fit head(dftempo) fit se.fit fit_lower fit_upper If we want to do it ourself we can follow : the general recipe for computing predictions from a linear or generalized linear model is to: 1- figure out the model matrix X corresponding to the new data; 2- matrix-multiply X by the parameter vector β to get the predictions; 3- extract the variance-covariance matrix of the parameters V 4- compute XVX to get the variance-covariance matrix of the predictions; 5- extract the diagonal of this matrix to get variances of predictions; 15

16 6- take the square-root of the variances to get the standard deviations (errors) of the predictions; 7- compute confidence intervals based on a Normal approximation;" (from 1 Xmatrix <- model.matrix( ~ Gender + Age + Treatment, dtl.datana) 2 predictions <- predict(m.gls, newdata = dtl.datana) # same as Xmatrix %*% lme.1$coefficients$fixed # same as df.pred_swabs$predicted.1 3 VCOV.beta <- vcov(m.gls) 4 VCOV.predictions <- Xmatrix %*% VCOV.beta %*% t(xmatrix) 5 VAR.predictions <- diag(vcov.predictions) 6 SE.predictions <- sqrt(var.predictions) 7 dtl.datana$predicted_lower <- predictions - quantile_norm * SE.predictions dtl.datana$predicted_upper <- predictions + quantile_norm * SE.predictions Check that both approaches agrees: identical(as.double(dftempo$fit),as.double(predictions)) [1] TRUE identical(as.double(dftempo$se.fit),as.double(se.predictions)) [1] TRUE 16

17 References Bates, Douglas M lme4: Mixed-effects modeling with R. org/lmmwr/lrgprt.pdf. Galecki, Andrzej, and Tomasz Burzykowski Linear Mixed-Effects Models Using R: A Step-by-Step Approach. Statistics. Springer. Pinheiro, J. C., and D. M. Bates Mixed Effects Models in Sand S-PLUS. 17

Solution pigs exercise

Solution pigs exercise Course repeated measurements - R exercise class 2 November 24, 2017 Contents 1 Question 1: Import data 3 1.1 Data management..................................... 3 1.2 Inspection