Solution pigs exercise

Similar documents
Solution: anti-fungal treatment exercise

Fitting mixed models in R

Workshop 9.3a: Randomized block designs

STAT3401: Advanced data analysis Week 10: Models for Clustered Longitudinal Data

These slides illustrate a few example R commands that can be useful for the analysis of repeated measures data.

Solution Anti-fungal treatment (R software)

Repeated measures, part 1, simple methods

Workshop 9.1: Mixed effects models

A Handbook of Statistical Analyses Using R 2nd Edition. Brian S. Everitt and Torsten Hothorn

Introduction to the Analysis of Hierarchical and Longitudinal Data

Repeated measures, part 2, advanced methods

A Handbook of Statistical Analyses Using R 2nd Edition. Brian S. Everitt and Torsten Hothorn

R Output for Linear Models using functions lm(), gls() & glm()

SPSS LAB FILE 1

Introduction and Background to Multilevel Analysis

Part II { Oneway Anova, Simple Linear Regression and ANCOVA with R

Repeated Measures ANOVA Multivariate ANOVA and Their Relationship to Linear Mixed Models

Stat 579: Generalized Linear Models and Extensions

A brief introduction to mixed models

Answer to exercise: Blood pressure lowering drugs

over Time line for the means). Specifically, & covariances) just a fixed variance instead. PROC MIXED: to 1000 is default) list models with TYPE=VC */

Inferences on Linear Combinations of Coefficients

Exploring Hierarchical Linear Mixed Models

MIXED MODELS FOR REPEATED (LONGITUDINAL) DATA PART 2 DAVID C. HOWELL 4/1/2010

SAS Syntax and Output for Data Manipulation: CLDP 944 Example 3a page 1

Regression Analysis in R

ST505/S697R: Fall Homework 2 Solution.

Exam Applied Statistical Regression. Good Luck!

Handout 4: Simple Linear Regression

Correlated Data: Linear Mixed Models with Random Intercepts

Introduction to Statistics and R

STAT 3022 Spring 2007

Temporal Learning: IS50 prior RT

A Handbook of Statistical Analyses Using R. Brian S. Everitt and Torsten Hothorn

Introduction to SAS proc mixed

A Handbook of Statistical Analyses Using R 3rd Edition. Torsten Hothorn and Brian S. Everitt

SAS Syntax and Output for Data Manipulation:

Introduction to SAS proc mixed

Introductory Statistics with R: Simple Inferences for continuous data

lme4 Luke Chang Last Revised July 16, Fitting Linear Mixed Models with a Varying Intercept

22s:152 Applied Linear Regression. Returning to a continuous response variable Y...

Mixed Model: Split plot with two whole-plot factors, one split-plot factor, and CRD at the whole-plot level (e.g. fancier split-plot p.

22s:152 Applied Linear Regression. In matrix notation, we can write this model: Generalized Least Squares. Y = Xβ + ɛ with ɛ N n (0, Σ)

Week 8, Lectures 1 & 2: Fixed-, Random-, and Mixed-Effects models

STAT 572 Assignment 5 - Answers Due: March 2, 2007

df=degrees of freedom = n - 1

Regression, Part I. - In correlation, it would be irrelevant if we changed the axes on our graph.

Analysis of 2x2 Cross-Over Designs using T-Tests

Linear Regression. In this lecture we will study a particular type of regression model: the linear regression model

Package HGLMMM for Hierarchical Generalized Linear Models

Statistical Prediction

Homework 3 - Solution

Hypothesis Testing. Hypothesis: conjecture, proposition or statement based on published literature, data, or a theory that may or may not be true

Using R in 200D Luke Sonnet

Linear Probability Model

Multivariate Analysis of Variance

Workshop 7.4a: Single factor ANOVA

Non-independence due to Time Correlation (Chapter 14)

R in Linguistic Analysis. Wassink 2012 University of Washington Week 6

Overview. 1. Independence. 2. Modeling Autocorrelation. 3. Temporal Autocorrelation Example. 4. Spatial Autocorrelation Example

Finite Mixture Model Diagnostics Using Resampling Methods

Business Statistics. Lecture 10: Course Review

Chapter 5 Exercises 1

Stat 209 Lab: Linear Mixed Models in R This lab covers the Linear Mixed Models tutorial by John Fox. Lab prepared by Karen Kapur. ɛ i Normal(0, σ 2 )

BIOL 458 BIOMETRY Lab 9 - Correlation and Bivariate Regression

Chapter 1 Statistical Inference

Package r2glmm. August 5, 2017

Booklet of Code and Output for STAD29/STA 1007 Midterm Exam

An Introduction to Path Analysis

Coping with Additional Sources of Variation: ANCOVA and Random Effects

STAT 510 Final Exam Spring 2015

Outline. Mixed models in R using the lme4 package Part 3: Longitudinal data. Sleep deprivation data. Simple longitudinal data

Mixed Model Theory, Part I

Analysis of Covariance: Comparing Regression Lines

CO2 Handout. t(cbind(co2$type,co2$treatment,model.matrix(~type*treatment,data=co2)))

SPH 247 Statistical Analysis of Laboratory Data

Hierarchical Random Effects

Aedes egg laying behavior Erika Mudrak, CSCU November 7, 2018

Inference with Heteroskedasticity

Module 4: Regression Methods: Concepts and Applications

A Re-Introduction to General Linear Models (GLM)

Univariate Analysis of Variance

Regression. Marc H. Mehlman University of New Haven

R-companion to: Estimation of the Thurstonian model for the 2-AC protocol

T.I.H.E. IT 233 Statistics and Probability: Sem. 1: 2013 ESTIMATION AND HYPOTHESIS TESTING OF TWO POPULATIONS

Hierarchical Linear Models (HLM) Using R Package nlme. Interpretation. 2 = ( x 2) u 0j. e ij

Business Analytics and Data Mining Modeling Using R Prof. Gaurav Dixit Department of Management Studies Indian Institute of Technology, Roorkee

UNIVERSITY OF TORONTO. Faculty of Arts and Science APRIL 2010 EXAMINATIONS STA 303 H1S / STA 1002 HS. Duration - 3 hours. Aids Allowed: Calculator

Subject-specific observed profiles of log(fev1) vs age First 50 subjects in Six Cities Study

Lab #5 - Predictive Regression I Econ 224 September 11th, 2018

STAT 215 Confidence and Prediction Intervals in Regression

Regression: Main Ideas Setting: Quantitative outcome with a quantitative explanatory variable. Example, cont.

Introduction to Mixed Models in R

The coxvc_1-1-1 package

STAT 526 Advanced Statistical Methodology

Simple, Marginal, and Interaction Effects in General Linear Models

BIOL Biometry LAB 6 - SINGLE FACTOR ANOVA and MULTIPLE COMPARISON PROCEDURES

Mixed effects models

Questions 3.83, 6.11, 6.12, 6.17, 6.25, 6.29, 6.33, 6.35, 6.50, 6.51, 6.53, 6.55, 6.59, 6.60, 6.65, 6.69, 6.70, 6.77, 6.79, 6.89, 6.

Using the lsmeans Package

Transcription:

Solution pigs exercise Course repeated measurements - R exercise class 2 November 24, 2017 Contents 1 Question 1: Import data 3 1.1 Data management..................................... 3 1.2 Inspection of the dataset.................................. 4 2 Question 2: Descriptive statistics 5 2.1 Raw data.......................................... 5 2.2 Summary statistics..................................... 6 3 Question 3: Modeling the group effect 9 3.1 Fitting a model using an unstructured covariance matrix................ 9 3.2 Inference on the mean parameters............................ 9 3.3 Technical note: the nlme provides not very accurate results in small samples..... 10 3.4 Inspection of the variance-covariance parameters.................... 11 4 Question 5: Investigating the group effect in the first four weeks 13 5 Question 6: Modeling the treatment effect 14 5.1 Definition of the new variables.............................. 14 5.1.1 the treatment variable............................... 14 5.1.2 number of weeks under treatment......................... 15 5.1.3 Interaction between time and treatment..................... 16 5.2 Model (a): non-parametric treatment effect....................... 17 5.3 Model (b): linear effect of the treatment......................... 18 5.4 Model (c): splitting the treatment effect into a linear effect and a non linear effect. 18 6 Question 7: Predicted weight profiles 19 6.1 Compute individual predictions.............................. 19 6.2 Graphical display...................................... 19 6.3 Note............................................. 21 1

7 Question 8: Estimate of the difference in weight between the group at the end of week 7 21 7.1 Model (a).......................................... 21 7.2 Model (b).......................................... 22 7.3 Model (c).......................................... 23 8 Question 10: Specification of the covariance matrix (compound symmetry vs. unstructured) 24 8.1 Comparison of the model fit................................ 24 8.2 Comparison of the fitted values.............................. 26 NOTE: This document contains an example of R code and related software outputs that answers the questions of the pigs exercise. The focus here is on the implementation using the R software and not on the interpretation - we refer to the SAS solution for a more detailed discussion of the results. 2

Load the packages that will be necessary for the analysis: library(data.table) # data management library(nlme) # implementation of models for repeated measurements (e.g. gls, lme) library(ggplot2) # graphical display library(fields) # graphical display: image.plot library(multcomp) # Test for linear hypothesis (glht function) library(aiccmodavg) # predictse.gls 1 Question 1: Import data 1.1 Data management We first specify the location of the data through a variable called path.data: path.data <- "http://publicifsv.sund.ku.dk/ jufo/courses/rm2017/vitamin.txt" Then we use the function fread to import the dataset: dtl.vitamin <- fread(path.data, header = TRUE) str(dtl.vitamin) Classes data.table and data.frame : 60 obs. of 4 variables: $ grp : int 1 1 1 1 1 1 1 1 1 1... $ animal: int 1 1 1 1 1 1 2 2 2 2... $ week : int 1 3 4 5 6 7 1 3 4 5... $ weight: int 455 460 510 504 436 466 467 565 610 596... - attr(*, ".internal.selfref")=<externalptr> We rename the group variable using the function factor: dtl.vitamin[, grp := factor(grp, levels = 1:2, labels = c("c","t"))] and convert the animal and week variables to factor: dtl.vitamin[, animal := as.factor(animal)] dtl.vitamin[, week.factor := paste0("w",as.factor(week))] 3

1.2 Inspection of the dataset The summary method provides useful information about the dataset: summary(dtl.vitamin, maxsum = 10) grp animal week weight week.factor C:30 1 :6 Min. :1.000 Min. :436.0 Length:60 T:30 2 :6 1st Qu.:3.000 1st Qu.:508.5 Class :character 3 :6 Median :4.500 Median :565.0 Mode :character 4 :6 Mean :4.333 Mean :555.7 5 :6 3rd Qu.:6.000 3rd Qu.:594.5 6 :6 Max. :7.000 Max. :702.0 7 :6 8 :6 9 :6 10:6 We have a total of 60 observations divided into 2 groups of 30 observations each. Further inside can be obtain with the table function: dtl.vitamin[, table(grp,animal)] animal grp 1 2 3 4 5 6 7 8 9 10 C 6 6 6 6 6 0 0 0 0 0 T 0 0 0 0 0 6 6 6 6 6 Each group contain 5 animals and each animal has 6 measurements. dtl.vitamin[, table(animal,week.factor)] week.factor animal w1 w3 w4 w5 w6 w7 1 1 1 1 1 1 1 2 1 1 1 1 1 1 3 1 1 1 1 1 1 4 1 1 1 1 1 1 5 1 1 1 1 1 1 6 1 1 1 1 1 1 7 1 1 1 1 1 1 8 1 1 1 1 1 1 9 1 1 1 1 1 1 10 1 1 1 1 1 1 Each animal has been measured once at each of the 6 timepoints. Note that there is no missing values: colsums(is.na(dtl.vitamin)) grp animal week weight week.factor 0 0 0 0 0 4

2 Question 2: Descriptive statistics 2.1 Raw data We can visualize the weight variable using a spaguetti plot: gg.spaguetti <- ggplot(dtl.vitamin, aes(x = week.factor, y = weight, group = animal, color = animal)) gg.spaguetti <- gg.spaguetti + geom_line() + geom_point() gg.spaguetti <- gg.spaguetti + facet_grid( grp, labeller = label_both) gg.spaguetti <- gg.spaguetti + xlab("week") gg.spaguetti Here we use ggplot2 instead of matplot since the data is in the long format. But one could convert dtl.vitamin to the wide format (e.g. using dcast) and use matplot. The syntax of ggplot2 in the previous code chunk worked as follow: first we specify the dataset to use and the variables corresponding the x-axis and y-axis. We also specify that the points should be grouped and colored according to the variable animal. second we specify how to display the data: with points and lines. finally we request to split the elements to be plotted in two windows according to the variable grp. grp: C grp: T 700 weight 600 500 animal 1 2 3 4 5 6 7 8 9 10 w1 w3 w4 w5 w6 w7 w1 w3 w4 w5 w6 w7 week 5

2.2 Summary statistics We can compute the mean and standard deviation of the weight for each group at each time: dt.descriptive <- dtl.vitamin[,.(n =.N, mean = mean(weight), sd = sd(weight)), by = c("grp","week.factor")] dt.descriptive grp week.factor n mean sd 1: C w1 5 466.4 16.72722 2: C w3 5 519.4 40.64234 3: C w4 5 568.8 39.58788 4: C w5 5 561.6 42.84040 5: C w6 5 546.6 66.87900 6: C w7 5 572.0 61.81828 7: T w1 5 494.4 31.91081 8: T w3 5 551.0 41.89272 9: T w4 5 574.2 27.99464 10: T w5 5 567.0 62.06045 11: T w6 5 603.0 53.30572 12: T w7 5 644.0 57.54998 and plot it: gg.mean <- ggplot(dt.descriptive, aes(x = week.factor, y = mean, group = grp, color = grp)) gg.mean <- gg.mean + geom_line() + geom_point() gg.mean <- gg.mean + ylab("sample mean (weight)") + xlab("week") gg.mean 650 600 sample mean (weight) 550 grp C T 500 w1 w3 w4 w5 w6 w7 week 6

gg.sd <- ggplot(dt.descriptive, aes(x = week.factor, y = sd, group = grp, color = grp)) gg.sd <- gg.sd + geom_line() + geom_point() gg.sd <- gg.sd + ylab("sample standard deviation (weight)") + xlab("week") gg.sd sample standard deviation (weight) 60 50 40 30 grp C T 20 w1 w3 w4 w5 w6 w7 week Instead of first computing the mean/variance and then plotting it, we could use ggplot to do both at the same time: gg.mean2 <- ggplot(dtl.vitamin, aes(x = week.factor, y = weight, group = grp, color = grp)) gg.mean2 <- gg.mean2 + stat_summary(geom = "line", fun.y = mean, size = 3, fun.data = NULL) gg.mean2 650 600 weight 550 grp C T 500 w1 w3 w4 w5 w6 w7 week.factor 7

If we wanted to compute the correlation matrix, it would be easier to first move to the wide format: dtw.vitamin <- dcast(dtl.vitamin, value.var = "weight", formula = grp+animal week.factor) dtw.vitamin grp animal w1 w3 w4 w5 w6 w7 1: C 1 455 460 510 504 436 466 2: C 2 467 565 610 596 542 587 3: C 3 445 530 580 597 582 619 4: C 4 485 542 594 583 611 612 5: C 5 480 500 550 528 562 576 6: T 6 514 560 565 524 552 597 7: T 7 440 480 536 484 567 569 8: T 8 495 570 569 585 576 677 9: T 9 520 590 610 637 671 702 10: T 10 503 555 591 605 649 675 and then compute the correlation matrices relative to each group: list("grp=t" = cor(dtw.vitamin[grp=="t",.(w1,w3,w4,w5,w6,w7)]), "grp=c" = cor(dtw.vitamin[grp=="c",.(w1,w3,w4,w5,w6,w7)])) $ grp=t w1 w3 w4 w5 w6 w7 w1 1.0000000 0.9505691 0.8271283 0.7324280 0.4525202 0.6711257 w3 0.9505691 1.0000000 0.8514018 0.8394614 0.4948230 0.8207430 w4 0.8271283 0.8514018 1.0000000 0.9521629 0.8698131 0.8880628 w5 0.7324280 0.8394614 0.9521629 1.0000000 0.8466142 0.9854188 w6 0.4525202 0.4948230 0.8698131 0.8466142 1.0000000 0.7803782 w7 0.6711257 0.8207430 0.8880628 0.9854188 0.7803782 1.0000000 $ grp=c w1 w3 w4 w5 w6 w7 w1 1.00000000 0.2332189 0.2523425-0.04856259 0.4263431 0.2441859 w3 0.23321886 1.0000000 0.9982323 0.93341452 0.7258512 0.8263880 w4 0.25234249 0.9982323 1.0000000 0.93923176 0.7595188 0.8489105 w5-0.04856259 0.9334145 0.9392318 1.00000000 0.7265133 0.8502560 w6 0.42634312 0.7258512 0.7595188 0.72651331 1.0000000 0.9648446 w7 0.24418591 0.8263880 0.8489105 0.85025600 0.9648446 1.0000000 8

3 Question 3: Modeling the group effect 3.1 Fitting a model using an unstructured covariance matrix We use the gls function to fit the mixed model, specifying the correlation and weights arguments to model the within individual variability in weights using an unstructured covariance matrix: gls.un <- gls(weight week.factor + grp + grp:week.factor, data = dtl.vitamin, correlation = corsymm(form = 1 animal), weights = varident(form = 1 week.factor) ) loglik(gls.un) log Lik. -218.9236 (df=33) 3.2 Inference on the mean parameters We can then extract the estimated coefficients: summary(gls.un)$ttable Value Std.Error t-value p-value (Intercept) 466.4 11.39344 40.9358448 5.487679e-39 week.factorw3 53.0 13.58787 3.9005379 2.981543e-04 week.factorw4 102.4 13.55350 7.5552443 1.041548e-09 week.factorw5 95.2 20.38023 4.6711939 2.446059e-05 week.factorw6 80.2 24.73664 3.2421536 2.160099e-03 week.factorw7 105.6 23.37022 4.5185709 4.063580e-05 grpt 28.0 16.11275 1.7377538 8.866687e-02 week.factorw3:grpt 3.6 19.21615 0.1873424 8.521818e-01 week.factorw4:grpt -22.6 19.16754-1.1790765 2.441793e-01 week.factorw5:grpt -22.6 28.82200-0.7841233 4.368203e-01 week.factorw6:grpt 28.4 34.98290 0.8118253 4.208997e-01 week.factorw7:grpt 44.0 33.05048 1.3312967 1.893800e-01 their confidence intervals intervals(gls.un)[["coef"]] lower est. upper (Intercept) 443.491958 466.4 489.30804 week.factorw3 25.679757 53.0 80.32024 week.factorw4 75.148863 102.4 129.65114 week.factorw5 54.222804 95.2 136.17720 week.factorw6 30.463643 80.2 129.93636 week.factorw7 58.611022 105.6 152.58898 grpt -4.396864 28.0 60.39686 9

week.factorw3:grpt -35.036658 3.6 42.23666 week.factorw4:grpt -61.138928-22.6 15.93893 week.factorw5:grpt -80.550507-22.6 35.35051 week.factorw6:grpt -41.937830 28.4 98.73783 week.factorw7:grpt -22.452450 44.0 110.45245 attr(,"label") [1] "Coefficients:" the F-tests: anova(gls.un, type = "marginal") Denom. DF: 48 numdf F-value p-value (Intercept) 1 1675.7434 <.0001 week.factor 5 42.9724 <.0001 grp 1 3.0198 0.0887 week.factor:grp 5 5.2803 0.0006 3.3 Technical note: the nlme provides not very accurate results in small samples The last p.value does not match the one of the SAS output. Indeed according to gls we have the following test: 1-pf(5.2803, df1 = 5, df2 = 48) [1] 0.0006067529 Here the degree of freedom are clearly wrong. We perform a comparison between individuals, here the 10 pigs, so we do not really have 60 independent observations (minus 12 parameters) but something closer to 10 observations. The Satterthwaite approximation can be used to obtain a more sensible value for the degree of freedom: library(lavasearch2) ## not (yet!) available on CRAN, see github/bozenne/lavasearch2 system.time( df.satterthwaite <- dfvariance(gls.un, adjust.residuals = TRUE) ) Le chargement a nécessité le package : lava lava version 1.5.1 Attachement du package : lava The following object is masked from package:fields : surface 10

lavasearch2 version 0.1.2 utilisateur système écoulé 57.416 0.000 57.446 df.satterthwaite[names(coef(gls.un))] (Intercept) week.factorw3 week.factorw4 week.factorw5 week.factorw6 6.944444 6.944444 6.944444 6.944444 6.944444 week.factorw7 grpt week.factorw3:grpt week.factorw4:grpt week.factorw5:grpt 6.944444 6.944444 6.944444 6.944444 6.944444 week.factorw6:grpt week.factorw7:grpt 6.944444 6.944444 We obtain something close to 7 degrees of freedom, so the p.value for the F-test of the interaction should be: 1-pf(5.2803, df1 = 5, df2 = 7) [1] 0.02505935 3.4 Inspection of the variance-covariance parameters We can display the modeled variance-covariance matrix between the vitamin measurements within individuals using the getvarcov function: Sigma.UN <- getvarcov(gls.un, individuals = 1) Sigma.UN Marginal variance covariance matrix [,1] [,2] [,3] [,4] [,5] [,6] [1,] 649.05 714.65 453.01 707.86 623.37 742.52 [2,] 714.65 1703.40 1302.30 1903.90 1539.00 2027.50 [3,] 453.01 1302.30 1175.50 1623.60 1654.50 1754.20 [4,] 707.86 1903.90 1623.60 2843.40 2441.20 2885.70 [5,] 623.37 1539.00 1654.50 2441.20 3657.20 3191.60 [6,] 742.52 2027.50 1754.20 2885.70 3191.60 3566.80 Standard Deviations: 25.477 41.272 34.285 53.324 60.475 59.723 This matrix can be converted into a correlation matrix: Cor.UN <- cov2cor(sigma.un) A graphical representation of the correlation matrix can be obtain with the following code: seqtime <- paste0("week",unique(dtl.vitamin$week)) seqtime.num <- as.numeric(as.factor(seqtime)) palette.z <- rev(heat.colors(12)) 11

par(mar = c(4,4,5,5)) image(x = seqtime.num, y = seqtime.num, z = Cor.UN, main = "correlation matrix", axes = FALSE, col = palette.z, xlab = "", ylab = "") axis(1, at = seqtime.num, labels = seqtime) axis(2, at = seqtime.num, labels = seqtime, las = 2) image.plot(x = seqtime.num, y = seqtime.num, z = Cor.UN, legend.only = TRUE, col = palette.z) correlation matrix week7 1.0 week6 0.9 week5 0.8 0.7 week4 0.6 week3 0.5 week1 0.4 week1 week3 week4 week5 week6 week7 12

4 Question 5: Investigating the group effect in the first four weeks We can create a new dataset containing the data of the first four week doing: dt.tempo <- dtl.vitamin[week<=4] table(dt.tempo$week) 1 3 4 10 10 10 So we can use a syntax similar to Question 3 to fit the mixed model using only the first weeks: gls.un.w14 <- gls(weight week.factor + grp:week.factor, data = dtl.vitamin[week<=4], correlation = corsymm(form = 1 animal), weights = varident(form = 1 week.factor) ) loglik(gls.un.w14) log Lik. -112.2367 (df=12) We can then extract the estimated coefficients: summary(gls.un.w14)$ttable Value Std.Error t-value p-value (Intercept) 466.4 11.39342 40.9359096 1.018777e-23 week.factorw3 53.0 13.58786 3.9005406 6.772109e-04 week.factorw4 102.4 13.55359 7.5551913 8.555732e-08 week.factorw1:grpt 28.0 16.11273 1.7377566 9.507109e-02 week.factorw3:grpt 31.6 26.10287 1.2105946 2.378384e-01 week.factorw4:grpt 5.4 21.68364 0.2490357 8.054520e-01 the F-tests: anova(gls.un.w14, type = "marginal") Denom. DF: 24 numdf F-value p-value (Intercept) 1 1675.7487 <.0001 week.factor 2 40.1472 <.0001 week.factor:grp 3 2.1888 0.1155 As before, the F-test should be computed with something close to 7 degree of freedom instead of 24, e.g. for the interaction: 1-pf(2.1888, df1 = 3, df2 = 7) [1] 0.1772701 13

5 Question 6: Modeling the treatment effect 5.1 Definition of the new variables We first define the new variables suggested in the exercise: 5.1.1 the treatment variable This variable takes value: "No" in the control group. "No" in the treated group at week 4 and before. "Yes" in the treated group after week 4. We can use the following syntax to obtain it: dtl.vitamin[, treat := as.character(na)] # initialization to missing dtl.vitamin[grp == "C", treat := "No"] dtl.vitamin[week<=4 & grp == "T", treat := "No"] dtl.vitamin[week>4 & grp == "T", treat := "Yes"] We can display the result for the first observation of each group at each time: dtl.vitamin[,.(treat = treat[1]), by = c("week","grp")] week grp treat 1: 1 C No 2: 3 C No 3: 4 C No 4: 5 C No 5: 6 C No 6: 7 C No 7: 1 T No 8: 3 T No 9: 4 T No 10: 5 T Yes 11: 6 T Yes 12: 7 T Yes 14

5.1.2 number of weeks under treatment This variable takes value: 0 when no treatment is given. 1 at week 5 when a treatment is given. 2 at week 6 when a treatment is given. 3 at week 7 when a treatment is given. dtl.vitamin[, vitaweeks := as.integer(na)] # initialization to missing dtl.vitamin[treat == "No", vitaweeks := 0] dtl.vitamin[treat == "Yes" & week == 5, vitaweeks := 1] dtl.vitamin[treat == "Yes" & week == 6, vitaweeks := 2] dtl.vitamin[treat == "Yes" & week == 7, vitaweeks := 3] We can display the result for the first observation of each group at each time: dtl.vitamin[,.(vitaweeks = vitaweeks[1]), by = c("week","grp")] week grp vitaweeks 1: 1 C 0 2: 3 C 0 3: 4 C 0 4: 5 C 0 5: 6 C 0 6: 7 C 0 7: 1 T 0 8: 3 T 0 9: 4 T 0 10: 5 T 1 11: 6 T 2 12: 7 T 3 A more concise syntax is: setkeyv(dtl.vitamin, c("animal","week")) dtl.vitamin[, vitaweeks2 := cumsum(treat=="yes"), by = "animal"] Here we count the number of week under treatement using the cumsum function. We can check that both coincide using: all(dtl.vitamin$vitaweeks == dtl.vitamin$vitaweeks2) [1] TRUE 15

5.1.3 Interaction between time and treatment To obtain an interation coefficients only at week 5, 6, and 7, we can define a new variable whose value is: baseline when the individual is not treated. the week number (e.g. w5, w6, w7) when the individual is treated. dtl.vitamin[treat == "No", I.treat_week := "baseline"] dtl.vitamin[treat == "Yes", I.treat_week := week.factor] We can display the result for the first observation of each group at each time: dtl.vitamin[,.(i.treat_week = I.treat_week[1]), by = c("week","grp")] week grp I.treat_week 1: 1 C baseline 2: 3 C baseline 3: 4 C baseline 4: 5 C baseline 5: 6 C baseline 6: 7 C baseline 7: 1 T baseline 8: 3 T baseline 9: 4 T baseline 10: 5 T w5 11: 6 T w6 12: 7 T w7 We also define another interaction term with only 2 coefficients. Here we decided not model an interaction at week 5: dtl.vitamin[, I.treat_week67 := I.treat_week] dtl.vitamin[week == 5, I.treat_week67 := "baseline"] We can display the result for the first observation of each group at each time: dtl.vitamin[,.(i.treat_week67 = I.treat_week67[1]), by = c("week","grp")] week grp I.treat_week67 1: 1 C baseline 2: 3 C baseline 3: 4 C baseline 4: 5 C baseline 5: 6 C baseline 6: 7 C baseline 7: 1 T baseline 8: 3 T baseline 9: 4 T baseline 10: 5 T baseline 11: 6 T w6 12: 7 T w7 16

5.2 Model (a): non-parametric treatment effect ls.un.a0 <- try(gls(weight week.factor + treat:week.factor, data = dtl.vitamin, correlation = corsymm(form = 1 animal), weights = varident(form = 1 week.factor) )) Error in glsestimate(object, control = control) : computed "gls" fit is singular, rank 10 The gls function cannot fit the model since the model is not properly defined by the formula. To see that let s look at how many coefficients gls is trying to estimate: X <- model.matrix(weight week.factor + treat:week.factor, data = dtl.vitamin) summary(x) (Intercept) week.factorw3 week.factorw4 week.factorw5 week.factorw6 week.factorw7 Min. :1 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 1st Qu.:1 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 Median :1 Median :0.0000 Median :0.0000 Median :0.0000 Median :0.0000 Median :0.0000 Mean :1 Mean :0.1667 Mean :0.1667 Mean :0.1667 Mean :0.1667 Mean :0.1667 3rd Qu.:1 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.0000 Max. :1 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 week.factorw1:treatyes week.factorw3:treatyes week.factorw4:treatyes week.factorw5:treatyes Min. :0 Min. :0 Min. :0 Min. :0.00000 1st Qu.:0 1st Qu.:0 1st Qu.:0 1st Qu.:0.00000 Median :0 Median :0 Median :0 Median :0.00000 Mean :0 Mean :0 Mean :0 Mean :0.08333 3rd Qu.:0 3rd Qu.:0 3rd Qu.:0 3rd Qu.:0.00000 Max. :0 Max. :0 Max. :0 Max. :1.00000 week.factorw6:treatyes week.factorw7:treatyes Min. :0.00000 Min. :0.00000 1st Qu.:0.00000 1st Qu.:0.00000 Median :0.00000 Median :0.00000 Mean :0.08333 Mean :0.08333 3rd Qu.:0.00000 3rd Qu.:0.00000 Max. :1.00000 Max. :1.00000 So gls is trying to estimate interactions before time 0 (e.g. week.factorw1:treatyes) even though they do not exist. The corresponding columns in the design matrix (X) contain only 0 making the design matrix singular. We therefore need to manually define the interaction using the variable I.treat_week that we have defined in the last subsection: gls.un.a <- gls(weight week.factor + I.treat_week, data = dtl.vitamin, correlation = corsymm(form = 1 animal), weights = varident(form = 1 week.factor) ) loglik(gls.un.a) 17

log Lik. -232.0818 (df=30) 5.3 Model (b): linear effect of the treatment gls.un.b <- gls(weight week.factor + vitaweeks, data = dtl.vitamin, correlation = corsymm(form = 1 animal), weights = varident(form = 1 week.factor) ) loglik(gls.un.b) log Lik. -243.859 (df=28) 5.4 Model (c): splitting the treatment effect into a linear effect and a non linear effect Once again if we try to fit the model with interactions, we have an overparametrized model. We therefore redefine the interactions such that there is one degree of freedom left for vitaweeks. gls.un.c <- gls(weight week.factor + vitaweeks + I.treat_week67, data = dtl.vitamin, correlation = corsymm(form = 1 animal), weights = varident(form = 1 week.factor) ) loglik(gls.un.c) log Lik. -232.0818 (df=30) As suggested by the log-likelihood, this is the same model as (a) but parametrized in another way: loglik(gls.un.a) - loglik(gls.un.c) log Lik. 7.538006e-10 (df=30) 18

6 Question 7: Predicted weight profiles 6.1 Compute individual predictions To compute the predicted profiles for all individuals, you can use the predict function: dtl.vitamin[, weight.un.a := predict(gls.un.a, newdata = dtl.vitamin)] dtl.vitamin[, weight.un.b := predict(gls.un.b, newdata = dtl.vitamin)] dtl.vitamin[, weight.un.c := predict(gls.un.c, newdata = dtl.vitamin)] 6.2 Graphical display We can directly display the prediction for a given model: gg.prediction <- ggplot(dtl.vitamin, aes(x = week, y = weight.un.a, group = grp, color = grp)) gg.prediction <- gg.prediction + geom_point() + geom_line() gg.prediction <- gg.prediction + ylab("model (a): week grptweek") gg.prediction model (a): week grptweek 600 550 grp C T 500 2 4 6 week To display the predictions of the three models on several panels, we need to move to the wide format: vec.name <- paste0("weight.",c("un.a","un.b","un.c")) vec.name [1] "weight.un.a" "weight.un.b" "weight.un.c" 19

dtl.prediction <- melt(dtl.vitamin, id.vars = c("grp","animal","week"), value.name = "weight", variable.name = "model", measure.vars = vec.name) dtl.prediction grp animal week model weight 1: C 1 1 weight.un.a 480.4000 2: C 1 3 weight.un.a 535.2000 3: C 1 4 weight.un.a 571.5000 4: C 1 5 weight.un.a 570.4639 5: C 1 6 weight.un.a 537.8247 --- 176: T 10 3 weight.un.c 535.2000 177: T 10 4 weight.un.c 571.5000 178: T 10 5 weight.un.c 558.1363 179: T 10 6 weight.un.c 611.7753 180: T 10 7 weight.un.c 635.8557 We can then use facet to divide the window into three sub-windows, each displaying the result of a specific gg.prediction2 <- ggplot(dtl.prediction, aes(x = week, y = weight, group = grp, color = grp)) gg.prediction2 <- gg.prediction2 + geom_point() + geom_line() gg.prediction2 <- gg.prediction2 + facet_wrap( model, labeller = label_both) gg.prediction2 model: weight.un.a model: weight.un.b model: weight.un.c 2 4 6 2 4 6 2 4 6 500 550 600 week weight grp C T 20

6.3 Note In the previous graph we have displayed the predicted values for all individuals. However we could only distinguish two curves for each model. This is because given a group and a week the prediction are the same for all individuals: we don t model individual specific covariates like age. we use the marginal predictions and not predictions conditional on the random effects. In other words, if we have already observed an individual at week 1 and 3, we could we these values to have a more accurate prediction on week 4 (predictions conditional on the individual random effect). Here we display the predicted values as if we were to perform prediction for a new individual (i.e. not already included in the study). 7 Question 8: Estimate of the difference in weight between the group at the end of week 7 7.1 Model (a) The estimated difference in weight is given by the interaction term: CI.UN.a <- intervals(gls.un.a, which = "coef") CI.UN.a[["coef"]]["I.treat_weekw7",] lower est. upper 17.15920 55.71152 94.26384 This matches the difference in predicted profiles: dtl.vitamin[grp=="t" & week=="7",unique(weight.un.a)] - dtl.vitamin[grp=="c" & week=="7 ",unique(weight.un.a)] [1] 55.71152 However, once again, the confidence intervals are computed using the wrong degree of freedom: beta <- summary(gls.un.a)$ttable["i.treat_weekw7","value"] sd.beta <- summary(gls.un.a)$ttable["i.treat_weekw7","std.error"] CI.default <- c("lower" = beta + qt(0.025, df = 60-9) * sd.beta, "est." = beta, "upper" = beta + qt(0.975, df = 60-9) * sd.beta) CI.default lower est. upper 17.15920 55.71152 94.26384 21

CI.corrected <- c("lower" = beta + qt(0.025, df = 7) * sd.beta, "est." = beta, "upper" = beta + qt(0.975, df = 7) * sd.beta) CI.corrected lower est. upper 10.30283 55.71152 101.12021 7.2 Model (b) In this model the difference is three times the linear term: coef(gls.un.b)["vitaweeks"]*3 vitaweeks 11.57111 One can check that this matches the difference in predicted profiles: dtl.vitamin[grp=="t" & week=="7",unique(weight.un.b)] - dtl.vitamin[grp=="c" & week=="7 ",unique(weight.un.b)] [1] 11.57111 To obtain the p.values and the standard error (and deduce the confidence interval) one can use the glht function. We first need to indicate that we are interested in 3 times the coefficient vitaweeks: coef.un.b <- coef(gls.un.b) C <- matrix(0,nrow = 1, ncol=length(coef.un.b), dimnames =list(null,names(coef.un.b))) C[,"vitaweeks"] <- 3 C (Intercept) week.factorw3 week.factorw4 week.factorw5 week.factorw6 week.factorw7 vitaweeks [1,] 0 0 0 0 0 0 3 and then call glht: glht.un.b <- summary(glht(gls.un.b, linfct = C)) glht.un.b Simultaneous Tests for General Linear Hypotheses Fit: gls(model = weight ~ week.factor + vitaweeks, data = dtl.vitamin, correlation = corsymm(form = ~1 animal), weights = varident(form = ~1 week.factor)) 22

Linear Hypotheses: Estimate Std. Error z value Pr(> z ) 1 == 0 11.57 13.79 0.839 0.401 (Adjusted p values reported -- single-step method) We can obtain the corresponding confidence interval using confint confint(glht(gls.un.b, linfct = C)) Simultaneous Confidence Intervals Fit: gls(model = weight ~ week.factor + vitaweeks, data = dtl.vitamin, correlation = corsymm(form = ~1 animal), weights = varident(form = ~1 week.factor)) Quantile = 1.96 95% family-wise confidence level Linear Hypotheses: Estimate lwr upr 1 == 0 11.5711-15.4488 38.5910 In this case this is simply three time the confidence interval of vitaweeks: 3*intervals(gls.UN.b, type = "coef")[["coef"]]["vitaweeks",] lower est. upper -16.07991 11.57111 39.22212 7.3 Model (c) The results are the same as model (a) but obtaining them would be a bit more complex since the difference is the interaction terms at week 7 plus the three times the linear term. In this case using glht simplifies a lot the implementation: coef.un.c <- coef(gls.un.c) C <- matrix(0,nrow = 1, ncol=length(coef.un.c), dimnames =list(null,names(coef.un.c))) C[,"vitaweeks"] <- 3 C[,"I.treat_week67w7"] <- 1 C (Intercept) week.factorw3 week.factorw4 week.factorw5 week.factorw6 week.factorw7 vitaweeks [1,] 0 0 0 0 0 0 3 I.treat_week67w6 I.treat_week67w7 [1,] 0 1 23

glht.un.c <- summary(glht(gls.un.c, linfct = C)) glht.un.c Simultaneous Tests for General Linear Hypotheses Fit: gls(model = weight ~ week.factor + vitaweeks + I.treat_week67, data = dtl.vitamin, correlation = corsymm(form = ~1 animal), weights = varident(form = ~1 week.factor)) Linear Hypotheses: Estimate Std. Error z value Pr(> z ) 1 == 0 55.71 19.20 2.901 0.00372 ** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 (Adjusted p values reported -- single-step method) 8 Question 10: Specification of the covariance matrix (compound symmetry vs. unstructured) 8.1 Comparison of the model fit Specifying an unstructured correlation matrix: gls.cs <-gls(weight week.factor + I.treat_week, data = dtl.vitamin, correlation = corcompsymm(form = 1 animal) ) loglik(gls.cs) log Lik. -258.8007 (df=11) is equivalent to a "standard" mixed model fitted using lme: e.lme <-lme(weight week.factor + I.treat_week, data = dtl.vitamin, random = 1 animal ) loglik(e.lme) log Lik. -258.8007 (df=11) or lmer from the lme4 package: library(lme4) e.lmer <- lmer(weight week.factor + I.treat_week+ (1 animal), data = dtl.vitamin) loglik(e.lmer) log Lik. -258.8007 (df=11) 24

As we can expect the variance-covariance structure is much simpler compared to the previous models: list("unstructured" = unclass(getvarcov(gls.un.a)), "compound symmetry" = unclass(getvarcov(gls.cs)) ) $unstructured [,1] [,2] [,3] [,4] [,5] [,6] [1,] 794.7107 881.0221 444.6671 767.0827 417.5841 786.688 [2,] 881.0221 1791.5135 1205.0034 1847.9253 1213.9257 1945.202 [3,] 444.6671 1205.0034 1052.9483 1469.7741 1444.3001 1583.659 [4,] 767.0827 1847.9253 1469.7741 2676.8199 2114.2963 2692.863 [5,] 417.5841 1213.9257 1444.3001 2114.2963 3428.2684 2848.488 [6,] 786.6880 1945.2020 1583.6589 2692.8628 2848.4881 3346.580 $ compound symmetry [,1] [,2] [,3] [,4] [,5] [,6] [1,] 2196.278 1510.800 1510.800 1510.800 1510.800 1510.800 [2,] 1510.800 2196.278 1510.800 1510.800 1510.800 1510.800 [3,] 1510.800 1510.800 2196.278 1510.800 1510.800 1510.800 [4,] 1510.800 1510.800 1510.800 2196.278 1510.800 1510.800 [5,] 1510.800 1510.800 1510.800 1510.800 2196.278 1510.800 [6,] 1510.800 1510.800 1510.800 1510.800 1510.800 2196.278 We can compare the two models using a likelihood ratio test: anova(update(gls.cs, method = "REML"), update(gls.un.a, method = "REML") ) Model df AIC BIC loglik Test L.Ratio p-value update(gls.cs, method = "REML") 1 11 539.6013 560.8514-258.8007 update(gls.un.a, method = "REML") 2 30 524.1637 582.1184-232.0818 1 vs 2 53.43764 <.0001 So it seems that the unstructured model gives a better fit (p<0.0001). 25

8.2 Comparison of the fitted values Computation of the predicted values with confidence intervals using predictse.gls: rescs.tempo <- predictse.gls(gls.cs, newdata = dtl.vitamin) dtl.vitamin[, weight.cs := rescs.tempo$fit] dtl.vitamin[, weightinf.cs := rescs.tempo$fit - 1.96 * rescs.tempo$se.fit] dtl.vitamin[, weightsup.cs := rescs.tempo$fit + 1.96 * rescs.tempo$se.fit] resun.tempo <- predictse.gls(gls.un.a, newdata = dtl.vitamin) dtl.vitamin[, weight.un.a := resun.tempo$fit] dtl.vitamin[, weightinf.un.a := resun.tempo$fit - 1.96 * resun.tempo$se.fit] dtl.vitamin[, weightsup.un.a := resun.tempo$fit + 1.96 * resun.tempo$se.fit] With the current dataset we could create one graph for each model. But putting the results on both model side by side may help to visualize discrepancies between the models. To do so we first convert the data to the long format. Since this involves to reshape simultaneously several variables, it might be easier to do that manually: keep.colscs <- c("grp","animal","week","weight.cs","weightinf.cs","weightsup.cs") keep.colsun <- c("grp","animal","week","weight.un.a","weightinf.un.a","weightsup.un.a") dt.tempo1 <- dtl.vitamin[,.sd,.sdcols = keep.colscs] setnames(dt.tempo1, old = names(dt.tempo1), new = c("grp","animal","week","estimate","lower", "upper")) dt.tempo1[, model := "CS"] dt.tempo2 <- dtl.vitamin[,.sd,.sdcols = keep.colsun] setnames(dt.tempo2, old = names(dt.tempo2), new = c("grp","animal","week","estimate","lower", "upper")) dt.tempo2[, model := "UN"] dtl.prediction2 <- rbind(dt.tempo1, dt.tempo2) dtl.prediction2 grp animal week estimate lower upper model 1: C 1 1 480.4000 451.3531 509.4469 CS 2: C 1 3 535.2000 506.1531 564.2469 CS 3: C 1 4 571.5000 542.4531 600.5469 CS 4: C 1 5 571.0101 536.6110 605.4093 CS 5: C 1 6 556.0101 521.6110 590.4093 CS --- 116: T 10 3 535.2000 508.9659 561.4341 UN 117: T 10 4 571.5000 551.3878 591.6122 UN 118: T 10 5 558.1361 522.8819 593.3904 UN 119: T 10 6 611.7753 571.3432 652.2075 UN 120: T 10 7 635.8558 595.3615 676.3500 UN 26

melt also enables to obtain a similar result in one operation: dtl.prediction2.bis <- melt(dt.tempo, id.vars = c("grp","animal","week"), measure.vars = patterns("weight\\.","weightinf\\.","weightsup\\."), variable.name = "model", value.name = c("estimate","lower","upper")) dtl.prediction2.bis grp animal week 1: C 1 1 2: C 1 3 3: C 1 4 4: C 2 1 5: C 2 3 6: C 2 4 7: C 3 1 8: C 3 3 9: C 3 4 10: C 4 1 11: C 4 3 12: C 4 4 13: C 5 1 14: C 5 3 15: C 5 4 16: T 6 1 17: T 6 3 18: T 6 4 19: T 7 1 20: T 7 3 21: T 7 4 22: T 8 1 23: T 8 3 24: T 8 4 25: T 9 1 26: T 9 3 27: T 9 4 28: T 10 1 29: T 10 3 30: T 10 4 grp animal week 27

We can now use ggplot2 to display the predictions: gg.predictionic <- ggplot(dtl.prediction2, aes(x = week, y = estimate, group = grp, color = grp)) gg.predictionic <- gg.predictionic + geom_point() + geom_line() gg.predictionic <- gg.predictionic + geom_ribbon(aes(ymin = lower, ymax = upper, fill = grp), alpha = 0.33) gg.predictionic <- gg.predictionic + facet_grid(grp model,labeller = label_both) gg.predictionic <- gg.predictionic + ylab("weight") gg.predictionic model: CS model: UN 650 600 weight 550 500 450 650 600 550 grp: C grp: T grp C T 500 450 2 4 6 2 4 6 week 28