Solution: anti-fungal treatment exercise

Size: px

Start display at page:

Download "Solution: anti-fungal treatment exercise"

Kory Austin
5 years ago
Views:

1 Solution: anti-fungal treatment exercise Course repeated measurements - R exercise class 5 December 5, 2017 Contents 1 Question 1: Import data Data management Inspection of the dataset Computation of the prevalence Question 2: Scheduled vs. observed visit times 6 3 Question 3: Population average model Definition of the interaction Model fitting Inference Display of the fitted prevalence Question 4: Continuous time population average model Model fitting Inference Question 6: Subject specific model with random intercept Fitting the model Inference Question 7: Subject specific model with random intercept and random slope Model fitting Inference NOTE: This document contains an example of R code and related software outputs that answers the questions of the exercise. The focus is here on the implementation using the R software and not on the interpretation - we refer to the SAS solution for a more detailed discussion of the results. 1

2 Load the packages that will be necessary for the analysis: library(data.table) # data management library(lme4) # subject specific model library(geepack) # population average models library(doby) # esticon function library(ggplot2) # graphical display 1 Question 1: Import data 1.1 Data management We first specify the location of the data through a variable called path.data: path.data <- " jufo/courses/rm2017/onycholysis.txt" Then we use the function fread to import the dataset: dtw.ony <- fread(path.data, header = TRUE, na.strings = ".") str(dtw.ony) Classes data.table and data.frame : 294 obs. of 16 variables: $ id : int $ treatment: int $ response1: int $ time1 : int $ response2: int $ time2 : num $ response3: int $ time3 : num $ response4: int $ time4 : num $ response5: int $ time5 : num $ response6: int $ time6 : num $ response7: int 0 NA $ time7 : num 13.1 NA attr(*, ".internal.selfref")=<externalptr> [optional] We ensure that all time variables are of type numeric by converting time1 from integer to numeric: dtw.ony[, time1 := as.double(time1)] 2

3 We convert the dataset to the long format: dtl.ony <- melt(dtw.ony, id.vars = c("id","treatment"), measure.vars = list(paste0("time",1:7), paste0("response",1:7)), value.name = c("time","response"), variable.name = "visit") str(dtl.ony) Classes data.table and data.frame : 2058 obs. of 5 variables: $ id : int $ treatment: int $ visit : Factor w/ 7 levels "1","2","3","4",..: $ time : num $ response : int attr(*, ".internal.selfref")=<externalptr> We convert the id and treatment variables to factor: dtl.ony[, id := as.factor(id)] dtl.ony[, treatment := factor(treatment, levels = 0:1, labels = c("200mg","250mg"))] We also add to the dataset the scheduled visit times: dtl.ony[, expected.timeweek := as.numeric(na)] dtl.ony[visit=="1", expected.timeweek := 0] dtl.ony[visit=="2", expected.timeweek := 4] dtl.ony[visit=="3", expected.timeweek := 8] dtl.ony[visit=="4", expected.timeweek := 12] dtl.ony[visit=="5", expected.timeweek := 24] dtl.ony[visit=="6", expected.timeweek := 36] dtl.ony[visit=="7", expected.timeweek := 48] and create a new visit variable, visit.num, which indexes the visit with numbers (and not characters): dtl.ony[,visit.num := as.numeric(visit)] 3

4 1.2 Inspection of the dataset The summary method provides useful information about the dataset: summary(dtl.ony) id treatment visit time response expected.timeweek visit.num 1 : 7 200mg:1022 1:294 Min. : Min. : Min. : 0.00 Min. :1 2 : 7 250mg:1036 2:294 1st Qu.: st Qu.: st Qu.: st Qu.:2 3 : 7 3:294 Median : Median : Median :12.00 Median :4 4 : 7 4:294 Mean : Mean : Mean :18.86 Mean :4 6 : 7 5:294 3rd Qu.: rd Qu.: rd Qu.: rd Qu.:6 7 : 7 6:294 Max. : Max. : Max. :48.00 Max. :7 (Other):2016 7:294 NA s :150 NA s :150 We see 294 patients, each having 7 visits. However we also have 150 missing values for time and response so the data is not available for all the visits. A closer inspection of the treatment variable: dtl.ony[,print(c(visit =.GRP, table(treatment))), by = "visit"] Empty data.table (0 rows) of 1 col: visit shows that the treatment variable is in reality the group variable. setnames(dtl.ony, old = "treatment", new = "group") The treatment variable is the group variable except at baseline where we assume that 200mg was given to all patients: dtl.ony[, treatment := as.character(group)] dtl.ony[visit == "1", treatment := "200mg"] dtl.ony[, treatment := as.factor(treatment)] 4

5 We can check that we obtain the expected variable: dtl.ony[,print(c(visit =.GRP, table(treatment))), by = "visit"] Empty data.table (0 rows) of 1 col: visit 1.3 Computation of the prevalence dtl.ony[,.(n.row =.N, n.obs = sum(!is.na(response)), n.response = sum(response, na.rm = TRUE), prevalence = mean(response, na.rm = TRUE)), by = c("group","visit")] group visit n.row n.obs n.response prevalence 1: 250mg : 200mg : 250mg : 200mg : 250mg : 200mg : 250mg : 200mg : 250mg : 200mg : 250mg : 200mg : 250mg : 200mg

6 2 Question 2: Scheduled vs. observed visit times Given that a year contains 52 weeks and 12 months, we convert the time variable from months to weeks using: dtl.ony[,timeweek := time * 52 / 12] We can then compute the quantiles of the observed visit times and compare them to the expected ones: dtl.ony[,.("expected" = expected.timeweek[1], "observed (median)" = median(timeweek, na.rm = TRUE), "observed (min)" = min(timeweek, na.rm = TRUE), "observed (max)" = max(timeweek, na.rm = TRUE)), by = c("visit","group")] visit group expected observed (median) observed (min) observed (max) 1: 1 250mg : 1 200mg : 2 250mg : 2 200mg : 3 250mg : 3 200mg : 4 250mg : 4 200mg : 5 250mg : 5 200mg : 6 250mg : 6 200mg : 7 250mg : 7 200mg We can also provide a graphical display: gg.time <- ggplot(dtl.ony, aes(x = as.factor(visit))) gg.time <- gg.time + geom_line(aes(y = expected.timeweek, group = "expected", size = "expected")) gg.time <- gg.time + geom_boxplot(aes(y = timeweek, color = as.factor(response))) gg.time <- gg.time + labs(x = "visit", y = "time (weeks)", color = "response", size = "") gg.time 6

7 visit time (weeks) response 0 1 expected 3 Question 3: Population average model 3.1 Definition of the interaction We first consider the mean effects. Due to baseline adjustement the design matrix including an interaction is not of full rank: X <- model.matrix( treatment * visit, data = dtl.ony) groupeffect <- X[,"treatment250mg"] interaction <- rowsums(x[,paste0("treatment250mg:visit",2:7)]) range(interaction-groupeffect) [1] 0 0 Indeed there should not be a treatment effect at baseline. To solve that we can create manually an interaction variable: 7

8 dtl.ony[, treatmentivisit := paste0(treatment,visit)] dtl.ony[treatment == "200mg", treatmentivisit := "baseline"] dtl.ony[visit == 1, treatmentivisit := "baseline"] We specify convert the new variable to factor specifying that baseline is the reference level: dtl.ony[, treatmentivisit := relevel(as.factor(treatmentivisit),"baseline")] 8

9 We can check that this variable gives the expected values: dtl.ony[,table(treatmentivisit,visit,group)],, group = 200mg visit treatmentivisit baseline mg mg mg mg mg mg ,, group = 250mg visit treatmentivisit baseline mg mg mg mg mg mg Model fitting According to the documentation of the geeglm function "Data are assumed to be sorted so that observations on a cluster are contiguous rows for all entities in the formula" so we need to reorder by id. To be safe, we also reorder by visit. setkeyv(dtl.ony, c("id","visit")) dtl.ony[1:10] id group visit time response expected.timeweek visit.num treatment timeweek treatmentivisit 1: 1 250mg mg baseline 2: 1 250mg mg mg2 3: 1 250mg mg mg3 4: 1 250mg mg mg4 5: 1 250mg mg mg5 6: 1 250mg mg mg6 7: 1 250mg mg mg7 8: 2 200mg mg baseline 9: 2 200mg mg baseline 10: 2 200mg mg baseline 9

10 We are now ready to fit the model: gee.un <- geeglm(response visit + treatmentivisit, id = id, data = dtl.ony, family = binomial(link = "logit"), corstr = "unstructured") Compared to "standard" mixed models, population average models are not estimated by maximum likelihood and therefore loglik cannot return the value of the likelihood: loglik(gee.un) log Lik. (df=13) We can still use summary to obtain display the model coefficients: summary(gee.un)$coefficients Estimate Std.err Wald Pr(> W ) (Intercept) e-06 visit e-01 visit e-02 visit e-04 visit e-08 visit e-09 visit e-08 treatmentivisit250mg e-01 treatmentivisit250mg e-01 treatmentivisit250mg e-01 treatmentivisit250mg e-01 treatmentivisit250mg e-01 treatmentivisit250mg e-02 A manual extraction is necessary to obtain the correlation matrix: corr.coef <- summary(gee.un)$corr[,"estimate"] n.corr.coef <- length(corr.coef) Mcorr <- matrix(na, 7, 7) Mcorr[lower.tri(Mcorr)] <- corr.coef Mcorr [,1] [,2] [,3] [,4] [,5] [,6] [,7] [1,] NA NA NA NA NA NA NA [2,] NA NA NA NA NA NA [3,] NA NA NA NA NA [4,] NA NA NA NA [5,] NA NA NA [6,] NA NA [7,] NA 10

11 A more reliable extraction could be done by reading the position of the coefficients from their names instead of assuming they follow a specific order: name.corr.coef <- rownames(summary(gee.un)$corr) ls.position.corr.coef <- strsplit(gsub("alpha.","",name.corr.coef),":") position.corr.coef <- do.call(rbind,lapply(ls.position.corr.coef, as.numeric)) rownames(position.corr.coef) <- name.corr.coef colnames(position.corr.coef) <- c("row","column") position.corr.coef[1:10,] row column alpha.1:2 1 2 alpha.1:3 1 3 alpha.1:4 1 4 alpha.1:5 1 5 alpha.1:6 1 6 alpha.1:7 1 7 alpha.2:3 2 3 alpha.2:4 2 4 alpha.2:5 2 5 alpha.2: Inference We can compute odd ratios as the exponential of the coefficients: ORvisit7 <- exp(coef(gee.un)["visit7"]) ORvisit7 visit We can also compute the confidence intervals for the coefficients from the standard errors given in the summary: beta.visit7 <- summary(gee.un)$coef["visit7","estimate"] betase.visit7 <- summary(gee.un)$coef["visit7","std.err"] quantile.norm <- qnorm(c(lower = 0.025, estimate = 0.5, upper = 0.975)) CI.visit7 <- exp(beta.visit7 + quantile.norm * betase.visit7) CI.visit7 lower estimate upper

12 Automatic computation of the standard errors for any linear combination of the coefficients can be obtained using the esticon function of the doby package. First a need to create a contrast matrix encoding the linear combinations we are interested in: name.coef <- names(coef(gee.un)) n.coef <- length(name.coef) C <- matrix(0, nrow = 3, ncol = n.coef, dimnames = list(c("diffp-200mg","diffp-250mg","treatment-lastvisit"), name.coef)) C["diffP-200mg","visit7"] <- 1 C["diffP-250mg",c("visit7","treatmentIvisit250mg7")] <- c(1,1) C["treatment-lastVisit","treatmentIvisit250mg7"] <- 1 C (Intercept) visit2 visit3 visit4 visit5 visit6 visit7 treatmentivisit250mg2 diffp-200mg diffp-250mg treatment-lastvisit treatmentivisit250mg3 treatmentivisit250mg4 treatmentivisit250mg5 treatmentivisit250mg6 diffp-200mg diffp-250mg treatment-lastvisit treatmentivisit250mg7 diffp-200mg 0 diffp-250mg 1 treatment-lastvisit 1 Then we can apply esticon to obtain the punctual estimate and the standard errors relative to each linear combination: resc <- esticon(gee.un, cm = C, conf.int = TRUE, joint.test = FALSE) resc beta0 Estimate Std.Error X2.value DF Pr(> X^2 ) Lower Upper diffp-200mg diffp-250mg treatment-lastvisit and then applying the exponential function to obtain the odd ratios with their confidence intervals: OR.CI <- cbind(exp(resc[,c("estimate","lower","upper")]), p.value = resc[,"pr(> X^2 )"]) OR.CI Estimate Lower Upper p.value diffp-200mg diffp-250mg treatment-lastvisit

13 We can also compare the present model to the model without treatment effect using a Wald test: gee.un0 <- geeglm(response visit, id = id, data = dtl.ony, family = binomial(link = "logit"), corstr = "unstructured") anova(gee.un0,gee.un) Analysis of Wald statistic Table Model 1 response ~ visit + treatmentivisit Model 2 response ~ visit Df X2 P(> Chi ) So there is no evidence for a treatment effect. 3.4 Display of the fitted prevalence We can obtain the fitted values using fitted: fitted.values <- fitted(gee.un) length(fitted.values) [1] 1908 The number of fitted values matches the number of non missing observations: dtl.ony[!is.na(response),.n] [1] 1908 So we can store the fitted values in the original dataset, excluding the rows corresponding to missing observations: dtl.ony[!is.na(response), fitted := fitted.values] We can then plot the fitted values by group: gg.gee <- ggplot(dtl.ony[!is.na(response)], aes(x = expected.timeweek, y = fitted, group = group, color = group)) gg.gee <- gg.gee + geom_point() + geom_line() gg.gee <- gg.gee + labs(x = "expected visit time (weeks)", y = "fitted prevalence", color = "treatment group") gg.gee 13

14 expected visit time (weeks) fitted prevalence treatment group 200mg 250mg We can make the same plot on the logit scale: dtl.ony[!is.na(response), fitted.logit := log(fitted.values/(1-fitted.values))] gg.gee_logit <- ggplot(dtl.ony[!is.na(response)], aes(x = expected.timeweek, y = fitted.logit, group = group, color = group)) gg.gee_logit <- gg.gee_logit + geom_point() + geom_line() gg.gee_logit <- gg.gee_logit + labs(x = "expected visit time (weeks)", y = "logit(fitted prevalence)", color = "treatment group") gg.gee_logit 14

15 expected visit time (weeks) logit(fitted prevalence) treatment group 200mg 250mg 4 Question 4: Continuous time population average model 4.1 Model fitting geecont.un <- geeglm(response treatment:time, id = id, data = dtl.ony[visit.num < 6], family = binomial(link = "logit"), corstr = "unstructured") summary(geecont.un)$coef Estimate Std.err Wald Pr(> W ) (Intercept) e-07 treatment200mg:time e-08 treatment250mg:time e-09 15

16 4.2 Inference name.coef <- names(coef(geecont.un)) n.coef <- length(name.coef) C2 <- matrix(0, nrow = 3, ncol = n.coef, dimnames = list(c("200mg/month","250mg/month","250mg vs. 200mg/month"), name.coef)) C2["200mg/month","treatment200mg:time"] <- 1 C2["250mg/month","treatment250mg:time"] <- 1 C2["250mg vs. 200mg/month",c("treatment250mg:time","treatment200mg:time")] <- c(1,-1) C2 (Intercept) treatment200mg:time treatment250mg:time 200mg/month mg/month mg vs. 200mg/month resc2 <- esticon(geecont.un, cm = C2, conf.int = TRUE, joint.test = FALSE) OR.CI2 <- cbind(exp(resc2[,c("estimate","lower","upper")]), p.value = resc2[,"pr(> X^2 )"]) OR.CI2 Estimate Lower Upper p.value 200mg/month mg/month mg vs. 200mg/month

17 5 Question 6: Subject specific model with random intercept 5.1 Fitting the model glmer.intercept <- glmer(formula = response time:treatment + (1 id), data = dtl.ony[visit.num < 6], family = binomial(link = "logit")) loglik(glmer.intercept) log Lik (df=4) summary(glmer.intercept)$coef Estimate Std. Error z value Pr(> z ) (Intercept) e-31 time:treatment200mg e-12 time:treatment250mg e-14 VarCorr(glmer.intercept) Groups Name Std.Dev. id (Intercept) Inference Using the approach proposed for objects estimated with the geeglm function, we obtain: name.coef <- names(fixef(glmer.intercept)) n.coef <- length(name.coef) C3 <- matrix(0, nrow = 3, ncol = n.coef, dimnames = list(c("200mg/month","250mg/month","250mg vs. 200mg/month"), name.coef)) C3["200mg/month",c("time:treatment200mg")] <- 1 C3["250mg/month",c("time:treatment250mg")] <- 1 C3["250mg vs. 200mg/month",c("time:treatment250mg","time:treatment200mg")] <- c(1,-1) C3 (Intercept) time:treatment200mg time:treatment250mg 200mg/month mg/month mg vs. 200mg/month

18 resc3 <- resc3 esticon(glmer.intercept, cm = C3, conf.int = TRUE, joint.test = FALSE) beta0 Estimate Std.Error X2.value DF Pr(> X^2 ) Lower Upper 200mg/month mg/month mg vs. 200mg/month OR.CI3 <- cbind(exp(resc3[,c("estimate","lower","upper")]), p.value = resc3[,"pr(> X^2 )"]) OR.CI3 Estimate Lower Upper p.value 200mg/month mg/month mg vs. 200mg/month Note: method. Confidence intervals about the parameters can also be obtained using the confint CI.confint <- confint(glmer.intercept, method="wald") CI.confint 2.5 % 97.5 %.sig01 NA NA (Intercept) time:treatment200mg time:treatment250mg Here we use set the argument method to "Wald" to save computation time but the default option "profile" should give more reliable confidence intervals. Confidence interval for the odd ratios can be obtain by applying the exponential function: exp(ci.confint) 2.5 % 97.5 %.sig01 NA NA (Intercept) time:treatment200mg time:treatment250mg

19 6 Question 7: Subject specific model with random intercept and random slope 6.1 Model fitting system.time( glmer.slope <- glmer(formula = response time:treatment + (time id), data = dtl.ony[visit.num < 6], family = binomial(link = "logit")) ) loglik(glmer.slope) user system elapsed log Lik (df=6) summary(glmer.slope)$coef Estimate Std. Error z value Pr(> z ) (Intercept) e-40 time:treatment200mg e-12 time:treatment250mg e-15 VarCorr(glmer.slope) Groups Name Std.Dev. Corr id (Intercept) time Inference name.coef <- names(fixef(glmer.slope)) n.coef <- length(name.coef) C4 <- matrix(0, nrow = 3, ncol = n.coef, dimnames = list(c("200mg/month","250mg/month","250mg vs. 200mg/month"), name.coef)) C4["200mg/month",c("time:treatment200mg")] <- 1 C4["250mg/month",c("time:treatment250mg")] <- 1 C4["250mg vs. 200mg/month",c("time:treatment250mg","time:treatment200mg")] <- c(1,-1) C4 (Intercept) time:treatment200mg time:treatment250mg 200mg/month mg/month mg vs. 200mg/month

20 resc4 <- resc4 esticon(glmer.slope, cm = C4, conf.int = TRUE, joint.test = FALSE) beta0 Estimate Std.Error X2.value DF Pr(> X^2 ) Lower Upper 200mg/month mg/month mg vs. 200mg/month OR.CI4 <- cbind(exp(resc4[,c("estimate","lower","upper")]), p.value = resc4[,"pr(> X^2 )"]) OR.CI4 Estimate Lower Upper p.value 200mg/month e e mg/month e e mg vs. 200mg/month e e

Solution Anti-fungal treatment (R software)

Contents Solution Anti-fungal treatment (R software) Question 1: Data import 2 Question 2: Compliance with the timetable 4 Question 3: population average model 5 Question 4: continuous time model 9 Question