Repeated measures, part 1, simple methods

Size: px

Start display at page:

Download "Repeated measures, part 1, simple methods"

Allyson Walton
5 years ago
Views:

1 enote 11 1 enote 11 Repeated measures, part 1, simple methods

2 enote 11 INDHOLD 2 Indhold 11 Repeated measures, part 1, simple methods Intro Main example: Activity of rats Separate analyses for each time point Example: Activity of rats analyzed separately for each month Analysis of summary statistics Example: Activity of rats analyzed via summary measure Random effects approach Example: Activity of rats analyzed via random effects model Pros and cons of simple approaches The R-package nlme and the function lme Exercises Intro This module describe various simple approaches for analyzing repeated measurements, and show how these analyses can be carried out in R. Data referred to as Repeated measurements (or sometimes as longitudinal data ) can be characterized by having several measurements on the same individuals or experimental units. These measurements

3 enote INTRO 3 are typically taken at different times, or at different positions within the individuals. Consider for instance the following experimental design, to compare two drugs (A and B) to reduce blood pressure: 1. Twenty individuals were selected randomly from the relevant population. 2. Half of these were given drug A and half were given drug B (randomly selected). 3. For a period of two months these individuals had their blood pressure measured every week, which resulted in eight measurements on each individual. The problem is that data collected this way, might be in violation of the standard assumption of independent measurements. It seems fair to expect two measurements from the same individual to be positively correlated, which would result in more similar measurements than two measurements from different individuals. Furthermore, two measurements taken on the same individual might be highly correlated if they are measured at two time points close to each other, but less correlated (or maybe independent) if they are measured far apart. This module describes some fairly simple (and maybe crude) methods for analyzing these data types. These methods include: Separate analyses for each time point Analysis of summary statistic Random effects approach The analysis of repeated measurements will continue in the next module, where some more explicit covariance models will be shown. The simplest approach to analyse repeated measurements would be to include time as a factor, and ignore the dependence between two observations on the same individual. Such an approach may lead to completely wrong conclusions. The essence of the problem is that this is the same as pretending to have more observations than are actually available. Two correlated observations contain less information than two independent observations, because one is partly explained by the other. This approach is unacceptable.

4 enote INTRO Main example: Activity of rats To investigate the effect of a certain type of exposure on the activity of rats, the following experiment was carried out. The experimental unit was a cage with two rats. During the entire experimental period the rats were daily exposed to the matter under investigation, in the concentration of 1, 2 or 3 units (treatment 1, 2 and 3, respectively). Once per month during 10 months the activity of the rats was measured by placing the rats from one cage in a chamber in which each intersection of a light beam was counted. The total count through a period of 57 hours was used as the result for that cage. Notice that in this setting the individual variable is cage. Summary of experiment: 3 treatments: 1, 2, 3 (concentration) 10 cages per treatment 10 contiguous months The response is activity (count of intersections of light beam during 57 hours). Here y = log(counts) is used for the analysis, because a residual plot showed that this was the most reasonable. The observations are listed in table , and the observations are plotted in Figure From the figure it seems that the activity is decreasing from month 1 to month 10 (maybe as a linear function?), and maybe that there is a small difference between the different doses. Plotting the individual curves is a very useful tool in the analysis of repeated measures. This should always be the first step. Quite often the main conclusions from the analysis can already be seen from a good plot of the data. Here we use interaction.plot to plot the individual profiles the resulting figure is seen in Figure 11.1: rats <- read.table("rats.txt", header=true, sep=",", dec=".") rats$month <- factor(rats$month) rats$treatm <- factor(rats$treatm) rats$cage <- factor(rats$cage) with(rats, interaction.plot(month, cage, lnc, legend = FALSE, las=1, lty = rep(1:3, each = 10), col = rep(2:4, each = 10))) Using the interaction.plot function here only produces the right plot because the time points are equidistant in these data. More generally one would need a quantitative

5 enote INTRO mean of lnc month Figur 11.1: The log(counts) for each cage plotted against month. The solid red lines are cages receiving dose=1, dashed green lines are dose=2, and dotted blue lines are dose=3.

6 enote INTRO 6 Month Dose Cage Tabel 11.1: The rats data set, here the raw activity counts are listed. copy of the time variable and use this as the x-axis in the plotting, e.g. using the ggplot2 package: require(ggplot2) rats$monthq <- as.numeric(rats$month) ggplot(rats, aes(x=monthq, y=lnc, group=cage, colour=treatm)) + geom_line()

7 enote INTRO lnc 9.5 treatm monthq Or the treatment group average time profiles: require(plyr) mns <- ddply(rats, ~ treatm + month + monthq, summarize, lnc = mean(lnc)) ggplot(mns, aes(x=monthq, y=lnc, group=treatm, colour=treatm)) + geom_point() + geom_line()

8 enote SEPARATE ANALYSES FOR EACH TIME POINT lnc treatm monthq Here we used the ddply function from the plyr package to compute the mean lnc for each level or unique value of treatm, month and monthq. We will not describe the details of this function or package, but only note that it is very efficient for data manipulation Separate analyses for each time point One way to avoid the problem of correlated measurements is to do a separate analysis for each point in time. This way only one observation from each individual is used, and hence they are independent. This way of analyzing repeated measurements is not wrong, but it is very inefficient, as all the remaining observations are wasted. This approach avoids the problem, instead of dealing with it. Separate analyses can be carried out for all the observed time points, but it will likely be very difficult to reach a coherent conclusion from all these sub tests. These sub tests will be correlated, and because the correlation structure is not part of the model, it is not possible to tell how strong this correlation is.

9 enote SEPARATE ANALYSES FOR EACH TIME POINT 9 Separate analyses can be carried out for selected time points far apart. This will (hopefully) cause the separate sub tests to be uncorrelated, or at least less correlated. Even with uncorrelated tests it will be difficult to reach a coherent conclusion, because of a problem known as mass significance (or multiplicity). For instance, if 20 tests are carried out at a 5% significance level, one of them is likely to be a false positive, i.e. a falsely significant p value. This problem is partly solved by using the Bonferroni correction for performing n tests (one for each time point). The Bonferroni correction simply states that the significance level 0.05/n should be used instead of the usual 0.05 (which sometimes might be shown by mutiplying the calculated p-value with n.) When selecting time points far apart, it is important that the selection must be done independently of the actual observations. Naturally the time points may not be selected systematically where there is large (or small) difference between treatments. Ideally the time points should be selected before data are collected Example: Activity of rats analyzed separately for each month Consider the rats data available in the rats.txt file. head(rats) treatm cage month lnc monthq To analyze the rats data set separately for each month, a simple one way analysis of variance model with treatment treatm as the only factor is used. The information about cage cannot be included, as we only have one observation from each cage in each monthly analysis. The model for each month is: lnc i = µ + α(treatm i ) + ε i, ε i i.i.d. N(0, σ 2 ), i = To do this in R, we split the rats data frame into a list of data frames, one for each month. We then apply (using sapply) the function fn, which fits the linear model and extracts the F and p values, to each of the data frames in the list:

10 enote ANALYSIS OF SUMMARY STATISTICS 10 ratsl <- split(rats, f = rats$month) # a list of data.frames # Function to fit model, get F and p from anova table: fn <- function(df) unlist(anova(lm(lnc ~ treatm, data=df))[1, c("f value", "Pr(>F)")]) # Alternative using plyr-functions: # round(t(daply(rats, ~ month,.fun=fn)), 2) round(sapply(ratsl, fn), 3) F value Pr(>F) These F values should be compared with F 95%,2,27 = 3.35 or with F 99.5%,2,27 = 6.49 if the Bonferroni correction is used. A few significant values are found, and even one if the Bonferroni correction is used, so the conclusion should be that weak evidence of group difference have been seen. It is possible to make a correct analysis time by time, but it is weak and often confusing, because it does not combine all information into one test Analysis of summary statistics Another way to avoid the problem of correlated measurements is to choose a single measure to summarize the individual curves, and then base the analysis on this measure. This again reduces the data set to independent observations one for each individual. To analyze the summary data set, standard methods for independent observations for instance analysis of variance can be used. The key is to choose a good summary measure. One possibility is to choose the value at a given time point, which reduces this summary method to the separate time point analysis described in the previous section. This choice is poor in most cases, because all other measurements are wasted. It is difficult to give general advice about the choice of summery measurement. Ideally, the summary measure should capture the most important feature of the curve. In some situations the most important feature is the net growth (last minus first), the average growth (slope), or time to reach the maximum point. It depends on the problem at hand.

11 enote ANALYSIS OF SUMMARY STATISTICS 11 Some common choices of summary measures are: Average over time Slope in regression with time (or higher order polynomial coefficients) Total increase (last point minus first point) Area under curve (AUC) Maximum or minimum point With the right choice of summary measure this type of analysis can be very useful, at least as a first step. These models have relatively few assumptions, and they can be checked via standard residual methods. Of course the downside of this method is that information may be lost by reducing each curve to one single measure Example: Activity of rats analyzed via summary measure The choice of summary measure for the rats data set is partly inspired by Figure It seems that the average slope is similar for the three treatments, but that the curves from dose=3 tends to be a slightly higher than the rest of the curves. To see if this is a significant difference the logarithm of the total count during all ten months lntot = log(total count) is used as summary measure. To calculate this summary measure from the previously described data set, the variable containing the log counts from each month lnc must be transformed back to the original counts, then the sum must be calculated, and finally the logarithm must be applied to the sum. This summary data set consists of independent measurements, as each cage is only used to generate one summary observation. Because it is now independent observations, it can be analyzed with a simple one way ANOVA model: lntot i = µ + α(treatm i ) + ε i, ε i i.i.d. N(0, σ 2 ), i = These operations can be done in R by writing: (The variable containing the logarithm of the total counts is called lnc)

12 enote RANDOM EFFECTS APPROACH 12 rats_sum <- ddply(rats,.(cage, treatm), summarize, logsum_count = log(sum(exp(lnc)))) anova(lm(logsum_count ~ treatm, data=rats_sum)) Analysis of Variance Table Response: logsum_count Df Sum Sq Mean Sq F value Pr(>F) treatm Residuals Signif. codes: 0 *** ** 0.01 * The p value for no treatment effect in this summary model is 5.22%. This is above the standard 5% significant level, but only slightly. In this analysis the entire curve has been summarized into a single measure, so a lot of information has been lost. A p value this low for the crude summary analysis could indicate that a significant treatment effect might be found with a more sophisticated analysis Random effects approach The two approaches described above both illustrated ways to reduce the data set to independent measures. This section explains the first step in modeling the actual covariance. As seen in previous modules, for instance the module about hierarchial random effects, the effect of adding a random effect is that two observations from the same level will possibly be positively correlated. Adding the individual factor to the model as a random effect will allow two observations from the same individual to be positively correlated Example: Activity of rats analyzed via random effects model It is reasonable to assume that two observations from the same cage could be correlated, so the model with cage as random effect is used. The factor month and the interaction between month and treatment are included. This was not possible in the previous models, because each curve was reduced into one number. In this analysis all observations

13 enote RANDOM EFFECTS APPROACH 13 are included into one coherent analysis. The model is: lnc i = µ + α(treatm i ) + β(month i ) + γ(treatm i, month i ) + d(cage i ) + ε i, where i = , d(cage i ) N(0, σ 2 d ), ε i N(0, σ 2 ), and all independent. Recall from previous modules that the covariance structure for this model is: 0, if cage i1 = cage i2 and i 1 = i 2 cov(y i1, y i2 ) = σ 2 d σd 2 + σ2, if cage i1 = cage i2 and i 1 = i 2, if i 1 = i 2 In other words this is the variance structure, where two observations from different cages are uncorrelated, and two observations from the same cage are positively correlated with correlation coefficient σ 2 d /(σ2 d + σ2 ). The following lines implement this model in R: require(lmertest) model1 <- lmer(lnc ~ month + treatm + month:treatm + (1 cage), data = rats) anova(model1) Analysis of Variance Table of type III with Satterthwaite approximation for degrees of freedom Sum Sq Mean Sq NumDF DenDF F.value Pr(>F) month < 2.2e-16 *** treatm month:treatm ** --- Signif. codes: 0 *** ** 0.01 * VarCorr(model1) Groups Name Std.Dev. cage (Intercept) Residual c(-2 * loglik(model1, REML=TRUE)) # REML=TRUE is default [1]

14 enote PROS AND CONS OF SIMPLE APPROACHES 14 This output give estimates of the variance parameters (σ 2 d = and σ 2 = ), twice the negative restricted/residual log likelihood (2lre = 8.61), and an ANOVA table for the fixed effects of the model. From this ANOVA table it is seen that the interaction between treatment and month is significant with a p value= The conclusion from this model is that treatment does have an effect on the activity, but the effect is not the same in all ten months. The main problem with this random effects approach is that all measurements on the same individual are assumed equally correlated, but some measurements are taken far apart and some measurements are taken close to each other, so this assumption is not always valid. The next module will suggest a few ways to deal with this problem. However, this random effects approach may give reasonable results for short series (with 2, 3, or 4 measurements on each individual) since the assumption of equal correlation may be ok in those cases. This random effects approach is also known as the split plot approach, or the split plot model. It is possible to view repeated measurements data as resulting from a kind of split plot experiment, with individuals as the main plots to which the treatments are applied. The sub plots are then the single measurements on each individuals. This interpretation is a bit weak, as the single measurements on each individual (typically at different times) cannot be randomized within the individual Pros and cons of simple approaches In this module a few simple approaches to the analysis of repeated measurements have been described. In many practical cases these simple approaches, especially the summary method, will give a sufficient and useful analysis of the data. Even in those cases where more sophisticated models are needed it is often helpful to run a few simple models first. Here follows a few pros and cons of the different methods: Separate analysis for each time point + Not wrong Can be confusing Difficult to reach coherent conclusion In general not very informative Analysis of summary statistic

15 enote THE R-PACKAGE NLME AND THE FUNCTION LME 15 + Good method with few and easily checked assumptions Important to choose good summary measure(s) Random effects approach + Good method for short series + Uses all observations Usually not good for long series 11.6 The R-package nlme and the function lme For what comes in the next module, the correlated residuals models, we will have to turn to the lme function of the nlme package. These model structures are not yet available by the lme4-package, while they may be implemented in the future. The lme function has a somewhat different syntax and also a somewhat different structure in the results. To run the simple split-plot version of the repeated measures model also given above: library(nlme) model2 <- lme(lnc ~ month + treatm + month:treatm, random = ~1 cage, data = rats) anova(model2) numdf dendf F-value p-value (Intercept) <.0001 month <.0001 treatm month:treatm VarCorr(model2) cage = pdlogchol(1) Variance StdDev (Intercept) Residual c(-2 * loglik(model2))

16 enote EXERCISES 16 [1] intervals(model2, which = "var-cov") Approximate 95% confidence intervals Random Effects: Level: cage lower est. upper sd((intercept)) Within-group standard error: lower est. upper The confidence intervals for the variance structure parameters produced here are not the same as those produced by the confint function (which cannot produce profile intervals for lme results). Instead they are Wald intervals constructed for log-standard deviations. This matches with the fact that the intervals are symmetric on the log-scale: ins <- intervals(model2, which = "var-cov") lins <- log(ins$sigma) unname(c(lins[2]-lins[1], lins[3]-lins[2])) [1] Exercises Exercise 1 Histamine concentration on dogs In an experiment with 16 dogs the blood histamine concentration was measured 0, 1, 3, and 5 minutes after injection of morphine or trimethaphane. Before injection the dogs were classified into two groups according to their level of histamine (intact or depleted). The data are available in the file histamin.txt and partly listed below.

17 enote EXERCISES 17 treatm level dog min hist morphine intact morphine intact morphine intact morphine intact morphine deplet morphine deplet morphine deplet morphine deplet trimetha intact trimetha intact trimetha intact trimetha intact trimetha deplet trimetha deplet trimetha deplet trimetha deplet morphine intact (64 lines total)..... trimetha deplet The main focus of this experiment is to compare the effect of trimethaphane to the effect of morphine. a) Make a plot of the data, for instance one line for each dog (maybe colored differently in each treatment group). b) Analyze these data using one or more of the simple methods. c) Formulate a conclusion about the treatment.

18 enote EXERCISES 18 Exercise 2 Growth of guinea pigs In an investigation of the effect of vitamin E on the growth of guinea pigs 15 animals were observed for 7 weeks. In week one they were given a growth inhibiting substance. In the beginning of week five they received different amounts of vitamin E (dosage 0, 1, or 2). there were five animals in each treatment group, and each animal were weighted at the end of week 1, 3, 4, 5, 6, and 7. The data is available in the file guinea.txt and is partly listed below. animal week weight dose (90 lines total) The focus of this experiment is the effect of vitamin E on the growth of guinea pigs. a) Plot of the data. b) What is the conclusion about vitamin E?

Repeated measures, part 2, advanced methods

enote 12 1 enote 12 Repeated measures, part 2, advanced methods enote 12 INDHOLD 2 Indhold 12 Repeated measures, part 2, advanced methods 1 12.1 Intro......................................... 3 12.2 A