Hierarchical Random Effects

Size: px

Start display at page:

Download "Hierarchical Random Effects"

Jeremy Howard
5 years ago
Views:

1 enote 5 1 enote 5 Hierarchical Random Effects

2 enote 5 INDHOLD 2 Indhold 5 Hierarchical Random Effects Introduction Main example: Lactase measurements in piglets Constant mean value structure The one layer model Two layer model Three or more layers Zero/Negative variance parameters Comparing different variance structures Better confidence limits in the balanced case Prediction of random effect coefficients R-TUTORIAL: The lactase example Exercises Introduction This enote takes a closer look at a type of models, where the experimental design leads to a natural hierarchy of the random effects. Consider for instance the following

3 enote MAIN EXAMPLE: LACTASE MEASUREMENTS IN PIGLETS 3 experimental design, to determine the fat content in milk (from a given area at a given time): 1. Ten farms were selected randomly from all milk producing farms in the area. 2. From each selected farm six cows were randomly selected. 3. From each selected cow two cups of milk were taken randomly from the daily production to be analyzed in the lab. Intuitively, two cups of milk from the same cow would be expected to be more similar, than two cups from different cows at different farms. Notice how data collected this way is in violation with the standard assumption of independent measurements. 5.2 Main example: Lactase measurements in piglets As part of a larger study of the intestinal health in newborn piglets, the gut enzyme lactase was measured in 20 piglets taken from 5 different litters. For each of the 20 piglets the lactase level was measured in three different regions. At the time the measurement was taken the piglet was either unborn (status=1) or newborn (status=2). These data are kindly provided by Charlotte Reinhard Bjørnvad, Department of Animal Science and Animal Health, Division of Animal Nutrition, Royal Veterinary and Agricultural University. The 60 log transformed measurements are listed below:

4 enote CONSTANT MEAN VALUE STRUCTURE 4 Obs litter pig reg status loglact Obs litter pig reg status loglact Notice the hierarchical structure in which the measurements are naturally ordered. First the five litters, then the piglets, and finally the individual measurements. This structure is illustrated in Figure 5.1 for the first two litters. Litter=1 Litter=2 Pig=1 Pig=2 Pig=3 Pig=4 Pig=5 Pig=6 Pig=7 Pig= Figur 5.1: The hierarchical structure of the lactase data set for the first two litters 5.3 Constant mean value structure The mixed model framework is very flexible, and it is possible to include the mean value structures known from fixed effects models, as well as the random effects structure. In this enote focus is on the variance structure, and for this reason the same simple mean value structure is used when the models are presented. The mean structure is simply

5 enote THE ONE LAYER MODEL 5 denoted µ without any subscripts. This is done in order to keep focus on the variance structure. In the analyzed examples more realistic mean value structures will be used. 5.4 The one layer model The simplest hierarchical model is the one layer model, which is typically denoted: y i = µ + ε i, ε i i.i.d. N(0, σ 2 ) Here all observations are independent, so this is really not a hierarchical model, and is listed here mainly for completeness. The mean value parameter(s) in a model with this simple variance structure is estimated via least squares method, maximum likelihood method, or restricted/residual maximum likelihood method. In this case these methods all give the same result, namely: ˆβ = (X X) 1 X y Here β is the parameter(s) of the fixed effects part of the model, y is the observations, and X is the design matrix for the fixed effects part of the model, as described in the previous enote. The single variance parameter in the one layer model is estimated via the restricted/residual maximum likelihood method, which gives an unbiased estimate ˆσ 2 = 1 N p N i=1 (y i ˆµ i ) 2 Here N is the total number of observations, p is the number of mean value parameters, and ˆµ i is the Best Linear Unbiased Estimate (BLUE) of the mean value for the i th observation, which is calculated as ˆµ i = (X ˆβ) i. The sum in the estimate for the variance is often referred to as the Sum of Squared Derivations (SSD), or simply as the Sum of Squares (SS). Lactase measurements in piglets analyzed as independent observations To analyse the lactase data set with a one layer model, the information about litter and pig is ignored. Remaining is the region (1,2 or 3), the status (1 or 2), and the interaction between region and status ((1,1), (1,2), (2,1), (2,2), (3,1), or (3,2)). The one layer model for the lactase data becomes: y i = µ + α(status i ) + β(reg i ) + γ(status i, reg i ) + ε i, ε i i.i.d. N(0, σ 2 )

6 enote THE ONE LAYER MODEL 6 reg [I] 53 status reg status 1 2 Figur 5.2: Factor structure in the lactase example when analyzed as independent measurements (one layer model). where i = 1,..., 60 is the observation number. The model is illustrated in the factor structure diagram in Figure 5.2 To analyze this model in R, write: model1 <- lm(loglact ~ reg + status + reg:status, data = lactase) anova(model1) Analysis of Variance Table Response: loglact Df Sum Sq Mean Sq F value Pr(>F) reg ** status * reg:status Residuals Signif. codes: 0 *** ** 0.01 *

7 enote THE ONE LAYER MODEL 7 In this case the mean value of the log-lactase measurements are described by region, status and the interaction between them. Notice how the intercept term µ is not specified. R adds the intercept term automatically. If the intercept term is undesired in a given model it must be removed by adding -1 after the last model term. From this output it follows that the estimate of the variance parameter is ˆσ 2 = , and two times the negative restricted/residual log likelihood function 2l re, when evaluated with the optimal parameters can be found from (-2) * loglik(model1, REML=TRUE) log Lik (df=7) and is 2l re = This is an important value as it is useful for comparing different variance structures. This enote is about hierarchical variance structures, and hence little attention is paid to the mean value structure. The ANOVA table indicates a p-value of 6.6% for removing the interaction term from the model, which is above the standard 5% significance level. Type I and Type III tests - again Type I (in anova): Succesive tests Each term is corrected for the terms PREVIOUSLY listed in the model Sums of squares sum to TOTAL sums of squares Depends on the order they are specified in the model Type III (in drop1 and anova of lmertest): Parallel tests Each term is corrected for ALL other the terms in the model Sums of squares do NOT generally sum to TOTAL sums of squares Does NOT depend on the order they are specified in the model If balanced data: the SAME! drop1 is really Type II = Type III with the condition: Do NOT test main effects which are part of (fixed) interactions!

8 enote TWO LAYER MODEL Two layer model The simplest real hierarchical model is the two layer model, which is denoted: y i = µ + a(sub i ) + ε i where a(sub i ) N(0, σ 2 a ), and ε i N(0, σ 2 ), and all (both a s and ε s) are independent. In this two layer model all observations are no longer independent, in fact the covariance between two observations is: 0, if sub i1 = sub i2 and i 1 = i 2 cov(y i1, y i2 ) = σa 2, if sub i1 = sub i2 and i 1 = i 2 σa 2 + σ 2, if i 1 = i 2 which is easily calculated from the model. This means that two observations coming from the same subject are positively correlated. This is exactly what is needed in many situations where several observations are taken from the same subject. It corresponds well with the intuition behind the experiment, and the variance parameters can be interpreted in a natural way. The subject variance parameter σ 2 a is the variance between subjects, and the residual variance parameter σ 2 is the variance between measurements on the same subject. The variance parameters and the fixed effect parameter(s) in the two layer model are estimated via the restricted/residual maximum likelihood method. The REML method is preferred to the maximum likelihood method, because it gives less biased estimates of the variance parameters. For the fixed effect parameter(s) the REML estimate is given by: ˆβ = (X V 1 X) 1 X V 1 y Here β is the parameter(s) of the fixed effect part of the model, y is the observations, and X is the design matrix for the fixed effect part of the model. The matrix V is the covariance matrix for the observations y, which is estimated from the REML estimates of the variance parameters. In this two layer model the structure of V is: V = σa 2 + σ 2 σa 2 σa σa 2 σa 2 + σ 2 σa σa 2 σa 2 σa 2 + σ σa 2 + σ 2 σa 2 σa σa 2 σa 2 + σ 2 σa σa 2 σa 2 σa 2 + σ 2 Here illustrated for the situation with two subjects and three observations on each subject. The general pattern is a block diagonal covariance matrix with blocks corresponding to each subject.

9 enote TWO LAYER MODEL 9 status reg reg status [I] 0 [pig] Figur 5.3: Factor structure in the lactase example when analyzed with a two layer model (random pig effect). Lactase measurements in piglets analyzed with two layer model To analyze the lactase data set with a two layer model the information about litter is still ignored, but the information about pig is now included. This corresponds to the two bottom layers in Figure 5.1 and is illustrated in the factor structure diagram in Figure 5.3. Alternatively the the pig information could have been omitted and the litter information included. The two layer model with random pig effect is: y i = µ + α(status i ) + β(reg i ) + γ(status i, reg i ) + d(pig i ) + ε i, where d(pig i N(0, σ 2 d ), and ε i N(0, σ 2 ), and all (both d s and ε s) are independent. Here i = 1,..., 60 is the observation number. The idea behind this model is that pigs are different, and the pigs in this investigation are random representatives from the relevant population of newborn pigs. The variance between pigs is described with the σd 2 parameter, and the remaining variance between measurements on the same pig is described with the σ 2 parameter.

10 enote TWO LAYER MODEL 10 To analyze this model in R, write: require(lmertest) model2 <- lmer(loglact ~ reg * status + (1 pig), data = lactase) anova(model2) Analysis of Variance Table of type III with Satterthwaite approximation for degrees of freedom Sum Sq Mean Sq NumDF DenDF F.value Pr(>F) reg ** status reg:status ** --- Signif. codes: 0 *** ** 0.01 * print(summary(model2), corr=false) Linear mixed model fit by REML t-tests use Satterthwaite approximations to degrees of freedom [lmermod] Formula: loglact ~ reg * status + (1 pig) Data: lactase REML criterion at convergence: 79.9 Scaled residuals: Min 1Q Median 3Q Max Random effects: Groups Name Variance Std.Dev. pig (Intercept) Residual Number of obs: 59, groups: pig, 20 Fixed effects: Estimate Std. Error df t value Pr(> t ) (Intercept) e-13 *** reg reg

11 enote THREE OR MORE LAYERS 11 status reg2:status reg3:status * --- Signif. codes: 0 *** ** 0.01 * From this output it is evident, that variation between pigs is a major part of the total variation. The variance between pigs is estimated to ˆσ d 2 = , and the residual variance is estimated to ˆσ 2 = So a little less than half of the total variation in the data set is due to the variation between pigs. Twice the negative restricted/residual log likelihood is 2l re = The conclusion about the fixed effect parameters is different from the one layer model. In the one layer model the interaction term between status and region could be removed with a p-value of 6.6%, but now the interaction term is significant with a p-value of 0.76%. The important lesson to be learned from this example is that the variance structure matters. Even in two models with the exact same fixed effect structure, the conclusions can differ greatly if the variance structure is not correct. Later in this enote a method for comparing different variance structures will be shown. 5.6 Three or more layers Hierarchical models with three or more levels are natural in some situations. Consider for instance the situation with randomly selected farms, randomly selected cows from each farm, and randomly selected cups of milk from the daily production of each cow. In this situation the natural model would be a three layer model. The basic three level model is denoted: y i = µ + a(block i ) + b(sub i ) + ε i where a(block i ) N(0, σa 2 ), b(sub i ) N(0, σb 2) and ε i N(0, σ 2 ) all independent. i is the observation index. The interpretation is similar to the interpretation of the two layer model. The parameter σa 2 is the variance between blocks (farms), σb 2 is the variance between subjects (cows) in the same block (farm), and σ 2 is the variance between observations from the same subject (cow). The parameters are estimated by REML method, just like in the two layer model, and the structure of the covariance matrix is:

12 enote THREE OR MORE LAYERS 12 V = σ 2 a +σ 2 b +σ2 σ 2 a +σ 2 b σ 2 a σ 2 a σ 2 a +σ 2 b σ2 a +σ 2 b +σ2 σ 2 a σ 2 a σ 2 a σ 2 a σ 2 a +σ 2 b +σ2 σ 2 a +σ 2 b σ 2 a σ 2 a σ 2 a +σ 2 b σ2 a +σ 2 b +σ σ 2 a +σ 2 b +σ2 σ 2 a +σ 2 b σ 2 a σ 2 a σ 2 a +σ 2 b σ2 a +σ 2 b +σ2 σ 2 a σ 2 a σ 2 a σ 2 a σ 2 a +σ 2 b +σ2 σ 2 a +σ 2 b σ 2 a σ 2 a σ 2 a +σ 2 b σ2 a +σ 2 b +σ2 Here illustrated for the case with two blocks, two subjects from each block, and two observations from each subject. The first 4 by 4 block in V corresponds to the first block, and the next non-zero 4 by 4 block corresponds to the second block. Within each of these blocks there is a similar structure, where two observations from the same subject are higher correlated, than two observations from different subjects. Lactase measurements in piglets analyzed with three layer model The natural model for the lactase data set is a three layer model, with litter as top layer, pig as the middle layer, and each single measurement as the bottom layer as illustrated in the factor structure diagram in Figure 5.4. This hierarchy is illustrated in Figure 5.1 for the first two litters. y i = µ + α(status i ) + β(reg i ) + γ(status i ), reg i )) + d(litter i ) + e(pig i ) + ε i, where d(litter i ) N(0, σ 2 d ), e(pig i) N(0, σ 2 e ), and ε i N(0, σ 2 ), and all (both d s, e s, and ε s) are independent. i is observation number. To analyze this model in R, write: (Here the litter information has been added to the random part of the model) model3 <- lmer(loglact ~ reg * status + (1 pig) + (1 litter), data=lactase) print(summary(model3), corr=false) Linear mixed model fit by REML t-tests use Satterthwaite approximations to degrees of freedom [lmermod] Formula: loglact ~ reg * status + (1 pig) + (1 litter) Data: lactase

13 enote THREE OR MORE LAYERS 13 status reg reg status [I] 0 [pig] [litter] Figur 5.4: Factor structure in the lactase example when analyzed with a three layer model (random pig and litter effect).

14 enote THREE OR MORE LAYERS 14 REML criterion at convergence: 79.9 Scaled residuals: Min 1Q Median 3Q Max Random effects: Groups Name Variance Std.Dev. pig (Intercept) litter (Intercept) Residual Number of obs: 59, groups: pig, 20; litter, 5 Fixed effects: Estimate Std. Error df t value Pr(> t ) (Intercept) e-13 *** reg reg status reg2:status reg3:status * --- Signif. codes: 0 *** ** 0.01 * anova(model3) Analysis of Variance Table of type III with Satterthwaite approximation for degrees of freedom Sum Sq Mean Sq NumDF DenDF F.value Pr(>F) reg ** status reg:status ** --- Signif. codes: 0 *** ** 0.01 * confint(model3, 1:3, oldnames=false) Computing profile confidence intervals...

15 enote ZERO/NEGATIVE VARIANCE PARAMETERS % 97.5 % sd_(intercept) pig sd_(intercept) litter sigma Compared to the output from the two layer model there is basically no difference. The likelihood is exactly the same and the variance for the litter effect is estimated at zero, and the conclusions about the fixed effect parameters are unchanged. 5.7 Zero/Negative variance parameters In some software, e.g. SAS Proc Mixed, an option can be set to allow for negative variance components - NOT in R though. When negative variance estimates occur, it is usually as an underestimate of small (or zero) variance. In this case a reasonable way to handle the negative estimates is simply to leave out the, or fix to zero, the corresponding variance component. In some rare cases it is possible to interpret a negative variance parameter estimate. For instance in the case of several observations on the same litter. It could be possible that some of the animals would do well and eat all the available food, and some of the animals would do poorly, and because all the food was now gone they would do even worse. In this type of competitive situations there is a negative within-group correlation, and the variation within a group could be greater than the variation between groups, which would lead to negative variance estimates. In R these are instead set to zero which really makes better sense in most cases. 5.8 Comparing different variance structures To test a reduction in the variance structure a restricted/residual likelihood ratio test can be used. A likelihood ratio test is used to compare two models A and B, where B is a sub-model of A. Typically the model including some variance component (model A), and the model without this variance component (model B) is to be compared. To do this the negative restricted/residual log-likelihood values (l re (A) must be computed. The test can now be computed as: G A B = 2l (B) re 2l (A) re and l re (B) ) from both models

16 enote COMPARING DIFFERENT VARIANCE STRUCTURES 16 The likelihood ratio test statistic follows a χ 2 d f distribution, where d f is the difference between the number of parameters in A and the number of parameters in B. In the case of removing a single variance component d f = 1. For the lactase data set three different variance structures have been used, the comparison of these variance models is seen in the following table: Model 2l re G value df P value (A) Pig and litter included 79.9 G A B = P A B = 1 (B) Only pig included 79.9 G B C = P B C = /2 (C) Independent observations 89.5 It follows that the litter variance component can be left out of the model, but the pig variance component is very significant. In other words: Two observations taken on the same pig are correlated, but two observations from different pigs within the same litter are not correlated (positively or negatively). Warning: It is really more correct to use DF=0.5 instead of df=1 in these tests! OR similarly(appxomately) divide the p values by 2. To calculate these p values via R, either use the anova function with refit=false (to get the restricted likelihood ratio tests): anova(model2, model3, refit=false) Or compute the negative log-likgelihood directly: (-2) * loglik(model3) (-2) * loglik(model2) (-2) * loglik(model1, REML=TRUE) # Need REML=TRUE for lm-objects to compute the p values by hand : G_value <- -2 * c(loglik(model1, REML=TRUE) - loglik(model2)) pchisq(g_value, df=1, lower.tail = FALSE) [1]

17 enote BETTER CONFIDENCE LIMITS IN THE BALANCED CASE Better confidence limits in the balanced case In R we get the confidence intervals as profile likelihood based such from the confint function. We also saw the alternative method of using an approximate χ 2 -distribution based on the Satterthwaite approximation approach - however this method requires the asymptotic variance-covariance matrix of the (random part) parameters. And per se these are not so easily extracted from lmer results. (It will be built into the lmertestpackage, but not currently available). In the balanced case the variance components can often be simply expressed by linear combinations of independent Mean Squares, which are all chi-square distributed and a simple chi-square approximation can be used to construct better confidence limits. Some of the theory behind this is given here. A factor F is said to be balanced if the same number of observations are made for each of its levels. In this section it is assumed that all factors are balanced. As seen in the enote on factor structure diagrams, it is possible in nice cases to calculate the degrees of freedom from the factor structure diagram. The formula to calculate the degrees of freedom df F for a factor F is: df F = F df G G:G<F Here F is the number of levels of the factor F, and the last sum is the sum of the degrees of freedom from all factors coarser than F. In these nice cases a similar formula exists for computing the sums of squares (SS) to be used in the analysis of variance table. The sum of squares SS F corresponding to a factor F is: SS F = 1 n F S F (j) 2 SS G j F G:G<F This formula require some explaining. The term n F is the number of observations on each level of the factor (as the factor is balanced this is the same on all levels). The term S F (j) is the sum of all observations at the j th level of the factor F. For instance if F is the factor indicating treatment ( no or yes ), then S F (yes) is the sum of all observations from treated individuals. The last sum is the sum of the SS s from all factors coarser than F.

18 enote BETTER CONFIDENCE LIMITS IN THE BALANCED CASE 18 Example 5.1 Balanced two layer model Consider again the two layer model: y i = µ + a(sub i ) + ε i where a(sub i ) N(0, σ 2 a ), and ε i N(0, σ 2 ), and all independent. The number of subjects is sub, the number of observations on each subject is n sub, and hence the total number of observations in the balanced case is N = sub n sub. The factor structure for this model is: N [I] N sub sub [sub] sub From this diagram the degrees of freedom for each factor can be calculated (in fact they are already printed in the diagram). For the constant factor 0 the degrees of freedom is always df 0 = 1, as it is the coarsest factor and has only one level. The subject factor has sub levels and only 0 is coarser, so df sub = sub 1. Finally for the finest factor I. I has N levels, but sub and 0 are coarser, so df I = N ( sub 1) 1 = N sub. Similarly the SS values can be calculated as: SS 0 = N 1 ( i=1...n y i ) 2 SS sub = n 1 sub j sub S sub (j) 2 SS 0 SS I = i=1...n y 2 i SS sub SS 0 These formulas may seem cumbersome, but actually they are fairly simple to use, as they only depend on the total sum, the total sum of squares, and the subject sums. With these formulas in place the analysis of variance table can be constructed as: Source SS df MS EMS Subject SS sub sub 1 SS sub /( sub 1) σ 2 + n sub σa 2 Residual SS I N sub SS I /(N sub ) σ 2

19 enote BETTER CONFIDENCE LIMITS IN THE BALANCED CASE 19 So natural estimates of the two variance parameters are: ˆσ 2 = MS I and ˆσ 2 a = MS sub MS I n sub In this balanced case it is also possible to write down better confidence intervals for the variance parameters, than the general confidence intervals based on the normal approximation, which will be described later. The standard 95% confidence interval for the residual variance is: (N sub )ˆσ 2 χ ;(N sub ) < σ 2 < (N sub )ˆσ2 χ ;(N sub ) Here N sub are the degrees of freedom corresponding to MS I. The parameter σa 2 is estimated by a linear combination of two MS values, so the exact degrees of freedom can not be found. Satterthwaite s approximation is used to approximate the degrees of freedom. It this case Satterthwaite determine the degrees of freedom by making the corresponding χ 2 distribution have the same variance as the estimate. The 95% confidence interval is: νˆσ a 2 χ 2 < ˆσ a 2 < νˆσ2 a 0.975;ν χ ;ν where the ν is estimated to: ν = ( ) 2 MSsub n sub (σ 2 a ) 2 sub 1 + ( ) 2 MSI n sub N sub Slightly more general, if fixed effects are also present in the model, or if possibly some inbalance occur, one may still estimate variance components by a so-called moment method. This is actually a simple alternative to REML estimation. It amounts to a solution of the linear system, where the theoretical expected random effect mean squares are equalled with the observed mean squares. In balanced experiments these theoretical expected mean squares can be deduced from the factor strcuture diagram, cf. enote 2, and the results of this will equal the REML estimates! In other cases, expected mean squares are almost impossible to deduce by hand calculation. This is not so easily available in R. In general, IF a variance component is estimated as a linear combination of mean squares: ˆσ 2 A = K k=1 b k MS k

20 enote PREDICTION OF RANDOM EFFECT COEFFICIENTS 20 where the mean squares have associated degrees of freedom f 1,..., f k, then an approximate 95% confidence band can be obtained by: νˆσ 2 A χ ;ν < σ 2 A < νˆσ2 A χ ;ν where the ν is estimated to: ν = K k=1 (ˆσ 2 A )2 (b k MS k ) 2 This is the basic Satterthwaite formula. In the case above, K = 2, b 1 = 1/n sub, b 2 = 1/n sub, f 1 = sub 1 and f 2 = N sub. In general, the REML estimates cannot be written as such linear combinations, so other methods are required. We saw in enote 4 a different satterthwaite method is described for REML estimates. Before computers revolutionized the world of statistics, the balanced case was extremely important. Great care was taken to obtain balanced experiments, but a single failed measurement could ruin a balanced design. With modern computers and software the balanced case seems less important. f k 5.10 Prediction of random effect coefficients This is reproduced from enote 4. A model with random effects is typically used in situations where the subjects (or blocks) in the study are to be considered as representatives from a greater population. As such, the main interest is usually not in the actual levels of the randomly selected subjects, but in the variation among them. It is however possible to obtain predictions of the individual subject levels u. The formula is given in matrix notation as: û = GZ V 1 (y X ˆβ) If G and R are known, û can be shown to be the best linear unbiased predictor (BLUP) of u. Here, best means minimum mean squared error. In real applications G and R are always estimated, but the predictor is still referred to as the BLUP. In R we get the random effect BLUPs very easily:

21 enote R-TUTORIAL: THE LACTASE EXAMPLE 21 ranef(model2) $pig (Intercept) R-TUTORIAL: The lactase example The R-function lmer from the package lme4 cannot be used for models with no random effects. Instead lm would have to be used whenever version of our models with no random effects are needed for reasons of comparison: lactase <- read.table("lactase.txt", sep=",", header=true, na.strings=".") lactase$litter <- factor(lactase$litter) lactase$pig <- factor(lactase$pig) lactase$reg <- factor(lactase$reg) lactase$status <- factor(lactase$status) model1 <- lm(loglact ~ reg + status + reg:status, data = lactase)

22 enote R-TUTORIAL: THE LACTASE EXAMPLE 22 require(car) Anova(model1, type = 3, test.statistic="f") Anova Table (Type III tests) Response: loglact Sum Sq Df F value Pr(>F) (Intercept) e-16 *** reg status reg:status Residuals Signif. codes: 0 *** ** 0.01 * Even though type III tests for main effects are produced by the Anova function, one must be careful about the interpretations of these, and for unbalanced cases and in fact also models with covariates we should not use such type III tests from this function (as the interpretation is generally unclear). In the present case there is only one test for which we would always be sure to have a good interpretation: namely the test for the interaction term and this would e.g. also be returned by the drop1 function: drop1(model1, test="f") Single term deletions Model: loglact ~ reg + status + reg:status Df Sum of Sq RSS AIC F value Pr(>F) <none> reg:status Signif. codes: 0 *** ** 0.01 * In R the two-layer model is specified using the function lmer

23 enote R-TUTORIAL: THE LACTASE EXAMPLE 23 model2 <- lmer(loglact ~ reg * status + (1 pig), data = lactase) anova(model2) Analysis of Variance Table of type III with Satterthwaite approximation for degrees of freedom Sum Sq Mean Sq NumDF DenDF F.value Pr(>F) reg ** status reg:status ** --- Signif. codes: 0 *** ** 0.01 * where the (1 pig) part specifies that there is a random effect for each level of the variable pig. The (default) output from anova, using the lmertest package, is Type III F statistics and p values. A Type I version is also an option (use type = 1). Remark 5.2 WARNING a confusing detail: in general the SS and MS columns of the lme4 (and hence lmertest) will NOT coincide with what is seen from lm as the SS and MS colunms from the lme4-anova are scaled by the proper mixed model error terms such that when they are divided by the error variance, the proper mixed model F- statistics will appear! Confused? See the example below!

24 enote R-TUTORIAL: THE LACTASE EXAMPLE 24 Example 5.3 Comparing anova on lm and lmer ## We illustrate this on a balanced data set, so we remove pigs 3-8: lactase_red <- subset(lactase, pig %in% c(1:2,9:20)) testmodel <- lmer(loglact ~ status + (1 pig), data = lactase_red) testmodelfixed <- lm(loglact ~ status + pig, data = lactase_red) anova(testmodel) Analysis of Variance Table of type III with Satterthwaite approximation for degrees of freedom Sum Sq Mean Sq NumDF DenDF F.value Pr(>F) status anova(testmodelfixed) Analysis of Variance Table Response: loglact Df Sum Sq Mean Sq F value Pr(>F) status pig * Residuals Signif. codes: 0 *** ** 0.01 * Note that the proper mixed model F-statistic for the status effect (as pigs are nested within status) is F Mixed = MS status = MS pig = where the SS status is defined the usual ANOVA way: (as there are 21 observations for each status in the reduced ) statusmeans <- tapply(lactase_red$loglact, lactase_red$status, mean) overallmean <- mean(lactase_red$loglact) 21*sum((statusmeans-overallmean)^2) [1]

25 enote R-TUTORIAL: THE LACTASE EXAMPLE 25 But from anova(model2) instead we get the number SS = , which is coming as: SS (lmer) status = = = SS(usual) status MS Residual MS Pig = SS (usual) status MS Residual (Mixed Error) and finally note that the F-statistic for status effect then can be achieved by taking the (socalled) MS from anova(model2) and dividing by the residual error: F Mixed = MS(lmer) status = MS Residual = And by the way the residual error MS from the fixed model is also the estimate of the residual error variance of the mixed model: sigma(testmodel)^2 [1] The function lmer returns a mermod object containing all information about the fitted model. In the example just below model2 is an mermod object: To extract information from the object various functions, so-called extractors, can be applied to the object. We have already met the two extractors anova and summary earlier. A list of all such methods is found as: methods(class = "mermod") [1] anova Anova as.function coef [5] confint deltamethod deviance df.residual [9] drop1 extractaic family fitted [13] fixef formula getl getme [17] hatvalues isglmm islmm isnlmm [21] isreml linearhypothesis loglik matchcoefs [25] model.frame model.matrix ngrps nobs [29] plot predict print profile [33] ranef refit refitml residuals [37] show sigma simulate summary [41] terms update VarCorr vcov [45] weights see?methods for accessing help and source code

26 enote EXERCISES Exercises Exercise 1 Blueberries In an experiment with 4 sorts of blueberries the fructification (the number of berries as percent of the number of flowers earlier in the season) was determined for 5 twigs on each of 6 bushes for every sort. The data are available in the file blueberry.txt see also the Hierachical Section 13.3 in enote 13. a) Investigate if the 4 sorts have different fructification. b) Draw the factor structure diagram corresponding to the analyzed model. c) Test if it is important to include bush as a random effect.

Workshop 9.3a: Randomized block designs

Workshop 9.3a: Randomized block designs -1- Workshop 93a: Randomized block designs Murray Logan November 23, 16 Table of contents 1 Randomized Block (RCB) designs 1 2 Worked Examples 12 1 Randomized Block (RCB) designs 11 RCB design Simple Randomized