Short Course Applied Linear and Nonlinear Mixed Models* Introduction

Size: px

Start display at page:

Download "Short Course Applied Linear and Nonlinear Mixed Models* Introduction"

Audrey Ross
5 years ago
Views:

1 Short Course Applied Linear and Nonlinear Mixed Models* Introduction Mixed-effect models (or simply, mixed models ) are like classical ( fixedeffects ) statistical models, except that some of the parameters describing group effects or covariate effects are replaced by random variables, or random-effects. Thus, the model has both parameters, also known as fixed effects, and random effects. Thus the model has mixed effects. Random effects can be thought of as random versions of parameters. So, in some sense, a mixed model has both fixed and random parameters. This can be a useful way to think about it, but it s really not quite right and can lead to confusion. The word parameter means fixed, unknown constant, so it is really something of an oxymoron to say random parameter. As we ll see, the distinction between a parameter and a random effect goes well beyond vocabulary. * Temple-Inland Forest Products, Inc., Jan ,

2 Random effects arise when the observations being analyzed are heterogeneous, and can be thought of as belonging to several groups or clusters. This happens when there is one observation per experimental unit (tree, patient, plot, animal) and the experimental units occur or are measured in different locations, at different time points, from different sires or genetic strains, etc. Also often occurs when repeated measurements of each experimental unit are taken. E.g., several observations are taken through time of the height of 100 trees. The repeated height measurements are grouped or clustered by tree. The use of random effects in linear models leads to linear mixed models (LMMs). LMMs are not new. Some examples from this class are among the simplest, most familiar linear model and are very old. However, until recently, software and statistical methods for inference were not well-developed enough to handle the general case. Thus, only recently has the full flexibility and power of this class of models been realized. 2

3 Some Simple LMMs: The one-way random effects model Railway Rails: (See Pinheiro and Bates, 1.1) The data displayed below are from an experiment conducted to measure longitudinal (lengthwise) stress in railway rails. Six rails were chosen at random and tested three times each by measuring the time it took for a certain type of ultrasonic wave to travel the length of the rail Rail Zero-force travel time (nanoseconds) Clearly, these data are grouped, or clustered, by rail. This clustering has two closely related implications: 1. (within-cluster correlation) we should expect that observations from the same rail will be more similar to one another than observations from different rails; and 2. (between cluster heterogeneity) we should expect that the mean response will vary from rail to rail in addition to varying from one measurement to the next. These ideas are really flip-sides of the same coin. 3

4 Although it is fairly obvious that clustering by rail must be incorporated in the modeling of these data somehow, we first consider a naive approach. The primary interest here is in measuring the mean travel time. Therefore, we might naively consider the model y ij = µ + e ij, i =1,...,6,j =1,...,3, where y ij is the travel time for the j th trial on the i th rail, and we assume iid ε 11,...,ε 63 N(0,σ 2 ). Here, the notation iid N(0,σ 2 ) means, are independent, identically distributed random variables each with a normal distribution with mean 0 and (constant) variance σ 2. In addition, µ is the mean travel time which we wish to estimate. Its maximum likelihood (ML)/ordinary least-squares (OLS) estimate is the grand sample mean of all observations in the data set: ȳ =66.5. Themeansquareerror(MSE)iss 2 = , which estimates the error variance σ 2. However, an examination of the residuals form this model plotted separately by rail reveals the inadequacy of the model: Boxplots of Raw Residuals by Rail, Simple Mean Model Residuals for simple mean model Rail No. 4

5 Clearly, the mean response is changing from rail to rail. Therefore, we consider a one-way ANOVA model: y ij = µ + α i + e ij. ( ) Here, µ is a grand mean across the rails included in the experiment, and α i is an effect up or down from the grand mean specific to the i th rail. Alternatively, we could define µ i = µ + α i as the mean response for the i th rail and reparameterize this model as y ij = µ i + e ij. The OLS estimates of the parameters of this model are ˆµ i = ȳ i, of (ˆµ 1,...,ˆµ 6 )=(54.00, 31.67, 84.67, 96.00, 50.00, 82.67) and s 2 = The residual plot looks much better: Boxplots of Raw Residuals by Rail, One-way Fixed Effects Model Residuals for one-way fixed effects model Rail No. 5

6 However, there are still drawbacks to this one-way fixed effects model: It only models the specific sample of rails used in the experiment, while the main interest is in the population of rails from which these rails were drawn. It does not produce an estimate of the rail-to-rail variability in travel time, which is a quantity of significant interest in the study. The number of parameters increases linearly with the number of rails used in the experiment. These deficiencies are overcome by the one-way random effects model. To motivate this model, consider again the one-way fixed effects model. Model (*) can be written as y ij = µ +(µ i µ)+e ij where, under the usual constraint i α i =0,(µ i µ) =α i has mean 0 when averaged over the groups (rails). The one-way random effects model, replaces the fixed parameter (µ i µ) with a random effect b i, a random variable specific to the i th rail, which isassumedtohavemean0andanunknownvarianceσb 2. This yields the model y ij = µ + b i + e ij, ( ) where b 1,...,b 6 are independent random variables, each with mean 0 and variance σb 2. Often, the b i s are assumed normal, and they are usually assumed independent of the e ij s. Thus we have b 1,...,b a iid N(0,σ 2 b ), independent of e 11...,e an iid N(0,σ 2 ), where a is the number of rails, n the number of observations on the i th rail. 6

7 Note that now the interpretation of µ changes from the mean over the 6 rails included in the experiment (fixed effects model) to the mean over the population of all rails from which the six rails were sampled. In addition, we don t estimate µ i the mean response for rail i, which is not of interest. Instead we estimate the population mean µ and the variance from rail to rail in the population, σ 2 b. In addition: σ 2 b That is, our scope of inference has changed from the six rails included in the study to the population of rails from which those six rails were drawn. we can estimate rail to rail variability σ 2 b ;and the number of parameters no longer increases with the number of rails tested in the experiment. The parameters in the fixed-effect model were the grand mean µ, the rail-specific effects α 1,...,α a, and the error variance σ 2. In the random effects model, the only parameters are µ, σ 2 and σ 2 b. quantifies heterogeneity from rail-to-rail, which is one consequence of having observations that are grouped or clustered by rail, but what about within-rail correlation? 7

8 Unlike a purely fixed-effect model, the one-way random effects model does not assume that all of the responses are independent. Instead, it implies that observations that share the same random effect are correlated. E.g., for two observations from the i th rail, y i1 and y i3,say,themodel implies y i1 = µ + b i + e i1 and y i3 = µ + b i + e i3 That is, y i1 and y i3 share the random effect b i, and are therefore correlated. Why? Because one can easily show that var(y ij )=σ 2 b + σ 2 cov(y ij,y ij )=σ 2 b, j j corr(y ij,y ij )=ρ σ2 b σb 2 +, j σ2 j, and cov(y ij,y i j )=0, i i. That is, if we stack up all of the observations from the i th rail (the observations that share the random effect b i )asy i =(y i1,...,y in ) T,then 1 ρ ρ var(y i )=(σb 2 + ρ 1 ρ σ2 ) ( ) ρ ρ 1 and groups of observations from different rails (those that do not share random effects) are independent. 8

9 The variance-covariance structure given by ( ) has a a special name: compound symmetry. This means that observations from the same rail all have constant variance equal to σ 2 + σ 2 b,and all pairs of observations from the same rail have constant correlation equal to ρ = σ2 b σ 2 + σ 2 b ρ, the correlation between any two observations from the same rail, is called the intraclass correlation coefficient. In addition, because the total variance of any observation is var(y ij )= σ 2 b + σ2, the sum of two terms, σ 2 b and σ2 are called variance components. 9

10 Both fixed-effects and random-effects versions of the one-way model are fit to these data in intro.r. For (*), the fixed-effect version of the one-way model, we obtain ˆµ =66.5 with a standard error of For (**), the random-effect version of the one-way model, we obtain ˆµ =66.5 with a standard error of Standard error is larger in random-effects model, because this model has a larger scope of inference. That is, the two models are estimating different µ s: the fixed effect model is estimating the grand mean for the six rails in the study; the fixed effect model is estimating the grand mean for all possible rails. It makes sense that we would be much less certain of (i.e., there would be more error in) our estimate of the latter quantity especially if there is a lot of rail-to-rail variability. The usual method of moment/anova/reml estimates of the variance components in model (**) are ˆσ 2 = and ˆσ 2 b = 4.022, so here there is much more between-rail variability than within-rail variability. 10

11 The randomized complete block model Stool Example: In the last example, the data were grouped by rail and we were interested in only one treatment (there was only one experimental condition under which the travel time along the rail was measured). Often, several treatments are of interest and the data are grouped. In a randomized complete block design (RCBD), each of a treatments are observed in each of n blocks. As an example, consider the data displayed below. These data come from an experiment to compare the ergonomics of four different stool designs. n = 9 subjects were asked to sit in each of a = 4 stools. The response measured was the amount of effort required to stand up. T1 T2 T3 T Subject Effort required to arise (Borg scale) 11

12 Here, subjects form the blocks and we have a complete set of treatments observed in each block (each subject tests each stool). Thus we have a RCBD. Let y ij be the response for the j th stool type tested by the i th subject. The classical fixed effects model for the RCBD assumes y ij = µ + α j + β i + e ij, = µ j + β i + e ij, i =1,...,n,j =1,...,a, where e 11,...,e na iid N(0,σ 2 ). Here, µ j is the mean response for the j th stool type, which can be broken apart into a grand mean µ and a stool type effect α j. β i is a fixed subject effect. Again, the scope of inference for this model is the set of 9 subjects used in this experiment. If we wish to generalize to the population from which the 9 subjects were drawn, we would consider the subject effects to be random. 12

13 The RCBD model with random block effects is where y ij = µ j + b i + e ij, b 1,...,b n iid N(0,σ 2 b ) independent of e 11,...,e na iid N(0,σ 2 ). Since µ j s are fixed, and b i s are random, this is a mixed model. The variance-covariance structure here is quite similar to that in the oneway random effects model. Again, the model implies that any two observations that share a random effect (i.e., any two observations from the same block) are correlated. In fact, the same compound symmetry structure holds. In particular, if y i =(y i1,...,y ia ) T is the vector of observations from the i th block, then as in the last example, 1 ρ ρ var(y i )=(σb 2 + ρ 1 ρ σ2 ) ( ) ρ ρ 1 All pairs of observations from the same block have correlation ρ = σ 2 b ; σ 2 +σ 2 b all pairs of observations from different blocks are independent; and all observations have variance σ 2 +σb 2 and between block variances). (two components: within-block The RCBD model treating block effects is fit to these data in intro.r. First blocks are treated as fixed, then random. 13

14 It is often stated that whether block effects are assumed random or fixed does not affect the analysis of the RCBD. This is not completely true. It is true that whether or not blocks are treated as random does not affect the ANOVA F test for treatments. Either way we test for equal treatment means with the test statistic F = MS Trt MS E However, there are important differences in the analysis of the two designs. These differences affect inferences on treatment means. For instance, the variance of a treatment mean is var(ȳ j )= { σ 2 n σ 2 b +σ2 n for fixed block effects, for random block effects. Substituting the usual method of moment/anova estimators for σ 2 and σb 2 leads to a standard error of MS Blocks +(s 1)MS E for random block effects ns s.e.(ȳ j )= var(ȳ j ˆ )= for fixed block effects. MS E n Again, the standard error of a treatent mean is larger in the random effects model, because the scope of inference is broader. For these data, s.e.(ˆµ j )=.367 in the fixed block effects model, and s.e.(ˆµ j )=.576 in the random block effects model. For these data, the estimated between- and within-subject variance components are ˆσ 2 b =1.332 and σ 2 = This means that the estimated correlation between any pair of observations on the same subject is ˆρ = ˆσ 2 b ˆσ 2 +ˆσ 2 b = =

15 A Split-plot model Grass Example: A split-plot experimental design is one in which two sizes of experimental unit are used. The larger experimental unit, known as the whole plot, is randomized to some experimental design (a RCBD, say). The whole plot is then subdivided into smaller units, known as split plots, which are assigned to a second experimental design within each whole plot. Example: A study of the effects of three bacterial inoculation treatments and two cultivar types on grass yield was conducted as follows. Four fields, or blocks, were divided in half, and the two cultivars (A 1 and A 2 ) were assigned at random to be grown in the two halves of each field. Then each half-field (the whole plot) was divided into three subunits or split-plots, and the three inoculation treatments (B 1, B 2, and B 3 ) were randomly assigned to the three split-plots in each whole plot. The resulting design and data are as follows: Block 1 A 1 A 2 B 2 B Block 2 A 2 A 1 Block 3 A 2 A 1 Block 4 A 1 A 2 B 3 B B 2 B B 1 B B 1 B B 2 B B 1 B B 3 B B 3 B B 1 B B 3 B B 2 B Here it was easier to randomize the planting of the two cultivars to a few large units (the whole plots) then to many small units (the split plots). Convenience is the motivation for this design. 15

16 Here the 8 columns within the four rectangles are the whole plots and cultivar is the whole plot factor. The 24 smaller squares within the columns are the split plots and inoculation type is the split plot factor. The Data: y ijk, i = 1,...,a (levels of W.P. factor) j =1,...,n (blocks) k = 1,...,b (levels of S.P. factor) That is, y ijk is the response for the i th cultivar in the j th block treated with the k th inoculation type. Model: y ijk = µ + α i + τ j + b ij + β k +(αβ) ik + e ijk, where α i = effect of i th cultivar τ j = effect of j th block (treated here as fixed) β k = effect of k th inoculation treatment (αβ) ik = interaction between cultivars and inoculations In addition, b ij s iid N(0,σ 2 b ) independent of e ijk s iid N(0,σ 2 ) b ij sometimes describes as whole plot error terms. In a sense that is what a random effect is, an additional error term in the model. 16

17 b ij s are random effects for each whole plot (one for each half-field). They account for: heterogeneity from one whole plot to the next (quantified by ˆσ 2 b ); correlation among the three split-plots within a given whole plot. Again, the variance-covariance structure in this model is compound symmetric. Model implies 1 ρ ρ ρ 1 ρ var(y ij )=(σb 2 + σ 2 ) ρ ρ 1 where y ij =(y ij1,y ij2,...,y ijb ) T (vector of all observations on i, j th whole plot). This means: all pairs of observations from the same whole-plot have correlation ρ = σ2 b ; σ 2 +σ 2 b all pairs of observations from different whole-plots are independent; and all observations have variance σ 2 + σb 2 (two components: withinwhole-plot and between-whole-plot variances). The split plot model is fit to these data with the lme() function in S-PLUS/R in intro.r. Note that here we ve treated blocks as fixed. Later, we ll return to this example and model block effects as random. 17

18 The Experimental Unit, Pseudoreplication, D.F., & Balance The split-plot design involves two different experimental units: the whole plot and the split plot. Whole plots are randomly assigned to the whole plot factor. E.g., half-fields were randomized to the two cultivars. There are many fewer whole plot experimental units than there are observations (which equals the number of split plots in the experiment). Only 8 half-fields, but 24 observations in the grass experiment. With respect to cultivar, then, 8 experimental units are randomized. So degrees of freedom for testing cultivar are based on a sample size of 8. At the whole plot level, we have a RCBD design with two treatments (cultivars), four blocks, so error d.f. for testing cultivars is the error d.f. in a RCBD of this size. Namely: (2-1)(4-1)=3. With respect to cultivar, the measurements on the three split plots in each whole plot are pseudoreplicates (or subsamples). That is, they are not independently randomized to cultivars and thus proved no additional d.f. (information) regarding cultivar effects. 18

19 In some sense, modeling whole plots with random effects: identifies the appropriate error term for the whole plot factor; identifies the appropriate d.f. (amount of relevant information in the data/design) for testing the whole plot factor (cultivar); and identifies which units are true experimental units and which are pseudoreplicates with respect to each experimental factor. If a purely fixed-effect model is used in the split-plot design* then the usual MSE and DFE based upon the e ijk error term will lead to incorrect inferences on the whole plot factor. See model grass.lm1 in intro.r. Correct inferences on whole plot factors can sometimes be obtained from a fixed-effects analysis, but you have to really know what you re doing, especially in complex situations like split-split-plot models, etc; and the design has to be balanced (rare!). So, the use of random effects is also motivated by use of multiple sizes of experimental units with distinct randomizations; presence of pseudo-replication; and imbalance. Mixed effects models handles these complications much more automatically than fixed-effects models and, consequently, avoid incorrect inferences to which fixed-effects models are prone in these situations. * this would be done by modeling variability among whole plots with a fixed cult*block interaction effect 19

20 A More Complex Example PMRC Site Preparation Study: Study of various site preparation and intensive management regimes on the growth of slash pine. Involved ha plots nested within 16 sites in lower coastal plain of GA and FL. Data consist of repeated plot-level measurements of hd=dominant height (m), ba=basal area (m 2 /ha), tph (100 s of trees per ha), derived volume (total volume outside bark in m 3 /ha), and other variables at ages 2, 5, 8, 11, 14, 17, and 20 years. At each site, plots were randomized to eleven treatments consisting of a subset of the 2 5 = 32 combinations of five two-level (absent/present) treatment factors: A = Chop, site prep. w/ a single pass of a rolling drum chopper; B = Fert, fertilizer following the first, 12th and 17th growing seasons; C = Burn, broadcast burn of site prior to planting; D = Bed, a double pass bedding of the site; E = Herb, veg. control with chemical herbicide. 20

21 Here is a plot of the data. Each panel represents a site, and each panel contains tph over time profiles for each plot on that site. 1 2 Separate profiles for each plot, graphed separately by site Trees/Hectare (100s of trees) Age (yrs) (Each panel is a site) These data are grouped, or clustered, by site. We would expect heterogeneity from site to site. We would expect correlation among plots within the same site. Data are also grouped by plot, since we have repeated measures throughtimeoneachplot. Again, we would expect plots to be heterogeneous. Expect stronger correlation among observations from the same plot than observations from different plots. 21

22 In addition, we d like to make inferences about the population of plantation sites for which these sites are representative, not just these sites alone. Also would like to be able to generalize to the population from which these plots are drawn. Hence, it makes sense to model sites with random site effects, plots with random plot effects. Plots are nested within sites. This would be an example of a multilevel mixed model. In addition, plots are randomized to treatments, then repeated measures through time are taken on each plot. With respect to treatments, plots are the experimental unit, but measurement unit occurs at a finer scale: times within plots. These time-specific measurements are a bit like measurements on split plots. However, in a split-plot example, observations from the same whole plot are correlated due to shared characteristics of that whole plot. These are captured by whole plot random effects. In a repeated measures context, observations through time from the same unit are correlated due to shared characteristics of that unit and are subject to serial correlation (observations taken close together in time more similar than observations taken far apart in time). Thus, in a repeated measures context, we may want random effects and serial correlation built into our model. We ll soon see how multilevel random effects, serial correlation, and other features can be handled in the general form of the LMM. 22

23 Fixed vs. random effects: The effects in the model account for variability in the response across levels of treatment and design factors. The decision as to whether fixed effects or random effects should be used depends upon what the appropriate scope of generalization is. If it is appropriate to think of the levels of a factor as randomly drawn from, or otherwise representative of, a population to which we d like to generalize, then random effects are suitable. Design or grouping factors are usually more appropriately modeled with random effects. E.g., blocks (sections of land) in an agricultural experiment, days when an experiment is conducted over several days, lab technician when measurements are taken by several technicians, subjects in a repeated measures design, locations or sites along a river when we desire to generalize to the entire river. If, however, the specific levels of the factor are of interest in and of themselves then fixed effects are more appropriate. Treatment factors are usually more appropriately modeled with fixed effects. E.g., In experiments to compare drugs, amounts of fertilizer, hybrids of corn, teaching techniques, and measurement devices, these factors are most appropriately modeled with fixed effects. A good litmus test for whether the level of some factor should be treated as fixed is to ask whether it would be of broad interest to report a mean for that level. For example, if I m conducting an experiment in which each of four different classes of third grade students are taught with each of three methods of instruction (e.g., in a crossover design) then it will be of broad interest to report the mean response (level of learning, say) for a particular method of instruction, but not for a particular classroom of third grades. Here, fixed effects are appropriate for instruction method, random effects for class. 23

24 Preliminaries/Background In order to really understand the LMM, we need to study it in its vector/matrix form. So, we need to discuss/review random vectors and the multivariate normal distribution. Also need to review the classical linear model (CLM) before generalizing to the LMM. Estimation in the CLM based on least squares, but in the LMM, maximum likelihood (ML) estimation is used. Therefore, need to cover/review the basic ideas of ML estimation. Random Vectors: Random Vector: A vector whose elements are random variables. E.g., y 1 y 2 y =.., y n where y 1,y 2,...,y n are each random variables. Random vectors we will be concerned with: A vector containing the response variable measured on n units in the sample: y =(y 1,...,y n ) T. A vector of error terms in a model for y: e =(e 1,...,e n ) T. A vector of random effects: b =(b 1,b 2,...,b q ) T. Expected Value: The expected value (population mean) of a random vector is the vector of expected values, often denoted µ. Fory n 1, E(y) = E(y 1 ) E(y 2 ). E(y n ) µ 1 µ 2.. µ n = µ. 24

25 (Population) Variance-Covariance Matrix: For a random vector y n 1 =(y 1,y 2,...,y n ) T with mean µ =(µ 1,µ 2,...,µ n ) T,thematrix var(y 1 ) cov(y 1,y 2 ) cov(y 1,y n ) cov(y 2,y 1 ) var(y 2 ) cov(y 2,y n ) E[(y µ)(y µ) T ]= cov(y n,y 1 ) cov(y n,y 2 ) var(y n ) σ 11 σ 12 σ 1n σ 21 σ 22 σ 2n σ n1 σ n2 σ nn is called the variance-covariance matrix of y and is denoted var(y). (Population) Correlation Matrix: For a random variable y n 1,the population correlation matrix is the matrix of correlations among the elements of n: 1 corr(y 1,y 2 ) corr(y 1,y n ) corr(y 2,y 1 ) 1 corr(y 2,y n ) corr(y) = corr(y n,y 1 ) corr(y n,y 2 ) 1 Recall: for random variables y i and y j, corr(y i,y j )= cov(y i,y j ) var(yi )var(y j ) measures the amount of linear association between y i and y j. Correlation matrices are symmetric. 25

26 Properties of expected value, variance: Let x, y be random vectors of the same dimension, and let C and c be a matrix and vector, respectively, of constants. Then 1. E(y + c) =E(y)+c. 2. E(x + y) =E(x)+E(y). 3. E(Cy) =CE(y). 4. var(y + c) =var(y). 5. var(y + x) =var(y)+var(x)+ cov(y, x)+cov(x, y). }{{} =0 if x, y independent 6. var(cy) =Cvar(y)C T. 26

27 Multivariate normal distribution: The multivariate normal distribution is to a random vector as the univariate (usual) normal distribution is to a random variable. It is the version of the normal distribution appropriate to the joint distribution of several random variables (collected and stacked as a vector) rather than a single random variable. Recall that we write y N(µ, σ 2 ) to signify that the univariate r.v. y has the normal distribution with mean µ and variance σ 2. Meansthaty has probability density function (p.d.f.) [ ] 1 f Y (y) = exp (y µ)2 2πσ 2 2σ 2 Meaning: for two values y 1 <y 2 the area under the graph of the p.d.f. between f Y (y 1 )andf Y (y 2 )givespr(y 1 <Y <y 2 ). We write y N n (µ, Σ) to denote that y follows the n dimensional multivariate normal distribution with mean µ and variance-covariance matrix Σ. ( ) y1 E.g., for a bivariate random vector y = N 2 (µ, Σ), the p.d.f. of y maps out a bell over the (y 1,y 2 ) plane centered at µ with spread described by Σ. Recall for y N(µ, σ 2 ) the p.d.f. of y is f(y) = 1 exp (2πσ 2 ) 1/2 { 1 2 y 2 (y µ) 2 In the multivariate case, for y N n (µ, Σ), the p.d.f. of y is { 1 f(y) = exp 1 } (2π) n/2 Σ 1/2 2 (y µ)t Σ 1 (y µ). σ 2 }, Here Σ denotes the determinant of the var-cov matrix Σ. 27

28 Review of Classical (Fixed-Effects) Linear Model Assume we observe a sample of independent pairs, (y 1, x 1 ),..., (y n, x n )wherey i is a response variable and x i =(x i1,...,x ip ) T is a p 1 vector of explanatory variables. The classical linear model can be written y i = β 1 x i1 + + β p x ip + e i, = x T i β + ε i, iid where e 1,...,e n N(0,σ 2 ). i =1,...,n, Equivalently, we can stack these n equations and write the model as follows: y 1 x 11 x 12 x 1p β 1 e 1.. = y n x n1 x n2 x np β p e n or y = Xβ + e Our assumptions on e 1,...,e n can be equivalently restated as e N n (0,σ 2 I n ). Since y = Xβ +e and e N n (0,σ 2 I n ), it follows that y is m variate normal too: y N n (Xβ,σ 2 I n ). The var-cov matrix for y is σ σ 2 0 σ I n = σ 2 y i s are uncorrelated and have constant variance σ 2. Therefore, in the CLM y is assumed to have multivariate normal joint p.d.f. 28

29 Estimation of β and σ 2 : Maximum likelihood estimation: In general, the likelihood function is just the probability density function, but thought of as a function of the parameters rather than of the data. Interpretation: likelihood function quantifies how likely the data are for a given value of the parameters. The idea behind maximum likelihood estimation is to find the values of β and σ 2 under which the data are most likely. That is, we find the β and σ 2 that maximize the likelihood function, or equivalently, the loglikelihood function, for the value of y actually observed. These values are the maximum likelihood estimates (MLEs) of the parameters. For the CLM, the loglikelihood is l(β,σ 2 ; y) = n 2 log(2π) n }{{} 2 log(σ2 ) 1 2σ (y 2 Xβ)T (y Xβ). }{{} a constant kernel of l 29

30 Notice that maximizing l(β,σ 2 ; y) with respect to β is equivalent to maximizing the third term: which is equivalent to minimizing 1 2σ 2 (y Xβ)T (y Xβ), (y Xβ) T (y Xβ) = n (y i x T i β)2 (Least-Squares Criterion). ( ) i=1 (y Xβ) T (y Xβ) is the squared distance between y and its mean, Xβ. Parameter estimate ˆβ minimizes this distance. That is, ˆβ gives the estimated mean Xˆβ that is closest to y. So, the estimators of β given by ML and (ordinary) least squares (OLS) coincide. For β in the CLM: ML = OLS and, if X is of full rank (model is not overparameterized) then:. ˆβ =(X T X) 1 X T y 30

31 Estimation of σ 2 : Setting the partial derivative of l with respect to σ 2 to 0 and solving leads to the MLE of σ 2 : ˆσ ML 2 = 1 n (y Xˆβ) T (y Xˆβ) = 1 (y i x T i n ˆβ) 2 = 1 n SS E i Problem: This estimator is biased for σ 2. This bias can be easily fixed, which leads to the generally preferred estimator: ˆσ 2 = 1 n p (y Xˆβ) T (y Xˆβ) = 1 n p SS E = 1 df E SS E = MS E Note that the MLE of σ 2 is biased, and this is due to using the wrong valueforthedf E (the divisor for SS E ). df E = n p is the information in the data left for estimating σ 2 after having estimated β 1,...,β p. Because ˆσ ML 2 uses n rather than n p, it is often said that the MLE of σ 2 fails to account for d.f. used (or lost) in estimating β. MS E, the preferred estimator of σ 2, is an example of what is known as a restricted ML (REML) estimator. As we ll see, REML is the preferred method of estimating variance components in LMMs. This method simply generalizes using ˆσ 2 = MS E rather than ˆσ ML 2 in the CLM. 31

32 Example Volume of Cherry Trees: For 31 black cherry trees the following measurements were obtained: V = Volume of usable wood (cubic feet) H = Height of tree (feet) D = Diameter at breast height (inches) Goal: Predict usable wood volume from diameter and height. See S-PLUS script, backgrnd.r. Here, we first consider a simple multiple regression model cherry.lm1, for these data: V i = β 0 + β 1 H i + β 2 D i + e i, i =1,...,31 Initial plots of V against both explanatory variables, D and H, look linear, so this model may be reasonable. cherry.lm1 gives a high R 2 of.941 and most residual plots look pretty good. However, plot of residuals vs. diameter looks U -shaped, so we consider some other models for these data. 32

33 Inference in the CLM: Under the basic assumptions of the CLM (independence, homoscedasticity, normality), ˆβ, the ML/OLS estimator of β, has distribution ˆβ N(β,σ 2 (X T X) 1 ) That is, ˆβ is unbiased for β; ˆβ has var-cov matrix σ 2 (X T X) 1 ˆβ j has standard error s.e.( ˆβ j )= MS E [(X T X) 1 ] jj ; ˆβ is normally distributed. Also can be shown that ˆβ is optimal estimator (BLUE, UMVUE). These properties lead to a number of normal-theory methods of inference: 1. t tests, confidence intervals for an individual regression coefficient β j based on ˆβ j ˆβ s.e.( ˆβ t(n p) j ) }{{} the t distribution with n p d.f. 100(1 α)% CI for β j given by ˆβ j ± t 1 α/2 (n p)s.e.( ˆβ j ). For an α-level test of H 0 : β j = β 0 versus H 1 : β j β 0 we use the rule: reject H 0 if ˆβ j β 0 s.e.( ˆβ j ) >t 1 α/2(n p) Tests of H 0 : β j =0foreachβ j given by summary() function in S-PLUS/R. 33

34 2. More generally, inference on linear combinations of the β j s of the form c T β (e.g., contrasts) based on the t distribution: c T ˆβ c T ˆβ MSE c T (X T X) 1 c t(n p) E.g., 100(1 α)% C.I. for the expected response at a given value of the vector of explanatory variables x o is given by x T ˆβ 0 ± t 1 α/2 (n p) MS E x T 0 (XT X) 1 x 0. A 100(1 α)% prediction interval for the response on a new subject with vector of explanatory variables x o is given by x T ˆβ 0 ± t 1 α/2 (n p) MS E (1 + x T 0 (XT X) 1 x 0 ). Confidence intervals for fitted and predicted values given by the predict() function in S-PLUS/R. 3. Inference on the entire vector β is based on the fact that (ˆβ β) T (X T X)(ˆβ β) F (p, n p) pms E }{{} the F distribution with p and n p d.f. E.g., we can test any hypothesis of the form H 0 : Aβ = c where A is a k p matrix of constants (e.g., contrast coefficients) with an F test. The appropriate test has rejection rule: reject if F = (Aˆβ c) T {A(X T X) 1 A} 1 (Aˆβ c) kms E >F 1 α (k, n p). 4. The fit of nested models can be compared via an F test comparing their MS E s. Accomplished with the anova() function in S-PLUS/R. 34

35 Clustered Data: Clustered data are data that are collected on subjects/animals/trees/units which are heterogenous, falling into natural groupings, or clusters, based upon characteristics of the units themselves or the experimental design, but not on the basis of treatments or interventions. The most common example of clustered data are repeated measures data. By repeated measures, people typically mean data consisting of multiple measurements of essentially the same variable on a given subject or unit of observation. Repeated measurements are typically taken through time, but can be at different spatial locations, or can arise from multiple measuring devices, obervers, etc. When repeated measures are taken through time, the terms longitudinal data, and panel data, are roughly synonymous. We ll use the more generic term clustered data to refer to any of these situations. Clustered data also include data from split-plot designs, crossover designs, hierarchical sampling, and designs with pseudoreplication/subsampling. 35

36 Advantages of longitudinal/clustered data: Allow study of individual patterns of change i.e., growth. Economize on experimental units. Heterogeneous experimental units are often better representative of the population to which we d like to generalize. Each subject/unit can serve as his or her own control. Disadvantages: E.g., in a split-plot experiment or crossover design, comparisons between treatments can be done within the same subject. In a longitudinal study comparisons of time effects (growth) can be made within a subject rather than between subjects. Between unit heterogeneity can be eliminated when assessing treatment or time effects. Leads to more power/efficiency (think paired t-test versus two-sample t-test). Correlation, multiple sources of heterogeneity in the data. Makes statistical methods harder to understand, implement. LMMs flexible enough to deal with these features. Imbalance, incompleteness in data more common. This can be hard for some statistical methods, especially if missing data are not missing at random. LMMs handle unbalanced data relatively easily, well. 36

37 Linear Mixed Models (LMMs) We will present the LMM for clustered data. It can be presented and used in a somewhat more general context, but most applicationsare to clustered data and this is a simpler case to discuss/understand. Examples revisited: Example 1, One-way random effects model Rails Recall that we had three observations on each of 6 rails. Model: y ij = µ + b i + e ij, i =1,...,6,j =1,...,3, where y ij =responsefromj th measurement on i th rail µ = grand mean response across population of all rails b i = random effect for the i th rail e ij = error term Data are clustered by rail. Model for all data from the i th rail can be written in vector/matrix form: y i1 y i2 = 1 1 µ b i + e i1 e i2 y i3 1 1 e i3 or y i = X i β + Z i b i + e i 37

38 Example 2, RCBD model Stools Recall that we had n = 9 subjects, each of whom tested all a =4 stool designs under study. Model: where y ij = µ j + b i + e ij, i =1,...,n,j =1,...,a, y ij =responsefromj th stool tested by i th subj. µ j = mean response for stool type j across population of all subjects b i = random effect for the i th subject e ij = error term Data are clustered by subject. Model for all data from the i th subject can be written in vector/matrix form: or y i1 y i2 y i3 y i4 = y i = X i β + Z i b i + e i µ 1 µ 2 µ 3 µ b i + e i1 e i2 e i3 e i4 38

39 Example 3, Split-plot model Grass Recall that we had 8 whole plots (half-fields) randomized to a RCND, and then split into 8 split-plots, which were randomized to 3 different inoculation types. Model: y ijk = µ + α i + β k +(αβ) ik + τ j + b ij + e ijk, where y ijk is response from the split-plot assigned to the k th inoculation type within the (i, j) th whole plot (which is assigned to the i th cultivar in j th block). In addition, µ =grandmean α i = i th cultivar effect (fixed) β k = k th inoculation type effect (fixed) (αβ) ik =cultivar inoculation interaction effect (fixed) τ j = j th block effect (treated as fixed, but could be random) b ij = effect for the (i, j) th whole plot (random) e ijk = error term (random) Data are clustered by whole plot. Model for all data from the (i, j) th whole plot can be written in vector/matrix form: µ α i y ij1 y ij2 = β 1 β β b i + e ij1 e ij2 y ij (αβ) i1 1 e ij3 (αβ) i2 (αβ) i3 τ j or y ij = X ij β + Z ij b ij + e ij 39

40 The Linear Mixed Model for Clustered Data: Notice that all 3 of the previous examples have the same form. They are all examples of LMMs with a single (univariate) random effect: a random cluster-specific intercept. Suppose we have data on n clusters, where y i =(y i1,...,y iti ) T are the t i observations available on the i th cluster, i =1,...,n. Then the LMM with random cluster-specific intercept is given (in general) by y i = X i β + Z i b i + e i, i =1,...,n, where X i is a t i p design matrices for the fixed effects β, andz i is a t i 1 vector of ones. e i is a vector of error terms. If you re not comfortable with the vector/matrix representation, another way to write it is where z ij =1. Assumptions: y ij = β 1 x 1ij + β 2 x 2ij + + β p x pij + z ij b i +e ij }{{}}{{} fixed part random part cluster effects: b i s are independent, normal with variance (variance component) σ 2 b. error terms: e ij s are independent, normal with variance (variance component) σ 2. We will relax both the assumption of independence, and constant variance (homoscedasticity) later b i s and e ij s assumed independent of each other. 40

41 Often, it makes sense to have more than one random effect in the model. To motivate this, let s consider another example. Example Microbril Angle in Loblolly Pine Whole-disk cross-sectional microfibril angle was measured at 1.4, 4.6, 7.6, 10.7 and 13.7 meters up the stem of 59 trees, sampled from four physiographic regions. Regions (no. trees) were Atlantic Coastal Plain (24), Piedmont (17), Gulf Coastal Plain (9), and Hilly Coastal Plain (9). A plot of the data: Atlantic Gulf Hilly Piedmont Whole disk cross sectional microbril angle (deg) Height on stem (m) 41

42 Here we have 4 or 5 repeated measures on each tree. Repeated measures not through time, but through space, up the stem of the tree. Any reasonable model would account for heterogeneity between individual trees; correlation among observations on the same tree; and dependence of MFA on height at which it is measured. From the plots it is clear that MFA decreases with height. For simplicity, suppose it decreases linearly with height (it doesn t, but let s keep things easy). Let y ijk be the MFA on the j th tree in the i th region, measured at the k th height. Then a reasonable model might be y ijk = µ i + βheight ijk +b ij + e ijk }{{} fixed part where µ i = mean response for i th region β = slope for linear effect of height on MFA b ij = random effect for (i, j) th tree e ijk = error term for height-specific measurements Fixed part of model says that MFA decreases linearly in height, with an intercept that depends on region. I.e., mean MFA is different from one region to next. Random effects (the b ij s) say that the intercept varies from tree to tree within region. 42

43 Rather than just random tree-specific intercepts, suppose we believe that the slope (linear effect of height on MFA) also varies from subject to subject. This leads to a random intercept and slope model: y ijk =(µ i + b 1ij ) +(β + b 2ij ) height }{{}}{{} ijk + e ijk intercept slope = µ i + βheight ijk + b 1ij + b 2ij height ijk + e ijk Now there are ( two random ) effects b 1ij and b 2ij, or bivariate random b1ij effects: b ij =. b 2ij No reason to expect that an individual tree s effect on the intercept would be independent of that same tree s effect on the slope. So, we would assume b 1ij and b 2ij are correlated (probably negatively). Model can be written as y ij1 1 height ij1 =. y ij5 or. 1 height ij5 ( µi β y ij = X ij β + Z ij b ij + e ij ) 1 height ij height ij5 ( b1ij b 2ij ) + e ij1. e ij5 43

44 So, the LMM in general may have > 1 random effect, which leads us to the general form of the model: y i = X i β + Z i b i + e i, i =1,...,n, where X i = design matrix for fixed effects β = p 1 vector of fixed effects (parameters) Z i = design matrix for random effects b i = q 1 vector of random effects e i = vector of error terms If you re not comfortable with the vector/matrix representation, another way to write it is y ij = β 1 x 1ij + β 2 x 2ij + + β p x pij + z 1ij b 1i + + z qij b qi +e ij. }{{}}{{} fixed part random part Assumptions: cluster effects: b i s are normal and independent from cluster to cluster. We allow b 1i,...,b qi (random effects from same cluster - e.g., random intercept and slope) to be correlated, with var-cov D: b i s iid N q (0, D) error terms: e ij s are independent, normal with variance (variance component) σ 2.Thatis,e i s iid N ti (0,σ 2 I). We will relax both the assumption of independence, and constant variance (homoscedasticity) later b i s and e ij s assumed independent of each other. 44

45 Example Microbril Angle in Loblolly Pine (Continued) Recall the original random-intercept (only) model: y ijk = µ i + βheight ijk + b ij + e ijk. This model is fit with the lme() function in LMM.R: > mfa.lme1 <- lme(mfa ~ regname + diskht -1, data=mfa, random= ~1 tree) > summary(mfa.lme1) Linear mixed-effects model fit by REML Data: mfa AIC BIC loglik Random effects: Formula: ~1 tree (Intercept) Residual StdDev: Fixed effects: mfa ~ regname + diskht - 1 Value Std.Error DF t-value p-value regnameatlantic regnamegulf regnamehilly regnamepiedmont diskht Correlation: rgnmat rgnmgl rgnmhl rgnmpd regnamegulf regnamehilly regnamepiedmont diskht Standardized Within-Group Residuals: Min Q1 Med Q3 Max Number of Observations: 274 Number of Groups: 59 45

46 The random intercept and random slope model was y ijk = µ i + βheight ijk + b 1ij + b 2ij height ijk + e ijk. This model can be fit with lme() too, but an easy way to refit a model with a slight change is via update(): > mfa.lme2 <- update(mfa.lme1, random= ~diskht tree) > summary(mfa.lme2) Linear mixed-effects model fit by REML Data: mfa AIC BIC loglik Random effects: Formula: ~diskht tree Structure: General positive-definite, Log-Cholesky parametrization StdDev Corr (Intercept) (Intr) diskht Residual Fixed effects: mfa ~ regname + diskht - 1 Value Std.Error DF t-value p-value regnameatlantic regnamegulf regnamehilly regnamepiedmont diskht Correlation: rgnmat rgnmgl rgnmhl rgnmpd regnamegulf regnamehilly regnamepiedmont diskht Standardized Within-Group Residuals: Min Q1 Med Q3 Max Number of Observations: 274 Number of Groups: 59 46

47 Questions: The models were fit by REML. What does that mean? Which model is better? How do know if the model assumptions are met (diagnostics)? How do we predict MFA at a given height for a given tree? for the population of all trees from a given region? Estimation and Inference in the LMM: Estimation: In the classical linear model, the usual method of estimation is ordinary least squares. However, we saw that if we assume normal errors, then OLS gives the same estimates of β as maximum likelihood (ML) estimation. In the LMM, there are fixed effects β, but also parameters related the distribution of the random effects (e.g., variance components such as σ 2 b ) as well as parameters related to the error terms (e.g., the error variance σ 2 ). Least-squares doesn t provide a framework for estimation and inference for all of these parameters, so ML and related likelihood-based methods (i.e., restricted maximum likelihood, or REML) are generally preferred. 47

48 ML: recall that ML proceeds by finding the parameters that maximizes the loglikelihood, or joint p.d.f. of the data. Finds the parameter values under which the observed data are most likely. Since the LMM assumes that the errors are normal, the random effects are normal, and the response y is linearly related to the errors and random effects via y = Xβ + Zb + e, its not hard to show that the LMM implies that the response vector y is normal too. That is, its easy to show that the observations from different clusters are independent, with y i N(X i β, V i ) where V i = Z i DZ T i + σ 2 I the joint p.d.f. of the data is multivariate normal the loglikelihood is the log of a m variate normal p.d.f. This loglikelihood is easy to write down, but requires iterative algorithm to maximize. Implemented optionally in lme() with method= ML option. 48

49 REML: Recall from the classical linear model that the MLE of σ 2 was biased. Did not adjust for d.f. lost in estimating β (fixed effects). Instead we used MSE as preferred estimator of σ 2. REML was developed as a general likelihood-based methodology that would be applicable to all LMMs, but which would take account of d.f. lost in estimation of β to produce less biased estimates of variance-covariance parameters (e.g., variance components) than ML; generalize the old, well-known, unbiased estimators in those simple cases of the LMM where such estimators are known; e.g., REML yields MSE as its estimator of σ 2 in the CLM. REML is based upon maximizing the restricted loglikelihood Can be thought of as that portion of the loglikelihood that doesn t depend on β. Like ML estimation, requires iterative algorithm to produce estimates. REML is the default estimation method for the lme() function and PROC MIXED in SAS. It s generally regarded as the preferred method of estimation for LMMs. However, some aspects of model selection are easier with ML, so sometimes competing models are fit and compared with ML, and then best model refit with REML at the end. 49

50 Inference on Fixed Effects: Remember, the framework for estimation and inference in the LMM is ML or REML, not least-squares as in the CLM. The standard methods of inference in a likelihood-based framework are Wald tests,andlikelihood ratio tests (LRTs). LRTs and Wald tests are based upon asymptotic theory. That is, they provide methods that hold exactly when the sample size goes to infinty and only approximately for finite sample sizes. LRTs are useful for comparing nested models. Shouldn t be used for comparing random-effect structures/variancecovariance structures. Shouldn t be used with REML, only ML. Wald tests are useful for testing linear hypotheses (e.g., contrasts) on fixed effects. Wald tests yield approximate z and chi-square tests. These tests can be improved as t and F tests to produce better inferences in small-samples. 50

51 Wald Tests: It can be shown that the approximate (i.e., large sample) distribution of the (restricted) ML estimator ˆβ in the LMM is where V i = Z i DZ T i + σ 2 I. [ n ] 1 ˆβ N β, X T i V 1 i X i, ( ) i=1 }{{} =var( ˆβ) In practice var(ˆβ) is estimated by plugging in final (restricted) ML estimates obtained from fitting the model. Standard errors of ˆβ j,thej th component of ˆβ are obtained as the square root of the j th diagonal element of var( ˆ ˆβ). The distributional result ( ) leads to the general Wald test on β. In particular, we reject H 0 : Aβ = c at level α, wherea is k p reject H 0 if (Aˆβ c) T {A[ var( ˆ ˆβ)] 1 A T } 1 (Aˆβ c) >χ 2 1 α(k) where χ 2 1 α(k) is the upper α th critical value of a chi-square distribution on k df. As a special case, an approximate z test of H 0 : ˆβj = 0 versus H 0 : ˆβ j 0 rejects H 0 if ˆβ j s.e.( ˆβ j ) >z 1 α/2 where z 1 α/2 is the (1 α/2) quantile of a standard normal distribution. In addition, an approximate 100(1 α)% CI for β j is given by ˆβ j ± z 1 α/2 s.e.( ˆβ j ). 51

13. October p. 1

13. October p. 1 Lecture 8 STK3100/4100 Linear mixed models 13. October 2014 Plan for lecture: 1. The lme function in the nlme library 2. Induced correlation structure 3. Marginal models 4. Estimation - ML and REML 5.