A First Look at Multilevel and Longitudinal Models York University Statistical Consulting Service (Revision 3, March 2, 2006) March 2006

Size: px

Start display at page:

Download "A First Look at Multilevel and Longitudinal Models York University Statistical Consulting Service (Revision 3, March 2, 2006) March 2006"

Shanon York
5 years ago
Views:

1 A First Look at Multilevel and Longitudinal Models York University Statistical Consulting Service (Revision 3, March, 006) March 006 Georges Monette with help from Ernest Kwan, Alina Rivilis, Qing Shao and Ye Sun 1

2 1 Introduction... 4 Preliminaries Is 1 the effect of X 1? Visualizing multivariate variance Data example: Simpson and Robinson Matrix formulation of regression General regression analysis Detailed description of data set Looking at school Model to compare two schools Comparing two Sectors What s wrong with 3.3? The Hierarchical Model Within School model: Between School model: A simulated example Between-School Model: What γ means Combined (composite) model From the multilevel to the combined form GLS form of the model Matrix form Notational Babel The GLS fit... 50

3 6 The simplest models One-way ANOVA with random effects Estimating the one-way ANOVA model Mixed model approach EBLUPs Slightly more complex models Means as outcomes regression One-way ANCOVA with random effects Random coefficients model Intercepts and Slopes as outcomes Nonrandom slopes Asking questions: CONTRAST and ESTIMATE statement A second look at multilevel models What is a mixed model really estimating Var( Y X ) and T Random slope model Two random predictors Interpreting Chol(T) Recentering and balancing the model Random slopes and variance components parametrization Testing hypotheses about T Examples Fitting a multilevel model: contextual effects Example Longitudinal Data

4 9.1 The basic model Analyzing longitudinal data Classical or Mixed models Pothoff and Roy Univariate ANOVA MANOVA repeated measures Random Intercept Model with Autocorrelation Comparing Different Covariance Models Using a REPEATED statement with missing occasions Exercises on Pothoff and Roy Bibliography (to be changed) Appendix: Synopsis of SAS commands in PROC MIXED Appendix B: PROC MIXED all-dressed SAS Exercises Memory and speed when using PROC MIXED Introduction The last two decades have seen rapid growth in the development and use of models suitable for multilevel or longitudinal data. We can identify at least four broad approaches: the use of derived variables, econometric models, latent traectory models using structural equations models and mixed (fixed and random effects) models. This course focuses on the use of mixed models for multilevel and longitudinal data. 4

5 Mixed models have a wide potential for applications to data that are otherwise awkward or impossible to model. Some key applications are to nested data structures, e.g. students within classes within schools within school boards, and to longitudinal data, especially messy data where measurements are taken at irregular times, e.g. clinical data from patients measured at irregular times or panel data with changing membership. Another application is to panel data with time-varying covariates, e.g. age where each cohort has a variety of ages and the researcher is interested in studying age effects. New books and articles on multilevel and longitudinal models are being released much faster than one can read or afford them. One way to get started for someone who intends to start with SAS would be to read: Judith D. Singer and John B. Willett (003). Applied Longitudinal Data Analysis. Oxford University Press, New York. An accessible and comprehensive treatment of methods for longitudinal analysis. In addition to the use of mixed models, this book includes material on latent growth models and on event history analysis. Stephen W. Raudenbush and Anthony S. Bryk (00) Hierarchical linear models : applications and data analysis methods, ( nd edition). Sage, Thousand Oaks, CA. Peter J. Diggle, Patrick Heagerty, Kyng-Yee Liang and Scott L. Zeger (00) Analysis of Longitudinal Data, ( nd edition). Oxford University Press, Oxford. Tom A. B. Sniders and Roel J. Bosker (1999). Multilevel Analysis: An Introduction to Basic and Advanced Multilevel Modeling. Sage, London. An excellent book with a broad conceptual coverage. Using SAS PROC MIXED to fit Multilevel Models, Hierarchical Models, and Individual Growth Models by Judith Singer Singer(1998). Mixed Effects Models in S and S-Plus Pinheiro and Bates (000). 5

6 The choice of software to fit mixed models continues to grow. Many packages are in active development and offer new functionality. It is safe to hypothesize that no package is uniformly superior to all the others. PROC MIXED in SAS is a solid workhorse although it has been relatively static for some years. The ability to specify a wide variety of structures for both the within-cluster covariance matrix and for the random effects covariance matrix is an important feature. Graphics in SAS are improving and the description of version 9 mentions the ability to produce some important diagnostic plots automatically. NLMIXED is a more recent and interesting addition for non-linear models and for modeling binary and Poisson outcomes. SPSS 11.5 has a MIXED command to fit mixed models. See Alastair Leyland (004) A review of multilevel modeling in SPSS. An excellent special-purpose programme is MLwiN developed by a group associated with Harvey Goldstein at the Institute of Education at the University of London. It uses a graphical interface that is very appealing for multilevel modelling and produces some very interesting plots quite easily. Its basic linear methods are well integrated with more advanced methods: logistic mixed models, MCMC estimation, to make the transition relatively easy. One shortcoming is its unstructured parametrization of the variance matrix of random effects. MLwiN seems to be able to handle very large data sets as long as they fit in RAM. NLME by Bates and Pinheiro is a library in R and S-Plus. This is my favourite working environment but it s best for those who enoy programming. It s a good environment for non-linear normal error models and can be adapted to fit logistic hierarchical models. R and S- Plus are very strong for graphics and for programmability which allows you to easily refine and reuse analytic solutions. GLLAMM in STATA is a very interesting program that is in active development and can fit a wide variety of models including those with latent variables. The current version of HLM has, like MLwiN, has a graphical interface. It is developed by Bryk and Raudenbush and is a powerful and popular program 6

7 Some resources on the web include: UCLA Academic Technology Services seminars at contains an excellent collection of seminars on topics in statistical computing including extensive materials on multilevel and longitudinal models. Two multilevel statistics books are available free online: The Internet Edition of Harvey Goldstein s Extensive but challenging Multilevel Statistical Models can be downloaded from Applied Multilevel Analysis by Joop Hox is available at Multilevel Modelling Newsletters produced twice yearly by The Multilevel Models Proect in the Institute of Education at the University of London: The web site for the Multilevel Models Proect: One can subscribe to an discussion list at Preliminaries We begin by reviewing a few concepts from regression and multivariate data analysis. Our emphasis is on concepts that are frequently omitted, or, at least, not well explored in typical courses. 7

8 .1 Is 1 the effect of X 1? Consider the familiar regression equation: E( Y) X X (1) To make this more concrete, suppose we are studying the relationship between Y=Health, X1 Weight and X Height. E( Health) Weight Height () 0 1 so that 1 is the effect of changing Weight. What if we are really interested in the effect of ExcessWeight. Maybe we should replace Weight with ExcessWeight. Let s suppose that ExcessWeight Weight ( Height) (3) 0 1 where Height 0 1 is the normal Weight for a given Height. What happens if we fit the model? E( Health) ExcessWeight Height (4) 0 1 where we now expect 1 to be the effect of ExcessWeight. 8

9 If we substitute the definition of ExcessWeight in (3) into (4), we get: E( Health) ExcessWeight Height 0 1 ( Weight ( Height)) Height ( ) Weight ( ) Height and, using the fact that identical linear functions have identical coefficients: The coefficient we expected to change is the only one that stays the same! This illustrates the importance of thinking of 1 in () as the effect of changing Weight keeping Height constant, which turns out to be the same thing as changing ExcessWeight keeping Height constant as long as ExcessWeight can be defined as a linear function of Weight and Height. In multiple regressions, the other variables in the model are as important in defining the meaning of a coefficient as the variable to which the coefficient is attached. Often, the maor role of the target variable is that of providing a measuring stick for units. This fact will play an important role when we consider controlling for contextual variables. See the appendix for an explanation of this phenomenon using the matrix formulation for regression. 9

10 . Visualizing multivariate variance 1 SD for a variable represented by proection onto the axis, i.e. any linear combination of X 1 and X. Understanding multivariate variance is important to unravel some mysteries of random coefficient models. With univariate random variables, it s generally easier to think in terms of the standard deviation SD VAR. What is the equivalent of the SD for a bivariate random vector? The easiest way of thinking about this uses the standard concentration ellipse (which may also be known as the dispersion ellipse or the variance ellipse ) for the bivariate normal. With a sample, we can define a standard sample data ellipse with the following properties: The centre of the standard ellipse is at the point of means. The shadow of the ellipse onto either axis gives mean 1 SD. The shadow of the ellipse onto ANY axis gives mean The vertical slice through the center of the ellipse gives the residual SD of X given X 1. When X 1 and X have a multivariate normal 10

11 distribution, this is also the conditional SD of X given X 1. Mutatis mutandis for X 1 given X. The line through the points of vertical tangency is the regression line of X on X 1. Mutatis mutandis for X 1 on X The fact that the regression line goes through the points of vertical tangency and NOT the corners of the rectangle containing the ellipse (NOR the principal axis of the ellipse) is the essence of the regression paradox. Indeed, the very word regression comes from the notion of regression to mediocrity embodied in the fact that the regression line (i.e. the Expected value of X given X 1 ) is flatter than the SD line (the line through the corners of the rectangle). The bivariate SD: Take any pair of conugate axes on the ellipse as shown in the figure above. (In D, axes are conugate if each axis is parallel to the tangent of the ellipse at the point where the other axis intersects the ellipse... a picture is worth 6 words.) Make them column vectors in a matrix and you have the square root of the variance matrix, i.e. the SD for the bivariate random vector. The choice of conugate axes is not unique neither is the square root of a matrix. Depending on your choice of conugate axes, you can get various factorizations of the variance matrix: upper or lower Choleski, symmetric non-negative definite square root, principal components factorization (singular value decomposition), etc. 11

12 .3 Data example: Simpson and Robinson We first introduce the data set we will use to illustrate concepts in multilevel modelling. We use an example from Byrk and Raudenbush (199) in which mixed effects models are presented as hierarchical linear models. This seems to me to be the easiest way to present and develop the ideas. We will use a subset of the data used in their book. It consists of data on math achievement, SES, minority status and sex of students in 40 schools. The schools are classified as public or catholic. One goal of the analysis is to study the relationship between math achievement and SES in different settings. (diagram from Bryk and Raudenbush, 199) 1

13 Let ust consider the relationship between math achievement, Y, and SES, X. Consider three ways (we ll see more later) of estimating the relationship between X and Y. 1) An ecological regression of the individual school means of Y on the individual school means of X. ) Pooled regression of Y on X ignoring school entirely. This estimates what is sometimes called the marginal relationship between Y and X. 3) A regression of Y on X adusting for school, i.e. including school as a categorical variable or, equivalently, first partialing out the effect of schools. This estimates the conditional relationship between Y and X. Robinson s paradox to the fact that 1) and 3) can have opposite signs. Simpson s paradox refers to the fact that ) and 3) can also have opposite signs. In fact, estimate ) will lie between 1) and 3). How much closer it is to 1) than to 3) depends on the variance of X and Y within schools..4 Matrix formulation of regression Y X is such a universal and convenient shorthand that we need to spell out what it means and how it is used. Consider the equation for a single observation assuming X variables: with i iid N(0, ). Y x x i 1,, N i 0 1i 1 i i We pile these equations one on top of the other: 13

14 Y x x Y x x Y x x i 0 1i 1 i i Y x x N 0 1N 1 N N Note that the s remain the same from line to line but Ys, xs and s change. Using vectors and matrices and exploiting the rules for multiplying matrices: Y1 1 x11 x1 1 0 Y 1 x1 x 1 YN 1 x1n x N N or, in short-hand: YXβε In multilevel models with, say J schools indexed by 1,..., J and with the th school having n students, we block students of the same school together. We can then write for the th school: Y1 1 x11 x x 1 x Y 1 Y 1 x n 1n x n i n 14

15 or, in short hand: Y X β ε We can stack schools on top of each other. If all schools are assumed to have the same value for β β, then we can stack the Xs vertically: or, in shorter form: Y1 X1 ε1 Y X β ε Y X ε J J J YXβε If the s are different we need to stack the Xs diagonally: Y1 X1 0 0 β1 ε1 Y 0 X 0 β ε 0 0 YJ X J βj εj or, in shorter form: YXβε Now, you know why you see Y Xβε so often! 15

16 Something that gets used over and over again is the fact that, if ε ols ' 1 ' β ( XX) XY N I then the OLS (ordinary least-squares) estimator is (0, ) with variance ( XX ) ' 1 If the components of are not iid but ε N (0, Σ ) where is a known variance matrix (or, at least, known up to a proportional factor) then the GLS (generalized least-squares) estimator is ' 1 1 ' 1 β GLS ( X Σ X) XΣ Y ' 1 1 with variance ( X Σ X ). 3 General regression analysis 3.1 Detailed description of data set For multilevel modeling we will use a subset of a data set presented at length in Bryk and Raudenbush (199). The maor variables are math achievement (mathach) and socio-economic status (ses) measured for each student in a sample of high school students in each of 40 selected schools in the U. S. Each school belongs to one of two sectors: 5 are Public schools and 15 are Catholic schools. There are 1,70 students in the sample. The sample size from each school ranges from 14 students to 61 students. The data are available as a text file. The following is a listing of the first 50 lines of the file: school minority female ses mathach size sector meanses

18 The first 48 lines are students belonging to school labelled The last lines are the first cases of the second school in the sample labelled The remaining variables are: minority 1 if a student belongs to a minority group female 1 if the student is female (note that school 1317 appears to be a girl s school ses a roughly standardized measure of socio-economic status mathach performance on a math achievement test size the size of the school sector 1 if the school is Catholic, 0 if Public meanses a measure of the mean ses in the school not ust the mean ses of the sample The uniform quantile plot of each variable gives a good snapshot of the data. Think of lining up each variables from shortest to tallest and plotting the result: 18

19 school minority female Fraction of 170 (NA: 0 ) ses Fraction of 170 (NA: 0 ) mathach Fraction of 170 (NA: 0 ) size Fraction of 170 (NA: 0 ) sector Fraction of 170 (NA: 0 ) meanses Fraction of 170 (NA: 0 ) id P9158 P464 P6415 C4511 C6074 C6366 C6469 C901 C9104 C9347 C Fraction of 170 (NA: 0 ) Figure 1: Uniform quantile plots of high school data Fraction of 170 (NA: 0 ) N = 170 Nmiss = 0 19

20 mathach C4868 C658 C755 C1436 C9586 C901 C9104 C6469 P1461 C6074 C4511 C519 C9347 C6366 C1317 C3039 P5838 P5783 P1909 P030 P6897 P336 P7919 P333 P6808 P1374 P7101 P8367 P315 P651 P80 P464 P8854 P3657 P3377 P9158 P9340 P8775 P995 P ses Figure : "Trellis" plot of high school data with least-squares lines for each school. 0

21 male female mathach C4868 C658 C755 C1436 C9586 C901 C9104 C6469 P1461 C6074 C4511 C519 C9347 C6366 C1317 C3039 P5838 P5783 P1909 P030 P6897 P336 P7919 P333 P6808 P1374 P7101 P8367 P315 P651 P80 P464 P8854 P3657 P3377 P9158 P9340 P8775 P995 P ses Figure 3: Trellis plot of high school data with sex of students. 1

22 First suppose we are dealing with only one school. LetYi and X i be the math achievement score and SES respectively of the ith student. We can formulate the model: Y X r i 0 1 i i where 1 is the average change in math achievement for a unit change in SES, 0 is the expected math achievement at SES = 0, and r i is the random deviation of the ith student from the expected (linear) pattern. We assume r i to be iid N(0, ). 3. Looking at school Public School 4458 > summary(lm(mathach ~ SES, zz1)) 0 Coefficients: Value Std. Error t value Pr(> t ) (Intercept) SES Residual standard error: on 46 degrees of freedom Multiple R-Squared: MathAch SES

23 P 4458: Estimate of int. and slope Estimated coefficients in beta space: Shadows are ordinary 95% confidence intervals. Shadow on the horizontal axis is a 95% confidence interval for the true slope. Note that the interval includes 1 0 so we would accept 0 1 H : Model to compare two schools b We suppose that the relationship between math achievement and SES might be b(ses) different in two schools. We can index s by school: Y X r i 0 1 i i where 1, and i 1,..., n where n is the number of students measured in school. Again, we assume r i to be iid N(0, ). We can fit this model and ask various questions: 1. Do the two schools have the same structure, i.e. are the s the same?. Perhaps the intercepts are different with one school better than the other but the slopes are the same. 3

24 3. Perhaps the slopes are different but one school is better than the other over the relevant range of X. 4. Maybe there s an essential interaction with one school better for high X and the other for low X i.e. the two lines cross over the observed range of Xs. 5. We could also allow different and test equality of variances (conditional given X). 6. We should also be ready to question the linearity implied by the model. The following regression output allows us to answer some of these questions. If it seems reasonable we could refit a model without an interaction term. Note that inferences based on ordinary regression only generalize to new samples of students from the same schools. They do not ustify statements comparing the two sectors. 4

25 Comparing schools: P4458 and C1433 > summary(lm(mathach ~ SES * Sector, dataset)) MathAch P4458 and C Coefficients: Value Std. Error t value Pr(> t ) (Intercept) SES Sector SES:Sector Residual standard error: on 79 degrees of freedom Multiple R-Squared: F-statistic: on 3 and 79 degrees of freedom, the p-value is 0 Correlation of Coefficients: (Intercept) SES Sector SES Sector SES:Sector SES > fit$contrasts $Sector: Catholic Public 0 Catholic 1 5

26 3.4 Comparing two Sectors One possible approach: two-stage analysis or derived variables : Stage1: Perform a regression on each School to get β ˆ for each school Stage: Analyze the β ˆ s from each school using multivariate methods to estimate the mean line in each sector. Use multivariate methods to compare the two estimated mean lines 6

27 All Schools: Estimates of int. and slope (data space) All Schools: Estimates of int. and slope (beta space) MathAch b SES b(ses) Figure 4: The figure on the left shows the mean lines from each sector. The figure on the right shows each estimated intercept and slope from each school (i.e. β ˆ s). The large ellipses are the sample dispersion ellipses for the β ˆ s from each sector. The smaller ellipses are 95% confidence ellipses for the mean lines from each sector. 7

28 3.5 What s wrong with 3.3? Essentially, it gives the least-squares line of each school equal weight in estimating the mean line for each sector. If every school sample had: 1) the same number of subects ) the same set of values of SES 3) the same (we ve been tacitly assuming this anyways) Then giving each least-squares line the same weight would make sense. In this case, the shape of the confidence ellipses in beta space would be identical for every school and the stage procedure would give the same result as a classical two-level ANOVA. This data has different N s and different sets of values for SES in each school. The -stage process ignores the fact that some β ˆ s are measured with greater precision than others (e.g. if the ith school has a large N or a large spread in SES, its β ˆ is measured with more precision). We could attempt to correct this problem by using weights proportional to ˆ ' 1 Var( β β ) ( XX ) but this is the variance of β ˆ as an estimate of the true β for the ith School, NOT as an estimate of the underlying β for the sector. We would need to use weights proportional to: ˆ ' 1 Var( β ) V ( X X ) Τ but we don t know V without knowing and T. Another approach would be to use ordinary least-squares with School as a categorical predictor. This has other drawbacks. Comparing sectors is difficult. We can t ust include a Sector effect because it would be collinear with Schools. We could construct a Sector contrast but the confidence intervals and tests would not generalize to new samples of schools, only to new samples of students from the same sample of schools. 8

29 4 The Hierarchical Model We develop the ideas for mixed and multilevel modeling in two stages: 1. Multilevel models as presented in Bryk and Raudenbush (199) in which the unobserved parameters at the lower level are modeled at the higher level. This is the representation used in HLM, the software developed by Bryk and Raudenbush and, to a limited extent in MLwiN.. Mixed model in which the levels are combined into a combined equation with two parts: one for fixed effects and the other for random effects. This is the form used in SAS and in many other packages. Although the former is more complex, it seems more natural and provides a more intuitive approach to the models. We will use the high school Math Achievement data mentioned above as an extensive example. We think of our data as structured in two levels: students within schools and between schools. We also have two types of predictor variables: 1. within-school variables: Individual student variables: SES, female, minority status. These variables are also known by many other names, e.g. inner variables, micro variables, level-1 variables 1, time-varying variables in the longitudinal context.. between-school variables: Sector: Catholic or Public, meanses, size. These variables are also known as outer variables, macro variables, level- variables, or time-invariant variables in a longitudinal context. A between-school variable can be created from a within-school variable by taking the average of the within-school variable within each school. Such a derived between-school variable is known as a contextual variable. Of course, such a variable is useful only if the average differs from school to school. Balanced data in which the set of values of within-school variables would be the same in each school does not give rise to contextual variables. 1 In some hierarchical modeling traditions the numbering of levels is reversed going from the top down instead of going from the bottom up. One needs to check which approach an author is using. 9

30 Basic structure of the model: 1. Each school has a true regression line that is not directly observed. The observations from each school are generated by taking random observations generated with the school s true regression line 3. The true regression lines come from a population or populations of regression lines 4.1 Within School model: For school i: (For now we suppose all schools come from the same population, e.g. only one Sector) 1) True but unknown 0 0 β for each school SES 1 ) The data are generated as Y X ~ N (0, ) independent of β ' s i i i i 4. Between School model: 0 0 We start by supposing that the β are sampled from a single population of schools. In vector notation: SES 1 30

31 where β γ u u ~ ( 0, Τ ) N Τ is a variance matrix. Writing out the elements of the vectors: β u u , ~ N, 1 u 1 1 u Note: Var( ) 0i 00 Var( ) 1i 11 Cov(, ) 0i 1i A simulated example To generate an example we need to do something with SES although its distribution is not part of the model. In the model the values of SES are taken as given constants. We will take: Once we have generated 1, 16 8 γ Τ, β we generate N ~ Poisson (30) and SES ~ N (0,1) Here s our first simulated school in detail: 31

32 For 1: u 6.756, β 6.61 γu N 3 SES : ε : Y SES : i 1 i i

33 Simulated school (data space) MathAch SES Figure 5: Simulation: mean population regression line γ 33

34 Simulated school (data space) MathAch SES β γ u Figure 6: Simulated school: True regression line in School 1: 34

35 Simulated school (data space) MathAch SES Figure 7: School 1 regression line with data generated by Y SES i i 1i i i 35

36 Simulated school (data space) MathAch SES Figure 8: Simulated school: True regression lineβ i, data, and least-squares lineβ ˆ i 36

37 Simulated school (beta space) b b(ses) Figure 9: Simulated school in beta space with true mean line represented by a point. 37

38 Simulated school (beta space) b b(ses) Figure 10: Simulated school: population mean line in beta space with dispersion ellipse with matrix T for random slopes and intercepts. Note that shadows of the ellipse yield the mean plus or minus 1 standard deviation 38

39 Simulated school (beta space) b b(ses) Figure 11: A random true intercept and slope from the population. This one happens to be somewhat atypical but not wholly implausible. 39

40 Simulated school (beta space) b b(ses) Figure 1: True intercept and slope with dispersion ellipse with matrix ' 1 X X forˆ β. 40

41 Simulated school (beta space) b b(ses) Figure 13: Observed value ofβ ˆ. 41

42 Simulated school (beta space) b b(ses) Figure 14: The blue dispersion ellipse with matrix 1 V T X ' X is almost coincident with the dispersion ellipse with matrix T. 4

43 Note that with smaller N, larger or smaller dispersion for SES, these dispersion ellipse for the true dispersion ellipse for ˆ β as an estimate of γ (with matrix ' 1 note that the statistical design of the study can make 1 β (with matrix T ) and the V T X X ) could differ much more than they do here. Also X ' X smaller but, typically, not T. 4.4 Between-School Model: What γ means Instead of supposing that we have a single population of schools we now add the between-school model that will allow us to suppose that there are two populations of schools: Catholic and Public and that the population mean slope and intercept may be different in the two sectors. Let W represent the between-school variable sector variable that is the indicator variable for Catholic schools: W is equal to 1 if school is Catholic and 0 if it is public. We have two regression models, one for intercepts and one for the slopes: W u W u o We can work out the following interpretation of the i coefficients by setting W to 0 for Public schools and then to 1 for Catholic schools. The interpretation is analogous to that of the ordinary regression to compare two schools except that we are now comparing the two sectors. In Public schools: In Catholic schools: 0 u u u u Between-school variables are not limited to indicator variables. Any variables suitable as a predictor in a linear model could be used as long as it is a function of schools, i.e. has the same value for every subect within each school. 43

44 1u u u u Thus: is the mean achievement intercept for Public schools, i.e. the mean achievement when SES is is the mean achievement intercept for Catholic schools so that 01 is the difference in mean intercepts between Catholic and Public schools is the mean slope in Public schools is the mean slope in Catholic schools so that 11 is the mean difference in (or difference in mean) slopes between Catholic and Public schools. 5. u 0 is the unique effect of school on the achievement intercept, conditional given W. 6. u 1 is the unique effect of school on the slope, conditional given W. Now, u 0 and u 1 are Level random variables (random effects) which we assume to have 0 mean and variance-covariance matrix: This is a multivariate model with the complication that the dependent variables, As mentioned above, one way to proceed would be to use a two-stage process: 1. Estimate 0, 1 with least-squares within each school, and. use the estimated values in a Level- analysis with the model above. Some problems with this approach are: 1. Each ˆ 0 i, ˆ 1i might have a different variance due to differing 0, 1 are not directly observable. n s and different predictor matrices X i in each school. A 44

45 Level analysis that uses OLS will not take these factors in consideration.. Even if X i (thus n ) is the same for each school, we might be interested in getting information on T itself, not on ˆ ' 1 var( βi ) T ( XX ) 3. ˆ 0 i, ˆ 1 i might be reasonable estimates of the parameters 0i and 1i but, as estimators of the random variables 0i and 1i they ignore the information contained in the distribution of 0i and 1i. 4. Some level 1 models might not be estimable, so information from these schools would be lost. 5 Combined (composite) model 5.1 From the multilevel to the combined form Since W u W u We combine the models by substituting the between school model above into the within school model: Y X r i 0 1 i i Substituting, we get Y 0 1 W X r i i i u ( W u ) X We then rearrange the term to separate fixed parameters from random coefficients: i r i 45

46 0 1 W Y X r i i i ( W u ) X i 11 u 0 u 1 u i r i i r W X W X The last two lines looks like the sum of two linear models: 1) an ordinary linear model with coefficients that are fixed parameters: X W X W X i 11 i i i with fixed parameters 00, 01, 10, 11, and ) a linear model with random coefficients and an error term: u u X r 0 1 i i Note the following: with random parameters u 0 and u the fixed model contains both outer variables and inner variables as well as an interaction between inner and outer variables. This kind of interaction is called a cross-level interaction. It allows the effect of X to be different in each Sector.. the random effects model only contains an intercept and an inner variable. There are very arcane situations in which it might make sense to include an outer variable in the random effects portion of the model which we will consider briefly later. 46

47 Understanding the connection between the multilevel model and the combined model is useful because some packages require the model to be specified in its multilevel form (e.g. MLWin) while others require the model to be specified in its combined form as two models: the fixed effects model and the random effects model (e.g. SAS PROC MIXED, R and S-Plus lme() and nlme()). 5. GLS form of the model Another way of looking at this model is to see it as a linear model with a complex form of error. Let i represent the combined error term also known as the composite error term: u u X r i 0 1 i i We can then write the model as: Y W X W X i i 11 i i This looks like an ordinary linear model except that the i s are not identically N(0, ) and are not independent since the same u0 and u 1 contribute to the random error for all i s in the th school. If we let δ be the vector of errors in the th school we can express the distribution of the combined errors as follows: 1 δ T X X δ δ ~ N(0, i' i ), and k are independent for k. If Τ and were known then the variance-covariance matrix of the random errors could be computed and the model fitted with Generalized Least-Squares (GLS). With Τ and unknown, we can iteratively estimate them and use the estimated values to fit the linear parameters, st by GLS. There are variants depending on the way in which Τ and are estimated. Using full likelihood yields what is often called IGLS, ML, or FIML. Using the conditional likelihood of residuals given Y ˆ yields RIGLS or REML (R for restricted or reduced). 47

48 5.3 Matrix form Take all observations in school and assemble them into vectors and matrices: (this is called the Laird-Ware formulation of the model from Laird and Ware (198)) where Y X γ Zu r 1 W X W X 1 X Y 1 1 W X WX 1 X,, Y X Z Y n 1 W Xn 1 WXn Xn 00 r1 u 0 r 01 u,,, 1,, J u γ r 1 10 r 11 n N(0, ), N(0, ) The distribution of the random elements is: u Τ r Ι with u independent of r. Now we put the school matrices together into big matrices: where Y Xγ Zu r Y1 X1 u1 r1 Y, X, u, r YJ XJ uj rj 48

49 Z Z 0 Z 0 0 ZJ with 0 T 0 0 N 0 0 T 0 u, T and r N ( 0, I ) which might be deceptive because the I is now much larger than before. The new block diagonal matrix for the variance of u is often with the same symbol as the variance of u. To avoid confusion we can use T. 5.4 Notational Babel Mixed models were simultaneously and semi independently developed by researchers in many different disciplines, each developing its own notation. The notation we are using here is that of Bryk and Raudenbush (199) which has been very influential in social research. Many publications use this notation. It differs from the notation used in SAS documentation whose development was more influenced by seminal statistical work in animal husbandry. It is, of course, perfectly normal to fit models in SAS but to report findings using the notation in common use in the subect matter area. A short bilingual dictionary follows. Fortunately, Y, X and Z are used with the same meaning. 49

50 Bryk and Raudenbush SAS Fixed effects parameters γ β Cluster random effect β b Cluster random effect (centered) u γ Variance of random effects T G Within cluster error variance Σ R 5.5 The GLS fit With the matrix formulation of the model, it is easy to Express the GLS estimator of γ. First denote: ' V Var( δ) ZTZ I Then the GLS estimator is: ' 1 1 ' 1 γˆ ( XV X) XV Y 1 We will see that the presence of V can result in an estimate that is very different from its OLS analogue 3 3 One ironic twist concerns small estimated values of. Normally this would a cause for reoicing; however it can result in a nearly 1 singular V. Although this need not imply that XV ' Xis nearly singular, some algorithms seem not to take advantage of this. 50

51 6 The simplest models We have now built up the notation and some theory for a fairly general form of the linear mixed model with both inner and outer variables and a random effects model with a random intercept and a random slope. We will now consider the interpretation of simpler models in which we keep only some components of the more general model. Even when we are interested in the larger model, it is important to understand the simple sub-models because they are used for hypothesis testing in the larger model. We will also consider some extensions of the concepts we have seen so far in the context of some of these simpler models. 6.1 One-way ANOVA with random effects This is the simplest random effects models and provides a good starting point to illustrate the special characteristics of these models. Level 1 model: Level model: Combined model: Y r i 0 i u Y u r i 00 0 i Var( Y ) Var( u r ) i 0 i 00 Note the intraclass correlation coefficient: 00 /( 00 ) Also note that within each school: E( Y ). 0 Var( Y. 0 ) n 51

52 but across the population: E( Y ). 0 Var( Y ) n This is an example of two very useful facts: 1. the unconditional (sometimes called marginal but not by economists) mean is equal to the mean conditional mean,. the unconditional variance is equal to the mean of the conditional variance plus the variance of the conditional mean, i.e.: Var( Y ) E(Var( Y ) Var(E( Y )) Var( ) 00 0 Var( Y ) E(Var( Y ) Var(E( Y )) Var( ) Estimating the one-way ANOVA model There are three kinds of parameters that need to be estimated: 1. fixed effect parameters : in this case there is only one: 00,. variance-covariance components: 00 and, 3. random effects: 0 or, equivalently, combined with 00 : u 0. 5

53 We use a different approach for each type of parameter. The fixed effects parameters are like linear regression parameters except that they are estimated from observations that are not independent. Instead of using OLS (ordinary least-squares) we use GLS (generalized least-squares) using the estimates of the variance-covariance components as the variance matrix in the GLS procedure. The variance-covariance parameters are estimated using ML (maximum likelihood) or REML (restricted maximum likelihood). Note that each step above assumes that the other one has been completed. What really happens is that estimation goes back and forth between the two steps until convergence. The random effects are not ust parameters. They are realizations of random variables. This means that we have two sources of information about them: we can estimate them from the observed data and we can guess them from their distribution. Putting these two sources of information together is the essence of Bayesian estimation, or empirical Bayesian estimation because the distribution of the random effects, determined by T 00, is estimated from the data and model. The random effects are predicted (in contrast with estimated ) using EBLUPs (Empirical Best Linear Unbiased Predictors) with the empirical posterior expectation: E(,, Y,, Y ) 01 0J 1 n i.e. the expected value of what is unknown given what is known. We will look at the estimation of the three types of parameters in detail in this example. First we consider the analysis of the data using OLS in which we treat 01,, 0J as non-random parameters. The coding of the school effect determines what is estimated by the intercept term. It is a weighted linear combination of the 0 s: J w w0 1 If the coding uses true contrasts (each column of the coding matrix sums to 0) the weights are all equal to 1/J and w is the ordinary mean of 0 s: 53

54 In this case 1 J w J 1 0 J 1 ˆ Y Y w Schools J 1 With sample size coding, e.g. V V V V School School School School School J 1 J n 1 3 J 1 J n 0 0 J 0 0 n School n n n n J 1 3 J 1 each column of the design matrix sums to 0 and the intercept will estimate: n J w n J 1 0 J 1n which weights each school according to its sample size. This can be thought of as the mean of the population of students instead of the population of schools. The estimator would be the overall average of Y: ny w Y.. Y J 1 J 1n Students We are not limited to these two obvious choices. A more appropriate set of weights could be school size, with coding: 54

55 School School School School School J1 J V V V V s 1 3 J 1 J s 0 0 J 0 0 s School s s s s J 1 3 J1 s J the intercept would estimate: s s J 1 0 J 1s In each case the form of the estimate is a weighted mean of the individual school averages: ˆ J wy w 1 with variance: J w 01 0J 1 Var( ˆ,, ) w n where the weights, w, sum to 1. Note that the variance is minimized when the weights are proportional to n, i.e. w n / n where n is the total sample size: n n. In this case the variance is / n. Thus, the student mean is the parameter estimated with the least variance. 55

56 6..1 Mixed model approach With a mixed model we want to estimate 00 instead of a particular linear combination of 0 s. Any weighted mean ˆ w wy of Y s will be unbiased for 00 because E( ˆ ) E( wy ) w w E( ) w if the w s are weights with w 1. Now, to calculate the variance of ˆw as an estimator of 00 random: Var( Y ) 00 / n Thus: Var( ˆ ) w ( / n ) w 00 The optimal estimator is obtained by taking weights inversely proportional to Consider the implications:, we first need the variance of Y as an estimator of 00 with 0 ( 00 / n ). 1. If 00 is much larger than, the weights will be nearly constant and ˆ w will be close to Y Schools.. Conversely, if 00 is much smaller than, the weights will be nearly proportional to n and the estimator will be close to Y Students.. If it is not reasonable to treat the 0 s as a random sample from the same N(0, 00) distribution then these two estimators could estimate two quantities with very different meanings. Consider, for example, what would happen if there is a strong relationship 56

57 between 0 and n s. What gets estimated is governed by the ratio / a 00 purely statistical consideration quite disconnected from any interpretation of the estimator. It is important to appreciate that your estimator is determined by considerations that might not be relevant. In SAS, the (minimal) commands would be 4 : PROC MIXED DATA = MIXED.HS; CLASS SCHOOL; MODEL Y = ; RANDOM INTERCEPT / SUBJECT=SCHOOL; RUN; 6.3 EBLUPs This interesting topic can, alas, be skipped. It played a central role in the early development of mixed models for animal husbandry where an important practical problem was estimating the reproductive qualities of a bull from the characteristics of its progeny. In most applications of mixed models in the social sciences, the focus is on the estimation of the fixed parameters and much less so on the prediction of the random effects. Estimating the u0 s involves using two sources of information: the data and their distribution as random variables. First consider the OLS estimator for : 0 ˆ 0 Y. Now, to get the Empirical Best Linear Unbiased Predictor of u0 s, we pretend that the estimated values of 00 and are the true values and we calculate the conditional expectation of u0 s given yi s. This is done most easily using the matrix formulation of the model and a formula for the conditional expectation in the multivariate case. We use partitioned matrices to express the oint distribution of Y and u: 4 To use the HS data set, download the self-extracting file from the course website. Save it in a convenient directory. Click on its icon to create the SAS data set HS.SD. From SAS, create a library named MIXED that points to this directory. You can then use the data set using the syntax in this example. 57

58 A well-known formula gives: ' Y Xγ ZΤZ I ZΤ N, ' u 0 ΤZ Τ u Y ΤZV Y Xγ 1 Ê( ) ( ) ' where V ZΤZ I. This formula with a bit more mechanical work will give us the EBLUP below, but we will derive it intuitively: 1. We could estimate u 0 with the obvious OLS estimate: as an estimate of u 0 this has variance ˆ ˆ ˆ uˆ Y. 00 / n.. We could also guess that u 0 is equal to 0 (the mean of its distribution) and our guess would have variance 00. How can we best combine these independent sources of information? By using weights proportional to inverse variance! This gives us the EBLUP of u 0 : 1 1 uˆ 0 0 / n 00 u / 1 / n uˆ u This has the effect of shrinking 0 towards 0 by a factor of 1 / n / n 1 / n ˆ n 58

59 Consider how the amount of shrinking depends on the relative values of, 00 and n. There will be more shrinkage if is small: i.e. the distribution of u 0 is known to be close to 0.. is large: i.e. Y0 has large variation as an estimate of n is small: ditto. The EBLUP estimator of 0 (we ll call it 0 works exactly the same way with the OLS estimator (analyzing each school separately) which gets shrunk towards the overall estimator ˆ 00. This is in exactly the same spirit as shrinkage estimators derived from Bayesian, Empirical Bayes or frequentist approaches. Bradley Efron and Carl Morris wrote an interesting article on the topic in Scientific American, Efron and Morris(1977). 7 Slightly more complex models 7.1 Means as outcomes regression Level 1 model: Level model: Combined model: Note that Y r i 0 0 i W u Y i W u r Var( Y ) Var( u r ) i 0 i i as above but, in this model, Var( Y i ) is a conditional variance, conditional given W. In SAS, the commands for the means as outcomes model would be: PROC MIXED DATA = MIXED.HS; 59

60 CLASS SCHOOL; MODEL Y = W ; RANDOM INTERCEPT / SUBJECT = SCHOOL; RUN 7. One-way ANCOVA with random effects Level 1 model: Level model: Y X r i 0 1 i i u Combined model: Yi 00 10Xi u0 ri In SAS, the commands for one-way ANCOVA with random effects are: PROC MIXED DATA = MIXED.HS; CLASS SCHOOL; MODEL Y = X ; RANDOM INTERCEPT / SUBJECT = SCHOOL; RUN; 7.3 Random coefficients model Level 1 model: Level model: Y X r i 0 1 i i u u

61 with: u Var T u Combined model: Yi Xi u0 u1xi ri In SAS, the commands for the random coefficients model are: PROC MIXED DATA = MIXED.HS; CLASS SCHOOL; MODEL Y = X ; RANDOM INTERCEPT X / SUBJECT = SCHOOL TYPE = UN; RUN; 7.4 Intercepts and Slopes as outcomes This corresponds to the full model presented in 4.4 above. The SAS commands for this model are: PROC MIXED DATA = MIXED.HS; CLASS SCHOOL; MODEL Y = X W X*W; RANDOM INTERCEPT X / SUBJECT = SCHOOL TYPE = UN; RUN; Note the X*W term. It is called a cross-level interaction. It has the function of allowing the mean slope with respect to X to vary with W. 61

62 7.5 Nonrandom slopes Consider the full model but with 11 0 (hence 01 0 also, otherwise T would not be a variance matrix). This is a model in which the variation in ˆ 1 from school to school is wholly consistent with the expected variation within schools and there is no need to postulate that It is left as an exercise to specify the SAS commands to produce this model. 7.6 Asking questions: CONTRAST and ESTIMATE statement The CONTRAST and ESTIMATE statements in SAS allow you to ask specific questions about the model. We will use the Intercept and slopes as outcome model above and consider asking some specific questions such as: Is there a significant difference between Public and Catholic schools at various values of SES, e.g. some meaningful quantiles Proportion 10% 50% 90% Quantile of SES There are a few steps that make it easier to ask a clear question: 1. Sketch the fitted model. Identify what you want to ask on the sketch. 3. Transform your question into coefficients for terms in the model. 4. Write a suitable CONTRAST or ESTIMATE statement. We will first apply this approach to the question: Is there a significant difference between the two sectors at the 10%-quantile. Let s have a look at a summary of the data: SCHOOL MINORITY FEMALE SES MATHACH SIZE SECTOR MEANSES Min st Qu Median Mean

63 3rd Qu MaX Then we fit the intercept and slopes as outcomes with random slopes model and have a look at estimated coefficients. We replace Y and X with the names of the variables in the data set. PROC MIXED DATA = MIXED.HS; CLASS SCHOOL; MODEL MATHACH = SES SECTOR SES*SECTOR/ SDDFM=SATTERTH; RANDOM INTERCEPT SES / SUBJECT = SCHOOL TYPE = FA0(); RUN; Note that we use the S option in the model statement to force printing the estimated coefficients. Some of the output produced follows: Solution for Fixed Effects Standard Effect Estimate Error DF t Value Pr > t Intercept <.0001 SES <.0001 SECTOR SES*SECTOR Type 3 Tests of Fixed Effects Num Den Effect DF DF F Value Pr > F SES <.0001 SECTOR SES*SECTOR We specify the values of the X matrix for the two levels we want to compare: 63

64 Intercept SES SECTOR SES*SECTOR Cath at SES = Public at SES = Difference We can now modify PROC MIXED to get the desired estimate: PROC MIXED DATA = MIXED.HS; CLASS SCHOOL; MODEL MATHACH = SES SECTOR SES*SECTOR / SDDFM=SATTERTH; RANDOM INTERCEPT SES / SUBJECT = SCHOOL TYPE = FA0(); ESTIMATE Sector diff at 10-pctile of SES SECTOR 1 SES*SECTOR / CL E; RUN; with partial output: Standard Label Estimate Error DF t Value Pr > t Sector diff at 10 pctile of SES Estimates Label Alpha Lower Upper Sector diff at 10 pctile of SES Thus we see that the estimated MATHACH is higher in the Catholic sector than in the Public sector at the 10th percentile of SES. Exercise 1 Estimate the difference at the 90th percentile of SES. We now ask a question that requires constraints: Are the two lines identical? The significant values for the SES*SECTOR interaction and for the SECTOR main effect suggest that they are not! However, these two tests don t constitute a formal test of the equality of the two lines. We could test equality at two points, say, SES = 1 and -1: 64

65 Intercept SES SECTOR SES*SECTOR Cath at SES = Public at SES = Difference Cath at SES = Public at SES = Difference To test two constraints simultaneously, we use the CONTRAST statement and separate each constraint with a comma: PROC MIXED DATA = MIXED.HS; CLASS SCHOOL; MODEL MATHACH = SES SECTOR SES*SECTOR / S DDFM=SATTERTH; RANDOM INTERCEPT SES / SUBJECT = SCHOOL TYPE = FA0(); CONTRAST No sector effect SECTOR 1 SES*SECTOR 1, SECTOR 1 SES*SECTOR -1 / E; RUN; Note that the CONTRAST statement does not take the CL option because it only performs a test, it does not produce confidence intervals. Exercise : See if you get the same answer with different equivalent constraints: e.g. if the coefficients of SECTOR and SES*SECTOR are both 0 then the two lines are identical. Try the statement: CONTRAST equivalent? SECTOR 0, SES*SECTOR 0 / E; Exercise 3: Writing ESTIMATE and CONTRAST statements is more challenging when a predictor is a CLASS variables. Turn SECTOR into a class variable by adding its name to the class statement: CLASS SCHOOL SECTOR; and try to obtain the estimates and perform the tests in exercises 1 and. Thinking in terms of rows of the X matrix is very useful to get the right specification of coefficients. 65

66 8 A second look at multilevel models We will look more deeply into multilevel models to see what our estimators are really estimating. We will learn about the potential bias of estimators when between cluster and within cluster effects are different and we will see the role of contextual variables (e.g. cluster averages of inner variables) and within cluster residuals (i.e. inner variables centered by cluster means). 8.1 What is a mixed model really estimating See Handout M which provides an introduction to the problem of bias of the mixed model parameter estimate when the betweensubect and within-subect effects of a predictor are different. The problem can be solved in part by introducing a contextual variable. 8. Var( Y X ) and T There are two ways of visualizing T : directly as Var( u ) or through its effect on the shape of the variance of Y as a function of the predictor X. This relationship is more than a mathematical curiosity. It is important for an understanding of the interpretation of the components of T and for an understanding the meaning of hypotheses involving T. We will consider the case of a single predictor and the case of two predictors Random slope model It is easy to visualize how a random sample of lines would produce a graph with larger variance as we move to extreme values of X. 66

67 The model is: with and Thus Y X u u X 0 1 i 0 1 i i u Var T u Var( i ) Var( Y X) [1 X] X X X

68 which is a quadratic function in X. Quadratic functions have straightforward properties that we can exploit to understand the meaning of T. Using calculus we can show that the minimum variance occurs at X 01 / 11. The minimum is / 11. An alternative way of showing this avoids calculus by using the technique of completing the square : Var( Y X) X X ( / ) ( X ( / )) The squared factor of 11, ( X ( / )) achieves its minimum of 0 when X 01 / 11 The default form of T for the RANDOM statement in SAS, e.g. RANDOM INTERCEPT SES / SUBJECT = SCHOOL;. is a diagonal matrix, that is a matrix in which Now forcing 01 0 is equivalent to forcing the model to have minimum variance over the origin of X, which is in general a quite arbitrary assumption. Since the origin of X is usually arbitrary, the assumption 01 0 violates a principle of invariance: the fitted model should not change when something arbitrary is changed. A familiar example of the principle of invariance is the principle of marginality that says that you shouldn t drop main effects that are included in interactions in the model. The interpretation of these main effects depends on the choice of origin or coding of the other interacting variables. Of course, if 0 is not an arbitrary origin then the principle of invariance is not necessary violated by considering the possibility that Two random predictors With two random predictors we can visualize the iso-variance contours in predictor space (see additional notes entitled Variance Making T simpler Var( Y X) [1 X1 X] X X We can show that the minimum occurs at: 68

69 1 1min x x min 1 0 and the contour ellipses have the same shape as the variance ellipses for a distribution with variance If we set 1 0, then the contour ellipses will not be tilted with respect to the X 1 and X axes. This corresponds to an assumptions that the values of 1i are independent of values of i, i.e. the strength of the effect of X1 is not related to the strength of the effect of X, a possibly reasonable hypothesis. Setting 1 to 0 also affords an opportunity to reduce the number of parameters in T. With even larger random models it becomes interesting to consider T matrices of the form: T Interpreting Chol(T) A very useful option for the RANDOM statement in PROC MIXED is GC to print the lower triangular Choleski root of the T (G in SAS) matrix. For a model with a single random slope we have: T

70 where 00 is the SD of the height of true regression lines when X=0, 11 is the SD of slopes and 01 / 11 is the value of X where regression lines have minimum SD. The lower triangular Choleski root of T is a lower triangular matrix, C with non-negative diagonal elements that is a square ' root of T in the sense that CC T. The columns of C are conugate axes of the variance ellipse. The elements of the matrix are: 00 0 C 01 / / 11 whose elements may be interesting but not highly interpretable. If we fit the model with the intercept last, however, we get more interesting results. If we specify RANDOM SES INT / GC... etc. the T matrix will be T and the lower-triangular Choleski root: C c c 01 c / / 11 0 is a much more interesting matrix because its diagonal elements have interpretations that are invariant with respect to relocating X: 11 is the SD of slopes 70

71 00 01 / 11 is the SD of the height of true regression lines at the value of X where this SD is minimized Less directly interpretable but useful: c01 / c11 01 / 11 is the value of X where the minimum occurs 8..4 Recentering and balancing the model If the value of X (or values of X 1 and X if there are two random slopes) where the minimum SD of true regression occurs is quite far from the origin, we have seen that this induces a strong correlation in T. If this is large enough, the matrix may be close to singular (a phenomenon analogous to multicollinearity in ordinary regression) and the optimization algorithm can fail to converge or appear to converge but to a sub-optimal point. Recentering the Xs can help considerably with convergence and with the numerical reliability of estimates. Also, if the SD for different slopes is of different orders of magnitude, rescaling the Xs to get similar SDs is desirable. Colinearity is [See notes on using C to recenter and sphere T ] [See notes on freeing the RANDOM model from the fixed MODEL: we can center the FIXED model and the RANDOM model independently of each other] 8..5 Random slopes and variance components parametrization [SEE NOTES] 8..6 Testing hypotheses about T The obvious way of testing a hypothesis about is to fit the full model and the restricted model and perform a likelihood ratio test to compare the two models. For Example, suppose you want to test whether you need a random slope model with one random X effect or whether a simpler random intercept model would be adequate. Fit both models (with the same fixed effects) and record the deviance for each model. The deviance is a synonym for - Log Likelihood and is shown in the Fit Statistics in the SAS output. Suppose the random intercept model shows: 71

72 Fit Statistics - Log Likelihood 44.6 and the random slope model shows Fit Statistics - Log Likelihood 49.9 Next we need to figure out the difference in the number of parameters. Since we used the same fixed model the only difference in parameters comes from the variances of the random model. The smaller model has: T [ ] 00 The larger model has: T and, since T is symmetric with 01 10, the larger model has two more parameters than the smaller model. The usual procedure for likelihood ratio tests calls for taking the difference of the two deviances and comparing it to a Chi-Squared distribution with q = degrees of freedom = difference in the number of parameters. So, here, we would have: Using a suitable table for the Chi-Square with two degrees of freedom, we find that the p-value is Although this is a very common approach to testing this kind of hypothesis, it is somewhat flawed. The reason is that the hypothesis 11 is that the edge of the parameter space of the full model. (Technically, the null hypothesis is not in the interior of the full model). The usual asymptotic theory for Likelihood Ratio cannot be relied upon. Intuitively, what happens is that, when 11 0, the fit is as likely to show underdispersion as overdispersion for the between cluster slope variance. When there is underdispersion, ˆ 11 will be 0. Now, ˆ11 contributes to the value of Chi-Square through ˆ11 only 7

73 when ˆ 11 is positive (about 1/ of the time). The hypothesis 10 0 contributes to the Chi-Square through 10 whether ˆ 10 is positive or not. As a result the distribution of the test statistic under the null hypothesis can be thought of as being a Chi-Square with 1 degree of freedom half the time and a Chi-Square with degrees of freedom the other half. The standard approach above assumes the Chi- Square has degrees of freedom all of the time. A Chi-Square with 1 degree of freedom has a smaller distribution than one with, so this has the effect of making the actual distribution of the statistic smaller than its reference distribution. The result is that the p-values computed with the reference distribution will be too large: if the statistic is too small then the probability in the upper tail will be too large. The p-value too large implies that the results are interpreted as less significant than they ought to be, i.e. the true probability of Type I error is lower than you think it is. You might think this is fine because that means the test is overly conservative. But remember that this also means that the true probability of Type II error is higher than it would be if you used an exact test. A test that is conservative with respect to reection is liberal with respect to acceptance. This might be seen as a problem since researchers are frequently interested in dropping random slopes to reduce the complexity of the random model. We can find the correct p-value for the problem of testing whether to add one random slope and all its covariances to a model that already has k random slopes and intercept together with all covariances, i.e. the full model has: Tkk T 0, k1 kk, 1 k1,0 k1, k k1, k1 The restricted model has T T kk and provides a test of H0 : 0, k1 k, k1 k 1, k1 0. The first k parameters are covariance parameters which can take values greater than or less than 0 but the last parameter, k 1, k 1, can only take non-negative values. The Chi-Squared for the likelihood ratio test has a nominal Chi-Squared distribution with q = k+1 degrees of freedom but its real distribution under the null hypothesis is that of a mixture of a Chi-Square with with k+1 degrees of freedom and a Chi-Square with k degrees of freedom. We can adust for this in two ways. If you have output providing the nominal p-value, you can adust the p-value downward but using the factor shown in the following figure. 73

74 Alternatively, if you have computed a deviance by taking the difference of two deviances, you can use the table below to find a range for the true p-value: 74

75 Critical values for mixture of two Chi Squares with df = q and q 1 df e For comparison, this is a table of critical values for the Chi-Square distribution e Examples We will illustrate many of the concepts introduced so far with our subsample of the Bryk and Raudenbush data. First we consider a fixed effects model for the relationship between MATHACH and SES: SAS input: proc glm data = mixed.hs; class school; model mathach = ses school / solution; run; Selected SAS output: 75

76 The GLM Procedure Dependent Variable: MATHACH Sum of Source DF Squares Mean Square F Value Pr > F Model <.0001 Error Corrected Total R Square Coeff Var Root MSE MATHACH Mean Source DF Type I SS Mean Square F Value Pr > F SES <.0001 SCHOOL <.0001 Source DF Type III SS Mean Square F Value Pr > F SES <.0001 SCHOOL <.0001 Standard Parameter Estimate Error t Value Pr > t Intercept B <.0001 SES <.0001 SCHOOL B SCHOOL B We note that the effect of SES is estimated as.3.this is the within-school effect of SES. It is a weighted average of the individual within-school effects of SES. 8.4 Fitting a multilevel model: contextual effects In this section we address two problems: 1. To what extent is the effect of SES an individual effect and to what Extent is it a school effect? Our models so far have 76

77 conflated these two effects.. Can we estimate the individual effect of SES without contaminating our estimate with the school effect? We solve both problems at once by introducing contextual effects. We decompose SES into two components: 1. average SES in each school (SES_MEAN) and. the deviation of a student from the average SES in the school (SES_ADJ). There are other ways of creating these two variables and we will discuss them later. The impact of these two variables lies in each variable providing a control for the other. In a model with these two variables, the coefficient of SES_ADJ measure the individual effect keeping the school effect constant. It can be shown that this is the within effect of a fixed model as the term is used in econometrics. The effect of SES_MEAN keeps SES_ADJ equal to 0, i.e. the school effect compares two schools and two students with constant SES_ADJ which is equivalent to having the individual SES of each student equal to the school SES_MEAN. Thus the effect of SES_MEAN is unconditional while the effect of SES_ADJ is conditional. We will use SAS to go through the steps of creating SES_MEAN and SES_ADJ. If nothing is gained by decomposing SES into SES_MEAN and SES_ADJ, the fitted coefficients for SES_MEAN and SES_ADJ will not differ significantly since, in this case, the original SES fits as well as the two separate variables: EYi... ( Xi X. ) X X i... Thus we can use an ESTIMATE statement to test equality. This test is analogous to the test for a random effects model in econometrics Example We will work through an analysis the HS data including a few extra touches to show some capabilities of SAS. We first define formats for some of the coded variables and summarize the data: These formats allow SAS to show the names of categories in some printed output: proc format; /* formating variables */ value sexfmt 1=female 0=male; value schfmt 1=Catholic 0=Public; 77

78 value minfmt 1=minority 0=non_minority; run; The formats are identified with their respective variables: data hs; set mixed.hs; format female sexfmt. minority minfmt. sector schfmt. ; label ses= social economic status mathach= mathematics achievement ; run; The dataset is sorted by the numerical variable SCHOOL so we can avoid declaring it as a CLASS variable in PROC MIXED: proc sort data=hs; /* sort by school so we can use it as a numeric var */ by school; run; We have a look at the data: proc contents data = hs; run; proc summary data=hs print min q1 median q3 max; var school minority female ses mathach size sector meanses; run; proc means data=hs ; var ses mathach size meanses; class sector ; run; 78

79 proc freq data=hs; tables sector female minority; run; We now try a first run with PROC MIXED deliberately using TYPE = UN in the RANDOM statement. ods output SolutionF=fixest; /*Data set fixest includes fixed estimates from fitted model*/ ods output SolutionR=ranest; /*Data set ranest includes random estimates of coefficients*/ proc mixed data=hs; model mathach = ses sector ses*sector / outp = predictsch outpm = predictpop solution solution ddfm = satterth corrb covb covbi cl; random int ses / sub = school type = un solution G GC GCORR GI ; run; The LOG output gives us a quiet warning: NOTE: Convergence criteria met. NOTE: Estimated G matrix is not positive definite. The output listing for G shows that SAS is very confused: Estimated G Matrix 79

80 Row Effect Subect Col1 Col 1 Intercept SES Estimated Inv(G) Matrix Row Effect Subect Col1 Col 1 Intercept SES Estimated Chol(G) Matrix Row Effect Subect Col1 Col 1 Intercept SES Estimated G Correlation Matrix Row Effect Subect Col1 Col 1 Intercept SES The blank for the lower right element of the G matrix is supposed to indicate something close to 0 but, if that is the case the matrix is not even a variance matrix. This become clear when we look at the Inv(G) matrix which should also be a variance matrix but has a negative diagonal element. Under these circumstances the Chol(G) matrix and the Correlation matrix should not exist but SAS valiantly reports some numbers for them. We try again, this time with a better parametrization of the G matrix. We also reverse SES and the INT term in the random statement: proc mixed data=hs; model mathach = ses sector ses*sector / outp=predictsch outpm=predictpop solution solution ddfm=satterth corrb covb covbi cl; random ses int/ sub = school type = fa0() solution G GC GCORR GI; 80

81 run; We now get a legitimate variance matrix for G but it is singular i.e. the variation falls entirely on a line. The Choleski root tells us that the point of minimum variance is far from the range of the data. Estimated G Matrix Row Effect Subect Col1 Col 1 SES Intercept Estimated Inv(G) Matrix Row Effect Subect Col1 Col 1 SES Intercept 1 Estimated Chol(G) Matrix Row Effect Subect Col1 Col 1 SES Intercept Estimated G Correlation Matrix Row Effect Subect Col1 Col 1 SES Intercept This suggests that the variability in slopes is very small after accounting for heterogeneity accounted for by SECTOR. We refit with only a random intercept. proc mixed data=hs; model mathach = ses sector ses*sector / outp=predictsch outpm=predictpop solution ddfm=satterth corrb covb covbi cl; random int /sub=school type=fa0(1) solution; run; We compare the deviances ( - Res Log Like) of the two models and find an increase of 1. for degrees of freedom so we have not lost anything significant by dropping the two variance parameters. Note that we are comparing two models that do not differ in 81

82 their MODEL statements which allows us to compare deviances even though the models have been fit with the default METHOD = REML 5. The table of estimates of fixed effects shows an interesting story... so far. Everything seems quite significant. Solution for Fixed Effects Standard Effect Estimate Error DF t Value Pr > t Alpha Intercept < SES < SECTOR SES*SECTOR We will now consider contextual effects: why do high SES schools do better? Is it the school or is the kid? So far we haven t made a distinction. Is there an effect of school SES distinct from child SES? Although it doesn t seem likely they could, in principle, even go in opposite directions! We turn SES into variables: aggregate between school ses and a within school residual or school-austed ses. We first create a data set with one observation per school containing the average SES_MEAN for each school. proc means data=hs; var ses; by school; output out=new mean=ses_mean; run; proc print data = new; run; Just to be on the safe side we sort before merging: proc sort data=new; by sschool; 5 Models using METHOD = REML, the default, can be compared only if they do not differ in their MODEL statements, i.e. only the RANDOM and REPEATED models differ. If the model differ in their MODEL statement as well full likelihood must be used with MODEL = ML. 8

83 run; We now merge the school averages back into the main data set and we create a variable with each students deviation, SES_ADJ, in SES from the school average. data hs; merge hs new; by school; ses_ad = ses - ses_mean; run; We are ready to run PROC MIXED on the new variables. We add the new inner variable SES_ADJ to the RANDOM model. proc mixed data = hs cl method = ml; model mathach = ses_ad ses_mean sector ses_ad*sector ses_mean*sector / cl ddfm=satterth; random ses_ad int / sub = school type = fa0() G GC GCORR; run; The G matrix is still close to singular and the point of minimum variance is far outside the range of SES: Estimated Chol(G) Matrix Row Effect Subect Col1 Col 1 ses_ad Intercept Estimated G Correlation Matrix Row Effect Subect Col1 Col 1 ses_ad Intercept We refit with a random intercept and get the following estimates of fixed effects: Standard Effect Estimate Error DF t Value Pr > t Alpha 83

84 Intercept < ses_ad < ses_mean < SECTOR ses_ad*sector ses_mean*sector This output suggests that SES_MEAN and SES_ADJ have very different effects, at least within the Public school system which is the reference level for the coding of SECTOR. There is also a suggestion that SECTOR might not be important in this new model. We rerun the model with the following CONTRAST statement: contrast sector SECTOR 1, ses_ad*sector 1, ses_mean*sector 1; which allows us to test the hypothesis that all terms involving SECTOR are simultanously equal to 0. The result is not entirely clear: Contrast Num Den Label DF DF F Value Pr > F sector Perhaps the apparent interaction between sector and SES might show up as an interaction between SES_MEAN and SES_ADJ suggesting that high SES schools have flatter MATHACH/SES slopes than low SES schools. The apparently greater fairness of Catholic schools in the early analyses might have been due to the fact that they tend to have higher values of SES_MEAN. Another possible explanation is that failing to separate the within school effect of SES from the between school effect contaminated the estimate of the between-school effect with the lower within-school effect. This is, ironically, the reverse of the concern usually expressed in econometrics. Having a lower coefficient than it should, SES, as a proxy for SES_MEAN, in the between-school model underadusted for the effect of SES and left variation to be explained by SECTOR. Once SES_MEAN was uncoupled with SES_ADJ, it accounted for the variation that had been attributed to SECTOR. Note also that residual plots show a ceiling effect in MATHACH which would tend to flatten the slope for schools with higher MATHACH. Thus the apparently greater fairness of Catholic schools could be an artifact of measurement. Exercises: The following questions are left as exercises: 84

85 1. Use the ESTIMATE or CONTRAST statement to carry out a formal test of the equality of the coefficients of SES_MEAN and SES_ADJ. This test is an analogue of the Wu-Hausman test in econometrics.. Explore the usefulness of a SES_MEAN*SES_ADJ interaction. How would you interpret it? 85

86 9 Longitudinal Data The structure of longitudinal data shares much with that of multilevel data. Turn the school into a subect and the students in the school into times or occasions on which the subect is measured and, except for the fact that measurements are ordered in time, the structures are essentially the same. Mixed models offer only one of of number of ways of approaching the analysis of longitudinal data but mixed models are extensions of some of the traditional ways and provide more flexibility. We will briefly review traditional methods and then discuss the use of mixed models for longitudinal data. The best method depends on a number of factors: The number of subects: if the number is very small it might not be feasible to treat subects as random. The analysis then focuses on an anecdotal description of the observed subect and does not generalize to the population of subects. Statistical models play a role in analyzing the structure of responses within subects especially if the subects were measured on a large number of occasions. This situation is typical or research in psychotherapy where a few subects are observed over a long time. With long series and few subects, time series analysis can be an appropriate method. The number of occasions and whether the timing of occasions is fixed or variable from subect to subect. With a small number (<10?) of fixed occasions and with the same within subect design for each subect, the traditional methods of repeated measures analysis can be used. There are two common repeated measures models: univariate repeated measures and multivariate repeated measures. In univariate repeated measures models the random variability from subect to subect is equivalent (almost) to a random intercept mixed model. In multivariate repeated measures, no constraints are placed on the intercorrelations between occasions. These two models can be thought of as extreme cases: a model with only one covariance parameters and a model with all possible covariance parameters. These two models turn out to be special cases of mixed models which provide the ability to specify a whole range of models between these extremes, models that allow simpler covariance structures based on the structure of the data and the fact that, having been observed in time, observations that are close in time are often expected to be more strongly correlated than observations that are farther in time. Mixed model methods are particularly well suited to data with many subects and variable within-subect designs, either due to variable occasions or to time-varying covariates. We will look at four data examples to illustrate the breadth of applications of mixed models to longitudinal data: 86

87 The classical Potthoff and Roy (1964) data of aw sizes of 7 children (16 boys and 11 girls) at ages 8, 10, 1 and 14. This is a simple yet interesting example of longitudinal data and helps understand the maor concepts in a simple context. PSID: Panel Study of Income Dynamics: An excerpt from the PSID conducted by ISR at the University of Michigan at Ann Arbor will illustrate the application of methods on a survey data set with fixed occasions and time-varying covariates. IQ recovery of post-coma patients: Long-term non-linear model with highly variable timing for occasions. Migraine diaries: A dichotomous dependent variable with a non-linear long-term response model including seasonal periodic components and long-term non-stationarity. The four data sets can be downloaded as self-extracting files from the course web site at After downloading the files into a convenient directory in Microsoft Windows, double click on the file icon to extract the SAS ssd file. Then, from SAS, define the directory as a library. In the following code, it is assumed that the name of the SAS library is MIXED. The SAS data set names for the four data sets above are: PR, PSID, IQ and MIGRAINE, respectively. 87

88 F 5 F 7 F 8 F 4 F F 10 F 6 F 9 F 3 F 1 F Y M 17 M 15 M 1 M 1 M 5 M 14 M 0 M M 6 M M 4 M 16 M 13 M 3 M 18 M Age Figure 15: Pothoff and Roy (1964) data on growth of aw sizes of 16 boys and 11 girls. Note possibly anomalous case M 0. 88

89 10 viq Figure 16: Recovery of verbal IQ after coma. Years post coma 89

90 10 viq Months post coma Figure 17: Recovery of Verbal IQ post coma (first four years) 90

91 9.1 The basic model We can think of longitudinal data as multilevel data with t (for a time index) replacing the within cluster index i. We could keep for subects but to be consistent with many authors such as Sniders and Bosker (1999) we will switch from to i. We also switch the order of the indices. Y it denotes subect (cluster) i observed on occasion t. Not only do we change notation, we also change the order of the indices: the cluster index comes first. The other difference is that at least one of the X variables will be time or a function of time. A simple model for the Pothoff and Roy data with a linear trend in time that varies randomly from child to child would look ust like a multilevel model with the following level 1 model: Y X it 0i 1i it it where i ranges over the 7 subects, t is the time index: t=1,,3,4, and X it takes on the actual years : 8,10,1,14. Now we come to the main departure from multilevel models: what to assume about it. It no longer seems reasonable to unquestioningly assume that it s are independent as t varies for a given i. Observations close in time might be more strongly related than observations that are farther apart. (Note that this relationship does not necessarily lead to a positive correlation.) The Laird-Ware form of the model for a single child is: Yi Xβ i i εi X γ Zu ε i i i i if i i X Z X. With the matrices written out, we get: 91

92 Yi1 1 X i1 i1 Y i 1 X i 0i i Y i3 1 X i3 1i i3 Y 1 X i4 i4 i4 1 Xi1 1 Xi1 i1 1 X i 0 1 X i u 0i i 1 Xi3 1 1 X i3 u 1i i3 1 X 1 X i4 i4 i4 with u0i Var u T 1i with a standard multilevel model we would have: Σ Var i i i i but this is often not realistic with data ordered in time. A model that allows autocorrelations has a form called AR(1) or autoregressive of order 1 would have the following model for Σ : 9

93 3 i1 1 1 Σ Var i i3 1 3 i4 1 Here we need to be careful because our model is identified only through the marginal variance: ' V ZTZ Σ We now have 5 variance parameters in the model and 10 variance estimates. It might seem that identifiability should not be a problem but we need to be on guard for possible near collinearities. We can t add arbitrary complexity to Σ without thinking that the estimation of the parameters in both T and Σ take place together. Writing out the V matrix we see that we don t have 10 independent variance estimates. If we centre the time variable, we get: V so that the upper triangle of the variance matrix is: 93

94 v v v v v v v v v v Heuristically, to estimate the variance parameters, we need to estimate V and then backsolve for a set of variance parameters that reproduce V. The transformation is not linear since it involves different powers of but we can consider the Jacobian: V J v v v v v v v v v v where [ ] 94

95 The information on is related to the Grammian matrix index of over 1,000 for values of.6. ' JJof the Jacobian matrix J whose normalized form has a condition The fact that the variance parameters of this model cannot be estimated accurately individually does not mean that it is not a good model for the purpose of estimating fixed effects. To get a reasonable estimate of the autocorrelation,, longer series are needed for at least some subects. As mentioned earlier, both univariate and multivariate repeated measures can be represented as extreme cases of mixed models. Univariate repeated measures are equivalent to a random intercept model which generates a compound symmetric V matrix: In this model X generates a model for time, generally a saturated model (i.e. equivalent to a categorical model) for time. Z consists of a column of 1s. Using a cubic polynomial model for the centered Pothoff and Roy data: Y i Y i [ u0 ] Y Y with Var( u0 ) 00 and i Var( i ) I. Thus: V [ 00][ ] A drawback of this model is that it assumes the same covariance between pairs of observations whether they are close or far in time. Note that the random slopes model (without autocorrelation) may seem better in this regard. Setting 01 0 which is the same as assuming that the point of minimum variance is at the origin: 95

96 V U where U is a matrix of 1's. Suitable values of 00, 01, 11 and may approximate an AR(1) correlation structure but the estimation of 00, 01 and 11 in the model is driven by the variability of the true regression lines between subects. Any resulting correlation within is an artifact and does not necessarily reflect the true value of autocorrelation in the population. Using a random intercept and an AR(1) parameter uses one fewer parameters than (137) and yields better identification for : V [ 00][ ] The MANOVA repeated measures design is saturated for V. It can be represented as a mixed models through T or through Σ. For example if X yields a saturated model, we could have: Yi Xγ Zui ε i 96

97 which yields Y u i1 0 0i i1 Y i u 1i i Y i u i i3 Yi u3i i4 Yi Y i Var Yi Y i V Φ I where is free to be any variance matrix. However, is not identifiable and fitting this model using software for mixed models requires the ability to constrain to be equal to 0 or some very small quantity. The other approach is to use a saturated model for Σ and no random part. ' yielding Y Xγ ε i i Yi i1 Y i i Y i i3 Y i4 3 i4 97

98 V Σ which is equivalent in form and substance to a MANOVA repeated measures model. We can therefore think of mixed models as extensions of univariate and multivariate repeated measures allowing models with variance matrices of intermediate complexity. They also allow one to fit univariate and multivariate repeated measures models to incomplete data yielding valid estimates provided the missing cases are missing at random. We fit these models by specifying both the Φ and the Σ matrices. In SAS the structure of the Σ matrix is controlled by the REPEATED statement. The options are listed in SAS help documentation and a summary can be found below starting on page 13. Some of the most important options to specify the TYPE of covariance structure are: Structure Discription Parms (i,)th element AR(1) Autoregressive i ARMA(1,1) ARMA(1) 3 i [ (1 i ) i ARH(1) Heterog. AR(1) T+1 i i ANTE(1) Ante-dependence T-1 1 i kik FA0(T) Choleski unstructured T(T+1)/ min( i, ) UN Unstructured T(T+1)/ i CS Compound Symmetry SP(POW)(t) Spatial covariance 6 structure to implement Continuous AR(1) Model with respect to time variable t k 1 1 i t i t ik k ] 6 Spatial covariance structure are designed for geographic applications where the correlation between observations is a function of their spatial distance. This model applied to the one-dimensional time variable produces a Continuous AR(1) model. 98

99 Note that i represents the Kronecker delta which is equal to 1 when i= and 0 ifi. AR(1) is used when the dependence over time follows an auto-regressive process of order 1 which means that the random error for each observation is equal to a constant times the previous error plus a new error (innovation). ARMA combines an autoregressive process with a moving average process. It can be used to model situations where the observed error is composed of an underlying hidden AR(1) process plus independent errors. ARH(1) is similar to AR(1) except that it allows observations on each occasion to have different variances. The correlation structure is identical to AR(1). ANTE(1) is similar to ARH(1) except that the autocorrelation between adoining occasions can vary from one adoining pair to another. The autocorrelation between pairs that are more than one unit of time apart is determined by the autocorrelations between intermediate adoining pairs. A model that accommodates highly variable times from subect to subect is a Continuous AR(1) model that can be obtained with the spatial covariance structure SP(POW)(t) where t is the name of the time variable. FA0(T) where T is the number of occasions per subect produces a Choleski parametrization which may be superior to an unstructured (UN) model since FA0(T) imposes a positive-semi definite structure for Σ. This approach, like UN, only works when the set of times is the same for each subect. Both FA0(T) and UN can accommodate some missing occasions for some subects. Methods for doing this are discussed in section on page 10 below. CS produces a T matrix with a compound symmetry structure which is (almost) identical to the matrix obtained with a random intercept model. One subtle difference is that CS can allow negative covariances which would not be possible under SAS s positive constraints for variance with a random intercept model. The covariance structures for some of the types listed above: AR(1)

100 ARMA(1,1) ARH(1) ANTE(1) SP(POW)(time) Supposing a subect with times 1,, 5 and Analyzing longitudinal data 9..1 Classical or Mixed models A comment from a SAS FAQ comparing PROC MIXED with PROC GLM: 7 Note that the times and the number of times hence the indices can change from subect to subect but and have the same value. 100

101 You can use PROC GLM or PROC MIXED in SAS to perform repeated measures ANOVA. Each procedure has strengths and weaknesses; one nice MIXED feature is its ability to perform comparisons involving within and between-subects factors in the same contrast. PROC GLM uses separate matrices for between-subects effects versus within-subects effects. This is not a problem if you are interested in between-subects effects or within-subects effects, but it can present complications if you want to generate contrasts that cross levels of both between and within-subects effects simultaneously. For this reason, this FAQ illustrates the more general contrast approach offered by PROC MIXED 8. One problem in using both approaches is that the input data must be in a different form for each. An FAQ addresses the transformation from the classical form to that needed for PROC MIXED 9 : Question: I ve run a repeated measures ANOVA using the SAS GLM procedure. My dataset has a single between-subects grouping factor with two levels and I have four dependent variables that comprise my repeated measures effect. For follow-up analyses using the MIXED procedure, I need to rearrange my dataset so that there is a single dependent variable column and then a second column that refers to the measurement occasion of the dependent variable (e.g., 1,, 3, or 4). What s the best way for me to rearrange my dataset using SAS? Answer: There are a number of ways you can SAS to transform your data from multivariate to univariate form. Here is one approach you can use. DATA one ; INFILE cards ; INPUT a b1 b b3 b4 ; subid + 1 ; CARDS ; From a SAS FAQ at the University of Texas: 101

102 ; RUN ; PROC SORT DATA = one ; BY a subid ; RUN ; PROC TRANSPOSE DATA=one OUT=two NAME=measure PREFIX=y_all_; VAR b1 b4 ; BY a subid ; RUN ; This example features a between-subects grouping factor a and four within-subects dependent variables, b1 through b4. The SAS syntax shown above first creates a counter variable called subid to denote subect ID number using the syntax SUBJID+1. The program then sorts the dataset by subect ID number within each level of group using the SORT procedure. The TRANSPOSE procedure then produces a single dependent variable, y_all_1, a character variable called measure that indicates the measurement occasion of the dependent variable, and it retains the correct group specification a. Notice that the TRANSPOSE procedure appends a number to the dependent variable name, which is specified by the PREFIX keyword. The prefix y_all_ is oined to the number 1 to create the new dependent variable, y_all_1. The new single dependent variable, the measure variable, and the group variable a are written to the temporary SAS dataset work.two by the TRANSPOSE procedure. 9.3 Pothoff and Roy A good illustration of a simple longitudinal model is provided by data in Potthoff and Roy (1964) consisting of measurements on the growth of the aw for 11 girls and 16 boys at ages 8, 10, 1, and 14. Since the data are complete and balanced, we can choose from many methods for their analysis. The following shows a SAS data step creating the data set. The data are entered in wide form (one row per subect) and then reshaped into long form (one row per occasion). In general this process is more complicated because all time-varying variables have to be treated like the dependent y variables in the example. (STATA has a function that does this very easily). Note that the following data set is already available in long form as MIXED.PR if you downloaded the files described on page 86 above. 10

103 data prwide; input Person Gender $ y1 y y3 y4; datalines; 1 F F F F F F F F F F F M M M M M M M M M M M M M M M M ; data mixed.pr; /* converting wide to long */ 103

104 set prwide; class Gender; y=y1; Age=8; output; y=y; Age=10; output; y=y3; Age=1; output; y=y4; Age=14; output; drop y1 y4; run; Perhaps the best way to visualize the data is to plot Y against Age for each subect (Figure 15 on page 88). The plot immediately reveals at least one anomaly but we will not exclude it in the analysis. We will consider five models to this data which we will use to estimate two features of the growth process: 1. the difference in the rate of growth between boys and girls and. the difference in size at age 14. The four models we will use are: 1. Univariate ANOVA repeated measures. MANOVA repeated measures 3. Random intercept with auto-correlation 4. Random slope model without auto-correlation 5. Random slope model with auto-correlation Actually, the last two are exercises! Univariate ANOVA The univariate ANOVA repeated measures analysis is a random intercept analysis. We can fit an arbitrary growth curve by fitting a polynomial of degree one less than the number of time periods. If the higher order coefficients are not significant we drop them and fit a linear model. We will assume that this work has already been done and fitting a linear model is adequate: The random intercept model is specified as follows. Note the construction of the ESTIMATE statement because Gender is a CLASS variable: 104

105 Variable INT Gender=F Gender=M Age Gender=F*Age Gender=M*Age Male at Age 14 Female at Age 14 Difference (F-M) PROC MIXED generates dummy variables for Gender, one for Females and one for Males. These variables are collinear with the INTERCEPT, so when it comes to estimation (see Solution for Fixed Effects below) the estimated coefficient for the second dummy variable is forced to 0. However, we must take both dummy variables into account when we build ESTIMATE and CONTRAST statements. Partial output: ods html; ods graphics on; /* experimental in SAS v. 9. produces diagnostics */ proc mixed data = mixed.pr method = ml covtest; class Person Gender; model y = Gender Age Gender*Age / residual influence ( effect = person ) /* diagnostics in SAS v. 9 */ s; random intercept / type=un subect=person g gc v vcorr; estimate gap at 14 Gender 1-1 Age*Gender 14-14; run; Estimated G Matrix Row Effect Person Col1 1 Intercept Estimated Chol(G) Matrix Row Effect Person Col1 105

106 1 Intercept Covariance Parameter Estimates Standard Z Cov Parm Subect Estimate Error Value Pr Z FA(1,1) Person <.0001 Residual <.0001 Fit Statistics Log Likelihood 48.6 AIC (smaller is better) AICC (smaller is better) BIC (smaller is better) Estimated V Matrix for Person 1 Row Col1 Col Col3 Col Estimated V Correlation Matrix for Person 1 Row Col1 Col Col3 Col Null Model Likelihood Ratio Test DF Chi Square Pr > ChiSq <.0001 Solution for Fixed Effects Standard Effect Gender Estimate Error DF t Value Pr > t Intercept <.0001 Gender F Gender M Age <.0001 Age*Gender F Age*Gender M Estimate 106

107 Standard Label Estimate Error DF t Value Pr > t gap ap To test the hypothesis that the two genders have the same slope we consider the p-value for the interaction between Age and Gender (0.0130) and to test the difference in sizes at age 14, we use the ESTIMATE output showing a difference of with a p- value of Note the variance matrix for person 1. It has constants off diagonal. The correlation matrix has a constant value offdiagonal. Regression diagnostics are produced in SAS version 9. You must specify: ODS HTML; ODS GRAPHICS ON; before invoking PROC MIXED and you must use the INFLUENCE and/or the RESIDUAL options in the MODEL statement. The INFLUENCE statement can be used to see the effect of dropping an occasion or an entire subect. To drop occasions, use INFLUENCE alone, to drop subects use INFLUENCE ( EFFECT = subect-variable). Often it will be desirable to get diagnostics for both. As far as I can see this requires running PROC MIXED twice. The following output shows diagnostics for deleting entire subects. The influential case most clearly identified is subect 4 who has the lowest starting value and the steepest growth. Subect 0 with zigzagging growth is influential but not, generally, as much. Try rerunning the code with the following options for INFLUENCE ( EFFECT = PERSON ITER = 3). This will also produce output showing influence on the covariance structure. 107

108 108

109 109

110 110

111 111

112 11

113 113

Introduction to the Analysis of Hierarchical and Longitudinal Data

Introduction to the Analysis of Hierarchical and Longitudinal Data Georges Monette, York University with Ye Sun SPIDA June 7, 2004 1 Graphical overview of selected concepts Nature of hierarchical models