Erasmus Teaching staff mobility INTRODUCTION TO MULTILEVEL MODELLING

Size: px

Start display at page:

Download "Erasmus Teaching staff mobility INTRODUCTION TO MULTILEVEL MODELLING"

Eustace Porter
5 years ago
Views:

1 Erasmus Teaching staff mobility Konstanz, June 2017 INTRODUCTION TO MULTILEVEL MODELLING Leonardo Grilli Dipartimento di Statistica, Informatica, Applicazioni G. Parenti WEB

2 Outline 1. Introduction 2. Basics of the two-level linear model Case A: no covariates (random effects ANOVA) Case B: a single covariate at level 1 Case C: introduction of a covariate at level 2 3. Between, within and contextual effects 4. Inference 5. Example: political trust in Europe 6. Fixed effects versus random effects 7. Sample size requirements 8. Multilevel logit models for binary responses 9. Software & Books Multilevel models - L. Grilli 2

3 Introduction Multilevel structures Basic definitions NELS-88 example

4 Hierarchical structures & multilevel models Reg. 1 Reg. 2 Level 2 (group or cluster): e.g. region index j s1 s2 s3 s1 s2 s3 s4 Level 1 (individual): e.g. citizen index i y = α + βx + γw + u + e ij ij j j ij Fixed part Random part Error at cluster level u j (called random effect): it collects unobserved factors at cluster level std. dev. σ u Error at subject level e ij : it collects unobserved factors at subject level std. dev. σ e Multilevel models - L. Grilli 4

5 A hierarchical structure district level 4 school 1 level 3 school 2 class 1 class 2 level 2 class 3 class 4 s1 s2 s3 s4 s5 s6 s7 s8 s9 s10 s11 s12 level 1 - students Remark: levels are numbered bottom-up Multilevel models - L. Grilli 5

Examples of hierarchical structures pupil, class, school patient, doctor, hospital worker, firm individual, family, region interviewed, interview Often the sampling design

! Level 2: cluster between unit macro unit Level 1: individual within unit micro unit In cluster analysis the hierarchical structure is unknown: it is just the aim of the

6 Examples of hierarchical structures pupil, class, school patient, doctor, hospital worker, firm individual, family, region interviewed, interview Often the sampling design reflects the hierarchical structure (multi-stage sampling), but this is not necessary!! Level 2: cluster between unit macro unit Level 1: individual within unit micro unit In cluster analysis the hierarchical structure is unknown: it is just the aim of the analysis to discover the clusters! In multilevel analysis the hierarchical structure (number of clusters, cluster membership) is defined a priori: the aim of the analysis is to understand the relationships within and between clusters Multilevel models - L. Grilli 6

7 Types of variables Level 1 Example: male/female, grade Level 2 (or contextual): Global: feature of the cluster with no corresponding level 1 measure Example: public/private school, number of teachers Compositional: feature of the cluster obtained through aggregation of level 1 measures (summary of the features of the level 1 units) Example: average class size, proportion of females, average grade Multilevel models - L. Grilli 7

8 Levels of the variables In a two-level setting a level 1 variable has a double index: X ij j = 1,, J clusters (level 2 units) i = 1,, n j elementary (level 1) units in cluster j while a level 2 variable has a single, level 2 index: W j A level 2 variable is by definition constant within clusters its variation is only between clusters A level 1 variable has distinct values for the elementary units and, in general, its cluster mean changes from cluster to cluster its variation is both within clusters and between clusters ij ( X X ) ( ) ( ) ( X ) X = X + var X = var X + var X. j ij. j ij. j ij. j Multilevel models - L. Grilli 8

9 NELS-88 example Levels: pupils (level 1); schools (level 2) [schnum] Response variable Y [math]: score on a math test Level 1 covariate X [homework]: hours per week spent on math homework Level 2 covariate W [public]: binary indicator of public vs non-public school (a global level 2 variable) We consider 10 handpicked schools from the NELS-88 data (Kreft and De Leeuw, Introducing Multilevel Modeling, Sage, 1988) Multilevel models - L. Grilli 9

10 NELS-88 example: data Here are some records (9 out of 260), each record refers to a pupil schnum public math homework Multilevel models - L. Grilli 10

11 NELS-88 example: summary statistics Nr. of schools Size There are 10 schools (level 2 units) of different size (unbalanced design) The total number of pupils (level 1 units) is 260 Variable Obs Mean Std. Dev. Min Max math homework public Multilevel models - L. Grilli 11

12 Hierarchical structures for repeated measures (multivariate and longitudinal) So far we have considered hierarchical structures for cross-sectional data with a single measure (response) Multilevel models can be applied to any hierarchical structure, including data with repeated measures: cross-sectional data with a multivariate response longitudinal data (panel data) resp1 subject j resp2 resp3 Bottom-level units (level 1) are different responses of a given statistical unit (level 2) It is also possible that a hierarchical structure represents both repeated measures and clustering into physical entities, e.g. test scores on reading, math and science for pupils nested into classes - Grilli L., Pennoni F., Rampichini C., Romeo I. (2015) Exploiting TIMSS and PIRLS combined data: multivariate multilevel modelling of student achievement (download via Research Gate). Multilevel models - L. Grilli 12

13 Repeated measures and missing data subject 1 subject 2 Multivariate data item1 item2 item3 item1 item3 Missing response subject 1 subject 2 Panel data wave1 wave2 wave3 wave1 wave2 Drop-out Missing data generate unbalanced structures, but this is not a problem: standard estimation methods for multilevel models yield unbiased estimates as long as missingness is non informative (Little & Rubin s MAR: Missing At Random) Multilevel models - L. Grilli 13

14 Different fields, different names Design of experiments variance components models Statistics mixed models (Harville, 1977), hierarchical linear models (HLM) Econometrics random coefficients models (Swamy 1972), random effects models for panel data Biostatistics mixed models for repeated measures (Laird and Ware, 1982), random effects models Educational statistics multilevel models (Cronbach 1976, Aitkin and Longford 1986) Sociology, demography, small area estimation, Multilevel models - L. Grilli 14

15 Basics of the two-level linear model Case A no covariates (random effects ANOVA)

16 Random effects ANOVA (RANOVA) y = µ + u + e ij j ij iid e ~ N 0, σ ij ( 2 ) e iid u ~ N 0, σ j ( 2 ) u e u i, j ij j The random variable u j has J realizations, one for each cluster (level of the factor) µ+u 1 µ+u 3 µ µ+u 2 Multilevel models - L. Grilli 16

17 Random effects ANOVA: variances and covariances y = µ + u + e ij j ij Var( y ) = σ + σ 2 2 ij u e Cov( y, y ) ij i j 0 if j j = 2 σ u if j = j and i i' Variance of y ij decomposed in two components: cluster (between) level + individual (within) level Observations belonging to the same cluster are positively correlated Remark: the correlation is necessarily positive since it is generated by a shared latent variable u j (it is the same basic idea of factor models, where u j is called factor indeed the GLLAMM class of Rabe-Hesketh and Skrondal includes both multilevel and factor models as special cases) Multilevel models - L. Grilli 17

18 Random effects ANOVA: covariance matrix Example J = 2, n = 2, n = Var( y) σ + σ σ u e u σ u σu + σe = 0 0 σ σ σ σ u + e u u σ u σu + σe σ u σ σ σ + σ u u u e This is a block diagonal matrix and each block has a compound symmetry structure Multilevel models - L. Grilli 18

19 Random effects ANOVA: intraclass correlation coefficient ρ denotes the ICC (Intraclass Correlation Coefficient) also known as VPC (Variance Partitioning Coefficient) σ cluster variance ρ = Corr( yij, yi j ) = = ρ 0,1 σ total variance 2 u 2 2 u + σe [ ] ρ is a measure of the degree of homogeneity of units belonging to the same cluster The double nature of ρ (correlation and variance ratio) does not hold in models with more than 2 levels Observations may be dependent, for instance, because they share some common feature, come from some common source, are affected by social interaction, or are arranged spatially or sequentially in time (Kenny and Judd, 1996) Multilevel models - L. Grilli 19

20 Random effects ANOVA: example with NELS-88 data Levels: pupils (level 1); schools (level 2) Response variable Y: score on the final test y = µ + u + e ij j ij Parameter Estimate Intercept (=mean) Level 2 variance level 1 variance Total variance = = ICC = 30.54/ = % of the variance of math scores is due to the clustering of pupils into schools Multilevel models - L. Grilli 20

21 Basics of the two-level linear model Case B a single covariate at level 1

22 Two-level linear model Let us consider three specifications of increasing complexity: Standard regression model (OLS) single level indeed Random intercept model Random (intercept and) slope model Example about school effectiveness (NELS-88 data) Levels: pupils (level 1); schools (level 2) Response variable Y [math]: score on a math test Level 1 covariate X [homework]: hours per week spent on math homework Multilevel models - L. Grilli 22

23 Standard (OLS) regression model y = β + β x + e 2 ij 0 1 ij ij The regression line is the same for all clusters Fixed intercept and slope standard regression model e N(0, σ ) ij iid e Homoscedasticity No correlation even within clusters Var( y x ) = σ 2 ij ij e Cov( y, y x, x ) = 0 ij i' j ij i' j y Multilevel models - L. Grilli 23 x

24 Random intercept model /1 y = β + β x + u + e ij 0 1 ij 0 j ij = ( β + u ) + β x + e 0 0 j 1 ij ij u e iid 2 0 j N σ u0 ij iid N (0, ) 2 (0, σ e ) e ij u 0 j Each cluster has its own intercept β 0 +u 0j Random effect u 0j : unexplained deviation of the intercept of cluster j from the population mean intercept β 0 The Normal distribution for the random effect u 0j is the default since it has nice properties and works well in many cases. Other choices are possible, such as a different continuous parametric family or an arbitrary discrete distribution. y x Grilli L., Rampichini C. (2015) Specification of random effects in multilevel models: a review. Quality & Quantity (download via Research Gate) Multilevel models - L. Grilli 24

25 Random intercept model /2 The total error is u + e 0 j ij thus the model has homoscedasticity Var( y x ) = σ + σ 2 2 ij ij u0 e Between-cluster variance homogeneous correlation among the responses of the units of the same cluster: Cov( y, y x, x ) = σ 2 ij i' j ij i' j u0 no correlation among the responses of units of different clusters: Cov( y, y x, x ) = 0 ij ' i' j ' ij ' i' j ' Within-cluster variance Multilevel models - L. Grilli 25

26 Random intercept model /3 y = β + β x + u + e ij 0 1 ij 0 j ij = ( β + u ) + β x + e 0 0 j 1 ij ij u e iid 2 0 j N σ u0 ij iid N (0, ) 2 (0, σ e ) e ij u 0 j Homoscedasticity Equi-correlation within clusters The variance of the intercept does not depend on the origin of X (i.e. centering X is irrelevant) y Var( y x ) = σ + σ 2 2 ij ij u0 e Cov( y, y x, x ) = σ 2 ij i' j ij i' j u0 The regression lines are parallel the clusters can be ranked x Multilevel models - L. Grilli 26

27 NELS-88 example: random intercept model Parameter Estimate Intercept Homework 2.21 Residual lev. 2 variance Residual lev. 1 variance Mean intercept in the population of schools Estimated slope for any school Total residual variance = = Residual ICC = 22.50/ = % of the residual variance of math scores after adjusting for homework is due to the clustering of pupils into schools Coverage interval (95%) for the intercepts: ± (35.68, 54.28) Multilevel models - L. Grilli 27

28 Random (intercept and) slope model /1 y = β + β x + u + u x + e ij 0 1 ij 0 j 1 j ij ij = ( β + u ) + ( β + u ) x + e 2 u iid 0 j 0 σ u0 σ u01 ~ N, 2 u 1 0 j σ u1 Each cluster has its own intercept β 0 +u 0j and its own slope β 1 +u 1j 0 0 j 1 1 j ij ij Random effect u 0j : unexplained deviation of the intercept of cluster j from the population mean intercept β 0 Random effect u 1j : unexplained deviation of the slope of cluster j from the population mean slope β 1 ij iid 2 (0, e ) eij u0 j u1 j e N σ Y (, ) Multilevel models - L. Grilli 28 X

29 Random (intercept and) slope model /2 The total error is u + u x + e 0 j 1j ij ij thus the model has heteroscedasticity (quadratic function) Var( y x ) = σ + 2σ x + σ x + σ ij ij u0 u01 ij u1 ij e Between-cluster variance non-homogeneous covariance among the responses of the units of the same cluster: Cov( y, y x, x ) = σ + σ ( x + x ) + σ x x 2 2 ij i' j ij i' j u0 u01 ij i' j u1 ij i' j zero covariance among the responses of units of different clusters: Cov( y, y x, x ) = 0 ij ' i' j ' ij ' i' j ' Within-cluster variance Multilevel models - L. Grilli 29

30 Random (intercept and) slope model /3 y ij = ( β + u ) + ( β + u ) x j 1 1 j ij e ij The between-cluster variance is a function of x the (residual) ICC is a function of x no unique value, in fact it is called conditional ICC What happens to the variances-covariances if we change the origin of x? For example, x =x-mean(x) The intercept variance and the slope-intercept covariance change (in fact, the value refers to x=0) The slope variance does not change y Old origin (x = 0) New origin (x = 0) Recommendation n. 1: do not compare slope variance vs intercept variance (it is just a comparison at x=0, in addition to the issue of the measurement unit mentioned later) Recommendation n. 2: the slope-intercept covariance should not be eliminated, i.e. constrained to zero, even if not significant (at x 0 could be significant) Multilevel models - L. Grilli 30 x

31 Random (intercept and) slope model /4 y ij = ( β + u ) + ( β + u ) x j 1 1 j ij e ij In effectiveness evaluation, multilevel models are used to assess the effectiveness of the clusters (e.g. schools, hospitals) y Unfortunately, when the model has random slopes there is no unique ranking of the clusters, namely the ranking depends on the value of x A practical strategy to communicate the results is to report the rankings at selected values of x A B C A > B > C C > B > A x Grilli L. & Rampichini C. (2009) Multilevel models for the evaluation of educational institutions: a review. In: Monari, P.; Bini, M.; Piccolo, D.; Salmaso, L. (Eds.) Statistical Methods for the Evaluation of Educational Services and Quality of Products, Physica-Verlag, pp (download via Research Gate) Multilevel models - L. Grilli 31

32 NELS-88 example Exploratory analysis: fit one OLS regression for each of the 10 schools The OLS lines have markedly different slopes we need a random slope model Multilevel models - L. Grilli 32

33 NELS-88 example: random slope model Parameter Estimate Intercept Homework 2.05 Residual lev. 2 var/cov. Intercept var Homework var Intercept-Homework cov Residual lev. 1 variance Mean slope in the population of schools It amounts to a correlation of 0.80 Here there is no unique ICC (we can compute a conditional ICC = 61.81/ ( ) = namely, conditional on the covariate Homework=0, not very useful) Coverage interval (95%) for the slopes: 2.05 ± ( 6.71, 10.81) Multilevel models - L. Grilli 33

34 Slope-intercept correlation corr ( ) ( ) u01 ( β + u ),( β + u ) = corr u, u = 0 0 j 1 1j 0 j 1j σ σ σ u0 u1 The slope-intercept covariance is a parameter to be estimated (in most applications it turns out to be negative) depends on the origin of x and it changes if x is centered it should be a free parameter (do not constrain it to zero!) β 0j Example of negative correlation β 1j Multilevel models - L. Grilli 34

35 How to choose the random part y Σ u 2 σ u0 σ u01 = 2 σ u1 Models 1 and 2 are special cases corresponding to restrictions on the variance-covariance matrix of the random effects Σ u Random intercept Standard (OLS) model 2 σ u Σu = Σu = 0 0 Remark : Three competing models Model 3 is the most general = β + β x + u + u x + e ij 0 1 ij 0 j 1 j ij ij σ = 0 σ = 0 2 u1 u01 1. standard no random parameters 2. random intercept 3. random intercept and slope The choice can be based on statistical tests, e.g. the Likelihood Ratio Test (see later) The random intercept model has 2 parameters less than the random slope model Multilevel models - L. Grilli 35

36 Random coefficient on a binary covariate (level 2 heteroscedasticity) Usually a random coefficient refers to a continuous covariate ( random slope ) Also a binary covariate may have a random coefficient, but the interpretation is different Suppose d is binary covariate (e.g. gender), then Var( y d ) = σ + 2σ d + σ d + σ = σ + ( 2σ + σ ) d + σ Thus the level 2 variance is heteroscedastic ij ij u0 u01 ij u1 ij e u0 u01 u1 ij e σ for d = 0 σ + 2σ + σ for d = u0 ij u0 u01 u1 ij It is more transparent to specify a model with heteroscedastic random effects, with a variance parameter for d=0 and another variance parameter for d=1 Multilevel models - L. Grilli 36

37 Hierarchical equations A multilevel model can be equivalently defined by a single equation (as usually required by the software) or by a set of hierarchical equations Hierarchical equations: Level 1 model: y = β + β x + e ij 0 j 1j ij ij Level 2 models: β β = γ + u 0 j 00 0 j = γ + u 1 j 10 1 j Combined model (single equation): y = γ + γ x + u x + u + e ij ij 1 j ij 0 j ij Fixed part Random part Multilevel models - L. Grilli 37

38 Basics of the two-level linear model Case C: introduction of a covariate at level 2

39 The two-level linear model (one covariate at level 1 + one covariate at level 2) Introduction of level 2 covariates: level 2 covariates represent features of the clusters useful to define a model for the level 1 parameters ( β, β ) 0 j 1j and so reduce the level 2 variances ( σ, σ ) 2 2 u0 u1 Example: W is a binary variable coded 1=public school; 0=private school Multilevel models - L. Grilli 39

40 The two-level linear model (one covariate at level 1 + one covariate at level 2) Hierarchical equations: Level 1 model: y = β + β x + e ij 0 j 1j ij ij Level 2 models: β = γ + γ w + u β = γ + γ w + u 0 j j 0 j 1 j j 1 j Here it becomes clear why the γ have a double index Combined model (single equation): y= γ + γ w+ γ x+ γ wx ij j 10 ij 11 j ij + u + u x + e Random part 0 j 1j ij ij Fixed part Multilevel models - L. Grilli 40

41 Cross-level interaction The previous model has a cross-level interaction In the hierarchical formulation (system of equations) it corresponds to adding the level 2 covariate w j into the equation for the level 1 coefficient β 1j of x ij In the combined formulation (single equation) it corresponds to adding the new covariate w j x ij It may be the target of research even if it is often not significant W X Y Indeed, estimating a cross-level interaction has high sample size requirements (in particular, a large number of higher-level units) Anguinis and Culpepper (2015) suggest a preliminary step: computing a new type of ICC to assess the degree of lower-level outcome variance that is attributed to higher-level differences in slope coefficients Anguinis H, Culpepper SA (2015) An Expanded Decision-Making Procedure for Examining Cross- Level Interaction Effects With Multilevel Modeling. Organizational Research Methods. Multilevel models - L. Grilli 41

42 Example: random slope model with the covariate public (NO cross-level interaction) Parameter Estimate Intercept Homework 1.94 Public Residual lev. 2 var/cov. Intercept var Homework var Intercept-Homework cov Residual lev. 1 variance Mean intercept of nonpublic schools Mean slope of schools (regardless of public/non-public) Difference in the mean intercept (public vs. non-public) The effect of homework is assumed to be the same for public and non-public schools Multilevel models - L. Grilli 42

43 Example: random slope model with the covariate public AND cross-level interaction Parameter Estimate Intercept Homework 1.09 Public Homework*Public 0.95 Residual lev. 2 var/cov. Intercept var Homework var Intercept-Homework cov Residual lev. 1 variance Mean slope of nonpublic schools Difference in the mean slope (public vs. nonpublic) The mean slope of homework in the population of schools is 1.09 for non-public schools and =2.04 for public schools Multilevel models - L. Grilli 43

44 Within, between and contextual effects Slopes: between, within and total Centering the covariates The contextual effect

45 Total, within and between effects 8 j i X_ij X_. j Y_ij Y_. j Y Within Total Between X yˆ = x ( total) ij yˆ = x ( between). j. j ( ) yˆ y = 1.00 x x ( within) ij. j ij. j ( ) ij. j ij. j Example from Snijders & Bosker (2011) yˆ = x x x ( multilevel) The difference Between-Within is the source of the ecological fallacy or aggregation bias (the reason why it is not a good idea to avoid the issue of withincluster correlation by analysing the cluster means) ij Multilevel models - L. Grilli 45

46 Centering a covariate A covariate can be centered w.r.t. a given constant, such as the grand mean: this affects the intercept (in a random slope model) the intercept variance and the intercept-slope covariance the cluster mean (CM centering), so if the cluster means are different the centering varies from cluster to cluster: this affects the slope (total effect vs. within effect) Multilevel models - L. Grilli 46

47 The Cronbach model Cronbach model: CM centering & cluster mean y = α + β ( x x ) + β x + u + e ij within ij. j between. j j ij y = α + β x + u + e. j between. j j. j Between slope y y = β ( x x ) + ( e e ) ij. j within ij. j ij. j Within slope Remark: researchers usually are interested in the effect at the individual level, namely the within effect Multilevel models - L. Grilli 47

48 The contextual model contextual model: no CM centering, but cluster mean y = α + γx + δx + u + e ij ij. j j ij x x x ) + x replacing with ( yields ij ij. j. j y = α + γ( x x ) + ( γ + δ) x + u + e ij ij. j. j j ij Just a reparameterization of the Cronbach model! γ = β within δ = β β between within Multilevel models - L. Grilli 48

49 The contextual effect Compositional variable = cluster-level variable obtained by summarizing the within-cluster distribution of an individual-level variable The most important compositional variable is the cluster mean E.g. if X = prior score, with pupils nested into schools the school mean of the prior score is a compositional variable measuring the quality of the educational environment (peer effects) In a model with both the individual variable X and its cluster mean, the slope of the cluster mean is the contextual effect y = + β x + δ x + ij within ij. j δ = β β between within It is the additional effect of the school mean of X on Y that is not accounted for by the individual level X (usually X is prior score or Socio-Economic Status) The estimate of the contextual effect of X will partially encompass the effects of all school level variables that are correlated with X such as peer influences, school climate, allocation of resources, organizational and structural features of schools Multilevel models - L. Grilli 49

50 Interpreting the within effect y = α + β ( x x ) + β x + u + e ij within ij. j between. j j ij Let X=prior score; Y=final score Sam has X=80 and attends a school with sch_mean(x)=70 Within effect: expected Y for Sam vs expected Y for another pupil with X=81 and sch_mean(x)=70 ( ) ( ) ij ij = 81,. j = 70 ij ij = 80,. j = 70 E y x x E y x x α + β 11+ β 70 α + β 10 + β 70 = β ( ) ( ) within between within between within Multilevel models - L. Grilli 50

51 Interpreting the contextual effect y = α + β ( x x ) + β x + u + e ij within ij. j between. j j ij Let X=prior score; Y=final score Sam has X=80 and attends a school with sch_mean(x)=70 Contextual effect: expected Y for Sam vs expected Y for another pupil with X=80 and sch_mean(x)=71 ( ) ( ) ij ij = 80,. j = 71 ij ij = 80,. j = 70 E y x x E y x x ( 9 71) ( 10 70) α + β + β α + β + β = β β within between within between between within Contextual effect: you increase the school mean score by 1 and leave the individual score unchanged (this changes the deviation: the pupil to be compared with Sam has a different relative position, namely 9 points above the school mean instead of 10) Multilevel models - L. Grilli 51

52 Some model specifications 1) y = + β x + ij total ij It implicitly assumes that the between and within slopes are identical 2) y = + β x + ( β β ) x + ij within ij between within. j 3) y = + β ( x x ) + β x + ij within ij. j between. j 4) y = + β ( x x ) + ij within ij. j They estimate between and within slopes separately (the two models have the same fit) It estimates only the within slope In most applications, within and between slopes are quite different model 1 is wrong In models with many covariates the parsimony principle may suggest to disentangle the slopes only for the covariates of main interest Multilevel models - L. Grilli 52

53 References on centering and within/between/contextual effects Snijders TAB and Bosker RJ. (2012) Multilevel analysis: An introduction to basic and advanced multilevel modeling. 2 nd ed. London: Sage. Raudenbush SW and Bryk AS (2002) Hierarchical Linear Models. 2 nd ed. Thousand Oaks: Sage. Kreft IGG, de Leeuw J and Aiken L (1995) The effect of different forms of centering in hierarchical linear models. Multivariate Behavioral Research, Hofmann DA, Gavin MB (1998) Centering Decisions In Hierarchical Linear Models: Implications for Research in Organizations. Journal of Management, 24(5), Paccagnella O (2006) Centering or not centering in multilevel models: The Role of the Group Mean and the Assessment of Group Effects. Evaluation Review 30, Enders CK and Tofighi D (2007) Centering Predictor Variables in Cross- Sectional Multilevel Models: A New Look at an Old Issue. Psychological Methods, 12, Kelley J., Evans M.D.R., Lowman J., Lykes V. (2017) Group-mean-centering independent variables in multi-level models is dangerous. Quality&Quantity 51, [warning: they give an incorrect interpretation of the effects!] Multilevel models - L. Grilli 53

54 Inference Maximum Likelihood inference Predicting the random effects (level 2 residuals) Bayesian inference

55 Parameter estimation y= γ + γ w+ γ x+ γ wx+ u + u x+ e ij j 10 ij 11 j ij 0 j 1 j ij ij Fixed part Random part Maximum Likelihood step 1: estimation of fixed parameters γ 00, γ 01, γ 10, γ and variance-covariance parameters σ, σ, σ, σ step 2: prediction of random effects (u 0j,u 1j, j=1,,j) e u0 u1 u01 Bayesian inference the parameters are random variables with a prior distribution parameters and random effects are all random variables Multilevel models - L. Grilli 55

56 Likelihood The equation defining a random effects model includes the random effects, but they are not observable (so cannot appear in the likelihood) the random effects must be integrated out! f ( y u, θ) p( u θ) Distribution of responses, conditional on random effects and parameters Distribution of random effects, conditional on parameters L( θ) = f( y uθ, ) p( u θ) du Likelihood Problem: the integral has analytical solution only for conjugate distributions (e.g. Normal-Normal, Binomial-Beta, ) Multilevel models - L. Grilli 56

57 Closed-form likelihood Multiple random effects multivariate distribution the Normal distribution is preferable (and in fact it is the standard choice in applications) Linear model : Normal-Normal (conjugate) the integral has analytical solution (closed-form likelihood) Non linear model : e.g. Binomial-Normal (non conjugate) the likelihood must be evaluated through approximate integration methods Multilevel models - L. Grilli 57

58 Maximum Likelihood (ML) estimation θ ML = arg max L( θ) = f( y uθ, ) p( u θ) du θ Once the ML estimates of fixed and random parameters have been obtained Empirical Bayes estimation of the random effects p( u y, θ ML ) = f( y u, θ ) ( ML p u θml ) f( y u, θ ) ( ML p u θml ) du Multilevel models - L. Grilli 58

59 Properties of ML estimators Under mild regularity conditions Maximum Likelihood estimators have good asymptotic properties: Consistency Normality Efficiency Remark: here asymptotic requires increasing the number of clusters (increasing the cluster sizes is not enough), so the number of clusters J is the key quantity for asymptotics Multilevel models - L. Grilli 59

60 REstricted Maximum Likelihood In multilevel modelling the ML method outlined before is sometimes referred to as Full Information Maximum Likelihood (FIML) to distinguish it from the restricted version (REML) Full Information Maximum Likelihood (FIML) Full likelihood, joint estimation of fixed and random parameters Underestimates the variances of random effects since it treats the regression coefficients as known quantities (ignoring degrees of freedom) REstricted Maximum Likelihood (REML) The variances and covariances are estimated by maximizing the restricted likelihood, i.e. the density of the residuals The estimators of the variances/covariances are approximately unbiased even in small samples since they account for the degrees of freedom In a two-level model, REML and FIML yield similar estimates for the level 1 variance, but there may be large differences for level 2 variances if the number of clusters J is small (e.g. with J=10 clusters FIML estimates of variances may be severely downward biased) Unless the main aim is the estimation of the variances (e.g. experimental design), FIML is preferred because it is has lower sampling variance (and often lower MSE) Multilevel models - L. Grilli 60

61 Hypothesis testing on the variancecovariance parameters We consider the tests for the omission of a random effect: Case A: random intercept model vs ordinary (OLS) model H : σ = u0 Case B: random slope model vs random intercept model H : σ = σ = u1 u01 the Wald test should not be used since the sampling distributions of the estimators of the random parameters are highly asymmetric (unless the number of clusters is huge) advisable to use the LRT (Likelihood Ratio Test) However, the null hypothesis is on the boundary of the parameter space (variances cannot be negative!) the usual asymptotic results for LRT do not hold Multilevel models - L. Grilli 61

62 Testing the level 2 variance: LRT (asymptotic chi-square-bar distribution) H 0 : the variance of a random effect is 0 (so the random effect can be eliminated) LRT = 2logL 0 2logL 1 L 0 = likelihood of the restricted model with a variance at 0 L 1 = likelihood of the unrestricted model LRT approx χ 2 #( restrictions) 0 = χ 2 #( restrictions) with prob. with prob. 1/2 1/2 degrees of freedom = #(restrictions) = 1 + #(corresponding covariances) Practical rule: the p-value must be halved! (otherwise the test is conservative, i.e. the actual probability of type I error is lower than α) With REML the LRT can be used only if the fixed part of the model is unchanged! This procedure is exact if H 0 involves only a variance. Otherwise, if H 0 involves both variances and covariances (e.g. testing for a random slope) there is no exact result, but it has been proved that the p-value from the standard chi-square is conservative Multilevel models - L. Grilli 62

63 Level 2 residuals after ML estimation (i.e. assigning values to the random effects) Given ML estimates of fixed and variance-covariance parameters, there are two main ways of computing the level 2 residuals: Naïve residual: ( ˆ γ ˆ γ ˆ γ ) uˆ = y yˆ = y + x + w 0 j. j. j. j j 01 j Empirical Bayes (EB) residual: mean of the posterior distribution of the random effects (with ML estimates plugged-in) In the RANOVA model the EB residual has a simple formula: 2 EB 0u ˆ 0 j = u j σ0u + σe / n j EB residuals minimize the prediction MSE they make better predictions than Naïve residuals (an advantage of using random instead of fixed effects) EB residuals are biased toward 0 (hence called shrunken residuals) the amount of shrinkage varies from cluster to cluster (given the variances, it is an inverse function of the cluster size) in applications it may happen that the shrinkage is negligible for large clusters and substantial for small clusters uˆ Multilevel models - L. Grilli 63 σ

64 Diagnostics based on the residuals Many residuals Level 1 Level 2 (Empirical Bayes) Extension of the techniques used in standard regression Purposes: e ˆij uˆ, uˆ, 0 j 1j Check the functional form and the distributional assumptions (normality, homoscedasticity, ) Look for influential units Multilevel models - L. Grilli 64

65 Inference on the random effects For any cluster, the random effect collects the unobserved factors it can be interpreted as propensity, effectiveness etc. A relevant inferential question: Is cluster j* significantly different from the mean? Since the mean is 0, the question is whether u j* 0 The level 2 residuals are predictions of the corresponding random effects u j they can be used to make inference Multilevel models - L. Grilli 65

66 Caterpillar plot of the residuals for comparing each cluster with the mean The residuals are ordered and endowed with 95% confidence bars (+/ the comparative standard errors) The width of the error bar of a given cluster depends of its size Comparisons with the mean EB prediction of random effects uˆ ˆ j ± SE u -1.0 Ranks 1.96 ( ) j There are many clusters significantly above or below the mean Multilevel models - L. Grilli 66

67 Bayesian inference for multilevel models For Bayesian inference a prior distribution and the joint posterior distribution is used p( θ) is defined p( u, θ y) = f( y u, θ) p( u θ) p( θ) f( y u, θ) p( u θ) p( θ) dudθ Inference on the parameters Inference on the random effects p( θ y) = p( θu, y) du p( u y) = p( θu, y) dθ Multilevel models - L. Grilli 67

68 Bayesian inference: pros and cons Even for the linear model it is necessary to use approximate integration algorithms (e.g. Gibbs sampling) The Bayesian approach has the usual disadvantages, whereas the main advantages here are Accurate estimates even for a small number of clusters In complex multilevel models the estimates properly account for all the uncertainty Bayesian methods yield accurate estimates of the variance components and confidence intervals with appropriate coverage even in highly complex models, where ML methods show a poor performance Multilevel models - L. Grilli 68

Bayesian inference: references Seltzer, M.H., Wong W.H., and Bryk A.S. (1996) Bayesian Analysis in Applications of Hierarchical Models: Issues and Methods.

Australian & New Zealand Journal of Statistics 44(1): 99-108. (simulation 3 clusters of size 2) Browne W.J., & Draper D. (2006).

69 Bayesian inference: references Seltzer, M.H., Wong W.H., and Bryk A.S. (1996) Bayesian Analysis in Applications of Hierarchical Models: Issues and Methods. Journal of Educational and Behavioral Statistics 21(2): Bian, Guarui (2002) Bayesian Estimates in a One-Way ANOVA Random Effects Model. Australian & New Zealand Journal of Statistics 44(1): (simulation 3 clusters of size 2) Browne W.J., & Draper D. (2006). A comparison of Bayesian and likelihoodbased methods for fitting multilevel models, Bayesian Analysis, 1, Good news: Bayesian inference can be quick (without losing accuracy) by using INLA (Integrated Nested Laplace Approximation), which approximates the posterior distribution without Monte Carlo re-sampling: Fong Y., Rue H. and Wakefield J. (2010) Bayesian inference for Generalized Linear Mixed Models. Biostatistics, 11, Metelli S., Grilli L., Rampichini C. (2015) Bayesian estimation with INLA for logistic multilevel models. J. Statistical Simulation and Computation 85(13), pp (download via Research Gate) Multilevel models - L. Grilli 69

70 Example: political trust in Europe (European Social Survey, ESS)

The European Social Survey (ESS) is a biennial multi-country survey covering over 30 nations. The first round was fielded in 2002/2003, the fifth in 2010/2011.

71 The European Social Survey (ESS) is a biennial multi-country survey covering over 30 nations. The first round was fielded in 2002/2003, the fifth in 2010/2011. The project is funded jointly by the European Commission, the European Science Foundation and academic funding bodies in each participating country The questionnaire includes two main sections, each consisting of approximately 120 items: The European Social Survey a 'core' module which remains relatively constant from round to round, plus two or more 'rotating' modules, repeated at intervals. The core module aims to monitor change and continuity in a wide range of social variables, including media use; social and public trust; political interest and participation; socio-political orientations; governance and efficacy; moral; political and social values; social exclusion, national, ethnic and religious allegiances; well-being; health and security; human values; demographics and socio-economics. Details on: Map of participating countries Multilevel models - L. Grilli 71

72 ESS Multilevel Data resource (ESS MD ) The ESS MD resource contains data at the individual level (the survey data from ESS respondents), the country level and the regional level. It incorporates contextual variables on a number of themes, including demography, geography, economy, health, education and crime. The entire dataset is available for download at: The ESS has published an ESS EduNet training package on Multilevel Models which provides guidance about this type of analysis: ESS Data Website The ESS MD was partly funded by the Descartes Research Prize awarded to the ESS in Multilevel models - L. Grilli 72

73 Political trust in Europe Political trust can be explained by a combination of individual, cultural and institutional factors For a deeper understanding of political trust, see Kenneth Newton s ESS EduNet module on Social and Political trust. A summary scale of trust in the electoral system is computed as the mean of B4, B7 and B8 of the following set of questions on trust in institutions: Using this card, please tell me on a scale from 0-10 how much you personally trust each of the institutions I read out; 0 means you do not trust an institution at all, and 10 means you have complete trust. Firstly, trust in the country's parliament (B4)... political parties (B8)... politicians (B7). Political trust ranges from 0 to 10, with average=3.41 and variance 5.21 (s.d.=2.28). Multilevel models - L. Grilli 73

74 A two-level model for political trust in EU Sample: subjects clustered in 26 countries Average number of subjects within country 1953 (min 1083 Cyprus, max 3031Germany) explanatory variables measured at the subject level (level 1) age, gender, educational level coping well on present income satisfied with the state of the economy, satisfied with the government, satisfied with democracy, most people can be trusted explanatory variables measured at country level (level 2) Human Development Index (2007) Corruption Perception Index (high=clean, 2008) Details on: Multilevel models - L. Grilli 74

75 Step 0 null model: variance decomposition y = µ + u + e ij j ij Subject level variance Between-country variance ICC= CM POLTRUST MIN 1.59 Greece MAX 5.49 Sweeden AVERAGE 3.51 ALL The null model indicates that about 24% of the variation in political trust stems from variation between countries. a multilevel model is needed Multilevel models - L. Grilli 75

76 General strategy of model building Step 0: null model to estimate the variance components Step 1: build the model at level 1, i.e. introduce level 1 covariates and test for significance, non-linear effects, interactions Step 2: build the model at level 2, i.e. introduce level 2 covariates and cross-level interactions Multilevel models - L. Grilli 76

77 Steps 1 and 2: explaining variances Step 1: add subject covariates The model with subject level covariates explains about 39% of the individual variance and 86% of the country level variance. The ICC reduces to 7%. Step 2: add country covariates Level 1 variance is unchanged, as expected, level 2 variance reduces further. The ICC is now 3% Model Variance NULL Subj covariates subj+country Subject level % % Country level % % Total % % ICC 24% 7% 3% Multilevel models - L. Grilli 77

78 Country level variables UN s Human Development Index (HDI) operationalised using objective data on income (GDP per capita), education (literacy and enrolment rates), and health (estimated life expectancy at birth) average=0.92 min=0.79 (Ukraine) max=0.97 (Norway) Corruption Perception Index (CPI) (high=clean) average=6.4 min=2.1 (Russian Federation) max=9.3 (Sweden) Adding the Human Development Index and the Corruption Perception Index reduces the residual country level variance. This reduction is entirely due to the CPI. Multilevel models - L. Grilli 78

79 Full model: fixed parameters estimates Covariates Estimate s.e. p-value Intercept Age in years, centered at 48 years Age centered squared Female gender Primary education (reference category) Secondary education Tertiary education Coping well on present income Satisfied state of the economy* (average 2.52) Satisfied with the government* (average 2.47) Satisfied with democracy* (average 2.56) Social trust (0-10, average 2.46) Human Development Index (range ) Corruption Perception Index ( high=clean) * How satisfied with (1=extremely unsatisfied, 10=extremely satisfied) Multilevel models - L. Grilli 79

80 Main findings: the full model Age, gender and levels of education have moderate effects on political trust: Trust increases with age, women trust political institutions slightly more than men and those with university level education show more political trust than people with only primary education. Of the performance variables, satisfaction with the government and satisfaction with democracy are the most important ones in relation to political trust. Social trust is important in that the difference between low (0) and high (10) social trust means a difference of one point in political trust. At country level, the CPI is by far the most important variable: an increase of one point on the CPI, i.e. one point less corrupt, increases political trust by 0.2 points. That means that the maximal effect of the CPI is 7*0.218 = 1.53 (remember: CPI range is about 2-9). Multilevel models - L. Grilli 80

EB residuals (endowed with 95% confidence bars (+/- 1.96 the comparative standard errors) Russian Federation A country with a bar not overlapping

81 EB residuals (endowed with 95% confidence bars (+/ the comparative standard errors) Russian Federation A country with a bar not overlapping zero has a significant residual E.g., countries in the red box (upper right) show a level of political trust significantly higher than expected on the basis of model covariates, especially the Russian Federation (remember that we adjusted by CPI!) Multilevel models - L. Grilli 81

82 TO BE OR NOT TO BE RANDOM? Fixed effects vs random effects

83 Assumptions on the model errors Level 2 errors (random effects) are assumed to be j iid 2 ~ (0, u) u N σ This is a complex assumption that can be decomposed as follows: a) The level 2 errors are iid across level 2 units For any level 2 unit, the conditional distribution of u j given b) null mean (exogeneity) c) constant variance (homosc.) x d) Normal distribution ( x,,, ) 1 x2 x j j j n j A similar argument holds for level 1 errors = As for inference on the regression coefficients, violation of b is more serious than violation of a or c, which are more serious than violation of d j has ( x ) ( ) 2 Var u x = σ E u j j = 0 for any value of x j for any value of x j j u j iid 2 ij ~ (0, e ) e N σ Multilevel models - L. Grilli 83

84 Role of the assumptions The assumptions of exogeneity (at level 1 and level 2) are crucial for the correct specification of the regression function (the mean of Y given X): when the errors are correlated with the covariates, the exogeneity assumption is violated and the estimators of the regression coefficients are biased (inconsistent) The other assumptions are not needed for consistent estimation of the regression coefficients: a violation yields biased standard errors (and thus tests with wrong size) but valid standard errors can be obtained with robust (sandwich) methods The other assumptions include: homoscedasticity and normality at both levels independence of level 2 errors across level 2 units (it may be violated when level 2 units are adjacent geographical areas) independence of level 1 errors across level 1 units (often violated in panel data) Multilevel models - L. Grilli 84

85 The fixed effects model (i.e. how to avoid doubtful assumptions on the random effects) y = β x + α + e ij ij j ij random effects u j replaced by cluster-specific parameters α j (fixed effects) No assumptions on the distribution of the cluster effects need not worry about a possible correlation between effects and covariates (endogeneity), and other violations such as heteroscedasticity and non-normality The slope β is not the total effect, but the within effect (in panel data the corresponding estimator is known as the fixed effects estimator): in fact, all the between variation is absorbed by the fixed effects α j the covariates can only explain the within variation Multilevel models - L. Grilli y α 2 α 1 α 3 x 85

86 NELS-88 example: fixed effects model The model is fitted by OLS, inserting one dummy variable for each school and centering the covariate homework w.r.t. the grand mean 2.02 (centering is optional: it does not alter the slope, but it gives more interpretable fixed effects) Parameter Estimate Slope (homework) 2.14 Fixed effects school school school school school school school school school school Resid. lev. 1 variance Slope for any school (within effect): in any school one additional hour of homework is associated with 2.14 additional points on average at the math test School fixed effects: e.g. in school 1 a pupil with a value of homework at the grand mean (2.02) has on average 47.1 points at the math test Level 2 variation is accounted by the fixed effects, whilst level 1 variation is accounted by the level 1 variance (here estimated as 64.59) Multilevel models - L. Grilli 86

87 Fixed or random effects: pros and cons Advantages of Fixed effects: avoiding assumptions on the cluster effects, in particular no need to assume that they are uncorrelated with the covariates can be used even with very few clusters (e.g. 5 clusters) Disadvantages of Fixed effects: Impossible to use level 2 covariates, a dramatic limitation in the (frequent) case where the research questions concern the effect of level 2 covariates! Loss of efficiency (since number of fixed effects = number of clusters) Inefficient estimation of cluster effects (for example, if a cluster has two units its fixed effect is estimated with just two observations) Inconsistency of estimators in non-linear models Multilevel models - L. Grilli 87

88 Disadvantages of fixed effects: no cluster-level covariates With fixed effects the level 2 (between subjects) variance is completely explained impossible to add level 2 covariates (in panel data: subject-specific or time-constant covariates) Technically, it is a problem of collinearity: any level 2 covariate is a linear combination of the J dummy variables for the clusters For example, in a model with students nested into schools we cannot add school-level covariates such as public/private In panel data models we cannot add subject-specific covariates such as gender or time-constant covariates (for example, education is usually time-constant since it does not change across the observation interval) Multilevel models - L. Grilli 88

89 Disadvantages of fixed effects: loss of efficiency Number of parameters: fixed effects model: K regr. coeff. + J cluster effects random effects model: K regr. coeff. + 1 cluster variance + 1 individual variance The fixed effects model has (J-2) more parameters implying a less efficient estimator of the regression coefficients (this loss of efficiency is exploited by the Hausman test) There is a loss of efficiency also in the estimation of the level 2 variance Note: the fixed effects model is fitted by OLS, where the residual variance does not enter and thus it is not counted as a parameter random effects: the level 2 variance is a model parameter fixed effects: the level 2 variance is not a model parameter and it is estimated as the variance of the J estimated fixed effects Multilevel models - L. Grilli 89

90 Disadvantages of fixed effects: incidental parameter problem The J fixed effects can be consistently estimated if the cluster size goes to infinity: this is unrealistic in panel data, where the cluster size is the number of occasions In the linear model the inconsistency in the estimation of the J fixed effects does not propagate to the estimation of the regression coefficients Unfortunately, this is true only for the linear model for example, in a fixed effects logit model the regression coefficients are not consistently estimated (solution: conditional logit model) Multilevel models - L. Grilli 90

91 Fixed vs random effects (different point of view) Fixed effects inference for the observed clusters Random effects inference for a population of clusters (the observed clusters are viewed as a random sample of clusters) When the observed clusters are a sample of clusters (e.g. a subset of all schools) we should use random effects When the observed clusters are a census (e.g. all European countries) we can choose between fixed and random effects Rationale for using random effects in the census case: parsimonious description of the observed variability among clusters accounting for measurement error generalizability in space and time Multilevel models - L. Grilli 91

92 Fixed vs random effects in research Random effects are the standard choice when the main aim of the research is evaluating the role of the context, since the inclusion of level 2 covariates is possible only with random effects this is typical of fields such as Sociology, Demography, Education, Epidemiology estimating the effects for small clusters, such as in small-area estimation (random effects allow to exploit Empirical Bayes prediction to borrow strength from other clusters) Multilevel models - L. Grilli 92

93 Fixed vs random effects in research /cont. Fixed effects are the standard choice in Econometrics, likely because of the long tradition about the analysis of panel data, where fixed effects are often preferable: Most panel data studies focus on the coefficients of time-varying covariates, which are level 1 covariates the big limitation of preventing the inclusion of level 2 covariates is irrelevant In the analysis of panel data a major concern is that the individual unobserved effects may be correlated with the covariates (endogeneity) this is not a problem with fixed effects Due to the importance attributed to the issue of endogeneity, in Econometrics fixed effects are often used also with cross-sectional data [e.g. Rivkin S.G., Hanushek E.A., Kain J.F. (2005) Teachers, schools, and academic achievement. Econometrica, 73, ] A general discussion: Bell A., Fairbrother M., Jones K. (2016) Fixed and Random effects: making an informed choice. Draft. Download from Research Gate. Multilevel models - L. Grilli 93

94 Sample size requirements

95 How many clusters are needed? A multilevel model requires a minimum number of clusters in order to get (approximately) unbiased estimates The minimum depends on type of model (linear vs non-linear, random intercept vs random slope), on the average size of the clusters and on the true parameter values In the best situation (a simple linear random intercept model and a dataset with large clusters) the minimum is about 10 (but often we are far from the best situation!) In case of few clusters fixed effects model An alternative solution to handle few clusters is to fit a random effects model with Bayesian methods since they do not rely on asymptotics Multilevel models - L. Grilli 95

96 Target of inference The sample size requirements are different depending on the target of inference The less demanding target is to get unbiased point estimates of regression coefficients (in favourable situations 10 clusters of size 2 may be enough) More clusters (say 30 or 50) are needed for unbiased estimation of variance components and standard errors (especially the standard errors of the variance components) The requirement is higher for models with random slopes and for non-linear models (e.g. binary responses) Multilevel models - L. Grilli 96

97 Small clusters The cluster size is less relevant than the number of clusters: clusters of size 2 (as in a two-wave panel) are usually ok for a linear random effects model (but not for a logistic random effects model) Even clusters with a single unit can be retained However, small clusters worsen cluster-specific inferences (for example, the precision of Empirical Bayes predictions of random effects) Moreover, data with small clusters carry limited information on the variance-covariance structure at level 2, which should be kept simple (for example, random slopes often turn out to be non-significant) Multilevel models - L. Grilli 97

98 Don t delete clusters with a single unit! Clusters with a single observation do not contribute to the estimation of the within-cluster variance, but they do contribute to the between-cluster variance and the overall mean If clusters are deleted, we must assume that they are Missing Completely At Random (MCAR), which is often not plausible If we leave such clusters in as incomplete cases, the relevant assumption is Missing At Random (MAR) which assumes that missingness may be correlated with other variables in the model but not with the unknown value that we failed to observe this is a much weaker assumption Multilevel models - L. Grilli 98

99 Degrees of freedom at level 2 degrees of freedom at level 2 = number of clusters minus number of level 2 covariates Given the number of level 2 covariates, the number of clusters should ensure enough degrees of freedom at level 2 For example, we cannot obtain good estimates with 20 clusters and 10 level 2 covariates Multilevel models - L. Grilli 99

100 References on sample size /1 Maas C.J.M. & Hox J.J. (2005) Sufficient sample sizes for multilevel modeling. Methodology, 1, Mathieu, J. E., Aguinis, H., Culpepper, S. A., & Chen. G. (2012). Understanding and estimating the power to detect cross-level interaction effects in multilevel modeling. Journal of Applied Psychology, 97, [available at Bell, B. A., Morgan, G. B., Schoeneberger, J. A., Kromrey, J. D. & Ferron, J. M. (2012). How low can you go? An investigation of the influence of sample size and model complexity on point and interval estimates in two-level linear models. Methodology. Moineddin R., Matheson F.I. and Glazier R.H. (2007) A simulation study of sample size for multilevel logistic regression models. BMC Medical Research Methodology, 7. Austin P.C. (2010) Estimating multilevel logistic regression models when the number of clusters is low: A comparison of different statistical software procedures. Intern. J. Biostatistics, 6(1), art. 16 Paccagnella O. (2011) Sample Size and Accuracy of Estimates in Multilevel Models. New Simulation Results. Methodology, 7, Stegmueller D. (2013) Comparison of Frequentist and Bayesian Approaches. American Journal of Political Science. Schoeneberger, J. A. (2015). The impact of sample size and other factors when estimating multilevel logistic models. Journal of Experimental Education. Bryan M.L., Jenkins S.P. (2015) Multilevel Modelling of Country Effects: A Cautionary Tale. European Sociological Review. 100 Multilevel models - L. Grilli

101 References on sample size /2 Elff M., Heisig J.P., Schaeffer M., Shikano S. (2016) No Need to Turn Bayesian in Multilevel Analysis with Few Clusters: How Frequentist Methods Provide Unbiased Estimates and Accurate Inference. Draft [ SMALL CLUSTERS Raudenbush SW (2008) Many small groups. In: J. de Leeuw, E. Meijer (eds.), Handbook of Multilevel Analysis, Springer. McNeish, D. M. (2014). Modeling Sparsely Clustered Data: Design-Based, Model-Based, and Single-Level Methods. Psychological Methods. OPTIMAL DESIGN: Snijders TAB, Bosker RJ (1993) Standard errors and sample sizes for two-level research. Journal of Educational Statistics, 18: Snijders, TAB (2005) Power and Sample Size in Multilevel Linear Models. In: B.S. Everitt and D.C. Howell (eds.), Encyclopedia of Statistics in Behavioral Science. Vol. 3, Chicester : Wiley. Moerbeek M, Van Breukelen GJP, Berger MPF (2008) Optimal Designs for Multilevel Studies. In: J. de Leeuw, E. Meijer (eds.), Handbook of Multilevel Analysis, Springer. Moerbeek M., Teerenstra S. (2016) Power Analysis of Trials with Multilevel Data. CRC Multilevel models - L. Grilli 101

102 Multilevel logit models for binary responses

103 Generalized Linear Mixed Models (GLMM) A GLM with random effects is often named GLMM, where the first M stands for Mixed: in fact, the model has both fixed effects (the regression coefficients) and random effects Components of a GLMM GLM for the distribution of Y conditional on the random effects distribution of the random effects Remark: the model of the marginal distribution of Y w.r.t. the random effects (usually) is not a GLM Multilevel models - L. Grilli 103

104 GLMM for a binary response: the random intercept logit model GLM for Y u linear predictor logit link distribution η = β + β x + u ij 0 1 ij j logit( π ) ij = η YY iijj xx iijj, uu jj ~BBBBBB 1, ππ iijj ij ff uu (uu) uu jj ~iiiiii NN 0, σσ uu 2 individual i =1,2,,n j ; cluster j =1,2,,J The linear predictor can be extended in many ways: Several level 1 and level 2 covariates Cross-level interactions Random slopes Multilevel models - L. Grilli 104

105 The random intercept logit model π ij logit( π ij ) = log = β0 + β1xij + u j 1 π ij 1 π ij = PY ( ij = 1 u j, xij ) = 1 + exp ( + + ) ( β β x u ) 0 1 The β are the conditional effects of the covariates, given the value of the random effects u j cluster-specific effects The marginal effects of the covariates are obtained integrating w.r.t. the distribution of the random effects u j ij j Multilevel models - L. Grilli 105

106 Cluster-Specific vs Population-Averaged effects /1 cluster-specific model (random intercept) population-averaged model (constant intercept) 1 PY ( = ij 1 xij, u j ) = 1 + exp ( + + ) ( β β x u ) PY ( = ij 1 x) = ij 1 + exp ( + ) ij ( γ γ x ) 0 1 ij j Multilevel models - L. Grilli the effect of x is attenuated! γ1 < β This is the case for any binary-response model (logit, probit, cloglog ): see Ritz J. & Spiegelman D. (2004) in Statistical Methods in Medical Research 1 106

107 Cluster-Specific vs Population-Averaged effects /2 The distinction between CS and PA effects is irrelevant in linear random intercept models since they are identical cluster-specific model (random intercept) EY ( x, u ) = β + β x + u ij ij j 0 1 ij j EY ( ij xij ) = E u (, ) j EYij xij u j = E u β j 0 + β1xij + u j = = β0 + β1xij + E u u j j = β + β xij + = = β + β x 0 1 ij So β 1 is both the CS effect and the PA effect! The technical reason is that the linear operator E( ) is applied to a linear function of its argument: it is well known that E(c+U)=c+E(U) Multilevel models - L. Grilli 107

108 Cluster-Specific vs Population-Averaged effects /3 The equality of CS and PA effects does not hold in non-linear models PY ( ij = 1 xij ) = E u ( 1, ) j PYij = xij u j 1 = Eu 1 exp ( ( ) j + β0 + β1xij + u j ) 1 ( β ) 0 β1xij Eu u j 1+ exp ( + + ( )) 1 = 1 + exp ( + ) ( β β x ) 0 1 ij Multilevel models - L. Grilli j The sign is due to the following: If U is random variable and f is a non-linear function, then E[f(U)] f[e(u)] 108

109 Logit model example: Cluster-specific vs Population-averaged /1 Example (Snijders & Bosker, 1999): data on 3432 pupils in 240 secondary schools, with Y ij =1 if the pupil chose at least one science subject for his/her final examination Standard logit Random intercept logit Intercept Female Level 2 variance marginal slope conditional slope γ = β = γ1 < β 1 Multilevel models - L. Grilli 109

110 Logit model example: Cluster-specific vs Population-averaged /2 The Odds Ratios are computed by taking the exp of the slope OR Female vs Male (OR Male vs Female) Standard logit Marginal OR (4.043) Random intercept logit Conditional OR (4.513) The conditional Odds Ratios are stronger, namely farther from 1 Interpretation: in the whole population of pupils, the odds that a male chooses at least one science subject is times the corresponding odds for a female; in any given school, the odds that a male chooses at least one science subject is times the corresponding odds for a female Multilevel models - L. Grilli 110

111 Cluster-specific model or Population-averaged model? The distinction between cluster-specific effects and population-averaged effects concerns the model, not the estimation method A cluster-specific model (such as a random effects model) incorporates the correlation structure any estimation method suitable for random effects models yields correct standard errors (as long as the model is correctly specified) A population-averaged (PA) model ignores the correlation structure to get correct standard errors it should be fitted with an estimation method that accounts for the correlation structure, such as Generalized Estimating Equations (GEE) Multilevel models - L. Grilli 111

112 Predicting the probability Three types of prediction (later we only consider type A): A. a unit in an hypothetical cluster (conditional probability) B. a unit in a new, randomly sampled cluster C. a unit in an existing cluster (i.e. a cluster of the sample) References: Afshartous D. and de Leeuw J. (2005) Prediction in multilevel models. J. Educ. Behav. Statist., 30, Skrondal A. and Rabe-Hesketh S. (2009) Prediction in multilevel generalized linear models. JRSS A 172, Multilevel models - L. Grilli 112

113 Conditional probability: prediction for a unit in a hypothetical cluster Probability for a unit with covariate value x 0 in a hypothetical cluster with random effect u j =u 0 = ˆ ˆ = 0 0 ( ij 1 u, x; β0, β1) PY exp ( ˆ + ˆ + ) ( 0 0 β β x u ) 0 1 Try different values of u 0, for example u u u = 1.96 ˆ σ hypothetical 'bad' cluster = u 0 hypothetical mean (median) cluster = ˆ σ hypothetical 'good' cluster u Multilevel models - L. Grilli 113

114 Conditional probability example Let us consider the random intercept model for the probability that a pupil choses at least one science subject Intercept Female Level 2 variance The conditional probabilities for three hypothetical schools are Male ˆ 0 + ˆ β x = Female ˆ 0 + ˆ β x = β β = 1.96 ˆ = = u σ u 0 u = = ˆ = = u σ u Multilevel models - L. Grilli 114

115 Likelihood iid Y x, u ~ Bin(1, π ) π = P( Y = 1 x, u ) ij ij j ij ij ij ij j The likelihood can be written in 3 steps: Step 1: Conditional likelihood j -th cluster L j i= 1 ( ) ij ( β0, β1 uj) = πij 1 πij n j y 1 y ij Exploits conditional independence within clusters Step 2: Marginal likelihood j -th cluster L β β σ = L β β u φ u σ du 2 2 j( 0, 1, u) j( 0, 1 j) ( j;0, u) j Not in closed form! Step 3: Overall likelihood J u = Lj 0 1 u j= 1 L( β, β, σ ) ( β, β, σ ) Multilevel models - L. Grilli Exploits independence between clusters 115

116 Popular methods to estimate the parameters when the likelihood has intractable integrals Linearization of the model via Taylor expansion PQL: Penalized Quasi-Likelihood MLwiN HLM ML via numerical integration Laplace approximation HLM R (lme4) (Adaptive) Gaussian Quadrature SAS(nlmixed) Stata(xt-, me-, gllamm) Mplus R (lme4, glmmml) Monte Carlo integration Mplus Bayesian inference MCMC WinBUGS OpenBUGS R (glmmbugs, jags) Integrated Nested Laplace Approximation R (inla) The convergence of the algorithm depends on: the data at hand, the complexity of the model, the initial values, the specific options of the algorithm (e.g. the number of quadrature points) Multilevel models - L. Grilli 116

117 Gaussian Quadrature The key idea of Gaussian Quadrature also called Gauss-Hermite quadrature, or Ordinary quadrature is to replace the continuous distribution of u j with a similar but discrete distribution, so that the integral becomes a summation L ( β, σ ) = L ( β u ) φ( u ;0, σ ) du 2 2 j u j j j u j R r= 1 L j ( β u j = a r ) w r u j a1 ar ar w1 wr w R GQ locations GQ weights (mass points) Multilevel models - L. Grilli 117

118 Ordinary Gaussian Quadrature Density of the random effect ζ j (density curve), normalized integrand (solid curve) and quadrature weights (bars) for ordinary quadrature and adaptive quadrature (source: Rabe- Hesketh, Skrondal and Pickles, 2002) Adaptive Gaussian Quadrature (better approximation)

119 ML with Gaussian Quadrature Advantages Accurate estimates in almost any case The quality of the approximation can be evaluated by changing the number of quadrature points Disadvantages Inefficient for continuous responses Computational time can be very long Warning: the time is roughly proportional on the number of quadrature points, a number that rapidly increases as the model becomes more complex: for example, using 8 quadrature points per dimension 1 random intercept + 1 random slope 8^2=64 q.points 1 random intercept + 2 random slopes 8^3=512 q.points Multilevel models - L. Grilli 119

120 An application of the random effects logit model: Contraception in Brazil

121 Contraception in Brazil: research aims How much of the individual-level variability in the use of contraceptives is due to the social context where the women live in? Is it possible to explain the differences due to the social context? Angeli A., Rampichini C., Salvini S. (1996) La contraccezione in Brasile: un analisi attraverso un modello a componenti di varianza. Dept. of Statistics of the University of Florence, Working Papers n. 59 Multilevel models - L. Grilli 121

122 Data 1156 records, 18 variables DHS 1986 Brazil: women in union aged Y: use of contraceptives [use] (0=never, 1=at least once) Hierarchical structure: Women [id]: 1156 level 1 units Area of residence [area]: 47 level 2 units Examples of available covariates: Individual level: Age at interview [age] Listening to the radio (every day or not) [radio] Education Contextual (area level): Infant mortality rate Average number of desired children Percentage having a job Multilevel models - L. Grilli 122

123 Preliminary analysis Variable Obs Mean Std. Dev. Min Max area use age use of contraception area of (1=yes; 0=no) residence 0 1 Total Proportions by area Overall proportion π = Area mean prop. π j =E(Y ij area=j) min (π j )=0.33, max(π j )= Multilevel models - L. Grilli 123

124 Null model: results logit(π ij )=β 0 +u j Random-effects logistic regression Number of obs = 1156 Group variable: area Number of groups = 47 Random effects u_i ~ Gaussian Obs per group: min = 1 avg = 24.6 max = 116 Wald chi2(0) =. Log likelihood = Prob > chi2 = use Coef. Std. Err. z P> z [95% Conf. Interval] _cons /lnsig2u sigma_u rho Likelihood-ratio test of rho=0: chibar2(01) = Prob >= chibar2 = β 0 σ u between areas sd ICC (on the the latent response scale) estimated proportion of the mean area (u j =0) 1/[1+exp(-β 0 )]= π j for high u: 1/[1+exp( (β 0 +σ u ))]= π j for low u: 1/[1+exp( (β 0 σ u ))]= Multilevel models - L. Grilli 124

125 Two-level random intercept model with one covariate (age) Inserting age (centered at 40, fixed effect) 1 P( Yij = 1 uj ) = π ij = 1+ exp ij + ( β β x u ) π logit( π ) = log = β + β x + u 1 ij ij 0 1 ij j π ij j Multilevel models - L. Grilli 125

126 Model results with age (centered at 40) Random-effects logistic regression Number of obs = 1156 Group variable: area Number of groups = 47 Random effects u_i ~ Gaussian Obs per group: min = 1 avg = 24.6 max = 116 Integration method: mvaghermite Integration points = 12 Wald chi2(1) = 7.37 Log likelihood = Prob > chi2 = use Coef. Std. Err. z P> z [95% Conf. Interval] age_ _cons /lnsig2u sigma_u rho Likelihood-ratio test of rho=0: chibar2(01) = Prob >= chibar2 = Better model fit LRT=2*( )=7.46 df=1 The between variance increased w.r.t. the null model (!) Bauer, D.J. (2009). A note on comparing the estimates of models for cluster-correlated or longitudinal data with binary or ordinal outcomes. Psychometrika 74, Multilevel models - L. Grilli 126

127 Contextual effect of age fit the model with the addition of the cluster mean of age (contextual effect) now the between variance is less than in the null model indeed between and within effects of age are significantly different (it follows from the Wald test on CMage_40 not reported) Variable m_null m_age m_age_ctx use age_ CMage_ _cons Between var exp(_cons) Multilevel models - L. Grilli 127

128 Two-level random intercept model with age and radio Random-effects logistic regression Number of obs = 1156 Group variable: area Number of groups = 47 Random effects u_i ~ Gaussian Obs per group: min = 1 avg = 24.6 max = 116 Wald chi2(2) = Log likelihood = Prob > chi2 = use Coef. Std. Err. z P> z [95% Conf. Interval] radio age_ _cons /lnsig2u sigma_u rho Likelihood-ratio test of rho=0: chibar2(01) = Prob >= chibar2 = Estimated probability of contraceptives use age=40, u j =0 radio=0 1/(1+exp(-_b[_cons])) =0.76 radio=1 1/(1+exp(-_b[_cons]-_b[_radio]))=0.86 Multilevel models - L. Grilli 128

129 Odds Ratio odds( x = 1, u) exp( β + β + u) = = = ( 0, ) exp( ) 0 1 OR odds x = u β0 + u exp( β ) 1 The OR is a measure of association between Y and X which does not depend on u In the application, OR(radio) = the odds of using contraceptives are about double for a woman listening to the radio every day (whichever area she lives in) Multilevel models - L. Grilli 129

130 Probabilities of contraceptive use by age and radio (0 left panel, 1 right panel): Marginal Conditional for u {-2,-1,0,+1,+2} probability of contraceptive use marginal u= +2 u= +1 u= 0 u= 1 u= age Marginal prob. Conditional prob. Graphs by listening to the radio (1=yes; 0=no)

131 Software & Books

132 Specialized software MLwiN, Stat-JR (Goldstein, Rasbash, Browne) HLM (Raudenbush, Bryk, Congdon) SUPERMIX (Hedeker) Multilevel models - L. Grilli 132

133 Procedures in general purpose software STATA (mixed, me commands, xt commands) R (packages lme4, nlme, HGLMMM, MCMCglmm, inla ) SAS (PROC MIXED, NLMIXED, GLIMMIX, MCMC) SPSS WinBUGS, OpenBUGS (for Bayesian analysis) STATA gllamm (Rabe-Hesketh & Skrondal) gsem LISREL (Joreskog) M-plus (Muthén) Also structural equations Latent Gold (Vermunt) Multilevel models - L. Grilli 133

134 Good introductory books Snijders & Bosker, 2 nd ed. Hox, 2 nd ed. Raudenbush & Bryk, 2 nd ed. Chapter 2 free to download Multilevel models - L. Grilli 134

135 Books for learning models AND software Rabe-Hesketh and Skrondal (2012) Multilevel and longitudinal modeling using Stata, 3 rd ed Littell et al. (2006) SAS for mixed models, 2 nd ed Bates (2010) lme4: Mixed-effects Modeling with R ( Finch, Bolin, Kelley (2014) Multilevel Modeling Using R. Gelman and Hill (2007) Data analysis using regression and multilevel/hierarchical models (mainly R and WinBUGS) Heck R.H., Thomas S.L., Tabata L.N. (2014) Multilevel and Longitudinal Modeling with IBM SPSS. 2 nd ed Heck R.H., Thomas S.L. (2015) An Introduction to Multilevel Modeling Techniques MLM and SEM Approaches Using Mplus, 3 rd ed. Multilevel models - L. Grilli 135

Advanced books H. Goldstein E. Demidenko A.

estimation algorithms Unified framework for models

136 Advanced books H. Goldstein E. Demidenko A. Skrondal & S. Rabe-Hesketh A classical book, with a great variety of models and applications Statistical theory and estimation algorithms Unified framework for models with latent variables, including multilevel, factor, structural eq. Multilevel models - L. Grilli 136

137 Handbooks J. de Leeuw, E. Meijer (eds.) J.J. Hoox J.K. Roberts (eds.) M.A. Scott J.S. Simonoff B.D. Marx (eds.) Multilevel models - L. Grilli 137

138 Web resources Centre for Multilevel Modelling (Bristol): Multilevel Modeling Resources at UCLA: GLLAMM page (Skrondal & Rabe-Hesketh): Some personal homepages: Raudenbush: sociology.uchicago.edu/people/faculty/raudenbush.shtml Goldstein: Hox: joophox.net Snijders: mailing list: Our homepages (also check the Research Gate pages) Multilevel models - L. Grilli 138

139 A web course Detailed instructions of how to carry out a range of analyses in R, MLwiN and Stata It is free, but you will need to log on or register onto the course to view all these practicals You can see short samples of these materials, without registering at Multilevel models - L. Grilli 139

140 More questions? Ask me via or come to Florence!

Exploiting TIMSS and PIRLS combined data: multivariate multilevel modelling of student achievement

Exploiting TIMSS and PIRLS combined data: multivariate multilevel modelling of student achievement Second meeting of the FIRB 2012 project Mixture and latent variable models for causal-inference and analysis