Introduction to Multilevel Modelling

Size: px

Start display at page:

Download "Introduction to Multilevel Modelling"

Ronald Clarke
5 years ago
Views:

1 Introduction to Multilevel Modelling Leonardo Trujillo, PhD. Luis Guillermo Díaz, PhD (c). Departamento de Estadística Universidad Nacional de Colombia XXI Simposio de Estadística 19 3 July 011, Bogotá, Colombia

2 Course Outline Review. Why multilevel models?. Fixed and Random Effects Model. Review. Random Intercepts and Random Slopes Model. Multilevel Models for Binary Response Data and Proportions ( Multilevel Logistic Regression ).

3 The Independence Assumption Why does clustered data matter? Standard analyses assume independence. The standard errors of model parameters are estimated under this assumption Observations within the clusters being positively correlated will underestimate these standard errors. Other names: hierarchical models, randomeffects or random-coefficient models, mixedeffect models. 3

4 The Importance of Data Structures Data in the real world has structure that tends to violate the assumptions of: independence homogeneity of residual variance (continuous data) That is why we need additional techniques for modeling as multilevel models 4

5 The Independence Assumption Survey data not always (in fact, rarely) comes from a Simple Random Sampling (SRS) The data collection process could generate outcomes that cannot be considered independent This is particularly true of social surveys as these often have multi-stage designs 5

6 The Independence Assumption Traditional approaches treat this clustering as a nuisance that must be accounted for Parameters are estimated in the usual manner Standard error estimates need to be adjusted for the impact of the clustering 6

7 The Independence Assumption There is still a possibility of natural clustering in the population even if we have collected our data in an unclustered way Model-based approach The population is represented to be generated under a model from which the data was selected Two particular interests: the impact of variables that describe the context as well as variables that relate to the individual response 7

Examples Pupils within classes within schools A pupil s performance will not only depend on their characteristics but also on the class they are in and the school from which that class is drawn

8 Examples Pupils within classes within schools A pupil s performance will not only depend on their characteristics but also on the class they are in and the school from which that class is drawn (Goldstein, 1995) Individuals within households within communities Patients within wards within hospitals Longitudinal data: multiple observations over time are nested within units, typically subjects. 8

9 General Framework will represent the outcome of individual i from area j. Y ij will be expressed as a function of the individual variables x 0ij, x 1ij, Y ij i will represent level one and j level two The index of level one units within the jth level two unit will be represented as i = 1,, n j and the level two units as j = 1,, J. 9

Hierarchical Structures Only interested in simple population structures: Structures are strictly hierarchical Level one within level two within

10 Hierarchical Structures Only interested in simple population structures: Structures are strictly hierarchical Level one within level two within level three NOT appropriate for all scenarios Pupils are nested within classes within schools However, pupils are also nested within communities. 10

11 Hierarchical Structures Comm 1 School 1 3 Pupils

12 Different Approaches Aggregate Analysis Disaggregate Analysis Fixed effects Standard models with robust standard errors (STATA; SAS) MULTILEVEL MODELS!! (MLwiN, R, STATA (xtreg, xtmixed, gllamm) Free 30-day evaluation version of the Mlwin software from 1

13 Different approaches Alternative 1 - Ignore the problem Problems with standard error estimates. The problem gets worse according to the nature of the clustering and the variable being analysed. Anyway, how would we account for the impact of higher level variables in our analysis? Alternative - Standard models with robust standard errors This solves the problem with the standard error estimates but NOT the above question. 13

14 Different Approaches Example: Students nested in Schools ID School AP 011 Income PS Type UNC UNC UNC PUJ PUJ PUJ Public Public Public Private Private Private Ignores the variation at the school level. Performance of individuals belonging p.e. to UNC could be very correlated 14

15 Different Approaches Example: Students nested in Schools ID School AAP UNC PUJ UV Average Income Average PS??? Type Public Private Public It is impossible to predict individual outcomes. 15

16 Different Approaches Example: Students nested in Schools ID School AAP UNC PUJ UV Income PS Type Public Private Public It is impossible to predict individual outcomes. 16

17 Aggregate Analysis We have the data for each individual, for each level two unit j. We still can calculate y j = 1 n j n j j = 1 y ij, n j 1 1 x 0 j = x0ij, x1 j = x1ij,... n n j j = 1 j n j j= 1 y j is modeled as a linear function of x,, 0 j x 1 j 17

18 Aggregate Analysis Three Problems: 1. The analysis has much less power as we only have j = 1,..., J observations. Ecological Fallacy Relationship between aggregate-level variables is NOT necessarily the same as the relationship between the individual-level variables. 3. In fact, when you aggregate levels, correlations tend to increase 18

19 Two Stage (Multistage) Approach Example: Students nested in Schools ID School AP 011 Income PS 1 3 UNC UNC UNC ID School b0 binc bps Type 1 3 UNC PUJ UV Public Private Public Small sample sizes in particular groups. Does not account for interactions indiv-groups. 19

20 Disaggregate Analysis Fixed Effects Model One-way ANOVA model for the response y ij : y ij = µ + u j + e ij very simple model Fixed parameters µ total mean and u j effect per group, n j u j = 0 Random parameters e ij individual level residual, e ij ~ N (0, σ e ) Cov (e ij, e kl ) = 0 0

21 Fixed Effects Model µ + u 1 µ µ + u µ + u 3 1

22 Fixed Effects Model Advantages Simple calculations The distribution of the u j s is not specified Ideal to cope with large and extreme between-group differences

23 Fixed Effects Model Disadvantages If J (the number of level units) is large, there will be a large number of model parameters If nj (the number of level one units within each unit j) is small, each u j will be poorly estimated If the J level units are a sample from a population and we want to make inference about that population, the model does not make sense. 3

24 Disaggregate Analysis Random Effects Model Extension to the fixed effects model for the response y ij : y ij = µ + u j + e ij one-way random effects model Fixed parameters: µ overall mean Random parameters: u j : group level residual, u j ~ N (0, σ u ), Cov (u j, u l ) = 0 e ij : individual level residual, e ij ~ N (0, σ e ), Cov (e ij, e kj ) = 0. Also, cov(e ij, u j ) = cov(e ij, u l ) = 0 4

25 Fixed Effects Model µ + u 1 µ µ + u µ + u 3 u j and e ij are random error terms after controlling for the fixed components in the model. Works well when: The units are a random sample from the population and J is large The u j s are with no extremes and all relatively small 5

(Most cited paper in Lancet). Peak expiratory flow rate (PEFR) measurements. Person's maximum speed of expiration.

26 Multilevel Model Example Bland and Altman (1986). Statistical methods for assessing agreement between two methods of clinical measurement. Lancet I: (Most cited paper in Lancet). Peak expiratory flow rate (PEFR) measurements. Person's maximum speed of expiration. PEFR measured twice (in liters per minute) using the Wright peak flow meter. Twice using the Mini Wright peak flow meter (more portable, lower cost). Q: How to assess the quality of the two instruments? 6

27 Multilevel Model Example Subject Wright peak flow meter Mini Wright peak flow meter First Second First Second If the new method agrees sufficiently well with the old one, the old may be replaced. Four measures clustered in each method and then the two methods clustered in each subject (three-level model) y ijk = µ + u j + u k + e ijk 7

28 Multilevel Model - Example Firstly, we will consider the problem to analyse the differences between the two sets of measurements using the Mini Wright peak flow meter. y ij = µ + u j + e ij Two repeated measures nested in 17 clusters (individuals). Random or fixed effects approach? The answer depends on the target of inference: population of clusters (random) or the particular clusters in the dataset (fixed). 8

29 Multilevel Model - Example It is often said that the random-effects should only be used if there are more than 0 clusters in the sample. This is true if the variance components are of interest since σ u will be poorly estimated. However, if a random effects approach is used merely to make appropriate inferences regarding β, this is not strictly required. One way fixed effects ANOVA model has 19 parameters (β, α1,, α17, σ e ) and one constraint. The one way random effects ANOVA model has only three parameters (β, σ, e ) (PARSIMONIA). σ u 9

30 Intraclass Correlation ( y, y ) Corr = ij kj Cov Var( y ( y, y ) ij ij kj ) Var( y kj ) ρ = u σ u e)( u e u σ = ( σ + σ σ + σ ) σ u + σ e Correlation of individual level units within level two units. Referred to as the intra-cluster correlation. 30

31 Intraclass correlation The within-cluster variance has increased giving a much smaller intraclass correlation. In contrast, the Pearson correlation seems to be similar. The Pearson correlation is only defined for pairs of variables whereas the intraclass correlation summarizes dependence for clusters of size larger than. 31

32 Multilevel Model - Example. xtmixed wm id:,mle wm Coef StdErr z P> z 95% Conf. Int. _cons Random Effects Estimat. StdErr. 95% Conf. Int. sd(_cons) sd(residual) Log likelihood= Intracluster correlation = 0.97 The Mini Wright peak flow meter is very reliable 3

33 Multilevel Models Advantages Only a few parameters needed to estimate the structure (efficient estimation) When there are few observations per higher level unit, the model can still estimate overall effects Inference from the sample to the population is possible Disadvantages Checking the assumptions about distributions of errors and independence need to be done 33

34 Example (Brown, 004) We will be studying fertility in Bangladesh (measured by the number of children ever born to a particular woman - CEB) using data from the Bangladesh Fertility Survey (1988). As well as looking at the effect of a woman s individual characteristics, we are also interested in community-level and district-level differences in fertility. Variables: DISTRICT, COMM, WOMAN, CEB, AGE, EDUC (1 to 4), HINDU (0,1) and FIND (0 to 8). Eight variables, concerning 401 women in 68 communities (villages) within 60 districts. 34

35 Example (Brown, 004) Data are the first three villages in the dataset. y ij is the number of children ever born to woman i from Bangladesh village j (CEB ij ). Data: j 1 y ij 7, 1, 5, 8, 1, 0,8 n j 7 y j ,, 0, 10, 5, 4, 1, 1, , 9, 4, 5, 1,, 5, All Source df SS MS Between ANOVA: Within Total

36 Example Fixed Effects Model y ij = µ + u j + e ij, = 0 n j u j, eij ~ N(0, σ e ) µˆ = 4.08, = ; σ ˆe j 1 3 û j H 0 : u 1 = u = u 3 F,1 (0.05) = 3.44 F = 4.45/10.81 = 0.41 H 0 NOT rejected at the 5% level! 36

37 Example Random Effects Model y ij = µ + u j + e ij uj ~ N (0, σ u ), eij ~ N (0, σ e ), µˆ = 4.08, = σˆe ˆ σ u = = Test for H 0 as for fixed effects model (but does not really make sense when random effect is negative!). 37

38 Random Intercepts Model Extension of the last model for the fixed part: Data y ij, x 0ij, x 1ij, j = 1,, J and i = 1,, n j Model y ij = β 0 x 0ij + β 1 x 1ij + u j x 0ij + e ij Fixed Part: β 0, β 1 Random Part: u j, e ij Same assumptions as before plus covariates 38

39 RIM Interpretation y x Slope β 1 is constant Intercept varies (hence the name) across j but is constant within j 39

40 Classic Example y ij = attainment at age 16 x ij = attainment at age 11 i = pupil and j = school For a given pupil, u j is the school effect. This is constant for all pupils within school j. For a given school, the effect of a one unit increase in x 1ij for a pupil is β 1 units increase in E(y ij ). Note: There are no interactions between school and since the school lines are parallel. x 1ij 40

41 Alternative formulation of the model y ij = β 0j x 0ij + β 1 x 1ij + e ij β 0j = β 0 + u j Within school model Between school model Idea: Between schools, the intercept varies Between and within a school, the slope is the same 41

42 Intra-Cluster Correlation ρ = σ σ u u + σ e Residual correlation of level one units within a level two unit after controlling for the effect of x 1ij Using the example above it is the residual school homogeneity after controlling for pupils attainment at age 11 4

43 Estimation Fixed Effects Model yij = µ + u j + e ij Parameters of the model are: µ, u j, σ e Use a standard ANOVA approach for the estimation. n j 1 J Let y j = yij, with n n = n j and a constraint on u j. j j= 1 j = 1 n j u j = 0 vs u j = 0 43

44 Estimation Fixed Effects Model Under this model 1 y = n J j = 1 n j y j 1 = n J n j j= 1 i= 1 y ij is an unbiased estimator of µ; y j y is an unbiased estimator of u j ANOVA Table ; Source df SS MS Between clusters J-1 SSB SSB/(J-1) = MSB Within clusters n-j SSW SSW/(n-J) = MSW Total n-1 SST 44

45 Estimation - Random Effects Model y ij = µ + u j + e ij Parameters of the model are µ, σ e and Var(u j ) = σ u. E( ) = µ y µˆ = y E(MSW) = σ e σˆ e = MSW 45

46 Estimation - Random Effects Model Also, Var ( y ) = Var( E( y u ) + E( Var( y u ) j j σe u n j = σ + j j j In order to get an estimator for σ u, we need to combine across the groups and taking into account that we have unequal within group sample sizes ˆ σ u = n MSB MSW J n j ( J j= 1 n 1) 46

47 Estimation - Random Effects Model ˆ < 0 σ u Problem: if MSB < MSW!! This happens in practice if n j is small (not uncommon). Usually, set negative estimates to zero but this is no longer an unbiased estimator of! σ u These are referred to as ANOVA estimators; Many alternative estimators; and under normality assumptions we can get slightly more efficient ML estimators. 47

48 Estimation RIM Iterative Generalised Least Squares (IGLS): βˆ Estimate assuming some initial values for ; ˆ GLS Estimate based on the current values of ; βˆ Re-estimate based on current values of ; ˆ GLS Re-estimate based on the current values of ; ˆ βˆ βˆ ˆ GLS GLS Repeat process until some convergence criterion is satisfied. 48

49 Properties of Estimators If u s and e s have a normal distribution then: IGLS = Maximum Likelihood. All estimators asymptotically efficient (J > ) If not, then: IGLS gives consistent estimators; Estimator µˆ asymptotically efficient; Estimators and σˆu σˆe NOT asymptotically efficient! σ u ( IGLS) PROBLEM: ˆ can be negative! 49

50 Conclusions Can easily write-down the analytic form of the ANOVA type estimators for a simple model BUT ANOVA type estimators NOT fully efficient IGLS estimators are ML estimators under certain assumptions with associated estimated standard errors IGLS provides a framework within which a wide class of models can be incorporated 50

51 Example Linear Regression (OLS) ceb ij N(XB, Ω) ceb ij = β 0i const (0.004) age ij β 0i = 3.957(0.039) + e 0ij [e 0ij ] N(0, Ω e ) : Ωe = [3.575(0.103)] -*loglikelihood(igls) = (401 of 401 cases in use). This is a base model to build form!!! 51

52 Example Random Intercept Model ceb ij N(XB, Ω) ceb ij = β 0ij const (0.004) age ij β 0ij = 3.943(0.050) + u 0j + e 0ij [u0j] N(0, Ωu) : Ωu = [0.57(0.056)] [e 0ij ] N(0, Ω e ) : Ωe = [3.314(0.101)] -*loglikelihood(igls) = (401 of 401 cases in use). 5

53 Interpretation Fixed Part Random Part ρ = σ σ u u + σ e = = 0.07 The residual correlation of women within a community is 0.07 or, alternatively, 7.% of the total residual variation is due to the community. Within the population of communities about 95% would have an expected number of children ever born to a woman of average age between 3 and 5. 53

54 Interpretation ceb ceb ij = β 0ij age ij age Variation in the intercept is generated by the random effect term! Slopes are constant! 54

55 Random Slopes Model Is the assumption of a constant slope across villages true? Is there an interaction between age and village? Extend our random intercepts model to also allow the slope on age to vary Normality assumptions as before u s and e s are independent 55

56 Random Slopes Model ceb age Within a village, the slope and intercept are NOT independent. In the graph, the correlation is positive. It appears that those clusters with higher than average y also have a faster increase y as age increases. 56

57 Example Random Slopes Model ceb ij N(XB, Ω) ceb ij = β 0ij const + β 1j age ij β 0ij = 3.978(0.05) + u 0j + e 0ij β 1j = 0.41(0.005) + u 1j u u 0 j 1 j N(0, Ω u ) : Ω u = 0.33 (0.061) (0.005) 0.00 (0.001) [e 0ij ] N(0, Ω e ) : Ω e = [3.11(0.098)] -*loglikelihood(igls) = (401 of 401 cases in use). 57

58 Further Developments Can expand the model to include additional explanatory variables EDUC, HINDU, FIND Also expand the model to include additional levels: women within villages within districts, for example Good initial analysis outside MLwiN Get the basic multilevel structure (fit the model with age and see whether we need two/three levels) Develop the basic model adding variables based on initial modelling Finally, look for random slopes based on substantive ideas 58

59 Further Developments Multivel Logistic Regression With multilevel models for non-linear data estimation is not quite so simple Numerical Integration - Difficult to implement and only really feasible for models with simple hierarchical structure Monte Carlo Markov Chain (MCMC) Methods - Computationally intensive methods based on Bayesian inference Iterative bootstrap - Computationally intensive method for bias correction 59

60 Example Random Slopes Model ceb ijk N(XB, ) ceb ijk = β 0ijk const + β 1j age ijk (0.107)lower ijk (0.18)upper ijk (0.108)sec ijk + β 5 hindu ijk β 0ijk = 4.084(0.068) + v 0k + u 0jk + e 0ijk β 1j = 0.40(0.005) + u 1jk β 5j = -0.44(0.14) + u 5jk -*loglikelihood(igls) = (401 of 401 cases in use). 60

61 Multilevel Model PEF Modelo 1. y ijk = µ + u jk + u k + e ijk Modelo. y ijk = µ + βxj + u jk + u k + e ijk. xtmixed wm id:,mle (b4). xtmixed w method id: method:,mle w Coef. StdErr z P> z 95% Conf. Inter method _cons Random-Eff. Pars. Estimat. StdErr 95% Conf. Inter. sd(_cons) id sd(_cons) - method sd(residual)

62 Multilevel Model PEF Modelo. y ijk = µ + βxj + u jk + u k + e ijk Modelo 3. y ijk = µ + βxj + u k + e ijk. lrtest model3 model LR chibar(01)= 8.68 Prob>chibar = Modelo 4. y ijk = µ + u jk + u k + e ijk w Coef. StdErr z P> z 95% Conf. Inter _cons Random-Eff. Pars. Estimat. StdErr 95% Conf. Inter. sd(_cons) id sd(_cons) - method sd(residual)

63 Multilevel Model - PEP Corr(method, subject) = = 0.97 Corr(subject)= = 0.94 There is no evidence of systematic bias between the methods. The methods appear to have good test-retest reliability. 63

64 Binary Response Data y ij = 0 or 1. Examples include being dead or alive, agreeing or disagreeing with a statement and succeding of failing to accomplish something. Assume y ij ~ Bin (n ij, ), n ij = 1 for binary data π ij π ij where = Pr (y ij = 1) = E(y ij ) π ij Var(y ij ) = (1 - π ij )/n ij 64

65 Binomial Response Data y ij is a proportion, e.g. unemployment rate in neighbourhood i in region j n ij is denominator for the proportion, e.g. number eligible to work in neighbourhood i in region j Models for binary data may be applied to proportions 65

66 Binomial Response Data We are interested in the expectation (mean) of the response as a function of the covariate E(yi xi) = Pr(yi =1 xi) In linear regression, the conditional expectation of the response is modeled as a linear function of the covariate E(yi xi)=β1+ βxi For dichotomous responses, this approach may be problematic because the probability must lie between 0 and 1, whereas regression line increase (or decrease) indefinitely as the covariable increases (decreases). 66

67 Binomial Response Data Instead a nonlinear function is specified in two ways: Pr(yi =1 xi) = h(β1+ βxi) g{pr(yi =1 xi)} = β1+ βxi where h(.) is the inverse of the function g(.). Here g(.) is known as the link function and h(.) as the inverse link function. Three components of a generalized linear model: the linear predictor, the link function and the distribution of the response given the covariates. For dichotomous responses, this is specified as Bernoulli(πi). 67

68 Binomial Response Data Typical choices of link function are the logit or probit links. For the logit link, Pr(yi =1 xi) = h(β1+ βxi)= logit{pr(yi =1 xi)} = ln Pr exp 1+ exp ( y ) i = 1 xi ( ) ( β ) 1 + βxi ( β + β x ) 1 Pr yi = 1xi The term in curly braces represents the odds that yi=1 given xi, the expected number of 1 responses for each 0 response. 1 = i β β 1 + xi 68

69 Guatemala immunization data Pebley, Goldman and Rodriguez (1996). Prenatal and delivery care and childhood immunization in Guatemala: Do family and community matter?. Demography, 33: National Survey of Maternal and Child Health (ENSMI) conducted in Guatemala in Nationally representative sample of 5,160 women aged between 15 and 44 was interviewed. 69

70 Guatemala immunization data Beginning in 1986, the Guatemalan government undertook a series of campaigns to immunize the population against major childhood diseases. The data considered,159 children aged 1-4 years for which we have community data on health services and who received at least one immunization during the campaign. Response variable: whether the children received the full set of immunizations. Children i nested in mothers j nested in communities k. 70

71 Guatemala immunization data Level 1 (child): Yijk: indicator variable - child receiving full set of immunizations. Xijk: dummy variable - child being at least years old and hence eligible for full set of immunizations. Level (mother): Mom: Identifier for mothers (j) Ethnicity: dummy variables with baseline Latino X3jk: Mother is indigenous, not Spanish speaking X4jk: Mother is indigenous, Spanish speaking Mother s education: dummy variables with baseline no education X5jk: mother has primary education X6jk: mother has secondary education 71

72 Guatemala immunization data Husband s education: dummy variable with baseline no education X7jk: Husband has primary education X8jk: Husband has secondary education X9jk: Husband education not known. Level 3 (community): Cluster: Identifier for communities (k) X10k: dummy variable for community being rural X11k: percentage of population that was indigenous in

73 Guatemala immunization data Three level random intercept logit model logit{pr(yijk=1)}= σ m β 1 + βxijk β11x11 k + u jk + uk + ε σ c ujk has a N(0, ) and uk has a N(0, ). As usual, the random effects are assumed independent of each other and across clusters. uk is independent across units as well. In STATA, we use gllamm. Gllamm uses numerical integration and it is recommended for GLM multilevel models not for the single linear models. Xtreg and xtmixed exploit the closed form of the likelihood for random effect models with normally distributed continuous responses. ijk 73

74 Guatemala immunization data gllamm immun kidp indnospa indspa momedpri momedsec husedpri husedsec huseddk rural pcind81, family (binomial) link(logit) i(mom cluster) nip(5) immun Coef. Std. Err z P>z 95% Confidence Interv. kidp indnospa indspa momedpri momedsec husedpri husedsec huseddk rural pcind81 _cons

75 Guatemala immunization data 75

76 Final Remarks Data in the real world has structure to tends to violate the assumptions of independence and homogeneity of the residual variance The structures with which these data were generated depend on the data collection mechanism and the natural structures within the population The standard analysis assume independence and estimate the standard errors of model parameters accordingly. If observations within the clusters are positively correlated this will underestimate the standard errors 76

77 Final Remarks The problem with the aggregate analysis, as known as the ecological fallacy, is that the relationship between aggregate level variables could NOT be the same as the relationship between the individual-level variables, because the correlations tend to increase as you aggregate levels 77

78 References Goldstein, H. (003). Multilevel Statistical Models. 3rd edn. London: Hodder Arnold. Leyland, A.H. and Goldstein, H. (001) (Eds.) Multilevel Modelling of Health Statistics. Wiley. Pfeffermann, D., Skinner, C.J., Holmes, D.J., Goldstein, H. and Rabash, J. (1998b). Weighting for unequal selection probabilities in multilevel models. Journal of the Royal Statistical Society B, 60, Rabe-Hesketh, S and Skrondal, A. (008). Multilevel and Longitudinal Modeling Using STATA. STATA Press. Snijders, T. A. and Bosker, R. J., (1999). Multilevel Analysis: An Introduction to Basic and Advanced Multilevel Modelling. Thousand Oaks: Sage Publications. 78

79 Thank you! 79

Recent Developments in Multilevel Modeling

Recent Developments in Multilevel Modeling Roberto G. Gutierrez Director of Statistics StataCorp LP 2007 North American Stata Users Group Meeting, Boston R. Gutierrez (StataCorp) Multilevel Modeling August