Simulating MLM. Paul E. Johnson 1 2. Descriptive 1 / Department of Political Science

Size: px

Start display at page:

Download "Simulating MLM. Paul E. Johnson 1 2. Descriptive 1 / Department of Political Science"

Olivia Cleopatra Horn
5 years ago
Views:

1 Descriptive 1 / 76 Simulating MLM Paul E. Johnson Department of Political Science 2 Center for Research Methods and Data Analysis, University of Kansas 2015

2 Descriptive 2 / 76 Outline 1 Orientation: MLM 2 Explore Simulated Data 3 Case 2: A Larger Variance Component

3 Descriptive 3 / 76 Introduction Manufacture example data sets with random intercepts Explore visual manifestations of the related problems

4 Descriptive 4 / 76 Orientation: MLM Outline 1 Orientation: MLM 2 Explore Simulated Data 3 Case 2: A Larger Variance Component

5 Descriptive 5 / 76 Orientation: MLM Ordinary Regression An ordinary regression model y 1 β 0 y 2 β 0 y 3 β 0 y 4 = β 0 +. y N. β 0 X 1 1 β 1 X 1 2 β 1 X 1 3 β 1 X 1 4 β 1. X 1 N β ɛ 1 ɛ 2 ɛ 3 ɛ 4. ɛ N (1)

6 Descriptive 6 / 76 Orientation: MLM Shocks b 1, b 2,... affect sets of rows y 1 y 2 y 3 y 1 y 2. y N = + β 0 β 0 β 0 β 0 β 0. β 0 + X 1 1β 1 X 1 2β 1 X 1 3β 1 X 1 1β 1 X 1 2β 1. X 1 N β b 1 b 2. b M ɛ 1 ɛ 2 ɛ 3 ɛ 4 ɛ 5. ɛ N (2) (3) Z is a matrix of contrast variables y = X β + Zb + ɛ (4)

7 Descriptive 7 / 76 Orientation: MLM The ij Subscripted Version (i = level 2, j = level 1) y 11 y 12 y 13 y 21 y 22. y Mni = + β 0 β 0 β 0 β 0 β 0. β 0 + X 1 11β 1 X 1 12β 1 X 1 13β 1 X 1 21β 1 X 1 22β 1. X 1 Mni β b 1 b 2. b M ɛ 11 ɛ 12 ɛ 13 ɛ 21 ɛ 22. ɛ Mni (5) (6)

8 Descriptive 8 / 76 Explore Simulated Data Outline 1 Orientation: MLM 2 Explore Simulated Data 3 Case 2: A Larger Variance Component

9 Descriptive 9 / 76 Explore Simulated Data Generate Data Data Generating Process I can t understand any statistical problem until I can make up data to represent it. In GitHub, I started a rather large simulation exercise called mlmsim. It is in a repository called R-crmda. It is not done yet, but I ll share some ideas here. I *think* the point of the simulations included here is the following: If then we have lots of groups, and they are all the same number of observations, and the distribution of X is the same in each group, Analysis with ordinary one level regression is not so horribly dangerous Scatterplots are not entirely deceptive however, we can see situations where We might be badly deceived by the data The MLM will fix the problem only sometimes, under conditions we can state

10 Descriptive 10 / 76 Explore Simulated Data Generate Data Simple simulation code This generates data for M = 10 sets of level-2 groupings, with n = 3 observations per group. gen1 < f u n c t i o n ( b e t a = c ( 3, 0. 5 ), x b a r i = 25, x s d i = 5, xsd = 4, M = 10, n = 3, bsd = 2, esd = 4) { x b a r s < rnorm (M, m = x b a r i, sd = x s d i ) dat < d a t a. f r a m e ( i = r e p ( 1 :M, each = n ) ) dat $x < rep ( xbars, each = n ) + rnorm (M*n, m = 0, sd = xsd ) b < rnorm (M, m = 0, bsd ) dat $b < r e p ( b, each = n ) e r r o r < rnorm (M*n, m = 0, sd = esd ) dat $ ynob < b e t a [ 1 ] + b e t a [ 2 ] * dat $ x + e r r o r dat $ ynoe < b e t a [ 1 ] + b e t a [ 2 ] * dat $ x + dat $b dat $ y < dat $ ynoe + rnorm (M*n, m = 0, sd = esd ) l i s t ( dat = dat, b = b ) } In case you saw a previous version of these notes, I ve converted a script into a function so I can draw samples over and over later (if I want to)

11 Descriptive 11 / 76 Explore Simulated Data Generate Data Simple simulation code... Note, necessary to specify mean & std. dev. of x for each group (xbari, xsdi), as well as value of b i Summary of arguments beta: vector of fixed effect coefficients (slope & intercept) xbari: the expected value of x i xsdi: the standard deviation of x i. xsd: the standard deviation of x ij (diversity within groups) M: number of level 2 groupings n: number of observations within each group bsd: the standard deviation of b i (diversity of group intercepts) esd: the standard deviation of the error term ɛ ij If we want all of the level 2 groups to have the same distribution on x, then we set xsdi=0.

12 Descriptive 12 / 76 Explore Simulated Data Generate Data Simple simulation code... I drew the b i separately from a normal N(0, σ b ), but we could have made this more elaborate by writing out a covariance matrix and using an MVN random draw. We don t need to do that here because b i N(0, bsd 2 )

13 Descriptive 13 / 76 Outline 1 Orientation: MLM 2 Explore Simulated Data 3 Case 2: A Larger Variance Component

14 Descriptive 14 / 76 Draw One Simulated Sample Lets suppose there are 10 groups of 3 each. M < 1 0 ; n < 3 ; beta0 < 3 ; beta1 < 0. 5 simdata < gen1 (M = M, n = 3, b e t a = c ( beta0, beta1 ) ) dat < simdata [ [ dat ] ] b < s imdata [ [ b ] ] That uses the other default settings of gen1, so we expect the means of x to vary across groups, as x i N(5, 25 2 ). Thus, the groups are not identically distributed The observed values x ij are N( x i, 4 2 )

15 Descriptive 15 / 76 Here s how it appears to a Naive Researcher x y

16 Descriptive 16 / 76 One Group At A Time y x

17 Descriptive 17 / 76 One Group At A Time y x

18 Descriptive 18 / 76 One Group At A Time y x

19 Descriptive 19 / 76 One Group At A Time y x

20 Descriptive 20 / 76 One Group At A Time y x

21 Descriptive 21 / 76 One Group At A Time y x

22 Descriptive 22 / 76 One Group At A Time y x

23 Descriptive 23 / 76 One Group At A Time y x

24 Descriptive 24 / 76 One Group At A Time y x

25 Descriptive 25 / 76 One Group At A Time y x

26 Descriptive 26 / 76 Fitted OLS regression (Recall the true slope β 1 == 0.5) m1 < lm ( y x, data = dat ) o u t r e g ( l i s t ( True S l o p e 0. 5 = m1), t i g h t = FALSE) True Slope 0.5 Estimate (S.E.) (Intercept) (3.520) x 0.398** (0.124) N 30 RMSE R p 0.05 p 0.01 p 0.001

27 Descriptive 27 / 76 Fitted OLS regression (Recall the true slope β 1 == 0.5) x y Predicted values 0.95 confidence interval

28 Descriptive 28 / 76 Stop and Think About Regression Assumptions The OLS error term is a blend of ɛ ij and b i, y ij = β 0 + β 1 X 1 ij + (ɛ ij + b i ) Let e ij =ɛ ij + b i In OLS, we assert E[e ij ] = 0 (The expected value of the error is 0) In an unconditional sense, that is correct, E[ɛ ij ] = 0 and E[b i ] = 0. As long as ɛ ij and b i are uncorrelated with x,that is And remember: given b i, E[e ij b i ] = b i Homoskedasticity: e ij = σ 2 ɛ (the variance of each error draw is the same for every ij) which is obviously untrue Var(e ij ) = Var(b i ) + σ 2 ɛ The rows are uncorrelated, Cov(e ij, e kl ) = 0 for all i k or j l, which is obviously untrue when i = k

29 Descriptive 29 / 76 Parallel True Lines y x

30 Descriptive 30 / 76 The Blue line is β 0 + β 1 x (i.e, no b i ) y x

31 Descriptive 31 / 76 Parallel True Lines with observations, but no ɛ ij y x

32 Descriptive 32 / 76 Parallel True Lines with errors, ɛ ij x y

33 Descriptive 33 / 76 Parallel True Lines with observations and predicted OLS estimate x y

34 Descriptive 34 / 76 OLS estimate of slope x y

35 Descriptive 35 / 76 Lets add the MLM estimate, for comparison x y OLS MLM

36 Descriptive 36 / 76 Repeat with fresh sample simdata < gen1 (M = M, n = n, b e t a = c ( beta0, beta1 ) ) b < s imdata [ [ b ] ] dat < simdata [ [ dat ] ]

37 Descriptive 37 / 76 Repeat with fresh sample m1 < lm ( y x, data = s imdata [ [ dat ] ] ) o u t r e g ( l i s t ( Run 2, True S l o p e 0. 5 = m1), t i g h t = FALSE) Run 2, True Slope 0.5 Estimate (S.E.) (Intercept) (3.309) x 0.561*** (0.134) N 30 RMSE R p 0.05 p 0.01 p 0.001

38 Descriptive 38 / 76 Repeat with fresh sample x y OLS MLM

39 Descriptive 39 / 76 Repeat with another fresh sample simdata < gen1 (M = M, n = n, b e t a = c ( beta0, beta1 ) ) b < s imdata [ [ b ] ] dat < simdata [ [ dat ] ]

40 Descriptive 40 / 76 Repeat with another fresh sample m1 < lm ( y x, data = s imdata [ [ dat ] ] ) o u t r e g ( l i s t ( Run 2, True S l o p e 0. 5 = m1), t i g h t = FALSE) Run 2, True Slope 0.5 Estimate (S.E.) (Intercept) (3.632) x 0.442** (0.144) N 30 RMSE R p 0.05 p 0.01 p 0.001

41 Descriptive 41 / 76 Repeat with another fresh sample x y OLS MLM

42 Descriptive 42 / 76 Repeat with 1000 fresh samples i f ( runsim ) { betahet < matrix (NA, nrow = 1000, n c o l = 2) betanob < matrix (NA, nrow = 1000, n c o l = 2) betamlm < matrix (NA, nrow = 1000, n c o l = 2) s e t. s e e d (234234) f o r ( i i n 1 : ) { simdata < gen1 (M = M, b e t a = c ( beta0, beta1 ) ) m1 < lm ( y x, data = s imdata [ [ dat ] ] ) b e t a h e t [ i, ] < c o e f (m1) m2 < lm ( ynob x, data = s imdata [ [ dat ] ] ) betanob [ i, ] < c o e f (m2) m3 < lmer ( y x + (1 i ), data = simdata [ [ dat ] ] ) betamlm [ i, ] < f i x e f (m3) } saverds ( b e t a h e t, p a s t e 0 ( workingdata, b e t a h e t. r d s ) ) saverds ( betanob, p a s t e 0 ( workingdata, b e t a n o b. r d s ) ) saverds ( betamlm, paste0 ( workingdata, betamlm. rds ) ) } e l s e { b e t a h e t < readrds ( p a s t e 0 ( workingdata, b e t a h e t. r d s ) ) betanob < readrds ( p a s t e 0 ( workingdata, b e t a n o b. r d s ) ) betamlm < readrds ( paste0 ( workingdata, betamlm. rds ) ) }

43 Descriptive 43 / 76 OLS Estimates: 1000 fresh samples y x

44 Descriptive 44 / 76 Histograms of Intercept & Slope Estimates Intercept Estimates Slope Estimates Density Density betahet[, 1] betahet[, 2]

45 Descriptive 45 / 76 Scatterplot of Intercept & Slope Estimates Estimates of Intercept Estimates of Slope

46 Descriptive 46 / 76 Wondered what that looks like with True OLS Data Estimates of Intercept Estimates of Slope

47 Descriptive 47 / 76 Wondered how the MLM Compared y x Display not great

48 Descriptive 48 / 76 OLS versus & MLM Slope Estimates OLS Slope Estimates MLM Slope Estimates Density Density betahet[, 2] Display not great betamlm[, 2]

49 Descriptive 49 / 76 OLS versus & MLM Slope Estimates OLS Slope Estimates Density OLS MLM betahet[, 2]

50 Descriptive 50 / 76 OLS versus & MLM Intercept Estimates OLS Intercept Estimates Density OLS MLM betahet[, 1]

51 Descriptive 51 / 76 What I think I Learned From That If the variance among the intercepts is not huge, compared to the variance of the error term, then OLS gives pretty good estimates of the slope of the fixed effect

52 Descriptive 52 / 76 Case 2: A Larger Variance Component Repeat with fresh sample simdata < gen1 (M = M, n = n, beta = c ( beta0, beta1 ), bsd = 20) b < s imdata [ [ b ] ] dat < simdata [ [ dat ] ]

53 Descriptive 53 / 76 Case 2: A Larger Variance Component Repeat with fresh sample m1 < lm ( y x, data = s imdata [ [ dat ] ] ) o u t r e g ( l i s t ( Run 2, True S l o p e 0. 5 = m1), t i g h t = FALSE) Run 2, True Slope 0.5 Estimate (S.E.) (Intercept) (10.005) x ( 0.391) N 30 RMSE R p 0.05 p 0.01 p 0.001

54 Descriptive 54 / 76 Case 2: A Larger Variance Component Repeat with fresh sample y OLS MLM x

55 Descriptive 55 / 76 Case 2: A Larger Variance Component Repeat with another fresh sample simdata < gen1 (M = M, n = n, beta = c ( beta0, beta1 ), bsd = 20) b < s imdata [ [ b ] ] dat < simdata [ [ dat ] ]

56 Descriptive 56 / 76 Case 2: A Larger Variance Component Repeat with another fresh sample m1 < lm ( y x, data = s imdata [ [ dat ] ] ) o u t r e g ( l i s t ( Run 2, True S l o p e 0. 5 = m1), t i g h t = FALSE) Run 2, True Slope 0.5 Estimate (S.E.) (Intercept) (8.133) x (0.341) N 30 RMSE R p 0.05 p 0.01 p 0.001

57 Descriptive 57 / 76 Case 2: A Larger Variance Component Repeat with another fresh sample y OLS MLM x

58 Descriptive 58 / 76 Case 2: A Larger Variance Component Repeat with 1000 fresh samples i f ( runsim ) { betahet < matrix (NA, nrow = 1000, n c o l = 2) betanob < matrix (NA, nrow = 1000, n c o l = 2) betamlm < matrix (NA, nrow = 1000, n c o l = 2) s e t. s e e d (234234) f o r ( i i n 1 : ) { simdata < gen1 (M = M, beta = c ( beta0, beta1 ), bsd = 20) m1 < lm ( y x, data = s imdata [ [ dat ] ] ) b e t a h e t [ i, ] < c o e f (m1) m2 < lm ( ynob x, data = s imdata [ [ dat ] ] ) betanob [ i, ] < c o e f (m2) m3 < lmer ( y x + (1 i ), data = simdata [ [ dat ] ] ) betamlm [ i, ] < f i x e f (m3) } saverds ( b e t a h e t, p a s t e 0 ( workingdata, v b e t a h e t. r d s ) ) saverds ( betanob, p a s t e 0 ( workingdata, v b e t a n o b. r d s ) ) saverds ( betamlm, paste0 ( workingdata, vbetamlm. rds ) ) } e l s e { b e t a h e t < readrds ( p a s t e 0 ( workingdata, v b e t a h e t. r d s ) ) betanob < readrds ( p a s t e 0 ( workingdata, v b e t a n o b. r d s ) ) betamlm < readrds ( paste0 ( workingdata, vbetamlm. rds ) ) }

59 Descriptive 59 / 76 Case 2: A Larger Variance Component Wondered how the MLM Compared y x Hmm. Blue and Red are different

60 Descriptive 60 / 76 Case 2: A Larger Variance Component OLS versus & MLM Slope Estimates OLS Slope Estimates MLM Slope Estimates Density Density betahet[, 2] I see a difference there betamlm[, 2]

61 Descriptive 61 / 76 Case 2: A Larger Variance Component OLS versus & MLM Slope Estimates Density OLS MLM Slope Estimates

62 Descriptive 62 / 76 Case 2: A Larger Variance Component Unfair to compare Intercept Estimates (but still will) OLS Intercept Estimates MLM Intercept Estimates Density Density betahet[, 1] betamlm[, 1] Seems unfair, OLS is the wrong model (supposed β 0 was intercept for all data points).

63 Descriptive 63 / 76 Deceptive Data Generators Deceptive Data Generators If b i is uncorrelated with the average of x in the groups, then we expect the OLS estimate is unbiased.

64 Descriptive 64 / 76 Deceptive Data Generators Recall the Parallel True Lines The true lines vary due to the random intercept y x

65 Descriptive 65 / 76 Deceptive Data Generators Put the true values of y on there This is still the pleasant scenario, where σ 2 ɛ = x y That makes regression easy!

66 Descriptive 66 / 76 Deceptive Data Generators Suppose the Data-Generating Genie is Evil Correlate the mean of observed X with the values of b. I m doing this by first drawing the values b i and then creating x ij = b i + N(0, xsd) gen2 < f u n c t i o n ( b e t a = c ( 3, 0. 5 ), x b a r i = 25, x s d i = 5, xsd = 4, M = 10, n = 3, bsd = 2, esd = 4) { b < rnorm (M, m = 0, bsd ) x b a r s < rnorm (M, m = x b a r i, sd = x s d i ) dat < d a t a. f r a m e ( i = r e p ( 1 :M, each = n ) ) dat $x < rep ( xbars, each = n ) + rnorm (M*n, m = 0, sd = xsd ) dat $b < r e p ( b, each = n ) dat $ ynoe < b e t a [ 1 ] + b e t a [ 2 ] * dat $ x + dat $b dat $y < dat $ ynoe + rnorm (M*n, m=0, sd = esd ) l i s t ( dat = dat, b = b ) }

67 Descriptive 67 / 76 Deceptive Data Generators Suppose the Data-Generating Genie is Evil... beta0 < 3 ; beta1 < 0. 5 ; x b a r i < 5 ; x s d i < 5 ; M < 4 ; n < 3 ; bsd < 2 ; esd < 4 x b a r s < 20 + rnorm (M, m = x b a r i, sd = x s d i ) b < rnorm (M, m = 0, bsd ) dat < d a t a. f r a m e ( i = r e p ( 1 :M, each = n ) ) dat $b < r e p ( b, each = n ) dat $ x < u n l i s t ( l a p p l y ( xbars, f u n c t i o n ( x b a r ) { x < x b a r + rnorm ( n, 4) }) ) dat $y < beta0 + beta1 * dat $x + rnorm (M*n, esd ) + dat $b Now we draw a sample: M < 1 0 ; beta0 < 3 ; beta1 < 0. 5 simdata < gen2 (M = M, b e t a = c ( beta0, beta1 ), bsd = 20, x s d i = 10) dat < simdata [ [ dat ] ] b < s imdata [ [ b ] ]

68 Descriptive 68 / 76 Deceptive Data Generators The Parallel True Lines y x

69 Descriptive 69 / 76 Deceptive Data Generators Put the points on the lines: No ɛ ij y x

70 Descriptive 70 / 76 Deceptive Data Generators Recall the Parallel True Lines Especially when we have a lot of groups, the data cloud is filled out well and OLS estimate of β 1 won t be horrible. M < 1 0 ; beta0 < 3 ; beta1 < 0. 5 simdata < gen1 (M = M, b e t a = c ( beta0, beta1 ), x s d i = 0) dat < simdata [ [ dat ] ] b < s imdata [ [ b ] ] m15 < lm ( y x, data = dat ) summary (m15)

71 Descriptive 71 / 76 Deceptive Data Generators Recall the Parallel True Lines... C a l l : lm ( formula = y x, data = dat ) R e s i d u a l s : Min 1Q Median 3Q Max C o e f f i c i e n t s : E s t i m a t e S t d. E r r o r t v a l u e Pr ( > t ) ( I n t e r c e p t ) * x S i g n i f. c o des : 0 ' *** ' ' ** ' ' * ' '. ' 0. 1 ' ' 1 R e s i d u a l s t a n d a r d e r r o r : on 28 d e g r e e s o f freedom M u l t i p l e R 2 : , A d j u s t e d R 2 : F s t a t i s t i c : on 1 and 28 DF, p value :

72 Descriptive 72 / 76 Deceptive Data Generators Recall the Parallel True Lines... y x

73 Descriptive 73 / 76 Deceptive Data Generators Parallel True Lines If b i and the mean of x within each sub-group fall into order, then the data generating process may not be so misleading. gen2 < f u n c t i o n ( b e t a = c ( 3, 0. 5 ), x b a r i = 25, x s d i = 5, xsd = 4, M = 10, n = 3, bsd = 2, esd = 4) { b < rnorm (M, m = 0, bsd ) x b a r s < rnorm (M, m = x b a r i, sd = x s d i ) dat < d a t a. f r a m e ( i = r e p ( 1 :M, each = n ) ) dat $x < rep ( xbars, each = n ) + rnorm (M*n, m = 0, sd = xsd ) dat $b < r e p ( b, each = n ) dat $ ynoe < b e t a [ 1 ] + b e t a [ 2 ] * dat $ x + dat $b dat $y < dat $ ynoe + rnorm (M*n, m=0, sd = esd ) l i s t ( dat = dat, b = b ) }

74 Descriptive 74 / 76 Deceptive Data Generators Parallel True Lines... beta0 < 3 ; beta1 < 0. 5 ; x b a r i < 5 ; x s d i < 5 ; M < 4 ; n < 3 ; bsd < 2 ; esd < 4 x b a r s < 20 + rnorm (M, m = x b a r i, sd = x s d i ) b < rnorm (M, m = 0, bsd ) dat < d a t a. f r a m e ( i = r e p ( 1 :M, each = n ) ) dat $b < r e p ( b, each = n ) dat $ x < u n l i s t ( l a p p l y ( xbars, f u n c t i o n ( x b a r ) { x < x b a r + rnorm ( n, 4) }) ) dat $y < beta0 + beta1 * dat $x + rnorm (M*n, esd ) + dat $b

75 Descriptive 75 / 76 Deceptive Data Generators Parallel True Lines beta0 < 3 ; beta1 < 0. 5 ; x b a r i < 2 5 ; x s d i < 5 ; M < 4 ; n < 3 ; bsd < 2 ; esd < 4 x b a r s < rnorm (M, m = x b a r i, sd = x s d i ) b < rnorm (M, m = ( x b a r i x b a r s ), bsd ) dat < d a t a. f r a m e ( i = r e p ( 1 :M, each = n ) ) dat $b < r e p ( b, each = n ) dat $ x < u n l i s t ( l a p p l y ( xbars, f u n c t i o n ( x b a r ) { x < x b a r + rnorm ( n, 4) }) ) dat $y < beta0 + beta1 * dat $x + rnorm (M*n, esd ) + dat $b

76 i Descriptive 76 / 76 Deceptive Data Generators Special Tool: Dotplot Suggested graphical representation in Pinheiro & Bates Observed Outcome

Graphical Diagnosis. Paul E. Johnson 1 2. (Mostly QQ and Leverage Plots) 1 / Department of Political Science

Graphical Diagnosis. Paul E. Johnson 1 2. (Mostly QQ and Leverage Plots) 1 / Department of Political Science (Mostly QQ and Leverage Plots) 1 / 63 Graphical Diagnosis Paul E. Johnson 1 2 1 Department of Political Science 2 Center for Research Methods and Data Analysis, University of Kansas. (Mostly QQ and Leverage