Developing multilevel models for analysing contextuality, heterogeneity and change using MLwiN 2.2. Volume 2. Kelvyn Jones SV Subramanian

Size: px

Start display at page:

Download "Developing multilevel models for analysing contextuality, heterogeneity and change using MLwiN 2.2. Volume 2. Kelvyn Jones SV Subramanian"

Marsha Hill
5 years ago
Views:

1 Developing multilevel models for analysing contextuality, heterogeneity and change using MLwiN 2.2 Volume 2 Kelvyn Jones SV Subramanian June

2 Preface to Volume 2 The purpose of these volumes is to provide a thorough account of how to implement multilevel models from a social science perspective in general and geographic perspective in particular. We use the MLwiN software throughout. Volume1 introduces the software and then provides an extended account of the Normaltheory two-level and three level model estimated by maximum likelihood and by Bayesian MCMC analysis. Volume 2 extends the analysis in a number of ways. First we consider discrete outcomes and in particular we provide an account of how to analyse the outcome of employed or not in a multilevel logistic model. This is followed by the analysis of count data and we consider the nature of HIV variations in India, a multilevel Poisson and NBD model are used. Second, the models are extended to repeated measures panel data and there is an extended consideration of the two-level growth model that focuses on the flexibility of the model to estimate heterogeneity and dependency. This is further extended to analyse age and cohort effects. A final chapter considers spatial models and the simultaneous modelling of space time-data. This remains a work in progress as further chapters are planned on multivariate models with more than one outcome, on itemresponse models which can be used develop measurement models for people and places, models for segregation, and the analysis of randomized cluster trials. 1

3 Contents 12. Logistic modelling of proportions Introduction The data on teenage employment if Glasgow Model 1: null random intercepts model Model 2 with fixed part terms for qualifications and gender Model 2b: changing estimation Model 3: modelling the cross-level interaction between gender, qualifications and adult unemployment Estimating the VPC Characterising the between area with effects with the Median Odds Ratio Characterising the effect of a higher-level predictors with the Interval Odds Ratio Comparing models: using the DIC estimated in MCMC Some answers 13. HIV in India: An example of a Poisson, and a NBD model Introduction The data Defining the exposure Model 1: a single level model for Gender Model 2: a single level model for Gender and Age Main effects Model 3: a single level model for Gender and Age with interactions Model 4: a single-level model for Gender-Age interactions and Education Model 5: a two-level model for States Comparing alternative estimates of Model 5: a two-level model for States Model 6: between State differences for Urban-Rural More efficient MCMC samplers Some answers Longitudinal analysis of repeated measures data Introduction A conceptual overview of the random-effects, subject-specific approach Algebraic specification of random effects model for repeated measures Some covariance structures for repeated measures data Estimating a two and three level random intercepts and slopes model A digression on orthogonal polynomials Elaborating the random part of the model: accommodating temporal dependence Discrete outcomes: population average versus subject specific Subject-specific and population average inferences in practice. Fixed versus Random effects What we have learnt Answers to Questions

4 15. Modelling longitudinal and cross-sectional effects Introduction Age, cohort and period in the Madeira Growth Study Alternative designs for studying change The Accelerated Longitudinal design of the Madeira Growth Study Specifying and estimating cohort effects Modelling age, sex and cohort Modelling a two level model: Mundlak formulation Changing gender ideology in the UK Building the base model Including longitudinal and cohort effects Including Gender as a main effect and as interactions Age, period and cohorts? What we have learnt The analysis of spatial and space-time models Introduction: What do we mean by adjacency: defining spatial neighbours Three types of spatial models Spatial lag dependence or autoregressive Spatial residual dependence models Spatial heterogeneity models The spatial multiple membership model Applying the spatial multiple membership model Low birth weights in South Carolina Respiratory cancer deaths in Ohio counties: space-time modelling- Self-rated health of the elderly in China What we have learnt

12. Logistic modelling of proportions Introduction This chapter is the first of two chapters that is concerned with the analysis of discrete outcomes and in particular models for when the response is

5 12. Logistic modelling of proportions Introduction This chapter is the first of two chapters that is concerned with the analysis of discrete outcomes and in particular models for when the response is a proportion. Substantively the model is concerned with the proportion of teenagers that are in employment and what individual characteristics influence this outcome. We also consider the degree of variation in small areas of Glasgow, and the extent to which adult unemployment relates differentially to teenage employment. The model is estimated as a logistic model with a binomial level-1 random term. As such it uses many of the procedures that have been covered in Volume 1 such as model specification, testing, the use of cross-level interactions, and the calculation of the VPC. The initial model is estimated by maximum likelihood procedures and later models by MCMC methods. The same basic model can also be used to estimate binary models. The data on teenage employment if Glasgow Retrieve the data File on main menu Open worksheet employ.ws Postcode is neighbourhood in Glasgow Cell is element of the table for each postcode Gender is male or female Qualif is unqualified or qualified Employed is count of number of employed teenagers in cell Total is number of employed and unemployed teenagers in cell Adunemp is adult unemployment in neighbourhood Proportion is employed/total Code is categorical variable 1 = unqualified male 2 = unqualified females 3 = qualified males 4 = qualified females = - 3 -

postcodes Data Manipulation on main menu Sort on

6 Highlight the Names of the data; all variables Press View button Ensure data is sorted; cells within postcodes Data Manipulation on main menu Sort on Postcode and cell carry the rest and put back into original variables - 4 -

7 Model 1: null random intercepts model Model on main menu Equations Click on y and change to Proportion Choose 2 levels postcode as level 2 cell as level 1 Done Click on N (for Normal theory model) and change to Binomial distribution, then choose Logit Link Click on red (ie unspecified) n ij inside the Binomial brackets and choose total to be the binomial denominator (= number of trials) Click on B 0 and choose the Constant, tick fixed effect; tick the j(postcode) to allow to vary over postcode (it is not allowed to vary at cell level, as we are assuming that all variation at this level is pure binomial variation) Click on Nonlinear in the bottom toolbar; this controls specification and estimation: Use Defaults [this gives an exact Binomial Distribution for level 1; 1 st Order Done Linearization and MQL estimation] Click on the + on the bottom toolbar to reveal the full specification. At this point, the equations window should look like The variable proportion employed in cell i of postcode j is specified to come from a Binomial distribution with and underlying probability, ij.. The logit of the underlying probability is related to a Fixed effect, B 0 and an allowed to vary effect u 0j which as usual is - 5 -

8 assumed to come from a Normal distribution. The level 1, cell variation is assumed to be pure binomial variation in that it depends on the underlying probability and the total number of teenagers in a cell; it is not a parameter that has to be estimated. It is worth looking at the worksheet as MLwiN will have created two variables in the background, Denom is the number of trials, that is Total in our case, while Bcons is a constant associated with the level 1 cell which is used in the calculation of the binomial weights; we can ignore this. Before estimating, it is important to check the hierarchy Model on main menu Hierarchy viewer Question 1: Why the variability in the number of cells? Answers at the end of chapter - 6 -

Before proceeding to estimation we can check location of non-linear macros for discrete data Options on main menu Directories MLwiN creates a small file during estimation which has to be written

9 Before proceeding to estimation we can check location of non-linear macros for discrete data Options on main menu Directories MLwiN creates a small file during estimation which has to be written temporarily to the current directory, this therefore has to be a place where files can be written; consequently you may have to change your current directory to something that can be written to. Do this now. After pressing start the model should converge to the following results, click on the lower Estimates button to see the numerical values Question 2 Who is the constant?; What is 1.176?What is 0.270? Does the log-odds of teenage employment vary over the city? - 7 -

We can store the estimates of this model as follows Equations window Click on Store model results type in One in the pane Ok To see the results Model on main menu Compare stored models This brings up

10 We can store the estimates of this model as follows Equations window Click on Store model results type in One in the pane Ok To see the results Model on main menu Compare stored models This brings up the results in tabular form; these can be copied as a tab-delimited text file to the clipboard and pasted to Microsoft Word. In Word, highlight the pasted text; Select Table, Insert, Table. The log-odds are rather difficult to interpret, but we can change an estimate to a probability using the Customised predictions window: Model on main menu Customised predictions In setup window Confidence 95 Button on for Probabilities Tick Medians Tick Means at bottom of pane: Fill grid at bottom of pane: Predict Switch to Predictions: all results have been stored in the worksheet. The setup window should look like - 8 -

The predictions window should look like: The cluster-specific estimated probability is given by the median of 0.764, with 95% confidence intervals of 0.737 and 0.

11 The predictions window should look like: The cluster-specific estimated probability is given by the median of 0.764, with 95% confidence intervals of and 0.789; while the population average values are very similar (0.755, CI: 0.73, 0.78). If we use Descriptive statistics on the main menu we find that the simple mean of the raw probabilities is The median rate of employment for teenagers in Glasgow districts is Returning to the Setup window we can additionally tick for the coverage for level 2 postcodes and request the 95% coverage - 9 -

As these values are derived from simulation, you can expect slightly different values from these.

12 Click Predict and then go to the Predictions subwindow: The estimated average teenage employment probability is 0.753, while the 95% coverage interval for Glasgow areas is between and As these values are derived from simulation, you can expect slightly different values from these. Model 2 with fixed part terms for qualifications and gender Returning to the equations window we can now distinguish between different types of teenagers. Add term using Code with Unmale as the base or reference category, so that revised model after convergence is:

We can store the estimates of this model as Two using the Store button on the equations window Model on main menu Compare stored models This bring up the results in tabular form We can now calculate

13 We can store the estimates of this model as Two using the Store button on the equations window Model on main menu Compare stored models This bring up the results in tabular form We can now calculate the probability for all four types of teenager: Model on main menu Customised predictions In setup window Clear [gets rid of previous choices] Highlight Code and request Change Range Click on Category and tick on each and every category for different type of teenager (unmale etc) Confidence 95 Button on for Probabilities Tick Medians Tick Means at bottom of pane: Fill grid at bottom of pane: Predict The completed Setup window is:

Switch to Predictions tab And the Predict window gives The values can be copied and pasted into Word to form a table Code.pred constant. median. median.low. median.high. mean.pred mean.low. mean.high. unmale 1 0.

80792409 0.77535808 0.83902699 qualfem 1 0.84216172 0.81049794 0.87156969 0.82985991 0.79759902 0.

14 Switch to Predictions tab And the Predict window gives The values can be copied and pasted into Word to form a table Code.pred constant. median. median.low. median.high. mean.pred mean.low. mean.high. unmale Unfem qualmale qualfem The higher employment is found for qualified teenagers, this is most easily seen by plotting the results Predictions sub-window Plot Grid Y is Mean.pred, that is population averages Tick 95% confidence intervals Button error bars X variable: tick code.pred

15 Apply After some re-labelling of the graph we get (the plot is in customized windows D1) The wider confidence bands for the unqualified reflect that there are fewer such teenagers. Staying with this random-intercepts model, we can see the 95% coverage across Glasgow neighbourhoods for different types of teenagers: Model on main menu Customised predictions In Setup window Tick coverage for postcode, and 95% coverage interval Predict Predictions sub-window

Across Glasgow the average probability of employment for unqualified males is estimated to be 0.628; in the 95% worst and best areas the probabilities are 0.422 and 0.823 respectively.

First we have to estimate differential logits by choosing a base category for our comparisons, and then we can exponentiate these values to get the relative odds of being employed.

16 Across Glasgow the average probability of employment for unqualified males is estimated to be 0.628; in the 95% worst and best areas the probabilities are and respectively. Sometimes it is preferred to interpret results from a logit model as relative odds, that is relative to some base or reference group. This can also be achieved in the customized predictions window. First we have to estimate differential logits by choosing a base category for our comparisons, and then we can exponentiate these values to get the relative odds of being employed. Here we choose unqualified males as the base category so that other teenagers will be compared to that group. Customised predictions In Setup window Button logit (instead of probabilities) Tick differences from variable Code, reference value Unmale Untick means Untick coverage Predict

In the prediction sub window This gives the estimated differential cluster-specific logits. Note that the logit for Unmale has been set to zero and the other values are differential logits.

17 In the prediction sub window This gives the estimated differential cluster-specific logits. Note that the logit for Unmale has been set to zero and the other values are differential logits. These are the value given in the model equations window as contrast coding has been used. We can now plot these values: Plot Grid Y is median.pred (not mean.pred) X is code.pred Tick 95% confidence interval Button error bars This will at first give the differential logits; to get odds we need to exponentiate the median and the 95% low and high values (from the Names window we see these are stored in c15- c17) Data manipulation Command interface expo c15-c17 c15-c17 After some re-labelling of the graph

18 In a relatively simple model with only one categorical predictor generating four main effects, we can achieve some of the above calculations by just using the Calculate command and the Expo and Alogit functions. Here are some illustrative results of doing this by hand : Data manipulation Command interface calc b1 = calc b2 = alogit b stores the logit unqualified male in a Box (that is a single value in comparison to a variate in a Column) derives the clusterspecific probability: unqualified males calc b1 = calc b2 = alogit b stores the logit for qualified female (base + differential) derives the c-s probability for qualified females To calculate the odds of being employed for any category compared to the base we simply exponeniate the differential logit (do not include the term associated with the constant) calc b1 = calc b2 = expo b differential logit for qualified females odds for qualified females The full table is as follows which agrees with minor rounding error with the simulated values

Who? Logit Probability Differential Odds Logit Unqual Males 0.529 0.63 0 1* Unqual Females 0.529 + 0.149 = 0.678 0.66 0.149 1.16 QualMale.529 + 0.996 = 1.525 0.82 0.996 2.71 QualFemale 0.529 + 1.

19 Who? Logit Probability Differential Odds Logit Unqual Males * Unqual Females = QualMale = QualFemale = * the odds for the base category must always be 1 We can use the Intervals and tests window to test for the significance of difference between gender for qualified and unqualified teenagers. NB for unqualified teenagers it is given directly; for qualified it is not, and it has to be derived as a difference (note the -1) The chi-square statistics are all small; indicating that there is little difference between the genders. In contrast the differences between the levels of qualification for both males and females are highly significant

Turning now to the random effects, an effective way of presenting these is to calculate the odds of being employed against an all Glasgow average of 1.

20 Turning now to the random effects, an effective way of presenting these is to calculate the odds of being employed against an all Glasgow average of 1. First calculate the level-2 residuals and store in c300, then exponeniate these values (using the command interface)and plot them against their rank Command interface Expo c300 c

21 At the extremes some places only have 0.4 of the city wide odds, at the other extreme, the odds are increased by 1.8 with of course the all-glasgow average being 1. Model 2b: changing estimation We have so far used the default non-linear options of mql, 1 st order and exact binomial distribution; clicking on the non-linear button on the equations window we can change that to pql, 2 nd order and allow extra-binomial variation, after more iterations the model converges to Question 3: Have the results changed a great deal? Is there significant over-dispersion for the extrabinomial variation?

Note that we have tested the over-dispersion parameter (associated with the binomial weight bcons) against 1, and that there is no significant overdispersion as shown by the very low chi-square value.

22 Note that we have tested the over-dispersion parameter (associated with the binomial weight bcons) against 1, and that there is no significant overdispersion as shown by the very low chi-square value. Use the non-linear button to set the distributional assumption back to an exact Binomial. Model 3: modelling the cross-level interaction between gender, qualifications and adult unemployment To estimate the effects of adult unemployment on teenage employment, In equations window Add term to the model Choose Adunemp centre this variable around a mean of 8% [the rounded, across-glasgow average]. Done This gives the main effect for adult unemployment. We want to see whether this interacts with the individual characteristics of qualification and gender. In equations window Order 1 Code Adunemp Done first order interactions choose unmale as base the continuous variable (the software takes account of centering) After more iterations to convergence the results are:

23 Store the model as three Mstore "three" This bring up the results in tabular form Model One Standard Error Model Two Standard Error Model Three Response proportion proportion proportion Fixed Part Standard Error Constant Unfem qualmale Qualfem (adunemp-8) unfem.(adunemp-8) qualmale.(adunemp ) qualfem.(adunemp- 8) Random Part Level: postcode constant/constant The results are most perhaps most easily appreciated as the probability of being employed in a cross-level interaction plot (adunemp is a level 2 variable; code is a level-1 one variable) Model on main menu Customised predictions (this automatically takes account of interactions) In Setup window Clear [gets rid of previous choices; this must be done as specification changed]

Highlight Adunemp and request Change Range Nested means; level of nesting 1 (repeated calculation of means to get 3 characteristic values of the un-centred variable) Done Highlight Code and request

24 Highlight Adunemp and request Change Range Nested means; level of nesting 1 (repeated calculation of means to get 3 characteristic values of the un-centred variable) Done Highlight Code and request Change Range Click on Category and tick on each and every category for different type of teenager (unmale etc) Done Confidence 95 Button on for Probabilities Tick Medians Tick Means at bottom of pane: Fill grid at bottom of pane: Predict Predictions sub-window The predictions are for 12 rows (4 types of teenager for each of 3 characteristic values of adult unemployment): To get a plot Plot Grid Y is median.pred (cluster specific) X is adunemp (the continuous predictor) Grouped by code.pred (the 4 types of teenager) Tick off the 95% CI s (to see the lines clearly)

25 Thickening the lines and putting labels on the graph: Estimating the VPC The next thing that we would like to do for this model is to partition the variance to see what percentage of the residual variation still lies between postcodes. This is not as straightforward as in the Normal-theory case

26 One simple method is to use a threshold approach and to treat the level-1, between cell variation as having a variance of a standard logistic distribution which is Then with this model, the proportion of the variance lying between postcode is calc b1 = 0.153/ ( ) T hat is 4% of the remaining unexplained variation lies at the district level. But this ignores the fact that the level -1 variance is not constant, but is function of the mean probability which depends on the predictors in the fixed part of the model. There is a macro called VPC.txt that will simulate the values given desired settings for the predictor variables Input values to c151 for all the fixed predictor values (Data manipulation and View) EG represents unqualified males in an area of average adult unemployment Or EG represents qualified females in an area of average adult unemployment Input values in c152 for predictor variables which have random coefficients at level 2 EG c152 1 because this a random-intercepts model To run the Macro File on main menu Change to directory to something like C:\Program Files\Mlwin v2.1\samples Open macro vpc.txt then Execute The result is obtained by print B8 in the Command window and then looking in Output window. prin b which is for unqualified males, while the result for qualified females is prin b So some 2 to 3% of the residual variance lies between postcodes. 1 Snijders T, Bosker R, 1999 Multilevel analysis: an introduction to basic and advanced multilevel modelling, London, Sage

27 Characterising the between area with effects with the Median Odds Ratio There is a growing agreement in the multilevel modelling community that the Median Odds Ratio of Larsen is a more effective way of portraying the higher level variance in discrete models than the VPC. 2 The MOR transform the between-area variance on the logit scale to a much more interpretable odds scale than can be compared to the relative odds of terms in the fixed part of the model. MOR can be conceptualised as the increased risk (on average, hence the median) that would result from moving from a lower to a higher risk area if two areas were chosen at random from the distribution with the estimated level 2 variance. The formula is as follows: ( ) Where is the level 2 between postcode variance on the logit scale (this would be replaced with a variance function if random slopes are involved); and ( ) is the 75 th percentile of the cumulative distribution function of the Normal distribution with mean 0 and variance 1. The Figure shows the relations between the three measures Thus the MOR for Models 1 to 3 is >calc b1 = expo((2 * 0.270)**0.5 * ) Larsen K, and Merlo J. (2005) Appropriate assessment of neighbourhood effects on individual health: Integrating random and fixed effects in multilevel logistic regression. Am J Epidemiol, 161,81-8; Merlo J, Chaix B, Yang M, Lynch J, Rastam L. (2005) A brief conceptual tutorial of multilevel analysis in social epidemiology: linking the statistical concept of clustering to the idea of contextual phenomenon. J Epidemiol Community Health 2005;59(6):

28 ->calc b1 = expo((2 * 0.237)**0.5 * ) >calc b1 = expo((2 * 0.153)**0.5 * ) According to this measure there is quite a bit of area heterogeneity which is larger than the gender gap but smaller than the qualifications gap in the relative odds. The credible intervals for a MOR can be obtained by plugging in the credible intervals from an MCMC. Thus for Model estimated below with 10k MCMC monitoring simulations: the MOR and the 95% credible intervals are: Calc b1 = expo((2 * 0.059)**0.5 * ) % Lower Calc b1 = expo((2 * 0.168)**0.5 * ) >Calc b1 = expo((2 * 0.317)**0.5 * ) % Higher Characterising the effect of a higher-level predictors with the Interval Odds Ratio Larsen has also introduced a statistic he calls the Interval Odds Ratio (IOR). 3 This aims to assess the effect of higher-level cluster variables on an odds scale taking into account the residual heterogeneity between areas. It is calculated as an interval between two persons with differing values of the higher-level variables and covering the middle 80 percent of odds ratios 4 ( ) ( ) ( ) ( ) ( ) ( ) ( ) Where ( ) and ( ) are the 10 th and 90 th percentiles of the Normal distribution which gives the values and If the interval contains the value 1 the effect of the higher-level variable is not strong given the residual between-area 3 Larsen K, Petersen JH, Budtz-Jørgensen E, Endahl L. (2000) Interpreting parameters in the logistic regression model with random effects, Biometrics, 56(3): The 80% is arbitrary but commonly used

29 variation. But if it does not contain 1 the effect of the higher-level variables is large in comparison to the unexplained between-neighbourhood variation; moving between neighbourhood with different levels of adult unemployment is not going to be swamped by other (unexplained) neighbourhood effects. In Model 3 the residual between neighbourhood variance is and the main effect for the level 2 variable of adult unemployment is , that is the effect for an unqualified male. We can therefor calculate the IOR 80% values for an unqualified male teenager who lives in the lower quartile neighbourhood in comparison with the upper quartile neighbourhood, a value for adult unemployment of 5.085% in comparison to 9.65%. We first calculate the simple odds ratio without taking into account potential neighbourhood differences calc b1 = expo( * ( )) so that the moving between neighbourhoods does appear to change the odds of employment. However, when we additionally take into account the other potential neighbourhood differences, the IOR80% is calculated to be Calc b1 = expo( *( ) + (2 * 0.153)**0.5 * ( )) Calc b1 = expo( *( ) + (2 * 0.153)**0.5 * (1.2816)) and this straddles 1. This suggests that the difference between these two types of areas is not large relative to the unexplained effect, there is quite a lot of chance that the teenager will not have an increased propensity of employment given the changed neighbourhood characteristics and the variation between neighbourhoods. If we look at a more marked neighbourhood change from the 5% best to the 5% worst this is a change of adult unemployment from 2.958% to %, the standard odds value is Calc b1 = expo( *( ) ) and the IOR s are Calc b1 = expo( *( ) + (2 * 0.153)**0.5 * ( )) Calc b1 = expo( *( ) + (2 * 0.153)**0.5 * (1.2816)) The interval does not include 1 so that this large scale change in the neighbourhood characteristic does increase the propensity to be employed even when account is taken of how much neighbourhoods differ. It must be stressed that the IOR is not a confidence interval, and care would be needed in plugging in the MCMC estimates as there not joint estimation of the credible intervals of the fixed and random terms

Comparing models: using the DIC estimated in MCMC Unfortunately because of the way that logit model are estimated in MLwiN through quasilikelihood, it is not possible to use the usual deviance to

30 Comparing models: using the DIC estimated in MCMC Unfortunately because of the way that logit model are estimated in MLwiN through quasilikelihood, it is not possible to use the usual deviance to compare models. One could use the Intervals and Tests procedures to test individual and sets of estimates for significance. But using MCMC methodology one can compare the overall fit of different models using the DIC diagnostic. Using the IGLS/ RIGLS estimates as starting values Estimation Control Switch to MCMC and use the default values of a burn-in of 500, followed by a monitoring length of 5000 Start Store results as MCMC5k To examine the estimates Model on main menu Trajectories Select the level 2 variance (Postcode: Constant/Constant) Change Structured graph layout to 1 graph per row Done This gives the trajectory of the estimate for the last 500 simulated draws Click in the middle of this graph to get the summary of these results:

31 You can see that the mean of the estimate for the level-2 variance is and that the 95% credible interval does not include zero in going from to 0.308; the parameter distribution is positively skewed, and the asymmetric CO s reflect this. Note however that both the Raftery-Lewis and Brooks-Draper statistics are suggesting that we have not ran the chain for long enough as the chain is highly auto-correlated; we have requested a run of 5000 simulations but they are behaving as an effective sample size of only Ignoring this for the moment, we want to get the DIC diagnostic, Model on main menu MCMC DIC diagnostic produces the following results in the output window Bayesian Deviance Information Criterion (DIC) Dbar D(thetabar) pd DIC To increase the number of simulated draws Estimation Control MCMC Change monitoring from 5000 to Done More iterations on top bar Store results as MCMC10k 5 There are a number of recently developed procedures (discussed in Volume 1, Chapter 10 ) that we can use to improve the efficiency of the sampling through the MCMC options. We found that for this model and for this term there was not substantial improvement in efficiency even when orthogonal parameterization and hierarchical centring were used in combination

see that there are now effectively now 246 independent draws.

32 The trajectories will be updated as the 5000 extra draws are performed (it makes good sense in large model to close the trajectory and the equations window down as it slows down the model, without being really informative) Click Update on the MCMC diagnostics To see that there are now effectively now 246 independent draws. The MCMC results of the two models can now be compared and it would appear that there is very little change with increased length of the monitoring run. Thus, the DIC diagnostic is Dbar D(thetabar) pd DIC

33 Doubling the number of draws has changed the DIC diagnostic by only a very small amount. There are two key elements to the interpretation of the DIC: pd This gives the complexity of the model as the effective degrees of freedom consumed in the fit, this takes into account both the fixed and random part; here we know there are 8 fixed terms and the rest of the effective degrees of freedom comes from treating the 122 postcodes as a distribution; DIC Deviance Information Criterion (DIC), which is a generalisation of the Akaike Information Criterion (AIC); The AIC the Deviance + 2p, where p is the number of parameters fitted in the model and the model with the smallest AIC is chosen as the most appropriate. The DIC diagnostic statistic is simple to calculate from an MCMC run as it simply involves calculating the value of the deviance at each iteration, and the deviance at the expected value of the unknown parameters. Then we can calculate the 'effective' number of parameters, by subtracting from the average deviance from the complete set of iterations. The DIC diagnostic can then be used to compare models as it consists of the sum of two terms that measure the 'fit' and the 'complexity' of a particular model. Models with a lower DIC are therefore to be preferred as a trade-off between complexity and fit. Crucially this measure can be used in the comparison of non-nested models and non-linear models. Here are the results for a set of models, all based on 10k simulated draws. To change a model specification, you have to use IGLS/ RIGLS estimation and then MCMC and with single models you cannot use mql and 2 nd order IGLS. The results are ordered in terms of increasing DIC, the simplest and yet best fitting model at the top. The Mwipe command clears the stored estimates of the models Model Terms PD DIC 4 2level,Cons+Code+Ad-Unemp level,Cons+Code*Ad-Unemp level,cons+code level,cons level,cons In terms of DIC, the chosen model is a two level one, with an additive effect for 3 categories of code and an additive effect for adult-unemployment, although there is no substantive difference to the model with the cross-level interactions

34 The plot for the final most parsimonious model is given below for logits and probabilities

35 Some answers Question 1: Why the variability in the number of cells? In some postcode areas there is not the full complement of types of teenager; this is a form of imbalance. Usually, estimation is not troubled by it. Question 2 Who is the constant? All types of teenagers; there are no other terms in the fixed part of the model. What is 1.176? What is 0.270? The Log-odds of being employed on average across all teenagers across all areas The between area variation on the logit scale Does the log-odds of teenage employment vary over the city? Yes, there appears to be evidence of this. Question 3: Have the results changed a great deal? No Is there significant over-dispersion for the extra-binomial variation? No, is less than a standard error away from 1. We need to compare against 1 not

13. HIV in India: An example of a Poisson, and a NBD model Introduction This chapter aims to demonstrate the use of MLwiN in the fitting of two-level models to count data.

36 13. HIV in India: An example of a Poisson, and a NBD model Introduction This chapter aims to demonstrate the use of MLwiN in the fitting of two-level models to count data. Substantively, the study aims to investigate the State geography of HIV in terms of prevalence and how this characterized by age-groups, gender, educational levels and urbanity. We have tried to provide an outline of a realistic research project and include single level models; random-intercepts and random-slopes two-levels models; Poisson, extra-poisson, and NBD models with an offset; estimation by quasi-likelihood (IGLS, PQL, 2 nd order) and MCMC samplers; interpretation and graphic of results both as log e rates, and as relative risks or incidence rates; significance testing, and even models that do not converge! The data These are based on the nationally representative, cross-sectional data for some 100k individuals from the India National Family Health survey (NFHS-3), the first national survey to include HIV testing. The survey was designed to provide a national estimate of HIV in the population of women aged and men aged 15-54, as well as separate HIV estimates for each what was thought to be the five highest HIV prevalence states Andhra Pradesh, Karnataka, Maharashtra, Manipur, and Tamil Nadu and for one low HIV prevalence state Uttar Pradesh. In the remaining 22 states, HIV testing was conducted in only a sub-sample of six households per enumeration area. The dependent variable indicates HIV sero-status. Details of this procedure and of the sampling design are given in Khan (2008) and the manual that accompanies the NFHS-3. 6 The initial worksheet File on main menu Open worksheet Filename Hivcounts 6 Khan, K T (2008) Social Determinants of HIV in India, Masters of Science Thesis, Harvard School of Public Health; National Family Health Survey (NFHS-3), : India: Volumes I and II. 2007,IIPS: Mumbai

The 1720 observations represent cells of a complex table which represents the crosstabulation of 28 States by 4 Age-Groups by 4 Educational levels by 2 Sexes by 2 Urbanity groups.

37 The 1720 observations represent cells of a complex table which represents the crosstabulation of 28 States by 4 Age-Groups by 4 Educational levels by 2 Sexes by 2 Urbanity groups. Cells are therefore a group of people who share common characteristics. The potentially full table (28*4*4*2*2) of 1792 cells has not been observed because in the sample not all combinations of these variables have been found. In the Names window Highlight all 7 variables View Gives the following data extract The first row shows that in the State of Andhra Pradesh, 32 people who shared the characteristics of being under 24, having been educated to a High level, being female and living in a rural area were interviewed, and that none of them tested sero-positive for HIV. The fourth row, again in the State of Andhra Pradesh, represents 1009 people who shared the characteristics of being under 24, having been educated to Secondary level, being female and living in a urban area. Three of such people were tested sero-positive for HIV. Such data with a low count based on a large denominator are highly suited to Poisson and NBD modelling. To get an initial idea of how rare HIV is in the population we can sum the number of Cases and the number of Cases+NonCases and calculate the overall ratio Data Manipulation on Main Menu Command interface and enter the following commands into the lower box, one at a time and press return Sum 'Cases' b1 Sum 'Cases+NonCases' b2 calc b3 = b1/b2-35 -

38 B1, B2 and B3 represent boxes when the answers are stored; boxes are single values (a scalar) as opposed to c1, c2 etc. which are columns (or variables). In the output window you should see the following results ->Sum 'Cases' b Overall only 467 sero-positives were found. ->Sum 'Cases+NonCases' b e+005 And this is from a survey of over 102 thousand people! ->calc b3 = b1/b Giving thankfully a rate of only Question 1 Use the command interface to calculate the rate per 10,000 Defining the exposure The observed counts have to be compared to the number of people in the different groups who have been observed or exposed. That is while we may get a relatively high count of cases, this may simply represent that we have observed a large number of people for this type of cell. Thus the relatively high count of 3 cases in row 4 may not represent a high prevalence but simply that there are a lot of people that share these characteristics. We can overcome this problem by defining an expected count and compare the observed count with this value. Using the command window line we can calculate the expected count as the national rate (b3) times the number of people interviewed (Cases+NonCases), and also the Standard Morbidity rate as the observed (Cases) divided by the Expected, if the national rate applied calc c8 = b3 *'Cases+NonCases' name c8 'Expected' calc c9 = 'Cases' / 'Expected' Name c9 'SMR' The revised worksheet is

and the revised data extract including the new variables We can see that the observed count of 3 cases in row 4 is less than the expected number of 4.

While we can clearly calculate a SMR we should not place a great deal of weight on it, as it is very sensitive to the small number problem.

39 and the revised data extract including the new variables We can see that the observed count of 3 cases in row 4 is less than the expected number of 4.6 cases given who we had interviewed, and hence we have a SMR below 1; the morbidity in this group is less than the national average. While we can clearly calculate a SMR we should not place a great deal of weight on it, as it is very sensitive to the small number problem. Thus, if we look at row 6, we see that the SMR for this group of people is 1.543, that is nearly 50% in excess of the national rate, but this rate is very unreliable and small changes in the number of cases would lead to very different SMR s. If the one observed case was zero, then the SMR would plummet to zero, but if a single extra case was observed the SMR would be three times the national rate (2/0.648 equals 3.08). Clearly in this form the SMR is highly troubled by the stochastic nature of the data and we require a model to provide a more robust inferential framework. As an aside, we have used a simple way of calculating the expected values on the basis of a national rate. We could have produced a more elaborate procedure. Thus we could have produced an expected value based on the national rate for different age-sex groups

Age-sex group Cases Respondents Rate > 24 Female 38 19576 0.0019 > 24 Male 24 17099 0.0014 25-34 Female 93 16416 0.0057 25-34 Male 119 13974 0.0085 34-44 Female 46 12390 0.0037 34-44 Male 102 11333 0.

40 Age-sex group Cases Respondents Rate > 24 Female > 24 Male Female Male Female Male Female Male But doing so would have effectively removed the effect of age and sex and this is something that we are interested in. Model 1: a single level model for Gender The first model we will fit will be a Poisson model for the observed count, with an offset of the expected value based on the national rate. The model will estimate the comparative rate for men and women separately. We begin by creating a Constant, a vector of 1 s, and storing in c10 Data manipulation Generate vector And naming the c10 column to be Cons We can the specify the model Model on main menu Equations y, the response is Cases single level model, i, with Cons declared to be the level 1 identifier [we can do this for at level 1, any arbitrary column can be used with this software]

Click on N for Normal and change response type to be Poisson Change the red x o the be Cons, done Double click on Estimates To give the following non-converged model: Where is the underlying mean

41 Click on N for Normal and change response type to be Poisson Change the red x o the be Cons, done Double click on Estimates To give the following non-converged model: Where is the underlying mean count, and the variance of the observed Cases is equal to this mean. Clicking on the in the second line of the window allows us to specify the offset, but as the window warns we first have to take the log e of the expected value Data Manipulation on Main Menu Command interface enter the following command into the lower box, one at a time and press return Name c12 'LogeExp' calc 'LogeExp' = loge('expected') return to the clicking on the in the second line of the equations window, we can now specify the loge(offset), followed by done to get the revised equation

42 It is noticeable that there is no coefficient associated with the offset as this value is constrained to be 1. If you look at the Names window; you will see that MLwiN has created two new variables in the background Offs : holds the values of the offset variable and bcons.1 is a placement for the Poison weight which is used during estimation; do not delete these variables. Return to the equations window Add term Variable, choose to be Gender and reference category to be Female Start iterations to convergence The estimated model is The positive value (+0.445) informs us that the loge mean count for Males is higher than that for the base category of Females; there is a higher prevalence of HIV across all of India for men, contrasted with the base of women. To help interpret the model, we will use customized predictions to get an Incidence Rate Ratio for men compared to women. Model on Main menu Customised predictions In the setup window, click on Gender, and Change range Change Means to Categories and by click on Male and Female, Done Choose Rate and click on medians (in single-level models the medians and means of the predictive distribution will gave the same results) Click on Differences and choose the From Variable to be Gender with the reference value Female The completed window should look like

Fill grid Predict Move to the Predictions sub-window On the log-scale, the Males are 0.455 higher than the females.

43 Fill grid Predict Move to the Predictions sub-window On the log-scale, the Males are higher than the females. If we exponentiate this value and the associated up and lower 95% confidence intervals in the command interface expo c17-c19 c17-c19 (you should check on the Names window to see where the predictions are stored) We get the transformed predicted values (you may have to close and re-open window to get the updated values

So that when the rate for Females is set to 1, the rate for Males is some 56% higher. We can now plot this value and the 95% confidence intervals Plot grid The y axis variable is median.

44 So that when the rate for Females is set to 1, the rate for Males is some 56% higher. We can now plot this value and the 95% confidence intervals Plot grid The y axis variable is median.pred The x axis variable is gender.pred Tick 95% confidence intervals Check error bars Apply This sends the graph commands to Graph display 1 and plots the graph; with re-labeling and titling, choosing Red as the colour on the Plot style tab, and ticking off group code on the Other tab we get:

Clearly there is a significant difference between Men and Women in the incidence rate ratio of HIV as the 95% confidence interval does not include 1.

45 Clearly there is a significant difference between Men and Women in the incidence rate ratio of HIV as the 95% confidence interval does not include 1. Question 2 What would be the results if Males had been chosen for the reference category and not Females? So far we have found that the Male rate is 1.56 times higher than the Female rate when the Female rate is set to 1. We can also estimate both the Male and Female comparative rates (the Standardised Morbidity Rate or the relative risk) in relation to the overall population. That is the overall population seropositive rate of both men and women is set to 1, and we are going to compare Men and Women to that figure. Using the results of Model 1 we change the settings of the customized predictions window Model on Main menu Customised predictions In the setup window, click on Gender, and Change range by clicking on Male and Female, Done Choose Rate and click on medians (in single-level models the medians and means of the predictive distribution will gave the same results) Click off the Differences (this is the key change that allows the comparison with overall rate) The completed window should look like Fill grid

Predict Move to the Predictions sub-window So if the overall rate is 1 (this is given by constraining the log offset to 1), males are 23% higher, while Females have a relative risk that is only 78%

46 Predict Move to the Predictions sub-window So if the overall rate is 1 (this is given by constraining the log offset to 1), males are 23% higher, while Females have a relative risk that is only 78% of that for the general population (you figures may differ slightly from this due to the simulation procedure used in MLwiN to derive these figures). We can relate these results to the earlier ones and calculate the relative risk for Men compared to Women by using the command interface calc 1.228/ The Male rate is indeed 1.56 times higher than the Female rate, as we found earlier. We can plot the relative risk and the 95% confidence intervals for each gender in relation to the overall rate Plot grid The y axis variable is median.pred The x axis variable is gender.pred Tick 95% confidence intervals Check error bars Apply

Which after a little editing gives As the confidence bands do not overlap, Men are at a significantly higher risk of being seropositive than women.

47 Which after a little editing gives As the confidence bands do not overlap, Men are at a significantly higher risk of being seropositive than women. The similar lengths of the confidence bands reflect the similar number of men and women in the survey. Model 2: a single level model for Gender and Age Main effects Returning to the model with Females as the base, we can add the main effects for different age groups (Age Groups) choosing under 24 years of Age as the reference group. After convergence we find The positive values for the differential age categories and Male mean that the lowest log e rate is for females aged under24 years. We can use the customised predictions window to calculate on the relative risks for the different age-sex groups in comparison to the overall population which is given by the loge offset being constrained to 1):

Clear the old specification and then specify the window as follows, ensuring that Age group is changed so that the predictions are made for all 4 categorical age groups as well as for both categories

Morbidity Rates or relative risks for HIV for each age-sex group in relation to the overall population.

48 Clear the old specification and then specify the window as follows, ensuring that Age group is changed so that the predictions are made for all 4 categorical age groups as well as for both categories of sex, make the predictions for Rates for the medians and the 95% upper and lower confidence bands; Fill the Grid and make Predictions We get the predicted values which are the Standardized Morbidity Rates or relative risks for HIV for each age-sex group in relation to the overall population. When compared to the overall national rate set to 1, year old men have nearly double the incidence, while Females under 24 only have a third of the national rate. To derive a plot that contrasts men and women at different ages: Plot grid The y axis variable is median.pred The x axis variable is gender.pred Grouped by Gender.pred

49 Tick 95% confidence intervals Check error bars Apply Men generally have a higher rate than women, but the difference is only significant for the two middle-age groups. The under 24 age-group for both men and women have the lowest rates. Question 3 Make a plot that contrasts the SMR at different ages for men and women separately. Model 3: a single level model for Gender and Age with interactions Model 2 included Gender and Age as main effects so that the model is additive on the log e scale. The differences between men and women at different ages in the above diagram are simply a result of differential stretching when the log e -rate is exponetiated. Now we will fit a model with interactions between Age and Sex; this will allow the gender differences even on the log e scale to be different for different age-groups, or equivalently, the differential age effects to be different by gender. Return to the equations window

50 Click on Add term Order 1 for 1 st order interaction Variable Gender choosing Female as the reference category Variable AgeGroup choosing <24 as the reference category Done After more iterations the following estimates are found Question 4 What type of person is the constant in this model? Using the customized predictions we can make a set of predictions of the relative rates or SMR s, plotted for each age-sex category as shown below, stressing the age-group differences. Or alternatively stressing the gender differences

There have been quite a few changes as compared to the simpler main-effects model with the lowest rates of all being found for young men; the biggest gender gap is for the 34-44 year olds, the males

51 There have been quite a few changes as compared to the simpler main-effects model with the lowest rates of all being found for young men; the biggest gender gap is for the year olds, the males of this group having the highest rates of all. Model 4: a single-level model for Gender-Age interactions and Education Return to the equations window Add term Variable choose to Educ and reference category to be High More iterations to convergence The estimated model is A quick inspection of the results shows that the other three education categories have a significantly higher log e rate than those who have received higher education. But the three

52 may not be significantly different from each other. We can test this using the intervals and test window. First we will test that each of the contrasted categories are significantly different to the base category of Higher. Model on Main menu Intervals and tests Tick Fixed and set number of functions to 3 Place a 1 in the request matrix for each of the hypotheses that are to be tested Click calculate It can be seen that each separate chi-square is significant with one degree of freedom, and the joint test of the three is also significant; here are the p values found using the command interface [testing whether those Secondary education are different to Higher] [very small p value, so cannot reject the null hypothesis of no difference] cpro cpro e-006 cpro e-005 cpro e-005 [testing whether those Primary education are different to Higher] [testing whether those No-education are different to Higher] [testing whether all three are different to Higher] There is very strong evidence that Indians who have received a higher education have significantly lower risk. We can now test whether the differences between the three lower educational categories are different

Intervals and tests Tick Fixed and set number of functions to 0 to clear, and then to 3 Place a 1 in the request matrix for each of the hypotheses that are to be tested and a -1 to signify that we

53 Intervals and tests Tick Fixed and set number of functions to 0 to clear, and then to 3 Place a 1 in the request matrix for each of the hypotheses that are to be tested and a -1 to signify that we are testing a difference Click calculate The joint test that all three differences are equal to zero cannot be rejected; cpro consequently in the interests of parsimony we will combine the 4 four Education groups into two In the equations window Click on the NoEd term Delete term, confirming that you want to delete all 3 categories from the model Data manipulation on main menu Recode By value Choose Educ as source column A free column as destination (here c27) Give the new values such that High gets a 0 all others are set to 1 Execute

In the Names window, Highlight c27 and Edit name to Educ2 Keeping the variable highlighted toggle Categorical to true click on Categories and give meaningful

convergence This looks a highly significant difference and we can see if this differential is to be found for each age-sex category In the Equations window Click

54 In the Names window, Highlight c27 and Edit name to Educ2 Keeping the variable highlighted toggle Categorical to true click on Categories and give meaningful codes to the categories High is 0 Low is 1 OK In the Equations window Add term Variable choose to Educ2 and reference category to be High More iterations to convergence This looks a highly significant difference and we can see if this differential is to be found for each age-sex category In the Equations window Click on Add term Order 1 for 1 st order interaction Variable Educ2 choosing High as the reference category Variable Gender choosing Female as the reference category Done

55 More iterations to convergence. Click on Add term Order 1 for 1 st order interaction Variable Educ2 choosing High as the reference category Variable AgeGroup choosing <24 as the reference category Done More iterations to convergence Stop as the model is not converging On inspection, the Low interaction term is not converging due to its large standard error. Click on this term and remove it from the fixed part by unticking (do not delete as this would remove all interactions between Low education and the three contrasted age groups). More iterations to convergence, but this also results in a problem, so Start instead of More (this will start model estimation from scratch) On inspection, we see that none of the Educ2 and AgeGroup interactions are significant, Click on Low Delete, confirming that all three interaction terms need to be deleted After more iterations the following estimates are found

We have gone back to a model only involving the interaction between Education and Gender, but we also know that Age effects are not differentiated by

) We can see these results as risks by using the customised predictions window to make projections for all 16 groups (2Education * 2Gender *4AgeGroups).

56 We have gone back to a model only involving the interaction between Education and Gender, but we also know that Age effects are not differentiated by Education. (We also tried the second order interactions between Age*Educ2*Gender but none of these proved to be reliably estimated.) We can see these results as risks by using the customised predictions window to make projections for all 16 groups (2Education * 2Gender *4AgeGroups). You will have to Clear the old specification and Change Range so that Means are replaced by Categories and each and every category is ticked for all three variables

57 To display a plot that displays the effects of the Education and Gender interaction Plot grid The y axis variable is median.pred The x axis variable is Educ2.pred Grouped by agegrp.pred Trellis in the X direction: gender.pred Tick 95% confidence intervals Check error bars

58 Clearly, highly educated females have the lowest risk and that is across the age-groups. As we have now completed the analysis of individual (cell-level) characteristics, we shall store the model. In Equations window Store model, giving the label Four Model 5: a two-level model for States Return to the Equations window Click on Cases N levels change to 2 ij Choose States to be level 2 Choose Cons to be level 1 Done Click on Cons Tick on j(states) Done (you can now check the hierarchy, for 28 States; if not, the cause is probably unsorted data) Click on Non-linear estimation at the bottom of the equations pane

Click on 2 nd order lineralization Click on PQL estimation Using the IGLS quasi-likelihood approach, the preferred estimation is PQL, 2 nd order (MQL has a tendency to overestimate the higher-level

59 Click on 2 nd order lineralization Click on PQL estimation Using the IGLS quasi-likelihood approach, the preferred estimation is PQL, 2 nd order (MQL has a tendency to overestimate the higher-level variance in a Poisson model). You will see the State level 2 residuals and the higher-level variance being included in the model. More iterations to convergence. We can store this model as FivePQL. There is clearly a sizeable between-state variance. We can test this for significance. Model on Main menu Intervals and tests Tick Random and set number of functions to 1 Place a 1 in the request matrix for the hypotheses to be tested, that is picking out the level-2 variance Click calculate

60 The chi-square test with one degree of freedom confirms there is significant between State variation, with a small p value. CPRO Just like we did for the binomial model of the last chapter, we can also transform the level 2 variance on the log scale to an odds ratio to more readily characterise higher-level variance. Larsen calls this equivalent to the MOR for the Poisson model, the Median Mean Ratio. (MMR). 7 The measure is always greater than or equal to 1. If the MMR is 1, there is no variation between States. If there is considerable between-cluster variation, the MOR will be large. The measure is directly comparable with fixed-effects relative risk ratios. It is calculated in the same way as the MOR: Where is the level 2 between State variance on the log scale (this would be replaced with a variance function if random slopes are involved). calc b1 = expo((2 * 0.861)**0.5 * ) Thus there is considerable differences remaining between States even after Age, Gender and Education of individuals is taken into account; indeed this median difference for States is comparable to the largest difference between the fixed effects. Having calculated an overall measure of the difference between States, we will now the see the extent that the relative risk of positive status varies by each State. Model on main menu Residuals Change to 1.96 times the SD comparative error to give 95% confidence intervals Level: change to 2: States Set columns, noting where different values are being stored Calc

We have previously stored the State names in short form (one for each State, not one for each cell) in c299, so we can view the State name, the residual (the difference

61 We have previously stored the State names in short form (one for each State, not one for each cell) in c299, so we can view the State name, the residual (the difference from the national rate on the loge scale), the 1.96 standardised residual, and the rank. We can plot the residuals and the 95% confidence intervals against the rank

62 Clearly, 5 states are significantly above the national rate represented by the value 0. We can turn the residuals into relative risks by exponetiating them and also calculate the upper and lower confidence intervals on this scale. First close the graph display, then in the command interface window calc c302= expo(c300 - c301) - (expo(c300)) calc c301= expo(c300 +c301) - (expo(c300)) calc c300= expo(c300) the differential lower 95% RR the differential upper 95% RR the relative risk and change the customised graphics in D10, so that Stateshort is given as the group

63 The lower error bars for 95% CI s are in c302 Tick the group code on in the Other sub-window And in relation to labels, text labelling using the on the graph method, Done, Apply

64 To get the actual values used in this graph, in the command interface calc c302= expo(c300 - c301) - (expo(c300))the differential lower 95% RR calc c301= expo(c300 +c301) - (expo(c300)) the differential upper 95% RR calc c300= expo(c300) the relative risk Copy 3 c299 c300 c301 c302 Where the 3 in the last command results in the copying of textual labels where appropriate. After sorting we get the following results State Relative risk Upper 95%CI Lower 95% CI Mani AndP Maha Karn Mizo Goa Madh Tami HimP Jam& Assa Punj Oris WBen Guju Delh Jkar Hary Megh Chat Rank

65 Raja Trip AruP Sikk UttP Utta Biha Kera As anticipated there are high relative risks for Manipur, Andhra Pradesh, Maharashtra, and Karnataka, and a confirmed low risk in Uttar Pradesh, but the risks for Tamil Nadu are not particularly high while that of Mizoram is rather unexpectedly elevated, being the fourth highest rate (the large CI is due to a relatively small sample of 647 individuals; this was not a State that was highly sampled). It might be useful to have a look at the raw counts and the simple un-modelled risk; this requires a little amount of work as the data are organized by cell and not by State. In the command interface window MLSUm 'States' 'Cases' c31 Multilevel sum over cells by States for cases MLSUm 'States' 'Cases+NonCases' c32 Multilevel sum over cells by States for respondents TAKE 'States' c31 c32 c31 c32 Take (ie unreplicate) the first entry for each State and put back into c31 c32 calc c33 = c31/c32 Calculate the rate Name c31 'Sero' c32 'Samples' c33 'Rates' copy 3 c299 c31 c32 c33 Copy the State name and values Stateshort Sero Samples Rates AndP AruP Assa Biha Chat Delh Goa Guju Hary HimP Jam& Jkar Karn Kera

66 Madh Maha Mani Megh Mizo Oris Punj Raja Sikk Tami Trip Utta UttP WBen The elevated modelled risk for Mizoram is based on only 6 cases out of a sample of 647 respondents, but should clearly be investigated further. Comparing alternative estimates of Model 5: a two-level model for States It is interesting at this point to see the degree of over-dispersion in the Poisson model and then have a look at the NBD model, and then use MCMC methods to assess the robustness of the estimates. In the Equations window Click on Nonlinear Tick on extra Poisson Done More iterations to convergence In the Equations window Store the model and label as Extra The results are as follows

67 There is some residual under-dispersion residual at the cell level. This may be a result of not taking account of the clustering of cells within enumeration areas within States which was the multistage design of the Health Survey. We can test if this under-dispersion parameter is different from the null value of 1 (specified in the Intervals and test window by the constant(k)), and find that this un-dispersion is not significant at conventional levels. CPRO Proceeding to the NBD model In the Equations window Click in the Nonlinear window Tick off extra Poisson; Done In the equations window Click on Poisson; Choose ve Binomial (we still have the offset) More iterations to convergence In the equations window Store the model and label as NBD In the command interface, type mcomp NBD to display and copy estimates Examining the stored model reveals that the over-dispersion parameter

68 Fixed Part Model NBD Standard Error Cons Mal Mal Mal Mal Low Low.Mal Random Part States: Cons/Cons bcons.1/bcons bcons2.1/bcons labelled bcons2.1/bcons2.1 at is not very different from null value of zero, given the standard error of A Poisson model would seem appropriate, for these data. Finally in this comparison of the estimation of thee two-level model we will now use MCMC methods to fit a model with an assumed Poisson distribution at the cell level In the Equations window, click on ve Binomial Choose Poisson instead [MCMC is not enabled for NBD models] More iterations to convergence as the IGLS results are required as starting values for the MCMC procedure Click on Estimation control Click on the MCMC Estimation tab Increase the chain monitoring length to 50k as Poisson models tend to have correlated chains Change thinning to 10, so that only 1 in 10 simulations are stored [calculations are done on the complete chain] Done

Start After the 50k has been completed, in the equations window Store the model and label as MCMC The program uses the IGLS, PQL, 2 nd order estimates as starting values, burns in for 500 simulations

On completion the estimates are as follows (they are still blue because MCMC has stochastic convergence to a distribution, not as in IGLS which has deterministic convergence to point estimates).

69 Start After the 50k has been completed, in the equations window Store the model and label as MCMC The program uses the IGLS, PQL, 2 nd order estimates as starting values, burns in for 500 simulations which are discarded, followed by 50k monitoring simulations. On completion the estimates are as follows (they are still blue because MCMC has stochastic convergence to a distribution, not as in IGLS which has deterministic convergence to point estimates). As the focus of interest is the State-level variance we need to see whether we have run a sufficient length of chain, before comparing the results from different models. Model on main menu Trajectories Click on Select on the bottom of the Trajectories window Choose States: Cons/Cons, that is the higher-level variance Change the structured graph layout to 1 graph per row to see the plot in detail Done To get a detailed plot of the simulations for this parameter

Click in this graph to get a summary of the simulated values, the MCMC diagnostics This plot shows a pretty healthy trace (top left) as the MCMC sampler explores the parameter space, the estimates

70 Click in this graph to get a summary of the simulated values, the MCMC diagnostics This plot shows a pretty healthy trace (top left) as the MCMC sampler explores the parameter space, the estimates are not too highly auto-correlated (middle left) and the Effective Sample Size is very respectably in excess of All this suggest that we have run the chain sufficiently. The smoothed histogram (top right) shows a marked positive skew for the degree of support for the parameter. The posterior mean at is larger than the mode at (reflecting the skew); the 95% credible intervals (0.461 to 2.380) do not include 0, so we have strong evidence that there are genuine differences between the States in terms of relative risk of being sero-positive. We can also see what a poorly estimated parameter looks like: Model on main menu Trajectories Click on Select on the bottom of the Trajectories window Choose Fixed: Male, that is the male differential for <24 years Change the structured graph layout to 1 graph per row to see the plot in detail Done

To get a detailed plot of the simulations for this parameter, Click in this graph to get a summary of the simulated values, the MCMC diagnostics This plot shows an unhealthy trace (top left); there

71 To get a detailed plot of the simulations for this parameter, Click in this graph to get a summary of the simulated values, the MCMC diagnostics This plot shows an unhealthy trace (top left); there is evidence of slow drift as if the sampler is getting stuck in one part of the parameter space rather than rapid exploration; successive estimates are highly auto-correlated (middle left) and the Effective Sample Size is only 58 even after 50k simulations. The smoothed histogram (top right) shows a distribution that includes zero, with the 95% credible intervals ranging from to If this was a crucial parameter we may want to run the sampler for a large number of further iterations; but it simply looks as if there is little evidence that this parameter is a sizeable one; a sizeable difference between Male and Female for the under 24 s receives very little empirical support. We will briefly come back to more efficient samplers at the end of this chapter. We now compare the estimates from the different procedures: Model on main menu Compared stored models

(or just to get the 2-level models mcomp 2 3 4 5 in Command interface) Copy puts the values, so that they can be pasted to a word-processor to form a table There has not been a great deal of change

72 (or just to get the 2-level models mcomp in Command interface) Copy puts the values, so that they can be pasted to a word-processor to form a table There has not been a great deal of change in the fixed part estimates and the standard errors using PQL 2 nd order (Poisson, extra-poisson, NBD) and MCMC (Poisson). The higherlevel random-part estimate for the between-state variance is also very compatible across the three PQL 2 nd order estimates, reflecting the insignificance of the extra-poisson term. The mean of the MCMC estimates at is somewhat higher reflecting the positive skew distribution of the support distribution, but so is the computed standard error. Given that there are only 28 higher-level units, it would sensible to report the MCMC results, but pointing out that PQL, 2 nd order modelling had found no evidence of over- or underdispersion. You could obtain more summary detail about the MCMC estimates by ticking on the extended MCMC information in the Manage stored models window. Question 5 Using the MCMC results, what are the State differential risks on the log e scale, and as relative risks, do they change a great deal from the Poisson model estimated by 2 nd order PQL? Model 6: between State differences for Urban-Rural The final model to be fitted estimates the differences in relative risk between Urban and Rural areas, and allows these differences to vary over States. Urban/rural status is really a higher-level variable measured at the enumeration district, but this is not available to us here, so we are in effect treating it (somewhat in-appropriately) as an individual cell-level effect. We will use the MCMC procedure but we have to begin with IGLS model to get starting values. We will also use a specification that gives the variances directly (rather than as a base and a contrasted dummy) in the level-2 random part, but contrast Urban with

Rural areas in the fixed part of the model. 8 This specification allows us to calculate and plot the correlation between the relative risk for urban and rural areas in each State.

73 Rural areas in the fixed part of the model. 8 This specification allows us to calculate and plot the correlation between the relative risk for urban and rural areas in each State. We have first to remove the random intercepts associated with the constant at level 3 In main menu, click on Estimation, and change back to IGLS Equations window Click on Cons and remove the tick for j(state) this will remove the level 2 variance term for the State random intercepts, Done Add term, choose variable to be Urbanity; reference category to be None (scroll up to get this) Done; this will create two separate dummies and place both in the model with the names of Rural and Urban Click on Rural, and tick of Fixed parameter (Urban will now be contrasted) with Rural in the fixed part), and tick on j(states), so that a new random term will be included at the higher level, Done Click on Urban, and leave the tick for Fixed parameter (urban will now be contrasted with Rural in the fixed part), and tick on j(states), so that a new random term will be included at the higher level, Done Start iterations to convergence (it has a problem getting started but it does get there] From this we can see that Urban-based individuals have a significantly higher risk on the loge scale (0.597 is more than twice 0.198) but rural-based individuals have the greater between-state variance. Now estimate this model with MCMC Click on Estimation control Click on the Estimation tab 8 The major drawback of this separate specification approach to random effects is that developing the model does not have a parsimonious route. If we wanted to include differential random effects for States for gender and urbanity, we would need to estimate variances and co-variances for 4 random effects (2 genders times 2 urbanity groups)

Increase the chain monitoring length to 50k, as Poisson models tend to have correlated chains Change thinning to 10, so that only 1 in 10 simulations are stored Done Start After the 50k has been

74 Increase the chain monitoring length to 50k, as Poisson models tend to have correlated chains Change thinning to 10, so that only 1 in 10 simulations are stored Done Start After the 50k has been completed, the results look like, store the model as RurUrb Again the MCMC estimated random terms are larger than the IGLS ones, but also have a larger calculated standard error. Comparing the two MCMC models mcom "MCMC" "UrbRur" Fixed Part Model MCMC Standard Error Model UrbRur Standard Error Cons Mal Mal Mal Mal Low Low.Mal Urban Random Part States Cons/Cons Rural/Rural Urban/Rural

75 Urban/Urban Level: Cons Cell DIC: We see that the DIC has decreased even though there has been an increase of 3 nominal parameters (fixed part, and 2 additional variance covariance terms); the new model is a better fit to the data. We will start the interpretation with the fixed part term for Urban; here is the monitored chain for 50k And for 100k And for 200k

Although, as would be expected, the Effective Sample Size increases substantially, there is no substantive change in the parameter estimates, in terms of mean, mode, standard deviation and quantiles.

due to the multistage design). In the customised predictions window we can estimate the median log rate for urban living as a difference from rural.

76 Although, as would be expected, the Effective Sample Size increases substantially, there is no substantive change in the parameter estimates, in terms of mean, mode, standard deviation and quantiles. There is some support that the loge rate is higher in urban than rural, but the evidence is not overwhelming (and that is without taking account that the standard error is likely to be underestimated due to the multistage design). In the customised predictions window we can estimate the median log rate for urban living as a difference from rural. Using the command interface we can exponentiate to get the relative risk or Incidence rate ratio of Urban living compared with Rural expo c23 c30 c40 c23 c30 c40 (check columns for median, low and high) and plot the results; by using the median we are using the cluster specific estimates and not the population average ones

77 Clearly the relative risk is higher for Urban living (the point estimate for the relative risk is 88% higher compared to rural living), but there are wide confidence intervals. Turning now to the State differentials, we can examine the pair of residuals Model on main menu Residuals Request the level 2 State residuals and the 1.96 SD(Comparative residual)

78 And plot each pair as caterpillar plots Which after putting on State names on both graphs (In Plot what? tab, Group is Stateshort; in Other tab, tick on group code)

The differences are more easily seen on a pairwise plot of the residuals that graphs the State differential against each other (again using Stateshort as

The patterns are similar for both types of living with something of an exception for Tamil Nadu, in which rural living has a comparatively higher risk.

79 The differences are more easily seen on a pairwise plot of the residuals that graphs the State differential against each other (again using Stateshort as the group code) From examining the axes, it is clear that the differentials in the relative risks are much greater for rural as opposed to urban living. The patterns are similar for both types of living with something of an exception for Tamil Nadu, in which rural living has a comparatively higher risk. Using the Estimate tables, we can see the estimated correlation between the two differentials on the log-scale to be 0.87 so that the patterns are very similar

80 Finally we can exponentiate the residuals to get the relative risks and update the plot. These risks are for different States relative to living in urban and rural areas nationally. Question 7 You could now fits models by including additional fixed part terms to estimate interactions between gender and urbanity, age and urbanity. It is also possible to investigate other between-state differences in relation to gender and age-group. More efficient MCMC samplers In the Urban-Rural model the fixed part Urban differential with a chain length of 50k was found to be Given the ESS of only 201, this is a very inefficient sampler. To access the new sampling schemes Model on Main menu

81 MCMC MCMC options Click on Use hierarchical centring at the highest level. here 2 Click on Use orthogonal parameterisation Done Start Using the trajectories window re-examine the monitored chain for the fixed part differential for Urban. In effect the efficiency of the sample has been doubled by combining the two procedures. Further details are given in the 2009 MCMC manual. Our advice is that you routinely use these options

Some answers Question 1 Use the command interface to calculate the rate per 10,000 calc b4 = b3 *10000 45.629 In sampling 10,000 individuals we could anticipate only finding some 45 people with HIV.

82 Some answers Question 1 Use the command interface to calculate the rate per 10,000 calc b4 = b3 * In sampling 10,000 individuals we could anticipate only finding some 45 people with HIV. Question 2 What would be the results if Males had been chose for the reference category and not Females? The estimated log values are now minus for Females, so that the log rate for females is lower than that for males. The customised predictions need to be compared to males

83 The results are then exponeniated expo c19-c21 c19-c21 to get the IRR and the 95% CI s Gender.pred Cons.pred median.pred median.low.pred median.high.pred Fem Mal and a revised plot made with appropriate re-labeling and re-scaling When the IRR for males is set to 1, that for women is 0.64, a significant and sizeable reduction. Before proceeding, re-specify the model so that Female is the base. Question 3 Make a plot that contrasts the SMR at different ages for men and women separately. This can be achieved by plotting the median predictions against Gender on the X axis and grouping by Age

84 While the patterns are the same for both sexes; the consistently higher rates are found for men for each age group. Question 4 What type of person is the constant in this model? The type of person for whom all the other variables are zero,that is a Female aged under 24 years Question 5 Using the MCMC results what are the State differential risk on the loge scale, and as relative risks, do they change a great deal from the Poisson model estimated by 2 nd order PQL?

85 State RR-McMC +95CI s -95CI s Rank Mani AndP Maha Karn Mizo Goa Madh Tami HimP Jam& Assa Punj

86 Oris WBen Guju Delh Jkar Hary Chat Megh Raja UttP AruP Trip Sikk Utta Biha Kera Both estimates are effectively the same, with the MCMC ones being very slightly larger in absolute magnitude. Question 6 Have all three higher-level variances had sufficient monitoring changes?

87 - 85 -

88 All three estimates have an Effective Sample Size of close to or substantially over 500; this should be sufficient to characterize their distributions

89 14. Longitudinal analysis of repeated measures data Introduction Recent decades have seen remarkable advances in methods for analysing longitudinal data. Classical time series analysis has a long and rich history and was originally developed for lengthy and detailed series such as the annual temperature record for central England for some hundreds of years; or the value of stocks and shares on a daily basis. This approach was developed typically for a single object hence the jibe that it is really the analysis of a sample of one! In contrast we are dealing with multiple time-based measures for multiple subjects. Some paradigmic cases are labour economics where one might observe 30 years of annual income for a thousands of individuals (large N panels); biometric growth studies where some hundreds of children may be measured on 5 occasions; and comparative political economy where there may be thirty years of observations for twenty-five countries (small N panels or temporally dominated data). In the social sciences, this is known as panel data, in medicine they are known as repeated measures, and in economics and political economy as cross-sectional, time-series data. The distinctive features are that data on repeated occasions are naturally clustered within individuals, the ordering of the data means something so that observations can be later or earlier in time, and there is usually dependence so that later observations are related to earlier ones. While classical time series was concerned to model variation over time and particular care was taken to model faithfully dependence over time (serial correlation), the models that we are going to examine aim to account for this variation and how it varies between subjects be they people or countries. With repeated measures on individuals we can capture within-individual change, that is longitudinal effects, and separate them from differences among people in their baselines values cross-sectional values. Both may be of substantive interest so that the outcome may respond rapidly to changing values for individuals but there could also be outcomes where the response is more closely related to the enduring nature of the subjects. The research problem We have deliberately chosen a rather straightforward main example in which children are repeatedly measured in terms of aerobic performance over a period to assess and understand their physical development. It is straightforward in that the outcome can be treated as a Normal distribution and that the research problem has a strict hierarchical structure in that repeated measures are seen as nested within children who are nested within schools. Aerobic performance is measured by a 12 minute runs test - the distance someone can run or walk in 12 minutes. The test was developed as an easy-applied method for assessing cardiovascular endurance. The measure is an important one in its own right for children and adolescents, and additionally it is known to be a key determinant of adult

90 health. Ruiz et al (2009) reviewed seventeen high-quality studies and concluded that there was strong evidence that higher levels of cardiorespiratory fitness in childhood and adolescence are associated with a healthier cardiovascular profile later in life in terms of such risk factors as abnormal blood lipids, high blood pressure and overall and central adiposity. 9 This study aims to model aerobic performances as children age and develop and to consider what are the demographic characteristics that are associated with this change. The original study was undertaken in Madeira at a time of rapid and substantial social and economic change. 10 The data that we are using are realistic in that they are based on the Madeira Growth Study but we have simplified the data somewhat as that study is based on an accelerated growth design in which repeated measures are taken of multiple overlapping cohorts. We consider the appropriate analysis for such data with multiple cohorts in the next chapter. Throughout we will be referring to individual or subject- specific change as our example is based on development in children, but there is no reason why this cannot be a firm or indeed a country. While the example we are using is about development and growth the same procedures are used to model decline, the economist s negative growth. Three types of longitudinal data analysis There are basically three types of approach to modelling change. Marginal approaches: this focuses exclusively on simply getting the general trend (for everybody or subgroups) as the alternative name of population-average models makes clear. With such models the nature of the dependency over time is not seen as of substantive interest and is treated as nuisance that gets in the way of good estimates of the trend. Consequently individual differences are averaged over and not explicitly modelled. One method of estimation is to totally ignore the structure of occasion nested within individuals and just fit means and treat observations as if the data are independent over time; this can be achieved with standard OLS regression. The trouble with this is that the standard errors are affected by dependency, and while estimates are consistent (approaching the true population parameters as sample size increases), they are not efficient (they are not the estimates with smallest possible sampling variance). Consequently, a more efficient procedure known as Generalized Estimating Equations (GEE) is often deployed in which a working correlation matrix is defined for the dependency and 9 Ruiz, J.R., Castro-Piñero, J., Artero, E.G., Ortega, F.B., Sjöström, M., Suni, J., Castillo, M J (2009) Predictive validity of health-related fitness in youth: a systematic review, Br J Sports Med 2009, 43, Freitas, D L Gaston Beunen, José A. Maia, Johan Lefevre, Albrecht L. Claessens, Martine Thomis, Élvio R. Gouveia, Jones, K (2011) Body fatness, maturity, environmental factors and aerobic performance: a multilevel analysis of children and adolescents, mimeo

91 is used iteratively in the estimation process. 11 The advantage of this procedure is that it is relaxed about distributional assumptions; it does not require an explicit specification of a statistical model. The disadvantages are that the procedure is not readily extendible to more complex models, there is no way within the standard GEE procedure to make formal inferences about the dependency structure over time, and you cannot determine how between-subject heterogeneity may be changing. Subject-specific models are a flexible approach that can capture in a reasonably parsimonious way the complexity of growth processes as it occurs in individuals. The focus is on modelling individual growth trajectories with explicit terms for both the average trajectory and how individuals depart for this. Specification comes in two forms so that in addition to the fixed terms for the overall average trajectory, there can be a fixed term for each individual or random terms for them. In the latter formulation it is also possible to include subject-level variables to try and account for the departures from the general trajectory. This random-effects formulation is the multilevel or mixed longitudinal model. These random effects not only capture individual differences in growth trajectories but also simultaneously induce residual dependence within individuals over occasions which is thereby modelled. This is the reason why this model is also known as the conditional model as we estimate the fixed part averages while taking account of the random effects. In the Normal-theory model the marginal and conditional model will generally give very similar results but this is not generally the case when discrete data are analysed; the difference is greatest when there are large differences between individuals. It is possible (and MLwiN has facilities for this) to derive the marginal from the conditional but not vice-versa. Autoregressive or lagged-response models in which the outcome depends on the response on a previous occasion so that the same values (albeit lagged) can appear on both sides of the equation. These more complex models, known as transition models in the bio-statistical literature and dynamic or state dependence models in econometrics should in general only be used if dependence on previous responses is of substantive interest, or to put it another way, lagged effects should have the possibility of a causal interpretation. An example might be that the number of infected people (response at t) in an area this week might depend on the number of infected last week (lagged predictor at t-1). A very approachable applied account of these models is given in Schmid (2001). 12 The sole focus here is on the subject-specific approach where there is an explicit model with random terms for each person. Before proceeding any further it is helpful to get the feel for subject-specific approach that will be used. Figure 1 shows the three 11 For a hand-worked example see Hanley JA, Negassa A, Edwardes MD, Forrester JE (2003) Statistical analysis of correlated data using generalized estimating equations: an orientation. Am J Epidemiol 157: Schmid, C H (2001) Marginal and dynamic regression models for longitudinal data, Statistics in Medicine 20,

92 components of the two-level model (occasions nested within individuals) in a schematic way. The response could be a sum score for 13 questions on happiness and we have measured this weekly over a two year period. The three components are, with Crowder and Hand s, 1990, rather colourful descriptors, given in Table Table 1 Three components of the two-level growth model Term in the model What is shown Crowder and Hand Fixed part with linear trend A generally increasing secular trend of happiness, a rising population average immutable constant of the universe Random effects for individuals Random departure for occasion Person 2 is consistently above the trend; Person 1 is consistently below; the heterogeneity between individuals (the gap between them) is increasing with time. Needs random slopes for latter. Both individuals have their good and bad days which seem patternless ; there is no evidence of either individual becoming more volatile with time. and one day unrelated to the next Figure 1 Three components of the subject specific model the lasting characteristic of the individuals the fleeting aberration of the moment 13 Crowder, M J and Hand D J (1990) Analysis of repeated measures, Chapman and Hall, London

93 What is meant by time? In every subject-specific model for change, some metric of time has to be included in the model as a predictor and used to define occasion. The specific choice of how time is measured and specified depends largely on substantive research questions and commonly there are several plausible metrics. In the initial endurance study we have used two forms of time the chronological age of the children at the planned time of measurement in decimal years (a continuous measure which potentially will gave greater precision as it contains more information) and the year in which the measurement was made (five discrete periods). Other possibilities are age in months since the inception of the study, the birthyear cohort of when they were born or indeed the height in centimetres the children were when endurance was measured. The measure of time has to make sense in the context of the study. Detailed psychological or physiological response studies generated by digital data loggers may operate in minutes and seconds while generational change may operate over decades. The one technical requirement is monoticity the measure cannot reverse direction- so that a variable like height is a possible measure of developmental time you cannot become shorter, but weight is problematic as you could become lighter as time unfolds. A specific analysis, as we shall see, may use more than different measure of time in different parts of the model so that discrete time in years may be used to define occasion, while continuous age is used in the fixed and random part to model individual developmental trajectories. It is also worth stressing that the dependent variable has to be measured reliably and validly over time and must not change its meaning its construct validity- with time. While this is likely to be the case with hard measures like aerobic performance it is much more demanding for outcomes such as skills, attitudes and behavioural measurements. What this chapter covers The major sections of this chapter are as follows. A general conceptual overview of the multilevel approach to modelling repeated measures data. This overview aims to be brief but comprehensive and makes the argument in terms of small illustrative examples and graphs. We consider hierarchical subject-specific models and provide an algebraic specification of the two and three level model using the random-effects approach. This is followed by a more detailed consideration of alternative forms of dependence in terms of alternative covariance patterns. The models are then applied using MLwiN to the development of aerobic performance during childhood and adolescence; we consider some determinants of development and alternative covariance structures to model the dependency over time. It is shown how the marginal model estimates differ from the conditional estimates when the response is discrete and how both can be obtained in MLwiN, and how

94 they differ in their interpretation. The chapter concludes with a consideration of the fixed versus random effects approach to longitudinal data in the context of robustness to endogeneity. This section also considers the true role of the commonly misused Hausman test. Given that researchers may wish just to use such growth models we have tried to make the chapter relatively self-contained, but inevitably we use material that has been covered in previous chapters. What this chapter is not about (and where to look for help) It is worth stressing that this chapter despite being long concentrates on the basics. Other possibilities of the analysis of measurements over time include the following. If you only have two time points we recommend simply using the standard model specifying the dependent variable as the variable measured on the later occasion and specifying the measurement on the first occasion as another predictor; you will then be modelling change. This is a more flexible procedure than subtracting the past from the current value and modelling a change score. This is the approach that is used in the early chapters of the MLwiN user manual in which educational progress is modelled by regressing Normexam on Standlrt. 14 If you have repeated cohorts see Chapter 11 of volume 1 which considers the changing performance of London schools as different cohorts of pupils pass through the schools. This is the analysis when there are just two measurements on individuals on entry and departure from the school. In the next chapter we will illustrate what to do when there are multiple repeated measurements on multiple cohorts. In reality children can be expected to change schools over time, necessitating a nonhierarchical approach as children can be seen as belonging to more than one school. This development will not be considered here, but the MLwiN MCMC manual (Chapter 16) considers such multiple membership models in which weights can be included for the amount of time a child spent in a particular school. 15 For such models MCMC estimation is essential. It is possible to analyse more than one outcome simultaneously; thus Goldstein (2011) considers two responses of children as they are growing and the full adult height. 16 An example with two discrete outcomes would be one response of whether the patient got better over time and simultaneously did they get affected by side effects of the treatment. This can be achieved using a multivariate multilevel model which is considered in the MLwiN User Manual (Chapter 14) and even more flexibly in the Realcom software which can handle a variety of different types of response 14 Rasbash, J., Steele, F., Browne, W.J. and Goldstein, H. (2009) A User s Guide to MLwiN, v2.1 Centre for Multilevel Modelling, University of Bristol. 15 Browne, W.J. (2009) MCMC Estimation in MLwiN, v2.1. Centre for Multilevel Modelling, University of Bristol. 16 Goldstein, H (2011) Multilevel statistical models, 4 th Edition, Wiley, Chichester

95 (continuous and discrete) simultaneously. 17 To avoid confusion, this multivariate approach can also be used to model a single outcome with multiple occasions and we shall use it here to model in a very flexible (but non-parsimonious) way complex dependency over time. While we shall be considering temporal dependence is some detail we will not combine this with spatial dependence; such modelling is possible by using crossclassified and multiple membership models, see Chapter 17 of the MLwiN MCMC Manual, and Lawson et al (2003). 18 We must also stress that we are not considering the multilevel approach to event history analysis where attention focuses on the time to an event be they single (death) or multiple states (single/cohabiting/married/divorced). Materials on that topic are available from the Centre for Multilevel Modelling. 19 Another topic that is not covered (and is not available in MLwiN) is group trajectory modelling in which the aim is to identify groups of children that have similar growth trajectories (Nagin, 2005 ). 20 This is achieved by assuming that the random effects for children at the higher level follow a discrete distribution instead of the usual Normal distribution. A linked multinomial model can then be used to try and account for group membership. This approach has been implemented in SAS 21 and could be estimated in MPlus, LatentGold and GLLAMM for these programs have facilities for non-parametric estimation of latent effects, the random child differences. The approach has been used not just on individuals but on countries and areas to ascertain distinctive trajectories of life expectancy and voting outcomes. 22 In all the models that follow, the quality of the estimates requires that the missing observations do not represent informative dropout. That is the standard application of the random-coefficient approach requires Missingness at Random (Rubin, 1976) for the estimates are to be unbiased. 23 That is conditional on the predictors in the fixed part of the model, there should be no patterning to the missingness. A more 17 Information on Realcom is available from 18 Lawson, A B, Browne, W J and Vidal Rodeiro, C L (2003) Disease mapping with WinBUGS and MLwiN. Wiley & Sons, New York Nagin, D. S. (2005) Group-based Modeling of Development. Harvard University Press, Cambridge, MA. 21 Jones, B., Nagin, D. S., & Roeder, K. (2001). A SAS procedure based on mixture models for estimating developmental trajectories, Sociological Research and Methods, 29, Jen, M-H, Johnston, R Jones, K et al (2010) 'International variations in life expectancy: a spatio-temporal analysis,. Tijdschrift voor Economische en Sociale Geografie, 101, 73-90; Johnston, RJ, Jones, K & Jen, M. (2009) Regional variations in voting at British general elections, : group-based latent trajectory analysis, Environment and Planning A, 41, Rubin, D B, 1976, Inference and missing data Biometrika 63,

96 sophisticated approach involving imputation is also available and has been implemented in MLwiN; for a tutorial see Goldstein, H (2009). 24 The chapter concentrates on Normal-theory outcome models but it is possible to use MLwiN to analyse counts (in Poisson and Negative Binomial models) and discrete outcomes (in binomial and multinomial models). A key issue is that, and unlike the Normal-theory case, the results for the fixed part of the random-effects model do not give the population average value directly but they can be obtained. At the end of this chapter we discuss this issue further and in the next chapter we show some examples of these models in the analysis of changing gender ideologies in the UK and happiness in the USA. With these models we will also consider how it is possible to analyse age, period and cohort effects simultaneously. A conceptual overview of the random-effects, subject-specific approach A great deal has been written in recent years on the random-effects approach to longitudinal data analysis since the foundational statement of Laird and Ware (1982). 25 In addition to the treatment of this approach in specific chapters of the standard multilevel books (Raudenbush and Bryk, 2002 Chapter 6; Goldstein, 2011, Chapter 5; Hox, 2011, Chapter 5; Snijders and Bosker, 1999, Chapter 12) 26, there are a number of excellent booklength, introductory-to-immediate treatments largely dedicated to the random-effects approach. These include Singer and Willet (2003); Fitzmaurice et al (2004); and Hedeker and Gibbons (2006). 27 More advanced book-length treatments include Verbeke and Molenberghs(2000, 2006; the latter concentrates on discrete outcomes) and Fitzmaurice et al (2009); a handbook that covers the field more broadly. 28 Focussed, highly recommended expository articles that pay particular attention to practical issues include Singer (1998), Maas and Snijders (2003), Cheng et al (2010), Goldstein and Stavola (2010) Goldstein, H (2009) Handling attrition and non-response in longitudinal data, Longitudinal and Life Course Studies1, Excellent practical advice on missing data can be found at 25 Laird NM, Ware JH. (1982) Random-effects models for longitudinal data. Biometrics 1982; 38: Hox, J (2011) Multilevel analysis: techniques and applications, Second Edition, Routledge, New York; Raudenbush, S and Bryk, A. (2002) Hierarchical linear models: applications and data analysis methods. 2 nd ed.. Sage Publications; Newbury Park CA; Snijders, T. and Bosker, R. (1999) Multilevel analysis: an introduction to basic and advanced multilevel modeling. Sage Publications; London. 27 Singer, J D and Willet, J B (2003) Applied longitudinal data analysis: modeling change and event occurrence, Oxford University Press, New York; Fitzmaurice, G M, Laird, N M and Ware, J H (2004) Applied longitudinal analysis, Wiley, New Jersey; Hedeker, D and Gibbons, R D (2006) Longitudinal data analysis, Wiley, New Jersey. All three book have very helpful websites which include data, software, power point presentations : and 28 Verbeke, G. and Molenberghs, G. (2000) Linear Mixed Models for Longitudinal Data, Springer, New York; Molenberghs, G and Verbeke, G (2006) Models for discrete longitudinal data Springer, Berlin; Fitzmaurice, G, Davidian, M, Verbeke, G and Molenberghs, G (eds.) (2008) Longitudinal Data Analysis, Chapman & Hall/CRC,Boca Raton. 29 Singer, J D (1998) Using SAS PROC MIXED to fit multilevel models, hierarchical models, and Individual growth models Journal of Educational and Behavioural Statistics. 23, ; Maas, CJM and Snijders, TAB (2003)

97 This wealth of material reflects the fact that random-effects multilevel model has become the paradigmic method for the analysis of repeated measurements. 30 We think that the reasons for the widespread adoption of this approach are six fold. 1 Repeated measures data are naturally hierarchical and can be extended to more levels As Figure 2 shows, we can readily conceptualise a simple repeated measured design with repeated measures of aerobic performance as being level 1 observations hierarchically nested within individual children at level 2. Note the imbalance which reflects missing values so that not all children are measured on all occasions. This can be extended to include further structures so that children are seen as belonging to schools at level 3 and again there can be imbalance with a potentially differing number of children observed in each school. Figure 2 Repeated measures as a hierarchical structure Schools 1 2 Children Occasion A typical dataframe for this study is shown in Tables 2 and 3. The first table gives the wide form in which columns represent different occasions, so that column 1 gives the unique child identifier, followed by 5 columns of endurance (the distance covered in 12 minutes) on 5 separate occasions. The next five columns give chronological age of the child in decimal years at the time when they were measured and this is followed by the sex of the child and a School identifier. Missing values are shown by *. The multilevel approach to repeated measures for complete and incomplete data Quality & Quantity 37: 71 89; Cheng, J, Edwards, LL, J, Maldonado-Molina, M M, Komroc, K A and Keith E. Muller, K E (2010) Real longitudinal data analysis for real people: Building a good enough mixed model, Statistics in Medicine, 29, ; Goldstein, H and de Stavola, B (2010) Statistical modelling of repeated measurement data Longitudinal and Life Course Studies, 1, For an apparently dissenting voice see Allison, P D (2009) Fixed Effects Regression Models, Quantitative Applications in the Social Sciences Series, Sage, Thousand Oaks, California. We will return to this issue at the end of the chapter

98 Table 2 An extract of the endurance data in wide form Uniqu e End 1 End 2 End 3 End 4 End 5 Age 1 Age 2 Age 3 Age 4 Age 5 Sex Schoo l Bo 1 y Girl * Bo 1 y Bo 1 y Bo y 1 The form that is needed for multilevel modelling is the vectorised or long form shown in Table 3. The first column gives the unique child identifier which will be the level 2 identifier in the model. This is followed by the occasion identifier which here is the calendar year for the planned measurement occasion; clearly the intention was to observe the children every two years. This will be the level 1 identifier, while School will be the level 3 identifier. This structure in its two level version is sometimes known as the person-period data set as compared to the wide from as the person-level dataset. The data must be sorted in this form for correct estimation in MLwiN so that all sequential observations are next to each other. The remaining three columns give the endurance measure, the age of the child when measured and their sex. The row with the missing measurement (here child 3 in 1998) has simply been deleted. As we shall see, two operations have been used in the move from Table 2 to 3, the time varying observations (such as Endurance and Age) have been vectorised, while the time-invariant observations (such as Sex) have been replicated. Table 3 The data in long format Child Year School Endur Age Sex Boy Boy Boy Boy Boy Girl Girl Girl Girl Girl Boy

99 Boy * 11.8 Boy Boy Boy 2 The fixed part of the model estimates the general trend over time The fixed part of the model generally consists of polynomials of age (or some metric of time) and interactions which allow us to model trends. To take a simple example, the quadratic equation of Age: y Age Age i i 2 i (1) can take on the four different forms of Figure 3 depending on the different values of the linear and quadratic parameters as given in Table 4. The rule is that the number of bends or inflection points is one less than the order of the polynomial so that that the linear equation has no bends and the quadratic has 1. More complex shapes can be captured by a cubic polynomial which allows two bends. More flexible shapes can be achieved by replacing polynomials by splines. 31 Figure 3 Four alternative forms from a quadratic relationship (Hedeker and Gibbons, 2006) 31 Pan, H., and Goldstein, H. (1998). Multi-level repeated measures growth modelling using extended spline functions. Statistics in Medicine, 17,

100 Table 4 The parameter values for the four alternative forms of Figure 3 Graph Form Intercept (β 0 ) Linear Slope (β 1 ) Quadratic slope (β 1 ) a) Decelerating positive slope b) Accelerating positive slope c) Positive to negative slope d) Inverted U shape The fixed part accommodates time-varying and time-independent predictors As we saw in Table 3 there may time varying predictors such as Age and time-invariant predictors such as Sex. From the multilevel perspective these are variables that are measured at the lower (occasion) level and the higher (child) level respectively. It is common practice for these variables to be included in the fixed part of the model both as main effects and as interactions. If Sex (measured for each child at level 2 represented by the subscript j) and Age (measured for each child on each occasion at level 1, represented by the subscript ij) are included as interactions, they form a cross-level interaction. A full two-level quadratic interaction is specified as follows: y ij Age ij 2 Age ij 3Boy j 4Boy j * Age ij 5Boy j * Age ij (2) Figure 4 shows some characteristic plots when the plot for Girls (the reference category) is the decelerating positive slope one of Figure 3a. The values that were used to generate the data are given in Table 5. Again it is clear that this straightforward specification is quite flexible in the different types of trends it is able to handle. Figure 4 Four alternative forms for a differential Age by Gender interaction

101 Graph Table 5 The parameter values for the four alternative forms of Figure 4 Form of Gender gap Intercept for Girls (β 0 ) Linear Slope for Girls (β 1 ) Quadratic slope for Girls (β 1 ) Differential Intercept for Boys (β 3 ) Differential Linear Slope for Boys (β 4 ) a) Constant differential b) Linearly increasing differential c) Quadratic increasing differential d) Increasing then narrowing gap Differential Quadratic slope for Boys(β 5 ) 4 The random effects approach allows complex heterogeneity in growth trajectories The fixed part gives the general trends across all children, the random part allows for between children and within-children variations around this general line. Beginning with the child higher-level and returning to the overall general trend of equation (1) we can add a random child intercept (u 0j ) and a random slope departure (u 1j ) associated with the Age of the child and assume that the random slopes and intercepts comes from a joint Normal distribution: y ij Age Age ij 2 ij ( u0 j u1 j Age ij u u 0 j 1 j 2 u0 ~ N(0, 2 u0u1 u1 ) ) (3) This specification is in effect allowing each child to have their own distinctive growth trajectory but we are not fitting a separate line for each child (as we would in a fixed-effects specification with a dummy and age interaction for each and every child) but rather each child s trajectory is seen as an allowed to vary departure from a general line. Figure 5 shows a number of different characteristic patterns that can be achieved with different values of the level-2 variance-covariance matrix when the underlying trend is the decelerating positive slope of Figure 3a. Each of the graphs on the far left show the trajectories for six children around the general trend with Age on the horizontal axis of time since baseline which is given a value of zero

102 Figure 5 Varying relations for growth models Graph Table 6 Interpreting the form and parameters of Figure 5 Interpretation a) Differences at baseline maintained as children grow older b) Small differences at baseline become accentuated as children grow older c) Large differences at baseline attenuate as children grow older d) Differences at baseline unrelated to subsequent development growth Intercept variance( ) Yes Slope Variance ( ) No Covariance ( ) Yes Yes Positive Yes Yes Negative Yes Yes Zero Taking each graph in turn we get the interpretations and associated values for the variance covariance terms as shown in Table 6. (These graphs are of course simple variants on the varying relations plots we have used extensively in these volumes.) Notice that the sample covariance summaries the association between status at when time is zero (typically baseline or average age) and the rate of change

103 Turning now to the within-child, between-occasion variation we can add a random intercept (e 0ij ) and a random departure (e 1ij ) associated with the Age of the child and assume that these departures comes from a joint Normal distribution: u u 0 j 1 j y ij Age Age ij 2 ij ( u0 j u1 j Age ij 2 u0 ~ N(0, ) 2 u0u1 u1 e e 0ij 1ij 2 e0 ~ N(0, 2 e0e1 ) e1 ) (4) This specification allows differential occasion departures around the child-specific line. Figure 6 shows a number of different characteristic patterns that can be achieved by different values of the level-1 variance-covariance matrix when the underlying child relationship is again the positive decelerating one. Table 6 gives the interpretation and nature of the associated parameters. Figure 6 Plots showing three plots differentiated by level 1 characteristics Table 7 Interpreting the form and parameters of Fig 6 Graph Interpretation ( ) ( ) ( ) a) Homoscedasticity: constant variability between occasions around general trend b) Heteroscedasticity: increasing variability between occasions around general trend c) Heteroscedasticity: decreasing variability between occasions around general trend Yes 0 - Yes Yes Positive Yes Yes Negative Another way to characterise the random part is to plot the variance function at each level. A quite complex example is shown in Figure 7 which examines the trend in life satisfaction over time in Germany with a differential trend in the East and West. The original data

104 structure was 15 occasions for 16k individuals in 16 Länder over the period with Life satisfaction on a 10 point score Figure 7 Changing life satisfaction in East and West Germany The graphs show a) National trend: decline throughout the period in the West; initial improvement in East; then decline; the East is always less satisfied with life than the West; b) Between Lander variance: very small differences between Lander around the national trend; c) Within Lander, between individual variance: over the period there is at first greater equality in between-individual Satisfaction but then the heterogeneity between individuals grows in both East & West, this must mean that not all have shared the decline is Satisfaction of the general trends, while for others it is more marked The West is consistently slightly more unequal. d) Within individuals over time: in East there is a decline in volatility to below the levels of the West, individual Life Satisfactions have become more consistent over time; individuals have become more stuck in their groove. Figure 8 shows the model specification in MLwiN that was used to fit the model

105 Figure 8 The model used to estimate the changes in German Life Satisfaction It is clear that the fixed part is a cubic polynomial of time and East/West interactions so that both parts of the country could have their own general trend with 2 possible bends. Each of the variance functions consists of a variance term for when Time is zero ( eg the between Lander Variation, ), a variance for Linear Time ( ) and a differential variance for the East ( eg ) and associated covariances so that a quadratic function of Linear Time is allowed in both East and West at all three levels. 5 The random part models the dependency in the response over time The distinctive feature of repeated measures data is that the response on one occasion is likely to be related to the response on another occasion. That the response is auto- or selfcorrelated, or dependent. Thus, for example, height of a child on one occasion is likely to be highly related to height on a previous occasion. Similarly, we could expect the income of individuals to be moderately dependent, while voting choice is quite dependent with quite a bit of churning over time. Substantively we want to know this degree of autocorrelation to see the extent and nature of the volatility in the outcome. Technically, dependent data do not provide as much information as you think. Thus, if there are 300 children and 5 measurement occasions there is not 300* 5 independent pieces of information, but somewhat less. It is well known that the model that does not take account of dependency will result in the overestimation of the standard errors of time-varying variables (such as Age) and underestimation for time-invariant variables (such as Sex)

106 Table 8 The correlation for Endurance on different occasions Y1 Y1 Y2 Y3 Y4 Y Y Y Y In order to get some feel for this dependency consider again the data in the wide form of Table 1. If we correlate the responses on each of the 5 occasions (Y1 to Y5) we get the correlations in Endurance for children on each and every pair of occasion as shown in Table 8. Clearly there is quite a strong correlation of greater than 0.65 which must be taken into account in the modelling process. To get some feel for the reduction in the degrees of freedom we can use the equation for the effective sample size ( neff ) 32 n eff = Total number of observation / (1 + ρ* (m-1)) (5) Where ρ is the degree of dependency over time (say, 0.65) and m is the size of the cluster (here typically 5 occasions are nested within each individual), so that the equations becomes n eff = 1500 / ( * (5-1)) = 417 (6) We do not have 1500 observations but substantially less than a third of that. This degree of correlation can change as we model or condition on the fixed part of the model, so it is vital to model explicitly the degree of residual autocorrelation or dependency. As we shall see in more detail later we can have a number of different patterns of correlation. The simplest possible form of dependency is the so-called compound symmetry approach (this turns out to be the standard random-intercepts model) in which the degree of correlation is unchanging with occasion. This means that the correlation is assumed to be the same between occasion 1 and 2 (a lag of 1), between occasion 1 and 3 (a lag of 2), between 2 and 3 (another lag of 1) and indeed each and every occasion. The most complex possible form of dependency is called unstructured, and this permits a different correlation between each and every lag. Both of these types of dependency structure can be modelled in MLwiN as well variants that lie between these two extremes. To get a feeling for this dependency consider gain the three components of the subject-specific model shown in Figure 9. In comparison to Figure 1 the population average trend and the subject specific trends are the same but the within person between occasion values show much more structure than the white noise of the earlier figure. That is there is a tendency for a positive happiness in a particular time to be followed by another positive 32 Cochran, W G (1977). Sampling Techniques (Third edition), Wiley, New York

107 score; similarly a negative score below the subject-specific line tend to be followed by negative scores. Figure 9 Three components of the subject-specific with serial dependency between occasions Figure 10 Auto-correlogram for Figure 9 and Figure 1 Thus there is clear marked positive autocorrelation for each person in that points close together in time tend to be differentially high or low. This type of random variation is usually a decreasing function of the time separation between measurements (the lag) and this can be highlighted in an autocorrelation plot where dependency is plotted against lag. Figure 10a shows a characteristic plot marked serial correlation with its defining dependency with lag; while 10b shows the auto-correlogram for independent white noise

108 Here because of the lengthy time sequence we are able to estimate the degree of correlation over a number of lags. Such dependency is of substantive interest and has to be modelled properly to get correct standard errors of the fixed part coefficients. 6 The random-effects approach allows unbiased estimation even with missing data The random-effects model has the valuable property that missing values on particular occasions can be simply dropped without biasing the estimates (Goldstein and de Stavola, 2010). 33 This allows the efficient use of all the information that we have; we do not need to omit children who do not have other than a complete set of observations. This property is based on the assumption of Missing at Random (Little and Rubin, 2002) 34 so that the missingness itself is un-informative and the non-response process depends only observed but not on unobserved variables, and in particular that the probability of missing does not depend on responses we would have observed had they not been missing (eg a study of alcohol consumption, the heavy drinkers not showing up for appointments). The assumption is a reasonably forgiving one and applies not to the response but to the response conditional on the fixed part, that is the residuals. Thus, if the dropout is solely due to failure to be measured by older girls, and terms for gender and age interactions are included in the fixed part of the model, the estimates can be expected to be unbiased. Moreover, with only a few missing values and not severe imbalance, any imparted bias is likely to be small (Maas and Snijders, 2003). While we cannot generally asses the assumption conclusively as we lack the very data we need, it is possible to use a pattern-mixture approach to give some insights into what is going on. The procedure works by identifying a small number of patterns of missingness ( e.g. those missing on occasion 2 and 3; those missing on 4 and 5) and then using dummy variables to represent different patterns of missingness in a stratified analysis. The use of this approach in multilevel analysis is explained by Hedeker, and Gibbons (1997) and applied by Hedeker and Rose (2000) 35. If the results are not MAR we have three main options: pattern-mixture models, multiple imputation 36, and selection models in which you 33 The time trend for subjects with missing observations is estimated by borrowing strength from subjects with similar characteristics. 34 Little, R. J. A. and Rubin, D (2002) Statistical analysis with missing data, 2nd ed. New York: John Wiley. 35 Hedeker, D., and Gibbons, R.D. (1997). Application of random-effects pattern-mixture models for missing data in longitudinal studies. Psychological Methods, 2, 64-78; Hedeker, D., and Rose, J.S. (2000), The natural history of smoking: a pattern-mixture random-effects regression model. In: Rose, J.S., Chassin, L., Presson, C.C., and Sherman, S.J. (Eds.), Multivariate Applications in Substance Use Research, p Lawrence Erlbaum Associates, Mahwah, NJ. 36 Goldstein, H (2009) Handling attrition and non-response in longitudinal data, Longitudinal and Life Course Studies1,

109 fit a model for the complete data and another model for the selection model that gives rise to missingness. 37 Algebraic specification of random effects model for repeated measures This section provides a detailed step-by-step specification of the Normal-theory two and three-level model. Two-level model: random intercepts We start with the micro-model for occasions within children where the observed variables are yij 0 jx0ij 1x1 ij e0ijx0ij (7) y ij x 1ij x 0ij the response of the distance covered in hundreds of metres on occasion i by child j; the age of the child on each occasion, that is age is a time-varying variable. We chose to centre this value around its grand mean across all occasions and all children to produce a more interpretable intercept and to aid model convergence (Snijders and Bosker, 1999,80) 38 ; the Constant, a set of 1 s for each and every child on all occasions. The parameters to be estimated are 1 0 j the general slope across all children, this gives the change in distance as children become one year older, a general measure of development; the intercept for each child; because we have centred age around its grand mean, this gives an estimate of how much each child would have run at that age. This specification of the model essentially fits a regression line for each and every child in such a way that they have the same slope the linear increase with age - but a different intercept, the amount they would achieve at the average age of the sample. This is a parallel 37 See Rabe-Hesketh, S (2002) Multilevel selection models using gllamm at 38 Snijders, T.; Bosker, R. (1999) Multilevel analysis: an introduction to basic and advanced multilevel modeling. Sage Publications; London

110 lines approach in which children differ but this difference does not change with age. The equation for the estimated lines is ^ ^ ^ yij 0 j x0ij 1 x1 ij (8) Consequently, the unexplained random part of the micro-model is e 0ij which gives the differences from the child-specific line on each occasion; it can be seen in equation (7) that this unobserved latent variable is associated with the Constant. The associated macro model is the between-children model and can be specified as follows: 0 j 0 2x2 j u0 j (9) The observed variable is x 2 j which is a time-invariant variable for each child which here is a dummy indicator variable with a 1 for a girl and 0 for a boy. The additional parameters to be estimated are 0 2 the grand-mean intercept which, because of the indicator dummy, gives the mean distance achieved by the reference category of a boy of average age; the differential achieved by a girl on average The unexplained random part of the model is u 0 j which gives the differential intercept for each child which remains after taking the account the gender of the child. Substituting equation macro equation (9) into micro equation (7) we obtain the combined model: y ij 0x0ij 1x1 ij 2x2 jx0ij ( u0 jx0ij e0ijx0ij ) (10) The s are the fixed part of the model and give the averages. Here, is the mean distance achieved by a boy of average age, is the linear change in distance if the child (of either sex) is one year older, and is the girl-boy differential at all ages. The random part is composed of two parts; which is the unexplained differential achievement for a child

111 given their age and gender; and is the unexplained occasion-specific differential given the child s age, gender and differential performance. Distributional assumptions complete the model. Here we assume, given the continuous measurement of performance, that both the level 1 and level 2 random terms follow a Normal distribution and that there this no covariance between occasion and child differentials: 2 2 u ~ N,0( ); ~ N,0( ); 0 j u0 e Cov u, e ] 0 0ij e0 [ 0 j 0ij (11) 2 The variance at level 2, u0, summarizes the differences between children conditional on the terms in the fixed part of the model. If we include additional terms at either occasion or child level (that is time varying or time invariant) and they are good predictors of the response, we would anticipate that this value would become smaller. The variance at level 2 1, e0, summarizes the between occasion differences. This can only reduce if we include time-varying variables that are good predictors of the distance that has been run. The combined model of equation (10) is known as random-intercepts model as the allowed-to-vary intercept is treated as a latent or random variable and the child differential, u 0 j, is assumed to come from a common distribution with the same variance. Thus, in the random-effects specification the target of inference is not each child but the between-child variance. If is positive the total residual ( + ) will tend to be positive leading to greater endurance than predicted by the covariates; if the random intercept is negative, the total residual will also tend to be negative. Since is shared by all responses for the same child, this induces within-child dependence among the total residuals. The larger the level 2 variance to the total variance (this is what ρ measures) the relative greater this similarity effect will be and the greater the dependence. Two-level model: varying slopes The structure of the basic model can be extended in a number of other ways. An important development is the random-slopes model in which the slope in the micro-model term is additionally indexed so as allow each child to have their own development trajectory There are now two macro-models, one for the allowed to varied intercepts: and one for the allowed to vary slopes: yij 0 jx0ij 1 jx1 ij e0ijx0ij 0 j 0 2x2 j u0 j 1 j 1 3x2 j u1 j (12) (13) (14) When both macro-models are substituted into the micro-model to form the combined model

112 y ij x 0 0ij x 1 1ij x 2 j 0ij ( u0 jx0ij u1 jx1ij e0ijx0ij 2 x x ) 3 2 j x 1ij (15) a cross-level interaction is required in the fixed part between time-varying age and timeinvariant gender. As before the s give the averages; is the mean distance achieved by an average-aged boy, is the linear change in distance if boys are one year older, is the girl-boy differential at the average age, and is the differential slope for girls in comparison to boys. The random part is now composed of three parts; differential achievement at average age for a child; and which is the unexplained is the differential child-specific slope is the unexplained occasion-specific differential given the child s age, gender and differential performance. The distributional assumptions are u u 0 j 1 j 2 u0 ~ N(0, ) 2 u0u1 u1 2 e ~ N,0( ) 0ij e0 (16) So that the level-2 random intercepts and slopes are assumed to come from a joint Normal 2 2 distribution with u0 being the variance of the intercepts, u1 being the variance of the intercepts, and u0u1 being the covariance of the intercepts and slopes. The total variance between children at level 2 is then given by a quadratic variance function (Bullen et al, 1997) Var ( u0 jx0ij u1 jx1 ij ) u0x0ij 2 u0u1x0ijx1 ij u1x1 ij (17) This specification can accommodate variation between children increasing over time, decreasing and remaining steady as we have seen in Figure 5. Non-linear growth curves In the above specification the underlying growth curve has been specified to have a linear form. The simplest way to specify a non-linear curve and thereby allow development to be non-constant is to include a polynomial of age in the micro model, such as the square of age 2 yij 0 j x0ij 1x1 ij 2xij e0ij x0ij it is also possible to allow the parameters associated with age to be random at level 2 2 yij 0 jx0ij 1 jx1 ij 2 jxij e0ijx0ij (18) (19) so that this model allows curvilinearity at both the population and individual levels. In this model 39 Bullen N, Jones K, Duncan C (1997) Modelling complexity: analysing between-individual and between-place variation - a multilevel tutorial, Environment and Planning A 29(4)

113 is the performance for person j at centred age; is the linear change with age for person j and this is known as the instantaneous rate of change; is the term associated with quadratic of age for person j, it is sometimes known as the curvature; a positive estimate for this new term would indicate acceleration of growth (and a convex curve) while a negative value would indicate deceleration (and a concave curve). The choice between equations (18) and (19) is often an empirical one and what degree of complexity is supported by the data. In general it is perfectly acceptable to have quite complex fixed effects while only allowing the lower order terms such as the linear one to vary randomly between subjects; indeed the linear term for age specifies a quadratic variance function as we saw in equation (17). It is a simple matter in both quadratic models to calculate the point at which the trend flattens out, the inflection or stationary point, through the derivative, and this can be done across the complete growth trajectory (20) which equals zero for ( ) (21) This may of course be outside the range of the data. Other forms of growth can be fitted so that a more curvy result is possible by adding a cubic term (so that there are two inflection points) while more complex models dealing with growth spurts or growth falling off as maturity is reached may require something more complex than polynomials. Splines are a particularly flexible choice. A key paper is Verbyla et al (1999) who show that a mixed methods (ie random coefficient) methodology can be used to derive an optimal smoothing parameter for estimating the curviness of the splines derived from a single additional variance component. 40 Here given the continuous growth that can be expected for physical measures of performance, it is unlikely that we will need more complex forms such as piecewise relations which can accommodate a sudden jump in development (Snijders and Bosker, 1999,187;Holt, 2008). 41 There are also intrinsically non-linear models that cannot be transformed to normality Verbyla, P., Cullis, B.R., Kenward, M.G. and Welham, S.J. (1999) The analysis of designed experiments and longitudinal data using smoothing splines. Applied Statistics, 48, Holt, J.K. (2008) Modeling growth using multilevel and alternative approaches In A.A. O'Connell & D.B. McCoach (Eds.) Multilevel Analysis of Educational data, IAP, Charlotte, Davidian, M., and Giltinan, D.M. (1995) Nonlinear models for repeated measurement data. New York, Chapman Hall, and also chapter 9 of Goldstein (2011)

Dummy variables can also be added to the equation to mark particular time-based processes so that in a detailed study of happiness on an hourly basis, a dummy could signify the weekend, and another

114 Dummy variables can also be added to the equation to mark particular time-based processes so that in a detailed study of happiness on an hourly basis, a dummy could signify the weekend, and another dummy Monday morning! Another example is to include a dummy for the first measurement occasion in addition to time trend so as to model a learning effect from having completed the task once. Modelling dependence in the random intercepts and slopes model The random-intercepts specification of equation (10) implies a degree of correlation between the children s measure on different occasions. Based on the assumptions of equation (11), the covariance between child j on occasion 1 and 2 (Goldstein,2011,19) is ( ) ( ) (22) Consequently, the degree of correlation between occasions is given by 2 u0 2 2 u0 e0 (23) which is the usual ratio of the between-child variance to the total variance. This type of dependency is known as compound symmetry and constrains the correlation to be the same between any pair of occasions. This model therefore imposes a block-diagonal structure on the data so that observations within the same child on different occasions have the same correlation, ρ, whatever the lag apart, while observations of other children are correlated 0, that is independent. Figure 11 Block diagonal correlation structure - two level random intercepts model

115 The random-slopes model relaxes this by allowing the degree of similarity to vary with age (Snijders and Bosker, 1999, 172). It is also possible to model explicitly the between-occasion dependency through covariance terms, and a number of different formulations are available which we will discuss later. Three-level model Differential school effects can be handed by a three-level model in which occasions are nested in individuals who are seen as nested in schools. The combined model in its randomintercepts form (building on equation 10) is y ijk 0x0ijk 1x1 ijk 2x2 jk ( v0k u0 jk e0ijk ) (24) y ijk where the dependent variable is distance covered on occasion i by child j of school k. The random part has three elements: which is the unexplained differential achievement for a school around the linear trend; which is the unexplained differential achievement for a child given their age, gender and school; and is the unexplained occasion-specific differential given the child s age, gender, and differential performance. The distributional assumptions complete the model. 2 v ~ N,0( ); 0k v0 2 2 u ~ N,0( ); e 0ijk ~ N,0( e0 ) 0 jk u0 (25) The terms give the residual variance between schools, between children and between occasions. Figure 12 Block diagonal correlation structure - three level random-intercepts model

116 Figure 12 shows the correlation structure of a three level random intercepts model. There are now two measures of dependency: the intra-school correlation (within same school, different child) 2 (26) 2 v0 2 v0 2 u0 2 e And the intra-child correlation over occasion (within same school and the same child) 2 2 v0 u0 1 (27) 2 v0 2 u0 2 e Some covariance structures for repeated measures data The multilevel mode automatically corrects the standard errors of the fixed part of the model for the degree of dependency in the repeated measures and thereby protects against incorrectly finding statistically significant results. This is especially the case for time-invariant variables which only vary at the higher level. But it can only do this if the nature of the dependency has been modelled correctly. Consequently, it is worth considering alternative forms of dependency, for as Cheng et al (2010, 506) argue even when the covariance structure is not the primary interest, an appropriate covariance model is essential to obtain valid inference for the fixed effects. Modelling the dependency appropriately is likely to result in smaller standard errors, that is more efficient model-based inferences. It is possible to estimate a number of different dependency structures with repeated measures data by specifying different covariance patterns and therefore different correlations over time. A number of commonly employed formulations are now described. 43 Homogenous without dependency The simplest possible covariance pattern, here shown for four occasions is: In this formulation, there is 2 Time _1 Time _ 2 0 Time _ 3 0 Time _ equal variances along main diagonal, that is the same variance at each occasion so that the structure is called a homogenous one; 2 zero covariances in the off-diagonal positions, so there is no dependency over time; 0 43 The same algebraic symbols are used in the different formulations but they can have a different meaning; there is internal coherence within a specification but not necessarily between

117 the need for only one term, the pooled variance ( )to be estimated; this is a standard ANOVA model that can be estimated by Ordinary Least Squares; because there is no dependency, the specification is not a multilevel one. A heterogeneous form of this simple structure would have a separate variance for each occasion so that there are four terms to be estimated, but the covariances remain at zero so there is no dependency. Compound symmetry (random-intercepts model) The simplest form that allows dependency is the standard multilevel random-intercepts model (as in equation 10) which is known as the compound symmetry formulation 2 2 Time _1 u0 e Time _ 2 u0 Time _ 3 u0 Time _ 4 u0 0 2 u0 2 u0 u0 e0 2 u0 2 u0 e0 2 u0 2 e0 This formulation has: 2 only two terms in the random part (the variance between-children, u0, and the 2 variance within-children, between-occasions, ) irrespective of how many occasions there are; 2 the covariance is given by the u0, and the degree of correlation between any two pairs of occasions is u0 /( u0 e0) ; consequently, the same degree of dependency, equal correlation, between occasions irrespective of lag is imposed; this model can readily be estimated as a two-level hierarchical model with occasions nested within children. Un-structured multivariate model The least parsimonious model has un-structured co-variances e0 2 Time _1 1 Time _ 2 12 Time _ 3 13 Time _ This multivariate repeated measures structure has the most complex structure with separate variances on diagonal and separate co-variances on the off-diagonal the variance is estimated separately for each occasion (that is it is a heterogeneous formulation), and separate co-variances for each pair of occasions;

118 this allows the dependency structure to change with occasion, so that the dependency between observations the same lag apart may be higher or later in the sequence when the process under study has bedded in ; the number of parameters grows rapidly with number of occasions; there are m(m+1)/2 parameters needed in the model where m is the number of occasions; in the Endurance example, we need to estimate 15 parameters for just 5 occasions which leads to less precise estimation; if there are 20 waves, 210 parameters will be needed; the model can be estimated as multivariate multilevel model with each occasion as a separate response; consequently, in this saturated model, there will be no level 1 within-person, between-occasion variance; the co-variances can be used to estimate the correlation as usual as the covariance divided by the product of the square root of the variances such that the correlation between Occasions 1 and 2 is given by /( ) Toeplitz model Finally, we will consider three middle-ground models in terms of parsimony. The Toeplitz model is formulated so that the covariance between measurements is assumed to change with the length of lag: It can be seen 2 Time _1 0 Time _ 2 1 Time _ 3 2 Time _ u the same variances on the diagonal at each occasion; co-variances on the off-diagonal are arranged so that occasions that are the same lag apart have the same dependency; this has the same banded structure of the AR1 model (described next) but is less restrictive as the off-diagonal elements are not forced to be an identical fraction of the elements of the prior band, as they are separately estimated; the number of parameters grows linearly with number of occasions (so not as rapidly as the unstructured model); in the Endurance example, there are only 5 parameters compared to the 15 of the unstructured model; when there is a long series, the correlation at large lags can be set at zero; this model can be estimated in MLwiN either as multivariate multilevel model with linear constraints on the random parameters imposed through the RCON command or by an appropriate design structure imposed on a standard multilevel level model using the SETD feature. A heterogeneous from with different variances is also possible

119 First order autoregressive residual model The first order autoregressive residual model (AR1) structure is so-called to distinguish it from a true autoregressive model with a lagged response. It has the following covariance structure 2 Time _1 2 2 Time _ Time _ Time _ 4 which is equivalent, to a correlation structure: Time _1 1 ( t2 Time _ 2 ( t3 Time _ 3 ( t4 Time _ 4 t1 ) t1 ) t1 ) 1 ( t3t2 ) ( t4t2 ) 1 ( t4t3 ) 1 This formulation has: equal variances on the main-diagonal; off-diagonal covariance terms represent the variance multiplied by the autoregressive coefficient,, raised to increasing powers as the observations become increasingly separated in time; such increasing power means decreasing co-variances; the greater the time span between two measurements, the lower the correlation is assumed to be; clearly this formulation can handle continuous time as well as fixed occasions; with fixed occasions, there is a banded structure (like the Toeplitz) so that there is the same correlation between responses one lag apart (that is occasion 1 and 2, 2 and 3, and 3 and 4); similarly the same correlation is imposed between occasions 2 lags apart (occasion 1 and 3, 2 and 4); there is only one additional term, the autoregressive parameter,, compared to the compound symmetry model; in practice it has been found that for many data sets that this rather simple model imposes too strong a decay on the dependency with lag; MLwiN has facilities for the estimation of homogenous and heterogonous AR(1) models. Autoregressive weights model The final model (which is not generally used in the literature) can be called an autoregressive weights model. Given the degree of parsimony, it is a very flexible specification than can be estimated readily in MLwiN with likelihood procedures but not MCMC 2 Time _1 Time _ 2 1w Time _ 3 2w Time _ 4 3w u1 2 0 w w2 1w

120 This formulation has the same variances on the diagonal on each occasion co-variances on the off-diagonal are arranged so that occasions the same lag apart have the same dependency; dependency depends only on lag with fixed occasions or time span with continuous time; with fixed occasions this has the same banded structure as the Toeplitz model but is more parsimonious in that the autoregressive parameter,, is a function of a set of weights ; thus the weights, w could be an inverse function of lag (1/1, ½, 1/3) so that there is less and less dependency the further the lags are apart; a steeper decline in the dependency could be achieved by specifying inverse power weights so that the three required weights, representing, the three different lags are 1/1 2, 1/2 2, and 1/3 2. there is only one additional parameter,, in comparison to the compound symmetry model; unlike the AR1 model, a rigid structure where the off-diagonal elements are forced to be an identical fraction of the elements of the prior band, is not imposed; in MLwiN the model can be estimated by choosing an appropriate design structure on a standard multilevel level model using the SETD feature; you have to choose the weights, they are not part of the estimation process. Block-diagonal structures To help avoid confusion, it must be remembered that we have been showing the variancecovariance matrix for a single child. In reality this matrix is imposed as a symmetric block diagonal structure; with a block for each and every child. Here is an extract from the full matrix for 3 children measured on 4 occasions where 4 3 t C is the fourth occasion for the third child when an unstructured matrix is imposed t C t C t C t C t t t t t C t C t C t C C t C t C t C t

121 We will subsequently fit a number of these models in MLwiN, but will only do so after fitting a random intercepts, compound-symmetry model. It is worth stressing again that not all the models can be fitted in MLwiN. The MCMC estimation does not currently allow the Toeplitz model, nor any user-defined constraints (that may have been set by the RCON command) beyond a menu of alternatives; the SETD command cannot be used with MCMC estimation but only with IGLS and RIGLS procedures. Estimating a two and three level random intercepts and slopes model This section details how to restructure the data from wide to long, how to check the quality of the data, and then fits a series of models of increasing complexity. The approach to model fitting that we have adopted is a hypothesis-informed, bottom-up, exploratory approach (in the manner of Chapter 11). We want to know the general trajectory of development, whether the form is linear or non-linear and the extent that boys and girls living in urban and rural areas have differential development trends. This means that the initial focus is on a simple random part with an elaboration of the fixed part model to derive a concise formulation of the mean effects. Attention then focuses on developing the random part to include higher-order random effects at the school level and a search for the most appropriate dependency structure. In practice it is difficult to know the precise form of the dependency so we have to try out a number of different forms to obtain the best fit and to see the extent to which this makes a difference to the other estimates. There are considerable tension in this section in what you would do in a real investigation and what we have done here for pedagogic purposes in that we have tried to show a great variety of models that could be needed for growth modelling in general but are not needed for the specific data we are modelling. Singer and Willet (2003, 75) recommend always fitting two initial models -the unconditional means model and the unconditional growth model. The former is simply a variance-components model without any predictors and simply calculates an overall mean and partitions the variance into between- and within- child. The latter then includes some predictor for time which in our case would be a function of age. These are two useful baselines but we must remember that as we add fixed term predictors (and this includes polynomials and interactions) the nature of the random part can change quite markedly. Thus, as we shall see, apparent increasing heterogeneity between children in their growth trajectories may in reality be due to growing mean differences by gender. It is thus important to model the fixed part well before considering complex random parts. We do not recommend some exhaustive machine based automatic procedure as the machine will not know your research questions and the literature on which they are based

Structuring the data from wide to long The original data are given in wide format with columns giving the Endurance distances at different occasions and the rows giving the individual children.

122 Structuring the data from wide to long The original data are given in wide format with columns giving the Endurance distances at different occasions and the rows giving the individual children. File on main menu Open worksheet Filename Endurancewide.wsz This will bring up the Names window which will give a summary of the data There are four blocks of variables Endurance measures: the distance covered by each child in hundreds of metres on 5 separate occasions; we have chosen hundreds of metres as the scale of measurement as this will result in sensible estimates for the coefficients; use of metres or kilometres would have produced very small or large absolute values that may have hindered convergence. Notice that there are some missing values for the four later measurement occasions, but all are measured in the first year of recruitment; Age variables: age of the child in decimal chronological years at each planned measurement occasion; Occasion indicators: the calendar year of the planned measurement occasion; Child variables: these include a Child and School identifier, a Sex and Rurality dummy and a set of 1 s, the Constant It is helpful to have a look at the original data; in the Names window Highlight all 20 variables Data in toolbar and click on View

to get the following data extract: The missing values are clearly seen as is the categorical nature of Sex and Rurality; it is a good idea to use a different label for categorical outcomes than that

conflict. To fit a simple repeated measures model in MLwiN we have to turn these data into a long format where the rows represent different occasions.

123 to get the following data extract: The missing values are clearly seen as is the categorical nature of Sex and Rurality; it is a good idea to use a different label for categorical outcomes than that used for the column name (thus Sex as column and Boy/Girl as label) as MLwiN will use the label to name any columns of dummy variables created from the categories; otherwise there could be a naming conflict. To fit a simple repeated measures model in MLwiN we have to turn these data into a long format where the rows represent different occasions. We also have to deal differently with occasion-varying variables (Endurance, Age, Occasion) and the occasion-invariant variables (Sex, Rurality, the constant, the Child and School identifier) Data manipulation on Main menu Split records (this restructures data from wide to long) Number of occasions,set to 5, that is the number of occasions Number of variables set to 3, as there are 3 occasion-varying variable specify End1 for Occasion 1, End2 for Occasion 2 for Variable 1, and so on; and then Age1 to Age5 for Variable 2, and Occ1 to Occ5 for Variable 3 Stack the repeated measures into empty columns c22, c23 and c23 this will give the interleaved data In the Repeat(carried) data, highlight the occasion-invariant variables Child, Sex, Rurality, Constant, and School, choosing the Output columns to be the Same as input Tick on Generate indicator column and choose to store in the empty column c21, this will give you the occasion number of the 5 occasions Split (No to Save worksheet)

Name the columns in the Names window as follows (use different names to those of existing columns) C21: Occasion C22: Endur C23: Age C24: Year After this naming, the

5 occasions). Save the worksheet with something like EndurLong.

124 Name the columns in the Names window as follows (use different names to those of existing columns) C21: Occasion C22: Endur C23: Age C24: Year After this naming, the Names window should look like (we have also added a description to the new variables), and you can see that they now consist of 1500 observations (300 respondents by 5 occasions). Save the worksheet with something like EndurLong.wsz The correct structure of Occasions within Child within School has now to be imposed on the data, and this is done by sorting on the 3 key variables and carrying the rest of the data (that is all the variables of length 1500 must be involved in this process)

Data Manipulation on main menu Sort Number of key codes 3 School as highest Child as middle Occasion as lowest Select all the long variables (Child to Year) Choose

125 Data Manipulation on main menu Sort Number of key codes 3 School as highest Child as middle Occasion as lowest Select all the long variables (Child to Year) Choose Same as input to put back the sorted data into the original columns Add to Action List Execute Highlight these long columns in the Names window and View them

126 It can be seen that Child 1 attends School 1, he is a Boy living in a Rural area, he was aged 7.3 years in 1994 at Occasion 1. There are no missing endurance values for him and he covered 19.1, 26.0, 26.8, 26.0 and 30.3 hundreds of metres in successive 12 minutes runs tests. Checking the data: graphing and tabulating Before modelling the data, it is sensible to get some feel for the values involved by carrying out a descriptive analysis. To plot a histogram of the Endurance, the sequence is: Graphs on Main menu Customised graphs Choose Endur for Y on the plot what? Tab Choose Plot type to be a Histogram Apply To get the following graph, once labelled There is some positive skew but remember the required assumption is only for approximate Normality conditioning on the fixed part of the model; we can check this after model estimation. To obtain a real feel for the variation of the data, plot a line for each child showing their development over time Graphs on Main menu Customised graphs Choose Endur for Y on the plot what? Tab Choose Age for Y Choose Plot type to be a Line Choose Group to be a Child Apply

To get the following graph, once labelled While there is some general increase in the Endurance with age, there is also substantial between-child

44 To obtain the averages for Boys and Girls by the occasion of measurement Basic statistics on Main menu Tabulate Click on Means Choose the Variate

127 To get the following graph, once labelled While there is some general increase in the Endurance with age, there is also substantial between-child variation. Question 1: obtain a plot of Endurance against Age separately for men and women (Hint: use Col codes on the graph), what do you find? 44 To obtain the averages for Boys and Girls by the occasion of measurement Basic statistics on Main menu Tabulate Click on Means Choose the Variate column to be Endur Columns of the cross-tabulation to be Sex Tick on Rows of the cross-tabulation to be Occasion Tabulate 44 Answers are at the end of this chapter

128 The results will appear in the output window from where they can be copied to a word processor: Variable tabulated is Endur Columns are levels of Sex Rows are levels of Occasion Boy Girl TOTALS 1 N MEANS SD'S N MEANS SD'S N MEANS SD'S N MEANS SD'S N MEANS SD'S TOTALS MEANS SD'S It can be seen that the data are well balanced in terms of gender with approximately the same number of Boys and Girls on each occasion. Boys consistently have a higher mean than Girls and there is an increasing distance run with each occasion that is more marked for Boys than Girls. Question 2: use the Tabulate command to examine the means for Endurance for the cross-tabulation of Sex and Rurality, what do you find? It is worth examining the missingness in terms of the other observed characteristics. To do this we first have to recode the Endurance values into Missing or Not. Data Manipulation on main menu Recode by Range ( as Endur is a continuous variable) Value in range Missing to Missing to New Value 1 (just type in M) Value in range 0 to 100 to New Value 0 ( a range encompassing all the observed data) Input column Endur Free columns Add to Action List Execute

0 Girl N 717 38 755 ROW % 95.0 5.0 100.0 TOTALS 1421 79 1500 ROW % 94.7 5.3 100.

129 In the Names window, re-name c25 to be Missing and toggle categorical giving the labels Not to 0 and Yes to 1. Then use tabulate to examine how this missingness relates to Sex 45 Columns are levels of Missing Rows are levels of Sex Not Yes TOTALS Boy N ROW % Girl N ROW % TOTALS ROW % Some 5% of both Boys and Girls are missing so that there does not appear to a selective process in terms of Sex. Question 3: are there any distinctive patterns of missingness in terms of Rurality and Occasion? 45 June 2011, I had to use a workaround so that the Variate in the previous Tabulate when Means were chosen was replaced by Missing; or try closing and opening the Tabulate window

Listwise Deletion As one of the procedures that we will use later (the SETD command) only operates with non-missing data for the dependent variable, we will remove the missing values from

Thus we are retaining all the information that we have.

130 Listwise Deletion As one of the procedures that we will use later (the SETD command) only operates with non-missing data for the dependent variable, we will remove the missing values from the dataset for the long vectors using the Listwise delete facility. As we will be operating on the long vectors, this means that only missing data from specific occasions will be dropped. Thus we are retaining all the information that we have. Data Manipulation on main menu Listwise Listwide delete on value Missing (just type in M) Input columns highlight all the long variables (Child to Year) Choose Same as input to put back the non-delete data in to the original columns Add to Action List Execute so that the data in the View window now looks like

131 There are now only 1421 (instead of 1500 observations) and you can see that the missing responses (eg for child 3 on occasion 3) have been deleted. The two-level null model To set this up in MLwiN: From the Model menu, select Equations Click Notation at the bottom of the Equations window, ensure that only the General box and Multiple subscripts box are ticked In the Equations window, click on y and select Endur from the drop-down list Click on N levels and select 2-ij to specify a 2-level model For level 2(j) select Child For level 1(i) select Occasion Click Done Click on the 0X term which is currently shown in red (which signifies that it is 0 unspecified) and choose Cons from the drop-down list, click on j(child), i(occasion), followed by Done Click the + button at the bottom of the Equations window twice to get the full specification The model represents a null, empty or unconditional 2 level random intercepts model Before proceeding to estimate model it is a good idea to check that the data structure is as expected: Model on Main Menu Hierarchy viewer to get the following results:

There are, as expected, 300 children with up to 5 measurement occasions nested within each child. The missing observations for the dependent variable are shown by fewer than 5 observations.

132 There are, as expected, 300 children with up to 5 measurement occasions nested within each child. The missing observations for the dependent variable are shown by fewer than 5 observations. 46 To estimate the model: Click Start on the main menu to begin the estimation from scratch Click Estimates at the bottom of the Equations window twice to see the parameter estimates; the green colour signifies that the model has converged You are strongly encouraged to Save the models as we proceed under a unique, distinctive and memorable filename, when you do so, the current model, estimates, any graphical commands and the current worksheet are stored in a zipped format which can be subsequently Opened but only by MLwiN. 46 Now that the structure has been defined we can give the following commands in the Command Window to see the pattern of this missingness MLCOunt 'Child' c26 [Counts the number of observations in each child block, store as long in c26] TAKE 'Child' 'c26' 'c26' [Takes the first count for each Child block and put is c26 now short] TABUlate 1 'c26' [Tabulate c26 short; 1 requests row percentages] TOTALS N % There are only 10 children with two observations missing, none with four

133 This model gives us the grand mean mean endurance of hundred metres for all children and all occasions irrespective of Sex, Age and Rurality. The between-child variance at is large in absolute terms and in comparison to the standard error, and is indeed larger than the between-occasion, within-child variance which is This substantial difference between children suggests a great deal of similarity within, and we can readily calculate the degree of similarity for this random intercepts model 2 u0 Using the Command Interface (accessed through Data Manipulation), we can calculate this to be calc b1 = / ( ) u0 2 e0 (28) Some 56% of the variance in Endurance lies at the child level, or equivalently if we pick any pair of Endurance measures on different occasions for the same child, we have an expectation that they will be correlated with a value of Currently, we are not allowing different correlations between different pairs of occasions as this is a compound symmetry model. At the bottom of the Equations window, Store these results as Model 1:Null. Elaborating the fixed part of the model Polynomial terms for Age We are now going to fit a sequence of growth models and compare them. The next model includes continuous Age as a linear main effect centred around its grand mean. 47 Equations window Click Add term at the bottom of the window Choose variable to be Age choosing Centering on the grand mean (to avoid convergence problems and to give a meaningful interpretation to the estimate associated with the intercept) Done More to convergence (use More as the estimates are likely to have changed little) Store on bottom tool bar, naming the model 2+age On convergence the estimates are as follows 47 If you only have discrete time such as the calendar year in which the measurement was taken you are well advised you use orthogonal polynomials which makes for more stable estimation. You are also advised to think of the scale of the time metric so that the associated parameter estimates do not become too large or small. Thus, Verbeke and Molenberghs (2000,50) specified their time variable as decades rather than years in their analysis of prostate cancer data; when the later was used, the model estimated (in SAS) failed to converge

134 Question 4: why does Age-gm have a ij subscript and not a j subscript? The new term means that Endurance increases on average by hundred of metres as children (Boys and Girls, Urban and Rural grow older by one year). The average distance covered by a child of average age (11.4 years ; use Average under Basic Statistics to get this value) is hundreds of metres. Comparing these results with the previous unconditional model shows that the between-child variance has gone up but only trivially so in comparison to the standard error, while the between-occasion variance has come down substantially in comparison to its standard error. Taking account of their age children are very similar over occasions and the degree of dependency in this linear growth model is given by: calc b1 = / ( ) There is clearly a great deal of depenndcy over time. Development researchers call this tracking as each child follows their own path in a dependent way.the deviance has also dropped substantially suggesting a better fit of the model. We can now see if this growth is linear or non-linear, beginning with a quadratic model Equations window Click on Age-GM and choose Modify term In the pop-up window, tick on Polynomial and select Poly degree to be 2 Done Start to convergence to get the following estimates

135 There is some evidence that there is a decelerating positive slope as there is a positive coefficient for linear slope for Age and a negative slope for Age 2 (although the latter is not very greatly bigger than the standard error). Store this model as 3+AgeSq Question 5: repeat the above procedure to see if a cubic term is necessary Gender effects The next research question is whether boys are different from girls, so the next model includes Sex as a main effect with Boy as the reference category Equations window Click Add term at the bottom of the window Choose variable to be Sex Reference category choose to be Boy to create a single dummy variable to be called Girl Done More to convergence to give the following results

136 Question 6: why does Girl have a j subscript and not an ij subscript? The average-aged Boy covers a distance of while the average Girl is substantially and significantly lower with a mean of Store the estimates from this model as 4+Sex Gender and Age interactions We have found that on average boys have greater endurance than girls, we now want to know if their differential trajectories diverge as they grow older. Consequently, we next try to estimate differential polynomial growth trajectories for Boys and Girls Equations window Click Add term at the bottom of the window Order 1 so as to be able to specify and interaction between Age and Sex Choose the top variable to be Age, you will automatically get a 2 nd order polynomial Choose the bottom variable to be Sex, you will automatically the reference category of Boys Done More to convergence An informal inspection shows that there is an slightly accelerating positive slope for Boy s age (the quadratic term is not large in comparison to the standard error) while both the differential linear and quadratic slope for Girls are large in comparison to their standard error. Thus, the predictive equation is: Predicted Endurance = *Age-GM *(Age-GM) 2 for Boys Predicted Endurance = ( ) + ( )*Age-GM + ( )*(Age-GM) 2 for Girls, that is Predicted Endurance = (0.413)*Age-GM + (-0.049)*(Age-GM) 2 for Girls

We can perform a set of Wald tests on the linear and quadratic slopes, the different Girl intercept and the differential linear and quadratic slopes for Girls.

137 We can perform a set of Wald tests on the linear and quadratic slopes, the different Girl intercept and the differential linear and quadratic slopes for Girls. We are not really interested whether the average-aged Boy term is different from zero, so there are 5 terms we wish to test. Each test is of the form: H 0 : m = 0 the null hypothesis that population terms associated with variable m is zero Model on Main menu Intervals and Tests Choose Fixed part with 5 functions; this gives the number of tests to be carried out Placing a 1 in a column means that the associated estimate is involved in the testing, a 0.00 in the row labelled Constant(k) means that we are testing against a null hypothesis of 0. Thus, the in the first column, the function result(f) gives the result of ( 1 * Age ), while f-k gives how far away the result is from the null hypothesis of zero. The chi-square value in next line tests gives the result of the test with 1 degree of freedom. At the bottom of the results table is the chi-square value for the joint test that all 5 parameters are zero. Inspecting the results we see that all the chi-square values of the individual coefficients have large chi-square values with the exception of the quadratic slope for Boys. We can calculate the p values associated with the smallest and second smallest chis-square with 1 degree of freedom by entering the following commands cpro the differential quadratic term for Girls is highly significant

138 cpro the quadratic term for Boys is not significant by conventional standards. However, we would normally retain the Boys quadratic term in the model as a higher-order term - the differential quadratic term for Girls- is needed and this is contrast to the Boy s quadratic term. 48 We can also test whether this full model is a significant improvement over the empty model. cpro e-284 with 5 degrees of freedom A very small p value is found but it is known that the log-likelihood test has better properties for model comparison and we can compare all the models we have fitted and look at the change in the deviance between the first and fifth models. Model on Main window Manage store models Compare models 1 and 5 1 Null S.E. 5 +poly(age,2)*girl S.E. Response Endur Endur Fixed Part Constant (Age-gm)^ (Age-gm)^ Girl (Age-gm)^1.Girl (Age-gm)^2.Girl Random Part Level: Child Constant/Constant Level: Occasion Constant/Constant *loglikelihood: Units: Child Units: Occasion It is well known that lower order terms should be included if higher order ones are in the fixed part of the model; a requirement that Peixoto (19900 calls a well-formulated specification. This applies to both interactions and higher order polynomials. Morrell et al (1997) have generalised this to random effects models so that if we want to include quadratic random time effects we must also include linear random effects and random intercepts. Morrell, C.H., Pearson, J.D., and Brant, L.J. (1997) 'Linear transformations of linear mixed-effects models,' The American Statistician, 51, Peixoto, J.L. (1990) A property of wellformulated polynomial regression models, The American Statistician, Vol. 44(1), See also Braumoeller, B F (2004) Hypothesis testing and multiplicative interaction terms International Organization, 58:

139 The change in the deviance is , and we can calculate the associated p value for this difference treating it as a chi-square distribution with 5 degrees of freedom (the difference in the number of parameters) calc b1 = cpro b e-200 A huge difference and a very small p, but the difference is considerably less than that suggested by the more approximate Wald test. The best way of appreciating the nature and the size of effects with polynomials and interactions is a plot of predictions and their confidence intervals. The easiest way of doing this is through the Customised predictions facility which in its setup phase requires a wish list for the values of the predictor variables. Model on Main menu Customised predictions In the Setup window, we have to consider each of the variables in turn and choose the values that we want predictions of Endurance for; there is no need to consider polynomials and interactions as those specified in the equations window are automatically included; Here, we specify Age from 7 to 17 for both Boys and Girls. Click on Sex Change Range Select Mean not Category Tick on Boy and Girl Done Leave Constant at the value 1 Click on Age Change Range Select Range Upper bound 17 Lower 7 Increment 1 Done

predicted values and CI s you specified, here 95%] Move to

140 Click on Fill Grid [to get the values for the predictor variables you specified] Click on Predict Grid [to get the predicted values and CI s you specified, here 95%] Move to the Predictions sub-window, which gives the following results

141 The first 3 columns on the left, give the variables and the values we have requested the wish list. The three remaining columns give the mean predicted Endurance with 95% lower and upper confidence intervals. 49 All 5 of these columns are named and stored in the worksheet; and the values can be copied to a table in Word. 50 To display the results graphically In the Predictions sub-window Click on Plot Grid Choose Y axis to be mean.pred [ the predicted values] Choose X axis to Age.Pred [which contains the values 7 to 17] Choose Grouped by to be Sex.pred [ to plot a line for each Boy/Girl group] Choose Graph Display to be 1 [ie D1, existing graphs will be overwritten] Tick for 95% confidence intervals, displayed as lines as the X is continuous Apply After some re-labelling we get a very clear graph of results (choosing Scale for X to be user defined and set to 7 to 17 with 10 ticks, ticking off the Show margin titles, and requesting that the labels for the Group - Boy/Girl are plotted on the graph) 49 Other confidence interval ranges could have been specified in the Set-up window 50 These predictions are not in fact calculations (which they could be in this case) but simulations; this approach allows great flexibility in non-linear models; the precision of the simulation can be increased by simply increasing the number of requested simulations. Very importantly they fully take account of the standard errors of the estimated coefficients including the covariances

142 The mean endurance of Boys and Girls increases with age, Boys have greater endurance throughout the 7 to 17 age range and the difference between Boys and Girls increases with Age. The development for Boys between 7 and 17 looks to be largely linear while Girls show a positively decelerating curve, and there is not much development beyond 15 or so years of age. The inflection point for girls can be calculated by using equation (21). The overall equation for Girls is Predicted Endurance = (0.413)*Age-GM + (-0.049)*(Age-GM) 2 for Girls. so the turning point is Calc b1 = / (2* (-0.049)) and to this we need to be the average age of 11.4 to get calc b1 = b Maximum endurance is reached for Girls at around while boys continue to development beyond that age. It is also worth stressing that as this is a panel and not a cross-sectional study, for the same children have been followed over time, the evidence for genuine development of endurance (and not some artefact) is strong

143 Question 7: repeat the above procedure, building on Model 5, to see if a Rural main effect is needed and if there is an interaction between Rurality and Age choose Urban as the base Rural Gender Interactions In completing question 7 you will have found out that there is a main effect for Rurality but not a cross-level interaction affect with Age. Rurality has conferred some benefit in endurance that is maintained throughout childhood and adolescence. The next research question is whether this benefit applies equally to boys and girls- the idea being that boys rather than girls may be expected in traditional farming communities to engage in more demanding physical work and therefore have greater endurance. This is also the final elaboration of the fixed part that is possible with the available data. The current model with a main rural effect and an age-by-sex interaction is To add the Rural by Sex interaction Equations window Click Add term at the bottom of the window Order 1 so as to be able to specify and interaction between Sex and Rurality Choose the top variable to be Sex, you will automatically get the reference category of Boys Choose the bottom to be Rural, you will automatically get the reference category of Urban Done More to convergence To get the following results which you can Store as 8:Sex*Rural

144 A boy of average age who lives in a Rural area has an endurance that is 300 metres longer while a girl is some 200 hundred metres less than this. We can use the Wald test to evaluate three null hypotheses: H 0 : Rural = 0 the null hypothesis that population terms associated with Rurality is zero; because of the interaction with Sex, this is a test of the difference between Rural and Urban Boys. H 0 Girl. Rural : = 0 this is a test of the differential difference between Rural Boys and Girls. H : 0 Rural Girl. Rural = 0 this is a test of the difference between Rural and Urban Girls. We use the intervals and test procedure specifying 3 tests of the fixed part of the model

To test the third hypothesis we have included two 1 s in the third column to get the function result of (1 * 3.056) + (1 * -2.044) equals 1.102, that is the Boy-Girl difference in rural areas.

145 To test the third hypothesis we have included two 1 s in the third column to get the function result of (1 * 3.056) + (1 * ) equals 1.102, that is the Boy-Girl difference in rural areas. All three effects are significant at conventional levels so that Rural Boys have greater endurance than Urban boys, Rural Boys have greater endurance than Rural Girls while Rural Girls have greater endurance than Urban girls. Again the best way to appreciate the size of the effects is a customised predictions plot where the wish list consists of two Sex categories, 2 Rurality Categories and 11 age groups from 7 to 17. First a plot that emphasises the Sex differences by grouping on Sex on the same graph but trellis into different columns with different graphs for the Rural/Urban children

146 Second a plot that emphasises the Rural/Urban differences by grouping by Urban/Rural on the same graph and trellising by Sex

147 We already now that the Rural Urban difference in general (that is for both sexes) does not change with Age, we could fit a model that allows this differential to be change with Age but differentially by Sex. This requires us to add a first order interaction between Rurality and Age (which we had previously removed) and a 2 nd order interaction with Age, Gender and Rurality. Fortunately for parsimony and our sanity, the effects get little empirical support and if we do a likelihood ratio test with the previous model, the difference of with 4 degrees of freedom receives a high insignificant p value of A word of warning it is all too easy to try out complex interactions and then justify them post hoc by theoretical musings. This is equivalent to the Texas sharp shooter who first fires at the barn door and then draws the target around where the shots have accidentally concentrated! The most complex model that receives empirical support is model 8 and that now provides a base to consider elaborations of the random part of the model. Interpreting and elaborating the random part The random intercepts model Staying with the framework of the two level multilevel model we can see how the random part has changed after elaborating the fixed part of the model by comparing the original null model with a model that has fixed terms for Age, Age 2., Sex, Age by Sex, Rurality, and Sex by Rurality

148 1 Null S.E. 8 + Sex*Rural S.E. Fixed Part Constant (Age-gm)^ (Age-gm)^ Girl (Age-gm)^1.Girl (Age-gm)^2.Girl Rural Girl.Rural Random Part Level: Child Constant/Constant Level: Occasion Constant/Constant *loglikelihood: Both the between-child and between-occasion variance have come down as the fixed part has been elaborated. The total unexplained variation was = and now it is = so that so that the fixed part elaborations accounted for ( )/ or some 45% of the original variation. The majority of the remaining residual variation lies at the child level, 6.885/ ( ) so that some 60% of the variation is between children. In this conditional compound symmetry model there is considerable similarity or dependence over occasion with endurance being auto-correlated with a value of 0.6 over occasions. An informal c heck of the level 2 residuals can be obtained by plotting a Normal plot if the child level residuals Model on Main Menu Residuals Settings Level 2 : Child Calc Plots Standardised Residuals by Normal Scores Graph to D10 Apply

149 The residuals plot very much as a straight line suggesting that the shrunken residuals are not markedly non-normal and there are not substantial child outliers. 51 Question 8: what would happen to the standard errors of the fixed part if it was assumed there was no dependence? There are a two ways we can characterise the scale of the between-children differences. The first uses the coverage facility in the customised predictions window. The second uses 51 Verbeke and Molenberghs (2000, 83-92) suggest that this procedure should be treated with some scepticism as shrinkage would result in the estimated residuals being made more Normal when the true random effects are not. They suggest that the only effective way to asses this to fit more complex models with relaxed distributional assumptions.. Verbeke G. and Lesaffre E. (1996) `A linear mixed effects model with heterogeneity in the random effects population,' Journal of the American Statistical Association, 91,

150 the predictions window to make a plot of the modelled estimates of the children s growth which have been purged of their occasion-level idiosyncrasies and or measurement error. We will start with the second method and plot the growth curves for Boys of different ages who live in urban areas. (As the model has the same variance for all types of children, plots for other groups will only shows a shift up and down and not greater or less variation.) Model on Main Menu Predictions Fixed: click on the terms associated with the Constant, and the linear and quadratic Age terms; leave other terms greyed out Random level 2: click on the random intercept, u oj Random Level 1: leave the level 1 term greyed out Output from prediction to c50 [an empty column] Calc To obtain a plot of the modelled growth trajectories Graphs on Main menu Customised graphs Choose c50 for Y on the plot what? Tab Choose Age for Y Choose Plot type to be a Line Choose Group to be a Child Apply

The scale of the between child heterogeneity is apparent even when we have modelled out Sex and Rurality differences there are considerable differences between

Turning now to the other approach we can calculate the coverage interval for the average child Model on Main menu Customised predictions Setup sub window

151 The scale of the between child heterogeneity is apparent even when we have modelled out Sex and Rurality differences there are considerable differences between children of the same age. The parallel lines is of course the random-intercepts assumption that variance does not change with Age. Turning now to the other approach we can calculate the coverage interval for the average child Model on Main menu Customised predictions Setup sub window change the range of each variable to get the average mean values Tick on Level 2 (Child) Coverage and choose 95% coverage interval Fill Grid Predict Switch to the Predictions sub window

To get the following results The typical child covers 18.0 hundred of metres in 12 minutes and the 95% confidence intervals around this average are from 17.67 to 18.25.

152 To get the following results The typical child covers 18.0 hundred of metres in 12 minutes and the 95% confidence intervals around this average are from to In the population the typical child with the best 2.5% of all endurance distances will cover 23.32, while the child in the poorest 2.5% will only cover Once again individual child variability is seen to be substantial; children are very variable in their endurance. The random slopes model with complex heterogeneity between occasions All the models so far have assumed a constant, unchanging between child and within child variation as the children age. We can allow for more complex heterogeneity by allowing the linear term for Age to have a child differential slope and thelevel -1 residuals to also depend on linear Age. This will allow the different characteristic patterns that were illustrated in Figure 5 and 6. The modified equation looks like This on convergence gives the following

We can use a Wald test to evaluate the approximate significance of the four new terms in the random part of the model There is some evidence that the linear part of the variance function is worth

153 We can use a Wald test to evaluate the approximate significance of the four new terms in the random part of the model There is some evidence that the linear part of the variance function is worth investigating at both levels. The positive covariance at the child level suggest that differences between children are growing with age, while the negative covariance at level 1 suggests that volatility around an individual child s growth curve is decreasing with age. 52 We can use the Variance function window to store the variance function at each level and use the calculate window to calculate the VPC as a ratio of level 2 to level 1 + level2, just as we did in Chapter 5 of volume 1. We have named some free columns as Lev2Var, Lev1Var and VPC and Lev1VarCI and Lev2varCI to hold the 95% confidence intervals. First for level 1 52 There is nothing untoward about the variance terms at level 1 being negative as long as the overall variance function does not become negative

And then for level 2 And then calculate the VPC Graphing each function against Age with their confidence intervals (you must use Offsets capability on the graphs

154 And then for level 2 And then calculate the VPC Graphing each function against Age with their confidence intervals (you must use Offsets capability on the graphs not the Values option, as the output from the confidence interval on the Variance function is departures from the general line); we get the following graphs:

155 We see some evidence that the between-occasion volatility is decreasing with Age and the between Child differences are increasing so that the VPC, which gives the proportion of the variance at the child level, is also increasing. However, the intervals are quite wide enough to thread a straight line through. Unfortunately it is only possible to calculate the confidence bounds for the VPC in MCMC (see Chapter 12 of this volume: Functions of parameters). It is also possible to calculate the correlation between any pair of continuous time observations using the formula given in Snijders and Bosker (199,172) such as a 7.5 year old (t 1 )and a 16.5 year old (t 2 ). As usual the correlation is the covariance divided by the product of the square root of the variances s so for time t1 and time t2 the correlation is ( ) ) (29) The covariance is given by ( ) ( ) ( ) (30) and the total variance at time t is given by the sum of the level 1 and level 2 quadratic variances so for t 1 it is ( ) ( ) ( ) ( ) (31) It must of course be remembered that in the formulation for the random slopes model, Age, which is our t variable, has been centred around its grand mean of so that a 7.5 year old equals , while the equivalent value for a 17.5 year old is The covariance is therefore given by *( ) *( * 6.061) =

156 The total variance at each of these two ages is *0.127*(-3.939) *(-3.939**2) *( * (-3.939)) + (-0.018*( **2)) and *0.127*(6.061) *( 6.061**2) *( * (6.061)) + (-0.018*( 6.061**2)) which is and respectively. So that the correlation is equal to / ( * )**0.5, that is These results again suggesting that there is quite a high dependency even in endurance measurements that are 9 years apart. There is thus strong evidence of tracking; individual children tend to follow a distinctive course. We will in a later section consider more complex dependencies over time. Question 9 make a plot of the modelled growth trajectories for Urban Boys using the above random slopes model; what does it show? Three level model: between schools variation It is straight forward to extend our two-level model to a three-level one with Schools as the higher third level. Equations on Main menu Click on Endur ij N levels: choose 3-ijk Lev3(k): choose Schools Done Click on Constant Tick on k(schools) Done This gives a random intercepts term at level 3 so that the specified model is

157 The hierarchy viewer shows that there are 10 children in each of 30 schools that have been sampled. After convergence the model estimates are as follows

158 The change in the deviance is from to , a sizeable improvement with a single term, Chapter 9 of volume 1 warns us that in comparing two models that differ in their random parts we should be using the REML/RIGLS estimates and not the current FIML/IGLS ones. Using the Estimation Control button in the Main menu, change to RIGLS, and re-estimate the models. The change in the deviance from including School effects is from to , so that there is only a small departure from the maximum likelihood IGLS results. We can calculate the associated p value and because we are testing a variance that cannot take on values of less than zero (the null hypothesis) we can follow the advice of Chapter 6 and halve the resultant p value. 53 calc b1 = cpro b1 1 b e-005 calc b2 = b2/ e-006 If we believe in such a frequentist approach to probability there is evidence of significant unexplained school effects. However this depends on treating 30 schools as sufficient to get reliable results while some software (such as LME in R) does not even give the standard error for the variance of higher-level random terms as the distribution of the estimator may be quite asymmetric. Moreover while the REML/RIGLS estimates are an improvement in taking account of the uncertainty of the fixed part in estimating the random part, the MCMC estimates are needed to take account of uncertainty in all parameters simultaneously and also allow for asymmetric credible intervals. The two level model was run in REML and in MCMC and the three-level model was also run with both types of estimation. Here are the comparisons when a burn-in of 500 has been used and a monitoring run of 50k simulations. The two-level results are as follows. Question 10 what do you conclude from these results given below; has the monitoring chain been run for a sufficiently long length; are there differences between the results obtained from the two types of estimation? 53 The logic is that with replicated data sets generated under the null hypothesis we would get positive values half of the time and negative the other half but these would be estimated as positive half of the time and zero the other half. Consequently, with random intercepts the sampling distribution is made up of a 50:50 mixture of chi-square distribution with a df of 1 and a spike at zero so this can be achieved by halving the incorrect p value based on chi-square with 1 degree of freedom

159 2 Lev RIGLS S.E. 2 level 50k S.E. Median CI(2.5%) CI(97.5% ESS Response Endur Endur Fixed Part Constant (Age-gm (Age-gm) Girl (Age-gm).Girl (Age-gm) 2.Girl Rural Girl.Rural Random Part Child Cons/Cons (Age-gm)/Cons (Age-gm)(Age-gm) Occasion Cons/Cons (Age-gm)/Cons (Age-gm)/(Age-m) Deviance DIC: The results for the 3-level model (again with 50k simulations) are as follows 3 level RIGLS S.E. 3 level MCMC S.E. Median CI(2.5%) CI(97.5%) ESS Response Endur Endur Fixed Part Constant (Age-gm) (Age-gm) Girl (Age-gm).Girl (Age-gm) 2.Girl Rural Girl.Rural Random Part Schools Constant/Constant Level: Child Constant/Constant (Age-gm)/Cons (Age-gm)/(Age-gm) Occasion Constant/Constant (Age-gm)/Constant (Age-gm)/(Age-gm) Deviance DIC:

160 Question 11 what do you conclude from these results; has the monitoring chain been run for a sufficiently long length; are there differences between the results obtained from the two types of estimation? Is the three level model an in improvement on the two-level model There appears to be some evidence that there are differences between schools and we can examine the caterpillar plot to characterise the size and nature of these differences. Model on main menu Residuals In Settings sub- window Start output at column c SD (comparative) of residual to c301 [to compare pairs of schools] Level 3: Schools Calc In Plots sub- window Choose residual +/ sd x rank Apply After modifying the title and changing the scale on the horizontal axis, we obtain the following plot. The extremes add some 150 metres (School 15 and 19) and subtract 210 metres (School 20 and 6) from the overall average. These estimates are derived formulaically from the

161 between-school variance; we could also have stored the residuals during the MCMC estimation and examined their distribution. 54 We can also calculate the proportion of the variance that is due to differences between schools. We do this using the MCMC estimates as they have better properties and do so for the average- aged child (to avoid the complication of the random slopes at level 1 and level 2) : (32) Calc b1 = 1.339/ ( ) Some 11 per cent of the variation lies at the school level or alternatively randomly selected pairs of pupils from the same school will be correlated Unfortunately, we do not have any variables measured at the school level (e.g. the physical exercise regime) so we are unable to proceed to try and account for this variation. A digression on orthogonal polynomials We break off from our sequence of developing the model at this point for pedagogic reasons. The models we have so far fitted have used continuous time in the form of age. It is not uncommon however, to only have discrete time such as the calendar year the period in which the measurement was taken. In effect the measurement is ordinal with 5 being later than 4 which is later than 3 and so on. When this is the case there are advantages in using a transform of this discrete time, here the Occasion variable, in both the fixed and the random part of the model, which is known as an orthogonal polynomial. This transformation can be linear, quadratic, cubic etc. thereby allowing a straight line relationship, a single curve, a curve with two bends etc. The resultant sets of new variables have a mean of zero and are orthogonal to each other which makes for much more stable estimation. Due to these properties they are highly recommend for the analysis of trends in repeated measures models. 55 Here we will pretend that we do not have data in continuous time and delete the Age terms and all interactions involving Age to get the following model (you will have to use RIGLS estimation and not MCMC to re-specify the model). 54 This can only be done before start of the MCMC procedure: Model on main menu.; MCMC; Store residuals. The resultant long vector will have to be unsplit to the get the estimates for each school see Chapter Orthogonal polynomials are discussed in more detail in the MLwiN User manual supplement and in Hedeker, D. and Gibbons, R.D. (2006) Longitudinal Data Analysis. New York: Wiley

162 We will also clear the current model estimates before including the orthogonal polynomials. Model on main menu Manage stored models Clear all Then we will introduce the 1st order polynomial of Occasion Model on main menu Equations window Click Add term at the bottom of the window Choose variable to be Occasion Tick on Orthogonal polynomial (this option is only permitted if the variable, as here, is categorical;) Choose degree to be 1 for a linear relation between pressure and occasion Done Start to convergence Store on bottom tool bar, naming the model Orth1 For speed in this exploratory phase we will use the RIGLS estimates and not the MCMC ones. We now modify the equations to be 2 nd order ones, and then the 3 rd order polynomials and the 4 th order storing the estimates as we go. The 4 th order polynomial is the most complicated form that we can fit to this data that has 5 occasions; it is equivalent to to fitting a separate dummy variable term for each occasion contrasted against a base category. We hope that the model can be made more parsimonious than this, that is we hope to find insignificant higher order polynomials. The orthogonal polynomial readily permits such flexibility whereas a dummy variables approach for each and every occasion does not. Equations window Click on orthog_visits^1 and choose to Modify term

Choose degree to be 2 for a quadratic relation between pressure and occasion Done More to convergence Store on bottom tool bar, naming the model Orth2 Before proceeding to look at the estimates it is

163 Choose degree to be 2 for a quadratic relation between pressure and occasion Done More to convergence Store on bottom tool bar, naming the model Orth2 Before proceeding to look at the estimates it is worth examining the orthogonal polynomials themselves. We can first view them in relation the untransformed Occasions variable and then calculate their means, SD s and correlations Means orthog_occ^1 orthog_occ^2 orthog_occ^3 orthog_occ^ S.D.'s orthog_occ^1 orthog_occ^2 orthog_occ^3 orthog_occ^ Correlations orthog_occ^1 orthog_occ^2 orthog_occ^3 orthog_occ^4 orthog_occ^ orthog_occ^ orthog_occ^ orthog_occ^ Question 12 what are the (approximate) characteristics of these orthogonal polynomials?

Turning now to the 4 sets of estimates we can compare the models. Orth1 S.E. Orth2 S.E. Orth3 S.E. Orth4 S.E. Fixed Part Constant 18.216 0.402 18.217 0.402 18.220 0.402 18.220 0.402 Girl -2.862 0.

164 Turning now to the 4 sets of estimates we can compare the models. Orth1 S.E. Orth2 S.E. Orth3 S.E. Orth4 S.E. Fixed Part Constant Girl Rural Girl.Rural orthog_occ^ orthog_occ^ orthog_occ^ orthog_occ^ Random Part Level: Schools Constant/Constant Level: Child Constant/Constant Level: Occasion Constant/Constant *loglikelihood: Comparing the deviance of the models Diff Comparison Dev1 Dev2 Diff in Chisquare in Df p value 1st and 2nd order nd and 3rd order rd and 4th order We see that that by conventional standards the 2 nd and 3 rd order polynomials are needed but not the fourth order. Looking at the specific estimates of the 4 th order model (3.925, , 0.303, and 0.169) and remembering that the estimates are directly comparable (the orthogonal variables have approximately the same standard deviation) we can see that the linear term dominates. The easiest way to interpret the estimates is as a plot of the fixed part predictions. Here are the results for the 3rd order polynomial for urban boys plotted against Year. The strong underlying linear increase is very evident

165 The model can be developed in the same manner as for continuous time to include interactions with Sex and Rurality and to include random slopes at each level. Elaborating the random part of the model: accommodating temporal dependence Compound symmetry The majority of our elaborations of the continuous time model have concerned the fixed part of the model and with the exception of the random slopes model we have been content to estimate a compound symmetry model in which there is only one variance which implies that the correlation between Endurance measurements for a child is constant and unchanging with how far apart in time the measurements were taken. We now turn to the random part and fit a number of different structures. As is common, we do not have a strong theory for specifying the exact form of the residual dependency over time, so we will fit a series of models of differing parsimony to assess empirically an appropriate form. We are going to operate without a term for between-school differences and any terms for random slopes as we want to concentrate on the nature of the dependency over occasion. If we return to Model 8 with the complex fixed part involving the 2 nd order polynomial of time and the Sex and Rurality interactions we get the following results using RIGLS. We could of course compare these models with a random slopes model to see which the better empirical fit is

166 This random intercepts has a compound symmetry correlation structure so that any pair of occasions has the same degree of correlation. We have stored the estimates of this model as CS. Using the Command Interface we can calculate the correlation to be calc b1 = 6.987/ ( ) and we can use this information to begin building a table of the degree of correlation for each pair of occasions identifying different lags. In this model, the same correlation is imposed at each and every lag. Pair of Occasions CS Lag 1 and and and and and and and and and Unstructured: the multivariate model The next model to be fitted is the least parsimonious model that is possible for these data, an unstructured model. The distinguishing feature is that a separate variance for each occasion is allowed and a separate covariance (and therefore correlation) between each and every pair of visits. Equations window Click Add term Choose the Variable to be Occasion and select None as the reference category

167 Click on each of the four created dummies in turn and tick off the fixed parameter, and tick on the j(child) differential random term Click on the Constant and tick off the differential for j(child ) to avoid multicollinearity; Done Start to convergence Store on bottom tool bar, naming the model UN After convergence the estimates are as follows. There are now estimates for each of the five occasions at level 2 and there is potentially a different covariance between each pair of occasions. The variance at level 1 is zero. This is to be expected as we are estimating a differential for each child at each occasion at level 2, there is no unexplained variation left at level 1 in this saturated model. 56 We can use the Estimates Tables at level 2 (accessed through Model on Main menu) to examine the conditional correlations of the children s Endurance at different occasions over time. and update our table 56 We are in fact fitting a multivariate multilevel model where the response is the stacked endurance and level 1 exists solely to define the multivariate structure (Goldstein, H, 2011, Multilevel statistical models, Chapter 6: Multivariate multilevel data, 4 th edition, Arnold, London)

168 Pair of Occasions CS Un Lag 1 and and and and and and and and and and then plot both estimates against the lag. There is roughly the same degree of correlation of about 0.6 between any pair of occasions and the only barely discernable pattern is that the correlations between occasions after the first is a little bit larger. The unstructured formulation has the advantage of uncovering the nature and degree of dependence of the Endurance measurements over time conditional on the fixed part variables. Indeed it would be possible to discern quite complex regime change such as before and after puberty but there also real dangers of over-fitting in that the covariance matches the observed pattern exactly but may not be the true pattern. These are the estimates for the compound symmetry and unstructured models

169 CS S.E. Un S.E. Fixed Part Constant (Age-gm)^ (Age-gm)^ Girl (Age-gm)^1.Girl (Age-gm)^2.Girl Rural Girl.Rural Between Child Constant/Constant Occ1r/Occ1r Occ2r/Occ1r Occ2r/Occ2r Occ3r/Occ1r Occ3r/Occ2r Occ3r/Occ3r Occ4r/Occ1r Occ4r/Occ2r Occ4r/Occ3r Occ4r/Occ4r Occ5r/Occ1r Occ5r/Occ2r Occ5r/Occ3r Occ5r/Occ4r Occ5r/Occ5r Between Occasion *loglikelihood: Question 13: what are the differences between the fixed estimates and their standard errors in the two models? Is the more complicated unstructured model a significantly better fit to the data? Toeplitz as a constrained model Substantively at this point we would have completed our analysis as the sensitivity analysis of the unstructured model has found that there are not distinctive patterns of dependency that require modelling beyond the compound symmetry dependency ( see the answers to Q13). However in the interest of pedagogy we are going to continue modelling as you often find in other data sets a distinctive form of dependency such as portrayed in the figure below in which the dependency reduces the greater the lag so that occasions that are further apart are less correlated. Consequently we fit a further set of models

170 The next model to be fitted is thetoeplitz structure and this is more parsimonious than the unstructured covariance (with its 15 separate random parameters) but less parsimonious than the compound symmetry model (with its 2 parameters). The distinctive feature of the Toeplitz is that the same correlation is imposed when pairs of occasions are separated by the same lag. 57 This gives a banded form of structure where pairs 1 and 2, 2 and 3, and 3 and 4, 4 and 5 are constrained to have the same correlation as they are lag 1 apart, while pairs 2 and 4 and 3 and 5 are constrained to have the same but potentially different correlation reflecting that they are 2 lags apart. It is clear, therefore, that the Toeplitz structure is a constrained version of the unstructured model and this permits one form of estimation that can be used in MLwiN. The number and nature of the constraints can be appreciated from the following table, where the letters (A, B, etc.) signify the estimates that must be the same for the Toeplitz assumptions to apply. Occ1 Occ2 Occ3 Occ4 Occ5 Occ1 A Occ2 B A 1 Occ3 C B 1 A 2 Occ4 D C 1 B 2 A 3 Occ5 E D 1 C 2 B 3 A 4 Consequently, the constraints for a homogenous Toeplitz model are: Named after Otto Toeplitz ( ); a Toeplitz matrix has the structure such that each descending diagonal from left to right is constant. 58 In MLwiN it is not possible to fit a heterogeneous Toeplitz by un-constraining the variances when RCON has been used. While this would result in banded covariances this would not result in banded correlations as the latter is calculated in relation to the respective variances. MLwiN does not have specific parameters for correlations, only for covariances

Variances: Lag 1: Lag 2: Lag 3: 4 constraints as the later variances have to be constrained to the variance of occasion 1 (A); 3 constraints as 3 other covariances one lag apart are constrained to

171 Variances: Lag 1: Lag 2: Lag 3: 4 constraints as the later variances have to be constrained to the variance of occasion 1 (A); 3 constraints as 3 other covariances one lag apart are constrained to the covariance of Occasion 1 and 2 (B) 2 constraints as 2 other covariances two lags apart are constrained to the covariance of Occasion 1 and 3 (C) 1 constraints as 1 other covariance three lags apart are constrained to the covariance of Occasion 1 and 4 (D) This gives a total of 10 constraints. Model on Main menu Constrain parameters Choose random part Number of constraints to be 10 Put a 1 for the parameter to be involved in the constraint, put a -1 to be involved as a difference, put a 0 as the value for the difference to be equal to; the first four constraints involve the variances; the next six constrain the covariances such that the differences are zero; 59 Store constraint matrix for random parameters in c200 Attach random constraints More to convergence; Store Model as Toep 59 The process is akin to the intervals and tests procedure, instead of testing against a difference of zero, you are constraining a difference to be zero

The results show the same variance and the banded form of the covariance as do the correlations, which we will include in our table with the other estimates Pair of Occasions CS UN Toeplitz Lag 1 and

172 The results show the same variance and the banded form of the covariance as do the correlations, which we will include in our table with the other estimates Pair of Occasions CS UN Toeplitz Lag 1 and and and and and and and and and Question 14: is the Toeplitz model a significantly better fit than the compound symmetry model? What are the implications of this result?

173 Toeplitz with SETD It is useful at this point to estimate the Toeplitz model in another way, by starting with the compound symmetry model of Model 8 and by including the additional covariances by imposing a structured design matrix based on the lags on the random part of the model in addition to existing variances. 60 We first have to create the lag structure., i2) j ti 1 j ti2 ( i1 j (33) where t ij is occasion of the i th measurement for the j the person. These form a symmetric matrix of dimension 5 (the 5 occasions) for each and every Child j which has zeroes on the main diagonal and the off-diagonal terms are the time differences which here are simply lags. After Opening Model 8 worksheet, this matrix is created via the Command Interface using the SUBS command which produces, in long form, a stacked half-symmetric matrix. The relevant form of the command here is Subs Child -1 Occasion Occasion c210 where Child defines the level, -1 results in the elements on the major diagonal of each Child matrix being set to zero; the values in the first Occasion are subtracted from the second Occasion to give the lag differences and the resultant stacked matrix is stored in c210, for use as a design vector for the random parameter. To view the stacked lowertriangular matrix for each Child, give the Command Mview Child c210 and in the Output window, you will get a long listing of a matrix for each and every child. It is worth looking at these and the extract below gives the last three BLOCK ID : 298 Child 298 who was only measured on 3 occasions It may be easier to appreciate the nature of the SETD function if we look at a multilevel model in its mixed model formulation: Y X ZU ; where Y is a vector of response, is the vector of unknown fixed part terms, X is the matrix of predictor variables, U is a vector of unknown random effects and Z is the specified design matrix for the random effects. In a single level model this design matrix is an Identity matrix with a 1 on the main diagonal (to obtain the parameter 2 e0 ) and zeros elsewhere, which means that there is no covariance between units. It is this design matrix that is specified in a very flexible way when the SETD command is used. Here, to achieve a Toeplitz structure, four sets of block diagonal structures, reflecting a lag of 1, 2 3, and 4 are imposed

174 BLOCK ID : 299 Child 299 who was measured on all 5 occasions BLOCK ID : 300 Child 300 who was only measured on 4 occasions It is clear that this block diagonal matrix gives the lag indicator of each observed occasion for each child so that child 300 is not measured on occasion 4 (there is no lag that is 3 away from occasion 1). Staying in the Command Interface, we can create four sets of block diagonal matrices, one for each lag, with a 1 signifying that the observation is involved in the variance-covariance term, 0 otherwise, view them and then apply them Name c211 LagOne c212 LagTwo c213 LagThree c214 LagFour Change 2 c210 0 LagOne create lag1 indicator Change 3 LagOne 0 LagOne Change 4 LagOne 0 LagOne Mview Child LagOne check it Here is the matrix for Child 299 with a complete record, we see that Occasions only 1 lag apart have a 1 BLOCK ID : Change 1 c210 0 LagTwo Change 2 LagTwo 1 LagTwo Change 3 LagTwo 0 LagTwo Change 4 LagTwo 0 LagTwo Mview Child LagTwo create lag2 indicator check it Here is the matrix for Child 299, we see that only occasions 2 lags apart have a

175 BLOCK ID : Change 1 c210 0 LagThree Change 2 LagThree 0 LagThree Change 3 LagThree 1 LagThree Change 4 LagThree 0 LagThree Mview Child LagThree Change 1 c210 0 LagFour Change 2 LagFour 0 LagFour Change 3 LagFour 0 LagFour Change 4 LagFour 1 LagFour Mview Child LagFour create lag3 indicator check it create lag4 indicator check it Here is the matrix for Child 299 for lag 4, we see that only occasion 1 and 5 which are 4 lags apart have a 1. BLOCK ID : We now have to impose the created design matric at the child level Setd 2 LagOne impose the design matrices 61 Setd 2 LagTwo Setd 2 LagThree Setd 2 LagFour Start to convergence Store the estimated model as ToepSD To get the estimates of the new terms (they are not shown in the Stored models nor in the 61 If there are any missing values for the response variable, you will get an error that the design matrix is not of the correct size. This is why we used the Listwise deletion command earlier. Notice that dependency is required between occasions, that is level 1, but the design matrices are imposed at level 2 for this is how we impose similarity within in MLwiN. To remove a design structure, use a command such as CLRDesign 2 c

176 Equations graphical interface), we have to give the command Rand in the Command Interface, which will display the following in the Output window LEV. PARAMETER (NCONV) ESTIMATE S. ERROR(U) PREV. ESTIM Constant /Constant ( 2) LagOne * ( 0) LagTwo * ( 1) LagThree * ( 3) LagFour * ( 0) Constant /Constant ( 1) The term associated with the Constant at level 2 gives the variance between children, while the term associated with the Constant at level 1 gives the variance within-children, between-occasions. The four new terms associated with the Lag variables give the covariance for each lag. Consequently, the degree of correlation can be calculated as Corr( Lag m ) 2 u0 2 u0 Lag 2 e0 m (34) To derive the correlations at lag1, lag2, lag 3, and lag 4 use the Command Interface calc b1 = ( ) / ( ) Lag One calc b1 = ( ) / ( ) Lag Two & Three calc b1 = ( ) / ( ) Lag Four and update the table of results for dependency with these values. Pair of Occasions CS UN Toeplitz_ Constrained Toeplitz SETD Lag 1 and and and and and and and and and

177 On completing the summary table; it is clear that exactly the same results have been found for the Toeplitz model by imposing constraints on the unstructured model thereby reducing the number of random parameters from 15 to 5, as through adding 3 additional terms to the compound symmetry increasing the number of parameters from 2 to If you compare the deviance of the two versions of the Toeplitz model you will also see that these models are indistinguishable, both having the value Consequently, neither version is a significant improvement over the simpler compound symmetry form. Autoregressive weights model using lags The next model, the autoregressive weights model, has one more parameter than the compound symmetry model. The dependency that we want to impose in this model is such that the covariance decreases as the time distance, here the lag, between measurements increases. We are again going to do this by imposing a structured design matrix on the random part of the model in addition to the existing compound symmetric structure of Model 8. In a 2 level model of repeated measures within individuals, such a time-dependent structure can be defined as follows: cov( e i1 j, e i2 j ) t i1 j 1 t i2 j (35) the covariance for person j at two occasions i 1 and i 2 depends on a distance decay or inverse weight function where, the autoregressive parameter, is to be estimated and t ij is time of the i th measurement for the j the child. 63 Returning to the specification of Model 8, we first have to create the time difference structure 64, i2) j ti 1 j ti2 ( i1 j (36) These form a symmetric matrix of dimension 5 (the 5 occasions) for each and every child j which has zeroes on the main diagonal and the off-diagonal terms are the time differences which here are simply lags. As before this is created in the Command Interface using the 62 Not all four terms can be estimated at level 2 as there is a linear dependency in the terms and one of the terms must be estimated to be zero; it was included here for pedagogical purposes. The total number of estimatable random terms in the Toeplitz model equals the number of occasions, in this case 5. Consequently one of the estimates is a structural zero and another is an estimated zero. 63 This distinction between occasion and time, although not needed here, permits a very general specification where time could be continuous so that for some children the second occasion is 1.1 years later while for others it is 1.9 years later. 64 We could also start with ToepSD and then clear the design matrix completely with the commands CLRD 2 LagOne ; CLRD 2 LagTwo CLRD 2 LagThree, CLRD 2 LagFour

178 command SUBS which produces, in long form, a stacked half-symmetric matrix. As before, the relevant form of the command here is Subs Child -1 Occasion Occasion c210 where Child defines the level of the parameter, -1 results in the elements on the major diagonal of each child matrix being set to zero; the values in the first Occasion are subtracted from the second Occasion to give the time differences and the resultant stacked matrix is stored in c210. Staying in the Command Interface we now need to calculate the inverse weights t 1 t i1 j i2 j we have to change the zeros to -1 to avoid a zero divide before we calculate the weights, and then set the diagonal back to zero once the weights have been calculated. Change 0 c210-1 c210 calc c220=1/c210 change -1 c220 0 c220 Mview Child c220 The Output window will show the inverse weights which are now stored in c220, here are the values for child 299 who was measured on all 5 occasions BLOCK ID : Clearly, occasions that are further apart have a smaller weight. The single set of weights is now imposed on the model as a design matrix which structures the random part of the model. Again in the Command Interface SetDesign 2 c220 Start to convergence Store the estimated model as Auto

179 You will notice that the estimate for the parameter is not included in the comparison table and it is not included in the equations window. To see the estimate we have to issue the following in the Command Interface. Rand which will display the random parameter estimates in the Output window LEV. PARAMETER (NCONV) ESTIMATE S. ERROR(U) PREV. ESTIM Constant /Constant ( 1) c220 * ( 0) Constant /Constant ( 1) The estimate for is To appreciate the meaning of this value we can generate a small data set of the lags and then calculate the correlation structure. In the Command Interface gene 1 4 c100 generate the lag 1, 2 and 4 calc c101 = (1/c100) calculate the weights calc c102 = ( * c101) covariance as a function of and the inverse weights calc c103 = c add in compound symmetry element calc c104 = c103/( ) calculate correlation elements print c101-c104 print out results c101 c102 c103 c104 N = We can see that the autoregressive weights model based on lags has imposed a slightly declining dependency with lag. Updating out table we get the following results. Pair of Occasions CS UN Toeplitz_ Constrained Toeplitz SETD Auto Lag Lag 1 and and and and and and and and and

180 If we compare the deviance for the compound symmetry and the autoregressive weights models there is a very small difference which with 1 degree of freedom (due to the additional parameter) which is associated with a high insignificant p value. Again we have no reason to reject the simpler CS model. This autoregressive weights model can be modified in a number of ways. 65 Thus we could use the inverse of the squared lags to get a steeper decline in the dependency (the unstructured model may have suggest this) or we could use, instead of the lags, the distances apart in continuous time. We will illustrate the latter. Starting again with Model 8 we first get the time distances apart between measurement occasions. Subs Child -1 Age Age c210 Mview Child c210 Here are the time intervals for child 299 who was measured on all 5 occasions. BLOCK ID : So that occasion 1 and occasion 5 were 7.6 years apart. We again have to change the zeros to -1 to avoid a zero divide before we calculate the weights, and then set the diagonal back to zero once the weights have been calculated. Change 0 c210-1 c210 calc c220=1/c210 change -1 c220 0 c220 Mview Child c220 The weights for child 299 are then BLOCK ID : In many ways the autoregressive weights procedure is a more flexible procedure that AR(1) models where the lag 2 correlation is the lag 1 correlation squared, and the lag 3 correlation is the lag 1 correlation cubed. There are MLwiN macros available for fitting AR(1) models as part of a very general specification but they have not been updated to work with version 2.1 as they have been found to be rather unstable and sensitive to the declared and required starting value for the autoregressive parameter. The macros can still be found at they need updating not least because the random parameter estimates are no longer stored in c96 but in c

181 These are imposed on the model as a design matrix, and the following commands given SetDesign 2 c220 in the Command window Start to convergence in the Main window Store the estimated model as AutoCont in the Equations window Rand in the Command window To obtain the results LEV. PARAMETER (NCONV) ESTIMATE S. ERROR(U) PREV. ESTIM Constant /Constant ( 1) c220 * ( 0) Constant /Constant ( 1) Again we need to generate a short set of weights to see how the dependency changes with the distance between observations, here we will use 1 to 9 years apart gene 1 9 c100 generate the time intervals calc c101 = (1/c100) calculate the weights calc c102 = ( * c101) covariance as a function of and the inverse weights calc c103 = c add in compound symmetry element calc c104 = c103/( ) calculate correlation elements print c100-c104 print out results c100 c101 c102 c103 c104 N = So that the degree of dependency hardly changes at all and it is not surprise that the continuous time autoregressive weights model with its extra parameter is not significantly different from the compound symmetry model. MCMC estimation of complex dependency The models in this section have so far all been estimated by IGLS/RIGLS procedures because the MCMC implementation does not currently recognize any user-defined constraints that have been set or any weights matrix imposed by the SETD command. Of course the

182 compound symmetry and the multivariate unstructured models can be estimated by both maximum likelihood and MCMC procedures. Moreover it is also possible to fit an explicit multivariate model that does not have any term for level 1 variation and then impose one of a number of alternative correlational structures at level 2. These are: Full covariance matrix the unstructured multivariate model; All correlations equal/all variances equal - the homogenous compound symmetry model; All correlations equal/independent variances - the heterogeneous compound symmetry; AR1 structure, all variances equal -the homogenous 1 st order autoregressive model; AR1 structure/independent variances the heterogeneous 1 st order autoregressive model. However, there is no Toeplitz model available in MCMC and we have to trick MLwiN into modelling time-varying variables as the way the multivariate model is constructed from the wide data means that there is no way of indicating that particular predictor variables belongs to a set of variables, that is Age 1 to Age5. We first fit an unstructured formulation as a multivariate model and then impose restrictions on the nature of the dependency between occasions. Begin retrieving the original Wide data File on main menu Open worksheet endurancewide.wsz Equations window Clear on bottom toolbar to remove any existing model Estimation control to RIGLS click on Responses in the bottom toolbar, highlight all the variables representing the 5 Endurance measures, and then Done. (This must be done in the order of End1, End2, to End5.) This will have started to create a sensu-stricto multivariate model in the Equations window with 5 responses; we now need to set up the random effects structure at level 2 for children as there will be no level 1 variance in this saturated model. Equations window Click on any of the responses N- level 2- ij Level 2(j) Child leaving level 1 to be the response indicator Done

183 Add term Variable: choose the Constant Add separate coefficients to create a dummy for each response Click on each Response variable in turn, click off the fixed parameter and click on j(child_long) to build the level 2 variance covariance matrix. Add term Variable: Constant Add Common coefficient to create an overall constant Include all Done The specification of the model should now look like There is an overall mean value (β 5 ), 5 variances, one for each occasion, and 10 covariances between the occasion; the h j formulation allows a common fixed effects model at the child level. On estimation we get the following unconditional results

The mean Endurance across all children and all occasions is 15.9 and the variance at occasion one is 14.1. The variance increases with occasion so that at Occasion 5 the variance of 41.

184 The mean Endurance across all children and all occasions is 15.9 and the variance at occasion one is The variance increases with occasion so that at Occasion 5 the variance of 41.9 gives substantially bigger differences between children. It is worth looking at this point at the worksheet via the Names window to see that MLwiN has created a number of new variables of length 1500 rows. Viewing these variables reveals the endurance values have been stacked in to a long column Resp, with an indicator variable picking out which occasion and which child the particular values belong to. We now need to add in the fixed part terms of the model and we shall aim the reproduce Model 8, the most complex fixed effects model supported by the data. This is straightforward for the time-invariant child level variables but more complex for the time varying ones. Equations window

185 Add term Variable: Sex Reference category: Boy Add Common coefficient to create an overall Girl dummy for all occasions Include all Done Add term Variable: Rurality Reference category: Urban Add Common coefficient to create an overall Rural dummy for all occasions Include all Done Add term Order 1 Variable: Sex Variable: Rurality Add Common coefficient to create an overallgirl* Rural dummy Include all Done We now start on the time varying age variables. Equations window Add term Variable: Age1 Centred around grand mean Add Common coefficient Include all Done Add term Variable: Age2 Centred around grand mean Add Common coefficient Include all Done And the Sex-Age interactions Add term Order 1 Variable: Age

186 Variable: Sex Add Common coefficient Include all Done Add term Order 1 Variable: Age2 Variable: Sex Add Common coefficient Include all Done To create the following specification Unfortunately Age 1-gm and Age 2-gm are not the variables that we want for they represent time-invariant variables the replicated values of how old each child was on occasion 1 and occasion 2. Looking at the equations window you can see the subscript j not the required ij. The trick is to create the variables of the form we want outside of the Equations window and then replace the specified values. We first need to create a long vector of the Age variables and store in a free column; here c37 Data Manipulation in main Menu Join

If you inspect this variable you will not find it in the order we need of age nested within children but the Ages of occasion 1 stacked on occasion 2 etc.

187 If you inspect this variable you will not find it in the order we need of age nested within children but the Ages of occasion 1 stacked on occasion 2 etc. So we have to create stacked child variable in c38 and generate an occasion index as a repeated sequence 66 and then sort on these variable carrying the long age variable. join 'child' 'child' 'child' 'child' 'child' c38 66 We need both the Child and occasion index as MLwiN s sort command does not maintain the original order when sorting; this generate vector facility of course can only work when the data are completely balanced, that is before missing values have been removed

188 Finally we replace we long age variable around its overall mean of and replace the values in the equation variables by the correct ones. In the command window CALC c37 = c CALC '(Age1-gm).12345' = c37 CALC '(Age2-gm).12345' = c37 **2 CALC '(Age1-gm).Girl.12345' =c37 * 'Girl.12345' CALC '(Age2-gm).Girl.12345' = c37 * c37* 'Girl.12345' You should notice that the subscript changes to ij. After estimation you should get the following results. Which are the same as the unstructured model that we obtained earlier

189 We are now going to use MCMC estimation for a range of models. Estimation Control on main menu MCMC estimation Burn in 500k Monitoring change length Done Equation on main menu MCMC MCMC methods Choose Univariate Metropolis Hastings estimation for Fixed and Random effects 67 Done Equation on main menu MCMC Correlated residuals Choose Full covariance matrix - the unstructured multivariate model Done Start 68 storing the estimates as Unstruct. Here is the variance and correlation matrix (accessed through Model on main menu, Estimate tables). 67 The default Gibbs sampling will not work correctly when estimating a common fixed coefficient.. 68 It is not possible to use the MCMC option for hierarchical centering to speed the estimation of this model

190 The variances and correlations are all separately estimated and it can be seen that there is a conditional correlation between occasions of about 0. 6 with a variance of around 12 on each occasion; the between-child increasing variance found in the unconditional multivariate model has disappeared with the explicit modelling of the different growth of boys and girls. We can now impose some alternative restrictions on this model: Model on main menu MCMC Correlated residuals This will bring up a window with the set of alternatives we can impose on the full covariance structure of the multivariate model. Choosing each in turn and Start to get the MCMC estimates and examine the variances and correlation estimates. It is also useful to give the command BDIC in the command window after each model has been estimated so as the get a complexity penalized badness of fit measure. Here are the correlations for All correlations equal/all variances equal, that is the homogenous compound symmetry model It is clear that all the variances have been constrained to be the same (11.6), as have all covariances and hence correlations (0.61). The results from the heterogeneous compound symmetry which is when All correlations equal/independent variances are:

191 Notice that the variances vary as do the covariances, but the correlations are the same between each pair of occasions. The results for homogenous 1 st order autoregressive, that is AR1 residual structure all variances equal (this is not a true autoregressive model with lagged responses on the right-hand side of the equation), are as follows. Notice that the lag 2 correlation is the lag 1 correlation squared, and the lag 3 correlation is the lag 1 correlation cubed: The results for heterogeneous 1 st order autoregressive residual structure, that is AR1 structure/independent variances (again not a true autoregressive model) are:

192 With again see the imposed rigid declining correlation with lag. We can now compare, after checking that the monitoring chain has been run long enough the Deviance Information Criterion for each model and the estimated degrees of freedom consumed in the fit (pd), against the homogenous compound symmetry model. Model for random part MLwiN Description pd DIC Difference in DIC from Homogenous CS Unstructured Full covariance matrix Homogenous compound symmetry Heterogeneous compound symmetry Homogenous 1st order autoregressive Heterogeneous 1st order autoregressive All correlations equal/ All variances equal All correlations equal/ independent variances AR1 structure/all variances equal AR1 structure/independent variances Question 15: have the models been run for long enough? Do the estimates for pd make sense? Which is the preferred model if the DIC criterion is used? Question 16: make a plot of the predictions from the fixed part estimates of the results for boys and girls in urban and rural areas at different stages of development. What do the results show?

193 These results have completed the analysis of the Madeira Growth Study; there is no evidence that a more complex model than a compound symmetry approach is needed. We now proceed to more technical matters. First we consider the difference between population average and subject-specific models and then consider the vexed question of whether fixed or random coefficient formulations should be used for subject-specific models. The Madeira data will be used to illuminate both debates. Discrete outcomes: population average versus subject specific Earlier in this chapter we considered briefly the difference between two types of modelling. One being the conditional, random-effects, subject-specific approach which has been the main focus of this chapter; the other being the marginal or population-average approach epitomised by the GEE form of estimation. 69 In the Normal-theory model for a continuous response, there may be differences in the estimation procedures for the fixed part of the model but the resultant conditional and marginal estimates and their interpretation are very unlikely to be substantively different. This is usually not the case however when the response is discrete and a non-linear transformation has been made of the response variable (see Chapter 12, this volume). The values for the population average are likely to be smaller in absolute value and the difference between the two estimates will be largest when there is substantial between-subject heterogeneity, or equivalently a high degree of dependency, which is commonly the case when analysing longitudinal data. Figure 13 Subject-specific and population average estimates on the logit and the probability scale. 69 Marginal is used to emphasize that the mean response is modelled conditional only on covariates and not on other responses or random effects

194 Figure 13 aims to highlight and explain these differences. On the left-hand side of the graph is the log-odds of a Yes outcome for a binary response; on the right-hand side is a plot of the probability of a Yes outcome. On the logit graph there are six lines. The five parallel lines represent the results from a conditional model where the middle line is the fixed-part estimate representing the subject-specific mean and each of the subject lines in this random intercepts model depart from this line by an amount u oj. The greater the level 2 between-child variance the greater apart the lines will be. Also shown on this logit graph is the marginal mean estimate of the relationship. The marginal approach is concerned with treating the dependency between occasions as a nuisance and there is no explicit term in the model for subject-level differences; there are no u 0j s. Consequently, we cannot plot the subject-specific lines to look at modelled growth trajectories in the marginal model. They do not exist, and we only have the marginal model mean estimate the population average. This is akin to a single-level model estimate with the standard errors appropriately inflated due to between-occasion dependence. 70 Clearly the marginal mean relationship has a much shallower slope than the subject specific mean. The departure depends on the betweenchild variance. If there is no higher level variance we are back to a single-level model and the marginal and conditional estimates will be the same. An approximate formula that relates the two types of slopes is: ( ( ) (37) Turning now to the probability scale in Figure 13, there are again four subject specific relations and these have been obtained by anti-logiting the values of the left hand side of the diagram. The line that passes through the middle of the four subject curves can be obtained either by anti-logiting the conditional fixed part estimate or equivalently by taking the median of the subject specific probabilities at each value of the predictor variable, here Age. In contrast the mean probability is obtained by either anti-logiting the marginal mean or equivalently by averaging over the subject specific probabilities. Thus the probabilities obtained from a conditional model are the medians of the subject-specific curves while the marginal models are equivalent to the means of the subject-specific probabilities. In a two-level longitudinal model, the one gives you the results for the median person, the other gives you the result for the mean person. In a three-level model of occasions nested with children within schools one gives you the probabilities for median person in the median school; the other the mean result in the mean school. The cause of these differences is technical the mean of a non-linear function (the logit) does not equal the nonlinear function of the mean, whereas the median of the logits does equal the logit of the median, it is still is the middle value. In Normal-theory models 70 When number of subjects are large and missing data are not an issue, the single-level model estimates will be the same as the GEE estimates

195 you can get the population average in two exactly equivalent ways: either as the curve of the means or the mean of the curves. In the non-linear model used for discrete data however this works for the median but not the means. There is considerable disagreement in the literature over which is the appropriate model for discrete data some favouring the population average approach, others the subject-specific. The debate can be resolved by separating how the model is specified and estimated on the one hand, and how it is used for predictions or inferences on the other. Thus, Lee and Nelder (2004) argue that the conditional model specification is more fundamental as you can estimate a conditional model and make marginal predictions from it, if these are wanted. 71. But they caution that the difference between the models is not just one of estimation in that the marginal model is fundamentally limited as it is not readily extendible to higher levels nor can it include time varying heterogeneity (random slopes) which may be of genuine substantive interest. They approvingly quote nine drawbacks of the marginal model listed by Lindsey and Lambert (1998) who provide an example where a treatment can be superior on the average, while being poorer for every individual. Lindsey and Lambert conclude the "statistical" argument that we should directly model margins if scientific interest centres on them, is not acceptable on scientific grounds, for it implies that we are generally imposing more unrealistic physiological mechanisms on our data than by direct conditional modelling and that these are most likely rendering simple marginal models greatly biased 72 Lee and Nelder (2004) in their turn, draw clear conclusions the use of marginal models can be dangerous, even when marginal inferences are of interest. The usefulness of marginal inferences requires the absence of interactions [ie random slopes] checkable only via conditional models. The problem of the conditional versus marginal model can therefore be resolved by fitting the more flexible conditional model but using it to make inferences if needed by marginalising the mixed model. The sole advantage of marginal estimation is that randomeffect estimates and standard errors may be more sensitive to assumptions about the nature of the residual structure. But of course this advantage only applies if between-child heterogeneity is not changing with time and a random slopes model is not needed and 71 Lee, Y-J and Nelder, J A (2004) Conditional and marginal models: another view Statistical Science, 19, Lindsey, J. K. and Lambert, P. (1998) On the appropriateness of marginal models for repeated measurements in clinical trials. Statistics in Medicine

196 there is no higher level variance of interest. 73 We therefore require a method of having estimated the more flexible conditional model of turning it into population average predictions. Analytically this is difficult as you have to integrate over the continuous random effects, but it is relatively easy to provide a solution based on simulation. Consequently, in MLwiN a very general simulation approach is adopted, the conditional model is estimated and these can be turned into probabilities by anti-logiting the logits to get the median probabilities. To obtain the population average values, simulations are made for thousands of subject-specific curves based on the fixed-part conditional estimates to which are added the subject-specific departures drawn from a distribution that has the same variances as that estimated in the conditional model. These are turned into probabilities and their mean taken to get the population values. At its simplest this can be seen as a 4 step procedure: 74 Step 1: Simulate M (5000 say) values for random effect u 0j s from a Normal distribution with same variance as estimated level 2 variance, Step 2: using these 5000 u 0j s, combined with fixed part estimates and particular values of the predictor variables to get predicted logits: ˆ T x* u ( m) Step 3 anti-logit to get probabilities ˆT ( [1 exp( ( x* u ))] * m) ( m) 1 Step 4 take the mean of the probabilities to get the population average predictions for particular values of the predictor variables. All of this is done fairly automatically in the customised predictions window of MLwiN. This procedure is very general in that it can be applied in MLwiN to probit and logit models for binary outcomes, to ordinal response models and models where the response is a count, and to multivariate models with more than one response. 73 For more on the debate see In the longitudinal case see Ritz, J and Donna Spiegelman (2004) Equivalence of conditional and marginal regression models for clustered and longitudinal data, Statistical Methods in Medical Research, 13(4), , and Hu, F B., Goldberg, J, Hedeker, D., Flay, B R, and Pentz, M-A( 1998) Comparison of population-averaged and subject-specific approaches for analyzing repeated binary outcomes. American Journal of Epidemiology 147: The debate has recently spread to non-longitudinal settings where Subramanian and O Mally (2010) make the case that the choice depends on the purpose of the analysis to counter Hubbard et al s (2010) claim that population average models provides a more useful approximation of the truth. Subramanian S V and O'Malley AJ. (2010) The futility in comparing mixed and marginal approaches to modeling neighborhood effects. Epidemiology 21, 4: ; Hubbard, A E et al (2010) To GEE or not to GEE: comparing population average and mixed models for estimating the associations between neighbourhood risk factors and health, Epidemiology, 21, In practice there is an additional step in that simulations are also made of the fixed part estimates based on the variance-covariances of the estimates so that we can get confidence intervals for the predictions. More details in the MLwiN Manual Supplement

197 There are therefore two sets of results. The subject-specific approach models the evolution of each individual and consequently the conditional slope is the effect of change for a particular individual of unit change in a predictor. In contrast the population average slope (which can be derived from the conditional estimate) is the effect of change in the whole population if everyone s predictor variable changed by one unit. So the answer to the question, which is better, depends on the use to you want to make of the different estimates. Allison (2009) makes the following contrast (p36) 75 if you are a doctor and you want an estimate of how much a statin drug will lower your patient s odds of getting a heart attack, the subject-specific coefficient is the clear choice. On the other hand, if you are a state health official and you want to know how the number of people who die of heart attacks would change if everyone in the at-risk population took the stain drug, you would probably want to use the population averaged coefficients. Even in the public health case, it could be argued that the subject-specific coefficient is more fundamental To put it another way marginal estimates should not be used to make inferences about individuals as that would be committing the ecological fallacy, while conditional should not be used to inferences about populations as that would be an atomistic fallacy. Or to put it yet another way the population average approach underestimates the individual risk and vice-versa. Subject-specific and population average inferences in practice. To appreciate these concepts we first take a simple fabricated example and then apply the ideas to the Madeira data. Beginning with the simple example, say we have estimated in a conditional model the log-odds of having a disease in a random-intercepts, variancecomponents model to be -1.5 and the level 2 variance to be 0 as in the model below. 75 Allison, P D (2009) Fixed Effects Regression Models, Quantitative Applications in the Social Sciences Series, Sage, Thousand Oaks, California

198 If we then calculate the subject-specific median probability via the customised predictions procedure, the probability of the disease for a typical individual (the subject-specific estimate) is 0.18, while prevalence of disease in the population is also 0.18, as there are no differences between individuals and the conditional results are the population values. However, if the level 2 between-individual increases as in Table 9 to 1 (equivalent to a VPC of 23 per cent, based on a standard logistic level 1 variance of 3.29; see Chapter 12 this volume) the individual risk is the same at 0.18, but this equates to a population prevalence of As the between-person variance grows (the level 1 variance cannot go down in a binomial logistic model), implying that there is substantial between-person variability in their propensity for the binary outcome, the differences between the mean and the median grows. So when the VPC is 86% signifying extreme between-people heterogeneity, the estimate of the population prevalence is 38 per cent while the individual risk remains at 18 per cent. 76 Table 9 Subject- specific estimates for different levels of between variance 77 Betwee Probability Logit n Varianc e VPC Subject specific Pop Average ( ) Median Mean We can see also how these ideas play out in a simple case with the Madeira data. We first recode endurance into a 1 for above the mean and a zero below. Then we fit a single level logit model with a Bernoulli level-1 distribution that will approximate a population average model but with incorrect standard errors; this is done for simplicity for 76 A very large between-variance means that a large number of subjects had the same response over occasions so that many have 00000, while others have This is known as a mover-stayer situation. When this is the case, it is questionable whether the Normality assumptions of the level 2 variance are sensible. It may be more appropriate to fit anon-parametric model whereby the level 2 distribution is discrete and non- continuous; see Aitkin, M. (1999) A general maximum likelihood analysis of variance components in generalized linear models. Biometrics 55, ; one such implementation is the npmlreg package in R, others are to be found in Latent Gold, gllamm in Stata and, Mplus. 77 The table was produced by using MLwiN as a sandpit. A two-level variance-components model was set up and then the desired value of 1.5 was put into the column c1098 where the fixed estimate is stored and the level 2 variance values of 0, 1 10 were successively put into c1096 where the variance estimates are stored. The customised predictions window was then used to get the median and mean values by simulation

199 just the linear term of centred age. This is followed by a random-intercepts subject-specific model. Finally we make customised prediction based on both the medians and the means and graph the results. Both models are estimated by MCMC 50k monitoring chains so we can compare the DIC To recode the response variable to a binary outcome, use the recode command. The single-level model is estimated to be The customised predictions for age 7 to 17 in steps of 1 using both the mean and medians are as follows; as expected these are the same values in this population-average model. The confidence intervals should be treated with caution as the dependency over time is not being modelled in any way. Age Median Median 95% CI Median 95% CI Mean Mean 95% CI Mean 95% CI

200 Here are the estimates of the random-intercepts, subject specific logit model, and we can see a very substantial between child variance Making the assumption of the level 1 distribution being a standard logistic distribution, the degree of dependency over occasion can be calculated to be 2 u calc b1 = 8.101/ ( ) u (38) Which is clearly a high level of dependency or equivalently there are substantial differences between children of the same age. Comparing the two sets of estimates Single level S.E. Random Intercepts S.E. Response HiLoAerobic HiLoAerobic Fixed Part Constant (Age-gm) Random Part Level: Child Constant/Constant DIC: it can be seen that the DIC is considerably lower in the model that takes account of between child heterogeneity and that the slope with Age is more than twice as steep in the subject specific model. It is also clear that the standard error of this time-varying variable has been

201 underestimated in the single-level model. The successfulness of the approximate prediction formula can be seen in that calc b1 = 0.455/ (( *8.101 )**0.5) Is quite close to even in this rather extreme case with very large between-child variance of over 8 on the logit scale. Here are the customised predictions for the subject specific model Median Median Mean Mean Mean Age Median 95% CI 95% CI 95% CI 95% CI And the greater steepness of the median subject-specific line is clear. Here is a plot of both the mean and median relationship and the subject-specific modelled probabilities curves for all the children It is therefore quite straightforward to use MLwiN to fit the mixed model for estimation and use the predictions facility to report both the subject and specific population results. It is also a simple matter to report, odds, logits or probabilities by choosing the desired metric in the customised predictions window

202 Heagerty and Zeger (2000) have been able to develop multilevel models with random effects in a marginal framework rather than the usual conditional one. 78 The important advantage of this approach is that the estimates of the regression parameters are less sensitive to assumptions about the distribution of the random effects. In subsequent discussion of this paper, Raudenbush argues that although this work is very important, the choice of which procedure to use depends as always on the target of inference; a view he later elaborated in Raudenbush (2009). 79 In particular he argues if you are interested in the size and nature of the random effects then the conditional model is the only choice that fits conceptually and he recommends a sensitivity analysis to evaluate distributional assumptions. Fixed versus Random effects This section is about the relative merits of a fixed versus random approach to including subject-specific effects in longitudinal models. This is another area where there has been a lot of quite trenchant debate about both technical and interpretative matters in which the sides have tended to talk past each other. Fortunately (and like the marginal and mixed case) it is possible to have you cake and eat it and use the mixed random-effects model to derive what the fixed-effects proponents are seeking and a lot more else besides. We will consider the form of both models, what both sides write about each other, the prize that the fixed-effects advocates are seeking, and why this is apparently not achievable by a random-effects model. We will then show that this is achievable with a re-specification of the multilevel model that has separate terms for the longitudinal and cross-sectional effects of time-varying variables. This brings both substantive and technical benefits and it puts the Hausman test which is commonly employed to choose between fixed and random effects in a new light. We will illustrate these ideas using a linear growth model for endurance in Madiera children. We first however have to make two digressions: we consider the meaning of endogeneity, then, we consider the notion of within- and between- regressions. This will allow us to appreciate fully why some researchers prefer fixed to random effects and why they are mistaken to do so. A digression on endogeneity Endogeneity, the violation of exogeneity, is committed if an explanatory, supposedly independent, variable is correlated with the residuals. More formally this can be specified as the zero conditional mean assumption, which in a two-level with a level- predictor applies to both sets of residuals: ( ) (39) 78 Heagerty, P J and Zeger, S L (2000) Marginalized multilevel models and likelihood inference Statistical Science 15(1), Raudenbush, S W (2009) Targets of inference in hierarchical models for longitudinal data, in Fitzmaurice et al (eds.) Longitudinal data analysis, CRC Press, Boca Raton

203 and ( ) (40) There are in fact three underlying causes of endogeneity (which can occur in combination); Simultaneity: this in when we can conceive of a reciprocal causal flow so that y determines x, and x determines y.; or to put it another way, the dependent and explanatory variables are jointly determined. We are not going to discuss this any further but note that one form of this is the dynamic type of longitudinal model where the outcome occurs on both the left and right and side of the equation as lagged responses. This can be handled by a multivariate multilevel model in which more than one response can be modelled simultaneously, see Steele (2011) 80 Omitted variables bias: this occurs when we have not measured or do not know of important predictor variables that affect both the outcome and the included predictor variables. This is very pervasive problem as it is difficult, for example, to measure ability, aptitude, proficiency, parental-rearing practices and genetic makeup in developmental studies. In labour economics relating the outcome of wages to years of schooling may be problematic as both variables may be correlated with unobserved ability. Similarly in comparative political economy the institutional features of a country may be important, and there may be aspects of a country s culture, history and geography that are not readily measurable. If a variable is correlated with the outcome and with the predictors included in the model, then its omission will impart bias to the slope estimates of the included variables. We will be particularly concerned with level 2 or cluster-level endogeneity which arises in the longitudinal case from correlations between time-varying predictors and omitted child characteristics so that the level 2 random effects are correlated with level 1 covariates (equation 39). Random-effects models stand accused of being very susceptible to such bias and that is why fixed effects are often preferred. It will be shown, however, that it is straightforward to deal with such cluster endogeneity within the random-effects framework which offers a much more flexible approach. For other types of omitted variable bias the recommended approach is multivariate multilevel models which require an instrumental variable for them to be identifiable Steele, F (2011) A multilevel simultaneous equations model for within-cluster dynamic effects, with an application to reciprocal relationships between maternal depression and child behaviour, Symposium on Recent Advances in Methods for the Analysis of Panel Data, th June, Lisbon University Institute. 81 In attempting to estimate the causal effect of some variable x on another y, an instrument is a third variable z which affects y only through its effect on x. See Ebbes, P Bockenholt, U and Wedel, M (2004) Regressor and random effects dependencies in multilevel model Statistica Neerlandica 58, and Steele, F., Vignoles, A. and Jenkins, A. (2007) The impact of school resources on pupil attainment: a multilevel simultaneous equation modelling approach. Journal of the Royal Statistical Society, A, 170 (3). pp

204 Measurement error: this occurs when we want to measure a particular predictor variable but can only measure it imperfectly. Depending on the form of this mismeasurement, endogeneity may result. The cases of multilevel models where a level one covariate is measured with error is considered by Ferrão and Goldstein (2009) while Grilli and Rampichini (2011) as we discuss later consider two sources of endogeneity simultaneously : cluster-level omitted variables and measurement error. 82 A digression on interpreting within and between regressions For the moment we will move away from longitudinal models and consider the standard multilevel approach to analysing contextual effects and focus on peer-group effects for pupils in a class. A standard multilevel model for this is: ( ) (41) where is the current achievement of pupil i in class j and is past achievement. It is a common occurrence that as we move away from a null model, and include not only does the between pupil variance,, decrease but so does the between-class variance,. This must mean that the predictor has an element within it that varies systematically from class to class. That is there are some classes where the average prior achievement is high and others where it is low and we would like to get at these differences to assess the size and nature of the peer-group effect. From this perspective, the standard model is a confusion in that the slope is a mixture of the effects that are going on at the class and the pupil level. To disentangle these we can take a level 1 predictor and decompose it into two elements, the between-class means,, and the within-class deviations from those means,. 83 If both the class mean centred predictor and the class mean values are included in the model, we can estimate both the within- and between- influences in a single multilevel model (Snijders and Bosker 1999, section 3.6) ( ) ( ) (42) where is the within-class slope effect and is the between-class slope effect. An alternative formulation is sometimes used: 82 Ferrão, M E and Goldstein, H (2009) Adjusting for measurement error in the value added model : evidence from Portugal, Quality and Quantity 43, Grilli, L and Rampichini, C (2011) The role of sample cluster means in multilevel models: a view on endogeneity and measurement errors issues to appear 83 We can imagine two limiting cases where the class means are all the same so that there is no between-class variation and at the other extreme, the case where there is no deviations around the class means. This can be revealed by a two-level variance components model where the response is. Thus ( ); ( ) It is only in the former case that there can be no possibility of peer-group effects

205 ( ) ( ) (43) where the raw predictor is included and not the class-mean centred one. The slope term associated with the group-mean term is known as the contextual effect as it estimates the difference the characteristics of the group make to the outcome over and above the individual effect. Thus if is positive then children progress more if the children around them are of high ability; if the coefficient is negative the children are somehow deterred if they are in a high-ability class; sometimes it is better to be a big fish in a small pond! A model that incorrectly assumes a common effect as in the standard model above (equation z1) can result in a misleading assessment of the influence of a predictor on the response. 84 The contextual effect is thus formally defined as the difference between the withingroup and between-group coefficients; thus = so that it can either be derived firectly from the contextual formulation of equation (43) or indirectly as a difference in the group-mean centred version of (42). Similarly, +. It is worth spelling out in detail the different and specific meanings of these three coefficients: equals the estimated difference in progress for two pupils in the same class who differ by one unit on prior attainment; equals the estimated difference in class mean achievement between two classes that differ by one unit on the class mean of prior attainment; is the estimated difference in progress for pupils who have the same prior attainment but attend classes that differ by one unit on prior mean attainment. We will see later that both these specifications allow us to deal with level 2 cluster endogeneity. For further discussion of the effects of different types of centering in multilevel modelling see Enders and Tofighi (2007) who also consider including group means for binary predictors which gives of course the proportion of cases who are in the nonreferent category. 85 Fixed and random effects specifications The random-intercepts linear growth model can be specified as usual as: ( ) (44) 84 The standard multilevel model of (z1) estimates a weighted average of the within- and between- group effects, where the weight for the within-effect becomes more important when there are a large number of level 1 observations, the level 1 residual variance is small, and the level 2 variance becomes large; the relevant formulae are given in Rabe-Hesketh and Skrondal (2008,114). The standard multilevel estimate will only be the same when the between is equal to the within or equivalently, equals zero. When this is the case the common slope will be more precisely estimated as it pools the within and between information. 85 Enders, G K and Tofighi, D (2007) Centering predictors in cross-sectional multilevel models: a new look at an old issues Psychological Methods 12, See also Biesanz, J C et al (2004) The role of coding time in estimating and interpreting growth curve models Psychological Methods 9,

206 2 ~ N,0( ); 2 u ) 0 j u0 e ~ N,0( 0ij e0 where there is a single time-varying predictor,, the age of the child j on occasion i. The key distinctive element is that the differential intercepts, one for each child, are seen as coming from a Normal distribution with a common variance. This is a very parsimonious model as we only have to estimate a single variance term and not hundreds or thousands of separate terms for each child. In contrast in the fixed-effects counterpart, which is the dominant approach for example in comparative political economy and much of economics: ( ) (45) 2 e ~ N,0( ) 0ij e0 there are m-1 additional fixed effects associated with m-1 dummy variables, one for each child, where m is the number of units, the children at level 2. Views on these two models can be much polarised. Thus Molenberghs and Verbeke (2006, 47) from a biostatistics background contend that the fixed effects approach is subject to severe criticisms as it leaves several sources of variability unaccounted for and to worsen matters, the number of fixed effects parameters increases with sample size, jeopardizing consistency of such approaches. 86 In addition, the conceptual argument can be made that the random-effects model allows generalization to the population of children and not to specific children as in the fixedeffects model. Moreover, the fixed-effects model cannot include subject-level variables (that is time-invariant variables such Sex and Rurality) as all the degrees of freedom have been consumed at the child level. 87 As Fielding, something of a renegade economist with these views (2004,4-5)writes 88 It is only a random effects specification that can handle level-two covariates and the extent to which level-two covariates can explain level-two variation. It is clear that fixed effects specification for u j [ subject specific differences] are unsuitable for many of the complex questions to which multilevel methodology has been addressed. 86 Molenberghs, G and Verbeke, G (2006) Models for discrete longitudinal data Springer, Berlin 87 Equivalently the child-level variables will be collinear with the set of child-level dummies, rendering the coefficients un-identifiable. 88 Fielding A (2004) The role of the Hausman test and whether higher level effects should be treated as random or fixed, Multilevel Modelling Newsletter, 16(2),

207 And yet Allison (2009, 2) can argue such characterisations are very unhelpful in a non-experimental setting, however, because they suggest that a random effects approach is nearly always preferable. Nothing could be further from the truth. 89 The prize must be very great indeed if the obvious relative advantages of random effects are going to be spurned, and it is. To understand the nature of the prize it is helpful to consider an experimental situation where the x 1ij variable is a treatment which can be switched on and off and the treatment is randomly allocated. 90 This random allocation guarantees, provided the protocols are followed, that the all other sources of systematic influences on the response and the intervention are held at bay. Consequently, those being treated and those in the control group not receiving the treatment are in equipoise so that that all potential confounding factors related to the treatment and to the outcome are held off. Any observed differences between the two groups must then be due to the intervention. Internal validity is guaranteed to a known degree, dependent only on the size the number of replicates- of the experiment. Randomisation controls for confounders even if they have not or indeed cannot be measured. This is the prize - by including fixed coefficients in an observational study for each child, all child-level effects, known or unknown have been removed. Each individual has become their own control. Cluste- level endogeneity has been solved as all of the variance at the child level has been wiped away. The gold standard methodology of random allocation can be achieved in observational studies, but only if we adopt the fixedeffects approach. From this perspective the random-effects model is seen as seriously deficient. Instead of a method to deal with endogeneity, the random-coefficient model is based on the fundamental and required assumption that there is no endogeneity or equivalently that omnisciently we have included all subject-level covariates that influence the response. In practice to compute the fixed-effects model we could either include set of dummies (but this gets cumbersome with thousands of children) or get exactly the same results by the mean deviations method. In this procedure, we calculate for all time-varying variables predictors and response - the mean value for each child over time, and then subtract this child-specific mean from the observed values of each variable and regress the de-meaned response on the de-meaned predictors: ( ) ( ) (46) 89 Allison, P D (2009) Fixed Effects Regression Models, Quantitative Applications in the Social Sciences Series, Sage, Thousand Oaks, California; similar arguments are made at Effects-Regression-Methods 90 The editor s introduction to Allison (2009) referring to the fixed-effects approach, writes these statistical models perform neatly the same function as random assignment in a designed experiment, p.ix

208 We can note three things about this formulation We treat the child differences as a complete nuisance and calculate away the subject-specific values; so that there are no values in this equation; this means we can estimate it as a single-level model using OLS; Child- level variables, that is stable time-invariant variables, will be reduced to a set of 0 s and can no longer be modelled, as they will not vary; The mean difference approach sweeps out all the between-children variation and control has been achieved for unobserved variation at the child level; the prize has been achieved. Mundlak formulation The third point is key for our argument. Instead of de-meaning to remove child characteristics, we could also control for it by modelling it away, that is by including the group mean in the equation for each predictor that is time varying. This model is called the Mundlak (1978) specification. 91 It is exactly the same as the contextual model formulation of earlier, so that we include the group mean in the model: ( ) ( ) (47) Consequently the estimate will not be biased by the omission of group level variation associated with that variable as it has been modelled out through. To stress the point we are making, when group means are include as in the Mundlak approach, the random-effects model will yield the same within estimate of the slope of a time-varying variate as a fixedeffects specification of the estimate. But there is an added advantage. As we have not expunged all child variation (the response has not been de-meaned) and we have unexplained variation at the child level ( ), we can include further time-invariant child predictors to account for this variation. We can also have a much less restrictive structure than the fixed-effects model in that we can three level models and can model explicitly complex heterogeneity and dependence which can be of substantive interest. However, we do need a multilevel model with its random effects so as ensure correct standard errors of the fixed part estimates. The downside of course is that for each time-varying variable we have to include an extra term the group mean in the model. But this is not a great deal of trouble Mundlak Y (1978) On the pooling of time series and cross section data. Econometrica, 46, He could not have been clearer(p70) the whole approach which calls for a decision on the nature of the effect whether it is random or fixed is both arbitrary and unnecessary. 92 Although Allison s (2009) monograph is called the fixed effects regression models, he does realize the power of the Mundlak approach which he calls a hybrid specification. While it has the intent of a fixed-effects

209 As earlier, instead of the contextual formulation, we can also use the alternative within- and between- formulation: ( ) ( ) (48) so that the group mean is included and additionally the time-varying predictor is de-meaned or group centred. Consequently the within- group coefficient is given by, and this measures the longitudinal effect of the predictor. In contrast, the between-group coefficient is given by, and this measures the cross-sectional effect of the predictor. As in usual there could be contradictory process going on at each level which may be of substantive interest; this would have been lost altogether in the fixed-effects approach. What we had previously discussed as a method to disentangle within and between effects in a random coefficient model is now seen as a solution to level 2 endogeneity. Technically, the group mean is an instrumental variable for as it is correlated with but uncorrelated with the random intercept.we can also view this as the random intercept, conceived as being uncorrelated with. When group means are included, the within estimate of the random-effects model will yield the same estimated effect as a fixed specification. From this perspective the fixed -effects approach is deficient as it disregards the cross-sectional variance of all the variables in the model and only uses variation over time to estimate the parameters of the model. While the fixed estimator is the dominant approach in econometrics and comparative political economy, it is seldom used by the educational researcher. 93 This is because it wipes out the variables of key focal interest that lie at the higher level. So in studying pupils in classes in schools, if we included pupil fixed effects we would not be able to estimate peer group effects, teaching styles or school climate. In the longitudinal situation if we used fixed effects, we would not able to study the effect of time-invariant variables or their interactions so that we cannot estimate and study even quite fundamental aspects like differential growth rates for boys and girls. This can also be an issue when the variable of causal interest is nearly time-invariant as is often found when longitudinal models are used to estimate the effect of changes at the country level; conceptually they are time-varying model in controlling for unobserved confounders, there is no disguising that it is fact a random-effects model with group-mean centering. 93 As recently as 2007 Beck and Katz were able only to identify two applications of the random-coefficient model to time-series cross section data in comparative political economy, and both of them were by the same author, and did not include time as a predictor! Their paper goes on to show in a set of simulations how effective the model is for estimating overall slope and country specific values even when there is nonnormality of between country differences and outliers are imposed. Moreover they find that the approach will not mislead, so if there is no country heterogeneity present, the random-effects model will not falsely find it. The simulation was designed to examine a typical comparative political economy study with 20 countries and 5 to 50 occasions with the models being estimated by maximum likelihood. With a larger number of higher-level units and MCMC estimation that takes account of the uncertainty in both the fixed and random parameters, even better results could be anticipated. This was indeed found by Shor et al (2007) in their Monte Carlo evaluation of MCMC estimation of time-series cross-section data.. Beck, N and Katz, J N (2007) Random coefficient models for time-series cross-section data: Monte-Carlo experiments, Political Analysis, 15, Shor, B Bafuni, J Keele, Park, D (2007) A Bayesian multilevel modelling approach to time-series crosssectional data, Political Analysis 15,

210 predictors but in practice, the changes are only slight and may be confined to only a few units. Such slow moving variables include the presence of democracy at the country level or the presence of a particular institution in a State over time in US politics. If this is the case, and the variables of interest show much more variation between units than within, the fixed-effects estimator will be very inefficient and considerable efforts have been made to get improved fixed estimates (Plümper and Troeger,2007). 94 But as Shor et al (2007) found in their Monte Carlo evaluation of MCMC estimation of time-series cross-section data, random effects models are able to recover the effects of such partially time-invariant predictors that are slow moving. 95 Finally it is worth stressing that the approach advocated here deals with only one form of endogeneity bias, the correlation that arises from included occasion-level variables and omitted child characteristics, that is cross-level or level-2 endogeneity between and. It does not protect from same level endogeneity such as the bias from correlation between included child effects and omitted child characteristics same level endogeneity (between and or ) Thus the coefficients for other level 2 variable such as gender may be subject to bias as we are not controlling for unmeasured predictors at level 2. Nor does it protect from correlations between included variables and level-1 residuals. This would usually require the approach of instrumental variables or simultaneous equations; see Ebbes et al (2004) and Kim and Frees(2007). 96 The Hausman test Since the 1980 s most applications of panel data analysis have made the choice between random and fixed effects based on Hausman s (1978) test. 97 A fixed-effects model is fitted and then an equivalent random-effects specification is estimated. The difference between the two results is the test statistic and if the differences are significant, cluster-level endogeneity is deemed to be present and the random effects are abandoned in favour of the supposedly more robust-to-endogeneity fixed effects. In fact the Hausman test which really is a very general test that can be applied to a wide variety of different model misspecifications is in the longitudinal case (Baltagi, 2005, Section 4.3) exactly equivalent to testing that the within- and between- estimates are different or equivalently that the contextual effect is zero. 98 In this light, the Hausman test should not be seen as just a 94 Plümper. T and Troeger, V, 2007, Efficient estimation of time-invariant and rarely changing variables in finite sample panel analyses with unit fixed effects, Political analysis,15, Shor, B Bafuni, J Keele, Park, D (2007) A Bayesian multilevel modelling approach to time-series crosssectional data, Political Analysis 15, Ebbes, P Bockenholt, U and Wedel, M (2004) Regressor and random effects dependencies in multilevel model Statistica Neerlandica 58, they recognize 16 types of correlation between included level 1 and level 2 variables and level 1 and level 2 residuals; only one of these is no endogeneity bias, and only one of these is tackled by the Mundlak specification ; Kim, J-S and Frees, E W (2007) Multilevel modelling with correlated effects Psychometrika, 72, Hausman, J A(1978) Specification tests in econometrics, Econometrica, 46 (6), Baltagi, B H (2005) Econometric analysis of panel data Wiley, Chichester

211 technical decision but is of very substantive importance a significant result means that there is evidence that different process are operating longitudinally from those that are operating cross-sectionally. To take an example, life satisfaction may not just be affected by the change to unemployment (the longitudinal effect) but also proportion of time spent in unemployment. Methodologically a significant contextual effect is a warning not to commit the atomistic or ecological fallacy, a cross-sectional analysis would miss important longitudinal processes, while a purely longitudinal analysis would miss the importance of stable or enduring effects. Moreover, its standard use in choosing one formulation over another is beside the point as the group-mean centred formulation achieves the fixedeffects goals in a much more parsimonious way. 99 Given the very widespread application of the Hausman test as a purely technical arbiter, it is very clear there is a great deal of confusion about the relative merits of the fixed and random approach. Hanchane and Mostafa (2010) provide a lovely illustration of these argument albeit in a non longitudinal setting. 100 They examine the level 2 endogeneity that arises from correlations between student characteristics and omitted school variables. They argue that the cause of this is stratification such that poor households are likely to live in poor communities due to the functioning of the housing market, so that local schools do not have a random allocation of children but the entry will be selective and non-homogenous. Apparent school effects may then be a by-product of the social mix of their pupils. Their new twist in the argument is that this may be differential by the organisation of the prevailing education system; being lessened in comprehensive Finland, heightened in Germany with its early selection, and in-between with the English liberal educational management system. Consequently the degree of endogeneity will vary from system to system. They then evaluate this by analysing the international Pisa data with student mathematics scores in the 2003 survey as their dependent variable. They find that if they omit the school peer effects (the group means of the Mundlak formulation), the Hausman test does find endogeneity in the manner specified so Finland has the lowest value and Germany the highest. Moreover, including the group means results in the Hausman test becoming zero in all three countries. Exemplifying the Mundlak approach Neuhaus and Kalbfeisch (1998) report a two-level multilevel model of birth weights (in grams) for 880 women (j) who have each had 5 births(i) and the predictor is the Age of the 99 Snijders T A B and Berkhof, J (2008) Diagnostic checks for multilevel models in de Leeuw, J and Meijer, E (eds.) Handbook of multilevel analysis, Springer, New York describe the Hausman test (p147) as slightly beside the point. 100 Hanchane, S and Mostafa, T (2010) Endogeneity problems in multilevel estimation of education production functions : an analysis using Pisa data, LLAKES Research paper 14 (

212 mother at the birth. 101 They report a standard two level compound symmetry model that presumes that the within- and between- results are identical: ( ) ( ) and a model with group-mean centred Age and mean Age: ( )( ) ( ) ( ) It is clear that the three slopes estimates are very different and the slope in the standard model at is an un-interpretable amalgam of the slope of when Age is group mean centred and for the slope for mean age. Both the terms in the latter model have a clear interpretation. The within or longitudinal effect shows that as a given woman ages by one year her children increase by an average of three grams. The between or cross-sectional effect estimates that average birth weight will differ by grams between two women whose average at birth differs by one year. This coefficient estimates differences in birth weight for women who had children at different periods of their lives (the median age of first child was only 17). A Wald test of the difference in the betweenand within- slopes (equivalent to a Hausman test) is highly significant but that should not mean that we abandon the random- coefficient model for the fixed-effects procedure as we already have the within longitudinal effect corrected for level 2 endogeneity, but rather we pay attention to the substantive interpretation of the two processes at work. Measurement error of the cluster means The inclusion of the level 2 cluster means in the Mundlak formulation is of course based on the assumption that this is the true mean whereas in practice it is a sample-based value which may be based on relatively few observations. Ignoring this results in a measurementerror problem in which the contextual effect will be attenuated and the unexplained level 2 variance inflated and there is the possibility of biased estimates for other cluster-level variables. The within-slope estimate remains unbiased as does the level 1 between-occasion variance. We have resolved one form of level 2 endogeneity only to be confronted by another. Grilli and Rampichini (2011) show how it is possible to correct for this measurement error post-estimation by taking account of the reliability of the sample group mean. 102 They estimate this from another multilevel model where the outcome is the predictor of interest,. Thus 101 Neuhaus, J M and Kalbfeisch, J M (1998) Between and within cluster covariate effects in the analysis of clustered data Biometrics 54, Grilli, L and Rampichini, C (2011) The role of sample cluster means in multilevel models: a view on endogeneity and measurement errors issues to appear

213 ( ); ( ) (49 ) The reliability of the group means in a balanced model is then (50) so that in a typical longitudinal setting the level 2 variance would be equal to the level 1 variance and with n equal to 5 occasions, the reliability would be 0.83, whereas if there were only 2 occasions reliability would fall to Their method is to inflate the estimated attenuated contextual effect by dividing by this reliability: (51) When there is unbalanced data they recommend computing the reliability for each cluster and averaging it. A modified procedure is needed when the level 1 units are a finite sample of the higher-level units (there are only so many children in the class) for as the sampling fraction (n/n) increases, so reliability improves. They suggest that you need greater than 30 clusters for this approach to be effective. It is probably best to regard this very easy-to-use procedure as a warning device, alerting us to potential attenuation in a particular analysis. We can then, if the problem is found to be severe, adopt the more complex model procedure developed by Shin and Raudenbush (2010). They tackle this issue by treating the group mean as a latent variable. In its simplest form this involves replacing the group mean by the shrunken level 2 residual from the above variance components model (equation 49) where is the response. They go on to develop a more flexible formulation in a multivariate multilevel framework that additionally deals with the situation when the sample group mean is based in part on missing data. 103 Re-analysing the Madeira growth data To compare a full range of model we fitted 4 different specifications 1. a fixed effects linear growth model, fitted by OLS with 299 dummies, individual Age is grand mean centred; 2. a random-intercepts two-level model fitted by RIGLS, individual Age is grand mean centred ; this is a random-effects equivalent of 1; 3. a random-intercepts two-level model fitted by RIGLS with the contextual formulation; group mean Age is included as grand mean centred variable and individual Age is also included as a grand-mean centred variable; 103 Shin, Y and Raudenbush, S W (2010) A latent cluster-mean approach to the contextual effects model with missing data Journal of Educational and Behavioural Statistics 35(1)

4. a random-intercepts two-level model fitted by RIGLS with the within- and between formulation; group mean Age is included as grand mean centred variable, but individual Age is included as a

214 4. a random-intercepts two-level model fitted by RIGLS with the within- and between formulation; group mean Age is included as grand mean centred variable, but individual Age is included as a group-mean centred variable. The grand mean centering of the group Age variable is adopted to get more interpretable intercepts and to allow more direct comparison between models Here are the specifications in MLwiN: Model 1 is a fixed effect analysis with Child 1 as the reference category; to achieve this, Child must be toggled to categorical 104 Model 2: equivalent to model 1 but a random- intercepts formulation and grand-mean centred individual Age 104 It is possible in MLwiN to fit all 300 dummies in the fixed part and to include a level 1 differential based on the overall Constant

215 Model 3: contextual formulation; group mean Age is included as grand mean centred variable and individual Age is also included as a grand-mean centred variable Data manipulation on main menu Multilevel data manipulations Operation Average On blocks defined by Child Input columns Age Output columns Free Columns here c28 is chose Add to action list Execute Naming c28 as ChildAvAge, add to the model and then centre around grand mean to aid the interpretation of the overall intercept Model 4: the within and between formulation with group -mean centering of Age, achieved by modifying the term for individual Age, and choosing group mean centreing on the Child index variable

to get the following specification The results of the four models are shown in the Table 9. Model Table 10 Estimates of alternative specifications 1: 2: S.E. 3: Fixed S.E. Random Contextual - 214 - S.

216 to get the following specification The results of the four models are shown in the Table 9. Model Table 10 Estimates of alternative specifications 1: 2: S.E. 3: Fixed S.E. Random Contextual S.E. 4: withinbetween Fixed part Constant (Age-gm) (Age-(Child)) (ChildAvAgegm) Random Level Level Deviance Comparing the fixed estimate of the individual Age term in models 1 and 2, the slope is exactly the same, 0.63, whether a fixed or standard random effects models are used. There is therefore no evidence of cluster-level endogeneity. When the contextual model 3 is fitted with Child average age, the within-child age effect of 0.63 is exactly the same as the standard multilevel model because the contextual effect of 0.36 is not significant as a Wald test shows. S.E.

more reliable LRT test of the change in the deviance between models 2 and 3. calc b1 = 7099.57-7098.77 0.80000 cpro b1 1 0.

217 Cpro This test is exactly equivalent to a Hausman test; again there is no evidence of endogeneity; this is also confirmed by the more reliable LRT test of the change in the deviance between models 2 and 3. calc b1 = cpro b Finally we can test the difference between the size of the between and within effects of Age and again the difference of 0.358, the size of the contextual effect cpro b is far from significantly different from zero at conventional levels

218 Overall there is no evidence of endogeneity and the standard multilevel model specification of the fixed part that we have been using until this section is supported. You may like to reflect why this is the case. If you undertake a variance components model with individual Age as a response, the answer is clear. There is no substantive variance between children, all the variance around the overall average age of 11.4 is between occasions. The design of the study is such that we have followed a single cohort aged 7 at the start for 8 years so that we only have data to detect individual change; the baseline cross-sectional effect in terms of age is the roughly the same for all children. In the next Chapter, we will see how it is possible to analyse a design that does allow us to separate these elements, indeed they are the focus of the study, for the real Madeira Growth Study allows us to look at cohort change. What we have learnt The random coefficient model is a highly flexible procedure for modelling growth and development. It allows for heterogeneity and serial dependence in a parsimonious fashion. The conditional formulation is more fundamental than the marginal as it is possible to estimate the former to give the latter; in the discrete response case, different answers can be obtained but these relate to the different questions that are being asked. MLwiN through the customised predictions window can provide both; they will not differ a great deal unless the higher-level variance is substantial Subject-level confounding can be tackled through the Mundlak specification by including subject means of time-varying covariates with the considerable advantage that the effects of between subject covariates can still be estimated. Other forms of endogeneity need to be tackled by instrumental variables, although finding good instruments (variables that have no direct effect on the response and account for a substantial proportion of the variation in the predictors) is always difficult in

219 substantive work. That is why in epidemiological research there has been recent interest in Mendelian randomization. 105 The Hausman test does not provide a reason to prefer fixed effects over random effects but really assesses the extent to which cross-sectional, between estimates differ from longitudinal, within ones. When the sample group means are used the effect can be attenuated when the sample size within a cluster is relatively small. 105 Davey Smith, G and Ebrahim, S (2005) What can Mendelian randomisation tell us about modifiable behavioural and environmental exposures? British Medical Journal 330: ; Didelez, V. and Sheehan. N (2007) Mendelian randomization as an instrumental variable approach to causal inference. Statistical Methods in Medical Research 16:

220 Answers to Questions Question 1: obtain a plot of Endurance against Age separately for men and women (Hint: use Col codes on the graph; what do you find? Both sexes show some general increase over time but this is more marked for the Boys than the Girls; both sexes show substantial between-child differences at all ages. Question 2: use the Tabulate command to examine the means for Endurance for the crosstabulation of Sex and Rurality, what do you find? Variable tabulated is Endur Columns are levels of Sex Rows are levels of Rurality Boy Girl TOTALS 0 N MEANS SD'S N MEANS SD'S TOTALS MEANS SD'S

221 Sex differences in the means are greater than Rurality differences, but the Rurality differences are greater for Boys and Girls; rural Boys have the highest mean. Question 3: are there any distinctive patterns of missingness in terms of Rurality and Occasion? Command TABUlate 14 'Missing' 'Rurality' Columns are levels of Missing Rows are levels of Rurality Not Yes TOTALS Urban N ROW % CHI Rural N ROW % CHI TOTALS ROW % (CHI = signed sqrt(chi-squared contribution) ) Chi-squared = df The Urban and Rural areas are both around 5 percent and there is not a significant difference between them as shown by the low chi-square value. TABUlate 14 'Missing' 'Occasion' Columns are levels of Missing Rows are levels of Occasion Not Yes TOTALS Occ1r N ROW % CHI Occ2r N ROW % CHI Occ3r N ROW % CHI Occ4r N ROW % CHI Occ5r N ROW % CHI TOTALS ROW % ( CHI = signed sqrt(chi-squared contribution) ) Chi-squared = df There is a significant difference between occasions but this largely being driven by the lack of missingness for occasion 1; indeed occasion 5 has lower missingness than occasions 2 to

222 Question 4: why does Age-gm have a ij subscript and not a j subscript? It is an occasion-varying variable. Question 5: repeat the above procedure to see if a cubic term is necessary This requires modifying the Age term and choosing a polynomial of 3. The results of the converged model shows that the coefficient of the cubic term is not large in comparison to its standard error (0.006 compared to 0.004). A Wald test (Intervals and Tests window) elicited a chi-square value of with 1 degree of freedom and hence a p value of In the interests or parsimony, we decided not to keep the term nor store the results of this model. Question 6: why does Girl have a j subscript and not an ij subscript? It is measured only at the child level, it is a time-invariant variable. Question 7: repeat the above procedure, building on Model 5, to see if a Rural main effect is needed and if there is an interaction between Rurality and Age choose urban as the base The question ascertains whether the Rurality effect on Endurance changes with Age. The main effect model 6 with added Rurality is

Children from rural areas have a higher endurance, so that at average age it is 2.03 hundreds of metres. This is significant at conventional levels and it is smaller than the Boy Girl difference (-3.

223 Children from rural areas have a higher endurance, so that at average age it is 2.03 hundreds of metres. This is significant at conventional levels and it is smaller than the Boy Girl difference (-3.63) at this age. Note the Rural dummy has a j subscript reflecting a time invariant variable; Rurality was measured at occasion 1. Model 7 includes the added 1st order interaction between Rurality and the second polynomial of age. The two new terms are not large in relation to their standard error, so that there is no strong evidence that Rurality has a differential effect as children age. The joint Wald test with 2 degrees of freedom on a chi-square of returns a p value of 0.572, while a likelihood ratio test of the two models returns a chi-square difference of (

6809.06) which equals 2.707 which returns a p value of 0.258 with 2 degrees of freedom consumed by the more complex second model. Again there is no evidence of the Rural effect changing with Age.

224 ) which equals which returns a p value of with 2 degrees of freedom consumed by the more complex second model. Again there is no evidence of the Rural effect changing with Age. The size of the effects are again readily appreciated through the customised predictions. The wish list now consists of 2 Sexes by 2 Rural/Urban groups and 11 Age groups (7 to 17); here is an extract of the predictions. To get a plot of all the effects, the additional requirement is use the Rurality predictions to Trellis in the Y dimension so that the graphs are in different columns

225 Given the non-importance of the Age* Rurality interactions, we remove them from the model in terms of parsimony. Question 8: what would happen to the standard errors of the fixed part if it was assumed there was no dependence? The current model is a multilevel random-intercepts term, a model assuming independence requires that the level 2 randompart is removed. Here is the revised model (which can be estimated by OLS) and a comparison of the estimates. ML Estimate ML S.E. OLS Estimate OLS S.E. Fixed Part * Constant (Age-gm)^ (Age-gm)^ Girl (Age-gm)^1.Girl (Age-gm)^2.Girl Rural Girl.Rural Random Part Level: Child Constant/Constant Level: Occasion Constant/Constant *loglikelihood: Ratio of OLS to ML SE

The standard error of the time invariant estimates are deflated that is they are spuriously precise when the independence assumption is made. Thus the SE associated with Girl is estimated to be 0.

226 The standard error of the time invariant estimates are deflated that is they are spuriously precise when the independence assumption is made. Thus the SE associated with Girl is estimated to be and not In contrast the SE of the time varying parameters is too imprecise when the independence assumption is made. Also notice according to the Deviance that the OLS is a very substantially worse fit. There is very strong evidence of the need to model dependency. Question 9 make a plot of the modelled growth trajectories for Urban Boys using the above random slopes model; what does it show? First make the predictions for the base category of male urban children including random intercepts and slopes at level 2; do not include the level 1 random terms To obtain a plot of the modelled growth trajectories Graphs on Main menu Customised graphs Choose c50 for Y on the plot what? Tab Choose Age for Y Choose Plot type to be a Line Choose Group to be a Child Apply

intercepts assumption that variance does not change with Age.

227 The scale of the between child heterogeneity is again apparent. Visually there is some evidence of increasing between child heterogeneity as the children develop but in truth there is not a great deal of departure from the parallel lines of the random intercepts assumption that variance does not change with Age. Question 10 what do you conclude from these results; has the monitoring chain been run for a sufficiently long length; are there differences between the results obtained from the two types of estimation? The Effective Sample Size for the most dependent chain is equivalent to 470 independent draws so that this suggests that the MCMC has been run sufficiently long. To check we can examine the trajectory and diagnostics for this parameter

228 There is no evidence of trending so it looks as if this 50k monitoring is sufficient even for this parameter that has a great deal of imprecision (the mean of the estimate is around about the same size as the standard error). A detailed examination of the two different sets of estimates finds that the results of both models are exceedingly similar and we would not reach different conclusions. The deviance from the RIGLS model cannot be compared with the DIC of the MCMC model. Question 11 what do you conclude from these results; has the monitoring chain been run for a sufficiently long length; are there differences between the results obtained from the two types of estimation? Is the three level model an in improvement on the two-level model The Effective Sample Size for the most dependent chain is equivalent to 471 independent draws so that this suggets that the MCMC has been run sufficiently long. Despite there only being some 30 schools, the ESS for the between school variance is large at If we examine the trajectory and diagnostics for this parameter, we see a markedly skew distribution so that while the mode is 1.196, the lower 2.5% quantile is 0.467, the upper 97.5% quantile is There does look to be evidence of a school effect but the evidence is not overwhelming as there is some support in the smoothed histogram that the value is zero. A detailed examination of the two different sets of estimates finds that the results of both models are very similar and we would not reach different conclusions. We saw earlier that the deviance from the RIGLS three level model was significantly lower than the two level model suggesting that there is genuine between school differences. If we compare the DIC of the two models ( ) there is some evidence that there are differences between schools, but such a difference does not bring overwhelming evidential support

229 Question 12: what are the (approximate) characteristics of these orthogonal polynomials? They have approximately a mean of zero, the same standard deviation and a correlation of zero (meaning that the relative size of the effects can be compared); the difference from zero is due to imbalance in the data structure and rounding error. Question 13: what are the differences between the fixed estimates and their standard errors in the two models? Is the more complicated unstructured model a significantly better fit to the data? There are no differences of substance between the two models in either their fixed part estimates nor in their associated standard errors. The difference in the deviance is Calc b1 = The difference in the degrees of freedom is due solely to the nature of the random part; there are 2 estimated parameters in the CS specification and 15 estimatable parameters in the UN specification, a difference of 13 more terms. cpro b Which even when halved does not provide convincing support that we need the considerably more complex unstructured model.. Question 14: is the Toeplitz model a significantly better fit than the compound symmetry model? What are the implications of this result? The difference in the deviance is Calc b1 = While the difference in the degrees of freedom is due solely to the nature of the random part; there are 2 estimated parameters in the CS specification and 5 estimatable parameters (the number of occasions) in the Toeplitz specification, a difference of 3 more terms. cpro b With such a high p value even when halved we should in the interests of parsimony prefer the simpler model. There is little evidence that a more complicated dependency structure than a compound symmetry model is needed

Question 15: have the models been run for long enough? Do the estimates for pd make sense? Which is the preferred model if the DIC criterion is used?

230 Question 15: have the models been run for long enough? Do the estimates for pd make sense? Which is the preferred model if the DIC criterion is used? The ESS of the parameters for all the models suggests that 50k is a sufficient monitoring length, and an examination of some individual trajectories suggests that there is no problematic trending. The pd values have good face validity with the homogenous compound symmetry model consuming the least degrees of freedom and the unstructured model the most. Moreover each of the homogenous versions of the model are identified correctly to have the simpler form with a smaller degrees of freedom consumed in the fit. The best model in terms of its DIC (the ability to predict a replicate dataset which has the same structure as that currently observed) is that of the homogenous compound symmetry model. Both forms of the autoregressive model are substantially a worse fit. There is little to choose between the homogenous and heterogeneous compound symmetry. Overall there is no evidence to prefer any other model than the simpler homogenous compound symmetry Question 16: make a plot of the predictions from the fixed part estimates of the results for boys and girls in urban and rural areas at different stages of development. What do the results show? Prediction of the fixed part into c50 (note standard errors are not implemented in MCMC models) Make a composite indicator of Sex and Rurality of the correct length CALCulate c51 = 'Girl.12345' + (2* 'Rural.12345') TABUlate 0 'c51' Toggle categorical for c51 and edit the labels to be 0: Urban boy; 1 Urban girl; 2 Rural boy; 3 Rural girl. Make a new variable for chronological age (this also has to be of the correct length) CALCulate c52 = '(Age1-gm).12345'

Use customised plots And plot the graph Boys have greater endurance than girls over all the age range and the difference with boys is greatest after the age of eleven as the girls development begins

231 Use customised plots And plot the graph Boys have greater endurance than girls over all the age range and the difference with boys is greatest after the age of eleven as the girls development begins to level off but boys continue to improve. Rural children have the greatest endurance and while this is true for both boys and girls, the greater Rurality difference is for boys. The greatest endurance is found rural boys

232 15. Modelling longitudinal and cross-sectional effects Introduction This chapter is an extension of the last in that extends longitudinal analysis to deal simultaneously with age and cohort effects. The argument is made by analysing two case studies. The first is an extension of the Madeira Growth Study whereby its accelerated design allows us to estimate cohort effects over time in addition to variation within and between children. The second considers changing gender ideology in the UK using the British Household Panel Survey. This is a repeated measures design and we will simultaneously model longitudinal and cross-sectional effects.. A short section at the end will consider recent developments in the analysis of age, cohort and period. We begin by defining what we mean by age, period and cohort and consider the capacity of different designs to isolate empirically these different conceptual effects. Age, cohort and period in the Madeira Growth Study 106 In studying change and development in human capability, it is possible to recognize three separate components: Age: this is the capability associated with how old the individual is, and how capacity develops as individuals mature; age effects are internal to individuals; Cohort: this is a group who share the same events with the same time interval; cohort effects are associated with changes across groups of individuals who experience common events. Cohort effects arise from the systematic differences in the study population that are associated with age at enrolment; Period: this is the specific time that the level of achievement or performance is measured; period effects are external to the individuals and represent variation over time periods regardless of age and of cohort. The focus here is on birth cohorts who were born in the 1980 s who have lived through Madeira s economic transformation. The centrality of cohorts for social change was made by Ryder (1965). 107 A birth cohort moves through life together and encounters the same historical and social events at the same ages, in this way they are changed and society is also changed via replacement. 106 This section was written with Duarte Freitas, University of Madeira. 107 Ryder, N B. (1965) Thecohort as a concept in the study of social change, American Sociological Review 30,

233 Alternative designs for studying change While growth researchers emphasize the maturation effects of aging, the observed changes may also be a result of the child s year of birth the cohort- and the actual year the observation was made the period. The three terms are of course are linked by the identity Age = Period Cohort so that a child aged 12 will have been born in 1980 if they have been measured in 1992.The ability to disentangling the effects of these different time scales depends on the research design that is used. A cross-sectional study only provides data for just one period (everybody is measured at the same time) so that it is impossible to estimate the effect for period which is held constant. Moreover, at a fixed point in time, age and birth cohort differences are confounded; it is not empirically possible disentangle age changes and cohort variations with a cross-sectional design. If we were to find in a cross-sectional study that older children have greater aerobic performance we cannot tell whether this is due to the maturation effect associated with age, or the cohort effect such that this older group born into an earlier cohort has always had a better performance. A longitudinal design is when children are measured on more than one occasion. Consequently, the growth rate is directly observed for each child. But a pure longitudinal study, as in the previous chapter, that follows a particular cohort is also limited. The cohort has been kept constant by sampling so any change could be either explained by age or period. Moreover, in a single cohort, age and cohort are again confounded so that the improvement in aerobic performance may not be the natural individual process of maturation, but is specific to this cohort as they collectively age. Consequently we need a longitudinal study that follows multiple cohorts. The most efficient approach (Cook and Ware, 1983) is an accelerated longitudinal design (mixed design or cohort sequential design). 108 Distinct, sequential age cohorts are sampled and longitudinal data are collected on members of each cohort. It is accelerated as it allows the study of development over a long age span in a short duration. This design usually employs an overlap between the cohorts, so that for each cohort, there is at least one age at which one of the other cohorts is also observed. This overlap allows the splicing together of one overall trajectory of development from the growth segments obtained for each cohort. This design has a number of important advantages. As it is of short duration, there is a shorter wait for findings, and less opportunity for sample attrition to accumulate. There is also less time for the measurement team to be in the field which reduces costs. Moreover, 108 Cook, NR and Ware, J H (1983) Design and analysis methods for longitudinal research Ann. Rev. Public Health :

234 as pointed out by Schaie (1965) age, period and cohort are potentially un-confounded in this design so that we can compare the development say of 12 to 14 year olds in more than one cohort. 109 The downside of accelerated design is that the data are sparser at the earliest and latest ages where there is no overlap. The design is also problematic when there is substantial migration as this process can erronesusly produce cohort effects. This is not a problem however if attrition is kept low. The raison d etre of a longitudinal study is to characterise within-individual changes in the outcome over time. The primary aim of the accelerated design has been to do this efficiently. Bell (1953) recognized that the piecing together of cohort-specific growth curves is only legitimate if there is convergence, that is no significant differences among cohorts in the age-specific means where the cohorts overlap. 110 For him and many others, cohort effects are a threat. Consequently there are tests to guard against debilitating cohort effects (Miyazaki & Raudenbush, 2000; Fitzmaurice et al, 2004, ), and comparisons to see if an accelerated design can recover the trajectories that would have been produced by a proper long duration with a single age cohort (Duncan et al., 1996). 111 The perspective adopted here is radically different in that the accelerated design is seen as an opportunity to assess the size and nature of social change through studying overlapping cohorts. Although it may be conceptually possible to separate all three elements in an accelerated design we focus on age and cohort only here and ignore period completely. That is we follow Guire and Kowalski (1979) who argue that although stability over time is usually not true for sociological studies, it is often true for studies of physical growth, and that the short-term time effect is can safely be assumed to be zero. 112 Table 11: The accelerated longitudinal design structure of the Madeira Growth Study Cohort Year School Grade when measured Born Schaie, K. W. (1965) General model for the study of developmental problems, Psychological Bulletin, 64, Bell, R.Q. (1953) Convergence: an accelerated longitudinal approach. Child Dev. 24: Miyazaki, Y and Raudenbush S W (2000) Tests for linkage of multiple cohorts in an accelerated longitudinal design, Psychological Methods 5:44-63; Fitzmaurice, G M, Laird, N M and Ware, J H (2004) Applied longitudinal analysis, Wiley, New Jersey; Duncan, S. C., Duncan, T. E., & Hops, H. (1996) Analysis of longitudinal data within accelerated longitudinal designs. Psychological Methods, 1, Guire, K. E., Kowalski, C. I. (1979) Mathematical description and representation of developmental change functions on the intra- and inter-individual levels. In Nesselroade, J R Baltes, B P (eds) Longitudinal Research in the Study of Behavior and Development, New York: Academic, pp

235 The Accelerated Longitudinal design of the Madeira Growth Study The Madeira Growth Study is an accelerated longitudinal design (Table 9) in which five overlapping age cohorts (participants of starting school grade of 8, 10, 12, 14, and 16 years) were observed concurrently for 3 years, thus providing information spanning 10 years of development from a study lasting just three. A stratified sampling procedure was used to ensure the representativeness of the subjects. At the first stage, 29 state-run schools were selected taking into account the geographical area, the school grade and sport facilities. In a second stage, a total of 507 students were recruited according to the proportion of the population by age and sex in each of Madeira s eleven districts. Initially, complete records were obtained for 498 children so that the dropout is only 9 children or less than 2 per cent. All of these had only completed the first aerobic test but had not answered the social questions. These children were simply excluded from the study. The MGS is naturally hierarchical with measurements on up to 3 occasions at level 1 on 498 children at level 2. Specifying and estimating cohort effects Differential cohort effects can be handed by a three-level model in which occasions are nested in individuals who are seen as nested in birth cohorts. The combined model in its random-intercepts form is y ijk 0x0ijk 1x1 ijk 2x2 jk 3x3k ( v0k u0 jk e0ijk ) (1) where the dependent variable y ijk is distance covered on occasion i by child j of cohort k. There are three predictors: Age which is time-varying at the occasion level ( ); Gender which is a child level variable ( x 2 ) and the Cohort number ( k ) centred around 1984 so jk that 1980 is -2, 1982 is -1, 1984 is 0, 1985 is 1, and 1988 is 2. Once again the s are the averages and is now the mean distance achieved by a boy of average age born in 1984, and is the linear change in distance achieved for a Cohort that is two year later. The random part has three elements: which is the unexplained differential achievement for a cohort around the linear trend; for a child given their age, gender and cohort; and x 3 x 1 ijk which is the unexplained differential achievement is the unexplained occasion-specific differential given the child s age, gender, cohort and differential performance. The distributional assumptions complete the model. 2 v ~ N,0( ); 0k v0 2 2 u ~ N,0( ); e 0ijk ~ N,0( e0 ) 0 jk u0 (2) The terms give the residual variance between cohorts, between children and between occasions. If we want to see how sub-groups of children have changed differentially through time we can include terms in the fixed part of the model to represent different groupings of children and then form a cross-level interaction with the linear cohort variable

236 The conventional wisdom is that when there are only a small number of higher-level units, they are more appropriately specified as fixed-effects dummies. However, this is contradicted by the comparative study of Yang and Land (2008) that assessed the extent to which the values of an Age-Period-Cohort model could be recovered. 113 They found unequivocally that random effects achieved better results even when the number of units was small and comparable to the present study (5 cohorts). The Yang and Land study was based on restricted maximum likelihood (RIGLS) estimation which would give empirical Bayes estimates of the random effects. These estimates can be improved upon by adopting a Bayesian approach using Markov-Chain Monte-Carlo procedures (Browne and Draper,2006). 114 There are two aspects to this. First, the full Bayesian analysis better accounts for uncertainty in that inference about every parameter fully takes into account the uncertainty associated with all other parameters. Second, when there are few higher level units the sampling or posterior distribution of the variance of the random effects is likely to be positively skewed as negative estimates of the variance are not possible. Both these problems are at their worst when there are a small number of units at a level and there is substantial imbalance. While the latter is not the case in the present design, the former certainly is. This has been recognized in the APC literature and Yang (2006) evaluated the performance of REML empirical Bayes compared to fully Bayesian ones, finding that the latter provide a more secure base for inference. 115 The Bayesian approach brings its own difficulties notably ensuring that the estimates have converged to a distribution and that the required prior distributions have to be specified before estimation. Common practice is now to specify diffuse priors that impart little influence to the estimates and allow the observed data to be the overwhelming determinant of the results. Yang (2006) experimented with a number of alternative diffuse priors and found that the APC estimates were fortunately largely insensitive to the particular choice that was made. In practice we are going to estimate the models with the MLwiN software which provides REML and Bayesian estimates using MCMC estimation with default diffuse priors. The Bayesian approach is highly computer intensive and we will use the REML estimates as good starting points for the MCMC simulation. We ran the MCMC procedure for a burn-in of 500 simulations to get away which from the REML estimates and then for a subsequent initial draws. At the end of this monitoring period each and every estimate was checked for convergence which is characterised by white noise. The existence of a trend would mean that the sampler has not reached it equilibrium position and a longer burin-in would be 113 Yang, Y and Land K C (2008) Age period cohort analysis of repeated cross-section surveys: fixed or random effects? Sociological Methods and Research, 36, Browne, W. J. and Draper, D. (2006). A comparison of Bayesian and likelihood-based methods for fitting multilevel models. Bayesian Analysis 1: Yang, Y (2006) Bayesian inference for hierarchical age-period-cohort models of repeated cross-section survey data, Sociological Methodology 36:

237 required. The monitored estimates were also assessed for information content and further monitoring simulations were undertaken until the effective sample size of the Markov draws was equivalent to 500 independent draws. Once sufficient draws had been made the estimates were summarised in terms of their mean and 2.5% lowest and highest values to give the Bayesian 95% credible intervals. A sensitivity analysis was undertaken using a number of diffuse priors but this made little difference. The overall approach to model development is based on fitting model of increasing complexity and assessing whether the more parsimonious form should be retained. As the modelare estimated by Bayesian procedures we have used the Deviance Information Criterion (Spiegelhalter et al., 2002). 116 This a goodness-of-fit measure penalized for model complexity. As such it is a generalization of the Akaike Information Criterion but unlike this measure, the degree of complexity is estimated during the fitting process. Lower values for a DIC suggest a better more parsimonious model. Any reduction in the DIC is an improvement, but following experience with AIC, differences greater than 4 suggest that that the model with the higher DIC has considerably less support. Table 12 Results for MGS: age and cohort Terms One SE Two SE Three SE Four SE CI(2.5%) CI(97.5%) ESS Fixed Part Cons Age (Age-13) Girl (Age-13)*.Girl (Age-13) 2 *Girl Cohort (Cohort-2)*Girl (Cohort-2)*(Age-13) (Cohort-2)*Girl*( Age-13) Random Part Between cohorts Between Children Between Occasions DIC: Diff in Dic Modelling age, sex and cohort This section reports the results of a series of models involving age, sex and cohort. The results are shown in Table 11. Model 1 is a two-level growth model where the value of the runs test is modelled as a function of quadratic chronological age and there is an interaction with gender. As it is a two-level model, the data has been effectively pooled across cohorts. The shape of the growth curve is most readily appreciated in Figure 12 which shows the curves with 95% credible intervals. Clearly at younger ages there is little difference between 116 Spiegelhalter D.J., Best, N.G., Carlin, B.P. and van der Linde, A. (2002). Bayesian measures of model complexity and fit. Journal of the Royal Statistical Society, Series B 64:

238 the sexes and while both boys and girls develop with age the rate of change is greater for boys. The curve is convex for girls and there is a flattening of growth by the age of 18. The general shape of the trajectory is consistent with past research on the association between age and performance as is the differentials by sex. The random part of this model consists of a between-child random-intercept variance and a within- child between-occasion variance The residual correlation between occasions in the compound symmetry model is equal to 0.50 so that there is quite a lot of dependency across time; children with a relatively high performance on one occasion tend to have a relatively high performance on other occasions. In fact this is the most complicated twolevel model that is supported by the data when DIC values are compared. There is no evidence of a cubic term nor that a random-slopes model is needed. Children differ in their aerobic performance but there is no evidence that this variance increases or decreases with age. Moreover, a model with the more complex unstructured covariance between children was not an improvement on this compound symmetry model. Consequently, there is no evidence that the residual dependency overtime is changing with occasion. The rest of the estimates in Table 11 are for a three-level model where random intercepts are additionally included at the cohort level in the manner of equation (1). Model 2 simply includes the random intercepts for cohorts, Model 3 additionally includes a linear cohort term, while Model 4 includes the full three-way interaction of linear age by sex by cohort. This latter model allows for the possibility that the cohort effects are differential for boys and girls and for different ages. Comparing the DIC of each of three level models to that of the two-level model shows that the more complex models are an improvement over a model with no cohort effects. Moreover, the model with the lowest DIC is Model 4 when the cohort effects are differentiated by age and sex. The table also gives the 95% credible intervals for this model and the effective degrees of freedom (a MCMC simulation of 50k draws was used). It is noticeable that for each of the terms involving the cohort that the 95% credible intervals for each of the terms do not include zero suggesting that each term is

239 required. Even with less than a decade separating the earliest and latest birth cohorts there is evidence of a difference. Model 4 with its complex interaction terms is most readily appreciated graphically by plotting the estimated growth curves for each cohort separately for boys and girls. The plots for girls show convergence and there is no evidence whatsoever of cohort effects as each segment of the growth curve overlaps the next. However, for boys there is evidence of cohort differences and these are bigger for the earlier cohorts. The nature of the change can be seen by plotting the predicted mean performance and 95% intervals for boys and girls aged 14 using MLwiN s customised predictions facility. The decline in the aerobic performance with later cohorts for the boys is marked and this contrasts with the lack of change for girls

240 Using an appropriate multilevel analysis we have been able to ascertain the intracohort development in aerobic performance and evaluate inter-cohort changes. The research design facilitated simultaneous estimation of age effects, cohort variations, and age-by-cohort interaction effects on aerobic performance. Both sexes show increased capacity as they develop and this is most marked for boys so that on average there are substantial differences between boys and girls at older ages. There is quite strong consistency between measures for the same child on different occasions so that children with high or low capacity appear to maintain this over time. Moreover, the variance around the average growth does not change with age so that there is no evidence for converge nor divergence in aerobic capacity across the years of childhood and adolescence. There are however marked differences between cohorts so that boys born less than a decade part have noticeably poor aerobic performance. Girls in contrast have not experienced this decline and have maintained their generally lower rate. For boys the decline is most noticeable in the older age groups. This decline occurred to cohorts that experienced rapid social change and we can speculate that this was related to a move away from the land, more sedentary forms of living and much greater use of motorized forms of transport. Modelling a two level model: Mundlak formulation The three-level model we have just is fitted is the most appropriate models for examining the nature of cohort change. But it is interesting in light of the previous chapter to fit a twolevel model without and with the group means- the Mundlak formulation in its contextual form. The results are shown in Table 13. The first model includes a quadratic polynomial of age for both Boys and Girls and the second model additionally includes Child Average Age (grand mean centred to allow for an interpretable intercept) and in the light of the modelling above, an interaction between the Girl dummy and Child Average Age. The models were estimated in MCMC with monitoring simulations after 500k burn in. Table 13 Comparing models with and without group means for Age Base S.E. +AvAge*Girl S.E. Fixed Part Cons (Age-13)^ (Age-13)^ Girl (Age-13)^1.Girl (Age-13)^2.Girl (AveAge-gm) Girl.(AveAge-gm) Random Part Children

Occasion 4.795 0.216 4.790 4.382 DIC: 6907.749 6902.

241 Occasion DIC: There are a number of points to note The model with the two extra parameters is an improvement in the DIC of around 5 so that there is evidenced of a contextual effect; a Hausman test would show significance; The contextual effect for boys is positive, so as the Child Mean Age goes up (in effect a later cohort of 1 year) the endurance goes up by If we mix types of inference and conduct a Wald test on this parameter; it is highly significant, (p is given by cpro ;0.004). For boys their baseline age, the cohort into which they were born, matters. The comparable differential slope for girls is so the baseline age matters much less for them, and we can use a Wald test to see if the cross-sectional slope for Girls is different from zero (note the two 1 s to specify the full slope effect for Girls). The slope for girls is only and it has a very small insignificant p value. There is no evidence of a contextual effect for Girls

242 The linear effect of Age for Boys reduces form to and the linear differential for Girl attenuates from to as the group means are included; not including the means would be to commit omitted-variable bias. We can use the Customised predictions to make a plot of the predicted endurance for Boys and Girls for the first and second model. Figure 14 The effect of individual Age on endurance: estimated with and without Child Mean Age There is no difference whatsoever for girls but the longitudinal relationship with Age for Boys in the model without Child Average Age is too steep; it has been biased upwards. We can make sense of these results by correlating Child Average Age with Cohort and get the value of minus ; they are measuring the same thing children with an older average age in the accelerated design were born into an earlier cohort. Moreover, if we fit a variance components model with child Age as the outcome, we see that cross-sectional between-children Age is much larger than between occasions. This is a consequence of the accelerated design, we have only followed individuals for three years but have selected children aged 8 to 16 at baseline to follow

243 Thus it is necessary for a predictor variable to have a between-children, time-invariant element to estimate cross-sectional effects, but it is not sufficient as we did not find a crosssectional effect for Girls. Importantly the Mundlak method allows us to model out the contextual cohort effect and distinguish it from individual longitudinal growth. The model with Child Mean Age included was used to make a set of prediction shown in Figure 15; the left hand side is the predicted longitudinal within child growth with the Mean age held steady at 13, while the right-hand side shows the effect of mean change with individual age held steady also at 13. The vertical and horizontal axes are set to the same scale on both graphs. Figure 15 The longitudinal and cohort effects of Age in predicting Endurance

244 The accelerated design has done its job when allied to this type of modelling. We have gathered information efficiently in a short period, we have identified that there is an element of cohort change for Boys so that the younger-aged cohorts have less endurance, and we also have now characterised individual growth across the age-span. Compare this with a fixed-effects approach of the previous chapter where we would have not been able to even examine Boy-Girl differences as these are time-invariant! A significant Hausman test does not simply mean we have an endogeneity problem, but gives an opportunity to explore the different processes that are at work Changing gender ideology in the UK 117 This study uses data from the British Household Panel Survey, a nationally representative sample of some 5500 household which was drawn at the start of the survey in 1991, giving close to 10,000 individual respondents. Subsequently, individuals have been traced and reinterviewed each year, generating annual panel data. The BHPS includes an extensive range of questions and every two years a section on beliefs, values and attitudes is incorporated. This contains 8 items which can be used to measure an individual s gender ideology. These items given in the table below consist of a statement such as A pre-school child is likely to suffer if his or her mother works, to which a respondent may answer strongly agree, agree, neither agree nor disagree, disagree, or strongly disagree. Item Gender role attitude items in the BHPS 1 A pre-school child is likely to suffer if his or her mother works 2 All in all, family life suffers when the woman has a full time job 3 A woman and her family would all be happier if she goes out to work 4 Both the husband and wife should contribute to the household income 5 Having a full-time job is the best way for a woman to be an independent person 6 A husband s job is to earn money; a wife s job is to look after the home and family 7 Children need a father to be as closely involved in their upbringing as the mother 8 Employers should make special arrangements to help mothers combine jobs and childcare In this study, the 5-point Likert scale on which responses are measured is recoded to a 3-point scale and the coding for items 1, 2 and 6 is inverted for consistency. The response alternatives used are Traditional, Neutral or Egalitarian, with high values denoting an egalitarian response. A multilevel item response model 118 of the eight items found that the items gauge gender role attitudes in very different ways and cannot usefully be combined into a single measure. The most discriminating of these items (in the sense of being most strongly related to the underlying latent ideology score) is All in all, family life suffers when the woman has a full time job, suggesting that it makes the best single measure. This response was then used in a multilevel model with to assess a 117 This section reports work undertaken with Laura Steele. 118 Adams,RJ, Wilson, M and Wu, M (1977) Multilevel item response models: an approach to errors in variables regression, Journal of Educational and Behavioral Statistics, Vol. 22,

245 number of questions about change. The BHPS is naturally a multiple cohort study as at the start of the survey individuals were aged from 18 to 90 plus. We will also include areas effects as we are interested in the geography of the outcome. The model is built in several stages 1 un-ordered multinomial logit to deal with the 3 unordered categories of the response, this is treated as a special case of the multivariate model; 2 repeated measures to deal with a responses on different occasions, this allows for dependence over time and the correct modelling of the age or developmental effect; 3 random cohort effects to model the differences between birth cohorts; 4 a cross-classified model as respondents can relocate so that individuals can be seen as belonging to different local authority areas at different times; 5 steps 1 to 4 form the base or empty model which effectively models the individual, temporal and spatial variation; subsequently fixed effects for age, cohort and time-varying and time-invariant variables for individuals and their neighbourhoods are then included as main effects and interactions to account for this variation. Building the base model Stage 1Unordered multinomial model The unordered single-level multinomial model can be written succinctly as follows: [ ( ) ( )] ( ) ( ) (3) where i refers to an individual, the response has t categories and ( ) is the underlying probability of being in category s, is a predictor variable and the s are regression-like parameters linking the response to the outcome. The Expectation operator is used to signify on average as we have no stochastic element in this model. One of the categories, signified by t is taken as a reference category and this plays the same function as 0 in a binary outcome model. In effect there is then a set of t-1 equations where if a logit link is chosen, the log-odds of each of the other categories in comparison to the base category is regressed on predictor variables. With t equal to 3 categories as here, there are 2 equations: ( [ ( ) ( )]) ( ) ( ) (4) ( [ ( ) ( ) ]) ( ) ( ) (5)

246 Typically separate intercepts and slopes and the same predictor variables are included for each line of the specification. A logit formulation is used to constrain predicted probabilities between 0 and 1. Each slope parameter is interpreted as an additive effect of a one unit increase in the associated predictor variable on the log-odds of being in category s in comparison to the referent category t. It is convenient and recommended that the model is interpreted by converting the estimated logits to probabilities as the logits can even mislead even about the sign of the relation. 119 The predicted values for the non-referent category are given as follows as follows: * ( ) + ( ( ) ( ) ) ( ( ) ( ) ) (6) So that it is clear that all equations have to be involved in the transformation to probabilities; it is not just the equation involving a single category and that is why the logits can be uninformative. The probability of being in the base category is the remainder from 1 minus the summed probability of all other categories: * ( ) + ( ) (7) MLwiN uses the customised prediction facility to transform from log-odds to probabilities, and in random effects models these can be subject-specific or population average values. With these models with multiple nominal outcomes it is natural to use a multinomial residual distribution: ( ( ) ( ) ) ( ( ) ( ) ) ( ) ( ( ) ) when ( ) ( ) ) when (8) Where is a vector of 1/0 responses then, the denominator is equal to 1, or if the response is an observed proportion in the category, is the denominator of the proportion. This is achieved in MLwiN by regarding the set of outcomes at level 1 as nested within individuals at level 2 and in a similar fashion to the binomial logit by calculating a multinomial weight based on the predicted probability in each category and constraining the variances of these weights to 1. This is an exact residual multinomial distribution; overdispersion may be permitted when modelling proportions. 119 Retherford, R D and Choe, M K (1993) Statistical Models for Causal Analysis, John Wiley & Sons, New York

The modelling begins by specifying that the response is a categorical variable with 3 outcomes and that there is a two-level structure in which level1 is the choice category and level 2 is a unique

247 The modelling begins by specifying that the response is a categorical variable with 3 outcomes and that there is a two-level structure in which level1 is the choice category and level 2 is a unique identifier for each person at each wave. We choose the base category to be a Traditional viewpoint on gender ideology and therefore model the log-odds of Neutral in comparison to Traditional, and Egalitarian in comparison to Traditional. The fixed part of the model consists of two parameters associated with the Constant, and these give the average log-odds across all waves and across all individuals. This is a multivariate model so that there is no level 1 variance, while the level 2 variance is an exact multinomial distribution in which the variance depends on the mean probability of being in each category. The 54 thousand observations representing some 9 thousand individuals measured over 9 waves 1991, 1993, ) doubles to 108 thousand observations when account is taken of the two non-referent responses. The model was estimated initially by IGLS and then by MCMC estimation with an initial burn-in of 500 simulations followed by 5k simulations. The estimates are as follows When the customised predictions facility is used to transform the values back to a probability scale we get the following results. Median 95% CI s Median 95% CI s Category Probability Lower Upper Probability Lower Upper Trad

248 Neutral Egal As there are no higher-level randomeffects the median and the mean (the subject-specific and population average results, Chapter 14) give the same estimates. These can be compared to a simple tabulation with percentages: Trad Ntral Egal TOTALS N % So that the most common declared category is the Traditional (by a small margin) while the least common is the Neutral standpoint. Stage 2: a model with hierarchical structure for waves nested in individuals The second model sees repeated measures as being nested within individuals. The general from of the model is [ ( ) ( )] ( ) ( ) ( ) (9) where ( ) is an individual-level random effect for each contrasted non-referent category which are assumed (on the logit scale) to have a mean of zero and a variance of. These individual random effects may be correlated through a covariance term. The two variance terms here represents the between individual variance in the Egalitarian: Traditional and Neutral: Traditional log-odds ratio. There are of course missing observations and we are invoking the MAR assumption (Chapter14) so that the response on gender ideology is presumed not to affect the missingness, and therefore the missingness mechanism does not require explicit modelling. In practice in MLwiN, this is modelled as a three-level hierarchical structure with the choice set at level 1 nested within person-wave at level 2 nested within individual at level3. Again the fixed part of the model is kept simple with just two constant terms to specify the average but additionally we now have variance terms for individuals for each of the contrasted response outcomes

The monitoring phase of the MCMC simulation was increased to 15k. On completion of the monitoring period the following results were obtained.

249 The monitoring phase of the MCMC simulation was increased to 15k. On completion of the monitoring period the following results were obtained. The between-individual variances on the logit scale are very large particularly for the Egalitarian response. This means that in this unconditional model that there is a great deal of similarity within a person over wave; individuals are not changing their category and are staying with their preference and this is particularly the case for the Egalitarian response. We can again use the customised predictions window and we will choose both subjectspecific medians and population average means with their 95% confidence intervals. The median gives median probability of an individual across the 9 waves and the category with the highest value remains the Neutral one while the Egalitarian is quite a bit lower. Reasonably similar values are found for the mean probability despite the size of the random effects. Notice too that both sets of confidence intervals have widened in comparison to the previous model as we are not inferring to individuals in general but to the average individual across waves

Category Median probability and 95CI s Mean probability Median Low High Mean Lows Highs Trad 0.448 0.432 0.465 0.465 0.415 0.432 Neutral 0.409 0.395 0.422 0.422 0.281 0.298 Egal 0.143 0.129 0.159 0.

250 Category Median probability and 95CI s Mean probability Median Low High Mean Lows Highs Trad Neutral Egal Stage 3: a model with random cohort affects The next model additionally includes random effects for cohorts which are defined by the calendar year in which the person was born. There are 78 birth cohorts running from 1894 to Again we keep the fixed part of the model simple with just the two intercepts for the non-referent category but include variance-covariances at the cohort level. We increased the monitoring phase of the MCMC estimates to 50k and here are the results

251 Clearly there are sizeable variations between cohorts particularly for the Egalitarian category and the between individual variances also remain very large The estimated cohort residuals show a clear outcome. 5.0 Estimated differential Logit vs Birth Year Category Egalitarian Neutral 2.5 Logit Birth Year There has been a marked and consistent rise in the Egalitarian category with successive birth cohorts (the random effects, of course, have no knowledge or earlier or later), while the rise in the Neutral category has also been consistent if less marked. This must mean that the Traditionalists have declined. Stage 4: a cross-classified model for area effects The next model includes a random effect for nearly 400 local authority areas. As individuals can be expected to re-locate across the sixteen years of this study, we now require a crossclassified model. Again we keep just two constants in the fixed part but additionally include two random effects for the non-referent categories at the LAD level

120 We can use Larsen s MOR procedure to get some handle on these and calculate the MOR ratio to be 2.1 for the Neutral category and 3.1 for the Egalitarian.

252 The model now has the following five classifications. After 50k MCMC monitoring simulations, the following results were obtained. In this unconditional model, although much smaller than the between-individual and between-cohort random effects, that there are now quite large area effects at this macro scale of groups of LAD s. 120 We can use Larsen s MOR procedure to get some handle on these and calculate the MOR ratio to be 2.1 for the Neutral category and 3.1 for the Egalitarian. Geography appears to make quite sizeable differences to attitudes to gender ideology. The plot of the residual differential logits shows that there is a quite strong positive correlation between Neutral and Egalitarian. The correlation, obtained from the estimates table between these latent values is The LAD s in the BHPS are not actual local authority districts but groups of such areas LADs which were combined if their population fell below 120,000 in 1991 (for reasons of preventing disclosure)

253 Including longitudinal and cohort effects Having estimated this base model with its temporal and spatial effects we are now going to include terms in the fixed part of the model that try to account for these variations. We begin with longitudinal and cross-sectional effects. Due to the linear dependency between Cohort, Age and Period (here represented by Wave) we can only include two of these elements in the model, and clearly we have to be careful not to misinterpret one for the other. Cohort effects are cross-sectional terms and these could be included in the model in a number of equivalent ways such as time-invariant Age measured in 1991 at the start of the survey; the group mean age of the respondents, or the year of birth of the respondent. We have chosen the latter as it gives a simple metric by which to portray change, but the choice is solely one of convenience. This variable is entered into the model as a linear term at first centred on its grand mean of To specify the longitudinal effect we could have chosen the time-varying variable Age in the model but of course we have not observed the maturing affect over the adult lifespan as the panel survey is limited by the years of measurement to 1991 to We have therefore chosen to include the year of the survey as the longitudinal effect centred on 1998; initially as a linear term. This model is equivalent to the contextual Mundlak formulation of the previous chapter. The initial model is specified as follows:

After 50k monitoring simulations the estimates are as follows. All four slopes represent the change in the logit of the outcome for one year and in that sense are directly comparable.

254 After 50k monitoring simulations the estimates are as follows. All four slopes represent the change in the logit of the outcome for one year and in that sense are directly comparable. There are some interesting patterns. The Egalitarian outcome has a positive slope associated with Date of Birth that is larger in absolute value than the negative slope for Year. For the Neutral outcome both slopes are positive. It is easiest to appreciate the scale and nature of the effects by characterising the change for some stereotypical people. Here we will choose the 10, 25, 50, 75 and 90 percentiles of the Birth Cohorts and plot them against year, distinguishing the propensity of all three outcomes. We can do this through the customised predictions window

255 The probability of choosing the Egalitarian outcome is most marked by Birth Cohort with older cohorts have a much lower chance of agreeing with this position. With time passing however all the cohorts show a decline in their probability of agreeing with the Egalitarian stance. The proportion agreeing with Traditionalist position similarly shows strong Birth Cohort effects and also a decline over time in support for this position. The probability of choosing the Neutral stance is much less differentiated by when people were born and this position has seen an increase in support over time. An exactly equivalent way of looking at this to take the specified birth year away from the calendar year of the survey to derive the time-varying age of the responden.t We can then plot the predicted probability against changing age; they make look very different but they are the same model results just portrayed in another way. Unlike the aerobic performance example, it is quite plausible that events associated with period may have affected the response as well the processes of ageing per se

256 We then fitted more complex quadratic and cubic models for Year and Birth finding that the DIC improved substantially with the quadratic compared to the linear but worsened with the cubic. There was also no evidence of an interaction effect between Year of Birth and Year. The predicted probabilities for the quadratic model show essentially the same results as the linear model. Including Gender as a main effect and as interactions The next model includes a time invariant predictor, for Sex and does so in a quadratic interaction with Year

257 The basic patterns are the same with a large cohort effect for the both extreme categories, but females in the earlier cohorts show less traditional views

258 Age, period and cohorts? It has long been held that why it is conceptually possible to separate age, period and cohort, it is not technically possible to do so due to the linear dependence between the three terms, the identification problem. This has recently been challenged by a set of papers by Yang and Land. 121 Their solution has two parts. First micro-survey data such as the BHPS are used to form bespoke groupings such that while period remains on an annual basis, cohorts consist of data for a five year period this breaks the linear dependence between the terms. Second they use random coefficient modelling to analyse complex cross-classification of individual respondents (level-1) nested within cells created by a cross-classification of birth cohorts and time periods. They are in fact using a main-effects cross classification of random terms for some terms elements of the APC model and a fixed term for others. Thus we could fit a model with quadratic term for Birth year in the fixed part of the model and random effects for each Wave alongside random effects for (say) five-year age groups. The problem with this approach is that is unclear which term should be put in to the fixed part and what is a meaningful grouping of age, period or cohorts to make for identifiability. Glenn (1989) has been highly critical of what he calls mechanical approaches 122 The reason why no mechanical or automatic method of cohort analysis can always properly sort out age, period, and cohort effects is simple: when the effects are linear, several different combinations of the different kinds of effects can produce the same pattern of variation in the data. In other words, different combinations of age, period, and cohort influences can have identical empirical consequences. Therefore, a proper labelling of the effects requires a priori knowledge of the nature of the effects or some kind of information from outside the cohort table. If different Age, Period and Cohort processes can yield identical observable outcomes, it must be that the observed data of itself, cannot tell us which combination is at work it is not possible to make bricks without straw. Indeed despite considerable experimentation with different terms in the fixed part and different alternative groupings of age and birth 121 Yang, Y and Land K C (2006) A mixed models approach to the age-period-cohort analysis of repeated crosssection surveys, with an application to data on trends in verbal test scores Sociological Methodology 36:75-98; Yang, Y and Land K C (2008) Age period cohort analysis of repeated cross-section surveys: fixed or random effects? Sociological Methods and Research, 36, ; Yang, Y (2006) Bayesian inference for hierarchical age-period-cohort models of repeated cross-section survey data, Sociological Methodology 36:39-74;Yang, Y (2007) Is old age depressing? Growth trajectories and cohort variations in late life depression Journal of Health and Social Behavior 48:16-32;Yang, Y (2008) Social inequalities in happiness in the United States, 1972 to 2004: an age-period-cohort analysis American Sociological Review, 73: ; Smith, H L (2008) Advances in Sociological Methods & Research, 36: Glenn, N D (1989) A caution about mechanical solutions to the identification problem in cohort analysis: comment on Sasaki and Suzuki, American Journal of Sociology 95 (3), Glenn, N D. (1976) Cohort analysts' futile quest: statistical Attempts to sepa rate Age, Period, and Cohort effects, American Sociological Review, 41,

259 year, we were unable to fit all three APC terms simultaneously without the MCMC estimates blowing up ; this suggests that identifiability remained a real problem. What we have learnt The multilevel model can estimate differential cross-sectional and longitudinal effects within an overall repeated-measures framework. It is important to do so because the relatively-enduring processes may have quite different effects from changeable longitudinal elements. The multilevel model can do this for discrete responses such as the unordered categorical used in the gender ideology example, and for models with added area effects. Separately estimating Age, Period and Cohort elements is technically difficult using grouping and random effects and this remains a mechanical approach. In the study of the development of aerobic performance we can effectively rule out a priori period effects, but in political voting behaviour we could not discount events like Jennifer s ear or where s the beef alongside long-standing socialization of being in a post-war cohort, nor maturation processes as people age. Quantitative technique has to be tempered by real-world understanding. The entity being studied makes areal difference and that would account for the lack of debate about APC in the biostatistics literature where it is often entirely plausible to discount period effcets

260 Chapter 16 The analysis of spatial and space-time models Introduction: The standard multilevel models treat space in a rather rudimentary way so that individuals at level 1 are seen as nested in neighbourhood at level 2 and districts at level 3. This forms a hierarchical partitioning of space and we can calculate what percentage of the variance lays at each level and hence the degree of dependence or autocorrelation at that level. Indeed this random-effects approach based on a null model was the basis for an early classic paper on geographical analysis. 123 The rudimentary nature of the model can be appreciated from Figures 17. In this standard model each neighbourhood is treated as a separate entity, there is no knowledge in the model of which areas are next to each other. The random-effects model is based on exchangeability and the results are invariant to the location of areas; we can move areas about without affecting the results as in Figure In contrast to the multilevel approach, the spatial econometric tradition 125 emphasizes that the location of areas does matter and we can conceive of interactions between areas that must be accommodated in the model as interaction or spillovers between adjacent areas as in Figure 19. Although this notion is not generally used in the literature it is helpful to think in terms of wider areas that surround each area and what we will call spatial patches. The red lines on Figure 20 show the ties of the adjacency for a particular neighbourhood, area 10 and consequently they define a wider area or patch that is areas 10, 7 and 11. There are 13 areas on this map so that there 13 patches that overlap to some extent. The spatial multilevel model allows exchangeability of information within these pre-defined patches, and that is how additional spatial dependence is accommodated in the multilevel model. Although the two traditions of multilevel modelling and spatial modelling have evolved separately there are now a number of applications in which both approaches are used simultaneously. These include the following papers by authors who are from the multilevel side of the house: Verbitsky-Savitz and Raudenbush (2009); Leyland (2001); Leyland et al (2000); Langford et al (1999); Jackson et al (2006) and Best et al (2005) Moellering, H. and Tobler, W. (1972) Geographical Variances Geographical Analysis, 4: More formally, exchangeability is that there is no systematic reason to distinguish particular areas. In essence we are assuming that we can permute the labels of the areas without affecting the results. De Finetti in 1930 derived much of the apparatus of modern Bayesian statistics from this assumption. 125 L. Anselin (1988) Spatial econometrics: methods and models. Dordrecht: Kluwer Academic Publishers. 126 Verbitsky-Savitz, N., and Raudenbush, S.W. (2009). Exploiting spatial dependence to improve measurement of neighbourhood social processes. Sociological Methodology, 39 (1), ; Leyland AH (2001) Spatial analysis. in: Leyland AH, Goldstein H (eds) Multilevel Modelling of Health Statistics. Chichester: John Wiley & Sons, 2001: ; Leyland A, Langford IH, Rasbash J, Goldstein H. (2000) Multivariate spatial models for event data. Statistics in Medicine, 19, ; Langford, I. H., Leyland, A. H., Rasbash, J., and Goldstein, H. (1999) Multilevel modelling of the geographical distributions of diseases. Applied Statistics, 48, ; Jackson C, Best NG, Richardson S. (2006) Improving ecological inference using individual-level data. Statistics in Medicine 25: Best N, Richardson S & Thomson A (2005) A comparison of Bayesian spatial models for disease mapping. Statistical Methods in Medical Research 14,

261 Figure 17 Neighbourhood influence in the standard multilevel model (Elffers, 2003) 127 Figure 18 Invariance over location in the standard model Figure 19 Interacting areas in a spatial model based on adjacency Figure 20 The spatial 'patch' based on area Elffers, H (2003) Analysing neighbourhood influence in criminology Statistica Neerlandica 57(3),

What do we mean by adjacency: defining spatial neighbours Spatial multilevel models include extra spatial autocorrelation or dependence over and above that from a strict hierarchy.

262 Book length treatments include Lawson et al (2003) and Banerjee et al (2004). 128 However, it must be noted and as we shall see, not all the models that have been developed in spatial econometrics are readily implementable as multilevel models. What do we mean by adjacency: defining spatial neighbours Spatial multilevel models include extra spatial autocorrelation or dependence over and above that from a strict hierarchy. This situation is akin to the standard model giving a compound symmetry approach to dependency in repeated-measures time-series anlaysis but more elaborate models being used to estimate more complex forms (Chapter 14, this volume). Spatial analysis is, however, much more demanding than time series. With repeated measures the current value can only depend on the past but in the analysis of spatial series the dependency could go in any direction. We tackle this by defining adjacent neighbours and specifying spatial weights that give the connectivity between places. Thus, in the South Wales valleys you could specify connectivity up and down the valley but not across from one valley to another, with weights for these spatial neighbours defined as the inverse of the road distance between them. The set of spatial patches with their additional spatial dependency is defined by these spatial neighbours while the weights give the degree of connectivity between areas. Figure 21 Three types of join: a) Bishop's case b) Rook's horizontal case c) Queen's case The form of the spatial neighbours will have major role in determining the patterns that will be find. Figure 21 shows a chessboard with three types of adjacency structure. A Bishop s case joins along the diagonals. This would give positive spatial auto-correlation (the usual geographical case) as the Black areas will be joined to the Black and the White to White. However a Rook s case where the connectivity is either along the rows or the columns would give negative autocorrelation; Black areas are joined to the White ones. A Queen s case connectivity, where each area is joined to the next area (horizontal, diagonal and vertical joins), would show no dependence with its White-White, Black-Black and Black- White joins effectively cancelling each other out. 128 Lawson, A.B., Browne W.J., and Vidal Rodeiro, C.L. (2003). Disease Mapping using WinBUGS and MLwiN Wiley. London; Banerjee, S and B.P. Carlin, BP and Gelfand, AE (2004) Hierarchical modelling and analysis for spatial data. Chapman Hall, Boca Raton

263 You have to specify these neighbourhood identifiers and weights before you model but this allows the use of multiple sets of joins structures to evaluate a range of geographical hypotheses. A classic example is Haggett s (1976) study of measles in Cornwall. 129 If we look at Figure 22 we can colour a district black if it has a measles infection in a particular week, white otherwise. The maps (a) and (b) show two characteristic map patterns; (c) and (d) show the places joined in neighbourhood space so that places that are near each other are joined up. The pattern of map (a) is then seen to have strong patterning lots of same colour joins, whereas map (b) has little dependency -the presence of measles in a district does not make the presence of the disease in a neighbouring district any more or less likely. Maps (e) and (f) show the connectivity not in neighbourhood space but in hierarchical space where places are connected in terms of size of the population; in this space, it is map (b) that has the strong spatial patterning. Figure 22 Two patterns of measles in Cornwall, and two types of joint structure Haggett took the county of Cornwall, and produced seven different joins structure representing seven different plausible mechanisms of spatial spread (Figure 23). He then calculated the degree of dependency for each week for forty weeks as the epidemic passed through the county (Figure 24). The graphs on the left show a time-series plot of the 129 Haggett, P (1976) Hybridizing alternative models of an epidemic diffusion process Economic Geography Vol. 52(2),

epidemic rising and falling, reaching a peak before week 20. The graphs on the right hand side show a measure of spatial autocorrelation for each of the seven sets of weights.

264 epidemic rising and falling, reaching a peak before week 20. The graphs on the right hand side show a measure of spatial autocorrelation for each of the seven sets of weights. The dotted vertical line is the peak of the epidemic, while the dotted horizontal line represents a p value of He found that the early weeks of the epidemic were characterised by hierarchical spread as the disease went from large to large places (see G-6; G-5 and G8). However, once past the epidemic peak, the dependency showed more local contagion (G1 and G7). The policy implications are clear: at the outset of an epidemic is it not sufficient to vaccinate locally, there is a need to make sure that the large population centres are covered if there is to be a possibility of containing the disease. Methodologically it is vital to specify appropriate joins and weights structure for the process being studied. Unfortunately, this geographical imagination is rarely brought to bear in applied research. Figure 23 Seven alternative join structures corresponding to different spatial processes Figure 24 Weekly degree of spatial autocorrelation according to the seven join structures

265 Three types of spatial models We can recognise three basic types of spatial model 130 Spatial lag dependence or autoregressive models in which the outcome depends on the response in a neighbouring areas so that the same values (albeit lagged) can appear on both sides of the equation. An example would be the prices of houses in this district depending on the prices of houses in neighbouring districts. The spatially weighted sum of neighbourhood housing prices (the spatial lag) enters the model as an explanatory variable. 131 Schematically one can think of this as: the response depending on the spatially lagged response through the weight(w) and degree of dependency,, and additional predictors X with regression weights,, plus some unstructured residual term,. Because of their simultaneity these are demanding models to fit, especially when as would be common in the multilevel approach that there are levels below the area level such as people and occasion. As of writing they cannot be estimated in MLwiN and should in general only be used if dependence on previous responses is of substantive interest, or to put it another way, lagged effects should have the possibility of a causal interpretation as spillover effects. An example might be that the number of infected people (response at t) in an area this week might depend on the number of infected last week (lagged predictor at t-1) in the area and on counts in neighbouring areas. For a discussion on estimation see Corrado and Fingleton, (2011). 132 Spatial residual dependence models in which the residual part of the model is partitioned in to at least two parts: unstructured and structured area effects. Schematically this can be written as where the residual variation in the response conditional on the fixed part of the model has an unstructured residual and a structured residual ; the structure of connectivity is defined by and the degree of dependency is defined by MLwiN can readily estimate such models that allow extra spatial dependence by specifying the spatial neighbours and associated weights as a multiple membership model. Such models have two important properties. First the standard errors of the fixed part of the model are automatically corrected for the degree of spatial dependency. The underlying cause of the dependency 130 Other models are possible for example the spatial Durbin model in which the predictors are additionally involved in a lagged relationship:. 131 For an explanatory video on this model, see Corrado, L and Fingleton, B (2011) Multilevel modelling with spatial effects, Strathclyde Discussion Papers in Economics,no

266 could be spatially correlated omitted variables and spatially correlated errors in variable measurement. Second because of the shrinkage properties of the estimated residual area effects, the differences between areas (in the structured case) are effectively smoothed to local means. A lot of spatial modelling is used to ascertain local hotspots for sudden infant death or childhood leukaemia which typically has low incidence and hence a high stochastic component. Treating the structured residuals as a distribution which is bound together by a spatial neighbourhood weights matrix results in the estimate of relative risk for an area borrowing strength from the surrounding areas. The successfulness of this strategy of course depends on the appropriateness of the adjacency and weights matrix. Figure 28 shows a typical example of this in modelling small-for-age children in Zambia. There is a clear patterning in the structured spatial effects with a concentration in the north-east part of the country. 133 Figure 25 Structured and unstructured spatial effects 'small' children in Zambia 134 Spatial heterogeneity models allow the relationship between a predictor and a response to vary across the map and it is possible to put in higher-level variables to account for this formulation. This is the idea behind Geographically Weighted Regression. 135 This is an exploratory data analysis technique that allows the relationship between an outcome and a set of predictor variables to vary locally across the map. The approach aims to find spatial 133 An excellent pedagogic account using baseball averages and a map of blood toxoplasmosis is Efron, B.; Morris, C. (1977). Stein's paradox in statistics, Scientific American, 236 (5): Source is There is a website at a book-length-treatment is Fotheringham, A.S., Brunsdon, C., and Charlton, M.E. 2002) Geographically weighted regression: the analysis of spatially varying relationships, Chichester, Wiley

267 non-stationarity and distinguish this from mere chance. 136 The GWR technique works by identifying spatial sub-samples of the data and fitting a set of local regressions. Taking each area across a map in turn, a set of nearby areas that form the local surrounding region are selected, a regression is then fitted to data in this region but in such a way that that nearby areas are given greater weight in the estimation of the regression coefficients. The surrounding region is known as the spatial kernel; this can be of a fixed spatial size across all the map but this will result in unstable estimation in some regions where there are relatively few areas on which to base the local regression, and will possibly miss important small scale patterns where there is a lot of local areas clustered together spatially. Consequently, an adaptive spatial kernel is often preferred so that a minimum number of areas that from the region can be specified and the kernel extends out until this number has been achieved. Changing the kernel changes the spatial weighting scheme which in turn produces estimates that vary more or less rapidly over space. A number of techniques have been developed for selecting an appropriate kernel and indeed for testing for spatial stationarity. 137 Once a model has been calibrated a set of local parameter estimates for each predictor variable can be mapped to see how the relation varies. From the perspective of random coefficient modelling this procedure is inefficient in that a separate model is being fitted to each area. The multilevel modelling approach is, as usual, to fit a general fixed relation and allow this to vary from place to place as a random slope as part of an overall model. The difference in these spatial models is that the random slope is allowed to vary for each patch centred on each area so that the observations for predictor variable become in effect the values for that patch. Poorly estimated local slopes due to a small number of areas in the patch or lack of variation in the predictor for that patch will be shrunk back to the general line, so that the procedure has some built-in protection against over-interpretation. It is possible to fit such models in MLwiN through multiple-membership models with additional random slopes associated with particular predictors. A warning that applies across all three types of model: it is worth stressing that apparent spatial dependence can be produced by model misspecification in that an omitted variable with distinctive spatial pattern could show up as spatial autocorrelation amongst the residuals. Similarly, an incorrect functional form such as fitting a linear relationship when the true relationship is concave could also show up as apparent spatial dependence. This will occur when the predictor variable involved in the non-linearity has spatial patterning so that the over- and under estimation varies geographically. You are well 136 Brunsdon C, Fotheringham S, Charlton M (1996) Geographically weighted regression modelling spatial non-stationarity, Geographical Analysis 28: Paez, A. Uchida, T. Miyamoto, K. (2002) A general framework for estimation and inference of geographically weighted regression models: 1: location-specific kernel bandwidths and a test for locational heterogeneity; 2: spatial association and model specification tests Environment and Planning A, 34, ,

268 advised to consider such mis-specification before embarking on these more complex spatial models. The spatial multiple membership model MLwiN can fit both the spatial residual and the spatial heterogeneity model. 138 The model is simply conceived as a classification structure where a lower level observation is nested in a higher level area and is also a member of a varying number of local or neighbouring areas as shown in Figure 26. Further insight is given by considering the town of Squareville in Figure 27. Say we are dealing with an outbreak of swine-flu so that a person in district H is conceived as being put at risk by the number of cases in the previous week in district H (a strict hierarchical relationship) and by the disease counts in districts E, I, K and G (through multiple membership relations). Similarly, we could conceive as someone in district A being affected by the disease counts in district A, a strict hierarchy, and by disease presence in B, C, D, a multiple membership relation. The linkage is therefore defined by including only districts that are adjacent in the multiple membership relation. This can be created from a map by the Adjacency For WinBUGS Tool if a map is stored in ArcMap format. 139 We can also place weights on the multiple membership relations to further emphasize the degree of connectivity or even to define the membership. For example, we could use the inverse distance between centroids of areas or, for more rapid fall off of the influence of other places, the inverse of the square of distance. We probably would want to define a maximum distance so that the entire map is not involved for each place. It is more natural to include the home area in the set of multiple membership areas (it is contributing to its own local mean ) and this has an important advantages with the software as it is then straightforward to obtain the residuals for the structured spatial effects with one for each area. Figure 26: The spatial model as a multiple membership model 138 Software in this area includes GEODA( and the Pythonbased PySAL ( that has grown out of Anselin s work. Roger Bivand maintains the R Rask Task View: Analysis of Spatial Data which has a wide range of facilities ( Stata has SPMLREG to estimate the spatial lag, the spatial error, the spatial Durbin, and general spatial models ( and SPPACK for spatial-autoregressive models(( In the R package, spdep there are Lagrange multiplier tests for distinguishing between spatial lag and spatial residual dependence models. In addition to MLwiN, Random coefficient multilevel approaches are available with MCMC estimation in through BayesX and Geobugs, both of these packages have tools for mapping;

269 Figure 27 Multiple membership linkages in Squareville A B C D E F G H I J K L M Applying the spatial multiple membership model We will now apply the spatial multiple membership model to three examples 140 Low birth weights in South Carolina: the dependent variable is proportion of low birth weight children in counties, and there are separate proportions for White and Black children. The predictor variable is percent of the population in poverty in the county. The idea behind this analysis is not just to estimate the general relation between low birth weight and poverty but also to identify residual hot spots in the manner of Sargent et al (1997) who found some low deprivation areas with high 140 MLwiN is also able to fit another spatial model which is known as the conditional autoregressive (CAR) model (Besag et al, 1992). This, like the spatial multiple membership model handles residual dependency and it is not a spatial lag model. The distinctive feature of this model is that there is one set of random effects, which have an expected value of the average of the surrounding random effects: u ~ N( u, / n ) ; where u i i 2 u i where n i is the number of neighbours for area i and the weights are typically all 1. MLwiN has only limited capabilities for the CAR model ( they can be specified at only one level), although it is possible to include an additional set of unstructured random effects in what is known as a convolution model. The BUGS software allows more complex models. Browne (2009) shows how to set up the CAR model in MLwiN and how it can be exported to BUGS for modification. Besag, J.,York, J. & Mollie, A. (1992). Bayesian image restoration with two applications in spatial statistics Annals of the Institute of Statistical Mathematics, 43:1-59; Browne, W.J. (2009) MCMC Estimation in MLwiN, v2.13. Centre for Multilevel Modelling, University of Bristol i w i, j jneigh( i) u j / n i

relative risk probably due to gentrification resulting in the removal of lead-based paint. 141 The model is a binomial logistic one with added spatial dependence.

270 relative risk probably due to gentrification resulting in the removal of lead-based paint. 141 The model is a binomial logistic one with added spatial dependence. Respiratory cancer deaths in Ohio counties: the response is annual repeated measures for 1979 to 1988, the aim being to discover hotspot counties with distinctive trends. The model is a Poisson log model with an offset to take account of the number of people exposed. This is an example of space-time analysis as the model accommodates the repeated measures. Self-rated health of the elderly in China: the response is whether an individual is in good as opposed to fair/poor health. The model is a binomial logit model based on individual survey data with additional spatial effects between Chinese provinces. This short example aims to show that the models can be applied to more than just aggregate data. For the South Carolina and China data, basic knowledge of the binomial model as fitted in MLwiN with MCMC estimation is presumed; for the Ohio data, you need to know about the Poisson model. Chapters 12 and 13 cover this material. Low birth weights in South Carolina 142 The data Retrieve the saves MLwiN worksheet SCarolBweight.wsz In the Names window you will see the following data summary. 141 Sargent, J., Bailey, A.J., Simon, P., Dalton, M. and Blake, M.(1997) Census tract analysis of lead exposure in Rhode Island children. Environmental Research, 74: We thank Beatriz B Caicedo Velasquez for help with this section

271 There are 46 counties with two entries for each in that the Black and White proportion is going to be modelled in a single model. The response variable is the PropLBW which has been calculated as the LBW count divided by TotBirths. The latter will form the denominator. The % in poverty is a predictor and the Race variable identifies whether the low weight proportion applies to White or Black children. The new concept is the set of 9 neighbourhood identifiers which follow immediately from the numerical county number in column 8. It is a requirement of the software that these identifiers from a strict consecutive sequence; here column 8 to column 17. If we look at an extract of these identifiers, you will see that row one is the county of Abbeville which is numbered 1. It has adjacency with 5 other neighbourhoods which have the numbers 33, 30, 24, 23 and 4. The rest of the row is filled with zeroes and it is important to do this and not leave missing values. The second row is exactly the same for this is the multiple membership relation for the Black proportion of low birth weights as opposed to the White proportion for Abbeville. Row 3 shows the adjacent neighbourhoods for county 2, Aiken; it also has 5 neighbours in addition to itself. The county with the most adjacent neighbourhoods, Orangeburg has 9. The spatially unstructured model We will first fit a two-level binomial logistic model with PropLBW as the response and with Cons defining level 1 and County defining level 2. This is a useful device so that level 1 is in effect the children who are nested within areas. This allows the modelling of level-1 binomial variation (the denominator of the proportion, the number of trials, is declared to be TotBirths) and level 2 to be the between-county variation on the logit scale (see Chapter 12, this volume) This is exactly the same model and results that would be found if you had binary data and the single predictor Black/White. The huge binary data of close to 55,000 births has been reduced to 92 rows without any loss of information. This formulation does not allow extra-binomial variation at level 1 even though we are modelling proportions. Another way of looking at this model is that there are two outcomes differentiated by Race in each County

272 Given the form of the data it makes for a more readily interpretable model for the Race variable to be added to the model as two separate dummies and not as a constant and a contrast: So that the fixed part gives directly the logit of the proportion of low birth weights for each Race averaged across the counties. We can then allow between-county variance for each Race: The two variances will summarise the between-county variations and the covariance will allow us to estimate the correlations between the area patterns for Black and White babies. We will also include the %poverty variable in an interaction with Race. After initial estimation with IGLS, the model was run in MCMC with a burn-in of 500 simulations and an initial monitoring run of 5k. The MCMC options were modified so that hierarchical centering was deployed at level 2 in the expectation of less-correlated chains

273 (Note this has nothing to do with centering % poverty which was left un-centred here.) The IGLS estimates were: On checking the initial run of the MCMC sampler, it was decided on the basis of the Effective sample size which was as low as 100 from 5k simulations to increase the monitoring run to 15k. The model was re-estimated to be MCMC15k S.E. Median CI(2.5%) CI(97.5%) ESS Fixed Part White Black White.% in poverty Black.% in poverty Random Part Level: County White/White Black/White Black/Black Level: Cons DIC:

274 The fixed estimates are most easily appreciated via the graphing of predictions; we have done this on the logit and the odds scale. 144 Clearly Black babies have a higher risk than White babies and this differential increases with county poverty. To interpret the between-counties random part we can plot the residuals and their 95% confidence intervals. There looks to be little residual variation between counties for Blacks and some variation for Whites. These areas might now be subject to further investigations. 144 To make prediction with confidence intervals of the fixed part temporarily switch to IGLS estimation, but do not re-estimate the model. To convert from logits to odds add the minimum value of the logit estimate to make the base value zero then exponetiate the result, so that this base is set to 1. If you are using the Customised predictions window the main effect for % poverty has to be specified but the fixed part ticked off to avoid exact multicollinearity when the interactions with Race are included in the model

For what it is worth, 145 we can look at the correlation between the differential logits of the areas : Finding that, unsurprisingly, there is no correlation.

275 For what it is worth, 145 we can look at the correlation between the differential logits of the areas : Finding that, unsurprisingly, there is no correlation. We can also do the pairwise plot of the residuals and here we have turned them into odds with the South Carolina average being 1. We have set the scale to be the same for both axes so that the unchanging nature of the County Black relative odds is clearly seen. There is some element of differentials for the White babies with the lowest rates being 25 percent below the county average and the worst rates getting on for 50% above. The spatially structured model Before we can fit the model with spatially-structured random effects we have to create a weight to go with the identifier for each county multiple membership. In the absence of anything better we will use equal weights that sum to 1 for each set of neighbourhood joins. We start by naming 10 adjacent empty columns with the Names wt0 (to hold the original 145 It is a bit silly looking at the correlation with a variable, the latent differential for Black, that does not really vary!

Then create a new variable TotWeight which is the sum of the ten weight variables Finally we

276 county weight with itself), wt1 for the weight for the first neighbour and up to wt9. We then take each neighbour in turn and recode all non-zero values to 1. Then create a new variable TotWeight which is the sum of the ten weight variables Finally we need to divide each of the ten weights by this total to get an appropriate weight. Here is the procedure for Wt

weights of 0.1, whereas the place with 3 neighbours plus itself has 4 equal weights of 0.25.

277 and the same has to be done for all the nine other weights. Once this is completed the weights columns should look like the following extract Thus the place that has 9 neighbours plus itself and therefore has 10 equal weights of 0.1, whereas the place with 3 neighbours plus itself has 4 equal weights of To specify the model begin in the Equations window by clicking on the response and increasing the number of level to 3 (you have to be in IGLS/RIGLS to do this) choosing CountyNO as the level 3 identifier. Then Click on the White and Black dummies in the fixed part and allow variation at level 3 to get the following model

Staying with the IGLS/RIGLS estimation, Start the model to convergence. These estimates should be dismissed as they ignore completely the cross-classified structure.

278 Staying with the IGLS/RIGLS estimation, Start the model to convergence. These estimates should be dismissed as they ignore completely the cross-classified structure. Now switch to MCMC in the Esimation Control window. This will alow you to specify the multiple membership cross-classification at level 3: Model on main menu MCMC Classifications Choose to Treat Levels as cross-classified Choose Multiple classifications at level 3 Specify the Number of columns to be 10 Specify that Weights start Column Wt0 Done [ leave level 2 to be the unstructured effects] [the largest number of neighbours] [the idetifiers will be understood as the 10-1 columns following CountyNo] [both the identifiers and weights have to be in consecutive columns] After this, the model will still look like a hierarchical one on the equations screen and will not convey the information to recognize that the model is now cross-classified with multiple membership weights. To overcome this MLwiN has a different notation to that which we have been using so far. 146 This is based on classifications rather than subscripts. The response variable subscript i is still used to index the lowest level units, but classification names are used for the subscripts of random effects. As there are several classifications, they are given a superscript to represent the classification number starting at 2. This notation also clearly indicates weights and multiple memberships. In the bottom toolbar of the equations window, choose Notation and ensure multiple subscripts is turned off. 146 Browne, W.J., Goldstein, H. & Rasbash, J. (2001a). Multiple membership, multiple classification (MMMC) models. Statistical Modelling, 1: Rasbash, J. & Browne, W.J. (2002). Non-hierarchical multilevel models In J.D. Leeuw & E. Meijer, eds., Handbook of Multilevel Analysis. Springer

It can be clearly seen that the 3 rd classification (CountyNo) is based on two sets of weighted residuals these are the spatially structured residuals- while the 2 nd classification (county) does not

279 It can be clearly seen that the 3 rd classification (CountyNo) is based on two sets of weighted residuals these are the spatially structured residuals- while the 2 nd classification (county) does not have any weights. The Hierarchy window will now reflect this Classification view of the model. Start to estimate keeping a burnin of 500 and a monitoring of 15k and choosing the hierarchical centering to be at level 3 to hopefully speed up estimation. You will probably get a software error at this point due to poor starting values from the IGLS estimation which of course does not know the spatial membership relations. The IGLS estimates found that all level 2 random variances to be 0.0. This is an important issue in successful estimation that we must consider in more detail. A short digression on MCMC starting values A common problem in MCMC estimation is the non-positive definite matrix warning or that the prior matrix is non-positive definite. Both problems will be flagged by the software and the estimation will not proceed. The underlying cause of the problem is that either the correlation between the random effects has been estimated by IGLS to be outside the range of +1 to -1 or, as here, there is an estimated variance of zero. This can usually be overcome by changing the covariance to an initial value of 0 implying a non-offending correlation of zero; and/or by changing the variance to a small positive value. This can be achieved in practice by editing c1096 which contains the IGLS estimates for the random part of the

This is particularly a concern when there are few higher-level units, as you are effectively adding extra data (which you have made up!) to the problem.

280 model. 147 This should be done in IGLS before switching to MCMC as the IGLS values are used to define a prior distribution; we suggest that small values are used not to impose too much prior information. This is particularly a concern when there are few higher-level units, as you are effectively adding extra data (which you have made up!) to the problem. You may need to experiment with values for the variance that are big enough to allow estimation but not large enough to affect the results. We proceed by editing c1096 replacing the zeros for the variances with the value leaving the covariance at 0.0. A relatively low value of was used so as not to have a major impact on the final estimates and DIC values. Here are the problematic initial IGLS estimates with three variances of zero We edit the values in c1096 as follows (the value of 1 in row 7 is the binomial constraint) Before After editing 147 A useful trick if there is a complex random part with problematic covariances is to click on the random effects variance-covariance matrix for a particular classification and choose a diagonal matrix, click again on to request a full matrix. This action results in all covariances being initially set to zero for that classification

281 Switching to MCMC estimation and requesting the full Bayesian specification of the model by clicking on + in the lower tool bar, the MCMC starting model is. You can see that the initial estimates have been used to create a not-uninformative prior matrix in the form of an inverse Wishart distribution for the unstructured and spatially structured random effects. Back to the spatially structured model After 15k simulations the Effective Sample Size of several parameters was rather low suggesting slow convergence so this was increased to 100k with a thinning of 10, so only 1 in 10 values were stored but all were used in the calculations. The resultant model estimates are

282 Comparing the estimates of the two models (Aspatial and Spatial, see below) the spatial model is a substantial improvement as the DIC has gone down from 792 to 779. The most important new term is the spatially structured unexplained variance between counties for Whites which is four time larger than the equivalent value for Blacks. There is not much unstructured explained variance for either Blacks or Whites between counties. In the fixed part of the model little has changed except the standard errors are higher now that the residual spatial dependence has been explicitly modelled Fixed Part Aspatial S.E. CI(2.5) CI(97.5) ESS Spatial S.E. CI(2.5) CI(97.5) ESS White Black White.% in poverty Black.% in poverty Random Part Level: CountyNo White/White Black/White Black/Black Level: County White/White Black/White Black/Black DIC: The revised fixed part estimates are shown graphically below. There is now a positive relationship with poverty for both Blacks and Whites although the line for Whites has a particularly large uncertainty

283 We can store the estimated level 3 spatial residuals in c400 and c401 and the level 2 aspatial in c300 and c301 and then exponetiate these values to get odds compared to the S Carolina average of 1. The plot has been made with the same scale on both sets of axes to allow direct comparison. The baseline of 1 has also been added to each graph. The relative importance of the spatial differentials is immediately apparent particularly for Whites; there is a threefold differential when the best is compared to the worst areas. There appears to be some correlation between the spatial differentials and the correlation (taken from the Estimate Tables) is This covariance between the two sets of spatial effects will mean

284 that borrowing strength will occur between the map patterns for Black and Whites. 148 The differentials between aspatial odds are very much smaller with a negative correlation of There would appear to be patches of the map with elevated unexplained differentials that are more marked for Whites than Blacks but patches that are high for one group tend also to be high for another. Within a patch the differences are small but where they are slightly elevated for one group they are slightly reduced for the other. If one examines the standard errors of these residuals they are large in comparison to the estimate, and one reaction is that we can say little about even the spatial patch differences. But another view is that this involves an element of double counting. Clayton and Kaldor (1987) in their classic paper on lip cancer in Scotland simply map the residuals as their uncertainty has already been accommodated through precision-weighted estimation. Th estimates have already been shrunk, when there is a lack of evidence, to the overall mean. 149 Certainly in exploratory work with an emphasis on the precautionary principle of not missing true high rates there is a great deal of sense in this. Consequently it would be sensible to produce a table of values that could be printed or exported to a mapping package. Currently the column labels in column 1 are long so we first need to un-replicate them to produce a short column. 148 Jones, K and Bullen, N (1994) Contextual models of urban house prices: a comparison of fixed- and random-coefficient models developed by expansion Economic Geography, 70(3), Clayton D, and Kaldor J. (1987) Empirical Bayes estimates of age-standardized relative risks for use in disease mapping Biometrics, 43,

285 In the Names window the Categories of column 1 can be copied and then pasted on to the new short column. 150 A print out of the county label and estimates can then be made. Here we have added the ranks. Aspatial Spatial White Black White Black County Odds Rank Odds Rank Odds Rank Odds Rank Abbeville Aiken Allendale Anderson Bamberg Barnwell Beaufort Berkeley Calhoun Charleston Cherokee Chester Chesterfield Clarendon Colleton Darlington Dillon Dorchester Edgefield Fairfield Florence Georgetown Greenville Greenwood Hampton Horry Jasper Kershaw Lancaster Laurens Lee Lexington McCormick Marion Marlboro Newberry Copy 3 colnumber in the Command window, copies labels to the clipboard. Because of the option

286 Oconee Orangeburg Pickens Richland Saluda Spartanburg Sumter Union Williamsburg York Fairfield County in the spatial model has the highest relative odds of 86 per cent above the South Carolina average for Whites which must mean it is in a patch of high areas (its patch being Union (1.40), Richland (1.07), Newberry the only one below 1 (0.88), Lancaster(1.64), Kershaw (1.03), Chester (1.08). Fairfield is also ranked first for the aspatial values with an additional 7 percent above this patch -induced value. Remembering that poverty has been conditioned on, it would be interesting to see what is distinctive about this county. If we plot up the odds for Whites we see that the spatially structured map show a concentration of higher risk in the north of the State once poverty is taken into account. Both the Black and White maps in the spatially structured case show lower risk in the south west of the map

287 Spatial heterogeneity: a random effects GWR We have in fact already fitted a model with spatial heterogeneity in that the effect for Black and White, two individual attributes, have been allowed to vary over the map between counties aspatially and spatially, But we can also allow the higher-order effect of %poverty to vary between spatially between counties. In estimation and interpretative terms this is a tall order as we will need an additional seven random terms to model what we could call a random effects geographically weighted regression. In IGLS estimation click in the equations window on the fixed terms involving %poverty and allow them vary at level 3. The resultant model should look like

288 Set a diagonal matrix at level 3 by clicking on Ω (3) edit the column c1096 to give the starting value of for the any variances that are zero and then click on Ω (3) again and choose the full set of covariances. Do this in IGLS before switching to MCMC. The model before MCMC estimation is : After 200k MCMC simulations we get the following results and stored estimates, even with this long run, the ESS is not really big enough

REGWR200 S.E. Median CI 2.5% CI 97.5% ESS Fixed Part White -2.803 0.327-2.811-3.427-2.126 91 Black -1.978 0.342-1.982-2.638-1.297 64 White.% in poverty 0.017 0.031 0.018-0.043 0.074 90 Black.

289 REGWR200 S.E. Median CI 2.5% CI 97.5% ESS Fixed Part White Black White.% in poverty Black.% in poverty Random Part Level: CountyNo White/White Black/White Black/Black White.% in poverty/white White.% in poverty/black White.% in poverty/ White.% in poverty Black.% in poverty/white Black.% in poverty/black Black.% in poverty/ White.% in poverty Black.% in poverty/ Black.% in poverty Level: County White/White Black/White Black/Black DIC:

The DIC shows that this randomised GWR model is not an improvement over the model with just White and Black spatially structured random effects with its DIC of 779.082.

Moreover the smoothed histograms of the values suggest that there is not much support for the parameters being zero suggesting that relationship between %in poverty and Low Birth weight does vary

290 The DIC shows that this randomised GWR model is not an improvement over the model with just White and Black spatially structured random effects with its DIC of However, both slope terms are in fact quite well estimated with an effective sample size of over Moreover the smoothed histograms of the values suggest that there is not much support for the parameters being zero suggesting that relationship between %in poverty and Low Birth weight does vary across the map. Emboldened by this we estimated the residual differential intercepts and slopes and plotted them on a varying relations graph and as maps. Differential Intercepts Differential Slopes County White Births Black Births White Births Black Births Abbeville Aiken Allendale

291 Anderson Bamberg Barnwell Beaufort Berkeley Calhoun Charleston Cherokee Chester Chesterfield Clarendon Colleton Darlington Dillon Dorchester Edgefield Fairfield Florence Georgetown Greenville Greenwood Hampton Horry Jasper Kershaw Lancaster Laurens Lee Lexington McCormick Marion Marlboro Newberry Oconee Orangeburg Pickens Richland Saluda Spartanburg Sumter Union Williamsburg York

292 These estimated differential intercepts and slopes have to be seen in the context of the overall intercepts and slopes that are found generally across S Carolina, namely for the White Intercept where %poverty is zero and for the Black intercept. The respective general slopes between Low birth weight and % in poverty are and The varying relations plot given below cannot simply be derived by the Predictions or Customised Predictions window as there is only a single value of % in Poverty for a county, but instead we need the range of values that are found in the spatial patch that surrounds that county. Thus Spartenburg has a low and narrow range of % in poverty but the relation in this patch is a steep positive one

293 The maps show the variation of both the White and Black slopes. What are we to make of these quite marked variations? Even though GWR is supposed to be an exploratory technique, we suggest that the results are treated with considerable circumspection. The DIC suggests that the more complex model is not a more parsimonious fit than the simpler model. All the fixed terms including the general slope terms for % in poverty have relatively small effective sample size. The differential slopes for each county patch are somewhat implausible in that they are large enough to overwhelm the general positive relation between poverty and low birth weight for both Whites and Blacks, so that in some counties higher area-based poverty is associated with a lowering of the risk of lowbirth weight children. Moreover, there is a real danger of capitalising and maximising on model mis-specification and creating a cross-level fallacy in that we do not have data on whether the individual child is or is not in poverty only that they live in poor areas 151 Respiratory cancer deaths in Ohio counties: space-time modelling The data Retrieve the saved MLwiN worksheet OhioCancer79-88.wsz 151 Subramanian S V, Jones K, Kaddour A, Krieger N. (2009) Revisiting Robinson: the perils of individualistic and ecologic fallacy, International Journal of Epidemiology, 38(2),

In the Names window you will see the following data summary. There are 880 observations which represent 10 years of observation for 88 counties.

294 In the Names window you will see the following data summary. There are 880 observations which represent 10 years of observation for 88 counties. The contents of the column are as follows: C1 a numerical code for each and every county, 1 to 88, which has been replicated 10 times; C2 c9 the neighbourhood identifiers; there are a maximum of 8 neighbourhood adjoining a county in addition to the county itself; C10 the constant, as usual just a set of 1 s; C11 the number of observed respiratory cancer deaths in each county in each of the 10 years C12 the expected number of deaths if State-wide age-sex rates applied; C13 time as a numerical variable, 0 to 9 representing 1977 to 1988; C14 another time variable which takes the numerical values 0 to 9, but is a categorical variable; C15 the Standardised Mortality Ratio; a measure of the risk of dying from respiratory cancer in a particular county in a particular year, it has simply been calculated as the Observed number of deaths divided by the Expected so that the State risk is set to 1, and a value of 2 is double the all-state risk, while a value of 0.5 is half the State risk. At the outset we need to create another column, named County2 which is a duplicate of column 1 so that we have a different identifier for the structured and unstructured classification. 152 calc County2 = c1 152 If this is not done, the Stored model functionality will become confused

295 The aim is to model the SMR rather than to treat it as a simple descriptive value so as to Identify hotspot counties with distinctive trends. The problem with the SMR is that as a ratio it is highly unstable and when the expected value is close to zero (signifying a rare event) any positive count will lead to a ratio above The SMR does also does not take into account that its variance is proportional to 1/Expected so that with rare events and small populations we have a great deal of natural heterogeneity, which makes spotting hotspots difficult. From a modelling perspective, it is like fitting a saturated model with in this case 880 separately estimated parameters. Respiratory cancer deaths are quite rare and therefore we are going to model the observed outcome, taking account of the expected counts as a log Poisson model. Such a model will explicitly take account of the stochastic nature of the counts and their heterogeneity and we will additionally include a spatial multiple membership relation so that poorly estimated counties (with small numbers) will borrow strength from neighbouring more precisely-estimated counties with larger numbers of deaths. In comparison to the descriptive SMR s we will be looking for the degree of evidence that supports that a county has high rates or an upward trajectory based on explicit modelling of trends and area differences. Unstructured Poisson area-effects models We will start the modelling with an aspatial two level Poisson. We specify the model as follows: Model on main menu Equations y, the response is Obs two level model, ij, with County2 declared to be the level 2 identifier; and Time as level 1, done Change the red x o the be Cons, done to get an overall grand mean value, click on j(county2) and i(time) random effects, done Click on N for Normal and change response type to be Poisson Click on Estimates to get the following model 153 Jones, K and Kirby, A (1980) The use of Chi-square maps in the analysis of census data; Geoforum 11(4):

Where is the underlying mean count, and we are going to model the log e of the observed count (this means on exponentiation we cannot have an estimated risk of below 0).

296 Where is the underlying mean count, and we are going to model the log e of the observed count (this means on exponentiation we cannot have an estimated risk of below 0). At level 1 the variance of the observed deaths is equal to this mean, that is it is assumed to follow a Poisson distribution, and there is a level 2 variance,, which summarises the between county differences. We also fitted a model with extra Poisson variation but there was not substantial evidence for this when time trend and area effects were taking into account. Clicking on the in the second line of the window allows us to specify the offset (this will allow the modelling of the rate Obs/Exp by taking the log e of the expected value onto the right hand side of the equation and constraining it to 1. As the window warns we first have to take the log e of the expected value Data Manipulation on Main Menu Command interface enter the following command into the lower box, one at a time and press return Name c17 'LogeExp' calc 'LogeExp' = loge('exp') return to the clicking on the in the second line of the equations window, we can now specify the loge(offset), followed by done to get the revised equation We can estimate this null model and then switch to MCMC estimation with a burn-in of 500, a monitoring run of 50000, a thinning of 10 and requesting in the MCMC options hierarchical centering at level 2. The Poisson model is known to have MCMC chains that are slowly changing so that is why we have started with a monitoring run of 50k. These values can then be stored as 2LevNull. The model estimates are as follows

Not surprisingly the log e grand mean estimate is close to zero indicating that the age standardisation has worked as this corresponds to an average SMR of 1 when we exponetiate this value.

297 Not surprisingly the log e grand mean estimate is close to zero indicating that the age standardisation has worked as this corresponds to an average SMR of 1 when we exponetiate this value. We can now ascertain whether there is a linear trend in these values or whether something more complicated is needed. Model on main menu Equations Set back the estimation to IGLS so as to modify the model Add term choose Year, tick on orthogonal polynomial, and choose degree 1 This specifies a linear trend, IGLS estimation to convergence and then MCMC estimation, storing the model as a Linear. We can now see if there is evidence that this linear trend varies from place to place. Click on the orthog_year^1 variable and allow random-slopes variation for the linear trend parameter over county as in the specification below

298 Estimate to convergence with IGLS and then MCMC estimation storing the model as LinSlopes. We will now see if there is any evidence for more complex trends by modifying the orthog_year^1 variable and requesting a 2 nd order polynomial and allow this quadratic parameter to vary over County2, as in the specification below, Again estimate to convergence with IGLS and then using MCMC, storing the results as Qslopes. 154 We can now compare the results of the four models and we can see that the most parsimonious with the highest predictive capacity (lowest DIC) is the model in which the linear slope parameter is allowed to vary over counties. That is while there is no evidence of a strong general trend (the linear model is not an improvement over the null random intercepts model), there is evidence of between county differences in the linear trend. That must mean that in some counties the trend is an upwards and in other downwards, given that the general term is flat. In fact of course, we should not have anticipated a general trend as the SMR has been calculated on the basis of the expected value for that year. The linear slope model will now be the base for the spatial modelling. Interestingly, a model without any random effects has a DIC of 7186 providing strong evidence that there are differences between Ohio counties in their risk of respiratory cancer. 154 If you get a warning message on starting MCMC estimation, switch back to IGLS, click on Ω and set to diagonal matrix and then click on it again to get the full covariance matrix, this will set the covariances to zero, switch to MCMC and estimate the model

299 2levNull S.E. Linear S.E. LinSlope S.E. Qslopes S.E. Fixed Part Cons orthog_year^ orthog_year^ Random Part Level: County2 cons/cons orthog_year^1/ cons orthog_year^1/ orthog_year^1 orthog_year^2/ cons orthog_year^2/ orthog_year^1 orthog_year^2/ orthog_year^2 Level: Time bcons.1/bcons DIC: Spatially structured Poisson area-effects models Although we already have the neighbourhood identifiers that define the patch we also have to create a weight to go with these identifiers. Again in the absence of anything better we will use equal weights that sum to 1 for each set of neighbourhood joins as we did with the Carolina data. We start by naming adjacent empty columns to hold the 9 sets of weights with the names wt0 (to hold the original county weight with itself), wt1 for the weight for the first neighbour and up to wt8. We then take each neighbour in turn and recode all nonzero (that is values 1 to 88) to a new value of 1, summing the values across the rows to create a new variable TotWeight. Finally we need to divide each of the ten weights in turn by this total to get an appropriate weight. Once this is completed the weights columns should look like the following extract

Thus the County 1 that has 5 neighbours plus itself and therefore has 5 equal weights of

measures structure of years within counties.

increasing the number of level to 3 (you have to be in IGLS/RIGLS to do this) choosing

300 Thus the County 1 that has 5 neighbours plus itself and therefore has 5 equal weights of 0.2. You can see that the weights are replicated 10 times to reflect the repeated measures structure of years within counties. To specify the model begin in the Equations window by clicking on the response and increasing the number of level to 3 (you have to be in IGLS/RIGLS to do this) choosing County as the level 3 identifier. Then Click on the Constant in the fixed part and allow variation at level 3 to get the following model

301 Staying with the IGLS/RIGLS estimation, Start the model to convergence. These estimates should be dismissed as they ignoer completely the cross-classified structure. Now switch to MCMC in the Esimation Control window. This will alow you to specify the multiple membership cross-classification at level 3: Model on main menu MCMC Classifications Choose to Treat Levels as cross-classified Choose Multiple classifications at level 3 Specify the Number of columns to be 9 Specify that Weights start Column Wt0 Done [leave level 2 to be the unstructured effects] [the largest number of neighbours] [the identifiers will be understood as the 8 columns following County] [both the identifiers and weights have to be in consecutive columns] After this, the model will still look like a hierarchical one on the equations screen and will not convey the information to recognize that the model is now cross-classified with multiple membership weights. To overcome this, use the classification notation. In the bottom toolbar of the equations window choose Notation and ensure multiple subscripts is turned off

302 It can be clearly seen that the 3 rd classification is based on a set of weighted residuals these are the spatially structured residuals- while the 2 nd classification does not have any weights; these are the unstructured residuals. Start to estimate keeping a burn in of 500 and a monitoring of 50k and choosing the hierarchical centering to be at level 3 to hopefully speed up estimation. You may get an error at this point due to poor starting values from the IGLS estimation which of course does not know the spatial membership relations. This can be changed by editing the column (c1096) where the estimates are stored and replacing the zeroes for variances with 0.01; leaving the covariances alone. A low value, 0.01 was used so as not to have a major impact on the final result. You may have to experiment with this (we found that was too small and we still got a warning). Initial estimates before running the MCMC chains but after modifying c1096 and requesting more details on the model specification (by clicking on + on the toolbar at the bottom of) are as follows. After 50k simulations with hierarchical centering at level 3 the results were stored in a model labelled Spat1. A more complicated model was the fitted in which the parameter for the linear trend was allowed to vary over the spatially structured neighbourhoods. This was achieved by clicking on the linear trend parameter in the fixed part and allowing it to vary at level 3. The same process was adopted - estimating in IGLS, changing variances that

303 are zero to 0.01 and then switching to MCMC with a burnin of 500 followed by 50k simulations to get the following results which are stored as Spat2. Comparing the three sets of models they have very similar DIC values; there is not a great deal of evidence that a spatial model is needed. LinSlope S.E. SpatMod1 S.E. SpatMod2 S.E. Fixed Part Cons orthog_year^ Random Part Level: County2 cons/cons orthog_year^1/cons orthog_year^1/ orthog_year^1 Level: Time bcons.1/bcons Level: county cons/cons orthog_year^1/cons orthog_year^1/ orthog_year^1 DIC: However, in pedagogic mode, we will continue by looking at the variance function at level 2 for the unstructured area effects and at level 3, the spatially structured effects; storing them in c31 and c32 respectively

We can the plot them up against time to see if the rates are diverging or converging; the differences between the spatial patches are growing while the aspatial differences are much smaller and

304 We can the plot them up against time to see if the rates are diverging or converging; the differences between the spatial patches are growing while the aspatial differences are much smaller and declining. However we have to be careful not to over-interpret the results when there is weak evidence for these additional structured spatial effects. Consequently we will return to the MCMC unstructured model with varying slopes for the linear trend over areas which is a substantial improvement over simpler models

305 Using this model we can estimate and plot the between-area variance function with 95 CI s and the predictions for each place. There does appear to be genuine differences between counties but there is not strong evidence that this is increasing over time. The predictions for each county can be obtained on a logit scale turned into a relative risk by exponentiation, calc c41 = expo(c41)

An appreciation of what we have achieved is to compare the results for raw unmodelled relative risk with the modelled relative risk.

306 and plotted against Year There are therefore half a dozen or so counties that are potential hotspots with a rising relative risk. This model could then be used as a basis for more focussed hypothesis testing by including distance from known pollution sources in the model. An appreciation of what we have achieved is to compare the results for raw unmodelled relative risk with the modelled relative risk. We have highlighted two counties in all the three graphs plotted below. County 80 which is coloured blue has SMR of 0.31 in 1988 which is based on an observed to expected ratio of 6 to 19 deaths. This is modelled to be a consistently low. County 41 which is coloured red has an SMR of 1.57 in 1988 which is derived from an observed to expected ratio of 77 to 49. It is consistently high and rising. The raw data has some extreme peaks where the SMR exceeds a risk in excess of The modelled data for this county (34) is shrunk through precision-weighted estimation towards the general trend as the rate is based on an expected count of only 9 or so deaths. With small expected counts we can expect non-substantive high relative risks just by chance

307 Self-rated health of the elderly in China 155 This short section is meant to be illustrative of what can be achieved and not to provide a detailed account of the modelling process. The distinctive feature of this study is that individual data are used in the model and not already aggregated data. It is based on survey data from the Chinese Longitudinal Healthy Longevity Survey a sample of over elders aged over 60 in 22 provinces of China in The response is self-rated health, a subjective assessment of one s own health status. We are going to model this in a binary form: in Good health as opposed to Poor/Fair heath. The data has a three level hierarchical structure with individuals nested within provinces nested in patches of provinces. A null or empty model binomial logit is specified and 100,000 MCMC simulations resulted in the following estimates. There is a quite a moderately large spatially structured variance of on the logit scale which using Larsen s MOR procedure (Chapter 12) translates into an odds of 1.75 for the average effect of conceptually re-locating a person from a low to a high patch. The 95% CI s for odds ratio using the MCMC 2.5% lowest and highest values however show a lot of uncertainty about the size of this patch effect as the MOR credible intervals range from 1.23 to Equivalent values for the unstructured variance of gave a MOR of 1.19 and 95% credible intervals of 1.09 and 1.35 which are clearly smaller. The plot below shows the estimated odds for each province in the null model and the bigger effects of the spatially structured effects are clearly visible. An interesting case revealed by this plot is Henan province in that the Henan patch is below the all China average of 1, but the aspatial unstructured area effects for Henan are clearly well above 1. The core area of the patch, Henan itself, has better health for the elderly than its neighbours. 155 We thank Zhixin Feng for his research reported in this section

308 A more complex model was fitted in which a large number of individual and family circumstances were included The spatially structured effects remained moderate in their size but there was a large standard error, so that we must treat with caution. Here are the plots of the spatially structured and unstructured residuals

It is noticeable while the effects are attenuated once account is taken of the differential composition of the provinces in terms of individuals, Hunan patch is still generally low health while Hunan

There is something that is make those sampled in Hunan eport better health than its provincial neighbours even when account is taken of individual age, sex, education, income, source of finance, who

309 It is noticeable while the effects are attenuated once account is taken of the differential composition of the provinces in terms of individuals, Hunan patch is still generally low health while Hunan itself is even more of an outlier. There is something that is make those sampled in Hunan eport better health than its provincial neighbours even when account is taken of individual age, sex, education, income, source of finance, who lives with them, and whether they think they have sufficient finance. Here are the maps of the spatially structured residuals for the null and the complex model. Null model: spatially structured province effects Full model: spatially structured province effects While care must be taken in interpretation as different cut-offs have been used on the maps (both use quartiles), essentially the same map patterns are found

Model 12 cross-level interactions between Envq and Type Open model10.wsz (to keep things simple we will ignore Size-5 in the random part)

Model 12 cross-level interactions between Envq and Type Open model10.wsz (to keep things simple we will ignore Size-5 in the random part) 8. Modeling with higher-level variables This chapter is concerned with predictor variables measured at a higher level; in this case at district and not at house level. We consider how to replicate data