Alexina Mason. Department of Epidemiology and Biostatistics Imperial College, London. 16 February PDF Free Download

Strategy for modelling non-random missing data mechanisms in longitudinal studies using Bayesian methods: application to income data from the Millennium Cohort Study Alexina Mason Department of Epidemiology and Biostatistics Imperial College, London 16 February 2010 with thanks to Nicky Best, Ian Plewis and Sylvia Richardson This work was supported by an ESRC PhD studentship.

Outline 1 Motivation Introduction MCS income example 2 Modelling Strategy Overview Construct a base model Sensitivity analysis 3 Application Construct a base model for MCS income Sensitivity analysis for MCS income

Why do we need a missing data strategy? Inevitably longitudinal studies lose members over time and generally suffer from missing data Analysis of such data is complicated by missing covariates and missing reponses Many approaches have been proposed The appropriateness of a particular approach is dependent on the mechanism that leads to the missing data but this cannot be determined from the data So, researchers are forced to make assumptions and strongly recommended to check the robustness of their conclusions to alternative plausible assumptions This can be complicated, so a flexible strategy can help

Why does the strategy use Bayesian methods? Bayesian full probability modelling is a statistically principled method for dealing with missing data, i.e. combines information in the observed data with assumptions about the missing value mechanism accounts for the uncertainty introduced by the missing data Allow complex models to be constructed in a modular way, for example a Bayesian joint model may consist of submodels for analysing the question of interest imputing missing covariates allowing the mechanism to be informative Enable coherent model estimation Facilitate sensitivity analysis However, the principles of the strategy could be adapted for a non-bayesian framework

Millennium Cohort Study (MCS) example MCS has 18,000+ cohort members born in the UK at the beginning of the Millennium Using sweeps 1 and 2, our example predicts income for main respondents (usually the cohort member s mother) meeting the criteria: single in sweep 1 in work not self-employed Motivating questions about income include: does ethnicity affect rate of pay? how much extra do individuals earn if they have a degree? does change in partnership status affect income?

Missingness in the MCS income dataset Initial dataset has 559 records sweep 1 covariates observed missing pay observed 505 7 missing 43 4 Restrict dataset to individuals fully observed in sweep 1 sweep 2 for remaining 505 individuals covariates observed missing pay observed 320 0 missing 19 166 Do not distinguish between item and sweep non-response All the covariate comes from sweep non-response

Types of missing data Following Rubin, missing data are generally classified into 3 types Consider the mechanism that led to missing pay in sweep 2 (pay 2 ), defining p i to be the probability that pay 2 is missing for individual i Missing Completely at Random (MCAR) does not depend on observed or unobserved data p i = θ 0 Missing at Random (MAR) depends only on observed data p i = θ 0 + θ 1 pay 1i

Schematic Diagram 1: select MoI using complete cases note plausible alternatives 2: add CMoM 3: add RMoM BASE MODEL 4: seek additional data 5: elicit expert knowledge 6: ASSUMPTION SENSITIVITY 7: PARAMETER SENSITIVITY report YES robustness 8: Are conclusions robust? NO determine region of high plausibility recognise uncertainty assess fit of validation sample calculate DIC

Schematic Diagram note plausible alternatives 6: ASSUMPTION SENSITIVITY report YES robustness 1: select MoI using complete cases 2: add CMoM 3: add RMoM BASE MODEL 8: Are conclusions robust? 7: PARAMETER SENSITIVITY NO determine region of high plausibility recognise uncertainty 4: seek additional data 5: elicit expert knowledge assess fit of validation sample calculate DIC Strategy can be thought of as consisting of two parts: Constructing a base model Assessing conclusions from this base model against a selection of well chosen sensitivity analyses

Construct a base model I note plausible alternatives 6: ASSUMPTION SENSITIVITY report YES robustness 1: select MoI using complete cases 2: add CMoM 3: add RMoM BASE MODEL 8: Are conclusions robust? 7: PARAMETER SENSITIVITY NO determine region of high plausibility recognise uncertainty 4: seek additional data 5: elicit expert knowledge assess fit of validation sample calculate DIC 1 Form an initial Model of Interest (MoI) using only complete cases, includes choosing transform for the response model structure set of explanatory variables Add a Covariate Model of Missingness (CMoM) to produce realistic imputations of any missing covariates Add a Response Model of Missingness (RMoM) to allow informative in the response

Construct a base model I note plausible alternatives 6: ASSUMPTION SENSITIVITY report YES robustness 1: select MoI using complete cases 2: add CMoM 3: add RMoM BASE MODEL 8: Are conclusions robust? 7: PARAMETER SENSITIVITY NO determine region of high plausibility recognise uncertainty 4: seek additional data 5: elicit expert knowledge assess fit of validation sample calculate DIC 1 Form an initial Model of Interest (MoI) using only complete cases, includes choosing transform for the response model structure set of explanatory variables 2 Add a Covariate Model of Missingness (CMoM) to produce realistic imputations of any missing covariates Add a Response Model of Missingness (RMoM) to allow informative in the response

Construct a base model I note plausible alternatives 6: ASSUMPTION SENSITIVITY report YES robustness 1: select MoI using complete cases 2: add CMoM 3: add RMoM BASE MODEL 8: Are conclusions robust? 7: PARAMETER SENSITIVITY NO determine region of high plausibility recognise uncertainty 4: seek additional data 5: elicit expert knowledge assess fit of validation sample calculate DIC 1 Form an initial Model of Interest (MoI) using only complete cases, includes choosing transform for the response model structure set of explanatory variables 2 Add a Covariate Model of Missingness (CMoM) to produce realistic imputations of any missing covariates 3 Add a Response Model of Missingness (RMoM) to allow informative in the response

The joint model: schematic diagram model of interest parameters response model parameters response with probability of covariate model parameters covariates with fully observed covariates indicator

The joint model: schematic diagram model of interest parameters model of interest response model parameters response with probability of covariate model parameters covariates with fully observed covariates indicator

The joint model: schematic diagram model of interest parameters model of interest response model of response model parameters this part required for non-ignorable in the response response with probability of covariate model parameters covariates with fully observed covariates indicator covariate model of

The joint model: schematic diagram model of interest parameters response model of model of interest response model parameters information from additional sources may help with the estimation of these parameters response with probability of covariate model parameters covariates with fully observed covariates indicator covariate model of

Construct a base model II note plausible alternatives 6: ASSUMPTION SENSITIVITY report YES robustness 1: select MoI using complete cases 2: add CMoM 3: add RMoM BASE MODEL 8: Are conclusions robust? 7: PARAMETER SENSITIVITY NO determine region of high plausibility recognise uncertainty 4: seek additional data 5: elicit expert knowledge assess fit of validation sample calculate DIC 4 Additional data can help with parameter estimation. Possible sources include earlier/later sweeps of longitudinal study not under investigation another study on individuals with similar characteristics Expert knowledge can be incorporated using informative priors. Information relating to the RMoM has potential to make large impact, particularly regarding its functional form.

Construct a base model II note plausible alternatives 6: ASSUMPTION SENSITIVITY report YES robustness 1: select MoI using complete cases 2: add CMoM 3: add RMoM BASE MODEL 8: Are conclusions robust? 7: PARAMETER SENSITIVITY NO determine region of high plausibility recognise uncertainty 4: seek additional data 5: elicit expert knowledge assess fit of validation sample calculate DIC 4 Additional data can help with parameter estimation. Possible sources include earlier/later sweeps of longitudinal study not under investigation another study on individuals with similar characteristics 5 Expert knowledge can be incorporated using informative priors. Information relating to the RMoM has potential to make large impact, particularly regarding its functional form.

Perform a sensitivity analysis note plausible alternatives 6: ASSUMPTION SENSITIVITY report YES robustness 1: select MoI using complete cases 2: add CMoM 3: add RMoM BASE MODEL 8: Are conclusions robust? 7: PARAMETER SENSITIVITY NO determine region of high plausibility recognise uncertainty 4: seek additional data 5: elicit expert knowledge assess fit of validation sample calculate DIC 6 Form alternative models from the base model by changing key assumptions, including: MoI error distribution MoI response transform functional form of the RMoM Run the base model with the parameters controlling the extent of the departure from MAR fixed to a range of plausible values. Use the results of both types of sensitivity analysis to establish variability in the quantities of interest.

Perform a sensitivity analysis note plausible alternatives 6: ASSUMPTION SENSITIVITY report YES robustness 1: select MoI using complete cases 2: add CMoM 3: add RMoM BASE MODEL 8: Are conclusions robust? 7: PARAMETER SENSITIVITY NO determine region of high plausibility recognise uncertainty 4: seek additional data 5: elicit expert knowledge assess fit of validation sample calculate DIC 6 Form alternative models from the base model by changing key assumptions, including: MoI error distribution MoI response transform functional form of the RMoM 7 Run the base model with the parameters controlling the extent of the departure from MAR fixed to a range of plausible values. Use the results of both types of sensitivity analysis to establish variability in the quantities of interest.

Perform a sensitivity analysis note plausible alternatives 6: ASSUMPTION SENSITIVITY report YES robustness 1: select MoI using complete cases 2: add CMoM 3: add RMoM BASE MODEL 8: Are conclusions robust? 7: PARAMETER SENSITIVITY NO determine region of high plausibility recognise uncertainty 4: seek additional data 5: elicit expert knowledge assess fit of validation sample calculate DIC 6 Form alternative models from the base model by changing key assumptions, including: MoI error distribution MoI response transform functional form of the RMoM 7 Run the base model with the parameters controlling the extent of the departure from MAR fixed to a range of plausible values. 8 Use the results of both types of sensitivity analysis to establish variability in the quantities of interest.

Initial model of interest We choose log of hourly net pay as our response 6 explanatory variables Description of explanatory variables short name description details age continuous a edu educational level 3 levels (1=none/NVQ1; 2=NVQ2/3; 3=NVQ4/5) b eth ethnic group 2 levels (1=white; 2=non-white) sing c single/partner 2 levels (1=single; 2=partner) reg region of country 2 levels (1=London; 2=other) stratum ward type by country d 9 levels a centred and standardised b the level of National Vocational Qualification (NVQ) equivalence of the individual s highest academic or vocational educational qualification (level 3 has a degree) c always single in sweep 1 d three strata for England (advantaged, disadvantaged and ethnic minority); two strata for Wales, Scotland and Northern Ireland (advantaged and disadvantaged) And a t distribution with 4 degrees of freedom (t 4 ) for the errors for robustness to outliers

Initial model of interest: the equations log of hourly pay (hpay) Alternative (AS3): cube root transform robustness to outliers Alternative (AS1): Normal errors individual random effects y it t 4 (µ it, σ 2 ) µ it = α i +γ s(i) + p k=1 β kx kit + q k=p+1 β kz ki stratum specific intercepts eth (ethnic group) age (main respondent s age) edu (educational level) reg (London/other) sing (single/partner) Alternative (AS2): include age 2 and age edu interaction terms & vague priors e.g. β k N(0, 10000 2 )

Conclusions based on complete cases Higher hourly pay is associated with having a degree Little evidence of an association between pay and ethnicity Lower pay is associated with gaining a partner between sweeps Key parameter estimates based on a complete case analysis Complete Cases β edu[nvq2&3] 0.15 (0.06,0.25) β edu[nvq4&5] 0.35 (0.24,0.45) β eth -0.04 (-0.18,0.10) β sing -0.07 (-0.14,0.00) Table shows the posterior mean, with the 95% interval in brackets

Covariate model of Assume covariates are missing at random (MAR) stratum and eth do not change between sweeps Imputation of missing sweep 2 values required for the other 4 covariates reg: assign sweep 1 value age, edu and sing: set up a joint imputation model using latent variables with Normal distributions for edu and sing Fully observed sweep 1 covariates are used as explanatory variables in the imputation model

Response model of (selection model) Allow informative in the response by modelling a missing value indicator (m i ) for sweep 2 pay (hpay i2 ) s.t. { 1: hpayi2 observed m i = 0: hpay i2 missing Use a logit model for response, i.e. m i Bernoulli(p i ); logit(p i ) =? Previous work in this area informs choice of predictors of missing income functional form Untransformed hourly pay used in this sub-model

Response model of : the equations m i Bernoulli(p i ) hpay i1 hpay i2 hpay i1 eth (ethnic group) sc (social class) ctry (country) logit(p i ) = θ 0 + P iecewise(level i ) + P iecewise(change i ) + k θ kw ki choice of functional form and position of knots are based on expert knowledge Alternative (AS4): linear functional form P iecewise(level i ) = P iecewise(change i ) = & vague priors θ level[1] (level i 10) : level i < 10 θ level[2] (level i 10) : level i 10 δ 1 change i : change i < 0 δ 2 change i : change i 0

Response model of parameter estimates Sweep 2 pay is more likely to be missing for individuals who are non-white have low levels of pay in sweep 1 whose pay changes substantially between sweeps Social class and country make little difference BASE CASE θ 0 2.73 (1.80,3.89) θ level[1] 0.29 (0.10,0.52) θ level[2] 0.59 (0.29,0.91) δ change[1] 0.67 (0.39,0.97) δ change[2] -0.21 (-0.36,-0.06) θ ctry[2:wales] -0.16 (-0.82,0.52) θ ctry[3:scotland] 0.17 (-0.46,0.81) θ ctry[4:northern Ireland] 0.29 (-0.36,0.97) θ eth -1.13 (-1.82,-0.46) θ sc[2] 0.06 (-0.70,0.85) θ sc[3] -0.06 (-1.08,1.00) θ sc[4] 0.15 (-0.62,0.95) Table shows posterior mean (95% interval) The 95% intervals of the change parameters (δ) do not include zero: evidence of informative given the model assumptions

Impact on substantive questions Conclusions regarding education and ethnicity unchanged Evidence of an association between hourly pay and gaining a partner between sweeps has strengthened Comparison of parameter estimates from model of interest (complete cases) and joint model (BASE CASE) Complete Cases BASE CASE β edu[nvq2&3] 0.15 (0.06,0.25) 0.17 (0.09,0.25) β edu[nvq4&5] 0.35 (0.24,0.45) 0.35 (0.25,0.44) β eth -0.04 (-0.18,0.10) -0.06 (-0.18,0.06) β sing -0.07 (-0.14,0.00) -0.11 (-0.20,-0.02) Table shows the posterior mean, with the 95% interval in brackets

Incorporating additional sources of information Where information is limited, some parameters in the joint model can be difficult to estimate But, we can increase the amount of information available by incorporating data from other sources, e.g. data from other studies expert opinion

Seek additional data For example, imputing edu is difficult because few individuals gain qualifications between sweeps Seek another study with individuals with similar characteristics which includes education variables Expand Covariate Model of Missingness (CMoM) to simultaneously model data from original study (MCS) and additional study by fitting 2 sets of equations with common coefficients 1 set for imputing the missing MCS covariates 1 set for modelling the additional data The extra data allows the parameters in the CMoM to be estimated with greater accuracy

Elicit expert knowledge The Bayesian approach provides the option of including additional information through informative priors This is of greatest potential value for the parameters associated with informative Informative priors can be formed through elicitation, but are difficult to elicit directly Instead elicit information about probability of response at design points convert this to informative priors A good elicitation strategy would identify and concentrate on weakly identified variables allow for correlation between these variables focus on functional form

Base model fit using validation sample probability density Data was collected from 7 individuals who were originally non-contacts or refusals in sweep 2, after they were re-issued by the fieldwork agency We set these data to missing before fitting our models, so they can now be used for model checking For BASE, hourly pay is well estimated for all 7 individuals 0.00 0.10 0.20 BASE Posterior predictive distribution and observed value of hourly pay of 4 re-issued individuals A 0 20 40 hourly pay ( ) probability density 0.00 0.10 0.20 B 0 20 40 hourly pay ( ) probability density 0.00 0.10 0.20 C 0 20 40 hourly pay ( ) probability density The true value of hourly pay is indicated by the red line 0.00 0.10 0.20 D 0 20 40 hourly pay ( )

Assumption sensitivity analysis: description BASE CASE key features: MoI - t 4 error distribution MoI - covariates {age, edu, eth, reg, sing} MoI - log transform of the response RMoM - piecewise linear functional form for level and change Assumption sensitivity analysis differences from BASE CASE: AS1: MoI - Normal error distribution AS2: MoI - additional covariates age 2 and age edu AS3: MoI - cube root transform of response AS4: RMoM - linear functional form for level and change MoI = Model of Interest; RMoM = Response Model of Missingness

Assumption sensitivity analysis: results Based on this sensitivity analysis ethnicity: conclusions from BASE CASE are robust gaining a partner: consistent evidence of association with lower pay, but strength is unclear Comparison of parameters associated with being non-white and gaining a partner between sweeps β eth β sing CC -0.04 (-0.18,0.10) -0.07 (-0.14,0.00) BASE -0.06 (-0.18,0.06) -0.11 (-0.20,-0.02) AS1 (Normal errors) 0.01 (-0.12,0.15) -0.12 (-0.23,-0.01) AS2 (additional covariates) -0.05 (-0.18,0.07) -0.11 (-0.20,-0.01) AS4 (linear level & change) -0.06 (-0.18,0.07) -0.16 (-0.25,-0.07) AS3 (cube root transform) -0.04 (-0.12,0.04) -0.08 (-0.14,-0.02) Table shows the posterior mean, with the 95% interval in brackets.

AS1-AS4: model fit using validation sample The mean square error (MSE) of the fit of hourly pay for the 7 re-issues is a summary measure of model performance The models with the linear functional form for the RMoM (AS4) and with the cube root transform (AS3) fit the re-issued individuals best MSE of the fit of hourly pay for the 7 re-issued individuals median 95% interval BASE 18.7 (3.1,367.0) AS1 (Normal errors) 16.8 (3.2,108.8) AS2 (additional covariates) 14.2 (2.8,295.3) AS3 (cube root transform) 8.0 (1.9,73.6) AS4 (linear level & change) 8.8 (2.9,21.7)

Parameter sensitivity analysis: description Recall change i = hpay i2 hpay i1, and {... + δ1 change logit(p i ) = i + : change i < 0... + δ 2 change i + : change i 0 The values of δ 1 and δ 2 control the degree of departure from MAR δ 1 and δ 2 are difficult for the model to estimate A series of models is run with these two parameters fixed 81 variants formed by combining 9 values of δ 1 with 9 values of δ 2 Value set for both δ 1 and δ 2 is { 1, 0.75, 0.5, 0.25, 0, 0.25, 0.5, 0.75, 1} 9 variants have linear functional form of change, i.e. δ 1 = δ 2 δ 1 = δ 2 = 0 variant is equivalent to assuming the response is MAR

Parameter sensitivity analysis: results - tabular Proportional increase in pay associated with selected covariates for PS variants compared with base model (BASE) minimum δ 1 δ 2 maximum δ 1 δ 2 MAR a BASE edu[nvq2&3] 1.15-1 1 1.19 1-0.25 1.18 1.19 (1.06,1.25) (1.10,1.29) (1.06,1.26) (1.09,1.29) edu[nvq4&5] 1.31-1 1 1.44 1-0.5 1.35 1.41 (1.18,1.44) (1.31,1.58) (1.19,1.46) (1.28,1.56) eth 0.93 0.25-0.75 0.97-1 0.75 0.94 0.94 PS (0.82,1.05) (0.85,1.12) (0.84,1.09) (0.83,1.06) sing 0.77 1 1 1.32-1 -1 0.94 0.90 (0.71,0.84) (1.18,1.47) (0.88,0.99) (0.82,0.98) Table shows the posterior mean, with the 95% interval in brackets. a δ 1 = 0 and δ 2 = 0 is MAR

Parameter sensitivity analysis: results - graphical I Estimated proportional change in pay associated with being non-white versus δ 1 conditional on δ 2 from PS variants 1.4 delta2= 1 delta2= 0.5 delta2=0 delta2=0.5 delta2=1 posterior mean 95% interval e β eth 1.2 1.0 0.8 0.75 0 0.75 0.75 0 0.75 0.75 0 0.75 0.75 0 0.75 0.75 0 0.75 δ 1

Parameter sensitivity analysis: results - graphical II Estimated proportional change in pay associated with gaining a partner between sweeps versus δ 1 conditional on δ 2 from PS variants e β sing 1.4 1.2 1.0 0.8 delta2= 1 delta2= 0.5 delta2=0 delta2=0.5 delta2=1 posterior mean 95% interval 0.75 0 0.75 0.75 0 0.75 0.75 0 0.75 0.75 0 0.75 0.75 0 0.75 δ 1

Parameter sensitivity analysis: results - graphical III Posterior mean of proportional change in pay associated with selected covariates versus δ 1 and δ 2 from PS variants non white: e β eth gaining a partner: e β sing δ 2 1.0 0.5 0.0 0.5 1.0 0.96 0.94 δ 2 1.0 0.5 0.0 0.5 1.0 0.96 0.98 1.02 1.1 1.26 1.04 1.14 1.2 1.08 1.22 1.12 0.88 0.92 1.06 0.82 0.84 0.8 0.86 0.9 0.94 1 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 δ 1 δ 1 The points at the δ values relating to MAR, BASE and AS4 are marked with a red circle, blue triangle and green diamond respectively.

Reporting robustness: ethnicity question Does ethnicity affect rate of pay? Key points to report are: No evidence of an association between ethnicity and hourly pay Base model results: the proportional change in hourly pay associated with being non-white has a posterior mean of 0.94, with a 95% interval from 0.83 to 1.06 These conclusions are very robust to our sensitivity analysis But results relating to gaining a partner are not robust, so we need to investigate the plausibility of different models

Assess fit using validation sample Mean square error of the fit of hourly pay for the 7 re-issued individuals versus δ 1 and δ 2 from PS variants δ 2 1.0 0.5 0.0 0.5 1.0 20 50 90 30 60 100 40 80 1.0 0.5 0.0 0.5 1.0 70 10 δ 1 The points at the δ values relating to MAR, BASE and AS4 are marked with a red circle, blue triangle and green diamond respectively.

Determine region of high plausibility MSE for re issues Proportional change in pay with gaining a partner δ 2 1.0 0.5 0.0 0.5 1.0 20 50 90 30 60 100 40 80 70 10 δ 2 1.0 0.5 0.0 0.5 1.0 0.96 0.98 1.02 1.08 1.16 1.24 1.3 1.04 1.1 1.18 1.12 1.22 1.14 0.88 0.92 1.06 0.84 0.8 0.82 0.86 0.9 0.94 1 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 δ 1 δ 1 The points at the δ values relating to MAR, BASE and AS4 are marked with a red circle, blue triangle and green diamond respectively.

Recognising uncertainty: partnership question Does change in partnership status affect income? Key points to report are: There is evidence that gaining a partner is associated with a decrease in hourly pay However, the magnitude of this decrease is uncertain Our analysis suggests that the proportional decrease lies in the region 0.77 (0.71,0.84) to 0.94 (0.88,0.99) Some models run as part of the sensitivity analysis suggest that change in partnership status is associated with an increase in pay, but these models do not fall in the region of high plausibility. Why should gaining a partner be associated with a decrease in hourly pay? proxy for additional child? reverse causality?

Summary Compared to a complete case analysis implementing this strategy is time-consuming but allows realistic assumptions about the mechanism to be explored and provides confidence in conclusions to questions of interest The proposed strategy is flexible steps can be omitted if appropriate or it can be extended if necessary applied for other types of studies

Relevant literature The BIAS project. www.bias-project.org.uk/. Best, N. G., Spiegelhalter, D. J., Thomas, A., and Brayne, C. E. G. (1996). Bayesian Analysis of Realistically Complex Models. Journal of the Royal Statistical Society, Series A (Statistics in Society), 159, (2), 323 42. Daniels, M. J. and Hogan, J. W. (2008). Missing Data In Longitudinal Studies Strategies for Bayesian Modeling and Sensitivity Analysis. Chapman & Hall. Little, R. J. A. and Rubin, D. B. (2002). Statistical Analysis with Missing Data, (2nd edn). John Wiley and Sons. Plewis, I. (2007). Non-Response in a Birth Cohort Study: The Case of the Millennium Cohort Study. International Journal of Social Research Methodology, 10, (5), 325 34.

Alexina Mason. Department of Epidemiology and Biostatistics Imperial College, London. 16 February 2010