Alexina Mason. Department of Epidemiology and Biostatistics Imperial College, London. 16 February 2010

Similar documents
A comparison of fully Bayesian and two-stage imputation strategies for missing covariate data

Bayesian methods for modelling non-random missing data mechanisms in longitudinal studies. Alexina Jane Mason. Imperial College London

Bayesian methods for missing data: part 1. Key Concepts. Nicky Best and Alexina Mason. Imperial College London

Nuoo-Ting (Jassy) Molitor, Nicky Best, Chris Jackson and Sylvia Richardson Imperial College UK. September 30, 2008

Inferences on missing information under multiple imputation and two-stage multiple imputation

University of Bristol - Explore Bristol Research. Peer reviewed version. Link to published version (if available): /rssa.

Discussion of Identifiability and Estimation of Causal Effects in Randomized. Trials with Noncompliance and Completely Non-ignorable Missing Data

MISSING or INCOMPLETE DATA

Unbiased estimation of exposure odds ratios in complete records logistic regression

Basics of Modern Missing Data Analysis

An Empirical Comparison of Multiple Imputation Approaches for Treating Missing Data in Observational Studies

Can a Pseudo Panel be a Substitute for a Genuine Panel?

Estimating and Using Propensity Score in Presence of Missing Background Data. An Application to Assess the Impact of Childbearing on Wellbeing

A Bayesian Nonparametric Approach to Monotone Missing Data in Longitudinal Studies with Informative Missingness

Some methods for handling missing values in outcome variables. Roderick J. Little

Whether to use MMRM as primary estimand.

Plausible Values for Latent Variables Using Mplus

Controlling for latent confounding by confirmatory factor analysis (CFA) Blinded Blinded

Known unknowns : using multiple imputation to fill in the blanks for missing data

Adjustment for Missing Confounders Using External Validation Data and Propensity Scores

Frailty Modeling for Spatially Correlated Survival Data, with Application to Infant Mortality in Minnesota By: Sudipto Banerjee, Mela. P.

MISSING or INCOMPLETE DATA

Millennium Cohort Study:

Reconstruction of individual patient data for meta analysis via Bayesian approach

Streamlining Missing Data Analysis by Aggregating Multiple Imputations at the Data Level

Statistical Methods. Missing Data snijders/sm.htm. Tom A.B. Snijders. November, University of Oxford 1 / 23

The propensity score with continuous treatments

Structural Uncertainty in Health Economic Decision Models

New Developments in Nonresponse Adjustment Methods

Toutenburg, Fieger: Using diagnostic measures to detect non-mcar processes in linear regression models with missing covariates

Estimation of Missing Data Using Convoluted Weighted Method in Nigeria Household Survey

Growth Mixture Modeling and Causal Inference. Booil Jo Stanford University

Bayesian Inference on Joint Mixture Models for Survival-Longitudinal Data with Multiple Features. Yangxin Huang

An Introduction to Mplus and Path Analysis

Multilevel Statistical Models: 3 rd edition, 2003 Contents

Mark Scheme (Results) Summer 2010

University of Warwick institutional repository:

Nonrespondent subsample multiple imputation in two-phase random sampling for nonresponse

Shu Yang and Jae Kwang Kim. Harvard University and Iowa State University

PIRLS 2016 Achievement Scaling Methodology 1

Midterm 1 ECO Undergraduate Econometrics

Don t be Fancy. Impute Your Dependent Variables!

Comparison of multiple imputation methods for systematically and sporadically missing multilevel data

Missing covariate data in matched case-control studies: Do the usual paradigms apply?

An Introduction to Path Analysis

On the Behavior of Marginal and Conditional Akaike Information Criteria in Linear Mixed Models

Bayesian SAE using Complex Survey Data Lecture 4A: Hierarchical Spatial Bayes Modeling

Markov Chain Monte Carlo methods

Overlapping Astronomical Sources: Utilizing Spectral Information

Multiple Imputation for Missing Data in Repeated Measurements Using MCMC and Copulas

7 Sensitivity Analysis

Bayesian Mixture Modeling

Strategies for dealing with Missing Data

F-tests for Incomplete Data in Multiple Regression Setup

Multidimensional Control Totals for Poststratified Weights

Selection endogenous dummy ordered probit, and selection endogenous dummy dynamic ordered probit models

Estimating the long-term health impact of air pollution using spatial ecological studies. Duncan Lee

Mixture modelling of recurrent event times with long-term survivors: Analysis of Hutterite birth intervals. John W. Mac McDonald & Alessandro Rosina

Dynamic sequential analysis of careers

Analyzing Pilot Studies with Missing Observations

ANALYSIS OF ORDINAL SURVEY RESPONSES WITH DON T KNOW

Fractional Hot Deck Imputation for Robust Inference Under Item Nonresponse in Survey Sampling

A Note on Bayesian Inference After Multiple Imputation

The Bayesian Approach to Multi-equation Econometric Model Estimation

Discrete Response Multilevel Models for Repeated Measures: An Application to Voting Intentions Data

Statistical Practice

Causal Inference with General Treatment Regimes: Generalizing the Propensity Score

Analysis of Longitudinal Data. Patrick J. Heagerty PhD Department of Biostatistics University of Washington

Modeling conditional dependence among multiple diagnostic tests

Centering Predictor and Mediator Variables in Multilevel and Time-Series Models

Tables and Figures. This draft, July 2, 2007

Local Polynomial Wavelet Regression with Missing at Random

Time-Invariant Predictors in Longitudinal Models

Propensity Score Adjustment for Unmeasured Confounding in Observational Studies

On the Behavior of Marginal and Conditional Akaike Information Criteria in Linear Mixed Models

A weighted simulation-based estimator for incomplete longitudinal data models

Fractional Imputation in Survey Sampling: A Comparative Review

A note on multiple imputation for general purpose estimation

Health utilities' affect you are reported alongside underestimates of uncertainty

A comparison of arm-based and contrast-based approaches to network meta-analysis (NMA)

ECON Introductory Econometrics. Lecture 13: Internal and external validity

Planned Missingness Designs and the American Community Survey (ACS)

ST440/540: Applied Bayesian Statistics. (9) Model selection and goodness-of-fit checks

Causal Inference in Observational Studies with Non-Binary Treatments. David A. van Dyk

Appendix to The Life-Cycle and the Business-Cycle of Wage Risk - Cross-Country Comparisons

INTRODUCTION TO MULTILEVEL MODELLING FOR REPEATED MEASURES DATA. Belfast 9 th June to 10 th June, 2011

On the Choice of Parameterisation and Priors for the Bayesian Analyses of Mendelian Randomisation Studies.

Latent Variable Model for Weight Gain Prevention Data with Informative Intermittent Missingness

Polytomous Item Explanatory IRT Models with Random Item Effects: An Application to Carbon Cycle Assessment Data

Time-Invariant Predictors in Longitudinal Models

An Overview of the Pros and Cons of Linearization versus Replication in Establishment Surveys

Biostat 2065 Analysis of Incomplete Data

A Fully Nonparametric Modeling Approach to. BNP Binary Regression

Flexible mediation analysis in the presence of non-linear relations: beyond the mediation formula.

Comparing Change Scores with Lagged Dependent Variables in Models of the Effects of Parents Actions to Modify Children's Problem Behavior

Describing Stratified Multiple Responses for Sparse Data

Chris Taylor a, Gareth Rees a & Rhys Davies a a WISERD, Cardiff University, Cardiff, UK. Published online: 05 Jul 2013.

Inference. ME104: Linear Regression Analysis Kenneth Benoit. August 15, August 15, 2012 Lecture 3 Multiple linear regression 1 1 / 58

Longitudinal Data Analysis Using Stata Paul D. Allison, Ph.D. Upcoming Seminar: May 18-19, 2017, Chicago, Illinois

A Flexible Bayesian Approach to Monotone Missing. Data in Longitudinal Studies with Nonignorable. Missingness with Application to an Acute

Transcription:

Strategy for modelling non-random missing data mechanisms in longitudinal studies using Bayesian methods: application to income data from the Millennium Cohort Study Alexina Mason Department of Epidemiology and Biostatistics Imperial College, London 16 February 2010 with thanks to Nicky Best, Ian Plewis and Sylvia Richardson This work was supported by an ESRC PhD studentship.

Outline 1 Motivation Introduction MCS income example 2 Modelling Strategy Overview Construct a base model Sensitivity analysis 3 Application Construct a base model for MCS income Sensitivity analysis for MCS income

Why do we need a missing data strategy? Inevitably longitudinal studies lose members over time and generally suffer from missing data Analysis of such data is complicated by missing covariates and missing reponses Many approaches have been proposed The appropriateness of a particular approach is dependent on the mechanism that leads to the missing data but this cannot be determined from the data So, researchers are forced to make assumptions and strongly recommended to check the robustness of their conclusions to alternative plausible assumptions This can be complicated, so a flexible strategy can help

Why does the strategy use Bayesian methods? Bayesian full probability modelling is a statistically principled method for dealing with missing data, i.e. combines information in the observed data with assumptions about the missing value mechanism accounts for the uncertainty introduced by the missing data Allow complex models to be constructed in a modular way, for example a Bayesian joint model may consist of submodels for analysing the question of interest imputing missing covariates allowing the mechanism to be informative Enable coherent model estimation Facilitate sensitivity analysis However, the principles of the strategy could be adapted for a non-bayesian framework

Millennium Cohort Study (MCS) example MCS has 18,000+ cohort members born in the UK at the beginning of the Millennium Using sweeps 1 and 2, our example predicts income for main respondents (usually the cohort member s mother) meeting the criteria: single in sweep 1 in work not self-employed Motivating questions about income include: does ethnicity affect rate of pay? how much extra do individuals earn if they have a degree? does change in partnership status affect income?

Missingness in the MCS income dataset Initial dataset has 559 records sweep 1 covariates observed missing pay observed 505 7 missing 43 4 Restrict dataset to individuals fully observed in sweep 1 sweep 2 for remaining 505 individuals covariates observed missing pay observed 320 0 missing 19 166 Do not distinguish between item and sweep non-response All the covariate comes from sweep non-response

Types of missing data Following Rubin, missing data are generally classified into 3 types Consider the mechanism that led to missing pay in sweep 2 (pay 2 ), defining p i to be the probability that pay 2 is missing for individual i

Types of missing data Following Rubin, missing data are generally classified into 3 types Consider the mechanism that led to missing pay in sweep 2 (pay 2 ), defining p i to be the probability that pay 2 is missing for individual i Missing Completely at Random (MCAR) does not depend on observed or unobserved data p i = θ 0

Types of missing data Following Rubin, missing data are generally classified into 3 types Consider the mechanism that led to missing pay in sweep 2 (pay 2 ), defining p i to be the probability that pay 2 is missing for individual i Missing Completely at Random (MCAR) does not depend on observed or unobserved data p i = θ 0 Missing at Random (MAR) depends only on observed data p i = θ 0 + θ 1 pay 1i

Types of missing data Following Rubin, missing data are generally classified into 3 types Consider the mechanism that led to missing pay in sweep 2 (pay 2 ), defining p i to be the probability that pay 2 is missing for individual i Missing Completely at Random (MCAR) does not depend on observed or unobserved data p i = θ 0 Missing at Random (MAR) depends only on observed data p i = θ 0 + θ 1 pay 1i Missing not at random (MNAR) neither MCAR or MAR hold p i = θ 0 + δpay 2i or p i = θ 0 + θ 1 pay 1i + δ(pay 2i pay 1i )

Outline 1 Motivation Introduction MCS income example 2 Modelling Strategy Overview Construct a base model Sensitivity analysis 3 Application Construct a base model for MCS income Sensitivity analysis for MCS income

Schematic Diagram 1: select MoI using complete cases note plausible alternatives 2: add CMoM 3: add RMoM BASE MODEL 4: seek additional data 5: elicit expert knowledge 6: ASSUMPTION SENSITIVITY 7: PARAMETER SENSITIVITY report YES robustness 8: Are conclusions robust? NO determine region of high plausibility recognise uncertainty assess fit of validation sample calculate DIC

Schematic Diagram note plausible alternatives 6: ASSUMPTION SENSITIVITY report YES robustness 1: select MoI using complete cases 2: add CMoM 3: add RMoM BASE MODEL 8: Are conclusions robust? 7: PARAMETER SENSITIVITY NO determine region of high plausibility recognise uncertainty 4: seek additional data 5: elicit expert knowledge assess fit of validation sample calculate DIC Strategy can be thought of as consisting of two parts: Constructing a base model Assessing conclusions from this base model against a selection of well chosen sensitivity analyses

Schematic Diagram note plausible alternatives 6: ASSUMPTION SENSITIVITY report YES robustness 1: select MoI using complete cases 2: add CMoM 3: add RMoM BASE MODEL 8: Are conclusions robust? 7: PARAMETER SENSITIVITY NO determine region of high plausibility recognise uncertainty 4: seek additional data 5: elicit expert knowledge assess fit of validation sample calculate DIC Strategy can be thought of as consisting of two parts: Constructing a base model Assessing conclusions from this base model against a selection of well chosen sensitivity analyses

Schematic Diagram note plausible alternatives 6: ASSUMPTION SENSITIVITY report YES robustness 1: select MoI using complete cases 2: add CMoM 3: add RMoM BASE MODEL 8: Are conclusions robust? 7: PARAMETER SENSITIVITY NO determine region of high plausibility recognise uncertainty 4: seek additional data 5: elicit expert knowledge assess fit of validation sample calculate DIC Strategy can be thought of as consisting of two parts: Constructing a base model Assessing conclusions from this base model against a selection of well chosen sensitivity analyses

Construct a base model I note plausible alternatives 6: ASSUMPTION SENSITIVITY report YES robustness 1: select MoI using complete cases 2: add CMoM 3: add RMoM BASE MODEL 8: Are conclusions robust? 7: PARAMETER SENSITIVITY NO determine region of high plausibility recognise uncertainty 4: seek additional data 5: elicit expert knowledge assess fit of validation sample calculate DIC 1 Form an initial Model of Interest (MoI) using only complete cases, includes choosing transform for the response model structure set of explanatory variables Add a Covariate Model of Missingness (CMoM) to produce realistic imputations of any missing covariates Add a Response Model of Missingness (RMoM) to allow informative in the response

Construct a base model I note plausible alternatives 6: ASSUMPTION SENSITIVITY report YES robustness 1: select MoI using complete cases 2: add CMoM 3: add RMoM BASE MODEL 8: Are conclusions robust? 7: PARAMETER SENSITIVITY NO determine region of high plausibility recognise uncertainty 4: seek additional data 5: elicit expert knowledge assess fit of validation sample calculate DIC 1 Form an initial Model of Interest (MoI) using only complete cases, includes choosing transform for the response model structure set of explanatory variables 2 Add a Covariate Model of Missingness (CMoM) to produce realistic imputations of any missing covariates Add a Response Model of Missingness (RMoM) to allow informative in the response

Construct a base model I note plausible alternatives 6: ASSUMPTION SENSITIVITY report YES robustness 1: select MoI using complete cases 2: add CMoM 3: add RMoM BASE MODEL 8: Are conclusions robust? 7: PARAMETER SENSITIVITY NO determine region of high plausibility recognise uncertainty 4: seek additional data 5: elicit expert knowledge assess fit of validation sample calculate DIC 1 Form an initial Model of Interest (MoI) using only complete cases, includes choosing transform for the response model structure set of explanatory variables 2 Add a Covariate Model of Missingness (CMoM) to produce realistic imputations of any missing covariates 3 Add a Response Model of Missingness (RMoM) to allow informative in the response

The joint model: schematic diagram model of interest parameters response model parameters response with probability of covariate model parameters covariates with fully observed covariates indicator

The joint model: schematic diagram model of interest parameters model of interest response model parameters response with probability of covariate model parameters covariates with fully observed covariates indicator

The joint model: schematic diagram model of interest parameters model of interest response model parameters response with probability of covariate model parameters covariates with fully observed covariates indicator covariate model of

The joint model: schematic diagram model of interest parameters model of interest response model of response model parameters this part required for non-ignorable in the response response with probability of covariate model parameters covariates with fully observed covariates indicator covariate model of

The joint model: schematic diagram model of interest parameters response model of model of interest response model parameters information from additional sources may help with the estimation of these parameters response with probability of covariate model parameters covariates with fully observed covariates indicator covariate model of

Construct a base model II note plausible alternatives 6: ASSUMPTION SENSITIVITY report YES robustness 1: select MoI using complete cases 2: add CMoM 3: add RMoM BASE MODEL 8: Are conclusions robust? 7: PARAMETER SENSITIVITY NO determine region of high plausibility recognise uncertainty 4: seek additional data 5: elicit expert knowledge assess fit of validation sample calculate DIC 4 Additional data can help with parameter estimation. Possible sources include earlier/later sweeps of longitudinal study not under investigation another study on individuals with similar characteristics Expert knowledge can be incorporated using informative priors. Information relating to the RMoM has potential to make large impact, particularly regarding its functional form.

Construct a base model II note plausible alternatives 6: ASSUMPTION SENSITIVITY report YES robustness 1: select MoI using complete cases 2: add CMoM 3: add RMoM BASE MODEL 8: Are conclusions robust? 7: PARAMETER SENSITIVITY NO determine region of high plausibility recognise uncertainty 4: seek additional data 5: elicit expert knowledge assess fit of validation sample calculate DIC 4 Additional data can help with parameter estimation. Possible sources include earlier/later sweeps of longitudinal study not under investigation another study on individuals with similar characteristics 5 Expert knowledge can be incorporated using informative priors. Information relating to the RMoM has potential to make large impact, particularly regarding its functional form.

Perform a sensitivity analysis note plausible alternatives 6: ASSUMPTION SENSITIVITY report YES robustness 1: select MoI using complete cases 2: add CMoM 3: add RMoM BASE MODEL 8: Are conclusions robust? 7: PARAMETER SENSITIVITY NO determine region of high plausibility recognise uncertainty 4: seek additional data 5: elicit expert knowledge assess fit of validation sample calculate DIC 6 Form alternative models from the base model by changing key assumptions, including: MoI error distribution MoI response transform functional form of the RMoM Run the base model with the parameters controlling the extent of the departure from MAR fixed to a range of plausible values. Use the results of both types of sensitivity analysis to establish variability in the quantities of interest.

Perform a sensitivity analysis note plausible alternatives 6: ASSUMPTION SENSITIVITY report YES robustness 1: select MoI using complete cases 2: add CMoM 3: add RMoM BASE MODEL 8: Are conclusions robust? 7: PARAMETER SENSITIVITY NO determine region of high plausibility recognise uncertainty 4: seek additional data 5: elicit expert knowledge assess fit of validation sample calculate DIC 6 Form alternative models from the base model by changing key assumptions, including: MoI error distribution MoI response transform functional form of the RMoM 7 Run the base model with the parameters controlling the extent of the departure from MAR fixed to a range of plausible values. Use the results of both types of sensitivity analysis to establish variability in the quantities of interest.

Perform a sensitivity analysis note plausible alternatives 6: ASSUMPTION SENSITIVITY report YES robustness 1: select MoI using complete cases 2: add CMoM 3: add RMoM BASE MODEL 8: Are conclusions robust? 7: PARAMETER SENSITIVITY NO determine region of high plausibility recognise uncertainty 4: seek additional data 5: elicit expert knowledge assess fit of validation sample calculate DIC 6 Form alternative models from the base model by changing key assumptions, including: MoI error distribution MoI response transform functional form of the RMoM 7 Run the base model with the parameters controlling the extent of the departure from MAR fixed to a range of plausible values. 8 Use the results of both types of sensitivity analysis to establish variability in the quantities of interest.

Outline 1 Motivation Introduction MCS income example 2 Modelling Strategy Overview Construct a base model Sensitivity analysis 3 Application Construct a base model for MCS income Sensitivity analysis for MCS income

Schematic Diagram note plausible alternatives 6: ASSUMPTION SENSITIVITY report YES robustness 1: select MoI using complete cases 2: add CMoM 3: add RMoM BASE MODEL 8: Are conclusions robust? 7: PARAMETER SENSITIVITY NO determine region of high plausibility recognise uncertainty 4: seek additional data 5: elicit expert knowledge assess fit of validation sample calculate DIC Strategy can be thought of as consisting of two parts: Constructing a base model Assessing conclusions from this base model against a selection of well chosen sensitivity analyses

Initial model of interest We choose log of hourly net pay as our response 6 explanatory variables Description of explanatory variables short name description details age continuous a edu educational level 3 levels (1=none/NVQ1; 2=NVQ2/3; 3=NVQ4/5) b eth ethnic group 2 levels (1=white; 2=non-white) sing c single/partner 2 levels (1=single; 2=partner) reg region of country 2 levels (1=London; 2=other) stratum ward type by country d 9 levels a centred and standardised b the level of National Vocational Qualification (NVQ) equivalence of the individual s highest academic or vocational educational qualification (level 3 has a degree) c always single in sweep 1 d three strata for England (advantaged, disadvantaged and ethnic minority); two strata for Wales, Scotland and Northern Ireland (advantaged and disadvantaged) And a t distribution with 4 degrees of freedom (t 4 ) for the errors for robustness to outliers

Initial model of interest: the equations log of hourly pay (hpay) Alternative (AS3): cube root transform robustness to outliers Alternative (AS1): Normal errors individual random effects y it t 4 (µ it, σ 2 ) µ it = α i +γ s(i) + p k=1 β kx kit + q k=p+1 β kz ki stratum specific intercepts eth (ethnic group) age (main respondent s age) edu (educational level) reg (London/other) sing (single/partner) Alternative (AS2): include age 2 and age edu interaction terms & vague priors e.g. β k N(0, 10000 2 )

Conclusions based on complete cases Higher hourly pay is associated with having a degree Little evidence of an association between pay and ethnicity Lower pay is associated with gaining a partner between sweeps Key parameter estimates based on a complete case analysis Complete Cases β edu[nvq2&3] 0.15 (0.06,0.25) β edu[nvq4&5] 0.35 (0.24,0.45) β eth -0.04 (-0.18,0.10) β sing -0.07 (-0.14,0.00) Table shows the posterior mean, with the 95% interval in brackets

Covariate model of Assume covariates are missing at random (MAR) stratum and eth do not change between sweeps Imputation of missing sweep 2 values required for the other 4 covariates reg: assign sweep 1 value age, edu and sing: set up a joint imputation model using latent variables with Normal distributions for edu and sing Fully observed sweep 1 covariates are used as explanatory variables in the imputation model

Response model of (selection model) Allow informative in the response by modelling a missing value indicator (m i ) for sweep 2 pay (hpay i2 ) s.t. { 1: hpayi2 observed m i = 0: hpay i2 missing Use a logit model for response, i.e. m i Bernoulli(p i ); logit(p i ) =? Previous work in this area informs choice of predictors of missing income functional form Untransformed hourly pay used in this sub-model

Response model of : the equations m i Bernoulli(p i ) hpay i1 hpay i2 hpay i1 eth (ethnic group) sc (social class) ctry (country) logit(p i ) = θ 0 + P iecewise(level i ) + P iecewise(change i ) + k θ kw ki choice of functional form and position of knots are based on expert knowledge Alternative (AS4): linear functional form P iecewise(level i ) = P iecewise(change i ) = & vague priors θ level[1] (level i 10) : level i < 10 θ level[2] (level i 10) : level i 10 δ 1 change i : change i < 0 δ 2 change i : change i 0

Response model of parameter estimates Sweep 2 pay is more likely to be missing for individuals who are non-white have low levels of pay in sweep 1 whose pay changes substantially between sweeps Social class and country make little difference BASE CASE θ 0 2.73 (1.80,3.89) θ level[1] 0.29 (0.10,0.52) θ level[2] 0.59 (0.29,0.91) δ change[1] 0.67 (0.39,0.97) δ change[2] -0.21 (-0.36,-0.06) θ ctry[2:wales] -0.16 (-0.82,0.52) θ ctry[3:scotland] 0.17 (-0.46,0.81) θ ctry[4:northern Ireland] 0.29 (-0.36,0.97) θ eth -1.13 (-1.82,-0.46) θ sc[2] 0.06 (-0.70,0.85) θ sc[3] -0.06 (-1.08,1.00) θ sc[4] 0.15 (-0.62,0.95) Table shows posterior mean (95% interval) The 95% intervals of the change parameters (δ) do not include zero: evidence of informative given the model assumptions

Impact on substantive questions Conclusions regarding education and ethnicity unchanged Evidence of an association between hourly pay and gaining a partner between sweeps has strengthened Comparison of parameter estimates from model of interest (complete cases) and joint model (BASE CASE) Complete Cases BASE CASE β edu[nvq2&3] 0.15 (0.06,0.25) 0.17 (0.09,0.25) β edu[nvq4&5] 0.35 (0.24,0.45) 0.35 (0.25,0.44) β eth -0.04 (-0.18,0.10) -0.06 (-0.18,0.06) β sing -0.07 (-0.14,0.00) -0.11 (-0.20,-0.02) Table shows the posterior mean, with the 95% interval in brackets

Incorporating additional sources of information Where information is limited, some parameters in the joint model can be difficult to estimate But, we can increase the amount of information available by incorporating data from other sources, e.g. data from other studies expert opinion

Seek additional data For example, imputing edu is difficult because few individuals gain qualifications between sweeps Seek another study with individuals with similar characteristics which includes education variables Expand Covariate Model of Missingness (CMoM) to simultaneously model data from original study (MCS) and additional study by fitting 2 sets of equations with common coefficients 1 set for imputing the missing MCS covariates 1 set for modelling the additional data The extra data allows the parameters in the CMoM to be estimated with greater accuracy

Elicit expert knowledge The Bayesian approach provides the option of including additional information through informative priors This is of greatest potential value for the parameters associated with informative Informative priors can be formed through elicitation, but are difficult to elicit directly Instead elicit information about probability of response at design points convert this to informative priors A good elicitation strategy would identify and concentrate on weakly identified variables allow for correlation between these variables focus on functional form

Base model fit using validation sample probability density Data was collected from 7 individuals who were originally non-contacts or refusals in sweep 2, after they were re-issued by the fieldwork agency We set these data to missing before fitting our models, so they can now be used for model checking For BASE, hourly pay is well estimated for all 7 individuals 0.00 0.10 0.20 BASE Posterior predictive distribution and observed value of hourly pay of 4 re-issued individuals A 0 20 40 hourly pay ( ) probability density 0.00 0.10 0.20 B 0 20 40 hourly pay ( ) probability density 0.00 0.10 0.20 C 0 20 40 hourly pay ( ) probability density The true value of hourly pay is indicated by the red line 0.00 0.10 0.20 D 0 20 40 hourly pay ( )

Schematic Diagram note plausible alternatives 6: ASSUMPTION SENSITIVITY report YES robustness 1: select MoI using complete cases 2: add CMoM 3: add RMoM BASE MODEL 8: Are conclusions robust? 7: PARAMETER SENSITIVITY NO determine region of high plausibility recognise uncertainty 4: seek additional data 5: elicit expert knowledge assess fit of validation sample calculate DIC Strategy can be thought of as consisting of two parts: Constructing a base model Assessing conclusions from this base model against a selection of well chosen sensitivity analyses

Assumption sensitivity analysis: description BASE CASE key features: MoI - t 4 error distribution MoI - covariates {age, edu, eth, reg, sing} MoI - log transform of the response RMoM - piecewise linear functional form for level and change Assumption sensitivity analysis differences from BASE CASE: AS1: MoI - Normal error distribution AS2: MoI - additional covariates age 2 and age edu AS3: MoI - cube root transform of response AS4: RMoM - linear functional form for level and change MoI = Model of Interest; RMoM = Response Model of Missingness

Assumption sensitivity analysis: results Based on this sensitivity analysis ethnicity: conclusions from BASE CASE are robust gaining a partner: consistent evidence of association with lower pay, but strength is unclear Comparison of parameters associated with being non-white and gaining a partner between sweeps β eth β sing CC -0.04 (-0.18,0.10) -0.07 (-0.14,0.00) BASE -0.06 (-0.18,0.06) -0.11 (-0.20,-0.02) AS1 (Normal errors) 0.01 (-0.12,0.15) -0.12 (-0.23,-0.01) AS2 (additional covariates) -0.05 (-0.18,0.07) -0.11 (-0.20,-0.01) AS4 (linear level & change) -0.06 (-0.18,0.07) -0.16 (-0.25,-0.07) AS3 (cube root transform) -0.04 (-0.12,0.04) -0.08 (-0.14,-0.02) Table shows the posterior mean, with the 95% interval in brackets.

AS1-AS4: model fit using validation sample The mean square error (MSE) of the fit of hourly pay for the 7 re-issues is a summary measure of model performance The models with the linear functional form for the RMoM (AS4) and with the cube root transform (AS3) fit the re-issued individuals best MSE of the fit of hourly pay for the 7 re-issued individuals median 95% interval BASE 18.7 (3.1,367.0) AS1 (Normal errors) 16.8 (3.2,108.8) AS2 (additional covariates) 14.2 (2.8,295.3) AS3 (cube root transform) 8.0 (1.9,73.6) AS4 (linear level & change) 8.8 (2.9,21.7)

Parameter sensitivity analysis: description Recall change i = hpay i2 hpay i1, and {... + δ1 change logit(p i ) = i + : change i < 0... + δ 2 change i + : change i 0 The values of δ 1 and δ 2 control the degree of departure from MAR δ 1 and δ 2 are difficult for the model to estimate A series of models is run with these two parameters fixed 81 variants formed by combining 9 values of δ 1 with 9 values of δ 2 Value set for both δ 1 and δ 2 is { 1, 0.75, 0.5, 0.25, 0, 0.25, 0.5, 0.75, 1} 9 variants have linear functional form of change, i.e. δ 1 = δ 2 δ 1 = δ 2 = 0 variant is equivalent to assuming the response is MAR

Parameter sensitivity analysis: results - tabular Proportional increase in pay associated with selected covariates for PS variants compared with base model (BASE) minimum δ 1 δ 2 maximum δ 1 δ 2 MAR a BASE edu[nvq2&3] 1.15-1 1 1.19 1-0.25 1.18 1.19 (1.06,1.25) (1.10,1.29) (1.06,1.26) (1.09,1.29) edu[nvq4&5] 1.31-1 1 1.44 1-0.5 1.35 1.41 (1.18,1.44) (1.31,1.58) (1.19,1.46) (1.28,1.56) eth 0.93 0.25-0.75 0.97-1 0.75 0.94 0.94 PS (0.82,1.05) (0.85,1.12) (0.84,1.09) (0.83,1.06) sing 0.77 1 1 1.32-1 -1 0.94 0.90 (0.71,0.84) (1.18,1.47) (0.88,0.99) (0.82,0.98) Table shows the posterior mean, with the 95% interval in brackets. a δ 1 = 0 and δ 2 = 0 is MAR

Parameter sensitivity analysis: results - graphical I Estimated proportional change in pay associated with being non-white versus δ 1 conditional on δ 2 from PS variants 1.4 delta2= 1 delta2= 0.5 delta2=0 delta2=0.5 delta2=1 posterior mean 95% interval e β eth 1.2 1.0 0.8 0.75 0 0.75 0.75 0 0.75 0.75 0 0.75 0.75 0 0.75 0.75 0 0.75 δ 1

Parameter sensitivity analysis: results - graphical II Estimated proportional change in pay associated with gaining a partner between sweeps versus δ 1 conditional on δ 2 from PS variants e β sing 1.4 1.2 1.0 0.8 delta2= 1 delta2= 0.5 delta2=0 delta2=0.5 delta2=1 posterior mean 95% interval 0.75 0 0.75 0.75 0 0.75 0.75 0 0.75 0.75 0 0.75 0.75 0 0.75 δ 1

Parameter sensitivity analysis: results - graphical III Posterior mean of proportional change in pay associated with selected covariates versus δ 1 and δ 2 from PS variants non white: e β eth gaining a partner: e β sing δ 2 1.0 0.5 0.0 0.5 1.0 0.96 0.94 δ 2 1.0 0.5 0.0 0.5 1.0 0.96 0.98 1.02 1.1 1.26 1.04 1.14 1.2 1.08 1.22 1.12 0.88 0.92 1.06 0.82 0.84 0.8 0.86 0.9 0.94 1 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 δ 1 δ 1 The points at the δ values relating to MAR, BASE and AS4 are marked with a red circle, blue triangle and green diamond respectively.

Reporting robustness: ethnicity question Does ethnicity affect rate of pay? Key points to report are: No evidence of an association between ethnicity and hourly pay Base model results: the proportional change in hourly pay associated with being non-white has a posterior mean of 0.94, with a 95% interval from 0.83 to 1.06 These conclusions are very robust to our sensitivity analysis But results relating to gaining a partner are not robust, so we need to investigate the plausibility of different models

Assess fit using validation sample Mean square error of the fit of hourly pay for the 7 re-issued individuals versus δ 1 and δ 2 from PS variants δ 2 1.0 0.5 0.0 0.5 1.0 20 50 90 30 60 100 40 80 1.0 0.5 0.0 0.5 1.0 70 10 δ 1 The points at the δ values relating to MAR, BASE and AS4 are marked with a red circle, blue triangle and green diamond respectively.

Determine region of high plausibility MSE for re issues Proportional change in pay with gaining a partner δ 2 1.0 0.5 0.0 0.5 1.0 20 50 90 30 60 100 40 80 70 10 δ 2 1.0 0.5 0.0 0.5 1.0 0.96 0.98 1.02 1.08 1.16 1.24 1.3 1.04 1.1 1.18 1.12 1.22 1.14 0.88 0.92 1.06 0.84 0.8 0.82 0.86 0.9 0.94 1 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 δ 1 δ 1 The points at the δ values relating to MAR, BASE and AS4 are marked with a red circle, blue triangle and green diamond respectively.

Determine region of high plausibility MSE for re issues Proportional change in pay with gaining a partner δ 2 1.0 0.5 0.0 0.5 1.0 20 50 90 30 60 100 40 80 70 10 δ 2 1.0 0.5 0.0 0.5 1.0 0.96 0.98 1.02 1.08 1.16 1.24 1.3 1.04 1.1 1.18 1.12 1.22 1.14 0.88 0.92 1.06 0.84 0.8 0.82 0.86 0.9 0.94 1 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 δ 1 δ 1 The points at the δ values relating to MAR, BASE and AS4 are marked with a red circle, blue triangle and green diamond respectively.

Recognising uncertainty: partnership question Does change in partnership status affect income? Key points to report are: There is evidence that gaining a partner is associated with a decrease in hourly pay However, the magnitude of this decrease is uncertain Our analysis suggests that the proportional decrease lies in the region 0.77 (0.71,0.84) to 0.94 (0.88,0.99) Some models run as part of the sensitivity analysis suggest that change in partnership status is associated with an increase in pay, but these models do not fall in the region of high plausibility. Why should gaining a partner be associated with a decrease in hourly pay? proxy for additional child? reverse causality?

Summary Compared to a complete case analysis implementing this strategy is time-consuming but allows realistic assumptions about the mechanism to be explored and provides confidence in conclusions to questions of interest The proposed strategy is flexible steps can be omitted if appropriate or it can be extended if necessary applied for other types of studies

Relevant literature The BIAS project. www.bias-project.org.uk/. Best, N. G., Spiegelhalter, D. J., Thomas, A., and Brayne, C. E. G. (1996). Bayesian Analysis of Realistically Complex Models. Journal of the Royal Statistical Society, Series A (Statistics in Society), 159, (2), 323 42. Daniels, M. J. and Hogan, J. W. (2008). Missing Data In Longitudinal Studies Strategies for Bayesian Modeling and Sensitivity Analysis. Chapman & Hall. Little, R. J. A. and Rubin, D. B. (2002). Statistical Analysis with Missing Data, (2nd edn). John Wiley and Sons. Plewis, I. (2007). Non-Response in a Birth Cohort Study: The Case of the Millennium Cohort Study. International Journal of Social Research Methodology, 10, (5), 325 34.