Unbiased estimation of exposure odds ratios in complete records logistic regression

Similar documents
Missing covariate data in matched case-control studies: Do the usual paradigms apply?

Bayesian methods for missing data: part 1. Key Concepts. Nicky Best and Alexina Mason. Imperial College London

A comparison of fully Bayesian and two-stage imputation strategies for missing covariate data

Estimating the Marginal Odds Ratio in Observational Studies

Ignoring the matching variables in cohort studies - when is it valid, and why?

Journal of Biostatistics and Epidemiology

Whether to use MMRM as primary estimand.

Correction for classical covariate measurement error and extensions to life-course studies

Comparison of multiple imputation methods for systematically and sporadically missing multilevel data

Structural Nested Mean Models for Assessing Time-Varying Effect Moderation. Daniel Almirall

Alexina Mason. Department of Epidemiology and Biostatistics Imperial College, London. 16 February 2010

Systematically missing data in individual participant data meta-analysis: a semiparametric inverse probability weighting approach

Measurement Error in Spatial Modeling of Environmental Exposures

Additive and multiplicative models for the joint effect of two risk factors

Analyzing Pilot Studies with Missing Observations

Statistics in medicine

Extending causal inferences from a randomized trial to a target population

7 Sensitivity Analysis

BIAS OF MAXIMUM-LIKELIHOOD ESTIMATES IN LOGISTIC AND COX REGRESSION MODELS: A COMPARATIVE SIMULATION STUDY

Ph.D. course: Regression models. Introduction. 19 April 2012

Structural Nested Mean Models for Assessing Time-Varying Effect Moderation. Daniel Almirall

Ph.D. course: Regression models. Regression models. Explanatory variables. Example 1.1: Body mass index and vitamin D status

Commentary: Incorporating concepts and methods from causal inference into life course epidemiology

Confounding, mediation and colliding

Downloaded from:

Estimating direct effects in cohort and case-control studies

Multiple imputation in Cox regression when there are time-varying effects of covariates

Causal inference in epidemiological practice

Double Robustness. Bang and Robins (2005) Kang and Schafer (2007)

Causal inference in biomedical sciences: causal models involving genotypes. Mendelian randomization genes as Instrumental Variables

A weighted simulation-based estimator for incomplete longitudinal data models

Longitudinal analysis of ordinal data

Known unknowns : using multiple imputation to fill in the blanks for missing data

Sensitivity analysis and distributional assumptions

Harvard University. Harvard University Biostatistics Working Paper Series

Nonrespondent subsample multiple imputation in two-phase random sampling for nonresponse

Combining multiple observational data sources to estimate causal eects

Empirical Bayes Moderation of Asymptotically Linear Parameters

Guideline on adjustment for baseline covariates in clinical trials

Bootstrapping Sensitivity Analysis

Measurement error effects on bias and variance in two-stage regression, with application to air pollution epidemiology

Causal Hazard Ratio Estimation By Instrumental Variables or Principal Stratification. Todd MacKenzie, PhD

2 Naïve Methods. 2.1 Complete or available case analysis

Methods for inferring short- and long-term effects of exposures on outcomes, using longitudinal data on both measures

Causal mediation analysis: Multiple mediators

When Should We Use Linear Fixed Effects Regression Models for Causal Inference with Longitudinal Data?

Missing Covariate Data in Matched Case-Control Studies

Flexible mediation analysis in the presence of non-linear relations: beyond the mediation formula.

Simulating from Marginal Structural Models with Time-Dependent Confounding

ABSTRACT INTRODUCTION. SESUG Paper

arxiv: v1 [stat.me] 19 Aug 2016

MISSING or INCOMPLETE DATA

11 November 2011 Department of Biostatistics, University of Copengen. 9:15 10:00 Recap of case-control studies. Frequency-matched studies.

Estimating and contextualizing the attenuation of odds ratios due to non-collapsibility

Specification Errors, Measurement Errors, Confounding

Mediation analyses. Advanced Psychometrics Methods in Cognitive Aging Research Workshop. June 6, 2016

Don t be Fancy. Impute Your Dependent Variables!

Online supplement. Absolute Value of Lung Function (FEV 1 or FVC) Explains the Sex Difference in. Breathlessness in the General Population

Causal Inference in Observational Studies with Non-Binary Treatments. David A. van Dyk

Modification and Improvement of Empirical Likelihood for Missing Response Problem

Statistical Methods. Missing Data snijders/sm.htm. Tom A.B. Snijders. November, University of Oxford 1 / 23

A Bayesian Nonparametric Approach to Monotone Missing Data in Longitudinal Studies with Informative Missingness

Mediation Analysis for Health Disparities Research

Effect Modification and Interaction

ADVANCED STATISTICAL ANALYSIS OF EPIDEMIOLOGICAL STUDIES. Cox s regression analysis Time dependent explanatory variables

Integrated approaches for analysis of cluster randomised trials

Comparative effectiveness of dynamic treatment regimes

Chapter 4. Parametric Approach. 4.1 Introduction

Strategy of Bayesian Propensity. Score Estimation Approach. in Observational Study

Confidence Intervals for the Odds Ratio in Logistic Regression with Two Binary X s

Lecture 2: Poisson and logistic regression

Part IV Statistics in Epidemiology

Lab 8. Matched Case Control Studies

Instrumental variables estimation in the Cox Proportional Hazard regression model

Lecture 15 (Part 2): Logistic Regression & Common Odds Ratio, (With Simulations)

Confidence Intervals for the Interaction Odds Ratio in Logistic Regression with Two Binary X s

Causal Inference with General Treatment Regimes: Generalizing the Propensity Score

IP WEIGHTING AND MARGINAL STRUCTURAL MODELS (CHAPTER 12) BIOS IPW and MSM

Survival Analysis with Time- Dependent Covariates: A Practical Example. October 28, 2016 SAS Health Users Group Maria Eberg

Chapter 20: Logistic regression for binary response variables

Variable selection and machine learning methods in causal inference

Causality II: How does causal inference fit into public health and what it is the role of statistics?

When Should We Use Linear Fixed Effects Regression Models for Causal Inference with Panel Data?

Targeted Maximum Likelihood Estimation in Safety Analysis

Lecture 12: Effect modification, and confounding in logistic regression

More Statistics tutorial at Logistic Regression and the new:

Statistical Methods for Causal Mediation Analysis

arxiv: v2 [stat.me] 17 Jan 2017

Statistical Analysis of Randomized Experiments with Nonignorable Missing Binary Outcomes

Flexible modelling of the cumulative effects of time-varying exposures

Survival Analysis I (CHL5209H)

University of Bristol - Explore Bristol Research. Peer reviewed version. Link to published version (if available): /rssa.

Marginal Structural Cox Model for Survival Data with Treatment-Confounder Feedback

Lecture 5: Poisson and logistic regression

Dirichlet process Bayesian clustering with the R package PReMiuM

This is the submitted version of the following book chapter: stat08068: Double robustness, which will be

Jun Tu. Department of Geography and Anthropology Kennesaw State University

Harvard University. A Note on the Control Function Approach with an Instrumental Variable and a Binary Outcome. Eric Tchetgen Tchetgen

Inferences on missing information under multiple imputation and two-stage multiple imputation

An Efficient Estimation Method for Longitudinal Surveys with Monotone Missing Data

Transcription:

Unbiased estimation of exposure odds ratios in complete records logistic regression Jonathan Bartlett London School of Hygiene and Tropical Medicine www.missingdata.org.uk Centre for Statistical Methodology 30th January 2015 1/35

Acknowledgements This is joint work with: Ofer Harel (University of Connecticut) James Carpenter (LSHTM) We are grateful to the UK Civil Aviation Authority for allowing the use of the data for the illustrative analysis, and to Bianca De Stavola for suggesting and facilitating the use of these data and providing useful comments on the manuscript. I am grateful for support from the Medical Research Council (MR/K02180X/1). 2/35

Outline When is the exposure odds ratio unbiased? Application in practice Illustrative example Concluding comments 3/35

Outline When is the exposure odds ratio unbiased? Application in practice Illustrative example Concluding comments 4/35

Setting and question Binary outcome Y, exposure X, confouders C. In general X and/or C could be continuous, and we can use logistic regression for Y X,C. We may have missing values in Y, X or one or more components of C. Let R = 1 denote complete records and R = 0 denote incomplete records. If we drop those with missing values, when will the complete records analysis (CRA) exposure (log) odds ratio ˆβ X be (asymptotically) unbiased? 5/35

A simplification First we assume that there no confounders C, and that exposure X is binary. We can display the full data as Odds ratio = a d b c Y 0 1 X 0 a b 1 c d 6/35

Outcome dependent missingness Suppose missingness is outcome dependent, with P(R = 1 X,Y = y) = k y, y = 0,1. Then the expected complete records table is Odds ratio = k 0a k 1 d k 1 b k 0 c = a d b c i.e. unbiased Y 0 1 X 0 k 0a k 1 b 1 k 0 c k 1 d This result of course justifies the validity of odds ratios in case-control studies. 7/35

Exposure dependent missingness Now suppose missingness is exposure dependent, with P(R = 1 X = x,y) = k x, x = 0,1. Then the expected complete records table is Odds ratio = k 0a k 1 d k 0 b k 1 c = a d b c i.e. unbiased Y 0 1 X 0 k 0a k 0 b 1 k 1 c k 1 d This result (covariate dependent missingness) holds in fact more generally, e.g. for risk ratios and other regression models. 8/35

Confounders Now suppose we have a categorical confounder C with levels c = 1,..,L. Let the full data exposure/outcome table for level c of the confounder be Y 0 1 b c X 0 a c 1 c c d c The full data odds ratio for the exposure effect at confounder level C = c is ac dc b c c c Assuming no effect modification, we then estimate the odds ratio by pooling across confounder levels, e.g. with Mantel-Haenszel or logistic regression. 9/35

Outcome/confounder dependent missingness Now suppose we have missingness dependent on Y and C, i.e. P(R = 1 Y = y,x,c = c) = k y,c, y = 0,1, c = 1,..,L. The expected complete record table is then Y 0 1 X 0 k 0,ca c k 1,c b c 1 k 0,c c c k 1,c d c Odds ratio = k 0,ca c k 1,c d c k 1,c b c k 0,c c c = ac dc b c c c. i.e. still unbiased 10/35

Exposure/confounder dependent missingness Now suppose we have missingness dependent on X and C, i.e. P(R = 1 Y,X = x,c = c) = k x,c, x = 0,1, c = 1,..,L. The expected complete record table is then Y 0 1 X 0 k 0,ca c k 0,c b c 1 k 1,c c c k 1,c d c Odds ratio = k 0,ca c k 1,c d c k 0,c b c k 1,c c c = ac dc b c c c. i.e. still unbiased 11/35

Exposure/outcome dependent missingness In general if missingness depends on X and Y, we will obtain biased estimates of the exposure effect. A special case exists however where we still obtain unbiased estimates. Specifically if P(R = 1 Y,X,C) = s(x,c) t(y,c) for some functions s(x,c), t(y,c). This result can be viewed as applying the previous results in turn. 12/35

Exposure/outcome dependent missingness continued This could arise for example if two variables are partially observed with one mechanism depending on (X,C) and the second depending on (Y,C). If we let R 1 and R 2 denote the observation indicators for the two variables, we need P(R = 1 Y,X,C) = P(R 1 = 1,R 2 = 1 Y,X,C) = P(R 1 = 1 X,C) P(R 2 = 1 Y,C) 13/35

Summary The results apply more generally, to continuous exposure X, multiple confounders C, with logistic regression used for estimation. Letting ˆβ 0, ˆβ X, ˆβ C denote the corresponding log odds ratios, we have Missingness dependent on ˆβ0 ˆβX ˆβC Neither Y, X nor C Unbiased Unbiased Unbiased Y Biased Unbiased Unbiased X, C Unbiased Unbiased Unbiased Y, C Biased Unbiased Biased Y,X,C Biased Biased* Biased * in general 14/35

MCAR/MAR/MNAR So far we have not said which variable(s) have missing values. Depending on what type of mechanism is assumed to be in action and which variable(s) have missing values, missingness could be MCAR, MAR or even MNAR. e.g. if exposure is partially observed, with missingness dependent on exposure level, and possibly confounders, this is MNAR. But so long as missingness is independent of Y, given X and C, ˆβ X is unbiased. 15/35

Extensions If we include interactions between X and some components of C, the results continue to apply. With survival data, if the event rate is low and follow-up similar for subjects, the results also apply approximately due to link between logistic and Cox regression. In this case the event indicator takes the place of the binary outcome Y. 16/35

Caveat The arguments implicitly assume that the outcome model is correctly specified. i.e. no effect interactions (if not included). Provided any misspecifications are not severe, results should still approximately apply (see illustrative) example. 17/35

Outline When is the exposure odds ratio unbiased? Application in practice Illustrative example Concluding comments 18/35

Application in practice The results depend on how the probability of being a complete record depend on (Y,X,C). In some cases our study specific knowledge may support one of the sufficient conditions needed for β X to be estimated without bias. 19/35

Missingness in a confounder Divide the confounders into C = (C 1,C 2 ), with C 1 partially observed Missingness in C 1 found related to Plausible missingness ˆβX unbiased mechanism C 2 C 2 Yes X and possibly C 2 X and C 2 Yes Y and possibly C 2 Y and C 2 Yes X, Y and possibly C 2 C and X Yes X and Y Generally no C and Y Yes 20/35

Missingness in exposure Missingness in X Plausible missingness ˆβX unbiased found related to mechanism C C Yes Y Y Yes C and Y Yes C and Y X and Y Generally no X and C Yes 21/35

Missingness in outcome Missingness in Y Plausible missingness ˆβX unbiased found related to mechanism X X Yes C C Yes X and C Yes X and C Y and C Yes X and Y Generally no 22/35

Investigating missingness So in some situations we may from the data be able to make some tentative conclusions regarding missingness and thus the unbiasedness of ˆβ X. With missingness in multiple variables things inevitably become more complex. Here our substantive knowledge is crucial to judge the plausibility of assumptions. 23/35

Outline When is the exposure odds ratio unbiased? Application in practice Illustrative example Concluding comments 24/35

Illustrative example We illustrate using data from a cohort study of professional pilots in the UK [1]. This study s aim was to include all professional flight crew who held a license at some point between 1989 and 1999. For our analyses, follow-up starts at recruitment, where various variables of interest were recorded. Crew data was linked to UK health registers to obtain vital status up to 2006. 25/35

Outcome model specification For this illustrative analysis, we consider a binary outcome Y representing death in the first 15 years of follow-up. We consider a exposure number of accrued flying hours at baseline, categorised into < 400 hours, 400 5499, and > 5500 hours. Confounders were age at entry, smoking status, BMI and type of route. Our analyses here use data from 11,841 male crew, who had complete data at baseline for exposure and confounders. 26/35

Simulation setup We first fitted a logistic regression model to the full data. For each of 8 missingness mechanisms, we make data missing and perform the complete records analysis. For each mechanism we repeat 10,000 times, and present the mean estimates across 10,000 realizations of missingness process. 27/35

Results Log odds ratios Missingness dependent on 400-5499 hours 5500 hours vs < 400 hours vs < 400 hours Full data 0.55 0.59 MCAR 0.57 0.61 Event indicator (Y) 0.56 0.61 Age (C) 0.51 0.54 Age and flying hours (C,X) 0.52 0.56 Event indicator and age (Y,C) 0.66 0.74 Event indicator and flying hours (Y,X) 1.59 2.64 Event indicator and flying hours* (Y, X) 0.59 0.67 * P(R = 1 Y,X,C) = s(x)t(y) 28/35

Interpretation For the most part, results are as expected from theory. For missingness dependent on outcome and confounder (age), estimates are somewhat different to full data estimates. This is likely due to the fact the outcome model is not exactly correctly specified: the exposure effect varies somewhat by age. Missingness dependent on C means complete records have different C distribution, leading to some small bias. 29/35

Outline When is the exposure odds ratio unbiased? Application in practice Illustrative example Concluding comments 30/35

Summary Estimates of exposure effect from complete records logistic regression are unbiased under a wide range of missingness assumptions [2]. Most of the results presented are not new [3, 4, 5], but perhaps are not widely appreciated. Directed acyclic graphs can be useful for considering possible missingness mechanisms and thus plausiblity of assumptions [6]. 31/35

Summary Of course even when CRA is unbiased, it is not efficient. If missingness is in confounders, alternative approaches will gain efficiency. If the data are missing at random, multiple imputation can be used. If a confounder is missing not at random, but independently of outcome, a weighting/imputation approach can be used to improve upon CRA efficiency [7]. 32/35

References I [1] B DeStavola, C Pizzi, F Clemens, S A Evans, A D Evans, and I dos Santos Silva. Cause-specific mortality in professional flight crew and air traffic control officers: findings from two uk population-based cohorts of over 20,000 subjects. International Archives of Occupational and Environmental Health, 85:283 293, 2012. [2] J W Bartlett, O Harel, and J R Carpenter. Unbiased estimation of exposure odds ratios in complete records logistic regression. under review, 2015. 33/35

References II [3] W Vach and M Blettner. Biased Estimation of the Odds Ratio in Case-Control Studies due to the Use of Ad Hoc methods of Correcting for Missing Values for Confounding Variables. American Journal of Epidemiology, 134:895 907, 1991. [4] M A Hernn, S H Hernndez-Diaz, and J M Robins. A structural approach to selection bias. Epidemiology, 15:615 625, 2004. [5] D Westreich. Berksons Bias, Selection Bias, and Missing Data. Epidemiology, 23:159 164, 2012. [6] R M Daniel, M G Kenward, S N Cousens, and B L de Stavola. Using causal diagrams to guide analysis in missing data problems. Statistical Methods in Medical Research, 21:243 256, 2012. 34/35

References III [7] J W Bartlett, J R Carpenter, K Tilling, and S Vansteelandt. Improving upon the efficiency of complete case analysis when covariates are MNAR. Biostatistics, 15:719 730, 2014. 35/35