Confounding, mediation and colliding

Similar documents
Ignoring the matching variables in cohort studies - when is it valid, and why?

Estimating direct effects in cohort and case-control studies

Proportional hazards model for matched failure time data

Marginal versus conditional effects: does it make a difference? Mireille Schnitzer, PhD Université de Montréal

Causality II: How does causal inference fit into public health and what it is the role of statistics?

Q learning. A data analysis method for constructing adaptive interventions

Specification Errors, Measurement Errors, Confounding

Lab 8. Matched Case Control Studies

Epidemiologic Methods

Lecture 2: Poisson and logistic regression

Model Selection in GLMs. (should be able to implement frequentist GLM analyses!) Today: standard frequentist methods for model selection

Estimating and contextualizing the attenuation of odds ratios due to non-collapsibility

Mediation analyses. Advanced Psychometrics Methods in Cognitive Aging Research Workshop. June 6, 2016

University of California, Berkeley

Literature review The attributable fraction in causal inference and genetics

Lecture 5: Poisson and logistic regression

STAT 5500/6500 Conditional Logistic Regression for Matched Pairs

Causal inference in epidemiological practice

Biostatistics Advanced Methods in Biostatistics IV

DATA-ADAPTIVE VARIABLE SELECTION FOR

Sampling bias in logistic models

Strategy of Bayesian Propensity. Score Estimation Approach. in Observational Study

More Statistics tutorial at Logistic Regression and the new:

Department of Biostatistics University of Copenhagen

The consequences of misspecifying the random effects distribution when fitting generalized linear mixed models

On the Use of Linear Fixed Effects Regression Models for Causal Inference

Lecture 15 (Part 2): Logistic Regression & Common Odds Ratio, (With Simulations)

Lecture Discussion. Confounding, Non-Collapsibility, Precision, and Power Statistics Statistical Methods II. Presented February 27, 2018

When Should We Use Linear Fixed Effects Regression Models for Causal Inference with Panel Data?

Help! Statistics! Mediation Analysis

R E S E A R C H T R I A N G L E P A R K, N O R T H C A R O L I N A

Causal Inference with General Treatment Regimes: Generalizing the Propensity Score

Previous lecture. P-value based combination. Fixed vs random effects models. Meta vs. pooled- analysis. New random effects testing.

STA6938-Logistic Regression Model

Missing Covariate Data in Matched Case-Control Studies

MAS3301 / MAS8311 Biostatistics Part II: Survival

Unbiased estimation of exposure odds ratios in complete records logistic regression

Package drgee. November 8, 2016

Propensity Score Methods, Models and Adjustment

11 November 2011 Department of Biostatistics, University of Copengen. 9:15 10:00 Recap of case-control studies. Frequency-matched studies.

Fitting stratified proportional odds models by amalgamating conditional likelihoods

Mediation Analysis for Health Disparities Research

Case-control studies

Asymptotic equivalence of paired Hotelling test and conditional logistic regression

Casual Mediation Analysis

Test of Association between Two Ordinal Variables while Adjusting for Covariates

Using Estimating Equations for Spatially Correlated A

Logistic Regression with the Nonnegative Garrote

Causal Hazard Ratio Estimation By Instrumental Variables or Principal Stratification. Todd MacKenzie, PhD

High-Throughput Sequencing Course

Hierarchical Linear Models and Structural Equation Modelling for the Children of Siblings model

Covariate Balancing Propensity Score for General Treatment Regimes

,..., θ(2),..., θ(n)

6.3 How the Associational Criterion Fails

Double Robustness. Bang and Robins (2005) Kang and Schafer (2007)

Selection on Observables: Propensity Score Matching.

TMA 4275 Lifetime Analysis June 2004 Solution

Gov 2002: 4. Observational Studies and Confounding

STA6938-Logistic Regression Model

Faculty of Health Sciences. Regression models. Counts, Poisson regression, Lene Theil Skovgaard. Dept. of Biostatistics

Models for binary data

When Should We Use Linear Fixed Effects Regression Models for Causal Inference with Longitudinal Data?

Figure 36: Respiratory infection versus time for the first 49 children.

Lecture 8 Stat D. Gillen

A comparison of 5 software implementations of mediation analysis

Gibbs Sampling in Endogenous Variables Models

Longitudinal Modeling with Logistic Regression

Mediation for the 21st Century

Part 8: GLMs and Hierarchical LMs and GLMs

Handling Ties in the Rank Ordered Logit Model Applied in Epidemiological

Assess Assumptions and Sensitivity Analysis. Fan Li March 26, 2014

Association studies and regression

Causal Mechanisms Short Course Part II:

1 Motivation for Instrumental Variable (IV) Regression

Introduction to Statistical Analysis

Comparative effectiveness of dynamic treatment regimes

Estimating the Mean Response of Treatment Duration Regimes in an Observational Study. Anastasios A. Tsiatis.

Chapter 14 Logistic Regression, Poisson Regression, and Generalized Linear Models

Non-iterative, regression-based estimation of haplotype associations

Sensitivity analysis and distributional assumptions

Lecture notes to Chapter 11, Regression with binary dependent variables - probit and logit regression

High Dimensional Propensity Score Estimation via Covariate Balancing

22s:152 Applied Linear Regression. Example: Study on lead levels in children. Ch. 14 (sec. 1) and Ch. 15 (sec. 1 & 4): Logistic Regression

Survival Analysis I (CHL5209H)

Propensity Score Weighting with Multilevel Data

Estimating the Marginal Odds Ratio in Observational Studies

STAT331. Cox s Proportional Hazards Model

A new strategy for meta-analysis of continuous covariates in observational studies with IPD. Willi Sauerbrei & Patrick Royston

Harvard University. A Note on the Control Function Approach with an Instrumental Variable and a Binary Outcome. Eric Tchetgen Tchetgen

Optimal Treatment Regimes for Survival Endpoints from a Classification Perspective. Anastasios (Butch) Tsiatis and Xiaofei Bai

Standardization methods have been used in epidemiology. Marginal Structural Models as a Tool for Standardization ORIGINAL ARTICLE

multilevel modeling: concepts, applications and interpretations

When Should We Use Linear Fixed Effects Regression Models for Causal Inference with Longitudinal Data?

Flexible mediation analysis in the presence of non-linear relations: beyond the mediation formula.

Harvard University. Harvard University Biostatistics Working Paper Series

General Regression Model

Lecture 3.1 Basic Logistic LDA

Dynamics in Social Networks and Causality

Frailty Modeling for Spatially Correlated Survival Data, with Application to Infant Mortality in Minnesota By: Sudipto Banerjee, Mela. P.

On the Choice of Parameterisation and Priors for the Bayesian Analyses of Mendelian Randomisation Studies.

Transcription:

Confounding, mediation and colliding What types of shared covariates does the sibling comparison design control for? Arvid Sjölander and Johan Zetterqvist

Causal effects and confounding A common aim of epidemiologic research is to estimate the causal effect of an exposure on an outcome In observational studies, confounding is always a concern

Sibling comparison designs A popular way to reduce confounding is to use a sibling comparison design the exposure-outcome association is studied within families instead of between unrelated individuals It is routinly argued that the within-sibling association is controlled for all measured and unmeasured covariates that are shared (constant) within families e.g. socioeconomic status and parental genetic makeup But the sibling comparison design has subtle issues, that are not present in studies of unrelated subjects

Amplification of bias due to unmeasured non-shared confounders Frisell, T., Öberg, S., Kuja-Halkola, R., Sjölander, A. (2012). Sibling comparison designs: bias from non-shared confounders and measurement error. Epidemiology, 23(5), 713-720. U i X i1 X i2 C i1 C i2 Y i1 Y i2

Bias due to carryover effects Sjölander, A., Frisell, T., Kuja-Halkola, R., Öberg, S., Zetterqvist, J. (2016). Carryover effects in sibling comparison designs. Epidemiology, 27(6), 852-858. U i X i1 X i2 C i1 C i2 Y i1 Y i2

Outline Motivating example Main results Correctly specified models Misspecified models Concluding remarks

Outline Motivating example Main results Correctly specified models Misspecified models Concluding remarks

Association between mother s age and ADHD in offspring Attention deficit/hyperactivity disorder (ADHD) is the most common neurodevelopmental disorder in childhood prevalence 5% Studies have shown that young mothers are more likely to get children with ADHD than old mothers But not clear if the association is causal possibly confounded by familial factors, e.g. socioeconomic status and genetics

A sibling comparison study To examine the role of familial confounding, Chang et al. (2014) carried out a sibling comparison study A cohort of 1495543 children born to 896389 mothers, between 1988 and 2003 1.7 children per mother, on average Each child was followed from birth to ADHD diagnosis or 2009.12.31, whichever came first

Analyses Standard analysis: an ordinary Cox proportional hazards model Sibling comparison: a stratified Cox model, with one stratum per family automatically controls for covariates that are shared (constant) in the family model details later

Results Statistically significant association between mother s age and offspring ADHD in the standard analysis No (or inverse) association in the sibling comparison Authors concluded: The association between early maternal age and offspring ADHD is mainly explained by genetic confounding

How about shared mediators? Familial environment (shared) is affected by the mother s age at all her childbirths e.g. having multiple children at an early age may lead to a financially difficult family situation. If the familial environment in turn affects the risk of ADHD, then this covariate is a shared mediator Controlling for this covariate may therefore remove part of the true causal effect and attenuate the association

How about shared colliders? Familial environment is a dynamic condition which may change according to the development of the children e.g. an ADHD diagnosis in one of the siblings may lead to a stressful situation for the whole family If so, then familial environment could also play the role of a shared collider Controlling for this covariate may induce collider-stratification bias, which may also attenuate the association

Confounding, mediation and colliding U i X i1 X i2 M i Y i1 Y i2 W i What types of shared covariates does the sibling comparison design really control for?

Outline Motivating example Main results Correctly specified models Misspecified models Concluding remarks

Outline Motivating example Main results Correctly specified models Misspecified models Concluding remarks

Statistical models that condition on the family Linear model for continuous outcome: E(Y ij family i, X ij ) = α i + βx ij Logistic model for binary outcome: logit{p(y ij = 1 family i, X ij )} = α i + βx ij Cox proportional hazards model for time-to-event outcome: log{λ(y ij family i, X ij )} = α i (y ij ) + βx ij We will focus on the linear model, but all arguments and results hold for the logistic model and the Cox proportional hazards model as well

The role of the intercept E(Y ij family i, X ij ) = α i + βx ij The family-specific intercept α i is intended to absorb those covariates that are shared in the family The exposure effect, β, is controlled for the covariates absorbed into α i Does α i only absorb shared confounders, or shared mediators and colliders as well?

In other words When fitting the conditional (on family) model E(Y ij family i, X ij ) = α i + βx ij, which of the following β s is really estimated? E(Y ij U i, X ij ) = αi 1 + β 1 X ij E(Y ij M i, X ij ) = αi 2 + β 2 X ij E(Y ij W i, X ij ) = αi 3 + β 3 X ij E(Y ij U i, M i, X ij ) = αi 4 + β 4 X ij E(Y ij U i, W i, X ij ) = αi 5 + β 5 X ij E(Y ij M i, W i, X ij ) = αi 6 + β 6 X ij E(Y ij U i, M i, W i, X ij ) = αi 7 + β 7 X ij

Technical note If (U i, X i1, X i2, M i, Y i1, Y i2, W i ) have a multivariate normal distribution, then all the models below are correct E(Y ij U i, X ij ) = αi 1 + β 1 X ij E(Y ij M i, X ij ) = αi 2 + β 2 X ij E(Y ij W i, X ij ) = αi 3 + β 3 X ij E(Y ij U i, M i, X ij ) = αi 4 + β 4 X ij E(Y ij U i, W i, X ij ) = αi 5 + β 5 X ij E(Y ij M i, W i, X ij ) = αi 6 + β 6 X ij E(Y ij U i, M i, W i, X ij ) = αi 7 + β 7 X ij So not a problem of model misspecification per se

Conditional maximum likelihood The answer lies in the way the model is fitted, rather than in the model itself The standard way to fit conditional (on family) models is to use conditional maximum likelihood (CML) leads to conditional logistic regression (for the logistic model) and stratified Cox regression (for the Cox proportional hazards model) Eliminates the family-specific intercept α i by conditioning on a sufficient statistic

Underlying assumptions in CML Let Z i be the set of shared covariates that we control for by conditioning on family i We may thus reformulate the model as E(Y ij family i, X ij ) = α i + βx ij E(Y ij Z i, X ij ) = α i + βx ij The CML estimator of β in this model is consistent if A: Y i1 Y i2 X i1, X i2, Z i B: Y ij X ij X ij, Z i

Comparison with conditional logistic regression Suppose that X ij and Y ij are binary, and that we use the model logit{p(y ij = 1 Z i, X ij )} = α i + βx ij Let D be the exposure-discordant pairs Arbitrary order siblings so that D X i1 = 1 and X i2 = 0 D Y i2 = 0 Y i2 = 1 Y i1 = 0 n 00 n 01 Y i1 = 1 n 10 n 11 exp( ˆβ) = n 10 n 01 p(y i1 = 1, Y i2 = 0 D) p(y i1 = 0, Y i2 = 1 D), Not generally equal to exp(β)

Comparison with conditional logistic regression, cont d However, under we have: A: Y i1 Y i2 X i1, X i2, Z i B: Y ij X ij X ij, Z i p(y i1 = 1, Y i2 = 0 D) p(y i1 = 0, Y i2 = 1 D) = E{p(Y i1 = 1, Y i2 = 0 D, Z i ) D} E{p(Y i1 = 0, Y i2 = 1 D, Z i ) D} A = E{p(Y i1 = 1 D, Z i )p(y i2 = 0 D, Z i ) D} E{p(Y i1 = 0 D, Z i )p(y i2 = 1 D, Z i ) D} B = E{p(Y i1 = 1 X i1 = 1, Z i )p(y i2 = 0 X i2 = 0, Z i ) D} E{p(Y i1 = 0 X i1 = 1, Z i )p(y i2 = 1 X i2 = 0, Z i ) D} { p(yi1 =1 X i1 =1,Z i )p(y i2 =0 X i2 =0,Z i ) p(y p(y i1 =0 X i1 =1,Z i )p(y i2 =1 X i2 =0,Z i ) i1 = 0 X i1 = 1, Z i )p(y i2 = 1 X i2 = 0, Z i ) D E = E{p(Y i1 = 0 X i1 = 1, Z i )p(y i2 = 1 X i2 = 0, Z i ) D} = E {exp(β)p(y i1 = 0 X i1 = 1, Z i )p(y i2 = 1 X i2 = 0, Z i ) D} E{p(Y i1 = 0 X i1 = 1, Z i )p(y i2 = 1 X i2 = 0, Z i ) D} = exp(β) E {p(y i1 = 0 X i1 = 1, Z i )p(y i2 = 1 X i2 = 0, Z i ) D} E{p(Y i1 = 0 X i1 = 1, Z i )p(y i2 = 1 X i2 = 0, Z i ) D} = exp(β) }

Matching by intervention vs nature A: Y i1 Y i2 X i1, X i2, Z i B: Y ij X ij X ij, Z i That the CML estimator requires assumptions A and B is rarely mentioned in standard textbooks Probably since A and B hold by design in studies that have been matched unrelated subjects on Z i by intervention Not guaranteed to hold in studies that have been matched nature, e.g. sibling comparison studies whether A and B hold or not depends on what we consider Z i to be

Implication A: Y i1 Y i2 X i1, X i2, Z i B: Y ij X ij X ij, Z i Since A and B are sufficient for the CML estimator to be consistent we can turn the argument around and use A and B to identify Z i : If there is a Z i such that A and B hold, then the CML estimator of β in model E(Y ij family i, X ij ) = α i + βx ij consistently estimates β in model E(Y ij Z i, X ij ) = α i + βx ij

Causal diagram revisited U i X i1 X i2 M i Y i1 Y i2 W i A: Y i1 Y i2 X i1, X i2, Z i B: Y ij X ij X ij, Z i A and B hold if and only if Z i = (U i, M i )

Conclusion E(Y ij U i, X ij ) = αi 1 + β 1 X ij E(Y ij M i, X ij ) = αi 2 + β 2 X ij E(Y ij W i, X ij ) = αi 3 + β 3 X ij E(Y ij U i, M i, X ij ) = αi 4 + β 4 X ij E(Y ij U i, W i, X ij ) = αi 5 + β 5 X ij E(Y ij M i, W i, X ij ) = αi 6 + β 6 X ij E(Y ij U i, M i, W i, X ij ) = αi 7 + β 7 X ij The CML estimator of β in model E(Y ij family i, X ij ) = α i + βx ij consistently estimates β 4 in model E(Y ij U i, M i, X ij ) = αi 4 + β 4 X ij

Causal interpretation Let Y x,m ij be the potential outcome for a subject with levels X ij = x and M i = m In the absence of non-shared confounders, β is the controlled direct effect E(Y x+1,m ij U i ) E(Y x,m ij U i )

The ADHD study revisited The null finding by Chang et al. (2014) does not imply the absence of a causal effect It may also be explained by a causal effect that is mediated through familial environment because the sibling comparison design implicitly controls for all shared mediators It may not be explained by collider-stratification bias because the sibling comparison design does not make implicit control for any shared colliders

Outline Motivating example Main results Correctly specified models Misspecified models Concluding remarks

Recap The CML estimator of β in model E(Y ij family i, X ij ) = α i + βx ij consistently estimates β 4 in model E(Y ij U i, M i, X ij ) = α 4 i + β 4 X ij...provided that the latter model is correctly specified What if the latter model is incorrect?

General result for linear model Suppose that X ij is binary and define (U i, M i ) = E(Y ij U i, M i, X ij = 1) E(Y ij U i, M i, X ij = 0) The CML estimator of β in model generally converges to E(Y ij family i, X ij ) = α i + βx ij E{ (U i, M i ) D}

General result for logistic model Suppose that X ij is binary and define OR(U i, M i ) = p(y ij = 1 U i, M i, X ij = 1)p(Y ij = 0 U i, M i, X ij = 0) p(y ij = 0 U i, M i, X ij = 1)p(Y ij = 1 U i, M i, X ij = 0) The CML estimator of exp(β) in model logit{p(y ij = 1 family i, X ij )} = α i + βx ij generally converges to E{d(U i, M i )OR(U i, M i ) D}, where d(u i, M i ) is a weight that is proportional to p(y ij = 0 U i, M i, X ij = 1)p(Y ij = 1 U i, M i, X ij = 0)

General result for Cox proportional hazards model Suppose that X ij is binary and define HR(U i, M i ) = λ(y ij U i, M i, X ij = 1)} λ(y ij U i, M i, X ij = 0)}, where we have assumed that the hazard ratio is constant across time y ij, but not across U i and M i The CML estimator of exp(β) in model log{λ(y ij family i, X ij )} = α i (y ij ) + βx ij generally converges to E{b(U i, M i )HR(U i, M i ) D}, where b(u i, M i ) is a weight that is proportional to {1 + HR(U i, M i )} 1

Outline Motivating example Main results Correctly specified models Misspecified models Concluding remarks

No clear cut The sibling comparison design controls for all shared confounders and mediators, but no shared colliders In practice, the distinction between confounders, mediators and colliders may be somewhat artificial e.g. familial environment may act as all three types of covariates Still a useful distinction, since helps to understand the theoretical properties of the design and models

Other estimation alternatives ML, GEE, random effects/frailty model, between-within (BW) model ML gives identical results as CML in linear models, but is inconsistent in non-linear models the incidental parameter problem GEE does not control for any shared confounders since estimates marginal (over families) association Random effects/frailty model is inconsistent if there is shared confounding since the model assumes that the intercept is independent of exposure, thus ruling out shared confounding a priori BW model gives (nearly) identical results as CML in (non-)linear models

References Chang, Z., Lichtenstein, P., D Onofrio, B., Almqvist, C., Kuja-Halkola, R., Sjölander, A., and Larsson, H. (2014). Maternal age at childbirth and risk for adhd in offspring: a population-based cohort study. International Journal of Epidemiology 43, 1815 1824. Sjölander, A, and Zetterqvist, J. (In press). Confounders, mediators or colliders What types of shared covariates does the sibling comparison design control for? Epidemiology.