Causal modelling in Medical Research

Causal modelling in Medical Research Debashis Ghosh Department of Biostatistics and Informatics, Colorado School of Public Health Biostatistics Workshop Series

Goals for today Introduction to Potential outcomes framework Exposure to the causal inference analytic pipeline

Causal Inference Asks the question of how a person-specific outcome changes under different scenarios To understand this pop culturally, and http://bit.ly/2kdfjpv http://bit.ly/q3qqzl

Causal Inference The fundamental notion: counterfactual Also called potential outcome

Father of the Counterfactual

Hume s definition of causal effect We may define a cause to be an object followed by another, and where all the objects, similar to the first, are followed by objects similar to the second. Or, in other words, where, if the first object had not been, the second never had existed. (from Section 7 of An Enquiry concerning Human Understanding, 1748)

Causality is a topic researched in many fields Philosophy Computer Science Economics Sociology Education Statistics

Motivating example

Motivating example (cont d.) Question asked by investigators: Is exposure to a statin in the early perioperative period associated with reduced postoperative complications after noncardiac surgery? The causal effect studied in the paper: causal relative risk in 30-day mortality associated with use of statin on the day of or the day after surgery Question: Ideal way to estimate this effect?

Motivating example (cont d.) Here, the investigators used data from an observational database on 180,478 patients undergoing noncardiac surgery who were admitted within 7 days of surgery and sampled by the Veterans Affairs Surgical Quality Improvement Program (VASQIP) Investigators used propensity scores for modelling the treatment effect and matching to estimate causal effects

A recipe for causal inference 1 Define the exposure 2 Define the causal parameter of interest 3 Model the assignment mechanism using propensity scores 4 Estimate the causal effect conditioning on the propensity score in some way 5 Get a variance estimate 6 Perform sensitivity analyses for unobserved confounding

Causal effects Let Y be the outcome in the movie Back to example from The Matrix Then one causal effect: Y(Neo takes blue pill) - Y(Neo takes red pill) From Sliding Doors, a causal effect is Y(Helen catches subway) - Y(Helen doesn t catch subway) Why are these causal effects?

Causal effect definition from potential outcomes Comparison of potential outcomes within the same individuals Requires envisioning a Y for a given person under each scenario The observed data will be Y under only one scenario Fundamental Problem of Causal Inference

Causal effects (cont d.) Suppose a person can get a treatment (T = 1) or control (T = 0) Then the data for causal inference can be visualized as Y (0) Y (1) T? y 1 1 y 2? 0? y 3 1? y 4 1... y n? 0 The question marks define a missing data mechanism If we can fill in question marks with y-values, then we calculate causal effects.

Potential Outcomes Setup: let T denote a binary treatment that takes values 0 and 1 1 Let {Y (0), Y (1)} denote the potential outcomes for an individual 2 If we want to have a probability model, then we need to formulate a bivariate distribution 3 This is two-dimensional, in contrast to a one-dimensional distribution

Terminology Causality is tied to an action (also called manipulation, treatment or intervention) Determine a list of possible actions (equivalent to T so that we have a list of two actions) The action is applied to a unit The presumption is that at any point in time a unit could have been subject to either action (potential outcomes) We associate each unit-action pair with a potential outcome

Ideal data The full data for causal inference can be visualized as Unit # Y (0) Y (1) 1 y 10 y 11 2 y 20 y 21 3 y 30 y 31 4 y 40 y 41... n y n0 y n1

Treatments Lots of controversy as to how to define this Examples: race, gender Rubin/Holland: No intervention without manipulation Think carefully about mechanisms that would lead to manipulation

Causal effects By definition, these are comparisons of potential outcomes made within a unit Regression models compare subpopulations and thus do not enjoy causal interpretations without further assumptions Example Correlation, not causation E(SBP) = β 0 + β 1 I (Age > 45)

One causal effect definition Unit # Y (0) Y (1) Y (1) Y (0) 1 y 10 y 11 y 11 y 10 2 y 20 y 21 y 21 y 20 3 y 30 y 31 y 31 y 30 4 y 40 y 41 y 41 y 40.... n y n0 y n1 y n1 y n0

Another causal effect definition Unit # Y (0) Y (1) Y (1)/Y (0) 1 y 10 y 11 y 11 /y 10 2 y 20 y 21 y 21 /y 20 3 y 30 y 31 y 31 /y 30 4 y 40 y 41 y 41 /y 40.... n y n0 y n1 y n1 /y n0

Causal effects What were shown on the previous page were individual-level causal effects A natural thing to do is average the unit-level causal effects across units What assumption are we making when we do this???

Subgroup-specific causal effects We can also define causal effects conditional on subgroups The subgroups could be based on observed covariates or even on potential outcomes What assumption are we making when we do this???

Causal inference assumptions Strongly ignorable treatment assumption Stable unit treatment value assumption Treatment Positivity assumption Consistency Assumption

Causal inference assumptions Strongly ignorable treatment assumption Math: {Y (0), Y (1)} T X where X denotes a vector of confounders In words: there exist a rich set of confounders such that the potential outcomes can be effectively treated as baseline variables.

SITA implication Data visualization Y (0) Y (1) T X 1 X p? y 11 1 x 11 x 1p y 20? 0 x 21 x 2p? y 31 1 x 31 x 3p? y 41 1 x 41 x 4p...... y n0? 0 x n1 x np SITA also means that we can use x s to fill in question marks

SUTVA Data visualization How does SUTVA show up here?? Y (0) Y (1) T X 1 X p? y 11 1 x 11 x 1p y 20? 0 x 21 x 2p? y 31 1 x 31 x 3p? y 41 1 x 41 x 4p...... y n0? 0 x n1 x np

Treatment Positivity Assumption This means that 0 < P(T = 1 X) < 1 for all subjects and for all X values This is needed to have well-defined counterfactuals for defining causal effects When is this violated???

How does violating SUTVA affect Y (0) Y (1) T X 1 X p? y 11 1 x 11 x 1p y 20? 0 x 21 x 2p? y 31 1 x 31 x 3p? y 41 1 x 41 x 4p...... y n0? 0 x n1 x np

Consistency Assumption We formulate Y (0) and Y (1) for everybody, but we only observe one of these. Consistency Assumption means that the potential outcome and observed outcome are the same.

Assumptions There are a lot of assumptions needed to identify causal effects from observational data These assumptions cannot be tested empirically

London et al., 2017, JAMA Internal Medicine Treatment: T = exposure to statin on day of or day after surgery (1 if received, 0 if not) Outcome: Y = death within 30 days Causal parameter: Y (1)/Y (0)

Assignment Mechanism We have shown how to conceptualize data for causal inference problems Let Yi obs Then denote the observed response for subject i Y obs i = T i Y i (1) + (1 T i )Y i (0) If T were the result of a coin flip, then P(T = 1) = P(T = 0) = 1/2 This is what a randomized clinical trial does! This is not true for an observational study

Assignment Mechanism (cont d.) Definition: An assignment mechanism refers to how each unit came to receive the treatment level actually received The propensity score is one attempt at modelling the assignment mechanism. It is defined as e(x) = P(T = 1 X) Conditional on propensity score, one achieves covariate balance on observed covariates

Potential Outcomes (cont d.) Use of propensity score leads to a three-step approach 1 Estimate propensity score 2 Check for balance and adjust accordingly 3 Based on estimated propensity score, estimate the average causal effect

Variable selection for propensity score Recommendation is to include all possible treatment confounders and interactions and quadratics by Rubin One variable to avoid: instruments Intuition: want to include as many variables that are strongly predictive of the potential outcomes We will need to check for balance

Balance Density 0.0 0.1 0.2 0.3 0.4 2 0 2 4 Figure: Distribution of covariate X for treatment and control groups. The blue line denotes the kernel density estimation for X in the T = 1 group, while the magenta line represents the kernel density estimate for X in the T = 0 group. The bars represent the histogram of X regardless of the treatment groups.

Recipe for causal analysis (Stuart, 2010, Stat Sci) 1 Decide on covariates for which balance must be achieved 2 Estimate the distance measure (e.g., propensity score) 3 Condition on the distance measure (e.g., using matching, weighting, or subclassification) 4 Assess balance on the covariates of interest; if poor, repeat steps 2-4 5 Estimate the treatment effect in the conditioned sample

Covariate balance Let T {0, 1} denote a binary treatment Let X denote confounders Ideally, we have that X T = 1 and X T = 0 are equal in distribution If this holds in original dataset, the dataset looks like what you might find in a randomized study

Covariate balance (cont d.) Typical ways to assess: report standardized mean difference, Kolmogorov-Smirnov statistics, or t-tests Covariate balance corresponds to smaller values or less significant p-values

Covariate balance stats within a matched dataset Start with n observations Assume n d observations get discarded, there remains n n d observations Let M denote the indices of the observations in the matched dataset

Covariate balance stats within a matched dataset (cont d.) Comparison of means of covariate k n i=1 I (i M)T ix ik n i=1 I (i M)T i n i=1 I (i M)(1 T i)x ik n i=1 I (i M)(1 T i) k = 1,..., p Can standardize by the variance in the controls Rule of thumb: Want the standardized difference to be < 0.25

London et al., 2017, JAMA Internal Medicine Propensity score-matching (1:1) was conducted by matching pairs of patients with statin exposure and no exposure with a greedy matching algorithm and a caliper width of 0.2 SD of the log odds of the estimated propensity score.covariate balance between matched pairs for continuous and dichotomous categorical variables was assessed using the standardized difference, with values equal to or less than 10% indicating minimal imbalance.

Inverse weighting estimators Use the observation that under SITA and treatment positivity [ ] Ti Y i E = E[Y i (1)] e(x i ) and E [ ] (1 Ti )Y i = E[Y i (0)] 1 e(x i )

Inverse weighting estimators (cont d.) The inverse-weighted estimator of average causal effect is given by ˆ ACE = n 1 n i=1 T i Y i ê(x i ) (1 T i)y i 1 ê(x i ) These estimators can be sensitive to the presence of extreme weights, i.e, ê(x i ) is close to zero or one

One hybrid approach: double-robust estimation Combine inverse weighting and regression adjustment The double-robust estimator is given by n ACE ˆ DR = n 1 T i Y i ê(x i ) T i ê(x i ) ˆm 1 (X i ) ê(x i ) n 1 i=1 n i=1 (1 T i )Y i 1 ê(x i ) + T i ê(x i ) 1 ê(x i ) ˆm 0(X i ), where ˆm 1 (X) and ˆm 0 (X) are the estimated functions for E(Y T = 1, X) and E(Y T = 0, X), respectively Asymptotic theory can give variance (or bootstrap)

Double-robust estimation Very popular in statistical research Note that ˆm t (X) (t = 0, 1) are the fitted values from a regression model of Y on X in the T = 0 and T = 1 subgroups ˆm t are used in the regression estimators for causal effects Pro: Double-robust estimator can be consistent if either propensity or outcome regression is correctly specifiied. Con:Double-robust estimator can perform poorly if both are misspecified.

Matching estimators Based on the estimated propensity score, for each subject with T = 1, try to find one (or more) subjects with similar propensity score with T = 0 and compare their responses; call this the matched set for subject i Estimate average causal effect as n 1 t (Y i (m i ) 1 i:t i =1 j M i :T j =0 Y j ), where M i is the matched set for subject i, and n t is the number of subjects with T = 1 A more robust approach is to coarsen the propensity score using coarsening (subclassification)

Matching Matching to balance the covariate distribution Goal: To make the treated and control subjects look alike before treatment Equivalently, to produce a study regime which resembles a completely randomized design, in terms of the observed covariates

Matching (cont d.) Need a notion of distance Typically idea: define distance between covariates, propensity scores or function thereof Example: ê(x) log 1 ê(x) For discrete covariates, could use exact matching, which corresponds to the Hamming distance

Matching (cont d.) Nearest Neighbor algorithm 1 Iteratively find the pair of subjects with the shortest distance between cases and controls 2 Easy to understand and implement; 3 Offers good results in practice; 4 Fast running time 5 Rarely offers the best matching results (compared to optimal matching)

Matching (cont d.) Phrased often as a bipartite matching problem Conceptually, have a graph with two sets of vertices: one set represents controls and one set represents treated Algorithm needs to find edges from one set to the other, where the presence of an edge denotes a matched set

Other matching schemes Bipartite matching: 1 Pair matching: used when the numbers of the treated and control are comparable 2 1:K matching: used when control group is huge compared to the treated 3 Variable matching: more flexible than 1:K, the matched control is set to be between a and b for each treated subject 4 Optimal matching: Phrase the bipartite algorithm as a minimizing a so-called max-flow/min-cut problem in graph theory

Matching options Mahalanobis metric matching: d M (x, x ) = (x x ) T ( n c ˆΣ c + n t ˆΣ t n c + n t ) 1 (x x ), where n c and n t are the number of observations in the T = 0 and T = 1 groups, and ˆΣ c and ˆΣ t are the estimated variance-covariance covariance matrices for the covariates/confounders Propensity score matching: Exact matching ( d l (x, x ê(x) ) = log 1 ê(x) log ê(x ) 1 ê(x ) Combine these into a hybrid ) 2

Matching options cont d. Sometimes will exclude matches of poor quality (relevance to causal effect estimation) For continuous covariates, might do what is known as caliper matching: define a match as having where d cal is a preset number d(x, x ) d cal,

Matching estimators: variance No consensus Some options: 1 Adjust for matching status as a fixed effect 2 Adjust for matching status as a random effect 3 Asymptotic theory (Abadie and Imbens, 2006) 4 Modified bootstrap

Subclassification Idea: categorize propensity score into K strata and to estimate causal effect within each stratum separately If there exists no stratum interaction with the causal effect, then can combine stratum-specific causal effect estimators to get an overall average causal effect

Subclassification: # of strata Pick a certain number for K Cochran (1965): set K = 5 A more data-driven choice of K given in the book (Chapter 13): 1 Start with K = 1 2 Do a pooled two-sample t-test to test for a difference in propensity score tests 3 If there is a significant difference, split up into more strata, subject to a mininum sample size for each group (treated and control) in the strata 4 Iterate over the 3 steps, stopping if there is nonsignificance

Subclassification: computation Given strata of the form where b 0 = 0 and b K = 1 (b j 1, b j ], j = 1,..., K, Define B ij (i = 1,..., n; j = 1,..., K) by { 1 ifb j 1 ê(x i ) b j, B ij = 0 otherwise Define for i = 1,..., n; j = 1,..., K n 1j = B ij, n 0j = B ij, i:t i =1 i:t i =0 n j = N 1j + N 0j

Subclassification: computation (cont d.) Then average causal effect in strata j is ˆτ j = N 1 1j B ij Y i N 1 0j i:t i =1 i:t i =0 B ij Y i. The subclassification or blocking average causal effect is ˆτ block = K j=1 n j n ˆτ j

London et al., 2017, JAMA Internal Medicine After creating a matched pair dataset using propensity scores, the authors estimated effects using matched pair methods Propensity score used to match observations Reduction in size (from >180,000 to 96,486)

Inference for surveys (finite-sample) versus superpopulation Recall earlier data structure: Y (0) Y (1) T X 1 X p? y 11 1 x 11 x 1p y 20? 0 x 21 x 2p? y 31 1 x 31 x 3p? y 41 1 x 41 x 4p...... y n0? 0 x n1 x np In this setup, the question marks occur randomly, but other than that, there is no randomness This is the finite-sample setup.

Finite-sample inferences Based on randomization distribution of T (i.e., shuffle the label of T ) Use estimated standard deviations conditional on propensity score

From finite-sample to superpopulation Everything discussed so far has been a finite-sample mode of inference Generalize to the superpopulation Idea now is that the observed units are a random sample from a superpopulation Recall τ fs = n 1 n i=1 {Y i(1) Y i (0)} Expectations are now taken with respect to the superpopulation sampling mechanism, and not T

Superpopulation inferences Extensive derivations can be found in the Appendix of Chapter 6 of Imbens and Rubin (2015) Key results: E(ˆτ dif ) = E(Y (1) Y (0)) Var(ˆτ dif ) = σ2 c n c + σ2 t n t, where σ 2 c is the variance of Y (0) with respect to the superpopulation sampling, and σ 2 t is the variance of Y (1) with respect to the superpopulation sampling

Other variance formulae Subclassification: Assume J strata based on the propensity score Use least squares of Y on T and X within each subclass Coefficient for T is ˆτ j and get ˆσ j 2 from sum of squares of residuals Assuming no stratum by treatment interaction, ˆτ s = J j=1 n c (j) + n t (j) ˆτ(j). n Then the estimated variance is given by V (ˆτ s ) = J ( ) nc (j) + n t (j) 2 ˆσ j 2. n j=1

To bootstrap or not to bootstrap? The form ˆτ = ˆτ(Y, T, X) suggests that one could do a bootstrap Algorithm 1 Resample bootstrapped dataset from observed data (Y i, T i, X i ), i = 1,..., n 2 Compute τ b = ˆτ(Y b, T b, X b ) for bth bootstrapped dataset 3 Repeat steps 1) and 2) B times 4 Use empirical distribution of τ 1,..., τ B to compute standard errors and confidence intervals

To bootstrap or not to bootstrap? (cont d.) Might be problematic for matching estimators Example of a nonregular problem, so the bootstrap might fail For inverse weighting problems, it works fine. Counterpoint arguing for the bootstrap: Samworth (2003, Biometrika)

London et al., 2017, JAMA Internal Medicine After creating a matched pair dataset using propensity scores, the authors estimated variances of effects using matched pair methods Random effects models are superpopulation in spirit Conditional methods are finite-sample in spirit.

Sensitivity Analysis: basic setup Assume that T is not conditionally independent of {Y (0), Y (1)} given X (i.e., SITA violated) Assume that T {Y (0), Y (1)} X, U, where U has a random variable Treat U as a synthetic covariate (i.e., simulated and under the control of the user. Write down a model for Y X, U and T X, U Let the coefficient for U vary and estimate the coefficients for X in the two models. Hard to implement, as it requires a grid search

Sensitivity Analysis: simplification One simplification is due to Lin et al. (1998, Biometrics) Idea: use model misspecification theory Recall that we have invoked consistency for parameter estimates: ˆθ p θ, or in words theta hat (estimate) converges in probability to θ (true value of the parameter) This assumes that the model for the data is correctly specified What happens when this is not true?

Lin et al. (1998) When model being fit to data is false ˆθ p θ, where θ is called the least false parameter θ minimizes the Kullback-Leibler divergence: https://en.wikipedia.org/wiki/kullback-leibler_divergence

Lin et al. (1998, cont d.) Cast the violation of SITA as one of fitting the incorrect model: True model: Y depends on T, X and U False model: Y depends on T and X This approach allows for some simple approximate sensitivity analysis formulae

London et al., 2017, JAMA Internal Medicine Authors performed several sensitivity analyses Interestingly, nothing for unmeasured confounding

Summary Causal inference aims to use observational data in order to mimic a randomized experimental design using observed covariates The propensity score plays the role of the coin flip (probability based on observed covariates) We have given a six-step pipeline for analysis Note that many unverifiable assumptions are needed for performing causal inference, and thus, sensitivity analysis step is a key.

References Abadie, A. and Imbens, G. W. (2006). Large sample properties of matching estimators for average treatment effects. Econometrica 74, 235 267. Imbens and Rubin. (2015). Causal Inference for Statistics, Social and Biomedical Sciences Cambridge: Cambridge University Press. Lin, D. Y., Psaty, B. M., and Kronmal, R. A. (1998). Assessing the Sensitivity of Regression Results to Unmeasured Confounders in Observational Studies. Biometrics, 54(3), 948-963. London, M. J., Schwartz, G. G., Hur, K. and Henderson, W. G. Association of Perioperative Statin Use With Mortality and Morbidity After Major Noncardiac Surgery. JAMA Intern Med. 2017 Feb 1;177(2):231-242. Stuart, E. A. (2010). Matching methods for causal inference: a review and a look forward. Statistical Science 25,1 21.