Bayesian regression tree models for causal inference: regularization, confounding and heterogeneity

Bayesian regression tree models for causal inference: regularization, confounding and heterogeneity P. Richard Hahn, Jared Murray, and Carlos Carvalho June 22, 2017

The problem setting We want to estimate treatment effects using a regression model, assuming: observational data (not from experiments), conditional unconfoundedness (we ve measured everything we need to), covariate-dependent treatment effects (people can have different responses to treatment according to their covariates) binary treatment variable (you either got the drug or you didn t) 1

More formally We assume strong ignorability: Y (0), Y (1) Z X, 0 < Pr(Z i = 1 x i ) < 1 for all i. Therefore E(Y i (1) x i ) = E(Y i x i, Z i = 1), E(Y i (0) x i ) = E(Y i x i, Z i = 0) and the treatment effect is α(x i ) := E(Y i x i, Z i = 1) E(Y i x i, Z i = 0). 2

Additive, homoskedastic errors Here we consider mean-zero, additive error representations Y i = f (x i, z i ) + ɛ i, so that E(Y i x i, z i ) = f (x i, z i ). The treatment effect of setting z i = 1 versus z i = 0 is expressed as α(x i ) := f (x i, 1) f (x i, 0). nb: In this context, conditional ignorability is often expressed as ɛ i Z i x i. 3

Regression trees no x 1 < c Tree T h yes g(x, T h, M h ) µ h2 µ h1 x 3 < d no yes x 3 d µ h3 µ h1 µ h2 µ h3 x 1 c Leaf/End node parameters M h = (µ h1, µ h2, µ h3 ) Partition A h = {A h1, A h2, A h3 } g(x, T h, M h ) = µ ht if x A ht (for 1 t b h ). 4

Bayesian Additive Regression Trees (BART) Bayesian additive regression trees (BART) (Chipman, George and McCulloch, 2010): y i = f (x i ) + ɛ i, ɛ i N(0, σ 2 ) m f (x) = g(x, T h, M h ) h=1 g are basis functions determined by a binary tree T h and vector of parameters M h. Hill (2012) specifically proposes BART for causal inference: in several simulation studies it works really, really well. 5

Improving BART for causal inference BART is excellent for causal inference, but it exhibits undesirable behavior in certain situations. 1. In cases with strong confounding estimates of the average treatment effect from BART exhibit severe bias. 2. Effect estimates from synthetic data with known homogeneous effect produce individual effect estimates that are over-dispersed. Our goal is to develop a modified BART model that improves these two weaknesses. 6

Example of problem one Consider a problem with p = 2, n = 1, 000, with homogeneous effects. True treatment effect is α = 1. Y i = µ i + Z i + ɛ i, µ i = 1(x i1 < x i2 ) 1(x i1 x i2 ) P(Z i = 1 x i1, x i2 ) = Φ(µ i ), ɛ i iid N(0, 0.7 2 ), x i1, x i2 iid N(0, 1). Y : measure of heart distress, Z: took heart medication, x 1 and x 2 are blood pressure measurements. This example demonstrates targeted treatment: patients with x i1 < x i2 are 5 times as likely to receive the new drug precisely because they are more likely to have higher levels of heart distress. 7

BART shows substantial bias BART (white) exhibits substantial bias. Density 0 5 10 15 0.8 0.9 1.0 1.1 1.2 Treatment effect Figure 1: BART (white) misses the truth, by a lot, across 250 simulations. (I will explain the blue and pink in a moment.) 8

Regularization induced confounding Why is BART biased in this example? µ(x) needs many axis-aligned splits to be approximated by trees. the response surface can be parsimoniously approximated with just one axis-aligned split in the treatment variable and an over-stated treatment effect. therefore, priors over f that penalize the total number of splits tend to over-attribute changes in E(Y x, Z) to a treatment effect. 9

Regularization induced confounding x 2-1.0-0.5 0.0 0.5 1.0-1.0-0.5 0.0 0.5 1.0 x 1 Figure 2: It takes many axis-aligned splits for a tree to fit a steep response gradient along a diagonal. Our example response surface was 1 above the diagonal and 1 below it. 10

The fix: ps-bart We can fix this by estimating π(x) = P(Z = 1 x) (using BART) and then using ˆπ(x) as an extra predictor variable in a BART model for the response surface. Table 1: The BART prior exhibits substantial bias in estimating the treatment effect. Modified BART priors allowing splits in either the true propensity score (Oracle BART) or an estimated propensity score (ps-bart) perform markedly better. Prior Bias Coverage RMSE BART 0.14 31% 0.15 Oracle BART 0.00 98% 0.05 ps-bart 0.06 85% 0.08 11

R.I.C. in linear regression Consider the response model Y i = β 0 + αz i + β t x i + ε i, Z i = γ t x i + ν i. For a flat prior on α and a ridge prior on β the bias is bias(ˆα rr ) = ( (Z t Z) 1 Z t X ) (I p + X t (X ˆX Z )) 1 β. For Ẑi γ t x i, Z = (Z, Ẑ) gives bias bias(ˆα rr ) = { ( Z t Z) 1 Z t X } 1 (I p + X t (X ˆX Z )) 1 β 0. 12

Example of problem two ps-bart fixes the severe bias under strong confounding, but offers no direct control on the prior for treatment effects. Individual effect estimates are quite variable: stronger regularization would likely improve average estimation error, but how to impose it? Frequency 0 10 20 30 40 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 ITE Figure 3: ps-bart (in pink) gives widely variable individual treatment effect estimates, even when the true effect is homogeneous (here it is α = 0.25).This is a single data set with 250 individuals. Our new approach is shown in gray. 13

The fix: Bayesian causal forests In order to regularize treatment effects directly, our new model is f (x i, z i ) = m(x i, ˆπ(x i )) + α(x i, ˆπ(x i ))z i, where m and α are given independent BART priors. Now the treatment effect is E(Y i x i, Z i = 1) E(Y i x i, Z i = 0) = {m(x i, ˆπ)+α(x i, ˆπ)} m(x i, ˆπ) = α(x i, ˆπ), and we can shrink towards homogeneity with stronger regression on α than m. In fact, we can directly set the prior probability of homogeneity. Even setting it to 1% corresponds to much more aggressive regularization than the default BART prior. 14

Heterogeneous effect example Consider the previous data generating process but let the treatment effect varies depending on an observable covariate x 3 N(0, 1): α i = 1(x i3 > 1/4) + 1 4 1(x i3 > 1/2) + 1 2 1(x i3 > 3/4), so α i {0, 1, 1.25, 1.5} according to the level of x i3. We consider a smaller sample size here, n = 250. Table 2: BART vs ps-bart vs BCF on RMSE on the heterogeneous effect vector α(x) over 250 replicates. Prior coverage of ATE ave. RMSE ATE ave. RMSE ITE BART 3% 0.53 0.63 ps-bart 96% 0.11 0.34 BCF 94% 0.10 0.25 15

Notable related approaches van der Laan, M. J. (2010). Targeted maximum likelihood based causal inference. The International Journal of Biostatistics. McCaffrey, D. F., Griffin, B. A., Almirall, D., Slaughter, M. E., Ramchand, R. and Burgette, L. F. (2013). A tutorial on propensity score estimation for multiple treatments using generalized boosted models. Statistics in Medicine. Zigler, C. M. and Dominici, F. (2014). Uncertainty in propensity score estimation: Bayesian methods for variable selection and model-averaged causal effects. Journal of the American Statistical Association. Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C. et al. (2016). Double machine learning for treatment and causal parameters. Wager, S. and Athey, S. (2017). Estimation and Inference of Heterogeneous Treatment Effects using Gradient Forests. 16

2017 ACIC Data Analysis Challenge Treatment-response pairs were simulated according to 32 distinct data generating processes (DGPs), given fixed covariates (n = 4, 302, p = 58) from an empirical study. We varied three parameters among two levels High or Low noise level, Strong or Weak confounding, Small or Large effect size. The error distributions were one of four types Additive, homoskedastic, independent, Nonadditive, homoskedastic, independent, Additive, heteroskedastic, independent. To assess coverage, 250 replicate data sets were generated for each DGP. 17

Results: Inference for ATE on homoskedastic DGPs 18

Results: Estimation for ATE on homoskedastic DGPs 19

Results: Inference for ITE on homoskedastic DGPs 20

Results: Inference for ATE on easy DGPs 21

Results: Estimation for ATE on easy DGPs 22

Results: Inference for ITE on easy DGPs 23

Results: Inference for ATE on difficult DGPs 24

Results: Estimation for ATE on difficult DGPs 25

Results: Inference for ITE on difficult DGPs 26

Results: Inference for ATE on heteroskedastic DGPs 27

Results: Estimation for ATE on heteroskedastic DGPs 28

Results: Inference for ITE on heteroskedastic DGPs 29

1987 National Medical Expenditure Survey (NMES) What is the effect of smoking on medical expenditures? outcome variable Y is medical expenses (verified, log transformed), treatment variable Z indicates heavy smoking (> 1/2 pack per day), n = 7.7k complete-case analysis for Y > 0, covariates include age: age in years at the time of the survey smoke age: age in years when the individual started smoking gender: male or female race: other, black or white marriage status: married, widowed, divorced, separated, never married education level: college graduate, some college, high school graduate, other census region: geographic location, Northeast, Midwest, South, West poverty status: poor, near poor, low income, middle income, high income seat belt: does patient regularly use a seat belt when in a car 30

Prediction and ITE estimates: BART vs. BCF BCF 5.0 6.0 7.0 8.0 BCF 0.05 0.10 0.15 0.20 5.0 5.5 6.0 6.5 7.0 7.5 8.0 Vanilla BART 0.2 0.1 0.0 0.1 0.2 0.3 vanilla BART 31

Posterior of ATE Frequency 0 100 200 300 400 500 600 0.00 0.02 0.04 0.06 0.08 0.10 0.12 Average Treatment effect 32

0.4 0.2 0.0 0.4 Treatment effect Subgroup inference 0 2000 4000 6000 8000 Individual 33

Subgroup inference CART (applied to the posterior) suggests that less educated (high school graduate), married individuals with a relatively high propensity for smoking (ˆπ(x i ) > 0.63) have a higher treatment effect than everyone else. Density 0 2 4 6 8 10 12 0.1 0.0 0.1 0.2 0.3 The corresponding subgroup ATE is indeed significant. 34

Takeaways BART is an impressive response surface method for causal inference; our new BCF model improves on BART in key respects. Regularization-induced confounding can adversely bias treatment effect estimates. Explicitly modeling selection allows regularization to be imposed robustly and directly on the estimand of interest. Expect an R package soon. The paper is up on my web page check it out! 35

Extensions Interesting connections to covariate-dependent g-priors. Incorporating uncertainty in the estimated propensity score? Applications to mediation analysis. Model improvements, including heteroskedasticity and more aggresive shrinkage priors (when many garbage variables are plausible). 36