Bayesian causal forests: dealing with regularization induced confounding and shrinking towards homogeneous effects

Bayesian causal forests: dealing with regularization induced confounding and shrinking towards homogeneous effects P. Richard Hahn, Jared Murray, and Carlos Carvalho July 29, 2018

Regularization induced confounding Suppose the treatment effect is homogenous and response and treatment model are both linear: Y i = τz i + β t x i + ε i, Z i = γ t x i + ν i. The bias of the treatment effect estimator ˆτ rr E(τ Y, z, X) is bias(ˆθ rr ) = (M + X t X) 1 Mθ (1) where the bias expectation is taken over Y, conditional on X and all model parameters. 1

Regularization induced confounding Let the prior precision be ( ) 0 0 M = 0 I p which gives a ridge prior on the control variables and a non-informative flat prior over the first element (τ, the treatment effect). bias(ˆτ rr ) = ( (z t z) 1 z t X ) (I + X t (X ˆX z )) 1 β, where ˆX z = z(z t z) 1 z t X. 2

Targeted selection Three components dictate the degree of RIC: 1. the coefficients defining the propensity function E(Z x) = γ t x, 2. the coefficients defining the prognostic function, E(Y Z = 0, x = x) = β t x, 3. the strength of the selection as measured by Var(Z x) = Var(ν). These are not in the analyst s control. 3

Targeted selection Consider the identity E(Y x, Z) = (τ + b)z + (β bγ) t x b(z γ t x) = ˆτZ + ˆβ t x ˆɛ. If ˆβ = (β bγ) has higher prior probability than β and Var(ˆɛ) = b 2 Var(ν) is small relative to σ 2, then τ will be biased toward ˆτ = τ + b. The bias is largest when confounding is strong: b 2 Var(ν) is smallest when Var(ν) is small, selection is targeted: For shrinkage priors on β, the (β bγ) term is most favorable with respect to the prior when the vector β and γ have the same direction! 4

De-bias with a propensity score estimate Estimate the propensity function ẑ i γ t x i and include both z and ẑ in the regression. The design matrix becomes X = ( ) z ẑ X. Plugging into our previous bias expression gives bias(ˆτ rr ) = { ( z t z) 1 z t X } 1 (I + Xt (X ˆX z )) 1 β = 0, where z = (z ẑ) and { ( z t z) 1 z t X } denotes the top row of { 1 ( zt z) 1 z t X }. 5

The nonlinear case Suppose that Y is a continuous biometric measure of heart distress, Z is an indicator for having received a heart medication, and x 1 and x 2 are systolic and diastolic blood pressure (in standardized units). Suppose that the difference between these two measurements is prognostic of high distress levels, with x 1 x 2 > 0 being a critical threshold. Prescribers target the drug towards patients with high risk, so the probability of receiving the drug is an increasing function in µ. 6

Nonlinear targeted selection 7

RIC with BART We simulated 200 datasets of size n = 250 according to this data generating process with τ = 1. Prior bias coverage rmse BART 0.27 65% 0.31 BCF 0.14 95% 0.21 BART gives clearly biased inference. Why? 8

RIC with BART Strong confounding and targeted selection imply that µ is approximately a monotone function of π alone. However, π (and hence µ) is difficult to learn via regression trees it takes many axis-aligned splits to approximate the shelf across the diagonal. x 2-1.0-0.5 0.0 0.5 1.0-1.0-0.5 0.0 0.5 1.0 x 1 Meanwhile, a single split in Z can stand in for many splits on x 1 and x 2 that would be required to approximate µ(x). 9

Regularizing heterogeneous effects Two common strategies: treat z as just another covariate and specify a prior on f (x i, z i ). (Hill, 2011) fit entirely separate models to the treatment and control data: (Y Z = z, x) N(f z (x i ), σ 2 z ) with independent priors over the parameters in each. The model we propose is f (x i, z i ) = µ(x i, ˆπ i ) + τ(x i )z i, where ˆπ i is an estimate of the propensity score. This splits the difference, compromising between the two approaches. 10

Give τ(x) its own prior By analogy, consider a two-group difference-in-means problem: Y i1 iid N(µ 1, σ 2 ) Y j2 iid N(µ 2, σ 2 ). If the estimand of interest is µ 1 µ 2, the implied prior over this quantity has variance strictly greater than the variances over µ 1 or µ 2 individually. Instead, if we know that µ 1 µ 2, it makes more sense to parametrize as Y i1 iid N(µ + τ, σ 2 ) Y j2 iid N(µ, σ 2 ) Now τ can be given an informative prior centered at zero and µ can be given a very vague prior. 11

Simulation study We toggle three two-level settings: homogeneous versus heterogeneous treatment effects a linear versus nonlinear conditional expectation function, and two different sample sizes (n = 250 and n = 500). Five variables comprise x; the first three are continuous, drawn as standard normal random variables, the fourth is a dichotomous variable and the fifth is unordered categorical, taking three levels (denoted 1,2,3). 12

Simulation study The treatment effect is either τ(x) = { 3, homogeneous 1 + 2x 2x 5, heterogeneous, the prognostic function is either { 1 + g(x 4) + x 1x 3, linear µ(x) = 6 + g(x 4) + 6 x 3 1, nonlinear, where g(1) = 2, g(2) = 1 and g(3) = 4, and the propensity function is π(x i ) = 0.8Φ(3µ(x i )/s 0.5x 1) + 0.05 + u i /10 where s is the standard deviation of µ taken over the observed sample and u i Uniform(0, 1). 13

Simulation study To evaluate each method we consider three criteria, applied to two different estimands. First, we consider how each method does at estimating the (sample) average treatment effect (ATE) according to root mean square error, coverage, and average interval length. Then, we consider the same criteria, except applied to estimates of the conditional average treatment effect (CATE), averaged over the sample. Results are based on 200 independent replications for each DGP. 14

Simulation result The important trends are as follows: BCF or ps-bart benefit dramatically by explicitly protecting against RIC; BART-(f 0, f 1 ) and causal random forests both exhibit subpar performance in this simulation; all methods improve with a larger sample; BCF priors are especially helpful at the smaller sample size (when estimation is more difficult); the linear model dominates when correct, but fares extremely poorly when wrong; BCF s improvements over ps-bart are more pronounced in the nonlinear DGP; BCF s average interval length is notably smaller than the ps-bart interval, usually (but not always) with comparable coverage. 15

Simulation results Homogeneous effect Heterogeneous effects n Method ATE CATE ATE CATE rmse cover len rmse cover len rmse cover len rmse cover len 250 500 BCF 0.21 0.92 0.91 0.48 0.96 2.0 0.27 0.84 0.99 1.09 0.91 3.3 ps-bart 0.22 0.94 0.97 0.44 0.99 2.3 0.31 0.90 1.13 1.30 0.89 3.5 BART 0.34 0.73 0.94 0.54 0.95 2.3 0.45 0.65 1.10 1.36 0.87 3.4 BART (f 0, f 1 ) 0.56 0.41 0.99 0.92 0.93 3.4 0.61 0.44 1.14 1.47 0.90 4.5 Causal RF 0.34 0.73 0.98 0.47 0.84 1.3 0.49 0.68 1.25 1.58 0.68 2.4 LM + HS 0.14 0.96 0.83 0.26 0.99 1.7 0.17 0.94 0.89 0.33 0.99 1.9 BCF 0.16 0.88 0.60 0.38 0.95 1.4 0.16 0.90 0.64 0.79 0.89 2.4 ps-bart 0.18 0.86 0.63 0.35 0.99 1.8 0.16 0.90 0.69 0.86 0.95 2.8 BART 0.27 0.61 0.61 0.42 0.95 1.8 0.25 0.76 0.67 0.88 0.94 2.8 BART (f 0, f 1 ) 0.47 0.21 0.66 0.80 0.93 3.1 0.42 0.42 0.75 1.16 0.92 3.9 Causal RF 0.36 0.47 0.69 0.52 0.75 1.2 0.40 0.59 0.88 1.30 0.71 2.1 LM + HS 0.11 0.96 0.54 0.18 0.99 1.0 0.12 0.93 0.59 0.22 0.98 1.2 16

Simulation results Homogeneous effect Heterogeneous effects n Method ATE CATE ATE CATE rmse cover len rmse cover len rmse cover len rmse cover len 250 500 BCF 0.26 0.945 1.3 0.63 0.94 2.5 0.30 0.930 1.4 1.3 0.93 4.5 ps-bart 0.54 0.780 1.6 1.00 0.96 4.3 0.56 0.805 1.7 1.7 0.91 5.4 BART 0.84 0.425 1.5 1.20 0.90 4.1 0.84 0.430 1.6 1.8 0.87 5.2 BART (f 0, f 1 ) 1.48 0.035 1.5 2.42 0.80 6.4 1.44 0.085 1.6 2.6 0.83 7.1 Causal RF 0.81 0.425 1.5 0.84 0.70 2.0 1.10 0.305 1.8 1.8 0.66 3.4 LM + HS 1.77 0.015 1.8 2.13 0.54 4.4 1.65 0.085 1.9 2.2 0.62 4.8 BCF 0.20 0.945 0.97 0.47 0.94 1.9 0.23 0.910 0.97 1.0 0.92 3.4 ps-bart 0.24 0.910 1.07 0.62 0.99 3.3 0.26 0.890 1.06 1.1 0.95 4.1 BART 0.31 0.790 1.00 0.63 0.98 3.0 0.33 0.760 1.00 1.1 0.94 3.9 BART (f 0, f 1 ) 1.11 0.035 1.18 2.11 0.81 5.8 1.09 0.065 1.17 2.3 0.82 6.2 Causal RF 0.39 0.650 1.00 0.54 0.87 1.7 0.59 0.515 1.18 1.5 0.73 2.8 LM + HS 1.76 0.005 1.34 2.19 0.40 3.5 1.71 0.000 1.34 2.2 0.45 3.7 17

ACIC 2017 interval length (CATE) 0.0 0.4 0.8 1.2 CRF psb BCF interval length (ATT) 0.0 0.4 0.8 1.2 CRF TL psb BCF 0.0 0.2 0.4 0.6 0.8 1.0 coverage (CATE) 0.0 0.2 0.4 0.6 0.8 1.0 coverage (ATT) rmse CATEs 0.0 0.4 0.8 1.2 psb BCF TL CRF 0.0 0.1 0.2 0.3 0.4 0.5 0.6 rmse ATT 18

Papers RIC in the linear model is discussed in: Regularization and confounding in linear regression for treatment effect estimation. Hahn, Carvalho, Puelz, and He. Bayesian Analysis (2018) The Bayesian causal forest paper is developed in: Bayesian regression tree models for causal inference: regularization, confounding, and heterogeneous effects. Hahn, Murray, and Carvalho. In review after revision at JASA. An exciting applied paper using these ideas is: Where Does a Scalable Growth-Mindset Intervention Improve Adolescents? Educational Trajectories? Under revision at Nature. 19

Most importantly: code The R package bcf went live just today. Give it a try. Thanks for you time. 20

Our setting We ll assume: Observational data (not from randomized experiments), Conditional unconfoundedness/ignorability (we ve measured all the factors causally influencing treatment and response), Covariate-dependent treatment effects (individuals can have different responses to treatment according to their covariates) Binary treatments 21

Our assumptions, more formally Strong ignorability: Y i (0), Y i (1) Z i X i = x i, Positivity: 0 < Pr(Z i = 1 X i = x i ) < 1 for all i. Therefore E(Y i (z) x i ) = E(Y i x i, Z i = z), so the conditional average treatment effect (CATE) is α(x i ) : = E(Y i (1) Y i (0) x i ) = E(Y i x i, Z i = 1) E(Y i x i, Z i = 0). 22

Modeling assumptions We write so that E(Y i x i, z i ) = f (x i, z i ), α(x i ) := f (x i, 1) f (x i, 0). We assume iid Gaussian errors: Y i = f (x i, z i ) + ɛ i, ɛ i N(0, σ 2 ) nb: Strong ignorability means ɛ i Z i x i. What prior on f? 23

Regression Trees no x 1 < c Tree T h yes g(x, T h, M h ) µ h2 µ h1 x 3 < d no yes x 3 d µ h3 µ h1 µ h2 µ h3 x 1 c Leaf/End node parameters M h = (µ h1, µ h2, µ h3 ) g(x, T h, M h ) = µ ht if x A ht (for 1 t b h ). Partition A h = {A h1, A h2, A h3 } 24

Bayesian Additive Regression Trees (BART) Bayesian additive regression trees (Chipman, George, & McCulloch, 2008): y i = f (x i, z i ) + ɛ i, ɛ i N(0, σ 2 ) m f (x, z) = g(x, z, T h, M h ) h=1 Hill (2011) proposes adopting Bayesian additive regression trees (BART) for causal inference. 25

2017 ACIC Data Analysis Challenge Treatment-response pairs were simulated according to 32 distinct data generating processes (DGPs), given fixed covariates (n = 4, 302, p = 58) from an empirical study. We varied three parameters among two levels High or Low noise level, Strong or Weak confounding, Small or Large effect size. The error distributions were one of four types Additive, homoskedastic, independent, Nonadditive, homoskedastic, independent, Additive, heteroskedastic, independent. To assess coverage, 250 replicate data sets were generated for each DGP. 26