Bayesian Causal Forests

Size: px

Start display at page:

Download "Bayesian Causal Forests"

Edwin Harvey
5 years ago
Views:

1 Bayesian Causal Forests for estimating personalized treatment effects P. Richard Hahn, Jared Murray, and Carlos Carvalho December 8, 2016

2 Overview We consider observational data, assuming conditional unconfoundedness, and are interested in individual treatment effects. Three key statistical concepts will be: prediction, regularization, and selection. Our regression model will take the novel form D i = g(x i ) + ɛ i, k Y i = m(x i ) + b j (x)d j i + ν i. j=1 where g, m and b are represented as regression forests. 1

3 Accurate prediction is necessary but not sufficient Higher predictive accuracy is associated with more reliable information about the underlying data mechanism. Weak predictive accuracy can lead to questionable conclusions. Algorithmic models can give better predictive accuracy than data models, and provide better information about the underlying mechanism. Leo Breiman Prediction by itself is only occasionally sufficient.the post office is happy with any method that predicts correct addresses from hand-written scrawls...[but] most statistical surveys have the identification of causal factors as their ultimate goal. Bradley Efron 2

4 Flexible models are necessary Taken from OpenIntro Statistics 2nd Edition. 3

5 Flexible models aren t sufficient An expensive car doesn t cause good teeth. 4

6 Concrete example Suppose we re interested in the treatment effect of dietary kale intake. We want to know how effective it is at lowering cholesterol, which is our outcome variable. Unfortunately, we have only observational data (i.e., not a randomized study). 5

7 Kale intake predicts exercise Only gym-rats seem to eat much kale and exercise is known to lower cholesterol. Y i = β 0 + αd i + ε i, Because cov(d i, ε i ) 0, we can write Y i = β 0 + αd i + ωd i + ε. Since cov(d i, ε i ) = 0, we mis-estimate α as α + ω. But if we include exercise in the model, Conditional on x i, cov(d i, ε i ) = 0. Y i = β 0 + αd i + βx i + ε i. 6

8 Unmeasured confounding and control variables Randomized experiments are the special case where prediction and causation are equivalent. ɛ ν ɛ D Y D Y x Holding x constant, one has experimental variation from which to infer causal impacts. (We have to assume Y doesn t cause D.) 7

9 Regularization Y i = β 0 + αd i + βx i + ε i. a flat prior on the treatment effect: (α, σ 2 ε) 1/σ ε, shrinkage prior on β (e.g., the horseshoe prior). Now turn the Bayesian crank... 8

10 Something goes wrong GoodBayes NaiveBayes OLS Oracle It turns out that this obvious approach is really bad at getting reasonable estimates of the treatment effect α. 9

11 Regularization-induced confounding The bias of the treatment parameter under ridge regression is: bias(ˆα rr ) = ( (D t D) 1 D t X ) (I p + X t (X ˆX D )) 1 β. This is a function of all the other unknown regression parameters. 10

12 Our reparametrization: a latent error approach We reparametrize as α α β + αγ β d. γ β c which gives the new equations Selection Eq.: D = x t β c + ɛ, ɛ N(0, σ 2 ɛ), Response Eq.: Y = α(d x t β c ) + x t β d + ν, ν N(0, σ 2 ν). We may now shrink β d and β c with impunity. 11

13 Control of bias The expression for ridge treatment effect bias is now: bias(ˆα) = ( (R t R) 1 R t X ) (I p + X t (X ˆX R )) 1 β d. where R = D x t β c. By construction, (R t R) 1 R t X will be close to the zero vector, because R is independent of x. 12

14 ρ 2 Bias Coverage I.L. MSE 0.1 New Approach OLS Naive Regularization Oracle OLS New Approach OLS Naive Regularization Oracle OLS New Approach OLS Naive Regularization Oracle OLS New Approach OLS -8e Naive Regularization Oracle OLS New Approach OLS Naive Regularization Oracle OLS Table 1: n = 50, p = 30, k = 3. κ 2 = φ 2 = σ 2 ν =

15 Nonlinear regressions for deconfounding Consider the nonlinear model Y i = f (x i, D i ) + ε i. Now our de-confoundedness condition is stronger D i ε i x i. And our causal estimand is more general α(x i, D i, D i ) = E(Y i x i, D i ) E(Y i x i, D i ), = f (x i, D i ) f (x i, D i ). 14

16 Regression Trees no x 1 < c Tree T h yes g(x, T h, M h ) µ h2 µ h1 x 3 < d no yes x 3 d µ h3 µ h1 µ h2 µ h3 x 1 c Leaf/End node parameters M h = (µ h1, µ h2, µ h3 ) Partition A h = {A h1, A h2, A h3 } g(x, T h, M h ) = µ ht if x A ht (for 1 t b h ). 15

17 Bayesian Additive Regression Trees (BART) Bayesian additive regression trees (BART) (Chipman, George and McCulloch, 2010): y i = f (x i ) + ɛ i, ɛ i N(0, σ 2 ) m f (x) = g(x, T h, M h ) h=1 g are basis functions determined by a binary tree T h and vector of parameters M h. These models were specifically advocated for causal inference in Hill (2012). 16

18 BART exhibits regularization-induced confounding Estimate EYhat4 0 0 EYhat New Causal FLM 2 State of the Art (Hill) EY EY Treatment effect Figure 1: Left: BART predictions. Middle: BCF predictions. Right: treatment effect estimates; BART (blue) and BCF (orange). 17

19 Taylor expanding f (x, D) We expand f (x, D) about the point D = E(D x) = g(x): f (j) (x, g(x))(d g(x)) j f (x, D) = j! j=0 where f (j) denotes the jth derivative of f with respect to D, and reparametrize as f (x, D) = f (x, g(x)) + = m(x, g(x)) + k b j (x, g(x))(d g(x)) j j=1 k b j (x, g(x))(d g(x)) j. j=1 18

20 Taylor Shifts One can transform back via b j (x) = j h k for 0 j k and b 0 (x) := m(x, g(x)). In this parametrization ( ) h b h (x, g(x))g(x) h j. j α(x i, D i, D i ) = f (x i, D i ) f (x i, D i ), k k = b j (x)d j b j (x)d j, = j=0 j=0 k b j (x) ( D j D j). j=0 This suggests k j=0 b j (x) is a nice summary of the causal effect at x. 19

21 Bayesian causal forests Take k <, give g, m and b independent BART priors and assume normal additive errors: D i = g(x i ) + ɛ i, Y i = m(x i, g(x i )) + k b j (x, g(x))(d i g(x i )) j + ν i. j=1 This model allows for the various roles of control variables to be regularized independently: g(x) represents selection, m(x, g(x)) represents direct association, and b j (x, g(x)) represents nonlinearity as it varies over j, b j (x, g(x)) represents heterogeneous effects as it varies in x. Confusion of these roles is confounding. 20

22 Advantages of BCF The analogies with the linear case are clear, D i = g(x i ) + ɛ i, Y i = m(x i, g(x i )) + Advantages are that k b j (x, g(x i ))(D i g(x i )) j + ν i. j=1 Independent regularization of g, m, and b. easy summarization of treatment effects, transparent interpolation/extrapolation. By comparison, BART s regularization is implicit, treatment effects are difficult to extract, and interpolation/extrapolation is poor. 21

23 1987 National Medical Expenditure Survey (NMES) What is the effect of smoking on medical expenditures? outcome variable Y is medical expenses (verified, log transformed), treatment variable D is cigarettes smoked per day, n = 7.7k complete-case analysis for Y > 0, covariates include age at time of the survey (19-94), age started smoking, gender (male, female), race (hispanic, black, other), marriage status (married, widowed, divorced, separated, never married), education level (college graduate, some college, high school graduate, other), census region (Northeast, Midwest, South, West), poverty status (poor, near poor, low income, middle income, high income), and seat belt usage (rarely, sometimes, always/almost always) 22

24 Average treatment effect posterior Frequency Average Treatment effect Figure 2: The average treatment effect is statistically significant, but small. 23

25 Individual treatment effect estimates bhat[ind] Index Figure 3: precision. Individual level treatment effects are difficult to estimate with any 24

26 Seeing the forests through a tree Fitting a single regression tree to posterior summaries of g, m, and k j=0 b j (x) allows us to find interesting sub-groups for follow-up study. Save posterior samples of each subject-specific parameter, compute ĝ, ˆm, and b j, fit single regression trees (with little or no regularization) with these summaries as the outcome, Leaf nodes define subgroups for further examination (such as subgroup average treatment effects). 25

27 Tree fit to π(g i ) yes % RACE3 < 2.5 no % AGESMOKE >= % MALE < % 2.5 6% % % RACE3 < 1.5 LASTAGE < 32 AGESMOKE >= 18 AGESMOKE >= % % % % 3 19% MALE < 0.5 LASTAGE >= 76 LASTAGE < 28 LASTAGE < 38 LASTAGE < % LASTAGE >= % 2.2 6% 2.4 3% 2.3 2% 2.6 4% 2.3 2% % 2.6 6% % 2.7 5% 2.7 2% % 2.8 6% % Figure 4: Age, gender and race appear to be the dominant determinants of treatment/smoking level. 26

28 Tree fit to π(m i ) yes % LASTAGE < 50 no % MALE >= % LASTAGE < % LASTAGE < % LASTAGE < % 6.4 3% % % 7 15% 7.3 7% Figure 5: Age appears to be the dominant driver of medical expenditures. 27

29 Tree fit to π(b i ) yes % RACE3 < 2.5 no % % educate < 1.5 educate < % educate >= % educate < % % % % % % Figure 6: Race and educational level appear to modulate the treatment effect. 28

30 What about BART? 0.10 bhat mhat + bhat * (z - ghat) BART and BCF give similar predictions bartfit$yhat.train.mean bhat_bart Same predictions, different conclusions. 29

31 Summary Regularization-induced confounding can adversely bias treatment effect estimates. Explicitly modeling de-confoundedness allows regularization to be imposed robustly. The intuitive parametrization makes posterior exploration quite convenient. Bayesian causal forests permit finer control of regularization. Easily modified to handle categorical outcomes and treatments. Future work considers non-normal errors. Thank you for your time. 30

32 Some related work Machine learning for causal inference: Hill, Van der Laan, Athey and Imbens, Athey and Wager, Hansen et al. Earlier work looked at joint modeling: Zigler et al (2013), Whitney and Powell (2003), Robins, Mark and Newey (1992), McCandless, Gustafson and Austin (2009). So-called double-robust estimators: Robins, Hernan, and Wasserman (2015). Our work differs from these works in its explicit focus on regularization and its novel forested-linear-model structure. 31

Bayesian regression tree models for causal inference: regularization, confounding and heterogeneity

Bayesian regression tree models for causal inference: regularization, confounding and heterogeneity P. Richard Hahn, Jared Murray, and Carlos Carvalho June 22, 2017 The problem setting We want to estimate