Bayesian regression tree models for causal inference: regularization, confounding and heterogeneity

Similar documents
Bayesian causal forests: dealing with regularization induced confounding and shrinking towards homogeneous effects

Bayesian Causal Forests

Selective Inference for Effect Modification

Achieving Optimal Covariate Balance Under General Treatment Regimes

Propensity Score Weighting with Multilevel Data

Variable selection and machine learning methods in causal inference

THE DESIGN (VERSUS THE ANALYSIS) OF EVALUATIONS FROM OBSERVATIONAL STUDIES: PARALLELS WITH THE DESIGN OF RANDOMIZED EXPERIMENTS DONALD B.

Potential Outcomes Model (POM)

Analysis of propensity score approaches in difference-in-differences designs

Individualized Treatment Effects with Censored Data via Nonparametric Accelerated Failure Time Models

DATA-ADAPTIVE VARIABLE SELECTION FOR

Deductive Derivation and Computerization of Semiparametric Efficient Estimation

Causal Inference Lecture Notes: Causal Inference with Repeated Measures in Observational Studies

Propensity scores for repeated treatments: A tutorial for the iptw function in the twang package

Causal Inference with General Treatment Regimes: Generalizing the Propensity Score

Lecture 1 January 18

Covariate Balancing Propensity Score for General Treatment Regimes

Combining Non-probability and Probability Survey Samples Through Mass Imputation

Regression with a Single Regressor: Hypothesis Tests and Confidence Intervals

Propensity Score Analysis with Hierarchical Data

Econ 2148, fall 2017 Instrumental variables I, origins and binary treatment case

Bootstrapping Sensitivity Analysis

arxiv: v1 [stat.me] 18 Nov 2018

The Supervised Learning Approach To Estimating Heterogeneous Causal Regime Effects

Flexible Estimation of Treatment Effect Parameters

Estimating and Using Propensity Score in Presence of Missing Background Data. An Application to Assess the Impact of Childbearing on Wellbeing

Vector-Based Kernel Weighting: A Simple Estimator for Improving Precision and Bias of Average Treatment Effects in Multiple Treatment Settings

High Dimensional Propensity Score Estimation via Covariate Balancing

Targeted Maximum Likelihood Estimation in Safety Analysis

Gov 2002: 4. Observational Studies and Confounding

Bayesian Additive Regression Tree (BART) with application to controlled trail data analysis

Marginal versus conditional effects: does it make a difference? Mireille Schnitzer, PhD Université de Montréal

The Impact of Measurement Error on Propensity Score Analysis: An Empirical Investigation of Fallible Covariates

Robust Bayesian Variable Selection for Modeling Mean Medical Costs

Partial factor modeling: predictor-dependent shrinkage for linear regression

Correlation & Regression Chapter 5

Item Response Theory for Conjoint Survey Experiments

Bayesian Ensemble Learning

Introduction to Machine Learning CMU-10701

Causal Inference in Observational Studies with Non-Binary Treatments. David A. van Dyk

Use of Matching Methods for Causal Inference in Experimental and Observational Studies. This Talk Draws on the Following Papers:

An Introduction to Causal Analysis on Observational Data using Propensity Scores

Chapter 11. Regression with a Binary Dependent Variable

Selective Inference for Effect Modification: An Empirical Investigation

Data analysis strategies for high dimensional social science data M3 Conference May 2013

The risk of machine learning

Machine learning, shrinkage estimation, and economic theory

Propensity-Score Based Methods for Causal Inference in Observational Studies with Fixed Non-Binary Treatments

Lecture 3: Causal inference

Variable Selection and Sensitivity Analysis via Dynamic Trees with an application to Computer Code Performance Tuning

Propensity Scores for Repeated Treatments

What s New in Econometrics. Lecture 1

Dynamics in Social Networks and Causality

Learning Representations for Counterfactual Inference. Fredrik Johansson 1, Uri Shalit 2, David Sontag 2

Instrumental Variables

Quantitative Economics for the Evaluation of the European Policy

The Econometric Evaluation of Policy Design: Part I: Heterogeneity in Program Impacts, Modeling Self-Selection, and Parameters of Interest

Instrumental Variables

Estimation of Optimal Treatment Regimes Via Machine Learning. Marie Davidian

Program Evaluation with High-Dimensional Data

A spatial causal analysis of wildfire-contributed PM 2.5 using numerical model output. Brian Reich, NC State

Investigating mediation when counterfactuals are not metaphysical: Does sunlight exposure mediate the effect of eye-glasses on cataracts?

Statistical Inference for Data Adaptive Target Parameters

Bayesian Nonparametric Accelerated Failure Time Models for Analyzing Heterogeneous Treatment Effects

Truncation and Censoring

Making Our Cities Safer: A Study In Neighbhorhood Crime Patterns

A Bayesian Machine Learning Approach for Optimizing Dynamic Treatment Regimes

Statistical Consulting Topics Classification and Regression Trees (CART)

Measuring Social Influence Without Bias

Balancing Covariates via Propensity Score Weighting

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Weighting. Homework 2. Regression. Regression. Decisions Matching: Weighting (0) W i. (1) -å l i. )Y i. (1-W i 3/5/2014. (1) = Y i.

Topic 7: HETEROSKEDASTICITY

causal inference at hulu

Chapter 14. Multiple Regression Models. Multiple Regression Models. Multiple Regression Models

Habilitationsvortrag: Machine learning, shrinkage estimation, and economic theory

Balancing Covariates via Propensity Score Weighting: The Overlap Weights

Combining multiple observational data sources to estimate causal eects

BART: Bayesian additive regression trees

Shrinkage Methods: Ridge and Lasso

Double Robustness. Bang and Robins (2005) Kang and Schafer (2007)

Empirical Bayes Moderation of Asymptotically Linear Parameters

Propensity Score Methods for Causal Inference

Causal inference in multilevel data structures:

Semiparametric Generalized Linear Models

Estimating the Marginal Odds Ratio in Observational Studies

A Sampling of IMPACT Research:

EMERGING MARKETS - Lecture 2: Methodology refresher

Matching. Quiz 2. Matching. Quiz 2. Exact Matching. Estimand 2/25/14

1 Motivation for Instrumental Variable (IV) Regression

Have you... Unit 1: Introduction to data Lecture 1: Data collection, observational studies, and experiments. Readiness assessment

arxiv: v1 [stat.ml] 1 Jul 2017

Machine Learning and Applied Econometrics

WU Weiterbildung. Linear Mixed Models

Lectures of STA 231: Biostatistics

A dynamic perspective to evaluate multiple treatments through a causal latent Markov model

How to Use the Internet for Election Surveys

Selection on Observables: Propensity Score Matching.

Final Exam - Solutions

Factor model shrinkage for linear instrumental variable analysis with many instruments

Transcription:

Bayesian regression tree models for causal inference: regularization, confounding and heterogeneity P. Richard Hahn, Jared Murray, and Carlos Carvalho June 22, 2017

The problem setting We want to estimate treatment effects using a regression model, assuming: observational data (not from experiments), conditional unconfoundedness (we ve measured everything we need to), covariate-dependent treatment effects (people can have different responses to treatment according to their covariates) binary treatment variable (you either got the drug or you didn t) 1

More formally We assume strong ignorability: Y (0), Y (1) Z X, 0 < Pr(Z i = 1 x i ) < 1 for all i. Therefore E(Y i (1) x i ) = E(Y i x i, Z i = 1), E(Y i (0) x i ) = E(Y i x i, Z i = 0) and the treatment effect is α(x i ) := E(Y i x i, Z i = 1) E(Y i x i, Z i = 0). 2

Additive, homoskedastic errors Here we consider mean-zero, additive error representations Y i = f (x i, z i ) + ɛ i, so that E(Y i x i, z i ) = f (x i, z i ). The treatment effect of setting z i = 1 versus z i = 0 is expressed as α(x i ) := f (x i, 1) f (x i, 0). nb: In this context, conditional ignorability is often expressed as ɛ i Z i x i. 3

Regression trees no x 1 < c Tree T h yes g(x, T h, M h ) µ h2 µ h1 x 3 < d no yes x 3 d µ h3 µ h1 µ h2 µ h3 x 1 c Leaf/End node parameters M h = (µ h1, µ h2, µ h3 ) Partition A h = {A h1, A h2, A h3 } g(x, T h, M h ) = µ ht if x A ht (for 1 t b h ). 4

Bayesian Additive Regression Trees (BART) Bayesian additive regression trees (BART) (Chipman, George and McCulloch, 2010): y i = f (x i ) + ɛ i, ɛ i N(0, σ 2 ) m f (x) = g(x, T h, M h ) h=1 g are basis functions determined by a binary tree T h and vector of parameters M h. Hill (2012) specifically proposes BART for causal inference: in several simulation studies it works really, really well. 5

Improving BART for causal inference BART is excellent for causal inference, but it exhibits undesirable behavior in certain situations. 1. In cases with strong confounding estimates of the average treatment effect from BART exhibit severe bias. 2. Effect estimates from synthetic data with known homogeneous effect produce individual effect estimates that are over-dispersed. Our goal is to develop a modified BART model that improves these two weaknesses. 6

Example of problem one Consider a problem with p = 2, n = 1, 000, with homogeneous effects. True treatment effect is α = 1. Y i = µ i + Z i + ɛ i, µ i = 1(x i1 < x i2 ) 1(x i1 x i2 ) P(Z i = 1 x i1, x i2 ) = Φ(µ i ), ɛ i iid N(0, 0.7 2 ), x i1, x i2 iid N(0, 1). Y : measure of heart distress, Z: took heart medication, x 1 and x 2 are blood pressure measurements. This example demonstrates targeted treatment: patients with x i1 < x i2 are 5 times as likely to receive the new drug precisely because they are more likely to have higher levels of heart distress. 7

BART shows substantial bias BART (white) exhibits substantial bias. Density 0 5 10 15 0.8 0.9 1.0 1.1 1.2 Treatment effect Figure 1: BART (white) misses the truth, by a lot, across 250 simulations. (I will explain the blue and pink in a moment.) 8

Regularization induced confounding Why is BART biased in this example? µ(x) needs many axis-aligned splits to be approximated by trees. the response surface can be parsimoniously approximated with just one axis-aligned split in the treatment variable and an over-stated treatment effect. therefore, priors over f that penalize the total number of splits tend to over-attribute changes in E(Y x, Z) to a treatment effect. 9

Regularization induced confounding x 2-1.0-0.5 0.0 0.5 1.0-1.0-0.5 0.0 0.5 1.0 x 1 Figure 2: It takes many axis-aligned splits for a tree to fit a steep response gradient along a diagonal. Our example response surface was 1 above the diagonal and 1 below it. 10

The fix: ps-bart We can fix this by estimating π(x) = P(Z = 1 x) (using BART) and then using ˆπ(x) as an extra predictor variable in a BART model for the response surface. Table 1: The BART prior exhibits substantial bias in estimating the treatment effect. Modified BART priors allowing splits in either the true propensity score (Oracle BART) or an estimated propensity score (ps-bart) perform markedly better. Prior Bias Coverage RMSE BART 0.14 31% 0.15 Oracle BART 0.00 98% 0.05 ps-bart 0.06 85% 0.08 11

R.I.C. in linear regression Consider the response model Y i = β 0 + αz i + β t x i + ε i, Z i = γ t x i + ν i. For a flat prior on α and a ridge prior on β the bias is bias(ˆα rr ) = ( (Z t Z) 1 Z t X ) (I p + X t (X ˆX Z )) 1 β. For Ẑi γ t x i, Z = (Z, Ẑ) gives bias bias(ˆα rr ) = { ( Z t Z) 1 Z t X } 1 (I p + X t (X ˆX Z )) 1 β 0. 12

Example of problem two ps-bart fixes the severe bias under strong confounding, but offers no direct control on the prior for treatment effects. Individual effect estimates are quite variable: stronger regularization would likely improve average estimation error, but how to impose it? Frequency 0 10 20 30 40 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 ITE Figure 3: ps-bart (in pink) gives widely variable individual treatment effect estimates, even when the true effect is homogeneous (here it is α = 0.25).This is a single data set with 250 individuals. Our new approach is shown in gray. 13

The fix: Bayesian causal forests In order to regularize treatment effects directly, our new model is f (x i, z i ) = m(x i, ˆπ(x i )) + α(x i, ˆπ(x i ))z i, where m and α are given independent BART priors. Now the treatment effect is E(Y i x i, Z i = 1) E(Y i x i, Z i = 0) = {m(x i, ˆπ)+α(x i, ˆπ)} m(x i, ˆπ) = α(x i, ˆπ), and we can shrink towards homogeneity with stronger regression on α than m. In fact, we can directly set the prior probability of homogeneity. Even setting it to 1% corresponds to much more aggressive regularization than the default BART prior. 14

Heterogeneous effect example Consider the previous data generating process but let the treatment effect varies depending on an observable covariate x 3 N(0, 1): α i = 1(x i3 > 1/4) + 1 4 1(x i3 > 1/2) + 1 2 1(x i3 > 3/4), so α i {0, 1, 1.25, 1.5} according to the level of x i3. We consider a smaller sample size here, n = 250. Table 2: BART vs ps-bart vs BCF on RMSE on the heterogeneous effect vector α(x) over 250 replicates. Prior coverage of ATE ave. RMSE ATE ave. RMSE ITE BART 3% 0.53 0.63 ps-bart 96% 0.11 0.34 BCF 94% 0.10 0.25 15

Notable related approaches van der Laan, M. J. (2010). Targeted maximum likelihood based causal inference. The International Journal of Biostatistics. McCaffrey, D. F., Griffin, B. A., Almirall, D., Slaughter, M. E., Ramchand, R. and Burgette, L. F. (2013). A tutorial on propensity score estimation for multiple treatments using generalized boosted models. Statistics in Medicine. Zigler, C. M. and Dominici, F. (2014). Uncertainty in propensity score estimation: Bayesian methods for variable selection and model-averaged causal effects. Journal of the American Statistical Association. Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C. et al. (2016). Double machine learning for treatment and causal parameters. Wager, S. and Athey, S. (2017). Estimation and Inference of Heterogeneous Treatment Effects using Gradient Forests. 16

2017 ACIC Data Analysis Challenge Treatment-response pairs were simulated according to 32 distinct data generating processes (DGPs), given fixed covariates (n = 4, 302, p = 58) from an empirical study. We varied three parameters among two levels High or Low noise level, Strong or Weak confounding, Small or Large effect size. The error distributions were one of four types Additive, homoskedastic, independent, Nonadditive, homoskedastic, independent, Additive, heteroskedastic, independent. To assess coverage, 250 replicate data sets were generated for each DGP. 17

Results: Inference for ATE on homoskedastic DGPs 18

Results: Estimation for ATE on homoskedastic DGPs 19

Results: Inference for ITE on homoskedastic DGPs 20

Results: Inference for ATE on easy DGPs 21

Results: Estimation for ATE on easy DGPs 22

Results: Inference for ITE on easy DGPs 23

Results: Inference for ATE on difficult DGPs 24

Results: Estimation for ATE on difficult DGPs 25

Results: Inference for ITE on difficult DGPs 26

Results: Inference for ATE on heteroskedastic DGPs 27

Results: Estimation for ATE on heteroskedastic DGPs 28

Results: Inference for ITE on heteroskedastic DGPs 29

1987 National Medical Expenditure Survey (NMES) What is the effect of smoking on medical expenditures? outcome variable Y is medical expenses (verified, log transformed), treatment variable Z indicates heavy smoking (> 1/2 pack per day), n = 7.7k complete-case analysis for Y > 0, covariates include age: age in years at the time of the survey smoke age: age in years when the individual started smoking gender: male or female race: other, black or white marriage status: married, widowed, divorced, separated, never married education level: college graduate, some college, high school graduate, other census region: geographic location, Northeast, Midwest, South, West poverty status: poor, near poor, low income, middle income, high income seat belt: does patient regularly use a seat belt when in a car 30

Prediction and ITE estimates: BART vs. BCF BCF 5.0 6.0 7.0 8.0 BCF 0.05 0.10 0.15 0.20 5.0 5.5 6.0 6.5 7.0 7.5 8.0 Vanilla BART 0.2 0.1 0.0 0.1 0.2 0.3 vanilla BART 31

Posterior of ATE Frequency 0 100 200 300 400 500 600 0.00 0.02 0.04 0.06 0.08 0.10 0.12 Average Treatment effect 32

0.4 0.2 0.0 0.4 Treatment effect Subgroup inference 0 2000 4000 6000 8000 Individual 33

Subgroup inference CART (applied to the posterior) suggests that less educated (high school graduate), married individuals with a relatively high propensity for smoking (ˆπ(x i ) > 0.63) have a higher treatment effect than everyone else. Density 0 2 4 6 8 10 12 0.1 0.0 0.1 0.2 0.3 The corresponding subgroup ATE is indeed significant. 34

Takeaways BART is an impressive response surface method for causal inference; our new BCF model improves on BART in key respects. Regularization-induced confounding can adversely bias treatment effect estimates. Explicitly modeling selection allows regularization to be imposed robustly and directly on the estimand of interest. Expect an R package soon. The paper is up on my web page check it out! 35

Extensions Interesting connections to covariate-dependent g-priors. Incorporating uncertainty in the estimated propensity score? Applications to mediation analysis. Model improvements, including heteroskedasticity and more aggresive shrinkage priors (when many garbage variables are plausible). 36

Extensions Interesting connections to covariate-dependent g-priors. Incorporating uncertainty in the estimated propensity score? Applications to mediation analysis. Model improvements, including heteroskedasticity and more aggresive shrinkage priors (when many garbage variables are plausible). Thank you for your time. 37