Bayesian causal forests: dealing with regularization induced confounding and shrinking towards homogeneous effects

Similar documents
Bayesian regression tree models for causal inference: regularization, confounding and heterogeneity

Bayesian Causal Forests

Bayesian Additive Regression Tree (BART) with application to controlled trail data analysis

Robustness to Parametric Assumptions in Missing Data Models

Propensity Score Weighting with Multilevel Data

Balancing Covariates via Propensity Score Weighting

Propensity Score Methods for Causal Inference

Partial factor modeling: predictor-dependent shrinkage for linear regression

Flexible Estimation of Treatment Effect Parameters

PSC 504: Instrumental Variables

Balancing Covariates via Propensity Score Weighting: The Overlap Weights

Causal Inference Lecture Notes: Causal Inference with Repeated Measures in Observational Studies

Vector-Based Kernel Weighting: A Simple Estimator for Improving Precision and Bias of Average Treatment Effects in Multiple Treatment Settings

Gov 2002: 5. Matching

Measuring Social Influence Without Bias

Marginal versus conditional effects: does it make a difference? Mireille Schnitzer, PhD Université de Montréal

arxiv: v1 [stat.me] 18 Nov 2018

Variable selection and machine learning methods in causal inference

Deductive Derivation and Computerization of Semiparametric Efficient Estimation

Instrumental Variables

Gov 2002: 4. Observational Studies and Confounding

Regularized PCA to denoise and visualise data

Covariate Balancing Propensity Score for General Treatment Regimes

What s New in Econometrics. Lecture 1

Estimation of the Conditional Variance in Paired Experiments

ECE521 week 3: 23/26 January 2017

Causal inference in multilevel data structures:

Bootstrap & Confidence/Prediction intervals

Causal Inference Basics

arxiv: v2 [math.st] 14 Oct 2017

REGRESSION TREE CREDIBILITY MODEL

Structural Nested Mean Models for Assessing Time-Varying Effect Moderation. Daniel Almirall

Variable Selection and Sensitivity Analysis via Dynamic Trees with an application to Computer Code Performance Tuning

Regression, Ridge Regression, Lasso

Difference-in-Differences Methods

Overlap Propensity Score Weighting to Balance Covariates

High-dimensional regression

Structural Nested Mean Models for Assessing Time-Varying Effect Moderation. Daniel Almirall

Machine Learning Linear Classification. Prof. Matteo Matteucci

Individualized Treatment Effects with Censored Data via Nonparametric Accelerated Failure Time Models

Variance Reduction and Ensemble Methods

This model of the conditional expectation is linear in the parameters. A more practical and relaxed attitude towards linear regression is to say that

Causal Hazard Ratio Estimation By Instrumental Variables or Principal Stratification. Todd MacKenzie, PhD

Bootstrapping Heteroskedasticity Consistent Covariance Matrix Estimator

Scalable robust hypothesis tests using graphical models

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Selective Inference for Effect Modification: An Empirical Investigation

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

8. Instrumental variables regression

EMERGING MARKETS - Lecture 2: Methodology refresher

Selective Inference for Effect Modification

PSC 504: Dynamic Causal Inference

Decision Trees Lecture 12

Multi-View Regression via Canonincal Correlation Analysis

Machine learning, shrinkage estimation, and economic theory

Statistical Models for Causal Analysis

Sparse Linear Models (10/7/13)

Stability and the elastic net

The Supervised Learning Approach To Estimating Heterogeneous Causal Regime Effects

Specification Errors, Measurement Errors, Confounding

Choosing the Summary Statistics and the Acceptance Rate in Approximate Bayesian Computation

1 Motivation for Instrumental Variable (IV) Regression

A Bayesian Machine Learning Approach for Optimizing Dynamic Treatment Regimes

Potential Outcomes Model (POM)

An Introduction to Causal Analysis on Observational Data using Propensity Scores

The risk of machine learning

Statistical Inference

arxiv: v2 [cs.ai] 26 Sep 2018

Bayesian spatial quantile regression

Data Integration for Big Data Analysis for finite population inference

Propensity Score Analysis with Hierarchical Data

arxiv: v1 [stat.ml] 1 Jul 2017

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

High Dimensional Propensity Score Estimation via Covariate Balancing

Data analysis strategies for high dimensional social science data M3 Conference May 2013

Meta-learners for Estimating Heterogeneous Treatment Effects using Machine Learning

Targeted Maximum Likelihood Estimation in Safety Analysis

Properties of the least squares estimates

CS281 Section 4: Factor Analysis and PCA

Combining multiple observational data sources to estimate causal eects

Bayesian variable selection via. Penalized credible regions. Brian Reich, NCSU. Joint work with. Howard Bondell and Ander Wilson

Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods.

Logistic Regression and Generalized Linear Models

Classification. Chapter Introduction. 6.2 The Bayes classifier

Selection on Observables: Propensity Score Matching.

The Simple Linear Regression Model

Cox s proportional hazards model and Cox s partial likelihood

Bayesian linear regression

Causal Inference with Big Data Sets

Instrumental Variables

Weighting Methods. Harvard University STAT186/GOV2002 CAUSAL INFERENCE. Fall Kosuke Imai

Chained Versus Post-Stratification Equating in a Linear Context: An Evaluation Using Empirical Data

ESTIMATION OF TREATMENT EFFECTS VIA MATCHING

Estimation Considerations in Contextual Bandits

Linear regression COMS 4771

Controlling for latent confounding by confirmatory factor analysis (CFA) Blinded Blinded

Matching. Quiz 2. Matching. Quiz 2. Exact Matching. Estimand 2/25/14

Spatial Lasso with Application to GIS Model Selection. F. Jay Breidt Colorado State University

Fully Nonparametric Bayesian Additive Regression Trees

AAEC/ECON 5126 FINAL EXAM: SOLUTIONS

Transcription:

Bayesian causal forests: dealing with regularization induced confounding and shrinking towards homogeneous effects P. Richard Hahn, Jared Murray, and Carlos Carvalho July 29, 2018

Regularization induced confounding Suppose the treatment effect is homogenous and response and treatment model are both linear: Y i = τz i + β t x i + ε i, Z i = γ t x i + ν i. The bias of the treatment effect estimator ˆτ rr E(τ Y, z, X) is bias(ˆθ rr ) = (M + X t X) 1 Mθ (1) where the bias expectation is taken over Y, conditional on X and all model parameters. 1

Regularization induced confounding Let the prior precision be ( ) 0 0 M = 0 I p which gives a ridge prior on the control variables and a non-informative flat prior over the first element (τ, the treatment effect). bias(ˆτ rr ) = ( (z t z) 1 z t X ) (I + X t (X ˆX z )) 1 β, where ˆX z = z(z t z) 1 z t X. 2

Targeted selection Three components dictate the degree of RIC: 1. the coefficients defining the propensity function E(Z x) = γ t x, 2. the coefficients defining the prognostic function, E(Y Z = 0, x = x) = β t x, 3. the strength of the selection as measured by Var(Z x) = Var(ν). These are not in the analyst s control. 3

Targeted selection Consider the identity E(Y x, Z) = (τ + b)z + (β bγ) t x b(z γ t x) = ˆτZ + ˆβ t x ˆɛ. If ˆβ = (β bγ) has higher prior probability than β and Var(ˆɛ) = b 2 Var(ν) is small relative to σ 2, then τ will be biased toward ˆτ = τ + b. The bias is largest when confounding is strong: b 2 Var(ν) is smallest when Var(ν) is small, selection is targeted: For shrinkage priors on β, the (β bγ) term is most favorable with respect to the prior when the vector β and γ have the same direction! 4

De-bias with a propensity score estimate Estimate the propensity function ẑ i γ t x i and include both z and ẑ in the regression. The design matrix becomes X = ( ) z ẑ X. Plugging into our previous bias expression gives bias(ˆτ rr ) = { ( z t z) 1 z t X } 1 (I + Xt (X ˆX z )) 1 β = 0, where z = (z ẑ) and { ( z t z) 1 z t X } denotes the top row of { 1 ( zt z) 1 z t X }. 5

The nonlinear case Suppose that Y is a continuous biometric measure of heart distress, Z is an indicator for having received a heart medication, and x 1 and x 2 are systolic and diastolic blood pressure (in standardized units). Suppose that the difference between these two measurements is prognostic of high distress levels, with x 1 x 2 > 0 being a critical threshold. Prescribers target the drug towards patients with high risk, so the probability of receiving the drug is an increasing function in µ. 6

Nonlinear targeted selection 7

RIC with BART We simulated 200 datasets of size n = 250 according to this data generating process with τ = 1. Prior bias coverage rmse BART 0.27 65% 0.31 BCF 0.14 95% 0.21 BART gives clearly biased inference. Why? 8

RIC with BART Strong confounding and targeted selection imply that µ is approximately a monotone function of π alone. However, π (and hence µ) is difficult to learn via regression trees it takes many axis-aligned splits to approximate the shelf across the diagonal. x 2-1.0-0.5 0.0 0.5 1.0-1.0-0.5 0.0 0.5 1.0 x 1 Meanwhile, a single split in Z can stand in for many splits on x 1 and x 2 that would be required to approximate µ(x). 9

Regularizing heterogeneous effects Two common strategies: treat z as just another covariate and specify a prior on f (x i, z i ). (Hill, 2011) fit entirely separate models to the treatment and control data: (Y Z = z, x) N(f z (x i ), σ 2 z ) with independent priors over the parameters in each. The model we propose is f (x i, z i ) = µ(x i, ˆπ i ) + τ(x i )z i, where ˆπ i is an estimate of the propensity score. This splits the difference, compromising between the two approaches. 10

Give τ(x) its own prior By analogy, consider a two-group difference-in-means problem: Y i1 iid N(µ 1, σ 2 ) Y j2 iid N(µ 2, σ 2 ). If the estimand of interest is µ 1 µ 2, the implied prior over this quantity has variance strictly greater than the variances over µ 1 or µ 2 individually. Instead, if we know that µ 1 µ 2, it makes more sense to parametrize as Y i1 iid N(µ + τ, σ 2 ) Y j2 iid N(µ, σ 2 ) Now τ can be given an informative prior centered at zero and µ can be given a very vague prior. 11

Simulation study We toggle three two-level settings: homogeneous versus heterogeneous treatment effects a linear versus nonlinear conditional expectation function, and two different sample sizes (n = 250 and n = 500). Five variables comprise x; the first three are continuous, drawn as standard normal random variables, the fourth is a dichotomous variable and the fifth is unordered categorical, taking three levels (denoted 1,2,3). 12

Simulation study The treatment effect is either τ(x) = { 3, homogeneous 1 + 2x 2x 5, heterogeneous, the prognostic function is either { 1 + g(x 4) + x 1x 3, linear µ(x) = 6 + g(x 4) + 6 x 3 1, nonlinear, where g(1) = 2, g(2) = 1 and g(3) = 4, and the propensity function is π(x i ) = 0.8Φ(3µ(x i )/s 0.5x 1) + 0.05 + u i /10 where s is the standard deviation of µ taken over the observed sample and u i Uniform(0, 1). 13

Simulation study To evaluate each method we consider three criteria, applied to two different estimands. First, we consider how each method does at estimating the (sample) average treatment effect (ATE) according to root mean square error, coverage, and average interval length. Then, we consider the same criteria, except applied to estimates of the conditional average treatment effect (CATE), averaged over the sample. Results are based on 200 independent replications for each DGP. 14

Simulation result The important trends are as follows: BCF or ps-bart benefit dramatically by explicitly protecting against RIC; BART-(f 0, f 1 ) and causal random forests both exhibit subpar performance in this simulation; all methods improve with a larger sample; BCF priors are especially helpful at the smaller sample size (when estimation is more difficult); the linear model dominates when correct, but fares extremely poorly when wrong; BCF s improvements over ps-bart are more pronounced in the nonlinear DGP; BCF s average interval length is notably smaller than the ps-bart interval, usually (but not always) with comparable coverage. 15

Simulation results Homogeneous effect Heterogeneous effects n Method ATE CATE ATE CATE rmse cover len rmse cover len rmse cover len rmse cover len 250 500 BCF 0.21 0.92 0.91 0.48 0.96 2.0 0.27 0.84 0.99 1.09 0.91 3.3 ps-bart 0.22 0.94 0.97 0.44 0.99 2.3 0.31 0.90 1.13 1.30 0.89 3.5 BART 0.34 0.73 0.94 0.54 0.95 2.3 0.45 0.65 1.10 1.36 0.87 3.4 BART (f 0, f 1 ) 0.56 0.41 0.99 0.92 0.93 3.4 0.61 0.44 1.14 1.47 0.90 4.5 Causal RF 0.34 0.73 0.98 0.47 0.84 1.3 0.49 0.68 1.25 1.58 0.68 2.4 LM + HS 0.14 0.96 0.83 0.26 0.99 1.7 0.17 0.94 0.89 0.33 0.99 1.9 BCF 0.16 0.88 0.60 0.38 0.95 1.4 0.16 0.90 0.64 0.79 0.89 2.4 ps-bart 0.18 0.86 0.63 0.35 0.99 1.8 0.16 0.90 0.69 0.86 0.95 2.8 BART 0.27 0.61 0.61 0.42 0.95 1.8 0.25 0.76 0.67 0.88 0.94 2.8 BART (f 0, f 1 ) 0.47 0.21 0.66 0.80 0.93 3.1 0.42 0.42 0.75 1.16 0.92 3.9 Causal RF 0.36 0.47 0.69 0.52 0.75 1.2 0.40 0.59 0.88 1.30 0.71 2.1 LM + HS 0.11 0.96 0.54 0.18 0.99 1.0 0.12 0.93 0.59 0.22 0.98 1.2 16

Simulation results Homogeneous effect Heterogeneous effects n Method ATE CATE ATE CATE rmse cover len rmse cover len rmse cover len rmse cover len 250 500 BCF 0.26 0.945 1.3 0.63 0.94 2.5 0.30 0.930 1.4 1.3 0.93 4.5 ps-bart 0.54 0.780 1.6 1.00 0.96 4.3 0.56 0.805 1.7 1.7 0.91 5.4 BART 0.84 0.425 1.5 1.20 0.90 4.1 0.84 0.430 1.6 1.8 0.87 5.2 BART (f 0, f 1 ) 1.48 0.035 1.5 2.42 0.80 6.4 1.44 0.085 1.6 2.6 0.83 7.1 Causal RF 0.81 0.425 1.5 0.84 0.70 2.0 1.10 0.305 1.8 1.8 0.66 3.4 LM + HS 1.77 0.015 1.8 2.13 0.54 4.4 1.65 0.085 1.9 2.2 0.62 4.8 BCF 0.20 0.945 0.97 0.47 0.94 1.9 0.23 0.910 0.97 1.0 0.92 3.4 ps-bart 0.24 0.910 1.07 0.62 0.99 3.3 0.26 0.890 1.06 1.1 0.95 4.1 BART 0.31 0.790 1.00 0.63 0.98 3.0 0.33 0.760 1.00 1.1 0.94 3.9 BART (f 0, f 1 ) 1.11 0.035 1.18 2.11 0.81 5.8 1.09 0.065 1.17 2.3 0.82 6.2 Causal RF 0.39 0.650 1.00 0.54 0.87 1.7 0.59 0.515 1.18 1.5 0.73 2.8 LM + HS 1.76 0.005 1.34 2.19 0.40 3.5 1.71 0.000 1.34 2.2 0.45 3.7 17

ACIC 2017 interval length (CATE) 0.0 0.4 0.8 1.2 CRF psb BCF interval length (ATT) 0.0 0.4 0.8 1.2 CRF TL psb BCF 0.0 0.2 0.4 0.6 0.8 1.0 coverage (CATE) 0.0 0.2 0.4 0.6 0.8 1.0 coverage (ATT) rmse CATEs 0.0 0.4 0.8 1.2 psb BCF TL CRF 0.0 0.1 0.2 0.3 0.4 0.5 0.6 rmse ATT 18

Papers RIC in the linear model is discussed in: Regularization and confounding in linear regression for treatment effect estimation. Hahn, Carvalho, Puelz, and He. Bayesian Analysis (2018) The Bayesian causal forest paper is developed in: Bayesian regression tree models for causal inference: regularization, confounding, and heterogeneous effects. Hahn, Murray, and Carvalho. In review after revision at JASA. An exciting applied paper using these ideas is: Where Does a Scalable Growth-Mindset Intervention Improve Adolescents? Educational Trajectories? Under revision at Nature. 19

Most importantly: code The R package bcf went live just today. Give it a try. Thanks for you time. 20

Our setting We ll assume: Observational data (not from randomized experiments), Conditional unconfoundedness/ignorability (we ve measured all the factors causally influencing treatment and response), Covariate-dependent treatment effects (individuals can have different responses to treatment according to their covariates) Binary treatments 21

Our assumptions, more formally Strong ignorability: Y i (0), Y i (1) Z i X i = x i, Positivity: 0 < Pr(Z i = 1 X i = x i ) < 1 for all i. Therefore E(Y i (z) x i ) = E(Y i x i, Z i = z), so the conditional average treatment effect (CATE) is α(x i ) : = E(Y i (1) Y i (0) x i ) = E(Y i x i, Z i = 1) E(Y i x i, Z i = 0). 22

Modeling assumptions We write so that E(Y i x i, z i ) = f (x i, z i ), α(x i ) := f (x i, 1) f (x i, 0). We assume iid Gaussian errors: Y i = f (x i, z i ) + ɛ i, ɛ i N(0, σ 2 ) nb: Strong ignorability means ɛ i Z i x i. What prior on f? 23

Regression Trees no x 1 < c Tree T h yes g(x, T h, M h ) µ h2 µ h1 x 3 < d no yes x 3 d µ h3 µ h1 µ h2 µ h3 x 1 c Leaf/End node parameters M h = (µ h1, µ h2, µ h3 ) g(x, T h, M h ) = µ ht if x A ht (for 1 t b h ). Partition A h = {A h1, A h2, A h3 } 24

Bayesian Additive Regression Trees (BART) Bayesian additive regression trees (Chipman, George, & McCulloch, 2008): y i = f (x i, z i ) + ɛ i, ɛ i N(0, σ 2 ) m f (x, z) = g(x, z, T h, M h ) h=1 Hill (2011) proposes adopting Bayesian additive regression trees (BART) for causal inference. 25

2017 ACIC Data Analysis Challenge Treatment-response pairs were simulated according to 32 distinct data generating processes (DGPs), given fixed covariates (n = 4, 302, p = 58) from an empirical study. We varied three parameters among two levels High or Low noise level, Strong or Weak confounding, Small or Large effect size. The error distributions were one of four types Additive, homoskedastic, independent, Nonadditive, homoskedastic, independent, Additive, heteroskedastic, independent. To assess coverage, 250 replicate data sets were generated for each DGP. 26