Bayesian Additive Regression Tree (BART) with application to controlled trail data analysis

Similar documents
Bayesian Classification and Regression Trees

Bayesian Linear Regression

Causal Inference and Response Surface Modeling. Inference and Representa6on DS- GA Fall 2015 Guest lecturer: Uri Shalit

Bayesian Ensemble Learning

Bayesian Causal Forests

Bayesian causal forests: dealing with regularization induced confounding and shrinking towards homogeneous effects

BART: Bayesian additive regression trees

Variable Selection and Sensitivity Analysis via Dynamic Trees with an application to Computer Code Performance Tuning

Probabilistic Graphical Networks: Definitions and Basic Results

Bayesian Methods for Machine Learning

Fully Nonparametric Bayesian Additive Regression Trees

Bayesian Linear Models

Bayesian Linear Models

Probabilistic Machine Learning

Bayesian Linear Models

Making rating curves - the Bayesian approach

Monte Carlo in Bayesian Statistics

Hierarchical models. Dr. Jarad Niemi. August 31, Iowa State University. Jarad Niemi (Iowa State) Hierarchical models August 31, / 31

Fractional Imputation in Survey Sampling: A Comparative Review

Model Checking and Improvement

Hierarchical Modeling for Spatial Data

Bayesian shrinkage approach in variable selection for mixed

ABC random forest for parameter estimation. Jean-Michel Marin

Holdout and Cross-Validation Methods Overfitting Avoidance

The linear model is the most fundamental of all serious statistical models encompassing:

Nonparametric Bayes tensor factorizations for big data

STA 4273H: Statistical Machine Learning

Part 6: Multivariate Normal and Linear Models

A note on multiple imputation for general purpose estimation

Bayesian non-parametric model to longitudinally predict churn

Computational statistics

Bayesian Inference in GLMs. Frequentists typically base inferences on MLEs, asymptotic confidence

(5) Multi-parameter models - Gibbs sampling. ST440/540: Applied Bayesian Analysis

A Bayesian Nonparametric Approach to Monotone Missing Data in Longitudinal Studies with Informative Missingness

Metropolis-Hastings Algorithm

Gibbs Sampling in Endogenous Variables Models

STAT 518 Intro Student Presentation

Plausible Values for Latent Variables Using Mplus

MCMC for big data. Geir Storvik. BigInsight lunch - May Geir Storvik MCMC for big data BigInsight lunch - May / 17

Bagging During Markov Chain Monte Carlo for Smoother Predictions

Models for spatial data (cont d) Types of spatial data. Types of spatial data (cont d) Hierarchical models for spatial data

A Bayesian Machine Learning Approach for Optimizing Dynamic Treatment Regimes

20: Gaussian Processes

Heteroscedastic BART Via Multiplicative Regression Trees

An Introduction to Bayesian Linear Regression

Lecture 13 Fundamentals of Bayesian Inference

Online appendix to On the stability of the excess sensitivity of aggregate consumption growth in the US

Contents. Part I: Fundamentals of Bayesian Inference 1

A Bayesian Perspective on Residential Demand Response Using Smart Meter Data

MCMC algorithms for fitting Bayesian models

DAG models and Markov Chain Monte Carlo methods a short overview

Statistics & Data Sciences: First Year Prelim Exam May 2018

Part 8: GLMs and Hierarchical LMs and GLMs

Integrated Non-Factorized Variational Inference

Outline. Clustering. Capturing Unobserved Heterogeneity in the Austrian Labor Market Using Finite Mixtures of Markov Chain Models

Density Estimation. Seungjin Choi

Random Effects Models for Network Data

April 20th, Advanced Topics in Machine Learning California Institute of Technology. Markov Chain Monte Carlo for Machine Learning

PMR Learning as Inference

CPSC 540: Machine Learning

Labor-Supply Shifts and Economic Fluctuations. Technical Appendix

Lecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu

Bayesian Graphical Models for Structural Vector AutoregressiveMarch Processes 21, / 1

Graphical Models and Kernel Methods

Bayesian regression tree models for causal inference: regularization, confounding and heterogeneity

Local Likelihood Bayesian Cluster Modeling for small area health data. Andrew Lawson Arnold School of Public Health University of South Carolina

Finite Population Estimators in Stochastic Search Variable Selection

Reconstruction. Reading for this lecture: Lecture Notes.

F denotes cumulative density. denotes probability density function; (.)

Bayesian Inference for DSGE Models. Lawrence J. Christiano

Example: Ground Motion Attenuation

Nonparametric Drift Estimation for Stochastic Differential Equations

Dirichlet Process Mixtures of Generalized Linear Models

Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model

1-bit Matrix Completion. PAC-Bayes and Variational Approximation

Statistical Inference

Bayesian Machine Learning

Bayesian data analysis in practice: Three simple examples

Bayesian Inference for DSGE Models. Lawrence J. Christiano

A Semi-parametric Bayesian Framework for Performance Analysis of Call Centers

Metropolis Hastings. Rebecca C. Steorts Bayesian Methods and Modern Statistics: STA 360/601. Module 9

variability of the model, represented by σ 2 and not accounted for by Xβ

Principles of Bayesian Inference

Algorithm-Independent Learning Issues

SUPPLEMENT TO MARKET ENTRY COSTS, PRODUCER HETEROGENEITY, AND EXPORT DYNAMICS (Econometrica, Vol. 75, No. 3, May 2007, )

Robust Monte Carlo Methods for Sequential Planning and Decision Making

Probabilistic Reasoning in Deep Learning

Non-Parametric Bayes

Accounting for Complex Sample Designs via Mixture Models

Forming Post-strata Via Bayesian Treed. Capture-Recapture Models

Fractional Hot Deck Imputation for Robust Inference Under Item Nonresponse in Survey Sampling

Learning the hyper-parameters. Luca Martino

Hastings-within-Gibbs Algorithm: Introduction and Application on Hierarchical Model

Markov Chain Monte Carlo methods

Variational Bayesian Inference for Parametric and Non-Parametric Regression with Missing Predictor Data

Nearest Neighbor Gaussian Processes for Large Spatial Data

Discussion on Fygenson (2007, Statistica Sinica): a DS Perspective

Gaussian Processes (10/16/13)

Bayesian Nonparametric Regression for Diabetes Deaths

Wavelet-Based Nonparametric Modeling of Hierarchical Functions in Colon Carcinogenesis

Transcription:

Bayesian Additive Regression Tree (BART) with application to controlled trail data analysis Weilan Yang wyang@stat.wisc.edu May. 2015 1 / 20

Background CATE i = E(Y i (Z 1 ) Y i (Z 0 ) X i ) 2 / 20

Background Previous work done by Jennifer L. Hill 1 suggested that in Causal Inference, it is more convenient to user flexible non-parametric method(like BART). The better fit can be obtained without parametric assumption. BART could also perform variable selection and confidence interval construction(from posterior samples). Imputation and Extrapolation are the main challenges in causal inference. If BART is able to give a reasonable estimate to the test effect, it should also be able to extrapolate the data well in a controlled trail data. In other words, we could evaluate the prediction performance of BART in the range of non-overlapping data. 1 Hill,J.L.(2011).Bayesian nonparametric modeling for causal inference. Journal of Computational and Graphical Statistics, 20(1). 3 / 20

BART overview 2 Place CART within a Bayesian framework by specifying a prior on tree space. Can get multiple tree realizations by using tree-changing proposal distribution: birth/death/change/swap. Get multiple realizations of 1 tree, average over posterior to form predictions. 2 http://www.stat.osu.edu/~comp_exp/jour.club/trees.pdf Chipman, H. A., George, E. I., & McCulloch, R. E. (1998). Bayesian CART model search. Journal of the American Statistical Association, 93(443), 935-948. Chipman, H. A., George, E. I., & McCulloch, R. E. (2010). BART: Bayesian additive regression trees. The Annals of Applied Statistics, 266-298. 4 / 20

BART overview, compared with CART Use entropy to split the nodes, which may result to producing similar trees in RF; Hard to obtain an interval estimation of a statistic. 5 / 20

BART overview A general regression form: Regression trees form: y = f (x) + ɛ, ɛ N(0, σ 2 ) y = g(x; M, T ) + ɛ, ɛ N(0, σ 2 ) With Bayesian perspective, we need to specify the joint distribution of parameters 3 : π(t, M, σ 2 ) = π(m, σ 2 T )π(t ), M N(µ, Σ), σ 2 IG(v/2, vλ/2) 3 Here we assume equal variance, i.e. mean shift model. Alternatively, we could have mean-variance shift model 6 / 20

BART overview π(t ) doesn t have a closed form, needs a process to define: P SPLIT (η, T ), η : Further split from a leaf node? P RULE (ρ η, T ), ρ : Split by which variable? Which value? complexity penalty:p SPLIT (η, T ) = α(1 + d η ) β 7 / 20

BART overview Then, integrate out Θ = (M, σ 2 ) p(y X, T ) = p(y X, Θ, T )p(θ T )dθ posterior of the tree: p(t X, Y ) p(y X, T )p(t ) 8 / 20

BART overview posterior of the tree: p(t X, Y ) p(y X, T )p(t ) Cannot enumerate all possible p(t ) s, so we user Metropolis-Hastings 4 to sample trees: probability of T i to T (T i+1 = T ): α(t i, T ) = min T 0, T 1, T 2,... { q(t, T i ) p(y X, T )p(t } ) q(t i, T ) p(y X, T i )p(t i ), 1 q(t i, T i+1 ) p(t i T i+1 ) produced by four situations: GROW, PRUNE, CHANGE, SWAP 4 http://nitro.biosci.arizona.edu/courses/eeb519a-2007/pdfs/gibbs.pdf 9 / 20

BART overview Aggregate m trees: y = m g(x; M j, T j ) + ɛ j=1 Posterior of p((t 1, M 1 ), (T 2, M 2 ),..., (T m, M m ), σ y) is produced by Gibbs Sampler: The former can be simplified as: (T j, M j ) T (j), M (j), σ 2, y σ T 1,..., T m, M 1,..., M m, y IG (T j, M j ) R j, σ R j y k j g(x; T k, M k ) 10 / 20

Simulation study 5 Data generating mechanism: Y Z = β 0 Z + i i A β 1 Z X i + i j (i,j) B β 2 Z X j X k + i j k (i,j,k) C β 3 Z X i X j X k + ε Previous work(hill, 2011) indicates that BART captures non-linear trend; When β i Z=0 β i Z=1, Z and X i have interaction, parameters in data generating model doubles, more complex; When some β i s are set to 0, we could examine its variable selection ability. 5 https://github.com/williamdotyang/bart_simulation.git 11 / 20

Simulation study: capturing interaction Figure: When X i, Z have interaction) different trends Z and X i unequal variance among X i s 12 / 20

Simulation study: capturing interaction Figure: residual plot Conclusion 1: BART captures interaction between X i, Z 13 / 20

Simulation study: capturing interaction Setting: 3 covariates, with all interactions significant, β i Z=0 β i Z=1 Table: MSE of OLS and BART on test dataset fit on overall data fit on separate labels OLS 1041.30 113.39 BART 10.63 6.87 Conclusion 2: When β i Z=0 β i Z=1 (more parameters, complex model), fitting BART on separate labels gives better result. 14 / 20

Simulation study: variable selection Setting: 30 candidate covariates, with 3 of them significant( β i > 1), 27 not significant( β i = 0.001), 3 2nd order interactions(significant), 1 3rd order interaction(significant). Table: MSE of OLS and BART on test dataset fit on overall data fit on separate labels OLS 1037.11 111.42 BART 30.42 21.76 Conclusion 3: BART has strong ability of variable selection. 15 / 20

Simulation study: extrapolation 16 / 20

Simulation study: extrapolation Conclusion 4: With only one covariate, BART cannot extrapolate well, trend will follow the existing ones if fitted together, and constant prediction is made if fitted separately. 17 / 20

Simulation study: extrapolation Figure: real data 18 / 20

Simulation study: extrapolation Figure: fitting overall data Conclusion 5: Multiple covariates with interaction will help BART in extrapolation. 19 / 20

Simulation study: extrapolation Figure: residual plot However, the system error is unavoidable. 20 / 20