Bayesian Classification and Regression Trees

Similar documents
Bayesian Additive Regression Tree (BART) with application to controlled trail data analysis

CPSC 540: Machine Learning

Infer relationships among three species: Outgroup:

Calibration of Stochastic Volatility Models using Particle Markov Chain Monte Carlo Methods

Monte Carlo in Bayesian Statistics

Bayesian networks: approximate inference

MCMC and Gibbs Sampling. Kayhan Batmanghelich

MCMC algorithms for fitting Bayesian models

MCMC: Markov Chain Monte Carlo

Machine Learning. Probabilistic KNN.

Introduction to Machine Learning

STA 4273H: Statistical Machine Learning

BAYESIAN METHODS FOR VARIABLE SELECTION WITH APPLICATIONS TO HIGH-DIMENSIONAL DATA

Learning the hyper-parameters. Luca Martino

Probabilistic Graphical Networks: Definitions and Basic Results

Variable Selection and Sensitivity Analysis via Dynamic Trees with an application to Computer Code Performance Tuning

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

(5) Multi-parameter models - Gibbs sampling. ST440/540: Applied Bayesian Analysis

Molecular Evolution & Phylogenetics

Eco517 Fall 2013 C. Sims MCMC. October 8, 2013

Bayesian Methods for Machine Learning

Markov Chain Monte Carlo methods

MCMC for big data. Geir Storvik. BigInsight lunch - May Geir Storvik MCMC for big data BigInsight lunch - May / 17

Markov Chain Monte Carlo, Numerical Integration

Markov chain Monte Carlo

Bayesian Linear Regression

Probabilistic Machine Learning

Bayesian Phylogenetics:

Bayesian phylogenetics. the one true tree? Bayesian phylogenetics

Image segmentation combining Markov Random Fields and Dirichlet Processes

Down by the Bayes, where the Watermelons Grow

Non-Parametric Bayes

Monte Carlo integration

Bayesian Estimation with Sparse Grids

Computational statistics

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007

David B. Dahl. Department of Statistics, and Department of Biostatistics & Medical Informatics University of Wisconsin Madison

Principles of Bayesian Inference

On Markov chain Monte Carlo methods for tall data

Machine Learning, Midterm Exam: Spring 2009 SOLUTION

Estimating marginal likelihoods from the posterior draws through a geometric identity

Dynamic models. Dependent data The AR(p) model The MA(q) model Hidden Markov models. 6 Dynamic models

Monte Carlo Dynamically Weighted Importance Sampling for Spatial Models with Intractable Normalizing Constants

MCMC notes by Mark Holder

Bayesian Inference and MCMC

Parameter Estimation. William H. Jefferys University of Texas at Austin Parameter Estimation 7/26/05 1

Who was Bayes? Bayesian Phylogenetics. What is Bayes Theorem?

Markov Chain Monte Carlo methods

Bayesian Phylogenetics

LECTURE 15 Markov chain Monte Carlo

Announcements. Inference. Mid-term. Inference by Enumeration. Reminder: Alarm Network. Introduction to Artificial Intelligence. V22.

BART: Bayesian additive regression trees

Part 1: Expectation Propagation

Metropolis-Hastings Algorithm

Accounting for Calibration Uncertainty

Introduction to Machine Learning CMU-10701

Approximate Inference using MCMC

Convex Optimization CMU-10725

CS 188: Artificial Intelligence. Bayes Nets

Hastings-within-Gibbs Algorithm: Introduction and Application on Hierarchical Model

The Origin of Deep Learning. Lili Mou Jan, 2015

Markov Chain Monte Carlo

Session 3A: Markov chain Monte Carlo (MCMC)

Basic math for biology

Monte Carlo Methods. Geoff Gordon February 9, 2006

Multimodal Nested Sampling

Theory of Stochastic Processes 8. Markov chain Monte Carlo

Graphical Models for Query-driven Analysis of Multimodal Data

Mixture models. Mixture models MCMC approaches Label switching MCMC for variable dimension models. 5 Mixture models

Contents. Part I: Fundamentals of Bayesian Inference 1

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

STA 4273H: Sta-s-cal Machine Learning

April 20th, Advanced Topics in Machine Learning California Institute of Technology. Markov Chain Monte Carlo for Machine Learning

Bayesian Inference. Anders Gorm Pedersen. Molecular Evolution Group Center for Biological Sequence Analysis Technical University of Denmark (DTU)

Bayesian Nonparametric Regression for Diabetes Deaths

Bayesian GLMs and Metropolis-Hastings Algorithm

Bayesian inference for multivariate extreme value distributions

Bridging the Gap between Stochastic Gradient MCMC and Stochastic Optimization

Forming Post-strata Via Bayesian Treed. Capture-Recapture Models

Bayesian Linear Models

Linear Models A linear model is defined by the expression

A Search and Jump Algorithm for Markov Chain Monte Carlo Sampling. Christopher Jennison. Adriana Ibrahim. Seminar at University of Kuwait

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Introduction to Stochastic Gradient Markov Chain Monte Carlo Methods

Bayesian Model Comparison:

Phylogenetics: Bayesian Phylogenetic Analysis. COMP Spring 2015 Luay Nakhleh, Rice University

Probabilistic Graphical Models

Bayesian CART Prior Specification and Posterior Simulation

A note on Reversible Jump Markov Chain Monte Carlo

Control Variates for Markov Chain Monte Carlo

A Bayesian Approach to Phylogenetics

Outline. Clustering. Capturing Unobserved Heterogeneity in the Austrian Labor Market Using Finite Mixtures of Markov Chain Models

Sequential Monte Carlo Algorithms

CSC 2541: Bayesian Methods for Machine Learning

Bayesian Inference using Markov Chain Monte Carlo in Phylogenetic Studies

Metropolis Hastings. Rebecca C. Steorts Bayesian Methods and Modern Statistics: STA 360/601. Module 9

Reminder of some Markov Chain properties:

Stat 535 C - Statistical Computing & Monte Carlo Methods. Lecture February Arnaud Doucet

17 : Markov Chain Monte Carlo

Bayesian Linear Models

Transcription:

Bayesian Classification and Regression Trees James Cussens York Centre for Complex Systems Analysis & Dept of Computer Science University of York, UK 1

Outline Problems for Lessons from Bayesian phylogeny Results 2

Problems for Lessons from Bayesian phylogeny Results 3

Trees are partition models 1:6-238.46910030049432 best_llhood:vst(4):msclf(89) =< 28.7 >28.7 2:108/25 3:3 =< 98 >98 4:2 25:5/3 =< 127 >127 5:8 12:2 =< 29 >29 =< 155 >155 6:104/7 7:2 13:3 24:7/43 =< 100 >100 =< 72 >72 8:3 11:25/42 14:7 21:8 =< 56 >56 =< 1.162 >1.162 =< 44 >44 9:1/6 10:37/7 15:4 20:4/1 22:23/4 23:0/6 =< 39 >39 16:8 19:2/4 =< 46 >46 17:5/26 18:3/2 4

Classification trees as probability models Tree structure T partitions the attribute space. Each partition (= leaf) i has its own class distribution with θ i = (p i1,... p ik ). Let Θ = (θ 1, θ 2,..., θ b ) be the complete parameter vector. for a tree T with b leaves. Let x be the vector of attributes for an example, and y its class label. (Θ, T) defines a conditional probability model P(y Θ, T, x). 5

The Bayesian approach Given Prior distribution P(Θ, T) = P(T)P(Θ T) Data (X, Y ) Compute Posterior distribution P(Θ, T X, Y ) We just care about structure: P(T X, Y ) P(T X)P(Y T, X) 6

Defining tree structure priors with a sampler Instead of specifying a closed-form expression for the tree prior, P(T X), we specify P(T X) implicitly by a tree-generating stochastic process. Each realization of such a process can simply be considered a random draw from this prior. (Chipman et al, JASA, 1998) Grow by splitting leaves η with a probability α(1 + d η ) β, where d η is the depth of η. Splitting rules chosen uniformly. 7

Sampling (approximately) from the posterior Produce an approximate sample from the posterior P(T X, Y ). Generate a Markov chain using the Metropolis-Hastings algorithm. If at tree T propose T with probability q(t T) and accept T with probability α(t, T ). α(t, T ) = min { P(T X, Y ) P(T X, Y ) q(t T ) q(t T),1 } 8

Our proposals We propose a new T by pruning T (i) at a random node and re-growing according to the prior, giving: α(t (i), T ) = min where d t is the depth of T. { dt (i) d T P(Y T, X) P(Y T (i), X),1 } So big jumps are possible. 9

Sometimes it s easy Kyphosis dataset (81 datapoints, 3 attributes, 2 classes) 50000 MCMC iterations, no tempering: Tree ˆp seed1 (T i ) ˆp seed2 (T i ) ˆp seed3 (T i ) T 1 0.08326 0.07898 0.08338 T 2 0.05900 0.06154 0.06170 T 3 0.05574 0.05664 0.05610 T 4 0.02466 0.02724 0.02790 T 5 0.02564 0.02674 0.02504 T 6 0.01494 0.01682 0.01530 T 7 0.01390 0.01410 0.01524 T 8 0.01208 0.01324 0.01288 T 9 0.01212 0.01284 0.01168 10

Computing class probabilities for new data Given training data (X, Y ), the posterior probability that x has class y is: p(y x, X, Y ) = T P(T X, Y ) p(y x,θ, T)P(Θ T, X, Y )dθ We use the MCMC sample to estimate P(T X, Y ), the rest is analytically soluble. 11

Comparing class probabilities in an easy case 1.2 (tr_uc_rm_idsd_a0_95b1_i50k s) 512 vs. 883 1 0.8 0.6 0.4 0.2 0-0.2-0.2 0 0.2 0.4 0.6 0.8 1 1.2 Dataset=K, iterations=50,000, tempering=false 12

Problems for Lessons from Bayesian phylogeny Results 13

Usually it s not easy... the algorithm gravitates quickly towards [regions of large posterior probability] and then stabilizes, moving locally in that region for a long time. Evidently, this is a consequence of a proposal distribution that makes local moves over a sharply peaked multimodal posterior. Once a tree has reasonable fit, the chain is unlikely to move away from a sharp local mode by small steps....although different move types might be implemented, we believe that any MH algorithm for CART models will have difficulty moving between local modes. (Chipman et al, 1998) 14

Where there is room for improvement 1.2 (tr_uc_rm_idsd_a0_95b1_i250k s) 447 vs. 938 1 0.8 0.6 0.4 0.2 0-0.2-0.2 0 0.2 0.4 0.6 0.8 1 1.2 Dataset=BCW, iterations=250,000, tempering=f 15

Where there is room for improvement 1.2 (tr_uc_rm_idsd_a0_95b1_i250k s) 938 vs. 447 1 0.8 0.6 0.4 0.2 0-0.2-0.2 0 0.2 0.4 0.6 0.8 1 1.2 Dataset=BCW, iterations=250,000, tempering=f 16

Problems for Lessons from Bayesian phylogeny Results 17

The same problem for Bayesian phylogeny The posterior probability of trees can contain multiple peaks....mcmc can be prone to entrapment in local optima; a Markov chain currently exploring a peak of high probability may experience difficulty crossing valleys to explore other peaks. (Altekar et al, 2004) MrBayes is at http://morphbank.ebc.uu.se/mrbayes/ 18

A solution: (power) tempering As well as the cold chain with stationary distribution P(T X, Y ), Have hot chains with stationary distributions P(T X, Y ) β for 0 < β < 1 And swap states between chains. Only states visited by the cold chain count. 19

Acceptance probabilities for tempering α β uc(t (i), T ) = min d T (i) d T ( P(Y T, X) P(Y T (i), X) ) β,1 α swap = min ( P(Y T2, X) P(Y T 1, X) ) (β1 β 2 ),1 20

Problems for Lessons from Bayesian phylogeny Results 21

The small print Copied MrBayes defaults: β i = 1/(1 + T(i 1)) for i = 1,2,3,4, where T = 0.2. 22

Datasets Name Size x Y Pos%(Tr) Pos%(HO) K 81 3 2 81.5% 68.8% BCW 683 9 2 66.2% 60.3% PIMA 768 8 2 65.4% 64.1% LR 20000 16 26 3.85% 4.3% WF 5000 40 3 35.6% 33.4% Holdout set (HO) is 20% of the data. 23

BCW: 50K, Temp=F vs 50K, Temp=T 1.2 (tr_uc_rm_idsd_a0_95b1_i50k s) 512 vs. 209 1.2 (tr_uc_rm_idsd_a0_95b1_i50k s) 512 vs. 209 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0-0.2-0.2 0 0.2 0.4 0.6 0.8 1 1.2-0.2-0.2 0 0.2 0.4 0.6 0.8 1 1.2 24

PIMA: 50K, Temp=F vs 50K, Temp=T 1.2 (tr_uc_rm_idsd_a0_95b1_i50k s) 883 vs. 512 1.2 (tr_uc_rm_idsd_a0_95b1_i50k s) 883 vs. 512 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0-0.2-0.2 0 0.2 0.4 0.6 0.8 1 1.2-0.2-0.2 0 0.2 0.4 0.6 0.8 1 1.2 25

PIMA: 250K, Temp=F vs 50K, Temp=T 1.2 (tr_uc_rm_idsd_a0_95b1_i250k s) 938 vs. 447 1.2 (tr_uc_rm_idsd_a0_95b1_i50k s) 883 vs. 512 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0-0.2-0.2 0 0.2 0.4 0.6 0.8 1 1.2-0.2-0.2 0 0.2 0.4 0.6 0.8 1 1.2 26

PIMA: 250K, Temp=F vs 250K, Temp=T 1.2 (tr_uc_rm_idsd_a0_95b1_i250k s) 938 vs. 447 1.2 (tr_uc_rm_idsd_a0_95b1_i250k s) 447 vs. 938 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0-0.2-0.2 0 0.2 0.4 0.6 0.8 1 1.2-0.2-0.2 0 0.2 0.4 0.6 0.8 1 1.2 27

LR: 50K, Temp=F vs 50K, Temp=T 1.2 (tr_uc_rm_idsd_a0_95b1_i50k s) 512 vs. 209 1.2 (tr_uc_rm_idsd_a0_95b1_i50k s) 883 vs. 209 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0-0.2-0.2 0 0.2 0.4 0.6 0.8 1 1.2-0.2-0.2 0 0.2 0.4 0.6 0.8 1 1.2 28

WF: 50K, Temp=F vs 50K, Temp=T 1.2 (tr_uc_rm_idsd_a0_95b1_i50k s) 512 vs. 209 1.2 (tr_uc_rm_idsd_a0_95b1_i50k s) 883 vs. 209 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0-0.2-0.2 0 0.2 0.4 0.6 0.8 1 1.2-0.2-0.2 0 0.2 0.4 0.6 0.8 1 1.2 29

Stability of classification accuracy on hold-out set for 3 MCMC runs with and without tempering Temp=F Temp=T Data acc σ acc acc σ acc rpart Time per 1000 K 68.8% 0.0% 68.8% 0.0% 75.0% 5s BCW 96.1% 1.2% 95.8% 0.3% 95.5% 17s PIMA 76.9% 3.2% 73.6% 1.6% 76.4% 129s LR 62.4% 3.6% 66.9% 0.1% 46.1% 2368s WF 71.0% 3.7% 72.5% 2.9% 74.1% 1151s 30

Materials This SLP and other materials used available from http://www-users.cs.york.ac.uk/aig/slps/mcmcms/ Look in the pbl/icml05 directory of the MCMCMS distribution. Includes scripts for reproducing the figures in this paper. 31

Future work Tempering plus informative priors. Currently applying to MCMC for Bayesian nets. 32

Bayesian Additive Regression Trees BART uses a sum-of-trees model: Y g 1 (x) + g 2 (x) +... + g m (x) + ǫ, ǫ N(0, σ 2 ) where each g i is a regression tree. Do MCMC by Gibbs sampling : get a new tree for g i conditional on all the others. The distribution for the new tree only depends on the residual produced from the other trees. 33

Getting that posterior Bayes theorem: P(Θ, T X, Y ) P(Θ, T X)P(Y Θ, T, X) We restrict attention to tree structures, so integrate away the parameters Θ: P(T X, Y ) = P(T X) Θ = P(T X)P(Y T, X) P(Y Θ, T, X)P(Θ T, X)dΘ The marginal likelihood P(Y T, X) is easy to compute...... since we use Dirichlet distributions for P(Θ T, X) (and other standard assumptions). 34

Comparing class probabilities in an easy case 1_i50K s) 883 vs. 209 0.6 0.8 1 1.2 Dataset=K, iterations=50,000, tempering=false 35

BCW: 50K, Temp=F vs 50K, Temp=T 1.2 (tr_uc_rm_idsd_a0_95b1_i50k s) 883 vs. 512 1.2 (tr_uc_rm_idsd_a0_95b1_i50k s) 883 vs. 512 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0-0.2-0.2 0 0.2 0.4 0.6 0.8 1 1.2-0.2-0.2 0 0.2 0.4 0.6 0.8 1 1.2 36