Fully Nonparametric Bayesian Additive Regression Trees

Size: px

Start display at page:

Download "Fully Nonparametric Bayesian Additive Regression Trees"

Britton Chapman
5 years ago
Views:

1 Fully Nonparametric Bayesian Additive Regression Trees Ed George, Prakash Laud, Brent Logan, Robert McCulloch, Rodney Sparapani Ed: Wharton, U Penn Prakash, Brent, Rodney: Medical College of Wisconsin Rob: Arizona State

2 1. Basic BART Ideas 2. A Simple Simulated Example 3. Out of Sample Prediction 4. The BART Model and Prior 5. BART MCMC 6. Fully Nonparametric BART 7. Simulated Examples 8. Real Data 9. More on DPM 10. BART Papers

3 1. Basic BART Ideas BART stands for Bayesian Additive Regression Trees. The original BART model (Chipman, George, and McCulloch) is: Y i = f (x i ) + ɛ i, ɛ i N(0, σ 2 ), iid. where the function f is represented as the sum of many regression trees. 1

4 BART was inspired by the Boosting literature, in particular the work of Jerry Friedman. The connection to boosting is obvious in that the model is based on a sum of trees. However, BART is a fundamentally different algorithm with some consequent pros and cons. 2

5 BART is a Bayesian MCMC procdure. We: put a prior on the model parameters (f, σ). run a Markov chain with state (f, σ) such that stationary distribution is the posterior (f, σ) D = {x i, y i } n i=1. Examine the draws as a repesentation of the full posterior. In particular, we can look at marginals of σ and f (x) at any given x. 3

6 Note: At the d th MCMC iteration we have {f d, σ d }. We will look at the sequence of draws σ d. We can t just look at the f d draws!! For any x we can look at {f d (x)}. For example, ˆf (x) could be the average of the numbers {f d (x)} which is our MCMC estimate of the posterior mean of the random variable f (x) D. 4

7 2. A Simple Simulated Example Simulate data from the model: Y i = x 3 i + ɛ i ɛ i N(0, σ 2 ) iid n = 100 sigma =.1 f = function(x) {x^3} set.seed(14) x = sort(2*runif(n)-1) y = f(x) + sigma*rnorm(n) xtest = seq(-.95,.95,length.out=20) Here, xtest will be the out of sample x values at which we wish to infer f or make predictions. 5

8 plot(x,y) points(xtest,rep(0,length(xtest)),col="red",pch=16) x y Red is xtest. 6

9 library(bart) rb = wbart(x,y,xtest) length(xtest) [1] 20 dim(rb$yhat.test) [1] The (d, j) element of yhat.test is f d evaluated at the j th value of xtest. 1,000 draws of f, each of which is evaluated at 20 xtest values. 7

10 plot(x,y) lines(xtest,xtest^3,col="blue") lines(xtest,apply(rb$yhat.test,2,mean),col="red") qm = apply(rb$yhat.test,2,quantile,probs=c(.025,.975)) lines(xtest,qm[1,],col="red",lty=2) lines(xtest,qm[2,],col="red",lty=2) x y 8

11 n=5, y x 9

12 {σ d } draws. There are 100 draws counted as burn-in + 1,000 additional draws. In all our previous f (x) inference, we dropped the first 100 iterations Index rb$sigma You can see that it looks burned in after

13 3. Out of Sample Prediction Did out of sample predictive comparisons on 42 data sets. (thanks to Wei-Yin Loh!!) p=3 65, n = 100 7, 000. for each data set 20 random splits into 5/6 train and 1/6 test use 5-fold cross-validation on train to pick hyperparameters (except BART-default!) gives 20*42 = 840 out-of-sample predictions, for each prediction, divide rmse of different methods by the smallest + each boxplots represents 840 predictions for a method means you are 20% worse than the best + BART-cv best + BART-default (use default prior) does amazingly well!! Rondom Forests Neural Net Boosting BART cv BART default

14 4. The BART Model and Prior Regression Trees: First, we review regression trees to set the notation for BART. Note however, that even in the simple regression tree case, our Bayesian approach is very different from the usual CART type approach. The model with have parameters and corresponding priors. 12

15 Regression Tree: Let T denote the tree structure including the decision rules. Let M = {µ 1, µ 2,..., µ b } denote the set of bottom node µ s. Let g(x; θ), θ = (T, M) be a regression tree function that assigns a µ value to x. x 5 < c x 2 < d x 2 % d µ 1 = -2 µ 2 = 5 x 5 % c µ 3 = 7 A single tree model: y = g(x; θ) + ɛ. 13

16 A coordinate view of g(x; θ) x 5 < c x 5 % c µ 3 = 7 x 5 c µ 3 = 7 x 2 < d x 2 % d µ 1 = -2 µ 2 = 5 µ 1 = -2 µ 2 = 5 d x 2 Easy to see that g(x; θ) is just a step function. 14

17 Here is an example of a simple tree with one x fit using standard CART methdology. 15

18 Here is an example with 2 x variables. 16

19 And here is the corresponding function (our g). 17

20 What s Boosting??? For Numeric y: (i) Set ˆf (x) = 0. r i = y i for all i in the training set. (ii) for b = 1, 2,... B, repeat: Fit a tree ˆf b with d splits (d + 1 terminal nodes) to the training data (X, r). Update ˆf by adding in a shrunken version of the new tree: ˆf (x) ˆf (x) + λ ˆf b (x). Update the residuls: r i r i λ ˆf b (x). (iii) Output the boosted model: ˆf (x) = B λ ˆf b (x). i=1 An Introduction to Statistical Learning, James, Witten, Hastie, Tibshirani. 18

21 .. it is rather amazing that an ensemble of trees leads to the state of the art in black-box predictors! Bradley Efron and Trevor Hastie, Computer Age Statistical Inference, chapter 17,

22 The BART Model Y = g(x;t 1,M 1 ) + g(x;t 2,M 2 ) g(x;t m,m m ) +! z, z ~ N(0,1) µ 1 µ 4 µ 2 µ 3 m = 200, 1000,..., big,.... f (x ) is the sum of all the corresponding µ s at each bottom node. Such a model combines additive and interaction effects. All parameters but σ are unidentified!!!! 20

23 ...the connection to Boosting is obvious... But,.. Rather than simply adding in fit in an iterative scheme, we will explicitly specify a prior on the model which directly impacts the performance. We will have an MCMC which infers each tree model in the sum. In particular, the depth of each tree is inferred. 21

24 Complete the Model with a Regularization Prior π(θ) = π((t 1, M 1 ), (T 2, M 2 ),..., (T m, M m ), σ). Have to specify: m π(θ) = π(σ) π(t j ) π(m j T j ) j=1 π(σ) π(t ) π(m T ) 22

25 π wants: Each T small. Each µ small. nice σ (smaller than least squares estimate). We refer to π as a regularization prior because it restrains the overall fit. In addition, it keeps the contribution of each g(x; T i, M i ) model component small. 23

26 Prior on T We specify a process we can use to draw a tree from the prior. The probability a current bottom node, at depth d, gives birth to a left and right child is The usual BART defaults are α (1 + d) β α = base =.95, β = power = 2. This makes non-null but small trees likely. nbottom Splitting variables and cutpoints are drawn uniformly from the set of available ones. 24

27 Prior on M Let θ denote all the parameters. f (x θ) = µ 1 + µ 2 + µ m. where µ i, is the µ in the bottom node x falls to in the i th tree. Let µ i N(0, τ 2 ), iid. f (x θ) N(0, m τ 2 ). In practice we often, unabashadly, use the data by first centering and then choosing τ so that f (x θ) (y min, y max ), with high probability. This gives: τ 1 m. 25

28 Prior on σ Default: ν = 3. σ 2 ν λ χ 2 ν λ: Get a reasonable estimate of ˆσ of sigma then choose λ to put ˆσ at a specified quantile of the σ prior. Default: quantile =.9 Default: if p < n, ˆσ is the usual least squares estimate, else sd(y). 26

29 Solid blue line at ˆσ sigma Conjecture: Most failures of BART are due to this default. 27

30 5. BART MCMC Y = g(x;t 1,M 1 ) + g(x;t 2,M 2 ) g(x;t m,m m ) +! z, z ~ N(0,1) µ 1 µ 4 µ 2 µ 3 First, it is a simple Gibbs sampler: (T i, M i ) (T 1, M 1,..., T i 1, M i 1, T i+1, M i+1,..., T m, M m, σ) σ (T 1, M 1,...,..., T m, M m ) To draw σ we subtract the trees off to get the ɛ i = y i f (x i ). To draw (T i, M i ) we subtract the contributions of the other trees from both sides to get a simple one-tree model. We integrate out M to draw T and then draw M T. 28

31 To draw T we use a Metropolis-Hastings with Gibbs step. We use various moves, but the key is a birth-death step. such as? => propose a more complex tree? => propose a simpler tree 29

32 Y = g(x;t 1,M 1 ) g(x;t m,m m ) + & z plus #((T 1,M 1 ),...(T m,m m ),&) Connections to Other Modeling Ideas: Bayesian Nonparametrics: - Lots of parameters to make model flexible. - A strong prior to shrink towards a simple structure. - BART shrinks towards additive models with some interaction. Dynamic Random Basis: - g(x; T 1, M 1 ), g(x; T 2, M 2 ),..., g(x; T m, M m ) are dimensionally adaptive. Gradient Boosting: - Overall fit becomes the cumulative effort of many weak learners. 30

33 Why does it work??? Build up the fit, by adding up tiny bits of fit.. Boosting: Freund and Schapire, Jerome Friedman 31

34 Note: I really want to be able to pick a (data based) default prior so I can put out my R package and people can get good results without too much effort. Contrast this with Deep Neural Nets, which are hard to fit. But, you can pretty easily put choose a prior for f (x) and σ!!! Constrast this with Deep Neural Nets, where it is very hard to think about the prior. 32

35 6. Fully Nonparametric BART BART Y i = f (x i ) + ɛ i, ɛ i N(0, σ 2 ). where f is a sum of trees. normal errors are embarrassing. prior on σ is flawed. normal errors may lead to influential observations and poorly calibrated predictive intervals. 33

36 Obvious Solution: Use DPM (Dirichlet Process Mixtures) in the classic Escobar and West manner to model the errors non parametrically. Tried this in the past with mixed success. The DPM stuff is tricky....not all obvious that you can get away with flexible f and flexible errors!!! The Goal: Goes in the R-package so people can use it with automatic priors and reliably get sensible results. 34

37 The MCW crowd (Prakash is a long-time nonparametric Bayesian) have a lot of experience with DPM. Prakash has recent work on choosing priors for DPM: Low Information Omnibus (LIO) Priors for Dirichlet Process Mixture Models (Yushu Shi, Michael Martens, Anjishnu Banerjee, and Purushottam Laud) Cautiously optimistic that we have a scheme that is close to working. 35

38 DPMBART Y i = f (x i ) + µ i + σ i Z i, Z i N(0, 1). each observation gets to have its own (µ i, σ i ). But, the DPM machinery allows us to uncover a set of (µ j, σ j ), j = 1, 2,..., I such that each for each i, (µ i, σ i ) = (µ j, σ j ), for some j. In our real example, n = 1, 479, I 100. Even though each observation can have it s own (µ i, σ i ), subsets of the obserations have the same (µ, σ) so that there is a relatively small number of unique values. 36

39 Markov Chain Monte Carlo (MCMC): {µ i, σ i } f, f {µ i, σ i } At each draw d we have f d, {(µ d i, σ d i )}, i = 1, 2,..., n where at each draw, many of the (µ, σ) pairs are repeats. For example, ˆf (x) = 1 D D f d (x) d=1 37

40 Connection to Mixture of Normals At each draw d we have Let be the unique (µ, σ) pairs. Let f, {(µ i, σ i )}, i = 1, 2,..., n {(µ j, σ j )}, j = 1, 2,..., I p j = ] # [(µ i, σ i ) = (µ j, σ j ) n Then ɛ I j=1 p j N(µ j, (σ j ) 2 ) 38

41 7. Simulated Examples Simulated data with t 20 (essentially normal) errors. n = x y top 5% mu_i top 5% sigma_i dpmbart fhat bart fhat 39

42 draw # alpha alpha draws draw # number unique draws of number unique (mu,sigma) E(mu) abs(error) E(mu) vs. abs(y fx) E(mu) E(sigma) E(mu) vs E(sigma) 40

43 Inference for the error distribution: dpmbart error distribution inference dpm error distribution inference from true errors t density t density dpmbart pointwise 95% intervals dpm (true errors) pointwise 95% intervals error error dpm,dpmbart, bart error distribution inference (dpm from true errors) dpmbart and density smooths of true errors t density t density dpm (true errors) dpmbart bart dpmbart density (true errors): adjust=.5 density (true errors): adjust= error error 41

44 Simulated data with t 3 errors x y top 5% mu_i top 5% sigma_i dpmbart fhat bart fhat 42

45 draw # alpha alpha draws draw # number unique draws of number unique (mu,sigma) E(mu) abs(error) E(mu) vs. abs(y fx) E(mu) E(sigma) E(mu) vs E(sigma) 43

46 Inference for the error distribution: dpmbart error distribution inference dpm error distribution inference from true errors t density dpmbart pointwise 95% intervals t density dpm (true errors) pointwise 95% intervals error error dpm,dpmbart, bart error distribution inference (dpm from true errors) dpmbart and density smooths of true errors t density dpm (true errors) dpmbart bart t density dpmbart density (true errors): adjust=.5 density (true errors): adjust= error error 44

47 Three basic examples: t20, t3, skewed. If the error is close to normal, then dpmbart is close to bart. If the error in nonnormal, dpmbart is much closer to the truth, but shrunk a bit towards bart. In these examples, ˆf for dpmbart and bart are pretty much the same but with lower signal/sample sizes this does not have to be the case. 45

48 8. Real Data Using one month of a much larger data set I am working on. y: return on cross-section of firms x: things about the firm measured the previous month. y: Index ym 46

49 Multiple regression results: Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) ** r1_ e-07 *** r12_ e-05 *** idiosyncraticvol seasonality industrymom ln_turn *** me ** an_cbprofitability Signif. codes: 0 *** ** 0.01 * Residual standard error: on 1470 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 8 and 1470 DF, p-value: 7.626e-15 It s like looking for a needle in a haystack!!! 47

50 Compare the ˆf : linear, bart, dpmbart: linear bart dpmbart dpmbart a little different from bart because it is not pulled around by the outliers??? 48

51 Note: the errors are now the errors from the multiple regression since we don t have the y f (x) we had for the simulated data draw # alpha alpha draws draw # number unique draws of number unique (mu,sigma) E(mu) abs(lm error) E(mu) vs. abs(lm error) E(mu) E(sigma) E(mu) vs E(sigma) 49

52 dpmbart error distribution inference dpmbart pointwise 95% intervals bart error 50

53 9. More on DPM 51

54 Prior on α: Used construction of Conley, Hanson, McCulloch, and Rossi. discrete distribution for α. you get to pick (I min, I max ) range for number of unique θ values. Default was I min = 1, I max.1n. In our examples, draws of α bumped up against upper limit. This could be good in that we want the prior conservative. 52

55 (µ, τ): τ: For τ = 1/σ 2 we used an approach similar to the BART default, but we tighten up up a bit. σ 2 νλ χ 2, ν = 2α o, λ = β o /α o. ν bart: ν = 3, dpmbart: ν = 10. bart: choose λ to put ˆσ at quantile =.9, dpmbart: quantile =.95. The bart default gets ˆσ from the multiple regression. 53

56 µ: µ λ ko t ν. let e i be the residuals from the multiple regression. Let k s be scaling for the µ marginal. Let k o solve: max e i = k s λ ko. Default: k s =

57 Comments: You can t be too diffuse on the base measure. Would prefer not to extend the hierarchy and put priors on the base hyperparameters (a common practice). BART default depends on the standard deviation of the regression residuals, DPMBART depends on the sd of the resids and the overall scale of the resids. k s = 10 may seem large, you don t have to cover the residual range, as µ get s bigger, σ gets bigger and you can t be too spread out. We would be happy to keep the dpm prior somewhat conservative in that we nail the normal error case but miss slightly on the non-normal cases: DO NO HARM. 55

58 10. BART Papers Log-Linear Bayesian Additive Regression Trees for Categorical and Count Responses, Jared Murray Bayesian regression trees for high-dimensional prediction and variable selection, Tony Linero Posterior Concentration for Bayesian Regression Trees and Their Ensembles, Rockova and van der Pas Nonparametric survival analysis using Bayesian Additive Regression Trees (BART), Rodney Sparapani and Brent Logan and Robert McCulloch and P. Laud Accelerated Bayesian Additive Regression Trees. He, Jingyu, Saar Yalov, and P. R. Hahn Heteroscedastic BART via Multiplicative Regression Trees}, {M.~T.~Pratola and H.~A.~Chipman and E.~I.~George and R.~Mc{C}ulloch}, Bayesian regression tree models for causal inference: regularization, confounding, and heterogeneous effects, Hahn, P Richard and Murray, Jared S and Carvalho, Carlos M High-dimensional nonparametric monotone function estimation using BART, H.~A.~Chipman and E.~George and R.~McCulloch and T.~S.~Shively 56

BART: Bayesian additive regression trees

BART: Bayesian additive regression trees Hedibert F. Lopes & Paulo Marques Insper Institute of Education and Research São Paulo, Brazil Most of the notes were kindly provided by Rob McCulloch (Arizona