Transportation Research Part B

Size: px

Start display at page:

Download "Transportation Research Part B"

Bethany Norton
5 years ago
Views:

Transportation Research Part B 44 (2010) 686 698 Contents lists available at ScienceDirect Transportation Research Part B journal homepage: www.elsevier.

Wolfville, NS, Canada b Department of Statistics, The Wharton School, University of Pennsylvania, Philadelphia, PA 19104-6302, United States c Department of Civil, Architectural, and Environmental

1 Transportation Research Part B 44 (2010) Contents lists available at ScienceDirect Transportation Research Part B journal homepage: Bayesian flexible modeling of trip durations Hugh Chipman a, Edward George b, Jason Lemp c, Robert McCulloch d, * a Department of Mathematics and Statistics, Acadia University, Wolfville, NS, Canada b Department of Statistics, The Wharton School, University of Pennsylvania, Philadelphia, PA , United States c Department of Civil, Architectural, and Environmental Engineering, The University of Texas at Austin, TX, United States d IROM Department, McCombs School of Business, The University of Texas at Austin, TX, United States article info abstract Article history: Received 18 January 2010 Accepted 19 January 2010 Keywords: Markov Chain Monte Carlo Boosting Ensemble modeling Recent advances in Bayesian modeling have led to stunning improvements in our ability to flexibly and easily model complex high-dimensional data. Flexibility comes from the use of a very large number of parameters without fixed dimension. Priors are placed on the parameters to avoid over-fitting and sensibly guide the search in model space for appropriate data-driven model choice. Modern computational, high dimensional search methods (in particular Markov Chain Monte Carlo) then allow us to search the parameter space. This paper introduces the application of BART, Bayesian Additive Regression Trees, to modelling trip durations. We have survey data on characteristics of trips in the Austin area. We seek to relate the trip duration to features of the household and trip characteristics. BART enables one to make inferences about the relationship with minimal assumptions and user decisions. Ó 2010 Elsevier Ltd. All rights reserved. 1. Introduction The workhorse model of applied statistics is the multiple regression model. Multiple regression allows us to relate a single y to many x variables. This is often the goal in applied work. However, multiple regression makes the fundamental assumption of a linear relationship between y and x. With many x variables, this assumption may not be tenable and it is hard to check. Good researchers are trained to use diagnostic plots to assess model adequacy. In case of problems, a wide variety of transformations of both the dependent and independent variables are available in the statistics literature. In practice, we often fail to carefully check the model. If problems are found with the linear specification, the choice of possible transformations is so overwhelming that most applied workers limit themselves, quite reasonably, to a few transformations, such as taking the log of y and using polynomial type terms for some the explanatory variables. Even with a moderate number of explanatory variables the task of searching for a reasonable specification quickly becomes overwhelming. In this paper we illustrate the use of BART, Bayesian Additive Regression Trees (Chipman et al., 2006, 2010), with particular emphasis on its role and impact on transportation research. BART combines recent advances in Bayesian modeling with ideas from machine learning to sensibly search the (potentially) high-dimensional space of possible models relating y to a high-dimensional x. The model is estimated for trip duration data from Austin, Texas. The goal of the study is to investigate how the reported time to take a trip in an automobile (y) depends on characteristics of the trip and people making the trip (the x s). Abou Zeid et al. (2006) and Popuri et al. (2008) modeled trip durations using classical linear regression techniques. Special considerations were needed for entering the time-of-day variable in the model. Clearly, travel times will not vary in a linear * Corresponding author. Address: IROM Department, McCombs School of Business, The University of Texas at Austin, 1 University Station, B6500 Austin, TX , United States. Tel.: address: robert.mcculloch1@gmail.com (R. McCulloch) /$ - see front matter Ó 2010 Elsevier Ltd. All rights reserved. doi: /j.trb

2 H. Chipman et al. / Transportation Research Part B 44 (2010) way from hour to hour. Abou Zeid et al. (2006) and Popuri et al. (2008) both employed collections of sinusoidal functions of departure time in the hope that such functions would be able to approximate the relationship between travel times and departure time. With BART, there is no need to experiment with different transformations of the explanatory variables. BART automatically detects and models nonlinear relationships between dependent and explanatory variables including interactions between explanatory variables. The remainder of the paper is organized as follows: Section 2 describes the BART model. In Section 3 the Markov Chain Monte Carlo (MCMC) algorithm used to search the model space is outlined. Section 4 illustrates the use of BART in modeling trip-duration data. Section 5 concludes. 2. The BART model The model consists of two parts: a sum-of-trees model, called BART (Bayesian Additive Regression Trees), and a regularization prior A sum-of-trees model The central element of our model is a regression tree, a predictive model that seeks to accomplish the same task as linear regression: predict a response y given the values of a vector of independent variables x =(x 1,...,x p ). What distinguishes the regression tree from a linear regression model is how the regression tree generates the prediction. An illustration is given in Fig. 1, with x =(x 1,x 2 ). The tree consists of a root node containing a question about one of the independent variables, here whether x 2 <1. Depending on the answer to this question, we would follow the left (x 2 < 1) or right (x 2 P 1) branch of the tree, arriving at a child node. To generate a prediction, we continue branching based on our value of x until a terminal node is reached, and an output l is returned. The output parameter l b in terminal node b plays the role of a response in regression. The tree model partitions the x space into rectangular regions, and associates a single predicted value for response y within each region. In Fig. 1, the tree partitions the (x 1,x 2 ) space into three rectangular regions, and produces outputs of 0.1, 0.8 or 0.3, depending on which region a value x falls. This particular tree represents an interaction between x 1 and x 2, since the relationship between y and x 1 is constant (the value 0.3) if x 2 P 1, but changes as a function of x 1 (the values 0.1 and 0.8) if x 2 <1. A tree model must be estimated from data, similar to the coefficients of a linear regression. We must estimate the tree structure itself (e.g. splitting rules like x 2 < 1 associated with interior nodes) and terminal parameters associated with the tree (the l s). Estimation will be discussed in Section 3. To develop a sum-of-trees model, we establish notation that represents the informal description of a single tree model. Let T denote a binary tree consisting of a set of interior node decision rules (questions) and a set of terminal nodes, and let M ={l 1,l 2,...,l B } denote a set of parameter values associated with each of the B terminal nodes of T. Prediction for a particular value of input vector x is accomplished as follows: If x is associated with terminal node b of T by the sequence of decision rules from top to bottom, it is then assigned the value l b. We use g(x;t,m) to denote the function corresponding to (T, M) which assigns a l b 2 M to x. a x 2 < 1 x 2 1 μ 3 = 0.3 x 1 < 1.5 x μ 1 = 0.1 μ 2 = 0.8 b x μ 3 = 0.3 μ 1 = 0.1 μ 2 = x 1 Fig. 1. An illustration of a tree model (a) and its predictions (b).

3 688 H. Chipman et al. / Transportation Research Part B 44 (2010) Using this notation our sum-of-trees model can more explicitly be expressed as Y ¼ gðx; T 1 ; M 1 Þþgðx; T 2 ; M 2 Þþþgðx; T m ; M m Þþ; Nð0; r 2 Þ: ð1þ Thus, our model is of the form Y = f(x)+, where the function f is flexibly represented as f = g(x;t 1, M 1 )+g(x; T 2, M 2 )++ g(x; T m, M m ). Inasingletreemodel,theconditionalmeanofY given x is composed of a single l valueassociatedwithoneterminalnode(i.e. theoutputofoneg). Unlike the single tree model, the sum-of-trees model (1) uses m different l values to compose the conditional mean of Y given x. Such terminal node parameters will represent interaction effects when their assignment depends on more than one component of x (i.e., more than one independent variable). Because (1) may be based on trees of varying sizes, the sum-oftrees model can incorporate both direct effects and interaction effects of varying orders. In the special case where every terminal node assignment depends on just a single component of x, the sum-of-trees model reduces to a simple additive function. In the machine learning literature the term ensemble is used to describe a collection of model pieces that add up to a bigger model. Thus, in the BART model the ensemble is the collection fgð; T i ; M i Þg m i¼1. The overall intuition is that a good way to find the fit is by adding little bits at a time. There are different ensemble methods in the literature with boosting (Freund and Schapire, 1997) being the lead example. BART is related to and partially motivated by boosting but has fundamental differences (see Section 3 below). With a large number of trees, a sum-of-trees model gains increased representation flexibility, which, when coupled with our regularization prior, gives excellent out-of-sample predictive performance. The default value for m, used in the application in this paper, is 200. Note that with m large there are hundreds of parameters of which only r is identified. For example, swapping (T 1,M 1 ) for (T 2,M 2 )in(1) gives a different parameterization but the same predictive model. This is not a problem for our Bayesian analysis as long as we use a proper prior. It just means that inferential statements cannot be made about individual l and T parameters. Instead, we shall draw inferences on the predictions that is, on the function f and r. This formulation can be considered as a mechanism for placing a prior distribution on functions, even though individual parameters are not identified. Indeed, this lack of identification is the reason our MCMC mixes well. Even when m is much larger than needed to capture f (effectively, we have an over-complete basis ), the procedure still works well. One of the key reasons the procedure works well with so many parameters is an effective specification of prior distributions for these parameters. We explore this in the next section A regularization prior In many Bayesian analyzes, one seeks relatively uninformative prior distributions for unknown parameters in order to let the data speak for itself. In our model, however, there are so many free parameters that uninformative priors would give the data too much of a voice: a prediction for observation i would be the observed response y i, interpolating the training data perfectly. The enormous capacity of the model to represent the data must be reined in by prior distributions that control the model s adaptability. In Machine Learning, this process of constraining parameters is called regularization, hence our term regularization priors. In this section we outline how to place a prior on each tree T and its terminal node parameters M. The complexity of the prior specification is vastly simplified by letting the T i be a priori independent and identically distributed (i.i.d), the l i,b (node b of tree i) be i.i.d given all T s, and r be independent of all T and l. Given these independence assumptions we need only choose priors for a single tree T, a single l, and r. Motivated by our desire to make each g(x;t i, M i ) a small contribution to the overall fit, we put prior weight on small trees and small l i,b. In the Machine Learning literature, the individual g(x; T i, M i ) are often called weak learners. They are learners in that each g(x; T i, M i ) fits or learns something about the relationship between y and x. They are weak in that each g(x; T i, M i ) makes a small contribution to the overall fit. For the tree prior, we use the same specification as in Chipman et al. (1998). In this prior, the probability that a node is nonterminal is a(1 + d) b, where d is the depth of the node. In all examples we use the same prior corresponding to the choice a =.95 and b = 2. With this choice, a root node (d = 0) has probability a =.95 of having children and a node at depth 1 has probability of having children. A corresponding probability distribution on tree size (number of terminal nodes) gives probability of 0.05, 0.55, 0.28, 0.09, and 0.03, for trees of size 1, 2, 3, 4, and P5. Note that even with this prior, trees with many terminal nodes can be grown if the data demands it. At any non-terminal node, the prior on the associated decision rule puts equal probability on each available independent variable and then equal probability on each available rule given the variable. Thus for the tree in Fig. 1, assuming only two predictors X 1 and X 2, each taking possible values 0,0.1,0.2,...,2.0, we have prior probability 0:95 ðroot node is nonterminalþ 0:5 ðsplit is on X 2 ; one of two variablesþ 0:05 ðsplit is on 1 of 20 possible locationsþ 0:2375 ðleft child is nonterminalþ 0:5 ðleft child splits on X 1 ; one of two variablesþ 0:05 ðsplit is on 1 of 20 possible locationsþ ð1 0:1056Þð1 0:1056Þðtwo children are terminalþ ð1 :2375Þðright child of root node is terminalþ ¼8: ð2þ

4 H. Chipman et al. / Transportation Research Part B 44 (2010) Forthe prior on a l, we first shift and rescale Y so there is high prior probability that E(Yjx) 2 ( 0.5,0.5). We let l N 0; r 2 l, where l is the output of any one terminal node of any one tree. Given the T i and an x, E(Yjx) is the sum of pffiffiffiffi m independent l s, (recall Eq. (1)). The standard deviation of the sum is m rl. We must choose r l so that the standard pffiffiffiffi deviation of the sum, m rl, ensures high probability that Y is in ( 0.5,0.5). We choose r l so that 0.5 is within k standard p deviations of zero: k ffiffiffiffi m rl ¼ 0:5. For example if k = 2 there is a 95% (conditional) prior probability that the mean of Y is in ( 0.5,0.5). k = 2 is our default choice and in practice we typically rescale the response y so that its observed values range from 0.5 to 0.5. Note that this prior increases the shrinkage of l i,b (toward zero) as m increases. As more trees are used in the ensemble, each one is permitted to contribute a smaller amount to the overall prediction. For the prior on r we start from the usual inverted-chi-squared prior: r 2 mk=v 2 m : To choose the hyper-parameters m and k, we begin by obtaining a rough overestimate ^r of r. We then pick a degrees of freedom value m between 3 and 10. Finally, we pick a value of q such as 0.75, 0.90 or 0.99, and set k so that the qth quantile of the prior on r is located at ^r, that is Pðr < ^rþ ¼q. Fig. 2 illustrates priors corresponding to three (m,q) settings when the rough overestimate is ^r ¼ 2. We refer to these three settings, (m, q) = (10, 0.75), (3, 0.90), (3, 0.99), as conservative, default and aggressive, respectively. For automatic use, we recommend the default setting (m, q) = (3, 0.90) which tends to avoid extremes. Simple data-driven choices of ^r that we have used in practice are the estimate from a linear regression or the sample standard deviation of Y. Note that this prior choice can be influential. Strong prior beliefs that r is very small could lead to over-fitting. 3. A back-fitting MCMC algorithm Given the observed data y, our Bayesian setup induces a posterior distribution pððt 1 ; M 1 Þ;...; ðt m ; M m Þ; rjyþ on all the unknowns that determine a sum-of-trees model. Although the sheer size of this parameter space precludes exhaustive calculation, a back-fitting MCMC algorithm (Hastie and Tibshirani, 2000) can be used to sample from this posterior. At a general level, our algorithm is a Gibbs sampler. For notational convenience, let T (i) be the set of all trees in the sum except T i, and similarly define M (i). The Gibbs sampler here entails m successive draws of (T i,m i ) conditionally on (T (i),m (i),r): ðt 1 ; M 1 ÞjT ð1þ ; M ð1þ ; r; y ðt 2 ; M 2 ÞjT ð2þ ; M ð2þ ; r; y.. ðt m ; M m ÞjT ðmþ ; M ðmþ ; r; y; followed by a draw of r from the full conditional: ð3þ conservative: df=10, quantile=.75 default: df=3, quantile=.9 aggressive: df=3, quantile= sigma Fig. 2. Three priors on r when ^r ¼ 2.

5 690 H. Chipman et al. / Transportation Research Part B 44 (2010) rjt 1 ;...T m ; M 1 ;...; M m ; y: The back-fitting MCMC algorithm repeatedly re-samples the parameters of each tree in the ensemble, conditional on the current parameter values of the other m 1 trees. This approach has some similarities and differences with the boosting algorithm of Freund and Schapire (1997). Boosting also produces an ensemble of trees model (1). The boosting algorithm also updates one tree conditional on all others, but it does so only once, rather than repeatedly resampling as in MCMC. This yields a single estimated model, rather than a posterior distribution on the model. Evaluation of the full conditionals required for Gibbs sampling is simplified by rearranging (1). For example to sample (T 1,M 1 )jt (1),M (1),r,y, we can write Y gðx; T 2 ; M 2 Þ gðx; T m ; M m Þ¼gðx; T 1 ; M 1 Þþ Given (T (1),M (1) ) and r we may subtract the fit from (T (1), M (1) ) from both sides of (1) leaving us with a single tree model with known error variance, g(x; T 1, M 1 ). This draw may be made following the approach of Chipman et al. (1998). These methods draw (T i, M i )jt (i), M (i),r,y as T i jt (i), M (i),r,y followed by M i jt i,t (i), M (i),r,y. The idea is that we can draw a (T, M) by drawing from the marginal of T after integrating out M and then from the conditional of M given the draw of T. The structure of the BART model and prior are carefully chosen to make this possible. The first draw is done by the Metropolis-Hastings algorithm after integrating out M i and the second is a set of normal draws. The draw of r is easily accomplished by subtracting all the fit from both sides of (1) so the are considered to be observed. Given all the (T i,m i ) we know f so we can compute i = y i f(x i ). The draw is then a standard inverted-chi-squared since our prior is conditionally conjugate. Subtracting off fits and fitting the resides is often called backfitting. Our Gibbs sampler iteratively and stochastically backfits (Hastie and Tibshirani, 2000). The Metropolis-Hastings draw of T i jt (i),m (i),r,y is complex and lies at the heart of our method. The algorithm of Chipman et al. (1998) proposes a new tree based on the current tree using one of four moves. The moves and their associated proposal probabilities are: growing a terminal node (0.25), pruning a pair of terminal nodes (0.25), changing a non-terminal rule (0.40), and swapping a rule between parent and child (0.10). Note that the grow and prune moves change the implicit dimensionality of the proposed tree in terms of the number of terminal nodes. Some readers may be more familiar with Simulated Annealing, a stochastic search algorithm that can be obtained by manipulation of acceptance probabilities of the Metropolis- Hastings algorithm as it runs. Both algorithms have in common the proposal of a new state as a random perturbation of the current state, followed by a randomized accept/reject step. We initialize the chain with m single node trees, and then iterations are repeated until satisfactory convergence is obtained. We illustrate convergence assessment by the monitoring of r draws in Section 4. At each iteration, each tree may increase or decrease the number of terminal nodes by one, or change one or two decision rules. Each l will change (or cease to exist or be born), and r will change. It is not uncommon for a tree to grow large and then subsequently collapse back down to a single node as the algorithm iterates. The sum-of-trees model, with its abundance of unidentified parameters, allows for fit to be freely reallocated from one tree to another. Because each move makes only small incremental changes to the fit, we can imagine the algorithm as analogous to sculpting a complex figure by adding and subtracting small dabs of clay. Compared to the single tree model MCMC approach of Chipman et al. (1998), the back-fitting MCMC algorithm mixes dramatically better. When only single tree models are considered, the MCMC algorithm tends to quickly gravitate toward a single large tree and then gets stuck in a local neighborhood of that tree. In sharp contrast, we have found that restarts of the back-fitting MCMC algorithm give remarkably similar results even in difficult problems. Consequently, we run one long chain rather than multiple starts. In some ways back-fitting MCMC is a stochastic alternative to boosting algorithms for fitting linear combinations of trees. It is distinguished by the ability to sample from a posterior distribution. At each iteration, we get a new draw f ¼ gðx; T 1 ; M 1 Þþgðx; T 2 ; M 2 Þþþgðx; T m ; M m Þ corresponding to the draw of {T j } and {M j }. These draws are a (dependent) sample from the posterior distribution on the true f. Rather than pick the best f * from these draws, the set of multiple draws can be used to further enhance inference. In contrast, boosting generates a single estimate of the model, rather than a sample of possible values. We estimate f(x) by the posterior mean of f(x) which is approximated by averaging the f * (x) over the draws. Further, we can gauge our uncertainty about the actual underlying f by the variation across the draws. For example, we can use the 5% and 95% quantiles of f * (x) to obtain 90% posterior intervals for f(x). ð4þ ð5þ 4. Fitting trip duration data with BART 4.1. The trip duration data The goal of the study is to investigate how the reported time to take a trip in an automobile depends on characteristics of the trip and the people making the trip. Each observation in our data set corresponds to a trip, made by car, in the Austin area. Each trip is made by a person identified as the trip maker from a household. Several variables are measured for each trip. We have variables describing the household, the trip-maker, and the trip itself. Variables describing the household are:

6 H. Chipman et al. / Transportation Research Part B 44 (2010) number of people in the household, income, number of children under five, number of children between 5 and 15 and the number of children of ages 16 or 17. Variables describing the trip-maker are: age, primary occupation and student status. Variables describing the trip are: month, day of the week, type of trip (e.g. home based work trip ), departure time, number of household in the departure zone, number of household in the destination zone, retail employment in the departure zone, retail employment in the destination zone, free-flow distance, free-flow duration and trip duration. The free flow variables are meant to capture the distance and time taken for such a strip under free flow conditions. Our dependent variable y is the log of the ratio of trip duration over free flow trip duration. Trip duration is simply the reported time taken to complete the trip. Free flow trip duration is an attempt to measure the time it would take to complete the trip if there were no traffic related inhibitions. Our trip durations are really approximations of trip durations and suffer from rounding error. Those trips with short distances are most prone to suffer from high relative error, since reported durations are often rounded to 5, 10, or even 15 min. A large y means high traffic congestion or travel delay, but not necessarily long durations, since y = duration/free-flow duration. Since the actual ratios are highly right-skewed, we take the log. This still gives us an interpretable quantity as it is approximately the difference between the two durations in percentage terms. The histogram of our dependent variable is given in Fig. 3. Note that while we have transformed our dependent variable, there is no need to consider transformations of the explanatory variables when using BART. Our explanatory variables consist of the remaining 17 listed above. Fig. 4 displays the marginal distributions of three important independent variables. Many of the trips are made by the same trip-maker so that a difficult issue arises with regard to independence assumptions. Even though BART allows for great flexibility in the form of f, the current formulation does make the basic assumption of i.i.d. normal errors. It may be the errors from trip made by the same person are not exchangeable with those made by others. There is no obvious way to resolve this question without considerably more modeling. We randomly choose one trip from those made by a trip-maker. This gives us 3244 observations BART results, all variables In this section we report the results obtained by running BART with all 17 explanatory variables. Here we show how BART, relatively automatically, fits the patterns in the data. In the next section, we focus on a small number of variables and interpret the BART fit. We emphasize that all BART results are obtained simply by calling the function bart in the R-package BayesTree. No decisions need be made about how to manipulate the information in the data. The default prior was used. A few fairly obvious decisions are made about how to run the Markov Chain as noted below. Fig. 5 shows the time series plot of the draws of r from each iteration of the Markov Chain. The initial part of the plot where the draws are declining is the burn-in period of the Markov Chain. The algorithm stochastically searches the high-dimensional space representing the unknown f to find functions that fit the data well with our prior stopping us from gravitating towards functions which overfit. After the r draws level off, we are exploring the posterior. As the chain iterates, the variation in the current draw of f (as represented by the current (T i, M i )) explores the set of f which could have plausible generated the observed data. We see that the chain burns in very quickly so that after a few hundred draws we are estimating the posterior.

7 692 H. Chipman et al. / Transportation Research Part B 44 (2010) y=log(dur/ffd) Fig. 3. The histogram of the log ratio of trip duration over free flow duration. To run the entire set of 2300 iterations took 226 s or about 4 min (2.93 GHz, Core 2 Duo processor). Note that if we just needed a point estimate we could use a much shorter run. After discarding the burn-in draws, we estimate f(x) by simply averaging f i (x), where i denotes the MCMC draws. If we do this for the x in our sample we obtain the BART fitted values. The squared correlation between y and the BART fits (or R 2 )is 48%. Similarly, we can estimate r by the average of the post burn-in draws. This we obtain ^r ¼ 0:5. In Fig. 5 this can be seen as the value at which the draws level off. The marginal posterior distribution of r would be given by the histogram of post burn-in r draws. By comparison, a linear regression fit (in which all categorical variables are dummied up in the usual way) gives an estimate of r of 0.58 and R 2 of 28%. The horizontal line drawn in Fig. 5 is at this r estimate. Clearly, no-one would simply run a linear regression with this data. As noted, the departure time variable would require some kind of transformation at a minimum. Our point in making this comparison is that BART automatically seeks out reasonable functions without user input. With 17 independent variables, choosing which transformations to try and which interactions to include becomes a daunting task in model selection. In Section 4.4 we try a transformation strategy and compare the results to the BART fit. Of course the reader may wonder if BART has over-fit the data. In Chipman et al. (2010) evidence is given in 42 real data sets that the out of sample predictive performance of BART using the default prior is as good or better than leading data-mining techniques tuned using cross-validation. The default BART prior regularizes the fit so that we do not build too complex a model given our data Interpreting BART results If all we want to do is use BART as a predictive device, there is no problem. The prediction of y given x = x *, may be given by the average of the f i (x * ) values, where i indexes post burn-in draws of f. However, we often want to learn about the nature of f. How is y related to x? In this case the lack of interpretability of the sum of trees representation of f becomes an issue. All flexible approaches (e.g., neural nets) have this problem. Note also, that those who think they can interpret the results of a linear regression with a large number of transformations, interactions, and blizzard of associated t-values are usually fooling themselves. Methods for extracting interpretable information about f are the subject of ongoing research, but we illustrate a few possible approaches. A simple approach to variable selection is to see how often a variable is used in the sum of trees representation (see Chipman et al. (2010)). For each draw, we compute the fraction of tree decision rules that use a given variable and then average the fraction over the MCMC draws. Using this criteria, the variable which stands out the most is free flow distance. This variable is used, on average, in 19% of the tree decision rules. The next two variables are trip type (4%) and departure time (3.4%). There are some subtle issues involving the choice of prior when using BART for variable selection. Chipman et al. (2010) recommend using fewer trees in the sum when doing variable selection than when using BART for prediction. The percentages above were calculated from a BART run using 20 trees in the sum, while the r draws displayed in Fig. 5 were obtained from a run using 200 trees in the sum (200 is the default choice). In order to simplify the analysis we reran BART using just the independent variables free flow distance, triptype, and departure time as suggested by our variable selection analysis. Note that our variable selection was done without making strong assumptions about the nature of f. Even with just these three variables, consider the difficulties involved in using a multiple regression based approach. What transformations should be applied to the two numeric variables free flow distance and departure time? The categorical

8 H. Chipman et al. / Transportation Research Part B 44 (2010) Frequency ff_dist hw hs hr hm hsoc ho nhw nhs nhr nho triptype Frequency deptime Fig. 4. Marginal distributions of free flow distance (ff_dist), trip type (triptype), and departure time (deptime). Distance is measured in miles, departure time with a 24 h daily clock, and trip type is categorical with ten different levels. For example, the first level hw denotes a home based work trip. variable trip type has 10 different levels. What possible interaction terms should be considered for inclusion in the model? If a large number of transformations and interactions are considered, how will the model selection be done? Given a selected model, how do we express our uncertainty? How relevant are the usual t-values given the model search and possible inclusion of interaction terms? An illustration of a transformation based approach is given in Section 4.4. Sophisticated users of multiple regression know that the dependence between independent variables affects the inference in important ways. Usually, we think in terms of collinearity, or linear dependence between x variables. Fig. 6 explores the relationship between two of our x variables, departure time and trip type. The left histogram displays departure times for work trips and the right histogram displays departure times for retail trips. Thus, the figure shows some of the dependence between trip type and departure time. Clearly, the dependence is strong and of an unusual type (i.e., with a varying number of modes) which could not possibly be captured by linear thinking. The BayesTree package includes functions for partial dependence plots. These plots can aid in interpreting the effect of individual x variables on y. However, when there is complex dependence between x variables and f is not additive they can be misleading. Thus, they are not a good choice for our trip duration analysis. Fig. 7 compares x values with have a large BART estimate of f with those that have a small estimate. The three rows in the figure correspond to our three variables free flow distance trip type, and departure time going from top to bottom. The left hand plots use only those observations such that ^f ðxþ is in the bottom 10% and the right hand plots use only those observations where the fit is in the top 10%. The left hand plots tell us what kinds of x give us a small y and the right hand plot tell us what kinds of x give us a big y. Such a plot could have been constructed directly from the y values but

9 694 H. Chipman et al. / Transportation Research Part B 44 (2010) Fig. 5. Draws of r from the BART Markov Chain Monte Carlo. The draws initially decrease quickly as the fit improves and the chain burns in. Subsequent variation reflects posterior uncertainty about the value of r. The chain was iterated 2300 times with the first 300 discarded as burn-in and the last 2000 used for posterior inference. All 2300 draws are shown in the plot departure time, work trips departure time, retail trips Fig. 6. Distribution of departure time for home based work trips (on the left) and home based retail trips (on the right). by plotting the fits we have hope to have a sharper picture, less influenced by noise. We clearly see that longer trips are associated with smaller free flow distance, trip types hw = home base work trips, hs = home based school trips, and ho = home based other trips ( other is a catch-all category), and departure time close to the two rush hours. Note that the afternoon rush hour looks like it is different from the morning one. The morning rush hour is more focused. Intuitively, this rush hour is driven more by work related trips while in the afternoon a greater variety of activities generate trips. Perhaps the most logically compelling way to learn about the nature of a function f is to compare its values at carefully chosen x. For a very nice example see Abrevaya and McCulloch (2010). In this case we chose the following nine x configurations: ff_dist triptype deptime 1 hw 8 1 hw 17 1 hw 20 1 hr 17 1 hr 20 1 hsoc 17 1 hsoc 20 1 ho 17 1 ho 20

10 H. Chipman et al. / Transportation Research Part B 44 (2010) Frequency Frequency ff_dist ff_dist hw hs hr hm hsoc ho nhw nhs nhr nho triptype hw hs hr hm hsoc ho nhw nhs nhr nho triptype Frequency Frequency deptime deptime Fig. 7. Distributions of explanatory variables for small and large fitted values from BART. The three rows depict the marginal distribution of free flow distance, trip type, and departure time. On the left we use the subset of observations corresponding to the bottom 10% of fitted values from BART. On the right we use the top 10% fitted values. Fig. 8. Posterior distributions of f(x) for nine selected x. The first three boxplots (from the left) depict draws from the posterior of f(x) for x which correspond to home based work trips with departure times 7:30 am, 5 pm, and 8 pm. The next two are for home based retail trips at 5 pm and 8 pm. The next two are for home based social trips at 5 pm and 8 pm. The last two are for other trips at 5 pm at 8 pm. Each row above specifies a choice of x. We have fixed free flow distance at 1. The first three x s examine home based work trips at 8 am, 5 pm, and 8 pm. The remaining six x s examine home based retail trips, home based social trips and home based other trips, each at 5 pm and 8 pm. Fig. 8 displays the results. Each boxplot depicts the draws f i (x) for post burn-in MCMC draws i of f. The nine boxplots correspond to the nine different x configurations given above. The symbols w, r, s, and o denote the four different trip types (work, retail, social, and other)

696 H. Chipman et al. / Transportation Research Part B 44 (2010) 686 698 and the time is given with the 24 h clock.

11 696 H. Chipman et al. / Transportation Research Part B 44 (2010) and the time is given with the 24 h clock. The boxplot labelled r17 displays the posterior distribution for f(x), where x means a home based retail trip at 5 pm (the fourth row in the table above). We can easily see things like work trips (boxplots 1 3) are longer than social trips (boxplots 6 and 7). For each kind of trip, we expect longer durations at 5 pm than at 8 pm. There is some evidence that the difference in trip duration between 5 pm and 8 pm is less for social trips than for the other three kinds of trips and the difference seems quite similar for work, retail, and other trips. Thus, there is some evidence of a particular kind of interaction, in that the effect for departure time does depends on the type of trip. This makes intuitive sense. We can also see that there is more uncertainty associated with our inference for the social trips than for the other three trip types. By carefully choosing x at which to infer f, we can learn much about f. Care is needed in choosing appropriate x. For example, asking about a work trip at 11 am is not a good idea given this data. While this takes effort, it is honest effort in that involves thinking hard about what questions we really want to ask, and what questions make sense. Figs. 7 and 8 given sensible results. Fig. 9. Comparison of fits from different modeling strategies. All pairwise plots between y, bart (fits from BART), naivereg (linear regression without transformations), logreg (linear regression with free flow distance logged), and trigreg (linear regression with log of free flow distance and 16 transformations of departure time).

12 H. Chipman et al. / Transportation Research Part B 44 (2010) Transforming independent variables In this section we try to improve the fit obtained using standard linear models technology by transforming some of the independent variables. This fit is compared with the BART fit. We consider transformations of the variables departure time and free flow distance. All of the variables are used as in Section 4.2. Similar results are obtained if we use just the three variables considered in Section 4.3. We do not consider interaction terms. Given the number of variables there are a great many possible interaction effects that could be considered. Logging free flow distance improves the linear model fit considerably. Without the log the R 2 was 28%. Replacing free flow distance with its log increases R 2 to 40%. The estimate of r decreases from 0.58 to Transforming departure time is more complex. Following Popuri et al. (2008) we let g 1 ðtþ ¼exp sin 2pT 2pT ; g 24 2 ðtþ ¼exp cos 24 and g 3 ðtþ ¼exp sin 4pT 4pT ; g 24 4 ðtþ ¼exp cos 24 where T is departure time. Each of the four g i is raised to powers 1, 2, 3, and 4. Thus, a total of 16 transformations of the variable T = departure time are included in the multiple regression: g i (T) j,i =1,2,3,4,andj =1,2,3,4. Note that in the Popuri et al. (2008) analysis, an additional variable called delay is used and all of the g i (T) j are multiplied by delay so that their analysis is focused on the interaction between departure time and delay. Our analysis does not use the variable delay so our transformation strategy is not identical to theirs. Nevertheless, by using the same functional form gðtþ ¼ P i;j b i;j g i ðtþ j, we hope to build upon their work in a reasonable way. Note that Popuri et al. (2008) are able to focus in on a specific kind of interaction based on subject matter insight. They obtain quite reasonable and interpretable representations of the relationship as exhibited in their Fig. 2. Adding the 16 transformations of departure time to the multiple regression with free flow distance logged increases R 2 to just 41% (from 40%). The estimate of r is 0.52 which is virtually the same as obtained with just the log of free flow distance. Recall that the BART R 2 is 48% and estimate of r is.50. Fig. 9 compares the fits from the different modeling strategies. We see that simply logging free flow distance goes a long way towards capturing f. There is very little difference between the fits with the log and departure time transformations. The BART fits are similar to those obtained with the transformations but there is a suggestion of some differences in the last plot in the second row. If we test the null hypothesis that all the coefficients for the departure time transformations are equal to 0, the p-value is Typically in application this is used to argue that including the transformations is useful in uncovering the relationship. Fig. 10 plots ^gðtþ ¼ P i;j^b i;j g i ðtþ j vs. T. It is hard to see how this makes sense intuitively. Again, Popuri et al. (2008) did obtain sensible results using the set of transformations we have employed here. In this application, the transformation give results which are statistically significant but practically insignificant and intuitively unappealing. Of course, some other transformation approach might give better results Fig. 10. The estimated additive component contributed by transformations of departure time.

13 698 H. Chipman et al. / Transportation Research Part B 44 (2010) Conclusions This paper introduces and illustrates application of the BART model to drip duration modeling. BART quickly and easily obtains the dependent variable, y, as some generic function f of the explanatory variables. Note that this is, in a sense, a nonparametric approach, since the functional form of f is specified on such a flexible form. This is particularly useful when some explanatory variables are not expected to have linear relationships with y, such as the case of departure times relationship with trip durations. The application demonstrated here clearly shows the nonlinear relationship between these variables, with trip duration peaks occurring during the AM and PM peak travel periods as expected. Further, the nature of the two peaks seems to quite different with a more focused rush hour in the morning. We also find evidence of interaction between two of our key variables: the effect of the departure time may depend on the type of trip taken. Future research will focus on identifying other transportation research areas where BART could be employed, as well as extending the ideas in this paper to gain other insights into the actual relationship between departure time and trip duration. References Abou Zeid, M., Rossi, T.F., Gardner, B., Modeling time-of-day choice in context of tour-and activity-based models. Transportation Research Record: Journal of the Transportation Research Board 1981, Abrevaya, J., McCulloch, R., Reversal of Fortune: A Statistical Analysis of Penalty Calls in the National Hockey League. < Chipman, H., George, E., McCulloch, R., Bayesian CART model search. Journal of the American Statistical Association 93 (443), Chipman, H., George, E., McCulloch, R., Bayesian ensemble learning. In: Scholkopf, B., Platt, J., Hoffman, T. (Eds.), Advances in Neural Information Processing Systems, vol. 19. MIT Press, Cambridge, MA, pp Chipman, H., George, E., McCulloch, R., BART: Bayesian Additive Regression Trees. Annals of Applied Statistics. Freund, Y., Schapire, R.E., A decision theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences 55 (1), Hastie, T., Tibshirani, R., Bayesian backfitting. Statistical Science 15 (3), Popuri, Y., Ben-Akiva, M., Proussaloglou, K., Time-of-day modeling in a tour-based context: the Tel-Aviv experience. Transportation Research Record: Journal of the Transportation Research Board 2076,

Bayesian Ensemble Learning

Bayesian Ensemble Learning Hugh A. Chipman Department of Mathematics and Statistics Acadia University Wolfville, NS, Canada Edward I. George Department of Statistics The Wharton School University of Pennsylvania