Transportation Research Part B

Size: px
Start display at page:

Download "Transportation Research Part B"

Transcription

1 Transportation Research Part B 44 (2010) Contents lists available at ScienceDirect Transportation Research Part B journal homepage: Bayesian flexible modeling of trip durations Hugh Chipman a, Edward George b, Jason Lemp c, Robert McCulloch d, * a Department of Mathematics and Statistics, Acadia University, Wolfville, NS, Canada b Department of Statistics, The Wharton School, University of Pennsylvania, Philadelphia, PA , United States c Department of Civil, Architectural, and Environmental Engineering, The University of Texas at Austin, TX, United States d IROM Department, McCombs School of Business, The University of Texas at Austin, TX, United States article info abstract Article history: Received 18 January 2010 Accepted 19 January 2010 Keywords: Markov Chain Monte Carlo Boosting Ensemble modeling Recent advances in Bayesian modeling have led to stunning improvements in our ability to flexibly and easily model complex high-dimensional data. Flexibility comes from the use of a very large number of parameters without fixed dimension. Priors are placed on the parameters to avoid over-fitting and sensibly guide the search in model space for appropriate data-driven model choice. Modern computational, high dimensional search methods (in particular Markov Chain Monte Carlo) then allow us to search the parameter space. This paper introduces the application of BART, Bayesian Additive Regression Trees, to modelling trip durations. We have survey data on characteristics of trips in the Austin area. We seek to relate the trip duration to features of the household and trip characteristics. BART enables one to make inferences about the relationship with minimal assumptions and user decisions. Ó 2010 Elsevier Ltd. All rights reserved. 1. Introduction The workhorse model of applied statistics is the multiple regression model. Multiple regression allows us to relate a single y to many x variables. This is often the goal in applied work. However, multiple regression makes the fundamental assumption of a linear relationship between y and x. With many x variables, this assumption may not be tenable and it is hard to check. Good researchers are trained to use diagnostic plots to assess model adequacy. In case of problems, a wide variety of transformations of both the dependent and independent variables are available in the statistics literature. In practice, we often fail to carefully check the model. If problems are found with the linear specification, the choice of possible transformations is so overwhelming that most applied workers limit themselves, quite reasonably, to a few transformations, such as taking the log of y and using polynomial type terms for some the explanatory variables. Even with a moderate number of explanatory variables the task of searching for a reasonable specification quickly becomes overwhelming. In this paper we illustrate the use of BART, Bayesian Additive Regression Trees (Chipman et al., 2006, 2010), with particular emphasis on its role and impact on transportation research. BART combines recent advances in Bayesian modeling with ideas from machine learning to sensibly search the (potentially) high-dimensional space of possible models relating y to a high-dimensional x. The model is estimated for trip duration data from Austin, Texas. The goal of the study is to investigate how the reported time to take a trip in an automobile (y) depends on characteristics of the trip and people making the trip (the x s). Abou Zeid et al. (2006) and Popuri et al. (2008) modeled trip durations using classical linear regression techniques. Special considerations were needed for entering the time-of-day variable in the model. Clearly, travel times will not vary in a linear * Corresponding author. Address: IROM Department, McCombs School of Business, The University of Texas at Austin, 1 University Station, B6500 Austin, TX , United States. Tel.: address: robert.mcculloch1@gmail.com (R. McCulloch) /$ - see front matter Ó 2010 Elsevier Ltd. All rights reserved. doi: /j.trb

2 H. Chipman et al. / Transportation Research Part B 44 (2010) way from hour to hour. Abou Zeid et al. (2006) and Popuri et al. (2008) both employed collections of sinusoidal functions of departure time in the hope that such functions would be able to approximate the relationship between travel times and departure time. With BART, there is no need to experiment with different transformations of the explanatory variables. BART automatically detects and models nonlinear relationships between dependent and explanatory variables including interactions between explanatory variables. The remainder of the paper is organized as follows: Section 2 describes the BART model. In Section 3 the Markov Chain Monte Carlo (MCMC) algorithm used to search the model space is outlined. Section 4 illustrates the use of BART in modeling trip-duration data. Section 5 concludes. 2. The BART model The model consists of two parts: a sum-of-trees model, called BART (Bayesian Additive Regression Trees), and a regularization prior A sum-of-trees model The central element of our model is a regression tree, a predictive model that seeks to accomplish the same task as linear regression: predict a response y given the values of a vector of independent variables x =(x 1,...,x p ). What distinguishes the regression tree from a linear regression model is how the regression tree generates the prediction. An illustration is given in Fig. 1, with x =(x 1,x 2 ). The tree consists of a root node containing a question about one of the independent variables, here whether x 2 <1. Depending on the answer to this question, we would follow the left (x 2 < 1) or right (x 2 P 1) branch of the tree, arriving at a child node. To generate a prediction, we continue branching based on our value of x until a terminal node is reached, and an output l is returned. The output parameter l b in terminal node b plays the role of a response in regression. The tree model partitions the x space into rectangular regions, and associates a single predicted value for response y within each region. In Fig. 1, the tree partitions the (x 1,x 2 ) space into three rectangular regions, and produces outputs of 0.1, 0.8 or 0.3, depending on which region a value x falls. This particular tree represents an interaction between x 1 and x 2, since the relationship between y and x 1 is constant (the value 0.3) if x 2 P 1, but changes as a function of x 1 (the values 0.1 and 0.8) if x 2 <1. A tree model must be estimated from data, similar to the coefficients of a linear regression. We must estimate the tree structure itself (e.g. splitting rules like x 2 < 1 associated with interior nodes) and terminal parameters associated with the tree (the l s). Estimation will be discussed in Section 3. To develop a sum-of-trees model, we establish notation that represents the informal description of a single tree model. Let T denote a binary tree consisting of a set of interior node decision rules (questions) and a set of terminal nodes, and let M ={l 1,l 2,...,l B } denote a set of parameter values associated with each of the B terminal nodes of T. Prediction for a particular value of input vector x is accomplished as follows: If x is associated with terminal node b of T by the sequence of decision rules from top to bottom, it is then assigned the value l b. We use g(x;t,m) to denote the function corresponding to (T, M) which assigns a l b 2 M to x. a x 2 < 1 x 2 1 μ 3 = 0.3 x 1 < 1.5 x μ 1 = 0.1 μ 2 = 0.8 b x μ 3 = 0.3 μ 1 = 0.1 μ 2 = x 1 Fig. 1. An illustration of a tree model (a) and its predictions (b).

3 688 H. Chipman et al. / Transportation Research Part B 44 (2010) Using this notation our sum-of-trees model can more explicitly be expressed as Y ¼ gðx; T 1 ; M 1 Þþgðx; T 2 ; M 2 Þþþgðx; T m ; M m Þþ; Nð0; r 2 Þ: ð1þ Thus, our model is of the form Y = f(x)+, where the function f is flexibly represented as f = g(x;t 1, M 1 )+g(x; T 2, M 2 )++ g(x; T m, M m ). Inasingletreemodel,theconditionalmeanofY given x is composed of a single l valueassociatedwithoneterminalnode(i.e. theoutputofoneg). Unlike the single tree model, the sum-of-trees model (1) uses m different l values to compose the conditional mean of Y given x. Such terminal node parameters will represent interaction effects when their assignment depends on more than one component of x (i.e., more than one independent variable). Because (1) may be based on trees of varying sizes, the sum-oftrees model can incorporate both direct effects and interaction effects of varying orders. In the special case where every terminal node assignment depends on just a single component of x, the sum-of-trees model reduces to a simple additive function. In the machine learning literature the term ensemble is used to describe a collection of model pieces that add up to a bigger model. Thus, in the BART model the ensemble is the collection fgð; T i ; M i Þg m i¼1. The overall intuition is that a good way to find the fit is by adding little bits at a time. There are different ensemble methods in the literature with boosting (Freund and Schapire, 1997) being the lead example. BART is related to and partially motivated by boosting but has fundamental differences (see Section 3 below). With a large number of trees, a sum-of-trees model gains increased representation flexibility, which, when coupled with our regularization prior, gives excellent out-of-sample predictive performance. The default value for m, used in the application in this paper, is 200. Note that with m large there are hundreds of parameters of which only r is identified. For example, swapping (T 1,M 1 ) for (T 2,M 2 )in(1) gives a different parameterization but the same predictive model. This is not a problem for our Bayesian analysis as long as we use a proper prior. It just means that inferential statements cannot be made about individual l and T parameters. Instead, we shall draw inferences on the predictions that is, on the function f and r. This formulation can be considered as a mechanism for placing a prior distribution on functions, even though individual parameters are not identified. Indeed, this lack of identification is the reason our MCMC mixes well. Even when m is much larger than needed to capture f (effectively, we have an over-complete basis ), the procedure still works well. One of the key reasons the procedure works well with so many parameters is an effective specification of prior distributions for these parameters. We explore this in the next section A regularization prior In many Bayesian analyzes, one seeks relatively uninformative prior distributions for unknown parameters in order to let the data speak for itself. In our model, however, there are so many free parameters that uninformative priors would give the data too much of a voice: a prediction for observation i would be the observed response y i, interpolating the training data perfectly. The enormous capacity of the model to represent the data must be reined in by prior distributions that control the model s adaptability. In Machine Learning, this process of constraining parameters is called regularization, hence our term regularization priors. In this section we outline how to place a prior on each tree T and its terminal node parameters M. The complexity of the prior specification is vastly simplified by letting the T i be a priori independent and identically distributed (i.i.d), the l i,b (node b of tree i) be i.i.d given all T s, and r be independent of all T and l. Given these independence assumptions we need only choose priors for a single tree T, a single l, and r. Motivated by our desire to make each g(x;t i, M i ) a small contribution to the overall fit, we put prior weight on small trees and small l i,b. In the Machine Learning literature, the individual g(x; T i, M i ) are often called weak learners. They are learners in that each g(x; T i, M i ) fits or learns something about the relationship between y and x. They are weak in that each g(x; T i, M i ) makes a small contribution to the overall fit. For the tree prior, we use the same specification as in Chipman et al. (1998). In this prior, the probability that a node is nonterminal is a(1 + d) b, where d is the depth of the node. In all examples we use the same prior corresponding to the choice a =.95 and b = 2. With this choice, a root node (d = 0) has probability a =.95 of having children and a node at depth 1 has probability of having children. A corresponding probability distribution on tree size (number of terminal nodes) gives probability of 0.05, 0.55, 0.28, 0.09, and 0.03, for trees of size 1, 2, 3, 4, and P5. Note that even with this prior, trees with many terminal nodes can be grown if the data demands it. At any non-terminal node, the prior on the associated decision rule puts equal probability on each available independent variable and then equal probability on each available rule given the variable. Thus for the tree in Fig. 1, assuming only two predictors X 1 and X 2, each taking possible values 0,0.1,0.2,...,2.0, we have prior probability 0:95 ðroot node is nonterminalþ 0:5 ðsplit is on X 2 ; one of two variablesþ 0:05 ðsplit is on 1 of 20 possible locationsþ 0:2375 ðleft child is nonterminalþ 0:5 ðleft child splits on X 1 ; one of two variablesþ 0:05 ðsplit is on 1 of 20 possible locationsþ ð1 0:1056Þð1 0:1056Þðtwo children are terminalþ ð1 :2375Þðright child of root node is terminalþ ¼8: ð2þ

4 H. Chipman et al. / Transportation Research Part B 44 (2010) Forthe prior on a l, we first shift and rescale Y so there is high prior probability that E(Yjx) 2 ( 0.5,0.5). We let l N 0; r 2 l, where l is the output of any one terminal node of any one tree. Given the T i and an x, E(Yjx) is the sum of pffiffiffiffi m independent l s, (recall Eq. (1)). The standard deviation of the sum is m rl. We must choose r l so that the standard pffiffiffiffi deviation of the sum, m rl, ensures high probability that Y is in ( 0.5,0.5). We choose r l so that 0.5 is within k standard p deviations of zero: k ffiffiffiffi m rl ¼ 0:5. For example if k = 2 there is a 95% (conditional) prior probability that the mean of Y is in ( 0.5,0.5). k = 2 is our default choice and in practice we typically rescale the response y so that its observed values range from 0.5 to 0.5. Note that this prior increases the shrinkage of l i,b (toward zero) as m increases. As more trees are used in the ensemble, each one is permitted to contribute a smaller amount to the overall prediction. For the prior on r we start from the usual inverted-chi-squared prior: r 2 mk=v 2 m : To choose the hyper-parameters m and k, we begin by obtaining a rough overestimate ^r of r. We then pick a degrees of freedom value m between 3 and 10. Finally, we pick a value of q such as 0.75, 0.90 or 0.99, and set k so that the qth quantile of the prior on r is located at ^r, that is Pðr < ^rþ ¼q. Fig. 2 illustrates priors corresponding to three (m,q) settings when the rough overestimate is ^r ¼ 2. We refer to these three settings, (m, q) = (10, 0.75), (3, 0.90), (3, 0.99), as conservative, default and aggressive, respectively. For automatic use, we recommend the default setting (m, q) = (3, 0.90) which tends to avoid extremes. Simple data-driven choices of ^r that we have used in practice are the estimate from a linear regression or the sample standard deviation of Y. Note that this prior choice can be influential. Strong prior beliefs that r is very small could lead to over-fitting. 3. A back-fitting MCMC algorithm Given the observed data y, our Bayesian setup induces a posterior distribution pððt 1 ; M 1 Þ;...; ðt m ; M m Þ; rjyþ on all the unknowns that determine a sum-of-trees model. Although the sheer size of this parameter space precludes exhaustive calculation, a back-fitting MCMC algorithm (Hastie and Tibshirani, 2000) can be used to sample from this posterior. At a general level, our algorithm is a Gibbs sampler. For notational convenience, let T (i) be the set of all trees in the sum except T i, and similarly define M (i). The Gibbs sampler here entails m successive draws of (T i,m i ) conditionally on (T (i),m (i),r): ðt 1 ; M 1 ÞjT ð1þ ; M ð1þ ; r; y ðt 2 ; M 2 ÞjT ð2þ ; M ð2þ ; r; y.. ðt m ; M m ÞjT ðmþ ; M ðmþ ; r; y; followed by a draw of r from the full conditional: ð3þ conservative: df=10, quantile=.75 default: df=3, quantile=.9 aggressive: df=3, quantile= sigma Fig. 2. Three priors on r when ^r ¼ 2.

5 690 H. Chipman et al. / Transportation Research Part B 44 (2010) rjt 1 ;...T m ; M 1 ;...; M m ; y: The back-fitting MCMC algorithm repeatedly re-samples the parameters of each tree in the ensemble, conditional on the current parameter values of the other m 1 trees. This approach has some similarities and differences with the boosting algorithm of Freund and Schapire (1997). Boosting also produces an ensemble of trees model (1). The boosting algorithm also updates one tree conditional on all others, but it does so only once, rather than repeatedly resampling as in MCMC. This yields a single estimated model, rather than a posterior distribution on the model. Evaluation of the full conditionals required for Gibbs sampling is simplified by rearranging (1). For example to sample (T 1,M 1 )jt (1),M (1),r,y, we can write Y gðx; T 2 ; M 2 Þ gðx; T m ; M m Þ¼gðx; T 1 ; M 1 Þþ Given (T (1),M (1) ) and r we may subtract the fit from (T (1), M (1) ) from both sides of (1) leaving us with a single tree model with known error variance, g(x; T 1, M 1 ). This draw may be made following the approach of Chipman et al. (1998). These methods draw (T i, M i )jt (i), M (i),r,y as T i jt (i), M (i),r,y followed by M i jt i,t (i), M (i),r,y. The idea is that we can draw a (T, M) by drawing from the marginal of T after integrating out M and then from the conditional of M given the draw of T. The structure of the BART model and prior are carefully chosen to make this possible. The first draw is done by the Metropolis-Hastings algorithm after integrating out M i and the second is a set of normal draws. The draw of r is easily accomplished by subtracting all the fit from both sides of (1) so the are considered to be observed. Given all the (T i,m i ) we know f so we can compute i = y i f(x i ). The draw is then a standard inverted-chi-squared since our prior is conditionally conjugate. Subtracting off fits and fitting the resides is often called backfitting. Our Gibbs sampler iteratively and stochastically backfits (Hastie and Tibshirani, 2000). The Metropolis-Hastings draw of T i jt (i),m (i),r,y is complex and lies at the heart of our method. The algorithm of Chipman et al. (1998) proposes a new tree based on the current tree using one of four moves. The moves and their associated proposal probabilities are: growing a terminal node (0.25), pruning a pair of terminal nodes (0.25), changing a non-terminal rule (0.40), and swapping a rule between parent and child (0.10). Note that the grow and prune moves change the implicit dimensionality of the proposed tree in terms of the number of terminal nodes. Some readers may be more familiar with Simulated Annealing, a stochastic search algorithm that can be obtained by manipulation of acceptance probabilities of the Metropolis- Hastings algorithm as it runs. Both algorithms have in common the proposal of a new state as a random perturbation of the current state, followed by a randomized accept/reject step. We initialize the chain with m single node trees, and then iterations are repeated until satisfactory convergence is obtained. We illustrate convergence assessment by the monitoring of r draws in Section 4. At each iteration, each tree may increase or decrease the number of terminal nodes by one, or change one or two decision rules. Each l will change (or cease to exist or be born), and r will change. It is not uncommon for a tree to grow large and then subsequently collapse back down to a single node as the algorithm iterates. The sum-of-trees model, with its abundance of unidentified parameters, allows for fit to be freely reallocated from one tree to another. Because each move makes only small incremental changes to the fit, we can imagine the algorithm as analogous to sculpting a complex figure by adding and subtracting small dabs of clay. Compared to the single tree model MCMC approach of Chipman et al. (1998), the back-fitting MCMC algorithm mixes dramatically better. When only single tree models are considered, the MCMC algorithm tends to quickly gravitate toward a single large tree and then gets stuck in a local neighborhood of that tree. In sharp contrast, we have found that restarts of the back-fitting MCMC algorithm give remarkably similar results even in difficult problems. Consequently, we run one long chain rather than multiple starts. In some ways back-fitting MCMC is a stochastic alternative to boosting algorithms for fitting linear combinations of trees. It is distinguished by the ability to sample from a posterior distribution. At each iteration, we get a new draw f ¼ gðx; T 1 ; M 1 Þþgðx; T 2 ; M 2 Þþþgðx; T m ; M m Þ corresponding to the draw of {T j } and {M j }. These draws are a (dependent) sample from the posterior distribution on the true f. Rather than pick the best f * from these draws, the set of multiple draws can be used to further enhance inference. In contrast, boosting generates a single estimate of the model, rather than a sample of possible values. We estimate f(x) by the posterior mean of f(x) which is approximated by averaging the f * (x) over the draws. Further, we can gauge our uncertainty about the actual underlying f by the variation across the draws. For example, we can use the 5% and 95% quantiles of f * (x) to obtain 90% posterior intervals for f(x). ð4þ ð5þ 4. Fitting trip duration data with BART 4.1. The trip duration data The goal of the study is to investigate how the reported time to take a trip in an automobile depends on characteristics of the trip and the people making the trip. Each observation in our data set corresponds to a trip, made by car, in the Austin area. Each trip is made by a person identified as the trip maker from a household. Several variables are measured for each trip. We have variables describing the household, the trip-maker, and the trip itself. Variables describing the household are:

6 H. Chipman et al. / Transportation Research Part B 44 (2010) number of people in the household, income, number of children under five, number of children between 5 and 15 and the number of children of ages 16 or 17. Variables describing the trip-maker are: age, primary occupation and student status. Variables describing the trip are: month, day of the week, type of trip (e.g. home based work trip ), departure time, number of household in the departure zone, number of household in the destination zone, retail employment in the departure zone, retail employment in the destination zone, free-flow distance, free-flow duration and trip duration. The free flow variables are meant to capture the distance and time taken for such a strip under free flow conditions. Our dependent variable y is the log of the ratio of trip duration over free flow trip duration. Trip duration is simply the reported time taken to complete the trip. Free flow trip duration is an attempt to measure the time it would take to complete the trip if there were no traffic related inhibitions. Our trip durations are really approximations of trip durations and suffer from rounding error. Those trips with short distances are most prone to suffer from high relative error, since reported durations are often rounded to 5, 10, or even 15 min. A large y means high traffic congestion or travel delay, but not necessarily long durations, since y = duration/free-flow duration. Since the actual ratios are highly right-skewed, we take the log. This still gives us an interpretable quantity as it is approximately the difference between the two durations in percentage terms. The histogram of our dependent variable is given in Fig. 3. Note that while we have transformed our dependent variable, there is no need to consider transformations of the explanatory variables when using BART. Our explanatory variables consist of the remaining 17 listed above. Fig. 4 displays the marginal distributions of three important independent variables. Many of the trips are made by the same trip-maker so that a difficult issue arises with regard to independence assumptions. Even though BART allows for great flexibility in the form of f, the current formulation does make the basic assumption of i.i.d. normal errors. It may be the errors from trip made by the same person are not exchangeable with those made by others. There is no obvious way to resolve this question without considerably more modeling. We randomly choose one trip from those made by a trip-maker. This gives us 3244 observations BART results, all variables In this section we report the results obtained by running BART with all 17 explanatory variables. Here we show how BART, relatively automatically, fits the patterns in the data. In the next section, we focus on a small number of variables and interpret the BART fit. We emphasize that all BART results are obtained simply by calling the function bart in the R-package BayesTree. No decisions need be made about how to manipulate the information in the data. The default prior was used. A few fairly obvious decisions are made about how to run the Markov Chain as noted below. Fig. 5 shows the time series plot of the draws of r from each iteration of the Markov Chain. The initial part of the plot where the draws are declining is the burn-in period of the Markov Chain. The algorithm stochastically searches the high-dimensional space representing the unknown f to find functions that fit the data well with our prior stopping us from gravitating towards functions which overfit. After the r draws level off, we are exploring the posterior. As the chain iterates, the variation in the current draw of f (as represented by the current (T i, M i )) explores the set of f which could have plausible generated the observed data. We see that the chain burns in very quickly so that after a few hundred draws we are estimating the posterior.

7 692 H. Chipman et al. / Transportation Research Part B 44 (2010) y=log(dur/ffd) Fig. 3. The histogram of the log ratio of trip duration over free flow duration. To run the entire set of 2300 iterations took 226 s or about 4 min (2.93 GHz, Core 2 Duo processor). Note that if we just needed a point estimate we could use a much shorter run. After discarding the burn-in draws, we estimate f(x) by simply averaging f i (x), where i denotes the MCMC draws. If we do this for the x in our sample we obtain the BART fitted values. The squared correlation between y and the BART fits (or R 2 )is 48%. Similarly, we can estimate r by the average of the post burn-in draws. This we obtain ^r ¼ 0:5. In Fig. 5 this can be seen as the value at which the draws level off. The marginal posterior distribution of r would be given by the histogram of post burn-in r draws. By comparison, a linear regression fit (in which all categorical variables are dummied up in the usual way) gives an estimate of r of 0.58 and R 2 of 28%. The horizontal line drawn in Fig. 5 is at this r estimate. Clearly, no-one would simply run a linear regression with this data. As noted, the departure time variable would require some kind of transformation at a minimum. Our point in making this comparison is that BART automatically seeks out reasonable functions without user input. With 17 independent variables, choosing which transformations to try and which interactions to include becomes a daunting task in model selection. In Section 4.4 we try a transformation strategy and compare the results to the BART fit. Of course the reader may wonder if BART has over-fit the data. In Chipman et al. (2010) evidence is given in 42 real data sets that the out of sample predictive performance of BART using the default prior is as good or better than leading data-mining techniques tuned using cross-validation. The default BART prior regularizes the fit so that we do not build too complex a model given our data Interpreting BART results If all we want to do is use BART as a predictive device, there is no problem. The prediction of y given x = x *, may be given by the average of the f i (x * ) values, where i indexes post burn-in draws of f. However, we often want to learn about the nature of f. How is y related to x? In this case the lack of interpretability of the sum of trees representation of f becomes an issue. All flexible approaches (e.g., neural nets) have this problem. Note also, that those who think they can interpret the results of a linear regression with a large number of transformations, interactions, and blizzard of associated t-values are usually fooling themselves. Methods for extracting interpretable information about f are the subject of ongoing research, but we illustrate a few possible approaches. A simple approach to variable selection is to see how often a variable is used in the sum of trees representation (see Chipman et al. (2010)). For each draw, we compute the fraction of tree decision rules that use a given variable and then average the fraction over the MCMC draws. Using this criteria, the variable which stands out the most is free flow distance. This variable is used, on average, in 19% of the tree decision rules. The next two variables are trip type (4%) and departure time (3.4%). There are some subtle issues involving the choice of prior when using BART for variable selection. Chipman et al. (2010) recommend using fewer trees in the sum when doing variable selection than when using BART for prediction. The percentages above were calculated from a BART run using 20 trees in the sum, while the r draws displayed in Fig. 5 were obtained from a run using 200 trees in the sum (200 is the default choice). In order to simplify the analysis we reran BART using just the independent variables free flow distance, triptype, and departure time as suggested by our variable selection analysis. Note that our variable selection was done without making strong assumptions about the nature of f. Even with just these three variables, consider the difficulties involved in using a multiple regression based approach. What transformations should be applied to the two numeric variables free flow distance and departure time? The categorical

8 H. Chipman et al. / Transportation Research Part B 44 (2010) Frequency ff_dist hw hs hr hm hsoc ho nhw nhs nhr nho triptype Frequency deptime Fig. 4. Marginal distributions of free flow distance (ff_dist), trip type (triptype), and departure time (deptime). Distance is measured in miles, departure time with a 24 h daily clock, and trip type is categorical with ten different levels. For example, the first level hw denotes a home based work trip. variable trip type has 10 different levels. What possible interaction terms should be considered for inclusion in the model? If a large number of transformations and interactions are considered, how will the model selection be done? Given a selected model, how do we express our uncertainty? How relevant are the usual t-values given the model search and possible inclusion of interaction terms? An illustration of a transformation based approach is given in Section 4.4. Sophisticated users of multiple regression know that the dependence between independent variables affects the inference in important ways. Usually, we think in terms of collinearity, or linear dependence between x variables. Fig. 6 explores the relationship between two of our x variables, departure time and trip type. The left histogram displays departure times for work trips and the right histogram displays departure times for retail trips. Thus, the figure shows some of the dependence between trip type and departure time. Clearly, the dependence is strong and of an unusual type (i.e., with a varying number of modes) which could not possibly be captured by linear thinking. The BayesTree package includes functions for partial dependence plots. These plots can aid in interpreting the effect of individual x variables on y. However, when there is complex dependence between x variables and f is not additive they can be misleading. Thus, they are not a good choice for our trip duration analysis. Fig. 7 compares x values with have a large BART estimate of f with those that have a small estimate. The three rows in the figure correspond to our three variables free flow distance trip type, and departure time going from top to bottom. The left hand plots use only those observations such that ^f ðxþ is in the bottom 10% and the right hand plots use only those observations where the fit is in the top 10%. The left hand plots tell us what kinds of x give us a small y and the right hand plot tell us what kinds of x give us a big y. Such a plot could have been constructed directly from the y values but

9 694 H. Chipman et al. / Transportation Research Part B 44 (2010) Fig. 5. Draws of r from the BART Markov Chain Monte Carlo. The draws initially decrease quickly as the fit improves and the chain burns in. Subsequent variation reflects posterior uncertainty about the value of r. The chain was iterated 2300 times with the first 300 discarded as burn-in and the last 2000 used for posterior inference. All 2300 draws are shown in the plot departure time, work trips departure time, retail trips Fig. 6. Distribution of departure time for home based work trips (on the left) and home based retail trips (on the right). by plotting the fits we have hope to have a sharper picture, less influenced by noise. We clearly see that longer trips are associated with smaller free flow distance, trip types hw = home base work trips, hs = home based school trips, and ho = home based other trips ( other is a catch-all category), and departure time close to the two rush hours. Note that the afternoon rush hour looks like it is different from the morning one. The morning rush hour is more focused. Intuitively, this rush hour is driven more by work related trips while in the afternoon a greater variety of activities generate trips. Perhaps the most logically compelling way to learn about the nature of a function f is to compare its values at carefully chosen x. For a very nice example see Abrevaya and McCulloch (2010). In this case we chose the following nine x configurations: ff_dist triptype deptime 1 hw 8 1 hw 17 1 hw 20 1 hr 17 1 hr 20 1 hsoc 17 1 hsoc 20 1 ho 17 1 ho 20

10 H. Chipman et al. / Transportation Research Part B 44 (2010) Frequency Frequency ff_dist ff_dist hw hs hr hm hsoc ho nhw nhs nhr nho triptype hw hs hr hm hsoc ho nhw nhs nhr nho triptype Frequency Frequency deptime deptime Fig. 7. Distributions of explanatory variables for small and large fitted values from BART. The three rows depict the marginal distribution of free flow distance, trip type, and departure time. On the left we use the subset of observations corresponding to the bottom 10% of fitted values from BART. On the right we use the top 10% fitted values. Fig. 8. Posterior distributions of f(x) for nine selected x. The first three boxplots (from the left) depict draws from the posterior of f(x) for x which correspond to home based work trips with departure times 7:30 am, 5 pm, and 8 pm. The next two are for home based retail trips at 5 pm and 8 pm. The next two are for home based social trips at 5 pm and 8 pm. The last two are for other trips at 5 pm at 8 pm. Each row above specifies a choice of x. We have fixed free flow distance at 1. The first three x s examine home based work trips at 8 am, 5 pm, and 8 pm. The remaining six x s examine home based retail trips, home based social trips and home based other trips, each at 5 pm and 8 pm. Fig. 8 displays the results. Each boxplot depicts the draws f i (x) for post burn-in MCMC draws i of f. The nine boxplots correspond to the nine different x configurations given above. The symbols w, r, s, and o denote the four different trip types (work, retail, social, and other)

11 696 H. Chipman et al. / Transportation Research Part B 44 (2010) and the time is given with the 24 h clock. The boxplot labelled r17 displays the posterior distribution for f(x), where x means a home based retail trip at 5 pm (the fourth row in the table above). We can easily see things like work trips (boxplots 1 3) are longer than social trips (boxplots 6 and 7). For each kind of trip, we expect longer durations at 5 pm than at 8 pm. There is some evidence that the difference in trip duration between 5 pm and 8 pm is less for social trips than for the other three kinds of trips and the difference seems quite similar for work, retail, and other trips. Thus, there is some evidence of a particular kind of interaction, in that the effect for departure time does depends on the type of trip. This makes intuitive sense. We can also see that there is more uncertainty associated with our inference for the social trips than for the other three trip types. By carefully choosing x at which to infer f, we can learn much about f. Care is needed in choosing appropriate x. For example, asking about a work trip at 11 am is not a good idea given this data. While this takes effort, it is honest effort in that involves thinking hard about what questions we really want to ask, and what questions make sense. Figs. 7 and 8 given sensible results. Fig. 9. Comparison of fits from different modeling strategies. All pairwise plots between y, bart (fits from BART), naivereg (linear regression without transformations), logreg (linear regression with free flow distance logged), and trigreg (linear regression with log of free flow distance and 16 transformations of departure time).

12 H. Chipman et al. / Transportation Research Part B 44 (2010) Transforming independent variables In this section we try to improve the fit obtained using standard linear models technology by transforming some of the independent variables. This fit is compared with the BART fit. We consider transformations of the variables departure time and free flow distance. All of the variables are used as in Section 4.2. Similar results are obtained if we use just the three variables considered in Section 4.3. We do not consider interaction terms. Given the number of variables there are a great many possible interaction effects that could be considered. Logging free flow distance improves the linear model fit considerably. Without the log the R 2 was 28%. Replacing free flow distance with its log increases R 2 to 40%. The estimate of r decreases from 0.58 to Transforming departure time is more complex. Following Popuri et al. (2008) we let g 1 ðtþ ¼exp sin 2pT 2pT ; g 24 2 ðtþ ¼exp cos 24 and g 3 ðtþ ¼exp sin 4pT 4pT ; g 24 4 ðtþ ¼exp cos 24 where T is departure time. Each of the four g i is raised to powers 1, 2, 3, and 4. Thus, a total of 16 transformations of the variable T = departure time are included in the multiple regression: g i (T) j,i =1,2,3,4,andj =1,2,3,4. Note that in the Popuri et al. (2008) analysis, an additional variable called delay is used and all of the g i (T) j are multiplied by delay so that their analysis is focused on the interaction between departure time and delay. Our analysis does not use the variable delay so our transformation strategy is not identical to theirs. Nevertheless, by using the same functional form gðtþ ¼ P i;j b i;j g i ðtþ j, we hope to build upon their work in a reasonable way. Note that Popuri et al. (2008) are able to focus in on a specific kind of interaction based on subject matter insight. They obtain quite reasonable and interpretable representations of the relationship as exhibited in their Fig. 2. Adding the 16 transformations of departure time to the multiple regression with free flow distance logged increases R 2 to just 41% (from 40%). The estimate of r is 0.52 which is virtually the same as obtained with just the log of free flow distance. Recall that the BART R 2 is 48% and estimate of r is.50. Fig. 9 compares the fits from the different modeling strategies. We see that simply logging free flow distance goes a long way towards capturing f. There is very little difference between the fits with the log and departure time transformations. The BART fits are similar to those obtained with the transformations but there is a suggestion of some differences in the last plot in the second row. If we test the null hypothesis that all the coefficients for the departure time transformations are equal to 0, the p-value is Typically in application this is used to argue that including the transformations is useful in uncovering the relationship. Fig. 10 plots ^gðtþ ¼ P i;j^b i;j g i ðtþ j vs. T. It is hard to see how this makes sense intuitively. Again, Popuri et al. (2008) did obtain sensible results using the set of transformations we have employed here. In this application, the transformation give results which are statistically significant but practically insignificant and intuitively unappealing. Of course, some other transformation approach might give better results Fig. 10. The estimated additive component contributed by transformations of departure time.

13 698 H. Chipman et al. / Transportation Research Part B 44 (2010) Conclusions This paper introduces and illustrates application of the BART model to drip duration modeling. BART quickly and easily obtains the dependent variable, y, as some generic function f of the explanatory variables. Note that this is, in a sense, a nonparametric approach, since the functional form of f is specified on such a flexible form. This is particularly useful when some explanatory variables are not expected to have linear relationships with y, such as the case of departure times relationship with trip durations. The application demonstrated here clearly shows the nonlinear relationship between these variables, with trip duration peaks occurring during the AM and PM peak travel periods as expected. Further, the nature of the two peaks seems to quite different with a more focused rush hour in the morning. We also find evidence of interaction between two of our key variables: the effect of the departure time may depend on the type of trip taken. Future research will focus on identifying other transportation research areas where BART could be employed, as well as extending the ideas in this paper to gain other insights into the actual relationship between departure time and trip duration. References Abou Zeid, M., Rossi, T.F., Gardner, B., Modeling time-of-day choice in context of tour-and activity-based models. Transportation Research Record: Journal of the Transportation Research Board 1981, Abrevaya, J., McCulloch, R., Reversal of Fortune: A Statistical Analysis of Penalty Calls in the National Hockey League. < Chipman, H., George, E., McCulloch, R., Bayesian CART model search. Journal of the American Statistical Association 93 (443), Chipman, H., George, E., McCulloch, R., Bayesian ensemble learning. In: Scholkopf, B., Platt, J., Hoffman, T. (Eds.), Advances in Neural Information Processing Systems, vol. 19. MIT Press, Cambridge, MA, pp Chipman, H., George, E., McCulloch, R., BART: Bayesian Additive Regression Trees. Annals of Applied Statistics. Freund, Y., Schapire, R.E., A decision theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences 55 (1), Hastie, T., Tibshirani, R., Bayesian backfitting. Statistical Science 15 (3), Popuri, Y., Ben-Akiva, M., Proussaloglou, K., Time-of-day modeling in a tour-based context: the Tel-Aviv experience. Transportation Research Record: Journal of the Transportation Research Board 2076,

Bayesian Ensemble Learning

Bayesian Ensemble Learning Bayesian Ensemble Learning Hugh A. Chipman Department of Mathematics and Statistics Acadia University Wolfville, NS, Canada Edward I. George Department of Statistics The Wharton School University of Pennsylvania

More information

BART: Bayesian additive regression trees

BART: Bayesian additive regression trees BART: Bayesian additive regression trees Hedibert F. Lopes & Paulo Marques Insper Institute of Education and Research São Paulo, Brazil Most of the notes were kindly provided by Rob McCulloch (Arizona

More information

Fully Nonparametric Bayesian Additive Regression Trees

Fully Nonparametric Bayesian Additive Regression Trees Fully Nonparametric Bayesian Additive Regression Trees Ed George, Prakash Laud, Brent Logan, Robert McCulloch, Rodney Sparapani Ed: Wharton, U Penn Prakash, Brent, Rodney: Medical College of Wisconsin

More information

Bayesian Classification and Regression Trees

Bayesian Classification and Regression Trees Bayesian Classification and Regression Trees James Cussens York Centre for Complex Systems Analysis & Dept of Computer Science University of York, UK 1 Outline Problems for Lessons from Bayesian phylogeny

More information

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling 10-708: Probabilistic Graphical Models 10-708, Spring 2014 27 : Distributed Monte Carlo Markov Chain Lecturer: Eric P. Xing Scribes: Pengtao Xie, Khoa Luu In this scribe, we are going to review the Parallel

More information

Bagging During Markov Chain Monte Carlo for Smoother Predictions

Bagging During Markov Chain Monte Carlo for Smoother Predictions Bagging During Markov Chain Monte Carlo for Smoother Predictions Herbert K. H. Lee University of California, Santa Cruz Abstract: Making good predictions from noisy data is a challenging problem. Methods

More information

STA 4273H: Sta-s-cal Machine Learning

STA 4273H: Sta-s-cal Machine Learning STA 4273H: Sta-s-cal Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 2 In our

More information

CSC 2541: Bayesian Methods for Machine Learning

CSC 2541: Bayesian Methods for Machine Learning CSC 2541: Bayesian Methods for Machine Learning Radford M. Neal, University of Toronto, 2011 Lecture 3 More Markov Chain Monte Carlo Methods The Metropolis algorithm isn t the only way to do MCMC. We ll

More information

Part 6: Multivariate Normal and Linear Models

Part 6: Multivariate Normal and Linear Models Part 6: Multivariate Normal and Linear Models 1 Multiple measurements Up until now all of our statistical models have been univariate models models for a single measurement on each member of a sample of

More information

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18 CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$

More information

Lecture 2: Linear regression

Lecture 2: Linear regression Lecture 2: Linear regression Roger Grosse 1 Introduction Let s ump right in and look at our first machine learning algorithm, linear regression. In regression, we are interested in predicting a scalar-valued

More information

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo Group Prof. Daniel Cremers 10a. Markov Chain Monte Carlo Markov Chain Monte Carlo In high-dimensional spaces, rejection sampling and importance sampling are very inefficient An alternative is Markov Chain

More information

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13 Boosting Ryan Tibshirani Data Mining: 36-462/36-662 April 25 2013 Optional reading: ISL 8.2, ESL 10.1 10.4, 10.7, 10.13 1 Reminder: classification trees Suppose that we are given training data (x i, y

More information

CS 484 Data Mining. Classification 7. Some slides are from Professor Padhraic Smyth at UC Irvine

CS 484 Data Mining. Classification 7. Some slides are from Professor Padhraic Smyth at UC Irvine CS 484 Data Mining Classification 7 Some slides are from Professor Padhraic Smyth at UC Irvine Bayesian Belief networks Conditional independence assumption of Naïve Bayes classifier is too strong. Allows

More information

Principles of Bayesian Inference

Principles of Bayesian Inference Principles of Bayesian Inference Sudipto Banerjee University of Minnesota July 20th, 2008 1 Bayesian Principles Classical statistics: model parameters are fixed and unknown. A Bayesian thinks of parameters

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning MCMC and Non-Parametric Bayes Mark Schmidt University of British Columbia Winter 2016 Admin I went through project proposals: Some of you got a message on Piazza. No news is

More information

Statistical Machine Learning from Data

Statistical Machine Learning from Data Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Ensembles Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique Fédérale de Lausanne

More information

Lecture : Probabilistic Machine Learning

Lecture : Probabilistic Machine Learning Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning

More information

Part 8: GLMs and Hierarchical LMs and GLMs

Part 8: GLMs and Hierarchical LMs and GLMs Part 8: GLMs and Hierarchical LMs and GLMs 1 Example: Song sparrow reproductive success Arcese et al., (1992) provide data on a sample from a population of 52 female song sparrows studied over the course

More information

Introduction to Bayesian Statistics and Markov Chain Monte Carlo Estimation. EPSY 905: Multivariate Analysis Spring 2016 Lecture #10: April 6, 2016

Introduction to Bayesian Statistics and Markov Chain Monte Carlo Estimation. EPSY 905: Multivariate Analysis Spring 2016 Lecture #10: April 6, 2016 Introduction to Bayesian Statistics and Markov Chain Monte Carlo Estimation EPSY 905: Multivariate Analysis Spring 2016 Lecture #10: April 6, 2016 EPSY 905: Intro to Bayesian and MCMC Today s Class An

More information

ECE 5424: Introduction to Machine Learning

ECE 5424: Introduction to Machine Learning ECE 5424: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting PAC Learning Readings: Murphy 16.4;; Hastie 16 Stefan Lee Virginia Tech Fighting the bias-variance tradeoff Simple

More information

1 Probabilities. 1.1 Basics 1 PROBABILITIES

1 Probabilities. 1.1 Basics 1 PROBABILITIES 1 PROBABILITIES 1 Probabilities Probability is a tricky word usually meaning the likelyhood of something occuring or how frequent something is. Obviously, if something happens frequently, then its probability

More information

Computational statistics

Computational statistics Computational statistics Markov Chain Monte Carlo methods Thierry Denœux March 2017 Thierry Denœux Computational statistics March 2017 1 / 71 Contents of this chapter When a target density f can be evaluated

More information

Unit 22: Sampling Distributions

Unit 22: Sampling Distributions Unit 22: Sampling Distributions Summary of Video If we know an entire population, then we can compute population parameters such as the population mean or standard deviation. However, we generally don

More information

Bayesian Inference in GLMs. Frequentists typically base inferences on MLEs, asymptotic confidence

Bayesian Inference in GLMs. Frequentists typically base inferences on MLEs, asymptotic confidence Bayesian Inference in GLMs Frequentists typically base inferences on MLEs, asymptotic confidence limits, and log-likelihood ratio tests Bayesians base inferences on the posterior distribution of the unknowns

More information

Conditional probabilities and graphical models

Conditional probabilities and graphical models Conditional probabilities and graphical models Thomas Mailund Bioinformatics Research Centre (BiRC), Aarhus University Probability theory allows us to describe uncertainty in the processes we model within

More information

Markov Chain Monte Carlo methods

Markov Chain Monte Carlo methods Markov Chain Monte Carlo methods By Oleg Makhnin 1 Introduction a b c M = d e f g h i 0 f(x)dx 1.1 Motivation 1.1.1 Just here Supresses numbering 1.1.2 After this 1.2 Literature 2 Method 2.1 New math As

More information

Univariate Normal Distribution; GLM with the Univariate Normal; Least Squares Estimation

Univariate Normal Distribution; GLM with the Univariate Normal; Least Squares Estimation Univariate Normal Distribution; GLM with the Univariate Normal; Least Squares Estimation PRE 905: Multivariate Analysis Spring 2014 Lecture 4 Today s Class The building blocks: The basics of mathematical

More information

Chris Fraley and Daniel Percival. August 22, 2008, revised May 14, 2010

Chris Fraley and Daniel Percival. August 22, 2008, revised May 14, 2010 Model-Averaged l 1 Regularization using Markov Chain Monte Carlo Model Composition Technical Report No. 541 Department of Statistics, University of Washington Chris Fraley and Daniel Percival August 22,

More information

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring / Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring 2015 http://ce.sharif.edu/courses/93-94/2/ce717-1 / Agenda Combining Classifiers Empirical view Theoretical

More information

Data Mining und Maschinelles Lernen

Data Mining und Maschinelles Lernen Data Mining und Maschinelles Lernen Ensemble Methods Bias-Variance Trade-off Basic Idea of Ensembles Bagging Basic Algorithm Bagging with Costs Randomization Random Forests Boosting Stacking Error-Correcting

More information

A Bayesian Approach to Phylogenetics

A Bayesian Approach to Phylogenetics A Bayesian Approach to Phylogenetics Niklas Wahlberg Based largely on slides by Paul Lewis (www.eeb.uconn.edu) An Introduction to Bayesian Phylogenetics Bayesian inference in general Markov chain Monte

More information

Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c /9/9 page 331 le-tex

Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c /9/9 page 331 le-tex Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c15 2013/9/9 page 331 le-tex 331 15 Ensemble Learning The expression ensemble learning refers to a broad class

More information

B. Weaver (24-Mar-2005) Multiple Regression Chapter 5: Multiple Regression Y ) (5.1) Deviation score = (Y i

B. Weaver (24-Mar-2005) Multiple Regression Chapter 5: Multiple Regression Y ) (5.1) Deviation score = (Y i B. Weaver (24-Mar-2005) Multiple Regression... 1 Chapter 5: Multiple Regression 5.1 Partial and semi-partial correlation Before starting on multiple regression per se, we need to consider the concepts

More information

Integration Made Easy

Integration Made Easy Integration Made Easy Sean Carney Department of Mathematics University of Texas at Austin Sean Carney (University of Texas at Austin) Integration Made Easy October 25, 2015 1 / 47 Outline 1 - Length, Geometric

More information

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007 Bayesian inference Fredrik Ronquist and Peter Beerli October 3, 2007 1 Introduction The last few decades has seen a growing interest in Bayesian inference, an alternative approach to statistical inference.

More information

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Decision Trees. Tobias Scheffer

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Decision Trees. Tobias Scheffer Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Decision Trees Tobias Scheffer Decision Trees One of many applications: credit risk Employed longer than 3 months Positive credit

More information

Machine Learning, Fall 2009: Midterm

Machine Learning, Fall 2009: Midterm 10-601 Machine Learning, Fall 009: Midterm Monday, November nd hours 1. Personal info: Name: Andrew account: E-mail address:. You are permitted two pages of notes and a calculator. Please turn off all

More information

MODULE -4 BAYEIAN LEARNING

MODULE -4 BAYEIAN LEARNING MODULE -4 BAYEIAN LEARNING CONTENT Introduction Bayes theorem Bayes theorem and concept learning Maximum likelihood and Least Squared Error Hypothesis Maximum likelihood Hypotheses for predicting probabilities

More information

Variable Selection and Sensitivity Analysis via Dynamic Trees with an application to Computer Code Performance Tuning

Variable Selection and Sensitivity Analysis via Dynamic Trees with an application to Computer Code Performance Tuning Variable Selection and Sensitivity Analysis via Dynamic Trees with an application to Computer Code Performance Tuning Robert B. Gramacy University of Chicago Booth School of Business faculty.chicagobooth.edu/robert.gramacy

More information

VBM683 Machine Learning

VBM683 Machine Learning VBM683 Machine Learning Pinar Duygulu Slides are adapted from Dhruv Batra Bias is the algorithm's tendency to consistently learn the wrong thing by not taking into account all the information in the data

More information

Bayesian Methods for Machine Learning

Bayesian Methods for Machine Learning Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),

More information

INTRODUCTION TO PATTERN RECOGNITION

INTRODUCTION TO PATTERN RECOGNITION INTRODUCTION TO PATTERN RECOGNITION INSTRUCTOR: WEI DING 1 Pattern Recognition Automatic discovery of regularities in data through the use of computer algorithms With the use of these regularities to take

More information

Volume vs. Diameter. Teacher Lab Discussion. Overview. Picture, Data Table, and Graph

Volume vs. Diameter. Teacher Lab Discussion. Overview. Picture, Data Table, and Graph 5 6 7 Middle olume Length/olume vs. Diameter, Investigation page 1 of olume vs. Diameter Teacher Lab Discussion Overview Figure 1 In this experiment we investigate the relationship between the diameter

More information

Machine Learning 3. week

Machine Learning 3. week Machine Learning 3. week Entropy Decision Trees ID3 C4.5 Classification and Regression Trees (CART) 1 What is Decision Tree As a short description, decision tree is a data classification procedure which

More information

Bayesian Additive Regression Tree (BART) with application to controlled trail data analysis

Bayesian Additive Regression Tree (BART) with application to controlled trail data analysis Bayesian Additive Regression Tree (BART) with application to controlled trail data analysis Weilan Yang wyang@stat.wisc.edu May. 2015 1 / 20 Background CATE i = E(Y i (Z 1 ) Y i (Z 0 ) X i ) 2 / 20 Background

More information

BART: Bayesian Additive Regression Trees

BART: Bayesian Additive Regression Trees University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 2010 BART: Bayesian Additive Regression Trees Hugh A. Chipman Edward I. George University of Pennsylvania Robert E.

More information

CS168: The Modern Algorithmic Toolbox Lecture #6: Regularization

CS168: The Modern Algorithmic Toolbox Lecture #6: Regularization CS168: The Modern Algorithmic Toolbox Lecture #6: Regularization Tim Roughgarden & Gregory Valiant April 18, 2018 1 The Context and Intuition behind Regularization Given a dataset, and some class of models

More information

MCMC algorithms for fitting Bayesian models

MCMC algorithms for fitting Bayesian models MCMC algorithms for fitting Bayesian models p. 1/1 MCMC algorithms for fitting Bayesian models Sudipto Banerjee sudiptob@biostat.umn.edu University of Minnesota MCMC algorithms for fitting Bayesian models

More information

Probabilistic Graphical Models

Probabilistic Graphical Models 2016 Robert Nowak Probabilistic Graphical Models 1 Introduction We have focused mainly on linear models for signals, in particular the subspace model x = Uθ, where U is a n k matrix and θ R k is a vector

More information

Learning Tetris. 1 Tetris. February 3, 2009

Learning Tetris. 1 Tetris. February 3, 2009 Learning Tetris Matt Zucker Andrew Maas February 3, 2009 1 Tetris The Tetris game has been used as a benchmark for Machine Learning tasks because its large state space (over 2 200 cell configurations are

More information

where Female = 0 for males, = 1 for females Age is measured in years (22, 23, ) GPA is measured in units on a four-point scale (0, 1.22, 3.45, etc.

where Female = 0 for males, = 1 for females Age is measured in years (22, 23, ) GPA is measured in units on a four-point scale (0, 1.22, 3.45, etc. Notes on regression analysis 1. Basics in regression analysis key concepts (actual implementation is more complicated) A. Collect data B. Plot data on graph, draw a line through the middle of the scatter

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate

More information

Making rating curves - the Bayesian approach

Making rating curves - the Bayesian approach Making rating curves - the Bayesian approach Rating curves what is wanted? A best estimate of the relationship between stage and discharge at a given place in a river. The relationship should be on the

More information

Chapter 10. Optimization Simulated annealing

Chapter 10. Optimization Simulated annealing Chapter 10 Optimization In this chapter we consider a very different kind of problem. Until now our prototypical problem is to compute the expected value of some random variable. We now consider minimization

More information

ECE 5984: Introduction to Machine Learning

ECE 5984: Introduction to Machine Learning ECE 5984: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting Readings: Murphy 16.4; Hastie 16 Dhruv Batra Virginia Tech Administrativia HW3 Due: April 14, 11:55pm You will implement

More information

A Study into Mechanisms of Attitudinal Scale Conversion: A Randomized Stochastic Ordering Approach

A Study into Mechanisms of Attitudinal Scale Conversion: A Randomized Stochastic Ordering Approach A Study into Mechanisms of Attitudinal Scale Conversion: A Randomized Stochastic Ordering Approach Zvi Gilula (Hebrew University) Robert McCulloch (Arizona State) Ya acov Ritov (University of Michigan)

More information

Markov chain Monte Carlo

Markov chain Monte Carlo Markov chain Monte Carlo Feng Li feng.li@cufe.edu.cn School of Statistics and Mathematics Central University of Finance and Economics Revised on April 24, 2017 Today we are going to learn... 1 Markov Chains

More information

Ensemble Methods and Random Forests

Ensemble Methods and Random Forests Ensemble Methods and Random Forests Vaishnavi S May 2017 1 Introduction We have seen various analysis for classification and regression in the course. One of the common methods to reduce the generalization

More information

CIVL 7012/8012. Collection and Analysis of Information

CIVL 7012/8012. Collection and Analysis of Information CIVL 7012/8012 Collection and Analysis of Information Uncertainty in Engineering Statistics deals with the collection and analysis of data to solve real-world problems. Uncertainty is inherent in all real

More information

Machine Learning for OR & FE

Machine Learning for OR & FE Machine Learning for OR & FE Regression II: Regularization and Shrinkage Methods Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com

More information

Designing Information Devices and Systems I Spring 2018 Lecture Notes Note Introduction to Linear Algebra the EECS Way

Designing Information Devices and Systems I Spring 2018 Lecture Notes Note Introduction to Linear Algebra the EECS Way EECS 16A Designing Information Devices and Systems I Spring 018 Lecture Notes Note 1 1.1 Introduction to Linear Algebra the EECS Way In this note, we will teach the basics of linear algebra and relate

More information

Introduction to Machine Learning CMU-10701

Introduction to Machine Learning CMU-10701 Introduction to Machine Learning CMU-10701 Markov Chain Monte Carlo Methods Barnabás Póczos & Aarti Singh Contents Markov Chain Monte Carlo Methods Goal & Motivation Sampling Rejection Importance Markov

More information

Automatic Differentiation and Neural Networks

Automatic Differentiation and Neural Networks Statistical Machine Learning Notes 7 Automatic Differentiation and Neural Networks Instructor: Justin Domke 1 Introduction The name neural network is sometimes used to refer to many things (e.g. Hopfield

More information

Chapter 1 Statistical Inference

Chapter 1 Statistical Inference Chapter 1 Statistical Inference causal inference To infer causality, you need a randomized experiment (or a huge observational study and lots of outside information). inference to populations Generalizations

More information

Machine Learning Linear Regression. Prof. Matteo Matteucci

Machine Learning Linear Regression. Prof. Matteo Matteucci Machine Learning Linear Regression Prof. Matteo Matteucci Outline 2 o Simple Linear Regression Model Least Squares Fit Measures of Fit Inference in Regression o Multi Variate Regession Model Least Squares

More information

ST 740: Markov Chain Monte Carlo

ST 740: Markov Chain Monte Carlo ST 740: Markov Chain Monte Carlo Alyson Wilson Department of Statistics North Carolina State University October 14, 2012 A. Wilson (NCSU Stsatistics) MCMC October 14, 2012 1 / 20 Convergence Diagnostics:

More information

Classification and Regression Trees

Classification and Regression Trees Classification and Regression Trees Ryan P Adams So far, we have primarily examined linear classifiers and regressors, and considered several different ways to train them When we ve found the linearity

More information

1 Probabilities. 1.1 Basics 1 PROBABILITIES

1 Probabilities. 1.1 Basics 1 PROBABILITIES 1 PROBABILITIES 1 Probabilities Probability is a tricky word usually meaning the likelyhood of something occuring or how frequent something is. Obviously, if something happens frequently, then its probability

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Prediction of Data with help of the Gaussian Process Method

Prediction of Data with help of the Gaussian Process Method of Data with help of the Gaussian Process Method R. Preuss, U. von Toussaint Max-Planck-Institute for Plasma Physics EURATOM Association 878 Garching, Germany March, Abstract The simulation of plasma-wall

More information

Chris Bishop s PRML Ch. 8: Graphical Models

Chris Bishop s PRML Ch. 8: Graphical Models Chris Bishop s PRML Ch. 8: Graphical Models January 24, 2008 Introduction Visualize the structure of a probabilistic model Design and motivate new models Insights into the model s properties, in particular

More information

Chapter 10 Nonlinear Models

Chapter 10 Nonlinear Models Chapter 10 Nonlinear Models Nonlinear models can be classified into two categories. In the first category are models that are nonlinear in the variables, but still linear in terms of the unknown parameters.

More information

Section 3: Simple Linear Regression

Section 3: Simple Linear Regression Section 3: Simple Linear Regression Carlos M. Carvalho The University of Texas at Austin McCombs School of Business http://faculty.mccombs.utexas.edu/carlos.carvalho/teaching/ 1 Regression: General Introduction

More information

Linear Model Selection and Regularization

Linear Model Selection and Regularization Linear Model Selection and Regularization Recall the linear model Y = β 0 + β 1 X 1 + + β p X p + ɛ. In the lectures that follow, we consider some approaches for extending the linear model framework. In

More information

STA414/2104 Statistical Methods for Machine Learning II

STA414/2104 Statistical Methods for Machine Learning II STA414/2104 Statistical Methods for Machine Learning II Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences Lecture 3 Slide credits: Russ Salakhutdinov Announcements

More information

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

A graph contains a set of nodes (vertices) connected by links (edges or arcs) BOLTZMANN MACHINES Generative Models Graphical Models A graph contains a set of nodes (vertices) connected by links (edges or arcs) In a probabilistic graphical model, each node represents a random variable,

More information

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017 CPSC 340: Machine Learning and Data Mining More PCA Fall 2017 Admin Assignment 4: Due Friday of next week. No class Monday due to holiday. There will be tutorials next week on MAP/PCA (except Monday).

More information

Module 03 Lecture 14 Inferential Statistics ANOVA and TOI

Module 03 Lecture 14 Inferential Statistics ANOVA and TOI Introduction of Data Analytics Prof. Nandan Sudarsanam and Prof. B Ravindran Department of Management Studies and Department of Computer Science and Engineering Indian Institute of Technology, Madras Module

More information

Graphical Models and Kernel Methods

Graphical Models and Kernel Methods Graphical Models and Kernel Methods Jerry Zhu Department of Computer Sciences University of Wisconsin Madison, USA MLSS June 17, 2014 1 / 123 Outline Graphical Models Probabilistic Inference Directed vs.

More information

Urban Transportation Planning Prof. Dr.V.Thamizh Arasan Department of Civil Engineering Indian Institute of Technology Madras

Urban Transportation Planning Prof. Dr.V.Thamizh Arasan Department of Civil Engineering Indian Institute of Technology Madras Urban Transportation Planning Prof. Dr.V.Thamizh Arasan Department of Civil Engineering Indian Institute of Technology Madras Module #03 Lecture #12 Trip Generation Analysis Contd. This is lecture 12 on

More information

A Search and Jump Algorithm for Markov Chain Monte Carlo Sampling. Christopher Jennison. Adriana Ibrahim. Seminar at University of Kuwait

A Search and Jump Algorithm for Markov Chain Monte Carlo Sampling. Christopher Jennison. Adriana Ibrahim. Seminar at University of Kuwait A Search and Jump Algorithm for Markov Chain Monte Carlo Sampling Christopher Jennison Department of Mathematical Sciences, University of Bath, UK http://people.bath.ac.uk/mascj Adriana Ibrahim Institute

More information

Lecture 8. Instructor: Haipeng Luo

Lecture 8. Instructor: Haipeng Luo Lecture 8 Instructor: Haipeng Luo Boosting and AdaBoost In this lecture we discuss the connection between boosting and online learning. Boosting is not only one of the most fundamental theories in machine

More information

Decision Tree Learning Lecture 2

Decision Tree Learning Lecture 2 Machine Learning Coms-4771 Decision Tree Learning Lecture 2 January 28, 2008 Two Types of Supervised Learning Problems (recap) Feature (input) space X, label (output) space Y. Unknown distribution D over

More information

Inferential statistics

Inferential statistics Inferential statistics Inference involves making a Generalization about a larger group of individuals on the basis of a subset or sample. Ahmed-Refat-ZU Null and alternative hypotheses In hypotheses testing,

More information

Sampling Distribution Models. Chapter 17

Sampling Distribution Models. Chapter 17 Sampling Distribution Models Chapter 17 Objectives: 1. Sampling Distribution Model 2. Sampling Variability (sampling error) 3. Sampling Distribution Model for a Proportion 4. Central Limit Theorem 5. Sampling

More information

Understanding Travel Time to Airports in New York City Sierra Gentry Dominik Schunack

Understanding Travel Time to Airports in New York City Sierra Gentry Dominik Schunack Understanding Travel Time to Airports in New York City Sierra Gentry Dominik Schunack 1 Introduction Even with the rising competition of rideshare services, many in New York City still utilize taxis for

More information

Designing Information Devices and Systems I Fall 2018 Lecture Notes Note Introduction to Linear Algebra the EECS Way

Designing Information Devices and Systems I Fall 2018 Lecture Notes Note Introduction to Linear Algebra the EECS Way EECS 16A Designing Information Devices and Systems I Fall 018 Lecture Notes Note 1 1.1 Introduction to Linear Algebra the EECS Way In this note, we will teach the basics of linear algebra and relate it

More information

Markov Chain Monte Carlo The Metropolis-Hastings Algorithm

Markov Chain Monte Carlo The Metropolis-Hastings Algorithm Markov Chain Monte Carlo The Metropolis-Hastings Algorithm Anthony Trubiano April 11th, 2018 1 Introduction Markov Chain Monte Carlo (MCMC) methods are a class of algorithms for sampling from a probability

More information

Stochastic Processes

Stochastic Processes qmc082.tex. Version of 30 September 2010. Lecture Notes on Quantum Mechanics No. 8 R. B. Griffiths References: Stochastic Processes CQT = R. B. Griffiths, Consistent Quantum Theory (Cambridge, 2002) DeGroot

More information

DAG models and Markov Chain Monte Carlo methods a short overview

DAG models and Markov Chain Monte Carlo methods a short overview DAG models and Markov Chain Monte Carlo methods a short overview Søren Højsgaard Institute of Genetics and Biotechnology University of Aarhus August 18, 2008 Printed: August 18, 2008 File: DAGMC-Lecture.tex

More information

Bayesian phylogenetics. the one true tree? Bayesian phylogenetics

Bayesian phylogenetics. the one true tree? Bayesian phylogenetics Bayesian phylogenetics the one true tree? the methods we ve learned so far try to get a single tree that best describes the data however, they admit that they don t search everywhere, and that it is difficult

More information

A Re-Introduction to General Linear Models (GLM)

A Re-Introduction to General Linear Models (GLM) A Re-Introduction to General Linear Models (GLM) Today s Class: You do know the GLM Estimation (where the numbers in the output come from): From least squares to restricted maximum likelihood (REML) Reviewing

More information

Exploratory quantile regression with many covariates: An application to adverse birth outcomes

Exploratory quantile regression with many covariates: An application to adverse birth outcomes Exploratory quantile regression with many covariates: An application to adverse birth outcomes June 3, 2011 eappendix 30 Percent of Total 20 10 0 0 1000 2000 3000 4000 5000 Birth weights efigure 1: Histogram

More information

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007 Decision Trees, cont. Boosting Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University October 1 st, 2007 1 A Decision Stump 2 1 The final tree 3 Basic Decision Tree Building Summarized

More information

Introduction to Optimization

Introduction to Optimization Introduction to Optimization Blackbox Optimization Marc Toussaint U Stuttgart Blackbox Optimization The term is not really well defined I use it to express that only f(x) can be evaluated f(x) or 2 f(x)

More information

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Lecture 4: Types of errors. Bayesian regression models. Logistic regression Lecture 4: Types of errors. Bayesian regression models. Logistic regression A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting more generally COMP-652 and ECSE-68, Lecture

More information

6.867 Machine learning, lecture 23 (Jaakkola)

6.867 Machine learning, lecture 23 (Jaakkola) Lecture topics: Markov Random Fields Probabilistic inference Markov Random Fields We will briefly go over undirected graphical models or Markov Random Fields (MRFs) as they will be needed in the context

More information

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms Neural networks Comments Assignment 3 code released implement classification algorithms use kernels for census dataset Thought questions 3 due this week Mini-project: hopefully you have started 2 Example:

More information

Fitting a Straight Line to Data

Fitting a Straight Line to Data Fitting a Straight Line to Data Thanks for your patience. Finally we ll take a shot at real data! The data set in question is baryonic Tully-Fisher data from http://astroweb.cwru.edu/sparc/btfr Lelli2016a.mrt,

More information