Transportation Research Part B
|
|
- Bethany Norton
- 5 years ago
- Views:
Transcription
1 Transportation Research Part B 44 (2010) Contents lists available at ScienceDirect Transportation Research Part B journal homepage: Bayesian flexible modeling of trip durations Hugh Chipman a, Edward George b, Jason Lemp c, Robert McCulloch d, * a Department of Mathematics and Statistics, Acadia University, Wolfville, NS, Canada b Department of Statistics, The Wharton School, University of Pennsylvania, Philadelphia, PA , United States c Department of Civil, Architectural, and Environmental Engineering, The University of Texas at Austin, TX, United States d IROM Department, McCombs School of Business, The University of Texas at Austin, TX, United States article info abstract Article history: Received 18 January 2010 Accepted 19 January 2010 Keywords: Markov Chain Monte Carlo Boosting Ensemble modeling Recent advances in Bayesian modeling have led to stunning improvements in our ability to flexibly and easily model complex high-dimensional data. Flexibility comes from the use of a very large number of parameters without fixed dimension. Priors are placed on the parameters to avoid over-fitting and sensibly guide the search in model space for appropriate data-driven model choice. Modern computational, high dimensional search methods (in particular Markov Chain Monte Carlo) then allow us to search the parameter space. This paper introduces the application of BART, Bayesian Additive Regression Trees, to modelling trip durations. We have survey data on characteristics of trips in the Austin area. We seek to relate the trip duration to features of the household and trip characteristics. BART enables one to make inferences about the relationship with minimal assumptions and user decisions. Ó 2010 Elsevier Ltd. All rights reserved. 1. Introduction The workhorse model of applied statistics is the multiple regression model. Multiple regression allows us to relate a single y to many x variables. This is often the goal in applied work. However, multiple regression makes the fundamental assumption of a linear relationship between y and x. With many x variables, this assumption may not be tenable and it is hard to check. Good researchers are trained to use diagnostic plots to assess model adequacy. In case of problems, a wide variety of transformations of both the dependent and independent variables are available in the statistics literature. In practice, we often fail to carefully check the model. If problems are found with the linear specification, the choice of possible transformations is so overwhelming that most applied workers limit themselves, quite reasonably, to a few transformations, such as taking the log of y and using polynomial type terms for some the explanatory variables. Even with a moderate number of explanatory variables the task of searching for a reasonable specification quickly becomes overwhelming. In this paper we illustrate the use of BART, Bayesian Additive Regression Trees (Chipman et al., 2006, 2010), with particular emphasis on its role and impact on transportation research. BART combines recent advances in Bayesian modeling with ideas from machine learning to sensibly search the (potentially) high-dimensional space of possible models relating y to a high-dimensional x. The model is estimated for trip duration data from Austin, Texas. The goal of the study is to investigate how the reported time to take a trip in an automobile (y) depends on characteristics of the trip and people making the trip (the x s). Abou Zeid et al. (2006) and Popuri et al. (2008) modeled trip durations using classical linear regression techniques. Special considerations were needed for entering the time-of-day variable in the model. Clearly, travel times will not vary in a linear * Corresponding author. Address: IROM Department, McCombs School of Business, The University of Texas at Austin, 1 University Station, B6500 Austin, TX , United States. Tel.: address: robert.mcculloch1@gmail.com (R. McCulloch) /$ - see front matter Ó 2010 Elsevier Ltd. All rights reserved. doi: /j.trb
2 H. Chipman et al. / Transportation Research Part B 44 (2010) way from hour to hour. Abou Zeid et al. (2006) and Popuri et al. (2008) both employed collections of sinusoidal functions of departure time in the hope that such functions would be able to approximate the relationship between travel times and departure time. With BART, there is no need to experiment with different transformations of the explanatory variables. BART automatically detects and models nonlinear relationships between dependent and explanatory variables including interactions between explanatory variables. The remainder of the paper is organized as follows: Section 2 describes the BART model. In Section 3 the Markov Chain Monte Carlo (MCMC) algorithm used to search the model space is outlined. Section 4 illustrates the use of BART in modeling trip-duration data. Section 5 concludes. 2. The BART model The model consists of two parts: a sum-of-trees model, called BART (Bayesian Additive Regression Trees), and a regularization prior A sum-of-trees model The central element of our model is a regression tree, a predictive model that seeks to accomplish the same task as linear regression: predict a response y given the values of a vector of independent variables x =(x 1,...,x p ). What distinguishes the regression tree from a linear regression model is how the regression tree generates the prediction. An illustration is given in Fig. 1, with x =(x 1,x 2 ). The tree consists of a root node containing a question about one of the independent variables, here whether x 2 <1. Depending on the answer to this question, we would follow the left (x 2 < 1) or right (x 2 P 1) branch of the tree, arriving at a child node. To generate a prediction, we continue branching based on our value of x until a terminal node is reached, and an output l is returned. The output parameter l b in terminal node b plays the role of a response in regression. The tree model partitions the x space into rectangular regions, and associates a single predicted value for response y within each region. In Fig. 1, the tree partitions the (x 1,x 2 ) space into three rectangular regions, and produces outputs of 0.1, 0.8 or 0.3, depending on which region a value x falls. This particular tree represents an interaction between x 1 and x 2, since the relationship between y and x 1 is constant (the value 0.3) if x 2 P 1, but changes as a function of x 1 (the values 0.1 and 0.8) if x 2 <1. A tree model must be estimated from data, similar to the coefficients of a linear regression. We must estimate the tree structure itself (e.g. splitting rules like x 2 < 1 associated with interior nodes) and terminal parameters associated with the tree (the l s). Estimation will be discussed in Section 3. To develop a sum-of-trees model, we establish notation that represents the informal description of a single tree model. Let T denote a binary tree consisting of a set of interior node decision rules (questions) and a set of terminal nodes, and let M ={l 1,l 2,...,l B } denote a set of parameter values associated with each of the B terminal nodes of T. Prediction for a particular value of input vector x is accomplished as follows: If x is associated with terminal node b of T by the sequence of decision rules from top to bottom, it is then assigned the value l b. We use g(x;t,m) to denote the function corresponding to (T, M) which assigns a l b 2 M to x. a x 2 < 1 x 2 1 μ 3 = 0.3 x 1 < 1.5 x μ 1 = 0.1 μ 2 = 0.8 b x μ 3 = 0.3 μ 1 = 0.1 μ 2 = x 1 Fig. 1. An illustration of a tree model (a) and its predictions (b).
3 688 H. Chipman et al. / Transportation Research Part B 44 (2010) Using this notation our sum-of-trees model can more explicitly be expressed as Y ¼ gðx; T 1 ; M 1 Þþgðx; T 2 ; M 2 Þþþgðx; T m ; M m Þþ; Nð0; r 2 Þ: ð1þ Thus, our model is of the form Y = f(x)+, where the function f is flexibly represented as f = g(x;t 1, M 1 )+g(x; T 2, M 2 )++ g(x; T m, M m ). Inasingletreemodel,theconditionalmeanofY given x is composed of a single l valueassociatedwithoneterminalnode(i.e. theoutputofoneg). Unlike the single tree model, the sum-of-trees model (1) uses m different l values to compose the conditional mean of Y given x. Such terminal node parameters will represent interaction effects when their assignment depends on more than one component of x (i.e., more than one independent variable). Because (1) may be based on trees of varying sizes, the sum-oftrees model can incorporate both direct effects and interaction effects of varying orders. In the special case where every terminal node assignment depends on just a single component of x, the sum-of-trees model reduces to a simple additive function. In the machine learning literature the term ensemble is used to describe a collection of model pieces that add up to a bigger model. Thus, in the BART model the ensemble is the collection fgð; T i ; M i Þg m i¼1. The overall intuition is that a good way to find the fit is by adding little bits at a time. There are different ensemble methods in the literature with boosting (Freund and Schapire, 1997) being the lead example. BART is related to and partially motivated by boosting but has fundamental differences (see Section 3 below). With a large number of trees, a sum-of-trees model gains increased representation flexibility, which, when coupled with our regularization prior, gives excellent out-of-sample predictive performance. The default value for m, used in the application in this paper, is 200. Note that with m large there are hundreds of parameters of which only r is identified. For example, swapping (T 1,M 1 ) for (T 2,M 2 )in(1) gives a different parameterization but the same predictive model. This is not a problem for our Bayesian analysis as long as we use a proper prior. It just means that inferential statements cannot be made about individual l and T parameters. Instead, we shall draw inferences on the predictions that is, on the function f and r. This formulation can be considered as a mechanism for placing a prior distribution on functions, even though individual parameters are not identified. Indeed, this lack of identification is the reason our MCMC mixes well. Even when m is much larger than needed to capture f (effectively, we have an over-complete basis ), the procedure still works well. One of the key reasons the procedure works well with so many parameters is an effective specification of prior distributions for these parameters. We explore this in the next section A regularization prior In many Bayesian analyzes, one seeks relatively uninformative prior distributions for unknown parameters in order to let the data speak for itself. In our model, however, there are so many free parameters that uninformative priors would give the data too much of a voice: a prediction for observation i would be the observed response y i, interpolating the training data perfectly. The enormous capacity of the model to represent the data must be reined in by prior distributions that control the model s adaptability. In Machine Learning, this process of constraining parameters is called regularization, hence our term regularization priors. In this section we outline how to place a prior on each tree T and its terminal node parameters M. The complexity of the prior specification is vastly simplified by letting the T i be a priori independent and identically distributed (i.i.d), the l i,b (node b of tree i) be i.i.d given all T s, and r be independent of all T and l. Given these independence assumptions we need only choose priors for a single tree T, a single l, and r. Motivated by our desire to make each g(x;t i, M i ) a small contribution to the overall fit, we put prior weight on small trees and small l i,b. In the Machine Learning literature, the individual g(x; T i, M i ) are often called weak learners. They are learners in that each g(x; T i, M i ) fits or learns something about the relationship between y and x. They are weak in that each g(x; T i, M i ) makes a small contribution to the overall fit. For the tree prior, we use the same specification as in Chipman et al. (1998). In this prior, the probability that a node is nonterminal is a(1 + d) b, where d is the depth of the node. In all examples we use the same prior corresponding to the choice a =.95 and b = 2. With this choice, a root node (d = 0) has probability a =.95 of having children and a node at depth 1 has probability of having children. A corresponding probability distribution on tree size (number of terminal nodes) gives probability of 0.05, 0.55, 0.28, 0.09, and 0.03, for trees of size 1, 2, 3, 4, and P5. Note that even with this prior, trees with many terminal nodes can be grown if the data demands it. At any non-terminal node, the prior on the associated decision rule puts equal probability on each available independent variable and then equal probability on each available rule given the variable. Thus for the tree in Fig. 1, assuming only two predictors X 1 and X 2, each taking possible values 0,0.1,0.2,...,2.0, we have prior probability 0:95 ðroot node is nonterminalþ 0:5 ðsplit is on X 2 ; one of two variablesþ 0:05 ðsplit is on 1 of 20 possible locationsþ 0:2375 ðleft child is nonterminalþ 0:5 ðleft child splits on X 1 ; one of two variablesþ 0:05 ðsplit is on 1 of 20 possible locationsþ ð1 0:1056Þð1 0:1056Þðtwo children are terminalþ ð1 :2375Þðright child of root node is terminalþ ¼8: ð2þ
4 H. Chipman et al. / Transportation Research Part B 44 (2010) Forthe prior on a l, we first shift and rescale Y so there is high prior probability that E(Yjx) 2 ( 0.5,0.5). We let l N 0; r 2 l, where l is the output of any one terminal node of any one tree. Given the T i and an x, E(Yjx) is the sum of pffiffiffiffi m independent l s, (recall Eq. (1)). The standard deviation of the sum is m rl. We must choose r l so that the standard pffiffiffiffi deviation of the sum, m rl, ensures high probability that Y is in ( 0.5,0.5). We choose r l so that 0.5 is within k standard p deviations of zero: k ffiffiffiffi m rl ¼ 0:5. For example if k = 2 there is a 95% (conditional) prior probability that the mean of Y is in ( 0.5,0.5). k = 2 is our default choice and in practice we typically rescale the response y so that its observed values range from 0.5 to 0.5. Note that this prior increases the shrinkage of l i,b (toward zero) as m increases. As more trees are used in the ensemble, each one is permitted to contribute a smaller amount to the overall prediction. For the prior on r we start from the usual inverted-chi-squared prior: r 2 mk=v 2 m : To choose the hyper-parameters m and k, we begin by obtaining a rough overestimate ^r of r. We then pick a degrees of freedom value m between 3 and 10. Finally, we pick a value of q such as 0.75, 0.90 or 0.99, and set k so that the qth quantile of the prior on r is located at ^r, that is Pðr < ^rþ ¼q. Fig. 2 illustrates priors corresponding to three (m,q) settings when the rough overestimate is ^r ¼ 2. We refer to these three settings, (m, q) = (10, 0.75), (3, 0.90), (3, 0.99), as conservative, default and aggressive, respectively. For automatic use, we recommend the default setting (m, q) = (3, 0.90) which tends to avoid extremes. Simple data-driven choices of ^r that we have used in practice are the estimate from a linear regression or the sample standard deviation of Y. Note that this prior choice can be influential. Strong prior beliefs that r is very small could lead to over-fitting. 3. A back-fitting MCMC algorithm Given the observed data y, our Bayesian setup induces a posterior distribution pððt 1 ; M 1 Þ;...; ðt m ; M m Þ; rjyþ on all the unknowns that determine a sum-of-trees model. Although the sheer size of this parameter space precludes exhaustive calculation, a back-fitting MCMC algorithm (Hastie and Tibshirani, 2000) can be used to sample from this posterior. At a general level, our algorithm is a Gibbs sampler. For notational convenience, let T (i) be the set of all trees in the sum except T i, and similarly define M (i). The Gibbs sampler here entails m successive draws of (T i,m i ) conditionally on (T (i),m (i),r): ðt 1 ; M 1 ÞjT ð1þ ; M ð1þ ; r; y ðt 2 ; M 2 ÞjT ð2þ ; M ð2þ ; r; y.. ðt m ; M m ÞjT ðmþ ; M ðmþ ; r; y; followed by a draw of r from the full conditional: ð3þ conservative: df=10, quantile=.75 default: df=3, quantile=.9 aggressive: df=3, quantile= sigma Fig. 2. Three priors on r when ^r ¼ 2.
5 690 H. Chipman et al. / Transportation Research Part B 44 (2010) rjt 1 ;...T m ; M 1 ;...; M m ; y: The back-fitting MCMC algorithm repeatedly re-samples the parameters of each tree in the ensemble, conditional on the current parameter values of the other m 1 trees. This approach has some similarities and differences with the boosting algorithm of Freund and Schapire (1997). Boosting also produces an ensemble of trees model (1). The boosting algorithm also updates one tree conditional on all others, but it does so only once, rather than repeatedly resampling as in MCMC. This yields a single estimated model, rather than a posterior distribution on the model. Evaluation of the full conditionals required for Gibbs sampling is simplified by rearranging (1). For example to sample (T 1,M 1 )jt (1),M (1),r,y, we can write Y gðx; T 2 ; M 2 Þ gðx; T m ; M m Þ¼gðx; T 1 ; M 1 Þþ Given (T (1),M (1) ) and r we may subtract the fit from (T (1), M (1) ) from both sides of (1) leaving us with a single tree model with known error variance, g(x; T 1, M 1 ). This draw may be made following the approach of Chipman et al. (1998). These methods draw (T i, M i )jt (i), M (i),r,y as T i jt (i), M (i),r,y followed by M i jt i,t (i), M (i),r,y. The idea is that we can draw a (T, M) by drawing from the marginal of T after integrating out M and then from the conditional of M given the draw of T. The structure of the BART model and prior are carefully chosen to make this possible. The first draw is done by the Metropolis-Hastings algorithm after integrating out M i and the second is a set of normal draws. The draw of r is easily accomplished by subtracting all the fit from both sides of (1) so the are considered to be observed. Given all the (T i,m i ) we know f so we can compute i = y i f(x i ). The draw is then a standard inverted-chi-squared since our prior is conditionally conjugate. Subtracting off fits and fitting the resides is often called backfitting. Our Gibbs sampler iteratively and stochastically backfits (Hastie and Tibshirani, 2000). The Metropolis-Hastings draw of T i jt (i),m (i),r,y is complex and lies at the heart of our method. The algorithm of Chipman et al. (1998) proposes a new tree based on the current tree using one of four moves. The moves and their associated proposal probabilities are: growing a terminal node (0.25), pruning a pair of terminal nodes (0.25), changing a non-terminal rule (0.40), and swapping a rule between parent and child (0.10). Note that the grow and prune moves change the implicit dimensionality of the proposed tree in terms of the number of terminal nodes. Some readers may be more familiar with Simulated Annealing, a stochastic search algorithm that can be obtained by manipulation of acceptance probabilities of the Metropolis- Hastings algorithm as it runs. Both algorithms have in common the proposal of a new state as a random perturbation of the current state, followed by a randomized accept/reject step. We initialize the chain with m single node trees, and then iterations are repeated until satisfactory convergence is obtained. We illustrate convergence assessment by the monitoring of r draws in Section 4. At each iteration, each tree may increase or decrease the number of terminal nodes by one, or change one or two decision rules. Each l will change (or cease to exist or be born), and r will change. It is not uncommon for a tree to grow large and then subsequently collapse back down to a single node as the algorithm iterates. The sum-of-trees model, with its abundance of unidentified parameters, allows for fit to be freely reallocated from one tree to another. Because each move makes only small incremental changes to the fit, we can imagine the algorithm as analogous to sculpting a complex figure by adding and subtracting small dabs of clay. Compared to the single tree model MCMC approach of Chipman et al. (1998), the back-fitting MCMC algorithm mixes dramatically better. When only single tree models are considered, the MCMC algorithm tends to quickly gravitate toward a single large tree and then gets stuck in a local neighborhood of that tree. In sharp contrast, we have found that restarts of the back-fitting MCMC algorithm give remarkably similar results even in difficult problems. Consequently, we run one long chain rather than multiple starts. In some ways back-fitting MCMC is a stochastic alternative to boosting algorithms for fitting linear combinations of trees. It is distinguished by the ability to sample from a posterior distribution. At each iteration, we get a new draw f ¼ gðx; T 1 ; M 1 Þþgðx; T 2 ; M 2 Þþþgðx; T m ; M m Þ corresponding to the draw of {T j } and {M j }. These draws are a (dependent) sample from the posterior distribution on the true f. Rather than pick the best f * from these draws, the set of multiple draws can be used to further enhance inference. In contrast, boosting generates a single estimate of the model, rather than a sample of possible values. We estimate f(x) by the posterior mean of f(x) which is approximated by averaging the f * (x) over the draws. Further, we can gauge our uncertainty about the actual underlying f by the variation across the draws. For example, we can use the 5% and 95% quantiles of f * (x) to obtain 90% posterior intervals for f(x). ð4þ ð5þ 4. Fitting trip duration data with BART 4.1. The trip duration data The goal of the study is to investigate how the reported time to take a trip in an automobile depends on characteristics of the trip and the people making the trip. Each observation in our data set corresponds to a trip, made by car, in the Austin area. Each trip is made by a person identified as the trip maker from a household. Several variables are measured for each trip. We have variables describing the household, the trip-maker, and the trip itself. Variables describing the household are:
6 H. Chipman et al. / Transportation Research Part B 44 (2010) number of people in the household, income, number of children under five, number of children between 5 and 15 and the number of children of ages 16 or 17. Variables describing the trip-maker are: age, primary occupation and student status. Variables describing the trip are: month, day of the week, type of trip (e.g. home based work trip ), departure time, number of household in the departure zone, number of household in the destination zone, retail employment in the departure zone, retail employment in the destination zone, free-flow distance, free-flow duration and trip duration. The free flow variables are meant to capture the distance and time taken for such a strip under free flow conditions. Our dependent variable y is the log of the ratio of trip duration over free flow trip duration. Trip duration is simply the reported time taken to complete the trip. Free flow trip duration is an attempt to measure the time it would take to complete the trip if there were no traffic related inhibitions. Our trip durations are really approximations of trip durations and suffer from rounding error. Those trips with short distances are most prone to suffer from high relative error, since reported durations are often rounded to 5, 10, or even 15 min. A large y means high traffic congestion or travel delay, but not necessarily long durations, since y = duration/free-flow duration. Since the actual ratios are highly right-skewed, we take the log. This still gives us an interpretable quantity as it is approximately the difference between the two durations in percentage terms. The histogram of our dependent variable is given in Fig. 3. Note that while we have transformed our dependent variable, there is no need to consider transformations of the explanatory variables when using BART. Our explanatory variables consist of the remaining 17 listed above. Fig. 4 displays the marginal distributions of three important independent variables. Many of the trips are made by the same trip-maker so that a difficult issue arises with regard to independence assumptions. Even though BART allows for great flexibility in the form of f, the current formulation does make the basic assumption of i.i.d. normal errors. It may be the errors from trip made by the same person are not exchangeable with those made by others. There is no obvious way to resolve this question without considerably more modeling. We randomly choose one trip from those made by a trip-maker. This gives us 3244 observations BART results, all variables In this section we report the results obtained by running BART with all 17 explanatory variables. Here we show how BART, relatively automatically, fits the patterns in the data. In the next section, we focus on a small number of variables and interpret the BART fit. We emphasize that all BART results are obtained simply by calling the function bart in the R-package BayesTree. No decisions need be made about how to manipulate the information in the data. The default prior was used. A few fairly obvious decisions are made about how to run the Markov Chain as noted below. Fig. 5 shows the time series plot of the draws of r from each iteration of the Markov Chain. The initial part of the plot where the draws are declining is the burn-in period of the Markov Chain. The algorithm stochastically searches the high-dimensional space representing the unknown f to find functions that fit the data well with our prior stopping us from gravitating towards functions which overfit. After the r draws level off, we are exploring the posterior. As the chain iterates, the variation in the current draw of f (as represented by the current (T i, M i )) explores the set of f which could have plausible generated the observed data. We see that the chain burns in very quickly so that after a few hundred draws we are estimating the posterior.
7 692 H. Chipman et al. / Transportation Research Part B 44 (2010) y=log(dur/ffd) Fig. 3. The histogram of the log ratio of trip duration over free flow duration. To run the entire set of 2300 iterations took 226 s or about 4 min (2.93 GHz, Core 2 Duo processor). Note that if we just needed a point estimate we could use a much shorter run. After discarding the burn-in draws, we estimate f(x) by simply averaging f i (x), where i denotes the MCMC draws. If we do this for the x in our sample we obtain the BART fitted values. The squared correlation between y and the BART fits (or R 2 )is 48%. Similarly, we can estimate r by the average of the post burn-in draws. This we obtain ^r ¼ 0:5. In Fig. 5 this can be seen as the value at which the draws level off. The marginal posterior distribution of r would be given by the histogram of post burn-in r draws. By comparison, a linear regression fit (in which all categorical variables are dummied up in the usual way) gives an estimate of r of 0.58 and R 2 of 28%. The horizontal line drawn in Fig. 5 is at this r estimate. Clearly, no-one would simply run a linear regression with this data. As noted, the departure time variable would require some kind of transformation at a minimum. Our point in making this comparison is that BART automatically seeks out reasonable functions without user input. With 17 independent variables, choosing which transformations to try and which interactions to include becomes a daunting task in model selection. In Section 4.4 we try a transformation strategy and compare the results to the BART fit. Of course the reader may wonder if BART has over-fit the data. In Chipman et al. (2010) evidence is given in 42 real data sets that the out of sample predictive performance of BART using the default prior is as good or better than leading data-mining techniques tuned using cross-validation. The default BART prior regularizes the fit so that we do not build too complex a model given our data Interpreting BART results If all we want to do is use BART as a predictive device, there is no problem. The prediction of y given x = x *, may be given by the average of the f i (x * ) values, where i indexes post burn-in draws of f. However, we often want to learn about the nature of f. How is y related to x? In this case the lack of interpretability of the sum of trees representation of f becomes an issue. All flexible approaches (e.g., neural nets) have this problem. Note also, that those who think they can interpret the results of a linear regression with a large number of transformations, interactions, and blizzard of associated t-values are usually fooling themselves. Methods for extracting interpretable information about f are the subject of ongoing research, but we illustrate a few possible approaches. A simple approach to variable selection is to see how often a variable is used in the sum of trees representation (see Chipman et al. (2010)). For each draw, we compute the fraction of tree decision rules that use a given variable and then average the fraction over the MCMC draws. Using this criteria, the variable which stands out the most is free flow distance. This variable is used, on average, in 19% of the tree decision rules. The next two variables are trip type (4%) and departure time (3.4%). There are some subtle issues involving the choice of prior when using BART for variable selection. Chipman et al. (2010) recommend using fewer trees in the sum when doing variable selection than when using BART for prediction. The percentages above were calculated from a BART run using 20 trees in the sum, while the r draws displayed in Fig. 5 were obtained from a run using 200 trees in the sum (200 is the default choice). In order to simplify the analysis we reran BART using just the independent variables free flow distance, triptype, and departure time as suggested by our variable selection analysis. Note that our variable selection was done without making strong assumptions about the nature of f. Even with just these three variables, consider the difficulties involved in using a multiple regression based approach. What transformations should be applied to the two numeric variables free flow distance and departure time? The categorical
8 H. Chipman et al. / Transportation Research Part B 44 (2010) Frequency ff_dist hw hs hr hm hsoc ho nhw nhs nhr nho triptype Frequency deptime Fig. 4. Marginal distributions of free flow distance (ff_dist), trip type (triptype), and departure time (deptime). Distance is measured in miles, departure time with a 24 h daily clock, and trip type is categorical with ten different levels. For example, the first level hw denotes a home based work trip. variable trip type has 10 different levels. What possible interaction terms should be considered for inclusion in the model? If a large number of transformations and interactions are considered, how will the model selection be done? Given a selected model, how do we express our uncertainty? How relevant are the usual t-values given the model search and possible inclusion of interaction terms? An illustration of a transformation based approach is given in Section 4.4. Sophisticated users of multiple regression know that the dependence between independent variables affects the inference in important ways. Usually, we think in terms of collinearity, or linear dependence between x variables. Fig. 6 explores the relationship between two of our x variables, departure time and trip type. The left histogram displays departure times for work trips and the right histogram displays departure times for retail trips. Thus, the figure shows some of the dependence between trip type and departure time. Clearly, the dependence is strong and of an unusual type (i.e., with a varying number of modes) which could not possibly be captured by linear thinking. The BayesTree package includes functions for partial dependence plots. These plots can aid in interpreting the effect of individual x variables on y. However, when there is complex dependence between x variables and f is not additive they can be misleading. Thus, they are not a good choice for our trip duration analysis. Fig. 7 compares x values with have a large BART estimate of f with those that have a small estimate. The three rows in the figure correspond to our three variables free flow distance trip type, and departure time going from top to bottom. The left hand plots use only those observations such that ^f ðxþ is in the bottom 10% and the right hand plots use only those observations where the fit is in the top 10%. The left hand plots tell us what kinds of x give us a small y and the right hand plot tell us what kinds of x give us a big y. Such a plot could have been constructed directly from the y values but
9 694 H. Chipman et al. / Transportation Research Part B 44 (2010) Fig. 5. Draws of r from the BART Markov Chain Monte Carlo. The draws initially decrease quickly as the fit improves and the chain burns in. Subsequent variation reflects posterior uncertainty about the value of r. The chain was iterated 2300 times with the first 300 discarded as burn-in and the last 2000 used for posterior inference. All 2300 draws are shown in the plot departure time, work trips departure time, retail trips Fig. 6. Distribution of departure time for home based work trips (on the left) and home based retail trips (on the right). by plotting the fits we have hope to have a sharper picture, less influenced by noise. We clearly see that longer trips are associated with smaller free flow distance, trip types hw = home base work trips, hs = home based school trips, and ho = home based other trips ( other is a catch-all category), and departure time close to the two rush hours. Note that the afternoon rush hour looks like it is different from the morning one. The morning rush hour is more focused. Intuitively, this rush hour is driven more by work related trips while in the afternoon a greater variety of activities generate trips. Perhaps the most logically compelling way to learn about the nature of a function f is to compare its values at carefully chosen x. For a very nice example see Abrevaya and McCulloch (2010). In this case we chose the following nine x configurations: ff_dist triptype deptime 1 hw 8 1 hw 17 1 hw 20 1 hr 17 1 hr 20 1 hsoc 17 1 hsoc 20 1 ho 17 1 ho 20
10 H. Chipman et al. / Transportation Research Part B 44 (2010) Frequency Frequency ff_dist ff_dist hw hs hr hm hsoc ho nhw nhs nhr nho triptype hw hs hr hm hsoc ho nhw nhs nhr nho triptype Frequency Frequency deptime deptime Fig. 7. Distributions of explanatory variables for small and large fitted values from BART. The three rows depict the marginal distribution of free flow distance, trip type, and departure time. On the left we use the subset of observations corresponding to the bottom 10% of fitted values from BART. On the right we use the top 10% fitted values. Fig. 8. Posterior distributions of f(x) for nine selected x. The first three boxplots (from the left) depict draws from the posterior of f(x) for x which correspond to home based work trips with departure times 7:30 am, 5 pm, and 8 pm. The next two are for home based retail trips at 5 pm and 8 pm. The next two are for home based social trips at 5 pm and 8 pm. The last two are for other trips at 5 pm at 8 pm. Each row above specifies a choice of x. We have fixed free flow distance at 1. The first three x s examine home based work trips at 8 am, 5 pm, and 8 pm. The remaining six x s examine home based retail trips, home based social trips and home based other trips, each at 5 pm and 8 pm. Fig. 8 displays the results. Each boxplot depicts the draws f i (x) for post burn-in MCMC draws i of f. The nine boxplots correspond to the nine different x configurations given above. The symbols w, r, s, and o denote the four different trip types (work, retail, social, and other)
11 696 H. Chipman et al. / Transportation Research Part B 44 (2010) and the time is given with the 24 h clock. The boxplot labelled r17 displays the posterior distribution for f(x), where x means a home based retail trip at 5 pm (the fourth row in the table above). We can easily see things like work trips (boxplots 1 3) are longer than social trips (boxplots 6 and 7). For each kind of trip, we expect longer durations at 5 pm than at 8 pm. There is some evidence that the difference in trip duration between 5 pm and 8 pm is less for social trips than for the other three kinds of trips and the difference seems quite similar for work, retail, and other trips. Thus, there is some evidence of a particular kind of interaction, in that the effect for departure time does depends on the type of trip. This makes intuitive sense. We can also see that there is more uncertainty associated with our inference for the social trips than for the other three trip types. By carefully choosing x at which to infer f, we can learn much about f. Care is needed in choosing appropriate x. For example, asking about a work trip at 11 am is not a good idea given this data. While this takes effort, it is honest effort in that involves thinking hard about what questions we really want to ask, and what questions make sense. Figs. 7 and 8 given sensible results. Fig. 9. Comparison of fits from different modeling strategies. All pairwise plots between y, bart (fits from BART), naivereg (linear regression without transformations), logreg (linear regression with free flow distance logged), and trigreg (linear regression with log of free flow distance and 16 transformations of departure time).
12 H. Chipman et al. / Transportation Research Part B 44 (2010) Transforming independent variables In this section we try to improve the fit obtained using standard linear models technology by transforming some of the independent variables. This fit is compared with the BART fit. We consider transformations of the variables departure time and free flow distance. All of the variables are used as in Section 4.2. Similar results are obtained if we use just the three variables considered in Section 4.3. We do not consider interaction terms. Given the number of variables there are a great many possible interaction effects that could be considered. Logging free flow distance improves the linear model fit considerably. Without the log the R 2 was 28%. Replacing free flow distance with its log increases R 2 to 40%. The estimate of r decreases from 0.58 to Transforming departure time is more complex. Following Popuri et al. (2008) we let g 1 ðtþ ¼exp sin 2pT 2pT ; g 24 2 ðtþ ¼exp cos 24 and g 3 ðtþ ¼exp sin 4pT 4pT ; g 24 4 ðtþ ¼exp cos 24 where T is departure time. Each of the four g i is raised to powers 1, 2, 3, and 4. Thus, a total of 16 transformations of the variable T = departure time are included in the multiple regression: g i (T) j,i =1,2,3,4,andj =1,2,3,4. Note that in the Popuri et al. (2008) analysis, an additional variable called delay is used and all of the g i (T) j are multiplied by delay so that their analysis is focused on the interaction between departure time and delay. Our analysis does not use the variable delay so our transformation strategy is not identical to theirs. Nevertheless, by using the same functional form gðtþ ¼ P i;j b i;j g i ðtþ j, we hope to build upon their work in a reasonable way. Note that Popuri et al. (2008) are able to focus in on a specific kind of interaction based on subject matter insight. They obtain quite reasonable and interpretable representations of the relationship as exhibited in their Fig. 2. Adding the 16 transformations of departure time to the multiple regression with free flow distance logged increases R 2 to just 41% (from 40%). The estimate of r is 0.52 which is virtually the same as obtained with just the log of free flow distance. Recall that the BART R 2 is 48% and estimate of r is.50. Fig. 9 compares the fits from the different modeling strategies. We see that simply logging free flow distance goes a long way towards capturing f. There is very little difference between the fits with the log and departure time transformations. The BART fits are similar to those obtained with the transformations but there is a suggestion of some differences in the last plot in the second row. If we test the null hypothesis that all the coefficients for the departure time transformations are equal to 0, the p-value is Typically in application this is used to argue that including the transformations is useful in uncovering the relationship. Fig. 10 plots ^gðtþ ¼ P i;j^b i;j g i ðtþ j vs. T. It is hard to see how this makes sense intuitively. Again, Popuri et al. (2008) did obtain sensible results using the set of transformations we have employed here. In this application, the transformation give results which are statistically significant but practically insignificant and intuitively unappealing. Of course, some other transformation approach might give better results Fig. 10. The estimated additive component contributed by transformations of departure time.
13 698 H. Chipman et al. / Transportation Research Part B 44 (2010) Conclusions This paper introduces and illustrates application of the BART model to drip duration modeling. BART quickly and easily obtains the dependent variable, y, as some generic function f of the explanatory variables. Note that this is, in a sense, a nonparametric approach, since the functional form of f is specified on such a flexible form. This is particularly useful when some explanatory variables are not expected to have linear relationships with y, such as the case of departure times relationship with trip durations. The application demonstrated here clearly shows the nonlinear relationship between these variables, with trip duration peaks occurring during the AM and PM peak travel periods as expected. Further, the nature of the two peaks seems to quite different with a more focused rush hour in the morning. We also find evidence of interaction between two of our key variables: the effect of the departure time may depend on the type of trip taken. Future research will focus on identifying other transportation research areas where BART could be employed, as well as extending the ideas in this paper to gain other insights into the actual relationship between departure time and trip duration. References Abou Zeid, M., Rossi, T.F., Gardner, B., Modeling time-of-day choice in context of tour-and activity-based models. Transportation Research Record: Journal of the Transportation Research Board 1981, Abrevaya, J., McCulloch, R., Reversal of Fortune: A Statistical Analysis of Penalty Calls in the National Hockey League. < Chipman, H., George, E., McCulloch, R., Bayesian CART model search. Journal of the American Statistical Association 93 (443), Chipman, H., George, E., McCulloch, R., Bayesian ensemble learning. In: Scholkopf, B., Platt, J., Hoffman, T. (Eds.), Advances in Neural Information Processing Systems, vol. 19. MIT Press, Cambridge, MA, pp Chipman, H., George, E., McCulloch, R., BART: Bayesian Additive Regression Trees. Annals of Applied Statistics. Freund, Y., Schapire, R.E., A decision theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences 55 (1), Hastie, T., Tibshirani, R., Bayesian backfitting. Statistical Science 15 (3), Popuri, Y., Ben-Akiva, M., Proussaloglou, K., Time-of-day modeling in a tour-based context: the Tel-Aviv experience. Transportation Research Record: Journal of the Transportation Research Board 2076,
Bayesian Ensemble Learning
Bayesian Ensemble Learning Hugh A. Chipman Department of Mathematics and Statistics Acadia University Wolfville, NS, Canada Edward I. George Department of Statistics The Wharton School University of Pennsylvania
More informationBART: Bayesian additive regression trees
BART: Bayesian additive regression trees Hedibert F. Lopes & Paulo Marques Insper Institute of Education and Research São Paulo, Brazil Most of the notes were kindly provided by Rob McCulloch (Arizona
More informationFully Nonparametric Bayesian Additive Regression Trees
Fully Nonparametric Bayesian Additive Regression Trees Ed George, Prakash Laud, Brent Logan, Robert McCulloch, Rodney Sparapani Ed: Wharton, U Penn Prakash, Brent, Rodney: Medical College of Wisconsin
More informationBayesian Classification and Regression Trees
Bayesian Classification and Regression Trees James Cussens York Centre for Complex Systems Analysis & Dept of Computer Science University of York, UK 1 Outline Problems for Lessons from Bayesian phylogeny
More information27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling
10-708: Probabilistic Graphical Models 10-708, Spring 2014 27 : Distributed Monte Carlo Markov Chain Lecturer: Eric P. Xing Scribes: Pengtao Xie, Khoa Luu In this scribe, we are going to review the Parallel
More informationBagging During Markov Chain Monte Carlo for Smoother Predictions
Bagging During Markov Chain Monte Carlo for Smoother Predictions Herbert K. H. Lee University of California, Santa Cruz Abstract: Making good predictions from noisy data is a challenging problem. Methods
More informationSTA 4273H: Sta-s-cal Machine Learning
STA 4273H: Sta-s-cal Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 2 In our
More informationCSC 2541: Bayesian Methods for Machine Learning
CSC 2541: Bayesian Methods for Machine Learning Radford M. Neal, University of Toronto, 2011 Lecture 3 More Markov Chain Monte Carlo Methods The Metropolis algorithm isn t the only way to do MCMC. We ll
More informationPart 6: Multivariate Normal and Linear Models
Part 6: Multivariate Normal and Linear Models 1 Multiple measurements Up until now all of our statistical models have been univariate models models for a single measurement on each member of a sample of
More informationCSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18
CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$
More informationLecture 2: Linear regression
Lecture 2: Linear regression Roger Grosse 1 Introduction Let s ump right in and look at our first machine learning algorithm, linear regression. In regression, we are interested in predicting a scalar-valued
More informationComputer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo
Group Prof. Daniel Cremers 10a. Markov Chain Monte Carlo Markov Chain Monte Carlo In high-dimensional spaces, rejection sampling and importance sampling are very inefficient An alternative is Markov Chain
More informationBoosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13
Boosting Ryan Tibshirani Data Mining: 36-462/36-662 April 25 2013 Optional reading: ISL 8.2, ESL 10.1 10.4, 10.7, 10.13 1 Reminder: classification trees Suppose that we are given training data (x i, y
More informationCS 484 Data Mining. Classification 7. Some slides are from Professor Padhraic Smyth at UC Irvine
CS 484 Data Mining Classification 7 Some slides are from Professor Padhraic Smyth at UC Irvine Bayesian Belief networks Conditional independence assumption of Naïve Bayes classifier is too strong. Allows
More informationPrinciples of Bayesian Inference
Principles of Bayesian Inference Sudipto Banerjee University of Minnesota July 20th, 2008 1 Bayesian Principles Classical statistics: model parameters are fixed and unknown. A Bayesian thinks of parameters
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning MCMC and Non-Parametric Bayes Mark Schmidt University of British Columbia Winter 2016 Admin I went through project proposals: Some of you got a message on Piazza. No news is
More informationStatistical Machine Learning from Data
Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Ensembles Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique Fédérale de Lausanne
More informationLecture : Probabilistic Machine Learning
Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning
More informationPart 8: GLMs and Hierarchical LMs and GLMs
Part 8: GLMs and Hierarchical LMs and GLMs 1 Example: Song sparrow reproductive success Arcese et al., (1992) provide data on a sample from a population of 52 female song sparrows studied over the course
More informationIntroduction to Bayesian Statistics and Markov Chain Monte Carlo Estimation. EPSY 905: Multivariate Analysis Spring 2016 Lecture #10: April 6, 2016
Introduction to Bayesian Statistics and Markov Chain Monte Carlo Estimation EPSY 905: Multivariate Analysis Spring 2016 Lecture #10: April 6, 2016 EPSY 905: Intro to Bayesian and MCMC Today s Class An
More informationECE 5424: Introduction to Machine Learning
ECE 5424: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting PAC Learning Readings: Murphy 16.4;; Hastie 16 Stefan Lee Virginia Tech Fighting the bias-variance tradeoff Simple
More information1 Probabilities. 1.1 Basics 1 PROBABILITIES
1 PROBABILITIES 1 Probabilities Probability is a tricky word usually meaning the likelyhood of something occuring or how frequent something is. Obviously, if something happens frequently, then its probability
More informationComputational statistics
Computational statistics Markov Chain Monte Carlo methods Thierry Denœux March 2017 Thierry Denœux Computational statistics March 2017 1 / 71 Contents of this chapter When a target density f can be evaluated
More informationUnit 22: Sampling Distributions
Unit 22: Sampling Distributions Summary of Video If we know an entire population, then we can compute population parameters such as the population mean or standard deviation. However, we generally don
More informationBayesian Inference in GLMs. Frequentists typically base inferences on MLEs, asymptotic confidence
Bayesian Inference in GLMs Frequentists typically base inferences on MLEs, asymptotic confidence limits, and log-likelihood ratio tests Bayesians base inferences on the posterior distribution of the unknowns
More informationConditional probabilities and graphical models
Conditional probabilities and graphical models Thomas Mailund Bioinformatics Research Centre (BiRC), Aarhus University Probability theory allows us to describe uncertainty in the processes we model within
More informationMarkov Chain Monte Carlo methods
Markov Chain Monte Carlo methods By Oleg Makhnin 1 Introduction a b c M = d e f g h i 0 f(x)dx 1.1 Motivation 1.1.1 Just here Supresses numbering 1.1.2 After this 1.2 Literature 2 Method 2.1 New math As
More informationUnivariate Normal Distribution; GLM with the Univariate Normal; Least Squares Estimation
Univariate Normal Distribution; GLM with the Univariate Normal; Least Squares Estimation PRE 905: Multivariate Analysis Spring 2014 Lecture 4 Today s Class The building blocks: The basics of mathematical
More informationChris Fraley and Daniel Percival. August 22, 2008, revised May 14, 2010
Model-Averaged l 1 Regularization using Markov Chain Monte Carlo Model Composition Technical Report No. 541 Department of Statistics, University of Washington Chris Fraley and Daniel Percival August 22,
More informationMachine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /
Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring 2015 http://ce.sharif.edu/courses/93-94/2/ce717-1 / Agenda Combining Classifiers Empirical view Theoretical
More informationData Mining und Maschinelles Lernen
Data Mining und Maschinelles Lernen Ensemble Methods Bias-Variance Trade-off Basic Idea of Ensembles Bagging Basic Algorithm Bagging with Costs Randomization Random Forests Boosting Stacking Error-Correcting
More informationA Bayesian Approach to Phylogenetics
A Bayesian Approach to Phylogenetics Niklas Wahlberg Based largely on slides by Paul Lewis (www.eeb.uconn.edu) An Introduction to Bayesian Phylogenetics Bayesian inference in general Markov chain Monte
More informationFrank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c /9/9 page 331 le-tex
Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c15 2013/9/9 page 331 le-tex 331 15 Ensemble Learning The expression ensemble learning refers to a broad class
More informationB. Weaver (24-Mar-2005) Multiple Regression Chapter 5: Multiple Regression Y ) (5.1) Deviation score = (Y i
B. Weaver (24-Mar-2005) Multiple Regression... 1 Chapter 5: Multiple Regression 5.1 Partial and semi-partial correlation Before starting on multiple regression per se, we need to consider the concepts
More informationIntegration Made Easy
Integration Made Easy Sean Carney Department of Mathematics University of Texas at Austin Sean Carney (University of Texas at Austin) Integration Made Easy October 25, 2015 1 / 47 Outline 1 - Length, Geometric
More informationBayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007
Bayesian inference Fredrik Ronquist and Peter Beerli October 3, 2007 1 Introduction The last few decades has seen a growing interest in Bayesian inference, an alternative approach to statistical inference.
More informationUniversität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Decision Trees. Tobias Scheffer
Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Decision Trees Tobias Scheffer Decision Trees One of many applications: credit risk Employed longer than 3 months Positive credit
More informationMachine Learning, Fall 2009: Midterm
10-601 Machine Learning, Fall 009: Midterm Monday, November nd hours 1. Personal info: Name: Andrew account: E-mail address:. You are permitted two pages of notes and a calculator. Please turn off all
More informationMODULE -4 BAYEIAN LEARNING
MODULE -4 BAYEIAN LEARNING CONTENT Introduction Bayes theorem Bayes theorem and concept learning Maximum likelihood and Least Squared Error Hypothesis Maximum likelihood Hypotheses for predicting probabilities
More informationVariable Selection and Sensitivity Analysis via Dynamic Trees with an application to Computer Code Performance Tuning
Variable Selection and Sensitivity Analysis via Dynamic Trees with an application to Computer Code Performance Tuning Robert B. Gramacy University of Chicago Booth School of Business faculty.chicagobooth.edu/robert.gramacy
More informationVBM683 Machine Learning
VBM683 Machine Learning Pinar Duygulu Slides are adapted from Dhruv Batra Bias is the algorithm's tendency to consistently learn the wrong thing by not taking into account all the information in the data
More informationBayesian Methods for Machine Learning
Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),
More informationINTRODUCTION TO PATTERN RECOGNITION
INTRODUCTION TO PATTERN RECOGNITION INSTRUCTOR: WEI DING 1 Pattern Recognition Automatic discovery of regularities in data through the use of computer algorithms With the use of these regularities to take
More informationVolume vs. Diameter. Teacher Lab Discussion. Overview. Picture, Data Table, and Graph
5 6 7 Middle olume Length/olume vs. Diameter, Investigation page 1 of olume vs. Diameter Teacher Lab Discussion Overview Figure 1 In this experiment we investigate the relationship between the diameter
More informationMachine Learning 3. week
Machine Learning 3. week Entropy Decision Trees ID3 C4.5 Classification and Regression Trees (CART) 1 What is Decision Tree As a short description, decision tree is a data classification procedure which
More informationBayesian Additive Regression Tree (BART) with application to controlled trail data analysis
Bayesian Additive Regression Tree (BART) with application to controlled trail data analysis Weilan Yang wyang@stat.wisc.edu May. 2015 1 / 20 Background CATE i = E(Y i (Z 1 ) Y i (Z 0 ) X i ) 2 / 20 Background
More informationBART: Bayesian Additive Regression Trees
University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 2010 BART: Bayesian Additive Regression Trees Hugh A. Chipman Edward I. George University of Pennsylvania Robert E.
More informationCS168: The Modern Algorithmic Toolbox Lecture #6: Regularization
CS168: The Modern Algorithmic Toolbox Lecture #6: Regularization Tim Roughgarden & Gregory Valiant April 18, 2018 1 The Context and Intuition behind Regularization Given a dataset, and some class of models
More informationMCMC algorithms for fitting Bayesian models
MCMC algorithms for fitting Bayesian models p. 1/1 MCMC algorithms for fitting Bayesian models Sudipto Banerjee sudiptob@biostat.umn.edu University of Minnesota MCMC algorithms for fitting Bayesian models
More informationProbabilistic Graphical Models
2016 Robert Nowak Probabilistic Graphical Models 1 Introduction We have focused mainly on linear models for signals, in particular the subspace model x = Uθ, where U is a n k matrix and θ R k is a vector
More informationLearning Tetris. 1 Tetris. February 3, 2009
Learning Tetris Matt Zucker Andrew Maas February 3, 2009 1 Tetris The Tetris game has been used as a benchmark for Machine Learning tasks because its large state space (over 2 200 cell configurations are
More informationwhere Female = 0 for males, = 1 for females Age is measured in years (22, 23, ) GPA is measured in units on a four-point scale (0, 1.22, 3.45, etc.
Notes on regression analysis 1. Basics in regression analysis key concepts (actual implementation is more complicated) A. Collect data B. Plot data on graph, draw a line through the middle of the scatter
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate
More informationMaking rating curves - the Bayesian approach
Making rating curves - the Bayesian approach Rating curves what is wanted? A best estimate of the relationship between stage and discharge at a given place in a river. The relationship should be on the
More informationChapter 10. Optimization Simulated annealing
Chapter 10 Optimization In this chapter we consider a very different kind of problem. Until now our prototypical problem is to compute the expected value of some random variable. We now consider minimization
More informationECE 5984: Introduction to Machine Learning
ECE 5984: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting Readings: Murphy 16.4; Hastie 16 Dhruv Batra Virginia Tech Administrativia HW3 Due: April 14, 11:55pm You will implement
More informationA Study into Mechanisms of Attitudinal Scale Conversion: A Randomized Stochastic Ordering Approach
A Study into Mechanisms of Attitudinal Scale Conversion: A Randomized Stochastic Ordering Approach Zvi Gilula (Hebrew University) Robert McCulloch (Arizona State) Ya acov Ritov (University of Michigan)
More informationMarkov chain Monte Carlo
Markov chain Monte Carlo Feng Li feng.li@cufe.edu.cn School of Statistics and Mathematics Central University of Finance and Economics Revised on April 24, 2017 Today we are going to learn... 1 Markov Chains
More informationEnsemble Methods and Random Forests
Ensemble Methods and Random Forests Vaishnavi S May 2017 1 Introduction We have seen various analysis for classification and regression in the course. One of the common methods to reduce the generalization
More informationCIVL 7012/8012. Collection and Analysis of Information
CIVL 7012/8012 Collection and Analysis of Information Uncertainty in Engineering Statistics deals with the collection and analysis of data to solve real-world problems. Uncertainty is inherent in all real
More informationMachine Learning for OR & FE
Machine Learning for OR & FE Regression II: Regularization and Shrinkage Methods Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com
More informationDesigning Information Devices and Systems I Spring 2018 Lecture Notes Note Introduction to Linear Algebra the EECS Way
EECS 16A Designing Information Devices and Systems I Spring 018 Lecture Notes Note 1 1.1 Introduction to Linear Algebra the EECS Way In this note, we will teach the basics of linear algebra and relate
More informationIntroduction to Machine Learning CMU-10701
Introduction to Machine Learning CMU-10701 Markov Chain Monte Carlo Methods Barnabás Póczos & Aarti Singh Contents Markov Chain Monte Carlo Methods Goal & Motivation Sampling Rejection Importance Markov
More informationAutomatic Differentiation and Neural Networks
Statistical Machine Learning Notes 7 Automatic Differentiation and Neural Networks Instructor: Justin Domke 1 Introduction The name neural network is sometimes used to refer to many things (e.g. Hopfield
More informationChapter 1 Statistical Inference
Chapter 1 Statistical Inference causal inference To infer causality, you need a randomized experiment (or a huge observational study and lots of outside information). inference to populations Generalizations
More informationMachine Learning Linear Regression. Prof. Matteo Matteucci
Machine Learning Linear Regression Prof. Matteo Matteucci Outline 2 o Simple Linear Regression Model Least Squares Fit Measures of Fit Inference in Regression o Multi Variate Regession Model Least Squares
More informationST 740: Markov Chain Monte Carlo
ST 740: Markov Chain Monte Carlo Alyson Wilson Department of Statistics North Carolina State University October 14, 2012 A. Wilson (NCSU Stsatistics) MCMC October 14, 2012 1 / 20 Convergence Diagnostics:
More informationClassification and Regression Trees
Classification and Regression Trees Ryan P Adams So far, we have primarily examined linear classifiers and regressors, and considered several different ways to train them When we ve found the linearity
More information1 Probabilities. 1.1 Basics 1 PROBABILITIES
1 PROBABILITIES 1 Probabilities Probability is a tricky word usually meaning the likelyhood of something occuring or how frequent something is. Obviously, if something happens frequently, then its probability
More informationECE521 week 3: 23/26 January 2017
ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear
More informationPrediction of Data with help of the Gaussian Process Method
of Data with help of the Gaussian Process Method R. Preuss, U. von Toussaint Max-Planck-Institute for Plasma Physics EURATOM Association 878 Garching, Germany March, Abstract The simulation of plasma-wall
More informationChris Bishop s PRML Ch. 8: Graphical Models
Chris Bishop s PRML Ch. 8: Graphical Models January 24, 2008 Introduction Visualize the structure of a probabilistic model Design and motivate new models Insights into the model s properties, in particular
More informationChapter 10 Nonlinear Models
Chapter 10 Nonlinear Models Nonlinear models can be classified into two categories. In the first category are models that are nonlinear in the variables, but still linear in terms of the unknown parameters.
More informationSection 3: Simple Linear Regression
Section 3: Simple Linear Regression Carlos M. Carvalho The University of Texas at Austin McCombs School of Business http://faculty.mccombs.utexas.edu/carlos.carvalho/teaching/ 1 Regression: General Introduction
More informationLinear Model Selection and Regularization
Linear Model Selection and Regularization Recall the linear model Y = β 0 + β 1 X 1 + + β p X p + ɛ. In the lectures that follow, we consider some approaches for extending the linear model framework. In
More informationSTA414/2104 Statistical Methods for Machine Learning II
STA414/2104 Statistical Methods for Machine Learning II Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences Lecture 3 Slide credits: Russ Salakhutdinov Announcements
More informationA graph contains a set of nodes (vertices) connected by links (edges or arcs)
BOLTZMANN MACHINES Generative Models Graphical Models A graph contains a set of nodes (vertices) connected by links (edges or arcs) In a probabilistic graphical model, each node represents a random variable,
More informationCPSC 340: Machine Learning and Data Mining. More PCA Fall 2017
CPSC 340: Machine Learning and Data Mining More PCA Fall 2017 Admin Assignment 4: Due Friday of next week. No class Monday due to holiday. There will be tutorials next week on MAP/PCA (except Monday).
More informationModule 03 Lecture 14 Inferential Statistics ANOVA and TOI
Introduction of Data Analytics Prof. Nandan Sudarsanam and Prof. B Ravindran Department of Management Studies and Department of Computer Science and Engineering Indian Institute of Technology, Madras Module
More informationGraphical Models and Kernel Methods
Graphical Models and Kernel Methods Jerry Zhu Department of Computer Sciences University of Wisconsin Madison, USA MLSS June 17, 2014 1 / 123 Outline Graphical Models Probabilistic Inference Directed vs.
More informationUrban Transportation Planning Prof. Dr.V.Thamizh Arasan Department of Civil Engineering Indian Institute of Technology Madras
Urban Transportation Planning Prof. Dr.V.Thamizh Arasan Department of Civil Engineering Indian Institute of Technology Madras Module #03 Lecture #12 Trip Generation Analysis Contd. This is lecture 12 on
More informationA Search and Jump Algorithm for Markov Chain Monte Carlo Sampling. Christopher Jennison. Adriana Ibrahim. Seminar at University of Kuwait
A Search and Jump Algorithm for Markov Chain Monte Carlo Sampling Christopher Jennison Department of Mathematical Sciences, University of Bath, UK http://people.bath.ac.uk/mascj Adriana Ibrahim Institute
More informationLecture 8. Instructor: Haipeng Luo
Lecture 8 Instructor: Haipeng Luo Boosting and AdaBoost In this lecture we discuss the connection between boosting and online learning. Boosting is not only one of the most fundamental theories in machine
More informationDecision Tree Learning Lecture 2
Machine Learning Coms-4771 Decision Tree Learning Lecture 2 January 28, 2008 Two Types of Supervised Learning Problems (recap) Feature (input) space X, label (output) space Y. Unknown distribution D over
More informationInferential statistics
Inferential statistics Inference involves making a Generalization about a larger group of individuals on the basis of a subset or sample. Ahmed-Refat-ZU Null and alternative hypotheses In hypotheses testing,
More informationSampling Distribution Models. Chapter 17
Sampling Distribution Models Chapter 17 Objectives: 1. Sampling Distribution Model 2. Sampling Variability (sampling error) 3. Sampling Distribution Model for a Proportion 4. Central Limit Theorem 5. Sampling
More informationUnderstanding Travel Time to Airports in New York City Sierra Gentry Dominik Schunack
Understanding Travel Time to Airports in New York City Sierra Gentry Dominik Schunack 1 Introduction Even with the rising competition of rideshare services, many in New York City still utilize taxis for
More informationDesigning Information Devices and Systems I Fall 2018 Lecture Notes Note Introduction to Linear Algebra the EECS Way
EECS 16A Designing Information Devices and Systems I Fall 018 Lecture Notes Note 1 1.1 Introduction to Linear Algebra the EECS Way In this note, we will teach the basics of linear algebra and relate it
More informationMarkov Chain Monte Carlo The Metropolis-Hastings Algorithm
Markov Chain Monte Carlo The Metropolis-Hastings Algorithm Anthony Trubiano April 11th, 2018 1 Introduction Markov Chain Monte Carlo (MCMC) methods are a class of algorithms for sampling from a probability
More informationStochastic Processes
qmc082.tex. Version of 30 September 2010. Lecture Notes on Quantum Mechanics No. 8 R. B. Griffiths References: Stochastic Processes CQT = R. B. Griffiths, Consistent Quantum Theory (Cambridge, 2002) DeGroot
More informationDAG models and Markov Chain Monte Carlo methods a short overview
DAG models and Markov Chain Monte Carlo methods a short overview Søren Højsgaard Institute of Genetics and Biotechnology University of Aarhus August 18, 2008 Printed: August 18, 2008 File: DAGMC-Lecture.tex
More informationBayesian phylogenetics. the one true tree? Bayesian phylogenetics
Bayesian phylogenetics the one true tree? the methods we ve learned so far try to get a single tree that best describes the data however, they admit that they don t search everywhere, and that it is difficult
More informationA Re-Introduction to General Linear Models (GLM)
A Re-Introduction to General Linear Models (GLM) Today s Class: You do know the GLM Estimation (where the numbers in the output come from): From least squares to restricted maximum likelihood (REML) Reviewing
More informationExploratory quantile regression with many covariates: An application to adverse birth outcomes
Exploratory quantile regression with many covariates: An application to adverse birth outcomes June 3, 2011 eappendix 30 Percent of Total 20 10 0 0 1000 2000 3000 4000 5000 Birth weights efigure 1: Histogram
More informationA Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007
Decision Trees, cont. Boosting Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University October 1 st, 2007 1 A Decision Stump 2 1 The final tree 3 Basic Decision Tree Building Summarized
More informationIntroduction to Optimization
Introduction to Optimization Blackbox Optimization Marc Toussaint U Stuttgart Blackbox Optimization The term is not really well defined I use it to express that only f(x) can be evaluated f(x) or 2 f(x)
More informationLecture 4: Types of errors. Bayesian regression models. Logistic regression
Lecture 4: Types of errors. Bayesian regression models. Logistic regression A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting more generally COMP-652 and ECSE-68, Lecture
More information6.867 Machine learning, lecture 23 (Jaakkola)
Lecture topics: Markov Random Fields Probabilistic inference Markov Random Fields We will briefly go over undirected graphical models or Markov Random Fields (MRFs) as they will be needed in the context
More informationComments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms
Neural networks Comments Assignment 3 code released implement classification algorithms use kernels for census dataset Thought questions 3 due this week Mini-project: hopefully you have started 2 Example:
More informationFitting a Straight Line to Data
Fitting a Straight Line to Data Thanks for your patience. Finally we ll take a shot at real data! The data set in question is baryonic Tully-Fisher data from http://astroweb.cwru.edu/sparc/btfr Lelli2016a.mrt,
More information