Gaussian Processes for Short-Term Traffic Volume Forecasting

Size: px

Start display at page:

Download "Gaussian Processes for Short-Term Traffic Volume Forecasting"

Emil Campbell
5 years ago
Views:

1 Gaussian Processes for Short-Term Traffic Volume Forecasting Yuanchang Xie, Kaiguang Zhao, Ying Sun, and Dawei Chen The accurate modeling and forecasting of traffic flow data such as volume and travel time are critical to intelligent transportation systems. Many forecasting models have been developed for this purpose since the 970s. Recently kernel-based machine learning methods such as support vector machines (SVMs) have gained special attention in traffic flow modeling and other time series analyses because of their outstanding generalization capability and superior nonlinear approximation. In this study, a novel kernel-based machine learning method, the Gaussian processes (GPs) model, was proposed to perform short-term traffic flow forecasting. This GP model was evaluated and compared with SVMs and autoregressive integrated moving average (ARIMA) models based on four sets of traffic volume data collected from three interstate highways in Seattle, Washington. The comparative results showed that the GP and SVM models consistently outperformed the ARIMA model. This study also showed that because the GP model is formulated in a full Bayesian framework, it can allow for explicit probabilistic interpretation of forecasting outputs. This capacity gives the GP an advantage over SVMs to model and forecast traffic flow. The accurate modeling and forecasting of traffic flow data, such as volume, speed, and travel time, are critical to intelligent transportation systems (ITS), especially advanced traveler information systems (ATIS) and advanced traffic management systems (ATMS). Given reliable real-time traffic flow predictions, travelers can choose the best routes dynamically. Also, such information can be used by traffic management personnel to develop proactive traffic control strategies that make better use of the available road network resources. The success of many ATIS and ATMS applications depends largely on the accuracy of the selected traffic flow modeling and forecasting algorithms. Numerous methods have been developed and compared since the 970s to improve the accuracy of traffic flow forecasting. These methods can generally be categorized into the following groups: autoregressive integrated moving average (ARIMA) models ( 3), nonparametric regression (4 5), Kalman filtering theory (6 8), neural networks (9 5), support vector machines (SVMs) (6 7), and hybrid models (8). Of the existing traffic flow forecasting methods, neural networks are the most widely used ones. One major reason is that neural networks have a strong function approx- Y. Xie, Civil and Mechanical Engineering Technology, South Carolina State University, Orangeburg, SC 97. K. Zhao, Spatial Science Lab, and Y. Sun, Department of Statistics, Texas A&M University, College Station, TX D. Chen, School of Transportation, Southeast University, Nanjing, Jiangsu, China. Corresponding author: Y. Xie, yxie@scsu.edu. Transportation Research Record: Journal of the Transportation Research Board, No. 65, Transportation Research Board of the National Academies, Washington, D.C., 00, pp DOI: 0.34/65-08 imation capability and can better model the complicated relationship between historical and future traffic flow data than other methods (9). The application of neural networks does not require an explicit model formulation to be specified, as is usually required. Despite the many attractive features of neural networks, their application is not an easy task. Model training and selection involve tricky decisions with regard to network architectures, type of transfer (activation) functions, learning rate, and number of hidden neurons (0). Cautions must be taken during the training of neural networks to prevent overfitting the training data and to avoid local minima. To address these problems, SVMs have been introduced (6 7). Similar to neural networks, SVMs have superior function approximation capability and do not require the specification of model formulations. In addition, they are developed on the structural risk minimization (SRM) principle (), as opposed to the empirical risk minimization (ERM) principle used in conventional neural networks. Theoretically then, SVMs can better solve the overfitting problem, and they have better generalization capabilities than do conventional neural networks. Another important feature is the SVM capacity to guarantee a globally optimal solution for a given training data set (6 7). A v-support vector machine (v-svm) model was compared with multilayer, feed-forward neural networks by using traffic volume data collected from interstates (I-5, I-90, and I-405) in the Seattle area. The comparison resulted in favor of the v-svm model (6). Gaussian processes (GPs) are another important class of kernelbased learning algorithms that have attracted attention in the machine learning community ( 3). Similar to other popular kernel machines, such as SVMs, GP models are powerful tools to explore implicit relationships between a set of variables based on a training data set, which make GPs especially useful to address difficult nonlinear regression and classification problems (0). A particularly attractive feature of GPs is their formulation in a full Bayesian framework, which allows for explicit probabilistic interpretation of model outputs (). Moreover, GP model parameters (e.g., kernel parameters) can be computed naturally by means of Bayesian learning, as opposed to the grid-searching, trial-and-error method commonly used to optimize classical SVMs (). Hence, some researchers refer to GPs as Bayesian SVMs (4). The superior performance of GPs for difficult supervised learning problems has been demonstrated in many domain-specific applications when compared against conventional methods such as neural networks and other advanced learning algorithms such as SVMs (0, 5). In the study reported on here, a GP regression model was adopted to model and predict traffic volume data. This GP model can be used to forecast travel speed and travel time as well. Like SVMs, GP models are kernel-based machine learning methods and possess many of the same desirable features. More important, they can produce better informative outputs than SVMs and neural networks, which makes it easier to interpret GP prediction results. 69

2 70 Transportation Research Record 65 GAUSSIAN PROCESSES General Formulation of GPs GPs provide a Bayesian paradigm to learn an implicit functional relationship ŷ = f(x) from a given training data set, D = {(x i, y i )} n i=. x i R d represents the vector of observed input variables (i.e., predictors) in a d-dimensional feature space, and y i is the one-dimensional observed target value (i.e., response variable) that is either continuous or discrete. Unlike most classical Bayesian models, GPs directly elicit a prior distribution on the whole function f(x). Specifically, f(x) is treated as a random field and is assumed to be a GP a prior: ( ( ) ) ( ( ) ( )) p f x GP m x, k x, x ( ) where the prior GP is fully specified by a mean function m(x) and a covariance function k(x, x ), and denotes the prior s hyperparameters used to parameterize the covariance function; that is, k(x, x ) = k(x, x ; ). Strictly speaking, a GP model can be treated as a probability distribution defined over functions such that E[ f(x)] = m(x) and Cov[ f(x), f(x )] = k(x, x ) where f(x) and f(x ) are random variables indexed by any pair of x and x. In such a sense, a GP prior can be roughly deemed as a probability distribution for an infinite number of random variables. Furthermore, a collection of function values that are indexed by any finite number of X = [x, x,..., x n ] T, i.e., f(x) = [ f(x ), f(x ),..., f(x n )] T, assumes a multivariate normal distribution ( ( )) = ( ( ) ( )) p f X N m X, K X, X ( ) where the mean vector m(x) and covariance matrix K(X, X) are determined directly from m( ) and k(, ), namely, m(x) = [m(x ), m(x ),..., m(x n )] T and K ij = k(x i, x j ), i, j =,..., n. For ease of presentation but without loss of generality, m(x) = 0 is assumed, because in practice the data can always be centered with respect to the sample mean. In the machine learning term, k(x, x ) is often called a kernel function or simply a kernel rather than a covariance function. As detailed later, kernel functions usually take certain forms that are parameterized by one or more parameters. Accordingly, specifying a GP prior p( f(x) ) GP(m(x), k(x, x )) is to determine a specific type of kernel (covariance function) and the associated values. Once a GP prior p(f ) and a noise model p(y f) are specified, the posterior distribution of f, given the training data D, p(f D, ), can be readily derived by updating the prior p(f ) according to the Bayes theorem: p yf p fx ( ) ( ) (, ) = p( D ) p fd, where the input variables X (i.e., the indices for f) have been made explicit in the prior. The term p(d ) is called marginal likelihood as it is a function of, given D, and the noise model p(y f) is also known as likelihood, which is a function of f for a fixed set of observations y. The p(y f) is introduced because in practice y i is a corrupted version of f(x i ) as the result of certain noises or measurement errors. With the posterior p(f D, ), prediction distribution at a new input x * is obtained by using ( ) = ( ) p f p f d * x *, D, *, f D, f ( 4 ) () 3 By combining Equation 4 and the noise model, the predictive distribution for y * can also be obtained as from which not only the predicted mean but also the associated uncertainty (error-bar) could be computed. In GP modeling it is a collection of function values f(x) not x itself that needs to be Gaussian. In fact, the input variables x are assumed to be distribution-free. In other words, the GP model theoretically can handle data with any kinds of distributions. Interested readers can refer to Rasmussen and Williams (0), MacKay (), and Seeger (3) for more information. GP Regression Model The aforementioned GP models will solve nonlinear regression problems when the response variables y i are continuous and a normal distribution is assumed for the noise model p(y f ). Specifically, y i is subject to independent and identically distributed (i.i.d.) normal errors with a mean of zero and a variance of σ : In such a case, the inference of GP models becomes analytically tractable as a result of the Gaussianity of p(y f ); accordingly, the resultant posterior and predictive distributions as given in Equations 3 through 5 all reduce to normal distributions. For a new input x *, the predictive mean and variance associated with ŷ * = f(x * ) = f * are given by Equations 7 and 8, respectively (0). μ ( ) = ( ) ( ) p y x p y f p f df * *, D, x * * * *, D, () 5 * y f x N 0, σ ( 6) i i i i and = ( ) + ( ) ( ) = ( ) ( ) + f k x, X, σ ( ) * * K X X I y 7 ( ) = ( ) ( ) ( ) + Var f k x x x, * *, k X,, * * K X X I k X x () * where X and y = observed predictors and response variables in D = {(x i, y i )} n i=, I = n n identity matrix, and k(x *, X), with its ith element being k(x *, x i ), = n vector denoting the covariance of f * with f(x), and k(x, x * ) T = k(x *, X). Kernels and Learning Hyperparameters Equations 7 and 8 show that to fit and apply a GP regression model amounts to the choice of a kernel and the specification of its parameters (i.e., hyperparameters). In machine learning, the most commonly used kernels include the polynomial kernel, the radial basis function (RBF), and the automatic relevance determination (ARD) as given by Rasmussen and Williams (0). ( ) = ( + ) σ ( ) 8 T p kpoly xx, ; σ0, Σp, p σ0 xσpx ( 9)

Xie, Zhao, Sun, and Chen 7 k RBF kard xx, ; σ0, l,..., ld σ0exp where σ 0, p, l, l i and Σ p are hyperparameters of the corresponding kernels that have been symbolized as in Equations 3 through 5.

3 Xie, Zhao, Sun, and Chen 7 k RBF kard xx, ; σ0, l,..., ld σ0exp where σ 0, p, l, l i and Σ p are hyperparameters of the corresponding kernels that have been symbolized as in Equations 3 through 5. The is called a hyperparameter because, in the Bayesian framework of GPs, the unknown function f itself is a parameter as the result of the prior p( f ) placed on f. A common hyperparameter of the above kernels is the variance σ 0, which plays the same role as the tradeoff parameter of SVMs. However, a GP kernel and its hyperparameters are more interpretable than those of SVMs because the GP kernel represents the degree of correlation between function values at two inputs. For example, the hyperparameter l in Equation 0, or l i in Equation, refers to a characteristic length that represents a distance in the input space beyond which function values become less relevant. The magnitude of l i in the ARD kernel indicates the inference capability of the ith input variable. Very large values of l i will downplay or eliminate the influences of irrelevant input dimensions. As such, the ARD provides a parameterization scheme for automatic feature reduction, which has proved effective when handling highdimensional problems (5). Most studies have confirmed the superior performance of the RBF kernel (6 7, 0, 5). Therefore, only the RBF kernel was examined, and no comparison between kernels was made in the study reported here. In practice, rather than guess at an initial value for the hyperparameter, it is advantageous to learn its informative value ˆ in favor of the training data D. In the Bayesian formulation of GP models, the posterior of is given by p D p ( ) ( ) ( ) = p( D) p D x x ( xx, ; σ0, l)= σ0 exp ( 0) l ( ) = Then, the optimal value ˆ can be obtained naturally as the maximum a posteriori (MAP) estimate of p( D) (0). Because of the lack of prior knowledge, p(θ) is assumed to be flat (i.e., a noninformative prior). In such a case, the MAP estimate ˆ is pinpointed by maximizing the following marginal likelihood p(d ): ( ) = ( ) ( ) i=... d ( xx, ) l p D p y f p f df ( 3) Such a procedure is also known as Type maximum likelihood. For the GP regression models of Equations 6 through 8, the log marginal likelihood can be expressed as it was in Rasmussen and Williams (0): i i T n log p( D ) = y K( X, X) y log K( X, X) log π ( ) ( ) ( 4) where denotes the determinant of a matrix. The gradient of log p(d θ) with respect to is ( ) The maximization of log p(d θ) with respect to can be implemented with any general gradient-based optimization techniques, and in this study a conjugate gradient optimization method was employed, similar to the one used by Rasmussen and Williams (0). The global optimization may be trapped into local maxima if there are a large number of hyperparameters (e.g., when the ARD kernel is used for feature selection over high-dimensional inputs). As a remedy, it is common practice to perform the optimization multiple times with random initial values and to select the one that yields the highest marginal likelihood. MODEL TESTING AND RESULT ANALYSIS Data Description To facilitate model comparison, the same data set used in Zhang and Xie (6) was used again here. The traffic volume data were obtained from the traffic data acquisition and distribution (TDAD) database maintained by an ITS research group at the University of Washington, Seattle. Specifically, traffic volume data from four detectors located on three interstate highways in the Seattle area were used. The approximate locations of the four detectors are shown in Figure. Detailed information about the four detectors follows. Detector Data Collection Dataset Direction Name Period Southbound ES-088D June 6, 005 to July 3, 005 Eastbound ES-855D 3 Northbound ES-645D 4 Northbound ES-708D A total of four sets of traffic volume data was obtained from these detectors. Each data set contained 8 days of data. The raw traffic volume data were aggregated by using 5-min intervals, and a single day generated 96 data points. The first 4 days of data from each data set are plotted in Figure to show the general trends. It is easy to see that ES-008D ES-855D ES-708D log p( D ) T = ykxx (, ) θ i tr K X, X ( ) (, ) K X X y θi K( X, X) θ i ( 5) FIGURE ES-645D Approximate locations of detectors.

4 7 Transportation Research Record 65 Dataset (Detector ES-088D, Southbound) Dataset (Detector ES-855D, Eastbound) Dataset 3 (Detector ES-645D, Northbound) (c) Dataset 4 (Detector ES-708D, Northbound) (d) FIGURE First 4 days of data from each detector.

5 Xie, Zhao, Sun, and Chen 73 observed traffic volume data v v v 3 v 4 v i v n-3 v n- v n- v n n-l x y x y v v L v L+ v v L v L+ v v L+ v L+ v v L+ v L+3 n-l- v n--l v n-3 v n- v n--l v n- v n- v n--l v n- v n v n-l v n- v n one-step ahead prediction two-step ahead prediction FIGURE 3 Predictions: one-step-ahead and two-step-ahead. Data Sets through 3 showed similar patterns but different traffic volume levels. Their weekday traffic clearly had two peak periods. In Data Set 4, the effect of morning rush hour was not as obvious. Model Fitting The same data sets discussed above were used by Zhang and Xie to evaluate v-svms and to compare them with a multilayer, feedforward neural networks model (6). Since their results showed that the v-svm model consistently outperformed the neural networks model, here it was compared only with the proposed GP model. ARIMA models also were fitted and compared with the GP and v-svm models. Thus three types of models were compared in the study reported here. For all three models, the first 3 weeks of data were used for fitting models; the last week of data was used for prediction tests. These three types of models were compared primarily on the basis of their prediction performance. Both one- and two-step-ahead prediction results were compared. Figure 3 shows the difference between one- and two-step-ahead predictions, where n was the total observed traffic volume data points, and L was the model input length. In Figure 3, v i represents the aggregate traffic count for a 5-min period. By using the first input as one example, both predictions took the same vector x =[v, v,..., v L ] as the input. However, the values to be predicted (outputs) for the one- and two-step-ahead predictions were v L and v L+, respectively. v-svm and GP Models As discussed earlier, to fit the v-svm and GP models conceptually is straightforward. It does not require users to specify an explicit model formulation. Take the one-step-ahead prediction as an example: for each data point to be predicted or modeled as output, the 4 data points immediately preceding it are used as model input. Thus, for a training data set of length 96, it can generate 7 training inputs. The input dimension 4 is determined on the basis of an autocorrelation function (ACF) method (6). For each of the four data sets, the ACF values were evaluated at different time lags. The first time lag, in which the ACF value was zero, was selected as the input dimension. Based on the ACF values, an input dimension of 4 was selected for all four data sets. According to the notations set out earlier in this paper on GPs, the one-step-ahead models to be fitted could be symbolized as yˆ = vol ˆ ( i) = f x = vol i, vol i,..., vol i 4 T ( 6) i ( i [ ( ) ( ) ( )] ) where vol(i) refers to the traffic volume at time step i, and f(x i ) is the implicit model form to be learned by either the v-svm or the GP algorithm. Compared with the GP model, the fitting of the v-svm model requires more effort. A validation data set for model selection usually is needed to help find the appropriate parameters for the v-svm model. For the v-svm model, therefore, the 3-week fitting data were separated further into a training data set (first weeks of data) and a validation data set (week 3 of data). Given the training and fitting data sets, a handy genetic algorithm tool was used to find the optimal parameters for the v-svm model. Details on the parameters to be determined and the genetic algorithm tool can be found in Zhang and Xie (6) and are not replicated here. The GP model was implemented by using a widely accepted package called Gaussian Processes for Machine Learning (GPML) (6). This package is based on the MATrix LABoratory (MATLAB) programming language platform. Customized MATLAB codes have also been developed to process the raw and output data and call functions in the GPML package. The training and testing process of the GP and v-svm models is illustrated in Figure 4. ARIMA Model ARIMA models were also fitted for the four data sets because of their popularity in traffic flow forecasting research ( 3, 8). An auto.arima forecasting program was used in the R Project for Statistical Computing to select the best-fit ARIMA model for each test data set (7). The best models selected by auto.arima and their corresponding Akaike information criterion (AIC), Bayesian information criterion (BIC), and log likelihood values are listed in Table. Both AIC and BIC are commonly accepted criteria for model selection. In general, models Untrained GP/ v-svm Models FIGURE 4 4 th week s input data (x only) First 3 weeks data x (x&y) x y Trained GP/ v-svm Models 4 th week s predicted data Training and evaluation of GP and v-svm models. ŷ

6 74 Transportation Research Record 65 TABLE ARIMA Models for Four Data Sets Data ARIMA Model Set Order AIC BIC Log Likelihood ARIMA(,,) 3, ,98.3 5, ARIMA(5,,4) 30, , ,5.7 3 ARIMA(3,,) 9, ,6.70 4, ARIMA(3,,4) 9, , ,90.3 with lower AIC and BIC values should be selected. Although the collected data sets showed seasonal patterns, seasonal ARIMA models were not selected by the auto.arima program. TABLE 3 Comparison of Two-Step-Ahead Forecasting Result GP v-svm ARIMA Data Set MAPE (%) RMSE Data Set MAPE (%) RMSE Data Set 3 MAPE (%) RMSE Data Set 4 MAPE (%) RMSE Measurements of Effectiveness Mean absolute percentage error (MAPE) and root mean square error (RMSE) are two commonly used criteria to evaluate and compare prediction methods. Adopted for use in this study, MAPE and RMSE are defined below: MAPE = N N k= vol ˆ ( k) vol( k) 00% ( 7) vol k N RMSE = ( vol ˆ ( k) vol( k) ) ( 8) N k= ( ) where vol(k) is the observed traffic volume at time step k and vol(k) is the corresponding predicted traffic volume. Each time step in this study was equivalent to 5 min. N was the size of the testing data set (total number of time steps). Results Analysis and Comparison The one- and two-step-ahead forecasting results are listed in Tables and 3, respectively. It can be easily seen that, for all data sets, the GP TABLE Comparison of One-Step-Ahead Forecasting Result GP v-svm ARIMA Data Set MAPE (%) RMSE Data Set MAPE (%) RMSE Data Set 3 MAPE (%) RMSE Data Set 4 MAPE (%) RMSE and v-svm models performed consistently better than the ARIMA models for one- and two-step-ahead forecasting. In some cases, the improvements in performance were quite significant. The two-stepahead forecasting MAPEs for Data Set for the GP and v-svm models were 0.7% and 0.9%, respectively, while the MAPE for the ARIMA (5,, 4) model was 7.3%. Tables and 3 also show that the superior performance by the GP and v-svm models was more significant in the two-step- than in the one-step-ahead forecasting. The GP and v-svm models performed about equally in terms of MAPE and RMSE in all cases. It was difficult to tell which method was absolutely better on the basis of the MAPE and RMSE values. However, the GP model possesses a desirable feature that distinguishes it from the v-svm model: Not only does it produce point traffic flow estimates, the GP model also generates standard deviations (i.e., error bar) for the predicted traffic flow values. To use the one-step-ahead forecasting result as one example, Figures 5 and 6 show the predicted traffic volumes and their standard deviations for Tuesday s data in Week 4 with the GP model. As Figures 5 and 6 show, the predicted standard deviations seemed not to be directly related to the magnitude of the observed traffic volumes. When the observed traffic volume is at its peak, the corresponding predicted standard deviation may not necessarily be at its largest (see Figure 6). Usually, predicted standard deviations become larger when there are drastic changes in the observed traffic volume data. This additional standard deviation information from the GP model could help traffic control and management personnel better assess the quality and reliability of the predicted data and use the predicted values wisely. From the predicted and observed traffic volumes plotted in Figures 5 and 6, it can be seen that the predicted traffic volumes (solid lines) closely followed the observed traffic flow data (red dots) for all test data sets. The predicted traffic volumes and standard deviations data were combined to create upper and lower boundaries for the four test data sets. Specifically, the upper and lower boundaries were created by adding or subtracting the standard deviations to or from the predicted traffic volumes. The resultant boundaries are shown in Figures 7 and 8. These two figures clearly show that in most cases the observed traffic data points were within the upper and lower boundaries generated from the GP model outputs. This result was encouraging. It confirms that the predicted traffic volume data closely follow the

7 Xie, Zhao, Sun, and Chen 75 (c) (d) FIGURE 5 One-step-ahead predicted volumes: Data Set and Data Set with GP model. Predicted standard deviations: (c) Data Set and (d ) Data Set. (c) (d) FIGURE 6 One-step-ahead predicted volumes: Data Set 3 and Data Set 4 with GP model. Predicted standard deviations: (c) Data Set 3 and (d ) Data Set 4.

8 76 Transportation Research Record 65 FIGURE 7 One-step-ahead prediction boundaries for Data Set and Data Set with GP model. FIGURE 8 One-step-ahead prediction boundaries for Data Set 3 and Data Set 4 with GP model.

9 Xie, Zhao, Sun, and Chen 77 observed traffic trends. In addition, it suggests that standard deviations are credible and can be used to generate useful intervals for predicted traffic volume data. DISCUSSION AND CONCLUSIONS This paper introduces a GP model into traffic flow modeling and forecasting. The proposed GP model was tested on four data sets collected on three interstate highways in Seattle, Washington. Two other promising traffic flow forecasting models, v-svm and ARIMA, were also tested with the same data sets. Their forecasting results were compared with those produced by the GP model. Two types of forecasting were conducted: one- and two-step-ahead forecasting. The results indicated that the GP and v-svm models outperformed the ARIMA model in all cases. They further showed that the GP and v-svm models outperformed the ARIMA models more significantly in the two-step-ahead forecasting than in the one-step. The GP and v-svm models are distribution-free learning algorithms and can be applied to handle many types of data that are not necessarily normally distributed for both classification and regression purposes (0, 5). However, the formulations of the two algorithms are based on very different modeling frameworks. The v-svm model is used to formulate and solve a nonlinear optimization problem, whereas the GP model is based on a full Bayesian framework. Nevertheless, the overall forecasting performances of the GP and v-svm models in this study were similar. This result probably can be explained by the fact that both are kernel-based machine learning methods. The full Bayesian framework enables the GP model to generate standard deviation estimates in addition to the predicted traffic flow volumes. This information could be useful to assess the reliability of the traffic flow predictions and to make better use of the predicted data. Unfortunately, such information cannot be readily obtained from the v-svm model. The estimated standard deviations were further plotted against the observed traffic volume data in Figures 5 and 6. The result suggests that the estimated standard deviations become larger when there are drastic changes in the observed traffic volume data. The predicted traffic volume and standard deviation data from the GP model were combined to generate upper and lower boundaries for each test data set. The result shows that most observed traffic flow data points fall into the predicted upper and lower boundaries (see Figures 7 and 8). The overall forecasting performance of the proposed GP model was satisfying. The model comparison results suggest that the GP model significantly outperforms the commonly used ARIMA model. Previously, the v-svm model has been shown to outperform multilayer, feed-forward neural network models in prediction accuracy and generalization capability (6). The results in this study indicate that the proposed GP model performs slightly better than the v-svm model in most cases and that it can generate useful standard deviation estimates, which the v-svm model cannot. In summary, the GP model offers a promising way to model and forecast traffic flow, and it has emerged as a serious competitor to the v-svm model. FUTURE WORK This study showed that the GP and v-svm models consistently outperformed the ARIMA model on all four data sets. Additional tests on other data sets are necessary to further confirm the superiority of these kernel-based machine learning methods over conventional modeling tools such as ARIMA. In particular, the GP model should be tested on data sets that exhibit clear seasonal patterns with erroneous values or missing data points, and then compared with other models such as the seasonal ARIMA model. ACKNOWLEDGMENTS The authors thank Daniel J. Dailey, University of Washington, Seattle, for permission to use the TDAD database and the James E. Clyburn University Transportation Center, South Carolina State University, Orangeburg, for its financial support. REFERENCES. Williams, B. M., P. K. Durvasula, and D. E. Brown. Urban Freeway Traffic Flow Prediction: Application of Seasonal Autoregressive Integrated Moving Average and Exponential Smoothing Models. In Transportation Research Record 644, TRB, National Research Council, Washington, D.C., 998, pp Ahmed, M. S., and A. R. Cook. Analysis of Freeway Traffic Time- Series Data By Using Box-Jenkins Techniques. In Transportation Research Record 7, TRB, National Research Council, Washington, D.C., 979, pp Nihan, N. L., and K. O. Holmesland. Use of the Box and Jenkins Time Series Technique in Traffic Forecasting. Transportation, Vol. 9, No., 980, pp Davis, G. A., and N. L. Nihan. Nonparametric Regression and Short-Term Freeway Traffic Forecasting. Journal of Transportation Engineering, Vol. 7, No., 99, pp Smith, B. L., and M. J. Demetsky. Traffic Flow Forecasting: Comparison of Modeling Approaches. Journal of Transportation Engineering, Vol. 3, No. 4, 997, pp Okutani, I., and Y. J. Stephanedes. Dynamic Prediction of Traffic Volume Through Kalman Filtering Theory. Transportation Research Part B, Vol. 8, No., 984, pp.. 7. Stathopoulos, A., and M. G. Karlaftis. A Multivariate State Space Approach for Urban Traffic Flow Modeling and Prediction. Transportation Research Part C, Vol., No., 003, pp Xie, Y., Y. Zhang, and Z. Ye. Short-Term Traffic Volume Forecasting Using Kalman Filter with Discrete Wavelet Decomposition. Computer- Aided Civil and Infrastructure Engineering, Vol., No. 5, 007, pp Smith, B. L., and M. J. Demetsky. Short-Term Traffic Flow Prediction: Neural Network Approach. In Transportation Research Record 453, TRB, National Research Council, Washington, D.C., 994, pp Park, B., C. J. Messer, and T. Urbanik II. Short-Term Freeway Traffic Volume Forecasting Using Radial Basis Function Neural Network. In Transportation Research Record 65, TRB, National Research Council, Washington, D.C., 998, pp Yin, H. B., S. C. Wong, J. M. Xu, and C. K. Wong. Urban Traffic Flow Prediction Using a Fuzzy-Neural Approach. Transportation Research Part C, Vol. 0, No., 00, pp Xie, Y., and Y. Zhang. A Wavelet Network Model for Short-Term Traffic Volume Forecasting. Journal of Intelligent Transportation Systems: Technology, Planning, and Operations, Vol. 0, No. 3, 006, pp Van Lint, J. W. C., S. P. Hoogendoorn, and H. J. Van Zuylen. Accurate Freeway Travel Time Prediction with State-Space Neural Networks Under Missing Data. Transportation Research Part C, Vol. 3, Nos. 5 6, 005, pp Park, D., and L. R. Rilett. Forecasting Freeway Link Travel Times with a Multilayer Feedforward Neural Network. Computer-Aided Civil and Infrastructure Engineering, Vol. 4, No. 5, 999, pp

10 78 Transportation Research Record Vlahogianni, E. I., M. G. Karlaftis, and J. C. Golias. Optimized and Meta- Optimized Neural Networks for Short-Term Traffic Flow Prediction: A Genetic Approach. Transportation Research Part C, Vol. 3, No. 3, 005, pp Zhang, Y., and Y. Xie. Forecasting of Short-Term Freeway Volume with v-support Vector Machines, In Transportation Research Record: Journal of the Transportation Research Board, No. 04, Transportation Research Board of the National Academies, Washington, D.C., 007, pp Wu, C. H., J. M. Ho, and D. T. Lee. Travel-Time Prediction with Support Vector Regression. IEEE Transactions on Intelligent Transportation Systems, Vol. 5, No. 4, 004, pp Van Der Voort, M., M. Dougherty, and S. Watson. Combining Kohonen Maps with ARIMA Time Series Models to Forecast Traffic Flow. Transportation Research Part C, Vol. 4, No. 5, 996, pp Hornik, K., M. Stinchcombe, and H. White. Multilayer Feedforward Networks Are Universal Approximators. Neural Networks, Vol., No. 5, 989, pp Rasmussen, C. E., and C. K. I. Williams. Gaussian Processes for Machine Learning. The MIT Press, Cambridge, Mass., Suykens, J. A. K., T. V. Gestel, J. D. Brabanter, B. D. Moor, and J. Vanderwalle. Least Squares Support Vector Machines. World Scientific Publishing Co. Pte. Ltd., Singapore, 00.. MacKay, D. J. C. Gaussian Processes A Replacement for Supervised Neural Networks? Tutorial. Neural Information Processing Systems Foundation, La Jolla, Calif., Seeger, M. Gaussian Processes for Machine Learning. International Journal of Neural Systems, Vol. 4, No., 004, pp Chu, W., S. S. Keerthi, and C. J. Ong. Bayesian Trigonometric Support Vector Classifier. Neural Computation, Vol. 5, No. 9, 003, pp Zhao, K., S. C. Popescu, and X. Zhang. Bayesian Learning with Gaussian Processes for Supervised Classification of Hyperspectral Data. Photogrammetric Engineering & Remote Sensing, Vol. 74, No. 0, 008, pp Documentation for GPML MATLAB Code. process.org/gpml/code/matlab/doc/. Accessed Nov., The R Project for Statistical Computing (Version.9.). r-project.org/. Accessed July 3, 009. The Statistical Methods Committee peer-reviewed this paper.

Short-term traffic volume prediction using neural networks

Short-term traffic volume prediction using neural networks Nassim Sohaee Hiranya Garbha Kumar Aswin Sreeram Abstract There are many modeling techniques that can predict the behavior of complex systems,