Regression Methods for Spatially Extending Traffic Data

Size: px

Start display at page:

Download "Regression Methods for Spatially Extending Traffic Data"

Nathan Wright
6 years ago
Views:

Regression Methods for Spatially Extending Traffic Data Roberto Iannella, Mariano Gallo, Giuseppina de Luca, Fulvio Simonelli Università del Sannio ABSTRACT Traffic monitoring and network state

1 Regression Methods for Spatially Extending Traffic Data Roberto Iannella, Mariano Gallo, Giuseppina de Luca, Fulvio Simonelli Università del Sannio ABSTRACT Traffic monitoring and network state estimation represent a key issue for ITS applications. Monitoring systems usually cover a limited number of network elements and, then, it is necessary to integrate the detected information about traffic flows and their characteristics using estimation models able to provide the state of whole network. Usually, this process involves the demand matrix estimation even though the knowledge of OD flows is often not necessary for several ITS application. This paper proposes an analysis of different types of regression models for spatially extending traffic data in function of the data detected on the monitored links; in particular, the paper focuses on all link flows estimation methods based on limited measures even if the same approach can be used to estimate other traffic related measures (i.e. mean speed, travel times, pollutant emission, etc.) Keywords (traffic estimation, regression models, ITS) INTRODUCTION An effective network state estimation is a crucial component for many ITS systems. Even though some ITS systems require the knowledge of OD flows and sometimes also their dynamic evolution, for many application the estimate of link characteristics as flows, speeds and travel times allows for effective ITS strategies as flow-responsive traffic-lights or route guidance based on instantaneous travel times. In the last decade road traffic monitoring technologies have had considerable evolution [1-5] and the wide diffusion of permanent monitoring devices (inductive loop sensors, cameras, radar, etc.) has provided a lot of traffic data in a time continuous manner. Nevertheless, because of budget limitation, the number of monitored links is usually very little in respect of networks dimension. For this reason network state estimation methods represent a crucial need for ITS applications and many authors have addressed this aspect. Unfortunately the number of available measures is usually not sufficient for the full observability of the transportation system and several studies are oriented to define observability condition. The network state estimation and the prevision of its evolution over time is mainly based on OD flows estimation; indeed, the knowledge of travel demand allows for calculating all link variables (flows, speed, travel time, pollutant emission, etc.) through traffic flow assignment methods. The OD matrix estimation based on traffic measures has been widely studied in the literature both in static and in dynamic context, but, unfortunately, OD flows updating is a highly undetermined problem [6] and several hypotheses have to be introduced in order to limit the number of variables [7] or to consider more relations between variables. This is possible mainly in dynamic context where some assumptions can be introduced about time evolution of the travel demand [8][9]. Considering that in many applications the knowledge of travel demand appears not necessary, herein we assess an alternative approach for network state estimation based only on link data, without involving OD flows estimation. In particular, they will be proposed and evaluated several regression methods, both parametric and non parametric, for estimating traffic flow data on some links of the network as a function of the data measured on other road links equipped with measurement devices. PROBLEM DESCRIPTION 5 Roberto Iannella, Mariano Gallo, Giuseppina de Luca, Fulvio Simonelli

2 This paper aims to assess some parametric and non parametric regression models to spatially extend flow information on monitored links for estimating the traffic flow on other links in the network. In more detail, let: L c be the set of monitored links and f c the vector of flows on these links with dimension n lc; L r be the set of non monitored relevant links and f r the vector of flows on these links with dimension n lr; d be the origin-destination flows vector with dimension n od; M be the assignment matrix mapping the link flows from OD flows with dimension (n l x n od); M c be the assignment sub-matrix referring to the links in L c with dimension (n lc x n od) M r be the assignment sub-matrix referring to the links in L r with dimension (n lr x n od) The assignment matrix M depends on the route choices of the traveler and usually is estimated through an assignment model considering the path costs formalized by the following equation: M = AP(C) (1.a) where: A represents the link-path matrix (number of links x number of paths) with elements a i,j equal to 1 if link i belongs to the path j, 0 otherwise; P is the route choice probability matrix (number of paths x number of OD flows) calculated as function of path costs C. Assuming the assignment matrix M known, error free, and non dependent on travel demand (non-congested network), the following equations can be written: f c = M c d (2.a) f r = M r d (2.b) If the rank of matrix M c is equal to the dimension of d, equation (1.a) allows for calculating the demand vector d and the flow vector f r can be expressed as a function of the vector f c through the following equation: 1 f r = M r M C f c (2.c) Unfortunately, very often the rank of M c is significantly lower than the demand vector dimension so that the demand and the consequent relevant flows estimation requires further information; the most common strategy to estimate the travel demand is based on prior knowledge of demand to be updated by using link flow measurements through Maximum Likelihood [10], Generalized Least Squares () [11] or Bayesian [12] approach. Referring to the approach, the estimation of the travel demand d* is given by following equation: d = d p + S M c t (M c S M c t ) 1 (f c M c d p ) where: d p is a prior estimation of d; S is the dispersion matrix of d p; M c t is the transpose of the matrix M c ; By using equations (1.b) and (2.b), flows f r can be estimated by equation (2.c): f r = M r d = M r [ d p + S M c t (M c S M c t ) 1 (f c M c d p )] Equation (3.b) establishes a linear relation between measured link flow, f c, and relevant but non-measured link flows, f r, and it can be viewed as flow data spatial extension model based on formulation. Notably, equation (3.b) requires the availability of prior estimation of the travel demand and also the knowledge of its distribution and dispersion matrix. In the following, several regression models aimed to estimate the flow vector f r directly from vector f c will be tested in laboratory experiments and compared to results provided by equation (3.b), here used as benchmark. (3.a) (3.b) 6 Roberto Iannella, Mariano Gallo, Giuseppina de Luca, Fulvio Simonelli

3 In the case of congested networks, link travel costs depend on link flows and, then, the assignment model should be formalized as a fixed point problem: f = A P(A t c(f )) d (4) Equation (4) establishes a non-linear relationship between travel demand and link flows that does not allow a closed expression for the inverse problem, namely the estimation of travel demand d from link flows. Consistently with the model, demand estimation can be obtained by the solution of following optimization problem: d = argmin [(d d p ) S (d d p ) t ] (5.a) subject to: f = A P(A t c(f )) d f c = f c where f c is the vector of assigned link flows on monitored links extracted from all assigned link flows vector f. The spatial extension of f c data can be obtained by assigning the estimated travel demand d^ and considering the set of non-monitored relevant links. It is worth noting that for large networks the solution of problem (5.a)-(5.c) requires computational efforts often not compatible with real time application; for this reason, synthetic models, based on regression and preprocessed data, appear more suitable in this context. Also in the case of congested networks, several regression models will be tested in order to obtain an effective spatial extension of monitored data and the results will be compared to the formulation. (5.b) (5.c) REGRESSION MODELS FOR EXTENSION OF LINK FLOW DATA The aim of this paper is to analyze different regression methods which are able to define a relationship between the measured traffic flows and those predicted. Considering that equation (3.b), based on a formulation, provides a linear relationship between measured and predicted flow variables, the linear regression, as the simplest method, has been firstly tested. Furthermore, this paper describes the nonparametric formulations, which are models suitable to describe virtually any nonlinear relationship. In this paper we propose and test three different methods: Kernel Methods, Artificial Neural Networks (s) and Methods. A theoretical reference on the use of kernel methods is provided by [13][14] and [15], while some examples of applications of Neural Network in transportation engineering are reported in [16] and [17], while for methods the principal past studies are reported in [17] and [18]. Linear regression Regression analysis is a technique used to analyze a series of data, using a dependent variable and one or more independent variables. The aim is to estimate a possible functional relationship between the dependent variable and the independent variables. Linear regression requires a linear model. Y = b 0 + b 1 X 1 + b 2 X b nc X nc (6) where: b 0, b 1,..., b nc are the model parameters; Y is the predicted dependent variable (non measured flow in our case); X 1, X 2,..., X nc are independent variables (measured flows in our case). Consistently with minimum square error paradigm, for each relevant link flow Y r, the vector of parameters, b, is given by: b = (X t X) 1 X t Y r (7) where: 7 Roberto Iannella, Mariano Gallo, Giuseppina de Luca, Fulvio Simonelli

4 X is the independent observation matrix, with dimensions N x (n c+1), where N represents the number of observations; Y r is the vector of the observed dependent variable (i.e. the link flow of link r) with dimension N x 1; Notably, in the case of full observability (rank of M c equal to n od), equation (7) must provide the same results of equation (2.a) if the number of observations is at least equal to n c+1. Kernel regression The objective of kernel regression is to find a non-linear relation between random variables X and Y. In any nonparametric regression, the conditional expectation of a variable Y relative to a variable X may be written: E(Y X) = m(x) where m is an unknown function. Nadaraya and Watson, both in 1964, proposed to estimate m as a locally weighted average, using a kernel, as a weighting function. The Nadaraya-Watson estimator is: n m (x) = i=1 K h x (x x i )y i n = W i=1 K hx (x x i ) hx (x, x i )y i i=1 A kernel is a non-negative real-value function K mapping from X X to R satisfying the following two requirements: + K(u)du = 1; K( u) = K(u) for all values of u; with u = x x i and where the bandwidth h represents a smoothing parameter limiting the size of interval h where the values should be considered in a weighted average. The first requirement ensures that the method of kernel density estimation results in a probability density function. The second requirement ensures that the average of the corresponding distribution is equal to that of the sample used. Several types of kernel functions are commonly used: uniform, triangle, Epanechnikov, quartic (biweight), tricube, triweight, Gaussian, quadratic and cosine. Herein will be considered the Epanechnikov kernel, named K 1 in the following: K(u) = 3 4 (1 u2 )1 { u 1} and the Gaussian formulation, named K 2 in the following: K(u) = 1 2π e Neural Networks Artificial neural networks (s) are a computational model widely used in computer science and other research disciplines, which is based on a large collection of simple neural units (artificial neurons), loosely analogous to the observed behavior of a biological brain s neurons. A single artificial neuron can be implemented in many different ways. The general mathematic definition is: Where x is a neuron with n input dendrites (x 0,, x n ), that are the elements which connect the nucleus to other nucleuses, and one output neuron y(x) and (w 0,, w n ) are weights calibrated in the training procedure. Parameter g is an activation function that weights how powerful the output (if any) should be from the neuron, based on the sum of the input. If the artificial neuron should mimic a real neuron, the activation function g should be a simple threshold function returning 0 or 1. In this paper we have chosen to implement a multilayer feed-forward, which is the most common kind of. In a multilayer feed-forward, the neurons 8 Roberto Iannella, Mariano Gallo, Giuseppina de Luca, Fulvio Simonelli n n 1 2u 2 y(x) = g ( w i x i ) i=0

5 are ordered in layers, starting with an input layer and ending with an output layer. Between these two layers are a number of hidden layers. Connections in these kinds of network only go forward from one layer to the next. Implementing multilayer feed-forward s requires two different phases: a training phase (sometimes also referred to as the learning phase) and an execution phase. In the training phase the is trained to return a specific output when given a specific input, this is done by continuous training on a set of training data. In the execution phase the returns outputs on the basis of inputs. Method Support Vector Machine () analysis is a popular machine learning tool for classification and regression. methods were originally designed for nonlinear classification problems, then extended to nonlinear regression problems (SVR). Suppose we have the training data {(x 1, y 1 ),. (x n, y n )} with n patterns; in the first step we map the input pattern x into a higher dimensional feature space using a nonlinear mapping function φ. The nonlinear regression problem between x and y becomes a linear regression problem between φ(x) and y, i.e.: f(x; w) = w, φ(x) + b Where. symbolizes the inner product and b are the regression coefficients obtained minimizing the error between f and the observed values (y). The error between f and y is not evaluated by MSE norm (mean squared error), but using the ε intensive error norm: 0 if f(x; w) y < ε f(x; w) y ε = { f(x; w) y ε otherwise where small errors f(x; w) y < ε are ignored, but for large errors the above mentioned error norm approximates the mean absolute error (MAE). The regression coefficients (w and b) are estimated by minimizing the objective function: J = C n f(x i; w) y i ε w 2 i=1 where C, that controls the regularization, and ε are the hyper-parameters. n NUMERICAL EXPERIMENTS The laboratory experiments have been carried out considering a network with 16 nodes, 48 oriented links and 16 OD pairs. In the following the flows of 6 links will be assumed knows (monitored links) and the 42 link flows will represent the variables to be estimated. Starting from a prior seed matrix d p, several draws d k of OD matrix have been generated through random sampling with different levels of coefficient of variation (CV). In more detail, for each CV level, the training sets were created by assigning 200 origin destination matrices, randomly generated, to the network, so that the training dataset where composed by 200 examples with 6 independent variables (measured link flows) and the corresponding 42 dependent variables (non measured relevant link flows). Similarly the validation set was generated by the assignment of 100 origin destination matrices in order to have 100 pairs of independent variables and dependent target variables. In this first set of experiments the network has been considered uncongested and the assignment matrix M was calculated through a logit based route choice model according to assumption of link travel times independent on link flows. In Table 1 are summarized the results with reference to validation set for different models in terms of Mean Absolute Percentage Error () and Root Mean Square Error (RMSE) for overall link flows estimation. Table 1. Mean errors in uncongested network of predicted link flows for different coefficients of variation in validation dataset 9 Roberto Iannella, Mariano Gallo, Giuseppina de Luca, Fulvio Simonelli

6 CV 0,1 2,41% 2,45% 3,45% 3,51% 2,58% 2,46% 0,2 5,01% 5,06% 6,47% 6,49% 5,33% 5,09% 0,3 7,47% 7,65% 10,21% 10,24% 7,96% 7,67% 0,4 11,67% 11,96% 15,55% 15,61% 12,27% 11,99% 0,5 15,56% 16,13% 22,33% 22,41% 16,59% 16,13% 0,6 17,88% 18,04% 26,90% 27,01% 19,09% 18,04% 0,7 25,10% 26,33% 33,84% 33,92% 27,81% 25,95% 0,8 31,30% 32,21% 39,37% 39,37% 32,65% 32,35% 0,9 38,77% 37,92% 56,32% 56,66% 39,24% 37,02% 1 56,23% 58,85% 77,95% 78,02% 60,51% 57,79% RMSE CV 0,1 3,81 3,90 6,20 6,34 4,29 3,90 0,2 7,81 7,92 11,47 11,53 0,00 0,00 0,3 11,20 11,42 17,25 17,25 12,43 11,47 0,4 15,47 15,77 22,18 22,28 16,50 15,83 0,5 18,95 19,62 29,27 29,34 20,46 19,71 0,6 21,27 21,59 33,03 33,00 22,39 21,83 0,7 25,24 25,69 38,07 38,04 27,34 25,88 0,8 28,16 28,66 41,83 41,71 0,00 0,00 0,9 30,72 30,98 46,65 46,57 31,78 31, ,51 32,68 47,98 47,80 0,61 32,96 The results show that for uncongested network a classical estimation perform well till high value of the dispersion of the true OD matrix around prior OD matrix estimation d p; notably, the linear regression based on simulated data not considering explicitly prior estimation and its dispersion provides results very similar to estimation and it can be considered as an alternative approach, requiring pre-calculated dataset, to improve the computational effort for large network in real-time application. In this context, also the and the model appear suitable to be used for link flows estimation because, without making any assumption about the relationship between measured flows and predicted flows, they lead to errors not significantly different from linear models. The worst performances are given by Kernel regression models, where the error does not change significantly with the choice of kernel function and are always higher than errors provided by the other models. In Figure 1 the cumulative distributions of the of predicted link flows are reported. The results highlight that for coefficient of variation lower than 0.5 over 70% of predicted link flows shows a percentage error lower than 20% in all models, except for Kernel model where the error is nearly 25%, while this percentage quickly decreases for higher coefficient of variation leading to an unacceptable error for a significant number of links. 10 Roberto Iannella, Mariano Gallo, Giuseppina de Luca, Fulvio Simonelli

7 Cumulative Distribution - CV=0.2 Cumulative Distribution - CV=0.3 Cumulative Distribution - CV=0.4 Cumulative Distribution - CV=0.5 Cumulative Distribution - CV=0.6 Cumulative Distribution - CV=0.7 Cumulative Distribution - CV=0.8 Cumulative Distribution - CV=0.9 Fig 1: Cumulative distribution of in uncongested network among predicted link flows Similar experiments have been carried out considering congested network. In particular a BPR cost function has been considered in order to take into account the dependency of link travel time on link flows. Considering the non linear relationship between measured and predicted flows, the linear regression has not been considered in this case. As shown in Table 2 and Figure 2, the results are generally worse with respect to the uncongested network, consistently with more complex relationship involving problem variables. It is 11 Roberto Iannella, Mariano Gallo, Giuseppina de Luca, Fulvio Simonelli

8 worth noting that, both in the case of uncongested and congested network, the model and model perform very close and sometimes better than model; this aspect suggests that non-parametric regression models appear suitable for application mainly in real time application where the computation time required by formulation represents a critical issue. Therefore, the performances of as well as of non-parametric models significantly decrease as the dispersion in dataset increases. To overcome this problem the opportunity to couple a non parametric regression model with cluster preprocessing analysis has been addressed. In more detail, with reference to scenarios having coefficient of variation from 0.3 to 0.9, the training dataset has been divided into several clusters on the basis of values of independent variables (measured flows on six links). To define the clusters, a classical method based on Euclidean distance has been used and for each cluster an and a model have been trained. The validation has been carried out assigning each pair of measured and relevant link flows to the closest cluster and applying the corresponding trained models. The coming from the combination of cluster analysis and non parametric regression model with reference to overall relevant links are reported in Table 3. These results suggest that the preprocessing of data based on cluster analysis allow for splitting data in several demand scenarios leading to significantly better final estimation, particularly for model that seems performing better than approach. Table 2. Mean errors in congested network of predicted link flows for different coefficients of variation in validation dataset CV 0,1 3,52% 3,71% 3,71% 3,90% 3,72% 0,2 7,90% 7,85% 7,85% 8,08% 8,26% 0,3 12,29% 10,08% 10,10% 12,23% 13,14% 0,4 19,41% 14,96% 14,97% 18,41% 18,13% 0,5 21,71% 20,04% 20,09% 23,50% 22,72% 0,6 24,99% 24,73% 24,79% 29,70% 27,02% 0,7 38,29% 32,99% 33,03% 45,63% 44,07% 0,8 39,39% 36,54% 36,43% 47,96% 42,17% 0,9 69,82% 53,14% 53,24% 70,44% 67,12% 1 68,10% 62,97% 63,58% 77,74% 80,12% RMSE CV 0,1 3,81 6,20 6,34 7,51 6,52 0,2 13,48 13,63 13,63 13,79 13,90 0,3 20,12 19,45 19,50 23,42 21,40 0,4 28,78 25,74 25,76 32,77 31,64 0,5 34,08 32,69 32,64 40,05 36,45 0,6 38,38 36,77 36,78 45,72 42,30 0,7 44,11 44,25 44,38 56,52 50,34 0,8 45,27 46,49 46,38 60,04 51,52 0,9 54,93 51,22 51,19 61,69 56, ,69 54,70 54,72 65,53 62,16 12 Roberto Iannella, Mariano Gallo, Giuseppina de Luca, Fulvio Simonelli

9 International Journal of Engineering Technology, Management and Applied Sciences Cumulative Distribution - CV=0.2 Cumulative Distribution - CV=0.3 Cumulative Distribution - CV=0.4 Cumulative Distribution - CV=0.5 Cumulative Distribution - CV=0.6 Cumulative Distribution - CV=0.7 Cumulative Distribution - CV=0.8 Cumulative Distribution - CV=0.9 Fig 2: Cumulative distribution of in congested network among predicted link flows Table 3. in congested network of predicted link flows with combined cluster analysis and non parametric regressions. CV 0,4 10,50% 8,90% 0,5 14,45% 11,05% 0,6 17,29% 15,03% 0,7 23,45% 18,90% 0,8 29,60% 21,82% 0,9 39,83% 34,06% 13 Roberto Iannella, Mariano Gallo, Giuseppina de Luca, Fulvio Simonelli

10 CONCLUSIONS In this paper several regression models for spatial extension of traffic data has been evaluated and compared to traditional approach based on travel demand estimation. The results shows that, even though all approaches do not appear reliable when the traffic scenario get away from main pattern, the and model are suitable to be applied, mainly in real time context where the computational time required by the optimization approach represents a critical aspect. Furthermore, the combination of cluster analysis aimed to recognize different demand scenarios and non parametric regression, mainly the regression, is able to guarantee the better results. REFERENCES [1] C.-H. Chong, S.P. Kumar, Sensor networks: evolution, opportunities, and challenges, Proceedings of the IEEE, vol. 91, pp August 2003 [2] Z. Sun, G. Bebis, R. Miller, On-road vehicle detection using optical sensors: a review, 2004 IEEE Intelligent Transportation Systems, Conference, Washington, D.C., USA, pp , October 2004 [3] A. Sharma, R. Chaki, U. Bhattacharya, Applications of wireless sensor network in Intelligent Traffic System: a review, 3rd InternationalConference on Electronics Computer Technology (ICECT), pp , April 2011 [4] G.S. Tewolde, Sensor and network technology for Intelligent Transportation Systems, 2012 IEEE International Conference on Electro/Information Technology (EIT), pp. 1-7, May 2012 [5] B. Tian, B.T. Morris, M. Tang, Y. Yao, C. Gou, D. Shen, S. Tang Hierarchical and networked vehicle surveillance in ITS: a survey, IEEE Transactions on Intelligent Transportation Systems, vol. 16, pp , April 2015 [6] Marzano, V., Papola, A., Simonelli, F., Limits and perspectives of effective o-d matrix correction using traffic counts. Transportation Research Part C, 17(5), [7] Djukic, T., Flötteröd G., van Lint H., Hoogendoorn S. (2012). Efficient real time OD matrix estimation based on Principal Component Analysis. Proceedings of Intelligent Transportation Systems (ITSC), th International IEEE Conference ITS, doi /ITSC [8] Ashok, K., Ben-Akiva, M., Alternative approaches for real-time estimation and prediction of time-dependent origin-destination flows. Transportation Science 34(1), [9] Cascetta E, Papola A., Marzano V., Simonelli F., Vitiello I., Quasi-dynamic estimation of o-d flows from traffic counts: formulation, statistical validation and performance analysis on real data. Transportation research Part B, vol. 55, pp [10] M.G.H. Bell, The estimation of origin destination matrix from traffic counts, Transportation Science, 17 (2) (1983), pp [11] E. Cascetta, Estimation of trip matrices from traffic counts and survey data: a generalized least squares estimator, Transportation Research Part B, 18 (4/5) (1984), pp [12] M. Maher, Inferences on trip matrices from observations on link volumes: a Bayesian statistical approach Transportation Research Part B, 17 (6) (1983), pp [13] Blundell, R., and A. Duncan. Kernel Regressions in Empirical Microeconomics Blundell, R., and A. Duncan. Kernel Regressions in Empirical Microeconomics. Journal of Human Resources, Vol. 33, No. 1, 1995, pp [14] Nadaraya, E. A. On Estimating Regression. Theory of Probability and Its Applications, Vol. 9, No. 1, 1964, pp [15] Watson, G. S. Smooth Regression Analysis. Shankya Series A, Vol. 26 No. 4, 1964, pp [16] M. Gallo, F. Simonelli, G. De Luca, C. Della Porta, An Artificial Neural Netwprk approach for spatially extending road traffic monitoring measures, Proceedings of 2016 IEEE Workshop on Environmental, Energy, and Structural Monitoring Systems (EESMS 2016), Bari, Italy, June 2016 [17] G. De Luca; M. Gallo Artificial neural networks for forecasting user flows in transportation networks: Literature review, limits, potentialities and open challenges, th IEEE International Conference on Models and Technologies for Intelligent Transportation Systems (MT-ITS) [18] M. Sciandrone Support Vector Machines, [19] H.T. Lin, C.J. Lin A study on Sigmoid Kernels for and the Training of non-psd Kernels by SMO-type Methods Department of Computer Science and Information Engineering National Taiwan University, Roberto Iannella, Mariano Gallo, Giuseppina de Luca, Fulvio Simonelli

We can now formulate our model as showed in equation 2:

We can now formulate our model as showed in equation 2: Simulation of traffic conditions requires accurate knowledge of travel demand. In a dynamic context, this entails estimating time-dependent demand matrices, which are a discretised representation of the