Transportation Big Data Analytics

Size: px

Start display at page:

Download "Transportation Big Data Analytics"

Aubrey Sullivan
5 years ago
Views:

1 Transportation Big Data Analytics Regularization Xiqun (Michael) Chen College of Civil Engineering and Architecture Zhejiang University, Hangzhou, China Fall, 2016 Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics 1 / 86

2 Outline 1 Subset Selection Best Subset Selection Stepwise Selection Choosing the Optimal Model 2 Shrinkage Methods Ridge Regression Lasso Selecting the Tuning Parameter 3 Dimension Reduction Methods 4 Applications Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics 2 / 86

3 Subset Selection Best Subset Selection Outline 1 Subset Selection Best Subset Selection Stepwise Selection Choosing the Optimal Model 2 Shrinkage Methods Ridge Regression Lasso Selecting the Tuning Parameter 3 Dimension Reduction Methods 4 Applications Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics 3 / 86

4 Subset Selection Best Subset Selection Best Subset Selection Fit a separate least squares regression for each possible combination of the p predictors That is, we fit all p models that contain exactly one predictor, all p(p 1)/2 models that contain exactly two predictors, and so forth We then look at all of the resulting models, with the goal of identifying the one that is best The problem of selecting the best model from among the 2 p possibilities considered by best subset selection is not trivial Algorithm 1 Best subset selection 1 Let M 0 denote the null model, which contains no predictors This model simply predicts the sample mean for each observation 2 For k = 1, 2, p: (a) Fit all C k p models that contain exactly k predictors (b) Pick the best among these C k p models, and call it M k Here best is defined as having the smallest RSS, or equivalently largest R 2 3 Select a single best model from among M 0,, M p using cross-validated prediction error, C p (AIC), BIC, or adjusted R 2 Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics 4 / 86

5 Subset Selection Best Subset Selection Best Subset Selection For each possible model containing a subset of the all predictors, the red frontier tracks the best model for a given number of predictors, according to RSS and R 2 Model improves as the number of variables increases, however, from the three-variable model on, there is little improvement in RSS and R 2 Residual Sum of Squares 2e+07 4e+07 6e+07 8e+07 R Number of Predictors Number of Predictors Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics 5 / 86

6 Subset Selection Stepwise Selection Outline 1 Subset Selection Best Subset Selection Stepwise Selection Choosing the Optimal Model 2 Shrinkage Methods Ridge Regression Lasso Selecting the Tuning Parameter 3 Dimension Reduction Methods 4 Applications Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics 6 / 86

7 Subset Selection Stepwise Selection Forward Stepwise Selection Forward stepwise selection is a computationally efficient alternative to best subset selection While the best subset selection procedure considers all 2 p possible models containing subsets of the p predictors, forward stepwise considers a much smaller set of models Forward stepwise selection begins with a model containing no predictors, and then adds predictors to the model, one-at-a-time, until all of the predictors are in the model Algorithm 2 Forward stepwise selection 1 Let M 0 denote the null model, which contains no predictors 2 For k = 0, 1, p 1: (a) Consider all p k models that augment the predictors in M k with one additional predictor (b) Choose the best among these p k models, and call it M k+1 Here best is defined as having smallest RSS or highest R 2 3 Select a single best model from among M 0,, M p using cross-validated prediction error, C p (AIC), BIC, or adjusted R 2 Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics 7 / 86

8 Subset Selection Stepwise Selection Backward Stepwise Selection Like forward stepwise selection, backward stepwise selection provides an efficient alternative to best subset selection Unlike forward stepwise selection, it begins with the full least squares model containing all p predictors, and then iteratively removes the least useful predictor, one-at-a-time Algorithm 3 Backward stepwise selection 1 Let M p denote the full model, which contains all p predictors 2 For k = p, p 1, 1: (a) Consider all k models that contain all but one of the predictors in M k, for a total of k 1 predictors Choose the best among these k models, and call it M k 1 Here best is defined as having smallest RSS or highest R 2 3 Select a single best model from among M 0,, M p using cross-validated prediction error, C p (AIC), BIC, or adjusted R 2 Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics 8 / 86

9 Subset Selection Choosing the Optimal Model Outline 1 Subset Selection Best Subset Selection Stepwise Selection Choosing the Optimal Model 2 Shrinkage Methods Ridge Regression Lasso Selecting the Tuning Parameter 3 Dimension Reduction Methods 4 Applications Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics 9 / 86

10 Subset Selection Choosing the Optimal Model Choosing the Optimal Model The model containing all of the predictors will always have the smallest RSS and the largest R 2, since these quantities are related to the training error Instead, we wish to choose a model with a low test error The training error can be a poor estimate of the test error Therefore, RSS and R 2 are not suitable for selecting the best model among a collection of models with different numbers of predictors Two common approaches 1 Indirectly estimate test error by making an adjustment to the training error to account for the bias due to overfitting 2 Directly estimate the test error, using either a validation set approach or a cross-validation approach Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics 10 / 86

11 Subset Selection Choosing the Optimal Model Indirect Estimate Test Errors C p statistic: an unbiased estimate of test MSE For a fitted least squares model containing d predictors, the C p estimate of test MSE is given by C p = 1 n (RSS + 2dˆσ2 ) (1) where ˆσ 2 is an estimate of the variance of the error associated with each response measurement Akaike Information Criterion (AIC) Bayesian Information Criterion (BIC) AIC = 1 nˆσ 2 (RSS + 2dˆσ2 ) (2) BIC = 1 n (RSS + log(n)dˆσ2 ) (3) Adjusted R 2 = 1 RSS/(n d 1) TSS/(n 1) Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics 11 / 86 (4)

12 Subset Selection Choosing the Optimal Model Indirect Estimate Test Errors C p statistic: an unbiased estimate of test MSE For a fitted least squares model containing d predictors, the C p estimate of test MSE is given by C p = 1 n (RSS + 2dˆσ2 ) (1) where ˆσ 2 is an estimate of the variance of the error associated with each response measurement Akaike Information Criterion (AIC) Bayesian Information Criterion (BIC) AIC = 1 nˆσ 2 (RSS + 2dˆσ2 ) (2) BIC = 1 n (RSS + log(n)dˆσ2 ) (3) Adjusted R 2 = 1 RSS/(n d 1) TSS/(n 1) Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics 11 / 86 (4)

13 Subset Selection Choosing the Optimal Model Indirect Estimate Test Errors C p statistic: an unbiased estimate of test MSE For a fitted least squares model containing d predictors, the C p estimate of test MSE is given by C p = 1 n (RSS + 2dˆσ2 ) (1) where ˆσ 2 is an estimate of the variance of the error associated with each response measurement Akaike Information Criterion (AIC) Bayesian Information Criterion (BIC) AIC = 1 nˆσ 2 (RSS + 2dˆσ2 ) (2) BIC = 1 n (RSS + log(n)dˆσ2 ) (3) Adjusted R 2 = 1 RSS/(n d 1) TSS/(n 1) Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics 11 / 86 (4)

14 Subset Selection Choosing the Optimal Model Indirect Estimate Test Errors C p statistic: an unbiased estimate of test MSE For a fitted least squares model containing d predictors, the C p estimate of test MSE is given by C p = 1 n (RSS + 2dˆσ2 ) (1) where ˆσ 2 is an estimate of the variance of the error associated with each response measurement Akaike Information Criterion (AIC) Bayesian Information Criterion (BIC) AIC = 1 nˆσ 2 (RSS + 2dˆσ2 ) (2) BIC = 1 n (RSS + log(n)dˆσ2 ) (3) Adjusted R 2 = 1 RSS/(n d 1) TSS/(n 1) Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics 11 / 86 (4)

15 Subset Selection Choosing the Optimal Model Best Subset Selection For least squares models, C p and AIC are proportional to each other, so only C p is displayed C p and BIC are estimates of test MSE BIC shows an increase after four variables are selected The other two plots are rather flat after four variables are included C p BIC Adjusted R Number of Predictors Number of Predictors Number of Predictors Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics 12 / 86

16 Subset Selection Choosing the Optimal Model Validation and Cross-Validation Direct estimate of the test error, fewer assumptions about the true underlying model Cross-validation is a very attractive approach for selecting from among a number of models under consideration For d ranging from 1 to 11 The overall best model is shown as a blue cross Square Root of BIC Validation Set Error Cross Validation Error Number of Predictors Number of Predictors Number of Predictors Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics 13 / 86

17 Shrinkage Methods Shrinkage Methods Shrinkage Subset selection methods involve using least squares to fit a linear model that contains a subset of the predictors As an alternative, shrinkage can fit a model containing all p predictors using a technique that constrains or regularizes the coefficient estimates, or equivalently, that shrinks the coefficient estimates towards zero The estimated coefficients are shrunken towards zero relative to the least squares estimates This shrinkage (also known as regularization) has the effect of reducing variance Depending on what type of shrinkage is performed, some of the coefficients may be estimated to be exactly zero Hence, shrinkage methods can also perform variable selection Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics 14 / 86

18 Shrinkage Methods Ridge Regression Outline 1 Subset Selection Best Subset Selection Stepwise Selection Choosing the Optimal Model 2 Shrinkage Methods Ridge Regression Lasso Selecting the Tuning Parameter 3 Dimension Reduction Methods 4 Applications Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics 15 / 86

19 Shrinkage Methods Ridge Regression Ridge Regression Standard Linear Model Y = β 0 + β 1 X β p X p + ϵ (5) Least Squares ( ) 2 n p RSS = y i β 0 β j x ij (6) i=1 j=1 Ridge Regression Objective Function ( ) 2 n p y i β 0 β j x ij + λ βj 2 = RSS + λ i=1 j=1 j=1 j=1 where λ 0 is a tuning parameter, to be determined separately The second terms is called a shrinkage penalty, is small when β are close to zero Ridge regression will produce a different set of coefficient estimates, ˆβ R λ, for each value of λ Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics 16 / 86 p p βj 2 (7)

20 Shrinkage Methods Ridge Regression Ridge Regression Standard Linear Model Y = β 0 + β 1 X β p X p + ϵ (5) Least Squares ( ) 2 n p RSS = y i β 0 β j x ij (6) i=1 j=1 Ridge Regression Objective Function ( ) 2 n p y i β 0 β j x ij + λ βj 2 = RSS + λ i=1 j=1 j=1 j=1 where λ 0 is a tuning parameter, to be determined separately The second terms is called a shrinkage penalty, is small when β are close to zero Ridge regression will produce a different set of coefficient estimates, ˆβ R λ, for each value of λ Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics 16 / 86 p p βj 2 (7)

21 Shrinkage Methods Ridge Regression Ridge Regression Standard Linear Model Y = β 0 + β 1 X β p X p + ϵ (5) Least Squares ( ) 2 n p RSS = y i β 0 β j x ij (6) i=1 j=1 Ridge Regression Objective Function ( ) 2 n p y i β 0 β j x ij + λ βj 2 = RSS + λ i=1 j=1 j=1 j=1 where λ 0 is a tuning parameter, to be determined separately The second terms is called a shrinkage penalty, is small when β are close to zero Ridge regression will produce a different set of coefficient estimates, ˆβ R λ, for each value of λ Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics 16 / 86 p p βj 2 (7)

22 Shrinkage Methods Ridge Regression Why Does Ridge Regression Improve Over Least Squares? Ridge Regression s Advantage Ridge regression s advantage over least squares is rooted in the bias-variance trade-off As λ increases, the flexibility decreases, leading to decreased variance but increased bias Mean Squared Error Mean Squared Error e 01 1e+01 1e λ ˆβ R λ 2/ ˆβ 2 Squared bias (black), variance (green), and test mean squared error (purple) for the ridge regression predictions on a simulated data set, as a function of λ and ˆβ R λ 2/ ˆβ 2 The horizontal dashed lines indicate the minimum possible MSE The purple crosses indicate the smallest MSE Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics 17 / 86

23 Shrinkage Methods Lasso Outline 1 Subset Selection Best Subset Selection Stepwise Selection Choosing the Optimal Model 2 Shrinkage Methods Ridge Regression Lasso Selecting the Tuning Parameter 3 Dimension Reduction Methods 4 Applications Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics 18 / 86

24 Shrinkage Methods Lasso Lasso Motivation: A Disadvantage of Ridge Regression Ridge regression includes all p predictors in the final model The penalty will shrink all of the coefficients towards zero, but it will not set any of them exactly to zero (unless λ = inf) This may not be a problem for prediction accuracy, but it can create a challenge in model interpretation in settings in which the number of variables p is quite large Lasso Objective Function ( n y i β 0 i=1 ) 2 p β j x ij + λ j=1 selection) when the tuning parameter λ is sufficiently large p β j = RSS + λ j=1 p β j (8) where λ j is the lasso penalty (l 1 norm instead of l 2 norm used in ridge regression) Lasso has the effect of forcing some of the coefficient estimates to be exactly equal to zero (variable j=1 Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics 19 / 86

25 Shrinkage Methods Lasso Lasso Motivation: A Disadvantage of Ridge Regression Ridge regression includes all p predictors in the final model The penalty will shrink all of the coefficients towards zero, but it will not set any of them exactly to zero (unless λ = inf) This may not be a problem for prediction accuracy, but it can create a challenge in model interpretation in settings in which the number of variables p is quite large Lasso Objective Function ( n y i β 0 i=1 ) 2 p β j x ij + λ j=1 selection) when the tuning parameter λ is sufficiently large p β j = RSS + λ j=1 p β j (8) where λ j is the lasso penalty (l 1 norm instead of l 2 norm used in ridge regression) Lasso has the effect of forcing some of the coefficient estimates to be exactly equal to zero (variable j=1 Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics 19 / 86

26 Shrinkage Methods Lasso Comparison of Lasso with Ridge Regression The standardized lasso coefficients are illustrated as a function of λ and β ˆL λ 1 / ˆβ 1 When λ = 0, then the lasso simply gives the least squares fit; When λ becomes sufficiently large, the lasso gives the null model (coefficients equal zero) Standardized Coefficients Standardized Coefficients Income Limit Rating Student λ ˆβ L λ 1/ ˆβ 1 Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics 20 / 86

27 Shrinkage Methods Lasso Example A simple special case with n = p, and X is identical matrix, least square estimate ˆβ j = arg min β p (y j β j ) 2 = y j j=1 Ridge Regression ˆβ R j = arg min β p (y j β j) 2 + λ j=1 p βj 2 = y j/(1 + λ) j=1 Lasso ˆβ L j = arg min β p (y j β j ) 2 + λ j=1 p y j λ/2 if y j > λ/2 β j = y j + λ/2 if y j < λ/2 0 if y j λ/2 j=1 Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics 21 / 86

28 Shrinkage Methods Lasso Example A simple special case with n = p, and X is identical matrix, least square estimate ˆβ j = arg min β p (y j β j ) 2 = y j j=1 Ridge Regression ˆβ R j = arg min β p (y j β j) 2 + λ j=1 p βj 2 = y j/(1 + λ) j=1 Lasso ˆβ L j = arg min β p (y j β j ) 2 + λ j=1 p y j λ/2 if y j > λ/2 β j = y j + λ/2 if y j < λ/2 0 if y j λ/2 j=1 Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics 21 / 86

29 Shrinkage Methods Lasso Example A simple special case with n = p, and X is identical matrix, least square estimate ˆβ j = arg min β p (y j β j ) 2 = y j j=1 Ridge Regression ˆβ R j = arg min β p (y j β j) 2 + λ j=1 p βj 2 = y j/(1 + λ) j=1 Lasso ˆβ L j = arg min β p (y j β j ) 2 + λ j=1 p y j λ/2 if y j > λ/2 β j = y j + λ/2 if y j < λ/2 0 if y j λ/2 j=1 Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics 21 / 86

30 Shrinkage Methods Lasso Comparison of Lasso with Ridge Regression Ridge regression more or less shrinks every dimension of the data by the same proportion; Lasso more or less shrinks all coefficients toward zero by a similar amount, and sufficiently small coefficients are shrunken all the way to zero Coefficient Estimate Ridge Least Squares Coefficient Estimate Lasso Least Squares y j y j Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics 22 / 86

31 Shrinkage Methods Selecting the Tuning Parameter Outline 1 Subset Selection Best Subset Selection Stepwise Selection Choosing the Optimal Model 2 Shrinkage Methods Ridge Regression Lasso Selecting the Tuning Parameter 3 Dimension Reduction Methods 4 Applications Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics 23 / 86

32 Shrinkage Methods Selecting the Tuning Parameter Selecting the Tuning Parameter λ Grid Search 1 Choose a grid of λ values, and compute the k-fold cross-validation error for each λ; 2 Select the tuning parameter value for which the cross-validation error is smallest; 3 Re-fit the model using all of the available observations and the selected value of λ Cross Validation Error e 03 5e 02 5e 01 5e+00 λ Standardized Coefficients e 03 5e 02 5e 01 5e+00 λ Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics 24 / 86

33 Shrinkage Methods Selecting the Tuning Parameter Selecting the Tuning Parameter λ Comparison of lasso coefficients with least square estimates Left: Ten-fold cross-validation MSE for the lasso; Right: Corresponding lasso coefficient estimates are displayed; Vertical dashed lines indicate where the lasso cross-validation error is smallest Cross Validation Error Standardized Coefficients ˆβ L λ 1/ ˆβ ˆβ L λ 1/ ˆβ 1 Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics 25 / 86

34 Dimension Reduction Methods Dimension Reduction Methods Motivation Either subset selection or shrinkage methods have controlled variance by using a subset of the original variables, or by shrinking their coefficients toward zero All of these methods are defined using the original predictors, X 1, X 2,, X p All dimension reduction methods work in two steps: 1 The transformed predictors (reduced dimensions) are obtained, and the selection can be achieved in different ways, eg principal components regression and partial least squares 2 The model is fit using these transformed predictors Will be introduced in Unsupervised Learning Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics 26 / 86

35 Dimension Reduction Methods Dimension Reduction Methods Motivation Either subset selection or shrinkage methods have controlled variance by using a subset of the original variables, or by shrinking their coefficients toward zero All of these methods are defined using the original predictors, X 1, X 2,, X p All dimension reduction methods work in two steps: 1 The transformed predictors (reduced dimensions) are obtained, and the selection can be achieved in different ways, eg principal components regression and partial least squares 2 The model is fit using these transformed predictors Will be introduced in Unsupervised Learning Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics 26 / 86

36 Example 1: Multi-Model Ensemble for Freeway Traffic State Estimations Reference Li, L, Chen, X, and Zhang, L (2014) Multimodel ensemble for traffic state estimations IEEE Transactions on Intelligent Transportation Systems, 15(3), Background and Research Highlights Freeway traffic state estimation is a vital component of traffic management and information systems Inherent randomness of traffic flow and uncertainties in the initial conditions of models, model parameters as well as model structures all influence traffic state estimations Present an ensemble learning framework to appropriately combine estimation results from multiple macroscopic traffic flow models Discuss three weighting algorithms: least square regression, ridge regression and lasso A field test indicates that lasso ensemble best handles various uncertainties and improves Xiqun estimation (Michael) Chen (Zhejiang accuracy University) significantly Transportation Big Data Analytics 27 / 86

37 Popular Traffic State Models Cell Transmission Model (CTM) CTM is a direct discretization of the first-order LWR model It was proposed by Daganzo (1994) via using the Godunov Scheme (Lebacque, 1996) and is now widely used Papageorgiou et al s Model A second-order macroscopic traffic flow model was chosen in Papageorgiou et al (1990), Wang and Papageorgiou (2005), Wang et al (2006,2007,2009) We can update estimations of traffic flow states directly from the linearized system dynamic model; Otherwise, we resort to Unscented Kalman filtering (KF), Extended Kalman filtering (EKF) or particle filtering for help Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics 28 / 86

38 Testing Data I Performance Measurement System (PeMS) The loop detector data collected at Vehicle Detection Stations (VDS) , and (denoted by L1, L2 and L3) on Highway SR101 southbound, Hollywood, California L1 L2 L3 PeMS VDS: PM: 2556 mile PeMS VDS: PM: 2455 mile PeMS VDS: PM: 2403 mile Cell De Soto Cell Cell Winnetka Cell 4 Cell 5 Cell km Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics 29 / 86

39 Testing Data II Mainline mean speed per lane Speed measurements (veh/h) L1 L2 L3 0 4 AM 5 AM 6 AM 7 AM 8 AM 9 AM 10 AM 11 AM 12 PM Time (h) On-ramp inflow rates On-ramp flow measurements (veh/h) Upstream on-ramp Downstream on-ramp 0 4 AM 5 AM 6 AM 7 AM 8 AM 9 AM 10 AM 11 AM 12 PM Time (h) Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics 30 / 86

40 Evaluation Indices Root Mean Square Error (RMSE) RMSE = 1 T T [x i (t) ˆx i (t)] 2 (9) t=1 Normalized Root Mean Squared Error (NRMSE) T t=1 [xi(t) ˆxi(t)]2 NRMSE = T t=1 (x i(t)) 2 100% (10) Symmetric Mean Absolute Percentage Error (SMAPE) SMAPE1 = 1 T x i(t) ˆx i(t) T x t=1 i(t) + ˆx i(t) T t=1 SMAPE2 = x i(t) ˆx i (t) T t=1 [x i(t) + ˆx i (t)] 100% (11) 100% (12) Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics 31 / 86

41 Comparison of Ensemble Algorithms: Part I Experiment I Ensemble 10 SMM models with various FD parameter combinations Algorithm I: Least-Square Ensemble Algorithm II: Ridge Regression Ensemble Algorithm III: Lasso Ensemble Estimation errors of ensemble SMM models using integrated weights Ŵ (Experiment I) Variable Error Algorithm I Algorithm II Algorithm III RMSE Density NRMSE % 4417% 4216% SMAPE1-4725% 1694% 1546% SMAPE % 1807% 1656% RMSE Flow NRMSE 1880% 2411% 2851% rate SMAPE1 846% 1024% 1207% SMAPE2 713% 957% 1190% Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics 32 / 86

42 Comparison of Ensemble Algorithms: Part II Experiment II (Linear models) Ensemble 20 SMM models with the following improvements A small scale zero mean Gaussian random noise is added to the estimated state vector; The estimation results and the available measurements of density and flow rate are separated Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics 33 / 86

43 Comparison of Ensemble Algorithms: Part III Estimation errors of ensemble SMM models using separate weights Ŵρ and Ŵq (Experiment II) Variable Error Algorithm I Algorithm II Algorithm III RMSE Density NRMSE 21098% 8722% 3950% SMAPE1-3029% % 1528% SMAPE % 4066% 1591% Flow rate RMSE NRMSE 2200% 1805% 1889% SMAPE1 892% 785% 841% SMAPE2 818% 702% 797% Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics 34 / 86

44 Comparison of Ensemble Algorithms: Part IV Experiment III (Nonlinear models) Ensemble 20 EKF models Stochastic resampling: a small scale Gaussian white noise vector was added to the estimated state vector after the EKF predict and update procedures at each time step; Estimation errors comparison of the single deterministic EKF models and ensemble EKF models using separate weights Ŵρ and Ŵq (Experiment III) (E ) Variable Error EKF Algorithm II Algorithm III RMSE Density NRMSE 4368% 4162% 3896% SMAPE % 1368% SMAPE2 1677% 1172% 1478% RMSE Flow rate NRMSE 2393% 3380% 3373% SMAPE1 999% 1297% 1507% SMAPE2 938% 1648% 1504% Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics 35 / 86

45 Comparison of Ensemble Algorithms: Part V Experiment IV Ensemble 10 candidate SMM models (deterministic models without initial condition noises) and 10 candidate EKF model (stochastic models with initial condition noises) Stochastic resampling: a small scale Gaussian white noise vector was added to the estimated state vector after the EKF predict and update procedures at each time step; Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics 36 / 86

46 Comparison of Ensemble Algorithms: Part VI Estimation errors of ensemble EKF models using separate weights Ŵρ and Ŵq (Experiment IV) Variable Error EKF Algorithm II Algorithm III RMSE Density NRMSE 4368% 4235% 3648% SMAPE % 1331% SMAPE2 1677% 1663% 1439% RMSE Flow NRMSE 2393% 2180% 2128% rate SMAPE1 999% 929% 918% SMAPE2 938% 849% 861% Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics 37 / 86

47 Influence of the Regularization Parameter λ Sensitivity Analysis All experiments show that the lasso ensemble performs best in this case study In practice, it is suggested to adaptively choose the best regularization scalar λ online Sensitivity analysis of regularization coefficient in the Lasso EKF ensemble model (Experiment III, Algorithm III) Variable ENSEMBLE MODEL (EXPERIMENT III, ALGORITHM III) Error λ λ λ λ = 001 max = 01 max = 02 max = 05 max λ λ λ λ Density Flow rate RMSE NRMSE 4769% 4003% 5066% 6644% SMAPE1 1755% 1328% 1598% 2756% SMAPE2 1883% 1451% 1965% 3254% RMSE NRMSE 3031% 2792% 4671% 7564% SMAPE1 1234% 1273% 2399% 5681% SMAPE2 1127% 1221% 2524% 5714% Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics 38 / 86

48 Models Estimation Results of Experiments I-IV Using Lasso Ensemble Exp I Density/flow rate estimations L1 Measurement Experiment I Exp II Density/flow rate estimations L1 Measurement Experiment II Density (veh/km) L2 Measurement Experiment I L3 Measurement Experiment I Density (veh/km) L2 Measurement Experiment II L3 Measurement Experiment II 0 4 AM 5 AM 6 AM 7 AM 8 AM 9 AM 10 AM 11 AM 12 PM Time (h) 0 4 AM 5 AM 6 AM 7 AM 8 AM 9 AM 10 AM 11 AM 12 PM Time (h) 2000 L1 Measurement Experiment I L1 Measurement Experiment II 1000 Flow (veh/h) L2 Measurement Experiment I L3 Measurement Experiment I 1000 Flow (veh/h) L2 Measurement Experiment II L3 Measurement Experiment II 1000 Xiqun (Michael) 0 Chen (Zhejiang University) Transportation Big Data Analytics / 86

49 Models Estimation Results of Experiments I-IV Using Lasso Ensemble Exp III Density/flow rate estimations L1 Measurement Experiment III Exp IV Density/flow rate estimations L1 Measurement Experiment IV Density (veh/km) L2 Measurement Experiment III L3 Measurement Experiment III Density (veh/km) L2 Measurement Experiment IV L3 Measurement Experiment IV Flow (veh/h) AM 5 AM 6 AM 7 AM 8 AM 9 AM 10 AM 11 AM 12 PM Time (h) 2000 L1 Measurement Experiment III L2 Measurement Experiment III 2000 L3 Measurement Experiment III Flow (veh/h) AM 5 AM 6 AM 7 AM 8 AM 9 AM 10 AM 11 AM 12 PM Time (h) 2000 L1 Measurement Experiment IV L2 Measurement Experiment IV 2000 L3 Measurement Experiment IV Xiqun (Michael) 0 Chen (Zhejiang University) Transportation Big Data Analytics / 86

50 Example 2: Feature Selection for Prediction I Reference Yang, S, 2013 On feature selection for traffic congestion prediction Transportation Research Part C: Emerging Technologies, 26, pp Objectives Traffic congestion prediction plays an important role in route guidance and traffic management We formulate it as a binary classification problem Through extensive experiments with real-world data, we found that a large number of sensors, usually over 100, are relevant to the prediction task at one sensor, which means wide area correlation and high dimensionality of the data This paper investigates into the feature selection problem for traffic congestion prediction By applying feature selection, the data dimensionality can be reduced remarkably while the performance remains the same Besides, a new traffic jam probability scoring method is proposed to solve the high-dimensional computation into many one-dimensional probabilities and its combination Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics 41 / 86

51 Example 2: Feature Selection for Prediction II Data Traffic Management Center of Minnesota Department of Transportation, 30s interval from over 4000 loop detectors located around the Twin Cities Metro freeways for 7 days per week from January 1 to September 22, 2010, where the sensors containing missing values, weekends, and the days with incomplete record are not taken into account Finally, we have a data set of 156 days with the traffic volume data summed per 10-min interval for 4584 sensors We use the first 126 days for learning and the remaining 30 days for testing Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics 42 / 86

52 Example 2: Feature Selection for Prediction III Mean prediction precision against feature number 07 Time lag=10 min 07 Time lag=20 min Mean precision Mean precision Feature number Feature number 07 Time lag=60 min 07 Time lag=300 min Mean precision Mean precision Feature number Feature number Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics 43 / 86

53 Example 2: Feature Selection for Prediction IV Mean prediction precision and feature number at turning point and end point Time lag 10 min 20 min 60 min 3000 min Turning point Precision 56% 5645% 5617% 5552% #Features End point Precision 6013% 6058% 5989% 5962% #Features 3386 Mean prediction precision against feature number #Features min min min min Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics 44 / 86

54 Example 2: Feature Selection for Prediction V Actual traffic volumes with time lag of 20 min following the prediction with all features 600 Sensor 1266 Volume Rank 400 Sensor 1473 Volume Rank Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics 45 / 86

55 Example 2: Feature Selection for Prediction VI Precision and recall rate of the two sensors performance Sensor 1266; True Jam number=86; Precision(top 86)=9186% 1 p (l) and r (l) l Sensor 1473; True Jam number=106; Precision(top 106)=6509% 1 p (l) and r (l) l Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics 46 / 86

56 Example 2: Feature Selection for Prediction VII Distribution of precision with prediction based on all features (time lag = 20 min) Number of sensors Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics 47 / 86

57 Example 2: Feature Selection for Prediction VIII Prediction precision against feature number for two sensors 1 Sensor 1266; Optimal feature number=156 Precision Feature number 1 Sensor 1473; Optimal feature number=26 Precision Feature number Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics 48 / 86

58 Example 2: Feature Selection for Prediction IX Distribution of the 888 sensors of interest over the maximum precision 300 Time lag=10 min 300 Time lag=20 min Number of sensors Number of sensors Maximum precision Maximum precision 300 Time lag=60 min 300 Time lag=300 min Number of sensors Number of sensors Maximum precision Maximum precision Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics 49 / 86

59 Example 2: Feature Selection for Prediction X Distribution of the 888 sensors of interest over the optimal number of features 150 Time lag=10 min 100 Time lag=20 min Number of sensors Number of sensors Optimal feature number Optimal feature number 150 Time lag=60 min 150 Time lag=300 min Number of sensors Number of sensors Optimal feature number Optimal feature number Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics 50 / 86

60 Example 2: Feature Selection for Prediction XI Comparison of the mean precision over the 888 sensors of interest achieved by using the optimal number of features to that using all features Time lag (min) Mean of max precision (%) Mean of precision with all features (%) Distribution of the 888 sensors of interest over the optimal number of features #Features [0, 10) [10, 20) [20, 30) [30, 40) [40, 50) [50, 100) [100, 3386) min min min min Mean prediction precision (%) Time lag (min) Optimal All Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics 51 / 86

61 Example 3: Forecasting Urban Travel Times I Reference Haworth, J, Shawe-Taylor, J, Cheng, T and Wang, J, 2014 Local online kernel ridge regression for forecasting of urban travel times Transportation Research Part C: Emerging Technologies, 46, pp Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics 52 / 86

62 Example 3: Forecasting Urban Travel Times II Highlights Local online kernel ridge regression (LOKRR) is developed for forecasting urban travel times LOKRR takes into account the time varying characteristics of traffic series through the use of locally defined kernels LOKRR outperforms ARIMA, Elman ANN and SVR in forecasting travel times on London s road network The model is based on regularised linear regression, and clear guidelines are given for parameter training Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics 53 / 86

63 Example 3: Forecasting Urban Travel Times III Merits over the standard single kernel approach It allows parameters to vary by time of day, capturing the time varying distribution of traffic data; It allows smaller kernels to be defined that contain only the relevant traffic patterns; It is online, allowing new traffic data to be incorporated as it arrives Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics 54 / 86

64 Example 3: Forecasting Urban Travel Times IV Data Unit Travel Times (UTTs, seconds/metre) collected on London s road network, as part of the London Congestion Analysis Project (LCAP) coordinated by Transport for London (TfL) LCAP TTs are observed using automatic number plate recognition (ANPR) technology Individual vehicle TTs are aggregated at 5 min intervals to produce a regularly spaced time series with 288 observations per day Only data collected between 6 AM and 9 PM are used in the analysis (180 observations per day) In total there are 154 days for which data are available, collected between January and July 2011 To test the models, the data are divided into three sets; a training set, a testing set and a validation set, which are 80 days (52%), 37 days (24%) and 37 days (24%) in length, respectively Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics 55 / 86

65 Example 3: Forecasting Urban Travel Times V Illustration of variability in traffic data: each line in the plot is the Unit Travel Time (UTT) profile recorded on a single link (link 1815) on a different day (5 days total) Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics 56 / 86

66 Example 3: Forecasting Urban Travel Times VI Diagram of the training data construction Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics 57 / 86

67 Example 3: Forecasting Urban Travel Times VII Observing travel times using ANPR: a vehicle passes camera l 1 at time t 1, and its number plate is read It then traverses link l and passes camera l 2 at time t 2 and its number plate is read again The two number plates are matched using inbuilt software, and the TT is calculated as t 2 t 1 Raw TTs are converted to UTTs by dividing by len(l), which is the length of link l Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics 58 / 86

68 Example 3: Forecasting Urban Travel Times VIII The test links and their patch rates and frequency Link ID % Missing Avg frequency Length (m) Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics 59 / 86

69 Example 3: Forecasting Urban Travel Times IX Location of the test links on the LCAP network Legend Test Links ANPR Camera Locations LCAP Network ITN Data Crown Copyright Miles Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics 60 / 86

70 Example 3: Forecasting Urban Travel Times X Time series of each of the test links over the first 10 weeks of the training period Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics 61 / 86

71 Example 3: Forecasting Urban Travel Times XI Training errors at: (a) 15 and 30 min forecast horizons and (b) 45 and 60 min horizons (a) 15 min 30 min Link ID RMSE NRMSE MAPE MASE RMSE NRMSE MAPE MASE (b) 45 min 60 min Link ID RMSE NRMSE MAPE MASE RMSE NRMSE MAPE MASE Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics 62 / 86

72 Example 3: Forecasting Urban Travel Times XII Fitted model parameters at each of the forecast horizons Link 15 min 30 min 45 min 60 min r k w r k w r k w r k w / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / /8 3 Note: r values are quantiles of jjx x 0 jj 2 ; k values are multiples of k0 Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics 63 / 86

73 Example 3: Forecasting Urban Travel Times XIII Testing errors of the LOKRR model RMSE NRMSE MAPE MASE 15 min min Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics 64 / 86

74 Example 3: Forecasting Urban Travel Times XIV Testing errors of the LOKRR model (continued) 45 min min Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics 65 / 86

75 Example 3: Forecasting Urban Travel Times XV Comparison with benchmark models C average errors at: (a) 15 min; (b) 30 min; (c) 45 min and (d) 60 min forecast horizons Model Training Testing RMSE NRMSE MAPE RMSE NRMSE MAPE (a) LOKRR SVR ANN ARIMA (b) LOKRR SVR ANN ARIMA (c) Online KRR SVR ANN ARIMA (d) LOKRR SVR ANN ARIMA Note: Training errors are not shown for the ARIMA model due to the difference in the model fitting procedure Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics 66 / 86

76 Example 3: Forecasting Urban Travel Times XVI RMSE of each of the models at: (a) 15 min; (b) 30 min; (c) 45 min and; (d) 60 min (a) (b) (c) (d) Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics 67 / 86

77 Example 3: Forecasting Urban Travel Times XVII MAPE of each of the models at: (a) 15 min; (b) 30 min; (c) 45 min and; (d) 60 min (a) (b) (c) (d) Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics 68 / 86

78 Example 3: Forecasting Urban Travel Times XVIII Time series plots of the observed series (thick black line) against the forecast series at the 15 min interval on (a) Monday 6th June; (b) 7th June; (c) 8th June; (d) 9th June; (e) 10th June; (f) 11th June Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics 69 / 86

79 Example 3: Forecasting Urban Travel Times XIX Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics 70 / 86

80 Example 3: Forecasting Urban Travel Times XX Time series plots of the observed series (thick black line) against the forecast series at the 60 min interval on (a) Monday 6th June; (b) 7th June; (c) 8th June; (d) 9th June; (e) 10th June; (f) 11th June Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics 71 / 86

81 Example 3: Forecasting Urban Travel Times XXI Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics 72 / 86

82 Example 3: Forecasting Urban Travel Times XXII Comparison of LOKRR with benchmark models at the 15 min interval on link 442, (a) on a typical weekday and; (b) during a non-recurrent congestion event Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics 73 / 86

83 Example 3: Forecasting Urban Travel Times XXIII Comparison of LOKRR with benchmark models at the 15 min interval on link 442, (a) on a typical weekday and; (b) during a non-recurrent congestion event Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics 74 / 86

84 Example 3: Forecasting Urban Travel Times XXIV Comparison of LOKRR with benchmark models at the 60 min interval on link 442, (a) on a typical weekday and; (b) during a non-recurrent congestion event Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics 75 / 86

Linear Model Selection and Regularization

Linear Model Selection and Regularization Recall the linear model Y = β 0 + β 1 X 1 + + β p X p + ɛ. In the lectures that follow, we consider some approaches for extending the linear model framework. In