Transportation Big Data Analytics

Similar documents
Linear Model Selection and Regularization

Linear model selection and regularization

Machine Learning for OR & FE

Dimension Reduction Methods

Regression, Ridge Regression, Lasso

Improving forecasting under missing data on sparse spatial networks

Lecture 14: Shrinkage

MS-C1620 Statistical inference

Machine Learning Linear Regression. Prof. Matteo Matteucci

Data Mining Stat 588

ISyE 691 Data mining and analytics

Day 4: Shrinkage Estimators

Linear regression methods

Linear Methods for Regression. Lijun Zhang

STAT 462-Computational Data Analysis

High-dimensional regression modeling

MLR Model Selection. Author: Nicholas G Reich, Jeff Goldsmith. This material is part of the statsteachr project

Linear Model Selection and Regularization

MODELLING TRAFFIC FLOW ON MOTORWAYS: A HYBRID MACROSCOPIC APPROACH

Linear Regression Models. Based on Chapter 3 of Hastie, Tibshirani and Friedman

Direct Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina

A Modern Look at Classical Multivariate Techniques

Statistical Methods for Data Mining

Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods.

Sparse Linear Models (10/7/13)

REVIEW OF SHORT-TERM TRAFFIC FLOW PREDICTION TECHNIQUES

Bayesian variable selection via. Penalized credible regions. Brian Reich, NCSU. Joint work with. Howard Bondell and Ander Wilson

Prediction & Feature Selection in GLM

Modelling, Simulation & Computing Laboratory (msclab) Faculty of Engineering, Universiti Malaysia Sabah, Malaysia

TIME SERIES ANALYSIS. Forecasting and Control. Wiley. Fifth Edition GWILYM M. JENKINS GEORGE E. P. BOX GREGORY C. REINSEL GRETA M.

A Sparse Linear Model and Significance Test. for Individual Consumption Prediction

Linear Models 1. Isfahan University of Technology Fall Semester, 2014

Linear Regression Linear Regression with Shrinkage

Bayesian linear regression

CS6220: DATA MINING TECHNIQUES

Biostatistics-Lecture 16 Model Selection. Ruibin Xi Peking University School of Mathematical Sciences

Introduction to Statistical modeling: handout for Math 489/583

Machine Learning for OR & FE

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 10

LECTURE 10: LINEAR MODEL SELECTION PT. 1. October 16, 2017 SDS 293: Machine Learning

CS6220: DATA MINING TECHNIQUES

Variable Selection in Predictive Regressions

Outline lecture 2 2(30)

How Accurate is My Forecast?

Linear Regression Linear Regression with Shrinkage

MS&E 226: Small Data

The prediction of house price

Elements of Multivariate Time Series Analysis

Model Selection. Frank Wood. December 10, 2009

Machine Learning for Biomedical Engineering. Enrico Grisan

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)

Urban traffic flow prediction: a spatio-temporal variable selectionbased

Prediction on Travel-Time Distribution for Freeways Using Online Expectation Maximization Algorithm

Variable Selection in Restricted Linear Regression Models. Y. Tuaç 1 and O. Arslan 1

Statistics 262: Intermediate Biostatistics Model selection

Linear Regression (9/11/13)

NATCOR. Forecast Evaluation. Forecasting with ARIMA models. Nikolaos Kourentzes

Predicting freeway traffic in the Bay Area

High-Dimensional Statistical Learning: Introduction

Chapter 3. Linear Models for Regression

CHAPTER 5 FORECASTING TRAVEL TIME WITH NEURAL NETWORKS

Multiple (non) linear regression. Department of Computer Science, Czech Technical University in Prague

Parking Occupancy Prediction and Pattern Analysis

Fast Boundary Flow Prediction for Traffic Flow Models using Optimal Autoregressive Moving Average with Exogenous Inputs (ARMAX) Based Predictors

LARGE-SCALE TRAFFIC STATE ESTIMATION

Real-Time Travel Time Prediction Using Multi-level k-nearest Neighbor Algorithm and Data Fusion Method

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Linear Regression In God we trust, all others bring data. William Edwards Deming

Lessons in Estimation Theory for Signal Processing, Communications, and Control

Automatic Forecasting

Sparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28

New Introduction to Multiple Time Series Analysis

Analysis Methods for Supersaturated Design: Some Comparisons

Research Article Accurate Multisteps Traffic Flow Prediction Based on SVM

Relevance Vector Machines for Earthquake Response Spectra

Empirical Study of Traffic Velocity Distribution and its Effect on VANETs Connectivity

Lecture 5: Logistic Regression. Neural Networks

Linear Models for Regression

Logistic Regression with the Nonnegative Garrote

9. Model Selection. statistical models. overview of model selection. information criteria. goodness-of-fit measures

How the mean changes depends on the other variable. Plots can show what s happening...

Lecture 6: Methods for high-dimensional problems

Iterative Selection Using Orthogonal Regression Techniques

Graphical LASSO for local spatio-temporal neighbourhood selection

Pattern Recognition and Machine Learning

Business Statistics. Tommaso Proietti. Model Evaluation and Selection. DEF - Università di Roma 'Tor Vergata'

Chapter 3: Regression Methods for Trends

Cross-Validation with Confidence

... SPARROW. SPARse approximation Weighted regression. Pardis Noorzad. Department of Computer Engineering and IT Amirkabir University of Technology

Online Appendix for Price Discontinuities in an Online Market for Used Cars by Florian Englmaier, Arno Schmöller, and Till Stowasser

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Computational statistics

Regression I: Mean Squared Error and Measuring Quality of Fit

Iterative ARIMA-Multiple Support Vector Regression models for long term time series prediction

Making Our Cities Safer: A Study In Neighbhorhood Crime Patterns

Applied Machine Learning Annalisa Marsico

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

1 Spatio-temporal Variable Selection Based Support

Urban Link Travel Time Estimation Using Large-scale Taxi Data with Partial Information

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

Transcription:

Transportation Big Data Analytics Regularization Xiqun (Michael) Chen College of Civil Engineering and Architecture Zhejiang University, Hangzhou, China Fall, 2016 Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics http://mypagezjueducn/en/xiqun 1 / 86

Outline 1 Subset Selection Best Subset Selection Stepwise Selection Choosing the Optimal Model 2 Shrinkage Methods Ridge Regression Lasso Selecting the Tuning Parameter 3 Dimension Reduction Methods 4 Applications Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics http://mypagezjueducn/en/xiqun 2 / 86

Subset Selection Best Subset Selection Outline 1 Subset Selection Best Subset Selection Stepwise Selection Choosing the Optimal Model 2 Shrinkage Methods Ridge Regression Lasso Selecting the Tuning Parameter 3 Dimension Reduction Methods 4 Applications Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics http://mypagezjueducn/en/xiqun 3 / 86

Subset Selection Best Subset Selection Best Subset Selection Fit a separate least squares regression for each possible combination of the p predictors That is, we fit all p models that contain exactly one predictor, all p(p 1)/2 models that contain exactly two predictors, and so forth We then look at all of the resulting models, with the goal of identifying the one that is best The problem of selecting the best model from among the 2 p possibilities considered by best subset selection is not trivial Algorithm 1 Best subset selection 1 Let M 0 denote the null model, which contains no predictors This model simply predicts the sample mean for each observation 2 For k = 1, 2, p: (a) Fit all C k p models that contain exactly k predictors (b) Pick the best among these C k p models, and call it M k Here best is defined as having the smallest RSS, or equivalently largest R 2 3 Select a single best model from among M 0,, M p using cross-validated prediction error, C p (AIC), BIC, or adjusted R 2 Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics http://mypagezjueducn/en/xiqun 4 / 86

Subset Selection Best Subset Selection Best Subset Selection For each possible model containing a subset of the all predictors, the red frontier tracks the best model for a given number of predictors, according to RSS and R 2 Model improves as the number of variables increases, however, from the three-variable model on, there is little improvement in RSS and R 2 Residual Sum of Squares 2e+07 4e+07 6e+07 8e+07 R 2 00 02 04 06 08 10 2 4 6 8 10 Number of Predictors 2 4 6 8 10 Number of Predictors Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics http://mypagezjueducn/en/xiqun 5 / 86

Subset Selection Stepwise Selection Outline 1 Subset Selection Best Subset Selection Stepwise Selection Choosing the Optimal Model 2 Shrinkage Methods Ridge Regression Lasso Selecting the Tuning Parameter 3 Dimension Reduction Methods 4 Applications Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics http://mypagezjueducn/en/xiqun 6 / 86

Subset Selection Stepwise Selection Forward Stepwise Selection Forward stepwise selection is a computationally efficient alternative to best subset selection While the best subset selection procedure considers all 2 p possible models containing subsets of the p predictors, forward stepwise considers a much smaller set of models Forward stepwise selection begins with a model containing no predictors, and then adds predictors to the model, one-at-a-time, until all of the predictors are in the model Algorithm 2 Forward stepwise selection 1 Let M 0 denote the null model, which contains no predictors 2 For k = 0, 1, p 1: (a) Consider all p k models that augment the predictors in M k with one additional predictor (b) Choose the best among these p k models, and call it M k+1 Here best is defined as having smallest RSS or highest R 2 3 Select a single best model from among M 0,, M p using cross-validated prediction error, C p (AIC), BIC, or adjusted R 2 Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics http://mypagezjueducn/en/xiqun 7 / 86

Subset Selection Stepwise Selection Backward Stepwise Selection Like forward stepwise selection, backward stepwise selection provides an efficient alternative to best subset selection Unlike forward stepwise selection, it begins with the full least squares model containing all p predictors, and then iteratively removes the least useful predictor, one-at-a-time Algorithm 3 Backward stepwise selection 1 Let M p denote the full model, which contains all p predictors 2 For k = p, p 1, 1: (a) Consider all k models that contain all but one of the predictors in M k, for a total of k 1 predictors Choose the best among these k models, and call it M k 1 Here best is defined as having smallest RSS or highest R 2 3 Select a single best model from among M 0,, M p using cross-validated prediction error, C p (AIC), BIC, or adjusted R 2 Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics http://mypagezjueducn/en/xiqun 8 / 86

Subset Selection Choosing the Optimal Model Outline 1 Subset Selection Best Subset Selection Stepwise Selection Choosing the Optimal Model 2 Shrinkage Methods Ridge Regression Lasso Selecting the Tuning Parameter 3 Dimension Reduction Methods 4 Applications Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics http://mypagezjueducn/en/xiqun 9 / 86

Subset Selection Choosing the Optimal Model Choosing the Optimal Model The model containing all of the predictors will always have the smallest RSS and the largest R 2, since these quantities are related to the training error Instead, we wish to choose a model with a low test error The training error can be a poor estimate of the test error Therefore, RSS and R 2 are not suitable for selecting the best model among a collection of models with different numbers of predictors Two common approaches 1 Indirectly estimate test error by making an adjustment to the training error to account for the bias due to overfitting 2 Directly estimate the test error, using either a validation set approach or a cross-validation approach Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics http://mypagezjueducn/en/xiqun 10 / 86

Subset Selection Choosing the Optimal Model Indirect Estimate Test Errors C p statistic: an unbiased estimate of test MSE For a fitted least squares model containing d predictors, the C p estimate of test MSE is given by C p = 1 n (RSS + 2dˆσ2 ) (1) where ˆσ 2 is an estimate of the variance of the error associated with each response measurement Akaike Information Criterion (AIC) Bayesian Information Criterion (BIC) AIC = 1 nˆσ 2 (RSS + 2dˆσ2 ) (2) BIC = 1 n (RSS + log(n)dˆσ2 ) (3) Adjusted R 2 = 1 RSS/(n d 1) TSS/(n 1) Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics http://mypagezjueducn/en/xiqun 11 / 86 (4)

Subset Selection Choosing the Optimal Model Indirect Estimate Test Errors C p statistic: an unbiased estimate of test MSE For a fitted least squares model containing d predictors, the C p estimate of test MSE is given by C p = 1 n (RSS + 2dˆσ2 ) (1) where ˆσ 2 is an estimate of the variance of the error associated with each response measurement Akaike Information Criterion (AIC) Bayesian Information Criterion (BIC) AIC = 1 nˆσ 2 (RSS + 2dˆσ2 ) (2) BIC = 1 n (RSS + log(n)dˆσ2 ) (3) Adjusted R 2 = 1 RSS/(n d 1) TSS/(n 1) Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics http://mypagezjueducn/en/xiqun 11 / 86 (4)

Subset Selection Choosing the Optimal Model Indirect Estimate Test Errors C p statistic: an unbiased estimate of test MSE For a fitted least squares model containing d predictors, the C p estimate of test MSE is given by C p = 1 n (RSS + 2dˆσ2 ) (1) where ˆσ 2 is an estimate of the variance of the error associated with each response measurement Akaike Information Criterion (AIC) Bayesian Information Criterion (BIC) AIC = 1 nˆσ 2 (RSS + 2dˆσ2 ) (2) BIC = 1 n (RSS + log(n)dˆσ2 ) (3) Adjusted R 2 = 1 RSS/(n d 1) TSS/(n 1) Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics http://mypagezjueducn/en/xiqun 11 / 86 (4)

Subset Selection Choosing the Optimal Model Indirect Estimate Test Errors C p statistic: an unbiased estimate of test MSE For a fitted least squares model containing d predictors, the C p estimate of test MSE is given by C p = 1 n (RSS + 2dˆσ2 ) (1) where ˆσ 2 is an estimate of the variance of the error associated with each response measurement Akaike Information Criterion (AIC) Bayesian Information Criterion (BIC) AIC = 1 nˆσ 2 (RSS + 2dˆσ2 ) (2) BIC = 1 n (RSS + log(n)dˆσ2 ) (3) Adjusted R 2 = 1 RSS/(n d 1) TSS/(n 1) Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics http://mypagezjueducn/en/xiqun 11 / 86 (4)

Subset Selection Choosing the Optimal Model Best Subset Selection For least squares models, C p and AIC are proportional to each other, so only C p is displayed C p and BIC are estimates of test MSE BIC shows an increase after four variables are selected The other two plots are rather flat after four variables are included C p 10000 15000 20000 25000 30000 BIC 10000 15000 20000 25000 30000 Adjusted R 2 086 088 090 092 094 096 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 Number of Predictors Number of Predictors Number of Predictors Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics http://mypagezjueducn/en/xiqun 12 / 86

Subset Selection Choosing the Optimal Model Validation and Cross-Validation Direct estimate of the test error, fewer assumptions about the true underlying model Cross-validation is a very attractive approach for selecting from among a number of models under consideration For d ranging from 1 to 11 The overall best model is shown as a blue cross Square Root of BIC 100 120 140 160 180 200 220 Validation Set Error 100 120 140 160 180 200 220 Cross Validation Error 100 120 140 160 180 200 220 2 4 6 8 10 Number of Predictors 2 4 6 8 10 Number of Predictors 2 4 6 8 10 Number of Predictors Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics http://mypagezjueducn/en/xiqun 13 / 86

Shrinkage Methods Shrinkage Methods Shrinkage Subset selection methods involve using least squares to fit a linear model that contains a subset of the predictors As an alternative, shrinkage can fit a model containing all p predictors using a technique that constrains or regularizes the coefficient estimates, or equivalently, that shrinks the coefficient estimates towards zero The estimated coefficients are shrunken towards zero relative to the least squares estimates This shrinkage (also known as regularization) has the effect of reducing variance Depending on what type of shrinkage is performed, some of the coefficients may be estimated to be exactly zero Hence, shrinkage methods can also perform variable selection Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics http://mypagezjueducn/en/xiqun 14 / 86

Shrinkage Methods Ridge Regression Outline 1 Subset Selection Best Subset Selection Stepwise Selection Choosing the Optimal Model 2 Shrinkage Methods Ridge Regression Lasso Selecting the Tuning Parameter 3 Dimension Reduction Methods 4 Applications Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics http://mypagezjueducn/en/xiqun 15 / 86

Shrinkage Methods Ridge Regression Ridge Regression Standard Linear Model Y = β 0 + β 1 X 1 + + β p X p + ϵ (5) Least Squares ( ) 2 n p RSS = y i β 0 β j x ij (6) i=1 j=1 Ridge Regression Objective Function ( ) 2 n p y i β 0 β j x ij + λ βj 2 = RSS + λ i=1 j=1 j=1 j=1 where λ 0 is a tuning parameter, to be determined separately The second terms is called a shrinkage penalty, is small when β are close to zero Ridge regression will produce a different set of coefficient estimates, ˆβ R λ, for each value of λ Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics http://mypagezjueducn/en/xiqun 16 / 86 p p βj 2 (7)

Shrinkage Methods Ridge Regression Ridge Regression Standard Linear Model Y = β 0 + β 1 X 1 + + β p X p + ϵ (5) Least Squares ( ) 2 n p RSS = y i β 0 β j x ij (6) i=1 j=1 Ridge Regression Objective Function ( ) 2 n p y i β 0 β j x ij + λ βj 2 = RSS + λ i=1 j=1 j=1 j=1 where λ 0 is a tuning parameter, to be determined separately The second terms is called a shrinkage penalty, is small when β are close to zero Ridge regression will produce a different set of coefficient estimates, ˆβ R λ, for each value of λ Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics http://mypagezjueducn/en/xiqun 16 / 86 p p βj 2 (7)

Shrinkage Methods Ridge Regression Ridge Regression Standard Linear Model Y = β 0 + β 1 X 1 + + β p X p + ϵ (5) Least Squares ( ) 2 n p RSS = y i β 0 β j x ij (6) i=1 j=1 Ridge Regression Objective Function ( ) 2 n p y i β 0 β j x ij + λ βj 2 = RSS + λ i=1 j=1 j=1 j=1 where λ 0 is a tuning parameter, to be determined separately The second terms is called a shrinkage penalty, is small when β are close to zero Ridge regression will produce a different set of coefficient estimates, ˆβ R λ, for each value of λ Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics http://mypagezjueducn/en/xiqun 16 / 86 p p βj 2 (7)

Shrinkage Methods Ridge Regression Why Does Ridge Regression Improve Over Least Squares? Ridge Regression s Advantage Ridge regression s advantage over least squares is rooted in the bias-variance trade-off As λ increases, the flexibility decreases, leading to decreased variance but increased bias Mean Squared Error 0 10 20 30 40 50 60 Mean Squared Error 0 10 20 30 40 50 60 1e 01 1e+01 1e+03 00 02 04 06 08 10 λ ˆβ R λ 2/ ˆβ 2 Squared bias (black), variance (green), and test mean squared error (purple) for the ridge regression predictions on a simulated data set, as a function of λ and ˆβ R λ 2/ ˆβ 2 The horizontal dashed lines indicate the minimum possible MSE The purple crosses indicate the smallest MSE Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics http://mypagezjueducn/en/xiqun 17 / 86

Shrinkage Methods Lasso Outline 1 Subset Selection Best Subset Selection Stepwise Selection Choosing the Optimal Model 2 Shrinkage Methods Ridge Regression Lasso Selecting the Tuning Parameter 3 Dimension Reduction Methods 4 Applications Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics http://mypagezjueducn/en/xiqun 18 / 86

Shrinkage Methods Lasso Lasso Motivation: A Disadvantage of Ridge Regression Ridge regression includes all p predictors in the final model The penalty will shrink all of the coefficients towards zero, but it will not set any of them exactly to zero (unless λ = inf) This may not be a problem for prediction accuracy, but it can create a challenge in model interpretation in settings in which the number of variables p is quite large Lasso Objective Function ( n y i β 0 i=1 ) 2 p β j x ij + λ j=1 selection) when the tuning parameter λ is sufficiently large p β j = RSS + λ j=1 p β j (8) where λ j is the lasso penalty (l 1 norm instead of l 2 norm used in ridge regression) Lasso has the effect of forcing some of the coefficient estimates to be exactly equal to zero (variable j=1 Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics http://mypagezjueducn/en/xiqun 19 / 86

Shrinkage Methods Lasso Lasso Motivation: A Disadvantage of Ridge Regression Ridge regression includes all p predictors in the final model The penalty will shrink all of the coefficients towards zero, but it will not set any of them exactly to zero (unless λ = inf) This may not be a problem for prediction accuracy, but it can create a challenge in model interpretation in settings in which the number of variables p is quite large Lasso Objective Function ( n y i β 0 i=1 ) 2 p β j x ij + λ j=1 selection) when the tuning parameter λ is sufficiently large p β j = RSS + λ j=1 p β j (8) where λ j is the lasso penalty (l 1 norm instead of l 2 norm used in ridge regression) Lasso has the effect of forcing some of the coefficient estimates to be exactly equal to zero (variable j=1 Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics http://mypagezjueducn/en/xiqun 19 / 86

Shrinkage Methods Lasso Comparison of Lasso with Ridge Regression The standardized lasso coefficients are illustrated as a function of λ and β ˆL λ 1 / ˆβ 1 When λ = 0, then the lasso simply gives the least squares fit; When λ becomes sufficiently large, the lasso gives the null model (coefficients equal zero) Standardized Coefficients 200 0 100 200 300 400 20 50 100 200 500 2000 5000 Standardized Coefficients 300 100 0 100 200 300 400 Income Limit Rating Student 00 02 04 06 08 10 λ ˆβ L λ 1/ ˆβ 1 Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics http://mypagezjueducn/en/xiqun 20 / 86

Shrinkage Methods Lasso Example A simple special case with n = p, and X is identical matrix, least square estimate ˆβ j = arg min β p (y j β j ) 2 = y j j=1 Ridge Regression ˆβ R j = arg min β p (y j β j) 2 + λ j=1 p βj 2 = y j/(1 + λ) j=1 Lasso ˆβ L j = arg min β p (y j β j ) 2 + λ j=1 p y j λ/2 if y j > λ/2 β j = y j + λ/2 if y j < λ/2 0 if y j λ/2 j=1 Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics http://mypagezjueducn/en/xiqun 21 / 86

Shrinkage Methods Lasso Example A simple special case with n = p, and X is identical matrix, least square estimate ˆβ j = arg min β p (y j β j ) 2 = y j j=1 Ridge Regression ˆβ R j = arg min β p (y j β j) 2 + λ j=1 p βj 2 = y j/(1 + λ) j=1 Lasso ˆβ L j = arg min β p (y j β j ) 2 + λ j=1 p y j λ/2 if y j > λ/2 β j = y j + λ/2 if y j < λ/2 0 if y j λ/2 j=1 Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics http://mypagezjueducn/en/xiqun 21 / 86

Shrinkage Methods Lasso Example A simple special case with n = p, and X is identical matrix, least square estimate ˆβ j = arg min β p (y j β j ) 2 = y j j=1 Ridge Regression ˆβ R j = arg min β p (y j β j) 2 + λ j=1 p βj 2 = y j/(1 + λ) j=1 Lasso ˆβ L j = arg min β p (y j β j ) 2 + λ j=1 p y j λ/2 if y j > λ/2 β j = y j + λ/2 if y j < λ/2 0 if y j λ/2 j=1 Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics http://mypagezjueducn/en/xiqun 21 / 86

Shrinkage Methods Lasso Comparison of Lasso with Ridge Regression Ridge regression more or less shrinks every dimension of the data by the same proportion; Lasso more or less shrinks all coefficients toward zero by a similar amount, and sufficiently small coefficients are shrunken all the way to zero Coefficient Estimate 15 05 05 15 Ridge Least Squares Coefficient Estimate 15 05 05 15 Lasso Least Squares 15 05 00 05 10 15 15 05 00 05 10 15 y j y j Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics http://mypagezjueducn/en/xiqun 22 / 86

Shrinkage Methods Selecting the Tuning Parameter Outline 1 Subset Selection Best Subset Selection Stepwise Selection Choosing the Optimal Model 2 Shrinkage Methods Ridge Regression Lasso Selecting the Tuning Parameter 3 Dimension Reduction Methods 4 Applications Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics http://mypagezjueducn/en/xiqun 23 / 86

Shrinkage Methods Selecting the Tuning Parameter Selecting the Tuning Parameter λ Grid Search 1 Choose a grid of λ values, and compute the k-fold cross-validation error for each λ; 2 Select the tuning parameter value for which the cross-validation error is smallest; 3 Re-fit the model using all of the available observations and the selected value of λ Cross Validation Error 250 252 254 256 5e 03 5e 02 5e 01 5e+00 λ Standardized Coefficients 300 100 0 100 300 5e 03 5e 02 5e 01 5e+00 λ Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics http://mypagezjueducn/en/xiqun 24 / 86

Shrinkage Methods Selecting the Tuning Parameter Selecting the Tuning Parameter λ Comparison of lasso coefficients with least square estimates Left: Ten-fold cross-validation MSE for the lasso; Right: Corresponding lasso coefficient estimates are displayed; Vertical dashed lines indicate where the lasso cross-validation error is smallest Cross Validation Error 0 200 600 1000 1400 Standardized Coefficients 5 0 5 10 15 00 02 04 06 08 10 ˆβ L λ 1/ ˆβ 1 00 02 04 06 08 10 ˆβ L λ 1/ ˆβ 1 Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics http://mypagezjueducn/en/xiqun 25 / 86

Dimension Reduction Methods Dimension Reduction Methods Motivation Either subset selection or shrinkage methods have controlled variance by using a subset of the original variables, or by shrinking their coefficients toward zero All of these methods are defined using the original predictors, X 1, X 2,, X p All dimension reduction methods work in two steps: 1 The transformed predictors (reduced dimensions) are obtained, and the selection can be achieved in different ways, eg principal components regression and partial least squares 2 The model is fit using these transformed predictors Will be introduced in Unsupervised Learning Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics http://mypagezjueducn/en/xiqun 26 / 86

Dimension Reduction Methods Dimension Reduction Methods Motivation Either subset selection or shrinkage methods have controlled variance by using a subset of the original variables, or by shrinking their coefficients toward zero All of these methods are defined using the original predictors, X 1, X 2,, X p All dimension reduction methods work in two steps: 1 The transformed predictors (reduced dimensions) are obtained, and the selection can be achieved in different ways, eg principal components regression and partial least squares 2 The model is fit using these transformed predictors Will be introduced in Unsupervised Learning Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics http://mypagezjueducn/en/xiqun 26 / 86

Example 1: Multi-Model Ensemble for Freeway Traffic State Estimations Reference Li, L, Chen, X, and Zhang, L (2014) Multimodel ensemble for traffic state estimations IEEE Transactions on Intelligent Transportation Systems, 15(3), 1323-1336 Background and Research Highlights Freeway traffic state estimation is a vital component of traffic management and information systems Inherent randomness of traffic flow and uncertainties in the initial conditions of models, model parameters as well as model structures all influence traffic state estimations Present an ensemble learning framework to appropriately combine estimation results from multiple macroscopic traffic flow models Discuss three weighting algorithms: least square regression, ridge regression and lasso A field test indicates that lasso ensemble best handles various uncertainties and improves Xiqun estimation (Michael) Chen (Zhejiang accuracy University) significantly Transportation Big Data Analytics http://mypagezjueducn/en/xiqun 27 / 86

Popular Traffic State Models Cell Transmission Model (CTM) CTM is a direct discretization of the first-order LWR model It was proposed by Daganzo (1994) via using the Godunov Scheme (Lebacque, 1996) and is now widely used Papageorgiou et al s Model A second-order macroscopic traffic flow model was chosen in Papageorgiou et al (1990), Wang and Papageorgiou (2005), Wang et al (2006,2007,2009) We can update estimations of traffic flow states directly from the linearized system dynamic model; Otherwise, we resort to Unscented Kalman filtering (KF), Extended Kalman filtering (EKF) or particle filtering for help Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics http://mypagezjueducn/en/xiqun 28 / 86

Testing Data I Performance Measurement System (PeMS) The loop detector data collected at Vehicle Detection Stations (VDS) 717518, 717516 and 772610 (denoted by L1, L2 and L3) on Highway SR101 southbound, Hollywood, California L1 L2 L3 PeMS VDS: 717518 PM: 2556 mile PeMS VDS: 717516 PM: 2455 mile PeMS VDS: 772610 PM: 2403 mile Cell 1 015 De Soto Cell 2 060 Cell 3 060 Winnetka Cell 4 Cell 5 Cell 6 040 040 044 km Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics http://mypagezjueducn/en/xiqun 29 / 86

Testing Data II Mainline mean speed per lane Speed measurements (veh/h) 150 100 50 L1 L2 L3 0 4 AM 5 AM 6 AM 7 AM 8 AM 9 AM 10 AM 11 AM 12 PM Time (h) On-ramp inflow rates On-ramp flow measurements (veh/h) 1200 1000 800 600 400 200 Upstream on-ramp Downstream on-ramp 0 4 AM 5 AM 6 AM 7 AM 8 AM 9 AM 10 AM 11 AM 12 PM Time (h) Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics http://mypagezjueducn/en/xiqun 30 / 86

Evaluation Indices Root Mean Square Error (RMSE) RMSE = 1 T T [x i (t) ˆx i (t)] 2 (9) t=1 Normalized Root Mean Squared Error (NRMSE) T t=1 [xi(t) ˆxi(t)]2 NRMSE = T t=1 (x i(t)) 2 100% (10) Symmetric Mean Absolute Percentage Error (SMAPE) SMAPE1 = 1 T x i(t) ˆx i(t) T x t=1 i(t) + ˆx i(t) T t=1 SMAPE2 = x i(t) ˆx i (t) T t=1 [x i(t) + ˆx i (t)] 100% (11) 100% (12) Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics http://mypagezjueducn/en/xiqun 31 / 86

Comparison of Ensemble Algorithms: Part I Experiment I Ensemble 10 SMM models with various FD parameter combinations Algorithm I: Least-Square Ensemble Algorithm II: Ridge Regression Ensemble Algorithm III: Lasso Ensemble Estimation errors of ensemble SMM models using integrated weights Ŵ (Experiment I) Variable Error Algorithm I Algorithm II Algorithm III RMSE 35005 871 832 Density NRMSE 177411% 4417% 4216% SMAPE1-4725% 1694% 1546% SMAPE2-16099% 1807% 1656% RMSE 23463 30094 35591 Flow NRMSE 1880% 2411% 2851% rate SMAPE1 846% 1024% 1207% SMAPE2 713% 957% 1190% Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics http://mypagezjueducn/en/xiqun 32 / 86

Comparison of Ensemble Algorithms: Part II Experiment II (Linear models) Ensemble 20 SMM models with the following improvements A small scale zero mean Gaussian random noise is added to the estimated state vector; The estimation results and the available measurements of density and flow rate are separated Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics http://mypagezjueducn/en/xiqun 33 / 86

Comparison of Ensemble Algorithms: Part III Estimation errors of ensemble SMM models using separate weights Ŵρ and Ŵq (Experiment II) Variable Error Algorithm I Algorithm II Algorithm III RMSE 4163 1721 779 Density NRMSE 21098% 8722% 3950% SMAPE1-3029% -17944% 1528% SMAPE2 10924% 4066% 1591% Flow rate RMSE 27458 22527 23573 NRMSE 2200% 1805% 1889% SMAPE1 892% 785% 841% SMAPE2 818% 702% 797% Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics http://mypagezjueducn/en/xiqun 34 / 86

Comparison of Ensemble Algorithms: Part IV Experiment III (Nonlinear models) Ensemble 20 EKF models Stochastic resampling: a small scale Gaussian white noise vector was added to the estimated state vector after the EKF predict and update procedures at each time step; Estimation errors comparison of the single deterministic EKF models and ensemble EKF models using separate weights Ŵρ and Ŵq (Experiment III) (E ) Variable Error EKF Algorithm II Algorithm III RMSE 862 821 769 Density NRMSE 4368% 4162% 3896% SMAPE1 1636 1665% 1368% SMAPE2 1677% 1172% 1478% RMSE 29870 42186 42096 Flow rate NRMSE 2393% 3380% 3373% SMAPE1 999% 1297% 1507% SMAPE2 938% 1648% 1504% Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics http://mypagezjueducn/en/xiqun 35 / 86

Comparison of Ensemble Algorithms: Part V Experiment IV Ensemble 10 candidate SMM models (deterministic models without initial condition noises) and 10 candidate EKF model (stochastic models with initial condition noises) Stochastic resampling: a small scale Gaussian white noise vector was added to the estimated state vector after the EKF predict and update procedures at each time step; Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics http://mypagezjueducn/en/xiqun 36 / 86

Comparison of Ensemble Algorithms: Part VI Estimation errors of ensemble EKF models using separate weights Ŵρ and Ŵq (Experiment IV) Variable Error EKF Algorithm II Algorithm III RMSE 862 836 720 Density NRMSE 4368% 4235% 3648% SMAPE1 1636 1693% 1331% SMAPE2 1677% 1663% 1439% RMSE 29870 27216 26565 Flow NRMSE 2393% 2180% 2128% rate SMAPE1 999% 929% 918% SMAPE2 938% 849% 861% Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics http://mypagezjueducn/en/xiqun 37 / 86

Influence of the Regularization Parameter λ Sensitivity Analysis All experiments show that the lasso ensemble performs best in this case study In practice, it is suggested to adaptively choose the best regularization scalar λ online Sensitivity analysis of regularization coefficient in the Lasso EKF ensemble model (Experiment III, Algorithm III) Variable ENSEMBLE MODEL (EXPERIMENT III, ALGORITHM III) Error λ λ λ λ = 001 max = 01 max = 02 max = 05 max λ λ λ λ Density Flow rate RMSE 941 790 1000 1311 NRMSE 4769% 4003% 5066% 6644% SMAPE1 1755% 1328% 1598% 2756% SMAPE2 1883% 1451% 1965% 3254% RMSE 37836 34847 58303 94407 NRMSE 3031% 2792% 4671% 7564% SMAPE1 1234% 1273% 2399% 5681% SMAPE2 1127% 1221% 2524% 5714% Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics http://mypagezjueducn/en/xiqun 38 / 86

Models Estimation Results of Experiments I-IV Using Lasso Ensemble Exp I Density/flow rate estimations 100 50 L1 Measurement Experiment I Exp II Density/flow rate estimations 100 50 L1 Measurement Experiment II Density (veh/km) 0 100 50 0 100 50 L2 Measurement Experiment I L3 Measurement Experiment I 0 100 Density (veh/km) 50 0 100 50 L2 Measurement Experiment II L3 Measurement Experiment II 0 4 AM 5 AM 6 AM 7 AM 8 AM 9 AM 10 AM 11 AM 12 PM Time (h) 0 4 AM 5 AM 6 AM 7 AM 8 AM 9 AM 10 AM 11 AM 12 PM Time (h) 2000 L1 Measurement Experiment I 1000 2000 L1 Measurement Experiment II 1000 Flow (veh/h) 0 2000 L2 Measurement Experiment I 1000 0 2000 L3 Measurement Experiment I 1000 Flow (veh/h) 0 2000 L2 Measurement Experiment II 1000 0 2000 L3 Measurement Experiment II 1000 Xiqun (Michael) 0 Chen (Zhejiang University) Transportation Big Data Analytics 0 http://mypagezjueducn/en/xiqun 39 / 86

Models Estimation Results of Experiments I-IV Using Lasso Ensemble Exp III Density/flow rate estimations 100 50 L1 Measurement Experiment III Exp IV Density/flow rate estimations 100 50 L1 Measurement Experiment IV Density (veh/km) 0 100 50 0 100 50 L2 Measurement Experiment III L3 Measurement Experiment III Density (veh/km) 0 100 50 0 100 50 L2 Measurement Experiment IV L3 Measurement Experiment IV Flow (veh/h) 1000 0 4 AM 5 AM 6 AM 7 AM 8 AM 9 AM 10 AM 11 AM 12 PM Time (h) 2000 L1 Measurement Experiment III 0 2000 1000 0 L2 Measurement Experiment III 2000 L3 Measurement Experiment III Flow (veh/h) 1000 0 4 AM 5 AM 6 AM 7 AM 8 AM 9 AM 10 AM 11 AM 12 PM Time (h) 2000 L1 Measurement Experiment IV 0 2000 1000 0 L2 Measurement Experiment IV 2000 L3 Measurement Experiment IV 1000 1000 Xiqun (Michael) 0 Chen (Zhejiang University) Transportation Big Data Analytics 0 http://mypagezjueducn/en/xiqun 40 / 86

Example 2: Feature Selection for Prediction I Reference Yang, S, 2013 On feature selection for traffic congestion prediction Transportation Research Part C: Emerging Technologies, 26, pp160-169 Objectives Traffic congestion prediction plays an important role in route guidance and traffic management We formulate it as a binary classification problem Through extensive experiments with real-world data, we found that a large number of sensors, usually over 100, are relevant to the prediction task at one sensor, which means wide area correlation and high dimensionality of the data This paper investigates into the feature selection problem for traffic congestion prediction By applying feature selection, the data dimensionality can be reduced remarkably while the performance remains the same Besides, a new traffic jam probability scoring method is proposed to solve the high-dimensional computation into many one-dimensional probabilities and its combination Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics http://mypagezjueducn/en/xiqun 41 / 86

Example 2: Feature Selection for Prediction II Data Traffic Management Center of Minnesota Department of Transportation, 30s interval from over 4000 loop detectors located around the Twin Cities Metro freeways for 7 days per week from January 1 to September 22, 2010, where the sensors containing missing values, weekends, and the days with incomplete record are not taken into account Finally, we have a data set of 156 days with the traffic volume data summed per 10-min interval for 4584 sensors We use the first 126 days for learning and the remaining 30 days for testing Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics http://mypagezjueducn/en/xiqun 42 / 86

Example 2: Feature Selection for Prediction III Mean prediction precision against feature number 07 Time lag=10 min 07 Time lag=20 min Mean precision 06 05 04 Mean precision 06 05 04 0 1000 2000 3000 4000 Feature number 0 1000 2000 3000 4000 Feature number 07 Time lag=60 min 07 Time lag=300 min Mean precision 06 05 04 Mean precision 06 05 04 0 1000 2000 3000 4000 Feature number 0 1000 2000 3000 4000 Feature number Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics http://mypagezjueducn/en/xiqun 43 / 86

Example 2: Feature Selection for Prediction IV Mean prediction precision and feature number at turning point and end point Time lag 10 min 20 min 60 min 3000 min Turning point Precision 56% 5645% 5617% 5552% #Features 286 336 326 276 End point Precision 6013% 6058% 5989% 5962% #Features 3386 Mean prediction precision against feature number #Features 6 26 56 106 156 286 876 1726 3146 3386 10 min 3267 396 4886 5226 5329 56 5801 5902 60 6013 20 min 3175 433 4861 5172 534 5586 5854 5948 6047 6058 60 min 3186 4332 4882 5226 5396 5575 5801 5893 5974 5989 300 min 3121 4358 4866 5218 5384 5555 5760 5861 595 5962 Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics http://mypagezjueducn/en/xiqun 44 / 86

Example 2: Feature Selection for Prediction V Actual traffic volumes with time lag of 20 min following the prediction with all features 600 Sensor 1266 Volume 400 200 0 0 500 1000 1500 2000 2500 3000 3500 4000 4500 Rank 400 Sensor 1473 Volume 300 200 100 0 0 500 1000 1500 2000 2500 3000 3500 4000 4500 Rank Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics http://mypagezjueducn/en/xiqun 45 / 86

Example 2: Feature Selection for Prediction VI Precision and recall rate of the two sensors performance Sensor 1266; True Jam number=86; Precision(top 86)=9186% 1 p (l) and r (l) 05 0 0 500 1000 1500 2000 2500 3000 3500 4000 4500 l Sensor 1473; True Jam number=106; Precision(top 106)=6509% 1 p (l) and r (l) 05 0 0 500 1000 1500 2000 2500 3000 3500 4000 4500 l Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics http://mypagezjueducn/en/xiqun 46 / 86

Example 2: Feature Selection for Prediction VII Distribution of precision with prediction based on all features (time lag = 20 min) 300 250 Number of sensors 200 150 100 50 0 0 01 02 03 04 05 06 07 08 09 1 Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics http://mypagezjueducn/en/xiqun 47 / 86

Example 2: Feature Selection for Prediction VIII Prediction precision against feature number for two sensors 1 Sensor 1266; Optimal feature number=156 Precision 08 06 04 0 500 1000 1500 2000 2500 3000 3500 Feature number 1 Sensor 1473; Optimal feature number=26 Precision 08 06 04 02 0 500 1000 1500 2000 2500 3000 3500 Feature number Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics http://mypagezjueducn/en/xiqun 48 / 86

Example 2: Feature Selection for Prediction IX Distribution of the 888 sensors of interest over the maximum precision 300 Time lag=10 min 300 Time lag=20 min Number of sensors 200 100 Number of sensors 200 100 0 0 05 1 Maximum precision 0 0 05 1 Maximum precision 300 Time lag=60 min 300 Time lag=300 min Number of sensors 200 100 Number of sensors 200 100 0 0 05 1 Maximum precision 0 0 05 1 Maximum precision Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics http://mypagezjueducn/en/xiqun 49 / 86

Example 2: Feature Selection for Prediction X Distribution of the 888 sensors of interest over the optimal number of features 150 Time lag=10 min 100 Time lag=20 min Number of sensors 100 50 Number of sensors 50 0 0 1000 2000 3000 4000 Optimal feature number 0 0 1000 2000 3000 4000 Optimal feature number 150 Time lag=60 min 150 Time lag=300 min Number of sensors 100 50 Number of sensors 100 50 0 0 1000 2000 3000 4000 Optimal feature number 0 0 1000 2000 3000 4000 Optimal feature number Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics http://mypagezjueducn/en/xiqun 50 / 86

Example 2: Feature Selection for Prediction XI Comparison of the mean precision over the 888 sensors of interest achieved by using the optimal number of features to that using all features Time lag (min) 10 20 60 300 Mean of max precision (%) 6221 6233 6179 6146 Mean of precision with all features (%) 6013 6058 5989 5962 Distribution of the 888 sensors of interest over the optimal number of features #Features [0, 10) [10, 20) [20, 30) [30, 40) [40, 50) [50, 100) [100, 3386) 3386 10 min 11 18 10 18 9 58 753 11 20 min 8 12 8 7 8 48 788 9 60 min 11 4 11 16 5 61 769 11 300 min 13 8 20 13 10 60 761 3 Mean prediction precision (%) Time lag (min) 10 20 60 300 Optimal 5208 5192 5173 5130 All 5207 5180 5157 5099 Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics http://mypagezjueducn/en/xiqun 51 / 86

Example 3: Forecasting Urban Travel Times I Reference Haworth, J, Shawe-Taylor, J, Cheng, T and Wang, J, 2014 Local online kernel ridge regression for forecasting of urban travel times Transportation Research Part C: Emerging Technologies, 46, pp151-178 Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics http://mypagezjueducn/en/xiqun 52 / 86

Example 3: Forecasting Urban Travel Times II Highlights Local online kernel ridge regression (LOKRR) is developed for forecasting urban travel times LOKRR takes into account the time varying characteristics of traffic series through the use of locally defined kernels LOKRR outperforms ARIMA, Elman ANN and SVR in forecasting travel times on London s road network The model is based on regularised linear regression, and clear guidelines are given for parameter training Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics http://mypagezjueducn/en/xiqun 53 / 86

Example 3: Forecasting Urban Travel Times III Merits over the standard single kernel approach It allows parameters to vary by time of day, capturing the time varying distribution of traffic data; It allows smaller kernels to be defined that contain only the relevant traffic patterns; It is online, allowing new traffic data to be incorporated as it arrives Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics http://mypagezjueducn/en/xiqun 54 / 86

Example 3: Forecasting Urban Travel Times IV Data Unit Travel Times (UTTs, seconds/metre) collected on London s road network, as part of the London Congestion Analysis Project (LCAP) coordinated by Transport for London (TfL) LCAP TTs are observed using automatic number plate recognition (ANPR) technology Individual vehicle TTs are aggregated at 5 min intervals to produce a regularly spaced time series with 288 observations per day Only data collected between 6 AM and 9 PM are used in the analysis (180 observations per day) In total there are 154 days for which data are available, collected between January and July 2011 To test the models, the data are divided into three sets; a training set, a testing set and a validation set, which are 80 days (52%), 37 days (24%) and 37 days (24%) in length, respectively Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics http://mypagezjueducn/en/xiqun 55 / 86

Example 3: Forecasting Urban Travel Times V Illustration of variability in traffic data: each line in the plot is the Unit Travel Time (UTT) profile recorded on a single link (link 1815) on a different day (5 days total) Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics http://mypagezjueducn/en/xiqun 56 / 86

Example 3: Forecasting Urban Travel Times VI Diagram of the training data construction Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics http://mypagezjueducn/en/xiqun 57 / 86

Example 3: Forecasting Urban Travel Times VII Observing travel times using ANPR: a vehicle passes camera l 1 at time t 1, and its number plate is read It then traverses link l and passes camera l 2 at time t 2 and its number plate is read again The two number plates are matched using inbuilt software, and the TT is calculated as t 2 t 1 Raw TTs are converted to UTTs by dividing by len(l), which is the length of link l Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics http://mypagezjueducn/en/xiqun 58 / 86

Example 3: Forecasting Urban Travel Times VIII The test links and their patch rates and frequency Link ID 24 26 442 454 1815 1799 453 1798 2448 881 % Missing 08 08 115 12 122 13 139 144 144 145 Avg frequency 2095 1549 1409 1174 6882 3169 1374 2853 937 1896 Length (m) 3667 9075 8994 7101 40331 1101 8926 46702 12165 37443 Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics http://mypagezjueducn/en/xiqun 59 / 86

453 442 2448 1815 Example 3: Forecasting Urban Travel Times IX Location of the test links on the LCAP network 881 26 1798 1799 24 Legend Test Links ANPR Camera Locations LCAP Network ITN Data Crown Copyright 2013 0 075 15 3 45 6 Miles Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics http://mypagezjueducn/en/xiqun 60 / 86

Example 3: Forecasting Urban Travel Times X Time series of each of the test links over the first 10 weeks of the training period Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics http://mypagezjueducn/en/xiqun 61 / 86

Example 3: Forecasting Urban Travel Times XI Training errors at: (a) 15 and 30 min forecast horizons and (b) 45 and 60 min horizons (a) 15 min 30 min Link ID RMSE NRMSE MAPE MASE RMSE NRMSE MAPE MASE 24 01081 00964 2491 09351 01254 01119 2942 08869 26 00477 00690 1590 08302 00505 00731 1678 07617 442 00553 00539 1369 09218 00665 00647 1593 08467 453 00385 00622 1341 08983 00413 00668 1434 08304 454 01051 00366 1662 09311 01259 00439 1948 08171 881 00189 00307 1127 08798 00222 00360 1316 08286 1798 00209 00350 938 08412 00304 00508 1394 07861 1799 00257 00360 858 08453 00347 00487 991 07602 1815 00183 00219 471 09510 00334 00399 835 10853 2448 00839 00607 1686 09334 01038 00751 2076 08874 (b) 45 min 60 min Link ID RMSE NRMSE MAPE MASE RMSE NRMSE MAPE MASE 24 01308 01167 3078 08343 01343 01198 3214 08007 26 00516 00747 1716 07226 00526 00761 1762 06937 442 00745 00725 1782 08315 00811 00790 1894 07990 453 00420 00679 1513 07883 00484 00524 1601 07721 454 01379 00481 2075 07417 01461 00509 2117 07023 881 00260 00422 1438 07743 00276 00448 1503 07479 1798 00349 00584 1716 07451 00388 00650 1956 06938 1799 00401 00563 1060 06950 00421 00592 1104 06528 1815 00389 00464 1062 10491 00437 00522 1238 10305 2448 01162 00841 2366 08637 01241 00897 2528 08319 Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics http://mypagezjueducn/en/xiqun 62 / 86

Example 3: Forecasting Urban Travel Times XII Fitted model parameters at each of the forecast horizons Link 15 min 30 min 45 min 60 min r k w r k w r k w r k w 24 075 1/8 3 075 1/8 3 05 1/8 1 075 1/8 2 26 05 1/8 2 05 1/8 2 05 1/8 2 05 1/8 3 442 05 1/8 3 05 1/8 3 075 1/8 3 075 1/8 3 453 075 1/8 3 075 1/8 3 075 1/8 3 075 1/8 3 454 05 1/8 3 075 1/8 2 075 1/8 3 075 1/8 3 881 05 1/8 3 05 1/8 3 075 1/8 3 075 1/8 3 1798 05 1/8 1 05 1/8 2 075 1/8 3 075 1/8 3 1799 05 1/8 3 075 1/8 3 075 1/8 3 075 1/8 3 1815 05 1/8 3 05 1/8 2 05 1/8 2 05 1/8 2 2448 025 1/8 2 05 1/8 2 075 1/8 3 075 1/8 3 Note: r values are quantiles of jjx x 0 jj 2 ; k values are multiples of k0 Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics http://mypagezjueducn/en/xiqun 63 / 86

Example 3: Forecasting Urban Travel Times XIII Testing errors of the LOKRR model RMSE NRMSE MAPE MASE 15 min 1798 0015056 0061842 9896842 086972 1799 0020865 0041542 904533 086287 1815 0021719 0028766 5557989 1066146 2448 0078816 0063965 1751586 0951652 24 0135792 0050655 238859 0887596 26 0056001 0050071 1667646 0838566 442 0077271 0026585 1610907 0900641 453 005 0024215 1296563 0842421 454 0083559 0052184 1619808 0906937 881 0027327 0030507 1272943 0872443 30 min 1798 0021784 0089477 1450633 0778719 1799 0025402 0050759 1022777 0789565 1815 00303 0040132 8325483 1079486 2448 0104256 0084611 2161859 0897533 24 0155577 0058035 2845854 0843025 26 0062764 0056117 1764926 0756886 442 0095284 0032783 1962794 0832451 453 0053543 0025931 1408368 077528 454 0112563 0070298 2042523 0841052 881 0036773 0041052 1475696 0817719 Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics http://mypagezjueducn/en/xiqun 64 / 86

Example 3: Forecasting Urban Travel Times XIV Testing errors of the LOKRR model (continued) 45 min 1798 002598 0111417 1802658 072652 1799 0026217 0052198 1111021 0742244 1815 0035621 004718 1129372 1116573 2448 0117528 0095382 2421025 0869196 24 0168476 0062847 311086 0825435 26 0066129 0059126 1819328 070734 442 009972 0034309 2112433 0751533 453 0054154 0026227 1458123 0742946 454 0124311 0077635 2343005 0795166 881 0042819 0047802 1585057 0756781 60 min 1798 0028534 0117203 2060327 0671616 1799 0025176 0062291 1166314 0735095 1815 0037106 0049147 1253772 1061102 2448 0124704 0101206 2582053 0833048 24 017265 0064404 3258387 0781458 26 0066611 0059556 184175 06703 442 0103471 00356 2276551 0720822 453 0055945 0027094 1538157 0738591 454 0130278 0081362 2511916 0768823 881 0046543 0051959 1709304 0721626 Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics http://mypagezjueducn/en/xiqun 65 / 86

Example 3: Forecasting Urban Travel Times XV Comparison with benchmark models C average errors at: (a) 15 min; (b) 30 min; (c) 45 min and (d) 60 min forecast horizons Model Training Testing RMSE NRMSE MAPE RMSE NRMSE MAPE (a) LOKRR 0051578 0049348 1324606 0056641 0043033 1405806 SVR 0063477 0060735 1335697 0067139 0050494 1416037 ANN 0062049 0059943 1458549 0062892 0056895 1491349 ARIMA 0057455 0048411 1489455 (b) LOKRR 0062015 0059734 1567754 0069825 005492 1696798 SVR 0076575 0073219 1606394 0081147 0062619 1726276 ANN 0077809 0075266 1725284 0081186 0071484 1805018 ARIMA 0074985 0065109 1890466 (c) Online KRR 0067464 0064846 1727787 0076096 0061412 1889288 SVR 0083118 0079573 177488 0088623 0069456 1918715 ANN 0085629 0083494 1849986 0089811 0070026 1929111 ARIMA 0090186 0074243 2168558 (d) LOKRR 0071581 0066783 1840614 0079102 0064982 2019853 SVR 0088056 0081424 1883573 0093475 0074376 2017787 ANN 0092487 0090328 1994198 0097288 0074856 2107094 ARIMA 0089645 007822 230101 Note: Training errors are not shown for the ARIMA model due to the difference in the model fitting procedure Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics http://mypagezjueducn/en/xiqun 66 / 86

Example 3: Forecasting Urban Travel Times XVI RMSE of each of the models at: (a) 15 min; (b) 30 min; (c) 45 min and; (d) 60 min (a) (b) (c) (d) Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics http://mypagezjueducn/en/xiqun 67 / 86

Example 3: Forecasting Urban Travel Times XVII MAPE of each of the models at: (a) 15 min; (b) 30 min; (c) 45 min and; (d) 60 min (a) (b) (c) (d) Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics http://mypagezjueducn/en/xiqun 68 / 86

Example 3: Forecasting Urban Travel Times XVIII Time series plots of the observed series (thick black line) against the forecast series at the 15 min interval on (a) Monday 6th June; (b) 7th June; (c) 8th June; (d) 9th June; (e) 10th June; (f) 11th June Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics http://mypagezjueducn/en/xiqun 69 / 86

Example 3: Forecasting Urban Travel Times XIX Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics http://mypagezjueducn/en/xiqun 70 / 86

Example 3: Forecasting Urban Travel Times XX Time series plots of the observed series (thick black line) against the forecast series at the 60 min interval on (a) Monday 6th June; (b) 7th June; (c) 8th June; (d) 9th June; (e) 10th June; (f) 11th June Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics http://mypagezjueducn/en/xiqun 71 / 86

Example 3: Forecasting Urban Travel Times XXI Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics http://mypagezjueducn/en/xiqun 72 / 86

Example 3: Forecasting Urban Travel Times XXII Comparison of LOKRR with benchmark models at the 15 min interval on link 442, (a) on a typical weekday and; (b) during a non-recurrent congestion event Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics http://mypagezjueducn/en/xiqun 73 / 86

Example 3: Forecasting Urban Travel Times XXIII Comparison of LOKRR with benchmark models at the 15 min interval on link 442, (a) on a typical weekday and; (b) during a non-recurrent congestion event Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics http://mypagezjueducn/en/xiqun 74 / 86

Example 3: Forecasting Urban Travel Times XXIV Comparison of LOKRR with benchmark models at the 60 min interval on link 442, (a) on a typical weekday and; (b) during a non-recurrent congestion event Xiqun (Michael) Chen (Zhejiang University) Transportation Big Data Analytics http://mypagezjueducn/en/xiqun 75 / 86