Time-series Extreme Event Forecasting with Neural Networks at Uber

Similar documents
arxiv: v1 [stat.ml] 6 Sep 2017

Understanding Travel Time to Airports in New York City Sierra Gentry Dominik Schunack

COMPARING FIXED AND ADAPTIVE COMPUTATION TIME FOR RE-

Deep Learning Architecture for Univariate Time Series Forecasting

Do we need Experts for Time Series Forecasting?

Integrated Electricity Demand and Price Forecasting

Prediction and Uncertainty Quantification of Daily Airport Flight Delays

Convolutional Neural Networks II. Slides from Dr. Vlad Morariu

Forecasting Solar Irradiance from Time Series with Exogenous Information

Dynamic Data Modeling, Recognition, and Synthesis. Rui Zhao Thesis Defense Advisor: Professor Qiang Ji

A New Look at Nonlinear Time Series Prediction with NARX Recurrent Neural Network. José Maria P. Menezes Jr. and Guilherme A.

The value of feedback in forecasting competitions

MODELLING ENERGY DEMAND FORECASTING USING NEURAL NETWORKS WITH UNIVARIATE TIME SERIES

Forecasting: principles and practice. Rob J Hyndman 1.1 Introduction to Forecasting

A Support Vector Regression Model for Forecasting Rainfall

Learning Chaotic Dynamics using Tensor Recurrent Neural Networks

Caesar s Taxi Prediction Services

LEARNING SPARSE STRUCTURED ENSEMBLES WITH STOCASTIC GTADIENT MCMC SAMPLING AND NETWORK PRUNING

Deep Multi-View Spatial-Temporal Network for Taxi Demand Prediction

Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions

Ensembles of Recurrent Neural Networks for Robust Time Series Forecasting

Automatic forecasting with a modified exponential smoothing state space framework

A QUESTION ANSWERING SYSTEM USING ENCODER-DECODER, SEQUENCE-TO-SEQUENCE, RECURRENT NEURAL NETWORKS. A Project. Presented to

peak half-hourly New South Wales

Forecasting demand in the National Electricity Market. October 2017

Modified Holt s Linear Trend Method

Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook

Forecasting Methods And Applications 3rd Edition

Comparison between VAR, GSTAR, FFNN-VAR and FFNN-GSTAR Models for Forecasting Oil Production

a) b) (Natural Language Processing; NLP) (Deep Learning) Bag of words White House RGB [1] IBM

A Brief Overview of Machine Learning Methods for Short-term Traffic Forecasting and Future Directions

arxiv: v1 [cs.lg] 2 Feb 2019

Analyzing the effect of Weather on Uber Ridership

CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning

arxiv: v1 [cs.cl] 21 May 2017

An Empirical Study of Building Compact Ensembles

Reptile: a Scalable Metalearning Algorithm

peak half-hourly Tasmania

Memory-Augmented Attention Model for Scene Text Recognition

DRIVING ROI. The Business Case for Advanced Weather Solutions for the Energy Market

Multimodal context analysis and prediction

Multi-Plant Photovoltaic Energy Forecasting Challenge with Regression Tree Ensembles and Hourly Average Forecasts

Lecture 17: Neural Networks and Deep Learning

CSC321 Lecture 16: ResNets and Attention

RAGAV VENKATESAN VIJETHA GATUPALLI BAOXIN LI NEURAL DATASET GENERALITY

Click Prediction and Preference Ranking of RSS Feeds

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Forecasting: Methods and Applications

SHORT TERM LOAD FORECASTING

Analysis of Fast Input Selection: Application in Time Series Prediction

Generating Sequences with Recurrent Neural Networks

Hierarchical models for the rainfall forecast DATA MINING APPROACH

Direct Method for Training Feed-forward Neural Networks using Batch Extended Kalman Filter for Multi- Step-Ahead Predictions

Anomaly Detection for the CERN Large Hadron Collider injection magnets

Toward Correlating and Solving Abstract Tasks Using Convolutional Neural Networks Supplementary Material

Machine Learning of Environmental Spatial Data Mikhail Kanevski 1, Alexei Pozdnoukhov 2, Vasily Demyanov 3

Some Applications of Machine Learning to Astronomy. Eduardo Bezerra 20/fev/2018

Google s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

FORECASTING. Methods and Applications. Third Edition. Spyros Makridakis. European Institute of Business Administration (INSEAD) Steven C Wheelwright

DEPARTMENT OF ECONOMETRICS AND BUSINESS STATISTICS

J. Sadeghi E. Patelli M. de Angelis

Deep Residual. Variations

A Dynamic Combination and Selection Approach to Demand Forecasting

Discussion of Predictive Density Combinations with Dynamic Learning for Large Data Sets in Economics and Finance

Forecasting Automobile Sales using an Ensemble of Methods

Feature Design. Feature Design. Feature Design. & Deep Learning

Towards Fully-automated Driving

Internet Engineering Jacek Mazurkiewicz, PhD

DL Approaches to Time Series Data. Miro Enev, DL Solution Architect Jeff Weiss, Director West SAs

Predicting New Search-Query Cluster Volume

DANIEL WILSON AND BEN CONKLIN. Integrating AI with Foundation Intelligence for Actionable Intelligence

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

WEATHER DEPENENT ELECTRICITY MARKET FORECASTING WITH NEURAL NETWORKS, WAVELET AND DATA MINING TECHNIQUES. Z.Y. Dong X. Li Z. Xu K. L.

Predictive analysis on Multivariate, Time Series datasets using Shapelets

Differentiable Fine-grained Quantization for Deep Neural Network Compression

Introduction to Deep Neural Networks

A Sparse Linear Model and Significance Test. for Individual Consumption Prediction

A Comparison Between Polynomial and Locally Weighted Regression for Fault Detection and Diagnosis of HVAC Equipment

Orbital Insight Energy: Oil Storage v5.1 Methodologies & Data Documentation

Bayesian Deep Learning

Ensembles of Nearest Neighbor Forecasts

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 24

Yahoo! Labs Nov. 1 st, Liangjie Hong, Ph.D. Candidate Dept. of Computer Science and Engineering Lehigh University

ABC random forest for parameter estimation. Jean-Michel Marin

EE-559 Deep learning LSTM and GRU

Comparison of Modern Stochastic Optimization Algorithms

8.6 Bayesian neural networks (BNN) [Book, Sect. 6.7]

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Deep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang

Deep Convolutional Neural Networks for Pairwise Causality

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Faster Training of Very Deep Networks Via p-norm Gates

Forecasting: Principles and Practice. Rob J Hyndman. 1. Introduction to forecasting OTexts.org/fpp/1/ OTexts.org/fpp/2/3

Financial Risk and Returns Prediction with Modular Networked Learning

Liangjie Hong, Ph.D. Candidate Dept. of Computer Science and Engineering Lehigh University Bethlehem, PA

Large-Scale Behavioral Targeting

On the use of Long-Short Term Memory neural networks for time series prediction

Chart types and when to use them

Iterative ARIMA-Multiple Support Vector Regression models for long term time series prediction

Journal of Chemical and Pharmaceutical Research, 2014, 6(5): Research Article

Transcription:

Nikolay Laptev 1 Jason Yosinski 1 Li Erran Li 1 Slawek Smyl 1 Abstract Accurate time-series forecasting during high variance segments (e.g., holidays), is critical for anomaly detection, optimal resource allocation, budget planning and other related tasks. At Uber accurate prediction for completed trips during special events can lead to a more efficient driver allocation resulting in a decreased wait time for the riders. State of the art methods for handling this task often rely on a combination of univariate forecasting models (e.g., Holt-Winters) and machine learning methods (e.g., random forest). Such a system, however, is hard to tune, scale and add exogenous variables. Motivated by the recent resurgence of Long Short Term Memory networks we propose a novel endto-end recurrent neural network architecture that outperforms the current state of the art event forecasting methods on Uber data and generalizes well to a public M3 dataset used for time-series forecasting competitions. 1. Introduction Accurate demand time-series forecasting during high variance segments (e.g., holidays, sporting events), is critical for anomaly detection, optimal resource allocation, budget planning and other related tasks. This problem is challenging because extreme event prediction depends on numerous external factors that can include weather, city population growth or marketing changes (e.g., driver incentives) (Horne & Manzenreiter, 2004). Classical time-series models, such as those found in the standard R forecast(hyndman & Khandakar, 2008) package are popular methods to provide a univariate base-level 1 Uber Technologies, San Francisco, CA, USA. Correspondence to: Nikolay Laptev <nlaptev@uber.com>, Jason Yosinski <yosinski@uber.com>, Li Erran Li <erranli@uber.com>, Slawek Smyl <slawek@uber.com>. ICML 2017 Time Series Workshop, Sydney, Australia. Copyright 2017 by the author(s). forecast. To incorporate exogenous variables, a machine learning approach, often based on a Quantile Random Forest (Meinshausen, 2006) is employed. This state of the art approach is effective at accurately modeling special events, however, it is not flexible and does not scale due to high retraining frequency. Classical time-series models usually require manual tuning to set seasonality and other parameters. Furthermore, while there are time-series models that can incorporate exogenous variables (Wei, 1994), they suffer from the curse of dimensionality and require frequent retraining. To more effectively deal with exogenous variables, a combination of univariate modeling and a machine learned model to handle residuals was introduced in (Opitz, 2015). The resulting two-stage model, however, is hard to tune, requires manual feature extraction and frequent retraining which is prohibitive to millions of time-series. Relatively recently, time-series modeling based on Long Short Term Memory (LSTM) (Hochreiter & Schmidhuber, 1997) technique gained popularity due to its end-to-end modeling, ease of incorporating exogenous variables and automatic feature extraction abilities (Assaad et al., 2008). By providing a large amount of data across numerous dimensions it was shown that an LSTM approach can model complex nonlinear feature interactions (Ogunmolu et al., 2016) which is critical to model complex extreme events. Our initial LSTM implementation did not show superior performance relative to the state of the art approach described above. In Section 4 we discuss key architecture changes to our initial LSTM implementation that were required to achieve good performance at scale for singlemodel, heterogeneous time-series forecasting. This paper makes the following contributions We propose a new LSTM-based architecture and train a single model using heterogeneous time-series. Experiments based on proprietary and public data are presented showing the generalization and scalability power of the discussed model. The rest of this paper is structured as follows: Section 2 provides a brief background on classical and neural network based time-series forecasting models. Section 3 describes the data and more specifically how it was con-

Time-series Extreme Event Forecasting with Neural Networks at Uber (a) Creating an input for the model requires two sliding windows (b) A scaled sample input to our model for x and for y Figure 1. Real-world time-series examples. structed and preprocessed to be used as input to the LSTM model. Section 4 describes the architectural changes to our initial LSTM model. Sections 5 and 6 provide results and subsequent discussion. 2. Background Extreme event prediction has become a popular topic for estimating peak electricity demand, traffic jam severity and surge pricing for ride sharing and other applications (Friederichs & Thorarinsdottir, 2012). In fact there is a branch of statistics known as extreme value theory (EVT) (de Haan & Ferreira, 2006) that deals directly with this challenge. To address the peak forecasting problem, univariate time-series and machine learning approaches have been proposed. While univariate time-series approaches directly model the temporal domain, they suffer from a frequent retraining requirement (Ye & Keogh, 2009). Machine learning models are often used in conjunction with the univariate time-series models resulting in a bulky two-step process for addressing the extreme event forecasting problem (Opitz, 2015). LSTMs, like traditional time-series approaches, can model temporal domain well while also modeling the nonlinear feature interactions and residuals (Assaad et al., 2008). We found that the vanilla LSTM model s performance is worse than our baseline. Thus, we propose a new architecture, that leverages an autoencoder for feature extraction, achieving superior performance compared to our baseline. 3. Data At Uber we have anonymized access to the rider and driver data from hundreds of cities. While we have plethora of data, challenges arise due to the data sparsity found in new cities and for special events. To circumvent the lack of data we use additional features including weather information (e.g., precipitation, wind speed, temperature) and city level information (e.g., current trips, current users, local holidays). An example of a raw dataset is shown in Figure 1 (b). Creating a training dataset requires a sliding window X (input) and Y (output) of, respectively, desired look-back and forecast horizon. X, Y are comprised of (batch, time, features). See Figure 1 (a) for an example of X and Y. Neural networks are sensitive to unscaled data (Hochreiter & Schmidhuber, 1997), therefore we normalize every minibatch. Furthermore, we found that de-trending the data, as opposed to de-seasoning, produces better results. 4. Modeling In this section we first present the strategy used for uncertainty computation in our model and then in Section 4.2, we propose a new scalable neural network architecture for time-series forecasting. 4.1. Uncertainty estimation The extreme event problem is probabilistic in nature and robust uncertainty estimation in neural network based timeseries forecasting is therefore critical. A number of approaches exist for uncertainty estimation ranging from Bayesian to those based on the bootstrap theory (Gal, 2016). In our work we combine Bootstrap and Bayesian approaches to produce a simple, robust and tight uncertainty bound with good coverage and provable convergence properties (Li & Maddala, 1996). Listing 1. Practical implementation of estimating the uncertainty bound vals = [] f o r r i n range ( 1 0 0 ) : v a l s. a p p e n d ( model. e v a l ( i n p u t, d r o p o u t = random ( 0, 1 ) ) ) mean = np. mean ( v a l s ) v a r = np. v a r ( v a l s ) The implementation of this approach is extremely simple and practical (see listing 1). Figures 2 (a) and (b) describe the uncertainty derivation and the underlying model used. The uncertainty calculation above is included for complete-

(a) Model and forecast uncertainty derivation (b) Model uncertainty is estimated via the architecture on the left while the forecast uncertainty is estimated via the architecture on the right. Figure 2. Model and forecast uncertainty ness of the proposed end-to-end forecasting model and can be replaced by other uncertainty measures. We leave the discussion of approximation bound, comparison with other methods (Kendall & Gal, 2017) and other detailed uncertainty experiments for a longer version of the paper. 4.2. Heterogeneous forecasting with a single model It is impractical to train a model per time-series for millions of metrics. Furthermore, training a single vanilla LSTM does not produce competitive results. Thus, we propose a novel model architecture that provides a single model for heterogeneous forecasting. As Figure 3 (b) shows, the model first primes the network by auto feature extraction, which is critical to capture complex time-series dynamics during special events at scale. This is contrary to the standard feature extraction methods where the features are manually derived, see Figure 3 (a). Features vectors are then aggregated via an ensemble technique (e.g., averaging or other methods). The final vector is then concatenated with the new input and fed to LSTM forecaster for prediction. Using this approach, we have achieved an average 14.09% improvement over the multilayer LSTM model trained over a set of raw inputs. Note there are different ways to include the extra features produced by the auto-encoder in Figure 3 (b). The extra features can be included by extending the input size or by increasing the depth of LSTM Forecaster in Figure 3 (b) and thereby removing LSTM auto-encoder. Having a separate auto-encoder module, however, produced better results in our experience. Other details on design choices are left for the longer version of the paper. 5. Results This section provides empirical results of the described model for special events and general time-series forecasting accuracy. Training was conducted using an AWS GPU instance with Tensorflow 1. Unless otherwise noted, 1 On production, the learned weights and the Tensorflow graph were exported into an equivalent target language SMAPE was used as a forecast error metric defined as 100 n Σ n ŷ yt ŷ t + y t /2. The described production Neural Network Model was trained on thousands of time-series with thousands of data points each. 5.1. Special Event Forecasting Accuracy A five year daily history of completed trips across top US cities in terms of population was used to provide forecasts across all major US holidays. Figure 4 shows the average SMAPE with the corresponding uncertainty. The uncertainty is measured as the Coefficient of Variation defined as c v = σ µ. We find that one of the hardest holidays to predict expected Uber trips for is Christmas day which corresponds to the greatest error and uncertainty. The longer version of the paper will contain more detailed error and uncertainty evaluation per city. The results presented show a 2%-18% forecast accuracy improvement compared to the current proprietary method comprising a univariate timeseries and machine learned model. 5.2. General Time-Series Forecasting Accuracy This section describes the forecasting accuracy of the trained model on a general time-series. Figure 5 shows the forecasting performance of the model on new timeseries relative to the current propriety forecasting solution. Note that we train a single Neural Network compared to per query training requirement of the proprietary model. Similar preprocessing described in Section 3 was applied to each time-series. Figure 6 shows the performance of the same model on the public M3 benchmark consisting of 1500 monthly time-series (Makridakis & Hibon, 2000). Both experiments indicate an exciting opportunity in the time-series field to have a single generic neural network model capable of producing high quality forecasts for heterogeneous time-series relative to specialized classical time-series models.

(a) Classical time-series features that are manually derived (Hyndman et al., 2015). (b) An auto-encoder can provide a powerful feature extraction used for priming the Neural Network. Figure 3. Single model heterogeneous forecast. Figure 4. Individual holiday performance. Figure 5. Forecasting errors for production queries relative to the current proprietary model. Figure 6. Forecast on a public M3 dataset. Single neural network was trained on Uber data and compared against the M3- specialized models. 6. Discussion We have presented an end-to-end neural network architecture for special event forecasting at Uber. We have shown its performance and scalability on Uber data. Finally we have demonstrated the model s general forecasting applicability on Uber data and on the M3 public monthly data. From our experience there are three criteria for picking a neural network model for time-series: (a) number of timeseries (b) length of time-series and (c) correlation among the time-series. If (a), (b) and (c) are high then the neural network might be the right choice, otherwise classical timeseries approach may work best. Our future work will be centered around utilizing the uncertainty information for neural net debugging and performing further research towards a general forecasting model for heterogeneous time-series forecasting and feature extraction with similar use-cases as the generic ImageNet model used for general image feature extraction and classification (Deng et al., 2009).

References Assaad, Mohammad, Boné, Romuald, and Cardot, Hubert. A new boosting algorithm for improved time-series forecasting with recurrent neural networks. Inf. Fusion, 2008. de Haan, L. and Ferreira, A. Extreme Value Theory: An Introduction. Springer Series in Operations Research and Financial Engineering. 2006. Opitz, T. Modeling asymptotically independent spatial extremes based on Laplace random fields. ArXiv e-prints, 2015. Wei, William Wu-Shyong. Time series analysis. Addison- Wesley publ Reading, 1994. Ye, Lexiang and Keogh, Eamonn. Time series shapelets: A new primitive for data mining. KDD. ACM, 2009. Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei- Fei, L. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, 2009. Friederichs, Petra and Thorarinsdottir, Thordis L. Forecast verification for extreme value distributions with an application to probabilistic peak wind prediction. Environmetrics, 2012. Gal, Yarin. Uncertainty in Deep Learning. PhD thesis, University of Cambridge, 2016. Long short- Hochreiter, Sepp and Schmidhuber, Jürgen. term memory. Neural Comput., 1997. Horne, John D. and Manzenreiter, Wolfram. Accounting for mega-events. International Review for the Sociology of Sport, 39(2):187 203, 2004. Hyndman, Rob J and Khandakar, Yeasmin. Automatic time series forecasting: the forecast package for R. Journal of Statistical Software, 26(3):1 22, 2008. Hyndman, Rob J., Wang, Earo, and Laptev, Nikolay. Large-scale unusual time series detection. In ICDM, pp. 1616 1619, 2015. Kendall, Alex and Gal, Yarin. What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision? 2017. Li, G. S. Hongyi and Maddala. Bootstrapping time series models. Econometric Reviews, 15(2):115 158, 1996. Makridakis, Spyros and Hibon, Michèle. The m3- competition: results, conclusions and implications. International Journal of Forecasting, 16(4):451 476, 00 2000. Meinshausen, Nicolai. Quantile regression forests. JOUR- NAL OF MACHINE LEARNING RESEARCH, 7:983 999, 2006. Ogunmolu, Olalekan P., Gu, Xuejun, Jiang, Steve B., and Gans, Nicholas R. Nonlinear systems identification using deep dynamic neural networks. CoRR, 2016.