Online Learning for Time Series Prediction

Online Learning for Time Series Prediction Joint work with Vitaly Kuznetsov (Google Research) MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH.

Motivation Time series prediction: stock values. earthquakes. energy demand. signal processing. sales forecasting. election forecast. economic variables. weather: e.g., local/global temperature. page 2

Time Series NY Unemployment Rate page 3

Time Series US Presidential Election 2016 page 4

Online Learning Advantages: active research area (Cesa-Bianchi and Lugosi, 2006). no distributional assumption. algorithms with tight regret guarantees. flexibility: e.g., non-static competitor classes (Herbster and Warmuth, 1998, 2001; Koolen et al., 2015; Rakhlin and Sridharan 2015). page 5

Online Learning Drawbacks: real-world time series data is not adversarial. the stochastic process must be taken into account. the quantity of interest is the conditional expected loss, not the regret. can we leverage online algorithms for time series forecasting? page 6

On-Line Learning Setup Adversarial setting with hypothesis/action set H. For t =1to T do player receives x t 2 X. player selects h t 2 H. adversary selects y t 2 Y. player incurs loss L(h t (x t ),y t ). Objective: minimize (external) regret Reg T = L(h t,z t ) inf h 2 H L(h,Z t ). page 7

On-Line-to-Batch Problem Problem: use h =(h 1,...,h T ) to derive a hypothesis with small path-dependent expected loss, h 2 H L T +1 (h, Z T 1 )= E ZT +1 L(h, ZT +1 ) Z T 1. IID case is standard: (Littlestone, 1989), (Cesa-Bianchi et al., 2004). how do we handle general stochastic processes? page 8

Previous Work Theory and algorithms for time series prediction: (Kuznetsov and MM, 2015) general non-stationary non-mixing stochastic processes. convex optimization algorithms. algorithms perform well in experiments. generalization bounds based on a notion of discrepancy. But, how do we tackle some difficult time series problems such as ensemble learning or model selection? page 9

Questions Theoretical: can we derive learning guarantees for a convex P T combination q th t? can we derive guarantees for h selected in (h 1,...,h T )? Algorithmic: on-line-to-batch conversion. learning ensembles. model selection. page 10

Theory Deep Boosting - page

Learning Guarantee Problem: given hypotheses (h 1,...,h T ) give bound on L T +1 (h, Z T 1 )= where h = P T q th t. E ZT +1 L(h, ZT +1 ) Z T 1, page 12

Standard Assumptions Stationarity: same distribution Mixing: (Z t,...,z t+m ) (Z t+k,...,z t+m+k ) dependence between events decaying with k. B A n n + k page 13

Problem Stationarity and mixing assumptions: widely adopted: (Alquier and Wintenberger, 2010, 2014), (Agarwal and Duchi, 2013), (Lozano et al., 1997), (Vidyasagar, 1997), (Yu, 1994), (Meir, 2000), (MM and Rostamizadeh, 2000), (Kuznetsov and MM, 2014). But, they often do not hold (think trend or periodic signals). they are not testable. estimating mixing parameters can be hard, even if general functional form known. hypothesis set and loss function ignored. we need a new tool for the analysis. page 14

Relevant Quantity key difference L t (h t, Z t 1 1 ) L T +1 (h t, Z T 1 ) 1 t +1 T T +1 Average difference: 1 T h i L T +1 (h t, Z T 1 ) L t (h t, Z t 1 1 ). page 15

On-line Discrepancy Definition: disc(q) = sup h2h A H A : sequences that A can return. q =(q 1,...,q T ) h i q t L T +1 (h t, Z T 1 ) L t (h t, Z t 1 1 ) : arbitrary weight vector. natural measure of non-stationarity or dependency. captures hypothesis set and loss function. can be efficiently estimated under mild assumptions. generalization of definition of (Kuznetsov and MM, 2015).. page 16

Learning Guarantee Theorem: let L be a convex loss bounded by M and h T 1 a hypothesis sequence adapted to Z T 1. Fix q 2. Then, for any > 0, the following holds with probability at least 1 for the hypothesis h = P T q th t : L T +1 (h, Z T 1 ) apple q t L(h t,z t )+disc(q)+mkqk 2 r2 log 1. page 17

Proof By convexity of the loss: By definition of the on-line discrepancy, h i q t L T +1 (h t, Z T 1 ) L t (h t, Z t 1 1 ) apple disc(q). L T +1 (h, Z T 1 ) apple q t L T +1 (h t, Z T 1 ). h i A t = q t L t (h t,z1 t 1 ) L(h t,z t ) is a martingale difference, thus by Azuma s inequality, whp, q t L t (h t, Z t 1 1 ) apple q t L(h t,z t )+kqk 2 q2 log 1. page 18

Learning Guarantee Theorem: let L be a convex loss bounded by M and H a set of hypothesis sequences adapted to Z T 1. Fix q 2. Then, for any > 0, the following holds with probability at least 1 for the hypothesis h = P T q th t : L T +1 (h, Z T 1 ) apple inf h 2H L T +1 (h, Z T 1 ) + 2disc(q)+ Reg T T + Mkq uk 1 +2Mkqk 2 r2 log 2. page 19

Notes Learning guarantees with same flavor as those of (Kuznetsov and MM, 2015) but simpler proofs, no complexity measure. Bounds admit as special case the learning guarantees for the i.i.d. scenario (Littlestone, 1989), (Cesa-Bianchi et al., 2004). the drifting scenario (MM and Muñoz Medina, 2012). page 20

Extension: Non-Convex Loss Theorems extend to non-convex losses when as follows (Cesa-Bianchi et al., 2004): h is selected h = argmin h t ( T X s=t q s L(h t,z s )+disc(q T t )+Mkq T t k 2 q 2 log 2(T +1) ). page 21

Extension: General Regret General regret definition: X T Reg T = L(h t,z t ) inf L t (h,z t )+R(h ). h 2 H standard regret: R =0, H constant sequences. tracking: H H T. R can be a kernel-based regularization (Herbster and Warmuth, 2001). page 22

Extension: Non-Adapted Seqs Hypotheses no longer adapted, but output by a uniformly stable algorithm. Stable hypotheses: H = {h 2 H : there exists A 2 A such that h = A(Z T 1 )}. t = ht : stability coefficient of algorithm returning h t. P T Similar learning bounds with additional term q t t. admit as special cases results of (Agarwal and Duchi, 2013) for asymptotically stationary mixing processes. page 23

Discrepancy Estimation Batch estimation method (Kuznetsov and MM, 2015). On-line estimation method: assume that the loss is µ -Lipschitz. assume that there exists an accurate hypothesis : h i inf E L(Z T +1,h (X T +1 )) Z T h 1 1. h page 24

Batch Estimation Decomposition: (q) apple 0 (q)+ s. 1 (q) apple sup L(h, Z t 1 1 ) q t L(h, Z t 1 1 ) h2h s +sup h2h t=t s+1 L(h, Z T 1 ) 1 s t=t s+1 L(h, Z t 1 1 ). 1 T s T T +1 Theory and Algorithms for Forecasting Non-Stationary Time Series page 25

Online Estimation Lemma: assume that L is µ -Lipschitz. Fix sequence Z T 1 in Z. Then, for any > 0, with probability at least 1, the following holds for all > 0: disc(q) apple disc d h i HA (q)+µ inf E L(Z T +1,h (X T +1 )) Z T h 1 +2 + Mkqk 2 r2 log E[N 1(, G, z)], where ddisc HA (q) = sup h2h,h2h A q t h L h t (X T +1 ),h(x T +1 ) L h t,z t i. page 26

Proof Sketch disc(q) = sup h2h A apple sup h2h A h i q t L T +1 (h t, Z T 1 ) L t (h t, Z t 1 1 ) q t applel T +1 (h t, Z T 1 ) E + sup h2h A apple q t E h L(h t (X T +1 ),h (X T +1 )) Z T 1 h L(h t (X T +1 ),h (X T +1 )) Z T 1 {z } i i L t (h t, Z t 1 1 ). apple µ sup h2h A q t E = µ sup h2h A E h L(h (X T +1 ),Y T +1 ) Z T 1 h L(h (X T +1 ),Y T +1 ) Z T 1 i. i disc HA (q) page 27

Algorithms Deep Boosting - page

On-Line-to-Batch (OTB) Problem: use h =(h 1,...,h T ) to derive a hypothesis with small path-dependent expected loss, h 2 H L T +1 (h, Z T 1 )= E ZT +1 L(h, ZT +1 ) Z T 1. page 29

OTB Algorithm Idea: choose weights q to minimize bound for h = P T q th t : L T +1 (h, Z T 1 ) apple q t L(h t,z t )+disc(q)+mkqk 2 r2 log 1. Optimization problem: min q2 q t L(h t,z t )+ d disc HA (q)+ kqk 2. Solution: h = P T q th t. page 30

Model Selection Problem: given N time series models, how should we use sample to select a single best model? Z T 1 in i.i.d. case, cross-validation can be shown to be close to the structural risk minimization solution. but, how do we select a validation set for general stochastic processes? models may have been pre-trained on Z T 1. page 31

Model Selection page 32

Model Selection Algorithm Idea: use learning bound in terms of online discrepancy and regret per round for hypothesis ( X T q h = argmin q s L(h t,z s )+disc(q T t )+Mkq T t k 2 2 log ). 2(T +1) h t s=t page 33

Model Selection Algorithm Algorithm: choose q 2 to minimize discrepancy min q2 ddisc H (q). use on-line algorithm for prediction with expert advice to generate a sequence of hypotheses h 2 H T, with H the set of N models. select model according to h = argmin h t ( T X s=t q s L(h t,z s )+disc(q T t )+Mkq T t k 2 q 2 log 2(T +1) ). page 34

Learning Ensembles Problem: given a hypothesis set H and a sample Z T 1, find convex combination h = P T q th t with h 2 H A with small path-dependent expected loss. Z T 1 in most general case, hypotheses may have been pretrained on. page 35

Ensemble Learning Algorithm Algorithm: run regret minimization on Z T 1 to return h =(h 1,...,h T ). minimize learning bound. For 2 0, min q d discha (q)+ q t L(h t,z t ) subject to kq uk 2 apple 2. convex optimization problem by convexity of ddisc HA (q) = sup h2h,h2h A q t h L h t (X T +1 ),h(x T +1 ) L h t,z t i. page 36

Ensemble Learning Algorithm If hypothesis set H is finite, then the supremum can be computed straightforwardly: If hypothesis set is not finite but is convex and the loss is convex, then the maximization can be cast as a DCprogramming problem, and solved using the DC-algorithm (Tao and An, 1998): sup h2h,h2h A q t h L h t (X T +1 ),h(x T +1 ) for squared loss, global optimum. L h t,z t i. page 37

Conclusion Time series prediction using on-line algorithms: new learning bounds for non-stationary non-mixing processes. on-line discrepancy measure that can be estimated. general on-line-to-batch conversion. application to model selection. application to learning ensembles. tools for tackling other time series problems. page 38