Time Series Prediction & Online Learning

Size: px

Start display at page:

Download "Time Series Prediction & Online Learning"

Jocelin Mason
5 years ago
Views:

1 Time Series Prediction & Online Learning Joint work with Vitaly Kuznetsov (Google Research) MEHRYAR MOHRI COURANT INSTITUTE & GOOGLE RESEARCH.

2 Motivation Time series prediction: stock values. earthquakes. energy demand. signal processing. sales forecasting. election forecast. economic variables. weather: e.g., local/global temperature. page 2

3 Time Series NY Unemployment Rate page 3

4 Time Series US Presidential Election 2016 page 4

5 Two Learning Scenarios Stochastic scenario: distributional assumption. performance measure: expected loss. guarantees: generalization bounds. On-line scenario: no distributional assumption. guarantees: regret bounds. performance measure: regret. active research area: (Cesa-Bianchi and Lugosi, 2006; Anava et al. 2013, 2015, 2016; Bousquet and Warmuth, 2002; Herbster and Warmuth, 1998, 2001; Koolen et al., 2015). page 5

6 On-Line Learning

7 On-Line Learning Setup Adversarial setting with hypothesis/action set H. For t =1to T do player receives x t 2 X. player selects h t 2 H. adversary selects y t 2 Y. player incurs loss L(h t (x t ),y t ). Objective: minimize (external) regret Reg T = L(h t (x t ),y t ) min h2h L(h(x t ),y t ). page 7

8 Example: Exp. Weights (EW) Expert set H = {E 1,...,E N }, H = conv(h ). EW({E 1,...,E N }) 1 for i 1 to N do 2 w 1,i 1 3 for t 1 to T do 4 Receive(x t ) 5 h t P N i=1 w t,ie i P N i=1 w t,i 6 Receive(y t ) 7 Incur-Loss(L(h t (x t ),y t )) 8 for i 1 to N do 9 w t+1,i w t,i e L(E i(x t ),y t ). (parameter > 0) 10 return h T page 8

9 EW Guarantee Theorem: assume that L is convex in its first argument and takes values in [0, 1]. Then, for any >0 and any sequence y 1,...,y T 2 Y, the regret of EW at time T satisfies Reg T apple log N + T 8. For = p 8 log N/T, Reg T apple p (T/2) log N. Reg T T r log N = O T. page 9

10 EW - Proof Potential: t = log P N i=1 w t,i. Upper bound: t t 1 = log P N i=1 w t 1,i e L(E i(x t ),y t ) P N i=1 w t 1,i = log E [e L(E i(x t ),y t ) ] wt 1 apple = log exp E w t 1 L(E i (x t ),y t ) E [L(E i (x t ),y t )] wt 1 E [L(E i (x t ),y t )] wt 1 apple E wt 1 [L(E i (x t ),y t )] apple L E wt 1 [E i (x t )],y t = L(h t (x t ),y t ) (Hoe ding s ineq.) (convexity of first arg. of L) page 10

11 EW - Proof Upper bound: summing up the inequalities yields T 0 apple L(h t (x t ),y t )+ 2 T 8. Lower bound: T 0 = log Comparison: L(h t (x t ),y t ) log NX i=1 e P T L(E i(x t ),y t ) N max i=1 e = N min i=1 P T L(E i(x t ),y t ) log N log N L(E i (x t ),y t ) log N. N min i=1 L(E i (x t ),y t ) apple log N + T 8. page 11

12 Questions Can we exploit both stochastic and on-line results? Can we tackle notoriously difficult time series problems? on-line-to-batch conversion. model selection. learning ensembles. page 12

13 On-line-to-Batch Conversion

14 On-Line-to-Batch (OTB) Input: sequence of hypotheses h =(h 1,...,h T ) returned after T rounds by an on-line algorithm A minimizing general regret Reg T = L(h t,z t ) inf L(h,Z t ). h 2 H Problem: use h =(h 1,...,h T ) to derive a hypothesis with small path-dependent expected loss, h 2 H L T +1 (h, Z T 1 )= E ZT +1 L(h, ZT +1 ) Z T 1. IID case is standard: (Littlestone, 1989), (Cesa-Bianchi et al., 2004). general stochastic process? page 14

15 Standard Assumptions Stationarity: same distribution Mixing: (Z t,...,z t+m ) (Z t+k,...,z t+m+k ) dependence between events decaying with k. B A n n + k page 15

16 Problem Stationarity and mixing assumptions: widely adopted: (Alquier and Wintenberger, 2010, 2014), (Agarwal and Duchi, 2013), (Lozano et al., 1997), (Vidyasagar, 1997), (Yu, 1994), (Meir, 2000), (MM and Rostamizadeh, 2000), (Kuznetsov and MM, 2014). But, they often do not hold (think trend or periodic signals). they are not testable. estimating mixing parameters can be hard, even if general functional form known. hypothesis set and loss function ignored. we need a new tool for the analysis. page 16

17 Relevant Quantity key difference L t (h t, Z t 1 1 ) L T +1 (h t, Z T 1 ) 1 t +1 T T +1 Average difference: 1 T h i L T +1 (h t, Z T 1 ) L t (h t, Z t 1 1 ). page 17

18 On-line Discrepancy Definition: disc(q) = sup h2h A H A : sequences that A can return. q =(q 1,...,q T ) h i q t L T +1 (h t, Z T 1 ) L t (h t, Z t 1 1 ) : arbitrary weight vector. natural measure of non-stationarity or dependency. captures hypothesis set and loss function. can be efficiently estimated under mild assumptions. generalization of definition of (Kuznetsov and MM, 2015).. page 18

19 Discrepancy Estimation Batch discrepancy estimation method (Kuznetsov and MM, 2015). Alternative method: assume that the loss is µ -Lipschitz. assume that there exists an accurate hypothesis : h i =inf E L(Z T +1,h (X T +1 )) Z T h 1 1. h page 19

20 Discrepancy Estimation Lemma: fix sequence Z T 1 in Z. Then, for any > 0, with probability at least 1, the following holds for all > 0: disc(q) apple d disc H T (q)+µ +2 + Mkqk 2 r2 log E[N 1(, G, z)], where ddisc H (q) = sup h2h,h2h A q t h L h t (X T +1 ),h(x T +1 ) L h t,z t i. page 20

21 Proof Sketch disc(q) = sup h2h A apple sup h2h A h i q t L T +1 (h t, Z T 1 ) L t (h t, Z t 1 1 ) q t applel T +1 (h t, Z T 1 ) E + sup h2h A h q t E h L(h t (X T +1 ),h (X T +1 )) Z T 1 h L(h t (X T +1 ),h (X T +1 )) Z T 1 i i i L t (h t, Z t 1 1 ). apple µ sup h2h A q t E = µ sup h2h A E h L(h (X T +1 ),Y T +1 ) Z T 1 h L(h (X T +1 ),Y T +1 ) Z T 1 i. i ddisc H (q) page 21

22 Lemma Lemma: let L be a convex loss bounded by M and a hypothesis sequence adapted to Z T 1. Fix q 2. Then, for any > 0, the following holds with probability at least 1 for the hypothesis h = P T q th t : L T +1 (h, Z T 1 ) apple q t L(h t,z t )+disc(q)+mkqk 2 r2 log 1. h T 1 page 22

23 Proof By convexity of the loss: By definition of the on-line discrepancy, h i q t L T +1 (h t, Z T 1 ) L t (h t, Z t 1 1 ) apple disc(q). L T +1 (h, Z T 1 ) apple q t L T +1 (h t, Z T 1 ). h i A t = q t L t (h t,z1 t 1 ) L(h t,z t ) is a martingale difference, thus by Azuma s inequality, whp, q t L t (h t, Z t 1 1 ) apple q t L(h t,z t )+kqk 2 q2 log 1. page 23

24 Learning Guarantee Theorem: let L be a convex loss bounded by M and H a set of hypothesis sequences adapted to Z T 1. Fix q 2. Then, for any > 0, the following holds with probability at least 1 for the hypothesis h = P T q th t : L T +1 (h, Z T 1 ) apple inf h 2H L T +1 (h, Z T 1 ) + 2disc(q)+ Reg T T + Mkq uk 1 +2Mkqk 2 r2 log 2. page 24

25 Notes Theorem extends to non-convex losses when h as follows: ( X T h = argmin h t s=t is selected q s L(h t,z s )+disc(q T t )+Mkq T t k 2 q 2 log 2(T +1) ). Learning guarantees with same flavor as those of (Kuznetsov and MM, 2015) but simpler proofs, no complexity measure. They admit as special case the learning guarantees for the i.i.d. scenario (Littlestone, 1989), (Cesa-Bianchi et al., 2004). the drifting scenario (MM and Muñoz Medina, 2012). page 25

26 Extension General regret definition: X T Reg T = L(h t,z t ) inf L t (h,z t )+R(h ). h 2 H standard regret: R =0, H constant sequences. tracking: H H T. R can be a kernel-based regularization (Herbster and Warmuth, 2001). page 26

27 Stable Hypothesis Sequences Hypotheses no longer adapted, but output by a uniformly stable algorithm. Stable hypotheses: H = {h 2 H : there exists A 2 A such that h = A(Z T 1 )}. t = ht : stability coefficient of algorithm returning h t. P T Similar learning bounds with additional term q t t. admit as special cases results of (Agarwal and Duchi, 2013) for asymptotically stationary mixing processes. page 27

28 Applications

29 Model Selection Problem: given N time series models, how should we use sample to select a single best model? Z T 1 in i.i.d. case, cross-validation can be shown to be close to the structural risk minimization solution. but, how do we select a validation set for general stochastic processes? use most recent data? use various splits? use the most distant data? models may have been pre-trained on. Z T 1 page 29

30 Model Selection Algorithm: choose q 2 to minimize discrepancy min q2 ddisc H (q). use on-line algorithm for prediction with expert advice to generate a sequence of hypotheses h 2 H T, with H the set of N models. select model according to h = argmin h t ( T X s=t q s L(h t,z s )+disc(q T t )+Mkq T t k 2 q 2 log 2(T +1) ). page 30

31 Learning Ensembles Problem: given a hypothesis set H and a sample Z T 1, find accurate convex combination h = P T q th t with h 2 H A and q 2. Z T 1 in most general case, hypotheses may have been pretrained on. page 31

32 Learning Ensembles Algorithm: run regret minimization on Z T 1 to return h. minimize learning bound. For 2 0, min q d disch (q)+ for convex loss and convex q t L(h t,z t ) H, can be cast as a DCprogramming problem, and solved using the DCalgorithm (Tao and An, 1998). for squared loss, global optimum. 32 subject to kq uk 2 apple 2. page

33 Conclusion Time series prediction using on-line algorithms: new learning bounds for non-stationary non-mixing processes. on-line discrepancy measure that can be estimated. general on-line-to-batch conversion. application to model selection. application to learning ensembles. tools for tackling other time series problems. page 33

Online Learning for Time Series Prediction

Online Learning for Time Series Prediction Joint work with Vitaly Kuznetsov (Google Research) MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Motivation Time series prediction: stock values.