Advanced Machine Learning

Size: px

Start display at page:

Download "Advanced Machine Learning"

Logan McCoy
6 years ago
Views:

1 Advanced Machine Learning Time Series Prediction VITALY KUZNETSOV GOOGLE RESEARCH.. MEHRYAR MOHRI COURANT INSTITUTE & GOOGLE RESEARCH..

2 Motivation Time series prediction: stock values. earthquakes. energy demand. signal processing. sales forecasting. economic variables. weather: e.g., local and global temperature. sensors: Internet-of-Things. page 2

3 Google Trends page 3

4 Google Trends page 4

5 Google Trends page 5

6 Challenges Standard Supervised Learning: IID assumption. Same distribution for training and test data. Distributions fixed over time (stationarity). none of these assumptions holds for time series! page 6

7 Outline Introduction to time series analysis. Learning theory for forecasting non-stationary time series. Algorithms for forecasting non-stationary time series. Time series prediction and on-line learning. page 7

8 Introduction to Time Series Analysis

9 Classical Framework Postulate a particular form of a parametric model that is assumed to generate data. Use given sample to estimate unknown parameters of the model. Use estimated model to make predictions. page 9

10 Autoregressive (AR) Models Definition: AR( p) model is a linear generative model based on the pth order Markov assumption: px 8t, Y t = a i Y t i + t i=1 where t s are zero mean uncorrelated random variables with Y t variance. a 1,...,a p are autoregressive coefficients. is observed stochastic process. page 10

11 Moving Averages (MA) Definition: MA( q) model is a linear generative model for the noise term based on the qth order Markov assumption: qx 8t, Y t = t + b j t j j=1 where b 1,...,b q are moving average coefficients. page 11

12 ARMA model (Whittle, 1951; Box & Jenkins, 1971) Definition: ARMA( p, q) model is a generative linear model that combines AR( p) and MA( q) models: 8t, Y t = px a i Y t i + t + i=1 qx b j t j=1 j. page 12

13 ARMA page 13

14 Stationarity Definition: a sequence of random variables Z = {Z t } +1 1 is stationary if its distribution is invariant to shifting in time. same distribution (Z t,...,z t+m ) (Z t+k,...,z t+m+k ) page 14

15 Weak Stationarity Definition: a sequence of random variables Z = {Z t } +1 1 weakly stationary if its first and second moments are invariant to shifting in time, that is, E[Z t ] is independent of t. E[Z t Z t j ]=f(j) for some function f. is page 15

16 Lag Operator Lag operator L is defined by LY t = Y t 1. ARMA model in terms of the lag operator: 1 px a i L!Y i t = 1+ i=1 qx b j L! j t j=1 Characteristic polynomial px P (z) =1 a i z i i=1 can be used to study properties of this stochastic process. page 16

17 Weak Stationarity of ARMA Theorem: an ARMA( p, q) process is weakly stationary if the roots of the characteristic polynomial P (z) are outside the unit circle. page 17

18 Proof If roots of the characteristic polynomial are outside the unit circle then: P (z) =1 px a i z i = c( 1 z) ( p z) i=1 = c 0 ( z) (1 p z) where i > 1for all i =1,...,p and c, c 0 are constants. page 18

19 Proof Therefore, the ARMA( p, q) process px 1 a i L!Y i t = 1+ i=1 admits MA( 1) representation: 1 Y t = L 1 p 1 L where 1 1X 1 1 i L! = k=0 is well-defined since 1 i < 1. qx b j L! j t j= i 1 k L qx b j L j t j=1 page 19

20 Proof Therefore, it suffices to show that 1X Y t = j t j=0 is weakly stationary. The mean is constant E[Y t ]= Covariance function E[Y t Y t l ] only depends on the lag l: 1X 1X 1X E[Y t Y t l ]= k je[ t j t l k ]= j j+l. k=0 j=0 1X j=0 j je[ t j ]=0. j=0 page 20

21 ARIMA Non-stationary processes can be modeled using processes whose characteristic polynomial has unit roots. Characteristic polynomial with unit roots can be factored: P (z) =R(z)(1 z) D where R(z) has no unit roots. Definition: ARIMA( p, D, q ) model is an ARMA( p, q) model for (1 L) D Y t :! D px qx 1 a i L i 1 L! Y t = 1+ b j L! j t. i=1 j=1 page 21

22 ARIMA Non-stationary processes can be modeled using processes whose characteristic polynomial has unit roots. Characteristic polynomial with unit roots can be factored: where has no unit roots. Definition: ARIMA(, for :, ) model is an ARMA(, ) model page 22

23 Other Extensions Further variants: models with seasonal components (SARIMA). models with side information (ARIMAX). models with long-memory (ARFIMA). multi-variate time series models (VAR). models with time-varying coefficients. other non-linear models. page 23

24 Modeling Variance Definition: the generalized autoregressive conditional heteroscedasticity GARCH( p, q) model is an ARMA( p, q) model for the variance t of the noise term t : Xp 1 Xq t, t+1 =! + i t i + j 2 t j i=0 j=0 where t s are zero mean Gaussian random variables with variance t conditioned on {Y t 1,Y t 2,...}.! > 0 is the mean parameter. (Engle, 1982; Bollerslev, 1986) page 24

25 GARCH Process Definition: the generalized autoregressive conditional heteroscedasticity (GARCH(, )) model is an ARMA(, ) model for the variance of the noise term : where s are zero mean Gaussian random variables with variance conditioned on. is the mean parameter. page 25

26 State-Space Models Continuous state space version of Hidden Markov Models: X t+1 = BX t + U t, Y t = AX t + t where X t is an n-dimensional state vector. Y t A and B are model parameters. U t and t are noise terms. is an observed stochastic process. page 26

27 State-Space Models page 27

28 Estimation Different methods for estimating model parameters: Maximum likelihood estimation: Requires further parametric assumptions on the noise distribution (e.g. Gaussian). Method of moments (Yule-Walker estimator). Restricted to certain models. Conditional and unconditional least square estimation. page 28

29 Invertibility of ARMA Definition: an ARMA( p, q) process is invertible if the roots of the polynomial qx Q(z) =1+ b j z j j=1 are outside the unit circle. page 29

30 Learning guarantee Theorem: assume Y t ARMA( p, q) is weakly stationary and invertible. Let ba T denote the least square estimate of a =(a 1,...,a p ) and assume that p is known. Then, kba T ak converges in probability to zero. Similar results hold for other estimators and other models. page 30

31 Notes Many other generative models exist. Learning guarantees are asymptotic. Model needs to be correctly specified. Non-stationarity needs to be modeled explicitly. page 31

32 Theory

33 Time Series Forecasting Training data: finite sample realization of some stochastic process, (X 1,Y 1 ),...,(X T,Y T ) 2 Z = X Y. Loss function: L: H Z! [0, 1], where H is a hypothesis set of functions mapping from X to Y. Problem: find loss, h 2 H L(h, Z T 1 )= with small path-dependent expected E ZT +1 L(h, ZT +1 ) Z T 1. page 33

34 Standard Assumptions Stationarity: same distribution Mixing: (Z t,...,z t+m ) (Z t+k,...,z t+m+k ) dependence between events decaying with k. B A n n + k page 34

35 β-mixing Definition: a sequence of random variables Z = {Z t } mixing if apple (k) =sup E sup P[A B] P[A]! 0. n B2 n 1 A2 n+k 1 is dependence between events decaying with k. B A n n + k page 35

36 Learning Theory Stationary and -mixing process: generalization bounds. PAC-learning preserved in that setting (Vidyasagar, 1997). covering number bounds for regression (Meir, 2000). VC-dimension bounds for binary classification (Yu, 1994). Rademacher complexity bounds for general loss functions (MM and Rostamizadeh, 2000). PAC-Bayesian bounds (Alquier et al., 2014). page 36

37 Learning Theory Stationarity and mixing: algorithm-dependent bounds. AdaBoost (Lozano et al., 1997). general stability bounds (MM and Rostamizadeh, 2010). regularized ERM (Steinwart and Christmann, 2009). stable on-line algorithms (Agarwal and Duchi, 2013). page 37

38 Problem Stationarity and mixing assumptions: often do not hold (think trend or periodic signals). not testable. estimating mixing parameters can be hard, even if general functional form known. hypothesis set and loss function ignored. page 38

39 Questions Is learning with general (non-stationary, non-mixing) stochastic processes possible? Can we design algorithms with theoretical guarantees? need a new tool for the analysis. page 39

40 Key Quantity - Fixed h key difference L(h, Z t 1 1 ) L(h, Z T 1 ) 1 t T T +1 Key average quantity: 1 T h i L(h, Z T 1 ) L(h, Z t 1 1 ). page 40

41 Discrepancy Definition: =sup h2h L(h, Z T 1 ) 1 T L(h, Z t 1 1 ). captures hypothesis set and loss function. =0 can be estimated from data, under mild assumptions. in IID case or for weakly stationary processes with linear hypotheses and squared loss (K and MM, 2014). page 41

42 Weighted Discrepancy Definition: extension to weights (q 1,...,q T )=q. (q) =sup h2h L(h, Z T 1 ) q t L(h, Z t 1 1 ). strictly extends discrepancy definition in drifting (MM and Muñoz Medina, 2012) or domain adaptation (Mansour, MM, Rostamizadeh 2009; Cortes and MM 2011, 2014); or for binary loss (Devroye et al., 1996; Ben-David et al., 2007). admits upper bounds in terms of relative entropy, or in terms of -mixing coefficients of asymptotic stationarity for an asymptotically stationary process. page 42

43 Estimation Decomposition: (q) apple 0 (q)+ s. 1 (q) apple sup L(h, Z t 1 1 ) q t L(h, Z t 1 1 ) h2h s +sup h2h t=t s+1 L(h, Z T 1 ) 1 s t=t s+1 L(h, Z t 1 1 ). 1 T s T T +1 page 43

44 Learning Guarantee Theorem: for any > 0, with probability at least 1, for all h 2 H and > 0, L(h, Z T 1 ) apple q t L(h, Z t )+ (q)+2 + kqk 2 r2 log E[N 1(, G, z)], where G = {z 7! L(h, z): h 2 H}. page 44

45 Bound with Emp. Discrepancy Corollary: for any > 0, with probability at least 1, for all h 2 H and > 0, L(h, Z T 1 ) apple q t L(h, Z t )+ b (q)+ s +4 s i 2 E N1 (, G, z) + hkqk 2 + kq u s k 2 2 log z, where 8 >< >: b (q) =sup h2h 1 s t=t s+1 u s unif. dist. over [T s, T ] G = {z 7! L(h, z): h 2 H}. L(h, Z t ) q t L(h, Z t ) page 45

46 Weighted Sequential α-cover Definition: let z be a Z-valued full binary tree of depth T. Then, a set of trees V is an l 1 -norm q-weighted -cover of a function class G on z if 8g 2 G, 8 2 {±1} T, 9v 2 V : v t ( ) g(z t ( )) apple. kqk 1 1 g(z 1 ),v 1 +1 (Rakhlin et al., 2010; K and MM, 2015) g(z 2 ),v 2 g(z 3 ),v g(z 4 ),v 4 g(z 5 ),v 5 g(z 6 ),v 6 g(z 7 ),v 7 1 g(z 12 ),v 12 " v1 g(z 1 ) # v 3 g(z 3 ) v 6 g(z 6 ) v 12 g(z 12 ) 1 apple kqk 1. page 46

47 Sequential Covering Numbers Definitions: sequential covering number: N 1 (, G, z) =min{ V : V l 1 -norm q-weighted -cover of G}. expected sequential covering number: E [N 1 (, G, z)]. z Z T Z 1,Z 0 1 D 1 Z 2,Z 0 2 D 2 ( Z 1 ) Z 3,Z 0 3 D 2 ( Z 0 1) Z 4,Z4 0 Z 5,Z5 0 D 3 ( Z 1,Z 2 ) D 3 ( Z 1,Z2) 0 Z T : distribution based on Z t s. page 47

48 Key quantities: Proof (q) =sup h2h L(h, Z T 1 ) Chernoff technique: for any t>0, P (Z T 1 " apple P = P " sup h2h exp apple e t E " (q) > ) t sup h2h exp (Z T 1 )=sup L(h, Z T ) h2h q t L(h, Z t 1 1 ) L(h, Z t ) > t sup h2h # q t L(h, Z t 1 1 ) L(h, Z t ) >e t # q t L(h, Z t ) q t L(h, Z t 1 1 ). (sub-add. of sup) (t >0) q t L(h, Z t 1 1 ) L(h, Z t ) #. (Markov s ineq.) page 48

49 Symmetrization Key tool: decoupled tangent sequence associated to. Z t and Zt 0 i.i.d. given Z t 1 1. P (Z T 1 apple e t E = e t E = e t E apple e t E " " " " exp exp exp exp (q) > ) t sup h2h t sup h2h t sup E h2h t sup h2h q t L(h, Z t 1 1 ) L(h, Z t ) # Z 0 T 1 Z T 1 q t E[L(h, Z 0 t ) Z t 1 1 ] L(h, Z t ) # (tangent seq.) h X T q t L(h, Z 0 t ) L(h, Z t ) i # Z T 1 (lin. of expectation) q t L(h, Z 0 t ) L(h, Z t ) #. (Jensen s ineq.) page 49

50 Symmetrization P (Z T 1 apple e t E " exp = e t E E (z,z 0 ) = e t E E (z,z 0 ) apple e t E E (z,z 0 ) = e t E z E " (q) > ) " " t sup h2h exp exp q t L(h, Z 0 t ) t sup h2h t sup h2h L(h, Z t ) # q t t L(h, z 0 t ( )) L(h, z t ( )) # (tangent seq. prop.) " 1 2 exp 2t sup h2h exp 2t sup h2h exp 2t sup h2h q t t L(h, z 0 t( )) + t sup h2h q t t L(h, zt( 0 )) q t t L(h, z t ( )) # (sub-add. of sup) q t t L(h, z t ( )) # (convexity. of exp) q t t L(h, z t ( )) #. page 50

51 P (Z T 1 apple e t E z E apple e t E z E " " (q) > ) exp exp Covering Number 2t sup h2h h 2t max v2v " apple e t( 2 ) E z " X v2v E exp # q t t L(h, z t ( )) q t t v t ( )+ i # ( -covering) 2t q t t v t ( ) ## (monotonicity of exp) " X t apple e t( 2 ) 2 kqk 2 ## E exp z 2 v2v i apple apple E hn 1 (, G, z) exp t( 2 )+ t2 kqk 2 z 2. (Hoe ding s ineq.) page 51

52 Algorithms

53 Review Theorem: for any > 0, with probability at least 1, for all h 2 H and > 0, L(h, Z T 1 ) apple q t L(h, Z t )+ b (q)+ s +4 s i 2 E N1 (, G, z) + hkqk 2 + kq u s k 2 2 log z. This bound can be extended to hold uniformly over the price of the additional term: eo(kq uk 1 p log2 log 2 (1 kq uk) 1 ) Data-dependent learning guarantee.. q at page 53

54 Discrepancy-Risk Minimization Key Idea: directly optimize the upper bound on generalization over q and h. This problem can be solved efficiently for some L and H. page 54

55 Kernel-Based Regression Squared loss function: L(y, y 0 )=(y y 0 ) 2 Hypothesis set: for PDS kernel K, n o H = x 7! w K(x): kwk H apple. Complexity term can be bounded by O (log 3/2 T ) sup x K(x, x)kqk 2. page 55

56 Instantaneous Discrepancy Empirical discrepancy can be further upper bounded in terms of instantaneous discrepancies: X b T (q) apple q t d t + Mkq uk 1 where M =supl(y, y 0 ) and y,y 0 d t =sup h2h 1 s t=t s+1 L(h, Z t ) L(h, Z t )!. page 56

57 Proof By sub-additivity of supremum b (q) =sup h2h =sup h2h apple ( 1 s ( T X + q t sup h2h t=t s+1 q t 1 s 1 T 1 s L(h, Z t ) t=t s+1 q t 1 s t=t s+1 ) q t L(h, Z t ) L(h, Z t ) L(h, Z t ) t=t s+1 L(h, Z t ) ) L(h, Z t ) q t L(h, Z t )!! + Mku qk 1. page 57

58 Computing Discrepancies Instantaneous discrepancy for kernel-based hypothesis with squared loss:! d t = u s (w 0 K(x s ) y s ) 2 (w 0 K(x t ) y t ) 2. sup kw 0 kapple s=1 Difference of convex (DC) functions. Global optimum via DC-programming: (Tao and Ahn, 1998). page 58

59 Discrepancy-Based Forecasting Theorem: for any > 0, with probability at least 1, for all kernel-based hypothesis h 2 H and all 0 < kq uk 1 apple 1 L(h, Z T 1 ) apple q t L(h, Z t )+ b (q)+ + O e log 3/2 T sup x s K(x, x) + kq uk 1. Corresponding optimization problem: min q2[0,1] T,w ( X T ) X T q t (w K(x t ) y t ) q t d t + 2 kwk H + 3 kq uk 1. page 59

60 Discrepancy-Based Forecasting Theorem: for any > 0, with probability at least 1, for all kernel-based hypothesis h 2 H and all 0 < kq uk 1 apple 1 L(h, Z T 1 ) apple q t L(h, Z t )+ b (q)+ + O e log 3/2 T sup x s K(x, x) + kq uk 1. Corresponding optimization problem: min q2[0,1] T,w ( X T ) X T q t (w K(x t ) y t ) q t d t + 2 kwk H + 3 kq uk 1. page 60

61 Discrepancy-Based Forecasting Theorem: for any > 0, with probability at least 1, for all kernel-based hypothesis h 2 H and all 0 < kq uk 1 apple 1 L(h, Z T 1 ) apple q t L(h, Z t )+ b (q)+ + O e log 3/2 T sup x s K(x, x) + kq uk 1. Corresponding optimization problem: min q2[0,1] T,w ( X T ) X T q t (w K(x t ) y t ) q t d t + 2 kwk H + 3 kq uk 1. page 61

62 Discrepancy-Based Forecasting Theorem: for any > 0, with probability at least 1, for all kernel-based hypothesis h 2 H and all 0 < kq uk 1 apple 1 L(h, Z T 1 ) apple q t L(h, Z t )+ b (q)+ + O e log 3/2 T sup x s K(x, x) + kq uk 1. Corresponding optimization problem: min q2[0,1] T,w ( X T ) X T q t (w K(x t ) y t ) q t d t + 2 kwk H + 3 kq uk 1. page 62

63 Discrepancy-Based Forecasting Theorem: for any > 0, with probability at least 1, for all kernel-based hypothesis h 2 H and all 0 < kq uk 1 apple 1 L(h, Z T 1 ) apple q t L(h, Z t )+ b (q)+ + O e log 3/2 T sup x s K(x, x) + kq uk 1. Corresponding optimization problem: min q2[0,1] T,w ( X T ) X T q t (w K(x t ) y t ) q t d t + 2 kwk H + 3 kq uk 1. page 63

64 Convex Problem r t =1/q t Change of variable:. min r2d,w Upper bound: rt 1 1/T apple T 1 r t T. ( X T (w K(x t ) y t ) d t + 2 kwk H + 3 r t ) r t T. D = {r: r t 1} where. convex optimization problem. page 64

65 Two-Stage Algorithm Minimize empirical discrepancy b (q) over q (convex optimization). Solve (weighted) kernel-ridge regression problem: ( X T min w q q t (w K(x t ) y t ) 2 + kwk H ) where is the solution to discrepancy minimization problem. page 65

66 Preliminary Experiments Artificial data sets:, page 66

67 True vs. Empirical Discrepancies Discrepancies Weights page 67

68 Running MSE page 68

69 Real-world Data Commodity prices, exchange rates, temperatures & climate. page 69

70 Time Series Prediction & On-line Learning

71 Two Learning Scenarios Stochastic scenario: distributional assumption. performance measure: expected loss. guarantees: generalization bounds. On-line scenario: no distributional assumption. guarantees: regret bounds. performance measure: regret. active research area: (Cesa-Bianchi and Lugosi, 2006; Anava et al. 2013, 2015, 2016; Bousquet and Warmuth, 2002; Herbster and Warmuth, 1998, 2001; Koolen et al., 2015). page 71

72 On-Line Learning Setup Adversarial setting with hypothesis/action set H. For t =1to T do player receives x t 2 X. player selects h t 2 H. adversary selects y t 2 Y. player incurs loss L(h t (x t ),y t ). Objective: minimize (external) regret Reg T = L(h t (x t ),y t ) min h2h L(h(x t ),y t ). page 72

73 Example: Exp. Weights (EW) Expert set H = {E 1,...,E N }, H = conv(h ). EW({E 1,...,E N }) 1 for i 1 to N do 2 w 1,i 1 3 for t 1 to T do 4 Receive(x t ) 5 h t P N i=1 w t,ie i P N i=1 w t,i 6 Receive(y t ) 7 Incur-Loss(L(h t (x t ),y t )) 8 for i 1 to N do 9 w t+1,i w t,i e L(E i(x t ),y t ). (parameter > 0) 10 return h T page 73

74 EW Guarantee Theorem: assume that L is convex in its first argument and takes values in [0, 1]. Then, for any >0 and any sequence y 1,...,y T 2 Y, the regret of EW at time T satisfies Reg T apple log N + T 8. For = p 8 log N/T, Reg T apple p (T/2) log N. Reg T T r log N = O T. page 74

75 EW - Proof Potential: t = log P N i=1 w t,i. Upper bound: t t 1 = log P N i=1 w t 1,i e L(E i(x t ),y t ) P N i=1 w t 1,i = log E [e L(E i(x t ),y t ) ] wt 1 apple = log exp E w t 1 L(E i (x t ),y t ) E [L(E i (x t ),y t )] wt 1 E [L(E i (x t ),y t )] wt 1 apple E wt 1 [L(E i (x t ),y t )] apple L E wt 1 [E i (x t )],y t = L(h t (x t ),y t ) (Hoe ding s ineq.) (convexity of first arg. of L) page 75

76 EW - Proof Upper bound: summing up the inequalities yields T 0 apple L(h t (x t ),y t )+ 2 T 8. Lower bound: Comparison: L(h t (x t ),y t ) T 0 = log log NX i=1 e P T L(E i(x t ),y t ) N max i=1 e = N min i=1 P T L(E i(x t ),y t ) log N log N L(E i (x t ),y t ) log N. N min i=1 L(E i (x t ),y t ) apple log N + T 8. page 76

77 Questions Can we exploit both batch and on-line to design flexible algorithms for time series prediction with stochastic guarantees? tackle notoriously difficult time series problems e.g., model selection, learning ensembles? page 77

78 Model Selection Problem: given N time series models, how should we use sample to select a single best model? Z T 1 in i.i.d. case, cross-validation can be shown to be close to the structural risk minimization solution. but, how do we select a validation set for general stochastic processes? use most recent data? use various splits? use the most distant data? models may have been pre-trained on. Z T 1 page 78

79 Learning Ensembles Problem: given a hypothesis set H and a sample Z T 1, find accurate convex combination h = P T q th t with h 2 H A and q 2. Z T 1 in most general case, hypotheses may have been pretrained on. on-line-to-batch conversion for general non-stationary non-mixing processes. page 79

80 On-Line-to-Batch (OTB) Input: sequence of hypotheses h =(h 1,...,h T ) returned after T rounds by an on-line algorithm A minimizing general regret Reg T = L(h t,z t ) inf L(h,Z t ). h 2 H page 80

81 On-Line-to-Batch (OTB) Problem: use h =(h 1,...,h T ) to derive a hypothesis with small path-dependent expected loss, h 2 H L T +1 (h, Z T 1 )= E ZT +1 L(h, ZT +1 ) Z T 1. i.i.d. problem is standard: (Littlestone, 1989), (Cesa-Bianchi et al., 2004). but, how do we design solutions for the general timeseries scenario? page 81

82 Questions Is OTB with general (non-stationary, non-mixing) stochastic processes possible? Can we design algorithms with theoretical guarantees? need a new tool for the analysis. page 82

83 Relevant Quantity key difference L t (h t, Z t 1 1 ) L T +1 (h t, Z T 1 ) 1 t +1 T T +1 Average difference: 1 T h i L T +1 (h t, Z T 1 ) L t (h t, Z t 1 1 ). page 83

84 On-line Discrepancy Definition: disc(q) = sup h2h A H A : sequences that A can return. q =(q 1,...,q T ) h i q t L T +1 (h t, Z T 1 ) L t (h t, Z t 1 1 ) : arbitrary weight vector. natural measure of non-stationarity or dependency. captures hypothesis set and loss function. can be efficiently estimated under mild assumptions. generalization of definition of (Kuznetsov and MM, 2015).. page 84

85 Discrepancy Estimation Batch discrepancy estimation method. Alternative method: assume that the loss is µ -Lipschitz. assume that there exists an accurate hypothesis : h i =inf E L(Z T +1,h (X T +1 )) Z T h 1 1. h page 85

86 Discrepancy Estimation Lemma: fix sequence Z T 1 in Z. Then, for any > 0, with probability at least 1, the following holds for all > 0: disc(q) apple d disc H T (q)+µ +2 + Mkqk 2 r2 log E[N 1(, G, z)], where ddisc H (q) = sup h2h,h2h A q t h L h t (X T +1 ),h(x T +1 ) L h t,z t i. page 86

87 disc(q) = sup h2h A apple sup h2h A Proof Sketch h i q t L T +1 (h t, Z T 1 ) L t (h t, Z t 1 1 ) q t applel T +1 (h t, Z T 1 ) E + sup h2h A apple µ sup h2h A + sup h q t E q t E h2h A = µ sup h2h A E + sup h2h A h i L(h t (X T +1 ),h (X T +1 )) Z T 1 h i L(h t (X T +1 ),h (X T +1 )) Z T 1 h i L(h (X T +1 ),Y T +1 ) Z T 1 h q t E h i L(h t (X T +1 ),h (X T +1 )) Z T 1 h L(h (X T +1 ),Y T +1 ) Z T 1 h q t E i h i L(h t (X T +1 ),h (X T +1 )) Z T 1 i L t (h t, Z t 1 1 ) i L t (h t, Z t 1 1 ) i L t (h t, Z t 1 1 ). page 87

88 Learning Guarantee Lemma: let L be a convex loss bounded by M and a hypothesis sequence adapted to Z T 1. Fix q 2. Then, for any > 0, the following holds with probability at least 1 for the hypothesis h = P T q th t : L T +1 (h, Z T 1 ) apple q t L(h t,z t )+disc(q)+mkqk 2 r2 log 1. h T 1 page 88

89 Proof By definition of the on-line discrepancy, h i q t L T +1 (h t, Z T 1 ) L t (h t, Z t 1 1 ) apple disc(q). h i A t = q t L t (h t,z1 t 1 ) L(h t,z t ) is a martingale difference, thus by Azuma s inequality, whp, q t L t (h t,z t 1 1 ) apple q t L(h t,z t )+kqk 2 q2 log 1. By convexity of the loss: L T +1 (h, Z T 1 ) apple q t L T +1 (h t, Z T 1 ). page 89

90 Learning Guarantee Theorem: let L be a convex loss bounded by M and H a set of hypothesis sequences adapted to Z T 1. Fix q 2. Then, for any > 0, the following holds with probability at least 1 for the hypothesis h = P T q th t : L T +1 (h, Z T 1 ) apple inf h 2H L T +1 (h, Z T 1 ) + 2disc(q)+ Reg T T + Mkq uk 1 +2Mkqk 2 r2 log 2. page 90

91 Conclusion Time series forecasting: key learning problem in many important tasks. very challenging: theory, algorithms, applications. new and general data-dependent learning guarantees for non-mixing non-stationary processes. algorithms with guarantees. Time series prediction and on-line learning: proof for flexible solutions derived via OTB. application to model selection. application to learning ensembles. 91 page

92 References T. M. Adams and A. B. Nobel. Uniform convergence of Vapnik-Chervonenkis classes under ergodic sampling. The Annals of Probability, 38(4): , A. Agrawal, J. Duchi. The generalization ability of online algorithms for dependent data. Information Theory, IEEE Transactions on, 59(1): , P. Alquier, X. Li, O. Wintenberger. Prediction of time series by statistical learning: general losses and fast rates. Dependence Modelling, 1:65 93, Oren Anava, Elad Hazan, Shie Mannor, and Ohad Shamir. Online learning for time series prediction. COLT, page 92

93 References O. Anava, E. Hazan, and A. Zeevi. Online time series prediction with missing data. ICML, O. Anava and S. Mannor. Online Learning for heteroscedastic sequences. ICML, P. L. Bartlett. Learning with a slowly changing distribution. COLT, R. D. Barve and P. M. Long. On the complexity of learning from drifting distributions. Information and Computation, 138(2): , page 93

94 References P. Berti and P. Rigo. A Glivenko-Cantelli theorem for exchangeable random variables. Statistics & Probability Letters, 32(4): , S. Ben-David, J. Blitzer, K. Crammer, and F. Pereira. Analysis of representations for domain adaptation. In NIPS T. Bollerslev. Generalized autoregressive conditional heteroskedasticity. J Econometrics, G. E. P. Box, G. Jenkins. (1990). Time Series Analysis, Forecasting and Control. page 94

95 References O. Bousquet and M. K. Warmuth. Tracking a small set of experts by mixing past posteriors. COLT, N. Cesa-Bianchi, A. Conconi, and C. Gentile. On the generalization ability of on-line learning algorithms. IEEE Trans. on Inf. Theory, 50(9), N. Cesa-Bianchi and C. Gentile. Tracking the best hyperplane with a simple budget perceptron. COLT, C. Cortes and M. Mohri. Domain adaptation and sample bias correction theory and algorithm for regression. Theoretical Computer Science, 519, page 95

96 References V. H. De la Pena and E. Gine. (1999) Decoupling: from dependence to independence: randomly stopped processes, U- statistics and processes, martingales and beyond. Probability and its applications. Springer, NY. P. Doukhan. (1994) Mixing: properties and examples. Lecture notes in statistics. Springer-Verlag, New York. R. Engle. Autoregressive conditional heteroscedasticity with estimates of the variance of United Kingdom inflation. Econometrica, 50(4): , page 96

97 References D. P. Helmbold and P. M. Long. Tracking drifting concepts by minimizing disagreements. Machine Learning, 14(1): 27-46, M. Herbster and M. K. Warmuth. Tracking the best expert. Machine Learning, 32(2), M. Herbster and M. K. Warmuth. Tracking the best linear predictor. JMLR, D. Hsu, A. Kontorovich, and C. Szepesvári. Mixing time estimation in reversible Markov chains from a single sample path. NIPS, page 97

98 References W. M. Koolen, A. Malek, P. L. Bartlett, and Y. Abbasi. Minimax time series prediction. NIPS, V. Kuznetsov, M. Mohri. Generalization bounds for time series prediction with non-stationary processes. In ALT, V. Kuznetsov, M. Mohri. Learning theory and algorithms for forecasting non-stationary time series. In NIPS, V. Kuznetsov, M. Mohri. Time series prediction and on-line learning. In COLT, page 98

99 References A. C. Lozano, S. R. Kulkarni, and R. E. Schapire. Convergence and consistency of regularized boosting algorithms with stationary β-mixing observations. In NIPS, pages , Y. Mansour, M. Mohri, and A. Rostamizadeh. Domain adaptation: Learning bounds and algorithms. In COLT R. Meir. Nonparametric time series prediction through adaptive model selection. Machine Learning, pages 5 34, page 99

100 References D. Modha, E. Masry. Memory-universal prediction of stationary random processes. Information Theory, IEEE Transactions on, 44(1): , Jan M. Mohri, A. Munoz Medina. New analysis and algorithm for learning with drifting distributions. In ALT, M. Mohri, A. Rostamizadeh. Rademacher complexity bounds for non-i.i.d. processes. In NIPS, M. Mohri, A. Rostamizadeh. Stability bounds for stationary φ-mixing and β-mixing processes. Journal of Machine Learning Research, 11: , page 100

101 References V. Pestov. Predictive PAC learnability: A paradigm for learning from exchangeable input data. GRC, A. Rakhlin, K. Sridharan, A. Tewari. Online learning: Random averages, combinatorial parameters, and learnability. In NIPS, L. Ralaivola, M. Szafranski, G. Stempfel. Chromatic PAC- Bayes bounds for non-iid data: Applications to ranking and stationary beta-mixing processes. JMLR 11: , C. Shalizi and A. Kontorovitch. Predictive PAC-learning and process decompositions. NIPS, page 101

102 References A. Rakhlin, K. Sridharan, A. Tewari. Online learning: Random averages, combinatorial parameters, and learnability. NIPS, I. Steinwart, A. Christmann. Fast learning from non-i.i.d. observations. NIPS, M. Vidyasagar. (1997). A Theory of Learning and Generalization: With Applications to Neural Networks and Control Systems. Springer-Verlag New York, Inc. V. Vovk. Competing with stationary prediction strategies. COLT page 102

103 References B. Yu. Rates of convergence for empirical processes of stationary mixing sequences. The Annals of Probability, 22(1):94 116, page 103

Time Series Prediction & Online Learning

Time Series Prediction & Online Learning Joint work with Vitaly Kuznetsov (Google Research) MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Motivation Time series prediction: stock values. earthquakes.