Advanced Machine Learning

Size: px
Start display at page:

Download "Advanced Machine Learning"

Transcription

1 Advanced Machine Learning Time Series Prediction VITALY KUZNETSOV GOOGLE RESEARCH.. MEHRYAR MOHRI COURANT INSTITUTE & GOOGLE RESEARCH..

2 Motivation Time series prediction: stock values. earthquakes. energy demand. signal processing. sales forecasting. economic variables. weather: e.g., local and global temperature. sensors: Internet-of-Things. page 2

3 Google Trends page 3

4 Google Trends page 4

5 Google Trends page 5

6 Challenges Standard Supervised Learning: IID assumption. Same distribution for training and test data. Distributions fixed over time (stationarity). none of these assumptions holds for time series! page 6

7 Outline Introduction to time series analysis. Learning theory for forecasting non-stationary time series. Algorithms for forecasting non-stationary time series. Time series prediction and on-line learning. page 7

8 Introduction to Time Series Analysis

9 Classical Framework Postulate a particular form of a parametric model that is assumed to generate data. Use given sample to estimate unknown parameters of the model. Use estimated model to make predictions. page 9

10 Autoregressive (AR) Models Definition: AR( p) model is a linear generative model based on the pth order Markov assumption: px 8t, Y t = a i Y t i + t i=1 where t s are zero mean uncorrelated random variables with Y t variance. a 1,...,a p are autoregressive coefficients. is observed stochastic process. page 10

11 Moving Averages (MA) Definition: MA( q) model is a linear generative model for the noise term based on the qth order Markov assumption: qx 8t, Y t = t + b j t j j=1 where b 1,...,b q are moving average coefficients. page 11

12 ARMA model (Whittle, 1951; Box & Jenkins, 1971) Definition: ARMA( p, q) model is a generative linear model that combines AR( p) and MA( q) models: 8t, Y t = px a i Y t i + t + i=1 qx b j t j=1 j. page 12

13 ARMA page 13

14 Stationarity Definition: a sequence of random variables Z = {Z t } +1 1 is stationary if its distribution is invariant to shifting in time. same distribution (Z t,...,z t+m ) (Z t+k,...,z t+m+k ) page 14

15 Weak Stationarity Definition: a sequence of random variables Z = {Z t } +1 1 weakly stationary if its first and second moments are invariant to shifting in time, that is, E[Z t ] is independent of t. E[Z t Z t j ]=f(j) for some function f. is page 15

16 Lag Operator Lag operator L is defined by LY t = Y t 1. ARMA model in terms of the lag operator: 1 px a i L!Y i t = 1+ i=1 qx b j L! j t j=1 Characteristic polynomial px P (z) =1 a i z i i=1 can be used to study properties of this stochastic process. page 16

17 Weak Stationarity of ARMA Theorem: an ARMA( p, q) process is weakly stationary if the roots of the characteristic polynomial P (z) are outside the unit circle. page 17

18 Proof If roots of the characteristic polynomial are outside the unit circle then: P (z) =1 px a i z i = c( 1 z) ( p z) i=1 = c 0 ( z) (1 p z) where i > 1for all i =1,...,p and c, c 0 are constants. page 18

19 Proof Therefore, the ARMA( p, q) process px 1 a i L!Y i t = 1+ i=1 admits MA( 1) representation: 1 Y t = L 1 p 1 L where 1 1X 1 1 i L! = k=0 is well-defined since 1 i < 1. qx b j L! j t j= i 1 k L qx b j L j t j=1 page 19

20 Proof Therefore, it suffices to show that 1X Y t = j t j=0 is weakly stationary. The mean is constant E[Y t ]= Covariance function E[Y t Y t l ] only depends on the lag l: 1X 1X 1X E[Y t Y t l ]= k je[ t j t l k ]= j j+l. k=0 j=0 1X j=0 j je[ t j ]=0. j=0 page 20

21 ARIMA Non-stationary processes can be modeled using processes whose characteristic polynomial has unit roots. Characteristic polynomial with unit roots can be factored: P (z) =R(z)(1 z) D where R(z) has no unit roots. Definition: ARIMA( p, D, q ) model is an ARMA( p, q) model for (1 L) D Y t :! D px qx 1 a i L i 1 L! Y t = 1+ b j L! j t. i=1 j=1 page 21

22 ARIMA Non-stationary processes can be modeled using processes whose characteristic polynomial has unit roots. Characteristic polynomial with unit roots can be factored: where has no unit roots. Definition: ARIMA(, for :, ) model is an ARMA(, ) model page 22

23 Other Extensions Further variants: models with seasonal components (SARIMA). models with side information (ARIMAX). models with long-memory (ARFIMA). multi-variate time series models (VAR). models with time-varying coefficients. other non-linear models. page 23

24 Modeling Variance Definition: the generalized autoregressive conditional heteroscedasticity GARCH( p, q) model is an ARMA( p, q) model for the variance t of the noise term t : Xp 1 Xq t, t+1 =! + i t i + j 2 t j i=0 j=0 where t s are zero mean Gaussian random variables with variance t conditioned on {Y t 1,Y t 2,...}.! > 0 is the mean parameter. (Engle, 1982; Bollerslev, 1986) page 24

25 GARCH Process Definition: the generalized autoregressive conditional heteroscedasticity (GARCH(, )) model is an ARMA(, ) model for the variance of the noise term : where s are zero mean Gaussian random variables with variance conditioned on. is the mean parameter. page 25

26 State-Space Models Continuous state space version of Hidden Markov Models: X t+1 = BX t + U t, Y t = AX t + t where X t is an n-dimensional state vector. Y t A and B are model parameters. U t and t are noise terms. is an observed stochastic process. page 26

27 State-Space Models page 27

28 Estimation Different methods for estimating model parameters: Maximum likelihood estimation: Requires further parametric assumptions on the noise distribution (e.g. Gaussian). Method of moments (Yule-Walker estimator). Restricted to certain models. Conditional and unconditional least square estimation. page 28

29 Invertibility of ARMA Definition: an ARMA( p, q) process is invertible if the roots of the polynomial qx Q(z) =1+ b j z j j=1 are outside the unit circle. page 29

30 Learning guarantee Theorem: assume Y t ARMA( p, q) is weakly stationary and invertible. Let ba T denote the least square estimate of a =(a 1,...,a p ) and assume that p is known. Then, kba T ak converges in probability to zero. Similar results hold for other estimators and other models. page 30

31 Notes Many other generative models exist. Learning guarantees are asymptotic. Model needs to be correctly specified. Non-stationarity needs to be modeled explicitly. page 31

32 Theory

33 Time Series Forecasting Training data: finite sample realization of some stochastic process, (X 1,Y 1 ),...,(X T,Y T ) 2 Z = X Y. Loss function: L: H Z! [0, 1], where H is a hypothesis set of functions mapping from X to Y. Problem: find loss, h 2 H L(h, Z T 1 )= with small path-dependent expected E ZT +1 L(h, ZT +1 ) Z T 1. page 33

34 Standard Assumptions Stationarity: same distribution Mixing: (Z t,...,z t+m ) (Z t+k,...,z t+m+k ) dependence between events decaying with k. B A n n + k page 34

35 β-mixing Definition: a sequence of random variables Z = {Z t } mixing if apple (k) =sup E sup P[A B] P[A]! 0. n B2 n 1 A2 n+k 1 is dependence between events decaying with k. B A n n + k page 35

36 Learning Theory Stationary and -mixing process: generalization bounds. PAC-learning preserved in that setting (Vidyasagar, 1997). covering number bounds for regression (Meir, 2000). VC-dimension bounds for binary classification (Yu, 1994). Rademacher complexity bounds for general loss functions (MM and Rostamizadeh, 2000). PAC-Bayesian bounds (Alquier et al., 2014). page 36

37 Learning Theory Stationarity and mixing: algorithm-dependent bounds. AdaBoost (Lozano et al., 1997). general stability bounds (MM and Rostamizadeh, 2010). regularized ERM (Steinwart and Christmann, 2009). stable on-line algorithms (Agarwal and Duchi, 2013). page 37

38 Problem Stationarity and mixing assumptions: often do not hold (think trend or periodic signals). not testable. estimating mixing parameters can be hard, even if general functional form known. hypothesis set and loss function ignored. page 38

39 Questions Is learning with general (non-stationary, non-mixing) stochastic processes possible? Can we design algorithms with theoretical guarantees? need a new tool for the analysis. page 39

40 Key Quantity - Fixed h key difference L(h, Z t 1 1 ) L(h, Z T 1 ) 1 t T T +1 Key average quantity: 1 T h i L(h, Z T 1 ) L(h, Z t 1 1 ). page 40

41 Discrepancy Definition: =sup h2h L(h, Z T 1 ) 1 T L(h, Z t 1 1 ). captures hypothesis set and loss function. =0 can be estimated from data, under mild assumptions. in IID case or for weakly stationary processes with linear hypotheses and squared loss (K and MM, 2014). page 41

42 Weighted Discrepancy Definition: extension to weights (q 1,...,q T )=q. (q) =sup h2h L(h, Z T 1 ) q t L(h, Z t 1 1 ). strictly extends discrepancy definition in drifting (MM and Muñoz Medina, 2012) or domain adaptation (Mansour, MM, Rostamizadeh 2009; Cortes and MM 2011, 2014); or for binary loss (Devroye et al., 1996; Ben-David et al., 2007). admits upper bounds in terms of relative entropy, or in terms of -mixing coefficients of asymptotic stationarity for an asymptotically stationary process. page 42

43 Estimation Decomposition: (q) apple 0 (q)+ s. 1 (q) apple sup L(h, Z t 1 1 ) q t L(h, Z t 1 1 ) h2h s +sup h2h t=t s+1 L(h, Z T 1 ) 1 s t=t s+1 L(h, Z t 1 1 ). 1 T s T T +1 page 43

44 Learning Guarantee Theorem: for any > 0, with probability at least 1, for all h 2 H and > 0, L(h, Z T 1 ) apple q t L(h, Z t )+ (q)+2 + kqk 2 r2 log E[N 1(, G, z)], where G = {z 7! L(h, z): h 2 H}. page 44

45 Bound with Emp. Discrepancy Corollary: for any > 0, with probability at least 1, for all h 2 H and > 0, L(h, Z T 1 ) apple q t L(h, Z t )+ b (q)+ s +4 s i 2 E N1 (, G, z) + hkqk 2 + kq u s k 2 2 log z, where 8 >< >: b (q) =sup h2h 1 s t=t s+1 u s unif. dist. over [T s, T ] G = {z 7! L(h, z): h 2 H}. L(h, Z t ) q t L(h, Z t ) page 45

46 Weighted Sequential α-cover Definition: let z be a Z-valued full binary tree of depth T. Then, a set of trees V is an l 1 -norm q-weighted -cover of a function class G on z if 8g 2 G, 8 2 {±1} T, 9v 2 V : v t ( ) g(z t ( )) apple. kqk 1 1 g(z 1 ),v 1 +1 (Rakhlin et al., 2010; K and MM, 2015) g(z 2 ),v 2 g(z 3 ),v g(z 4 ),v 4 g(z 5 ),v 5 g(z 6 ),v 6 g(z 7 ),v 7 1 g(z 12 ),v 12 " v1 g(z 1 ) # v 3 g(z 3 ) v 6 g(z 6 ) v 12 g(z 12 ) 1 apple kqk 1. page 46

47 Sequential Covering Numbers Definitions: sequential covering number: N 1 (, G, z) =min{ V : V l 1 -norm q-weighted -cover of G}. expected sequential covering number: E [N 1 (, G, z)]. z Z T Z 1,Z 0 1 D 1 Z 2,Z 0 2 D 2 ( Z 1 ) Z 3,Z 0 3 D 2 ( Z 0 1) Z 4,Z4 0 Z 5,Z5 0 D 3 ( Z 1,Z 2 ) D 3 ( Z 1,Z2) 0 Z T : distribution based on Z t s. page 47

48 Key quantities: Proof (q) =sup h2h L(h, Z T 1 ) Chernoff technique: for any t>0, P (Z T 1 " apple P = P " sup h2h exp apple e t E " (q) > ) t sup h2h exp (Z T 1 )=sup L(h, Z T ) h2h q t L(h, Z t 1 1 ) L(h, Z t ) > t sup h2h # q t L(h, Z t 1 1 ) L(h, Z t ) >e t # q t L(h, Z t ) q t L(h, Z t 1 1 ). (sub-add. of sup) (t >0) q t L(h, Z t 1 1 ) L(h, Z t ) #. (Markov s ineq.) page 48

49 Symmetrization Key tool: decoupled tangent sequence associated to. Z t and Zt 0 i.i.d. given Z t 1 1. P (Z T 1 apple e t E = e t E = e t E apple e t E " " " " exp exp exp exp (q) > ) t sup h2h t sup h2h t sup E h2h t sup h2h q t L(h, Z t 1 1 ) L(h, Z t ) # Z 0 T 1 Z T 1 q t E[L(h, Z 0 t ) Z t 1 1 ] L(h, Z t ) # (tangent seq.) h X T q t L(h, Z 0 t ) L(h, Z t ) i # Z T 1 (lin. of expectation) q t L(h, Z 0 t ) L(h, Z t ) #. (Jensen s ineq.) page 49

50 Symmetrization P (Z T 1 apple e t E " exp = e t E E (z,z 0 ) = e t E E (z,z 0 ) apple e t E E (z,z 0 ) = e t E z E " (q) > ) " " t sup h2h exp exp q t L(h, Z 0 t ) t sup h2h t sup h2h L(h, Z t ) # q t t L(h, z 0 t ( )) L(h, z t ( )) # (tangent seq. prop.) " 1 2 exp 2t sup h2h exp 2t sup h2h exp 2t sup h2h q t t L(h, z 0 t( )) + t sup h2h q t t L(h, zt( 0 )) q t t L(h, z t ( )) # (sub-add. of sup) q t t L(h, z t ( )) # (convexity. of exp) q t t L(h, z t ( )) #. page 50

51 P (Z T 1 apple e t E z E apple e t E z E " " (q) > ) exp exp Covering Number 2t sup h2h h 2t max v2v " apple e t( 2 ) E z " X v2v E exp # q t t L(h, z t ( )) q t t v t ( )+ i # ( -covering) 2t q t t v t ( ) ## (monotonicity of exp) " X t apple e t( 2 ) 2 kqk 2 ## E exp z 2 v2v i apple apple E hn 1 (, G, z) exp t( 2 )+ t2 kqk 2 z 2. (Hoe ding s ineq.) page 51

52 Algorithms

53 Review Theorem: for any > 0, with probability at least 1, for all h 2 H and > 0, L(h, Z T 1 ) apple q t L(h, Z t )+ b (q)+ s +4 s i 2 E N1 (, G, z) + hkqk 2 + kq u s k 2 2 log z. This bound can be extended to hold uniformly over the price of the additional term: eo(kq uk 1 p log2 log 2 (1 kq uk) 1 ) Data-dependent learning guarantee.. q at page 53

54 Discrepancy-Risk Minimization Key Idea: directly optimize the upper bound on generalization over q and h. This problem can be solved efficiently for some L and H. page 54

55 Kernel-Based Regression Squared loss function: L(y, y 0 )=(y y 0 ) 2 Hypothesis set: for PDS kernel K, n o H = x 7! w K(x): kwk H apple. Complexity term can be bounded by O (log 3/2 T ) sup x K(x, x)kqk 2. page 55

56 Instantaneous Discrepancy Empirical discrepancy can be further upper bounded in terms of instantaneous discrepancies: X b T (q) apple q t d t + Mkq uk 1 where M =supl(y, y 0 ) and y,y 0 d t =sup h2h 1 s t=t s+1 L(h, Z t ) L(h, Z t )!. page 56

57 Proof By sub-additivity of supremum b (q) =sup h2h =sup h2h apple ( 1 s ( T X + q t sup h2h t=t s+1 q t 1 s 1 T 1 s L(h, Z t ) t=t s+1 q t 1 s t=t s+1 ) q t L(h, Z t ) L(h, Z t ) L(h, Z t ) t=t s+1 L(h, Z t ) ) L(h, Z t ) q t L(h, Z t )!! + Mku qk 1. page 57

58 Computing Discrepancies Instantaneous discrepancy for kernel-based hypothesis with squared loss:! d t = u s (w 0 K(x s ) y s ) 2 (w 0 K(x t ) y t ) 2. sup kw 0 kapple s=1 Difference of convex (DC) functions. Global optimum via DC-programming: (Tao and Ahn, 1998). page 58

59 Discrepancy-Based Forecasting Theorem: for any > 0, with probability at least 1, for all kernel-based hypothesis h 2 H and all 0 < kq uk 1 apple 1 L(h, Z T 1 ) apple q t L(h, Z t )+ b (q)+ + O e log 3/2 T sup x s K(x, x) + kq uk 1. Corresponding optimization problem: min q2[0,1] T,w ( X T ) X T q t (w K(x t ) y t ) q t d t + 2 kwk H + 3 kq uk 1. page 59

60 Discrepancy-Based Forecasting Theorem: for any > 0, with probability at least 1, for all kernel-based hypothesis h 2 H and all 0 < kq uk 1 apple 1 L(h, Z T 1 ) apple q t L(h, Z t )+ b (q)+ + O e log 3/2 T sup x s K(x, x) + kq uk 1. Corresponding optimization problem: min q2[0,1] T,w ( X T ) X T q t (w K(x t ) y t ) q t d t + 2 kwk H + 3 kq uk 1. page 60

61 Discrepancy-Based Forecasting Theorem: for any > 0, with probability at least 1, for all kernel-based hypothesis h 2 H and all 0 < kq uk 1 apple 1 L(h, Z T 1 ) apple q t L(h, Z t )+ b (q)+ + O e log 3/2 T sup x s K(x, x) + kq uk 1. Corresponding optimization problem: min q2[0,1] T,w ( X T ) X T q t (w K(x t ) y t ) q t d t + 2 kwk H + 3 kq uk 1. page 61

62 Discrepancy-Based Forecasting Theorem: for any > 0, with probability at least 1, for all kernel-based hypothesis h 2 H and all 0 < kq uk 1 apple 1 L(h, Z T 1 ) apple q t L(h, Z t )+ b (q)+ + O e log 3/2 T sup x s K(x, x) + kq uk 1. Corresponding optimization problem: min q2[0,1] T,w ( X T ) X T q t (w K(x t ) y t ) q t d t + 2 kwk H + 3 kq uk 1. page 62

63 Discrepancy-Based Forecasting Theorem: for any > 0, with probability at least 1, for all kernel-based hypothesis h 2 H and all 0 < kq uk 1 apple 1 L(h, Z T 1 ) apple q t L(h, Z t )+ b (q)+ + O e log 3/2 T sup x s K(x, x) + kq uk 1. Corresponding optimization problem: min q2[0,1] T,w ( X T ) X T q t (w K(x t ) y t ) q t d t + 2 kwk H + 3 kq uk 1. page 63

64 Convex Problem r t =1/q t Change of variable:. min r2d,w Upper bound: rt 1 1/T apple T 1 r t T. ( X T (w K(x t ) y t ) d t + 2 kwk H + 3 r t ) r t T. D = {r: r t 1} where. convex optimization problem. page 64

65 Two-Stage Algorithm Minimize empirical discrepancy b (q) over q (convex optimization). Solve (weighted) kernel-ridge regression problem: ( X T min w q q t (w K(x t ) y t ) 2 + kwk H ) where is the solution to discrepancy minimization problem. page 65

66 Preliminary Experiments Artificial data sets:, page 66

67 True vs. Empirical Discrepancies Discrepancies Weights page 67

68 Running MSE page 68

69 Real-world Data Commodity prices, exchange rates, temperatures & climate. page 69

70 Time Series Prediction & On-line Learning

71 Two Learning Scenarios Stochastic scenario: distributional assumption. performance measure: expected loss. guarantees: generalization bounds. On-line scenario: no distributional assumption. guarantees: regret bounds. performance measure: regret. active research area: (Cesa-Bianchi and Lugosi, 2006; Anava et al. 2013, 2015, 2016; Bousquet and Warmuth, 2002; Herbster and Warmuth, 1998, 2001; Koolen et al., 2015). page 71

72 On-Line Learning Setup Adversarial setting with hypothesis/action set H. For t =1to T do player receives x t 2 X. player selects h t 2 H. adversary selects y t 2 Y. player incurs loss L(h t (x t ),y t ). Objective: minimize (external) regret Reg T = L(h t (x t ),y t ) min h2h L(h(x t ),y t ). page 72

73 Example: Exp. Weights (EW) Expert set H = {E 1,...,E N }, H = conv(h ). EW({E 1,...,E N }) 1 for i 1 to N do 2 w 1,i 1 3 for t 1 to T do 4 Receive(x t ) 5 h t P N i=1 w t,ie i P N i=1 w t,i 6 Receive(y t ) 7 Incur-Loss(L(h t (x t ),y t )) 8 for i 1 to N do 9 w t+1,i w t,i e L(E i(x t ),y t ). (parameter > 0) 10 return h T page 73

74 EW Guarantee Theorem: assume that L is convex in its first argument and takes values in [0, 1]. Then, for any >0 and any sequence y 1,...,y T 2 Y, the regret of EW at time T satisfies Reg T apple log N + T 8. For = p 8 log N/T, Reg T apple p (T/2) log N. Reg T T r log N = O T. page 74

75 EW - Proof Potential: t = log P N i=1 w t,i. Upper bound: t t 1 = log P N i=1 w t 1,i e L(E i(x t ),y t ) P N i=1 w t 1,i = log E [e L(E i(x t ),y t ) ] wt 1 apple = log exp E w t 1 L(E i (x t ),y t ) E [L(E i (x t ),y t )] wt 1 E [L(E i (x t ),y t )] wt 1 apple E wt 1 [L(E i (x t ),y t )] apple L E wt 1 [E i (x t )],y t = L(h t (x t ),y t ) (Hoe ding s ineq.) (convexity of first arg. of L) page 75

76 EW - Proof Upper bound: summing up the inequalities yields T 0 apple L(h t (x t ),y t )+ 2 T 8. Lower bound: Comparison: L(h t (x t ),y t ) T 0 = log log NX i=1 e P T L(E i(x t ),y t ) N max i=1 e = N min i=1 P T L(E i(x t ),y t ) log N log N L(E i (x t ),y t ) log N. N min i=1 L(E i (x t ),y t ) apple log N + T 8. page 76

77 Questions Can we exploit both batch and on-line to design flexible algorithms for time series prediction with stochastic guarantees? tackle notoriously difficult time series problems e.g., model selection, learning ensembles? page 77

78 Model Selection Problem: given N time series models, how should we use sample to select a single best model? Z T 1 in i.i.d. case, cross-validation can be shown to be close to the structural risk minimization solution. but, how do we select a validation set for general stochastic processes? use most recent data? use various splits? use the most distant data? models may have been pre-trained on. Z T 1 page 78

79 Learning Ensembles Problem: given a hypothesis set H and a sample Z T 1, find accurate convex combination h = P T q th t with h 2 H A and q 2. Z T 1 in most general case, hypotheses may have been pretrained on. on-line-to-batch conversion for general non-stationary non-mixing processes. page 79

80 On-Line-to-Batch (OTB) Input: sequence of hypotheses h =(h 1,...,h T ) returned after T rounds by an on-line algorithm A minimizing general regret Reg T = L(h t,z t ) inf L(h,Z t ). h 2 H page 80

81 On-Line-to-Batch (OTB) Problem: use h =(h 1,...,h T ) to derive a hypothesis with small path-dependent expected loss, h 2 H L T +1 (h, Z T 1 )= E ZT +1 L(h, ZT +1 ) Z T 1. i.i.d. problem is standard: (Littlestone, 1989), (Cesa-Bianchi et al., 2004). but, how do we design solutions for the general timeseries scenario? page 81

82 Questions Is OTB with general (non-stationary, non-mixing) stochastic processes possible? Can we design algorithms with theoretical guarantees? need a new tool for the analysis. page 82

83 Relevant Quantity key difference L t (h t, Z t 1 1 ) L T +1 (h t, Z T 1 ) 1 t +1 T T +1 Average difference: 1 T h i L T +1 (h t, Z T 1 ) L t (h t, Z t 1 1 ). page 83

84 On-line Discrepancy Definition: disc(q) = sup h2h A H A : sequences that A can return. q =(q 1,...,q T ) h i q t L T +1 (h t, Z T 1 ) L t (h t, Z t 1 1 ) : arbitrary weight vector. natural measure of non-stationarity or dependency. captures hypothesis set and loss function. can be efficiently estimated under mild assumptions. generalization of definition of (Kuznetsov and MM, 2015).. page 84

85 Discrepancy Estimation Batch discrepancy estimation method. Alternative method: assume that the loss is µ -Lipschitz. assume that there exists an accurate hypothesis : h i =inf E L(Z T +1,h (X T +1 )) Z T h 1 1. h page 85

86 Discrepancy Estimation Lemma: fix sequence Z T 1 in Z. Then, for any > 0, with probability at least 1, the following holds for all > 0: disc(q) apple d disc H T (q)+µ +2 + Mkqk 2 r2 log E[N 1(, G, z)], where ddisc H (q) = sup h2h,h2h A q t h L h t (X T +1 ),h(x T +1 ) L h t,z t i. page 86

87 disc(q) = sup h2h A apple sup h2h A Proof Sketch h i q t L T +1 (h t, Z T 1 ) L t (h t, Z t 1 1 ) q t applel T +1 (h t, Z T 1 ) E + sup h2h A apple µ sup h2h A + sup h q t E q t E h2h A = µ sup h2h A E + sup h2h A h i L(h t (X T +1 ),h (X T +1 )) Z T 1 h i L(h t (X T +1 ),h (X T +1 )) Z T 1 h i L(h (X T +1 ),Y T +1 ) Z T 1 h q t E h i L(h t (X T +1 ),h (X T +1 )) Z T 1 h L(h (X T +1 ),Y T +1 ) Z T 1 h q t E i h i L(h t (X T +1 ),h (X T +1 )) Z T 1 i L t (h t, Z t 1 1 ) i L t (h t, Z t 1 1 ) i L t (h t, Z t 1 1 ). page 87

88 Learning Guarantee Lemma: let L be a convex loss bounded by M and a hypothesis sequence adapted to Z T 1. Fix q 2. Then, for any > 0, the following holds with probability at least 1 for the hypothesis h = P T q th t : L T +1 (h, Z T 1 ) apple q t L(h t,z t )+disc(q)+mkqk 2 r2 log 1. h T 1 page 88

89 Proof By definition of the on-line discrepancy, h i q t L T +1 (h t, Z T 1 ) L t (h t, Z t 1 1 ) apple disc(q). h i A t = q t L t (h t,z1 t 1 ) L(h t,z t ) is a martingale difference, thus by Azuma s inequality, whp, q t L t (h t,z t 1 1 ) apple q t L(h t,z t )+kqk 2 q2 log 1. By convexity of the loss: L T +1 (h, Z T 1 ) apple q t L T +1 (h t, Z T 1 ). page 89

90 Learning Guarantee Theorem: let L be a convex loss bounded by M and H a set of hypothesis sequences adapted to Z T 1. Fix q 2. Then, for any > 0, the following holds with probability at least 1 for the hypothesis h = P T q th t : L T +1 (h, Z T 1 ) apple inf h 2H L T +1 (h, Z T 1 ) + 2disc(q)+ Reg T T + Mkq uk 1 +2Mkqk 2 r2 log 2. page 90

91 Conclusion Time series forecasting: key learning problem in many important tasks. very challenging: theory, algorithms, applications. new and general data-dependent learning guarantees for non-mixing non-stationary processes. algorithms with guarantees. Time series prediction and on-line learning: proof for flexible solutions derived via OTB. application to model selection. application to learning ensembles. 91 page

92 References T. M. Adams and A. B. Nobel. Uniform convergence of Vapnik-Chervonenkis classes under ergodic sampling. The Annals of Probability, 38(4): , A. Agrawal, J. Duchi. The generalization ability of online algorithms for dependent data. Information Theory, IEEE Transactions on, 59(1): , P. Alquier, X. Li, O. Wintenberger. Prediction of time series by statistical learning: general losses and fast rates. Dependence Modelling, 1:65 93, Oren Anava, Elad Hazan, Shie Mannor, and Ohad Shamir. Online learning for time series prediction. COLT, page 92

93 References O. Anava, E. Hazan, and A. Zeevi. Online time series prediction with missing data. ICML, O. Anava and S. Mannor. Online Learning for heteroscedastic sequences. ICML, P. L. Bartlett. Learning with a slowly changing distribution. COLT, R. D. Barve and P. M. Long. On the complexity of learning from drifting distributions. Information and Computation, 138(2): , page 93

94 References P. Berti and P. Rigo. A Glivenko-Cantelli theorem for exchangeable random variables. Statistics & Probability Letters, 32(4): , S. Ben-David, J. Blitzer, K. Crammer, and F. Pereira. Analysis of representations for domain adaptation. In NIPS T. Bollerslev. Generalized autoregressive conditional heteroskedasticity. J Econometrics, G. E. P. Box, G. Jenkins. (1990). Time Series Analysis, Forecasting and Control. page 94

95 References O. Bousquet and M. K. Warmuth. Tracking a small set of experts by mixing past posteriors. COLT, N. Cesa-Bianchi, A. Conconi, and C. Gentile. On the generalization ability of on-line learning algorithms. IEEE Trans. on Inf. Theory, 50(9), N. Cesa-Bianchi and C. Gentile. Tracking the best hyperplane with a simple budget perceptron. COLT, C. Cortes and M. Mohri. Domain adaptation and sample bias correction theory and algorithm for regression. Theoretical Computer Science, 519, page 95

96 References V. H. De la Pena and E. Gine. (1999) Decoupling: from dependence to independence: randomly stopped processes, U- statistics and processes, martingales and beyond. Probability and its applications. Springer, NY. P. Doukhan. (1994) Mixing: properties and examples. Lecture notes in statistics. Springer-Verlag, New York. R. Engle. Autoregressive conditional heteroscedasticity with estimates of the variance of United Kingdom inflation. Econometrica, 50(4): , page 96

97 References D. P. Helmbold and P. M. Long. Tracking drifting concepts by minimizing disagreements. Machine Learning, 14(1): 27-46, M. Herbster and M. K. Warmuth. Tracking the best expert. Machine Learning, 32(2), M. Herbster and M. K. Warmuth. Tracking the best linear predictor. JMLR, D. Hsu, A. Kontorovich, and C. Szepesvári. Mixing time estimation in reversible Markov chains from a single sample path. NIPS, page 97

98 References W. M. Koolen, A. Malek, P. L. Bartlett, and Y. Abbasi. Minimax time series prediction. NIPS, V. Kuznetsov, M. Mohri. Generalization bounds for time series prediction with non-stationary processes. In ALT, V. Kuznetsov, M. Mohri. Learning theory and algorithms for forecasting non-stationary time series. In NIPS, V. Kuznetsov, M. Mohri. Time series prediction and on-line learning. In COLT, page 98

99 References A. C. Lozano, S. R. Kulkarni, and R. E. Schapire. Convergence and consistency of regularized boosting algorithms with stationary β-mixing observations. In NIPS, pages , Y. Mansour, M. Mohri, and A. Rostamizadeh. Domain adaptation: Learning bounds and algorithms. In COLT R. Meir. Nonparametric time series prediction through adaptive model selection. Machine Learning, pages 5 34, page 99

100 References D. Modha, E. Masry. Memory-universal prediction of stationary random processes. Information Theory, IEEE Transactions on, 44(1): , Jan M. Mohri, A. Munoz Medina. New analysis and algorithm for learning with drifting distributions. In ALT, M. Mohri, A. Rostamizadeh. Rademacher complexity bounds for non-i.i.d. processes. In NIPS, M. Mohri, A. Rostamizadeh. Stability bounds for stationary φ-mixing and β-mixing processes. Journal of Machine Learning Research, 11: , page 100

101 References V. Pestov. Predictive PAC learnability: A paradigm for learning from exchangeable input data. GRC, A. Rakhlin, K. Sridharan, A. Tewari. Online learning: Random averages, combinatorial parameters, and learnability. In NIPS, L. Ralaivola, M. Szafranski, G. Stempfel. Chromatic PAC- Bayes bounds for non-iid data: Applications to ranking and stationary beta-mixing processes. JMLR 11: , C. Shalizi and A. Kontorovitch. Predictive PAC-learning and process decompositions. NIPS, page 101

102 References A. Rakhlin, K. Sridharan, A. Tewari. Online learning: Random averages, combinatorial parameters, and learnability. NIPS, I. Steinwart, A. Christmann. Fast learning from non-i.i.d. observations. NIPS, M. Vidyasagar. (1997). A Theory of Learning and Generalization: With Applications to Neural Networks and Control Systems. Springer-Verlag New York, Inc. V. Vovk. Competing with stationary prediction strategies. COLT page 102

103 References B. Yu. Rates of convergence for empirical processes of stationary mixing sequences. The Annals of Probability, 22(1):94 116, page 103

Time Series Prediction & Online Learning

Time Series Prediction & Online Learning Time Series Prediction & Online Learning Joint work with Vitaly Kuznetsov (Google Research) MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Motivation Time series prediction: stock values. earthquakes.

More information

Online Learning for Time Series Prediction

Online Learning for Time Series Prediction Online Learning for Time Series Prediction Joint work with Vitaly Kuznetsov (Google Research) MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Motivation Time series prediction: stock values.

More information

Time Series Prediction and Online Learning

Time Series Prediction and Online Learning JMLR: Workshop and Conference Proceedings vol 49:1 24, 2016 Time Series Prediction and Online Learning Vitaly Kuznetsov Courant Institute, New York Mehryar Mohri Courant Institute and Google Research,

More information

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research Foundations of Machine Learning On-Line Learning Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Motivation PAC learning: distribution fixed over time (training and test). IID assumption.

More information

Generalization bounds

Generalization bounds Advanced Course in Machine Learning pring 200 Generalization bounds Handouts are jointly prepared by hie Mannor and hai halev-hwartz he problem of characterizing learnability is the most basic question

More information

Perceptron Mistake Bounds

Perceptron Mistake Bounds Perceptron Mistake Bounds Mehryar Mohri, and Afshin Rostamizadeh Google Research Courant Institute of Mathematical Sciences Abstract. We present a brief survey of existing mistake bounds and introduce

More information

Rademacher Complexity Bounds for Non-I.I.D. Processes

Rademacher Complexity Bounds for Non-I.I.D. Processes Rademacher Complexity Bounds for Non-I.I.D. Processes Mehryar Mohri Courant Institute of Mathematical ciences and Google Research 5 Mercer treet New York, NY 00 mohri@cims.nyu.edu Afshin Rostamizadeh Department

More information

On Learnability, Complexity and Stability

On Learnability, Complexity and Stability On Learnability, Complexity and Stability Silvia Villa, Lorenzo Rosasco and Tomaso Poggio 1 Introduction A key question in statistical learning is which hypotheses (function) spaces are learnable. Roughly

More information

On the Generalization Ability of Online Strongly Convex Programming Algorithms

On the Generalization Ability of Online Strongly Convex Programming Algorithms On the Generalization Ability of Online Strongly Convex Programming Algorithms Sham M. Kakade I Chicago Chicago, IL 60637 sham@tti-c.org Ambuj ewari I Chicago Chicago, IL 60637 tewari@tti-c.org Abstract

More information

Foundations of Machine Learning

Foundations of Machine Learning Introduction to ML Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu page 1 Logistics Prerequisites: basics in linear algebra, probability, and analysis of algorithms. Workload: about

More information

Domain Adaptation for Regression

Domain Adaptation for Regression Domain Adaptation for Regression Corinna Cortes Google Research corinna@google.com Mehryar Mohri Courant Institute and Google mohri@cims.nyu.edu Motivation Applications: distinct training and test distributions.

More information

Learning Theory for Conditional Risk Minimization

Learning Theory for Conditional Risk Minimization Alexander Zimin IST Austria azimin@ist.ac.at Christoph H. Lampert IST Austria chl@ist.ac.at Abstract In this work we study the learnability of stochastic processes with respect to the conditional risk,

More information

Adaptive Sampling Under Low Noise Conditions 1

Adaptive Sampling Under Low Noise Conditions 1 Manuscrit auteur, publié dans "41èmes Journées de Statistique, SFdS, Bordeaux (2009)" Adaptive Sampling Under Low Noise Conditions 1 Nicolò Cesa-Bianchi Dipartimento di Scienze dell Informazione Università

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Domain Adaptation MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Non-Ideal World time real world ideal domain sampling page 2 Outline Domain adaptation. Multiple-source

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Online Convex Optimization MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Outline Online projected sub-gradient descent. Exponentiated Gradient (EG). Mirror descent.

More information

Learning Kernels -Tutorial Part III: Theoretical Guarantees.

Learning Kernels -Tutorial Part III: Theoretical Guarantees. Learning Kernels -Tutorial Part III: Theoretical Guarantees. Corinna Cortes Google Research corinna@google.com Mehryar Mohri Courant Institute & Google Research mohri@cims.nyu.edu Afshin Rostami UC Berkeley

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Deep Boosting MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Outline Model selection. Deep boosting. theory. algorithm. experiments. page 2 Model Selection Problem:

More information

The sample complexity of agnostic learning with deterministic labels

The sample complexity of agnostic learning with deterministic labels The sample complexity of agnostic learning with deterministic labels Shai Ben-David Cheriton School of Computer Science University of Waterloo Waterloo, ON, N2L 3G CANADA shai@uwaterloo.ca Ruth Urner College

More information

Learning with Imperfect Data

Learning with Imperfect Data Mehryar Mohri Courant Institute and Google mohri@cims.nyu.edu Joint work with: Yishay Mansour (Tel-Aviv & Google) and Afshin Rostamizadeh (Courant Institute). Standard Learning Assumptions IID assumption.

More information

Online Learning: Random Averages, Combinatorial Parameters, and Learnability

Online Learning: Random Averages, Combinatorial Parameters, and Learnability Online Learning: Random Averages, Combinatorial Parameters, and Learnability Alexander Rakhlin Department of Statistics University of Pennsylvania Karthik Sridharan Toyota Technological Institute at Chicago

More information

Rademacher Bounds for Non-i.i.d. Processes

Rademacher Bounds for Non-i.i.d. Processes Rademacher Bounds for Non-i.i.d. Processes Afshin Rostamizadeh Joint work with: Mehryar Mohri Background Background Generalization Bounds - How well can we estimate an algorithm s true performance based

More information

Generalization theory

Generalization theory Generalization theory Daniel Hsu Columbia TRIPODS Bootcamp 1 Motivation 2 Support vector machines X = R d, Y = { 1, +1}. Return solution ŵ R d to following optimization problem: λ min w R d 2 w 2 2 + 1

More information

COMS 4771 Introduction to Machine Learning. Nakul Verma

COMS 4771 Introduction to Machine Learning. Nakul Verma COMS 4771 Introduction to Machine Learning Nakul Verma Announcements HW2 due now! Project proposal due on tomorrow Midterm next lecture! HW3 posted Last time Linear Regression Parametric vs Nonparametric

More information

Bennett-type Generalization Bounds: Large-deviation Case and Faster Rate of Convergence

Bennett-type Generalization Bounds: Large-deviation Case and Faster Rate of Convergence Bennett-type Generalization Bounds: Large-deviation Case and Faster Rate of Convergence Chao Zhang The Biodesign Institute Arizona State University Tempe, AZ 8587, USA Abstract In this paper, we present

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Learning Kernels MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Outline Kernel methods. Learning kernels scenario. learning bounds. algorithms. page 2 Machine Learning

More information

Agnostic Online learnability

Agnostic Online learnability Technical Report TTIC-TR-2008-2 October 2008 Agnostic Online learnability Shai Shalev-Shwartz Toyota Technological Institute Chicago shai@tti-c.org ABSTRACT We study a fundamental question. What classes

More information

Uniform concentration inequalities, martingales, Rademacher complexity and symmetrization

Uniform concentration inequalities, martingales, Rademacher complexity and symmetrization Uniform concentration inequalities, martingales, Rademacher complexity and symmetrization John Duchi Outline I Motivation 1 Uniform laws of large numbers 2 Loss minimization and data dependence II Uniform

More information

Yevgeny Seldin. University of Copenhagen

Yevgeny Seldin. University of Copenhagen Yevgeny Seldin University of Copenhagen Classical (Batch) Machine Learning Collect Data Data Assumption The samples are independent identically distributed (i.i.d.) Machine Learning Prediction rule New

More information

Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research

Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research Introduction to Machine Learning Lecture 11 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Boosting Mehryar Mohri - Introduction to Machine Learning page 2 Boosting Ideas Main idea:

More information

Worst-Case Bounds for Gaussian Process Models

Worst-Case Bounds for Gaussian Process Models Worst-Case Bounds for Gaussian Process Models Sham M. Kakade University of Pennsylvania Matthias W. Seeger UC Berkeley Abstract Dean P. Foster University of Pennsylvania We present a competitive analysis

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Learning and Games MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Outline Normal form games Nash equilibrium von Neumann s minimax theorem Correlated equilibrium Internal

More information

Learning with Rejection

Learning with Rejection Learning with Rejection Corinna Cortes 1, Giulia DeSalvo 2, and Mehryar Mohri 2,1 1 Google Research, 111 8th Avenue, New York, NY 2 Courant Institute of Mathematical Sciences, 251 Mercer Street, New York,

More information

On-Line Learning with Path Experts and Non-Additive Losses

On-Line Learning with Path Experts and Non-Additive Losses On-Line Learning with Path Experts and Non-Additive Losses Joint work with Corinna Cortes (Google Research) Vitaly Kuznetsov (Courant Institute) Manfred Warmuth (UC Santa Cruz) MEHRYAR MOHRI MOHRI@ COURANT

More information

Accelerating Online Convex Optimization via Adaptive Prediction

Accelerating Online Convex Optimization via Adaptive Prediction Mehryar Mohri Courant Institute and Google Research New York, NY 10012 mohri@cims.nyu.edu Scott Yang Courant Institute New York, NY 10012 yangs@cims.nyu.edu Abstract We present a powerful general framework

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Bandit Problems MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Multi-Armed Bandit Problem Problem: which arm of a K-slot machine should a gambler pull to maximize his

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

Learnability, Stability, Regularization and Strong Convexity

Learnability, Stability, Regularization and Strong Convexity Learnability, Stability, Regularization and Strong Convexity Nati Srebro Shai Shalev-Shwartz HUJI Ohad Shamir Weizmann Karthik Sridharan Cornell Ambuj Tewari Michigan Toyota Technological Institute Chicago

More information

Better Algorithms for Selective Sampling

Better Algorithms for Selective Sampling Francesco Orabona Nicolò Cesa-Bianchi DSI, Università degli Studi di Milano, Italy francesco@orabonacom nicolocesa-bianchi@unimiit Abstract We study online algorithms for selective sampling that use regularized

More information

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Machine Learning. Lecture 9: Learning Theory. Feng Li. Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell

More information

Overfitting, Bias / Variance Analysis

Overfitting, Bias / Variance Analysis Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic

More information

Empirical Risk Minimization

Empirical Risk Minimization Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space

More information

Machine Learning Lecture 7

Machine Learning Lecture 7 Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant

More information

Exponential Weights on the Hypercube in Polynomial Time

Exponential Weights on the Hypercube in Polynomial Time European Workshop on Reinforcement Learning 14 (2018) October 2018, Lille, France. Exponential Weights on the Hypercube in Polynomial Time College of Information and Computer Sciences University of Massachusetts

More information

Stochastic Optimization

Stochastic Optimization Introduction Related Work SGD Epoch-GD LM A DA NANJING UNIVERSITY Lijun Zhang Nanjing University, China May 26, 2017 Introduction Related Work SGD Epoch-GD Outline 1 Introduction 2 Related Work 3 Stochastic

More information

Structured Prediction

Structured Prediction Structured Prediction Ningshan Zhang Advanced Machine Learning, Spring 2016 Outline Ensemble Methods for Structured Prediction[1] On-line learning Boosting AGeneralizedKernelApproachtoStructuredOutputLearning[2]

More information

The No-Regret Framework for Online Learning

The No-Regret Framework for Online Learning The No-Regret Framework for Online Learning A Tutorial Introduction Nahum Shimkin Technion Israel Institute of Technology Haifa, Israel Stochastic Processes in Engineering IIT Mumbai, March 2013 N. Shimkin,

More information

No-Regret Algorithms for Unconstrained Online Convex Optimization

No-Regret Algorithms for Unconstrained Online Convex Optimization No-Regret Algorithms for Unconstrained Online Convex Optimization Matthew Streeter Duolingo, Inc. Pittsburgh, PA 153 matt@duolingo.com H. Brendan McMahan Google, Inc. Seattle, WA 98103 mcmahan@google.com

More information

Lecture 16: Perceptron and Exponential Weights Algorithm

Lecture 16: Perceptron and Exponential Weights Algorithm EECS 598-005: Theoretical Foundations of Machine Learning Fall 2015 Lecture 16: Perceptron and Exponential Weights Algorithm Lecturer: Jacob Abernethy Scribes: Yue Wang, Editors: Weiqing Yu and Andrew

More information

On the Complexity of Best Arm Identification with Fixed Confidence

On the Complexity of Best Arm Identification with Fixed Confidence On the Complexity of Best Arm Identification with Fixed Confidence Discrete Optimization with Noise Aurélien Garivier, Emilie Kaufmann COLT, June 23 th 2016, New York Institut de Mathématiques de Toulouse

More information

Revisiting linear and non-linear methodologies for time series prediction - application to ESTSP 08 competition data

Revisiting linear and non-linear methodologies for time series prediction - application to ESTSP 08 competition data Revisiting linear and non-linear methodologies for time series - application to ESTSP 08 competition data Madalina Olteanu Universite Paris 1 - SAMOS CES 90 Rue de Tolbiac, 75013 Paris - France Abstract.

More information

The Online Approach to Machine Learning

The Online Approach to Machine Learning The Online Approach to Machine Learning Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Online Approach to ML 1 / 53 Summary 1 My beautiful regret 2 A supposedly fun game I

More information

Bandit models: a tutorial

Bandit models: a tutorial Gdt COS, December 3rd, 2015 Multi-Armed Bandit model: general setting K arms: for a {1,..., K}, (X a,t ) t N is a stochastic process. (unknown distributions) Bandit game: a each round t, an agent chooses

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Follow-he-Perturbed Leader MEHRYAR MOHRI MOHRI@ COURAN INSIUE & GOOGLE RESEARCH. General Ideas Linear loss: decomposition as a sum along substructures. sum of edge losses in a

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Learning with Large Expert Spaces MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Problem Learning guarantees: R T = O( p T log N). informative even for N very large.

More information

Lecture 2: Univariate Time Series

Lecture 2: Univariate Time Series Lecture 2: Univariate Time Series Analysis: Conditional and Unconditional Densities, Stationarity, ARMA Processes Prof. Massimo Guidolin 20192 Financial Econometrics Spring/Winter 2017 Overview Motivation:

More information

An Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI

An Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI An Introduction to Statistical Theory of Learning Nakul Verma Janelia, HHMI Towards formalizing learning What does it mean to learn a concept? Gain knowledge or experience of the concept. The basic process

More information

Nonparametric Risk Bounds for Time-Series Forecasting

Nonparametric Risk Bounds for Time-Series Forecasting Journal of Machine Learning Research 18 2017 1-40 Submitted 8/13; Revised 2/17; Published 4/17 Nonparametric Risk Bounds for Time-Series Forecasting Daniel J. McDonald Department of Statistics Indiana

More information

Learning with Large Number of Experts: Component Hedge Algorithm

Learning with Large Number of Experts: Component Hedge Algorithm Learning with Large Number of Experts: Component Hedge Algorithm Giulia DeSalvo and Vitaly Kuznetsov Courant Institute March 24th, 215 1 / 3 Learning with Large Number of Experts Regret of RWM is O( T

More information

Multi-armed bandit models: a tutorial

Multi-armed bandit models: a tutorial Multi-armed bandit models: a tutorial CERMICS seminar, March 30th, 2016 Multi-Armed Bandit model: general setting K arms: for a {1,..., K}, (X a,t ) t N is a stochastic process. (unknown distributions)

More information

Full-information Online Learning

Full-information Online Learning Introduction Expert Advice OCO LM A DA NANJING UNIVERSITY Full-information Lijun Zhang Nanjing University, China June 2, 2017 Outline Introduction Expert Advice OCO 1 Introduction Definitions Regret 2

More information

Risk Bounds for Lévy Processes in the PAC-Learning Framework

Risk Bounds for Lévy Processes in the PAC-Learning Framework Risk Bounds for Lévy Processes in the PAC-Learning Framework Chao Zhang School of Computer Engineering anyang Technological University Dacheng Tao School of Computer Engineering anyang Technological University

More information

Generalization Bounds for Online Learning Algorithms with Pairwise Loss Functions

Generalization Bounds for Online Learning Algorithms with Pairwise Loss Functions JMLR: Workshop and Conference Proceedings vol 3 0 3. 3. 5th Annual Conference on Learning Theory Generalization Bounds for Online Learning Algorithms with Pairwise Loss Functions Yuyang Wang Roni Khardon

More information

IFT Lecture 7 Elements of statistical learning theory

IFT Lecture 7 Elements of statistical learning theory IFT 6085 - Lecture 7 Elements of statistical learning theory This version of the notes has not yet been thoroughly checked. Please report any bugs to the scribes or instructor. Scribe(s): Brady Neal and

More information

Online Learning with Feedback Graphs

Online Learning with Feedback Graphs Online Learning with Feedback Graphs Claudio Gentile INRIA and Google NY clagentile@gmailcom NYC March 6th, 2018 1 Content of this lecture Regret analysis of sequential prediction problems lying between

More information

Time Series: Theory and Methods

Time Series: Theory and Methods Peter J. Brockwell Richard A. Davis Time Series: Theory and Methods Second Edition With 124 Illustrations Springer Contents Preface to the Second Edition Preface to the First Edition vn ix CHAPTER 1 Stationary

More information

Online Learning and Sequential Decision Making

Online Learning and Sequential Decision Making Online Learning and Sequential Decision Making Emilie Kaufmann CNRS & CRIStAL, Inria SequeL, emilie.kaufmann@univ-lille.fr Research School, ENS Lyon, Novembre 12-13th 2018 Emilie Kaufmann Online Learning

More information

Nonparametric regression with martingale increment errors

Nonparametric regression with martingale increment errors S. Gaïffas (LSTA - Paris 6) joint work with S. Delattre (LPMA - Paris 7) work in progress Motivations Some facts: Theoretical study of statistical algorithms requires stationary and ergodicity. Concentration

More information

New bounds on the price of bandit feedback for mistake-bounded online multiclass learning

New bounds on the price of bandit feedback for mistake-bounded online multiclass learning Journal of Machine Learning Research 1 8, 2017 Algorithmic Learning Theory 2017 New bounds on the price of bandit feedback for mistake-bounded online multiclass learning Philip M. Long Google, 1600 Amphitheatre

More information

Optimal and Adaptive Online Learning

Optimal and Adaptive Online Learning Optimal and Adaptive Online Learning Haipeng Luo Advisor: Robert Schapire Computer Science Department Princeton University Examples of Online Learning (a) Spam detection 2 / 34 Examples of Online Learning

More information

Adaptive Online Gradient Descent

Adaptive Online Gradient Descent University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 6-4-2007 Adaptive Online Gradient Descent Peter Bartlett Elad Hazan Alexander Rakhlin University of Pennsylvania Follow

More information

Lecture 6: Univariate Volatility Modelling: ARCH and GARCH Models

Lecture 6: Univariate Volatility Modelling: ARCH and GARCH Models Lecture 6: Univariate Volatility Modelling: ARCH and GARCH Models Prof. Massimo Guidolin 019 Financial Econometrics Winter/Spring 018 Overview ARCH models and their limitations Generalized ARCH models

More information

Fast Rates for Estimation Error and Oracle Inequalities for Model Selection

Fast Rates for Estimation Error and Oracle Inequalities for Model Selection Fast Rates for Estimation Error and Oracle Inequalities for Model Selection Peter L. Bartlett Computer Science Division and Department of Statistics University of California, Berkeley bartlett@cs.berkeley.edu

More information

Time Series Analysis. James D. Hamilton PRINCETON UNIVERSITY PRESS PRINCETON, NEW JERSEY

Time Series Analysis. James D. Hamilton PRINCETON UNIVERSITY PRESS PRINCETON, NEW JERSEY Time Series Analysis James D. Hamilton PRINCETON UNIVERSITY PRESS PRINCETON, NEW JERSEY PREFACE xiii 1 Difference Equations 1.1. First-Order Difference Equations 1 1.2. pth-order Difference Equations 7

More information

Introduction to ARMA and GARCH processes

Introduction to ARMA and GARCH processes Introduction to ARMA and GARCH processes Fulvio Corsi SNS Pisa 3 March 2010 Fulvio Corsi Introduction to ARMA () and GARCH processes SNS Pisa 3 March 2010 1 / 24 Stationarity Strict stationarity: (X 1,

More information

Understanding Generalization Error: Bounds and Decompositions

Understanding Generalization Error: Bounds and Decompositions CIS 520: Machine Learning Spring 2018: Lecture 11 Understanding Generalization Error: Bounds and Decompositions Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the

More information

Structured Prediction Theory and Algorithms

Structured Prediction Theory and Algorithms Structured Prediction Theory and Algorithms Joint work with Corinna Cortes (Google Research) Vitaly Kuznetsov (Google Research) Scott Yang (Courant Institute) MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE

More information

Time Series Analysis. James D. Hamilton PRINCETON UNIVERSITY PRESS PRINCETON, NEW JERSEY

Time Series Analysis. James D. Hamilton PRINCETON UNIVERSITY PRESS PRINCETON, NEW JERSEY Time Series Analysis James D. Hamilton PRINCETON UNIVERSITY PRESS PRINCETON, NEW JERSEY & Contents PREFACE xiii 1 1.1. 1.2. Difference Equations First-Order Difference Equations 1 /?th-order Difference

More information

Foundations of Machine Learning Multi-Class Classification. Mehryar Mohri Courant Institute and Google Research

Foundations of Machine Learning Multi-Class Classification. Mehryar Mohri Courant Institute and Google Research Foundations of Machine Learning Multi-Class Classification Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Motivation Real-world problems often have multiple classes: text, speech,

More information

Lecture 14: Approachability and regret minimization Ramesh Johari May 23, 2007

Lecture 14: Approachability and regret minimization Ramesh Johari May 23, 2007 MS&E 336 Lecture 4: Approachability and regret minimization Ramesh Johari May 23, 2007 In this lecture we use Blackwell s approachability theorem to formulate both external and internal regret minimizing

More information

On Minimaxity of Follow the Leader Strategy in the Stochastic Setting

On Minimaxity of Follow the Leader Strategy in the Stochastic Setting On Minimaxity of Follow the Leader Strategy in the Stochastic Setting Wojciech Kot lowsi Poznań University of Technology, Poland wotlowsi@cs.put.poznan.pl Abstract. We consider the setting of prediction

More information

Learning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley

Learning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley Learning Methods for Online Prediction Problems Peter Bartlett Statistics and EECS UC Berkeley Course Synopsis A finite comparison class: A = {1,..., m}. Converting online to batch. Online convex optimization.

More information

From Bandits to Experts: A Tale of Domination and Independence

From Bandits to Experts: A Tale of Domination and Independence From Bandits to Experts: A Tale of Domination and Independence Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Domination and Independence 1 / 1 From Bandits to Experts: A

More information

Online Passive-Aggressive Algorithms

Online Passive-Aggressive Algorithms Online Passive-Aggressive Algorithms Koby Crammer Ofer Dekel Shai Shalev-Shwartz Yoram Singer School of Computer Science & Engineering The Hebrew University, Jerusalem 91904, Israel {kobics,oferd,shais,singer}@cs.huji.ac.il

More information

Simultaneous Model Selection and Optimization through Parameter-free Stochastic Learning

Simultaneous Model Selection and Optimization through Parameter-free Stochastic Learning Simultaneous Model Selection and Optimization through Parameter-free Stochastic Learning Francesco Orabona Yahoo! Labs New York, USA francesco@orabona.com Abstract Stochastic gradient descent algorithms

More information

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1 Introduction to Machine Learning Introduction to ML - TAU 2016/7 1 Course Administration Lecturers: Amir Globerson (gamir@post.tau.ac.il) Yishay Mansour (Mansour@tau.ac.il) Teaching Assistance: Regev Schweiger

More information

New Algorithms for Contextual Bandits

New Algorithms for Contextual Bandits New Algorithms for Contextual Bandits Lev Reyzin Georgia Institute of Technology Work done at Yahoo! 1 S A. Beygelzimer, J. Langford, L. Li, L. Reyzin, R.E. Schapire Contextual Bandit Algorithms with Supervised

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 17: Stochastic Optimization Part II: Realizable vs Agnostic Rates Part III: Nearest Neighbor Classification Stochastic

More information

Online Bounds for Bayesian Algorithms

Online Bounds for Bayesian Algorithms Online Bounds for Bayesian Algorithms Sham M. Kakade Computer and Information Science Department University of Pennsylvania Andrew Y. Ng Computer Science Department Stanford University Abstract We present

More information

Optimistic Rates Nati Srebro

Optimistic Rates Nati Srebro Optimistic Rates Nati Srebro Based on work with Karthik Sridharan and Ambuj Tewari Examples based on work with Andy Cotter, Elad Hazan, Tomer Koren, Percy Liang, Shai Shalev-Shwartz, Ohad Shamir, Karthik

More information

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade Adaptive Gradient Methods AdaGrad / Adam Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade 1 Announcements: HW3 posted Dual coordinate ascent (some review of SGD and random

More information

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013 Learning Theory Ingo Steinwart University of Stuttgart September 4, 2013 Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 1 / 62 Basics Informal Introduction Informal Description

More information

ECONOMICS 7200 MODERN TIME SERIES ANALYSIS Econometric Theory and Applications

ECONOMICS 7200 MODERN TIME SERIES ANALYSIS Econometric Theory and Applications ECONOMICS 7200 MODERN TIME SERIES ANALYSIS Econometric Theory and Applications Yongmiao Hong Department of Economics & Department of Statistical Sciences Cornell University Spring 2019 Time and uncertainty

More information

Empirical Market Microstructure Analysis (EMMA)

Empirical Market Microstructure Analysis (EMMA) Empirical Market Microstructure Analysis (EMMA) Lecture 3: Statistical Building Blocks and Econometric Basics Prof. Dr. Michael Stein michael.stein@vwl.uni-freiburg.de Albert-Ludwigs-University of Freiburg

More information

U Logo Use Guidelines

U Logo Use Guidelines Information Theory Lecture 3: Applications to Machine Learning U Logo Use Guidelines Mark Reid logo is a contemporary n of our heritage. presents our name, d and our motto: arn the nature of things. authenticity

More information

Does Unlabeled Data Help?

Does Unlabeled Data Help? Does Unlabeled Data Help? Worst-case Analysis of the Sample Complexity of Semi-supervised Learning. Ben-David, Lu and Pal; COLT, 2008. Presentation by Ashish Rastogi Courant Machine Learning Seminar. Outline

More information

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric

More information

The Learning Problem and Regularization

The Learning Problem and Regularization 9.520 Class 02 February 2011 Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference problem from usually small sets of high dimensional, noisy data. Learning

More information

Littlestone s Dimension and Online Learnability

Littlestone s Dimension and Online Learnability Littlestone s Dimension and Online Learnability Shai Shalev-Shwartz Toyota Technological Institute at Chicago The Hebrew University Talk at UCSD workshop, February, 2009 Joint work with Shai Ben-David

More information

Using Additive Expert Ensembles to Cope with Concept Drift

Using Additive Expert Ensembles to Cope with Concept Drift Jeremy Z. Kolter and Marcus A. Maloof {jzk, maloof}@cs.georgetown.edu Department of Computer Science, Georgetown University, Washington, DC 20057-1232, USA Abstract We consider online learning where the

More information

ADANET: adaptive learning of neural networks

ADANET: adaptive learning of neural networks ADANET: adaptive learning of neural networks Joint work with Corinna Cortes (Google Research) Javier Gonzalo (Google Research) Vitaly Kuznetsov (Google Research) Scott Yang (Courant Institute) MEHRYAR

More information