Advanced Machine Learning
|
|
- Logan McCoy
- 6 years ago
- Views:
Transcription
1 Advanced Machine Learning Time Series Prediction VITALY KUZNETSOV GOOGLE RESEARCH.. MEHRYAR MOHRI COURANT INSTITUTE & GOOGLE RESEARCH..
2 Motivation Time series prediction: stock values. earthquakes. energy demand. signal processing. sales forecasting. economic variables. weather: e.g., local and global temperature. sensors: Internet-of-Things. page 2
3 Google Trends page 3
4 Google Trends page 4
5 Google Trends page 5
6 Challenges Standard Supervised Learning: IID assumption. Same distribution for training and test data. Distributions fixed over time (stationarity). none of these assumptions holds for time series! page 6
7 Outline Introduction to time series analysis. Learning theory for forecasting non-stationary time series. Algorithms for forecasting non-stationary time series. Time series prediction and on-line learning. page 7
8 Introduction to Time Series Analysis
9 Classical Framework Postulate a particular form of a parametric model that is assumed to generate data. Use given sample to estimate unknown parameters of the model. Use estimated model to make predictions. page 9
10 Autoregressive (AR) Models Definition: AR( p) model is a linear generative model based on the pth order Markov assumption: px 8t, Y t = a i Y t i + t i=1 where t s are zero mean uncorrelated random variables with Y t variance. a 1,...,a p are autoregressive coefficients. is observed stochastic process. page 10
11 Moving Averages (MA) Definition: MA( q) model is a linear generative model for the noise term based on the qth order Markov assumption: qx 8t, Y t = t + b j t j j=1 where b 1,...,b q are moving average coefficients. page 11
12 ARMA model (Whittle, 1951; Box & Jenkins, 1971) Definition: ARMA( p, q) model is a generative linear model that combines AR( p) and MA( q) models: 8t, Y t = px a i Y t i + t + i=1 qx b j t j=1 j. page 12
13 ARMA page 13
14 Stationarity Definition: a sequence of random variables Z = {Z t } +1 1 is stationary if its distribution is invariant to shifting in time. same distribution (Z t,...,z t+m ) (Z t+k,...,z t+m+k ) page 14
15 Weak Stationarity Definition: a sequence of random variables Z = {Z t } +1 1 weakly stationary if its first and second moments are invariant to shifting in time, that is, E[Z t ] is independent of t. E[Z t Z t j ]=f(j) for some function f. is page 15
16 Lag Operator Lag operator L is defined by LY t = Y t 1. ARMA model in terms of the lag operator: 1 px a i L!Y i t = 1+ i=1 qx b j L! j t j=1 Characteristic polynomial px P (z) =1 a i z i i=1 can be used to study properties of this stochastic process. page 16
17 Weak Stationarity of ARMA Theorem: an ARMA( p, q) process is weakly stationary if the roots of the characteristic polynomial P (z) are outside the unit circle. page 17
18 Proof If roots of the characteristic polynomial are outside the unit circle then: P (z) =1 px a i z i = c( 1 z) ( p z) i=1 = c 0 ( z) (1 p z) where i > 1for all i =1,...,p and c, c 0 are constants. page 18
19 Proof Therefore, the ARMA( p, q) process px 1 a i L!Y i t = 1+ i=1 admits MA( 1) representation: 1 Y t = L 1 p 1 L where 1 1X 1 1 i L! = k=0 is well-defined since 1 i < 1. qx b j L! j t j= i 1 k L qx b j L j t j=1 page 19
20 Proof Therefore, it suffices to show that 1X Y t = j t j=0 is weakly stationary. The mean is constant E[Y t ]= Covariance function E[Y t Y t l ] only depends on the lag l: 1X 1X 1X E[Y t Y t l ]= k je[ t j t l k ]= j j+l. k=0 j=0 1X j=0 j je[ t j ]=0. j=0 page 20
21 ARIMA Non-stationary processes can be modeled using processes whose characteristic polynomial has unit roots. Characteristic polynomial with unit roots can be factored: P (z) =R(z)(1 z) D where R(z) has no unit roots. Definition: ARIMA( p, D, q ) model is an ARMA( p, q) model for (1 L) D Y t :! D px qx 1 a i L i 1 L! Y t = 1+ b j L! j t. i=1 j=1 page 21
22 ARIMA Non-stationary processes can be modeled using processes whose characteristic polynomial has unit roots. Characteristic polynomial with unit roots can be factored: where has no unit roots. Definition: ARIMA(, for :, ) model is an ARMA(, ) model page 22
23 Other Extensions Further variants: models with seasonal components (SARIMA). models with side information (ARIMAX). models with long-memory (ARFIMA). multi-variate time series models (VAR). models with time-varying coefficients. other non-linear models. page 23
24 Modeling Variance Definition: the generalized autoregressive conditional heteroscedasticity GARCH( p, q) model is an ARMA( p, q) model for the variance t of the noise term t : Xp 1 Xq t, t+1 =! + i t i + j 2 t j i=0 j=0 where t s are zero mean Gaussian random variables with variance t conditioned on {Y t 1,Y t 2,...}.! > 0 is the mean parameter. (Engle, 1982; Bollerslev, 1986) page 24
25 GARCH Process Definition: the generalized autoregressive conditional heteroscedasticity (GARCH(, )) model is an ARMA(, ) model for the variance of the noise term : where s are zero mean Gaussian random variables with variance conditioned on. is the mean parameter. page 25
26 State-Space Models Continuous state space version of Hidden Markov Models: X t+1 = BX t + U t, Y t = AX t + t where X t is an n-dimensional state vector. Y t A and B are model parameters. U t and t are noise terms. is an observed stochastic process. page 26
27 State-Space Models page 27
28 Estimation Different methods for estimating model parameters: Maximum likelihood estimation: Requires further parametric assumptions on the noise distribution (e.g. Gaussian). Method of moments (Yule-Walker estimator). Restricted to certain models. Conditional and unconditional least square estimation. page 28
29 Invertibility of ARMA Definition: an ARMA( p, q) process is invertible if the roots of the polynomial qx Q(z) =1+ b j z j j=1 are outside the unit circle. page 29
30 Learning guarantee Theorem: assume Y t ARMA( p, q) is weakly stationary and invertible. Let ba T denote the least square estimate of a =(a 1,...,a p ) and assume that p is known. Then, kba T ak converges in probability to zero. Similar results hold for other estimators and other models. page 30
31 Notes Many other generative models exist. Learning guarantees are asymptotic. Model needs to be correctly specified. Non-stationarity needs to be modeled explicitly. page 31
32 Theory
33 Time Series Forecasting Training data: finite sample realization of some stochastic process, (X 1,Y 1 ),...,(X T,Y T ) 2 Z = X Y. Loss function: L: H Z! [0, 1], where H is a hypothesis set of functions mapping from X to Y. Problem: find loss, h 2 H L(h, Z T 1 )= with small path-dependent expected E ZT +1 L(h, ZT +1 ) Z T 1. page 33
34 Standard Assumptions Stationarity: same distribution Mixing: (Z t,...,z t+m ) (Z t+k,...,z t+m+k ) dependence between events decaying with k. B A n n + k page 34
35 β-mixing Definition: a sequence of random variables Z = {Z t } mixing if apple (k) =sup E sup P[A B] P[A]! 0. n B2 n 1 A2 n+k 1 is dependence between events decaying with k. B A n n + k page 35
36 Learning Theory Stationary and -mixing process: generalization bounds. PAC-learning preserved in that setting (Vidyasagar, 1997). covering number bounds for regression (Meir, 2000). VC-dimension bounds for binary classification (Yu, 1994). Rademacher complexity bounds for general loss functions (MM and Rostamizadeh, 2000). PAC-Bayesian bounds (Alquier et al., 2014). page 36
37 Learning Theory Stationarity and mixing: algorithm-dependent bounds. AdaBoost (Lozano et al., 1997). general stability bounds (MM and Rostamizadeh, 2010). regularized ERM (Steinwart and Christmann, 2009). stable on-line algorithms (Agarwal and Duchi, 2013). page 37
38 Problem Stationarity and mixing assumptions: often do not hold (think trend or periodic signals). not testable. estimating mixing parameters can be hard, even if general functional form known. hypothesis set and loss function ignored. page 38
39 Questions Is learning with general (non-stationary, non-mixing) stochastic processes possible? Can we design algorithms with theoretical guarantees? need a new tool for the analysis. page 39
40 Key Quantity - Fixed h key difference L(h, Z t 1 1 ) L(h, Z T 1 ) 1 t T T +1 Key average quantity: 1 T h i L(h, Z T 1 ) L(h, Z t 1 1 ). page 40
41 Discrepancy Definition: =sup h2h L(h, Z T 1 ) 1 T L(h, Z t 1 1 ). captures hypothesis set and loss function. =0 can be estimated from data, under mild assumptions. in IID case or for weakly stationary processes with linear hypotheses and squared loss (K and MM, 2014). page 41
42 Weighted Discrepancy Definition: extension to weights (q 1,...,q T )=q. (q) =sup h2h L(h, Z T 1 ) q t L(h, Z t 1 1 ). strictly extends discrepancy definition in drifting (MM and Muñoz Medina, 2012) or domain adaptation (Mansour, MM, Rostamizadeh 2009; Cortes and MM 2011, 2014); or for binary loss (Devroye et al., 1996; Ben-David et al., 2007). admits upper bounds in terms of relative entropy, or in terms of -mixing coefficients of asymptotic stationarity for an asymptotically stationary process. page 42
43 Estimation Decomposition: (q) apple 0 (q)+ s. 1 (q) apple sup L(h, Z t 1 1 ) q t L(h, Z t 1 1 ) h2h s +sup h2h t=t s+1 L(h, Z T 1 ) 1 s t=t s+1 L(h, Z t 1 1 ). 1 T s T T +1 page 43
44 Learning Guarantee Theorem: for any > 0, with probability at least 1, for all h 2 H and > 0, L(h, Z T 1 ) apple q t L(h, Z t )+ (q)+2 + kqk 2 r2 log E[N 1(, G, z)], where G = {z 7! L(h, z): h 2 H}. page 44
45 Bound with Emp. Discrepancy Corollary: for any > 0, with probability at least 1, for all h 2 H and > 0, L(h, Z T 1 ) apple q t L(h, Z t )+ b (q)+ s +4 s i 2 E N1 (, G, z) + hkqk 2 + kq u s k 2 2 log z, where 8 >< >: b (q) =sup h2h 1 s t=t s+1 u s unif. dist. over [T s, T ] G = {z 7! L(h, z): h 2 H}. L(h, Z t ) q t L(h, Z t ) page 45
46 Weighted Sequential α-cover Definition: let z be a Z-valued full binary tree of depth T. Then, a set of trees V is an l 1 -norm q-weighted -cover of a function class G on z if 8g 2 G, 8 2 {±1} T, 9v 2 V : v t ( ) g(z t ( )) apple. kqk 1 1 g(z 1 ),v 1 +1 (Rakhlin et al., 2010; K and MM, 2015) g(z 2 ),v 2 g(z 3 ),v g(z 4 ),v 4 g(z 5 ),v 5 g(z 6 ),v 6 g(z 7 ),v 7 1 g(z 12 ),v 12 " v1 g(z 1 ) # v 3 g(z 3 ) v 6 g(z 6 ) v 12 g(z 12 ) 1 apple kqk 1. page 46
47 Sequential Covering Numbers Definitions: sequential covering number: N 1 (, G, z) =min{ V : V l 1 -norm q-weighted -cover of G}. expected sequential covering number: E [N 1 (, G, z)]. z Z T Z 1,Z 0 1 D 1 Z 2,Z 0 2 D 2 ( Z 1 ) Z 3,Z 0 3 D 2 ( Z 0 1) Z 4,Z4 0 Z 5,Z5 0 D 3 ( Z 1,Z 2 ) D 3 ( Z 1,Z2) 0 Z T : distribution based on Z t s. page 47
48 Key quantities: Proof (q) =sup h2h L(h, Z T 1 ) Chernoff technique: for any t>0, P (Z T 1 " apple P = P " sup h2h exp apple e t E " (q) > ) t sup h2h exp (Z T 1 )=sup L(h, Z T ) h2h q t L(h, Z t 1 1 ) L(h, Z t ) > t sup h2h # q t L(h, Z t 1 1 ) L(h, Z t ) >e t # q t L(h, Z t ) q t L(h, Z t 1 1 ). (sub-add. of sup) (t >0) q t L(h, Z t 1 1 ) L(h, Z t ) #. (Markov s ineq.) page 48
49 Symmetrization Key tool: decoupled tangent sequence associated to. Z t and Zt 0 i.i.d. given Z t 1 1. P (Z T 1 apple e t E = e t E = e t E apple e t E " " " " exp exp exp exp (q) > ) t sup h2h t sup h2h t sup E h2h t sup h2h q t L(h, Z t 1 1 ) L(h, Z t ) # Z 0 T 1 Z T 1 q t E[L(h, Z 0 t ) Z t 1 1 ] L(h, Z t ) # (tangent seq.) h X T q t L(h, Z 0 t ) L(h, Z t ) i # Z T 1 (lin. of expectation) q t L(h, Z 0 t ) L(h, Z t ) #. (Jensen s ineq.) page 49
50 Symmetrization P (Z T 1 apple e t E " exp = e t E E (z,z 0 ) = e t E E (z,z 0 ) apple e t E E (z,z 0 ) = e t E z E " (q) > ) " " t sup h2h exp exp q t L(h, Z 0 t ) t sup h2h t sup h2h L(h, Z t ) # q t t L(h, z 0 t ( )) L(h, z t ( )) # (tangent seq. prop.) " 1 2 exp 2t sup h2h exp 2t sup h2h exp 2t sup h2h q t t L(h, z 0 t( )) + t sup h2h q t t L(h, zt( 0 )) q t t L(h, z t ( )) # (sub-add. of sup) q t t L(h, z t ( )) # (convexity. of exp) q t t L(h, z t ( )) #. page 50
51 P (Z T 1 apple e t E z E apple e t E z E " " (q) > ) exp exp Covering Number 2t sup h2h h 2t max v2v " apple e t( 2 ) E z " X v2v E exp # q t t L(h, z t ( )) q t t v t ( )+ i # ( -covering) 2t q t t v t ( ) ## (monotonicity of exp) " X t apple e t( 2 ) 2 kqk 2 ## E exp z 2 v2v i apple apple E hn 1 (, G, z) exp t( 2 )+ t2 kqk 2 z 2. (Hoe ding s ineq.) page 51
52 Algorithms
53 Review Theorem: for any > 0, with probability at least 1, for all h 2 H and > 0, L(h, Z T 1 ) apple q t L(h, Z t )+ b (q)+ s +4 s i 2 E N1 (, G, z) + hkqk 2 + kq u s k 2 2 log z. This bound can be extended to hold uniformly over the price of the additional term: eo(kq uk 1 p log2 log 2 (1 kq uk) 1 ) Data-dependent learning guarantee.. q at page 53
54 Discrepancy-Risk Minimization Key Idea: directly optimize the upper bound on generalization over q and h. This problem can be solved efficiently for some L and H. page 54
55 Kernel-Based Regression Squared loss function: L(y, y 0 )=(y y 0 ) 2 Hypothesis set: for PDS kernel K, n o H = x 7! w K(x): kwk H apple. Complexity term can be bounded by O (log 3/2 T ) sup x K(x, x)kqk 2. page 55
56 Instantaneous Discrepancy Empirical discrepancy can be further upper bounded in terms of instantaneous discrepancies: X b T (q) apple q t d t + Mkq uk 1 where M =supl(y, y 0 ) and y,y 0 d t =sup h2h 1 s t=t s+1 L(h, Z t ) L(h, Z t )!. page 56
57 Proof By sub-additivity of supremum b (q) =sup h2h =sup h2h apple ( 1 s ( T X + q t sup h2h t=t s+1 q t 1 s 1 T 1 s L(h, Z t ) t=t s+1 q t 1 s t=t s+1 ) q t L(h, Z t ) L(h, Z t ) L(h, Z t ) t=t s+1 L(h, Z t ) ) L(h, Z t ) q t L(h, Z t )!! + Mku qk 1. page 57
58 Computing Discrepancies Instantaneous discrepancy for kernel-based hypothesis with squared loss:! d t = u s (w 0 K(x s ) y s ) 2 (w 0 K(x t ) y t ) 2. sup kw 0 kapple s=1 Difference of convex (DC) functions. Global optimum via DC-programming: (Tao and Ahn, 1998). page 58
59 Discrepancy-Based Forecasting Theorem: for any > 0, with probability at least 1, for all kernel-based hypothesis h 2 H and all 0 < kq uk 1 apple 1 L(h, Z T 1 ) apple q t L(h, Z t )+ b (q)+ + O e log 3/2 T sup x s K(x, x) + kq uk 1. Corresponding optimization problem: min q2[0,1] T,w ( X T ) X T q t (w K(x t ) y t ) q t d t + 2 kwk H + 3 kq uk 1. page 59
60 Discrepancy-Based Forecasting Theorem: for any > 0, with probability at least 1, for all kernel-based hypothesis h 2 H and all 0 < kq uk 1 apple 1 L(h, Z T 1 ) apple q t L(h, Z t )+ b (q)+ + O e log 3/2 T sup x s K(x, x) + kq uk 1. Corresponding optimization problem: min q2[0,1] T,w ( X T ) X T q t (w K(x t ) y t ) q t d t + 2 kwk H + 3 kq uk 1. page 60
61 Discrepancy-Based Forecasting Theorem: for any > 0, with probability at least 1, for all kernel-based hypothesis h 2 H and all 0 < kq uk 1 apple 1 L(h, Z T 1 ) apple q t L(h, Z t )+ b (q)+ + O e log 3/2 T sup x s K(x, x) + kq uk 1. Corresponding optimization problem: min q2[0,1] T,w ( X T ) X T q t (w K(x t ) y t ) q t d t + 2 kwk H + 3 kq uk 1. page 61
62 Discrepancy-Based Forecasting Theorem: for any > 0, with probability at least 1, for all kernel-based hypothesis h 2 H and all 0 < kq uk 1 apple 1 L(h, Z T 1 ) apple q t L(h, Z t )+ b (q)+ + O e log 3/2 T sup x s K(x, x) + kq uk 1. Corresponding optimization problem: min q2[0,1] T,w ( X T ) X T q t (w K(x t ) y t ) q t d t + 2 kwk H + 3 kq uk 1. page 62
63 Discrepancy-Based Forecasting Theorem: for any > 0, with probability at least 1, for all kernel-based hypothesis h 2 H and all 0 < kq uk 1 apple 1 L(h, Z T 1 ) apple q t L(h, Z t )+ b (q)+ + O e log 3/2 T sup x s K(x, x) + kq uk 1. Corresponding optimization problem: min q2[0,1] T,w ( X T ) X T q t (w K(x t ) y t ) q t d t + 2 kwk H + 3 kq uk 1. page 63
64 Convex Problem r t =1/q t Change of variable:. min r2d,w Upper bound: rt 1 1/T apple T 1 r t T. ( X T (w K(x t ) y t ) d t + 2 kwk H + 3 r t ) r t T. D = {r: r t 1} where. convex optimization problem. page 64
65 Two-Stage Algorithm Minimize empirical discrepancy b (q) over q (convex optimization). Solve (weighted) kernel-ridge regression problem: ( X T min w q q t (w K(x t ) y t ) 2 + kwk H ) where is the solution to discrepancy minimization problem. page 65
66 Preliminary Experiments Artificial data sets:, page 66
67 True vs. Empirical Discrepancies Discrepancies Weights page 67
68 Running MSE page 68
69 Real-world Data Commodity prices, exchange rates, temperatures & climate. page 69
70 Time Series Prediction & On-line Learning
71 Two Learning Scenarios Stochastic scenario: distributional assumption. performance measure: expected loss. guarantees: generalization bounds. On-line scenario: no distributional assumption. guarantees: regret bounds. performance measure: regret. active research area: (Cesa-Bianchi and Lugosi, 2006; Anava et al. 2013, 2015, 2016; Bousquet and Warmuth, 2002; Herbster and Warmuth, 1998, 2001; Koolen et al., 2015). page 71
72 On-Line Learning Setup Adversarial setting with hypothesis/action set H. For t =1to T do player receives x t 2 X. player selects h t 2 H. adversary selects y t 2 Y. player incurs loss L(h t (x t ),y t ). Objective: minimize (external) regret Reg T = L(h t (x t ),y t ) min h2h L(h(x t ),y t ). page 72
73 Example: Exp. Weights (EW) Expert set H = {E 1,...,E N }, H = conv(h ). EW({E 1,...,E N }) 1 for i 1 to N do 2 w 1,i 1 3 for t 1 to T do 4 Receive(x t ) 5 h t P N i=1 w t,ie i P N i=1 w t,i 6 Receive(y t ) 7 Incur-Loss(L(h t (x t ),y t )) 8 for i 1 to N do 9 w t+1,i w t,i e L(E i(x t ),y t ). (parameter > 0) 10 return h T page 73
74 EW Guarantee Theorem: assume that L is convex in its first argument and takes values in [0, 1]. Then, for any >0 and any sequence y 1,...,y T 2 Y, the regret of EW at time T satisfies Reg T apple log N + T 8. For = p 8 log N/T, Reg T apple p (T/2) log N. Reg T T r log N = O T. page 74
75 EW - Proof Potential: t = log P N i=1 w t,i. Upper bound: t t 1 = log P N i=1 w t 1,i e L(E i(x t ),y t ) P N i=1 w t 1,i = log E [e L(E i(x t ),y t ) ] wt 1 apple = log exp E w t 1 L(E i (x t ),y t ) E [L(E i (x t ),y t )] wt 1 E [L(E i (x t ),y t )] wt 1 apple E wt 1 [L(E i (x t ),y t )] apple L E wt 1 [E i (x t )],y t = L(h t (x t ),y t ) (Hoe ding s ineq.) (convexity of first arg. of L) page 75
76 EW - Proof Upper bound: summing up the inequalities yields T 0 apple L(h t (x t ),y t )+ 2 T 8. Lower bound: Comparison: L(h t (x t ),y t ) T 0 = log log NX i=1 e P T L(E i(x t ),y t ) N max i=1 e = N min i=1 P T L(E i(x t ),y t ) log N log N L(E i (x t ),y t ) log N. N min i=1 L(E i (x t ),y t ) apple log N + T 8. page 76
77 Questions Can we exploit both batch and on-line to design flexible algorithms for time series prediction with stochastic guarantees? tackle notoriously difficult time series problems e.g., model selection, learning ensembles? page 77
78 Model Selection Problem: given N time series models, how should we use sample to select a single best model? Z T 1 in i.i.d. case, cross-validation can be shown to be close to the structural risk minimization solution. but, how do we select a validation set for general stochastic processes? use most recent data? use various splits? use the most distant data? models may have been pre-trained on. Z T 1 page 78
79 Learning Ensembles Problem: given a hypothesis set H and a sample Z T 1, find accurate convex combination h = P T q th t with h 2 H A and q 2. Z T 1 in most general case, hypotheses may have been pretrained on. on-line-to-batch conversion for general non-stationary non-mixing processes. page 79
80 On-Line-to-Batch (OTB) Input: sequence of hypotheses h =(h 1,...,h T ) returned after T rounds by an on-line algorithm A minimizing general regret Reg T = L(h t,z t ) inf L(h,Z t ). h 2 H page 80
81 On-Line-to-Batch (OTB) Problem: use h =(h 1,...,h T ) to derive a hypothesis with small path-dependent expected loss, h 2 H L T +1 (h, Z T 1 )= E ZT +1 L(h, ZT +1 ) Z T 1. i.i.d. problem is standard: (Littlestone, 1989), (Cesa-Bianchi et al., 2004). but, how do we design solutions for the general timeseries scenario? page 81
82 Questions Is OTB with general (non-stationary, non-mixing) stochastic processes possible? Can we design algorithms with theoretical guarantees? need a new tool for the analysis. page 82
83 Relevant Quantity key difference L t (h t, Z t 1 1 ) L T +1 (h t, Z T 1 ) 1 t +1 T T +1 Average difference: 1 T h i L T +1 (h t, Z T 1 ) L t (h t, Z t 1 1 ). page 83
84 On-line Discrepancy Definition: disc(q) = sup h2h A H A : sequences that A can return. q =(q 1,...,q T ) h i q t L T +1 (h t, Z T 1 ) L t (h t, Z t 1 1 ) : arbitrary weight vector. natural measure of non-stationarity or dependency. captures hypothesis set and loss function. can be efficiently estimated under mild assumptions. generalization of definition of (Kuznetsov and MM, 2015).. page 84
85 Discrepancy Estimation Batch discrepancy estimation method. Alternative method: assume that the loss is µ -Lipschitz. assume that there exists an accurate hypothesis : h i =inf E L(Z T +1,h (X T +1 )) Z T h 1 1. h page 85
86 Discrepancy Estimation Lemma: fix sequence Z T 1 in Z. Then, for any > 0, with probability at least 1, the following holds for all > 0: disc(q) apple d disc H T (q)+µ +2 + Mkqk 2 r2 log E[N 1(, G, z)], where ddisc H (q) = sup h2h,h2h A q t h L h t (X T +1 ),h(x T +1 ) L h t,z t i. page 86
87 disc(q) = sup h2h A apple sup h2h A Proof Sketch h i q t L T +1 (h t, Z T 1 ) L t (h t, Z t 1 1 ) q t applel T +1 (h t, Z T 1 ) E + sup h2h A apple µ sup h2h A + sup h q t E q t E h2h A = µ sup h2h A E + sup h2h A h i L(h t (X T +1 ),h (X T +1 )) Z T 1 h i L(h t (X T +1 ),h (X T +1 )) Z T 1 h i L(h (X T +1 ),Y T +1 ) Z T 1 h q t E h i L(h t (X T +1 ),h (X T +1 )) Z T 1 h L(h (X T +1 ),Y T +1 ) Z T 1 h q t E i h i L(h t (X T +1 ),h (X T +1 )) Z T 1 i L t (h t, Z t 1 1 ) i L t (h t, Z t 1 1 ) i L t (h t, Z t 1 1 ). page 87
88 Learning Guarantee Lemma: let L be a convex loss bounded by M and a hypothesis sequence adapted to Z T 1. Fix q 2. Then, for any > 0, the following holds with probability at least 1 for the hypothesis h = P T q th t : L T +1 (h, Z T 1 ) apple q t L(h t,z t )+disc(q)+mkqk 2 r2 log 1. h T 1 page 88
89 Proof By definition of the on-line discrepancy, h i q t L T +1 (h t, Z T 1 ) L t (h t, Z t 1 1 ) apple disc(q). h i A t = q t L t (h t,z1 t 1 ) L(h t,z t ) is a martingale difference, thus by Azuma s inequality, whp, q t L t (h t,z t 1 1 ) apple q t L(h t,z t )+kqk 2 q2 log 1. By convexity of the loss: L T +1 (h, Z T 1 ) apple q t L T +1 (h t, Z T 1 ). page 89
90 Learning Guarantee Theorem: let L be a convex loss bounded by M and H a set of hypothesis sequences adapted to Z T 1. Fix q 2. Then, for any > 0, the following holds with probability at least 1 for the hypothesis h = P T q th t : L T +1 (h, Z T 1 ) apple inf h 2H L T +1 (h, Z T 1 ) + 2disc(q)+ Reg T T + Mkq uk 1 +2Mkqk 2 r2 log 2. page 90
91 Conclusion Time series forecasting: key learning problem in many important tasks. very challenging: theory, algorithms, applications. new and general data-dependent learning guarantees for non-mixing non-stationary processes. algorithms with guarantees. Time series prediction and on-line learning: proof for flexible solutions derived via OTB. application to model selection. application to learning ensembles. 91 page
92 References T. M. Adams and A. B. Nobel. Uniform convergence of Vapnik-Chervonenkis classes under ergodic sampling. The Annals of Probability, 38(4): , A. Agrawal, J. Duchi. The generalization ability of online algorithms for dependent data. Information Theory, IEEE Transactions on, 59(1): , P. Alquier, X. Li, O. Wintenberger. Prediction of time series by statistical learning: general losses and fast rates. Dependence Modelling, 1:65 93, Oren Anava, Elad Hazan, Shie Mannor, and Ohad Shamir. Online learning for time series prediction. COLT, page 92
93 References O. Anava, E. Hazan, and A. Zeevi. Online time series prediction with missing data. ICML, O. Anava and S. Mannor. Online Learning for heteroscedastic sequences. ICML, P. L. Bartlett. Learning with a slowly changing distribution. COLT, R. D. Barve and P. M. Long. On the complexity of learning from drifting distributions. Information and Computation, 138(2): , page 93
94 References P. Berti and P. Rigo. A Glivenko-Cantelli theorem for exchangeable random variables. Statistics & Probability Letters, 32(4): , S. Ben-David, J. Blitzer, K. Crammer, and F. Pereira. Analysis of representations for domain adaptation. In NIPS T. Bollerslev. Generalized autoregressive conditional heteroskedasticity. J Econometrics, G. E. P. Box, G. Jenkins. (1990). Time Series Analysis, Forecasting and Control. page 94
95 References O. Bousquet and M. K. Warmuth. Tracking a small set of experts by mixing past posteriors. COLT, N. Cesa-Bianchi, A. Conconi, and C. Gentile. On the generalization ability of on-line learning algorithms. IEEE Trans. on Inf. Theory, 50(9), N. Cesa-Bianchi and C. Gentile. Tracking the best hyperplane with a simple budget perceptron. COLT, C. Cortes and M. Mohri. Domain adaptation and sample bias correction theory and algorithm for regression. Theoretical Computer Science, 519, page 95
96 References V. H. De la Pena and E. Gine. (1999) Decoupling: from dependence to independence: randomly stopped processes, U- statistics and processes, martingales and beyond. Probability and its applications. Springer, NY. P. Doukhan. (1994) Mixing: properties and examples. Lecture notes in statistics. Springer-Verlag, New York. R. Engle. Autoregressive conditional heteroscedasticity with estimates of the variance of United Kingdom inflation. Econometrica, 50(4): , page 96
97 References D. P. Helmbold and P. M. Long. Tracking drifting concepts by minimizing disagreements. Machine Learning, 14(1): 27-46, M. Herbster and M. K. Warmuth. Tracking the best expert. Machine Learning, 32(2), M. Herbster and M. K. Warmuth. Tracking the best linear predictor. JMLR, D. Hsu, A. Kontorovich, and C. Szepesvári. Mixing time estimation in reversible Markov chains from a single sample path. NIPS, page 97
98 References W. M. Koolen, A. Malek, P. L. Bartlett, and Y. Abbasi. Minimax time series prediction. NIPS, V. Kuznetsov, M. Mohri. Generalization bounds for time series prediction with non-stationary processes. In ALT, V. Kuznetsov, M. Mohri. Learning theory and algorithms for forecasting non-stationary time series. In NIPS, V. Kuznetsov, M. Mohri. Time series prediction and on-line learning. In COLT, page 98
99 References A. C. Lozano, S. R. Kulkarni, and R. E. Schapire. Convergence and consistency of regularized boosting algorithms with stationary β-mixing observations. In NIPS, pages , Y. Mansour, M. Mohri, and A. Rostamizadeh. Domain adaptation: Learning bounds and algorithms. In COLT R. Meir. Nonparametric time series prediction through adaptive model selection. Machine Learning, pages 5 34, page 99
100 References D. Modha, E. Masry. Memory-universal prediction of stationary random processes. Information Theory, IEEE Transactions on, 44(1): , Jan M. Mohri, A. Munoz Medina. New analysis and algorithm for learning with drifting distributions. In ALT, M. Mohri, A. Rostamizadeh. Rademacher complexity bounds for non-i.i.d. processes. In NIPS, M. Mohri, A. Rostamizadeh. Stability bounds for stationary φ-mixing and β-mixing processes. Journal of Machine Learning Research, 11: , page 100
101 References V. Pestov. Predictive PAC learnability: A paradigm for learning from exchangeable input data. GRC, A. Rakhlin, K. Sridharan, A. Tewari. Online learning: Random averages, combinatorial parameters, and learnability. In NIPS, L. Ralaivola, M. Szafranski, G. Stempfel. Chromatic PAC- Bayes bounds for non-iid data: Applications to ranking and stationary beta-mixing processes. JMLR 11: , C. Shalizi and A. Kontorovitch. Predictive PAC-learning and process decompositions. NIPS, page 101
102 References A. Rakhlin, K. Sridharan, A. Tewari. Online learning: Random averages, combinatorial parameters, and learnability. NIPS, I. Steinwart, A. Christmann. Fast learning from non-i.i.d. observations. NIPS, M. Vidyasagar. (1997). A Theory of Learning and Generalization: With Applications to Neural Networks and Control Systems. Springer-Verlag New York, Inc. V. Vovk. Competing with stationary prediction strategies. COLT page 102
103 References B. Yu. Rates of convergence for empirical processes of stationary mixing sequences. The Annals of Probability, 22(1):94 116, page 103
Time Series Prediction & Online Learning
Time Series Prediction & Online Learning Joint work with Vitaly Kuznetsov (Google Research) MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Motivation Time series prediction: stock values. earthquakes.
More informationOnline Learning for Time Series Prediction
Online Learning for Time Series Prediction Joint work with Vitaly Kuznetsov (Google Research) MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Motivation Time series prediction: stock values.
More informationTime Series Prediction and Online Learning
JMLR: Workshop and Conference Proceedings vol 49:1 24, 2016 Time Series Prediction and Online Learning Vitaly Kuznetsov Courant Institute, New York Mehryar Mohri Courant Institute and Google Research,
More informationFoundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research
Foundations of Machine Learning On-Line Learning Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Motivation PAC learning: distribution fixed over time (training and test). IID assumption.
More informationGeneralization bounds
Advanced Course in Machine Learning pring 200 Generalization bounds Handouts are jointly prepared by hie Mannor and hai halev-hwartz he problem of characterizing learnability is the most basic question
More informationPerceptron Mistake Bounds
Perceptron Mistake Bounds Mehryar Mohri, and Afshin Rostamizadeh Google Research Courant Institute of Mathematical Sciences Abstract. We present a brief survey of existing mistake bounds and introduce
More informationRademacher Complexity Bounds for Non-I.I.D. Processes
Rademacher Complexity Bounds for Non-I.I.D. Processes Mehryar Mohri Courant Institute of Mathematical ciences and Google Research 5 Mercer treet New York, NY 00 mohri@cims.nyu.edu Afshin Rostamizadeh Department
More informationOn Learnability, Complexity and Stability
On Learnability, Complexity and Stability Silvia Villa, Lorenzo Rosasco and Tomaso Poggio 1 Introduction A key question in statistical learning is which hypotheses (function) spaces are learnable. Roughly
More informationOn the Generalization Ability of Online Strongly Convex Programming Algorithms
On the Generalization Ability of Online Strongly Convex Programming Algorithms Sham M. Kakade I Chicago Chicago, IL 60637 sham@tti-c.org Ambuj ewari I Chicago Chicago, IL 60637 tewari@tti-c.org Abstract
More informationFoundations of Machine Learning
Introduction to ML Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu page 1 Logistics Prerequisites: basics in linear algebra, probability, and analysis of algorithms. Workload: about
More informationDomain Adaptation for Regression
Domain Adaptation for Regression Corinna Cortes Google Research corinna@google.com Mehryar Mohri Courant Institute and Google mohri@cims.nyu.edu Motivation Applications: distinct training and test distributions.
More informationLearning Theory for Conditional Risk Minimization
Alexander Zimin IST Austria azimin@ist.ac.at Christoph H. Lampert IST Austria chl@ist.ac.at Abstract In this work we study the learnability of stochastic processes with respect to the conditional risk,
More informationAdaptive Sampling Under Low Noise Conditions 1
Manuscrit auteur, publié dans "41èmes Journées de Statistique, SFdS, Bordeaux (2009)" Adaptive Sampling Under Low Noise Conditions 1 Nicolò Cesa-Bianchi Dipartimento di Scienze dell Informazione Università
More informationAdvanced Machine Learning
Advanced Machine Learning Domain Adaptation MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Non-Ideal World time real world ideal domain sampling page 2 Outline Domain adaptation. Multiple-source
More informationAdvanced Machine Learning
Advanced Machine Learning Online Convex Optimization MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Outline Online projected sub-gradient descent. Exponentiated Gradient (EG). Mirror descent.
More informationLearning Kernels -Tutorial Part III: Theoretical Guarantees.
Learning Kernels -Tutorial Part III: Theoretical Guarantees. Corinna Cortes Google Research corinna@google.com Mehryar Mohri Courant Institute & Google Research mohri@cims.nyu.edu Afshin Rostami UC Berkeley
More informationAdvanced Machine Learning
Advanced Machine Learning Deep Boosting MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Outline Model selection. Deep boosting. theory. algorithm. experiments. page 2 Model Selection Problem:
More informationThe sample complexity of agnostic learning with deterministic labels
The sample complexity of agnostic learning with deterministic labels Shai Ben-David Cheriton School of Computer Science University of Waterloo Waterloo, ON, N2L 3G CANADA shai@uwaterloo.ca Ruth Urner College
More informationLearning with Imperfect Data
Mehryar Mohri Courant Institute and Google mohri@cims.nyu.edu Joint work with: Yishay Mansour (Tel-Aviv & Google) and Afshin Rostamizadeh (Courant Institute). Standard Learning Assumptions IID assumption.
More informationOnline Learning: Random Averages, Combinatorial Parameters, and Learnability
Online Learning: Random Averages, Combinatorial Parameters, and Learnability Alexander Rakhlin Department of Statistics University of Pennsylvania Karthik Sridharan Toyota Technological Institute at Chicago
More informationRademacher Bounds for Non-i.i.d. Processes
Rademacher Bounds for Non-i.i.d. Processes Afshin Rostamizadeh Joint work with: Mehryar Mohri Background Background Generalization Bounds - How well can we estimate an algorithm s true performance based
More informationGeneralization theory
Generalization theory Daniel Hsu Columbia TRIPODS Bootcamp 1 Motivation 2 Support vector machines X = R d, Y = { 1, +1}. Return solution ŵ R d to following optimization problem: λ min w R d 2 w 2 2 + 1
More informationCOMS 4771 Introduction to Machine Learning. Nakul Verma
COMS 4771 Introduction to Machine Learning Nakul Verma Announcements HW2 due now! Project proposal due on tomorrow Midterm next lecture! HW3 posted Last time Linear Regression Parametric vs Nonparametric
More informationBennett-type Generalization Bounds: Large-deviation Case and Faster Rate of Convergence
Bennett-type Generalization Bounds: Large-deviation Case and Faster Rate of Convergence Chao Zhang The Biodesign Institute Arizona State University Tempe, AZ 8587, USA Abstract In this paper, we present
More informationAdvanced Machine Learning
Advanced Machine Learning Learning Kernels MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Outline Kernel methods. Learning kernels scenario. learning bounds. algorithms. page 2 Machine Learning
More informationAgnostic Online learnability
Technical Report TTIC-TR-2008-2 October 2008 Agnostic Online learnability Shai Shalev-Shwartz Toyota Technological Institute Chicago shai@tti-c.org ABSTRACT We study a fundamental question. What classes
More informationUniform concentration inequalities, martingales, Rademacher complexity and symmetrization
Uniform concentration inequalities, martingales, Rademacher complexity and symmetrization John Duchi Outline I Motivation 1 Uniform laws of large numbers 2 Loss minimization and data dependence II Uniform
More informationYevgeny Seldin. University of Copenhagen
Yevgeny Seldin University of Copenhagen Classical (Batch) Machine Learning Collect Data Data Assumption The samples are independent identically distributed (i.i.d.) Machine Learning Prediction rule New
More informationIntroduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research
Introduction to Machine Learning Lecture 11 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Boosting Mehryar Mohri - Introduction to Machine Learning page 2 Boosting Ideas Main idea:
More informationWorst-Case Bounds for Gaussian Process Models
Worst-Case Bounds for Gaussian Process Models Sham M. Kakade University of Pennsylvania Matthias W. Seeger UC Berkeley Abstract Dean P. Foster University of Pennsylvania We present a competitive analysis
More informationAdvanced Machine Learning
Advanced Machine Learning Learning and Games MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Outline Normal form games Nash equilibrium von Neumann s minimax theorem Correlated equilibrium Internal
More informationLearning with Rejection
Learning with Rejection Corinna Cortes 1, Giulia DeSalvo 2, and Mehryar Mohri 2,1 1 Google Research, 111 8th Avenue, New York, NY 2 Courant Institute of Mathematical Sciences, 251 Mercer Street, New York,
More informationOn-Line Learning with Path Experts and Non-Additive Losses
On-Line Learning with Path Experts and Non-Additive Losses Joint work with Corinna Cortes (Google Research) Vitaly Kuznetsov (Courant Institute) Manfred Warmuth (UC Santa Cruz) MEHRYAR MOHRI MOHRI@ COURANT
More informationAccelerating Online Convex Optimization via Adaptive Prediction
Mehryar Mohri Courant Institute and Google Research New York, NY 10012 mohri@cims.nyu.edu Scott Yang Courant Institute New York, NY 10012 yangs@cims.nyu.edu Abstract We present a powerful general framework
More informationAdvanced Machine Learning
Advanced Machine Learning Bandit Problems MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Multi-Armed Bandit Problem Problem: which arm of a K-slot machine should a gambler pull to maximize his
More informationMachine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.
Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted
More informationLearnability, Stability, Regularization and Strong Convexity
Learnability, Stability, Regularization and Strong Convexity Nati Srebro Shai Shalev-Shwartz HUJI Ohad Shamir Weizmann Karthik Sridharan Cornell Ambuj Tewari Michigan Toyota Technological Institute Chicago
More informationBetter Algorithms for Selective Sampling
Francesco Orabona Nicolò Cesa-Bianchi DSI, Università degli Studi di Milano, Italy francesco@orabonacom nicolocesa-bianchi@unimiit Abstract We study online algorithms for selective sampling that use regularized
More informationMachine Learning. Lecture 9: Learning Theory. Feng Li.
Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell
More informationOverfitting, Bias / Variance Analysis
Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic
More informationEmpirical Risk Minimization
Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space
More informationMachine Learning Lecture 7
Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant
More informationExponential Weights on the Hypercube in Polynomial Time
European Workshop on Reinforcement Learning 14 (2018) October 2018, Lille, France. Exponential Weights on the Hypercube in Polynomial Time College of Information and Computer Sciences University of Massachusetts
More informationStochastic Optimization
Introduction Related Work SGD Epoch-GD LM A DA NANJING UNIVERSITY Lijun Zhang Nanjing University, China May 26, 2017 Introduction Related Work SGD Epoch-GD Outline 1 Introduction 2 Related Work 3 Stochastic
More informationStructured Prediction
Structured Prediction Ningshan Zhang Advanced Machine Learning, Spring 2016 Outline Ensemble Methods for Structured Prediction[1] On-line learning Boosting AGeneralizedKernelApproachtoStructuredOutputLearning[2]
More informationThe No-Regret Framework for Online Learning
The No-Regret Framework for Online Learning A Tutorial Introduction Nahum Shimkin Technion Israel Institute of Technology Haifa, Israel Stochastic Processes in Engineering IIT Mumbai, March 2013 N. Shimkin,
More informationNo-Regret Algorithms for Unconstrained Online Convex Optimization
No-Regret Algorithms for Unconstrained Online Convex Optimization Matthew Streeter Duolingo, Inc. Pittsburgh, PA 153 matt@duolingo.com H. Brendan McMahan Google, Inc. Seattle, WA 98103 mcmahan@google.com
More informationLecture 16: Perceptron and Exponential Weights Algorithm
EECS 598-005: Theoretical Foundations of Machine Learning Fall 2015 Lecture 16: Perceptron and Exponential Weights Algorithm Lecturer: Jacob Abernethy Scribes: Yue Wang, Editors: Weiqing Yu and Andrew
More informationOn the Complexity of Best Arm Identification with Fixed Confidence
On the Complexity of Best Arm Identification with Fixed Confidence Discrete Optimization with Noise Aurélien Garivier, Emilie Kaufmann COLT, June 23 th 2016, New York Institut de Mathématiques de Toulouse
More informationRevisiting linear and non-linear methodologies for time series prediction - application to ESTSP 08 competition data
Revisiting linear and non-linear methodologies for time series - application to ESTSP 08 competition data Madalina Olteanu Universite Paris 1 - SAMOS CES 90 Rue de Tolbiac, 75013 Paris - France Abstract.
More informationThe Online Approach to Machine Learning
The Online Approach to Machine Learning Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Online Approach to ML 1 / 53 Summary 1 My beautiful regret 2 A supposedly fun game I
More informationBandit models: a tutorial
Gdt COS, December 3rd, 2015 Multi-Armed Bandit model: general setting K arms: for a {1,..., K}, (X a,t ) t N is a stochastic process. (unknown distributions) Bandit game: a each round t, an agent chooses
More informationAdvanced Machine Learning
Advanced Machine Learning Follow-he-Perturbed Leader MEHRYAR MOHRI MOHRI@ COURAN INSIUE & GOOGLE RESEARCH. General Ideas Linear loss: decomposition as a sum along substructures. sum of edge losses in a
More informationAdvanced Machine Learning
Advanced Machine Learning Learning with Large Expert Spaces MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Problem Learning guarantees: R T = O( p T log N). informative even for N very large.
More informationLecture 2: Univariate Time Series
Lecture 2: Univariate Time Series Analysis: Conditional and Unconditional Densities, Stationarity, ARMA Processes Prof. Massimo Guidolin 20192 Financial Econometrics Spring/Winter 2017 Overview Motivation:
More informationAn Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI
An Introduction to Statistical Theory of Learning Nakul Verma Janelia, HHMI Towards formalizing learning What does it mean to learn a concept? Gain knowledge or experience of the concept. The basic process
More informationNonparametric Risk Bounds for Time-Series Forecasting
Journal of Machine Learning Research 18 2017 1-40 Submitted 8/13; Revised 2/17; Published 4/17 Nonparametric Risk Bounds for Time-Series Forecasting Daniel J. McDonald Department of Statistics Indiana
More informationLearning with Large Number of Experts: Component Hedge Algorithm
Learning with Large Number of Experts: Component Hedge Algorithm Giulia DeSalvo and Vitaly Kuznetsov Courant Institute March 24th, 215 1 / 3 Learning with Large Number of Experts Regret of RWM is O( T
More informationMulti-armed bandit models: a tutorial
Multi-armed bandit models: a tutorial CERMICS seminar, March 30th, 2016 Multi-Armed Bandit model: general setting K arms: for a {1,..., K}, (X a,t ) t N is a stochastic process. (unknown distributions)
More informationFull-information Online Learning
Introduction Expert Advice OCO LM A DA NANJING UNIVERSITY Full-information Lijun Zhang Nanjing University, China June 2, 2017 Outline Introduction Expert Advice OCO 1 Introduction Definitions Regret 2
More informationRisk Bounds for Lévy Processes in the PAC-Learning Framework
Risk Bounds for Lévy Processes in the PAC-Learning Framework Chao Zhang School of Computer Engineering anyang Technological University Dacheng Tao School of Computer Engineering anyang Technological University
More informationGeneralization Bounds for Online Learning Algorithms with Pairwise Loss Functions
JMLR: Workshop and Conference Proceedings vol 3 0 3. 3. 5th Annual Conference on Learning Theory Generalization Bounds for Online Learning Algorithms with Pairwise Loss Functions Yuyang Wang Roni Khardon
More informationIFT Lecture 7 Elements of statistical learning theory
IFT 6085 - Lecture 7 Elements of statistical learning theory This version of the notes has not yet been thoroughly checked. Please report any bugs to the scribes or instructor. Scribe(s): Brady Neal and
More informationOnline Learning with Feedback Graphs
Online Learning with Feedback Graphs Claudio Gentile INRIA and Google NY clagentile@gmailcom NYC March 6th, 2018 1 Content of this lecture Regret analysis of sequential prediction problems lying between
More informationTime Series: Theory and Methods
Peter J. Brockwell Richard A. Davis Time Series: Theory and Methods Second Edition With 124 Illustrations Springer Contents Preface to the Second Edition Preface to the First Edition vn ix CHAPTER 1 Stationary
More informationOnline Learning and Sequential Decision Making
Online Learning and Sequential Decision Making Emilie Kaufmann CNRS & CRIStAL, Inria SequeL, emilie.kaufmann@univ-lille.fr Research School, ENS Lyon, Novembre 12-13th 2018 Emilie Kaufmann Online Learning
More informationNonparametric regression with martingale increment errors
S. Gaïffas (LSTA - Paris 6) joint work with S. Delattre (LPMA - Paris 7) work in progress Motivations Some facts: Theoretical study of statistical algorithms requires stationary and ergodicity. Concentration
More informationNew bounds on the price of bandit feedback for mistake-bounded online multiclass learning
Journal of Machine Learning Research 1 8, 2017 Algorithmic Learning Theory 2017 New bounds on the price of bandit feedback for mistake-bounded online multiclass learning Philip M. Long Google, 1600 Amphitheatre
More informationOptimal and Adaptive Online Learning
Optimal and Adaptive Online Learning Haipeng Luo Advisor: Robert Schapire Computer Science Department Princeton University Examples of Online Learning (a) Spam detection 2 / 34 Examples of Online Learning
More informationAdaptive Online Gradient Descent
University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 6-4-2007 Adaptive Online Gradient Descent Peter Bartlett Elad Hazan Alexander Rakhlin University of Pennsylvania Follow
More informationLecture 6: Univariate Volatility Modelling: ARCH and GARCH Models
Lecture 6: Univariate Volatility Modelling: ARCH and GARCH Models Prof. Massimo Guidolin 019 Financial Econometrics Winter/Spring 018 Overview ARCH models and their limitations Generalized ARCH models
More informationFast Rates for Estimation Error and Oracle Inequalities for Model Selection
Fast Rates for Estimation Error and Oracle Inequalities for Model Selection Peter L. Bartlett Computer Science Division and Department of Statistics University of California, Berkeley bartlett@cs.berkeley.edu
More informationTime Series Analysis. James D. Hamilton PRINCETON UNIVERSITY PRESS PRINCETON, NEW JERSEY
Time Series Analysis James D. Hamilton PRINCETON UNIVERSITY PRESS PRINCETON, NEW JERSEY PREFACE xiii 1 Difference Equations 1.1. First-Order Difference Equations 1 1.2. pth-order Difference Equations 7
More informationIntroduction to ARMA and GARCH processes
Introduction to ARMA and GARCH processes Fulvio Corsi SNS Pisa 3 March 2010 Fulvio Corsi Introduction to ARMA () and GARCH processes SNS Pisa 3 March 2010 1 / 24 Stationarity Strict stationarity: (X 1,
More informationUnderstanding Generalization Error: Bounds and Decompositions
CIS 520: Machine Learning Spring 2018: Lecture 11 Understanding Generalization Error: Bounds and Decompositions Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the
More informationStructured Prediction Theory and Algorithms
Structured Prediction Theory and Algorithms Joint work with Corinna Cortes (Google Research) Vitaly Kuznetsov (Google Research) Scott Yang (Courant Institute) MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE
More informationTime Series Analysis. James D. Hamilton PRINCETON UNIVERSITY PRESS PRINCETON, NEW JERSEY
Time Series Analysis James D. Hamilton PRINCETON UNIVERSITY PRESS PRINCETON, NEW JERSEY & Contents PREFACE xiii 1 1.1. 1.2. Difference Equations First-Order Difference Equations 1 /?th-order Difference
More informationFoundations of Machine Learning Multi-Class Classification. Mehryar Mohri Courant Institute and Google Research
Foundations of Machine Learning Multi-Class Classification Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Motivation Real-world problems often have multiple classes: text, speech,
More informationLecture 14: Approachability and regret minimization Ramesh Johari May 23, 2007
MS&E 336 Lecture 4: Approachability and regret minimization Ramesh Johari May 23, 2007 In this lecture we use Blackwell s approachability theorem to formulate both external and internal regret minimizing
More informationOn Minimaxity of Follow the Leader Strategy in the Stochastic Setting
On Minimaxity of Follow the Leader Strategy in the Stochastic Setting Wojciech Kot lowsi Poznań University of Technology, Poland wotlowsi@cs.put.poznan.pl Abstract. We consider the setting of prediction
More informationLearning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley
Learning Methods for Online Prediction Problems Peter Bartlett Statistics and EECS UC Berkeley Course Synopsis A finite comparison class: A = {1,..., m}. Converting online to batch. Online convex optimization.
More informationFrom Bandits to Experts: A Tale of Domination and Independence
From Bandits to Experts: A Tale of Domination and Independence Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Domination and Independence 1 / 1 From Bandits to Experts: A
More informationOnline Passive-Aggressive Algorithms
Online Passive-Aggressive Algorithms Koby Crammer Ofer Dekel Shai Shalev-Shwartz Yoram Singer School of Computer Science & Engineering The Hebrew University, Jerusalem 91904, Israel {kobics,oferd,shais,singer}@cs.huji.ac.il
More informationSimultaneous Model Selection and Optimization through Parameter-free Stochastic Learning
Simultaneous Model Selection and Optimization through Parameter-free Stochastic Learning Francesco Orabona Yahoo! Labs New York, USA francesco@orabona.com Abstract Stochastic gradient descent algorithms
More informationIntroduction to Machine Learning. Introduction to ML - TAU 2016/7 1
Introduction to Machine Learning Introduction to ML - TAU 2016/7 1 Course Administration Lecturers: Amir Globerson (gamir@post.tau.ac.il) Yishay Mansour (Mansour@tau.ac.il) Teaching Assistance: Regev Schweiger
More informationNew Algorithms for Contextual Bandits
New Algorithms for Contextual Bandits Lev Reyzin Georgia Institute of Technology Work done at Yahoo! 1 S A. Beygelzimer, J. Langford, L. Li, L. Reyzin, R.E. Schapire Contextual Bandit Algorithms with Supervised
More informationComputational and Statistical Learning Theory
Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 17: Stochastic Optimization Part II: Realizable vs Agnostic Rates Part III: Nearest Neighbor Classification Stochastic
More informationOnline Bounds for Bayesian Algorithms
Online Bounds for Bayesian Algorithms Sham M. Kakade Computer and Information Science Department University of Pennsylvania Andrew Y. Ng Computer Science Department Stanford University Abstract We present
More informationOptimistic Rates Nati Srebro
Optimistic Rates Nati Srebro Based on work with Karthik Sridharan and Ambuj Tewari Examples based on work with Andy Cotter, Elad Hazan, Tomer Koren, Percy Liang, Shai Shalev-Shwartz, Ohad Shamir, Karthik
More informationAdaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade
Adaptive Gradient Methods AdaGrad / Adam Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade 1 Announcements: HW3 posted Dual coordinate ascent (some review of SGD and random
More informationLearning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013
Learning Theory Ingo Steinwart University of Stuttgart September 4, 2013 Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 1 / 62 Basics Informal Introduction Informal Description
More informationECONOMICS 7200 MODERN TIME SERIES ANALYSIS Econometric Theory and Applications
ECONOMICS 7200 MODERN TIME SERIES ANALYSIS Econometric Theory and Applications Yongmiao Hong Department of Economics & Department of Statistical Sciences Cornell University Spring 2019 Time and uncertainty
More informationEmpirical Market Microstructure Analysis (EMMA)
Empirical Market Microstructure Analysis (EMMA) Lecture 3: Statistical Building Blocks and Econometric Basics Prof. Dr. Michael Stein michael.stein@vwl.uni-freiburg.de Albert-Ludwigs-University of Freiburg
More informationU Logo Use Guidelines
Information Theory Lecture 3: Applications to Machine Learning U Logo Use Guidelines Mark Reid logo is a contemporary n of our heritage. presents our name, d and our motto: arn the nature of things. authenticity
More informationDoes Unlabeled Data Help?
Does Unlabeled Data Help? Worst-case Analysis of the Sample Complexity of Semi-supervised Learning. Ben-David, Lu and Pal; COLT, 2008. Presentation by Ashish Rastogi Courant Machine Learning Seminar. Outline
More informationMidterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas
Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric
More informationThe Learning Problem and Regularization
9.520 Class 02 February 2011 Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference problem from usually small sets of high dimensional, noisy data. Learning
More informationLittlestone s Dimension and Online Learnability
Littlestone s Dimension and Online Learnability Shai Shalev-Shwartz Toyota Technological Institute at Chicago The Hebrew University Talk at UCSD workshop, February, 2009 Joint work with Shai Ben-David
More informationUsing Additive Expert Ensembles to Cope with Concept Drift
Jeremy Z. Kolter and Marcus A. Maloof {jzk, maloof}@cs.georgetown.edu Department of Computer Science, Georgetown University, Washington, DC 20057-1232, USA Abstract We consider online learning where the
More informationADANET: adaptive learning of neural networks
ADANET: adaptive learning of neural networks Joint work with Corinna Cortes (Google Research) Javier Gonzalo (Google Research) Vitaly Kuznetsov (Google Research) Scott Yang (Courant Institute) MEHRYAR
More information