Online Learning for Time Series Prediction

Similar documents
Time Series Prediction & Online Learning

Advanced Machine Learning

Time Series Prediction and Online Learning

Perceptron Mistake Bounds

Rademacher Complexity Bounds for Non-I.I.D. Processes

Rademacher Bounds for Non-i.i.d. Processes

Learning with Large Number of Experts: Component Hedge Algorithm

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research

Learning with Imperfect Data

Advanced Machine Learning

The No-Regret Framework for Online Learning

Advanced Machine Learning

Domain Adaptation for Regression

Advanced Machine Learning

Accelerating Online Convex Optimization via Adaptive Prediction

Optimistic Bandit Convex Optimization

Advanced Machine Learning

Advanced Machine Learning

Generalization bounds

Online Learning and Sequential Decision Making

Advanced Machine Learning

Yevgeny Seldin. University of Copenhagen

Adaptive Online Gradient Descent

On-Line Learning with Path Experts and Non-Additive Losses

Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research

A Second-order Bound with Excess Losses

Optimal and Adaptive Online Learning

Adaptive Online Learning in Dynamic Environments

Learning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley

Learning, Games, and Networks

Foundations of Machine Learning

Adaptivity and Optimism: An Improved Exponentiated Gradient Algorithm

Deep Boosting. Joint work with Corinna Cortes (Google Research) Umar Syed (Google Research) COURANT INSTITUTE & GOOGLE RESEARCH.

No-Regret Algorithms for Unconstrained Online Convex Optimization

Smooth Calibration, Leaky Forecasts, Finite Recall, and Nash Dynamics

Lecture 16: Perceptron and Exponential Weights Algorithm

Full-information Online Learning

Online Optimization : Competing with Dynamic Comparators

Advanced Machine Learning

Structured Prediction

Online Learning: Random Averages, Combinatorial Parameters, and Learnability

Bandit Online Convex Optimization

Learning Algorithms for Second-Price Auctions with Reserve

On the Generalization Ability of Online Strongly Convex Programming Algorithms

Generalization Bounds for Online Learning Algorithms with Pairwise Loss Functions

Online Learning. Jordan Boyd-Graber. University of Colorado Boulder LECTURE 21. Slides adapted from Mohri

Uniform concentration inequalities, martingales, Rademacher complexity and symmetrization

Bandit models: a tutorial

Conditional Swap Regret and Conditional Correlated Equilibrium

The Online Approach to Machine Learning

Learning with Rejection

Exponential Weights on the Hypercube in Polynomial Time

Online Optimization in Dynamic Environments: Improved Regret Rates for Strongly Convex Problems

Bandit Convex Optimization

Dynamic Regret of Strongly Adaptive Methods

On Learnability, Complexity and Stability

Exponentiated Gradient Descent

Agnostic Online learnability

Multi-armed bandit models: a tutorial

arxiv: v4 [cs.lg] 22 Oct 2017

Sample Selection Bias Correction

Anticipating Concept Drift in Online Learning

On Minimaxity of Follow the Leader Strategy in the Stochastic Setting

ADANET: adaptive learning of neural networks

Structured Prediction Theory and Algorithms

New Algorithms for Contextual Bandits

Applications of on-line prediction. in telecommunication problems

Risk Bounds for Lévy Processes in the PAC-Learning Framework

A survey: The convex optimization approach to regret minimization

EASINESS IN BANDITS. Gergely Neu. Pompeu Fabra University

Computational and Statistical Learning theory

Theory and Applications of A Repeated Game Playing Algorithm. Rob Schapire Princeton University [currently visiting Yahoo!

Tutorial: PART 1. Online Convex Optimization, A Game- Theoretic Approach to Learning.

Learnability, Stability, Regularization and Strong Convexity

Competing With Strategies

Beyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization

Classification. Jordan Boyd-Graber University of Maryland WEIGHTED MAJORITY. Slides adapted from Mohri. Jordan Boyd-Graber UMD Classification 1 / 13

Online Learning and Online Convex Optimization

Explore no more: Improved high-probability regret bounds for non-stochastic bandits

Convex Repeated Games and Fenchel Duality

Online Gradient Descent Learning Algorithms

Generalization error bounds for classifiers trained with interdependent data

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade

Worst-Case Bounds for Gaussian Process Models

Bennett-type Generalization Bounds: Large-deviation Case and Faster Rate of Convergence

Learning Theory for Conditional Risk Minimization

Sequential prediction with coded side information under logarithmic loss

Minimax Policies for Combinatorial Prediction Games

Convex Repeated Games and Fenchel Duality

Unconstrained Online Linear Learning in Hilbert Spaces: Minimax Algorithms and Normal Approximations

Minimax strategy for prediction with expert advice under stochastic assumptions

Online Learning with Feedback Graphs

Littlestone s Dimension and Online Learnability

The definitions and notation are those introduced in the lectures slides. R Ex D [h

Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013

Delay-Tolerant Online Convex Optimization: Unified Analysis and Adaptive-Gradient Algorithms

Optimistic Rates Nati Srebro

arxiv: v1 [stat.ml] 22 Jan 2013

From Bandits to Experts: A Tale of Domination and Independence

Competing With Strategies

Transcription:

Online Learning for Time Series Prediction Joint work with Vitaly Kuznetsov (Google Research) MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH.

Motivation Time series prediction: stock values. earthquakes. energy demand. signal processing. sales forecasting. election forecast. economic variables. weather: e.g., local/global temperature. page 2

Time Series NY Unemployment Rate page 3

Time Series US Presidential Election 2016 page 4

Online Learning Advantages: active research area (Cesa-Bianchi and Lugosi, 2006). no distributional assumption. algorithms with tight regret guarantees. flexibility: e.g., non-static competitor classes (Herbster and Warmuth, 1998, 2001; Koolen et al., 2015; Rakhlin and Sridharan 2015). page 5

Online Learning Drawbacks: real-world time series data is not adversarial. the stochastic process must be taken into account. the quantity of interest is the conditional expected loss, not the regret. can we leverage online algorithms for time series forecasting? page 6

On-Line Learning Setup Adversarial setting with hypothesis/action set H. For t =1to T do player receives x t 2 X. player selects h t 2 H. adversary selects y t 2 Y. player incurs loss L(h t (x t ),y t ). Objective: minimize (external) regret Reg T = L(h t,z t ) inf h 2 H L(h,Z t ). page 7

On-Line-to-Batch Problem Problem: use h =(h 1,...,h T ) to derive a hypothesis with small path-dependent expected loss, h 2 H L T +1 (h, Z T 1 )= E ZT +1 L(h, ZT +1 ) Z T 1. IID case is standard: (Littlestone, 1989), (Cesa-Bianchi et al., 2004). how do we handle general stochastic processes? page 8

Previous Work Theory and algorithms for time series prediction: (Kuznetsov and MM, 2015) general non-stationary non-mixing stochastic processes. convex optimization algorithms. algorithms perform well in experiments. generalization bounds based on a notion of discrepancy. But, how do we tackle some difficult time series problems such as ensemble learning or model selection? page 9

Questions Theoretical: can we derive learning guarantees for a convex P T combination q th t? can we derive guarantees for h selected in (h 1,...,h T )? Algorithmic: on-line-to-batch conversion. learning ensembles. model selection. page 10

Theory Deep Boosting - page

Learning Guarantee Problem: given hypotheses (h 1,...,h T ) give bound on L T +1 (h, Z T 1 )= where h = P T q th t. E ZT +1 L(h, ZT +1 ) Z T 1, page 12

Standard Assumptions Stationarity: same distribution Mixing: (Z t,...,z t+m ) (Z t+k,...,z t+m+k ) dependence between events decaying with k. B A n n + k page 13

Problem Stationarity and mixing assumptions: widely adopted: (Alquier and Wintenberger, 2010, 2014), (Agarwal and Duchi, 2013), (Lozano et al., 1997), (Vidyasagar, 1997), (Yu, 1994), (Meir, 2000), (MM and Rostamizadeh, 2000), (Kuznetsov and MM, 2014). But, they often do not hold (think trend or periodic signals). they are not testable. estimating mixing parameters can be hard, even if general functional form known. hypothesis set and loss function ignored. we need a new tool for the analysis. page 14

Relevant Quantity key difference L t (h t, Z t 1 1 ) L T +1 (h t, Z T 1 ) 1 t +1 T T +1 Average difference: 1 T h i L T +1 (h t, Z T 1 ) L t (h t, Z t 1 1 ). page 15

On-line Discrepancy Definition: disc(q) = sup h2h A H A : sequences that A can return. q =(q 1,...,q T ) h i q t L T +1 (h t, Z T 1 ) L t (h t, Z t 1 1 ) : arbitrary weight vector. natural measure of non-stationarity or dependency. captures hypothesis set and loss function. can be efficiently estimated under mild assumptions. generalization of definition of (Kuznetsov and MM, 2015).. page 16

Learning Guarantee Theorem: let L be a convex loss bounded by M and h T 1 a hypothesis sequence adapted to Z T 1. Fix q 2. Then, for any > 0, the following holds with probability at least 1 for the hypothesis h = P T q th t : L T +1 (h, Z T 1 ) apple q t L(h t,z t )+disc(q)+mkqk 2 r2 log 1. page 17

Proof By convexity of the loss: By definition of the on-line discrepancy, h i q t L T +1 (h t, Z T 1 ) L t (h t, Z t 1 1 ) apple disc(q). L T +1 (h, Z T 1 ) apple q t L T +1 (h t, Z T 1 ). h i A t = q t L t (h t,z1 t 1 ) L(h t,z t ) is a martingale difference, thus by Azuma s inequality, whp, q t L t (h t, Z t 1 1 ) apple q t L(h t,z t )+kqk 2 q2 log 1. page 18

Learning Guarantee Theorem: let L be a convex loss bounded by M and H a set of hypothesis sequences adapted to Z T 1. Fix q 2. Then, for any > 0, the following holds with probability at least 1 for the hypothesis h = P T q th t : L T +1 (h, Z T 1 ) apple inf h 2H L T +1 (h, Z T 1 ) + 2disc(q)+ Reg T T + Mkq uk 1 +2Mkqk 2 r2 log 2. page 19

Notes Learning guarantees with same flavor as those of (Kuznetsov and MM, 2015) but simpler proofs, no complexity measure. Bounds admit as special case the learning guarantees for the i.i.d. scenario (Littlestone, 1989), (Cesa-Bianchi et al., 2004). the drifting scenario (MM and Muñoz Medina, 2012). page 20

Extension: Non-Convex Loss Theorems extend to non-convex losses when as follows (Cesa-Bianchi et al., 2004): h is selected h = argmin h t ( T X s=t q s L(h t,z s )+disc(q T t )+Mkq T t k 2 q 2 log 2(T +1) ). page 21

Extension: General Regret General regret definition: X T Reg T = L(h t,z t ) inf L t (h,z t )+R(h ). h 2 H standard regret: R =0, H constant sequences. tracking: H H T. R can be a kernel-based regularization (Herbster and Warmuth, 2001). page 22

Extension: Non-Adapted Seqs Hypotheses no longer adapted, but output by a uniformly stable algorithm. Stable hypotheses: H = {h 2 H : there exists A 2 A such that h = A(Z T 1 )}. t = ht : stability coefficient of algorithm returning h t. P T Similar learning bounds with additional term q t t. admit as special cases results of (Agarwal and Duchi, 2013) for asymptotically stationary mixing processes. page 23

Discrepancy Estimation Batch estimation method (Kuznetsov and MM, 2015). On-line estimation method: assume that the loss is µ -Lipschitz. assume that there exists an accurate hypothesis : h i inf E L(Z T +1,h (X T +1 )) Z T h 1 1. h page 24

Batch Estimation Decomposition: (q) apple 0 (q)+ s. 1 (q) apple sup L(h, Z t 1 1 ) q t L(h, Z t 1 1 ) h2h s +sup h2h t=t s+1 L(h, Z T 1 ) 1 s t=t s+1 L(h, Z t 1 1 ). 1 T s T T +1 Theory and Algorithms for Forecasting Non-Stationary Time Series page 25

Online Estimation Lemma: assume that L is µ -Lipschitz. Fix sequence Z T 1 in Z. Then, for any > 0, with probability at least 1, the following holds for all > 0: disc(q) apple disc d h i HA (q)+µ inf E L(Z T +1,h (X T +1 )) Z T h 1 +2 + Mkqk 2 r2 log E[N 1(, G, z)], where ddisc HA (q) = sup h2h,h2h A q t h L h t (X T +1 ),h(x T +1 ) L h t,z t i. page 26

Proof Sketch disc(q) = sup h2h A apple sup h2h A h i q t L T +1 (h t, Z T 1 ) L t (h t, Z t 1 1 ) q t applel T +1 (h t, Z T 1 ) E + sup h2h A apple q t E h L(h t (X T +1 ),h (X T +1 )) Z T 1 h L(h t (X T +1 ),h (X T +1 )) Z T 1 {z } i i L t (h t, Z t 1 1 ). apple µ sup h2h A q t E = µ sup h2h A E h L(h (X T +1 ),Y T +1 ) Z T 1 h L(h (X T +1 ),Y T +1 ) Z T 1 i. i disc HA (q) page 27

Algorithms Deep Boosting - page

On-Line-to-Batch (OTB) Problem: use h =(h 1,...,h T ) to derive a hypothesis with small path-dependent expected loss, h 2 H L T +1 (h, Z T 1 )= E ZT +1 L(h, ZT +1 ) Z T 1. page 29

OTB Algorithm Idea: choose weights q to minimize bound for h = P T q th t : L T +1 (h, Z T 1 ) apple q t L(h t,z t )+disc(q)+mkqk 2 r2 log 1. Optimization problem: min q2 q t L(h t,z t )+ d disc HA (q)+ kqk 2. Solution: h = P T q th t. page 30

Model Selection Problem: given N time series models, how should we use sample to select a single best model? Z T 1 in i.i.d. case, cross-validation can be shown to be close to the structural risk minimization solution. but, how do we select a validation set for general stochastic processes? models may have been pre-trained on Z T 1. page 31

Model Selection page 32

Model Selection Algorithm Idea: use learning bound in terms of online discrepancy and regret per round for hypothesis ( X T q h = argmin q s L(h t,z s )+disc(q T t )+Mkq T t k 2 2 log ). 2(T +1) h t s=t page 33

Model Selection Algorithm Algorithm: choose q 2 to minimize discrepancy min q2 ddisc H (q). use on-line algorithm for prediction with expert advice to generate a sequence of hypotheses h 2 H T, with H the set of N models. select model according to h = argmin h t ( T X s=t q s L(h t,z s )+disc(q T t )+Mkq T t k 2 q 2 log 2(T +1) ). page 34

Learning Ensembles Problem: given a hypothesis set H and a sample Z T 1, find convex combination h = P T q th t with h 2 H A with small path-dependent expected loss. Z T 1 in most general case, hypotheses may have been pretrained on. page 35

Ensemble Learning Algorithm Algorithm: run regret minimization on Z T 1 to return h =(h 1,...,h T ). minimize learning bound. For 2 0, min q d discha (q)+ q t L(h t,z t ) subject to kq uk 2 apple 2. convex optimization problem by convexity of ddisc HA (q) = sup h2h,h2h A q t h L h t (X T +1 ),h(x T +1 ) L h t,z t i. page 36

Ensemble Learning Algorithm If hypothesis set H is finite, then the supremum can be computed straightforwardly: If hypothesis set is not finite but is convex and the loss is convex, then the maximization can be cast as a DCprogramming problem, and solved using the DC-algorithm (Tao and An, 1998): sup h2h,h2h A q t h L h t (X T +1 ),h(x T +1 ) for squared loss, global optimum. L h t,z t i. page 37

Conclusion Time series prediction using on-line algorithms: new learning bounds for non-stationary non-mixing processes. on-line discrepancy measure that can be estimated. general on-line-to-batch conversion. application to model selection. application to learning ensembles. tools for tackling other time series problems. page 38