Online Prediction: Bayes versus Experts

Similar documents
Adaptive Online Prediction by Following the Perturbed Leader

Convergence and Error Bounds for Universal Prediction of Nonbinary Sequences

Optimality of Universal Bayesian Sequence Prediction for General Loss and Alphabet

Optimality of Universal Bayesian Sequence Prediction for General Loss and Alphabet

Confident Bayesian Sequence Prediction. Tor Lattimore

Fast Non-Parametric Bayesian Inference on Infinite Trees

On the Foundations of Universal Sequence Prediction

Bayesian Regression of Piecewise Constant Functions

Online Learning and Sequential Decision Making

The No-Regret Framework for Online Learning

Predictive MDL & Bayes

Online Learning with Experts & Multiplicative Weights Algorithms

Learning, Games, and Networks

Towards a Universal Theory of Artificial Intelligence based on Algorithmic Probability and Sequential Decisions

THE WEIGHTED MAJORITY ALGORITHM

Lecture 8: Decision-making under total uncertainty: the multiplicative weight algorithm. Lecturer: Sanjeev Arora

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013

Asymptotics of Discrete MDL for Online Prediction

Lecture 16: Perceptron and Exponential Weights Algorithm

CS261: A Second Course in Algorithms Lecture #11: Online Learning and the Multiplicative Weights Algorithm

Theoretically Optimal Program Induction and Universal Artificial Intelligence. Marcus Hutter

0.1 Motivating example: weighted majority algorithm

New Error Bounds for Solomonoff Prediction

Universal Convergence of Semimeasures on Individual Random Sequences

MDL Convergence Speed for Bernoulli Sequences

The Multi-Arm Bandit Framework

Learning with Large Number of Experts: Component Hedge Algorithm

Full-information Online Learning

U Logo Use Guidelines

Online Learning and Online Convex Optimization

Applications of on-line prediction. in telecommunication problems

From Bandits to Experts: A Tale of Domination and Independence

Universal Prediction with general loss fns Part 2

The Algorithmic Foundations of Adaptive Data Analysis November, Lecture The Multiplicative Weights Algorithm

Agnostic Online learnability

Loss Bounds and Time Complexity for Speed Priors

Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Intro to Learning Theory Date: 12/8/16

On the Generalization Ability of Online Strongly Convex Programming Algorithms

The Online Approach to Machine Learning

CS261: Problem Set #3

On the Relation between Realizable and Nonrealizable Cases of the Sequence Prediction Problem

Generalization bounds

A Second-order Bound with Excess Losses

Learning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley

Bandit models: a tutorial

Decision-Theoretic Specification of Credal Networks: A Unified Language for Uncertain Modeling with Sets of Bayesian Networks

Tutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning

A Tight Excess Risk Bound via a Unified PAC-Bayesian- Rademacher-Shtarkov-MDL Complexity

Online Bounds for Bayesian Algorithms

A survey: The convex optimization approach to regret minimization

Yevgeny Seldin. University of Copenhagen

Understanding Generalization Error: Bounds and Decompositions

Support Vector Machine (continued)

Worst-Case Bounds for Gaussian Process Models

The multi armed-bandit problem

OLSO. Online Learning and Stochastic Optimization. Yoram Singer August 10, Google Research

Using Additive Expert Ensembles to Cope with Concept Drift

Universal probability distributions, two-part codes, and their optimal precision

Efficient Tracking of Large Classes of Experts

1 Overview. 2 Learning from Experts. 2.1 Defining a meaningful benchmark. AM 221: Advanced Optimization Spring 2016

Universal Learning Theory

Sequential prediction with coded side information under logarithmic loss

Online prediction with expert advise

On-line Scheduling to Minimize Max Flow Time: An Optimal Preemptive Algorithm

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research

Introduction to Machine Learning

Is there an Elegant Universal Theory of Prediction?

Learning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley

Bandits for Online Optimization

Theory and Applications of A Repeated Game Playing Algorithm. Rob Schapire Princeton University [currently visiting Yahoo!

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

Exponentiated Gradient Descent

Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions

Algorithmic Probability

Stat 260/CS Learning in Sequential Decision Problems. Peter Bartlett

Time Series Prediction & Online Learning

No-Regret Algorithms for Unconstrained Online Convex Optimization

Adaptive Sampling Under Low Noise Conditions 1

Generalization Bounds

Multi-armed bandit models: a tutorial

EASINESS IN BANDITS. Gergely Neu. Pompeu Fabra University

Online Convex Optimization

Online Learning with Feedback Graphs

Efficient learning by implicit exploration in bandit problems with side observations

Optimal and Adaptive Online Learning

Lecture 7 Introduction to Statistical Decision Theory

Least Squares Regression

Reinforcement Learning

Approximate Inference Part 1 of 2

A Further Look at Sequential Aggregation Rules for Ozone Ensemble Forecasting

Approximate Inference Part 1 of 2

Online Aggregation of Unbounded Signed Losses Using Shifting Experts

Least Squares Regression

Approximate Universal Artificial Intelligence

The Multi-Armed Bandit Problem

An Algorithms-based Intro to Machine Learning

CS264: Beyond Worst-Case Analysis Lecture #20: From Unknown Input Distributions to Instance Optimality

Learning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley

Adversarial Sequence Prediction

Gambling in a rigged casino: The adversarial multi-armed bandit problem

Transcription:

Marcus Hutter - 1 - Online Prediction Bayes versus Experts Online Prediction: Bayes versus Experts Marcus Hutter Istituto Dalle Molle di Studi sull Intelligenza Artificiale IDSIA, Galleria 2, CH-6928 Manno-Lugano, Switzerland marcus@idsia.ch, http://www.idsia.ch/ marcus PASCAL-2004, July 19-21

Marcus Hutter - 2 - Online Prediction Bayes versus Experts Table of Contents Sequential/online prediction: Setup Bayesian Sequence Prediction (Bayes) Prediction with Expert Advice (PEA) PEA Bounds versus Bayes Bounds PEA Bounds reduced to Bayes Bounds Open Problems, Discussion, More

Marcus Hutter - 3 - Online Prediction Bayes versus Experts Abstract We derive a very general regret bound in the framework of prediction with expert advice, which challenges the best known regret bound for Bayesian sequence prediction. Both bounds of the form Loss complexity hold for any bounded loss-function, any prediction and observation spaces, arbitrary expert/environment classes and weights, and unknown sequence length. Keywords Bayesian sequence prediction; Prediction with Expert Advice; general weights, alphabet and loss.

Marcus Hutter - 4 - Online Prediction Bayes versus Experts Sequential/online predictions In sequential or online prediction, for t = 1, 2, 3,..., our predictor p makes a prediction y p t Y based on past observations x 1,..., x t 1. Thereafter x t X is observed and p suffers loss l(x t, y p t ). The goal is to design predictors with small total loss or cumulative Loss 1:T (p) := T t=1 l(x t, y p t ). Applications are abundant, e.g. weather or stock market forecasting. Example: Y = Loss l(x, y) { umbrella sunglasses } X = {sunny, rainy} 0.1 0.3 0.0 1.0 Setup also includes: Classification and Regression problems.

Marcus Hutter - 5 - Online Prediction Bayes versus Experts Bayesian Sequence Prediction

Marcus Hutter - 6 - Online Prediction Bayes versus Experts Bayesian Sequence Prediction - Setup Assumption: Sequence x 1...x T is sampled from some distribution µ, i.e. the probability of x <t := x 1...x t 1 is µ(x <t ). The probability of the next symbol being x t, given x <t, is µ(x t x <t ) Goal: minimize the µ-expected-loss =: Loss. More generally: Define the Bayes ρ sequence prediction scheme y ρ t := arg min y t Y which minimizes the ρ-expected loss. x t ρ(x t x <t )l(x t, y t ), If µ is known, Bayes µ is obviously the best predictor in the sense of achieving minimal expected loss: Loss 1:T (Bayes µ ) Loss 1:T (Any p)

Marcus Hutter - 7 - Online Prediction Bayes versus Experts The Bayes-mixture distribution ξ Assumption: The true (objective) environment µ is unknown. Bayesian approach: Replace true probability distribution µ by a Bayes-mixture ξ. Assumption: We know that the true environment µ is contained in some known (finite or countable) set M of environments. The Bayes-mixture ξ is defined as ξ(x 1:m ) := w ν ν(x 1:m ) with ν M ν M w ν = 1, w ν > 0 ν The weights w ν may be interpreted as the prior degree of belief that the true environment is ν, or k ν = ln wν 1 as a complexity penalty (prefix code length) of environment ν. Then ξ(x 1:m ) could be interpreted as the prior subjective belief probability in observing x 1:m.

Marcus Hutter - 8 - Online Prediction Bayes versus Experts Bayesian Loss Bound Under certain conditions, Loss1:T (Bayes ξ ) is bounded by Loss1:T (Any p) (and hence by the loss of the best predictor in hindsight Bayes µ ): Loss1:T (Bayes ξ ) Loss1:T (Any p)+2 Loss1:T (Any p) k µ +2k µ µ M Note that Loss 1:T depends on µ. Proven for countable M and X, finite Y, any k µ, and any bounded loss function l : X Y [0, 1] [H 01 03] For finite M, the uniform choice k ν = ln M ν M is common. For infinite M, k ν = complexity of ν is common (Occam,Solomonoff).

Marcus Hutter - 9 - Online Prediction Bayes versus Experts Prediction with Expert Advice

Marcus Hutter - 10 - Online Prediction Bayes versus Experts Prediction with Expert Advice (PEA) - Setup Given a countable class of E experts, each Expert e E at times t = 1, 2,... makes a prediction y e t. The goal is to construct a master algorithm, which exploits the experts, and predicts asymptotically as well as the best expert in hindsight. More formally, a PEA-Master is defined as: For t = 1, 2,..., T - Predict yt PEA := PEA(x <t, y t, Loss) - Observe x t := Env(y <t, x <t, y<t PEA?) - Receive Loss t (Expert e ) := l(x t, yt e ) for each Expert e E - Suffer Loss t (PEA) := l(x t, yt PEA ) Notation: x <t := (x 1,..., x t 1 ) and y t = (y e t ) e E.

Marcus Hutter - 11 - Online Prediction Bayes versus Experts Goals BEH := Best Expert in Hindsight = Expert of minimal total Loss. Loss 1:T (BEH) = min e E Loss 1:T (Expert e ). 0) Regret := Loss 1:T (PEA) Loss 1:T (BEH) shall be small (O( Loss 1:T (BEH)). 1) Any bounded Loss function (w.l.g. 0 Loss t 1). Literature: Mostly specific Loss (absolute, 0/1, log, square) 2) Neither (non-trivial) upper bound on total Loss, nor sequence length T is known. Solution: Adaptive learning rate. 3) Infinite number of Experts. Motivation: - Expert e = polynomial of degree e = 1, 2, 3,... through data -or- - E = class of all computable (or finite state or...) Experts.

Marcus Hutter - 12 - Online Prediction Bayes versus Experts Weighted Majority (WM) Take expert which performed best in past with high probability and others with smaller probability. At time t, select Expert I WM t with probability P [I WM t = e] w e exp[ η t Loss <t (Expert e )] η t = learning rate, w e = initial weight. [Littlestone&Warmuth 90 (Classical)]: 0/1 loss and η t =const. [Freund&Shapire 97 (Hedge)] and others: General Loss, but η t =const. [Cesa-Bianchi et al. 97]: Piecewise constant η t. Only 1/w e = E <. [Auer&Gentile 00, Yaroshinsky et al. 04]: Smooth η t 0, but only 0/1 Loss and 1/w e = E <.

Marcus Hutter - 13 - Online Prediction Bayes versus Experts Follow the Perturbed Leader (FPL) Select expert of minimal perturbed and penalized Loss. Let Q e t be i.i.d. random variables and k e complexity penalty. Select expert I FPL t := arg min e {η t Loss <t (Expert e ) + k e +Q e t } [Hannan 57]: Q e d. t Uniform[0, 1], [Kalai&V. 03]: P [Q e t = u] = 1 2 exp( u ) Both: k e = 0, E <, η t 1/ t = Regret=O( E T ). [Hutter&Poland 04]: P [Q e t = u] = exp( u) (u 0), General k e and E and η t 1/ Loss = Regret=O( k e Loss). For all PEA variants (WM & FPL & others) it holds: P [I t = e] = { large small } if Expert e has { small large } Loss. I t I t η Best Expert in Past (η = learning rate) η 0 Uniform distribution among Experts.

Marcus Hutter - 14 - Online Prediction Bayes versus Experts FPL Regret Bounds for E < and k e = ln E Since FPL is randomized, we need to consider expected-loss =: Loss. Regret := Loss 1:T (FPL) Loss 1:T (BEH). Static ln E η t = T = Regret 2 T ln E ln E Dynamic η t = 2t = Regret 2 2T ln E Self-confident η t = ln E 2(Loss <t (FPL)+1) = Regret 2 2(Loss 1:T (BEH) + 1) ln E + 8 ln E Adaptive η t = 1 2 {1, min } ln E Loss <t ( BEH ) = Regret 2 2Loss 1:T (BEH) ln E + 5 ln E ln Loss 1:T (BEH)+3ln E +6 No hidden O() terms!

Marcus Hutter - 15 - Online Prediction Bayes versus Experts FPL Regret Bounds for E = and general k e Assume complexity penalty k e such that e E exp( ke ) 1. We expect ln E k e, i.e. Regret = O( k e (Loss or T )). Problem: Choice of η t = k e /... depends on e. Proofs break down. Choose: η t = 1/... Regret k e, i.e. k e not under. Solution: Two-Level Hierarchy of Experts: Group all experts of (roughly) equal complexity. FPL K over subclass of experts with complexity k e (K 1, K]. Choose ηt K = K/2Loss <t = constant within subclass. Regard each FPL K as a (meta)expert. Construct from them (meta) FPL. Choose η t = 1/Loss <t and k K = 1 2 + 2 ln K. = Regret 2 2 k e Loss 1:T (Expert e ) (1 + O( ln ke k e )) + O(ke )

Marcus Hutter - 16 - Online Prediction Bayes versus Experts PEA versus Bayes

Marcus Hutter - 17 - Online Prediction Bayes versus Experts PEA versus Bayes Bounds Formal Formal similarity and duality between Bayes and PEA bounds is striking: Loss1:T (Bayes ξ ) Loss 1:T (Any p) + 2 Loss1:T (Any p) k µ + 2k µ Loss 1:T (PEA) Loss 1:T (Expert e ) + c Loss 1:T (Expert e ) k e + b k e c = 2 2 and b = 8 for PEA = FPL. beats in environ- expectation function predictors ment w.r.t. of Bayes all p µ M environment µ M PEA Expert e E any x 1...x T prob. prediction E Apart from these formal duality, there is a real connection between both bounds.

Marcus Hutter - 18 - Online Prediction Bayes versus Experts PEA Bound reduced to Bayes Bound Regard class of Bayes-predictors {Bayes ν : ν M} as class of experts E. The corresponding FPL algorithm then satisfies PEA bound Loss 1:T (PEA) Loss 1:T (Bayes µ ) + c Loss 1:T (Bayes µ )k µ + b k µ. Take the µ-expectation, and use Loss 1:T (Bayes µ ) Loss 1:T (Any p) and Jensen s inequality, to get a Bayes-like bound for PEA Loss(PEA) Loss 1:T (Any p) + c Loss 1:T (Any p) k µ + b k µ µ M Ignoring details, instead of using Bayes ξ, one may use PEA with same/similar performance guarantees as Bayes ξ. Additionally, PEA has worst-case guarantees, which Bayes lacks. So why use Bayes at all?

Marcus Hutter - 19 - Online Prediction Bayes versus Experts Open Problems We only compared bounds on PEA and Bayes. What about the actual (practical or theoretical) relative performance? Can FPL regret constant c = 2 2 be improved to c = 2? For Hedge/FPL? Conjecture: Yes for Hedge, since Bayes has c = 2. Generalize existing bounds for WM-type masters (e.g. Hedge) to general X, Y, E, and l [0, 1], similarly to FPL. Generalize FPL bound to infinite E and general k e without the hierarchy trick (like for Bayes) (with expert dependent η e t?) Try first to prove weaker regret bounds with Loss 1:T T.

Marcus Hutter - 20 - Online Prediction Bayes versus Experts More on (PEA) Regret Constant Constant c in Regret = c Loss k e for various settings and algorithms. η Loss Optimal LowBnd Upper Bound static 0/1 1? 1? 2 [V 95] static any 2! 2 [V 95] 2 [FS 97], 2 [FPL] dynamic 0/1 2? 1 [H 03]? 2 [YEYS 04], 2 2 [ACBG 02] dynamic any 2? 2 [V 95] 2 2 [FPL], 2 [H 03] Major open Problems Elimination of hierarchy (trick) Lower regret bound for infinite #Experts Same results (dynamic η t, any Loss, E = ) for WM

Marcus Hutter - 21 - Online Prediction Bayes versus Experts Some more FPL Results Lower bound: Loss 1:T (FPL) Loss 1:T (BEH) + ln E η T if k e = ln E. Bounds with high probability (Chernoff-Hoeffding): P [ Loss 1:T Loss 1:T 3cLoss 1:T ] 2 exp( c) is tiny for e.g. c = 5. Computational aspects: It is trivial to generate the randomized decision of FPL. If we want to explicitly compute the probability we need to compute a 1D integral. Deterministic prediction: FPL can be derandomized if prediction space Y and loss-function Loss(x, y) are convex.

Marcus Hutter - 22 - Online Prediction Bayes versus Experts Thanks! Questions? Details: http://www.idsia.ch/ marcus/ai/expert.htm [ALT 2004] http://www.idsia.ch/ marcus/ai/spupper.htm [IEEE-TIT 2003]