THE WEIGHTED MAJORITY ALGORITHM

Similar documents
Online Learning with Experts & Multiplicative Weights Algorithms

Online learning CMPUT 654. October 20, 2011

Lecture 16: Perceptron and Exponential Weights Algorithm

DISCRETE PREDICTION PROBLEMS: RANDOMIZED PREDICTION

Online Learning. Jordan Boyd-Graber. University of Colorado Boulder LECTURE 21. Slides adapted from Mohri

Full-information Online Learning

Computational Learning Theory. Definitions

Online Learning and Sequential Decision Making

Agnostic Online learnability

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research

The Multi-Arm Bandit Framework

Online Prediction: Bayes versus Experts

Classification. Jordan Boyd-Graber University of Maryland WEIGHTED MAJORITY. Slides adapted from Mohri. Jordan Boyd-Graber UMD Classification 1 / 13

OLSO. Online Learning and Stochastic Optimization. Yoram Singer August 10, Google Research

Online Learning Class 12, 20 March 2006 Andrea Caponnetto, Sanmay Das

Online Learning, Mistake Bounds, Perceptron Algorithm

From Bandits to Experts: A Tale of Domination and Independence

Weighted Majority and the Online Learning Approach

Littlestone s Dimension and Online Learnability

Applications of on-line prediction. in telecommunication problems

Extended dynamic programming: technical details

On-line Variance Minimization

A simple algorithmic explanation for the concentration of measure phenomenon

Online prediction with expert advise

Introduction to Machine Learning CMU-10701

Lecture 2: Weighted Majority Algorithm

1 Overview. 2 Learning from Experts. 2.1 Defining a meaningful benchmark. AM 221: Advanced Optimization Spring 2016

The Ellipsoid Algorithm

CS261: A Second Course in Algorithms Lecture #11: Online Learning and the Multiplicative Weights Algorithm

Multitask Learning With Expert Advice

Learning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley

Online Prediction Peter Bartlett

On-line Prediction and Conversion Strategies

Using Additive Expert Ensembles to Cope with Concept Drift

Computational Learning Theory

Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Intro to Learning Theory Date: 12/8/16

Machine Learning. Computational Learning Theory. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Online Convex Optimization

The Weighted Majority Algorithm. Nick Littlestone. Manfred K. Warmuth y UCSC-CRL Revised. October 26, 1992.

Logistic Regression Logistic

Algorithms, Games, and Networks January 17, Lecture 2

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013

CS261: Problem Set #3

Learning, Games, and Networks

An Algorithms-based Intro to Machine Learning

Worst-Case Analysis of the Perceptron and Exponentiated Update Algorithms

Lecture 23: Online convex optimization Online convex optimization: generalization of several algorithms

Generalization bounds

Machine Learning Theory (CS 6783)

1 MDP Value Iteration Algorithm

Computational Learning Theory

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

Online learning with feedback graphs and switching costs

1 Review of the Perceptron Algorithm

Computational Learning Theory

The Online Approach to Machine Learning

Approximation Theoretical Questions for SVMs

Machine Learning Theory Lecture 4

Learning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley

Online Convex Optimization. Gautam Goel, Milan Cvitkovic, and Ellen Feldman CS 159 4/5/2016

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Mistake Bound Model, Halving Algorithm, Linear Classifiers, & Perceptron

New bounds on the price of bandit feedback for mistake-bounded online multiclass learning

Stochastic and online algorithms

Learning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley

U Logo Use Guidelines

Efficient Tracking of Large Classes of Experts

A Tutorial on Computational Learning Theory Presented at Genetic Programming 1997 Stanford University, July 1997

Lecture 8. Instructor: Haipeng Luo

Lecture 16: FTRL and Online Mirror Descent

Tutorial: PART 1. Online Convex Optimization, A Game- Theoretic Approach to Learning.

Introduction to Machine Learning

Ad Placement Strategies

Support Vector Machines and Bayes Regression

Minimax risk bounds for linear threshold functions

Lecture 3: Lower Bounds for Bandit Algorithms

Learning with Large Number of Experts: Component Hedge Algorithm

Lecture 25 of 42. PAC Learning, VC Dimension, and Mistake Bounds

Introduction to Statistical Learning Theory

Adaptive Sampling Under Low Noise Conditions 1

Calibrated Surrogate Losses

Online Learning: Bandit Setting

Defensive forecasting for optimal prediction with expert advice

Computational Learning Theory (VC Dimension)

FORMULATION OF THE LEARNING PROBLEM

Putting the Bayes update to sleep

The Blessing and the Curse

Online Learning versus Offline Learning*

Logistic regression and linear classifiers COMS 4771

Computational Learning Theory

Learning Binary Relations Using Weighted Majority Voting *

Yevgeny Seldin. University of Copenhagen

The Perceptron Algorithm, Margins

Lecture 8: Decision-making under total uncertainty: the multiplicative weight algorithm. Lecturer: Sanjeev Arora

The No-Regret Framework for Online Learning

TDT4173 Machine Learning

0.1 Motivating example: weighted majority algorithm

Online Learning with Costly Features and Labels

Stochastic Gradient Descent

Transcription:

THE WEIGHTED MAJORITY ALGORITHM Csaba Szepesvári University of Alberta CMPUT 654 E-mail: szepesva@ualberta.ca UofA, October 3, 2006

OUTLINE 1 PREDICTION WITH EXPERT ADVICE 2 HALVING: FIND THE PERFECT EXPERT! (0/1 LOSS) 3 NO PERFECT EXPERT? (0/1 LOSS) 4 PREDICTING CONTINUOUS OUTCOMES 5 BIBLIOGRAPHY

FRAMEWORK Prediction with Expert Advice Outcomes: y 1, y 2,... Y Decisions: ˆp 1, ˆp 2,... D Loss function: l : D Y R Advice of expert i: f i1, f i2,... D, i J (Total) loss of expert i: L i,n = n t=1 l(f it, y t ) (Total) loss of algorithm: ˆL n = n t=1 l(ˆp t, y t ) (Total) regret (excess loss): R n = ˆL n L i,n Goal: Design algorithm that keeps the regret small

A PERFECT WORLD y t {0, 1}, ˆp t {0, 1} (Y = D = {0, 1}) Loss: l(p, y) = I {p y} (0/1, binary or classification loss) N experts (J = {1,..., N}) Expert predictions: f i1, f i2,... {0, 1} Assumption: There is an expert that never makes a mistake. How to keep the regret small?

HALVING ALGORITHM Keep regret small Find the perfect expert quickly: with few mistakes Idea: Eliminate immediately experts that make a mistake Take majority vote of remaining experts Halving Algorithm [Barzdin and Freivalds, 1972, Angluin, 1988] Claim: Whenever the alg. makes a mistake, at least half of the experts are eliminated! There is a perfect expert, hence cannot halve more than log 2 N times! Theorem: Regret never grows above log 2 N (finite!) Holds for any sequence y 1, y 2,...!

FORMAL ANALYSIS Weights w it {0, 1}: Is expert i alive at time t? (after y t is received) w i0 = 1. W t = N i=1 w it: Number of alive experts at time t ˆL t : number of mistakes up to time t (including time t) Claim: If mistake (l(ˆp t, y t ) = 1) then W t W t 1 /2. Also: W t never grows. W t W 0 /2ˆL t = N/2ˆL t. Lower bound: 1 W t. Putting together: 1 N/2ˆL t, hence ˆL t log 2 N.

NO PERFECT EXPERT: WEIGHTED MAJORITY Elimination: too strong if there is no perfect expert! Keep weights positive! Have weights of experts making a mistake decay: w it = βw i,t 1, if f it y t (0 < β < 1) Keep majority vote: ˆp t = I n P i w i,t 1I {fit =0} <P o i w i,t 1I {fit =1} = I P { i w i,t 1(1 f it )< P i w i,t 1f it} = I P { i w i,t 1<2 P i w i,t 1f it} = I P. i w i,t 1 f it Pi w > 1 i,t 1 2 Weighted Majority [Littlestone and Warmuth, 1994]

WEIGHTED MAJORITY: ANALYSIS/1 Notation: J t,bad = {i f it y t }, J t,good = {i f it = y t } W t,j = i J W it W t = W t 1,Jt,good + βw t 1,Jt,bad Claim: W t W t 1 and if ˆp t y t then W t (1 + β)/2w t 1 Proof: W t = W t 1,Jt,good + βw t 1,Jt,bad. Since β < 1, W t W t 1. Assume ˆp t y t. W t 1,Jt,good W t 1 /2 (majority vote). W t = W t 1,Jt,good + βw t 1,Jt,bad = W t 1,Jt,good + β(w t 1 W t 1,Jt,good ) = (1 β)w t 1,Jt,good + βw t 1 (1 β)w t 1 /2 + βw t 1

WEIGHTED MAJORITY: ANALYSIS/2 CLAIM W t W t 1 and if ˆp t y t then W t (1 + β)/2w t 1 Lower bound: For any i, β L it = w it W t. Putting together: β L it ( 1 + β )ˆLt W t W 0. 2 Take log, reorder: log2 ( ˆL 1 β t )L it + log 2 N log 2 ( 2 1+β ).

PREDICTING CONTINUOUS OUTCOMES What if Y = D = [0, 1] or R d? More generally: let Y = D be convex subsets of some vector space λ 1 y 1 + λ 2 y 2 Y whenever λ 1, λ 2 0, λ 1 + λ 2 = 1, y 1, y 2 Y). Loss: l : D Y [0, 1] (bounded) Example: D = Y = [0, 1], (p, y) = 1 2 p y. Can we generalize the previous idea? Combine the advice of the experts! ˆp t = N i=1 w i,t 1f it N i=1 w it How to set the weights? Let them decay exponentially as a function of the losses! w i,t = w i,t 1 e ηl(f it,y t ).

PREDICTING CONTINUOUS OUTCOMES/2 For numerical stability we might want to normalize the weights: w i,t 1 e ηl(f it,y t ) w i,t = N i=1 w i,t 1e. ηl(f it,y t ) Note: resembles Bayes updates! For the analysis we do not normalize Analysis?? Plan?? Lower bound the sum of weights using individual total losses of the experts Upper bound the sum of weights in terms of the total loss

ANALYSIS/1 Lower bound: W n = N i=1 w in = N i=1 e ηl in e ηl in. Upper bound: Bound W t /W t 1 in terms of l(ˆp t, y t )! (W t const W t 1, const =?) W t W t 1 = = i i e ηl it w i,t 1 W t 1 ŵ i,t 1 e ηl it (l it def = l(ˆp t, y t )) (ŵ i,t 1 def = w i,t 1 /W t 1 )

ANALYSIS/2 W t W t 1 = i ŵ i,t 1 e ηl it Looks like an expectation! Let Then Observe: P (I = i) = ŵ i,t 1, I J. W t W t 1 [ ] = E e ηl I,t. l(ˆp t, y t ) = l(e [ ] f I,t, yt ), E [ ] l I,t = E [ l(f I,t, y t ) ]

ANALYSIS/3 [ ] W t W t 1 = E e ηl I,t?? l(ˆp t, y t ) = l(e [ f I,t ], yt ), E [ l I,t ] = E [ l(f I,t, y t ) ] What if l(p, y) = 1 2 p y, p, y [0, 1]? l(, y) is convex for any y E [ l(f I,t, y t ) ] l(e [ f I,t ], yt ) = l(ˆp t, y t ) (Jensen s inequality)

ANALYSIS/4 [ ] W t W t 1 = E e ηl I,t?? E [ l(f I,t, y t ) ] l(e [ f I,t ], yt ) = l(ˆp t, y t ). LEMMA (HOEFFDING S INEQUALITY) Let 0 X 1. Then s R, E [ e sx ] e se[x]+s2 /8. [ ] W t W t 1 E e ηl I,t e ηe[l I,t]+η 2 /8 e ηl(ˆp t,y t )+η 2 /8 (line 2 above) (Hoeffding s inequality) (line 2 above)

ANALYSIS/3 W n e ηl in, i J W t W t 1 e ηl(ˆp t,y t )+ η 2 8. Hence, using W 0 = N (w 0i = 1), 1 N e ηl in W n W 0 = W n W n 1... W 1 e ηˆl n+ η 2 W 0 8 n THEOREM (LOSS BOUND FOR THE EWA FORECASTER) Assume that D is a convex subset of some vector-space. Let l : D Y [0, 1] be convex in its first argument. Then, for EWA forecaster it holds: With η = ˆL n min L in + ln N i J η + η 8 n. 8 ln N n, ˆL n min i J L in + n/2 ln N.

NOTES Small losses Loss bound for WM, 0/1-predictions: log2 ( ˆL 1 β n )L in + log 2 N log 2 ( 2 1+β ). If L in = 0 for some expert then regret is finite! Continuous prediction spaces (EWA): ˆL n min i J L in + n/2 ln N. The bound grows to infinite even if for some i, L in = 0! :-( Can this be improved? If there is a perfect expert, the regret should be finite! How to select η if horizon (n) is not given a priori? Would η t = 8(ln N)/t work? (yes) Cheap solution: doubling trick Related: Can use the doubling trick to improve bound in case of small losses? (yes)

REFERENCES Angluin, D. (1988). Queries and concept learning. Journal of Machine Learning, 2:319 342. Barzdin, Y. and Freivalds, R. (1972). On the prediction of general recursive functions. Soviet Mathematics (Doklady), 13:1224 1228. Littlestone, N. and Warmuth, M. (1994). The weighted majority algorithm. Information and Computation, 108:212 261.