THE WEIGHTED MAJORITY ALGORITHM Csaba Szepesvári University of Alberta CMPUT 654 E-mail: szepesva@ualberta.ca UofA, October 3, 2006
OUTLINE 1 PREDICTION WITH EXPERT ADVICE 2 HALVING: FIND THE PERFECT EXPERT! (0/1 LOSS) 3 NO PERFECT EXPERT? (0/1 LOSS) 4 PREDICTING CONTINUOUS OUTCOMES 5 BIBLIOGRAPHY
FRAMEWORK Prediction with Expert Advice Outcomes: y 1, y 2,... Y Decisions: ˆp 1, ˆp 2,... D Loss function: l : D Y R Advice of expert i: f i1, f i2,... D, i J (Total) loss of expert i: L i,n = n t=1 l(f it, y t ) (Total) loss of algorithm: ˆL n = n t=1 l(ˆp t, y t ) (Total) regret (excess loss): R n = ˆL n L i,n Goal: Design algorithm that keeps the regret small
A PERFECT WORLD y t {0, 1}, ˆp t {0, 1} (Y = D = {0, 1}) Loss: l(p, y) = I {p y} (0/1, binary or classification loss) N experts (J = {1,..., N}) Expert predictions: f i1, f i2,... {0, 1} Assumption: There is an expert that never makes a mistake. How to keep the regret small?
HALVING ALGORITHM Keep regret small Find the perfect expert quickly: with few mistakes Idea: Eliminate immediately experts that make a mistake Take majority vote of remaining experts Halving Algorithm [Barzdin and Freivalds, 1972, Angluin, 1988] Claim: Whenever the alg. makes a mistake, at least half of the experts are eliminated! There is a perfect expert, hence cannot halve more than log 2 N times! Theorem: Regret never grows above log 2 N (finite!) Holds for any sequence y 1, y 2,...!
FORMAL ANALYSIS Weights w it {0, 1}: Is expert i alive at time t? (after y t is received) w i0 = 1. W t = N i=1 w it: Number of alive experts at time t ˆL t : number of mistakes up to time t (including time t) Claim: If mistake (l(ˆp t, y t ) = 1) then W t W t 1 /2. Also: W t never grows. W t W 0 /2ˆL t = N/2ˆL t. Lower bound: 1 W t. Putting together: 1 N/2ˆL t, hence ˆL t log 2 N.
NO PERFECT EXPERT: WEIGHTED MAJORITY Elimination: too strong if there is no perfect expert! Keep weights positive! Have weights of experts making a mistake decay: w it = βw i,t 1, if f it y t (0 < β < 1) Keep majority vote: ˆp t = I n P i w i,t 1I {fit =0} <P o i w i,t 1I {fit =1} = I P { i w i,t 1(1 f it )< P i w i,t 1f it} = I P { i w i,t 1<2 P i w i,t 1f it} = I P. i w i,t 1 f it Pi w > 1 i,t 1 2 Weighted Majority [Littlestone and Warmuth, 1994]
WEIGHTED MAJORITY: ANALYSIS/1 Notation: J t,bad = {i f it y t }, J t,good = {i f it = y t } W t,j = i J W it W t = W t 1,Jt,good + βw t 1,Jt,bad Claim: W t W t 1 and if ˆp t y t then W t (1 + β)/2w t 1 Proof: W t = W t 1,Jt,good + βw t 1,Jt,bad. Since β < 1, W t W t 1. Assume ˆp t y t. W t 1,Jt,good W t 1 /2 (majority vote). W t = W t 1,Jt,good + βw t 1,Jt,bad = W t 1,Jt,good + β(w t 1 W t 1,Jt,good ) = (1 β)w t 1,Jt,good + βw t 1 (1 β)w t 1 /2 + βw t 1
WEIGHTED MAJORITY: ANALYSIS/2 CLAIM W t W t 1 and if ˆp t y t then W t (1 + β)/2w t 1 Lower bound: For any i, β L it = w it W t. Putting together: β L it ( 1 + β )ˆLt W t W 0. 2 Take log, reorder: log2 ( ˆL 1 β t )L it + log 2 N log 2 ( 2 1+β ).
PREDICTING CONTINUOUS OUTCOMES What if Y = D = [0, 1] or R d? More generally: let Y = D be convex subsets of some vector space λ 1 y 1 + λ 2 y 2 Y whenever λ 1, λ 2 0, λ 1 + λ 2 = 1, y 1, y 2 Y). Loss: l : D Y [0, 1] (bounded) Example: D = Y = [0, 1], (p, y) = 1 2 p y. Can we generalize the previous idea? Combine the advice of the experts! ˆp t = N i=1 w i,t 1f it N i=1 w it How to set the weights? Let them decay exponentially as a function of the losses! w i,t = w i,t 1 e ηl(f it,y t ).
PREDICTING CONTINUOUS OUTCOMES/2 For numerical stability we might want to normalize the weights: w i,t 1 e ηl(f it,y t ) w i,t = N i=1 w i,t 1e. ηl(f it,y t ) Note: resembles Bayes updates! For the analysis we do not normalize Analysis?? Plan?? Lower bound the sum of weights using individual total losses of the experts Upper bound the sum of weights in terms of the total loss
ANALYSIS/1 Lower bound: W n = N i=1 w in = N i=1 e ηl in e ηl in. Upper bound: Bound W t /W t 1 in terms of l(ˆp t, y t )! (W t const W t 1, const =?) W t W t 1 = = i i e ηl it w i,t 1 W t 1 ŵ i,t 1 e ηl it (l it def = l(ˆp t, y t )) (ŵ i,t 1 def = w i,t 1 /W t 1 )
ANALYSIS/2 W t W t 1 = i ŵ i,t 1 e ηl it Looks like an expectation! Let Then Observe: P (I = i) = ŵ i,t 1, I J. W t W t 1 [ ] = E e ηl I,t. l(ˆp t, y t ) = l(e [ ] f I,t, yt ), E [ ] l I,t = E [ l(f I,t, y t ) ]
ANALYSIS/3 [ ] W t W t 1 = E e ηl I,t?? l(ˆp t, y t ) = l(e [ f I,t ], yt ), E [ l I,t ] = E [ l(f I,t, y t ) ] What if l(p, y) = 1 2 p y, p, y [0, 1]? l(, y) is convex for any y E [ l(f I,t, y t ) ] l(e [ f I,t ], yt ) = l(ˆp t, y t ) (Jensen s inequality)
ANALYSIS/4 [ ] W t W t 1 = E e ηl I,t?? E [ l(f I,t, y t ) ] l(e [ f I,t ], yt ) = l(ˆp t, y t ). LEMMA (HOEFFDING S INEQUALITY) Let 0 X 1. Then s R, E [ e sx ] e se[x]+s2 /8. [ ] W t W t 1 E e ηl I,t e ηe[l I,t]+η 2 /8 e ηl(ˆp t,y t )+η 2 /8 (line 2 above) (Hoeffding s inequality) (line 2 above)
ANALYSIS/3 W n e ηl in, i J W t W t 1 e ηl(ˆp t,y t )+ η 2 8. Hence, using W 0 = N (w 0i = 1), 1 N e ηl in W n W 0 = W n W n 1... W 1 e ηˆl n+ η 2 W 0 8 n THEOREM (LOSS BOUND FOR THE EWA FORECASTER) Assume that D is a convex subset of some vector-space. Let l : D Y [0, 1] be convex in its first argument. Then, for EWA forecaster it holds: With η = ˆL n min L in + ln N i J η + η 8 n. 8 ln N n, ˆL n min i J L in + n/2 ln N.
NOTES Small losses Loss bound for WM, 0/1-predictions: log2 ( ˆL 1 β n )L in + log 2 N log 2 ( 2 1+β ). If L in = 0 for some expert then regret is finite! Continuous prediction spaces (EWA): ˆL n min i J L in + n/2 ln N. The bound grows to infinite even if for some i, L in = 0! :-( Can this be improved? If there is a perfect expert, the regret should be finite! How to select η if horizon (n) is not given a priori? Would η t = 8(ln N)/t work? (yes) Cheap solution: doubling trick Related: Can use the doubling trick to improve bound in case of small losses? (yes)
REFERENCES Angluin, D. (1988). Queries and concept learning. Journal of Machine Learning, 2:319 342. Barzdin, Y. and Freivalds, R. (1972). On the prediction of general recursive functions. Soviet Mathematics (Doklady), 13:1224 1228. Littlestone, N. and Warmuth, M. (1994). The weighted majority algorithm. Information and Computation, 108:212 261.