Online Learning with Experts & Multiplicative Weights Algorithms

Size: px

Start display at page:

Download "Online Learning with Experts & Multiplicative Weights Algorithms"

Flora Bernadette Fleming
5 years ago
Views:

1 Online Learning with Experts & Multiplicative Weights Algorithms CS 159 lecture #2 Stephan Zheng April 1, 2016 Caltech

2 Table of contents 1. Online Learning with Experts With a perfect expert Without perfect experts 2. Multiplicative Weights algorithms Regret of Multiplicative Weights 3. Learning with Experts revisited Continuous vector predictions Subtlety in the discrete case 2

3 Today What should you take away from this lecture? How should you base your prediction on expert predictions? What are the characteristics of multiplicative weights algorithms? 3

4 References Online Learning [Ch 2-4], Gabor Bartok, David Pal, Csaba Szepesvari, and Istvan Szita. The Multiplicative Weights Update Method: A Meta-Algorithm and Applications, Sanjeev Arora, Elad Hazan, and Satyen Kale. Theory of Computing, 8, ,

5 Online Learning with Experts

6 Leo at the Oscars: an online binary prediction problem Will Leo win an Oscar this year? (running question since ) 6

7 Leo at the Oscars: an online binary prediction problem... No 2005 No 2006 No 2007 No 2008 No 2009 No 2010 No 7

8 Leo at the Oscars: an online binary prediction problem 2011 No 2012 No 2013 No 2014 No 2015 No 2016 Yes! 8

9 Leo at the Oscars: an online binary prediction problem A frivolous extrapolation: 2016 Yes! 2017 No 2018 Yes! 2019 No 2020 Yes 2021 Yes 9

10 Leo at the Oscars: an online binary prediction problem Imagine you re a showbiz fan and want to predict the answer every year t. But, you don t know all the ins and outs of Hollywood. Instead, you are lucky enough to have access to experts {i} i I, that each make a (possibly wrong!) prediction p i,t. How do you use the experts for your own prediction? How do you incorporate the feedback from Nature? How do we achieve sublinear regret? 10

11 Formal description 11

12 Online Learning with Experts Formal setup: there are T rounds in which we predict a binary label ˆp t Y = {0, 1}. At every timestep t: 1. Each expert i = 1... n predicts p i,t Y 2. You make a prediction ˆp t Y 3. Nature reveals y t Y 4. We suffer a loss l(ˆp t, y t ) = 1 [ˆp t y t ] 5. Each expert suffers a loss l(p i,t, y t) = 1 [p i,t y t] Loss: ˆL T = T l(ˆpt, yt), L i,t = T l(p i,t, y t) # of mistakes made Regret: R T = ˆL T min i L i,t How do you decide what ˆp t to predict? How do you incorporate the feedback from Nature? How do we achieve sublinear regret? 12

13 Online Learning with Experts Experts is a way to abstract a hypothesis class. For the most part, we ll deal with a finite, discrete number of experts, because that s easier to analyze. In general, there can be a continuous space of experts = using a standard hypothesis class. Boosting uses a collection of n weak classifiers as experts. At every time t it adds 1 weak classifier with weight α t to the ensemble classifier h 1:t. The prediction ˆp t is based on the ensemble h 1:t that we have collected so far. 13

14 Weighted Majority: the halving algorithm Let s assume that there is a perfect expert i, that is guaranteed to know the right answer (say a mind-reader that can read the minds of the voting Academy members). That is, t : p i,t = y t and l i,t = 0. Keep regret (= # mistakes) small find the perfect expert quickly with few mistakes. Eliminate experts that make a mistake Take a majority vote of alive experts Let s define some groups of experts: The alive set: E t = {i : i did not make a mistake until time t}, E 0 = {1,..., n} The nay-sayers: E 0 t 1 = {i E t 1 : p i,t = 0} The yay-sayers: E 1 t 1 = {i E t 1 : p i,t = 1} 14

15 Weighted Majority: the halving algorithm t ˆlt y t ˆp t p 1,t p 2,t p 3,t p 4,t p 5,t p 6,t p 7,t p 8,t p 9,t p 10,t p 11,t p 12,t No No No No No Yes! No Yes! No Yes Yes 15

16 Weighted Majority: the halving algorithm The halving algorithm 1: for t = 1... T do 2: Receive expert predictions (p 1,t... p n,t ) 3: Split E t 1 into E 0 t 1 E 1 t 1 4: if E 1 t 1 > E 0 t 1 then follow the majority 5: Predict ˆp t = 1 6: else 7: Predict ˆp t = 0 8: end if 9: Receive Nature s answer y t (and incur loss l(ˆp t, y t )) 10: Update E t with experts that continue to be right 11: end for 16

17 Weighted Majority: the halving algorithm Notice that if halving makes a mistake, then at least half of the experts in E t 1 were wrong: W t = E t = E y t t 1 E t 1 /2. Since there is always a perfect expert, the algorithm makes no more than log 2 E 0 = log 2 n mistakes sublinear regret. Qualitatively speaking: W t multiplicatively decreases when halving makes a mistake. If W t doesn t shrink too much, then halving can t make too many mistakes. There is a lower bound on W t for all t, since there is an expert. W t can t shrink too much starting from its initial value. We ll see similar behavior later. 17

18 Weighted Majority So what should we do if we don t know anything about the experts? The majority vote might be always wrong! Give each of them a weight w i,t, initialized as w i,1 = 1. Decide based on a weighted sum of experts: ˆp t = 1 [weighted sum of yay-sayers > weighted sum of nay-sayers] [ n ˆp t = 1 w i,t 1 p i,t > i=1 ] n w i,t 1 (1 p i,t ) i=1 18

19 Weighted Majority Recall: with a perfect expert we had The halving algorithm 1: for t = 1... T do 2: Receive expert predictions (p 1,t... p n,t ) 3: Split E t 1 into E 0 t 1 E 1 t 1 4: if E 1 t 1 > E 0 t 1 then follow the majority 5: Predict ˆp t = 1 6: else 7: Predict ˆp t = 0 8: end if 9: Receive Nature s answer y t (and incur loss l(ˆp t, y t )) 10: Update E t with experts that continue to be right 11: end for 19

20 Weighted Majority Without knowledge about the experts: choose a decay factor β [0, 1) The Weighted Majority algorithm 1: Initialize w i,1 = 1, W 1 = n. 2: for t = 1... T do 3: Receive expert predictions (p 1,t... p n,t ) 4: if n i=1 w i,t 1p i,t > n i=1 w i,t 1(1 p i,t ) then 5: Predict ˆp t = 1 6: else 7: Predict ˆp t = 0 8: end if 9: Receive Nature s answer y t (and incur loss l(ˆp t, y t )) 10: w i,t w i,t 1 β 1(pi,t y t) 11: end for 20

21 Weighted Majority halving is equivalent to choosing w i,t = 1 if i is alive (β = 0). What happens as β 1? Faulty experts are not punished that much. 21

22 Weighted Majority Define the potential W t = n i=1 w i,t (the weighted size of the set of experts). Theorem 1: W t W t 1 If ˆp t y t then W t 1+β 2 W t 1. Compare with before: W t multiplicatively decreases when halving makes a mistake. Is there always a lower non-zero bound on W t for all t? Yes, with finite T. No, infinite time all weights could decay towards 0. Note that we didn t normalize the w i,t - we ll fix this in the next part. What about the regret behavior? Is it sublinear? We ll leave this for now. 22

23 Multiplicative Weights algorithms

24 Multiplicative Weights High-level intuition: More general case: choose from n options. Stochastic strategies allowed. Not always experts present. 24

25 Multiplicative Weights Setup: at each timestep t = 1... T: 1. you want to take a decision d D = {1,..., n} 2. you choose a distribution p t over D and sample randomly from it. 3. Nature reveals the cost vector c t, where each cost c i,t is bounded in [ 1, 1] 4. the expected cost is then E i pt [c t ] = c t p t. Goal: minimize regret with respect to the best decision in hindsight, min i T c i,t. In halving, decision = choose expert, cost = mistake, and p t was 1 for the majority of alive experts (note that p is not prediction here). 25

26 The Multiplicative Weights algorithm Comes in many forms! The multiplicative weights algorithm 1: Fix η : Initialize w i,1 = 1 3: Initialize W 1 = n 4: for t = 1... T do 5: W t = i w i,t. 6: Choose a decision i p t = w i,t W t 7: Nature reveals c t 8: w i,t+1 w i,t (1 ηc i,t ) 9: end for 26

27 Regret of Multiplicative Weights Regret of multiplicative weights Assume cost c i,t is bounded in [ 1, 1] and η 1. Then after T rounds, for 2 any decision i: c t p t c i,t + η c i,t + log n η 27

28 Regret of Multiplicative Weights Regret of multiplicative weights Assume cost c i,t is bounded in [ 1, 1] and η 1. Then after T rounds, for 2 any decision i: c t p t c i,t + η c i,t + log n η The cost of any fixed decision i Relates to how quickly the MWA updates itself after observing the costs at time t How conservative the MWA should be - if η is too small, we overfit too quickly 28

29 Regret of Multiplicative Weights Regret of multiplicative weights Assume cost c i,t is bounded in [ 1, 1] and η 1. Then after T rounds, for 2 any decision i: c t p t c i,t + η c i,t + log n η η 1 is needed to make some inequalities work. 2 Generality: no assumptions about the sequence of events (may be correlated, costs may be adversarial)! Holds for all i: taking a min over decisions, we see: R T log n η + η min i c i,t 29

30 Regret of Multiplicative Weights What is the regret behavior? R T log n η + η min i c i,t Since the sum is O(T), it s sub-linear with appropriate choice η Then ( ) R T O T log n What is the issue with this? We need to know the horizon T up front! ( ) log n If T is unknown, use η t min, 1 or the doubling trick. t 2 log n T. 30

31 Regret of Multiplicative Weights What if we don t choose η R T log n η 1 T? What if η is constant w.r.t. T? What if η < 1 T? + η min i c i,t Fundamental tension between fitting to expert and learning from Nature Optimality of MW: in the online setting you can t do better: the regret is ( ) lower bounded as R T Ω T log n. See Theorem 4.1 in Arora. 31

32 Regret of Multiplicative Weights R T log n η + η min i c i,t Why can t you achieve O(log T) regret as with FTL in this case? Recall for Follow The Leader with strongly convex loss, the difference between the two consecutive decisions and losses scaled as p t p t 1 = O( 1 t ) In MW, the loss-differential scales with p t p t 1 = O(η), which is O( 1 t ), so the MW algorithm needs to be prepared to make much bigger changes than FTL over time. 32

33 Proof of MW The steps in the proof are: 1. Upper bound W t in terms of cumulative decay factors 2. Lower bound W t by using convexity arguments 3. Combine the upper and lower bounds on W t to get the answer Let s go through the proof carefully. 33

34 Proof of MW: Getting the upper bound on W t W t+1 = i w i,t+1 34

35 Proof of MW: Getting the upper bound on W t W t+1 = i = i w i,t+1 w i,t (1 ηc i,t ) 34

36 Proof of MW: Getting the upper bound on W t W t+1 = i w i,t+1 = w i,t (1 ηc i,t ) i = W t ηw t c i,t p i,t i w i,t = W tp i,t 34

37 Proof of MW: Getting the upper bound on W t W t+1 = i w i,t+1 = w i,t (1 ηc i,t ) i = W t ηw t c i,t p i,t = W t (1 ηc t p t ) i w i,t = W tp i,t 34

38 Proof of MW: Getting the upper bound on W t W t+1 = i w i,t+1 = w i,t (1 ηc i,t ) i = W t ηw t c i,t p i,t = W t (1 ηc t p t ) W t exp ( ηc t p t ) i w i,t = W tp i,t convexity 34

39 Proof of MW: Getting the upper bound on W t W t+1 = i w i,t+1 = w i,t (1 ηc i,t ) i = W t ηw t c i,t p i,t = W t (1 ηc t p t ) W t exp ( ηc t p t ) Recursively using this inequality: ( ) W T+1 W 1 exp η c t p t ( = n exp η i ) c t p t w i,t = W tp i,t convexity x R : 1 x e x 34

40 Proof of MW: Getting the lower bound on W t W T+1 w i,t+1 35

41 Proof of MW: Getting the lower bound on W t W T+1 w i,t+1 = t T (1 ηc i,t ) multipl. updates and w i,1 = 1 35

42 Proof of MW: Getting the lower bound on W t W T+1 w i,t+1 = (1 ηc i,t ) multipl. updates and w i,1 = 1 t T = (1 ηc i,t ) (1 ηc i,t ) split positive + negative costs t:c i,t 0 t:c i,t <0 35

43 Proof of MW: Getting the lower bound on W t W T+1 w i,t+1 = (1 ηc i,t ) multipl. updates and w i,1 = 1 t T = (1 ηc i,t ) (1 ηc i,t ) split positive + negative costs t:c i,t 0 t:c i,t <0 Now we use the fact that: x [0, 1] : (1 η) x (1 ηx) x [ 1, 0] : (1 + η) x (1 ηx) 35

44 Proof of MW: Getting the lower bound on W t W T+1 w i,t+1 = (1 ηc i,t ) multipl. updates and w i,1 = 1 t T = (1 ηc i,t ) (1 ηc i,t ) split positive + negative costs t:c i,t 0 t:c i,t <0 Now we use the fact that: x [0, 1] : (1 η) x (1 ηx) x [ 1, 0] : (1 + η) x (1 ηx) Since c i,t [ 1, 1], we can apply this to each factor in the product above: W T+1 (1 η) 0 c i,t (1 + η) <0 c i,t This is the lower bound we want. 35

45 Proof of MW: Combine the upper and lower bounds We get: ( (1 η) 0 c i,t (1 + η) <0 c i,t W T+1 n exp η Take logs: c i,t log (1 η) c i,t log (1 + η) log n η <0 0 ) c t p t c t p t 36

46 Proof of MW: Combine the upper and lower bounds c i,t log (1 η) c i,t log (1 + η) log n η <0 0 Now we ll massage this into the form we want: c t p t 37

47 Proof of MW: Combine the upper and lower bounds c i,t log (1 η) c i,t log (1 + η) log n η <0 0 Now we ll massage this into the form we want: η c t p t c t p t log n c i,t log (1 η) + c i,t log (1 + η) 0 <0 37

48 Proof of MW: Combine the upper and lower bounds c i,t log (1 η) c i,t log (1 + η) log n η <0 0 Now we ll massage this into the form we want: η c t p t c t p t log n c i,t log (1 η) + c i,t log (1 + η) 0 <0 = log n + ( ) 1 c i,t log + c i,t log (1 + η) 1 η 0 <0 37

49 Proof of MW: Combine the upper and lower bounds Since η 1 2, we can use log ( 1 1 η ) η + η 2 and log (1 + η) η η 2 : η c t p t 38

50 Proof of MW: Combine the upper and lower bounds Since η 1 2, we can use log ( 1 1 η ) η + η 2 and log (1 + η) η η 2 : η c t p t log n + 0 ( ) 1 c i,t log + c i,t log (1 + η) 1 η <0 38

51 Proof of MW: Combine the upper and lower bounds Since η 1 2, we can use log ( 1 1 η ) η + η 2 and log (1 + η) η η 2 : η c t p t log n + 0 log n + 0 ( ) 1 c i,t log + c i,t log (1 + η) 1 η <0 ( c i,t η + η 2) + c i,t (η η 2) use inequalities <0 38

52 Proof of MW: Combine the upper and lower bounds Since η 1 2, we can use log ( 1 1 η ) η + η 2 and log (1 + η) η η 2 : η c t p t log n + ( ) 1 c i,t log + c i,t log (1 + η) 1 η 0 <0 log n + ( c i,t η + η 2) + c i,t (η η 2) use inequalities 0 <0 = log n + η c i,t + η 2 c i,t η 2 c i,t 0 <0 38

53 Proof of MW: Combine the upper and lower bounds Since η 1 2, we can use log ( 1 1 η ) η + η 2 and log (1 + η) η η 2 : η c t p t log n + ( ) 1 c i,t log + c i,t log (1 + η) 1 η 0 <0 log n + ( c i,t η + η 2) + c i,t (η η 2) use inequalities 0 <0 = log n + η c i,t + η 2 c i,t η 2 c i,t 0 <0 = log n + η c i,t + η 2 c i,t combine sums 38

54 Proof of MW: Combine the upper and lower bounds Dividing by η, we get what we want: c t p t log n η + c i,t + η c i,t 39

55 Some remarks Matrix form of MW Gains instead of losses See Arora s paper for more details 40

56 Learning with Experts revisited

57 Multiplicative Weights with continuous label spaces. 42

58 Exponential Weighted Average: continuous prediction case Formal setup: there are T rounds in which you predict a label ˆp t C. C is a convex subset of some vector space. At every timestep t: 1. Each expert i = 1... n predicts p i,t C 2. You make a prediction ˆp t C 3. Nature reveals y t Y 4. We suffer a loss l(ˆp t, y t ) 5. Each expert suffers a loss l(p i,t, y t ) Loss: ˆL t = T l(ˆpt, yt), L i,t = T l(p i,t, y t) Regret: R t = ˆL t min i L i,t 43

59 Exponential Weighted Average: continuous prediction case We assume that the loss l(ˆp t, y t ) as a function l : C Y R: is bounded: p C, y Y : l(p, y) [0, 1] l(., y) is convex for any fixed y Y 44

60 Exponential Weighted Average This brings us to a familiar algorithm. Choose an η > 0 (we ll make this more directed later). Exponential Weighted Average algorithm 1: Initialize w i,0 = 1, W 0 = n. 2: for t = 1... T do 3: Receive expert predictions (p 1,t... p N,t ) Ni=1 w i,t 1 p i,t 4: Predict ˆp t = W t 1 5: Receive Nature s answer y t (and incur loss l(ˆp t, y t)) 6: w i,t w i,t 1 e ηl(p i,t,y t) 7: W t N i=1 w i,t 8: end for 45

61 Exponential Weighted Average Regret of EWA Assume that the loss l is a function l : C Y [0, 1], where C is convex and l(., y) is convex for any fixed y Y. Then R T = ˆL T min L i,t log n + ηt i η 8 8 log n T Hence, if we choose η =, the regret R T T log n = O( T). 2 The proof follows from an application of Jensen s inequality and Hoeffding s lemma. See chapter 3 of Bartok for details. Again note that we need to know what T is in advance. 46

62 So how about the regret in the discrete case? 47

63 Weighted Majority Regret of Weighted Majority Let C = Y, Y have at least 2 elements and l(p, y) = 1(p y). Let L i,t = min i L i,t. ( ) ( ) ) 1 2 (log 2 log β 2 L 1+β i,t + log 2 N R T = ˆL T min L i,t ( ) i 2 log 2 1+β This bound is of the form R T = al i,t + b = O(T)! The discrete case is harder than the continuous case if we stick to deterministic algorithms. The worst-case regret is achieved by a set of 2 experts and an outcome sequence y t such that ˆL T = T. See chapter 4 of Bartok for details. 48

64 Today What should you (at least) take away from this lecture? How should you base your prediction on expert predictions? Can use a weighted majority algorithm, which is a type of MWA. General framework for online learning with experts (e.g. boosting). What are the characteristics of multiplicative weights algorithms?... Few assumptions on costs / Nature (can be adversarial), therefore broadly applicable. Fundamental tension between fitting to decision and learning from Nature. Tends to have instantenous loss-differential O( 1 t ), worse than the version of FTL O( 1 ) that we saw in lecture 1. t 49

65 Questions? 50

THE WEIGHTED MAJORITY ALGORITHM

THE WEIGHTED MAJORITY ALGORITHM Csaba Szepesvári University of Alberta CMPUT 654 E-mail: szepesva@ualberta.ca UofA, October 3, 2006 OUTLINE 1 PREDICTION WITH EXPERT ADVICE 2 HALVING: FIND THE PERFECT EXPERT!