Online Learning under Full and Bandit Information

Size: px

Start display at page:

Download "Online Learning under Full and Bandit Information"

Andrea Simpson
5 years ago
Views:

1 Online Learning under Full and Bandit Information Artem Sokolov Computerlinguistik Universität Heidelberg 1 Motivation 2 Adversarial Online Learning Hedge EXP3 3 Stochastic Bandits ε-greedy UCB

2 Real world online learning tasks advertising (which ad to display) medical treatment (which drug to prescribe) design/functionality rollouts (works or not) spam/malware filtering (filter or keep) stock market (sell or acquire bonds) network routing (which path to take) compression, weather, etc. in every task there is a decision to be made under missing information Online Learning under Weak (Bandit) Information 1 / 48

3 Goals 1 introduce online learning 2 explain distinction between full and partial information tasks 3 introduce the notion of regret 4 present basic algorithms for those cases Online Learning under Weak (Bandit) Information 2 / 48

4 Relation to batch learning Batch learning many i.i.d examples D = {x i, y i } N define some loss l(d) (e.g. negative log-likelihood, square error) learn a model by l(d) min deploy on a test set Online learning one example x t predict ŷ t get feedback suffer some penalty l t (x t, ŷ t ) improve the model repeat note: no training/testing set distinction Online Learning under Weak (Bandit) Information 3 / 48

5 Examples input x t X input space truth y t Y truth space prediction ŷ t P decision space X Y P penalty/loss online regression R d R R y t ŷ t online classification R d {1,..., K} {1,..., K} [y t ŷ t ] expert advice R N R d {1,..., N} y t [ŷ t ] structured prediction K m K m K m m [yi t ŷt i ] Online Learning under Weak (Bandit) Information 4 / 48

6 Why online learning? early days 50-70s: online learning is a requirement first computers, very low memory, very slow CPUs perceptron from 1957 is originally an online algorithm! later 70-90s: batch learning became possible reasonable CPU power, reasonable memory great convergence guarantees! 2000s-now: computers are very powerful, memory is cheap batch algorithms explode memory and time easy access to data made datasets practically infinite discarding data is a bad idea, we want it all! some people say that data acquisition outpaced the Moore s law effectively are back into the 50s Online Learning under Weak (Bandit) Information 5 / 48

7 Why online learning? not only a question of resources: the larger the data, the harder it is to guarantee stationarity cannot to be cramped into a fixed size dataset hence algorithms need to be adaptive and frequent re-training is not an option (because resources...) Online Learning under Weak (Bandit) Information 6 / 48

8 Advantages Online learning one example x t predict ŷ t get feedback suffer some penalty l t (x t, ŷ t ) improve the model repeat small memory footprint faster updates faster adaptation better test performance (in certain sense) Online Learning under Weak (Bandit) Information 7 / 48

9 Domain of online learning environment i.i.d assumption is convenient often cannot be guaranteed or is obviously violated sometimes we assume nothing about distribution: adversarial case feedback full information is best but correct labels are expensive and slow to get often partial feedback is all you have: bandit case structure no state (important but rare case) usually there is some state or context structured spaces (actions change the environment) resources batch learning is costly saving everything is impractical learn from one x and discard Online Learning under Weak (Bandit) Information 8 / 48

10 The space of online learning algorithms environment adversarial expert advice adversarial bandits i.i.d stochastic bandits context no state full bandit feedback reinforcement learning structure [Seldin 15] Online Learning under Weak (Bandit) Information 9 / 48

11 Adversarial Environment with Full Information (just means there are no statistical assumptions) Online Learning under Weak (Bandit) Information 10 / 48

12 Measure of success Online learning protocol 1: for t = 0,... do 2: observe x t (if available) 3: predict ŷ t 4: suffer loss l t (ŷ t ) 5: update l t are arbitrary (e.g., does not mean there are uniformly distributed) could be random or non-random, depend on previous history we want algorithms that work in any case What about the goal? no training set, so cannot minimize loss over training set even if we could, does not always make sense as l t can be anything measure of success has to be calculated w.r.t. to the whole interaction, not just some end objective Online Learning under Weak (Bandit) Information 11 / 48

13 Regret What do we want to achieve? in principle we want to minimize our total loss still not ideal, because l t can scale arbitrary so we need a relative measure e.g., with respect to some fixed (but unknown) arm-pulling strategy h = h t or with respect to the best arm-pulling strategy from a set H note: the larger is H the harder it is the task we measure a cost of ignorance or regret for not following that strategy R T = l t(ŷ t) l t(h t) Our ultimate goal: R T = average regret R T /T 0 as fast as possible l t(ŷ t) min l t(h) h H as the learning goes on, our loss is less and less different from the alternative one Online Learning under Weak (Bandit) Information 12 / 48

14 Adversary restriction Why do we need different definitions of regret? on the one hand, it s a tool to analyze a problem, to test it under different assumptions on the other, w/o any restrictions online learning is too hard (or impossible) need to restrict the power of adversary and vary R T accordingly different definitions than reflect our knowledge about the environment: 1 if we believe that true data is generated by some fixed function h, y t = h (x t ), it s reasonable to minimize R T w.r.t. to that function R T (h ) = l t (w t ) l t (h ) 2 if not, the adversary must not at least change his mind at will, i.e. has to commit to some y t before seeing ŷ t ; then it makes sense to optimize R T w.r.t. to the best function from some set H: R T (H) = l t (w t ) min l t (h) h H Online Learning under Weak (Bandit) Information 13 / 48

15 Adversary restriction What happens if we don t have the commitment requirement? Example: online classification 1: for t = 0,... do 2: observe x t 3: predict ŷ t {0, 1} 4: receive true y t 5: suffer loss l t (ŷ t ) = y t ŷ t 6: update w t+1 R T = l(ŷ t ) min h H l t (h(x t ))R T = y t ŷ t min h H y t h(x t ) take simplest H = {h 0, h 1 }, where h a a (constant function) Exercise 1: can you make the learner always lose? [Shalev-Shwartz 12] wait until ŷ t and set y t = 1 ŷ t Online Learning under Weak (Bandit) Information 14 / 48

16 Algorithm for the realizability case Realizability assumption: h H s.t. t y t = h (x t ). Also H < Consistent 1: Initialize V 0 = H 2: for t = 0,... do 3: observe x t 4: choose any h V t 5: predict ŷ t = h(x t ) 6: receive true y t = h (x t ) 7: update V t+1 = {h V t : h(x t ) = y t } Analysis: t at least one h is removed if there was an error (and none if not) 1 V t H #errors R T = #errors 0 = #errors H 1 can we do better? hint: purge hypotheses faster Online Learning under Weak (Bandit) Information 15 / 48

17 Algorithm for the realizability case Realizability assumption: h H s.t. t y t = h (x t ). Also H < Halving 1: Initialize V 0 = H 2: for t = 0,... do 3: observe x t 4: choose by majority vote h = arg max r {0,1} h V t : h(x t ) = r 5: predict ŷ t = h(x t ) 6: receive true y t = h (x t ) 7: update V t+1 = {h V t : h(x t ) = y t } Analysis: t at least one half of V t is removed if there was an error 1 V t H /2 #errors R T (h ) = #errors log 2 H Online Learning under Weak (Bandit) Information 16 / 48

18 Failure of realizability for infinite H Finiteness of H is crucial Example real line X = (0, 1), thresholds H = { h θ : (0, 1) {0, 1} } h θ (x) = sign(θ x) a sequence of x t, y t generated by some θ on which the Halving will have R T = T Exercise 2: construct such a sequence [Shalev-Shwartz 12] Solution: maintain L t (left) and R t (right) L 0 = 0, R 0 = 1 pick a random x t (L t, R t ) receive ŷ t report y t = 1 ŷ t R t+1 = x t y t + R t ŷ t L t+1 = x t ŷ t + L t y t t R t L t > 0 Online Learning under Weak (Bandit) Information 17 / 48

19 Randomization realizability assumption may be too harsh for our application add an element of surprise to our predictions! remember to require the adversary to commit to y t before seeing ŷ t will change lines 4 and 5 in the Consistent Randomized 1: Initialize V 0 = H 2: for t = 0,... do 3: observe x t 4: choose probability p t 5: predict ŷ t (w t ) = 1 with prob. p t 6: receive true y t = h (x t ) 7: update V t+1 = {h V t : h(x t ) = y t } R T = = E pt [ŷ t y t ] min [h(x t ) y t ] h H p t y t min h(x t ) y t 2T ln H h H note regret changed again proof later Online Learning under Weak (Bandit) Information 18 / 48

20 Summary so far so far we had full information (i.e., we received the true y t ) different adversary restrictions help to get regret bounds realizability + finiteness Consistent R T H 1 Halving R T log 2 H commitment Randomized R T 2T ln H Online Learning under Weak (Bandit) Information 19 / 48

21 Learning with Experts Advice Online Learning under Weak (Bandit) Information 20 / 48

22 Learning with Experts Advice imagine horse-races you know nothing about horses luckily you have knowledgeable friends willing to give you advices apportion a fixed sum of money between them goal: minimize losses / maximize profit Online Learning under Weak (Bandit) Information 21 / 48

23 Experts Advice stateless case (you have friend s identity, but not horses breakfast menu or expert history) N friends loss vector l t [0, 1] N e.g., l t [i] = 0.3 if ith friend lost 30 cents prediction p t [0, 1] N, N p t[i] = 1 your distribution of money loss N p t[i]l t [i] = p t, l t goal R T = p t, l t min l t [i] min,...,n }{{} loss of the best friend note: you don t know how good your friends are note: horses/friends can conspire against you but in the limit you can do as good as the best friend in hindsight! (in terms of average loss per race) Online Learning under Weak (Bandit) Information 22 / 48

24 Hedge algorithm if one of the friends is perfect can get log 2 N mistakes with Halving but making a mistake does not necessarily mean we should disqualify a friend Hedge 1: init vector w 1 R N + s.t. N w 1[i] = 1, learning rate µ > 0 2: for t = 1,... do 3: compute p t = w t N wt[i] 4: receive loss l t 5: update w t+1 [i] = w t [i]e µlt[i] soft disqualification Theorem For any l 1,..., l T and any i {1,..., N} R T = p t, l t min j l t [j] 2T ln N + ln N Online Learning under Weak (Bandit) Information 23 / 48

25 Hedge Analysis We will upper- and lower-bounds the quantity N w t+1[i] Upper bound N N w t+1[i] = w t[i]e µl t[i] ln ln N w t[i](1 (1 e µ )l t[i]) ( N ) = w t[i] (1 (1 e µ ) p t, l t ) use e αx 1 (1 e α )x N ( N ) w t+1[i] ln w t[i] + ln ( 1 (1 e µ ) p t, l ) t use ln(1 x) x ( N ) ln w t[i] (1 e µ ) p t, l t N ( N ) w T +1[i] ln w 1[i] (1 e µ ) p t, l t telescope Online Learning under Weak (Bandit) Information 24 / 48

26 Hedge Analysis Lower bound for any j 1,..., N N w t+1 [i] w t+1 [j] w 1 [j]e µ t s=1 ls[j] Combining =ln(1)=0 {}}{ N N ln w 1 [j] µ l t [j] ln w T +1 [i] ln( w 1 [i]) (1 e µ ) p t, l t rearranging p t, l t ln w 1[j] + µ T l t[j] 1 e µ remember j was arbitrary, and let w 1 [i] = 1/N p t, l t ln N + µmin T j l t[j] 1 e µ Online Learning under Weak (Bandit) Information 25 / 48

27 Choice of µ suppose we can upper bound the minimal loss (we are sure about at least one of our friends) min j l t [j] L ( ) 2 ln N if we set µ = ln 1 + L then p t, l t ln N + µ min T j l t[j] 1 e µ min j after rearranging we get a regret bound: p t, l t min j l t [j] + 2 L ln N + ln N l t [j] 2 L ln N + ln N Online Learning under Weak (Bandit) Information 26 / 48

28 Regret bound trivially: L T (yes, very loose bound) p t, l t min j l t [j] 2T ln N + ln N much better if L T (e.g., there is a friend that almost never errs, L 0) p t, l t min j l t [j] ln N not surprisingly looks similar to the Halving bound (realizability & finiteness hold) Online Learning under Weak (Bandit) Information 27 / 48

29 Exercise Hedge 1: init vector w 1 R N + s.t. N w 1 [i] = 1, learning rate µ > 0 2: for t = 1,... do 3: compute p t = w t N w t [i] 4: receive loss l t 5: update w t+1 [i] = w t [i]e µl t [i] Exercise 3: [Marchetti-Spaccamela 11] 3 experts: 1st playing always Rock, 2nd Scissors, and 3rd Paper your opponent plays Rock T/3 times, then Scissors T/3 times and then Paper T/3 times 0 T/3 2T/3 T Rock Scissors Paper loss: -1 if won, +1 if lost, 0 if tie describe roughly 1) the most probable strategies played by Hedge, 2) when they switch and 3) the final distribution Online Learning under Weak (Bandit) Information 28 / 48

30 Relation to batch learning Hedge inspired Boosting a powerful concept of combining weak algorithms into a strong one idea: treat your training examples as experts changing weights focuses attention on difficult examples Gödel Prize 2003 Online Learning under Weak (Bandit) Information 29 / 48

31 Adversarial Multi-Armed Bandits (how to use only one friend at a time in horse races) Online Learning under Weak (Bandit) Information 30 / 48

32 Why the name? One-armed bandits you are in a casino with slot-machines ( one-armed bandits ) you have to find a machine that gives you most money can try one machine per time on the one hand, you should play the best machine so far exploitation on the other hand, there might be better machines, not yet tried exploration Considered by Allied scientists in WW2, it proved so intractable that the problem was proposed to be dropped over Germany so that German scientists could also waste their time on it [Peter 79] Online Learning under Weak (Bandit) Information 31 / 48

33 Adversarial Multi-Armed Bandits similar setup to expert s advice you can to bid on only one of the friends (experts) at a time consequently, no full loss vector l t is received (you don t know how much lost your other friends) you get only the loss l t [i t ] of the chosen friend i t Online Learning under Weak (Bandit) Information 32 / 48

34 Hedge vs Exp3 Slight change to the Hedge algorithm: Hedge 1: init w 1 s.t. N w1[i] = 1, µ > 0 2: for t = 1,... do 3: play all friends p t = w t N wt[i] Exp3 1: init w 1[i] = 1, γ (0, 1] 2: for t = 1,... do 3: draw a friend i t acc. to w t p t[i] = (1 γ) N wt[i] + γ N 4: receive l t 5: w t+1[i] = w t[i]e µl t[i] 4: receive l t[i t] 5: {w w t+1[i] = t[i]e γ l t [i] p t [i], w t[i], if i = i t else random surprise actions added Exponential-weight algorithm for Exploration and Exploitation Exp3 Online Learning under Weak (Bandit) Information 33 / 48

35 Connection to the importance sampling Exp3 1: init w 1[i] = 1, γ (0, 1] 2: for t = 1,... do 3: draw a friend i t acc. to 4: receive l t[i t] {w t[i]e γ l t [i] 5: w t+1[i] = w t[i], 6: w t+1[i] = w t[i]e γ l t [i] w t p t[i] = (1 γ) N wt[i] + γ N p t [i], if i = i t else { denote l l t [i]/p t [i] if i = i t t [i] = 0 otherwise then E [ lt [i] i 1,..., i t 1 ] = pt [i] l t[i] p t [i] + (1 p t[i]) 0 = l t [i] Online this Learning a recurring under Weak (Bandit) theme: Information unbiased estimates of losses 34 / 48

36 Exp3 Regret Theorem For any γ (0, 1], N > 0 and any sequence of l 1,..., l T E [ T l t [i t ] ] min i l t [i] 2 e 1 T N ln N Comparison to the full-information Hedge: O( T ln N) Exp3: O( T N ln N) the price of bandit info Online Learning under Weak (Bandit) Information 35 / 48

37 Proof very similar proof to Hedge see appendix Exercise 4 (optional): explain why is the exploration necessary? point where the proof will fail if N p t [i] = w t [i]/ w t [i] Online Learning under Weak (Bandit) Information 36 / 48

38 Stochastic Bandits Online Learning under Weak (Bandit) Information 37 / 48

39 Stochastic Bandits restrict somewhat the adversarial setting as we have seen, restrictions lead to nicer regret (possibly under a different definition) N arms as before this time arm s loss l t [i] is sampled i.i.d. from D i (unknown and fixed) l t [i] and l t [j] are independent for i j mean loss µ i = E[l t [i]] µ = min i µ i i = arg min i µ i suffered loss l t [i t ] note: D i,t s may change, keeping µ i,t fixed Online Learning under Weak (Bandit) Information 38 / 48

40 Stochastic Bandits mean loss µ i = E[l t [i]] µ = min i µ i n i (T ) number of pulls of i th arm over first T plays Goal: now it makes sense to speak of expected regret regret: R T = E [ T l t [i t ] ] min E [ T l t [i] ] i = E [ T = = N l t [i][i = i t ] ] min T µ i i N µ i E [ T [i = i t ] ] T µ N µ i E[n i (T )] T µ min Online Learning under Weak (Bandit) Information 39 / 48

41 Stochastic Bandits Gaps: i = µ i µ 0 j i µ 0 µ µ j µ i R T = N µ i E[n i (T )] T µ = N i E[n i (T )] Next Theorem(s) R T C N i=2 ln T i + o(ln T ) smaller gaps bigger regret intuition: takes more time to distinguish between close arms compare to Exp3: O( NT ln N) Online Learning under Weak (Bandit) Information 40 / 48

42 Regret bound R T C N i=2 ln T i + o(ln T ) Why 1 i? Intuitive example: imagine Bernoulli losses with p i mean estimates µ i = 1 T T l t[i] have variance σ 2 = p i (1 p i )/T if T 1 α i then σ 2 α i p i(1 p i ) if α 1, hard to say if there is a real difference between µ and µ i or it s just variance so we need rather α 2 as on every pull we loose about i, for α 2 we have i 1 α i 1 i Online Learning under Weak (Bandit) Information 41 / 48

43 ɛ-greedy Simplest strategy: ɛ-greedy 1: ɛ > 0, µ i = 0 empirical means of rewards 2: for t = 1,... do 3: with prob. 1 ɛ play current best arm i t = arg min i µ i 4: with prob. ɛ play a random arm 5: receive l t [i t ] 6: update empirical means ( µ it = µ i t n it +l t[i t] n it +1 ) Regret: because of the constant ɛ, R T ɛt need to let ɛ to to zero at a certain rate Online Learning under Weak (Bandit) Information 42 / 48

44 ɛ t -decreasing Modified simplest strategy: ɛ t -decreasing 1: c > 0, δ > 0, µ i = 0 2: for t = 1,... do 3: 4: ɛ t = min{1, cn δ 2 t } with prob. 1 ɛ t play current best arm i t = arg min i µ i 5: with prob. ɛ t play a random arm 6: receive l t [i t ] 7: update means Theorem If 0 < δ min i:µi <µ i < 1 and T cn δ P {i t i } c δ 2 t + o( 1) t R T = O(ln T ) Online Learning under Weak (Bandit) Information 43 / 48

45 UCB (upper confidence bound) UCB-style strategies: pull arms and maintain empirical averages of their losses calculate also confidence intervals (a region around our estimate so that the true value is within with high prob.) μ-u μ μ+u repeated plays shrink the confidence bound, the average is becoming more reliable now stick to the principle: optimism in the face of uncertainty play the arm whose mean loss combined with confidence bound promises the least loss eventually the most optimistic arm will change because either that is really better or it wasn t sampled often enough deterministic algorithm unlike ɛ-greedy Online Learning under Weak (Bandit) Information 44 / 48

46 UCB UCB 1: play each arm once 2: for t = 1,... do 3: play the arm i t = arg min i ( µ i 2 ln t n i (t) 4: receive l t [i t ] 5: update averages ) Theorem 1 For N > 0, T > 0 and arbitrary distributions D i,t with fixed means µ i R T = [8 ln T ] + (1 + π2 N i:µ i >µ i 3 ) i Online Learning under Weak (Bandit) Information 45 / 48

47 Lessons learned lot s of applications where traditional batch learning may fail goals are formulated in terms of various regrets the slower the upper bound on regret grows with T the better full information realizability Consistent R T H 1 Halving R T log 2 H commitment + surprise Randomized R T 2T ln H Hedge R T = O( T ln N) adversarial bandits EXP3 R T = O( NT ln N) stochastic bandits ɛ-decreasing R T = O(ln T/δ) UCB R T = O(ln T/ 2 ) Online Learning under Weak (Bandit) Information 46 / 48

48 Conclusion barely scratched the surface with most important algorithms advanced algorithms use EXP3, UCB as building blocks Online Learning under Weak (Bandit) Information 47 / 48

49 Recommended Reading Material for this lecture 1 S. Shalev-Shwartz. Online Learning and Online Convex Optimization, Y. Seldin. The Space of Online Learning Algorithms, P. Auer et al. The nonstochastic multiarmed bandit problem, P. Auer et al. Finite-time Analysis of the Multiarmed Bandit Problem, 2002 Advanced Contextual Algorithms 1 EXP4: P. Auer et al. The nonstochastic multiarmed bandit problem, LinUCB: Li et al. A contextual-bandit approach to personalized news article recommendation, Epoch-greedy: J. Langford The epoch-greedy algorithm for multi-armed bandits with side information, 2003 Online Learning under Weak (Bandit) Information 48 / 48

50 Appendix Online Learning under Weak (Bandit) Information 1 / 5

51 Analysis of EXP3 similar to the Hedge algorithm idea: lower- and upper-bound N w t+1[i]/ N w t[i] { denote ˆl l t [i]/p t [i] if i = i t t [i] = 0 otherwise Online Learning under Weak (Bandit) Information 2 / 5

52 Analysis of EXP3 Upper bound N wt+1[i] N wt[i] = = N N N 1 1 w t[i]e γ ˆl t [i] N (use w t+1 [i] definition; note the bound ˆl t [i] γ N ) p t[i] γ γ N ˆl t [i] 1 γ e N (using e x 1 + x + (e 2)x 2 for x 1) p t[i] γ N 1 γ γ N 1 γ N [ 1 γ ˆl t[i] N + (e 2)( γ ˆl t[i] ) ] 2 N p t[i]ˆl t[i] + (e 2)( γ N )2 (1 γ) γ N 1 γ lt[it] + (e 2)( γ N )2 (1 γ) N ln wt+1[i] γ N N wt[i] 1 γ lt[it] + (e 2)( γ N )2 (1 γ) N p t[i](ˆl t[i]) 2 N ˆl t[i] (use ln(1 + x) x) N ˆl t[i] Online Learning under Weak (Bandit) Information 3 / 5

53 Analysis of EXP3 Upper bound (cont.) sum over t = 1,..., T N ln w γ T +1[i] N N w l t [i t ] + (e 2)( γ N )2 T [i] 1 γ (1 γ) N ˆl t [i] Lower bound for any j = 1,..., N N ln w T +1[i] N w ln w T +1[j] N T [i] w T [i] = γ N ˆl t [j] ln N Combining (and simplifying) l t [i t ] (1 γ) ˆl t [j] + N ln N γ + (e 2) γ N N ˆl t [i] Online Learning under Weak (Bandit) Information 4 / 5

54 Analysis of EXP3 expectation to rescue: E [ˆlt [i] i 1,..., i t 1 ] = pt [i] l t[i] p t [i] + (1 p t[i]) 0] = l t [i] Take expectation E [ T l ] t[i t] (1 γ) l t[j] + N ln N γ + (e 2) γ N since j was arbitrary and by assumption N l t[i] (1 γ) min l j t[j] + N ln N + (e 2) γ γ N N L choose γ = min { N ln N } 1, (e 1) L 2 e 1 LN ln N N l t[i] N L Online Learning under Weak (Bandit) Information 5 / 5

New Algorithms for Contextual Bandits

New Algorithms for Contextual Bandits Lev Reyzin Georgia Institute of Technology Work done at Yahoo! 1 S A. Beygelzimer, J. Langford, L. Li, L. Reyzin, R.E. Schapire Contextual Bandit Algorithms with Supervised