Advanced Machine Learning

Size: px

Start display at page:

Download "Advanced Machine Learning"

Evan Mason
5 years ago
Views:

1 Advanced Machine Learning Bandit Problems MEHRYAR MOHRI COURANT INSTITUTE & GOOGLE RESEARCH.

2 Multi-Armed Bandit Problem Problem: which arm of a K-slot machine should a gambler pull to maximize his cumulative reward over a sequence of trials? stochastic setting. adversarial setting. page 2

3 Motivation Clinical trials: potential treatments for a disease to select from, new patient or category at each round (Thompson, 1933). Ads placement: selection of ad to display out of a finite set (which could vary with time though) for each new web page visitor. Adaptive routing: alternative paths for routing packets through a series of tubes or alternative roads for driving from a source to a destination. Games: different moves at each round of a game such as chess, or Go. page 3

4 Key Problem Exploration vs exploitation dilemma (or trade-off): inspect new arms with possibly better rewards. use existing information to select best arm. page 4

5 Outline Stochastic bandits Adversarial bandits page 5

6 Stochastic Model K arms: for each arm i 2 {1,...,K}, reward distribution P i. reward mean µ i. i = µ gap to best:, where µ = max. µ i i2[1,k] µ i page 6

7 Bandit Setting For t =1to T do player selects action (randomized). player receives reward. Equivalent descriptions: I t 2 {1,...,K} X It,t P It on-line learning with partial information ( full). one-state MDPs (Markov Decision Processes). = page 7

8 Objectives Expected regret Pseudo-regret E[R T ]=E By Jensen s inequality, R T apple E[R T ]. " max i2[1,k] R T = max i2[1,k] E = µ T E TX " T X " T X X i,t X i,t X It,t # X T X It,t X T X It,t. # #.. page 8

9 Expected Regret (X i,t µ i ) " If s take values in, then E max i2[1,k] [ r, +r] # TX (X i,t µ ) apple r p 2T log K. The O( p T ) dependency cannot be improved; better guarantees can be achieved for pseudo-regret. page 9

10 Pseudo-Regret Expression in terms of is: KX R T = E[T i (T )] i, where T i (t) denotes the number of times arm i was pulled up to time t, T i (t) = P t. Proof: R T = µ T E " X T =E = i=1 " T X i=1 s=1 1 I s =i X It,t # " T # X =E (µ X It,t) # KX (µ X i,t )1 It =i = " KX X T (µ µ i )E 1 It =i TX KX E[(µ X i,t )] E[1 It =i] i=1 i=1 # = i=1 KX E[T i (T )] i. page 10

11 ε-greedy Strategy At time t, with probability with probability 1 t, select arm i with best emp. mean. t, select random arm. (Auer et al. 2002a) For t =min( 6K 2 t, 1), with = min, 6K for t 2, Pr[I t 6= i ] apple C 2 t for some C>0. E[T i (T )] apple C 2 log T R T apple P i : i>0 thus, and i 2 log T. Logarithmic regret but, requires knowledge of. i : i>0 sub-optimal arms treated similarly (naive search). i C page 11

12 UCB Strategy Optimism in face of uncertainty: at each time compute upper confidence bound (UCB) on the expected reward of each arm. select arm with largest UCB. Idea: wrong arm t 2 [1,T] i cannot be selected for too long. by definition, µ i apple µ apple UCB i. i (Lai and Robbins, 1985; Agrawal 1995; Auer et al. 2002a) µ i pulling often UCB closer to. i 2 [1,K] page 12

13 Note on Concentration Ineqs Let X be a random variable such that for all t 0, log E e t(x E[X]) apple (t), where is a convex function. For Hoeffding s inequality and X 2 [a, b], (t) = t2 (b a) 2. 8 Then, Pr[X E[X] > ] =Pr[e t(x E[X]) >e t ] apple inf t>0 e t E[e t(x E[X]) ] apple inf t>0 e t e (t) = e sup t>0 (t (t)) = e ( )). page 13

14 UCB Strategy Average reward estimate for arm by time : bµ i,t = 1 T i (t) P t s=1 X i,s1 Is =i. Concentration inequality (e.g., Hoeffding s ineq.): Pr[µ i 1 t P t s=1 X i,s > ] apple e t ( ). >0 1 µ i < 1 tx X i,s t t log 1. Thus, for any, with probability at least, s=1 i t page 14

15 (α, ψ)-ucb Strategy >0 Parameter ; -UCB strategy consists of selecting at time t I t 2 argmax i2[1,k] (, ) apple bµ i,t log t T i (t 1). page 15

16 (α, ψ)-ucb Guarantee Theorem: for, the pseudo-regret of -UCB satisfies R T apple for Hoeffding s lemma, -UCB, R T apple >2 X i : i>0 X i : i>0 i 2 ( i (, ) i 2 ) log T + 2 ( ) =2 2 log T (Auer et al. 2002a), page 16

17 Proof Lemma: for any s 0, TX 1 It =i apple s + TX t=s+1 1 It =i 1 Ti (t 1) s. Proof: observe that TX TX 1 It =i = 1 It =i1 Ti (t 1)<s + Now, for t = max {t apple T :1 Ti (t 1)<s 6= 0}, By definition of TX 1 It =i 1 Ti (t 1)<s = t t X TX 1 It =i1 Ti (t 1) s. 1 It =i 1 Ti (t 1)<s., the number of non-zero terms in the sum is at most s. page 17

18 Proof For any and define i,t 1 = 1 log t i t T i (t 1). At time t, if i is selected, then (bµ i,t 1 + i,t 1 ) (bµ i,t + i,t 1) 0,[bµ i,t 1 µ i,t 1 i,t 1 ]+[2 i,t 1 i ]+[µ bµ i,t 1 i,t 1] 0. Thus, at least of one of these three terms is non-negative. Also, if one is non-positive, at least one of the other two is non-negative. page 18

19 Proof To bound the pseudo-regret, we bound observe first that log T log t T i (t 1) s = Thus, apple X T E[T i (T )] = E apple apple s +E apple s + TX t=s+1 1 It =i T X t=s+1 ( i 2 ) 1 It =i 1 Ti (t 1) s ( E[T i (T )]. But, i 2 ) ) i 2 i,t 1 0. Pr[bµ i,t 1 µ i,t 1 i,t 1 0] + Pr[µ bµ i,t 1 i,t 1 0]. page 19

20 Proof Each of the two probability terms can be bounded as follows using the union bound: Pr[µ bµ i,t 1 i,t 1 0] apple apple Pr 9s 2 [1,t]: µ bµ i,s 1 log t s tx 1 apple t = 1 t 1. s=1 0 Final constant of the bound obtained by further simple calculations. page 20

21 Lower Bound Theorem: for any strategy such that E[T i (T )] = o(t ) for any arm i and any >0 for any set of Bernoulli reward distributions, the following holds for all Bernoulli reward distributions: R T X i lim inf T!+1 log T D(µ i k µ ). i : i>0 (Lai and Robbins, 1985) a more general result holds for general distributions. page 21

22 Notes Observe that X i : i>0 i D(µ i k µ ) µ (1 µ ) X i : i>0 1 i, since D(µ i k µ )=µ i log µ i µ +(1 µ i) log 1 µ i 1 µ apple µ i µ i µ µ +(1 µ i ) µ µ i 1 µ = (µ i µ ) 2 µ (1 µ ) = 2 i µ (1 µ ). page 22

23 Outline Stochastic bandits Adversarial bandits page 23

24 Adversarial Model Karms: for each arm i 2 {1,...,K}, no stochastic assumption. rewards in [0, 1]. page 24

25 Bandit Setting For to T do player selects action (randomized). player receives reward. Notes: I t 2 {1,...,K} x It,t rewards x i,t for all arms determined by adversary simultaneously with the selection of an arm by player. adversary oblivious or nonoblivious (or adaptive). T strategies: deterministic, regret of at least 2 for some (bad) sequences, thus must consider randomized. I t page 25

26 Scenarios Oblivious case: adversary rewards selected independently of the player s actions; thus, reward vector at time t only a function of t. Non-oblivious case: adversary rewards at time t function of the player s past actions I 1,...,I t 1. notion of regret problematic: cumulative reward compared to a quantity that depends on the player s actions! (single best action in hindsight function of actions I 1,...,I T played; playing that single best action could have resulted in different rewards.) page 26

27 Objectives Minimize regret ( `i,t =1 x i,t ), expectation or high prob.: R T = max i2[1,k] TX x i,t T X x It,t = TX `It,t min i2[1,k] TX `i,t. Pseudo-regret: " X T R T =E `It,t # min E i2[1,k] " T X `i,t #. R T apple E[R T ] By Jensen s inequality,. page 27

28 Importance Weighting In the bandit setting, the cumulative loss of each arm is not observed, so how should we update the probabilities? Estimates via surrogate loss: è i,t = `i,t 1 It =i, p i,t where p t =(p 1,t,...,p K,t ) is the probability distribution the player uses at time t to draw an arm ( p i,t >0 ). Unbiased estimate: for any i, E [ è KX `i,t i,t] = p j,t 1 j=i = `i,t. I t p t p i,t j=1 page 28

29 EXP3 EXP3(K) 1 p 1 ( 1 K,..., 1 K ) 2 ( L e 1,0,..., L e K,0 ) (0,...,0) 3 for t 1 to T do 4 Sample(I t p t ) 5 Receive(`It,t) 6 for i 1 to K do 7 è i,t `i,t p i,t 1 It =i 8 e Li,t e Li,t 1 + è i,s 9 for i 1 to K do e 10 p L e i,t i,t+1 P K j=1 e L e j,t (Auer et al. 2002b) 11 return p T +1 EXP3 (Exponential weights for Exploration and Exploitation) page 29

30 EXP3 Guarantee Theorem: the pseudo-regret of EXP3 can be bounded as follows: R T apple log K + KT 2. Choosing to minimize the bound gives R T apple p 2KT log K. Proof: similar to that of EG, but we cannot use Hoeffding s inequality since is unbounded. è i,t page 30

31 Proof Potential: t = log P K. Upper bound: t t 1 = log i=1 e e L i,t P K i=1 e L e i,t P K i=1 P N i=1 e L e = log e L e i,t P i,t 1 N i=1 e L e i,t 1 h = log E e è i i,t i p t apple apple 1 e è i,t E e è i,t 1 (log x apple x 1) i pt E i pt è i,t è 2 i,t (e x apple 1 x + x2 2 ) = E i pt [ è i,t]+ 2 2 E i p t = `It,t l 2 I t,t p It,t apple apple l 2 i,t 1 It =i p 2 i,t `It,t p It,t. page 31

32 Proof Upper bound: summing up the inequalities yields apple X T E[ T 0] apple E It p t `It,t + E It p t apple T X 2 2p It,t apple X T = E `It,t + 2 KT 2. Lower bound: for all j 2 [1,K], E[ T 0] = E It p t Comparison: 8j 2 [1,K], apple apple X K log i=1 e e L i,t log K E [ L e j,t ] log K = E [L j,t ] log K. It p t It p t apple X T E `It,t E[L j,t ] apple log K KT ) R T apple log K + KT 2. page 32

33 Notes When T is not known: standard doubling trick. q log K t = Kt or, use, then R T apple 2 p KT log K. High probability bounds: importance weighting problem: unbounded second moment (see (Cortes, Mansour, MM, 2010)), (Auer et al., 2002b): mixing probability with a uniform distribution to ensure a lower bound on p i,t ; but not sufficient for high probability bound. solution: biased estimate p i,t with >0 a parameter to tune. è i,t = `i,t1 It =i+ E i pt [ è 2 i,t ]= `2 I t,t p It,t. page 33

34 Lower Bound (Bubek and Cesa-Bianchi, 2012) Sufficient lower bound in a stochastic setting for the pseudo-regret (and therefore for the expected regret). Theorem: for any T 1 and any player strategy, there exists a distribution of losses in {0, 1} for which 1 p R T KT. 20 page 34

35 Notes Bound of EXP3 matching lower bound modulo Log term. Log-free bound: p i,t+1 = (C t Li,t e ) where C is a constant P t K ensuring i=1 p i,t+1 =1 and increasing, convex, twice differentiable over (Audibert and Bubeck, 2010). R EXP3 coincides with (x) =e x. formulation as mirror descent. only in oblivious case. log-free bound with (x) =( x) q and q =2. page 35

36 References R. Agrawal. Sample mean based index policies with O(log n) regret for the multi-armed bandit problem. Advances in Applied Mathematics, vol. 27, pp , Jean-Yves Audibert and Sebastian Bubeck. Regret bounds and minimax policies under partial monitoring. Journal of Machine Learning Research, vol. 11, pp , Peter Auer, Nicolò Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multi- armed bandit problem, Machine Learning Journal, vol. 47, no. 2 3, pp , 2002a. Peter Auer, Nicolò Cesa-Bianchi, Yoav Freund, and Robert Schapire. The nonstochastic multi-armed bandit problem. SIAM Journal on Computing, vol. 32, no. 1, pp , 2002b. Sébastian Bubeck, Nicolò Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning 5, 1-122, page 36

37 References Nicolò Cesa-Bianchi and Gábor Lugosi. Prediction, learning, and games. Cambridge University Press, Corinna Cortes, Yishay Mansour, and Mehryar Mohri. Learning bounds for importance weighting. In NIPS, T.L. Lai and H. Robbins. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, vol. 6, pp. 4 22, Gilles Stoltz. Incomplete information and internal regret in prediction of individual sequences. Ph.D. thesis, Universite Paris-Sud, R. Tyrrell Rockafellar. Convex Analysis. Princeton University Press, W. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Bulletin of the American Mathematics Society, vol. 25, pp , page 37

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I. Sébastien Bubeck Theory Group

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I Sébastien Bubeck Theory Group i.i.d. multi-armed bandit, Robbins [1952] i.i.d. multi-armed bandit, Robbins [1952] Known