Introduction to Bandit Algorithms. Introduction to Bandit Algorithms

Size: px

Start display at page:

Download "Introduction to Bandit Algorithms. Introduction to Bandit Algorithms"

Ella Farmer
6 years ago
Views:

2 Stochastic K-Arm Bandit Problem Formulation Consider K arms (actions) each correspond to an unknown distribution {ν k } K k=1 with values bounded in [0, 1]. At each time t, the agent pulls an arm I t {1,..., K} and observes a reward x t ν It (i.i.d. sample from ν It ). The objective is to maximize the expected sum of rewards. Notations mean of each arm: mean of the best arm: µ k = E X νk [X] µ = max k µ k

3 Stochastic K-Arm Bandit To evaluate the performance of a strategy Cumulative Regret R n = nµ x t Objective: find a strategy with small expected cumulative regret E [ R n ] Note: E [ ] R n = nµ E [ n ] [ K µ It = E T k (n)(µ µ k ) ] = E [ K ] T k (n) k, k=1 k=1 T k (n) = n 1{I t = k} is the total number of times that arm k has been pulled up to time n k = µ µ k is the gap between the optimal arm and arm k

4 UCB Strategy Upper confidence bound (UCB) strategy selects an arm at time t that I t = arg max B t,tk (t 1)(k), B t,s (k) = ˆµ k,s + k s ˆµ k,s = 1 s s i=1 x k,i is the empirical mean of arm k at time s UCB is a strategy based on the optimism in the face of uncertainty" principle B t,tk(t 1)(k) represents an upper-bound of µ k with high probability

5 Chernoff-Hoeffding Inequality Proposition Let X i [a i, b i ] be independent random variables with µ i = E[X i ]. Then we have ( ) ( 2ɛ 2 ) P X i µ i ɛ 2 exp n i=1 (b i a i ) 2 i=1

6 Proof of Chernoff-Hoeffding Inequality P ( n X i µ i ɛ ) = P ( exp(s X i µ i ) exp(sɛ) ) (a) exp( sɛ)e [ exp(s i=1 (b) = exp( sɛ) i=1 n i=1 E [ exp ( s(x i µ i ) )] (c) exp( sɛ) = exp ( sɛ + s 2 (b i a i ) 2 /8 ) i=1 X i µ i ) ] i=1 n exp ( s 2 (b i a i ) 2 /8 ) i=1 By selecting s = 4ɛ/ n i=1 (b i a i ) 2, we obtain P ( n i=1 X i µ i ɛ ) ( 2ɛ exp 2 ni=1 ). (b i a i ) 2 The result follows by repeating the same procedure for P ( n i=1 X i µ i ɛ ). (a) by Markov inequality, (b) by independence of the random variables, and (c) by Hoeffding inequality.

7 Chernoff-Hoeffding Inequality & UCB Strategy Using Chernoff-Hoeffding inequality, we have P ( 1 s ) ( s X i µ ɛ exp( 2sɛ 2 1 ), P s i=1 ) s X i µ ɛ exp( 2sɛ 2 ) (1) i=1 if we set ɛ = s, Eq. 1 may be written as ( ) P ˆµ k,s + µ k exp( 4 log t) = t 4 s ( ) P ˆµ k,s µ k exp( 4 log t) = t 4 s (2)

8 UCB Regret Bound Proposition Each sub-optimal arm k is pulled on average at most E[T k (n)] 8 log n 2 k + π2 3 times. As a result, the expected cumulative regret of UCB is bounded as E[R n ] = k k E[T k (n)] 8 k, k >0 log n k + Kπ2 3 The expected cumulative regret of UCB is logarithmic in n

9 Proof of UCB Regret Bound (I) Gap of arm k: k = µ µ k Idea of the proof 1 To show that each sub-optimal arm k cannot be pulled more than 8 log n plus a small constant value 2 k or 2 To show that the event that a sub-optimal arm k is pulled at a time t at which T k (t 1) 8 log t, has a small probability 2 k

10 Proof of UCB Regret Bound (II) Using Eq. 2, we may write s, t, k µ k ˆµ k,s w.p. 1 t 4 (3) s µ k + ˆµ k,s w.p. 1 t 4 (4) s If a sub-optimal arm k is pulled at time t, we know that B t,tk (t 1)(k) B t,tk (t 1)(k ) ˆµ k,tk (t 1) + T k (t 1) ˆµ k,t k (t 1) + T k (t 1) (5) If Eqs. 3 and 4 hold, then by upper-bounding the left-hand-side of Eq. 5 and lower-bounding the right-hand-side of Eq. 5, we have µ k + 2 T k (t 1) µ T k (t 1) 8 log t 2 k

11 Proof of UCB Regret Bound (III) For any integer u, we may write T k (n) u + 1 { I t = k; T k (t 1) u } t=u+1 u + 1 { s : u < s t, s : 1 s t, B t,s(k) B t,s (k ) } t=u+1 u + t t=u+1 s=u+1 s =1 We set u = 8 log n 2, therefore we have k t 1 { B t,s(k) B t,s (k ) } (6) s : u < s t s > u = 8 log n 2 k 8 log t 2 k µ k + 2 < µ (7) s It is easy to show that the event { B t,s(k) B t,s (k ) } and Eq. 7 imply that either µ s > ˆµ k,s or ˆµ k,s > µ k + s

12 Proof of UCB Regret Bound (IV) So Eq. 6 may be rewritten as T k (n) 8 log n 2 k [ { } { }] t t + 1 µ s > ˆµ k,s + 1 ˆµ k,s > µ k + s t=u+1 s=u+1 s =1 (8) Taking the expectation from both sides of Eq. 8, we obtain E [ T k (n) ] [ { 8 log n } t t 2 + P µ k s > ˆµ k,s t=u+1 s=u+1 s =1 { } ] t t + P ˆµ k,s > µ k + s s=u+1 s =1 = 8 log n [ t t t t 2 + t 4 + t 4] k t=u+1 s=u+1 s =1 s=u+1 s =1 8 log n 2 k + t=u+1 [ 2 t s=1 t 3] 8 log n 2 k + 2 t 2 8 log n 2 k + π2 3

13 UCB Regret Bound Proposition We have the following uniform bound on the expected cumulative regret E[R n ] 8Kn ( log n + π2 ) 3 Proof: Using Cauchy-Schwarz, we may write E[R n] = k k E [ T k (n) ] = k k E [ T k (n) ] E [ T k (n) ] 2 k E[ T k (n) ] E [ T k (n) ] k 8Kn ( log n + π2 3 ) k

14 UCB Regret: Lower-Bound Asymptotic lower-bound (for a rich class of distributions) [Lai & Robbins, 1985] of the form E [ T k (n) ] lim sup n log n 1 KL(ν k ν ) where KL(ν ν ) = dν log(dν/dν ), and thus, E[R n ] = Ω(log n) Non-asymptotic minimax lower-bound inf sup R n = Ω( nk) alg. prob.

15 Adversarial Bandit Setting: The rewards are not necessarily i.i.d., they are arbitrarily selected by an adversary. In this case, we cannot hope to perform as well as if we have known the rewards in advance, because the adversary does not want to be revealed. Therefore, we compare the performance of our algorithm with the performance obtained by a class of strategies, and try to be almost as good as the best strategy in that class. We may consider two types of problems according to the information received at each round: 1) full information, and 2) bandit or partial information. At each time t = 1,..., n the adversary selects rewards x t(1),..., x t(k) (the reward values are in [0, 1]) without revealing them to the player the player selects an arm I t Full Information: the player observes the rewards of all the arms: x t(k), k = 1,..., K Bandit Information: the player observes the reward of the selected arm x t(i t)

16 Adversarial Bandit If we compare the performance of our algorithm with the class of constant strategies, the regret vs. the constant strategy k is defined as R n(k) = x t(k) x t(i t). The player s strategy can be stochastic, in which case we consider the expected regret of the best constant strategy R n = max E[ R ] n(k), 1 k K and the goal is to design an algorithm that is good for all reward sequences generated by the adversary, that is sup x 1,...,x n R n be small.

17 Adversarial Bandit - Full Information Exponentially Weighted Forecaster (EWF) Initialization: w 1 (k) = 1 for all k = 1,..., K At each time t = 1,..., n: the player selects an arm I t p t, where p t (k) = w t(k) K i=1 w t(i) with w t (k) = e η t 1 s=1 xs(k), η is the parameter of the algorithm.

18 Exponentially Weighted Forecaster (EWF) Proposition (EWF Regret Bound) Let η 1. Then the regret of EWF is bounded as R n log K η + ηn 8. 8 log K If we select η = n, the regret of EWF is bounded as n log K R n. 2 Remarks: The lower bound for this problem is of the same order R n = Ω( n log K). The number of arms appears logarithmic in the regret bound. It is possible to have an anytime algorithm (the horizon n is not ( ) necessarily known) by setting η t = O. log K t

19 EWF Regret Bound Proof: Let W t = K k=1 wt(k), then we may write Therefore, W t+1 K = W t k=1 w t(k) e ηxt(k) W t = k=1 [ ( = e ηe[xt(k)] E I pt e η x t(i) E J pt [x t(j)] e ηe[xt(i)] e η2 /8 Now for any k = 1,..., K, we may write log W n+1 W 1 K p t(k) e ηxt(k) [ = E I pt e ηx ] t(i) )] log W n+1 ηe [ n x ] t(i) + nη2 W 1 8. k=1 using Hoeffding s inequality K = log e η n x t(k) log K η x t(k) log K. So for any k = 1,..., K, we have E[R n(k)] = x t(k) E [ n x ] t(i t) log K η + nη 8.

20 Adversarial Bandit - Bandit Information Exploration-Exploitation using Exponential Weights (EXP3) Initialization: w 1 (k) = 1 for all k = 1,..., K At each time t = 1,..., n: the player selects an arm I t p t, where w t (k) p t (k) = (1 β) K i=1 w t(i) + β K with w t (k) = e η t 1 s=1 xs(k), where x s (k) = xs(k) p s(k) 1{I s = k}. η > 0 and β > 0 are the parameters of the algorithm.

21 EXP3 Proposition (EXP3 Regret Bound) Let η 1 and β = ηk. Then the regret of EXP3 is bounded as R n log K η + (e 1)ηnK. If we select η = log K (e 1)nK, the regret of EXP3 is bounded as R n 2.63 nk log K. Remarks: The lower bound for this problem is R n 1 20 nk.

22 EXP3 Regret Bound I Proof: Let W t = K k=1 wt(k). Note that E I s p s [ x s(k)] = K xs(k) i=1 ps(i) 1{i = k} = xs(k) and p s(k) E Is ps [ x s(i s)] = K xs(i) i=1 ps(i) K. Therefore, we have p s(i) W t+1 K = W t k=1 (a) K k= β w t(k) e η xt(k) W t = K k=1 p t(k) β/k 1 β e η xt(k) p t(k) β/k ( 1 + η xt(k) + (e 2)η 2 x t(k) 2) 1 β K p ( t(k) η x t(k) + (e 2)η 2 x t(k) 2) k=1 (a) This is because η x t(k) ηk/β and e x 1 + x + (e 2)x 2 for x 1.

23 EXP3 Regret Bound II Proof: log W t+1 W t 1 1 β K p ( t(k) η x t(k) + (e 2)η 2 x t(k) 2) k=1 so, we have log W n+1 W β k=1 K p ( t(k) η x t(k) + (e 2)η 2 x t(k) 2). For any k = 1,..., K, we also have log W n+1 W 1 K = log e η n x t(k) log K η x t(k) log K. k=1

24 EXP3 Regret Bound III Proof: By taking the expectation, for any k = 1,..., K, we may write E [ (1 β) x t(k) ] K p t(i) x t(i) i=1 [ ] x t(k) E x t(i t) (1 β) log K η βn + log K η E[R n(k)] log K η + (e 2)ηE + (e 2)ηnK + (e 1)ηnK [ k=1 ] K p t(k) x t(k) 2

The Multi-Arm Bandit Framework

The Multi-Arm Bandit Framework A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course In This Lecture A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-2/94