Two optimization problems in a stochastic bandit model

Size: px

Start display at page:

Download "Two optimization problems in a stochastic bandit model"

Kerrie Leonard
5 years ago
Views:

1 Two optimization problems in a stochastic bandit model Emilie Kaufmann joint work with Olivier Cappé, Aurélien Garivier and Shivaram Kalyanakrishnan Journées MAS 204, Toulouse

2 Outline From stochastic optimization to bandit problems 2/ Emilie Kaufmann Two optimization problems in a stochastic bandit model

3 Stochastic optimization Bandit models: classical framework Classical framework in stochastic optimization f : X R max f (a)? a X Sequential observations: at time t, choose a t X, observe After T observations, x t = f (a t ) + ɛ t Minimize the optimization error If a T is a guess of the argmax minimize E [f ( a T ) f (a )] Minimize the regret [ T ] minimize E (f (a ) f (a t )) 3/ Emilie Kaufmann Two optimization problems in a stochastic bandit model t=

4 Stochastic optimization Bandit models: classical framework A particular case: the bandit model f : {,..., K} R max f (a)? a=,...,k Sequential observations: at time t, choose A t {,..., K}, observe X t ν At where ν a has mean f (a) After T observations, Minimize the probability of error If A T is a guess of the argmax minimize P ( A T A ) Minimize the regret [ T ] minimize E (f (A ) f (A t )) t= 4/ Emilie Kaufmann Two optimization problems in a stochastic bandit model

5 Two bandit problems Stochastic optimization Bandit models: classical framework A binary bandit model is a set of K arms, where arm a is a Bernoulli distribution with mean µ a drawing arm a is observing a realization of B(µ a ) arms are assumed to be independent In a bandit game, at round t, an agent chooses arm A t based on past observations, according to his sampling strategy, or bandit algorithm observes a sample X t B(µ At ) Two possible objectives can be considered best arm identification regret minimization 5/ Emilie Kaufmann Two optimization problems in a stochastic bandit model

6 Zoom on an application Stochastic optimization Bandit models: classical framework A doctor can choose between K different treatments treatment number a: (unknown) probability of sucess µ a (unknown) best treatment: a = argmax a µ a If treatment a is given to patient t, he is cured with probability µ a The doctor: chooses treatment A t to give to patient t observes whether the patient is healed : X t B(µ At ) His goal: ajust (A t ) so that to maximize the number of patients healed during a study involving T patients identify the best treatment with high probability (and always give this one later) 6/ Emilie Kaufmann Two optimization problems in a stochastic bandit model

7 Outline Performance criterion Bandit algorithms for regret minimization From stochastic optimization to bandit problems 7/ Emilie Kaufmann Two optimization problems in a stochastic bandit model

8 Performance criterion Bandit algorithms for regret minimization Asymptotically optimal algorithms N a (t) be the number of draws of arm a up to time t [ T ] K R T = E (µ µ At ) = (µ µ a )E[N a (T )] t= a= [Lai and Robbins,985]: every consistent algorithm satisfies µ a < µ lim inf T E[N a (T )] log T d(µ a, µ a ) A bandit algorithm is asymptotically optimal if where µ a < µ lim sup n E[N a (T )] log T d(x, y) = KL(B(x), B(y)). d(µ a, µ a ) 8/ Emilie Kaufmann Two optimization problems in a stochastic bandit model

9 Performance criterion Bandit algorithms for regret minimization A family of optimistic index policies For each arm a, compute a confidence interval on µ a : µ a UCB a (t) w.h.p Act as if the best possible model was the true model (optimism-in-face-of-uncertainty): A t = argmax a UCB a (t) Example UCB [Auer et al. 02] uses Hoeffding bounds: UCB a (t) = S a(t) N a (t) + 2 log(t) N a (t). S a (t): sum of the rewards collected from arm a up to time t. E[N a (T )] 8 (µ log T + C. µ a ) 2 9/ Emilie Kaufmann Two optimization problems in a stochastic bandit model

10 Performance criterion Bandit algorithms for regret minimization KL-UCB: an asymptotically optimal algorithm KL-UCB [Cappé et al. 203] uses the index: { ( ) } Sa (t) u a (t) = argmax d x> Sa(t) N a (t), x log(t) + c log log(t) N a (t) Na(t) with d(p, q) = KL (B(p), B(q)) d(s a (t)/n a (t),q) 2(q S a (t)/n a (t)) log(t)/n a (t) E[N a (T )] S a (t)/n a (t) [ [ ]] q d(µ a, µ log T + o(log log(t )). ) 0/ Emilie Kaufmann Two optimization problems in a stochastic bandit model

11 Outline From stochastic optimization to bandit problems / Emilie Kaufmann Two optimization problems in a stochastic bandit model

12 m best arms identification Assume µ µ m > µ m+... µ K (Bernoulli bandit model) Parameters and notations m the number of arms to find δ ]0, [ a risk parameter S m = {,..., m} the set of m optimal arms The forecaster chooses at time t one (or several) arms to draw decides to stop after a (possibly random) total number of samples from the arms τ recommends a set Ŝ of m arms His goal (in the fixed-confidence setting) P(Ŝ = S m) δ (the algorithm is δ-pac) The sample complexity E[τ] is small 2/ Emilie Kaufmann Two optimization problems in a stochastic bandit model

13 Challenges for The regret minimization problem is solved in some sense: A lower bound on the regret of any good algorithm R K T lim inf T log(t ) µ µ a d(µ a, µ ) a=2 Algorithms matching this bound, notably KL-UCB 3/ Emilie Kaufmann Two optimization problems in a stochastic bandit model

14 Challenges for The regret minimization problem is solved in some sense: A lower bound on the regret of any good algorithm R K T lim inf T log(t ) µ µ a d(µ a, µ ) a=2 Algorithms matching this bound, notably KL-UCB For, we would want to give: A lower bound on the sample complexity E[τ] of any δ-pac algorithm, featuring informational quantities δ-pac algorithms matching this bound 3/ Emilie Kaufmann Two optimization problems in a stochastic bandit model

15 A general lower bound Theorem [K.,Cappé, Garivier 4] Any algorithm that is δ-pac on every binary bandit model such that µ m > µ m+ satisfies, for δ 0.5, E[τ] ( m t= d(µ a, µ m+ ) + K t=m+ ) log d(µ a, µ m ) 2δ This result follows from changes of distributions: Lemma ν = (ν, ν 2,..., ν K ), ν = (ν, ν 2,..., ν K ) two bandit models, A F τ, K E ν [N a ]KL(ν a, ν a) d(p ν (A), P ν (A)). a= 4/ Emilie Kaufmann Two optimization problems in a stochastic bandit model

16 An algorithm: KL-LUCB Generic notation: confidence interval (C.I.) on the mean of arm a at round t: I a (t) = [L a (t), U a (t)] J(t) the set of m arms with highest empirical means Our contribution: Introduce KL-based confidence intervals U a (t) = max {q ˆµ a (t) : N a (t)d(ˆµ a (t), q) β(t, δ)} L a (t) = min {q ˆµ a (t) : N a (t)d(ˆµ a (t), q) β(t, δ)} for β(t, δ) some exploration rate. 5/ Emilie Kaufmann Two optimization problems in a stochastic bandit model

17 An algorithm: KL-LUCB At round t, the algorithm: draws two well-chosen arms: u t and l t (in bold) stops when C.I. for arms in J(t) and J(t) c are separated m = 3, K = 6 Set J(t), arm l t in bold Set J(t) c, arm u t in bold 6/ Emilie Kaufmann Two optimization problems in a stochastic bandit model

18 Theoretical guarantees Theorem [K.,Kalyanakrishnan 203] KL-LUCB using the exploration rate ( k Kt α ) β(t, δ) = log, δ with α > and k > + α satisfies P(Ŝ = S m) δ. For α > 2, ( E[τ] 4αH [log k K(H ) α δ ) + log log ( k K(H ) α δ )] + C α, with H = min K c [µ m+ ;µ m] a= d (µ a, c). 7/ Emilie Kaufmann Two optimization problems in a stochastic bandit model

19 Theoretical guarantees Another informational quantity: Chernoff information d (x, y) := d(z, x) = d(z, y), where z is defined by the equality d(z, x) = d(z, y) x z * y / Emilie Kaufmann Two optimization problems in a stochastic bandit model

20 Summary Lower bound lim sup δ 0 E ν [τ] log δ m t= d(µ a, µ m+ ) + K t=m+ d(µ a, µ m ) Upper bound (for KL-UCB) lim sup δ 0 E ν [τ] log δ 8 min K c [µ m+ ;µ m] a= d (µ a, c) 9/ Emilie Kaufmann Two optimization problems in a stochastic bandit model

21 Refined results for two-armed bandits A tighter lower bound [K.,Cappé, Garivier 4] Any algorithm that is δ-pac on every two-armed bandit model such that µ > µ 2 satisfies, for δ 0.5, E[τ] d (µ, µ 2 ) log 2δ where d (µ, µ 2 ) := d(µ, z ) = d(µ 2, z ), with z defined by Matching algorithms? d(µ, z ) = d(µ 2, z ). Uniform sampling is (almost) optimal A stopping rule τ based on the difference of empirical means is not optimal (and we propose a new one) 20/ Emilie Kaufmann Two optimization problems in a stochastic bandit model

22 Conclusion KL-based confidence intervals are useful in both settings, though KL-UCB and KL-LUCB draw the arms differently (m=) Do the complexity of these two problems feature the same information-theoretic quantities? R T inf lim sup consistent T log T = K µ µ a d(µ algorithms a=2 a, µ ) E[τ] inf lim sup δ PAC δ log(/δ) algorithms K a= d(µ a, µ m+ ) + K a=m+ d(µ a, µ m ) 2/ Emilie Kaufmann Two optimization problems in a stochastic bandit model

23 References KL-UCB: Cappé, Garivier, Maillard, Munos, Stoltz, Kulbback-Leibler Upper Confidence Bounds for Optimal Sequential Allocation, Annals of Statistics, 203 KL-LUCB: Kaufmann and Kalyanakrishanan, Information Complexity in Bandit Subset Selection, COLT 203 The complexity of best arm identification: Kaufmann, Cappé, Garivier, On the Complexity of Best Arm Identification in Multi-Armed Bandit Models, arxiv: , / Emilie Kaufmann Two optimization problems in a stochastic bandit model

The information complexity of sequential resource allocation

The information complexity of sequential resource allocation Emilie Kaufmann, joint work with Olivier Cappé, Aurélien Garivier and Shivaram Kalyanakrishan SMILE Seminar, ENS, June 8th, 205 Sequential allocation