Revisiting the Exploration-Exploitation Tradeoff in Bandit Models

Size: px

Start display at page:

Download "Revisiting the Exploration-Exploitation Tradeoff in Bandit Models"

Alexandra Briana Warren
5 years ago
Views:

1 Revisiting the Exploration-Exploitation Tradeoff in Bandit Models joint work with Aurélien Garivier (IMT, Toulouse) and Tor Lattimore (University of Alberta) Workshop on Optimization and Decision-Making in Uncertainty, Simons Institute, Berkeley, September 21st, 2016

2 The multi-armed bandit model K arms = K probability distributions (ν a has mean µ a ) ν 1 ν 2 ν 3 ν 4 ν 5 At round t, an agent: chooses an arm A t observes a sample X t ν At using a sequential sampling strategy (A t ): A t+1 = F t (A 1, X 1,..., A t, X t ). Generic goal: learn the best arm, a = argmax a µ a of mean µ = max a µ a

3 Regret minimization in a bandit model Samples = rewards, (A t ) is adjusted to maximize the (expected) sum of rewards, [ T ] E X t t=1 or equivalently minimize the regret: [ T ] K R T = T µ E X t = (µ µ a )E[N a (T )] t=1 a=1 N a (T ) : number of draws of arm a up to time T Exploration/Exploitation tradeoff

4 Algorithms: naive ideas Idea 1 : Choose each arm T /K times EXPLORATION Idea 2 : Always choose the best arm so far EXPLOITATION A t+1 = argmax a ˆµ a (t)...linear regret

5 Algorithms: naive ideas Idea 1 : Choose each arm T /K times EXPLORATION Idea 2 : Always choose the best arm so far EXPLOITATION A t+1 = argmax a ˆµ a (t)...linear regret A better idea: First explore the arms uniformly, then commit to the empirical best until the end EXPLORATION followed by EXPLOITATION...Still sub-optimal

6 A motivation: should we minimize regret? B(µ 1 ) B(µ 2 ) B(µ 3 ) B(µ 4 ) B(µ 5 ) For the t-th patient in a clinical study, chooses a treatment A t observes a response X t {0, 1}: P(X t = 1) = µ At Goal: maximize the number of patient healed during the study

a response X t {0, 1}: P(X t = 1) = µ At

identify as quickly as possible the best

7 A motivation: should we minimize regret? B(µ 1 ) B(µ 2 ) B(µ 3 ) B(µ 4 ) B(µ 5 ) For the t-th patient in a clinical study, chooses a treatment A t observes a response X t {0, 1}: P(X t = 1) = µ At Goal: maximize the number of patient healed during the study Alternative goal: allocate the treatments so as to identify as quickly as possible the best treatment (no focus on curing patients during the study)

8 Two different objectives Regret minimization Best arm identification sampling rule (A t ) Bandit sampling rule (A t ) stopping rule τ algorithm recommendation rule â τ Input horizon T risk parameter δ minimize [ ensure P(â τ = a ) 1 δ T ] Objective R T = µ T E t=1 X t and minimize E[τ] Exploration/Exploitation pure Exploration This talk: (distribution-dependent) optimal algorithm for both objectives best performance of an Explore-Then-Comit strategy? We focus on distributions parameterized by their means µ = (µ 1,..., µ K ) (Bernoulli, Gaussian)

9 Outline 1 Optimal algorithms for Regret Minimization 2 Optimal algorithms for Best Arm Identification 3 Explore-Then-Commit strategies

10 Optimal algorithms for regret minimization µ = (µ 1,..., µ K ). N a (t) : number of draws of arm a up to time t K R µ (A, T ) = (µ µ a )E µ [N a (T )] a=1 Notation: Kullback-Leibler divergence d(µ, µ ) := KL ( ) ν µ, ν µ (Gaussian): d(µ, µ ) = (µ µ ) 2 2σ 2 [Lai and Robbins, 1985]: for uniformly efficient algorithms, µ a < µ E µ [N a (T )] 1 lim inf T log T d(µ a, µ ) A bandit algorithm is asymptotically optimal if, for every µ, µ a < µ E µ [N a (T )] 1 lim sup T log T d(µ a, µ )

11 Optimal algorithms for regret minimization µ = (µ 1,..., µ K ). N a (t) : number of draws of arm a up to time t K R µ (A, T ) = (µ µ a )E µ [N a (T )] a=1 Notation: Kullback-Leibler divergence d(µ, µ ) := KL ( ) ν µ, ν µ (Bernoulli): d(µ, µ ) = µ log µ 1 µ + (1 µ) log µ 1 µ [Lai and Robbins, 1985]: for uniformly efficient algorithms, µ a < µ E µ [N a (T )] 1 lim inf T log T d(µ a, µ ) A bandit algorithm is asymptotically optimal if, for every µ, µ a < µ E µ [N a (T )] 1 lim sup T log T d(µ a, µ )

12 Mixing Exploration and Exploitation: the UCB approach A UCB-type (or optimistic) algorithm chooses at round t A t+1 = argmax UCB a (t). a=1...k where UCB a (t) is an Upper Confidence Bound on µ a d(µ a (t),q) log(t)/n a (t) µ a (t) [ ] u a (t) q The KL-UCB index UCB a (t) := max satisfies P(µ a UCB a (t)) 1 t 1. { q : d (ˆµ a (t), q) log(t) }, N a (t)

13 Mixing Exploration and Exploitation: KL-UCB A UCB-type (or optimistic) algorithm chooses at round t A t+1 = argmax a=1...k UCB a (t). where UCB a (t) is an Upper Confidence Bound on µ a d(µ a (t),q) log(t)/n a (t) µ a (t) [ ] u a (t) q The KL-UCB index [Cappé et al. 13]: E µ [N a (T )] KL-UCB satisfies 1 d(µ a, µ ) log T + O( log(t )).

14 Outline 1 Optimal algorithms for Regret Minimization 2 Optimal algorithms for Best Arm Identification 3 Explore-Then-Commit strategies

15 A sample complexity lower bound A Best Arm Identification algorithm (A t, τ, â τ ) is δ-pac if Theorem [Garivier and K. 2016] For any δ-pac algorithm, where µ, P µ (â τ = a (µ)) 1 δ. E µ [τ] T (µ) log (1/(2.4δ)), T (µ) 1 = sup w Σ K inf λ Alt(µ) a=1 K w a d(µ a, λ a ) Σ K = {w [0, 1] K : K i=1 w i = 1}, Alt(µ) = {λ : a (λ) a (µ)} ( ) Eµ[N Moreover, the vector of optimal proportions, a(τ)] E µ[τ] wa (µ) K w (µ) = argmax w Σ K inf λ Alt(µ) w a d(µ a, λ a ) a=1 is well-defined, and we propose an efficient way to compute it.

16 Sampling rule: Tracking the optimal proportions ˆµ(t) = (ˆµ 1 (t),..., ˆµ K (t)): vector of empirical means Introducing U t = {a : N a (t) < t}, the arm sampled at round t + 1 is argmin N a (t) if U t (forced exploration) a U A t+1 t argmax [t wa ( ˆµ(t)) N a (t)] (tracking) 1 a K Lemma Under the Tracking sampling rule, ( ) N a (t) P µ lim = w t a (µ) t = 1.

17 An asymptotically optimal algorithm Theorem [K. and Garivier, 2016] The Track-and-Stop strategy, that uses the Tracking sampling rule a stopping rule based on GLRT tests: τ δ = inf { t N : Z(t) > log 2Kt } δ, with [ K ] N a (t) Z(t) := t sup d(ˆµ a (t), λ a ) t λ Alt(ˆµ(t)) a=1 and recommends â τ = argmax ˆµ a (τ) a=1...k is δ-pac for every δ ]0, 1[ and satisfies lim sup δ 0 E µ [τ δ ] log(1/δ) = T (µ).

18 Regret minimization versus Best Arm Identification Algorithms for regret minimization and BAI are very different! playing mostly the best arm vs. optimal proportions different complexity terms (featuring KL-divergence) ( R T (µ) µ µ ) a d(µ a, µ log(t ) E µ [τ] T (µ) log (1/δ) ) a a

19 Outline 1 Optimal algorithms for Regret Minimization 2 Optimal algorithms for Best Arm Identification 3 Explore-Then-Commit strategies

20 Gaussian two-armed bandits ν 1 = N (µ 1, 1) and ν 2 = N (µ 2, 1). µ = (µ 1, µ 2 ). Let = µ 1 µ 2 Regret minimization For any uniformly efficient algorithm A = (A t ), Best Arm Identification For any δ-pac algorithm A = (A t, τ, â τ ), lim inf T R µ (T, A) log(t ) 2 lim inf δ 0 E µ [τ δ ] log(1/δ) 8 2 u.e.: µ, α ]0, 1[, R µ (T, A) = o(t α ) (optimal algorithms use uniform sampling)

21 Explore-Then-Commit (ETC) strategies ETC stragies: given a stopping rule τ and a commit rule â, 1 if t τ and t is odd, A t = 2 if t τ and t is even, â otherwise. Assume µ 1 > µ 2. R µ ( T, A ETC ) = E µ [N 2 (T )] = E µ [ τ T 2 + (T τ) + 1 (â=2) ] 2 E µ[τ] + T P µ (â = 2).

22 Explore-Then-Commit (ETC) strategies ETC stragies: given a stopping rule τ and a commit rule â, 1 if t τ and t is odd, A t = 2 if t τ and t is even, â otherwise. Assume µ 1 > µ 2. For A = (τ, â) as in an optimal BAI algorithm with δ = 1 T R µ ( T, A ETC ) = E µ [N 2 (T )] = E µ [ τ T 2 + (T τ) + 1 (â=2) ] Hence 2 E µ [τ] }{{} +T P µ (â = 2). }{{} (8/ 2 ) log(t ) 1/T lim sup R µ (T, A) log T 4.

23 Is this the best we can do? Lower bounds. Lemma Let µ, λ : a (µ) a (λ) Let σ s.t. N 2 (T ) is F σ -measurable. For any u.e. algorithm, lim inf T E µ [N 1 (σ)] (λ 1 µ 1 ) E µ [N 2 (σ)] (λ 2 µ 2 ) 2 2 log(t ) Proof. Introducing the log-likelihood ratio L t (µ, λ) = log p µ(x 1,..., X t ) p λ (X 1,..., X t ), one needs to prove that lim inf T E µ[l σ(µ,λ)] log(t ) 1. E µ [L σ (µ, λ)] = KL (L µ (X 1,..., X σ ), L λ (X 1,..., X σ )) 1. kl(e µ [Z], E λ [Z]) for any Z [0, 1], F σ -mesurable [Garivier et al. 16] E µ [L σ (µ, λ)] kl (E µ [N 2 (T )/T ], E λ [N 2 (T )/T ]) log(t ) (u.e.)

24 Is this the best we can do? Lower bounds. Lemma Let µ, λ : a (µ) a (λ). Let σ s.t. N 2 (T ) is F σ -measurable. For any u.e. algorithm, lim inf T E µ [N 1 (σ)] (λ 1 µ 1 ) E µ [N 2 (σ)] (λ 2 µ 2 ) 2 2 log(t ) 1. Assume µ 1 > µ 2 : Lai and Robbins bound: λ 1 = µ 1, λ 2 = µ 1 + ɛ σ = T lim inf T E µ [N 2 (T )] ( +ɛ)2 2 log(t ) 1. For ETC strategies: λ 1 = µ 1+µ 2 ɛ 2, λ 2 = µ 1+µ 2 +ɛ 2 σ = τ T E µ[τ T ] lim inf ( +ɛ) 2 8 T log(t ) 1 lim inf T R µ(t,a) log(t ) 4.

25 An interesting matching algorithm Theorem Any uniformly efficient ETC strategy satisfies lim inf T R µ (T, A) log(t ) 4. The ETC strategy based on the stopping rule τ = inf t = 2n : ˆµ 4 log ( T /(2n) ) 1,n ˆµ 2,n > n. satisfies, for T 2 > 4e 2, ( ) 4 log T log 4 R µ (T, A) + R µ (T, A) 32 T +. ( ) T ,

26 Conclusion In Gaussian two-armed bandits, ETC strategies are sub-optimal by a factor two compared to UCB strategies rather than A/B Test + always showing the best product, dynamically present products to customers all day long! On-going work: how does Optimal BAI + Commit behave in general? ( K ) T (µ) wa (µ)(µ 1 µ a ) a=2 v.s K a=2 µ 1 µ a d(µ a, µ 1 ).

27 References A. Garivier, E. Kaufmann, Optimal Best Arm Identification with Fixed Confidence, COLT 2016 A. Garivier, E. Kaufmann, T. Lattimore, On Explore-Then-Commit strategies, NIPS 2016

Stratégies bayésiennes et fréquentistes dans un modèle de bandit

Stratégies bayésiennes et fréquentistes dans un modèle de bandit thèse effectuée à Telecom ParisTech, co-dirigée par Olivier Cappé, Aurélien Garivier et Rémi Munos Journées MAS, Grenoble, 30 août 2016