The information complexity of best-arm identification Emilie Kaufmann, joint work with Olivier Cappé and Aurélien Garivier MAB workshop, Lancaster, January th, 206
Context: the multi-armed bandit model (MAB) K arms = K probability distributions (νa has mean µa ) ν ν2 ν3 ν4 ν5 At round t, an agent: chooses an arm At observes a sample Xt νat using a sequential sampling strategy (At ): At+ = Ft (A, X,..., At, Xt ), aimed for a prescribed objective, e.g. related to learning a = argmaxa µa and µ = max µa. a
A possible objective: Regret minimization Samples = rewards, (A t ) is adjusted to [ T maximize the (expected) sum of rewards, E t= X t or equivalently minimize regret: [ ] T R T = E T µ X t exploration/exploitation tradeoff Motivation: clinical trials [933] t= ] B(µ ) B(µ 2 ) B(µ 3 ) B(µ 4 ) B(µ 5 ) Goal: Maximize the number of patients healed during the trial
Our objective: Best-arm identification Goal : identify the best arm, a, as fast/accurately as possible. No incentive to draw arms with high means! optimal exploration The agent s strategy is made of: a sequential sampling strategy (A t ) a stopping rule τ (stopping time) a recommendation rule â τ Possible goals: Fixed-budget setting τ = T minimize P(â τ a ) Fixed-confidence setting minimize E[τ] P(â τ a ) δ Motivation: Market research, A/B Testing, clinical trials...
Our objective: Best-arm identification Goal : identify the best arm, a, as fast/accurately as possible. No incentive to draw arms with high means! optimal exploration The agent s strategy is made of: a sequential sampling strategy (A t ) a stopping rule τ (stopping time) a recommendation rule â τ Possible goals: Fixed-budget setting τ = T minimize P(â τ a ) Fixed-confidence setting minimize E[τ] P(â τ a ) δ Motivation: Market research, A/B Testing, clinical trials...
Wanted: Optimal algorithms in the PAC formulation M a class of bandit models ν = (ν,..., ν K ). A strategy is δ-pac on M is ν M, P ν (â τ = a ) δ. Goal: for some classes M, and ν M, find a lower bound on E ν [τ] for any δ-pac strategy a δ-pac strategy such that E ν [τ] matches this bound (distribution-dependent bounds)
Outline Regret minimization 2 Lower bound on the sample complexity 3 The complexity of A/B Testing 4 Algorithms for the general case
Exponential family bandit models ν,..., ν K belong to a one-dimensional exponential family: P λ,θ,b = {ν θ, θ Θ : ν θ has density f θ (x) = exp(θx b(θ)) w.r.t. λ} Example: Gaussian, Bernoulli, Poisson distributions... ν θ can be parametrized by its mean µ = ḃ(θ) : ν µ := νḃ (µ) Notation: Kullback-Leibler divergence For a given exponential family P, ] d P (µ, µ ) := KL(ν µ, ν µ ) = E X ν µ [log dνµ (X ) dν µ is the KL-divergence between the distributions of mean µ and µ. Example: Bernoulli distributions d(µ, µ ) = KL(B(µ), B(µ )) = µ log µ µ + ( µ) log µ µ.
Outline Regret minimization 2 Lower bound on the sample complexity 3 The complexity of A/B Testing 4 Algorithms for the general case
Optimal algorithms for regret minimization ν = (ν µ,..., ν µ K ) M = (P) K. N a (t) : number of draws of arm a up to time t K R T (ν) = (µ µ a )E ν [N a (T )] a= consistent algorithm: ν M, α ]0, [, R T (ν) = o(t α ) [Lai and Robbins 985]: every consistent algorithm satisfies µ a < µ lim inf T E ν [N a (T )] log T d(µ a, µ ) Definition A bandit algorithm is asymptotically optimal if, for every ν M, µ a < µ lim sup T E ν [N a (T )] log T d(µ a, µ )
KL-UCB: an asymptotically optimal algorithm KL-UCB [Cappé et al. 203] A t+ = arg max a u a (t), with { u a (t) = argmax d (ˆµ a (t), x) log(t) }, x N a (t) ( where d(µ, µ ) = KL ν µ, ν µ ). 0.9 0.8 d(s a (t)/n a (t),q) 2(q S a (t)/n a (t)) 2 0.7 0.6 0.5 0.4 0.3 log(t)/n a (t) 0.2 0. E[N a (T )] S a (t)/n a (t) [ [ ]] 0 0 0. 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 q d(µ a, µ ) log T + O( log(t )).
The information complexity of regret minimization Letting we showed that κ R (ν) := inf lim inf A consistent T κ R (ν) = K a= (µ µ a ) d(µ a, µ ). R T (ν) log(t ),
Outline Regret minimization 2 Lower bound on the sample complexity 3 The complexity of A/B Testing 4 Algorithms for the general case
A general lower bound M a class of exponential family bandit models A = (A t, τ, â τ ) a strategy A is δ-pac on M: ν M, P ν (â τ = a ) δ. Theorem [K.,Cappé, Garivier 5] Let ν = (ν µ,..., ν µ K ) be such that µ > µ 2 µ K. Let δ ]0, [. Any algorithm that is δ-pac on M satisfies E ν [τ] ( K ) d(µ, µ 2 ) + ( ) log. d(µ a, µ ) 2.4δ a=2 d(µ, µ ) = KL(ν µ, ν µ )
Behind the lower bound: changes of distribution Lemma [K., Cappé, Garivier 205] ν = (ν, ν 2,..., ν K ), ν = (ν, ν 2,..., ν K ) two bandit models. K a= E ν [N a (τ)]kl(ν a, ν a) sup E F τ kl(p ν (E), P ν (E)). with kl(x, y) = x log(x/y) + ( x) log(( x)/( y)).
Behind the lower bound: changes of distribution Lemma [K., Cappé, Garivier 205] ν = (ν, ν 2,..., ν K ), ν = (ν, ν 2,..., ν K ) two bandit models. K a= E ν [N a (τ)]kl(ν a, ν a) sup E F τ kl(p ν (E), P ν (E)). with kl(x, y) = x log(x/y) + ( x) log(( x)/( y)). L t the log-likelihood ratio of past observations under ν and ν : Wald s equality: E ν [L τ ] = K a= E ν[n a (τ)]kl(ν a, ν a) change of distribution: E F τ, P ν (E) = E ν [exp( L τ ) E ]
Behind the lower bound: changes of distribution Exponential bandits: ν = (µ, µ 2,..., µ K ), ν = (µ, µ 2,..., µ K ) E F τ, K E ν [N a (τ)]d(µ a, µ a) kl(p ν (E), P ν (E)). a= E ν [τ] = K a= E ν[n a (τ)]. Then, for a, choose ν such that arm is no longer the best : µ K µ a µ 2 µ µ + ε { µ a = µ + ɛ µ i = µ i, if i a 2 E = (â τ = ): P ν (E) δ and P ν (E) δ. E ν [N a (τ)]d(µ a, µ + ɛ) kl(δ, δ) ( ) E ν [N a (τ)] d(µ a, µ ) log. 2.4δ
The complexity of best arm identification M a class of exponential family bandit models Theorem [K.,Cappé, Garivier 5] Let ν = (ν µ,..., ν µ K ) be such that µ > µ 2 µ K. Let δ ]0, [. Any algorithm that is δ-pac on M satisfies E ν [τ] ( K ) d(µ, µ 2 ) + ( ) log. d(µ a, µ ) 2.4δ a=2
The complexity of best arm identification M a class of exponential family bandit models Theorem [K.,Cappé, Garivier 5] Let ν = (ν µ,..., ν µ K ) be such that µ > µ 2 µ K. Let δ ]0, [. Any algorithm that is δ-pac on M satisfies E ν [τ] ( K ) d(µ, µ 2 ) + ( ) log. d(µ a, µ ) 2.4δ a=2 For any class M, the complexity term of ν M is defined as E ν [τ] κ C (ν) :=inf lim sup A PAC log(/δ) δ 0 A = (A(δ)) is PAC if for all δ ]0, [, A(δ) is δ-pac on M.
Outline Regret minimization 2 Lower bound on the sample complexity 3 The complexity of A/B Testing 4 Algorithms for the general case
Computing the complexity term M a class of two-armed bandit models. For ν = (ν, ν 2 ) recall that E ν [τ] κ C (ν) :=inf lim sup A PAC log(/δ) δ 0 We now compute κ C (ν) for two types of classes M: Exponential family bandit models: M = {ν = (ν µ, ν µ 2 ) : ν µ P, µ µ 2 } Gaussian with known variances σ 2 and σ2 2 : M = { ν = ( N ( µ, σ) 2 (, N µ2, σ2 2 )) : (µ, µ 2 ) R 2 }, µ µ 2
Lower bounds on the complexity From our previous lower bound (or a similar method) Exponential family bandit models: κ C (ν) d(µ, µ 2 ) + d(µ 2, µ ). Gaussian with known variances σ 2, σ2 2 : κ C (ν) 2σ 2 (µ µ 2 ) 2 + 2σ 2 2 (µ 2 µ ) 2.
Towards tighter lower bounds Exponential bandits: ν = (µ, µ 2 ), ν = (µ, µ 2 ) : µ < µ 2 ( ) E ν [N (τ)]d(µ, µ ) + E ν [N 2 (τ)]d(µ 2, µ 2) log. 2.4δ previously, a new change of distribution: μ 2 μ μ +ε µ = µ µ 2 = µ + ɛ μ 2 μ * μ * +ε μ µ = µ µ 2 = µ + ɛ choosing µ : d(µ, µ ) = d(µ 2, µ ) := d (µ, µ 2 ): ( ) d (µ, µ 2 )E ν [τ] log 2.4δ E ν [τ] d (µ, µ 2 ) log ( ) 2.4δ
Tighter lower bounds in the two-armed case New lower bounds (tighter!) Exponential families Gaussian with known σ 2, σ2 2 κ C (ν) d (µ,µ 2 ) κ C (ν) 2(σ +σ 2 ) 2 (µ µ 2 ) 2 d (µ, µ 2 ) := d(µ, µ ) = d(µ 2, µ ) is a Chernoff information. Previous lower bounds Exponential families Gaussian with known σ 2, σ2 2 κ C (ν) d(µ,µ 2 ) + d(µ 2,µ ) κ C (ν) 2(σ2 +σ2 2 ) (µ µ 2 ) 2
Upper bounds on the complexity: algorithms M = { ν = ( N ( µ, σ) 2 (, N µ2, σ2 2 )) : (µ, µ 2 ) R 2 }, µ µ 2 The α-elimination algorithm with exploration rate β(t, δ) :.0 0.5 0.0 0.5.0 0 200 400 600 800 000 chooses A t in order to keep a proportion N (t)/t α i.e. A t = 2 if and only if αt = α(t + ) if ˆµ a (t) is the empirical mean of rewards obtained from a up to time t, σt 2 (α) = σ 2/ αt + σ2 2 /(t αt ), { } τ = inf t N : ˆµ (t) ˆµ 2 (t) > 2σt 2 (α)β(t, δ)
Gaussian case: matching algorithm Theorem [K., Cappé, Garivier 4] With α = σ σ + σ 2 and β(t, δ) = log t δ + 2 log log(6t), α-elimination is δ-pac and ɛ > 0, E ν [τ] ( + ɛ) 2(σ + σ 2 ) 2 ( ) ( (µ µ 2 ) 2 log + o ɛ log ) δ δ 0 δ In the Gaussian case, and finally κ C (ν) 2(σ + σ 2 ) 2 (µ µ 2 ) 2 κ C (ν) = 2(σ + σ 2 ) 2 (µ µ 2 ) 2.
Exponential families: uniform sampling Another lower bound... M = {ν = (ν µ, ν µ 2 ) : ν µ P, µ µ 2 } A δ-pac algorithm using uniform sampling (A t = t[2]) satisfy ( ) E ν [τ] I (µ, µ 2 ) log 2.4δ with I (µ, µ 2 ) = d ( µ, µ +µ 2 2 ) ( + d µ2, µ +µ 2 ) 2. 2 Remark: I (µ, µ 2 ) is very close to d (µ, µ 2 )... find a good strategy with a uniform sampling strategy!
Exponential families: uniform sampling For Bernoulli bandit models, uniform sampling and { ( t )} τ = inf t N : ˆµ (t) ˆµ 2 (t) > log δ is δ-pac but not optimal: E ν[τ] log(/δ) 2 (µ µ 2 ) 2 > I (µ,µ 2 ). SGLRT algorithm (Sequential Generalized Likelihood Ratio Test) Let α > 0. There exists C = C α such that the algorithm using a uniform sampling strategy and the stopping rule τ = inf {t N : ti (ˆµ (t), ˆµ 2 (t)) > β(t, δ))} ( ) with β(t, δ) = log Ct +α is δ-pac and δ lim sup δ 0 E ν [τ] log(/δ) κ C (ν) + α I (µ, µ 2 ). I (µ, µ 2 )
The complexity of A/B Testing For Gaussian bandit models with known variances σ 2 and σ2 2, if ν = (N (µ, σ 2 ), N (µ 2, σ 2 2 )), κ C (ν) = (σ + σ 2 ) 2 2(µ + µ 2 ) 2 and the optimal strategy draws the arms proportionally to their standard deviation. For exponential bandit models, if ν = (ν µ, ν µ 2 ), d (µ, µ 2 ) κ C (ν) and uniform sampling is close to optimal. I (µ, µ 2 )
Outline Regret minimization 2 Lower bound on the sample complexity 3 The complexity of A/B Testing 4 Algorithms for the general case
Racing (or elimination) algorithms S = {,..., K} set of remaining arms r = 0 current round while S > r=r+ draw each a S, compute ˆµ a,r, the empirical mean of the r samples observed sofar compute the empirical best and empirical worst arms: b r = argmax a S ˆµ a,r w r = argmin a S ˆµ a,r end if EliminationRule(r, b r, w r ), eliminate w r : S = S\{w r } Outpout: â the single element in S.
Elimination rules In the literature: Successive Elimination for Bernoulli bandits ( ) log (ckt Elimination(r, a, b) = ˆµ a,r ˆµ b,r > 2 /δ) r [Even Dar et al. 06] KL-Racing for exponential family bandits with Elimination(r, a, b) = (l a,r > u b,r ) { la,r = min{x : rd(ˆµ a,r, x) β(r, δ)} u b,r = max{x : rd(ˆµ b,r, x) β(r, δ)} [K. and Kalyanakrishnan 3]
The Racing-SGLRT algorithm = EliminationRule(r, a, b) ( ( rd ˆµ a,r, ˆµ a,r + ˆµ b,r 2 ) + rd = (2rI (ˆµ a,r, ˆµ b,r ) > β(r, δ)) ( ˆµ b,r, ˆµ ) ) a,r + ˆµ b,r > β(r, δ) 2 Analysis of Racing-SGLRT Let α > 0. For an exploration rate of the form ( ) Ct +α β(r, δ) = log, δ Racing-SGLRT is δ-pac and satifies ( K ) E ν [τ] lim sup ( + α). δ 0 log(/δ) I (µ a, µ ) a=2
Conclusion We presented: a simple methodology to derive lower bounds on the sample complexity of a δ-pac strategy a characterization of the complexity of best arm identification among two-arm, involving alternative information-theoretic quantities (e.g. Chernoff information) To be continued... A/B Testing: for which classes of distributions is uniform sampling a good idea? the complexity of best arm identification is still to be understood in the general case... K d (µ, µ 2 ) + K d(µ a, µ ) κ C (ν) I (µ a, µ ) a=2 a=2
References O. Cappé, A. Garivier, O-A. Maillard, R. Munos, and G. Stoltz. Kullback-Leibler upper confidence bounds for optimal sequential allocation. Annals of Statistics, 203. E. Even-Dar, S. Mannor, Y. Mansour, Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems. JMLR, 2006. E. Kaufmann, S. Kalyanakrishnan, Information Complexity in Bandit Subset Selection. In Proceedings of the 26th Conference On Learning Theory (COLT), 203. E. Kaufmann, O. Cappé, A. Garivier. On the Complexity of A/B Testing. In Proceedings of the 27th Conference On Learning Theory (COLT), 204. E. Kaufmann, O. Cappé, A. Garivier. On the Complexity of Best Arm Identification in Multi-Armed Bandit Models. JMLR, 205 T.L. Lai and H. Robbins. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 985.