The information complexity of sequential resource allocation

Size: px

Start display at page:

Download "The information complexity of sequential resource allocation"

Bartholomew Mosley
5 years ago
Views:

1 The information complexity of sequential resource allocation Emilie Kaufmann, joint work with Olivier Cappé, Aurélien Garivier and Shivaram Kalyanakrishan SMILE Seminar, ENS, June 8th, 205

2 Sequential allocation : some examples Clinical trial K possible treatments (with unknown effect) Which treatment should be allocated to each patient based on their effect on previous patients? Movie recommendation K different movies Which movie should be recommended to each user, based on the ratings given by previous (similar) users?

3 The bandit framework One-armed bandit = slot machine (or arm) Multi-armed bandit : several arms. Drawing arm a observing a sample from a distribution νa, with mean µa Best arm a = argmaxa µa Which arm should be drawn based on the previous observed outcomes?

4 Bandit model (more formal) A multi-armed bandit model is a set of K arms where Each arm a is a probability distribution ν a of mean µ a Drawing arm a is observing a realization of ν a Arms are assumed to be independent At round t, an agent chooses arm A t, and observes X t ν At (A t ) is his strategy or bandit algorithm, such that A t+ = F t (A, X,..., A t, X t ) Global objective : Learn which arm(s) have highest mean(s) µ = max a µ a a = argmax a µ a

5 Objective : Regret minimization Samples are seen as rewards. The agent ajusts (A t ) to maximize the (expected) sum of rewards accumulated, [ T ] E X t t= or equivalently minimize his regret : [ ] T R T = E T µ X t exploration/exploitation tradeoff t=

6 Objective 2 : Best arm identification The agent has to identify the best arm a. (no loss when drawing bad arms) To do so, he His goal : uses a sampling strategy (A t ) stops sampling the arms at some (random) time τ recommends an arm â τ Fixed-budget setting τ = T minimize P(â τ a ) Fixed-confidence setting minimize E[τ] P(â τ a ) δ optimal exploration

7 Comparison on the medical trials example The doctor : chooses treatment A t to give to patient t observes whether the patient is cured : X t B(µ At ) He can ajust his strategy (A t ) so as to Regret minimization Maximize the number of patients cured among T patients Best arm identification Identify the best treatment with probability at least δ (to always give this one later)

8 Outline Regret minimization 2 m best arms identification Lower bound on the sample complexity An optimal algorithm? 3 The complexity of A/B Testing The Gaussian case The Bernoulli case

9 A parametric assumption on the arms ν,..., ν K belong to a one-dimensional exponential family : P λ,θ,b = {ν θ, θ Θ : ν θ has density f θ (x) = exp(θx b(θ)) w.r.t. λ} Example : Gaussian, Bernoulli, Poisson distributions... ν k = ν θk can also be parametrized by its mean µ k = ḃ(θ k ). Notation : Kullback-Leibler divergence [ KL(p, q) = E X p log dp ] dq (X ) For a given exponential family P, we denote by d P (µ, µ ) := KL(νḃ (µ), ν ḃ (µ ) ) the KL divergence between the distributions of mean µ and µ. Example : Bernoulli distributions d(µ, µ ) = KL(B(µ), B(µ )) = µ log µ µ + ( µ) log µ µ.

10 Outline Regret minimization 2 m best arms identification Lower bound on the sample complexity An optimal algorithm? 3 The complexity of A/B Testing The Gaussian case The Bernoulli case

11 Optimal algorithms for regret minimization ν = ν θ = (ν θ,..., ν θk ) M = (P) K. N a (t) : number of draws of arm a up to time t K R T (ν) = (µ µ a )E ν [N a (T )] a= consistent algorithm : ν M, α ]0, [, R T (ν) = o(t α ) [Lai and Robbins 985] : every consistent algorithm satisfies µ a < µ lim inf T E ν [N a (T )] log T d(µ a, µ ) Definition A bandit algorithm is asymptotically optimal if, for every ν M, µ a < µ lim sup T E ν [N a (T )] log T d(µ a, µ )

12 Towards asymptotically optimal algorithms A UCB-type algorithm chooses at time t + A t+ = arg max a UCB a (t) where UCB a (t) is some upper confidence bound on µ a. Examples for binary bandits (Bernoulli distributions) UCB [Auer et al. 02] uses Hoeffding bounds : UCB a (t) = S a(t) N a (t) + 2 log(t) N a (t). S a (t) : sum of rewards from arm a up to time t E ν [N a (T )] K 2(µ µ a ) 2 log T + K 2, with K >.

13 KL-UCB : an asymptotically optimal algorithm KL-UCB [Cappé et al. 203] uses the index : { ( ) Sa (t) u a (t) = argmax d x> Sa(t) N a (t), x log(t) + c log log(t) N a (t) Na(t) where d(p, q) = KL (B(p), B(q)) = p log ( p q ) + ( p) log }, ( ) p q d(s a (t)/n a (t),q) 2(q S a (t)/n a (t)) log(t)/n a (t) E[N a (T )] S a (t)/n a (t) [ [ ]] q d(µ a, µ ) log T + O( log(t ))

14 KL-UCB in action

15 KL-UCB in action

16 The information complexity of regret minimization Letting we showed that Remarks : κ R (ν) := inf lim inf A consistent T κ R (ν) = K a= an asymptotic notion of optimality (µ µ a ) d(µ a, µ ). R T (ν) log(t ), still worth fighting for more efficient algorithms (e.g. Bayesian algorithms)

17 Outline Regret minimization 2 m best arms identification Lower bound on the sample complexity An optimal algorithm? 3 The complexity of A/B Testing The Gaussian case The Bernoulli case

18 m best arms identification (fixed-confidence setting) ν = (ν θ,..., ν θk ) M = (P) K such that µ [m] > µ [m+]. Parameters and notation : m a fixed number of arms δ ]0, [ a risk parameter S m the set of m arms with highest means The agent s strategy : A = (A t, τ, Ŝ) sampling rule : A t arm chosen at time t stopping rule : at time τ he stops sampling the arms recommendation rule : a guess Ŝ for the m best arms His goal : ν M : µ [m] > µ [m+], P ν (Ŝ = S m) δ (the algorithm is δ-pac on M) The sample complexity E ν [τ] is small

19 Towards the complexity of best-arm identification The literature presents δ-pac algorithm such that E ν [τ] CH(ν) log(/δ) [Even-Dar et al. 06], [Kalyanakrishnan et al.2] but no lower bound on E ν [τ]. No notion of optimal algorithm We propose a lower bound on E ν [τ] new algorithms (close to) reaching the lower bound

20 Outline Regret minimization 2 m best arms identification Lower bound on the sample complexity An optimal algorithm? 3 The complexity of A/B Testing The Gaussian case The Bernoulli case

21 A general lower bound ν M such that µ µ m > µ m+ µ K. Theorem [K.,Cappé, Garivier 4] Any algorithm that is δ-pac on M satisfies, for all δ ]0, [, E ν [τ] ( m a= d(µ a, µ m+ ) + K a=m+ ) ( ) log. d(µ a, µ m ) 2.4δ First lower bound for m > Involves information-theoretic quantities

22 A general lower bound ν M such that µ µ m > µ m+ µ K. Theorem [K.,Cappé, Garivier 4] Any algorithm that is δ-pac on M satisfies, for all δ ]0, [, E ν [τ] ( m a= d(µ a, µ m+ ) + K a=m+ ) ( ) log. d(µ a, µ m ) 2.4δ First lower bound for m > Involves information-theoretic quantities K E[τ] = E[N a (τ)] a=

23 Behind the lower bound : changes of distribution Lemma [K., Cappé, Garivier 204] ν = (ν, ν 2,..., ν K ), ν = (ν, ν 2,..., ν K ) two bandit models. K E ν [N a (τ)]kl(ν a, ν a) sup kl(p ν (E), P ν (E)). E F τ a= with kl(x, y) = x log(x/y) + ( x) log(( x)/( y)).

24 Behind the lower bound : changes of distribution Lemma [K., Cappé, Garivier 204] ν = (ν, ν 2,..., ν K ), ν = (ν, ν 2,..., ν K ) two bandit models. K E ν [N a (τ)]kl(ν a, ν a) sup kl(p ν (E), P ν (E)). E F τ a= with kl(x, y) = x log(x/y) + ( x) log(( x)/( y)). choose ν such that Sm(ν ) {,..., m} : µ K... µ a... µ m+ µ m.... µ Sm(ν) = {,, m, m} a a µ K µ m+ µ m µ m +Ɛ µ S m(ν ) = {,, m, a} 2 E = (Ŝ = Sm(ν)) : P ν (E) δ and P ν (E) δ. E ν [N a (τ)]d(µ a, µ m + ɛ) kl(δ, δ) log(/2.4δ).

25 Outline Regret minimization 2 m best arms identification Lower bound on the sample complexity An optimal algorithm? 3 The complexity of A/B Testing The Gaussian case The Bernoulli case

26 The KL-LUCB algorithm Generic notation : confidence interval (C.I.) on the mean of arm a at round t : I a (t) = [L a (t), U a (t)] J(t) the set of m arms with highest empirical means Our contribution : Introduce KL-based confidence intervals U a (t) = max {q ˆµ a (t) : N a (t)d(ˆµ a (t), q) β(t, δ)} L a (t) = min {q ˆµ a (t) : N a (t)d(ˆµ a (t), q) β(t, δ)} for β(t, δ) some exploration rate.

27 The KL-LUCB algorithm At round t, the algorithm : draws two well-chosen arms : u t and l t (in bold) stops when C.I. for arms in J(t) and J(t) c are separated m = 3, K = 6 Set J(t), arm l t in bold Set J(t) c, arm u t in bold

28 The KL-LUCB algorithm At round t, the algorithm : draws two well-chosen arms : u t and l t (in bold) stops when C.I. for arms in J(t) and J(t) c are separated m = 3, K = 6 Set J(t), arm l t in bold Set J(t) c, arm u t in bold

29 Remark : KL-UCB versus KL-LUCB Similar tools for a different behavior : KL-UCB KL-LUCB (m = )

30 Theoretical guarantees Theorem [K.,Kalyanakrishnan 203] KL-LUCB using the exploration rate ( k Kt α ) β(t, δ) = log, δ with α > and k > + α satisfies P ν(ŝ = Sm) δ. For α > 2, ( ) ( E ν [τ] 4αH log + o δ 0 log ), δ δ with H = min K c [µ m+ ;µ m] a= d (µ a, c).

31 Theoretical guarantees Another informational quantity : Chernoff information d (x, y) := d(z, x) = d(z, y), where z is defined by the equality d(z, x) = d(z, y) x z * y

32 Complexity of m best arm identification (FC) Define the following complexity term : κ C (ν) = inf PAC algorithms E ν [τ] lim sup log(/δ) δ 0 Lower bound κ C (ν) m t= Upper bound (for KL-LUCB) κ C (ν) 8 d(µ a, µ m+ ) + min c [µ m+ ;µ m] a= K t=m+ K d(µ a, µ m ) d (µ a, c)

33 Outline Regret minimization 2 m best arms identification Lower bound on the sample complexity An optimal algorithm? 3 The complexity of A/B Testing The Gaussian case The Bernoulli case

34 Motivation : A/B Testing

35 Two possible goals The agent s goal is to design a strategy A = ((A t ), τ, â τ ) satisfying Fixed-confidence setting P ν (â τ a ) δ Fixed-budget setting τ = t E ν [τ] as small p t (ν) := P ν (â t a ) as possible as small as possible An algorithm using uniform sampling is Fixed-confidence setting Fixed-budget setting a sequential test of a test of (µ > µ 2 ) against (µ < µ 2 ) (µ > µ 2 ) against (µ < µ 2 ) with probability of error based on (t/2) samples uniformly bounded by δ [Siegmund 85] : sequential tests can save samples!

36 Two complexity terms M a class of bandit models. A = ((A t ), τ, â τ ) is... Fixed-confidence setting Fixed-budget setting δ-pac on M if ν M, consistent on M if ν M, P ν (â τ a ) δ p t (ν) := P ν (â τ am) 0 t Two complexities : κ C (ν) =inf lim sup A PAC δ 0 E ν[τ] log(/δ) κ B (ν) = inf A cons. ( lim sup t log p t(ν) t ) for a probability of error δ for a probability of error δ, E ν [τ] κ C (ν) log δ budget t κ B (ν) log δ In all our examples, â τ = argmax a ˆµ a (τ) (empirical best arm)

37 Lower bounds in the two-armed case From the previous Lemma... A is δ-pac. ν = (ν, ν 2 ),ν = (ν, ν 2 ) : µ > µ 2 and µ < µ 2. ( ) E ν [N (τ)]kl(ν, ν ) + E ν [N 2 (τ)]kl(ν 2, ν 2) log. 2.4δ previously, a new change of distribution : μ 2 μ μ +ε µ = µ µ 2 = µ + ɛ μ 2 μ * μ * +ε μ µ = µ µ 2 = µ + ɛ choosing µ : d(µ, µ ) = d(µ 2, µ ) := d (µ, µ 2 ) : ( ) d (µ, µ 2 )E ν [τ] log. 2.4δ

38 Lower bounds in the two-armed case Exponential families bandit models : M = { ν (P) 2 : µ µ 2 } Fixed-confidence setting any δ-pac algorithm satisfies E ν [τ] d (µ,µ 2 ) log ( ) 2δ Fixed-budget setting any consistent algorithm satisfies lim sup t log p t(ν) d (µ, µ 2 ) t Gaussian bandit models, with σ, σ 2 known : M = { ν = ( N ( µ, σ 2 ), N ( µ2, σ 2 2)) : (µ, µ 2 ) R 2, µ µ 2 }, E ν [τ] 2(σ +σ 2 ) 2 (µ µ 2 log ( ) ) 2 2δ lim sup t log p t(ν) (µ µ 2 ) 2 2(σ t +σ 2 ) 2

39 Outline Regret minimization 2 m best arms identification Lower bound on the sample complexity An optimal algorithm? 3 The complexity of A/B Testing The Gaussian case The Bernoulli case

40 Fixed-budget setting M = { ν = ( N ( µ, σ) 2 (, N µ2, σ2 2 )) : (µ, µ 2 ) R 2 }, µ µ 2 From the lower bound : κ B (ν) 2(σ + σ 2 ) 2 (µ µ 2 ) 2 A strategy allocating t = σ σ +σ 2 t samples to arm and t 2 = t t samples to arm satisfies lim inf t t log p t(ν) (µ µ 2 ) 2 2(σ + σ 2 ) 2 κ B (ν) = 2(σ + σ 2 ) 2 (µ µ 2 ) 2

41 Fixed-confidence setting : algorithm The α-elimination algorithm with exploration rate β(t, δ) chooses A t in order to keep a proportion N (t)/t α i.e. A t = 2 if and only if αt = α(t + ) if ˆµ a (t) is the empirical mean of rewards obtained from a up to time t, σt 2 (α) = σ 2/ αt + σ2 2 /(t αt ), { } τ = inf t N : ˆµ (t) ˆµ 2 (t) > 2σt 2 (α)β(t, δ)

42 Fixed-confidence setting : results From the lower bound : E ν [τ] 2(σ + σ 2 ) 2 ( ) (µ µ 2 ) 2 log 2δ Theorem With α = σ σ + σ 2 and β(t, δ) = log t δ + 2 log log(6t), α-elimination is δ-pac and ɛ > 0, E ν [τ] ( + ɛ) 2(σ + σ 2 ) 2 ( ) ( (µ µ 2 ) 2 log + o ɛ log ) 2δ δ 0 δ κ C (ν) = 2(σ + σ 2 ) 2 (µ µ 2 ) 2

43 Outline Regret minimization 2 m best arms identification Lower bound on the sample complexity An optimal algorithm? 3 The complexity of A/B Testing The Gaussian case The Bernoulli case

44 Lower bounds for Bernoulli bandit models M = {ν = (B(µ ), B(µ 2 )) : (µ, µ 2 ) ]0; [ 2, µ µ 2 }, From the lower bounds, κ C (ν) d (µ, µ 2 ) and κ B (ν) d (µ, µ 2 ). d (x, y) = d(x, z ) = d(y, z ) d (x, y) = d(z, x) = d(z, y) with z defined by with z defined by d(x, z ) = d(y, z ) d(z, x) = d(z, y) (Chernoff information) For Bernoulli distributions, d (µ, µ 2 ) > d (µ, µ 2 )

45 Fixed-budget setting There exists α(ν) such that a strategy allocating t = α(ν)t samples to arm and t2 = t t samples to arm 2 satisfies p t (ν) exp( td (µ, µ 2 )). κ B (ν) = d (µ, µ 2 ) Remarks : the optimal strategy not implementable in practice using uniform sampling is very close to optimal Consequence : κ C (ν) > κ B (ν)

46 Fixed-confidence setting Another lower bound A δ-pac algorithm using uniform sampling satisfy ( ) E ν [τ] I (µ, µ 2 ) log 2.4δ with I (µ, µ 2 ) = d ( µ, µ +µ 2 2 ) ( + d µ2, µ +µ 2 ) 2. 2 Remark : I (µ, µ 2 ) is very close to d (µ, µ 2 )! in practice, use uniform sampling? The algorithm using uniform sampling and { ( t )} τ = inf t N : ˆµ (t) ˆµ 2 (t) > log δ is δ-pac but not optimal : E[τ] log(/δ) 2 (µ µ 2 ) 2 > I (µ,µ 2 ).

47 The SGLRT algorithm The stopping rule τ = inf { ( t )} t N : ti (ˆµ (t), ˆµ 2 (t)) > log δ corresponds to a Sequential Generalized Likelihood Ratio Test. lim sup δ 0 E ν [τ] log(/δ) I (µ, µ 2 ) SGLRT : optimal among strategies using uniform sampling (hence, close to optimal)

48 Conclusion the complexity of regret minimization is well-understood complexity term involving Kullback-Leibler divergence Chernoff information appears as a relevant complexity measure for best arm identification among two arms complexity terms for the fixed-budget and fixed-confidence settings can be different! Remaining questions A/B Testing : for which classes of distributions is uniform sampling a good idea? the complexity of m best arm identification, m >

49 References O. Cappé, A. Garivier, O-A. Maillard, R. Munos, and G. Stoltz. Kullback-Leibler upper confidence bounds for optimal sequential allocation. Annals of Statistics, 203. E. Kaufmann, S. Kalyanakrishnan, Information Complexity in Bandit Subset Selection. In Proceedings of the 26th Conference On Learning Theory (COLT), 203. E. Kaufmann, O. Cappé, A. Garivier. On the Complexity of A/B Testing. In Proceedings of the 27th Conference On Learning Theory (COLT), 204. E. Kaufmann, O. Cappé, A. Garivier. On the Complexity of Best Arm Identification in Multi-Armed Bandit Models. to appear in the Journal of Machine Learning Research, 205 T.L. Lai and H. Robbins. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 985.

The information complexity of best-arm identification

The information complexity of best-arm identification Emilie Kaufmann, joint work with Olivier Cappé and Aurélien Garivier MAB workshop, Lancaster, January th, 206 Context: the multi-armed bandit model