On the Complexity of Best Arm Identification with Fixed Confidence

Size: px

Start display at page:

Download "On the Complexity of Best Arm Identification with Fixed Confidence"

Josephine Stokes
5 years ago
Views:

On the Complexity of Best Arm Identification with Fixed Confidence Discrete Optimization with Noise Aurélien Garivier, joint work with Emilie

1 On the Complexity of Best Arm Identification with Fixed Confidence Discrete Optimization with Noise Aurélien Garivier, joint work with Emilie Kaufmann CNRS, CRIStAL) to be presented at COLT 16, New York June 6 th, 2016 Institut de Mathématiques de Toulouse LabeX CIMI Université Paul Sabatier

2 Table of contents Preliminaries: Basics of Large Deviation Bounds Identifying the Best Arm with Fixed Confidence Lower Bound on the Sample Complexity The Track-and-Stop Strategy 2

3 Preliminaries: Basics of Large Deviation Bounds

4 Chernoff Bound for Bernoulli variables Let µ 0, 1). Let X 1, X 2,... X n Bµ), and let X n = X X n )/n. Theorem For all x > µ, where klx, y) = x log x y P µ Xn x ) n klx,µ) e +1 x) log 1 x 1 y is the binary relative entropy Corollary For every δ > 0, P µ n kl Xn, µ ) log 1 ) 2δ δ 4

5 Proof: Fenchel-Legendre transform of log-laplace For every λ > 0, Thus, P µ Xn x ) = P µ e λx1+ +Xn) e λnx) E [ ] µ e λx 1+ +X n) e λnx ) = e n λx log E µ[exp λx 1]. 1 n log P µ Xn x ) { sup λx log Eµ [exp λx 1 ] )} λ>0 { = sup λx log 1 µ + µe λ )} λ>0 = klx, µ). kl = binary Kullback-Leibler divergence: more generally [ KLP, Q) = E X P log dp ] dq X ) 5

6 A Divergence on the Set of Possible Means P µ n kl X n, µ ) log 1 ) 2δ δ klp s, ) kl,θ) logα)/s 0 p s u s 6

7 Lower Bound: Change of Measure For all ɛ > 0 and all α > 0, P µ X n x ) = E µ [ 1{ Xn x} ] = E x+ɛ [ 1{ X n x} dp µ ] X 1,..., X n ) dp x+ɛ [ = E x+ɛ 1{ X n x} e ] n i=1 log dp x+ɛ dpµ X i ) E x+ɛ [1{ X { 1 n x} 1 n n i=1 log dpx+ɛ dp µ e n i=1 log dp x+ɛ dpµ X i ) ] µ x x+ ε X i ) E x+ɛ { ] } e n E x+ɛ [log [ dp x+ɛ dpµ X i ) +α 1 P x+ɛ Xn < x ) 1 n P x+ɛ log dp x+ɛ n dp µ i=1 { } = e n klx+ɛ,µ)+α 1 on 1) ). X i ) > E x+ɛ [ ] } log dpx+ɛ X 1) + α dp µ [ log dp ] ) ] x+ɛ X 1 ) + α dp µ 7

8 Lower Bound: Change of Measure For all ɛ > 0 and all α > 0, P µ X n x ) = E µ [ 1{ Xn x} ] { 1 E x+ɛ [1{ X n x} 1 n n i=1 log dpx+ɛ dp µ e n i=1 log dp x+ɛ dpµ X i ) { ] } e n E x+ɛ [log [ dp x+ɛ dpµ X i ) +α 1 P x+ɛ Xn < x ) 1 n P x+ɛ log dp x+ɛ n dp µ i=1 { } = e n klx+ɛ,µ)+α 1 on 1) ). ] X i ) > E x+ɛ µ [ x x+ ε ] X i ) E x+ɛ log dpx+ɛ X 1) + α dp µ } [ log dp ] ) ] x+ɛ X 1 ) + α dp µ Asymptotic Optimality Large Deviation Lower Bound) lim inf n 1 n log P µ Xn x ) klx, µ) 7

9 Lower Bound: Change of Measure For all ɛ > 0 and all α > 0, P µ X n x ) = E µ [ 1{ Xn x} ] { 1 E x+ɛ [1{ X n x} 1 n n i=1 log dpx+ɛ dp µ e n i=1 log dp x+ɛ dpµ X i ) { ] } e n E x+ɛ [log [ dp x+ɛ dpµ X i ) +α 1 P x+ɛ Xn < x ) 1 n P x+ɛ log dp x+ɛ n dp µ i=1 { } = e n klx+ɛ,µ)+α 1 on 1) ). ] X i ) > E x+ɛ µ [ x x+ ε ] X i ) E x+ɛ log dpx+ɛ X 1) + α dp µ } [ log dp ] ) ] x+ɛ X 1 ) + α dp µ Asymptotic Optimality Large Deviation Lower Bound) 1 n log P µ Xn x ) klx, µ) n 7

10 Lower Bound: the Entropy Way Notation: KLY, Z) = KL LY ), LZ) ). For all ɛ > 0, if X 1,..., X n iid Bµ) and X 1,..., X n iid Bx + ɛ): µ x x+ ε n klx + ɛ, µ) = KL Bx + ɛ) n, Bµ) n) KLP P, Q Q ) = KLP, Q) + KLP, Q ) = KL X 1,..., X n), X 1,..., X n ) ) KL 1 { X n x }, 1 { X n x } ) contraction of entropy = data-processing inequality = kl P x+ɛ X n x ), P µ Xn x ) ) P x+ɛ X n x ) log 1 P µ Xn x ) log2) 1 klp, q) p log log 2 q A non-asymptotic lower bound P µ Xn x ) e n klx+ɛ,µ)+log2) 1 e 2nɛ2 8

11 Identifying the Best Arm with Fixed Confidence

12 The Stochastic Multi-Armed Bandit Model MAB) K arms = K probability distributions ν a has mean µ a, here: ν a = Bµ a )) At round t, an agent: ν 1 ν 2 ν 3 ν 4 ν 5 chooses an arm A t A := {1,..., K} observes a sample X t Bµ At ) using a sequential sampling strategy A t ): A t+1 = φ t A 1, X 1,..., A t, X t ), aimed for a prescribed objective, e.g. related to learning a = argmax a µ a and µ = max µ a. a 10

13 Usual Objective: Regret Minimization Samples = rewards, A t ) is adjusted to [ T ] maximize the expected) sum of rewards, E t=1 X t or equivalently minimize regret: [ ] T R T = E T µ X t exploration/exploitation tradeoff t=1 Motivation: clinical trials [1933] Bµ 1 ) Bµ 2 ) Bµ 3 ) Bµ 4 ) Bµ 5 ) Goal: maximize the number of patients healed during the trial 11

Motivation: clinical trials [1933] Bµ 1 )

the number of patients healed during the

14 Usual Objective: Regret Minimization Samples = rewards, A t ) is adjusted to [ T ] maximize the expected) sum of rewards, E t=1 X t or equivalently minimize regret: [ ] T R T = E T µ X t exploration/exploitation tradeoff t=1 Motivation: clinical trials [1933] Bµ 1 ) Bµ 2 ) Bµ 3 ) Bµ 4 ) Bµ 5 ) Goal: maximize the number of patients healed during the trial Alternative goal: identify as quickly as possible the best treatment 11

15 Our Objective: Best-arm Identification Goal : identify the best arm, a, as fast and accurately as possible. No incentive to draw arms with high means! optimal exploration The agent s strategy is made of: a sequential sampling strategy A t ) a stopping rule τ stopping time) a recommendation rule â τ Possible goals: Fixed-budget setting given τ = T minimize Pâ τ a ) Fixed-confidence setting minimize E[τ] under constraint Pâ τ a ) δ Motivation: clinical trials, market research, A/B testing... 12

16 Wanted: Optimal Algorithms for PAC-BAI S a class of bandit models ν = ν 1,..., ν K ). A strategy is δ-pac on S is ν S, P ν â τ = a ) 1 δ. Goal: for some classes S, find a lower bound on E ν [τ] for any δ-pac strategy and any ν S, a δ-pac strategy such that E ν [τ] matches this bound for all ν S distribution-dependent bounds) best achievable E ν [τ] = sample complexity of model ν 13

17 Racing Strategy see [Kaufmann & Kalyanakrishnan 13] R := {1,..., K} set of remaining arms. r := 0 current round br while R > 1 r := r + 1 wr draw each a R, compute ˆµ a,r, the empirical mean of the r samples observed sofar compute the empirical best and empirical worst arms: b r = argmax a R ˆµ a,r w r = argmin a R ˆµ a,r Elimination step: if l br r) > u wr r), then eliminate w r : R := R\{w r } end Outpout: â the single element in R. 14

18 Lower Bound on the Sample Complexity

19 Key Inequality for Lower Bounds in Bandit Models Let µ = µ 1,..., µ K ) and λ = λ 1,..., λ K ) be two bandit models with KL-divergence d = kl for Bernoulli models). Change of distribution lemma [G., Ménard, Stoltz 16] For every stopping time τ and every F τ -measurable variable Z almost surely bounded in [0, 1], K [ E µ Na τ) ] dµ a, λ a ) kl E µ [Z], E λ [Z] ) cf lower bound n dx + ɛ, µ) kl P x+ɛ X n x ), P µ Xn x ) ) Useful if the behaviour of the algorithm and of Z) is supposed to be very different under µ and under λ. Permits to prove the famous Lai&Robbins lower bound on regret clearly in a few lines. 16

20 Key Inequality for PAC-BAI Let µ = µ 1,..., µ K ) and λ = λ 1,..., λ K ) be two bandit models. Change of distribution lemma [Kaufmann, Cappé, G. 15] If a µ) a λ), any δ-pac algorithm satisfies K [ E µ Na τ) ] dµ a, λ a ) klδ, 1 δ) Using it for each arm separately, one obtains: µ 4 µ 3 µ 2 m 2 µ 1 Theorem For any δ-pac algorithm, E µ [τ] 1 K dµ 1, µ 2 ) + a=2 ) 1 klδ, 1 δ) dµ a, µ 1 ) Remark: klδ, 1 δ) log ) 1 δ 0 δ and klδ, 1 δ) log 1 ) 2.4δ 17

21 Combining the Inequalities µ = µ 1,..., µ K ) and λ = λ 1,..., λ K ) be two bandit models. Uniform δ-pac Constraint [Kaufmann, Cappé, G. 15] If a µ) a λ), any δ-pac algorithm satisfies K [ E µ Na τ) ] dµ a, λ a ) klδ, 1 δ). Let Altµ) = {λ : a λ) a µ)}. µ 4 µ 3 µ 2 m 2 µ 1 E µ [τ] E µ [τ] inf λ Altµ) inf λ Altµ) sup w Σ K K E µ [N a τ)]dµ a, λ a ) klδ, 1 δ) K inf E µ [N a τ)] dµ a, λ a ) E µ [τ] klδ, 1 δ) ) λ Altµ) K w a dµ a, λ a ) klδ, 1 δ) 18

22 Combining the Inequalities µ = µ 1,..., µ K ) and λ = λ 1,..., λ K ) be two bandit models. Uniform δ-pac Constraint [Kaufmann, Cappé, G. 15] If a µ) a λ), any δ-pac algorithm satisfies K [ E µ Na τ) ] dµ a, λ a ) klδ, 1 δ). Let Altµ) = {λ : a λ) a µ)}. µ 4 µ 3 µ 2 m 3 µ 1 E µ [τ] E µ [τ] inf λ Altµ) inf λ Altµ) sup w Σ K K E µ [N a τ)]dµ a, λ a ) klδ, 1 δ) K inf E µ [N a τ)] dµ a, λ a ) E µ [τ] klδ, 1 δ) ) λ Altµ) K w a dµ a, λ a ) klδ, 1 δ) 18

23 Combining the Inequalities µ = µ 1,..., µ K ) and λ = λ 1,..., λ K ) be two bandit models. Uniform δ-pac Constraint [Kaufmann, Cappé, G. 15] If a µ) a λ), any δ-pac algorithm satisfies K [ E µ Na τ) ] dµ a, λ a ) klδ, 1 δ). Let Altµ) = {λ : a λ) a µ)}. µ 4 µ 3 µ 2 m 4 µ 1 E µ [τ] E µ [τ] inf λ Altµ) inf λ Altµ) sup w Σ K K E µ [N a τ)]dµ a, λ a ) klδ, 1 δ) K inf E µ [N a τ)] dµ a, λ a ) E µ [τ] klδ, 1 δ) ) λ Altµ) K w a dµ a, λ a ) klδ, 1 δ) 18

24 Combining the Inequalities µ = µ 1,..., µ K ) and λ = λ 1,..., λ K ) be two bandit models. Uniform δ-pac Constraint [Kaufmann, Cappé, G. 15] If a µ) a λ), any δ-pac algorithm satisfies K [ E µ Na τ) ] dµ a, λ a ) klδ, 1 δ). Let Altµ) = {λ : a λ) a µ)}. µ 4 µ 3 µ 2 m 4 m 3 m 2 µ 1 E µ [τ] E µ [τ] inf λ Altµ) inf λ Altµ) sup w Σ K K E µ [N a τ)]dµ a, λ a ) klδ, 1 δ) K inf E µ [N a τ)] dµ a, λ a ) E µ [τ] klδ, 1 δ) ) λ Altµ) K w a dµ a, λ a ) klδ, 1 δ) 18

25 Lower Bound: the Complexity of BAI Theorem For any δ-pac algorithm, E µ [τ] T µ) klδ, 1 δ), where T µ) 1 = sup w Σ K inf λ Altµ) K ) w a dµ a, λ a ). Cf. [Graves and Lai 1997, Vaidhyan and Sundaresan, 2015] A kind of game : you choose the proportions of draws w a ) a, the opponent chooses the alternative the optimal proportions of arm draws are K ) w µ) = argmax w Σ K inf λ Altµ) w a dµ a, λ a ) 19

26 PAC-BAI as a Game Given a parameter µ = µ 1,..., µ K ) such that µ 1 > µ 2 µ K : the statistician chooses proportions of arm draws w = w a ) a the opponent chooses an alternative model λ the payoff is the minimal number T = T w, λ) of draws necessary to ensure that he does not violate the δ-pac constraint K Tw a dµ a, λ a ) klδ, 1 δ) T µ) = value of the game w = optimal action for the statistician 20

27 PAC-BAI as a Game Given a parameter µ = µ 1,..., µ K ) such that µ 1 > µ 2 µ K : the statistician chooses proportions of arm draws w = w a ) a the opponent chooses an arm a {2,..., K} and λ a = arg min λ w 1 dµ 1, λ) + w a dµ a, λ) the payoff is the minimal number T = T w, a) of draws necessary to ensure that ) ) P µ ˆµ1,Tw1 ˆµ a,twa Pµ ˆµ1,Tw1 < λ a and ˆµ a,twa λ a exp T w 1 klµ 1, λ a )) + w a klµ a, λ a ) )) δ log1/δ) that is T w, a) = w 1 klµ 1, λ a ɛ)) + w a klµ a, λ) T µ) = value of the game w = optimal action for the statistician 20

28 Computing the optimal proportions Computing w w argmax w Σ K inf λ Altµ) K ) w a dµ a, λ a ) } {{ } ) An explicit calculation yields [ ) = min w 1 d µ 1, w 1µ 1 + w a µ a a 1 w 1 + w a ) = w 1 min g wa a w 1 0) a 1 w 1 where g a x) = d µ 1, µ1+xµa 1+x ) + xd ) + w a d µ a, w )] 1µ 1 + w a µ a w 1 + w a µ a, µ1+xµa 1+x ) Jensen-Shannon divergence) g a is a one-to-one mapping from [0, + [ onto [0, dµ 1, µ a )[. 21

29 Computing the optimal proportions Computing w w argmax w Σ K inf λ Altµ) K ) w a dµ a, λ a ) } {{ } ) An explicit calculation yields [ ) = min w 1 d µ 1, w 1µ 1 + w a µ a a 1 w 1 + w a ) = w 1 min g wa a w 1 0) a 1 w 1 where g a x) = d µ 1, µ1+xµa 1+x ) + xd ) + w a d µ a, w )] 1µ 1 + w a µ a w 1 + w a µ a, µ1+xµa 1+x ) Jensen-Shannon divergence) g a is a one-to-one mapping from [0, + [ onto [0, dµ 1, µ a )[. x 1 = 1 x 2 = w 2 /w 1... x K = w K /w 1 21

30 Computing the optimal proportions Letting x a = w a /w 1 for all a 2, x 2,..., x K argmax x 2,...,x K 0 min a 1 g a x a ) 1 + x 2 + x K. It is easy to check that there exists y [0, dµ 1, µ 2 )[ such that a {2,..., K}, g a x a ) = y. Letting x a y) = ga 1 y), one has xa = x a y ) where y argmax y [0,dµ 1,µ 2)[ y 1 + x 2 y) + x K y). 22

31 Computing the optimal proportions Theorem For every a {1,..., K}, w a µ) = x a y ) K x ay ), where y is the unique solution of the equation F µ y) = 1, where ) K d µ 1, µ1+xay)µa 1+x ay) F µ : y ) a=2 d µ a, µ1+xay)µa 1+x ay) is a continuous, increasing function on [0, dµ 1, µ 2 )[ such that F µ 0) = 0 and F µ y) when y dµ 1, µ 2 ). an efficient way to compute the vector of proportions w µ) 23

32 Properties of T µ) and w µ) 1. For all µ S, for all a, w a µ) > 0 2. w is continuous in every µ S 3. If µ 1 > µ 2 µ K, one has w 2 µ) w K µ) one may have w 1 µ) < w 2 µ)) 4. Case of two arms [Kaufmann, Cappé, G. 14]: E µ [τ δ ] klδ, 1 δ) d µ 1, µ 2 ). where d is the reversed Chernoff information d µ 1, µ 2 ) := dµ 1, µ ) = dµ 2, µ ). 5. Gaussian arms : algebraic equation but no simple formula when K 3, only: K 2σ 2 K 2 T 2σ 2 µ) 2 a 2. a 24

33 The Track-and-Stop Strategy

34 Sampling rule: Tracking the optimal proportions ˆµt) = ˆµ 1 t),..., ˆµ K t)): vector of empirical means Introducing U t = { a : N a t) < t }, the arm sampled at round t + 1 is argmin N a t) if U t forced exploration) a U A t+1 t argmax [t wa ˆµt)) N a t)] tracking) 1 a K Lemma Under the Tracking sampling rule, P µ lim t N a t) t ) = wa µ) = 1. 26

35 Sequential Generalized Likelihood Test High values of the Generalized Likelihood Ratio Z a,b t) := log max {λ:λ a λ b } dp λ X 1,..., X t ) max {λ:λa λ b } dp λ X 1,..., X t ) reject the hypothesis that µ a < µ b ). We stop when one arm is accessed to be significantly larger than all other arms, according to a GLR Test: τ δ = inf {t N : a {1,..., K}, b a, Z a,b t) > βt, δ)} { } = inf t N : max min Z a,bt) > βt, δ) a {1,...,K} b a Chernoff stopping rule [Chernoff 59] 27

36 Stopping Rule: Alternative Formulations One has Z a,b t) = Z b,a t) and, if ˆµ a t) ˆµ b t), Z a,b t) = N a t) d ˆµ a t), ˆµ a,b t) ) + N b t) d ˆµ b t), ˆµ a,b t) ), where ˆµ a,b t) := A link with the lower bound N at) N at)+n b t) ˆµ at) + N bt) N at)+n b t) ˆµ bt). max min Z a,bt) = t inf a b a K λ Altˆµt)) t T µ) under a good sampling strategy for t large) N a t) dˆµ a t), λ a ) t 28

37 Stopping Rule: Alternative Formulations One has Z a,b t) = Z b,a t) and, if ˆµ a t) ˆµ b t), Z a,b t) = N a t) d ˆµ a t), ˆµ a,b t) ) + N b t) d ˆµ b t), ˆµ a,b t) ), where ˆµ a,b t) := N at) N at)+n b t) ˆµ at) + N bt) N at)+n b t) ˆµ bt). A Minimum Description Length interpretation If Hµ) = E X ν µ[ log p µ X )] is the Shannon entropy, Z a,b t) = N a t) + N b t))hˆµ a,b t)) }{{} average #bits to encode the samples of a and b together [N a t)hˆµ a t)) + N b t)hˆµ b t))], }{{} average #bits to encode the sample of a and b separately 29

38 Calibration The Chernoff rule is δ-pac for βt, δ) = log Lemma If µ a < µ b, whatever the sampling rule, P µ t N : Z a,b t) > log ) 2K 1)t δ )) 2t δ δ i.e., PT a,b < ) δ, for T a,b = inf{t N : Z a,b t) > log2t/δ)} { Ta,b = t } maxµ a p µ µ b X a a t )p µ X b b t ) ) max µ a µ p µ b X a a t )p µ X b b t ) 2t δ [ P µ T a,b < ) = E µ 1{Ta,b = t} ] t=1 t=1 [ δ 2t E µ 1{T a,b = t} max µ p a µ µ b X ] a a t )p µ b X b t ) max µ a µ p µ b X a a t )p µ b X b t ) 30

39 Stopping rule: δ-pac property δ max µ a µ P µt a,b < ) 2t Eµ 1{T p µ b X a a t )p µ X b t ) b a,b = t} p t=1 µa X a t )pµ b X b t ) δ = 1{T a,b = t} x 2t t ) max p t=1 x t {0,1} t µ µ x a a µ a t )p µ x b t ) p µi x i t ) b b i {1,...,K}\{a,b} }{{} not a probability density... Lemma [Willems et al. 95] The Krichevsky-Trofimov distribution ktx) = π u1 u) p ux)du is a probability law on {0, 1} n that satisfies sup u [0,1] p u x) sup 2 n x {0,1} ktx) n 31

40 Stopping rule: δ-pac property P µt a,b < ) = = δ t=1 t=1 δ 2t δ 2t δ t=1 t=1 [ δ 2t Eµ 1{T a,b = t} max µ p a µ µ b X a a t )p µ X b b t ) ] p µa X a t )pµ X b b t ) x t {0,1} t 1{T a,b = t} x t ) max µ a µ b p µ a x a t )p µ b x b t ) i {1,...,K}\{a,b} 1{T a,b = t} x t ) 4 n at n bt ktx at )ktx bt ) x t {0,1} t 1{T a,b = t} x t )I x t ) x t {0,1} t Ẽ[1{T a,b = t}] = δ PT a,b < ) δ. t=1 i {1,...,K}\{a,b} p µi x i t ) p µi x i t ) } {{ } I x t ) 32

41 Asymptotic Optimality of the T&S strategy Theorem The Track-and-Stop strategy, that uses the Tracking sampling rule the Chernoff stopping rule with βt, δ) = log and recommends â τ = argmax...k ˆµ a τ) is δ-pac for every δ 0, 1) and satisfies lim sup δ 0 E µ [τ δ ] log1/δ) = T µ). ) 2K 1)t δ 33

42 Sketch of proof almost-sure convergence only) forced exploration = N a t) a.s. for all a {1,..., K} µt) µ a.s. w ˆµt) ) w a.s. tracking rule: N a t) t t w a but the mapping F : µ, w) continuous at µ, w µ)): as max min a exists t 0 such that b a Z a,bt) = t F t t 0 { = Thus τ t 0 inf and max min a a.s. inf λ Altµ ) K w a dµ a, λ a ) is ) ˆµt), N a t)/t) K, for every ɛ > 0 there Z a,bt) 1 + ɛ) 1 T µ) 1 t b a } t N : 1 + ɛ) 1 T µ) 1 t log2k 1)t/δ) τ lim sup δ 0 log1/δ) 1 + ɛ)t µ). 34

43 Numerical Experiments µ 1 = [ ] w µ 1 ) = [ ] µ 2 = [ ] w µ 2 ) = [ ] In practice, set the threshold to βt, δ) = log ) logt)+1 δ δ-pac OK) Track-and-Stop Chernoff-Racing KL-LUCB KL-Racing µ µ Table 1: Expected number of draws E µ[τ δ ] for δ = 0.1, averaged over N = 3000 experiments. Empirically good even for large values of the risk δ Racing is sub-optimal in general, because it plays w 1 = w 2 35

44 Perspectives For best arm identification, we showed that inf lim sup PAC algorithm δ 0 E µ [τ δ ] log1/δ) = sup w Σ K inf λ Altµ) K ) w a dµ a, λ a ) and provided an efficient strategy matching this bound. Future work: easy) find an ɛ-optimal arm give a simple algorithm with a finite-time analysis extend to structured and continuous settings 36

45 References O. Cappé, A. Garivier, O-A. Maillard, R. Munos, and G. Stoltz. Kullback-Leibler upper confidence bounds for optimal sequential allocation. Annals of Statistics, H. Chernoff. Sequential design of Experiments. The Annals of Mathematical Statistics, E. Even-Dar, S. Mannor, Y. Mansour, Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems. JMLR, T.L. Graves and T.L. Lai. Asymptotically Efficient adaptive choice of control laws in controlled markov chains. SIAM Journal on Control and Optimization, 353):715743, S. Kalyanakrishnan, A. Tewari, P. Auer, and P. Stone. PAC subset selection in stochastic multi- armed bandits. ICML, E. Kaufmann, O. Cappé, A. Garivier. On the Complexity of Best Arm Identification in Multi-Armed Bandit Models. JMLR, 2015 A. Garivier, E. Kaufmann. Optimal Best Arm Identification with Fixed Confidence, COLT 16, New York, arxiv: A. Garivier, P. Ménard, G. Stoltz. Explore First, Exploit Next: The True Shape of Regret in Bandit Problems. E. Kaufmann, S. Kalyanakrishnan. The information complexity of best arm identification, COLT 2013 T.L. Lai and H. Robbins. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, N.K. Vaidhyan and R. Sundaresan. Learning to detect an oddball target. arxiv: ,

On the Complexity of Best Arm Identification with Fixed Confidence

On the Complexity of Best Arm Identification with Fixed Confidence Discrete Optimization with Noise Aurélien Garivier, Emilie Kaufmann COLT, June 23 th 2016, New York Institut de Mathématiques de Toulouse