On the Complexity of Best Arm Identification with Fixed Confidence Discrete Optimization with Noise Aurélien Garivier, Emilie Kaufmann COLT, June 23 th 2016, New York Institut de Mathématiques de Toulouse LabeX CIMI Université Paul Sabatier, France Université Lille, CNRS UMR 9189 Laboratoire CRIStAL F-59000 Lille, France
The Problem
Best-Arm Identification with Fixed Confidence K options = probability distributions ν = (ν a ) 1 a K ν a F exponential family parameterized by its expectation µ a ν 1 ν 2 ν 3 ν 4 ν 5 At round t, you may: choose an option A t = φ t (A 1, X 1,..., A t 1, X t 1) ) {1,..., K} observe a new independent sample X t ν At so as to identify the best arm a = argmax a µ a as fast as possible: stopping time τ. and µ = max µ a a Fixed-budget setting Fixed-confidence setting given τ = T minimize E[τ] minimize P(â τ a ) under constraint P(â τ a ) δ 3
Best-Arm Identification with Fixed Confidence K options = probability distributions ν = (ν a ) 1 a K ν a F exponential family parameterized by its expectation µ a x 1 x 2 x 3 x 4 x 5 At round t, you may: choose an option A t = φ t (A 1, X 1,..., A t 1, X t 1) ) {1,..., K} observe a new independent sample X t ν At so as to identify the best arm a = argmax a µ a as fast as possible: stopping time τ. and µ = max µ a a Fixed-budget setting given τ = T Fixed-confidence setting minimize E[τ] minimize P(â τ a ) under constraint P(â τ a ) δ 3
Best-Arm Identification with Fixed Confidence K options = probability distributions ν = (ν a ) 1 a K ν a F exponential family parameterized by its expectation µ a x 1 x 2 x 3 x 4 x 5 At round t, you may: choose an option A t = φ t (A 1, X 1,..., A t 1, X t 1) ) {1,..., K} observe a new independent sample X t ν At so as to identify the best arm a = argmax a µ a as fast as possible: stopping time τ δ. and µ = max µ a a Fixed-budget setting Fixed-confidence setting given τ = T minimize E[τ δ ] minimize P(â τ a ) under constraint P(â τ a ) δ 3
Intuition Most simple setting: for all a {1,..., K}, ν a = N (µ a, 1) For example: µ = [2, 1.75, 1.75, 1.6, 1.5]. At time t: you have sampled n a times the option a your empirical average is X a,na. 2 0 2 4 arm 1 arm 2 arm 3 arm 4 arm 5 if you stop at time t, your probability of prefering arm a 2 to arm a = 1 is: P ( Xa,na > X 1,n1 ) = P ( Xa,na µ a ( X1,n1 µ 1 ) 1/n1 + 1/n a > ( ) µ 1 µ a = Φ 1/n1 + 1/n a ) µ 1 µ a 1/n1 + 1/n a e where Φ(u) u2 /2 = du u 2π 4
Uniform Sampling 5
Uniform Sampling 5
Uniform Sampling 5
Uniform Sampling 5
Uniform Sampling 5
Uniform Sampling 5
Uniform Sampling 5
Uniform Sampling 5
Uniform Sampling 5
Uniform Sampling 5
Uniform Sampling 5
Uniform Sampling 5
Uniform Sampling 5
Uniform Sampling 5
Uniform Sampling 5
Uniform Sampling 5
Uniform Sampling 5
Uniform Sampling 5
Uniform Sampling 5
Uniform Sampling 5
Intuition: Equalizing the Probabilities of Confusion Most simple setting: for all a {1,..., K}, ν a = N (µ a, 1) For example: µ = [2, 1.75, 1.75, 1.6, 1.5]. Active Learning You allocate a relative budget w a to option a, with w 1 + + w K = 1. At time t: you have sampled n a w a t times the option a your empirical average is X a,na. 2 0 2 4 arm 1 arm 2 arm 3 arm 4 arm 5 if you stop at time t, your probability of prefering arm a 2 to arm a = 1 is: P ( Xa,na > X 1,n1 ) = P ( X a,na µ a ( X 1,n1 µ 1 ) 1/n1 + 1/n a > ( ) = Φ µ 1 µ a 1/n1 + 1/n a ) µ 1 µ a 1/n1 + 1/n a 6
Improving: trial 1 7
Improving: trial 1 7
Improving: trial 1 7
Improving: trial 1 7
Improving: trial 1 7
Improving: trial 1 7
Improving: trial 1 7
Improving: trial 1 7
Improving: trial 1 7
Improving: trial 1 7
Improving: trial 1 7
Improving: trial 1 7
Improving: trial 1 7
Improving: trial 1 7
Improving: trial 1 7
Improving: trial 2 8
Improving: trial 2 8
Improving: trial 2 8
Improving: trial 2 8
Improving: trial 2 8
Improving: trial 2 8
Improving: trial 2 8
Improving: trial 2 8
Improving: trial 2 8
Improving: trial 2 8
Improving: trial 2 8
Improving: trial 2 8
Improving: trial 2 8
Improving: trial 2 8
Improving: trial 2 8
Improving: trial 2 8
Improving: trial 2 8
Improving: trial 3 9
Improving: trial 3 9
Improving: trial 3 9
Improving: trial 3 9
Improving: trial 3 9
Improving: trial 3 9
Improving: trial 3 9
Improving: trial 3 9
Improving: trial 3 9
Improving: trial 3 9
Improving: trial 3 9
Improving: trial 3 9
Improving: trial 3 9
Improving: trial 3 9
Improving: trial 3 9
Improving: trial 3 9
Optimal Proportions 10
Optimal Proportions 10
Optimal Proportions 10
Optimal Proportions 10
Optimal Proportions 10
Optimal Proportions 10
Optimal Proportions 10
Optimal Proportions 10
Optimal Proportions 10
Optimal Proportions 10
Optimal Proportions 10
Optimal Proportions 10
Optimal Proportions 10
Optimal Proportions 10
How to Turn this Intuition into a Theorem? The arms are not Gaussian (no formula for probability of confusion) large deviations (Sanov, KL) You do not allocate a relative budget at first, but you use sequential sampling no fixed-size samples: sequential experiment tracking lemma How to compute the optimal proportions? lower bound, game The parameters of the distribution are unknown (sequential) estimation When should you stop? Chernoff s stopping rule 11
Exponential Families ν 1,..., ν K belong to a one-dimensional exponential family P λ,θ,b = { ν θ, θ Θ : ν θ has density f θ (x) = exp(θx b(θ)) w.r.t. λ } Example: Gaussian, Bernoulli, Poisson distributions... ν θ can be parametrized by its mean µ = ḃ(θ) : νµ := νḃ 1 (µ) Notation: Kullback-Leibler divergence For a given exponential family, d(µ, µ ) := KL(ν µ, ν µ ) = E X ν µ ] [log dνµ (X ) dν µ is the KL-divergence between the distributions of mean µ and µ. We identify ν = (ν µ1,..., ν µ K ) and µ = (µ 1,..., µ K ) and consider { } S = µ (ḃ(θ)) K : a {1,..., K} : µ a > max µ i i a 12
Back to: Equalizing the Probabilities of Confusion Most simple setting: for all a {1,..., K}, ν a = N (µ a, 1). For example: µ = [2, 1.75, 1.75, 1.6, 1.5]. You allocate a relative budget w a to option a, with w 1 + + w K = 1. At time t, you have sampled n a w a t times option a and the empirical average is X a,na. if you stop at time t, your probability of prefering arm a 2 to arm a = 1 is: P ( Xa,na > X 1,n1 ) = P ( X a,na µ a ( X 1,n1 µ 1 ) 1/n1 + 1/n a > ( ) = Φ µ 1 µ a 1/n1 + 1/n a ) µ 1 µ a 1/n1 + 1/n a e ( (µ 1 µa)2 2 1/n 1 +1/na ) Chernoff s bound 13
Back to: Equalizing the Probabilities of Confusion Most simple setting: for all a {1,..., K}, ν a = N (µ a, 1). if you stop at time t, your probability of prefering arm a 2 to arm a = 1 is: P ( X a,na > X 1,n1 ) = P ( Xa,na µ a ( X1,n1 µ 1 ) 1/n1 + 1/n a > ( ) = Φ µ 1 µ a 1/n1 + 1/n a ) µ 1 µ a 1/n1 + 1/n a e ( (µ 1 µa)2 2 1/n 1 +1/na ) Large Deviation Principle = e n 1 (µ 1 m)2 2 e na(µa m)2 2 with m = n 1µ 1 + n 1 µ a n 1 + n a P ( X1,n1 < m ) P ( Xa,na m ) = max µ a<m<µ 1 P ( X1,n1 < m ) P ( Xa,na m ) (cf. Sanov) µ a m µ 1 13
Entropic Method for Large Deviations Lower Bounds Let d(µ, µ ) = KL ( N (µ, 1), N (µ, 1)) = (x y)2 2, KL(Y, Z) = KL ( L(Y ), L(Z) ), ɛ > 0, µ a m µ 1 and X 1,1,..., X 1,n1 iid N (µ1, 1) X a,1,..., X a,na iid N (µa, 1) X 1,1,..., X 1,n 1 iid N (m ɛ, 1) X a,1,..., X a,n a iid N (m + ɛ, 1) µ a m µ 1 n 1 d(m ɛ, µ 1 ) + n a d(m + ɛ, µ a ) = KL ( (X a,i) ) KL(P P a,i, (X a,i ),Q Q ) a,i = KL(P, Q) + KL(P, Q ) KL (1 { X a,na > X } { } 1,n1, 1 X a,na > X ) contraction of entropy 1,n1 = data-processing inequality ( = kl P ( X a,n a > X 1,n ) ( ) 1, P X a,na > X ) 1,n1 kl(p, q) = p log p + (1 p) log 1 p q 1 q P ( X a,na > X 1,n ) 1 1 log ( ) log(2) kl(p, q) p log 1 log 2 q P X a,na > X 1,n1 14
Entropic Method for Large Deviations Lower Bounds µ a m µ 1 n 1 d(m ɛ, µ 1 ) + n a d(m + ɛ, µ a ) = KL ( (X a,i) ) KL(P P a,i, (X a,i ),Q Q ) a,i = KL(P, Q) + KL(P, Q ) KL (1 { X a,na > X } { 1,n1, 1 Xa,na > X } ) contraction of entropy 1,n1 = data-processing inequality ( = kl P ( X a,na > X 1,n ) ( 1, P Xa,na > X ) ) 1,n1 kl(p, q) = p log p + (1 p) log 1 p q 1 q P ( X a,n a > X 1,n ) 1 1 log ( ) log(2) kl(p, q) p log 1 log 2 q P X a,na > X 1,n1 P ( ( Xa,na > X ) 1,n1 max exp n ) 1 d(m ɛ, µ 1 ) + n a d(m + ɛ, µ a ) + log(2) µ 1 m µ a 1 e (n1+na)ɛ2 /2 (µ 1 µ a+ɛ) 2 1/n = exp 1+1/n a + log(2) 2 ( 1 e ) n1µ1 + naµa m = (n1+na)ɛ2 /2 n 1 + n a 14
Entropic Method for Large Deviations Lower Bounds µ a m µ 1 n 1 d(m ɛ, µ 1 ) + n a d(m + ɛ, µ a ) = KL ( (X a,i) ) KL(P P a,i, (X a,i ),Q Q ) a,i = KL(P, Q) + KL(P, Q ) KL (1 { X a,na > X } { 1,n1, 1 Xa,na > X } ) contraction of entropy 1,n1 = data-processing inequality ( = kl P ( X a,na > X 1,n ) ( 1, P Xa,na > X ) ) 1,n1 kl(p, q) = p log p + (1 p) log 1 p q 1 q P ( X a,n a > X 1,n ) 1 1 log ( ) log(2) kl(p, q) p log 1 log 2 q P X a,na > X 1,n1 ( ) 1 = T w 1 d(m ɛ, µ 1 ) + w a d(m + ɛ, µ a ) log ( P Xa,na > X ) 1,n1 ( ) if you want to have P X a,na > X 1,n1 δ then you need T log(1/δ) w 1 d(m, µ 1 ) + w a d(m, µ a ) 14
Lower Bound
Lower-Bounding the Sample Complexity Let µ = (µ 1,..., µ K ) and λ = (λ 1,..., λ K ) be two elements of S. Uniform δ-pac Constraint [Kaufmann, Cappé, G. 15] If a (µ) a (λ), any δ-pac algorithm satisfies K [ E µ Na (τ δ ) ] d(µ a, λ a ) kl(δ, 1 δ). a=1 µ 4 µ 3 µ 2 m 2 µ 1 Let Alt(µ) = {λ : a (λ) a (µ)}. Take: λ 1 = m 2 ɛ λ 2 = m 2 + ɛ E µ [N 1 (τ δ )] d(µ 1, m 2 ɛ) + E µ [N 2 (τ δ ) ] d(µ 2, m 2 + ɛ) kl(δ, 1 δ) 16
Lower-Bounding the Sample Complexity Let µ = (µ 1,..., µ K ) and λ = (λ 1,..., λ K ) be two elements of S. Uniform δ-pac Constraint [Kaufmann, Cappé, G. 15] If a (µ) a (λ), any δ-pac algorithm satisfies K [ E µ Na (τ δ ) ] d(µ a, λ a ) kl(δ, 1 δ). a=1 µ 4 µ 3 µ 2 m 3 µ 1 Let Alt(µ) = {λ : a (λ) a (µ)}. Take: λ 1 = m 3 ɛ λ 3 = m 3 + ɛ E µ [N 1 (τ δ )] d(µ 1, m 2 ɛ) + E µ [N 2 (τ δ ) ] d(µ 2, m 2 + ɛ) kl(δ, 1 δ) E µ [N 1 (τ δ )] d(µ 1, m 3 ɛ) + E µ [N 3 (τ δ )] d(µ 3, m 3 + ɛ) kl(δ, 1 δ) 16
Lower-Bounding the Sample Complexity Let µ = (µ 1,..., µ K ) and λ = (λ 1,..., λ K ) be two elements of S. Uniform δ-pac Constraint [Kaufmann, Cappé, G. 15] If a (µ) a (λ), any δ-pac algorithm satisfies K [ E µ Na (τ δ ) ] d(µ a, λ a ) kl(δ, 1 δ). a=1 µ 4 µ 3 µ 2 m 4 µ 1 Let Alt(µ) = {λ : a (λ) a (µ)}. Take: λ 1 = m 4 ɛ λ 4 = m 4 + ɛ E µ [N 1 (τ δ )] d(µ 1, m 2 ɛ) + E µ [N 2 (τ δ ) ] d(µ 2, m 2 + ɛ) kl(δ, 1 δ) E µ [N 1 (τ δ )] d(µ 1, m 3 ɛ) + E µ [N 3 (τ δ )] d(µ 3, m 3 + ɛ) kl(δ, 1 δ) E µ [N 1 (τ δ )] d(µ 1, m 4 ɛ) + E µ [N 4 (τ δ )] d(µ 4, m 4 + ɛ) kl(δ, 1 δ) 16
Lower-Bounding the Sample Complexity Let µ = (µ 1,..., µ K ) and λ = (λ 1,..., λ K ) be two elements of S. Uniform δ-pac Constraint [Kaufmann, Cappé, G. 15] If a (µ) a (λ), any δ-pac algorithm satisfies K [ E µ Na (τ δ ) ] d(µ a, λ a ) kl(δ, 1 δ). a=1 Let Alt(µ) = {λ : a (λ) a (µ)}. µ 4 µ 3 µ 2 m 4 m 3 m 2 µ 1 E µ [τ δ ] E µ [τ δ ] inf λ Alt(µ) a=1 inf λ Alt(µ) a=1 ( sup w Σ K K E µ [N a (τ δ )] d(µ a, λ a ) kl(δ, 1 δ) K inf E µ [N a (τ δ )] E µ [τ δ ] λ Alt(µ) a=1 d(µ a, λ a ) kl(δ, 1 δ) ) K w a d(µ a, λ a ) kl(δ, 1 δ) 16
Lower Bound: the Complexity of BAI Theorem For any δ-pac algorithm, E µ [τ δ ] T (µ) kl(δ, 1 δ), where T (µ) 1 = sup w Σ K inf λ Alt(µ) ( K ) w a d(µ a, λ a ). a=1 kl(δ, 1 δ) log(1/δ) when δ 0, kl(δ, 1 δ) log ( 1/(2.4δ) ) cf. [Graves and Lai 1997, Vaidhyan and Sundaresan, 2015] the optimal proportions of arm draws are ( K ) w (µ) = argmax w Σ K inf λ Alt(µ) a=1 w a d(µ a, λ a ) they do not depend on δ 17
PAC-BAI as a Game Given a parameter µ = (µ 1,..., µ K ) : the statistician chooses proportions of arm draws w = (w a ) a the opponent chooses an alternative model λ the payoff is the minimal number T = T (w, λ) of draws necessary to ensure that he does not violate the δ-pac constraint K Tw a d(µ a, λ a ) kl(δ, 1 δ) a=1 T (µ) kl(δ, 1 δ) = value of the game w = optimal action for the statistician 18
PAC-BAI as a Game Given a parameter µ = (µ 1,..., µ K ) such that µ 1 > µ 2 µ K : the statistician chooses proportions of arm draws w = (w a ) a the opponent chooses an arm a {2,..., K} and λ a = arg min λ w 1 d(µ 1, λ) + w a d(µ a, λ) µ 4 µ 3 µ 2 m 3 µ 1 the payoff is the minimal number T = T (w, a, δ) of draws necessary to ensure that Tw 1 d(µ 1, λ a ɛ) + Tw a d(µ a, λ a + ɛ) kl(δ, 1 δ) kl(δ, 1 δ) that is T (w, a, δ) = w 1 d(µ 1, λ a ɛ) + w a d(µ a, λ a + ɛ) T (µ) kl(δ, 1 δ) = value of the game w = optimal action for the statistician 18
Properties of T (µ) and w (µ) 1. Unique solution, solution of scalar equations only 2. For all µ S, for all a, w a (µ) > 0 3. w is continuous in every µ S 4. If µ 1 > µ 2 µ K, one has w 2 (µ) w K (µ) (one may have w 1 (µ) < w 2 (µ)) 5. Case of two arms [Kaufmann, Cappé, G. 14]: E µ [τ δ ] kl(δ, 1 δ) d (µ 1, µ 2 ). where d is the reversed Chernoff information d (µ 1, µ 2 ) := d(µ 1, µ ) = d(µ 2, µ ). 6. Gaussian arms : algebraic equation but no simple formula for K 3. K 2σ 2 K 2 T 2σ 2 (µ) 2 a 2. a a=1 a=1 19
The Track-and-Stop Strategy
Sampling rule: Tracking the optimal proportions ˆµ(t) = (ˆµ 1 (t),..., ˆµ K (t)): vector of empirical means Introducing U t = { a : N a (t) < } t, the arm sampled at round t + 1 is argmin N a (t) if U t (forced exploration) a U A t+1 t argmax t wa ( ˆµ(t)) N a (t) (tracking) 1 a K Lemma Under the Tracking sampling rule, ( P µ lim t N a (t) t ) = wa (µ) = 1. 21
Sequential Generalized Likelihood Test High values of the Generalized Likelihood Ratio Z a,b (t) := log max {λ:λ a λ b } dp λ (X 1,..., X t ) max {λ:λa λ b } dp λ (X 1,..., X t ) = N a (t) d (ˆµ a (t), ˆµ a,b (t) ) + N b (t) d (ˆµ b (t), ˆµ a,b (t) ) if ˆµ a (t) > ˆµ b (t) Z b,a (t) otherwise reject the hypothesis that (µ a µ b ). We stop when one arm is assessed to be significantly larger than all other arms, according to a GLR test: τ δ = inf { t N : a {1,..., K}, b a, Z a,b (t) > β(t, δ) } { } = inf t N : Z(t) := max min Z a,b(t) > β(t, δ) a {1,...,K} b a Chernoff stopping rule [Chernoff 59] Two other possible interpretations of the stopping rule: MDL: Z a,b (t) = ( N a(t) + N b (t) ) H (ˆµ a,b (t) ) [ N a(t)h (ˆµ a(t) ) + N b (t)h (ˆµ b (t) )] 22
Sequential Generalized Likelihood Test High values of the Generalized Likelihood Ratio Z a,b (t) := log max {λ:λ a λ b } dp λ (X 1,..., X t ) max {λ:λa λ b } dp λ (X 1,..., X t ) reject the hypothesis that (µ a µ b ). We stop when one arm is assessed to be significantly larger than all other arms, according to a GLR test: { } τ δ = inf t N : Z(t) := max min Z a,b(t) > β(t, δ) a {1,...,K} b a Chernoff stopping rule [Chernoff 59] Two other possible interpretations of the stopping rule: plug-in complexity estimate: with F (w, µ) := stop when Z(t) = t F t T (µ) = t F (w, µ) kl(δ, 1 δ). inf K w d ( ) µ a, λ a, λ Alt(µ) a=1 ( Na(t) ), ˆµ(t) β(t, δ) instead of the lower bound t 22
Calibration Theorem The Chernoff rule is δ-pac for β(t, δ) = log ( ) 2(K 1)t δ Lemma If µ a < µ b, whatever the sampling rule, P µ ( t N : Z a,b (t) > log ( )) 2t δ δ The proof uses: Barron s lemma (change of distribution) and Krichevsky-Trofimov s universal distribution (very information-theoretic ideas) 23
Asymptotic Optimality of the T&S strategy Theorem The Track-and-Stop strategy, that uses the Tracking sampling rule the Chernoff stopping rule with β(t, δ) = log and recommends â τδ = argmax a=1...k ˆµ a (τ δ ) is δ-pac for every δ (0, 1) and satisfies lim sup δ 0 E µ [τ δ ] log(1/δ) = T (µ). ( ) 2(K 1)t δ 24
Why is the T&S Strategy asymptotically Optimal? 25
Why is the T&S Strategy asymptotically Optimal? 25
Why is the T&S Strategy asymptotically Optimal? 25
Why is the T&S Strategy asymptotically Optimal? 25
Why is the T&S Strategy asymptotically Optimal? 25
Why is the T&S Strategy asymptotically Optimal? 25
Why is the T&S Strategy asymptotically Optimal? 25
Why is the T&S Strategy asymptotically Optimal? 25
Why is the T&S Strategy asymptotically Optimal? 25
Why is the T&S Strategy asymptotically Optimal? 25
Why is the T&S Strategy asymptotically Optimal? 25
Why is the T&S Strategy asymptotically Optimal? 25
Why is the T&S Strategy asymptotically Optimal? 25
Why is the T&S Strategy asymptotically Optimal? 25
Why is the T&S Strategy asymptotically Optimal? 25
Why is the T&S Strategy asymptotically Optimal? 25
Why is the T&S Strategy asymptotically Optimal? 25
Why is the T&S Strategy asymptotically Optimal? 25
Why is the T&S Strategy asymptotically Optimal? 25
Why is the T&S Strategy asymptotically Optimal? 25
Sketch of proof (almost-sure convergence only) forced exploration = N a (t) a.s. for all a {1,..., K} µ(t) µ a.s. w ( ˆµ(t) ) w a.s. tracking rule: N a (t) t t w a but the mapping F : (µ, w) a.s. inf λ Alt(µ ) a=1 K w a d(µ a, λ a ) is continuous at ( (µ, w (µ)): ) Z(t) = t F ˆµ(t), (N a (t)/t) K a=1 t F (µ, w ) = t T (µ) 1 and for every ɛ > 0 there exists t 0 such that t t 0 Z(t) t (1 + ɛ) 1 T (µ) 1 { } = Thus τ δ t 0 inf t N : (1 + ɛ) 1 T (µ) 1 t log(2(k 1)t/δ) τ δ and lim sup δ 0 log(1/δ) (1 + ɛ)t (µ) a.s. 26
Numerical Experiments µ 1 = [0.5 0.45 0.43 0.4] w (µ 1 ) = [0.42 0.39 0.14 0.06] µ 2 = [0.3 0.21 0.2 0.19 0.18] w (µ 2 ) = [0.34 0.25 0.18 0.13 0.10] In practice, set the threshold to β(t, δ) = log ( ) log(t)+1 δ (δ-pac OK) Track-and-Stop Chernoff-Racing KL-LUCB KL-Racing µ 1 4052 4516 8437 9590 µ 2 1406 3078 2716 3334 Table 1: Expected number of draws E µ[τ δ ] for δ = 0.1, averaged over N = 3000 experiments. Empirically good even for large values of the risk δ Racing is sub-optimal in general, because it plays w 1 = w 2 LUCB is sub-optimal in general, because it plays w 1 = 1/2 27
Perspectives For best arm identification, we showed that inf lim sup PAC algorithm δ 0 E µ [τ δ ] log(1/δ) = sup w Σ K inf λ Alt(µ) ( K ) w a d(µ a, λ a ) and provided an efficient strategy asymptotically matching this bound. Future work: (easy) anytime stopping gives a confidence level (easy) find an ɛ-optimal arm (easy) find the m-best arms design and analyze more stable algorithm (hint: optimism) give a simple algorithm with a finite-time analysis candidate: play action maximizing the expected increase of Z(t) extend to structured and continuous settings a=1 x 1 x 2 x 3 x 4 x 5 28
References O. Cappé, A. Garivier, O-A. Maillard, R. Munos, and G. Stoltz. Kullback-Leibler upper confidence bounds for optimal sequential allocation. Annals of Statistics, 2013. H. Chernoff. Sequential design of Experiments. The Annals of Mathematical Statistics, 1959. E. Even-Dar, S. Mannor, Y. Mansour, Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems. JMLR, 2006. T.L. Graves and T.L. Lai. Asymptotically Efficient adaptive choice of control laws in controlled markov chains. SIAM Journal on Control and Optimization, 35(3):715743, 1997. S. Kalyanakrishnan, A. Tewari, P. Auer, and P. Stone. PAC subset selection in stochastic multi- armed bandits. ICML, 2012. E. Kaufmann, O. Cappé, A. Garivier. On the Complexity of Best Arm Identification in Multi-Armed Bandit Models. JMLR, 2015 A. Garivier, E. Kaufmann. Optimal Best Arm Identification with Fixed Confidence, COLT 16, New York, arxiv:1602.04589 A. Garivier, P. Ménard, G. Stoltz. Explore First, Exploit Next: The True Shape of Regret in Bandit Problems. E. Kaufmann, S. Kalyanakrishnan. The information complexity of best arm identification, COLT 2013 T.L. Lai and H. Robbins. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 1985. D. Russo. Simple Bayesian Algorithms for Best Arm Identification, COLT 2016 N.K. Vaidhyan and R. Sundaresan. Learning to detect an oddball target. arxiv:1508.05572, 2015. 29