On the Complexity of Best Arm Identification with Fixed Confidence

Size: px

Start display at page:

Download "On the Complexity of Best Arm Identification with Fixed Confidence"

Meghan Antonia Wilkinson
5 years ago
Views:

Aurélien Garivier, Emilie Kaufmann COLT, June 23 th

LabeX CIMI Université Paul Sabatier, France

1 On the Complexity of Best Arm Identification with Fixed Confidence Discrete Optimization with Noise Aurélien Garivier, Emilie Kaufmann COLT, June 23 th 2016, New York Institut de Mathématiques de Toulouse LabeX CIMI Université Paul Sabatier, France Université Lille, CNRS UMR 9189 Laboratoire CRIStAL F Lille, France

2 The Problem

Best-Arm Identification with Fixed Confidence K

ν a F exponential family parameterized by its

may: choose an option A t = φ t (A 1, X 1,.

.., K} observe a new independent sample X t ν At so

3 Best-Arm Identification with Fixed Confidence K options = probability distributions ν = (ν a ) 1 a K ν a F exponential family parameterized by its expectation µ a ν 1 ν 2 ν 3 ν 4 ν 5 At round t, you may: choose an option A t = φ t (A 1, X 1,..., A t 1, X t 1) ) {1,..., K} observe a new independent sample X t ν At so as to identify the best arm a = argmax a µ a as fast as possible: stopping time τ. and µ = max µ a a Fixed-budget setting Fixed-confidence setting given τ = T minimize E[τ] minimize P(â τ a ) under constraint P(â τ a ) δ 3

4 Best-Arm Identification with Fixed Confidence K options = probability distributions ν = (ν a ) 1 a K ν a F exponential family parameterized by its expectation µ a x 1 x 2 x 3 x 4 x 5 At round t, you may: choose an option A t = φ t (A 1, X 1,..., A t 1, X t 1) ) {1,..., K} observe a new independent sample X t ν At so as to identify the best arm a = argmax a µ a as fast as possible: stopping time τ. and µ = max µ a a Fixed-budget setting given τ = T Fixed-confidence setting minimize E[τ] minimize P(â τ a ) under constraint P(â τ a ) δ 3

5 Best-Arm Identification with Fixed Confidence K options = probability distributions ν = (ν a ) 1 a K ν a F exponential family parameterized by its expectation µ a x 1 x 2 x 3 x 4 x 5 At round t, you may: choose an option A t = φ t (A 1, X 1,..., A t 1, X t 1) ) {1,..., K} observe a new independent sample X t ν At so as to identify the best arm a = argmax a µ a as fast as possible: stopping time τ δ. and µ = max µ a a Fixed-budget setting Fixed-confidence setting given τ = T minimize E[τ δ ] minimize P(â τ a ) under constraint P(â τ a ) δ 3

6 Intuition Most simple setting: for all a {1,..., K}, ν a = N (µ a, 1) For example: µ = [2, 1.75, 1.75, 1.6, 1.5]. At time t: you have sampled n a times the option a your empirical average is X a,na arm 1 arm 2 arm 3 arm 4 arm 5 if you stop at time t, your probability of prefering arm a 2 to arm a = 1 is: P ( Xa,na > X 1,n1 ) = P ( Xa,na µ a ( X1,n1 µ 1 ) 1/n1 + 1/n a > ( ) µ 1 µ a = Φ 1/n1 + 1/n a ) µ 1 µ a 1/n1 + 1/n a e where Φ(u) u2 /2 = du u 2π 4

7 Uniform Sampling 5

8 Uniform Sampling 5

9 Uniform Sampling 5

10 Uniform Sampling 5

11 Uniform Sampling 5

12 Uniform Sampling 5

13 Uniform Sampling 5

14 Uniform Sampling 5

15 Uniform Sampling 5

16 Uniform Sampling 5

17 Uniform Sampling 5

18 Uniform Sampling 5

19 Uniform Sampling 5

20 Uniform Sampling 5

21 Uniform Sampling 5

22 Uniform Sampling 5

23 Uniform Sampling 5

24 Uniform Sampling 5

25 Uniform Sampling 5

26 Uniform Sampling 5

27 Intuition: Equalizing the Probabilities of Confusion Most simple setting: for all a {1,..., K}, ν a = N (µ a, 1) For example: µ = [2, 1.75, 1.75, 1.6, 1.5]. Active Learning You allocate a relative budget w a to option a, with w w K = 1. At time t: you have sampled n a w a t times the option a your empirical average is X a,na arm 1 arm 2 arm 3 arm 4 arm 5 if you stop at time t, your probability of prefering arm a 2 to arm a = 1 is: P ( Xa,na > X 1,n1 ) = P ( X a,na µ a ( X 1,n1 µ 1 ) 1/n1 + 1/n a > ( ) = Φ µ 1 µ a 1/n1 + 1/n a ) µ 1 µ a 1/n1 + 1/n a 6

28 Improving: trial 1 7

29 Improving: trial 1 7

30 Improving: trial 1 7

31 Improving: trial 1 7

32 Improving: trial 1 7

33 Improving: trial 1 7

34 Improving: trial 1 7

35 Improving: trial 1 7

36 Improving: trial 1 7

37 Improving: trial 1 7

38 Improving: trial 1 7

39 Improving: trial 1 7

40 Improving: trial 1 7

41 Improving: trial 1 7

42 Improving: trial 1 7

43 Improving: trial 2 8

44 Improving: trial 2 8

45 Improving: trial 2 8

46 Improving: trial 2 8

47 Improving: trial 2 8

48 Improving: trial 2 8

49 Improving: trial 2 8

50 Improving: trial 2 8

51 Improving: trial 2 8

52 Improving: trial 2 8

53 Improving: trial 2 8

54 Improving: trial 2 8

55 Improving: trial 2 8

56 Improving: trial 2 8

57 Improving: trial 2 8

58 Improving: trial 2 8

59 Improving: trial 2 8

60 Improving: trial 3 9

61 Improving: trial 3 9

62 Improving: trial 3 9

63 Improving: trial 3 9

64 Improving: trial 3 9

65 Improving: trial 3 9

66 Improving: trial 3 9

67 Improving: trial 3 9

68 Improving: trial 3 9

69 Improving: trial 3 9

70 Improving: trial 3 9

71 Improving: trial 3 9

72 Improving: trial 3 9

73 Improving: trial 3 9

74 Improving: trial 3 9

75 Improving: trial 3 9

76 Optimal Proportions 10

77 Optimal Proportions 10

78 Optimal Proportions 10

79 Optimal Proportions 10

80 Optimal Proportions 10

81 Optimal Proportions 10

82 Optimal Proportions 10

83 Optimal Proportions 10

84 Optimal Proportions 10

85 Optimal Proportions 10

86 Optimal Proportions 10

87 Optimal Proportions 10

88 Optimal Proportions 10

89 Optimal Proportions 10

90 How to Turn this Intuition into a Theorem? The arms are not Gaussian (no formula for probability of confusion) large deviations (Sanov, KL) You do not allocate a relative budget at first, but you use sequential sampling no fixed-size samples: sequential experiment tracking lemma How to compute the optimal proportions? lower bound, game The parameters of the distribution are unknown (sequential) estimation When should you stop? Chernoff s stopping rule 11

91 Exponential Families ν 1,..., ν K belong to a one-dimensional exponential family P λ,θ,b = { ν θ, θ Θ : ν θ has density f θ (x) = exp(θx b(θ)) w.r.t. λ } Example: Gaussian, Bernoulli, Poisson distributions... ν θ can be parametrized by its mean µ = ḃ(θ) : νµ := νḃ 1 (µ) Notation: Kullback-Leibler divergence For a given exponential family, d(µ, µ ) := KL(ν µ, ν µ ) = E X ν µ ] [log dνµ (X ) dν µ is the KL-divergence between the distributions of mean µ and µ. We identify ν = (ν µ1,..., ν µ K ) and µ = (µ 1,..., µ K ) and consider { } S = µ (ḃ(θ)) K : a {1,..., K} : µ a > max µ i i a 12

92 Back to: Equalizing the Probabilities of Confusion Most simple setting: for all a {1,..., K}, ν a = N (µ a, 1). For example: µ = [2, 1.75, 1.75, 1.6, 1.5]. You allocate a relative budget w a to option a, with w w K = 1. At time t, you have sampled n a w a t times option a and the empirical average is X a,na. if you stop at time t, your probability of prefering arm a 2 to arm a = 1 is: P ( Xa,na > X 1,n1 ) = P ( X a,na µ a ( X 1,n1 µ 1 ) 1/n1 + 1/n a > ( ) = Φ µ 1 µ a 1/n1 + 1/n a ) µ 1 µ a 1/n1 + 1/n a e ( (µ 1 µa)2 2 1/n 1 +1/na ) Chernoff s bound 13

93 Back to: Equalizing the Probabilities of Confusion Most simple setting: for all a {1,..., K}, ν a = N (µ a, 1). if you stop at time t, your probability of prefering arm a 2 to arm a = 1 is: P ( X a,na > X 1,n1 ) = P ( Xa,na µ a ( X1,n1 µ 1 ) 1/n1 + 1/n a > ( ) = Φ µ 1 µ a 1/n1 + 1/n a ) µ 1 µ a 1/n1 + 1/n a e ( (µ 1 µa)2 2 1/n 1 +1/na ) Large Deviation Principle = e n 1 (µ 1 m)2 2 e na(µa m)2 2 with m = n 1µ 1 + n 1 µ a n 1 + n a P ( X1,n1 < m ) P ( Xa,na m ) = max µ a<m<µ 1 P ( X1,n1 < m ) P ( Xa,na m ) (cf. Sanov) µ a m µ 1 13

94 Entropic Method for Large Deviations Lower Bounds Let d(µ, µ ) = KL ( N (µ, 1), N (µ, 1)) = (x y)2 2, KL(Y, Z) = KL ( L(Y ), L(Z) ), ɛ > 0, µ a m µ 1 and X 1,1,..., X 1,n1 iid N (µ1, 1) X a,1,..., X a,na iid N (µa, 1) X 1,1,..., X 1,n 1 iid N (m ɛ, 1) X a,1,..., X a,n a iid N (m + ɛ, 1) µ a m µ 1 n 1 d(m ɛ, µ 1 ) + n a d(m + ɛ, µ a ) = KL ( (X a,i) ) KL(P P a,i, (X a,i ),Q Q ) a,i = KL(P, Q) + KL(P, Q ) KL (1 { X a,na > X } { } 1,n1, 1 X a,na > X ) contraction of entropy 1,n1 = data-processing inequality ( = kl P ( X a,n a > X 1,n ) ( ) 1, P X a,na > X ) 1,n1 kl(p, q) = p log p + (1 p) log 1 p q 1 q P ( X a,na > X 1,n ) 1 1 log ( ) log(2) kl(p, q) p log 1 log 2 q P X a,na > X 1,n1 14

95 Entropic Method for Large Deviations Lower Bounds µ a m µ 1 n 1 d(m ɛ, µ 1 ) + n a d(m + ɛ, µ a ) = KL ( (X a,i) ) KL(P P a,i, (X a,i ),Q Q ) a,i = KL(P, Q) + KL(P, Q ) KL (1 { X a,na > X } { 1,n1, 1 Xa,na > X } ) contraction of entropy 1,n1 = data-processing inequality ( = kl P ( X a,na > X 1,n ) ( 1, P Xa,na > X ) ) 1,n1 kl(p, q) = p log p + (1 p) log 1 p q 1 q P ( X a,n a > X 1,n ) 1 1 log ( ) log(2) kl(p, q) p log 1 log 2 q P X a,na > X 1,n1 P ( ( Xa,na > X ) 1,n1 max exp n ) 1 d(m ɛ, µ 1 ) + n a d(m + ɛ, µ a ) + log(2) µ 1 m µ a 1 e (n1+na)ɛ2 /2 (µ 1 µ a+ɛ) 2 1/n = exp 1+1/n a + log(2) 2 ( 1 e ) n1µ1 + naµa m = (n1+na)ɛ2 /2 n 1 + n a 14

96 Entropic Method for Large Deviations Lower Bounds µ a m µ 1 n 1 d(m ɛ, µ 1 ) + n a d(m + ɛ, µ a ) = KL ( (X a,i) ) KL(P P a,i, (X a,i ),Q Q ) a,i = KL(P, Q) + KL(P, Q ) KL (1 { X a,na > X } { 1,n1, 1 Xa,na > X } ) contraction of entropy 1,n1 = data-processing inequality ( = kl P ( X a,na > X 1,n ) ( 1, P Xa,na > X ) ) 1,n1 kl(p, q) = p log p + (1 p) log 1 p q 1 q P ( X a,n a > X 1,n ) 1 1 log ( ) log(2) kl(p, q) p log 1 log 2 q P X a,na > X 1,n1 ( ) 1 = T w 1 d(m ɛ, µ 1 ) + w a d(m + ɛ, µ a ) log ( P Xa,na > X ) 1,n1 ( ) if you want to have P X a,na > X 1,n1 δ then you need T log(1/δ) w 1 d(m, µ 1 ) + w a d(m, µ a ) 14

97 Lower Bound

98 Lower-Bounding the Sample Complexity Let µ = (µ 1,..., µ K ) and λ = (λ 1,..., λ K ) be two elements of S. Uniform δ-pac Constraint [Kaufmann, Cappé, G. 15] If a (µ) a (λ), any δ-pac algorithm satisfies K [ E µ Na (τ δ ) ] d(µ a, λ a ) kl(δ, 1 δ). a=1 µ 4 µ 3 µ 2 m 2 µ 1 Let Alt(µ) = {λ : a (λ) a (µ)}. Take: λ 1 = m 2 ɛ λ 2 = m 2 + ɛ E µ [N 1 (τ δ )] d(µ 1, m 2 ɛ) + E µ [N 2 (τ δ ) ] d(µ 2, m 2 + ɛ) kl(δ, 1 δ) 16

99 Lower-Bounding the Sample Complexity Let µ = (µ 1,..., µ K ) and λ = (λ 1,..., λ K ) be two elements of S. Uniform δ-pac Constraint [Kaufmann, Cappé, G. 15] If a (µ) a (λ), any δ-pac algorithm satisfies K [ E µ Na (τ δ ) ] d(µ a, λ a ) kl(δ, 1 δ). a=1 µ 4 µ 3 µ 2 m 3 µ 1 Let Alt(µ) = {λ : a (λ) a (µ)}. Take: λ 1 = m 3 ɛ λ 3 = m 3 + ɛ E µ [N 1 (τ δ )] d(µ 1, m 2 ɛ) + E µ [N 2 (τ δ ) ] d(µ 2, m 2 + ɛ) kl(δ, 1 δ) E µ [N 1 (τ δ )] d(µ 1, m 3 ɛ) + E µ [N 3 (τ δ )] d(µ 3, m 3 + ɛ) kl(δ, 1 δ) 16

100 Lower-Bounding the Sample Complexity Let µ = (µ 1,..., µ K ) and λ = (λ 1,..., λ K ) be two elements of S. Uniform δ-pac Constraint [Kaufmann, Cappé, G. 15] If a (µ) a (λ), any δ-pac algorithm satisfies K [ E µ Na (τ δ ) ] d(µ a, λ a ) kl(δ, 1 δ). a=1 µ 4 µ 3 µ 2 m 4 µ 1 Let Alt(µ) = {λ : a (λ) a (µ)}. Take: λ 1 = m 4 ɛ λ 4 = m 4 + ɛ E µ [N 1 (τ δ )] d(µ 1, m 2 ɛ) + E µ [N 2 (τ δ ) ] d(µ 2, m 2 + ɛ) kl(δ, 1 δ) E µ [N 1 (τ δ )] d(µ 1, m 3 ɛ) + E µ [N 3 (τ δ )] d(µ 3, m 3 + ɛ) kl(δ, 1 δ) E µ [N 1 (τ δ )] d(µ 1, m 4 ɛ) + E µ [N 4 (τ δ )] d(µ 4, m 4 + ɛ) kl(δ, 1 δ) 16

101 Lower-Bounding the Sample Complexity Let µ = (µ 1,..., µ K ) and λ = (λ 1,..., λ K ) be two elements of S. Uniform δ-pac Constraint [Kaufmann, Cappé, G. 15] If a (µ) a (λ), any δ-pac algorithm satisfies K [ E µ Na (τ δ ) ] d(µ a, λ a ) kl(δ, 1 δ). a=1 Let Alt(µ) = {λ : a (λ) a (µ)}. µ 4 µ 3 µ 2 m 4 m 3 m 2 µ 1 E µ [τ δ ] E µ [τ δ ] inf λ Alt(µ) a=1 inf λ Alt(µ) a=1 ( sup w Σ K K E µ [N a (τ δ )] d(µ a, λ a ) kl(δ, 1 δ) K inf E µ [N a (τ δ )] E µ [τ δ ] λ Alt(µ) a=1 d(µ a, λ a ) kl(δ, 1 δ) ) K w a d(µ a, λ a ) kl(δ, 1 δ) 16

102 Lower Bound: the Complexity of BAI Theorem For any δ-pac algorithm, E µ [τ δ ] T (µ) kl(δ, 1 δ), where T (µ) 1 = sup w Σ K inf λ Alt(µ) ( K ) w a d(µ a, λ a ). a=1 kl(δ, 1 δ) log(1/δ) when δ 0, kl(δ, 1 δ) log ( 1/(2.4δ) ) cf. [Graves and Lai 1997, Vaidhyan and Sundaresan, 2015] the optimal proportions of arm draws are ( K ) w (µ) = argmax w Σ K inf λ Alt(µ) a=1 w a d(µ a, λ a ) they do not depend on δ 17

103 PAC-BAI as a Game Given a parameter µ = (µ 1,..., µ K ) : the statistician chooses proportions of arm draws w = (w a ) a the opponent chooses an alternative model λ the payoff is the minimal number T = T (w, λ) of draws necessary to ensure that he does not violate the δ-pac constraint K Tw a d(µ a, λ a ) kl(δ, 1 δ) a=1 T (µ) kl(δ, 1 δ) = value of the game w = optimal action for the statistician 18

104 PAC-BAI as a Game Given a parameter µ = (µ 1,..., µ K ) such that µ 1 > µ 2 µ K : the statistician chooses proportions of arm draws w = (w a ) a the opponent chooses an arm a {2,..., K} and λ a = arg min λ w 1 d(µ 1, λ) + w a d(µ a, λ) µ 4 µ 3 µ 2 m 3 µ 1 the payoff is the minimal number T = T (w, a, δ) of draws necessary to ensure that Tw 1 d(µ 1, λ a ɛ) + Tw a d(µ a, λ a + ɛ) kl(δ, 1 δ) kl(δ, 1 δ) that is T (w, a, δ) = w 1 d(µ 1, λ a ɛ) + w a d(µ a, λ a + ɛ) T (µ) kl(δ, 1 δ) = value of the game w = optimal action for the statistician 18

105 Properties of T (µ) and w (µ) 1. Unique solution, solution of scalar equations only 2. For all µ S, for all a, w a (µ) > 0 3. w is continuous in every µ S 4. If µ 1 > µ 2 µ K, one has w 2 (µ) w K (µ) (one may have w 1 (µ) < w 2 (µ)) 5. Case of two arms [Kaufmann, Cappé, G. 14]: E µ [τ δ ] kl(δ, 1 δ) d (µ 1, µ 2 ). where d is the reversed Chernoff information d (µ 1, µ 2 ) := d(µ 1, µ ) = d(µ 2, µ ). 6. Gaussian arms : algebraic equation but no simple formula for K 3. K 2σ 2 K 2 T 2σ 2 (µ) 2 a 2. a a=1 a=1 19

106 The Track-and-Stop Strategy

107 Sampling rule: Tracking the optimal proportions ˆµ(t) = (ˆµ 1 (t),..., ˆµ K (t)): vector of empirical means Introducing U t = { a : N a (t) < } t, the arm sampled at round t + 1 is argmin N a (t) if U t (forced exploration) a U A t+1 t argmax t wa ( ˆµ(t)) N a (t) (tracking) 1 a K Lemma Under the Tracking sampling rule, ( P µ lim t N a (t) t ) = wa (µ) = 1. 21

108 Sequential Generalized Likelihood Test High values of the Generalized Likelihood Ratio Z a,b (t) := log max {λ:λ a λ b } dp λ (X 1,..., X t ) max {λ:λa λ b } dp λ (X 1,..., X t ) = N a (t) d (ˆµ a (t), ˆµ a,b (t) ) + N b (t) d (ˆµ b (t), ˆµ a,b (t) ) if ˆµ a (t) > ˆµ b (t) Z b,a (t) otherwise reject the hypothesis that (µ a µ b ). We stop when one arm is assessed to be significantly larger than all other arms, according to a GLR test: τ δ = inf { t N : a {1,..., K}, b a, Z a,b (t) > β(t, δ) } { } = inf t N : Z(t) := max min Z a,b(t) > β(t, δ) a {1,...,K} b a Chernoff stopping rule [Chernoff 59] Two other possible interpretations of the stopping rule: MDL: Z a,b (t) = ( N a(t) + N b (t) ) H (ˆµ a,b (t) ) [ N a(t)h (ˆµ a(t) ) + N b (t)h (ˆµ b (t) )] 22

109 Sequential Generalized Likelihood Test High values of the Generalized Likelihood Ratio Z a,b (t) := log max {λ:λ a λ b } dp λ (X 1,..., X t ) max {λ:λa λ b } dp λ (X 1,..., X t ) reject the hypothesis that (µ a µ b ). We stop when one arm is assessed to be significantly larger than all other arms, according to a GLR test: { } τ δ = inf t N : Z(t) := max min Z a,b(t) > β(t, δ) a {1,...,K} b a Chernoff stopping rule [Chernoff 59] Two other possible interpretations of the stopping rule: plug-in complexity estimate: with F (w, µ) := stop when Z(t) = t F t T (µ) = t F (w, µ) kl(δ, 1 δ). inf K w d ( ) µ a, λ a, λ Alt(µ) a=1 ( Na(t) ), ˆµ(t) β(t, δ) instead of the lower bound t 22

110 Calibration Theorem The Chernoff rule is δ-pac for β(t, δ) = log ( ) 2(K 1)t δ Lemma If µ a < µ b, whatever the sampling rule, P µ ( t N : Z a,b (t) > log ( )) 2t δ δ The proof uses: Barron s lemma (change of distribution) and Krichevsky-Trofimov s universal distribution (very information-theoretic ideas) 23

111 Asymptotic Optimality of the T&S strategy Theorem The Track-and-Stop strategy, that uses the Tracking sampling rule the Chernoff stopping rule with β(t, δ) = log and recommends â τδ = argmax a=1...k ˆµ a (τ δ ) is δ-pac for every δ (0, 1) and satisfies lim sup δ 0 E µ [τ δ ] log(1/δ) = T (µ). ( ) 2(K 1)t δ 24

112 Why is the T&S Strategy asymptotically Optimal? 25

113 Why is the T&S Strategy asymptotically Optimal? 25

114 Why is the T&S Strategy asymptotically Optimal? 25

115 Why is the T&S Strategy asymptotically Optimal? 25

116 Why is the T&S Strategy asymptotically Optimal? 25

117 Why is the T&S Strategy asymptotically Optimal? 25

118 Why is the T&S Strategy asymptotically Optimal? 25

119 Why is the T&S Strategy asymptotically Optimal? 25

120 Why is the T&S Strategy asymptotically Optimal? 25

121 Why is the T&S Strategy asymptotically Optimal? 25

122 Why is the T&S Strategy asymptotically Optimal? 25

123 Why is the T&S Strategy asymptotically Optimal? 25

124 Why is the T&S Strategy asymptotically Optimal? 25

125 Why is the T&S Strategy asymptotically Optimal? 25

126 Why is the T&S Strategy asymptotically Optimal? 25

127 Why is the T&S Strategy asymptotically Optimal? 25

128 Why is the T&S Strategy asymptotically Optimal? 25

129 Why is the T&S Strategy asymptotically Optimal? 25

130 Why is the T&S Strategy asymptotically Optimal? 25

131 Why is the T&S Strategy asymptotically Optimal? 25

132 Sketch of proof (almost-sure convergence only) forced exploration = N a (t) a.s. for all a {1,..., K} µ(t) µ a.s. w ( ˆµ(t) ) w a.s. tracking rule: N a (t) t t w a but the mapping F : (µ, w) a.s. inf λ Alt(µ ) a=1 K w a d(µ a, λ a ) is continuous at ( (µ, w (µ)): ) Z(t) = t F ˆµ(t), (N a (t)/t) K a=1 t F (µ, w ) = t T (µ) 1 and for every ɛ > 0 there exists t 0 such that t t 0 Z(t) t (1 + ɛ) 1 T (µ) 1 { } = Thus τ δ t 0 inf t N : (1 + ɛ) 1 T (µ) 1 t log(2(k 1)t/δ) τ δ and lim sup δ 0 log(1/δ) (1 + ɛ)t (µ) a.s. 26

133 Numerical Experiments µ 1 = [ ] w (µ 1 ) = [ ] µ 2 = [ ] w (µ 2 ) = [ ] In practice, set the threshold to β(t, δ) = log ( ) log(t)+1 δ (δ-pac OK) Track-and-Stop Chernoff-Racing KL-LUCB KL-Racing µ µ Table 1: Expected number of draws E µ[τ δ ] for δ = 0.1, averaged over N = 3000 experiments. Empirically good even for large values of the risk δ Racing is sub-optimal in general, because it plays w 1 = w 2 LUCB is sub-optimal in general, because it plays w 1 = 1/2 27

134 Perspectives For best arm identification, we showed that inf lim sup PAC algorithm δ 0 E µ [τ δ ] log(1/δ) = sup w Σ K inf λ Alt(µ) ( K ) w a d(µ a, λ a ) and provided an efficient strategy asymptotically matching this bound. Future work: (easy) anytime stopping gives a confidence level (easy) find an ɛ-optimal arm (easy) find the m-best arms design and analyze more stable algorithm (hint: optimism) give a simple algorithm with a finite-time analysis candidate: play action maximizing the expected increase of Z(t) extend to structured and continuous settings a=1 x 1 x 2 x 3 x 4 x 5 28

135 References O. Cappé, A. Garivier, O-A. Maillard, R. Munos, and G. Stoltz. Kullback-Leibler upper confidence bounds for optimal sequential allocation. Annals of Statistics, H. Chernoff. Sequential design of Experiments. The Annals of Mathematical Statistics, E. Even-Dar, S. Mannor, Y. Mansour, Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems. JMLR, T.L. Graves and T.L. Lai. Asymptotically Efficient adaptive choice of control laws in controlled markov chains. SIAM Journal on Control and Optimization, 35(3):715743, S. Kalyanakrishnan, A. Tewari, P. Auer, and P. Stone. PAC subset selection in stochastic multi- armed bandits. ICML, E. Kaufmann, O. Cappé, A. Garivier. On the Complexity of Best Arm Identification in Multi-Armed Bandit Models. JMLR, 2015 A. Garivier, E. Kaufmann. Optimal Best Arm Identification with Fixed Confidence, COLT 16, New York, arxiv: A. Garivier, P. Ménard, G. Stoltz. Explore First, Exploit Next: The True Shape of Regret in Bandit Problems. E. Kaufmann, S. Kalyanakrishnan. The information complexity of best arm identification, COLT 2013 T.L. Lai and H. Robbins. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, D. Russo. Simple Bayesian Algorithms for Best Arm Identification, COLT 2016 N.K. Vaidhyan and R. Sundaresan. Learning to detect an oddball target. arxiv: ,

On the Complexity of Best Arm Identification with Fixed Confidence

On the Complexity of Best Arm Identification with Fixed Confidence Discrete Optimization with Noise Aurélien Garivier, joint work with Emilie Kaufmann CNRS, CRIStAL) to be presented at COLT 16, New York