Dynamic resource allocation: Bandit problems and extensions

Size: px

Start display at page:

Download "Dynamic resource allocation: Bandit problems and extensions"

Laura Price
5 years ago
Views:

1 Dynamic resource allocation: Bandit problems and extensions Aurélien Garivier Institut de Mathématiques de Toulouse MAD Seminar, Université Toulouse 1 October 3rd, 2014

2 The Bandit Model Roadmap 1 The Bandit Model 2 Lower Bound for the Regret 3 Optimistic Algorithms 4 An Optimistic Algorithm based on Kullback-Leibler Divergence 5 Parametric setting: the kl-ucb Algorithm 6 Non-parametric setting and Empirical Likelihood 7 Extensions

3 The Bandit Model Dynamic resource allocation Imagine you are a doctor: patients visit you one after another for a given disease you prescribe one of the (say) 5 treatments available the treatments are not equally efficient you do not know which one is the best, you observe the effect of the prescribed treatment on each patient What do you do? You must choose each prescription using only the previous observations Your goal is not to estimate each treatment s efficiency precisely, but to heal as many patients as possible

4 The Bandit Model The (stochastic) Multi-Armed Bandit Model Environment K arms with parameters θ = (θ 1,..., θ K ) such that for any possible choice of arm a t {1,..., K} at time t, one receives the reward X t = X at,t where, for any 1 a K and s 1, X a,s ν a, and the (X a,s ) a,s are independent. Reward distributions ν a F a parametric family, or not. Examples: canonical exponential family, general bounded rewards Example Bernoulli rewards: θ [0, 1] K, ν a = B(θ a ) Strategy The agent s actions follow a dynamical strategy π = (π 1, π 2,... ) such that A t = π t (X 1,..., X t 1 )

The Bandit Model Real challenges Randomized clinical trials original motivation since the 1930 s dynamic strategies can save resources Recommender systems: advertisement website optimization news,

5 The Bandit Model Real challenges Randomized clinical trials original motivation since the 1930 s dynamic strategies can save resources Recommender systems: advertisement website optimization news, blog posts,... Computer experiments large systems can be simulated in order to optimize some criterion over a set of parameters but the simulation cost may be high, so that only few choices are possible for the parameters Games and planning (tree-structured options)

6 The Bandit Model Performance Evaluation, Regret Cumulated Reward S T = T t=1 X t Our goal Choose π so as to maximize E [S T ] = = T t=1 a=1 K E [ E [X t 1{A t = a} X 1,..., X t 1 ] ] K µ a E [Na π (T )] a=1 where N π a (T ) = t T 1{A t = a} is the number of draws of arm a up to time T, and µ a = E(ν a ). Regret Minimization equivalent to minimizing R T = T µ E [S T ] = (µ µ a )E [Na π (T )] a:µ a<µ where µ max{µ a : 1 a K}

7 Lower Bound for the Regret Roadmap 1 The Bandit Model 2 Lower Bound for the Regret 3 Optimistic Algorithms 4 An Optimistic Algorithm based on Kullback-Leibler Divergence 5 Parametric setting: the kl-ucb Algorithm 6 Non-parametric setting and Empirical Likelihood 7 Extensions

8 Lower Bound for the Regret Asymptotically Optimal Strategies A strategy π is said to be consistent if, for any (ν a ) a F K, 1 T E[S T ] µ The strategy is efficient if for all θ [0, 1] K and all α > 0, R T = o(t α ) There are efficient strategies and we consider the best achievable asymptotic performance among efficient strategies

9 Lower Bound for the Regret The Bound of Lai and Robbins One-parameter reward distribution ν a = ν θa, θ a Θ R. Theorem [Lai and Robbins, 85] If π is an efficient strategy, then, for any θ Θ K, lim inf T R T log(t ) a:µ a<µ µ µ a KL(ν a, ν ) where KL(ν, ν ) denotes the Kullback-Leibler divergence For example, in the Bernoulli case: KL ( B(p), B(q) ) = dber (p, q) = p log p q + (1 p) log 1 p 1 q

10 Lower Bound for the Regret The Bound of Burnetas and Katehakis More general reward distributions ν a F a Theorem [Burnetas and Katehakis, 96] If π is an efficient strategy, then, for any θ [0, 1] K, lim inf T R T log(t ) a:µ a<µ µ µ a K inf (ν a, µ ) δ 12 where K inf (ν a, µ ) = inf { K(ν a, ν ) : ν F a, E(ν ) µ } νa K inf (νa, µ ) ν δ 0 µ δ 1

11 Lower Bound for the Regret Intuition First assume that µ is known and that T is fixed How many draws n a of ν a are necessary to know that µ a < µ with probability at least 1 1/T? Test: H 0 : µ a = µ against H 1 : ν = ν a Stein s Lemma: if the first type error α na 1/T, then β na exp ( n a K inf (ν a, µ ) ) = it can be smaller than 1/T if n a log(t ) K inf (ν a, µ ) How to do as well without knowing µ and T in advance? Not asymptotically?

12 Optimistic Algorithms Roadmap 1 The Bandit Model 2 Lower Bound for the Regret 3 Optimistic Algorithms 4 An Optimistic Algorithm based on Kullback-Leibler Divergence 5 Parametric setting: the kl-ucb Algorithm 6 Non-parametric setting and Empirical Likelihood 7 Extensions

13 Optimistic Algorithms Optimism in the Face of Uncertainty Optimism in an heuristic principle popularized by [Lai&Robins 85; Agrawal 95] which consists in letting the agent play as if the environment was the most favorable among all environments that are sufficiently likely given the observations accumulated so far Surprisingly, this simple heuristic principle can be instantiated into algorithms that are robust, efficient and easy to implement in many scenarios pertaining to reinforcement learning

14 Optimistic Algorithms Upper Confidence Bound Strategies UCB [Lai&Robins 85; Agrawal 95; Auer&al 02] Construct an upper confidence bound for the expected reward of each arm: S a (t) log(t) + N a (t) 2N }{{} a (t) }{{} estimated reward exploration bonus Choose the arm with the highest UCB It is an index strategy [Gittins 79] Its behavior is easily interpretable and intuitively appealing

15 UCB in Action Optimistic Algorithms

16 UCB in Action Optimistic Algorithms

17 Optimistic Algorithms Performance of UCB For rewards in [0, 1], the regret of UCB is upper-bounded as (finite-time regret bound) and lim sup T E[R T ] = O(log(T )) E[R T ] log(t ) 1 2(µ µ a:µ a<µ a ) Yet, in the case of Bernoulli variables, the rhs. is greater than suggested by the bound by Lai & Robbins Many variants have been suggested to incorporate an estimate of the variance in the exploration bonus (e.g., [Audibert&al 07])

18 An Optimistic Algorithm based on Kullback-Leibler Divergence Roadmap 1 The Bandit Model 2 Lower Bound for the Regret 3 Optimistic Algorithms 4 An Optimistic Algorithm based on Kullback-Leibler Divergence 5 Parametric setting: the kl-ucb Algorithm 6 Non-parametric setting and Empirical Likelihood 7 Extensions

19 An Optimistic Algorithm based on Kullback-Leibler Divergence The KL-UCB algorithm [Cappé,G.&al 13] Parameters: An operator Π F : M 1 (S) F; a non-decreasing function f : N R Initialization: Pull each arm of {1,..., K} once for t = K to T 1 do compute for each arm a the quantity { ( U a (t) = sup E(ν) : ν F and KL Π F (ˆνa (t) ) ), ν f(t) } N a (t) end for pick an arm A t+1 arg max U a (t) a {1,...,K}

20 Parametric setting: the kl-ucb Algorithm Roadmap 1 The Bandit Model 2 Lower Bound for the Regret 3 Optimistic Algorithms 4 An Optimistic Algorithm based on Kullback-Leibler Divergence 5 Parametric setting: the kl-ucb Algorithm 6 Non-parametric setting and Empirical Likelihood 7 Extensions

21 Parametric setting: the kl-ucb Algorithm d exp (u, v) = u v + u log u v The analysis is generic and yields a non-asymptotic regret bound optimal in the sense of Lai and Robbins. Exponential Family Rewards Assume that F a = F = canonical exponential family, i.e. such that the pdf of the rewards is given by p θa (x) = exp ( xθ a b(θ a ) + c(x) ), 1 a K for a parameter θ R K, expectation µ a = ḃ(θ a) The KL-UCB si simply: { U a (t) = sup µ I : d (ˆµ a (t), µ ) f(t) } N a (t) For instance, for Bernoulli rewards: d ber (p, q) = p log p q + (1 p) log 1 p 1 q for exponential rewards p θa (x) = θ a e θax :

22 Parametric setting: the kl-ucb Algorithm Parametric version: the kl-ucb algorithm Parameters: F parameterized by the expectation µ I R with divergence d, a non-decreasing function f : N R Initialization: Pull each arm of {1,..., K} once for t = K to T 1 do compute for each arm a the quantity { U a (t) = sup µ I : d (ˆµ a (t), µ ) f(t) } N a (t) end for pick an arm A t+1 arg max U a (t) a {1,...,K}

23 Parametric setting: the kl-ucb Algorithm The kl Upper Confidence Bound in Picture If Z 1,..., Z s iid B(θ 0 ), x < θ 0 and if ˆp s = (Z Z s )/s, then by Chernoff s inequality kl(,θ) P θ0 (ˆp s x) exp ( sd ber (x, θ 0 )). log(α)/s 0 x θ 0 In other words, if α = exp ( sd ber (x, θ 0 )): ( P θ0 (ˆp s x) = P θ0 d ber (ˆp s, θ 0 ) log(α) ), ˆp s < θ 0 α s = Upper Confidence Bound for p at risk α : { u s = sup θ > ˆp s : d ber (ˆp s, θ) log(α) }. s

24 Parametric setting: the kl-ucb Algorithm The kl Upper Confidence Bound in Picture If Z 1,..., Z s iid B(θ 0 ), x < θ 0 and if ˆp s = (Z Z s )/s, then by Chernoff s inequality kl(,θ) kl(p s, ) P θ0 (ˆp s x) exp ( sd ber (x, θ 0 )). log(α)/s 0 p s u s In other words, if α = exp ( sd ber (x, θ 0 )): ( P θ0 (ˆp s x) = P θ0 d ber (ˆp s, θ 0 ) log(α) ), ˆp s < θ 0 α s = Upper Confidence Bound for p at risk α : { u s = sup θ > ˆp s : d ber (ˆp s, θ) log(α) }. s

25 Parametric setting: the kl-ucb Algorithm Key Tool: Deviation Inequality for Self-Normalized Sums Problem: random number of summands Solution: peeling trick (as in the proof of the LIL) Theorem For all ɛ > 1, P ( µ a > ˆµ a (t) and N a (t) d (ˆµ a (t), µ a ) ɛ ) e ɛ log(t) e ɛ. Thus, P ( U a (t) < µ a ) e f(t) log(t) e f(t)

26 Parametric setting: the kl-ucb Algorithm Regret bound Theorem: Assume that all arms belong to a canonical, regular, exponential family F = {ν θ : θ Θ} of probability distributions indexed by its natural parameter space Θ R. Then, with the choice f(t) = log(t) + 3 log log(t) for t 3, the number of draws of any suboptimal arm a is upper bounded for any horizon T 3 as E [N a (T )] log(t ) ( d (µ a, µ ) +2 2πσ2 a, d (µ a, µ ) ) 2 ( d(µa, µ ) ) 3 log(t ) + 3 log(log(t )) ( ) ( 3 d + 4e + d(µ a, µ log(log(t )) + 8σa, 2 (µ a, µ ) 2 ) ) d(µ a, µ + 6, ) where σ 2 a, = max { Var(ν θ ) : µ a E(ν θ ) µ } and where d (, µ ) denotes the derivative of d(, µ ).

27 Parametric setting: the kl-ucb Algorithm Results: Two-Arm Scenario UCB MOSS UCB Tuned UCB V DMED KL UCB bound N 2 (n) 250 N 2 (n) n (log scale) 0 UCB MOSS UCB Tuned UCB V DMED KL UCB Figure: Performance of various algorithms when θ = (0.9, 0.8). Left: average number of draws of the sub-optimal arm as a function of time. Right: box-and-whiskers plot for the number of draws of the sub-optimal arm at time T = 5, 000. Results based on 50, 000 independent replications

28 Parametric setting: the kl-ucb Algorithm Results: Ten-Arm Scenario with Low Rewards R n R n R n UCB UCB Tuned CP UCB n (log scale) MOSS DMED DMED n (log scale) UCB V KL UCB KL UCB n (log scale) Figure: Average regret as a function of time when θ = (0.1, 0.05, 0.05, 0.05, 0.02, 0.02, 0.02, 0.01, 0.01, 0.01). Red line: Lai & Robbins lower bound; thick line: average regret; shaded regions: central 99% region an upper 99.95% quantile

29 Non-parametric setting and Empirical Likelihood Roadmap 1 The Bandit Model 2 Lower Bound for the Regret 3 Optimistic Algorithms 4 An Optimistic Algorithm based on Kullback-Leibler Divergence 5 Parametric setting: the kl-ucb Algorithm 6 Non-parametric setting and Empirical Likelihood 7 Extensions

30 Non-parametric setting and Empirical Likelihood Non-parametric setting Rewards are only assumed to be bounded (say in [0, 1]) Need for an estimation procedure with non-asymptotic guarantees efficient in the sense of Stein / Bahadur = Idea 1: use d ber (Hoeffding) = Idea 2: Empirical Likelihood [Owen 01] Bad idea: use Bernstein / Bennett

31 Non-parametric setting and Empirical Likelihood First idea: use d ber Idea: rescale to [0, 1], and take the divergence d ber. because Bernoulli distributions maximize deviations among bounded variables with given expectation: Lemma (Hoeffding 63) Let X denote a random variable such that 0 X 1 and denote by µ = E[X] its mean. Then, for any λ R, E [exp(λx)] 1 µ + µ exp(λ). This fact is well-known for the variance, but also true for all exponential moments and thus for Cramer-type deviation bounds

32 Non-parametric setting and Empirical Likelihood Regret Bound for kl-ucb Theorem With the divergence d ber, for all T > 3, ( ) E [ N a (T ) ] log(t ) 2π log µ (1 µ a) d ber (µ a, µ ) + µ a(1 µ ) ( log(t dber (µ a, µ ) ) ) + 3 log ( log(t ) ) 3/2 ( ( ) ) 2 ( ) 3 + 4e + d ber (µ a, µ log ( log(t ) ) 2 log µ (1 µ a) µ a(1 µ ) + ) (d ber (µ a, µ )) kl-ucb satisfies an improved logarithmic finite-time regret bound Besides, it is asymptotically optimal in the Bernoulli case

33 Non-parametric setting and Empirical Likelihood Comparison to UCB KL-UCB addresses exactly the same problem as UCB, with the same generality, but it has always a smaller regret as can be seen from Pinsker s inequality 1 d ber (µ 1, µ 2 ) 2(µ 1 µ 2 ) kl(0.7, q) 2(0.7 q)

34 Non-parametric setting and Empirical Likelihood Idea 2: Empirical Likelihood { U(ˆν n, ɛ) = sup E(ν ) : ν ( M 1 Supp(ˆνn ) ) } and KL(ˆν n, ν ) ɛ or, rather, modified Empirical Likelihood: { U(ˆν n, ɛ) = sup E(ν ) : ν ( M 1 Supp(ˆνn ) {1} ) } and KL(ˆν n, ν ) ɛ ˆµ n U n

35 Non-parametric setting and Empirical Likelihood Coverage properties of the modified EL confidence bound Proposition: Let ν 0 M 1 ([0, 1]) with E(ν 0 ) (0, 1) and let X 1,..., X n be independent ( ) random variables with common distribution ν 0 M 1 [0, 1], not necessarily with finite support. Then, for all ɛ > 0, P { U(ˆν n, ɛ) E(ν 0 ) } { P K inf (ˆνn, E(ν 0 ) ) } ɛ e(n + 2) exp( nɛ). Remark: For {0, 1} valued observations, it is readily seen that U(ˆν n, ɛ) boils down to the upper-confidence bound above. = This proposition is at least not always optimal: the presence of the factor n in front of the exponential exp( nɛ) term is questionable.

36 Non-parametric setting and Empirical Likelihood Regret bound Theorem: Assume that F is the set of finitely supported probability distributions over S = [0, 1], that µ a > 0 for all arms a and that µ < 1. There exists a constant M(ν a, µ ) > 0 only depending on ν a and µ such that, with the choice f(t) = log(t) + log ( log(t) ) for t 2, for all T 3: E [ N a (T ) ] log(t ) 36 ( ) ( 4/5 ( ) K inf νa, µ ) + log(t ) log log(t ) (µ ) 4 ( ) 72 + (µ ) 4 + 2µ (log(t ) 4/5 ( (1 µ ) K inf νa, µ ) 2 ) + (1 µ ) 2 M(ν a, µ ) 2(µ ) 2 ( log(t ) ) 2/5 + log( log(t ) ) ( K inf νa, µ ) + 2µ ( (1 µ ) K inf νa, µ )

37 Non-parametric setting and Empirical Likelihood Example: truncated Poisson rewards for each arm 1 a 6 is associated with ν a, a Poisson distribution with expectation (2 + a)/4, truncated at 10. N = 10, 000 Monte-Carlo replications on an horizon of T = 20, 000 steps kl Poisson UCB KL UCB UCB V kl UCB UCB Regret Time (log scale)

38 Non-parametric setting and Empirical Likelihood Example: truncated Exponential rewards exponential rewards with respective parameters 1/5, 1/4, 1/3, 1/2 and 1, truncated at x max = 10; kl-ucb uses the divergence d(x, y) = x/y 1 log(x/y) prescribed for genuine exponential distributions, but it ignores the fact that the rewards are truncated kl exp UCB KL UCB UCB V kl UCB UCB 800 Regret Time (log scale)

39 Non-parametric setting and Empirical Likelihood Take-home message on bandit algorithms 1 Use kl-ucb rather than UCB-1 or UCB-2 2 Use KL-UCB if speed is not a problem 3 todo: improve on the deviation bounds, address general non-parametric families of distributions 4 Alternative: Bayesian-flavored methods: Bayes-UCB [Kaufmann, Cappé, G.] Thompson sampling [Kaufmann & al.]

40 Extensions Roadmap 1 The Bandit Model 2 Lower Bound for the Regret 3 Optimistic Algorithms 4 An Optimistic Algorithm based on Kullback-Leibler Divergence 5 Parametric setting: the kl-ucb Algorithm 6 Non-parametric setting and Empirical Likelihood 7 Extensions

41 Extensions Non-stationary Bandits [G. Moulines 11] Changepoint : reward distributions change abruptly Goal : follow the best arm Application : scanning tunnelling microscope Variants D-UCB et SW-UCB including a progressive discount of the past Bounds O( n log n) are proved, which is (almost) optimal

42 Extensions (Generalized) Linear Bandits [Filippi, Cappé, G. & Szepesvári 10] Bandit with contextual information: E[X t A t ] = µ(m A t θ ) where θ R d is an unkown parameter and µ : R R is a link function Example : binary rewards µ(x) = Application : targeted web ads exp(x) 1 + exp(x) GLM-UCB : regret bound depending on dimension d and not on the number of arms

43 Extensions Stochastic Optimization Goal : Find the maximum of a function f : C R d R (possibly) observed in noise Application : DAS res$f res$x Model : f is the realization of a Gaussian Process (or has a small norm in some RKHS) GP-UCB : evaluate f at the point x C where the confidence interval for f(x) has the highest upper-bound

44 Extensions Recent Advances in Continuous bandits Unimodal bandits without smoothness: trisection algorithms, and better [Combes, Proutière 14] Application to internet network traffic optimization Reserve Price Optimization in Second-price Options [Cesa-Bianchi, Gentile, Mansour 13] Application to advertisement systems

45 Extensions Markov Decision Processes (MDP) [Filippi, Cappe & G. 10] The system is in state St which evolves as a Markov Chain: St+1 P ( ; St, At ) et Rt = r(st, At ) + εt Optimistic algorithm: search the best transition matrix in a neighborhood of the ML estimate The use of Kullback-Leibler neighborhoods leads to better performance and has desirable propoerties.

46 Extensions Optimal Exploration with Probabilistic Expert Advice Search space : B Ω discrete set Probabilistic experts : P a M 1 (Ω) for a A Requests : at time t, calling expert A t yields a realization of X t = X At,t independent with law P a Goal : find as many distinct elements of B as possible with few requests : F n = Card (B {X 1,..., X n }) bandit : finding the same element twice is no use! Oracle : selects the expert with highest missing mass A t+1 = arg max P a (B\ {X 1,..., X t }) a A

47 Extensions The model Subset A X of important items X 1, A X Access to X only by probabilistic experts (P i ) 1 i K : sequential independent draws Goal: discover rapidly the elements of A

48 Extensions The model Subset A X of important items X 1, A X Access to X only by probabilistic experts (P i ) 1 i K : sequential independent draws Goal: discover rapidly the elements of A

49 Extensions The model Subset A X of important items X 1, A X Access to X only by probabilistic experts (P i ) 1 i K : sequential independent draws Goal: discover rapidly the elements of A

50 Extensions The model Subset A X of important items X 1, A X Access to X only by probabilistic experts (P i ) 1 i K : sequential independent draws Goal: discover rapidly the elements of A

51 Extensions Goal At each time step t = 1, 2,... : pick an index I t = π t ( I1, Y 1,..., I s 1, Y s 1 ) {1,..., K} according to past observations observe Y t = X It,n It,t P I t, where n i,t = s t 1{I s = i} Goal: design the strategy π = (π t ) t so as to maximize the number of important items found after t requests F π (t) = A { } Y 1,..., Y t Assumption: non-intersecting supports A supp(p i ) supp(p j ) = for i j

52 Extensions Is it a Bandit Problem? It looks like a bandit problem... sequential choices among K options want to maximize cumulative rewards exploration vs exploitation dilemma... but it is not a bandit problem! rewards are not i.i.d. destructive rewards: no interest to observe twice the same important item all strategies eventually equivalent

53 Extensions The oracle strategy Proposition: Under the non-intersecting support hypothesis, the greedy oracle strategy It arg max P i (A \ {Y 1,..., Y t }) 1 i K is optimal: for every possible strategy π, E [ F π (t) ] E [ F (t) ]. Remark: the proposition if false if the supports may intersect = estimate the missing mass of important items!

54 Extensions Estimating the missing mass Notation : P M 1 (Ω), O n (ω) = n t=1 1{X t = ω} Z n (x) = 1{O n (ω) = 0} H n (ω) = 1{O n (ω) = 1}, H n = ω B H n(ω) X t iid Problem : estimate the missing mass R n = ω B P (ω)z n (ω) Good-Turing : estimator ˆR n = H n /n st. E[ ˆR n R n ] [0, 1/n]. Concentration : by McDiarmid s inequality, with probability 1 δ ˆR n E[ ˆR (2/n + pmax ) n ] 2 n log(2/δ) 2

55 Extensions The Good-UCB algorithm [Bubeck, Ernst & G.] Optimistic algorithm based on Good-Turing s estimator : { } H a (t) A t+1 = arg max a A N a (t) + c log (t) N a (t) N a (t) = number of draws of P a up to time t H a (t) = number of elements of B seen exactly once thanks to P a c = tuning parameter

56 Extensions Classical analysis Theorem: For any t 1, under the non-intersecting support assumption, Good-UCB (with constant C = (1 + 2) 3) satisfies E [ F (t) F UCB (t) ] 17 Kt log(t)+20 Kt+K +K log(t/k) Remark: Usual result for bandit problem, but not-so-simple analysis

57 Good-UCB en action Extensions

58 Extensions The macroscopic limit Restricted framework: P i = U{1,..., N} N A supp(p i ) /N q i (0, 1), q = i q i

59 Extensions The macroscopic limit Restricted framework: P i = U{1,..., N} N A supp(p i ) /N q i (0, 1), q = i q i

60 Extensions The macroscopic limit Restricted framework: P i = U{1,..., N} N A supp(p i ) /N q i (0, 1), q = i q i

61 Extensions The Oracle behaviour The limiting discovery process of the Oracle strategy is deterministic Proposition: For every λ (0, q 1 ), for every sequence (λ N ) N converging to λ as N goes to infinity, almost surely T N (λ N ) lim = N N i ( log q i λ ) +

62 Extensions Oracle vs. uniform sampling Oracle: The proportion of important items not found after Nt draws tends to q F (t) = I(t)q I(t) exp ( t/i(t)) Kq K exp( t/k) with q K = ( K i=1 q i) 1/K the geometric mean of the (q i ) i. Uniform: The proportion of important items not found after Nt draws tends to K q K exp( t/k) = Asymptotic ratio of efficiency ρ(q) = q K q K = 1 K k i=1 q i ( k i=1 q i) 1/K 1 larger if the (q i ) i are unbalanced

63 Extensions Macroscopic optimality Theorem: Take C = (1 + 2) c + 2 with c > 3/2 in the Good-UCB algorithm. For every sequence (λ N ) N converging to λ as N goes to infinity, almost surely TUCB N lim sup (λn ) N + N i ( log q i λ The proportion of items found after Nt steps F GUCB converges uniformly to F as N goes to infinity ) +

64 Extensions Number of items found by Good-UCB (line), the oracle (bold dashed), and by uniform sampling (light dotted) as a function of time, for sample sizes N = 128, N = 500, N = 1000 et N = 10000, in an environment with 7 experts. Simulation

65 Extensions Bibliography 1 [Abbasi-Yadkori&al 11] Yasin Abbasi-Yadkori, Dávid Pál, Csaba Szepesvári: Online Least Squares Estimation with Self-Normalized Processes: An Application to Bandit Problems CoRR abs/ : (2011) 2 [Agrawal 95] R. Agrawal. Sample mean based index policies with O(log n) regret for the multi-armed bandit problem. Advances in Applied Probability, 27(4) : , [Audibert&al 09] J-Y. Audibert, R. Munos, and Cs. Szepesvári. Exploration-exploitation tradeoff using variance estimates in multi-armed bandits. Theoretical Computer Science, 410(19), [Auer&al 02] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(2) : , [Bubeck, Ernst&G. 11] S. Bubeck, D. Ernst, A. Garivier, Optimal discovery with probabilistic expert advice: finite time analysis and macroscopic optimality Journal of Machine Learning Research vol. 14 Feb p [De La Pena&al 04] V.H. De La Pena, M.J. Klass, and T.L. Lai. Self-normalized processes : exponential inequalities, moment bounds and iterated logarithm laws. Annals of Probability, 32(3) : , [Filippi, Cappé&Garivier 10] S. Filippi, O. Cappé, and A. Garivier. Optimism in reinforcement learning and Kullback-Leibler divergence. In Allerton Conf. on Communication, Control, and Computing, Monticello, US, [Filippi, Cappé, G.& Szepesvari 10] S. Filippi, O. Cappé, A. Garivier, and C. Szepesvari. Parametric bandits : The generalized linear case. In Neural Information Processing Systems (NIPS), [G.&Cappé 11] A. Garivier and O. Cappé. The KL-UCB algorithm for bounded stochastic bandits and beyond. In 23rd Conf. Learning Theory (COLT), Budapest, Hungary, [Cappé,G.&al 13] O. Cappé, A. Garivier, O-A. Maillard, R. Munos, G. Stoltz, Kullback-Leibler Upper Confidence Bounds for Optimal Sequential Allocation, Annals of Statistics, accepted 11 [G.&Leonardi 11] A. Garivier and F. Leonardi. Context tree selection : A unifying view. Stochastic Processes and their Applications, 121(11) : , Nov [G.&Moulines 11] A. Garivier and E. Moulines. On upper-confidence bound policies for non-stationary bandit problems. In Algorithmic Learning Theory (ALT), volume 6925 of Lecture Notes in Computer Science, [Lai&Robins 85] T.L. Lai and H. Robbins. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6(1) :4-22, 1985.

On the Complexity of Best Arm Identification in Multi-Armed Bandit Models

On the Complexity of Best Arm Identification in Multi-Armed Bandit Models Aurélien Garivier Institut de Mathématiques de Toulouse Information Theory, Learning and Big Data Simons Institute, Berkeley, March