The multi armed-bandit problem

Size: px

Start display at page:

Download "The multi armed-bandit problem"

Posy Dickerson
5 years ago
Views:

1 The multi armed-bandit problem (with covariates if we have time) Vianney Perchet & Philippe Rigollet LPMA Université Paris Diderot ORFE Princeton University Algorithms and Dynamics for Games and Optimization October, 14-18th 2013

2 Introduction Introduction Boring and useless definitions: Bandits: Optimization of a noisy function. Observations: f (x) + ε x where ε x is random variable Statistics: lack of information (exploration) Optimization: maximize f ( ) (exploitation) Games: cumulative loss/payoff/reward Covariates: Some additional side observations gathered Start "easy": f is maximized over a finite set Concrete, simple and understandable examples follow.

3 Introduction Some real world examples

4 Introduction Some real world examples

5 Introduction Some real world examples

6 Introduction Simplified decision problem of Google Different firms go to Google and offer if you put my ad after the keywords "Flat Rental Paris", every time a customer clicks on it, I ll give you b i s euros A given ad i has some exogenous but unknown probability of being clicked p i. Displaying ad i gives in expectation p i.b i to Google. Objective of Google... maximize cumulated payoff as fast as possible.

7 Introduction Simplified decision problem of Google Different firms go to Google and offer if you put my ad after the keywords "Flat Rental Paris", every time a customer clicks on it, I ll give you b i s euros A given ad i has some exogenous but unknown probability of being clicked p i. Displaying ad i gives in expectation p i.b i to Google. Objective of Google... maximize cumulated payoff as fast as possible. Difficulties: The expected revenue of an ad i is unknown; p i cannot be estimated if ad i is not displayed. Take risk and display new ads (to compute new and maybe high p i ) or be safe and display the best estimated ad?

8 Static case Framework Static Case Successive Elimination (SE) Static bandit No queries Structure of a specific instance Decision set: {1,..., K } (the set of "arms"... ads). Expected payoff of arm k: f k [0, 1]. Best ad, f. Problem difficulty: k = f f k, min = min k >0 k Repeated decision problem. At stage t N, Choose k t {1,..., K }, receive Y t [0, 1] i.i.d. expectation f kt Observe only the payoff Y t (and not f kt ) and move to stage t + 1 Objectives: maximize cumulative expected payoff or Minimize regret: R T = T.f T t=1 f k t = T t=1 k t Choose the quickest possible the best decision with noise.

9 Static case Framework Static Case Successive Elimination (SE) Static Case: UCB Lower bound for K=2: R T log(t 2 min) min with min = min f f k Famous algo: Upper Confidence Bound (and its variants) Using UCB, E[R T ] 8 k log(t ) k 8K log(t ) min

10 Static case Framework Static Case Successive Elimination (SE) Static Case: UCB Lower bound for K=2: R T log(t 2 min) min with min = min f f k Famous algo: Upper Confidence Bound (and its variants) Draw each arm 1,.., K once and observe Y 1 1,.., Y K K After stage t, compute the following: (Round 1) t k = {τ t; k τ = k} the number of times arm k was drawn; Ȳt k = 1 Yτ k an estimate of f k t k τ t; k τ =k Draw the arm k t+1 = arg max k Ȳ k t + 2 log(t) t k Using UCB, E[R T ] 8 k log(t ) k 8K log(t ) min

11 Static case Framework Static Case Successive Elimination (SE) Remarks on UCB Lower bound for K=2: R T log(t 2 min) min, min = min k >0 k UCB algo: Draw each arm 1,.., K once and observe Y 1 1,.., Y K K (Round 1) Draw the arm k t+1 = arg max k Ȳ k t + 2 log(t) t k UCB Upper bound: E[R T ] 8 k log(t ) k 8K log(t ) min Remarks: Proof based on Hoeffding inequality; Not intuitive: clearly suboptimal arms keep being drawn MOSS, a variant of UCB, achieves E[R T ] K log(t 2 min /K ) min Neither log(t ) or K log(t 2 min /K ) sufficient with covariates.

12 Static case Framework Static Case Successive Elimination (SE) Successive Elimination (SE) Simple policy based on the intuition: Determine the suboptimal arms, and do not play them. Time is divided in rounds n N: after round n: eliminate arms (with great proba.) suboptimal i.e., arm k s.t. Ȳ n k log(t /n) + 2 n Ȳ k n 2 at round n + 1: draw each remaining arm once. log(t /n) n Easy to describe, to understand (but not to analyse for K > 2...), intuitive. Simple confidence term (but requires knowledge of T ). (SE) is a variant of Even-Dar et al. ( 06) Auer and Ortner ( 10)

13 Static case Framework Static Case Successive Elimination (SE) Regret of successive elimination Theorem [P. and Rigollet ( 13)] Played on K arms, the (SE) policy satisfies { log(t 2 k E[R T ] min ), } TK log(k ) k UCB: k log(t ) k, MOSS: K log(t 2 min /K ) min k E[R T ] = k k.e[n k ] with n k the number of draws of arm k Exact bound:

14 Static case Framework Static Case Successive Elimination (SE) Regret of successive elimination Theorem [P. and Rigollet ( 13)] Played on K arms, the (SE) policy satisfies { log(t 2 k E[R T ] min ), } TK log(k ) k UCB: k log(t ) k, MOSS: K log(t 2 min /K ) min k E[R T ] = k k.e[n k ] with n k the number of draws of arm k Exact bound: { E[R T ] min 646 k ( [ ]) 1 T 2 log max k k 18, e, 166 } TK log(k )

15 Static case Framework Static Case Successive Elimination (SE) Successive Elimination: Example Two arms A round: a draw of both arms f 1 f 2

16 Static case Framework Static Case Successive Elimination (SE) Successive Elimination: Example Two arms A round: a draw of both arms log(t /1) 2 1 f 1 Ȳ 1 1 Round 1: no elimination f 2 Ȳ 2 1

17 Static case Framework Static Case Successive Elimination (SE) Successive Elimination: Example Two arms A round: a draw of both arms Ȳ 1 f 1 2 log(t /2) 2 2 Round 1: Round 2: no elimination no elimination f 2 Ȳ 2 2

18 Static case Framework Static Case Successive Elimination (SE) Successive Elimination: Example Two arms A round: a draw of both arms f 1 log(t /3) 2 3 Ȳ 1 3 Round 1: Round 2: Round 3: no elimination no elimination elimination f 2 Ȳ 2 3

19 Static case Framework Static Case Successive Elimination (SE) Sketch of proof with K = 2 Basic idea: arm 2 (subopt.) eliminated at the first round n s.t.: Ȳn 2 log(t /n) + 2 n Ȳ n 1 log(t /n) 2 n

20 Static case Framework Static Case Successive Elimination (SE) Sketch of proof with K = 2 Basic idea: arm 2 (subopt.) eliminated at the first round n s.t.: f 2 log(t /n) + 2 f 1 log(t /n) 2 n n

21 Static case Framework Static Case Successive Elimination (SE) Sketch of proof with K = 2 Basic idea: arm 2 (subopt.) eliminated at the first round n s.t.: f 1 f 2 log(t /n) = n

22 Static case Framework Static Case Successive Elimination (SE) Sketch of proof with K = 2 Basic idea: arm 2 (subopt.) eliminated at the first round n s.t.: f 1 f 2 log(t /n) = n 2 log(t 2 2 ) n 2 2

23 Static case Framework Static Case Successive Elimination (SE) Sketch of proof with K = 2 Basic idea: arm 2 (subopt.) eliminated at the first round n s.t.: f 1 f 2 = 2 2 log(t /n) 2 n n 2 log(t 2 2 ) 2 2 What could go wrong: Arm 1 eliminated before round n 2 ( P n n 2, Ȳ 1 n Ȳ 2 n 2 2 ) log(t /n) n 2 n T Arm 2 not eliminated at round n 2.

24 Static case Framework Static Case Successive Elimination (SE) Sketch of proof with K = 2 Basic idea: arm 2 (subopt.) eliminated at the first round n s.t.: f 1 f 2 = 2 2 log(t /n) 2 n n 2 log(t 2 2 ) 2 2 What could go wrong: Arm 1 eliminated before round n 2 (with proba. n 2 T ) ( ) P n n 2, Ȳn 1 Ȳ n 2 log(t /n) 2 2 n 2 n T Arm 2 not eliminated at round n 2.

25 Static case Framework Static Case Successive Elimination (SE) Sketch of proof with K = 2 Basic idea: arm 2 (subopt.) eliminated at the first round n s.t.: f 1 f 2 = 2 2 log(t /n) 2 n n 2 log(t 2 2 ) 2 2 What could go wrong: Arm 1 eliminated before round n 2 (with proba. n 2 T ) Arm 2 not eliminated at round n 2. ( P n n 2,Ȳ 2 n Ȳ 1 n 2 2 ) log(t /n) n

26 Static case Framework Static Case Successive Elimination (SE) Sketch of proof with K = 2 Basic idea: arm 2 (subopt.) eliminated at the first round n s.t.: f 1 f 2 = 2 2 log(t /n) 2 n n 2 log(t 2 2 ) 2 2 What could go wrong: Arm 1 eliminated before round n 2 (with proba. n 2 T ) Arm 2 not eliminated at round n 2. P Ȳ n 2 2 Ȳ n log(t /n 2) n 2

27 Static case Framework Static Case Successive Elimination (SE) Sketch of proof with K = 2 Basic idea: arm 2 (subopt.) eliminated at the first round n s.t.: f 1 f 2 = 2 2 log(t /n) 2 n n 2 log(t 2 2 ) 2 2 What could go wrong: Arm 1 eliminated before round n 2 (with proba. n 2 T ) Arm 2 not eliminated at round n 2. P (Ȳ 2 n2 Ȳ 1 n )

28 Static case Framework Static Case Successive Elimination (SE) Sketch of proof with K = 2 Basic idea: arm 2 (subopt.) eliminated at the first round n s.t.: f 1 f 2 = 2 2 log(t /n) 2 n n 2 log(t 2 2 ) 2 2 What could go wrong: Arm 1 eliminated before round n 2 (with proba. n 2 T ) Arm 2 not eliminated at round n 2. (with proba. n 2 T ) P ([Ȳ 1 n2 Ȳ 2 n2 ] 2 2 ) exp ( n ) n 2 T

29 Static case Framework Static Case Successive Elimination (SE) Sketch of proof with K = 2 Basic idea: arm 2 (subopt.) eliminated at the first round n s.t.: f 1 f 2 = 2 2 log(t /n) 2 n n 2 log(t 2 2 ) 2 2 What could go wrong: Arm 1 eliminated before round n 2 (with proba. n 2 T ) Arm 2 not eliminated at round n 2. (with proba. n 2 T ) Number of draws of arm 2 (each incurs a regret of 2 ): T if something wrong (w.p. n 2 T ), n 2 otherwise ( w.p. 1): E[R T ] [ n 2 + n 2 T T ] 2 n 2 2 log(t 2 2 ) 2

30 General Model Covariates: X t X = [0, 1] d, i.i.d., law µ (equivalent to) λ Examples: request received by Amazon or Google X t observed before taking a decision at time t N Equivalence: two unknown constants cλ(a) µ(a) cλ(a) Decisions: k t K = {1,.., K }; construction of a policy π Payoff: Y k t [0, 1] ν k (X t ), E[Y k X] = f k (X) Objective: regret R T := T t=1 f π (X t ) (X t ) f k t (X t ) o(t )

31 General Model Covariates: X t X = [0, 1] d, i.i.d., law µ (equivalent to) λ Decisions: k t K = {1,.., K }; construction of a policy π Examples: Choice of the ad to be displayed Decision k t taken after the observation of X t at time t N Objectives: Find the best decision given the request Payoff: Y k t [0, 1] ν k (X t ), E[Y k X] = f k (X) Objective: regret R T := T t=1 f π (X t ) (X t ) f k t (X t ) o(t )

32 General Model Covariates: X t X = [0, 1] d, i.i.d., law µ (equivalent to) λ Decisions: k t K = {1,.., K }; construction of a policy π Payoff: Y k t [0, 1] ν k (X t ), E[Y k X] = f k (X) Examples: proba/reward of click on ad k function of the request Only Y kt t is observed before moving to stage t + 1; Optimization: Find the decision k t that maximizes f k (X t ) Objective: regret R T := T t=1 f π (X t ) (X t ) f k t (X t ) o(t )

33 General Model Covariates: X t X = [0, 1] d, i.i.d., law µ (equivalent to) λ Decisions: k t K = {1,.., K }; construction of a policy π Payoff: Y k t [0, 1] ν k (X t ), E[Y k X] = f k (X) Objective: regret R T := T t=1 f π (X t ) (X t ) f k t (X t ) o(t ) Optimal policy: π (X) = arg max f k (X); and f π (X) (X) = f (X) Maximize cumulated payoffs T t=1 f kt (X t ) or minimize regret Find a policy π asymptotic. at least as well as π (in average)

34 Regularity assumptions 1 Smoothness of the pb: Every f k is β-hölder, with β (0, 1]: L > 0, x, y X, f (x) f (y) L x y β 2 Complexity of the pb: (α-margin condition) δ 0 > 0 and C 0 > 0 ] P X [0 < f 1 (x) f 2 (x) < δ C 0 δ α, δ (0, δ 0 )

35 Regularity assumptions 1 Smoothness of the pb: Every f k is β-hölder, with β (0, 1]: L > 0, x, y X, f (x) f (y) L x y β 2 Complexity of the pb: (α-margin condition) δ 0 > 0 and C 0 > 0 ] P X [0 < f (x) f (x) < δ C 0 δ α, δ (0, δ 0 ) where f (x) = max k f k (x) is the maximal f k and f (x) = max { f k (x) s.t. f k (x) < f (x) } is the second max. With K > 2: f is β-hölder but f is not continuous.

36 Regularity: an easy example (α big) f 1 (x)

37 Regularity: an easy example (α big) f 1 (x) f 2 (x)

38 Regularity: an easy example (α big) f 1 (x) f 2 (x) f 3 (x)

39 Regularity: an easy example (α big) f (x) f 1 (x) f 2 (x) f 3 (x)

40 Regularity: an easy example (α big) f (x) f 1 (x) f 2 (x) f (x) f 3 (x)

41 Regularity: an easy example (α big) f (x) f 1 (x) f 2 (x) f (x) f 3 (x)

42 Regularity: a hard example (α small) f 1 (x)

43 Regularity: a hard example (α small) f 1 (x) f 2 (x)

44 Regularity: a hard example (α small) f 3 (x) f 1 (x) f 2 (x)

45 Regularity: a hard example (α small) f 3 (x) f 1 (x) f (x) f 2 (x)

46 Regularity: a hard example (α small) f 3 (x) f 1 (x) f (x) f (x) f 2 (x)

47 Regularity: a hard example (α small) f 3 (x) f 1 (x) f (x) f (x) f 2 (x)

48 Conflict between α and β ] δ 0, C 0, P X [0 < f (x) f (x) < δ C 0 δ α, δ (0, δ 0 ) First used by Goldenshluger and Zeevi ( 08) case f 1 = 0; It was an assumption on the distribution of X only Here: fixed marginal (uniform), measures closeness of functions. Proposition: Conflict α vs. β αβ > d = all arms are either always or never optimal Smoothness β is known, but complexity α is not known.

49 Binned policy f (x) f 1 (x) f 2 (x) f (x) f 3 (x)

50 Binned policy f 1 (x) f 2 (x) f 3 (x)

51 Binned policy f 1 (x) f 2 (x) f 3 (x)

52 Binned policy Consider the uniform partition of [0, 1] d into 1/M d bins Bins: hypercube B with side length B equal to M. Each bin is an independent problem; exact value of X t discarded Average reward of bin B: f B k = B f k (x)dp(x) P(B) (P(B) M d ) Follow on each bin your favorite static policy. Reduction to 1/M d static bandits pb. with expected reward ( f 1 B,.., f K B ). see Rigollet and Zeevi ( 10)

53 Binned Successive Elimination (BSE) f 1 (x) f 2 (x) f 3 (x)

54 Binned Successive Elimination (BSE) Theorem [P. and Rigollet ( 11)] If 0 < α < 1, E[R T (BSE)] T of parameter M ( K log(k ) T ) 1 2β+d ( K log(k ) T ) β(1+α) 2β+d with the choice For K = 2, matches lower bound: minimax optimal w.r.t. T. Same bound can be obtained in the full info. setting (Audibert and Tsybakov, 07) No log(t ): difficulty of nonparametric estimation washes away the effects of exploration/exploitation. α < 1: cannot attain fast rates

55 Sketch for K = 2 Decomposition of regret: E[R T (BSE)] = R H + R E Hard bins ( B < M β ): R H M β.p (Hard) T M β.p ( 0 < f f < M β) T TM β(1+α) Easy bins ( B M β ): with B = sup f (x) f (x) x B B f f dp P(B)

56 Sketch for K = 2 Decomposition of regret: E[R T (BSE)] = R H + R E Hard bins ( B < M β ): R H TM β(1+α) T ( ) β(1+α) K log(k ) 2β+d T R H M β.p (Hard) T M β.p ( 0 < f f < M β) T TM β(1+α) Easy bins ( B M β ): with B = sup f (x) f (x) x B B f f dp P(B)

57 Sketch for K = 2 Decomposition of regret: E[R T (BSE)] = R H + R E Hard bins ( B < M β ): R H TM β(1+α) T ( ) β(1+α) K log(k ) 2β+d T Easy bins ( B M β ): R E easy log ( (TM d ) 2 ) B B with B = sup f (x) f (x) x B B f f dp P(B)

58 Sketch for K = 2 Decomposition of regret: E[R T (BSE)] = R H + R E Hard bins ( B < M β ): R H TM β(1+α) T ( ) β(1+α) K log(k ) 2β+d T Easy bins ( B M β ): R E easy log ( (TM d ) 2 ) B B Order the B as M d then ) l {1,.., M d }, lm d P (0 < f f < l α l

59 Sketch for K = 2 Decomposition of regret: E[R T (BSE)] = R H + R E Hard bins ( B < M β ): R H TM β(1+α) T ( ) β(1+α) K log(k ) 2β+d T Easy bins ( B M β ): R E M d l=m αβ d log ( (TM d )(lm d ) 2/α) (lm d ) 1/α Order the B as M d then ) l {1,.., M d }, lm d P (0 < f f < l α l

60 Sketch for K = 2 Decomposition of regret: E[R T (BSE)] = R H + R E Hard bins ( B < M β ): R H TM β(1+α) T ( ) β(1+α) K log(k ) 2β+d T Easy bins ( B M β ): R E TM β(1+α) T ( ) β(1+α) K log(k ) 2β+d T R E M d l=m αβ d log ( (TM d )(lm d ) 2/α) (lm d ) 1/α TM β(1+α) because (for α < 1): M d l=m αβ d log ( (TM d )(lm d ) 2/α) (lm d ) 1/α log ( TM 2β+d ) M d+β(1 α) TM β(1+α)

61 Sketch for K = 2 Decomposition of regret: E[R T (BSE)] = R H + R E Hard bins ( B < M β ): R H TM β(1+α) T ( ) β(1+α) K log(k ) 2β+d T Easy bins ( B M β ): R E TM β(1+α) T ( ) β(1+α) K log(k ) 2β+d T For α 1 additional terms: E[R T ] multiplied by log(t ). We always pay the number of bins (that should be large enough for non-smooth functions) Problem is: too many bins. Solution: Online/adaptive construction of the bins.

62 Suboptimality of (BSE) for α 1 f 1 (x) f 2 (x) f 3 (x)

63 Suboptimality of (BSE) for α 1 f 1 (x) f 2 (x) f 3 (x)

64 Adaptative BSE (ABSE) Basic idea: Given a bin of size B (for K = 2): If f B 1 f B 2 B β then f 1 f 2 on B. Adaptively Binned Successive Elimination Start with B = [0, 1] and B 0 ( K log(k ) T ) 1 2β+d Draw samples (in rounds) of arms when covariates are in B; If Ȳ n k Ȳ k log(t B n d /n) n + B β then eliminate arm k ; Stop after n B rounds and split B in two halves (of size B /2) with log(t B d /n B ) n B = B β and n B log(t B 2β+d ) B 2β Repeat the procedure on two halves (until B B 0 ).

65 Regret of (ABSE) Theorem [P. and Rigollet ( 11)] Fix α > 0 and 0 < β 1 then (ABSE) has a regret bounded as ( K log(k ) E[R T (ABSE)] T T ) β(1+α) 2β+d Minimax optimal (Rigollet and Zeevi, See also Audibert and Tsybakov, 2007) Slivkins (2011, COLT): Zooming (abstract setup, complicated algorithm); no real purpose nor measure to adaptive policy.

66 (ABSE) illustrated 0 1/4 1/2 1 [0, 1] (1 2) keep 1 and 2 f 1 f 2 [0, 1 2 ] (1 2) keep 1 and 2 [ 1 2, 1] (1 2) eliminate 2 [0, 1 4 ] (1 2) [0, 1 4 ] (2 1) eliminate 2 eliminate 1

67 (ABSE) illustrated 0 1/4 1/2 1 eliminate 1 or 2 [0, 1] (1 2) keep 1 and 2 eliminate 1 or eliminate 1 or 2 [0, 1 2 ] (1 2) [ 1 not eliminate 2 2, 1] (1 2) eliminate 1 or keep 1 and 2 eliminate 2 not eliminate 2 [0, 1 4 ] (1 2) [0, 1 4 ] (2 1) eliminate 2 or eliminate 2 eliminate 1 not eliminate 1 f 1 f 2

68 (ABSE) Sketch of proof If everything goes right: When a bin B is reach, one has B B β (so regret n B B β ). What could go wrong Terminal node: Eliminate arm 1 or not eliminate arm 2: Same analysis for (SE) Happens with proba. less than n B T B d Number of times covariates in B less than T B d Regret each time less than B B β Non-terminal node:

69 (ABSE) Sketch of proof If everything goes right: When a bin B is reach, one has B B β (so regret n B B β ). What could go wrong Terminal node: R B n B B β log(t B 2β+d ) B β Eliminate arm 1 or not eliminate arm 2: Same analysis for (SE) Happens with proba. less than n B T B d Number of times covariates in B less than T B d Regret each time less than B B β Non-terminal node:

70 (ABSE) Sketch of proof If everything goes right: When a bin B is reach, one has B B β (so regret n B B β ). What could go wrong Terminal node: R B n B B β log(t B 2β+d ) B β Non-terminal node: Eliminate arm 1 or eliminate arm 2 ( f 2 B f 1 B f 2 B + B β ) For arm 1, same analysis. For arm 2: log(t n n B, Ȳn 1 B d /n) log(t n Ȳ n 2 B d /n) + + B β n

71 (ABSE) Sketch of proof If everything goes right: When a bin B is reach, one has B B β (so regret n B B β ). What could go wrong Terminal node: R B n B B β log(t B 2β+d ) B β Non-terminal node: Eliminate arm 1 or eliminate arm 2 ( f 2 B f 1 B f 2 B + B β ) For arm 1, same analysis. For arm 2: log(t n n B, Ȳn 1 Ȳ n 2 B d /n) B 2 + B β B n

72 (ABSE) Sketch of proof If everything goes right: When a bin B is reach, one has B B β (so regret n B B β ). What could go wrong Terminal node: R B n B B β log(t B 2β+d ) B β Non-terminal node: Eliminate arm 1 or eliminate arm 2 ( f 2 B f 1 B f 2 B + B β ) For arm 1, same analysis. For arm 2: ( ) log(t P n n B, Ȳn 1 Ȳ n 2 B d /n) B 2 n B n T B d

73 (ABSE) Sketch of proof If everything goes right: When a bin B is reach, one has B B β (so regret n B B β ). What could go wrong Terminal node: R B n B B β log(t B 2β+d ) B β Non-terminal node: R B n B B β log(t B 2β+d ) B β Eliminate arm 1 or eliminate arm 2 ( f 2 B f 1 B f 2 B + B β ) For arm 1, same analysis. For arm 2: ( ) log(t P n n B, Ȳn 1 Ȳ n 2 B d /n) B 2 n B n T B d

74 (ABSE) Sketch of proof If everything goes right: When a bin B is reach, one has B B β (so regret n B B β ). What could go wrong Terminal node: R B n B B β log(t B 2β+d ) B β Non-terminal node: R B n B B β log(t B 2β+d ) B β N l =number of bins of size B = 2 l (and 2 l 0 = B 0 ): ( N l.2 ld P 0 < f f < 2 lβ) 2 lαβ and E[R T ] B n B B β l 0 l=0 ( 2 l(d αβ) log T 2 l(2β+d)) 2 lβ

75 (ABSE) Sketch of proof If everything goes right: When a bin B is reach, one has B B β (so regret n B B β ). What could go wrong Terminal node: R B n B B β log(t B 2β+d ) B β Non-terminal node: R B n B B β log(t B 2β+d ) B β N l =number of bins of size B = 2 l (and 2 l 0 = B 0 ): ( N l.2 ld P 0 < f f < 2 lβ) 2 lαβ and ( K log(k ) E[R T ] T T ) β(1+α) 2β+d

76 Conclusion and Remark Conclusion We introduced and analyzed new policies: Sequential Elimination: an intuitive policy with great potential for the static case; Binned SE: its generalization for hard dynamic pb; Adaptively BSE: again generalized for both easy and hard pb. There are all minimax optimal in T ; Conjecture: also in K up to the term log(k ). They require the knowledge of T (OK) and β (more arguable) Analysis more intricate when K > 2: optimal arm can be eliminated more easily, f non continuous Future work: adaptive policy w.r.t. β

Bandit models: a tutorial

Bandit models: a tutorial Gdt COS, December 3rd, 2015 Multi-Armed Bandit model: general setting K arms: for a {1,..., K}, (X a,t ) t N is a stochastic process. (unknown distributions) Bandit game: a each round t, an agent chooses