Multi-armed bandit models: a tutorial

Size: px

Start display at page:

Download "Multi-armed bandit models: a tutorial"

Lauren Matthews
5 years ago
Views:

1 Multi-armed bandit models: a tutorial CERMICS seminar, March 30th, 2016

2 Multi-Armed Bandit model: general setting K arms: for a {1,..., K}, (X a,t ) t N is a stochastic process. (unknown distributions) Bandit game: a each round t, an agent chooses an arm A t {1,..., K} receives a reward X t = X At,t Goal: Build a sequential strategy maximizing A t = F t (A 1, X 1,..., A t 1, X t 1 ) [ ] E α t X t, t=1 where (α t ) t N is a discount sequence. [Berry and Fristedt, 1985]

3 Multi-Armed Bandit model: the i.i.d. case K independent arms: for a {1,..., K}, (X a,t ) t N is i.i.d ν a (unknown distributions) Bandit game: a each round t, an agent chooses an arm A t {1,..., K} receives a reward X t ν At Goal: Build a sequential strategy maximizing A t = F t (A 1, X 1,..., A t 1, X t 1 ) Finite-horizon [ MAB T ] E t=1 X t Discounted MAB E [ ] t=1 αt 1 X t

4 Why MABs? ν 1 ν 2 ν 3 ν 4 ν 5 Goal: maximize ones gains in a casino? (HOPELESS)

5 Why MABs? Real motivations Clinical trials: B(µ 1 ) B(µ 2 ) B(µ 3 ) B(µ 4 ) B(µ 5 ) choose a treatment A t for patient t observe a response X t {0, 1} : P(X t = 1) = µ At Goal: maximize the number of patient healed Recommendation tasks: ν 1 ν 2 ν 3 ν 4 ν 5 recommend a movie A t for visitor t observe a rating X t ν At (e.g. X t {1,..., 5}) Goal: maximize the sum of ratings

6 Bernoulli bandit models K independent arms: for a {1,..., K}, (X a,t ) t N is i.i.d B(µ a ) Bandit game: a each round t, an agent chooses an arm A t {1,..., K} receives a reward X t B(µ At ) {0, 1} Goal: maximize Finite-horizon [ MAB T ] E t=1 X t Discounted MAB E [ ] t=1 αt 1 X t

7 Bernoulli bandit models K independent arms: for a {1,..., K}, (X a,t ) t N is i.i.d B(µ a ) Bandit game: a each round t, an agent chooses an arm A t {1,..., K} receives a reward X t B(µ At ) {0, 1} Goal: maximize Finite-horizon [ MAB T ] E t=1 X t Discounted MAB E [ ] t=1 αt 1 X t Frequentist model Bayesian model µ 1,..., µ K µ 1,..., µ K drawn from a unknown parameters prior distribution : µ a π a arm a: (X a,t ) t i.i.d. B(µ a ) arm a: (X a,t ) t µ i.i.d. B(µ a )

8 Outline 1 Bayesian bandits: a planning problem 2 Frequentist bandits: asymptotically optimal algorithms 3 Non stochastic bandits: minimax algorithms

9 Outline 1 Bayesian bandits: a planning problem 2 Frequentist bandits: asymptotically optimal algorithms 3 Non stochastic bandits: minimax algorithms

10 A Markov Decision Process Bandit model (B(µ 1 ),..., B(µ K )) i.i.d prior distribution: µ a U([0, 1]) posterior distribution: πa t := L(µ a X 1,..., X t ) ( πa t = Beta S a (t) }{{} #ones +1, N a (t) S a (t) }{{} #zeros ) +1 S a (t): sum of the rewards gathered from a up to time t N a (t): number of draws of arm a up to time t State Π t = (πa) t K a=1that evolves in a MDP.

11 A Markov Decision Process An example of transition: At=2 6 1 if X t = Solving a planning problem: there exists an exact solution to The finite-horizon MAB: [ T ] arg max E µ π X t (A t) t=1 The discounted MAB: [ ] arg max E µ π α t 1 X t (A t) t=1 Optimal policy = solution to dynamic programming equations.

12 A Markov Decision Process An example of transition: At=2 6 1 if X t = Solving a planning problem: there exists an exact solution to The finite-horizon MAB: [ T ] arg max E µ π X t (A t) t=1 The discounted MAB: [ ] arg max E µ π α t 1 X t (A t) t=1 Optimal policy = solution to dynamic programming equations. Problem: The state space is too large!

13 A reduction of the dimension [Gittins 79]: the solution of the discounted MAB reduces to an index policy: A t+1 = argmax G α (πa). t a=1...k The Gittins indices: G α (p) = sup stopping times τ>0 E i.i.d Yt B(µ) µ p E i.i.d Yt B(µ) µ p [ τ t=1 αt 1 Y t ] [ τ t=1 αt 1 ] instantaneous rewards when commiting to arm µ p, when rewards are discounted by α

14 The Gittins indices An alternative formulation: G α (p) = inf{λ R : V α(p, λ) = 0}, with V α(p, λ) = sup stopping times τ>0 E Yt i.i.d B(µ) µ p [ τ ] α t 1 (Y t λ). t=1 price worth paying for commiting to arm µ p when rewards are discounted by α

15 Gittins indices for finite horizon The Finite-Horizon Gittins indices: depend on the remaining time to play r with V r (p, λ) = G(p, r) = inf{λ R : V r (p, λ) = 0}, sup stopping times 0<τ r E Yt i.i.d B(µ) µ p [ τ ] (Y t λ). t=1 price worth paying for playing arm µ p for at most r rounds The Finite-Horizon Gittins algorithm A t+1 = argmax a=1...k G(π t a, T t) does NOT coincide with the optimal solution [Berry and Fristedt 85]... but is conjectured to be a good approximation!

16 Outline 1 Bayesian bandits: a planning problem 2 Frequentist bandits: asymptotically optimal algorithms 3 Non stochastic bandits: minimax algorithms

17 Regret minimization µ = (µ 1,..., µ K ) unknown parameters, µ = max a µ a. The regret of a strategy A = (A t ) is defined as R µ (A, T ) = E µ [µ T and can be rewritten ] T X t t=1 K R µ (A, T ) = (µ µ a )E µ [N a (T )]. a=1 N a (t) : number of draws of arm a up to time t Maximizing rewards Minimizing regret Goal: Design strategies that have small regret for all µ.

18 Optimal algorithms for regret minimization All the arms should be drawn infinitely often! where [Lai and Robbins, 1985]: a uniformly efficient strategy ( µ, α ]0, 1[, R µ (A, T ) = o(t α )) satisfies µ a < µ lim inf T E µ [N a (T )] log T d(µ, µ ) = KL(B(µ), B(µ )) 1 d(µ a, µ ), = µ log µ 1 µ + (1 µ) log µ 1 µ. Definition A bandit algorithm is asymptotically optimal if, for every µ, µ a < µ lim sup T E µ [N a (T )] log T 1 d(µ a, µ )

19 First algorithms Idea 1 : Draw each arm T /K times EXPLORATION

20 First algorithms Idea 1 : Draw each arm T /K times EXPLORATION Idea 2: Always choose the empirical best arm: EXPLOITATION A t+1 = argmax a ˆµ a (t)

21 First algorithms Idea 1 : Draw each arm T /K times EXPLORATION Idea 2: Always choose the empirical best arm: EXPLOITATION A t+1 = argmax a ˆµ a (t) Idea 3 : Draw the arms uniformly during T /2 rounds, then draw the empirical best until the end EXPLORATION followed EXPLOITATION

22 First algorithms Idea 1 : Draw each arm T /K times EXPLORATION Idea 2: Always choose the empirical best arm: EXPLOITATION A t+1 = argmax a ˆµ a (t) Idea 3 : Draw the arms uniformly during T /2 rounds, then draw the empirical best until the end EXPLORATION followed EXPLOITATION Linear regret...

23 Optimistic algorithms For each arm a, build a confidence interval on µ a : µ a UCB a (t) w.h.p Figure : Confidence intervals on the arms at round t Optimism principle: act as if the best possible model were the true model A t+1 = arg max a UCB a (t)

24 A UCB algorithm in action!

25 A UCB algorithm in action!

26 The UCB1 algorithm UCB1 [Auer et al. 02] is based on the index α log(t) UCB a (t) = ˆµ a (t) + 2N a (t) Hoeffding s inequality: ( ) ( ( )) α log(t) α log(t) P ˆµ a,s + µ a exp 2s 2s 2s = 1 t α. Union bound: P(UCB a (t) µ a ) P ( t s=1 s t : ˆµ a,s + 1 t α = 1 t α 1. α log(t) 2s µ a )

27 The UCB1 algorithm Theorem For every α > 2 and every sub-optimal arm a, there exists a constant C α > 0 such that E µ [N a (T )] 2α (µ µ a ) 2 log(t ) + C α. It follows that R T 2α 1 (µ log(t ) + KC α. µ a a a )

28 Proof : 1/3 Assume µ = µ 1 and µ 2 < µ 1. N 2 (T ) = = T 1 t=0 T 1 t=0 T 1 t=0 1 (At+1 =2) T 1 1 (At+1 =2) (UCB 1 (t) µ 1 ) + T 1 1 (UCB1 (t) µ 1 ) + t=0 t=0 1 (At+1 =2) (UCB 1 (t)>µ 1 ) 1 (At+1 =2) (UCB 2 (t)>µ 1 )

29 Proof : 1/3 Assume µ = µ 1 and µ 2 < µ 1. N 2 (T ) = = T 1 t=0 T 1 t=0 T 1 t=0 1 (At+1 =2) T 1 1 (At+1 =2) (UCB 1 (t) µ 1 ) + T 1 1 (UCB1 (t) µ 1 ) + t=0 t=0 1 (At+1 =2) (UCB 1 (t)>µ 1 ) 1 (At+1 =2) (UCB 2 (t)>µ 1 ) E[N 2 (T )] T 1 t=0 P(UCB 1 (t) µ 1 ) } {{ } A P(A t+1 = 2, UCB 2 (t) > µ 1 ) T 1 + t=0 } {{ } B

30 Proof : 2/3 E[N 2 (T )] T 1 t=0 P(UCB 1 (t) µ 1 ) } {{ } A Term A: if α > 2, P(A t+1 = 2, UCB 2 (t) > µ 1 ) T 1 + t=0 } {{ } B T 1 t=0 T 1 P(UCB 1 (t) µ 1 ) 1 + t=1 1 t α ζ(α 1) := C α /2.

31 Proof : 3/3 Term B: (B) = with T 1 t=0 T 1 t=0 P (A t+1 = 2, UCB 2 (t) > µ 1 ) P(A t+1 = 2, UCB 2 (t) > µ 1, LCB 2 (t) µ 2 ) + C α /2 LCB 2 (t) = ˆµ 2 (t) α log t 2N 2 (t). (LCB 2 (t) < µ 2 < µ 1 UCB 2 (t)) α log(t ) (µ 1 µ 2 ) 2 2N 2 (t) N 2 (t) 2α (µ 1 µ 2 ) 2 log(t )

32 Proof : 3/3 Term B: (continued) (B) T 1 t=0 T 1 t=0 P(A t+1 = 2, UCB 2 (t) > µ 1, LCB 2 (t) µ 2 ) + C α /2 ( P A t+1 = 2, N 2 (t) 2α (µ 1 µ 2 ) 2 log(t ) + C α/2 ) 2α (µ 1 µ 2 ) 2 log(t ) + C α /2 Conclusion: E[N 2 (T )] 2α (µ 1 µ 2 ) 2 log(t ) + C α.

33 The UCB1 algorithm Theorem For every α > 2 and every sub-optimal arm a, there exists a constant C α > 0 such that E µ [N a (T )] 2α (µ µ a ) 2 log(t ) + C α.

34 The UCB1 algorithm Theorem For every α > 2 and every sub-optimal arm a, there exists a constant C α > 0 such that E µ [N a (T )] 2α (µ µ a ) 2 log(t ) + C α. Remark: 2α (µ µ a ) 2 > 4α 1 d(µ a, µ ) (UCB1 not asymptotically optimal)

35 The KL-UCB algorithm A UCB-type algorithm: A t+1 = arg max a u a (t)... associated to the right upper confidence bounds: { u a (t) = max q ˆµ a (t) : d (ˆµ a (t), x) log(t) }, N a (t) ˆµ a (t): empirical mean of rewards from arm a up to time t d(µ a (t),q) log(t)/n a (t) µ a (t) [ ] u a (t) q [Cappé et al. 13]: KL-UCB satisfies 1 E µ [N a (T )] d(µ a, µ ) log T + O( log(t )).

36 Bayesian algorithms for regret minimization? Algorithms based on Bayesian tools can be good to solve (frequentist) regret minimization Ideas: use the Finite-Horizon Gittins use posterior quantiles use posterior samples

37 Thompson Sampling (π t a,.., π t K ) posterior distribution on (µ 1,.., µ K ) at round t. Algorithm: Thompson Sampling Thompson Sampling is a randomized Bayesian algorithm: a {1..K}, θ a (t) πa t A t+1 = argmax a θ a (t) Draw each arm according to its posterior probability of being optimal [Thompson 1933] θ (t) μ μ θ (t) Thompson Sampling is asymptotically optimal. [K.,Korda,Munos 2012]

38 Bayesian algorithms in contextual linear bandit models At time t, a set of contexts D t R d is revealed. = characteristics of the items to recommend The model: if the context x t D t is selected a reward r t = x T t θ + ɛ t is received θ R d = underlying preference vector A Bayesian model: (with Gaussian prior) r t = xt T θ + ɛ t, θ N ( 0, κ 2 ) I d, ɛt N ( 0, σ 2). ) Explicit posterior: p(θ x 1, r 1,..., x t, r t ) = N (ˆθ(t), Σt. Thompson Sampling: ) θ(t) N (ˆθ(t), Σ t, and x t+1 = argmax x T θ(t). x D t+1

39 Outline 1 Bayesian bandits: a planning problem 2 Frequentist bandits: asymptotically optimal algorithms 3 Non stochastic bandits: minimax algorithms

40 Minimax regret In stochastic bandits, we exhibited algorithm such that µ, E µ [N a (T )] log T /d(µ a, µ ) + o(log T ). Their regret satisfy [ ] K (µ µ a ) R µ (A, T ) = min d(µ a=2 a, µ ) log(t ), (µ µ a )T +o(log(t )). }{{}}{{} small when large when µ a µ µ a µ There exist some constant C such that µ, R µ (A, T ) C KT log(t ). }{{} problem-independent bound Minimax rate of the regret ( ) inf sup R µ (A, T ) = O KT A µ

41 Removing the stochastic assumption... A new bandit game: at round t the player chooses arm A t simultaneously, an adversary chooses the vector of rewards (x t,1,..., x t,k ) the player receives the reward x t = x At,t Goal: maximize rewards, or minimize regret [ T ] [ T ] R(A, T ) = max E x a,t E x t. a t=1 t=1

42 Full information: Exponential Weighted Forecaster The full-information game: at round t the player chooses arm A t simultaneously, an adversary chooses the vector of rewards (x t,1,..., x t,k ) the player receives the reward x t = x At,t and he observes the reward vector (x t,1,..., x t,k ) The EWF algorithm [Littelstone, Warmuth 1994] With ˆp t the probability distribution at round t, choose ˆp a,t e η( t 1 s=1 xa,s) A t ˆp t

43 Back to the bandit case: the EXP3 strategy We don t have access to the (x a,t ) for all a... satisfies E[ˆx a,t ] = x a,t. ˆx a,t = x a,t ˆp a,t 1 (At=a) The EXP3 strategy [Auer et al. 2003] With ˆp t the probability distribution at round t, choose ˆp a,t e η( t 1 s=1 ˆxa,s) A t ˆp t

44 Theoretical results The EXP3 strategy [Auer et al. 2003] With ˆp t the probability distribution at round t, choose ˆp a,t e η( t 1 s=1 ˆxa,s) A t ˆp t [Bubeck and Cesa-Bianchi 12] EXP3 with log(k) η = KT satisfies R(EXP3, T ) 2 log K KT Remarks: log(k) Kt almost the same guarantees for η t = extra exploration is needed to have high probability results

45 Conclusions Under different assumptions, different types of strategies to achieve an exploration-exploitation tradeoff in bandit models: Index policies: Gittins indices UCB-type algorithms Randomized algorithms: Thompson Sampling Exponential weights More complex bandit models not covered today: restless bandits, contextual bandits, combinatorial bandits...

46 Beyond regret minimization Under different assumptions, different types of strategies to achieve an exploration-exploitation tradeoff in bandit models: Index policies: Gittins indices UCB-type algorithms Randomized algorithms: Thompson Sampling Exponential weights More complex bandit models not covered today: restless bandits, contextual bandits, combinatorial bandits...

47 A pure-exploration objective Regret minimization: maximize the number of patients healed during the trial Alternative goal: identify as quickly as possible the best treatment (no focus on curing patients during the study)

48 A pure-exploration objective Regret minimization: maximize the number of patients healed during the trial Alternative goal: identify as quickly as possible the best treatment (no focus on curing patients during the study) Additionnaly to the sampling strategy (A t ), one needs a stopping rule τ (stopping time) a recommendation rule â τ such that, for some risk parameter δ ]0, 1[, P(â τ a ) δ and E[τ] is as small as possible.

49 An algorithm: KL-LUCB [K., Kalyanakrishnan 13] An algorithm based on Upper and Lower confidence bounds u a (t) = max {q : N a (t)d(ˆµ a (t), q) log(kt/δ)} l a (t) = min {q : N a (t)d(ˆµ a (t), q) log(kt/δ)} sampling rule: A t+1 = argmax ˆµ a (t), B t+1 = argmax a b A t+1 stopping rule: τ = inf{t N : l At (t) > u Bt (t)} recommendation rule: â τ = argmax ˆµ a (τ) a=1,...,k u b (t)

50 KL-LUCB for finding the m best arms

51 The complexity of best-arm identification Theorem [K. and Garivier, 16] For any δ-pac algorithm, where T (µ) 1 = sup w Σ K E µ [τ] T (µ) log inf λ Alt(µ) ( ) 1, 2.4δ ( K ) w a d(µ a, λ a ). a=1 an optimal strategy satisfies Eµ[Na(τ)] E µ[τ] wa (µ) with ( K ) w (µ) = argmax w Σ K inf λ Alt(µ) w a d(µ a, λ a ) a=1 tracking these optimal proportions yield a δ-pac algorithm such that lim sup δ 0 E µ [τ δ ] log(1/δ) = T (µ).

52 References Bayesian bandits: Bandit Problems. Sequential allocation of experiments. Berry and Fristedt. Chapman and Hall, Multi-armed bandit allocation indices. Gittins, Glazebrook, Weber. Wiley, Stochastic and non-stochastic bandits: Regret analysis of Stochastic and Non-stochastic Bandit Problems. Bubeck and Cesa-Bianchi. Fondations and Trends in Machine Learning, Prediction, Learning and Games. Cesa-Bianchi and Lugosi. Cambridge University Press, Best arm identification: Optimal Best Arm Identification with Fixed Confidence. A. Garivier and E. Kaufmann. Preprint, 2016.

Bandit models: a tutorial

Bandit models: a tutorial Gdt COS, December 3rd, 2015 Multi-Armed Bandit model: general setting K arms: for a {1,..., K}, (X a,t ) t N is a stochastic process. (unknown distributions) Bandit game: a each round t, an agent chooses