Gdt COS, December 3rd, 2015
Multi-Armed Bandit model: general setting K arms: for a {1,..., K}, (X a,t ) t N is a stochastic process. (unknown distributions) Bandit game: a each round t, an agent chooses an arm A t {1,..., K} receives a reward X t = X At,t Goal: Build a sequential strategy maximizing A t = F t (A 1, X 1,..., A t 1, X t 1 ) [ ] E α t X t, t=1 where (α t ) t N is a discount sequence. [Berry and Fristedt, 1985]
Multi-Armed Bandit model: the i.i.d. case K independent arms: for a {1,..., K}, (X a,t ) t N is i.i.d ν a (unknown distributions) Bandit game: a each round t, an agent chooses an arm A t {1,..., K} receives a reward X t ν At Goal: Build a sequential strategy maximizing A t = F t (A 1, X 1,..., A t 1, X t 1 ) Finite-horizon [ MAB T ] E t=1 X t Discounted MAB E [ ] t=1 αt 1 X t
Why MABs? ν 1 ν 2 ν 3 ν 4 ν 5 Goal: maximize ones gains in a casino? (HOPELESS)
Why MABs? Real motivations Clinical trials: B(µ 1 ) B(µ 2 ) B(µ 3 ) B(µ 4 ) B(µ 5 ) choose a treatment A t for patient t observe a response X t {0, 1} : P(X t = 1) = µ At Goal: maximize the number of patient healed Recommendation tasks: ν 1 ν 2 ν 3 ν 4 ν 5 recommend a movie A t for visitor t observe a rating X t ν At (e.g. X t {1,..., 5}) Goal: maximize the sum of ratings
Bernoulli bandit models K independent arms: for a {1,..., K}, (X a,t ) t N is i.i.d B(µ a ) Bandit game: a each round t, an agent chooses an arm A t {1,..., K} receives a reward X t B(µ At ) {0, 1} Goal: maximize Finite-horizon [ MAB T ] E t=1 X t Discounted MAB E [ ] t=1 αt 1 X t
Bernoulli bandit models K independent arms: for a {1,..., K}, (X a,t ) t N is i.i.d B(µ a ) Bandit game: a each round t, an agent chooses an arm A t {1,..., K} receives a reward X t B(µ At ) {0, 1} Goal: maximize Finite-horizon [ MAB T ] E t=1 X t Discounted MAB E [ ] t=1 αt 1 X t Frequentist model Bayesian model µ 1,..., µ K µ 1,..., µ K drawn from a unknown parameters prior distribution : µ a π a arm a: (X a,t ) t i.i.d. B(µ a ) arm a: (X a,t ) t µ i.i.d. B(µ a )
Outline 1 Bayesian bandits: a planning problem 2 Frequentist bandits: asymptotically optimal algorithms 3 Non stochastic bandits: minimax algorithms
Outline 1 Bayesian bandits: a planning problem 2 Frequentist bandits: asymptotically optimal algorithms 3 Non stochastic bandits: minimax algorithms
A Markov Decision Process Bandit model (B(µ 1 ),..., B(µ K )) i.i.d prior distribution: µ a U([0, 1]) posterior distribution: πa t := L(µ a X 1,..., X t ) ( πa t = Beta S a (t) }{{} #ones +1, N a (t) S a (t) }{{} #zeros ) +1 S a (t): sum of the rewards gathered from a up to time t N a (t): number of draws of arm a up to time t State Π t = (πa) t K a=1that evolves in a MDP.
A Markov Decision Process An example of transition: 1 2 1 2 5 1 At=2 6 1 if X t = 1 0 2 0 2 Solving a planning problem: there exists an exact solution to The finite-horizon MAB: [ T ] arg max E µ π X t (A t) t=1 The discounted MAB: [ ] arg max E µ π α t 1 X t (A t) t=1 Optimal policy = solution to dynamic programming equations.
A Markov Decision Process An example of transition: 1 2 1 2 5 1 At=2 6 1 if X t = 1 0 2 0 2 Solving a planning problem: there exists an exact solution to The finite-horizon MAB: [ T ] arg max E µ π X t (A t) t=1 The discounted MAB: [ ] arg max E µ π α t 1 X t (A t) t=1 Optimal policy = solution to dynamic programming equations. Problem: The state space is too large!
A reduction of the dimension [Gittins 79]: the solution of the discounted MAB reduces to an index policy: A t+1 = argmax G α (πa). t a=1...k The Gittins indices: G α (p) = sup stopping times τ>0 E i.i.d Yt B(µ) µ p E i.i.d Yt B(µ) µ p [ τ t=1 αt 1 Y t ] [ τ t=1 αt 1 ] instantaneous rewards when commiting to arm µ p, when rewards are discounted by α
The Gittins indices An alternative formulation: G α (p) = inf{λ R : V α(p, λ) = 0}, with V α(p, λ) = sup stopping times τ>0 E Yt i.i.d B(µ) µ p [ τ ] α t 1 (Y t λ). t=1 price worth paying for commiting to arm µ p when rewards are discounted by α
Gittins indices for finite horizon The Finite-Horizon Gittins indices: depend on the remaining time to play r with V r (p, λ) = G(p, r) = inf{λ R : V r (p, λ) = 0}, sup stopping times 0<τ r E Yt i.i.d B(µ) µ p [ τ ] (Y t λ). t=1 price worth paying for playing arm µ p for at most r rounds The Finite-Horizon Gittins algorithm A t+1 = argmax a=1...k G(π t a, T t) does NOT coincide with the optimal solution [Berry and Fristedt 85]... but is conjectured to be a good approximation!
Outline 1 Bayesian bandits: a planning problem 2 Frequentist bandits: asymptotically optimal algorithms 3 Non stochastic bandits: minimax algorithms
Regret minimization µ = (µ 1,..., µ K ) unknown parameters, µ = max a µ a. The regret of a strategy A = (A t ) is defined as R µ (A, T ) = E µ [µ T and can be rewritten ] T X t t=1 K R µ (A, T ) = (µ µ a )E µ [N a (T )]. a=1 N a (t) : number of draws of arm a up to time t Maximizing rewards Minimizing regret Goal: Design strategies that have small regret for all µ.
Optimal algorithms for regret minimization All the arms should be drawn infinitely often! where [Lai and Robbins, 1985]: a uniformly efficient strategy ( µ, α ]0, 1[, R µ (A, T ) = o(t α )) satisfies µ a < µ lim inf T E µ [N a (T )] log T d(µ, µ ) = KL(B(µ), B(µ )) 1 d(µ a, µ ), = µ log µ 1 µ + (1 µ) log µ 1 µ. Definition A bandit algorithm is asymptotically optimal if, for every µ, µ a < µ lim sup T E µ [N a (T )] log T 1 d(µ a, µ )
First algorithms Idea 1 : Draw each arm T /K times EXPLORATION
First algorithms Idea 1 : Draw each arm T /K times EXPLORATION Idea 2: Always choose the empirical best arm: EXPLOITATION A t+1 = argmax a ˆµ a (t)
First algorithms Idea 1 : Draw each arm T /K times EXPLORATION Idea 2: Always choose the empirical best arm: EXPLOITATION A t+1 = argmax a ˆµ a (t) Idea 3 : Draw the arms uniformly during T /2 rounds, then draw the empirical best until the end EXPLORATION followed EXPLOITATION
First algorithms Idea 1 : Draw each arm T /K times EXPLORATION Idea 2: Always choose the empirical best arm: EXPLOITATION A t+1 = argmax a ˆµ a (t) Idea 3 : Draw the arms uniformly during T /2 rounds, then draw the empirical best until the end EXPLORATION followed EXPLOITATION Linear regret...
Optimistic algorithms For each arm a, build a confidence interval on µ a : µ a UCB a (t) w.h.p Figure : Confidence intervals on the arms at round t Optimism principle: act as if the best possible model were the true model A t+1 = arg max a UCB a (t)
A UCB algorithm in action! 1 0 6 31 436 17 9
The UCB1 algorithm UCB1 [Auer et al. 02] is based on the index α log(t) UCB a (t) = ˆµ a (t) + 2N a (t) Hoeffding s inequality: ( ) ( ( )) α log(t) α log(t) P ˆµ a,s + µ a exp 2s 2s 2s = 1 t α. Union bound: P(UCB a (t) µ a ) P ( t s=1 s t : ˆµ a,s + 1 t α = 1 t α 1. α log(t) 2s µ a )
The UCB1 algorithm Theorem For every α > 2 and every sub-optimal arm a, there exists a constant C α > 0 such that E µ [N a (T )] 2α (µ µ a ) 2 log(t ) + C α.
The UCB1 algorithm Theorem For every α > 2 and every sub-optimal arm a, there exists a constant C α > 0 such that E µ [N a (T )] 2α (µ µ a ) 2 log(t ) + C α. Remark: 2α (µ µ a ) 2 > 4α 1 d(µ a, µ ) (UCB1 not asymptotically optimal)
Proof : 1/3 Assume µ = µ 1 and µ 2 < µ 1. N 2 (T ) = = T 1 t=0 T 1 t=0 T 1 t=0 1 (At+1 =2) T 1 1 (At+1 =2) (UCB 1 (t) µ 1 ) + T 1 1 (UCB1 (t) µ 1 ) + t=0 t=0 1 (At+1 =2) (UCB 1 (t)>µ 1 ) 1 (At+1 =2) (UCB 2 (t)>µ 1 )
Proof : 1/3 Assume µ = µ 1 and µ 2 < µ 1. N 2 (T ) = = T 1 t=0 T 1 t=0 T 1 t=0 1 (At+1 =2) T 1 1 (At+1 =2) (UCB 1 (t) µ 1 ) + T 1 1 (UCB1 (t) µ 1 ) + t=0 t=0 1 (At+1 =2) (UCB 1 (t)>µ 1 ) 1 (At+1 =2) (UCB 2 (t)>µ 1 ) E[N 2 (T )] T 1 t=0 P(UCB 1 (t) µ 1 ) } {{ } A P(A t+1 = 2, UCB 2 (t) > µ 1 ) T 1 + t=0 } {{ } B
Proof : 2/3 E[N 2 (T )] T 1 t=0 P(UCB 1 (t) µ 1 ) } {{ } A Term A: if α > 2, P(A t+1 = 2, UCB 2 (t) > µ 1 ) T 1 + t=0 } {{ } B T 1 t=0 T 1 P(UCB 1 (t) µ 1 ) 1 + t=1 1 t α 1 1 + ζ(α 1) := C α /2.
Preuve : 3/3 Term B: (B) = with T 1 t=0 T 1 t=0 P (A t+1 = 2, UCB 2 (t) > µ 1 ) P(A t+1 = 2, UCB 2 (t) > µ 1, LCB 2 (t) µ 2 ) + C α /2 LCB 2 (t) = ˆµ 2 (t) α log t 2N 2 (t). (LCB 2 (t) < µ 2 < µ 1 UCB 2 (t)) α log(t ) (µ 1 µ 2 ) 2 2N 2 (t) N 2 (t) 2α (µ 1 µ 2 ) 2 log(t )
Proof : 3/3 Term B: (continued) (B) T 1 t=0 T 1 t=0 P(A t+1 = 2, UCB 2 (t) > µ 1, LCB 2 (t) µ 2 ) + C α /2 ( P A t+1 = 2, N 2 (t) 2α (µ 1 µ 2 ) 2 log(t ) + C α/2 ) 2α (µ 1 µ 2 ) 2 log(t ) + C α /2 Conclusion: E[N 2 (T )] 2α (µ 1 µ 2 ) 2 log(t ) + C α.
The KL-UCB algorithm A UCB-type algorithm: A t+1 = arg max a u a (t)... associated to the right upper confidence bounds: { u a (t) = max q ˆµ a (t) : d (ˆµ a (t), x) log(t) }, N a (t) ˆµ a (t): empirical mean of rewards from arm a up to time t. 1 0.9 d(µ a (t),q) 0.8 0.7 0.6 0.5 0.4 0.3 log(t)/n a (t) 0.2 0.1 µ a (t) [ ] u a (t) 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 q [Cappé et al. 13]: KL-UCB satisfies 1 E µ [N a (T )] d(µ a, µ ) log T + O( log(t )).
Bayesian algorithms for regret minimization? Algorithms based on Bayesian tools can be good to solve (frequentist) regret minimization 1 1 0 0 9 3 448 18 21 6 3 451 5 34 Ideas: use the Finite-Horizon Gittins use posterior quantiles use posterior samples
Thompson Sampling (π t a,.., π t K ) posterior distribution on (µ 1,.., µ K ) at round t. Algorithm: Thompson Sampling Thompson Sampling is a randomized Bayesian algorithm: a {1..K}, θ a (t) πa t A t+1 = argmax a θ a (t) Draw each arm according to its posterior probability of being optimal [Thompson 1933] 10 8 6 4 2 θ (t) μ 1 1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 6 4 2 μ θ (t) 2 2 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Thompson Sampling is asymptotically optimal. [K.,Korda,Munos 2012]
Outline 1 Bayesian bandits: a planning problem 2 Frequentist bandits: asymptotically optimal algorithms 3 Non stochastic bandits: minimax algorithms
Minimax regret In stochastic (Bernoulli) bandits, we exhibited algorithm satisfying µ, R µ (A, T ) = ( K ) (µ µ a ) d(µ a=1 a, µ ) }{{} log(t ) + o(log(t )). problem-dependent term, can be large For those algorithms, one can also prove that, for some constant C, µ, R µ (A, T ) C KT log(t ). }{{} problem-independent bound Minimax rate of the regret ( ) inf sup R µ (A, T ) = O KT A µ
Removing the stochastic assumption... A new bandit game: at round t the player chooses arm A t simultaneously, an adversary chooses the vector of rewards (x t,1,..., x t,k ) the player receives the reward x t = x At,t Goal: maximize rewards, or minimize regret [ T ] [ T ] R(A, T ) = max E x a,t E x t. a t=1 t=1
Full information: Exponential Weighted Forecaster The full-information game: at round t the player chooses arm A t simultaneously, an adversary chooses the vector of rewards (x t,1,..., x t,k ) the player receives the reward x t = x At,t and he observes the reward vector (x t,1,..., x t,k ) The EWF algorithm [Littelstone, Warmuth 1994] With ˆp t the probability distribution at round t, choose ˆp a,t e η( t 1 s=1 xa,s) A t ˆp t
Back to the bandit case: the EXP3 strategy We don t have access to the (x a,t ) for all a... satisfies E[ˆx a,t ] = x a,t. ˆx a,t = x a,t ˆp a,t 1 (At=a) The EXP3 strategy [Auer et al. 2003] With ˆp t the probability distribution at round t, choose ˆp a,t e η( t 1 s=1 ˆxa,s) A t ˆp t
Theoretical results The EXP3 strategy [Auer et al. 2003] With ˆp t the probability distribution at round t, choose ˆp a,t e η( t 1 s=1 ˆxa,s) A t ˆp t [Bubeck and Cesa-Bianchi 12] EXP3 with log(k) η = KT satisfies R(EXP3, T ) 2 log K KT Remarks: log(k) Kt almost the same guarantees for η t = extra exploration is needed to have high probability results
Conclusions Under different assumptions, different types of strategies to achieve an exploration-exploitation tradeoff in bandit models: Index policies: Gittins indices UCB-type algorithms Randomized algorithms: Thompson Sampling Exponential weights More complex bandit models not covered today: restless bandits, contextual bandits, combinatorial bandits...
Books and surveys Bayesian bandits: Bandit Problems. Sequential allocation of experiments. Berry and Fristedt. Chapman and Hall, 1985. Multi-armed bandit allocation indices. Gittins, Glazebrook, Weber. Wiley, 2011. Stochastic and non-stochastic bandits: Regret analysis of Stochastic and Non-stochastic Bandit Problems. Bubeck and Cesa-Bianchi. Fondations and Trends in Machine Learning, 2012. Prediction, Learning and Games. Cesa-Bianchi and Lugosi. Cambridge University Press, 2006.