Stratégies bayésiennes et fréquentistes dans un modèle de bandit thèse effectuée à Telecom ParisTech, co-dirigée par Olivier Cappé, Aurélien Garivier et Rémi Munos Journées MAS, Grenoble, 30 août 2016
The multi-armed bandit model K arms = K probability distributions (ν a has mean µ a ) ν 1 ν 2 ν 3 ν 4 ν 5 At round t, an agent: chooses an arm A t observes a sample X t ν At using a sequential sampling strategy (A t ): A t+1 = F t (A 1, X 1,..., A t, X t ). Generic goal: learn the best arm, a = argmax a µ a
Regret minimization in a bandit model Samples = rewards, (A t ) is adjusted to maximize the (expected) sum of rewards, [ T ] E X t t=1 or equivalently minimize the regret: [ ] T R T = E T µ X t t=1 µ = µ a = max µ a a Exploration/Exploitation tradeoff
Modern motivation: recommendation tasks ν 1 ν 2 ν 3 ν 4 ν 5 For the t-th visitor of a website, recommend a movie A t observe a rating X t ν At (e.g. X t {1,..., 5}) Goal: maximize the sum of ratings
Back to the initial motivation: clinical trials B(µ 1 ) B(µ 2 ) B(µ 3 ) B(µ 4 ) B(µ 5 ) For the t-th patient in a clinical study, chooses a treatment A t observes a response X t {0, 1}: P(X t = 1) = µ At Goal: maximize the number of patient healed during the study
Back to the initial motivation: clinical trials B(µ 1 ) B(µ 2 ) B(µ 3 ) B(µ 4 ) B(µ 5 ) For the t-th patient in a clinical study, chooses a treatment A t observes a response X t {0, 1}: P(X t = 1) = µ At Goal: maximize the number of patient healed during the study Alternative goal: allocate the treatments so as to identify as quickly as possible the best treatment (no focus on curing patients during the study)
Two bandit problems Regret minimization Best arm identification sampling rule (A t ) Bandit sampling rule (A t ) stopping rule τ algorithm recommendation rule â τ Input horizon T risk parameter δ [ minimize ensure P(â τ = a ) 1 δ Objective R T = E µ T ] T t=1 X t and minimize E[τ] Exploration/Exploitation pure Exploration Goal: find efficient, optimal algorithms for both objectives In the presentation, we focus on Bernoulli bandit models µ = (µ 1,..., µ K )
Outline 1 Asymptotically optimal algorithms for regret minimization 2 A Bayesian approach for regret minimization Bayes-UCB Thompson Sampling 3 Optimal algorithms for Best Arm Identification
Outline 1 Asymptotically optimal algorithms for regret minimization 2 A Bayesian approach for regret minimization Bayes-UCB Thompson Sampling 3 Optimal algorithms for Best Arm Identification
Optimal algorithms for regret minimization µ = (µ 1,..., µ K ). N a (t) : number of draws of arm a up to time t [ T ] K R µ (A, T ) = µ T E µ X t = (µ µ a )E µ [N a (T )] t=1 Notation: Kullback-Leibler divergence d(µ, µ ) := KL ( B(µ), B(µ ) ) a=1 = µ log(µ/µ ) + (1 µ) log((1 µ)/(1 µ )) [Lai and Robbins, 1985]: for uniformly efficient algorithms, µ a < µ lim inf T E µ [N a (T )] log T 1 d(µ a, µ ) A bandit algorithm is asymptotically optimal if, for every µ, µ a < µ E µ [N a (T )] 1 lim sup T log T d(µ a, µ )
Algorithms: naive ideas Idea 1 : Choose each arm T /K times EXPLORATION Idea 2 : Always choose the best arm so far EXPLOITATION A t+1 = argmax a ˆµ a (t)...linear regret
Algorithms: naive ideas Idea 1 : Choose each arm T /K times EXPLORATION Idea 2 : Always choose the best arm so far EXPLOITATION A t+1 = argmax a ˆµ a (t)...linear regret Idea 3 : First explore the arms uniformly, then commit to the empirical best until the end EXPLORATION followed by EXPLOITATION...Still sub-optimal
Mixing Exploration and Exploitation: the UCB approach I a (t) = [LCB a (t), UCB a (t)] a confidence interval on µ a, based on observation up to round t. A UCB-type (or optimistic) algorithm chooses at round t A t+1 = argmax a=1...k UCB a (t).
Mixing Exploration and Exploitation: the UCB approach I a (t) = [LCB a (t), UCB a (t)] a confidence interval on µ a, based on observation up to round t. A UCB-type (or optimistic) algorithm chooses at round t A t+1 = argmax a=1...k UCB a (t). How to choose the Upper Confidence Bounds? use appropriate deviation inequalities...... in order to guarantee that P(µ a UCB a (t)) 1 t 1
The KL-UCB algorithm ˆµ a (t): empirical mean of rewards from arm a up to time t. A deviation inequality involving the KL-divergence: P (N a (t)d(ˆµ a (t), µ a ) γ) 2e(log(t) + 1)γe γ The KL-UCB algorithm: A t+1 = arg max a { u a (t) := max q : d (ˆµ a (t), q) 1 0.9 u a (t) with log(t) + c log log(t) N a (t) d(µ a (t),q) }, 0.8 0.7 0.6 0.5 0.4 0.3 log(t)/n a (t) 0.2 0.1 µ a (t) [ ] u a (t) 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 q [Cappé et al. 13]: KL-UCB satisfies, for c 5, E µ [N a (T )] 1 d(µ a, µ ) log T + O( log(t )).
Outline 1 Asymptotically optimal algorithms for regret minimization 2 A Bayesian approach for regret minimization Bayes-UCB Thompson Sampling 3 Optimal algorithms for Best Arm Identification
A frequentist or a Bayesian model? µ = (µ 1,..., µ K ). Two probabilistic modelings Frequentist model Bayesian model µ 1,..., µ K µ 1,..., µ K drawn from a unknown parameters prior distribution : µ a π a arm a: (Y a,s ) s i.i.d. B(µ a ) The regret can be computed in each case arm a: (Y a,s ) s µ i.i.d. B(µ a ) Frequentist regret (regret) [ T ] R T (A, µ)= E µ t=1 (µ µ At ) Bayesian regret (Bayes risk) [ T ] R T (A, π)= E µ π t=1 (µ µ At ) = R T (A, µ)dπ(µ)
Frequentist and Bayesian algorithms Two types of tools to build bandit algorithms: Frequentist tools MLE estimators of the means Confidence Intervals Bayesian tools Posterior distributions πa t = L(µ a X a,1,..., X a,na(t)) 1 1 0 0 9 3 448 18 21 6 3 451 5 34 One can separate tools and objective: We present efficient Bayesian algorithms for regret minimization
Outline 1 Asymptotically optimal algorithms for regret minimization 2 A Bayesian approach for regret minimization Bayes-UCB Thompson Sampling 3 Optimal algorithms for Best Arm Identification
The Bayes-UCB algorithm π t a the posterior distribution over µ a at the end of round t. Algorithm: Bayes-UCB [K., Cappé, Garivier 2012] ( Q 1 A t+1 = argmax a 1 t(log t) c, πt a where Q(α, p) is the quantile of order α of the distribution p. ) Bernoulli reward with uniform prior: π 0 a i.i.d U([0, 1]) = Beta(1, 1) π t a = Beta(S a (t) + 1, N a (t) S a (t) + 1)
Bayes-UCB in practice 1 0 2 4 346 107 40
Theoretical results Bayes-UCB is asymptotically optimal Theorem [K.,Cappé,Garivier 2012] Let ɛ > 0. The Bayes-UCB algorithm using a uniform prior over the arms and parameter c 5 satisfies E µ [N a (T )] 1 + ɛ d(µ a, µ ) log(t ) + o ɛ,c (log(t )).
Links to a frequentist algorithm Bayes-UCB index is close to KL-UCB indices: Lemma with: u a (t) = max ũ a (t) = max q : d { q : d (ˆµ a (t), q) ũ a (t) q a (t) u a (t) ( ) Na (t)ˆµ a (t) log N a (t) + 1, q } log(t) + c log log(t) N a (t) ( ) + c log log(t) t N a(t)+2 (N a (t) + 1) Bayes-UCB automatically builds confidence intervals based on the Kullback-Leibler divergence!
Where does it come from? We have a tight bound on the tail of posterior distributions (Beta distributions) First element: link between Beta and Binomial distribution: P(X a,b x) = P(S a+b 1,1 x b) Second element: Sanov inequalities Lemma For k > nx, e nd( k n,x) n + 1 P(S n,x k) e nd( k n,x)
Outline 1 Asymptotically optimal algorithms for regret minimization 2 A Bayesian approach for regret minimization Bayes-UCB Thompson Sampling 3 Optimal algorithms for Best Arm Identification
Thompson Sampling (π1 t,.., πt K ) posterior distribution on (µ 1,.., µ K ) at round t. Algorithm: Thompson Sampling Thompson Sampling is a randomized Bayesian algorithm: a {1..K}, θ a (t) πa t A t+1 = argmax a θ a (t) Draw each arm according to its posterior probability of being optimal the first bandit algorithm, introduced by [Thompson 1933]
Illustration of the algorithm 10 8 6 4 2 θ (t) μ 1 1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 6 4 2 μ 2 θ 2 (t) 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Posterior distributions of the mean of arm 1 (top) and arm 2 (bottom) in a two-armed bandit model
Thompson Sampling is asymptotically optimal good empirical performance in complex models first logarithmic regret bound by [Agrawal and Goyal 2012] Theorem [K.,Korda,Munos 2012] For all ɛ > 0, E µ [N a (T )] 1 + ɛ d(µ a, µ ) log(t ) + o µ,ɛ(log(t )).
Outline 1 Asymptotically optimal algorithms for regret minimization 2 A Bayesian approach for regret minimization Bayes-UCB Thompson Sampling 3 Optimal algorithms for Best Arm Identification
A sample complexity lower bound A Best Arm Identification algorithm (A t, τ, â τ ) is δ-pac if µ, P µ (â τ = a (µ)) 1 δ. Theorem [Garivier and K. 2016] For any δ-pac algorithm, where T (µ) 1 = sup w Σ K E µ [τ] T (µ) log inf ( ) 1, 2.4δ {λ:a (λ) a (µ)} a=1 with Σ K = {w [0, 1] K : K i=1 w i = 1}. K w a d(µ a, λ a )
Optimal proportion of draws The vector w (µ) = argmax w Σ K inf {λ:a (λ) a (µ)} a=1 K w a d(µ a, λ a ) contains the optimal proportions of draws of the arms, i.e. an algorithm matching the lower bound should satisfy a {1,..., K}, E µ [N a (τ)] E µ [τ] w a (µ). Building on this notion of optimal proportions, one can exhibit an asymptotically optimal algorithm: lim δ E µ [τ δ ] log(1/δ) = T (µ).
Conclusion We saw two Bayesian algorithms that are good alternative to KL-UCB for regret minimization because: they are also asymptotically optimal in simple models they display better empirical performance... they can be easily generalized to more complex models Algorithms for regret minimization and BAI are very different! playing mostly the best arm vs. optimal proportions 1 1 0 0 1289 111 22 36 19 22 771 459 200 45 48 23 different complexity terms (featuring KL-divergence) ( R T µ µ ) a d(µ a, µ log(t ) E µ [τ] T (µ) log (1/δ) ) a a
Merci de votre attention. Merci!
Bayesian algorithms in contextual linear bandit models At time t, a set of contexts D t R d is revealed. = characteristics of the items to recommend The model: if the context x t D t is selected a reward r t = x T t θ + ɛ t is received θ R d = underlying preference vector A Bayesian model: (with Gaussian prior) r t = xt T θ + ɛ t, θ N ( 0, κ 2 ) I d, ɛt N ( 0, σ 2). ) Explicit posterior: p(θ x 1, r 1,..., x t, r t ) = N (ˆθ(t), Σt. Thompson Sampling: ) θ(t) N (ˆθ(t), Σ t, and x t+1 = argmax x T θ(t). x D t+1