Stratégies bayésiennes et fréquentistes dans un modèle de bandit

Size: px

Start display at page:

Download "Stratégies bayésiennes et fréquentistes dans un modèle de bandit"

Julius Adams
5 years ago
Views:

1 Stratégies bayésiennes et fréquentistes dans un modèle de bandit thèse effectuée à Telecom ParisTech, co-dirigée par Olivier Cappé, Aurélien Garivier et Rémi Munos Journées MAS, Grenoble, 30 août 2016

2 The multi-armed bandit model K arms = K probability distributions (ν a has mean µ a ) ν 1 ν 2 ν 3 ν 4 ν 5 At round t, an agent: chooses an arm A t observes a sample X t ν At using a sequential sampling strategy (A t ): A t+1 = F t (A 1, X 1,..., A t, X t ). Generic goal: learn the best arm, a = argmax a µ a

3 Regret minimization in a bandit model Samples = rewards, (A t ) is adjusted to maximize the (expected) sum of rewards, [ T ] E X t t=1 or equivalently minimize the regret: [ ] T R T = E T µ X t t=1 µ = µ a = max µ a a Exploration/Exploitation tradeoff

4 Modern motivation: recommendation tasks ν 1 ν 2 ν 3 ν 4 ν 5 For the t-th visitor of a website, recommend a movie A t observe a rating X t ν At (e.g. X t {1,..., 5}) Goal: maximize the sum of ratings

5 Back to the initial motivation: clinical trials B(µ 1 ) B(µ 2 ) B(µ 3 ) B(µ 4 ) B(µ 5 ) For the t-th patient in a clinical study, chooses a treatment A t observes a response X t {0, 1}: P(X t = 1) = µ At Goal: maximize the number of patient healed during the study

6 Back to the initial motivation: clinical trials B(µ 1 ) B(µ 2 ) B(µ 3 ) B(µ 4 ) B(µ 5 ) For the t-th patient in a clinical study, chooses a treatment A t observes a response X t {0, 1}: P(X t = 1) = µ At Goal: maximize the number of patient healed during the study Alternative goal: allocate the treatments so as to identify as quickly as possible the best treatment (no focus on curing patients during the study)

7 Two bandit problems Regret minimization Best arm identification sampling rule (A t ) Bandit sampling rule (A t ) stopping rule τ algorithm recommendation rule â τ Input horizon T risk parameter δ [ minimize ensure P(â τ = a ) 1 δ Objective R T = E µ T ] T t=1 X t and minimize E[τ] Exploration/Exploitation pure Exploration Goal: find efficient, optimal algorithms for both objectives In the presentation, we focus on Bernoulli bandit models µ = (µ 1,..., µ K )

8 Outline 1 Asymptotically optimal algorithms for regret minimization 2 A Bayesian approach for regret minimization Bayes-UCB Thompson Sampling 3 Optimal algorithms for Best Arm Identification

9 Outline 1 Asymptotically optimal algorithms for regret minimization 2 A Bayesian approach for regret minimization Bayes-UCB Thompson Sampling 3 Optimal algorithms for Best Arm Identification

10 Optimal algorithms for regret minimization µ = (µ 1,..., µ K ). N a (t) : number of draws of arm a up to time t [ T ] K R µ (A, T ) = µ T E µ X t = (µ µ a )E µ [N a (T )] t=1 Notation: Kullback-Leibler divergence d(µ, µ ) := KL ( B(µ), B(µ ) ) a=1 = µ log(µ/µ ) + (1 µ) log((1 µ)/(1 µ )) [Lai and Robbins, 1985]: for uniformly efficient algorithms, µ a < µ lim inf T E µ [N a (T )] log T 1 d(µ a, µ ) A bandit algorithm is asymptotically optimal if, for every µ, µ a < µ E µ [N a (T )] 1 lim sup T log T d(µ a, µ )

11 Algorithms: naive ideas Idea 1 : Choose each arm T /K times EXPLORATION Idea 2 : Always choose the best arm so far EXPLOITATION A t+1 = argmax a ˆµ a (t)...linear regret

12 Algorithms: naive ideas Idea 1 : Choose each arm T /K times EXPLORATION Idea 2 : Always choose the best arm so far EXPLOITATION A t+1 = argmax a ˆµ a (t)...linear regret Idea 3 : First explore the arms uniformly, then commit to the empirical best until the end EXPLORATION followed by EXPLOITATION...Still sub-optimal

13 Mixing Exploration and Exploitation: the UCB approach I a (t) = [LCB a (t), UCB a (t)] a confidence interval on µ a, based on observation up to round t. A UCB-type (or optimistic) algorithm chooses at round t A t+1 = argmax a=1...k UCB a (t).

14 Mixing Exploration and Exploitation: the UCB approach I a (t) = [LCB a (t), UCB a (t)] a confidence interval on µ a, based on observation up to round t. A UCB-type (or optimistic) algorithm chooses at round t A t+1 = argmax a=1...k UCB a (t). How to choose the Upper Confidence Bounds? use appropriate deviation inequalities in order to guarantee that P(µ a UCB a (t)) 1 t 1

15 The KL-UCB algorithm ˆµ a (t): empirical mean of rewards from arm a up to time t. A deviation inequality involving the KL-divergence: P (N a (t)d(ˆµ a (t), µ a ) γ) 2e(log(t) + 1)γe γ The KL-UCB algorithm: A t+1 = arg max a { u a (t) := max q : d (ˆµ a (t), q) u a (t) with log(t) + c log log(t) N a (t) d(µ a (t),q) }, log(t)/n a (t) µ a (t) [ ] u a (t) q [Cappé et al. 13]: KL-UCB satisfies, for c 5, E µ [N a (T )] 1 d(µ a, µ ) log T + O( log(t )).

16 Outline 1 Asymptotically optimal algorithms for regret minimization 2 A Bayesian approach for regret minimization Bayes-UCB Thompson Sampling 3 Optimal algorithms for Best Arm Identification

17 A frequentist or a Bayesian model? µ = (µ 1,..., µ K ). Two probabilistic modelings Frequentist model Bayesian model µ 1,..., µ K µ 1,..., µ K drawn from a unknown parameters prior distribution : µ a π a arm a: (Y a,s ) s i.i.d. B(µ a ) The regret can be computed in each case arm a: (Y a,s ) s µ i.i.d. B(µ a ) Frequentist regret (regret) [ T ] R T (A, µ)= E µ t=1 (µ µ At ) Bayesian regret (Bayes risk) [ T ] R T (A, π)= E µ π t=1 (µ µ At ) = R T (A, µ)dπ(µ)

18 Frequentist and Bayesian algorithms Two types of tools to build bandit algorithms: Frequentist tools MLE estimators of the means Confidence Intervals Bayesian tools Posterior distributions πa t = L(µ a X a,1,..., X a,na(t)) One can separate tools and objective: We present efficient Bayesian algorithms for regret minimization

19 Outline 1 Asymptotically optimal algorithms for regret minimization 2 A Bayesian approach for regret minimization Bayes-UCB Thompson Sampling 3 Optimal algorithms for Best Arm Identification

20 The Bayes-UCB algorithm π t a the posterior distribution over µ a at the end of round t. Algorithm: Bayes-UCB [K., Cappé, Garivier 2012] ( Q 1 A t+1 = argmax a 1 t(log t) c, πt a where Q(α, p) is the quantile of order α of the distribution p. ) Bernoulli reward with uniform prior: π 0 a i.i.d U([0, 1]) = Beta(1, 1) π t a = Beta(S a (t) + 1, N a (t) S a (t) + 1)

21 Bayes-UCB in practice

22 Theoretical results Bayes-UCB is asymptotically optimal Theorem [K.,Cappé,Garivier 2012] Let ɛ > 0. The Bayes-UCB algorithm using a uniform prior over the arms and parameter c 5 satisfies E µ [N a (T )] 1 + ɛ d(µ a, µ ) log(t ) + o ɛ,c (log(t )).

23 Links to a frequentist algorithm Bayes-UCB index is close to KL-UCB indices: Lemma with: u a (t) = max ũ a (t) = max q : d { q : d (ˆµ a (t), q) ũ a (t) q a (t) u a (t) ( ) Na (t)ˆµ a (t) log N a (t) + 1, q } log(t) + c log log(t) N a (t) ( ) + c log log(t) t N a(t)+2 (N a (t) + 1) Bayes-UCB automatically builds confidence intervals based on the Kullback-Leibler divergence!

24 Where does it come from? We have a tight bound on the tail of posterior distributions (Beta distributions) First element: link between Beta and Binomial distribution: P(X a,b x) = P(S a+b 1,1 x b) Second element: Sanov inequalities Lemma For k > nx, e nd( k n,x) n + 1 P(S n,x k) e nd( k n,x)

25 Outline 1 Asymptotically optimal algorithms for regret minimization 2 A Bayesian approach for regret minimization Bayes-UCB Thompson Sampling 3 Optimal algorithms for Best Arm Identification

26 Thompson Sampling (π1 t,.., πt K ) posterior distribution on (µ 1,.., µ K ) at round t. Algorithm: Thompson Sampling Thompson Sampling is a randomized Bayesian algorithm: a {1..K}, θ a (t) πa t A t+1 = argmax a θ a (t) Draw each arm according to its posterior probability of being optimal the first bandit algorithm, introduced by [Thompson 1933]

27 Illustration of the algorithm θ (t) μ μ 2 θ 2 (t) Posterior distributions of the mean of arm 1 (top) and arm 2 (bottom) in a two-armed bandit model

28 Thompson Sampling is asymptotically optimal good empirical performance in complex models first logarithmic regret bound by [Agrawal and Goyal 2012] Theorem [K.,Korda,Munos 2012] For all ɛ > 0, E µ [N a (T )] 1 + ɛ d(µ a, µ ) log(t ) + o µ,ɛ(log(t )).

29 Outline 1 Asymptotically optimal algorithms for regret minimization 2 A Bayesian approach for regret minimization Bayes-UCB Thompson Sampling 3 Optimal algorithms for Best Arm Identification

30 A sample complexity lower bound A Best Arm Identification algorithm (A t, τ, â τ ) is δ-pac if µ, P µ (â τ = a (µ)) 1 δ. Theorem [Garivier and K. 2016] For any δ-pac algorithm, where T (µ) 1 = sup w Σ K E µ [τ] T (µ) log inf ( ) 1, 2.4δ {λ:a (λ) a (µ)} a=1 with Σ K = {w [0, 1] K : K i=1 w i = 1}. K w a d(µ a, λ a )

31 Optimal proportion of draws The vector w (µ) = argmax w Σ K inf {λ:a (λ) a (µ)} a=1 K w a d(µ a, λ a ) contains the optimal proportions of draws of the arms, i.e. an algorithm matching the lower bound should satisfy a {1,..., K}, E µ [N a (τ)] E µ [τ] w a (µ). Building on this notion of optimal proportions, one can exhibit an asymptotically optimal algorithm: lim δ E µ [τ δ ] log(1/δ) = T (µ).

32 Conclusion We saw two Bayesian algorithms that are good alternative to KL-UCB for regret minimization because: they are also asymptotically optimal in simple models they display better empirical performance... they can be easily generalized to more complex models Algorithms for regret minimization and BAI are very different! playing mostly the best arm vs. optimal proportions different complexity terms (featuring KL-divergence) ( R T µ µ ) a d(µ a, µ log(t ) E µ [τ] T (µ) log (1/δ) ) a a

33 Merci de votre attention. Merci!

34 Bayesian algorithms in contextual linear bandit models At time t, a set of contexts D t R d is revealed. = characteristics of the items to recommend The model: if the context x t D t is selected a reward r t = x T t θ + ɛ t is received θ R d = underlying preference vector A Bayesian model: (with Gaussian prior) r t = xt T θ + ɛ t, θ N ( 0, κ 2 ) I d, ɛt N ( 0, σ 2). ) Explicit posterior: p(θ x 1, r 1,..., x t, r t ) = N (ˆθ(t), Σt. Thompson Sampling: ) θ(t) N (ˆθ(t), Σ t, and x t+1 = argmax x T θ(t). x D t+1

Revisiting the Exploration-Exploitation Tradeoff in Bandit Models

Revisiting the Exploration-Exploitation Tradeoff in Bandit Models joint work with Aurélien Garivier (IMT, Toulouse) and Tor Lattimore (University of Alberta) Workshop on Optimization and Decision-Making