On Bayesian bandit algorithms

Size: px

Start display at page:

Download "On Bayesian bandit algorithms"

Byron Atkins
5 years ago
Views:

1 On Bayesian bandit algorithms Emilie Kaufmann joint work with Olivier Cappé, Aurélien Garivier, Nathaniel Korda and Rémi Munos July 1st, 2012 Emilie Kaufmann (Telecom ParisTech) On Bayesian bandit algorithms July 1st, / 25

2 1 Bayesian bandits versus Frequentist Bandit Emilie Kaufmann (Telecom ParisTech) On Bayesian bandit algorithms July 1st, / 25

3 1 Bayesian bandits versus Frequentist Bandit 2 Gittins Bayesian solution Emilie Kaufmann (Telecom ParisTech) On Bayesian bandit algorithms July 1st, / 25

4 1 Bayesian bandits versus Frequentist Bandit 2 Gittins Bayesian solution 3 The Bayes-UCB algorithm Emilie Kaufmann (Telecom ParisTech) On Bayesian bandit algorithms July 1st, / 25

5 1 Bayesian bandits versus Frequentist Bandit 2 Gittins Bayesian solution 3 The Bayes-UCB algorithm 4 Thompson Sampling Emilie Kaufmann (Telecom ParisTech) On Bayesian bandit algorithms July 1st, / 25

6 Bayesian bandits versus Frequentist Bandit Two probablistic modelling 1 Bayesian bandits versus Frequentist Bandit 2 Gittins Bayesian solution 3 The Bayes-UCB algorithm 4 Thompson Sampling Emilie Kaufmann (Telecom ParisTech) On Bayesian bandit algorithms July 1st, / 25

7 Bayesian bandits versus Frequentist Bandit Two probabilistic modelling Two probablistic modelling K independent arms. µ = µ j Frequentist : θ 1,..., θ K unknown parameters (Y j,t ) t is i.i.d. with distribution ν θj with mean µ j highest expectation of reward. Bayesian : i.i.d. θ j π j (Y j,t ) t is i.i.d. conditionally to θ j with distribution ν θj At time t, arm I t is chosen and reward X t = Y It,t is observed Two measures of performance Minimize (classic) regret [ n ] R n (θ) = E θ µ µ It t=1 Minimize bayesian regret R n = R n (θ)dπ(θ) Emilie Kaufmann (Telecom ParisTech) On Bayesian bandit algorithms July 1st, / 25

8 Our goal Bayesian bandits versus Frequentist Bandit Bayesian versus frequentist bandit algorithms Design Bayesian bandit algorithms which are optimal in terms of frequentist regret Lai and Robbins asymptotic rate for the regret: lim inf E θ[n n (j)] log(n) lim inf E θ [R n ] log(n) 1 if j is non optimal KL(ν θj, ν θ ) µ µ j KL(ν θj, ν θ ) j non optimal Emilie Kaufmann (Telecom ParisTech) On Bayesian bandit algorithms July 1st, / 25

9 Bayesian bandits versus Frequentist Bandit Bayesian versus frequentist bandit algorithms Some Bayesian and frequentist algorithms Π t = (π t 1,..., πt K ) the current posterior over (θ 1,..., θ K ) Λ t = (λ t 1,..., λt K ) the current posterior over the means (µ 1,..., µ K ) A Bayesian algorithm uses Π t 1 to determine action I t. Frequentist algorithms: Bayesian algorithms: upper confidence bound on the empirical mean (UCB) [Auer at al. 2002] UCB based on KL-divergence (KL-UCB) [Garivier, Cappé 2011] Gittins indices [Gittins, 1979] quantiles of the posterior samples from the posterior [Thompson, 1933] Emilie Kaufmann (Telecom ParisTech) On Bayesian bandit algorithms July 1st, / 25

10 Gittins Bayesian solution 1 Bayesian bandits versus Frequentist Bandit 2 Gittins Bayesian solution 3 The Bayes-UCB algorithm 4 Thompson Sampling Emilie Kaufmann (Telecom ParisTech) On Bayesian bandit algorithms July 1st, / 25

11 Gittins Bayesian solution The Finite-Horizon Gittins algorithm Often heard : Gittins solved the Bayesian MAB ONLY PARTIALLY TRUE gives an optimal policy for bayesian discounted regret only for simple parametric cases Finite-Horizon Gittins algorithm : is Bayesian optimal for the finite horizon problem involves indices hard to compute is heavily horizon-dependent no theoretical proof of its frequentist optimality Emilie Kaufmann (Telecom ParisTech) On Bayesian bandit algorithms July 1st, / 25

12 The Bayes-UCB algorithm The algorithm 1 Bayesian bandits versus Frequentist Bandit 2 Gittins Bayesian solution 3 The Bayes-UCB algorithm 4 Thompson Sampling Emilie Kaufmann (Telecom ParisTech) On Bayesian bandit algorithms July 1st, / 25

13 The general algorithm The Bayes-UCB algorithm The algorithm Recall : Λ t = (λ t 1,..., λt K ) is the current posterior over the means (µ 1,..., µ K ) The Bayes-UCB algorithm is the index policy associated with: ( ) 1 q j (t) = Q 1 t(log t) c, λt 1 j ie, at time t choose I t = argmax j=1...k q j (t) Emilie Kaufmann (Telecom ParisTech) On Bayesian bandit algorithms July 1st, / 25

14 The Bayes-UCB algorithm The algorithm An illustration for Bernoulli bandits Emilie Kaufmann (Telecom ParisTech) On Bayesian bandit algorithms July 1st, / 25

15 The Bayes-UCB algorithm The Bernoulli case Theoretical results for the Bernoulli case ν θj is the Bernoulli distribution B(µ j ), π 0 j the (conjugate) prior Beta(1, 1) Bayes-UCB is frequentist optimal in this case Theorem (Kaufmann,Cappé,Garivier 2012) Let ɛ > 0; for the Bayes-UCB algorithm with parameter c 5, the number of draws of a suboptimal arm j is such that : 1 + ɛ E θ [N n (j)] KL (B(µ j ), B(µ )) log(n) + o ɛ,c (log(n)) Emilie Kaufmann (Telecom ParisTech) On Bayesian bandit algorithms July 1st, / 25

16 The Bayes-UCB algorithm The Bernoulli case Link to a frequentist algorithm: Bayes-UCB index is close to KL-UCB index: ũ j (t) q j (t) u j (t) with: { ( ) } St (j) d N t (j), x log(t) + c log(log(t)) N t (j) u j (t) = argmax x> S t (j) N j (t) ũ j (t) = argmax x> S t (j) N t (j)+1 ( ( ) d St (j) log N t (j) + 1, x where d(x, y) = KL (B(x), B(y)) = x log x y t N t(j)+2 ) + c log(log(t)) (N t (j) + 1) + (1 x) log 1 x 1 y Bayes-UCB appears to build automatically confidence intervals based on Kullback-Leibler divergence, that are adapted to the geometry of the problem in this specific case. Emilie Kaufmann (Telecom ParisTech) On Bayesian bandit algorithms July 1st, / 25

17 The Bayes-UCB algorithm Where does it come from? The Bernoulli case First element: link between Beta and Binomial distribution: P(X a,b x) = P(S a+b 1,x a 1) Second element: Sanov inequality leads to the following inequality: e nd( k n,x) n + 1 P(S n,x k) e nd( k n,x) Emilie Kaufmann (Telecom ParisTech) On Bayesian bandit algorithms July 1st, / 25

18 Experimental results The Bayes-UCB algorithm The Bernoulli case UCB KL UCB Bayes UCB FH Gittins UCB KL UCB Bayes UCB FH Gittins θ 1 = 0.1, θ 2 = 0.2 θ 1 = 0.45, θ 2 = 0.55 Cumulated regret curves for several strategies (estimated with N = 5000 repetitions of the bandit game with horizon n = 500) for two different problems Emilie Kaufmann (Telecom ParisTech) On Bayesian bandit algorithms July 1st, / 25

19 The Bayes-UCB algorithm Beyond the Bernoulli case Beyond the Bernoulli case In more general cases, the Bayes-UCB algorithm is very close to existing frequentist algorithms: bandits with rewards in a one parameter exponential familly Gaussian bandits with unknown mean and variance Linear Bandit setting with prior over the parameter Emilie Kaufmann (Telecom ParisTech) On Bayesian bandit algorithms July 1st, / 25

20 Thompson Sampling 1 Bayesian bandits versus Frequentist Bandit 2 Gittins Bayesian solution 3 The Bayes-UCB algorithm 4 Thompson Sampling Emilie Kaufmann (Telecom ParisTech) On Bayesian bandit algorithms July 1st, / 25

21 Thompson Sampling The algorithm The algorithm A very simple algorithm: j {1..K}, s j,t λ t j I t = argmax j s j,t (Recent) interest for this algorithm: a very old algorithm : dates back to 1933 partial analysis proposed [Granmo 2010][May, Korda, Lee, Leslie 2011] extensive numerical study beyond the Bernoulli case [Chapelle, Li 2011] first logarithmic upper bound on the regret [Agrawal,Goyal COLT 2012] Emilie Kaufmann (Telecom ParisTech) On Bayesian bandit algorithms July 1st, / 25

22 Thompson Sampling Theoretical analysis An optimal regret bound for the Bernoulli case Assume the first arm is the unique optimal and a = µ 1 µ a. First upper bound : Theorem (Agrawal,Goyal, 2012) K E[R n ] C 1 j=2 j First optimal upper bound : Theorem (Kaufmann,Korda,Munos 2012) log(n) + o θ (log(n)) ɛ > 0 K E[R n ] (1 + ɛ) j log(n) + o θ,ɛ (log(n)) KL (B(µ j ), B(µ 1 )) j=2 Emilie Kaufmann (Telecom ParisTech) On Bayesian bandit algorithms July 1st, / 25

23 Sketch of the analysis Thompson Sampling Theoretical analysis Bound the expected number of draws of a suboptimal arm j (1 is optimal) A usual decomposition in an index policy analysis is T T E[N t (j)] P (ind 1,t < µ 1 ) + P (ind j,t ind 1,t > µ 1, I t = j) t=1 t=1 Decomposisition used for Thompson Sampling is ( ) T 6 ln(t) E[N t (j)] P s 1,t µ 1 N t=1 t (1) ( ) T 6 ln(t) + P s j,t > µ 1 N t (1), I t = j t=1 Emilie Kaufmann (Telecom ParisTech) On Bayesian bandit algorithms July 1st, / 25

24 Sketch of the analysis Thompson Sampling Theoretical analysis An extra deviation inequality is needed Proposition There exists constants b = b(µ 1, µ j ) (0, 1) and C b = C b (µ 1, µ j ) < such that ( P N t (1) t b) C b. t=1 Emilie Kaufmann (Telecom ParisTech) On Bayesian bandit algorithms July 1st, / 25

25 Thompson Sampling Where does it come from? Theoretical analysis ( N t (1) t b) ( = exists a time range of length at least t 1 b 1 with no draw of arm 1) µ µ + ε µ Emilie Kaufmann (Telecom ParisTech) On Bayesian bandit algorithms July 1st, / 25

26 Numerical summary Thompson Sampling Theoretical analysis 500 UCB 500 UCB V 500 DMED R n KL UCB 500 Bayes UCB 500 Thompson R n n (log scale) n (log scale) n (log scale) Regret as a function of time (on a log scale) in a ten arms problem with low rewards, horizon n = 20000, average over N = trials. Emilie Kaufmann (Telecom ParisTech) On Bayesian bandit algorithms July 1st, / 25

27 Thompson Sampling Theoretical analysis Conclusion and perspectives You are now aware that: Bayesian algorithms are efficient for the frequentist MAB problem Bayes-UCB show striking similarity with frequentist algorithms Bayes-UCB and Thompson Sampling are optimal for Bernoulli bandits Some perspectives: A better understanding of the Finite-Horizon Gittins indices Using Thompson with more involved priors A more general analysis of Bayes-UCB and Thompson Sampling Emilie Kaufmann (Telecom ParisTech) On Bayesian bandit algorithms July 1st, / 25

28 Thompson Sampling Summary of the contributions Theoretical analysis Gittins and Bayes-UCB algorithm: Emilie Kaufmann, Olivier Cappé and Aurélien Garivier On Bayesian upper confidence bounds for bandits problem AISTATS Analysis of Thompson Sampling: Emilie Kaufmann, Nathaniel Korda and Rémi Munos Thompson Sampling : an optimal finite time analysis Submitted. Emilie Kaufmann (Telecom ParisTech) On Bayesian bandit algorithms July 1st, / 25

Bayesian and Frequentist Methods in Bandit Models

Bayesian and Frequentist Methods in Bandit Models Emilie Kaufmann, Telecom ParisTech Bayes In Paris, ENSAE, October 24th, 2013 Emilie Kaufmann (Telecom ParisTech) Bayesian and Frequentist Bandits BIP,