Stat 260/CS Learning in Sequential Decision Problems. Peter Bartlett

Size: px

Start display at page:

Download "Stat 260/CS Learning in Sequential Decision Problems. Peter Bartlett"

Francis Ray
5 years ago
Views:

1 Stat 260/CS Learning in Sequential Decision Problems. Peter Bartlett 1. Thompson sampling Bernoulli strategy Regret bounds Extensions the flexibility of Bayesian strategies 1

2 Bayesian bandit strategies Thompson sampling: Simple strategy: only requires ability to sample a parameter from the posterior and maximize the expected reward under the sampled parameter. Applicable for more complex problems (dependent priors, complex actions, dependent rewards) Performs well in practice. Strong (frequentist) regret bounds. Gittins index has shortcomings: only applicable to infinite horizon and independent arm priors; complex to compute. 2

3 Bayesian Bernoulli bandits At timet, armj gives X j,t Bernoulli(µ j ) reward. Suppose we have fixed, unknown parameters µ j, and we wish to choose a sequence of armsi 1,I 2,..., so as to minimize regret R n = (µ j µ It ). t=1 3

4 Thompson s idea Frequentist setting, Bayesian strategy: Uniform prior onp j. Pick action j with probability that increases with the probability (under posterior distributions) that j is optimal. Typically, Thompson sampling refers to probability matching. That is, the probability of choosing j is set to the probability that j is optimal: Pr(I t = j) = Pr(p j is the max). 4

5 Thompson sampling Given priorπ j,0 = π 0 : Fort = 1,2,..., 1. Draw p 1,t,...,p k,t from posteriors π j,t Play the arm j with maximump j,t. 3. Observe reward X j,t. 4. Update posterior: π j,t (p) p X j,t (1 p) 1 X j,t π j,t 1 (p). }{{}}{{} likelihood prior Note that Pr(I t = j) = Pr(p j,t is max). 5

6 Thompson sampling For Beta(1,1) prior: Fort = 1,2,..., 1. Draw p j,t Beta(S j,t +1,F j,t +1) for j = 1,...,k. 2. Play I t = j for j with maximum p j,t. 3. Observe reward X It,t. 4. Update posterior: Set S It,t+1 = S It,t +X It,t. Set F It,t+1 = F It,t +1 X It,t. 6

7 Regret of Thompson sampling Theorem: [Agrawal and Goyal] For every µ 1,...,µ k, there is a constant C such that for all ǫ > 0, R n (1+ǫ) j: j >0 j logn d(µ j,µ ) + Ck ǫ 2, where d(µ j,µ ) is the KL-divergence between Bernoulli distributions. 7

8 As always, R n = k j=1 ET j(n) j, where T j (n) = n t=1 1[I t = j]. For UCB strategies, we were able to bound regret by showing that the upper bounds are (whp) valid, and that, after enough mistakes, they are sufficiently tight that subsequent mistakes are unlikely. Here, we don t have bounds, we re just sampling from distributions that get more concentrated as we sample more. 8

9 Fix a suboptimal armj j. Split according to ˆµ j (t 1) andp j,t : ET j (n) = = Pr(I t = j) t=1 [Pr(I t = j, ˆµ j (t 1) x j, p j,t y j ) t=1 +Pr(I t = j, ˆµ j (t 1) x j, p j,t > y j ) +Pr(I t = j, ˆµ j (t 1) > x j )], where we choose µ j < x j < y j < µ as follows: 9

10 µ j < x j < µ s.t. d(x j,µ ) = d(µ j,µ ) 1+ǫ x j < y j < µ s.t. d(x j,y j ) = d(x j,µ ) 1+ǫ = d(µ j,µ ) (1+ǫ) 2. This ensures that the relevant divergences are only a constant factor different fromd(µ j,µ ). 10

11 Consider those three probabilities: 1. Pr(I t = j, ˆµ j (t 1) x j, p j,t y j ) counts the times when both the empirical mean ˆµ j (t 1) and the sampled p j,t are not too far above their expectations. For this to be small, we need to be sure that j is pulled a lot. 2. Pr(I t = j, ˆµ j (t 1) x j, p j,t > y j ) counts the times when p j,t is far above its expectation. AsT j grows, the distribution of p j,t gets more concentrated, so this sum is O(logn/d(x j,y j )). 3. Pr(I t = j, ˆµ j (t 1) > x j ) counts the times when the empirical mean ˆµ j (t 1) is far above its expectation. Its sum iso(1). 11

12 For the 3rd probability: ifτ j,k is the kth time when I t = j, then Pr(I t = j, ˆµ j (t 1) > x j ) 1+E t= n 1 k=1 n 1 k=1 1[ˆµ j (τ j,k ) > x j ] exp( kd(x j,µ j )) 1 d(x j,µ j ) = O(1). 12

13 For the 2nd probability: Pr(I t = j, ˆµ j (t 1) x j, p j,t > y j ) t=1 Pr(I t = j, p j,t > y j ˆµ j (t 1) x j ) t=1 m+e t=τ j,m +1 Pr(I t = j, p j,t > y j ˆµ j (t 1) x j, F t 1 ), for any choice of m. (Recall that τ j,k is the kth time when I t = j and F t 1 is everything up tot 1.) 13

14 Pr(I t = j, ˆµ j (t 1) x j, p j,t > y j ) t=1 m+e t=τ j,m +1 logn d(x j,y j ) +E logn d(x j,y j ) +1, exp( T j (t 1)d(x j,y j )) n t=τ j,m +1 where we choose m = logn/d(x j,y j ) so that 1 n exp( T j (t 1)d(x j,y j )) 1 n. 14

15 For the 1st probability, define α j,t = Pr(p j,t > y j F t 1 ). Then Pr(I t = j, ˆµ j (t 1) x j, p j,t y j ) t=1 (*) E t=1 n 1 k=0 n 1 k=0 E [ 1 αj,t [ E ] 1[I t = j, ˆµ j (t 1) x j, p j,t y j ] α j,t 1 α j,τ j,k +1 α j,τj,k +1 1 α j,τj,k +1 τ j,k+1 t=τ j,k +1 1[I t = j ] ] 1 = = O(1). (Recall τ j,k is the kth time when I t = j.) 15

16 (*) Crucial fact: For a bad arm, when its sample average and sampled parameter aren t too large, the probability it s chosen is bounded linearly by the probability that the best arm is chosen. Lemma: For suboptimal j, Pr(I t = j, ˆµ j (t 1) x j, p j,t y j F t 1 ) 1 α j,t α j,t Pr(I t = j, ˆµ j (t 1) x j, p j,t y j F t 1 ), where α j,t = Pr(p j,t > y j F t 1 ). 16

17 First, notice that, conditioned onf t 1, ˆµ j (t 1) is determined. So assume ˆµ j (t 1) x j. Define the event W j (t) = { j j, p j,t p j,t} (j is the suboptimal winner ). Then Pr(I t = j p j,t y j, F t 1 ) Pr(I t = j, W j (t) p j,t y j, F t 1 ) = Pr(W j (t) p j,t y j, F t 1 )Pr(I t = j W j (t), p j,t y j, F t 1 ) Pr(W j (t) p j,t y j, F t 1 )Pr(p j,t > y j W j (t), p j,t y j, F t 1 ) Pr(W j (t) p j,t y j, F t 1 )Pr(p j,t > y j F t 1 ) = Pr(W j (t) p j,t y j, F t 1 )α j,t. 17

18 On the other side: Pr(I t = j p j,t y j, F t 1 ) Pr(W j (t), p j,t y j p j,t y j, F t 1 ) = Pr(W j (t) p j,t y j, F t 1 )Pr(p j,t y j F t 1 ) = Pr(W j (t) p j,t y j, F t 1 )(1 α j,t ). 18

19 Thompson sampling for bounded rewards For X j,t [0,1], can use an identical approach. But the only analysis applies to Bernoulli-sampled rewards (which introduce a little extra variance): Fort = 1,2,..., 1. Draw p j,t Beta(S j,t +1,F j,t +1) for j = 1,...,k. 2. Play I t = j for j with maximum p j,t. 3. Observe reward X It,t. 4. Sample R t Bernoulli(X It,t). 5. Update posterior: Set S It,t+1 = S It,t +R t. Set F It,t+1 = F It,t +1 R t. 19

20 Thompson sampling One of the advantages of a Bayesian strategy (even in a frequentist setting, where the parameters are fixed but unknown) is that it is easy to generalize it to settings with rewards and (other) outcomes, where the distributions depend on parameters θ. This allows: Complex actions (e.g., choose four advertisements on a web page; multi-commodity flow with random capacities). Limited observations (e.g., only see rewards for complex actions, not for simple ones). Dependent rewards. 20

21 Thompson sampling Given parameter space Θ, action space A, outcome space Y, prior π 0 on Θ, likelihoodl(y;a,θ), and reward distribution: Fort = 1,2,..., 1. Draw θ t from posterior π t Play action a t that maximizes E θt R a. 3. Observe outcome Y t. 4. Update posterior: π t (θ) l(y t ;a t,θ)π t 1 (θ). 21

On Bayesian bandit algorithms

On Bayesian bandit algorithms Emilie Kaufmann joint work with Olivier Cappé, Aurélien Garivier, Nathaniel Korda and Rémi Munos July 1st, 2012 Emilie Kaufmann (Telecom ParisTech) On Bayesian bandit algorithms