Stat 260/CS Learning in Sequential Decision Problems. Peter Bartlett

Size: px

Start display at page:

Download "Stat 260/CS Learning in Sequential Decision Problems. Peter Bartlett"

Evangeline Moody
6 years ago
Views:

1 Stat 260/CS Learning in Sequential Decision Problems. Peter Bartlett 1. Multi-armed bandit algorithms. Concentration inequalities. P(X ǫ) exp( ψ (ǫ))). Cumulant generating function bounds. Hoeffding s inequality for sub-gaussian random variables. Upper confidence bound (UCB) algorithms. Compare upper bounds on means (based on sample averages and concentration inequalities) Analysis: bound ET j (n). Compare gap j to confidence interval width. 1

2 Recall: Concentration inequalities. Definition: function is For a random variable X, the moment-generating M X (λ) = Eexp(λX), the cumulant-generating function is Γ X (λ) = logm X (λ). 2

3 Recall: Concentration inequalities. Definition: For a random variable X, ψ : R R is a cumulant generating function upper bound if, for λ > 0, ψ(λ) max{γ X (λ),γ X (λ)}, ψ( λ) = ψ(λ). The Legendre transform (convex conjugate) ofψ is ψ (ǫ) = sup(λǫ ψ(λ)). λ R 3

4 Concentration Inequalities. Theorem: Γ X+c (λ) = λc+γ X (λ), Γ X+c(ǫ) = Γ X(ǫ c). (Easy to check.) 4

5 Recall: Concentration Inequalities. Theorem: For ǫ 0,P(X EX ǫ) exp ( ψ X EX (ǫ)). 5

6 Recall: Concentration Inequalities. Theorem: IfX 1,X 2,...,X n are mean zero, i.i.d. with cgf upper bound ψ, then X n = 1 n n i=1 X i has cgf bound ( ) λ ψ X n (λ) = nψ, n and hence, ψ X n (ǫ) = nψ (ǫ), P ( Xn ǫ ) exp( nψ (ǫ)), 6

7 Recall: Concentration Inequalities. And the exponent can t be improved. Theorem: [Cramér-Chernoff] If X 1,X 2,...,X n are iid and mean zero, and have cgf Γ, then for ǫ > 0 and X n = 1 n n i=1 X i, lim n 1 n logp( Xn ǫ ) = Γ (ǫ). (Lower bound is a change-of-measure argument plus central limit theorem.) 7

8 Example: Gaussian For X N(µ,σ 2 ), Γ X µ (λ) = λ2 σ 2 2, Γ X µ(ǫ) = ǫ2 2σ 2. For X 1,...,X n N(µ,σ 2 ), it s easy to check that the bound is tight: lim n 1 n lnp( X n µ t) = t2 2σ 2. 8

9 Example: Bounded Support Theorem: [Hoeffding s Inequality] For a random variable X [a,b] withex = µ and λ R, lnm X µ (λ) λ2 (b a) 2. 8 Note the resemblance to a Gaussian: λ 2 σ 2 2 vs λ2 (b a) 2 8 (And since P has support in[a,b], VarX (b a) 2 /4.). 9

10 Example: Hoeffding s Inequality Proof Define A(λ) = log ( Ee λx) ( = log ) e λx dp(x), where X P. Then A is the log normalization of the exponential family random variable X λ with reference measure P and sufficient statisticx. Since P has bounded support,a(λ) < for all λ, and we know that A (λ) = E(X λ ), A (λ) = Var(X λ ). Since P has support in[a,b], Var(X λ ) (b a) 2 /4. Then a Taylor expansion about λ = 0 (at this value of λ,x λ has the same distribution as X, hence the same expectation) gives A(λ) λex + λ2 2 (b a)

11 Sub-Gaussian Random Variables Definition: λ R, X is sub-gaussian with parameter σ 2 if, for all lnm X µ (λ) λ2 σ 2 2. Note: Gaussian is sub-gaussian. X sub-gaussian iff X sub-gaussian. X sub-gaussian impliesp(x µ t) exp( t 2 /(2σ 2 )). 11

12 Hoeffding Bound Theorem: For X 1,...,X n independent, EX i = µ, X i sub- Gaussian with parameter σ 2, then for all t > 0, ( ) 1 n ) P X i µ t exp ( nt2 n 2σ 2. i=1 12

13 Back to the stochastic bandit problem. k arms. Arm j has unknown reward distributionp θj, for θ j Θ. Reward: X j,t P θj. Mean reward: µ j = EX j,1. Best: µ = max j =1,...,kµ j. Gap: j = µ µ j. Number of plays: T j (s) = s t=1 1[I t = j]. Pseudo-regret: R n = nmax j =1,...,kµ j E n t=1 X I t,t = k j=1 ET j(n) j. 13

14 UCB strategy Define the sample averages ˆµ j (t) = 1 T j (t) t X Is,s1[I s = j]. s=1 If X j,s µ j has c.g.f. upper boundψ, ( ) 1 n Pr X j,s µ j ǫ n s=1 e nψ (ǫ), that is, Pr ( µ j < 1 n n s=1 X j,s +(ψ ) 1 ( log1/δ n ) ) 1 δ. 14

15 UCB strategy. Suppose X j,t µ j has c.g.f. upper boundψ. ψ-ucb Strategy: I t = arg max 1 j k ( ( )) 3logt ˆµ j,t 1 +(ψ ) 1. T j (t 1) e.g., X j,t sub-gaussian (with parameter σ 2 ), ( ) (ψ ) 1 3logt T j (t 1) ψ (ǫ) = ǫ2 2σ 2, = 6σ 2 logt T j (t 1). 15

16 UCB strategy. Theorem: If the reward distributions have cgf bound ψ, then the ψ-ucb Strategy satisfies R n ( ) 3 j logn ψ ( j /2) +o(1). j: j >0 Example: Rewards in[0,1] are sub-gaussian with σ 2 = 1/4, so R n ( ) 6logn +o(1). j: j >0 This is within a constant factor of optimal (µ (1 µ ) versus 6). j 16

17 UCB strategy: Proof ( ) (Drop t indices.) Define ǫ j = (ψ ) 1 3logt T j (t 1). So UCB chooses argmax j (ˆµ j +ǫ j ). If then so UCB will not choose I t = j. ˆµ j > µ j ǫ j, ˆµ j < µ j +ǫ j, j > 2ǫ j, ˆµ j +ǫ j > µ j = µ j + j > ˆµ j ǫ j + j > ˆµ j +ǫ j, 17

18 UCB strategy: Proof So {I t = j and j > 2ǫ j } {ˆµ j µ j ǫ j or ˆµ j µ j +ǫ j }. Note that j > 2ǫ j for T j (t 1) m := 3logn ψ ( j /2) 3logt ψ ( j /2). ET j (n) = E n 1[I t = j] t=1 m+e m+ n t=m+1 n t=m+1 1[I t = j and T j (t 1) m] (P(ˆµ j µ j ǫ j )+P(ˆµ j µ j +ǫ j )). 18

19 UCB strategy: Proof P(ˆµ j µ j ǫ j )l P ( s {1,...,t}ˆµ j,s +(ψ ) 1 ) (3lnt/s) µ j t s=1 P (ˆµ ) j,s +(ψ ) 1 (3lnt/s) µ j t t 3 = t 2. s=1 Similarly for P(ˆµ j µ j +ǫ j ). So ET j (n) 3logn ψ ( j /2) +O ( ) 1. logn 19

20 UCB strategy. Theorem: If the reward distributions have cgf bound ψ, then the ψ-ucb Strategy satisfies R n ( ) 3 j logn ψ ( j /2) +o(1). j: j >0 Hence, R n for rewards in[0,1]. ( 6logn j j: j >0 ) +o(1), 20

21 UCB strategy. This bound is on expected reward. Looking at the proof, the probability of bad behavior decays only polynomially withn(c.f. Hoeffding s inequality). This slow decay is not just an artefact of the analysis. For rewards in[0,1], the upper bound is R n ( 6logn j: j >0 j ) +o(1). However, the lower bound is smaller: the leading term is of the form j logn D KL (P θj,p θ ). This is achievable, by considering a better c.g.f. bound ψ: the Bernoulli c.g.f. 21

22 UCB strategy: Some history. Index strategies (which involve a comparison of the value of an index for each arm, which depends only on observations at that arm) were introduced by Gittins and collaborators (in a Bayesian context). The idea of UCB strategies, where the index has the interpretation of an upper confidence bound, goes back to Lai and Robbins. They gave strategies and analysis where the leading term is asymptotically optimal (including the constant). Agrawal (1995) considered simpler strategies based only on the sample mean, and used c.g.f. bounds and large deviations theory to give asymptotic results. Auer, Cesa-Bianchi and Fischer (2002) used Hoeffding s inequality to give a finite time analysis. 22

Stat 260/CS Learning in Sequential Decision Problems.

Stat 260/CS Learning in Sequential Decision Problems. Stat 260/CS 294-102. Learning in Sequential Decision Problems. Peter Bartlett 1. Multi-armed bandit algorithms. Exponential families. Cumulant generating function. KL-divergence. KL-UCB for an exponential