Stat 260/CS Learning in Sequential Decision Problems.

Size: px

Start display at page:

Download "Stat 260/CS Learning in Sequential Decision Problems."

Elizabeth Williamson
6 years ago
Views:

1 Stat 260/CS Learning in Sequential Decision Problems. Peter Bartlett 1. Multi-armed bandit algorithms. Exponential families. Cumulant generating function. KL-divergence. KL-UCB for an exponential family. KL vs c.g.f. bounds. Bounded rewards: Bernoulli and Hoeffding. Empirical KL-UCB. See (Olivier Cappé, Aurélien Garivier, Odalric-Ambrym Maillard, Rémi Munos and Gilles Stoltz, 2013) 1

2 Recall: Concentration inequalities. Definition: Cumulant-generating function: Γ X (λ) = logeexp(λx), We consider upper bounds ψ : R R, satisfying ψ(λ) Γ X (λ). The Legendre transform (convex conjugate) ofψ is ψ (ǫ) = sup(λǫ ψ(λ)). λ R Theorem: For ǫ 0,P(X EX ǫ) exp ( ψ X EX (ǫ)). 2

3 Recall: Concentration Inequalities. Theorem: IfX 1,X 2,...,X n are mean zero, i.i.d. with cgf upper bound ψ, then X n = 1 n n i=1 X i satisfies P ( Xn ǫ ) exp( nψ (ǫ)), And the exponent can t be improved. Theorem: [Cramér-Chernoff] If X 1,X 2,...,X n are iid and mean zero, and have cgf Γ, then for ǫ > 0 and X n = 1 n n i=1 X i, lim n 1 n logp( X n ǫ ) = Γ (ǫ). (Γ sometimes called Cramér function. Lower bound is a change-of-measure argument plus central limit theorem.) 3

4 Outline. For an exponential family, we can compute the c.g.f. exactly. Its convex conjugate corresponds to a KL-divergence. For reward distributions from the exponential family, concentration inequalities involving the KL-divergence define an upper confidence bound strategy: KL-UCB. If the reward distributions are bounded, the c.g.f. of a particular exponential family (a scaled, shifted Bernoulli) gives a bound on the c.g.f. And we can bound this, in turn, with a quadratic (like Hoeffding s inequality), which corresponds to another exponential family (a Gaussian). KL-UCB for Bernoulli improves on KL-UCB for Gaussian. (KL-UCB for Gaussian corresponds to the original UCB strategy.) There s also a non-parametric version of KL-UCB (called empirical KL-UCB) for bounded rewards. It works with the set of distributions with finite support. 4

5 Exponential families. Definition: Canonical exponential family defined wrt measure P : dp θ (x) = exp(θx A(θ)), dp ( ) A(θ) = log exp(θx)dp(x), θ Θ = {θ : A(θ) < }. 5

6 Exponential families. µ(θ) := E θ X = A (θ). θ(µ) defined on µ(θ). ( one-to-one because Var θ X = A (θ) > 0). Γ θ (λ) = A(θ +λ) A(θ), Γ θ 1 (µ(θ 2 )) = A(θ 1 ) A(θ 2 )+µ(θ 2 )(θ 2 θ 1 ), D KL (P θ1,p θ2 ) = Γ θ 2 (µ(θ 1 )). 6

7 Exponential families. A (θ) = = E θ X. ( Γ θ (λ) = log xexp(θx)dp(x) ( = log exp(a(θ)) = A(θ +λ) A(θ). ) exp(λx+θx A(θ))dP(x) ) exp((λ+θ)x)dp(x) A(θ) 7

8 Exponential families. Γ θ 1 (µ(θ 2 )) = sup(λµ(θ 2 ) (A(θ 1 +λ) A(θ 1 ))) λ Maximum has µ(θ 2 ) = µ(θ 1 +λ), that is, λ = θ 2 θ 1, so Γ θ 1 (µ(θ 2 )) = (θ 2 θ 1 )µ(θ 2 )+A(θ 1 ) A(θ 2 ). 8

9 Exponential families. D KL (P θ1,p θ2 ) = = log dp θ 1 dp θ2 dp θ1 ((θ 1 θ 2 )x)exp(θ 1 x A(θ 1 ))dp(x) +A(θ 2 ) A(θ 1 ) = µ(θ 1 )(θ 1 θ 2 )+A(θ 2 ) A(θ 1 ) = Γ θ 2 (µ(θ 1 )) 9

10 Exponential families. Example: Bernoulli: P θ (x) = exp(θx A(θ)), A(θ) = log ( 1+e θ), µ(θ) = P θ (1) = eθ 1+e θ, θ = log µ 1 µ, Γ θ (λ) = log ( 1 µ(θ)+µ(θ)e λ), Θ = R. 10

11 Exponential families. Example: Bernoulli: Γ θ 1 (µ 2 ) = sup λ Maximum has µ 2 = ( λµ2 log ( 1 µ 1 +µ 1 e λ)) µ 1 e λ 1 µ 1 +µ 1 e λ, that is, λ = log µ 2(1 µ 1 ) µ 1 (1 µ 2 ) so Γ θ 1 (µ 2 ) = µ 2 log µ 2 µ 1 +(1 µ 2 )log 1 µ 2 1 µ 1 = D KL (P θ2,p θ1 ). 11

12 KL-UCB for exponential families: Use ψ = Γ Define the sample averages ˆµ j (t) = 1 T j (t) t X Is,s1[I s = j], s=1 ˆµ j,t = 1 t t X j,s. s=1 If X j,s has mean µ and c.g.f. Γ µ, anda < µ, Pr(ˆµ j,n a) e nγ (a), that is, or Pr ( Pr ˆµ j,n < µ andγ µ(ˆµ j,n ) f(n) ) e f(n), n ( ) ( ) f(n) ˆµ j,n < µ and D KL Pˆµj,n,P µ e f(n). n (Note that P µ denotes P θ(µ).) 12

13 KL-UCB for exponential families. KL-UCB Strategy for an exponential family (P µ denotes P θ(µ) ): I t = t for t = 1,...,k, I t = arg max 1 j k sup { µ(θ) : θ Θ and where f(t) = logt+3loglog(t). Equivalent to UCB withψ = Γ µ. } ( ) f(t) D KL Pˆµj (t 1),P µ, T j (t 1) 13

14 KL-UCB for exponential families. We can think of D KL ( Pˆµj,t 1,P µ ) as a divergence defined in terms of means: for any ˆµ,µ µ(θ), d(ˆµ,µ) = D KL (Pˆµ,P µ ) = (θ(ˆµ) θ(µ))ˆµ A(θ(ˆµ))+A(θ(µ)). Then d(ˆµ, µ) = 0 iff ˆµ = µ, d is strictly convex and differentiable. We can extend it to the closure of µ(θ), by taking limits, allowing infinite values, and setting d(µ,µ) = 0 at boundaries. (Consider, for example, ˆµ = 0 for a Bernoulli.) 14

15 KL-UCB for exponential families. Theorem: KL-UCB for an exponential family satisfies: logn ( ) ET j (n) ( ) +O logn. D KL Pµj,P µ And the leading term is optimal (including the constant). 15

16 KL-UCB for bounded rewards. Theorem: For X [0,1] withex = µ, define Y Bernoulli(µ). Then Γ X (λ) Γ Y (λ). Notice that this gives a c.g.f. bound ψ Xµ forx satisfying: ψ X µ (µ ) = µ log µ µ +(1 µ )log 1 µ 1 µ. 16

17 KL-UCB for bounded rewards. Proof: For x [0,1],exp(λx) lies below the line from(0,e 0 ) to(1,e λ ): exp(λx) x ( e λ e 0) +e 0, so Eexp(λX) µ ( e λ 1 ) +1 = Eexp(λY). 17

18 KL-UCB-Bernoulli for bounded rewards. KL-UCB-Bernoulli Strategy For the Bernoulli family P µ : I t = t for t = 1,...,k, I t = arg max 1 j k sup { µ (0,1) : d(ˆµ j (t 1),µ) f(t) }, T j (t 1) where d(µ 1,µ 2 ) = µ 1 log µ 1 µ 2 +(1 µ 1 )log 1 µ 1 1 µ 2 and f(t) = logt+3loglog(t). 18

19 KL-UCB-Bernoulli for bounded rewards. Theorem: KL-UCB-Bernoulli satisfies: ET j (n) logn d(µ j,µ ) +O ( logn ), where d(µ 1,µ 2 ) = µ 1 log µ 1 µ 2 +(1 µ 1 )log 1 µ 1 1 µ 2. The leading term is optimal for Bernoulli rewards, but might not be optimal, for example, if the variance is lower than µ(1 µ). 19

20 KL-UCB: More concentration inequalities. Now, Pinsker s inequality gives ψx µ (µ ) = D KL (µ,µ) = µ log µ µ +(1 µ )log 1 µ 1 µ 2(µ µ) 2. which shows this is at least as good as Hoeffding s inequality: P ( X n µ ) exp ( 2n(µ µ) 2) P ( Xn µ+ǫ ) exp ( 2nǫ 2). 20

21 Exponential families. Example: Gaussian: p θ (x) = exp( x2 /(2σ 2 )) 2πσ 2 ( ) µ µ2 exp σ2x, 2σ 2 θ = µ σ 2, µ(θ) = σ 2 θ, A(θ) = σ2 θ 2 2, Γ θ (λ) = σ2 2 (λ+θ)2 θ2 σ 2 2, Γ θ 1 (µ 2 ) = 1 2σ 2 (µ 2 µ 1 ) 2. 21

22 Exponential families. Withσ 2 = 1/4, Pinsker s inequality corresponds to Hoeffding s inequality. So we can view the UCB strategy (based on Hoeffding s inequality), as a special case of KL-UCB, modeling the reward distributions from [0,1] as N(µ,1/4). 22

23 KL-UCB-Gaussian for bounded rewards. KL-UCB-Gaussian Strategy For the Gaussian family P µ : I t = t for t = 1,...,k, I t = arg max 1 j k sup { µ (0,1) : d(ˆµ j (t 1),µ) f(t) }, T j (t 1) where f(t) = logt+3loglog(t) and d(µ 1,µ 2 ) = 2(µ 1 µ 2 ) 2. This is equivalent to the UCB strategy (based on Hoeffding) that we saw last time. 23

24 UCB for bounded rewards. Theorem: UCB satisfies: ET j (n) logn d(µ j,µ ) +O ( logn ), where d(µ 1,µ 2 ) = 2(µ 1 µ 2 ) 2. This result is weaker (because of Pinsker s inequality) than the result for KL-UCB-Bernoulli. 24

25 KL-UCB regret bounds: upper versus lower. Denote the canonical exponential family defined wrt a measure m by E m : { E m = P : dp } (x) = exp(θx A(θ)), anda(θ) <, dm ( ) where A(θ) = log exp(θx) dm(x). WriteP m,θ for the element of E m with parameter θ, andp m,µ for the element of E m with mean µ (and there s a one-to-one map between θ and µ, so it s well-defined.) And define fore m the relevant divergence as a function of expectations: d m (µ,µ ) := D KL (P m,µ,p m,µ ). 25

26 KL-UCB regret bounds: upper versus lower. We have derived bounds on Γ Pj in terms ofγ Pm,µj, for some exponential families E m. For instance, if we let P denote the set of distributions on [0,1], and consider two exponential families, the Bernoulli (call it E B ) and the Gaussian with variance 1/4 (call it E G ), then we have: For all P P withpx = µ, and all λ, And this is equivalent to: for all µ, Γ P (λ) Γ PB,µ (λ) Γ PG,µ (λ). Γ P(µ ) Γ P B,µ (µ ) Γ P G,µ (µ ), that is, Γ P(µ ) d B (µ,µ) d G (µ,µ). 26

27 KL-UCB regret bounds: upper versus lower. We have seen upper bounds on regret based on these inequalities of the form R n ( logn ( )) j d m (µ j,µ ) +O logn. j: j >0 And we ve seen lower bounds that are (roughly) of the form R n ( ) logn j D KL (P j,p j ) +o(1). j: j >0 To understand the gap between the upper bounds and the lower bounds, we can consider the I-projection ofp j E Pj on to{p : PX = µ j }. 27

28 KL-UCB regret bounds: upper versus lower. Theorem: Fix a measure m and an exponential family E m. For all Q E m and P withpx = µ, D KL (P,Q) = D KL (P,P m,µ )+D KL (P m,µ,q). In particular, inf{d KL (P,Q) : PX = µ} = D KL (P m,µ,q). We say that P m,µ is the I-projection of Q E m onto{p : PX = µ}. 28

29 KL-UCB regret bounds: upper versus lower. The negative KL-divergence D KL (P,Q) = log dp dq dp dp dq = log dq dp dq is also called the entropy ofp (defined with respect toq),h Q (P). So the result says that among all distributions satisfying the mean constraint PX = µ, the one with maximum entropy (wrt any Q ine m ) isp m,µ in the exponential family E m. 29

30 KL-UCB regret bounds: upper versus lower. Using this fact, we can see that D KL (P j,p j ) inf{d KL (P,P j ) : PX = µ j } ( ) = D KL PPj,µ j,p j = Γ P j (µ j ) (both distributions are in E Pj ) Γ P m,µ (µ j ) = d m (µ j,µ ), where E m is one of the exponential families that give the upper bounds (Bernoulli or Gaussian). 30

31 KL-UCB regret bounds: upper versus lower. So the upper bound might be loose because P j is further fromp j than the I-projection of P j on to{px = µ j } (i.e., because P j is not ine Pj ), or because Γ Pm,µ is a loose upper bound onγ Pj (i.e., because P j is not ine m ). 31

32 KL-UCB: Regret bounds. The KL-UCB strategies choose I 1 = 1,...,I k = k, and then where I t+1 = arg max U j(t), 1 j k { U j (t) = sup µ µ(θ) s.t. d(ˆµ j (t),µ) f(t) }. T j (t) For a suboptimal armj, we want to bound ET j (n) = 1+ n P{I t+1 = j}. t=k We might have I t+1 = j if either U j (t) is not an upper bound onµ (for a suitable choice forf(t), this has negligible probability), or it is an upper bound, butu j (t) is bigger (and so exceeds µ ; this can t happen too often). 32

33 KL-UCB: Regret bounds. {I t+1 = j} {µ U j (t)} {I t+1 = j andµ < U j (t) U j (t)} {µ U j (t)} {I t+1 = j andµ < U j (t)}. Also, { { {µ < U j (t)} = µ < sup µ µ(θ) s.t. d(ˆµ j (t),µ) f(t) T j (t) { } ˆµ j (t) µ f(t)/t j (t), { } ˆµ j (t) µ f(n)/t j (t), { where µ f(n)/t j (t) := min µ : d(µ,µ ) f(n) }. T j (t) }} 33

34 KL-UCB: Regret bounds. And ET j (n) = 1+ n 1 t=k P{I t+1 = j}. n 1 t=k P{µ U j (t)} }{{} times upper bound violated 3+4eloglogn. 34

35 KL-UCB: Regret bounds. n 1 t=k = P n 1 t=k n k m=1 M + { } I t+1 = j and ˆµ j (t) µ f(n)/t j (t) n k+1 m=2 P } P {ˆµ j,m 1 µ f(n)/(m 1) and mthj at t+1 { } ˆµ j,m µ f(n)/m n k m=m+1 P for M = f(n)/d(µ j,µ ). { } ˆµ j,m µ f(n)/m, 35

36 KL-UCB: Regret bounds. n k m=m+1 (Relate d method.) P { } ˆµ j,m µ f(n)/m ( ) µ f(n)/m,µ j n k m=m+1 exp. ( ) = O f(n). ( ( )) md µ f(n)/m,µ j tod(µ j,µ ), bound by integral, use Laplace s 36

37 Empirical KL-UCB for rewards in[0,1]. Empirical KL-UCB Strategy: I t = t for t = 1,...,k, I t = arg max 1 j k sup { E P X : supp(p) <, ) D KL (ˆPj (t 1),P f(t) }, T j (t 1) where ˆP j (t 1) is the empirical distribution of thet j (t 1) pulls of arm j up to timet 1, andf(t) = logt+3loglog(t). 37

38 Empirical KL-UCB for rewards in[0,1]. It turns out that it s always a finite convex optimization: { ) } sup E P X : supp(p) <,D KL (ˆPj (t 1),P γ { = sup E P X : supp(p) supp(ˆp j (t 1)) {1}, ) } D KL (ˆPj (t 1),P γ. 38

39 Empirical KL-UCB for rewards in[0,1]. Empirical KL-UCB Strategy: I t = t for t = 1,...,k, I t = arg max 1 j k sup { E P X : supp(p) supp(ˆp j (t 1)) {1}, ) D KL (ˆPj (t 1),P f(t) }, T j (t 1) where ˆP j (t 1) is the empirical distribution of thet j (t 1) pulls of arm j up to timet 1, andf(t) = logt+3loglog(t). 39

40 Empirical KL-UCB for rewards in[0,1]. Theorem: Empirical KL-UCB for rewards in[0,1] satisfies: ET j (n) logn ( ) inf{d KL (P j,p) : PX > µ } +O log 4/5 nloglogn, provided µ j > 0 and µ < 1. The leading term is optimal (including the constant). But the remainder term is worse than in the parametric case. 40

Stat 260/CS Learning in Sequential Decision Problems. Peter Bartlett

Stat 260/CS Learning in Sequential Decision Problems. Peter Bartlett Stat 260/CS 294-102. Learning in Sequential Decision Problems. Peter Bartlett 1. Multi-armed bandit algorithms. Concentration inequalities. P(X ǫ) exp( ψ (ǫ))). Cumulant generating function bounds. Hoeffding