Linear Bandits. Peter Bartlett

Size: px

Start display at page:

Download "Linear Bandits. Peter Bartlett"

Prudence Stevenson
6 years ago
Views:

1 Linear Bandits. Peter Bartlett Linear bandits. Exponential weights with unbiased loss estimates. Controlling loss estimates and their variance. Barycentric spanner. Uniform distribution. John s distribution. Lower bounds. Stochastic mirror descent. Full information. Bandit information. 1

2 Linear bandits At round t, Strategy chooses a t A R d. Adversary chooses linear lossl t : A [ 1,1]. Strategy sees lossl t (a t ). Loss is linear in action. Aim to minimize regret: R n = E l t (a t ) inf E n l t (a). a A 2

3 Example: Packet routing Consider the problem of packet-routing in a network (V,E). At round t, Strategy chooses a path a t A {0,1} E from origin node to destination node. Adversary chooses delays l t L = [0,1] E. See lossl t a t (total delay). Aim to minimize regret: R n = E Loss is linear in action. l t a t inf E n l t a. a A 3

4 Linear bandits vs k-armed bandits This problem is closely related to the classical k-armed bandit problem: At round t: Strategy chooses a t A = {1,...,k}. Adversary chooses l t L = [0,1] A. See lossl t (a t ). Aim to minimize regret: R n = E l t (a t ) inf E n l t (a). a A 4

5 Linear bandits vs k-armed bandits This is unchanged (up to a constant factor) if we instead define A = {e 1,...,e k } R k, L = {l : A [ 1,1] linear}. And allowing the strategy to choose a in the convex hull ofadoes not change the regret R n = E l t (a t ) inf E n l t (a). a A (But it might make the game easier for the strategy since it changes the information that the strategy sees.) 5

6 Finite covers For a compact A R d, we can construct an ǫ-cover of size O(1/ǫ d ), for example, in the uniform metric ρ(â,a) = â a := max i â i a i. Since we re aiming for O( n) regret, we can think of A as having cardinality A = O(n d/2 ), solog A = O(dlogn). 6

7 Exponential weights for linear bandits Given A, distribution µ on A, mixing coefficient γ > 0, learning rate η > 0, set q 1 uniform ona. for t = 1,2,...,n, 1. p t = (1 γ)q t +γµ 2. choose a t p t 3. observe l T t a t 4. update q t+1 (a) q t (a)exp( η l T t a)), where lt = Σ 1 t a t a T t l t, Σ t = E a pt aa T. 7

8 Unbiased loss estimates Assume span(a) = R d (otherwise, we can project to a lower dimension) and that µ has support on a d-dimensional set. So E a pt aa T has rank d. Strategy observes a T t l t and a t, so it can compute l t = Σ 1 t a t ( a T t l t ). l t is unbiased: ] E [ lt F t 1 = ( E a pt aa T) 1( Eat p t a t a T t ) lt = l t. 8

9 Regret bound Theorem: For ηsup a A lt t a 1, R n γn+ log A η +(e 2)η 2. EE a pt ( lt t a) So we need to control η times the magnitude of the loss estimates, η sup l T t a a A and the variance term, EE a pt ( lt t a) 2. 9

10 Proof The regret is [ n ( E l T t a t l T t a )]. We ve seen that, given historyf t 1, ] E [ lt F t 1 = E [ Σ 1 t a t a T ] t l t F t 1 = E[lt F t 1 ]. Lemma: Some unbiased estimates involving l t : E [ l T t a ] ] = E[ lt t a, E [ [ ] [ ] l T ] ] T t a t = E t (a)e [ lt F t 1 a = E p t (a) l a Ap T t a. a A 10

11 Proof So we can write the strategy s expected cumulative loss as E l T t a t = E p t (a) l T t a. a A We ll give up on the loss incurred in the exploration trials: a A p t (a) l T t a = a A = (1 γ) ((1 γ)q t (a)+γµ(a)) l T t a ( n a A q t (a) l T t a ) +γ a A µ(a) l T t a. } {{ } exploration 11

12 Proof For q t, we follow the standard analysis (see Adversarial Bandits), but instead of using non-negativity of the ls, we use a lower bound: logeexp( η(x EX)) E(exp( ηx) 1+ηX) (e 2)η 2 EX 2, where the last inequality uses exp( x) 1 x+(e 2)x 2 for x 1. So ifη l T t a 1 for all a A, the previous analysis shows that, for any a A, the first term above satisfies q t (a) l T t a a A l T t a + log A η +(e 2)η 2. q t (a)( lt t a) a A 12

13 Proof Combining, and using the fact that (1 γ)q t (a) p t (a), a A p t (a) l T t a l T t a +(exploration)+ log A η The unbiasedness lemma gives +(e 2)η 2. p t (a)( lt t a) a A R n γn+ log A η +(e 2)η 2. E a pt ( lt t a) 13

14 Controlling variance Lemma: ForL [ 1,1] A, the variance term is bounded: EE a pt ( lt t a) 2 d. 2 ) E( lt t a) = a T E ( lt lt t a = a T E ( (l T t a t ) 2Σ 1 t a t a T t Σ 1 t a T Σ 1 t E ( a t a T t ) Σ 1 t a ) a = a T Σ 1 t a. ) 2 ( E a pt E( lt t a Etr a T Σ 1 t a ) = tr ( Σ 1 t E ( aa T)) = tr(i) = d. 14

15 Controlling the magnitude of the estimator Lemma: ForL [ 1,1] A, l T t a sup a,b A a T Σ 1 t b. l T t a = a T t l t ( Σ 1 t a t ) T a a T t l t a T t Σ 1 t a sup a T Σ 1 t b. a,b A We ll see that typically sup a,b A a T Σ 1 t b c d /γ. 15

16 Regret bound Theorem: For L [ 1,1] A, if sup a T Σ 1 t b c d a,b A γ, log A settingη = n((e 2)d+c d ) γ = c d η givesr n 2 n(d+c d )log A. 16

17 Exploration distributions (Dani, Hayes, Kakade, 2008): For µ uniform over barycentric spanner, ( R n = O d ) nlog A = Õ ( d 3/2 ) n. (Cesa-Bianchi and Lugosi, 2009): For several combinatorial problems, A {0,1} d,µuniform over A gives sup a A a 2 2 λ min (E a µ [aa T ]) = O(d), so ( ) R n = O dnlog A = Õ( d n ). (Bubeck, Cesa-Bianchi and Kakade, 2009): John s Theorem: Õ(d n). 17

18 Barycentric spanner (Suppose that A R d spans R d.) A barycentric spanner ofais a set {b 1,...,b d } that spans R d and satisfies: for all a A there is ) an α [ 1,1] d such that a = Bα, where B = (b 1 b d. Every compact A has a barycentric spanner. If linear functions can be efficiently optimized over A, then there is an efficient algorithm for finding an approximate barycentric spanner (that is, α i 1+δ;O(d 2 logd/δ) linear optimizations). 18

19 Barycentric spanner Lemma: If {b 1,...,b d } A maximizes det(b), then it is a barycentric spanner. Proof. Fora = Bα, det(b) = det i ( a b 2 b d ) α i det = α 1 det(b). ( b i b 2 b d ) 19

20 Barycentric spanner Theorem: For A [ 1,1] d and µ uniform on a barycentric spanner ofa, (that is,c d d 2 ). Hence, sup a T Σ 1 t b d2 a,b A γ R n 2d 2nlog A. Σ t = γ d BBT +(1 γ) q t (a)aa T. a A }{{} M 20

21 Barycentric spanner: Proof sup a T Σ 1 t b a,b A sup α T B T Σ 1 t Bβ α,β [ 1,1] d sup α T B T Σ 1 α = β = t Bβ d ( = dλ max B T Σ 1 t B ) = dλ max ( B 1 Σ t B T) 1 d = ( λ ( min B 1 γ d BBT +M ) B T) d 2 γλ min (B 1 BB T B T ) = d2 γ, where λ max ( ) andλ min ( ) denote the largest and smallest eigenvalues. 21

22 Other exploration distributions Lemma: sup a T Σ 1 t b a,b A sup a A a 2 2 γλ min (E a µ [aa T ]). sup a T Σ 1 t b sup a,b A a A a 2 ( ) 2λ max Σ 1 t = sup a A a 2 2. λ min (Σ t ) λ min (Σ t ) = min p t (a)v T aa T v v =1 a A ( γ min µ(a)v T aa T v = γλ min Ea µ [aa T ] ). v =1 a A 22

23 John s distribution Theorem: [John s Theorem] For any convex set A R d, denote the ellipsoid of minimal volume containing it as E = { x R d : (x c) T M(x c) 1 }. Then there is a set{u 1,...,u m } E A ofm d(d+1)/2+1 contact points and a distributionpon this set such that anyx R d can be written x = c+d m p i x c,u i c (u i c), i=1 where, is the inner product for which the minimal ellipsoid is the unit ball about its center c: x,y = x T My. 23

24 John s distribution This shows that x c = d i p i (u i c)(u i c) T M(x c) x = d i 1 d I = i p i ũ i ũ T i x p i ũ i ũ T i, where ũ i = M 1/2 (u i c), and similarly for x. Setting the exploration distributionµto be the distributionpover the set of transformed contact pointsũ i, we see that, for a,b A, ã T E u µ uu T b = 1 dãt b. 24

25 John s distribution So if we shift the origin of the set A and of theu i (and the corresponding introduction of a constant component in the losses), we have that is,c d d. Hence, sup a T Σ 1 t b d a,b A γ, R n 2 2ndlog A. 25

26 Exploration distributions (Dani, Hayes, Kakade, 2008): For µ uniform over barycentric spanner, ( R n = O d ) nlog A = Õ ( d 3/2 ) n. (Cesa-Bianchi and Lugosi, 2009): For several combinatorial problems, A {0,1} d,µuniform over A gives sup a A a 2 2 λ min (E a µ [aa T ]) = O(d), so ( ) R n = O dnlog A = Õ( d n ). (Bubeck, Cesa-Bianchi and Kakade, 2009): John s Theorem: Õ(d n). 26

27 Outline Linear bandits. Exponential weights with unbiased loss estimates. Controlling loss estimates and their variance. Barycentric spanner. Uniform distribution. John s distribution. Lower bounds. Stochastic mirror descent. Full information. Bandit information. 27

28 Lower bounds Lower bounds from the stochastic setting suffice. Theorem: Consider A = {±1} d, L {±e i : 1 i d}. There is a constant c such that, for any strategy and any n, there is an i.i.d. adversary for which R n cd n. (Here, ndlog A = O(d n).) 28

29 Lower bounds: proof Probabilistic method: Fixǫ (0,1/2) and, for each b {±1} d, definep b onlas P b (e i ) = 1 b iǫ 2d P b ( e i ) = 1+b iǫ 2d (so that the optimal a = b). We ll choose b uniformly, and show that the expected regret under this choice is large.,. 29

30 Lower bounds: proof R n (P b ) = = = = i=1 d E[l t,i (a t,i b i )] i=1 d ( 1 2bi ǫ (a t,i b i ) 2d i=1 d i=1 (b i a t,i ) b iǫ d d 2ǫ 1[a t,i b i ]. d }{{} R i n (b i) 1+2b iǫ 2d ) 30

31 Lower bounds: proof The regret of sub-game i,r i n(b i ), is at least the regret that would be incurred if the strategy knew that the adversary was using one of thep b distributions, and also knew {b j : j i}. In that case, it would know θ := E j i l t,j a t,j, and so at each round, it would see a (±1) Bernoulli random variable l T t a t, with mean θ b i a t,i ǫ d. Notice that the 1/d here is crucial: because information about theith component only arrives once every d rounds on average, the range of values of the unknown Bernoulli mean has shrunk. If the strategy saw the components of l i (even in the semi-bandit setting, witha = {0,1} d and feedback (l t,1 a t,1,...,l t,d a t,d )), it would not suffer this disadvantage. 31

32 Lower bounds: proof Using the same argument as we saw for the stochastic multi-armed bandit case (with a little extra work to show that θ is unlikely to be too close to0 or1, so that the variance of the Bernoulli is not too small), we see that ER i n(b i ) 2ǫn ( ) 1 n d 2 cǫ. d Choosingǫ = d/(4c n) giveser i n(b i ) = Ω( n), and so ER n (P b ) = Ω(d n). [NB: A = [ 1,1] d L = {±e i } has lower regret, because the strategy can use a t to identify which direction ±e i was played.] [Open problem: when isθ(d n) possible with an efficient strategy?] 32

33 Outline Linear bandits. Exponential weights with unbiased loss estimates. Controlling loss estimates and their variance. Barycentric spanner. Uniform distribution. John s distribution. Lower bounds. Stochastic mirror descent. Full information. Bandit information. 33

34 Full information online prediction games Repeated game: Strategy plays a t A Adversary reveals l t L Aim to minimize regret: R n = l t (a t ) min a A l t (a). 34

35 Online Convex Optimization Choosinga t to minimize past losses can fail. The strategy must avoid overfitting. First approach: gradient steps. Stay close to previous decisions, but move in a direction of improvement. 35

36 Online Convex Optimization 1. Gradient algorithm. 2. Regularized minimization Bregman divergence Regularized minimization minimizing latest loss and divergence from previous decision Constrained minimization equivalent to unconstrained plus Bregman projection Linearization Mirror descent 3. Regret bound 36

37 Online Convex Optimization: Gradient Method a 1 A, a t+1 = Π A (a t η l t (a t )), where Π A is the Euclidean projection on A, Π A (x) = argmin a A x a. Theorem: For G = max t l t (a t ) and D = diam(a), the gradient strategy withη = D/(G n) has regret satisfying R n GD n. 37

38 Online Convex Optimization: Gradient Method Example: (2-ball, 2-ball) A = {a R d : a 1},L = {a v a : v 1}. D = 2,G 1. Regret is no more than2 n. (And O( n) is optimal.) Example: (1-ball, -ball) A = (k), L = {a v a : v 1}. D = 2,G k. Regret is no more than2 kn. Since competing with the whole simplex is equivalent to competing with the vertices (experts) for linear losses, this is worse than exponential weights ( k versus logk). 38

39 Gradient Method: Proof Define ã t+1 = a t η l t (a t ), a t+1 = Π A (ã t+1 ). Fixa A and consider the measure of progress a t a. a t+1 a 2 ã t+1 a 2 = a t a 2 +η 2 l t (a t ) 2 2η t (a t ) (a t a). By convexity, (l t (a t ) l t (a)) l t (a t ) (a t a) a 1 a 2 a n+1 a 2 2η + η 2 l t (a t ) 2 39

40 Online Convex Optimization 1. Gradient algorithm. 2. Regularized minimization Bregman divergence Regularized minimization minimizing latest loss and divergence from previous decision Constrained minimization equivalent to unconstrained plus Bregman projection Linearization Mirror descent 3. Regret bound 40

41 Online Convex Optimization: A Regularization Viewpoint Suppose l t is linear: l t (a) = g t a. Suppose A = R d. Then minimizing the regularized criterion ( ) t a t+1 = argmin η l s (a)+ 1 a A 2 a 2 corresponds to the gradient step s=1 a t+1 = a t η l t (a t ). 41

42 Online Convex Optimization: Regularization Regularized minimization Consider the family of strategies of the form: ( ) t a t+1 = argmin η l s (a)+r(a). a A s=1 The regularizer R : R d R is strictly convex and differentiable. R keeps the sequence ofa t s stable: it diminishesl t s influence. We can view the choice ofa t+1 as trading off two competing forces: making l t (a t+1 ) small, and keeping a t+1 close toa t. This is a perspective that motivated many algorithms in the literature. 42

43 Properties of Regularization Methods In the unconstrained case (A = R d ), regularized minimization is equivalent to minimizing the latest loss and the distance to the previous decision. The appropriate notion of distance is the Bregman divergence D Φt 1 : Define Φ 0 = R, Φ t = Φ t 1 +ηl t, so that a t+1 = argmin a A ( η ) t l s (a)+r(a) s=1 = argmin a A Φ t(a). 43

44 Bregman Divergence Definition: For a strictly convex, differentiable Φ : R d R, the Bregman divergence wrt Φ is defined, for a,b R d, as D Φ (a,b) = Φ(a) (Φ(b)+ Φ(b) (a b)). D Φ (a,b) is the difference between Φ(a) and the value at a of the linear approximation of Φ about b. (PICTURE) 44

45 Bregman Divergence Example: For a R d, the squared euclidean norm, Φ(a) = 1 2 a 2, has D Φ (a,b) = 1 ( ) 1 2 a 2 2 b 2 +b (a b) the squared euclidean norm. = 1 2 a b 2, 45

46 Bregman Divergence Example: Fora [0, ) d, the unnormalized negative entropy,φ(a) = d i=1 a i(lna i 1), has D Φ (a,b) = i = i (a i (lna i 1) b i (lnb i 1) lnb i (a i b i )) ( a i ln a ) i +b i a i, b i the unnormalized KL divergence. Thus, fora d,φ(a) = i a ilna i has D Φ (a,b) = i a i ln a i b i. 46

47 Bregman Divergence When the domain of Φ iss R d, in addition to differentiability and strict convexity, we make some more assumptions: S is closed, and its interior is convex. For a sequence approaching the boundary of S, Φ(a n ). We say that such aφis a Legendre function. 47

48 Bregman Divergence Properties 1. D Φ 0,D Φ (a,a) = D A+B = D A +D B. 3. For l linear, D Φ+l = D Φ. 4. Bregman projection, Π Φ A (b) = argmin a AD Φ (a,b) is uniquely defined for closed, convex A S (that intersects the interior ofs). 5. Generalized Pythagorus: for closed, convex A,a = Π Φ A (b),a A, D Φ (a,b) D Φ (a,a )+D Φ (a,b). 6. a D Φ (a,b) = Φ(a) Φ(b). 7. For Φ the Legendre dual ofφ, Φ = ( Φ) 1, D Φ (a,b) = D Φ ( Φ(b), Φ(a)). 48

49 Legendre Dual Here, for a Legendre function Φ : S R, we define the Legendre dual as Φ (u) = sup(u v Φ(v)). v S ( 49

50 Legendre Dual Properties: Φ is Legendre. dom(φ ) = Φ(int domφ). Φ = ( Φ) 1. D Φ (a,b) = D Φ ( Φ(b), Φ(a)). Φ = Φ. Examples: Φ = p: Φ = q, where 1/p+1/q = 1. Φ(a) = d i=1 ea i : Φ (u) = i u i(lnu i 1). 50

51 Online Convex Optimization 1. Problem formulation 2. Empirical minimization fails. 3. Gradient algorithm. 4. Regularized minimization Bregman divergence Regularized minimization minimizing latest loss plus divergence from previous decision Constrained minimization equivalent to unconstrained plus Bregman projection Linearization Mirror descent 5. Regret bounds 51

52 Properties of Regularization Methods In the unconstrained case (A = R d ), regularized minimization is equivalent to minimizing the latest loss and the distance (Bregman divergence) to the previous decision. Theorem: Define ã 1 via R(ã 1 ) = 0, and set ã t+1 = arg min a R d ( ηlt (a)+d Φt 1 (a,ã t ) ). Then ã t+1 = arg min a R d ( η ) t l s (a)+r(a). s=1 52

53 Properties of Regularization Methods Proof. By the definition of Φ t, ηl t (a)+d Φt 1 (a,ã t ) = Φ t (a) Φ t 1 (a)+d Φt 1 (a,ã t ). The derivative wrt a is Φ t (a) Φ t 1 (a)+ a D Φt 1 (a,ã t ) = Φ t (a) Φ t 1 (a)+ Φ t 1 (a) Φ t 1 (ã t ) Setting to zero shows that Φ t (ã t+1 ) = Φ t 1 (ã t ) = = Φ 0 (ã 1 ) = R(ã 1 ) = 0, Soã t+1 minimizesφ t. 53

54 Properties of Regularization Methods Constrained minimization is equivalent to unconstrained minimization, followed by Bregman projection: Theorem: For we have a t+1 = argmin a A Φ t(a), ã t+1 = arg min a R d Φ t (a), a t+1 = Π Φ t A (ã t+1). 54

55 Properties of Regularization Methods Proof. Let a t+1 denote Π Φ t A (ã t+1). First, by definition of a t+1, Φ t (a t+1 ) Φ t (a t+1). Conversely, D Φt (a t+1,ã t+1 ) D Φt (a t+1,ã t+1 ). But Φ t (ã t+1 ) = 0, so Thus,Φ t (a t+1) Φ t (a t+1 ). D Φt (a,ã t+1 ) = Φ t (a) Φ t (ã t+1 ). 55

56 Properties of Regularization Methods Example: For linear l t, regularized minimization is equivalent to minimizing the last loss plus the Bregman divergence wrt R to the previous decision: ( ) t arg min η l s (a)+r(a) a A s=1 ( ) = Π R A arg min (ηl t (a)+d R (a,ã t )) a R d because adding a linear function toφdoes not change D Φ., 56

57 Linear Loss We can replace l t by l t (a t ), and this leads to an upper bound on regret. Thus, for convex losses, we can work with linear l t. 57

58 Regularization Methods: Mirror Descent Regularized minimization for linear losses can be viewed as mirror descent taking a gradient step in a dual space: Theorem: The decisions ã t+1 = arg min a R d ( η ) t g s a+r(a) s=1 can be written ã t+1 = ( R) 1 ( R(ã t ) ηg t ). This corresponds to first mapping from ã t through R, then taking a step in the direction g t, then mapping back through ( R) 1 = R to ã t+1. 58

59 Regularization Methods: Mirror Descent Proof. For the unconstrained minimization, we have R(ã t+1 ) = η R(ã t ) = η t g s, s=1 t 1 s=1 g s, so R(ã t+1 ) = R(ã t ) ηg t, which can be written ã t+1 = R 1 ( R(ã t ) ηg t ). 59

60 Mirror Descent Given: compact, convex A R d, closed, convex S A, η > 0, S A, Legendre R : S R. Set a 1 argmin a A R(a). For round t: 1. Play a t ; observe l t R d. 2. w t+1 = R ( R(a t ) η l t (a t )). 3. a t+1 = argmin a A D R (a,w t+1 ). [Always convex optimization.] 60

61 Exponential weights as mirror descent For A = (k) and R(a) = exponential weights: k (a i loga i a i ), this reduces to i=1 R(u) i = loga i, R (u) = i e u i, R (u) i = exp(u i ), R(w t+1 ) i = log(w t+1,i ) = loga t,i η l t (a t ) i, w t+1,i = a t,i exp( η l t (a t ) i ), D R (a,b) = ( a i log a ) i +b i a i, b i i a t+1,i w t+1,i. 61

62 Mirror descent regret Theorem: Suppose that, for all a A int(s), l L, R(a) η l(a) R(int(S)). For any a A, (l t (a t ) l t (a)) ( 1 R(a) R(a 1 )+ η D R ( R(a t ) η l t (a t ), R(a t )) ). Proof: Fixa A. Since thel t are convex, (l t (a t ) l t (a)) l t (a t ) T (a t a). 62

63 Mirror descent regret: proof The choice of w t+1 and the fact that R 1 = R show that Hence, R(w t+1 ) = R(a t ) η l t (a t ). η l t (a t ) T (a t a) = (a a t ) T ( R(w t+1 ) R(a t )) = D R (a,a t )+D R (a t,w t+1 ) D R (a,w t+1 ). Generalized Pythagorus inequality shows that the projection a t+1 satisfies D R (a,w t+1 ) D R (a,a t+1 )+D R (a t+1,w t+1 ). 63

64 Mirror descent regret: proof η l t (a t ) T (a t a) ( D R (a,a t )+D R (a t,w t+1 ) D R (a,w t+1 ) ) D R (a,a t+1 ) D R (a t+1,w t+1 ) = D R (a,a 1 ) D R (a,a n+1 )+ D R (a,a 1 )+ (D R (a t,w t+1 ) D R (a t+1,w t+1 )) D R (a t,w t+1 ). 64

65 Mirror descent regret: proof = D R (a,a 1 )+ D R ( R(w t+1 ), R(a t )) = D R (a,a 1 )+ D R ( R(a t ) η l t (a t ), R(a t )) = R(a) R(a 1 )+ D R ( R(a t ) η l t (a t ), R(a t )). 65

66 Linear bandit setting See onlyl t (a t ); l t (a t ) is unseen. Instead of a t, strategy plays a noisy version, x t. Strategy uses l t (x t ) to give an unbiased estimate of l t (a t ). 66

67 Stochastic mirror descent Given: compact, convex A R d, η > 0, S A, Legendre R : S R. Set a 1 argmin a A R(a). For round t: 1. Play noisy version x t of a t ; observe l t (x t ). 2. Compute estimate g t of l t (a t ). 3. w t+1 = R ( R(a t ) η g t ). 4. a t+1 = argmin a A D R (a,w t+1 ). 67

68 Regret of stochastic mirror descent Theorem: Suppose that, for all a A int(s) and linear l L, E[ g t a t ] = l t (a t ) and R(a) η g t (a) R(int(S)). For any a A, (l t (a t ) l t (a)) 1 η ( R(a) R(a 1 )+ + ED R E[ a t E[x t a t ] g t ]. ( ) ) R(a t ) η g t, R(a t ) 68

69 Regret: proof E (l t (x t ) l t (a)) = E = E E = E (l t (x t ) l t (a t )+l t (a t ) l t (a)) ( [ E l T t (x t a t ) ] a t +lt (a t ) l t (a) ) a t E[x t a t ] g t +E a t E[x t a t ] g t +E l t (a t ) T (a t a) g t T (a t a). 69

70 Regret: proof Applying the regret bound for the (random) linear lossesa g T t a gives E a t E[x t a t ] g t ( + 1 R(a) R(a 1 )+ η ) ED R ( R(a t ) η g t, R(a t )). 70

71 Regret: Euclidean ball Consider B = {a R d : a 1} (with the Euclidean norm). Ingredients: 1. Distribution ofx t, given a t : x t = ξ t a t a t +(1 ξ t)ǫ t e It, where ξ t is Bernoulli( a t ),ǫ t is uniform ±1, and I t is uniform on {1,...,d}, soe[x t a t ] = a t. 2. Estimate l t of lossl t : soe[ l t a t ] = l t. l t = d 1 ξ t 1 a t xt t l t x t, 71

72 Regret: Euclidean ball Theorem: Consider stochastic mirror descent on A = (1 γ)b, with these choices and R(a) = log(1 a ) a. Then for ηd 1/2, R n γn+ log(1/γ) η +η E For γ = 1/ n and η = logn/(2nd), R n 3 dnlogn. [ (1 a t ) l t 2]. Proof: R(a) = a/(1 a ). 72

73 Linear bandits Open question: What geometric properties of A and L determine the regret? A L R n convex l : A [ 1,1] Õ(d n) Õ( dn) d 1 1 Õ( dn) 1 {±e i : 1 i d} Ω(d n) 73

Learning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley

Learning Methods for Online Prediction Problems Peter Bartlett Statistics and EECS UC Berkeley Course Synopsis A finite comparison class: A = {1,..., m}. 1. Prediction with expert advice. 2. With perfect