Tor Lattimore & Csaba Szepesvári

Size: px

Start display at page:

Download "Tor Lattimore & Csaba Szepesvári"

Tamsin Garrett
5 years ago
Views:

1 Bandit Algorithms Tor Lattimore & Csaba Szepesvári

2 Outline 1 From Contextual to Linear Bandits 2 Stochastic Linear Bandits 3 Confidence Bounds for Least-Squares Estimators 4 Improved Regret for Fixed, Finite Action Sets

3 Outline 5 Sparse Stochastic Linear Bandits 6 Minimax Regret 7 Asymptopia 8 Summary

4 1 From Contextual to Linear Bandits On the Choice of Features 2 Stochastic Linear Bandits Setting Optimism and LinUCB Generic Regret Analysis Miscellaneous Remarks 3 Confidence Bounds for Least-Squares Estimators Least Squares: Recap Fixed Design Sequential Design Completing the Regret Bound 4 Improved Regret for Fixed, Finite Action Sets Challenge! G-Optimal Designs The PEGOE Algorithm

5 Stochastic Contextual Bandits Set of contexts, C, set of actions [K]; distributions (P c,a ). Interaction For rounds t = 1, 2, 3,... : 1 Context C t C is revealed to the learner. 2 Based on its past observations (including C t ), the learner chooses an action A t [K]. The chosen action is sent to the environment. 3 The environment sends the reward X t P Ct,A t to the learner.

6 Regret Definition Definition: Expected reward for action a under context c: r(c, a) = x P c,a (dx). Regret: R n = E [ n t=1 max r(c t, a) a [K] ] n X t. t=1

7 Poor Man s Contextual Bandit Algorithm Assumption: C is finite. Idea: Assign a bandit to each context. Worst-case regret: R n = Θ( nmk), where M = C. Problem: M (and K) can be very large. How to save this? Assume structure.

8 Linear Models Assumption: r(c, a) = ψ(c, a), θ, (c, a) C [K]. where ψ : C [K] R d, θ R d. ψ: feature map;. H ψ = span( ψ(c, k) : c C, k [K] ) R d : feature space; S ψ. = {rθ : r θ : C [K] R, θ R d }, where r θ (c, k) = ψ(c, a), θ : space of linear reward functions. Note: RKHS let us deal with d = (or d very large).

9 Outline 1 From Contextual to Linear Bandits On the Choice of Features 2 Stochastic Linear Bandits Setting Optimism and LinUCB Generic Regret Analysis Miscellaneous Remarks 3 Confidence Bounds for Least-Squares Estimators Least Squares: Recap Fixed Design Sequential Design Completing the Regret Bound 4 Improved Regret for Fixed, Finite Action Sets Challenge! G-Optimal Designs The PEGOE Algorithm

10 Choosing ψ betting on smoothness of r Let be a norm on R d, its dual (e.g., both 2-norms). Hölder s inequality: r(c, a) r(c, a ) θ ψ(c, a) ψ(c, a ). r = ψ, θ r is (ρ, θ )-smooth: r(z) r(z ) θ ρ(z, z ) where z = (c, a), z = (c, a ), ρ(z, z ) = ψ(z) ψ(z ). Choice of ψ + preference for small θ preference for ρ-smooth r. Influence of ψ is the largest when d = +.

11 Sparsity Concern: Is r S ψ really true? Thinking of free lunches: Can t we just add a lot of features to make sure r S ψ? Feature i unused: θ,i = 0; Small θ 0 = d i=1 I {θ i 0}; 0-norm. Sparsity

12 Outline 1 From Contextual to Linear Bandits On the Choice of Features 2 Stochastic Linear Bandits Setting Optimism and LinUCB Generic Regret Analysis Miscellaneous Remarks 3 Confidence Bounds for Least-Squares Estimators Least Squares: Recap Fixed Design Sequential Design Completing the Regret Bound 4 Improved Regret for Fixed, Finite Action Sets Challenge! G-Optimal Designs The PEGOE Algorithm

13 Features Are All What You Need Given C 1, A 1,..., C t, A t, the reward X t in round t satisfies E [X t C 1, A 1,..., C t, A t ] = ψ(c t, A t ), θ, for some known ψ and unknown θ. At the beginning of any round t, observe action set A t R d. If A t A t, with some unknown θ. E [X t A 1, A 1,..., A t, A t ] = A t, θ Why? Let A t = {ψ(c t, a) : a [K]}.

14 Stochastic Linear Bandits 1 In round t, observe action set A t R d. 2 The learner chooses A t A t and receives X t, satisfying E [X t A 1, A 1,..., A t, A t ] = A t, θ with some unknown θ. Goal: Keep regret [ n ] R n = E t=1 max a A t a, θ X t small. Additional assumptions: (i) A 1,..., A n is any fixed sequence; (ii) X t A t, θ is light tailed, given A 1, A 1,..., A t, A t.

15 Finite-armed bandits Case (a): A t has always the same number of vectors in it: finite-armed stochastic contextual bandit. Case (b): Also, A t does not change, or A t = {a 1,..., a K }: finite-armed stochastic linear bandit. Case (c): If the vectors in A t are also orthogonal to each other: finite-armed stochastic bandit. Difference between cases (c) and (b): Case (c): Learn about mean of arm i Choose action i; Case (b): Learn about mean of arm i Choose action j s.t. x j, x i 0.

16 Outline 1 From Contextual to Linear Bandits On the Choice of Features 2 Stochastic Linear Bandits Setting Optimism and LinUCB Generic Regret Analysis Miscellaneous Remarks 3 Confidence Bounds for Least-Squares Estimators Least Squares: Recap Fixed Design Sequential Design Completing the Regret Bound 4 Improved Regret for Fixed, Finite Action Sets Challenge! G-Optimal Designs The PEGOE Algorithm

17 Once an Optimist, Always an Optimist Optimism in the Face of Uncertainty Principle: Choose the best action in the best environment amongst the plausible ones. Environment θ R d. Plausible environments: C t R d s.t. P (θ C t ) 1/t. Best environment: θ t = argmax θ Ct max a A a, θ. Best action: argmax a A a, ˆθ t.

18 Choosing the Confidence Set Say, reward in round t is X t, action in round t is A t R d : X t = A t, θ + η t, η t is noise. Regularized least-squares estimator: ˆθ t = V 1 t t s=1 A sx s, V 0 = λi, V t = V 0 + t A s A s. s=1 Choice of C t : C t E t. = { θ R d : θ ˆθ t 1 2 V t 1 β t }. Here, (β t ) t decreasing, β t 1, for A positive definite, x 2 A = x Ax.

19 LinUCB Choose C t = E t with suitable (β t ) t and let Then, A t = argmax max a, θ. a A θ C t A t = argmax a, ˆθ t + β t a V 1 a t 1. LinUCB (a.k.a. LinRel, OFUL, ConfEllips,... )

20 Outline 1 From Contextual to Linear Bandits On the Choice of Features 2 Stochastic Linear Bandits Setting Optimism and LinUCB Generic Regret Analysis Miscellaneous Remarks 3 Confidence Bounds for Least-Squares Estimators Least Squares: Recap Fixed Design Sequential Design Completing the Regret Bound 4 Improved Regret for Fixed, Finite Action Sets Challenge! G-Optimal Designs The PEGOE Algorithm

21 Assumptions: Regret Analysis 1 Bounded scalar mean reward: a, θ 1 for any a t A t. 2 Bounded actions: for any a t A t, a 2 L. 3 Honest confidence intervals: There exists a δ (0, 1) such that with probability 1 δ, for all t [n], θ C t where C t satisfies C t E t with (β t ) t 1, decreasing as on the previous slide. Theorem (LinUCB Regret) Let the conditions listed above hold. Then with probability 1 δ the regret of LinUCB satisfies ( ) ( ) det ˆR pseudo Vn n 8nβ n log trace(v 0 ) + nl 8dnβn log 2. det V 0 d det 1 d (V0 )

22 Proof Assume θ C t, t [n]. Let A. t = argmax a At a, θ, r t = A t A t, θ and let θ t C t s.t. A t, θ t = UCB t (A t ). From θ C t and the definition of LinUCB, Then, A t, θ UCB t (A t ) UCB t (A t ) = A t, θ t. r t A t, θ t θ A t V 1 θ t θ Vt 1 2 A t t 1 V 1 β t. t 1 From a, θ 1, r t 2. This combined with β n max{1, β t } gives r t 2 2 β t A t V 1 2 β n (1 A t t 1 V 1 ). t 1 Jensen s inequality shows that n n ˆR pseudo n = r t n rt 2 2 nβn t=1 t=1 n t=1 (1 A t 2 V ). 1 t 1

23 Lemma Lemma Let x 1,..., x n R d, V t = V 0 + t s=1 x sx s, t [n], v 0 = trace(v 0 ) and L max t x t 2. Then, n t=1 ( ) 1 x t 2 V 2 log 1 t 1 ( ) det Vn det V 0 ( ) v0 + nl 2 d log. d det 1/d (V 0 )

24 Outline 1 From Contextual to Linear Bandits On the Choice of Features 2 Stochastic Linear Bandits Setting Optimism and LinUCB Generic Regret Analysis Miscellaneous Remarks 3 Confidence Bounds for Least-Squares Estimators Least Squares: Recap Fixed Design Sequential Design Completing the Regret Bound 4 Improved Regret for Fixed, Finite Action Sets Challenge! G-Optimal Designs The PEGOE Algorithm

25 LinUCB and Finite-Armed Bandits Recall that if A t = {e 1,..., e d }, we get back the finite armed bandits. LinUCB: A t = argmax a, ˆθ t + β t a V 1 a t 1 If we set λ = 0, e i, ˆθ t = ˆµ i,t : empirical mean, V t 1 = diag(t 1 (t 1),..., T K (t 1)). Hence, β t βt e i V 1 = t 1 T i (t 1), and A t = argmax a, ˆθ t + a We recover UCB when β t = 2 log( ).. β t T i (t 1),

26 History Abe and Long (1999) introduced stochastic linear bandits into machine learning literature. Auer (2002) was the first to consider optimism for linear bandits (LinRel, SupLinRel). Main restriction: A t < +. Confidence ellipsoids: Dani et al. (2008) (ConfidenceBall 2 ), Rusmevichientong and Tsitsiklis (2010) (Uncertainty Ellipsoid Policy), Abbasi-Yadkori et al. (2011) (OFUL). The name LinUCB comes from Chu et al. (2011). Alternative routes: Explore then commit for action sets with smooth boundary. Abbasi-Yadkori et al. (2009); Abbasi-Yadkori (2009); Rusmevichientong and Tsitsiklis (2010). Phased elimination Thompson sampling

27 Extensions of Linear Bandits Generalized linear model (Filippi et al., 2009): X t = g 1 ( A t, θ + η), (1) where g : R R is called the link function. Common choice: g(p) = log(p/(1 p)) when g 1 (x) = 1/(1 + exp( x)) (sigmoid). Spectral bandits: Spectral Eliminator Valko et al. (2014). Kernelised UCB: Valko et al. (2013). (Nonlinear) Structured bandits: r : C [K] [0, 1] belongs to some known set (Anantharam et al., 1987; Russo and Roy, 2013; Lattimore and Munos, 2014).

28 Outline 1 From Contextual to Linear Bandits On the Choice of Features 2 Stochastic Linear Bandits Setting Optimism and LinUCB Generic Regret Analysis Miscellaneous Remarks 3 Confidence Bounds for Least-Squares Estimators Least Squares: Recap Fixed Design Sequential Design Completing the Regret Bound 4 Improved Regret for Fixed, Finite Action Sets Challenge! G-Optimal Designs The PEGOE Algorithm

29 Setting 1 Subgaussian rewards: The reward is X t = A t, θ + η t, where η t is conditionally 1-subgaussian (η t F t 1 subg(1)): E[exp(λη t ) F t 1 ] exp(λ 2 /2) almost surely for all λ R, where F t = σ(a 1, η 1,..., A t 1, η t 1, A t ). 2 Bounded parameter vector: θ 2 S with S > 0 known.

30 Linear model: Least Squares: Recap X t = A t, θ + η t. Regularized squared loss: t L t (θ) = (X s A s, θ ) 2 + λ θ 2 2, s=1 Least squares estimate: ˆθ t = argmin θ L t (θ): t ˆθ t = V t (λ) 1 X s A s with V t (λ) = λi + s=1 Abbreviation: V t = V t (0). Main difficulty t A s A s. (2) The actions (A s ) s<t are neither fixed nor independent, but are intricately correlated via the rewards (X s ) s<t. s=1

31 Outline 1 From Contextual to Linear Bandits On the Choice of Features 2 Stochastic Linear Bandits Setting Optimism and LinUCB Generic Regret Analysis Miscellaneous Remarks 3 Confidence Bounds for Least-Squares Estimators Least Squares: Recap Fixed Design Sequential Design Completing the Regret Bound 4 Improved Regret for Fixed, Finite Action Sets Challenge! G-Optimal Designs The PEGOE Algorithm

32 Fixed Design 1 Nonsingular Grammian: λ = 0 and V t is invertible. 2 Independent subgaussian noise: (η s ) s are independent and 1-subgaussian. 3 Fixed design: A 1,..., A t are deterministically chosen without the knowledge of X 1,..., X t. Notation: A 1,..., A t replaced by a 1,..., a t, so V t = t s=1 a s a s and ˆθt = V 1 t t a s X s. s=1 Note: (a s ) t s=1 must span Rd for V 1 t t d.

33 Fix x R d. From X s = a s θ + η s, so ˆθ t = V 1 t First Steps t a s X s = V 1 t V t θ + V 1 t s=1 x, ˆθ t θ = t s=1 t a s η s, s=1 x, V 1 t a s η s. Since (η s ) s are independent and 1-subgaussian: With probability 1 δ, x, ˆθ t θ < 2 t s=1 x, V 1 t a s 2 log ( ) 1 δ = 2 x 2 log ( ) 1 V 1 δ. t

34 Bounding ˆθ t θ Vt For x R d fixed, with probability 1 δ, x, ˆθ t θ < 2 x 2 log ( ) 1 V 1 δ. (*) t We have u 2 V t = (V t u) u. Idea 1: Apply (*) to X = V t (ˆθ t θ ).. Problem: (*) holds for non-random x only! Idea 2: Take finite C ε s.t. X ε x for some x C ε. Make (*) hold for each x C ε, combine. Refinement: Let X = V 1/2 t (ˆθ t θ )/ ˆθ t θ Vt. Then X S d 1 = {x R d : x 2 = 1} and we can choose a finite C ε S d 1 s.t. for any u S d 1, x C ε with u x ε.

35 Let or Bounding ˆθ t θ Vt : Part II. X = V1/2 t (ˆθ t θ ) ˆθ t θ Vt, X = argmin x C ε x X. Then, with probability 1 δ, ˆθ t θ Vt = X, V 1/2 t (ˆθ t θ ) = X X, V 1/2 t (ˆθ t θ ) + V 1/2 t X, ˆθ t θ ε ˆθ t θ Vt + 2 V 1/2 X 2 ) log. V 1 ˆθ t θ Vt 1 2 log 1 ε We can choose C ε so that C ε (5/ε) d : ˆθ t θ Vt < 2 2 t t ( ) Cε. δ ( d log(10) + log ( Cε δ ( )) 1. δ

36 Outline 1 From Contextual to Linear Bandits On the Choice of Features 2 Stochastic Linear Bandits Setting Optimism and LinUCB Generic Regret Analysis Miscellaneous Remarks 3 Confidence Bounds for Least-Squares Estimators Least Squares: Recap Fixed Design Sequential Design Completing the Regret Bound 4 Improved Regret for Fixed, Finite Action Sets Challenge! G-Optimal Designs The PEGOE Algorithm

37 Bounding ˆθ t θ Vt : Sequential Design X s = A s, θ + η s, η s F s 1 subg(1) for F s = σ(a 1, η 1,..., A s, η s ). Previous bound exploited A 1,..., A t fixed, non-random. Known as: fixed design When A 1,..., A t is i.i.d., we have a random design Bandits: A s is chosen based on A 1, X 1,..., A s 1, X s 1! sequential design How to bound ˆθ t θ Vt in this case?

38 Bounding ˆθ t θ Vt : Sequential Design Linearization trick Vector Chernoff Laplace method

39 A Start: Linearization & Vector Chernoff Let S t = t s=1 η sa s. Linearization of the quadratic: Let 1 ˆθ 2 t θ 2 V t = 1 S 2 t 2 V = max x, S 1 t 1 t x R d 2 x 2 V t. M t (x) = exp ( x, S t 1 2 x 2 V t ). One can show that E [M t (x)] 1 for any x R d. Chernoff s method: ( ) ( ) 1 P ˆθ 2 t θ 2 V t u = P exp(max log M t (x)) exp(u) x [ ] [ ] E exp(max log M t (x)) exp( u) = E max M t (x)) exp( u). x x Can we control E [max x M t (x)]?

40 Controlling E [max x M t (x)]: Covering Argument Recall: M t (x) = exp ( x, S t 1 2 x 2 V t ). Let C ε R d be finite, to be chosen later, Then, X = argmax M t (x), Y = argmin X y. x R d y C ε max M t (x) = M t (X) = M t (X) M t (Y) + M t (Y) ε + M t (y). x R d y C ε Challenge: ensure M t (X) M t (Y) ε!

41 Laplace: One Step Back, Two Forward The need to control E [max x M t (x)] comes from the identity exp( 1 2 ˆθ t θ 2 V t ) = max x M t (x). Laplace: Integral of exp(sf(x)) is dominated by exp(s max x f(x)): b e sf(x) dx e sf(x 2π 0) s f (x 0 ), a x 0 = argmax x [a,b] f(x) (a, b). Idea: Replace max x M t (x) with M t (x)h(x)dx with h appropriate: 1 Mt (x)h(x)dx max x M t (x) (in a way); 2 E [ M t (x)h(x)dx ] = E [M t (x)] h(x)dx 1. Choose h(x) as density of N (0, H 1 ) for H 0.

42 Step 2: Finishing M t (x)h(x)dx = ( ) 1/2 ( ) det(h) 1 exp det(h + V) 2 S t 2 (H+V t ). 1 Choose H = λi. Then, with probability 1 e u, 1 S 2 t 2 < u + 1 ( ) V 1 t (λ) 2 log det(vt (λ)) λ d (**) and from ˆθ t θ = V 1 t (λ)s t λv 1 t (λ)θ, ˆθ t θ Vt (λ) V 1 t (λ)s t Vt (λ) + λ V 1 t (λ)θ Vt (λ) S t V 1 t (λ) + λ1/2 θ.

43 Confidence Ellipsoid for Sequential Design Assumptions: θ S, and let (A s ) s, (η s ) s be so that for any 1 s t, η s F s 1 subg(1), where F s = σ(a 1, η 1,..., A s 1, η s 1, A s ) Fix δ (0, 1). Let β t+1 = λs + 2 log ( ) ( ) 1 δ + log det Vt (λ), λ d and C t+1 = {θ R d : ˆθ t θ Vt (λ) β t+1 }. Theorem C t+1 is a confidence set for θ at level 1 δ: P (θ C t+1 ) 1 δ. Note: β t+1 is a function of (A s ) s t.

44 Freedman s Stopping Trick We want θ C t hold with probability 1 δ, simultaneously for all 1 t n. Can we avoid the union bound over time?

45 Freedman s Stopping Trick: II Let E t = { ˆθ t θ Vt 1 (λ) λs + 2u + log } ( ) det Vt 1 (λ), t [n]. λ d Define τ [n] as follows: τ to be the smallest round index t [n] such that E t holds, or n when none of E 1,..., E n hold. Note: Because ˆθ t and V t 1 (λ) is a function of H t 1 = (A 1, η 1,..., A t 1, η t 1 ), whether E t holds can be decided based on H t 1. τ is a (H t ) t stopping time E [M τ (x)] 1 and also E [Mτ (x)] h(x)dx 1 and thus P ( t [n] E t ) P (Eτ ) e u. n. Corollary P ( t 0 such that θ C t+1 ) δ.

46 Historical Remarks Presentation mostly follows Abbasi-Yadkori et al. (2011). Auer (2002); Chu et al. (2011) avoided the need to construct ellipsoidal confidence sets Previous ellipsoidal constructions by Dani et al. (2008) and Rusmevichientong and Tsitsiklis (2010) used covering arguments. The improvement that results from using Laplace s method as compared to the previous ellipsoidal constructions that are based on covering arguments is quite enormous. Laplace s method is also called the Method of Mixtures (Peña et al., 2008); its use goes back to the work of Robbins and Siegmund in the 1970s (Robbins and Siegmund, 1970, 1971). Freedman s Stopping is by Freedman (1975).

47 Outline 1 From Contextual to Linear Bandits On the Choice of Features 2 Stochastic Linear Bandits Setting Optimism and LinUCB Generic Regret Analysis Miscellaneous Remarks 3 Confidence Bounds for Least-Squares Estimators Least Squares: Recap Fixed Design Sequential Design Completing the Regret Bound 4 Improved Regret for Fixed, Finite Action Sets Challenge! G-Optimal Designs The PEGOE Algorithm

48 Regret for LinUCB: Final Steps Previously we have seen, for β t 1, nondecreasing, using LinUCB with V 0 = λi, w.p. 1 δ, ( ) ( ) det ˆR pseudo Vn λd + nl 2 n 8nβ n log 8dnβ n log. det V 0 λd Now, β n = λs + λs + 2 log ( 1 δ 2 log ( 1 δ ) ( ) det Vn 1 (λ) + log λ d ) + d log ( λd + nl 2 λd ), from A s L and log det V d log(trace(v)/d). ˆR pseudo n C 1 d n log(n) + C 2 nd log(1/δ) + C3.

49 Summary ˆR pseudo n C 1 d n log(n) + C 2 nd log(1/δ) + C3. Optimism, confidence ellipsoid Getting the ellipsoid is tricky because the bandit algorithm makes (A s ) s and (η s ) s interdependent: sequential design. Hsu et al. (2012): Random design, fixed design Kernelization: One can directly kernelize the proof presented here. See Abbasi-Yadkori (2012). Gaussian Process Bandits effectively do the same (Srinivas et al., 2010). This presentation followed mostly Abbasi-Yadkori et al. (2011); Abbasi-Yadkori (2012).

50 Outline 1 From Contextual to Linear Bandits On the Choice of Features 2 Stochastic Linear Bandits Setting Optimism and LinUCB Generic Regret Analysis Miscellaneous Remarks 3 Confidence Bounds for Least-Squares Estimators Least Squares: Recap Fixed Design Sequential Design Completing the Regret Bound 4 Improved Regret for Fixed, Finite Action Sets Challenge! G-Optimal Designs The PEGOE Algorithm

51 Challenge! The previous bound is Õ(d n) even for A = {e 1,..., e d } finite-armed stochastic bandits. Can we do better?

52 Setting 1 Fixed finite action set: The set of actions available in round t is A R d and A = K for some natural number K. 2 Subgaussian rewards: The reward is X t = θ, A t + η t where η t is conditionally 1-subgaussian: η t F t 1 subg(1), where F t = σ(a 1, η 1,..., A t 1, η t, A t ). 3 Bounded mean rewards: a = max b A θ, b a 1 for all a A. Key difference to previous setting: Finite, fixed action set.

53 Avoiding Sequential Designs Recall result for fixed design: For x R d fixed, with probability 1 δ, x, ˆθ t θ < 2 x 2 log ( ) 1 V 1 δ. (*) t Goal: Use this result! How? Idea: Use a phased elimination algorithm! A = A 1 A 2 A In phase l, use actions in A l to collect enough data to ensure that by the end of the phase, the data collected in the phase is. sufficient to rule out all ε l = 2 l -suboptimal actions. Which action to use & how many times in phase l?

54 How to Collect Data? Recall: If V t = t s=1 a sa s, for any x R d. x, ˆθ t θ < 2 x 2 log ( ) 1 V 1 δ. (*) t If we need to know whether x, ˆθ t θ 2 l, x A, we better choose a 1, a 2,..., a t so that max x A x 2 V 1 t (*) is minimized (and make t big enough). Experimental design Minimizing (*) is known as the G-optimal design problem.

55 Outline 1 From Contextual to Linear Bandits On the Choice of Features 2 Stochastic Linear Bandits Setting Optimism and LinUCB Generic Regret Analysis Miscellaneous Remarks 3 Confidence Bounds for Least-Squares Estimators Least Squares: Recap Fixed Design Sequential Design Completing the Regret Bound 4 Improved Regret for Fixed, Finite Action Sets Challenge! G-Optimal Designs The PEGOE Algorithm

56 G-optimal Design Let π : A [0, 1] be a distribution on A: a A π(a) = 1. Define V(π) = a A π(a)aa, g(π) = max a A a 2 V(π) 1. G-optimal design π : g(π ) = min π g(π). How to use this?

57 Using a Design π Given a design π, for a Supp(π), set n a = π(a) g(π) log ( ) 1. ε 2 δ Choose each action a Supp(π) exactly n a times. Then: V = n a aa g(π) log ( ) 1 ε 2 δ V(π), a Supp(π) and so for any a A, w.p. 1 δ, ˆθ θ, a a 2 V log ( ) 1 1 δ ε. How big is n? n = a Supp(π) n a = a Supp(π) Supp(π) + g(π) ε 2 log ( 1 δ ). π(a) g(π) log ( ) 1 ε 2 δ

58 Bounding g(π) and Supp(π) Theorem (Kiefer Wolfowitz) The following are equivalent: 1 π is a minimizer of g. 2 π is a minimizer of f(π) = log det V(π). 3 g(π ) = d. Note: Designs, minimizing f are known as D-optimal designs. KW says that G-optimality is the same as D-optimality. Combining this with John s Theorem for minimum-volume enclosing ellipsoids (John, 1948), we get Supp(π) d(d + 3)/2.

59 Outline 1 From Contextual to Linear Bandits On the Choice of Features 2 Stochastic Linear Bandits Setting Optimism and LinUCB Generic Regret Analysis Miscellaneous Remarks 3 Confidence Bounds for Least-Squares Estimators Least Squares: Recap Fixed Design Sequential Design Completing the Regret Bound 4 Improved Regret for Fixed, Finite Action Sets Challenge! G-Optimal Designs The PEGOE Algorithm

60 PEGOE Algorithm 1 Input: A R d and δ. Set A 1 = A, l = 1, t = 1. 1 Let t l = t: current round. Find G-optimal design π l : A l [0, 1] that maximizes log det V(π l ) subject to a A l π l (a) = 1 2 Let ε l = 2 l and ( ) 2π(a) Kl(l + 1) N l (a) = log δ ε 2 l and N l = a A l N l (a) 3 Choose each action a A l exactly N l (a) times 4 Calculate estimate: ˆθ = V 1 tl +N l l t=t l A t X t. 5 Eliminate poor arms: { } A l+1 = a A l : max ˆθ l, b a 2ε l b A l 1 Phased Elimination with G-Optimal Exploration.

61 The Regret of PEGOE Theorem With probability at least 1 δ the pseudo-regret of PEGOE is at most: ( ) K log(n) ˆR pseudo n C nd log, δ where C > 0 is a universal constant. If δ = O(1/n), then E[R n ] C nd log(kn) for appropriately chosen universal constant C > 0.

62 Summary and Historical Remarks Phased exploration allows one to use methods developed for fixed-design PEGOE: Exploration tuned to maximize information gain Finding an (1 + ε)-optimal design is sufficient; convex problem This algorithm and analysis in this form is new. Phased Elimination is well known: Even-Dar et al. (2006) (pure exploration), Auer and Ortner (2010) (finite-armed bandits), Valko et al. (2014) (linear bandits, spanners instead of G-optimality). Finite, but changing action set: PEGOE cannot be applied! SupLinRel and SupLinUCB get the same bound (Auer, 2002; Chu et al., 2011). Sadly, these algorithms are very conservative..

63 Outline 5 Sparse Stochastic Linear Bandits Warmup (Hypercube) LinUCB with Sparsity Confidence Sets & Online Linear Regression Summary 6 Minimax Regret A Minimax Lower Bound 7 Asymptopia Lower Bound What About Optimism? 8 Summary

64 General Setting 1 (Sparse parameter) There exist known constants M 0 and M 2 such that θ 0 M 0 and θ 2 M 2. 2 (Bounded mean rewards): a, θ 1 for all a A t and all rounds t. 3 (Subgaussian noise): The reward is X t = A t, θ + η t where η t F t 1 subg(1) for F t = σ(a 1, η 1,..., A t, η t ).

65 The Case of the Hypercube A = [ 1, 1] d, θ. = θ, X t = A t, θ + η t. Assumptions: 1 (Bounded mean rewards): θ 1 1, which ensures that a, θ 1 for all a A. 2 (Subgaussian noise): The reward is X t = A t, θ + η t where η t F t 1 subg(1) for F t = σ(a 1, η 1,..., A t, η t ).

66 Selective Explore-Then-Commit (SETC) Recall: θ = θ. For any i [d] such that A ti is randomized: A ti (A t θ + η t ) = θ i + A ti A tj θ j + A ti η t. j i } {{ } noise

67 Regret of SETC Theorem There exists a universal constant C > 0 such that the regret of SETC satisfies: R n 2 θ 1 + C Furthermore R n C θ 0 n log(n). i:θ i 0 log(n) θ i. SETC adapts to θ 0!

68 Outline 5 Sparse Stochastic Linear Bandits Warmup (Hypercube) LinUCB with Sparsity Confidence Sets & Online Linear Regression Summary 6 Minimax Regret A Minimax Lower Bound 7 Asymptopia Lower Bound What About Optimism? 8 Summary

69 (General) LinUCB: Recap GLinUCB Choose C t R d and let A t = argmax max a, θ. a A θ C t Previous choice leads to regret Õ(d n). How to choose C t, knowing that θ 0 p, so that the regret gets smaller?

70 Outline 5 Sparse Stochastic Linear Bandits Warmup (Hypercube) LinUCB with Sparsity Confidence Sets & Online Linear Regression Summary 6 Minimax Regret A Minimax Lower Bound 7 Asymptopia Lower Bound What About Optimism? 8 Summary

71 Online Linear Regression (OLR) Learner-environment interaction: 1 The environment chooses X t R and A t R d in an arbitrary fashion. 2 The value of A t is revealed to the learner (but not X t ). 3 The learner produces a real-valued prediction ˆX t in some way. 4 The environment reveals X t to the learner and the loss is (X t ˆX t ) 2. Goal: Compete with the total loss of the best linear predictors in some set Θ R d. Regret against θ Θ: ρ n (θ) = n (X t ˆX t ) 2 t=1 n (X t A t, θ ) 2. t=1

72 From OLR to Confidence Sets Let L be a learner that enjoys a regret guarantee B n = B n (A 1, X 1,..., A n, X n ) relative to Θ: For any strategy of the environment, Combine ρ n (θ) = sup ρ n (θ) B n. θ Θ n (X t ˆX t ) 2 t=1 and X t = A t, θ + η t to get n (X t A t, θ ) 2. t=1. t Q t = (ˆX s A s, θ ) 2 = ρ t (θ ) + 2 s=1 B t + 2 t η s (ˆX s A s, θ ). s=1 t η s (ˆX s A s, θ ) s=1

73 From OLR to Confidence Sets: II. Q t B t + 2Z t, Z t = Goal: Bound Z t for t 0. t η s (ˆX s A s, θ ). (*) ˆX t, chosen by OLR learner L, is F t 1 -measurable, (Z t Z t 1 ) F t 1 subg(σ t ), where σt 2 = (ˆX t A t, θ ) 2. s=1 Previous self-normalized bound (**): With probability 1 δ, Z t < (1 + Q t ) log ( 1+Q t ) δ, t = 0, 1, Combining with (*), solve for Q t : Q t β t (δ), β t (δ) = 1 + 2B t + 32 log ( 8+ 1+Bt δ ).

74 OLR to Confidence Sets: III. Theorem Let δ (0, 1) and assume that θ Θ and sup θ Θ ρ t (θ) B t. If { } t C t+1 = θ R d : θ (ˆX s A s, θ ) 2 M β t (δ), s=1 then P (exists t N such that θ C t+1 ) δ.

75 Sparse LinUCB 1: Input OLR Learner L, regret bound B t, confidence parameter δ (0, 1) 2: for t = 1,..., n 3: Receive action set A t 4: Computer confidence set: { } t 1 C t = θ R d : θ (ˆX s A s, θ ) 2 M β t (δ) s=1 5: Calculate optimistic action A t = argmax max a, θ a A t θ C t 6: Feed A t to L and obtain prediction ˆX t 7: Play A t and receive reward X t 8: Feed X t to L as feedback

76 Regret of OLR-UCB Theorem With probability at least 1 δ the pseudo-regret of OLR-UCB satisfies ˆR pseudo n 8dn ( M β n 1(δ) ) ( log 1 + n ). d

77 The Regret of OLR-UCB(π) Theorem (Sparse OLR Algorithm) π for the learner such that for any θ R d, the regret ρ n (θ) of π against any strategic environment such that max t [n] A t 2 L and max t [n] X t X satisfies } ρ n (θ) cx 2 θ 0 {log(e + n 1/2 L) + C n log(1 + θ 1 ) + (1 + X 2 )C θ n, 0 where c > 0 is some universal constant and C n = 2 + log 2 log(e + n 1/2 L). Corollary The expected regret of OLR-UCB when using the strategy π from above satisfies R n = Õ( dpn).

78 Outline 5 Sparse Stochastic Linear Bandits Warmup (Hypercube) LinUCB with Sparsity Confidence Sets & Online Linear Regression Summary 6 Minimax Regret A Minimax Lower Bound 7 Asymptopia Lower Bound What About Optimism? 8 Summary

79 Summary OLR algorithm used inside OLR-UCB to construct center Regret guarantee of the OLR controls width of confidence ellipsoid Regret: Õ( dpn), and p is known. Hypercube: p n, p is unknown! In general, the regret can be as high as Ω( pdn) (p = 1: think of A = {e 1,..., e d }) Under parameter noise (X t = A t, θ + η t ), for rounded action sets, Õ(p n) is possible! Very much unlike in the passive case: Major conflict between exploration and exploitation!

80 Historical Notes Selective Explore-Then-Commit algorithm is due to (Lattimore et al., 2015). OLR-UCB is from Abbasi-Yadkori et al. (2012). The Sparse OLR algorithm is due to Gerchinovitz (2013). Rakhlin and Sridharan (2015) also discusses relationship between online learning regret bounds and self-normalized tail bounds of the type given here.

81 Outline 5 Sparse Stochastic Linear Bandits Warmup (Hypercube) LinUCB with Sparsity Confidence Sets & Online Linear Regression Summary 6 Minimax Regret A Minimax Lower Bound 7 Asymptopia Lower Bound What About Optimism? 8 Summary

82 Minimax Lower Bound Theorem Let the action set be A = { 1, 1} d and Θ = { n 1/2, n 1/2 } d. Then for any policy π there exists a θ Θ such that R π n(a, θ) C d n for some universal constant C > 0.

83 Some Thoughts LinUCB with our confidence set construction is nearly worst-case optimal. The theorem is new, but the proof is standard; see (Shamir, 2015). Similar results for some other action sets: Rusmevichientong and Tsitsiklis (2010) (l 2 -ball), Dani et al. (2008) (products of 2D balls). Some action sets will have smaller minimax regret! Can you think of one?

84 Outline 5 Sparse Stochastic Linear Bandits Warmup (Hypercube) LinUCB with Sparsity Confidence Sets & Online Linear Regression Summary 6 Minimax Regret A Minimax Lower Bound 7 Asymptopia Lower Bound What About Optimism? 8 Summary

85 Lower Bound Setting: 1 Actions: A R d finite, K = A. 2 Reward is X t = A t, θ + η t, where θ R d and η t is a sequence of independent standard Gaussian variables. Regret of policy π: R π n(a, θ) = E θ,π [ n t=1 At ], a = max a A a a, θ, Recall: a policy π is consistent in some class of bandits E if the regret is subpolynomial for any bandit in that class: R π n(a, θ) = o(n p ) for all p > 0 and θ R d.

86 Lower Bound: II Theorem Assume that A R d is finite and spans R d and suppose π is consistent. Let θ R d be any parameter [ such that there is a unique optimal action and let Ḡn = E n θ,π t=1 A ] ta t be the expected Gram matrix. Then lim inf n λ min (Ḡn)/ log(n) > 0. Furthermore, for any a A it holds that: lim sup log(n) a 2 Ḡ 2 a 1 n n 2.

87 Lower Bound: III Corollary Let A R d be a finite set that spans R d and θ R d be such that there is a unique optimal action. Then for any consistent policy π, lim inf n R π n(a, θ) log(n) c(a, θ), where c(a, θ) is defined as c(a, θ) = inf α(a) a α [0, ) A a A subject to a 2 H 1 α 2 a 2 for all a A with a > 0, where H = a A α(a)aa.

88 Outline 5 Sparse Stochastic Linear Bandits Warmup (Hypercube) LinUCB with Sparsity Confidence Sets & Online Linear Regression Summary 6 Minimax Regret A Minimax Lower Bound 7 Asymptopia Lower Bound What About Optimism? 8 Summary

89 Poor Outlook for Optimism

90 Poor Outlook for Optimism Actions: A = {a1, a2, a3 }, a1 = e1, a2 = e2, a3 = (1 ε, γε). ε > 0 small, γ 1. Let θ = (1, 0), so a = a1. Solving for the lower bound, α(a2 ) = 2γ 2 and α(a3 ) = 0, c(a, θ) = 2γ 2 and Rπn (A, θ) lim inf = 2γ 2. n log(n) Moreover, for γ large, ε sufﬁciently small, π optimistic, lim sup n Rπn (A, θ) = Ω(1/ε), log(n)

91 Instance-Optimal Asymptotic Regret Theorem There exists a policy π that is consistent and satisfies lim sup n R π n(a, θ) log(n) = c(a, θ), where c(a, θ) was defined in the lower bound.

92 Illustration: LinUCB

93 Summary The instance-optimal regret of consistent algorithms is asymptotically c(a, θ) log(n). Optimistic algorithms fail to achieve this: Their regret can be worse by an arbitrarily large constant factor. Remember: Finite-armed bandits Case (a): At has always the same number of vectors in it: nite-armed stochastic contextual bandit. Case (b): Also, At does not change, or At = {a,...,ak}: nite-armed stochastic linear bandit. Case (c): If the vectors in At are also orthogonal to each other: nite-armed stochastic bandit. Difference between cases (c) and (b): Case (c): Learn about mean of arm i, Choose action i; Case (b): Learn about mean of arm i, Choose action j s.t. hxj, xii 6=.

94 Departing Thoughts These results are from Lattimore and Szepesvári (2016) The asymptotically optimal algorithm is given there (the algorithm solves for the optimal allocation, while monitoring whether things went wrong) Combes et al. (2017) refine the algorithm and generalize it to other settings. Soare et al. (2014), in best arm identification with linear payoff functions, gave essentially the same example that we use to argue for the large regret of optimistic algorithms. Open questions: Simultaneously finite-time near-optimal and asymptotically optimal algorithm Changing, or infinite action sets?

95 Summary

96 Summary of This Talk Contextual vs. linear bandits: Changing action sets can model contextual bandits Optimistic algorithms: Optimism can achieve minimax optimality Optimism can be expensive Optimistic algorithms require a careful design of the underlying confidence sets Sparsity: Exploiting sparsity is sometimes at odds with the requirement to collect rewards

97 What s Next for Bandits? Today: Finite-armed and linear stochastic bandits. But bandits come in all forms and shapes! Adversarial (finite, linear,... ) Combinatorial action sets: From shortest path to ranking Continuous action sets, continuous time, delays Resourceful, nonstationary, various structures (low-rank),... Nearby problems: Reinforcement learning/markov decision processes Partial monitoring

Learning Material Bandit Visualizer: https://github.com/alexrutar/banditvis Online bandit simulator: http://downloads.tor-lattimore.com/bandits/ Most of this tutorial (and more): http://banditalgs.

98 Learning Material Bandit Visualizer: Online bandit simulator: Most of this tutorial (and more): Book to be published by early next year: Looking for reviewers! Tor s lightweight C++ bandit library Sebastien Bubeck s tutorial Blog post 1 Blog post 2 Bubeck and Cesa-Bianchi s book; (Bubeck and Cesa-Bianchi, 2012) banditalgs.com

99 References I Abbasi-Yadkori, Y. (2009). Forced-exploration based algorithms for playing in bandits with large action sets. PhD thesis, University of Alberta. Abbasi-Yadkori, Y. (2012). Online Learning for Linearly Parametrized Control Problems. PhD thesis, University of Alberta. Abbasi-Yadkori, Y., Antos, A., and Szepesvári, C. (2009). Forced-exploration based algorithms for playing in stochastic linear bandits. In COLT Workshop on On-line Learning with Limited Feedback. Abbasi-Yadkori, Y., Pal, D., and Szepesvari, C. (2012). Online-to-confidence-set conversions and application to sparse stochastic bandits. In Artificial Intelligence and Statistics, pages 1 9. Abbasi-Yadkori, Y., Szepesvári, C., and Tax, D. (2011). Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems (NIPS), pages Abe, N. and Long, P. M. (1999). Associative reinforcement learning using linear probabilistic concepts. In ICML, pages Anantharam, V., Varaiya, P., and Walrand, J. (1987). Asymptotically efficient allocation rules for the multiarmed bandit problem with multiple plays-part i: Iid rewards. IEEE Transactions on Automatic Control, 32(11): Auer, P. (2002). Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3(Nov): Auer, P. and Ortner, R. (2010). UCB revisited: Improved regret bounds for the stochastic multi-armed bandit problem. Periodica Mathematica Hungarica, 61(1-2):55 65.

100 References II Bubeck, S. and Cesa-Bianchi, N. (2012). Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems. Foundations and Trends in Machine Learning. Now Publishers Incorporated. Chu, W., Li, L., Reyzin, L., and Schapire, R. E. (2011). Contextual bandits with linear payoff functions. In AISTATS, volume 15, pages Combes, R., Magureanu, S., and Proutiere, A. (2017). Minimal exploration in structured stochastic bandits. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R., editors, Advances in Neural Information Processing Systems 30, pages Curran Associates, Inc. Dani, V., Hayes, T. P., and Kakade, S. M. (2008). Stochastic linear optimization under bandit feedback. In Proceedings of Conference on Learning Theory (COLT), pages Even-Dar, E., Mannor, S., and Mansour, Y. (2006). Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems. Journal of machine learning research, 7(Jun): Filippi, S., Cappé, O., Garivier, A., and Szepesvári, Cs. (2009). Parametric bandits: The generalized linear case. In NIPS-22, pages Freedman, D. (1975). On tail probabilities for martingales. The Annals of Probability, 3(1): Gerchinovitz, S. (2013). Sparsity regret bounds for individual sequences in online linear regression. Journal of Machine Learning Research, 14(Mar): Hsu, D., Kakade, S. M., and Zhang, T. (2012). Random design analysis of ridge regression. In Conference on Learning Theory, pages 9 1.

101 References III John, F. (1948). Extremum problems with inequalities as subsidiary conditions. Courant Anniversary Volume, Interscience. Lattimore, T., Crammer, K., and Szepesvári, C. (2015). Linear multi-resource allocation with semi-bandit feedback. In Advances in Neural Information Processing Systems, pages Lattimore, T. and Munos, R. (2014). Bounded regret for finite-armed structured bandits. In Advances in Neural Information Processing Systems, pages Lattimore, T. and Szepesvári, C. (2016). The end of optimism? an asymptotic analysis of finite-armed linear bandits. arxiv preprint arxiv: Peña, V. H., Lai, T. L., and Shao, Q.-M. (2008). Self-normalized processes: Limit theory and Statistical Applications. Springer Science & Business Media. Rakhlin, A. and Sridharan, K. (2015). On equivalence of martingale tail bounds and deterministic regret inequalities. arxiv preprint arxiv: Robbins, H. and Siegmund, D. (1970). Boundary crossing probabilities for the Wiener process and sample sums. Annals of Math. Statistics, 41: Robbins, H. and Siegmund, D. (1971). A convergence theorem for non-negative almost supermartingales and some applications. In Rustagi, J., editor, Optimizing Methods in Statistics, pages Academic Press, New York. Rusmevichientong, P. and Tsitsiklis, J. N. (2010). Linearly parameterized bandits. Mathematics of Operations Research, 35(2): Russo, D. and Roy, B. V. (2013). Eluder dimension and the sample complexity of optimistic exploration. In NIPS, pages

102 References IV Shamir, O. (2015). On the complexity of bandit linear optimization. In Conference on Learning Theory, pages Soare, M., Lazaric, A., and Munos, R. (2014). Best-arm identification in linear bandits. In Advances in Neural Information Processing Systems, pages Srinivas, N., Krause, A., Kakade, S., and Seeger, M. W. (2010). Gaussian process optimization in the bandit setting: No regret and experimental design. In ICML, pages Valko, M., Korda, N., Munos, R., Flaounas, I., and Cristianini, N. (2013). Finite-time analysis of kernelised contextual bandits. arxiv preprint arxiv: Valko, M., Munos, R., Kveton, B., and Kocák, T. (2014). Spectral bandits for smooth graph functions. In International Conference on Machine Learning, pages

Improved Algorithms for Linear Stochastic Bandits

Improved Algorithms for Linear Stochastic Bandits Yasin Abbasi-Yadkori abbasiya@ualberta.ca Dept. of Computing Science University of Alberta Dávid Pál dpal@google.com Dept. of Computing Science University