Tor Lattimore & Csaba Szepesvári

Size: px
Start display at page:

Download "Tor Lattimore & Csaba Szepesvári"

Transcription

1 Bandit Algorithms Tor Lattimore & Csaba Szepesvári

2 Outline 1 From Contextual to Linear Bandits 2 Stochastic Linear Bandits 3 Confidence Bounds for Least-Squares Estimators 4 Improved Regret for Fixed, Finite Action Sets

3 Outline 5 Sparse Stochastic Linear Bandits 6 Minimax Regret 7 Asymptopia 8 Summary

4 1 From Contextual to Linear Bandits On the Choice of Features 2 Stochastic Linear Bandits Setting Optimism and LinUCB Generic Regret Analysis Miscellaneous Remarks 3 Confidence Bounds for Least-Squares Estimators Least Squares: Recap Fixed Design Sequential Design Completing the Regret Bound 4 Improved Regret for Fixed, Finite Action Sets Challenge! G-Optimal Designs The PEGOE Algorithm

5 Stochastic Contextual Bandits Set of contexts, C, set of actions [K]; distributions (P c,a ). Interaction For rounds t = 1, 2, 3,... : 1 Context C t C is revealed to the learner. 2 Based on its past observations (including C t ), the learner chooses an action A t [K]. The chosen action is sent to the environment. 3 The environment sends the reward X t P Ct,A t to the learner.

6 Regret Definition Definition: Expected reward for action a under context c: r(c, a) = x P c,a (dx). Regret: R n = E [ n t=1 max r(c t, a) a [K] ] n X t. t=1

7 Poor Man s Contextual Bandit Algorithm Assumption: C is finite. Idea: Assign a bandit to each context. Worst-case regret: R n = Θ( nmk), where M = C. Problem: M (and K) can be very large. How to save this? Assume structure.

8 Linear Models Assumption: r(c, a) = ψ(c, a), θ, (c, a) C [K]. where ψ : C [K] R d, θ R d. ψ: feature map;. H ψ = span( ψ(c, k) : c C, k [K] ) R d : feature space; S ψ. = {rθ : r θ : C [K] R, θ R d }, where r θ (c, k) = ψ(c, a), θ : space of linear reward functions. Note: RKHS let us deal with d = (or d very large).

9 Outline 1 From Contextual to Linear Bandits On the Choice of Features 2 Stochastic Linear Bandits Setting Optimism and LinUCB Generic Regret Analysis Miscellaneous Remarks 3 Confidence Bounds for Least-Squares Estimators Least Squares: Recap Fixed Design Sequential Design Completing the Regret Bound 4 Improved Regret for Fixed, Finite Action Sets Challenge! G-Optimal Designs The PEGOE Algorithm

10 Choosing ψ betting on smoothness of r Let be a norm on R d, its dual (e.g., both 2-norms). Hölder s inequality: r(c, a) r(c, a ) θ ψ(c, a) ψ(c, a ). r = ψ, θ r is (ρ, θ )-smooth: r(z) r(z ) θ ρ(z, z ) where z = (c, a), z = (c, a ), ρ(z, z ) = ψ(z) ψ(z ). Choice of ψ + preference for small θ preference for ρ-smooth r. Influence of ψ is the largest when d = +.

11 Sparsity Concern: Is r S ψ really true? Thinking of free lunches: Can t we just add a lot of features to make sure r S ψ? Feature i unused: θ,i = 0; Small θ 0 = d i=1 I {θ i 0}; 0-norm. Sparsity

12 Outline 1 From Contextual to Linear Bandits On the Choice of Features 2 Stochastic Linear Bandits Setting Optimism and LinUCB Generic Regret Analysis Miscellaneous Remarks 3 Confidence Bounds for Least-Squares Estimators Least Squares: Recap Fixed Design Sequential Design Completing the Regret Bound 4 Improved Regret for Fixed, Finite Action Sets Challenge! G-Optimal Designs The PEGOE Algorithm

13 Features Are All What You Need Given C 1, A 1,..., C t, A t, the reward X t in round t satisfies E [X t C 1, A 1,..., C t, A t ] = ψ(c t, A t ), θ, for some known ψ and unknown θ. At the beginning of any round t, observe action set A t R d. If A t A t, with some unknown θ. E [X t A 1, A 1,..., A t, A t ] = A t, θ Why? Let A t = {ψ(c t, a) : a [K]}.

14 Stochastic Linear Bandits 1 In round t, observe action set A t R d. 2 The learner chooses A t A t and receives X t, satisfying E [X t A 1, A 1,..., A t, A t ] = A t, θ with some unknown θ. Goal: Keep regret [ n ] R n = E t=1 max a A t a, θ X t small. Additional assumptions: (i) A 1,..., A n is any fixed sequence; (ii) X t A t, θ is light tailed, given A 1, A 1,..., A t, A t.

15 Finite-armed bandits Case (a): A t has always the same number of vectors in it: finite-armed stochastic contextual bandit. Case (b): Also, A t does not change, or A t = {a 1,..., a K }: finite-armed stochastic linear bandit. Case (c): If the vectors in A t are also orthogonal to each other: finite-armed stochastic bandit. Difference between cases (c) and (b): Case (c): Learn about mean of arm i Choose action i; Case (b): Learn about mean of arm i Choose action j s.t. x j, x i 0.

16 Outline 1 From Contextual to Linear Bandits On the Choice of Features 2 Stochastic Linear Bandits Setting Optimism and LinUCB Generic Regret Analysis Miscellaneous Remarks 3 Confidence Bounds for Least-Squares Estimators Least Squares: Recap Fixed Design Sequential Design Completing the Regret Bound 4 Improved Regret for Fixed, Finite Action Sets Challenge! G-Optimal Designs The PEGOE Algorithm

17 Once an Optimist, Always an Optimist Optimism in the Face of Uncertainty Principle: Choose the best action in the best environment amongst the plausible ones. Environment θ R d. Plausible environments: C t R d s.t. P (θ C t ) 1/t. Best environment: θ t = argmax θ Ct max a A a, θ. Best action: argmax a A a, ˆθ t.

18 Choosing the Confidence Set Say, reward in round t is X t, action in round t is A t R d : X t = A t, θ + η t, η t is noise. Regularized least-squares estimator: ˆθ t = V 1 t t s=1 A sx s, V 0 = λi, V t = V 0 + t A s A s. s=1 Choice of C t : C t E t. = { θ R d : θ ˆθ t 1 2 V t 1 β t }. Here, (β t ) t decreasing, β t 1, for A positive definite, x 2 A = x Ax.

19 LinUCB Choose C t = E t with suitable (β t ) t and let Then, A t = argmax max a, θ. a A θ C t A t = argmax a, ˆθ t + β t a V 1 a t 1. LinUCB (a.k.a. LinRel, OFUL, ConfEllips,... )

20 Outline 1 From Contextual to Linear Bandits On the Choice of Features 2 Stochastic Linear Bandits Setting Optimism and LinUCB Generic Regret Analysis Miscellaneous Remarks 3 Confidence Bounds for Least-Squares Estimators Least Squares: Recap Fixed Design Sequential Design Completing the Regret Bound 4 Improved Regret for Fixed, Finite Action Sets Challenge! G-Optimal Designs The PEGOE Algorithm

21 Assumptions: Regret Analysis 1 Bounded scalar mean reward: a, θ 1 for any a t A t. 2 Bounded actions: for any a t A t, a 2 L. 3 Honest confidence intervals: There exists a δ (0, 1) such that with probability 1 δ, for all t [n], θ C t where C t satisfies C t E t with (β t ) t 1, decreasing as on the previous slide. Theorem (LinUCB Regret) Let the conditions listed above hold. Then with probability 1 δ the regret of LinUCB satisfies ( ) ( ) det ˆR pseudo Vn n 8nβ n log trace(v 0 ) + nl 8dnβn log 2. det V 0 d det 1 d (V0 )

22 Proof Assume θ C t, t [n]. Let A. t = argmax a At a, θ, r t = A t A t, θ and let θ t C t s.t. A t, θ t = UCB t (A t ). From θ C t and the definition of LinUCB, Then, A t, θ UCB t (A t ) UCB t (A t ) = A t, θ t. r t A t, θ t θ A t V 1 θ t θ Vt 1 2 A t t 1 V 1 β t. t 1 From a, θ 1, r t 2. This combined with β n max{1, β t } gives r t 2 2 β t A t V 1 2 β n (1 A t t 1 V 1 ). t 1 Jensen s inequality shows that n n ˆR pseudo n = r t n rt 2 2 nβn t=1 t=1 n t=1 (1 A t 2 V ). 1 t 1

23 Lemma Lemma Let x 1,..., x n R d, V t = V 0 + t s=1 x sx s, t [n], v 0 = trace(v 0 ) and L max t x t 2. Then, n t=1 ( ) 1 x t 2 V 2 log 1 t 1 ( ) det Vn det V 0 ( ) v0 + nl 2 d log. d det 1/d (V 0 )

24 Outline 1 From Contextual to Linear Bandits On the Choice of Features 2 Stochastic Linear Bandits Setting Optimism and LinUCB Generic Regret Analysis Miscellaneous Remarks 3 Confidence Bounds for Least-Squares Estimators Least Squares: Recap Fixed Design Sequential Design Completing the Regret Bound 4 Improved Regret for Fixed, Finite Action Sets Challenge! G-Optimal Designs The PEGOE Algorithm

25 LinUCB and Finite-Armed Bandits Recall that if A t = {e 1,..., e d }, we get back the finite armed bandits. LinUCB: A t = argmax a, ˆθ t + β t a V 1 a t 1 If we set λ = 0, e i, ˆθ t = ˆµ i,t : empirical mean, V t 1 = diag(t 1 (t 1),..., T K (t 1)). Hence, β t βt e i V 1 = t 1 T i (t 1), and A t = argmax a, ˆθ t + a We recover UCB when β t = 2 log( ).. β t T i (t 1),

26 History Abe and Long (1999) introduced stochastic linear bandits into machine learning literature. Auer (2002) was the first to consider optimism for linear bandits (LinRel, SupLinRel). Main restriction: A t < +. Confidence ellipsoids: Dani et al. (2008) (ConfidenceBall 2 ), Rusmevichientong and Tsitsiklis (2010) (Uncertainty Ellipsoid Policy), Abbasi-Yadkori et al. (2011) (OFUL). The name LinUCB comes from Chu et al. (2011). Alternative routes: Explore then commit for action sets with smooth boundary. Abbasi-Yadkori et al. (2009); Abbasi-Yadkori (2009); Rusmevichientong and Tsitsiklis (2010). Phased elimination Thompson sampling

27 Extensions of Linear Bandits Generalized linear model (Filippi et al., 2009): X t = g 1 ( A t, θ + η), (1) where g : R R is called the link function. Common choice: g(p) = log(p/(1 p)) when g 1 (x) = 1/(1 + exp( x)) (sigmoid). Spectral bandits: Spectral Eliminator Valko et al. (2014). Kernelised UCB: Valko et al. (2013). (Nonlinear) Structured bandits: r : C [K] [0, 1] belongs to some known set (Anantharam et al., 1987; Russo and Roy, 2013; Lattimore and Munos, 2014).

28 Outline 1 From Contextual to Linear Bandits On the Choice of Features 2 Stochastic Linear Bandits Setting Optimism and LinUCB Generic Regret Analysis Miscellaneous Remarks 3 Confidence Bounds for Least-Squares Estimators Least Squares: Recap Fixed Design Sequential Design Completing the Regret Bound 4 Improved Regret for Fixed, Finite Action Sets Challenge! G-Optimal Designs The PEGOE Algorithm

29 Setting 1 Subgaussian rewards: The reward is X t = A t, θ + η t, where η t is conditionally 1-subgaussian (η t F t 1 subg(1)): E[exp(λη t ) F t 1 ] exp(λ 2 /2) almost surely for all λ R, where F t = σ(a 1, η 1,..., A t 1, η t 1, A t ). 2 Bounded parameter vector: θ 2 S with S > 0 known.

30 Linear model: Least Squares: Recap X t = A t, θ + η t. Regularized squared loss: t L t (θ) = (X s A s, θ ) 2 + λ θ 2 2, s=1 Least squares estimate: ˆθ t = argmin θ L t (θ): t ˆθ t = V t (λ) 1 X s A s with V t (λ) = λi + s=1 Abbreviation: V t = V t (0). Main difficulty t A s A s. (2) The actions (A s ) s<t are neither fixed nor independent, but are intricately correlated via the rewards (X s ) s<t. s=1

31 Outline 1 From Contextual to Linear Bandits On the Choice of Features 2 Stochastic Linear Bandits Setting Optimism and LinUCB Generic Regret Analysis Miscellaneous Remarks 3 Confidence Bounds for Least-Squares Estimators Least Squares: Recap Fixed Design Sequential Design Completing the Regret Bound 4 Improved Regret for Fixed, Finite Action Sets Challenge! G-Optimal Designs The PEGOE Algorithm

32 Fixed Design 1 Nonsingular Grammian: λ = 0 and V t is invertible. 2 Independent subgaussian noise: (η s ) s are independent and 1-subgaussian. 3 Fixed design: A 1,..., A t are deterministically chosen without the knowledge of X 1,..., X t. Notation: A 1,..., A t replaced by a 1,..., a t, so V t = t s=1 a s a s and ˆθt = V 1 t t a s X s. s=1 Note: (a s ) t s=1 must span Rd for V 1 t t d.

33 Fix x R d. From X s = a s θ + η s, so ˆθ t = V 1 t First Steps t a s X s = V 1 t V t θ + V 1 t s=1 x, ˆθ t θ = t s=1 t a s η s, s=1 x, V 1 t a s η s. Since (η s ) s are independent and 1-subgaussian: With probability 1 δ, x, ˆθ t θ < 2 t s=1 x, V 1 t a s 2 log ( ) 1 δ = 2 x 2 log ( ) 1 V 1 δ. t

34 Bounding ˆθ t θ Vt For x R d fixed, with probability 1 δ, x, ˆθ t θ < 2 x 2 log ( ) 1 V 1 δ. (*) t We have u 2 V t = (V t u) u. Idea 1: Apply (*) to X = V t (ˆθ t θ ).. Problem: (*) holds for non-random x only! Idea 2: Take finite C ε s.t. X ε x for some x C ε. Make (*) hold for each x C ε, combine. Refinement: Let X = V 1/2 t (ˆθ t θ )/ ˆθ t θ Vt. Then X S d 1 = {x R d : x 2 = 1} and we can choose a finite C ε S d 1 s.t. for any u S d 1, x C ε with u x ε.

35 Let or Bounding ˆθ t θ Vt : Part II. X = V1/2 t (ˆθ t θ ) ˆθ t θ Vt, X = argmin x C ε x X. Then, with probability 1 δ, ˆθ t θ Vt = X, V 1/2 t (ˆθ t θ ) = X X, V 1/2 t (ˆθ t θ ) + V 1/2 t X, ˆθ t θ ε ˆθ t θ Vt + 2 V 1/2 X 2 ) log. V 1 ˆθ t θ Vt 1 2 log 1 ε We can choose C ε so that C ε (5/ε) d : ˆθ t θ Vt < 2 2 t t ( ) Cε. δ ( d log(10) + log ( Cε δ ( )) 1. δ

36 Outline 1 From Contextual to Linear Bandits On the Choice of Features 2 Stochastic Linear Bandits Setting Optimism and LinUCB Generic Regret Analysis Miscellaneous Remarks 3 Confidence Bounds for Least-Squares Estimators Least Squares: Recap Fixed Design Sequential Design Completing the Regret Bound 4 Improved Regret for Fixed, Finite Action Sets Challenge! G-Optimal Designs The PEGOE Algorithm

37 Bounding ˆθ t θ Vt : Sequential Design X s = A s, θ + η s, η s F s 1 subg(1) for F s = σ(a 1, η 1,..., A s, η s ). Previous bound exploited A 1,..., A t fixed, non-random. Known as: fixed design When A 1,..., A t is i.i.d., we have a random design Bandits: A s is chosen based on A 1, X 1,..., A s 1, X s 1! sequential design How to bound ˆθ t θ Vt in this case?

38 Bounding ˆθ t θ Vt : Sequential Design Linearization trick Vector Chernoff Laplace method

39 A Start: Linearization & Vector Chernoff Let S t = t s=1 η sa s. Linearization of the quadratic: Let 1 ˆθ 2 t θ 2 V t = 1 S 2 t 2 V = max x, S 1 t 1 t x R d 2 x 2 V t. M t (x) = exp ( x, S t 1 2 x 2 V t ). One can show that E [M t (x)] 1 for any x R d. Chernoff s method: ( ) ( ) 1 P ˆθ 2 t θ 2 V t u = P exp(max log M t (x)) exp(u) x [ ] [ ] E exp(max log M t (x)) exp( u) = E max M t (x)) exp( u). x x Can we control E [max x M t (x)]?

40 Controlling E [max x M t (x)]: Covering Argument Recall: M t (x) = exp ( x, S t 1 2 x 2 V t ). Let C ε R d be finite, to be chosen later, Then, X = argmax M t (x), Y = argmin X y. x R d y C ε max M t (x) = M t (X) = M t (X) M t (Y) + M t (Y) ε + M t (y). x R d y C ε Challenge: ensure M t (X) M t (Y) ε!

41 Laplace: One Step Back, Two Forward The need to control E [max x M t (x)] comes from the identity exp( 1 2 ˆθ t θ 2 V t ) = max x M t (x). Laplace: Integral of exp(sf(x)) is dominated by exp(s max x f(x)): b e sf(x) dx e sf(x 2π 0) s f (x 0 ), a x 0 = argmax x [a,b] f(x) (a, b). Idea: Replace max x M t (x) with M t (x)h(x)dx with h appropriate: 1 Mt (x)h(x)dx max x M t (x) (in a way); 2 E [ M t (x)h(x)dx ] = E [M t (x)] h(x)dx 1. Choose h(x) as density of N (0, H 1 ) for H 0.

42 Step 2: Finishing M t (x)h(x)dx = ( ) 1/2 ( ) det(h) 1 exp det(h + V) 2 S t 2 (H+V t ). 1 Choose H = λi. Then, with probability 1 e u, 1 S 2 t 2 < u + 1 ( ) V 1 t (λ) 2 log det(vt (λ)) λ d (**) and from ˆθ t θ = V 1 t (λ)s t λv 1 t (λ)θ, ˆθ t θ Vt (λ) V 1 t (λ)s t Vt (λ) + λ V 1 t (λ)θ Vt (λ) S t V 1 t (λ) + λ1/2 θ.

43 Confidence Ellipsoid for Sequential Design Assumptions: θ S, and let (A s ) s, (η s ) s be so that for any 1 s t, η s F s 1 subg(1), where F s = σ(a 1, η 1,..., A s 1, η s 1, A s ) Fix δ (0, 1). Let β t+1 = λs + 2 log ( ) ( ) 1 δ + log det Vt (λ), λ d and C t+1 = {θ R d : ˆθ t θ Vt (λ) β t+1 }. Theorem C t+1 is a confidence set for θ at level 1 δ: P (θ C t+1 ) 1 δ. Note: β t+1 is a function of (A s ) s t.

44 Freedman s Stopping Trick We want θ C t hold with probability 1 δ, simultaneously for all 1 t n. Can we avoid the union bound over time?

45 Freedman s Stopping Trick: II Let E t = { ˆθ t θ Vt 1 (λ) λs + 2u + log } ( ) det Vt 1 (λ), t [n]. λ d Define τ [n] as follows: τ to be the smallest round index t [n] such that E t holds, or n when none of E 1,..., E n hold. Note: Because ˆθ t and V t 1 (λ) is a function of H t 1 = (A 1, η 1,..., A t 1, η t 1 ), whether E t holds can be decided based on H t 1. τ is a (H t ) t stopping time E [M τ (x)] 1 and also E [Mτ (x)] h(x)dx 1 and thus P ( t [n] E t ) P (Eτ ) e u. n. Corollary P ( t 0 such that θ C t+1 ) δ.

46 Historical Remarks Presentation mostly follows Abbasi-Yadkori et al. (2011). Auer (2002); Chu et al. (2011) avoided the need to construct ellipsoidal confidence sets Previous ellipsoidal constructions by Dani et al. (2008) and Rusmevichientong and Tsitsiklis (2010) used covering arguments. The improvement that results from using Laplace s method as compared to the previous ellipsoidal constructions that are based on covering arguments is quite enormous. Laplace s method is also called the Method of Mixtures (Peña et al., 2008); its use goes back to the work of Robbins and Siegmund in the 1970s (Robbins and Siegmund, 1970, 1971). Freedman s Stopping is by Freedman (1975).

47 Outline 1 From Contextual to Linear Bandits On the Choice of Features 2 Stochastic Linear Bandits Setting Optimism and LinUCB Generic Regret Analysis Miscellaneous Remarks 3 Confidence Bounds for Least-Squares Estimators Least Squares: Recap Fixed Design Sequential Design Completing the Regret Bound 4 Improved Regret for Fixed, Finite Action Sets Challenge! G-Optimal Designs The PEGOE Algorithm

48 Regret for LinUCB: Final Steps Previously we have seen, for β t 1, nondecreasing, using LinUCB with V 0 = λi, w.p. 1 δ, ( ) ( ) det ˆR pseudo Vn λd + nl 2 n 8nβ n log 8dnβ n log. det V 0 λd Now, β n = λs + λs + 2 log ( 1 δ 2 log ( 1 δ ) ( ) det Vn 1 (λ) + log λ d ) + d log ( λd + nl 2 λd ), from A s L and log det V d log(trace(v)/d). ˆR pseudo n C 1 d n log(n) + C 2 nd log(1/δ) + C3.

49 Summary ˆR pseudo n C 1 d n log(n) + C 2 nd log(1/δ) + C3. Optimism, confidence ellipsoid Getting the ellipsoid is tricky because the bandit algorithm makes (A s ) s and (η s ) s interdependent: sequential design. Hsu et al. (2012): Random design, fixed design Kernelization: One can directly kernelize the proof presented here. See Abbasi-Yadkori (2012). Gaussian Process Bandits effectively do the same (Srinivas et al., 2010). This presentation followed mostly Abbasi-Yadkori et al. (2011); Abbasi-Yadkori (2012).

50 Outline 1 From Contextual to Linear Bandits On the Choice of Features 2 Stochastic Linear Bandits Setting Optimism and LinUCB Generic Regret Analysis Miscellaneous Remarks 3 Confidence Bounds for Least-Squares Estimators Least Squares: Recap Fixed Design Sequential Design Completing the Regret Bound 4 Improved Regret for Fixed, Finite Action Sets Challenge! G-Optimal Designs The PEGOE Algorithm

51 Challenge! The previous bound is Õ(d n) even for A = {e 1,..., e d } finite-armed stochastic bandits. Can we do better?

52 Setting 1 Fixed finite action set: The set of actions available in round t is A R d and A = K for some natural number K. 2 Subgaussian rewards: The reward is X t = θ, A t + η t where η t is conditionally 1-subgaussian: η t F t 1 subg(1), where F t = σ(a 1, η 1,..., A t 1, η t, A t ). 3 Bounded mean rewards: a = max b A θ, b a 1 for all a A. Key difference to previous setting: Finite, fixed action set.

53 Avoiding Sequential Designs Recall result for fixed design: For x R d fixed, with probability 1 δ, x, ˆθ t θ < 2 x 2 log ( ) 1 V 1 δ. (*) t Goal: Use this result! How? Idea: Use a phased elimination algorithm! A = A 1 A 2 A In phase l, use actions in A l to collect enough data to ensure that by the end of the phase, the data collected in the phase is. sufficient to rule out all ε l = 2 l -suboptimal actions. Which action to use & how many times in phase l?

54 How to Collect Data? Recall: If V t = t s=1 a sa s, for any x R d. x, ˆθ t θ < 2 x 2 log ( ) 1 V 1 δ. (*) t If we need to know whether x, ˆθ t θ 2 l, x A, we better choose a 1, a 2,..., a t so that max x A x 2 V 1 t (*) is minimized (and make t big enough). Experimental design Minimizing (*) is known as the G-optimal design problem.

55 Outline 1 From Contextual to Linear Bandits On the Choice of Features 2 Stochastic Linear Bandits Setting Optimism and LinUCB Generic Regret Analysis Miscellaneous Remarks 3 Confidence Bounds for Least-Squares Estimators Least Squares: Recap Fixed Design Sequential Design Completing the Regret Bound 4 Improved Regret for Fixed, Finite Action Sets Challenge! G-Optimal Designs The PEGOE Algorithm

56 G-optimal Design Let π : A [0, 1] be a distribution on A: a A π(a) = 1. Define V(π) = a A π(a)aa, g(π) = max a A a 2 V(π) 1. G-optimal design π : g(π ) = min π g(π). How to use this?

57 Using a Design π Given a design π, for a Supp(π), set n a = π(a) g(π) log ( ) 1. ε 2 δ Choose each action a Supp(π) exactly n a times. Then: V = n a aa g(π) log ( ) 1 ε 2 δ V(π), a Supp(π) and so for any a A, w.p. 1 δ, ˆθ θ, a a 2 V log ( ) 1 1 δ ε. How big is n? n = a Supp(π) n a = a Supp(π) Supp(π) + g(π) ε 2 log ( 1 δ ). π(a) g(π) log ( ) 1 ε 2 δ

58 Bounding g(π) and Supp(π) Theorem (Kiefer Wolfowitz) The following are equivalent: 1 π is a minimizer of g. 2 π is a minimizer of f(π) = log det V(π). 3 g(π ) = d. Note: Designs, minimizing f are known as D-optimal designs. KW says that G-optimality is the same as D-optimality. Combining this with John s Theorem for minimum-volume enclosing ellipsoids (John, 1948), we get Supp(π) d(d + 3)/2.

59 Outline 1 From Contextual to Linear Bandits On the Choice of Features 2 Stochastic Linear Bandits Setting Optimism and LinUCB Generic Regret Analysis Miscellaneous Remarks 3 Confidence Bounds for Least-Squares Estimators Least Squares: Recap Fixed Design Sequential Design Completing the Regret Bound 4 Improved Regret for Fixed, Finite Action Sets Challenge! G-Optimal Designs The PEGOE Algorithm

60 PEGOE Algorithm 1 Input: A R d and δ. Set A 1 = A, l = 1, t = 1. 1 Let t l = t: current round. Find G-optimal design π l : A l [0, 1] that maximizes log det V(π l ) subject to a A l π l (a) = 1 2 Let ε l = 2 l and ( ) 2π(a) Kl(l + 1) N l (a) = log δ ε 2 l and N l = a A l N l (a) 3 Choose each action a A l exactly N l (a) times 4 Calculate estimate: ˆθ = V 1 tl +N l l t=t l A t X t. 5 Eliminate poor arms: { } A l+1 = a A l : max ˆθ l, b a 2ε l b A l 1 Phased Elimination with G-Optimal Exploration.

61 The Regret of PEGOE Theorem With probability at least 1 δ the pseudo-regret of PEGOE is at most: ( ) K log(n) ˆR pseudo n C nd log, δ where C > 0 is a universal constant. If δ = O(1/n), then E[R n ] C nd log(kn) for appropriately chosen universal constant C > 0.

62 Summary and Historical Remarks Phased exploration allows one to use methods developed for fixed-design PEGOE: Exploration tuned to maximize information gain Finding an (1 + ε)-optimal design is sufficient; convex problem This algorithm and analysis in this form is new. Phased Elimination is well known: Even-Dar et al. (2006) (pure exploration), Auer and Ortner (2010) (finite-armed bandits), Valko et al. (2014) (linear bandits, spanners instead of G-optimality). Finite, but changing action set: PEGOE cannot be applied! SupLinRel and SupLinUCB get the same bound (Auer, 2002; Chu et al., 2011). Sadly, these algorithms are very conservative..

63 Outline 5 Sparse Stochastic Linear Bandits Warmup (Hypercube) LinUCB with Sparsity Confidence Sets & Online Linear Regression Summary 6 Minimax Regret A Minimax Lower Bound 7 Asymptopia Lower Bound What About Optimism? 8 Summary

64 General Setting 1 (Sparse parameter) There exist known constants M 0 and M 2 such that θ 0 M 0 and θ 2 M 2. 2 (Bounded mean rewards): a, θ 1 for all a A t and all rounds t. 3 (Subgaussian noise): The reward is X t = A t, θ + η t where η t F t 1 subg(1) for F t = σ(a 1, η 1,..., A t, η t ).

65 The Case of the Hypercube A = [ 1, 1] d, θ. = θ, X t = A t, θ + η t. Assumptions: 1 (Bounded mean rewards): θ 1 1, which ensures that a, θ 1 for all a A. 2 (Subgaussian noise): The reward is X t = A t, θ + η t where η t F t 1 subg(1) for F t = σ(a 1, η 1,..., A t, η t ).

66 Selective Explore-Then-Commit (SETC) Recall: θ = θ. For any i [d] such that A ti is randomized: A ti (A t θ + η t ) = θ i + A ti A tj θ j + A ti η t. j i } {{ } noise

67 Regret of SETC Theorem There exists a universal constant C > 0 such that the regret of SETC satisfies: R n 2 θ 1 + C Furthermore R n C θ 0 n log(n). i:θ i 0 log(n) θ i. SETC adapts to θ 0!

68 Outline 5 Sparse Stochastic Linear Bandits Warmup (Hypercube) LinUCB with Sparsity Confidence Sets & Online Linear Regression Summary 6 Minimax Regret A Minimax Lower Bound 7 Asymptopia Lower Bound What About Optimism? 8 Summary

69 (General) LinUCB: Recap GLinUCB Choose C t R d and let A t = argmax max a, θ. a A θ C t Previous choice leads to regret Õ(d n). How to choose C t, knowing that θ 0 p, so that the regret gets smaller?

70 Outline 5 Sparse Stochastic Linear Bandits Warmup (Hypercube) LinUCB with Sparsity Confidence Sets & Online Linear Regression Summary 6 Minimax Regret A Minimax Lower Bound 7 Asymptopia Lower Bound What About Optimism? 8 Summary

71 Online Linear Regression (OLR) Learner-environment interaction: 1 The environment chooses X t R and A t R d in an arbitrary fashion. 2 The value of A t is revealed to the learner (but not X t ). 3 The learner produces a real-valued prediction ˆX t in some way. 4 The environment reveals X t to the learner and the loss is (X t ˆX t ) 2. Goal: Compete with the total loss of the best linear predictors in some set Θ R d. Regret against θ Θ: ρ n (θ) = n (X t ˆX t ) 2 t=1 n (X t A t, θ ) 2. t=1

72 From OLR to Confidence Sets Let L be a learner that enjoys a regret guarantee B n = B n (A 1, X 1,..., A n, X n ) relative to Θ: For any strategy of the environment, Combine ρ n (θ) = sup ρ n (θ) B n. θ Θ n (X t ˆX t ) 2 t=1 and X t = A t, θ + η t to get n (X t A t, θ ) 2. t=1. t Q t = (ˆX s A s, θ ) 2 = ρ t (θ ) + 2 s=1 B t + 2 t η s (ˆX s A s, θ ). s=1 t η s (ˆX s A s, θ ) s=1

73 From OLR to Confidence Sets: II. Q t B t + 2Z t, Z t = Goal: Bound Z t for t 0. t η s (ˆX s A s, θ ). (*) ˆX t, chosen by OLR learner L, is F t 1 -measurable, (Z t Z t 1 ) F t 1 subg(σ t ), where σt 2 = (ˆX t A t, θ ) 2. s=1 Previous self-normalized bound (**): With probability 1 δ, Z t < (1 + Q t ) log ( 1+Q t ) δ, t = 0, 1, Combining with (*), solve for Q t : Q t β t (δ), β t (δ) = 1 + 2B t + 32 log ( 8+ 1+Bt δ ).

74 OLR to Confidence Sets: III. Theorem Let δ (0, 1) and assume that θ Θ and sup θ Θ ρ t (θ) B t. If { } t C t+1 = θ R d : θ (ˆX s A s, θ ) 2 M β t (δ), s=1 then P (exists t N such that θ C t+1 ) δ.

75 Sparse LinUCB 1: Input OLR Learner L, regret bound B t, confidence parameter δ (0, 1) 2: for t = 1,..., n 3: Receive action set A t 4: Computer confidence set: { } t 1 C t = θ R d : θ (ˆX s A s, θ ) 2 M β t (δ) s=1 5: Calculate optimistic action A t = argmax max a, θ a A t θ C t 6: Feed A t to L and obtain prediction ˆX t 7: Play A t and receive reward X t 8: Feed X t to L as feedback

76 Regret of OLR-UCB Theorem With probability at least 1 δ the pseudo-regret of OLR-UCB satisfies ˆR pseudo n 8dn ( M β n 1(δ) ) ( log 1 + n ). d

77 The Regret of OLR-UCB(π) Theorem (Sparse OLR Algorithm) π for the learner such that for any θ R d, the regret ρ n (θ) of π against any strategic environment such that max t [n] A t 2 L and max t [n] X t X satisfies } ρ n (θ) cx 2 θ 0 {log(e + n 1/2 L) + C n log(1 + θ 1 ) + (1 + X 2 )C θ n, 0 where c > 0 is some universal constant and C n = 2 + log 2 log(e + n 1/2 L). Corollary The expected regret of OLR-UCB when using the strategy π from above satisfies R n = Õ( dpn).

78 Outline 5 Sparse Stochastic Linear Bandits Warmup (Hypercube) LinUCB with Sparsity Confidence Sets & Online Linear Regression Summary 6 Minimax Regret A Minimax Lower Bound 7 Asymptopia Lower Bound What About Optimism? 8 Summary

79 Summary OLR algorithm used inside OLR-UCB to construct center Regret guarantee of the OLR controls width of confidence ellipsoid Regret: Õ( dpn), and p is known. Hypercube: p n, p is unknown! In general, the regret can be as high as Ω( pdn) (p = 1: think of A = {e 1,..., e d }) Under parameter noise (X t = A t, θ + η t ), for rounded action sets, Õ(p n) is possible! Very much unlike in the passive case: Major conflict between exploration and exploitation!

80 Historical Notes Selective Explore-Then-Commit algorithm is due to (Lattimore et al., 2015). OLR-UCB is from Abbasi-Yadkori et al. (2012). The Sparse OLR algorithm is due to Gerchinovitz (2013). Rakhlin and Sridharan (2015) also discusses relationship between online learning regret bounds and self-normalized tail bounds of the type given here.

81 Outline 5 Sparse Stochastic Linear Bandits Warmup (Hypercube) LinUCB with Sparsity Confidence Sets & Online Linear Regression Summary 6 Minimax Regret A Minimax Lower Bound 7 Asymptopia Lower Bound What About Optimism? 8 Summary

82 Minimax Lower Bound Theorem Let the action set be A = { 1, 1} d and Θ = { n 1/2, n 1/2 } d. Then for any policy π there exists a θ Θ such that R π n(a, θ) C d n for some universal constant C > 0.

83 Some Thoughts LinUCB with our confidence set construction is nearly worst-case optimal. The theorem is new, but the proof is standard; see (Shamir, 2015). Similar results for some other action sets: Rusmevichientong and Tsitsiklis (2010) (l 2 -ball), Dani et al. (2008) (products of 2D balls). Some action sets will have smaller minimax regret! Can you think of one?

84 Outline 5 Sparse Stochastic Linear Bandits Warmup (Hypercube) LinUCB with Sparsity Confidence Sets & Online Linear Regression Summary 6 Minimax Regret A Minimax Lower Bound 7 Asymptopia Lower Bound What About Optimism? 8 Summary

85 Lower Bound Setting: 1 Actions: A R d finite, K = A. 2 Reward is X t = A t, θ + η t, where θ R d and η t is a sequence of independent standard Gaussian variables. Regret of policy π: R π n(a, θ) = E θ,π [ n t=1 At ], a = max a A a a, θ, Recall: a policy π is consistent in some class of bandits E if the regret is subpolynomial for any bandit in that class: R π n(a, θ) = o(n p ) for all p > 0 and θ R d.

86 Lower Bound: II Theorem Assume that A R d is finite and spans R d and suppose π is consistent. Let θ R d be any parameter [ such that there is a unique optimal action and let Ḡn = E n θ,π t=1 A ] ta t be the expected Gram matrix. Then lim inf n λ min (Ḡn)/ log(n) > 0. Furthermore, for any a A it holds that: lim sup log(n) a 2 Ḡ 2 a 1 n n 2.

87 Lower Bound: III Corollary Let A R d be a finite set that spans R d and θ R d be such that there is a unique optimal action. Then for any consistent policy π, lim inf n R π n(a, θ) log(n) c(a, θ), where c(a, θ) is defined as c(a, θ) = inf α(a) a α [0, ) A a A subject to a 2 H 1 α 2 a 2 for all a A with a > 0, where H = a A α(a)aa.

88 Outline 5 Sparse Stochastic Linear Bandits Warmup (Hypercube) LinUCB with Sparsity Confidence Sets & Online Linear Regression Summary 6 Minimax Regret A Minimax Lower Bound 7 Asymptopia Lower Bound What About Optimism? 8 Summary

89 Poor Outlook for Optimism

90 Poor Outlook for Optimism Actions: A = {a1, a2, a3 }, a1 = e1, a2 = e2, a3 = (1 ε, γε). ε > 0 small, γ 1. Let θ = (1, 0), so a = a1. Solving for the lower bound, α(a2 ) = 2γ 2 and α(a3 ) = 0, c(a, θ) = 2γ 2 and Rπn (A, θ) lim inf = 2γ 2. n log(n) Moreover, for γ large, ε sufficiently small, π optimistic, lim sup n Rπn (A, θ) = Ω(1/ε), log(n)

91 Instance-Optimal Asymptotic Regret Theorem There exists a policy π that is consistent and satisfies lim sup n R π n(a, θ) log(n) = c(a, θ), where c(a, θ) was defined in the lower bound.

92 Illustration: LinUCB

93 Summary The instance-optimal regret of consistent algorithms is asymptotically c(a, θ) log(n). Optimistic algorithms fail to achieve this: Their regret can be worse by an arbitrarily large constant factor. Remember: Finite-armed bandits Case (a): At has always the same number of vectors in it: nite-armed stochastic contextual bandit. Case (b): Also, At does not change, or At = {a,...,ak}: nite-armed stochastic linear bandit. Case (c): If the vectors in At are also orthogonal to each other: nite-armed stochastic bandit. Difference between cases (c) and (b): Case (c): Learn about mean of arm i, Choose action i; Case (b): Learn about mean of arm i, Choose action j s.t. hxj, xii 6=.

94 Departing Thoughts These results are from Lattimore and Szepesvári (2016) The asymptotically optimal algorithm is given there (the algorithm solves for the optimal allocation, while monitoring whether things went wrong) Combes et al. (2017) refine the algorithm and generalize it to other settings. Soare et al. (2014), in best arm identification with linear payoff functions, gave essentially the same example that we use to argue for the large regret of optimistic algorithms. Open questions: Simultaneously finite-time near-optimal and asymptotically optimal algorithm Changing, or infinite action sets?

95 Summary

96 Summary of This Talk Contextual vs. linear bandits: Changing action sets can model contextual bandits Optimistic algorithms: Optimism can achieve minimax optimality Optimism can be expensive Optimistic algorithms require a careful design of the underlying confidence sets Sparsity: Exploiting sparsity is sometimes at odds with the requirement to collect rewards

97 What s Next for Bandits? Today: Finite-armed and linear stochastic bandits. But bandits come in all forms and shapes! Adversarial (finite, linear,... ) Combinatorial action sets: From shortest path to ranking Continuous action sets, continuous time, delays Resourceful, nonstationary, various structures (low-rank),... Nearby problems: Reinforcement learning/markov decision processes Partial monitoring

98 Learning Material Bandit Visualizer: Online bandit simulator: Most of this tutorial (and more): Book to be published by early next year: Looking for reviewers! Tor s lightweight C++ bandit library Sebastien Bubeck s tutorial Blog post 1 Blog post 2 Bubeck and Cesa-Bianchi s book; (Bubeck and Cesa-Bianchi, 2012) banditalgs.com

99 References I Abbasi-Yadkori, Y. (2009). Forced-exploration based algorithms for playing in bandits with large action sets. PhD thesis, University of Alberta. Abbasi-Yadkori, Y. (2012). Online Learning for Linearly Parametrized Control Problems. PhD thesis, University of Alberta. Abbasi-Yadkori, Y., Antos, A., and Szepesvári, C. (2009). Forced-exploration based algorithms for playing in stochastic linear bandits. In COLT Workshop on On-line Learning with Limited Feedback. Abbasi-Yadkori, Y., Pal, D., and Szepesvari, C. (2012). Online-to-confidence-set conversions and application to sparse stochastic bandits. In Artificial Intelligence and Statistics, pages 1 9. Abbasi-Yadkori, Y., Szepesvári, C., and Tax, D. (2011). Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems (NIPS), pages Abe, N. and Long, P. M. (1999). Associative reinforcement learning using linear probabilistic concepts. In ICML, pages Anantharam, V., Varaiya, P., and Walrand, J. (1987). Asymptotically efficient allocation rules for the multiarmed bandit problem with multiple plays-part i: Iid rewards. IEEE Transactions on Automatic Control, 32(11): Auer, P. (2002). Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3(Nov): Auer, P. and Ortner, R. (2010). UCB revisited: Improved regret bounds for the stochastic multi-armed bandit problem. Periodica Mathematica Hungarica, 61(1-2):55 65.

100 References II Bubeck, S. and Cesa-Bianchi, N. (2012). Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems. Foundations and Trends in Machine Learning. Now Publishers Incorporated. Chu, W., Li, L., Reyzin, L., and Schapire, R. E. (2011). Contextual bandits with linear payoff functions. In AISTATS, volume 15, pages Combes, R., Magureanu, S., and Proutiere, A. (2017). Minimal exploration in structured stochastic bandits. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R., editors, Advances in Neural Information Processing Systems 30, pages Curran Associates, Inc. Dani, V., Hayes, T. P., and Kakade, S. M. (2008). Stochastic linear optimization under bandit feedback. In Proceedings of Conference on Learning Theory (COLT), pages Even-Dar, E., Mannor, S., and Mansour, Y. (2006). Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems. Journal of machine learning research, 7(Jun): Filippi, S., Cappé, O., Garivier, A., and Szepesvári, Cs. (2009). Parametric bandits: The generalized linear case. In NIPS-22, pages Freedman, D. (1975). On tail probabilities for martingales. The Annals of Probability, 3(1): Gerchinovitz, S. (2013). Sparsity regret bounds for individual sequences in online linear regression. Journal of Machine Learning Research, 14(Mar): Hsu, D., Kakade, S. M., and Zhang, T. (2012). Random design analysis of ridge regression. In Conference on Learning Theory, pages 9 1.

101 References III John, F. (1948). Extremum problems with inequalities as subsidiary conditions. Courant Anniversary Volume, Interscience. Lattimore, T., Crammer, K., and Szepesvári, C. (2015). Linear multi-resource allocation with semi-bandit feedback. In Advances in Neural Information Processing Systems, pages Lattimore, T. and Munos, R. (2014). Bounded regret for finite-armed structured bandits. In Advances in Neural Information Processing Systems, pages Lattimore, T. and Szepesvári, C. (2016). The end of optimism? an asymptotic analysis of finite-armed linear bandits. arxiv preprint arxiv: Peña, V. H., Lai, T. L., and Shao, Q.-M. (2008). Self-normalized processes: Limit theory and Statistical Applications. Springer Science & Business Media. Rakhlin, A. and Sridharan, K. (2015). On equivalence of martingale tail bounds and deterministic regret inequalities. arxiv preprint arxiv: Robbins, H. and Siegmund, D. (1970). Boundary crossing probabilities for the Wiener process and sample sums. Annals of Math. Statistics, 41: Robbins, H. and Siegmund, D. (1971). A convergence theorem for non-negative almost supermartingales and some applications. In Rustagi, J., editor, Optimizing Methods in Statistics, pages Academic Press, New York. Rusmevichientong, P. and Tsitsiklis, J. N. (2010). Linearly parameterized bandits. Mathematics of Operations Research, 35(2): Russo, D. and Roy, B. V. (2013). Eluder dimension and the sample complexity of optimistic exploration. In NIPS, pages

102 References IV Shamir, O. (2015). On the complexity of bandit linear optimization. In Conference on Learning Theory, pages Soare, M., Lazaric, A., and Munos, R. (2014). Best-arm identification in linear bandits. In Advances in Neural Information Processing Systems, pages Srinivas, N., Krause, A., Kakade, S., and Seeger, M. W. (2010). Gaussian process optimization in the bandit setting: No regret and experimental design. In ICML, pages Valko, M., Korda, N., Munos, R., Flaounas, I., and Cristianini, N. (2013). Finite-time analysis of kernelised contextual bandits. arxiv preprint arxiv: Valko, M., Munos, R., Kveton, B., and Kocák, T. (2014). Spectral bandits for smooth graph functions. In International Conference on Machine Learning, pages

Improved Algorithms for Linear Stochastic Bandits

Improved Algorithms for Linear Stochastic Bandits Improved Algorithms for Linear Stochastic Bandits Yasin Abbasi-Yadkori abbasiya@ualberta.ca Dept. of Computing Science University of Alberta Dávid Pál dpal@google.com Dept. of Computing Science University

More information

On the Complexity of Best Arm Identification in Multi-Armed Bandit Models

On the Complexity of Best Arm Identification in Multi-Armed Bandit Models On the Complexity of Best Arm Identification in Multi-Armed Bandit Models Aurélien Garivier Institut de Mathématiques de Toulouse Information Theory, Learning and Big Data Simons Institute, Berkeley, March

More information

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I. Sébastien Bubeck Theory Group

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I. Sébastien Bubeck Theory Group Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I Sébastien Bubeck Theory Group i.i.d. multi-armed bandit, Robbins [1952] i.i.d. multi-armed bandit, Robbins [1952] Known

More information

Sparse Linear Contextual Bandits via Relevance Vector Machines

Sparse Linear Contextual Bandits via Relevance Vector Machines Sparse Linear Contextual Bandits via Relevance Vector Machines Davis Gilton and Rebecca Willett Electrical and Computer Engineering University of Wisconsin-Madison Madison, WI 53706 Email: gilton@wisc.edu,

More information

Bandit Algorithms. Tor Lattimore & Csaba Szepesvári

Bandit Algorithms. Tor Lattimore & Csaba Szepesvári Bandit Algorithms Tor Lattimore & Csaba Szepesvári Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next? Overview What are bandits,

More information

Two generic principles in modern bandits: the optimistic principle and Thompson sampling

Two generic principles in modern bandits: the optimistic principle and Thompson sampling Two generic principles in modern bandits: the optimistic principle and Thompson sampling Rémi Munos INRIA Lille, France CSML Lunch Seminars, September 12, 2014 Outline Two principles: The optimistic principle

More information

Multi-armed bandit models: a tutorial

Multi-armed bandit models: a tutorial Multi-armed bandit models: a tutorial CERMICS seminar, March 30th, 2016 Multi-Armed Bandit model: general setting K arms: for a {1,..., K}, (X a,t ) t N is a stochastic process. (unknown distributions)

More information

Online Learning with Gaussian Payoffs and Side Observations

Online Learning with Gaussian Payoffs and Side Observations Online Learning with Gaussian Payoffs and Side Observations Yifan Wu 1 András György 2 Csaba Szepesvári 1 1 Department of Computing Science University of Alberta 2 Department of Electrical and Electronic

More information

Online Learning and Sequential Decision Making

Online Learning and Sequential Decision Making Online Learning and Sequential Decision Making Emilie Kaufmann CNRS & CRIStAL, Inria SequeL, emilie.kaufmann@univ-lille.fr Research School, ENS Lyon, Novembre 12-13th 2018 Emilie Kaufmann Sequential Decision

More information

Csaba Szepesvári 1. University of Alberta. Machine Learning Summer School, Ile de Re, France, 2008

Csaba Szepesvári 1. University of Alberta. Machine Learning Summer School, Ile de Re, France, 2008 LEARNING THEORY OF OPTIMAL DECISION MAKING PART I: ON-LINE LEARNING IN STOCHASTIC ENVIRONMENTS Csaba Szepesvári 1 1 Department of Computing Science University of Alberta Machine Learning Summer School,

More information

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems. Sébastien Bubeck Theory Group

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems. Sébastien Bubeck Theory Group Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems Sébastien Bubeck Theory Group Part 1: i.i.d., adversarial, and Bayesian bandit models i.i.d. multi-armed bandit, Robbins [1952]

More information

Bandit models: a tutorial

Bandit models: a tutorial Gdt COS, December 3rd, 2015 Multi-Armed Bandit model: general setting K arms: for a {1,..., K}, (X a,t ) t N is a stochastic process. (unknown distributions) Bandit game: a each round t, an agent chooses

More information

The geometry of Gaussian processes and Bayesian optimization. Contal CMLA, ENS Cachan

The geometry of Gaussian processes and Bayesian optimization. Contal CMLA, ENS Cachan The geometry of Gaussian processes and Bayesian optimization. Contal CMLA, ENS Cachan Background: Global Optimization and Gaussian Processes The Geometry of Gaussian Processes and the Chaining Trick Algorithm

More information

Revisiting the Exploration-Exploitation Tradeoff in Bandit Models

Revisiting the Exploration-Exploitation Tradeoff in Bandit Models Revisiting the Exploration-Exploitation Tradeoff in Bandit Models joint work with Aurélien Garivier (IMT, Toulouse) and Tor Lattimore (University of Alberta) Workshop on Optimization and Decision-Making

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Bandit Problems MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Multi-Armed Bandit Problem Problem: which arm of a K-slot machine should a gambler pull to maximize his

More information

Two optimization problems in a stochastic bandit model

Two optimization problems in a stochastic bandit model Two optimization problems in a stochastic bandit model Emilie Kaufmann joint work with Olivier Cappé, Aurélien Garivier and Shivaram Kalyanakrishnan Journées MAS 204, Toulouse Outline From stochastic optimization

More information

New bounds on the price of bandit feedback for mistake-bounded online multiclass learning

New bounds on the price of bandit feedback for mistake-bounded online multiclass learning Journal of Machine Learning Research 1 8, 2017 Algorithmic Learning Theory 2017 New bounds on the price of bandit feedback for mistake-bounded online multiclass learning Philip M. Long Google, 1600 Amphitheatre

More information

The information complexity of best-arm identification

The information complexity of best-arm identification The information complexity of best-arm identification Emilie Kaufmann, joint work with Olivier Cappé and Aurélien Garivier MAB workshop, Lancaster, January th, 206 Context: the multi-armed bandit model

More information

Online Learning with Feedback Graphs

Online Learning with Feedback Graphs Online Learning with Feedback Graphs Claudio Gentile INRIA and Google NY clagentile@gmailcom NYC March 6th, 2018 1 Content of this lecture Regret analysis of sequential prediction problems lying between

More information

On the Complexity of Best Arm Identification with Fixed Confidence

On the Complexity of Best Arm Identification with Fixed Confidence On the Complexity of Best Arm Identification with Fixed Confidence Discrete Optimization with Noise Aurélien Garivier, Emilie Kaufmann COLT, June 23 th 2016, New York Institut de Mathématiques de Toulouse

More information

Explore no more: Improved high-probability regret bounds for non-stochastic bandits

Explore no more: Improved high-probability regret bounds for non-stochastic bandits Explore no more: Improved high-probability regret bounds for non-stochastic bandits Gergely Neu SequeL team INRIA Lille Nord Europe gergely.neu@gmail.com Abstract This work addresses the problem of regret

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Lecture 5: Bandit optimisation Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute of Technology Objectives of this lecture Introduce bandit optimisation: the

More information

Yevgeny Seldin. University of Copenhagen

Yevgeny Seldin. University of Copenhagen Yevgeny Seldin University of Copenhagen Classical (Batch) Machine Learning Collect Data Data Assumption The samples are independent identically distributed (i.i.d.) Machine Learning Prediction rule New

More information

Stratégies bayésiennes et fréquentistes dans un modèle de bandit

Stratégies bayésiennes et fréquentistes dans un modèle de bandit Stratégies bayésiennes et fréquentistes dans un modèle de bandit thèse effectuée à Telecom ParisTech, co-dirigée par Olivier Cappé, Aurélien Garivier et Rémi Munos Journées MAS, Grenoble, 30 août 2016

More information

The information complexity of sequential resource allocation

The information complexity of sequential resource allocation The information complexity of sequential resource allocation Emilie Kaufmann, joint work with Olivier Cappé, Aurélien Garivier and Shivaram Kalyanakrishan SMILE Seminar, ENS, June 8th, 205 Sequential allocation

More information

Explore no more: Improved high-probability regret bounds for non-stochastic bandits

Explore no more: Improved high-probability regret bounds for non-stochastic bandits Explore no more: Improved high-probability regret bounds for non-stochastic bandits Gergely Neu SequeL team INRIA Lille Nord Europe gergely.neu@gmail.com Abstract This work addresses the problem of regret

More information

THE first formalization of the multi-armed bandit problem

THE first formalization of the multi-armed bandit problem EDIC RESEARCH PROPOSAL 1 Multi-armed Bandits in a Network Farnood Salehi I&C, EPFL Abstract The multi-armed bandit problem is a sequential decision problem in which we have several options (arms). We can

More information

EASINESS IN BANDITS. Gergely Neu. Pompeu Fabra University

EASINESS IN BANDITS. Gergely Neu. Pompeu Fabra University EASINESS IN BANDITS Gergely Neu Pompeu Fabra University EASINESS IN BANDITS Gergely Neu Pompeu Fabra University THE BANDIT PROBLEM Play for T rounds attempting to maximize rewards THE BANDIT PROBLEM Play

More information

EFFICIENT ALGORITHMS FOR LINEAR POLYHEDRAL BANDITS. Manjesh K. Hanawal Amir Leshem Venkatesh Saligrama

EFFICIENT ALGORITHMS FOR LINEAR POLYHEDRAL BANDITS. Manjesh K. Hanawal Amir Leshem Venkatesh Saligrama EFFICIENT ALGORITHMS FOR LINEAR POLYHEDRAL BANDITS Manjesh K. Hanawal Amir Leshem Venkatesh Saligrama IEOR Group, IIT-Bombay, Mumbai, India 400076 Dept. of EE, Bar-Ilan University, Ramat-Gan, Israel 52900

More information

Lecture 4: Lower Bounds (ending); Thompson Sampling

Lecture 4: Lower Bounds (ending); Thompson Sampling CMSC 858G: Bandits, Experts and Games 09/12/16 Lecture 4: Lower Bounds (ending); Thompson Sampling Instructor: Alex Slivkins Scribed by: Guowei Sun,Cheng Jie 1 Lower bounds on regret (ending) Recap from

More information

Contextual Combinatorial Bandit and its Application on Diversified Online Recommendation

Contextual Combinatorial Bandit and its Application on Diversified Online Recommendation Contextual Combinatorial Bandit and its Application on Diversified Online Recommendation Lijing Qin Shouyuan Chen Xiaoyan Zhu Abstract Recommender systems are faced with new challenges that are beyond

More information

Online Learning with Gaussian Payoffs and Side Observations

Online Learning with Gaussian Payoffs and Side Observations Online Learning with Gaussian Payoffs and Side Observations Yifan Wu András György 2 Csaba Szepesvári Dept. of Computing Science University of Alberta {ywu2,szepesva}@ualberta.ca 2 Dept. of Electrical

More information

The Multi-Arm Bandit Framework

The Multi-Arm Bandit Framework The Multi-Arm Bandit Framework A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course In This Lecture A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-2/94

More information

New Algorithms for Contextual Bandits

New Algorithms for Contextual Bandits New Algorithms for Contextual Bandits Lev Reyzin Georgia Institute of Technology Work done at Yahoo! 1 S A. Beygelzimer, J. Langford, L. Li, L. Reyzin, R.E. Schapire Contextual Bandit Algorithms with Supervised

More information

On the Complexity of Best Arm Identification with Fixed Confidence

On the Complexity of Best Arm Identification with Fixed Confidence On the Complexity of Best Arm Identification with Fixed Confidence Discrete Optimization with Noise Aurélien Garivier, joint work with Emilie Kaufmann CNRS, CRIStAL) to be presented at COLT 16, New York

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Lecture 6: RL algorithms 2.0 Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute of Technology Objectives of this lecture Present and analyse two online algorithms

More information

Linear Multi-Resource Allocation with Semi-Bandit Feedback

Linear Multi-Resource Allocation with Semi-Bandit Feedback Linear Multi-Resource Allocation with Semi-Bandit Feedbac Tor Lattimore Department of Computing Science University of Alberta, Canada tor.lattimore@gmail.com Koby Crammer Department of Electrical Engineering

More information

Informational Confidence Bounds for Self-Normalized Averages and Applications

Informational Confidence Bounds for Self-Normalized Averages and Applications Informational Confidence Bounds for Self-Normalized Averages and Applications Aurélien Garivier Institut de Mathématiques de Toulouse - Université Paul Sabatier Thursday, September 12th 2013 Context Tree

More information

Combinatorial Multi-Armed Bandit: General Framework, Results and Applications

Combinatorial Multi-Armed Bandit: General Framework, Results and Applications Combinatorial Multi-Armed Bandit: General Framework, Results and Applications Wei Chen Microsoft Research Asia, Beijing, China Yajun Wang Microsoft Research Asia, Beijing, China Yang Yuan Computer Science

More information

Thompson Sampling for Contextual Bandits with Linear Payoffs

Thompson Sampling for Contextual Bandits with Linear Payoffs Thompson Sampling for Contextual Bandits with Linear Payoffs Shipra Agrawal Microsoft Research Navin Goyal Microsoft Research Abstract Thompson Sampling is one of the oldest heuristics for multi-armed

More information

Bandits : optimality in exponential families

Bandits : optimality in exponential families Bandits : optimality in exponential families Odalric-Ambrym Maillard IHES, January 2016 Odalric-Ambrym Maillard Bandits 1 / 40 Introduction 1 Stochastic multi-armed bandits 2 Boundary crossing probabilities

More information

Bandits for Online Optimization

Bandits for Online Optimization Bandits for Online Optimization Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Bandits for Online Optimization 1 / 16 The multiarmed bandit problem... K slot machines Each

More information

Active Learning and Optimized Information Gathering

Active Learning and Optimized Information Gathering Active Learning and Optimized Information Gathering Lecture 7 Learning Theory CS 101.2 Andreas Krause Announcements Project proposal: Due tomorrow 1/27 Homework 1: Due Thursday 1/29 Any time is ok. Office

More information

Anytime optimal algorithms in stochastic multi-armed bandits

Anytime optimal algorithms in stochastic multi-armed bandits Rémy Degenne LPMA, Université Paris Diderot Vianney Perchet CREST, ENSAE REMYDEGENNE@MATHUNIV-PARIS-DIDEROTFR VIANNEYPERCHET@NORMALESUPORG Abstract We introduce an anytime algorithm for stochastic multi-armed

More information

Multiple Identifications in Multi-Armed Bandits

Multiple Identifications in Multi-Armed Bandits Multiple Identifications in Multi-Armed Bandits arxiv:05.38v [cs.lg] 4 May 0 Sébastien Bubeck Department of Operations Research and Financial Engineering, Princeton University sbubeck@princeton.edu Tengyao

More information

Bandit Algorithms. Zhifeng Wang ... Department of Statistics Florida State University

Bandit Algorithms. Zhifeng Wang ... Department of Statistics Florida State University Bandit Algorithms Zhifeng Wang Department of Statistics Florida State University Outline Multi-Armed Bandits (MAB) Exploration-First Epsilon-Greedy Softmax UCB Thompson Sampling Adversarial Bandits Exp3

More information

Profile-Based Bandit with Unknown Profiles

Profile-Based Bandit with Unknown Profiles Journal of Machine Learning Research 9 (208) -40 Submitted /7; Revised 6/8; Published 9/8 Profile-Based Bandit with Unknown Profiles Sylvain Lamprier sylvain.lamprier@lip6.fr Sorbonne Universités, UPMC

More information

Online Optimization in X -Armed Bandits

Online Optimization in X -Armed Bandits Online Optimization in X -Armed Bandits Sébastien Bubeck INRIA Lille, SequeL project, France sebastien.bubeck@inria.fr Rémi Munos INRIA Lille, SequeL project, France remi.munos@inria.fr Gilles Stoltz Ecole

More information

Stat 260/CS Learning in Sequential Decision Problems. Peter Bartlett

Stat 260/CS Learning in Sequential Decision Problems. Peter Bartlett Stat 260/CS 294-102. Learning in Sequential Decision Problems. Peter Bartlett 1. Multi-armed bandit algorithms. Concentration inequalities. P(X ǫ) exp( ψ (ǫ))). Cumulant generating function bounds. Hoeffding

More information

Bandit View on Continuous Stochastic Optimization

Bandit View on Continuous Stochastic Optimization Bandit View on Continuous Stochastic Optimization Sébastien Bubeck 1 joint work with Rémi Munos 1 & Gilles Stoltz 2 & Csaba Szepesvari 3 1 INRIA Lille, SequeL team 2 CNRS/ENS/HEC 3 University of Alberta

More information

Learning to Optimize via Information-Directed Sampling

Learning to Optimize via Information-Directed Sampling Learning to Optimize via Information-Directed Sampling Daniel Russo Stanford University Stanford, CA 94305 djrusso@stanford.edu Benjamin Van Roy Stanford University Stanford, CA 94305 bvr@stanford.edu

More information

Weighted Bandits or: How Bandits Learn Distorted Values That Are Not Expected

Weighted Bandits or: How Bandits Learn Distorted Values That Are Not Expected Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17) Weighted Bandits or: How Bandits Learn Distorted Values That Are Not Expected Aditya Gopalan Indian Institute of Science

More information

A Time and Space Efficient Algorithm for Contextual Linear Bandits

A Time and Space Efficient Algorithm for Contextual Linear Bandits A Time and Space Efficient Algorithm for Contextual Linear Bandits José Bento, Stratis Ioannidis, S. Muthukrishnan, and Jinyun Yan Stanford University, Technicolor, Rutgers University, Rutgers University

More information

Bandits with Delayed, Aggregated Anonymous Feedback

Bandits with Delayed, Aggregated Anonymous Feedback Ciara Pike-Burke 1 Shipra Agrawal 2 Csaba Szepesvári 3 4 Steffen Grünewälder 1 Abstract We study a variant of the stochastic K-armed bandit problem, which we call bandits with delayed, aggregated anonymous

More information

On Bayesian bandit algorithms

On Bayesian bandit algorithms On Bayesian bandit algorithms Emilie Kaufmann joint work with Olivier Cappé, Aurélien Garivier, Nathaniel Korda and Rémi Munos July 1st, 2012 Emilie Kaufmann (Telecom ParisTech) On Bayesian bandit algorithms

More information

Pure Exploration Stochastic Multi-armed Bandits

Pure Exploration Stochastic Multi-armed Bandits C&A Workshop 2016, Hangzhou Pure Exploration Stochastic Multi-armed Bandits Jian Li Institute for Interdisciplinary Information Sciences Tsinghua University Outline Introduction 2 Arms Best Arm Identification

More information

Bayesian and Frequentist Methods in Bandit Models

Bayesian and Frequentist Methods in Bandit Models Bayesian and Frequentist Methods in Bandit Models Emilie Kaufmann, Telecom ParisTech Bayes In Paris, ENSAE, October 24th, 2013 Emilie Kaufmann (Telecom ParisTech) Bayesian and Frequentist Bandits BIP,

More information

Subsampling, Concentration and Multi-armed bandits

Subsampling, Concentration and Multi-armed bandits Subsampling, Concentration and Multi-armed bandits Odalric-Ambrym Maillard, R. Bardenet, S. Mannor, A. Baransi, N. Galichet, J. Pineau, A. Durand Toulouse, November 09, 2015 O-A. Maillard Subsampling and

More information

Bandit Optimization: Theory and Applications

Bandit Optimization: Theory and Applications 1 / 40 Bandit Optimization: Theory and Applications Richard Combes 1 and Alexandre Proutière 2 1 Centrale-Supélec / L2S, France 2 KTH, Royal Institute of Technology, Sweden. SIGMETRICS 2015 2 / 40 Outline

More information

A fully adaptive algorithm for pure exploration in linear bandits

A fully adaptive algorithm for pure exploration in linear bandits A fully adaptive algorithm for pure exploration in linear bandits Liyuan Xu Junya Honda Masashi Sugiyama :The University of Tokyo :RIKEN Abstract We propose the first fully-adaptive algorithm for pure

More information

Online Regret Bounds for Markov Decision Processes with Deterministic Transitions

Online Regret Bounds for Markov Decision Processes with Deterministic Transitions Online Regret Bounds for Markov Decision Processes with Deterministic Transitions Ronald Ortner Department Mathematik und Informationstechnologie, Montanuniversität Leoben, A-8700 Leoben, Austria Abstract

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Lecture 3: RL problems, sample complexity and regret Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute of Technology Objectives of this lecture Introduce the

More information

An Information-Theoretic Analysis of Thompson Sampling

An Information-Theoretic Analysis of Thompson Sampling Journal of Machine Learning Research (2015) Submitted ; Published An Information-Theoretic Analysis of Thompson Sampling Daniel Russo Department of Management Science and Engineering Stanford University

More information

Stochastic bandits: Explore-First and UCB

Stochastic bandits: Explore-First and UCB CSE599s, Spring 2014, Online Learning Lecture 15-2/19/2014 Stochastic bandits: Explore-First and UCB Lecturer: Brendan McMahan or Ofer Dekel Scribe: Javad Hosseini In this lecture, we like to answer this

More information

Dynamic resource allocation: Bandit problems and extensions

Dynamic resource allocation: Bandit problems and extensions Dynamic resource allocation: Bandit problems and extensions Aurélien Garivier Institut de Mathématiques de Toulouse MAD Seminar, Université Toulouse 1 October 3rd, 2014 The Bandit Model Roadmap 1 The Bandit

More information

Learning to play K-armed bandit problems

Learning to play K-armed bandit problems Learning to play K-armed bandit problems Francis Maes 1, Louis Wehenkel 1 and Damien Ernst 1 1 University of Liège Dept. of Electrical Engineering and Computer Science Institut Montefiore, B28, B-4000,

More information

Evaluation of multi armed bandit algorithms and empirical algorithm

Evaluation of multi armed bandit algorithms and empirical algorithm Acta Technica 62, No. 2B/2017, 639 656 c 2017 Institute of Thermomechanics CAS, v.v.i. Evaluation of multi armed bandit algorithms and empirical algorithm Zhang Hong 2,3, Cao Xiushan 1, Pu Qiumei 1,4 Abstract.

More information

Minimax Concave Penalized Multi-Armed Bandit Model with High-Dimensional Convariates

Minimax Concave Penalized Multi-Armed Bandit Model with High-Dimensional Convariates Minimax Concave Penalized Multi-Armed Bandit Model with High-Dimensional Convariates Xue Wang * 1 Mike Mingcheng Wei * 2 Tao Yao * 1 Abstract In this paper, we propose a Minimax Concave Penalized Multi-Armed

More information

Analysis of Thompson Sampling for the multi-armed bandit problem

Analysis of Thompson Sampling for the multi-armed bandit problem Analysis of Thompson Sampling for the multi-armed bandit problem Shipra Agrawal Microsoft Research India shipra@microsoft.com avin Goyal Microsoft Research India navingo@microsoft.com Abstract We show

More information

The No-Regret Framework for Online Learning

The No-Regret Framework for Online Learning The No-Regret Framework for Online Learning A Tutorial Introduction Nahum Shimkin Technion Israel Institute of Technology Haifa, Israel Stochastic Processes in Engineering IIT Mumbai, March 2013 N. Shimkin,

More information

Quantifying mismatch in Bayesian optimization

Quantifying mismatch in Bayesian optimization Quantifying mismatch in Bayesian optimization Eric Schulz University College London e.schulz@cs.ucl.ac.uk Maarten Speekenbrink University College London m.speekenbrink@ucl.ac.uk José Miguel Hernández-Lobato

More information

The multi armed-bandit problem

The multi armed-bandit problem The multi armed-bandit problem (with covariates if we have time) Vianney Perchet & Philippe Rigollet LPMA Université Paris Diderot ORFE Princeton University Algorithms and Dynamics for Games and Optimization

More information

Reward Maximization Under Uncertainty: Leveraging Side-Observations on Networks

Reward Maximization Under Uncertainty: Leveraging Side-Observations on Networks Reward Maximization Under Uncertainty: Leveraging Side-Observations Reward Maximization Under Uncertainty: Leveraging Side-Observations on Networks Swapna Buccapatnam AT&T Labs Research, Middletown, NJ

More information

Complex Bandit Problems and Thompson Sampling

Complex Bandit Problems and Thompson Sampling Complex Bandit Problems and Aditya Gopalan Department of Electrical Engineering Technion, Israel aditya@ee.technion.ac.il Shie Mannor Department of Electrical Engineering Technion, Israel shie@ee.technion.ac.il

More information

Exploiting Correlation in Finite-Armed Structured Bandits

Exploiting Correlation in Finite-Armed Structured Bandits Exploiting Correlation in Finite-Armed Structured Bandits Samarth Gupta Carnegie Mellon University Pittsburgh, PA 1513 Gauri Joshi Carnegie Mellon University Pittsburgh, PA 1513 Osman Yağan Carnegie Mellon

More information

COS 402 Machine Learning and Artificial Intelligence Fall Lecture 22. Exploration & Exploitation in Reinforcement Learning: MAB, UCB, Exp3

COS 402 Machine Learning and Artificial Intelligence Fall Lecture 22. Exploration & Exploitation in Reinforcement Learning: MAB, UCB, Exp3 COS 402 Machine Learning and Artificial Intelligence Fall 2016 Lecture 22 Exploration & Exploitation in Reinforcement Learning: MAB, UCB, Exp3 How to balance exploration and exploitation in reinforcement

More information

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon. Administration CSCI567 Machine Learning Fall 2018 Prof. Haipeng Luo U of Southern California Nov 7, 2018 HW5 is available, due on 11/18. Practice final will also be available soon. Remaining weeks: 11/14,

More information

Introduction to Bandit Algorithms. Introduction to Bandit Algorithms

Introduction to Bandit Algorithms. Introduction to Bandit Algorithms Stochastic K-Arm Bandit Problem Formulation Consider K arms (actions) each correspond to an unknown distribution {ν k } K k=1 with values bounded in [0, 1]. At each time t, the agent pulls an arm I t {1,...,

More information

Bandit Convex Optimization: T Regret in One Dimension

Bandit Convex Optimization: T Regret in One Dimension Bandit Convex Optimization: T Regret in One Dimension arxiv:1502.06398v1 [cs.lg 23 Feb 2015 Sébastien Bubeck Microsoft Research sebubeck@microsoft.com Tomer Koren Technion tomerk@technion.ac.il February

More information

arxiv: v4 [cs.lg] 22 Jul 2014

arxiv: v4 [cs.lg] 22 Jul 2014 Learning to Optimize Via Information-Directed Sampling Daniel Russo and Benjamin Van Roy July 23, 2014 arxiv:1403.5556v4 cs.lg] 22 Jul 2014 Abstract We propose information-directed sampling a new algorithm

More information

Basics of reinforcement learning

Basics of reinforcement learning Basics of reinforcement learning Lucian Buşoniu TMLSS, 20 July 2018 Main idea of reinforcement learning (RL) Learn a sequential decision policy to optimize the cumulative performance of an unknown system

More information

Regional Multi-Armed Bandits

Regional Multi-Armed Bandits School of Information Science and Technology University of Science and Technology of China {wzy43, zrd7}@mail.ustc.edu.cn, congshen@ustc.edu.cn Abstract We consider a variant of the classic multiarmed

More information

Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning

Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning Peter Auer Ronald Ortner University of Leoben, Franz-Josef-Strasse 18, 8700 Leoben, Austria auer,rortner}@unileoben.ac.at Abstract

More information

arxiv: v1 [cs.lg] 15 Oct 2014

arxiv: v1 [cs.lg] 15 Oct 2014 THOMPSON SAMPLING WITH THE ONLINE BOOTSTRAP By Dean Eckles and Maurits Kaptein Facebook, Inc., and Radboud University, Nijmegen arxiv:141.49v1 [cs.lg] 15 Oct 214 Thompson sampling provides a solution to

More information

Mostly Exploration-Free Algorithms for Contextual Bandits

Mostly Exploration-Free Algorithms for Contextual Bandits Mostly Exploration-Free Algorithms for Contextual Bandits Hamsa Bastani IBM Thomas J. Watson Research and Wharton School, hamsab@wharton.upenn.edu Mohsen Bayati Stanford Graduate School of Business, bayati@stanford.edu

More information

Analysis of Greedy Algorithms

Analysis of Greedy Algorithms Analysis of Greedy Algorithms Jiahui Shen Florida State University Oct.26th Outline Introduction Regularity condition Analysis on orthogonal matching pursuit Analysis on forward-backward greedy algorithm

More information

Multi-Armed Bandit Formulations for Identification and Control

Multi-Armed Bandit Formulations for Identification and Control Multi-Armed Bandit Formulations for Identification and Control Cristian R. Rojas Joint work with Matías I. Müller and Alexandre Proutiere KTH Royal Institute of Technology, Sweden ERNSI, September 24-27,

More information

Lecture 19: UCB Algorithm and Adversarial Bandit Problem. Announcements Review on stochastic multi-armed bandit problem

Lecture 19: UCB Algorithm and Adversarial Bandit Problem. Announcements Review on stochastic multi-armed bandit problem Lecture 9: UCB Algorithm and Adversarial Bandit Problem EECS598: Prediction and Learning: It s Only a Game Fall 03 Lecture 9: UCB Algorithm and Adversarial Bandit Problem Prof. Jacob Abernethy Scribe:

More information

Worst-Case Bounds for Gaussian Process Models

Worst-Case Bounds for Gaussian Process Models Worst-Case Bounds for Gaussian Process Models Sham M. Kakade University of Pennsylvania Matthias W. Seeger UC Berkeley Abstract Dean P. Foster University of Pennsylvania We present a competitive analysis

More information

arxiv: v1 [cs.lg] 6 Jun 2018

arxiv: v1 [cs.lg] 6 Jun 2018 Mitigating Bias in Adaptive Data Gathering via Differential Privacy Seth Neel Aaron Roth June 7, 2018 arxiv:1806.02329v1 [cs.lg] 6 Jun 2018 Abstract Data that is gathered adaptively via bandit algorithms,

More information

Bandits and Exploration: How do we (optimally) gather information? Sham M. Kakade

Bandits and Exploration: How do we (optimally) gather information? Sham M. Kakade Bandits and Exploration: How do we (optimally) gather information? Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for Big data 1 / 22

More information

Online (and Distributed) Learning with Information Constraints. Ohad Shamir

Online (and Distributed) Learning with Information Constraints. Ohad Shamir Online (and Distributed) Learning with Information Constraints Ohad Shamir Weizmann Institute of Science Online Algorithms and Learning Workshop Leiden, November 2014 Ohad Shamir Learning with Information

More information

Ordinal optimization - Empirical large deviations rate estimators, and multi-armed bandit methods

Ordinal optimization - Empirical large deviations rate estimators, and multi-armed bandit methods Ordinal optimization - Empirical large deviations rate estimators, and multi-armed bandit methods Sandeep Juneja Tata Institute of Fundamental Research Mumbai, India joint work with Peter Glynn Applied

More information

An adaptive algorithm for finite stochastic partial monitoring

An adaptive algorithm for finite stochastic partial monitoring An adaptive algorithm for finite stochastic partial monitoring Gábor Bartók Navid Zolghadr Csaba Szepesvári Department of Computing Science, University of Alberta, AB, Canada, T6G E8 bartok@ualberta.ca

More information

arxiv: v1 [cs.lg] 7 Sep 2018

arxiv: v1 [cs.lg] 7 Sep 2018 Analysis of Thompson Sampling for Combinatorial Multi-armed Bandit with Probabilistically Triggered Arms Alihan Hüyük Bilkent University Cem Tekin Bilkent University arxiv:809.02707v [cs.lg] 7 Sep 208

More information

Advanced Topics in Machine Learning and Algorithmic Game Theory Fall semester, 2011/12

Advanced Topics in Machine Learning and Algorithmic Game Theory Fall semester, 2011/12 Advanced Topics in Machine Learning and Algorithmic Game Theory Fall semester, 2011/12 Lecture 4: Multiarmed Bandit in the Adversarial Model Lecturer: Yishay Mansour Scribe: Shai Vardi 4.1 Lecture Overview

More information

Supremum of simple stochastic processes

Supremum of simple stochastic processes Subspace embeddings Daniel Hsu COMS 4772 1 Supremum of simple stochastic processes 2 Recap: JL lemma JL lemma. For any ε (0, 1/2), point set S R d of cardinality 16 ln n S = n, and k N such that k, there

More information

Efficient and Principled Online Classification Algorithms for Lifelon

Efficient and Principled Online Classification Algorithms for Lifelon Efficient and Principled Online Classification Algorithms for Lifelong Learning Toyota Technological Institute at Chicago Chicago, IL USA Talk @ Lifelong Learning for Mobile Robotics Applications Workshop,

More information

Online Learning and Online Convex Optimization

Online Learning and Online Convex Optimization Online Learning and Online Convex Optimization Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Online Learning 1 / 49 Summary 1 My beautiful regret 2 A supposedly fun game

More information

Lecture 5: Regret Bounds for Thompson Sampling

Lecture 5: Regret Bounds for Thompson Sampling CMSC 858G: Bandits, Experts and Games 09/2/6 Lecture 5: Regret Bounds for Thompson Sampling Instructor: Alex Slivkins Scribed by: Yancy Liao Regret Bounds for Thompson Sampling For each round t, we defined

More information