Online Learning with Gaussian Payoffs and Side Observations

Size: px
Start display at page:

Download "Online Learning with Gaussian Payoffs and Side Observations"

Transcription

1 Online Learning with Gaussian Payoffs and Side Observations Yifan Wu 1 András György 2 Csaba Szepesvári 1 1 Department of Computing Science University of Alberta 2 Department of Electrical and Electronic Engineering Imperial College London January 14, / 33

2 Outline 1 Introduction 2 Has This Been Done Before? 3 Results Lower Bounds Algorithms/Upper Bounds 4 Summary 2 / 33

3 Outline 1 Introduction 2 Has This Been Done Before? 3 Results Lower Bounds Algorithms/Upper Bounds 4 Summary 3 / 33

4 A Fishy Problem Each day, you get to choose a fishing spot. Which one to choose? Every fish you catch: +1 cookies. No fish: 10 cookies. Fish distribution is i.i.d. With some probability, you will see neighboring sites yield for the day. 4 / 33

5 The Fishing Game Choosing a fishing spot: K actions. 5 / 33

6 The Fishing Game Choosing a fishing spot: K actions. θ 1,..., θ K : (unknown) mean rewards for the K spots. 5 / 33

7 The Fishing Game Choosing a fishing spot: K actions. θ 1,..., θ K : (unknown) mean rewards for the K spots. For rounds t = 1,..., T : 5 / 33

8 The Fishing Game Choosing a fishing spot: K actions. θ 1,..., θ K : (unknown) mean rewards for the K spots. For rounds t = 1,..., T : Choose a fishing spot I t [K] := {1,..., K}; 5 / 33

9 The Fishing Game Choosing a fishing spot: K actions. θ 1,..., θ K : (unknown) mean rewards for the K spots. For rounds t = 1,..., T : Choose a fishing spot I t [K] := {1,..., K}; Incur reward Y t R with mean θ It ; 5 / 33

10 The Fishing Game Choosing a fishing spot: K actions. θ 1,..., θ K : (unknown) mean rewards for the K spots. For rounds t = 1,..., T : Choose a fishing spot I t [K] := {1,..., K}; Incur reward Y t R with mean θ It ; Observe X t R K ; noisy reward observations for all the sites (Y t = X t,it ). 5 / 33

11 The Fishing Game Choosing a fishing spot: K actions. θ 1,..., θ K : (unknown) mean rewards for the K spots. For rounds t = 1,..., T : Choose a fishing spot I t [K] := {1,..., K}; Incur reward Y t R with mean θ It ; Observe X t R K ; noisy reward observations for all the sites (Y t = X t,it ). 5 / 33

12 The Fishing Game Choosing a fishing spot: K actions. θ 1,..., θ K : (unknown) mean rewards for the K spots. For rounds t = 1,..., T : Choose a fishing spot I t [K] := {1,..., K}; Incur reward Y t R with mean θ It ; Observe X t R K ; noisy reward observations for all the sites (Y t = X t,it ). Assumptions E[X t,k ] = θ k, and V(X t,k I t ) = σi 2 with Σ = t,k (σ2 i,k ) known a priori. 5 / 33

13 The Fishing Game Choosing a fishing spot: K actions. θ 1,..., θ K : (unknown) mean rewards for the K spots. For rounds t = 1,..., T : Choose a fishing spot I t [K] := {1,..., K}; Incur reward Y t R with mean θ It ; Observe X t R K ; noisy reward observations for all the sites (Y t = X t,it ). Assumptions E[X t,k ] = θ k, and V(X t,k I t ) = σi 2 with Σ = t,k (σ2 i,k ) known a priori. Goal Minimize expected regret R T = T max i [K] θ i T t=1 E [Y t]. 5 / 33

14 Outline 1 Introduction 2 Has This Been Done Before? 3 Results Lower Bounds Algorithms/Upper Bounds 4 Summary 6 / 33

15 (Stochastic) Partial Monitoring and Bandits For rounds t = 1,..., T : Choose action I t [K]; Observe X t p(θ, I t ); Incur reward R t = r(θ, I t ). 7 / 33

16 (Stochastic) Partial Monitoring and Bandits For rounds t = 1,..., T : Choose action I t [K]; Observe X t p(θ, I t ); Incur reward R t = r(θ, I t ). Information Structure Known: p : Θ [K] M 1 (X ); Known: r : Θ [K] R; Unknown: θ Θ. 7 / 33

17 (Stochastic) Partial Monitoring and Bandits For rounds t = 1,..., T : (Some) prior work: Choose action I t [K]; Observe X t p(θ, I t ); Incur reward R t = r(θ, I t ). Information Structure Known: p : Θ [K] M 1 (X ); Known: r : Θ [K] R; Unknown: θ Θ. 7 / 33

18 (Stochastic) Partial Monitoring and Bandits For rounds t = 1,..., T : Choose action I t [K]; Observe X t p(θ, I t ); (Some) prior work: Bandits (Robbins, 1952): X t = R t. Incur reward R t = r(θ, I t ). Information Structure Known: p : Θ [K] M 1 (X ); Known: r : Θ [K] R; Unknown: θ Θ. 7 / 33

19 (Stochastic) Partial Monitoring and Bandits For rounds t = 1,..., T : Choose action I t [K]; Observe X t p(θ, I t ); Incur reward R t = r(θ, I t ). Information Structure Known: p : Θ [K] M 1 (X ); (Some) prior work: Bandits (Robbins, 1952): X t = R t. Finite Θ and Y t,1 = R t : Agrawal et al. (1989) Known: r : Θ [K] R; Unknown: θ Θ. 7 / 33

20 (Stochastic) Partial Monitoring and Bandits For rounds t = 1,..., T : Choose action I t [K]; Observe X t p(θ, I t ); Incur reward R t = r(θ, I t ). Information Structure Known: p : Θ [K] M 1 (X ); Known: r : Θ [K] R; Unknown: θ Θ. (Some) prior work: Bandits (Robbins, 1952): X t = R t. Finite Θ and Y t,1 = R t : Agrawal et al. (1989) X t = h(i t, J t ), R t = r(i t, J t ), J t [M] i.i.d.: Bartók et al. (2011) 7 / 33

21 (Stochastic) Partial Monitoring and Bandits For rounds t = 1,..., T : Choose action I t [K]; Observe X t p(θ, I t ); Incur reward R t = r(θ, I t ). Information Structure Known: p : Θ [K] M 1 (X ); Known: r : Θ [K] R; Unknown: θ Θ. (Some) prior work: Bandits (Robbins, 1952): X t = R t. Finite Θ and Y t,1 = R t : Agrawal et al. (1989) X t = h(i t, J t ), R t = r(i t, J t ), J t [M] i.i.d.: Bartók et al. (2011) Learning with feedback graphs: Alon et al. (2015) 7 / 33

22 Fishing as Partial Monitoring Fishing round t = 1,..., T : Choose a fishing spot I t [K]; Incur (mean) reward θ It ; Observe X t R K. Basic Assumptions E[X t,k ] = θ k, V(X t,k I t ) = σ 2 I t,k with Σ = (σi,k 2 ) known a priori. Distributional Assumptions X t,j N (θ j, σ It,j), independent. 8 / 33

23 Fishing as Partial Monitoring Fishing round t = 1,..., T : Choose a fishing spot I t [K]; Incur (mean) reward θ It ; Observe X t R K. Partial monitoring t = 1,..., T : Choose I t [K]; Incur (mean) reward r(θ, I t ) Observe X t p(θ, I t ). Basic Assumptions E[X t,k ] = θ k, V(X t,k I t ) = σ 2 I t,k with Σ = (σi,k 2 ) known a priori. Distributional Assumptions X t,j N (θ j, σ It,j), independent. 8 / 33

24 Fishing as Partial Monitoring Fishing round t = 1,..., T : Choose a fishing spot I t [K]; Incur (mean) reward θ It ; Observe X t R K. Basic Assumptions E[X t,k ] = θ k, V(X t,k I t ) = σ 2 I t,k with Σ = (σi,k 2 ) known a priori. Distributional Assumptions X t,j N (θ j, σ It,j), independent. Partial monitoring t = 1,..., T : Choose I t [K]; Incur (mean) reward r(θ, I t ) Observe X t p(θ, I t ). Choose: r(θ, i) = θ i ; p(θ, i) = N (θ, diag(..., σ i,j,... )); Θ = [0, D] K. 8 / 33

25 Some Interesting Special Cases Full information problems: σ ij = σ for all i, j [K]. 9 / 33

26 Some Interesting Special Cases Full information problems: σ ij = σ for all i, j [K]. Bandits: σ ii = σ for all i [K], σ ij = for all i j. 9 / 33

27 Some Interesting Special Cases Full information problems: σ ij = σ for all i, j [K]. Bandits: σ ii = σ for all i [K], σ ij = for all i j. Graph feedback (Alon et al., 2015): 9 / 33

28 Some Interesting Special Cases 5 2 Full information problems: σ ij = σ for all i, j [K] Bandits: σ ii = σ for all i [K], σ ij = for all i j. 4 3 Graph feedback (Alon et al., 2015): Each i [K] has S i [K]: (a) 4 (b) 3 σ, 1 if j 2 S i ; σ i,j = +, otherwise (d) (e) Figure 1: Examples of feedback graphs: (a) full feedback, clique, (d) apple tasting, (e) revealing action, (f) a clique min 2 Problem Setting and Main Resu Let G =(V,E) be a directed feedback graph over the set each i V, let N in (i) ={j V :(j, i) E} be the in-n N out (i) ={j V :(i, j) E} be the out-neighborhood 9 / 33of

29 Some Interesting Special Cases 5 2 Full information problems: σ ij = σ for all i, j [K] Bandits: σ ii = σ for all i [K], σ ij = for all i j. 4 3 Graph feedback (Alon et al., 2015): Each i [K] has S i [K]: (a) 4 (b) 3 σ, 1 if j 2 S i ; σ i,j = +, otherwise Self-observability: i (d) (e) Si for any i [K] (Mannor & Shamir, 2011; Caron et al., 2012; Alon et al., 2013; Buccapatnam et al., 2014; Kocák Figure 1: Examples of feedback graphs: (a) full feedback, et al., 2014). clique, (d) apple tasting, (e) revealing action, (f) a clique min 2 Problem Setting and Main Resu Let G =(V,E) be a directed feedback graph over the set each i V, let N in (i) ={j V :(j, i) E} be the in-n N out (i) ={j V :(i, j) E} be the out-neighborhood 9 / 33of

30 Some Interesting Special Cases 5 2 Full information problems: σ ij = σ for all i, j [K] Bandits: σ ii = σ for all i [K], σ ij = for all i j. 4 3 Graph feedback (Alon et al., 2015): Each i [K] has S i [K]: (a) 4 (b) 3 σ, 1 if j 2 S i ; σ i,j = +, otherwise Self-observability: i (d) (e) Si for any i [K] (Mannor & Shamir, 2011; Caron et al., 2012; Alon et al., 2013; Buccapatnam et al., 2014; Kocák Figure 1: Examples of feedback graphs: (a) full feedback, et al., 2014). clique, (d) apple tasting, (e) revealing action, (f) a clique min 2 Problem Setting and Main Resu Let G =(V,E) be a directed feedback graph over the set each i V, let N in (i) ={j V :(j, i) E} be the in-n N out (i) ={j V :(i, j) E} be the out-neighborhood 9 / 33of

31 Some Interesting Special Cases 5 2 Full information problems: σ ij = σ for all i, j [K] Bandits: σ ii = σ for all i [K], σ ij = for all i j. 4 3 Graph feedback (Alon et al., 2015): Each i [K] has S i [K]: (a) 4 (b) 3 σ, 1 if j 2 S i ; σ i,j = +, otherwise Self-observability: i (d) (e) Si for any i [K] (Mannor & Shamir, 2011; Caron et al., 2012; Alon et al., 2013; Buccapatnam et al., 2014; Kocák Figure 1: Examples of feedback graphs: (a) full feedback, et al., 2014). clique, (d) apple tasting, (e) revealing action, (f) a clique min 2 Problem Setting and Main Resu Strength: Our single model encompasses all these settings Let G =(V,E) be a directed feedback graph over the set each i V, let N in (i) ={j V :(j, i) E} be the in-n and allows continuous interpolation between them. N out (i) ={j V :(i, j) E} be the out-neighborhood 9 / 33of

32 How to Compare Algorithms? Performance Metric Expected regret R T = T max i [K] θ i T t=1 E [Y t]. 10 / 33

33 How to Compare Algorithms? Performance Metric Expected regret R T = T max i [K] θ i T t=1 E [Y t]. Minimax Regret: RT = inf sup R T (A, θ) A θ 10 / 33

34 How to Compare Algorithms? Performance Metric Expected regret R T = T max i [K] θ i T t=1 E [Y t]. Minimax Regret: RT = inf sup R T (A, θ) A Typically, R T = O(T α ) with 0 < α < 1 (polynomial minimax regret), where the constant is a function of (p, r), Θ, but not the individual θ. θ 10 / 33

35 How to Compare Algorithms? Performance Metric Expected regret R T = T max i [K] θ i T t=1 E [Y t]. Minimax Regret: RT = inf sup R T (A, θ) A Typically, RT = O(T α ) with 0 < α < 1 (polynomial minimax regret), where the constant is a function of (p, r), Θ, but not the individual θ. θ Regret Asymptotics: A s = set of algorithms with subpolynomial regret growth, i.e., for any A A s, α > 0, R T (A, θ) = O(T α ). 10 / 33

36 How to Compare Algorithms? Performance Metric Expected regret R T = T max i [K] θ i T t=1 E [Y t]. Minimax Regret: RT = inf sup R T (A, θ) A Typically, RT = O(T α ) with 0 < α < 1 (polynomial minimax regret), where the constant is a function of (p, r), Θ, but not the individual θ. θ Regret Asymptotics: A s = set of algorithms with subpolynomial regret growth, i.e., for any A A s, α > 0, R T (A, θ) = O(T α ). Problem-dependent sharp asymptotic regret lower bound: For any θ Θ, R T (A, θ) inf lim inf = c(θ). A A s T log(t ) 10 / 33

37 Outline 1 Introduction 2 Has This Been Done Before? 3 Results Lower Bounds Algorithms/Upper Bounds 4 Summary 11 / 33

38 A Unified Lower Bound Under our setting with general variance matrix Σ, we have a unified, finite-time, problem-dependent lower bound that recovers all of the existing results. 12 / 33

39 Idea of the Lower Bound Let A be an algorithm, θ Θ an environment parameter. Regret: RT A (θ) = c q θ, (θ). i (θ) = max j µ j (θ) µ i (θ) the loss due to playing i instead of an optimal action; µi (θ): the mean reward for action i [K] under θ. c q = c dq(c) C R+ T : mean number of plays under q M 1(C N T ). C S T = {c S K : c i 0, i [K] c i = T } set of S-valued, T -round allocations. q θ M 1 (CT N): Distribution of N T CT N, the number of pulls of the K actions under A and θ. Depends on A (dependence hidden). 13 / 33

40 Idea of the Lower Bound Let A be an algorithm, θ Θ an environment parameter. Regret: RT A (θ) = c q θ, (θ). 13 / 33

41 Idea of the Lower Bound Let A be an algorithm, θ Θ an environment parameter. Regret: RT A (θ) = c q θ, (θ). We want to lower bound this by a quantity that depends on θ, Θ, T, but not A. 13 / 33

42 Idea of the Lower Bound Let A be an algorithm, θ Θ an environment parameter. Regret: RT A (θ) = c q θ, (θ). We want to lower bound this by a quantity that depends on θ, Θ, T, but not A. Only 0 works if A is allowed to be arbitrary (why?). 13 / 33

43 Idea of the Lower Bound Let A be an algorithm, θ Θ an environment parameter. Regret: RT A (θ) = c q θ, (θ). We want to lower bound this by a quantity that depends on θ, Θ, T, but not A. Only 0 works if A is allowed to be arbitrary (why?). Which algorithms to allow? 13 / 33

44 Idea of the Lower Bound Let A be an algorithm, θ Θ an environment parameter. Regret: RT A (θ) = c q θ, (θ). We want to lower bound this by a quantity that depends on θ, Θ, T, but not A. Only 0 works if A is allowed to be arbitrary (why?). Which algorithms to allow? Ideas: Use the regret itself! Allow algorithms with some predetermined worst-case regret over Θ! 13 / 33

45 Idea of the Lower Bound Let A be an algorithm, θ Θ an environment parameter. Regret: RT A (θ) = c q θ, (θ). We want to lower bound this by a quantity that depends on θ, Θ, T, but not A. Only 0 works if A is allowed to be arbitrary (why?). Which algorithms to allow? Ideas: Use the regret itself! Allow algorithms with some predetermined worst-case regret over Θ! : sup θ Θ RT A(θ ) B for B > 0 fixed. 13 / 33

46 Idea of the Lower Bound Let A be an algorithm, θ Θ an environment parameter. Regret: RT A (θ) = c q θ, (θ). We want to lower bound this by a quantity that depends on θ, Θ, T, but not A. Only 0 works if A is allowed to be arbitrary (why?). Which algorithms to allow? Ideas: Use the regret itself! Allow algorithms with some predetermined worst-case regret over Θ! : sup θ Θ RT A(θ ) B for B > 0 fixed. General strategy (for any lower bounds): create perturbations of θ s.t. any algorithm performs badly on one of them 13 / 33

47 Asymptotic Lower Bound for Graph Feedback Derived from the work of Graves & Lai (1997): Let i = max j θ j θ i ; σ i,j {σ, + }. Assumption: optimal action is unique; let i 1, i 2 be the index of the best, resp., second best action. 14 / 33

48 Asymptotic Lower Bound for Graph Feedback Derived from the work of Graves & Lai (1997): Let i = max j θ j θ i ; σ i,j {σ, + }. Assumption: optimal action is unique; let i 1, i 2 be the index of the best, resp., second best action. Theorem (Asymptotic lower bound) For any algorithm A A s, and for any θ Θ, lim inf T R T (A, θ) log T inf c i i, c C θ i i 1 where C θ = c [0, )K : c i 2σ2 2 i:j S i j for all j i 1, and c i 2σ2 2 i:i 1 S i i / 33

49 Lower Bound for Gaussian Case Given some B > 0, for i i 1, let 1 ɛ i = 8 eb T ew ( i T For i = i 1, replace i with i2. Let K C θ,b = c C c R+ j T : 16 eb ) + i, m i (θ, B) = 1 ɛ 2 i σ 2 j=1 ji log T (ɛ i i ) 8B m i (θ, B) for all i [K].. 1 W ( ) is the Lambert W function satisfying W (x)e W (x) = x. 15 / 33

50 Lower Bound for Gaussian Case Given some B > 0, for i i 1, let 1 ɛ i = 8 eb T ew ( i T For i = i 1, replace i with i2. Let K C θ,b = c C c R+ j T : 16 eb ) + i, m i (θ, B) = 1 ɛ 2 i σ 2 j=1 ji log T (ɛ i i ) 8B m i (θ, B) for all i [K]. Theorem (Finite-time problem-dependent lower bound) For any algorithm s.t. sup λ Θ R T (λ) B, any T large enough, any θ inside Θ, R T (θ) b(θ, B) = min c i i. c C θ,b i i 1. 1 W ( ) is the Lambert W function satisfying W (x)e W (x) = x. 15 / 33

51 Recovering the Asymptotic Lower Bound Theorem (Finite-time problem-dependent lower bound) For any algorithm such that sup λ Θ R T (λ) B, we have, for any θ Θ, R T (θ) b(θ, B) = min c i i. (*) c C θ,b i i 1 16 / 33

52 Recovering the Asymptotic Lower Bound Theorem (Finite-time problem-dependent lower bound) For any algorithm such that sup λ Θ R T (λ) B, we have, for any θ Θ, R T (θ) b(θ, B) = min c i i. (*) c C θ,b i i 1 Recall asymptotic lower bound: lim inf T R T (θ) log T inf c i i. (**) c C θ i i 1 16 / 33

53 Recovering the Asymptotic Lower Bound Theorem (Finite-time problem-dependent lower bound) For any algorithm such that sup λ Θ R T (λ) B, we have, for any θ Θ, R T (θ) b(θ, B) = min c i i. (*) c C θ,b i i 1 Recall asymptotic lower bound: lim inf T R T (θ) log T inf c i i. (**) c C θ i i 1 For any B = αt β with α > 0 and β (0, 1) we have (1 β) log T C θ,b C θ 2 as T. Hence, (**) is recovered from (*). 16 / 33

54 Minimax Lower Bounds (Alon et al., 2015) Each i [K] is associated with an observation set S i [K]: for j S i, σ ij = σ; for j / S i, σ ij =. 17 / 33

55 Minimax Lower Bounds (Alon et al., 2015) Each i [K] is associated with an observation set S i [K]: for j S i, σ ij = σ; for j / S i, σ ij =. Assume Σ is always observable: for all i, there exists j such that i S j. 17 / 33

56 Minimax Lower Bounds (Alon et al., 2015) Each i [K] is associated with an observation set S i [K]: for j S i, σ ij = σ; for j / S i, σ ij =. Assume Σ is always observable: for all i, there exists j such that i S j. Σ is strongly observable if all actions are strongly observable. 17 / 33

57 Minimax Lower Bounds (Alon et al., 2015) Each i [K] is associated with an observation set S i [K]: for j S i, σ ij = σ; for j / S i, σ ij =. Assume Σ is always observable: for all i, there exists j such that i S j. Σ is strongly observable if all actions are strongly observable. An action i is strongly observable if either it is self-observable or is observable under any other action. Otherwise, the action is said to be weakly observable. 17 / 33

58 Minimax Lower Bounds (Alon et al., 2015) Each i [K] is associated with an observation set S i [K]: for j S i, σ ij = σ; for j / S i, σ ij =. Assume Σ is always observable: for all i, there exists j such that i S j. Σ is strongly observable if all actions are strongly observable. An action i is strongly observable if either it is self-observable or is observable under any other action. Otherwise, the action is said to be weakly observable. 17 / 33

59 Minimax Lower Bounds (Alon et al., 2015) Each i [K] is associated with an observation set S i [K]: for j S i, σ ij = σ; for j / S i, σ ij =. Assume Σ is always observable: for all i, there exists j such that i S j. Σ is strongly observable if all actions are strongly observable. An action i is strongly observable if either it is self-observable or is observable under any other action. Otherwise, the action is said to be weakly observable. Σ is weakly observable if it is observable but not strongly observable. 17 / 33

60 Minimax Lower Bounds for Graph Feedback - Strong Observability σ i,j {1, + }, Θ = [0, 1]; S i = {j : σ i,j = σ}. 18 / 33

61 Minimax Lower Bounds for Graph Feedback - Strong Observability σ i,j {1, + }, Θ = [0, 1]; S i = {j : σ i,j = σ}. A set A [K] is independent in Σ if for any i A, S i A {i}. 18 / 33

62 Minimax Lower Bounds for Graph Feedback - Strong Observability σ i,j {1, + }, Θ = [0, 1]; S i = {j : σ i,j = σ}. A set A [K] is independent in Σ if for any i A, S i A {i}. Choosing i A gives no information about any j i, j A. 18 / 33

63 Minimax Lower Bounds for Graph Feedback - Strong Observability σ i,j {1, + }, Θ = [0, 1]; S i = {j : σ i,j = σ}. A set A [K] is independent in Σ if for any i A, S i A {i}. Choosing i A gives no information about any j i, j A. Independence number of Σ: κ(σ) = max{ A : A [K] is independent in Σ}. 18 / 33

64 Minimax Lower Bounds for Graph Feedback - Strong Observability σ i,j {1, + }, Θ = [0, 1]; S i = {j : σ i,j = σ}. A set A [K] is independent in Σ if for any i A, S i A {i}. Choosing i A gives no information about any j i, j A. Independence number of Σ: κ(σ) = max{ A : A [K] is independent in Σ}. 18 / 33

65 Minimax Lower Bounds for Graph Feedback - Strong Observability σ i,j {1, + }, Θ = [0, 1]; S i = {j : σ i,j = σ}. A set A [K] is independent in Σ if for any i A, S i A {i}. Choosing i A gives no information about any j i, j A. Independence number of Σ: κ(σ) = max{ A : A [K] is independent in Σ}. Theorem (Mannor & Shamir (2011), Alon et al. (2015)) Let Σ be strongly observable. Then, sup R T (θ) c κ(σ)t. θ Θ 18 / 33

66 Minimax Lower Bounds for Graph Feedback - Weak Observability σ i,j {1, + }, Θ = [0, 1]; S i = {j : σ i,j = σ}; 19 / 33

67 Minimax Lower Bounds for Graph Feedback - Weak Observability σ i,j {1, + }, Θ = [0, 1]; S i = {j : σ i,j = σ}; A, A [K]; A dominates A if for any j A there exists i A such that j S i ; 19 / 33

68 Minimax Lower Bounds for Graph Feedback - Weak Observability σ i,j {1, + }, Θ = [0, 1]; S i = {j : σ i,j = σ}; A, A [K]; A dominates A if for any j A there exists i A such that j S i ; Any j A can be observed through some i A. 19 / 33

69 Minimax Lower Bounds for Graph Feedback - Weak Observability σ i,j {1, + }, Θ = [0, 1]; S i = {j : σ i,j = σ}; A, A [K]; A dominates A if for any j A there exists i A such that j S i ; Any j A can be observed through some i A. W(Σ): Set of all weakly observable actions; 19 / 33

70 Minimax Lower Bounds for Graph Feedback - Weak Observability σ i,j {1, + }, Θ = [0, 1]; S i = {j : σ i,j = σ}; A, A [K]; A dominates A if for any j A there exists i A such that j S i ; Any j A can be observed through some i A. W(Σ): Set of all weakly observable actions; Weak domination number: ρ(σ) = min{ A : A dominates W(Σ) }. 19 / 33

71 Minimax Lower Bounds for Graph Feedback - Weak Observability σ i,j {1, + }, Θ = [0, 1]; S i = {j : σ i,j = σ}; A, A [K]; A dominates A if for any j A there exists i A such that j S i ; Any j A can be observed through some i A. W(Σ): Set of all weakly observable actions; Weak domination number: ρ(σ) = min{ A : A dominates W(Σ) }. 19 / 33

72 Minimax Lower Bounds for Graph Feedback - Weak Observability σ i,j {1, + }, Θ = [0, 1]; S i = {j : σ i,j = σ}; A, A [K]; A dominates A if for any j A there exists i A such that j S i ; Any j A can be observed through some i A. W(Σ): Set of all weakly observable actions; Weak domination number: ρ(σ) = min{ A : A dominates W(Σ) }. Theorem (Mannor & Shamir (2011), Alon et al. (2015)) Let Σ be weakly observable. Then, sup R T (θ) c(log K) 2/3 ρ(σ) 1/3 T 2/3. θ Θ 19 / 33

73 Recovering Minimax Lower Bounds Theorem (Finite-time problem-dependent lower bound) For any algorithm such that sup λ Θ R T (λ) B, we have, for any θ Θ, R T (θ) b(θ, B) = min c i i. (*) c C θ,b i i 1 20 / 33

74 Recovering Minimax Lower Bounds Theorem (Finite-time problem-dependent lower bound) For any algorithm such that sup λ Θ R T (λ) B, we have, for any θ Θ, R T (θ) b(θ, B) = min c i i. (*) c C θ,b i i 1 If Σ is strongly observable, by choosing B = σ κ(σ)t 8 e we have sup θ Θ b(θ, B) B for T large enough. 20 / 33

75 Recovering Minimax Lower Bounds Theorem (Finite-time problem-dependent lower bound) For any algorithm such that sup λ Θ R T (λ) B, we have, for any θ Θ, R T (θ) b(θ, B) = min c i i. (*) c C θ,b i i 1 If Σ is strongly observable, by choosing B = σ κ(σ)t 8 e we have sup θ Θ b(θ, B) B for T large enough. If Σ is weakly observable, by choosing we have sup θ Θ b(θ, B) B. B = (ρ(σ)d)1/3 (σt ) 2/3 73(log K) 2/3 20 / 33

76 Outline 1 Introduction 2 Has This Been Done Before? 3 Results Lower Bounds Algorithms/Upper Bounds 4 Summary 21 / 33

77 Upcoming Attractions Just for feedback graphs; Near asymptotically optimal algorithm (new); Single near-minimax optimal algorithm with logarithmic asymptotic regret (new). 22 / 33

78 Asymptotically (Almost) Optimal Algorithm Recall C θ = c [0, )K : c i 2σ2 2 i:j S i j Let c(θ) = argmin c Cθ i i 1 c i i. for all j i 1, and c i 2σ2 2 i:i 1 S i i / 33

79 Asymptotically (Almost) Optimal Algorithm Recall C θ = c [0, )K : Let c(θ) = argmin c Cθ i i 1 c i i. c i 2σ2 2 for all j i 1, and i:j S i j c i 2σ2 2 i:i 1 S i i 2. Goal: Find an algorithm that achieves O( ( i i 1 c i (θ) i ) log T ) regret. 23 / 33

80 Asymptotically (Almost) Optimal Algorithm Recall C θ = c [0, )K : Let c(θ) = argmin c Cθ i i 1 c i i. c i 2σ2 2 for all j i 1, and i:j S i j c i 2σ2 2 i:i 1 S i i 2. Goal: Find an algorithm that achieves O( ( i i 1 c i (θ) i ) log T ) regret. (Simple) idea borrowed from Magureanu et al. (2014): Use forced exploration to ensure that c(θ) is well-approximated by c(ˆθ t ) uniformly in time, while paying a constant price in total. 23 / 33

81 Asymptotically (Almost) Optimal Algorithm Recall C θ = c [0, )K : Let c(θ) = argmin c Cθ i i 1 c i i. c i 2σ2 2 for all j i 1, and i:j S i j c i 2σ2 2 i:i 1 S i i 2. Goal: Find an algorithm that achieves O( ( i i 1 c i (θ) i ) log T ) regret. (Simple) idea borrowed from Magureanu et al. (2014): Use forced exploration to ensure that c(θ) is well-approximated by c(ˆθ t ) uniformly in time, while paying a constant price in total. Exploration schedule β( ) : N R is chosen to be sublinear. 23 / 33

82 Asymptotically (Almost) Optimal Algorithm Recall C θ = c [0, )K : Let c(θ) = argmin c Cθ i i 1 c i i. c i 2σ2 2 for all j i 1, and i:j S i j c i 2σ2 2 i:i 1 S i i 2. Goal: Find an algorithm that achieves O( ( i i 1 c i (θ) i ) log T ) regret. (Simple) idea borrowed from Magureanu et al. (2014): Use forced exploration to ensure that c(θ) is well-approximated by c(ˆθ t ) uniformly in time, while paying a constant price in total. Exploration schedule β( ) : N R is chosen to be sublinear. Magureanu et al. (2014) s linear schedule β(n) = βn requires that they choose a parameter of their algorithm based on the unknown min. The sublinear schedule avoids this. 23 / 33

83 Asymptotically (Almost) Optimal Algorithm t := t + 1 Y plays(t) 4α log t Exploitation: Play It := i1 (θ t ). Set ne (t + 1) := ne (t). Cθ t? Y N mini obsi (t) < β(ne (t))/k? Play It s.t. arg mini obsi (t) SIt. N Play It = i s.t. playsi (t) < ci (θ t )4α log t Set ne (t + 1) = ne (t) + 1. Update θ t to θ t / 33

84 Asymptotically Almost Optimal Algorithm - Upper Bound Upper bound For any α > 2, β(n) = an b with a (0, 1 2 ], b (0, 1) and for any θ Θ such that c(θ) is unique, lim sup T R T (θ) log T 4α i i 1 c i (θ) i. 25 / 33

85 Near Minimax Optimal Algorithm Successive elimination: maintain a set of possibly optimal actions ( good actions) until only one action remains. 26 / 33

86 Near Minimax Optimal Algorithm Successive elimination: maintain a set of possibly optimal actions ( good actions) until only one action remains. In each round r, Explore all good actions by playing only good actions. (exploitation) Due to weak observability, sometimes some actions can only be explored by bad actions (exploration-exploitation trade off). Use a sublinear function γ to control the exploration using bad actions. 26 / 33

87 Near Minimax Optimal Algorithm Successive elimination: maintain a set of possibly optimal actions ( good actions) until only one action remains. In each round r, Explore all good actions by playing only good actions. (exploitation) Due to weak observability, sometimes some actions can only be explored by bad actions (exploration-exploitation trade off). Use a sublinear function γ to control the exploration using bad actions. The idea is similar to the CBP algorithm in Bartók et al. (2014). Here we use a better exploration method to exploit the feedback structure, which leads to the optimal dependence on factors such as ρ(σ) and κ(σ). 26 / 33

88 Near Minimax Optimal Algorithm - Preliminaries For E, G [K], let c(e, G) = argmax c Simplex E min i G j:i S j c j : optimal way of using actions in E to uniformly explore actions in G. c(e, G) = min i G j:i S j c j (E, G): least coverage. For any A [K] and A 2, let A S = {i A : j A, i S j } denote the set of actions that can be observed while using of actions of A only; and A W = A \ A S (Note: actions in A W must be weakly observable). Exploration schedule for A W : γ(r) = (σα r t r /D) 2/3 α r = min 1 s r,a W s c([k], A W s ) At round r, define confidence width g r,i (δ) = σ 2 log(8k 2 r 3 /δ) obs i (r) where obs i (r) is the number of observations gained for action i so far. 27 / 33

89 Near Minimax Optimal Algorithm r := r + 1 Y A W r & obs A W r (r) < obs A S r (r) & obs A W r (r) < γ(r)? N c r = c([k], A W r ) c r = c(a r, A S r ) Play i r = c r c r 0 Set t r+1 t r + i r 1 Update ˆθ r to ˆθ r+1 A r+1 {i A r : UCB δ r+1,i max j Ar LCB δ r+1,j} Y N Keep playing the A r+1 > 1? remaining action 28 / 33

90 Near Minimax Optimal Algorithm - Upper Bound Theorem With δ = 1 T, for any θ Θ: If Σ is strongly observable, R T (θ) = O ( σ log K ) κ(σ)t log T. If Σ is weakly observable, ( R T (θ) = O (ρ(σ)d) 1/3 (σt ) 2/3 log ) KT. If we view min as constant and only consider dependence on T, ( ) R T (θ) = O log 3/2 T. 29 / 33

91 Outline 1 Introduction 2 Has This Been Done Before? 3 Results Lower Bounds Algorithms/Upper Bounds 4 Summary 30 / 33

92 Conclusions Online learning with Gaussian payoffs and side observations; Smooth interpolation between full-information and bandit settings; First non-asymptotic, problem-dependent lower bounds in regret minimization; 31 / 33

93 Conclusions Online learning with Gaussian payoffs and side observations; Smooth interpolation between full-information and bandit settings; First non-asymptotic, problem-dependent lower bounds in regret minimization; Algorithms for σ i,j {σ, + }; 31 / 33

94 Conclusions Online learning with Gaussian payoffs and side observations; Smooth interpolation between full-information and bandit settings; First non-asymptotic, problem-dependent lower bounds in regret minimization; Algorithms for σ i,j {σ, + }; Asymptotically near-optimal algorithm; First for learning with feedback graphs to do this; 31 / 33

95 Conclusions Online learning with Gaussian payoffs and side observations; Smooth interpolation between full-information and bandit settings; First non-asymptotic, problem-dependent lower bounds in regret minimization; Algorithms for σ i,j {σ, + }; Asymptotically near-optimal algorithm; First for learning with feedback graphs to do this; Single near minimax algorithm regardless of observability, with poly-logarithmic asymptotic regret; First for learning with feedback graphs to do this: 31 / 33

96 Conclusions Online learning with Gaussian payoffs and side observations; Smooth interpolation between full-information and bandit settings; First non-asymptotic, problem-dependent lower bounds in regret minimization; Algorithms for σ i,j {σ, + }; Asymptotically near-optimal algorithm; First for learning with feedback graphs to do this; Single near minimax algorithm regardless of observability, with poly-logarithmic asymptotic regret; First for learning with feedback graphs to do this: Mannor & Shamir (2011); Alon et al. (2013) and Alon et al. (2015): No log asymptotic regret, minimax alss. 31 / 33

97 Conclusions Online learning with Gaussian payoffs and side observations; Smooth interpolation between full-information and bandit settings; First non-asymptotic, problem-dependent lower bounds in regret minimization; Algorithms for σ i,j {σ, + }; Asymptotically near-optimal algorithm; First for learning with feedback graphs to do this; Single near minimax algorithm regardless of observability, with poly-logarithmic asymptotic regret; First for learning with feedback graphs to do this: Mannor & Shamir (2011); Alon et al. (2013) and Alon et al. (2015): No log asymptotic regret, minimax alss. Caron et al. (2012) and Buccapatnam et al. (2014): Log asymptotics, but no near-minimax finite time regret. 31 / 33

98 Open Problems Remove the assumption that c(θ) is unique for the optimality of the first algorithm; Remove the log 1/2 T overhead for the second algorithm; 32 / 33

99 Open Problems Remove the assumption that c(θ) is unique for the optimality of the first algorithm; Remove the log 1/2 T overhead for the second algorithm; A single algorithm that achieves both asymptotic and minimax optimal bounds up to constant factors; For bandits, achieved (very) recently (Lattimore, 2015) 32 / 33

100 Open Problems Remove the assumption that c(θ) is unique for the optimality of the first algorithm; Remove the log 1/2 T overhead for the second algorithm; A single algorithm that achieves both asymptotic and minimax optimal bounds up to constant factors; For bandits, achieved (very) recently (Lattimore, 2015) Algorithm for general Σ; Algorithm for unknown Σ; 32 / 33

101 Open Problems Remove the assumption that c(θ) is unique for the optimality of the first algorithm; Remove the log 1/2 T overhead for the second algorithm; A single algorithm that achieves both asymptotic and minimax optimal bounds up to constant factors; For bandits, achieved (very) recently (Lattimore, 2015) Algorithm for general Σ; Algorithm for unknown Σ; General tightness of the new lower bound; Algorithms for the (general) stochastic partial monitoring setting. 32 / 33

102 References Agrawal, R., Teneketzis, D., and Anantharam, V. Asymptotically efficient adaptive allocation schemes for controlled i.i.d. processes: Finite parameter space. IEEE Transaction on Automatic Control, 34: , Alon, N., Cesa-Bianchi, N., Gentile, C., and Mansour, Y. From bandits to experts: A tale of domination and independence. In NIPS, pp , Alon, N., Cesa-Bianchi, N., Gentile, C., and Mansour, Y. Online learning with feedback graphs: beyond bandits. In COLT, pp , Bartók, G., Pál, D., and Szepesvári, Cs. Minimax regret of finite partial-monitoring games in stochastic environments. In COLT 2011, pp , July Bartók, Gábor, Foster, Dean P., Pál, Dávid, Rakhlin, Alexander, and Szepesvári, Csaba. Partial monitoring classification, regret bounds, and algorithms. Mathematics of Operations Research, 39: , Buccapatnam, Swapna, Eryilmaz, Atilla, and Shroff, Ness B. Stochastic bandits with side observations on networks. SIGMETRICS Perform. Eval. Rev., 42(1): , June Caron, S., Kveton, B., Lelarge, M., and Bhagat, S. Leveraging side observations in stochastic bandits. In UAI, pp , Graves, Todd L. and Lai, Tze Leung. Asymptotically efficient adaptive choice of control laws in controlled markov chains. SIAM Journal on Control and Optimization, 35 (3): , Kocák, Tomáš, Neu, Gergely, Valko, Michal, and Munos, Rémi. Efficient learning by implicit exploration in bandit problems with side observations. In Advances in Neural Information Processing Systems 27 (NIPS), pp , Lattimore, T. Optimally confident UCB: Improved regret for finite-armed bandits. Arxiv preprint, Magureanu, S., Combes, R., and Proutiere, A. Lipschitz bandits: Regret lower bounds and optimal algorithms. In COLT, pp , Mannor, S. and Shamir, O. From bandits to experts: on the value of side-observations. In NIPS, pp , Robbins, H. Some aspects of the sequential design of experiments. Bulletin of the American Mathematics Society, 58: , / 33

Online Learning with Gaussian Payoffs and Side Observations

Online Learning with Gaussian Payoffs and Side Observations Online Learning with Gaussian Payoffs and Side Observations Yifan Wu András György 2 Csaba Szepesvári Dept. of Computing Science University of Alberta {ywu2,szepesva}@ualberta.ca 2 Dept. of Electrical

More information

Online Learning with Feedback Graphs

Online Learning with Feedback Graphs Online Learning with Feedback Graphs Claudio Gentile INRIA and Google NY clagentile@gmailcom NYC March 6th, 2018 1 Content of this lecture Regret analysis of sequential prediction problems lying between

More information

Multi-armed Bandits in the Presence of Side Observations in Social Networks

Multi-armed Bandits in the Presence of Side Observations in Social Networks 52nd IEEE Conference on Decision and Control December 0-3, 203. Florence, Italy Multi-armed Bandits in the Presence of Side Observations in Social Networks Swapna Buccapatnam, Atilla Eryilmaz, and Ness

More information

Efficient learning by implicit exploration in bandit problems with side observations

Efficient learning by implicit exploration in bandit problems with side observations Efficient learning by implicit exploration in bandit problems with side observations Tomáš Kocák, Gergely Neu, Michal Valko, Rémi Munos SequeL team, INRIA Lille - Nord Europe, France SequeL INRIA Lille

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Lecture 5: Bandit optimisation Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute of Technology Objectives of this lecture Introduce bandit optimisation: the

More information

Multi-armed bandit models: a tutorial

Multi-armed bandit models: a tutorial Multi-armed bandit models: a tutorial CERMICS seminar, March 30th, 2016 Multi-Armed Bandit model: general setting K arms: for a {1,..., K}, (X a,t ) t N is a stochastic process. (unknown distributions)

More information

Bandit models: a tutorial

Bandit models: a tutorial Gdt COS, December 3rd, 2015 Multi-Armed Bandit model: general setting K arms: for a {1,..., K}, (X a,t ) t N is a stochastic process. (unknown distributions) Bandit game: a each round t, an agent chooses

More information

On the Complexity of Best Arm Identification in Multi-Armed Bandit Models

On the Complexity of Best Arm Identification in Multi-Armed Bandit Models On the Complexity of Best Arm Identification in Multi-Armed Bandit Models Aurélien Garivier Institut de Mathématiques de Toulouse Information Theory, Learning and Big Data Simons Institute, Berkeley, March

More information

Csaba Szepesvári 1. University of Alberta. Machine Learning Summer School, Ile de Re, France, 2008

Csaba Szepesvári 1. University of Alberta. Machine Learning Summer School, Ile de Re, France, 2008 LEARNING THEORY OF OPTIMAL DECISION MAKING PART I: ON-LINE LEARNING IN STOCHASTIC ENVIRONMENTS Csaba Szepesvári 1 1 Department of Computing Science University of Alberta Machine Learning Summer School,

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Bandit Problems MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Multi-Armed Bandit Problem Problem: which arm of a K-slot machine should a gambler pull to maximize his

More information

Online Learning and Sequential Decision Making

Online Learning and Sequential Decision Making Online Learning and Sequential Decision Making Emilie Kaufmann CNRS & CRIStAL, Inria SequeL, emilie.kaufmann@univ-lille.fr Research School, ENS Lyon, Novembre 12-13th 2018 Emilie Kaufmann Sequential Decision

More information

The information complexity of best-arm identification

The information complexity of best-arm identification The information complexity of best-arm identification Emilie Kaufmann, joint work with Olivier Cappé and Aurélien Garivier MAB workshop, Lancaster, January th, 206 Context: the multi-armed bandit model

More information

Online Learning with Feedback Graphs

Online Learning with Feedback Graphs Online Learning with Feedback Graphs Nicolò Cesa-Bianchi Università degli Studi di Milano Joint work with: Noga Alon (Tel-Aviv University) Ofer Dekel (Microsoft Research) Tomer Koren (Technion and Microsoft

More information

From Bandits to Experts: A Tale of Domination and Independence

From Bandits to Experts: A Tale of Domination and Independence From Bandits to Experts: A Tale of Domination and Independence Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Domination and Independence 1 / 1 From Bandits to Experts: A

More information

Bandit View on Continuous Stochastic Optimization

Bandit View on Continuous Stochastic Optimization Bandit View on Continuous Stochastic Optimization Sébastien Bubeck 1 joint work with Rémi Munos 1 & Gilles Stoltz 2 & Csaba Szepesvari 3 1 INRIA Lille, SequeL team 2 CNRS/ENS/HEC 3 University of Alberta

More information

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I. Sébastien Bubeck Theory Group

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I. Sébastien Bubeck Theory Group Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I Sébastien Bubeck Theory Group i.i.d. multi-armed bandit, Robbins [1952] i.i.d. multi-armed bandit, Robbins [1952] Known

More information

Lecture 4: Lower Bounds (ending); Thompson Sampling

Lecture 4: Lower Bounds (ending); Thompson Sampling CMSC 858G: Bandits, Experts and Games 09/12/16 Lecture 4: Lower Bounds (ending); Thompson Sampling Instructor: Alex Slivkins Scribed by: Guowei Sun,Cheng Jie 1 Lower bounds on regret (ending) Recap from

More information

Stratégies bayésiennes et fréquentistes dans un modèle de bandit

Stratégies bayésiennes et fréquentistes dans un modèle de bandit Stratégies bayésiennes et fréquentistes dans un modèle de bandit thèse effectuée à Telecom ParisTech, co-dirigée par Olivier Cappé, Aurélien Garivier et Rémi Munos Journées MAS, Grenoble, 30 août 2016

More information

Two generic principles in modern bandits: the optimistic principle and Thompson sampling

Two generic principles in modern bandits: the optimistic principle and Thompson sampling Two generic principles in modern bandits: the optimistic principle and Thompson sampling Rémi Munos INRIA Lille, France CSML Lunch Seminars, September 12, 2014 Outline Two principles: The optimistic principle

More information

Revisiting the Exploration-Exploitation Tradeoff in Bandit Models

Revisiting the Exploration-Exploitation Tradeoff in Bandit Models Revisiting the Exploration-Exploitation Tradeoff in Bandit Models joint work with Aurélien Garivier (IMT, Toulouse) and Tor Lattimore (University of Alberta) Workshop on Optimization and Decision-Making

More information

Reward Maximization Under Uncertainty: Leveraging Side-Observations on Networks

Reward Maximization Under Uncertainty: Leveraging Side-Observations on Networks Reward Maximization Under Uncertainty: Leveraging Side-Observations Reward Maximization Under Uncertainty: Leveraging Side-Observations on Networks Swapna Buccapatnam AT&T Labs Research, Middletown, NJ

More information

EASINESS IN BANDITS. Gergely Neu. Pompeu Fabra University

EASINESS IN BANDITS. Gergely Neu. Pompeu Fabra University EASINESS IN BANDITS Gergely Neu Pompeu Fabra University EASINESS IN BANDITS Gergely Neu Pompeu Fabra University THE BANDIT PROBLEM Play for T rounds attempting to maximize rewards THE BANDIT PROBLEM Play

More information

arxiv: v1 [stat.ml] 23 May 2018

arxiv: v1 [stat.ml] 23 May 2018 Analysis of Thompson Sampling for Graphical Bandits Without the Graphs arxiv:1805.08930v1 [stat.ml] 3 May 018 Fang Liu The Ohio State University Columbus, Ohio 4310 liu.3977@osu.edu Abstract We study multi-armed

More information

Lecture 5: Regret Bounds for Thompson Sampling

Lecture 5: Regret Bounds for Thompson Sampling CMSC 858G: Bandits, Experts and Games 09/2/6 Lecture 5: Regret Bounds for Thompson Sampling Instructor: Alex Slivkins Scribed by: Yancy Liao Regret Bounds for Thompson Sampling For each round t, we defined

More information

The Multi-Arm Bandit Framework

The Multi-Arm Bandit Framework The Multi-Arm Bandit Framework A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course In This Lecture A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-2/94

More information

Online learning with feedback graphs and switching costs

Online learning with feedback graphs and switching costs Online learning with feedback graphs and switching costs A Proof of Theorem Proof. Without loss of generality let the independent sequence set I(G :T ) formed of actions (or arms ) from to. Given the sequence

More information

arxiv: v1 [cs.ds] 4 Mar 2016

arxiv: v1 [cs.ds] 4 Mar 2016 Sequential ranking under random semi-bandit feedback arxiv:1603.01450v1 [cs.ds] 4 Mar 2016 Hossein Vahabi, Paul Lagrée, Claire Vernade, Olivier Cappé March 7, 2016 Abstract In many web applications, a

More information

On the Complexity of Best Arm Identification with Fixed Confidence

On the Complexity of Best Arm Identification with Fixed Confidence On the Complexity of Best Arm Identification with Fixed Confidence Discrete Optimization with Noise Aurélien Garivier, Emilie Kaufmann COLT, June 23 th 2016, New York Institut de Mathématiques de Toulouse

More information

Introduction to Bandit Algorithms. Introduction to Bandit Algorithms

Introduction to Bandit Algorithms. Introduction to Bandit Algorithms Stochastic K-Arm Bandit Problem Formulation Consider K arms (actions) each correspond to an unknown distribution {ν k } K k=1 with values bounded in [0, 1]. At each time t, the agent pulls an arm I t {1,...,

More information

Bandit Algorithms. Tor Lattimore & Csaba Szepesvári

Bandit Algorithms. Tor Lattimore & Csaba Szepesvári Bandit Algorithms Tor Lattimore & Csaba Szepesvári Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next? Overview What are bandits,

More information

An adaptive algorithm for finite stochastic partial monitoring

An adaptive algorithm for finite stochastic partial monitoring An adaptive algorithm for finite stochastic partial monitoring Gábor Bartók Navid Zolghadr Csaba Szepesvári Department of Computing Science, University of Alberta, AB, Canada, T6G E8 bartok@ualberta.ca

More information

Online learning with noisy side observations

Online learning with noisy side observations Online learning with noisy side observations Tomáš Kocák Gergely Neu Michal Valko Inria Lille - Nord Europe, France DTIC, Universitat Pompeu Fabra, Barcelona, Spain Inria Lille - Nord Europe, France SequeL

More information

Tor Lattimore & Csaba Szepesvári

Tor Lattimore & Csaba Szepesvári Bandit Algorithms Tor Lattimore & Csaba Szepesvári Outline 1 From Contextual to Linear Bandits 2 Stochastic Linear Bandits 3 Confidence Bounds for Least-Squares Estimators 4 Improved Regret for Fixed,

More information

The Online Approach to Machine Learning

The Online Approach to Machine Learning The Online Approach to Machine Learning Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Online Approach to ML 1 / 53 Summary 1 My beautiful regret 2 A supposedly fun game I

More information

The information complexity of sequential resource allocation

The information complexity of sequential resource allocation The information complexity of sequential resource allocation Emilie Kaufmann, joint work with Olivier Cappé, Aurélien Garivier and Shivaram Kalyanakrishan SMILE Seminar, ENS, June 8th, 205 Sequential allocation

More information

Exploiting Correlation in Finite-Armed Structured Bandits

Exploiting Correlation in Finite-Armed Structured Bandits Exploiting Correlation in Finite-Armed Structured Bandits Samarth Gupta Carnegie Mellon University Pittsburgh, PA 1513 Gauri Joshi Carnegie Mellon University Pittsburgh, PA 1513 Osman Yağan Carnegie Mellon

More information

Sparse Linear Contextual Bandits via Relevance Vector Machines

Sparse Linear Contextual Bandits via Relevance Vector Machines Sparse Linear Contextual Bandits via Relevance Vector Machines Davis Gilton and Rebecca Willett Electrical and Computer Engineering University of Wisconsin-Madison Madison, WI 53706 Email: gilton@wisc.edu,

More information

Partial monitoring classification, regret bounds, and algorithms

Partial monitoring classification, regret bounds, and algorithms Partial monitoring classification, regret bounds, and algorithms Gábor Bartók Department of Computing Science University of Alberta Dávid Pál Department of Computing Science University of Alberta Csaba

More information

Explore no more: Improved high-probability regret bounds for non-stochastic bandits

Explore no more: Improved high-probability regret bounds for non-stochastic bandits Explore no more: Improved high-probability regret bounds for non-stochastic bandits Gergely Neu SequeL team INRIA Lille Nord Europe gergely.neu@gmail.com Abstract This work addresses the problem of regret

More information

On the Complexity of Best Arm Identification with Fixed Confidence

On the Complexity of Best Arm Identification with Fixed Confidence On the Complexity of Best Arm Identification with Fixed Confidence Discrete Optimization with Noise Aurélien Garivier, joint work with Emilie Kaufmann CNRS, CRIStAL) to be presented at COLT 16, New York

More information

Explore no more: Improved high-probability regret bounds for non-stochastic bandits

Explore no more: Improved high-probability regret bounds for non-stochastic bandits Explore no more: Improved high-probability regret bounds for non-stochastic bandits Gergely Neu SequeL team INRIA Lille Nord Europe gergely.neu@gmail.com Abstract This work addresses the problem of regret

More information

Two optimization problems in a stochastic bandit model

Two optimization problems in a stochastic bandit model Two optimization problems in a stochastic bandit model Emilie Kaufmann joint work with Olivier Cappé, Aurélien Garivier and Shivaram Kalyanakrishnan Journées MAS 204, Toulouse Outline From stochastic optimization

More information

Stat 260/CS Learning in Sequential Decision Problems. Peter Bartlett

Stat 260/CS Learning in Sequential Decision Problems. Peter Bartlett Stat 260/CS 294-102. Learning in Sequential Decision Problems. Peter Bartlett 1. Multi-armed bandit algorithms. Concentration inequalities. P(X ǫ) exp( ψ (ǫ))). Cumulant generating function bounds. Hoeffding

More information

Alireza Shafaei. Machine Learning Reading Group The University of British Columbia Summer 2017

Alireza Shafaei. Machine Learning Reading Group The University of British Columbia Summer 2017 s s Machine Learning Reading Group The University of British Columbia Summer 2017 (OCO) Convex 1/29 Outline (OCO) Convex Stochastic Bernoulli s (OCO) Convex 2/29 At each iteration t, the player chooses

More information

Introducing strategic measure actions in multi-armed bandits

Introducing strategic measure actions in multi-armed bandits 213 IEEE 24th International Symposium on Personal, Indoor and Mobile Radio Communications: Workshop on Cognitive Radio Medium Access Control and Network Solutions Introducing strategic measure actions

More information

Online Convex Optimization. Gautam Goel, Milan Cvitkovic, and Ellen Feldman CS 159 4/5/2016

Online Convex Optimization. Gautam Goel, Milan Cvitkovic, and Ellen Feldman CS 159 4/5/2016 Online Convex Optimization Gautam Goel, Milan Cvitkovic, and Ellen Feldman CS 159 4/5/2016 The General Setting The General Setting (Cover) Given only the above, learning isn't always possible Some Natural

More information

Multi-Armed Bandit Formulations for Identification and Control

Multi-Armed Bandit Formulations for Identification and Control Multi-Armed Bandit Formulations for Identification and Control Cristian R. Rojas Joint work with Matías I. Müller and Alexandre Proutiere KTH Royal Institute of Technology, Sweden ERNSI, September 24-27,

More information

EFFICIENT ALGORITHMS FOR LINEAR POLYHEDRAL BANDITS. Manjesh K. Hanawal Amir Leshem Venkatesh Saligrama

EFFICIENT ALGORITHMS FOR LINEAR POLYHEDRAL BANDITS. Manjesh K. Hanawal Amir Leshem Venkatesh Saligrama EFFICIENT ALGORITHMS FOR LINEAR POLYHEDRAL BANDITS Manjesh K. Hanawal Amir Leshem Venkatesh Saligrama IEOR Group, IIT-Bombay, Mumbai, India 400076 Dept. of EE, Bar-Ilan University, Ramat-Gan, Israel 52900

More information

Multi-Armed Bandits. Credit: David Silver. Google DeepMind. Presenter: Tianlu Wang

Multi-Armed Bandits. Credit: David Silver. Google DeepMind. Presenter: Tianlu Wang Multi-Armed Bandits Credit: David Silver Google DeepMind Presenter: Tianlu Wang Credit: David Silver (DeepMind) Multi-Armed Bandits Presenter: Tianlu Wang 1 / 27 Outline 1 Introduction Exploration vs.

More information

Learning to play K-armed bandit problems

Learning to play K-armed bandit problems Learning to play K-armed bandit problems Francis Maes 1, Louis Wehenkel 1 and Damien Ernst 1 1 University of Liège Dept. of Electrical Engineering and Computer Science Institut Montefiore, B28, B-4000,

More information

Multi-Armed Bandits with Metric Movement Costs

Multi-Armed Bandits with Metric Movement Costs Multi-Armed Bandits with Metric Movement Costs Tomer Koren 1, Roi Livni 2, and Yishay Mansour 3 1 Google; tkoren@google.com 2 Princeton University; rlivni@cs.princeton.edu 3 Tel Aviv University; mansour@cs.tau.ac.il

More information

University of Alberta. The Role of Information in Online Learning

University of Alberta. The Role of Information in Online Learning University of Alberta The Role of Information in Online Learning by Gábor Bartók A thesis submitted to the Faculty of Graduate Studies and Research in partial fulfillment of the requirements for the degree

More information

Efficient learning by implicit exploration in bandit problems with side observations

Efficient learning by implicit exploration in bandit problems with side observations Efficient learning by implicit exploration in bandit problems with side observations omáš Kocák Gergely Neu Michal Valko Rémi Munos SequeL team, INRIA Lille Nord Europe, France {tomas.kocak,gergely.neu,michal.valko,remi.munos}@inria.fr

More information

Online Learning with Abstention

Online Learning with Abstention Corinna Cortes 1 Giulia DeSalvo 1 Claudio Gentile 1 2 Mehryar Mohri 3 1 Scott Yang 4 Abstract We present an extensive study of a key problem in online learning where the learner can opt to abstain from

More information

The Multi-Armed Bandit Problem

The Multi-Armed Bandit Problem Università degli Studi di Milano The bandit problem [Robbins, 1952]... K slot machines Rewards X i,1, X i,2,... of machine i are i.i.d. [0, 1]-valued random variables An allocation policy prescribes which

More information

Anytime optimal algorithms in stochastic multi-armed bandits

Anytime optimal algorithms in stochastic multi-armed bandits Rémy Degenne LPMA, Université Paris Diderot Vianney Perchet CREST, ENSAE REMYDEGENNE@MATHUNIV-PARIS-DIDEROTFR VIANNEYPERCHET@NORMALESUPORG Abstract We introduce an anytime algorithm for stochastic multi-armed

More information

Decentralized Multi-Armed Bandit with Multiple Distributed Players

Decentralized Multi-Armed Bandit with Multiple Distributed Players Decentralized Multi-Armed Bandit with Multiple Distributed Players Keqin Liu, Qing Zhao Department of Electrical and Computer Engineering University of California, Davis, CA 95616 {kqliu, qzhao}@ucdavis.edu

More information

Bandits for Online Optimization

Bandits for Online Optimization Bandits for Online Optimization Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Bandits for Online Optimization 1 / 16 The multiarmed bandit problem... K slot machines Each

More information

Improved Algorithms for Linear Stochastic Bandits

Improved Algorithms for Linear Stochastic Bandits Improved Algorithms for Linear Stochastic Bandits Yasin Abbasi-Yadkori abbasiya@ualberta.ca Dept. of Computing Science University of Alberta Dávid Pál dpal@google.com Dept. of Computing Science University

More information

The geometry of Gaussian processes and Bayesian optimization. Contal CMLA, ENS Cachan

The geometry of Gaussian processes and Bayesian optimization. Contal CMLA, ENS Cachan The geometry of Gaussian processes and Bayesian optimization. Contal CMLA, ENS Cachan Background: Global Optimization and Gaussian Processes The Geometry of Gaussian Processes and the Chaining Trick Algorithm

More information

Informational Confidence Bounds for Self-Normalized Averages and Applications

Informational Confidence Bounds for Self-Normalized Averages and Applications Informational Confidence Bounds for Self-Normalized Averages and Applications Aurélien Garivier Institut de Mathématiques de Toulouse - Université Paul Sabatier Thursday, September 12th 2013 Context Tree

More information

An Estimation Based Allocation Rule with Super-linear Regret and Finite Lock-on Time for Time-dependent Multi-armed Bandit Processes

An Estimation Based Allocation Rule with Super-linear Regret and Finite Lock-on Time for Time-dependent Multi-armed Bandit Processes An Estimation Based Allocation Rule with Super-linear Regret and Finite Lock-on Time for Time-dependent Multi-armed Bandit Processes Prokopis C. Prokopiou, Peter E. Caines, and Aditya Mahajan McGill University

More information

Evaluation of multi armed bandit algorithms and empirical algorithm

Evaluation of multi armed bandit algorithms and empirical algorithm Acta Technica 62, No. 2B/2017, 639 656 c 2017 Institute of Thermomechanics CAS, v.v.i. Evaluation of multi armed bandit algorithms and empirical algorithm Zhang Hong 2,3, Cao Xiushan 1, Pu Qiumei 1,4 Abstract.

More information

On Bayesian bandit algorithms

On Bayesian bandit algorithms On Bayesian bandit algorithms Emilie Kaufmann joint work with Olivier Cappé, Aurélien Garivier, Nathaniel Korda and Rémi Munos July 1st, 2012 Emilie Kaufmann (Telecom ParisTech) On Bayesian bandit algorithms

More information

Combinatorial Multi-Armed Bandit: General Framework, Results and Applications

Combinatorial Multi-Armed Bandit: General Framework, Results and Applications Combinatorial Multi-Armed Bandit: General Framework, Results and Applications Wei Chen Microsoft Research Asia, Beijing, China Yajun Wang Microsoft Research Asia, Beijing, China Yang Yuan Computer Science

More information

Online (and Distributed) Learning with Information Constraints. Ohad Shamir

Online (and Distributed) Learning with Information Constraints. Ohad Shamir Online (and Distributed) Learning with Information Constraints Ohad Shamir Weizmann Institute of Science Online Algorithms and Learning Workshop Leiden, November 2014 Ohad Shamir Learning with Information

More information

Bandits with Delayed, Aggregated Anonymous Feedback

Bandits with Delayed, Aggregated Anonymous Feedback Ciara Pike-Burke 1 Shipra Agrawal 2 Csaba Szepesvári 3 4 Steffen Grünewälder 1 Abstract We study a variant of the stochastic K-armed bandit problem, which we call bandits with delayed, aggregated anonymous

More information

Yevgeny Seldin. University of Copenhagen

Yevgeny Seldin. University of Copenhagen Yevgeny Seldin University of Copenhagen Classical (Batch) Machine Learning Collect Data Data Assumption The samples are independent identically distributed (i.i.d.) Machine Learning Prediction rule New

More information

Stochastic Contextual Bandits with Known. Reward Functions

Stochastic Contextual Bandits with Known. Reward Functions Stochastic Contextual Bandits with nown 1 Reward Functions Pranav Sakulkar and Bhaskar rishnamachari Ming Hsieh Department of Electrical Engineering Viterbi School of Engineering University of Southern

More information

Multiple Identifications in Multi-Armed Bandits

Multiple Identifications in Multi-Armed Bandits Multiple Identifications in Multi-Armed Bandits arxiv:05.38v [cs.lg] 4 May 0 Sébastien Bubeck Department of Operations Research and Financial Engineering, Princeton University sbubeck@princeton.edu Tengyao

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Lecture 6: RL algorithms 2.0 Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute of Technology Objectives of this lecture Present and analyse two online algorithms

More information

Toward a Classification of Finite Partial-Monitoring Games

Toward a Classification of Finite Partial-Monitoring Games Toward a Classification of Finite Partial-Monitoring Games Gábor Bartók ( *student* ), Dávid Pál, and Csaba Szepesvári Department of Computing Science, University of Alberta, Canada {bartok,dpal,szepesva}@cs.ualberta.ca

More information

Lecture 19: UCB Algorithm and Adversarial Bandit Problem. Announcements Review on stochastic multi-armed bandit problem

Lecture 19: UCB Algorithm and Adversarial Bandit Problem. Announcements Review on stochastic multi-armed bandit problem Lecture 9: UCB Algorithm and Adversarial Bandit Problem EECS598: Prediction and Learning: It s Only a Game Fall 03 Lecture 9: UCB Algorithm and Adversarial Bandit Problem Prof. Jacob Abernethy Scribe:

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Lecture 3: RL problems, sample complexity and regret Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute of Technology Objectives of this lecture Introduce the

More information

THE first formalization of the multi-armed bandit problem

THE first formalization of the multi-armed bandit problem EDIC RESEARCH PROPOSAL 1 Multi-armed Bandits in a Network Farnood Salehi I&C, EPFL Abstract The multi-armed bandit problem is a sequential decision problem in which we have several options (arms). We can

More information

On Regret-Optimal Learning in Decentralized Multi-player Multi-armed Bandits

On Regret-Optimal Learning in Decentralized Multi-player Multi-armed Bandits 1 On Regret-Optimal Learning in Decentralized Multi-player Multi-armed Bandits Naumaan Nayyar, Dileep Kalathil and Rahul Jain Abstract We consider the problem of learning in single-player and multiplayer

More information

arxiv: v1 [cs.lg] 7 Sep 2018

arxiv: v1 [cs.lg] 7 Sep 2018 Analysis of Thompson Sampling for Combinatorial Multi-armed Bandit with Probabilistically Triggered Arms Alihan Hüyük Bilkent University Cem Tekin Bilkent University arxiv:809.02707v [cs.lg] 7 Sep 208

More information

The No-Regret Framework for Online Learning

The No-Regret Framework for Online Learning The No-Regret Framework for Online Learning A Tutorial Introduction Nahum Shimkin Technion Israel Institute of Technology Haifa, Israel Stochastic Processes in Engineering IIT Mumbai, March 2013 N. Shimkin,

More information

Performance and Convergence of Multi-user Online Learning

Performance and Convergence of Multi-user Online Learning Performance and Convergence of Multi-user Online Learning Cem Tekin, Mingyan Liu Department of Electrical Engineering and Computer Science University of Michigan, Ann Arbor, Michigan, 4809-222 Email: {cmtkn,

More information

Regional Multi-Armed Bandits

Regional Multi-Armed Bandits School of Information Science and Technology University of Science and Technology of China {wzy43, zrd7}@mail.ustc.edu.cn, congshen@ustc.edu.cn Abstract We consider a variant of the classic multiarmed

More information

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems. Sébastien Bubeck Theory Group

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems. Sébastien Bubeck Theory Group Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems Sébastien Bubeck Theory Group Part 1: i.i.d., adversarial, and Bayesian bandit models i.i.d. multi-armed bandit, Robbins [1952]

More information

The multi armed-bandit problem

The multi armed-bandit problem The multi armed-bandit problem (with covariates if we have time) Vianney Perchet & Philippe Rigollet LPMA Université Paris Diderot ORFE Princeton University Algorithms and Dynamics for Games and Optimization

More information

Lecture 4 January 23

Lecture 4 January 23 STAT 263/363: Experimental Design Winter 2016/17 Lecture 4 January 23 Lecturer: Art B. Owen Scribe: Zachary del Rosario 4.1 Bandits Bandits are a form of online (adaptive) experiments; i.e. samples are

More information

A Change-Detection based Framework for Piecewise-stationary Multi-Armed Bandit Problem

A Change-Detection based Framework for Piecewise-stationary Multi-Armed Bandit Problem A Change-Detection based Framework for Piecewise-stationary Multi-Armed Bandit Problem Fang Liu and Joohyun Lee and Ness Shroff The Ohio State University Columbus, Ohio 43210 {liu.3977, lee.7119, shroff.11}@osu.edu

More information

arxiv: v1 [cs.lg] 23 Jan 2019

arxiv: v1 [cs.lg] 23 Jan 2019 Cooperative Online Learning: Keeping your Neighbors Updated Nicolò Cesa-Bianchi, Tommaso R. Cesari, and Claire Monteleoni 2 Dipartimento di Informatica, Università degli Studi di Milano, Italy 2 Department

More information

New bounds on the price of bandit feedback for mistake-bounded online multiclass learning

New bounds on the price of bandit feedback for mistake-bounded online multiclass learning Journal of Machine Learning Research 1 8, 2017 Algorithmic Learning Theory 2017 New bounds on the price of bandit feedback for mistake-bounded online multiclass learning Philip M. Long Google, 1600 Amphitheatre

More information

Learning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley

Learning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley Learning Methods for Online Prediction Problems Peter Bartlett Statistics and EECS UC Berkeley Course Synopsis A finite comparison class: A = {1,..., m}. Converting online to batch. Online convex optimization.

More information

Monte-Carlo Tree Search by. MCTS by Best Arm Identification

Monte-Carlo Tree Search by. MCTS by Best Arm Identification Monte-Carlo Tree Search by Best Arm Identification and Wouter M. Koolen Inria Lille SequeL team CWI Machine Learning Group Inria-CWI workshop Amsterdam, September 20th, 2017 Part of...... a new Associate

More information

On Sequential Elimination Algorithms for Best-Arm Identification in Multi-Armed Bandits

On Sequential Elimination Algorithms for Best-Arm Identification in Multi-Armed Bandits 1 On Sequential Elimination Algorithms for Best-Arm Identification in Multi-Armed Bandits Shahin Shahrampour, Mohammad Noshad, Vahid Tarokh arxiv:169266v2 [statml] 13 Apr 217 Abstract We consider the best-arm

More information

Bandits : optimality in exponential families

Bandits : optimality in exponential families Bandits : optimality in exponential families Odalric-Ambrym Maillard IHES, January 2016 Odalric-Ambrym Maillard Bandits 1 / 40 Introduction 1 Stochastic multi-armed bandits 2 Boundary crossing probabilities

More information

Learning Algorithms for Minimizing Queue Length Regret

Learning Algorithms for Minimizing Queue Length Regret Learning Algorithms for Minimizing Queue Length Regret Thomas Stahlbuhk Massachusetts Institute of Technology Cambridge, MA Brooke Shrader MIT Lincoln Laboratory Lexington, MA Eytan Modiano Massachusetts

More information

Worst-Case Bounds for Gaussian Process Models

Worst-Case Bounds for Gaussian Process Models Worst-Case Bounds for Gaussian Process Models Sham M. Kakade University of Pennsylvania Matthias W. Seeger UC Berkeley Abstract Dean P. Foster University of Pennsylvania We present a competitive analysis

More information

Online Sparse Linear Regression

Online Sparse Linear Regression JMLR: Workshop and Conference Proceedings vol 49:1 11, 2016 Online Sparse Linear Regression Dean Foster Amazon DEAN@FOSTER.NET Satyen Kale Yahoo Research SATYEN@YAHOO-INC.COM Howard Karloff HOWARD@CC.GATECH.EDU

More information

Unsupervised Sequential Sensor Acquisition

Unsupervised Sequential Sensor Acquisition Unsupervised Sequential Sensor Acquisition Manjesh K. Hanawal Csaba Szepesvári Venkatesh Saligrama Dept. of IEOR IIT Bombay, India mhanawal@iitb.ac.in Dept. of Computing Sciences University of Alberta,

More information

Stochastic bandits: Explore-First and UCB

Stochastic bandits: Explore-First and UCB CSE599s, Spring 2014, Online Learning Lecture 15-2/19/2014 Stochastic bandits: Explore-First and UCB Lecturer: Brendan McMahan or Ofer Dekel Scribe: Javad Hosseini In this lecture, we like to answer this

More information

Online Learning and Online Convex Optimization

Online Learning and Online Convex Optimization Online Learning and Online Convex Optimization Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Online Learning 1 / 49 Summary 1 My beautiful regret 2 A supposedly fun game

More information

Bandits and Exploration: How do we (optimally) gather information? Sham M. Kakade

Bandits and Exploration: How do we (optimally) gather information? Sham M. Kakade Bandits and Exploration: How do we (optimally) gather information? Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for Big data 1 / 22

More information

Dynamic resource allocation: Bandit problems and extensions

Dynamic resource allocation: Bandit problems and extensions Dynamic resource allocation: Bandit problems and extensions Aurélien Garivier Institut de Mathématiques de Toulouse MAD Seminar, Université Toulouse 1 October 3rd, 2014 The Bandit Model Roadmap 1 The Bandit

More information

Corrupt Bandits. Abstract

Corrupt Bandits. Abstract Corrupt Bandits Pratik Gajane Orange labs/inria SequeL Tanguy Urvoy Orange labs Emilie Kaufmann INRIA SequeL pratik.gajane@inria.fr tanguy.urvoy@orange.com emilie.kaufmann@inria.fr Editor: Abstract We

More information

Complex Bandit Problems and Thompson Sampling

Complex Bandit Problems and Thompson Sampling Complex Bandit Problems and Aditya Gopalan Department of Electrical Engineering Technion, Israel aditya@ee.technion.ac.il Shie Mannor Department of Electrical Engineering Technion, Israel shie@ee.technion.ac.il

More information