Online Learning with Gaussian Payoffs and Side Observations

Size: px

Start display at page:

Download "Online Learning with Gaussian Payoffs and Side Observations"

Ann Golden
6 years ago
Views:

1 Online Learning with Gaussian Payoffs and Side Observations Yifan Wu 1 András György 2 Csaba Szepesvári 1 1 Department of Computing Science University of Alberta 2 Department of Electrical and Electronic Engineering Imperial College London January 14, / 33

2 Outline 1 Introduction 2 Has This Been Done Before? 3 Results Lower Bounds Algorithms/Upper Bounds 4 Summary 2 / 33

3 Outline 1 Introduction 2 Has This Been Done Before? 3 Results Lower Bounds Algorithms/Upper Bounds 4 Summary 3 / 33

4 A Fishy Problem Each day, you get to choose a fishing spot. Which one to choose? Every fish you catch: +1 cookies. No fish: 10 cookies. Fish distribution is i.i.d. With some probability, you will see neighboring sites yield for the day. 4 / 33

5 The Fishing Game Choosing a fishing spot: K actions. 5 / 33

6 The Fishing Game Choosing a fishing spot: K actions. θ 1,..., θ K : (unknown) mean rewards for the K spots. 5 / 33

7 The Fishing Game Choosing a fishing spot: K actions. θ 1,..., θ K : (unknown) mean rewards for the K spots. For rounds t = 1,..., T : 5 / 33

8 The Fishing Game Choosing a fishing spot: K actions. θ 1,..., θ K : (unknown) mean rewards for the K spots. For rounds t = 1,..., T : Choose a fishing spot I t [K] := {1,..., K}; 5 / 33

9 The Fishing Game Choosing a fishing spot: K actions. θ 1,..., θ K : (unknown) mean rewards for the K spots. For rounds t = 1,..., T : Choose a fishing spot I t [K] := {1,..., K}; Incur reward Y t R with mean θ It ; 5 / 33

10 The Fishing Game Choosing a fishing spot: K actions. θ 1,..., θ K : (unknown) mean rewards for the K spots. For rounds t = 1,..., T : Choose a fishing spot I t [K] := {1,..., K}; Incur reward Y t R with mean θ It ; Observe X t R K ; noisy reward observations for all the sites (Y t = X t,it ). 5 / 33

11 The Fishing Game Choosing a fishing spot: K actions. θ 1,..., θ K : (unknown) mean rewards for the K spots. For rounds t = 1,..., T : Choose a fishing spot I t [K] := {1,..., K}; Incur reward Y t R with mean θ It ; Observe X t R K ; noisy reward observations for all the sites (Y t = X t,it ). 5 / 33

12 The Fishing Game Choosing a fishing spot: K actions. θ 1,..., θ K : (unknown) mean rewards for the K spots. For rounds t = 1,..., T : Choose a fishing spot I t [K] := {1,..., K}; Incur reward Y t R with mean θ It ; Observe X t R K ; noisy reward observations for all the sites (Y t = X t,it ). Assumptions E[X t,k ] = θ k, and V(X t,k I t ) = σi 2 with Σ = t,k (σ2 i,k ) known a priori. 5 / 33

.., K}; Incur reward Y t R with mean θ It ; Observe X t R K ; noisy reward observations for all the sites (Y t = X

13 The Fishing Game Choosing a fishing spot: K actions. θ 1,..., θ K : (unknown) mean rewards for the K spots. For rounds t = 1,..., T : Choose a fishing spot I t [K] := {1,..., K}; Incur reward Y t R with mean θ It ; Observe X t R K ; noisy reward observations for all the sites (Y t = X t,it ). Assumptions E[X t,k ] = θ k, and V(X t,k I t ) = σi 2 with Σ = t,k (σ2 i,k ) known a priori. Goal Minimize expected regret R T = T max i [K] θ i T t=1 E [Y t]. 5 / 33

14 Outline 1 Introduction 2 Has This Been Done Before? 3 Results Lower Bounds Algorithms/Upper Bounds 4 Summary 6 / 33

15 (Stochastic) Partial Monitoring and Bandits For rounds t = 1,..., T : Choose action I t [K]; Observe X t p(θ, I t ); Incur reward R t = r(θ, I t ). 7 / 33

16 (Stochastic) Partial Monitoring and Bandits For rounds t = 1,..., T : Choose action I t [K]; Observe X t p(θ, I t ); Incur reward R t = r(θ, I t ). Information Structure Known: p : Θ [K] M 1 (X ); Known: r : Θ [K] R; Unknown: θ Θ. 7 / 33

17 (Stochastic) Partial Monitoring and Bandits For rounds t = 1,..., T : (Some) prior work: Choose action I t [K]; Observe X t p(θ, I t ); Incur reward R t = r(θ, I t ). Information Structure Known: p : Θ [K] M 1 (X ); Known: r : Θ [K] R; Unknown: θ Θ. 7 / 33

18 (Stochastic) Partial Monitoring and Bandits For rounds t = 1,..., T : Choose action I t [K]; Observe X t p(θ, I t ); (Some) prior work: Bandits (Robbins, 1952): X t = R t. Incur reward R t = r(θ, I t ). Information Structure Known: p : Θ [K] M 1 (X ); Known: r : Θ [K] R; Unknown: θ Θ. 7 / 33

19 (Stochastic) Partial Monitoring and Bandits For rounds t = 1,..., T : Choose action I t [K]; Observe X t p(θ, I t ); Incur reward R t = r(θ, I t ). Information Structure Known: p : Θ [K] M 1 (X ); (Some) prior work: Bandits (Robbins, 1952): X t = R t. Finite Θ and Y t,1 = R t : Agrawal et al. (1989) Known: r : Θ [K] R; Unknown: θ Θ. 7 / 33

20 (Stochastic) Partial Monitoring and Bandits For rounds t = 1,..., T : Choose action I t [K]; Observe X t p(θ, I t ); Incur reward R t = r(θ, I t ). Information Structure Known: p : Θ [K] M 1 (X ); Known: r : Θ [K] R; Unknown: θ Θ. (Some) prior work: Bandits (Robbins, 1952): X t = R t. Finite Θ and Y t,1 = R t : Agrawal et al. (1989) X t = h(i t, J t ), R t = r(i t, J t ), J t [M] i.i.d.: Bartók et al. (2011) 7 / 33

21 (Stochastic) Partial Monitoring and Bandits For rounds t = 1,..., T : Choose action I t [K]; Observe X t p(θ, I t ); Incur reward R t = r(θ, I t ). Information Structure Known: p : Θ [K] M 1 (X ); Known: r : Θ [K] R; Unknown: θ Θ. (Some) prior work: Bandits (Robbins, 1952): X t = R t. Finite Θ and Y t,1 = R t : Agrawal et al. (1989) X t = h(i t, J t ), R t = r(i t, J t ), J t [M] i.i.d.: Bartók et al. (2011) Learning with feedback graphs: Alon et al. (2015) 7 / 33

22 Fishing as Partial Monitoring Fishing round t = 1,..., T : Choose a fishing spot I t [K]; Incur (mean) reward θ It ; Observe X t R K. Basic Assumptions E[X t,k ] = θ k, V(X t,k I t ) = σ 2 I t,k with Σ = (σi,k 2 ) known a priori. Distributional Assumptions X t,j N (θ j, σ It,j), independent. 8 / 33

23 Fishing as Partial Monitoring Fishing round t = 1,..., T : Choose a fishing spot I t [K]; Incur (mean) reward θ It ; Observe X t R K. Partial monitoring t = 1,..., T : Choose I t [K]; Incur (mean) reward r(θ, I t ) Observe X t p(θ, I t ). Basic Assumptions E[X t,k ] = θ k, V(X t,k I t ) = σ 2 I t,k with Σ = (σi,k 2 ) known a priori. Distributional Assumptions X t,j N (θ j, σ It,j), independent. 8 / 33

24 Fishing as Partial Monitoring Fishing round t = 1,..., T : Choose a fishing spot I t [K]; Incur (mean) reward θ It ; Observe X t R K. Basic Assumptions E[X t,k ] = θ k, V(X t,k I t ) = σ 2 I t,k with Σ = (σi,k 2 ) known a priori. Distributional Assumptions X t,j N (θ j, σ It,j), independent. Partial monitoring t = 1,..., T : Choose I t [K]; Incur (mean) reward r(θ, I t ) Observe X t p(θ, I t ). Choose: r(θ, i) = θ i ; p(θ, i) = N (θ, diag(..., σ i,j,... )); Θ = [0, D] K. 8 / 33

25 Some Interesting Special Cases Full information problems: σ ij = σ for all i, j [K]. 9 / 33

26 Some Interesting Special Cases Full information problems: σ ij = σ for all i, j [K]. Bandits: σ ii = σ for all i [K], σ ij = for all i j. 9 / 33

27 Some Interesting Special Cases Full information problems: σ ij = σ for all i, j [K]. Bandits: σ ii = σ for all i [K], σ ij = for all i j. Graph feedback (Alon et al., 2015): 9 / 33

28 Some Interesting Special Cases 5 2 Full information problems: σ ij = σ for all i, j [K] Bandits: σ ii = σ for all i [K], σ ij = for all i j. 4 3 Graph feedback (Alon et al., 2015): Each i [K] has S i [K]: (a) 4 (b) 3 σ, 1 if j 2 S i ; σ i,j = +, otherwise (d) (e) Figure 1: Examples of feedback graphs: (a) full feedback, clique, (d) apple tasting, (e) revealing action, (f) a clique min 2 Problem Setting and Main Resu Let G =(V,E) be a directed feedback graph over the set each i V, let N in (i) ={j V :(j, i) E} be the in-n N out (i) ={j V :(i, j) E} be the out-neighborhood 9 / 33of

29 Some Interesting Special Cases 5 2 Full information problems: σ ij = σ for all i, j [K] Bandits: σ ii = σ for all i [K], σ ij = for all i j. 4 3 Graph feedback (Alon et al., 2015): Each i [K] has S i [K]: (a) 4 (b) 3 σ, 1 if j 2 S i ; σ i,j = +, otherwise Self-observability: i (d) (e) Si for any i [K] (Mannor & Shamir, 2011; Caron et al., 2012; Alon et al., 2013; Buccapatnam et al., 2014; Kocák Figure 1: Examples of feedback graphs: (a) full feedback, et al., 2014). clique, (d) apple tasting, (e) revealing action, (f) a clique min 2 Problem Setting and Main Resu Let G =(V,E) be a directed feedback graph over the set each i V, let N in (i) ={j V :(j, i) E} be the in-n N out (i) ={j V :(i, j) E} be the out-neighborhood 9 / 33of

30 Some Interesting Special Cases 5 2 Full information problems: σ ij = σ for all i, j [K] Bandits: σ ii = σ for all i [K], σ ij = for all i j. 4 3 Graph feedback (Alon et al., 2015): Each i [K] has S i [K]: (a) 4 (b) 3 σ, 1 if j 2 S i ; σ i,j = +, otherwise Self-observability: i (d) (e) Si for any i [K] (Mannor & Shamir, 2011; Caron et al., 2012; Alon et al., 2013; Buccapatnam et al., 2014; Kocák Figure 1: Examples of feedback graphs: (a) full feedback, et al., 2014). clique, (d) apple tasting, (e) revealing action, (f) a clique min 2 Problem Setting and Main Resu Let G =(V,E) be a directed feedback graph over the set each i V, let N in (i) ={j V :(j, i) E} be the in-n N out (i) ={j V :(i, j) E} be the out-neighborhood 9 / 33of

31 Some Interesting Special Cases 5 2 Full information problems: σ ij = σ for all i, j [K] Bandits: σ ii = σ for all i [K], σ ij = for all i j. 4 3 Graph feedback (Alon et al., 2015): Each i [K] has S i [K]: (a) 4 (b) 3 σ, 1 if j 2 S i ; σ i,j = +, otherwise Self-observability: i (d) (e) Si for any i [K] (Mannor & Shamir, 2011; Caron et al., 2012; Alon et al., 2013; Buccapatnam et al., 2014; Kocák Figure 1: Examples of feedback graphs: (a) full feedback, et al., 2014). clique, (d) apple tasting, (e) revealing action, (f) a clique min 2 Problem Setting and Main Resu Strength: Our single model encompasses all these settings Let G =(V,E) be a directed feedback graph over the set each i V, let N in (i) ={j V :(j, i) E} be the in-n and allows continuous interpolation between them. N out (i) ={j V :(i, j) E} be the out-neighborhood 9 / 33of

32 How to Compare Algorithms? Performance Metric Expected regret R T = T max i [K] θ i T t=1 E [Y t]. 10 / 33

33 How to Compare Algorithms? Performance Metric Expected regret R T = T max i [K] θ i T t=1 E [Y t]. Minimax Regret: RT = inf sup R T (A, θ) A θ 10 / 33

34 How to Compare Algorithms? Performance Metric Expected regret R T = T max i [K] θ i T t=1 E [Y t]. Minimax Regret: RT = inf sup R T (A, θ) A Typically, R T = O(T α ) with 0 < α < 1 (polynomial minimax regret), where the constant is a function of (p, r), Θ, but not the individual θ. θ 10 / 33

35 How to Compare Algorithms? Performance Metric Expected regret R T = T max i [K] θ i T t=1 E [Y t]. Minimax Regret: RT = inf sup R T (A, θ) A Typically, RT = O(T α ) with 0 < α < 1 (polynomial minimax regret), where the constant is a function of (p, r), Θ, but not the individual θ. θ Regret Asymptotics: A s = set of algorithms with subpolynomial regret growth, i.e., for any A A s, α > 0, R T (A, θ) = O(T α ). 10 / 33

36 How to Compare Algorithms? Performance Metric Expected regret R T = T max i [K] θ i T t=1 E [Y t]. Minimax Regret: RT = inf sup R T (A, θ) A Typically, RT = O(T α ) with 0 < α < 1 (polynomial minimax regret), where the constant is a function of (p, r), Θ, but not the individual θ. θ Regret Asymptotics: A s = set of algorithms with subpolynomial regret growth, i.e., for any A A s, α > 0, R T (A, θ) = O(T α ). Problem-dependent sharp asymptotic regret lower bound: For any θ Θ, R T (A, θ) inf lim inf = c(θ). A A s T log(t ) 10 / 33

37 Outline 1 Introduction 2 Has This Been Done Before? 3 Results Lower Bounds Algorithms/Upper Bounds 4 Summary 11 / 33

38 A Unified Lower Bound Under our setting with general variance matrix Σ, we have a unified, finite-time, problem-dependent lower bound that recovers all of the existing results. 12 / 33

39 Idea of the Lower Bound Let A be an algorithm, θ Θ an environment parameter. Regret: RT A (θ) = c q θ, (θ). i (θ) = max j µ j (θ) µ i (θ) the loss due to playing i instead of an optimal action; µi (θ): the mean reward for action i [K] under θ. c q = c dq(c) C R+ T : mean number of plays under q M 1(C N T ). C S T = {c S K : c i 0, i [K] c i = T } set of S-valued, T -round allocations. q θ M 1 (CT N): Distribution of N T CT N, the number of pulls of the K actions under A and θ. Depends on A (dependence hidden). 13 / 33

40 Idea of the Lower Bound Let A be an algorithm, θ Θ an environment parameter. Regret: RT A (θ) = c q θ, (θ). 13 / 33

41 Idea of the Lower Bound Let A be an algorithm, θ Θ an environment parameter. Regret: RT A (θ) = c q θ, (θ). We want to lower bound this by a quantity that depends on θ, Θ, T, but not A. 13 / 33

42 Idea of the Lower Bound Let A be an algorithm, θ Θ an environment parameter. Regret: RT A (θ) = c q θ, (θ). We want to lower bound this by a quantity that depends on θ, Θ, T, but not A. Only 0 works if A is allowed to be arbitrary (why?). 13 / 33

43 Idea of the Lower Bound Let A be an algorithm, θ Θ an environment parameter. Regret: RT A (θ) = c q θ, (θ). We want to lower bound this by a quantity that depends on θ, Θ, T, but not A. Only 0 works if A is allowed to be arbitrary (why?). Which algorithms to allow? 13 / 33

44 Idea of the Lower Bound Let A be an algorithm, θ Θ an environment parameter. Regret: RT A (θ) = c q θ, (θ). We want to lower bound this by a quantity that depends on θ, Θ, T, but not A. Only 0 works if A is allowed to be arbitrary (why?). Which algorithms to allow? Ideas: Use the regret itself! Allow algorithms with some predetermined worst-case regret over Θ! 13 / 33

45 Idea of the Lower Bound Let A be an algorithm, θ Θ an environment parameter. Regret: RT A (θ) = c q θ, (θ). We want to lower bound this by a quantity that depends on θ, Θ, T, but not A. Only 0 works if A is allowed to be arbitrary (why?). Which algorithms to allow? Ideas: Use the regret itself! Allow algorithms with some predetermined worst-case regret over Θ! : sup θ Θ RT A(θ ) B for B > 0 fixed. 13 / 33

46 Idea of the Lower Bound Let A be an algorithm, θ Θ an environment parameter. Regret: RT A (θ) = c q θ, (θ). We want to lower bound this by a quantity that depends on θ, Θ, T, but not A. Only 0 works if A is allowed to be arbitrary (why?). Which algorithms to allow? Ideas: Use the regret itself! Allow algorithms with some predetermined worst-case regret over Θ! : sup θ Θ RT A(θ ) B for B > 0 fixed. General strategy (for any lower bounds): create perturbations of θ s.t. any algorithm performs badly on one of them 13 / 33

47 Asymptotic Lower Bound for Graph Feedback Derived from the work of Graves & Lai (1997): Let i = max j θ j θ i ; σ i,j {σ, + }. Assumption: optimal action is unique; let i 1, i 2 be the index of the best, resp., second best action. 14 / 33

48 Asymptotic Lower Bound for Graph Feedback Derived from the work of Graves & Lai (1997): Let i = max j θ j θ i ; σ i,j {σ, + }. Assumption: optimal action is unique; let i 1, i 2 be the index of the best, resp., second best action. Theorem (Asymptotic lower bound) For any algorithm A A s, and for any θ Θ, lim inf T R T (A, θ) log T inf c i i, c C θ i i 1 where C θ = c [0, )K : c i 2σ2 2 i:j S i j for all j i 1, and c i 2σ2 2 i:i 1 S i i / 33

49 Lower Bound for Gaussian Case Given some B > 0, for i i 1, let 1 ɛ i = 8 eb T ew ( i T For i = i 1, replace i with i2. Let K C θ,b = c C c R+ j T : 16 eb ) + i, m i (θ, B) = 1 ɛ 2 i σ 2 j=1 ji log T (ɛ i i ) 8B m i (θ, B) for all i [K].. 1 W ( ) is the Lambert W function satisfying W (x)e W (x) = x. 15 / 33

50 Lower Bound for Gaussian Case Given some B > 0, for i i 1, let 1 ɛ i = 8 eb T ew ( i T For i = i 1, replace i with i2. Let K C θ,b = c C c R+ j T : 16 eb ) + i, m i (θ, B) = 1 ɛ 2 i σ 2 j=1 ji log T (ɛ i i ) 8B m i (θ, B) for all i [K]. Theorem (Finite-time problem-dependent lower bound) For any algorithm s.t. sup λ Θ R T (λ) B, any T large enough, any θ inside Θ, R T (θ) b(θ, B) = min c i i. c C θ,b i i 1. 1 W ( ) is the Lambert W function satisfying W (x)e W (x) = x. 15 / 33

51 Recovering the Asymptotic Lower Bound Theorem (Finite-time problem-dependent lower bound) For any algorithm such that sup λ Θ R T (λ) B, we have, for any θ Θ, R T (θ) b(θ, B) = min c i i. (*) c C θ,b i i 1 16 / 33

52 Recovering the Asymptotic Lower Bound Theorem (Finite-time problem-dependent lower bound) For any algorithm such that sup λ Θ R T (λ) B, we have, for any θ Θ, R T (θ) b(θ, B) = min c i i. (*) c C θ,b i i 1 Recall asymptotic lower bound: lim inf T R T (θ) log T inf c i i. (**) c C θ i i 1 16 / 33

53 Recovering the Asymptotic Lower Bound Theorem (Finite-time problem-dependent lower bound) For any algorithm such that sup λ Θ R T (λ) B, we have, for any θ Θ, R T (θ) b(θ, B) = min c i i. (*) c C θ,b i i 1 Recall asymptotic lower bound: lim inf T R T (θ) log T inf c i i. (**) c C θ i i 1 For any B = αt β with α > 0 and β (0, 1) we have (1 β) log T C θ,b C θ 2 as T. Hence, (**) is recovered from (*). 16 / 33

54 Minimax Lower Bounds (Alon et al., 2015) Each i [K] is associated with an observation set S i [K]: for j S i, σ ij = σ; for j / S i, σ ij =. 17 / 33

55 Minimax Lower Bounds (Alon et al., 2015) Each i [K] is associated with an observation set S i [K]: for j S i, σ ij = σ; for j / S i, σ ij =. Assume Σ is always observable: for all i, there exists j such that i S j. 17 / 33

56 Minimax Lower Bounds (Alon et al., 2015) Each i [K] is associated with an observation set S i [K]: for j S i, σ ij = σ; for j / S i, σ ij =. Assume Σ is always observable: for all i, there exists j such that i S j. Σ is strongly observable if all actions are strongly observable. 17 / 33

57 Minimax Lower Bounds (Alon et al., 2015) Each i [K] is associated with an observation set S i [K]: for j S i, σ ij = σ; for j / S i, σ ij =. Assume Σ is always observable: for all i, there exists j such that i S j. Σ is strongly observable if all actions are strongly observable. An action i is strongly observable if either it is self-observable or is observable under any other action. Otherwise, the action is said to be weakly observable. 17 / 33

58 Minimax Lower Bounds (Alon et al., 2015) Each i [K] is associated with an observation set S i [K]: for j S i, σ ij = σ; for j / S i, σ ij =. Assume Σ is always observable: for all i, there exists j such that i S j. Σ is strongly observable if all actions are strongly observable. An action i is strongly observable if either it is self-observable or is observable under any other action. Otherwise, the action is said to be weakly observable. 17 / 33

59 Minimax Lower Bounds (Alon et al., 2015) Each i [K] is associated with an observation set S i [K]: for j S i, σ ij = σ; for j / S i, σ ij =. Assume Σ is always observable: for all i, there exists j such that i S j. Σ is strongly observable if all actions are strongly observable. An action i is strongly observable if either it is self-observable or is observable under any other action. Otherwise, the action is said to be weakly observable. Σ is weakly observable if it is observable but not strongly observable. 17 / 33

60 Minimax Lower Bounds for Graph Feedback - Strong Observability σ i,j {1, + }, Θ = [0, 1]; S i = {j : σ i,j = σ}. 18 / 33

61 Minimax Lower Bounds for Graph Feedback - Strong Observability σ i,j {1, + }, Θ = [0, 1]; S i = {j : σ i,j = σ}. A set A [K] is independent in Σ if for any i A, S i A {i}. 18 / 33

62 Minimax Lower Bounds for Graph Feedback - Strong Observability σ i,j {1, + }, Θ = [0, 1]; S i = {j : σ i,j = σ}. A set A [K] is independent in Σ if for any i A, S i A {i}. Choosing i A gives no information about any j i, j A. 18 / 33

63 Minimax Lower Bounds for Graph Feedback - Strong Observability σ i,j {1, + }, Θ = [0, 1]; S i = {j : σ i,j = σ}. A set A [K] is independent in Σ if for any i A, S i A {i}. Choosing i A gives no information about any j i, j A. Independence number of Σ: κ(σ) = max{ A : A [K] is independent in Σ}. 18 / 33

64 Minimax Lower Bounds for Graph Feedback - Strong Observability σ i,j {1, + }, Θ = [0, 1]; S i = {j : σ i,j = σ}. A set A [K] is independent in Σ if for any i A, S i A {i}. Choosing i A gives no information about any j i, j A. Independence number of Σ: κ(σ) = max{ A : A [K] is independent in Σ}. 18 / 33

65 Minimax Lower Bounds for Graph Feedback - Strong Observability σ i,j {1, + }, Θ = [0, 1]; S i = {j : σ i,j = σ}. A set A [K] is independent in Σ if for any i A, S i A {i}. Choosing i A gives no information about any j i, j A. Independence number of Σ: κ(σ) = max{ A : A [K] is independent in Σ}. Theorem (Mannor & Shamir (2011), Alon et al. (2015)) Let Σ be strongly observable. Then, sup R T (θ) c κ(σ)t. θ Θ 18 / 33

66 Minimax Lower Bounds for Graph Feedback - Weak Observability σ i,j {1, + }, Θ = [0, 1]; S i = {j : σ i,j = σ}; 19 / 33

67 Minimax Lower Bounds for Graph Feedback - Weak Observability σ i,j {1, + }, Θ = [0, 1]; S i = {j : σ i,j = σ}; A, A [K]; A dominates A if for any j A there exists i A such that j S i ; 19 / 33

68 Minimax Lower Bounds for Graph Feedback - Weak Observability σ i,j {1, + }, Θ = [0, 1]; S i = {j : σ i,j = σ}; A, A [K]; A dominates A if for any j A there exists i A such that j S i ; Any j A can be observed through some i A. 19 / 33

69 Minimax Lower Bounds for Graph Feedback - Weak Observability σ i,j {1, + }, Θ = [0, 1]; S i = {j : σ i,j = σ}; A, A [K]; A dominates A if for any j A there exists i A such that j S i ; Any j A can be observed through some i A. W(Σ): Set of all weakly observable actions; 19 / 33

70 Minimax Lower Bounds for Graph Feedback - Weak Observability σ i,j {1, + }, Θ = [0, 1]; S i = {j : σ i,j = σ}; A, A [K]; A dominates A if for any j A there exists i A such that j S i ; Any j A can be observed through some i A. W(Σ): Set of all weakly observable actions; Weak domination number: ρ(σ) = min{ A : A dominates W(Σ) }. 19 / 33

71 Minimax Lower Bounds for Graph Feedback - Weak Observability σ i,j {1, + }, Θ = [0, 1]; S i = {j : σ i,j = σ}; A, A [K]; A dominates A if for any j A there exists i A such that j S i ; Any j A can be observed through some i A. W(Σ): Set of all weakly observable actions; Weak domination number: ρ(σ) = min{ A : A dominates W(Σ) }. 19 / 33

72 Minimax Lower Bounds for Graph Feedback - Weak Observability σ i,j {1, + }, Θ = [0, 1]; S i = {j : σ i,j = σ}; A, A [K]; A dominates A if for any j A there exists i A such that j S i ; Any j A can be observed through some i A. W(Σ): Set of all weakly observable actions; Weak domination number: ρ(σ) = min{ A : A dominates W(Σ) }. Theorem (Mannor & Shamir (2011), Alon et al. (2015)) Let Σ be weakly observable. Then, sup R T (θ) c(log K) 2/3 ρ(σ) 1/3 T 2/3. θ Θ 19 / 33

73 Recovering Minimax Lower Bounds Theorem (Finite-time problem-dependent lower bound) For any algorithm such that sup λ Θ R T (λ) B, we have, for any θ Θ, R T (θ) b(θ, B) = min c i i. (*) c C θ,b i i 1 20 / 33

74 Recovering Minimax Lower Bounds Theorem (Finite-time problem-dependent lower bound) For any algorithm such that sup λ Θ R T (λ) B, we have, for any θ Θ, R T (θ) b(θ, B) = min c i i. (*) c C θ,b i i 1 If Σ is strongly observable, by choosing B = σ κ(σ)t 8 e we have sup θ Θ b(θ, B) B for T large enough. 20 / 33

75 Recovering Minimax Lower Bounds Theorem (Finite-time problem-dependent lower bound) For any algorithm such that sup λ Θ R T (λ) B, we have, for any θ Θ, R T (θ) b(θ, B) = min c i i. (*) c C θ,b i i 1 If Σ is strongly observable, by choosing B = σ κ(σ)t 8 e we have sup θ Θ b(θ, B) B for T large enough. If Σ is weakly observable, by choosing we have sup θ Θ b(θ, B) B. B = (ρ(σ)d)1/3 (σt ) 2/3 73(log K) 2/3 20 / 33

76 Outline 1 Introduction 2 Has This Been Done Before? 3 Results Lower Bounds Algorithms/Upper Bounds 4 Summary 21 / 33

77 Upcoming Attractions Just for feedback graphs; Near asymptotically optimal algorithm (new); Single near-minimax optimal algorithm with logarithmic asymptotic regret (new). 22 / 33

78 Asymptotically (Almost) Optimal Algorithm Recall C θ = c [0, )K : c i 2σ2 2 i:j S i j Let c(θ) = argmin c Cθ i i 1 c i i. for all j i 1, and c i 2σ2 2 i:i 1 S i i / 33

79 Asymptotically (Almost) Optimal Algorithm Recall C θ = c [0, )K : Let c(θ) = argmin c Cθ i i 1 c i i. c i 2σ2 2 for all j i 1, and i:j S i j c i 2σ2 2 i:i 1 S i i 2. Goal: Find an algorithm that achieves O( ( i i 1 c i (θ) i ) log T ) regret. 23 / 33

80 Asymptotically (Almost) Optimal Algorithm Recall C θ = c [0, )K : Let c(θ) = argmin c Cθ i i 1 c i i. c i 2σ2 2 for all j i 1, and i:j S i j c i 2σ2 2 i:i 1 S i i 2. Goal: Find an algorithm that achieves O( ( i i 1 c i (θ) i ) log T ) regret. (Simple) idea borrowed from Magureanu et al. (2014): Use forced exploration to ensure that c(θ) is well-approximated by c(ˆθ t ) uniformly in time, while paying a constant price in total. 23 / 33

81 Asymptotically (Almost) Optimal Algorithm Recall C θ = c [0, )K : Let c(θ) = argmin c Cθ i i 1 c i i. c i 2σ2 2 for all j i 1, and i:j S i j c i 2σ2 2 i:i 1 S i i 2. Goal: Find an algorithm that achieves O( ( i i 1 c i (θ) i ) log T ) regret. (Simple) idea borrowed from Magureanu et al. (2014): Use forced exploration to ensure that c(θ) is well-approximated by c(ˆθ t ) uniformly in time, while paying a constant price in total. Exploration schedule β( ) : N R is chosen to be sublinear. 23 / 33

82 Asymptotically (Almost) Optimal Algorithm Recall C θ = c [0, )K : Let c(θ) = argmin c Cθ i i 1 c i i. c i 2σ2 2 for all j i 1, and i:j S i j c i 2σ2 2 i:i 1 S i i 2. Goal: Find an algorithm that achieves O( ( i i 1 c i (θ) i ) log T ) regret. (Simple) idea borrowed from Magureanu et al. (2014): Use forced exploration to ensure that c(θ) is well-approximated by c(ˆθ t ) uniformly in time, while paying a constant price in total. Exploration schedule β( ) : N R is chosen to be sublinear. Magureanu et al. (2014) s linear schedule β(n) = βn requires that they choose a parameter of their algorithm based on the unknown min. The sublinear schedule avoids this. 23 / 33

83 Asymptotically (Almost) Optimal Algorithm t := t + 1 Y plays(t) 4α log t Exploitation: Play It := i1 (θ t ). Set ne (t + 1) := ne (t). Cθ t? Y N mini obsi (t) < β(ne (t))/k? Play It s.t. arg mini obsi (t) SIt. N Play It = i s.t. playsi (t) < ci (θ t )4α log t Set ne (t + 1) = ne (t) + 1. Update θ t to θ t / 33

84 Asymptotically Almost Optimal Algorithm - Upper Bound Upper bound For any α > 2, β(n) = an b with a (0, 1 2 ], b (0, 1) and for any θ Θ such that c(θ) is unique, lim sup T R T (θ) log T 4α i i 1 c i (θ) i. 25 / 33

85 Near Minimax Optimal Algorithm Successive elimination: maintain a set of possibly optimal actions ( good actions) until only one action remains. 26 / 33

86 Near Minimax Optimal Algorithm Successive elimination: maintain a set of possibly optimal actions ( good actions) until only one action remains. In each round r, Explore all good actions by playing only good actions. (exploitation) Due to weak observability, sometimes some actions can only be explored by bad actions (exploration-exploitation trade off). Use a sublinear function γ to control the exploration using bad actions. 26 / 33

87 Near Minimax Optimal Algorithm Successive elimination: maintain a set of possibly optimal actions ( good actions) until only one action remains. In each round r, Explore all good actions by playing only good actions. (exploitation) Due to weak observability, sometimes some actions can only be explored by bad actions (exploration-exploitation trade off). Use a sublinear function γ to control the exploration using bad actions. The idea is similar to the CBP algorithm in Bartók et al. (2014). Here we use a better exploration method to exploit the feedback structure, which leads to the optimal dependence on factors such as ρ(σ) and κ(σ). 26 / 33

88 Near Minimax Optimal Algorithm - Preliminaries For E, G [K], let c(e, G) = argmax c Simplex E min i G j:i S j c j : optimal way of using actions in E to uniformly explore actions in G. c(e, G) = min i G j:i S j c j (E, G): least coverage. For any A [K] and A 2, let A S = {i A : j A, i S j } denote the set of actions that can be observed while using of actions of A only; and A W = A \ A S (Note: actions in A W must be weakly observable). Exploration schedule for A W : γ(r) = (σα r t r /D) 2/3 α r = min 1 s r,a W s c([k], A W s ) At round r, define confidence width g r,i (δ) = σ 2 log(8k 2 r 3 /δ) obs i (r) where obs i (r) is the number of observations gained for action i so far. 27 / 33

89 Near Minimax Optimal Algorithm r := r + 1 Y A W r & obs A W r (r) < obs A S r (r) & obs A W r (r) < γ(r)? N c r = c([k], A W r ) c r = c(a r, A S r ) Play i r = c r c r 0 Set t r+1 t r + i r 1 Update ˆθ r to ˆθ r+1 A r+1 {i A r : UCB δ r+1,i max j Ar LCB δ r+1,j} Y N Keep playing the A r+1 > 1? remaining action 28 / 33

90 Near Minimax Optimal Algorithm - Upper Bound Theorem With δ = 1 T, for any θ Θ: If Σ is strongly observable, R T (θ) = O ( σ log K ) κ(σ)t log T. If Σ is weakly observable, ( R T (θ) = O (ρ(σ)d) 1/3 (σt ) 2/3 log ) KT. If we view min as constant and only consider dependence on T, ( ) R T (θ) = O log 3/2 T. 29 / 33

91 Outline 1 Introduction 2 Has This Been Done Before? 3 Results Lower Bounds Algorithms/Upper Bounds 4 Summary 30 / 33

92 Conclusions Online learning with Gaussian payoffs and side observations; Smooth interpolation between full-information and bandit settings; First non-asymptotic, problem-dependent lower bounds in regret minimization; 31 / 33

93 Conclusions Online learning with Gaussian payoffs and side observations; Smooth interpolation between full-information and bandit settings; First non-asymptotic, problem-dependent lower bounds in regret minimization; Algorithms for σ i,j {σ, + }; 31 / 33

94 Conclusions Online learning with Gaussian payoffs and side observations; Smooth interpolation between full-information and bandit settings; First non-asymptotic, problem-dependent lower bounds in regret minimization; Algorithms for σ i,j {σ, + }; Asymptotically near-optimal algorithm; First for learning with feedback graphs to do this; 31 / 33

95 Conclusions Online learning with Gaussian payoffs and side observations; Smooth interpolation between full-information and bandit settings; First non-asymptotic, problem-dependent lower bounds in regret minimization; Algorithms for σ i,j {σ, + }; Asymptotically near-optimal algorithm; First for learning with feedback graphs to do this; Single near minimax algorithm regardless of observability, with poly-logarithmic asymptotic regret; First for learning with feedback graphs to do this: 31 / 33

96 Conclusions Online learning with Gaussian payoffs and side observations; Smooth interpolation between full-information and bandit settings; First non-asymptotic, problem-dependent lower bounds in regret minimization; Algorithms for σ i,j {σ, + }; Asymptotically near-optimal algorithm; First for learning with feedback graphs to do this; Single near minimax algorithm regardless of observability, with poly-logarithmic asymptotic regret; First for learning with feedback graphs to do this: Mannor & Shamir (2011); Alon et al. (2013) and Alon et al. (2015): No log asymptotic regret, minimax alss. 31 / 33

Conclusions Online learning with Gaussian payoffs and side observations; Smooth interpolation between full-information and bandit settings; First non-asymptotic, problem-dependent lower bounds in

97 Conclusions Online learning with Gaussian payoffs and side observations; Smooth interpolation between full-information and bandit settings; First non-asymptotic, problem-dependent lower bounds in regret minimization; Algorithms for σ i,j {σ, + }; Asymptotically near-optimal algorithm; First for learning with feedback graphs to do this; Single near minimax algorithm regardless of observability, with poly-logarithmic asymptotic regret; First for learning with feedback graphs to do this: Mannor & Shamir (2011); Alon et al. (2013) and Alon et al. (2015): No log asymptotic regret, minimax alss. Caron et al. (2012) and Buccapatnam et al. (2014): Log asymptotics, but no near-minimax finite time regret. 31 / 33

98 Open Problems Remove the assumption that c(θ) is unique for the optimality of the first algorithm; Remove the log 1/2 T overhead for the second algorithm; 32 / 33

99 Open Problems Remove the assumption that c(θ) is unique for the optimality of the first algorithm; Remove the log 1/2 T overhead for the second algorithm; A single algorithm that achieves both asymptotic and minimax optimal bounds up to constant factors; For bandits, achieved (very) recently (Lattimore, 2015) 32 / 33

100 Open Problems Remove the assumption that c(θ) is unique for the optimality of the first algorithm; Remove the log 1/2 T overhead for the second algorithm; A single algorithm that achieves both asymptotic and minimax optimal bounds up to constant factors; For bandits, achieved (very) recently (Lattimore, 2015) Algorithm for general Σ; Algorithm for unknown Σ; 32 / 33

101 Open Problems Remove the assumption that c(θ) is unique for the optimality of the first algorithm; Remove the log 1/2 T overhead for the second algorithm; A single algorithm that achieves both asymptotic and minimax optimal bounds up to constant factors; For bandits, achieved (very) recently (Lattimore, 2015) Algorithm for general Σ; Algorithm for unknown Σ; General tightness of the new lower bound; Algorithms for the (general) stochastic partial monitoring setting. 32 / 33

102 References Agrawal, R., Teneketzis, D., and Anantharam, V. Asymptotically efficient adaptive allocation schemes for controlled i.i.d. processes: Finite parameter space. IEEE Transaction on Automatic Control, 34: , Alon, N., Cesa-Bianchi, N., Gentile, C., and Mansour, Y. From bandits to experts: A tale of domination and independence. In NIPS, pp , Alon, N., Cesa-Bianchi, N., Gentile, C., and Mansour, Y. Online learning with feedback graphs: beyond bandits. In COLT, pp , Bartók, G., Pál, D., and Szepesvári, Cs. Minimax regret of finite partial-monitoring games in stochastic environments. In COLT 2011, pp , July Bartók, Gábor, Foster, Dean P., Pál, Dávid, Rakhlin, Alexander, and Szepesvári, Csaba. Partial monitoring classification, regret bounds, and algorithms. Mathematics of Operations Research, 39: , Buccapatnam, Swapna, Eryilmaz, Atilla, and Shroff, Ness B. Stochastic bandits with side observations on networks. SIGMETRICS Perform. Eval. Rev., 42(1): , June Caron, S., Kveton, B., Lelarge, M., and Bhagat, S. Leveraging side observations in stochastic bandits. In UAI, pp , Graves, Todd L. and Lai, Tze Leung. Asymptotically efficient adaptive choice of control laws in controlled markov chains. SIAM Journal on Control and Optimization, 35 (3): , Kocák, Tomáš, Neu, Gergely, Valko, Michal, and Munos, Rémi. Efficient learning by implicit exploration in bandit problems with side observations. In Advances in Neural Information Processing Systems 27 (NIPS), pp , Lattimore, T. Optimally confident UCB: Improved regret for finite-armed bandits. Arxiv preprint, Magureanu, S., Combes, R., and Proutiere, A. Lipschitz bandits: Regret lower bounds and optimal algorithms. In COLT, pp , Mannor, S. and Shamir, O. From bandits to experts: on the value of side-observations. In NIPS, pp , Robbins, H. Some aspects of the sequential design of experiments. Bulletin of the American Mathematics Society, 58: , / 33

Online Learning with Gaussian Payoffs and Side Observations

Online Learning with Gaussian Payoffs and Side Observations Yifan Wu András György 2 Csaba Szepesvári Dept. of Computing Science University of Alberta {ywu2,szepesva}@ualberta.ca 2 Dept. of Electrical