Online Learning with Feedback Graphs

Size: px

Start display at page:

Download "Online Learning with Feedback Graphs"

Stewart Elliott
5 years ago
Views:

1 Online Learning with Feedback Graphs Nicolò Cesa-Bianchi Università degli Studi di Milano Joint work with: Noga Alon (Tel-Aviv University) Ofer Dekel (Microsoft Research) Tomer Koren (Technion and Microsoft Research) Also: Claudio Gentile, Shie Mannor, Yishay Mansour, Ohad Shamir N. Cesa-Bianchi (UNIMI) Online Learning with Feedback Graphs 1 / 20

Theory of repeated games James Hannan (1922 2010) David

game repeatedly against a possibly suboptimal opponent N.

2 Theory of repeated games James Hannan ( ) David Blackwell ( ) Learning to play a game (1956) Play a game repeatedly against a possibly suboptimal opponent N. Cesa-Bianchi (UNIMI) Online Learning with Feedback Graphs 2 / 20

3 Prediction with expert advice N actions????????? For t = 1, 2,... 1 Loss l t (i) [0, 1] is assigned to every action i = 1,..., N (hidden from the player) N. Cesa-Bianchi (UNIMI) Online Learning with Feedback Graphs 3 / 20

4 Prediction with expert advice N actions????????? For t = 1, 2,... 1 Loss l t (i) [0, 1] is assigned to every action i = 1,..., N (hidden from the player) 2 Player picks an action I t (possibly using randomization) and incurs loss l t (I t ) N. Cesa-Bianchi (UNIMI) Online Learning with Feedback Graphs 3 / 20

5 Prediction with expert advice N actions For t = 1, 2,... 1 Loss l t (i) [0, 1] is assigned to every action i = 1,..., N (hidden from the player) 2 Player picks an action I t (possibly using randomization) and incurs loss l t (I t ) 3 Player gets feedback information: l t = ( l t (1),..., l t (N) ) N. Cesa-Bianchi (UNIMI) Online Learning with Feedback Graphs 3 / 20

6 Regret The loss process l t t 1 is deterministic and unknown to the (randomized) player I 1, I 2,... Regret of player I 1, I 2,... [ T ] def R T = E l t (I t ) t=1 min T i=1,...,n t=1 l t (i) want = o(t) N. Cesa-Bianchi (UNIMI) Online Learning with Feedback Graphs 4 / 20

7 Regret The loss process l t t 1 is deterministic and unknown to the (randomized) player I 1, I 2,... Regret of player I 1, I 2,... [ T ] def R T = E l t (I t ) t=1 min T i=1,...,n t=1 Asymptotic lower bound for experts game R T = ( 1 o(1) ) T ln N 2 Proof uses an i.i.d. stochastic loss process l t (i) want = o(t) N. Cesa-Bianchi (UNIMI) Online Learning with Feedback Graphs 4 / 20

8 Exponentially weighted forecaster At time t pick action I t = i with probability proportional to ( ) t 1 exp η l s (i) s=1 the sum at the exponent is the total loss of action i up to now N. Cesa-Bianchi (UNIMI) Online Learning with Feedback Graphs 5 / 20

9 Exponentially weighted forecaster At time t pick action I t = i with probability proportional to ( ) t 1 exp η l s (i) s=1 the sum at the exponent is the total loss of action i up to now Regret bound If η = ln N 8T then R T T ln N 2 Matching asymptotic lower bound including constants Dynamic choice η t = (ln N)/(8t) only loses small constants N. Cesa-Bianchi (UNIMI) Online Learning with Feedback Graphs 5 / 20

10 The bandit problem: playing an unknown game N actions????????? For t = 1, 2,... 1 Loss l t (i) [0, 1] is assigned to every action i = 1,..., N (hidden from the player) N. Cesa-Bianchi (UNIMI) Online Learning with Feedback Graphs 6 / 20

11 The bandit problem: playing an unknown game N actions????????? For t = 1, 2,... 1 Loss l t (i) [0, 1] is assigned to every action i = 1,..., N (hidden from the player) 2 Player picks an action I t (possibly using randomization) and incurs loss l t (I t ) N. Cesa-Bianchi (UNIMI) Online Learning with Feedback Graphs 6 / 20

12 The bandit problem: playing an unknown game N actions?.3??????? For t = 1, 2,... 1 Loss l t (i) [0, 1] is assigned to every action i = 1,..., N (hidden from the player) 2 Player picks an action I t (possibly using randomization) and incurs loss l t (I t ) 3 Player gets feedback information: Only l t (I t ) is revealed N. Cesa-Bianchi (UNIMI) Online Learning with Feedback Graphs 6 / 20

13 The bandit problem: playing an unknown game N actions?.3??????? For t = 1, 2,... 1 Loss l t (i) [0, 1] is assigned to every action i = 1,..., N (hidden from the player) 2 Player picks an action I t (possibly using randomization) and incurs loss l t (I t ) 3 Player gets feedback information: Only l t (I t ) is revealed Many applications Ad placement, recommender systems, online auctions,... N. Cesa-Bianchi (UNIMI) Online Learning with Feedback Graphs 6 / 20

14 Bandits as an instance of a general feedback model Besides observing the loss of the played action, the player also observes the loss some other actions For example, a recommender system can infer how the user would have reacted had similar products been recommended However: we do not insist on assuming that observability between actions implies similarity between losses N. Cesa-Bianchi (UNIMI) Online Learning with Feedback Graphs 7 / 20

15 Bandits as an instance of a general feedback model Besides observing the loss of the played action, the player also observes the loss some other actions For example, a recommender system can infer how the user would have reacted had similar products been recommended However: we do not insist on assuming that observability between actions implies similarity between losses How does the observability structure influence regret? N. Cesa-Bianchi (UNIMI) Online Learning with Feedback Graphs 7 / 20

16 Feedback graph?????????? N. Cesa-Bianchi (UNIMI) Online Learning with Feedback Graphs 8 / 20

17 Feedback graph?????????? N. Cesa-Bianchi (UNIMI) Online Learning with Feedback Graphs 8 / 20

18 Feedback graph ????? N. Cesa-Bianchi (UNIMI) Online Learning with Feedback Graphs 8 / 20

19 Recovering expert and bandit settings Experts: clique Bandits: empty graph?.3???????? N. Cesa-Bianchi (UNIMI) Online Learning with Feedback Graphs 9 / 20

20 Exponentially weighted forecaster Reprise Player s strategy ( ) t 1 P t (I t = i) exp η ls (i) lt (i) = s=1 l t (i) P t ( lt (i) observed ) i = 1,..., N if l t(i) is observed 0 otherwise N. Cesa-Bianchi (UNIMI) Online Learning with Feedback Graphs 10 / 20

21 Exponentially weighted forecaster Reprise Player s strategy ( ) t 1 P t (I t = i) exp η ls (i) lt (i) = s=1 l t (i) P t ( lt (i) observed ) i = 1,..., N if l t(i) is observed 0 otherwise Importance sampling estimator ] E t [ lt (i) = l t (i) unbiasedness E t [ lt (i) 2] = l t (i) 2 P t ( lt (i) observed ) variance control N. Cesa-Bianchi (UNIMI) Online Learning with Feedback Graphs 10 / 20

22 Independence number α(g) The size of the largest independent set N. Cesa-Bianchi (UNIMI) Online Learning with Feedback Graphs 11 / 20

23 Independence number α(g) The size of the largest independent set N. Cesa-Bianchi (UNIMI) Online Learning with Feedback Graphs 11 / 20

24 Regret bounds Analysis (undirected graphs) [ R T ln N η + η T 2 E N t=1 i=1 ] P t (i is played) P t (l t (i) is observed) N. Cesa-Bianchi (UNIMI) Online Learning with Feedback Graphs 12 / 20

25 Regret bounds Analysis (undirected graphs) [ R T ln N η + η T 2 E N Lemma t=1 i=1 ] P t (i is played) P t (l t (i) is observed) For any undirected graph G = (V, E) and for any probability assignment p 1,..., p N over its vertices N i=1 p i + p i j N G (i) p j } {{ } P t (loss of i observed) α(g) N. Cesa-Bianchi (UNIMI) Online Learning with Feedback Graphs 12 / 20

26 Regret bounds Analysis (undirected graphs) R T ln N η + η 2 T α(g) = Tα(G) ln N t=1 by choosing η N. Cesa-Bianchi (UNIMI) Online Learning with Feedback Graphs 13 / 20

27 Regret bounds Analysis (undirected graphs) R T ln N η + η 2 T α(g) = Tα(G) ln N t=1 by choosing η Special cases Experts (clique): α(g) = 1 R T T ln N Bandits (empty graph): α(g) = N R T TN ln N N. Cesa-Bianchi (UNIMI) Online Learning with Feedback Graphs 13 / 20

28 Regret bounds Analysis (undirected graphs) R T ln N η + η 2 T α(g) = Tα(G) ln N t=1 by choosing η Special cases Experts (clique): α(g) = 1 R T T ln N Bandits (empty graph): α(g) = N R T TN ln N Minimax rate The general bound is tight: R T = Θ ( Tα(G) ln N ) N. Cesa-Bianchi (UNIMI) Online Learning with Feedback Graphs 13 / 20

29 More general feedback models Directed Interventions N. Cesa-Bianchi (UNIMI) Online Learning with Feedback Graphs 14 / 20

30 Old and new examples Experts Bandits Cops & Robbers Revealing Action N. Cesa-Bianchi (UNIMI) Online Learning with Feedback Graphs 15 / 20

31 Exponentially weighted forecaster with exploration Player s strategy ( ) P t (I t = i) = 1 γ t 1 exp η ls (i) + γ U G i = 1,..., N lt (i) = Z t l t (i) P t ( lt (i) observed ) s=1 if l t(i) is observed 0 otherwise U G is uniform distribution supported on a subset of V N. Cesa-Bianchi (UNIMI) Online Learning with Feedback Graphs 16 / 20

32 A characterization of feedback graphs A vertex of G is: observable if it has at least one incoming edge (possibly a self-loop) strongly observable if it has either a self-loop or incoming edges from all other vertices weakly observable if it is observable but not strongly observable is not observable 2 and 5 are weakly observable and 4 are strongly observable N. Cesa-Bianchi (UNIMI) Online Learning with Feedback Graphs 17 / 20

33 Characterization of minimax rates G is strongly observable G is weakly observable G is not observable R T = Θ ( ) α(g)t U G is uniform on V R T = Θ ( ) T 2/3 δ(g) for T = Ω ( N 3) U G is uniform on a weakly dominating set R T = Θ(T) Weakly dominating set δ(g) is the size of the smallest set that dominates all weakly observable nodes of G N. Cesa-Bianchi (UNIMI) Online Learning with Feedback Graphs 18 / 20

34 Some curious cases Experts vs. Cops & Robbers Presence of red loops does not affect minimax regret R T = Θ ( T ln N ) Sharp transitions With red loop: strongly observable with α(g) = N 1 R T = Θ ( ) NT 4 3 Without red loop: weakly observable with δ(g) = 1 R T = Θ ( T 2/3) for T = Ω(N 3 ) N. Cesa-Bianchi (UNIMI) Online Learning with Feedback Graphs 19 / 20

35 Final remarks Theory extends to time-varying feedback graphs In the strongly observable case, algorithm can predict without knowing the graph Entire framework is a special case of partial monitoring, but our rates exhibit sharp problem-dependent constants N. Cesa-Bianchi (UNIMI) Online Learning with Feedback Graphs 20 / 20

36 Final remarks Theory extends to time-varying feedback graphs In the strongly observable case, algorithm can predict without knowing the graph Entire framework is a special case of partial monitoring, but our rates exhibit sharp problem-dependent constants Graph over actions: more interpretations Relatedness (rather than observability) structure on loss assignment Delay model for loss observations N. Cesa-Bianchi (UNIMI) Online Learning with Feedback Graphs 20 / 20

From Bandits to Experts: A Tale of Domination and Independence

From Bandits to Experts: A Tale of Domination and Independence Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Domination and Independence 1 / 1 From Bandits to Experts: A