From Bandits to Experts: A Tale of Domination and Independence

Size: px

Start display at page:

Download "From Bandits to Experts: A Tale of Domination and Independence"

Hannah Snow
5 years ago
Views:

1 From Bandits to Experts: A Tale of Domination and Independence Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Domination and Independence 1 / 1

2 From Bandits to Experts: A Tale of Domination and Independence Nicolò Cesa-Bianchi Università degli Studi di Milano Joint work with: Noga Alon Ofer Dekel Tomer Koren N. Cesa-Bianchi (UNIMI) Domination and Independence 1 / 1

Theory of repeated games James Hannan (1922 2010) David

3 Theory of repeated games James Hannan ( ) David Blackwell ( ) Learning to play a game (1956) Play a game repeatedly against a possibly suboptimal opponent N. Cesa-Bianchi (UNIMI) Domination and Independence 2 / 1

4 Zero-sum 2-person games played more than once M 1 l(1, 1) l(1, 2)... 2 l(2, 1) l(2, 2) N N M known loss matrix over R Row player (player) has N actions Column player (opponent) has M actions For each game round t = 1, 2,... Player chooses action i t and opponent chooses action y t The player suffers loss l(i t, y t ) (= gain of opponent) Player can learn from opponent s history of past choices y 1,..., y t 1 N. Cesa-Bianchi (UNIMI) Domination and Independence 3 / 1

Prediction with expert advice t = 1 t = 2... 1 l 1 (1) l 2 (1)... 2 l 1 (2) l 2 (2).

5 Prediction with expert advice t = 1 t = l 1 (1) l 2 (1)... 2 l 1 (2) l 2 (2) N l 1 (N) l 2 (N) Volodya Vovk Manfred Warmuth Play an unknown loss matrix Opponent s moves y 1, y 2,... define a sequential prediction problem with a time-varying loss function l(i t, y t ) = l t (i t ) N. Cesa-Bianchi (UNIMI) Domination and Independence 4 / 1

6 Playing the experts game N actions????????? For t = 1, 2,... 1 Loss l t (i) [0, 1] is assigned to every action i = 1,..., N (hidden from the player) N. Cesa-Bianchi (UNIMI) Domination and Independence 5 / 1

7 Playing the experts game N actions????????? For t = 1, 2,... 1 Loss l t (i) [0, 1] is assigned to every action i = 1,..., N (hidden from the player) 2 Player picks an action I t (possibly using randomization) and incurs loss l t (I t ) N. Cesa-Bianchi (UNIMI) Domination and Independence 5 / 1

8 Playing the experts game N actions For t = 1, 2,... 1 Loss l t (i) [0, 1] is assigned to every action i = 1,..., N (hidden from the player) 2 Player picks an action I t (possibly using randomization) and incurs loss l t (I t ) 3 Player gets feedback information: l t = ( l t (1),..., l t (N) ) N. Cesa-Bianchi (UNIMI) Domination and Independence 5 / 1

9 Oblivious opponents The loss process l t t 1 is deterministic and unknown to the (randomized) player I 1, I 2,... Oblivious regret minimization [ T ] def R T = E l t (I t ) t=1 min T i=1,...,n t=1 l t (i) want = o(t) N. Cesa-Bianchi (UNIMI) Domination and Independence 6 / 1

10 Bounds on regret [How to use expert advice, 1997] Lower bound using random losses Losses l t (i) are independent random coin flips L t (i) {0, 1} [ T ] For any player strategy E L t (I t ) = T 2 t=1 Then the expected regret is [ T ( ) ] 1 E max i=1,...,n 2 L t(i) = ( 1 o(1) ) T ln N 2 t=1 N. Cesa-Bianchi (UNIMI) Domination and Independence 7 / 1

11 Exponentially weighted forecaster At time t pick action I t = i with probability proportional to ( ) t 1 exp η l s (i) s=1 the sum at the exponent is the total loss of action i up to now Regret bound [How to use expert advice, 1997] If η = T ln N (ln N)/(8T) then R T 2 Matching lower bound including constants Dynamic choice η t = (ln N)/(8t) only loses small constants N. Cesa-Bianchi (UNIMI) Domination and Independence 8 / 1

12 The bandit problem: playing an unknown game N actions????????? For t = 1, 2,... 1 Loss l t (i) [0, 1] is assigned to every action i = 1,..., N (hidden from the player) N. Cesa-Bianchi (UNIMI) Domination and Independence 9 / 1

13 The bandit problem: playing an unknown game N actions????????? For t = 1, 2,... 1 Loss l t (i) [0, 1] is assigned to every action i = 1,..., N (hidden from the player) 2 Player picks an action I t (possibly using randomization) and incurs loss l t (I t ) N. Cesa-Bianchi (UNIMI) Domination and Independence 9 / 1

14 The bandit problem: playing an unknown game N actions? 3??????? For t = 1, 2,... 1 Loss l t (i) [0, 1] is assigned to every action i = 1,..., N (hidden from the player) 2 Player picks an action I t (possibly using randomization) and incurs loss l t (I t ) 3 Player gets feedback information: Only l t (I t ) is revealed N. Cesa-Bianchi (UNIMI) Domination and Independence 9 / 1

15 The bandit problem: playing an unknown game N actions? 3??????? For t = 1, 2,... 1 Loss l t (i) [0, 1] is assigned to every action i = 1,..., N (hidden from the player) 2 Player picks an action I t (possibly using randomization) and incurs loss l t (I t ) 3 Player gets feedback information: Only l t (I t ) is revealed Many applications Ad placement, dynamic content adaptation, routing, online auctions N. Cesa-Bianchi (UNIMI) Domination and Independence 9 / 1

16 Relationships between actions [Mannor and Shamir, 2011] N. Cesa-Bianchi (UNIMI) Domination and Independence 10 / 1

17 A graph of relationships over actions?????????? N. Cesa-Bianchi (UNIMI) Domination and Independence 11 / 1

18 A graph of relationships over actions?????????? N. Cesa-Bianchi (UNIMI) Domination and Independence 11 / 1

19 A graph of relationships over actions ????? N. Cesa-Bianchi (UNIMI) Domination and Independence 11 / 1

20 Recovering expert and bandit settings Experts: clique Bandits: empty graph? 3???????? N. Cesa-Bianchi (UNIMI) Domination and Independence 12 / 1

21 Exponentially weighted forecaster Reprise Player s strategy [Alon, C-B, Gentile, Mannor, Mansour and Shamir, 2013] ( ) t 1 P t (I t = i) exp η ls (i) i = 1,..., N lt (i) = s=1 l t (i) P t ( lt (i) observed ) if l t(i) is observed 0 otherwise Importance sampling estimator ] E t [ lt (i) = l t (i) unbiasedness E t [ lt (i) 2] 1 P t ( lt (i) observed ) variance control N. Cesa-Bianchi (UNIMI) Domination and Independence 13 / 1

22 Regret bounds Analysis (undirected graphs) Lemma R T ln N η + η 2 T t=1 i=1 N P t (I t = i) P t (I t = i) + P t (I t = j) j N G (i) For any undirected graph G = (V, E) and for any probability assignment p 1,..., p N over its vertices N i=1 p i + p i j N G (i) p j α(g) α(g) is the independence number of G (largest subset of V such that no two distinct vertices in it are adjacent in G) N. Cesa-Bianchi (UNIMI) Domination and Independence 14 / 1

23 Regret bounds Analysis (undirected graphs) R T ln N η + η 2 T α(g) = Tα(G) ln N t=1 by choosing η Special cases Experts (clique): α(g) = 1 R T T ln N Bandits (empty graph): α(g) = N R T TN ln N Minimax rate The general bound is tight: R T = Θ ( Tα(G) ln N ) N. Cesa-Bianchi (UNIMI) Domination and Independence 15 / 1

24 More general feedback models Directed Interventions N. Cesa-Bianchi (UNIMI) Domination and Independence 16 / 1

25 Old and new examples Experts Bandits Cops & Robbers Revealing Action N. Cesa-Bianchi (UNIMI) Domination and Independence 17 / 1

26 Exponentially weighted forecaster with exploration Player s strategy [Alon, C-B, Dekel and Koren, 2015] ( ) P t (I t = i) 1 γ t 1 exp η ls (i) + γ U G i = 1,..., N Z t s=1 l t (i) ( lt (i) = P t lt (i) observed ) if l t(i) is observed 0 otherwise U G is uniform distribution supported on a subset of V N. Cesa-Bianchi (UNIMI) Domination and Independence 18 / 1

27 A characterization of feedback graphs A vertex of G is: observable if it has at least one incoming edge (possibly a self-loop) strongly observable if it has either a self-loop or incoming edges from all other vertices weakly observable if it is observable but not strongly observable is not observable 2 and 5 are weakly observable and 4 are strongly observable N. Cesa-Bianchi (UNIMI) Domination and Independence 19 / 1

28 Minimax rates G is strongly observable G is weakly observable G is not observable R T = Θ ( ) α(g)t U G is uniform on V R T = Θ ( ) T 2/3 δ(g) U G is uniform on a weakly dominating set R T = Θ(T) Weakly dominating set δ(g) is the size of the smallest set that dominates all weakly observable nodes of G N. Cesa-Bianchi (UNIMI) Domination and Independence 20 / 1

29 Minimax regret Presence of red loops does not affect minimax regret R T = Θ ( T ln N ) With red loop: strongly observable with α(g) = N 1 R T = Θ ( ) NT Without red loop: weakly observable with δ(g) = 1 R T = Θ (T 2/3) N. Cesa-Bianchi (UNIMI) Domination and Independence 21 / 1

30 Reactive opponents [Dekel, Koren and Peres, 2014] The loss of action i at time t depends on the player s past m actions l t (i) L t (I t m,..., I t 1, i) Adaptive regret T = E L t (I t m,..., I t 1, I t ) R ada T t=1 min i=1,...,n t=1 T L t (i,..., i, i) } {{ } m times lt(1) lt(2) Minimax rate (m > 0) R ada T = Θ ( T 2/3) t N. Cesa-Bianchi (UNIMI) Domination and Independence 22 / 1

31 Conclusions An abstract, game-theoretic framework for studying a variety of sequential decisions problems Applicable to machine learning (e.g., binary classification) and online convex optimization settings Exponential weights can be replaced by polynomial weights (cfr. Mirror Descent for convex optimization) Connections to gambling, portfolio management, competitive analysis of algorithms N. Cesa-Bianchi (UNIMI) Domination and Independence 23 / 1

Online Learning with Feedback Graphs

Online Learning with Feedback Graphs Nicolò Cesa-Bianchi Università degli Studi di Milano Joint work with: Noga Alon (Tel-Aviv University) Ofer Dekel (Microsoft Research) Tomer Koren (Technion and Microsoft