Online Learning with Feedback Graphs Claudio Gentile INRIA and Google NY clagentile@gmailcom NYC March 6th, 2018 1
Content of this lecture Regret analysis of sequential prediction problems lying between full and bandit information regimes: Motivation Nonstochastic setting: Brief review of background Feedback graphs Stochastic setting: Brief review of background Feedback graphs Examples (nonstochastic) 2
Motivation Sequential prediction problems with partial information where items in action space have semantic connections turning into observability dependencies of associated losses/gains Action space Action space 3
Background/1: Nonstochastic experts K actions for Learner For t = 1, 2, : 1 Losses l t (i) [0, 1] are assigned deterministically by Nature to every action i = 1 K (hidded to Learner) 2 Learner picks action I t (possibly using randomization) and incurs loss l t (I t ) 3 Learner gets feedback information: l t (1),, l t (I t ),, l t (K) 4
Background/1: Nonstochastic experts K actions for Learner For t = 1, 2, : 1 Losses l t (i) [0, 1] are assigned deterministically by opponent to every action i = 1 K (hidded to Learner) 2 Learner picks action I t (possibly using randomization) and incurs loss l t (I t ) 3 Learner gets feedback information: l t (1),, l t (I t ),, l t (K) 5
Background: Nonstochastic experts K actions for Learner 01 03 08 03 04 05 07 02 For t = 1, 2, : 1 Losses l t (i) [0, 1] are assigned deterministically by opponent to every action i = 1 K (hidded to Learner) 2 Learner picks action I t (possibly using randomization) and incurs loss l t (I t ) 3 Learner gets feedback information: l t (1),, l t (I t ),, l t (K) 6
No (External, Pseudo) Regret Goal : Given T rounds, Learner s total loss T l t (I t ) must be close to that of single best action in hindsight for Learner Regret of Learner for T rounds: [ T ] R T = E l t (I t ) min i=1k Want : R T = o(t ) as T grows large ( no regret ) T l t (i) Notice : No stochastic assumptions on losses, but assume for simplicity Nature is deterministic and oblivious Lower bound: T ln K R T (1 o(1)) 2 as T, K (l t (i) random coin flips + simple probabilistic argument) [CB+97] 7
Exponentially-weighted Algorithm [CB+97] At round t pick action I t = i with probability proportional to ( ) exp η t 1 s=1 l s (i) if η = Dynamic η = ln K 8T = R T total loss of action i so far T ln K 2 ln K t = R T looses constant factors 8
Nonstochastic bandit problem/1 K actions for Learner For t = 1, 2, : 1 Losses l t (i) [0, 1] are assigned deterministically by Nature to every action i = 1 K (hidded to Learner) 2 Learner picks action I t (possibly using randomization) and incurs loss l t (I t ) 3 Learner gets feedback information: l t (I t ) 9
Nonstochastic bandit problem/1 K actions for Learner For t = 1, 2, : 1 Losses l t (i) [0, 1] are assigned deterministically by Nature to every action i = 1 K (hidded to Learner) 2 Learner picks action I t (possibly using randomization) and incurs loss l t (I t ) 3 Learner gets feedback information: l t (I t ) 10
Nonstochastic bandit problem/1 K actions for Learner 03 For t = 1, 2, : 1 Losses l t (i) [0, 1] are assigned deterministically by Nature to every action i = 1 K (hidded to Learner) 2 Learner picks action I t (possibly using randomization) and incurs loss l t (I t ) 3 Learner gets feedback information: l t (I t ) 11
Nonstochastic bandit problem/2 Goal : same as before Regret of Learner for T rounds: [ T ] R T = E l t (I t ) min i=1k Want : R T = o(t ) as T grows large ( no regret ) Tradeoff exploration vs exploitation T l t (i) 12
Nonstochastic bandit problem/3: Exp3 Alg/1 [Auer+ 02] At round t pick action I t = i with probability proportional to ( ) l s (i) = exp η t 1 s=1 l s (i), i = 1 K { l s (i) if l Pr s (l s (i) is observed in round s (i) is observed s) 0 otherwise Only one nonzero component in lt Exponentially-weighted alg with (importance sampling) loss estimates l t (i) l t (i) 13
Nonstochastic bandit problem/3: Exp3 Alg/2 [Auer+ 02] Properties of loss estimates: E t [ lt (i)] = l t (i) unbiasedness E t [ lt (i) 2 ] 1 Pr t(l t (i) is observed in round t) variance control Regret analysis: Set p t (i) = Pr t (I t = i) Approximate exp(x) up to 2nd order, sum over rounds t and overapprox: T i=1 K p t (i) lt (i) min i=1,,k T l t (i) ln K η + η 2 T K p t (i) lt (i) 2 i=1 Take expectations (tower rule), and optimize over η: R T ln K η + η 2 T K = 2T K ln K Lower bound Ω( T K) (improved upper bound by the INF alg [AB09]) 14
Contrasting expert to nonstochastic bandit problem Experts : Learner observes all losses l t (1),, l t (K) Pr t (l t (i) is observed in round t) = 1 Regret R T = O( T ln K) Nonstochastic bandits : Learner only observes loss l t (I t ) of chosen action Pr t (l t (i) is observed in round t) = Pr t (I t = i) Note: Exp3 collapses to Exponentially-weighted alg Regret R T = O( T K) Exponential gap ln K vs K: relevant when actions are many 15
Nonstochastic bandits with Feedback Graphs/1[MS11,A+13,K+14] K actions for Learner For t = 1, 2, : 1 Losses l t (i) [0, 1] are assigned deterministically by Nature to every action i = 1 K (hidded to Learner) 2 Feedback graph G t = (V, E t ), V = {1,, K} generated by exogenous process (hidden to Learner) all self-loops included 3 Learner picks action I t (possibly using randomization) and incurs loss l t (I t ) 4 Learner gets feedback information: {l t (j) : (I t, j) E t } + G t 16
Nonstochastic bandits with Feedback Graphs/1[MS11,A+13,K+14] K actions for Learner For t = 1, 2, : 1 Losses l t (i) [0, 1] are assigned deterministically by Nature to every action i = 1 K (hidded to Learner) 2 Feedback graph G t = (V, E t ), V = {1,, K} generated by exogenous process (hidden to Learner) all self-loops included 3 Learner picks action I t (possibly using randomization) and incurs loss l t (I t ) 4 Learner gets feedback information: {l t (j) : (I t, j) E t } + G t 17
Nonstochastic bandits with Feedback Graphs/1[MS11,A+13,K+14] K actions for Learner For t = 1, 2, : 1 Losses l t (i) [0, 1] are assigned deterministically by Nature to every action i = 1 K (hidded to Learner) 2 Feedback graph G t = (V, E t ), V = {1,, K} generated by exogenous process (hidden to Learner) all self-loops included 3 Learner picks action I t (possibly using randomization) and incurs loss l t (I t ) 4 Learner gets feedback information: {l t (j) : (I t, j) E t } + G t 18
Nonstochastic bandits with Feedback Graphs/1[MS11,A+13,K+14] 01 K actions for Learner 04 03 03 For t = 1, 2, : 07 1 Losses l t (i) [0, 1] are assigned deterministically by Nature to every action i = 1 K (hidded to Learner) 2 Feedback graph G t = (V, E t ), V = {1,, K} generated by exogenous process (hidden to Learner) all self-loops included 3 Learner picks action I t (possibly using randomization) and incurs loss l t (I t ) 4 Learner gets feedback information: {l t (j) : (I t, j) E t } + G t 19
Nonstochastic bandits with Feedback Graphs/2: Alg Exp3-IX [Ne+15] At round t pick action I t = i with probability proportional to ( ) l s (i) = exp η t 1 s=1 l s (i), i = 1 K { l s (i) if l γ t + Pr s(l s (i) is observed in round s) s (i) is observed 0 otherwise Note: prob of observing loss of action prob of playing action Exponentially-weighted alg with γ t -biased (importance sampling) loss estimates l t (i) l t (i) Bias is controlled by γ t = 1/ t 20
Nonstochastic bandits with Feedback Graphs/3[A+13,K+14] Independence number α(g t ) : disregard edge orientation 1 }{{} clique: expert problem α(g t ) }{{} K edgeless: bandit problem Regret analysis: If G t = G t: ( ) R T = Õ T α(g) (also lower bound up to logs) In general: R T = O ln(t K) T α(g t ) 21
Nonstochastic bandits with Feedback Graphs/4 Properties of loss estimates: p t (i) = Pr t (I t = i) (prob of playing) Q t (i) = Pr t (l t (i) is observed in round t) lt (i) = l t(i) {l t (i) is observed in round t } γ t + Q t (i) (prob of observing) E t [ lt (i)] = l t (i) E t [ lt (i) 2 ] 1 Q t (i) unbiasedness variance control Some details of regret analysis: From T K p t (i) lt (i) i=1 Take expectations: R T ln K η min i=1,,k + η 2 T l t (i) ln K η + η 2 T i=1 [ T K ] p t (i) E variance Q t (i) i=1 K p t (i) lt (i) 2 22
Nonstochastic bandits with Feedback Graphs/5 Relating variance to α(g) : j Suppose G is undirected (with self-loops) K p(i) K Σ = Q G (i) = p(i) S p(j) i=1 i=1 j : j i G where S V is an independent set for G = (V, E) j i 1 Init: S = ; G 1 = G, V 1 = V Pick i 1 = argmin i V1 Q G 1 (i) Augment S S {i 1 } Remove i 1 from V 1, all its neighbors (and incident edges in G 1 ): p(j) p(j) Σ Σ Σ Q G 1 (j) Q G 1 (i1 ) = Σ (i 1) QG1 Q G 1 (i1 ) = Σ 1 j : j G 1 i 1 j : j G 1 i 1 get smaller graph G 2 = (V 2, E 2 ) and iterate 23
Nonstochastic bandits with Feedback Graphs/5 Relating variance to α(g) : Suppose G is undirected (with self-loops) j K p(i) K Σ = Q G (i) = p(i) S p(j) i=1 i=1 j : j i G where S V is an independent set for G = (V, E) i 2 Init: S = ; G 1 = G, V 1 = V Pick i 2 = argmin i V2 Q G 2 (i) Augment S S {i 2 } Remove i 2 from V 2, all its neighbors (and incident edges in G 2 ): p(j) p(j) Σ Σ Σ = Σ QG 2 (i 1) j : j G 2 i 2 Q G1 (j) j : j G 2 i 2 Q G2 Q (i 2 ) G 2 (i2 ) = Σ 1 get smaller graph G 3 = (V 3, E 3 ) and iterate 24
Nonstochastic bandits with Feedback Graphs/5 Relating variance to α(g) : Suppose G is undirected (with self-loops) K p(i) K Σ = Q G (i) = p(i) S p(j) i=1 i=1 j : j i G where S V is an independent set for G = (V, E) i 3 Init: S = ; G 1 = G, V 1 = V Pick i 3 = argmin i V3 Q G 3 (i) Augment S S {i 3 } Remove i 3 from V 3, all its neighbors (and incident edges in G 3 ): p(j) p(j) Σ Σ Σ = Σ QG 3 (i 3) j : j G 3 i 3 Q G1 (j) j : j G 3 i 3 Q G3 Q (i 3 ) G 3 (i3 ) = Σ 1 get smaller graph G 4 = (V 4, E 4 ) and iterate 25
Nonstochastic bandits with Feedback Graphs/5 Relating variance to α(g) : Suppose G is undirected (with self-loops) i 4 K p(i) K Σ = Q G (i) = p(i) S p(j) i=1 i=1 j : j i G where S V is an independent set for G = (V, E) Init: S = ; G 1 = G, V 1 = V Pick i 4 = argmin i V4 Q G 4 (i) Augment S S {i 4 } Remove i 4 from V 4, all its neighbors (and incident edges in G 4 ): p(j) p(j) Σ Σ Σ = Σ QG 4 (i 4) j : j G 4 i 4 Q G1 (j) j : j G 4 i 4 Q G4 Q (i 4 ) G 4 (i4 ) = Σ 1 get smaller graph G 4 = (V 4, E 4 ) and iterate 26
Nonstochastic bandits with Feedback Graphs/6 Hence: Σ decreases by at most 1 S increases by 1 Potential S + Σ increases over iterations: has minimal value at the beginning (S = ) reaches maximal value is when G becomes empty (Σ = 0) S is independent set by construction S α(g) i 2 i 3 i 1 When G directed analysis gets more complicated (needs lower bound on p t (i)) and adds a log T factor in bound Have obtained: R T ln K η + η 2 T T α(g t ) = O (ln K) α(g t ) i 4 27
Stochastic bandit problem/1 K actions for Learner When picking action i at time t, Learner receives as reward independent realization of random variable X i : E[X i ] = µ i, X i [0, 1] The µ i s are hidden to Learner For t = 1, 2, : 1 Learner picks action I t (possibly using random) and gathers reward X It,t 2 Learner gets feedback information: X It,t Goal: Optimize (pseudo)regret [ T ] [ T R T = max E X i,t E i=1k X It,t ] [ T = µ T E X It,t ] = K i E[T i (T )] i=1 28
Stochastic bandit problem/1 K actions for Learner When picking action i at time t, Learner receives as reward independent realization of random variable X i : E[X i ] = µ i, X i [0, 1] The µ i s are hidden to Learner For t = 1, 2, : 1 Learner picks action I t (possibly using random) and gathers reward X It,t 2 Learner gets feedback information: X It,t Goal: Optimize (pseudo)regret [ T ] [ T R T = max E X i,t E i=1k X It,t ] [ T = µ T E X It,t ] = K i E[T i (T )] i=1 29
Stochastic bandit problem/1 K actions for Learner When picking action i at time t, Learner receives as reward independent realization of random variable X i : E[X i ] = µ i, X i [0, 1] The µ i s are hidden to Learner 03 For t = 1, 2, : 1 Learner picks action I t (possibly using random) and gathers reward X It,t 2 Learner gets feedback information: X It,t Goal: Optimize (pseudo)regret [ T ] [ T R T = max E X i,t E i=1k X It,t ] [ T = µ T E X It,t ] = K i E[T i (T )] i=1 30
Stochastic bandit problem/1 K actions for Learner When picking action i at time t, Learner receives as reward independent realization of random variable X i : E[X i ] = µ i, X i [0, 1] The µ i s are hidden to Learner 03 For t = 1, 2, : 1 Learner picks action I t (possibly using random) and gathers reward X It,t 2 Learner gets feedback information: X It,t Goal: Optimize (pseudo)regret [ T ] [ T R T = max E X i,t E i=1k X It,t ] [ T = µ T E X It,t ] = K i E[T i (T )] i=1 31
Stochastic bandit problem/2: UCB alg [AFC02] At round t pick action ( I t = argmax i=1k X i,t 1 + ) ln t T i,t 1 T i,t 1 = no of times reward of action i has been observed so far X i,t 1 = 1 T i,t 1 s t 1 : I s =i X i,s = average reward of action i observed so far (Pseudo)Regret: R T = O (( K i=1 1 i ) ln T + K ) 32
Stochastic bandits with feedback graphs/1 K actions for Learner, arranged into a fixed graph G = (V, E) When picking action i at time t, Learner receives as reward independent realization of random variable X i : E[X i ] = µ i, but also reward of nearby actions in G The µ i s are hidden to Learner 04 01 03 03 For t = 1, 2, : 07 1 Learner picks action I t (possibly using random) and gathers reward X It,t 2 Learner gets feedback information: {X j,t : (I t, j) E} Goal: Optimize (pseudo)regret [ T ] [ T R T = max E X i,t E i=1k X It,t ] [ T = µ T E X It,t ] = K i E[T i (T )] i=1 33
Stochastic bandits with feedback graphs/2: UCB-N [Ca+12] At round t pick action ( I t = argmax i=1k X i,t 1 + ) ln t O i,t 1 O i,t 1 = no of times reward of action i has been observed so far X i,t 1 = 1 O i,t 1 s t 1 : I s G i X i,s = average reward of action i observed so far 34
Stochastic bandits with feedback graphs/3 Clique covering number χ(g) : assume G is undirected 1 }{{} clique: expert problem c(g) = 4 α(g) χ(g) }{{} K edgeless: bandit problem Regret analysis: Given any partition C of V into cliques: C = {C 1, C 2,, C C } max i C i R T = O C C min i C 2 ln T + K i Sum over χ(g) regret terms (but can be improved to α(g) ) Term K replaced by χ(g) by modified alg No tight lower bounds available 35
Simple examples/1: Auctions (nonstoc) Revenue (= 1- Loss) Feedback graph G t b 2 revealed after play b b 2 1 played It Second-price auction with reserve (seller side) highest bid revealed to seller (eg AppNexus) Auctioneer is third party price Each price reveals revenue of itself + revenue of any higher price After seller plays reserve price I t, both seller s revenue and highest bid revealed to him/her Seller/Player in a position to observe all revenues for prices j I t α(g) = 1: R T = O ( ln(t K) T ) (expert problem up to logs) [CB+17] 36
Simple examples/2: Contextual bandits (nonstoc)[auer+02] K predictors f i : {1 T } {1 N}, i = 1 K, each one having the same N << K actions Learner s action space is the set of K predictors For t = 1, 2, : 1 l t (j) [0, 1] are assigned deterministically by Nature to every action j = 1 N (hidded to Learner) K = 8 N = 3 2 Learner observes f 1 (t) f 2 (t) f K (t) 3 Learner picks predictor f It (possibly using randomization) and incurs loss l t (f It (t)) 4 Learner gets feedback information: l t (f It (t)) Feedback graph G t on K predictors made up of N cliques {i : f i (t) = 1} {i : f i (t) = 2} {i : f i (t) = N} Independence number: α(g t ) N t 37
References CB+97: NCesa-Bianchi, Y Freund, D Haussler, D Helmbold, R Schapire, M Warmuth, How to use expert advice, Journal of the ACM, 44/3, pp 427 485, 1997 AB09: J Audibert and S Bubeck Regret bounds and minimax policies under partial monitoring Journal of Machine Learning Research, 11, pp 2635 2686, 2010 Auer+02: P Auer, N Cesa-Bianchi, Y Freund, and R Schapire The nonstochastic multi-armed bandit problem SIAM Journal on Computing, 32/1, pp 48 77, 2002 ACF02 P Auer, N Cesa-Bianchi, and P Fischer Finite-time analysis of the multi- armed bandit problem, Machine Learning Journal, vol 47, no 2-3, pp 235 256, 2002 MS+11: S Mannor, O Shamir, From Bandits to Experts: On the Value of Side-Observations, NIPS 2011 Ca+12: S Caron, B Kveton, M Lelarge and S Bhagat, Leveraging Side Observations in Stochastic Bandits, UAI 2012 A+13: N Alon, N Cesa-Bianchi, C Gentile, Y Mansour, From bandits to experts: A tale of domination and independence, NIPS 2013 K+14: T Kocak, G Neu, M Valko, R Munos, Efficient learning by implicit exploration in bandit problems with side observations, NIPS 2014 Ne15: G Neu, Explore no more: Improved high-probability regret bounds for non-stochastic bandits, NIPS 2015 CB+17: N Cesa-Bianchi, P Gaillard, C Gentile, S Gerchinovitz, Algorithmic chaining and the role of partial feedback in online nonparametric learning, COLT 2017 38