Online Learning with Feedback Graphs

Similar documents
From Bandits to Experts: A Tale of Domination and Independence

Online Learning with Feedback Graphs

Efficient learning by implicit exploration in bandit problems with side observations

Advanced Machine Learning

Efficient learning by implicit exploration in bandit problems with side observations

Explore no more: Improved high-probability regret bounds for non-stochastic bandits

Online Learning with Gaussian Payoffs and Side Observations

Explore no more: Improved high-probability regret bounds for non-stochastic bandits

Online Learning with Abstention

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I. Sébastien Bubeck Theory Group

EASINESS IN BANDITS. Gergely Neu. Pompeu Fabra University

Yevgeny Seldin. University of Copenhagen

Online Learning and Online Convex Optimization

The Online Approach to Machine Learning

New bounds on the price of bandit feedback for mistake-bounded online multiclass learning

arxiv: v1 [cs.lg] 30 Sep 2014

Online learning with feedback graphs and switching costs

Algorithmic Chaining and the Role of Partial Feedback in Online Nonparametric Learning

Algorithmic Chaining and the Role of Partial Feedback in Online Nonparametric Learning

Online learning with noisy side observations

A Second-order Bound with Excess Losses

New Algorithms for Contextual Bandits

THE first formalization of the multi-armed bandit problem

Lecture 19: UCB Algorithm and Adversarial Bandit Problem. Announcements Review on stochastic multi-armed bandit problem

Multi-armed Bandits in the Presence of Side Observations in Social Networks

Advanced Topics in Machine Learning and Algorithmic Game Theory Fall semester, 2011/12

Reducing contextual bandits to supervised learning

Bandit models: a tutorial

The Multi-Armed Bandit Problem

Combinatorial Multi-Armed Bandit: General Framework, Results and Applications

Key words. online learning, multi-armed bandits, learning from experts, learning with partial feedback, graph theory

On the Complexity of Best Arm Identification in Multi-Armed Bandit Models

Online (and Distributed) Learning with Information Constraints. Ohad Shamir

Stat 260/CS Learning in Sequential Decision Problems. Peter Bartlett

Multi-armed bandit models: a tutorial

Online Learning with Gaussian Payoffs and Side Observations

The No-Regret Framework for Online Learning

The geometry of Gaussian processes and Bayesian optimization. Contal CMLA, ENS Cachan

Mistake Bounds on Noise-Free Multi-Armed Bandit Game

arxiv: v1 [stat.ml] 23 May 2018

Bandit Algorithms. Tor Lattimore & Csaba Szepesvári

Reward Maximization Under Uncertainty: Leveraging Side-Observations on Networks

Bandits for Online Optimization

arxiv: v3 [cs.lg] 30 Jun 2012

Learning, Games, and Networks

Lecture 4: Lower Bounds (ending); Thompson Sampling

Exponential Weights on the Hypercube in Polynomial Time

Improved Algorithms for Linear Stochastic Bandits

Optimal and Adaptive Online Learning

COS 402 Machine Learning and Artificial Intelligence Fall Lecture 22. Exploration & Exploitation in Reinforcement Learning: MAB, UCB, Exp3

Lecture 3: Lower Bounds for Bandit Algorithms

Tutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning

Online Learning with Costly Features and Labels

Stat 260/CS Learning in Sequential Decision Problems. Peter Bartlett

Online prediction with expert advise

An Optimal Bidimensional Multi Armed Bandit Auction for Multi unit Procurement

Full-information Online Learning

Active Learning and Optimized Information Gathering

Online Learning and Sequential Decision Making

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems. Sébastien Bubeck Theory Group

Exploration. 2015/10/12 John Schulman

Two generic principles in modern bandits: the optimistic principle and Thompson sampling

arxiv: v2 [cs.lg] 24 Oct 2018

Tsinghua Machine Learning Guest Lecture, June 9,

An adaptive algorithm for finite stochastic partial monitoring

Csaba Szepesvári 1. University of Alberta. Machine Learning Summer School, Ile de Re, France, 2008

Pure Exploration Stochastic Multi-armed Bandits

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research

Toward a Classification of Finite Partial-Monitoring Games

arxiv: v1 [cs.lg] 16 Jan 2017

EFFICIENT ALGORITHMS FOR LINEAR POLYHEDRAL BANDITS. Manjesh K. Hanawal Amir Leshem Venkatesh Saligrama

Online combinatorial optimization with stochastic decision sets and adversarial losses

arxiv: v1 [cs.lg] 23 Jan 2019

Piecewise-stationary Bandit Problems with Side Observations

University of Alberta. The Role of Information in Online Learning

Regret Bounds for Sleeping Experts and Bandits

Applications of on-line prediction. in telecommunication problems

Evaluation of multi armed bandit algorithms and empirical algorithm

Multi-Armed Bandit Formulations for Identification and Control

Fundamental Limits of Online and Distributed Algorithms for Statistical Learning and Estimation

Gambling in a rigged casino: The adversarial multi-armed bandit problem

Learning Hurdles for Sleeping Experts

Bandit Algorithms. Zhifeng Wang ... Department of Statistics Florida State University

The Multi-Arm Bandit Framework

Contextual semibandits via supervised learning oracles

Large-scale Information Processing, Summer Recommender Systems (part 2)

Learning for Contextual Bandits

On Minimaxity of Follow the Leader Strategy in the Stochastic Setting

Perceptron Mistake Bounds

Littlestone s Dimension and Online Learnability

arxiv: v1 [cs.lg] 7 Sep 2018

Online Forest Density Estimation

Efficient learning by implicit exploration in bandit problems with side observations

Collaborative Learning of Stochastic Bandits over a Social Network

Budgeted Prediction With Expert Advice

Learning to Bid Without Knowing your Value. Chara Podimata Harvard University June 4, 2018

Tor Lattimore & Csaba Szepesvári

Grundlagen der Künstlichen Intelligenz

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

Multi-Armed Bandits with Metric Movement Costs

Transcription:

Online Learning with Feedback Graphs Claudio Gentile INRIA and Google NY clagentile@gmailcom NYC March 6th, 2018 1

Content of this lecture Regret analysis of sequential prediction problems lying between full and bandit information regimes: Motivation Nonstochastic setting: Brief review of background Feedback graphs Stochastic setting: Brief review of background Feedback graphs Examples (nonstochastic) 2

Motivation Sequential prediction problems with partial information where items in action space have semantic connections turning into observability dependencies of associated losses/gains Action space Action space 3

Background/1: Nonstochastic experts K actions for Learner For t = 1, 2, : 1 Losses l t (i) [0, 1] are assigned deterministically by Nature to every action i = 1 K (hidded to Learner) 2 Learner picks action I t (possibly using randomization) and incurs loss l t (I t ) 3 Learner gets feedback information: l t (1),, l t (I t ),, l t (K) 4

Background/1: Nonstochastic experts K actions for Learner For t = 1, 2, : 1 Losses l t (i) [0, 1] are assigned deterministically by opponent to every action i = 1 K (hidded to Learner) 2 Learner picks action I t (possibly using randomization) and incurs loss l t (I t ) 3 Learner gets feedback information: l t (1),, l t (I t ),, l t (K) 5

Background: Nonstochastic experts K actions for Learner 01 03 08 03 04 05 07 02 For t = 1, 2, : 1 Losses l t (i) [0, 1] are assigned deterministically by opponent to every action i = 1 K (hidded to Learner) 2 Learner picks action I t (possibly using randomization) and incurs loss l t (I t ) 3 Learner gets feedback information: l t (1),, l t (I t ),, l t (K) 6

No (External, Pseudo) Regret Goal : Given T rounds, Learner s total loss T l t (I t ) must be close to that of single best action in hindsight for Learner Regret of Learner for T rounds: [ T ] R T = E l t (I t ) min i=1k Want : R T = o(t ) as T grows large ( no regret ) T l t (i) Notice : No stochastic assumptions on losses, but assume for simplicity Nature is deterministic and oblivious Lower bound: T ln K R T (1 o(1)) 2 as T, K (l t (i) random coin flips + simple probabilistic argument) [CB+97] 7

Exponentially-weighted Algorithm [CB+97] At round t pick action I t = i with probability proportional to ( ) exp η t 1 s=1 l s (i) if η = Dynamic η = ln K 8T = R T total loss of action i so far T ln K 2 ln K t = R T looses constant factors 8

Nonstochastic bandit problem/1 K actions for Learner For t = 1, 2, : 1 Losses l t (i) [0, 1] are assigned deterministically by Nature to every action i = 1 K (hidded to Learner) 2 Learner picks action I t (possibly using randomization) and incurs loss l t (I t ) 3 Learner gets feedback information: l t (I t ) 9

Nonstochastic bandit problem/1 K actions for Learner For t = 1, 2, : 1 Losses l t (i) [0, 1] are assigned deterministically by Nature to every action i = 1 K (hidded to Learner) 2 Learner picks action I t (possibly using randomization) and incurs loss l t (I t ) 3 Learner gets feedback information: l t (I t ) 10

Nonstochastic bandit problem/1 K actions for Learner 03 For t = 1, 2, : 1 Losses l t (i) [0, 1] are assigned deterministically by Nature to every action i = 1 K (hidded to Learner) 2 Learner picks action I t (possibly using randomization) and incurs loss l t (I t ) 3 Learner gets feedback information: l t (I t ) 11

Nonstochastic bandit problem/2 Goal : same as before Regret of Learner for T rounds: [ T ] R T = E l t (I t ) min i=1k Want : R T = o(t ) as T grows large ( no regret ) Tradeoff exploration vs exploitation T l t (i) 12

Nonstochastic bandit problem/3: Exp3 Alg/1 [Auer+ 02] At round t pick action I t = i with probability proportional to ( ) l s (i) = exp η t 1 s=1 l s (i), i = 1 K { l s (i) if l Pr s (l s (i) is observed in round s (i) is observed s) 0 otherwise Only one nonzero component in lt Exponentially-weighted alg with (importance sampling) loss estimates l t (i) l t (i) 13

Nonstochastic bandit problem/3: Exp3 Alg/2 [Auer+ 02] Properties of loss estimates: E t [ lt (i)] = l t (i) unbiasedness E t [ lt (i) 2 ] 1 Pr t(l t (i) is observed in round t) variance control Regret analysis: Set p t (i) = Pr t (I t = i) Approximate exp(x) up to 2nd order, sum over rounds t and overapprox: T i=1 K p t (i) lt (i) min i=1,,k T l t (i) ln K η + η 2 T K p t (i) lt (i) 2 i=1 Take expectations (tower rule), and optimize over η: R T ln K η + η 2 T K = 2T K ln K Lower bound Ω( T K) (improved upper bound by the INF alg [AB09]) 14

Contrasting expert to nonstochastic bandit problem Experts : Learner observes all losses l t (1),, l t (K) Pr t (l t (i) is observed in round t) = 1 Regret R T = O( T ln K) Nonstochastic bandits : Learner only observes loss l t (I t ) of chosen action Pr t (l t (i) is observed in round t) = Pr t (I t = i) Note: Exp3 collapses to Exponentially-weighted alg Regret R T = O( T K) Exponential gap ln K vs K: relevant when actions are many 15

Nonstochastic bandits with Feedback Graphs/1[MS11,A+13,K+14] K actions for Learner For t = 1, 2, : 1 Losses l t (i) [0, 1] are assigned deterministically by Nature to every action i = 1 K (hidded to Learner) 2 Feedback graph G t = (V, E t ), V = {1,, K} generated by exogenous process (hidden to Learner) all self-loops included 3 Learner picks action I t (possibly using randomization) and incurs loss l t (I t ) 4 Learner gets feedback information: {l t (j) : (I t, j) E t } + G t 16

Nonstochastic bandits with Feedback Graphs/1[MS11,A+13,K+14] K actions for Learner For t = 1, 2, : 1 Losses l t (i) [0, 1] are assigned deterministically by Nature to every action i = 1 K (hidded to Learner) 2 Feedback graph G t = (V, E t ), V = {1,, K} generated by exogenous process (hidden to Learner) all self-loops included 3 Learner picks action I t (possibly using randomization) and incurs loss l t (I t ) 4 Learner gets feedback information: {l t (j) : (I t, j) E t } + G t 17

Nonstochastic bandits with Feedback Graphs/1[MS11,A+13,K+14] K actions for Learner For t = 1, 2, : 1 Losses l t (i) [0, 1] are assigned deterministically by Nature to every action i = 1 K (hidded to Learner) 2 Feedback graph G t = (V, E t ), V = {1,, K} generated by exogenous process (hidden to Learner) all self-loops included 3 Learner picks action I t (possibly using randomization) and incurs loss l t (I t ) 4 Learner gets feedback information: {l t (j) : (I t, j) E t } + G t 18

Nonstochastic bandits with Feedback Graphs/1[MS11,A+13,K+14] 01 K actions for Learner 04 03 03 For t = 1, 2, : 07 1 Losses l t (i) [0, 1] are assigned deterministically by Nature to every action i = 1 K (hidded to Learner) 2 Feedback graph G t = (V, E t ), V = {1,, K} generated by exogenous process (hidden to Learner) all self-loops included 3 Learner picks action I t (possibly using randomization) and incurs loss l t (I t ) 4 Learner gets feedback information: {l t (j) : (I t, j) E t } + G t 19

Nonstochastic bandits with Feedback Graphs/2: Alg Exp3-IX [Ne+15] At round t pick action I t = i with probability proportional to ( ) l s (i) = exp η t 1 s=1 l s (i), i = 1 K { l s (i) if l γ t + Pr s(l s (i) is observed in round s) s (i) is observed 0 otherwise Note: prob of observing loss of action prob of playing action Exponentially-weighted alg with γ t -biased (importance sampling) loss estimates l t (i) l t (i) Bias is controlled by γ t = 1/ t 20

Nonstochastic bandits with Feedback Graphs/3[A+13,K+14] Independence number α(g t ) : disregard edge orientation 1 }{{} clique: expert problem α(g t ) }{{} K edgeless: bandit problem Regret analysis: If G t = G t: ( ) R T = Õ T α(g) (also lower bound up to logs) In general: R T = O ln(t K) T α(g t ) 21

Nonstochastic bandits with Feedback Graphs/4 Properties of loss estimates: p t (i) = Pr t (I t = i) (prob of playing) Q t (i) = Pr t (l t (i) is observed in round t) lt (i) = l t(i) {l t (i) is observed in round t } γ t + Q t (i) (prob of observing) E t [ lt (i)] = l t (i) E t [ lt (i) 2 ] 1 Q t (i) unbiasedness variance control Some details of regret analysis: From T K p t (i) lt (i) i=1 Take expectations: R T ln K η min i=1,,k + η 2 T l t (i) ln K η + η 2 T i=1 [ T K ] p t (i) E variance Q t (i) i=1 K p t (i) lt (i) 2 22

Nonstochastic bandits with Feedback Graphs/5 Relating variance to α(g) : j Suppose G is undirected (with self-loops) K p(i) K Σ = Q G (i) = p(i) S p(j) i=1 i=1 j : j i G where S V is an independent set for G = (V, E) j i 1 Init: S = ; G 1 = G, V 1 = V Pick i 1 = argmin i V1 Q G 1 (i) Augment S S {i 1 } Remove i 1 from V 1, all its neighbors (and incident edges in G 1 ): p(j) p(j) Σ Σ Σ Q G 1 (j) Q G 1 (i1 ) = Σ (i 1) QG1 Q G 1 (i1 ) = Σ 1 j : j G 1 i 1 j : j G 1 i 1 get smaller graph G 2 = (V 2, E 2 ) and iterate 23

Nonstochastic bandits with Feedback Graphs/5 Relating variance to α(g) : Suppose G is undirected (with self-loops) j K p(i) K Σ = Q G (i) = p(i) S p(j) i=1 i=1 j : j i G where S V is an independent set for G = (V, E) i 2 Init: S = ; G 1 = G, V 1 = V Pick i 2 = argmin i V2 Q G 2 (i) Augment S S {i 2 } Remove i 2 from V 2, all its neighbors (and incident edges in G 2 ): p(j) p(j) Σ Σ Σ = Σ QG 2 (i 1) j : j G 2 i 2 Q G1 (j) j : j G 2 i 2 Q G2 Q (i 2 ) G 2 (i2 ) = Σ 1 get smaller graph G 3 = (V 3, E 3 ) and iterate 24

Nonstochastic bandits with Feedback Graphs/5 Relating variance to α(g) : Suppose G is undirected (with self-loops) K p(i) K Σ = Q G (i) = p(i) S p(j) i=1 i=1 j : j i G where S V is an independent set for G = (V, E) i 3 Init: S = ; G 1 = G, V 1 = V Pick i 3 = argmin i V3 Q G 3 (i) Augment S S {i 3 } Remove i 3 from V 3, all its neighbors (and incident edges in G 3 ): p(j) p(j) Σ Σ Σ = Σ QG 3 (i 3) j : j G 3 i 3 Q G1 (j) j : j G 3 i 3 Q G3 Q (i 3 ) G 3 (i3 ) = Σ 1 get smaller graph G 4 = (V 4, E 4 ) and iterate 25

Nonstochastic bandits with Feedback Graphs/5 Relating variance to α(g) : Suppose G is undirected (with self-loops) i 4 K p(i) K Σ = Q G (i) = p(i) S p(j) i=1 i=1 j : j i G where S V is an independent set for G = (V, E) Init: S = ; G 1 = G, V 1 = V Pick i 4 = argmin i V4 Q G 4 (i) Augment S S {i 4 } Remove i 4 from V 4, all its neighbors (and incident edges in G 4 ): p(j) p(j) Σ Σ Σ = Σ QG 4 (i 4) j : j G 4 i 4 Q G1 (j) j : j G 4 i 4 Q G4 Q (i 4 ) G 4 (i4 ) = Σ 1 get smaller graph G 4 = (V 4, E 4 ) and iterate 26

Nonstochastic bandits with Feedback Graphs/6 Hence: Σ decreases by at most 1 S increases by 1 Potential S + Σ increases over iterations: has minimal value at the beginning (S = ) reaches maximal value is when G becomes empty (Σ = 0) S is independent set by construction S α(g) i 2 i 3 i 1 When G directed analysis gets more complicated (needs lower bound on p t (i)) and adds a log T factor in bound Have obtained: R T ln K η + η 2 T T α(g t ) = O (ln K) α(g t ) i 4 27

Stochastic bandit problem/1 K actions for Learner When picking action i at time t, Learner receives as reward independent realization of random variable X i : E[X i ] = µ i, X i [0, 1] The µ i s are hidden to Learner For t = 1, 2, : 1 Learner picks action I t (possibly using random) and gathers reward X It,t 2 Learner gets feedback information: X It,t Goal: Optimize (pseudo)regret [ T ] [ T R T = max E X i,t E i=1k X It,t ] [ T = µ T E X It,t ] = K i E[T i (T )] i=1 28

Stochastic bandit problem/1 K actions for Learner When picking action i at time t, Learner receives as reward independent realization of random variable X i : E[X i ] = µ i, X i [0, 1] The µ i s are hidden to Learner For t = 1, 2, : 1 Learner picks action I t (possibly using random) and gathers reward X It,t 2 Learner gets feedback information: X It,t Goal: Optimize (pseudo)regret [ T ] [ T R T = max E X i,t E i=1k X It,t ] [ T = µ T E X It,t ] = K i E[T i (T )] i=1 29

Stochastic bandit problem/1 K actions for Learner When picking action i at time t, Learner receives as reward independent realization of random variable X i : E[X i ] = µ i, X i [0, 1] The µ i s are hidden to Learner 03 For t = 1, 2, : 1 Learner picks action I t (possibly using random) and gathers reward X It,t 2 Learner gets feedback information: X It,t Goal: Optimize (pseudo)regret [ T ] [ T R T = max E X i,t E i=1k X It,t ] [ T = µ T E X It,t ] = K i E[T i (T )] i=1 30

Stochastic bandit problem/1 K actions for Learner When picking action i at time t, Learner receives as reward independent realization of random variable X i : E[X i ] = µ i, X i [0, 1] The µ i s are hidden to Learner 03 For t = 1, 2, : 1 Learner picks action I t (possibly using random) and gathers reward X It,t 2 Learner gets feedback information: X It,t Goal: Optimize (pseudo)regret [ T ] [ T R T = max E X i,t E i=1k X It,t ] [ T = µ T E X It,t ] = K i E[T i (T )] i=1 31

Stochastic bandit problem/2: UCB alg [AFC02] At round t pick action ( I t = argmax i=1k X i,t 1 + ) ln t T i,t 1 T i,t 1 = no of times reward of action i has been observed so far X i,t 1 = 1 T i,t 1 s t 1 : I s =i X i,s = average reward of action i observed so far (Pseudo)Regret: R T = O (( K i=1 1 i ) ln T + K ) 32

Stochastic bandits with feedback graphs/1 K actions for Learner, arranged into a fixed graph G = (V, E) When picking action i at time t, Learner receives as reward independent realization of random variable X i : E[X i ] = µ i, but also reward of nearby actions in G The µ i s are hidden to Learner 04 01 03 03 For t = 1, 2, : 07 1 Learner picks action I t (possibly using random) and gathers reward X It,t 2 Learner gets feedback information: {X j,t : (I t, j) E} Goal: Optimize (pseudo)regret [ T ] [ T R T = max E X i,t E i=1k X It,t ] [ T = µ T E X It,t ] = K i E[T i (T )] i=1 33

Stochastic bandits with feedback graphs/2: UCB-N [Ca+12] At round t pick action ( I t = argmax i=1k X i,t 1 + ) ln t O i,t 1 O i,t 1 = no of times reward of action i has been observed so far X i,t 1 = 1 O i,t 1 s t 1 : I s G i X i,s = average reward of action i observed so far 34

Stochastic bandits with feedback graphs/3 Clique covering number χ(g) : assume G is undirected 1 }{{} clique: expert problem c(g) = 4 α(g) χ(g) }{{} K edgeless: bandit problem Regret analysis: Given any partition C of V into cliques: C = {C 1, C 2,, C C } max i C i R T = O C C min i C 2 ln T + K i Sum over χ(g) regret terms (but can be improved to α(g) ) Term K replaced by χ(g) by modified alg No tight lower bounds available 35

Simple examples/1: Auctions (nonstoc) Revenue (= 1- Loss) Feedback graph G t b 2 revealed after play b b 2 1 played It Second-price auction with reserve (seller side) highest bid revealed to seller (eg AppNexus) Auctioneer is third party price Each price reveals revenue of itself + revenue of any higher price After seller plays reserve price I t, both seller s revenue and highest bid revealed to him/her Seller/Player in a position to observe all revenues for prices j I t α(g) = 1: R T = O ( ln(t K) T ) (expert problem up to logs) [CB+17] 36

Simple examples/2: Contextual bandits (nonstoc)[auer+02] K predictors f i : {1 T } {1 N}, i = 1 K, each one having the same N << K actions Learner s action space is the set of K predictors For t = 1, 2, : 1 l t (j) [0, 1] are assigned deterministically by Nature to every action j = 1 N (hidded to Learner) K = 8 N = 3 2 Learner observes f 1 (t) f 2 (t) f K (t) 3 Learner picks predictor f It (possibly using randomization) and incurs loss l t (f It (t)) 4 Learner gets feedback information: l t (f It (t)) Feedback graph G t on K predictors made up of N cliques {i : f i (t) = 1} {i : f i (t) = 2} {i : f i (t) = N} Independence number: α(g t ) N t 37

References CB+97: NCesa-Bianchi, Y Freund, D Haussler, D Helmbold, R Schapire, M Warmuth, How to use expert advice, Journal of the ACM, 44/3, pp 427 485, 1997 AB09: J Audibert and S Bubeck Regret bounds and minimax policies under partial monitoring Journal of Machine Learning Research, 11, pp 2635 2686, 2010 Auer+02: P Auer, N Cesa-Bianchi, Y Freund, and R Schapire The nonstochastic multi-armed bandit problem SIAM Journal on Computing, 32/1, pp 48 77, 2002 ACF02 P Auer, N Cesa-Bianchi, and P Fischer Finite-time analysis of the multi- armed bandit problem, Machine Learning Journal, vol 47, no 2-3, pp 235 256, 2002 MS+11: S Mannor, O Shamir, From Bandits to Experts: On the Value of Side-Observations, NIPS 2011 Ca+12: S Caron, B Kveton, M Lelarge and S Bhagat, Leveraging Side Observations in Stochastic Bandits, UAI 2012 A+13: N Alon, N Cesa-Bianchi, C Gentile, Y Mansour, From bandits to experts: A tale of domination and independence, NIPS 2013 K+14: T Kocak, G Neu, M Valko, R Munos, Efficient learning by implicit exploration in bandit problems with side observations, NIPS 2014 Ne15: G Neu, Explore no more: Improved high-probability regret bounds for non-stochastic bandits, NIPS 2015 CB+17: N Cesa-Bianchi, P Gaillard, C Gentile, S Gerchinovitz, Algorithmic chaining and the role of partial feedback in online nonparametric learning, COLT 2017 38