Online Learning with Feedback Graphs
|
|
- Stewart Elliott
- 5 years ago
- Views:
Transcription
1 Online Learning with Feedback Graphs Nicolò Cesa-Bianchi Università degli Studi di Milano Joint work with: Noga Alon (Tel-Aviv University) Ofer Dekel (Microsoft Research) Tomer Koren (Technion and Microsoft Research) Also: Claudio Gentile, Shie Mannor, Yishay Mansour, Ohad Shamir N. Cesa-Bianchi (UNIMI) Online Learning with Feedback Graphs 1 / 20
2 Theory of repeated games James Hannan ( ) David Blackwell ( ) Learning to play a game (1956) Play a game repeatedly against a possibly suboptimal opponent N. Cesa-Bianchi (UNIMI) Online Learning with Feedback Graphs 2 / 20
3 Prediction with expert advice N actions????????? For t = 1, 2,... 1 Loss l t (i) [0, 1] is assigned to every action i = 1,..., N (hidden from the player) N. Cesa-Bianchi (UNIMI) Online Learning with Feedback Graphs 3 / 20
4 Prediction with expert advice N actions????????? For t = 1, 2,... 1 Loss l t (i) [0, 1] is assigned to every action i = 1,..., N (hidden from the player) 2 Player picks an action I t (possibly using randomization) and incurs loss l t (I t ) N. Cesa-Bianchi (UNIMI) Online Learning with Feedback Graphs 3 / 20
5 Prediction with expert advice N actions For t = 1, 2,... 1 Loss l t (i) [0, 1] is assigned to every action i = 1,..., N (hidden from the player) 2 Player picks an action I t (possibly using randomization) and incurs loss l t (I t ) 3 Player gets feedback information: l t = ( l t (1),..., l t (N) ) N. Cesa-Bianchi (UNIMI) Online Learning with Feedback Graphs 3 / 20
6 Regret The loss process l t t 1 is deterministic and unknown to the (randomized) player I 1, I 2,... Regret of player I 1, I 2,... [ T ] def R T = E l t (I t ) t=1 min T i=1,...,n t=1 l t (i) want = o(t) N. Cesa-Bianchi (UNIMI) Online Learning with Feedback Graphs 4 / 20
7 Regret The loss process l t t 1 is deterministic and unknown to the (randomized) player I 1, I 2,... Regret of player I 1, I 2,... [ T ] def R T = E l t (I t ) t=1 min T i=1,...,n t=1 Asymptotic lower bound for experts game R T = ( 1 o(1) ) T ln N 2 Proof uses an i.i.d. stochastic loss process l t (i) want = o(t) N. Cesa-Bianchi (UNIMI) Online Learning with Feedback Graphs 4 / 20
8 Exponentially weighted forecaster At time t pick action I t = i with probability proportional to ( ) t 1 exp η l s (i) s=1 the sum at the exponent is the total loss of action i up to now N. Cesa-Bianchi (UNIMI) Online Learning with Feedback Graphs 5 / 20
9 Exponentially weighted forecaster At time t pick action I t = i with probability proportional to ( ) t 1 exp η l s (i) s=1 the sum at the exponent is the total loss of action i up to now Regret bound If η = ln N 8T then R T T ln N 2 Matching asymptotic lower bound including constants Dynamic choice η t = (ln N)/(8t) only loses small constants N. Cesa-Bianchi (UNIMI) Online Learning with Feedback Graphs 5 / 20
10 The bandit problem: playing an unknown game N actions????????? For t = 1, 2,... 1 Loss l t (i) [0, 1] is assigned to every action i = 1,..., N (hidden from the player) N. Cesa-Bianchi (UNIMI) Online Learning with Feedback Graphs 6 / 20
11 The bandit problem: playing an unknown game N actions????????? For t = 1, 2,... 1 Loss l t (i) [0, 1] is assigned to every action i = 1,..., N (hidden from the player) 2 Player picks an action I t (possibly using randomization) and incurs loss l t (I t ) N. Cesa-Bianchi (UNIMI) Online Learning with Feedback Graphs 6 / 20
12 The bandit problem: playing an unknown game N actions?.3??????? For t = 1, 2,... 1 Loss l t (i) [0, 1] is assigned to every action i = 1,..., N (hidden from the player) 2 Player picks an action I t (possibly using randomization) and incurs loss l t (I t ) 3 Player gets feedback information: Only l t (I t ) is revealed N. Cesa-Bianchi (UNIMI) Online Learning with Feedback Graphs 6 / 20
13 The bandit problem: playing an unknown game N actions?.3??????? For t = 1, 2,... 1 Loss l t (i) [0, 1] is assigned to every action i = 1,..., N (hidden from the player) 2 Player picks an action I t (possibly using randomization) and incurs loss l t (I t ) 3 Player gets feedback information: Only l t (I t ) is revealed Many applications Ad placement, recommender systems, online auctions,... N. Cesa-Bianchi (UNIMI) Online Learning with Feedback Graphs 6 / 20
14 Bandits as an instance of a general feedback model Besides observing the loss of the played action, the player also observes the loss some other actions For example, a recommender system can infer how the user would have reacted had similar products been recommended However: we do not insist on assuming that observability between actions implies similarity between losses N. Cesa-Bianchi (UNIMI) Online Learning with Feedback Graphs 7 / 20
15 Bandits as an instance of a general feedback model Besides observing the loss of the played action, the player also observes the loss some other actions For example, a recommender system can infer how the user would have reacted had similar products been recommended However: we do not insist on assuming that observability between actions implies similarity between losses How does the observability structure influence regret? N. Cesa-Bianchi (UNIMI) Online Learning with Feedback Graphs 7 / 20
16 Feedback graph?????????? N. Cesa-Bianchi (UNIMI) Online Learning with Feedback Graphs 8 / 20
17 Feedback graph?????????? N. Cesa-Bianchi (UNIMI) Online Learning with Feedback Graphs 8 / 20
18 Feedback graph ????? N. Cesa-Bianchi (UNIMI) Online Learning with Feedback Graphs 8 / 20
19 Recovering expert and bandit settings Experts: clique Bandits: empty graph?.3???????? N. Cesa-Bianchi (UNIMI) Online Learning with Feedback Graphs 9 / 20
20 Exponentially weighted forecaster Reprise Player s strategy ( ) t 1 P t (I t = i) exp η ls (i) lt (i) = s=1 l t (i) P t ( lt (i) observed ) i = 1,..., N if l t(i) is observed 0 otherwise N. Cesa-Bianchi (UNIMI) Online Learning with Feedback Graphs 10 / 20
21 Exponentially weighted forecaster Reprise Player s strategy ( ) t 1 P t (I t = i) exp η ls (i) lt (i) = s=1 l t (i) P t ( lt (i) observed ) i = 1,..., N if l t(i) is observed 0 otherwise Importance sampling estimator ] E t [ lt (i) = l t (i) unbiasedness E t [ lt (i) 2] = l t (i) 2 P t ( lt (i) observed ) variance control N. Cesa-Bianchi (UNIMI) Online Learning with Feedback Graphs 10 / 20
22 Independence number α(g) The size of the largest independent set N. Cesa-Bianchi (UNIMI) Online Learning with Feedback Graphs 11 / 20
23 Independence number α(g) The size of the largest independent set N. Cesa-Bianchi (UNIMI) Online Learning with Feedback Graphs 11 / 20
24 Regret bounds Analysis (undirected graphs) [ R T ln N η + η T 2 E N t=1 i=1 ] P t (i is played) P t (l t (i) is observed) N. Cesa-Bianchi (UNIMI) Online Learning with Feedback Graphs 12 / 20
25 Regret bounds Analysis (undirected graphs) [ R T ln N η + η T 2 E N Lemma t=1 i=1 ] P t (i is played) P t (l t (i) is observed) For any undirected graph G = (V, E) and for any probability assignment p 1,..., p N over its vertices N i=1 p i + p i j N G (i) p j } {{ } P t (loss of i observed) α(g) N. Cesa-Bianchi (UNIMI) Online Learning with Feedback Graphs 12 / 20
26 Regret bounds Analysis (undirected graphs) R T ln N η + η 2 T α(g) = Tα(G) ln N t=1 by choosing η N. Cesa-Bianchi (UNIMI) Online Learning with Feedback Graphs 13 / 20
27 Regret bounds Analysis (undirected graphs) R T ln N η + η 2 T α(g) = Tα(G) ln N t=1 by choosing η Special cases Experts (clique): α(g) = 1 R T T ln N Bandits (empty graph): α(g) = N R T TN ln N N. Cesa-Bianchi (UNIMI) Online Learning with Feedback Graphs 13 / 20
28 Regret bounds Analysis (undirected graphs) R T ln N η + η 2 T α(g) = Tα(G) ln N t=1 by choosing η Special cases Experts (clique): α(g) = 1 R T T ln N Bandits (empty graph): α(g) = N R T TN ln N Minimax rate The general bound is tight: R T = Θ ( Tα(G) ln N ) N. Cesa-Bianchi (UNIMI) Online Learning with Feedback Graphs 13 / 20
29 More general feedback models Directed Interventions N. Cesa-Bianchi (UNIMI) Online Learning with Feedback Graphs 14 / 20
30 Old and new examples Experts Bandits Cops & Robbers Revealing Action N. Cesa-Bianchi (UNIMI) Online Learning with Feedback Graphs 15 / 20
31 Exponentially weighted forecaster with exploration Player s strategy ( ) P t (I t = i) = 1 γ t 1 exp η ls (i) + γ U G i = 1,..., N lt (i) = Z t l t (i) P t ( lt (i) observed ) s=1 if l t(i) is observed 0 otherwise U G is uniform distribution supported on a subset of V N. Cesa-Bianchi (UNIMI) Online Learning with Feedback Graphs 16 / 20
32 A characterization of feedback graphs A vertex of G is: observable if it has at least one incoming edge (possibly a self-loop) strongly observable if it has either a self-loop or incoming edges from all other vertices weakly observable if it is observable but not strongly observable is not observable 2 and 5 are weakly observable and 4 are strongly observable N. Cesa-Bianchi (UNIMI) Online Learning with Feedback Graphs 17 / 20
33 Characterization of minimax rates G is strongly observable G is weakly observable G is not observable R T = Θ ( ) α(g)t U G is uniform on V R T = Θ ( ) T 2/3 δ(g) for T = Ω ( N 3) U G is uniform on a weakly dominating set R T = Θ(T) Weakly dominating set δ(g) is the size of the smallest set that dominates all weakly observable nodes of G N. Cesa-Bianchi (UNIMI) Online Learning with Feedback Graphs 18 / 20
34 Some curious cases Experts vs. Cops & Robbers Presence of red loops does not affect minimax regret R T = Θ ( T ln N ) Sharp transitions With red loop: strongly observable with α(g) = N 1 R T = Θ ( ) NT 4 3 Without red loop: weakly observable with δ(g) = 1 R T = Θ ( T 2/3) for T = Ω(N 3 ) N. Cesa-Bianchi (UNIMI) Online Learning with Feedback Graphs 19 / 20
35 Final remarks Theory extends to time-varying feedback graphs In the strongly observable case, algorithm can predict without knowing the graph Entire framework is a special case of partial monitoring, but our rates exhibit sharp problem-dependent constants N. Cesa-Bianchi (UNIMI) Online Learning with Feedback Graphs 20 / 20
36 Final remarks Theory extends to time-varying feedback graphs In the strongly observable case, algorithm can predict without knowing the graph Entire framework is a special case of partial monitoring, but our rates exhibit sharp problem-dependent constants Graph over actions: more interpretations Relatedness (rather than observability) structure on loss assignment Delay model for loss observations N. Cesa-Bianchi (UNIMI) Online Learning with Feedback Graphs 20 / 20
From Bandits to Experts: A Tale of Domination and Independence
From Bandits to Experts: A Tale of Domination and Independence Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Domination and Independence 1 / 1 From Bandits to Experts: A
More informationThe Online Approach to Machine Learning
The Online Approach to Machine Learning Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Online Approach to ML 1 / 53 Summary 1 My beautiful regret 2 A supposedly fun game I
More informationOnline Learning with Feedback Graphs
Online Learning with Feedback Graphs Claudio Gentile INRIA and Google NY clagentile@gmailcom NYC March 6th, 2018 1 Content of this lecture Regret analysis of sequential prediction problems lying between
More informationOnline Learning and Online Convex Optimization
Online Learning and Online Convex Optimization Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Online Learning 1 / 49 Summary 1 My beautiful regret 2 A supposedly fun game
More informationEfficient learning by implicit exploration in bandit problems with side observations
Efficient learning by implicit exploration in bandit problems with side observations Tomáš Kocák, Gergely Neu, Michal Valko, Rémi Munos SequeL team, INRIA Lille - Nord Europe, France SequeL INRIA Lille
More informationKey words. online learning, multi-armed bandits, learning from experts, learning with partial feedback, graph theory
SIAM J. COMPUT. Vol. 46, No. 6, pp. 785 86 c 07 the authors NONSTOCHASTIC MULTI-ARMED BANDITS WITH GRAPH-STRUCTURED FEEDBACK NOGA ALON, NICOLÒ CESA-BIANCHI, CLAUDIO GENTILE, SHIE MANNOR, YISHAY MANSOUR,
More informationOnline learning with feedback graphs and switching costs
Online learning with feedback graphs and switching costs A Proof of Theorem Proof. Without loss of generality let the independent sequence set I(G :T ) formed of actions (or arms ) from to. Given the sequence
More informationA Second-order Bound with Excess Losses
A Second-order Bound with Excess Losses Pierre Gaillard 12 Gilles Stoltz 2 Tim van Erven 3 1 EDF R&D, Clamart, France 2 GREGHEC: HEC Paris CNRS, Jouy-en-Josas, France 3 Leiden University, the Netherlands
More informationarxiv: v1 [cs.lg] 30 Sep 2014
Nonstochastic Multi-Armed Bandits with Graph-Structured Feedback arxiv:409.848v [cs.lg] 30 Sep 04 Noga Alon Tel Aviv University, Tel Aviv, Israel, Institute for Advanced Study, Princeton, United States,
More informationBandits for Online Optimization
Bandits for Online Optimization Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Bandits for Online Optimization 1 / 16 The multiarmed bandit problem... K slot machines Each
More informationLearning, Games, and Networks
Learning, Games, and Networks Abhishek Sinha Laboratory for Information and Decision Systems MIT ML Talk Series @CNRG December 12, 2016 1 / 44 Outline 1 Prediction With Experts Advice 2 Application to
More informationarxiv: v1 [cs.lg] 23 Jan 2019
Cooperative Online Learning: Keeping your Neighbors Updated Nicolò Cesa-Bianchi, Tommaso R. Cesari, and Claire Monteleoni 2 Dipartimento di Informatica, Università degli Studi di Milano, Italy 2 Department
More informationOnline Learning with Gaussian Payoffs and Side Observations
Online Learning with Gaussian Payoffs and Side Observations Yifan Wu András György 2 Csaba Szepesvári Dept. of Computing Science University of Alberta {ywu2,szepesva}@ualberta.ca 2 Dept. of Electrical
More informationAdvanced Topics in Machine Learning and Algorithmic Game Theory Fall semester, 2011/12
Advanced Topics in Machine Learning and Algorithmic Game Theory Fall semester, 2011/12 Lecture 4: Multiarmed Bandit in the Adversarial Model Lecturer: Yishay Mansour Scribe: Shai Vardi 4.1 Lecture Overview
More informationFull-information Online Learning
Introduction Expert Advice OCO LM A DA NANJING UNIVERSITY Full-information Lijun Zhang Nanjing University, China June 2, 2017 Outline Introduction Expert Advice OCO 1 Introduction Definitions Regret 2
More informationStochastic bandits: Explore-First and UCB
CSE599s, Spring 2014, Online Learning Lecture 15-2/19/2014 Stochastic bandits: Explore-First and UCB Lecturer: Brendan McMahan or Ofer Dekel Scribe: Javad Hosseini In this lecture, we like to answer this
More informationOnline Learning with Gaussian Payoffs and Side Observations
Online Learning with Gaussian Payoffs and Side Observations Yifan Wu 1 András György 2 Csaba Szepesvári 1 1 Department of Computing Science University of Alberta 2 Department of Electrical and Electronic
More informationEfficient learning by implicit exploration in bandit problems with side observations
Efficient learning by implicit exploration in bandit problems with side observations omáš Kocák Gergely Neu Michal Valko Rémi Munos SequeL team, INRIA Lille Nord Europe, France {tomas.kocak,gergely.neu,michal.valko,remi.munos}@inria.fr
More informationRegret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I. Sébastien Bubeck Theory Group
Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I Sébastien Bubeck Theory Group i.i.d. multi-armed bandit, Robbins [1952] i.i.d. multi-armed bandit, Robbins [1952] Known
More informationThe No-Regret Framework for Online Learning
The No-Regret Framework for Online Learning A Tutorial Introduction Nahum Shimkin Technion Israel Institute of Technology Haifa, Israel Stochastic Processes in Engineering IIT Mumbai, March 2013 N. Shimkin,
More informationAlgorithmic Chaining and the Role of Partial Feedback in Online Nonparametric Learning
Algorithmic Chaining and the Role of Partial Feedback in Online Nonparametric Learning Nicolò Cesa-Bianchi, Pierre Gaillard, Claudio Gentile, Sébastien Gerchinovitz To cite this version: Nicolò Cesa-Bianchi,
More informationApplications of on-line prediction. in telecommunication problems
Applications of on-line prediction in telecommunication problems Gábor Lugosi Pompeu Fabra University, Barcelona based on joint work with András György and Tamás Linder 1 Outline On-line prediction; Some
More informationExponential Weights on the Hypercube in Polynomial Time
European Workshop on Reinforcement Learning 14 (2018) October 2018, Lille, France. Exponential Weights on the Hypercube in Polynomial Time College of Information and Computer Sciences University of Massachusetts
More informationAlgorithmic Chaining and the Role of Partial Feedback in Online Nonparametric Learning
Proceedings of Machine Learning Research vol 65:1 17, 2017 Algorithmic Chaining and the Role of Partial Feedback in Online Nonparametric Learning Nicolò Cesa-Bianchi Università degli Studi di Milano, Milano,
More informationExplore no more: Improved high-probability regret bounds for non-stochastic bandits
Explore no more: Improved high-probability regret bounds for non-stochastic bandits Gergely Neu SequeL team INRIA Lille Nord Europe gergely.neu@gmail.com Abstract This work addresses the problem of regret
More information0.1 Motivating example: weighted majority algorithm
princeton univ. F 16 cos 521: Advanced Algorithm Design Lecture 8: Decision-making under total uncertainty: the multiplicative weight algorithm Lecturer: Sanjeev Arora Scribe: Sanjeev Arora (Today s notes
More informationGambling in a rigged casino: The adversarial multi-armed bandit problem
Gambling in a rigged casino: The adversarial multi-armed bandit problem Peter Auer Institute for Theoretical Computer Science University of Technology Graz A-8010 Graz (Austria) pauer@igi.tu-graz.ac.at
More informationYevgeny Seldin. University of Copenhagen
Yevgeny Seldin University of Copenhagen Classical (Batch) Machine Learning Collect Data Data Assumption The samples are independent identically distributed (i.i.d.) Machine Learning Prediction rule New
More informationLecture 8: Decision-making under total uncertainty: the multiplicative weight algorithm. Lecturer: Sanjeev Arora
princeton univ. F 13 cos 521: Advanced Algorithm Design Lecture 8: Decision-making under total uncertainty: the multiplicative weight algorithm Lecturer: Sanjeev Arora Scribe: (Today s notes below are
More informationarxiv: v2 [cs.lg] 24 Oct 2018
Adversarial Online Learning with noise arxiv:80.09346v [cs.lg] 4 Oct 08 Alon Resler alonress@gmail.com Blavatnik School of Computer Science, Tel Aviv University, Tel Aviv, Israel Yishay Mansour mansour.yishay@gmail.com
More informationAdaptive Sampling Under Low Noise Conditions 1
Manuscrit auteur, publié dans "41èmes Journées de Statistique, SFdS, Bordeaux (2009)" Adaptive Sampling Under Low Noise Conditions 1 Nicolò Cesa-Bianchi Dipartimento di Scienze dell Informazione Università
More informationAdvanced Machine Learning
Advanced Machine Learning Bandit Problems MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Multi-Armed Bandit Problem Problem: which arm of a K-slot machine should a gambler pull to maximize his
More informationExplore no more: Improved high-probability regret bounds for non-stochastic bandits
Explore no more: Improved high-probability regret bounds for non-stochastic bandits Gergely Neu SequeL team INRIA Lille Nord Europe gergely.neu@gmail.com Abstract This work addresses the problem of regret
More informationOnline learning with noisy side observations
Online learning with noisy side observations Tomáš Kocák Gergely Neu Michal Valko Inria Lille - Nord Europe, France DTIC, Universitat Pompeu Fabra, Barcelona, Spain Inria Lille - Nord Europe, France SequeL
More informationEASINESS IN BANDITS. Gergely Neu. Pompeu Fabra University
EASINESS IN BANDITS Gergely Neu Pompeu Fabra University EASINESS IN BANDITS Gergely Neu Pompeu Fabra University THE BANDIT PROBLEM Play for T rounds attempting to maximize rewards THE BANDIT PROBLEM Play
More informationPrediction and Playing Games
Prediction and Playing Games Vineel Pratap vineel@eng.ucsd.edu February 20, 204 Chapter 7 : Prediction, Learning and Games - Cesa Binachi & Lugosi K-Person Normal Form Games Each player k (k =,..., K)
More informationBandit Convex Optimization: T Regret in One Dimension
Bandit Convex Optimization: T Regret in One Dimension arxiv:1502.06398v1 [cs.lg 23 Feb 2015 Sébastien Bubeck Microsoft Research sebubeck@microsoft.com Tomer Koren Technion tomerk@technion.ac.il February
More informationAdvanced Machine Learning
Advanced Machine Learning Follow-he-Perturbed Leader MEHRYAR MOHRI MOHRI@ COURAN INSIUE & GOOGLE RESEARCH. General Ideas Linear loss: decomposition as a sum along substructures. sum of edge losses in a
More informationUniversity of Alberta. The Role of Information in Online Learning
University of Alberta The Role of Information in Online Learning by Gábor Bartók A thesis submitted to the Faculty of Graduate Studies and Research in partial fulfillment of the requirements for the degree
More informationReward Maximization Under Uncertainty: Leveraging Side-Observations on Networks
Reward Maximization Under Uncertainty: Leveraging Side-Observations Reward Maximization Under Uncertainty: Leveraging Side-Observations on Networks Swapna Buccapatnam AT&T Labs Research, Middletown, NJ
More informationOnline Learning and Sequential Decision Making
Online Learning and Sequential Decision Making Emilie Kaufmann CNRS & CRIStAL, Inria SequeL, emilie.kaufmann@univ-lille.fr Research School, ENS Lyon, Novembre 12-13th 2018 Emilie Kaufmann Online Learning
More informationBandit View on Continuous Stochastic Optimization
Bandit View on Continuous Stochastic Optimization Sébastien Bubeck 1 joint work with Rémi Munos 1 & Gilles Stoltz 2 & Csaba Szepesvari 3 1 INRIA Lille, SequeL team 2 CNRS/ENS/HEC 3 University of Alberta
More informationOnline (and Distributed) Learning with Information Constraints. Ohad Shamir
Online (and Distributed) Learning with Information Constraints Ohad Shamir Weizmann Institute of Science Online Algorithms and Learning Workshop Leiden, November 2014 Ohad Shamir Learning with Information
More informationLecture 14: Approachability and regret minimization Ramesh Johari May 23, 2007
MS&E 336 Lecture 4: Approachability and regret minimization Ramesh Johari May 23, 2007 In this lecture we use Blackwell s approachability theorem to formulate both external and internal regret minimizing
More informationOnline Learning with Costly Features and Labels
Online Learning with Costly Features and Labels Navid Zolghadr Department of Computing Science University of Alberta zolghadr@ualberta.ca Gábor Bartók Department of Computer Science TH Zürich bartok@inf.ethz.ch
More informationCombinatorial Bandits
Combinatorial Bandits Nicolò Cesa-Bianchi Università degli Studi di Milano, Italy Gábor Lugosi 1 ICREA and Pompeu Fabra University, Spain Abstract We study sequential prediction problems in which, at each
More informationAgnostic Online learnability
Technical Report TTIC-TR-2008-2 October 2008 Agnostic Online learnability Shai Shalev-Shwartz Toyota Technological Institute Chicago shai@tti-c.org ABSTRACT We study a fundamental question. What classes
More informationRegret Minimization for Branching Experts
JMLR: Workshop and Conference Proceedings vol 30 (2013) 1 21 Regret Minimization for Branching Experts Eyal Gofer Tel Aviv University, Tel Aviv, Israel Nicolò Cesa-Bianchi Università degli Studi di Milano,
More informationLearning with Exploration
Learning with Exploration John Langford (Yahoo!) { With help from many } Austin, March 24, 2011 Yahoo! wants to interactively choose content and use the observed feedback to improve future content choices.
More informationBetter Algorithms for Selective Sampling
Francesco Orabona Nicolò Cesa-Bianchi DSI, Università degli Studi di Milano, Italy francesco@orabonacom nicolocesa-bianchi@unimiit Abstract We study online algorithms for selective sampling that use regularized
More informationLower and Upper Bounds on the Generalization of Stochastic Exponentially Concave Optimization
JMLR: Workshop and Conference Proceedings vol 40:1 16, 2015 28th Annual Conference on Learning Theory Lower and Upper Bounds on the Generalization of Stochastic Exponentially Concave Optimization Mehrdad
More informationarxiv: v1 [stat.ml] 23 May 2018
Analysis of Thompson Sampling for Graphical Bandits Without the Graphs arxiv:1805.08930v1 [stat.ml] 3 May 018 Fang Liu The Ohio State University Columbus, Ohio 4310 liu.3977@osu.edu Abstract We study multi-armed
More informationPartial monitoring classification, regret bounds, and algorithms
Partial monitoring classification, regret bounds, and algorithms Gábor Bartók Department of Computing Science University of Alberta Dávid Pál Department of Computing Science University of Alberta Csaba
More informationOnline Learning with Abstention
Corinna Cortes 1 Giulia DeSalvo 1 Claudio Gentile 1 2 Mehryar Mohri 3 1 Scott Yang 4 Abstract We present an extensive study of a key problem in online learning where the learner can opt to abstain from
More informationBandit models: a tutorial
Gdt COS, December 3rd, 2015 Multi-Armed Bandit model: general setting K arms: for a {1,..., K}, (X a,t ) t N is a stochastic process. (unknown distributions) Bandit game: a each round t, an agent chooses
More informationThe Multi-Arm Bandit Framework
The Multi-Arm Bandit Framework A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course In This Lecture A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-2/94
More informationOnline prediction with expert advise
Online prediction with expert advise Jyrki Kivinen Australian National University http://axiom.anu.edu.au/~kivinen Contents 1. Online prediction: introductory example, basic setting 2. Classification with
More informationFoundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research
Foundations of Machine Learning On-Line Learning Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Motivation PAC learning: distribution fixed over time (training and test). IID assumption.
More informationLecture 19: UCB Algorithm and Adversarial Bandit Problem. Announcements Review on stochastic multi-armed bandit problem
Lecture 9: UCB Algorithm and Adversarial Bandit Problem EECS598: Prediction and Learning: It s Only a Game Fall 03 Lecture 9: UCB Algorithm and Adversarial Bandit Problem Prof. Jacob Abernethy Scribe:
More informationPiecewise-stationary Bandit Problems with Side Observations
Jia Yuan Yu jia.yu@mcgill.ca Department Electrical and Computer Engineering, McGill University, Montréal, Québec, Canada. Shie Mannor shie.mannor@mcgill.ca; shie@ee.technion.ac.il Department Electrical
More informationOnline Convex Optimization. Gautam Goel, Milan Cvitkovic, and Ellen Feldman CS 159 4/5/2016
Online Convex Optimization Gautam Goel, Milan Cvitkovic, and Ellen Feldman CS 159 4/5/2016 The General Setting The General Setting (Cover) Given only the above, learning isn't always possible Some Natural
More informationMulti-armed Bandits in the Presence of Side Observations in Social Networks
52nd IEEE Conference on Decision and Control December 0-3, 203. Florence, Italy Multi-armed Bandits in the Presence of Side Observations in Social Networks Swapna Buccapatnam, Atilla Eryilmaz, and Ness
More informationStat 260/CS Learning in Sequential Decision Problems. Peter Bartlett
Stat 260/CS 294-102. Learning in Sequential Decision Problems. Peter Bartlett 1. Adversarial bandits Definition: sequential game. Lower bounds on regret from the stochastic case. Exp3: exponential weights
More informationNear-Optimal Algorithms for Online Matrix Prediction
JMLR: Workshop and Conference Proceedings vol 23 (2012) 38.1 38.13 25th Annual Conference on Learning Theory Near-Optimal Algorithms for Online Matrix Prediction Elad Hazan Technion - Israel Inst. of Tech.
More informationOptimal and Adaptive Online Learning
Optimal and Adaptive Online Learning Haipeng Luo Advisor: Robert Schapire Computer Science Department Princeton University Examples of Online Learning (a) Spam detection 2 / 34 Examples of Online Learning
More informationLecture 4: Lower Bounds (ending); Thompson Sampling
CMSC 858G: Bandits, Experts and Games 09/12/16 Lecture 4: Lower Bounds (ending); Thompson Sampling Instructor: Alex Slivkins Scribed by: Guowei Sun,Cheng Jie 1 Lower bounds on regret (ending) Recap from
More informationBudgeted Prediction With Expert Advice
Budgeted Prediction With Expert Advice Kareem Amin University of Pennsylvania, Philadelphia, PA akareem@seas.upenn.edu Gerald Tesauro IBM T.J. Watson Research Center Yorktown Heights, NY gtesauro@us.ibm.com
More informationA Drifting-Games Analysis for Online Learning and Applications to Boosting
A Drifting-Games Analysis for Online Learning and Applications to Boosting Haipeng Luo Department of Computer Science Princeton University Princeton, NJ 08540 haipengl@cs.princeton.edu Robert E. Schapire
More informationExponentiated Gradient Descent
CSE599s, Spring 01, Online Learning Lecture 10-04/6/01 Lecturer: Ofer Dekel Exponentiated Gradient Descent Scribe: Albert Yu 1 Introduction In this lecture we review norms, dual norms, strong convexity,
More informationFrom External to Internal Regret
From External to Internal Regret Avrim Blum School of Computer Science Carnegie Mellon University, Pittsburgh, PA 15213. Yishay Mansour School of Computer Science, Tel Aviv University, Tel Aviv, ISRAEL.
More informationMulti-armed bandit models: a tutorial
Multi-armed bandit models: a tutorial CERMICS seminar, March 30th, 2016 Multi-Armed Bandit model: general setting K arms: for a {1,..., K}, (X a,t ) t N is a stochastic process. (unknown distributions)
More informationReducing contextual bandits to supervised learning
Reducing contextual bandits to supervised learning Daniel Hsu Columbia University Based on joint work with A. Agarwal, S. Kale, J. Langford, L. Li, and R. Schapire 1 Learning to interact: example #1 Practicing
More informationAlireza Shafaei. Machine Learning Reading Group The University of British Columbia Summer 2017
s s Machine Learning Reading Group The University of British Columbia Summer 2017 (OCO) Convex 1/29 Outline (OCO) Convex Stochastic Bernoulli s (OCO) Convex 2/29 At each iteration t, the player chooses
More informationAdvanced Machine Learning
Advanced Machine Learning Learning and Games MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Outline Normal form games Nash equilibrium von Neumann s minimax theorem Correlated equilibrium Internal
More informationarxiv: v4 [cs.lg] 27 Jan 2016
The Computational Power of Optimization in Online Learning Elad Hazan Princeton University ehazan@cs.princeton.edu Tomer Koren Technion tomerk@technion.ac.il arxiv:1504.02089v4 [cs.lg] 27 Jan 2016 Abstract
More informationTime Series Prediction & Online Learning
Time Series Prediction & Online Learning Joint work with Vitaly Kuznetsov (Google Research) MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Motivation Time series prediction: stock values. earthquakes.
More informationOnline learning CMPUT 654. October 20, 2011
Online learning CMPUT 654 Gábor Bartók Dávid Pál Csaba Szepesvári István Szita October 20, 2011 Contents 1 Shooting Game 4 1.1 Exercises...................................... 6 2 Weighted Majority Algorithm
More informationarxiv: v3 [cs.lg] 30 Jun 2012
arxiv:05874v3 [cslg] 30 Jun 0 Orly Avner Shie Mannor Department of Electrical Engineering, Technion Ohad Shamir Microsoft Research New England Abstract We consider a multi-armed bandit problem where the
More informationarxiv: v1 [cs.lg] 8 Nov 2010
Blackwell Approachability and Low-Regret Learning are Equivalent arxiv:0.936v [cs.lg] 8 Nov 200 Jacob Abernethy Computer Science Division University of California, Berkeley jake@cs.berkeley.edu Peter L.
More informationCollaborative Learning of Stochastic Bandits over a Social Network
Collaborative Learning of Stochastic Bandits over a Social Network Ravi Kumar Kolla, Krishna Jagannathan and Aditya Gopalan Abstract We consider a collaborative online learning paradigm, wherein a group
More informationAdaptivity and Optimism: An Improved Exponentiated Gradient Algorithm
Adaptivity and Optimism: An Improved Exponentiated Gradient Algorithm Jacob Steinhardt Percy Liang Stanford University {jsteinhardt,pliang}@cs.stanford.edu Jun 11, 2013 J. Steinhardt & P. Liang (Stanford)
More informationBlackwell s Approachability Theorem: A Generalization in a Special Case. Amy Greenwald, Amir Jafari and Casey Marks
Blackwell s Approachability Theorem: A Generalization in a Special Case Amy Greenwald, Amir Jafari and Casey Marks Department of Computer Science Brown University Providence, Rhode Island 02912 CS-06-01
More informationCombinatorial Multi-Armed Bandit: General Framework, Results and Applications
Combinatorial Multi-Armed Bandit: General Framework, Results and Applications Wei Chen Microsoft Research Asia, Beijing, China Yajun Wang Microsoft Research Asia, Beijing, China Yang Yuan Computer Science
More informationConvergence and No-Regret in Multiagent Learning
Convergence and No-Regret in Multiagent Learning Michael Bowling Department of Computing Science University of Alberta Edmonton, Alberta Canada T6G 2E8 bowling@cs.ualberta.ca Abstract Learning in a multiagent
More informationTHE first formalization of the multi-armed bandit problem
EDIC RESEARCH PROPOSAL 1 Multi-armed Bandits in a Network Farnood Salehi I&C, EPFL Abstract The multi-armed bandit problem is a sequential decision problem in which we have several options (arms). We can
More informationLecture 6: Non-stochastic best arm identification
CSE599i: Online and Adaptive Machine Learning Winter 08 Lecturer: Kevin Jamieson Lecture 6: Non-stochastic best arm identification Scribes: Anran Wang, eibin Li, rian Chan, Shiqing Yu, Zhijin Zhou Disclaimer:
More informationAlgorithms, Games, and Networks January 17, Lecture 2
Algorithms, Games, and Networks January 17, 2013 Lecturer: Avrim Blum Lecture 2 Scribe: Aleksandr Kazachkov 1 Readings for today s lecture Today s topic is online learning, regret minimization, and minimax
More informationMachine Learning Theory (CS 6783)
Machine Learning Theory (CS 6783) Tu-Th 1:25 to 2:40 PM Hollister, 306 Instructor : Karthik Sridharan ABOUT THE COURSE No exams! 5 assignments that count towards your grades (55%) One term project (40%)
More informationTsinghua Machine Learning Guest Lecture, June 9,
Tsinghua Machine Learning Guest Lecture, June 9, 2015 1 Lecture Outline Introduction: motivations and definitions for online learning Multi-armed bandit: canonical example of online learning Combinatorial
More informationThe Multi-Armed Bandit Problem
The Multi-Armed Bandit Problem Electrical and Computer Engineering December 7, 2013 Outline 1 2 Mathematical 3 Algorithm Upper Confidence Bound Algorithm A/B Testing Exploration vs. Exploitation Scientist
More informationOnline learning with sample path constraints
Online learning with sample path constraints The MIT Faculty has made this article openly available. Please share how this access benefits you. Your story matters. Citation As Published Publisher Mannor,
More informationTheory and Applications of A Repeated Game Playing Algorithm. Rob Schapire Princeton University [currently visiting Yahoo!
Theory and Applications of A Repeated Game Playing Algorithm Rob Schapire Princeton University [currently visiting Yahoo! Research] Learning Is (Often) Just a Game some learning problems: learn from training
More informationMulti-Armed Bandits with Metric Movement Costs
Multi-Armed Bandits with Metric Movement Costs Tomer Koren 1, Roi Livni 2, and Yishay Mansour 3 1 Google; tkoren@google.com 2 Princeton University; rlivni@cs.princeton.edu 3 Tel Aviv University; mansour@cs.tau.ac.il
More informationOLSO. Online Learning and Stochastic Optimization. Yoram Singer August 10, Google Research
OLSO Online Learning and Stochastic Optimization Yoram Singer August 10, 2016 Google Research References Introduction to Online Convex Optimization, Elad Hazan, Princeton University Online Learning and
More informationGame-Theoretic Learning:
Game-Theoretic Learning: Regret Minimization vs. Utility Maximization Amy Greenwald with David Gondek, Amir Jafari, and Casey Marks Brown University University of Pennsylvania November 17, 2004 Background
More informationToward a Classification of Finite Partial-Monitoring Games
Toward a Classification of Finite Partial-Monitoring Games Gábor Bartók ( *student* ), Dávid Pál, and Csaba Szepesvári Department of Computing Science, University of Alberta, Canada {bartok,dpal,szepesva}@cs.ualberta.ca
More informationSmooth Calibration, Leaky Forecasts, Finite Recall, and Nash Dynamics
Smooth Calibration, Leaky Forecasts, Finite Recall, and Nash Dynamics Sergiu Hart August 2016 Smooth Calibration, Leaky Forecasts, Finite Recall, and Nash Dynamics Sergiu Hart Center for the Study of Rationality
More informationGrundlagen der Künstlichen Intelligenz
Grundlagen der Künstlichen Intelligenz Uncertainty & Probabilities & Bandits Daniel Hennes 16.11.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Uncertainty Probability
More informationBandit Algorithms. Zhifeng Wang ... Department of Statistics Florida State University
Bandit Algorithms Zhifeng Wang Department of Statistics Florida State University Outline Multi-Armed Bandits (MAB) Exploration-First Epsilon-Greedy Softmax UCB Thompson Sampling Adversarial Bandits Exp3
More informationOn the Generalization Ability of Online Strongly Convex Programming Algorithms
On the Generalization Ability of Online Strongly Convex Programming Algorithms Sham M. Kakade I Chicago Chicago, IL 60637 sham@tti-c.org Ambuj ewari I Chicago Chicago, IL 60637 tewari@tti-c.org Abstract
More information