Relative Upper Confidence Bound for the K-Armed Dueling Bandit Problem
|
|
- Morgan Shaw
- 5 years ago
- Views:
Transcription
1 Relative Upper Confidence Bound for the K-Armed Dueling Bandit Problem Masrour Zoghi 1, Shimon Whiteson 1, Rémi Munos 2 and Maarten de Rijke 1 University of Amsterdam 1 ; INRIA Lille / MSR-NE 2 June 24, / 18
2 Plan of Talk Motivation Definitions Algorithm Experiments Regret Bounds Open problems 2 / 18
3 Motivation for Dueling Bandits People are better at expressing preferences than absolute quality. Preference feedback is abundant: Information Retrieval Recommender Systems Can t apply K-armed bandit algorithms 3 / 18
4 Plan of Talk Motivation Definitions Algorithm Experiments Regret Bounds Open problems 4 / 18
5 K arms {a 1,..., a K } K-armed Dueling Bandits Preference probabilities p ij := Pr(a i beats a j ) for i, j = 1,..., K Preference matrix p ij a 1 a 2 a 3 a 4 a a a a / 18
6 K arms {a 1,..., a K } K-armed Dueling Bandits Preference probabilities p ij := Pr(a i beats a j ) for i, j = 1,..., K Goal 1: Find the best arm: arm a b that beats all others, i.e. p bj > 0.5 for all j b Goal 2: Lower the number of suboptimal comparisons Preference matrix p ij a 1 a 2 a 3 a 4 a a a a / 18
7 Evaluation Measure Given a comparison between a i and a j, define regret as r = i + j, 2 with k = p bk 1 2 and a b is the best ranker. Cumulative regret = sum of regret over time No regret only when a 1 compared against a 1 Preference matrix p ij a 1 a 2 a 3 a 4 a a a a i / 18
8 Plan of Talk Motivation Definitions Algorithm Experiments Regret Bounds Open problems 7 / 18
9 Three phases in each iteration: Relative Upper Confidence Bound 8 / 18
10 Three phases in each iteration: Relative Upper Confidence Bound 1. Run an optimistic tournament to pick a contender 8 / 18
11 Relative Upper Confidence Bound Three phases in each iteration: 1. Run an optimistic tournament to pick a contender Compute µ ij (t), the frequentist estimates of p ij Frequentist estimates µ ij (t) a 1 a 2 a 3 a 4 a a a a / 18
12 Three phases in each iteration: Relative Upper Confidence Bound 1. Run an optimistic tournament to pick a contender Compute µ ij (t), the frequentist estimates of p ij Add optimism bonuses to get upper bounds u ij (t) = µ ij (t) + α log t N ij (t) Optimistic estimates u ij (t) a 1 a 2 a 3 a 4 a a a a / 18
13 Three phases in each iteration: Relative Upper Confidence Bound 1. Run an optimistic tournament to pick a contender Compute µ ij (t), the frequentist estimates of p ij Add optimism bonuses to get upper bounds u ij (t) = µ ij (t) + α log t N ij (t) Choose contender, i.e. an arm that optimistically beats everyone a 1 and a 2 potential contenders u ij (t) a 1 a 2 a 3 a 4 a a a a / 18
14 Three phases in each iteration: Relative Upper Confidence Bound 1. Run an optimistic tournament to pick a contender a 2 chosen as contender u ij (t) a 1 a 2 a 3 a 4 a a a a / 18
15 Three phases in each iteration: Relative Upper Confidence Bound 1. Run an optimistic tournament to pick a contender 2. Pick a challenger using UCB relative to the contender UCB relative to a 2 u ij (t) a 1 a 2 a 3 a 4 a a a a / 18
16 Relative Upper Confidence Bound Three phases in each iteration: 1. Run an optimistic tournament to pick a contender 2. Pick a challenger using UCB relative to the contender Optimism for challenger = pessimism for contender Choose challenger most likely to show contender best arm UCB relative to a 2 u ij (t) a 1 a 2 a 3 a 4 a a a a / 18
17 Relative Upper Confidence Bound Three phases in each iteration: 1. Run an optimistic tournament to pick a contender 2. Pick a challenger using UCB relative to the contender 3. Compare the two arms and update the score sheet Play a 4 against a 2 u ij (t) a 1 a 2 a 3 a 4 a a a a / 18
18 Plan of Talk Motivation Definitions Algorithm Experiments Regret Bounds Open problems 9 / 18
19 Experiments 64-armed problem obtained from LETOR learning to rank dataset Beat the Mean (BTM) from (Yue & Joachims, 2011) and Condorcet SAVAGE from (Uvroy et al, 2013) LETOR NP2004 Dataset with 64 rankers cumulative regret BTM Condorcet SAVAGE RUCB α = time 10 / 18
20 Plan of Talk Motivation Definitions Algorithm Experiments Regret Bounds Open problems 11 / 18
21 Existing Regret Bounds (Urvoy et al, ICML 2013): SAVAGE (Yue & Joachims, ICML 2011): Beat the Mean (BTM) (Yue et al, COLT 2009): Interleaved Filter (Ailon et al, ICML 2014): Doubler & MultiSBM 12 / 18
22 Existing Regret Bounds (Urvoy et al, ICML 2013): SAVAGE (Yue & Joachims, ICML 2011): Beat the Mean (BTM) (Yue et al, COLT 2009): Interleaved Filter (Ailon et al, ICML 2014): Doubler & MultiSBM 12 / 18
23 Existing Regret Bounds (Urvoy et al, ICML 2013): SAVAGE (Yue & Joachims, ICML 2011): Beat the Mean (BTM) (Yue et al, COLT 2009): Interleaved Filter unds are in O(K log T ) and do no ptions. (Ailon et al, ICML 2014): Doubler & MultiSBM More Restrictive Assumptions 12 / 18 ling bandits arising from regular band
24 Existing Regret Bounds (Urvoy et al, ICML 2013): SAVAGE - Very general assumptions (Yue & Joachims, ICML 2011): Beat the Mean (BTM) (Yue et al, COLT 2009): Interleaved Filter unds are in O(K log T ) and do no ptions. (Ailon et al, ICML 2014): Doubler & MultiSBM More Restrictive Assumptions 12 / 18 ling bandits arising from regular band
25 Existing Regret Bounds (Urvoy et al, ICML 2013): SAVAGE - Very general assumptions but O(K 2 log T ) U.B. (Yue & Joachims, ICML 2011): Beat the Mean (BTM) (Yue et al, COLT 2009): Interleaved Filter unds are in O(K log T ) and do no ptions. (Ailon et al, ICML 2014): Doubler & MultiSBM More Restrictive Assumptions 12 / 18 ling bandits arising from regular band
26 Existing Regret Bounds (Urvoy et al, ICML 2013): SAVAGE - Very general assumptions but O(K 2 log T ) U.B. (Yue & Joachims, ICML 2011): Beat the Mean (BTM) - O(K log T ) U.B. (Yue et al, COLT 2009): Interleaved Filter unds are in O(K log T ) and do no ptions. (Ailon et al, ICML 2014): Doubler & MultiSBM More Restrictive Assumptions 12 / 18 ling bandits arising from regular band
27 Existing Regret Bounds unds are in O(K log T ) and do no ptions. (Urvoy et al, ICML 2013): SAVAGE - Very general assumptions but O(K 2 log T ) U.B. (Yue & Joachims, ICML 2011): Beat the Mean (BTM) - O(K log T ) U.B. but assumes total ordering of arms and requires prior knowledge of the problem and scales poorly with the degradation of the problem. (Yue et al, COLT 2009): Interleaved Filter (Ailon et al, ICML 2014): Doubler & MultiSBM More Restrictive Assumptions 12 / 18 ling bandits arising from regular band
28 Existing Regret Bounds (Urvoy et al, ICML 2013): SAVAGE - Very general assumptions but O(K 2 log T ) U.B. (Yue & Joachims, ICML 2011): Beat the Mean (BTM) - O(K log T ) U.B. but assumes total ordering of arms and requires prior knowledge of the problem and scales poorly with the degradation of the problem. (Yue et al, COLT 2009): Interleaved Filter - O(K log T ) L.B. - O(K log T ) U.B. unds are in O(K log T ) and do no ptions. (Ailon et al, ICML 2014): Doubler & MultiSBM More Restrictive Assumptions 12 / 18 ling bandits arising from regular band
29 Existing Regret Bounds unds are in O(K log T ) and do no ptions. (Urvoy et al, ICML 2013): SAVAGE - Very general assumptions but O(K 2 log T ) U.B. (Yue & Joachims, ICML 2011): Beat the Mean (BTM) - O(K log T ) U.B. but assumes total ordering of arms and requires prior knowledge of the problem and scales poorly with the degradation of the problem. (Yue et al, COLT 2009): Interleaved Filter - O(K log T ) L.B. but on particular dueling bandit problems - O(K log T ) U.B. but w/ strong transitivity assumptions (Ailon et al, ICML 2014): Doubler & MultiSBM More Restrictive Assumptions 12 / 18 ling bandits arising from regular band
30 Existing Regret Bounds (Urvoy et al, ICML 2013): SAVAGE - Very general assumptions but O(K 2 log T ) U.B. (Yue & Joachims, ICML 2011): Beat the Mean (BTM) - O(K log T ) U.B. but assumes total ordering of arms and requires prior knowledge of the problem and scales poorly with the degradation of the problem. (Yue et al, COLT 2009): Interleaved Filter - O(K log T ) L.B. but on particular dueling bandit problems - O(K log T ) U.B. but w/ strong transitivity assumptions (Ailon et al, ICML 2014): Doubler & MultiSBM - O(K log T ) U.B. and L.B. unds are in O(K log T ) and do no ptions. More Restrictive Assumptions 12 / 18 ling bandits arising from regular band
31 Existing Regret Bounds unds are in O(K log T ) and do no ptions. (Urvoy et al, ICML 2013): SAVAGE - Very general assumptions but O(K 2 log T ) U.B. (Yue & Joachims, ICML 2011): Beat the Mean (BTM) - O(K log T ) U.B. but assumes total ordering of arms and requires prior knowledge of the problem and scales poorly with the degradation of the problem. (Yue et al, COLT 2009): Interleaved Filter - O(K log T ) L.B. but on particular dueling bandit problems - O(K log T ) U.B. but w/ strong transitivity assumptions (Ailon et al, ICML 2014): Doubler & MultiSBM - O(K log T ) U.B. and L.B. but for dueling bandits arising from regular bandits More Restrictive Assumptions 12 / 18 ling bandits arising from regular band
32 Existing Regret Bounds unds are in O(K log T ) and do no ptions. (Urvoy et al, ICML 2013): SAVAGE - Very general assumptions but O(K 2 log T ) U.B. (Yue & Joachims, ICML 2011): Beat the Mean (BTM) - O(K log T ) U.B. but assumes total ordering of arms and requires prior knowledge of the problem and scales poorly with the degradation of the problem. (Yue et al, COLT 2009): Interleaved Filter - O(K log T ) L.B. but on particular dueling bandit problems - O(K log T ) U.B. but w/ strong transitivity assumptions (Ailon et al, ICML 2014): Doubler & MultiSBM - O(K log T ) U.B. and L.B. but for dueling bandits arising from regular bandits Remark: Our bounds are in O(K log T ) and do not require any transitivity assumptions or prior knowledge. More Restrictive Assumptions 12 / 18 ling bandits arising from regular band
33 Regret Bounds P := [p ij ] with i, j = 1,..., K - preference matrix Only assume p 1j > 0.5 for j > 1. No transitivity assumptions R T - cumulative regret up to time T α - the parameter in UCB 13 / 18
34 Regret Bounds P := [p ij ] with i, j = 1,..., K - preference matrix Only assume p 1j > 0.5 for j > 1. No transitivity assumptions R T - cumulative regret up to time T α - the parameter in UCB High Probability Regret Bound Given α > 0.5 and δ > 0, applying RUCB with parameter α to the above problem, we get with probability 1 δ that K R T C(K, δ, α) + }{{} O(K 2 ) j=2 D j (α) ln T } {{ } O(K ln T ) 13 / 18
35 Regret Bounds P := [p ij ] with i, j = 1,..., K - preference matrix Only assume p 1j > 0.5 for j > 1. No transitivity assumptions R T - cumulative regret up to time T α - the parameter in UCB Expected Regret Bound Assuming α > 1, we have K E[R T ] C (K, α) + D j (α) ln T }{{} j=2 }{{} O(K 2 ) O(K ln T ) 13 / 18
36 Plan of Talk Motivation Definitions Algorithm Experiments Regret Bounds Open problems 14 / 18
37 Open Theoretical Questions 1 Comprehensive lower bounds: e.g. is the O(K 2 ) additive constant necessary for problems with cycles? E[R T ] C (K, α) + D j (α) ln T }{{} j }{{} O(K 2 ) O(K ln T ) Expected Regret Bounds for α 1 Extensions to the case when there is no best arm Results for a GP or X-armed extension Theoretical results for a Thompson Sampling version Contextual, adversarial, etc 15 / 18
38 Open Theoretical Questions 2 Preference matrix P YJ used in (Yue & Joachims, ICML 2011) and (Ailon et al, ICML 2014) Relative Confidence Sampling (RCS) from (Zoghi et al, WSDM 2014) and Sparring from (Ailon et al, ICML 2014) YJ Preference Matrix with 6 arms cumulative regret RUCB α = 0.51 Sparring RCS α = time 16 / 18
39 Contributions New K-armed dueling bandit algorithm Outperforming existing algorithms with theoretical results Expected and high probability regret bounds (not requiring δ to be passed to the algorithm) Main contribution: Asymptotically optimal regret bounds for a broad class of problems 17 / 18
40 Thank you Masrour Zoghi 18 / 18
A Relative Exponential Weighing Algorithm for Adversarial Utility-based Dueling Bandits
A Relative Exponential Weighing Algorithm for Adversarial Utility-based Dueling Bandits Pratik Gajane Tanguy Urvoy Fabrice Clérot Orange-labs, Lannion, France PRATIK.GAJANE@ORANGE.COM TANGUY.URVOY@ORANGE.COM
More informationarxiv: v1 [cs.lg] 15 Jan 2016
A Relative Exponential Weighing Algorithm for Adversarial Utility-based Dueling Bandits extended version arxiv:6.3855v [cs.lg] 5 Jan 26 Pratik Gajane Tanguy Urvoy Fabrice Clérot Orange-labs, Lannion, France
More informationMergeRUCB: A Method for Large-Scale Online Ranker Evaluation
MergeRUCB: A Method for Large-Scale Online Ranker Evaluation Masrour Zoghi Shimon Whiteson Maarten de Rijke University of Amsterdam, Amsterdam, The Netherlands {m.zoghi, s.a.whiteson, derijke}@uva.nl ABSTRACT
More informationDueling Bandits: Beyond Condorcet Winners to General Tournament Solutions
Dueling Bandits: Beyond Condorcet Winners to General Tournament Solutions Siddartha Ramamohan Indian Institute of Science Bangalore 56001, India siddarthayr@csaiiscernetin Arun Rajkumar Xerox Research
More informationDueling Bandits for Online Ranker Evaluation
Dueling Bandits for Online Ranker Evaluation Masrour Zoghi Graduation committee: Chairmen: Prof.dr. P.M.G. Apers Universiteit Twente Prof.dr.ir. A. Rensink Universiteit Twente Supervisors: Prof.dr. P.M.G.
More informationCopeland Dueling Bandit Problem: Regret Lower Bound, Optimal Algorithm, and Computationally Efficient Algorithm
Copeland Dueling Bandit Problem: Regret Lower Bound, Optimal Algorithm, and Computationally Efficient Algorithm Junpei Komiyama The University of Tokyo, Japan Junya Honda The University of Tokyo, Japan
More informationarxiv: v1 [cs.lg] 29 Apr 2017
Multi-dueling Bandits with Dependent Arms arxiv:175.253v1 [cs.lg] 29 Apr 217 Yanan Sui Caltech Pasadena, CA 91125 ysui@caltech.edu Abstract Vincent Zhuang Caltech Pasadena, CA 91125 vzhuang@caltech.edu
More informationInstance-dependent Regret Bounds for Dueling Bandits
JMLR: Workshop and Conference Proceedings vol 49:1 24, 2016 Instance-dependent Regret Bounds for Dueling Bandits Akshay Balsubramani * UC San Diego, CA, USA Zohar Karnin Yahoo! Research, New York, NY,
More informationarxiv: v1 [stat.ml] 31 Jan 2015
Sparse Dueling Bandits arxiv:500033v [statml] 3 Jan 05 Kevin Jamieson, Sumeet Katariya, Atul Deshpande and Robert Nowak Department of Electrical and Computer Engineering University of Wisconsin Madison
More informationSupplementary: Battle of Bandits
Supplementary: Battle of Bandits A Proof of Lemma Proof. We sta by noting that PX i PX i, i {U, V } Pi {U, V } PX i i {U, V } PX i i {U, V } j,j i j,j i j,j i PX i {U, V } {i, j} Q Q aia a ia j j, j,j
More informationDueling Bandits with Weak Regret
Dueling Bandits with Weak Regret Bangrui Chen 1 Peter I Frazier 1 Abstract We consider online content recommendation with imlicit feedback through airwise comarisons, formalized as the so-called dueling
More informationPreference-Based Rank Elicitation using Statistical Models: The Case of Mallows
Preference-Based Rank Elicitation using Statistical Models: The Case of Mallows Robert Busa-Fekete 1 Eyke Huellermeier 2 Balzs Szrnyi 3 1 MTA-SZTE Research Group on Artificial Intelligence, Tisza Lajos
More informationOnline Rank Elicitation for Plackett-Luce: A Dueling Bandits Approach
Online Rank Elicitation for Plackett-Luce: A Dueling Bandits Approach Balázs Szörényi Technion, Haifa, Israel / MTA-SZTE Research Group on Artificial Intelligence, Hungary szorenyibalazs@gmail.com Róbert
More informationBandit View on Continuous Stochastic Optimization
Bandit View on Continuous Stochastic Optimization Sébastien Bubeck 1 joint work with Rémi Munos 1 & Gilles Stoltz 2 & Csaba Szepesvari 3 1 INRIA Lille, SequeL team 2 CNRS/ENS/HEC 3 University of Alberta
More informationSpectral Bandits for Smooth Graph Functions with Applications in Recommender Systems
Spectral Bandits for Smooth Graph Functions with Applications in Recommender Systems Tomáš Kocák SequeL team INRIA Lille France Michal Valko SequeL team INRIA Lille France Rémi Munos SequeL team, INRIA
More informationOn Bayesian bandit algorithms
On Bayesian bandit algorithms Emilie Kaufmann joint work with Olivier Cappé, Aurélien Garivier, Nathaniel Korda and Rémi Munos July 1st, 2012 Emilie Kaufmann (Telecom ParisTech) On Bayesian bandit algorithms
More informationInterventions. Online Learning from User Interactions through Interventions. Interactive Learning Systems
Online Learning from Interactions through Interventions CS 7792 - Fall 2016 horsten Joachims Department of Computer Science & Department of Information Science Cornell University Y. Yue, J. Broder, R.
More informationTwo generic principles in modern bandits: the optimistic principle and Thompson sampling
Two generic principles in modern bandits: the optimistic principle and Thompson sampling Rémi Munos INRIA Lille, France CSML Lunch Seminars, September 12, 2014 Outline Two principles: The optimistic principle
More informationLecture 5: Regret Bounds for Thompson Sampling
CMSC 858G: Bandits, Experts and Games 09/2/6 Lecture 5: Regret Bounds for Thompson Sampling Instructor: Alex Slivkins Scribed by: Yancy Liao Regret Bounds for Thompson Sampling For each round t, we defined
More informationarxiv: v2 [stat.ml] 19 Jul 2012
Thompson Sampling: An Asymptotically Optimal Finite Time Analysis Emilie Kaufmann, Nathaniel Korda and Rémi Munos arxiv:105.417v [stat.ml] 19 Jul 01 Telecom Paristech UMR CNRS 5141 & INRIA Lille - Nord
More informationLarge-scale Information Processing, Summer Recommender Systems (part 2)
Large-scale Information Processing, Summer 2015 5 th Exercise Recommender Systems (part 2) Emmanouil Tzouridis tzouridis@kma.informatik.tu-darmstadt.de Knowledge Mining & Assessment SVM question When a
More informationBandit Algorithms. Zhifeng Wang ... Department of Statistics Florida State University
Bandit Algorithms Zhifeng Wang Department of Statistics Florida State University Outline Multi-Armed Bandits (MAB) Exploration-First Epsilon-Greedy Softmax UCB Thompson Sampling Adversarial Bandits Exp3
More informationStratégies bayésiennes et fréquentistes dans un modèle de bandit
Stratégies bayésiennes et fréquentistes dans un modèle de bandit thèse effectuée à Telecom ParisTech, co-dirigée par Olivier Cappé, Aurélien Garivier et Rémi Munos Journées MAS, Grenoble, 30 août 2016
More informationarxiv: v1 [cs.lg] 14 May 2014
Reducing Dueling Bandits to Cardinal Bandits arxiv:145.3396v1 [cs.lg] 14 May 214 Nir Ailon CS Dept. Technion Haifa, Israel nailon@cs.technion.ac.il Thorsten Joachims Cs Dept. Cornell Ithaca, NY tj@cs.cornell.edu
More informationBeat the Mean Bandit
Yisong Yue H. John Heinz III College, Carnegie Mellon University, Pittsburgh, PA, USA Thorsten Joachims Department of Computer Science, Cornell University, Ithaca, NY, USA yisongyue@cmu.edu tj@cs.cornell.edu
More informationThe K-armed Dueling Bandits Problem
The K-armed Dueling Bandits Problem Yisong Yue, Josef Broder, Robert Kleinberg, Thorsten Joachims Abstract We study a partial-information online-learning problem where actions are restricted to noisy comparisons
More informationEfficient learning by implicit exploration in bandit problems with side observations
Efficient learning by implicit exploration in bandit problems with side observations Tomáš Kocák, Gergely Neu, Michal Valko, Rémi Munos SequeL team, INRIA Lille - Nord Europe, France SequeL INRIA Lille
More informationBasics of reinforcement learning
Basics of reinforcement learning Lucian Buşoniu TMLSS, 20 July 2018 Main idea of reinforcement learning (RL) Learn a sequential decision policy to optimize the cumulative performance of an unknown system
More informationCOS 402 Machine Learning and Artificial Intelligence Fall Lecture 22. Exploration & Exploitation in Reinforcement Learning: MAB, UCB, Exp3
COS 402 Machine Learning and Artificial Intelligence Fall 2016 Lecture 22 Exploration & Exploitation in Reinforcement Learning: MAB, UCB, Exp3 How to balance exploration and exploitation in reinforcement
More informationThe Multi-Arm Bandit Framework
The Multi-Arm Bandit Framework A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course In This Lecture A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-2/94
More informationOnline Learning to Rank in Stochastic Click Models
Masrour Zoghi 1 Tomas Tunys 2 Mohammad Ghavamzadeh 3 Branislav Kveton 4 Csaba Szepesvari 5 Zheng Wen 4 Abstract Online learning to rank is a core problem in information retrieval and machine learning.
More informationMulti-Armed Bandits. Credit: David Silver. Google DeepMind. Presenter: Tianlu Wang
Multi-Armed Bandits Credit: David Silver Google DeepMind Presenter: Tianlu Wang Credit: David Silver (DeepMind) Multi-Armed Bandits Presenter: Tianlu Wang 1 / 27 Outline 1 Introduction Exploration vs.
More informationBandit models: a tutorial
Gdt COS, December 3rd, 2015 Multi-Armed Bandit model: general setting K arms: for a {1,..., K}, (X a,t ) t N is a stochastic process. (unknown distributions) Bandit game: a each round t, an agent chooses
More informationBandits and Recommender Systems
Bandits and Recommender Systems Jérémie Mary, Romaric Gaudel, Philippe Preux To cite this version: Jérémie Mary, Romaric Gaudel, Philippe Preux. Bandits and Recommender Systems. First International Workshop
More informationSparse Linear Contextual Bandits via Relevance Vector Machines
Sparse Linear Contextual Bandits via Relevance Vector Machines Davis Gilton and Rebecca Willett Electrical and Computer Engineering University of Wisconsin-Madison Madison, WI 53706 Email: gilton@wisc.edu,
More informationEstimation Considerations in Contextual Bandits
Estimation Considerations in Contextual Bandits Maria Dimakopoulou Zhengyuan Zhou Susan Athey Guido Imbens arxiv:1711.07077v4 [stat.ml] 16 Dec 2018 Abstract Contextual bandit algorithms are sensitive to
More informationBayesian Ranker Comparison Based on Historical User Interactions
Bayesian Ranker Comparison Based on Historical User Interactions Artem Grotov a.grotov@uva.nl Shimon Whiteson s.a.whiteson@uva.nl University of Amsterdam, Amsterdam, The Netherlands Maarten de Rijke derijke@uva.nl
More informationAnalysis of Thompson Sampling for the multi-armed bandit problem
Analysis of Thompson Sampling for the multi-armed bandit problem Shipra Agrawal Microsoft Research India shipra@microsoft.com avin Goyal Microsoft Research India navingo@microsoft.com Abstract We show
More informationBandits and Exploration: How do we (optimally) gather information? Sham M. Kakade
Bandits and Exploration: How do we (optimally) gather information? Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for Big data 1 / 22
More informationYevgeny Seldin. University of Copenhagen
Yevgeny Seldin University of Copenhagen Classical (Batch) Machine Learning Collect Data Data Assumption The samples are independent identically distributed (i.i.d.) Machine Learning Prediction rule New
More informationExploration. 2015/10/12 John Schulman
Exploration 2015/10/12 John Schulman What is the exploration problem? Given a long-lived agent (or long-running learning algorithm), how to balance exploration and exploitation to maximize long-term rewards
More informationMulti-task Linear Bandits
Multi-task Linear Bandits Marta Soare Alessandro Lazaric Ouais Alsharif Joelle Pineau INRIA Lille - Nord Europe INRIA Lille - Nord Europe McGill University & Google McGill University NIPS2014 Workshop
More informationPreference-Based Rank Elicitation using Statistical Models: The Case of Mallows
: The Case of Mallows Róbert Busa-Fekete BUSAROBI@INF.U-SZEGED.HU Eyke Hüllermeier EYKE@UPB.DE Balázs Szörényi,3 SZORENYI@INF.U-SZEGED.HU MTA-SZTE Research Group on Artificial Intelligence, Tisza Lajos
More informationRegret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I. Sébastien Bubeck Theory Group
Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I Sébastien Bubeck Theory Group i.i.d. multi-armed bandit, Robbins [1952] i.i.d. multi-armed bandit, Robbins [1952] Known
More informationOnline Learning with Feedback Graphs
Online Learning with Feedback Graphs Claudio Gentile INRIA and Google NY clagentile@gmailcom NYC March 6th, 2018 1 Content of this lecture Regret analysis of sequential prediction problems lying between
More informationAn Experimental Evaluation of High-Dimensional Multi-Armed Bandits
An Experimental Evaluation of High-Dimensional Multi-Armed Bandits Naoki Egami Romain Ferrali Kosuke Imai Princeton University Talk at Political Data Science Conference Washington University, St. Louis
More informationOnline Learning to Rank in Stochastic Click Models
Masrour Zoghi 1 Tomas Tunys 2 Mohammad Ghavamzadeh 3 Branislav Kveton 4 Csaba Szepesvari 5 Zheng Wen 4 Abstract Online learning to rank is a core problem in information retrieval and machine learning.
More informationStochastic Structured Prediction under Bandit Feedback
Stochastic Structured Prediction under Bandit Feedback Artem Sokolov,, Julia Kreutzer, Christopher Lo,, Stefan Riezler, Computational Linguistics & IWR, Heidelberg University, Germany {sokolov,kreutzer,riezler}@cl.uni-heidelberg.de
More informationRevisiting the Exploration-Exploitation Tradeoff in Bandit Models
Revisiting the Exploration-Exploitation Tradeoff in Bandit Models joint work with Aurélien Garivier (IMT, Toulouse) and Tor Lattimore (University of Alberta) Workshop on Optimization and Decision-Making
More information1 MDP Value Iteration Algorithm
CS 0. - Active Learning Problem Set Handed out: 4 Jan 009 Due: 9 Jan 009 MDP Value Iteration Algorithm. Implement the value iteration algorithm given in the lecture. That is, solve Bellman s equation using
More informationBayesian and Frequentist Methods in Bandit Models
Bayesian and Frequentist Methods in Bandit Models Emilie Kaufmann, Telecom ParisTech Bayes In Paris, ENSAE, October 24th, 2013 Emilie Kaufmann (Telecom ParisTech) Bayesian and Frequentist Bandits BIP,
More informationMonte-Carlo Tree Search by. MCTS by Best Arm Identification
Monte-Carlo Tree Search by Best Arm Identification and Wouter M. Koolen Inria Lille SequeL team CWI Machine Learning Group Inria-CWI workshop Amsterdam, September 20th, 2017 Part of...... a new Associate
More informationMulti-armed bandit models: a tutorial
Multi-armed bandit models: a tutorial CERMICS seminar, March 30th, 2016 Multi-Armed Bandit model: general setting K arms: for a {1,..., K}, (X a,t ) t N is a stochastic process. (unknown distributions)
More informationCounterfactual Model for Learning Systems
Counterfactual Model for Learning Systems CS 7792 - Fall 28 Thorsten Joachims Department of Computer Science & Department of Information Science Cornell University Imbens, Rubin, Causal Inference for Statistical
More informationNew Algorithms for Contextual Bandits
New Algorithms for Contextual Bandits Lev Reyzin Georgia Institute of Technology Work done at Yahoo! 1 S A. Beygelzimer, J. Langford, L. Li, L. Reyzin, R.E. Schapire Contextual Bandit Algorithms with Supervised
More informationThe Multi-Armed Bandit Problem
The Multi-Armed Bandit Problem Electrical and Computer Engineering December 7, 2013 Outline 1 2 Mathematical 3 Algorithm Upper Confidence Bound Algorithm A/B Testing Exploration vs. Exploitation Scientist
More informationInteractively Optimizing Information Retrieval Systems as a Dueling Bandits Problem
Interactively Optimizing Information Retrieval Systems as a Dueling Bandits Problem Yisong Yue Thorsten Joachims Department of Computer Science, Cornell University, Ithaca, NY 14853 USA yyue@cs.cornell.edu
More informationarxiv: v7 [cs.lg] 7 Jul 2017
Learning to Optimize Via Information-Directed Sampling Daniel Russo 1 and Benjamin Van Roy 2 1 Northwestern University, daniel.russo@kellogg.northwestern.edu 2 Stanford University, bvr@stanford.edu arxiv:1403.5556v7
More informationEfficient and Principled Online Classification Algorithms for Lifelon
Efficient and Principled Online Classification Algorithms for Lifelong Learning Toyota Technological Institute at Chicago Chicago, IL USA Talk @ Lifelong Learning for Mobile Robotics Applications Workshop,
More informationCollaborative Filtering as a Multi-Armed Bandit
Collaborative Filtering as a Multi-Armed Bandit Frédéric Guillou, Romaric Gaudel, Philippe Preux To cite this version: Frédéric Guillou, Romaric Gaudel, Philippe Preux. Collaborative Filtering as a Multi-Armed
More informationLearning Multiple Tasks in Parallel with a Shared Annotator
Learning Multiple Tasks in Parallel with a Shared Annotator Haim Cohen Department of Electrical Engeneering The Technion Israel Institute of Technology Haifa, 3000 Israel hcohen@tx.technion.ac.il Koby
More informationStochastic bandits: Explore-First and UCB
CSE599s, Spring 2014, Online Learning Lecture 15-2/19/2014 Stochastic bandits: Explore-First and UCB Lecturer: Brendan McMahan or Ofer Dekel Scribe: Javad Hosseini In this lecture, we like to answer this
More informationarxiv: v4 [cs.lg] 22 Jul 2014
Learning to Optimize Via Information-Directed Sampling Daniel Russo and Benjamin Van Roy July 23, 2014 arxiv:1403.5556v4 cs.lg] 22 Jul 2014 Abstract We propose information-directed sampling a new algorithm
More informationOn the Complexity of Best Arm Identification in Multi-Armed Bandit Models
On the Complexity of Best Arm Identification in Multi-Armed Bandit Models Aurélien Garivier Institut de Mathématiques de Toulouse Information Theory, Learning and Big Data Simons Institute, Berkeley, March
More informationComplex Bandit Problems and Thompson Sampling
Complex Bandit Problems and Aditya Gopalan Department of Electrical Engineering Technion, Israel aditya@ee.technion.ac.il Shie Mannor Department of Electrical Engineering Technion, Israel shie@ee.technion.ac.il
More informationCounterfactual Evaluation and Learning
SIGIR 26 Tutorial Counterfactual Evaluation and Learning Adith Swaminathan, Thorsten Joachims Department of Computer Science & Department of Information Science Cornell University Website: http://www.cs.cornell.edu/~adith/cfactsigir26/
More informationReinforcement Learning
Reinforcement Learning Lecture 6: RL algorithms 2.0 Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute of Technology Objectives of this lecture Present and analyse two online algorithms
More informationarxiv: v1 [cs.lg] 29 Oct 2014
A Markov Decision Process Analysis of the Cold Start Problem in Bayesian Information Filtering arxiv:1410.7852v1 [cs.lg 29 Oct 2014 Xiaoting Zhao, Peter I. Frazier School of Operations Research and Information
More informationStat 260/CS Learning in Sequential Decision Problems. Peter Bartlett
Stat 260/CS 294-102. Learning in Sequential Decision Problems. Peter Bartlett 1. Thompson sampling Bernoulli strategy Regret bounds Extensions the flexibility of Bayesian strategies 1 Bayesian bandit strategies
More informationAdministration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.
Administration CSCI567 Machine Learning Fall 2018 Prof. Haipeng Luo U of Southern California Nov 7, 2018 HW5 is available, due on 11/18. Practice final will also be available soon. Remaining weeks: 11/14,
More informationMulti-Armed Bandit Formulations for Identification and Control
Multi-Armed Bandit Formulations for Identification and Control Cristian R. Rojas Joint work with Matías I. Müller and Alexandre Proutiere KTH Royal Institute of Technology, Sweden ERNSI, September 24-27,
More informationTwo optimization problems in a stochastic bandit model
Two optimization problems in a stochastic bandit model Emilie Kaufmann joint work with Olivier Cappé, Aurélien Garivier and Shivaram Kalyanakrishnan Journées MAS 204, Toulouse Outline From stochastic optimization
More informationStochastic Contextual Bandits with Known. Reward Functions
Stochastic Contextual Bandits with nown 1 Reward Functions Pranav Sakulkar and Bhaskar rishnamachari Ming Hsieh Department of Electrical Engineering Viterbi School of Engineering University of Southern
More informationStable Coactive Learning via Perturbation
Karthik Raman horsten Joachims Department of Computer Science, Cornell University, Ithaca, NY, USA Pannaga Shivaswamy A& Research, San Francisco, CA, USA obias Schnabel Fachbereich Informatik, Universitaet
More informationLecture 4 January 23
STAT 263/363: Experimental Design Winter 2016/17 Lecture 4 January 23 Lecturer: Art B. Owen Scribe: Zachary del Rosario 4.1 Bandits Bandits are a form of online (adaptive) experiments; i.e. samples are
More informationQualitative Multi-Armed Bandits: A Quantile-Based Approach
Qualitative Multi-Armed Bandits: A Quantile-Based Approach Balazs Szorenyi, Róbert Busa-Fekete, Paul Weng, Eyke Hüllermeier To cite this version: Balazs Szorenyi, Róbert Busa-Fekete, Paul Weng, Eyke Hüllermeier.
More informationEfficient Partial Monitoring with Prior Information
Efficient Partial Monitoring with Prior Information Hastagiri P Vanchinathan Dept. of Computer Science ETH Zürich, Switzerland hastagiri@inf.ethz.ch Gábor Bartók Dept. of Computer Science ETH Zürich, Switzerland
More informationBayesian reinforcement learning
Bayesian reinforcement learning Markov decision processes and approximate Bayesian computation Christos Dimitrakakis Chalmers April 16, 2015 Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning
More informationCsaba Szepesvári 1. University of Alberta. Machine Learning Summer School, Ile de Re, France, 2008
LEARNING THEORY OF OPTIMAL DECISION MAKING PART I: ON-LINE LEARNING IN STOCHASTIC ENVIRONMENTS Csaba Szepesvári 1 1 Department of Computing Science University of Alberta Machine Learning Summer School,
More informationA parametric approach to Bayesian optimization with pairwise comparisons
A parametric approach to Bayesian optimization with pairwise comparisons Marco Co Eindhoven University of Technology m.g.h.co@tue.nl Bert de Vries Eindhoven University of Technology and GN Hearing bdevries@ieee.org
More informationOnline Learning and Sequential Decision Making
Online Learning and Sequential Decision Making Emilie Kaufmann CNRS & CRIStAL, Inria SequeL, emilie.kaufmann@univ-lille.fr Research School, ENS Lyon, Novembre 12-13th 2018 Emilie Kaufmann Sequential Decision
More informationSubsampling, Concentration and Multi-armed bandits
Subsampling, Concentration and Multi-armed bandits Odalric-Ambrym Maillard, R. Bardenet, S. Mannor, A. Baransi, N. Galichet, J. Pineau, A. Durand Toulouse, November 09, 2015 O-A. Maillard Subsampling and
More informationBandit Algorithms. Tor Lattimore & Csaba Szepesvári
Bandit Algorithms Tor Lattimore & Csaba Szepesvári Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next? Overview What are bandits,
More informationPreference Based Adaptation for Learning Objectives
Preference Based Adaptation for Learning Objectives Yao-Xiang Ding Zhi-Hua Zhou National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, 223, China {dingyx, zhouzh}@lamda.nju.edu.cn
More informationGraphs in Machine Learning
Graphs in Machine Learning Michal Valko Inria Lille - Nord Europe, France Partially based on material by: Toma s Koca k November 23, 2015 MVA 2015/2016 Last Lecture Examples of applications of online SSL
More informationLearning from Rational * Behavior
Learning from Rational * Behavior Josef Broder, Olivier Chapelle, Geri Gay, Arpita Ghosh, Laura Granka, Thorsten Joachims, Bobby Kleinberg, Madhu Kurup, Filip Radlinski, Karthik Raman, Tobias Schnabel,
More informationExploration and exploitation of scratch games
Mach Learn (2013) 92:377 401 DOI 10.1007/s10994-013-5359-2 Exploration and exploitation of scratch games Raphaël Féraud Tanguy Urvoy Received: 10 January 2013 / Accepted: 12 April 2013 / Published online:
More informationEvaluation of multi armed bandit algorithms and empirical algorithm
Acta Technica 62, No. 2B/2017, 639 656 c 2017 Institute of Thermomechanics CAS, v.v.i. Evaluation of multi armed bandit algorithms and empirical algorithm Zhang Hong 2,3, Cao Xiushan 1, Pu Qiumei 1,4 Abstract.
More informationCorrupt Bandits. Abstract
Corrupt Bandits Pratik Gajane Orange labs/inria SequeL Tanguy Urvoy Orange labs Emilie Kaufmann INRIA SequeL pratik.gajane@inria.fr tanguy.urvoy@orange.com emilie.kaufmann@inria.fr Editor: Abstract We
More informationHybrid Machine Learning Algorithms
Hybrid Machine Learning Algorithms Umar Syed Princeton University Includes joint work with: Rob Schapire (Princeton) Nina Mishra, Alex Slivkins (Microsoft) Common Approaches to Machine Learning!! Supervised
More informationStable Coactive Learning via Perturbation
Karthik Raman horsten Joachims Department of Computer Science, Cornell University, Ithaca, NY, USA Pannaga Shivaswamy A& Research, San Francisco, CA, USA obias Schnabel Fachbereich Informatik, Universitaet
More informationThompson Sampling for the MNL-Bandit
JMLR: Workshop and Conference Proceedings vol 65: 3, 207 30th Annual Conference on Learning Theory Thompson Sampling for the MNL-Bandit author names withheld Editor: Under Review for COLT 207 Abstract
More informationParallel Gaussian Process Optimization with Upper Confidence Bound and Pure Exploration
Parallel Gaussian Process Optimization with Upper Confidence Bound and Pure Exploration Emile Contal David Buffoni Alexandre Robicquet Nicolas Vayatis CMLA, ENS Cachan, France September 25, 2013 Motivating
More informationAdvanced Machine Learning
Advanced Machine Learning Bandit Problems MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Multi-Armed Bandit Problem Problem: which arm of a K-slot machine should a gambler pull to maximize his
More informationarxiv: v1 [cs.lg] 15 Oct 2014
THOMPSON SAMPLING WITH THE ONLINE BOOTSTRAP By Dean Eckles and Maurits Kaptein Facebook, Inc., and Radboud University, Nijmegen arxiv:141.49v1 [cs.lg] 15 Oct 214 Thompson sampling provides a solution to
More informationActive Learning and Optimized Information Gathering
Active Learning and Optimized Information Gathering Lecture 7 Learning Theory CS 101.2 Andreas Krause Announcements Project proposal: Due tomorrow 1/27 Homework 1: Due Thursday 1/29 Any time is ok. Office
More informationThompson sampling for web optimisation. 29 Jan 2016 David S. Leslie
Thompson sampling for web optimisation 29 Jan 2016 David S. Leslie Plan Contextual bandits on the web Thompson sampling in bandits Selecting multiple adverts Plan Contextual bandits on the web Thompson
More informationCrowd-Learning: Improving the Quality of Crowdsourcing Using Sequential Learning
Crowd-Learning: Improving the Quality of Crowdsourcing Using Sequential Learning Mingyan Liu (Joint work with Yang Liu) Department of Electrical Engineering and Computer Science University of Michigan,
More informationOnline Optimization in X -Armed Bandits
Online Optimization in X -Armed Bandits Sébastien Bubeck INRIA Lille, SequeL project, France sebastien.bubeck@inria.fr Rémi Munos INRIA Lille, SequeL project, France remi.munos@inria.fr Gilles Stoltz Ecole
More informationAnytime optimal algorithms in stochastic multi-armed bandits
Rémy Degenne LPMA, Université Paris Diderot Vianney Perchet CREST, ENSAE REMYDEGENNE@MATHUNIV-PARIS-DIDEROTFR VIANNEYPERCHET@NORMALESUPORG Abstract We introduce an anytime algorithm for stochastic multi-armed
More information