Monte-Carlo Tree Search by. MCTS by Best Arm Identification
|
|
- Godwin Brown
- 5 years ago
- Views:
Transcription
1 Monte-Carlo Tree Search by Best Arm Identification and Wouter M. Koolen Inria Lille SequeL team CWI Machine Learning Group Inria-CWI workshop Amsterdam, September 20th, 2017
2 Part of a new Associate Team proposal 6 PAC involving Peter Grünwald (CWI, Machine Learning Group) Wouter M. Koolen (CWI, Machine Learning Group) Benjamin Guedj (Inria Lille, MODAL project-team) (Inria Lille, SequeL project-team) Broader goal: Probably Approximately Correct - Learning 6 Safe, Efficient, Sequential, Active, Structured, Ideal
3 Monte-Carlo Tree Search for games
4 Monte-Carlo Tree Search for games We introduce an idealized model: fixed maximin tree i.i.d. playouts starting from each leaf and propose new algorithms with sample complexity guarantees
5 Outline 1 Problem formulation 2 The BAI-MCTS architecture 3 UGapE-MCTS and LUCB-MCTS 4 Towards optimal algorithms
6 Outline 1 Problem formulation 2 The BAI-MCTS architecture 3 UGapE-MCTS and LUCB-MCTS 4 Towards optimal algorithms
7 A simple model for MCTS A fixed MAXMIN game tree T, with leaves L. MAX node (= your move) MIN node (= adversary move) Leaf l: stochastic oracle O l that evaluates the position
8 A simple model for MCTS At round t a MCTS algorithm: picks a path down to a leaf L t get an evaluation of this leaf X t O Lt Assumption: i.i.d. sucessive evaluations, E X Ol [X ] = µ l
9 A simple model for MCTS μ 1 μ 2 μ 3 μ 4 μ 5 μ 6 μ 7 μ 8 At round t a MCTS algorithm: picks a path down to a leaf L t get an evaluation of this leaf X t O Lt Assumption: i.i.d. sucessive evaluations, E X Ol [X ] = µ l
10 Goal s 0 μ 1 μ 2 μ 3 μ 4 μ 5 μ 6 μ 7 μ 8 A MCTS algorithm should find the best move at the root: µ s if s L, V s = max c C(s) V c if s is a MAX node, min c C(s) V c if s is a MIN node. s = argmax s C(s 0 ) V s
11 A PAC learning framework s 0 μ 1 μ 2 μ 3 μ 4 μ 5 μ 6 μ 7 μ 8 MCTS algorithm: (L t, τ, ŝ τ ), where L t is the sampling rule τ is the stopping rule ŝ τ C(s 0 ) is the recommendation rule is (ɛ, δ) PAC if P (Vŝτ V s ɛ) 1 δ. Goal: (ɛ, δ)-pac algorithm with a small sample complexity τ.
12 A simpler problem: best arm identification Reminiscent of a bandit model: μ 1 μ 2 μ 3 μ 4 μ 5 μ 6 μ 7 μ 8 A Best Arm Identification algorithm: (A t, τ, ŝ τ ), where A t is the sampling rule τ is the stopping rule ŝ τ C(s 0 ) is the recommendation rule is (ɛ, δ)-pac if P (µŝτ µ ɛ) 1 δ.
13 A simpler problem: best arm identification Reminiscent of a bandit model: μ 1 μ 2 μ 3 μ 4 μ 5 μ 6 μ 7 μ 8 A Best Arm Identification algorithm: (A t, τ, ŝ τ ), where A t is the sampling rule τ is the stopping rule ŝ τ C(s 0 ) is the recommendation rule is (ɛ, δ)-pac if P (µŝτ µ ɛ) 1 δ. The BAI problem: How to adaptivly sample the arms so as to identify as quickly as possible the arm with highest mean?
14 MCTS: a structured BAI problem Reminiscent of a bandit model: μ 1 μ 2 μ 3 μ 4 μ 5 μ 6 μ 7 μ 8 A Best Arm Identification algorithm: (L t, τ, ŝ τ ), where L t is the sampling rule τ is the stopping rule ŝ τ C(s 0 ) is the recommendation rule is (ɛ, δ)-pac if P (Vŝτ V s ɛ) 1 δ. The MCTS problem: How to adaptivly sample the leaves of a maxmin tree so as to identify as quickly as possible the best action at the root?
15 Outline 1 Problem formulation 2 The BAI-MCTS architecture 3 UGapE-MCTS and LUCB-MCTS 4 Towards optimal algorithms
16 A key building block: confidence intervals Using the samples collected for the leaves, one can build, for l L, [LCB l (t), UCB l (t)] a confidence interval on µ l s 0 μ 1 μ 2 μ 3 μ 4 μ 5 μ 6 μ 7 μ 8
17 A key building block: confidence intervals Using the samples collected for the leaves, one can build, for l L, [LCB l (t), UCB l (t)] a confidence interval on µ l s 0 Idea: Propagate these confidence intervals up in the tree
18 A key building block: confidence intervals MAX node: UCB s (t) = max c C(s) UCB c(t) LCB s (t) = max c C(s) LCB c(t) s 0
19 A key building block: confidence intervals MAX node: UCB s (t) = max c C(s) UCB c(t) s 0 LCB s (t) = max c C(s) LCB c(t)
20 A key building block: confidence intervals MIN node: UCB s (t) = min c C(s) UCB c(t) s 0 LCB s (t) = min c C(s) LCB c(t)
21 Property of this construction s 0 (µ l I l (t)) (V s I s (t)) l L s T
22 Representative leaves l s (t): representative leaf of internal node s T. s 0 Idea: alternate optimistic/pessimistic moves starting from s
23 Generic BAI-MCTS algorithm Input: a BAI algorithm Initialization: t = 0. while not BAIStop ({s C(s 0 )}) do R t+1 = BAIStep ({s C(s 0 )}) Sample the representative leaf L t+1 = l Rt+1 (t) Update the information about the arms. t = t + 1. end Output: BAIReco ({s C(s 0 )})
24 Generic BAI-MCTS algorithm Input: a BAI algorithm Initialization: t = 0. while not BAIStop ({s C(s 0 )}) do R t+1 = BAIStep ({s C(s 0 )}) Sample the representative leaf L t+1 = l Rt+1 (t) Update the information about the arms. t = t + 1. end Output: BAIReco ({s C(s 0 )})... sometimes reduces to updating confidence intervals!
25 Outline 1 Problem formulation 2 The BAI-MCTS architecture 3 UGapE-MCTS and LUCB-MCTS 4 Towards optimal algorithms
26 An example of BAI algorithm: LUCB The (KL)-LUCB algorithm [Kalyanakrishnan et al. 12, Kaufmann and Kalyanakrishnan 13]
27 UGapE-MCTS based on the UGapE algorithm [Gabillon et al. 12] Sampling rule: R t+1 is the least sampled among two promising depth-one nodes: where a t = argmin a C(s 0 ) Stopping rule: B s (t) = B a (t) and b t = argmax UCB b (t), b C(s 0 )\{a t } max UCB s s (t) LCB s(t). C(s 0 )\{s} τ = inf { t N : UCB bt (t) LCB at (t) < ɛ } Recommendation rule: ŝ τ = a τ
28 Theoretical guarantees We choose confidence intervals of the form β(n l (t), δ) LCB l (t) = ˆµ l (t) 2N l (t) β(n l (t), δ) UCB l (t) = ˆµ l (t) + 2N l (t) where β(s, δ) is some exploration function. Correctness If δ max(0.1 L, 1), for the choice β(s, δ) = log( L /δ) + 3 log log( L /δ) + (3/2) log(log s + 1) UGapE-MCTS is (ɛ, δ)-pac.
29 Theoretical guarantees where H ɛ (µ) := l L := V (s ) V (s 2) l := max s Ancestors(l)\{s 0 } 1 2 l 2 ɛ 2 V Parent(s) V s Sample complexity With probability larger than 1 δ, the total number of leaves explorations performed by UGapE-MCTS is upper bounded as ( ( )) 1 τ = O Hɛ (µ) log. δ
30 Theoretical guarantees where H ɛ (µ) := l L := V (s ) V (s 2) l := max s Ancestors(l)\{s 0 } 1 2 l 2 ɛ V Parent(s) V s
31 Numerical results ɛ = 0, δ = (N = 10 6 simulations) LUCB-MCTS (0.72% errors, 1551 samples) UGapE-MCTS (0.75% erros, 1584 samples) FindTopWinner (0% errors, samples) [Teraoka et al. 14]
32 Outline 1 Problem formulation 2 The BAI-MCTS architecture 3 UGapE-MCTS and LUCB-MCTS 4 Towards optimal algorithms
33 A sample complexity lower bound Theorem Let ɛ = 0. Any δ-correct algorithm satisfies where T (µ) 1 := Depth-two tree: E µ [τ] T (µ) log (1/(3δ)) sup inf w Σ L λ Alt(µ) l L w l KL (B(µ l ), B(λ l )). The optimal proportions satisfy w i,j(µ) = 0 if i 2 and j 2.
34 A sample complexity lower bound Theorem Let ɛ = 0. Any δ-correct algorithm satisfies where T (µ) 1 := Depth-two tree: E µ [τ] T (µ) log (1/(3δ)) sup inf w Σ L λ Alt(µ) l L w l KL (B(µ l ), B(λ l )). The optimal proportions satisfy w i,j(µ) = 0 if i 2 and j 2. A more general sparsity pattern?
35 Conclusion Our contributions: a generic way to use a BAI algorithm for MCTS PAC and sample complexity guarantees for UGapE-MCTS and LUCB-MCTS that also displays good empirical performance Future work: identify the optimal sample complexity of the MCTS problem... (i.e. matching upper and lower bounds)... and that of other structured Best Arm Identification problems [Ajallooeian et al., ALT 17]
36 Conclusion Our contributions: a generic way to use a BAI algorithm for MCTS PAC and sample complexity guarantees for UGapE-MCTS and LUCB-MCTS that also displays good empirical performance Future work: identify the optimal sample complexity of the MCTS problem... (i.e. matching upper and lower bounds)... and that of other structured Best Arm Identification problems [Ajallooeian et al., ALT 17] Reference: E. Kaufmann & W.M. Koolen, Monte-Carlo Tree Search by Best Arm Identification to appear in NIPS 2017
Bandit Algorithms for Pure Exploration: Best Arm Identification and Game Tree Search. Wouter M. Koolen
Bandit Algorithms for Pure Exploration: Best Arm Identification and Game Tree Search Wouter M. Koolen Machine Learning and Statistics for Structures Friday 23 rd February, 2018 Outline 1 Intro 2 Model
More informationTwo optimization problems in a stochastic bandit model
Two optimization problems in a stochastic bandit model Emilie Kaufmann joint work with Olivier Cappé, Aurélien Garivier and Shivaram Kalyanakrishnan Journées MAS 204, Toulouse Outline From stochastic optimization
More informationRevisiting the Exploration-Exploitation Tradeoff in Bandit Models
Revisiting the Exploration-Exploitation Tradeoff in Bandit Models joint work with Aurélien Garivier (IMT, Toulouse) and Tor Lattimore (University of Alberta) Workshop on Optimization and Decision-Making
More informationMulti-armed bandit models: a tutorial
Multi-armed bandit models: a tutorial CERMICS seminar, March 30th, 2016 Multi-Armed Bandit model: general setting K arms: for a {1,..., K}, (X a,t ) t N is a stochastic process. (unknown distributions)
More informationOnline Learning and Sequential Decision Making
Online Learning and Sequential Decision Making Emilie Kaufmann CNRS & CRIStAL, Inria SequeL, emilie.kaufmann@univ-lille.fr Research School, ENS Lyon, Novembre 12-13th 2018 Emilie Kaufmann Sequential Decision
More informationThe information complexity of sequential resource allocation
The information complexity of sequential resource allocation Emilie Kaufmann, joint work with Olivier Cappé, Aurélien Garivier and Shivaram Kalyanakrishan SMILE Seminar, ENS, June 8th, 205 Sequential allocation
More informationStratégies bayésiennes et fréquentistes dans un modèle de bandit
Stratégies bayésiennes et fréquentistes dans un modèle de bandit thèse effectuée à Telecom ParisTech, co-dirigée par Olivier Cappé, Aurélien Garivier et Rémi Munos Journées MAS, Grenoble, 30 août 2016
More informationThe information complexity of best-arm identification
The information complexity of best-arm identification Emilie Kaufmann, joint work with Olivier Cappé and Aurélien Garivier MAB workshop, Lancaster, January th, 206 Context: the multi-armed bandit model
More informationBandit models: a tutorial
Gdt COS, December 3rd, 2015 Multi-Armed Bandit model: general setting K arms: for a {1,..., K}, (X a,t ) t N is a stochastic process. (unknown distributions) Bandit game: a each round t, an agent chooses
More informationOn the Complexity of Best Arm Identification with Fixed Confidence
On the Complexity of Best Arm Identification with Fixed Confidence Discrete Optimization with Noise Aurélien Garivier, Emilie Kaufmann COLT, June 23 th 2016, New York Institut de Mathématiques de Toulouse
More informationOn the Complexity of Best Arm Identification in Multi-Armed Bandit Models
On the Complexity of Best Arm Identification in Multi-Armed Bandit Models Aurélien Garivier Institut de Mathématiques de Toulouse Information Theory, Learning and Big Data Simons Institute, Berkeley, March
More informationOn the Complexity of Best Arm Identification with Fixed Confidence
On the Complexity of Best Arm Identification with Fixed Confidence Discrete Optimization with Noise Aurélien Garivier, joint work with Emilie Kaufmann CNRS, CRIStAL) to be presented at COLT 16, New York
More informationThe Multi-Arm Bandit Framework
The Multi-Arm Bandit Framework A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course In This Lecture A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-2/94
More informationBandit View on Continuous Stochastic Optimization
Bandit View on Continuous Stochastic Optimization Sébastien Bubeck 1 joint work with Rémi Munos 1 & Gilles Stoltz 2 & Csaba Szepesvari 3 1 INRIA Lille, SequeL team 2 CNRS/ENS/HEC 3 University of Alberta
More informationBayesian and Frequentist Methods in Bandit Models
Bayesian and Frequentist Methods in Bandit Models Emilie Kaufmann, Telecom ParisTech Bayes In Paris, ENSAE, October 24th, 2013 Emilie Kaufmann (Telecom ParisTech) Bayesian and Frequentist Bandits BIP,
More informationBandit Algorithms. Zhifeng Wang ... Department of Statistics Florida State University
Bandit Algorithms Zhifeng Wang Department of Statistics Florida State University Outline Multi-Armed Bandits (MAB) Exploration-First Epsilon-Greedy Softmax UCB Thompson Sampling Adversarial Bandits Exp3
More informationarxiv: v2 [stat.ml] 14 Nov 2016
Journal of Machine Learning Research 6 06-4 Submitted 7/4; Revised /5; Published /6 On the Complexity of Best-Arm Identification in Multi-Armed Bandit Models arxiv:407.4443v [stat.ml] 4 Nov 06 Emilie Kaufmann
More informationLecture 19: UCB Algorithm and Adversarial Bandit Problem. Announcements Review on stochastic multi-armed bandit problem
Lecture 9: UCB Algorithm and Adversarial Bandit Problem EECS598: Prediction and Learning: It s Only a Game Fall 03 Lecture 9: UCB Algorithm and Adversarial Bandit Problem Prof. Jacob Abernethy Scribe:
More informationSequential Test for the Lowest Mean: From Thompson to Murphy Sampling
Sequential Test for the Lowest Mean: From Thompson to Murphy Sampling Emilie Kaufmann 1 Wouter M. Koolen 2 Aurélien Garivier 3 1 CNRS & U. Lille, CRIStAL / SequeL Inria Lille, emilie.kaufmann@univ-lille.fr
More informationBandits and Exploration: How do we (optimally) gather information? Sham M. Kakade
Bandits and Exploration: How do we (optimally) gather information? Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for Big data 1 / 22
More informationOn Sequential Decision Problems
On Sequential Decision Problems Aurélien Garivier Séminaire du LIP, 14 mars 2018 Équipe-projet AOC: Apprentissage, Optimisation, Complexité Institut de Mathématiques de Toulouse LabeX CIMI Université Paul
More informationThe optimistic principle applied to function optimization
The optimistic principle applied to function optimization Rémi Munos Google DeepMind INRIA Lille, Sequel team LION 9, 2015 The optimistic principle applied to function optimization Optimistic principle:
More informationAdministration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.
Administration CSCI567 Machine Learning Fall 2018 Prof. Haipeng Luo U of Southern California Nov 7, 2018 HW5 is available, due on 11/18. Practice final will also be available soon. Remaining weeks: 11/14,
More informationMulti-Armed Bandit Formulations for Identification and Control
Multi-Armed Bandit Formulations for Identification and Control Cristian R. Rojas Joint work with Matías I. Müller and Alexandre Proutiere KTH Royal Institute of Technology, Sweden ERNSI, September 24-27,
More informationIntroduction to Reinforcement Learning Part 3: Exploration for sequential decision making
Introduction to Reinforcement Learning Part 3: Exploration for sequential decision making Rémi Munos SequeL project: Sequential Learning http://researchers.lille.inria.fr/ munos/ INRIA Lille - Nord Europe
More informationIntroduction to Reinforcement Learning Part 3: Exploration for decision making, Application to games, optimization, and planning
Introduction to Reinforcement Learning Part 3: Exploration for decision making, Application to games, optimization, and planning Rémi Munos SequeL project: Sequential Learning http://researchers.lille.inria.fr/
More informationOnline Learning and Sequential Decision Making
Online Learning and Sequential Decision Making Emilie Kaufmann CNRS & CRIStAL, Inria SequeL, emilie.kaufmann@univ-lille.fr Research School, ENS Lyon, Novembre 12-13th 2018 Emilie Kaufmann Online Learning
More informationStochastic bandits: Explore-First and UCB
CSE599s, Spring 2014, Online Learning Lecture 15-2/19/2014 Stochastic bandits: Explore-First and UCB Lecturer: Brendan McMahan or Ofer Dekel Scribe: Javad Hosseini In this lecture, we like to answer this
More informationMultiple Identifications in Multi-Armed Bandits
Multiple Identifications in Multi-Armed Bandits arxiv:05.38v [cs.lg] 4 May 0 Sébastien Bubeck Department of Operations Research and Financial Engineering, Princeton University sbubeck@princeton.edu Tengyao
More informationRegret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I. Sébastien Bubeck Theory Group
Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I Sébastien Bubeck Theory Group i.i.d. multi-armed bandit, Robbins [1952] i.i.d. multi-armed bandit, Robbins [1952] Known
More informationLecture 5: Regret Bounds for Thompson Sampling
CMSC 858G: Bandits, Experts and Games 09/2/6 Lecture 5: Regret Bounds for Thompson Sampling Instructor: Alex Slivkins Scribed by: Yancy Liao Regret Bounds for Thompson Sampling For each round t, we defined
More informationAdvanced Machine Learning
Advanced Machine Learning Bandit Problems MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Multi-Armed Bandit Problem Problem: which arm of a K-slot machine should a gambler pull to maximize his
More informationThompson Sampling for Monte Carlo Tree Search and Maxi-min Action Identification
Thompson Sampling for Monte Carlo Tree Search and Maxi-min Action Identification Mingxi Li (s1720554) First Advisor: Dr. Wouter M.Koolen Secondary Advisor: Dr. Tim van Erven master thesis Defended on October
More informationOn the Complexity of A/B Testing
JMLR: Workshop and Conference Proceedings vol 35:1 3, 014 On the Complexity of A/B Testing Emilie Kaufmann LTCI, Télécom ParisTech & CNRS KAUFMANN@TELECOM-PARISTECH.FR Olivier Cappé CAPPE@TELECOM-PARISTECH.FR
More informationIntroduction to Bandit Algorithms. Introduction to Bandit Algorithms
Stochastic K-Arm Bandit Problem Formulation Consider K arms (actions) each correspond to an unknown distribution {ν k } K k=1 with values bounded in [0, 1]. At each time t, the agent pulls an arm I t {1,...,
More informationCOS 402 Machine Learning and Artificial Intelligence Fall Lecture 22. Exploration & Exploitation in Reinforcement Learning: MAB, UCB, Exp3
COS 402 Machine Learning and Artificial Intelligence Fall 2016 Lecture 22 Exploration & Exploitation in Reinforcement Learning: MAB, UCB, Exp3 How to balance exploration and exploitation in reinforcement
More informationMulti-Armed Bandits. Credit: David Silver. Google DeepMind. Presenter: Tianlu Wang
Multi-Armed Bandits Credit: David Silver Google DeepMind Presenter: Tianlu Wang Credit: David Silver (DeepMind) Multi-Armed Bandits Presenter: Tianlu Wang 1 / 27 Outline 1 Introduction Exploration vs.
More informationTwo generic principles in modern bandits: the optimistic principle and Thompson sampling
Two generic principles in modern bandits: the optimistic principle and Thompson sampling Rémi Munos INRIA Lille, France CSML Lunch Seminars, September 12, 2014 Outline Two principles: The optimistic principle
More informationHypothesis Testing. Rianne de Heide. April 6, CWI & Leiden University
Hypothesis Testing Rianne de Heide CWI & Leiden University April 6, 2018 Me PhD-student Machine Learning Group Study coordinator M.Sc. Statistical Science Peter Grünwald Jacqueline Meulman Projects Bayesian
More informationRacing Thompson: an Efficient Algorithm for Thompson Sampling with Non-conjugate Priors
Racing Thompson: an Efficient Algorithm for Thompson Sampling with Non-conugate Priors Yichi Zhou 1 Jun Zhu 1 Jingwe Zhuo 1 Abstract Thompson sampling has impressive empirical performance for many multi-armed
More informationLecture 6: Non-stochastic best arm identification
CSE599i: Online and Adaptive Machine Learning Winter 08 Lecturer: Kevin Jamieson Lecture 6: Non-stochastic best arm identification Scribes: Anran Wang, eibin Li, rian Chan, Shiqing Yu, Zhijin Zhou Disclaimer:
More informationOnline Learning with Gaussian Payoffs and Side Observations
Online Learning with Gaussian Payoffs and Side Observations Yifan Wu 1 András György 2 Csaba Szepesvári 1 1 Department of Computing Science University of Alberta 2 Department of Electrical and Electronic
More informationMachine Learning for Data Science (CS4786) Lecture 24
Machine Learning for Data Science (CS4786) Lecture 24 Graphical Models: Approximate Inference Course Webpage : http://www.cs.cornell.edu/courses/cs4786/2016sp/ BELIEF PROPAGATION OR MESSAGE PASSING Each
More informationThe geometry of Gaussian processes and Bayesian optimization. Contal CMLA, ENS Cachan
The geometry of Gaussian processes and Bayesian optimization. Contal CMLA, ENS Cachan Background: Global Optimization and Gaussian Processes The Geometry of Gaussian Processes and the Chaining Trick Algorithm
More informationOnline Learning: Bandit Setting
Online Learning: Bandit Setting Daniel asabi Summer 04 Last Update: October 0, 06 Introduction [TODO Bandits. Stocastic setting Suppose tere exists unknown distributions ν,..., ν, suc tat te loss at eac
More informationPure Exploration in Finitely Armed and Continuous Armed Bandits
Pure Exploration in Finitely Armed and Continuous Armed Bandits Sébastien Bubeck INRIA Lille Nord Europe, SequeL project, 40 avenue Halley, 59650 Villeneuve d Ascq, France Rémi Munos INRIA Lille Nord Europe,
More informationAdaptive Concentration Inequalities for Sequential Decision Problems
Adaptive Concentration Inequalities for Sequential Decision Problems Shengjia Zhao Tsinghua University zhaosj12@stanford.edu Ashish Sabharwal Allen Institute for AI AshishS@allenai.org Enze Zhou Tsinghua
More informationMulti-task Linear Bandits
Multi-task Linear Bandits Marta Soare Alessandro Lazaric Ouais Alsharif Joelle Pineau INRIA Lille - Nord Europe INRIA Lille - Nord Europe McGill University & Google McGill University NIPS2014 Workshop
More informationNew Algorithms for Contextual Bandits
New Algorithms for Contextual Bandits Lev Reyzin Georgia Institute of Technology Work done at Yahoo! 1 S A. Beygelzimer, J. Langford, L. Li, L. Reyzin, R.E. Schapire Contextual Bandit Algorithms with Supervised
More informationReinforcement Learning
Reinforcement Learning Lecture 5: Bandit optimisation Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute of Technology Objectives of this lecture Introduce bandit optimisation: the
More informationAn Experimental Evaluation of High-Dimensional Multi-Armed Bandits
An Experimental Evaluation of High-Dimensional Multi-Armed Bandits Naoki Egami Romain Ferrali Kosuke Imai Princeton University Talk at Political Data Science Conference Washington University, St. Louis
More informationArtificial Intelligence
Artificial Intelligence Dynamic Programming Marc Toussaint University of Stuttgart Winter 2018/19 Motivation: So far we focussed on tree search-like solvers for decision problems. There is a second important
More informationIntroduction to Reinforcement Learning and multi-armed bandits
Introduction to Reinforcement Learning and multi-armed bandits Rémi Munos INRIA Lille - Nord Europe Currently on leave at MSR-NE http://researchers.lille.inria.fr/ munos/ NETADIS Summer School 2013, Hillerod,
More informationEstimation Considerations in Contextual Bandits
Estimation Considerations in Contextual Bandits Maria Dimakopoulou Zhengyuan Zhou Susan Athey Guido Imbens arxiv:1711.07077v4 [stat.ml] 16 Dec 2018 Abstract Contextual bandit algorithms are sensitive to
More informationBest Arm Identification: A Unified Approach to Fixed Budget and Fixed Confidence
Best Ar Identification: A Unified Approach to Fixed Budget and Fixed Confidence Victor Gabillon Mohaad Ghavazadeh Alessandro Lazaric INRIA Lille - Nord Europe, Tea SequeL {victor.gabillon,ohaad.ghavazadeh,alessandro.lazaric}@inria.fr
More informationApproximate Universal Artificial Intelligence
Approximate Universal Artificial Intelligence A Monte-Carlo AIXI Approximation Joel Veness Kee Siong Ng Marcus Hutter David Silver University of New South Wales National ICT Australia The Australian National
More informationBandits : optimality in exponential families
Bandits : optimality in exponential families Odalric-Ambrym Maillard IHES, January 2016 Odalric-Ambrym Maillard Bandits 1 / 40 Introduction 1 Stochastic multi-armed bandits 2 Boundary crossing probabilities
More informationMulti-armed Bandits in the Presence of Side Observations in Social Networks
52nd IEEE Conference on Decision and Control December 0-3, 203. Florence, Italy Multi-armed Bandits in the Presence of Side Observations in Social Networks Swapna Buccapatnam, Atilla Eryilmaz, and Ness
More informationTHE first formalization of the multi-armed bandit problem
EDIC RESEARCH PROPOSAL 1 Multi-armed Bandits in a Network Farnood Salehi I&C, EPFL Abstract The multi-armed bandit problem is a sequential decision problem in which we have several options (arms). We can
More informationOn Bayesian bandit algorithms
On Bayesian bandit algorithms Emilie Kaufmann joint work with Olivier Cappé, Aurélien Garivier, Nathaniel Korda and Rémi Munos July 1st, 2012 Emilie Kaufmann (Telecom ParisTech) On Bayesian bandit algorithms
More informationThe Knowledge Gradient for Sequential Decision Making with Stochastic Binary Feedbacks
The Knowledge Gradient for Sequential Decision Making with Stochastic Binary Feedbacks Yingfei Wang, Chu Wang and Warren B. Powell Princeton University Yingfei Wang Optimal Learning Methods June 22, 2016
More informationPAC Subset Selection in Stochastic Multi-armed Bandits
In Langford, Pineau, editors, Proceedings of the 9th International Conference on Machine Learning, pp 655--66, Omnipress, New York, NY, USA, 0 PAC Subset Selection in Stochastic Multi-armed Bandits Shivaram
More informationNearly Optimal Sampling Algorithms for Combinatorial Pure Exploration
JMLR: Workshop and Conference Proceedings vol 65: 55, 207 30th Annual Conference on Learning Theory Nearly Optimal Sampling Algorithms for Combinatorial Pure Exploration Editor: Under Review for COLT 207
More informationStat 260/CS Learning in Sequential Decision Problems. Peter Bartlett
Stat 260/CS 294-102. Learning in Sequential Decision Problems. Peter Bartlett 1. Multi-armed bandit algorithms. Concentration inequalities. P(X ǫ) exp( ψ (ǫ))). Cumulant generating function bounds. Hoeffding
More informationComplexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning
Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning Christos Dimitrakakis Informatics Institute, University of Amsterdam, Amsterdam, The Netherlands
More informationThe Multi-Armed Bandit Problem
The Multi-Armed Bandit Problem Electrical and Computer Engineering December 7, 2013 Outline 1 2 Mathematical 3 Algorithm Upper Confidence Bound Algorithm A/B Testing Exploration vs. Exploitation Scientist
More informationAn Optimal Bidimensional Multi Armed Bandit Auction for Multi unit Procurement
An Optimal Bidimensional Multi Armed Bandit Auction for Multi unit Procurement Satyanath Bhat Joint work with: Shweta Jain, Sujit Gujar, Y. Narahari Department of Computer Science and Automation, Indian
More informationSolving Zero-Sum Extensive-Form Games. Branislav Bošanský AE4M36MAS, Fall 2013, Lecture 6
Solving Zero-Sum Extensive-Form Games ranislav ošanský E4M36MS, Fall 2013, Lecture 6 Imperfect Information EFGs States Players 1 2 Information Set ctions Utility Solving II Zero-Sum EFG with perfect recall
More informationNotes on AdaGrad. Joseph Perla 2014
Notes on AdaGrad Joseph Perla 2014 1 Introduction Stochastic Gradient Descent (SGD) is a common online learning algorithm for optimizing convex (and often non-convex) functions in machine learning today.
More informationRelative Upper Confidence Bound for the K-Armed Dueling Bandit Problem
Relative Upper Confidence Bound for the K-Armed Dueling Bandit Problem Masrour Zoghi 1, Shimon Whiteson 1, Rémi Munos 2 and Maarten de Rijke 1 University of Amsterdam 1 ; INRIA Lille / MSR-NE 2 June 24,
More informationEfficient learning by implicit exploration in bandit problems with side observations
Efficient learning by implicit exploration in bandit problems with side observations Tomáš Kocák, Gergely Neu, Michal Valko, Rémi Munos SequeL team, INRIA Lille - Nord Europe, France SequeL INRIA Lille
More informationApproximate Universal Artificial Intelligence
Approximate Universal Artificial Intelligence A Monte-Carlo AIXI Approximation Joel Veness Kee Siong Ng Marcus Hutter Dave Silver UNSW / NICTA / ANU / UoA September 8, 2010 General Reinforcement Learning
More informationAlireza Shafaei. Machine Learning Reading Group The University of British Columbia Summer 2017
s s Machine Learning Reading Group The University of British Columbia Summer 2017 (OCO) Convex 1/29 Outline (OCO) Convex Stochastic Bernoulli s (OCO) Convex 2/29 At each iteration t, the player chooses
More informationarxiv: v1 [cs.gt] 1 Sep 2015
HC selection for MCTS in Simultaneous Move Games Analysis of Hannan Consistent Selection for Monte Carlo Tree Search in Simultaneous Move Games arxiv:1509.00149v1 [cs.gt] 1 Sep 2015 Vojtěch Kovařík vojta.kovarik@gmail.com
More informationMinimax strategy for prediction with expert advice under stochastic assumptions
Minimax strategy for prediction ith expert advice under stochastic assumptions Wojciech Kotłosi Poznań University of Technology, Poland otlosi@cs.put.poznan.pl Abstract We consider the setting of prediction
More informationarxiv: v2 [stat.ml] 19 Jul 2012
Thompson Sampling: An Asymptotically Optimal Finite Time Analysis Emilie Kaufmann, Nathaniel Korda and Rémi Munos arxiv:105.417v [stat.ml] 19 Jul 01 Telecom Paristech UMR CNRS 5141 & INRIA Lille - Nord
More informationLearning, Games, and Networks
Learning, Games, and Networks Abhishek Sinha Laboratory for Information and Decision Systems MIT ML Talk Series @CNRG December 12, 2016 1 / 44 Outline 1 Prediction With Experts Advice 2 Application to
More informationOptimal Best Arm Identification with Fixed Confidence
JMLR: Workshop and Conference Proceedings vol 49:1 30, 2016 Optimal Best Arm Identification with Fixed Confidence Aurélien Garivier Institut de Mathématiques de Toulouse; UMR5219 Université de Toulouse;
More informationMachine Learning Theory (CS 6783)
Machine Learning Theory (CS 6783) Tu-Th 1:25 to 2:40 PM Hollister, 306 Instructor : Karthik Sridharan ABOUT THE COURSE No exams! 5 assignments that count towards your grades (55%) One term project (40%)
More informationReinforcement Learning
Reinforcement Learning Lecture 6: RL algorithms 2.0 Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute of Technology Objectives of this lecture Present and analyse two online algorithms
More informationEvaluation of multi armed bandit algorithms and empirical algorithm
Acta Technica 62, No. 2B/2017, 639 656 c 2017 Institute of Thermomechanics CAS, v.v.i. Evaluation of multi armed bandit algorithms and empirical algorithm Zhang Hong 2,3, Cao Xiushan 1, Pu Qiumei 1,4 Abstract.
More informationFrom Bandits to Monte-Carlo Tree Search: The Optimistic Principle Applied to Optimization and Planning
From Bandits to Monte-Carlo Tree Search: The Optimistic Principle Applied to Optimization and Planning Rémi Munos To cite this version: Rémi Munos. From Bandits to Monte-Carlo Tree Search: The Optimistic
More informationYevgeny Seldin. University of Copenhagen
Yevgeny Seldin University of Copenhagen Classical (Batch) Machine Learning Collect Data Data Assumption The samples are independent identically distributed (i.i.d.) Machine Learning Prediction rule New
More informationInformational Confidence Bounds for Self-Normalized Averages and Applications
Informational Confidence Bounds for Self-Normalized Averages and Applications Aurélien Garivier Institut de Mathématiques de Toulouse - Université Paul Sabatier Thursday, September 12th 2013 Context Tree
More informationAnnealing-Pareto Multi-Objective Multi-Armed Bandit Algorithm
Annealing-Pareto Multi-Objective Multi-Armed Bandit Algorithm Saba Q. Yahyaa, Madalina M. Drugan and Bernard Manderick Vrije Universiteit Brussel, Department of Computer Science, Pleinlaan 2, 1050 Brussels,
More informationarxiv: v3 [cs.lg] 7 Nov 2017
ESAIM: PROCEEDINGS AND SURVEYS, Vol.?, 2017, 1-10 Editors: Will be set by the publisher arxiv:1702.00001v3 [cs.lg] 7 Nov 2017 LEARNING THE DISTRIBUTION WITH LARGEST MEAN: TWO BANDIT FRAMEWORKS Emilie Kaufmann
More informationBandits for Online Optimization
Bandits for Online Optimization Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Bandits for Online Optimization 1 / 16 The multiarmed bandit problem... K slot machines Each
More informationModels of collective inference
Models of collective inference Laurent Massoulié (Microsoft Research-Inria Joint Centre) Mesrob I. Ohannessian (University of California, San Diego) Alexandre Proutière (KTH Royal Institute of Technology)
More informationMarks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:
Marks } Assignment 1: Should be out this weekend } All are marked, I m trying to tally them and perhaps add bonus points } Mid-term: Before the last lecture } Mid-term deferred exam: } This Saturday, 9am-10.30am,
More informationPure Exploration Stochastic Multi-armed Bandits
C&A Workshop 2016, Hangzhou Pure Exploration Stochastic Multi-armed Bandits Jian Li Institute for Interdisciplinary Information Sciences Tsinghua University Outline Introduction 2 Arms Best Arm Identification
More informationAdaptive Concentration Inequalities for Sequential Decision Problems
Adaptive Concentration Inequalities for Sequential Decision Problems Shengjia Zhao Tsinghua University zhaosj12@stanford.edu Ashish Sabharwal Allen Institute for AI AshishS@allenai.org Enze Zhou Tsinghua
More informationOnline Learning with Feedback Graphs
Online Learning with Feedback Graphs Claudio Gentile INRIA and Google NY clagentile@gmailcom NYC March 6th, 2018 1 Content of this lecture Regret analysis of sequential prediction problems lying between
More informationStat 260/CS Learning in Sequential Decision Problems. Peter Bartlett
Stat 260/CS 294-102. Learning in Sequential Decision Problems. Peter Bartlett 1. Thompson sampling Bernoulli strategy Regret bounds Extensions the flexibility of Bayesian strategies 1 Bayesian bandit strategies
More informationOn Minimaxity of Follow the Leader Strategy in the Stochastic Setting
On Minimaxity of Follow the Leader Strategy in the Stochastic Setting Wojciech Kot lowsi Poznań University of Technology, Poland wotlowsi@cs.put.poznan.pl Abstract. We consider the setting of prediction
More informationOrdinal Optimization and Multi Armed Bandit Techniques
Ordinal Optimization and Multi Armed Bandit Techniques Sandeep Juneja. with Peter Glynn September 10, 2014 The ordinal optimization problem Determining the best of d alternative designs for a system, on
More informationParallel Gaussian Process Optimization with Upper Confidence Bound and Pure Exploration
Parallel Gaussian Process Optimization with Upper Confidence Bound and Pure Exploration Emile Contal David Buffoni Alexandre Robicquet Nicolas Vayatis CMLA, ENS Cachan, France September 25, 2013 Motivating
More informationOptimal and Adaptive Online Learning
Optimal and Adaptive Online Learning Haipeng Luo Advisor: Robert Schapire Computer Science Department Princeton University Examples of Online Learning (a) Spam detection 2 / 34 Examples of Online Learning
More informationGraphs in Machine Learning
Graphs in Machine Learning Michal Valko INRIA Lille - Nord Europe, France Partially based on material by: Rob Fergus, Tomáš Kocák March 17, 2015 MVA 2014/2015 Last Lecture Analysis of online SSL Analysis
More informationBasics of reinforcement learning
Basics of reinforcement learning Lucian Buşoniu TMLSS, 20 July 2018 Main idea of reinforcement learning (RL) Learn a sequential decision policy to optimize the cumulative performance of an unknown system
More informationBandit Algorithms. Tor Lattimore & Csaba Szepesvári
Bandit Algorithms Tor Lattimore & Csaba Szepesvári Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next? Overview What are bandits,
More information