Preference-based Reinforcement Learning
|
|
- Sybil Nelson
- 6 years ago
- Views:
Transcription
1 Preference-based Reinforcement Learning Róbert Busa-Fekete 1,2, Weiwei Cheng 1, Eyke Hüllermeier 1, Balázs Szörényi 2,3, Paul Weng 3 1 Computational Intelligence Group, Philipps-Universität Marburg 2 Research Group on Artificial Intelligence of the Hungarian Academy of Sciences and University of Szeged (RGAI) 3 INRIA Lille - Nord Europe, SequeL project, 40 avenue Halley, Villeneuve d Ascq, France 4 Laboratory of Computer Science, Université Pierre et Marie Curie EWRL 2013/ Dagstuhl August 6, 2013
2 Preference-based Reinforcement Learning Motivating example: Medical treatment design Direct policy search (DPS) Evolutionary direct policy search approach Preference-based Evolutionary Direct policy search (PB-EDPS) Preference-based Racing (PBR)
3 Many problems where it is hard to define a reasonable reward function task of driving [Abbeel and Ng, 2004] medical treatment design [Zhao et al., 2009] Aggregation of rewards: one may not always be willing/able to combine rewards Multi-objective reinforcement learning Episodic setup: h following policy π, h following policy π Given h and h, it might be easier to decide which one is preferred (at least in some problems) The piece of information we want to learn from is preferences over simulations!
4 Motivating example: medical treatment design [Zhao et al., 2009] Virtual patient with cancer State captures some essential factors in cancer treatment Tumor size Toxicity Episodic setup: an episode corresponds to a treatment of a patient over six months The action is the dosage level itself dosage Transitions: The tumor is constantly growing (without treatment or if the dosage is too low) The higher dose selected, the higher toxicity evolves, and the more tumor growth is inhibited. The higher the toxicity and the tumor size, the higher the probability of the patient s death.
5 Motivating example [Zhao et al., 2009] Terminal state: end of sixth month or patient dies The reward is defined based on the wellness of patient tumor size: -5, 5, 15 toxicity level: -5, 0, 5 The reward assigned to death is -60 Based on the wellness of patients, it is straightforward to define a preference relation over treatments Given two trajectories h1 and h 2 generated by following two different treatments Otherwise Pareto dominance h 1 h 2 if the tumor size AND the toxicity level both are smaller under h 1
6 Point of departure: preferences There is no reward function (hard to define a reasonable one) and the goal is not to find a reward function!!! The piece of information we want to learn from is preferences over trajectories! Partial order over trajectories h H (T ) From a tutor or an expert Extracted from trajectories
7 Decision model Decision model: lifting the preference relation on H (T ) to a preference relation on the space of policies Intermezzo: each policy π generate a probability distribution over the set of trajectories (for a fixed MDP) which is denoted by P π policy random variable whose realizations are trajectories s(π, π ) = Eh P π,h P π [I{h h }] Probability of that π beats π Ordinal decision model π π if and only if s(π, π) < s(π, π ) Alternative decision model?
8 Preference-based Evolutionary direct policy search
9 Direct policy search (DPS) 1. Parametric policy space: Π = {π Θ Θ R d }, for example the space of linear policies: π w (s) = w T s if S R d 2. The policy search can be viewed as an optimization task: Π is the search space, some policy evaluation is the target function Gradientbased Direct policy search Gradientfree REINFORCE [Williams, 1992] Preferencebased evolutionary DPS Evolutionary DPS [Heidrich- Meisner and Igel, 2009] Evolutionary strategy based optimizer Crossentropy method
10 Evolutionary direct policy search approach [Heidrich-Meisner and Igel, 2009] Covariance Matrix Adaptation Evolution Strategy (CMA-ES)[Hansen and Kern, 2004] It maintains a distribution over the solution space ( in this case over the space of policies) Expected total reward is optimized that can be estimated based on finite set of trajectories {h 1,..., h n } P π as ρ (n) π = 1 n n V (h i ) i=1 where V (.) is the cummulative reward
11 Evolutionary direct policy search [Heidrich-Meisner and Igel, 2009] Repeat these three steps until convergence 1. Generate a population of candidate solutions (in this case, a set of policies with different parameters). π Θ1,..., π Θλ where Θ 1,..., Θ λ N (m, Σ) 2. Evaluate the candidate solutions (estimate the performance of the policies based on simulations {h 1,..., h n } P πθi ). ρ (n) π Θi = 1 n n V (h i ) i=1 and select the best µ individuals 3. Update m and Σ by using the parameters of best µ individuals/policies
12 Racing algorithm (a) [Heidrich-Meisner and Igel, 2009] In the bandit literature, these algorithms are called PAC bandits
13 Basic idea 1. Direct motivation: the Evolution strategy optimizers need only ranking, but they do not need the function values themselves 2. GOAL: devise a racing algorithm that utilizes only pairwise comparison of random samples (in this case trajectories) and is able to select the best policies with respect to the decision model ( ) 3. This naturally gives rise to a preference-based policy search method
14 Recall the decision model s(π, π ) = Eh P π,h P π [I{h h }] Ordinal decision model π π if and only if s(π, π) < s(π, π ) There can be preferential cycles π π AND π π AND π π select the best options is not a well-defined task Practical solution: surrogate ranking model Given π 1,..., π K π i C π j d i < d j where d i = {k : π k π i, k i} It is a complete preorder since it has a numeric representation (d i ) Unfortunately, the preference relation C depends on the set of policies considered
15 An example for the surrogate ranking model π 2 π 3 edge π i π j π 4 π 1 C d 2 = 4 d 1 = 3 d 3 = d 4 = d 5 = 1 π 5
16 Concentration property of s(.,.) s(π, π ) = Eh P π,h P π [I{h h }] π π if and only if s(π, π) < s(π, π ) π i C π j d i < d j where d i = {k : π k π i, k i} s(π, π ) can be estimated based on finite sets of trajectories {h 1,..., h n } P π and {h 1,..., h n} P π as s(π, π ) = 1 nn n n I{h i h j} i=1 j=1 Hoeffding-bound for U-statistics, two-sample case Hoeffding, 1963, 5b: For any ɛ > 0 P ( s(π, π ) s(π, π ) ɛ ) 2 exp ( 2 min(n, n )ɛ 2) Empirical Bernstein-bound?
17 Preference-based racing We have an efficient estimator for s(π i, π j ) We can calculate confidence interval for s(π i, π j ) K = 7, K = 3 edge s(π i, π j ) is significantly bigger than s(π j, π i ) π 1 π 2 π 3 π 4 π 5 π 6 π 7 Expected sample complexity: Even-Dar et al. [2002] ( i,j = 1/2 s(π i, π j ) )
18 Preference-based evolutionary direct policy search Repeat these three steps until convergence 1. Generate a population of candidate solutions (in this case, a set of policies with different parameters). π Θ1,..., π Θλ where Θ 1,..., Θ λ N (m, Σ) 2. Select the best µ individuals by using Preference-based Racing algorithm 3. Update m and Σ by using the parameters of best µ individuals/policies
19 The relation of and C (only locally valid) π2 π i is a Condorcet winner among a set of policies π 1,..., π K if π l π i for all l i π3 π4 π5 π1 If the Condorcet winner exists, it is the largest element of C π2 Smith set is the smallest non-empty set D {π 1,..., π K } satisfying π k π i for all π i D and π j {π 1,..., π K } \ D Proposition Let Π = {π 1,..., π K } be a set of random variables for which there exists a Smith set D of size K D. Then for any π i D and π j Π \ D, π j C π i. π3 π4 π5 π1 d 5 = 1 d 4 = 1 d 3 = 1 d 1 = 3 d 2 = 4
20 Issues to be discussed The existence of global optima If there exists a global Condorcet winner, under what assumptions we can find it (w.h.p) by using Evolution strategy along with Preference-based racing algorithm? Hoeffding-bound is loose: the use of Clopper-Pearson-type confidence bound for trinomial random variables
21 Bibliography I P. Abbeel and A.Y. Ng. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the 21th International conference on Machine Learning, pages????, E. Even-Dar, S. Mannor, and Y. Mansour. PAC bounds for multi-armed bandit and markov decision processes. In Proceedings of the 15th Annual Conference on Computational Learning Theory, pages , N. Hansen and S. Kern. Evaluating the CMA evolution strategy on multimodal test functions. In Parallel Problem Solving from Nature-PPSN VIII, pages , V. Heidrich-Meisner and C. Igel. Hoeffding and Bernstein races for selecting policies in evolutionary direct policy search. In Proceedings of the 26th International Conference on Machine Learning, pages , J. Hemelrijk. Note on Wilcoxon s two-sample test when ties are present. The Annals of Mathematical Statistics, 23(1): , 1952.
22 Bibliography II R.J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3): , Y. Zhao, M.R. Kosorok, and D. Zeng. Reinforcement learning design for cancer clinical trials. Statistics in Medicine, 28(26): , 2009.
23 Preference-based racing for C One can estimate s(π, π ) based on finite set of trajectories {h 1,... h n } P π and { h 1,... h n } Pπ s(π, π ) = 1 nn n n I{h i h j} i=1 j=1 Incomparable trajectories: solution by [Hemelrijk, 1952] 1 if x x I {x, x } = 0 if x x 1/2 otherwise Probabilistic interpretation: if two samples are incomparable, then we select one of them being preferred with probability 1/2 s(π, π ) = 1 s(π, π)
24 Preference-based racing: optimization view Preference-based case: PBR(π 1,..., π K, K, n max, δ) argmax I{π j C π i } I {1,...,K}: I =K i I j i with probability at least 1 δ Since s(π i, π j ) = 1 s(π j, π i ) argmax I{s(π j, π i ) > 1/2} (1) I {1,...,K}: I =K i I j i We have an efficient estimator of s(π i, π j )
25 Algorithm 1 PBR(π 1,..., π K, K, n max, δ) 1: A = {(i, j) 1 i, j K}, n = 0 2: while (n n max) ( A > 0) do 3: for all i appearing in A do 4: h (n) i M and π i Generate trajectories 5: end for 6: for all (i, j) A do 7: Update s i,j = 1 n n n 2 l=1 l =1 I{h(l) i h (l ) j } 1 8: c i,j = 2n log 2K 2 n max, u δ i,j = ŝ i,j + c i,j, l i,j = ŝ i,j c i,j 9: end for 10: for i = 1 K do 11: z i = {j : u i,j < 1/2, j i}, o i = {j : l i,j > 1/2, j i} 12: end for 13: C = { i : K K < {j : K zj < o i } } select 14: D = { i : K < {j : K o j < z i } } discard 15: for (i, j) A do 16: if (i, j C D) (1/2 [l i,j, u i,j ]) then 17: A = A \ (i, j) Do not update ŝ i,j any more 18: end if 19: end for 20: n = n : end while 22: return the top-k options for which the most s i,j above 1/2
26 Cancer treatment 1. State space consists of toxicity level and tumor size (X, Y ) 2. Linear policy space 3. Each policy search method were trained 100 times and each policy were evaluated on 300 virtual patients 4. 6-months treatment 5. Transitions: X t+1 = X t + X t and Y t+1 = Y t + Y t Y t = [a 1 max(x t, X 0 ) b 1 (D t d 1 )] I{Y t > 0} X t = a 2 max(y t, Y 0 ) b 2 (D t d 2 ) 6. Probability of death: 1 exp( exp(c 0 + c 1 X t + c 2 Y t ))
27 Cancer treatment Toxicity Random (0.355) Const. π(s) = 0.1 (0.405) Const. π(s) = 0.4 (0.318) Const. π(s) = 0.7 (0.383) Const. π(s) = 1.0 (0.448) SARSA(λ) (0.337) PBPI (0.259) EDPS (0.283) PB-EDPS (0.209) Pareto front Tumor size (b) 100 repetitions of training process
Preference-based Reinforcement Learning: Evolutionary Direct Policy Search using a Preference-based Racing Algorithm
Preference-based Reinforcement Learning: Evolutionary Direct Policy Search using a Preference-based Racing Algorithm Róbert Busa-Fekete Balázs Szörényi Paul Weng Weiwei Cheng Eyke Hüllermeier Abstract
More informationPreference-based reinforcement learning: evolutionary direct policy search using a preference-based racing algorithm
Preference-based reinforcement learning: evolutionary direct policy search using a preference-based racing algorithm Róbert Busa-Fekete, Balázs Szörényi, Paul Weng, Weiwei Cheng, Eyke Hüllermeier To cite
More informationPreference-based reinforcement learning: evolutionary direct policy search using a preference-based racing algorithm
Preference-based reinforcement learning: evolutionary direct policy search using a preference-based racing algorithm Róbert Busa-Fekete, Balázs Szörényi, Paul Weng, Weiwei Cheng & Eyke Hüllermeier Machine
More informationPreference-Based Rank Elicitation using Statistical Models: The Case of Mallows
Preference-Based Rank Elicitation using Statistical Models: The Case of Mallows Robert Busa-Fekete 1 Eyke Huellermeier 2 Balzs Szrnyi 3 1 MTA-SZTE Research Group on Artificial Intelligence, Tisza Lajos
More informationTop-k Selection based on Adaptive Sampling of Noisy Preferences
Top-k Selection based on Adaptive Sampling of Noisy Preferences Róbert Busa-Fekete, busarobi@inf.u-szeged.hu Balázs Szörényi, szorenyi@inf.u-szeged.hu Paul Weng paul.weng@lip.fr Weiwei Cheng cheng@mathematik.uni-marburg.de
More informationOnline Rank Elicitation for Plackett-Luce: A Dueling Bandits Approach
Online Rank Elicitation for Plackett-Luce: A Dueling Bandits Approach Balázs Szörényi Technion, Haifa, Israel / MTA-SZTE Research Group on Artificial Intelligence, Hungary szorenyibalazs@gmail.com Róbert
More informationState Space Abstractions for Reinforcement Learning
State Space Abstractions for Reinforcement Learning Rowan McAllister and Thang Bui MLG RCC 6 November 24 / 24 Outline Introduction Markov Decision Process Reinforcement Learning State Abstraction 2 Abstraction
More informationMDP Preliminaries. Nan Jiang. February 10, 2019
MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process
More informationReinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina
Reinforcement Learning Introduction Introduction Unsupervised learning has no outcome (no feedback). Supervised learning has outcome so we know what to predict. Reinforcement learning is in between it
More informationPreference-Based Rank Elicitation using Statistical Models: The Case of Mallows
: The Case of Mallows Róbert Busa-Fekete BUSAROBI@INF.U-SZEGED.HU Eyke Hüllermeier EYKE@UPB.DE Balázs Szörényi,3 SZORENYI@INF.U-SZEGED.HU MTA-SZTE Research Group on Artificial Intelligence, Tisza Lajos
More informationMachine Learning I Continuous Reinforcement Learning
Machine Learning I Continuous Reinforcement Learning Thomas Rückstieß Technische Universität München January 7/8, 2010 RL Problem Statement (reminder) state s t+1 ENVIRONMENT reward r t+1 new step r t
More informationReinforcement Learning
Reinforcement Learning Model-Based Reinforcement Learning Model-based, PAC-MDP, sample complexity, exploration/exploitation, RMAX, E3, Bayes-optimal, Bayesian RL, model learning Vien Ngo MLR, University
More informationArtificial Intelligence
Artificial Intelligence Dynamic Programming Marc Toussaint University of Stuttgart Winter 2018/19 Motivation: So far we focussed on tree search-like solvers for decision problems. There is a second important
More informationOnline Learning and Sequential Decision Making
Online Learning and Sequential Decision Making Emilie Kaufmann CNRS & CRIStAL, Inria SequeL, emilie.kaufmann@univ-lille.fr Research School, ENS Lyon, Novembre 12-13th 2018 Emilie Kaufmann Sequential Decision
More informationProf. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be
REINFORCEMENT LEARNING AN INTRODUCTION Prof. Dr. Ann Nowé Artificial Intelligence Lab ai.vub.ac.be REINFORCEMENT LEARNING WHAT IS IT? What is it? Learning from interaction Learning about, from, and while
More informationThe information complexity of best-arm identification
The information complexity of best-arm identification Emilie Kaufmann, joint work with Olivier Cappé and Aurélien Garivier MAB workshop, Lancaster, January th, 206 Context: the multi-armed bandit model
More informationCOMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning. Hanna Kurniawati
COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning Hanna Kurniawati Today } What is machine learning? } Where is it used? } Types of machine learning
More informationBasics of reinforcement learning
Basics of reinforcement learning Lucian Buşoniu TMLSS, 20 July 2018 Main idea of reinforcement learning (RL) Learn a sequential decision policy to optimize the cumulative performance of an unknown system
More informationMarks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:
Marks } Assignment 1: Should be out this weekend } All are marked, I m trying to tally them and perhaps add bonus points } Mid-term: Before the last lecture } Mid-term deferred exam: } This Saturday, 9am-10.30am,
More informationThe Multi-Arm Bandit Framework
The Multi-Arm Bandit Framework A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course In This Lecture A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-2/94
More information1 MDP Value Iteration Algorithm
CS 0. - Active Learning Problem Set Handed out: 4 Jan 009 Due: 9 Jan 009 MDP Value Iteration Algorithm. Implement the value iteration algorithm given in the lecture. That is, solve Bellman s equation using
More informationReinforcement Learning: An Introduction
Introduction Betreuer: Freek Stulp Hauptseminar Intelligente Autonome Systeme (WiSe 04/05) Forschungs- und Lehreinheit Informatik IX Technische Universität München November 24, 2004 Introduction What is
More informationReinforcement Learning and NLP
1 Reinforcement Learning and NLP Kapil Thadani kapil@cs.columbia.edu RESEARCH Outline 2 Model-free RL Markov decision processes (MDPs) Derivative-free optimization Policy gradients Variance reduction Value
More informationSelecting Near-Optimal Approximate State Representations in Reinforcement Learning
Selecting Near-Optimal Approximate State Representations in Reinforcement Learning Ronald Ortner 1, Odalric-Ambrym Maillard 2, and Daniil Ryabko 3 1 Montanuniversitaet Leoben, Austria 2 The Technion, Israel
More informationThe information complexity of sequential resource allocation
The information complexity of sequential resource allocation Emilie Kaufmann, joint work with Olivier Cappé, Aurélien Garivier and Shivaram Kalyanakrishan SMILE Seminar, ENS, June 8th, 205 Sequential allocation
More informationMarkov Decision Processes
Markov Decision Processes Noel Welsh 11 November 2010 Noel Welsh () Markov Decision Processes 11 November 2010 1 / 30 Annoucements Applicant visitor day seeks robot demonstrators for exciting half hour
More informationOn the Complexity of Best Arm Identification with Fixed Confidence
On the Complexity of Best Arm Identification with Fixed Confidence Discrete Optimization with Noise Aurélien Garivier, Emilie Kaufmann COLT, June 23 th 2016, New York Institut de Mathématiques de Toulouse
More informationLecture 1: March 7, 2018
Reinforcement Learning Spring Semester, 2017/8 Lecture 1: March 7, 2018 Lecturer: Yishay Mansour Scribe: ym DISCLAIMER: Based on Learning and Planning in Dynamical Systems by Shie Mannor c, all rights
More informationRelative Entropy Inverse Reinforcement Learning
Relative Entropy Inverse Reinforcement Learning Abdeslam Boularias Jens Kober Jan Peters Max-Planck Institute for Intelligent Systems 72076 Tübingen, Germany {abdeslam.boularias,jens.kober,jan.peters}@tuebingen.mpg.de
More informationLecture 23: Reinforcement Learning
Lecture 23: Reinforcement Learning MDPs revisited Model-based learning Monte Carlo value function estimation Temporal-difference (TD) learning Exploration November 23, 2006 1 COMP-424 Lecture 23 Recall:
More informationAdvanced Machine Learning
Advanced Machine Learning Bandit Problems MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Multi-Armed Bandit Problem Problem: which arm of a K-slot machine should a gambler pull to maximize his
More informationReinforcement Learning II
Reinforcement Learning II Andrea Bonarini Artificial Intelligence and Robotics Lab Department of Electronics and Information Politecnico di Milano E-mail: bonarini@elet.polimi.it URL:http://www.dei.polimi.it/people/bonarini
More informationMethods for Interpretable Treatment Regimes
Methods for Interpretable Treatment Regimes John Sperger University of North Carolina at Chapel Hill May 4 th 2018 Sperger (UNC) Interpretable Treatment Regimes May 4 th 2018 1 / 16 Why Interpretable Methods?
More informationComplexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning
Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning Christos Dimitrakakis Informatics Institute, University of Amsterdam, Amsterdam, The Netherlands
More informationGrundlagen der Künstlichen Intelligenz
Grundlagen der Künstlichen Intelligenz Reinforcement learning Daniel Hennes 4.12.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Reinforcement learning Model based and
More informationThis question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.
This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer. 1. Suppose you have a policy and its action-value function, q, then you
More informationReinforcement learning an introduction
Reinforcement learning an introduction Prof. Dr. Ann Nowé Computational Modeling Group AIlab ai.vub.ac.be November 2013 Reinforcement Learning What is it? Learning from interaction Learning about, from,
More informationREINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning
REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning Ronen Tamari The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (#67679) February 28, 2016 Ronen Tamari
More informationReal Time Value Iteration and the State-Action Value Function
MS&E338 Reinforcement Learning Lecture 3-4/9/18 Real Time Value Iteration and the State-Action Value Function Lecturer: Ben Van Roy Scribe: Apoorva Sharma and Tong Mu 1 Review Last time we left off discussing
More informationVariable Metric Reinforcement Learning Methods Applied to the Noisy Mountain Car Problem
Variable Metric Reinforcement Learning Methods Applied to the Noisy Mountain Car Problem Verena Heidrich-Meisner and Christian Igel Institut für Neuroinformatik, Ruhr-Universität Bochum, Germany {Verena.Heidrich-Meisner,Christian.Igel}@neuroinformatik.rub.de
More informationMachine Learning I Reinforcement Learning
Machine Learning I Reinforcement Learning Thomas Rückstieß Technische Universität München December 17/18, 2009 Literature Book: Reinforcement Learning: An Introduction Sutton & Barto (free online version:
More informationTemporal Difference Learning & Policy Iteration
Temporal Difference Learning & Policy Iteration Advanced Topics in Reinforcement Learning Seminar WS 15/16 ±0 ±0 +1 by Tobias Joppen 03.11.2015 Fachbereich Informatik Knowledge Engineering Group Prof.
More informationIntroduction to Optimization
Introduction to Optimization Blackbox Optimization Marc Toussaint U Stuttgart Blackbox Optimization The term is not really well defined I use it to express that only f(x) can be evaluated f(x) or 2 f(x)
More informationPath Integral Policy Improvement with Covariance Matrix Adaptation
JMLR: Workshop and Conference Proceedings 1 14, 2012 European Workshop on Reinforcement Learning (EWRL) Path Integral Policy Improvement with Covariance Matrix Adaptation Freek Stulp freek.stulp@ensta-paristech.fr
More informationBandit models: a tutorial
Gdt COS, December 3rd, 2015 Multi-Armed Bandit model: general setting K arms: for a {1,..., K}, (X a,t ) t N is a stochastic process. (unknown distributions) Bandit game: a each round t, an agent chooses
More informationMulti-armed bandit models: a tutorial
Multi-armed bandit models: a tutorial CERMICS seminar, March 30th, 2016 Multi-Armed Bandit model: general setting K arms: for a {1,..., K}, (X a,t ) t N is a stochastic process. (unknown distributions)
More informationLearning Tetris. 1 Tetris. February 3, 2009
Learning Tetris Matt Zucker Andrew Maas February 3, 2009 1 Tetris The Tetris game has been used as a benchmark for Machine Learning tasks because its large state space (over 2 200 cell configurations are
More informationMultiple Identifications in Multi-Armed Bandits
Multiple Identifications in Multi-Armed Bandits arxiv:05.38v [cs.lg] 4 May 0 Sébastien Bubeck Department of Operations Research and Financial Engineering, Princeton University sbubeck@princeton.edu Tengyao
More informationReinforcement Learning as Classification Leveraging Modern Classifiers
Reinforcement Learning as Classification Leveraging Modern Classifiers Michail G. Lagoudakis and Ronald Parr Department of Computer Science Duke University Durham, NC 27708 Machine Learning Reductions
More informationCovariance Matrix Adaptation in Multiobjective Optimization
Covariance Matrix Adaptation in Multiobjective Optimization Dimo Brockhoff INRIA Lille Nord Europe October 30, 2014 PGMO-COPI 2014, Ecole Polytechnique, France Mastertitelformat Scenario: Multiobjective
More informationActive Policy Iteration: Efficient Exploration through Active Learning for Value Function Approximation in Reinforcement Learning
Active Policy Iteration: fficient xploration through Active Learning for Value Function Approximation in Reinforcement Learning Takayuki Akiyama, Hirotaka Hachiya, and Masashi Sugiyama Department of Computer
More informationPath Integral Policy Improvement with Covariance Matrix Adaptation
Path Integral Policy Improvement with Covariance Matrix Adaptation Freek Stulp freek.stulp@ensta-paristech.fr Cognitive Robotics, École Nationale Supérieure de Techniques Avancées (ENSTA-ParisTech), Paris
More informationLearning Exploration/Exploitation Strategies for Single Trajectory Reinforcement Learning
JMLR: Workshop and Conference Proceedings vol:1 8, 2012 10th European Workshop on Reinforcement Learning Learning Exploration/Exploitation Strategies for Single Trajectory Reinforcement Learning Michael
More informationSelecting the State-Representation in Reinforcement Learning
Selecting the State-Representation in Reinforcement Learning Odalric-Ambrym Maillard INRIA Lille - Nord Europe odalricambrym.maillard@gmail.com Rémi Munos INRIA Lille - Nord Europe remi.munos@inria.fr
More informationINF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018
Machine learning for image classification Lecture 14: Reinforcement learning May 9, 2018 Page 3 Outline Motivation Introduction to reinforcement learning (RL) Value function based methods (Q-learning)
More informationAnnealing-Pareto Multi-Objective Multi-Armed Bandit Algorithm
Annealing-Pareto Multi-Objective Multi-Armed Bandit Algorithm Saba Q. Yahyaa, Madalina M. Drugan and Bernard Manderick Vrije Universiteit Brussel, Department of Computer Science, Pleinlaan 2, 1050 Brussels,
More informationCsaba Szepesvári 1. University of Alberta. Machine Learning Summer School, Ile de Re, France, 2008
LEARNING THEORY OF OPTIMAL DECISION MAKING PART I: ON-LINE LEARNING IN STOCHASTIC ENVIRONMENTS Csaba Szepesvári 1 1 Department of Computing Science University of Alberta Machine Learning Summer School,
More informationQ-Learning in Continuous State Action Spaces
Q-Learning in Continuous State Action Spaces Alex Irpan alexirpan@berkeley.edu December 5, 2015 Contents 1 Introduction 1 2 Background 1 3 Q-Learning 2 4 Q-Learning In Continuous Spaces 4 5 Experimental
More informationGrundlagen der Künstlichen Intelligenz
Grundlagen der Künstlichen Intelligenz Uncertainty & Probabilities & Bandits Daniel Hennes 16.11.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Uncertainty Probability
More information(Deep) Reinforcement Learning
Martin Matyášek Artificial Intelligence Center Czech Technical University in Prague October 27, 2016 Martin Matyášek VPD, 2016 1 / 17 Reinforcement Learning in a picture R. S. Sutton and A. G. Barto 2015
More informationI D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69
R E S E A R C H R E P O R T Online Policy Adaptation for Ensemble Classifiers Christos Dimitrakakis a IDIAP RR 03-69 Samy Bengio b I D I A P December 2003 D a l l e M o l l e I n s t i t u t e for Perceptual
More informationGrundlagen der Künstlichen Intelligenz
Grundlagen der Künstlichen Intelligenz Formal models of interaction Daniel Hennes 27.11.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Taxonomy of domains Models of
More informationArtificial Intelligence & Sequential Decision Problems
Artificial Intelligence & Sequential Decision Problems (CIV6540 - Machine Learning for Civil Engineers) Professor: James-A. Goulet Département des génies civil, géologique et des mines Chapter 15 Goulet
More informationCombining PPO and Evolutionary Strategies for Better Policy Search
Combining PPO and Evolutionary Strategies for Better Policy Search Jennifer She 1 Abstract A good policy search algorithm needs to strike a balance between being able to explore candidate policies and
More informationCS599 Lecture 1 Introduction To RL
CS599 Lecture 1 Introduction To RL Reinforcement Learning Introduction Learning from rewards Policies Value Functions Rewards Models of the Environment Exploitation vs. Exploration Dynamic Programming
More informationAdministration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.
Administration CSCI567 Machine Learning Fall 2018 Prof. Haipeng Luo U of Southern California Nov 7, 2018 HW5 is available, due on 11/18. Practice final will also be available soon. Remaining weeks: 11/14,
More informationExploration. 2015/10/12 John Schulman
Exploration 2015/10/12 John Schulman What is the exploration problem? Given a long-lived agent (or long-running learning algorithm), how to balance exploration and exploitation to maximize long-term rewards
More informationFinite-Sample Analysis in Reinforcement Learning
Finite-Sample Analysis in Reinforcement Learning Mohammad Ghavamzadeh INRIA Lille Nord Europe, Team SequeL Outline 1 Introduction to RL and DP 2 Approximate Dynamic Programming (AVI & API) 3 How does Statistical
More informationAction Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems
Journal of Machine Learning Research (2006 Submitted 2/05; Published 1 Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems Eyal Even-Dar evendar@seas.upenn.edu
More informationOn the Complexity of Best Arm Identification with Fixed Confidence
On the Complexity of Best Arm Identification with Fixed Confidence Discrete Optimization with Noise Aurélien Garivier, joint work with Emilie Kaufmann CNRS, CRIStAL) to be presented at COLT 16, New York
More informationCS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning
CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning Lei Lei Ruoxuan Xiong December 16, 2017 1 Introduction Deep Neural Network
More informationHuman-level control through deep reinforcement. Liia Butler
Humanlevel control through deep reinforcement Liia Butler But first... A quote "The question of whether machines can think... is about as relevant as the question of whether submarines can swim" Edsger
More informationOpen Theoretical Questions in Reinforcement Learning
Open Theoretical Questions in Reinforcement Learning Richard S. Sutton AT&T Labs, Florham Park, NJ 07932, USA, sutton@research.att.com, www.cs.umass.edu/~rich Reinforcement learning (RL) concerns the problem
More informationTwo optimization problems in a stochastic bandit model
Two optimization problems in a stochastic bandit model Emilie Kaufmann joint work with Olivier Cappé, Aurélien Garivier and Shivaram Kalyanakrishnan Journées MAS 204, Toulouse Outline From stochastic optimization
More informationMaximum Margin Planning
Maximum Margin Planning Nathan Ratliff, Drew Bagnell and Martin Zinkevich Presenters: Ashesh Jain, Michael Hu CS6784 Class Presentation Theme 1. Supervised learning 2. Unsupervised learning 3. Reinforcement
More informationRL 14: POMDPs continued
RL 14: POMDPs continued Michael Herrmann University of Edinburgh, School of Informatics 06/03/2015 POMDPs: Points to remember Belief states are probability distributions over states Even if computationally
More informationarxiv: v1 [cs.lg] 13 Dec 2013
Efficient Baseline-free Sampling in Parameter Exploring Policy Gradients: Super Symmetric PGPE Frank Sehnke Zentrum für Sonnenenergie- und Wasserstoff-Forschung, Industriestr. 6, Stuttgart, BW 70565 Germany
More informationReplacing eligibility trace for action-value learning with function approximation
Replacing eligibility trace for action-value learning with function approximation Kary FRÄMLING Helsinki University of Technology PL 5500, FI-02015 TKK - Finland Abstract. The eligibility trace is one
More information1 Introduction 2. 4 Q-Learning The Q-value The Temporal Difference The whole Q-Learning process... 5
Table of contents 1 Introduction 2 2 Markov Decision Processes 2 3 Future Cumulative Reward 3 4 Q-Learning 4 4.1 The Q-value.............................................. 4 4.2 The Temporal Difference.......................................
More informationELEC-E8119 Robotics: Manipulation, Decision Making and Learning Policy gradient approaches. Ville Kyrki
ELEC-E8119 Robotics: Manipulation, Decision Making and Learning Policy gradient approaches Ville Kyrki 9.10.2017 Today Direct policy learning via policy gradient. Learning goals Understand basis and limitations
More informationCMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro
CMU 15-781 Lecture 12: Reinforcement Learning Teacher: Gianni A. Di Caro REINFORCEMENT LEARNING Transition Model? State Action Reward model? Agent Goal: Maximize expected sum of future rewards 2 MDP PLANNING
More informationReinforcement Learning. George Konidaris
Reinforcement Learning George Konidaris gdk@cs.brown.edu Fall 2017 Machine Learning Subfield of AI concerned with learning from data. Broadly, using: Experience To Improve Performance On Some Task (Tom
More informationLecture 9: Learning Optimal Dynamic Treatment Regimes. Donglin Zeng, Department of Biostatistics, University of North Carolina
Lecture 9: Learning Optimal Dynamic Treatment Regimes Introduction Refresh: Dynamic Treatment Regimes (DTRs) DTRs: sequential decision rules, tailored at each stage by patients time-varying features and
More informationLecture 3: Markov Decision Processes
Lecture 3: Markov Decision Processes Joseph Modayil 1 Markov Processes 2 Markov Reward Processes 3 Markov Decision Processes 4 Extensions to MDPs Markov Processes Introduction Introduction to MDPs Markov
More informationLinear Scalarized Knowledge Gradient in the Multi-Objective Multi-Armed Bandits Problem
ESANN 04 proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning. Bruges (Belgium), 3-5 April 04, i6doc.com publ., ISBN 978-8749095-7. Available from
More informationMARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti
1 MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti Historical background 2 Original motivation: animal learning Early
More informationMachine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396
Machine Learning Reinforcement learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1396 1 / 32 Table of contents 1 Introduction
More informationEvolutionary Ensemble Strategies for Heuristic Scheduling
0 International Conference on Computational Science and Computational Intelligence Evolutionary Ensemble Strategies for Heuristic Scheduling Thomas Philip Runarsson School of Engineering and Natural Science
More informationOn and Off-Policy Relational Reinforcement Learning
On and Off-Policy Relational Reinforcement Learning Christophe Rodrigues, Pierre Gérard, and Céline Rouveirol LIPN, UMR CNRS 73, Institut Galilée - Université Paris-Nord first.last@lipn.univ-paris13.fr
More informationEvolutionary Programming Using a Mixed Strategy Adapting to Local Fitness Landscape
Evolutionary Programming Using a Mixed Strategy Adapting to Local Fitness Landscape Liang Shen Department of Computer Science Aberystwyth University Ceredigion, SY23 3DB UK lls08@aber.ac.uk Jun He Department
More informationNonparametric Statistics Notes
Nonparametric Statistics Notes Chapter 5: Some Methods Based on Ranks Jesse Crawford Department of Mathematics Tarleton State University (Tarleton State University) Ch 5: Some Methods Based on Ranks 1
More informationRobustifying Trial-Derived Treatment Rules to a Target Population
1/ 39 Robustifying Trial-Derived Treatment Rules to a Target Population Yingqi Zhao Public Health Sciences Division Fred Hutchinson Cancer Research Center Workshop on Perspectives and Analysis for Personalized
More informationAn Introduction to Reinforcement Learning
An Introduction to Reinforcement Learning Shivaram Kalyanakrishnan shivaram@cse.iitb.ac.in Department of Computer Science and Engineering Indian Institute of Technology Bombay April 2018 What is Reinforcement
More informationUsing Gaussian Processes for Variance Reduction in Policy Gradient Algorithms *
Proceedings of the 8 th International Conference on Applied Informatics Eger, Hungary, January 27 30, 2010. Vol. 1. pp. 87 94. Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms
More informationUniversität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Linear Classifiers. Blaine Nelson, Tobias Scheffer
Universität Potsdam Institut für Informatik Lehrstuhl Linear Classifiers Blaine Nelson, Tobias Scheffer Contents Classification Problem Bayesian Classifier Decision Linear Classifiers, MAP Models Logistic
More informationCMU Lecture 11: Markov Decision Processes II. Teacher: Gianni A. Di Caro
CMU 15-781 Lecture 11: Markov Decision Processes II Teacher: Gianni A. Di Caro RECAP: DEFINING MDPS Markov decision processes: o Set of states S o Start state s 0 o Set of actions A o Transitions P(s s,a)
More informationReinforcement Learning
Reinforcement Learning Lecture 6: RL algorithms 2.0 Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute of Technology Objectives of this lecture Present and analyse two online algorithms
More informationEfficient Selection of Multiple Bandit Arms: Theory and Practice
To appear in Proceedings of the Twenty-seventh International Conference on Machine Learning, Haifa, Israel, June 00. Efficient Selection of Multiple Bandit Arms: Theory and Practice Shivaram Kalyanakrishnan
More informationMachine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?
Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone extension 6372 Email: sbh11@cl.cam.ac.uk www.cl.cam.ac.uk/ sbh11/ Unsupervised learning Can we find regularity
More informationReinforcement Learning with Function Approximation. Joseph Christian G. Noel
Reinforcement Learning with Function Approximation Joseph Christian G. Noel November 2011 Abstract Reinforcement learning (RL) is a key problem in the field of Artificial Intelligence. The main goal is
More information