Preference-based Reinforcement Learning

Size: px
Start display at page:

Download "Preference-based Reinforcement Learning"

Transcription

1 Preference-based Reinforcement Learning Róbert Busa-Fekete 1,2, Weiwei Cheng 1, Eyke Hüllermeier 1, Balázs Szörényi 2,3, Paul Weng 3 1 Computational Intelligence Group, Philipps-Universität Marburg 2 Research Group on Artificial Intelligence of the Hungarian Academy of Sciences and University of Szeged (RGAI) 3 INRIA Lille - Nord Europe, SequeL project, 40 avenue Halley, Villeneuve d Ascq, France 4 Laboratory of Computer Science, Université Pierre et Marie Curie EWRL 2013/ Dagstuhl August 6, 2013

2 Preference-based Reinforcement Learning Motivating example: Medical treatment design Direct policy search (DPS) Evolutionary direct policy search approach Preference-based Evolutionary Direct policy search (PB-EDPS) Preference-based Racing (PBR)

3 Many problems where it is hard to define a reasonable reward function task of driving [Abbeel and Ng, 2004] medical treatment design [Zhao et al., 2009] Aggregation of rewards: one may not always be willing/able to combine rewards Multi-objective reinforcement learning Episodic setup: h following policy π, h following policy π Given h and h, it might be easier to decide which one is preferred (at least in some problems) The piece of information we want to learn from is preferences over simulations!

4 Motivating example: medical treatment design [Zhao et al., 2009] Virtual patient with cancer State captures some essential factors in cancer treatment Tumor size Toxicity Episodic setup: an episode corresponds to a treatment of a patient over six months The action is the dosage level itself dosage Transitions: The tumor is constantly growing (without treatment or if the dosage is too low) The higher dose selected, the higher toxicity evolves, and the more tumor growth is inhibited. The higher the toxicity and the tumor size, the higher the probability of the patient s death.

5 Motivating example [Zhao et al., 2009] Terminal state: end of sixth month or patient dies The reward is defined based on the wellness of patient tumor size: -5, 5, 15 toxicity level: -5, 0, 5 The reward assigned to death is -60 Based on the wellness of patients, it is straightforward to define a preference relation over treatments Given two trajectories h1 and h 2 generated by following two different treatments Otherwise Pareto dominance h 1 h 2 if the tumor size AND the toxicity level both are smaller under h 1

6 Point of departure: preferences There is no reward function (hard to define a reasonable one) and the goal is not to find a reward function!!! The piece of information we want to learn from is preferences over trajectories! Partial order over trajectories h H (T ) From a tutor or an expert Extracted from trajectories

7 Decision model Decision model: lifting the preference relation on H (T ) to a preference relation on the space of policies Intermezzo: each policy π generate a probability distribution over the set of trajectories (for a fixed MDP) which is denoted by P π policy random variable whose realizations are trajectories s(π, π ) = Eh P π,h P π [I{h h }] Probability of that π beats π Ordinal decision model π π if and only if s(π, π) < s(π, π ) Alternative decision model?

8 Preference-based Evolutionary direct policy search

9 Direct policy search (DPS) 1. Parametric policy space: Π = {π Θ Θ R d }, for example the space of linear policies: π w (s) = w T s if S R d 2. The policy search can be viewed as an optimization task: Π is the search space, some policy evaluation is the target function Gradientbased Direct policy search Gradientfree REINFORCE [Williams, 1992] Preferencebased evolutionary DPS Evolutionary DPS [Heidrich- Meisner and Igel, 2009] Evolutionary strategy based optimizer Crossentropy method

10 Evolutionary direct policy search approach [Heidrich-Meisner and Igel, 2009] Covariance Matrix Adaptation Evolution Strategy (CMA-ES)[Hansen and Kern, 2004] It maintains a distribution over the solution space ( in this case over the space of policies) Expected total reward is optimized that can be estimated based on finite set of trajectories {h 1,..., h n } P π as ρ (n) π = 1 n n V (h i ) i=1 where V (.) is the cummulative reward

11 Evolutionary direct policy search [Heidrich-Meisner and Igel, 2009] Repeat these three steps until convergence 1. Generate a population of candidate solutions (in this case, a set of policies with different parameters). π Θ1,..., π Θλ where Θ 1,..., Θ λ N (m, Σ) 2. Evaluate the candidate solutions (estimate the performance of the policies based on simulations {h 1,..., h n } P πθi ). ρ (n) π Θi = 1 n n V (h i ) i=1 and select the best µ individuals 3. Update m and Σ by using the parameters of best µ individuals/policies

12 Racing algorithm (a) [Heidrich-Meisner and Igel, 2009] In the bandit literature, these algorithms are called PAC bandits

13 Basic idea 1. Direct motivation: the Evolution strategy optimizers need only ranking, but they do not need the function values themselves 2. GOAL: devise a racing algorithm that utilizes only pairwise comparison of random samples (in this case trajectories) and is able to select the best policies with respect to the decision model ( ) 3. This naturally gives rise to a preference-based policy search method

14 Recall the decision model s(π, π ) = Eh P π,h P π [I{h h }] Ordinal decision model π π if and only if s(π, π) < s(π, π ) There can be preferential cycles π π AND π π AND π π select the best options is not a well-defined task Practical solution: surrogate ranking model Given π 1,..., π K π i C π j d i < d j where d i = {k : π k π i, k i} It is a complete preorder since it has a numeric representation (d i ) Unfortunately, the preference relation C depends on the set of policies considered

15 An example for the surrogate ranking model π 2 π 3 edge π i π j π 4 π 1 C d 2 = 4 d 1 = 3 d 3 = d 4 = d 5 = 1 π 5

16 Concentration property of s(.,.) s(π, π ) = Eh P π,h P π [I{h h }] π π if and only if s(π, π) < s(π, π ) π i C π j d i < d j where d i = {k : π k π i, k i} s(π, π ) can be estimated based on finite sets of trajectories {h 1,..., h n } P π and {h 1,..., h n} P π as s(π, π ) = 1 nn n n I{h i h j} i=1 j=1 Hoeffding-bound for U-statistics, two-sample case Hoeffding, 1963, 5b: For any ɛ > 0 P ( s(π, π ) s(π, π ) ɛ ) 2 exp ( 2 min(n, n )ɛ 2) Empirical Bernstein-bound?

17 Preference-based racing We have an efficient estimator for s(π i, π j ) We can calculate confidence interval for s(π i, π j ) K = 7, K = 3 edge s(π i, π j ) is significantly bigger than s(π j, π i ) π 1 π 2 π 3 π 4 π 5 π 6 π 7 Expected sample complexity: Even-Dar et al. [2002] ( i,j = 1/2 s(π i, π j ) )

18 Preference-based evolutionary direct policy search Repeat these three steps until convergence 1. Generate a population of candidate solutions (in this case, a set of policies with different parameters). π Θ1,..., π Θλ where Θ 1,..., Θ λ N (m, Σ) 2. Select the best µ individuals by using Preference-based Racing algorithm 3. Update m and Σ by using the parameters of best µ individuals/policies

19 The relation of and C (only locally valid) π2 π i is a Condorcet winner among a set of policies π 1,..., π K if π l π i for all l i π3 π4 π5 π1 If the Condorcet winner exists, it is the largest element of C π2 Smith set is the smallest non-empty set D {π 1,..., π K } satisfying π k π i for all π i D and π j {π 1,..., π K } \ D Proposition Let Π = {π 1,..., π K } be a set of random variables for which there exists a Smith set D of size K D. Then for any π i D and π j Π \ D, π j C π i. π3 π4 π5 π1 d 5 = 1 d 4 = 1 d 3 = 1 d 1 = 3 d 2 = 4

20 Issues to be discussed The existence of global optima If there exists a global Condorcet winner, under what assumptions we can find it (w.h.p) by using Evolution strategy along with Preference-based racing algorithm? Hoeffding-bound is loose: the use of Clopper-Pearson-type confidence bound for trinomial random variables

21 Bibliography I P. Abbeel and A.Y. Ng. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the 21th International conference on Machine Learning, pages????, E. Even-Dar, S. Mannor, and Y. Mansour. PAC bounds for multi-armed bandit and markov decision processes. In Proceedings of the 15th Annual Conference on Computational Learning Theory, pages , N. Hansen and S. Kern. Evaluating the CMA evolution strategy on multimodal test functions. In Parallel Problem Solving from Nature-PPSN VIII, pages , V. Heidrich-Meisner and C. Igel. Hoeffding and Bernstein races for selecting policies in evolutionary direct policy search. In Proceedings of the 26th International Conference on Machine Learning, pages , J. Hemelrijk. Note on Wilcoxon s two-sample test when ties are present. The Annals of Mathematical Statistics, 23(1): , 1952.

22 Bibliography II R.J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3): , Y. Zhao, M.R. Kosorok, and D. Zeng. Reinforcement learning design for cancer clinical trials. Statistics in Medicine, 28(26): , 2009.

23 Preference-based racing for C One can estimate s(π, π ) based on finite set of trajectories {h 1,... h n } P π and { h 1,... h n } Pπ s(π, π ) = 1 nn n n I{h i h j} i=1 j=1 Incomparable trajectories: solution by [Hemelrijk, 1952] 1 if x x I {x, x } = 0 if x x 1/2 otherwise Probabilistic interpretation: if two samples are incomparable, then we select one of them being preferred with probability 1/2 s(π, π ) = 1 s(π, π)

24 Preference-based racing: optimization view Preference-based case: PBR(π 1,..., π K, K, n max, δ) argmax I{π j C π i } I {1,...,K}: I =K i I j i with probability at least 1 δ Since s(π i, π j ) = 1 s(π j, π i ) argmax I{s(π j, π i ) > 1/2} (1) I {1,...,K}: I =K i I j i We have an efficient estimator of s(π i, π j )

25 Algorithm 1 PBR(π 1,..., π K, K, n max, δ) 1: A = {(i, j) 1 i, j K}, n = 0 2: while (n n max) ( A > 0) do 3: for all i appearing in A do 4: h (n) i M and π i Generate trajectories 5: end for 6: for all (i, j) A do 7: Update s i,j = 1 n n n 2 l=1 l =1 I{h(l) i h (l ) j } 1 8: c i,j = 2n log 2K 2 n max, u δ i,j = ŝ i,j + c i,j, l i,j = ŝ i,j c i,j 9: end for 10: for i = 1 K do 11: z i = {j : u i,j < 1/2, j i}, o i = {j : l i,j > 1/2, j i} 12: end for 13: C = { i : K K < {j : K zj < o i } } select 14: D = { i : K < {j : K o j < z i } } discard 15: for (i, j) A do 16: if (i, j C D) (1/2 [l i,j, u i,j ]) then 17: A = A \ (i, j) Do not update ŝ i,j any more 18: end if 19: end for 20: n = n : end while 22: return the top-k options for which the most s i,j above 1/2

26 Cancer treatment 1. State space consists of toxicity level and tumor size (X, Y ) 2. Linear policy space 3. Each policy search method were trained 100 times and each policy were evaluated on 300 virtual patients 4. 6-months treatment 5. Transitions: X t+1 = X t + X t and Y t+1 = Y t + Y t Y t = [a 1 max(x t, X 0 ) b 1 (D t d 1 )] I{Y t > 0} X t = a 2 max(y t, Y 0 ) b 2 (D t d 2 ) 6. Probability of death: 1 exp( exp(c 0 + c 1 X t + c 2 Y t ))

27 Cancer treatment Toxicity Random (0.355) Const. π(s) = 0.1 (0.405) Const. π(s) = 0.4 (0.318) Const. π(s) = 0.7 (0.383) Const. π(s) = 1.0 (0.448) SARSA(λ) (0.337) PBPI (0.259) EDPS (0.283) PB-EDPS (0.209) Pareto front Tumor size (b) 100 repetitions of training process

Preference-based Reinforcement Learning: Evolutionary Direct Policy Search using a Preference-based Racing Algorithm

Preference-based Reinforcement Learning: Evolutionary Direct Policy Search using a Preference-based Racing Algorithm Preference-based Reinforcement Learning: Evolutionary Direct Policy Search using a Preference-based Racing Algorithm Róbert Busa-Fekete Balázs Szörényi Paul Weng Weiwei Cheng Eyke Hüllermeier Abstract

More information

Preference-based reinforcement learning: evolutionary direct policy search using a preference-based racing algorithm

Preference-based reinforcement learning: evolutionary direct policy search using a preference-based racing algorithm Preference-based reinforcement learning: evolutionary direct policy search using a preference-based racing algorithm Róbert Busa-Fekete, Balázs Szörényi, Paul Weng, Weiwei Cheng, Eyke Hüllermeier To cite

More information

Preference-based reinforcement learning: evolutionary direct policy search using a preference-based racing algorithm

Preference-based reinforcement learning: evolutionary direct policy search using a preference-based racing algorithm Preference-based reinforcement learning: evolutionary direct policy search using a preference-based racing algorithm Róbert Busa-Fekete, Balázs Szörényi, Paul Weng, Weiwei Cheng & Eyke Hüllermeier Machine

More information

Preference-Based Rank Elicitation using Statistical Models: The Case of Mallows

Preference-Based Rank Elicitation using Statistical Models: The Case of Mallows Preference-Based Rank Elicitation using Statistical Models: The Case of Mallows Robert Busa-Fekete 1 Eyke Huellermeier 2 Balzs Szrnyi 3 1 MTA-SZTE Research Group on Artificial Intelligence, Tisza Lajos

More information

Top-k Selection based on Adaptive Sampling of Noisy Preferences

Top-k Selection based on Adaptive Sampling of Noisy Preferences Top-k Selection based on Adaptive Sampling of Noisy Preferences Róbert Busa-Fekete, busarobi@inf.u-szeged.hu Balázs Szörényi, szorenyi@inf.u-szeged.hu Paul Weng paul.weng@lip.fr Weiwei Cheng cheng@mathematik.uni-marburg.de

More information

Online Rank Elicitation for Plackett-Luce: A Dueling Bandits Approach

Online Rank Elicitation for Plackett-Luce: A Dueling Bandits Approach Online Rank Elicitation for Plackett-Luce: A Dueling Bandits Approach Balázs Szörényi Technion, Haifa, Israel / MTA-SZTE Research Group on Artificial Intelligence, Hungary szorenyibalazs@gmail.com Róbert

More information

State Space Abstractions for Reinforcement Learning

State Space Abstractions for Reinforcement Learning State Space Abstractions for Reinforcement Learning Rowan McAllister and Thang Bui MLG RCC 6 November 24 / 24 Outline Introduction Markov Decision Process Reinforcement Learning State Abstraction 2 Abstraction

More information

MDP Preliminaries. Nan Jiang. February 10, 2019

MDP Preliminaries. Nan Jiang. February 10, 2019 MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process

More information

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina Reinforcement Learning Introduction Introduction Unsupervised learning has no outcome (no feedback). Supervised learning has outcome so we know what to predict. Reinforcement learning is in between it

More information

Preference-Based Rank Elicitation using Statistical Models: The Case of Mallows

Preference-Based Rank Elicitation using Statistical Models: The Case of Mallows : The Case of Mallows Róbert Busa-Fekete BUSAROBI@INF.U-SZEGED.HU Eyke Hüllermeier EYKE@UPB.DE Balázs Szörényi,3 SZORENYI@INF.U-SZEGED.HU MTA-SZTE Research Group on Artificial Intelligence, Tisza Lajos

More information

Machine Learning I Continuous Reinforcement Learning

Machine Learning I Continuous Reinforcement Learning Machine Learning I Continuous Reinforcement Learning Thomas Rückstieß Technische Universität München January 7/8, 2010 RL Problem Statement (reminder) state s t+1 ENVIRONMENT reward r t+1 new step r t

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Model-Based Reinforcement Learning Model-based, PAC-MDP, sample complexity, exploration/exploitation, RMAX, E3, Bayes-optimal, Bayesian RL, model learning Vien Ngo MLR, University

More information

Artificial Intelligence

Artificial Intelligence Artificial Intelligence Dynamic Programming Marc Toussaint University of Stuttgart Winter 2018/19 Motivation: So far we focussed on tree search-like solvers for decision problems. There is a second important

More information

Online Learning and Sequential Decision Making

Online Learning and Sequential Decision Making Online Learning and Sequential Decision Making Emilie Kaufmann CNRS & CRIStAL, Inria SequeL, emilie.kaufmann@univ-lille.fr Research School, ENS Lyon, Novembre 12-13th 2018 Emilie Kaufmann Sequential Decision

More information

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be REINFORCEMENT LEARNING AN INTRODUCTION Prof. Dr. Ann Nowé Artificial Intelligence Lab ai.vub.ac.be REINFORCEMENT LEARNING WHAT IS IT? What is it? Learning from interaction Learning about, from, and while

More information

The information complexity of best-arm identification

The information complexity of best-arm identification The information complexity of best-arm identification Emilie Kaufmann, joint work with Olivier Cappé and Aurélien Garivier MAB workshop, Lancaster, January th, 206 Context: the multi-armed bandit model

More information

COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning. Hanna Kurniawati

COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning. Hanna Kurniawati COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning Hanna Kurniawati Today } What is machine learning? } Where is it used? } Types of machine learning

More information

Basics of reinforcement learning

Basics of reinforcement learning Basics of reinforcement learning Lucian Buşoniu TMLSS, 20 July 2018 Main idea of reinforcement learning (RL) Learn a sequential decision policy to optimize the cumulative performance of an unknown system

More information

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam: Marks } Assignment 1: Should be out this weekend } All are marked, I m trying to tally them and perhaps add bonus points } Mid-term: Before the last lecture } Mid-term deferred exam: } This Saturday, 9am-10.30am,

More information

The Multi-Arm Bandit Framework

The Multi-Arm Bandit Framework The Multi-Arm Bandit Framework A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course In This Lecture A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-2/94

More information

1 MDP Value Iteration Algorithm

1 MDP Value Iteration Algorithm CS 0. - Active Learning Problem Set Handed out: 4 Jan 009 Due: 9 Jan 009 MDP Value Iteration Algorithm. Implement the value iteration algorithm given in the lecture. That is, solve Bellman s equation using

More information

Reinforcement Learning: An Introduction

Reinforcement Learning: An Introduction Introduction Betreuer: Freek Stulp Hauptseminar Intelligente Autonome Systeme (WiSe 04/05) Forschungs- und Lehreinheit Informatik IX Technische Universität München November 24, 2004 Introduction What is

More information

Reinforcement Learning and NLP

Reinforcement Learning and NLP 1 Reinforcement Learning and NLP Kapil Thadani kapil@cs.columbia.edu RESEARCH Outline 2 Model-free RL Markov decision processes (MDPs) Derivative-free optimization Policy gradients Variance reduction Value

More information

Selecting Near-Optimal Approximate State Representations in Reinforcement Learning

Selecting Near-Optimal Approximate State Representations in Reinforcement Learning Selecting Near-Optimal Approximate State Representations in Reinforcement Learning Ronald Ortner 1, Odalric-Ambrym Maillard 2, and Daniil Ryabko 3 1 Montanuniversitaet Leoben, Austria 2 The Technion, Israel

More information

The information complexity of sequential resource allocation

The information complexity of sequential resource allocation The information complexity of sequential resource allocation Emilie Kaufmann, joint work with Olivier Cappé, Aurélien Garivier and Shivaram Kalyanakrishan SMILE Seminar, ENS, June 8th, 205 Sequential allocation

More information

Markov Decision Processes

Markov Decision Processes Markov Decision Processes Noel Welsh 11 November 2010 Noel Welsh () Markov Decision Processes 11 November 2010 1 / 30 Annoucements Applicant visitor day seeks robot demonstrators for exciting half hour

More information

On the Complexity of Best Arm Identification with Fixed Confidence

On the Complexity of Best Arm Identification with Fixed Confidence On the Complexity of Best Arm Identification with Fixed Confidence Discrete Optimization with Noise Aurélien Garivier, Emilie Kaufmann COLT, June 23 th 2016, New York Institut de Mathématiques de Toulouse

More information

Lecture 1: March 7, 2018

Lecture 1: March 7, 2018 Reinforcement Learning Spring Semester, 2017/8 Lecture 1: March 7, 2018 Lecturer: Yishay Mansour Scribe: ym DISCLAIMER: Based on Learning and Planning in Dynamical Systems by Shie Mannor c, all rights

More information

Relative Entropy Inverse Reinforcement Learning

Relative Entropy Inverse Reinforcement Learning Relative Entropy Inverse Reinforcement Learning Abdeslam Boularias Jens Kober Jan Peters Max-Planck Institute for Intelligent Systems 72076 Tübingen, Germany {abdeslam.boularias,jens.kober,jan.peters}@tuebingen.mpg.de

More information

Lecture 23: Reinforcement Learning

Lecture 23: Reinforcement Learning Lecture 23: Reinforcement Learning MDPs revisited Model-based learning Monte Carlo value function estimation Temporal-difference (TD) learning Exploration November 23, 2006 1 COMP-424 Lecture 23 Recall:

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Bandit Problems MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Multi-Armed Bandit Problem Problem: which arm of a K-slot machine should a gambler pull to maximize his

More information

Reinforcement Learning II

Reinforcement Learning II Reinforcement Learning II Andrea Bonarini Artificial Intelligence and Robotics Lab Department of Electronics and Information Politecnico di Milano E-mail: bonarini@elet.polimi.it URL:http://www.dei.polimi.it/people/bonarini

More information

Methods for Interpretable Treatment Regimes

Methods for Interpretable Treatment Regimes Methods for Interpretable Treatment Regimes John Sperger University of North Carolina at Chapel Hill May 4 th 2018 Sperger (UNC) Interpretable Treatment Regimes May 4 th 2018 1 / 16 Why Interpretable Methods?

More information

Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning

Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning Christos Dimitrakakis Informatics Institute, University of Amsterdam, Amsterdam, The Netherlands

More information

Grundlagen der Künstlichen Intelligenz

Grundlagen der Künstlichen Intelligenz Grundlagen der Künstlichen Intelligenz Reinforcement learning Daniel Hennes 4.12.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Reinforcement learning Model based and

More information

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer. This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer. 1. Suppose you have a policy and its action-value function, q, then you

More information

Reinforcement learning an introduction

Reinforcement learning an introduction Reinforcement learning an introduction Prof. Dr. Ann Nowé Computational Modeling Group AIlab ai.vub.ac.be November 2013 Reinforcement Learning What is it? Learning from interaction Learning about, from,

More information

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning Ronen Tamari The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (#67679) February 28, 2016 Ronen Tamari

More information

Real Time Value Iteration and the State-Action Value Function

Real Time Value Iteration and the State-Action Value Function MS&E338 Reinforcement Learning Lecture 3-4/9/18 Real Time Value Iteration and the State-Action Value Function Lecturer: Ben Van Roy Scribe: Apoorva Sharma and Tong Mu 1 Review Last time we left off discussing

More information

Variable Metric Reinforcement Learning Methods Applied to the Noisy Mountain Car Problem

Variable Metric Reinforcement Learning Methods Applied to the Noisy Mountain Car Problem Variable Metric Reinforcement Learning Methods Applied to the Noisy Mountain Car Problem Verena Heidrich-Meisner and Christian Igel Institut für Neuroinformatik, Ruhr-Universität Bochum, Germany {Verena.Heidrich-Meisner,Christian.Igel}@neuroinformatik.rub.de

More information

Machine Learning I Reinforcement Learning

Machine Learning I Reinforcement Learning Machine Learning I Reinforcement Learning Thomas Rückstieß Technische Universität München December 17/18, 2009 Literature Book: Reinforcement Learning: An Introduction Sutton & Barto (free online version:

More information

Temporal Difference Learning & Policy Iteration

Temporal Difference Learning & Policy Iteration Temporal Difference Learning & Policy Iteration Advanced Topics in Reinforcement Learning Seminar WS 15/16 ±0 ±0 +1 by Tobias Joppen 03.11.2015 Fachbereich Informatik Knowledge Engineering Group Prof.

More information

Introduction to Optimization

Introduction to Optimization Introduction to Optimization Blackbox Optimization Marc Toussaint U Stuttgart Blackbox Optimization The term is not really well defined I use it to express that only f(x) can be evaluated f(x) or 2 f(x)

More information

Path Integral Policy Improvement with Covariance Matrix Adaptation

Path Integral Policy Improvement with Covariance Matrix Adaptation JMLR: Workshop and Conference Proceedings 1 14, 2012 European Workshop on Reinforcement Learning (EWRL) Path Integral Policy Improvement with Covariance Matrix Adaptation Freek Stulp freek.stulp@ensta-paristech.fr

More information

Bandit models: a tutorial

Bandit models: a tutorial Gdt COS, December 3rd, 2015 Multi-Armed Bandit model: general setting K arms: for a {1,..., K}, (X a,t ) t N is a stochastic process. (unknown distributions) Bandit game: a each round t, an agent chooses

More information

Multi-armed bandit models: a tutorial

Multi-armed bandit models: a tutorial Multi-armed bandit models: a tutorial CERMICS seminar, March 30th, 2016 Multi-Armed Bandit model: general setting K arms: for a {1,..., K}, (X a,t ) t N is a stochastic process. (unknown distributions)

More information

Learning Tetris. 1 Tetris. February 3, 2009

Learning Tetris. 1 Tetris. February 3, 2009 Learning Tetris Matt Zucker Andrew Maas February 3, 2009 1 Tetris The Tetris game has been used as a benchmark for Machine Learning tasks because its large state space (over 2 200 cell configurations are

More information

Multiple Identifications in Multi-Armed Bandits

Multiple Identifications in Multi-Armed Bandits Multiple Identifications in Multi-Armed Bandits arxiv:05.38v [cs.lg] 4 May 0 Sébastien Bubeck Department of Operations Research and Financial Engineering, Princeton University sbubeck@princeton.edu Tengyao

More information

Reinforcement Learning as Classification Leveraging Modern Classifiers

Reinforcement Learning as Classification Leveraging Modern Classifiers Reinforcement Learning as Classification Leveraging Modern Classifiers Michail G. Lagoudakis and Ronald Parr Department of Computer Science Duke University Durham, NC 27708 Machine Learning Reductions

More information

Covariance Matrix Adaptation in Multiobjective Optimization

Covariance Matrix Adaptation in Multiobjective Optimization Covariance Matrix Adaptation in Multiobjective Optimization Dimo Brockhoff INRIA Lille Nord Europe October 30, 2014 PGMO-COPI 2014, Ecole Polytechnique, France Mastertitelformat Scenario: Multiobjective

More information

Active Policy Iteration: Efficient Exploration through Active Learning for Value Function Approximation in Reinforcement Learning

Active Policy Iteration: Efficient Exploration through Active Learning for Value Function Approximation in Reinforcement Learning Active Policy Iteration: fficient xploration through Active Learning for Value Function Approximation in Reinforcement Learning Takayuki Akiyama, Hirotaka Hachiya, and Masashi Sugiyama Department of Computer

More information

Path Integral Policy Improvement with Covariance Matrix Adaptation

Path Integral Policy Improvement with Covariance Matrix Adaptation Path Integral Policy Improvement with Covariance Matrix Adaptation Freek Stulp freek.stulp@ensta-paristech.fr Cognitive Robotics, École Nationale Supérieure de Techniques Avancées (ENSTA-ParisTech), Paris

More information

Learning Exploration/Exploitation Strategies for Single Trajectory Reinforcement Learning

Learning Exploration/Exploitation Strategies for Single Trajectory Reinforcement Learning JMLR: Workshop and Conference Proceedings vol:1 8, 2012 10th European Workshop on Reinforcement Learning Learning Exploration/Exploitation Strategies for Single Trajectory Reinforcement Learning Michael

More information

Selecting the State-Representation in Reinforcement Learning

Selecting the State-Representation in Reinforcement Learning Selecting the State-Representation in Reinforcement Learning Odalric-Ambrym Maillard INRIA Lille - Nord Europe odalricambrym.maillard@gmail.com Rémi Munos INRIA Lille - Nord Europe remi.munos@inria.fr

More information

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018 Machine learning for image classification Lecture 14: Reinforcement learning May 9, 2018 Page 3 Outline Motivation Introduction to reinforcement learning (RL) Value function based methods (Q-learning)

More information

Annealing-Pareto Multi-Objective Multi-Armed Bandit Algorithm

Annealing-Pareto Multi-Objective Multi-Armed Bandit Algorithm Annealing-Pareto Multi-Objective Multi-Armed Bandit Algorithm Saba Q. Yahyaa, Madalina M. Drugan and Bernard Manderick Vrije Universiteit Brussel, Department of Computer Science, Pleinlaan 2, 1050 Brussels,

More information

Csaba Szepesvári 1. University of Alberta. Machine Learning Summer School, Ile de Re, France, 2008

Csaba Szepesvári 1. University of Alberta. Machine Learning Summer School, Ile de Re, France, 2008 LEARNING THEORY OF OPTIMAL DECISION MAKING PART I: ON-LINE LEARNING IN STOCHASTIC ENVIRONMENTS Csaba Szepesvári 1 1 Department of Computing Science University of Alberta Machine Learning Summer School,

More information

Q-Learning in Continuous State Action Spaces

Q-Learning in Continuous State Action Spaces Q-Learning in Continuous State Action Spaces Alex Irpan alexirpan@berkeley.edu December 5, 2015 Contents 1 Introduction 1 2 Background 1 3 Q-Learning 2 4 Q-Learning In Continuous Spaces 4 5 Experimental

More information

Grundlagen der Künstlichen Intelligenz

Grundlagen der Künstlichen Intelligenz Grundlagen der Künstlichen Intelligenz Uncertainty & Probabilities & Bandits Daniel Hennes 16.11.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Uncertainty Probability

More information

(Deep) Reinforcement Learning

(Deep) Reinforcement Learning Martin Matyášek Artificial Intelligence Center Czech Technical University in Prague October 27, 2016 Martin Matyášek VPD, 2016 1 / 17 Reinforcement Learning in a picture R. S. Sutton and A. G. Barto 2015

More information

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69 R E S E A R C H R E P O R T Online Policy Adaptation for Ensemble Classifiers Christos Dimitrakakis a IDIAP RR 03-69 Samy Bengio b I D I A P December 2003 D a l l e M o l l e I n s t i t u t e for Perceptual

More information

Grundlagen der Künstlichen Intelligenz

Grundlagen der Künstlichen Intelligenz Grundlagen der Künstlichen Intelligenz Formal models of interaction Daniel Hennes 27.11.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Taxonomy of domains Models of

More information

Artificial Intelligence & Sequential Decision Problems

Artificial Intelligence & Sequential Decision Problems Artificial Intelligence & Sequential Decision Problems (CIV6540 - Machine Learning for Civil Engineers) Professor: James-A. Goulet Département des génies civil, géologique et des mines Chapter 15 Goulet

More information

Combining PPO and Evolutionary Strategies for Better Policy Search

Combining PPO and Evolutionary Strategies for Better Policy Search Combining PPO and Evolutionary Strategies for Better Policy Search Jennifer She 1 Abstract A good policy search algorithm needs to strike a balance between being able to explore candidate policies and

More information

CS599 Lecture 1 Introduction To RL

CS599 Lecture 1 Introduction To RL CS599 Lecture 1 Introduction To RL Reinforcement Learning Introduction Learning from rewards Policies Value Functions Rewards Models of the Environment Exploitation vs. Exploration Dynamic Programming

More information

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon. Administration CSCI567 Machine Learning Fall 2018 Prof. Haipeng Luo U of Southern California Nov 7, 2018 HW5 is available, due on 11/18. Practice final will also be available soon. Remaining weeks: 11/14,

More information

Exploration. 2015/10/12 John Schulman

Exploration. 2015/10/12 John Schulman Exploration 2015/10/12 John Schulman What is the exploration problem? Given a long-lived agent (or long-running learning algorithm), how to balance exploration and exploitation to maximize long-term rewards

More information

Finite-Sample Analysis in Reinforcement Learning

Finite-Sample Analysis in Reinforcement Learning Finite-Sample Analysis in Reinforcement Learning Mohammad Ghavamzadeh INRIA Lille Nord Europe, Team SequeL Outline 1 Introduction to RL and DP 2 Approximate Dynamic Programming (AVI & API) 3 How does Statistical

More information

Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems

Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems Journal of Machine Learning Research (2006 Submitted 2/05; Published 1 Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems Eyal Even-Dar evendar@seas.upenn.edu

More information

On the Complexity of Best Arm Identification with Fixed Confidence

On the Complexity of Best Arm Identification with Fixed Confidence On the Complexity of Best Arm Identification with Fixed Confidence Discrete Optimization with Noise Aurélien Garivier, joint work with Emilie Kaufmann CNRS, CRIStAL) to be presented at COLT 16, New York

More information

CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning

CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning Lei Lei Ruoxuan Xiong December 16, 2017 1 Introduction Deep Neural Network

More information

Human-level control through deep reinforcement. Liia Butler

Human-level control through deep reinforcement. Liia Butler Humanlevel control through deep reinforcement Liia Butler But first... A quote "The question of whether machines can think... is about as relevant as the question of whether submarines can swim" Edsger

More information

Open Theoretical Questions in Reinforcement Learning

Open Theoretical Questions in Reinforcement Learning Open Theoretical Questions in Reinforcement Learning Richard S. Sutton AT&T Labs, Florham Park, NJ 07932, USA, sutton@research.att.com, www.cs.umass.edu/~rich Reinforcement learning (RL) concerns the problem

More information

Two optimization problems in a stochastic bandit model

Two optimization problems in a stochastic bandit model Two optimization problems in a stochastic bandit model Emilie Kaufmann joint work with Olivier Cappé, Aurélien Garivier and Shivaram Kalyanakrishnan Journées MAS 204, Toulouse Outline From stochastic optimization

More information

Maximum Margin Planning

Maximum Margin Planning Maximum Margin Planning Nathan Ratliff, Drew Bagnell and Martin Zinkevich Presenters: Ashesh Jain, Michael Hu CS6784 Class Presentation Theme 1. Supervised learning 2. Unsupervised learning 3. Reinforcement

More information

RL 14: POMDPs continued

RL 14: POMDPs continued RL 14: POMDPs continued Michael Herrmann University of Edinburgh, School of Informatics 06/03/2015 POMDPs: Points to remember Belief states are probability distributions over states Even if computationally

More information

arxiv: v1 [cs.lg] 13 Dec 2013

arxiv: v1 [cs.lg] 13 Dec 2013 Efficient Baseline-free Sampling in Parameter Exploring Policy Gradients: Super Symmetric PGPE Frank Sehnke Zentrum für Sonnenenergie- und Wasserstoff-Forschung, Industriestr. 6, Stuttgart, BW 70565 Germany

More information

Replacing eligibility trace for action-value learning with function approximation

Replacing eligibility trace for action-value learning with function approximation Replacing eligibility trace for action-value learning with function approximation Kary FRÄMLING Helsinki University of Technology PL 5500, FI-02015 TKK - Finland Abstract. The eligibility trace is one

More information

1 Introduction 2. 4 Q-Learning The Q-value The Temporal Difference The whole Q-Learning process... 5

1 Introduction 2. 4 Q-Learning The Q-value The Temporal Difference The whole Q-Learning process... 5 Table of contents 1 Introduction 2 2 Markov Decision Processes 2 3 Future Cumulative Reward 3 4 Q-Learning 4 4.1 The Q-value.............................................. 4 4.2 The Temporal Difference.......................................

More information

ELEC-E8119 Robotics: Manipulation, Decision Making and Learning Policy gradient approaches. Ville Kyrki

ELEC-E8119 Robotics: Manipulation, Decision Making and Learning Policy gradient approaches. Ville Kyrki ELEC-E8119 Robotics: Manipulation, Decision Making and Learning Policy gradient approaches Ville Kyrki 9.10.2017 Today Direct policy learning via policy gradient. Learning goals Understand basis and limitations

More information

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro CMU 15-781 Lecture 12: Reinforcement Learning Teacher: Gianni A. Di Caro REINFORCEMENT LEARNING Transition Model? State Action Reward model? Agent Goal: Maximize expected sum of future rewards 2 MDP PLANNING

More information

Reinforcement Learning. George Konidaris

Reinforcement Learning. George Konidaris Reinforcement Learning George Konidaris gdk@cs.brown.edu Fall 2017 Machine Learning Subfield of AI concerned with learning from data. Broadly, using: Experience To Improve Performance On Some Task (Tom

More information

Lecture 9: Learning Optimal Dynamic Treatment Regimes. Donglin Zeng, Department of Biostatistics, University of North Carolina

Lecture 9: Learning Optimal Dynamic Treatment Regimes. Donglin Zeng, Department of Biostatistics, University of North Carolina Lecture 9: Learning Optimal Dynamic Treatment Regimes Introduction Refresh: Dynamic Treatment Regimes (DTRs) DTRs: sequential decision rules, tailored at each stage by patients time-varying features and

More information

Lecture 3: Markov Decision Processes

Lecture 3: Markov Decision Processes Lecture 3: Markov Decision Processes Joseph Modayil 1 Markov Processes 2 Markov Reward Processes 3 Markov Decision Processes 4 Extensions to MDPs Markov Processes Introduction Introduction to MDPs Markov

More information

Linear Scalarized Knowledge Gradient in the Multi-Objective Multi-Armed Bandits Problem

Linear Scalarized Knowledge Gradient in the Multi-Objective Multi-Armed Bandits Problem ESANN 04 proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning. Bruges (Belgium), 3-5 April 04, i6doc.com publ., ISBN 978-8749095-7. Available from

More information

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti 1 MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti Historical background 2 Original motivation: animal learning Early

More information

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396 Machine Learning Reinforcement learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1396 1 / 32 Table of contents 1 Introduction

More information

Evolutionary Ensemble Strategies for Heuristic Scheduling

Evolutionary Ensemble Strategies for Heuristic Scheduling 0 International Conference on Computational Science and Computational Intelligence Evolutionary Ensemble Strategies for Heuristic Scheduling Thomas Philip Runarsson School of Engineering and Natural Science

More information

On and Off-Policy Relational Reinforcement Learning

On and Off-Policy Relational Reinforcement Learning On and Off-Policy Relational Reinforcement Learning Christophe Rodrigues, Pierre Gérard, and Céline Rouveirol LIPN, UMR CNRS 73, Institut Galilée - Université Paris-Nord first.last@lipn.univ-paris13.fr

More information

Evolutionary Programming Using a Mixed Strategy Adapting to Local Fitness Landscape

Evolutionary Programming Using a Mixed Strategy Adapting to Local Fitness Landscape Evolutionary Programming Using a Mixed Strategy Adapting to Local Fitness Landscape Liang Shen Department of Computer Science Aberystwyth University Ceredigion, SY23 3DB UK lls08@aber.ac.uk Jun He Department

More information

Nonparametric Statistics Notes

Nonparametric Statistics Notes Nonparametric Statistics Notes Chapter 5: Some Methods Based on Ranks Jesse Crawford Department of Mathematics Tarleton State University (Tarleton State University) Ch 5: Some Methods Based on Ranks 1

More information

Robustifying Trial-Derived Treatment Rules to a Target Population

Robustifying Trial-Derived Treatment Rules to a Target Population 1/ 39 Robustifying Trial-Derived Treatment Rules to a Target Population Yingqi Zhao Public Health Sciences Division Fred Hutchinson Cancer Research Center Workshop on Perspectives and Analysis for Personalized

More information

An Introduction to Reinforcement Learning

An Introduction to Reinforcement Learning An Introduction to Reinforcement Learning Shivaram Kalyanakrishnan shivaram@cse.iitb.ac.in Department of Computer Science and Engineering Indian Institute of Technology Bombay April 2018 What is Reinforcement

More information

Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms *

Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms * Proceedings of the 8 th International Conference on Applied Informatics Eger, Hungary, January 27 30, 2010. Vol. 1. pp. 87 94. Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms

More information

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Linear Classifiers. Blaine Nelson, Tobias Scheffer

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Linear Classifiers. Blaine Nelson, Tobias Scheffer Universität Potsdam Institut für Informatik Lehrstuhl Linear Classifiers Blaine Nelson, Tobias Scheffer Contents Classification Problem Bayesian Classifier Decision Linear Classifiers, MAP Models Logistic

More information

CMU Lecture 11: Markov Decision Processes II. Teacher: Gianni A. Di Caro

CMU Lecture 11: Markov Decision Processes II. Teacher: Gianni A. Di Caro CMU 15-781 Lecture 11: Markov Decision Processes II Teacher: Gianni A. Di Caro RECAP: DEFINING MDPS Markov decision processes: o Set of states S o Start state s 0 o Set of actions A o Transitions P(s s,a)

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Lecture 6: RL algorithms 2.0 Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute of Technology Objectives of this lecture Present and analyse two online algorithms

More information

Efficient Selection of Multiple Bandit Arms: Theory and Practice

Efficient Selection of Multiple Bandit Arms: Theory and Practice To appear in Proceedings of the Twenty-seventh International Conference on Machine Learning, Haifa, Israel, June 00. Efficient Selection of Multiple Bandit Arms: Theory and Practice Shivaram Kalyanakrishnan

More information

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels? Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone extension 6372 Email: sbh11@cl.cam.ac.uk www.cl.cam.ac.uk/ sbh11/ Unsupervised learning Can we find regularity

More information

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel Reinforcement Learning with Function Approximation Joseph Christian G. Noel November 2011 Abstract Reinforcement learning (RL) is a key problem in the field of Artificial Intelligence. The main goal is

More information