Reinforcement Learning

Size: px
Start display at page:

Download "Reinforcement Learning"

Transcription

1 Reinforcement Learning Inverse Reinforcement Learning Inverse RL, behaviour cloning, apprenticeship learning, imitation learning. Vien Ngo Marc Toussaint University of Stuttgart

2 Outline Introduction to Inverse RL Inverse RL vs. behavioral cloning IRL algorithms (Inspired from a lecture from Pieter Abbeel.) 2/??

3 Inverse RL: Informal Definition Given Measurements of an agent s behaviour π over time (s t, a t, s t), in different circumstances. If possible, given transition model (not given reward function). Goal: Find the reward function R π (s, a, s ). 3/??

4 Inverse Reinforcement Learning RL Agent Reward Dynamics Model Policy Imitation/Apprenticeship Learning IRL Expert's Demonstration inspired from a poster of Boularias, Kober, Peters. 4/??

5 Motivation: Two Sources The potential use of RL/related methods as computational model for animal and human learning: bee foraging (Montague et al 1995), song-bird vocalization (Doya & Sejnowski 1995),... 5/??

6 Motivation: Two Sources The potential use of RL/related methods as computational model for animal and human learning: bee foraging (Montague et al 1995), song-bird vocalization (Doya & Sejnowski 1995),... Construction of an intelligent agent in a particular domain: Car driver, helicopter (Ng et al),... (imitation learning, apprenticeship learning) 5/??

7 Examples Car driving simulation Abbeel et al 2004, etc. Autonomous Helicopter Flight Andrew Ng et. al. Urban navigation Ziebart, Maas, Bagnell and Dey, AAAI 2008 (route recommendation, and destination prediction) etc. 6/??

8 Problem Formulation Given State space S, action space ca. Transition model T (s, a, s ) = P (s s, a) not given reward function R(s, a, s ). Teacher s demonstration (from teacher s policy π ): s 0, a 0, s 1, a 1,..., IRL: Recover R. Apprenticeship learning via IRL Use R to compute a good policy. Behaviour cloning: Using supersived-learning to learn the teacher s policy. 7/??

9 IRL vs. behavioral cloning 8/??

10 IRL vs. Behavioral cloning Behavioral cloning: Formulated as a supervised-learning problem. (Using SVM, Neural networks, deep learning,...) Given (s 0, a 0 ), (s 1, a 1 ),..., generated from a policy π. Estimate a policy mapping s to a. 9/??

11 IRL vs. Behavioral cloning Behavioral cloning: Formulated as a supervised-learning problem. (Using SVM, Neural networks, deep learning,...) Given (s 0, a 0 ), (s 1, a 1 ),..., generated from a policy π. Estimate a policy mapping s to a. Behavioral cloning: can only mimic the trajectory of the teacher, then can not: with change of goal/destination, and non-markovian environment (e.g. car driving). 9/??

12 IRL vs. Behavioral cloning Behavioral cloning: Formulated as a supervised-learning problem. (Using SVM, Neural networks, deep learning,...) Given (s 0, a 0 ), (s 1, a 1 ),..., generated from a policy π. Estimate a policy mapping s to a. Behavioral cloning: can only mimic the trajectory of the teacher, then can not: with change of goal/destination, and non-markovian environment (e.g. car driving). IRL vs. Behavioral cloning is ˆR vs. ˆπ. 9/??

13 Inverse Reinforcement Learning 10/??

14 IRL: Mathematical Formulation Given State space S, action space ca. Transition model T (s, a, s ) = P (s s, a) not given reward function R(s, a, s ). Teacher s demonstration (from teacher s policy π ): s 0, a 0, s 1, a 1,..., Find R, such that [ E γ t R (s t ) π ] [ ] E γ t R (s t ) π, π t=0 t=0 11/??

15 IRL: Mathematical Formulation Given State space S, action space ca. Transition model T (s, a, s ) = P (s s, a) not given reward function R(s, a, s ). Teacher s demonstration (from teacher s policy π ): s 0, a 0, s 1, a 1,..., Find R, such that [ E γ t R (s t ) π ] [ ] E γ t R (s t ) π, π t=0 Challenges? R = 0 is a solution (rewrad function ambiguity), and multiple R satisfy the above condition. π is only given partially through trajectories, then how to evaluate the expectation terms. t=0 11/??

16 IRL: Finite state spaces Bellman equations V π = (I γp π ) 1 R Then IRL finds R such that (P a P a )(I γp a ) 1 R 0, a (if consider only deterministic policies) 12/??

17 IRL: Finite state spaces Bellman equations V π = (I γp π ) 1 R Then IRL finds R such that (P a P a )(I γp a ) 1 R 0, a (if consider only deterministic policies) IRL as linear programming with l 1 s.t. max S { min b A/a i=1 } (P a (i) P b (i))(i γp a ) 1 R λ R 1 (P a P b )(I γp a )R 0 R(i) R max Maximize the sum of differences between the values of the optimal action and the next-best. With l 1 penalty. 12/??

18 IRL: With FA in large state spaces Using FA: R(s) = w.φ(s), where w R n, and φ : S R. Thus, [ ] [ ] E γ t R(s t ) π = E γ t w φ(s t ) π [ ] = w E γ t φ(s t ) π = w.η(π) 13/??

19 IRL: With FA in large state spaces Using FA: R(s) = w.φ(s), where w R n, and φ : S R. Thus, [ ] [ ] E γ t R(s t ) π = E γ t w φ(s t ) π [ ] = w E γ t φ(s t ) π = w.η(π) The optimization problem: finding w such that w.η(π ) w.η(π) η(π) can be evaluated with sampled trajectories from π. η(π) = 1 N N T i γ t φ(s t ) i=1 t=0 13/??

20 Apprenticeship learning Abbeel & Ng, /??

21 Apprenticeship learning Finding a policy π whose performance is as close to the expert policy s performance as possible w.η(π ) w.η(π) ɛ 15/??

22 Apprenticeship learning Finding a policy π whose performance is as close to the expert policy s performance as possible w.η(π ) w.η(π) ɛ 1: Assume R(s) = w.φ(s), where w R n, and φ : S R. 2: Initialize π 0 3: for i = 1, 2,... do 4: Find a reward function such that the teacher maximally outperforms all previously found controllers. max γ γ, w 1 s.t. w.η(π) w.η(π) + γ, π {π 0, π 1,..., π i 1 } 5: Find optimal policy π i for the reward function R w w.r.t current w. 6: end for 15/??

23 Examples 16/??

24 Simulated Highway Driving Given dynamic model T (s, a, s ) Each teacher demonstrates 1 minute. Abbeel et. al /??

25 Simulated Highway Driving expert demonstration (left), learned control (right) 18/??

26 Urban Navigation picture from a tutorial of Pieter Abbeel. 19/??

27 References Andrew Y. Ng, Stuart J. Russell: Algorithms for Inverse Reinforcement Learning. ICML 2000: Pieter Abbeel, Andrew Y. Ng: Apprenticeship learning via inverse reinforcement learning. ICML 2004 Pieter Abbeel, Adam Coates, Morgan Quigley, Andrew Y. Ng: An Application of Reinforcement Learning to Aerobatic Helicopter Flight. NIPS 2006: 1-8 Adam Coates, Pieter Abbeel, Andrew Y. Ng: Apprenticeship learning for helicopter control. Commun. ACM 52(7): (2009) 20/??

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Inverse Reinforcement Learning LfD, imitation learning/behavior cloning, apprenticeship learning, IRL. Hung Ngo MLR Lab, University of Stuttgart Outline Learning from Demonstrations

More information

Maximum Margin Planning

Maximum Margin Planning Maximum Margin Planning Nathan Ratliff, Drew Bagnell and Martin Zinkevich Presenters: Ashesh Jain, Michael Hu CS6784 Class Presentation Theme 1. Supervised learning 2. Unsupervised learning 3. Reinforcement

More information

Relative Entropy Inverse Reinforcement Learning

Relative Entropy Inverse Reinforcement Learning Relative Entropy Inverse Reinforcement Learning Abdeslam Boularias Jens Kober Jan Peters Max-Planck Institute for Intelligent Systems 72076 Tübingen, Germany {abdeslam.boularias,jens.kober,jan.peters}@tuebingen.mpg.de

More information

MAP Inference for Bayesian Inverse Reinforcement Learning

MAP Inference for Bayesian Inverse Reinforcement Learning MAP Inference for Bayesian Inverse Reinforcement Learning Jaedeug Choi and Kee-Eung Kim bdepartment of Computer Science Korea Advanced Institute of Science and Technology Daejeon 305-701, Korea jdchoi@ai.kaist.ac.kr,

More information

An Introduction to Reinforcement Learning

An Introduction to Reinforcement Learning An Introduction to Reinforcement Learning Shivaram Kalyanakrishnan shivaram@cse.iitb.ac.in Department of Computer Science and Engineering Indian Institute of Technology Bombay April 2018 What is Reinforcement

More information

MDP Preliminaries. Nan Jiang. February 10, 2019

MDP Preliminaries. Nan Jiang. February 10, 2019 MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process

More information

An Introduction to Reinforcement Learning

An Introduction to Reinforcement Learning An Introduction to Reinforcement Learning Shivaram Kalyanakrishnan shivaram@csa.iisc.ernet.in Department of Computer Science and Automation Indian Institute of Science August 2014 What is Reinforcement

More information

Nonparametric Bayesian Inverse Reinforcement Learning

Nonparametric Bayesian Inverse Reinforcement Learning PRML Summer School 2013 Nonparametric Bayesian Inverse Reinforcement Learning Jaedeug Choi JDCHOI@AI.KAIST.AC.KR Sequential Decision Making (1) Multiple decisions over time are made to achieve goals Reinforcement

More information

Inverse Reinforcement Learning with Simultaneous Estimation of Rewards and Dynamics

Inverse Reinforcement Learning with Simultaneous Estimation of Rewards and Dynamics Inverse Reinforcement Learning with Simultaneous Estimation of Rewards and Dynamics Michael Herman Tobias Gindele Jörg Wagner Felix Schmitt Wolfram Burgard Robert Bosch GmbH D-70442 Stuttgart, Germany

More information

Parking lot navigation. Experimental setup. Problem setup. Nice driving style. Page 1. CS 287: Advanced Robotics Fall 2009

Parking lot navigation. Experimental setup. Problem setup. Nice driving style. Page 1. CS 287: Advanced Robotics Fall 2009 Consider the following scenario: There are two envelopes, each of which has an unknown amount of money in it. You get to choose one of the envelopes. Given this is all you get to know, how should you choose?

More information

Batch, Off-policy and Model-free Apprenticeship Learning

Batch, Off-policy and Model-free Apprenticeship Learning Batch, Off-policy and Model-free Apprenticeship Learning Edouard Klein 13, Matthieu Geist 1, and Olivier Pietquin 12 1. Supélec-Metz Campus, IMS Research group, France, prenom.nom@supelec.fr 2. UMI 2958

More information

A Game-Theoretic Approach to Apprenticeship Learning

A Game-Theoretic Approach to Apprenticeship Learning Advances in Neural Information Processing Systems 20, 2008. A Game-Theoretic Approach to Apprenticeship Learning Umar Syed Computer Science Department Princeton University 35 Olden St Princeton, NJ 08540-5233

More information

Inverse Reinforcement Learning in Partially Observable Environments

Inverse Reinforcement Learning in Partially Observable Environments Proceedings of the Twenty-First International Joint Conference on Artificial Intelligence (IJCAI-09) Inverse Reinforcement Learning in Partially Observable Environments Jaedeug Choi and Kee-Eung Kim Department

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Model-Based Reinforcement Learning Model-based, PAC-MDP, sample complexity, exploration/exploitation, RMAX, E3, Bayes-optimal, Bayesian RL, model learning Vien Ngo MLR, University

More information

A Cascaded Supervised Learning Approach to Inverse Reinforcement Learning

A Cascaded Supervised Learning Approach to Inverse Reinforcement Learning A Cascaded Supervised Learning Approach to Inverse Reinforcement Learning Edouard Klein 1,2, Bilal Piot 2,3, Matthieu Geist 2, Olivier Pietquin 2,3 1 ABC Team LORIA-CNRS, France. 2 Supélec, IMS-MaLIS Research

More information

arxiv: v1 [cs.ro] 12 Aug 2016

arxiv: v1 [cs.ro] 12 Aug 2016 Density Matching Reward Learning Sungjoon Choi 1, Kyungjae Lee 1, H. Andy Park 2, and Songhwai Oh 1 arxiv:1608.03694v1 [cs.ro] 12 Aug 2016 1 Seoul National University, Seoul, Korea {sungjoon.choi, kyungjae.lee,

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Markov decision process & Dynamic programming Evaluative feedback, value function, Bellman equation, optimality, Markov property, Markov decision process, dynamic programming, value

More information

Bayesian Nonparametric Feature Construction for Inverse Reinforcement Learning

Bayesian Nonparametric Feature Construction for Inverse Reinforcement Learning Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence Bayesian Nonparametric Feature Construction for Inverse Reinforcement Learning Jaedeug Choi and Kee-Eung Kim Department

More information

Inverse Optimal Control

Inverse Optimal Control Inverse Optimal Control Oleg Arenz Technische Universität Darmstadt o.arenz@gmx.de Abstract In Reinforcement Learning, an agent learns a policy that maximizes a given reward function. However, providing

More information

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning CSCI-699: Advanced Topics in Deep Learning 01/16/2019 Nitin Kamra Spring 2019 Introduction to Reinforcement Learning 1 What is Reinforcement Learning? So far we have seen unsupervised and supervised learning.

More information

Imitation Learning. Richard Zhu, Andrew Kang April 26, 2016

Imitation Learning. Richard Zhu, Andrew Kang April 26, 2016 Imitation Learning Richard Zhu, Andrew Kang April 26, 2016 Table of Contents 1. Introduction 2. Preliminaries 3. DAgger 4. Guarantees 5. Generalization 6. Performance 2 Introduction Where we ve been The

More information

Reinforcement learning

Reinforcement learning Reinforcement learning Stuart Russell, UC Berkeley Stuart Russell, UC Berkeley 1 Outline Sequential decision making Dynamic programming algorithms Reinforcement learning algorithms temporal difference

More information

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Introduction to Reinforcement Learning. CMPT 882 Mar. 18 Introduction to Reinforcement Learning CMPT 882 Mar. 18 Outline for the week Basic ideas in RL Value functions and value iteration Policy evaluation and policy improvement Model-free RL Monte-Carlo and

More information

Reinforcement Learning and Control

Reinforcement Learning and Control CS9 Lecture notes Andrew Ng Part XIII Reinforcement Learning and Control We now begin our study of reinforcement learning and adaptive control. In supervised learning, we saw algorithms that tried to make

More information

PolicyBoost: Functional Policy Gradient with Ranking-Based Reward Objective

PolicyBoost: Functional Policy Gradient with Ranking-Based Reward Objective AI and Robotics: Papers from the AAAI-14 Worshop PolicyBoost: Functional Policy Gradient with Raning-Based Reward Objective Yang Yu and Qing Da National Laboratory for Novel Software Technology Nanjing

More information

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992)

More information

Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan

Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan COS 402 Machine Learning and Artificial Intelligence Fall 2016 Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan Some slides borrowed from Peter Bodik and David Silver Course progress Learning

More information

Artificial Intelligence

Artificial Intelligence Artificial Intelligence Dynamic Programming Marc Toussaint University of Stuttgart Winter 2018/19 Motivation: So far we focussed on tree search-like solvers for decision problems. There is a second important

More information

Improving the Efficiency of Bayesian Inverse Reinforcement Learning

Improving the Efficiency of Bayesian Inverse Reinforcement Learning Improving the Efficiency of Bayesian Inverse Reinforcement Learning Bernard Michini* and Jonathan P. How** Aerospace Controls Laboratory Massachusetts Institute of Technology, Cambridge, MA 02139 USA Abstract

More information

arxiv: v2 [cs.lg] 13 Aug 2018 ABSTRACT

arxiv: v2 [cs.lg] 13 Aug 2018 ABSTRACT LEARNING ROBUST REWARDS WITH ADVERSARIAL INVERSE REINFORCEMENT LEARNING Justin Fu, Katie Luo, Sergey Levine Department of Electrical Engineering and Computer Science University of California, Berkeley

More information

Variance Reduction for Policy Gradient Methods. March 13, 2017

Variance Reduction for Policy Gradient Methods. March 13, 2017 Variance Reduction for Policy Gradient Methods March 13, 2017 Reward Shaping Reward Shaping Reward Shaping Reward shaping: r(s, a, s ) = r(s, a, s ) + γφ(s ) Φ(s) for arbitrary potential Φ Theorem: r admits

More information

Adversarial Inverse Optimal Control for General Imitation Learning Losses and Embodiment Transfer

Adversarial Inverse Optimal Control for General Imitation Learning Losses and Embodiment Transfer ersarial Inverse Optimal Control for General Imitation Learning Losses and Embodiment Transfer Xiangli Chen Mathew Monfort Brian D. Ziebart University of Illinois at Chicago Chicago, IL 60607 {xchen0,mmonfo,bziebart}@uic.edu

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Temporal Difference Learning Temporal difference learning, TD prediction, Q-learning, elibigility traces. (many slides from Marc Toussaint) Vien Ngo MLR, University of Stuttgart

More information

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016 Course 16:198:520: Introduction To Artificial Intelligence Lecture 13 Decision Making Abdeslam Boularias Wednesday, December 7, 2016 1 / 45 Overview We consider probabilistic temporal models where the

More information

Autonomous Helicopter Flight via Reinforcement Learning

Autonomous Helicopter Flight via Reinforcement Learning Autonomous Helicopter Flight via Reinforcement Learning Authors: Andrew Y. Ng, H. Jin Kim, Michael I. Jordan, Shankar Sastry Presenters: Shiv Ballianda, Jerrolyn Hebert, Shuiwang Ji, Kenley Malveaux, Huy

More information

Algorithms for Learning Markov Field Policies

Algorithms for Learning Markov Field Policies Algorithms for Learning Marov Field Policies Abdeslam Boularias Max Planc Institute for Intelligent Systems boularias@tuebingen.mpg.de Oliver Krömer, Jan Peters Technische Universität Darmstadt {oli,jan}@robot-learning.de

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Function Approximation Continuous state/action space, mean-square error, gradient temporal difference learning, least-square temporal difference, least squares policy iteration Vien

More information

Dueling Network Architectures for Deep Reinforcement Learning (ICML 2016)

Dueling Network Architectures for Deep Reinforcement Learning (ICML 2016) Dueling Network Architectures for Deep Reinforcement Learning (ICML 2016) Yoonho Lee Department of Computer Science and Engineering Pohang University of Science and Technology October 11, 2016 Outline

More information

Modeling Decision Making with Maximum Entropy Inverse Optimal Control Thesis Proposal

Modeling Decision Making with Maximum Entropy Inverse Optimal Control Thesis Proposal Modeling Decision Making with Maximum Entropy Inverse Optimal Control Thesis Proposal Brian D. Ziebart Machine Learning Department Carnegie Mellon University September 30, 2008 Thesis committee: J. Andrew

More information

Generative Adversarial Imitation Learning

Generative Adversarial Imitation Learning Generative Adversarial Imitation Learning Jonathan Ho OpenAI hoj@openai.com Stefano Ermon Stanford University ermon@cs.stanford.edu Abstract Consider learning a policy from example expert behavior, without

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Temporal Difference Learning Temporal difference learning, TD prediction, Q-learning, elibigility traces. (many slides from Marc Toussaint) Vien Ngo Marc Toussaint University of

More information

Model-based Imitation Learning by Probabilistic Trajectory Matching

Model-based Imitation Learning by Probabilistic Trajectory Matching Model-based Imitation Learning by Probabilistic Trajectory Matching Peter Englert1, Alexandros Paraschos1, Jan Peters1,2, Marc Peter Deisenroth1 Abstract One of the most elegant ways of teaching new skills

More information

Lecture 9: Policy Gradient II 1

Lecture 9: Policy Gradient II 1 Lecture 9: Policy Gradient II 1 Emma Brunskill CS234 Reinforcement Learning. Winter 2019 Additional reading: Sutton and Barto 2018 Chp. 13 1 With many slides from or derived from David Silver and John

More information

Deep Reinforcement Learning via Policy Optimization

Deep Reinforcement Learning via Policy Optimization Deep Reinforcement Learning via Policy Optimization John Schulman July 3, 2017 Introduction Deep Reinforcement Learning: What to Learn? Policies (select next action) Deep Reinforcement Learning: What to

More information

Preference Elicitation for Sequential Decision Problems

Preference Elicitation for Sequential Decision Problems Preference Elicitation for Sequential Decision Problems Kevin Regan University of Toronto Introduction 2 Motivation Focus: Computational approaches to sequential decision making under uncertainty These

More information

EM-based Reinforcement Learning

EM-based Reinforcement Learning EM-based Reinforcement Learning Gerhard Neumann 1 1 TU Darmstadt, Intelligent Autonomous Systems December 21, 2011 Outline Expectation Maximization (EM)-based Reinforcement Learning Recap : Modelling data

More information

Maximum Entropy Inverse Reinforcement Learning

Maximum Entropy Inverse Reinforcement Learning Maximum Entropy Inverse Reinforcement Learning Brian D. Ziebart, Andrew Maas, J.Andrew Bagnell, and Anind K. Dey School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 bziebart@cs.cmu.edu,

More information

Optimal Control with Learned Forward Models

Optimal Control with Learned Forward Models Optimal Control with Learned Forward Models Pieter Abbeel UC Berkeley Jan Peters TU Darmstadt 1 Where we are? Reinforcement Learning Data = {(x i, u i, x i+1, r i )}} x u xx r u xx V (x) π (u x) Now V

More information

Deep Reinforcement Learning: Policy Gradients and Q-Learning

Deep Reinforcement Learning: Policy Gradients and Q-Learning Deep Reinforcement Learning: Policy Gradients and Q-Learning John Schulman Bay Area Deep Learning School September 24, 2016 Introduction and Overview Aim of This Talk What is deep RL, and should I use

More information

Short Course: Multiagent Systems. Multiagent Systems. Lecture 1: Basics Agents Environments. Reinforcement Learning. This course is about:

Short Course: Multiagent Systems. Multiagent Systems. Lecture 1: Basics Agents Environments. Reinforcement Learning. This course is about: Short Course: Multiagent Systems Lecture 1: Basics Agents Environments Reinforcement Learning Multiagent Systems This course is about: Agents: Sensing, reasoning, acting Multiagent Systems: Distributed

More information

Deep Reinforcement Learning. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 19, 2017

Deep Reinforcement Learning. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 19, 2017 Deep Reinforcement Learning STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 19, 2017 Outline Introduction to Reinforcement Learning AlphaGo (Deep RL for Computer Go)

More information

Generalized Inverse Reinforcement Learning with Linearly Solvable MDP

Generalized Inverse Reinforcement Learning with Linearly Solvable MDP Generalized Inverse Reinforcement Learning with Linearly Solvable MDP Masahiro Kohjima ( ), Tatsushi Matsubayashi, and Hiroshi Sawada NTT Service Evolution Laboratories, NTT Corporation, Japan {kohjima.masahiro,matsubayashi.tatsushi,sawada.hiroshi}@lab.ntt.co.jp

More information

Efficient Probabilistic Performance Bounds for Inverse Reinforcement Learning

Efficient Probabilistic Performance Bounds for Inverse Reinforcement Learning Efficient Probabilistic Performance Bounds for Inverse Reinforcement Learning Daniel S. Brown and Scott Niekum Department of Computer Science University of Texas at Austin {dsbrown,sniekum}@cs.utexas.edu

More information

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning CSE 473: Artificial Intelligence Reinforcement Learning Dan Weld Today s Outline Reinforcement Learning Q-value iteration Q-learning Exploration / exploitation Linear function approximation Many slides

More information

Stochastic Primal-Dual Methods for Reinforcement Learning

Stochastic Primal-Dual Methods for Reinforcement Learning Stochastic Primal-Dual Methods for Reinforcement Learning Alireza Askarian 1 Amber Srivastava 1 1 Department of Mechanical Engineering University of Illinois at Urbana Champaign Big Data Optimization,

More information

The convergence limit of the temporal difference learning

The convergence limit of the temporal difference learning The convergence limit of the temporal difference learning Ryosuke Nomura the University of Tokyo September 3, 2013 1 Outline Reinforcement Learning Convergence limit Construction of the feature vector

More information

Policy Gradient Methods. February 13, 2017

Policy Gradient Methods. February 13, 2017 Policy Gradient Methods February 13, 2017 Policy Optimization Problems maximize E π [expression] π Fixed-horizon episodic: T 1 Average-cost: lim T 1 T r t T 1 r t Infinite-horizon discounted: γt r t Variable-length

More information

ADVERSARIAL IMITATION VIA VARIATIONAL INVERSE REINFORCEMENT LEARNING

ADVERSARIAL IMITATION VIA VARIATIONAL INVERSE REINFORCEMENT LEARNING ADVERSARIAL IMITATION VIA VARIATIONAL INVERSE REINFORCEMENT LEARNING Ahmed H. Qureshi Department of Electrical and Computer Engineering University of California San Diego, La Jolla, CA 92093, USA a1qureshi@ucsd.edu

More information

Approximation Methods in Reinforcement Learning

Approximation Methods in Reinforcement Learning 2018 CS420, Machine Learning, Lecture 12 Approximation Methods in Reinforcement Learning Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/cs420/index.html Reinforcement

More information

Probabilistic inverse reinforcement learning in unknown environments

Probabilistic inverse reinforcement learning in unknown environments Probabilistic inverse reinforcement learning in unknown environments Aristide C. Y. Tossou EPAC, Abomey-Calavi, Bénin yedtoss@gmail.com Christos Dimitrakakis EPFL, Lausanne, Switzerland christos.dimitrakakis@gmail.com

More information

Advanced Policy Gradient Methods: Natural Gradient, TRPO, and More. March 8, 2017

Advanced Policy Gradient Methods: Natural Gradient, TRPO, and More. March 8, 2017 Advanced Policy Gradient Methods: Natural Gradient, TRPO, and More March 8, 2017 Defining a Loss Function for RL Let η(π) denote the expected return of π [ ] η(π) = E s0 ρ 0,a t π( s t) γ t r t We collect

More information

arxiv: v3 [cs.lg] 17 Jun 2018

arxiv: v3 [cs.lg] 17 Jun 2018 Machine Learning manuscript No. (will be inserted by the editor) Inverse Reinforcement Learning from Summary Data Antti Kangasrääsiö Samuel Kaski Received: date / Accepted: date arxiv:1703.09700v3 [cs.lg]

More information

Lecture 9: Policy Gradient II (Post lecture) 2

Lecture 9: Policy Gradient II (Post lecture) 2 Lecture 9: Policy Gradient II (Post lecture) 2 Emma Brunskill CS234 Reinforcement Learning. Winter 2018 Additional reading: Sutton and Barto 2018 Chp. 13 2 With many slides from or derived from David Silver

More information

Verifying Robustness of Human-Aware Autonomous Cars

Verifying Robustness of Human-Aware Autonomous Cars Verifying Robustness of Human-Aware Autonomous Cars Dorsa Sadigh S. Shankar Sastry Sanjit A. Seshia Stanford University, (e-mail: dorsa@cs.stanford.edu). UC Berkeley, (e-mail: {sseshia, sastry}@eecs.berkeley.edu)

More information

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning Ronen Tamari The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (#67679) February 28, 2016 Ronen Tamari

More information

REINFORCEMENT LEARNING

REINFORCEMENT LEARNING REINFORCEMENT LEARNING Larry Page: Where s Google going next? DeepMind's DQN playing Breakout Contents Introduction to Reinforcement Learning Deep Q-Learning INTRODUCTION TO REINFORCEMENT LEARNING Contents

More information

Reinforcement Learning for NLP

Reinforcement Learning for NLP Reinforcement Learning for NLP Advanced Machine Learning for NLP Jordan Boyd-Graber REINFORCEMENT OVERVIEW, POLICY GRADIENT Adapted from slides by David Silver, Pieter Abbeel, and John Schulman Advanced

More information

Analysis of Inverse Reinforcement Learning with Perturbed Demonstrations

Analysis of Inverse Reinforcement Learning with Perturbed Demonstrations Analysis of Inverse Reinforcement Learning with Perturbed Demonstrations Francisco S. Melo 1 and Manuel Lopes 2 and Ricardo Ferreira 3 Abstract. Inverse reinforcement learning (IRL addresses the problem

More information

Competitive Multi-agent Inverse Reinforcement Learning with Sub-optimal Demonstrations

Competitive Multi-agent Inverse Reinforcement Learning with Sub-optimal Demonstrations Competitive Multi-agent Inverse Reinforcement Learning with Sub-optimal Demonstrations Xingyu Wang 1 Diego Klabjan 1 Abstract his paper considers the problem of inverse reinforcement learning in zero-sum

More information

Activity Forecasting. Research CMU. Carnegie Mellon University. Kris Kitani Carnegie Mellon University

Activity Forecasting. Research CMU. Carnegie Mellon University. Kris Kitani Carnegie Mellon University Carnegie Mellon University Research Showcase @ CMU Robotics Institute School of Computer Science 8-2012 Activity Forecasting Kris Kitani Carnegie Mellon University Brian D. Ziebart Carnegie Mellon University

More information

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina Reinforcement Learning Introduction Introduction Unsupervised learning has no outcome (no feedback). Supervised learning has outcome so we know what to predict. Reinforcement learning is in between it

More information

Lecture 23: Reinforcement Learning

Lecture 23: Reinforcement Learning Lecture 23: Reinforcement Learning MDPs revisited Model-based learning Monte Carlo value function estimation Temporal-difference (TD) learning Exploration November 23, 2006 1 COMP-424 Lecture 23 Recall:

More information

Efficient Probabilistic Performance Bounds for Inverse Reinforcement Learning

Efficient Probabilistic Performance Bounds for Inverse Reinforcement Learning Efficient Probabilistic Performance Bounds for Inverse Reinforcement Learning Daniel S. Brown and Scott Niekum Department of Computer Science University of Texas at Austin {dsbrown,sniekum}@cs.utexas.edu

More information

Neural Map. Structured Memory for Deep RL. Emilio Parisotto

Neural Map. Structured Memory for Deep RL. Emilio Parisotto Neural Map Structured Memory for Deep RL Emilio Parisotto eparisot@andrew.cmu.edu PhD Student Machine Learning Department Carnegie Mellon University Supervised Learning Most deep learning problems are

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Dipendra Misra Cornell University dkm@cs.cornell.edu https://dipendramisra.wordpress.com/ Task Grasp the green cup. Output: Sequence of controller actions Setup from Lenz et. al.

More information

CS788 Dialogue Management Systems Lecture #2: Markov Decision Processes

CS788 Dialogue Management Systems Lecture #2: Markov Decision Processes CS788 Dialogue Management Systems Lecture #2: Markov Decision Processes Kee-Eung Kim KAIST EECS Department Computer Science Division Markov Decision Processes (MDPs) A popular model for sequential decision

More information

A Bayesian Approach to Generative Adversarial Imitation Learning

A Bayesian Approach to Generative Adversarial Imitation Learning A Bayesian Approach to Generative Adversarial Imitation Learning Wonseo Jeon 1, Seoin Seo 1, Kee-Eung Kim 1,2 1 School of Computing, KAIST, Republic of Korea 2 PROWLER.io {wsjeon, siseo}@ai.aist.ac.r,

More information

Bits of Machine Learning Part 2: Unsupervised Learning

Bits of Machine Learning Part 2: Unsupervised Learning Bits of Machine Learning Part 2: Unsupervised Learning Alexandre Proutiere and Vahan Petrosyan KTH (The Royal Institute of Technology) Outline of the Course 1. Supervised Learning Regression and Classification

More information

Bellmanian Bandit Network

Bellmanian Bandit Network Bellmanian Bandit Network Antoine Bureau TAO, LRI - INRIA Univ. Paris-Sud bldg 50, Rue Noetzlin, 91190 Gif-sur-Yvette, France antoine.bureau@lri.fr Michèle Sebag TAO, LRI - CNRS Univ. Paris-Sud bldg 50,

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Ron Parr CompSci 7 Department of Computer Science Duke University With thanks to Kris Hauser for some content RL Highlights Everybody likes to learn from experience Use ML techniques

More information

Imitation Learning by Coaching. Abstract

Imitation Learning by Coaching. Abstract Imitation Learning by Coaching He He Hal Daumé III Department of Computer Science University of Maryland College Park, MD 20740 {hhe,hal}@cs.umd.edu Abstract Jason Eisner Department of Computer Science

More information

On-line Reinforcement Learning for Nonlinear Motion Control: Quadratic and Non-Quadratic Reward Functions

On-line Reinforcement Learning for Nonlinear Motion Control: Quadratic and Non-Quadratic Reward Functions Preprints of the 19th World Congress The International Federation of Automatic Control Cape Town, South Africa. August 24-29, 214 On-line Reinforcement Learning for Nonlinear Motion Control: Quadratic

More information

Deep Reinforcement Learning SISL. Jeremy Morton (jmorton2) November 7, Stanford Intelligent Systems Laboratory

Deep Reinforcement Learning SISL. Jeremy Morton (jmorton2) November 7, Stanford Intelligent Systems Laboratory Deep Reinforcement Learning Jeremy Morton (jmorton2) November 7, 2016 SISL Stanford Intelligent Systems Laboratory Overview 2 1 Motivation 2 Neural Networks 3 Deep Reinforcement Learning 4 Deep Learning

More information

SWIRL: A Sequential Windowed Inverse Reinforcement Learning Algorithm for Robot Tasks With Delayed Rewards

SWIRL: A Sequential Windowed Inverse Reinforcement Learning Algorithm for Robot Tasks With Delayed Rewards SWIRL: A Sequential Windowed Inverse Reinforcement Learning Algorithm for Robot Tasks With Delayed Rewards Sanjay Krishnan, Animesh Garg, Richard Liaw, Brijen Thananjeyan, Lauren Miller, Florian T. Pokorny,

More information

MS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction

MS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction MS&E338 Reinforcement Learning Lecture 1 - April 2, 2018 Introduction Lecturer: Ben Van Roy Scribe: Gabriel Maher 1 Reinforcement Learning Introduction In reinforcement learning (RL) we consider an agent

More information

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Reinforcement Learning. Yishay Mansour Tel-Aviv University Reinforcement Learning Yishay Mansour Tel-Aviv University 1 Reinforcement Learning: Course Information Classes: Wednesday Lecture 10-13 Yishay Mansour Recitations:14-15/15-16 Eliya Nachmani Adam Polyak

More information

Actor-critic methods. Dialogue Systems Group, Cambridge University Engineering Department. February 21, 2017

Actor-critic methods. Dialogue Systems Group, Cambridge University Engineering Department. February 21, 2017 Actor-critic methods Milica Gašić Dialogue Systems Group, Cambridge University Engineering Department February 21, 2017 1 / 21 In this lecture... The actor-critic architecture Least-Squares Policy Iteration

More information

Computational Rationalization: The Inverse Equilibrium Problem

Computational Rationalization: The Inverse Equilibrium Problem Kevin Waugh waugh@cs.cmu.edu Brian D. Ziebart bziebart@cs.cmu.edu J. Andrew Bagnell dbagnell@ri.cmu.edu Carnegie Mellon University, 5000 Forbes Ave, Pittsburgh, PA, USA 15213 Abstract Modeling the purposeful

More information

Reinforcement Learning via Policy Optimization

Reinforcement Learning via Policy Optimization Reinforcement Learning via Policy Optimization Hanxiao Liu November 22, 2017 1 / 27 Reinforcement Learning Policy a π(s) 2 / 27 Example - Mario 3 / 27 Example - ChatBot 4 / 27 Applications - Video Games

More information

Learning Control Under Uncertainty: A Probabilistic Value-Iteration Approach

Learning Control Under Uncertainty: A Probabilistic Value-Iteration Approach Learning Control Under Uncertainty: A Probabilistic Value-Iteration Approach B. Bischoff 1, D. Nguyen-Tuong 1,H.Markert 1 anda.knoll 2 1- Robert Bosch GmbH - Corporate Research Robert-Bosch-Str. 2, 71701

More information

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018 Machine learning for image classification Lecture 14: Reinforcement learning May 9, 2018 Page 3 Outline Motivation Introduction to reinforcement learning (RL) Value function based methods (Q-learning)

More information

Topics of Active Research in Reinforcement Learning Relevant to Spoken Dialogue Systems

Topics of Active Research in Reinforcement Learning Relevant to Spoken Dialogue Systems Topics of Active Research in Reinforcement Learning Relevant to Spoken Dialogue Systems Pascal Poupart David R. Cheriton School of Computer Science University of Waterloo 1 Outline Review Markov Models

More information

(Deep) Reinforcement Learning

(Deep) Reinforcement Learning Martin Matyášek Artificial Intelligence Center Czech Technical University in Prague October 27, 2016 Martin Matyášek VPD, 2016 1 / 17 Reinforcement Learning in a picture R. S. Sutton and A. G. Barto 2015

More information

CMU Lecture 11: Markov Decision Processes II. Teacher: Gianni A. Di Caro

CMU Lecture 11: Markov Decision Processes II. Teacher: Gianni A. Di Caro CMU 15-781 Lecture 11: Markov Decision Processes II Teacher: Gianni A. Di Caro RECAP: DEFINING MDPS Markov decision processes: o Set of states S o Start state s 0 o Set of actions A o Transitions P(s s,a)

More information

CS885 Reinforcement Learning Lecture 7a: May 23, 2018

CS885 Reinforcement Learning Lecture 7a: May 23, 2018 CS885 Reinforcement Learning Lecture 7a: May 23, 2018 Policy Gradient Methods [SutBar] Sec. 13.1-13.3, 13.7 [SigBuf] Sec. 5.1-5.2, [RusNor] Sec. 21.5 CS885 Spring 2018 Pascal Poupart 1 Outline Stochastic

More information

arxiv: v1 [cs.lg] 10 Jun 2016

arxiv: v1 [cs.lg] 10 Jun 2016 Generative Adversarial Imitation Learning Jonathan Ho Stanford University hoj@cs.stanford.edu Stefano Ermon Stanford University ermon@cs.stanford.edu arxiv:1606.03476v1 [cs.lg] 10 Jun 2016 Abstract Consider

More information

Exploration. 2015/10/12 John Schulman

Exploration. 2015/10/12 John Schulman Exploration 2015/10/12 John Schulman What is the exploration problem? Given a long-lived agent (or long-running learning algorithm), how to balance exploration and exploitation to maximize long-term rewards

More information

Markov Decision Processes and Solving Finite Problems. February 8, 2017

Markov Decision Processes and Solving Finite Problems. February 8, 2017 Markov Decision Processes and Solving Finite Problems February 8, 2017 Overview of Upcoming Lectures Feb 8: Markov decision processes, value iteration, policy iteration Feb 13: Policy gradients Feb 15:

More information

CS230: Lecture 9 Deep Reinforcement Learning

CS230: Lecture 9 Deep Reinforcement Learning CS230: Lecture 9 Deep Reinforcement Learning Kian Katanforoosh Menti code: 21 90 15 Today s outline I. Motivation II. Recycling is good: an introduction to RL III. Deep Q-Learning IV. Application of Deep

More information

Deep Reinforcement Learning. Scratching the surface

Deep Reinforcement Learning. Scratching the surface Deep Reinforcement Learning Scratching the surface Deep Reinforcement Learning Scenario of Reinforcement Learning Observation State Agent Action Change the environment Don t do that Reward Environment

More information