Reinforcement Learning
|
|
- Julianna Loraine Bishop
- 5 years ago
- Views:
Transcription
1 Reinforcement Learning Inverse Reinforcement Learning Inverse RL, behaviour cloning, apprenticeship learning, imitation learning. Vien Ngo Marc Toussaint University of Stuttgart
2 Outline Introduction to Inverse RL Inverse RL vs. behavioral cloning IRL algorithms (Inspired from a lecture from Pieter Abbeel.) 2/??
3 Inverse RL: Informal Definition Given Measurements of an agent s behaviour π over time (s t, a t, s t), in different circumstances. If possible, given transition model (not given reward function). Goal: Find the reward function R π (s, a, s ). 3/??
4 Inverse Reinforcement Learning RL Agent Reward Dynamics Model Policy Imitation/Apprenticeship Learning IRL Expert's Demonstration inspired from a poster of Boularias, Kober, Peters. 4/??
5 Motivation: Two Sources The potential use of RL/related methods as computational model for animal and human learning: bee foraging (Montague et al 1995), song-bird vocalization (Doya & Sejnowski 1995),... 5/??
6 Motivation: Two Sources The potential use of RL/related methods as computational model for animal and human learning: bee foraging (Montague et al 1995), song-bird vocalization (Doya & Sejnowski 1995),... Construction of an intelligent agent in a particular domain: Car driver, helicopter (Ng et al),... (imitation learning, apprenticeship learning) 5/??
7 Examples Car driving simulation Abbeel et al 2004, etc. Autonomous Helicopter Flight Andrew Ng et. al. Urban navigation Ziebart, Maas, Bagnell and Dey, AAAI 2008 (route recommendation, and destination prediction) etc. 6/??
8 Problem Formulation Given State space S, action space ca. Transition model T (s, a, s ) = P (s s, a) not given reward function R(s, a, s ). Teacher s demonstration (from teacher s policy π ): s 0, a 0, s 1, a 1,..., IRL: Recover R. Apprenticeship learning via IRL Use R to compute a good policy. Behaviour cloning: Using supersived-learning to learn the teacher s policy. 7/??
9 IRL vs. behavioral cloning 8/??
10 IRL vs. Behavioral cloning Behavioral cloning: Formulated as a supervised-learning problem. (Using SVM, Neural networks, deep learning,...) Given (s 0, a 0 ), (s 1, a 1 ),..., generated from a policy π. Estimate a policy mapping s to a. 9/??
11 IRL vs. Behavioral cloning Behavioral cloning: Formulated as a supervised-learning problem. (Using SVM, Neural networks, deep learning,...) Given (s 0, a 0 ), (s 1, a 1 ),..., generated from a policy π. Estimate a policy mapping s to a. Behavioral cloning: can only mimic the trajectory of the teacher, then can not: with change of goal/destination, and non-markovian environment (e.g. car driving). 9/??
12 IRL vs. Behavioral cloning Behavioral cloning: Formulated as a supervised-learning problem. (Using SVM, Neural networks, deep learning,...) Given (s 0, a 0 ), (s 1, a 1 ),..., generated from a policy π. Estimate a policy mapping s to a. Behavioral cloning: can only mimic the trajectory of the teacher, then can not: with change of goal/destination, and non-markovian environment (e.g. car driving). IRL vs. Behavioral cloning is ˆR vs. ˆπ. 9/??
13 Inverse Reinforcement Learning 10/??
14 IRL: Mathematical Formulation Given State space S, action space ca. Transition model T (s, a, s ) = P (s s, a) not given reward function R(s, a, s ). Teacher s demonstration (from teacher s policy π ): s 0, a 0, s 1, a 1,..., Find R, such that [ E γ t R (s t ) π ] [ ] E γ t R (s t ) π, π t=0 t=0 11/??
15 IRL: Mathematical Formulation Given State space S, action space ca. Transition model T (s, a, s ) = P (s s, a) not given reward function R(s, a, s ). Teacher s demonstration (from teacher s policy π ): s 0, a 0, s 1, a 1,..., Find R, such that [ E γ t R (s t ) π ] [ ] E γ t R (s t ) π, π t=0 Challenges? R = 0 is a solution (rewrad function ambiguity), and multiple R satisfy the above condition. π is only given partially through trajectories, then how to evaluate the expectation terms. t=0 11/??
16 IRL: Finite state spaces Bellman equations V π = (I γp π ) 1 R Then IRL finds R such that (P a P a )(I γp a ) 1 R 0, a (if consider only deterministic policies) 12/??
17 IRL: Finite state spaces Bellman equations V π = (I γp π ) 1 R Then IRL finds R such that (P a P a )(I γp a ) 1 R 0, a (if consider only deterministic policies) IRL as linear programming with l 1 s.t. max S { min b A/a i=1 } (P a (i) P b (i))(i γp a ) 1 R λ R 1 (P a P b )(I γp a )R 0 R(i) R max Maximize the sum of differences between the values of the optimal action and the next-best. With l 1 penalty. 12/??
18 IRL: With FA in large state spaces Using FA: R(s) = w.φ(s), where w R n, and φ : S R. Thus, [ ] [ ] E γ t R(s t ) π = E γ t w φ(s t ) π [ ] = w E γ t φ(s t ) π = w.η(π) 13/??
19 IRL: With FA in large state spaces Using FA: R(s) = w.φ(s), where w R n, and φ : S R. Thus, [ ] [ ] E γ t R(s t ) π = E γ t w φ(s t ) π [ ] = w E γ t φ(s t ) π = w.η(π) The optimization problem: finding w such that w.η(π ) w.η(π) η(π) can be evaluated with sampled trajectories from π. η(π) = 1 N N T i γ t φ(s t ) i=1 t=0 13/??
20 Apprenticeship learning Abbeel & Ng, /??
21 Apprenticeship learning Finding a policy π whose performance is as close to the expert policy s performance as possible w.η(π ) w.η(π) ɛ 15/??
22 Apprenticeship learning Finding a policy π whose performance is as close to the expert policy s performance as possible w.η(π ) w.η(π) ɛ 1: Assume R(s) = w.φ(s), where w R n, and φ : S R. 2: Initialize π 0 3: for i = 1, 2,... do 4: Find a reward function such that the teacher maximally outperforms all previously found controllers. max γ γ, w 1 s.t. w.η(π) w.η(π) + γ, π {π 0, π 1,..., π i 1 } 5: Find optimal policy π i for the reward function R w w.r.t current w. 6: end for 15/??
23 Examples 16/??
24 Simulated Highway Driving Given dynamic model T (s, a, s ) Each teacher demonstrates 1 minute. Abbeel et. al /??
25 Simulated Highway Driving expert demonstration (left), learned control (right) 18/??
26 Urban Navigation picture from a tutorial of Pieter Abbeel. 19/??
27 References Andrew Y. Ng, Stuart J. Russell: Algorithms for Inverse Reinforcement Learning. ICML 2000: Pieter Abbeel, Andrew Y. Ng: Apprenticeship learning via inverse reinforcement learning. ICML 2004 Pieter Abbeel, Adam Coates, Morgan Quigley, Andrew Y. Ng: An Application of Reinforcement Learning to Aerobatic Helicopter Flight. NIPS 2006: 1-8 Adam Coates, Pieter Abbeel, Andrew Y. Ng: Apprenticeship learning for helicopter control. Commun. ACM 52(7): (2009) 20/??
Reinforcement Learning
Reinforcement Learning Inverse Reinforcement Learning LfD, imitation learning/behavior cloning, apprenticeship learning, IRL. Hung Ngo MLR Lab, University of Stuttgart Outline Learning from Demonstrations
More informationMaximum Margin Planning
Maximum Margin Planning Nathan Ratliff, Drew Bagnell and Martin Zinkevich Presenters: Ashesh Jain, Michael Hu CS6784 Class Presentation Theme 1. Supervised learning 2. Unsupervised learning 3. Reinforcement
More informationRelative Entropy Inverse Reinforcement Learning
Relative Entropy Inverse Reinforcement Learning Abdeslam Boularias Jens Kober Jan Peters Max-Planck Institute for Intelligent Systems 72076 Tübingen, Germany {abdeslam.boularias,jens.kober,jan.peters}@tuebingen.mpg.de
More informationMAP Inference for Bayesian Inverse Reinforcement Learning
MAP Inference for Bayesian Inverse Reinforcement Learning Jaedeug Choi and Kee-Eung Kim bdepartment of Computer Science Korea Advanced Institute of Science and Technology Daejeon 305-701, Korea jdchoi@ai.kaist.ac.kr,
More informationAn Introduction to Reinforcement Learning
An Introduction to Reinforcement Learning Shivaram Kalyanakrishnan shivaram@cse.iitb.ac.in Department of Computer Science and Engineering Indian Institute of Technology Bombay April 2018 What is Reinforcement
More informationMDP Preliminaries. Nan Jiang. February 10, 2019
MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process
More informationAn Introduction to Reinforcement Learning
An Introduction to Reinforcement Learning Shivaram Kalyanakrishnan shivaram@csa.iisc.ernet.in Department of Computer Science and Automation Indian Institute of Science August 2014 What is Reinforcement
More informationNonparametric Bayesian Inverse Reinforcement Learning
PRML Summer School 2013 Nonparametric Bayesian Inverse Reinforcement Learning Jaedeug Choi JDCHOI@AI.KAIST.AC.KR Sequential Decision Making (1) Multiple decisions over time are made to achieve goals Reinforcement
More informationInverse Reinforcement Learning with Simultaneous Estimation of Rewards and Dynamics
Inverse Reinforcement Learning with Simultaneous Estimation of Rewards and Dynamics Michael Herman Tobias Gindele Jörg Wagner Felix Schmitt Wolfram Burgard Robert Bosch GmbH D-70442 Stuttgart, Germany
More informationParking lot navigation. Experimental setup. Problem setup. Nice driving style. Page 1. CS 287: Advanced Robotics Fall 2009
Consider the following scenario: There are two envelopes, each of which has an unknown amount of money in it. You get to choose one of the envelopes. Given this is all you get to know, how should you choose?
More informationBatch, Off-policy and Model-free Apprenticeship Learning
Batch, Off-policy and Model-free Apprenticeship Learning Edouard Klein 13, Matthieu Geist 1, and Olivier Pietquin 12 1. Supélec-Metz Campus, IMS Research group, France, prenom.nom@supelec.fr 2. UMI 2958
More informationA Game-Theoretic Approach to Apprenticeship Learning
Advances in Neural Information Processing Systems 20, 2008. A Game-Theoretic Approach to Apprenticeship Learning Umar Syed Computer Science Department Princeton University 35 Olden St Princeton, NJ 08540-5233
More informationInverse Reinforcement Learning in Partially Observable Environments
Proceedings of the Twenty-First International Joint Conference on Artificial Intelligence (IJCAI-09) Inverse Reinforcement Learning in Partially Observable Environments Jaedeug Choi and Kee-Eung Kim Department
More informationReinforcement Learning
Reinforcement Learning Model-Based Reinforcement Learning Model-based, PAC-MDP, sample complexity, exploration/exploitation, RMAX, E3, Bayes-optimal, Bayesian RL, model learning Vien Ngo MLR, University
More informationA Cascaded Supervised Learning Approach to Inverse Reinforcement Learning
A Cascaded Supervised Learning Approach to Inverse Reinforcement Learning Edouard Klein 1,2, Bilal Piot 2,3, Matthieu Geist 2, Olivier Pietquin 2,3 1 ABC Team LORIA-CNRS, France. 2 Supélec, IMS-MaLIS Research
More informationarxiv: v1 [cs.ro] 12 Aug 2016
Density Matching Reward Learning Sungjoon Choi 1, Kyungjae Lee 1, H. Andy Park 2, and Songhwai Oh 1 arxiv:1608.03694v1 [cs.ro] 12 Aug 2016 1 Seoul National University, Seoul, Korea {sungjoon.choi, kyungjae.lee,
More informationReinforcement Learning
Reinforcement Learning Markov decision process & Dynamic programming Evaluative feedback, value function, Bellman equation, optimality, Markov property, Markov decision process, dynamic programming, value
More informationBayesian Nonparametric Feature Construction for Inverse Reinforcement Learning
Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence Bayesian Nonparametric Feature Construction for Inverse Reinforcement Learning Jaedeug Choi and Kee-Eung Kim Department
More informationInverse Optimal Control
Inverse Optimal Control Oleg Arenz Technische Universität Darmstadt o.arenz@gmx.de Abstract In Reinforcement Learning, an agent learns a policy that maximizes a given reward function. However, providing
More informationIntroduction to Reinforcement Learning
CSCI-699: Advanced Topics in Deep Learning 01/16/2019 Nitin Kamra Spring 2019 Introduction to Reinforcement Learning 1 What is Reinforcement Learning? So far we have seen unsupervised and supervised learning.
More informationImitation Learning. Richard Zhu, Andrew Kang April 26, 2016
Imitation Learning Richard Zhu, Andrew Kang April 26, 2016 Table of Contents 1. Introduction 2. Preliminaries 3. DAgger 4. Guarantees 5. Generalization 6. Performance 2 Introduction Where we ve been The
More informationReinforcement learning
Reinforcement learning Stuart Russell, UC Berkeley Stuart Russell, UC Berkeley 1 Outline Sequential decision making Dynamic programming algorithms Reinforcement learning algorithms temporal difference
More informationIntroduction to Reinforcement Learning. CMPT 882 Mar. 18
Introduction to Reinforcement Learning CMPT 882 Mar. 18 Outline for the week Basic ideas in RL Value functions and value iteration Policy evaluation and policy improvement Model-free RL Monte-Carlo and
More informationReinforcement Learning and Control
CS9 Lecture notes Andrew Ng Part XIII Reinforcement Learning and Control We now begin our study of reinforcement learning and adaptive control. In supervised learning, we saw algorithms that tried to make
More informationPolicyBoost: Functional Policy Gradient with Ranking-Based Reward Objective
AI and Robotics: Papers from the AAAI-14 Worshop PolicyBoost: Functional Policy Gradient with Raning-Based Reward Objective Yang Yu and Qing Da National Laboratory for Novel Software Technology Nanjing
More informationChristopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015
Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992)
More informationLecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan
COS 402 Machine Learning and Artificial Intelligence Fall 2016 Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan Some slides borrowed from Peter Bodik and David Silver Course progress Learning
More informationArtificial Intelligence
Artificial Intelligence Dynamic Programming Marc Toussaint University of Stuttgart Winter 2018/19 Motivation: So far we focussed on tree search-like solvers for decision problems. There is a second important
More informationImproving the Efficiency of Bayesian Inverse Reinforcement Learning
Improving the Efficiency of Bayesian Inverse Reinforcement Learning Bernard Michini* and Jonathan P. How** Aerospace Controls Laboratory Massachusetts Institute of Technology, Cambridge, MA 02139 USA Abstract
More informationarxiv: v2 [cs.lg] 13 Aug 2018 ABSTRACT
LEARNING ROBUST REWARDS WITH ADVERSARIAL INVERSE REINFORCEMENT LEARNING Justin Fu, Katie Luo, Sergey Levine Department of Electrical Engineering and Computer Science University of California, Berkeley
More informationVariance Reduction for Policy Gradient Methods. March 13, 2017
Variance Reduction for Policy Gradient Methods March 13, 2017 Reward Shaping Reward Shaping Reward Shaping Reward shaping: r(s, a, s ) = r(s, a, s ) + γφ(s ) Φ(s) for arbitrary potential Φ Theorem: r admits
More informationAdversarial Inverse Optimal Control for General Imitation Learning Losses and Embodiment Transfer
ersarial Inverse Optimal Control for General Imitation Learning Losses and Embodiment Transfer Xiangli Chen Mathew Monfort Brian D. Ziebart University of Illinois at Chicago Chicago, IL 60607 {xchen0,mmonfo,bziebart}@uic.edu
More informationReinforcement Learning
Reinforcement Learning Temporal Difference Learning Temporal difference learning, TD prediction, Q-learning, elibigility traces. (many slides from Marc Toussaint) Vien Ngo MLR, University of Stuttgart
More informationCourse 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016
Course 16:198:520: Introduction To Artificial Intelligence Lecture 13 Decision Making Abdeslam Boularias Wednesday, December 7, 2016 1 / 45 Overview We consider probabilistic temporal models where the
More informationAutonomous Helicopter Flight via Reinforcement Learning
Autonomous Helicopter Flight via Reinforcement Learning Authors: Andrew Y. Ng, H. Jin Kim, Michael I. Jordan, Shankar Sastry Presenters: Shiv Ballianda, Jerrolyn Hebert, Shuiwang Ji, Kenley Malveaux, Huy
More informationAlgorithms for Learning Markov Field Policies
Algorithms for Learning Marov Field Policies Abdeslam Boularias Max Planc Institute for Intelligent Systems boularias@tuebingen.mpg.de Oliver Krömer, Jan Peters Technische Universität Darmstadt {oli,jan}@robot-learning.de
More informationReinforcement Learning
Reinforcement Learning Function Approximation Continuous state/action space, mean-square error, gradient temporal difference learning, least-square temporal difference, least squares policy iteration Vien
More informationDueling Network Architectures for Deep Reinforcement Learning (ICML 2016)
Dueling Network Architectures for Deep Reinforcement Learning (ICML 2016) Yoonho Lee Department of Computer Science and Engineering Pohang University of Science and Technology October 11, 2016 Outline
More informationModeling Decision Making with Maximum Entropy Inverse Optimal Control Thesis Proposal
Modeling Decision Making with Maximum Entropy Inverse Optimal Control Thesis Proposal Brian D. Ziebart Machine Learning Department Carnegie Mellon University September 30, 2008 Thesis committee: J. Andrew
More informationGenerative Adversarial Imitation Learning
Generative Adversarial Imitation Learning Jonathan Ho OpenAI hoj@openai.com Stefano Ermon Stanford University ermon@cs.stanford.edu Abstract Consider learning a policy from example expert behavior, without
More informationReinforcement Learning
Reinforcement Learning Temporal Difference Learning Temporal difference learning, TD prediction, Q-learning, elibigility traces. (many slides from Marc Toussaint) Vien Ngo Marc Toussaint University of
More informationModel-based Imitation Learning by Probabilistic Trajectory Matching
Model-based Imitation Learning by Probabilistic Trajectory Matching Peter Englert1, Alexandros Paraschos1, Jan Peters1,2, Marc Peter Deisenroth1 Abstract One of the most elegant ways of teaching new skills
More informationLecture 9: Policy Gradient II 1
Lecture 9: Policy Gradient II 1 Emma Brunskill CS234 Reinforcement Learning. Winter 2019 Additional reading: Sutton and Barto 2018 Chp. 13 1 With many slides from or derived from David Silver and John
More informationDeep Reinforcement Learning via Policy Optimization
Deep Reinforcement Learning via Policy Optimization John Schulman July 3, 2017 Introduction Deep Reinforcement Learning: What to Learn? Policies (select next action) Deep Reinforcement Learning: What to
More informationPreference Elicitation for Sequential Decision Problems
Preference Elicitation for Sequential Decision Problems Kevin Regan University of Toronto Introduction 2 Motivation Focus: Computational approaches to sequential decision making under uncertainty These
More informationEM-based Reinforcement Learning
EM-based Reinforcement Learning Gerhard Neumann 1 1 TU Darmstadt, Intelligent Autonomous Systems December 21, 2011 Outline Expectation Maximization (EM)-based Reinforcement Learning Recap : Modelling data
More informationMaximum Entropy Inverse Reinforcement Learning
Maximum Entropy Inverse Reinforcement Learning Brian D. Ziebart, Andrew Maas, J.Andrew Bagnell, and Anind K. Dey School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 bziebart@cs.cmu.edu,
More informationOptimal Control with Learned Forward Models
Optimal Control with Learned Forward Models Pieter Abbeel UC Berkeley Jan Peters TU Darmstadt 1 Where we are? Reinforcement Learning Data = {(x i, u i, x i+1, r i )}} x u xx r u xx V (x) π (u x) Now V
More informationDeep Reinforcement Learning: Policy Gradients and Q-Learning
Deep Reinforcement Learning: Policy Gradients and Q-Learning John Schulman Bay Area Deep Learning School September 24, 2016 Introduction and Overview Aim of This Talk What is deep RL, and should I use
More informationShort Course: Multiagent Systems. Multiagent Systems. Lecture 1: Basics Agents Environments. Reinforcement Learning. This course is about:
Short Course: Multiagent Systems Lecture 1: Basics Agents Environments Reinforcement Learning Multiagent Systems This course is about: Agents: Sensing, reasoning, acting Multiagent Systems: Distributed
More informationDeep Reinforcement Learning. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 19, 2017
Deep Reinforcement Learning STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 19, 2017 Outline Introduction to Reinforcement Learning AlphaGo (Deep RL for Computer Go)
More informationGeneralized Inverse Reinforcement Learning with Linearly Solvable MDP
Generalized Inverse Reinforcement Learning with Linearly Solvable MDP Masahiro Kohjima ( ), Tatsushi Matsubayashi, and Hiroshi Sawada NTT Service Evolution Laboratories, NTT Corporation, Japan {kohjima.masahiro,matsubayashi.tatsushi,sawada.hiroshi}@lab.ntt.co.jp
More informationEfficient Probabilistic Performance Bounds for Inverse Reinforcement Learning
Efficient Probabilistic Performance Bounds for Inverse Reinforcement Learning Daniel S. Brown and Scott Niekum Department of Computer Science University of Texas at Austin {dsbrown,sniekum}@cs.utexas.edu
More informationToday s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning
CSE 473: Artificial Intelligence Reinforcement Learning Dan Weld Today s Outline Reinforcement Learning Q-value iteration Q-learning Exploration / exploitation Linear function approximation Many slides
More informationStochastic Primal-Dual Methods for Reinforcement Learning
Stochastic Primal-Dual Methods for Reinforcement Learning Alireza Askarian 1 Amber Srivastava 1 1 Department of Mechanical Engineering University of Illinois at Urbana Champaign Big Data Optimization,
More informationThe convergence limit of the temporal difference learning
The convergence limit of the temporal difference learning Ryosuke Nomura the University of Tokyo September 3, 2013 1 Outline Reinforcement Learning Convergence limit Construction of the feature vector
More informationPolicy Gradient Methods. February 13, 2017
Policy Gradient Methods February 13, 2017 Policy Optimization Problems maximize E π [expression] π Fixed-horizon episodic: T 1 Average-cost: lim T 1 T r t T 1 r t Infinite-horizon discounted: γt r t Variable-length
More informationADVERSARIAL IMITATION VIA VARIATIONAL INVERSE REINFORCEMENT LEARNING
ADVERSARIAL IMITATION VIA VARIATIONAL INVERSE REINFORCEMENT LEARNING Ahmed H. Qureshi Department of Electrical and Computer Engineering University of California San Diego, La Jolla, CA 92093, USA a1qureshi@ucsd.edu
More informationApproximation Methods in Reinforcement Learning
2018 CS420, Machine Learning, Lecture 12 Approximation Methods in Reinforcement Learning Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/cs420/index.html Reinforcement
More informationProbabilistic inverse reinforcement learning in unknown environments
Probabilistic inverse reinforcement learning in unknown environments Aristide C. Y. Tossou EPAC, Abomey-Calavi, Bénin yedtoss@gmail.com Christos Dimitrakakis EPFL, Lausanne, Switzerland christos.dimitrakakis@gmail.com
More informationAdvanced Policy Gradient Methods: Natural Gradient, TRPO, and More. March 8, 2017
Advanced Policy Gradient Methods: Natural Gradient, TRPO, and More March 8, 2017 Defining a Loss Function for RL Let η(π) denote the expected return of π [ ] η(π) = E s0 ρ 0,a t π( s t) γ t r t We collect
More informationarxiv: v3 [cs.lg] 17 Jun 2018
Machine Learning manuscript No. (will be inserted by the editor) Inverse Reinforcement Learning from Summary Data Antti Kangasrääsiö Samuel Kaski Received: date / Accepted: date arxiv:1703.09700v3 [cs.lg]
More informationLecture 9: Policy Gradient II (Post lecture) 2
Lecture 9: Policy Gradient II (Post lecture) 2 Emma Brunskill CS234 Reinforcement Learning. Winter 2018 Additional reading: Sutton and Barto 2018 Chp. 13 2 With many slides from or derived from David Silver
More informationVerifying Robustness of Human-Aware Autonomous Cars
Verifying Robustness of Human-Aware Autonomous Cars Dorsa Sadigh S. Shankar Sastry Sanjit A. Seshia Stanford University, (e-mail: dorsa@cs.stanford.edu). UC Berkeley, (e-mail: {sseshia, sastry}@eecs.berkeley.edu)
More informationREINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning
REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning Ronen Tamari The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (#67679) February 28, 2016 Ronen Tamari
More informationREINFORCEMENT LEARNING
REINFORCEMENT LEARNING Larry Page: Where s Google going next? DeepMind's DQN playing Breakout Contents Introduction to Reinforcement Learning Deep Q-Learning INTRODUCTION TO REINFORCEMENT LEARNING Contents
More informationReinforcement Learning for NLP
Reinforcement Learning for NLP Advanced Machine Learning for NLP Jordan Boyd-Graber REINFORCEMENT OVERVIEW, POLICY GRADIENT Adapted from slides by David Silver, Pieter Abbeel, and John Schulman Advanced
More informationAnalysis of Inverse Reinforcement Learning with Perturbed Demonstrations
Analysis of Inverse Reinforcement Learning with Perturbed Demonstrations Francisco S. Melo 1 and Manuel Lopes 2 and Ricardo Ferreira 3 Abstract. Inverse reinforcement learning (IRL addresses the problem
More informationCompetitive Multi-agent Inverse Reinforcement Learning with Sub-optimal Demonstrations
Competitive Multi-agent Inverse Reinforcement Learning with Sub-optimal Demonstrations Xingyu Wang 1 Diego Klabjan 1 Abstract his paper considers the problem of inverse reinforcement learning in zero-sum
More informationActivity Forecasting. Research CMU. Carnegie Mellon University. Kris Kitani Carnegie Mellon University
Carnegie Mellon University Research Showcase @ CMU Robotics Institute School of Computer Science 8-2012 Activity Forecasting Kris Kitani Carnegie Mellon University Brian D. Ziebart Carnegie Mellon University
More informationReinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina
Reinforcement Learning Introduction Introduction Unsupervised learning has no outcome (no feedback). Supervised learning has outcome so we know what to predict. Reinforcement learning is in between it
More informationLecture 23: Reinforcement Learning
Lecture 23: Reinforcement Learning MDPs revisited Model-based learning Monte Carlo value function estimation Temporal-difference (TD) learning Exploration November 23, 2006 1 COMP-424 Lecture 23 Recall:
More informationEfficient Probabilistic Performance Bounds for Inverse Reinforcement Learning
Efficient Probabilistic Performance Bounds for Inverse Reinforcement Learning Daniel S. Brown and Scott Niekum Department of Computer Science University of Texas at Austin {dsbrown,sniekum}@cs.utexas.edu
More informationNeural Map. Structured Memory for Deep RL. Emilio Parisotto
Neural Map Structured Memory for Deep RL Emilio Parisotto eparisot@andrew.cmu.edu PhD Student Machine Learning Department Carnegie Mellon University Supervised Learning Most deep learning problems are
More informationReinforcement Learning
Reinforcement Learning Dipendra Misra Cornell University dkm@cs.cornell.edu https://dipendramisra.wordpress.com/ Task Grasp the green cup. Output: Sequence of controller actions Setup from Lenz et. al.
More informationCS788 Dialogue Management Systems Lecture #2: Markov Decision Processes
CS788 Dialogue Management Systems Lecture #2: Markov Decision Processes Kee-Eung Kim KAIST EECS Department Computer Science Division Markov Decision Processes (MDPs) A popular model for sequential decision
More informationA Bayesian Approach to Generative Adversarial Imitation Learning
A Bayesian Approach to Generative Adversarial Imitation Learning Wonseo Jeon 1, Seoin Seo 1, Kee-Eung Kim 1,2 1 School of Computing, KAIST, Republic of Korea 2 PROWLER.io {wsjeon, siseo}@ai.aist.ac.r,
More informationBits of Machine Learning Part 2: Unsupervised Learning
Bits of Machine Learning Part 2: Unsupervised Learning Alexandre Proutiere and Vahan Petrosyan KTH (The Royal Institute of Technology) Outline of the Course 1. Supervised Learning Regression and Classification
More informationBellmanian Bandit Network
Bellmanian Bandit Network Antoine Bureau TAO, LRI - INRIA Univ. Paris-Sud bldg 50, Rue Noetzlin, 91190 Gif-sur-Yvette, France antoine.bureau@lri.fr Michèle Sebag TAO, LRI - CNRS Univ. Paris-Sud bldg 50,
More informationReinforcement Learning
Reinforcement Learning Ron Parr CompSci 7 Department of Computer Science Duke University With thanks to Kris Hauser for some content RL Highlights Everybody likes to learn from experience Use ML techniques
More informationImitation Learning by Coaching. Abstract
Imitation Learning by Coaching He He Hal Daumé III Department of Computer Science University of Maryland College Park, MD 20740 {hhe,hal}@cs.umd.edu Abstract Jason Eisner Department of Computer Science
More informationOn-line Reinforcement Learning for Nonlinear Motion Control: Quadratic and Non-Quadratic Reward Functions
Preprints of the 19th World Congress The International Federation of Automatic Control Cape Town, South Africa. August 24-29, 214 On-line Reinforcement Learning for Nonlinear Motion Control: Quadratic
More informationDeep Reinforcement Learning SISL. Jeremy Morton (jmorton2) November 7, Stanford Intelligent Systems Laboratory
Deep Reinforcement Learning Jeremy Morton (jmorton2) November 7, 2016 SISL Stanford Intelligent Systems Laboratory Overview 2 1 Motivation 2 Neural Networks 3 Deep Reinforcement Learning 4 Deep Learning
More informationSWIRL: A Sequential Windowed Inverse Reinforcement Learning Algorithm for Robot Tasks With Delayed Rewards
SWIRL: A Sequential Windowed Inverse Reinforcement Learning Algorithm for Robot Tasks With Delayed Rewards Sanjay Krishnan, Animesh Garg, Richard Liaw, Brijen Thananjeyan, Lauren Miller, Florian T. Pokorny,
More informationMS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction
MS&E338 Reinforcement Learning Lecture 1 - April 2, 2018 Introduction Lecturer: Ben Van Roy Scribe: Gabriel Maher 1 Reinforcement Learning Introduction In reinforcement learning (RL) we consider an agent
More informationReinforcement Learning. Yishay Mansour Tel-Aviv University
Reinforcement Learning Yishay Mansour Tel-Aviv University 1 Reinforcement Learning: Course Information Classes: Wednesday Lecture 10-13 Yishay Mansour Recitations:14-15/15-16 Eliya Nachmani Adam Polyak
More informationActor-critic methods. Dialogue Systems Group, Cambridge University Engineering Department. February 21, 2017
Actor-critic methods Milica Gašić Dialogue Systems Group, Cambridge University Engineering Department February 21, 2017 1 / 21 In this lecture... The actor-critic architecture Least-Squares Policy Iteration
More informationComputational Rationalization: The Inverse Equilibrium Problem
Kevin Waugh waugh@cs.cmu.edu Brian D. Ziebart bziebart@cs.cmu.edu J. Andrew Bagnell dbagnell@ri.cmu.edu Carnegie Mellon University, 5000 Forbes Ave, Pittsburgh, PA, USA 15213 Abstract Modeling the purposeful
More informationReinforcement Learning via Policy Optimization
Reinforcement Learning via Policy Optimization Hanxiao Liu November 22, 2017 1 / 27 Reinforcement Learning Policy a π(s) 2 / 27 Example - Mario 3 / 27 Example - ChatBot 4 / 27 Applications - Video Games
More informationLearning Control Under Uncertainty: A Probabilistic Value-Iteration Approach
Learning Control Under Uncertainty: A Probabilistic Value-Iteration Approach B. Bischoff 1, D. Nguyen-Tuong 1,H.Markert 1 anda.knoll 2 1- Robert Bosch GmbH - Corporate Research Robert-Bosch-Str. 2, 71701
More informationINF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018
Machine learning for image classification Lecture 14: Reinforcement learning May 9, 2018 Page 3 Outline Motivation Introduction to reinforcement learning (RL) Value function based methods (Q-learning)
More informationTopics of Active Research in Reinforcement Learning Relevant to Spoken Dialogue Systems
Topics of Active Research in Reinforcement Learning Relevant to Spoken Dialogue Systems Pascal Poupart David R. Cheriton School of Computer Science University of Waterloo 1 Outline Review Markov Models
More information(Deep) Reinforcement Learning
Martin Matyášek Artificial Intelligence Center Czech Technical University in Prague October 27, 2016 Martin Matyášek VPD, 2016 1 / 17 Reinforcement Learning in a picture R. S. Sutton and A. G. Barto 2015
More informationCMU Lecture 11: Markov Decision Processes II. Teacher: Gianni A. Di Caro
CMU 15-781 Lecture 11: Markov Decision Processes II Teacher: Gianni A. Di Caro RECAP: DEFINING MDPS Markov decision processes: o Set of states S o Start state s 0 o Set of actions A o Transitions P(s s,a)
More informationCS885 Reinforcement Learning Lecture 7a: May 23, 2018
CS885 Reinforcement Learning Lecture 7a: May 23, 2018 Policy Gradient Methods [SutBar] Sec. 13.1-13.3, 13.7 [SigBuf] Sec. 5.1-5.2, [RusNor] Sec. 21.5 CS885 Spring 2018 Pascal Poupart 1 Outline Stochastic
More informationarxiv: v1 [cs.lg] 10 Jun 2016
Generative Adversarial Imitation Learning Jonathan Ho Stanford University hoj@cs.stanford.edu Stefano Ermon Stanford University ermon@cs.stanford.edu arxiv:1606.03476v1 [cs.lg] 10 Jun 2016 Abstract Consider
More informationExploration. 2015/10/12 John Schulman
Exploration 2015/10/12 John Schulman What is the exploration problem? Given a long-lived agent (or long-running learning algorithm), how to balance exploration and exploitation to maximize long-term rewards
More informationMarkov Decision Processes and Solving Finite Problems. February 8, 2017
Markov Decision Processes and Solving Finite Problems February 8, 2017 Overview of Upcoming Lectures Feb 8: Markov decision processes, value iteration, policy iteration Feb 13: Policy gradients Feb 15:
More informationCS230: Lecture 9 Deep Reinforcement Learning
CS230: Lecture 9 Deep Reinforcement Learning Kian Katanforoosh Menti code: 21 90 15 Today s outline I. Motivation II. Recycling is good: an introduction to RL III. Deep Q-Learning IV. Application of Deep
More informationDeep Reinforcement Learning. Scratching the surface
Deep Reinforcement Learning Scratching the surface Deep Reinforcement Learning Scenario of Reinforcement Learning Observation State Agent Action Change the environment Don t do that Reward Environment
More information