Trust Region Policy Optimization

Similar documents
Lecture 9: Policy Gradient II 1

Lecture 9: Policy Gradient II (Post lecture) 2

Advanced Policy Gradient Methods: Natural Gradient, TRPO, and More. March 8, 2017

Deep Reinforcement Learning: Policy Gradients and Q-Learning

Policy Gradient Methods. February 13, 2017

Reinforcement Learning via Policy Optimization

Reinforcement Learning and NLP

CSC321 Lecture 22: Q-Learning

Decision Theory: Q-Learning

Deep Reinforcement Learning

Markov Decision Processes and Solving Finite Problems. February 8, 2017

Deep Reinforcement Learning via Policy Optimization

Q-Learning in Continuous State Action Spaces

O P T I M I Z I N G E X P E C TAT I O N S : T O S T O C H A S T I C C O M P U TAT I O N G R A P H S. john schulman. Summer, 2016

Lecture 8: Policy Gradient I 2

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018

Decision Theory: Markov Decision Processes

REINFORCEMENT LEARNING

Reinforcement Learning as Classification Leveraging Modern Classifiers

Reinforcement Learning and Control

Reinforcement Learning for NLP

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning. Introduction

Variance Reduction for Policy Gradient Methods. March 13, 2017

(Deep) Reinforcement Learning

Introduction to Reinforcement Learning

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Learning Dexterity Matthias Plappert SEPTEMBER 6, 2018

MDP Preliminaries. Nan Jiang. February 10, 2019

Optimizing Expectations: From Deep Reinforcement Learning to Stochastic Computation Graphs

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning and Deep Reinforcement Learning

Grundlagen der Künstlichen Intelligenz

, and rewards and transition matrices as shown below:

Reinforcement Learning Part 2

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Misc. Topics and Reinforcement Learning

Reinforcement Learning. George Konidaris

Reinforcement Learning

Lecture 10 - Planning under Uncertainty (III)

Temporal Difference Learning & Policy Iteration

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

Lecture 3: Markov Decision Processes

Reinforcement Learning

MS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.

15-780: ReinforcementLearning

Proximal Policy Optimization (PPO)

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

15-780: Graduate Artificial Intelligence. Reinforcement learning (RL)

Internet Monetization

Reinforcement Learning

INTRODUCTION TO EVOLUTION STRATEGY ALGORITHMS. James Gleeson Eric Langlois William Saunders

Lecture 8: Policy Gradient

Reinforcement Learning

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

Introduction of Reinforcement Learning

Deep Reinforcement Learning SISL. Jeremy Morton (jmorton2) November 7, Stanford Intelligent Systems Laboratory

Lecture 1: March 7, 2018

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

CS Deep Reinforcement Learning HW2: Policy Gradients due September 19th 2018, 11:59 pm

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI

Lecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010

16.410/413 Principles of Autonomy and Decision Making

Dueling Network Architectures for Deep Reinforcement Learning (ICML 2016)

Notes on Reinforcement Learning

Deep Reinforcement Learning. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 19, 2017

CS 598 Statistical Reinforcement Learning. Nan Jiang

Reinforcement Learning as Variational Inference: Two Recent Approaches

Reinforcement Learning II

Real Time Value Iteration and the State-Action Value Function

Deep Reinforcement Learning

VIME: Variational Information Maximizing Exploration

Machine Learning I Reinforcement Learning

arxiv: v4 [cs.lg] 14 Oct 2018

CS230: Lecture 9 Deep Reinforcement Learning

Reinforcement Learning. Machine Learning, Fall 2010

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396

Planning in Markov Decision Processes

CS599 Lecture 1 Introduction To RL

Machine Learning I Continuous Reinforcement Learning

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.

Reinforcement Learning

Reinforcement Learning

Lecture 23: Reinforcement Learning

An Introduction to Reinforcement Learning

Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms *

Basics of reinforcement learning

Reinforcement Learning

Reinforcement Learning: An Introduction

1 Introduction 2. 4 Q-Learning The Q-value The Temporal Difference The whole Q-Learning process... 5

CS 570: Machine Learning Seminar. Fall 2016

Markov Decision Processes

Grundlagen der Künstlichen Intelligenz

Transcription:

Trust Region Policy Optimization Yixin Lin Duke University yixin.lin@duke.edu March 28, 2017 Yixin Lin (Duke) TRPO March 28, 2017 1 / 21

Overview 1 Preliminaries Markov Decision Processes Policy iteration Policy gradients 2 TRPO Kakade & Langford Natural policy gradients Overview of TRPO Practice Experiments 3 References Yixin Lin (Duke) TRPO March 28, 2017 2 / 21

Introduction Reinforcement learning is the problem of sequential decision making in dynamic environment Goal: capture the most important aspects of an agent making decisions Input (sensing the state of the environment) Action (choosing to affect on the environment) Goal (prefers some states of the environment over others) This is incredibly general Examples Robots (and their components) Games Better A/B testing Yixin Lin (Duke) TRPO March 28, 2017 3 / 21

The Markov Decision Process (MDP) S: set of possible states of the environment p(s init ), s init S: a distribution over initial state Markov property: we assume that the current state summarizes everything we need to remember A: set of possible actions P(s new s old, a): state transitions, for each state s and action a R : S R: reward γ [0, 1]: discount factor Yixin Lin (Duke) TRPO March 28, 2017 4 / 21

Policies and value functions π: a policy (what action to do, given a state) Return: (possibly discounted) sum of future rewards r t + γr t+1 + γ 2 r t+2... Performance of policy: η(π) = E[return] V π (s) = E π [return s] How good is a state, given a policy? Q π (s, a) = E π [return s, a] How good is an action at a state, given a policy? Yixin Lin (Duke) TRPO March 28, 2017 5 / 21

Policy iteration Assume perfect model of MDP Alternate between the following until convergence Evaluating the policy (computing V π ) For each state s, V (s) = E[ s,r r + γv (s )] Repeat until convergence (or just once for value iteration) Setting policy to be greedy (π(s) = arg max a E[r + γv π (s )]) Guaranteed convergence (for both policy and value iteration) Yixin Lin (Duke) TRPO March 28, 2017 6 / 21

Policy gradients Policy iteration scales very badly: have to repeatedly evaluate policy on all states Parameterize policy a π θ We can sample instead Yixin Lin (Duke) TRPO March 28, 2017 7 / 21

Policy gradients Sample a lot of trajectories (simulate your environment under the policy) under the current policy Make good actions more probable Specifically, estimate gradient using score function gradient estimator For each trajectory τ i, ĝ i = R(τ i ) θ log p(τ i θ) Intuitively, take the gradient of log probability of the trajectory, then weight it by the final reward Reduce variance by temporal structure and other tricks (e.g. baseline) Replace reward by the advantage function A π = Q π(s, a) V π(s) Intuitively, how much better is the action we picked over the average action? Repeat Yixin Lin (Duke) TRPO March 28, 2017 8 / 21

Vanilla policy gradient algorithm/reinforce Initialize policy π θ while gradient estimate has not converged do Sample trajectories using π for each timestep do Compute return and advantage estimate end for Refit optimal baseline Update the policy using gradient estimate ĝ end while Yixin Lin (Duke) TRPO March 28, 2017 9 / 21

Connection to supervised learning Minimizing L(θ) = t log π(a t s t ; θ)â t In the paper, they use cost functions instead of reward functions Intuitively, we have some parameterized policy ( model ) giving us a distribution over actions We don t have the correct action ( label ), so we just use the reward at the end as our label We can do better. How do we do credit assignment? Baseline (roughly encourage half of the actions, not just all of them) Discounted future reward (actions affect near-term future), etc. Yixin Lin (Duke) TRPO March 28, 2017 10 / 21

Kakade & Langford: conservative policy iteration A useful identity for η π, the expected discounted cost of a new policy π η( π) = η(π) + E[ t=0 γt A π (s t, a t )] = η(π) + s ρ π(s) a π(a s)a π(s, a) Intuitively: the expected return of a new policy is the expected return of the old policy, plus how much better the new policy is at each state Local approximation: switch out ρ π for ρ π, since we only have the state visitation frequency for the old policy, not the new policy L π ( π) = η(π) + s ρ π(s) a π(a s)a π(s, a) Kakade & Langford proved that optimizing this local approximation is good for small step sizes, but for mixture policies only Yixin Lin (Duke) TRPO March 28, 2017 11 / 21

Natural policy gradients In this paper, they prove that η( π) L π ( π) + CDKL max (π, π), C is a constant dependent on γ Intuitively, we optimize the approximation, but regularize with the KL divergence between old and new policy Algorithm called the natural policy gradient Problem: choosing hyperparameter C is difficult Yixin Lin (Duke) TRPO March 28, 2017 12 / 21

Overview of TRPO Instead of adding KL divergence as a cost, simply use it as an optimization constraint TRPO algorithm: minimize L π ( π), constraint that DKL max δ for some easily-picked hyperparameter δ Yixin Lin (Duke) TRPO March 28, 2017 13 / 21

Practical considerations How do we sample trajectories? Single-path: simply run each sample to completion Vine : for each sampled trajectory, pick random states along the trajectory and perform small rollout How do we compute gradient? Use conjugate gradient algorithm followed by line search Yixin Lin (Duke) TRPO March 28, 2017 14 / 21

Algorithm while gradient not converged do Collect trajectories (either single-path or vine) Estimate advantage function Compute policy gradient estimator Solve quadratic approximation to L(π θ ) using CG Rescale using line search Apply update end while Yixin Lin (Duke) TRPO March 28, 2017 15 / 21

Experiments - MuJoCo robotic locomotion Link to demonstration Same δ hyperparameter across experiments Yixin Lin (Duke) TRPO March 28, 2017 16 / 21

Experiments - MuJoCo learning curves Link to demonstration Same δ hyperparameter across experiments Yixin Lin (Duke) TRPO March 28, 2017 17 / 21

Experiments - Atari Not always better than previous techniques, but consistently decent Very little problem-specific engineering Yixin Lin (Duke) TRPO March 28, 2017 18 / 21

Takeaway TRPO is a good default policy gradient technique which scales well and has minimal hyperparameter tuning Just use KL constraint on gradient approximator Yixin Lin (Duke) TRPO March 28, 2017 19 / 21

References RS Sutton. Introduction to reinforcement learning Kakade and Langford. Approximately optimal approximate reinforcement learning Schulman et al. Trust Region Policy Optimization Schulman, Levine, Finn. Deep Reinforcement Learning course: link Andrej Karpathy. Deep Reinforcement Learning: From Pong to Pixels: link Trust Region Policy Optimization summary: link Yixin Lin (Duke) TRPO March 28, 2017 20 / 21

Thanks! Link to presentation: yixinlin.net/trpo Yixin Lin (Duke) TRPO March 28, 2017 21 / 21