Reinforcement Learning via Policy Optimization

Similar documents
Lecture 9: Policy Gradient II 1

Lecture 9: Policy Gradient II (Post lecture) 2

Trust Region Policy Optimization

Advanced Policy Gradient Methods: Natural Gradient, TRPO, and More. March 8, 2017

Policy Gradient Methods. February 13, 2017

Reinforcement Learning

Variance Reduction for Policy Gradient Methods. March 13, 2017

Reinforcement Learning

Deep Reinforcement Learning: Policy Gradients and Q-Learning

Deep Reinforcement Learning via Policy Optimization

Reinforcement Learning and NLP

Lecture 8: Policy Gradient

Lecture 8: Policy Gradient I 2

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018

REINFORCEMENT LEARNING

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

15-889e Policy Search: Gradient Methods Emma Brunskill. All slides from David Silver (with EB adding minor modificafons), unless otherwise noted

Learning Dexterity Matthias Plappert SEPTEMBER 6, 2018

CSC321 Lecture 22: Q-Learning

Grundlagen der Künstlichen Intelligenz

ELEC-E8119 Robotics: Manipulation, Decision Making and Learning Policy gradient approaches. Ville Kyrki

Proximal Policy Optimization (PPO)

Deep Reinforcement Learning

(Deep) Reinforcement Learning

Reinforcement Learning

Introduction of Reinforcement Learning

Reinforcement Learning for NLP

Actor-critic methods. Dialogue Systems Group, Cambridge University Engineering Department. February 21, 2017

Lecture 23: Reinforcement Learning

Reinforcement Learning and Deep Reinforcement Learning

Decision Theory: Q-Learning

Q-Learning in Continuous State Action Spaces

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI

Deep Reinforcement Learning SISL. Jeremy Morton (jmorton2) November 7, Stanford Intelligent Systems Laboratory

Lecture 7: Value Function Approximation

Nonparametric Inverse Reinforcement Learning and Approximate Optimal Control with Temporal Logic Tasks. Siddharthan Rajasekaran

Reinforcement Learning

Reinforcement learning an introduction

Machine Learning I Continuous Reinforcement Learning

Reinforcement Learning

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

Lecture 1: March 7, 2018

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms *

Markov Decision Processes and Solving Finite Problems. February 8, 2017

Artificial Intelligence

Olivier Sigaud. September 21, 2012

Behavior Policy Gradient Supplemental Material

Reinforcement Learning

Deterministic Policy Gradient, Advanced RL Algorithms

Introduction to Reinforcement Learning

Learning Control for Air Hockey Striking using Deep Reinforcement Learning

Reinforcement Learning

Lecture 6: CS395T Numerical Optimization for Graphics and AI Line Search Applications

Combining PPO and Evolutionary Strategies for Better Policy Search

Dueling Network Architectures for Deep Reinforcement Learning (ICML 2016)

Notes on Reinforcement Learning

Misc. Topics and Reinforcement Learning

Reinforcement Learning and Control

Reinforcement Learning

Deep Reinforcement Learning

Deep Reinforcement Learning. Scratching the surface

Reinforcement Learning

CS599 Lecture 1 Introduction To RL

Decision Theory: Markov Decision Processes

Reinforcement Learning: An Introduction

16.410/413 Principles of Autonomy and Decision Making

Machine Learning I Reinforcement Learning

Reinforcement Learning as Variational Inference: Two Recent Approaches

CS Deep Reinforcement Learning HW2: Policy Gradients due September 19th 2018, 11:59 pm

Machine Learning. Machine Learning: Jordan Boyd-Graber University of Maryland REINFORCEMENT LEARNING. Slides adapted from Tom Mitchell and Peter Abeel

Dialogue management: Parametric approaches to policy optimisation. Dialogue Systems Group, Cambridge University Engineering Department

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel

Lecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010

Deep Reinforcement Learning. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 19, 2017

O P T I M I Z I N G E X P E C TAT I O N S : T O S T O C H A S T I C C O M P U TAT I O N G R A P H S. john schulman. Summer, 2016

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

Reinforcement Learning In Continuous Time and Space

Reinforcement Learning. George Konidaris

Tutorial on Policy Gradient Methods. Jan Peters

Discrete Latent Variable Models

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.

Reinforcement Learning II. George Konidaris

Reinforcement Learning II. George Konidaris

Partially observable Markov decision processes. Department of Computer Science, Czech Technical University in Prague

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Reinforcement Learning: the basics

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

Reinforcement learning

Lecture 3: Markov Decision Processes

Internet Monetization

Reinforcement Learning

1 Introduction 2. 4 Q-Learning The Q-value The Temporal Difference The whole Q-Learning process... 5

Backpropagation Through

Reinforcement Learning Part 2

Reinforcement Learning as Classification Leveraging Modern Classifiers

MDP Preliminaries. Nan Jiang. February 10, 2019

Transcription:

Reinforcement Learning via Policy Optimization Hanxiao Liu November 22, 2017 1 / 27

Reinforcement Learning Policy a π(s) 2 / 27

Example - Mario 3 / 27

Example - ChatBot 4 / 27

Applications - Video Games 5 / 27

Applications - Robotics 6 / 27

Applications - Combinatorial Problems 7 / 27

Applications - Bioinformatics 8 / 27

Supervised Learning vs RL Supervised setup s t P ( ) a t = π(s t ) immediate reward Reinforcement setup s t P ( s t 1, a t 1 ) a t π(s t ) delayed reward Both aims to maximize the total reward. 9 / 27

Markov Decision Process What is the underlying assumption? 10 / 27

Landscape 11 / 27

Monte Carlo Return R(τ) = T t=1 R(s t, a t ). Trajectory τ = s 0, a 0,..., s T, a T Goal: finding a good π such that R is maximized. Policy evaluation via MC 1. Sample a trajectory τ given π. 2. Record R(τ) Repeat the above for many times and take the average. 12 / 27

Policy Gradient Let s parametrize π using a neural network. a π θ (s) Expected return max θ U(θ) = max θ P (τ; π θ )R(τ) (1) τ Heuristic: Raise the probability of good trajectories. 1. Sample a trajectory τ under π θ 2. Update θ using gradient θ log P (τ; π θ )R(τ). 13 / 27

Policy Gradient The log-derivative trick θ U(θ) = τ = τ = τ θ P (τ; π θ )R(τ) (2) P (τ; π θ ) P (τ; π θ ) θp (τ; π θ )R(τ) (3) P (τ; π θ ) θ log P (τ; π θ )R(τ) (4) = E τ P ( ;πθ ) θ log P (τ; π θ )R(τ) (5) θ log P (τ; π θ )R(τ) τ P ( ; π θ ) (6) 14 / 27

Policy Gradient θ U(θ) θ log P (τ; π θ )R(τ) τ P ( ; π θ ) (7) Analogous to SGD (so variance reduction is important), but data distribution here is a moving target (so we may want a trust region). 15 / 27

Policy Gradient We can subtract any constant from the reward θ P (τ; π θ )(R(τ) b) (8) τ = θ P (τ; π θ )R(τ) θ P (τ; π θ )b (9) τ = θ P (τ; π θ )R(τ) θ b (10) τ = θ P (τ; π θ )R(τ) (11) τ τ 16 / 27

Policy Gradient The variance is magnified by R(τ) More fine-grained baseline? θ U(θ) θ log P (τ; π θ )R(τ) (12) t=0 t=0 θ log π θ (a t s t ) t =0 θ log π θ (a t s t ) t =t R(s t, a t ) R(s t, a t ) (13) } {{ } R t (14) t=0 θ log π θ (a t s t ) (R t b t ) (15) 17 / 27

Policy Gradient θ log π θ (a t s t ) (R t b t ) (16) t=0 Good choice of b t? t=0 b t = argmin b E (R t b) 2 (17) V φ (s t ) (18) θ log π θ (a t s t ) (R t V φ (s t )) }{{} A t (19) Promote actions that lead to positive advantage. 18 / 27

Policy Gradient θ log π θ (a t s t )A t (20) t=0 Sample a trajectory τ For each step in τ Rt = t =t r t At = R t V φ (s t ) take a gradient step φ t=0 V φ(s t ) R t 2 take a gradient step θ t=0 log π θ(a t s t )A t 19 / 27

TRPO In PG, our opt objective a moving target defined based on trajectory samples given the current policy each sample is only used once An opt objective that reuses all historical data? 20 / 27

TRPO Policy gradient θ log π θ (a t s t )A t (21) t=0 Objective (on-policy) E π [A(s, a)] (22) Objective (off-policy), via importance sampling [ ] π(a s) E πold π old (a s) A old(s, a) (23) 21 / 27

TRPO max π [ ] π(a s) E πold π old (a s) A old(s, a) N n=1 π(a n s n ) π old (a n s n ) A n (24) s.t. KL(π old, π) δ (25) Quadratic approximation is used for the KL. Use conjugate gradient for the natural gradient direction F 1 g. 22 / 27

PPO max π N n=1 π(a n s n ) π old (a n s n ) A n βkl(π old, π) (26) Fixed β does not work well shrink β when KL(π old, π) is small increase β otherwise 23 / 27

PPO Alternative ways to penalize large change in π? Modify As π(a n s n ) π old (a n s n ) A n = r n (θ)a n (27) min (r n (θ)a n, clip(1 ɛ, 1 + ɛ, r n (θ))a n ) (28) being pessimistic whenever a large change in π leads to a better obj. same as the original obj otherwise. 24 / 27

Advantage Estimation Currently  t = r t + r t+1 + r t+2 + V (s t ) (29) More bias, less variance (ignoring long-term effect)  t = r t + γr t+1 + γ 2 r t+2 + V (s t ) (30) Bootstrapping  t = r t + γr t+1 + γ 2 r t+2 + V (s t ) (31) = r t + γ(r t+1 + γr t+2 +... ) V (s t ) (32) = r t + γv (s t+1 ) V (s t ) (33) We may unroll for more than one steps. 25 / 27

A2C Advantage Actor-Critic Sample a trajectory τ For each step in τ ˆRt = r t + γv (s t+1 ) Ât = ˆR t V (s t ) take a gradient step using [ log π θ (a t s t )Ât + V φ (s t ) ˆR ] t 2 θ,φ t=0 May use TROP, PPO for policy optimization. 26 / 27

A3C Variance reduction via multiple actors Figure : Asynchronous Advantage Actor-Critic 27 / 27