Off-Policy Actor-Critic

Similar documents
PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.

Reinforcement Learning

Reinforcement Learning: the basics

Chapter 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

Off-Policy Actor-Critic

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI

Lecture 8: Policy Gradient

Policy Gradient Methods. February 13, 2017

Temporal difference learning

Reinforcement Learning

Reinforcement Learning

Approximation Methods in Reinforcement Learning

Off-Policy Actor-Critic

Lecture 3: Markov Decision Processes

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396

Notes on Reinforcement Learning

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel

Grundlagen der Künstlichen Intelligenz

Reinforcement Learning and Deep Reinforcement Learning

Actor-critic methods. Dialogue Systems Group, Cambridge University Engineering Department. February 21, 2017

(Deep) Reinforcement Learning

Reinforcement Learning

Open Theoretical Questions in Reinforcement Learning

6 Reinforcement Learning

Dialogue management: Parametric approaches to policy optimisation. Dialogue Systems Group, Cambridge University Engineering Department

Least squares temporal difference learning

arxiv: v1 [cs.ai] 5 Nov 2017

Linear Least-squares Dyna-style Planning

Reinforcement Learning In Continuous Time and Space

15-889e Policy Search: Gradient Methods Emma Brunskill. All slides from David Silver (with EB adding minor modificafons), unless otherwise noted

, and rewards and transition matrices as shown below:

Reinforcement Learning

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.

Olivier Sigaud. September 21, 2012

Fast Gradient-Descent Methods for Temporal-Difference Learning with Linear Function Approximation

Dual Temporal Difference Learning

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Reinforcement Learning

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Reinforcement Learning Part 2

Lecture 7: Value Function Approximation

Q-Learning in Continuous State Action Spaces

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018

The convergence limit of the temporal difference learning

Markov Decision Processes and Solving Finite Problems. February 8, 2017

Lecture 1: March 7, 2018

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

Replacing eligibility trace for action-value learning with function approximation

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning

Emphatic Temporal-Difference Learning

Temporal Difference Learning & Policy Iteration

CS599 Lecture 2 Function Approximation in RL

Internet Monetization

Reinforcement Learning

REINFORCEMENT LEARNING

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

Reinforcement Learning

Reinforcement Learning. Machine Learning, Fall 2010

Reinforcement Learning for NLP

Artificial Intelligence

Deep reinforcement learning. Dialogue Systems Group, Cambridge University Engineering Department

Reinforcement Learning

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

MDP Preliminaries. Nan Jiang. February 10, 2019

Reinforcement Learning. George Konidaris

Reinforcement Learning. Summer 2017 Defining MDPs, Planning

Lecture 9: Policy Gradient II (Post lecture) 2

An online kernel-based clustering approach for value function approximation

Lecture 4: Approximate dynamic programming

ilstd: Eligibility Traces and Convergence Analysis

ROB 537: Learning-Based Control. Announcements: Project background due Today. HW 3 Due on 10/30 Midterm Exam on 11/6.

arxiv: v1 [cs.lg] 20 Sep 2018

Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms *

ECE276B: Planning & Learning in Robotics Lecture 16: Model-free Control

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Reinforcement Learning II

CS599 Lecture 1 Introduction To RL

Forward Actor-Critic for Nonlinear Function Approximation in Reinforcement Learning

Implicit Incremental Natural Actor Critic

Lecture 9: Policy Gradient II 1

Grundlagen der Künstlichen Intelligenz

Trust Region Policy Optimization

Generalization and Function Approximation

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

The Book: Where we are and where we re going. CSE 190: Reinforcement Learning: An Introduction. Chapter 7: Eligibility Traces. Simple Monte Carlo

Machine Learning I Reinforcement Learning

Variance Reduction for Policy Gradient Methods. March 13, 2017

Review: TD-Learning. TD (SARSA) Learning for Q-values. Bellman Equations for Q-values. P (s, a, s )[R(s, a, s )+ Q (s, (s ))]

On and Off-Policy Relational Reinforcement Learning

Deep Reinforcement Learning: Policy Gradients and Q-Learning

Notes on Tabular Methods


Reinforcement Learning and NLP

Convergence of reinforcement learning algorithms and acceleration of learning

Reinforcement Learning

CSC321 Lecture 22: Q-Learning

Reinforcement learning an introduction

6 Basic Convergence Results for RL Algorithms

CS 287: Advanced Robotics Fall Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study

Relative Entropy Policy Search

Transcription:

Off-Policy Actor-Critic Ludovic Trottier Laval University July 25 2012 Ludovic Trottier (DAMAS Laboratory) Off-Policy Actor-Critic July 25 2012 1 / 34

Table of Contents 1 Reinforcement Learning Theory 2 Off-PAC Algorithm 3 Experimentations 4 Discussion 5 References Ludovic Trottier (DAMAS Laboratory) Off-Policy Actor-Critic July 25 2012 2 / 34

1 Reinforcement Learning Theory 2 Off-PAC Algorithm 3 Experimentations 4 Discussion 5 References Ludovic Trottier (DAMAS Laboratory) Off-Policy Actor-Critic July 25 2012 3 / 34

On vs Off Policy On-policy learning: learns the value function of the policy that is used to make decisions. (SARSA) Q(s, a) Q(s, a) + α [ r + γq(s, a ) Q(s, a) ] Off-policy learning: learns the value function of a policy different of the one used to make decisions. (Q-learning) [ ] Q(s, a) Q(s, a) + α r + γ max Q(s, a ) Q(s, a) a Ludovic Trottier (DAMAS Laboratory) Off-Policy Actor-Critic July 25 2012 4 / 34

Ref. [3] Ludovic Trottier (DAMAS Laboratory) Off-Policy Actor-Critic July 25 2012 5 / 34 Actor-Critic vs Off-Policy Actor-Critic (Off-PAC) Actor-Critic evaluates improves critic behavior - actor - behavior data generated with behavior Off-PAC evaluates improves critic target - actor - target data generated with behavior Figure: Actor-Critic Method

Value Function Approximation Because of a large state-space, the value function need to be approximated. Q(s, a) = s S Pr(s s, a)[r(s, a, s ) + γv (s )] Off-policy + value function approximation fluctuation. Based on the analysis of [2], it is better to use linear approximation : ˆV (s) = v T x s, v i and x s i R e.g. tile coding binary features x s Ludovic Trottier (DAMAS Laboratory) Off-Policy Actor-Critic July 25 2012 6 / 34

Mountain Car Tile Coding Simulation S = [ 1.5, 0.5] [ 0.07, 0.07] A = { 1, 0, 1} Ludovic Trottier (DAMAS Laboratory) Off-Policy Actor-Critic July 25 2012 7 / 34

Stochastic Gradient Descent Given an exemple (x t, y t ) X and an objective function Q(w, x, y) Normal gradient descent w t+1 = w t + α Q(w, x t, y t ) w w t+1 = w t + α 1 n n i=1 Q(w, x i, y i ) w Ludovic Trottier (DAMAS Laboratory) Off-Policy Actor-Critic July 25 2012 8 / 34

1 Reinforcement Learning Theory 2 Off-PAC Algorithm 3 Experimentations 4 Discussion 5 References Ludovic Trottier (DAMAS Laboratory) Off-Policy Actor-Critic July 25 2012 9 / 34

Motivations 3 problems of off-policy action-value methods : 1 Target policy is deterministic 2 Policy improvement is problematic for large action space 3 Small changes in policy evaluation large changes in policy improvement Ludovic Trottier (DAMAS Laboratory) Off-Policy Actor-Critic July 25 2012 10 / 34

Problem The policy is a function parametrize by a weight vector u: π u : A S [0, 1] Objective function to maximise [4] : J(u) = lim Pr(s t = s s 0, b)]v (s) with V (s) = t s S[ π(a s)q(s, a) a A Figure: Chain Ludovic Trottier (DAMAS Laboratory) Off-Policy Actor-Critic July 25 2012 11 / 34

Desciption of the Critic Evaluating an approximation of the true value function V (s) ˆV (s) = v T x s, v i and x s i R in an incremental way using gradient descent. Objective function to minimize : MSPBE(v) = ˆV ΠT ˆV 2 D Ludovic Trottier (DAMAS Laboratory) Off-Policy Actor-Critic July 25 2012 12 / 34

MSPBE Objective function Given a vector function f : f 2 D = s S[ lim t Pr(s t = s s 0, b)]f 2 (s) Define V π (s) = E [ t=0 γt R t+1 S 0 = s, π] and we get : V π (s) = a π(s, a) s Pr(s, a, s )[R(s, a) + γv π (s )] = s Pr(s, a, s )[R(s, a) + γv π (s )] = s Pr(s, a, s )R(s, a) + s Pr(s, a, s )γv π (s ) V π = E[R] + γpv π = TV π Ludovic Trottier (DAMAS Laboratory) Off-Policy Actor-Critic July 25 2012 13 / 34

MSPBE Objective function For a vector function f, we define the linear projection as Πf := f θ = θ T φ such that θ = arg min fθ f 2 D θ { } { } = arg min θ [ lim Pr(s t = s s 0, b)](f t θ f ) 2 (s) s S Ludovic Trottier (DAMAS Laboratory) Off-Policy Actor-Critic July 25 2012 14 / 34

Geometrical Interpretation Ref. [2] Ludovic Trottier (DAMAS Laboratory) Off-Policy Actor-Critic July 25 2012 15 / 34

Exemple Iteration 0 Ludovic Trottier (DAMAS Laboratory) Off-Policy Actor-Critic July 25 2012 16 / 34

Exemple Iteration 1 Ludovic Trottier (DAMAS Laboratory) Off-Policy Actor-Critic July 25 2012 17 / 34

Exemple Iteration 1 Ludovic Trottier (DAMAS Laboratory) Off-Policy Actor-Critic July 25 2012 18 / 34

Exemple Iteration 1 Ludovic Trottier (DAMAS Laboratory) Off-Policy Actor-Critic July 25 2012 19 / 34

Exemple Iteration 2 Ludovic Trottier (DAMAS Laboratory) Off-Policy Actor-Critic July 25 2012 20 / 34

Eligibility Traces R (1) t = r t+1 + γv t (s t+1 ) state R (2) t = r t+1 + γr t+2 + γ 2 V t (s t+2 ) action R (n) t = r t+1 + γr t+2 +... + γ n V t (s t+n ) Ref. [3] Ludovic Trottier (DAMAS Laboratory) Off-Policy Actor-Critic July 25 2012 21 / 34

Eligibility Traces The return R t can be rewritten as any average of steps. e.g. R avg t = 1 2 R(2) t + 1 2 R(4) t Ref. [3] Ludovic Trottier (DAMAS Laboratory) Off-Policy Actor-Critic July 25 2012 22 / 34

Eligibility Traces The λ-return used with TD methods is another way to define R avg R avg t := R λ t = (1 λ) n=1 λ n 1 R (n) t, 0 λ 1 t. Ref. [3] Ludovic Trottier (DAMAS Laboratory) Off-Policy Actor-Critic July 25 2012 23 / 34

Off-Policy Policy-gradient Theorem Objective function to maximise : u J is too difficult to estimate. J(u) = s S[ lim t Pr(s t = s s 0, b)]v (s) We approximate the gradient with g(u). u J(u) g(u) = lim Pr(s t = s s 0, b)] t s S[ u π(a s)q(s, a) a A (product rule + approximation) Ludovic Trottier (DAMAS Laboratory) Off-Policy Actor-Critic July 25 2012 24 / 34

Desciption of the Actor Uses stochastic gradient descent on g(u) : [ ] g(u) = E u π(a s)q(s, a) s lim Pr( ) t a We can add a random function of the state to the equations because of [4] u π(s, a) = 0; Intuition : u π(a s) and Q(s, a) are orthogonal. a Ludovic Trottier (DAMAS Laboratory) Off-Policy Actor-Critic July 25 2012 25 / 34

Desciption of the Actor Rewriting the terms [ ] g(u) = E ρ(s, a)ψ(s, a)q(s, a) s lim Pr( ), a b( s) t where ρ(s, a) = π(a s) b(a s) and ψ(s, a) = uπ(a s) π(a s) We further approximate Q(s, a) with the off-policy λ-return R λ t R λ t = r t+1 + (1 λ)γ ˆV (s t ) + λγρ(s, a)r λ t+1 Forward view Backward view Ludovic Trottier (DAMAS Laboratory) Off-Policy Actor-Critic July 25 2012 26 / 34

1 Reinforcement Learning Theory 2 Off-PAC Algorithm 3 Experimentations 4 Discussion 5 References Ludovic Trottier (DAMAS Laboratory) Off-Policy Actor-Critic July 25 2012 27 / 34

Parameters 4 methods : Q(λ) Greedy-GQ Softmax-GQ Off-PAC b( s) = uniform distribution γ = 0.99 # steps = 5000, with performance testing at each 20 episodes. Gibbs distribution for target policy : π(a s) = 10 10 tile coding feature extraction. Parameters sweep exp{ut φ s,a} a exp{ut φ s,a} Ludovic Trottier (DAMAS Laboratory) Off-Policy Actor-Critic July 25 2012 28 / 34

Results Off-PAC is successful Ref. [1] Ludovic Trottier (DAMAS Laboratory) Off-Policy Actor-Critic July 25 2012 29 / 34

1 Reinforcement Learning Theory 2 Off-PAC Algorithm 3 Experimentations 4 Discussion 5 References Ludovic Trottier (DAMAS Laboratory) Off-Policy Actor-Critic July 25 2012 30 / 34

Discussion Why using approximation (value function + policy) in small state/action space problems? Linear approximation of value function is very naive? Uniform behavior policy may cause problem for Q(λ). Is a stochastic policy really necessary? What is the complexity? Is Bellman operator T defined with deterministic policy, but applied to stochastic one?. Ludovic Trottier (DAMAS Laboratory) Off-Policy Actor-Critic July 25 2012 31 / 34

Conclusion 1 Off-Policy vs On-Policy methods 2 Actor-Critic for on and off-policy 3 Description of the Critic (value-function approximation, eligibility trace, stochastic gradient descent,gtd(λ)) 4 Description of the Actor (approximation of gradient, forward backward view, objective function) 5 Results 6 (Possible) problems Ludovic Trottier (DAMAS Laboratory) Off-Policy Actor-Critic July 25 2012 32 / 34

1 Reinforcement Learning Theory 2 Off-PAC Algorithm 3 Experimentations 4 Discussion 5 References Ludovic Trottier (DAMAS Laboratory) Off-Policy Actor-Critic July 25 2012 33 / 34

References Thomas Degris, Martha White, and Richard S. Sutton. Off-policy actor-critic. CoRR, abs/1205.4839, 2012. Hamid Reza Maei. Gradient Temporal-Difference Learning Algorithms. PhD thesis, University of Alberta, Fall 2011. Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, USA, 1st edition, 1998. Richard S. Sutton, David A. McAllester, Satinder P. Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In NIPS, pages 1057 1063, 1999. Ludovic Trottier (DAMAS Laboratory) Off-Policy Actor-Critic July 25 2012 34 / 34