Policy Gradient. U(θ) = E[ R(s t,a t );π θ ] = E[R(τ);π θ ] (1) 1 + e θ φ(s t) E[R(τ);π θ ] (3) = max. θ P(τ;θ)R(τ) (6) P(τ;θ) θ log P(τ;θ)R(τ) (9)

Similar documents
ECE 680 Modern Automatic Control. Gradient and Newton s Methods A Review

Reinforcement Learning

Lecture 8: Policy Gradient I 2

Machine Learning Basics: Maximum Likelihood Estimation

Lecture 6: CS395T Numerical Optimization for Graphics and AI Line Search Applications

Parking lot navigation. Experimental setup. Problem setup. Nice driving style. Page 1. CS 287: Advanced Robotics Fall 2009

Lecture 9: Policy Gradient II 1

Lecture 9: Policy Gradient II (Post lecture) 2

CS229T/STATS231: Statistical Learning Theory. Lecturer: Tengyu Ma Lecture 11 Scribe: Jongho Kim, Jamie Kang October 29th, 2018

Lecture 8: Policy Gradient

Actor-critic methods. Dialogue Systems Group, Cambridge University Engineering Department. February 21, 2017

ECS171: Machine Learning

Probabilistic Graphical Models for Image Analysis - Lecture 4

CS 287: Advanced Robotics Fall Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study

Discrete Variables and Gradient Estimators

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Optimization methods

Vector Derivatives and the Gradient

EKF, UKF. Pieter Abbeel UC Berkeley EECS. Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics

EKF, UKF. Pieter Abbeel UC Berkeley EECS. Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics

Introduction to Statistical Learning Theory

Reinforcement Learning as Variational Inference: Two Recent Approaches

Reinforcement Learning

CSC321 Lecture 22: Q-Learning

CSC2541 Lecture 5 Natural Gradient

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

2.1 Optimization formulation of k-means

13 : Variational Inference: Loopy Belief Propagation

15. LECTURE 15. I can calculate the dot product of two vectors and interpret its meaning. I can find the projection of one vector onto another one.

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Introduction to Machine Learning (67577) Lecture 7

Latent Variable Models

Policy Gradient Methods. February 13, 2017

Real Time Value Iteration and the State-Action Value Function

ECE276B: Planning & Learning in Robotics Lecture 16: Model-free Control

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Distributed Estimation, Information Loss and Exponential Families. Qiang Liu Department of Computer Science Dartmouth College

Optimization II: Unconstrained Multivariable

1/37. Convexity theory. Victor Kitov

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

arxiv: v9 [cs.lg] 21 Nov 2017

Convex Optimization Lecture 16

1 Review of The Learning Setting

CS260: Machine Learning Algorithms

Stochastic Gradient Descent

13 : Variational Inference: Loopy Belief Propagation and Mean Field

Covariant Policy Search

Gradient descents and inner products

Reinforcement Learning

Announcements Kevin Jamieson

LECTURE NOTE #3 PROF. ALAN YUILLE

Lecture 23: Online convex optimization Online convex optimization: generalization of several algorithms

Energy-Based Generative Adversarial Network

ECE580 Partial Solution to Problem Set 3

15-889e Policy Search: Gradient Methods Emma Brunskill. All slides from David Silver (with EB adding minor modificafons), unless otherwise noted

Lecture 21: Minimax Theory

6.252 NONLINEAR PROGRAMMING LECTURE 10 ALTERNATIVES TO GRADIENT PROJECTION LECTURE OUTLINE. Three Alternatives/Remedies for Gradient Projection

Performance Metrics for Machine Learning. Sargur N. Srihari

Lecture 1 Measure concentration

Functional Gradient Descent

Lecture 4: Lower Bounds (ending); Thompson Sampling

Trust Region Policy Optimization

Approximate Dynamic Programming

Information theoretic perspectives on learning algorithms

Robustness and duality of maximum entropy and exponential family distributions

Reinforcement Learning

1 Newton s Method. Suppose we want to solve: x R. At x = x, f (x) can be approximated by:

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Data Mining Techniques

MATH 4211/6211 Optimization Quasi-Newton Method

Statistical Machine Learning Lectures 4: Variational Bayes

12 : Variational Inference I

Nonlinear Optimization for Optimal Control

Linear and logistic regression

Module 9 : Infinite Series, Tests of Convergence, Absolute and Conditional Convergence, Taylor and Maclaurin Series

COMP90051 Statistical Machine Learning

Lecture: Adaptive Filtering

Introduction to gradient descent

Homework #3 RELEASE DATE: 10/28/2013 DUE DATE: extended to 11/18/2013, BEFORE NOON QUESTIONS ABOUT HOMEWORK MATERIALS ARE WELCOMED ON THE FORUM.

25.2 Last Time: Matrix Multiplication in Streaming Model

Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm

Machine Learning

Variational Autoencoder

Variational Inference. Sargur Srihari

ECE521 Lectures 9 Fully Connected Neural Networks

Lecture 3 - Linear and Logistic Regression

Lecture 16: Introduction to Neural Networks

Variance Reduction for Policy Gradient Methods. March 13, 2017

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent

Wasserstein GAN. Juho Lee. Jan 23, 2017

Hoeffding, Chernoff, Bennet, and Bernstein Bounds

Linear classifiers: Overfitting and regularization

Optimal Control with Learned Forward Models

Optimization II: Unconstrained Multivariable

Association studies and regression

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

VIME: Variational Information Maximizing Exploration

Lecture 5: Gradient Descent. 5.1 Unconstrained minimization problems and Gradient descent

10-704: Information Processing and Learning Fall Lecture 24: Dec 7

Discrete Latent Variable Models

Transcription:

CS294-40 Learning for Robotics and Control Lecture 16-10/20/2008 Lecturer: Pieter Abbeel Policy Gradient Scribe: Jan Biermeyer 1 Recap Recall: H U() = E[ R(s t,a ;π ] = E[R();π ] (1) Here is a sample path of states and actions, s 0,a 0,...,s H,a H. Example policy π : a {0,1},π (0 s = e φ(s t) 2 Policy Search 1 + e φ(s t) (2) taking the gradient w.r.t. gives max U() = max = max E[R();π ] (3) P(;)R() (4) U() = P(;)R() (5) = P(;)R() (6) = = = P(;) P(;) P(;)R() (7) P(;) P(;) R() (8) P(;) P(;) log P(;)R() (9) Approximate with the empirical estimate for m sample paths under policy π : ĝ = 1 m log P( (i) ;)R( (i) ) (10) m i=1 1

log P( (i) H ;) = log = [ H = H = H P(s (i) t+1 s(i) t,a (i) } {{ } dynamics model log P(s (i) t+1 s(i) t,a (i) + π (a (i) policy H t s (i) (11) log π (a (i) t s (i) ] (12) log π (a (i) t s (i) (13) log π ( (i) s (i) t no dynamics model required!! (14) Note that: P(;) = 1 (15) P(;) = 0 (16) P(;) = 0 (17) P(;) P(;) P(;) = 0 (18) [ ] E P(;) log P(;) = 0 (19) E [log P(;)] = 0 (20) Unbiased gradient estimate: ĝ = U() (21) = 1 m log P( (i) ;) R( (i) ) (22) m i=1 = 1 m [ log P( (i) ;) R( (i) ) ] log P( (i) ;) b (i) (23) m π i=i1 (24) ĝ = 1 m m ( ) log P( (i) ;) R( (i) ) b (i) i=1 (25) ĝ is an unbiased estimate, with free parameters b (i). As b(i) has to be a constant, in practice, there is typically no reason to let it depend on i, as all traectories (i) are (presumably) sampled i.i.d. So in practice, we typically use b. 2

Wile our gradient estimates are unbiased, in that: ĝ = log P( (i) ;) (R() b ), (26) Eĝ = U() (27) they are stochastic estimates and have a variance given by: [ E (ĝ E[ĝ ]) 2] (28) We will now describe how to choose b such that we are minimizing the variance of our gradient estimates: [ min E (ĝ E[ĝ ]) 2] [ = Eĝ 2 + E (Eĝ ) 2] 2E[ĝ E[ĝ ]] (29) b = Eĝ 2 + (Eĝ ) 2 2E[ĝ ] E[ĝ ] (30) = Eĝ 2 (Eĝ ) 2 = U() independent of b (31) min b Eĝ 2 = min E b [ ( ) ] 2 log P(;) (R() b ) i (32) = min E b = min E b [( 2E [ ( ) 2 log(;) (R() 2 + b 2 2b R() )] (33) i [ ( ) ] [ 2 ( ) ] 2 log P(;) R() 2 +E log P(;) b 2 (34) i i = min b 2 E b independent of b i log P(;) 2 ) ] b R() ) ] 2 [ ( i log P(;) [ ( ) 2 2b E log P(;) R()] 2 i (35) (36) = 0 2b E [...] 2E [...] = 0 (37) b [ ( ) ] 2 E i log P(;) R() b = [ ( ) ] 2 (38) E i log P(;) If we need to minimize the variance, we can compute b as above from samples for a good estimate. 3

3 Gradient descent x1 x1 x2 x2 Figure 1: f(x) = x 2 1 x 2 2 Figure 2: f(x) = x 2 1 100x 2 2 For function f(x) = x 2 1 x 2 2 in Figure 1, f = x 1 x 2 leads directly to the maximum. Note that f(x) = x 2 1 100x 2 2 in Figure 2 is essentially the same function! We ust scaled the variables (e.g. could correspend to different measurements units). Solution: look at second order methods/ approximations. Indeed, if we use Newton s method, which finds a step direction by considering the second order Taylor approximation, the resulting step directions always leads us directly to the minimum. The gradient descent directly operates in the parameter space. However the same policy class π can often be represented by various different parameterizations. Different parameterizations will lead to different gradients. The natural gradient method, described below, intends to get around this issue, by more directly optimizing in terms of the policy class (and distances within the policy class, i.e., distances between probability distributions) rather than the original parameters. 4 Natural gradients We can approximate the derivative f f(0 i + ) f(0 i ) i 2 (39) where is the delta-width. Note that the gradient computation normalizes by the distance traveled in space. However, is often an arbitrary way to index into the policy class, so it might be more natural to divide by a distance that is defined directly on the policy class, rather than on. For example, rather than dividing by 2, we could consider dividing by KL(P 0 i + P 0 i ) (40) 4

f i0 -Δ +Δ i Figure 3: f() The natural gradient essentially generalizes this idea, and finds the steepest ascent direction, when normalizing by a distance metric operating in the policy class space directly, rather than in the arbitrary space: f(i 0 g N = arg max + δ0 i ) f(0 i δ0 i ) i 0: δ0 i =ɛ distance(i 0 + δ0 i,0 i δ0 i ) (41) When ɛ is really small, we can approximate the function f with a first-order Taylor expansion, and the distance metric by a 2nd order Taylor expansion: f( + δ) f() + g T δ (42) distance( + δ, δ) ( + δ ( δ)) G ( + δ ( δ)) = 2 δ T G δ (43) g N = arg max : δ =ɛ g T = arg max δ : δ =ɛ δt G 0 δ f() + g δ (f(δ) g δ) 2 δ T G δ (44) (45) The key is to pick a clever distance metric G δ = αg 1 g, α > 0 (46) G = I g N = α g (47) A typical choice for distance metric between probability distributions would be a symmetric version of the KL-divergence: 1 2 (KL(P Q) + KL(Q P) = 1 2 x P(x)log P(x) Q(x) + 1 2 x Q(x)log P(x) Q(x) (48) 5

For this choice, we have, for 1 and 2 close enough together: For: G = E x [ KL(P 1 P 2 ) ( 1 2 ) T G 1 ( 1 2 ) (49) log P (x) ] T log P (x) (Fischer information matrix) (50) 6