Policy Gradient. U(θ) = E[ R(s t,a t );π θ ] = E[R(τ);π θ ] (1) 1 + e θ φ(s t) E[R(τ);π θ ] (3) = max. θ P(τ;θ)R(τ) (6) P(τ;θ) θ log P(τ;θ)R(τ) (9)

CS294-40 Learning for Robotics and Control Lecture 16-10/20/2008 Lecturer: Pieter Abbeel Policy Gradient Scribe: Jan Biermeyer 1 Recap Recall: H U() = E[ R(s t,a ;π ] = E[R();π ] (1) Here is a sample path of states and actions, s 0,a 0,...,s H,a H. Example policy π : a {0,1},π (0 s = e φ(s t) 2 Policy Search 1 + e φ(s t) (2) taking the gradient w.r.t. gives max U() = max = max E[R();π ] (3) P(;)R() (4) U() = P(;)R() (5) = P(;)R() (6) = = = P(;) P(;) P(;)R() (7) P(;) P(;) R() (8) P(;) P(;) log P(;)R() (9) Approximate with the empirical estimate for m sample paths under policy π : ĝ = 1 m log P( (i) ;)R( (i) ) (10) m i=1 1

log P( (i) H ;) = log = [ H = H = H P(s (i) t+1 s(i) t,a (i) } {{ } dynamics model log P(s (i) t+1 s(i) t,a (i) + π (a (i) policy H t s (i) (11) log π (a (i) t s (i) ] (12) log π (a (i) t s (i) (13) log π ( (i) s (i) t no dynamics model required!! (14) Note that: P(;) = 1 (15) P(;) = 0 (16) P(;) = 0 (17) P(;) P(;) P(;) = 0 (18) [ ] E P(;) log P(;) = 0 (19) E [log P(;)] = 0 (20) Unbiased gradient estimate: ĝ = U() (21) = 1 m log P( (i) ;) R( (i) ) (22) m i=1 = 1 m [ log P( (i) ;) R( (i) ) ] log P( (i) ;) b (i) (23) m π i=i1 (24) ĝ = 1 m m ( ) log P( (i) ;) R( (i) ) b (i) i=1 (25) ĝ is an unbiased estimate, with free parameters b (i). As b(i) has to be a constant, in practice, there is typically no reason to let it depend on i, as all traectories (i) are (presumably) sampled i.i.d. So in practice, we typically use b. 2

Wile our gradient estimates are unbiased, in that: ĝ = log P( (i) ;) (R() b ), (26) Eĝ = U() (27) they are stochastic estimates and have a variance given by: [ E (ĝ E[ĝ ]) 2] (28) We will now describe how to choose b such that we are minimizing the variance of our gradient estimates: [ min E (ĝ E[ĝ ]) 2] [ = Eĝ 2 + E (Eĝ ) 2] 2E[ĝ E[ĝ ]] (29) b = Eĝ 2 + (Eĝ ) 2 2E[ĝ ] E[ĝ ] (30) = Eĝ 2 (Eĝ ) 2 = U() independent of b (31) min b Eĝ 2 = min E b [ ( ) ] 2 log P(;) (R() b ) i (32) = min E b = min E b [( 2E [ ( ) 2 log(;) (R() 2 + b 2 2b R() )] (33) i [ ( ) ] [ 2 ( ) ] 2 log P(;) R() 2 +E log P(;) b 2 (34) i i = min b 2 E b independent of b i log P(;) 2 ) ] b R() ) ] 2 [ ( i log P(;) [ ( ) 2 2b E log P(;) R()] 2 i (35) (36) = 0 2b E [...] 2E [...] = 0 (37) b [ ( ) ] 2 E i log P(;) R() b = [ ( ) ] 2 (38) E i log P(;) If we need to minimize the variance, we can compute b as above from samples for a good estimate. 3

3 Gradient descent x1 x1 x2 x2 Figure 1: f(x) = x 2 1 x 2 2 Figure 2: f(x) = x 2 1 100x 2 2 For function f(x) = x 2 1 x 2 2 in Figure 1, f = x 1 x 2 leads directly to the maximum. Note that f(x) = x 2 1 100x 2 2 in Figure 2 is essentially the same function! We ust scaled the variables (e.g. could correspend to different measurements units). Solution: look at second order methods/ approximations. Indeed, if we use Newton s method, which finds a step direction by considering the second order Taylor approximation, the resulting step directions always leads us directly to the minimum. The gradient descent directly operates in the parameter space. However the same policy class π can often be represented by various different parameterizations. Different parameterizations will lead to different gradients. The natural gradient method, described below, intends to get around this issue, by more directly optimizing in terms of the policy class (and distances within the policy class, i.e., distances between probability distributions) rather than the original parameters. 4 Natural gradients We can approximate the derivative f f(0 i + ) f(0 i ) i 2 (39) where is the delta-width. Note that the gradient computation normalizes by the distance traveled in space. However, is often an arbitrary way to index into the policy class, so it might be more natural to divide by a distance that is defined directly on the policy class, rather than on. For example, rather than dividing by 2, we could consider dividing by KL(P 0 i + P 0 i ) (40) 4

f i0 -Δ +Δ i Figure 3: f() The natural gradient essentially generalizes this idea, and finds the steepest ascent direction, when normalizing by a distance metric operating in the policy class space directly, rather than in the arbitrary space: f(i 0 g N = arg max + δ0 i ) f(0 i δ0 i ) i 0: δ0 i =ɛ distance(i 0 + δ0 i,0 i δ0 i ) (41) When ɛ is really small, we can approximate the function f with a first-order Taylor expansion, and the distance metric by a 2nd order Taylor expansion: f( + δ) f() + g T δ (42) distance( + δ, δ) ( + δ ( δ)) G ( + δ ( δ)) = 2 δ T G δ (43) g N = arg max : δ =ɛ g T = arg max δ : δ =ɛ δt G 0 δ f() + g δ (f(δ) g δ) 2 δ T G δ (44) (45) The key is to pick a clever distance metric G δ = αg 1 g, α > 0 (46) G = I g N = α g (47) A typical choice for distance metric between probability distributions would be a symmetric version of the KL-divergence: 1 2 (KL(P Q) + KL(Q P) = 1 2 x P(x)log P(x) Q(x) + 1 2 x Q(x)log P(x) Q(x) (48) 5

For this choice, we have, for 1 and 2 close enough together: For: G = E x [ KL(P 1 P 2 ) ( 1 2 ) T G 1 ( 1 2 ) (49) log P (x) ] T log P (x) (Fischer information matrix) (50) 6