Temporal-Difference Learning

Size: px

Start display at page:

Download "Temporal-Difference Learning"

Adam Brown
5 years ago
Views:

1 .997 Decision-Making in Lage-Scale Systems Mach 17 MIT, Sping 004 Handout #17 Lectue Note 13 1 Tempoal-Diffeence Leaning We now conside the poblem of computing an appopiate paamete, so that, given an appoximation achitectue J (x, ), J (, ) J ( ). A class of iteative methods ae the so-called tempoal-diffeence leaning algoithms, which geneates a seies of appoximations J k = J (, k ) as follows. Conside geneating a tajectoy (x 1, u 1,..., x k, u k ), whee u k is the geedy policy with espect to J k. We then have the eo/tempoal diffeences d k = g uk (x k ) + αj k (x k+1, k ) J k (x k, k ), which epesent an appoximation to the Bellman eo (T J k )(x k ) J k (x k ) at state x k. Based on the tempoal diffeences, an intuitive way of updating the paametes k is to make updates popotional to the obseved Bellman eo/tempoal diffeence: k+1 = k + γ k d k z k, whee γ k is the step size and z k is called an eligibility vecto it measues how much updates to each component of the vecto k would affect the Bellman eo. To gathe moe intuition about how to choose the eligibility vecto, we will conside the case of autonomous systems, i.e., systems that do not involve contol. In this case, we can estimate the cost-to-go function via sampling as follows. Suppose that we have a tajectoy x 1,..., x n. Then we have J (x n 1 1 ) α g(x t ) t=1 J (x n ) α g(x t ). t= In othe wods, fom a tajectoy x 1,..., x n, we can deive pais (x i, Ĵ(x i )), whee Ĵ(x i ) is a noisy and biased estimate of J (x i ). Theefoe we may conside fitting the appoximation J (x, ) by minimizing the empiical squaed eo: ( min Ĵ n (x t ) J(x t, )) (1) t=1 We deive an incemental, appoximate vesion of (1). Fist note that Ĵ n (x t ) could be updated incementally as follows: n+1 t Ĵ n+1 (x t ) = Ĵ n (x t ) + α g(x n+1 ) () Altenatively, we may use a small-step update of the fom n +1 Ĵ n+1 (x t ) = Ĵ n (x t ) + γ α j t g(x j ) Jˆ n (x t ), (3) j=t 1

2 which makes Ĵ n+1 (x t ) an aveage of the old estimate Ĵ n (x t ) and the new estimate (). Finally, we may appoximate (3) to have Ĵ n (x t ) function d 1, d,..., d n : α j t g(x j ) Jˆ n (x t ) = g(x t ) + αĵ n (x t+1 ) Jˆ n (x t ) + α(g(x t+1 ) + αĵ n (x t+ ) Jˆ n (x t+1 )) +... j=t Hence n t (g(x n ) + αĵ n (x n+1 ) Ĵ n (x n )) α n+1 t Ĵ n (x +α n+1 α j t d t. j=t n +1 Ĵ n+1 (x t ) = Ĵ n (x t ) + γ α j t d j. (4) Finally, we may conside having the sum in (1) implemented incementally, so that the pevious tempoal diffeences do not have to be stoed: Stating fom the solution n to the poblem at stage n, we can appoximate the solution of the poblem at stage n + 1 by updating n+1 along the gadient of (5). This leads to ( ) α t n n+1 = + γ J ( n, x t ) d j=t Ĵ n+1 t n+1 (x t ) = Ĵ n (x t ) + γα dn+1. Hence, in each time stage, we would like to find n minimizing ( min Ĵ n (x t ) + γα n t d n J(x t, )). (5) t=1 We can also have an incemental vesion, given by n t=1 k+1 = k + γ k z k d k z k = αz k 1 + J (x k, k ) The algoithm above is known as T D(1). We have the genealization T D(λ), λ [0, 1]. n+1. k+1 = k + γ k z k d k TD(λ) z k = αλz k 1 + J (x k, k ) Befoe analyzing the behavio of T D(λ), we ae going to study a elated, deteministic algoithm appoximate value iteation. The analysis of T D(λ) will be based on intepeting it as a stochastic appoximation vesion of appoximate value iteation. Appoximate Value Iteation Define the opeato T λ T λ J = (1 λ) λ m T m+1 J, fo λ [0, 1) m=0 T λ J = J, fo λ = 1. We can show that T λ satisfies the same popeties as T :

3 Lemma 1 T λ J T λ J α(1 λ) 1 αλ J J, J = T λ J J, J The motivation fo T λ is as follows. Recall that, in value iteation, we have J k+1 = T J k. Howeve, we could also implement value iteation with J k+1 = T L J k, which implies L steps look ahead. Finally, we can have an update that is a weighted aveage ove all possible values of L; J k+1 = T λ J k gives one such update. In what follows, we ae going to estict attention to linea appoximation achitectues. Let P J (x, ) = φ i (x) i, and Φ = i=1 J = Φ φ 1 (1) φ (1)... φ P (1) φ 1 () φ ()... φ P () φ 1 (n) φ (n)... φ P (n) Moeove, we ae going to conside only autonomous systems. We denote by P the tansition matix associated with the system. Let us intoduce some notation. Fist, we have d(1) d()... 0 D = d(n) whee d : S (0, 1) S is a pobability distibution ove states. Define the weighted Euclidean noms J,D = J T DJ = d(x)j (x) x S < J J > D = J T DJ = d(x)j(x)j (x) Fo simplicity, we assume that φ i, i = 1,..., p is an othonomal basis to the subspace J = Φ, i.e., x S φ i,d = 1 and < φ i, φ j >= 0, i = j In matix notation, we have Φ T DΦ = I. We ae going to use the following pojection opeato Π: ΠJ = Φ J, whee J = ag min Φ J,D 3

4 State J T λ Φ k J = Φ ΠJ Φ k+1 = ΠT λ Φ k State 1 Φ k State 3 Figue 1: Appoximate Value Iteation 4

5 We can chaacteize Π explicitly by solving the associated minimizing poblem. We have J = ag min Φ J,D = ag min (Φ J) T D(Φ J) ( = Φ T DΦ ) 1 Φ T DJ = < Φ, J > D Hence, we have ΠJ = Φ < Φ, J > D. Lemma Fo all J, ΠJ = Φ < Φ, J > D (6) < ΠJ, J ΠJ > D = 0 (7) J = ΠJ,D + J ΠJ,D (8),D Note that Φ k+1 = ΠT λ Φ k. We know that the pojection Π is a nonexpansion fom ΠJ ΠJ,D = Π(J J ),D J J,D. Moeove, T λ is a contaction: T λ J T λ J K J J. Howeve, the fact that Π and T λ ae a non-expansion and a contaction with espect to diffeent noms implies that convegence of appoximate value iteation cannot be guaanteed by a contaction agument, as was the case with exact value iteation. Indeed, as illustated in Figue, ΠT λ is not necessaily a contaction with espect to any nom, and one can find counteexamples whee T D(λ) fails to convege. As it tuns out, thee is a special choice of D that ensues convegence of T D(λ) fo all λ [0, 1]. Befoe poving that, we need the following auxiliay esult. Fist, we pesent two definitions involving Makov chains. Definition 1 A Makov chain is called ieducible if, fo evey pai of states x and y, thee is k such that P k (x, y) > 0. Definition A state x is called peiodic if thee is m such that P k (x, x) > 0 iff k = mn, fo some n {0, 1,,... }. A Makov chain is called apeiodic if none of its states is peiodic. Lemma 3 Given a tansition matix P and assume that P is ieducible and apeiodic. Then thee exists a unique π such that π T P = π T and π T π T P n... π T 5

6 Φ k T λ Φ k ΠT λ Φ k J Figue : T λ Φ k must be inside the smalle squae and ΠT λ Φ k must be inside the cicle, but ΠT λ Φ k may be outside the lage squae and futhe away fom J than Φ k. 6

7 This lemma was poved in Poblem Set, fo the special case whee P (x, x) > 0 fo some x. We ae now poised to pove the following cental esult used to deive a convegent vesion of T D(λ): Lemma 4 Suppose that the tansition matix P is ieducible and apeiodic. Let π π... 0 D =, π S whee π is the stationay distibution associated with P. Then P J,D J,D. Poof: P J =,D = = ( ) π(x) P (x, y)j(y) x S π(x) y P (x, y)j (y) x S y π(x)p (x, y)j (y) y x π(y)j (y) y J =,D The fist inequality follows the Jensen s inequality and the thid equality holds because π is a stationay distibution. Based on the pevious lemma, we can show that T λ is a contaction with espect to,dπ, whee π π... 0 D π = π S and π is the stationay distibution of the tansition matix P. It follows that, if the pojection Π is pefomed with espect to,dπ, ΠT λ becomes a contaction with espect to the same nom, and convegence of T D(λ) is guaanteed. Lemma 5 (i) T J T J,Dπ α J J,Dπ α(1 α) (ii) T λ J T λ J,Dπ 1 αλ J J α(1 α) (iii) ΠT λ J ΠT λ J,Dπ J J 1 αλ,d π,d π 7

8 Poof of (1) T J T J,Dπ = g + αp J (g + α J ),Dπ = α P J P J,Dπ α J J,Dπ Theoem 1 Let Φ k+1 = ΠT λ Φ k and Then k with π π... 0 D π = π S Φ J,Dπ K α,λ ΠJ J,Dπ. Poof: Convegence follows fom (iii). We have Φ = ΠT λ Φ and J T λ J. Then Φ J,D π = Φ ΠJ + ΠJ J,D π = Φ ΠJ,D π + ΠJ J,D π (othogonal) = ΠT λ Φ ΠT λ J,D π + ΠJ J,D π α (1 λ) (1 αλ) }{{ Φ J,D π + ΠJ J,D π } γ Theefoe 1 Φ J,Dπ ΠJ J,Dπ 1 γ 8

1 Explicit Explore or Exploit (E 3 ) Algorithm

1 Explicit Explore or Exploit (E 3 ) Algorithm 2.997 Decision-Making in Lage-Scale Systems Mach 3 MIT, Sping 2004 Handout #2 Lectue Note 9 Explicit Exploe o Exploit (E 3 ) Algoithm Last lectue, we studied the Q-leaning algoithm: [ ] Q t+ (x t, a t