1 Explicit Explore or Exploit (E 3 ) Algorithm

Size: px

Start display at page:

Download "1 Explicit Explore or Exploit (E 3 ) Algorithm"

Malcolm Morgan
5 years ago
Views:

1 2.997 Decision-Making in Lage-Scale Systems Mach 3 MIT, Sping 2004 Handout #2 Lectue Note 9 Explicit Exploe o Exploit (E 3 ) Algoithm Last lectue, we studied the Q-leaning algoithm: [ ] Q t+ (x t, a t ) = Q t (x t, a t ) + β t g (x t ) + π min Q t (x t+, a ) Q t (x t, a t ). a t An impotant chaacteistic of Q-leaning is that it is a model-fee appoach to leaning an optimal policy in an MDP with unknown paametes. In othe wods, thee is explicit attempt to model o estimate costs and/o tansition pobabilities the value of each action is estimated diectly though the Q-facto. Anothe appoach to the same poblem is to estimate the MDP paametes fom the data and find a policy based on the estimated paametes. In this lectue, we will study one such algoithm the Explicit Exploe o Exploit (E 3 ) algoithm, poposed by Keans and Singh []. The main ideas fo E 3 ae as follows: we divide states in two sets: a N N C known states unknown states known states have been visited sufficiently many times to ensue that Pˆa(x, y), ĝ a (x) ae accuate with high pobabilities an unknown state is moved to N when it has been visited at least m times fo some numbe m We intoduce two MDPs Mˆ N and M N. The MDP Mˆ N is pesented in Fig.. Its main chaacteistic is that the unknown states fom the oiginal MDP ae meged into a ecuent state x 0 with cost g a (x 0 ) = g max, a. The othe MDP M N has the same stuctue as Mˆ N but the estimated tansition pobabilities and costs ae eplaced with thei tue values. We now intoduce the algoithm.. Algoithm We will fist conside a vesion of E 3 which assumes knowledge of J ; the assumption will be lifted late. The E 3 algoithm poceeds as follows.. Let N =. Pick abitay state x 0. Let k = If x k / N, pefom balanced wandeing: If x k N, then a k = action chosen fewest times at state x k

2 attempt exploitation: If the optimal policy π fo Mˆ N has Ĵ ˆ (x k ) J (x k ) + β M N 2, stop. Retun x k and π ˆMN attempt exploation: Follow policy ˆπ S0 fo T steps whee T = α. ˆ Figue : Makov Decision Pocess M n Theoem With pobability no less than δ, E 3 will stop afte a numbe of actions and computation time ( ) poly,, S,, g max δ δ and etun a state x and policy u such that J u (x) J (x) + δ..2 Main Points The main points used fo poving Theoem ae as follows: (i) Thee exists m that is polynomially bounded such that, if all states in N have been visited at least m ˆ times, then M N is sufficiently close to M N. (ii) Balanced wandeing can only happen finitely many times. (iii) (a) J u,mn (x) J u (x) (b) J u,mn J u,mn β with high pobability ˆ 2 (iv) If exploitation is not possible, then thee is an exploation policy that eaches an unknown state afte T tansitions with high pobability. To show the fist main point, we conside the following lemma. Lemma Suppose a state x has been visited at least m times with each action a A x having been executed at least m A x times. Then, if ( ) m = poly S,, T, g max,, log, va(g) δ δ 2

3 we have, w.p. δ, ( ( ) ) 2 Pˆa(x, y) P a (x, y) = O δ S g max ( ( ) ) 2 ĝ a (x) g a (x) = O δ S g max The poof of this lemma is a diectly application of the Chenoff bound, which states that, if z, z 2,... ae i.i.d. Benoulli andom vaiables, then n zi Ez n i= (SLLN) ( ) ( ) n nδ 2 P z i Ez > δ 2 exp n i= 2 The main point (ii) follows fom pigeonhole pinciple: afte (m ) S balanced wandeing steps, at least one state will have to become known The main point iii(a) follows fom the next lemma. Lemma 2 Fo all policy u, J u,mn (x) J u (x), x. Poof: Tivial fo x / N since J u,mn (x) = gmax α J u (x). If x N, take T = inf{t : x t / N}. Then [ T ] J π t g π t u (x) = E u(x t ) + gu(x t ) t=0 t=t [ ] T E π t g u (x t ) + π T gmax t=0 = J u,mn (x) To pove the main point iii(b), we fist intoduce the following definition. Definition Let M and M ˆ be two MDPs. Then M ˆ is a β-appoximation to M if Lemma 3 If T α log 2g max β( α) Pˆa(x, y) P a (x, y) ) ( and ˆM is an O δ β g a (x) ĝ a (x) β. α S g max J u,m J u,m δ. ˆ ) ) 2 appoximation of M, then, u, 3

4 Sketch of poof: Take a policy u and a stat state x. We conside paths of length T stating fom x: p = x 0, x, x 2,..., x T whee p denotes the path. Note that [ ] J u,m (x) = P u,m (p)g u (p) + E π t g u (x t ), p t=t + whee P u,m (p) = P u,m (x 0, x )P u,m (x, x 2 )... P u,m (x T, x T ) is the pobability of obseving path p and is the discounted cost associated with path p. By selecting T popely, we can have T gu(p) = π g u (x t ) [ ] π T g max E π t g u (x t ) δ t=t + Recall that Pa(x, y) Pˆa(x, y) β. We conside two kinds of paths: (a) paths containing at least one tansition x t, x t+ in the set R such that P u (x t, x t+ ) β. Note that the total pobability associated with such paths is less than o equal to β S T, since the pobability of any given path is less than o equal to β, stating with each state x in each tansition thee ae at most S possible small pobability tansitions, and thee ae T tansitions whee this can occu. Theefoe P g max g max u (p)g u (P) P u (p) β S T. We can follow the same pinciple with the MDP ˆM to conclude that (β + β) S T g max Pˆu(p)ĝ u (P). t=0 t Theefoe, we have P u (p)gu(p) Pˆ u(p)ĝ u (p) (β + 2β) S T g max (b) Fo all othe paths, we have ( )P a (x t, x t+ ) Pˆa(x t, x t+ ) ( + )P a (x t, x t+ ) whee = γ. Theefoe, β ( ) T P u (p) Pˆu(p) ( + ) T P u (p). 4

5 Moeove, g u (p) ĝ u (p) T β, then δ ( ) T [J u,t βt ] Ĵ u,t ( + ) T [J u,t + βt ] + δ 4 4 The theoem follows by consideing an appopiate choice of β. The main point (iv) says that: If exploitation is not possible, then exploation is. We show it by the following lemma. Lemma 4 Fo any x N, one of the following must hold. (a) thee exists u in M N such that Ju,T N (x) J T (x) + β, o (b) thee exists u such that the pobability that a walk of T steps will teminate in N C exceeds γ( α) g max. Poof: Let u be the policy that attains JT. If Ju N T (x) + β,t (x) J then we ae done. Suppose that Ju N,T (x) > JT (x) + β. Then we have Ju N P N (q)g N P N (p)g N,T (x) = u u (q) + u u (p) q N } }} } and Theefoe which implies q path in N path outside N J P u (q)g u (q) + P u T (x) = (q)g u (q). Ju N,T (x) Ju,T (x) = Pu N (p) g N (p) P (p)g u u } u u (p) }}} > β P N g max β() (p) > β Pu N (p). gmax α 0 g max In ode the complete the poof of Theoem fom the fou lemmas above, we have to conside the pobabilities fom two foms of failue: failue to stop the algoithm with a nea-optimal policy failue to pefom enough exploation in a timely fashion The fist point is addessed by Lemmas, 2 and 3; which establish that, if the algoithm stops, with high pobability the policy poduced is nea-optimal. The second point follows fom Lemma 4, which shows that each attempt to exploe is successful with some non negligible pobability. By applying the Chenoff bound, it can be shown that, afte a numbe of attempts that is polynomial in the quantities of inteest, exploation will occu with high pobability. 5

6 Refeences [] M. Keans and S. Singh, Nea-Optimal Reinfocement Leaning in Polynomial Time, Machine Leaning, Volume 49, Issue 2, pp , Nov

Temporal-Difference Learning

Temporal-Difference Learning .997 Decision-Making in Lage-Scale Systems Mach 17 MIT, Sping 004 Handout #17 Lectue Note 13 1 Tempoal-Diffeence Leaning We now conside the poblem of computing an appopiate paamete, so that, given an appoximation