Batch RL Via Least Squares Policy Iteration

Size: px

Start display at page:

Download "Batch RL Via Least Squares Policy Iteration"

Opal Rice
6 years ago
Views:

1 Batch RL Va Leat Square Polcy Iteraton Alan Fern * Baed n part on lde by Ronald Parr

2 Overvew Motvaton LSPI Dervaton from LSTD Expermental reult

3 Onlne veru Batch RL Onlne RL: ntegrate data collecton and optmzaton Select acton n envronment and at the ame tme update parameter baed on each oberved,a,,r) Batch RL: decouple data collecton and optmzaton Frt generate/collect experence n the envronment gvng a data et of tate-acton-reward-tate par {,a,r, )} Ue the fxed et of experence to optmze/learn a polcy Onlne v. Batch: Batch algorthm are often more data effcent and table Batch algorthm gnore the exploraton-explotaton problem, and do ther bet wth the data they have

4 Batch RL Motvaton There are many applcaton that naturally ft the batch RL model Medcal Treatment Optmzaton: Input: collecton of treatment epode for an alment gvng equence of obervaton and acton ncludng outcome Ouput: a treatment polcy, deally better than current practce Emergency Repone Optmzaton: Input: collecton of emergency repone epode gvng movement of emergency reource before, durng, and after 911 call Output: emergency repone polcy

5 LSPI LSPI a model-free batch RL algorthm Learn a lnear approxmaton of Q-functon table and effcent Never dverge or gve meanngle anwer LSPI can be appled to a dataet regardle of how t wa collected But garbage n, garbage out. Leat-Square Polcy Iteraton, Mchal Lagoudak and Ronald Parr, Journal of Machne Learnng Reearch JMLR), Vol. 4, 2003, pp

6 Notaton S: tate pace, : ndvdual tate R,a): reward for takng acton a n tate g: dcount factor V): tate value functon P,a) = T,a, ): tranton functon Q,a) : tate-acton value functon Polcy: ) a Objectve: Maxmze expected, total, dcounted reward E t 0 g t rt

7 Projecton Approach to Approxmaton Recall the tandard Bellman equaton: V * ) max R, a) g or equvalently V * T[ V * ] Bellman operator a where T[.] the Recall from value teraton, the ub-optmalty of a value functon can be bounded n term of the Bellman error: V T[V ] Th motvate tryng to fnd an approxmate value functon wth mall Bellman error ' P ', a) V * ')

8 Projecton Approach to Approxmaton Suppoe that we have a pace of repreentable value functon E.g. the pace of lnear functon over gven feature Let P be a projecton operator for that pace Project any value functon n or outde of the pace) to cloet value functon n the pace Fxed Pont Bellman Equaton wth approxmaton ˆ * V T[ Vˆ * ] Dependng on pace th wll have a mall Bellman error LSPI wll attempt to arrve at uch a value functon Aume lnear approxmaton and leat-quare projecton

10 Projected Value Iteraton Naïve Idea: try computng projected fxed pont ung VI Exact VI: terate Bellman backup) V T[ V 1 ] Projected VI: terated projected Bellman backup): ˆ V 1 T[ Vˆ ] Project exact Bellman backup to cloet functon n our retrcted functon pace exact Bellman backup produced value functon)

11 Example: Projected Bellman Backup Retrct pace to lnear functon over a ngle feature f: Vˆ ) wf ) Suppoe jut two tate 1 and 2 wth: Suppoe exact backup of V gve: T Vˆ ˆ ] 1) 2, T[ V ] ) 2 [ 2 f 1 )=1, f 2 )=2 Can we repreent th exact backup n our lnear pace? No f 1 )=1 f 2 )=2

12 Example: Projected Bellman Backup Retrct pace to lnear functon over a ngle feature f: Vˆ ) wf ) Suppoe jut two tate 1 and 2 wth: Suppoe exact backup of V gve: T Vˆ ˆ ] 1) 2, T[ V ] ) 2 [ 2 f 1 )=1, f 2 )=2 The backup can t be repreented va our lnear functon: ˆ V 1 T[ Vˆ ] f 1 )=1 f 2 )=2 ˆ 1 ) 1.333f ) V projected backup jut leat-quare ft to exact backup

13 Problem: Stablty Exact value teraton tablty enured by contracton property of Bellman backup: V T[ V 1 ] I the projected Bellman backup a contracton: ˆ V 1 T[ Vˆ ]?

14 Example: Stablty Problem [Berteka & Ttkl 1996] Problem: Mot projecton lead to backup that are not contracton and untable 1 2 Reward all zero, ngle acton, g = 0.9: V* = 0 Conder lnear approx. w/ ngle feature f wth weght w. Vˆ ) wf ) Optmal w = 0 nce V*=0

15 Example: Stablty Problem f 1 )=1 1 2 f 2 )=2 V 1 ) = w V 2 ) = 2w weght value at teraton From V perform projected backup for each tate ˆ T[ V ] 1) gv 2) 1. 8 ˆ T[ V ] 2) gv 2) 1. 8 ˆ ˆ Can t be repreented n our pace o fnd w +1 that gve leat-quare approx. to exact backup After ome math we can get: w +1 = 1.2 w w w What doe th mean?

16 Example: Stablty Problem Vx) Vˆ 3 Vˆ 2 1 Vˆ 0 Vˆ Iteraton # 1 2 S Each teraton of Bellman backup make approxmaton wore! Even for th trval problem projected VI dverge.

17 Undertandng the Problem What went wrong? Exact Bellman backup reduce error n max-norm Leat quare = projecton) non-expanve n L 2 norm May ncreae max-norm dtance Concluon: Alternatng value teraton and functon approxmaton rky bune

18 Overvew Motvaton LSPI Dervaton from Leat-Square Temporal Dfference Learnng Expermental reult

19 How doe LSPI fx thee? LSPI perform approxmate polcy teraton PI nvolve polcy evaluaton and polcy mprovement Ue a varant of leat-quare temporal dfference learnng LSTD) for approx. polcy evaluaton [Bratdke & Barto 96] Stablty: LSTD drectly olve for the fxed pont of the approxmate Bellman equaton for polcy value Wth ngular-value decompoton SVD), th alway well defned Data effcency LSTD fnd bet approxmaton for any fnte data et Make a ngle pa over the data for each polcy Can be mplemented ncrementally

20 OK, What LSTD? Leat Square Temporal Dfference Learnng Aume lnear value functon approxmaton of K feature The f k are arbtrary feature functon of tate Some vector notaton k k w k V ) ) ˆ f ) ˆ ) ˆ ˆ 1 n V V V w k w w 1 ) 1 ) n k k k f f f f K f 1

21 Dervng LSTD Vˆ w agn a value to every tate = K ba functon f 1 1) f 2 1)... f 1 2) f 2 2).. # tate Vˆ a lnear functon n the column pace of f 1 f k, that, Vˆ w 1 f 1 w K f K.

22 Suppoe we know value of polcy Want: w V Leat quare weght mnmze quared error w T 1 ) T V Sometme called peudonvere Leat quare projecton then Vˆ w T 1 ) V Textbook leat quare projecton operator T

23 But we don t know V Recall fxed-pont equaton for polce Wll olve a projected fxed-pont equaton: Subttutng leat quare projecton nto th gve: Solvng for w: g V P R V ˆ ˆ w P R w T T g 1 ) R P w T T T 1 ) g )), )), )), )),, )), )), n n n n n n n P P P P P R R R ' ') )), ' )), ) V P R V g

24 Almot there w T g T P) 1 T R Matrx to nvert only K x K But Expenve to contruct matrx e.g. P S x S ) We don t know P We don t know R

25 Ung Sample for Suppoe we have tate tranton ample of the polcy runnng n the MDP: {,a,r, )} Idea: Replace enumeraton of tate wth ampled tate K ba functon f 1 1) f 2 1)... ˆ f 1 2) f 2 2).. ample tate.

26 Ung Sample for R Suppoe we have tate tranton ample of the polcy runnng n the MDP: {,a,r, )} Idea: Replace enumeraton of reward wth ampled reward r 1 R = r 2... ample

27 Ung Sample for P Idea: Replace expectaton over next tate wth ampled next tate. K ba functon f 1 1 ) f 2 1 )... P f 1 2 ) f 2 2 )... from,a,r, )

28 Puttng t Together LSTD need to compute: w g P) The hard part of whch B the kxk matrx: Both B and b can be computed ncrementally for each,a,r, ) ample: ntalze to zero) R B T T 1 T 1 T T B g P) T b R B j b B j b from prevou lde b f ) f ) gf ) f ') r f ) j j

29 LSTD Algorthm Collect data by executng trajectore of current polcy For each,a,r, ) ample: B b j b B j f ) f ) gf ) f ') r f, a) j j w B 1 b

30 LSTD Summary Doe Ok 2 ) work per datum Lnear n amount of data. Approache model-baed anwer n lmt Fndng fxed pont requre nvertng matrx Fxed pont almot alway ext Stable; effcent

31 Approxmate Polcy Iteraton wth LSTD Polcy Iteraton: terate between polcy mprovement and polcy evaluaton Idea: ue LSTD for approxmate polcy evaluaton n PI Start wth random weght w.e. value functon) Repeat Untl Convergence ) = greedy ) // polcy mprovement Evaluate Vˆ, w ung LSTD Generate ample trajectore of Ue LSTD to produce new weght w w gve an approx. value functon of ) )

32 What Break? No way to execute greedy polcy wthout a model Approxmaton baed by current polcy We only approxmate value of tate we ee when executng the current polcy LSTD a weghted approxmaton toward thoe tate Can reult n Learn-forget cycle of polcy teraton Drve off the road; learn that t bad New polcy never doe th; forget that t bad Not truly a batch method Data mut be collected from current polcy for LSTD

33 LSPI LSPI mlar to prevou loop but replace LSTD wth a new algorthm LSTDQ LSTD: produce a value functon Requre ample from polcy under conderaton LSTDQ: produce a Q-functon for current polcy Can learn Q-functon for polcy from any reaonable) et of ample---ometme called an off-polcy method No need to collect ample from current polcy Dconnect polcy evaluaton from data collecton Permt reue of data acro teraton! Truly a batch method.

34 Implementng LSTDQ Both LSTD and LSTDQ compute: But LSTDQ ba functon are ndexed by acton defne greedy polcy: For each,a,r, ) ample: B j b w Q ˆ, a) w f, a) w B 1 B b b j k T T B P) f, a) f, a) f, a) f ', ')) r f, a) k k ˆ w ) arg max Q, a) j a arg max a j w ˆ Q w w ', a)

35 Runnng LSPI There a Matlab mplementaton avalable! 1. Collect a databae of,a,r, ) experence th the magc tep) 2. Start wth random weght = random polcy) 3. Repeat Evaluate current polcy agant databae Run LSTDQ to generate new et of weght New weght mply new Q-functon and hence new polcy Replace current weght wth new weght Untl convergence

36 Reult: Bcycle Rdng Watch random controller operate bke Collect ~40,000,a,r, ) ample Pck 20 mple feature functon 5 acton) Make 5-10 pae over data PI tep) Reward wa baed on dtance to goal + goal achevement Reult: Controller that balance and rde to goal

37 Bcycle Trajectore

38 What about Q-learnng? Ran Q-learnng wth ame feature Ued experence replay for data effcency

39 Q-learnng Reult

40 LSPI Robutne

41 Some key pont LSPI a batch RL algorthm Can generate trajectory data anyway you want Induce a polcy baed on global optmzaton over full dataet Very table wth no parameter that need tweakng

42 So, what the bad new? LSPI doe not addre the exploraton problem It decouple data collecton from polcy optmzaton Th often not a major ue, but can be n ome cae k 2 can ometme be bg Lot of torage Matrx nveron can be expenve Bcycle needed hapng reward Stll haven t olved Feature electon ue for all machne learnng, but RL eem even more entve)

Batch Reinforcement Learning

Batch Reinforcement Learning Batch Renforcement Learnng Alan Fern * Baed n part on lde by Ronald Parr Overvew What batch renforcement learnng? Leat Square Polcy Iteraton Ftted Q-teraton Batch DQN Onlne veru Batch RL Onlne RL: ntegrate