Machine Learning Reinforcement Learning

Size: px

Start display at page:

Download "Machine Learning Reinforcement Learning"

Norma Moore
6 years ago
Views:

1 Mchine Lerning Reinforcemen Lerning Leon 2

2 Mchine Lerning

3 Mchine Lerning Supervied Lerning Techer ell lerner wh o remember Reinforcemen Lerning Environmen provide hin o lerner Unupervied Lerning Lerner dicover on i own

4 Reinforcemen Lerning Wh mke RL differen from oher ML prdigm? There i no ny upervior only rewrd ignl Feedbck i delyed no innneou Sequenil d (no i.i.d. d ime dependendency) Agen cion ffec he ubequen d i receive Mchine Lerning 207 Compuer Science & Engineering Univeriy of Ionnin ML2 ( 4 )

5 A muli-diciplinry field Arificil Inelligence Pychology Auomic Conrol nd Operion Reerch Reinforcemen Lerning (RL) Neurocience Siic 5

6 Lerning (Pychology) Reinforcemen for rining niml Negive reinforcemen: Pin nd Hunger Poiive reinforcemen: Pleure nd food deciion environmen Opern condiioning (Ivn Pvlov 927) proce by which humn nd niml lern o behve in uch wy o obin rewrd nd void punihmen Compuionl neurocience Hebbin lerning (96): ynpic weigh beween neuron re reinforced by imulneouly civion.

7 Reinforcemen Lerning Lerning bou from nd while inercing wih n exernl environmen Lerning wh o do o o mximize numericl rewrd ignl Lerner i no old wh cion o ke bu mu dicover hem by rying hem ou nd eeing wh he rewrd i. A ochic opimizion over ime Imporn difference from clificion (no exmple of correc nwer mu ry hing)

8 Exmple of Reinforcemen Lerning pplicion How hould robo behve o o opimize i performnce (Roboic) How o uome he moion of drone? (Conrol Theory) How o mke good cheplying progrm (Arificil Inelligence)

ediion in progre (207) Univeriy of Alber hp://incompleeide.

9 Reinforcemen Lerning Richrd. S. Suon Profeor nd icore chir MIT Pre Deprmen of Compuing Science ediion nd ediion in progre (207) Univeriy of Alber hp://incompleeide.ne/uon/book/bookdrf206ep.pdf Cnd (Pychology & Compuer Science) ciion!!!!! (Google cholr ody ) ciion in ol Bihop book h ciion!!!!!

10 Reinforcemen Lerning ciion in ol ciion in ol ciion!!!! 205

11 Agen wih Inelligen Behvior Gme-plying: Sequence of move o win gme (e.g. Che Bckgmmon) Robo in mze: Sequence of cion o find gol (e.g. poiion or objec) Auonomou vehicle conrol Lerning o chooe cion o opimize fcory oupu (procedure) Recommendion Syem (li) Rouing problem: Medicl ril / Pcke / Ad plcemen. (mny more).

12 Inelligen Behvior Agen receive enory inpu nd ke cion in environmen Aume he gen receive rewrd (or penlie/loe) The gol i o mximize he rewrd i receive (or minimize he loe) Chooing cion h minimize loe i equivlen o behve opimlly

14 Deciion Theory Del wih he problem of mking opiml deciion Deciion or cion h minimize n expeced lo Aume k poible cion: k. Aume lo he world cn be in one of m differen e m. If we ke cion j nd we re in e i lo l ij i ppered Given ll oberved d D nd prior knowledge B our belief bou he e of he world re ummrized by: Opiml cion i h i expeced o minimize lo * rg min p D m j i l ij B p i D B

15 Deciion Theory (II) Byein equenil deciion heory (iic) Opiml conrol heory (engineering) Reinforcemen Lerning (compuer cience) Opiml cion i he one which i expeced o minimize lo * rg min m j i l ij p i D B Thi i how o mke ingle deciion. How do e mke equence of deciion in order o chieve long-erm gol? Aume we know he loe for ech cion-e pir we need model for how he oberved d D rele o he e

16 The Agen-Environmen Inerfce Agen nd environmen inerc dicree ime ep: 0 2 Agen oberve e ep : S produce cion ep : A( ) ge reuling rewrd: r nd reuling nex e :... r r r

17 Mrkov Deciion Procee - MDP Model of he gen-environmen yem covering he Mrkov propery. A MDP i uple {S A P r γ} S: finie e of e A: finie e of cion P: e rniion probbiliy funcion P ' r: rewrd funcion r E r γ: dicou fcor [0] P '

18 Se cpure whever informion i vilble o he gen ep bou i environmen. Thee re rucure buil up over ime from equence of enion memorie ec. Mrkovin propery: We could hrow wy he hiory once e i known Se () r r r r r r ' Pr ' Pr 0 0

19 Rewrd pecify wh he gen need o chieve no HOW o chieve i The gol of n MDP i no o mximize immedie rewrd bu o mximize long erm ccumuled rewrd. Averge fuure reurn: Rewrd (r) Thi ume h rewrd ino he fuure i vluble rewrd now (ll hve he me weigh) lim r k k Dicouned fuure reurn (γ: dicoun fcor) R k R r r 2 2 r 3 k 0 k r k

20 Policy (π) I i progrm deciion mechnim I conin mp from e (or iuion) o cion h could be ken A probbiliy diribuion π() P A condiionl probbiliy diribuion of cion over e If in e hen wih probbiliy defined by π ke he cion

21 Dynmic I pecifie how he e chnge given he cion of he gen. Model-bed: dynmic re known or re eimed. Model-free: we do no know he dynmic of he MDP. In prcice he dynmic re unknown nd o he e repreenion hould be uch h i eily predicble from neighboring e.

22 Se vlue funcion: Ued o deermine how good i i for he gen o be in given e Se-cion vlue funcion: how good i i o perform n cion from given e nd hen follow policy Thi re defined wih repec o pecific policy π() Vlue Funcion r E R E V k k k 0 r E R E Q k k k 0

23 Muliple ource of (probbiliic) unceriny: In e one i llowed o elec differen cion The yem my rniion o differen e from Reurn defined in erm of rewrd i rndom vrible where we eek o mximize in expecion r E R E V k k k 0 r E R E Q k k k 0

24 Bellmn Equion Richrd Bellmn 957 (600 pper 35 book 7 monogrph) A fundmenl propery of vlue funcion i h hey ify e of recurive coniency equion. Wrie he vlue of deciion problem in erm of he pyoff from ome iniil choice nd he vlue of he remining deciion problem brking he opimizion problem ino impler ubproblem

25 The vlue funcion i decompoed ino 2 pr Immedie rewrd r + Dicouned vlue of ucceor e γv( + ) Bellmn Expecion Equion (V) V r E R r E r r E R E V k k k 0 2

26 The e-cion vlue funcion cn be imilrly decompoed ino 2 pr Bellmn Expecion Equion (Q) Q r E r r E R E Q k k k 0 2

27 Looking inide he Bellmn Expecion P '

28 Looking inide he Bellmn Expecion P '

29 Relion beween e nd e-cion vlue funcion

30 Relion beween e nd e-cion vlue funcion P ' P '

31 Opiml Policie nd Vlue Opiml Policy: *: Poibly more hn one opiml policy. Alwy here i le one opiml policy Opiml e vlue funcion: V * V If policy π i uch h in ech e i elec n cion h mximize vlue hen π i n opiml policy V mxv mx r P V * ' * ' '

32 Opiml Policie nd Vlue Opiml e-cion vlue funcion Q Q mx * V r E Q * * ' ' ' ' ' * mx * Q P r Q

33 Relion beween Opiml Se & Se- Acion vlue funcion Opiml vlue funcion re recurively reled by he Bellmn opimliy equion: P '

34 From Opiml Vlue funcion o Opiml Policie An opiml policy cn be found from V*() nd he model dynmic by uing greedily he V*() V * mxv mx r P' V * ' ' An opiml policy cn be found by mximizing over he Q*() i.e. if rg mx Q * * A There i lwy deerminiic opiml policy for ny MDP. If we know Q*() we hve he opiml policy

35 Solving he MDP Given known model of he environmen n MDP Trniion dynmic Rewrd probbiliie We pply Dynmic Progrmming lgorihm for compuing opiml policie Thi i ccomplihed by obining opiml vlue funcion () rg mx Q () A()

36 Solving finie-e MDP Aume MDP wih finie e nd cion pce ( S < A <) Repeedly upde he eimed vlue funcion uing Bellmn equion. Algorihm : Vlue Ierion. For ech e iniilize V()=0 2. Repe unil convergence { } For every e upde V mx r P V ' ' Q( ) '

37 Algorihm : Vlue Ierion Two poible wy of performing he upde Synchronouly: fir compue he new vlue for V() for every e nd hen overwrie ll he old vlue wih he new vlue (Bellmn bckup operor). Aynchronouly: loop over he e (in ome order) nd upde he vlue one ime. A he end V will converge o V* nd he opiml policy i found by: A ' S ' r V * ' *( ) rg mx P '

49 Algorihm 2: Policy Ierion Combine policy evluion nd policy improvemen o obin equence of monooniclly improving policie nd vlue funcion Compue he vlue funcion for he curren policy nd hen upde he policy uing he curren vlue funcion. A he end V nd π will converge o opimum V*π*

50 Algorihm 2: Policy Ierion. Iniilize policy π rndomly 2. Policy Evluion for ech e unil convergence uing =π() 3. Policy Improvemen ' ' ' rg mx ) ( A V r P New ' ' ' mx ) ( A V r P New V ' ' ' rg mx A V P r

51 Algorihm 2: Policy Ierion (Q-verion). Iniilize policy π rndomly 2. Repe unil convergence () New Q (b) For ech e : Q r E Q A new rg mx rg mx ) ( ) ( ' ' ' ' Q P r Q

61 Vlue ierion v. Policy Ierion Sndrd lgorihm for olving MDP There i no ny greemen which lgorihm i beer Policy ierion i ofen very f for mll MDP nd converge wih very few ierion. For MDP wih lrge e pce vlue ierion my be preferred ince olving for V π explicily would involve lrge yem of liner equion h could be difficul.

62 Reinforcemen Lerning Mehod Mone Crlo mehod Temporl Difference (TD) mehod Vlue Funcion Approximion mehod

63 Mone Crlo Mehod Lern vlue funcion nd Dicover opiml policie. Do no ume knowledge of model (P R ) Mone Crlo mehod cn olve RL problem by verging mple reurn. Lern from experience: mple equence of e cion rewrd (r)

64 Wh doe he Dynmic Progrmming perform?

65 Mone Crlo upde rule: or where n() i number of fir vii o e nd R i he cul reurn following V R n V V V R V V 2 2 r r r R R E V

66 Mone Crlo Policy Evluion rg mx r V '

67 Mone Crlo Policy Evluion Q R Q Q T T T r r r 0 0 0

68 Temporl Difference (TD) Lerning TD generic upde rule: V V V r V V

69 Temporl Difference (TD) Lerning Mone-Crlo upde: V V R V TD(0) upde: Acul reurn from o end of epiode ΔV error V V r V V Eime of he reurn ccording o he curren policy

70 Lerning re V ( ) mx r V( ) V ( ')

71 Advnge of TD Lerning TD mehod do no require model of he environmen only experience TD mehod cn be fully incremenl You cn lern before knowing he finl oucome Le memory Le pek compuion You cn lern wihou he finl oucome From incomplee equence

72 Q-Lerning (Wkin Ph.D. Thei Cmbridge Univ. 989) Off-policy greedy mehod: evlue or improve one policy while cing uing noher Lern e-cion vlue funcion Q() Q( ) Q( ) r mx Q( ) Q( Q ( ) r mx Q( ) Q( ) )

73 Q-Lerning (Wkin Ph.D. Thei Cmbridge Univ. 989) Off-policy greedy mehod: evlue or improve one policy while cing uing noher Lern e-cion vlue funcion Q() Q ( ) Q( ) Q( ) r mx Q( ) Q( Q ( ) r mx Q( ) Q( ) )

75 Cn lwy chooe he cion wih highe Q-vlue The Q-funcion i iniilly unrelible Need o explore unil i i opiml A rde-off i needed beween explorion nd exploiion Exploiion To mke he be deciion given curren informion Explorion Explorion v. Exploiion To Gher more informion o mke beer deciion

76 ε-greedy: wih probbiliy ε chooe one cion rndom (uniformly) nd chooe he be cion wih probbiliy -ε (ε i grdully reduced) Probbiliic: Ue probbiliy Smoohly: Explorion Sregie A b b Q Q e e P A b T b Q T Q e e P

77 Exenion: SARSA () On-policy TD mehod: evlue or improve he curren policy ued for conrol SARSA ke explorion ino ccoun in upde Q( ) Q( ) r mx Q( ) Q( Q( ) Q( ) r Q( ) Q( ) ) Ue he cion cully choen in upde (e.g. e- greedy)

79 SARSA v. Q-lerning

80 Keep record of previouly viied e (cion) Eligibiliy rce: Lookhed wih le memory Upde muliple e once Se ge credi ccording o heir rce Exenion: Eligibiliy Trce e Q Q ) ( ) ( oherwie e e ) ( ) ( mx Q Q r

81 Acor Criic mehod

82 Funcion pproximion: llow complex environmen The Q-funcion ble could be oo big (or infiniely big!) Decribe e by feure vecor Then he vlue funcion cn be ny regreion model e.g. liner regreion model: Vlue Funcion Approximion n 2 w w w w V n n T 2 2 n 2 w w w w Q n n T 2 2 or

83 Vlue Funcion Approximion There re mny funcion pproximor e.g. Liner model Liner combinion of feure Neurl Nework Deciion ree Kernel mchine Guin Proce..

84 Aume liner model: Wih n bi funcion nd liner weigh w i TD im o chieve n pproximion of Q π meured by he men qured-error (MSE) Vlue funcion pproximion wih Temporl Difference Lerning n i i i T w w w Q N N w Q r w Q N Q w Q N w E 2 2 ) ( ) ( 2 ) ( ) ( 2

85 MSE: Ue ochic grdien decen for on-line lerning: TD Lerning wih funcion pproximion old T old T old old old old w r w w w w Q w Q r w Q w w w E w w ) ( ) ( ) ( N T T r N E w w w N w Q r w Q N E 2 ) ( ) ( 2 w

86 Ide: Conruc regreion problem: Aume N mple obined by curren policy Temporl error: Tol error: Le Squre Temporl Difference Lerning (LSTD) Brdke & Bro 996 Boyn 2002 N N N N N r r r D ) ( ) ( ) ( ) ( Q r Q Q Q e N T T N N r N e N D w E w u

87 Regreion problem: or equivlen: Le Squre Temporl Difference Lerning (LSTD) Brdke & Bro 996 Boyn ' 2 min min w u u R D E u N w min min N T T w N w r N D w E w u T N N T T N N T ' ' ' r N r R

88 Regreion problem: Obining he eimion (2 ge): () Find fir fixed poin pproximion of w (u*) 2 ' 2 min min w u u R D E w N w w u u w ' * 0 R D E T T N

89 Regreion problem: min w E ' u D min u R w 2 N Obining he eimion (2 ge): w 2 (2) Thi i pproximely equl o fixed poin of he rue equion: w T T R T T ' w w ' R u* or equivlen T w A b A ' nd b R T

90 Didvnge of LSTD Approprie number of bi funcion? LSTD require lrge number of mple in order o obin good eime of w. Soring nd invering mrix A nxn no feible if n i lrge. Exenion (Koler Ng 2009) w A b Temporl difference wih regulrized fixed poin min u E ' 2 u D min u R w u N u u u 2 u l l 2 regulrizion regulrizion w A I b

91 LSTDQ pplicion o conrol problem Lgoudki & Pr 2003 Off-line LSTD: w A b T T A ' nd b R On-line cheme: A T N ' i i i i i i i T Ierively vii e nd upde policy

92 LSTDQ pplicion o conrol problem Lgoudki & Pr 2003

93 Chllenge in Reinforcemen Lerning Feure/rewrd deign cn be very involved: Online lerning Coninuou feure Delyed rewrd Prmeer cn hve lrge effec on lerning peed Reliic environmen cn hve pril obervbiliy Non-ionry environmen (more reliic) Working wih coninuou e nd cion pce Muliple gen Reinforcemen Lerning

94 Applicion wih Reinforcemen Lerning Agen

95 (poiion velociy)

97 Suon & Bro 998

98 Lerning biliy policy of he Humnoid No Robo Se: 3 join Hip Pich Knee Pich Foo Pich 7 poible cion

100

101 Ph Plnning wih Reinforcemen Lerning for mrine plform

102

103 The ochic environmen

104

The rewrd funcion cn be eily defined Acion pce h mll ize (4 poible cion) Mny RL cheme hve been propoed up o now Rule bed

105 PcMn nd Reinforcemen Lerning (I) M. PcMn coniue chllenging domin for building nd eing RL gen: The environmen i difficuly o be prediced he gho behvior i ochic. The rewrd funcion cn be eily defined Acion pce h mll ize (4 poible cion) Mny RL cheme hve been propoed up o now Rule bed mehodology [I. Szi e.l. MLDG 08] Vlue funcion pproximion by uing Neurl Nework [S. Luc CIG 05 L.Bom e.l. ADPRL 3] Geneic progrmming [A. Alhejli e.l. CIG 0] Mone Crlo ree erch [S. Smohrki e. l. CIG K. Nguyen e.l. CIG 3 ] 0

Se pce repreenion (I) Se pce: 0 dimenionl feure vecor =( 2 3 4

The fir 4 feure ( 2 3 4 5 6 7 8 9 0 ) Reponible for he M.

106 Se pce repreenion (I) Se pce: 0 dimenionl feure vecor =( ). The fir 4 feure ( ) Reponible for he M.PcMn view Binry: repreen he exience () or no (0) of he wll in he M PcMn four wind direcion S=(00 ) norh we ouh e 0

Se pce repreenion (II) The fifh feure ( 2 3 4 5 6 7 8 9 0 ) Reponible for he direcion of he nere gen rge Tke four (4)

107 Se pce repreenion (II) The fifh feure ( ) Reponible for he direcion of he nere gen rge Tke four (4) vlue h correpond o four wind direcion The rge i eleced follow: IF gho i in dince le hn 8 ep nd i moving gin M. PcMn THEN he cloe fe exi i eleced. uer prmeer ELSE IF n edible (cred) gho exi wihin mximum dince of 5 ep THEN he gho direcion i eleced. ELSE he direcion o he nere do i eleced. S=(000 ) The rge in hi ce i edible gho ( 5 ep ) ) 5 0 (norh direcion) 0

Se pce repreenion (III) The nex 4 feure ( 2 3 4 5 6 7 8 9 0 ) Reponible for he exi of ny poible gho hre direcion Binry: give u he

108 Se pce repreenion (III) The nex 4 feure ( ) Reponible for he exi of ny poible gho hre direcion Binry: give u he informion bou he direcion h here i gho hre Thre mening : IF gho wih dince le hn 8 ep from he M. PcMn nd i moving gin i S=( ) norh we ouh e 0

Se pce repreenion (IV) The 0 h feure ( 2 3 4 5 6 7 8 9 0 ) Reponible for rpped iuion Binry: pecifie if he pcmn gen i rpped () or no (0) Trp

109 Se pce repreenion (IV) The 0 h feure ( ) Reponible for rpped iuion Binry: pecifie if he pcmn gen i rpped () or no (0) Trp mening : IF doe no exi ny poible ecpe direcion. S= ( ) Ce of rp becue gho hve urrounded he Pcmn nd here doen exi ny poible ecpe direcion 0

110 Experimenl Reul (I) All experimen were conduced by uing MASON imulor Took plce on convenionl PC We ued 3 mze of he originl M. PcMn gme (2) () Trining Mze Teing Mze Mze Mze 2 Mze 3 () (2) hp://c.gmu.edu/~eclb/projec/mon/ Inel Core 2 Qud (2.66 GHz) CPU wih 2 GB RAM

111 Experimenl Reul (II) Performnce Meric Averge percenge of uccefully level compleion Averge number of win Averge number of ep per epiode Averge core ined per epiode Rewrd funcion (R) Even ep Wll Gho Pill Loe Prmeer eing: Rewrd Decripion Μ Pcmn performed move in he empy pce Μ Pcmn hi he wll Μ Pcmn e cred gho Μ Pcmn e pill Μ Pcmn w een by non-cred gho Dicoun fcor (γ) equl o 0.99 Lerning re (η) equl o 0.0

112 Experimenl Reul (V) Siic (men vlue & d) of evluion meric fer running 00 epiode Mze Level compleion Win # Sep Score Mze (rining) 80% (±24) 40% (±53) (±977) Mze 2 (eing) 70% (±24) 33% 39.4 (±43) (±045) Mze 3 (eing) 80% (±20) 25% (±55) (±0) Siic by plying 50 gme Gme r wih 3 live nd dding life every 0000 poin Mze Averge Score Mx Score Mze (rining) Mze 2 (eing) Mze 3 (eing) Th i inereed o noe h he gen hd hown remrkble behvior biliy o boh unknown mze providing clerly ignificn generlizion biliie

113 Reinforcemen Lerning in Bord Gme Bckgmmon

114 Che Se: 34 feure

115 Deep Reinforcemen Lerning (Deep Mind Tech. Google - 205)

116

117

118

Chapter 2: Evaluative Feedback

Chapter 2: Evaluative Feedback Chper 2: Evluive Feedbck Evluing cions vs. insrucing by giving correc cions Pure evluive feedbck depends olly on he cion ken. Pure insrucive feedbck depends no ll on he cion ken. Supervised lerning is