Chapter 2: Evaluative Feedback

Chper 2: Evluive Feedbck Evluing cions vs. insrucing by giving correc cions Pure evluive feedbck depends olly on he cion ken. Pure insrucive feedbck depends no ll on he cion ken. Supervised lerning is insrucive; opimizion is evluive Associive vs. Nonssociive: Associive: inpus mpped o oupus; lern he bes oupu for ech inpu Nonssociive: lern (find) one bes oupu n-rmed bndi ( les how we re i) is: Nonssociive Evluive feedbck R. S. Suon nd A. G. Bro: Reinforcemen Lerning: An Inroducion 1

The n-armed Bndi Problem Choose repeedly from one of n cions; ech choice is clled ply Afer ech ply, you ge rewrd, where E r = Q * ( ) These re unknown cion vlues Disribuion of depends only on Objecive is o mximize he rewrd in he long erm, e.g., over 1000 plys To solve he n-rmed bndi problem, you mus explore vriey of cions nd he exploi he bes of hem r r R. S. Suon nd A. G. Bro: Reinforcemen Lerning: An Inroducion 2

The Explorion/Exploiion Dilemm Suppose you form esimes Q ( ) * Q ( ) cion vlue esimes The greedy cion is You cn exploi ll he ime; you cn explore ll he ime You cn never sop exploring; bu you should lwys reduce exploring * = rg mxq ( ) * * = exploiion explorion R. S. Suon nd A. G. Bro: Reinforcemen Lerning: An Inroducion 3

Acion-Vlue Mehods Mehods h dp cion-vlue esimes nd nohing else, e.g.: suppose by he -h ply, cion hd been chosen k imes, producing rewrds r r r 1, 2, K, k, hen Q ( ) = r + r + r 1 2 L k k smple verge k lim * Q ( ) = Q ( ) R. S. Suon nd A. G. Bro: Reinforcemen Lerning: An Inroducion 4

ε-greedy Acion Selecion Greedy cion selecion: * = = rg mxq ( ) ε-greedy: = { * wih probbiliy 1 ε rndom cion wih probbiliy ε... he simples wy o ry o blnce explorion nd exploiion R. S. Suon nd A. G. Bro: Reinforcemen Lerning: An Inroducion 5

10-Armed Tesbed n = 10 possible cions Ech ech Q * ( ) r 1000 plys is chosen rndomly from norml disribuion: is lso norml: * η( Q ( ), 1) repe he whole hing 2000 imes nd verge he resuls η( 0, 1) R. S. Suon nd A. G. Bro: Reinforcemen Lerning: An Inroducion 6

ε-greedy Mehods on he 10-Armed Tesbed Averge rewrd 1.5 1 0.5 = 0.1 = 0.01 0 0 250 500 750 1000 Plys 100% 80% = 0.1 % Opiml cion 60% 40% 20% = 0 (greedy) = 0.01 0% 0 250 500 750 1000 Plys R. S. Suon nd A. G. Bro: Reinforcemen Lerning: An Inroducion 7

Sofmx Acion Selecion Sofmx cion selecion mehods grde cion probs. by esimed vlues. The mos common sofmx uses Gibbs, or Bolzmnn, disribuion: Choose cion on ply wih probbiliy e Q n b= 1 ( ) τ where τ is he compuionl emperure e Q ( b) τ, R. S. Suon nd A. G. Bro: Reinforcemen Lerning: An Inroducion 8

Binry Bndi Tsks Suppose you hve jus wo cions: nd jus wo rewrds: r = success or r = = 1 or = 2 filure Then you migh infer rge or desired cion: d = { he oher cion if success if filure nd hen lwys ply he cion h ws mos ofen he rge Cll his he supervised lgorihm I works fine on deerminisic sks R. S. Suon nd A. G. Bro: Reinforcemen Lerning: An Inroducion 9

Coningency Spce The spce of ll possible binry bndi sks: 1 EASY PROBLEMS B DIFFICULT PROBLEMS Success probbiliy for cion 2 0.5 DIFFICULT PROBLEMS EASY PROBLEMS 0 A 0 1 0.5 Success probbiliy for cion 1 R. S. Suon nd A. G. Bro: Reinforcemen Lerning: An Inroducion 10

Liner Lerning Auom Le ( ) = Pr = be he only dped prmeer π { } L (Liner, rewrd - incion) R I On success : π + 1( ) = π ( ) + α( 1 π ( )) 0 < α < 1 (he oher cion probs. re djused o sill sum o 1) On filure : no chnge L (Liner, rewrd - penly) R-P On success : π ( ) = π ( ) + α( 1 π ( )) 0 < α < 1 + 1 (he oher cion probs. re djused o sill sum o 1) On filure : π ( ) = π ( ) + α( 0 π ( )) 0 < α < 1 + 1 For wo cions, sochsic, incremenl version of he supervised lgorihm R. S. Suon nd A. G. Bro: Reinforcemen Lerning: An Inroducion 11

Performnce on Binry Bndi Tsks A nd B 100% 90% BANDIT A L R-I cion vlues % Opiml cion 80% 70% 60% supervised 50% L R-P 0 100 200 300 400 500 Plys 100% 90% BANDIT B cion vlues % Opiml cion 80% 70% L R-I L R-P 60% supervised 50% 0 100 200 300 400 500 Plys R. S. Suon nd A. G. Bro: Reinforcemen Lerning: An Inroducion 12

Incremenl Implemenion Recll he smple verge esimion mehod: The verge of he firs k rewrds is (dropping he dependence on ): Q k = r1 + r2 + Lr k k Cn we do his incremenlly (wihou soring ll he rewrds)? We could keep running sum nd coun, or, equivlenly: 1 Q + 1 = Q + r + 1 Q k + 1 [ ] k k k k This is common form for upde rules: NewEsime = OldEsime + SepSize[Trge OldEsime] R. S. Suon nd A. G. Bro: Reinforcemen Lerning: An Inroducion 13

Trcking Nonsionry Problem Choosing Q k o be smple verge is pproprie in sionry problem, i.e., when none of he Q * ( ) chnge over ime, Bu no in nonsionry problem. Beer in he nonsionry cse is: [ ] Q = Q + α r Q k + 1 k k + 1 k for consn α, 0 < α 1 k = ( 1 α) Q + α( 1 α) 0 k i= 1 k i exponenil, recency-weighed verge r i R. S. Suon nd A. G. Bro: Reinforcemen Lerning: An Inroducion 14

Opimisic Iniil Vlues All mehods so fr depend on Q ( 0 ), i.e., hey re bised. Suppose insed we iniilize he cion vlues opimisiclly, i.e., on he 10-rmed esbed, use Q0 ( ) = 5 for ll 100% 80% opimisic, greedy Q 0 = 5, = 0 % Opiml cion 60% 40% relisic, ε-greedy Q 0 = 0, = 0.1 20% 0% 0 200 400 600 800 1000 Plys R. S. Suon nd A. G. Bro: Reinforcemen Lerning: An Inroducion 15

Reinforcemen Comprison Compre rewrds o reference rewrd, verge of observed rewrds, e.g., n Srenghen or weken he cion ken depending on Le p( ) denoe he preference for cion Preferences deermine cion probbiliies, e.g., by Gibbs disribuion: p ( ) e π ( ) = Pr{ = } = n p ( b) e Then: b= 1 [ ] = + [ ] p ( ) = p ( ) + r r nd r r α r r + 1 + 1 r r r R. S. Suon nd A. G. Bro: Reinforcemen Lerning: An Inroducion 16

Performnce of Reinforcemen Comprison Mehod 100% 80% reinforcemen comprison % Opiml cion 60% 40% 20% -greedy = 0.1, α = 1/k -greedy = 0.1, α = 0.1 0% 0 200 400 600 800 1000 Plys R. S. Suon nd A. G. Bro: Reinforcemen Lerning: An Inroducion 17

Pursui Mehods Minin boh cion-vlue esimes nd cion preferences Alwys pursue he greedy cion, i.e., mke he greedy cion more likely o be seleced Afer he -h ply, upde he cion vlues o ge * The new greedy cion is = rg mxq ( ) + 1 + 1 Q +1 Then: [ ] π ( * ) = π ( * ) + β 1 π ( * ) + 1 + 1 + 1 + 1 nd he probs. of he oher cions decremened o minin he sum of 1 R. S. Suon nd A. G. Bro: Reinforcemen Lerning: An Inroducion 18

Performnce of Pursui Mehod % Opiml cion 100% 80% 60% 40% 20% pursui reinforcemen comprison -greedy = 0.1, α = 1/k 0% 0 200 400 600 800 1000 Plys R. S. Suon nd A. G. Bro: Reinforcemen Lerning: An Inroducion 19

Associive Serch Imgine swiching bndis ech ply Bndi 3 cions R. S. Suon nd A. G. Bro: Reinforcemen Lerning: An Inroducion 20

Conclusions These re ll very simple mehods bu hey re compliced enough we will build on hem Ides for improvemens: esiming uncerinies... inervl esimion pproximing Byes opiml soluions Giens indices The full RL problem offers some ides for soluion... R. S. Suon nd A. G. Bro: Reinforcemen Lerning: An Inroducion 21