Chapter 2: Evaluative Feedback

Size: px

Start display at page:

Download "Chapter 2: Evaluative Feedback"

Emerald Fletcher
5 years ago
Views:

1 Chper 2: Evluive Feedbck Evluing cions vs. insrucing by giving correc cions Pure evluive feedbck depends olly on he cion ken. Pure insrucive feedbck depends no ll on he cion ken. Supervised lerning is insrucive; opimizion is evluive Associive vs. Nonssociive: Associive: inpus mpped o oupus; lern he bes oupu for ech inpu Nonssociive: lern (find) one bes oupu n-rmed bndi ( les how we re i) is: Nonssociive Evluive feedbck R. S. Suon nd A. G. Bro: Reinforcemen Lerning: An Inroducion 1

2 The n-armed Bndi Problem Choose repeedly from one of n cions; ech choice is clled ply Afer ech ply, you ge rewrd, where E r = Q * ( ) These re unknown cion vlues Disribuion of depends only on Objecive is o mximize he rewrd in he long erm, e.g., over 1000 plys To solve he n-rmed bndi problem, you mus explore vriey of cions nd he exploi he bes of hem r r R. S. Suon nd A. G. Bro: Reinforcemen Lerning: An Inroducion 2

3 The Explorion/Exploiion Dilemm Suppose you form esimes Q ( ) * Q ( ) cion vlue esimes The greedy cion is You cn exploi ll he ime; you cn explore ll he ime You cn never sop exploring; bu you should lwys reduce exploring * = rg mxq ( ) * * = exploiion explorion R. S. Suon nd A. G. Bro: Reinforcemen Lerning: An Inroducion 3

4 Acion-Vlue Mehods Mehods h dp cion-vlue esimes nd nohing else, e.g.: suppose by he -h ply, cion hd been chosen k imes, producing rewrds r r r 1, 2, K, k, hen Q ( ) = r + r + r 1 2 L k k smple verge k lim * Q ( ) = Q ( ) R. S. Suon nd A. G. Bro: Reinforcemen Lerning: An Inroducion 4

5 ε-greedy Acion Selecion Greedy cion selecion: * = = rg mxq ( ) ε-greedy: = { * wih probbiliy 1 ε rndom cion wih probbiliy ε... he simples wy o ry o blnce explorion nd exploiion R. S. Suon nd A. G. Bro: Reinforcemen Lerning: An Inroducion 5

6 10-Armed Tesbed n = 10 possible cions Ech ech Q * ( ) r 1000 plys is chosen rndomly from norml disribuion: is lso norml: * η( Q ( ), 1) repe he whole hing 2000 imes nd verge he resuls η( 0, 1) R. S. Suon nd A. G. Bro: Reinforcemen Lerning: An Inroducion 6

7 ε-greedy Mehods on he 10-Armed Tesbed Averge rewrd = 0.1 = Plys 100% 80% = 0.1 % Opiml cion 60% 40% 20% = 0 (greedy) = % Plys R. S. Suon nd A. G. Bro: Reinforcemen Lerning: An Inroducion 7

8 Sofmx Acion Selecion Sofmx cion selecion mehods grde cion probs. by esimed vlues. The mos common sofmx uses Gibbs, or Bolzmnn, disribuion: Choose cion on ply wih probbiliy e Q n b= 1 ( ) τ where τ is he compuionl emperure e Q ( b) τ, R. S. Suon nd A. G. Bro: Reinforcemen Lerning: An Inroducion 8

9 Binry Bndi Tsks Suppose you hve jus wo cions: nd jus wo rewrds: r = success or r = = 1 or = 2 filure Then you migh infer rge or desired cion: d = { he oher cion if success if filure nd hen lwys ply he cion h ws mos ofen he rge Cll his he supervised lgorihm I works fine on deerminisic sks R. S. Suon nd A. G. Bro: Reinforcemen Lerning: An Inroducion 9

10 Coningency Spce The spce of ll possible binry bndi sks: 1 EASY PROBLEMS B DIFFICULT PROBLEMS Success probbiliy for cion DIFFICULT PROBLEMS EASY PROBLEMS 0 A Success probbiliy for cion 1 R. S. Suon nd A. G. Bro: Reinforcemen Lerning: An Inroducion 10

11 Liner Lerning Auom Le ( ) = Pr = be he only dped prmeer π { } L (Liner, rewrd - incion) R I On success : π + 1( ) = π ( ) + α( 1 π ( )) 0 < α < 1 (he oher cion probs. re djused o sill sum o 1) On filure : no chnge L (Liner, rewrd - penly) R-P On success : π ( ) = π ( ) + α( 1 π ( )) 0 < α < (he oher cion probs. re djused o sill sum o 1) On filure : π ( ) = π ( ) + α( 0 π ( )) 0 < α < For wo cions, sochsic, incremenl version of he supervised lgorihm R. S. Suon nd A. G. Bro: Reinforcemen Lerning: An Inroducion 11

12 Performnce on Binry Bndi Tsks A nd B 100% 90% BANDIT A L R-I cion vlues % Opiml cion 80% 70% 60% supervised 50% L R-P Plys 100% 90% BANDIT B cion vlues % Opiml cion 80% 70% L R-I L R-P 60% supervised 50% Plys R. S. Suon nd A. G. Bro: Reinforcemen Lerning: An Inroducion 12

13 Incremenl Implemenion Recll he smple verge esimion mehod: The verge of he firs k rewrds is (dropping he dependence on ): Q k = r1 + r2 + Lr k k Cn we do his incremenlly (wihou soring ll he rewrds)? We could keep running sum nd coun, or, equivlenly: 1 Q + 1 = Q + r + 1 Q k + 1 [ ] k k k k This is common form for upde rules: NewEsime = OldEsime + SepSize[Trge OldEsime] R. S. Suon nd A. G. Bro: Reinforcemen Lerning: An Inroducion 13

14 Trcking Nonsionry Problem Choosing Q k o be smple verge is pproprie in sionry problem, i.e., when none of he Q * ( ) chnge over ime, Bu no in nonsionry problem. Beer in he nonsionry cse is: [ ] Q = Q + α r Q k + 1 k k + 1 k for consn α, 0 < α 1 k = ( 1 α) Q + α( 1 α) 0 k i= 1 k i exponenil, recency-weighed verge r i R. S. Suon nd A. G. Bro: Reinforcemen Lerning: An Inroducion 14

15 Opimisic Iniil Vlues All mehods so fr depend on Q ( 0 ), i.e., hey re bised. Suppose insed we iniilize he cion vlues opimisiclly, i.e., on he 10-rmed esbed, use Q0 ( ) = 5 for ll 100% 80% opimisic, greedy Q 0 = 5, = 0 % Opiml cion 60% 40% relisic, ε-greedy Q 0 = 0, = % 0% Plys R. S. Suon nd A. G. Bro: Reinforcemen Lerning: An Inroducion 15

16 Reinforcemen Comprison Compre rewrds o reference rewrd, verge of observed rewrds, e.g., n Srenghen or weken he cion ken depending on Le p( ) denoe he preference for cion Preferences deermine cion probbiliies, e.g., by Gibbs disribuion: p ( ) e π ( ) = Pr{ = } = n p ( b) e Then: b= 1 [ ] = + [ ] p ( ) = p ( ) + r r nd r r α r r r r r R. S. Suon nd A. G. Bro: Reinforcemen Lerning: An Inroducion 16

17 Performnce of Reinforcemen Comprison Mehod 100% 80% reinforcemen comprison % Opiml cion 60% 40% 20% -greedy = 0.1, α = 1/k -greedy = 0.1, α = 0.1 0% Plys R. S. Suon nd A. G. Bro: Reinforcemen Lerning: An Inroducion 17

18 Pursui Mehods Minin boh cion-vlue esimes nd cion preferences Alwys pursue he greedy cion, i.e., mke he greedy cion more likely o be seleced Afer he -h ply, upde he cion vlues o ge * The new greedy cion is = rg mxq ( ) Q +1 Then: [ ] π ( * ) = π ( * ) + β 1 π ( * ) nd he probs. of he oher cions decremened o minin he sum of 1 R. S. Suon nd A. G. Bro: Reinforcemen Lerning: An Inroducion 18

19 Performnce of Pursui Mehod % Opiml cion 100% 80% 60% 40% 20% pursui reinforcemen comprison -greedy = 0.1, α = 1/k 0% Plys R. S. Suon nd A. G. Bro: Reinforcemen Lerning: An Inroducion 19

20 Associive Serch Imgine swiching bndis ech ply Bndi 3 cions R. S. Suon nd A. G. Bro: Reinforcemen Lerning: An Inroducion 20

21 Conclusions These re ll very simple mehods bu hey re compliced enough we will build on hem Ides for improvemens: esiming uncerinies... inervl esimion pproximing Byes opiml soluions Giens indices The full RL problem offers some ides for soluion... R. S. Suon nd A. G. Bro: Reinforcemen Lerning: An Inroducion 21

Making Complex Decisions Markov Decision Processes. Making Complex Decisions: Markov Decision Problem

Making Complex Decisions Markov Decision Processes. Making Complex Decisions: Markov Decision Problem Mking Comple Decisions Mrkov Decision Processes Vsn Honvr Bioinformics nd Compuionl Biology Progrm Cener for Compuionl Inelligence, Lerning, & Discovery honvr@cs.ise.edu www.cs.ise.edu/~honvr/ www.cild.ise.edu/