Reinforcement learning

Size: px

Start display at page:

Download "Reinforcement learning"

Priscilla Crawford
6 years ago
Views:

1 Renforcement learnng Nathanel Daw Gatsby Computatonal Neuroscence Unt gatsby.ucl.ac.uk Mostly adapted from Andrew Moore s tutorals, copyrght 2002, 2004 by Andrew Moore. Hs orgnals, and many more tutorals are avalable at: The problem Decson-makng n a stuaton that may be:. Sequental (lke chess, a maze) 2. Stochastc (lke backgammon, the stock market)

2 The plan We dscuss:. How to evaluate long-term payoffs Markov systems 2. How to fnd optmal decsons Markov decson processes, dynamc programmng 3. How to learn these on the fly Renforcement learnng, sem-supervsed learnng 4. Extensons Dscounted rewards An assstant professor gets pad, say, 20K per year. How much, n total, wll the A.P. earn n ther lfe? = Infnty $ $ What s wrong wth ths argument?

3 Dscounted rewards A reward (payment) n the future s not worth qute as much as a reward now. Because of chance of oblteraton Because of nflaton Example: Beng promsed $0,000 next year s worth only 90% as much as recevng $0,000 rght now. Assumng payment n years n future s worth only (0.9) n of payment now, what s the AP s Future Dscounted Sum of Rewards? Dscount factors People n economcs and probablstc decsonmakng do ths all the tme. The Dscounted sum of future rewards usng dscount factor γ s (reward now) + γ (reward n tme step) + γ 2 (reward n 2 tme steps) + γ 3 (reward n 3 tme steps) + (nfnte sum) = τ E γ t r(st ) τ = t

4 Defne: 0.6 A. Assstant Prof 20 The academc lfe 0.2 B. Assoc. Prof S. 0.2 On the Street D. Dead 0 T. Tenured Prof 400 Assume Dscount Factor γ = 0.9 V A = Expected dscounted future rewards startng n state A V B = Expected dscounted future rewards startng n state B V T = T V S = S V D = D How do we compute V A, V B, V T, V S, V D? A Markov system wth rewards Has a set of states {s s 2 s N } Has a transton probablty matrx: T T 2 T N T= T 2 T j = Prob(next state s t+ = s j ths state s t = s ) : T N T NN Each state has a reward. {r r 2 r N } There s a dscount factor γ. 0 < γ < On Each Tme Step 0. Assume your state s s t. You get gven reward r(s t ) 2. You randomly move to another state P(s t+ = s j s t = s ) = T j 3. All future rewards are dscounted by γ

5 Solvng a Markov system Wrte V(s t ) = expected dscounted sum of future rewards startng n state s t = s V(s t ) = r(s t ) + E [ γ r(s t+ ) + γ 2 r(s t+ ) + ] = r(s t ) + γ (Expected future rewards startng from next state) = r(s t )+ γ (T V(s )+T 2 V(s 2 )+ T N V(s N )) Usng vector notaton wrte: V(s ) r T T 2 T N V(s 2 ) r 2 T 2. V= : R= : T= : V(s N ) r N T N T N2 T NN Queston: can you nvent a closed form expresson for V n terms of R, T, and γ? Solvng a Markov system wth matrx nverson Upsde: You get an exact answer Downsde:

6 Solvng a Markov system wth matrx nverson Upsde: You get an exact answer Downsde: If you have 00,000 states you re solvng a 00,000 by 00,000 system of equatons. Value teraton: another way to solve a Markov system Let s t = s. Defne: V (s ) = Expected dscounted sum of rewards over the next tme step. V 2 (s ) = Expected dscounted sum rewards durng next 2 steps V 3 (s ) = Expected dscounted sum rewards durng next 3 steps : V k (s ) = Expected dscounted sum rewards durng next k steps V (s ) = (what?) V 2 (s ) = (what?) V k+ (s ) = (what?)

7 Value teraton: another way to solve a Markov system Let s t = s. Defne: V (s ) = Expected dscounted sum of rewards over the next tme step. V 2 (s ) = Expected dscounted sum rewards durng next 2 steps V 3 (s ) = Expected dscounted sum rewards durng next 3 steps : V k (s ) = Expected dscounted sum rewards durng next k steps V (s ) = r(s t ) V 2 (s ) = r(s t ) + E[ γ r(s t+ ) ] V k+ (s ) = r(s t ) + E[ γ r(s t+ ) + + γ k r(s t+k ) ] = r( s ) + γ = r( s ) + γ N = Number of states N j= N j= T V j T V j ( s ) k j ( s ) j Let s do Value teraton SUN +4 WIND 0 HAIL.::.:.:: -8 γ = 0.5 k V k (SUN) V k (WIND) V k (HAIL)

8 Let s do Value teraton SUN +4 WIND 0 HAIL.::.:.:: -8 γ = 0.5 k V k (SUN) V k (WIND) V k (HAIL) Value teraton for solvng Markov systems Compute V (s ) for each j Compute V 2 (s ) for each j : Compute V k (s ) for each j As k V k (s ) V(s ). Why? When to stop? When Max V k+ (s ) V k (s ) < ξ (Ths s faster than smple matrx nverson f the transton matrx s sparse)

9 You run a startup company. In every state you must choose between Savng money or Advertsng. A Markov decson process Poor & Unknown +0 S S A Rch & Unknown +0 A S Poor & Famous +0 S A Rch & Famous +0 γ = 0.9 A Markov decson processes An MDP has A set of states {s s N } A set of actons {a a M } A set of rewards {r r N } (one for each state) A transton probablty functon T ( s = s s = s and a = a ) k j = Prob t + j t t On each step: 0. Call current state s. Receve reward r 2. Choose acton {a a M } 3. If you choose acton a k you ll move to state s j wth probablty 4. All future rewards are dscounted by γ k k T j

10 A polcy A polcy s a mappng from states to actons. Eg: Polcy Number : STATE ACTION PU S PF A RU S RF A S PU 0 S RU +0 PF 0 A RF +0 A Polcy Number 2: STATE ACTION PU A PF A RU A RF A PU 0 RU 0 A A PF 0 RF 0 A A Whch of the above two polces s best? A polcy Followng a polcy reduces an MDP to a Markov system. Wth what transton probabltes? How many possble polces n our example? (In general, we mght also consder stochastc polces) How do you compute the optmal polcy?

11 Interestng fact For every MDP there exsts a (at least one) determnstc optmal polcy. It s a polcy such that for every possble start state there s no better opton than to follow the polcy. (Not proved n ths lecture) More formally V π (s ): expected dscounted future reward followng polcy π from state s V (s ): expected dscounted future reward followng optmal polcy π from state s V (s ) V π (s ) for all states and polces

12 Computng the optmal polcy Idea One: Run through all possble polces. Select the best. What s the problem?? Optmal value functon S +0 /3 B A A A /3 /3 B S 2 +3 B S Queston What (by nspecton) s an optmal polcy for that MDP? (assume γ = 0.9) What s V*(s )? What s V*(s 2 )? What s V*(s 3 )?

13 Computng the Optmal Value Functon wth Value Iteraton Defne V k (s ) = Maxmum possble expected sum of dscounted rewards I can get f I start at state S and I lve for k tme steps. Note that V (s ) = r(s ) Let s compute V k (s ) for our example k V k (PU) V k (PF) V k (RU) V k (RF)

14 Let s compute V k (s ) for our example k V k (PU) V k (PF) V k (RU) V k (RF) Bellman Equaton N n+ k n V ( s ) max r ( s ) + γ T V ( s ) = k j= Value Iteraton for solvng MDPs Compute V (s ) for all Compute V 2 (s ) for all..untl converged converged when j max Also known as Dynamc Programmng J j n+ n ( S ) J ( S ) ξ Can also update values asynchronously (e n any order)

15 Fndng the optmal polcy Gven V*, t s easy to fnd π* How? Fndng the optmal polcy Gven V*, t s easy to fnd π* Compute V*(s ) for all usng value teraton Defne the best acton n state s as argmax k + k r γ Tj V j ( s ) j

16 Algorthm: Intalze π = Any randomly chosen polcy Alternate Polcy evaluaton: compute V πk (expected rewards usng polcy π k ) Polcy mprovement: π k+ (s ) = arg max a Polcy teraton a r( s ) + γ Tj V j ( s ) untl π k = π k+. You now have an optmal polcy. Ths wll converge n a fnte number of teratons (why?) πk j Another way to compute optmal polces (for all ; greedy polcy under V k ) Where we are Formalsms: Markov system and Markov decson process Markov system = MDP + polcy Value teraton for fndng expected (optonally optmal) future rewards Optmal values provde optmal polcy Polcy teraton for fndng optmal polces Next: onlne versons of these algorthms

17 Onlne renforcement learnng Imagne you are a robot n a world controlled by an MDP You are not gven the functons R, T, etc. Must learn values and polces from experence wth ndvdual states, rewards and actons. As before, let s start wth the Markov system (no acton) case Learnng Delayed Rewards s a? R=???? s d? R=??? s b? R=??? s d? R=??? s c? R=??? s f? R=??? All you can see s a seres of states and rewards: s a (r=0) s b (r=0) s c (r=4) s b (r=0) s d (r=0) s e (r=0) Task: Based on ths sequence, estmate V(s a ),V(s b ) V(s f )

18 Idea : Certanty-Equvalent Learnng Idea: Use your data to estmate the underlyng Markov system, then use the prevous offlne methods to solve t s a (r=0) s b (r=0) s c (r=4) s b (r=0) s d (r=0) s e (r=0) You draw n the Estmated Markov System: transtons + probs s a r est = 0 s b r est = 0 s c r est = 4 s d r est = 0 s e r est = 0 What are the estmated V values? C.E. for Markov systems Estmate T j, r by countng transtons, averagng rewards At each step, solve new estmated system wth, for nstance, value teraton (Why do we want new estmates at each step?) Slow, memory ntensve Varatons (e.g. prortzed sweepng) mnmze computaton by takng shortcuts on value teraton step. Can be data-neffcent.

19 Idea 2: Value samplng Idea: Sample long-term values drectly from observed sequence, wthout estmatng T or r s a (r=0) s b (r=0) s c (r=4) s b (r=0) s d (r=0) s e (r=0) At t= we were n state s a and eventually got a long term dscounted reward of 0+γ0+γ 2 4+γ 3 0+γ 4 0 = At t=2 n state s b ltdr = 2 At t=5 n state s d ltdr = 0 At t=3 n state s c ltdr = 4 At t=6 n state s e ltdr = 0 At t=4 n state s b ltdr = 0 State Observatons s a =V est (s a ) s b 2, 0 =V est (s b ) s c 4 4 =V est (s c ) s d 0 0 =V est (s d ) s e 0 0 =V est (s e ) Mean LTDR (Ths algorthm s also called Monte Carlo samplng or TD()) Assume γ= Idea 3: Temporal Dfference learnng (Sutton/Barto) Idea: A samplng verson of value teraton Only mantan a V est array, nothng else So you ve got V est (s ), V est (s 2 ), V est (s N ) and you observe s r s j what should you do? Can You Guess? A transton from that receves an mmedate reward of r and jumps to j

20 Value teraton update: TD learnng V k+ ( s ) = E r s r s j Use observed r as sample of ths k [ ( s ) + γv ( s )] V est (s ) (-α) V est (s ) + α (sampled future reward) = (-α) V est (s ) + α (r + γv est (s j )) (Ths s actually an algorthm called TD(0)) Use observed s j as sample of ths, and V est (s j ) to estmate ts value Learnng rule: ( bootstrappng ) nudge V est (s ) toward sampled value wth learnng rate α j TD convergence Dayan (992) showed that for a more general famly of TD rules, as the number of observatons goes to nfnty, then V est PROVIDED All states vsted ly often Decayng learnng rates: α = t= α t= t 2 t < ( s ) V( s ) Ths means k. T. k. T. T t= Ths means T t= α > k t 2 α < k t

21 Onlne polcy learnng The task: World: You are n state 34. Your mmedate reward s 3. You have 3 actons. Robot: I ll take acton 2. World: You are n state 77. Your mmedate reward s -7. You have 2 actons. Robot: I ll take acton. World: You re n state 34 (agan). Your mmedate reward s 3. You have 3 actons. The Credt Assgnment Problem I m n state 43, 39, 22, 2, 2, 3, 54, 26, reward = 0, = 0, = 0, = 0, = 0, = 0, = 0, = 00, acton = 2 = 4 = = = = 2 = 2 Yppee! I got to a state wth a bg reward! But whch of my actons along the way actually helped me get there?? Ths s the Credt Assgnment problem. The MDP machnery we have developed helps address ths problem.

22 Idea : Certanty-Equvalent Learnng Idea: Use your data to estmate the underlyng MDP, then use the prevous offlne methods to solve t Same as before, except now solve estmated MDP usng the MDP verson of value teraton: V est or polcy teraton est est,a est ( s ) max r + γ T V ( s ) = a j,j j The explore/explot problem We re n state s We can estmate r est, T est, V est So what acton should we choose? IDEA IDEA : 2 : Any problems wth these deas? Any other suggestons? Could we be optmal? est a = arg max r (s ) + γ T a j a = random est, a', j V est ( s ) j

23 Idea 2: Actor/Crtc (Sutton) Idea: An approxmate, samplng verson of polcy teraton, usng TD for polcy evaluaton and stochastc gradent ascent for polcy mprovement Assume stochastc polcy, parameterzed by w: Prob(choose acton a n state s ): π ( s, a) exp βw( s, a) Observe Update V est (s ) wth TD ( ) s r,a s j How do we mprove polcy w? What polcy does V est evaluate? Polcy mprovement: π k+ (s )= Observed: s r,a Actor/crtc s j πk arg max[ E[ r( s ) + γv ( s j ) ] sample Update rule: For all k, w(s,a k ) w(s,a k ) + ν (r + γv est (s j ) V est (s )) (δ a k,a π(s,a k )) a sample estmate learnng rate Kronecker δ: f a k =a, 0 f a k a Performs stochastc gradent ascent on V est Increase probablty of acton a f result better than expected.e. f r + γv est (s j ) > V est (s ), Decrease t otherwse

24 Actor/crtc Samplng verson of polcy teraton Dsadvantages: No convergence guarantees, somewhat flakey n practce Problems wth V est not trackng current polcy Advantages: May be better wth functon approxmaton (wll return to ths) Not obvous how to make a samplng verson of MDP value teraton for optmal values V*. (Why?) Defne Idea 3: Q-learnng (Watkns) Idea: TD wth a redefned value functon, whch can be learned ndependent of exploraton polcy Q*(s,a)= Expected sum of dscounted future rewards f I start n state s, f I then take acton a, and f I m subsequently optmal Questons: Defne Q*(s,a) n terms of V* Defne V*(s ) n terms of Q*

25 Q We mantan Q est nstead of V est values The TD update, on seeng s est Q-learnng Q verson of Bellman equaton: a ( s, a) = r + γ T maxq ( S a ) Q,j j, j a s j, s: est est ( s, a) α r γ maxq ( s, a' ) + ( α ) Q ( s, a) + a reward r acton a Ths s even cleverer than t looks: the Q est values are not based by any partcular exploraton polcy. Q-learnng s proved to converge. j Q-Learnng: Choosng Actons Same ssues as for CE choosng actons Optmal acton s: arg max Q * ( s, a ) a Don t always choose optmally accordng to Q est Don t always be random (otherwse t wll take a long tme to reach somewhere exctng) Boltzmann exploraton, as n actor/crtc: Prob(choose acton a) exp βq est ( ( s, a) )

26 Where we are Formalsm Offlne algorthms Onlne algorthms for value estmaton CE (model based), samplng, TD (model free) Onlne algorthms for polcy learnng CE, actor/crtc, Q-learnng If we had tme Avod lookup tables for value functon Use functon approxmaton/regresson Convergence guarantees out the wndow May favor polcy gradent methods Actor/crtc s one example, but polcy gradents can also be estmated drectly,.e. wthout usng values Optmal exploraton/explotaton tradeoffs Gttns ndces E 3 (Kearns)

27 If we had tme TD(λ) TD generalzed wth multstep backups Monte Carlo trajectory samplng s a specal case. Applcatons Backgammon Elevator schedulng Neuroscentfc modelng If we had tme Partally Observable MDPs RL when state s not observable (lke a hdden Markov model) Extremely ntractable, some cases proved uncomputable Approxmate offlne value teraton methods No polcy teraton, mnmal onlne methods

28 Revew paper: For more Kaelblng et al., Renforcement learnng: A Survey, Journal of AI Research, 996 Two excellent books: Sutton & Barto, Renforcement Learnng Informal and readable Tstskls & Bertsekas, Neuro-Dynamc Programmng Formal and less readable, full of delghtful proofs

Hidden Markov Models

Hidden Markov Models Note to other teachers and users of these sldes. Andrew would be delghted f you found ths source materal useful n gvng your own lectures. Feel free to use these sldes verbatm, or to modfy them to ft your