Reinforcement Lerning Tom Mitchell, Mchine Lerning, chpter 13 Outline Introduction Comprison with inductive lerning Mrkov Decision Processes: the model Optiml policy: The tsk Q Lerning: Q function Algorithm Convergence proof Explortion vs exploittion Non-deterministic rewrds nd ctions Copyright Fcundo Bromberg 2 Introduction How n utonomous gent tht senses nd cts in its environment cn lern to choose optiml ctions to chieve its gol. Rewrd Critic Agent Action Stte Environment s 0 s 1 s 2 0 1 2 r 0 r 1 r 2 Gol: Lern to choose ctions tht mximize r 0 +r 1 + 2 r 2 +..., where 0<1 Copyright Fcundo Bromberg 3 Copyright Fcundo Bromberg 1
Introduction This generic problem is one of lerning to control sequentil processes such s: Lerning to control mobile robot, Lerning to optimize opertions in fctorie nd Lerning to ply bord gmes (world-clss bckgmmon plyer, Tesuro 1995). Suitble for scenrios with unpredictble (e.g. highly dynmic) or complex environments. Lck of domin theory Impossible to build-in optiml behvior (like in plnning, optimiztion lg, etc). Copyright Fcundo Bromberg 4 Introduction Assumption: gols cn be defined by rewrd function tht ssigns numericl vlues to ctionstte pirs. This rewrd function is known by the critic who could be externl or built-in. The tsk of the gent is to perform sequences of ction observe their consequence nd lern control policy : S A tht chooses ctions tht mximize the ccumulted rewrd. Copyright Fcundo Bromberg 5 Differences with inductive lerning Differences with function pproximtion (inductive lerning) of : S A: Delyed rewrd. Insted of pirs <(s)>, gent Optiml ction for current stte s. receives sequence of rewrds nd fces the problem of temporl credit ssignment. Explortion. The gent hs influence on the distribution of trining exmples by the ction sequence it chooses. Rises problem of explortion vs exploittion. Copyright Fcundo Bromberg 6 Copyright Fcundo Bromberg 2
Differences with inductive lerning Prtilly observble sttes. In mny prcticl situtions sensors provide only prtil informtion of the environment s stte. Optiml policy my therefore include specificlly ctions tht improve observbility of environment. Life-long lerning. Unlike isolted inductive lerning tsk gent lerning often requires lerning of severl relted tsks within the sme environment. Prior knowledge or experience become relevnt. Copyright Fcundo Bromberg 7 The model (or the tsk of mking compromises) Deterministic or non-deterministic ctions? Prior knowledge or not bout effects of its ctions on the environment (domin theory)? Triner gives exmples of optiml ction sequences (inductive lerning), or it must trin itself? The choice: Mrkov Decision Process (MDP). Copyright Fcundo Bromberg 8 Mrkov Decision Processes An MDP is tuple (S, A, s 0,, r), S is the set of stte A is the set of ctions vilble to the gent, s 0 is the initil stte, : S x A S, is the trnsition function, nd r: S x A R + is the rewrd function. r nd depend only on current stte nd ction (Mrkov property), nd they re deterministic. Copyright Fcundo Bromberg 9 Copyright Fcundo Bromberg 3
2 The tsk The tsk of the gent is to lern policy : S A tht selects next ction t bsed on current observed stte s t, i.e. (s t )= t. How? Policy tht mximizes cumultive rewrd over time. Tht i policy tht mximizes: V (s t ) =r t + r t+1 + 2 r t+1 + = i r, 0 1 = + i 0 t < i Where the sequence of rewrds ws generted by 0 s0 r s1 1 ) r s2 ) r = 0 1 ( s = ( s = ( s 2 0 1 2 Copyright Fcundo Bromberg 10 The tsk (2) Alterntive definitions of totl rewrd: Finite horizon rewrd: Averge rewrd: lim h = r i 0 t + i 1 h h h i = r 0 t + i Copyright Fcundo Bromberg 11 The optiml policy The tsk of the gent is thus to lern the optiml policy given by: rg mxv, And we denote by V (s)=v (s) the mximum rewrd the gent cn obtin strting t s. Copyright Fcundo Bromberg 12 Copyright Fcundo Bromberg 4
Exmple S: cells A: rrows. r: numbers by rrows. V (s bottom-right )=100 V (s bottom-center )=0+0.9 100= 90 V (s bottom-left )=0+0+0.9 2 100= 81 Since G is bsorbing, infinite sum becomes finite. Copyright Fcundo Bromberg 13 Q Lerning Agent wnts to mximize cumultive rewrd, thu it should prefer stte s 1 over s 2 whenever V (s 1 )>V (s 2 ). However, gent s policy must choose mong ction not sttes. No problem: The optiml ction in stte s is the ction tht mximizes the sum of the immedite rewrd r( plus the vlue V of the immedite successor stte s, discounted by. s, successor = rg mx [ r( + V ( ( )] immedite rewrd vlue of successor Copyright Fcundo Bromberg 14 Q Lerning rδ = rg mx[ ( + V ( ( )] Thu if gent knows functions r nd, it cn lern optiml policy by lerning vlue V offline vlue itertion lgorithm (skipped) For ll but few cse is unknown. Requires precise knowledge of the domin. Sometimes the domin is even non-deterministic!. Copyright Fcundo Bromberg 15 Copyright Fcundo Bromberg 5
The Q function If r or re unknown, wht evlution function should the gent use? The evlution function Q. rδ = rg mx[ ( + V ( ( )] Q( s, ) = rg mx Q( Thu if gent is cpble of lerning Q, it will be ble to select optiml ctions even when it hs no knowledge of r nd. Copyright Fcundo Bromberg 16 The Q function rδ = rg mx Q(, Q( = ( + V ( ( ) Surprisingly, gent cn choose optiml ction without ever conducting lookhed serch to explicitly consider wht sttes result from the ction. Ye Q function hs exctly tht property. Q( summrizes in single vlue ll the informtion needed to determine discounted cumultive rewrd tht will be gined in the future if ction is chosen in stte s. Copyright Fcundo Bromberg 17 Exmple rq( = δ ( + V ( ( ) S: cells A: rrows. r: numbers by rrows. Q(s G, )= 0 + 0.9 0 + Q(s bootom-right, )= 100 + 0= 100 Q(s bootom-center, )= 0 + 0.9 100= 90 Q(s bootom-left, )= 0 + 0 + (0.9) 2 100= 81 Since G is bsorbing, infinite sum becomes finite. Copyright Fcundo Bromberg 18 Copyright Fcundo Bromberg 6
Algorithm for lerning Q (Wtkins 1989) rδ = rg mx Q(, Q( = ( + V ( ( ) Lerning Q is equivlent to lern optiml policy. Note tht: V = mx [ r( ) + V ( ( ))] = mx Q( ) So we obtin recursive definition of Q, rδq( = ( + mx Q( ( ), ') The lgorithm lerns n pproximtion Qˆ of Q represented s tble with seprte entries for ech stte-ction pir. Copyright Fcundo Bromberg 19 Algorithm for lerning Q (cond.) For ech initilize tble entry Qˆ ( to zero. Observe current stte s Do forever: Agent observes its current stte Chooses some ction nd executes it Observes resulting rewrd r=r( nd new stte s =( Updtes the tble entry for Qˆ (, ccording to the rule: Qˆ( r + mxqˆ( s', ) s s Note tht Q-lerning propgtes Qˆ estimtes bckwrds from the new stte s to the old stte s. Copyright Fcundo Bromberg 20 Exmple 2. Q lerning S S 72 100 90 1 2 S 63 1 63 81 right S 100 2 81 Qˆ( s1, ) r + mx Qˆ( s2, ) right 0 + 0.9mx{63, 81,100} 90 Copyright Fcundo Bromberg 21 Copyright Fcundo Bromberg 7
Exmple 3. Q lerning Proceeding in episodes from s 0 to G, lwys through s 1, s 2. s 1 0 s 2 0 G s 0 0 Copyright Fcundo Bromberg 22 Exmple 3. Q lerning episode 1 s 1 0 s 2 100 G s 0 0 Copyright Fcundo Bromberg 23 Exmple 3. Q lerning episode 2 s 1 90 s 2 100 G s 0 0 Copyright Fcundo Bromberg 24 Copyright Fcundo Bromberg 8
Exmple 3. Q lerning episode 3 s 1 90 s 2 100 G s 0 81 Copyright Fcundo Bromberg 25 Convergence of Q lerning Will the lgorithm converge towrd Qˆ equl to the true Q function? Copyright Fcundo Bromberg 26 Convergence of Q lerning Qˆ Copyright Fcundo Bromberg 27 Copyright Fcundo Bromberg 9
Experimenttion strtegies in Q lerning Algorithm does not specify how ctions re chosen!. Exploittion: One possibility is for the gent t stte s to choose ction tht mximizes Qˆ (, thereby exploiting current pproximtion. With this strtegy, gent risks filing to explore other ction in other stte tht hve even higher vlues but hven t been visited yet. Moreover, theorem requires ction-stte pirs visited infinitely often. Copyright Fcundo Bromberg 28 Experimenttion strtegies in Q lerning (2) Explortion: Probbilistic pproch tht gives higher probbilities to higher Qˆ vlues. P( s) = i k Qˆ ( i ) where k > 0 determines how strongly the selection fvors ctions with high Qˆ vlues. j k Qˆ ( j ) High k exploit Low k explore Copyright Fcundo Bromberg 29 Nondeterministic rewrds nd ctions Noisy effector gmes with dice, etc. ( first produce distribution P : S A S then drws n outcome t rndom from P., nd Similrly, for r. We ssume these probbilities follows mrkov property. We retrce line of rgument tht led to the deterministic lgorithm, revising it where needed. Copyright Fcundo Bromberg 30 Copyright Fcundo Bromberg 10
Nondeterministic vlue function We define the nondeterministic vlue function V (s t ) for policy s the expected vlue of the discounted cumultive rewrd: V ( s ) E t i [ rt i ] i= 0 + As before, we define the optiml policy to be the policy tht mximizes V (s) for ll sttes s. rgmx V, Copyright Fcundo Bromberg 31 Nondeterministic vlue function And we generlize the erlier definition of Q by tking its expected vlue Q( E r [ ( + V ( ( ) ] = E[ r( ] + E[ V ( ( )] = E[ r( ] + s' P( s' V ( s' ) As before, we cn express Q recursively Q( = E[ r( ] + s' P( s' mxq( s', ) Copyright Fcundo Bromberg 32 Convergence nd trining rule The convergence proof holds for the deterministic cse, but previous lerning rule do not converge in the nondeterministic cse. The following trining rule is sufficient to ssure convergence of Qˆ to Q Qˆ ( (1 ) Qˆ n n ˆ 1( + n[ r mx Qn 1( s', )] n + where 1 n = 1+ visits ( n deterministic trining rule Copyright Fcundo Bromberg 33 Copyright Fcundo Bromberg 11
Convergence nd trining rule Key ide: revisions to Qˆ re mde more grdully thn in the deterministic cse. n =1 we recover the deterministic lerning rule. Choice of n given bove is one of mny to stisfy the conditions for convergence ccording to theorem by Wtkins nd Dyn (1992) (see Mitchell, not included here). Copyright Fcundo Bromberg 34 Copyright Fcundo Bromberg 12