CS 1675 Introduction to Mchine Lerning Lecture 26 Reinforcement lerning II Milos Huskrecht milos@cs.pitt.edu 5329 Sennott Squre Reinforcement lerning Bsics: Input x Lerner Output Reinforcement r Critic Lerner intercts with the environment Receives input with informtion bout the environment (e.g. from sensors) Mkes ctions tht (my) effect the environment Receives reinforcement signl tht provides feedbck on how well it performed 1
Reinforcement lerning Objective: Lern how to ct in the environment in order to mximize the reinforcement signl The selection of ctions should depend on the input A policy : X A mps inputs to ctions Gol: find the optiml policy : X A tht gives the best expected reinforcements Input x Lerner Output Reinforcement r Critic Exmple: lern how to ply gmes (AlphGo) Gmbling exmple Gme: 3 bised coins 1 2 3 The coin to be tossed is selected rndomly from the three coin options. The gent lwys sees which coin is going to be plyed next. The gent mkes bet on either hed or til with wge of $1. If fter the coin tos the outcome grees with the bet, the gent wins $1, otherwise it looses $1 RL model: Input: X coin chosen for the next tos Action: A choice of hed or til the gent bets on, Reinforcements: {1, -1} A policy : X A Exmple: : Coin1 Coin2 Coin3 hed til hed : 1 2 3 hed til hed 2
Gmbling exmple RL model: Input: X coin chosen for the next tos Action: A choice of hed or til the gent bets on, Reinforcements: {1, -1} A policy : Coin1 Coin2 Coin3 hed til hed Stte, ction rewrd trjectories Step0 Step1 Step2 Step k stte ction Coin2 Til Coin1 Hed Coin2 Til.. Coin1 Hed.. rewrd -1 1 1 1 RL lerning: objective functions Objective: Find policy : X A Coin3? Tht mximizes some combintion of future reinforcements (rewrds) received over time Vlution models (quntify how good the mpping is): Finite horizon models E( T r t t0 T Infinite horizon discounted model 0 t Averge rewrd ) Time horizon: E( t rt ) Discount fctor: 1 lim E( T T T t0 r ) t T 0 E( ) Discount fctor: 0 1 t0 t rt : Coin1? Coin2? 0 1 3
Expected rewrd E( 0 t t rt RL with immedite rewrds ) Immedite rewrd cse: Rewrd depends only on x nd the ction choice The ction does not ffect the environment nd hence future inputs (sttes) nd future rewrds Expected one step rewrd for input x (coin to ply next) nd the choice : 0 1 Expected rewrd t0 RL with immedite rewrds t 2 E( rt ) E( r0 ) E( r1 ) E( r2 )... Optiml strtegy: : X A ( x) rg mx : Expected one step rewrd for input x (coin to ply next) nd the choice 4
RL with immedite rewrds The optiml choice ssumes we know the expected rewrd Then: Cvets ( x) rg mx We do not know the expected rewrd We need to estimte it using from interction We cnnot determine the optiml policy if the estimte of the expected rewrd is not good We need to try lso ctions tht look suboptiml wrt the ~ current estimtes of ~ Estimting Solution 1: For ech input x try different ctions Estimte using the verge of observed rewrds ) ~ 1 x, N N x, i1 r i Solution 2: online pproximtion Updtes n estimte fter performing ction in x nd x observing the rewrd r, R ~ ( ( i) (1 ( i)) R ~ ( ( i1) ( i) r i (i) - lerning rte 5
RL with immedite rewrds At ny step in time i during the experiment we hve estimtes of expected rewrds for ech (coin, ction) pir: coin1, hed) coin1, til) coin2, hed) coin2, til) coin3, hed) coin3, til) Assume the next coin to ply in step (i+1) is coin 2 nd we pick hed s our bet. Then we updte ~ ( i 1) coin2, hed) using the observed rewrd nd one of the updte strtegy bove, nd keep the rewrd estimtes for the remining (coin, ction) pirs unchnged, e.g. ~ ( i1) ~ ( ) coin2, til) coin2, til) i Explortion vs. Exploittion Uniform explortion: Uses explortion prmeter 0 1 Choose the current best choice with probbility ˆ ( x) rg mx R ~ ( A 1 All other choices re selected with uniform probbility A 1 Advntges: Simple, esy to implement Disdvntges: Explortion more pproprite t the beginning when we do not ~ hve good estimtes of Exploittion more pproprite lter when we hve good estimtes 6
Explortion vs. Exploittion Boltzmn explortion The ction is chosen rndomly but proportionlly to its current expected rewrd estimte Cn be tuned with temperture prmeter T to promote explortion or exploittion Probbility of choosing ction expr ~ ( / T p( x) ~ expr ( ') / T ' A Effect of T: For high vlues of T, p( x) is uniformly distributed for ll ctions For low vlues of T, p( x) of the ction with the highest ~ vlue of is pproching 1 Agent nvigtion exmple Agent nvigtion in the mze: 4 moves in compss directions Effects of moves re stochstic we my wind up in other thn intended loction with non-zero probbility Objective: lern how to rech the gol stte in the shortest expected time moves G 7
Agent nvigtion exmple The RL model: Input: X position of n gent Output: A the next move Reinforcements: R -1 for ech move +100 for reching the gol A policy: : X A Gol: find the policy mximizing future expected rewrds E( 0 t t rt ) : Position 1 Position 2 Position 25 G right right left 0 1 moves Agent nvigtion exmple Stte, ction rewrd trjectories policy : Position 1 Position 2 Position 25 right right left 21 22 23 24 25 16 17 18 19 20 G 11 12 3 14 15 6 7 8 9 10 1 2 3 4 5 moves Step0 Step1 Step2 Step k stte ction Pos1 Right Pos2 Right Pos3 Up.. Pos15 Up.. rewrd -1-1 -1-1 8
Lerning with delyed rewrds Action in ddition to immedite rewrds ffect the next stte of the environment nd thus indirectly lso future rewrds We need model to represent environment chnges nd the the effect of ctions on sttes nd rewrds ssocited with them Mrkov decision process (MDP) Frequently used in AI, OR, control theory ction t-1 stte t-1 stte t rewrd t-1 Mrkov decision process ction t-1 stte t-1 stte t Forml definition: ( S, A, T, R) A set of sttes S (X ) loctions of robot A set of ctions move ctions Trnsition model S AS [0,1] where cn I get with different moves Rewrd model S A S rewrd/cost for trnsition A 4-tuple rewrd t-1 9
MDP problem We wnt to find the best policy : S A Vlue function ( V ) for policy, quntifies the goodness of policy through, e.g. infinite horizon, discounted model It: E( 0 t 1. combines future rewrds over trjectory 2. combines rewrds for multiple trjectories (through expecttion-bsed mesures) t rt ) G G Vlue of policy for MDP Assume fixed policy : S A How to compute the vlue of policy under infinite horizon discounted model? A fixed point eqution: V ( s) ( s)) S P( ( s)) V ( ) expected one step rewrd for the first ction v r Uv expected discounted rewrd for following the policy for the rest of the steps v ( I U) 1 r For finite stte spce we get set of liner equtions 10
Optiml policy The vlue of the optiml policy V ( s) mx P( V ( ) A S expected one step rewrd for the first ction expected discounted rewrd for following the opt. policy for the rest of the steps The optiml policy: :S A ( s) rg mx P( V ( ) A S Computing optiml policy Dynmic progrmming: Vlue itertion: computes the optiml vlue function first then the policy itertive pproximtion converges to the optiml vlue function Vlue itertion ( ) initilize V ;; V is vector of vlues for ll sttes repet set V' V set V ( s) mx P( V '( ) A S until V' V output ( s) rg mx P( V ( ) A S 11
Reinforcement lerning of optiml policies In the RL frmework we do not know the MDP model!!! Gol: lern the optiml policy Two bsic pproches: : S A Model bsed lerning Lern the MDP model (probbilitie rewrds) first Solve the MDP fterwrds Model-free lerning Lern how to ct directly No need to lern the prmeters of the MDP A number of clones of the two in the literture Model-bsed lerning We need to lern trnsition probbilities nd rewrds Lerning of probbilities ML prmeter estimtes Use counts Ns s P ~,, ' ( N s N Lerning rewrds Similr to lerning with immedite rewrds R ~ 1 ( N N s, i1 r i,, S Problem: chnges in the probbilities nd rewrd estimtes would require us to solve n MDP from scrtch! (fter every ction nd rewrd seen) N or the online solution 12
Model free lerning Motivtion: vlue function updte (vlue itertion): V Let ( s) mx A Q( S S P( V P( V ( ) ( ) Then V ( s) mx Q( A Note tht the updte cn be defined purely in terms of Q- functions Q( S P( mx Q(, ') ' Q-lerning Q-lerning uses the Q-vlue updte ide But relies on stochstic (on-line, smple by smple) updte Q( is replced with S P( mx Q(, ') Qˆ( (1 ) Qˆ( r( mx Qˆ(, ') r( - rewrd received from the environment fter performing n ction in stte s - new stte reched fter ction - lerning rte, function of s N, - number of times hs been executed t s ' ' 13
Q-function updtes in Q-lerning At ny step in time i during the experiment we hve estimtes of Q functions for ech (stte, ction) pir: Q( position1, up) ( position 1, left) ( position 1, right ) ( position 1, down) Q Q Q Q( position 2, up) Assume the current stte is position 1 nd we pick up ction to be performed next. ~ After we observe the rewrd, we updte Q( position 1, up), nd keep the Q function estimtes for the remining (stte, ction) pirs unchnged. Q-lerning The on-line updte rule is pplied repetedly during the direct interction with n environment Q-lerning initilize Q( =0 for ll pirs observe current stte s repet select ction ; use some explortion/exploittion schedule receive rewrd r observe next stte s updte Q( (1 ) Q( r mx Q(, ') ' set s to s end repet 14
Q-lerning convergence The Q-lerning is gurnteed to converge to the optiml Q- vlues under the following conditions: Every stte is visited nd every ction in tht stte is tried infinite number of times This is ssured vi explortion/exploittion schedule The sequence of lerning rtes for ech Q( stisfies: i1 1. ( i) 2. i1 (i) 2 ( n( ) - is the lerning rte for the nth tril of ( RL with delyed rewrds The optiml choice ( s) rg mx Q( much like wht we hd for the immedite rewrds ( x) rg mx RL Lerning Insted of exct vlues of Q( we use Since we hve only estimtes of Qˆ ( We need to try lso ctions tht look suboptiml wrt the current estimtes Explortion/exploittion strtegies Uniform explortion Boltzmn explortion Qˆ ( Qˆ( (1 ) Qˆ( r( mx Qˆ(, ') ' 15
Q-lerning speed-ups The bsic Q-lerning rule updtes my propgte distnt (delyed) rewrds very slowly Exmple: G Gol: high rewrd stte To mke the correct decision we need ll Q-vlues for the current position to be good Problem: in ech run we bck-propgte vlues only one-step bck. It tkes multiple trils to bck-propgte vlues multiple steps. Q-lerning speed-ups Remedy: Bckup vlues for lrger number of steps Rewrds from pplying the policy 2 qt rt r 1 r... t t2 i0 r i ti We cn substitute (immedite rewrds with n-step rewrds): n n 1 q i n t r ti mx Qt n(, ') i0 ' Postpone the updte for n steps nd updte with longer trjectory rewrds n Qt n1( Qt n( q t Qt n( Problems: - lrger vrince - explortion/exploittion switching - wit n steps to updte 16
Q-lerning speed-ups One step vs. n-step bckup G G Problems with n-step bckups: - lrger vrince - explortion/exploittion switching - wit n steps to updte Q-lerning speed-ups Temporl difference (TD) method Remedy of the wit n-steps problem Prtil bck-up fter every simultion step Similr ide: wether forecst djustment G Different versions of this ide hs been implemented 17
RL successes Reinforcement lerning is reltively simple On-line techniques cn trck non-sttionry environments nd dpt to its chnges Successful pplictions: Deep Mind s AlphGo (Alph Zero) TD Gmmon lerned to ply bckgmmon on the chmpionship level Elevtor control Dynmic chnnel lloction in mobile telephony Robot nvigtion in the environment 18