Reinforcement lerning Regulr MDP Given: Trnition model P Rewrd function R Find: Policy π Reinforcement lerning Trnition model nd rewrd function initilly unknown Still need to find the right policy Lern by doing
Reinforcement lerning: In ech time tep: Tke ome ction Bic cheme Oberve the outcome of the ction: ucceor tte nd rewrd Updte ome internl repreenttion of the environment nd policy If you rech terminl tte jut trt over ech p through the environment i clled tril Why i thi clled reinforcement lerning?
Appliction of reinforcement Bckgmmon lerning http://www.reerch.ibm.com/mive/tdl.html com/mive/tdl html http://en.wikipedi.org/wiki/td-gmmon
Appliction of reinforcement lerning Lerning ft git for Aibo Initil git Lerned git Policy Grdient Reinforcement Lerning for Ft udrupedl Locomotion Nte Kohl nd Peter Stone. IEEE Interntionl Conference on Robotic nd Automtion 2004.
Appliction of reinforcement lerning Stnford utonomou helicopter
Reinforcement lerning Model-bed trtegie Lern the model of the MDP trnition probbilitie nd rewrd nd try to olve the MDP concurrently Model-free Lern how to ct without explicitly lerning the trnition probbilitie P -lerning: lern n ction-utility function tht tell u the vlue of doing ction in tte t
Model-bed reinforcement lerning Bic ide: try to lern the model of the MDP trnition probbilitie nd rewrd nd lern how to ct olve the MDP imultneouly Lerning the model: Keep trck of how mny time tte follow tte when you tke ction nd updte the trnition probbility P ccording to the reltive frequencie Keep trck of the rewrd R Lerning how to ct: Etimte the utilitie U uing Bellmn eqution Chooe the ction tht mximize expected future utility: * π = rg mx A ' P ' U '
Model-bed reinforcement lerning Lerning how to ct: Etimte the utilitie U uing Bellmn eqution Chooe the ction tht mximize expected future utility given the model of the environment we ve experienced through our ction o fr: * π = rg mx P ' U ' A ' I there ny yproblem with thi greedy pproch?
Explortion v. exploittion Explortion: tke new ction with unknown conequence Pro: Get more ccurte model of the environment Dicover higher-rewrd tte thn the one found o fr Con: When you re exploring you re not mximizing your utility Something bd might hppen Exploittion: go with the bet trtegy found o fr Pro: Mximize rewrd reflected in the current utility etimte Avoid bd tuff Con: Might lo prevent you from dicovering the true optiml trtegy
Incorporting explortion Ide: explore more in the beginning become more nd more greedy over time Stndrd greedy election of optiml ction: = rg mx P ' ' U ' ' A Modified trtegy: = rg mx f P ' ' A ' ' ' U ' N ' explortion function Number of time we ve tken ction in tte f u n = + R if n < Ne u otherwie optimitic rewrd etimte
Model-free reinforcement lerning Ide: lern how to ct without explicitly lerning the trnition probbilitie P -lerning: lern n ction-utility function tht tell u the vlue of doing ction in tte Reltionhip between -vlue nd utilitie: U = mx
Model-free reinforcement lerning -lerning: lern n ction-utility function tht tell u the vlue of doing ction in tte U = mx Equilibrium contrint on vlue: = R + γ P ' mx' ' ' ' Problem: we don t know nd don t wnt to lern P
Temporl difference TD lerning Equilibrium contrint on vlue: Temporl difference TD updte: + = ' ' ' ' mx ' P R γ Temporl difference TD updte: Pretend tht the currently oberved trnition i the only poible outcome nd djut the vlue yp j towrd the locl equilibrium ' ' mx ' R locl + = γ 1 locl new locl new + = + = α α α ' ' mx ' R new + + = γ α + = α
Temporl difference TD lerning At ech time tep t From current tte elect n ction : = ' N ' rg mx ' f Explortion function Get the ucceor tte Perform the TD updte: Number of time we ve tken ction from tte R + γ mx ' ' + α ' Lerning rte Should trt t 1 nd decy O1/t e.g. αt = 60/59 + t
Function pproximtion So fr we ve umed lookup tble repreenttion for utility function U or ction-utility function But wht if the tte t pce i relly lrge or continuou? Alterntive ide: pproximte the utility function weighted liner combintion of feture: U = w1 f1 + w2 f2 +Kwn fn RL lgorithm cn be modified to etimte thee weight Recll: feture for deigning evlution function in gme Benefit: Cn hndle very lrge tte pce gme continuou tte pce robot control Cn generlize to previouly uneen tte