Presentation Overview

Size: px

Start display at page:

Download "Presentation Overview"

Clemence Sutton
5 years ago
Views:

1 Acion Refinemen in Reinforcemen Learning by Probabiliy Smoohing By Thomas G. Dieerich & Didac Busques Speaer: Kai Xu Presenaion Overview Bacground The Probabiliy Smoohing Mehod Experimenal Sudy of Acion Refinemen Conclusion Bacground -- Model Based Reinforcemen Learning (MBRL) Experience gained during exploring is employed o learn he models of he sae-acion ransiion funcion and he reward funcion From he learned model, he opimal policy can be compued by many good algorihms MBRL is appropriae when he sae and acion space and relaively small and finie, and each exploring acion is expensive. Bacground -- Mehods To Reduce he Need For Training Daa By incorporae some ind of prior nowledge Previous Sudy: Absracion nowledge across he saes So ha he RL can generalize across saes Absracion nowledge across he acions (in his paper) The RL assumes similar acions will have similar ransiion effecs and rewards Bacground -- Acion Refinemen Recall how human learns Bad Kongfu Masers each he sudens all he rics a he beginning. The sudens have o spend a long ime o grasp all of hem Acion Refinemen Good Kongfu Masers each he sudens only he basic acions a he beginning. Afer he sudens grasp he basic sill he each hem he subleies among differen similar acions. The sudens grasp all he rics in a much shorer ime.

2 Acion Refinemen An RL algorihm iniially reas a se of similar acions as a single absracion Laer, refines ha absracion acion ino individual acions. The Probabiliy Smoohing Mehod Bacground The Probabiliy Smoohing Mehod Experimenal Sudy of Acion Refinemen Conclusion The Probabiliy Smoohing Model The Probabiliy Smoohing Model Conex: The agen is ineracing wih an unnown bu observable Marovian environmen. The environmen conains a finie sae se S, and a finie acion se A. The programmer groups se A ino L disjoin acion ses A, A,..., A L. Acions in he same subses are similar. Le N ( denoe he # of imes acion a has been execued in sae s. Le N ( a, s') denoe he # of imes his resuls in a ransiion o sae s. Le W ( a, s') denoe he oal rewards received when a caused a ransiion from s o s. Define he probabiliy smoohing model M such ha P ( s' = R ( a, s') = Al Al λ N (, s' ) Al Al λ N ( ) λ W (, s' ) λ N ( ) Deermine Smoohing Parameer Suppose he rue ransiion probabiliy from s o s afer execuing acion a is P( s', and he esimae o his probabiliy is P ( s' We wan o find a proper λ such ha P ( s' would be a consisen esimaor for he rue probabiliy. To deermine which esimaor is more appropriae, we need o define he error measure as he following J ( = [ P( s' P ( s' ] s' So he problem is o find a λ which minimizes J( λ Derivaion of Opimal Smoohing Parameers in he simples case Le s suppose here are only wo similar acion and a The curren sae is s There are only wo possible resuling sae s ' and s' ' Acion a has been applied on sae s for N imes. For H imes i ransi o sae s Acion a has been applied on sae s for N imes. For H imes i ransi o sae s a

3 Derivaion of Opimal Smoohing Parameers in he simples case Suppose he rue ransiion probabiliy from s o s afer exe a is p. Alhough H/N is an esimaor for p, i requires large number of rials. So we should use he smoohing model: H + λh pˆ = N + λn Derivaion of Opimal Smoohing Parameers in he simples case Afer calculaion, we find he mos appropriae smoohing parameer V λ = where N ε + V ε = p p Properies of using his λ : V = p ( p ), lim pˆ N V = p ( p ) = p ˆ and lim lim p N N = p Derivaion of Opimal Smoohing Parameers in he simples case Deermine he Level of Smoohing in Pracice Therefore, he probabiliy smoohing will converge o he opimal policy. This model can be expand o cases such as here are more han similar acions here are more han possible resuling saes We can use he resuling λ o build good esimaor for he reward. Big problem: In mos pracical case we will never now he rue value for p, p, or ε A naive approach for choosing λ would be esimae p by H/N, esimae p by H/N Bu when he rial number is small, he variance o hese esimaes are very high. The resul is poor. So he paper proposed o use defaul smoohing, in which we assume he defaul values of p, p, and ε, and plug in he value of N from he real daa. Deermine he Level of Smoohing in Pracice Deermine he Level of Smoohing in Pracice The auhor proposed o use defaul values p = 0., p = 0.5, ε = p p = 0.05 for he simples case. They wor well when < 0.5 for all values of p For cases ha here are more han possible resuling sae he auhor proposed o use defaul values ε V = 0.09, V = 0.75, ε = 0.005

Experimenal Sudy of Acion Refinemen Bacground The Probabiliy Smoohing Mehod Experimenal Sudy of Acion Refinemen Conclusion Experimenal Sudy of Acion Refinemen -- Conex A oy maze wih 8 nonerminal saes

To measure he performance of a policy, we compue he π value funcion V and sum he value of all 8 non-erminal saes The opimal policy has oal value of 43.

4 Experimenal Sudy of Acion Refinemen Bacground The Probabiliy Smoohing Mehod Experimenal Sudy of Acion Refinemen Conclusion Experimenal Sudy of Acion Refinemen -- Conex A oy maze wih 8 nonerminal saes and erminal saes. 6 acions from he crossproduc of he 4 compass direcions wih 4 modifiers. Acions are grouped ino 4 ses. To measure he performance of a policy, we compue he π value funcion V and sum he value of all 8 non-erminal saes The opimal policy has oal value of Experimenal Sudy of Acion Refinemen -- Compare Wih No Smoohing Mehod Experimenal Sudy of Acion Refinemen -- Compare wih fixed smoohing & four-acion Comparison of probabiliy smoohing, and no smoohing ( λ = 0 ) The probabiliy smoohing model is much beer Comparison of fixed smoohing ( λ = ) four-acion mehod probabiliy smoohing Afer 9.3 exploraion sep four-acion mehod and probabiliy smoohing mehod bea fixed smoohing. Afer 3 sep probabiliy smoohing mehod wins. Experimenal Sudy of Acion Refinemen -- Conclusion Experimenal Sudy of Acion Refinemen -- Sensiiviy To The Size of Acion Ses Conclusion for he previous experimen The probabiliy smoohing mehod is vasly superior o no-smoohing mehod. If large raining se is available, probabiliy smoohing mehod is beer han fixsmoohing and four-acion mehod. Vary he # of acion ses from,, 4, 8, and 6 Wih similar acions be grouped ogeher o he exen possible. 6 separae acion ses gives high variance. One single acion se gives high bias. 4 ses and ses gave he bes performance during he early par of he curve.

5 Experimenal Sudy of Acion Refinemen -- Sensiiviy To The Acion Se Correcness Conclusions A inermediae sample size (4), even random groupings give beer performance han no smoohing. A large sample size he bias in he random and bad groupings leads o worse performance han eiher no smoohing or well-chosen acion ses. Probabiliy Smoohing Mehod is inroduced o acion refinemen o speed up RL applicaions by pariion acions ino ses of similar acions. I significanly eases he designing of a se of good acions in RL. Probabiliy smoohing parameer is deermined by defaul smoohing and he corresponding # of rials. Good prior acion se pariion is criical o he performance.

6 Acion Refinemen applied o he Robo Navigaion Problem Acion Refinemen in Reinforcemen Learning by Probabiliy Smoohing Thomas G. Dieerich, Didac Busques Ramon Lopez de Manara Carles Sierra Robo navigaes by finding visual landmars Robo s camera has a viewing angle of 60 degrees The space around he robo was pariioned ino six 60-degree secors Commens by : Sameer Ape Acion Refinemen applied o he Robo Navigaion Problem An acion called Move While looing for Landmars (MLL) was defined Robo moves forward while aiming is camera in one of he six secors o search for new visual landmars Can define six MLL acion one for each secor and le robo decide which secor o examine wih he camera Acion Refinemen applied o he Robo Navigaion Problem Designers do no now which of hese acions would be mos useful Include all hese acions in he MDP and le RL sysem deermine which acions are useful Problem : Large amoun of exploraion required o learn a good policy Train he robo several imes,each ime wih a differen se of acions Problem : Even more raining experiences required Acion Refinemen applied o he Robo Navigaion Problem Soluion : Acion Refinemen We now ha differen varians of he MLL acion have similar behavior Iniially rea hese similar acions as a single absrac acion Laer allow he learning algorihm o refine absrac acion ino individual acions

7 Direc vs. Model-Based Reinforcemen Learning Daa efficiency Crieria -- Commenary on Kai Xu s presenaion -- Commened by Ruinan Lu -- Reference: paper by C. Aeson, e al. Compuing efficiency Problem for comparison of he wo approaches: single pendulum swing-up Mae i swing! Model-based RL Known reward funcion: r( θ, τ ) = (( θ θ ) + τ ) d θ : angle of he pendulum θ d : desired angle for he invered verical sae τ : moor orque : ime sep Equaion of moion: θ = τ 9.8cos( θ ) Q-learning Resuls Q( x, u ) = Q( x, u ) + α[ r( x, u ) + γ Q( x α : learning rae γ : discoun facor x : sae vecor u : conrol vecor, u ) Q( x, u )] e( x, u ) + + Opimal acion: arg min Q( x, u) u

8 Conclusions Simple Dynamics favor MRL Exploraory acion is expensive Exploraion is performed on a physical sysem Cases favor Direc RL More raining experiences Learner ineracs wih an inexpensive simulaor

RL Lecture 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

RL Lecture 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 RL Lecure 7: Eligibiliy Traces R. S. Suon and A. G. Baro: Reinforcemen Learning: An Inroducion 1 N-sep TD Predicion Idea: Look farher ino he fuure when you do TD backup (1, 2, 3,, n seps) R. S. Suon and