Reinforcement Learning for Robotic Locomotions

Similar documents
Reinforcement learning

Artificial Intelligence Markov Decision Problems

TP 10:Importance Sampling-The Metropolis Algorithm-The Ising Model-The Jackknife Method

Reinforcement Learning and Policy Reuse

Policy Gradient Methods for Reinforcement Learning with Function Approximation

20.2. The Transform and its Inverse. Introduction. Prerequisites. Learning Outcomes

Bias in Natural Actor-Critic Algorithms

MArkov decision processes (MDPs) have been widely

Reinforcement learning II

Markov Decision Processes

CHOOSING THE NUMBER OF MODELS OF THE REFERENCE MODEL USING MULTIPLE MODELS ADAPTIVE CONTROL SYSTEM

Non-Myopic Multi-Aspect Sensing with Partially Observable Markov Decision Processes

Efficient Planning in R-max

APPENDIX 2 LAPLACE TRANSFORMS

Working with Powers and Exponents

Chapter 2 Organizing and Summarizing Data. Chapter 3 Numerically Summarizing Data. Chapter 4 Describing the Relation between Two Variables

PHYS 601 HW 5 Solution. We wish to find a Fourier expansion of e sin ψ so that the solution can be written in the form

2. The Laplace Transform

Administrivia CSE 190: Reinforcement Learning: An Introduction

arxiv: v6 [stat.ml] 13 Apr 2018

CS 188 Introduction to Artificial Intelligence Fall 2018 Note 7

CONTROL SYSTEMS LABORATORY ECE311 LAB 3: Control Design Using the Root Locus

PHYSICS 211 MIDTERM I 22 October 2003

STABILITY and Routh-Hurwitz Stability Criterion

Reinforcement Learning

1 Online Learning and Regret Minimization

{ } = E! & $ " k r t +k +1

Chapter 4: Dynamic Programming

2D1431 Machine Learning Lab 3: Reinforcement Learning

Robot Planning in Partially Observable Continuous Domains

Transfer Functions. Chapter 5. Transfer Functions. Derivation of a Transfer Function. Transfer Functions

Bellman Optimality Equation for V*

Robot Planning in Partially Observable Continuous Domains

Acceptance Sampling by Attributes

Review of Calculus, cont d

19 Optimal behavior: Game theory

Jim Lambers MAT 169 Fall Semester Lecture 4 Notes

The First Fundamental Theorem of Calculus. If f(x) is continuous on [a, b] and F (x) is any antiderivative. f(x) dx = F (b) F (a).

Math 2142 Homework 2 Solutions. Problem 1. Prove the following formulas for Laplace transforms for s > 0. a s 2 + a 2 L{cos at} = e st.

Trust Region Policy Optimization

Math 1B, lecture 4: Error bounds for numerical methods

The ifs Package. December 28, 2005

Actor-Critic. Hung-yi Lee

Best Approximation. Chapter The General Case

We will see what is meant by standard form very shortly

Infinite Geometric Series

Line and Surface Integrals: An Intuitive Understanding

Estimation of Binomial Distribution in the Light of Future Data

4-4 E-field Calculations using Coulomb s Law

New data structures to reduce data size and search time

Duality # Second iteration for HW problem. Recall our LP example problem we have been working on, in equality form, is given below.

LECTURE NOTE #12 PROF. ALAN YUILLE

Module 6 Value Iteration. CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo

Accelerator Physics. G. A. Krafft Jefferson Lab Old Dominion University Lecture 5

1 Probability Density Functions

Decision Networks. CS 188: Artificial Intelligence. Decision Networks. Decision Networks. Decision Networks and Value of Information

MIXED MODELS (Sections ) I) In the unrestricted model, interactions are treated as in the random effects model:

SPACE VECTOR PULSE- WIDTH-MODULATED (SV-PWM) INVERTERS

Solution for Assignment 1 : Intro to Probability and Statistics, PAC learning

Decision Networks. CS 188: Artificial Intelligence Fall Example: Decision Networks. Decision Networks. Decisions as Outcome Trees

ELECTRICAL CIRCUITS 10. PART II BAND PASS BUTTERWORTH AND CHEBYSHEV

Package ifs. R topics documented: August 21, Version Title Iterated Function Systems. Author S. M. Iacus.

COUNTING DESCENTS, RISES, AND LEVELS, WITH PRESCRIBED FIRST ELEMENT, IN WORDS

Optimal Treatment of Queueing Model for Highway

MULTI-DISCIPLINARY SYSTEM DESIGN OPTIMIZATION OF THE F-350 REAR SUSPENSION *

VSS CONTROL OF STRIP STEERING FOR HOT ROLLING MILLS. M.Okada, K.Murayama, Y.Anabuki, Y.Hayashi

Error Estimation of Practical Convolution Discrete Gaussian Sampling

Approximation of continuous-time systems with discrete-time systems

THE EXISTENCE-UNIQUENESS THEOREM FOR FIRST-ORDER DIFFERENTIAL EQUATIONS.

ARCHIVUM MATHEMATICUM (BRNO) Tomus 47 (2011), Kristína Rostás

M. A. Pathan, O. A. Daman LAPLACE TRANSFORMS OF THE LOGARITHMIC FUNCTIONS AND THEIR APPLICATIONS

arxiv: v1 [stat.ml] 9 Aug 2016

positive definite (symmetric with positive eigenvalues) positive semi definite (symmetric with nonnegative eigenvalues)

An Application of the Generalized Shrunken Least Squares Estimator on Principal Component Regression

Bernoulli Numbers Jeff Morton

Improper Integrals, and Differential Equations

Section 14.3 Arc Length and Curvature

Improper Integrals. Type I Improper Integrals How do we evaluate an integral such as

1 Linear Least Squares

7.2 The Definite Integral

SIMULATION OF TRANSIENT EQUILIBRIUM DECAY USING ANALOGUE CIRCUIT

Math 3B Final Review

Review of basic calculus

Jack Simons, Henry Eyring Scientist and Professor Chemistry Department University of Utah

Chapters 4 & 5 Integrals & Applications

A Fast and Reliable Policy Improvement Algorithm

p-adic Egyptian Fractions

The Regulated and Riemann Integrals

Excerpted Section. Consider the stochastic diffusion without Poisson jumps governed by the stochastic differential equation (SDE)

Week 10: Line Integrals

The steps of the hypothesis test

Recitation 3: More Applications of the Derivative

Continuous Random Variables

Math& 152 Section Integration by Parts

Near-Bayesian Exploration in Polynomial Time

Math 270A: Numerical Linear Algebra

Multi-Armed Bandits: Non-adaptive and Adaptive Sampling

Quadratic Forms. Quadratic Forms

NUMERICAL INTEGRATION

Numerical Analysis: Trapezoidal and Simpson s Rule

Transcription:

Reinforcement Lerning for Robotic Locomotion Bo Liu Stnford Univerity 121 Cmpu Drive Stnford, CA 94305, USA bliuxix@tnford.edu Hunzhong Xu Stnford Univerity 121 Cmpu Drive Stnford, CA 94305, USA xuhunvc@tnford.edu Songze Li Stnford Univerity 121 Cmpu Drive Stnford, CA 94305, USA ongzeli@tnford.edu Abtrct In Reinforcement Lerning, it i uully more convenient to optimize in the policy π( ) pce thn forming policy indirectly by ccurtely evluting tte-ction function Q(, ) or vlue function V(). Becue the vlue function doen t precribe ction nd the tte-ction function might be very hrd to etimte in continuou or lrge ction pce. Therefore, in n environment tht uffer from high cot of evlution, uch robotic locomotion ytem, policy optimiztion method i better option due to the ignificntly mller policy pce. In thi project, we invetigte prticulr policy optimiztion method, the Trut Region Policy Optimiztion lgorithm, explore poible vrint of it, nd propoe new method bed on our undertnding of thi lgorithm. 1 Introduction The originl motivtion of thi project come from the deire to undertnd humn motor control unit in the field of bioengineering. Controlling imulted humn body to chieve complex motion uing n rtificil brin will provide not only better undertnding of how humn motor ytem work but lo theoreticl foundtion for urgerie for people hving phyicl dibilitie. Therefore, we focu on finding efficient Reinforcement Lerning lgorithm in environment with high cot evlution. We eventully decided on exploring policy grdient method becue erching in policy pce cn be much more efficient. In prticulr, we invetigte into recent work clled Trut Region Policy Optimiztion (Schulmn et l., 2015), in which the uthor formulte the reinforcement lerning objective n optimiztion problem ubject to trut region contrint. Thi lgorithm work well in prctice nd h theoreticl gurntee to improve in ech epiode if the objective nd contrint re evluted exctly. In the originl pper, the trut region correpond to the region tht i cloe to the old policy in the policy pce. The uthor define cloene in term of the KullbckLeibler (KL) divergence between the old nd new policy ditribution, e.g. D KL (π old ( ) π new ( )). In fct, during optimiztion tep, it ue econd order pproximtion of thi KL contrint which involve the Hein of the KL divergence. Although uing conjugte grdient method, uggeted by the uthor, llow u to void computing the Hein exctly, in generl thi i till computtionlly expenive. Therefore, we quetion to wht extent cn KL contrint outperform other ey-to-compute contrint, or even no contrint t ll, to compente for it cot. Bed on our undertnding of thi model, we lo view thi problem from nother perpective nd propoe new method tht etimte the dvntge vlue uing neurl net nd updte policy directly. 2 Relted Work There re minly two pproche for the propoed problem in reinforcement lerning: policy grdient method nd Q-function method. A policy grdient method directly optimize prmetrized control policy by grdient decent. It belong to the cl of policy erch technique

tht mximize the expected return of policy in fixed policy cl, while trditionl vlue function pproximtion pproche derive policie from vlue function. Policy grdient method llow the trightforwrd incorportion of domin knowledge in the policy prmetriztion nd require ignificntly fewer prmeter to repreent the optiml policy thn the correponding vlue function. They re gurnteed to converge to policy loclly optiml t let. Furthermore, they cn hndle continuou tte nd ction, even including imperfect tte informtion. Except for the vnill policy grdient method, there exit vrint like nturl policy grdient (Kkde, 2002) nd lgorithm tht ue trut region (Schulmn et l., 2015). Inted of prmeterizing the policy, the Q-function focue on the tte-ction function Q π ( t, t ) nd updte it with Bellmn Eqution. The optiml Q i obtined vi vlue itertion, in which we repetedly pply Bellmn opertor until it converge. It h been hown tht under ome mild umption, for ny initil Q, thi vlue itertion lgorithm i gurnteed to converge to Q π, where π i the optiml policy (Bird nd other, 1995). Vrint of thi bic vlue itertion lgorithm include the neurl fitted Q-itertion prmeterize Q-function with neurl network nd replce the Bellmn opertor with minimizing MSE between two Q-function. Q-function method do not work generlly policy grdient method lthough they re more mple efficient when they do work. Algorithm Simple & Sclble Dt Efficient Vnill Policy Grdient Good Bd Nturl Policy Grdient Bd OK Q-lerning Good OK Tble 1: comprion of different lgorithm 3 Nottion To mke further derivtion nd decription cler, we provide our nottion for clic Mrkov Deciion Proce (MDP) nd review the conventionl reinforcement lerning objective. MDP i 6-tuple of (S, A, T, r, ρ, γ), where S i the et of poible tte, A i the et of poible ction, T : S A S R i the trnition probbility, r : S R i the rewrd function, ρ : S R i the initil ditribution of tte 0 nd γ i the dicount fctor. Let π denote tochtic policy π : S A [0, 1]. The objective of clic RL i to mximize the expected future rewrd: [ η(π) = E 0, 0,... γ t r( t ) ] where 0 ρ 0 ( 0 ), t π( t t ), t+1 T( t+1 t, t ). Let Q π (, ) nd V π () denote the tndrd ttection function nd vlue function, where [ Q π ( t, t ) = E t+1, t+1,... γ t r( l ) ] Define the dvntge : l=t V π ( t ) = E t Q π ( t, t ) A π (, ) = Q π (, ) V π () In ddition, we define ρ π : S R the dicounted viittion frequencie function: ρ π () = P( 0 = ) + γp( 1 = ) + γ 2 P( 2 = ) +... 4 Method 4.1 Policy Grdient Method Policy grdient method, the nme ugget, trie to etimte the derivtive of objective with repect to policy prmeter directly. The mot commonly ued grdient etimtor h the form ĝ = E t [ θ log π θ ( t t )Â t ] The TRPO lgorithm i pecific derivtive under thi cl of lgorithm. 4.2 Trut Region Policy Optimiztion Trut region policy optimiztion (TRPO) method combine the ide of Minorize-Mximiztion nd policy grdient method. It define urrogte function which i eier to optimize nd provide trict lower bound for the originl objective function. But

in order to ue thi urrogte, the optimiztion h to be done in pce ner the previou policy. Let π 0 denote the beline policy nd π denote ny policy. The following identity expree the expected return of π (Kkde nd Lngford, 2002): η(π) = η(π 0 ) + E 0, 0, π[ A π0 ( t, t )] = η(π 0 ) + P( t = π) π( )γ t A π0 (, ) devition. = η(π 0 ) + γ t P( t = π) π( )A π0 (, ) = η(π 0 ) + ρ π () π( )A π0 (, ) By ubtituting ρ π with ρ π0, we get n pproximtion of the dicounted future rewrd under the new policy π. The new objective i: L(π) = η(π 0 ) + ρ π0 () π( )A π0 (, ) (1) It h been pointed out tht thi pproximtion mtche the originl objective η(π) to firt order (Schulmn et l., 2015). Therefore, mll enough tep π θold π tht improve L(π θold ) lo improve η(π). But the problem i wht mll men here. The uthor ugget uing the KL divergence meure nd provide rigorou theoreticl proof. For implicity, we will not include full derivtion here but only the eventul urrogte objective: mx L θold (θ) CD KL (θ old, θ) θ where C i ome properly choen contnt. However, in prctice, if we ued the penlty coefficient C recommended by the theory, the tep ize would be too mll. To find fter nd more robut optimiztion, trut region i introduced nd the objective become mximizing L ubject to D KL inide the trut region: mx L θold (θ) θ ubject to D KL (θ old, θ) δ We cn further etimte the objective uing importnce mpling bed on the π old ditribution. The contrint cn lo be etimted in the me wy uing Monte-Crlo mpling nd econd order pproximtion. 4.3 Model for TRPO For TRPO nd it vrint, we ue neurl network for both policy nd vlue model. For both network, the input i the tte obervtion. The output of policy model i multi-vrite norml ditribution for ction prmeterized by it men nd tndrd The output of vlue model i ingle number tht i the etimtion of the vlue t thi prticulr tte obervtion. 4.4 Subtitution for KL Divergence The bic ide of TRPO i to optimize the urrogte lo function under the the retriction tht the new policy i cloe enough to the old one. However, one my wonder why we hve to ue KL divergence here to meure the cloene between the two policie. A nturl ubtitution for the KL divergence i the men-qure error (MSE) θ old θ 2 2. Replcing KL contrint with the MSE will ignificntly peed up the trining procedure ince MSE i eier to etimte. Moreover, we recognize tht uing MSE correpond to the originl policy grdient method becue when the MSE i ufficiently mll, the direction tht mximize the objective correpond to it grdient. We experiment TRPO with both KL nd MSE contrint. For beline, we lo implement the method tht purely optimize the objective without ny contrint. We will further nlyze the comprion in following ection with figure. 4.5 Neurl Network Advntge Etimtion Bck to eqution (1) L(π) = η(π 0 ) + ρ π0 () π( )A π0 (, ) Since η(π 0 ) i contnt, mximizing L(π) i equivlent to mking the L(π) η(π 0 ) poitive poible, for exmple: ρ π0 () π( )A π0 (, ) > 0 The intuition here i tht the dvntge vlue A π0 (, ) i n indiction of how good certin

ction i nd we would like to incree the probbilitie for better ction tht hve lrge poitive dvntge vlue. In other word, if dvntge vlue cn be ccurtely etimted, we cn optimize the problem by directly mximizing the probbilitie of good mpled ction. In prctice, our ction re mpled from multi-vrite Guin in continuou pce. We etimte dvntge of mpled ction directly from the Monte-Crlo mpling trjectory uing the generlized dvntge etimtion (GAE) (Schulmn et l., 2015b), but we lo wnt n etimtion of the dvntge vlue if men ction were tken t ech time tep. Then by clculting the difference, whenever men ction dvntge i mller (in prctice, we require the difference to be lrger thn threhold to remove ome noie), we mk out the probbility of the mpled ction nd mximize them. Hence, the only problem i how to etimte dvntge vlue for men ction, the ction tht we didn t tke during roll-out. We olve thi problem by uing 3-lyer feed-forwrd neurl network. The input of the network i the conctention of tte obervtion nd ction ; the output of the network i the etimted dvntge vlue. During implementtion, we ue TRPO for the firt few epiode until the MSE between clculted nd etimted dvntge i mller thn threhold. The bove method trt fter thi. 5 Experiment 6 Reult nd Anlyi Figure 1: Swimmer: Averge rewrd v.. Epiode Figure 2: Hopper: Averge rewrd v.. Epiode We tet the different method in the MuJoCo imultor, phyic engine from openai gym. The 3 environment we ue hve increing complexity: Swimmer: 10-dimenionl tte pce, liner rewrd Hopper: 12-dimenionl tte pce, me rewrd Swimmer, with poitive bonu for being in non-terminl tte Wlker: 18-dimenionl tte pce, me rewrd Hopper with n dded penlty for trong impct of the feet gint the ground. Figure 3: Wlker: Averge rewrd v.. Epiode

A hown bove, KL divergence doe outperform it vrint. In pite of the high cot, KL contrint i better when meuring ditnce between policie. Becue in MSE ce, though intuitively cloer prmeter correpond to imilr policie, in lrger policy pce even mll updte in prmeter will reult in much different policie. In contrt, the KL contrint i directly etimted upon different π(θ) nd roll-out from different policie re gurnteed to be imilr. However, MSE cn ignificntly horten the trining time. So our uggetion i combintion: pply TRPO with MSE contrint to let the model quickly climb to certin point nd then ue originl TRPO lter for fter convergence. Our dvntge etimtion method did not work t firt. We pent lot of time debugging. One thing we noticed i tht the lgorithm tended to hve lrge MSE lo for dvntge etimtion fter ome conecutive epoch of improvement. In other word, thi provide evidence tht our model work when etimtion i ccurte. In ddition, we conjecture tht the unexpected drop might reult from the over-fitting of our dvntge neurl net. So whenever it ee new obervtion nd ction tht differ from previou experience, it give wrong etimtion of dvntge vlue nd the policy will updte long the wrong grdient direction. Therefore, by crefully tuning L2-regulriztion contnt nd dropout for the neurl network, the model indeed improve, hown in the figure below. In ddition to the nlyi bove, we lo notice n intereting fct bout TRPO: depite the fct tht the verge rewrd incree in ech epiode, the lgorithm i highly enitive to initiliztion. A even uing different rndom eed, the model performnce vrie lrgely. Figure 5: TRPO with different rndom eed 7 Concluion nd Future work In thi project, we explore the cutting-edge TRPO lgorithm in the cl of policy grdient method nd try out ome poible vrint of it. We lo experiment our own modifiction uing dvntge etimtion neurl network. We my continue our invetigtion into other poible ubtitution for KL divergence. In ddition, we would lo like to do more error nlyi on why even the TRPO model i not robut enough nd find method to lower the vrince. 8 Contribution Bo Liu: reponible for coding of TRPO, it vrint nd dvntge etimtion method; prticipted in writeup of project propol, miletone, poter nd finl report; preented poter. Figure 4: TRPO nd dvntge etimtion Hunzhong Xu: reponible for nlyzing cot of Conjugte Grdient in KL, nd etimting feibility of it vrint; helped debugging; collected dt nd ploted ll figure; prticipted in miletone, poter nd finl report writeup.

Songze Li: reponible for project ide nd running experiment; prticipted in miletone nd finl report writeup. Reference Leemon Bird et l. 1995. Reidul lgorithm: Reinforcement lerning with function pproximtion. In Proceeding of the twelfth interntionl conference on mchine lerning, pge 30 37. Shm Kkde nd John Lngford. 2002. Approximtely optiml pproximte reinforcement lerning. Shm M Kkde. 2002. A nturl policy grdient. In Advnce in neurl informtion proceing ytem, pge 1531 1538. John Schulmn, Sergey Levine, Pieter Abbeel, Michel Jordn, nd Philipp Moritz. 2015. Trut region policy optimiztion. In Proceeding of the 32nd Interntionl Conference on Mchine Lerning (ICML-15), pge 1889 1897. John Schulmn, Philipp Moritz, Sergey Levine, Michel Jordn, nd Pieter Abbeel. 2015b. High-dimenionl continuou control uing generlized dvntge etimtion. rxiv preprint rxiv:1506.02438.