arxiv: v1 [cs.sy] 3 Sep 2015

Size: px

Start display at page:

Download "arxiv: v1 [cs.sy] 3 Sep 2015"

Edgar Neal
5 years ago
Views:

1 Model Based Reinforcement Learning with Final Time Horizon Optimization arxiv:9.86v [cs.sy] 3 Sep Wei Sun School of Aerospace Engineering Georgia Institute of Technology Atlanta, GA 333-, USA. wsun4@gatech.edu Evangelos Theodorou School of Aerospace Engineering Georgia Institute of Technology Atlanta, GA 333-, USA. evangelos.theodorou@ae.gatech.edu Panagiotis Tsiotras School of Aerospace Engineering Georgia Institute of Technology Atlanta, GA 333-, USA. tsiotras@gatech.edu Abstract We present one of the first algorithms on model based reinforcement learning and trajectory optimization with free final time horizon. Grounded on the optimal control theory and Dynamic Programming, we derive a set of backward differential equations that propagate the value function and provide the optimal control policy and the optimal time horizon. The resulting policy generalizes previous results in model based trajectory optimization. Our analysis shows that the proposed algorithm recovers the theoretical optimal solution on linear low dimensional problem. Finally we provide application results on nonlinear systems. Introduction Trajectory optimization is one of the most active areas of research in machine learning and control theory with a plethora of applications in robotics, autonomous systems and computational neuroscience. Among the different methodologies, Differential Dynamic Programming (DDP) is a model based reinforcement learning algorithm that relies on linear approximation of dynamics and quadratic approximations of cost functions along nominal trajectories. Even though there has been almost 4 years since the fundamental work by Jacobson and Mayne on Differential Dynamic Programming [], it is a fact that research on trajectory optimization and model based reinforcement learning is performed nowadays by having a main ingredient DDP. In the NIPS community, recently published state of art methods on trajectory optimization use DDP to perform guided policy search [] and data-efficient probabilistic trajectory optimization [3]. Earlier work on DDP includes minmax [4], control limited [], receding horizon [6, 7], and stochastic optimal control formulations [8, 9]. Despite all of this research on trajectory optimization using model based reinforcement learning methods such as DDP, there has not been any effort towards the development of model based trajectory optimization algorithms in which the time horizon is not a-priori specified. The time horizon is one of the important free tuning parameters in trajectory optimization algorithms and in most case is manually tuned based on the experience of the engineer. In this paper we present a new algorithm on model based reinforcement learning in which optimization is performed with respect to control and the time horizon. While free time horizon DDP has been initially derived by Jacobson and Mayne in [], the resulting algorithm is not implementable

2 and it relies on the assumption that the initialization of the algorithm starts close to the optimal control solution. This will become more clear in the next section as we present our analysis on free final time model based trajectory optimization. Problem Formulation and Analysis We consider model based reinforcement learning problems in which optimization occurs with respect to control and time horizon. In mathematical terms these problems are formulated as follows: [ tf ] V(x(t ),t ;ν,t f ) = minj(x( ),u( )) = min Φ(x(t f ),ν,t f )+ L(x(t),u(t),t), () u( ) u( ) t where the term Φ(x(t f ),ν,t f ) is defined as Φ(x(t f ),ν,t f ) = φ(x(t f ),t f ) + ν T ψ(x(t f ),t f ), φ(x(t f ),t f ) is the terminal cost, ψ(x(t f ),t f ) is the terminal constraint and ν is the corresponding Lagrange mulitplier. L(x(t),u(t),t) is the running cost accumulated along the time horizont f, which is not specified a-priori. The cost function J(x( ), u( )) in () is minimized subject to the dynamics: dx(t) = F(x(t),u(t),t), x = x(t ). () where x R n is the state and the u R m is the control of the dynamics. Note that the value functionv(x(t ),t ;ν,t f ) is now a function of the Lagrange multiplierν and the terminal timet f. This is important for the derivation of the free time horizon algorithm since expansions of the value function are computed not only with respect to nominal controls and state trajectories but also with respect to nominal ν and t f.. Derivation of Differential Dynamic Programming (DDP) with Free Final Time Our analysis and derivation of the free time horizon model based reinforcement learning is in continuous time. As it is shown, a set of backward ordinary differential equations is derived that back propagates the value function along the nominal trajectory. In particular, given a nominal trajectory ( x( ),ū( )) with nominal Lagrange multiplier ν and terminal time t f, we start our analysis with the linearization of the dynamics as follows: dx(t) = F( x(t)+δx(t),ū(t)+δu(t),t) dδx(t) = F x ( x(t),ū(t),t)δx(t)+f u ( x(t),ū(t),t)δu(t). (3) All the quantities in the derivation later are evaluated at( x(t),ū(t), ν, t f ) unless otherwise specified. Since our derivation is in continuous time, we consider the corresponding Hamilton-Jacobi-Bellman equation: V(x(t),t;ν,t f) t [ ] = min H(x(t),t;ν,t f ), (4) u(t) under the terminal conditionv(x(t f ),t f ;ν,t f ) = Φ(x(t f ),ν,t f ), and with the Hamiltonian functionh(x(t),t;ν,t f ) defined as follows: H(x(t),t;ν,t f ) = L(x(t),u(t),t)+V x (x(t),t;ν,t f ) T F(x(t),u(t),t). () We take expansions of the terms on both sides of 4 around ( x,ū, ν, t f ). Notice that this is in contrast with the derivation of free final time DDP in [] in which the expansion takes place around (x,u ). Hence, the key assumption in [] is thatū is close to the optimal controlu, which makes the algorithm hard to implement, especially when the optimal t f is not known a-priori. Moreover, expansion of the Hamiltonian H around u yields H u u=u =, which results in dropping terms from the derivation.

3 The left-hand side of (4) can be expanded as t V( x(t)+δx(t),t; ν +δν, t f + ) ( ) V( x(t),t; ν, t f )+V T x t δx(t)+vt ν δν +V t f + ( ] [δx(t) T δν T xx V xν V xtf δx(t) ) V νx V νν V νtf δν. t V tf x V tf ν V tf t f (6) Next we make use of the fact that d ( ) = t ( )+ x ( )T F( x(t),ū(t),t) t ( ) = d ( )+ x ( )T F( x(t),ū(t),t). (7) Based on the equation above we have that t V( x(t)+δx(t),t; ν +δν, t f + ) = d ( ) V( x(t),t; ν, t f )+V T xδx(t)+v T ν δν +V tf d ( ] [δx(t) T δν T xx V xν V xtf δx(t) ) V νx V νν V νtf δν V tf x V tf ν V tf t f (8) +V T x F V +δx(t)t xx F +δν T V νx F + V tf xf + ] [δx(t) T δν T xxxf V xνx F V xtf xf δx(t) V νxx F V ννx F V νtf xf δν. V tf xxf V tf νxf V tf t f xf The next step is to work with the expansion of the right-hand side of the HJB equation in (4). In particular, we have that V x ( x(t)+δx(t),t; ν +δν, t f + ) V x ( x(t),t; ν, t f )+V xx δx(t)+v xν δν +V xtf + ] [δx(t) T δν T xxx V xxν V xxtf δx(t) V xνx V xνν V xνtf δν. (9) V xtf x V xtf ν V xtf t f In addition, the running cost and the dynamics are expanded as follows: L(x(t),u(t),t) = L( x(t)+δx(t),ū(t)+δu(t),t) L( x(t),ū(t),t)+l T x δx(t)+lt u δu(t) + [ ][ ][ ] δx(t) T δu(t) T Lxx L xu δx(t), () L ux L uu δu(t) F(x(t),u(t),t) = F( x(t)+δx(t),ū(t)+δu(t),t) F( x(t),ū(t),t)+f x δx(t)+f u δu(t). () Therefore, the right hand side of (4) can be expressed as { min L( x(t),ū(t),t)+l T xδx(t)+l T uδu(t)+ [ ][ ][ ] δx(t) T δu(t) T Lxx L xu δx(t) δu(t) L ux L uu δu(t) + V T x F +VT x F xδx(t)+v T x F uδu(t)+ δx(t) T V xx F +δx(t) T V xx F x δx(t)+δx(t) T V xx F u δu(t) + δν T V νx F +δν T V νx F x δx(t)+δν T V νx F u δu(t)+ V tf xf + V tf xf x δx(t)+ V tf xf u δu(t) + ] [δx(t) T δν T xxxf V xνx F V xtf xf δx(t) } V νxx F V ννx F V νtf xf δν +H.O.T.. () V tf xxf V tf νxf V tf t f xf 3

4 Equating (8) with () and cancel like terms, we get d ( V +V T xδx(t)+v T ν δν +V tf + ] [δx(t) T δν T xx V xν V xtf δx(t) ) V νx V νν V νtf δν = V tf x V tf ν V tf t f { min L+L T xδx(t)+l T uδu(t)+ [ ][ ][ ] δx(t) T δu(t) T Lxx L xu δx(t) +V T δu(t) L ux L uu δu(t) xf x δx(t)+vxf T u δu(t) +δx(t) T V xx F x δx(t)+δx(t) T V xx F u δu(t)+δν T V νx F x δx(t)+δν T V νx F u δu(t) } + V tf xf x δx(t)+ V tf xf u δu(t). (3) To find the δu(t) that minimize the equation, we take derivative of the right hand side of (3) and set it to, = L u +L uu δu(t)+( L ux + LT xu +F T uv xx )δx(t)+f T uv x +F T uv xν δν +F T uv xtf. The update law for the control is thus given by (4) δu(t) = l(t)+k x (t)δx(t)+k ν (t)δν +K tf (t). () where the termsl(t),k x (t),k ν (t) andk tf (t) are defined as follows l(t) = L uu (L u +F T u V x), K x (t) = L uu ( L ux + LT xu +F T u V xx), K ν (t) = L uu F T u V xν, K tf (t) = L uu F T u V xt f. (6) Note that L uu is guaranteed to be invertible if the running cost L = g(x) + u T Ru, where R >. This type of cost is normal for a mechanical system where we would like to minimize the energy cost of the control. Substitution of the optimal policy variationδu back to the HJB equation results in a set of backward ordinary differential equations that propagate the expansion of the value function which consists of the terms V,V ν,v xx,v νν,v xν,v xtf and V νtf. These backward differential equations are given as follows d V = L lt L uu l, d V x = L x K T xl uu l+f T xv x, d V ν = K T νl u, d V t f = K T t f L u, d V xx = L xx K T x L uuk x +F T x V xx, d V νν = K T ν L uuk ν, (7) d V t f t f = K T t f L uu K tf, d V xν = L xu K ν +F T x V xν +V xx F u K ν, d V xt f = L xu K tf +F T x V xt f +V xx F u K tf, d V νt f = K T νf T uv xtf, where all the quantities are evaluated at ( x(t),ū(t), ν, t f ). To numerically solve the equations in (7) one has to compute the terminal conditions. In the next section we present the derivation for the terminal condition and provide an overview of the algorithm. 4

5 . Terminal Conditions The terminal conditions can be determined by the following procedure. From V(x(t ),t ;ν,t f ) = minj(x( ),u( )) = min u( ) u( ) we have that for anyt [t,t f ], V(x(t),t;ν,t f ) = minj(x( ),u( )) = min u( ) u( ) Therefore, =min u( ) { tf { tf t } L(x(t),u(t))+Φ(x(t f ),ν,t f ), (8) t } L(x(s),u(s))ds+Φ(x(t f ),ν,t f ). (9) V( x( t f )+δx( t f ), t f ; ν +δν, t f + ) { t f + } L(x(t),u(t),t)+Φ(x( t f + ), ν +δν, t f + ) t f L( x( t f ),ū( t f ),t) +Φ( x( t f )+δx( t f )+ x( t f ), ν +δν, t f + ) L( x( t f ),ū( t f ),t) +Φ( x( t f ), ν, t f )+Φ T x (δx( t f )+F( x( t f ),ū( t f ), t f ) ) () +Φ ν ( x( t f ),ū( t f ), t f ) T δν +Φ tf ( x( t f ),ū( t f ), t f ) [ ] T + δx+fδtf δν Φ [ ] xx Φ xν Φ xtf δx+fδtf Φ νx Φ νν Φ νtf δν Φ tf x Φ tf ν Φ tf t f () V( x( t f )+δx( t f ), t f ; ν +δν, t f + ) = Φ+Φ T xδx+φ T νδν +(L+Φ T xf +Φ tf ) + ] Φ xx Φ xν Φ xtf +Φ xx F [ ] δx(t) [δx(t) T δν T Φ νx Φ νν Φ νtf +Φ νx F δν, Φ tf x +F T Φ xx Φ tf ν +F T Φ xν Φ tf t f +Φ tf xf +F T Φ xx F () wherex( t f + ) is evaluated by x( t f )+δx( t f )+ x( t f ) and x( t f ) = F( x( t f ),ū( t f ), t f ). The arguments of the functions in the last line of equations are the same as those in the previous equations and are thus omitted. The minimization with respect to u is dropped on the third line of equations because t f + t f L(x(t),u(t)) is evaluated byl( x( t f ),ū( t f )) and the latter is only a function of the nominal control. Note that this approximation is relatively rough, since we approximate L(x(t),u(t)) byl( x( t f ),ū( t f )) instead ofl(x( t f ),u( t f )) and follow up with an t f + t f expansion on x( t f ) and u( t f ). But the simulation results suggest that such level of approximation is good enough. Hence, att = t f, the terminal conditions are V( x( t f ), t f ; ν, t f ) = Φ( x( t f ), ν, t f ), V x ( x( t f ), t f ; ν, t f ) = Φ x ( x( t f ), ν, t f ), V ν ( x( t f ), t f ; ν, t f ) = Φ ν ( x( t f ), ν, t f ), V tf ( x( t f ), t f ; ν, t f ) = L( x( t f ),ū( t f ))+Φ x ( x( t f ), ν, t f ) T F( x( t f ),ū( t f ))+Φ tf ( x( t f ), ν, t f ), V xx ( x( t f ), t f ; ν, t f ) = Φ xx ( x( t f ), ν, t f ), V νν ( x( t f ), t f ; ν, t f ) = Φ νν ( x( t f ), ν, t f ), V tf t f ( x( t f ), t f ; ν, t f ) = Φ tf t f ( x( t f ), ν, t f )+Φ tf x( x( t f ), ν, t f )F( x( t f ),ū( t f )) +F( x( t f ),ū( t f )) T Φ xx ( x( t f ), ν, t f )F( x( t f ),ū( t f )), V xν ( x( t f ), t f ; ν, t f ) = Φ xν ( x( t f ), ν, t f ), V xtf ( x( t f ), t f ; ν, t f ) = Φ xtf ( x( t f ), ν, t f )+Φ xx ( x( t f ), ν, t f )F( x( t f ),ū( t f )), V νtf ( x( t f ), t f ; ν, t f ) = Φ νtf ( x( t f ), ν, t f )+Φ νx ( x( t f ), ν, t f )F( x( t f ),ū( t f )). (3)

6 Given the boundary conditions of the value function and its derivatives, we can back-propagate the differential equations we derived earlier to find their values. Our next step is to update the control through (), and in order to do so, we need to find the update law ofδν and. We follow the derivation in [] and set [ ] [ ] [ ] δν Vνν ( x(t = ζ ),t ; ν, t f ) V νtf ( x(t ),t ; ν, t f ) Vν ( x( t f ), t f ; ν, t f ), (4) V tf ν( x(t ),t ; ν, t f ) V tf t f ( x(t ),t ; ν, t f ) V tf ( x( t f ), t f ; ν, t f ) whereζ [,] is introduced to ensure that the update ofδν and are not too large. 3 Simulation Results 3. Double Integrator We first apply the algorithm on a simple system, namely, the double integrator. We compare our numerical result with the analytical solution to verify the algorithm. The dynamics is given by ẋ = x, and ẋ = u. Initial condition is x() = [x ();x ()] = [;]. The cost J = t f ( + Ru ). Terminal constraint is x =. Introducing the Lagrange multiplier ν, the cost can be reformulated as J = ν(x ) + t f ( + Ru ). Given different values of R, we can find different optimal cost and terminal time. In particular, let R =.,,, the corresponding terminal times are.89,.46,.9, respectively. Optimal control an f per iteration whenr =.,, are shown in Figure. Now we solve this problem analytically to verify the simulation results. Denote the co-states by λ = [λ ;λ ], the Hamiltonian is given by The co-states satisfy the adjoint equations H = + Ru +λ x +λ u. () λ = H =, (6) x λ = H = λ. (7) x Utilizing the Pontryagin s minimum principle, the optimal control u can be calculated from = H u = Ru + λ. Hence, u = λ /R. Transversality conditions are such that λ (t f ) = ν, λ (t f ) =, H(t f ) =. Given the previous information, we are ready to solve the problem. From λ = andλ (t f ) = ν, we get λ (t) ν,t [,t f ]. Then from λ = λ andλ (t f ) =, we haveλ (t) = ν(t f t). Therefore, u = λ /R = ν R (t t f). (8) Note that the optimal control is a linear function of t. Furthermore, boundary conditions yields t f = (9 R) 4 and ν = 3 t f = 3 (9 R) 4. When R =.,,, t f =.89,.46,.9, respectively, which is consistent with the numerical simulation results presented in the control plots in Figure. 3. Cart Pole In this subsection, we apply our algorithm on the inverted pendulum on a cart, as known as the cart pole problem, with M = the mass of the cart, m = and l =. are the mass and length of the pendulum, g = 9.8 the gravitational acceleration and u the force applied to the cart. The state x = [x,ẋ,θ, θ]. The goal is to bring the state from x() = [,,π,] to p = [,,,], which represents the case where the pendulum is pointing strait up. The cost function is given by J = tf [c t +(x p) T Q(x p)+u T R u u]+λ T ([x 3 (t f );x 4 (t f )] [p 3 ;p 4 ]), 6

7 Control t f per iteration. Control (a) R=. t per iteration f (b) R=. Control (c) R= t f per iteration (d) R=... 3 (e) R= (f) R= Figure : Figure a, c, e show the numerical optimal control for the cases when R =.,,, respectively. Figure b, d, f present the t f per iteration for the cases when R =.,,, respectively. Dashed red lines represents the according analytical optimalt f. where Q = diag{,,,} and R u =.. Initial values are given as t f =, λ = [,] T. The multipliers γ =. and ǫ =.. We run the algorithm for 3 iterations and the convergence is achieved at around th iteration. Figure 3a presents the optimal control u. The corresponding optimal trajectories of the states are depicted in Figure. Cost and t f per iteration are shown in Figure 3b and 3c, respectively. x ẋ 4 θ θ Figure : Optimal trajectories of the states in blue. Red lines represent the desired terminal states. Control.. (a) Optimal control of the cart pole system. Cost per iteration 3 3 (b) Cost per iteration tf per iteration. 3 (c)t f per iteration. Figure 3: Cost an f per iteration for the cart pole system. 7

8 3.3 Quadrotor The dynamic model of the quadrotor includes 6 states: 3 for the position (r = (x,y,z) T ), 3 for the Euler angles (Φ = (φ,θ,ψ) T ), 3 for the velocity (ṙ = (ẋ,ẏ,ż) T ), 3 for the body angular rates ( Φ = (p,q,r) T ) and 4 for the motor speeds (Ω = (ω,ω,ω 3,ω 4 ) T ). The corresponding dynamics of the quadrotor is given as follows: dx = f(x)+gu, (9) where x = [r,φ,ṙ, Φ,Ω] T R 6, and u = (u,u,u 3,u 4 ) T R 4 is the control vector, where u represents the thrust force, and u,u 3,u 4 represent the pitching, rolling, yawing moments, respectively. The corresponding cost function is defined as J = tf [c t + (x p) T Q(x p) + u T R u u] + (x(t f) p) T Q f (x(t f ) p) + λ T ([x (t f );...;x 6 (t f )] [p ;...;p 6 ]), where p = [p ;...;p 6 ] R 6 denotes the desired terminal states. In the simulation, we set { 7,i =,,3;,i = 3; 6,i = 4,...,9; p(i) = and Q f (i,i) =, otherwise, (3),i =,,;, otherwise, and all the off-diagonal terms are assigned to. Q =.Q f. R u =.I. γ =. and ǫ =.. The desired terminal state p is chosen for the quadrotor to execute the take-off maneuver. iterations are included to ensure the convergence and the cost per iteration is presented in Figure b. The corresponding optimal state trajectories are shown in Figure 4. Optimal controlu is illustrated in Figure a. t f per iteration is presented in Figure c. x x 4 x x 4 x3 x4 x 4. x x 4 x6 x 8 x7 x 4 x8 x 4 x9 x x 4 x x 4 x x 9 x3 x4 x x Figure 4: Optimal trajectories of the states of the quadrotor in blue. Dashed red lines represent the desired terminal states. Acknowledgments References [] D.H. Jacobson and D.Q Mayne. Differential dynamic programming. Elsevier Sci. Publ., 97. [] Sergey Levine and Pieter Abbeel. Learning neural network policies with guided policy search under unknown dynamics. In Z. Ghahramani, M. Welling, C. Cortes, N.D. Lawrence, and K.Q. 8

9 Control x 4 4 Control 7. 8 x 4 7 Cost per iteration. t f per iteration Control3 3 x Control4 3 (a) Optimal controls (b) Cost per iteration.. (c)t f per iteration. Figure : Cost an f per iteration for the quadrotor. Weinberger, editors, Advances in Neural Information Processing Systems 7, pages Curran Associates, Inc., 4. [3] Y. Pan and E. Theodorou. Probabilistic differential dynamic programming. In Advances in Neural Information Processing Systems (NIPS), pages 97 9, 4. [4] J. Morimoto, G. Zeglin, and C. G Atkeson. Minimax differential dynamic programming: Application to a biped walking robot. In Proceedings of 3 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 3)., volume, pages IEEE, 3. [] Y. Tassa, N. Mansard, and E. Todorov. Control-limited differential dynamic programming. In 4 IEEE International Conference on Robotics and Automation (ICRA),, pages IEEE, 4. [6] Y. Tassa, T. Erez, and W. D. Smart. Receding horizon differential dynamic programming. In NIPS, 7. [7] P. Abbeel, A. Coates, M. Quigley, and A. Y Ng. An application of reinforcement learning to aerobatic helicopter flight. Advances in Neural Information Processing Systems (NIPS), 9:, 7. [8] E. Todorov and W. Li. A generalized iterative lqg method for locally-optimal feedback control of constrained nonlinear stochastic systems. In American Control Conference,, pages IEEE,. [9] E. Theodorou, Y. Tassa, and E. Todorov. Stochastic differential dynamic programming. In American Control Conference (ACC),, pages 3. IEEE,. 9

Game Theoretic Continuous Time Differential Dynamic Programming

Game Theoretic Continuous Time Differential Dynamic Programming Game heoretic Continuous ime Differential Dynamic Programming Wei Sun, Evangelos A. heodorou and Panagiotis siotras 3 Abstract In this work, we derive a Game heoretic Differential Dynamic Programming G-DDP