Stochastic Differential Dynamic Programming

Size: px

Start display at page:

Download "Stochastic Differential Dynamic Programming"

Leonard Day
6 years ago
Views:

1 Stochastic Differential Dynamic Programming Evangelos heodorou, Yuval assa & Emo odorov Abstract Although there has been a significant amount of work in the area of stochastic optimal control theory towards the development of new algorithms, the problem of how to control a stochastic nonlinear system remains an open research topic. Recent iterative linear quadratic optimal control methods ilqg [], [] handle control and state multiplicative noise while they are derived based on first order approximation of dynamics. On the other hand, methods such as Differential Dynamic Programming expand the dynamics up to the second order but so far they can handle nonlinear systems with additive noise. In this work we present a generalization of the classic Differential Dynamic Programming algorithm. We assume the existence of state and control multiplicative process noise, and proceed to derive the second-order expansion of the cost-to-go. We find the correction terms that arise from the stochastic assumption. Despite having quartic and cubic terms in the initial expression, we show that these vanish, leaving us with the same quadratic structure as standard DDP. I. INRODUCION Optimal Control describes the choice of actions that minimize future costs under the constraint of state space dynamics. In the continuous nonlinear case, local methods are one of the very few classes of algorithms which successfully solve general, high-dimensional Optimal Control problems. hese methods are based on the observation that optimal solutions form extremal trajectories, i.e. are solutions to a calculus-of-variations problem. Differential Dynamic Programming, or DDP, is a powerful local dynamic programming algorithm, which generates both open and closed loop control policies along a trajectory. he DDP algorithm, introduced in [3], computes a quadratic approximation of the cost-to-go and correspondingly, a local linear-feedback controller. he state space dynamics are also quadratically approximated around a trajectory. As in the Linear-Quadratic-Gaussian case, fixed additive noise has no effect on the controllers generated by DDP, and when described in the literature, the dynamics are usually assumed deterministic. Past work on nonlinear optimal control allows the use of DDP in problems with state and control constraints [4],[5]. In this work, state and control constraints are expanded up to the first order and the KK conditions are formulated resulting in a unconstrained quadratic optimization problem. E. heodorou is with the Computational Learning and Motor Control Lab, Departments of Computer Science and Neuroscience, University of Southern California etheodor@usc.edu Y. assa is with the Interdisciplinary Center for Neural Computation, Hebrew University, Jerusalem, Israel tassa@alice.nc.huji.ac.il E. odorov is with the Department of Computer Science and Engineering and the Department of Applied Mathematics, University of Washington Seattle, WA, todorov@cs.washington.edu he application of DDP in real robotic high dimensional control tasks created the need for extensions that relax the condition for accurate models. In this vain, in [6] DDP is extended for the case of min - max formulation. he goal for this formulation is to make DDP robust against the model uncertainties and hybrid dynamics in a biped robotic locomotion task. In [7] implementation improvements regarding the evaluation of the value function allow the use of DDP in a receding horizon mode. In all the past work related to DDP, the optimal control problem is considered to be deterministic. While the impartiality to noise can be considered a feature if the noise is indeed fixed, in many cases varying noise covariance is an important feature of the problem, as with control-multiplicative noise which is common in biological systems []. his latter case was addressed within the iterative-lqg framework [], in which the optimal controller is derived based on the first order approximation of the dynamics and the second order approximation of the cost to go. However, for the iterative nonlinear optimal control algorithms such as DDP in which second order expansion of the dynamics is considered, the more general case of state and/or control multiplicative noise appears to have never been tackled. In this paper, we derive the DDP algorithm for state and control multiplicative process noise. We find that despite the potential of cubic and quartic terms, these cancel out, allowing us to maintain the quadratic form of the approximation. Moreover we show how the new generalized formulation of Stochastic Differential Dynamic Programming SDDP recovers the standard DDP deterministic solution as well as the special cases in which only state multiplicative or control multiplicative noise is considered. he remaining of this work is organized as follows: in the next section we provide the definition of the SDDP. In section III, the second order expansion of the cost to go is presented and in section IV the optimal controls are derived and the overall SDDP algorithm is presented. In addition in section IV we show how SDDP recovers the deterministic solution as well as the cases of only control multiplicative, only state multiplicative and only additive noise. Finally in sections V and VI simulation results and conclusions are discussed. II. SOCHASIC DIFFERENIAL DYNAMIC PROGRAMMING We consider the class of nonlinear stochastic optimal control problems with cost v π x, t = E [ hx + t 0 l τ, xτ, πτ, xτ dτ ]

2 subject to the stochastic dynamics of the form: dx = fx, udt + F x, udω where x R n is the state, u R p is the control and dω R m is brownian noise. he term hx in the cost function, is the terminal cost while the l τ, xτ, πτ, xτ is the instantaneous cost rate which is a function of the state x and control policy πτ, xτ. he cost-to - go v π x, t is defined as the expected cost accumulated over the time horizon t 0,, starting from the initial state x t to the final state x. o enhance the readability of our derivations we write the dynamics as a function Φ R n of the state, control and instantiation of the noise: Φx, u, dω fx, udt + F x, udω 3 It will sometimes be convenient to write the matrix F x, u R n m in terms of its rows or columns: Fr x, u [ ] F x, u =. = Fc x, u,..., Fc p x, u Fr n x, u Every element of the vector Φx, u, dω R n can now be expressed as: Φ j x, u, dω = f j x, uδt + F j r x, udω Given a nominal trajectory of states and controls x, ū we expand the dynamics around this trajectory to second order: Φ x + δx, ū +, dω = Φ x, ū, dω + x Φ δx + u Φ + Oδx,, dω where Oδx,, dω R n contains all the second order terms in the deviations in states, controls and noise. Writing this term element-wise: Oδx,, dω = O δx,, dω. O n δx,, dω, we can express the elements O j δx,, dω R as: O j δx,, dω = δx xx Φ j xu Φ j ux Φ j uu Φ j δx. We would now like to express the derivatives of Φ in terms of the given quantities. Beginning with the first-order terms, we find that: x Φ = x fx, uδt + x m u Φ = u fx, uδt + u m Not to be confused with big-o. F i c dω i t F i c dω i t Next we find the second order derivatives and we have that: xx Φ j = xx f j x, uδt + xx F r j x, udω t uu Φ j = uu f j x, uδt + uu F r j x, udω t ux Φ j = ux f j x, uδt + ux F r j x, udω t xu Φ j = ux Φ j After expanding the dynamics up to the second order we can transition from continuous to discrete time. More precisely the discrete-time dynamics are formulated as: δx t+δt = + I n n + x fx, uδt + x m u fx, uδt + u m F i c + F x, u δtξ t + O d δx,, ξ, δt F i c ξ i t δt δx t ξ i t δt t with δt = t k+ t k corresponding to a small discretization interval. Note that the term O d is the equivalent of O but in discrete time and therefore it is now a function of δt. In fact, since O d contains all the second order expansion terms of the dynamics it contains second order derivatives WR state and control expressed as follows: xx Φ j = xx f j x, uδt + xx F r j x, uξ t δt uu Φ j = uu f j x, uδt + uu F r j x, uξ t δt ux Φ j = ux f j x, uδt + ux F r j x, uξ t δt xu Φ j = ux Φ j he random variable ξ R m is zero mean and Gaussian distributed with covariance Σ = σ I m m he discretized dynamics can be written in a more compact form by grouping the state, control and noise dependent terms, and leaving the second order term separate: δx t+δt = A t δx t + B t t + Γ t ξ t + O d δx,, ξ, δt 4 where the matrices A t R n n, B t R n p and Γ t R n m are defined as A t = I n n + x fx, uδt B t = u fx, uδt Γ t = [ Γ Γ Γ m ] with Γ i R n defined Γ i = u F c i t + x F c i δx t + F c i. For the derivation of the optimal control it is useful to expresses Γ t as the summation of terms that depend on variations in state and controls and terms that are independent of such variations. More precisely we will have that:

3 Γ t = t δx, + F x, u 5 where each column vector of t is defined as δx, = u F c i t + x F c i δx t. i t III. VALUE FUNCION SECOND ORDER APPROXIMAION As in classical DDP, the derivation of stochastic DDP requires the second order expansion of the cost-to-go function around a nominal trajectory x: V x + δx = V x + V x δx + δx V xx δx 6 Substitution of the discretized dynamics 4 in the second order Value function expansion 6 results in: V x t+δt + δx t+δt = V x t+δt + V x A t δx t + B t t + Γ t ξ + O d + A t δx t + B t t + Γ t ξ + O d V xx A t δx t + B t t + Γ t ξ + O d 7 Next we will compute E V x t+δt + δx t+δt which requires the calculation of the expectation of the all the terms that appear in the equation above. his is what the rest of the analysis is dedicated to. More precisely in the next two sections we will calculate the expectation of the terms: and E V x δx t+δt E δx t+δtv xx δx t+δt where the state deviation δx t+δt at time instant t + δt is given by the linearized dynamics: 8 9 δx t+δt = A t δx t + B t t + Γ t ξ + O d 0 he analysis that follows in section III-A consist of the computation of the expectation of the four terms which result from the substitution of the linearized dynamics 0 into 8. In section III-B we compute the expectation of the 6 terms that result from the substitution of 0 into 9. A. Expectation of the first order term of the value function expansion x V δx t+δt. he expectation of the first order term results in: E V x A t δx t + B t t + Γ t ξ t + O d = V x A t δx t + B t t + E O d In order to find the expectation of O d R n we need to find the expectation of each one of the elements of this column vector. hus we will have that: E E O j δx,, ξ t, δt = δx xx Φ j xu Φ j ux Φ j uu Φ j = δt δx xx f j xu f j ux f j herefore we will have that: uu f j δx δx E V x δx t+δt = V x A t δx t + B t t + Õd Where the term Õd is defined as: Õ d δx,, δt = Õ δx,, δt Õ n δx,, δt = = Õj 3 4 he term x V Õ d is quadratic in variations in the states and controls δx, an thus there are the symmetric matrices F R n n, Z R m m and L R m n such that: V x Õd = δx Fδx + Z + Lδx 5 with F = xx f j V xj 6 Z = uu f j V xj 7 L = ux f j V xj 8 From the analysis above we can see that the expectation x V δx t+δt is a quadratic function with respect to variations in states and controls δx,. As we will prove in the next section the expectation of δx t+δt xxv δx t+δt is also a quadratic function of variations in states and controls δx,. B. Expectation of the second order term of the value function expansion δx t+δt xxv δx t+δt. In this section we compute all the terms that appear due to the expectation of the second approximation of the value function E δx t+δt xxv δx t+δt. he term δxt+δt is given by the stochastic dynamics in 0. Substitution of 0 results in 6 terms. o make our analysis clear we classify these 6 terms terms above into five classes. More precisely we will have that:

4 E δx t+δtv xxδx t+δt = E + E + E 3 + E 4 + E 5 9 where the terms E, E, E 3, E 4 and E 5 are defined as follows: E = E δx t A t V xx A t δx t + E t Bt V xx B t t +E δx t A t V xx B t t + E t Bt V xx A t δx t 0 E = E ξ t Γ t V xx A t δx + E ξ t Γ t V xx B t + E δx A t V xx Γ t ξ t + E Bt V xx Γ t ξ t + E ξ t Γ t V xx Γ t ξ t E 3 = E O d V xx Γ t ξ t + E ξ t Γ t V xx O d E 4 = E δx t A t V xx O d + E t B t V xx O d + E O d V xx B t t + E O d V xx A t δx t E 5 = E O d V xx O d 3 4 In the first category we have all these terms that depend neither on ξ t and nor on O d δx,, ξ t, δt. hese are the terms that define E. he second category E includes terms that depend on ξ t but not on O d δx,, ξ t, δt. In the third class E 3, there are terms that depends both on O d δx,, ξ t, δt and ξ t. In the fourth class E 4, we have terms that depend on O d δx,, ξ t, δt. Finally in the fifth class E 5, we have all these terms that depend on O d δx,, ξ t, δt quadratically. he expectation operator will cancel all the terms that include noise up the first order. Moreover, the mean operator for terms that depend on the noise quadratically will result in covariance. We compute the expectations of all the terms in the E class. More precisely we will have that: E δx t A t V xx A t δx t = δx t A t V xx A t δx t E t B t V xx B t t = t B t V xx B t t E δx t A t V xx B t t = δx t A t V xx B t t E t B t V xx A t δx t = t B t V xx A t δx t 5 We continue our analysis by calculating all the terms in the class E. More presicely we will have: E ξ t Γ t V xx A t δx = 0 E ξ t Γ t V xx B t = 0 E ξ t Γ t V xx A t δx = 0 E ξ t Γ t V xx B t = 0 6 he terms above are equal to zero since the brownian noise is zero mean. he expectation of the term that does not depend on O d δx,, ξ t, δt and it is quadratic with respect to the noise is given as follows: E ξ t Γ t V xx Γ t ξ t = trace Γ t V xx Γ t Σ ω 7 Since matrix Γ depends on variations in states and controls δx, we can further massage the expressions above so that it can be expressed as quadratic functions in δx,. trace Γ t V xx Γ t Σ ω = σ dω δt 8 Γ trace V xx Γ Γ m Γ m 9 = σdωδt Γ i V xx Γ i 30 he last equation is written in the form: trace Γ t V xx Γ t Σ ω = δx Fδx + δx L + Z + Ũ + δx S + γ 3 Where the terms F R n m, L R n p, Z R p p, Ũ R p, S R n and γ R are defined as follows: F = σ δt x F c i V xx x F c i 3 L = σ δt Z = σ δt Ũ = σ δt S = σ δt γ = σ δt x F c i V xx u F c i 33 u F c i V xx u F c i 34 u F c i V xx F c i 35 x F c i V xx F c i 36 F c i V xx F c i 37 For those terms that depend both on O d δx,, ξ t, δt and on the noise class E 3 we will have: E O d V xx Γ t ξ t = E trace Vxx Γ t ξ t O d = trace V xx Γ t E ξ t O d 38 By writing the term O d δx,, ξ t, δt in a matrix form and putting the noise vector insight the this matrix we have: E O d V xx Γ t ξ t = 39 trace V xx Γ t E [ ξ t O ξ t O n ]

5 Calculation of the expectation above requires to find the δtξt terms E O j more precisely we will have: E δtξt O j = 40 E δtξt δx Φ i xxδx + E δtξt Φ i uu + E δtξt Φ i uxδx We first calculate the term: E δtξt δx xx Φ i δx 4 = E δtξt δx xx f i δt + xx F r i ξ t δt δx = E δtξt δx xx f i δt δx + E δtξt δx xx F r i ξ t δt δx δtξt he term E δx xx f i δt δx = 0 since it depends linearly on the noise and E ξ t = 0. he second term δtξt E δx xx F r i ξ t δt δx depends quadratically in the noise and thus the expectation operator will result in the variance on the noise. We follow the analysis: E δtξt δx xx Φ i δx = 4 E δtξt δx xx F r i ξ t δt δx Since the ξ t = ξ,, ξ m F i,., F im we will have that: and F i r = E δtξt δx xx Φ i δx = 43 E δtξ t δx xx F ij ξ j δx E δtξ t δx xx F ij ξ j δx E δtξ t δx ξ j xx F ij δx By writing ξ t in vector form we have that: E δtξt δx xx Φ i δx = 44 ξ E δt δx ξ j xx F ij δx ξ m he term δx m ξj xx F ij δx is scalar and it can multiply each one of the elements of the noise vector. m δte ξ δx ξj xx F ij δx m δte ξ m δx ξj xx F ij δx 45 Since E ξ i ξ i = σ and E ξ i ξ j = 0 we can show that: E δtξt δx xx Φ i δx σ δt δx xx F r i δx δx xx F r im δx In a similar way we can show that: and = 46 E δtξt δx uu Φ i δx σ δt uu F r i uu F r im = 47 E δtξt xu Φ i δx σ δt ux F r i δx ux F r im δx = 48 Since we have calculated all the terms of expression 4 we can proceed with the computation of 38. According to the analysis above the term E O d V xxγ t ξ t can be written as follows: E O d V xx Γ t ξ t = 49 trace V xx Γ t M + N + G Where the matrices M R m n, N R m n and G R m n are defined as follows: M = 50 δx xx F σ r δx δx xx F r n δx δt δx xx F r m δx δx xx F r mn δx Similarly N = 5 δx xu F σ r, δx xu F r,n δt δx xu F r m, δx xu F r m,n

6 and G = 5 uu F σ r, uu F r,n δt uu F r m, uu F m,n Based on 5 the term Γ t depends on which is a function of the variations in states and control up to the th order. In addition the matrices M, N and G are also functions of the deviations in state and controls up to the th order. he product of with each one of the matrices M, N and G will result into 3th order terms that can be neglected. By neglecting these terms we can show that: E O d V xx Γ t ξ t = 53 = trace V xx + F M + N + G = trace V xx F M + N + G Each element i, j of the product C = V xx F can be expressed as C i,j = n V xx i,r F r,j where C R n p. Furthermore the element µ, ν of the product H = CM is formulated H µ,ν = n k= Cµ,k M k,ν with H R n n. hus, the term trace V xx F M can be now expressed as: trace V xx F M = = = H l,l 54 l= C l,k M k,l n r V xx k,r F r,l M k,l Since M k,l = δtσdω δx xx F k,l δx the vectors δtσdω δx and δx do not depend on k, l, r and they can be taken outside the sum. hus we can show that: trace V xx F M 55 n = V xx k,r F r,l σ δtδx xx F k,l δx = δx σ δt = δx Mδx n V xx k,r F r,l xx F k,l δx where M is a matrix of dimensionality M R n n and it is defined as: M = σ δt n V xx k,r F r,l xx F k,l 56 By following the same algebraic steps it can been shown that: trace V xx F N = δx Ñ 57 with Ñ matrix of dimensionality Ñ Rn p defined as: Ñ = σ δt and n V xx k,r F r,l xu F k,l 58 trace V xx F G = G 59 with G matrix of dimensionality Ñ Rp p defined as: n G = σ δt V xx k,r F r,l uu F k,l 60 hus the term E O d V xxγ t ξ t is formulated as: E O d V xx Γ t ξ t = δx Mδx + G + δx Ñ 6 Similarly we can show that: E ξ t Γ t V xx O d = δx Mδx + G + δx Ñ 6 Next we will find the expectation for all terms that depend on O d δx,, dω, δt and not on the noise. Consequently, we will have that: E δx t A t V xx O d = δx t A t V xx Õ d = 0 E t B t V xx O d = t B t V xx Õ d = 0 E O d V xx A t δx t = Õ d Vxx A t δx t = 0 E O d V xx B t t = Õ d Vxx B t t = 0 63 where the quantity Õd has been defined in 4. All the 4 terms above are equal to zero since they have variations in state and control of the order higher than and therefore they can be neglected. Finally we compute the terms of the 5th class and therefore we have the expression E 5 = E O d V xx O d 64 = E trace V xx O d O d = trace V xx E O d O d O O = trace V xx E O n O n he product O i O j is a function of variation in state and control of order 4 since each term O i is a function of

7 variation in states and control of order. Consequently, the term E 5 = E O d V xxo d is equal to zero. With the computation of the expectation of term that is quadratic WR O d we have calculated all the terms of the second order expansion of the cost to go function. In the next section we derive the optimal controls and we present the SDDP algorithm. Furthermore we show how SDDP recover the deterministic solution as well as the cases of only control multiplicative, only state multiplicative and only additive noise. IV. OPIMAL CONROLS In this section we provide the form of the optimal controls and we show how previous results are special cases of our generalized stochastic DDP formulation. Furthermore after we computed all the terms of expansion of the cost to go function V x t at statex t we show that its form remains quadratic WR variations in state δx t under the constraint of the nonlinear stochastic dynamics in. More precisely we have that: V x t+δt + δx t+δt = V x t+δt ABLE I PSEUDOCODE OF HE SDDP ALGORIHM Given: An immediate cost function lx, u A terminal cost term φ tn. he stochastic dynamics dx = fx, udt + F x, udω Repeat until convergence: Given a trajectory in states and controls x, ū find the quadratic approximations of the stochastic dynamics A t, B t, Γ t, O d and the quadratic approximation of the immediate cost function l o, l x, l xx, l uu, l ux around these trajectories. Compute all the terms Q x, Q u, Q xu and Q uu according to equation 68. Back-propagate the quadratic approximation of the value function based on the equations: V k+ 0 = V k+ 0 Q uq uuq u V x k+ = Q k+ x Q uq uuq ux V xx k+ = Q k+ xx Q xuq uuq ux Compute = Q uu Q u + Q uxδx Update controls u = u + γ Get the new optimal trajectory x by propagating the nonlinear dynamics dx = fx, u dt + F x, u dω. Set x = x and ū = u and repeat. + x V A t δx t + x V B t t + δx Fδx + Z + Lδx + δx t A t V xx A t δx t + t B t V xx B t t + δx t A t V xx B t t + t B t V xx A t δx t 65 + δx Fδx + δx L + Z + Ũ + δx S + γ + δx Mδx + G + δx Ñ he unmaximized state, action value function is defined as follows: Qx k, u k = lx k, u k + V x k+ 66 Given a trajectory in states and controls x, ū we can approximate the state action value function as follows: Q x + δx, ū + = Q 0 + Q u + δx Q x 67 [ δx ] [ ] [ ] Q xx Q xu δx Q ux Q uu By equating the coefficients with similar powers between the state action value function Qx k, u k and the immediate reward and cost to go lx k, u k and V x k+ respectively we can show that: Q x = l x + A t V x + S Q u = l u + A t V x + Ũ Q xx = l xx + A t V xx A t + F + F + M Q xu = l xu + A t V xu B t + L + L + Ñ Q uu = l uu + B t V uu B t + Z + Z + G 68 where we have assume a local quadratic approximation of the immediate reward lx k, u k according to the equation: l x + δx, ū + = l 0 + l u + δx l x 69 [ δx ] [ ] [ ] l xx l xu δx l ux l uu with l x = l x, l u = l u, l xx = l x, l uu = l u and l ux = l u x. he local variations in control that maximize the state action value function are expressed by the equation that follows: = argmax Q x + δx, ū + 70 u = Q uu Q u + Q ux δx he optimal control variations have the form = l + Lδx where l = Q uuq u is the open loop gain and L = Q uuq ux is the closed loop - feedback gain. he SDDP algorithm is provided in a pseudocode form on able I. For the special cases where the stochastic dynamics have only additive noise F u, x = F then the terms M, Ñ, G, F, L, Z, Ũ, S will be zero since they are functions of xx F and xu F and uu F and it holds that xx F = 0, xu F = 0 and uu F = 0. In such a type of systems the control does not depend on the statistical characteristics of the noise. In cases of deterministic systems again M, Ñ, G, F, L, Z, Ũ, S will be zero because these terms depend on the variance of the noise σ dωi = 0, i =,, m. Finally if the noise is only control depended then M, Ñ, L, F, S will be zero since xx F u = 0, xu F u = 0 and x F c i x = 0 while if it is state dependent then Ñ, G, Z, L, Ũ will be zero since xu F x = 0, uu F x = 0 and u F c i x = 0.

8 State rajectories! *+,-*./0-34,*-5 State rajectories 0.09 Cost 3.5 #! '! &! %! 0.8 $! ! ime Horizon!$!!!"# $ $"# % %"# & 0673/8*-69* ime Horizon Iterations Fig.. he left plot illustrates the state trajectory for the the system dx = αcosxdt + u + x dω. he right plot corresponds to the optimal control. Fig.. Left plot illustrates the state space trajectories of the system dx = αx dt + udt + x dω for different values of noise. he right plot shows the stereotypical convergence behavior of SDDP. V. SIMULAION RESULS In this section we are testing the SDDP on two different one dimensional systems. Most precisely, we consider the one-dimensional stochastic nonlinear system of the form: dx = cosxdt + udt + x dω 7 he quadratic approximation is based on the matrices A t = sinxdt, B t = dt and O dt = cosxdt. he system has only state depended noise and therefore M = σ dtv xx x, F = 4σ dtv xx x, S = x 3 V xx σ dt while the terms Ñ = G = L = Z = Ũ = 0. We apply the SDDP algorithm for the task of bringing the state from the initial x 0 = 0 to terminal x [ = p = 4. he cost function is expressed as v π x, t = E hx + ] t 0 ru dτ with hx = w p x p where w p = and r = 0 6. In this example, there is only a terminal cost for the state while during the time horizon the state dependent cost is zero and thus there is only control cost. In figure, the left plot illustrates the state trajectory as it approaches the target state p = 4 while the right plot illustrates the optimal control. Since there is only a terminal cost, the control over the time horizon is almost zero while at the very end of the time horizon the control is activated and the state reaches the target state. he second system is given by the equation that follows: dx = αx dt + udt + x dω 7 he quadratic approximation is based on the matrices A t = αxtdt, B t = dt and O dt = αdt. he parameter α controls the degree of the instability of the system. For our simulations we used α = he task is to bring the state xt from the initial state x o = 0 to target state x = p =. he cost function has the same form as in the first example but with tuning parameters w p = and r = 0 3. Furthermore, the system has only state depended noise and therefore M = σ dtv xx x, F = 4σ dtv xx x, S = x 3 V xx σ dt while the terms Ñ = G = L = Z = Ũ = 0. In figure, the left plot illustrates state space trajectories for different values of the variance of the noise while the In both examples, the noise is used only during the iterative optimization process of SDDP. When testing the optimal control, instead of running the controlled system many times for different realizations of noise and then calculate the mean, we just zero the noise and run the system only once. right plot illustrates the convergence behavior of SDDP. An important observation is that curved trajectories correspond to cases with high variance noise. Furthermore, as the variance of the noise decreases the optimal trajectories become more and more straight. VI. DISCUSSION In this paper we explicitly derived the equations describing the second order expansion of the cost-to-go, given state and control dependent noise. Our main result is that the expressions remain quadratic WR δx and, so the basic structure of the algorithm, a quadratic cost-to-go approximation with a linear policy, remains unchanged. In addition we have shown how the cases of deterministic and stochastic DDP with additive noise are sub-cases of our generalized formulation of Stochastic DDP. Current and future research includes further testing and evaluation of our generalized Stochastic DDP algorithm on multidimensional stochastic systems which are highly nonlinear and have noise that is control and state dependent. Biomechanical models belong in this class due to highly nonlinear and noisy nature of muscle dynamics. Moreover we are aiming to incorporate a second order Extended Kalman Filter that can handle observation as well as process noise. he resulting optimal controller-estimator scheme will be a version of iterative Quadratic Gaussian regulator that can handle stochastic dynamics expanded up to the second order with state and control dependent noise. REFERENCES [] E. odorov and W. Li. A generalized iterative LQG method for locallyoptimal feedback control of constrained nonlinear stochastic systems. In Proceedings of the American Control Conference, 005. [] W. Li. and E odorov. Iterative optimal control and estimation design for stochastic nonlinear systems. In Proceedings of Conference on Decision and Control, 006. [3] D. H. Jacobson and D. Q. Mayne. Differential Dynamic Programming. Optimal Control. Elsevier Publishing Company, New York, 970. [4] Gregory Lantoine and Ryan P. Russell. A hybrid differential dynamic programming algorithm for robust low-thrust optimization. In AAS/AIAA Astrodynamics Specialist Conference and Exhibit, 008. [5] S. Yakowitz. he stagewise kuhn-tucker condition and differential dynamic programming. IEEE ransactions on Automatic Control, 3:5 30, 986. [6] Jun Morimoto and Chris Atkeson. Minimax differential dynamic programming: An application to robust biped walking. In In Advances in Neural Information Processing Systems 5. MI Press, Cambridge, MA, 00. [7] Yuval assa, om Erez, and William Smart. Receding horizon differential dynamic programming. In J.C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 0, pages MI Press, Cambridge, MA, 008.

Game Theoretic Continuous Time Differential Dynamic Programming

Game Theoretic Continuous Time Differential Dynamic Programming Game heoretic Continuous ime Differential Dynamic Programming Wei Sun, Evangelos A. heodorou and Panagiotis siotras 3 Abstract In this work, we derive a Game heoretic Differential Dynamic Programming G-DDP