Path Integral Stochastic Optimal Control for Reinforcement Learning

Preprint August 3, 204 The st Multidisciplinary Conference on Reinforcement Learning and Decision Making RLDM203 Path Integral Stochastic Optimal Control for Reinforcement Learning Farbod Farshidian Institute of Robotics and Intelligent Systems ETH Zürich Zürich, 8092, Switzerland farshidian@mavt.ethz.ch Jonas Buchli Institute of Robotics and Intelligent Systems ETH Zürich Zürich, 8092, Switzerland buchlij@ethz.ch Abstract Path integral stochastic optimal control based learning methods are among the most efficient and scalable reinforcement learning algorithms. In this work, we present a variation of this idea in which the optimal control policy is approximated through linear regression. This connection allows the use of well-developed linear regression algorithms for learning of the optimal policy, e.g. learning the structural parameters as well as linear parameters. In path integral reinforcement learning, Policy Improvement with Path Integral PI 2 algorithm is one of the most efficient and most similar algorithms to the algorithm we propose here. However, in contrast to the PI 2 algorithm that relies on the Dynamic Movement Primitive DMPs to become a model free learning algorithm, our proposed method is formulated for an arbitrary parameterized policy represented by a linear combination of nonlinear basis functions. Additionally, as the duration and the goal of the task is part of the optimization in some tasks like shortest-time path optimization problem, our proposed method can directly optimize these quantities instead of assuming them to be given, fixed parameters. Furthermore PI 2 needs a batch of rollouts for each parameter update iteration whereas our method can update after just one rollout. The simulation result in this work shows that a simple implementation of our proposed method can at least perform as well as PI 2 despite only using out-of-the-box regression and a naive sampling strategy. In this light, the here presented should only be considered as a preliminary step in the development of our new approach which addresses some issues in the derivation of previous algorithms. Basing the development on this improvements, we believe that this work will ultimately lead to more efficient learning algorithms. Keywords: reinforcement learning, stochastic optimal control, path integral, linear regression approximation. Acknowledgements This research is supported by the Swiss National Science Foundation through a SNSF Professorship Award to Jonas Buchli and the SNSF National Centre of Competence in Research Robotics.

Introduction Learning motor control is one the essential components for versatile autonomous robots. In order to guide the learning, we need to define a cost or reward function that mirrors the requirements for a given task. Therefore learning motor control can be considered as a constraint optimization problem where the constraints are in general a set of stochastic differential equations plus some limits on the joint variables and torques. This naturally leads to formulating learning motor control in a stochastic optimal control framework, where the main difference is the availability of a model optimal control vs. no model learning. Taking a model based optimal control perspective and then developing a model free reinforcement learning algorithm based on an optimal control framework has proven very successful. Under the assumption of stochastic dynamics the key for the step from model based control to model free learning is a path integral formulation of optimal control. Recently, algorithms based on these developments [0, 4] have shown superior performance in practical applications [9, ]. Using path integral to solve the optimal control problem was introduced by Kappen in [3]. He demonstrates how path integral allows to replace the backward computation, by forward computation of a diffusion process. But his formulation is not model-free and also it is restricted to a special class of systems i.e. system with state independent control transition. In Theodorou et al. [0], the generalized path integral formulation the control transition can be a function of the states is used to derive a model-free, iterative learning algorithm named Policy Improvement with Path Integral PI 2. The original formulation of PI 2 is based on Dynamic Movement Primitives DMPs [2]. The DMPs add an artificial limitation to the optimization problem by restricting the space of the achievable trajectories. DMPs also introduce two open algorithmic parameters i.e. the goal and duration of movement. In [9] an approach to learn the goal parameter is presented, but still the duration of movement is an open parameter. In [7] a variation of PI 2 is presented in which the control input is directly parameterized. However this variation is not model-free the control transition is needed and suffers from over parameterization. To solve this problem as proposed in [] a fast intermediate system between the real system input and parameterized input is needed the same version of PI 2 is used here for the sake of comparison. In addition to the policy representation issue PI 2 needs a batch of rollouts for each iteration of parameter update which is not ideal for learning on real systems e.g. robots. Here every additional roll-out is time consuming and leads to unnecessary wear and tear. Also PI 2 relies on a task-dependent heuristic formula to sum up the time-indexed policy parameter vector to one time-independent parameter vector. In contrast to PI 2 our proposed algorithm addresses all these issues in a more systematic way. In this work, we follow up on the idea of path integral stochastic optimal control as a basis for formulating a learning algorithm. However, instead of confining policy parametrization to the DMPs, we utilize an arbitrary linear parametrization for optimal policy by using linear combinations of nonlinear basis functions. We also try to avoid any kind of task dependent heuristic in order to develop a more general learning algorithm. As will be detailed, the estimation of the optimal control policy can be transformed into a linear regression problem. This connection to the linear regressions gives the opportunity of using well developed algorithms for learning both the model structure and the parameters. This approach also makes it possible to update parameter vector after each rollout Although in current implementation this ability is not used. 2 Path Integral Stochastic Optimal Control Learning Algorithm In this section, we are going to demonstrate how the problem of estimating the optimal policy boils down to a linear regression problem. For this reason, we will briefly introduce the path integral stochastic optimal control. Then, we will derive our main formula which is a linear regression optimal control policy. Path Integral Stochastic Optimal Control First, we briefley introduce the basics of path integral stochastic optimal control for more details refer to [3, 4, 0]. We assume a finite horizon cost function Jx i, t i = min C x i, t, u t i t f ut t f tf C x i, t i, u. =E τ xi φxt f + qx, t + } 2 ut Ru dt where x is the state vector and u is the control input vector. The trajectory τ x i is generated by the stochastic dynamics modelled by ẋ = fx, t + gxu + ε ε N0, Σ 2 with initial condition xt i = x i, control trajectory u., and zero mean Gaussian input noise ε with covariance Σ. The solution to this optimization problem can be derived by solving the stochastic Hamilton-Jacobi-Bellman HJB equation which is a nonlinear second order Partial Differential Equation PDE. This PDE can be transformed into a linear second order PDE through the transformation J = λ log Ψ and assumption Σ = λr. Using the Feynman-Kac formula, the t i

solution to this linear PDE can be found by simulating random paths of the stochastic process which is associated with the uncontrolled noise driven system. After several steps of calculation, the control input can also be derived by a path integral formulation. Equation 3 shows this relation to the path integral. In this equation g c is the nonzero vertical partition of the matrix g and Qτ x is the accumulated cost without the control cost on trajectory τ which starts form state x and evolves according to equation 4 uncontrolled noise driven system. u x, t =T x E τ x exp λ Qτ x } ε t 3 Qτ x} t λ gc T x =R gc T gc R gc T tf Qτ x = φxt f + qx, t dt τ x x =fx, t + gx ε 4 Equation 3 can also be written in the form of equation 5. This equation will be used in the next section. E τ x exp } λ Qτ x u x, t T xε t = 0 5 Proposed Learning Algorithm Using a parameterized policy makes the learning problem tractable in high dimension. Among different possible choices for function approximation a linear combination of basis functions, θ T Φx, t, represents theoretical and practical advantages. In general the basis functions Φx, t of these models can be a nonlinear function of any set of features from time and states. In addition, any approximation problem needs a suitable metric to calculate the difference between the approximated and real optimal solution. Equation 6 shows the sum square error criterion, Lθ, which is calculated over an arbitrary distribution ideally the optimal trajectory distribution for approximating the optimal control input. Here. 2 is the norm two of the vector. Lθ = E x u x, t θ T Φx, t 2 } 2 6 The gradient of Lθ is equal to zero at the optimal solution which implies equation 7. Lθ = E x u } x, t θ T Φx, t Φx, t T = 0 7 θ By multiplying and dividing λ Qτ x} and further steps we omit for brevity we get E x λ Qτ x} } } λ Qτ x u x, t θ T Φx, t Φx, t T = 0 8 by adding and subtracting the term T xε t from the previous expression we derive the following expression E x λ Qτ x} } } λ Qτ x u x, t T xε t + T xε t θ T Φx, t Φx, t T = 0 9 After several further steps and using equation 5 it can be shown that the underlined term will vanish. Therefore, we get exp λ E x E Qτ x } τ x θ T λ Qτ x} Φx, t T xε t Φx, t T = 0 0 which is equivalent to the following minimization problem θ = arg min E x E τ x θ exp λ Qτ x E τ x exp λ Qτ x} θt Φx, t T xε t 2 2 Equation shows that the problem of finding the optimal parameterized policy is reduced to a Weighted Linear Regression WLR problem. To turn this method into a model free algorithm, the matrix T x should be omitted from regression. This can be achieved by assuming that the real system dynamics is augmented by a square MIMO Multiple-Input Multiple-Output linear system in which its states are the real system control input. This system should have a unique output-to-input gain and faster dynamics than the real, underlying system and does therefore not interfere with the real system dynamics. In this case the real system control inputs become part of the states of the augmented system and it causes matrix T x to } 2

Table Path Integral Stochastic Optimal Control PISOC Learning Algorithm: Given: Initial parameter θ, Exploration noise covariance, Linear model structure u = θ T Φx, t, Cost function. while Convergence of the trajectory cost do - Create K rollouts through the system in equation 2. - Compute S in equation 2 for each rollout. - Compute mins and maxs over K rollouts. - Compute S in equation 2 for each rollout. - Update θ by solving the minimization in equation 2 over K rollouts. - Anneal the exploration noise. end while Figure The evaluation results, the cost versus iteration for a mass point b 50-link arm depend only on the input matrix of the linear system which is a invertible square matrix. This means T x = I. This simplification borrows the idea of the adiabatic elimination technique in physics []. Note that the sampling distribution in equation comes from the noise driven system. This can cause two major problems. First, we expect that the performance of the learning agent increases during the time which obviously contradicts with sampling from noise driven system. Second, it causes the probability of finding good trajectories to decrease exponentially as the dimension of the problem increases. This makes the learning problem intractable in high dimension and real word problems. One possible remedy is using the importance sampling method. In this method, instead of sampling the whole solution space, samples are taken from a suitable sampling distribution. One natural choice for current learning problem will be sampling from active system where the control inputs are computed as the last estimation of the optimal control inputs with respect to the samples collected so far. The only remaining issue is to add importance sampling correction weight to the regression problem. By Gaussian noise assumption the final regression problem is as equation 2. Here the τ a illustrates that the sampling system is the active system. In equation 2 S is the normalized version of S. This accomplished by choosing the λ = maxs mins 0 and multiplying the nominator and denominator by exp 0minS maxs mins. Table illustrates the pseudo code for a simple learning algorithm based on equation 2. exp Sτ a x θ = min E x E τa x θ exp Sτ } θ T Φx, t ε t 2 2 a x E τa x Sτ a x min τa x Sτ a x Sτ a x = 0 max τa x Sτ a x min τa x Sτ a x tf Sτ a x = φxt f + qx, t + 2 ut Ru + u T Rε dt t τ a x 2 3 Simulation In this section the algorithm in the Table is used for learning the two different tasks. Both of the evaluation are the via point and reaching task combination in the 2D environment. The first on is a single point mass and the second one is a 50- link planar arm. We compare our solution against a modified version of PI 2 algorithm for the performance comparison. The here used variant of PI 2 does not use the DMPs for policy parameterization. Whereas it directly parameterizes the control input i.e. as our proposed algorithm. In both of the algorithms, the exploration noise, number of rollouts, and other open parameters are the same. The control input is represented by linear combination of time-based Gaussian kernels. Point Mass The first evaluation considers learning the optimal policy for controlling a point mass moving on a horizontal plane. The task is to start from a particular point, then to pass through another particular point in a specific time, and finally to reach to a goal point in the predefined time with zero velocity. We also try to minimize the control input quadratic cost. In both algorithms 5 trials per update are considered which consist of 0 new trials and the best 5 trials form previous updates. The covariance of the noise is set to 20 which is decreased linearly over parameter update iterations. The initial value for the parameter vector is set to zero. Figure.a shows the cost of task versus iteration for both algorithms. The figure shows that the both algorithms have comparable performances. Robotic Arm In the second evaluation, a 50-link planar robot is considered. The goal of the task is the same as in the previous example but this time the end-effector position and velocity is considered. The learning setup is also the same. Figure.b shows the result of learning curve. The results show that the proposed algorithm performs slightly better. 3

4 Discussion As the simulation results demonstrate, the performance of current implementation of equation 2 is not significantly better than the PI 2. The results just confirm that a simple implementation of this linear regression based learning algorithm can work at least as well as PI 2. We consider this work as a basis for future development, as it addresses some issues of previous developments and it will allow for more efficient learning algorithms. There are a few interesting aspects of the proposed algorithm that we are discussing in the following. One of the advantages of the formulation in equation 2 is that it transforms the problem of the learning optimal control in to a linear regression problem which allows us to use the well developed framework and tools of regression. While we have not demonstrated this fact here, the structural parameters can be learned automatically. There exist a number of algorithms in the framework of linear model for regression which make it possible to automatically adapt the distribution and number of the base functions. Although it is possible to use the same methods for PI 2 algorithm, some practical considerations in the implementation of PI 2 will slow down structural parameter learning. Before starting to describe the reason, note that one of the difference between these two algorithm is the way they add noise to the system for exploration. In our proposed algorithm like gradient based policy search the noise is added directly to the action while in PI 2, exploration noise is added to the parameter vector like stochastic gradient descend methods and evolutionary algorithms. As shown in [8] it is better not to add time dependent noise to the parameter vector but just to add a constant perturbation during each trial. This consideration will cause that for a single update of structural parameters, we require the information of several different rollouts. Whereas in the new formulation noise is varied during execution of a rollout which makes it possible to update the structural parameters after each rollout. In general we believe that adding noise directly to the control action increases the amount of information we can retrieve from the system. Although some may argue that it is not suitable for real systems to have non smooth input but it would be fine for simulation phase. In the case of real system experiment, we can assume a low pass filter before the system input which smooth out the control input signal. But again even by this assumption we can add a time varying noise in each rollouts It would be colored noise not a white noise. According to the current formulation it is possible to do an update after each rollout. Consequently it can be expected that the accumulative reward during the agent life will increase quicker. Because not only the living policy is improved constantly but also the sampling distribution is modified. In the inference related methods for policy search, it is common to introduce the probability of reward in order to transform the optimization problem into inference problem [5, 6]. This reward probability is proportional to the exponential of the negative trajectory cost. Our proposed method lends support to this assumption. As shown, the problem of finding optimal parameter vector for arbitrary linear approximation of the optimal control input will be WLR, in which the weights are the reward probability. By the assumption of the Gaussian noise, The proposed WLR problem is equivalent to the maximum-likelihood estimation problem for a parameterized policy where the a-posterior distribution is proportional to the reward probability. As future work, we will test this algorithm on real robotic systems. We also aim to implement a better algorithm for equation 2. This implementation should be able to learn the structural parameters e.g. number of base functions and their distribution as well as the linear parameters. It also should be able to update the parameterized policy after each rollout. References [] J. Buchli, F. Stulp, E.A. Theodorou, and S. Schaal. Learning variable impedance control. The International Journal of Robotics Research, 20. [2] A.J. Ijspeert, J. Nakanishi, and S. Schaal. Learning attractor landscapes for learning motor primitives. Advances in neural information processing systems, 5:523 530, 2002. [3] H.J. Kappen. Path integrals and symmetry breaking for optimal control theory. Journal of statistical mechanics: theory and experiment, 2005, 2005. [4] H.J. Kappen, V. Gómez, and M. Opper. Optimal control as a graphical model inference problem. Machine learning, 872, 202. [5] G. Neumann. Variational inference for policy search in changing situations. [6] K. Rawlik, M. Toussaint, and S. Vijayakumar. On stochastic optimal control and reinforcement learning by approximate inference. In Proceedings, International Conference on Robotics Science and Systems RSS 202, 202. [7] E. Rombokas, E.A. Theodorou, M. Malhotra, E. Todorov, and Y. Matsuoka. Tendon-driven control of biomechanical and robotic systems: A path integral reinforcement learning approach. In ICRA, 202. [8] F. Stulp and O. Sigaud. Policy improvement methods: Between black-box optimization and episodic reinforcement learning. hal-00738463, 202. [9] F. Stulp, E.A. Theodorou, and S. Schaal. Reinforcement learning with sequences of motion primitives for robust manipulation. Robotics, IEEE Transactions on, 286:360 370, 202. [0] E.A. Theodorou, J. Buchli, and S. Schaal. A generalized path integral control approach to reinforcement learning. The Journal of Machine Learning Research, 200. 4