Identifying and Solving an Unknown Acrobot System

Size: px

Start display at page:

Download "Identifying and Solving an Unknown Acrobot System"

Myron Harrington
6 years ago
Views:

1 Identifying and Solving an Unknown Acrobot System Clement Gehring and Xinkun Nie Abstract Identifying dynamical parameters is a challenging and important part of solving dynamical systems. For this project, we set out to solve an acrobot system with unknown parameters. In the process, we eplored three different methods for system identifications: the energy model, the power model and the dynamic model. In order to improve policies on the identified system, we eplore the Policy-Gradient with Parameter-based Eploration algorithm. We demonstrate all four methods can be applied to their corresponding subproblem with varying degrees of success. A. The Acrobot II. BACKGROUND The acrobot is an underactuated two-link robot with a single actuator at the elbow, as seen in Figure. The most common task for the acrobot is to swing up and balance [][8]. I. INTRODUCTION Solving a real-world dynamical system poses several significant challenges. Even with perfect knowledge, underactuated dynamical system represents difficult optimization problems which are only fully solvable in low dimensional cases. In order to facilitate the search for good policies, we can consider leveraging partial knowledge of the domain. This can be done by obtaining a good model from data. For many problems, we can derive the dynamics analytically with standard kinematics leaving only a discrete set of unknown parameters of the system (e.g., masses, inertia, geometry). We will limit our work to these kind of problems as they constitute an already large and interesting class of problems. In order to better understand the process of solving a realworld dynamical system, we consider the full pipeline from system identification to policy optimization. We focus on conventional system identification for unknown dynamical parameters and policy gradient methods. To better understand the methods studied, we apply them to the well known acrobot system, which we present in Section II. In Section III, we introduce the system identification framework and the three methods considered, the energy model, the power model, and the dynamic model. We derive their eact formulations when applied to the acrobot. In Section IV, we present the policy gradients with parameter-based eploration (PGPE) algorithm [7], a policy gradienpproach capable of optimizing non-differentiable policies. We introduce the energy shaping policy with LQR stabilization and partialfeedback linearization, and its corresponding parametrization. In Section V, we offer a series of empirical results comparing the different system identification methods, and illustrating the performance of PGPE on our chosen policy. Finally, we conclude in Section VI with a discussion about our results and future work. Fig.. The Acrobot [8] Figure shows the parameters we use for our analysis. We define q and q to be the shoulder joinngle and the elbow joinngle, respectively. We define q = [q, q ] T, s = [q, q] T. The zero state corresponds to both links hanging down. The moments of inertia, I, I, are taken about the pivots. Our goal is to balance the acrobot the unstable fied point s = [π,,, ] T. To induce this behaviour, we define a reward giving for every step spent further than π/ away than the balance poinnd with angular speed greater than π/. In our analysis throughout the paper, we define the following notations: c : = cos(q ) c : = cos(q ) s : = sin(q ) s : = sin(q ) c : = cos(q + q ) s : = sin(q + q ). The energy of the acrobot system is defined by: T = T + T T = I q T = (m l + I + m l l c c ) q In this report, we define a solution to a dynamical system as a precomputed controller (or policy) mapping states to actions for all reachable states. + I q + (I + m l l c c ) q q U = m gl c c m g(l c + l c c + ),

2 where T and T are the kinetic energy at the shoulder and elbow joint respectively, and U is the potential energy of the acrobot. The total energy is given by H = T + U. The equations of motion are defined by = (I + I + m l + m l l c c ) q + (I + m l l c c ) q m l l c s q q m l l c s q + m gl c s + m g(l s + l c s + ) τ = (I + m l l c c ) q + I q + m l l c s q + m gl c s +. III. SYSTEM IDENTIFICATION The system identification problem can be formulated as an optimization problem. For the class of problem considered in this report, it is possible to formulate a linear least-squares minimization. Given non-linear changes of variables of both the dynamics parameters and the sampled states, the search for best parameters is formulated as, = arg min A b, where A represen matri with non-linear transformations of the sampled states, and their derivative as rows, and b represents the sampled torques applied in those samples. Below we present three models: the energy model, the power model, and the dynamic model. The dynamic model relies on the equations of motion, and requires joinccelerations, which in practice could potentially have a lot of noise. The energy model uses the energy of the system, and does not require knowing joinccelerations. The power model is the differential form of the energy model. Compared to the energy model, the power model does not need to approimate the integral of the damping coefficients, but requires knowing the acceleration of the joint. A. Energy model In the energy model, defined in [], we look at the change of total energy as a function of the the input torque. It is derived as follows: q T τ dt = H(q, q)(t b ) H(q, q)( ) + = h, q T τ f dt where τ = [, τ] T is the input torque to the system, τ f is the frictional torque term defined to be τ fj = F sj Sign( q j ) + F vj q j, where F sj and F vj are the Coulomb and viscous friction coefficients respectively at joint j. We see that in the energy based model, we do not need to calculate the acceleration of the system. For the acrobot, we define = [,..., 7 ] T, where = I + m l = I = m l l c = m gl c + m gl = m gl c = b 7 = b We define h = [ h,..., h 8 ], where h = q (t b ) q ( ) h = q (t b ) q ( ) + q (t b ) q ( ) + q (t b ) q (t b ) q ( ) q ( ) h = q (t b ) cos(q (t b )) q ( ) cos(q ( )) + q (t b ) q (t b ) cos(q (t b )) q ( ) q ( ) cos(q ( )) h = cos(q ( )) + cos(q (t b )) h = cos(q ( ) + q ( )) h = h 7 = + cos(q (t b ) + q (t b )) t a tb q (t) dt q (t) dt. We approimate h and h 7 by using piecewise linear estimates for q h h (( q (t b ) q ( )) t + q ( )) d (( q (t b ) q ( )) t + q ( )) d, and we define q( ), q( ) to be the state and velocity at some time. The left hand side is defined by B. Power model q T τ dt = τ(q (b) q (a)). We then eplore the power model [], in which we look at the change of power as a function of the input torque. It is formulated as q T τ = d dt (H(q, q)) + qt τ f = dh, where, for the acrobot, q T τ = τ q and is defined the same as in the energy model. We then define dh = [dh,..., dh 7 ], where

3 dh = q q dh = q q + q q + q q + q q dh = c (q q + q q + q q ) s q ( q + q q ) dh = s q dh = s + ( q + q ) dh = q dh 7 = q. C. Dynamic model A rather differenpproach is to use the equations of motion directly instead of the energy of the acrobot system for system identification [8]. Using the same parameter and following a similar derivation that is presented in [], we have = ( + ) q + (c q + c q s q s q q ) + q + c + c + q τ = (c q + s q ) + ( q + q ) + c + 7 q. In the class presentation of this work, we reported some bad results using the dynamic model. These results were obtained after (wrongfully) collapsing the two equalities into one, making the problem less well defined. Significant improvements were seen after this correction. IV. POLICY GRADIENT METHODS Policy gradient methods are techniques in reinforcement learning for optimizing parametrized policies over the epected return. Formally, we assume the policy is parametrized by θ R n. Given a state s k, the agent needs to choose an action a k. The policy is defined as a k π θ (a k s k ), a distribution over actions mapping from states. The goal is to find the parameter θ that will maimize some long-term return. For an episode of length L, we define the sequence of states and actions in this episode to be τ = [ :L, a :L ]. Our objective function is define as [ L ] J(θ) = E r k, k= where r k is the instant reward the agent receives at time step k. Our update rule for the parameters is as follows: θ k+ = θ k + α θ J, θ=θk where α is the fied learning rate. A. Policy gradients with parameter-based eploration (PGPE) PGPE is a policy gradient method that is capable of optimizing non-differentiable parametrized policies. PGPE samples the parameters before starting each training episode with the policy given by the sampled parameters, records the reward, and updates the parameters. The objective is to find the parameters for the policy that maimize the total reward across all episode histories. Formally, we have J(θ) = p(h θ)r(h)dh, H where h is any episode history, and R(h) is the total reward over the episode history h. PGPE relies on sampling over histories, and averaging the results. However, to determine p(h θ), sampling from the policy at each timestep would increase the variance of samples over histories. To reduce variance, PGPE redefines the policy as π ρ (a k s k ) = p(θ ρ) δ Fθ(sk ),a k dθ, θ where ρ parametrizes θ s distribution, and F θ(sk ) is the action determined by the policy with parameters θ in state s k, and δ is the Dirac delta function. To make the gradient more robust, we consider PGPE with symmetric sampling. More specifically, assume ρ consists of a set of {µ i } and a set of {σ i }, which determines a normal distribution for each of the parameter in θ independently. We sample perturbations ɛ drawn from N(, σ ), and define sampled parameters θ + = µ + ɛ and θ = µ ɛ. We run several episodes with each of the two sampling parameters, and get cumulative rewards r + and r across the episodes. Our update rule is as follows: σ i = µ i = αɛ i(r + r ) (m r + r ), α + r m b (r+ b)( ɛ i σ i ), σ i where m is the best reward so far, and b is a moving baseline initialized to and defined as b k = βb k + ( β) r+ + r, with β being some step size parameter. Often we make i a constant. The resulting algorithm is very similar to finite difference optimization with the difference that PGPE eplicitly updates a distribution on directions to evaluate and makes uses of some smart normlization of the step sizes. B. Using PGPE for Acrobot Swing-up and Balance For the acrobot swing up and balance problem, we first pump energy with partial feedback linearization on the second joint s acceleration, and then switch to the LQR controller for balancing.

4 Formally, in the energy shaping controller, we find the desired energy E d by calculating the potential energy at the balancing point. We then choose u = u p + u e, where u e = k (E d E) q, assuming the current total energy in the system is E, and u p is the controller input we get if we use partial feedback linearization to ensure q = k q k q. We use a naive switching scheme where the LQR controller is activated when its cost to go is below a threshold φ. This threshold φ should hopefully be in the region of attraction. We parametrize the swing-up and balancing policy with θ = [k, k, k, φ] T in order to apply PGPE to improve the policy. V. RESULTS A. System Identification Eperiments To eplore the performance of the different methods, we first consider the rate of convergence of the three proposed system identification methods when given perfect samples (from simulation) with the derivatives estimated from symmetric finite-difference. The results, seen in Figure, for the power model, and Figure, for the dynamic model, plot ˆ i i the absolute relative deviation,, for an estimated i parameter ˆ i with respect to the number of samples used. Each method was given the same samples generated by picking a random action U (, ) and applying it for seconds. This way of generating data is far from optimal but was sufficient for our eperiments. The simulation of the system was done using ode through the scipy python library[]. For this eperiment, we do not show the results for the energy model as we were unable to get comparable results. For all eperiments, we used = [.,,.,.7, 9.8,.,.] T as parameters, which is equivalent to an acrobot system with m =, m =, l =, l =, b =., b =., g = 9.8. These results indicate that, in our setting, the dynamic model significantly outperforms the power model. This was not epected as the power model was designed to be better behaved than the dynamic model. The scale for each parameter was fied across the methods to facilitate comparisons. The parameter representing the damping coefficients, and 7, appear to be quite hard to fit. The power model was unable to converge to a reasonable value with the given data while the dynamic model was unable to fit the damping term on the second joint. There seems to be an inherent difficulty in fitting these parameters whose effect, given the poor quality data we are using, might be eplained away through other parameters. It is important to note that our simulation setup is lacking many real-world difficulties such as noisy measurements. In order to further eplore this issue, we set up a slightly altered In the case where the true derivatives are know, both the dynamic model and the power model converge in very few samples. version of the eperiment where the samples have had a small amount of noise injected, sampled from N (, ). All methods were given the same perturbed data. Even with this small amount of noise, we can see in Figure that the dynamic model s performance is significantly lower than in the noiseless case (note the scale change from the previous figures). This noise has a much smaller effect on the performance of both the power model and the energy model, whose results are plotted in Figure and Figure, respectively. This result confirms claims that the dynamic method is vulnerable to high-frequency noise while the power and energy methods are much more tolerant []. On real-world data, we could have improved the performance of all three methods by applying a low-pass filter to reduce the effect of noise on the finite-difference approimations but this was not considered in our setting. B. PGPE Results In order to study the whole pipeline, we briefly eperiment with PGPE and the energy shaping policy with LQR stabilization seen in class. We want to optimize the gains and LQR cutoffs in order to obtain the best swing-up times from a variety of start positions around q, q = at rest. We have two sets of eperiments, the first, seen in Figure 7, was reported during the class presentation. It converged to a solution capable of swinging up in under seconds but we later realized that every evaluation step for a given trial would always start from the same position. This caused the algorithm to overfind converge to parameters that end up giving eactly the right behaviour for a fied start position. We believe this to be a consequence of the naive LQR switch condition. When the LQR is active outside its region of attraction, it will hinder the system after which the energy shaping is required to correct the system. Our activation condition is too coarse to reliably activate the LQR in its region of attraction. With a fied start point, the optimization can ensure that the system will only activate the LQR once in the region of attraction. However this might only hold true for the trajectory induced from that specific start point. Figure 8, shows the PGPE optimization on true random restarnd converges to a little under seconds. In order to achieve better performance in this setting, more sophisticated methods must be used to approimate the region of attraction of the LQR, such as through the sums-of-square optimization formulation seen in class. VI. DISCUSSION Even with our simple eperimental set-up, we can notice a sharp decrease in performance in the dynamic model. We are confident that the power model is a better candidate for system identification in our setting though different noise models (e.g., approimated torques, uniform distribution, finite difference for q) and their corresponding effect on each method would need to be studied to understand if this is always the case. Real-world domains carry many difficulties which are not captured by our simulation which potentially require etra processing in order to obtain good

5 ..... Fig.. The deviation of the fitted parameter when using the power model with no input noise. The error bars represent the.9 confidence interval and are averaged over independent runs Fig.. The deviation of the fitted parameter when using the dynamic model with no input noise. The error bars represent the.9 confidence interval and are averaged over independent runs.. 8. Fig.. The deviation of the fitted parameter when using the power model with input noise drawn from N (, ). The error bars represent the.9 confidence interval and are averaged over independent runs.

6 . 8. Fig.. The deviation of the fitted parameter when using the energy model with input noise drawn from N (, ). The error bars represent the.9 confidence interval and are averaged over independent runs.. 8. Fig.. The deviation of the fitted parameter when using the dynamic model with input noise drawn from N (, ). The error bars represent the.9 confidence interval and are averaged over independent runs. performance. This pre-processing could help unevenly the different methods and would need to be used in future eperiments to ensure the comparisons between methods are fair. Similarly, the data we have generated for the system identification eperiments led to poorly conditioned matrices. Eploring various work in eciting trajectories would certainly improve the performance of all three methods []. The policy gradient methods seem quite promising. They allow non-trivial parametrization of policies allowing the designer to leverage previous work and specific domain knowledge. The PGPE algorithm gives us good performance in the fied start point setting but takes considerably more computational power in the random restart case (by requiring many more evaluations or iterations). As mentioned earlier, this could have been improved by using a proper region of attraction algorithm. Alternatively, a parametrized function of the LQR activation condition whose parameters could have been optimized and initialized based on the current cost-to-go condition might be able to find a larger region of attraction (depending on the parametrization). We have found the episodic nature of PGPE to be detrimental to what we had set out to achieve. In the future, we will consider an incremental policy gradient methods in order to be able to formulate an incremental system identification and policy improvement method. REFERENCES [] Maime Gautier. Dynamic identification of robots with power model. In Robotics and Automation, 997. Proceedings., 997 IEEE International Conference on, volume, pages IEEE, 997. [] Luke Johnson. Adaptive Swing-up and Balancing Control of Acrobot Systems. Bachelor s thesis, Massachusetts Institute of Technology, Cambridge, MA, USA, 9. [] Eric Jones, Travis Oliphant, Pearu Peterson, el. SciPy: Open source scientific tools for Python,. [Online; accessed --7]. [] W. Khalil and E. Dombre. Modeling, identification & control of robots. London : HPS,.,. [] Richard M Murray and John Edmond Hauser. A case study in approimate linearization: The acrobat eample. Electronics Research Laboratory, College of Engineering, University of California, 99.

7 Fig. 7. The average performance of the energy shaping policy plotted against the number of improvement steps with PGPE. Every run was given a fied start point. The error bars represent the.9 confidence interval and are averaged over independent runs. LQR cost were defined as Q = diag([,, 7, 7]) and R = I. Average reward Number of steps Fig. 8. The average performance of the energy shaping policy plotted against the number of improvement steps with PGPE. Every evaluation is given a start point with q within π/ within the rest position. The evaluation of a policy is averaged over a random restarts. The error bars represent the.9 confidence interval and are averaged over independent runs. Converged parameters hovered around k =, k = 9, k =, φ = 8. LQR cost were defined as Q = diag([,, 7, 7]) and R = I. [] C Presse and Maime Gautier. New criteria of eciting trajectories for robot identification. In Robotics and Automation, 99. Proceedings., 99 IEEE International Conference on, pages IEEE, 99. [7] Frank Sehnke, Christian Osendorfer, Thomas Rückstieß, Ale Graves, Jan Peters, and Jürgen Schmidhuber. Parameter-eploring policy gradients. Neural Networks, (): 9,. [8] Russ Tedrake. Underactuated Robotics: Algorithms for Walking, Running, Swimming, Flying, and Manipulation (Course Notes for MIT.8). Downloaded in Fall, from

arxiv: v1 [cs.lg] 13 Dec 2013

arxiv: v1 [cs.lg] 13 Dec 2013 Efficient Baseline-free Sampling in Parameter Exploring Policy Gradients: Super Symmetric PGPE Frank Sehnke Zentrum für Sonnenenergie- und Wasserstoff-Forschung, Industriestr. 6, Stuttgart, BW 70565 Germany