Identifying and Solving an Unknown Acrobot System
|
|
- Myron Harrington
- 6 years ago
- Views:
Transcription
1 Identifying and Solving an Unknown Acrobot System Clement Gehring and Xinkun Nie Abstract Identifying dynamical parameters is a challenging and important part of solving dynamical systems. For this project, we set out to solve an acrobot system with unknown parameters. In the process, we eplored three different methods for system identifications: the energy model, the power model and the dynamic model. In order to improve policies on the identified system, we eplore the Policy-Gradient with Parameter-based Eploration algorithm. We demonstrate all four methods can be applied to their corresponding subproblem with varying degrees of success. A. The Acrobot II. BACKGROUND The acrobot is an underactuated two-link robot with a single actuator at the elbow, as seen in Figure. The most common task for the acrobot is to swing up and balance [][8]. I. INTRODUCTION Solving a real-world dynamical system poses several significant challenges. Even with perfect knowledge, underactuated dynamical system represents difficult optimization problems which are only fully solvable in low dimensional cases. In order to facilitate the search for good policies, we can consider leveraging partial knowledge of the domain. This can be done by obtaining a good model from data. For many problems, we can derive the dynamics analytically with standard kinematics leaving only a discrete set of unknown parameters of the system (e.g., masses, inertia, geometry). We will limit our work to these kind of problems as they constitute an already large and interesting class of problems. In order to better understand the process of solving a realworld dynamical system, we consider the full pipeline from system identification to policy optimization. We focus on conventional system identification for unknown dynamical parameters and policy gradient methods. To better understand the methods studied, we apply them to the well known acrobot system, which we present in Section II. In Section III, we introduce the system identification framework and the three methods considered, the energy model, the power model, and the dynamic model. We derive their eact formulations when applied to the acrobot. In Section IV, we present the policy gradients with parameter-based eploration (PGPE) algorithm [7], a policy gradienpproach capable of optimizing non-differentiable policies. We introduce the energy shaping policy with LQR stabilization and partialfeedback linearization, and its corresponding parametrization. In Section V, we offer a series of empirical results comparing the different system identification methods, and illustrating the performance of PGPE on our chosen policy. Finally, we conclude in Section VI with a discussion about our results and future work. Fig.. The Acrobot [8] Figure shows the parameters we use for our analysis. We define q and q to be the shoulder joinngle and the elbow joinngle, respectively. We define q = [q, q ] T, s = [q, q] T. The zero state corresponds to both links hanging down. The moments of inertia, I, I, are taken about the pivots. Our goal is to balance the acrobot the unstable fied point s = [π,,, ] T. To induce this behaviour, we define a reward giving for every step spent further than π/ away than the balance poinnd with angular speed greater than π/. In our analysis throughout the paper, we define the following notations: c : = cos(q ) c : = cos(q ) s : = sin(q ) s : = sin(q ) c : = cos(q + q ) s : = sin(q + q ). The energy of the acrobot system is defined by: T = T + T T = I q T = (m l + I + m l l c c ) q In this report, we define a solution to a dynamical system as a precomputed controller (or policy) mapping states to actions for all reachable states. + I q + (I + m l l c c ) q q U = m gl c c m g(l c + l c c + ),
2 where T and T are the kinetic energy at the shoulder and elbow joint respectively, and U is the potential energy of the acrobot. The total energy is given by H = T + U. The equations of motion are defined by = (I + I + m l + m l l c c ) q + (I + m l l c c ) q m l l c s q q m l l c s q + m gl c s + m g(l s + l c s + ) τ = (I + m l l c c ) q + I q + m l l c s q + m gl c s +. III. SYSTEM IDENTIFICATION The system identification problem can be formulated as an optimization problem. For the class of problem considered in this report, it is possible to formulate a linear least-squares minimization. Given non-linear changes of variables of both the dynamics parameters and the sampled states, the search for best parameters is formulated as, = arg min A b, where A represen matri with non-linear transformations of the sampled states, and their derivative as rows, and b represents the sampled torques applied in those samples. Below we present three models: the energy model, the power model, and the dynamic model. The dynamic model relies on the equations of motion, and requires joinccelerations, which in practice could potentially have a lot of noise. The energy model uses the energy of the system, and does not require knowing joinccelerations. The power model is the differential form of the energy model. Compared to the energy model, the power model does not need to approimate the integral of the damping coefficients, but requires knowing the acceleration of the joint. A. Energy model In the energy model, defined in [], we look at the change of total energy as a function of the the input torque. It is derived as follows: q T τ dt = H(q, q)(t b ) H(q, q)( ) + = h, q T τ f dt where τ = [, τ] T is the input torque to the system, τ f is the frictional torque term defined to be τ fj = F sj Sign( q j ) + F vj q j, where F sj and F vj are the Coulomb and viscous friction coefficients respectively at joint j. We see that in the energy based model, we do not need to calculate the acceleration of the system. For the acrobot, we define = [,..., 7 ] T, where = I + m l = I = m l l c = m gl c + m gl = m gl c = b 7 = b We define h = [ h,..., h 8 ], where h = q (t b ) q ( ) h = q (t b ) q ( ) + q (t b ) q ( ) + q (t b ) q (t b ) q ( ) q ( ) h = q (t b ) cos(q (t b )) q ( ) cos(q ( )) + q (t b ) q (t b ) cos(q (t b )) q ( ) q ( ) cos(q ( )) h = cos(q ( )) + cos(q (t b )) h = cos(q ( ) + q ( )) h = h 7 = + cos(q (t b ) + q (t b )) t a tb q (t) dt q (t) dt. We approimate h and h 7 by using piecewise linear estimates for q h h (( q (t b ) q ( )) t + q ( )) d (( q (t b ) q ( )) t + q ( )) d, and we define q( ), q( ) to be the state and velocity at some time. The left hand side is defined by B. Power model q T τ dt = τ(q (b) q (a)). We then eplore the power model [], in which we look at the change of power as a function of the input torque. It is formulated as q T τ = d dt (H(q, q)) + qt τ f = dh, where, for the acrobot, q T τ = τ q and is defined the same as in the energy model. We then define dh = [dh,..., dh 7 ], where
3 dh = q q dh = q q + q q + q q + q q dh = c (q q + q q + q q ) s q ( q + q q ) dh = s q dh = s + ( q + q ) dh = q dh 7 = q. C. Dynamic model A rather differenpproach is to use the equations of motion directly instead of the energy of the acrobot system for system identification [8]. Using the same parameter and following a similar derivation that is presented in [], we have = ( + ) q + (c q + c q s q s q q ) + q + c + c + q τ = (c q + s q ) + ( q + q ) + c + 7 q. In the class presentation of this work, we reported some bad results using the dynamic model. These results were obtained after (wrongfully) collapsing the two equalities into one, making the problem less well defined. Significant improvements were seen after this correction. IV. POLICY GRADIENT METHODS Policy gradient methods are techniques in reinforcement learning for optimizing parametrized policies over the epected return. Formally, we assume the policy is parametrized by θ R n. Given a state s k, the agent needs to choose an action a k. The policy is defined as a k π θ (a k s k ), a distribution over actions mapping from states. The goal is to find the parameter θ that will maimize some long-term return. For an episode of length L, we define the sequence of states and actions in this episode to be τ = [ :L, a :L ]. Our objective function is define as [ L ] J(θ) = E r k, k= where r k is the instant reward the agent receives at time step k. Our update rule for the parameters is as follows: θ k+ = θ k + α θ J, θ=θk where α is the fied learning rate. A. Policy gradients with parameter-based eploration (PGPE) PGPE is a policy gradient method that is capable of optimizing non-differentiable parametrized policies. PGPE samples the parameters before starting each training episode with the policy given by the sampled parameters, records the reward, and updates the parameters. The objective is to find the parameters for the policy that maimize the total reward across all episode histories. Formally, we have J(θ) = p(h θ)r(h)dh, H where h is any episode history, and R(h) is the total reward over the episode history h. PGPE relies on sampling over histories, and averaging the results. However, to determine p(h θ), sampling from the policy at each timestep would increase the variance of samples over histories. To reduce variance, PGPE redefines the policy as π ρ (a k s k ) = p(θ ρ) δ Fθ(sk ),a k dθ, θ where ρ parametrizes θ s distribution, and F θ(sk ) is the action determined by the policy with parameters θ in state s k, and δ is the Dirac delta function. To make the gradient more robust, we consider PGPE with symmetric sampling. More specifically, assume ρ consists of a set of {µ i } and a set of {σ i }, which determines a normal distribution for each of the parameter in θ independently. We sample perturbations ɛ drawn from N(, σ ), and define sampled parameters θ + = µ + ɛ and θ = µ ɛ. We run several episodes with each of the two sampling parameters, and get cumulative rewards r + and r across the episodes. Our update rule is as follows: σ i = µ i = αɛ i(r + r ) (m r + r ), α + r m b (r+ b)( ɛ i σ i ), σ i where m is the best reward so far, and b is a moving baseline initialized to and defined as b k = βb k + ( β) r+ + r, with β being some step size parameter. Often we make i a constant. The resulting algorithm is very similar to finite difference optimization with the difference that PGPE eplicitly updates a distribution on directions to evaluate and makes uses of some smart normlization of the step sizes. B. Using PGPE for Acrobot Swing-up and Balance For the acrobot swing up and balance problem, we first pump energy with partial feedback linearization on the second joint s acceleration, and then switch to the LQR controller for balancing.
4 Formally, in the energy shaping controller, we find the desired energy E d by calculating the potential energy at the balancing point. We then choose u = u p + u e, where u e = k (E d E) q, assuming the current total energy in the system is E, and u p is the controller input we get if we use partial feedback linearization to ensure q = k q k q. We use a naive switching scheme where the LQR controller is activated when its cost to go is below a threshold φ. This threshold φ should hopefully be in the region of attraction. We parametrize the swing-up and balancing policy with θ = [k, k, k, φ] T in order to apply PGPE to improve the policy. V. RESULTS A. System Identification Eperiments To eplore the performance of the different methods, we first consider the rate of convergence of the three proposed system identification methods when given perfect samples (from simulation) with the derivatives estimated from symmetric finite-difference. The results, seen in Figure, for the power model, and Figure, for the dynamic model, plot ˆ i i the absolute relative deviation,, for an estimated i parameter ˆ i with respect to the number of samples used. Each method was given the same samples generated by picking a random action U (, ) and applying it for seconds. This way of generating data is far from optimal but was sufficient for our eperiments. The simulation of the system was done using ode through the scipy python library[]. For this eperiment, we do not show the results for the energy model as we were unable to get comparable results. For all eperiments, we used = [.,,.,.7, 9.8,.,.] T as parameters, which is equivalent to an acrobot system with m =, m =, l =, l =, b =., b =., g = 9.8. These results indicate that, in our setting, the dynamic model significantly outperforms the power model. This was not epected as the power model was designed to be better behaved than the dynamic model. The scale for each parameter was fied across the methods to facilitate comparisons. The parameter representing the damping coefficients, and 7, appear to be quite hard to fit. The power model was unable to converge to a reasonable value with the given data while the dynamic model was unable to fit the damping term on the second joint. There seems to be an inherent difficulty in fitting these parameters whose effect, given the poor quality data we are using, might be eplained away through other parameters. It is important to note that our simulation setup is lacking many real-world difficulties such as noisy measurements. In order to further eplore this issue, we set up a slightly altered In the case where the true derivatives are know, both the dynamic model and the power model converge in very few samples. version of the eperiment where the samples have had a small amount of noise injected, sampled from N (, ). All methods were given the same perturbed data. Even with this small amount of noise, we can see in Figure that the dynamic model s performance is significantly lower than in the noiseless case (note the scale change from the previous figures). This noise has a much smaller effect on the performance of both the power model and the energy model, whose results are plotted in Figure and Figure, respectively. This result confirms claims that the dynamic method is vulnerable to high-frequency noise while the power and energy methods are much more tolerant []. On real-world data, we could have improved the performance of all three methods by applying a low-pass filter to reduce the effect of noise on the finite-difference approimations but this was not considered in our setting. B. PGPE Results In order to study the whole pipeline, we briefly eperiment with PGPE and the energy shaping policy with LQR stabilization seen in class. We want to optimize the gains and LQR cutoffs in order to obtain the best swing-up times from a variety of start positions around q, q = at rest. We have two sets of eperiments, the first, seen in Figure 7, was reported during the class presentation. It converged to a solution capable of swinging up in under seconds but we later realized that every evaluation step for a given trial would always start from the same position. This caused the algorithm to overfind converge to parameters that end up giving eactly the right behaviour for a fied start position. We believe this to be a consequence of the naive LQR switch condition. When the LQR is active outside its region of attraction, it will hinder the system after which the energy shaping is required to correct the system. Our activation condition is too coarse to reliably activate the LQR in its region of attraction. With a fied start point, the optimization can ensure that the system will only activate the LQR once in the region of attraction. However this might only hold true for the trajectory induced from that specific start point. Figure 8, shows the PGPE optimization on true random restarnd converges to a little under seconds. In order to achieve better performance in this setting, more sophisticated methods must be used to approimate the region of attraction of the LQR, such as through the sums-of-square optimization formulation seen in class. VI. DISCUSSION Even with our simple eperimental set-up, we can notice a sharp decrease in performance in the dynamic model. We are confident that the power model is a better candidate for system identification in our setting though different noise models (e.g., approimated torques, uniform distribution, finite difference for q) and their corresponding effect on each method would need to be studied to understand if this is always the case. Real-world domains carry many difficulties which are not captured by our simulation which potentially require etra processing in order to obtain good
5 ..... Fig.. The deviation of the fitted parameter when using the power model with no input noise. The error bars represent the.9 confidence interval and are averaged over independent runs Fig.. The deviation of the fitted parameter when using the dynamic model with no input noise. The error bars represent the.9 confidence interval and are averaged over independent runs.. 8. Fig.. The deviation of the fitted parameter when using the power model with input noise drawn from N (, ). The error bars represent the.9 confidence interval and are averaged over independent runs.
6 . 8. Fig.. The deviation of the fitted parameter when using the energy model with input noise drawn from N (, ). The error bars represent the.9 confidence interval and are averaged over independent runs.. 8. Fig.. The deviation of the fitted parameter when using the dynamic model with input noise drawn from N (, ). The error bars represent the.9 confidence interval and are averaged over independent runs. performance. This pre-processing could help unevenly the different methods and would need to be used in future eperiments to ensure the comparisons between methods are fair. Similarly, the data we have generated for the system identification eperiments led to poorly conditioned matrices. Eploring various work in eciting trajectories would certainly improve the performance of all three methods []. The policy gradient methods seem quite promising. They allow non-trivial parametrization of policies allowing the designer to leverage previous work and specific domain knowledge. The PGPE algorithm gives us good performance in the fied start point setting but takes considerably more computational power in the random restart case (by requiring many more evaluations or iterations). As mentioned earlier, this could have been improved by using a proper region of attraction algorithm. Alternatively, a parametrized function of the LQR activation condition whose parameters could have been optimized and initialized based on the current cost-to-go condition might be able to find a larger region of attraction (depending on the parametrization). We have found the episodic nature of PGPE to be detrimental to what we had set out to achieve. In the future, we will consider an incremental policy gradient methods in order to be able to formulate an incremental system identification and policy improvement method. REFERENCES [] Maime Gautier. Dynamic identification of robots with power model. In Robotics and Automation, 997. Proceedings., 997 IEEE International Conference on, volume, pages IEEE, 997. [] Luke Johnson. Adaptive Swing-up and Balancing Control of Acrobot Systems. Bachelor s thesis, Massachusetts Institute of Technology, Cambridge, MA, USA, 9. [] Eric Jones, Travis Oliphant, Pearu Peterson, el. SciPy: Open source scientific tools for Python,. [Online; accessed --7]. [] W. Khalil and E. Dombre. Modeling, identification & control of robots. London : HPS,.,. [] Richard M Murray and John Edmond Hauser. A case study in approimate linearization: The acrobat eample. Electronics Research Laboratory, College of Engineering, University of California, 99.
7 Fig. 7. The average performance of the energy shaping policy plotted against the number of improvement steps with PGPE. Every run was given a fied start point. The error bars represent the.9 confidence interval and are averaged over independent runs. LQR cost were defined as Q = diag([,, 7, 7]) and R = I. Average reward Number of steps Fig. 8. The average performance of the energy shaping policy plotted against the number of improvement steps with PGPE. Every evaluation is given a start point with q within π/ within the rest position. The evaluation of a policy is averaged over a random restarts. The error bars represent the.9 confidence interval and are averaged over independent runs. Converged parameters hovered around k =, k = 9, k =, φ = 8. LQR cost were defined as Q = diag([,, 7, 7]) and R = I. [] C Presse and Maime Gautier. New criteria of eciting trajectories for robot identification. In Robotics and Automation, 99. Proceedings., 99 IEEE International Conference on, pages IEEE, 99. [7] Frank Sehnke, Christian Osendorfer, Thomas Rückstieß, Ale Graves, Jan Peters, and Jürgen Schmidhuber. Parameter-eploring policy gradients. Neural Networks, (): 9,. [8] Russ Tedrake. Underactuated Robotics: Algorithms for Walking, Running, Swimming, Flying, and Manipulation (Course Notes for MIT.8). Downloaded in Fall, from
arxiv: v1 [cs.lg] 13 Dec 2013
Efficient Baseline-free Sampling in Parameter Exploring Policy Gradients: Super Symmetric PGPE Frank Sehnke Zentrum für Sonnenenergie- und Wasserstoff-Forschung, Industriestr. 6, Stuttgart, BW 70565 Germany
More informationSEP LIBRARIES
Adaptive Swing-up and Balancing Control of Acrobot Systems by Luke B. Johnson MASSACHUSETTS INSTITUTE OFTECHNOLOGY SEP 1 6 2009 LIBRARIES Submitted to the Department of Mechanical Engineering in Partial
More informationManipulators. Robotics. Outline. Non-holonomic robots. Sensors. Mobile Robots
Manipulators P obotics Configuration of robot specified by 6 numbers 6 degrees of freedom (DOF) 6 is the minimum number required to position end-effector arbitrarily. For dynamical systems, add velocity
More informationActive Policy Iteration: Efficient Exploration through Active Learning for Value Function Approximation in Reinforcement Learning
Active Policy Iteration: fficient xploration through Active Learning for Value Function Approximation in Reinforcement Learning Takayuki Akiyama, Hirotaka Hachiya, and Masashi Sugiyama Department of Computer
More informationValue Function Approximation in Reinforcement Learning using the Fourier Basis
Value Function Approximation in Reinforcement Learning using the Fourier Basis George Konidaris Sarah Osentoski Technical Report UM-CS-28-19 Autonomous Learning Laboratory Computer Science Department University
More informationLearning Dexterity Matthias Plappert SEPTEMBER 6, 2018
Learning Dexterity Matthias Plappert SEPTEMBER 6, 2018 OpenAI OpenAI is a non-profit AI research company, discovering and enacting the path to safe artificial general intelligence. OpenAI OpenAI is a non-profit
More informationThe knowledge gradient method for multi-armed bandit problems
The knowledge gradient method for multi-armed bandit problems Moving beyond inde policies Ilya O. Ryzhov Warren Powell Peter Frazier Department of Operations Research and Financial Engineering Princeton
More informationPolicy Gradients for Cryptanalysis
Policy Gradients for Cryptanalysis Frank Sehnke 1, Christian Osendorfer 1, Jan Sölter 2, Jürgen Schmidhuber 3,4, and Ulrich Rührmair 1 1 Faculty of Computer Science, Technische Universität München, Germany
More informationAngular Momentum Based Controller for Balancing an Inverted Double Pendulum
Angular Momentum Based Controller for Balancing an Inverted Double Pendulum Morteza Azad * and Roy Featherstone * * School of Engineering, Australian National University, Canberra, Australia Abstract.
More informationq 1 F m d p q 2 Figure 1: An automated crane with the relevant kinematic and dynamic definitions.
Robotics II March 7, 018 Exercise 1 An automated crane can be seen as a mechanical system with two degrees of freedom that moves along a horizontal rail subject to the actuation force F, and that transports
More informationCase Study: The Pelican Prototype Robot
5 Case Study: The Pelican Prototype Robot The purpose of this chapter is twofold: first, to present in detail the model of the experimental robot arm of the Robotics lab. from the CICESE Research Center,
More informationA Sliding Mode Controller Using Neural Networks for Robot Manipulator
ESANN'4 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), 8-3 April 4, d-side publi., ISBN -9337-4-8, pp. 93-98 A Sliding Mode Controller Using Neural Networks for Robot
More informationModel-Based Reinforcement Learning with Continuous States and Actions
Marc P. Deisenroth, Carl E. Rasmussen, and Jan Peters: Model-Based Reinforcement Learning with Continuous States and Actions in Proceedings of the 16th European Symposium on Artificial Neural Networks
More informationOptimal Control with Learned Forward Models
Optimal Control with Learned Forward Models Pieter Abbeel UC Berkeley Jan Peters TU Darmstadt 1 Where we are? Reinforcement Learning Data = {(x i, u i, x i+1, r i )}} x u xx r u xx V (x) π (u x) Now V
More informationLearning Control for Air Hockey Striking using Deep Reinforcement Learning
Learning Control for Air Hockey Striking using Deep Reinforcement Learning Ayal Taitler, Nahum Shimkin Faculty of Electrical Engineering Technion - Israel Institute of Technology May 8, 2017 A. Taitler,
More informationPolicy Gradient Reinforcement Learning for Robotics
Policy Gradient Reinforcement Learning for Robotics Michael C. Koval mkoval@cs.rutgers.edu Michael L. Littman mlittman@cs.rutgers.edu May 9, 211 1 Introduction Learning in an environment with a continuous
More informationOptimal Control. McGill COMP 765 Oct 3 rd, 2017
Optimal Control McGill COMP 765 Oct 3 rd, 2017 Classical Control Quiz Question 1: Can a PID controller be used to balance an inverted pendulum: A) That starts upright? B) That must be swung-up (perhaps
More informationThe Acrobot and Cart-Pole
C H A P T E R 3 The Acrobot and Cart-Pole 3.1 INTRODUCTION A great deal of work in the control of underactuated systems has been done in the context of low-dimensional model systems. These model systems
More informationLecture 9 Nonlinear Control Design. Course Outline. Exact linearization: example [one-link robot] Exact Feedback Linearization
Lecture 9 Nonlinear Control Design Course Outline Eact-linearization Lyapunov-based design Lab Adaptive control Sliding modes control Literature: [Khalil, ch.s 13, 14.1,14.] and [Glad-Ljung,ch.17] Lecture
More informationEfficient Swing-up of the Acrobot Using Continuous Torque and Impulsive Braking
American Control Conference on O'Farrell Street, San Francisco, CA, USA June 9 - July, Efficient Swing-up of the Acrobot Using Continuous Torque and Impulsive Braking Frank B. Mathis, Rouhollah Jafari
More informationREINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning
REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning Ronen Tamari The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (#67679) February 28, 2016 Ronen Tamari
More informationOptimal Sojourn Time Control within an Interval 1
Optimal Sojourn Time Control within an Interval Jianghai Hu and Shankar Sastry Department of Electrical Engineering and Computer Sciences University of California at Berkeley Berkeley, CA 97-77 {jianghai,sastry}@eecs.berkeley.edu
More informationLecture Schedule Week Date Lecture (M: 2:05p-3:50, 50-N202)
J = x θ τ = J T F 2018 School of Information Technology and Electrical Engineering at the University of Queensland Lecture Schedule Week Date Lecture (M: 2:05p-3:50, 50-N202) 1 23-Jul Introduction + Representing
More informationUsing Gaussian Processes for Variance Reduction in Policy Gradient Algorithms *
Proceedings of the 8 th International Conference on Applied Informatics Eger, Hungary, January 27 30, 2010. Vol. 1. pp. 87 94. Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms
More informationGaussian Process priors with Uncertain Inputs: Multiple-Step-Ahead Prediction
Gaussian Process priors with Uncertain Inputs: Multiple-Step-Ahead Prediction Agathe Girard Dept. of Computing Science University of Glasgow Glasgow, UK agathe@dcs.gla.ac.uk Carl Edward Rasmussen Gatsby
More informationDesign and Comparison of Different Controllers to Stabilize a Rotary Inverted Pendulum
ISSN (Online): 347-3878, Impact Factor (5): 3.79 Design and Comparison of Different Controllers to Stabilize a Rotary Inverted Pendulum Kambhampati Tejaswi, Alluri Amarendra, Ganta Ramesh 3 M.Tech, Department
More informationQ-Learning in Continuous State Action Spaces
Q-Learning in Continuous State Action Spaces Alex Irpan alexirpan@berkeley.edu December 5, 2015 Contents 1 Introduction 1 2 Background 1 3 Q-Learning 2 4 Q-Learning In Continuous Spaces 4 5 Experimental
More informationMinimax Differential Dynamic Programming: An Application to Robust Biped Walking
Minimax Differential Dynamic Programming: An Application to Robust Biped Walking Jun Morimoto Human Information Science Labs, Department 3, ATR International Keihanna Science City, Kyoto, JAPAN, 619-0288
More informationNonlinear Identification of Backlash in Robot Transmissions
Nonlinear Identification of Backlash in Robot Transmissions G. Hovland, S. Hanssen, S. Moberg, T. Brogårdh, S. Gunnarsson, M. Isaksson ABB Corporate Research, Control Systems Group, Switzerland ABB Automation
More informationReinforcement Learning of Potential Fields to achieve Limit-Cycle Walking
IFAC International Workshop on Periodic Control Systems (PSYCO 216) Reinforcement Learning of Potential Fields to achieve Limit-Cycle Walking Denise S. Feirstein Ivan Koryakovskiy Jens Kober Heike Vallery
More informationEnergy-based Swing-up of the Acrobot and Time-optimal Motion
Energy-based Swing-up of the Acrobot and Time-optimal Motion Ravi N. Banavar Systems and Control Engineering Indian Institute of Technology, Bombay Mumbai-476, India Email: banavar@ee.iitb.ac.in Telephone:(91)-(22)
More informationBalancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm
Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 2778 mgl@cs.duke.edu
More informationM.S. Project Report. Efficient Failure Rate Prediction for SRAM Cells via Gibbs Sampling. Yamei Feng 12/15/2011
.S. Project Report Efficient Failure Rate Prediction for SRA Cells via Gibbs Sampling Yamei Feng /5/ Committee embers: Prof. Xin Li Prof. Ken ai Table of Contents CHAPTER INTRODUCTION...3 CHAPTER BACKGROUND...5
More informationReinforcement Learning: An Introduction
Introduction Betreuer: Freek Stulp Hauptseminar Intelligente Autonome Systeme (WiSe 04/05) Forschungs- und Lehreinheit Informatik IX Technische Universität München November 24, 2004 Introduction What is
More informationIntroduction to Reinforcement Learning. CMPT 882 Mar. 18
Introduction to Reinforcement Learning CMPT 882 Mar. 18 Outline for the week Basic ideas in RL Value functions and value iteration Policy evaluation and policy improvement Model-free RL Monte-Carlo and
More informationParameter-exploring Policy Gradients
Parameter-exploring Policy Gradients Frank Sehnke a, Christian Osendorfer a, Thomas Rückstieß a, Alex Graves a, Jan Peters c, Jürgen Schmidhuber a,b a Faculty of Computer Science, Technische Universität
More informationApproximate Message Passing with Built-in Parameter Estimation for Sparse Signal Recovery
Approimate Message Passing with Built-in Parameter Estimation for Sparse Signal Recovery arxiv:1606.00901v1 [cs.it] Jun 016 Shuai Huang, Trac D. Tran Department of Electrical and Computer Engineering Johns
More informationStable Limit Cycle Generation for Underactuated Mechanical Systems, Application: Inertia Wheel Inverted Pendulum
Stable Limit Cycle Generation for Underactuated Mechanical Systems, Application: Inertia Wheel Inverted Pendulum Sébastien Andary Ahmed Chemori Sébastien Krut LIRMM, Univ. Montpellier - CNRS, 6, rue Ada
More informationRobust Global Swing-Up of the Pendubot via Hybrid Control
Robust Global Swing-Up of the Pendubot via Hybrid Control Rowland W. O Flaherty, Ricardo G. Sanfelice, and Andrew R. Teel Abstract Combining local state-feedback laws and openloop schedules, we design
More informationGaussian Processes in Reinforcement Learning
Gaussian Processes in Reinforcement Learning Carl Edward Rasmussen and Malte Kuss Ma Planck Institute for Biological Cybernetics Spemannstraße 38, 776 Tübingen, Germany {carl,malte.kuss}@tuebingen.mpg.de
More informationUnderactuated Robotics: Learning, Planning, and Control for Efficient and Agile Machines Course Notes for MIT 6.832
Underactuated Robotics: Learning, Planning, and Control for Efficient and Agile Machines Course Notes for MIT 6.832 Russ Tedrake Massachusetts Institute of Technology c Russ Tedrake, 2009 2 c Russ Tedrake,
More informationPALADYN Journal of Behavioral Robotics. Exploring Parameter Space in Reinforcement Learning
Review Article DOI: 10.2478/s13230-010-0002-4 JBR 1(1) 2010 14-24 Exploring Parameter Space in Reinforcement Learning Thomas Rückstieß 1, Frank Sehnke 1, Tom Schaul 2, Daan Wierstra 2, Yi Sun 2, Jürgen
More informationNotes on Discriminant Functions and Optimal Classification
Notes on Discriminant Functions and Optimal Classification Padhraic Smyth, Department of Computer Science University of California, Irvine c 2017 1 Discriminant Functions Consider a classification problem
More informationControl of the Inertia Wheel Pendulum by Bounded Torques
Proceedings of the 44th IEEE Conference on Decision and Control, and the European Control Conference 5 Seville, Spain, December -5, 5 ThC6.5 Control of the Inertia Wheel Pendulum by Bounded Torques Victor
More informationMachine Learning I Continuous Reinforcement Learning
Machine Learning I Continuous Reinforcement Learning Thomas Rückstieß Technische Universität München January 7/8, 2010 RL Problem Statement (reminder) state s t+1 ENVIRONMENT reward r t+1 new step r t
More informationON MODEL SELECTION FOR STATE ESTIMATION FOR NONLINEAR SYSTEMS. Robert Bos,1 Xavier Bombois Paul M. J. Van den Hof
ON MODEL SELECTION FOR STATE ESTIMATION FOR NONLINEAR SYSTEMS Robert Bos,1 Xavier Bombois Paul M. J. Van den Hof Delft Center for Systems and Control, Delft University of Technology, Mekelweg 2, 2628 CD
More informationq HYBRID CONTROL FOR BALANCE 0.5 Position: q (radian) q Time: t (seconds) q1 err (radian)
Hybrid Control for the Pendubot Mingjun Zhang and Tzyh-Jong Tarn Department of Systems Science and Mathematics Washington University in St. Louis, MO, USA mjz@zach.wustl.edu and tarn@wurobot.wustl.edu
More informationavailable online at CONTROL OF THE DOUBLE INVERTED PENDULUM ON A CART USING THE NATURAL MOTION
Acta Polytechnica 3(6):883 889 3 Czech Technical University in Prague 3 doi:.43/ap.3.3.883 available online at http://ojs.cvut.cz/ojs/index.php/ap CONTROL OF THE DOUBLE INVERTED PENDULUM ON A CART USING
More informationSwinging-Up and Stabilization Control Based on Natural Frequency for Pendulum Systems
9 American Control Conference Hyatt Regency Riverfront, St. Louis, MO, USA June -, 9 FrC. Swinging-Up and Stabilization Control Based on Natural Frequency for Pendulum Systems Noriko Matsuda, Masaki Izutsu,
More informationTutorial on Policy Gradient Methods. Jan Peters
Tutorial on Policy Gradient Methods Jan Peters Outline 1. Reinforcement Learning 2. Finite Difference vs Likelihood-Ratio Policy Gradients 3. Likelihood-Ratio Policy Gradients 4. Conclusion General Setup
More informationAn experimental robot load identification method for industrial application
An experimental robot load identification method for industrial application Jan Swevers 1, Birgit Naumer 2, Stefan Pieters 2, Erika Biber 2, Walter Verdonck 1, and Joris De Schutter 1 1 Katholieke Universiteit
More information6.867 Machine learning
6.867 Machine learning Mid-term eam October 8, 6 ( points) Your name and MIT ID: .5.5 y.5 y.5 a).5.5 b).5.5.5.5 y.5 y.5 c).5.5 d).5.5 Figure : Plots of linear regression results with different types of
More informationReplacing eligibility trace for action-value learning with function approximation
Replacing eligibility trace for action-value learning with function approximation Kary FRÄMLING Helsinki University of Technology PL 5500, FI-02015 TKK - Finland Abstract. The eligibility trace is one
More informationModelling and Control of Nonlinear Systems using Gaussian Processes with Partial Model Information
5st IEEE Conference on Decision and Control December 0-3, 202 Maui, Hawaii, USA Modelling and Control of Nonlinear Systems using Gaussian Processes with Partial Model Information Joseph Hall, Carl Rasmussen
More informationOn the robustness of a one-period look-ahead policy in multi-armed bandit problems
Procedia Computer Science Procedia Computer Science 00 (2010) 1 10 On the robustness of a one-period look-ahead policy in multi-armed bandit problems Ilya O. Ryzhov a, Peter Frazier b, Warren B. Powell
More informationApplication of Neural Networks for Control of Inverted Pendulum
Application of Neural Networks for Control of Inverted Pendulum VALERI MLADENOV Department of Theoretical Electrical Engineering Technical University of Sofia Sofia, Kliment Ohridski blvd. 8; BULARIA valerim@tu-sofia.bg
More informationCentral Limit Theorem and the Law of Large Numbers Class 6, Jeremy Orloff and Jonathan Bloom
Central Limit Theorem and the Law of Large Numbers Class 6, 8.5 Jeremy Orloff and Jonathan Bloom Learning Goals. Understand the statement of the law of large numbers. 2. Understand the statement of the
More informationTrust Region Policy Optimization
Trust Region Policy Optimization Yixin Lin Duke University yixin.lin@duke.edu March 28, 2017 Yixin Lin (Duke) TRPO March 28, 2017 1 / 21 Overview 1 Preliminaries Markov Decision Processes Policy iteration
More informationebay/google short course: Problem set 2
18 Jan 013 ebay/google short course: Problem set 1. (the Echange Parado) You are playing the following game against an opponent, with a referee also taking part. The referee has two envelopes (numbered
More informationIn: Proc. BENELEARN-98, 8th Belgian-Dutch Conference on Machine Learning, pp 9-46, 998 Linear Quadratic Regulation using Reinforcement Learning Stephan ten Hagen? and Ben Krose Department of Mathematics,
More informationActor-critic methods. Dialogue Systems Group, Cambridge University Engineering Department. February 21, 2017
Actor-critic methods Milica Gašić Dialogue Systems Group, Cambridge University Engineering Department February 21, 2017 1 / 21 In this lecture... The actor-critic architecture Least-Squares Policy Iteration
More informationMEM04: Rotary Inverted Pendulum
MEM4: Rotary Inverted Pendulum Interdisciplinary Automatic Controls Laboratory - ME/ECE/CHE 389 April 8, 7 Contents Overview. Configure ELVIS and DC Motor................................ Goals..............................................3
More informationMatlab-Based Tools for Analysis and Control of Inverted Pendula Systems
Matlab-Based Tools for Analysis and Control of Inverted Pendula Systems Slávka Jadlovská, Ján Sarnovský Dept. of Cybernetics and Artificial Intelligence, FEI TU of Košice, Slovak Republic sjadlovska@gmail.com,
More informationMachine Learning I Reinforcement Learning
Machine Learning I Reinforcement Learning Thomas Rückstieß Technische Universität München December 17/18, 2009 Literature Book: Reinforcement Learning: An Introduction Sutton & Barto (free online version:
More informationDYNAMICS OF SERIAL ROBOTIC MANIPULATORS
DYNAMICS OF SERIAL ROBOTIC MANIPULATORS NOMENCLATURE AND BASIC DEFINITION We consider here a mechanical system composed of r rigid bodies and denote: M i 6x6 inertia dyads of the ith body. Wi 6 x 6 angular-velocity
More informationA parametric approach to Bayesian optimization with pairwise comparisons
A parametric approach to Bayesian optimization with pairwise comparisons Marco Co Eindhoven University of Technology m.g.h.co@tue.nl Bert de Vries Eindhoven University of Technology and GN Hearing bdevries@ieee.org
More informationMomentum-centric whole-body control and kino-dynamic motion generation for floating-base robots
Momentum-centric whole-body control and kino-dynamic motion generation for floating-base robots Alexander Herzog The Movement Generation and Control Group (Ludovic Righetti) Conflicting tasks & constraints
More informationPartially Observable Markov Decision Processes (POMDPs)
Partially Observable Markov Decision Processes (POMDPs) Sachin Patil Guest Lecture: CS287 Advanced Robotics Slides adapted from Pieter Abbeel, Alex Lee Outline Introduction to POMDPs Locally Optimal Solutions
More informationCONTROL OF ROBOT CAMERA SYSTEM WITH ACTUATOR S DYNAMICS TO TRACK MOVING OBJECT
Journal of Computer Science and Cybernetics, V.31, N.3 (2015), 255 265 DOI: 10.15625/1813-9663/31/3/6127 CONTROL OF ROBOT CAMERA SYSTEM WITH ACTUATOR S DYNAMICS TO TRACK MOVING OBJECT NGUYEN TIEN KIEM
More informationExponential Controller for Robot Manipulators
Exponential Controller for Robot Manipulators Fernando Reyes Benemérita Universidad Autónoma de Puebla Grupo de Robótica de la Facultad de Ciencias de la Electrónica Apartado Postal 542, Puebla 7200, México
More informationMulti-task Learning with Gaussian Processes, with Applications to Robot Inverse Dynamics
1 / 38 Multi-task Learning with Gaussian Processes, with Applications to Robot Inverse Dynamics Chris Williams with Kian Ming A. Chai, Stefan Klanke, Sethu Vijayakumar December 2009 Motivation 2 / 38 Examples
More informationDerivation and Application of a Conserved Orbital Energy for the Inverted Pendulum Bipedal Walking Model
Derivation and Application of a Conserved Orbital Energy for the Inverted Pendulum Bipedal Walking Model Jerry E. Pratt and Sergey V. Drakunov Abstract We present an analysis of a point mass, point foot,
More informationGaussian Process for Internal Model Control
Gaussian Process for Internal Model Control Gregor Gregorčič and Gordon Lightbody Department of Electrical Engineering University College Cork IRELAND E mail: gregorg@rennesuccie Abstract To improve transparency
More informationReinforcement Learning and NLP
1 Reinforcement Learning and NLP Kapil Thadani kapil@cs.columbia.edu RESEARCH Outline 2 Model-free RL Markov decision processes (MDPs) Derivative-free optimization Policy gradients Variance reduction Value
More informationBalancing of an Inverted Pendulum with a SCARA Robot
Balancing of an Inverted Pendulum with a SCARA Robot Bernhard Sprenger, Ladislav Kucera, and Safer Mourad Swiss Federal Institute of Technology Zurich (ETHZ Institute of Robotics 89 Zurich, Switzerland
More informationLecture 6: CS395T Numerical Optimization for Graphics and AI Line Search Applications
Lecture 6: CS395T Numerical Optimization for Graphics and AI Line Search Applications Qixing Huang The University of Texas at Austin huangqx@cs.utexas.edu 1 Disclaimer This note is adapted from Section
More informationDevelopment of a Deep Recurrent Neural Network Controller for Flight Applications
Development of a Deep Recurrent Neural Network Controller for Flight Applications American Control Conference (ACC) May 26, 2017 Scott A. Nivison Pramod P. Khargonekar Department of Electrical and Computer
More informationReinforcement Learning in Non-Stationary Continuous Time and Space Scenarios
Reinforcement Learning in Non-Stationary Continuous Time and Space Scenarios Eduardo W. Basso 1, Paulo M. Engel 1 1 Instituto de Informática Universidade Federal do Rio Grande do Sul (UFRGS) Caixa Postal
More informationReverse Order Swing-up Control of Serial Double Inverted Pendulums
Reverse Order Swing-up Control of Serial Double Inverted Pendulums T.Henmi, M.Deng, A.Inoue, N.Ueki and Y.Hirashima Okayama University, 3-1-1, Tsushima-Naka, Okayama, Japan inoue@suri.sys.okayama-u.ac.jp
More informationReinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina
Reinforcement Learning Introduction Introduction Unsupervised learning has no outcome (no feedback). Supervised learning has outcome so we know what to predict. Reinforcement learning is in between it
More informationStatistical Techniques in Robotics (16-831, F12) Lecture#21 (Monday November 12) Gaussian Processes
Statistical Techniques in Robotics (16-831, F12) Lecture#21 (Monday November 12) Gaussian Processes Lecturer: Drew Bagnell Scribe: Venkatraman Narayanan 1, M. Koval and P. Parashar 1 Applications of Gaussian
More informationIntroduction of Reinforcement Learning
Introduction of Reinforcement Learning Deep Reinforcement Learning Reference Textbook: Reinforcement Learning: An Introduction http://incompleteideas.net/sutton/book/the-book.html Lectures of David Silver
More informationMultibody simulation
Multibody simulation Dynamics of a multibody system (Euler-Lagrange formulation) Dimitar Dimitrov Örebro University June 16, 2012 Main points covered Euler-Lagrange formulation manipulator inertia matrix
More informationOnline Learning in High Dimensions. LWPR and it s application
Lecture 9 LWPR Online Learning in High Dimensions Contents: LWPR and it s application Sethu Vijayakumar, Aaron D'Souza and Stefan Schaal, Incremental Online Learning in High Dimensions, Neural Computation,
More informationELEC4631 s Lecture 2: Dynamic Control Systems 7 March Overview of dynamic control systems
ELEC4631 s Lecture 2: Dynamic Control Systems 7 March 2011 Overview of dynamic control systems Goals of Controller design Autonomous dynamic systems Linear Multi-input multi-output (MIMO) systems Bat flight
More informationADAPTIVE NEURAL NETWORK CONTROL OF MECHATRONICS OBJECTS
acta mechanica et automatica, vol.2 no.4 (28) ADAPIE NEURAL NEWORK CONROL OF MECHARONICS OBJECS Egor NEMSE *, Yuri ZHUKO * * Baltic State echnical University oenmeh, 985, St. Petersburg, Krasnoarmeyskaya,
More informationI D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69
R E S E A R C H R E P O R T Online Policy Adaptation for Ensemble Classifiers Christos Dimitrakakis a IDIAP RR 03-69 Samy Bengio b I D I A P December 2003 D a l l e M o l l e I n s t i t u t e for Perceptual
More informationReinforcement Learning with Reference Tracking Control in Continuous State Spaces
Reinforcement Learning with Reference Tracking Control in Continuous State Spaces Joseph Hall, Carl Edward Rasmussen and Jan Maciejowski Abstract The contribution described in this paper is an algorithm
More informationReinforcement Learning In Continuous Time and Space
Reinforcement Learning In Continuous Time and Space presentation of paper by Kenji Doya Leszek Rybicki lrybicki@mat.umk.pl 18.07.2008 Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous
More informationChaotic motion. Phys 750 Lecture 9
Chaotic motion Phys 750 Lecture 9 Finite-difference equations Finite difference equation approximates a differential equation as an iterative map (x n+1,v n+1 )=M[(x n,v n )] Evolution from time t =0to
More informationLearning Control Under Uncertainty: A Probabilistic Value-Iteration Approach
Learning Control Under Uncertainty: A Probabilistic Value-Iteration Approach B. Bischoff 1, D. Nguyen-Tuong 1,H.Markert 1 anda.knoll 2 1- Robert Bosch GmbH - Corporate Research Robert-Bosch-Str. 2, 71701
More informationNoise-Blind Image Deblurring Supplementary Material
Noise-Blind Image Deblurring Supplementary Material Meiguang Jin University of Bern Switzerland Stefan Roth TU Darmstadt Germany Paolo Favaro University of Bern Switzerland A. Upper and Lower Bounds Our
More informationControl Design along Trajectories with Sums of Squares Programming
Control Design along Trajectories with Sums of Squares Programming Anirudha Majumdar 1, Amir Ali Ahmadi 2, and Russ Tedrake 1 Abstract Motivated by the need for formal guarantees on the stability and safety
More informationChaotic motion. Phys 420/580 Lecture 10
Chaotic motion Phys 420/580 Lecture 10 Finite-difference equations Finite difference equation approximates a differential equation as an iterative map (x n+1,v n+1 )=M[(x n,v n )] Evolution from time t
More informationRobust Controller Design for Speed Control of an Indirect Field Oriented Induction Machine Drive
Leonardo Electronic Journal of Practices and Technologies ISSN 1583-1078 Issue 6, January-June 2005 p. 1-16 Robust Controller Design for Speed Control of an Indirect Field Oriented Induction Machine Drive
More informationLearning Gaussian Process Models from Uncertain Data
Learning Gaussian Process Models from Uncertain Data Patrick Dallaire, Camille Besse, and Brahim Chaib-draa DAMAS Laboratory, Computer Science & Software Engineering Department, Laval University, Canada
More informationLaboratory Exercise 1 DC servo
Laboratory Exercise DC servo Per-Olof Källén ø 0,8 POWER SAT. OVL.RESET POS.RESET Moment Reference ø 0,5 ø 0,5 ø 0,5 ø 0,65 ø 0,65 Int ø 0,8 ø 0,8 Σ k Js + d ø 0,8 s ø 0 8 Off Off ø 0,8 Ext. Int. + x0,
More informationQ-Learning in Continuous State-Action Space with Noisy and Redundant Inputs by Using a Selective Desensitization Neural Network
Q-Learning Using SDNN for Noisy and Redundant Inputs Paper: Q-Learning in Continuous State-Action Space with Noisy and Redundant Inputs by Using a Selective Desensitization Neural Network Takaaki Kobayashi,
More informationOn-line Learning of Robot Arm Impedance Using Neural Networks
On-line Learning of Robot Arm Impedance Using Neural Networks Yoshiyuki Tanaka Graduate School of Engineering, Hiroshima University, Higashi-hiroshima, 739-857, JAPAN Email: ytanaka@bsys.hiroshima-u.ac.jp
More informationCS Deep Reinforcement Learning HW2: Policy Gradients due September 19th 2018, 11:59 pm
CS294-112 Deep Reinforcement Learning HW2: Policy Gradients due September 19th 2018, 11:59 pm 1 Introduction The goal of this assignment is to experiment with policy gradient and its variants, including
More information