Identifying and Solving an Unknown Acrobot System

Size: px
Start display at page:

Download "Identifying and Solving an Unknown Acrobot System"

Transcription

1 Identifying and Solving an Unknown Acrobot System Clement Gehring and Xinkun Nie Abstract Identifying dynamical parameters is a challenging and important part of solving dynamical systems. For this project, we set out to solve an acrobot system with unknown parameters. In the process, we eplored three different methods for system identifications: the energy model, the power model and the dynamic model. In order to improve policies on the identified system, we eplore the Policy-Gradient with Parameter-based Eploration algorithm. We demonstrate all four methods can be applied to their corresponding subproblem with varying degrees of success. A. The Acrobot II. BACKGROUND The acrobot is an underactuated two-link robot with a single actuator at the elbow, as seen in Figure. The most common task for the acrobot is to swing up and balance [][8]. I. INTRODUCTION Solving a real-world dynamical system poses several significant challenges. Even with perfect knowledge, underactuated dynamical system represents difficult optimization problems which are only fully solvable in low dimensional cases. In order to facilitate the search for good policies, we can consider leveraging partial knowledge of the domain. This can be done by obtaining a good model from data. For many problems, we can derive the dynamics analytically with standard kinematics leaving only a discrete set of unknown parameters of the system (e.g., masses, inertia, geometry). We will limit our work to these kind of problems as they constitute an already large and interesting class of problems. In order to better understand the process of solving a realworld dynamical system, we consider the full pipeline from system identification to policy optimization. We focus on conventional system identification for unknown dynamical parameters and policy gradient methods. To better understand the methods studied, we apply them to the well known acrobot system, which we present in Section II. In Section III, we introduce the system identification framework and the three methods considered, the energy model, the power model, and the dynamic model. We derive their eact formulations when applied to the acrobot. In Section IV, we present the policy gradients with parameter-based eploration (PGPE) algorithm [7], a policy gradienpproach capable of optimizing non-differentiable policies. We introduce the energy shaping policy with LQR stabilization and partialfeedback linearization, and its corresponding parametrization. In Section V, we offer a series of empirical results comparing the different system identification methods, and illustrating the performance of PGPE on our chosen policy. Finally, we conclude in Section VI with a discussion about our results and future work. Fig.. The Acrobot [8] Figure shows the parameters we use for our analysis. We define q and q to be the shoulder joinngle and the elbow joinngle, respectively. We define q = [q, q ] T, s = [q, q] T. The zero state corresponds to both links hanging down. The moments of inertia, I, I, are taken about the pivots. Our goal is to balance the acrobot the unstable fied point s = [π,,, ] T. To induce this behaviour, we define a reward giving for every step spent further than π/ away than the balance poinnd with angular speed greater than π/. In our analysis throughout the paper, we define the following notations: c : = cos(q ) c : = cos(q ) s : = sin(q ) s : = sin(q ) c : = cos(q + q ) s : = sin(q + q ). The energy of the acrobot system is defined by: T = T + T T = I q T = (m l + I + m l l c c ) q In this report, we define a solution to a dynamical system as a precomputed controller (or policy) mapping states to actions for all reachable states. + I q + (I + m l l c c ) q q U = m gl c c m g(l c + l c c + ),

2 where T and T are the kinetic energy at the shoulder and elbow joint respectively, and U is the potential energy of the acrobot. The total energy is given by H = T + U. The equations of motion are defined by = (I + I + m l + m l l c c ) q + (I + m l l c c ) q m l l c s q q m l l c s q + m gl c s + m g(l s + l c s + ) τ = (I + m l l c c ) q + I q + m l l c s q + m gl c s +. III. SYSTEM IDENTIFICATION The system identification problem can be formulated as an optimization problem. For the class of problem considered in this report, it is possible to formulate a linear least-squares minimization. Given non-linear changes of variables of both the dynamics parameters and the sampled states, the search for best parameters is formulated as, = arg min A b, where A represen matri with non-linear transformations of the sampled states, and their derivative as rows, and b represents the sampled torques applied in those samples. Below we present three models: the energy model, the power model, and the dynamic model. The dynamic model relies on the equations of motion, and requires joinccelerations, which in practice could potentially have a lot of noise. The energy model uses the energy of the system, and does not require knowing joinccelerations. The power model is the differential form of the energy model. Compared to the energy model, the power model does not need to approimate the integral of the damping coefficients, but requires knowing the acceleration of the joint. A. Energy model In the energy model, defined in [], we look at the change of total energy as a function of the the input torque. It is derived as follows: q T τ dt = H(q, q)(t b ) H(q, q)( ) + = h, q T τ f dt where τ = [, τ] T is the input torque to the system, τ f is the frictional torque term defined to be τ fj = F sj Sign( q j ) + F vj q j, where F sj and F vj are the Coulomb and viscous friction coefficients respectively at joint j. We see that in the energy based model, we do not need to calculate the acceleration of the system. For the acrobot, we define = [,..., 7 ] T, where = I + m l = I = m l l c = m gl c + m gl = m gl c = b 7 = b We define h = [ h,..., h 8 ], where h = q (t b ) q ( ) h = q (t b ) q ( ) + q (t b ) q ( ) + q (t b ) q (t b ) q ( ) q ( ) h = q (t b ) cos(q (t b )) q ( ) cos(q ( )) + q (t b ) q (t b ) cos(q (t b )) q ( ) q ( ) cos(q ( )) h = cos(q ( )) + cos(q (t b )) h = cos(q ( ) + q ( )) h = h 7 = + cos(q (t b ) + q (t b )) t a tb q (t) dt q (t) dt. We approimate h and h 7 by using piecewise linear estimates for q h h (( q (t b ) q ( )) t + q ( )) d (( q (t b ) q ( )) t + q ( )) d, and we define q( ), q( ) to be the state and velocity at some time. The left hand side is defined by B. Power model q T τ dt = τ(q (b) q (a)). We then eplore the power model [], in which we look at the change of power as a function of the input torque. It is formulated as q T τ = d dt (H(q, q)) + qt τ f = dh, where, for the acrobot, q T τ = τ q and is defined the same as in the energy model. We then define dh = [dh,..., dh 7 ], where

3 dh = q q dh = q q + q q + q q + q q dh = c (q q + q q + q q ) s q ( q + q q ) dh = s q dh = s + ( q + q ) dh = q dh 7 = q. C. Dynamic model A rather differenpproach is to use the equations of motion directly instead of the energy of the acrobot system for system identification [8]. Using the same parameter and following a similar derivation that is presented in [], we have = ( + ) q + (c q + c q s q s q q ) + q + c + c + q τ = (c q + s q ) + ( q + q ) + c + 7 q. In the class presentation of this work, we reported some bad results using the dynamic model. These results were obtained after (wrongfully) collapsing the two equalities into one, making the problem less well defined. Significant improvements were seen after this correction. IV. POLICY GRADIENT METHODS Policy gradient methods are techniques in reinforcement learning for optimizing parametrized policies over the epected return. Formally, we assume the policy is parametrized by θ R n. Given a state s k, the agent needs to choose an action a k. The policy is defined as a k π θ (a k s k ), a distribution over actions mapping from states. The goal is to find the parameter θ that will maimize some long-term return. For an episode of length L, we define the sequence of states and actions in this episode to be τ = [ :L, a :L ]. Our objective function is define as [ L ] J(θ) = E r k, k= where r k is the instant reward the agent receives at time step k. Our update rule for the parameters is as follows: θ k+ = θ k + α θ J, θ=θk where α is the fied learning rate. A. Policy gradients with parameter-based eploration (PGPE) PGPE is a policy gradient method that is capable of optimizing non-differentiable parametrized policies. PGPE samples the parameters before starting each training episode with the policy given by the sampled parameters, records the reward, and updates the parameters. The objective is to find the parameters for the policy that maimize the total reward across all episode histories. Formally, we have J(θ) = p(h θ)r(h)dh, H where h is any episode history, and R(h) is the total reward over the episode history h. PGPE relies on sampling over histories, and averaging the results. However, to determine p(h θ), sampling from the policy at each timestep would increase the variance of samples over histories. To reduce variance, PGPE redefines the policy as π ρ (a k s k ) = p(θ ρ) δ Fθ(sk ),a k dθ, θ where ρ parametrizes θ s distribution, and F θ(sk ) is the action determined by the policy with parameters θ in state s k, and δ is the Dirac delta function. To make the gradient more robust, we consider PGPE with symmetric sampling. More specifically, assume ρ consists of a set of {µ i } and a set of {σ i }, which determines a normal distribution for each of the parameter in θ independently. We sample perturbations ɛ drawn from N(, σ ), and define sampled parameters θ + = µ + ɛ and θ = µ ɛ. We run several episodes with each of the two sampling parameters, and get cumulative rewards r + and r across the episodes. Our update rule is as follows: σ i = µ i = αɛ i(r + r ) (m r + r ), α + r m b (r+ b)( ɛ i σ i ), σ i where m is the best reward so far, and b is a moving baseline initialized to and defined as b k = βb k + ( β) r+ + r, with β being some step size parameter. Often we make i a constant. The resulting algorithm is very similar to finite difference optimization with the difference that PGPE eplicitly updates a distribution on directions to evaluate and makes uses of some smart normlization of the step sizes. B. Using PGPE for Acrobot Swing-up and Balance For the acrobot swing up and balance problem, we first pump energy with partial feedback linearization on the second joint s acceleration, and then switch to the LQR controller for balancing.

4 Formally, in the energy shaping controller, we find the desired energy E d by calculating the potential energy at the balancing point. We then choose u = u p + u e, where u e = k (E d E) q, assuming the current total energy in the system is E, and u p is the controller input we get if we use partial feedback linearization to ensure q = k q k q. We use a naive switching scheme where the LQR controller is activated when its cost to go is below a threshold φ. This threshold φ should hopefully be in the region of attraction. We parametrize the swing-up and balancing policy with θ = [k, k, k, φ] T in order to apply PGPE to improve the policy. V. RESULTS A. System Identification Eperiments To eplore the performance of the different methods, we first consider the rate of convergence of the three proposed system identification methods when given perfect samples (from simulation) with the derivatives estimated from symmetric finite-difference. The results, seen in Figure, for the power model, and Figure, for the dynamic model, plot ˆ i i the absolute relative deviation,, for an estimated i parameter ˆ i with respect to the number of samples used. Each method was given the same samples generated by picking a random action U (, ) and applying it for seconds. This way of generating data is far from optimal but was sufficient for our eperiments. The simulation of the system was done using ode through the scipy python library[]. For this eperiment, we do not show the results for the energy model as we were unable to get comparable results. For all eperiments, we used = [.,,.,.7, 9.8,.,.] T as parameters, which is equivalent to an acrobot system with m =, m =, l =, l =, b =., b =., g = 9.8. These results indicate that, in our setting, the dynamic model significantly outperforms the power model. This was not epected as the power model was designed to be better behaved than the dynamic model. The scale for each parameter was fied across the methods to facilitate comparisons. The parameter representing the damping coefficients, and 7, appear to be quite hard to fit. The power model was unable to converge to a reasonable value with the given data while the dynamic model was unable to fit the damping term on the second joint. There seems to be an inherent difficulty in fitting these parameters whose effect, given the poor quality data we are using, might be eplained away through other parameters. It is important to note that our simulation setup is lacking many real-world difficulties such as noisy measurements. In order to further eplore this issue, we set up a slightly altered In the case where the true derivatives are know, both the dynamic model and the power model converge in very few samples. version of the eperiment where the samples have had a small amount of noise injected, sampled from N (, ). All methods were given the same perturbed data. Even with this small amount of noise, we can see in Figure that the dynamic model s performance is significantly lower than in the noiseless case (note the scale change from the previous figures). This noise has a much smaller effect on the performance of both the power model and the energy model, whose results are plotted in Figure and Figure, respectively. This result confirms claims that the dynamic method is vulnerable to high-frequency noise while the power and energy methods are much more tolerant []. On real-world data, we could have improved the performance of all three methods by applying a low-pass filter to reduce the effect of noise on the finite-difference approimations but this was not considered in our setting. B. PGPE Results In order to study the whole pipeline, we briefly eperiment with PGPE and the energy shaping policy with LQR stabilization seen in class. We want to optimize the gains and LQR cutoffs in order to obtain the best swing-up times from a variety of start positions around q, q = at rest. We have two sets of eperiments, the first, seen in Figure 7, was reported during the class presentation. It converged to a solution capable of swinging up in under seconds but we later realized that every evaluation step for a given trial would always start from the same position. This caused the algorithm to overfind converge to parameters that end up giving eactly the right behaviour for a fied start position. We believe this to be a consequence of the naive LQR switch condition. When the LQR is active outside its region of attraction, it will hinder the system after which the energy shaping is required to correct the system. Our activation condition is too coarse to reliably activate the LQR in its region of attraction. With a fied start point, the optimization can ensure that the system will only activate the LQR once in the region of attraction. However this might only hold true for the trajectory induced from that specific start point. Figure 8, shows the PGPE optimization on true random restarnd converges to a little under seconds. In order to achieve better performance in this setting, more sophisticated methods must be used to approimate the region of attraction of the LQR, such as through the sums-of-square optimization formulation seen in class. VI. DISCUSSION Even with our simple eperimental set-up, we can notice a sharp decrease in performance in the dynamic model. We are confident that the power model is a better candidate for system identification in our setting though different noise models (e.g., approimated torques, uniform distribution, finite difference for q) and their corresponding effect on each method would need to be studied to understand if this is always the case. Real-world domains carry many difficulties which are not captured by our simulation which potentially require etra processing in order to obtain good

5 ..... Fig.. The deviation of the fitted parameter when using the power model with no input noise. The error bars represent the.9 confidence interval and are averaged over independent runs Fig.. The deviation of the fitted parameter when using the dynamic model with no input noise. The error bars represent the.9 confidence interval and are averaged over independent runs.. 8. Fig.. The deviation of the fitted parameter when using the power model with input noise drawn from N (, ). The error bars represent the.9 confidence interval and are averaged over independent runs.

6 . 8. Fig.. The deviation of the fitted parameter when using the energy model with input noise drawn from N (, ). The error bars represent the.9 confidence interval and are averaged over independent runs.. 8. Fig.. The deviation of the fitted parameter when using the dynamic model with input noise drawn from N (, ). The error bars represent the.9 confidence interval and are averaged over independent runs. performance. This pre-processing could help unevenly the different methods and would need to be used in future eperiments to ensure the comparisons between methods are fair. Similarly, the data we have generated for the system identification eperiments led to poorly conditioned matrices. Eploring various work in eciting trajectories would certainly improve the performance of all three methods []. The policy gradient methods seem quite promising. They allow non-trivial parametrization of policies allowing the designer to leverage previous work and specific domain knowledge. The PGPE algorithm gives us good performance in the fied start point setting but takes considerably more computational power in the random restart case (by requiring many more evaluations or iterations). As mentioned earlier, this could have been improved by using a proper region of attraction algorithm. Alternatively, a parametrized function of the LQR activation condition whose parameters could have been optimized and initialized based on the current cost-to-go condition might be able to find a larger region of attraction (depending on the parametrization). We have found the episodic nature of PGPE to be detrimental to what we had set out to achieve. In the future, we will consider an incremental policy gradient methods in order to be able to formulate an incremental system identification and policy improvement method. REFERENCES [] Maime Gautier. Dynamic identification of robots with power model. In Robotics and Automation, 997. Proceedings., 997 IEEE International Conference on, volume, pages IEEE, 997. [] Luke Johnson. Adaptive Swing-up and Balancing Control of Acrobot Systems. Bachelor s thesis, Massachusetts Institute of Technology, Cambridge, MA, USA, 9. [] Eric Jones, Travis Oliphant, Pearu Peterson, el. SciPy: Open source scientific tools for Python,. [Online; accessed --7]. [] W. Khalil and E. Dombre. Modeling, identification & control of robots. London : HPS,.,. [] Richard M Murray and John Edmond Hauser. A case study in approimate linearization: The acrobat eample. Electronics Research Laboratory, College of Engineering, University of California, 99.

7 Fig. 7. The average performance of the energy shaping policy plotted against the number of improvement steps with PGPE. Every run was given a fied start point. The error bars represent the.9 confidence interval and are averaged over independent runs. LQR cost were defined as Q = diag([,, 7, 7]) and R = I. Average reward Number of steps Fig. 8. The average performance of the energy shaping policy plotted against the number of improvement steps with PGPE. Every evaluation is given a start point with q within π/ within the rest position. The evaluation of a policy is averaged over a random restarts. The error bars represent the.9 confidence interval and are averaged over independent runs. Converged parameters hovered around k =, k = 9, k =, φ = 8. LQR cost were defined as Q = diag([,, 7, 7]) and R = I. [] C Presse and Maime Gautier. New criteria of eciting trajectories for robot identification. In Robotics and Automation, 99. Proceedings., 99 IEEE International Conference on, pages IEEE, 99. [7] Frank Sehnke, Christian Osendorfer, Thomas Rückstieß, Ale Graves, Jan Peters, and Jürgen Schmidhuber. Parameter-eploring policy gradients. Neural Networks, (): 9,. [8] Russ Tedrake. Underactuated Robotics: Algorithms for Walking, Running, Swimming, Flying, and Manipulation (Course Notes for MIT.8). Downloaded in Fall, from

arxiv: v1 [cs.lg] 13 Dec 2013

arxiv: v1 [cs.lg] 13 Dec 2013 Efficient Baseline-free Sampling in Parameter Exploring Policy Gradients: Super Symmetric PGPE Frank Sehnke Zentrum für Sonnenenergie- und Wasserstoff-Forschung, Industriestr. 6, Stuttgart, BW 70565 Germany

More information

SEP LIBRARIES

SEP LIBRARIES Adaptive Swing-up and Balancing Control of Acrobot Systems by Luke B. Johnson MASSACHUSETTS INSTITUTE OFTECHNOLOGY SEP 1 6 2009 LIBRARIES Submitted to the Department of Mechanical Engineering in Partial

More information

Manipulators. Robotics. Outline. Non-holonomic robots. Sensors. Mobile Robots

Manipulators. Robotics. Outline. Non-holonomic robots. Sensors. Mobile Robots Manipulators P obotics Configuration of robot specified by 6 numbers 6 degrees of freedom (DOF) 6 is the minimum number required to position end-effector arbitrarily. For dynamical systems, add velocity

More information

Active Policy Iteration: Efficient Exploration through Active Learning for Value Function Approximation in Reinforcement Learning

Active Policy Iteration: Efficient Exploration through Active Learning for Value Function Approximation in Reinforcement Learning Active Policy Iteration: fficient xploration through Active Learning for Value Function Approximation in Reinforcement Learning Takayuki Akiyama, Hirotaka Hachiya, and Masashi Sugiyama Department of Computer

More information

Value Function Approximation in Reinforcement Learning using the Fourier Basis

Value Function Approximation in Reinforcement Learning using the Fourier Basis Value Function Approximation in Reinforcement Learning using the Fourier Basis George Konidaris Sarah Osentoski Technical Report UM-CS-28-19 Autonomous Learning Laboratory Computer Science Department University

More information

Learning Dexterity Matthias Plappert SEPTEMBER 6, 2018

Learning Dexterity Matthias Plappert SEPTEMBER 6, 2018 Learning Dexterity Matthias Plappert SEPTEMBER 6, 2018 OpenAI OpenAI is a non-profit AI research company, discovering and enacting the path to safe artificial general intelligence. OpenAI OpenAI is a non-profit

More information

The knowledge gradient method for multi-armed bandit problems

The knowledge gradient method for multi-armed bandit problems The knowledge gradient method for multi-armed bandit problems Moving beyond inde policies Ilya O. Ryzhov Warren Powell Peter Frazier Department of Operations Research and Financial Engineering Princeton

More information

Policy Gradients for Cryptanalysis

Policy Gradients for Cryptanalysis Policy Gradients for Cryptanalysis Frank Sehnke 1, Christian Osendorfer 1, Jan Sölter 2, Jürgen Schmidhuber 3,4, and Ulrich Rührmair 1 1 Faculty of Computer Science, Technische Universität München, Germany

More information

Angular Momentum Based Controller for Balancing an Inverted Double Pendulum

Angular Momentum Based Controller for Balancing an Inverted Double Pendulum Angular Momentum Based Controller for Balancing an Inverted Double Pendulum Morteza Azad * and Roy Featherstone * * School of Engineering, Australian National University, Canberra, Australia Abstract.

More information

q 1 F m d p q 2 Figure 1: An automated crane with the relevant kinematic and dynamic definitions.

q 1 F m d p q 2 Figure 1: An automated crane with the relevant kinematic and dynamic definitions. Robotics II March 7, 018 Exercise 1 An automated crane can be seen as a mechanical system with two degrees of freedom that moves along a horizontal rail subject to the actuation force F, and that transports

More information

Case Study: The Pelican Prototype Robot

Case Study: The Pelican Prototype Robot 5 Case Study: The Pelican Prototype Robot The purpose of this chapter is twofold: first, to present in detail the model of the experimental robot arm of the Robotics lab. from the CICESE Research Center,

More information

A Sliding Mode Controller Using Neural Networks for Robot Manipulator

A Sliding Mode Controller Using Neural Networks for Robot Manipulator ESANN'4 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), 8-3 April 4, d-side publi., ISBN -9337-4-8, pp. 93-98 A Sliding Mode Controller Using Neural Networks for Robot

More information

Model-Based Reinforcement Learning with Continuous States and Actions

Model-Based Reinforcement Learning with Continuous States and Actions Marc P. Deisenroth, Carl E. Rasmussen, and Jan Peters: Model-Based Reinforcement Learning with Continuous States and Actions in Proceedings of the 16th European Symposium on Artificial Neural Networks

More information

Optimal Control with Learned Forward Models

Optimal Control with Learned Forward Models Optimal Control with Learned Forward Models Pieter Abbeel UC Berkeley Jan Peters TU Darmstadt 1 Where we are? Reinforcement Learning Data = {(x i, u i, x i+1, r i )}} x u xx r u xx V (x) π (u x) Now V

More information

Learning Control for Air Hockey Striking using Deep Reinforcement Learning

Learning Control for Air Hockey Striking using Deep Reinforcement Learning Learning Control for Air Hockey Striking using Deep Reinforcement Learning Ayal Taitler, Nahum Shimkin Faculty of Electrical Engineering Technion - Israel Institute of Technology May 8, 2017 A. Taitler,

More information

Policy Gradient Reinforcement Learning for Robotics

Policy Gradient Reinforcement Learning for Robotics Policy Gradient Reinforcement Learning for Robotics Michael C. Koval mkoval@cs.rutgers.edu Michael L. Littman mlittman@cs.rutgers.edu May 9, 211 1 Introduction Learning in an environment with a continuous

More information

Optimal Control. McGill COMP 765 Oct 3 rd, 2017

Optimal Control. McGill COMP 765 Oct 3 rd, 2017 Optimal Control McGill COMP 765 Oct 3 rd, 2017 Classical Control Quiz Question 1: Can a PID controller be used to balance an inverted pendulum: A) That starts upright? B) That must be swung-up (perhaps

More information

The Acrobot and Cart-Pole

The Acrobot and Cart-Pole C H A P T E R 3 The Acrobot and Cart-Pole 3.1 INTRODUCTION A great deal of work in the control of underactuated systems has been done in the context of low-dimensional model systems. These model systems

More information

Lecture 9 Nonlinear Control Design. Course Outline. Exact linearization: example [one-link robot] Exact Feedback Linearization

Lecture 9 Nonlinear Control Design. Course Outline. Exact linearization: example [one-link robot] Exact Feedback Linearization Lecture 9 Nonlinear Control Design Course Outline Eact-linearization Lyapunov-based design Lab Adaptive control Sliding modes control Literature: [Khalil, ch.s 13, 14.1,14.] and [Glad-Ljung,ch.17] Lecture

More information

Efficient Swing-up of the Acrobot Using Continuous Torque and Impulsive Braking

Efficient Swing-up of the Acrobot Using Continuous Torque and Impulsive Braking American Control Conference on O'Farrell Street, San Francisco, CA, USA June 9 - July, Efficient Swing-up of the Acrobot Using Continuous Torque and Impulsive Braking Frank B. Mathis, Rouhollah Jafari

More information

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning Ronen Tamari The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (#67679) February 28, 2016 Ronen Tamari

More information

Optimal Sojourn Time Control within an Interval 1

Optimal Sojourn Time Control within an Interval 1 Optimal Sojourn Time Control within an Interval Jianghai Hu and Shankar Sastry Department of Electrical Engineering and Computer Sciences University of California at Berkeley Berkeley, CA 97-77 {jianghai,sastry}@eecs.berkeley.edu

More information

Lecture Schedule Week Date Lecture (M: 2:05p-3:50, 50-N202)

Lecture Schedule Week Date Lecture (M: 2:05p-3:50, 50-N202) J = x θ τ = J T F 2018 School of Information Technology and Electrical Engineering at the University of Queensland Lecture Schedule Week Date Lecture (M: 2:05p-3:50, 50-N202) 1 23-Jul Introduction + Representing

More information

Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms *

Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms * Proceedings of the 8 th International Conference on Applied Informatics Eger, Hungary, January 27 30, 2010. Vol. 1. pp. 87 94. Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms

More information

Gaussian Process priors with Uncertain Inputs: Multiple-Step-Ahead Prediction

Gaussian Process priors with Uncertain Inputs: Multiple-Step-Ahead Prediction Gaussian Process priors with Uncertain Inputs: Multiple-Step-Ahead Prediction Agathe Girard Dept. of Computing Science University of Glasgow Glasgow, UK agathe@dcs.gla.ac.uk Carl Edward Rasmussen Gatsby

More information

Design and Comparison of Different Controllers to Stabilize a Rotary Inverted Pendulum

Design and Comparison of Different Controllers to Stabilize a Rotary Inverted Pendulum ISSN (Online): 347-3878, Impact Factor (5): 3.79 Design and Comparison of Different Controllers to Stabilize a Rotary Inverted Pendulum Kambhampati Tejaswi, Alluri Amarendra, Ganta Ramesh 3 M.Tech, Department

More information

Q-Learning in Continuous State Action Spaces

Q-Learning in Continuous State Action Spaces Q-Learning in Continuous State Action Spaces Alex Irpan alexirpan@berkeley.edu December 5, 2015 Contents 1 Introduction 1 2 Background 1 3 Q-Learning 2 4 Q-Learning In Continuous Spaces 4 5 Experimental

More information

Minimax Differential Dynamic Programming: An Application to Robust Biped Walking

Minimax Differential Dynamic Programming: An Application to Robust Biped Walking Minimax Differential Dynamic Programming: An Application to Robust Biped Walking Jun Morimoto Human Information Science Labs, Department 3, ATR International Keihanna Science City, Kyoto, JAPAN, 619-0288

More information

Nonlinear Identification of Backlash in Robot Transmissions

Nonlinear Identification of Backlash in Robot Transmissions Nonlinear Identification of Backlash in Robot Transmissions G. Hovland, S. Hanssen, S. Moberg, T. Brogårdh, S. Gunnarsson, M. Isaksson ABB Corporate Research, Control Systems Group, Switzerland ABB Automation

More information

Reinforcement Learning of Potential Fields to achieve Limit-Cycle Walking

Reinforcement Learning of Potential Fields to achieve Limit-Cycle Walking IFAC International Workshop on Periodic Control Systems (PSYCO 216) Reinforcement Learning of Potential Fields to achieve Limit-Cycle Walking Denise S. Feirstein Ivan Koryakovskiy Jens Kober Heike Vallery

More information

Energy-based Swing-up of the Acrobot and Time-optimal Motion

Energy-based Swing-up of the Acrobot and Time-optimal Motion Energy-based Swing-up of the Acrobot and Time-optimal Motion Ravi N. Banavar Systems and Control Engineering Indian Institute of Technology, Bombay Mumbai-476, India Email: banavar@ee.iitb.ac.in Telephone:(91)-(22)

More information

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 2778 mgl@cs.duke.edu

More information

M.S. Project Report. Efficient Failure Rate Prediction for SRAM Cells via Gibbs Sampling. Yamei Feng 12/15/2011

M.S. Project Report. Efficient Failure Rate Prediction for SRAM Cells via Gibbs Sampling. Yamei Feng 12/15/2011 .S. Project Report Efficient Failure Rate Prediction for SRA Cells via Gibbs Sampling Yamei Feng /5/ Committee embers: Prof. Xin Li Prof. Ken ai Table of Contents CHAPTER INTRODUCTION...3 CHAPTER BACKGROUND...5

More information

Reinforcement Learning: An Introduction

Reinforcement Learning: An Introduction Introduction Betreuer: Freek Stulp Hauptseminar Intelligente Autonome Systeme (WiSe 04/05) Forschungs- und Lehreinheit Informatik IX Technische Universität München November 24, 2004 Introduction What is

More information

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Introduction to Reinforcement Learning. CMPT 882 Mar. 18 Introduction to Reinforcement Learning CMPT 882 Mar. 18 Outline for the week Basic ideas in RL Value functions and value iteration Policy evaluation and policy improvement Model-free RL Monte-Carlo and

More information

Parameter-exploring Policy Gradients

Parameter-exploring Policy Gradients Parameter-exploring Policy Gradients Frank Sehnke a, Christian Osendorfer a, Thomas Rückstieß a, Alex Graves a, Jan Peters c, Jürgen Schmidhuber a,b a Faculty of Computer Science, Technische Universität

More information

Approximate Message Passing with Built-in Parameter Estimation for Sparse Signal Recovery

Approximate Message Passing with Built-in Parameter Estimation for Sparse Signal Recovery Approimate Message Passing with Built-in Parameter Estimation for Sparse Signal Recovery arxiv:1606.00901v1 [cs.it] Jun 016 Shuai Huang, Trac D. Tran Department of Electrical and Computer Engineering Johns

More information

Stable Limit Cycle Generation for Underactuated Mechanical Systems, Application: Inertia Wheel Inverted Pendulum

Stable Limit Cycle Generation for Underactuated Mechanical Systems, Application: Inertia Wheel Inverted Pendulum Stable Limit Cycle Generation for Underactuated Mechanical Systems, Application: Inertia Wheel Inverted Pendulum Sébastien Andary Ahmed Chemori Sébastien Krut LIRMM, Univ. Montpellier - CNRS, 6, rue Ada

More information

Robust Global Swing-Up of the Pendubot via Hybrid Control

Robust Global Swing-Up of the Pendubot via Hybrid Control Robust Global Swing-Up of the Pendubot via Hybrid Control Rowland W. O Flaherty, Ricardo G. Sanfelice, and Andrew R. Teel Abstract Combining local state-feedback laws and openloop schedules, we design

More information

Gaussian Processes in Reinforcement Learning

Gaussian Processes in Reinforcement Learning Gaussian Processes in Reinforcement Learning Carl Edward Rasmussen and Malte Kuss Ma Planck Institute for Biological Cybernetics Spemannstraße 38, 776 Tübingen, Germany {carl,malte.kuss}@tuebingen.mpg.de

More information

Underactuated Robotics: Learning, Planning, and Control for Efficient and Agile Machines Course Notes for MIT 6.832

Underactuated Robotics: Learning, Planning, and Control for Efficient and Agile Machines Course Notes for MIT 6.832 Underactuated Robotics: Learning, Planning, and Control for Efficient and Agile Machines Course Notes for MIT 6.832 Russ Tedrake Massachusetts Institute of Technology c Russ Tedrake, 2009 2 c Russ Tedrake,

More information

PALADYN Journal of Behavioral Robotics. Exploring Parameter Space in Reinforcement Learning

PALADYN Journal of Behavioral Robotics. Exploring Parameter Space in Reinforcement Learning Review Article DOI: 10.2478/s13230-010-0002-4 JBR 1(1) 2010 14-24 Exploring Parameter Space in Reinforcement Learning Thomas Rückstieß 1, Frank Sehnke 1, Tom Schaul 2, Daan Wierstra 2, Yi Sun 2, Jürgen

More information

Notes on Discriminant Functions and Optimal Classification

Notes on Discriminant Functions and Optimal Classification Notes on Discriminant Functions and Optimal Classification Padhraic Smyth, Department of Computer Science University of California, Irvine c 2017 1 Discriminant Functions Consider a classification problem

More information

Control of the Inertia Wheel Pendulum by Bounded Torques

Control of the Inertia Wheel Pendulum by Bounded Torques Proceedings of the 44th IEEE Conference on Decision and Control, and the European Control Conference 5 Seville, Spain, December -5, 5 ThC6.5 Control of the Inertia Wheel Pendulum by Bounded Torques Victor

More information

Machine Learning I Continuous Reinforcement Learning

Machine Learning I Continuous Reinforcement Learning Machine Learning I Continuous Reinforcement Learning Thomas Rückstieß Technische Universität München January 7/8, 2010 RL Problem Statement (reminder) state s t+1 ENVIRONMENT reward r t+1 new step r t

More information

ON MODEL SELECTION FOR STATE ESTIMATION FOR NONLINEAR SYSTEMS. Robert Bos,1 Xavier Bombois Paul M. J. Van den Hof

ON MODEL SELECTION FOR STATE ESTIMATION FOR NONLINEAR SYSTEMS. Robert Bos,1 Xavier Bombois Paul M. J. Van den Hof ON MODEL SELECTION FOR STATE ESTIMATION FOR NONLINEAR SYSTEMS Robert Bos,1 Xavier Bombois Paul M. J. Van den Hof Delft Center for Systems and Control, Delft University of Technology, Mekelweg 2, 2628 CD

More information

q HYBRID CONTROL FOR BALANCE 0.5 Position: q (radian) q Time: t (seconds) q1 err (radian)

q HYBRID CONTROL FOR BALANCE 0.5 Position: q (radian) q Time: t (seconds) q1 err (radian) Hybrid Control for the Pendubot Mingjun Zhang and Tzyh-Jong Tarn Department of Systems Science and Mathematics Washington University in St. Louis, MO, USA mjz@zach.wustl.edu and tarn@wurobot.wustl.edu

More information

available online at CONTROL OF THE DOUBLE INVERTED PENDULUM ON A CART USING THE NATURAL MOTION

available online at   CONTROL OF THE DOUBLE INVERTED PENDULUM ON A CART USING THE NATURAL MOTION Acta Polytechnica 3(6):883 889 3 Czech Technical University in Prague 3 doi:.43/ap.3.3.883 available online at http://ojs.cvut.cz/ojs/index.php/ap CONTROL OF THE DOUBLE INVERTED PENDULUM ON A CART USING

More information

Swinging-Up and Stabilization Control Based on Natural Frequency for Pendulum Systems

Swinging-Up and Stabilization Control Based on Natural Frequency for Pendulum Systems 9 American Control Conference Hyatt Regency Riverfront, St. Louis, MO, USA June -, 9 FrC. Swinging-Up and Stabilization Control Based on Natural Frequency for Pendulum Systems Noriko Matsuda, Masaki Izutsu,

More information

Tutorial on Policy Gradient Methods. Jan Peters

Tutorial on Policy Gradient Methods. Jan Peters Tutorial on Policy Gradient Methods Jan Peters Outline 1. Reinforcement Learning 2. Finite Difference vs Likelihood-Ratio Policy Gradients 3. Likelihood-Ratio Policy Gradients 4. Conclusion General Setup

More information

An experimental robot load identification method for industrial application

An experimental robot load identification method for industrial application An experimental robot load identification method for industrial application Jan Swevers 1, Birgit Naumer 2, Stefan Pieters 2, Erika Biber 2, Walter Verdonck 1, and Joris De Schutter 1 1 Katholieke Universiteit

More information

6.867 Machine learning

6.867 Machine learning 6.867 Machine learning Mid-term eam October 8, 6 ( points) Your name and MIT ID: .5.5 y.5 y.5 a).5.5 b).5.5.5.5 y.5 y.5 c).5.5 d).5.5 Figure : Plots of linear regression results with different types of

More information

Replacing eligibility trace for action-value learning with function approximation

Replacing eligibility trace for action-value learning with function approximation Replacing eligibility trace for action-value learning with function approximation Kary FRÄMLING Helsinki University of Technology PL 5500, FI-02015 TKK - Finland Abstract. The eligibility trace is one

More information

Modelling and Control of Nonlinear Systems using Gaussian Processes with Partial Model Information

Modelling and Control of Nonlinear Systems using Gaussian Processes with Partial Model Information 5st IEEE Conference on Decision and Control December 0-3, 202 Maui, Hawaii, USA Modelling and Control of Nonlinear Systems using Gaussian Processes with Partial Model Information Joseph Hall, Carl Rasmussen

More information

On the robustness of a one-period look-ahead policy in multi-armed bandit problems

On the robustness of a one-period look-ahead policy in multi-armed bandit problems Procedia Computer Science Procedia Computer Science 00 (2010) 1 10 On the robustness of a one-period look-ahead policy in multi-armed bandit problems Ilya O. Ryzhov a, Peter Frazier b, Warren B. Powell

More information

Application of Neural Networks for Control of Inverted Pendulum

Application of Neural Networks for Control of Inverted Pendulum Application of Neural Networks for Control of Inverted Pendulum VALERI MLADENOV Department of Theoretical Electrical Engineering Technical University of Sofia Sofia, Kliment Ohridski blvd. 8; BULARIA valerim@tu-sofia.bg

More information

Central Limit Theorem and the Law of Large Numbers Class 6, Jeremy Orloff and Jonathan Bloom

Central Limit Theorem and the Law of Large Numbers Class 6, Jeremy Orloff and Jonathan Bloom Central Limit Theorem and the Law of Large Numbers Class 6, 8.5 Jeremy Orloff and Jonathan Bloom Learning Goals. Understand the statement of the law of large numbers. 2. Understand the statement of the

More information

Trust Region Policy Optimization

Trust Region Policy Optimization Trust Region Policy Optimization Yixin Lin Duke University yixin.lin@duke.edu March 28, 2017 Yixin Lin (Duke) TRPO March 28, 2017 1 / 21 Overview 1 Preliminaries Markov Decision Processes Policy iteration

More information

ebay/google short course: Problem set 2

ebay/google short course: Problem set 2 18 Jan 013 ebay/google short course: Problem set 1. (the Echange Parado) You are playing the following game against an opponent, with a referee also taking part. The referee has two envelopes (numbered

More information

In: Proc. BENELEARN-98, 8th Belgian-Dutch Conference on Machine Learning, pp 9-46, 998 Linear Quadratic Regulation using Reinforcement Learning Stephan ten Hagen? and Ben Krose Department of Mathematics,

More information

Actor-critic methods. Dialogue Systems Group, Cambridge University Engineering Department. February 21, 2017

Actor-critic methods. Dialogue Systems Group, Cambridge University Engineering Department. February 21, 2017 Actor-critic methods Milica Gašić Dialogue Systems Group, Cambridge University Engineering Department February 21, 2017 1 / 21 In this lecture... The actor-critic architecture Least-Squares Policy Iteration

More information

MEM04: Rotary Inverted Pendulum

MEM04: Rotary Inverted Pendulum MEM4: Rotary Inverted Pendulum Interdisciplinary Automatic Controls Laboratory - ME/ECE/CHE 389 April 8, 7 Contents Overview. Configure ELVIS and DC Motor................................ Goals..............................................3

More information

Matlab-Based Tools for Analysis and Control of Inverted Pendula Systems

Matlab-Based Tools for Analysis and Control of Inverted Pendula Systems Matlab-Based Tools for Analysis and Control of Inverted Pendula Systems Slávka Jadlovská, Ján Sarnovský Dept. of Cybernetics and Artificial Intelligence, FEI TU of Košice, Slovak Republic sjadlovska@gmail.com,

More information

Machine Learning I Reinforcement Learning

Machine Learning I Reinforcement Learning Machine Learning I Reinforcement Learning Thomas Rückstieß Technische Universität München December 17/18, 2009 Literature Book: Reinforcement Learning: An Introduction Sutton & Barto (free online version:

More information

DYNAMICS OF SERIAL ROBOTIC MANIPULATORS

DYNAMICS OF SERIAL ROBOTIC MANIPULATORS DYNAMICS OF SERIAL ROBOTIC MANIPULATORS NOMENCLATURE AND BASIC DEFINITION We consider here a mechanical system composed of r rigid bodies and denote: M i 6x6 inertia dyads of the ith body. Wi 6 x 6 angular-velocity

More information

A parametric approach to Bayesian optimization with pairwise comparisons

A parametric approach to Bayesian optimization with pairwise comparisons A parametric approach to Bayesian optimization with pairwise comparisons Marco Co Eindhoven University of Technology m.g.h.co@tue.nl Bert de Vries Eindhoven University of Technology and GN Hearing bdevries@ieee.org

More information

Momentum-centric whole-body control and kino-dynamic motion generation for floating-base robots

Momentum-centric whole-body control and kino-dynamic motion generation for floating-base robots Momentum-centric whole-body control and kino-dynamic motion generation for floating-base robots Alexander Herzog The Movement Generation and Control Group (Ludovic Righetti) Conflicting tasks & constraints

More information

Partially Observable Markov Decision Processes (POMDPs)

Partially Observable Markov Decision Processes (POMDPs) Partially Observable Markov Decision Processes (POMDPs) Sachin Patil Guest Lecture: CS287 Advanced Robotics Slides adapted from Pieter Abbeel, Alex Lee Outline Introduction to POMDPs Locally Optimal Solutions

More information

CONTROL OF ROBOT CAMERA SYSTEM WITH ACTUATOR S DYNAMICS TO TRACK MOVING OBJECT

CONTROL OF ROBOT CAMERA SYSTEM WITH ACTUATOR S DYNAMICS TO TRACK MOVING OBJECT Journal of Computer Science and Cybernetics, V.31, N.3 (2015), 255 265 DOI: 10.15625/1813-9663/31/3/6127 CONTROL OF ROBOT CAMERA SYSTEM WITH ACTUATOR S DYNAMICS TO TRACK MOVING OBJECT NGUYEN TIEN KIEM

More information

Exponential Controller for Robot Manipulators

Exponential Controller for Robot Manipulators Exponential Controller for Robot Manipulators Fernando Reyes Benemérita Universidad Autónoma de Puebla Grupo de Robótica de la Facultad de Ciencias de la Electrónica Apartado Postal 542, Puebla 7200, México

More information

Multi-task Learning with Gaussian Processes, with Applications to Robot Inverse Dynamics

Multi-task Learning with Gaussian Processes, with Applications to Robot Inverse Dynamics 1 / 38 Multi-task Learning with Gaussian Processes, with Applications to Robot Inverse Dynamics Chris Williams with Kian Ming A. Chai, Stefan Klanke, Sethu Vijayakumar December 2009 Motivation 2 / 38 Examples

More information

Derivation and Application of a Conserved Orbital Energy for the Inverted Pendulum Bipedal Walking Model

Derivation and Application of a Conserved Orbital Energy for the Inverted Pendulum Bipedal Walking Model Derivation and Application of a Conserved Orbital Energy for the Inverted Pendulum Bipedal Walking Model Jerry E. Pratt and Sergey V. Drakunov Abstract We present an analysis of a point mass, point foot,

More information

Gaussian Process for Internal Model Control

Gaussian Process for Internal Model Control Gaussian Process for Internal Model Control Gregor Gregorčič and Gordon Lightbody Department of Electrical Engineering University College Cork IRELAND E mail: gregorg@rennesuccie Abstract To improve transparency

More information

Reinforcement Learning and NLP

Reinforcement Learning and NLP 1 Reinforcement Learning and NLP Kapil Thadani kapil@cs.columbia.edu RESEARCH Outline 2 Model-free RL Markov decision processes (MDPs) Derivative-free optimization Policy gradients Variance reduction Value

More information

Balancing of an Inverted Pendulum with a SCARA Robot

Balancing of an Inverted Pendulum with a SCARA Robot Balancing of an Inverted Pendulum with a SCARA Robot Bernhard Sprenger, Ladislav Kucera, and Safer Mourad Swiss Federal Institute of Technology Zurich (ETHZ Institute of Robotics 89 Zurich, Switzerland

More information

Lecture 6: CS395T Numerical Optimization for Graphics and AI Line Search Applications

Lecture 6: CS395T Numerical Optimization for Graphics and AI Line Search Applications Lecture 6: CS395T Numerical Optimization for Graphics and AI Line Search Applications Qixing Huang The University of Texas at Austin huangqx@cs.utexas.edu 1 Disclaimer This note is adapted from Section

More information

Development of a Deep Recurrent Neural Network Controller for Flight Applications

Development of a Deep Recurrent Neural Network Controller for Flight Applications Development of a Deep Recurrent Neural Network Controller for Flight Applications American Control Conference (ACC) May 26, 2017 Scott A. Nivison Pramod P. Khargonekar Department of Electrical and Computer

More information

Reinforcement Learning in Non-Stationary Continuous Time and Space Scenarios

Reinforcement Learning in Non-Stationary Continuous Time and Space Scenarios Reinforcement Learning in Non-Stationary Continuous Time and Space Scenarios Eduardo W. Basso 1, Paulo M. Engel 1 1 Instituto de Informática Universidade Federal do Rio Grande do Sul (UFRGS) Caixa Postal

More information

Reverse Order Swing-up Control of Serial Double Inverted Pendulums

Reverse Order Swing-up Control of Serial Double Inverted Pendulums Reverse Order Swing-up Control of Serial Double Inverted Pendulums T.Henmi, M.Deng, A.Inoue, N.Ueki and Y.Hirashima Okayama University, 3-1-1, Tsushima-Naka, Okayama, Japan inoue@suri.sys.okayama-u.ac.jp

More information

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina Reinforcement Learning Introduction Introduction Unsupervised learning has no outcome (no feedback). Supervised learning has outcome so we know what to predict. Reinforcement learning is in between it

More information

Statistical Techniques in Robotics (16-831, F12) Lecture#21 (Monday November 12) Gaussian Processes

Statistical Techniques in Robotics (16-831, F12) Lecture#21 (Monday November 12) Gaussian Processes Statistical Techniques in Robotics (16-831, F12) Lecture#21 (Monday November 12) Gaussian Processes Lecturer: Drew Bagnell Scribe: Venkatraman Narayanan 1, M. Koval and P. Parashar 1 Applications of Gaussian

More information

Introduction of Reinforcement Learning

Introduction of Reinforcement Learning Introduction of Reinforcement Learning Deep Reinforcement Learning Reference Textbook: Reinforcement Learning: An Introduction http://incompleteideas.net/sutton/book/the-book.html Lectures of David Silver

More information

Multibody simulation

Multibody simulation Multibody simulation Dynamics of a multibody system (Euler-Lagrange formulation) Dimitar Dimitrov Örebro University June 16, 2012 Main points covered Euler-Lagrange formulation manipulator inertia matrix

More information

Online Learning in High Dimensions. LWPR and it s application

Online Learning in High Dimensions. LWPR and it s application Lecture 9 LWPR Online Learning in High Dimensions Contents: LWPR and it s application Sethu Vijayakumar, Aaron D'Souza and Stefan Schaal, Incremental Online Learning in High Dimensions, Neural Computation,

More information

ELEC4631 s Lecture 2: Dynamic Control Systems 7 March Overview of dynamic control systems

ELEC4631 s Lecture 2: Dynamic Control Systems 7 March Overview of dynamic control systems ELEC4631 s Lecture 2: Dynamic Control Systems 7 March 2011 Overview of dynamic control systems Goals of Controller design Autonomous dynamic systems Linear Multi-input multi-output (MIMO) systems Bat flight

More information

ADAPTIVE NEURAL NETWORK CONTROL OF MECHATRONICS OBJECTS

ADAPTIVE NEURAL NETWORK CONTROL OF MECHATRONICS OBJECTS acta mechanica et automatica, vol.2 no.4 (28) ADAPIE NEURAL NEWORK CONROL OF MECHARONICS OBJECS Egor NEMSE *, Yuri ZHUKO * * Baltic State echnical University oenmeh, 985, St. Petersburg, Krasnoarmeyskaya,

More information

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69 R E S E A R C H R E P O R T Online Policy Adaptation for Ensemble Classifiers Christos Dimitrakakis a IDIAP RR 03-69 Samy Bengio b I D I A P December 2003 D a l l e M o l l e I n s t i t u t e for Perceptual

More information

Reinforcement Learning with Reference Tracking Control in Continuous State Spaces

Reinforcement Learning with Reference Tracking Control in Continuous State Spaces Reinforcement Learning with Reference Tracking Control in Continuous State Spaces Joseph Hall, Carl Edward Rasmussen and Jan Maciejowski Abstract The contribution described in this paper is an algorithm

More information

Reinforcement Learning In Continuous Time and Space

Reinforcement Learning In Continuous Time and Space Reinforcement Learning In Continuous Time and Space presentation of paper by Kenji Doya Leszek Rybicki lrybicki@mat.umk.pl 18.07.2008 Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous

More information

Chaotic motion. Phys 750 Lecture 9

Chaotic motion. Phys 750 Lecture 9 Chaotic motion Phys 750 Lecture 9 Finite-difference equations Finite difference equation approximates a differential equation as an iterative map (x n+1,v n+1 )=M[(x n,v n )] Evolution from time t =0to

More information

Learning Control Under Uncertainty: A Probabilistic Value-Iteration Approach

Learning Control Under Uncertainty: A Probabilistic Value-Iteration Approach Learning Control Under Uncertainty: A Probabilistic Value-Iteration Approach B. Bischoff 1, D. Nguyen-Tuong 1,H.Markert 1 anda.knoll 2 1- Robert Bosch GmbH - Corporate Research Robert-Bosch-Str. 2, 71701

More information

Noise-Blind Image Deblurring Supplementary Material

Noise-Blind Image Deblurring Supplementary Material Noise-Blind Image Deblurring Supplementary Material Meiguang Jin University of Bern Switzerland Stefan Roth TU Darmstadt Germany Paolo Favaro University of Bern Switzerland A. Upper and Lower Bounds Our

More information

Control Design along Trajectories with Sums of Squares Programming

Control Design along Trajectories with Sums of Squares Programming Control Design along Trajectories with Sums of Squares Programming Anirudha Majumdar 1, Amir Ali Ahmadi 2, and Russ Tedrake 1 Abstract Motivated by the need for formal guarantees on the stability and safety

More information

Chaotic motion. Phys 420/580 Lecture 10

Chaotic motion. Phys 420/580 Lecture 10 Chaotic motion Phys 420/580 Lecture 10 Finite-difference equations Finite difference equation approximates a differential equation as an iterative map (x n+1,v n+1 )=M[(x n,v n )] Evolution from time t

More information

Robust Controller Design for Speed Control of an Indirect Field Oriented Induction Machine Drive

Robust Controller Design for Speed Control of an Indirect Field Oriented Induction Machine Drive Leonardo Electronic Journal of Practices and Technologies ISSN 1583-1078 Issue 6, January-June 2005 p. 1-16 Robust Controller Design for Speed Control of an Indirect Field Oriented Induction Machine Drive

More information

Learning Gaussian Process Models from Uncertain Data

Learning Gaussian Process Models from Uncertain Data Learning Gaussian Process Models from Uncertain Data Patrick Dallaire, Camille Besse, and Brahim Chaib-draa DAMAS Laboratory, Computer Science & Software Engineering Department, Laval University, Canada

More information

Laboratory Exercise 1 DC servo

Laboratory Exercise 1 DC servo Laboratory Exercise DC servo Per-Olof Källén ø 0,8 POWER SAT. OVL.RESET POS.RESET Moment Reference ø 0,5 ø 0,5 ø 0,5 ø 0,65 ø 0,65 Int ø 0,8 ø 0,8 Σ k Js + d ø 0,8 s ø 0 8 Off Off ø 0,8 Ext. Int. + x0,

More information

Q-Learning in Continuous State-Action Space with Noisy and Redundant Inputs by Using a Selective Desensitization Neural Network

Q-Learning in Continuous State-Action Space with Noisy and Redundant Inputs by Using a Selective Desensitization Neural Network Q-Learning Using SDNN for Noisy and Redundant Inputs Paper: Q-Learning in Continuous State-Action Space with Noisy and Redundant Inputs by Using a Selective Desensitization Neural Network Takaaki Kobayashi,

More information

On-line Learning of Robot Arm Impedance Using Neural Networks

On-line Learning of Robot Arm Impedance Using Neural Networks On-line Learning of Robot Arm Impedance Using Neural Networks Yoshiyuki Tanaka Graduate School of Engineering, Hiroshima University, Higashi-hiroshima, 739-857, JAPAN Email: ytanaka@bsys.hiroshima-u.ac.jp

More information

CS Deep Reinforcement Learning HW2: Policy Gradients due September 19th 2018, 11:59 pm

CS Deep Reinforcement Learning HW2: Policy Gradients due September 19th 2018, 11:59 pm CS294-112 Deep Reinforcement Learning HW2: Policy Gradients due September 19th 2018, 11:59 pm 1 Introduction The goal of this assignment is to experiment with policy gradient and its variants, including

More information