Size: px
Start display at page:

Download ""

Transcription

1 In: Proc. BENELEARN-98, 8th Belgian-Dutch Conference on Machine Learning, pp 9-46, 998 Linear Quadratic Regulation using Reinforcement Learning Stephan ten Hagen? and Ben Krose Department of Mathematics, Computer Science, Physics and Astronomy University of Amsterdam Kruislaan 4, 98 SJ Amsterdam Abstract In this paper we describe a possible way to make reinforcement learning more applicable in the context of industrial manufacturing processes. We achieve this by formulating the optimization task in the linear quadratic regulation framework, for which a conventional control theoretic solution exist. By rewriting the Q-learning approach into a linear least squares approximation problem, we can make a fair comparison between the resulting approximation and that of the conventional system identication approach. Our experiment shows that the conventional approach performs slightly better. Also we can show that the amount of exploration noise, added during the generation of data, plays a crucial role in the outcome of both approaches. Introduction Reinforcement Learning (RL) is a trial based method for optimization of the interaction with an environment or the control of a system [][7]. The optimization is performed by approximating the future sum of evaluations and determine the feedback that minimizes or maximizes this approximation. We are particularly interested in industrial applications, like the optimization of the control of manufacturing processes. In [6] this is demonstrated for the manufacturing of thermoplastic composite structures. In spite of the successful demonstration, the use of reinforcement learning to control manufacturing processes is still not commonplace. The reason for this is that there are still some barriers to take. The rst barrier is that manufacturers are very careful, with respect to novel control techniques. Only those techniques are used that are fully understood and are guaranteed to work. The problems encountered by the manufacturers are too specic to be related to existing successful demonstrations of certain techniques. Impressive demonstrations are not sucient to convince manufacturers from the usefulness of these techniques. For this a solid theoretic underpinning is required. The second barrier lies in the reinforcement learning eld itself. Most of the RL application described are maze-like navigation tasks, scheduling tasks or obstacle avoidance tasks for mobile robots. This has resulted in a theoretic understanding for discrete state space systems, in particular those that t in a Markov decision framework. A consequence of this is that the main theoretic results and convergence guarantees apply only to Markov decision processes. But in a realistic manufacturing environment most information about the \state" of the process is due to measurements. These measurements form a continuous state space for the manufacturing process, for which theoretic reinforcement learning results are not valid. So in order to make reinforcement learning techniques applicable for manufacturing processes, the theoretic results should be extended to continuous state space systems. In order to come to a RL approach that is more generally applicable for manufacturing processes, we adopt the Linear Quadratic Regulation (LQR) framework from control theory. LQR is an optimization task in which the system is assumed to be linear and the evaluation is a quadratic cost function. For this the optimal feedback can be computed when the system and cost are completely known []. The combination of RL and LQR has been described rst in [8]. In []? Supported by the Dutch technology foundation STW

2 this framework has been described as a possible extension of RL algorithms to problems with continuous state and action spaces. Convergence of RL applied to LQR can also be proven [4][5]. These convergence proofs rely on a contraction to the optimal feedback, that takes place if the amount of exploration and the number of time steps is enough. The diculty is that in practice it is never known whether the amount of exploration is sucient. In this paper we use the LQR framework to compare Q-Learning (QL), a particular kind of RL, with System Identication (SI). In QL the optimization is based on the approximation of the sum of future evaluation as a function of the state and action. According to [][4] the optimal feedback can be derived using QL, without having to know the system. However, the same data used by QL can also be used for SI. This results in a approximation of the system, that can be used to compute the optimal feedback. We formulate the learning methods of the QL and the SI approach in such a way that a linear least squares approximation method can be used. In this way a fair comparison is possible, so this enables us to see which methods performs best. We investigate experimentally whether both methods are able to approximation the optimal feedback. Particularly we look at how the exploration noise and system's noise inuence the outcome of the approximation. The experiment shows that the amount of exploration noise has a strong impact on the resulting approximations. Linear Quadratic Regulation The Linear Quadratic Regulation (LQR) framework consist of a linear system with a linear feedback. Let x IR nx be the state vector and the vector u IR nu the control actions, then the system and controller are given by: x k+ = Ax k + Bu k + v k and u k = Lx k + e k ; () where k indicates the time step and the matrices A, B and L have the proper dimensions. All elements of the vectors v k and e k are normal distributed white noise with variance v and e. The vector v represents the system's noise and e represents the exploration (or excitation) noise. In the LQR framework the direct cost is a quadratic function of x and u, so : r(x k ; u k ) = x T k Sx k + u T k Ru k: () The matrix S is symmetric positive semidenite and R is symmetric positive denite. The total cost is dened as: J = X i= r i () The objective of LQR is to nd the optimal feedback L that minimizes J. In order to nd L the sum of future cost can be expressed as a function of x: J(x k ) = X i=k r i = x T k Kx k: (4) The symmetric positive denite matrix K is the unique solution to the Discrete Algebraic Riccati Equation (DARE): K = A T (K? KB(B T KB + R)? B T K)A + S (5) The solution of this equation can be found iteratively if: the pair fa; Bg is controllable, S is symmetric positive semidenite, R is symmetric positive denite and the pair fa; p Sg ( p S is the matrix for which p S T p S = S) is observable []. The optimal feedback L is given by: The T indicates the transpose. L =?(B T KB + R)? B T KA (6)

3 The problem of LQR is that it requires exact knowledge about the system's model (), which in practical situations is not available. Using the feedback in () it is possible to control the system and generate data. In the next section two methods will be described to get an approximation ^L of the optimal feedback, based on the generated data. The rst method approximates the system's model and then uses (5) and (6). The second method approximates the future cost as a function of x and u and derives ^L from this. Two Approximation Methods The system starts at the initial state x and is controlled for N time steps using the feedback L. All state vectors from x to x N and all control vectors u to u N form the data set. Based on this data set the approximation of the optimal feedback ^L should be derived. Two dierent methods to do this will be presented. The rst method will be referred to as the System Identication (SI) approach, because it identies the parameters of the matrices A and B resulting in ^A and ^B. Rewrite () to: x T k+ = x T k u T k A T B T + v T k : (7) This makes it possible to stack the vectors of all time steps into matrices and write (7) for the complete data set: Y = 64 x T x T. x T N x 75 6 = 4 T u T x T u T. x T N? u T N? 75 ^ + 64 v T v T. v T N? 75 = X ^ + V: (8) ^B. The least squares solution of ^ is In this expression Y = X ^ + V, the value of ^ T = ^A given by: ^ = (X T X)? X T Y: (9) So ^A and ^B can be derived from ^, and applying the DARE and (6) will result in ^L. ^A; ^B S; R ^K ^A; ^B; R ^L x; u SI A; B x; u; r; L ^H QL? S; R K A; B; R L Figure. L and ^L. At the bottom the computation of L is shown. At the top the two methods, SI and QL, to derive ^L are shown. The blocks indicate results and the arrows indicate the required \information" to derive the next result.

4 u x.5 x ; x.5 X; Y ^ ^A; ^B ^K; ^L SI. u rr QL X; Y ^ ^H ^L Figure. Two approximation methods. The data set on the left consists of two sequences of state values x and x, a sequence of control actions u and a sequence of direct costs r. Both methods use this data to get the matrices X and Y, which is used in (9) to get ^. At the top it is shown how the SI approach derives ^L from ^, at the bottom this is shown for the QL approach. The second method will be referred to as the Q-Learning (QL) approach, because it approximates the Q-function. The Q-function is the future cost as a function of x and u, so: Q(x k ; u k ) = X i=k r i = x T k = x T k u T k u T k S + A T KA H H x k H H B T KA A T KB R + B T KB x k u k () u k = T k H k ; () where K is the solution of the DARE. L can be found by setting the derivative of the Q-function to u k to zero. This results in u k =?H? H x k, sol =?H? H. It is clear that this result is identical to (6). The value of T k = x T k u T k in () is formed by the data set, so the symmetric positive denite matrix H can be approximated. Then the approximation ^L of the optimal feedback follows directly from the approximation ^H. The approximation can be made because the Q-function in () can also be dened recursively as : Q(x k ; u k ) = r k + Q(x k+ ; Lx k+ ). Write T k H k in () as T k, where vector k consist of all quadratic combinations of the elements of k and vector the corresponding values of H. Then T from the recursive denition of the Q-function it follows that: r k = k? T = ( T k+ k? T k+). This resembles (7), so that can be approximated in a similar way using (9). For this X should be formed by all vectors T k? T k+ and Y should be formed by all r k. Then ^ in (9) approximates (and not A B ), so ^H and ^L can be derived from this. By using (9) for the approximation, the complete data set is used at once. The advantage of this is that the results do not depend on additional choices, like an initial ^ or an iteration step size. This means that dierences in resulting approximations are only due to processing the data into X and Y and to the dierent use of additional \information" by both methods (The SI method uses S and R, while the QL method uses L). In this way a fair comparison can be made between the performances of both methods. In a simulation the true A and B are known so the performances can also be compared with the true optimal feedback L. Figure shows how the two approximation methods are related to each other and to the optimal solution. Figure shows how both methods derive the approximation from the data. Note that Lx k+ is used instead of u k+. So X, Y and ^ are dierent for the QL approach. 4

5 Figure. Two data sets. Both sets generated according to () and () consist of two sequences of state values x and x (top), a sequence of control actions u (middle) and a sequence of direct costs r (bottom). Data set I, on the left, is generated with v =? and e =?4. Data set II, on the right, is generated with v =?4 and e =?. Because these noise terms are relatively small, both data sets are very much alike. 4 Experiment In this experiment we take a system as described in (), with:?:6?:4 : A = B = : : : L = :?: () The matrix A has all it's eigenvalues within the unit disc and the pair fa; Bg is controllable. The matrices S and R are identity matrices and fa; p Sg is observable. So (5) and (6) can be used to compute the optimal feedback L. This results in: L =?:46 :78 () The system starts in it's initial state x = : : T and is controlled using the feedback L for N = time steps. Both noise terms v and e are normal distributed and white. Two data sets are generated, I with v =? and e =?4 and II with v =?4 and e =?. This results in two almost identical data sets, as shown in Figure. The optimal feedback is approximated using both methods on both data sets. Table shows the approximation results for both methods. For data set I they almost have a similar outcome, which corresponds to the value of L. This is the feedback that was used to generate the data (). For data set II both methods seem to approximate the value of L from (), which is the value that should be approximated. The results for data set II in Table indicate that the SI method gives a better approximation of L than the QL method. The main goal of both approximation methods is to approximate the optimal feedback L. The results in Table shows that this value is not always being approximated. If the amount of exploration noise is much lower than the noise in the system then the feedback used to generate the data is being approximated. This is a problem for practical application, because Figure indicates that visual inspection of the data sets does not reveal whether the exploration is enough. So additional investigations are required to see how both noise sources v and e inuence the approximation of the optimal feedback. 5

6 Data set I Data set II v??4 e?4? SI: ^L QL: ^L :49?:75?:467 :7 :96?:759?:57 :78 ^L L L Table. Experimental results. For the data sets I and II the values of v and e are given. Also given are the approximated optimal feedbacks according to the SI and QL method. The bottom row indicates what value actually is being approximated by both methods. 5 The Noise The true optimal solution L computed with (5) and (6) does not take into account the presence of noise. The noise sources v and e are only present when the system is used to generate the data set. This means that a mismatch between ^L and L can be a consequence of the noise. Therefore the dierent inuences of the noise sources v and e will be investigated. The exploration noise e is essential to make the approximation. This can easily be seen by looking at the matrix X T X in (9). This matrix should have full rank to be invertible 4. Without exploration noise the control vector u k = Lx k is linear dependent on the state vector. For both approximation methods this has the consequence that not all the rows (and also the columns) are linear independent. So the role of the exploration noise is to prevent the matrix X T X to become singular. The inuence of e on the entries of matrix X T X is smaller for the QL method. So this method will require more exploration than the SI method, for the matrix X T X to be not singular. The system's noise v is not required to get the approximations. In fact (9) minimizes the mismatch between Y and X + V, caused by V. So if V is zero, will result in a perfect approximation of the system or Q-function. The V consist of all terms v k when the SI method is applied, but for the QL method the V is dierent. This can be seen by looking at (), where the noise v is not included. If it is included, the value of the Q-function becomes: Q(x k ; u k ) = x T k u T k S + A T KA B T KA A T KB R + B T KB x k u k + v T k Kv k; (4) so r k = ( T k? T k+) + v T k Kv k? v T k+ Kv k+. This means that V consist o all terms v T k Kv k? v T k+ Kv k+, making the minimization performed by (9) dierent. There is an other way the noise inuences the approximations. To see this use () to get the value of the state at time k: x k = D k x + kx i= D k?i? (Be i + v i ); (5) where D = A + BL represents the closed loop. So the values of the state vector and the control vector u k depend on the initial state x and all previous noise values of e and v. This is a consequence of generating the data with the closed-loop system (). The noise that enters the system, re-enters it through the feedback, causing an additional disturbance that is no longer white like e and v, but \colored" by D. This can cause a bias in the approximation. This also makes it impossible to derive the variance v of the system's noise, without knowledge about the closed loop D. 6 Experiment In this experiment we continue investigating the consequences of e and v on the approximation of ^L, for the SI and QL approaches. Two dierent setups are chosen such that (5) and (6) can 4 Note that this also means that the number time steps N should be large enough. 6

7 σ e σ e Figure 4. The relative performance of the SI approach (dashed lines) and the QL approach (solid lines) for e varying from? to. The vertical dotted line indicates e = v. Left gure: n x =, v =?5 and N = 5. Right gure: n x =, v =? and N = 5. be used to compute the optimal feedback L. The number of samples N and the amount of system's noise v are dierent for these setups. The amount of exploration noise e is varied form? to. However, both noise sequences are kept the same (except for the scale e of the exploration noise). This is repeated 5 times for both systems and both approaches. The total cost obtained when ^L is used is divided by the total cost obtained when L is used. This is the relative performance. Figure 4 shows the relative performance as function of e for both setups 5. With the increase of e from a very small value, the relative performance and the approximation ^L go through four dierent types of results 6 : I The matrix X T X is singular so no feedback can be computed. For the SI approach this is for e <? (not shown in Figure 4) and for the QL approach this is for e <?7. So QL requires more exploration for X T X to be not singular. II Figure 4 shows that for low values of e, both methods give the same constant relative performance. In this case the feedback that is used to generate the data is approximated as the optimal feedback, so ^L L like in Experiment. For SI this result can be explained by using ^B instead of B and ^D? ^BL instead of A in (6). Clearly ^B is much too large close to the singularity, making (6) result in L. The QL approach does not use (6) so a similar explanation cannot be given. (Although conceptually it makes sense that for a low amount of exploration, the presence of L is very dominant in the data set.) III Between the approximations ^L L and ^L L there is an transition area, where the relative performance can be quite good or very bad. The approximated feedback can even result in an unstable closed-loop of the system. Figure 4 shows that for SI this happens just before e = v and for QL this happens just after e = v. IV This is the only type of result that is useful! The relative performance is (very close to) one, so ^L L. Although it is not clear in Figure 4, but the SI result is slightly closer to one than the QL result. Also the QL approach requires a higher e to get to this type of result. The results in Figure 4 are obtained for dierent numbers of data samples N. But even if the N is increased much more, the relative performance will still show the four types of results for the same values of e. From this it can be concluded that the amount of exploration noise is much more important than N. 5 The experiment is repeated for many dierent congurations and the results were consistent with the two in Figure 4. 6 When v is very small or zero, then there are no type II and III results. This corresponds to the congurations of the convergence proofs in [4] and [5], where the system's noise v is not taken into account. 7

8 From a manufacturing point of view, the type I (no solution) and II (no improvement) results are not very interesting. Only the type III and IV results are important. The type III result has to be avoided. Although it can result in an improvement, the only certainty about this improvement can be obtained by testing it on the real manufacturing process. In case of an unstable closed loop this can result in damaging the process, which might be very expensive. So the amount of exploration should be high enough to avoid the type III result and guarantee a type IV result. The type IV result always is an improvement (except when L = L ). The problem in controlling a manufacturing process is that adding exploration noise results in variations in the products. This make exploring expensive. This means that the amount of exploration noise should be minimized. Figure 4 shows that the minimal value of e for which a type IV result is obtained is lower for the SI approach. So the SI approach requires less exploration to give an improvement of the feedback. This makes the use of the SI approach for manufacturing more appropriate than the QL approach. Further research will focus on minimizing the exploration requirements of the QL approach. Also ways to derive the minimal amount of exploration from the data will be investigated. 7 Conclusion In this paper we showed that the linear quadratic regulation framework provides the opportunity to extend the applicability of reinforcement learning to industrial manufacturing processes. We did this by rewriting the Q-learning approach into a format that made a fair comparison possible with the more conventional system identication approach from control theory. The experiment showed the importance of sucient exploration in data generation. If the disturbance in the system exceeds the amount of exploration noise, the optimal feedback is not being approximated, but the feedback used to generate the data. But more important, the experiment showed that the conventional approach requires less exploration than the reinforcement learning approach, making it more appropriate for manufacturing processes. References [] D.P. Bertsekas. Dynamic Programming: Deterministic and Stochastic Models, pages 55{64. Prentice-Hall, 987. [] D.P. Bertsekas and Tsitsiklis J.N. Neuro-Dynamic Programming. Athena Scientic, Belmont, Massachusetts, 997. [] S.J. Bradtke. Reinforcement learning applied to linear quadratic regulation. In Stephen Jose Hanson, Jack D. Cowan, and C. Lee Giles, editors, Advances in Neural Information Processing Systems, pages 95{. Morgan Kaufmann, San Mateo, CA, 99. [4] S.J. Bradtke, B.E. Ydstie, and A.G. Barto. Adaptive linear quadratic control using policy iteration. CMPSCI 94-49, University of Massachusetts, June 994. [5] T. Landelius. Reinforcement Learning and Distributed Local Model Synthesis. PhD thesis, Linkoping University, 997. [6] D.A. Sofge and D.A. White. Neural network based process optimization and control. In Proceedings of the 9th conf. on Decision and Control, pages 7{76, Honolulu, Hawaii, 99. IEEE. [7] R.S. Sutton and A.G. Barto. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, 998. [8] P.J. Werbos. Consistency of HDP applied to a simple reinforcement learning problem. Neural Networks, :79{89, 99. 8

Adaptive linear quadratic control using policy. iteration. Steven J. Bradtke. University of Massachusetts.

Adaptive linear quadratic control using policy. iteration. Steven J. Bradtke. University of Massachusetts. Adaptive linear quadratic control using policy iteration Steven J. Bradtke Computer Science Department University of Massachusetts Amherst, MA 01003 bradtke@cs.umass.edu B. Erik Ydstie Department of Chemical

More information

Approximate Optimal-Value Functions. Satinder P. Singh Richard C. Yee. University of Massachusetts.

Approximate Optimal-Value Functions. Satinder P. Singh Richard C. Yee. University of Massachusetts. An Upper Bound on the oss from Approximate Optimal-Value Functions Satinder P. Singh Richard C. Yee Department of Computer Science University of Massachusetts Amherst, MA 01003 singh@cs.umass.edu, yee@cs.umass.edu

More information

In Advances in Neural Information Processing Systems 6. J. D. Cowan, G. Tesauro and. Convergence of Indirect Adaptive. Andrew G.

In Advances in Neural Information Processing Systems 6. J. D. Cowan, G. Tesauro and. Convergence of Indirect Adaptive. Andrew G. In Advances in Neural Information Processing Systems 6. J. D. Cowan, G. Tesauro and J. Alspector, (Eds.). Morgan Kaufmann Publishers, San Fancisco, CA. 1994. Convergence of Indirect Adaptive Asynchronous

More information

Continuous State Space Q-Learning for Control of Nonlinear Systems

Continuous State Space Q-Learning for Control of Nonlinear Systems Continuous State Space Q-Learning for Control of Nonlinear Systems . Continuous State Space Q-Learning for Control of Nonlinear Systems ACADEMISCH PROEFSCHRIFT ter verkrijging van de graad van doctor aan

More information

Open Theoretical Questions in Reinforcement Learning

Open Theoretical Questions in Reinforcement Learning Open Theoretical Questions in Reinforcement Learning Richard S. Sutton AT&T Labs, Florham Park, NJ 07932, USA, sutton@research.att.com, www.cs.umass.edu/~rich Reinforcement learning (RL) concerns the problem

More information

Lecture 9: Discrete-Time Linear Quadratic Regulator Finite-Horizon Case

Lecture 9: Discrete-Time Linear Quadratic Regulator Finite-Horizon Case Lecture 9: Discrete-Time Linear Quadratic Regulator Finite-Horizon Case Dr. Burak Demirel Faculty of Electrical Engineering and Information Technology, University of Paderborn December 15, 2015 2 Previous

More information

An Adaptive Clustering Method for Model-free Reinforcement Learning

An Adaptive Clustering Method for Model-free Reinforcement Learning An Adaptive Clustering Method for Model-free Reinforcement Learning Andreas Matt and Georg Regensburger Institute of Mathematics University of Innsbruck, Austria {andreas.matt, georg.regensburger}@uibk.ac.at

More information

Switching Controllers Based on Neural Network. Estimates of Stability Regions and Controller. Carnegie Mellon University.

Switching Controllers Based on Neural Network. Estimates of Stability Regions and Controller. Carnegie Mellon University. Switching Controllers Based on Neural Network Estimates of Stability Regions and Controller Performance Enrique D. Ferreira and Bruce H. Krogh Department of Electrical and Computer Engineering Carnegie

More information

Average Reward Parameters

Average Reward Parameters Simulation-Based Optimization of Markov Reward Processes: Implementation Issues Peter Marbach 2 John N. Tsitsiklis 3 Abstract We consider discrete time, nite state space Markov reward processes which depend

More information

Proceedings of the International Conference on Neural Networks, Orlando Florida, June Leemon C. Baird III

Proceedings of the International Conference on Neural Networks, Orlando Florida, June Leemon C. Baird III Proceedings of the International Conference on Neural Networks, Orlando Florida, June 1994. REINFORCEMENT LEARNING IN CONTINUOUS TIME: ADVANTAGE UPDATING Leemon C. Baird III bairdlc@wl.wpafb.af.mil Wright

More information

Linear-Quadratic Optimal Control: Full-State Feedback

Linear-Quadratic Optimal Control: Full-State Feedback Chapter 4 Linear-Quadratic Optimal Control: Full-State Feedback 1 Linear quadratic optimization is a basic method for designing controllers for linear (and often nonlinear) dynamical systems and is actually

More information

6.241 Dynamic Systems and Control

6.241 Dynamic Systems and Control 6.241 Dynamic Systems and Control Lecture 24: H2 Synthesis Emilio Frazzoli Aeronautics and Astronautics Massachusetts Institute of Technology May 4, 2011 E. Frazzoli (MIT) Lecture 24: H 2 Synthesis May

More information

Elements of Reinforcement Learning

Elements of Reinforcement Learning Elements of Reinforcement Learning Policy: way learning algorithm behaves (mapping from state to action) Reward function: Mapping of state action pair to reward or cost Value function: long term reward,

More information

Reinforcement Learning: the basics

Reinforcement Learning: the basics Reinforcement Learning: the basics Olivier Sigaud Université Pierre et Marie Curie, PARIS 6 http://people.isir.upmc.fr/sigaud August 6, 2012 1 / 46 Introduction Action selection/planning Learning by trial-and-error

More information

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 2778 mgl@cs.duke.edu

More information

Prioritized Sweeping Converges to the Optimal Value Function

Prioritized Sweeping Converges to the Optimal Value Function Technical Report DCS-TR-631 Prioritized Sweeping Converges to the Optimal Value Function Lihong Li and Michael L. Littman {lihong,mlittman}@cs.rutgers.edu RL 3 Laboratory Department of Computer Science

More information

Procedia Computer Science 00 (2011) 000 6

Procedia Computer Science 00 (2011) 000 6 Procedia Computer Science (211) 6 Procedia Computer Science Complex Adaptive Systems, Volume 1 Cihan H. Dagli, Editor in Chief Conference Organized by Missouri University of Science and Technology 211-

More information

Optimal Control. McGill COMP 765 Oct 3 rd, 2017

Optimal Control. McGill COMP 765 Oct 3 rd, 2017 Optimal Control McGill COMP 765 Oct 3 rd, 2017 Classical Control Quiz Question 1: Can a PID controller be used to balance an inverted pendulum: A) That starts upright? B) That must be swung-up (perhaps

More information

Active Policy Iteration: Efficient Exploration through Active Learning for Value Function Approximation in Reinforcement Learning

Active Policy Iteration: Efficient Exploration through Active Learning for Value Function Approximation in Reinforcement Learning Active Policy Iteration: fficient xploration through Active Learning for Value Function Approximation in Reinforcement Learning Takayuki Akiyama, Hirotaka Hachiya, and Masashi Sugiyama Department of Computer

More information

4F3 - Predictive Control

4F3 - Predictive Control 4F3 Predictive Control - Lecture 2 p 1/23 4F3 - Predictive Control Lecture 2 - Unconstrained Predictive Control Jan Maciejowski jmm@engcamacuk 4F3 Predictive Control - Lecture 2 p 2/23 References Predictive

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Markov decision process & Dynamic programming Evaluative feedback, value function, Bellman equation, optimality, Markov property, Markov decision process, dynamic programming, value

More information

below, kernel PCA Eigenvectors, and linear combinations thereof. For the cases where the pre-image does exist, we can provide a means of constructing

below, kernel PCA Eigenvectors, and linear combinations thereof. For the cases where the pre-image does exist, we can provide a means of constructing Kernel PCA Pattern Reconstruction via Approximate Pre-Images Bernhard Scholkopf, Sebastian Mika, Alex Smola, Gunnar Ratsch, & Klaus-Robert Muller GMD FIRST, Rudower Chaussee 5, 12489 Berlin, Germany fbs,

More information

Convergence of reinforcement learning algorithms and acceleration of learning

Convergence of reinforcement learning algorithms and acceleration of learning PHYSICAL REVIEW E 67, 026706 2003 Convergence of reinforcement learning algorithms and acceleration of learning A. Potapov* and M. K. Ali Department of Physics, The University of Lethbridge, 4401 University

More information

Linear stochastic approximation driven by slowly varying Markov chains

Linear stochastic approximation driven by slowly varying Markov chains Available online at www.sciencedirect.com Systems & Control Letters 50 2003 95 102 www.elsevier.com/locate/sysconle Linear stochastic approximation driven by slowly varying Marov chains Viay R. Konda,

More information

Linear Riccati Dynamics, Constant Feedback, and Controllability in Linear Quadratic Control Problems

Linear Riccati Dynamics, Constant Feedback, and Controllability in Linear Quadratic Control Problems Linear Riccati Dynamics, Constant Feedback, and Controllability in Linear Quadratic Control Problems July 2001 Revised: December 2005 Ronald J. Balvers Douglas W. Mitchell Department of Economics Department

More information

MDP Preliminaries. Nan Jiang. February 10, 2019

MDP Preliminaries. Nan Jiang. February 10, 2019 MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process

More information

, and rewards and transition matrices as shown below:

, and rewards and transition matrices as shown below: CSE 50a. Assignment 7 Out: Tue Nov Due: Thu Dec Reading: Sutton & Barto, Chapters -. 7. Policy improvement Consider the Markov decision process (MDP) with two states s {0, }, two actions a {0, }, discount

More information

Reinforcement Learning for Continuous. Action using Stochastic Gradient Ascent. Hajime KIMURA, Shigenobu KOBAYASHI JAPAN

Reinforcement Learning for Continuous. Action using Stochastic Gradient Ascent. Hajime KIMURA, Shigenobu KOBAYASHI JAPAN Reinforcement Learning for Continuous Action using Stochastic Gradient Ascent Hajime KIMURA, Shigenobu KOBAYASHI Tokyo Institute of Technology, 4259 Nagatsuda, Midori-ku Yokohama 226-852 JAPAN Abstract:

More information

Optimal Control with Learned Forward Models

Optimal Control with Learned Forward Models Optimal Control with Learned Forward Models Pieter Abbeel UC Berkeley Jan Peters TU Darmstadt 1 Where we are? Reinforcement Learning Data = {(x i, u i, x i+1, r i )}} x u xx r u xx V (x) π (u x) Now V

More information

ilstd: Eligibility Traces and Convergence Analysis

ilstd: Eligibility Traces and Convergence Analysis ilstd: Eligibility Traces and Convergence Analysis Alborz Geramifard Michael Bowling Martin Zinkevich Richard S. Sutton Department of Computing Science University of Alberta Edmonton, Alberta {alborz,bowling,maz,sutton}@cs.ualberta.ca

More information

Performance Comparison of Two Implementations of the Leaky. LMS Adaptive Filter. Scott C. Douglas. University of Utah. Salt Lake City, Utah 84112

Performance Comparison of Two Implementations of the Leaky. LMS Adaptive Filter. Scott C. Douglas. University of Utah. Salt Lake City, Utah 84112 Performance Comparison of Two Implementations of the Leaky LMS Adaptive Filter Scott C. Douglas Department of Electrical Engineering University of Utah Salt Lake City, Utah 8411 Abstract{ The leaky LMS

More information

Model-Based Reinforcement Learning with Continuous States and Actions

Model-Based Reinforcement Learning with Continuous States and Actions Marc P. Deisenroth, Carl E. Rasmussen, and Jan Peters: Model-Based Reinforcement Learning with Continuous States and Actions in Proceedings of the 16th European Symposium on Artificial Neural Networks

More information

Optimal Control of Linear Systems with Stochastic Parameters for Variance Suppression: The Finite Time Case

Optimal Control of Linear Systems with Stochastic Parameters for Variance Suppression: The Finite Time Case Optimal Control of Linear Systems with Stochastic Parameters for Variance Suppression: The inite Time Case Kenji ujimoto Soraki Ogawa Yuhei Ota Makishi Nakayama Nagoya University, Department of Mechanical

More information

Chapter 7 Interconnected Systems and Feedback: Well-Posedness, Stability, and Performance 7. Introduction Feedback control is a powerful approach to o

Chapter 7 Interconnected Systems and Feedback: Well-Posedness, Stability, and Performance 7. Introduction Feedback control is a powerful approach to o Lectures on Dynamic Systems and Control Mohammed Dahleh Munther A. Dahleh George Verghese Department of Electrical Engineering and Computer Science Massachuasetts Institute of Technology c Chapter 7 Interconnected

More information

Reinforcement learning

Reinforcement learning Reinforcement learning Based on [Kaelbling et al., 1996, Bertsekas, 2000] Bert Kappen Reinforcement learning Reinforcement learning is the problem faced by an agent that must learn behavior through trial-and-error

More information

Coarticulation in Markov Decision Processes Khashayar Rohanimanesh Robert Platt Sridhar Mahadevan Roderic Grupen CMPSCI Technical Report 04-33

Coarticulation in Markov Decision Processes Khashayar Rohanimanesh Robert Platt Sridhar Mahadevan Roderic Grupen CMPSCI Technical Report 04-33 Coarticulation in Markov Decision Processes Khashayar Rohanimanesh Robert Platt Sridhar Mahadevan Roderic Grupen CMPSCI Technical Report 04- June, 2004 Department of Computer Science University of Massachusetts

More information

Problem Description The problem we consider is stabilization of a single-input multiple-state system with simultaneous magnitude and rate saturations,

Problem Description The problem we consider is stabilization of a single-input multiple-state system with simultaneous magnitude and rate saturations, SEMI-GLOBAL RESULTS ON STABILIZATION OF LINEAR SYSTEMS WITH INPUT RATE AND MAGNITUDE SATURATIONS Trygve Lauvdal and Thor I. Fossen y Norwegian University of Science and Technology, N-7 Trondheim, NORWAY.

More information

Linearly-solvable Markov decision problems

Linearly-solvable Markov decision problems Advances in Neural Information Processing Systems 2 Linearly-solvable Markov decision problems Emanuel Todorov Department of Cognitive Science University of California San Diego todorov@cogsci.ucsd.edu

More information

1/sqrt(B) convergence 1/B convergence B

1/sqrt(B) convergence 1/B convergence B The Error Coding Method and PICTs Gareth James and Trevor Hastie Department of Statistics, Stanford University March 29, 1998 Abstract A new family of plug-in classication techniques has recently been

More information

A Tour of Reinforcement Learning The View from Continuous Control. Benjamin Recht University of California, Berkeley

A Tour of Reinforcement Learning The View from Continuous Control. Benjamin Recht University of California, Berkeley A Tour of Reinforcement Learning The View from Continuous Control Benjamin Recht University of California, Berkeley trustable, scalable, predictable Control Theory! Reinforcement Learning is the study

More information

Q-Learning for Markov Decision Processes*

Q-Learning for Markov Decision Processes* McGill University ECSE 506: Term Project Q-Learning for Markov Decision Processes* Authors: Khoa Phan khoa.phan@mail.mcgill.ca Sandeep Manjanna sandeep.manjanna@mail.mcgill.ca (*Based on: Convergence of

More information

Chapter 9 Observers, Model-based Controllers 9. Introduction In here we deal with the general case where only a subset of the states, or linear combin

Chapter 9 Observers, Model-based Controllers 9. Introduction In here we deal with the general case where only a subset of the states, or linear combin Lectures on Dynamic Systems and Control Mohammed Dahleh Munther A. Dahleh George Verghese Department of Electrical Engineering and Computer Science Massachuasetts Institute of Technology c Chapter 9 Observers,

More information

1 Introduction Independent component analysis (ICA) [10] is a statistical technique whose main applications are blind source separation, blind deconvo

1 Introduction Independent component analysis (ICA) [10] is a statistical technique whose main applications are blind source separation, blind deconvo The Fixed-Point Algorithm and Maximum Likelihood Estimation for Independent Component Analysis Aapo Hyvarinen Helsinki University of Technology Laboratory of Computer and Information Science P.O.Box 5400,

More information

Linear Riccati Dynamics, Constant Feedback, and Controllability in Linear Quadratic Control Problems

Linear Riccati Dynamics, Constant Feedback, and Controllability in Linear Quadratic Control Problems Linear Riccati Dynamics, Constant Feedback, and Controllability in Linear Quadratic Control Problems July 2001 Ronald J. Balvers Douglas W. Mitchell Department of Economics Department of Economics P.O.

More information

Reinforcement Learning In Continuous Time and Space

Reinforcement Learning In Continuous Time and Space Reinforcement Learning In Continuous Time and Space presentation of paper by Kenji Doya Leszek Rybicki lrybicki@mat.umk.pl 18.07.2008 Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous

More information

Markov Decision Processes and Dynamic Programming

Markov Decision Processes and Dynamic Programming Markov Decision Processes and Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) Ecole Centrale - Option DAD SequeL INRIA Lille EC-RL Course In This Lecture A. LAZARIC Markov Decision Processes

More information

DIFFERENTIAL TRAINING OF 1 ROLLOUT POLICIES

DIFFERENTIAL TRAINING OF 1 ROLLOUT POLICIES Appears in Proc. of the 35th Allerton Conference on Communication, Control, and Computing, Allerton Park, Ill., October 1997 DIFFERENTIAL TRAINING OF 1 ROLLOUT POLICIES by Dimitri P. Bertsekas 2 Abstract

More information

On GMW designs and a conjecture of Assmus and Key Thomas E. Norwood and Qing Xiang Dept. of Mathematics, California Institute of Technology, Pasadena,

On GMW designs and a conjecture of Assmus and Key Thomas E. Norwood and Qing Xiang Dept. of Mathematics, California Institute of Technology, Pasadena, On GMW designs and a conjecture of Assmus and Key Thomas E. Norwood and Qing iang Dept. of Mathematics, California Institute of Technology, Pasadena, CA 91125 June 24, 1998 Abstract We show that a family

More information

Linear Regression and Its Applications

Linear Regression and Its Applications Linear Regression and Its Applications Predrag Radivojac October 13, 2014 Given a data set D = {(x i, y i )} n the objective is to learn the relationship between features and the target. We usually start

More information

IMPROVED MPC DESIGN BASED ON SATURATING CONTROL LAWS

IMPROVED MPC DESIGN BASED ON SATURATING CONTROL LAWS IMPROVED MPC DESIGN BASED ON SATURATING CONTROL LAWS D. Limon, J.M. Gomes da Silva Jr., T. Alamo and E.F. Camacho Dpto. de Ingenieria de Sistemas y Automática. Universidad de Sevilla Camino de los Descubrimientos

More information

Error Empirical error. Generalization error. Time (number of iteration)

Error Empirical error. Generalization error. Time (number of iteration) Submitted to Neural Networks. Dynamics of Batch Learning in Multilayer Networks { Overrealizability and Overtraining { Kenji Fukumizu The Institute of Physical and Chemical Research (RIKEN) E-mail: fuku@brain.riken.go.jp

More information

Stable Adaptive Momentum for Rapid Online Learning in Nonlinear Systems

Stable Adaptive Momentum for Rapid Online Learning in Nonlinear Systems Stable Adaptive Momentum for Rapid Online Learning in Nonlinear Systems Thore Graepel and Nicol N. Schraudolph Institute of Computational Science ETH Zürich, Switzerland {graepel,schraudo}@inf.ethz.ch

More information

Suppose that we have a specific single stage dynamic system governed by the following equation:

Suppose that we have a specific single stage dynamic system governed by the following equation: Dynamic Optimisation Discrete Dynamic Systems A single stage example Suppose that we have a specific single stage dynamic system governed by the following equation: x 1 = ax 0 + bu 0, x 0 = x i (1) where

More information

Reinforcement Learning, Neural Networks and PI Control Applied to a Heating Coil

Reinforcement Learning, Neural Networks and PI Control Applied to a Heating Coil Reinforcement Learning, Neural Networks and PI Control Applied to a Heating Coil Charles W. Anderson 1, Douglas C. Hittle 2, Alon D. Katz 2, and R. Matt Kretchmar 1 1 Department of Computer Science Colorado

More information

Economics 472. Lecture 10. where we will refer to y t as a m-vector of endogenous variables, x t as a q-vector of exogenous variables,

Economics 472. Lecture 10. where we will refer to y t as a m-vector of endogenous variables, x t as a q-vector of exogenous variables, University of Illinois Fall 998 Department of Economics Roger Koenker Economics 472 Lecture Introduction to Dynamic Simultaneous Equation Models In this lecture we will introduce some simple dynamic simultaneous

More information

Reinforcement Learning II. George Konidaris

Reinforcement Learning II. George Konidaris Reinforcement Learning II George Konidaris gdk@cs.brown.edu Fall 2017 Reinforcement Learning π : S A max R = t=0 t r t MDPs Agent interacts with an environment At each time t: Receives sensor signal Executes

More information

Coarticulation in Markov Decision Processes

Coarticulation in Markov Decision Processes Coarticulation in Markov Decision Processes Khashayar Rohanimanesh Department of Computer Science University of Massachusetts Amherst, MA 01003 khash@cs.umass.edu Sridhar Mahadevan Department of Computer

More information

Optimal Polynomial Control for Discrete-Time Systems

Optimal Polynomial Control for Discrete-Time Systems 1 Optimal Polynomial Control for Discrete-Time Systems Prof Guy Beale Electrical and Computer Engineering Department George Mason University Fairfax, Virginia Correspondence concerning this paper should

More information

Adaptive State Feedback Nash Strategies for Linear Quadratic Discrete-Time Games

Adaptive State Feedback Nash Strategies for Linear Quadratic Discrete-Time Games Adaptive State Feedbac Nash Strategies for Linear Quadratic Discrete-Time Games Dan Shen and Jose B. Cruz, Jr. Intelligent Automation Inc., Rocville, MD 2858 USA (email: dshen@i-a-i.com). The Ohio State

More information

Reinforcement Learning II. George Konidaris

Reinforcement Learning II. George Konidaris Reinforcement Learning II George Konidaris gdk@cs.brown.edu Fall 2018 Reinforcement Learning π : S A max R = t=0 t r t MDPs Agent interacts with an environment At each time t: Receives sensor signal Executes

More information

Linear State Feedback Controller Design

Linear State Feedback Controller Design Assignment For EE5101 - Linear Systems Sem I AY2010/2011 Linear State Feedback Controller Design Phang Swee King A0033585A Email: king@nus.edu.sg NGS/ECE Dept. Faculty of Engineering National University

More information

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69 R E S E A R C H R E P O R T Online Policy Adaptation for Ensemble Classifiers Christos Dimitrakakis a IDIAP RR 03-69 Samy Bengio b I D I A P December 2003 D a l l e M o l l e I n s t i t u t e for Perceptual

More information

Partially observable Markov decision processes. Department of Computer Science, Czech Technical University in Prague

Partially observable Markov decision processes. Department of Computer Science, Czech Technical University in Prague Partially observable Markov decision processes Jiří Kléma Department of Computer Science, Czech Technical University in Prague https://cw.fel.cvut.cz/wiki/courses/b4b36zui/prednasky pagenda Previous lecture:

More information

Alternative Characterization of Ergodicity for Doubly Stochastic Chains

Alternative Characterization of Ergodicity for Doubly Stochastic Chains Alternative Characterization of Ergodicity for Doubly Stochastic Chains Behrouz Touri and Angelia Nedić Abstract In this paper we discuss the ergodicity of stochastic and doubly stochastic chains. We define

More information

only nite eigenvalues. This is an extension of earlier results from [2]. Then we concentrate on the Riccati equation appearing in H 2 and linear quadr

only nite eigenvalues. This is an extension of earlier results from [2]. Then we concentrate on the Riccati equation appearing in H 2 and linear quadr The discrete algebraic Riccati equation and linear matrix inequality nton. Stoorvogel y Department of Mathematics and Computing Science Eindhoven Univ. of Technology P.O. ox 53, 56 M Eindhoven The Netherlands

More information

1. Introduction Let the least value of an objective function F (x), x2r n, be required, where F (x) can be calculated for any vector of variables x2r

1. Introduction Let the least value of an objective function F (x), x2r n, be required, where F (x) can be calculated for any vector of variables x2r DAMTP 2002/NA08 Least Frobenius norm updating of quadratic models that satisfy interpolation conditions 1 M.J.D. Powell Abstract: Quadratic models of objective functions are highly useful in many optimization

More information

CS599 Lecture 1 Introduction To RL

CS599 Lecture 1 Introduction To RL CS599 Lecture 1 Introduction To RL Reinforcement Learning Introduction Learning from rewards Policies Value Functions Rewards Models of the Environment Exploitation vs. Exploration Dynamic Programming

More information

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992)

More information

Reinforcement Learning

Reinforcement Learning CS7/CS7 Fall 005 Supervised Learning: Training examples: (x,y) Direct feedback y for each input x Sequence of decisions with eventual feedback No teacher that critiques individual actions Learn to act

More information

Function Approximation for Continuous Constrained MDPs

Function Approximation for Continuous Constrained MDPs Function Approximation for Continuous Constrained MDPs Aditya Undurti, Alborz Geramifard, Jonathan P. How Abstract In this work we apply function approximation techniques to solve continuous, constrained

More information

Temporal Difference Learning & Policy Iteration

Temporal Difference Learning & Policy Iteration Temporal Difference Learning & Policy Iteration Advanced Topics in Reinforcement Learning Seminar WS 15/16 ±0 ±0 +1 by Tobias Joppen 03.11.2015 Fachbereich Informatik Knowledge Engineering Group Prof.

More information

The Bias-Variance dilemma of the Monte Carlo. method. Technion - Israel Institute of Technology, Technion City, Haifa 32000, Israel

The Bias-Variance dilemma of the Monte Carlo. method. Technion - Israel Institute of Technology, Technion City, Haifa 32000, Israel The Bias-Variance dilemma of the Monte Carlo method Zlochin Mark 1 and Yoram Baram 1 Technion - Israel Institute of Technology, Technion City, Haifa 32000, Israel fzmark,baramg@cs.technion.ac.il Abstract.

More information

R. Balan. Splaiul Independentei 313, Bucharest, ROMANIA D. Aur

R. Balan. Splaiul Independentei 313, Bucharest, ROMANIA D. Aur An On-line Robust Stabilizer R. Balan University "Politehnica" of Bucharest, Department of Automatic Control and Computers, Splaiul Independentei 313, 77206 Bucharest, ROMANIA radu@karla.indinf.pub.ro

More information

Outline. 1 Linear Quadratic Problem. 2 Constraints. 3 Dynamic Programming Solution. 4 The Infinite Horizon LQ Problem.

Outline. 1 Linear Quadratic Problem. 2 Constraints. 3 Dynamic Programming Solution. 4 The Infinite Horizon LQ Problem. Model Predictive Control Short Course Regulation James B. Rawlings Michael J. Risbeck Nishith R. Patel Department of Chemical and Biological Engineering Copyright c 217 by James B. Rawlings Outline 1 Linear

More information

Approximating Q-values with Basis Function Representations. Philip Sabes. Department of Brain and Cognitive Sciences

Approximating Q-values with Basis Function Representations. Philip Sabes. Department of Brain and Cognitive Sciences Approximating Q-values with Basis Function Representations Philip Sabes Department of Brain and Cognitive Sciences Massachusetts Institute of Technology Cambridge, MA 39 sabes@psyche.mit.edu The consequences

More information

Chapter 30 Minimality and Stability of Interconnected Systems 30.1 Introduction: Relating I/O and State-Space Properties We have already seen in Chapt

Chapter 30 Minimality and Stability of Interconnected Systems 30.1 Introduction: Relating I/O and State-Space Properties We have already seen in Chapt Lectures on Dynamic Systems and Control Mohammed Dahleh Munther A. Dahleh George Verghese Department of Electrical Engineering and Computer Science Massachuasetts Institute of Technology 1 1 c Chapter

More information

Georey J. Gordon. Carnegie Mellon University. Pittsburgh PA Bellman-Ford single-destination shortest paths algorithm

Georey J. Gordon. Carnegie Mellon University. Pittsburgh PA Bellman-Ford single-destination shortest paths algorithm Stable Function Approximation in Dynamic Programming Georey J. Gordon Computer Science Department Carnegie Mellon University Pittsburgh PA 53 ggordon@cs.cmu.edu Abstract The success of reinforcement learning

More information

and 3) a value function update using exponential eligibility traces is more efficient and stable than that based on Euler approximation. The algorithm

and 3) a value function update using exponential eligibility traces is more efficient and stable than that based on Euler approximation. The algorithm Reinforcement Learning In Continuous Time and Space Kenji Doya Λ ATR Human Information Processing Research Laboratories 2-2 Hikaridai, Seika, Soraku, Kyoto 619-288, Japan Neural Computation, 12(1), 219-245

More information

An average case analysis of a dierential attack. on a class of SP-networks. Distributed Systems Technology Centre, and

An average case analysis of a dierential attack. on a class of SP-networks. Distributed Systems Technology Centre, and An average case analysis of a dierential attack on a class of SP-networks Luke O'Connor Distributed Systems Technology Centre, and Information Security Research Center, QUT Brisbane, Australia Abstract

More information

Markov Decision Processes With Delays and Asynchronous Cost Collection

Markov Decision Processes With Delays and Asynchronous Cost Collection 568 IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL 48, NO 4, APRIL 2003 Markov Decision Processes With Delays and Asynchronous Cost Collection Konstantinos V Katsikopoulos, Member, IEEE, and Sascha E Engelbrecht

More information

Lecture 5 Linear Quadratic Stochastic Control

Lecture 5 Linear Quadratic Stochastic Control EE363 Winter 2008-09 Lecture 5 Linear Quadratic Stochastic Control linear-quadratic stochastic control problem solution via dynamic programming 5 1 Linear stochastic system linear dynamical system, over

More information

Robust Control 5 Nominal Controller Design Continued

Robust Control 5 Nominal Controller Design Continued Robust Control 5 Nominal Controller Design Continued Harry G. Kwatny Department of Mechanical Engineering & Mechanics Drexel University 4/14/2003 Outline he LQR Problem A Generalization to LQR Min-Max

More information

On the Convergence of Optimistic Policy Iteration

On the Convergence of Optimistic Policy Iteration Journal of Machine Learning Research 3 (2002) 59 72 Submitted 10/01; Published 7/02 On the Convergence of Optimistic Policy Iteration John N. Tsitsiklis LIDS, Room 35-209 Massachusetts Institute of Technology

More information

Reinforcement. Function Approximation. Learning with KATJA HOFMANN. Researcher, MSR Cambridge

Reinforcement. Function Approximation. Learning with KATJA HOFMANN. Researcher, MSR Cambridge Reinforcement Learning with Function Approximation KATJA HOFMANN Researcher, MSR Cambridge Representation and Generalization in RL Focus on training stability Learning generalizable value functions Navigating

More information

Structured State Space Realizations for SLS Distributed Controllers

Structured State Space Realizations for SLS Distributed Controllers Structured State Space Realizations for SLS Distributed Controllers James Anderson and Nikolai Matni Abstract In recent work the system level synthesis (SLS) paradigm has been shown to provide a truly

More information

Markov Decision Processes and Dynamic Programming

Markov Decision Processes and Dynamic Programming Markov Decision Processes and Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course How to model an RL problem The Markov Decision Process

More information

Gaussian Processes for Regression. Carl Edward Rasmussen. Department of Computer Science. Toronto, ONT, M5S 1A4, Canada.

Gaussian Processes for Regression. Carl Edward Rasmussen. Department of Computer Science. Toronto, ONT, M5S 1A4, Canada. In Advances in Neural Information Processing Systems 8 eds. D. S. Touretzky, M. C. Mozer, M. E. Hasselmo, MIT Press, 1996. Gaussian Processes for Regression Christopher K. I. Williams Neural Computing

More information

Computer Vision Group Prof. Daniel Cremers. 14. Sampling Methods

Computer Vision Group Prof. Daniel Cremers. 14. Sampling Methods Prof. Daniel Cremers 14. Sampling Methods Sampling Methods Sampling Methods are widely used in Computer Science as an approximation of a deterministic algorithm to represent uncertainty without a parametric

More information

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision The Particle Filter Non-parametric implementation of Bayes filter Represents the belief (posterior) random state samples. by a set of This representation is approximate. Can represent distributions that

More information

Optimal Convergence in Multi-Agent MDPs

Optimal Convergence in Multi-Agent MDPs Optimal Convergence in Multi-Agent MDPs Peter Vrancx 1, Katja Verbeeck 2, and Ann Nowé 1 1 {pvrancx, ann.nowe}@vub.ac.be, Computational Modeling Lab, Vrije Universiteit Brussel 2 k.verbeeck@micc.unimaas.nl,

More information

Approximate active fault detection and control

Approximate active fault detection and control Approximate active fault detection and control Jan Škach Ivo Punčochář Miroslav Šimandl Department of Cybernetics Faculty of Applied Sciences University of West Bohemia Pilsen, Czech Republic 11th European

More information

The convergence limit of the temporal difference learning

The convergence limit of the temporal difference learning The convergence limit of the temporal difference learning Ryosuke Nomura the University of Tokyo September 3, 2013 1 Outline Reinforcement Learning Convergence limit Construction of the feature vector

More information

RICE UNIVERSITY. System Identication for Robust Control. Huipin Zhang. A Thesis Submitted. in Partial Fulfillment of the. Requirements for the Degree

RICE UNIVERSITY. System Identication for Robust Control. Huipin Zhang. A Thesis Submitted. in Partial Fulfillment of the. Requirements for the Degree RICE UNIVERSITY System Identication for Robust Control by Huipin Zhang A Thesis Submitted in Partial Fulfillment of the Requirements for the Degree Master of Science Approved, Thesis Committee: Athanasios

More information

REGLERTEKNIK AUTOMATIC CONTROL LINKÖPING

REGLERTEKNIK AUTOMATIC CONTROL LINKÖPING Generating state space equations from a bond graph with dependent storage elements using singular perturbation theory. Krister Edstrom Department of Electrical Engineering Linkoping University, S-58 83

More information

6 Reinforcement Learning

6 Reinforcement Learning 6 Reinforcement Learning As discussed above, a basic form of supervised learning is function approximation, relating input vectors to output vectors, or, more generally, finding density functions p(y,

More information

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel Reinforcement Learning with Function Approximation Joseph Christian G. Noel November 2011 Abstract Reinforcement learning (RL) is a key problem in the field of Artificial Intelligence. The main goal is

More information

Training Guidelines for Neural Networks to Estimate Stability Regions

Training Guidelines for Neural Networks to Estimate Stability Regions Training Guidelines for Neural Networks to Estimate Stability Regions Enrique D. Ferreira Bruce H.Krogh Department of Electrical and Computer Engineering Carnegie Mellon University 5 Forbes Av., Pittsburgh,

More information

Online solution of the average cost Kullback-Leibler optimization problem

Online solution of the average cost Kullback-Leibler optimization problem Online solution of the average cost Kullback-Leibler optimization problem Joris Bierkens Radboud University Nijmegen j.bierkens@science.ru.nl Bert Kappen Radboud University Nijmegen b.kappen@science.ru.nl

More information

Direct Gradient-Based Reinforcement Learning: I. Gradient Estimation Algorithms

Direct Gradient-Based Reinforcement Learning: I. Gradient Estimation Algorithms Direct Gradient-Based Reinforcement Learning: I. Gradient Estimation Algorithms Jonathan Baxter and Peter L. Bartlett Research School of Information Sciences and Engineering Australian National University

More information

Robotics. Control Theory. Marc Toussaint U Stuttgart

Robotics. Control Theory. Marc Toussaint U Stuttgart Robotics Control Theory Topics in control theory, optimal control, HJB equation, infinite horizon case, Linear-Quadratic optimal control, Riccati equations (differential, algebraic, discrete-time), controllability,

More information