|
|
- Daniela Carpenter
- 6 years ago
- Views:
Transcription
1 In: Proc. BENELEARN-98, 8th Belgian-Dutch Conference on Machine Learning, pp 9-46, 998 Linear Quadratic Regulation using Reinforcement Learning Stephan ten Hagen? and Ben Krose Department of Mathematics, Computer Science, Physics and Astronomy University of Amsterdam Kruislaan 4, 98 SJ Amsterdam Abstract In this paper we describe a possible way to make reinforcement learning more applicable in the context of industrial manufacturing processes. We achieve this by formulating the optimization task in the linear quadratic regulation framework, for which a conventional control theoretic solution exist. By rewriting the Q-learning approach into a linear least squares approximation problem, we can make a fair comparison between the resulting approximation and that of the conventional system identication approach. Our experiment shows that the conventional approach performs slightly better. Also we can show that the amount of exploration noise, added during the generation of data, plays a crucial role in the outcome of both approaches. Introduction Reinforcement Learning (RL) is a trial based method for optimization of the interaction with an environment or the control of a system [][7]. The optimization is performed by approximating the future sum of evaluations and determine the feedback that minimizes or maximizes this approximation. We are particularly interested in industrial applications, like the optimization of the control of manufacturing processes. In [6] this is demonstrated for the manufacturing of thermoplastic composite structures. In spite of the successful demonstration, the use of reinforcement learning to control manufacturing processes is still not commonplace. The reason for this is that there are still some barriers to take. The rst barrier is that manufacturers are very careful, with respect to novel control techniques. Only those techniques are used that are fully understood and are guaranteed to work. The problems encountered by the manufacturers are too specic to be related to existing successful demonstrations of certain techniques. Impressive demonstrations are not sucient to convince manufacturers from the usefulness of these techniques. For this a solid theoretic underpinning is required. The second barrier lies in the reinforcement learning eld itself. Most of the RL application described are maze-like navigation tasks, scheduling tasks or obstacle avoidance tasks for mobile robots. This has resulted in a theoretic understanding for discrete state space systems, in particular those that t in a Markov decision framework. A consequence of this is that the main theoretic results and convergence guarantees apply only to Markov decision processes. But in a realistic manufacturing environment most information about the \state" of the process is due to measurements. These measurements form a continuous state space for the manufacturing process, for which theoretic reinforcement learning results are not valid. So in order to make reinforcement learning techniques applicable for manufacturing processes, the theoretic results should be extended to continuous state space systems. In order to come to a RL approach that is more generally applicable for manufacturing processes, we adopt the Linear Quadratic Regulation (LQR) framework from control theory. LQR is an optimization task in which the system is assumed to be linear and the evaluation is a quadratic cost function. For this the optimal feedback can be computed when the system and cost are completely known []. The combination of RL and LQR has been described rst in [8]. In []? Supported by the Dutch technology foundation STW
2 this framework has been described as a possible extension of RL algorithms to problems with continuous state and action spaces. Convergence of RL applied to LQR can also be proven [4][5]. These convergence proofs rely on a contraction to the optimal feedback, that takes place if the amount of exploration and the number of time steps is enough. The diculty is that in practice it is never known whether the amount of exploration is sucient. In this paper we use the LQR framework to compare Q-Learning (QL), a particular kind of RL, with System Identication (SI). In QL the optimization is based on the approximation of the sum of future evaluation as a function of the state and action. According to [][4] the optimal feedback can be derived using QL, without having to know the system. However, the same data used by QL can also be used for SI. This results in a approximation of the system, that can be used to compute the optimal feedback. We formulate the learning methods of the QL and the SI approach in such a way that a linear least squares approximation method can be used. In this way a fair comparison is possible, so this enables us to see which methods performs best. We investigate experimentally whether both methods are able to approximation the optimal feedback. Particularly we look at how the exploration noise and system's noise inuence the outcome of the approximation. The experiment shows that the amount of exploration noise has a strong impact on the resulting approximations. Linear Quadratic Regulation The Linear Quadratic Regulation (LQR) framework consist of a linear system with a linear feedback. Let x IR nx be the state vector and the vector u IR nu the control actions, then the system and controller are given by: x k+ = Ax k + Bu k + v k and u k = Lx k + e k ; () where k indicates the time step and the matrices A, B and L have the proper dimensions. All elements of the vectors v k and e k are normal distributed white noise with variance v and e. The vector v represents the system's noise and e represents the exploration (or excitation) noise. In the LQR framework the direct cost is a quadratic function of x and u, so : r(x k ; u k ) = x T k Sx k + u T k Ru k: () The matrix S is symmetric positive semidenite and R is symmetric positive denite. The total cost is dened as: J = X i= r i () The objective of LQR is to nd the optimal feedback L that minimizes J. In order to nd L the sum of future cost can be expressed as a function of x: J(x k ) = X i=k r i = x T k Kx k: (4) The symmetric positive denite matrix K is the unique solution to the Discrete Algebraic Riccati Equation (DARE): K = A T (K? KB(B T KB + R)? B T K)A + S (5) The solution of this equation can be found iteratively if: the pair fa; Bg is controllable, S is symmetric positive semidenite, R is symmetric positive denite and the pair fa; p Sg ( p S is the matrix for which p S T p S = S) is observable []. The optimal feedback L is given by: The T indicates the transpose. L =?(B T KB + R)? B T KA (6)
3 The problem of LQR is that it requires exact knowledge about the system's model (), which in practical situations is not available. Using the feedback in () it is possible to control the system and generate data. In the next section two methods will be described to get an approximation ^L of the optimal feedback, based on the generated data. The rst method approximates the system's model and then uses (5) and (6). The second method approximates the future cost as a function of x and u and derives ^L from this. Two Approximation Methods The system starts at the initial state x and is controlled for N time steps using the feedback L. All state vectors from x to x N and all control vectors u to u N form the data set. Based on this data set the approximation of the optimal feedback ^L should be derived. Two dierent methods to do this will be presented. The rst method will be referred to as the System Identication (SI) approach, because it identies the parameters of the matrices A and B resulting in ^A and ^B. Rewrite () to: x T k+ = x T k u T k A T B T + v T k : (7) This makes it possible to stack the vectors of all time steps into matrices and write (7) for the complete data set: Y = 64 x T x T. x T N x 75 6 = 4 T u T x T u T. x T N? u T N? 75 ^ + 64 v T v T. v T N? 75 = X ^ + V: (8) ^B. The least squares solution of ^ is In this expression Y = X ^ + V, the value of ^ T = ^A given by: ^ = (X T X)? X T Y: (9) So ^A and ^B can be derived from ^, and applying the DARE and (6) will result in ^L. ^A; ^B S; R ^K ^A; ^B; R ^L x; u SI A; B x; u; r; L ^H QL? S; R K A; B; R L Figure. L and ^L. At the bottom the computation of L is shown. At the top the two methods, SI and QL, to derive ^L are shown. The blocks indicate results and the arrows indicate the required \information" to derive the next result.
4 u x.5 x ; x.5 X; Y ^ ^A; ^B ^K; ^L SI. u rr QL X; Y ^ ^H ^L Figure. Two approximation methods. The data set on the left consists of two sequences of state values x and x, a sequence of control actions u and a sequence of direct costs r. Both methods use this data to get the matrices X and Y, which is used in (9) to get ^. At the top it is shown how the SI approach derives ^L from ^, at the bottom this is shown for the QL approach. The second method will be referred to as the Q-Learning (QL) approach, because it approximates the Q-function. The Q-function is the future cost as a function of x and u, so: Q(x k ; u k ) = X i=k r i = x T k = x T k u T k u T k S + A T KA H H x k H H B T KA A T KB R + B T KB x k u k () u k = T k H k ; () where K is the solution of the DARE. L can be found by setting the derivative of the Q-function to u k to zero. This results in u k =?H? H x k, sol =?H? H. It is clear that this result is identical to (6). The value of T k = x T k u T k in () is formed by the data set, so the symmetric positive denite matrix H can be approximated. Then the approximation ^L of the optimal feedback follows directly from the approximation ^H. The approximation can be made because the Q-function in () can also be dened recursively as : Q(x k ; u k ) = r k + Q(x k+ ; Lx k+ ). Write T k H k in () as T k, where vector k consist of all quadratic combinations of the elements of k and vector the corresponding values of H. Then T from the recursive denition of the Q-function it follows that: r k = k? T = ( T k+ k? T k+). This resembles (7), so that can be approximated in a similar way using (9). For this X should be formed by all vectors T k? T k+ and Y should be formed by all r k. Then ^ in (9) approximates (and not A B ), so ^H and ^L can be derived from this. By using (9) for the approximation, the complete data set is used at once. The advantage of this is that the results do not depend on additional choices, like an initial ^ or an iteration step size. This means that dierences in resulting approximations are only due to processing the data into X and Y and to the dierent use of additional \information" by both methods (The SI method uses S and R, while the QL method uses L). In this way a fair comparison can be made between the performances of both methods. In a simulation the true A and B are known so the performances can also be compared with the true optimal feedback L. Figure shows how the two approximation methods are related to each other and to the optimal solution. Figure shows how both methods derive the approximation from the data. Note that Lx k+ is used instead of u k+. So X, Y and ^ are dierent for the QL approach. 4
5 Figure. Two data sets. Both sets generated according to () and () consist of two sequences of state values x and x (top), a sequence of control actions u (middle) and a sequence of direct costs r (bottom). Data set I, on the left, is generated with v =? and e =?4. Data set II, on the right, is generated with v =?4 and e =?. Because these noise terms are relatively small, both data sets are very much alike. 4 Experiment In this experiment we take a system as described in (), with:?:6?:4 : A = B = : : : L = :?: () The matrix A has all it's eigenvalues within the unit disc and the pair fa; Bg is controllable. The matrices S and R are identity matrices and fa; p Sg is observable. So (5) and (6) can be used to compute the optimal feedback L. This results in: L =?:46 :78 () The system starts in it's initial state x = : : T and is controlled using the feedback L for N = time steps. Both noise terms v and e are normal distributed and white. Two data sets are generated, I with v =? and e =?4 and II with v =?4 and e =?. This results in two almost identical data sets, as shown in Figure. The optimal feedback is approximated using both methods on both data sets. Table shows the approximation results for both methods. For data set I they almost have a similar outcome, which corresponds to the value of L. This is the feedback that was used to generate the data (). For data set II both methods seem to approximate the value of L from (), which is the value that should be approximated. The results for data set II in Table indicate that the SI method gives a better approximation of L than the QL method. The main goal of both approximation methods is to approximate the optimal feedback L. The results in Table shows that this value is not always being approximated. If the amount of exploration noise is much lower than the noise in the system then the feedback used to generate the data is being approximated. This is a problem for practical application, because Figure indicates that visual inspection of the data sets does not reveal whether the exploration is enough. So additional investigations are required to see how both noise sources v and e inuence the approximation of the optimal feedback. 5
6 Data set I Data set II v??4 e?4? SI: ^L QL: ^L :49?:75?:467 :7 :96?:759?:57 :78 ^L L L Table. Experimental results. For the data sets I and II the values of v and e are given. Also given are the approximated optimal feedbacks according to the SI and QL method. The bottom row indicates what value actually is being approximated by both methods. 5 The Noise The true optimal solution L computed with (5) and (6) does not take into account the presence of noise. The noise sources v and e are only present when the system is used to generate the data set. This means that a mismatch between ^L and L can be a consequence of the noise. Therefore the dierent inuences of the noise sources v and e will be investigated. The exploration noise e is essential to make the approximation. This can easily be seen by looking at the matrix X T X in (9). This matrix should have full rank to be invertible 4. Without exploration noise the control vector u k = Lx k is linear dependent on the state vector. For both approximation methods this has the consequence that not all the rows (and also the columns) are linear independent. So the role of the exploration noise is to prevent the matrix X T X to become singular. The inuence of e on the entries of matrix X T X is smaller for the QL method. So this method will require more exploration than the SI method, for the matrix X T X to be not singular. The system's noise v is not required to get the approximations. In fact (9) minimizes the mismatch between Y and X + V, caused by V. So if V is zero, will result in a perfect approximation of the system or Q-function. The V consist of all terms v k when the SI method is applied, but for the QL method the V is dierent. This can be seen by looking at (), where the noise v is not included. If it is included, the value of the Q-function becomes: Q(x k ; u k ) = x T k u T k S + A T KA B T KA A T KB R + B T KB x k u k + v T k Kv k; (4) so r k = ( T k? T k+) + v T k Kv k? v T k+ Kv k+. This means that V consist o all terms v T k Kv k? v T k+ Kv k+, making the minimization performed by (9) dierent. There is an other way the noise inuences the approximations. To see this use () to get the value of the state at time k: x k = D k x + kx i= D k?i? (Be i + v i ); (5) where D = A + BL represents the closed loop. So the values of the state vector and the control vector u k depend on the initial state x and all previous noise values of e and v. This is a consequence of generating the data with the closed-loop system (). The noise that enters the system, re-enters it through the feedback, causing an additional disturbance that is no longer white like e and v, but \colored" by D. This can cause a bias in the approximation. This also makes it impossible to derive the variance v of the system's noise, without knowledge about the closed loop D. 6 Experiment In this experiment we continue investigating the consequences of e and v on the approximation of ^L, for the SI and QL approaches. Two dierent setups are chosen such that (5) and (6) can 4 Note that this also means that the number time steps N should be large enough. 6
7 σ e σ e Figure 4. The relative performance of the SI approach (dashed lines) and the QL approach (solid lines) for e varying from? to. The vertical dotted line indicates e = v. Left gure: n x =, v =?5 and N = 5. Right gure: n x =, v =? and N = 5. be used to compute the optimal feedback L. The number of samples N and the amount of system's noise v are dierent for these setups. The amount of exploration noise e is varied form? to. However, both noise sequences are kept the same (except for the scale e of the exploration noise). This is repeated 5 times for both systems and both approaches. The total cost obtained when ^L is used is divided by the total cost obtained when L is used. This is the relative performance. Figure 4 shows the relative performance as function of e for both setups 5. With the increase of e from a very small value, the relative performance and the approximation ^L go through four dierent types of results 6 : I The matrix X T X is singular so no feedback can be computed. For the SI approach this is for e <? (not shown in Figure 4) and for the QL approach this is for e <?7. So QL requires more exploration for X T X to be not singular. II Figure 4 shows that for low values of e, both methods give the same constant relative performance. In this case the feedback that is used to generate the data is approximated as the optimal feedback, so ^L L like in Experiment. For SI this result can be explained by using ^B instead of B and ^D? ^BL instead of A in (6). Clearly ^B is much too large close to the singularity, making (6) result in L. The QL approach does not use (6) so a similar explanation cannot be given. (Although conceptually it makes sense that for a low amount of exploration, the presence of L is very dominant in the data set.) III Between the approximations ^L L and ^L L there is an transition area, where the relative performance can be quite good or very bad. The approximated feedback can even result in an unstable closed-loop of the system. Figure 4 shows that for SI this happens just before e = v and for QL this happens just after e = v. IV This is the only type of result that is useful! The relative performance is (very close to) one, so ^L L. Although it is not clear in Figure 4, but the SI result is slightly closer to one than the QL result. Also the QL approach requires a higher e to get to this type of result. The results in Figure 4 are obtained for dierent numbers of data samples N. But even if the N is increased much more, the relative performance will still show the four types of results for the same values of e. From this it can be concluded that the amount of exploration noise is much more important than N. 5 The experiment is repeated for many dierent congurations and the results were consistent with the two in Figure 4. 6 When v is very small or zero, then there are no type II and III results. This corresponds to the congurations of the convergence proofs in [4] and [5], where the system's noise v is not taken into account. 7
8 From a manufacturing point of view, the type I (no solution) and II (no improvement) results are not very interesting. Only the type III and IV results are important. The type III result has to be avoided. Although it can result in an improvement, the only certainty about this improvement can be obtained by testing it on the real manufacturing process. In case of an unstable closed loop this can result in damaging the process, which might be very expensive. So the amount of exploration should be high enough to avoid the type III result and guarantee a type IV result. The type IV result always is an improvement (except when L = L ). The problem in controlling a manufacturing process is that adding exploration noise results in variations in the products. This make exploring expensive. This means that the amount of exploration noise should be minimized. Figure 4 shows that the minimal value of e for which a type IV result is obtained is lower for the SI approach. So the SI approach requires less exploration to give an improvement of the feedback. This makes the use of the SI approach for manufacturing more appropriate than the QL approach. Further research will focus on minimizing the exploration requirements of the QL approach. Also ways to derive the minimal amount of exploration from the data will be investigated. 7 Conclusion In this paper we showed that the linear quadratic regulation framework provides the opportunity to extend the applicability of reinforcement learning to industrial manufacturing processes. We did this by rewriting the Q-learning approach into a format that made a fair comparison possible with the more conventional system identication approach from control theory. The experiment showed the importance of sucient exploration in data generation. If the disturbance in the system exceeds the amount of exploration noise, the optimal feedback is not being approximated, but the feedback used to generate the data. But more important, the experiment showed that the conventional approach requires less exploration than the reinforcement learning approach, making it more appropriate for manufacturing processes. References [] D.P. Bertsekas. Dynamic Programming: Deterministic and Stochastic Models, pages 55{64. Prentice-Hall, 987. [] D.P. Bertsekas and Tsitsiklis J.N. Neuro-Dynamic Programming. Athena Scientic, Belmont, Massachusetts, 997. [] S.J. Bradtke. Reinforcement learning applied to linear quadratic regulation. In Stephen Jose Hanson, Jack D. Cowan, and C. Lee Giles, editors, Advances in Neural Information Processing Systems, pages 95{. Morgan Kaufmann, San Mateo, CA, 99. [4] S.J. Bradtke, B.E. Ydstie, and A.G. Barto. Adaptive linear quadratic control using policy iteration. CMPSCI 94-49, University of Massachusetts, June 994. [5] T. Landelius. Reinforcement Learning and Distributed Local Model Synthesis. PhD thesis, Linkoping University, 997. [6] D.A. Sofge and D.A. White. Neural network based process optimization and control. In Proceedings of the 9th conf. on Decision and Control, pages 7{76, Honolulu, Hawaii, 99. IEEE. [7] R.S. Sutton and A.G. Barto. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, 998. [8] P.J. Werbos. Consistency of HDP applied to a simple reinforcement learning problem. Neural Networks, :79{89, 99. 8
Adaptive linear quadratic control using policy. iteration. Steven J. Bradtke. University of Massachusetts.
Adaptive linear quadratic control using policy iteration Steven J. Bradtke Computer Science Department University of Massachusetts Amherst, MA 01003 bradtke@cs.umass.edu B. Erik Ydstie Department of Chemical
More informationApproximate Optimal-Value Functions. Satinder P. Singh Richard C. Yee. University of Massachusetts.
An Upper Bound on the oss from Approximate Optimal-Value Functions Satinder P. Singh Richard C. Yee Department of Computer Science University of Massachusetts Amherst, MA 01003 singh@cs.umass.edu, yee@cs.umass.edu
More informationIn Advances in Neural Information Processing Systems 6. J. D. Cowan, G. Tesauro and. Convergence of Indirect Adaptive. Andrew G.
In Advances in Neural Information Processing Systems 6. J. D. Cowan, G. Tesauro and J. Alspector, (Eds.). Morgan Kaufmann Publishers, San Fancisco, CA. 1994. Convergence of Indirect Adaptive Asynchronous
More informationContinuous State Space Q-Learning for Control of Nonlinear Systems
Continuous State Space Q-Learning for Control of Nonlinear Systems . Continuous State Space Q-Learning for Control of Nonlinear Systems ACADEMISCH PROEFSCHRIFT ter verkrijging van de graad van doctor aan
More informationOpen Theoretical Questions in Reinforcement Learning
Open Theoretical Questions in Reinforcement Learning Richard S. Sutton AT&T Labs, Florham Park, NJ 07932, USA, sutton@research.att.com, www.cs.umass.edu/~rich Reinforcement learning (RL) concerns the problem
More informationLecture 9: Discrete-Time Linear Quadratic Regulator Finite-Horizon Case
Lecture 9: Discrete-Time Linear Quadratic Regulator Finite-Horizon Case Dr. Burak Demirel Faculty of Electrical Engineering and Information Technology, University of Paderborn December 15, 2015 2 Previous
More informationAn Adaptive Clustering Method for Model-free Reinforcement Learning
An Adaptive Clustering Method for Model-free Reinforcement Learning Andreas Matt and Georg Regensburger Institute of Mathematics University of Innsbruck, Austria {andreas.matt, georg.regensburger}@uibk.ac.at
More informationSwitching Controllers Based on Neural Network. Estimates of Stability Regions and Controller. Carnegie Mellon University.
Switching Controllers Based on Neural Network Estimates of Stability Regions and Controller Performance Enrique D. Ferreira and Bruce H. Krogh Department of Electrical and Computer Engineering Carnegie
More informationAverage Reward Parameters
Simulation-Based Optimization of Markov Reward Processes: Implementation Issues Peter Marbach 2 John N. Tsitsiklis 3 Abstract We consider discrete time, nite state space Markov reward processes which depend
More informationProceedings of the International Conference on Neural Networks, Orlando Florida, June Leemon C. Baird III
Proceedings of the International Conference on Neural Networks, Orlando Florida, June 1994. REINFORCEMENT LEARNING IN CONTINUOUS TIME: ADVANTAGE UPDATING Leemon C. Baird III bairdlc@wl.wpafb.af.mil Wright
More informationLinear-Quadratic Optimal Control: Full-State Feedback
Chapter 4 Linear-Quadratic Optimal Control: Full-State Feedback 1 Linear quadratic optimization is a basic method for designing controllers for linear (and often nonlinear) dynamical systems and is actually
More information6.241 Dynamic Systems and Control
6.241 Dynamic Systems and Control Lecture 24: H2 Synthesis Emilio Frazzoli Aeronautics and Astronautics Massachusetts Institute of Technology May 4, 2011 E. Frazzoli (MIT) Lecture 24: H 2 Synthesis May
More informationElements of Reinforcement Learning
Elements of Reinforcement Learning Policy: way learning algorithm behaves (mapping from state to action) Reward function: Mapping of state action pair to reward or cost Value function: long term reward,
More informationReinforcement Learning: the basics
Reinforcement Learning: the basics Olivier Sigaud Université Pierre et Marie Curie, PARIS 6 http://people.isir.upmc.fr/sigaud August 6, 2012 1 / 46 Introduction Action selection/planning Learning by trial-and-error
More informationBalancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm
Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 2778 mgl@cs.duke.edu
More informationPrioritized Sweeping Converges to the Optimal Value Function
Technical Report DCS-TR-631 Prioritized Sweeping Converges to the Optimal Value Function Lihong Li and Michael L. Littman {lihong,mlittman}@cs.rutgers.edu RL 3 Laboratory Department of Computer Science
More informationProcedia Computer Science 00 (2011) 000 6
Procedia Computer Science (211) 6 Procedia Computer Science Complex Adaptive Systems, Volume 1 Cihan H. Dagli, Editor in Chief Conference Organized by Missouri University of Science and Technology 211-
More informationOptimal Control. McGill COMP 765 Oct 3 rd, 2017
Optimal Control McGill COMP 765 Oct 3 rd, 2017 Classical Control Quiz Question 1: Can a PID controller be used to balance an inverted pendulum: A) That starts upright? B) That must be swung-up (perhaps
More informationActive Policy Iteration: Efficient Exploration through Active Learning for Value Function Approximation in Reinforcement Learning
Active Policy Iteration: fficient xploration through Active Learning for Value Function Approximation in Reinforcement Learning Takayuki Akiyama, Hirotaka Hachiya, and Masashi Sugiyama Department of Computer
More information4F3 - Predictive Control
4F3 Predictive Control - Lecture 2 p 1/23 4F3 - Predictive Control Lecture 2 - Unconstrained Predictive Control Jan Maciejowski jmm@engcamacuk 4F3 Predictive Control - Lecture 2 p 2/23 References Predictive
More informationReinforcement Learning
Reinforcement Learning Markov decision process & Dynamic programming Evaluative feedback, value function, Bellman equation, optimality, Markov property, Markov decision process, dynamic programming, value
More informationbelow, kernel PCA Eigenvectors, and linear combinations thereof. For the cases where the pre-image does exist, we can provide a means of constructing
Kernel PCA Pattern Reconstruction via Approximate Pre-Images Bernhard Scholkopf, Sebastian Mika, Alex Smola, Gunnar Ratsch, & Klaus-Robert Muller GMD FIRST, Rudower Chaussee 5, 12489 Berlin, Germany fbs,
More informationConvergence of reinforcement learning algorithms and acceleration of learning
PHYSICAL REVIEW E 67, 026706 2003 Convergence of reinforcement learning algorithms and acceleration of learning A. Potapov* and M. K. Ali Department of Physics, The University of Lethbridge, 4401 University
More informationLinear stochastic approximation driven by slowly varying Markov chains
Available online at www.sciencedirect.com Systems & Control Letters 50 2003 95 102 www.elsevier.com/locate/sysconle Linear stochastic approximation driven by slowly varying Marov chains Viay R. Konda,
More informationLinear Riccati Dynamics, Constant Feedback, and Controllability in Linear Quadratic Control Problems
Linear Riccati Dynamics, Constant Feedback, and Controllability in Linear Quadratic Control Problems July 2001 Revised: December 2005 Ronald J. Balvers Douglas W. Mitchell Department of Economics Department
More informationMDP Preliminaries. Nan Jiang. February 10, 2019
MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process
More information, and rewards and transition matrices as shown below:
CSE 50a. Assignment 7 Out: Tue Nov Due: Thu Dec Reading: Sutton & Barto, Chapters -. 7. Policy improvement Consider the Markov decision process (MDP) with two states s {0, }, two actions a {0, }, discount
More informationReinforcement Learning for Continuous. Action using Stochastic Gradient Ascent. Hajime KIMURA, Shigenobu KOBAYASHI JAPAN
Reinforcement Learning for Continuous Action using Stochastic Gradient Ascent Hajime KIMURA, Shigenobu KOBAYASHI Tokyo Institute of Technology, 4259 Nagatsuda, Midori-ku Yokohama 226-852 JAPAN Abstract:
More informationOptimal Control with Learned Forward Models
Optimal Control with Learned Forward Models Pieter Abbeel UC Berkeley Jan Peters TU Darmstadt 1 Where we are? Reinforcement Learning Data = {(x i, u i, x i+1, r i )}} x u xx r u xx V (x) π (u x) Now V
More informationilstd: Eligibility Traces and Convergence Analysis
ilstd: Eligibility Traces and Convergence Analysis Alborz Geramifard Michael Bowling Martin Zinkevich Richard S. Sutton Department of Computing Science University of Alberta Edmonton, Alberta {alborz,bowling,maz,sutton}@cs.ualberta.ca
More informationPerformance Comparison of Two Implementations of the Leaky. LMS Adaptive Filter. Scott C. Douglas. University of Utah. Salt Lake City, Utah 84112
Performance Comparison of Two Implementations of the Leaky LMS Adaptive Filter Scott C. Douglas Department of Electrical Engineering University of Utah Salt Lake City, Utah 8411 Abstract{ The leaky LMS
More informationModel-Based Reinforcement Learning with Continuous States and Actions
Marc P. Deisenroth, Carl E. Rasmussen, and Jan Peters: Model-Based Reinforcement Learning with Continuous States and Actions in Proceedings of the 16th European Symposium on Artificial Neural Networks
More informationOptimal Control of Linear Systems with Stochastic Parameters for Variance Suppression: The Finite Time Case
Optimal Control of Linear Systems with Stochastic Parameters for Variance Suppression: The inite Time Case Kenji ujimoto Soraki Ogawa Yuhei Ota Makishi Nakayama Nagoya University, Department of Mechanical
More informationChapter 7 Interconnected Systems and Feedback: Well-Posedness, Stability, and Performance 7. Introduction Feedback control is a powerful approach to o
Lectures on Dynamic Systems and Control Mohammed Dahleh Munther A. Dahleh George Verghese Department of Electrical Engineering and Computer Science Massachuasetts Institute of Technology c Chapter 7 Interconnected
More informationReinforcement learning
Reinforcement learning Based on [Kaelbling et al., 1996, Bertsekas, 2000] Bert Kappen Reinforcement learning Reinforcement learning is the problem faced by an agent that must learn behavior through trial-and-error
More informationCoarticulation in Markov Decision Processes Khashayar Rohanimanesh Robert Platt Sridhar Mahadevan Roderic Grupen CMPSCI Technical Report 04-33
Coarticulation in Markov Decision Processes Khashayar Rohanimanesh Robert Platt Sridhar Mahadevan Roderic Grupen CMPSCI Technical Report 04- June, 2004 Department of Computer Science University of Massachusetts
More informationProblem Description The problem we consider is stabilization of a single-input multiple-state system with simultaneous magnitude and rate saturations,
SEMI-GLOBAL RESULTS ON STABILIZATION OF LINEAR SYSTEMS WITH INPUT RATE AND MAGNITUDE SATURATIONS Trygve Lauvdal and Thor I. Fossen y Norwegian University of Science and Technology, N-7 Trondheim, NORWAY.
More informationLinearly-solvable Markov decision problems
Advances in Neural Information Processing Systems 2 Linearly-solvable Markov decision problems Emanuel Todorov Department of Cognitive Science University of California San Diego todorov@cogsci.ucsd.edu
More information1/sqrt(B) convergence 1/B convergence B
The Error Coding Method and PICTs Gareth James and Trevor Hastie Department of Statistics, Stanford University March 29, 1998 Abstract A new family of plug-in classication techniques has recently been
More informationA Tour of Reinforcement Learning The View from Continuous Control. Benjamin Recht University of California, Berkeley
A Tour of Reinforcement Learning The View from Continuous Control Benjamin Recht University of California, Berkeley trustable, scalable, predictable Control Theory! Reinforcement Learning is the study
More informationQ-Learning for Markov Decision Processes*
McGill University ECSE 506: Term Project Q-Learning for Markov Decision Processes* Authors: Khoa Phan khoa.phan@mail.mcgill.ca Sandeep Manjanna sandeep.manjanna@mail.mcgill.ca (*Based on: Convergence of
More informationChapter 9 Observers, Model-based Controllers 9. Introduction In here we deal with the general case where only a subset of the states, or linear combin
Lectures on Dynamic Systems and Control Mohammed Dahleh Munther A. Dahleh George Verghese Department of Electrical Engineering and Computer Science Massachuasetts Institute of Technology c Chapter 9 Observers,
More information1 Introduction Independent component analysis (ICA) [10] is a statistical technique whose main applications are blind source separation, blind deconvo
The Fixed-Point Algorithm and Maximum Likelihood Estimation for Independent Component Analysis Aapo Hyvarinen Helsinki University of Technology Laboratory of Computer and Information Science P.O.Box 5400,
More informationLinear Riccati Dynamics, Constant Feedback, and Controllability in Linear Quadratic Control Problems
Linear Riccati Dynamics, Constant Feedback, and Controllability in Linear Quadratic Control Problems July 2001 Ronald J. Balvers Douglas W. Mitchell Department of Economics Department of Economics P.O.
More informationReinforcement Learning In Continuous Time and Space
Reinforcement Learning In Continuous Time and Space presentation of paper by Kenji Doya Leszek Rybicki lrybicki@mat.umk.pl 18.07.2008 Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous
More informationMarkov Decision Processes and Dynamic Programming
Markov Decision Processes and Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) Ecole Centrale - Option DAD SequeL INRIA Lille EC-RL Course In This Lecture A. LAZARIC Markov Decision Processes
More informationDIFFERENTIAL TRAINING OF 1 ROLLOUT POLICIES
Appears in Proc. of the 35th Allerton Conference on Communication, Control, and Computing, Allerton Park, Ill., October 1997 DIFFERENTIAL TRAINING OF 1 ROLLOUT POLICIES by Dimitri P. Bertsekas 2 Abstract
More informationOn GMW designs and a conjecture of Assmus and Key Thomas E. Norwood and Qing Xiang Dept. of Mathematics, California Institute of Technology, Pasadena,
On GMW designs and a conjecture of Assmus and Key Thomas E. Norwood and Qing iang Dept. of Mathematics, California Institute of Technology, Pasadena, CA 91125 June 24, 1998 Abstract We show that a family
More informationLinear Regression and Its Applications
Linear Regression and Its Applications Predrag Radivojac October 13, 2014 Given a data set D = {(x i, y i )} n the objective is to learn the relationship between features and the target. We usually start
More informationIMPROVED MPC DESIGN BASED ON SATURATING CONTROL LAWS
IMPROVED MPC DESIGN BASED ON SATURATING CONTROL LAWS D. Limon, J.M. Gomes da Silva Jr., T. Alamo and E.F. Camacho Dpto. de Ingenieria de Sistemas y Automática. Universidad de Sevilla Camino de los Descubrimientos
More informationError Empirical error. Generalization error. Time (number of iteration)
Submitted to Neural Networks. Dynamics of Batch Learning in Multilayer Networks { Overrealizability and Overtraining { Kenji Fukumizu The Institute of Physical and Chemical Research (RIKEN) E-mail: fuku@brain.riken.go.jp
More informationStable Adaptive Momentum for Rapid Online Learning in Nonlinear Systems
Stable Adaptive Momentum for Rapid Online Learning in Nonlinear Systems Thore Graepel and Nicol N. Schraudolph Institute of Computational Science ETH Zürich, Switzerland {graepel,schraudo}@inf.ethz.ch
More informationSuppose that we have a specific single stage dynamic system governed by the following equation:
Dynamic Optimisation Discrete Dynamic Systems A single stage example Suppose that we have a specific single stage dynamic system governed by the following equation: x 1 = ax 0 + bu 0, x 0 = x i (1) where
More informationReinforcement Learning, Neural Networks and PI Control Applied to a Heating Coil
Reinforcement Learning, Neural Networks and PI Control Applied to a Heating Coil Charles W. Anderson 1, Douglas C. Hittle 2, Alon D. Katz 2, and R. Matt Kretchmar 1 1 Department of Computer Science Colorado
More informationEconomics 472. Lecture 10. where we will refer to y t as a m-vector of endogenous variables, x t as a q-vector of exogenous variables,
University of Illinois Fall 998 Department of Economics Roger Koenker Economics 472 Lecture Introduction to Dynamic Simultaneous Equation Models In this lecture we will introduce some simple dynamic simultaneous
More informationReinforcement Learning II. George Konidaris
Reinforcement Learning II George Konidaris gdk@cs.brown.edu Fall 2017 Reinforcement Learning π : S A max R = t=0 t r t MDPs Agent interacts with an environment At each time t: Receives sensor signal Executes
More informationCoarticulation in Markov Decision Processes
Coarticulation in Markov Decision Processes Khashayar Rohanimanesh Department of Computer Science University of Massachusetts Amherst, MA 01003 khash@cs.umass.edu Sridhar Mahadevan Department of Computer
More informationOptimal Polynomial Control for Discrete-Time Systems
1 Optimal Polynomial Control for Discrete-Time Systems Prof Guy Beale Electrical and Computer Engineering Department George Mason University Fairfax, Virginia Correspondence concerning this paper should
More informationAdaptive State Feedback Nash Strategies for Linear Quadratic Discrete-Time Games
Adaptive State Feedbac Nash Strategies for Linear Quadratic Discrete-Time Games Dan Shen and Jose B. Cruz, Jr. Intelligent Automation Inc., Rocville, MD 2858 USA (email: dshen@i-a-i.com). The Ohio State
More informationReinforcement Learning II. George Konidaris
Reinforcement Learning II George Konidaris gdk@cs.brown.edu Fall 2018 Reinforcement Learning π : S A max R = t=0 t r t MDPs Agent interacts with an environment At each time t: Receives sensor signal Executes
More informationLinear State Feedback Controller Design
Assignment For EE5101 - Linear Systems Sem I AY2010/2011 Linear State Feedback Controller Design Phang Swee King A0033585A Email: king@nus.edu.sg NGS/ECE Dept. Faculty of Engineering National University
More informationI D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69
R E S E A R C H R E P O R T Online Policy Adaptation for Ensemble Classifiers Christos Dimitrakakis a IDIAP RR 03-69 Samy Bengio b I D I A P December 2003 D a l l e M o l l e I n s t i t u t e for Perceptual
More informationPartially observable Markov decision processes. Department of Computer Science, Czech Technical University in Prague
Partially observable Markov decision processes Jiří Kléma Department of Computer Science, Czech Technical University in Prague https://cw.fel.cvut.cz/wiki/courses/b4b36zui/prednasky pagenda Previous lecture:
More informationAlternative Characterization of Ergodicity for Doubly Stochastic Chains
Alternative Characterization of Ergodicity for Doubly Stochastic Chains Behrouz Touri and Angelia Nedić Abstract In this paper we discuss the ergodicity of stochastic and doubly stochastic chains. We define
More informationonly nite eigenvalues. This is an extension of earlier results from [2]. Then we concentrate on the Riccati equation appearing in H 2 and linear quadr
The discrete algebraic Riccati equation and linear matrix inequality nton. Stoorvogel y Department of Mathematics and Computing Science Eindhoven Univ. of Technology P.O. ox 53, 56 M Eindhoven The Netherlands
More information1. Introduction Let the least value of an objective function F (x), x2r n, be required, where F (x) can be calculated for any vector of variables x2r
DAMTP 2002/NA08 Least Frobenius norm updating of quadratic models that satisfy interpolation conditions 1 M.J.D. Powell Abstract: Quadratic models of objective functions are highly useful in many optimization
More informationCS599 Lecture 1 Introduction To RL
CS599 Lecture 1 Introduction To RL Reinforcement Learning Introduction Learning from rewards Policies Value Functions Rewards Models of the Environment Exploitation vs. Exploration Dynamic Programming
More informationChristopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015
Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992)
More informationReinforcement Learning
CS7/CS7 Fall 005 Supervised Learning: Training examples: (x,y) Direct feedback y for each input x Sequence of decisions with eventual feedback No teacher that critiques individual actions Learn to act
More informationFunction Approximation for Continuous Constrained MDPs
Function Approximation for Continuous Constrained MDPs Aditya Undurti, Alborz Geramifard, Jonathan P. How Abstract In this work we apply function approximation techniques to solve continuous, constrained
More informationTemporal Difference Learning & Policy Iteration
Temporal Difference Learning & Policy Iteration Advanced Topics in Reinforcement Learning Seminar WS 15/16 ±0 ±0 +1 by Tobias Joppen 03.11.2015 Fachbereich Informatik Knowledge Engineering Group Prof.
More informationThe Bias-Variance dilemma of the Monte Carlo. method. Technion - Israel Institute of Technology, Technion City, Haifa 32000, Israel
The Bias-Variance dilemma of the Monte Carlo method Zlochin Mark 1 and Yoram Baram 1 Technion - Israel Institute of Technology, Technion City, Haifa 32000, Israel fzmark,baramg@cs.technion.ac.il Abstract.
More informationR. Balan. Splaiul Independentei 313, Bucharest, ROMANIA D. Aur
An On-line Robust Stabilizer R. Balan University "Politehnica" of Bucharest, Department of Automatic Control and Computers, Splaiul Independentei 313, 77206 Bucharest, ROMANIA radu@karla.indinf.pub.ro
More informationOutline. 1 Linear Quadratic Problem. 2 Constraints. 3 Dynamic Programming Solution. 4 The Infinite Horizon LQ Problem.
Model Predictive Control Short Course Regulation James B. Rawlings Michael J. Risbeck Nishith R. Patel Department of Chemical and Biological Engineering Copyright c 217 by James B. Rawlings Outline 1 Linear
More informationApproximating Q-values with Basis Function Representations. Philip Sabes. Department of Brain and Cognitive Sciences
Approximating Q-values with Basis Function Representations Philip Sabes Department of Brain and Cognitive Sciences Massachusetts Institute of Technology Cambridge, MA 39 sabes@psyche.mit.edu The consequences
More informationChapter 30 Minimality and Stability of Interconnected Systems 30.1 Introduction: Relating I/O and State-Space Properties We have already seen in Chapt
Lectures on Dynamic Systems and Control Mohammed Dahleh Munther A. Dahleh George Verghese Department of Electrical Engineering and Computer Science Massachuasetts Institute of Technology 1 1 c Chapter
More informationGeorey J. Gordon. Carnegie Mellon University. Pittsburgh PA Bellman-Ford single-destination shortest paths algorithm
Stable Function Approximation in Dynamic Programming Georey J. Gordon Computer Science Department Carnegie Mellon University Pittsburgh PA 53 ggordon@cs.cmu.edu Abstract The success of reinforcement learning
More informationand 3) a value function update using exponential eligibility traces is more efficient and stable than that based on Euler approximation. The algorithm
Reinforcement Learning In Continuous Time and Space Kenji Doya Λ ATR Human Information Processing Research Laboratories 2-2 Hikaridai, Seika, Soraku, Kyoto 619-288, Japan Neural Computation, 12(1), 219-245
More informationAn average case analysis of a dierential attack. on a class of SP-networks. Distributed Systems Technology Centre, and
An average case analysis of a dierential attack on a class of SP-networks Luke O'Connor Distributed Systems Technology Centre, and Information Security Research Center, QUT Brisbane, Australia Abstract
More informationMarkov Decision Processes With Delays and Asynchronous Cost Collection
568 IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL 48, NO 4, APRIL 2003 Markov Decision Processes With Delays and Asynchronous Cost Collection Konstantinos V Katsikopoulos, Member, IEEE, and Sascha E Engelbrecht
More informationLecture 5 Linear Quadratic Stochastic Control
EE363 Winter 2008-09 Lecture 5 Linear Quadratic Stochastic Control linear-quadratic stochastic control problem solution via dynamic programming 5 1 Linear stochastic system linear dynamical system, over
More informationRobust Control 5 Nominal Controller Design Continued
Robust Control 5 Nominal Controller Design Continued Harry G. Kwatny Department of Mechanical Engineering & Mechanics Drexel University 4/14/2003 Outline he LQR Problem A Generalization to LQR Min-Max
More informationOn the Convergence of Optimistic Policy Iteration
Journal of Machine Learning Research 3 (2002) 59 72 Submitted 10/01; Published 7/02 On the Convergence of Optimistic Policy Iteration John N. Tsitsiklis LIDS, Room 35-209 Massachusetts Institute of Technology
More informationReinforcement. Function Approximation. Learning with KATJA HOFMANN. Researcher, MSR Cambridge
Reinforcement Learning with Function Approximation KATJA HOFMANN Researcher, MSR Cambridge Representation and Generalization in RL Focus on training stability Learning generalizable value functions Navigating
More informationStructured State Space Realizations for SLS Distributed Controllers
Structured State Space Realizations for SLS Distributed Controllers James Anderson and Nikolai Matni Abstract In recent work the system level synthesis (SLS) paradigm has been shown to provide a truly
More informationMarkov Decision Processes and Dynamic Programming
Markov Decision Processes and Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course How to model an RL problem The Markov Decision Process
More informationGaussian Processes for Regression. Carl Edward Rasmussen. Department of Computer Science. Toronto, ONT, M5S 1A4, Canada.
In Advances in Neural Information Processing Systems 8 eds. D. S. Touretzky, M. C. Mozer, M. E. Hasselmo, MIT Press, 1996. Gaussian Processes for Regression Christopher K. I. Williams Neural Computing
More informationComputer Vision Group Prof. Daniel Cremers. 14. Sampling Methods
Prof. Daniel Cremers 14. Sampling Methods Sampling Methods Sampling Methods are widely used in Computer Science as an approximation of a deterministic algorithm to represent uncertainty without a parametric
More informationThe Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision
The Particle Filter Non-parametric implementation of Bayes filter Represents the belief (posterior) random state samples. by a set of This representation is approximate. Can represent distributions that
More informationOptimal Convergence in Multi-Agent MDPs
Optimal Convergence in Multi-Agent MDPs Peter Vrancx 1, Katja Verbeeck 2, and Ann Nowé 1 1 {pvrancx, ann.nowe}@vub.ac.be, Computational Modeling Lab, Vrije Universiteit Brussel 2 k.verbeeck@micc.unimaas.nl,
More informationApproximate active fault detection and control
Approximate active fault detection and control Jan Škach Ivo Punčochář Miroslav Šimandl Department of Cybernetics Faculty of Applied Sciences University of West Bohemia Pilsen, Czech Republic 11th European
More informationThe convergence limit of the temporal difference learning
The convergence limit of the temporal difference learning Ryosuke Nomura the University of Tokyo September 3, 2013 1 Outline Reinforcement Learning Convergence limit Construction of the feature vector
More informationRICE UNIVERSITY. System Identication for Robust Control. Huipin Zhang. A Thesis Submitted. in Partial Fulfillment of the. Requirements for the Degree
RICE UNIVERSITY System Identication for Robust Control by Huipin Zhang A Thesis Submitted in Partial Fulfillment of the Requirements for the Degree Master of Science Approved, Thesis Committee: Athanasios
More informationREGLERTEKNIK AUTOMATIC CONTROL LINKÖPING
Generating state space equations from a bond graph with dependent storage elements using singular perturbation theory. Krister Edstrom Department of Electrical Engineering Linkoping University, S-58 83
More information6 Reinforcement Learning
6 Reinforcement Learning As discussed above, a basic form of supervised learning is function approximation, relating input vectors to output vectors, or, more generally, finding density functions p(y,
More informationReinforcement Learning with Function Approximation. Joseph Christian G. Noel
Reinforcement Learning with Function Approximation Joseph Christian G. Noel November 2011 Abstract Reinforcement learning (RL) is a key problem in the field of Artificial Intelligence. The main goal is
More informationTraining Guidelines for Neural Networks to Estimate Stability Regions
Training Guidelines for Neural Networks to Estimate Stability Regions Enrique D. Ferreira Bruce H.Krogh Department of Electrical and Computer Engineering Carnegie Mellon University 5 Forbes Av., Pittsburgh,
More informationOnline solution of the average cost Kullback-Leibler optimization problem
Online solution of the average cost Kullback-Leibler optimization problem Joris Bierkens Radboud University Nijmegen j.bierkens@science.ru.nl Bert Kappen Radboud University Nijmegen b.kappen@science.ru.nl
More informationDirect Gradient-Based Reinforcement Learning: I. Gradient Estimation Algorithms
Direct Gradient-Based Reinforcement Learning: I. Gradient Estimation Algorithms Jonathan Baxter and Peter L. Bartlett Research School of Information Sciences and Engineering Australian National University
More informationRobotics. Control Theory. Marc Toussaint U Stuttgart
Robotics Control Theory Topics in control theory, optimal control, HJB equation, infinite horizon case, Linear-Quadratic optimal control, Riccati equations (differential, algebraic, discrete-time), controllability,
More information