- PDF Free Download

In: Proc. BENELEARN-98, 8th Belgian-Dutch Conference on Machine Learning, pp 9-46, 998 Linear Quadratic Regulation using Reinforcement Learning Stephan ten Hagen? and Ben Krose Department of Mathematics, Computer Science, Physics and Astronomy University of Amsterdam Kruislaan 4, 98 SJ Amsterdam email: stephanh@wins.uva.nl Abstract In this paper we describe a possible way to make reinforcement learning more applicable in the context of industrial manufacturing processes. We achieve this by formulating the optimization task in the linear quadratic regulation framework, for which a conventional control theoretic solution exist. By rewriting the Q-learning approach into a linear least squares approximation problem, we can make a fair comparison between the resulting approximation and that of the conventional system identication approach. Our experiment shows that the conventional approach performs slightly better. Also we can show that the amount of exploration noise, added during the generation of data, plays a crucial role in the outcome of both approaches. Introduction Reinforcement Learning (RL) is a trial based method for optimization of the interaction with an environment or the control of a system [][7]. The optimization is performed by approximating the future sum of evaluations and determine the feedback that minimizes or maximizes this approximation. We are particularly interested in industrial applications, like the optimization of the control of manufacturing processes. In [6] this is demonstrated for the manufacturing of thermoplastic composite structures. In spite of the successful demonstration, the use of reinforcement learning to control manufacturing processes is still not commonplace. The reason for this is that there are still some barriers to take. The rst barrier is that manufacturers are very careful, with respect to novel control techniques. Only those techniques are used that are fully understood and are guaranteed to work. The problems encountered by the manufacturers are too specic to be related to existing successful demonstrations of certain techniques. Impressive demonstrations are not sucient to convince manufacturers from the usefulness of these techniques. For this a solid theoretic underpinning is required. The second barrier lies in the reinforcement learning eld itself. Most of the RL application described are maze-like navigation tasks, scheduling tasks or obstacle avoidance tasks for mobile robots. This has resulted in a theoretic understanding for discrete state space systems, in particular those that t in a Markov decision framework. A consequence of this is that the main theoretic results and convergence guarantees apply only to Markov decision processes. But in a realistic manufacturing environment most information about the \state" of the process is due to measurements. These measurements form a continuous state space for the manufacturing process, for which theoretic reinforcement learning results are not valid. So in order to make reinforcement learning techniques applicable for manufacturing processes, the theoretic results should be extended to continuous state space systems. In order to come to a RL approach that is more generally applicable for manufacturing processes, we adopt the Linear Quadratic Regulation (LQR) framework from control theory. LQR is an optimization task in which the system is assumed to be linear and the evaluation is a quadratic cost function. For this the optimal feedback can be computed when the system and cost are completely known []. The combination of RL and LQR has been described rst in [8]. In []? Supported by the Dutch technology foundation STW

this framework has been described as a possible extension of RL algorithms to problems with continuous state and action spaces. Convergence of RL applied to LQR can also be proven [4][5]. These convergence proofs rely on a contraction to the optimal feedback, that takes place if the amount of exploration and the number of time steps is enough. The diculty is that in practice it is never known whether the amount of exploration is sucient. In this paper we use the LQR framework to compare Q-Learning (QL), a particular kind of RL, with System Identication (SI). In QL the optimization is based on the approximation of the sum of future evaluation as a function of the state and action. According to [][4] the optimal feedback can be derived using QL, without having to know the system. However, the same data used by QL can also be used for SI. This results in a approximation of the system, that can be used to compute the optimal feedback. We formulate the learning methods of the QL and the SI approach in such a way that a linear least squares approximation method can be used. In this way a fair comparison is possible, so this enables us to see which methods performs best. We investigate experimentally whether both methods are able to approximation the optimal feedback. Particularly we look at how the exploration noise and system's noise inuence the outcome of the approximation. The experiment shows that the amount of exploration noise has a strong impact on the resulting approximations. Linear Quadratic Regulation The Linear Quadratic Regulation (LQR) framework consist of a linear system with a linear feedback. Let x IR nx be the state vector and the vector u IR nu the control actions, then the system and controller are given by: x k+ = Ax k + Bu k + v k and u k = Lx k + e k ; () where k indicates the time step and the matrices A, B and L have the proper dimensions. All elements of the vectors v k and e k are normal distributed white noise with variance v and e. The vector v represents the system's noise and e represents the exploration (or excitation) noise. In the LQR framework the direct cost is a quadratic function of x and u, so : r(x k ; u k ) = x T k Sx k + u T k Ru k: () The matrix S is symmetric positive semidenite and R is symmetric positive denite. The total cost is dened as: J = X i= r i () The objective of LQR is to nd the optimal feedback L that minimizes J. In order to nd L the sum of future cost can be expressed as a function of x: J(x k ) = X i=k r i = x T k Kx k: (4) The symmetric positive denite matrix K is the unique solution to the Discrete Algebraic Riccati Equation (DARE): K = A T (K? KB(B T KB + R)? B T K)A + S (5) The solution of this equation can be found iteratively if: the pair fa; Bg is controllable, S is symmetric positive semidenite, R is symmetric positive denite and the pair fa; p Sg ( p S is the matrix for which p S T p S = S) is observable []. The optimal feedback L is given by: The T indicates the transpose. L =?(B T KB + R)? B T KA (6)

The problem of LQR is that it requires exact knowledge about the system's model (), which in practical situations is not available. Using the feedback in () it is possible to control the system and generate data. In the next section two methods will be described to get an approximation ^L of the optimal feedback, based on the generated data. The rst method approximates the system's model and then uses (5) and (6). The second method approximates the future cost as a function of x and u and derives ^L from this. Two Approximation Methods The system starts at the initial state x and is controlled for N time steps using the feedback L. All state vectors from x to x N and all control vectors u to u N form the data set. Based on this data set the approximation of the optimal feedback ^L should be derived. Two dierent methods to do this will be presented. The rst method will be referred to as the System Identication (SI) approach, because it identies the parameters of the matrices A and B resulting in ^A and ^B. Rewrite () to: x T k+ = x T k u T k A T B T + v T k : (7) This makes it possible to stack the vectors of all time steps into matrices and write (7) for the complete data set: Y = 64 x T x T. x T N x 75 6 = 4 T u T x T u T. x T N? u T N? 75 ^ + 64 v T v T. v T N? 75 = X ^ + V: (8) ^B. The least squares solution of ^ is In this expression Y = X ^ + V, the value of ^ T = ^A given by: ^ = (X T X)? X T Y: (9) So ^A and ^B can be derived from ^, and applying the DARE and (6) will result in ^L. ^A; ^B S; R ^K ^A; ^B; R ^L x; u SI A; B x; u; r; L ^H QL? S; R K A; B; R L Figure. L and ^L. At the bottom the computation of L is shown. At the top the two methods, SI and QL, to derive ^L are shown. The blocks indicate results and the arrows indicate the required \information" to derive the next result.

u x.5 x ; x.5 X; Y ^ ^A; ^B ^K; ^L 5 5 5.4 SI. u..4 5 5 5 rr QL X; Y ^ ^H ^L 5 5 5 Figure. Two approximation methods. The data set on the left consists of two sequences of state values x and x, a sequence of control actions u and a sequence of direct costs r. Both methods use this data to get the matrices X and Y, which is used in (9) to get ^. At the top it is shown how the SI approach derives ^L from ^, at the bottom this is shown for the QL approach. The second method will be referred to as the Q-Learning (QL) approach, because it approximates the Q-function. The Q-function is the future cost as a function of x and u, so: Q(x k ; u k ) = X i=k r i = x T k = x T k u T k u T k S + A T KA H H x k H H B T KA A T KB R + B T KB x k u k () u k = T k H k ; () where K is the solution of the DARE. L can be found by setting the derivative of the Q-function to u k to zero. This results in u k =?H? H x k, sol =?H? H. It is clear that this result is identical to (6). The value of T k = x T k u T k in () is formed by the data set, so the symmetric positive denite matrix H can be approximated. Then the approximation ^L of the optimal feedback follows directly from the approximation ^H. The approximation can be made because the Q-function in () can also be dened recursively as : Q(x k ; u k ) = r k + Q(x k+ ; Lx k+ ). Write T k H k in () as T k, where vector k consist of all quadratic combinations of the elements of k and vector the corresponding values of H. Then T from the recursive denition of the Q-function it follows that: r k = k? T = ( T k+ k? T k+). This resembles (7), so that can be approximated in a similar way using (9). For this X should be formed by all vectors T k? T k+ and Y should be formed by all r k. Then ^ in (9) approximates (and not A B ), so ^H and ^L can be derived from this. By using (9) for the approximation, the complete data set is used at once. The advantage of this is that the results do not depend on additional choices, like an initial ^ or an iteration step size. This means that dierences in resulting approximations are only due to processing the data into X and Y and to the dierent use of additional \information" by both methods (The SI method uses S and R, while the QL method uses L). In this way a fair comparison can be made between the performances of both methods. In a simulation the true A and B are known so the performances can also be compared with the true optimal feedback L. Figure shows how the two approximation methods are related to each other and to the optimal solution. Figure shows how both methods derive the approximation from the data. Note that Lx k+ is used instead of u k+. So X, Y and ^ are dierent for the QL approach. 4

.5.5.5.5 4 6 8 4 6 8 4 6 8 4 6 8...4 4 6 8 4 6 8...4 4 6 8 4 6 8.5.5 4 6 8 4 6 8.5.5 4 6 8 4 6 8 Figure. Two data sets. Both sets generated according to () and () consist of two sequences of state values x and x (top), a sequence of control actions u (middle) and a sequence of direct costs r (bottom). Data set I, on the left, is generated with v =? and e =?4. Data set II, on the right, is generated with v =?4 and e =?. Because these noise terms are relatively small, both data sets are very much alike. 4 Experiment In this experiment we take a system as described in (), with:?:6?:4 : A = B = : : : L = :?: () The matrix A has all it's eigenvalues within the unit disc and the pair fa; Bg is controllable. The matrices S and R are identity matrices and fa; p Sg is observable. So (5) and (6) can be used to compute the optimal feedback L. This results in: L =?:46 :78 () The system starts in it's initial state x = : : T and is controlled using the feedback L for N = time steps. Both noise terms v and e are normal distributed and white. Two data sets are generated, I with v =? and e =?4 and II with v =?4 and e =?. This results in two almost identical data sets, as shown in Figure. The optimal feedback is approximated using both methods on both data sets. Table shows the approximation results for both methods. For data set I they almost have a similar outcome, which corresponds to the value of L. This is the feedback that was used to generate the data (). For data set II both methods seem to approximate the value of L from (), which is the value that should be approximated. The results for data set II in Table indicate that the SI method gives a better approximation of L than the QL method. The main goal of both approximation methods is to approximate the optimal feedback L. The results in Table shows that this value is not always being approximated. If the amount of exploration noise is much lower than the noise in the system then the feedback used to generate the data is being approximated. This is a problem for practical application, because Figure indicates that visual inspection of the data sets does not reveal whether the exploration is enough. So additional investigations are required to see how both noise sources v and e inuence the approximation of the optimal feedback. 5

Data set I Data set II v??4 e?4? SI: ^L QL: ^L :49?:75?:467 :7 :96?:759?:57 :78 ^L L L Table. Experimental results. For the data sets I and II the values of v and e are given. Also given are the approximated optimal feedbacks according to the SI and QL method. The bottom row indicates what value actually is being approximated by both methods. 5 The Noise The true optimal solution L computed with (5) and (6) does not take into account the presence of noise. The noise sources v and e are only present when the system is used to generate the data set. This means that a mismatch between ^L and L can be a consequence of the noise. Therefore the dierent inuences of the noise sources v and e will be investigated. The exploration noise e is essential to make the approximation. This can easily be seen by looking at the matrix X T X in (9). This matrix should have full rank to be invertible 4. Without exploration noise the control vector u k = Lx k is linear dependent on the state vector. For both approximation methods this has the consequence that not all the rows (and also the columns) are linear independent. So the role of the exploration noise is to prevent the matrix X T X to become singular. The inuence of e on the entries of matrix X T X is smaller for the QL method. So this method will require more exploration than the SI method, for the matrix X T X to be not singular. The system's noise v is not required to get the approximations. In fact (9) minimizes the mismatch between Y and X + V, caused by V. So if V is zero, will result in a perfect approximation of the system or Q-function. The V consist of all terms v k when the SI method is applied, but for the QL method the V is dierent. This can be seen by looking at (), where the noise v is not included. If it is included, the value of the Q-function becomes: Q(x k ; u k ) = x T k u T k S + A T KA B T KA A T KB R + B T KB x k u k + v T k Kv k; (4) so r k = ( T k? T k+) + v T k Kv k? v T k+ Kv k+. This means that V consist o all terms v T k Kv k? v T k+ Kv k+, making the minimization performed by (9) dierent. There is an other way the noise inuences the approximations. To see this use () to get the value of the state at time k: x k = D k x + kx i= D k?i? (Be i + v i ); (5) where D = A + BL represents the closed loop. So the values of the state vector and the control vector u k depend on the initial state x and all previous noise values of e and v. This is a consequence of generating the data with the closed-loop system (). The noise that enters the system, re-enters it through the feedback, causing an additional disturbance that is no longer white like e and v, but \colored" by D. This can cause a bias in the approximation. This also makes it impossible to derive the variance v of the system's noise, without knowledge about the closed loop D. 6 Experiment In this experiment we continue investigating the consequences of e and v on the approximation of ^L, for the SI and QL approaches. Two dierent setups are chosen such that (5) and (6) can 4 Note that this also means that the number time steps N should be large enough. 6

.5.5.5.5.5 8 6 4 σ e.5 8 6 4 σ e Figure 4. The relative performance of the SI approach (dashed lines) and the QL approach (solid lines) for e varying from? to. The vertical dotted line indicates e = v. Left gure: n x =, v =?5 and N = 5. Right gure: n x =, v =? and N = 5. be used to compute the optimal feedback L. The number of samples N and the amount of system's noise v are dierent for these setups. The amount of exploration noise e is varied form? to. However, both noise sequences are kept the same (except for the scale e of the exploration noise). This is repeated 5 times for both systems and both approaches. The total cost obtained when ^L is used is divided by the total cost obtained when L is used. This is the relative performance. Figure 4 shows the relative performance as function of e for both setups 5. With the increase of e from a very small value, the relative performance and the approximation ^L go through four dierent types of results 6 : I The matrix X T X is singular so no feedback can be computed. For the SI approach this is for e <? (not shown in Figure 4) and for the QL approach this is for e <?7. So QL requires more exploration for X T X to be not singular. II Figure 4 shows that for low values of e, both methods give the same constant relative performance. In this case the feedback that is used to generate the data is approximated as the optimal feedback, so ^L L like in Experiment. For SI this result can be explained by using ^B instead of B and ^D? ^BL instead of A in (6). Clearly ^B is much too large close to the singularity, making (6) result in L. The QL approach does not use (6) so a similar explanation cannot be given. (Although conceptually it makes sense that for a low amount of exploration, the presence of L is very dominant in the data set.) III Between the approximations ^L L and ^L L there is an transition area, where the relative performance can be quite good or very bad. The approximated feedback can even result in an unstable closed-loop of the system. Figure 4 shows that for SI this happens just before e = v and for QL this happens just after e = v. IV This is the only type of result that is useful! The relative performance is (very close to) one, so ^L L. Although it is not clear in Figure 4, but the SI result is slightly closer to one than the QL result. Also the QL approach requires a higher e to get to this type of result. The results in Figure 4 are obtained for dierent numbers of data samples N. But even if the N is increased much more, the relative performance will still show the four types of results for the same values of e. From this it can be concluded that the amount of exploration noise is much more important than N. 5 The experiment is repeated for many dierent congurations and the results were consistent with the two in Figure 4. 6 When v is very small or zero, then there are no type II and III results. This corresponds to the congurations of the convergence proofs in [4] and [5], where the system's noise v is not taken into account. 7

From a manufacturing point of view, the type I (no solution) and II (no improvement) results are not very interesting. Only the type III and IV results are important. The type III result has to be avoided. Although it can result in an improvement, the only certainty about this improvement can be obtained by testing it on the real manufacturing process. In case of an unstable closed loop this can result in damaging the process, which might be very expensive. So the amount of exploration should be high enough to avoid the type III result and guarantee a type IV result. The type IV result always is an improvement (except when L = L ). The problem in controlling a manufacturing process is that adding exploration noise results in variations in the products. This make exploring expensive. This means that the amount of exploration noise should be minimized. Figure 4 shows that the minimal value of e for which a type IV result is obtained is lower for the SI approach. So the SI approach requires less exploration to give an improvement of the feedback. This makes the use of the SI approach for manufacturing more appropriate than the QL approach. Further research will focus on minimizing the exploration requirements of the QL approach. Also ways to derive the minimal amount of exploration from the data will be investigated. 7 Conclusion In this paper we showed that the linear quadratic regulation framework provides the opportunity to extend the applicability of reinforcement learning to industrial manufacturing processes. We did this by rewriting the Q-learning approach into a format that made a fair comparison possible with the more conventional system identication approach from control theory. The experiment showed the importance of sucient exploration in data generation. If the disturbance in the system exceeds the amount of exploration noise, the optimal feedback is not being approximated, but the feedback used to generate the data. But more important, the experiment showed that the conventional approach requires less exploration than the reinforcement learning approach, making it more appropriate for manufacturing processes. References [] D.P. Bertsekas. Dynamic Programming: Deterministic and Stochastic Models, pages 55{64. Prentice-Hall, 987. [] D.P. Bertsekas and Tsitsiklis J.N. Neuro-Dynamic Programming. Athena Scientic, Belmont, Massachusetts, 997. [] S.J. Bradtke. Reinforcement learning applied to linear quadratic regulation. In Stephen Jose Hanson, Jack D. Cowan, and C. Lee Giles, editors, Advances in Neural Information Processing Systems, pages 95{. Morgan Kaufmann, San Mateo, CA, 99. [4] S.J. Bradtke, B.E. Ydstie, and A.G. Barto. Adaptive linear quadratic control using policy iteration. CMPSCI 94-49, University of Massachusetts, June 994. [5] T. Landelius. Reinforcement Learning and Distributed Local Model Synthesis. PhD thesis, Linkoping University, 997. [6] D.A. Sofge and D.A. White. Neural network based process optimization and control. In Proceedings of the 9th conf. on Decision and Control, pages 7{76, Honolulu, Hawaii, 99. IEEE. [7] R.S. Sutton and A.G. Barto. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, 998. [8] P.J. Werbos. Consistency of HDP applied to a simple reinforcement learning problem. Neural Networks, :79{89, 99. 8