System identification with information theoretic criteria

Size: px

Start display at page:

Download "System identification with information theoretic criteria"

Barbra Nelson
6 years ago
Views:

1 Centrum voor Wiskunde en Informatica REPORTRAPPORT System identification with information theoretic criteria A.A. Stoorvogel and J.H. van Schuppen Department of Operations Reasearch, Statistics, and System Theory BS-R

2 Report BS-R953 ISSN CWI P.O. Box GB Amsterdam The Netherlands CWI is the National Research Institute for Mathematics and Computer Science. CWI is part of the Stichting Mathematisch Centrum (SMC), the Dutch foundation for promotion of mathematics and computer science and their applications. SMC is sponsored by the Netherlands Organization for Scientific Research (NWO). CWI is a member of ERCIM, the European Research Consortium for Informatics and Mathematics. Copyright Stichting Mathematisch Centrum P.O. Box 94079, 090 GB Amsterdam (NL) Kruislaan 43, 098 SJ Amsterdam (NL) Telephone Telefax

3 System Identication with Information Theoretic Criteria A.A. Stoorvogel Department of Mathematics and Computing Science, Eindhoven University of Technology P.O. Box 53, 5600 MB Eindhoven, The Netherlands J.H. van Schuppen CWI P.O. Box 94079, 090 GB Amsterdam, The Netherlands Abstract Attention is focused in this paper on the approximation problem of system identication with information theoretic criteria. For a class of problems it is shown that the criterion of mutual information rate is identical to the criterion of exponentialof-quadratic cost and to H entropy. In addition the relation between the likelihood function and divergence is explored. As a consequence of these relations a parameter estimator is derived by four methods for the approximation of a stationary Gaussian process by the output of a Gaussian system. The unied framework for approximation problems of system identication formulated in this paper seems extremely useful. AMS Subject Classication (99): 93B30, 93B36, 93E2, 93E20, 94A7. Keywords and Phrases: System Identication, Information Theory, Parameter Estimation, Maximum Likelihood Method, Linear-Exponential-Quadratic-Gaussian Control, H Optimal Control. Note: The text of this report will appear in the proceedings of the NATO Advanced Study Institute `From Identication to Learning' that was held in Como, Italy in August 994. Introduction The purpose of this paper is to clarify and to explore the relationship between several approximation criteria in system identication. The original motivation for the research reported on in this paper was to clarify the relation between LEQG optimal stochastic control and H optimal control with an entropy criterion, and to explore the use of this relation in system identication. The investigation The research of dr. A.A. Stoorvogel has been made possible by a fellowship of the Royal Netherlands Academy of Sciences and Arts. His cooperation with the second author is supported in part by CWI.

4 lead to a study of information theoretic criteria. The unied framework that appeared seems quite useful for the theory of system identication. A motivating problem of this paper is the choice of an approximation criterion for system identication. Up to recently the approximation criteria used mainly in system identication of stochastic processes were: (i) The likelihood function (maximum likelihood method). (ii) A quadratic cost function (least squares method). For stationary Gaussian processes these criteria are identical. The resulting algorithms for parameter estimation with these criteria are well known. The approximation criteria considered in this paper are: (i) Mutual information rate. (ii) A weighted likelihood function and divergence rate between the measure associated with a simple model and that associated with an element in the model class. In the following sections these criteria are discussed in detail. It will be shown that the criterion of mutual information rate is related to that of the LEQG optimal stochastic control problem and to that of an H optimal control problem with an entropy criterion. Mutual information is related to the information theoretic concept of divergence. The divergence concept is the same as the Kullback-Leibler pseudodistance, which in turn is related to the likelihood function. The discussion of the approximation criteria is rather general. Thus the discussion is formulated in terms of stationary Gaussian processes but can easily be extended to other classes of processes. In summary, in this paper several approximation criteria for system identication of stationary Gaussian processes will be introduced and their relationship will be discussed. Subsequently several parameter estimation problems are posed and solved. The character of this paper is expository. The paper is primarily addressed to doctoral students in engineering, mathematics, and econometrics. The reader is assumed to be familiar with system and control theory, and with probability at a rst year graduate level. The paper contains many denitions and elementary results that are known although not available in a concise and compact publication. The body of the paper contains the main results while the appendices contain details on concepts and theorems. A brief conference version of this paper appeared as [68]. A description of the contents by section follows. Section 2 contains a brief problem formulation. Parameter estimation by the approximation criterion of mutual information is the subject of Section 3 and by the approximation criterion of the likelihood function and divergence is the subject of Section 4 Section 5 presents concluding remarks. Appendix A introduces concepts from probability theory and stochastic processes and Appendix B concepts from system theory. Appendix C contains denitions from information theory, Appendix D formulas of information measures for Gaussian random variables, and Appendix E formulas of information measures for stationary Gaussian processes. Appendix F summarizes results for the LEQG optimal stochastic control problem and Appendix G the H optimal control problem with an entropy criterion. 2

5 2 Problem formulation System identication is a topic of the research area of systems and control. The problem of system identication is to obtain from data a mathematical model in the form of a dynamic system for a phenomenon. Examples of applications are system identication of a wind turbine, of a gas boiler, and of the ow of benzo-a-pyreen through the human body. System identication of a phenomenon often proceeds by the following procedure: (i) Physical modeling and selection of a model class of dynamic systems. (ii) Experiment design, data collection, and preprocessing of the data. (iii) Check on the identiability of the parametrization of the model class. (iv) Selection of an element in the model class that is an approximation of the data. (v) Evaluation of the model determined and, possibly, redoing one or more of the previous steps. In this paper attention is restricted to step (iv) of the above procedure, the selection of an element in the model class that is an approximation of the data. In this paper attention is restricted to phenomena that, after preliminary physical or domain modeling, can be modeled by a stationary Gaussian process. See the appendices for concepts and terminology used in the body of the paper. It is well known from stochastic realization theory for stationary Gaussian processes that such a process can under certain conditions be represented as the output of a Gaussian system in state space form x(t + ) = Ax(t) + Kv(t); y(t) = Cx(t) + v(t); where v : T R p is a Gaussian white noise process. Such a process can also be represented by an ARMA (Auto-Regressive-Moving-Average) representation of the form nx i=0 a i y(t? i) = mx j=0 b j v(t? j): Below attention is sometimes restricted to an AR (Auto-Regressive) representation of the form y(t) + nx i= With the notation a i y(t? i) = v(t): (2.) = (?a ; : : : ;?a n ) T ; (t) = (y(t? ); : : : ; y(t? n)) T ; this representation may be rewritten as y(t) = (t) T + v(t): (2.2) 3

6 Problem 2. Parameter estimation problem for time-invariant Gaussian systems. Consider observations of a stationary Gaussian process with values in R p. Consider the model class of time-invariant Gaussian systems x(t + ) = Ax(t) + Kv(t); y(t) = Cx(t) + v(t); (2.3) or, properly restricted, in AR representation, y(t) = (t) T + v(t): (2.4) Select an element in the model class that approximates the observations by one or more of the approximation criteria mentioned in the following sections. 3 Approximation with mutual information Successively this section presents the topics of mutual information, a parameter estimation problem, the relation between mutual information, LEQG, and H entropy, parameter estimation by LEQG optimal stochastic control, and parameter estimation by the H method. 3. Mutual information There follows an introduction to the concept of mutual information. For a detailed treatment and references see the Appendices C, D, and E. The mutual information of two discrete-valued random variables x : X = fa ; : : : ; a n g and y : Y = fb ; : : : ; b m g, is dened by the formula where J(x; y) = nx mx i= j= p xy (a i ; b j ) ln pxy (a i ; b j ) p x (a i )p y (b j ) ; (3.) p xy (a i ; b j ) = P (fx = a i ; y = b j g); p x (a i ) = P (fx = a i g); p y (b j ) = P (fy = b j g): The mutual information of two random variables describes the dependence between the associated probability measures. In general 0 J(x; y) + and J(x; y) = 0 if and only if x and y are independent. For random variables with continuous values the mutual information is dened as the limit of the mutual information of discrete valued random variables, in which the discrete random variables are required to approximate the continuous valued random variables. In case the random variables x; y are jointly Gaussian random variables with a strictly positive denite variance, or (x; y) 2 G(0; Q) with Q = Q T > 0 where Q = Q x Q xy Q T ; xy Q y 4

7 then J(x; y) =? 2 ln det Q : (3.2) det Q x det Q y For stationary processes the corresponding concept is the mutual information rate. Let x : T R m, y : T R p, be stationary stochastic processes on T =. The mutual information rate of x and y is dened by the formula J(x; y) = lim sup n 2n + J? xj [?n;n] ; yj [?n;n] ; (3.3) where xj [?n;n] denotes the random variable dened by the restriction of the process x to the interval f?n; : : : ;?; 0; ; : : : ; ng. Let for T =, x : T R m and y : T R p be two jointly stationary Gaussian processes that admit a spectral density. Denote their joint spectral density function by ^Wx ^Wxy ^W : C C (m+p)(m+p) ; ^W = : ^W xy ^Wy It follows from [24] that the mutual information of the processes x; y is invariant under scaling by nonsingular linear systems. Let S x : C C mm and S y : C C pp be minimal and square spectral factors of respectively ^W x and ^W y, hence Dene ^W x (z) = S x (z)s x (z? ) T ; ^Wy (z) = S y (z)s y (z? ) T : S(z) = S x (z)? ^Wxy (z)s y (z? )?T : The mutual information rate of these processes is given by the formula J(x; y) =? 4i D z ln det[i? S(z? ) T S(z)]dz: (3.4) Assume that the function S admits a realization as a nite-dimensional linear system of the form S(z) = C(zI? A)? B + D; sp(a) C? ; (3.5) and that the following set of relations admits a unique solution Q 2 R nn Q = Q T 0; (3.6) Q = A T QA + C T C + (A T QB + C T D)N? (B T QA + D T C); (3.7) N = I? D T D? B T QB > 0; (3.8) sp(a + BN? (B T QA + D T C)) C? : (3.9) Then the mutual information rate of the processes x; y is given by the formula J(x; y) =? 2 ln det(i? BT QB? D T D): (3.0) 5

8 3.2 A parameter estimation problem Consider the model class of Gaussian systems in AR representation as stated in Section 2, (t + ) = (t) + v(t); (0) = 0 ; y(t) = (t) T (t) + Nw(t); (3.) with 0 2 G(m 0 ; Q 0 ), v(t) 2 G(0; I), w(t) 2 G(0; I). The problem is to determine a linear recursive parameter estimator, say, ^(t + ) = ^(t) + K(t)[y(t)? (t) T ^(t)]; ^(0) = m0 ; (3.2) such that the estimation error e : T R n ; e(t) = (t)? ^(t); is small according to a criterion specied below. The recursion for the estimation error is then e(t + ) = [I? K(t)(t) T ]e(t) + v(t)? K(t)Nw(t); e(0) = 0? ^ 0 : Assume that e is such that there exists a Gaussian random process r, which is independent of e, such that z := e + r is a standard white noise process, i.e. z 2 G(0; I). Our objective is to determine a parameter estimator such that the mutual information rate J(e; z) between e and z is minimized. This implies that the error signal e should be as small as possible. Clearly this measure is not very intuitive but it turns out to be closely related to LEQG and H parameter estimators. In particular, assume e is asymptotically stationary with asymptotic density function R. The assumption regarding the existence of r and z implies 0 R(z) I: It is then easy to show that the mutual information rate of the processes z and e is given by J(z; e) =? 4i D? ln det[i? R(z)]dz = z 4i D z ln det[i? S(z? ) T S(z)]dz (3.3) where S is a spectral factor of R. Expressing the mutual information in terms of the spectral factor S will be useful later on. Problem 3. Consider the parameter estimation setting dened above, with the observation equation and the parameter model as given in (3.). Determine a parameter estimator of the form (3.2) that minimizes the mutual information rate (3.3) between the processes z and e. Problem 3. will not be solved directly but via LEQG optimal stochastic control and via H optimal control theory. 6

9 3.3 Relation of mutual information, H entropy, and LEQG cost An investigation of the authors has established that the criterion of mutual information of two stationary Gaussian processes is identical to the H entropy criterion and to the limit of the cost of a stochastic control problem with an exponential-of-quadratic cost function. This result is rst stated as a theorem and then explained. Theorem 3.2 Consider a nite-dimensional linear system x(t + ) = Ax(t) + Bu(t); y(t) = Cx(t) + Du(t); (3.4) that is a minimal realization of its impulse response function and such that sp(a) C?. Denote the transfer function of this system by S, S(z) = C(zI? A)? B + D: (3.5) Assume that the set of relations (3.6)-(3.9) admits a unique solution Q 2 R nn. Consider the expression? 2 ln(det(i? BT QB? D T D): (3.6) a. If e : T R m and z : T R p are stationary Gaussian processes with spectral density I S(z? ) T S(z) I ; (3.7) then the mutual information of (e; z) is given by the expression (3.6). b. The H entropy of the system (3.4), dened by: J =? 4i D is equal to the expression (3.6). z ln det I? S(z? ) T S(z) dz; (3.8) c. Let u : T R m be a stationary Gaussian process in (3.4) with u(t) 2 G(0; I). Then lim "exp t t ln E 2 tx is equal to the expression (3.6). y(s) T y(s) # Proof a. See Theorem E.3. b. See Theorem G.. c. See Proposition F.3. 2 The discovery of the relation between the criterion of LEQG and the H entropy has been published by K. Glover and J.C. Doyle [26]. That the H entropy criterion is identical to mutual information is new as far as the authors know. The H entropy criterion for continuous-time systems was introduced by K. Glover and D. Mustafa in [27] and has 7

10 been used extensively since then in H optimal control theory. In that paper the concept of H entropy is referred to the paper [8] by D.. Arov and M.G. Krein. In the latter paper the formula (3.8) is presented with the explanation that this formula `has a denite entropy sense' under reference to a publication by Kolmogorov, Gelfand, and Yaglom. In fact, the formula (3.8) is the mutual information of two stationary Gaussian processes that was apparently rst analyzed in [24]. The expression of mutual information provided in [24] is in a geometric formulation from which (3.8) may be derived. The formula for the mutual information in the case of scalar processes is stated in [57]. The authors therefore conclude that H entropy is actually mutual information. 3.4 Parameter estimation with an exponential-of-quadratic cost A short introduction to the LEQG optimal stochastic control problem follows. A detailed treatment may be found in Appendix F. Consider the Gaussian stochastic control system x(t + ) = Ax(t) + Bu(t) + Mv(t); y(t) = C x(t) + D u(t) + Nv(t); z(t) = C 2 x(t) + D 2 u(t); (3.9) where v is Gaussian white noise with v(t) 2 G(0; V ). A control law is a collection of maps g = fg t ; t 2 T g, g t : Y t U t U such that the input u(t) is given by u(t) = g t (y(0); : : : ; y(t? ); u(0); : : : ; u(t? )): Let G be the set of control laws. Consider the cost function on a nite horizon J t : G R J t (g) = E " c exp 2 c tx t=0 z(t) T z(t) # ; (3.20) for c 2 R, c 6= 0. The LEQG optimal stochastic control problem is then to determine a control law g 2 G such that J t (g ) = inf J t (g): (3.2) g2g Assume that we apply a controller g to the stochastic system (3.9) such that the closed loop system is linear and time-invariant. We can then determine the limit of the cost function " # t E c exp tx 2 c z(s) T z(s) : (3.22) as t. In Appendix F it is proven that " # lim t t ln E c exp tx 2 c z(s) T z(s) =? 2? ln det(v? cm T QM? cn T N) +? ln det V : (3.23) 2 In (3.23) Q is the solution of a set of relations similar to (3.6)-(3.9). Next the parameter estimation problem is posed in terms of the LEQG problem. 8

11 Problem 3.3 Consider the Gaussian system (t + ) = (t) + v(t); (0) = 0 ; y(t) = (t) T (t) + w(t); (3.24) where T = f0; ; : : : ; t g, 0 : R n, 0 2 G(m 0 ; Q 0 ), v : R n and w : R p are Gaussian white noise processes with v(t) 2 G(0; Q v ), w(t) 2 G(0; Q w ) with Q w > 0, 0 ; v; w are independent, and : T R np is as specied in (2.2). Determine the process ^ : T R n and a recursion for it such that ^(t) is F y t? measurable for all t 2 T, and such that the following expression is minimized E " c exp 2 c tx z(s) T z(s) # where z(t) = C[(t)? ^(t)], c 2 R, c 6= 0. ; (3.25) Theorem 3.4 Consider Problem 3.3. Dene the functions K : T R np, Q : T R nn by the recursions: K(t) = Q(t) (t) C T I S(t)? ; (3.26) 0 Q(t + ) = Q(t) + Q v? Q(t) (t) C T (t) T S(t)? Q(t); (3.27) C Q(0) = Q 0 ; (3.28) S(t) = (t)t Q(t) (t) C T + Q w 0? C 0 c I : (3.29) Assume that for all t 2 T we have Q(t) > 0. The optimal LEQG parameter estimator is then specied by the recursion ^(t + ) = ^(t) + K(t)[y(t)? (t) T ^(t)]: (3.30) If in addition Q v = 0 then Q(t + )? = Q(t)? + (t)q? w (t) T? cc T C; (3.3) K(t) = Q(t)(t) ((t) T Q(t)(t) + Q w )? : (3.32) The details of the proof of Theorem 3.4 will not be provided in this paper because of space limitations. The problem is related to Problem F.. It diers from that problem in that Problem 3.3 has output matrix, (t), that depends on the past observations. Therefore a conditional version of Problem F. is obtained. The result of Problem F. may be found in [32]. The second author of this paper is preparing a publication that contains the solution to the conditional version of Problem F.. As pointed out in Appendix F, P. Whittle rst solved the discrete time case of Problem F. but for a system representation slightly dierent from that used in this paper. 9

12 3.5 Parameter estimation with H entropy A short introduction to the H optimal control problem follows. A detailed treatment may be found in Appendix G Consider the linear control system x(t + ) = Ax(t) + Bu(t) + Mv(t); y(t) = C x(t) + D u(t) + Nv(t); z(t) = C 2 x(t) + D 2 u(t); (3.33) where v is this time an unknown deterministic signal. As before, a control law is a collection of maps g = fg t ; t 2 T g, g t : Y t U t U such that the input u(t) is given by u(t) = g t (y(0); : : : ; y(t? ); u(0); : : : ; u(t? )): Let G be the set of control laws. Consider the cost function J : G R where J(g) = sup v;x0 kvk 2 2;t := tx kzk 2 2;t kvk 2 2;t + xt 0 R? x 0 (3.34) v(t) T v(t): (3.35) Note that R plays a role similar to the covariance of the initial condition for the LEQG problem. The larger R, the more uncertainty we have for the initial condition. If t = then we must constrain v to `2, the class of signals for which the innite sum (3.35) converges. The H optimal control problem is then to determine a control law g 2 G such that J(g ) = inf J(g): (3.36) g2g This problem turns out to be very hard. Hence instead we will look for a control law g such that: J(g) < c? : This is equivalent to minimizing the following criterion J c (g) = sup v;x0 ckzk 2 2;t? kvk2 2;t? xt 0 R? x 0 : As a matter of fact J(g) < if J c (g) < c?. We will actually solve the problem to nd g such that J c (g ) = inf g2g J c(g): (3.37) If t =, x(0) = 0 (R = 0), and both the system and the controller are time-invariant, then this problem has another interpretation. Let us dene J e (g) =? 4i D z ln det I? cs T (z? )S(z) dz 0

13 where S is the closed loop transfer matrix from v to z. Then g satises (3.37) if and only if J e (g ) = inf g2g J e(g): Next the parameter estimation problem is posed in terms of an H control problem. Problem 3.5 Consider the Gaussian system (t + ) = (t) + v(t); (0) = 0 ; y(t) = (t) T (t) + w(t); (3.38) where T = f0; ; : : : ; t g and is as specied in (2.2). Determine the process ^ : T R n and a recursion for it such that for all t 2 T we have that ^(t) is a function of past measurements y(0); : : : ; y(t? ) and such that the following expression is minimized sup w;v;x0?x T 0Q? 0 x 0 + tx t=0 cz(t) T z(t)? v(t) T Q? v v(t)? w(t) T Q? w w(t) where z(t) = C[(t)? ^(t)], c 2 R, c 6= 0, Q v = Q T v > 0, Q w = Q T w > 0. (3.39) Proposition 3.6 Consider Problem 3.5. There exists a ^ such that the supremum (3.39) is nite if and only if there exists a matrix Q satisfying (3.27) and (3.28). Moreover, in that case the optimal estimator is given by (3.30). 4 Approximation with likelihood and divergence In this section the reader is presented successively with text on the approximation with likelihood, divergence, the relation of likelihood and divergence, approximation with divergence, and parameter estimation by divergence minimization. 4. Approximation with the likelihood function In this subsection a particular recursive weighted maximum likelihood problem will be formulated and solved. The maximum likelihood method for parameter estimation has been proposed and analyzed by Sir Ronald Fisher, see [2, 22]. In most lectures and most textbooks the likelihood function is dened by example only. In general the likelihood function may be dened as the Radon-Nikodym derivative of the measure of the observations with respect to the measure with respect to which the observations are independent in discrete-time or to the measure with respect to which the observation process is an independent increment process in continuous-time. The likelihood function for a Gaussian or ARMA representation is presented in [33, 63].

14 Problem 4. Consider the parameter estimation model of the rst paragraph of Section 3, with the representation (t + ) = (t), (0) = ; y(t) = (t) T (t) + w(t): (4.) Let c 2 R, c 6= 0, 2 G(0; Q 0 ), Q 0 = Q T 0 > 0. Suppose that at t 2 T the parameter estimates f^(s); s = 0; : : : ; tg have been determined. Choose the next parameter estimate ^(t + ) 2 R n as the solution of the optimization problem sup 2R exp n 2 c tx k? ^(s)k 2 f(y(0); : : : ; y(t); ); (4.2) where f is the joint density function of the observation process fy(s); s = ; : : : ; tg and fy(s); s = ; : : : ; tg are the numerical values of the observations. The density function f depends on the parameter vector. Theorem 4.2 Consider Problem 4. with Q w = I. Assume that the model is such that Q(t) is well-dened by Q(t + )? = Q(t)? + (t)q? w (t) T? ci: (4.3) with Q(0) = Q 0 and Q(t) > 0 for all t 2 T. Then the optimal estimate at t 2 T is given by the recursion where ^(t + ) = ^(t) + K(t)[y(t)? (t) T ^(t)]; ^(0) = 0; (4.4) K(t) = Q(t)(t)[(t) T Q(t)(t) + ]? : (4.5) Note that the parameter estimator (4.4) is identical to that of Theorem 3.4. Proof: to The logarithm of the criterion is, up to a term that does not depend on, equal 2h() =? T Q? By assumption 0? tx 2d 2 h()=d 2 = cti? Q? 0? ky(s)? (s) T k 2 + c tx The rst order conditions then yield: Q(t)? = tx y(s)(s)? c t? X t? X (s)(s) T =?Q(t)?? ci < 0: ^(s); k? ^(s)k 2 : (4.6) and after simple calculations one gets (4.4). 2 2

15 4.2 Divergence The Kullback-Leibler pseudo-distance or divergence is a function that measures the relation between probability measures. For system identication it is an important approximation criterion. There follows a brief introduction to divergence, the full details may be found in the Appendices C, D, and E Consider the measurable space (; F ) and two probability measures P ; P 2 dened on it. Let Q be a -nite measure on (; F ) such that P and P 2 are absolutely continuous with respect to Q, say with r = dp =dq and r 2 = dp 2 =dq. Such a measure always exists. The divergence of P ; P 2 is dened by the formula D(P kp 2 ) = E Q r ln( r r 2 )I (r2 >0) : (4.7) It may be shown that the denition of D(P kp 2 ) does not depend on the measure Q chosen, that for all P ; P 2 we have D(P kp 2 ) 0, and D(P kp 2 ) = 0 i P = P 2. The triangle inequality is not satised by D hence it is not a distance but a pseudo-distance. S. Kullback and R.A. Leibler in [46] introduced this concept which is therefore also known as the Kullback-Leibler pseudo-distance. The term divergence is used in information theory. Divergence and mutual information for measures induced by two real valued random variables are related by J(x; y) = D(P xy kp x P y ): Let G(m ; Q ) and G(m 2 ; Q 2 ) be two Gaussian measures on R n. Assume that Q ; Q 2 > 0. The divergence between these measures is given by the formula (see D.5), 2D (G(m ; Q )kg(m 2 ; Q 2 )) =? ln det Q + tr([q? 2? Q? det Q ]Q ) + (m 2? m ) T Q? (m 2 2? m ): (4.8) 2 Consider next two jointly stationary Gaussian processes x ; x 2 : T R n on T =. Assume that they have a mean value function that is zero and that the joint processes admit a spectral density. Under additional conditions, see Theorem E.5, the divergence between the measures associated with these processes is given by the formula D(P kp 2 ) =? 2 ln det(bt QB + D T D) + 2 tr(bt RB + D T D? I); (4.9) where Q 2 R nn is the solution of an algebraic Riccati equation and R 2 R nn is the solution of a Lyapunov equation. 4.3 Relation of likelihood function and divergence The likelihood function as an approximation criterion is related to the approximation criterion of divergence discussed in the previous subsection. The relation between these criteria will be described for Gaussian random variables. Consider n real valued observations. The model class of the variable considered is the class of Gaussian random variables with values in R, say G(m; q), m 2 R, q 2 R, q > 0. 3

16 The observations are assumed to be independent. The measure of these observations and their representation are given by y : T R n, y 2 G(me; Q), where e = (; : : : ; ) T 2 R n ; Q = qi: The density function of y with respect to Lebesgue measure is p(v; me; Q) = p (2)n det Q exp? 2 (v? me)t Q? (v? me) The likelihood function L : R R + R + is therefore equal to L(m; q; y) = p (2)n det Q exp? 2 (y? me)t Q? (y? me) where y 2 R n is the vector with the numerical values of the observations. As is well known, the maximum likelihood estimate of the parameters m; q are given by the formulas ^m = n nx k= y k ; ^q = n nx k= (y k? ^m) 2 : The density function of a Gaussian measure with ^m; ^q as parameters is denoted by p(:; ^m; ^q). Formula (4.8) for the divergence of two Gaussian measures on R reduces in this case to (assuming q > 0 and q 2 > 0) 2D (p(:; m ; q )kp(:; m 2 ; q 2 )) =? ln q q 2? + q q 2 + (m? m 2 ) 2 q 2 : Proposition 4.3 Consider the likelihood function and the divergence for the Gaussian random variables dened above. Then Proof: L(m; q; y) = p (2) n exp?n 2 [2D (G( ^m; ^q)kg(m; q)) + + ln ^q] : From the discussion above follows that ln L(m; q; y) =? n 2 ln(2)? 2 n[ln(q) + q q ]; ; : 2D(G( ^m; ^q)kg(m; q)) =? ln( ^q q )? + ^q q ( ^m? m)2 + ; q where q = n Note that q = n nx k= nx k= (y k? m) 2 : (y k? m) 2 = n nx k= (^y k? ^m + ^m? m) 2 = ^q + ( ^m? m) 2 ; 4

17 hence the result. 2 The conclusion from Proposition 4.3 is that maximizing the likelihood function is equivalent to minimizing the divergence between a measure in the model class and the measure associated with the maximum likelihood estimates. However, note that this fact cannot be used to nd the maximum likelihood estimates since the maximum likelihood estimates ^m and ^q appear in the expression. Rather, it shows that the likelihood function is equivalent to the divergence criterion. In the example above both measures appearing in the divergence, G(m; q) associated with the model class and G( ^m; ^q) associated with the maximum, are members of the model class. Another way to consider the relation between the likelihood function and divergence has been proposed by H. Akaike in []. Does the relation between the likelihood function and divergence extend to other classes of stochastic systems? In the maximum likelihood approach one maximizes the likelihood function with respect to the parameters of the model class. The divergence between the measure associated with the observations and the measure associated with the model class is a pseudo-distance. Therefore the maximum likelihood approach and the minimum divergence approach are related. The exponential transformation between both criteria as in Proposition 4.3 for the Gaussian case may not hold in general, although this relation does hold for a larger class of distributions than the Gaussian one. S.-I. Amari has investigated the geometric structure of exponential families of probability distributions and explored the use of the likelihood function and of divergence for this family, see [4, 6, 5]. The minimum divergence method diers from the maximum likelihood method in that for the rst method a measure must be chosen that represents the observations. In this point the maximum likelihood approach requires less. 4.4 Approximation with divergence For system identication an approximation problem with the divergence criterion may be considered. The relationship between the likelihood function and divergence, see Proposition 4.3, suggests such an approach. In system identication one is given observations. According to the procedure outlined in Section 2 one then selects a model class. In the case considered in this paper this is the class of stationary Gaussian processes which can be represented as a Gaussian system. With the observations one can associate a measure of a stationary Gaussian process. For example, one can take the measure associated with the mean value function of this process to be the zero function and the covariance function to be an estimate, say ( P t t?t ^W (t) = y(s + t)y(s) T ; t = 0; : : : ; t 2 ; (4.0) 0; else. Another choice is a measure associated with an AR representation of high order. Problem 4.4 Approximation of a stationary Gaussian process by a time-invariant Gaussian system according to the divergence criterion. Consider given a probability measure P on the space of Gaussian processes with values in R p and zero mean value function. Consider the model class GS(n) of Gaussian systems up to order n and the associated 5

18 set of probability measures fp (); 2 GS(n)g on the output processes of such systems. Solve the approximation problem inf D(P kp ()): (4.) 2GS(n) A solution to Problem 4.4, if it exists, is a Gaussian system of which the probability measure on the output process minimizes the divergence to the probabiblity measure associated with the observations. Note that the probability measure associated with the observations may not be in the model class, for example because the covariance estimate is not a rational function. Problem 4.4 is therefore equivalent to the determination of an element in the model class that minimizes the divergence to the measure associated with the observations. The principle of approximation by the divergence criterion is used in the literature. It has apparently rst been proposed by H. Akaike in [] who explored its use in [2, 3] and who derived the Akaike Information Criterion (AIC) from it. Other publications on this approach are those of I. Csiszar, [4, 5, 7] and, in the case of the ARMA model class, [20, 53]. The following model reduction problem for Gaussian random variables is of interest to system identication. In system identication a technique is used called subspace methods that is closely related to the following problem. It is expected that solution of this problem will be useful to subspace methods. The problem was rst formulated in [69, Subsec. 3.8]. Problem 4.5 Model reduction for a pair of Gaussian random variables. Let y : R p, y 2 : R p 2 with (y ; y 2 ) 2 G(0; S), S = S S 2 S T ; rank(s 2 ) = n : 2 S 22 Consider the model class, for n 2 2 +, n 2 < n, G(n 2 ) = fg(0; Q) on R p+p2 jrank(q 2 ) = n 2 ; g; (4.2) Q = Q Q 2 Q T ; (4.3) 2 Q 22 in which the components of Q are compatible with the decomposition (y ; y 2 ). Solve for xed n 2 2 +, n 2 < n, inf D (G(0; S)kG(0; Q)): (4.4) G(0;Q)2G(n2) An interpretation of this problem follows. In stochastic realization theory of stochastic processes one uses the framework of past, future, and present or state. If one restricts attention to the case of Gaussian random variables, thus not of processes, then the past and the future are represented by Gaussian random variables, say y ; y 2 as above. From observations one can estimate the measure of the past and future observations, say G(0; S). The state space associated with the past and the future then has dimension n = rank(s 2 ). In the model reduction problem one seeks a model or measure of the observations in which the dimension of the state space is less than that initially xed, thus rank(q 2 ) = n 2 < n. 6

19 The approximation criterion of divergence seems quite appropriate, also because of its relation with the likelihood function. For initial results on this problem see [69, Subsec. 3.8]. The problem has a rich convex structure that remains to be explored. 4.5 Parameter estimation by divergence minimization Consider Problem 2. of parameter estimation. Consider the time moment t 2 T. Suppose that the observations fy(s); s = 0; : : : ; tg are available and that the estimates f^(s); s = 0; : : : ; t? g have been chosen. The question is then how to choose ^(t). With the observations fy(s); s = 0; : : : ; tg we associate the probability measure G(y; I) and with the model class the probability measure G(; I) where = 0 (0) T () T C. A ; y = (t) T 0 y(0) y() C. A : y(t) The unknown parameter has to be estimated. In this setting : T R n is a deterministic function. Associate with the past parameter estimates the probability measure G( ; W ) and with the model class the probability measure G(; c? I), where c 2 R, c 6= 0, = t t? X ^(s); W = t t? X (^(s)? )(^(s)? ) T : Problem 4.6 Consider Problem 2. and the notation introduced above. For t 2 T choose the parameter estimate ^(t) as the solution of the minimization problem inf 2R n t D (G(y; I)kG(; I))? D? G( ; W)kG(; c? I) : (4.5) The problem formulated above is related to but dierent from that of the papers [34, 35]. Theorem 4.7 Consider Problem 4.6. a. The divergence criterion of that problem is equivalent to the criterion inf 2R n t ky? k 2? ck? k 2 : (4.6) b. The parameter estimator that solves this problem is given by the formula ^(t + ) = ^(t) + K(t)[y(t)? (t) T ^(t)]; ^(0) = 0; (4.7) where the formula for the gain and the Riccati recursion are as stated in Theorem 4.2. Note that the parameter estimator (4.7) is identical to the estimators of Theorem 3.4 and Theorem

20 Proof a. A simple calculation yields that 2D (G(y; I)kG(; I)) =? ln 2D? G( ; W)kG(; c? I) =? ln det I det I? t + tr(i? I) + ky? k 2 = ky? k 2 ; det W? t + tr(c W ) + ck? k2 ; det(c? I) 2 t D (G(y; I)kG(; I))? 2D(G( ; W )kg(; c? I)) = t ky? k2? ck? k 2 + ln det W + t? c tr W + t ln c: Note that the last four terms depend on the observations and not on the parameter. Hence they can be neglected in the criterion. b. It will be shown that the criterion of part a of the theorem is equivalent to the criterion ky? k 2? ck? ^k 2 : This will be established by showing that inmization of the functions f() = k? k 2 ; g() = t k? ^k 2 = t t? X (? ^(s)) T (? ^(s)); is equivalent. Both functions are quadratic forms in. The derivatives are df() d dg() d = 2(? ) T ; = 2 t t? X (? ^(s)) T = 2? 2 t t? X ^(s) = 2(? ) T : Hence the rst order conditions are identical and therefore the optimal is the same for both optimization problems. This optimization problem is then given by: inf 2R n t hky? k 2? ck? ^k 2 i : The result then follows from Theorem Concluding remarks What is the contribution of this paper to the theory of system identication? The main contribution of the paper is the relation between several information theoretic criteria and their use in system identication. The concept of mutual information rate for stationary Gaussian processes is identical to H entropy and to the exponentialof-quadratic cost in optimal stochastic control. The likelihood function is identical to divergence for Gaussian random variables. As a consequence of this relation a parameter estimator is presented that may be derived by four dierent approximation criteria. The problem concerns the approximation of observations from a stationary Gaussian process 8

21 by the output of a Gaussian system of a specied order. Another product of the paper are formulas for information theoretic criteria of stationary Gaussian processes in case these processes are outputs of time-invariant Gaussian systems. The relationship between the H entropy for a nite-dimensional linear system and the exponential-of-quadratic cost for a Gaussian stochastic control system is understood at the level of the formulas. It is a question as to whether a deeper understanding is possible. By rephrasing the relation as a relation between mutual information and the exponentialof-quadratic cost, the interpretation becomes dierent but it is still not enlightening. The parameter estimator derived in this paper should be tested on data. Its robustness properties require further study. Which problems of system identication theory require further attention? The framework for system identication with information theoretic criteria should be explored further. For stationary Gaussian processes also the case of ARMA representations may be considered. Besides Gaussian processes also nite valued processes, with the hidden Markov model as stochastic system, and counting processes should be considered. Theoretical problems of system identication and of realization may be considered using the framework of this paper. The relation between the likelihood function and divergence rate may be explored for model reduction of Gaussian systems. Possibly minimization of the divergence can be solved partly analytically. The approximation problem as such also requires further study. System identication theory in general may benet from studying other classes of dynamic systems in relation with practical problems. Examples of such classes are positive linear systems, particular classes of nonlinear systems, and the class of errors-in-variables systems. Approximation techniques for nonlinear systems, such as articial neural nets, wavelets, and fuzzy modeling, may be explored. Acknowledgements The authors thank S. Bittanti and G. Picci for the organization of the NATO Advanced Study Institute From Identication to Learning and for the invitation to the second named author to lecture on the topic of this paper. The meeting and its proceedings are very useful ways to transfer knowledge in the area of system identication. The second named author of this paper also thanks L. Finesso of LADSEB, Padova, Italy, for a discussion on the subject of this paper and for useful references. The research of this paper has been sponsored in part by the Commission of the European Communities through the SCIENCE Program by the project SC*-CT

22 A Concepts from probability and the theory of stochastic processes In this appendix concepts and notation of probability and of the theory of stochastic processes are introduced. General mathematics notation follows. The set of integers is denoted by and the set of positive integers by +. The set of real numbers is denoted by R and the set of the positive real numbers by R + = [0; ). The n-dimensional vector space over R is denoted by R n and the set of n m matrices over this vector space by R nm. The transpose of a matrix A 2 R nn is denoted by A T. The matrix A 2 R nn is said to be positive denite if x T Ax 0 for all x 2 R n and strictly positive denite if x T Ax > 0 for all x 2 R n ; x 6= 0. The set of complex numbers is denoted by C and C? := c 2 C jcj < ; D := c 2 C jcj = : For A 2 R nn, sp(a) C denotes the spectrum of the matrix; in other words the set of eigenvalues. A. Probability concepts A measurable space, denoted by (; F ), is dened as a set and a -algebra F. probability space is dened as a triple (; F; P ) where (; F ) is a measurable space and P : F R is a probability measure. Let I A : R be the indicator function of the event A 2 F dened by I A () =, if 2 A and I A () = 0 otherwise. For any random variable x let F x denote the smallest -algebra on which the random variable x is measurable. Let P; Q be two probability measures on a measurable space (; F ). Then P is said to be absolutely continuous with respect to Q, denoted by P Q, if P (A) = 0 is implied by Q(A) = 0. If P Q then it follows from the Radon-Nikodym theorem that there exists a random variable r : R + such that P (A) = E Q [ri A ] = r()i A ()dq: A.2 Gaussian random variables The Gaussian probability distribution function with parameters m 2 R n and Q 2 R nn, with Q strictly positive denite, is dened by the probability density function p(v) = p (2)n det Q exp? 2 (v? m)t Q? (v? m) A : (A.) A random variable x : R n is said to be a Gaussian random variable with parameters m 2 R n and Q 2 R nn, Q positive denite, if for all u 2 R n E[exp(iu T x)] = exp iu T m? 2 ut Qu : (A.2) 20

23 The notation x 2 G(m; Q) will be used in this case. Moreover, (x ; : : : ; x n ) 2 G(m; Q) denotes that, with x = (x ; : : : ; x n ) T, x 2 G(m; Q). In this case x ; : : : ; x n are said to be jointly Gaussian random variables. A random variable with Gaussian probability distribution function is a Gaussian random variable. A Gaussian random variable does not necessarily have a Gaussian probability density function, only so in the case its variance is strictly positive denite. In the following the geometric approach to Gaussian random variables is introduced. In this approach one considers the space that the random variables generate rather than the random variables themselves. Proposition A. Let x : R n x 2 G(m; Q), S 2 R n n. a. Then F Sx F x. b. F Sx = F x i ker Q = ker SQ. The above result motivates the following denition. Denition A.2 Let x : R n, x 2 G and consider the -algebra F x. a. A basis for F x is a triple (n ; m ; Q ) 2 N R n R n n such that there exists a x : R n satisfying x 2 G(m ; Q ), Q = Q T 0, and F x = F x. b. A minimal basis for F x is a basis (n ; m ; Q ) such that rank(q ) = n. c. A basis transformation of F x is a map x 7 Sx, with S 2 R n n such that F Sx = F x. By use of linear algebra one can, given a basis (n ; m ; Q ), always construct a minimal basis for F x. Proposition A.3 Let x : R n, x 2 G(0; Q ), and Q = Q T 0. Then there exists a n 2 2 N and a basis transformation S 2 R n 2n such that, if x 2 : R n2 x 2 = Sx, then x 2 2 G(0; Q 2 ), Q 2 = Q T 2 > 0, and F x 2 = F x. Problem A.4 Let y : R k and y 2 : R k 2 be jointly Gaussian random variables with (y ; y 2 ) 2 G(0; Q). Determine a canonical form for the spaces F y ; F y2. Note that a basis transformation of the form S = block? diag(s ; S 2 ) with S ; S 2 nonsingular, leaves the spaces F y ; F y 2 invariant. Therefore this operation introduces an equivalence relation on the spaces F y ; F y2, hence one can speak of a canonical form. Denition A.5 Let y : R k and y 2 : R k2 be jointly Gaussian random variables with (y ; y 2 ) 2 G(0; Q). Then (y ; y 2 ) are said to be in canonical variable form if Q = 0 I I I 0C A 2 R(k+k2)(k+k2) I where 2 R k 2k2, k 2 2 N +, = diag( ; : : : ; k2 ), : : : k2 > 0. One then says that (y ; : : : ; y k ), (y 2 ; : : : ; y 2k2 ) are the canonical variables and ( ; : : : ; k2 ) the canonical correlation coecients. 2

24 Theorem A.6 Let y : R k and y 2 : R k 2 be jointly Gaussian random variables with (y ; y 2 ) 2 G(0; Q). Then there exists a basis transformation S = block? diag(s ; S 2 ) such that with respect to the new basis (S y ; S 2 y 2 ) 2 G(0; Q ) has the canonical variable form. The transformation to canonical variable form is not unique in general. The remaining invariance of the canonical variable form will not be stated here because of space limitation. For additional results on canonical variables the reader is referred to [7, 25]. A.3 Concepts from the theory of stochastic processes A real valued stochastic process x : T R on a probability space (; F; P ) and a time index set T is dened to be a function such that for all t 2 T the map x(:; t) : R is a random variable. A stochastic process x : T R n is said to be a Gaussian process if every nite dimensional distribution is Gaussian. Thus, if for all m 2 + and (t ; : : : ; t m ) 2 T the collection of random variables x(t ); : : : ; x(t m ) is jointly Gaussian. Such a process is completely specied by its mean value function m : T R n, m(t) = E[x(t)] and its covariance function W : T T R nn, W (t; s) = E[(x(t)? m(t))(x(s)? m(s)) T ]. A (discrete time) Gaussian white noise process with values in R k and intensity V : T R kk, V (t) = V (t) T 0, is a stochastic process such that () fv(t); t 2 T g is a collection of independent Gaussian random variables; (2) v is a Gaussian process with v(t) 2 G(0; V (t)). A stochastic process x : T R n on T = or T = R is said to be stationary if for all m 2 +, s 2 T, and t ; : : : ; t m 2 T the joint distribution of (x(t ); : : : ; x(t m )) equals the joint distribution of (x(t + s); : : : ; x(t m + s)). If for all m 2 + and t ; : : : ; t m 2 T the distribution function of (x(t + s); : : : ; x(t m + s)) converges to a limit as s then we call the process asymptotically stationary. A Gaussian process with mean value function m : T R n and covariance function W is stationary i m(t) = m(0) 8t 2 T and W (t; s) = W (t + u; s + u), 8t; s; u 2 T. In this case the function W : T R nn, W (t) = W (t; 0) is also called the covariance function. The concept of canonical correlation process has been dened for stationary Gaussian processes in analogy with the canonical variable form and canonical correlation coecients for Gaussian random variables. That this may be done follows from the fact that many properties of a stationary Gaussian process depend only on the spaces generated by such a process. For references on this topic see [28, 42, 43, 54]. B Concepts from system theory This appendix contains several denitions of concepts from system theory that are needed at several places in the paper. Denition B. A discrete-time nite-dimensional linear dynamic system or, by way of abbreviation, a linear system, is a dynamic system = ft;r p ; Y;R n ;R m ; U; ; rg 22

25 in which the state transition map and the read-out map r are specied by : x(t + ) = A(t)x(t) + B(t)u(t); x(0) = x 0 ; r : y(t) = C(t)x(t) + D(t)u(t): (B.) Here T ; m; n; p 2 +, u : T R m ; u 2 U, x 0 2 R n, A : T R nn, B : T R nm, C : T R pn, D : T R pm, and x : T R n ; y : T R p are determined by the above recursions and are called respectively the state function and the output function. Such a system is called time-invariant if for all t 2 T A(t) = A(0); B(t) = B(0); C(t) = C(0); D(t) = D(0). The parameters of such a system are denoted by fn; m; p; A; B; C; Dg. Denition B.2 A Gaussian stochastic control system is dened by the representation x(t + ) = A(t)x(t) + B(t)u(t) + M(t)v(t); x(0) = x 0 ; y(t) = C(t)x(t) + D(t)u(t) + N(t)v(t); (B.2) where T = f0; ; : : : ; t g, t 2 +, X = R n, U = R m, Y = R p, x 0 : X, x 0 2 G(m 0 ; Q 0 ), v : T R r is a Gaussian white noise process with for all t 2 T, v(t) 2 G(0; V (t)), V : T R rr, V (t) = V (t) T 0, the -algebras F x0 and Ft v are independent, A : T R nn, B : T R nm, C : T R pn, D : T R pm, M : T R nr, N : T R pr, x : T X, and y : T Y. Stochastic realization The problem of obtaining a representation of a stochatic process as the output of a stochastic system is called the stochastic realization problem. Below the weak stochastic realization problem for stationary Gaussian processes is mentioned. For references on this problem see [9] and the paper by G. Picci elsewhere in this volume. Consider a stationary Gaussian process with mean value function equal to zero and covariance function W : T R pp. The problem is whether there exists a time-invariant Gaussian system, say of the form x(t + ) = Ax(t) + Mv(t); y(t) = Cx(t) + Nv(t); (B.3) with sp(a) C? such that the covariance function of the output process y equals the given covariance function. If so, then this system is said to be a stochastic realization of the given process. There is a necessary and sucient condition for the existence of such a realization. When a realization exists there may be many realizations. Attention is then restricted to minimal realizations for which the dimension of the state space is as small as possible. There is a classication of all minimal stochastic realizations. In addition, there is an algorithm to construct a minimal realization. C Concepts from information theory The purpose of this appendix is to describe for the reader the main concepts of information theory. In Appendix D formulas are presented for information measures of Gaussian 23

26 random variables and in Appendix E formulas for information measures of stationary Gaussian processes. Information theory originated with the work of C.E. Shannon [6, 62]. The Russian mathematicians A.N. Kolmogorov, I.M. Gelfand, and A.M. Yaglom partly developed the measure theoretic formulation of information theory, see [24]. The book [57] by M.S. Pinsker partly summarizes information theory as developed by the Russian school. The publications of A. Perez, [55, 56], oer another measure theoretic formulation of information theory in which martingale theory is used as a technique to establish convergence results. Of more recent contributions to information theory we mention the work of I. Csiszar [6]. A recent textbook on information theory is the one by T.M. Cover and J.A. Thomas [3]. Information theory of continuous stochastic processes is treated in the book by S. Ihara [37]. Other books are [30, 45]. Denition C. Let x : X with X = fa ; a 2 ; : : : ; a n g be a nite valued random variable and let p x : X R be its frequency function, p x (a i ) = P (fx = a i g). Dene the entropy of the nite valued random variable x as H C (x) = nx i= p x (a i ) ln p x (a i ) : (C.) The concept of entropy of a nite valued random variable should be regarded as entropy with respect to a counting measure. Entropy is a property of a probability measure, not of a random variable. Nevertheless, the terminology of information theory is followed in referring to entropy of a random variable. Denition C.2 Let x : R n be a random variable whose probability distribution function is absolutely continuous with respect to Lebesgue measure with density p x : R n R +. Dene the entropy with respect to Lebesgue measure of such a random variable by H L (x) = p x (v) ln dv; (C.2) p x (v) R n \S where S is the support of p x. The entropy dened above depends on Lebesgue measure, or in general on the measure with respect to which it is dened. Denition C.3 [57, p. 9] Let (; F ) be a measurable space consisting of a set and a -algebra F. Let P ; P 2 be two probability measures dened on (; F ). Dene the entropy of P with respect to P 2 by the formula H(P ; P 2 ) = sup fe;::: ;Eng nx i= P (E i ) ln P (E i ) ; (C.3) P 2 (E i ) where the supremum is taken over all nite measurable partitions of. Besides entropy there is the concept of mutual information. 24

Control for Coordination of Linear Systems

Control for Coordination of Linear Systems André C.M. Ran Department of Mathematics, Vrije Universiteit De Boelelaan 181a, 181 HV Amsterdam, The Netherlands Jan H. van Schuppen CWI, P.O. Box 9479, 19 GB