Just-in-Time Models with Applications to Dynamical Systems

Size: px

Start display at page:

Download "Just-in-Time Models with Applications to Dynamical Systems"

Geoffrey Sparks
5 years ago
Views:

1 Linköping Studies in Science and Technology Thesis No. 601 Just-in-Time Models with Applications to Dynamical Systems Anders Stenman REGLERTEKNIK AUTOMATIC CONTROL LINKÖPING Division of Automatic Control Department of Electrical Engineering Linköping University, S Linköping, Sweden March 1997

2 Just-in-Time Models with Applications to Dynamical Systems c 1997 Anders Stenman, stenman@isy.liu.se Department of Electrical Engineering, Linköping University, S Linköping, Sweden LIU-TEK-LIC-1997:02 ISBN ISSN

3 To Maria

5 Abstract System identification deals with the problem of estimating models of dynamical systems given observations from the systems. In this thesis we focus on the nonlinear modeling problem, and, in particular, on the situation that occurs when a very large amount of data is available. Traditional treatments of the estimation problem in statistics and system identification have mainly focused on global modeling approaches, i.e., the model has been optimized using the entire data set. However, when the number of samples becomes large, this approach becomes less attractive mainly because of the computational complexity. We instead assume that all observations are stored in a database, and that models are built dynamically as the actual need arises. When a model is really needed in a neighborhood around an operating point, a subset of the data closest to the operating point is retrieved from the database, and a local modeling operation is performed on that subset. For this concept, the name Just-in-Time models has been adopted. It is proposed that the Just-in-Time estimator is formed as a weighted average of the data in the neighborhood, where the weights are optimized such that the pointwise mean square error (MSE) measure is minimized. The number of data retrieved from the database is determined using a local bias/variance error tradeoff. This is closely related to the nonparametric kernel estimation concept which is commonly used in statistics. A review of kernel methods is therefore presented in one of the introductory chapters. The asymptotical properties of the method are investigated. It is shown that the Just-in-Time estimator produces consistent estimates, and that the convergence rate as a function of the sample size is of the same order as for the kernel methods. Two important applications for the concept are presented. The first one considers nonlinear time domain identification, which is the problem of predicting the outputs of nonlinear dynamical systems given data sets of past inputs and outputs of the systems. The second one occurs within frequency domain identification when one is faced with the problem of estimating the frequency response function of a linear system. Compared to global methods, the advantage with Just-in-Time models is that they are optimized locally, which might increase the performance. A possible drawback is the computational complexity, both because we have to search for neighborhoods in a multidimensional regressor space, and because the derived estimator is quite demanding in terms of computational effort. i

6 ii

7 Acknowledgments I am very grateful to all the people that have supported me during the work with the thesis. First of all, I would like to thank my supervisors, Prof. Lennart Ljung and Dr. Fredrik Gustafsson, for their excellent guidance through the work. Especially Fredrik deserves my deepest gratitude for putting up with all my stupid questions. I am also indebted to our former visitors, Daniel Rivera and Alexander Nazin, for insightful discussions about my work and for giving interesting new ideas and proposals for future research. Dr. Peter Lindskog and Dr. Jonas Sjöberg have read the thesis thoroughly and have given valuable comments and suggestions for improvements. For this I am very grateful. Peter has also provided the pictures that are used in the examples in the end of Chapter 6. I also want to thank Mattias Olofsson for keeping the computers running, and all the hackers and volunteers all around the world that provide such excellent and free software tools as L A TEX andxemacs 1. Finally, I would like to thank Maria for all and support during the writing of this thesis. I guess that it is my turn to do all the cleaning and washing the following months. ;-) This work was supported by the Swedish Research Council for Engineering Sciences (TFR), which is gratefully acknowledged. Linköping, March 1997 Anders Stenman 1 C-u 50 M-x all-hail-xemacs iii

8 iv

9 Contents Notation ix 1 Introduction The Regression Problem Parametric Methods Nonparametric Methods The System Identification Problem Just-in-Time Models Applications Thesis Outline Contributions Parametric Methods Parametric Regression Models Parametric Models in System Identification Linear Black-box Models Nonlinear Black-box Models Parameter Estimation Linear Least Squares Nonlinear Least Squares Asymptotic Properties of the Model Nonparametric Methods The Basic Smoothing Problem v

10 vi 3.2 Local Polynomial Kernel Estimators K-Nearest Neighbor Estimators Statistical Properties of Kernel Estimators The MSE and MISE Criteria Asymptotic MSE Approximation Bandwidth Selection Extensions to the Multivariable Case A Appendix: Proof of Theorem Just-in-Time Models The Just-in-Time Idea The Just-in-Time Estimator Optimal Weights The MSE Formula Optimizing the Weights Properties of the Weight Sequence Comparison with Kernel Weights Estimation of Hessian and Noise Variance Using Linear Least Squares Using a Weighted Mean Data selection Neighborhood size Neighborhood shape The Just-in-Time Algorithm Properties of the Just-in-Time Estimator Computational Aspects Convergence Asymptotic Properties of the Just-in-Time Estimator Asymptotic Properties of the Scalar Estimator A Appendix: Some Power Series Formulas B Appendix: Moments of a Uniformly Distributed Random Variable Applications to Dynamical Systems Nonparametric System Identification A Linear System A Nonlinear System Tank Level Modeling Water Heating Process Applications to Frequency Response Estimation Traditional Methods Properties of the ETFE Smoothing the ETFE Asymptotic Properties of the Estimate

11 Contents vii An Example Using the Just-in-Time Approach Aircraft Flight Flutter Data Summary Summary & Conclusions 89 Bibliography 91 Subject Index 95

12 viii Contents

13 Notation Abbreviations AIC AMISE AMSE ETFE FPE JIT MISE MSE RMSE Akaike s Information Theoretic Criterion Asymptotic Mean Integrated Squared Error Asymptotic Mean Squared Error Empirical Transfer Function Estimate Akaike s Final Prediction Error Just-in-Time Mean Integrated Squared Error Mean Squared Error Root Mean Squared Error Symbols C The set of complex numbers R The set of real numbers R d Euclidean d-dimensional space I d The d d identity matrix a N b N, if and only if lim N (a N /b N )=1 o a N = o(b N ), if and only if lim sup N a N /b N =0 O a N = O(b N ), if and only if lim sup N a N /b N < Ω M A neighborhood around the current operating point that contains M data λ Noise variance ix

14 x Notation Operators and Functions arg min f(x) x EX Var X vec A vech A D f (x) H f (x) µ k (f) R(f) A T tr A The minimizing argument of the function f( ) w.r.t. x Mathematical expectation of the random variable X Variance of the random vector X The vector of the matrix A, obtained by stacking the columns of A underneath each other in order from left to right The vector-half of the matrix A, obtained from vec A by eliminating the above-diagonal entries of A The d 1 derivative vector which ith entry is equal to ( / x i )f(x) The d d Hessian matrix which (i, j)th entry is equal to 2 /( x i x j )f(x) x k f(x)dx f 2 (x) dx The transpose of the matrix A The trace of the matrix A

15 1 Introduction The problem considered in this thesis is how to derive relationships between inputs and outputs of a dynamical system when very little a priori knowledge is available. In traditional system identification literature, this is usually known as black-box modeling. Very rich and well established theory for black-box modeling of linear systems exists, see for example [30] and [43]. In recent years the interest for nonlinear system identification has been growing, and the attention has been focused on a number of nonlinear black-box structures; neural networks, wavelets as to mention some of them [42]. However, non-linear identification has been studied for a long time within the statistical community, where it is known under the name nonparametric regression. 1.1 The Regression Problem Within many areas of science one often wishes to study the relationship between a number of variables. The purpose could for example be to predict the outcome of one of the variables, on the basis of information provided by the others. In the statistical theory this is usually referred to as the regression problem. The objective is to determine a functional relationship between a predictor variable (or regression variable) X R n and a response variable Y R, given a set of observations {(X 1,Y 1 ),...,(X N,Y N )}. In mathematical sense, the problem is to find a function of X, f(x), such that the difference Y f(x) (1.1) 1

16 2 Chapter 1 Introduction becomes small in some sense. minimizes It is a well-known fact that the function f that is the conditional expectation of Y given X, E(Y f(x)) 2 (1.2) f(x) =E(Y X). (1.3) This function, the best mean square predictor of Y given X, is often called the regression function or the regression of Y on X. IfN data points have been collected, the regression relation can be modeled as Y i = f(x i )+e i, i =1,...,N, (1.4) where {e i } are identically distributed random variables with zero means, which are independent of the predictor data {X i }. The task of estimating the regression function f from observations can be done in essentially two different ways. The quite commonly used parametric approach is to assume that the function f has a pre-specified form, for instance a hyperplane with unknown slope and offset. As an alternative one could try to estimate f nonparametrically without reference to a specific form. 1.2 Parametric Methods Parametric estimation methods rely on the assumption that the true regression function f has a pre-specified functional form, which can be fully described by a finite dimensional parameter vector θ: f(x, θ). (1.5) The structure of the model is chosen from families that are known to be flexible and which have been successful in previous applications. This means that the parameters are not necessarily required to have any physical meaning, they are just tuned to fit the observed data as well as possible. It is natural to consider the parameterization (1.5) as an expansion of basis functions g i ( ), i.e. r f(x, θ) = α i g i (x, β i,γ i ). i=1 This formulation allows that the dependence of f in some of the components of θ can be linear while it on others can be nonlinear. This is for instance the situation in some commonly used special cases of basis functions as shown in Table 1.1, which will be more thoroughly described in Chapter 2.

17 1.3 Nonparametric Methods 3 Modeling Approach g i (x, β i,γ i ) Fourier series sin(xβ i + γ i ) Feedforward neural networks σ(x T β i + γ i ) σ(x) =1/(1 + e x ) Radial basis functions κ( x γ i βi ) κ(x) =e x2 Linear regression x i Table 1.1 Basis functions used in some common modeling approaches. Once a particular model structure is chosen, the parameters can be obtained from the observations by an optimization procedure that minimizes the prediction errors in a global least squares fashion ˆθ = arg min θ N (Y i f(x i,θ)) 2. (1.6) i=1 This optimization problem usually has to be solved using a numeric search routine, except in the linear case where an explicit solution exists. Hence it will typically have numerous local minima which in general makes the search for the desired global minimum hard [9]. The greatest advantage with parametric models is that they give a very compact description of the data set once the parameter vector estimate ˆθ is computed. A drawback, however, is the required assumption of the imposed parameterization. Sometimes the assumed function family (or model structure) might be too restrictive or too low-dimensional (i.e. too few parameters) to fit unexpected features in the data. 1.3 Nonparametric Methods The problems with parametric regression methods can be overcome by removing the restriction that the regression function belongs to a parametric function family. This leads to an approach which is usually referred to as nonparametric regression. The basic idea behind nonparametric methods is that one should let the data decide which function that fits them best without the restrictions imposed by a parametric model. There exist several methods for obtaining nonparametric estimates of the functional relationship, ranging from the simple nearest neighbor method to more advanced smoothing techniques. A fundamental assumption is that observations located close to each other are related, so that an estimate at a certain operation point x can be constructed from observations in a small neighborhood around x. The simplest nonparametric method is perhaps the nearest neighbor approach. The estimate ˆf(x) is taken as the response variable Y k that corresponds to the

18 4 Chapter 1 Introduction regression vector X k that is the nearest neighbor of x, i.e. ˆf(x) =Y k, k = arg min X k x. k Hence the estimation problem is essentially reduced to a data set searching problem, rather than a modeling problem. Although its simplicity, the nearest neighbor method suffers from a major drawback. The observations are almost always corrupted by measurement noise. Hence the nearest neighbor estimate is in general a very poor and noisy estimate of the true function value. Significant improvements can therefore be achieved using an interpolation or smoothing operation, ˆf(x) = i w i Y i, (1.7) where {w i } denotes a sequence of weights which may depend on x and the predictor variable data {X i }. This weight sequence can of course be selected in many ways. An often used approach in statistics is to select the weights according to a kernel function [21], w i = K h (X i x), which explicitly specifies the shape of the weight sequence. A similar approach is considered in signal processing applications where the so-called Hamming window is frequently used for smoothing [2]. The weights in (1.7) are typically tuned by a smoothing parameter which controls the degree of local averaging, i.e. the size of the neighborhood around x. A too large neighborhood will include observations located far away from x, whose expected values may differ significantly from f(x), and as a result the estimator will produce an over-smoothed or biased estimate. When using a too small neighborhood, on the other hand, only a few number of observations will contribute to the estimate at x, hence making it under-smoothed or noisy. The basic problem in nonparametric methods is thus to find the optimal choice of the smoothing parameter that will balance the bias error against the variance error. The advantage with nonparametric models is their flexibility, since they allow predictions to be computed without reference to a fixed parametric model. The price that has to be paid for that is the computational complexity. In general nonparametric methods require more computations than parametric ones. The convergence rate with respect to sample size N is also slower than for parametric methods. 1.4 The System Identification Problem System identification is a special case of the regression problem presented in Section 1.1. It deals with the problem of determining mathematical models of dynamical

19 1.4 The System Identification Problem 5 systems on the basis of observed data from the systems. Having collected a data set of paired inputs and outputs S = {(u(t),y(t))} N t=1 from a system, the goal in time domain system identification is typically to try to model future outputs of the system as a function of past inputs and outputs, y(t) =f(ϕ(t)) + e(t), (1.8) where ϕ(t) is a so-called regression vector which consists of past data, ϕ(t) =(y(t 1),y(t 2),...,u(t 1),u(t 2),...) T, and e(t) is an error term which accounts for the fact that in general it is not possible to model y(t) as an exact function of past observations. Nevertheless, a requirement must be that the error term is small or white, so that we can treat f(ϕ(t)) as a good prediction of y(t), ŷ(t t 1) = f(ϕ(t)). The system identification problem is thus to find a good function f( ) such that the discrepancy between the true and the predicted outputs, y(t) ŷ(t t 1), is minimized. The problem of estimating ŷ(t t 1) = f(ϕ(t)) from experimental data with poor or no a priori knowledge of the system is usually referred to as black-box modeling [30]. It has traditionally been solved using parametric linear models of different sophistication, but problems usually occur when encountering highly nonlinear systems which poorly allow them self to be approximated by linear models. As a consequence of this, the interest for nonlinear modeling alternatives like neural networks and radial basis functions has been growing in recent years [42, 6]. As an alternative one could apply nonparametric methods of the type described in Section 1.3. Then the predictor will be of the form ŷ(t t 1) = t 1 k= w k y(k), (1.9) where the weights are constructed such that they give measurements located close to ϕ(t) more influence than those located far away from it. Example 1.1 (Lindskog [29]) Consider the laboratory-scale tank system shown in Figure 1.1 (a). Suppose the modeling aim is to describe how the water level h(t) changes with the voltage u(t) that controls the pump, given a data set that consists of 1000 observations of u(t)

20 6 Chapter 1 Introduction and h(t). The data set is plotted in Figure 1.1 (b). A reasonable assumption is that the water level at the current time instant t can be expressed in terms of the water level and the pump voltage at the previous time instant t 1, i.e., ĥ(t t 1) = f(h(t 1),u(t 1)). Assuming that the function f( ) can be described by a linear regression ĥ(t t 1) = θ 1 h(t 1) + θ 2 u(t 1) + θ 0, the parameters θ i can easily be estimated using linear least squares, resulting in θ 1 =0.9063, θ 2 =1.2064, and θ 0 = The result from a simulation is shown in Figure 1.1 (c). The solid line represents the measured water level, and the dashed line corresponds to a simulation using the estimated parameters. As shown, the simulated water level follows the true level quite well except at levels close to zero, where the linear model produces negative levels. This indicates that the true system is nonlinear, and that better results could be achieved using a nonlinear or a nonparametric model. Figure 1.1 (d) shows a simulation using a nonparametric model of the type (1.9). The performance of the model is clearly much better at low water levels in this case. A traditional application of nonparametric methods in system identification is in the frequency domain when estimating the transfer function of a system. If the system considered is linear, i.e. if it can be modeled by the input-output relation y(t) =G 0 (q)u(t)+e(t), t =1,...,N, (1.10) an estimate of the transfer function G 0 (q) can be formed as the ratio between the Fourier transforms of the input and output signals, Ĝ N (e iω )= Y N(ω) U N (ω). (1.11) This estimate is often called the empirical transfer function estimate (ETFE), since it is formed with no other assumptions than linearity of the system [30]. It is well-known that the ETFE is a very crude estimate of the true transfer function. This is due to the fact that the observations {(y(t),u(t))} are corrupted by measurement noise e(t) which propagates to the ETFE through the Fourier transform. In particular, for sufficiently large N, the ETFE can be written Ĝ N (e iω )=G 0 (e iω )+ρ N (ω), (1.12) where ρ N (ω) is a complex disturbance with zero mean and variance proportional to the noise to input signal ratio. Hence the transfer function can be estimated in a nonparametric fashion Ĝ(e iω0 )= k w k Ĝ N (e iω k ),

21 1.4 The System Identification Problem 7 35 [cm] h(t) h(t) [cm] u(t) Time [min] u(t) [V] Time [min] (a) (b) h(t) [cm] h(t) [cm] Time [min] Time [min] (c) (d) Figure 1.1 (a) A simple tank system. (b) Experimental data. (c) The result of a simulation with a linear model. Solid: True water level, Dashed: Simulated water level. (d) The result of a simulation with a nonparametric model. Solid: True water level. Dashed: Simulated water level. where the weights again are selected so that a good trade-off between the bias and the variance is achieved.

22 8 Chapter 1 Introduction 1.5 Just-in-Time Models The main contribution in this thesis is Just-in-Time estimators, which is another approach of getting nonparametric estimates of nonlinear regression functions on the basis of observed data. Traditionally in system identification literature and statistics, the regression problems have been solved by global modeling methods, like kernel methods, neural networks or other non-linear parametric models [42], but when dealing with very large data sets, this approach becomes less attractive to deal with. For real industrial applications, for example in the chemical process industry, the volume of data may occupy several Gigabytes. The global modeling process is in general associated with an optimization step as in (1.6). This optimization problem is typically non-convex and will have a number of local minima which makes the solution difficult. Although the global model has the appealing feature of giving a high degree of data compression, it seems both inefficient and unnecessary to spend a large amount of calculations to optimize a model which is valid over the whole regressor space, while in most cases it is more likely that we only will visit a very restricted subset of it. Inspired by ideas and concepts from the database research area, we will take a conceptually different point of view. We assume that all observations are stored in a database, and that the models are built dynamically as the actual need arises. When a model is really needed in a neighborhood of an operating point x, a subset of the data closest to the operating point is retrieved from the database, and a local modeling operation is performed on that subset. For this concept, we have adopted the name Just-in-Time models, suggested by [9]. As in (1.7) it is assumed that the Just-in-Time predictor is formed as a weighted average of the response variables in a neighborhood around x, ˆf JIT (x) = i w i Y i, where the weights {w i } are optimized in such a way that the pointwise mean square error (MSE) measure is minimized. Compared to global methods, the advantage with Just-in-Time models is that the modeling is optimized locally, which might increase the performance. A possible drawback is the computational complexity, as we both have to search for a neighborhood of x in a multidimensional space, and as the derived estimator is quite computationally intensive. In this thesis, however, we will only investigate the properties of the modeling part of the problem. The searching problem will be left as a topic for future research. 1.6 Applications The reasons and needs for estimating a model of a system can of course be many. When dealing with dynamical systems, some of the reasons can be as follows.

23 1.7 Thesis Outline 9 One obvious reason which already has been mentioned briefly in Section 1.4 is prediction or forecasting. Based on the observations we have collected so far, we will be able to predict the future behavior of the system. Conceptually speaking, this can be described as in Figure 1.2. The predictor/estimator takes a data set and an operation point x as inputs, and uses some suitable modeling approach, parametric or nonparametric, to produce an estimate ˆf(x). Modern control theory usually requires a model of the process to be controlled. One example is predictive control where the control signal from the regulator is optimized on the basis of predictions of future outputs of the system. System analysis and fault detection in general requires investigation or monitoring of certain parameters which may not be directly available through measurements. Therefore we will have to derive their values using a model of the system. {(Y k,x k )} N k=1 x Estimator ˆf(x) Figure 1.2 A conceptual view of modeling. 1.7 Thesis Outline The thesis is divided into six chapters, excluding the introductory and the concluding chapters. The first two chapters give an overview of existing parametric and nonparametric methods that relate to the Just-in-Time modeling concept, and the last four chapters derive, analyze and exemplify the proposed method. The purpose of Chapter 2 is to give the reader a brief background on parametric estimation methods, especially in system identification applications. Examples of some commonly used linear and non-linear black-box models are given, along with the two basic parameter estimation methods. Chapter 3 serves as an introduction to nonparametric smoothing methods. The chapter is mainly focused on a special class of so-called kernel estimation methods which is widely used in statistics. The fundamental ideas and terminology are presented as well as the statistical and asymptotical properties that are associated with these methods. Chapter 4 is the core chapter of the thesis. It presents the basic ideas behind the Just-in-Time concept and proposes a possible implementation of a Just-in-Time

24 10 Chapter 1 Introduction estimator. The chapter is concluded with a discussion regarding different aspects and properties of the method. Chapter 5 presents an analysis of the asymptotic properties of the Just-in-Time estimator. The aim is to investigate the consistency, i.e., if the estimator tends to the true regression function when the sample size tends to infinity, and the convergence rate, i.e., how fast it tends to this function. In Chapter 6 the Just-in-Time method is applied to the time domain system identification problem. First two simulated examples are considered, and then two real data applications, a tank and a water heating system are successfully modeled by the method. Chapter 7 gives an example of an important application of the Just-in-Time method in the field of frequency response estimation. The chapter starts by giving a review of the traditional treatments of the topic, whereafter the Just-in-Time modeling concept is modified to fit into this framework. Finally, Chapter 8 gives a summary and directions for future work. 1.8 Contributions The contributions of this thesis is mainly the material contained in Chapter 4 to Chapter 7. They can be summarized as follows: The concept of Just-in-Time models is advocated as a method for obtaining predictions of a system given large data volumes. A particular implementation of a Just-in-Time estimator is proposed, that forms the estimate in a nonparametric fashion as a weighted average of the response variables. The weights are optimized so that the local mean square error is minimized. An analysis of the asymptotic properties of the Just-in-Time smoother is presented. It is shown that the method produces consistent estimates and that the convergence rate is in the same order as for nonparametric methods. A comparison to kernel estimators is made and it is shown that the Just-in- Time estimator is easier to generalize to higher regressor dimensions. Examples with dynamical systems shows that the Just-in-Time method for some systems gives smaller prediction errors than other proposed methods. It is shown that the method is quite efficient in matter of performance for smoothing frequency response estimates. The thesis is based on two papers that have been, or will be, presented at different conferences. The papers are: [44] A. Stenman, F. Gustafsson, and L. Ljung. Just in time models for dynamical systems. In Proceedings of the 35th IEEE Conference on Decision and Control, Kobe, Japan, 1996.

25 1.8 Contributions 11 [45] A. Stenman, A.V. Nazin, and F. Gustafsson. Asymptotic properties of Justin-Time models To be presented at SYSID 97 in Fukuoka, Japan.

26 12 Chapter 1 Introduction

27 2 Parametric Methods This chapter gives a brief review of parametric estimation methods, which quite often are considered when solving the regression problem described in Chapter 1. The basic concept of parametric regression methods is given in Section 2.1. Section 2.2 gives some examples of common black-box models, both linear and nonlinear, that are frequently used in system identification. Section 2.3 describes the two basic parameter estimation methods used when a certain model class is chosen. Section 2.4, finally, briefly states the basic asymptotic properties concerned with parametric models. 2.1 Parametric Regression Models A very commonly used way of estimating the regression function f in a regression relationship Y i = f(x i )+e i, (2.1) on basis of observed data {(X i,y i )} N i=1,istheparametric approach. The basic assumption is that f belongs to a family of functions with a pre-specified functional form, and that this family can be parameterized by a finite-dimensional parameter vector θ, f(x i,θ). (2.2) The simplest example, which is very often used, is the linear regression, f(x i,θ,φ)=x T i θ+φ, (2.3) 13

28 14 Chapter 2 Parametric Methods where it is assumed that the relation between the variables can be described by a hyperplane, whose slope and offset are controlled by the parameters θ and φ. In the general case, though, a wide range of different nonlinear model structures is possible. The choice of parameterization depends very much on the situation. Sometimes there are physical reasons for modeling Y as a particular function of X, while at other times the choice is based on previous experience with similar data sets. Once a particular model structure is chosen, the parameter vector θ can naturally be assessed by means of the fit between the model and the data set Y i f(x i,θ). (2.4) As will be described in Section 2.3, this fit can be performed in two major ways, depending on which norm that is used and how the parameter vector appears in the parameterization. When the parameters enter linearly as in (2.3), they can be easily computed using simple and powerful methods. In general though, this optimization problem is non-convex and may have a number of local minima which makes its solution difficult. An advantage with parametric models is that they give a very compact description of the data set once the parameter vector θ is estimated. In some applications, the data set may occupy several megabytes while the model is represented by only a handful of parameters. A major drawback, however, is the particular parameterization that must be imposed. Sometimes the assumed function family might be too restrictive or too low-dimensional to fit unexpected features in the data. 2.2 Parametric Models in System Identification System identification is a special case of the regression relationship (2.1) where the response variable Y t represents the output of a dynamical system at time t, Y t = y(t), and the predictor variable X t (usually denoted by ϕ(t) rather than X t ) consists of inputs and outputs to the system at previous time instants, X t =(y(t 1),y(t 2),...,u(t 1),u(t 2),...) T. Over the years, different names and concepts have been associated with different parameterizations. We will in the following two subsections briefly describe some of the most commonly used ones Linear Black-box Models Linear black-box models have been thoroughly discussed and analyzed in the system identification literature during the last decades, see for example [30] and [43].

29 2.2 Parametric Models in System Identification 15 The simplest linear model is the finite impulse response (FIR) model y(t) =b 1 u(t 1) +...+b n u(t n)+e(t)=b(q)u(t)+e(t). where B(q) is a polynomial in the time shift operator q. Allowing that the model order n tends to infinity and using noise models of varying sophistication, all linear models can, as in [30], be described by the general model structure family, A(q)y(t) =q n B(q) C(q) k u(t)+ e(t), (2.5) F(q) D(q) where n k is the delay from u(t) toy(t)and A(q)=1+a 1 q a na q na B(q)=b 1 +b 2 q b nb q n b+1 C(q) =1+c 1 q c nc q nc D(q)=1+d 1 q d nd q n d F(q)=1+f 1 q f nf q n f. An often used special case of (2.5) is the ARX (Auto Regressive with exogeneous input) model, y(t)+a 1 y(t 1) +...+a na y(t n a ) =b 1 u(t n k )+...+b nb u(t n b n k +1)+e(t) (2.6) which corresponds to F (q) =C(q)=D(q) = 1. It has the nice property of being expressible in terms of a linear regression y(t) =ϕ T (t)θ+e(t), and hence the parameter vector θ can be determined using simple and powerful estimation methods, see Section Note that the parametric model used in Example 1.1 in Chapter 1 is of ARX type, with n a = n b = n k = Nonlinear Black-box Models When turning into nonlinear modeling, things in general become much more complicated. The reason for that is that almost nothing is excluded, and a very rich spectra of possible model structures are possible. It is natural to think of the parameterization (2.2) as a function expansion [42], f(ϕ(t),θ)= r α k g k (ϕ(t),β k,γ k ). (2.7) k=1 The functions g k ( ) are usually referred to as basis functions, because the role they play in (2.7) is very similar to that of a functional space basis. Typically, the basis

30 16 Chapter 2 Parametric Methods functions are constructed from a simple scalar mother basis function, κ( ), which is scaled and translated according to the parameters β k and γ k. Using scalar basis functions, there are three basic methods of expanding them into higher regressor dimensions: Ridge construction. A ridge basis function has the form g k (ϕ(t),β k,γ k )=κ(β T kϕ(t)+γ k ), (2.8) where κ( ) is a scalar basis function, β k R n and γ k R. The ridge function is constant for all ϕ(t) in the direction where βk T ϕ(t) is constant. Hence the basis functions will have unbounded support in this subspace, although the mother basis function κ( ) has local support. See Figure 2.1 (a). Radial construction. In contrast to the ridge construction, the radial basis functions have true local support as is illustrated in Figure 2.1 (b). The radial support can be obtained using basis functions of the form g k (ϕ(t),β k,γ k )=κ( ϕ(t) γ k βk ), (2.9) where γ k R n is a center point and βk denotes an arbritrary norm on the regressor space. The norm is often taken as a scaled identity matrix. Composition. A composition is obtained when the ridge and radial constructions are combined when forming the basis functions. A typical example is illustrated in Figure 2.1 (c). In general the composition can be written as a tensor product g k (ϕ(t),β k,γ k )=g k,1 (ϕ 1 (t),β k,1,γ k,1 ) g k,r (ϕ r (t),β k,r,γ k,r ), (2.10) where each g k,i ( ) is either a ridge or a radial function. Using the function expansion (2.7) and the different basis function constructions (2.8)-(2.10), a number of well-known nonlinear model structures can be formed for example neural networks, radial basis function networks and wavelets. Neural Networks The combination of (2.7), the ridge construction (2.8), and the so-called sigmoid mother basis function, κ(x) =σ(x)= 1 1+e x, (2.11) results in the celebrated one hidden layer feedforward neural net. See Figure 2.2. Many different generalizations of this basic structure is possible. If the outputs of the κ( ) blocks are weighted, summed and fed through a new layer of κ( ) blocks, one

31 2.2 Parametric Models in System Identification 17 (a) Ridge (b) Radial (c) Composition Figure 2.1 Three different methods of expanding into higher regressor dimensions. usually talks about multi-layer feedforward neural nets. So-called recurrent neural networks are obtained if instead some of the internal signals in the network are fed back to the input layer. See [23] for further structural issues. Neural network models are highly nonlinear in the parameters, and have thus to be estimated through numerical optimization schemes as will be described in Section Radial Basis Networks A closely related concept is the radial basis function (RBF) network [6]. It is constructed using the expansion (2.7) and the radial construction (2.9). The radial mother basis function κ( ) isoftentakenasagaussianfunction κ(x)=e x2. Compared to neural networks, the RBF network has the advantage of being linear in the parameters (provided that the location parameters are fixed). This makes the estimation process easier. Wavelets Wavelet decomposition of a function is another example of the parameterization (2.7) [10]. A mother basis function (usually referred to as the mother wavelet and denoted by ψ( ) rather than κ( )) is scaled and translated to form a wavelet basis. The mother wavelet is usually a small wave (a pulse) with bounded support. It is common to let the expansion (2.7) be double indexed according to scale and location. For the scalar case and the specific choices β j =2 j and γ k = k, the basis functions can therefore be written as g j,k =2 j/2 κ(2 j ϕ(t) k). (2.12) Multivariable wavelet functions can be constructed from scalar ones using the composition method (2.10).

32 18 Chapter 2 Parametric Methods γ 1 β 1,1. β 1,n ϕ 1 (t).. ϕ n (t) γ r β r,1. β r,n κ( ) α1. κ( ) αr.. ŷ }{{}}{{}}{{} Input layer Hidden layer Output layer Figure 2.2 A one hidden layer feedforward neural net. Wavelets have multiresolution capabilities. Several different scale parameters are used simultaneously and overlappingly. With a suitable chosen mother wavelet along with scaling and translation parameters, the wavelet basis can be made orthonormal, which makes it easy to compute the coordinates α j,k in (2.7). See for example [42] for details. 2.3 Parameter Estimation When a particular linear or nonlinear model structure is chosen the next step is to estimate the parameters on the basis of the observations S N = {(X i,y i )} N i=1. This is usually done by minimizing the mean square error loss function V N (θ, S N )= 1 N N (Y i f(x i,θ)) 2. (2.13) i=1 The parameter estimate is then given by ˆθ N = arg min V N (θ, S N ). (2.14) θ

33 2.4 Asymptotic Properties of the Model 19 Depending on how the parameters appear in the parameterization, this minimization can be performed either using a linear least squares approach or a nonlinear least squares approach Linear Least Squares When the parameters enter linearly in the predictor, an explicit solution that minimizes (2.13) exists. The optimal parameter estimate is then simply given by ( N ˆθ N = X i Xi T i=1 ) 1 N X i Y i (2.15) i=1 provided that the inverse in (2.15) exists. For numerical reasons this inverse is rarely formed. Instead the estimate is computed using QR- or singular value decomposition [29] Nonlinear Least Squares When the predictor is nonlinear in the parameters, the minimum of the loss function (2.13) cannot be computed analytically. Instead one has to search for the minimum numerically. An often used numeric optimization method is Newton s algorithm [12], ˆθ k+1 N = ˆθ [ 1 N k V N (ˆθ N k, S N )] V N (ˆθ N k, S N ), (2.16) where V N ( ) andv N ( ) denote the gradient and the Hessian of the loss function respectively. The parameter vector estimate ˆθ N k is in each iteration updated in the negative gradient direction with a step size according to the inverse Hessian. For model structures like neural networks which are highly nonlinear in the parameters, this introduces a problem since several local minima exist. There are no guarantees that the parameter estimate converges to the global minimum of the loss function (2.13). 2.4 Asymptotic Properties of the Model An interesting question is what properties the estimate resulting from (2.13) will have. These will naturally depend on the properties of the data set S N. In general it is a difficult problem to characterize the quality of ˆθ N exactly. Instead one normally investigates the asymptotic properties of ˆθ N as the number of data, N, tends to infinity. It is an important aspect of the general parameter estimation method (2.13) that the asymptotic properties of the resulting estimate can be expressed in general terms for arbitrary model structures.

34 20 Chapter 2 Parametric Methods where The first basic result is the following one: ˆθ N θ as N, (2.17) θ = arg min E(Y i f(x i,θ)) 2. (2.18) θ That is, as more and more data become available, the estimate converges to that value θ, that would minimize the expected value of the squared prediction errors. This is in a sense the best possible approximation of the true regression function that is available within the model structure. The expectation E in (2.18) is taken with respect to all random disturbances that affect the data and it also includes averaging over the predictor variables. The second basic result is the following one: If the prediction error ε i (θ )= Y i f(x i,θ ) is approximately white noise, then the covariance matrix of ˆθ N is approximately given by E(ˆθ N θ )(ˆθ N θ ) T λ [ Eψi ψ T ] 1 i, (2.19) N where and λ = Eε 2 i(θ ) (2.20) ψ i = d dθ f(x i,θ) θ=θ. (2.21) The results (2.17) through (2.21) are general and hold for all model structures, both linear and non-linear ones, subject only to some regularity and smoothness conditions. See [30] for more details around this.

35 3 Nonparametric Methods In Chapter 2, a brief review of parametric estimation methods was given. It was concluded that parametric methods are good modeling alternatives, since they give a high degree of data compression, i.e., the major features of the data are condensed into a few number of parameters. A problem with the approach is the requirement of a certain parameterization. This must be selected such that it match the properties of the underlying regression function. Otherwise quite poor results are often obtained. The problem with parametric regression models mentioned above can be solved by removing the restriction that the regression function belongs to a parametric function family. This leads to an approach which is usually referred to as nonparametric regression. The basic idea behind nonparametric methods is that one should let the data decide which function fits them best without the restrictions imposed by a parametric model. Local nonparametric regression models have been discussed and analyzed in the statistical literature for a long time. In the context of so-called kernel regression methods, traditional approaches have involved the Nadaraya-Watson estimator [35, 51] and some alternative kernel estimators, for example the Priestly-Chao estimator [37] and the Gasser-Müller estimator [18]. In this chapter we give a brief introduction to a special class of such models, local polynomial kernel estimators [46, 7, 34, 15]. These estimate the regression function at a certain point by locally fitting a polynomial of degree p to the data using weighted least squares. The Nadaraya-Watson estimator can in this framework be seen as a special case since it corresponds to fitting a zero degree polynomial, i.e., a local constant, to data. The presentation here is neither formal nor complete, the purpose is just to 21

36 22 Chapter 3 Nonparametric Methods introduce concepts and notation used in the area. More comprehensive treatments of the topic are given in the books [50] and [21], upon which this survey is based. The outline is as follows; Section 3.1 describes the basic nonparametric smoothing problem. Section 3.2 gives an introduction to local polynomial kernel estimators which is one possible solution to the smoothing problem. 3.1 The Basic Smoothing Problem Smoothing of a noisy data set {(X i,y i )} N i=1 concerns the problem of estimating the function f in the regression relationship Y i = f(x i )+e i, i =1,...,N. (3.1) without the imposition that f belongs to a parametric family of functions. Depending on how the data have been collected, several alternatives exist. If there are multiple observations at a certain point x, an estimate of f(x) can be obtained by just taking the average of the corresponding Y -values. In most cases however, repeated observations at a given x are not available, and one has to resort to other solutions that deduce the value of f(x) using observations at other positions than x. In the trivial case where the regression function f is constant, estimation of f(x) reduces to taking the average over the response variables Y. In general situations, though, it is unlikely that the true regression curve is constant. Rather the assumed function is modeled as a smooth continuous function which is nearly constant in a small neighborhood around x. A natural approach is therefore the mean of the response variables near the point x. This local average should then be constructed so that it is defined only from observations in a small neighborhood around x. This local averaging can be seen as the basic idea of smoothing. Almost all smoothing methods can, at least asymptotically, be described as a weighted average of the Y s near x, ˆf(x) = N w i Y i, (3.2) i=1 where {w i } is a sequence of weights that may depend on x and the predictor data {X i }. The estimator ˆf(x) is usually called a smoother and the result of a smoothing operation is called the smooth [49]. A simple smooth can be obtained by defining the weights as constant over adjacent intervals. This is quite similar to the histogram concept. Therefore it is sometimes referred to as the regressogram [48]. In more sophisticated methods like the kernel estimator approach, the weights are chosen to follow a kernel function K h ( ) of fixed form w i = K h (X i x). Kernel estimators will be described more detailed in the following section. The fact that smoothers, by definition, average over observations with considerably different expected values has been paid special attention in the statistical

37 3.2 Local Polynomial Kernel Estimators 23 literature. The weights {w i } are typically tuned by a smoothing parameter which controls the degree of local averaging, i.e., the size of the neighborhood around x. A too large neighborhood will include observations located far away from x, whose expected values may differ considerably from f(x), and as a result the estimator will produce an over-smoothed or biased estimate. When using a too small neighborhood, on the other hand, only a few number of observations will contribute to the estimate at x, hence making it under-smoothed or noisy. The basic problem in nonparametric methods is thus to find the optimal choice of smoothing parameter that will balance the bias error against the variance error. Before going into details about kernel regression models, we will give some basic terminology and notation. Nonparametric regression is studied in both fixed design andrandom design contexts. In the fixed design case, the predictor variables consist of ordered non-random numbers. A special case is the equally spaced fixed design where the difference X i+1 X i is constant for all i, for example X i = i/n, i = 1,...,N. The random design occurs when the predictor variables instead are independent, identically distributed random variables. The regression relationship is in both cases assumed to be modeled as in (3.1), where e i are independent random variables with zero means and variances λ, which are independent of {X i }. The overview is concentrated on the scalar case, because of its simpler notation. However, the results are generalized to the multivariable case in Section Local Polynomial Kernel Estimators Local polynomial kernel estimators is a special class of nonparametric regression models and was first discussed by Stone [46] and Cleveland [7]. The basic idea is to estimate the regression function f( ) at a particular point x, by locally fitting a pth degree polynomial θ 0 + θ 1 (X i x)+...+θ p (X i x) p (3.3) to the data {(Y i,x i )} via weighted least squares, where the weights are chosen according to a kernel function, K( ), centered about x and scaled according to a parameter h. A kernel estimate, ˆf(x, h), of the true regression function at the point x, is thus obtained as ˆf(x, h) =ˆθ 0, (3.4) where ˆθ =(ˆθ 0,...,ˆθ p ) T is the solution to the weighted least squares problem and ˆθ = arg min θ N i=1 {Y i θ 0 θ 1 (X i x)... θ p (X i x) p } 2 K h (X i x), (3.5) K h (X i x) =h 1 K((X i x)/h). (3.6)

Local Modelling with A Priori Known Bounds Using Direct Weight Optimization

Local Modelling with A Priori Known Bounds Using Direct Weight Optimization Jacob Roll, Alexander azin, Lennart Ljung Division of Automatic Control Department of Electrical Engineering Linköpings universitet,