Just-in-Time Models with Applications to Dynamical Systems

Size: px
Start display at page:

Download "Just-in-Time Models with Applications to Dynamical Systems"

Transcription

1 Linköping Studies in Science and Technology Thesis No. 601 Just-in-Time Models with Applications to Dynamical Systems Anders Stenman REGLERTEKNIK AUTOMATIC CONTROL LINKÖPING Division of Automatic Control Department of Electrical Engineering Linköping University, S Linköping, Sweden March 1997

2 Just-in-Time Models with Applications to Dynamical Systems c 1997 Anders Stenman, stenman@isy.liu.se Department of Electrical Engineering, Linköping University, S Linköping, Sweden LIU-TEK-LIC-1997:02 ISBN ISSN

3 To Maria

4

5 Abstract System identification deals with the problem of estimating models of dynamical systems given observations from the systems. In this thesis we focus on the nonlinear modeling problem, and, in particular, on the situation that occurs when a very large amount of data is available. Traditional treatments of the estimation problem in statistics and system identification have mainly focused on global modeling approaches, i.e., the model has been optimized using the entire data set. However, when the number of samples becomes large, this approach becomes less attractive mainly because of the computational complexity. We instead assume that all observations are stored in a database, and that models are built dynamically as the actual need arises. When a model is really needed in a neighborhood around an operating point, a subset of the data closest to the operating point is retrieved from the database, and a local modeling operation is performed on that subset. For this concept, the name Just-in-Time models has been adopted. It is proposed that the Just-in-Time estimator is formed as a weighted average of the data in the neighborhood, where the weights are optimized such that the pointwise mean square error (MSE) measure is minimized. The number of data retrieved from the database is determined using a local bias/variance error tradeoff. This is closely related to the nonparametric kernel estimation concept which is commonly used in statistics. A review of kernel methods is therefore presented in one of the introductory chapters. The asymptotical properties of the method are investigated. It is shown that the Just-in-Time estimator produces consistent estimates, and that the convergence rate as a function of the sample size is of the same order as for the kernel methods. Two important applications for the concept are presented. The first one considers nonlinear time domain identification, which is the problem of predicting the outputs of nonlinear dynamical systems given data sets of past inputs and outputs of the systems. The second one occurs within frequency domain identification when one is faced with the problem of estimating the frequency response function of a linear system. Compared to global methods, the advantage with Just-in-Time models is that they are optimized locally, which might increase the performance. A possible drawback is the computational complexity, both because we have to search for neighborhoods in a multidimensional regressor space, and because the derived estimator is quite demanding in terms of computational effort. i

6 ii

7 Acknowledgments I am very grateful to all the people that have supported me during the work with the thesis. First of all, I would like to thank my supervisors, Prof. Lennart Ljung and Dr. Fredrik Gustafsson, for their excellent guidance through the work. Especially Fredrik deserves my deepest gratitude for putting up with all my stupid questions. I am also indebted to our former visitors, Daniel Rivera and Alexander Nazin, for insightful discussions about my work and for giving interesting new ideas and proposals for future research. Dr. Peter Lindskog and Dr. Jonas Sjöberg have read the thesis thoroughly and have given valuable comments and suggestions for improvements. For this I am very grateful. Peter has also provided the pictures that are used in the examples in the end of Chapter 6. I also want to thank Mattias Olofsson for keeping the computers running, and all the hackers and volunteers all around the world that provide such excellent and free software tools as L A TEX andxemacs 1. Finally, I would like to thank Maria for all and support during the writing of this thesis. I guess that it is my turn to do all the cleaning and washing the following months. ;-) This work was supported by the Swedish Research Council for Engineering Sciences (TFR), which is gratefully acknowledged. Linköping, March 1997 Anders Stenman 1 C-u 50 M-x all-hail-xemacs iii

8 iv

9 Contents Notation ix 1 Introduction The Regression Problem Parametric Methods Nonparametric Methods The System Identification Problem Just-in-Time Models Applications Thesis Outline Contributions Parametric Methods Parametric Regression Models Parametric Models in System Identification Linear Black-box Models Nonlinear Black-box Models Parameter Estimation Linear Least Squares Nonlinear Least Squares Asymptotic Properties of the Model Nonparametric Methods The Basic Smoothing Problem v

10 vi 3.2 Local Polynomial Kernel Estimators K-Nearest Neighbor Estimators Statistical Properties of Kernel Estimators The MSE and MISE Criteria Asymptotic MSE Approximation Bandwidth Selection Extensions to the Multivariable Case A Appendix: Proof of Theorem Just-in-Time Models The Just-in-Time Idea The Just-in-Time Estimator Optimal Weights The MSE Formula Optimizing the Weights Properties of the Weight Sequence Comparison with Kernel Weights Estimation of Hessian and Noise Variance Using Linear Least Squares Using a Weighted Mean Data selection Neighborhood size Neighborhood shape The Just-in-Time Algorithm Properties of the Just-in-Time Estimator Computational Aspects Convergence Asymptotic Properties of the Just-in-Time Estimator Asymptotic Properties of the Scalar Estimator A Appendix: Some Power Series Formulas B Appendix: Moments of a Uniformly Distributed Random Variable Applications to Dynamical Systems Nonparametric System Identification A Linear System A Nonlinear System Tank Level Modeling Water Heating Process Applications to Frequency Response Estimation Traditional Methods Properties of the ETFE Smoothing the ETFE Asymptotic Properties of the Estimate

11 Contents vii An Example Using the Just-in-Time Approach Aircraft Flight Flutter Data Summary Summary & Conclusions 89 Bibliography 91 Subject Index 95

12 viii Contents

13 Notation Abbreviations AIC AMISE AMSE ETFE FPE JIT MISE MSE RMSE Akaike s Information Theoretic Criterion Asymptotic Mean Integrated Squared Error Asymptotic Mean Squared Error Empirical Transfer Function Estimate Akaike s Final Prediction Error Just-in-Time Mean Integrated Squared Error Mean Squared Error Root Mean Squared Error Symbols C The set of complex numbers R The set of real numbers R d Euclidean d-dimensional space I d The d d identity matrix a N b N, if and only if lim N (a N /b N )=1 o a N = o(b N ), if and only if lim sup N a N /b N =0 O a N = O(b N ), if and only if lim sup N a N /b N < Ω M A neighborhood around the current operating point that contains M data λ Noise variance ix

14 x Notation Operators and Functions arg min f(x) x EX Var X vec A vech A D f (x) H f (x) µ k (f) R(f) A T tr A The minimizing argument of the function f( ) w.r.t. x Mathematical expectation of the random variable X Variance of the random vector X The vector of the matrix A, obtained by stacking the columns of A underneath each other in order from left to right The vector-half of the matrix A, obtained from vec A by eliminating the above-diagonal entries of A The d 1 derivative vector which ith entry is equal to ( / x i )f(x) The d d Hessian matrix which (i, j)th entry is equal to 2 /( x i x j )f(x) x k f(x)dx f 2 (x) dx The transpose of the matrix A The trace of the matrix A

15 1 Introduction The problem considered in this thesis is how to derive relationships between inputs and outputs of a dynamical system when very little a priori knowledge is available. In traditional system identification literature, this is usually known as black-box modeling. Very rich and well established theory for black-box modeling of linear systems exists, see for example [30] and [43]. In recent years the interest for nonlinear system identification has been growing, and the attention has been focused on a number of nonlinear black-box structures; neural networks, wavelets as to mention some of them [42]. However, non-linear identification has been studied for a long time within the statistical community, where it is known under the name nonparametric regression. 1.1 The Regression Problem Within many areas of science one often wishes to study the relationship between a number of variables. The purpose could for example be to predict the outcome of one of the variables, on the basis of information provided by the others. In the statistical theory this is usually referred to as the regression problem. The objective is to determine a functional relationship between a predictor variable (or regression variable) X R n and a response variable Y R, given a set of observations {(X 1,Y 1 ),...,(X N,Y N )}. In mathematical sense, the problem is to find a function of X, f(x), such that the difference Y f(x) (1.1) 1

16 2 Chapter 1 Introduction becomes small in some sense. minimizes It is a well-known fact that the function f that is the conditional expectation of Y given X, E(Y f(x)) 2 (1.2) f(x) =E(Y X). (1.3) This function, the best mean square predictor of Y given X, is often called the regression function or the regression of Y on X. IfN data points have been collected, the regression relation can be modeled as Y i = f(x i )+e i, i =1,...,N, (1.4) where {e i } are identically distributed random variables with zero means, which are independent of the predictor data {X i }. The task of estimating the regression function f from observations can be done in essentially two different ways. The quite commonly used parametric approach is to assume that the function f has a pre-specified form, for instance a hyperplane with unknown slope and offset. As an alternative one could try to estimate f nonparametrically without reference to a specific form. 1.2 Parametric Methods Parametric estimation methods rely on the assumption that the true regression function f has a pre-specified functional form, which can be fully described by a finite dimensional parameter vector θ: f(x, θ). (1.5) The structure of the model is chosen from families that are known to be flexible and which have been successful in previous applications. This means that the parameters are not necessarily required to have any physical meaning, they are just tuned to fit the observed data as well as possible. It is natural to consider the parameterization (1.5) as an expansion of basis functions g i ( ), i.e. r f(x, θ) = α i g i (x, β i,γ i ). i=1 This formulation allows that the dependence of f in some of the components of θ can be linear while it on others can be nonlinear. This is for instance the situation in some commonly used special cases of basis functions as shown in Table 1.1, which will be more thoroughly described in Chapter 2.

17 1.3 Nonparametric Methods 3 Modeling Approach g i (x, β i,γ i ) Fourier series sin(xβ i + γ i ) Feedforward neural networks σ(x T β i + γ i ) σ(x) =1/(1 + e x ) Radial basis functions κ( x γ i βi ) κ(x) =e x2 Linear regression x i Table 1.1 Basis functions used in some common modeling approaches. Once a particular model structure is chosen, the parameters can be obtained from the observations by an optimization procedure that minimizes the prediction errors in a global least squares fashion ˆθ = arg min θ N (Y i f(x i,θ)) 2. (1.6) i=1 This optimization problem usually has to be solved using a numeric search routine, except in the linear case where an explicit solution exists. Hence it will typically have numerous local minima which in general makes the search for the desired global minimum hard [9]. The greatest advantage with parametric models is that they give a very compact description of the data set once the parameter vector estimate ˆθ is computed. A drawback, however, is the required assumption of the imposed parameterization. Sometimes the assumed function family (or model structure) might be too restrictive or too low-dimensional (i.e. too few parameters) to fit unexpected features in the data. 1.3 Nonparametric Methods The problems with parametric regression methods can be overcome by removing the restriction that the regression function belongs to a parametric function family. This leads to an approach which is usually referred to as nonparametric regression. The basic idea behind nonparametric methods is that one should let the data decide which function that fits them best without the restrictions imposed by a parametric model. There exist several methods for obtaining nonparametric estimates of the functional relationship, ranging from the simple nearest neighbor method to more advanced smoothing techniques. A fundamental assumption is that observations located close to each other are related, so that an estimate at a certain operation point x can be constructed from observations in a small neighborhood around x. The simplest nonparametric method is perhaps the nearest neighbor approach. The estimate ˆf(x) is taken as the response variable Y k that corresponds to the

18 4 Chapter 1 Introduction regression vector X k that is the nearest neighbor of x, i.e. ˆf(x) =Y k, k = arg min X k x. k Hence the estimation problem is essentially reduced to a data set searching problem, rather than a modeling problem. Although its simplicity, the nearest neighbor method suffers from a major drawback. The observations are almost always corrupted by measurement noise. Hence the nearest neighbor estimate is in general a very poor and noisy estimate of the true function value. Significant improvements can therefore be achieved using an interpolation or smoothing operation, ˆf(x) = i w i Y i, (1.7) where {w i } denotes a sequence of weights which may depend on x and the predictor variable data {X i }. This weight sequence can of course be selected in many ways. An often used approach in statistics is to select the weights according to a kernel function [21], w i = K h (X i x), which explicitly specifies the shape of the weight sequence. A similar approach is considered in signal processing applications where the so-called Hamming window is frequently used for smoothing [2]. The weights in (1.7) are typically tuned by a smoothing parameter which controls the degree of local averaging, i.e. the size of the neighborhood around x. A too large neighborhood will include observations located far away from x, whose expected values may differ significantly from f(x), and as a result the estimator will produce an over-smoothed or biased estimate. When using a too small neighborhood, on the other hand, only a few number of observations will contribute to the estimate at x, hence making it under-smoothed or noisy. The basic problem in nonparametric methods is thus to find the optimal choice of the smoothing parameter that will balance the bias error against the variance error. The advantage with nonparametric models is their flexibility, since they allow predictions to be computed without reference to a fixed parametric model. The price that has to be paid for that is the computational complexity. In general nonparametric methods require more computations than parametric ones. The convergence rate with respect to sample size N is also slower than for parametric methods. 1.4 The System Identification Problem System identification is a special case of the regression problem presented in Section 1.1. It deals with the problem of determining mathematical models of dynamical

19 1.4 The System Identification Problem 5 systems on the basis of observed data from the systems. Having collected a data set of paired inputs and outputs S = {(u(t),y(t))} N t=1 from a system, the goal in time domain system identification is typically to try to model future outputs of the system as a function of past inputs and outputs, y(t) =f(ϕ(t)) + e(t), (1.8) where ϕ(t) is a so-called regression vector which consists of past data, ϕ(t) =(y(t 1),y(t 2),...,u(t 1),u(t 2),...) T, and e(t) is an error term which accounts for the fact that in general it is not possible to model y(t) as an exact function of past observations. Nevertheless, a requirement must be that the error term is small or white, so that we can treat f(ϕ(t)) as a good prediction of y(t), ŷ(t t 1) = f(ϕ(t)). The system identification problem is thus to find a good function f( ) such that the discrepancy between the true and the predicted outputs, y(t) ŷ(t t 1), is minimized. The problem of estimating ŷ(t t 1) = f(ϕ(t)) from experimental data with poor or no a priori knowledge of the system is usually referred to as black-box modeling [30]. It has traditionally been solved using parametric linear models of different sophistication, but problems usually occur when encountering highly nonlinear systems which poorly allow them self to be approximated by linear models. As a consequence of this, the interest for nonlinear modeling alternatives like neural networks and radial basis functions has been growing in recent years [42, 6]. As an alternative one could apply nonparametric methods of the type described in Section 1.3. Then the predictor will be of the form ŷ(t t 1) = t 1 k= w k y(k), (1.9) where the weights are constructed such that they give measurements located close to ϕ(t) more influence than those located far away from it. Example 1.1 (Lindskog [29]) Consider the laboratory-scale tank system shown in Figure 1.1 (a). Suppose the modeling aim is to describe how the water level h(t) changes with the voltage u(t) that controls the pump, given a data set that consists of 1000 observations of u(t)

20 6 Chapter 1 Introduction and h(t). The data set is plotted in Figure 1.1 (b). A reasonable assumption is that the water level at the current time instant t can be expressed in terms of the water level and the pump voltage at the previous time instant t 1, i.e., ĥ(t t 1) = f(h(t 1),u(t 1)). Assuming that the function f( ) can be described by a linear regression ĥ(t t 1) = θ 1 h(t 1) + θ 2 u(t 1) + θ 0, the parameters θ i can easily be estimated using linear least squares, resulting in θ 1 =0.9063, θ 2 =1.2064, and θ 0 = The result from a simulation is shown in Figure 1.1 (c). The solid line represents the measured water level, and the dashed line corresponds to a simulation using the estimated parameters. As shown, the simulated water level follows the true level quite well except at levels close to zero, where the linear model produces negative levels. This indicates that the true system is nonlinear, and that better results could be achieved using a nonlinear or a nonparametric model. Figure 1.1 (d) shows a simulation using a nonparametric model of the type (1.9). The performance of the model is clearly much better at low water levels in this case. A traditional application of nonparametric methods in system identification is in the frequency domain when estimating the transfer function of a system. If the system considered is linear, i.e. if it can be modeled by the input-output relation y(t) =G 0 (q)u(t)+e(t), t =1,...,N, (1.10) an estimate of the transfer function G 0 (q) can be formed as the ratio between the Fourier transforms of the input and output signals, Ĝ N (e iω )= Y N(ω) U N (ω). (1.11) This estimate is often called the empirical transfer function estimate (ETFE), since it is formed with no other assumptions than linearity of the system [30]. It is well-known that the ETFE is a very crude estimate of the true transfer function. This is due to the fact that the observations {(y(t),u(t))} are corrupted by measurement noise e(t) which propagates to the ETFE through the Fourier transform. In particular, for sufficiently large N, the ETFE can be written Ĝ N (e iω )=G 0 (e iω )+ρ N (ω), (1.12) where ρ N (ω) is a complex disturbance with zero mean and variance proportional to the noise to input signal ratio. Hence the transfer function can be estimated in a nonparametric fashion Ĝ(e iω0 )= k w k Ĝ N (e iω k ),

21 1.4 The System Identification Problem 7 35 [cm] h(t) h(t) [cm] u(t) Time [min] u(t) [V] Time [min] (a) (b) h(t) [cm] h(t) [cm] Time [min] Time [min] (c) (d) Figure 1.1 (a) A simple tank system. (b) Experimental data. (c) The result of a simulation with a linear model. Solid: True water level, Dashed: Simulated water level. (d) The result of a simulation with a nonparametric model. Solid: True water level. Dashed: Simulated water level. where the weights again are selected so that a good trade-off between the bias and the variance is achieved.

22 8 Chapter 1 Introduction 1.5 Just-in-Time Models The main contribution in this thesis is Just-in-Time estimators, which is another approach of getting nonparametric estimates of nonlinear regression functions on the basis of observed data. Traditionally in system identification literature and statistics, the regression problems have been solved by global modeling methods, like kernel methods, neural networks or other non-linear parametric models [42], but when dealing with very large data sets, this approach becomes less attractive to deal with. For real industrial applications, for example in the chemical process industry, the volume of data may occupy several Gigabytes. The global modeling process is in general associated with an optimization step as in (1.6). This optimization problem is typically non-convex and will have a number of local minima which makes the solution difficult. Although the global model has the appealing feature of giving a high degree of data compression, it seems both inefficient and unnecessary to spend a large amount of calculations to optimize a model which is valid over the whole regressor space, while in most cases it is more likely that we only will visit a very restricted subset of it. Inspired by ideas and concepts from the database research area, we will take a conceptually different point of view. We assume that all observations are stored in a database, and that the models are built dynamically as the actual need arises. When a model is really needed in a neighborhood of an operating point x, a subset of the data closest to the operating point is retrieved from the database, and a local modeling operation is performed on that subset. For this concept, we have adopted the name Just-in-Time models, suggested by [9]. As in (1.7) it is assumed that the Just-in-Time predictor is formed as a weighted average of the response variables in a neighborhood around x, ˆf JIT (x) = i w i Y i, where the weights {w i } are optimized in such a way that the pointwise mean square error (MSE) measure is minimized. Compared to global methods, the advantage with Just-in-Time models is that the modeling is optimized locally, which might increase the performance. A possible drawback is the computational complexity, as we both have to search for a neighborhood of x in a multidimensional space, and as the derived estimator is quite computationally intensive. In this thesis, however, we will only investigate the properties of the modeling part of the problem. The searching problem will be left as a topic for future research. 1.6 Applications The reasons and needs for estimating a model of a system can of course be many. When dealing with dynamical systems, some of the reasons can be as follows.

23 1.7 Thesis Outline 9 One obvious reason which already has been mentioned briefly in Section 1.4 is prediction or forecasting. Based on the observations we have collected so far, we will be able to predict the future behavior of the system. Conceptually speaking, this can be described as in Figure 1.2. The predictor/estimator takes a data set and an operation point x as inputs, and uses some suitable modeling approach, parametric or nonparametric, to produce an estimate ˆf(x). Modern control theory usually requires a model of the process to be controlled. One example is predictive control where the control signal from the regulator is optimized on the basis of predictions of future outputs of the system. System analysis and fault detection in general requires investigation or monitoring of certain parameters which may not be directly available through measurements. Therefore we will have to derive their values using a model of the system. {(Y k,x k )} N k=1 x Estimator ˆf(x) Figure 1.2 A conceptual view of modeling. 1.7 Thesis Outline The thesis is divided into six chapters, excluding the introductory and the concluding chapters. The first two chapters give an overview of existing parametric and nonparametric methods that relate to the Just-in-Time modeling concept, and the last four chapters derive, analyze and exemplify the proposed method. The purpose of Chapter 2 is to give the reader a brief background on parametric estimation methods, especially in system identification applications. Examples of some commonly used linear and non-linear black-box models are given, along with the two basic parameter estimation methods. Chapter 3 serves as an introduction to nonparametric smoothing methods. The chapter is mainly focused on a special class of so-called kernel estimation methods which is widely used in statistics. The fundamental ideas and terminology are presented as well as the statistical and asymptotical properties that are associated with these methods. Chapter 4 is the core chapter of the thesis. It presents the basic ideas behind the Just-in-Time concept and proposes a possible implementation of a Just-in-Time

24 10 Chapter 1 Introduction estimator. The chapter is concluded with a discussion regarding different aspects and properties of the method. Chapter 5 presents an analysis of the asymptotic properties of the Just-in-Time estimator. The aim is to investigate the consistency, i.e., if the estimator tends to the true regression function when the sample size tends to infinity, and the convergence rate, i.e., how fast it tends to this function. In Chapter 6 the Just-in-Time method is applied to the time domain system identification problem. First two simulated examples are considered, and then two real data applications, a tank and a water heating system are successfully modeled by the method. Chapter 7 gives an example of an important application of the Just-in-Time method in the field of frequency response estimation. The chapter starts by giving a review of the traditional treatments of the topic, whereafter the Just-in-Time modeling concept is modified to fit into this framework. Finally, Chapter 8 gives a summary and directions for future work. 1.8 Contributions The contributions of this thesis is mainly the material contained in Chapter 4 to Chapter 7. They can be summarized as follows: The concept of Just-in-Time models is advocated as a method for obtaining predictions of a system given large data volumes. A particular implementation of a Just-in-Time estimator is proposed, that forms the estimate in a nonparametric fashion as a weighted average of the response variables. The weights are optimized so that the local mean square error is minimized. An analysis of the asymptotic properties of the Just-in-Time smoother is presented. It is shown that the method produces consistent estimates and that the convergence rate is in the same order as for nonparametric methods. A comparison to kernel estimators is made and it is shown that the Just-in- Time estimator is easier to generalize to higher regressor dimensions. Examples with dynamical systems shows that the Just-in-Time method for some systems gives smaller prediction errors than other proposed methods. It is shown that the method is quite efficient in matter of performance for smoothing frequency response estimates. The thesis is based on two papers that have been, or will be, presented at different conferences. The papers are: [44] A. Stenman, F. Gustafsson, and L. Ljung. Just in time models for dynamical systems. In Proceedings of the 35th IEEE Conference on Decision and Control, Kobe, Japan, 1996.

25 1.8 Contributions 11 [45] A. Stenman, A.V. Nazin, and F. Gustafsson. Asymptotic properties of Justin-Time models To be presented at SYSID 97 in Fukuoka, Japan.

26 12 Chapter 1 Introduction

27 2 Parametric Methods This chapter gives a brief review of parametric estimation methods, which quite often are considered when solving the regression problem described in Chapter 1. The basic concept of parametric regression methods is given in Section 2.1. Section 2.2 gives some examples of common black-box models, both linear and nonlinear, that are frequently used in system identification. Section 2.3 describes the two basic parameter estimation methods used when a certain model class is chosen. Section 2.4, finally, briefly states the basic asymptotic properties concerned with parametric models. 2.1 Parametric Regression Models A very commonly used way of estimating the regression function f in a regression relationship Y i = f(x i )+e i, (2.1) on basis of observed data {(X i,y i )} N i=1,istheparametric approach. The basic assumption is that f belongs to a family of functions with a pre-specified functional form, and that this family can be parameterized by a finite-dimensional parameter vector θ, f(x i,θ). (2.2) The simplest example, which is very often used, is the linear regression, f(x i,θ,φ)=x T i θ+φ, (2.3) 13

28 14 Chapter 2 Parametric Methods where it is assumed that the relation between the variables can be described by a hyperplane, whose slope and offset are controlled by the parameters θ and φ. In the general case, though, a wide range of different nonlinear model structures is possible. The choice of parameterization depends very much on the situation. Sometimes there are physical reasons for modeling Y as a particular function of X, while at other times the choice is based on previous experience with similar data sets. Once a particular model structure is chosen, the parameter vector θ can naturally be assessed by means of the fit between the model and the data set Y i f(x i,θ). (2.4) As will be described in Section 2.3, this fit can be performed in two major ways, depending on which norm that is used and how the parameter vector appears in the parameterization. When the parameters enter linearly as in (2.3), they can be easily computed using simple and powerful methods. In general though, this optimization problem is non-convex and may have a number of local minima which makes its solution difficult. An advantage with parametric models is that they give a very compact description of the data set once the parameter vector θ is estimated. In some applications, the data set may occupy several megabytes while the model is represented by only a handful of parameters. A major drawback, however, is the particular parameterization that must be imposed. Sometimes the assumed function family might be too restrictive or too low-dimensional to fit unexpected features in the data. 2.2 Parametric Models in System Identification System identification is a special case of the regression relationship (2.1) where the response variable Y t represents the output of a dynamical system at time t, Y t = y(t), and the predictor variable X t (usually denoted by ϕ(t) rather than X t ) consists of inputs and outputs to the system at previous time instants, X t =(y(t 1),y(t 2),...,u(t 1),u(t 2),...) T. Over the years, different names and concepts have been associated with different parameterizations. We will in the following two subsections briefly describe some of the most commonly used ones Linear Black-box Models Linear black-box models have been thoroughly discussed and analyzed in the system identification literature during the last decades, see for example [30] and [43].

29 2.2 Parametric Models in System Identification 15 The simplest linear model is the finite impulse response (FIR) model y(t) =b 1 u(t 1) +...+b n u(t n)+e(t)=b(q)u(t)+e(t). where B(q) is a polynomial in the time shift operator q. Allowing that the model order n tends to infinity and using noise models of varying sophistication, all linear models can, as in [30], be described by the general model structure family, A(q)y(t) =q n B(q) C(q) k u(t)+ e(t), (2.5) F(q) D(q) where n k is the delay from u(t) toy(t)and A(q)=1+a 1 q a na q na B(q)=b 1 +b 2 q b nb q n b+1 C(q) =1+c 1 q c nc q nc D(q)=1+d 1 q d nd q n d F(q)=1+f 1 q f nf q n f. An often used special case of (2.5) is the ARX (Auto Regressive with exogeneous input) model, y(t)+a 1 y(t 1) +...+a na y(t n a ) =b 1 u(t n k )+...+b nb u(t n b n k +1)+e(t) (2.6) which corresponds to F (q) =C(q)=D(q) = 1. It has the nice property of being expressible in terms of a linear regression y(t) =ϕ T (t)θ+e(t), and hence the parameter vector θ can be determined using simple and powerful estimation methods, see Section Note that the parametric model used in Example 1.1 in Chapter 1 is of ARX type, with n a = n b = n k = Nonlinear Black-box Models When turning into nonlinear modeling, things in general become much more complicated. The reason for that is that almost nothing is excluded, and a very rich spectra of possible model structures are possible. It is natural to think of the parameterization (2.2) as a function expansion [42], f(ϕ(t),θ)= r α k g k (ϕ(t),β k,γ k ). (2.7) k=1 The functions g k ( ) are usually referred to as basis functions, because the role they play in (2.7) is very similar to that of a functional space basis. Typically, the basis

30 16 Chapter 2 Parametric Methods functions are constructed from a simple scalar mother basis function, κ( ), which is scaled and translated according to the parameters β k and γ k. Using scalar basis functions, there are three basic methods of expanding them into higher regressor dimensions: Ridge construction. A ridge basis function has the form g k (ϕ(t),β k,γ k )=κ(β T kϕ(t)+γ k ), (2.8) where κ( ) is a scalar basis function, β k R n and γ k R. The ridge function is constant for all ϕ(t) in the direction where βk T ϕ(t) is constant. Hence the basis functions will have unbounded support in this subspace, although the mother basis function κ( ) has local support. See Figure 2.1 (a). Radial construction. In contrast to the ridge construction, the radial basis functions have true local support as is illustrated in Figure 2.1 (b). The radial support can be obtained using basis functions of the form g k (ϕ(t),β k,γ k )=κ( ϕ(t) γ k βk ), (2.9) where γ k R n is a center point and βk denotes an arbritrary norm on the regressor space. The norm is often taken as a scaled identity matrix. Composition. A composition is obtained when the ridge and radial constructions are combined when forming the basis functions. A typical example is illustrated in Figure 2.1 (c). In general the composition can be written as a tensor product g k (ϕ(t),β k,γ k )=g k,1 (ϕ 1 (t),β k,1,γ k,1 ) g k,r (ϕ r (t),β k,r,γ k,r ), (2.10) where each g k,i ( ) is either a ridge or a radial function. Using the function expansion (2.7) and the different basis function constructions (2.8)-(2.10), a number of well-known nonlinear model structures can be formed for example neural networks, radial basis function networks and wavelets. Neural Networks The combination of (2.7), the ridge construction (2.8), and the so-called sigmoid mother basis function, κ(x) =σ(x)= 1 1+e x, (2.11) results in the celebrated one hidden layer feedforward neural net. See Figure 2.2. Many different generalizations of this basic structure is possible. If the outputs of the κ( ) blocks are weighted, summed and fed through a new layer of κ( ) blocks, one

31 2.2 Parametric Models in System Identification 17 (a) Ridge (b) Radial (c) Composition Figure 2.1 Three different methods of expanding into higher regressor dimensions. usually talks about multi-layer feedforward neural nets. So-called recurrent neural networks are obtained if instead some of the internal signals in the network are fed back to the input layer. See [23] for further structural issues. Neural network models are highly nonlinear in the parameters, and have thus to be estimated through numerical optimization schemes as will be described in Section Radial Basis Networks A closely related concept is the radial basis function (RBF) network [6]. It is constructed using the expansion (2.7) and the radial construction (2.9). The radial mother basis function κ( ) isoftentakenasagaussianfunction κ(x)=e x2. Compared to neural networks, the RBF network has the advantage of being linear in the parameters (provided that the location parameters are fixed). This makes the estimation process easier. Wavelets Wavelet decomposition of a function is another example of the parameterization (2.7) [10]. A mother basis function (usually referred to as the mother wavelet and denoted by ψ( ) rather than κ( )) is scaled and translated to form a wavelet basis. The mother wavelet is usually a small wave (a pulse) with bounded support. It is common to let the expansion (2.7) be double indexed according to scale and location. For the scalar case and the specific choices β j =2 j and γ k = k, the basis functions can therefore be written as g j,k =2 j/2 κ(2 j ϕ(t) k). (2.12) Multivariable wavelet functions can be constructed from scalar ones using the composition method (2.10).

32 18 Chapter 2 Parametric Methods γ 1 β 1,1. β 1,n ϕ 1 (t).. ϕ n (t) γ r β r,1. β r,n κ( ) α1. κ( ) αr.. ŷ }{{}}{{}}{{} Input layer Hidden layer Output layer Figure 2.2 A one hidden layer feedforward neural net. Wavelets have multiresolution capabilities. Several different scale parameters are used simultaneously and overlappingly. With a suitable chosen mother wavelet along with scaling and translation parameters, the wavelet basis can be made orthonormal, which makes it easy to compute the coordinates α j,k in (2.7). See for example [42] for details. 2.3 Parameter Estimation When a particular linear or nonlinear model structure is chosen the next step is to estimate the parameters on the basis of the observations S N = {(X i,y i )} N i=1. This is usually done by minimizing the mean square error loss function V N (θ, S N )= 1 N N (Y i f(x i,θ)) 2. (2.13) i=1 The parameter estimate is then given by ˆθ N = arg min V N (θ, S N ). (2.14) θ

33 2.4 Asymptotic Properties of the Model 19 Depending on how the parameters appear in the parameterization, this minimization can be performed either using a linear least squares approach or a nonlinear least squares approach Linear Least Squares When the parameters enter linearly in the predictor, an explicit solution that minimizes (2.13) exists. The optimal parameter estimate is then simply given by ( N ˆθ N = X i Xi T i=1 ) 1 N X i Y i (2.15) i=1 provided that the inverse in (2.15) exists. For numerical reasons this inverse is rarely formed. Instead the estimate is computed using QR- or singular value decomposition [29] Nonlinear Least Squares When the predictor is nonlinear in the parameters, the minimum of the loss function (2.13) cannot be computed analytically. Instead one has to search for the minimum numerically. An often used numeric optimization method is Newton s algorithm [12], ˆθ k+1 N = ˆθ [ 1 N k V N (ˆθ N k, S N )] V N (ˆθ N k, S N ), (2.16) where V N ( ) andv N ( ) denote the gradient and the Hessian of the loss function respectively. The parameter vector estimate ˆθ N k is in each iteration updated in the negative gradient direction with a step size according to the inverse Hessian. For model structures like neural networks which are highly nonlinear in the parameters, this introduces a problem since several local minima exist. There are no guarantees that the parameter estimate converges to the global minimum of the loss function (2.13). 2.4 Asymptotic Properties of the Model An interesting question is what properties the estimate resulting from (2.13) will have. These will naturally depend on the properties of the data set S N. In general it is a difficult problem to characterize the quality of ˆθ N exactly. Instead one normally investigates the asymptotic properties of ˆθ N as the number of data, N, tends to infinity. It is an important aspect of the general parameter estimation method (2.13) that the asymptotic properties of the resulting estimate can be expressed in general terms for arbitrary model structures.

34 20 Chapter 2 Parametric Methods where The first basic result is the following one: ˆθ N θ as N, (2.17) θ = arg min E(Y i f(x i,θ)) 2. (2.18) θ That is, as more and more data become available, the estimate converges to that value θ, that would minimize the expected value of the squared prediction errors. This is in a sense the best possible approximation of the true regression function that is available within the model structure. The expectation E in (2.18) is taken with respect to all random disturbances that affect the data and it also includes averaging over the predictor variables. The second basic result is the following one: If the prediction error ε i (θ )= Y i f(x i,θ ) is approximately white noise, then the covariance matrix of ˆθ N is approximately given by E(ˆθ N θ )(ˆθ N θ ) T λ [ Eψi ψ T ] 1 i, (2.19) N where and λ = Eε 2 i(θ ) (2.20) ψ i = d dθ f(x i,θ) θ=θ. (2.21) The results (2.17) through (2.21) are general and hold for all model structures, both linear and non-linear ones, subject only to some regularity and smoothness conditions. See [30] for more details around this.

35 3 Nonparametric Methods In Chapter 2, a brief review of parametric estimation methods was given. It was concluded that parametric methods are good modeling alternatives, since they give a high degree of data compression, i.e., the major features of the data are condensed into a few number of parameters. A problem with the approach is the requirement of a certain parameterization. This must be selected such that it match the properties of the underlying regression function. Otherwise quite poor results are often obtained. The problem with parametric regression models mentioned above can be solved by removing the restriction that the regression function belongs to a parametric function family. This leads to an approach which is usually referred to as nonparametric regression. The basic idea behind nonparametric methods is that one should let the data decide which function fits them best without the restrictions imposed by a parametric model. Local nonparametric regression models have been discussed and analyzed in the statistical literature for a long time. In the context of so-called kernel regression methods, traditional approaches have involved the Nadaraya-Watson estimator [35, 51] and some alternative kernel estimators, for example the Priestly-Chao estimator [37] and the Gasser-Müller estimator [18]. In this chapter we give a brief introduction to a special class of such models, local polynomial kernel estimators [46, 7, 34, 15]. These estimate the regression function at a certain point by locally fitting a polynomial of degree p to the data using weighted least squares. The Nadaraya-Watson estimator can in this framework be seen as a special case since it corresponds to fitting a zero degree polynomial, i.e., a local constant, to data. The presentation here is neither formal nor complete, the purpose is just to 21

36 22 Chapter 3 Nonparametric Methods introduce concepts and notation used in the area. More comprehensive treatments of the topic are given in the books [50] and [21], upon which this survey is based. The outline is as follows; Section 3.1 describes the basic nonparametric smoothing problem. Section 3.2 gives an introduction to local polynomial kernel estimators which is one possible solution to the smoothing problem. 3.1 The Basic Smoothing Problem Smoothing of a noisy data set {(X i,y i )} N i=1 concerns the problem of estimating the function f in the regression relationship Y i = f(x i )+e i, i =1,...,N. (3.1) without the imposition that f belongs to a parametric family of functions. Depending on how the data have been collected, several alternatives exist. If there are multiple observations at a certain point x, an estimate of f(x) can be obtained by just taking the average of the corresponding Y -values. In most cases however, repeated observations at a given x are not available, and one has to resort to other solutions that deduce the value of f(x) using observations at other positions than x. In the trivial case where the regression function f is constant, estimation of f(x) reduces to taking the average over the response variables Y. In general situations, though, it is unlikely that the true regression curve is constant. Rather the assumed function is modeled as a smooth continuous function which is nearly constant in a small neighborhood around x. A natural approach is therefore the mean of the response variables near the point x. This local average should then be constructed so that it is defined only from observations in a small neighborhood around x. This local averaging can be seen as the basic idea of smoothing. Almost all smoothing methods can, at least asymptotically, be described as a weighted average of the Y s near x, ˆf(x) = N w i Y i, (3.2) i=1 where {w i } is a sequence of weights that may depend on x and the predictor data {X i }. The estimator ˆf(x) is usually called a smoother and the result of a smoothing operation is called the smooth [49]. A simple smooth can be obtained by defining the weights as constant over adjacent intervals. This is quite similar to the histogram concept. Therefore it is sometimes referred to as the regressogram [48]. In more sophisticated methods like the kernel estimator approach, the weights are chosen to follow a kernel function K h ( ) of fixed form w i = K h (X i x). Kernel estimators will be described more detailed in the following section. The fact that smoothers, by definition, average over observations with considerably different expected values has been paid special attention in the statistical

37 3.2 Local Polynomial Kernel Estimators 23 literature. The weights {w i } are typically tuned by a smoothing parameter which controls the degree of local averaging, i.e., the size of the neighborhood around x. A too large neighborhood will include observations located far away from x, whose expected values may differ considerably from f(x), and as a result the estimator will produce an over-smoothed or biased estimate. When using a too small neighborhood, on the other hand, only a few number of observations will contribute to the estimate at x, hence making it under-smoothed or noisy. The basic problem in nonparametric methods is thus to find the optimal choice of smoothing parameter that will balance the bias error against the variance error. Before going into details about kernel regression models, we will give some basic terminology and notation. Nonparametric regression is studied in both fixed design andrandom design contexts. In the fixed design case, the predictor variables consist of ordered non-random numbers. A special case is the equally spaced fixed design where the difference X i+1 X i is constant for all i, for example X i = i/n, i = 1,...,N. The random design occurs when the predictor variables instead are independent, identically distributed random variables. The regression relationship is in both cases assumed to be modeled as in (3.1), where e i are independent random variables with zero means and variances λ, which are independent of {X i }. The overview is concentrated on the scalar case, because of its simpler notation. However, the results are generalized to the multivariable case in Section Local Polynomial Kernel Estimators Local polynomial kernel estimators is a special class of nonparametric regression models and was first discussed by Stone [46] and Cleveland [7]. The basic idea is to estimate the regression function f( ) at a particular point x, by locally fitting a pth degree polynomial θ 0 + θ 1 (X i x)+...+θ p (X i x) p (3.3) to the data {(Y i,x i )} via weighted least squares, where the weights are chosen according to a kernel function, K( ), centered about x and scaled according to a parameter h. A kernel estimate, ˆf(x, h), of the true regression function at the point x, is thus obtained as ˆf(x, h) =ˆθ 0, (3.4) where ˆθ =(ˆθ 0,...,ˆθ p ) T is the solution to the weighted least squares problem and ˆθ = arg min θ N i=1 {Y i θ 0 θ 1 (X i x)... θ p (X i x) p } 2 K h (X i x), (3.5) K h (X i x) =h 1 K((X i x)/h). (3.6)

Local Modelling with A Priori Known Bounds Using Direct Weight Optimization

Local Modelling with A Priori Known Bounds Using Direct Weight Optimization Local Modelling with A Priori Known Bounds Using Direct Weight Optimization Jacob Roll, Alexander azin, Lennart Ljung Division of Automatic Control Department of Electrical Engineering Linköpings universitet,

More information

Local Modelling of Nonlinear Dynamic Systems Using Direct Weight Optimization

Local Modelling of Nonlinear Dynamic Systems Using Direct Weight Optimization Local Modelling of Nonlinear Dynamic Systems Using Direct Weight Optimization Jacob Roll, Alexander Nazin, Lennart Ljung Division of Automatic Control Department of Electrical Engineering Linköpings universitet,

More information

AUTOMATIC CONTROL COMMUNICATION SYSTEMS LINKÖPINGS UNIVERSITET. Questions AUTOMATIC CONTROL COMMUNICATION SYSTEMS LINKÖPINGS UNIVERSITET

AUTOMATIC CONTROL COMMUNICATION SYSTEMS LINKÖPINGS UNIVERSITET. Questions AUTOMATIC CONTROL COMMUNICATION SYSTEMS LINKÖPINGS UNIVERSITET The Problem Identification of Linear and onlinear Dynamical Systems Theme : Curve Fitting Division of Automatic Control Linköping University Sweden Data from Gripen Questions How do the control surface

More information

EL1820 Modeling of Dynamical Systems

EL1820 Modeling of Dynamical Systems EL1820 Modeling of Dynamical Systems Lecture 9 - Parameter estimation in linear models Model structures Parameter estimation via prediction error minimization Properties of the estimate: bias and variance

More information

AUTOMATIC CONTROL COMMUNICATION SYSTEMS LINKÖPINGS UNIVERSITET AUTOMATIC CONTROL COMMUNICATION SYSTEMS LINKÖPINGS UNIVERSITET

AUTOMATIC CONTROL COMMUNICATION SYSTEMS LINKÖPINGS UNIVERSITET AUTOMATIC CONTROL COMMUNICATION SYSTEMS LINKÖPINGS UNIVERSITET Identification of Linear and Nonlinear Dynamical Systems Theme : Nonlinear Models Grey-box models Division of Automatic Control Linköping University Sweden General Aspects Let Z t denote all available

More information

System Identification: From Data to Model

System Identification: From Data to Model 1 : From Data to Model With Applications to Aircraft Modeling Division of Automatic Control Linköping University Sweden Dynamic Systems 2 A Dynamic system has an output response y that depends on (all)

More information

CONTROL SYSTEMS, ROBOTICS, AND AUTOMATION - Vol. V - Prediction Error Methods - Torsten Söderström

CONTROL SYSTEMS, ROBOTICS, AND AUTOMATION - Vol. V - Prediction Error Methods - Torsten Söderström PREDICTIO ERROR METHODS Torsten Söderström Department of Systems and Control, Information Technology, Uppsala University, Uppsala, Sweden Keywords: prediction error method, optimal prediction, identifiability,

More information

Nonparametric Methods

Nonparametric Methods Nonparametric Methods Michael R. Roberts Department of Finance The Wharton School University of Pennsylvania July 28, 2009 Michael R. Roberts Nonparametric Methods 1/42 Overview Great for data analysis

More information

Identification of Non-linear Dynamical Systems

Identification of Non-linear Dynamical Systems Identification of Non-linear Dynamical Systems Linköping University Sweden Prologue Prologue The PI, the Customer and the Data Set C: I have this data set. I have collected it from a cell metabolism experiment.

More information

Parameter Estimation in a Moving Horizon Perspective

Parameter Estimation in a Moving Horizon Perspective Parameter Estimation in a Moving Horizon Perspective State and Parameter Estimation in Dynamical Systems Reglerteknik, ISY, Linköpings Universitet State and Parameter Estimation in Dynamical Systems OUTLINE

More information

41903: Introduction to Nonparametrics

41903: Introduction to Nonparametrics 41903: Notes 5 Introduction Nonparametrics fundamentally about fitting flexible models: want model that is flexible enough to accommodate important patterns but not so flexible it overspecializes to specific

More information

Linear Regression and Its Applications

Linear Regression and Its Applications Linear Regression and Its Applications Predrag Radivojac October 13, 2014 Given a data set D = {(x i, y i )} n the objective is to learn the relationship between features and the target. We usually start

More information

On Input Design for System Identification

On Input Design for System Identification On Input Design for System Identification Input Design Using Markov Chains CHIARA BRIGHENTI Masters Degree Project Stockholm, Sweden March 2009 XR-EE-RT 2009:002 Abstract When system identification methods

More information

Chapter 4 Neural Networks in System Identification

Chapter 4 Neural Networks in System Identification Chapter 4 Neural Networks in System Identification Gábor HORVÁTH Department of Measurement and Information Systems Budapest University of Technology and Economics Magyar tudósok körútja 2, 52 Budapest,

More information

Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University

Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University Chap 1. Overview of Statistical Learning (HTF, 2.1-2.6, 2.9) Yongdai Kim Seoul National University 0. Learning vs Statistical learning Learning procedure Construct a claim by observing data or using logics

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18 CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$

More information

Bits of Machine Learning Part 1: Supervised Learning

Bits of Machine Learning Part 1: Supervised Learning Bits of Machine Learning Part 1: Supervised Learning Alexandre Proutiere and Vahan Petrosyan KTH (The Royal Institute of Technology) Outline of the Course 1. Supervised Learning Regression and Classification

More information

Modelling Non-linear and Non-stationary Time Series

Modelling Non-linear and Non-stationary Time Series Modelling Non-linear and Non-stationary Time Series Chapter 2: Non-parametric methods Henrik Madsen Advanced Time Series Analysis September 206 Henrik Madsen (02427 Adv. TS Analysis) Lecture Notes September

More information

Model Selection and Geometry

Model Selection and Geometry Model Selection and Geometry Pascal Massart Université Paris-Sud, Orsay Leipzig, February Purpose of the talk! Concentration of measure plays a fundamental role in the theory of model selection! Model

More information

On Identification of Cascade Systems 1

On Identification of Cascade Systems 1 On Identification of Cascade Systems 1 Bo Wahlberg Håkan Hjalmarsson Jonas Mårtensson Automatic Control and ACCESS, School of Electrical Engineering, KTH, SE-100 44 Stockholm, Sweden. (bo.wahlberg@ee.kth.se

More information

Local regression I. Patrick Breheny. November 1. Kernel weighted averages Local linear regression

Local regression I. Patrick Breheny. November 1. Kernel weighted averages Local linear regression Local regression I Patrick Breheny November 1 Patrick Breheny STA 621: Nonparametric Statistics 1/27 Simple local models Kernel weighted averages The Nadaraya-Watson estimator Expected loss and prediction

More information

Automatic Differentiation and Neural Networks

Automatic Differentiation and Neural Networks Statistical Machine Learning Notes 7 Automatic Differentiation and Neural Networks Instructor: Justin Domke 1 Introduction The name neural network is sometimes used to refer to many things (e.g. Hopfield

More information

APPENDIX A. Background Mathematics. A.1 Linear Algebra. Vector algebra. Let x denote the n-dimensional column vector with components x 1 x 2.

APPENDIX A. Background Mathematics. A.1 Linear Algebra. Vector algebra. Let x denote the n-dimensional column vector with components x 1 x 2. APPENDIX A Background Mathematics A. Linear Algebra A.. Vector algebra Let x denote the n-dimensional column vector with components 0 x x 2 B C @. A x n Definition 6 (scalar product). The scalar product

More information

Statistical inference on Lévy processes

Statistical inference on Lévy processes Alberto Coca Cabrero University of Cambridge - CCA Supervisors: Dr. Richard Nickl and Professor L.C.G.Rogers Funded by Fundación Mutua Madrileña and EPSRC MASDOC/CCA student workshop 2013 26th March Outline

More information

Using Multiple Kernel-based Regularization for Linear System Identification

Using Multiple Kernel-based Regularization for Linear System Identification Using Multiple Kernel-based Regularization for Linear System Identification What are the Structure Issues in System Identification? with coworkers; see last slide Reglerteknik, ISY, Linköpings Universitet

More information

Nonparametric Econometrics

Nonparametric Econometrics Applied Microeconometrics with Stata Nonparametric Econometrics Spring Term 2011 1 / 37 Contents Introduction The histogram estimator The kernel density estimator Nonparametric regression estimators Semi-

More information

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008 Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:

More information

6.435, System Identification

6.435, System Identification System Identification 6.435 SET 3 Nonparametric Identification Munther A. Dahleh 1 Nonparametric Methods for System ID Time domain methods Impulse response Step response Correlation analysis / time Frequency

More information

2 Chapter 1 A nonlinear black box structure for a dynamical system is a model structure that is prepared to describe virtually any nonlinear dynamics.

2 Chapter 1 A nonlinear black box structure for a dynamical system is a model structure that is prepared to describe virtually any nonlinear dynamics. 1 SOME ASPECTS O OLIEAR BLACK-BOX MODELIG I SYSTEM IDETIFICATIO Lennart Ljung Dept of Electrical Engineering, Linkoping University, Sweden, ljung@isy.liu.se 1 ITRODUCTIO The key problem in system identication

More information

Chapter 9. Non-Parametric Density Function Estimation

Chapter 9. Non-Parametric Density Function Estimation 9-1 Density Estimation Version 1.1 Chapter 9 Non-Parametric Density Function Estimation 9.1. Introduction We have discussed several estimation techniques: method of moments, maximum likelihood, and least

More information

4 Bias-Variance for Ridge Regression (24 points)

4 Bias-Variance for Ridge Regression (24 points) Implement Ridge Regression with λ = 0.00001. Plot the Squared Euclidean test error for the following values of k (the dimensions you reduce to): k = {0, 50, 100, 150, 200, 250, 300, 350, 400, 450, 500,

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table

More information

Statistics: Learning models from data

Statistics: Learning models from data DS-GA 1002 Lecture notes 5 October 19, 2015 Statistics: Learning models from data Learning models from data that are assumed to be generated probabilistically from a certain unknown distribution is a crucial

More information

Generalization theory

Generalization theory Generalization theory Daniel Hsu Columbia TRIPODS Bootcamp 1 Motivation 2 Support vector machines X = R d, Y = { 1, +1}. Return solution ŵ R d to following optimization problem: λ min w R d 2 w 2 2 + 1

More information

Lecture Notes 1: Vector spaces

Lecture Notes 1: Vector spaces Optimization-based data analysis Fall 2017 Lecture Notes 1: Vector spaces In this chapter we review certain basic concepts of linear algebra, highlighting their application to signal processing. 1 Vector

More information

Maximum Likelihood Estimation. only training data is available to design a classifier

Maximum Likelihood Estimation. only training data is available to design a classifier Introduction to Pattern Recognition [ Part 5 ] Mahdi Vasighi Introduction Bayesian Decision Theory shows that we could design an optimal classifier if we knew: P( i ) : priors p(x i ) : class-conditional

More information

Least Squares Approximation

Least Squares Approximation Chapter 6 Least Squares Approximation As we saw in Chapter 5 we can interpret radial basis function interpolation as a constrained optimization problem. We now take this point of view again, but start

More information

Econ 582 Nonparametric Regression

Econ 582 Nonparametric Regression Econ 582 Nonparametric Regression Eric Zivot May 28, 2013 Nonparametric Regression Sofarwehaveonlyconsideredlinearregressionmodels = x 0 β + [ x ]=0 [ x = x] =x 0 β = [ x = x] [ x = x] x = β The assume

More information

Chapter 9. Non-Parametric Density Function Estimation

Chapter 9. Non-Parametric Density Function Estimation 9-1 Density Estimation Version 1.2 Chapter 9 Non-Parametric Density Function Estimation 9.1. Introduction We have discussed several estimation techniques: method of moments, maximum likelihood, and least

More information

Lecture 2 Machine Learning Review

Lecture 2 Machine Learning Review Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things

More information

12 - Nonparametric Density Estimation

12 - Nonparametric Density Estimation ST 697 Fall 2017 1/49 12 - Nonparametric Density Estimation ST 697 Fall 2017 University of Alabama Density Review ST 697 Fall 2017 2/49 Continuous Random Variables ST 697 Fall 2017 3/49 1.0 0.8 F(x) 0.6

More information

Gradient-Based Learning. Sargur N. Srihari

Gradient-Based Learning. Sargur N. Srihari Gradient-Based Learning Sargur N. srihari@cedar.buffalo.edu 1 Topics Overview 1. Example: Learning XOR 2. Gradient-Based Learning 3. Hidden Units 4. Architecture Design 5. Backpropagation and Other Differentiation

More information

Lasso Maximum Likelihood Estimation of Parametric Models with Singular Information Matrices

Lasso Maximum Likelihood Estimation of Parametric Models with Singular Information Matrices Article Lasso Maximum Likelihood Estimation of Parametric Models with Singular Information Matrices Fei Jin 1,2 and Lung-fei Lee 3, * 1 School of Economics, Shanghai University of Finance and Economics,

More information

Identification, Model Validation and Control. Lennart Ljung, Linköping

Identification, Model Validation and Control. Lennart Ljung, Linköping Identification, Model Validation and Control Lennart Ljung, Linköping Acknowledgment: Useful discussions with U Forssell and H Hjalmarsson 1 Outline 1. Introduction 2. System Identification (in closed

More information

Function Approximation

Function Approximation 1 Function Approximation This is page i Printer: Opaque this 1.1 Introduction In this chapter we discuss approximating functional forms. Both in econometric and in numerical problems, the need for an approximating

More information

EECE Adaptive Control

EECE Adaptive Control EECE 574 - Adaptive Control Recursive Identification in Closed-Loop and Adaptive Control Guy Dumont Department of Electrical and Computer Engineering University of British Columbia January 2010 Guy Dumont

More information

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017 The Kernel Trick, Gram Matrices, and Feature Extraction CS6787 Lecture 4 Fall 2017 Momentum for Principle Component Analysis CS6787 Lecture 3.1 Fall 2017 Principle Component Analysis Setting: find the

More information

ECO Class 6 Nonparametric Econometrics

ECO Class 6 Nonparametric Econometrics ECO 523 - Class 6 Nonparametric Econometrics Carolina Caetano Contents 1 Nonparametric instrumental variable regression 1 2 Nonparametric Estimation of Average Treatment Effects 3 2.1 Asymptotic results................................

More information

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) = Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces 9.520: Statistical Learning Theory and Applications February 10th, 2010 Reproducing Kernel Hilbert Spaces Lecturer: Lorenzo Rosasco Scribe: Greg Durrett 1 Introduction In the previous two lectures, we

More information

Lecture 3: Statistical Decision Theory (Part II)

Lecture 3: Statistical Decision Theory (Part II) Lecture 3: Statistical Decision Theory (Part II) Hao Helen Zhang Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 1 / 27 Outline of This Note Part I: Statistics Decision Theory (Classical

More information

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis.

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis. 401 Review Major topics of the course 1. Univariate analysis 2. Bivariate analysis 3. Simple linear regression 4. Linear algebra 5. Multiple regression analysis Major analysis methods 1. Graphical analysis

More information

Model-free Predictive Control

Model-free Predictive Control Model-free Predictive Control Anders Stenman Department of Electrical Engineering Linköping University, S-581 83 Linköping, Sweden WWW: http://wwwcontrolisyliuse Email: stenman@isyliuse February 25, 1999

More information

ECON 721: Lecture Notes on Nonparametric Density and Regression Estimation. Petra E. Todd

ECON 721: Lecture Notes on Nonparametric Density and Regression Estimation. Petra E. Todd ECON 721: Lecture Notes on Nonparametric Density and Regression Estimation Petra E. Todd Fall, 2014 2 Contents 1 Review of Stochastic Order Symbols 1 2 Nonparametric Density Estimation 3 2.1 Histogram

More information

In this chapter, we provide an introduction to covariate shift adaptation toward machine learning in a non-stationary environment.

In this chapter, we provide an introduction to covariate shift adaptation toward machine learning in a non-stationary environment. 1 Introduction and Problem Formulation In this chapter, we provide an introduction to covariate shift adaptation toward machine learning in a non-stationary environment. 1.1 Machine Learning under Covariate

More information

EECE Adaptive Control

EECE Adaptive Control EECE 574 - Adaptive Control Basics of System Identification Guy Dumont Department of Electrical and Computer Engineering University of British Columbia January 2010 Guy Dumont (UBC) EECE574 - Basics of

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Pattern Recognition Feature Extraction Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi, Payam Siyari Spring 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2/ Agenda Dimensionality Reduction

More information

Linear Models for Regression. Sargur Srihari

Linear Models for Regression. Sargur Srihari Linear Models for Regression Sargur srihari@cedar.buffalo.edu 1 Topics in Linear Regression What is regression? Polynomial Curve Fitting with Scalar input Linear Basis Function Models Maximum Likelihood

More information

Statistics 612: L p spaces, metrics on spaces of probabilites, and connections to estimation

Statistics 612: L p spaces, metrics on spaces of probabilites, and connections to estimation Statistics 62: L p spaces, metrics on spaces of probabilites, and connections to estimation Moulinath Banerjee December 6, 2006 L p spaces and Hilbert spaces We first formally define L p spaces. Consider

More information

Linear Model Selection and Regularization

Linear Model Selection and Regularization Linear Model Selection and Regularization Recall the linear model Y = β 0 + β 1 X 1 + + β p X p + ɛ. In the lectures that follow, we consider some approaches for extending the linear model framework. In

More information

Regression I: Mean Squared Error and Measuring Quality of Fit

Regression I: Mean Squared Error and Measuring Quality of Fit Regression I: Mean Squared Error and Measuring Quality of Fit -Applied Multivariate Analysis- Lecturer: Darren Homrighausen, PhD 1 The Setup Suppose there is a scientific problem we are interested in solving

More information

Nonparametric Bayesian Methods (Gaussian Processes)

Nonparametric Bayesian Methods (Gaussian Processes) [70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent

More information

Additive Isotonic Regression

Additive Isotonic Regression Additive Isotonic Regression Enno Mammen and Kyusang Yu 11. July 2006 INTRODUCTION: We have i.i.d. random vectors (Y 1, X 1 ),..., (Y n, X n ) with X i = (X1 i,..., X d i ) and we consider the additive

More information

EE731 Lecture Notes: Matrix Computations for Signal Processing

EE731 Lecture Notes: Matrix Computations for Signal Processing EE731 Lecture Notes: Matrix Computations for Signal Processing James P. Reilly c Department of Electrical and Computer Engineering McMaster University September 22, 2005 0 Preface This collection of ten

More information

This model of the conditional expectation is linear in the parameters. A more practical and relaxed attitude towards linear regression is to say that

This model of the conditional expectation is linear in the parameters. A more practical and relaxed attitude towards linear regression is to say that Linear Regression For (X, Y ) a pair of random variables with values in R p R we assume that E(Y X) = β 0 + with β R p+1. p X j β j = (1, X T )β j=1 This model of the conditional expectation is linear

More information

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016 Instance-based Learning CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Outline Non-parametric approach Unsupervised: Non-parametric density estimation Parzen Windows Kn-Nearest

More information

Approximate Dynamic Programming

Approximate Dynamic Programming Master MVA: Reinforcement Learning Lecture: 5 Approximate Dynamic Programming Lecturer: Alessandro Lazaric http://researchers.lille.inria.fr/ lazaric/webpage/teaching.html Objectives of the lecture 1.

More information

FACTOR ANALYSIS AND MULTIDIMENSIONAL SCALING

FACTOR ANALYSIS AND MULTIDIMENSIONAL SCALING FACTOR ANALYSIS AND MULTIDIMENSIONAL SCALING Vishwanath Mantha Department for Electrical and Computer Engineering Mississippi State University, Mississippi State, MS 39762 mantha@isip.msstate.edu ABSTRACT

More information

Outline 2(42) Sysid Course VT An Overview. Data from Gripen 4(42) An Introductory Example 2,530 3(42)

Outline 2(42) Sysid Course VT An Overview. Data from Gripen 4(42) An Introductory Example 2,530 3(42) Outline 2(42) Sysid Course T1 2016 An Overview. Automatic Control, SY, Linköpings Universitet An Umbrella Contribution for the aterial in the Course The classic, conventional System dentification Setup

More information

Chapter 6: Nonparametric Time- and Frequency-Domain Methods. Problems presented by Uwe

Chapter 6: Nonparametric Time- and Frequency-Domain Methods. Problems presented by Uwe System Identification written by L. Ljung, Prentice Hall PTR, 1999 Chapter 6: Nonparametric Time- and Frequency-Domain Methods Problems presented by Uwe System Identification Problems Chapter 6 p. 1/33

More information

Some Background Material

Some Background Material Chapter 1 Some Background Material In the first chapter, we present a quick review of elementary - but important - material as a way of dipping our toes in the water. This chapter also introduces important

More information

Introduction. Chapter 1

Introduction. Chapter 1 Chapter 1 Introduction In this book we will be concerned with supervised learning, which is the problem of learning input-output mappings from empirical data (the training dataset). Depending on the characteristics

More information

Forecast comparison of principal component regression and principal covariate regression

Forecast comparison of principal component regression and principal covariate regression Forecast comparison of principal component regression and principal covariate regression Christiaan Heij, Patrick J.F. Groenen, Dick J. van Dijk Econometric Institute, Erasmus University Rotterdam Econometric

More information

9. Model Selection. statistical models. overview of model selection. information criteria. goodness-of-fit measures

9. Model Selection. statistical models. overview of model selection. information criteria. goodness-of-fit measures FE661 - Statistical Methods for Financial Engineering 9. Model Selection Jitkomut Songsiri statistical models overview of model selection information criteria goodness-of-fit measures 9-1 Statistical models

More information

Clustering by Mixture Models. General background on clustering Example method: k-means Mixture model based clustering Model estimation

Clustering by Mixture Models. General background on clustering Example method: k-means Mixture model based clustering Model estimation Clustering by Mixture Models General bacground on clustering Example method: -means Mixture model based clustering Model estimation 1 Clustering A basic tool in data mining/pattern recognition: Divide

More information

Stochastic Analogues to Deterministic Optimizers

Stochastic Analogues to Deterministic Optimizers Stochastic Analogues to Deterministic Optimizers ISMP 2018 Bordeaux, France Vivak Patel Presented by: Mihai Anitescu July 6, 2018 1 Apology I apologize for not being here to give this talk myself. I injured

More information

y k = ( ) x k + v k. w q wk i 0 0 wk

y k = ( ) x k + v k. w q wk i 0 0 wk Four telling examples of Kalman Filters Example : Signal plus noise Measurement of a bandpass signal, center frequency.2 rad/sec buried in highpass noise. Dig out the quadrature part of the signal while

More information

Regularized Least Squares

Regularized Least Squares Regularized Least Squares Ryan M. Rifkin Google, Inc. 2008 Basics: Data Data points S = {(X 1, Y 1 ),...,(X n, Y n )}. We let X simultaneously refer to the set {X 1,...,X n } and to the n by d matrix whose

More information

Topic 3: Neural Networks

Topic 3: Neural Networks CS 4850/6850: Introduction to Machine Learning Fall 2018 Topic 3: Neural Networks Instructor: Daniel L. Pimentel-Alarcón c Copyright 2018 3.1 Introduction Neural networks are arguably the main reason why

More information

Confidence intervals for kernel density estimation

Confidence intervals for kernel density estimation Stata User Group - 9th UK meeting - 19/20 May 2003 Confidence intervals for kernel density estimation Carlo Fiorio c.fiorio@lse.ac.uk London School of Economics and STICERD Stata User Group - 9th UK meeting

More information

Optimization Problems

Optimization Problems Optimization Problems The goal in an optimization problem is to find the point at which the minimum (or maximum) of a real, scalar function f occurs and, usually, to find the value of the function at that

More information

MS&E 226: Small Data

MS&E 226: Small Data MS&E 226: Small Data Lecture 6: Bias and variance (v5) Ramesh Johari ramesh.johari@stanford.edu 1 / 49 Our plan today We saw in last lecture that model scoring methods seem to be trading off two different

More information

NONLINEAR PLANT IDENTIFICATION BY WAVELETS

NONLINEAR PLANT IDENTIFICATION BY WAVELETS NONLINEAR PLANT IDENTIFICATION BY WAVELETS Edison Righeto UNESP Ilha Solteira, Department of Mathematics, Av. Brasil 56, 5385000, Ilha Solteira, SP, Brazil righeto@fqm.feis.unesp.br Luiz Henrique M. Grassi

More information

Nonparametric Regression. Badr Missaoui

Nonparametric Regression. Badr Missaoui Badr Missaoui Outline Kernel and local polynomial regression. Penalized regression. We are given n pairs of observations (X 1, Y 1 ),...,(X n, Y n ) where Y i = r(x i ) + ε i, i = 1,..., n and r(x) = E(Y

More information

LTI Systems, Additive Noise, and Order Estimation

LTI Systems, Additive Noise, and Order Estimation LTI Systems, Additive oise, and Order Estimation Soosan Beheshti, Munther A. Dahleh Laboratory for Information and Decision Systems Department of Electrical Engineering and Computer Science Massachusetts

More information

Problem Set 2 Solution Sketches Time Series Analysis Spring 2010

Problem Set 2 Solution Sketches Time Series Analysis Spring 2010 Problem Set 2 Solution Sketches Time Series Analysis Spring 2010 Forecasting 1. Let X and Y be two random variables such that E(X 2 ) < and E(Y 2 )

More information

On Ridge Functions. Allan Pinkus. September 23, Technion. Allan Pinkus (Technion) Ridge Function September 23, / 27

On Ridge Functions. Allan Pinkus. September 23, Technion. Allan Pinkus (Technion) Ridge Function September 23, / 27 On Ridge Functions Allan Pinkus Technion September 23, 2013 Allan Pinkus (Technion) Ridge Function September 23, 2013 1 / 27 Foreword In this lecture we will survey a few problems and properties associated

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

More information

Linear Regression 1 / 25. Karl Stratos. June 18, 2018

Linear Regression 1 / 25. Karl Stratos. June 18, 2018 Linear Regression Karl Stratos June 18, 2018 1 / 25 The Regression Problem Problem. Find a desired input-output mapping f : X R where the output is a real value. x = = y = 0.1 How much should I turn my

More information

A General Overview of Parametric Estimation and Inference Techniques.

A General Overview of Parametric Estimation and Inference Techniques. A General Overview of Parametric Estimation and Inference Techniques. Moulinath Banerjee University of Michigan September 11, 2012 The object of statistical inference is to glean information about an underlying

More information

Kernel Method: Data Analysis with Positive Definite Kernels

Kernel Method: Data Analysis with Positive Definite Kernels Kernel Method: Data Analysis with Positive Definite Kernels 2. Positive Definite Kernel and Reproducing Kernel Hilbert Space Kenji Fukumizu The Institute of Statistical Mathematics. Graduate University

More information

Discussion of Regularization of Wavelets Approximations by A. Antoniadis and J. Fan

Discussion of Regularization of Wavelets Approximations by A. Antoniadis and J. Fan Discussion of Regularization of Wavelets Approximations by A. Antoniadis and J. Fan T. Tony Cai Department of Statistics The Wharton School University of Pennsylvania Professors Antoniadis and Fan are

More information

Support Vector Regression (SVR) Descriptions of SVR in this discussion follow that in Refs. (2, 6, 7, 8, 9). The literature

Support Vector Regression (SVR) Descriptions of SVR in this discussion follow that in Refs. (2, 6, 7, 8, 9). The literature Support Vector Regression (SVR) Descriptions of SVR in this discussion follow that in Refs. (2, 6, 7, 8, 9). The literature suggests the design variables should be normalized to a range of [-1,1] or [0,1].

More information

Linear regression COMS 4771

Linear regression COMS 4771 Linear regression COMS 4771 1. Old Faithful and prediction functions Prediction problem: Old Faithful geyser (Yellowstone) Task: Predict time of next eruption. 1 / 40 Statistical model for time between

More information

Adaptive Dual Control

Adaptive Dual Control Adaptive Dual Control Björn Wittenmark Department of Automatic Control, Lund Institute of Technology Box 118, S-221 00 Lund, Sweden email: bjorn@control.lth.se Keywords: Dual control, stochastic control,

More information

ARTIFICIAL NEURAL NETWORKS گروه مطالعاتي 17 بهار 92

ARTIFICIAL NEURAL NETWORKS گروه مطالعاتي 17 بهار 92 ARTIFICIAL NEURAL NETWORKS گروه مطالعاتي 17 بهار 92 BIOLOGICAL INSPIRATIONS Some numbers The human brain contains about 10 billion nerve cells (neurons) Each neuron is connected to the others through 10000

More information

Machine Learning Practice Page 2 of 2 10/28/13

Machine Learning Practice Page 2 of 2 10/28/13 Machine Learning 10-701 Practice Page 2 of 2 10/28/13 1. True or False Please give an explanation for your answer, this is worth 1 pt/question. (a) (2 points) No classifier can do better than a naive Bayes

More information

4 Nonparametric Regression

4 Nonparametric Regression 4 Nonparametric Regression 4.1 Univariate Kernel Regression An important question in many fields of science is the relation between two variables, say X and Y. Regression analysis is concerned with the

More information

6 Pattern Mixture Models

6 Pattern Mixture Models 6 Pattern Mixture Models A common theme underlying the methods we have discussed so far is that interest focuses on making inference on parameters in a parametric or semiparametric model for the full data

More information