ECE 636: Systems identification

Size: px

Start display at page:

Download "ECE 636: Systems identification"

Jesse Tucker
5 years ago
Views:

1 ECE 636: Systems identification Lectures 9 0 Linear regression

2 Coherence Φ ( ) xy ω γ xy ( ω) = 0 γ Φ ( ω) Φ xy ( ω) ( ω) xx o noise in the input, uncorrelated output noise Φ zz Φ ( ω) = Φ xy xx ( ω ) ( ω) Φ ( ω) =Φ ( ω) Φ ( ω) ee zz yy Φ ( ) xy ω Φuz ( ω) γ xy ( ω) = = Φ ( ω) Φ ( ω) Φ ( ω)( Φ ( ω) +Φ ( ω)) xx yy xx zz ee = +Φ ( ω)/ Φ ( ω) ee zz yy m(t) + u(t) x(t) H(ω) z(t) + e(t) y(t) Uncorrelated input and output noise Φ ( ) ω xy Φ uz ( ω ) γ xy ( ω ) = = Φ ( ω) Φ ( ω) ( Φ ( ω) +Φ ( ω))( Φ ( ω) +Φ ( ω)) xx yy uu mm zz ee = + c ( ω) + c ( ω) + c ( ω) c ( ω) c c Φ ( ω ( ) ) mm ω = Φ ( ω) Φ ( ω) = Φ uu ee zz ( ω) ( ω) γ ( ω) < xy

3 ime domain Impulse response analysis Step response analysis Correlation analysis White noise input Generally onparametric identification ˆ ϕ ( τ) = gˆ ( τ)* ˆ ϕ ( τ) uy uu g ϕ ( τ ) yt () gt ˆ( ) = α yt () yt ( ) gt ˆ( () = α ˆ( ) uy n= τ 0( τ ) = g τ = σ ynun ( ) ( τ ) u ( n) = n ˆ ϕ (0) uy ˆ ϕ (0) ˆ ( )... ˆ ( ( )) gˆ (0) uu ϕuu ϕ uu Μ ˆ ϕ () ˆ ˆ ˆ ˆ() uy () (0)... ( ( )) g ϕuu ϕuu ϕuu Μ = ˆ ϕ ( ) ˆ ( ) ˆ ( ) ˆ (0) ˆ uy Μ ϕ g( ) uu Μ ϕuu Μ ϕ Μ uu Least squares y=ugˆ y() u() gˆ (0) y() u() u()... 0 gˆ () = y ( ) u ( ) u ( )... u ( M+ ) gˆ ( Μ) g=uu ˆ ( ) Uy u(t) ( ) gˆ = Φuu Φuy g 0 (τ) + υ(t) y(t)

4 Frequency domain Sine wave testing onparametric identification ut () = α cos( ω0t) y () t = α G0 ( ω0 )cos( ω0 t + G0 ( ω0 )) + υ () t + transient ϕ = G ( ω ) 0 0 Frequency response analysis Empirical transfer function estimate Y ( ) ˆ( ) ω G ω = U ( ω) (large variance for Ν > ) u(t) g 0 (τ) + υ(t) y(t) ˆ Φ yu ( ω ) G ( ω ) = Φ ( ω) Smoothing Windowing uu 0 W ( ξ ω ) U ( ξ) Gˆ ( ξ) dξ ω0 Δω γ ˆ( ) G( ω ) = Φ ˆ Φ ˆ uu yu ω +Δω 0 ω0 +Δ ω ω Δω 0 ( ω) = w( τ) ˆ ϕuu ( τ) τ = ( ω) = w( τ) ˆ ϕyu ( τ) τ = 0 W ( ξ ω ) U ( ξ) dξ γ 0 i e τω i e τω

5 Linear regression One of the most common problems in quantitative sciences is predicting the values of a dependent variable y based on the information given by a set of independent variables φ,,φ d Linear regression models assume that the dependence of y on φ,,φ d is linear and have been studied extensively in statisticsandand used in many other scientific fields (econometrics, human sciences, psychology, engineering etc.) hese models can be theoretically analyzed in detail and often yield satisfactory descriptions of reality Additionally, in cases where the amount of experimental data is relatively low and/or we have considerable noise, linear regression models may yield better results than more complex nonlinear models Linear regression Leastsquaresmethod: squares Gauss (809) Generally, linear regression aims in calculating a function of the independent variables g(φ) based on observations of φ and y, such that the difference: y g( φ ) = y yˆ is small. We can treat linear regression both in a deterministic and a stochastic context

6 he general form of a linear regression model is g( φ) = θϕ + θϕ θdϕd = φ θ where φ = ϕ ϕ... ϕ, θ = θ θ... θ [ ] [ ] d Example: Curve fitting (Lecture ) Linear regression M In this case: φ = x... x, θ = w 0 w... w M Example: Impulse response model for finite memory LI system (Lectures 7 8) d [ ] [ ] [ ] In this case: φ = ut ( ) ut ( )... ut ( M+ ), θ = h(0) h()... hm ( ) In general, note that we can have nonlinear transformation terms of some independent variable(s) such as 3 logarithmic, polynomial etc, e.g. ϕ = ϕ, ϕ3 = ϕ (as in curve fitting) or even interaction terms such as ϕ = ϕ ϕ 3 In the case of systems identification (but not in general) the vector φ will be a function of time, i.e. φ(t) Both linear and nonlinear systems identification may be formulated as linear regression problems!

7 Linear regression gx ( ) = θ x + θ gx ( ) = θ ( x = θ + θ x + θ x 0 + θx+ θx g( ) 0 0

8 Linear regression he objective is to find an estimate of the parameter vector θ, i.e. to obtain measurements of φ and y, i.e. the set {φ, y,φ, y,,φ,y } We can write the following set of linear equations: y = φ θ y =... y φθ = φ θ φθ from or, in matrix form: y = Φθ where: y y: x vector, Φ: xd matrix φ y, φ y = Φ = y φ If =d then we can invert Φ to obtain θ, however we typically have data contaminated by noise and in this case we need >>d in order to obtain good results (Lecture ) overdetermined system, in general we don t have an exact solution How do we solve the above matrix equation? Define the model errors/residuals as ε = and their vector ε i yi φθ i ε ε =... ε

9 Linear regression We may then define the least squares estimate t of θ as the vector that t minimizes i i the following cost function V y g y ( θ) = ε k = [ k ( φk) ] = k k = = k= k= φ θ ε ε ε k= i.e. we are looking for the values of θ such that: θˆ = arg min V θ ( θ) Set derivative =0 V ( θ) = y = = ( ) ( ) = φ θ ε ε y Φθ y Φθ k k k = [ = yy+ θφφθy Φθ θ Φ y ] V ( θ ) = Φ y + ΦΦθ= 0 θ ˆ ( ) θν = Φ Φ Φ y ΦΦ *ote from algebra θ Aθ = Aθ + θ a θ θ a = = a θ θ If Φ is full rank, the matrix is nonsingular and positive definite and we have the above unique solution which corresponds to a minimum as: V ( θ) =ΦΦ θθ Equivalently: θ ˆ Ν = k k kyk φ φ k= φ k= If Φ is not full rank, we have infinite solutions. he matrix ( ΦΦ) Φ A θ is termed the pseudoinverse of Φ

10 Linear regression Geometric interpretation {y,φ,φ,,φ }: vectors in R Equivalent problem: Find a linear combination of {φ,φ,,φ d } which approximates the vector y as well as possible. he vectors {φ,,φ define a subspace R d of R,φ,φ d } if d< If y belongs to this subspace, we can express it as a linear combination of {φ,φ,,φ d } If not? he best approximation that belongs to R d is the one with the smallest distance from the vector y, which is the orthogonal projection of y on the subspace R d ( yyˆ ) φi, i=,,..., d herefore ( y yˆ ) φ i = 0 d and since yˆ d yˆ = θˆ φ j= j d y φi jφ φ j j j= j = ˆθ, i =,,..., d which in matrix form becomes: ( ΦΦθ ) ˆ Ν ˆ ( ) Ν = = Φy θ Φ Φ Φ y = 3, d =

11 Weighted least squares Often, our observations may not be equally reliable: we can weigh them differently V( θ) = α k y k k φ θ In matrix form k = V( θ) = ( y Φθ) Q( y Φθ) α 0 0 Q = α and: Ν ( ) θ ˆ Φ Q Φ Φ Q y Ν = ˆ θ Ν = α α φφ k= φ k= k k k k k k he model prediction error/residuals are given by: εˆ = y yˆ = y Φθˆ Ν y he percentage of the observations y that are explained by the linear regression model may be quantified by: yˆ ˆ k ε k Correlation coefficient, normalized mean square error k= k= Ry =, MSE = y y k k= k= k

12 Linear regression in a stochastic context Assume that our observations are a function of time and that they are given by: y() t = φ () t θ + e() t 0 Eet {()} = 0, Eetes {()()} = rls Assume also that: φ(t) is deterministic In the simplest case, e(t) is white noise with covariance matrix E { ee } = λ Ι Properties of the least squares estimate he quantity ˆ ( ) θ LS = Φ Φ Φ y Is an unbiased estimate of θ i.e. ˆ 0 E { θ LS} = θ 0 he covariance matrix of the least squares estimate is: {( ˆ { ˆ })( ˆ { ˆ }) E θ ( ) LS E θls θls E θls } = λ0 Φ Φ herefore, the covariance matrix depends on the noise variance and the input characteristics Specifically, it is desirable that the input is such that the values of the elements of the inverse matrix above are small he value of λ 0 is not known: How can we estimate it? he following noise variance estimate: ˆ λ = ˆ yt () () t d φ θ Ν t= is unbiased

13 Best linear unbiased estimate BLUE We have seen that the weighted least squares estimate is: θ ˆ Φ Q Φ Φ Q y Ν = ( ) Question: For which matrix Q is the variance of the estimates minimized? In the general case where: E { ee } = R Cov{ ˆ θ } = Φ Q Φ Φ Q R Q ΦΦQ Φ ( ) ( ) WLS he matrix Q which minimizes the above is: Q = R he resulting estimate for this choice of Q ˆ θ = Φ Q Φ Φ Q y = θˆ ( ) WLS BLUE is called the best linear unbiased estimate (BLUE). What happens when e is white noise? E { ee } = diag ( λ, λ,..., λ k ) α k = λ k ˆ θ = Φ Φ Φ y = θˆ BLUE ( ) LS he standard least squares estimate is the best linear unbiased estimate If the noise is not white, there may be another estimate with lower variance

14 Distribution of the estimates We have seen that the estimated parameters ˆθ are random What is their distribution? Assume that e(t) is Gaussian White oise, i.e. its distribution is Ν(0,λ( ) he output observations follow a multivariate Gaussian distribution y ( Φθ0, λ I) he coefficient estimates also follow a multivariate Gaussian distribution: ˆ θ LS ( θ 0, λ ( Φ Φ) ) In the general case, where the noise is not white (samples are not independent ), the observations and the estimated coefficients still follow MV normal distributions: y ( Φθ, R ) 0 ( ) ( ) ˆ (, θ ) LS θ0 Φ Φ Φ R ΦΦΦ Even if the observations are not normally distributed, the distribution of ˆθ may approach a normal distribution for a large number of (central limit theorem) he estimate of the noise variance follows a χ distribution with d degrees of freedom ˆ λ = yt () () t ˆ d φ θ Ν t= specifically d λ ˆ ( Ν ) Ν λ χd

15 Statistical testing We have seen (Lectures 5 6) that we can use the sampling distribution of an estimator in order to perform statistical hypothesis testing his procedure can be applied to linear regression, in order to examine whether the value of an estimated regression coefficient is significantly different than zero, in other words if this coefficient corresponds to a regressor that should be included in the model > model order selection! We have seen that for white noise: ˆ θls ( θ, λ Φ Φ ) 0 ( ) In order to examine whether the estimate of a coefficient is significantly different than zero, we θˆj consider the null hypothesis that the real value of θ j is equal to zero, therefore that it follows the distribution: ˆθ (0, ) where r : diagonal elements of (Φ Φ) j λ r j θˆθ j j If we create the random variable z j = λ rj Considering λ known, z j follows a standard normal distribution (Ν(0,)) Lectures 5 6 Considering λ unknown and random (more realistic) z j is a ratio between a normal r.v. over the square root of a χ r.v. herefore zj follows a t distribution with Ν d degrees of freedom (Lectures 5 6) herefore, in order to decide whether the estimated value of z j is significantly different than zero we can compare this value to the tail probability t where α is the level of significance d, α /

16 Matlab: tcdf(x,v) tpdf(x,v) tinv(p,v) Statistical testing For large values of Ν, the t distribution approximates the standard normal distribution (0,) and we can compare the estimates θ ˆ to the tail probabilities of the j Ν(0,) distribution Matlab: P = ORMCDF(X,MU,SIGMA) Y = ORMPDF(X,MU,SIGMA) X = ORMIV(P,MU,SIGMA) (MU=0, SIGMA=)

17 Statistical testing Similarly, we could compare the estimated value of a coefficient ˆθ j is significantly different than a value by creating the r.v. ˆθ j θ 0, j z j = λ rj which follows at or Ν(0,) distribution as before he corresponding confidence interval, which quantifies the uncertainty for each estimate, may be obtained from (θ ˆ t ˆ λ r,θ ˆ + t ˆ λ r ) j d, α/ j j d, α/ j θ0, j We can also examine the significance of a group of coefficients simultaneously (e.g. a group that may correspond to a specific independent variable), by computing the value of the following quantity: F = ( MSEMSE)/( d d) MSE /( d ) Fd d d For Gaussian white noise, this quantity follows the, distribution (Lectures 5 6), where d and d are the number of coefficients in the more complex and simpler regression models respectively For large Ν this distribution approximates the χd d distribution

18 Statistical testing model order selection How can we use this result to select the regression model order? In the ideal case (no noise true system of the same type) when we increase d adequately the error falls to zero Realistically: gradual decrease of V(θ) as we increase d When should we stop increasing d? For two models M and M where M is more complex (i.e. it includes more regressors), we should decide whether the reduction in the cost function ΔV=V V for M and M is significant We can examine the normalized quantity V V V Moreover, when Ν tends to infinity and if the true system can be perfectly described by the model M, then ΔV should tend to zero V ( An appropriate test quantity is V) V

19 herefore if we have y() t = φ () t θ + e() t 0 e t iid { ()}, ε (0, λ ) We have seen that the quantity ( VV)/( dd) F = V /( d ) Statistical testing model order selection Fd d, χ d follows an distribution which approximates the d d distribution for large > herefore, in order to compare the performance between two models Μ and Μ We compute the mean square errors and the corresponding F quantity We determine the level of significance α We compare this quantity to the tail probabilities F α d or d, d χ d d, α If F < χ : accept model Μ d d, α If F > χd d, α : accept model Μ Matlab: Y = CHIPDF(X,V),P = CHICDF(X,V), X = CHIIV(P,V)

20 Computational issues Eigenvalue decomposition of the matrix Φ Φ (Hessian) ΦΦ= UΛU We can have an idea regarding possible arithmetic errors by calculating the ratio of the (absolutely) largest over the smallest eigenvalue, termed the condition number of the matrix (Matlab: cond, rcond) he largest this number is, the closest the determinant of the matrix Φ Φ is closer to zero : greater sensitivity to small changes in the data, which may produce large values in the estimated coefficients his in turn depends on the input characteristics (next lecture) We may do the same by computing the singular value decomposition of the non square matrix Φ and examining its singular values

21 Estimation of the coefficients θ ˆ Φ Φ Φ y Ν = ( ) Computational issues requires inverting the matrix Φ Φ. As discussed in the previous slide, this inversion may lead to problems particularly for large matrices that are close to being singular (determinant close to zero) or sparse matrices. Small changes in observations > large changes in estimated coefficients! Possible solutions: QR decomposition: here exists an orthogonal matrix Q (Q Q=I) such as for any non square matrix Φ (Ν>d) we can write: Φ = QR where R is an upper triangular matrix Multiplying the relation Φθ=y with Q we get: Qy= QΦ = Rθ We can therefore solve the equivalent problem: Rθ = Qy which is easier to solve (R is triangular) Less sensitivity to errors (the condition number of R is equal to the square root of the initial matrix condition number) Matlab: qr

22 Computational issues Singular value decomposition Φ = UΣV Φ: Νxd, U,V orthogonal (U: x, V: dxd), Σ diagonal We can keep only the (absolutely) largest singular values of Φ i.e. Σ 0 V Φ = UΣV = U U 0 Σ V and calculate the reduced rank pseudoinverse, i.e.: Φ + + = V Σ U in other words, we reject the coefficients corresponding to small singular values and we solve a problem of reduced order Matlab: [U,S,V] = SVD(X)

23 Regularization Regularization: Similarly, when our problem is ill conditioned we can modify the cost function (see also Lecture ) as follows: W ( θ) = V ( θ) +λv ( θ) R For example we can use a quadratic regularization term i.e.: ( ) [ ( )] λ W θ = yk g φk + θ θ k = In this case we can obtain an analytic solution as: ( ) ˆ λ θ = Φ Φ+ Ι Φ y reg Addition of a term λι to the matrix Φ Φ: improvement of condition number Increase of λ: We bring the estimates closer to zero, i.e. we induce bias in the estimates but we improve the variance, computational problems and avoid overfitting (bias/variance tradeoff) In general we can use cost functions of the form: d [ ] λ q ( θ) = k ( φk) + θ j k= j= W y g

24 q=: Lasso regularization Regularization

Regularization Minimization of [ ] q W ( θ) = yk g(

(eg Lasso) )leads to sparser solutions some weights

25 Regularization Minimization of [ ] q W ( θ) = yk g( φk) + θ j is equivalent to least squares k= j= minimization under constraints for the coefficients, i.e.: λ d d j= θ j q η Regularization with small λ (eg Lasso) )leads to sparser solutions some weights iht are driven to zero, therefore we select only significant terms

ECE 636: Systems identification

ECE 636: Systems identification Lectures 7 8 onparametric identification (continued) Important distributions: chi square, t distribution, F distribution Sampling distributions ib i Sample mean If the variance