EL1820 Modeling of Dynamical Systems Lecture 9 - Parameter estimation in linear models Model structures Parameter estimation via prediction error minimization Properties of the estimate: bias and variance Lecture 9 1
You should be able to Today s goal distinguish between common model structures used in identification estimate model parameters using the prediction-error method calculate the optimal parameters for ARX models using least-squares estimate bias and variance of estimates from model and input signal properties Lecture 9 2
System identification Basic idea: estimate system from measurements of u(t) and y(t) w(t) - disturbance u(t) System y(t) e(t) - measurement noise u(kh) y(kh) Many issues Many issues choice of sampling frequency, input signal (experiment conditions) choice of sampling freq., input signal (experimental conditions) what class of models how to model disturbances? what class of models how to model disturbances? estimating model parameters from sampled, finite and noisy data estimating model parameters from sampled, finite and noisy data Lecture 9 3
System identification via parameter estimation w[k] (disturbance) u[k] Linear system y[k] Need to fix model structure before trying to estimate parameters Need to fix model system structure model, disturbance before trying model to estimate parameters system model, order disturbance (degrees model of transfer function polynomials) model order (degrees of transfer function polynomials) Lecture 9 4
Model structures Model structures commonly used (BJ includes all as special cases) ARMAX (autoregressive moving average exogeneous input) e[k] BJ (Box Jenkins) e[k] C(q) C(q) D(q) u[k] B(q) 1 A(q) y[k] u[k] B(q) F(q) y[k] ARX (autoregressive with exogeneous input) e[k] OE (output error) e[k] u[k] B(q) 1 A(q) y[k] u[k] B(q) A(q) y[k] Lecture 9 5
Transfer function parameterizations The transfer functions G(q) and H(q) in the linear model will be parameterized as y[k] = G(q; θ)u[k] + H(q; θ)e[k] G(q; θ) = q n k b 0 + b 1 q 1 + + b nb q n b 1 + f 1 q 1 + + f nf q n f H(q; θ) = 1 + c 1q 1 + + c nc q n c 1 + d 1 q 1 + + d nd q n d where the parameter vector θ contains {b k }, {f k }, {c k }, {d k } Note n k determines dead-time; n b, n f, n c, n d : order of polynomials Lecture 9 6
Model order selection from physical insight Physical insight can often help us to determine the right model order If system is sampled using zero-order hold (input piecewise constant), n f equals the number of poles of continuous-time system if system has no delay and no direct term, then n b = n f 1, n k = 1 if system has no delay but direct term, then n b = n f, n k = 0 if continuous system has time delay, then n k = τ/h + 1 Note n b does not depend on number of continuous-time zeros! Lecture 9 7
EL1820 Modeling of Dynamical Systems Lecture 9 - Parameter estimation in linear models Model structures Parameter estimation via prediction error minimization Properties of the estimate: bias and variance Lecture 9 8
Basic principle of parameter estimation w[k] System y[k] u[k] Model ^ y[k] For given parameters θ, the model predicts that the system output should be ŷ[t; θ] For given θ, the model predicts that the system output will be ŷ[k; θ] Determine θ so that ŷ[t; θ] matches observed output y[t] as closely as possible Determine θ so ŷ[k; θ] matches observed y[k] as closely as possible To solve the parameter estimation problem, we note that To solve the parameter estimation problem, we note that 1. The value of ŷ[t; θ] depends on the disturbance model 1. The value of ŷ[k; θ] depends on the disturbance model 2. The concept as closely as possible must be given a mathematical formulation 2. The concept as closely as possible must be mathematically formalized Lecture 9 8 April 29, 2004 Lecture 9 9
1. Compute Prediction error minimization (PEM) ŷ[k; θ k 1] the model s prediction of the system output, given information at time k 1 2. Form the prediction error ε[k] = y[k] ŷ[k; θ k 1] 3. Construct the loss function V N (θ) = 1 N N ε 2 [k] k=1 4. The optimal θ is the one minimizing the loss function θ = arg min θ V N (θ) Lecture 9 10
Prediction using linear models Consider the linear model: y[k] = G(q)u[k] + H(q)e[k] Multiply by H 1 (q) (to make noise term white) and re-write as y[k] = (1 H 1 (q))y[k] + H 1 (q)g(q)u[k] + e[k] Since {e[k]} is a white noise sequence, our best prediction is ŷ[k] = (1 H 1 (q))y[k] + H 1 (q)g(q)u[k] If n c n d, prediction uses only old outputs (measured up to k 1) Lecture 9 11
Prediction using ARX models For ARX models, H = 1/A and G = q n k B/A, so (1 H 1 (q))y[k] = (1 A(q))y[k] = (a 1 q 1 + + a na q n a )y[k] H 1 (q)g(q)u[k] = q n k B(q)u[k] = (b 0 + b 1 q 1 + + b nb q n b )q n k u[k] Thus, the predictor is linear in the parameters ŷ[k; θ k 1] = ϕ T [k]θ where θ = a 1. a na. b nb ϕ[k] = y[k 1]. y[k n a ] u[k n k ]. u[k n k n b ] Lecture 9 12
Linear regression Linear model, linear predictor ({e[k]}: white noise) y[k] = ϕ T [k]θ 0 + e[k] ŷ[k] = ϕ T [k]θ Convenient to express the residuals ε[k] = y[k] ŷ[k] in vector form, ε[1] y[1] ϕ T [1] ε N = = θ = y N ϕ N θ. ε[n]. y[n]. ϕ T [N] Then, the loss function can be written as V (θ) = 1 N N n=1 ε2 [k] = 1 N εt Nε N = 1 N (y N ϕ N θ) T (y N ϕ N θ) and the optimal estimate is found by solving V/ θ = 0: ˆθ = (ϕ T Nϕ N ) 1 ϕ T N y N (provided ( ) 1 exists; see end of slides for proof) Lecture 9 13
Example: Estimation in ARX models Example Estimate the model parameters a and b in the ARX model y[k] = ay[k 1] + bu[k 1] + e[k] from input and output sequences {y[k]}, {u[k]} for k = 0,..., N Using θ = (a b) T and ψ[k] = (y[k 1] u[k 1]) T, we find ϕ T Nϕ N = [ y[0] y[n 1] u[0] u[n 1] ] y[0] u[0]. y[n 1] u[n 1] so the optimal estimate is given by [ N ] 1 [ N N ] ˆθ = k=1 y2 [k 1] k=1 y[k 1]u[k 1] N k=1 u[k 1]y[k 1] k=1 y[k 1]y[k] N N k=1 u2 [k 1] k=1 u[k 1]y[k]. Note Estimate computed using covariances of u[k], y[k] Lecture 9 14
Estimation in general model structures Estimation more difficult when predictor is not linear in parameters In general, we need to minimize V N (θ) using iterative numerical methods, e.g., θ (i+1) = θ (i) µ (i) M (i) V N(θ (i) ) Example Newton s method uses M (i) = (V N (θ (i) )) 1 while Gauss-Newton approximates M (i) using first-order derivatives Problem Result is locally optimal, but not necessarily globally optimal Lecture 9 15
Example Example G(s) = 10/(s 2 + 2s + 10) sampled w/ h = 0.05, var{v} = 0.1 2 10 1 10 0 Magnitude 10 1 10 2 10 3 True system ARX OE 10 0 10 1 Frequency (rad/s) 0 Phase (deg) 50 100 150 200 10 0 10 1 Frequenct (rad/s) Lecture 9 16 Model structure matters!
EL1820 Modeling of Dynamical Systems Lecture 9 - Parameter estimation in linear models Model structures Parameter estimation via prediction error minimization Properties of the estimate: bias and variance Lecture 9 17
Properties of PEM estimates What can we say about models estimated using prediction-error minimization? Model errors have two components: 1. Bias errors: arise if model is unable to capture true system 2. Variance errors: due to influence of stochastic disturbances We will study two properties of general prediction error methods: 1. Convergence: what happens with ˆθ N as N grows? 2. Accuracy: what can we say about size of ˆθ N θ 0 as N increases? Lecture 9 18
Convergence If disturbances acting on system are stochastic, then so is prediction error ε[k] Under quite general conditions (even if ε[k] are not independent) and lim N 1 N N ε 2 [k; θ] = E{ε 2 [k; θ]} k=1 ˆθ N θ = arg min θ E{ε 2 [k; θ]} as N Even if model cannot reflect reality, estimate will minimize prediction mean squared error! Lecture 9 19
Example Example Assume you try to estimate the parameter b in the model ŷ[k] = bu[k 1] while the true system is y[k] = u[k 1] + u[k 2] + e[k] where {u[k]}, {e[k]} are white noise signals, indep. of each other What will the PEM estimate converge to? PEM will find the parameters that minimize the mean squared error E{ε 2 [k]} = E{(y[k] ŷ[k]) 2 } = E{(u[k 1] + u[k 2] + e[k] bu[k 1]) 2 } = E{((1 b)u[k 1] + u[k 2]) 2 } + σe 2 = (1 b) 2 σu 2 + σu 2 + σe 2 This expression is minimized by b = 1 (the asymptotic estimate) Lecture 9 20
Consistency Assume that there is some θ 0 such that {ε[k; θ 0 ]} is white noise, then E{ε 2 [k; θ]} is minimized by this value (see end of slides for proof) If, moreover, then one can conclude that ŷ[k; θ 0 ] = ŷ[k; θ] = θ = θ 0 ˆθ N θ 0 as N Lecture 9 21
θ : frequency domain characterization Assume that the true system is described by y[k] = G 0 (q)u[k] + w[k] and that we try to estimate a model of the form (H (q) indep. of θ) y[k] = G(q; θ)u[k] + H (q)e[k] If {u[k]} and {w[k]} are independent, θ = lim N ˆθ N = arg min θ π π G 0 (e iω ) G(e iω ; θ) 2 Φ u (ω) H (e iω ) 2 dω θ minimizes least-squares criterion, weighted by Φ u (ω)/ H (e iω ) 2 good fit where Φ u (ω) has much energy, or H(e iω ) has little energy Can focus model accuracy on important frequency range by choosing {u[k]} Lecture 9 22
Example Output error method using low- and high-frequency input signal 10 2 Magnitude 10 0 10 2 True system OE 10 0 10 1 Frequency (rad/s) 10 2 Magnitude 10 0 10 2 True system OE 10 0 10 1 Frequency (rad/s) Lecture 9 23
Estimation error variance If {e[k]} is white noise with variance λ, then E{(θ θ 0 )(θ θ 0 ) T } 1 N λr 1 where R = E{ψ[k; θ 0 ]ψ T [k; θ 0 ]} ψ[k; θ 0 ] = d dθ ŷ[k; θ] θ=θ0 Error variance decreases with sensitivity of prediction error (w.r.t. parameters) number of measurements Lecture 9 24
Estimation error variance cont d We can estimate the estimation error variance via ˆP N = 1 N ˆλ ˆR 1 N where ˆλ = 1 N N ε 2 [k; ˆθ N ], ˆRN = 1 N N ψ[k; ˆθ N ]ψ T [k; ˆθ N ] k=1 k=1 Moreover, one can show that N(ˆθN θ 0 ) d N (0, λr 1 ) This can be used to compute confidence regions for parameter estimates Lecture 9 25
Error variance in the frequency domain For the variance of the frequency response of the estimate, we have var{g(e iω ; θ)} n N Φ w (ω) Φ u (ω) n, N 1 Variance increases with number of model parameters n decreases with number of observations, and signal-to-noise ratio again, the frequency content of the input influences accuracy of the model Similar to spectral analysis error bounds G(e iω ; θ) typically decreases at ω π/h, while variance is constant (or increases!) = high relative error at high freq. Lecture 9 26
Example Confidence intervals for freq. responses for two different input spectra 10 1 Input spectrum 1 Input spectrum 2 10 1 10 0 10 0 10 1 10 1 10 2 10 2 10 3 10 2 10 1 10 0 10 1 10 3 10 2 10 1 10 0 10 1 10 1 Estimate 1 Estimate 2 10 0 10 0 10 1 10 1 10 2 10 2 10 1 10 0 10 1 frequency (rad/sec) 10 2 10 2 10 1 10 0 10 1 frequency (rad/sec) Lecture 9 27
Next lecture Experimental condition and model validation Lecture 9 28
Bonus: calculation of V θ V (θ) = 1 N (y N ϕ N θ) T (y N ϕ N θ) = 1 N (yt Ny N 2θ T ϕ T Ny N + θ T ϕ T N ϕ N θ }{{}}{{} 2 i θ i(ϕ T N y N ) i i,j θ iθ j (ϕ T N ϕ N ) ij ) Therefore, or V θ k = 2 N (ϕt Ny N ) k + 2 N (ϕ T N ϕ N ) ik θ k V θ = 2 N ϕt Ny N + 2 N (ϕt Nϕ N )θ! = 0 k for θ = ˆθ Hence: (ϕ T Nϕ N )ˆθ = ϕ T N y N ˆθ = (ϕ T N ϕ N ) 1 ϕ T Ny N Lecture 9 29
Bonus: Proof that ε[k; ˆθ] is white noise E{ε 2 [k; θ]} = E{(y[k] ŷ[k; θ 0 ] +ŷ[k; θ 0 ] ŷ[k; θ]) 2 } }{{} =ε[k;θ 0 ] = E{ε 2 [k; θ 0 ]} + E{(ŷ[k; θ 0 ] ŷ[k; θ]) 2 } + 2E{ε[k; θ 0 ](ŷ[k; θ 0 ] ŷ[k; θ])} E{ε 2 [k; θ 0 ]} if E{ε[k; θ 0 ](ŷ[k; θ 0 ] ŷ[k; θ])} = 0 Now, y[k] = ε[k; θ 0 ] + ŷ[k; θ 0 ] is a function of ε[k; θ 0 ], ε[k 1; θ 0 ],..., u[k], u[k 1],..., because ŷ[k; θ 0 ] is a function of y[k 1], y[k 2],... and u[k], u[k 1],..., where y[k 1],... depend on previous values of ε[k 1; θ 0 ], and so on Then, since {ε[k; θ 0 ]} is white noise, ε[k; θ 0 ] is uncorrelated to y[k 1],... and u[k],..., hence it is uncorrelated to both ŷ[k; θ 0 ] and ŷ[k; θ], i.e., E{ε[k; θ 0 ](ŷ[k; θ 0 ] ŷ[k; θ])} = 0 This shows that E{ε 2 [k; θ]} E{ε 2 [k; θ 0 ]} for all θ Lecture 9 30