SGN Advanced Signal Processing: Lecture 8 Parameter estimation for AR and MA models. Model order selection

Size: px

Start display at page:

Download "SGN Advanced Signal Processing: Lecture 8 Parameter estimation for AR and MA models. Model order selection"

Melvin Randall
5 years ago
Views:

1 SG Advanced Signal Processing: Lecture 8 Parameter estimation for AR and MA models. Model order selection Ioan Tabus Department of Signal Processing Tampere University of Technology Finland 1 / 28

2 Overview Basic notions of estimation theory Statistically efficient estimates ARMA, AR and MA models Estimation of AR and ARMA parameters Selection of the model order for AR models 2 / 28

3 Basic notions of estimation theory Assume that a set of data x 1, x 2,..., x is given, and we postulate a model in the form of the pdf p(x a) as a function of a parameter a. An estimate of a is denoted â and it is a function of the data x 1, x 2,..., x. Example 1: Consider that the data x 1, x 2,..., x are independent and identically distributed, i.e., were generated as realizations of the random variable X = a + ε. We assume that ε i and ε j are independent for all pairs of i, j, with i j. Additionally we assume that ε has a Gaussian distribution (0, σ). The intuitive significance of a is that it is the mean of the variables Xi. Having available a realization, x1, x 2,..., x, an estimate of the mean is â = 1 x i The estimate is a function of the data. The particular function used to compute the estimate is called estimator. Many estimators can be proposed. If the realizations are changing, the value of â also changes. Hence â changes with the realizations of Xi and it is itself a random variable. (x i a) 2 By independence assumption the joint pdf is p(x1, x 2,..., x a) = 1 (2π) /2 σ e 2σ 2. 3 / 28

4 The likelihood function The likelihood function is denoted L(a x 1, x 2,..., x ) and is seen as a function in the variable a, given the data x 1, x 2,..., x. It is defined formally as L(a x 1, x 2,..., x ) = p(x 1, x 2,..., x a), although the pdf is defined for a fixed a and has the variables x 1, x 2,..., x. In many cases, when the pdfs contain an exponential function, it is more convenient to operate with the logarithm of the likelihood, named log-likelihood log L(a x 1, x 2,..., x ). For Example 1, the likelihood is L(a x 1, x 2,..., x ) = and the log-likelihood is log L(a x 1, x 2,..., x ) = 1 2σ 2 (x i a) 2 1 e (2π) /2 σ 2σ 2 (x i a) 2 log(2π) log σ 2 4 / 28

5 The maximum likelihood (ML) estimate The maximum likelihood (ML) estimate is the estimate that maximizes the likelihood function. One may want to estimate the unknown parameter by the value having the highest likelihood. This is just one possible estimate, of the many existing ones (moment fitting estimators are quite popular also). However, in general, when ML is easy to compute, it is likely to have very useful properties. The formal definition is â ML = arg max L(a x a 1, x 2,..., x ) â ML = arg max log L(a x a 1, x 2,..., x ) The two definitions are equivalent, since log function is monotone increasing, and maximizing the function, or its logarithm, are achieved at the same value of a. For example 1, to find the maximizing a take the derivative of the log-likelihood with respect to a a log L(a x 1, x 2,..., x ) = a 1 2σ 2 (x i a) 2 log(2π) log σ 2 = 1 σ 2 (x i a) In order to find the extremum point equate now the derivative to 0, to get 1 σ 2 (x i â ML ) = 0 â ML = x i 1 â ML = x i 5 / 28

6 Properties of a good estimator The most interesting properties of an estimate are the following Small bias: Bias(â) = E[â] a When Bias(â) = 0 the estimate is said unbiased. Small variance: Variance = E[(â a) 2 ] To quantify how small a variance can be, several bounds are available, the most important being Cramer-Rao bound. Consistency: lim â = a When the number of available data points is very large one wishes to have a very precise estimation, very close to the true a, and the more data is provided, the closer the estimate becomes to the true value. 6 / 28

7 ML estimate of a in Example 1 In the case of Example 1, the bias in the ML estimate of a can be evaluated easily: E[â ML ] = E[ 1 X i ] = 1 E[X i ] = 1 E[a + ε i ] = 1 a + 1 E[ε i ] = a so that the bias B(â ML ) = E[â ML ] a = 0 and the estimator is unbiased. The variance of the estimate âml results from Var[â ML ] = E[(â ML a) 2 ] = E[ 1 2 X i a ] = 1 2 Var(ε i ) = 1 2 σ 2 = σ2 where we have used the property that the variance of the sum of independent random variables ε 1,..., ε is the sum of variances of the variables, each variance being σ 2. When the number of measurements grows to infinite, the variance of the ML estimate tends to 0, lim Var[â ML ] = lim σ 2 = 0. This ensures that the fluctuations of â ML about mean are becoming extremely small, and is one form of convergence for random variables (â ML is said to converge in mean square to a). 7 / 28

8 ML estimate of σ in Example 1 When computing the ML estimate of a we assumed σ is known. In fact the estimate of a does not depend on σ. If we want to find an estimation of σ in addition to the estimate of a, we have to solve the system: a log L(a, σ x 1, x 2,..., x ) = 0 σ log L(a, σ x 1, x 2,..., x ) = 0 and denote the solution (â ML, ˆσ ML ). The maximizing âml will not depend on σ and is as before â ML = 1 x i. To find the maximizing σ take the derivative of the log-likelihood with respect to σ σ log L(a, σ x 1, x 2,..., x ) = σ 1 2σ 2 (x i a) 2 log(2π) log σ 2 = 2 2σ 3 (x i a) 2 σ Equating this to 0, and taking now also a to be the ML estimate, 2 2ˆσ ML 3 (x i â ML ) 2 ˆσ ML = 0 ˆσ 2 ML = 1 (x i â ML ) 2 8 / 28

9 ML estimate of σ in Example 1 The bias in the ML estimate of σ 2 can be evaluated as: E[ˆσ 2 ML ] = 1 E[(x i â ML ) 2 ] = 1 E[(a + ε i â ML ) 2 ] and using that â ML = a + 1 ε i : E[ˆσ 2 ML ] = 1 E[(ε i 1 ε i ) 2 ] = 1 [( ( 1)εi j i E ε ) 2 ] i ( ( 1) 2 Eε 2 i + ) j i = Eε2 i ( 1) 2 = 2 σ 2 = 1 σ2 so that the bias B(ˆσ 2 ML ) = E[ˆσ2 ML ] σ2 = 1 σ2 and the estimator is biased. An unbiased estimator of σ 2 can be found simply by scaling the ML estimator: ˆσ 2 unbiased = 1 ˆσ2 ML = 1 1 (x i â ML ) 2 The ML estimate, although is biased, is preferred to be use in some applications (in spectrum estimation) 9 / 28

10 Cramér-Rao bound Consider a set of data x = [x1... x ] generated by a distribution p(x a) and take any estimate â(x). Denote the bias Bias = B(a) = E[â(x) a a]. The MSE can be defined as MSE = E[(â(x) a) 2 a] = E[â(x) 2 a] 2aE[â(x) a] + a 2 Var = E[(â(x) E[â(x) a]) 2 a] = E[â(x) 2 a] (E[â(x) a]) 2 E[â(x) 2 a] = Var + (E[â(x) a]) 2 MSE = Var + (E[â(x) a]) 2 2aE[â(x) a] + a 2 = Var + (Bias) 2 Then the MSE of the estimate can be bounded by using the mean square error (MSE) inequality: MSE = E[(â(x) a) 2 [1 + a a] { B(a)]2 [ ] } = CRB E 2 a log p(x a) a { [ ] } The denominator in CRB is known as the Fisher information I (a) = E a 2 log p(x a) a If the estimator is unbiased, Bias = B(a) = 0, then Var(â(x)) = E[(â(x) a) 2 a] 1 1 { [ ] } = E 2 a log p(x a) I (a) = CRB a 10 / 28

11 Efficient estimate An estimator is efficient if it is unbiased Bias = B(a) = E[â(x) a a] = 0. Its variance achieves the CRB Var(â(x)) = E[(â(x) a) 2 a] 1 1 { [ ] } = E 2 a log p(x a) I (a) = CRB a When an estimator is efficient, there is no other unbiased estimator that can have smaller variance. 11 / 28

12 Asymptotic properties of the ML estimator the estimator is consistent: â(x) a when the estimator is asymptotically Gaussian, with mean a and variance inverse of the Fisher information I (a). the estimator is asymptotically efficient When an estimator is efficient, there is no other unbiased estimator that can have smaller variance. 12 / 28

13 ML estimates of uniform distribution parameters Consider a uniform distribution U(0, a). Here the parameter is a, which is the upper bound of the interval for which the distribution is defined. The pdf is p(x a) = 1/a if x a and p(x a) = 0 if x > a. Assume that a single realization x1 is available, so = 1. What is the ML estimate of a? We need to maximize with respect to a the likelihood function L(a x1 ). If a < x 1, L(a x 1 ) = 0. If a x 1, then L(a x 1 ) = 1/a. The maximum L(a x 1 ) = 1/a is obtained when â ML = x 1. When = 1 the ML is very biased. We have E[x1 ] = a/2 and since â ML = x 1, we have E[â ML ] = a/2, and the bias is Bias(â ML ) = a/2. Assume two realizations, x1, x 2 are available. Take x 1 to be the largest of them. Repeating the reasoning above, it results that â ML = max{x 1, x 2 }. Assume that the number of realizations goes to infinity. Then the value âml = max{x 1,..., x } tends to a and the ML estimate is consistent. ML for uniform distribution shows that not always partial derivatives are needed to find the ML estimates. 13 / 28

14 ML estimates for Gaussian distributed model errors Most often we assume Gaussian identically and independent distribution for the errors et in some models involving a vector of parameters θ (see the AR models later in the lecture) y t = u T t θ + et Then maximizing the likelihood max θ log L(θ x 1, x 2,..., x ) = max θ 1 2σ 2 (y i u T i θ) 2 log(2π) log σ 2 becomes equivalent to minimizing the sum of squares min θ (y i u T i θ) 2 which explains why the LS estimator has very good statistical properties. 14 / 28

15 : Simplest example AR(1) model Consider a first order autoregressive model y n = ay n 1 + e n where e n is a white noise sequence of zero mean and variance σ 2 e, initialized at n = 0 with y 0 = 0. Using the delay in time operator we can obtain the dependence of yn on the inputs e 1,..., e n y n = aq 1 y n + e n 1 y n = en 1 aq 1 y n = (1 + aq 1 + a 2 q 2 + a 3 q )e n y n = e n + ae n 1 + a 2 e n 2 + a 3 e n a n 1 e 1 The cross-correlation Eyne n+k for all k 0 is zero: y n = e n + ae n 1 + a 2 e n 2 + a 3 e n a n 1 e 1 Ey ne n+k = Ee n+k e n + aee n+k e n 1 + a 2 Ee n+k e n 2 + a 3 Ee n+k e n a n 1 Ee n+k e 1 = 0 since e n is not correlated with any e n+k with k 0. Given y(0) = 0 and e1,..., e the values of y 1,..., y can be computed Given same y(0) = 0 and another sequence e1,..., e results in other sequence y 1,..., y. Hence the sequence y 1,..., y is random and depends on the realization of e 1,..., e. 15 / 28

16 AR(1) model Input and output signals of the model and the autocorrelation function, for a = Input: white noise 2 e t time t 60 Output: AR(1), a = y t time t Autocorrelation function, a = estimated r true r r(τ) lag τ 16 / 28

17 AR(1) model Input and output signals of the model and the autocorrelation function, for a = Input: white noise 2 e t time t Output: AR(1), a = 0.98 y t time t Autocorrelation function, a = 0.98 estimated r true r r(τ) lag τ 17 / 28

18 Stability and stationarity of AR(1) model Property: The mean of yn is zero, for all n. We have y 1 = e 1 and the mean Ey 1 = 0. Then Ey 2 = aey 1 + Ee 2 = 0 and by induction we see that Ey n = 0. We want to evaluate the variance of yi, i.e., the autocorrellation function r yy (i, i) = Ey 2 i. We have r yy (n, n) = Ey 2 n = E(ay n 1 + e n) 2 = a 2 Ey 2 n 1 + 2aEy n 1e n + Ee 2 n = a2 r yy (n 1, n 1) + σ 2 e Iterating the equation ryy (n, n) = a 2 r yy (n 1, n 1) + σ 2 e r yy (n, n) = a 2 r yy (n 1, n 1) + σ 2 e = a4 r yy (n 2, n 2) + (1 + a 2 )σ 2 e = a 2n r yy (0, 0) + (1 + a 2 + a a 2n 2 )σ 2 e = a2n r yy (0, 0) + 1 a2n 1 a 2 σ2 e = 1 a 2n 1 a 2 σ2 e 18 / 28

19 Stability and stationarity of AR(1) model In general the autocorrelations depend on n, r yy (n, n) = 1 a 2n 1 a 2 σ2 e and hence the process AR(1) is not stationary. If the parameter a is smaller than 1 in modulus, after an initial transient regime the autocorrelation values are converging to the value r yy (n, n) = 1 1 a 2 σ2 e and the process is asymptotically stationary (but in practice the transient regime can be short). If the parameter a is larger than 1 in modulus, the autocorrelations are diverging, and the AR(1) process is non-stationary. In general, for a process AR(k) yn = a 1 y n a k y n k + e n, the condition of stability is that all the roots of the polynomial A(z) = 1 a 1 z a 2 z 2... a k z k are inside the unit circle. For the AR(1) this requires that the root of A(z) = 1 a 1 z, which is a, be smaller than 1 in modulus. Exercise If one initializes the AR(1) randomly, with y0 which has zero mean and variance 1 1 a 2 σ2 e, then the AR(1) model is wide sense stationary, i.e all the autocorrelation functions r yy (i, k) = Ey(i)y(k) depend only on the difference between i and j, for i, j > / 28

20 AR, MA and The general ARMA(n,m) model is described by the equation with differences y t = a 1 y t 1 + a 2 y t a ny t n + e t + b 1 e t b me t m Applying the Z transform to both members and assuming zero initial conditions Zy t = a 1 Zy t 1 + a 2 Zy t a nzy t n + Ze t + b 1 Ze t b mze t m Y (z) = a 1 z 1 Y (z) + a 2 z 2 Y (z) a nz n Y (z) + E(z) + b 1 z 2 E(z) b mz m E(z) (1 a 1 z 1 a 2 z 2... a nz n )Y (z) = (1 + b 1 z b mz m )E(z) 1 + b 1 z b mz m B(z) Y (z) = 1 a 1 z 1 a 2 z 2 E(z) =... a nz n A(z) E(z) The model written in the time domain using time shift operator is (1 a 1 q 1 a 2 q 2... a nq n )y t = (1 + b 1 q b mq m )e t y t = 1 + b 1 q b mq m 1 a 1 q 1 a 2 q 2 et... a nq n e t E(z) H(z) = B(z) A(z) y t Y (z) 20 / 28

21 AR, MA models The autoregressive AR(n) model is described by the equation with differences or the transfer function H(z) = 1 A(z) y t = a 1 y t 1 + a 2 y t a ny t n + e t 1 H(z) = A(z) The moving average MA(m) model is described by the equation with differences y t = e t + b 1 e t b me t m H(z) = B(z) Equivalence between models: An MA(m) model is equivalent to an AR( ) model An AR(n) model is equivalent to an MA( ) model An ARMA(n,m) model has an equivalent AR( ) model and an equivalent MA( ) model 21 / 28

22 Equivalence of an ARMA model to AR( ) model Example: consider an ARMA(1,1) model H(z) = H(z) = H(z) = 1 0.8z 1 = ARMA(1, 1) z 1 1 ( z 1 ) 1 (1 0.8z 1 ) 1 ( z 1 )( z z z ) = AR( ) The infinitely long AR( ) can be truncated to a finite number of coefficients and then we have an approximate equivalence, between an ARMA(n,m) model and a very long AR(L) model. The long model has L coefficients, much more than the n + m coefficients of the ARMA model. If two models are equivalent, in principle we prefer models having less parameters. However, the parameters in the AR model are easier to estimate (we already discussed the Levinson-Durbin and LS methods) while for the parameter estimation is a nonlinear optimization problem, difficult to solve. 22 / 28

23 Estimation of AR models The autoregressive AR(n) model is described by the equation with differences y t = a 1 y t 1 + a 2 y t a ny t n + e t which is a prediction model of order n. Levinson-Durbin algorithm can be used to estimate its parameters, for all possible orders n = 1,..., n max. The LS method can also be applied to estimate the AR parameters. Denote the unknowns θ = [a 1 a 2... a n] T. Denoting the data vector u t = [y t 1 y t 2... y t n ] T and since the desired values are d t = y t the following system of equations results y 1 = u T 1 θ + e 1... y = u T θ + e The system can be solved in LS sense, and has the solution ˆθ = 1 u i u T i u i d i (1) 23 / 28

24 Estimation of MA models The moving average MA(m) model is described by the equation with differences y t = e t + b 1 e t b me t m H(z) = B(z) r y (k) = E[y t y t k ] = E(e t + b 1 e t b me t m )(e t k + b 1 e t k b me t k m ) If k > m (or k < m) then ry (k) = 0. If 0 < k m (or 0 k m) then collecting all equations together r y (k) = σ 2 m k e b i b i+k i=0 r y (0) = σ 2 e (1 + b2 1 + b b2 m ) r y (1) = σ 2 e (b 1 + b 1 b 2 + b 2 b b m 1 b m)... r y (m 1) = σ 2 e (b m 1 + b 1 b m) r y (m) = σ 2 e bm which is a nonlinear system of m + 1 equations and m + 1 unknowns σ 2 e, b 1,..., b m. onlinear solvers may have troubles due to local minima in the criterion. More often used solver is the two stage least squares method, used also for the more general, described next. 24 / 28

25 Estimation of : Two-stage LS method The general ARMA(n,m) model is described by the equation with differences y t = a 1 y t 1 + a 2 y t a ny t n + e t + b 1 e t b me t m An ARMA(n,m) model has an equivalent AR( ) model, B(z) A(z) = 1 C(z) where C(z) = 1 + c 1 z 1 + c 2 z has an infinite number of coefficients. We estimate first a truncated form of C(z), where only the coefficients c1, c 2, c 3,..., c L are retained, with a very large L, but L <, e.g. L = /4. We use for estimation the LS method (this is the first LS stage) in the model y t = c 1 y t 1 + c 2 y t c L y t L + ε t using the values y 1,..., y and we obtain the estimated coefficients ĉ 1, ĉ 2, ĉ 3,..., ĉ L. from the estimated coefficients ĉ1, ĉ 2, ĉ 3,..., ĉ L compute the estimated values of the residuals of the model, ˆε 1, ˆε 2,..., ˆε, ˆε t = y t ĉ 1 y t 1 ĉ 2 y t 2... ĉ L y t L 25 / 28

26 Estimation of : Two-stage LS method We return to estimating the unknown coefficients in the ARMA(n,m), using now the equations for all t = 1,..., y t = a 1 y t 1 + a 2 y t a ny t n + ˆε t + b 1 ˆε t b m ˆε t m from which we obtain a system of equations, for all t = 1,..., y t ˆε t = a 1 y t 1 + a 2 y t a ny t n + +b 1 ˆε t b m ˆε t m with the unknowns θ = [a 1 a 2... a n b 1 b 2... b m] T. Denoting the data vector u t = [y t 1 y t 2... y t n ˆε t 1 ˆε t 2... ˆε t m ] T and the desired values d t = y t ˆε t the overdetermined system of equations becomes d 1 = u T 1 θ... d = u T θ having the LS solution ˆθ = 1 u i u T i u i d i (2) 26 / 28

27 Selecting the order of A set of candidate orders is given, most usually k {1, 2,..., K}. An estimated method is used for obtaining for each order k the estimates of the parameters ˆθ k and the power ˆσ k 2 of the residuals computed with the estimated vector ˆθ k. For each candidate order k a certain criterion is computed and the selected order is the order k minimizing the criterion. There are many criteria which are in use, derived based on various considerations, more often information theoretic considerations. one of them is good in all situations. A good policy is to check the optimal value given by each criterion and take a voting decision, taking into account the dangers of overfitting or underfitting in the current application. All criteria are penalizing a too large number of parameters and contain a term which rewards very good fits. The selection is a trade-off between overfitting and underfitting. Overfitting occurs when having a very good fit to data, obtained for very large k, but in which case the noise in the data is modeled as well. Such a model will not perform well on other data, different than the one where it was fitted. Under-fitting occurs when we have too few parameters, and the fit to the data is poor. 27 / 28

28 Criteria for order selection Final prediction error (FPE) criterion FPE(k) = + k k ˆσ2 k FPE tends to underestimate the order. Akaike Information Criterion (AIC) AIC(k) = ln(ˆσ 2 k ) + 2k For large, the AIC tends to overestimate the order (is not penalizing enough). FPE never exceeds the selection of AIC, hence is better for large. Minimum Description Length (MDL) criterion, also known as Bayesian Information Criterion (BIC) MDL(k) = ln(ˆσ 2 k ) + k ln For large, MDL estimates correctly the order (is penalizing correctly the number of parameters). However, for small the performance of AIC and MDL is similar. There are no definitive ranking of the three methods. 28 / 28

F & B Approaches to a simple model

F & B Approaches to a simple model A6523 Signal Modeling, Statistical Inference and Data Mining in Astrophysics Spring 215 http://www.astro.cornell.edu/~cordes/a6523 Lecture 11 Applications: Model comparison Challenges in large-scale surveys