Last lecture: Introduction to Time Series Analysis. Lecture 12. Peter Bartlett 1. Parameter estimation 2. Maximum likelihood estimator 3. Yule-Walker estimation 1
Introduction to Time Series Analysis. Lecture 12. 1. Review: Yule-Walker estimators 2. Yule-Walker example 3. Efficiency 4. Maximum likelihood estimation 5. Large-sample distribution of MLE 2
Review: Yule-Walker estimation Method of moments: We choose parameters for which the moments are equal to the empirical moments. In this case, we choose φ so that γ(h) = ˆγ(h) for h = 0,...,p. Yule-Walker equations for ˆφ: ˆΓ pˆφ = ˆγp, ˆσ 2 = ˆγ(0) ˆφ ˆγ p. These are the forecasting equations. We can use the Durbin-Levinson algorithm. 3
Review: Confidence intervals for Yule-Walker estimation If {X t } is an AR(p) process, ˆφ AN (φ, σ2 ˆφ hh AN ( 0, 1 n n Γ 1 p ) ), ˆσ 2 P σ 2. for h > p. Thus, we can use the sample PACF to test for AR order, and we can calculate approximate confidence intervals for the parameters φ. 4
Review: Confidence intervals for Yule-Walker estimation If {X t } is an AR(p) process, and n is large, n(ˆφ p φ p ) is approximately N(0,ˆσ 2ˆΓ 1 p ), with probability 1 α,φ pj is in the interval ˆσ ) 1/2 ˆφ pj ±Φ 1 α/2 n (ˆΓ 1 p, jj whereφ 1 α/2 is the 1 α/2 quantile of the standard normal. 5
Review: Confidence intervals for Yule-Walker estimation with probability 1 α,φ p is in the ellipsoid { ) } φ R p : (ˆφp φ) ˆΓp (ˆφp φ ˆσ2 n χ2 1 α(p), whereχ 2 1 α(p) is the(1 α) quantile of the chi-squared withpdegrees of freedom. 6
Yule Walker estimation: Example 1.2 Sample ACF: Depth of Lake Huron, 1875 1972 1 0.8 0.6 0.4 0.2 0 0.2 40 30 20 10 0 10 20 30 40 7
Yule Walker estimation: Example 1 Sample PACF: Depth of Lake Huron, 1875 1972 0.8 0.6 0.4 0.2 0 0.2 0.4 0 5 10 15 20 25 30 35 40 8
Yule Walker estimation: Example ˆΓ 2 = ˆφ 2 = 1.7379 1.4458 ˆγ 2 = 1.4458 1.7379 1 ˆΓ 2 ˆγ 2 = 1.0538 0.2668 1.4458 1.0600 ˆσ 2 w = ˆγ(0) ˆφ 2ˆγ 2 = 0.4971 9
Yule Walker estimation: Example Confidence intervals: ˆφ 1 ±Φ 1 α/2 (ˆσ 2 wˆγ 1 2 /n ) 1/2 11 = 1.0538±0.1908 ˆφ 2 ±Φ 1 α/2 (ˆσ 2 wˆγ 1 2 /n ) 1/2 22 = 0.2668±0.1908 10
Yule-Walker estimation It is also possible to define analogous estimators for ARMA(p,q) models withq > 0: q ˆγ(j) φ 1ˆγ(j 1) φ pˆγ(j p) = σ 2 θ i ψ i j, where ψ(b) = θ(b)/φ(b). Because of the dependence on theψ i, these equations are nonlinear inφ i,θ i. There might be no solution, or nonunique solutions. Also, the asymptotic efficiency of this estimator is poor: it has unnecessarily high variance. i=j 11
Efficiency of estimators Let ˆφ (1) and ˆφ (2) be two estimators. Suppose that ˆφ (1) AN(φ,σ 2 1), ˆφ(2) AN(φ,σ 2 2). The asymptotic efficiency of ˆφ (1) relative to ˆφ (2) is ( e φ, ˆφ (1), ˆφ (2)) = σ2 2 σ1 2. If e (φ, ˆφ (1), ˆφ ) (2) 1 for all φ, we say that ˆφ (2) is a more efficient estimator ofφthan ˆφ (1). For example, for an AR(p) process, the moment estimator and the maximum likelihood estimator are as efficient as each other. For an MA(q) process, the moment estimator is less efficient than the innovations estimator, which is less efficient than the MLE. 12
Yule Walker estimation: Example AR(1): AR(2): and ˆφ 1 ˆφ 2 γ(0) = σ2 1 φ 2 1 ˆφ 1 AN (φ 1, σ2 AN σ 2 n Γ 1 2 = 1 n φ 1 n Γ 1 1 φ 2 ) ( ) = AN φ 1, 1 φ2 1. n, σ2 n Γ 1 2 1 φ 2 2 φ 1 (1+φ 2 ) φ 1 (1+φ 2 ) 1 φ 2 2. 13
Yule Walker estimation: Example Suppose{X t } is an AR(1) process and the sample size n is large. If we estimate φ, we have Var(ˆφ 1 ) 1 φ2 1 n If we fit a larger model, say an AR(2), to this AR(1) process,. We have lost efficiency. Var(ˆφ 1 ) 1 φ2 2 n = 1 n > 1 φ2 1 n. 14
Introduction to Time Series Analysis. Lecture 12. 1. Review: Yule-Walker estimators 2. Yule-Walker example 3. Efficiency 4. Maximum likelihood estimation 5. Large-sample distribution of MLE 15
Parameter estimation: Maximum likelihood estimator One approach: Assume that {X t } is Gaussian, that is,φ(b)x t = θ(b)w t, where W t is i.i.d. Gaussian. Chooseφ i,θ j to maximize the likelihood: L(φ,θ,σ 2 ) = f(x 1,...,X n ), where f is the joint (Gaussian) density for the given ARMA model. (c.f. choosing the parameters that maximize the probability of the data.) 16
Maximum likelihood estimation Suppose that X 1,X 2,...,X n is drawn from a zero mean Gaussian ARMA(p,q) process. The likelihood of parameters φ R p,θ R q, σ 2 w R + is defined as the density of X = (X 1,X 2,...,X n ) under the Gaussian model with those parameters: L(φ,θ,σ 2 w) = 1 (2π) n/2 Γ n 1/2 exp ( 1 ) 2 X Γ 1 n X, where A denotes the determinant of a matrix A, and Γ n is the variance/covariance matrix of X with the given parameter values. The maximum likelihood estimator (MLE) ofφ,θ,σ 2 w maximizes this quantity. 17
Parameter estimation: Maximum likelihood estimator Advantages of MLE: Efficient (low variance estimates). Often the Gaussian assumption is reasonable. Even if{x t } is not Gaussian, the asymptotic distribution of the estimates (ˆφ, ˆθ,ˆσ 2 ) is the same as the Gaussian case. Disadvantages of MLE: Difficult optimization problem. Need to choose a good starting point (often use other estimators for this). 18
Preliminary parameter estimates Yule-Walker for AR(p): Regress X t ontox t 1,...,X t p. Durbin-Levinson algorithm withγ replaced by ˆγ. Yule-Walker for ARMA(p,q): Method of moments. Not efficient. Innovations algorithm for MA(q): withγ replaced by ˆγ. Hannan-Rissanen algorithm for ARMA(p,q): 1. Estimate high-order AR. 2. Use to estimate (unobserved) noisew t. 3. Regress X t ontox t 1,...,X t p,ŵt 1,...,Ŵt q. 4. Regress again with improved estimates ofw t. 19
Recall: Maximum likelihood estimation Suppose that X 1,X 2,...,X n is drawn from a zero mean Gaussian ARMA(p,q) process. The likelihood of parameters φ R p,θ R q, σ 2 w R + is defined as the density of X = (X 1,X 2,...,X n ) under the Gaussian model with those parameters: L(φ,θ,σ 2 w) = 1 (2π) n/2 Γ n 1/2 exp ( 1 ) 2 X Γ 1 n X, where A denotes the determinant of a matrix A, and Γ n is the variance/covariance matrix of X with the given parameter values. The maximum likelihood estimator (MLE) ofφ,θ,σ 2 w maximizes this quantity. 20
Maximum likelihood estimation We can simplify the likelihood by expressing it in terms of the innovations. Since the innovations are linear in previous and current values, we can write X 1 X 1 X1 0. = C. X n }{{} X X n Xn n 1 }{{} U where C is a lower triangular matrix with ones on the diagonal. Take the variance/covariance of both sides to see that Γ n = CDC where D = diag(p 0 1,...,P n 1 n ). 21
Thus, Γ n = C 2 P 0 1 P n 1 n Maximum likelihood estimation = P 0 1 P n 1 n and X Γ 1 n X = U C Γ 1 n CU = U C C T D 1 C 1 CU = U D 1 U. So we can rewrite the likelihood as L(φ,θ,σ 2 w) = = 1 ( (2π)n P 0 1 Pn 1 n 1 ( (2πσ 2 w ) n r 0 1 rn 1 n ) 1/2 exp ( ) 1/2 exp n 1 2 i=1 ( S(φ,θ) 2σw 2 (X i X i 1 i ), ) 2 /P i 1 i ) where r i 1 i = P i 1 i /σw 2 and S(φ,θ) = n i=1 ( Xi X i 1 ) 2 i r i 1. i 22
Maximum likelihood estimation The log likelihood of φ,θ,σ 2 w is l(φ,θ,σ 2 w) = log(l(φ,θ,σ 2 w)) = n 2 log(2πσ2 w) 1 2 n i=1 logr i 1 i S(φ,θ). 2σw 2 Differentiating with respect toσ 2 w shows that the MLE(ˆφ,ˆθ,ˆσ 2 w) satisfies n 2ˆσ 2 w = and ˆφ,ˆθ minimize S(ˆφ, ˆθ) 2ˆσ 4 w log ( ) S(ˆφ,ˆθ) n ˆσ w 2 = S(ˆφ,ˆθ), n + 1 n n i=1 logr i 1 i. 23
Summary: Maximum likelihood estimation The MLE(ˆφ,ˆθ,ˆσ 2 w) satisfies and ˆφ,ˆθ minimize ˆσ w 2 S(ˆφ, ˆθ) =, n ( ) S(ˆφ,ˆθ) log n + 1 n n i=1 logr i 1 i, where r i 1 i = P i 1 i /σw 2 and S(φ,θ) = n i=1 ( Xi X i 1 ) 2 i r i 1. i 24
Maximum likelihood estimation Minimization is done numerically (e.g., Newton-Raphson). Computational simplifications: Unconditional least squares. Drop thelogr i 1 i terms. Conditional least squares. Also approximate the computation of x i 1 i dropping initial terms ins. e.g., for AR(2), all but the first two terms ins depend linearly onφ 1,φ 2, so we have a least squares problem. The differences diminish as sample size increases. For example, Pt t 1 σw 2 sort t 1 1, and thusn 1 i logri 1 i 0. by 25
Maximum likelihood estimation: Confidence intervals For an ARMA(p,q) process, the MLE and un/conditional least squares estimators satisfy ˆφ 1 φ AN 0, σ2 w Γ φφ Γ φθ, ˆθ θ n Γ θφ Γ θθ, where Γ φθ Γ φφ Γ θφ Γ θθ, = Cov((X,Y),(X,Y)), X = (X 1,...,X p ) φ(b)x t = W t, Y = (Y 1,...,Y p ) θ(b)y t = W t. 26
Introduction to Time Series Analysis. Lecture 12. 1. Review: Yule-Walker estimators 2. Yule-Walker example 3. Efficiency 4. Maximum likelihood estimation 5. Large-sample distribution of MLE 27