STOR 356: Summary Course Notes Part III

Size: px

Start display at page:

Download "STOR 356: Summary Course Notes Part III"

Godfrey Rice
5 years ago
Views:

1 STOR 356: Summary Course Notes Part III Richard L. Smith Department of Statistics and Operations Research University of North Carolina Chapel Hill, NC April 23, ESTIMATION OF TIME SERIES MODELS The aim of this chapter is to discuss different estimation methods for time series models. The starting point is that we have a specific model in mind either AR(p) for given p or ARMA(p, q) for given p and q (or more rarely, MA(q) for given q usually we don t set out with the objective of fitting an MA model, but if model selection criteria for ARMA lead us to p = 0, that is what we are left with) and we would like to fit the model. Therefore, the main focus is on actual estimation of the model parameters. In Section 1.7, we list several commonly used criteria for choosing the model. 1.1 Yule-Walker Used for AR(p) processes. We assume (as in most places in this course) that the series has mean 0 in most cases this is achieved by subtracting the sample mean X from every observation before beginning the analysis. The Yule-Walker method starts with the equation p X t φ j X t j = Z t, (1) j=1 where Z t W N[0, σ 2 ]. Taking the covariance with X t k for k 1, we deduce p γ X (k) φ j γ X (k j) = 0. (2) j=1 If (2) is evaluated for k = 1, 2,..., p, we get p linear equations in p unknowns, which we write in vector-matrix notation as where Γ p φ = γ p (3) 1

2 Γ p is the p p matrix whose (i, j) entry is γ X (i j), φ is the vector ( φ 1 φ 2... φ p ) T, γ p is the vector ( γ X (1) γ X (2)... γ X (p) ) T, Also by applying (2) to the case k = 0, we get p γ X (0) φ j γ X (j) = σ 2. (4) j=1 (The right hand side in this case is Cov{X t, Z t }. But by writing X t = j=0 ψ j Z t j and noting that ψ 0 = 1, this reduces to the variance of Z t, which is σ 2.) In practice we substitute sample estimates ˆΓ p for Γ p, γˆ p for γ p, to deduce ˆφ = 1 ˆΓ p γˆ p, (5) ˆσ 2 = ˆγ X (0) ˆφ T γˆ p (6) We also have the approximate result ˆφ N [ ] φ, σ2 n Γ 1 p (7) in other words, the vector of estimates ˆφ has an approximately normal distribution with mean φ and covariance matrix σ2 n Γ 1 p. This is used in determining standard errors for the parameters. See the parallel discussion of the information matrix in Section below. 1.2 Burg s Method Burg s method is a method for AR processes that is similar to Yule-Walker, but based on the PACF. We don t try to describe the details here, but the basic idea is that it is an alternative to the Yule- Walker method, that usually leads to a model with lower AICC (see Section 1.7), and closer to the maximum likelihood estimates. Therefore, although it is less intuitive than the Yule-Walker method and harder to calculate by hand, it is probably a better method overall. 1.3 Innovations Algorithm This algorithm is primarily used for the estimation of MA models. It is offered as one of the choices for ARMA models by the ITSM package. 1.4 Hannan-Rissanen Method This is a procedure that tries to extend the Yule-Walker equations to general ARMA processes. The estimators are not as good as general MLEs, but they are useful as a starting point for the MLE algorithm (see Section 1.5.3). The idea is to extend (1) to the equation X t = p q φ i X t i + θ j Z t j + Z t. (8) i=1 j=1 2

3 In other words, we regress X t on X t 1,..., X t p, Z t 1,..., Z t q, treating Z t as a residual error term. The disadvantage of this is that we need initial estimates of Z t 1,..., Z t q to get the procedure started. Usually this is done by first fitting an AR(p ) model for some p p + q, using that model to estimate the residuals, and treating these are starting values for the Z s in (8). Then, once an initial model is fitted, the Z s are recalculated and the analysis repeated. This can be done several times until the procedure converges. There is still the issue that the analysis does not deal well with the earlier observations in the series (we can only start (8) at t = max(p, q) + 1), and in that respect, it is less efficient than methods that use all the data. 1.5 Method of Maximum Likelihood The Method of Maximum Likelihod is the most commonly used method for estimating parameters in all of statistics. The method is covered in one of the STOR courses (555), but since that isn t a prerequisite for this course, I won t assume students have previously heard about it. In view of that, I first describe how maximum likelihood is applied to two common distributions, the binomial and the normal, and then go on to consider the time series case Example 1: Binomial Distribution Suppose X has a binomial distribution with parameters n (known) and p (unknown). This means that Pr {X = k} = n! k!(n k)! pk (1 p) n k, k = 0, 1, 2,,..., n. (9) The method of maximum likelihood states that to estimate p given the observed value X = k, we choose the value ˆp that maximizes (9). It s a little simpler if we work with the log likelihood: Differentiate (10) twice: l(p) = log Pr {X = k ; p} n! = log + k log p + (n k) log(1 p). (10) k!(n k)! From (11), we see that l (p) = 0 when k p = n k 1 p, which reduces to l (p) = k p n k 1 p, (11) l (p) = k p 2 n k (1 p) 2. (12) ˆp = k n (13) which is the answer we expected, e.g. a course like STOR 155 teaches that we estimate a population proportion by the sample proportion, which is exactly what (13) says. Moreover, from (12), we see at once that l (p) < 0, so it really is a local maximum. [Strictly speaking, this argument fails 3

4 when k = 0 or n, but then we prove it is a local maximum directly. For example, if k = 0 then l(p) = n log(1 p), which is decreasing over 0 p 1, and has a maximum at p = 0.] However, we can do more with the second derivative. Suppose we evaluate l (p) at p = ˆp: l (ˆp) = k n2 k 2 (n k) n 2 (n k) 2 = n3 k(n k) n = ˆp(1 ˆp). (14) However, let s also note that from the standard (e.g. STOR 155) formula for the variance of a binomial distribution, Var(ˆp) = p(1 p) n which in practice (p being unknown) we approximate by Comparing (14) with (15), we observe that: Var(ˆp) The variance of ˆp is approximately { l (ˆp)} 1. ˆp(1 ˆp) n This turns out to be a general property of maximum likelihood estimators Example 2: Normal distribution Suppose X 1,..., X n are independent random variables, each normal with unknown mean µ and variance τ. Note that normally we would write σ 2 instead of τ, but in this case, to avoid possible confusion over whether we are differentiating with respect to σ or σ 2, I write τ. The density of a normal (µ, τ) random variable is 1 2πτ e (x µ)2 /(2τ). (15) The joint density of X 1,..., X n is the product of this for X 1,..., X n : { f(x 1,..., X n ; µ, τ) = (2πτ) n/2 exp 1 } (Xi µ) 2. (16) 2τ [This is a consequence of independence: any statement about the values of X 1,..., X n can be expressed as a product of probabilities for X 1,..., X n individually. When the answer is written as a probability density, we get (16).] Again we take logarithms of (16), so l n (µ, τ) = f(x 1,..., X n ; µ, τ) = n 2 log(2π) n 2 log τ 1 (Xi µ) 2. (17) 2τ 4

5 We then calculate l n (µ, τ) µ l n (µ, τ) τ = 1 τ (Xi µ), (18) = n 2τ + 1 2τ 2 (Xi µ) 2. (19) (18) is 0 when ˆµ = 1 n Xi = X, which again agrees with the STOR 155 solution. If we substitute µ = X and set (19) equal to 0, we get ˆτ = (Xi X) 2 which however is different from the STOR 155 solution: in that course you were taught to divide by n 1 instead of n. In fact, what they told you in STOR 155 was right: it s better to divide by n 1 because that leads to an unbiased estimator (the definition of an unbiased estimator is that if the experiment is repeated many times over, the long-run average of the estimator is equal to its true value). However the two estimators are almost the same when n is large, and that is what really counts with maximum likelihood estimators (or MLEs for short). MLEs lead to approximately the most efficient estimators in large samples, but they are not necessarily best in small samples. Nevertheless, in most practical estimation problems (including nearly all of the estimation problems that arise in time series analysis), we simply have no way to calculate an estimator that is exactly unbiased for every n, whereas maximum likelihood is a very general technique that works in very many cases. Let s now extend (18) and (19) to the calculation of second-order derivatives. Again, the case of practical interest is when we set µ and τ to the MLEs, so we do that in the following calculation: n (20) 2 l n (µ, τ) µ 2 = n τ, (21) 2 l n (µ, τ) µ τ = 1 (Xi τ 2 µ) = 0, (22) 2 l n (µ, τ) τ 2 = n 2τ 2 1 τ 3 (Xi µ) 2 = n 2τ 2 (23) Let s assemble this as a matrix which we ll call H: H = 2 l n(µ,τ) µ 2 2 l n(µ,τ) 2 l n(µ,τ) µ τ 2 l n(µ,τ) = µ τ τ ( ) 2 n τ 0 n. 0 2τ 2 (24) Since H is diagonal, it s easy to calculate its inverse: ( τ H 1 = n 0 2τ 0 2 n ). (25) 5

6 At this point you can recognize that τ n is exactly the variance of X according to standard (STOR 155) statistical theory. Moreover, although you may not instantly recognize it, it s also true that 2τ 2 n is very nearly the variance of ˆτ. [ If you feel confident about the chi-squared distribution, here s the argument. Suppose µ is known. Then (Xi µ) 2 τ has a chi-squared distribution with n degrees of freedom, denoted χ 2 n. This distribution has mean n and variance 2n. Therefore, τ n E { (Xi µ) 2 } τ (Xi µ) 2 n = τ n n = τ has a mean of and a variance of τ 2 { n 2 Var (Xi µ) 2 } τ = τ 2 2τ 2 (2n) = n2 n. (Xi X) 2 This is more complicated when µ is unknown, when the correct statement is that τ has a χ 2 n 1 distribution. I ll leave you to figure out exactly what this implies for the variance of ˆτ. It s ] still correct that for large n, the variance is approximately 2τ 2 n. The general principle here is this: the matrix H is called the information matrix, and is calculated to be the matrix of second-order partial derivatives of l, evaluated at the MLE. Its inverse is an approximation to the covariance matrix of the estimators. In particular, the diagonal entries of H 1 are approximately the variances of the individual parameter estimates, in this case, ˆµ and ˆτ. Our purpose in going through the calculations was to illustrate how the general principle works in some well-known examples Time Series Models We state without proof the following fact (see also equation (5.2.1), page 158 of the course text). Suppose X 1,..., X n are random variables that have a normal distribution with mean 0, but instead of being independent, they have a covariance matrix Γ n. Then the joint density of X 1,..., X n is { L n = (2π) n/2 Γ n 1/2 exp 1 } 2 XT Γ 1 n X. (26) [ Here X denotes the vector X 1. X n, and indicates the determinant of a matrix. In this case, the principle of maximum likelihood says we choose the unknown parameters of the time series model to maximize L n. In a little more detail, this is what that means. Suppose we have an ARMA(p, q) model of mean 0, i.e. X t φ 1 X t 1... φ p X t p = Z t + θ 1 Z t θ q Z t q where Z t W N[0, σ 2 ]. There are m = p + q + 1 unknown parameters, 6 ]

7 φ 1,..., φ p, θ 1,..., θ q, σ 2. For convenience we write this as a single vector β = ( φ 1... φ p θ 1... θ q σ 2 ) T. Theory developed earlier in the course has shown how to calculate all the autocovariances as functions of β. Therefore, we can calculate the matrix Γ n as a function of β. We substitute this in (26). The result is written L n (β). We choose β to maximize this expression (or more often, its logarithm, l n (β)). In practice, this cannot be done analytically and we have to employ a numerical search, and for that purpose, it s useful if we have at least reasonably good estimates as starting values for the numerical search. This is the reason why other estimation methods, such as the Yule-Walker method or Burg s algorithm (in the case of AR models) are often used as preliminary estimates prior to running MLE. However, MLE itself is not dependent on the use of any particular algorithm: there are many possible methods to calculate it numerically. Properties of MLE. Let H n denote the matrix of second-order derivatives of l n (β) = log L n (β), evaluated at β = ˆβ. This is known as the information matrix associated with this particular likelihood function. Also let V n = Hn 1 (i.e. the usual matrix inverse). Then V n is approximately the variance-covariance matrix of ˆβ. In particular, if we write v 11 v v 1m v 21 v v 2m V n =.....,. v m1 v m2... v mm then v 11, v 22,..., v mm are approximately the standard deviations of ˆβ 1,..., ˆβ m. In popular terminology they are called the standard errors. In particular, these are the numbers shown as standard errors when ITSM fits an ARMA model by maximum likelihood. There is an analogy here with concepts that are well familiar from STOR 355. In that course, a linear regression model of the form Y = Xβ + ɛ, where the vector ɛ consists of independent errors with mean 0 and variance σ 2, was solved with an estimator ˆβ = (X T X) 1 X T Y, and you were taught that the covariance matrix of ˆβ is (X T X) 1 σ 2. Moreover, the square roots of the diagonal entries of that covariance matrix were used to give standard errors of the individual ˆβ j s. The use of V n as a covariance matrix in the present setting is analogous, except that the answers are only approximate not exact The Prediction Error Decomposition There is one other trick to learn about the use of MLE for time series models: usually the formula (26) is not calculated exactly, but through an alternative approach known as the prediction error decomposition. The idea is as follows. Suppose X 1 has a normal distribution with mean ˆX 1 = 0 and variance v 0. Typically we would just take v 0 to be γ X (0), the variance of a typical X t under the stationary distribution. Now suppose: The optimal prediction of X 2 given X 1 has mean predictor ˆX 2 and prediction variance (i.e. mean squared prediction error) v 1, The optimal prediction of X 3 given X 1, X 2 has mean predictor ˆX 3 and prediction variance v 2, 7

8 and so on up to The optimal prediction of X n given X 1,..., X n 1 has mean predictor ˆX n and prediction variance v n 1. Then an alternative way to write (26) is 1 2πv0 e X2 1 /(2v 0) 1 2πv1 e (X 2 ˆX 2 ) 2 /(2v 1) 1 2πvn 1 e (Xn ˆX n) 2 /(2v n 1 ) L n = { = (2π) n/2 (v 0 v 1...v n 1 ) 1/2 exp 1 n (X i ˆX i ) 2 }. (27) 2 v i=1 i 1 In practice the calculation of the ˆX t s and the v t 1 s follow the method for prediction from the finite past that was covered in Chapter 2 of the course text. The advantage of (27) over (26) is that when n is large, say of the order of thousands, calculating Γ n and Γ 1 n takes a lot of computer time and memory, whereas the recursive formulas used to calculate ˆX t and v t 1 are extremely fast. (If you read the whole of Chapter 2 rather than just the part of it that we summarized for this course, you will find detailed discussion of specific algorithms for this, such as the Durbin-Levinson method.) After taking account of such recursive formulas, (27) is much faster to calculate than (26) Homework Problem For an AR(1) process, X t = φx t 1 + Z t, Z t N[0, σ 2 ], find an expression for the exact MLE of φ and compare it with the Yule-Walker estimator. (For simplicity, assume that the mean of the process is known µ = 0 and that the variance σ 2 is known, though in practice we would also estimate these along with φ.) 1.6 Diagnostics Given estimates of the model parameters, one-step ahead forecasts ˆX t, t = 1,..., n and forecasting mean squared errors ˆv t 1, t = 1,..., n (given a hat solely because they are based on estimated parameters), we can form residuals Ŵ t = X t ˆX t ˆvt 1, t = 1,..., n. (28) If the model is correct, these should be approximately white noise. You can look at residuals in much the same way you look at residuals from a regression model try any way you can think of to look for deviations from the underlying white noise assumption, because any such deviation is indicative of the wrong model, either identifying the wrong p and q in the ARMA order selection, or failing in some other aspect of the initial analysis of the data (e.g. failing to transform the data in case of an obviously non-constant variance, or failure to perform differencing or trend removal in the case of long-term trends). Specific techniques include 8

9 Plotting the residuals looking for trends, outliers, evidence of heteroscedasticity, etc. QQ plots to test normality (note that ITSM also gives you an option of plotting a QQ-plot to test for a t distribution useful when you suspect the distribution may be long-tailed, in which case a t distribution may fit better than a normal distribution). Plotting the sample ACF and PACF if the model is correct, most of these should be within the confidence bands. Formal testing of randomness e.g. Ljung-Box test for autocorrelation, turning points test for a trend, Jarque-Bera test for normality (recall Section 1.6 of the text for a full discussion of these tests). 1.7 Automated Criteria for Model Selection The section addresses the issue of how to select the order of the model either the value of p for an AR(p) model, or p and q for ARMA(p, q) FPE In the case of AR models, a standard criterion is forward prediction error (usually abbreviated FPE). The idea is to come up with an approximation to the mean squared error of ˆXt the one-step forecast of X t given the entire past up to time t 1. The general theory of Chapter 2 shows that if the model is correctly identified, the mean squared error should be σ 2, the variance of the white noise process. But that won t be achieved in reality, because we don t know the exact parameters φ 1,..., φ p. Instead, we only have estimates ˆφ 1,..., ˆφ p. Asymptotic arguments (see text for further detail) lead to the approximation E {(X t ˆX t ) 2} ˆσ 2 n + p n p. (29) The right hand side of (29) is called FPE, and the FPE criterion essentially says we choose p to minimize this. Note that (29) balances two factors: as we increase p, we should expect ˆσ 2 to decrease, for the same reason as in regression models (the more covariates we add to the model, the smaller we should expect the residual mean squared error to be). However this is compensated by the n+p n p term, which increases as p increases. The two terms compensate and in most cases lead to a sensible choice of p AIC, AICC and BIC For general ARMA(p, q) models, the simple correction (29) no longer works, but there are a number of alternatives that aim to do the same thing, i.e. balance the decrease in estimated mean squared error that usually happens when the model order increases, with some penalty term that stops the model order getting too large. Note that an ARMA(p, q) model actually has m = p + q + 1 unknown terms, including σ 2 (but not including the mean µ as most other places in this course, we assume the series is centered to mean 0 before beginning the formal analysis). All these criteria start by calculating L n (ˆβ) the value of the maximize likelihood when the MLE ˆβ is substituted for the true unknown β. For simplicity we abbreviate L n (ˆβ) to ˆL n. 9

10 The criteria are the we choose p and q to minimize one of the following: AIC: 2 log ˆL n + 2m, AICC: 2 log ˆL n + 2mn n m 1, BIC: 2 log ˆL n + m log n. Note that the formula for BIC differs from what is presented in the text. Roughly, the distinction is this: AIC stands for Akaike Information Criterion, and was originally proposed by the Japanese statistician Akaike as an extension of the FPE criterion to ARMA models. However the objective is the same as FPE to minimize the mean squared error of a one-step prediction after taking into account the effect of estimating model parameters. AICC 2mn n m 1 is the corrected AIC the correction from 2m to is supposed to be a more accurate approximation, but it s still trying to do the same thing. BIC stands for Bayesian Information Criterion and was originally proposed as the basis for an approximation to the probability that a particular model is correct from a Bayesian statistics viewpoint. However the main thing that BIC is recognized for these days is that in large samples, it leads to consistent model selection if one of the models is actually correct, then as the sample size n tends to, the probability that BIC chooses the correct model tends to 1 (under a number of qualifying conditions that we don t try to list here). However this isn t a perfect criterion either we may not actually believe that any of our models is actually correct in an absolute sense, and all of these criteria are more guidelines than absolute rules. In practice, the important thing to be aware of is that BIC typically leads to a lower-order model being selected than either AIC or AICC. All of the model selection criteria should be viewed primarily as giving a range of models that is reasonable. If things are working well, the actual quantities of interest (e.g. future predictions) should be not too sensitive to the exact determination of the model, but by using AICC and BIC to guide the choice of models, we have a means of checking up on that. ITSM makes it especially convenient to use AICC because the Autofit option uses it, but we should really consider AICC alongside other checks, including BIC and more informal checking procedures based on residuals. 2 NONSTATIONARY AND SEASONAL TIME SERIES 2.1 ARIMA models Definition: Suppose X t is a time series. If Y t = (1 B) d X t is ARMA(p, q), then X t is said to be ARIMA(p, d, q). Another way of writing this is φ (B)X t = θ(b)z t, φ (B) = (1 B) d φ(b), Z t W N[0, σ 2 ], (30) the point being that φ (B) would not be a causal operator if taken on its own (because of the factor 1 B), but in the form (30), it is perfectly reasonable to require that φ(b) be causal and θ(b) be invertible, which are the same conditions as we have seen in earlier chapters. 10

11 The point about ARIMA processes is that we can often model a process with a trend in ARIMA form, when a direct ARMA model would fail. This point is reinforced by the following: Comment. If m(t) is a polynomial of degree d 1, then (1 B) d m(t) = 0. This is true because of the following. First, consider the case m(t) = c (a constant). This is a polynomial of degree 0, while (1 B)m(t) = c c = 0, so the comment is true when d = 1. Now consider a component the form t p for some p 1. We have (1 B)t p = t p (t 1) p = t p (t p pt p 1 p(p 1) + t p 2... ± 1) 2 = pt p 1 p(p 1) t p ) 2 which is a polynomial of degree p 1. Thus, if m(t) is a polynomial of degree d then (1 B)m(t) is a polynomial of degree d 1. Iterating, we deduce (1 B) d 1 is a polynomial of degree 0 (i.e. a constant) and then one further application of 1 B yields the result. The point about this example is that we could have a time series of form X t = m(t) + Y t, with m(t) a polynomial and Y t stationary so that (1 B) d X t = (1 B) d Y t and can be modeled as a stationary time series An Example This is a simulated example based on one in the course text, but has been worked out separately for this discussion. AR(1) Process, phi=0.9 ARIMA(1,1,0) Process, phi=0.9 y x Index Index Figure 5. Left: Plot of AR(1) series Y t with φ = 0.9. Right: the integrated series X t = Y 1 + Y Y t. Figure 5 (left) shows a simulated AR(1) model with Y t with φ = 0.9, and (right) the integrated form X t = Y Y t (so (1 B)X t = Y t ). The two series are on the course webpage at 11

12 and The first comment is that the two series do not look at all alike even though Y t is itself a highly autocorrelated series, X t has much smoother sample paths over long time periods (in other words, it would be reasonable to conclude just from visual inspection that Y t is an autocorrelated but still stationary time series, whereas X t is not stationary at all). This is further reinforced by the ACFs of the two series (Figure 6). Nevertheless, it s difficult in practice to decide whether a series is stationary or not (see later section on Unit Root Processes for some formal tests) so it s worth doing some more comparisons. Indeed, the PACF of X t could well lead one to the conclusion that the process with either AR(2) or AR(3). Series Yt Series Yt ACF Partial ACF Lag Lag Series Xt Series Xt ACF Partial ACF Lag Lag Figure 6. ACF and PACF of Y t, X t series. Here are some direct comparisons, run in ITSM. The Autofit option applied to X t indeed leads to the AR(2) model, fitted as X t = 1.888X t X t 2 + Z t (31) 12

13 with σ 2 = (as compared with true value σ 2 = 1) and an AICC of However if we check for stationarity in other words, solve the equation φ(z) = z z 2 = 0 we find solution z = ± i which has z = 1.056, quite close to the region of noncausality (recall that z > 1 is a necessary condition for the process to be causal). We also find that the parameter estimates are sensitive to the method of estimation. The Burg estimates in this case are X t = 1.895X t X t 2 + Z t (32) with σ 2 = 1.012, AICC=299.4, not very different from the MLEs, but the Yule-Walker estimates are with σ 2 = 3.41, AICC=416.4, which looks very different. X t = 1.259X t X t 2 + Z t (33) On the other hand an ARIMA(1,1,0) model fit (in ITSM, first do Transform, then difference at lag 1, then Autofit) leads to which is also equivalent to Y t = Y t 1 + Z t (34) X t = X t X t 2 + Z t, (35) quite similar to either (31) or (32), but without the problems of instability, and incidentally with lower AICC (292.3). Thus the conclusion in this case is that although either the AR(2) model applied to X t or the AR(1) model applied to Y t both seem to be plausible options, the latter model leads to a more stable representation of the time series and is identified by AICC as the better model (as does BIC, but not, surprisingly, FPE). 2.2 Identification Techniques Transformations A transformation of the data may be appropriate when the dat are clearly not normally distributed (e.g. because the distribution is highly left or right skewed), or in cases of heteroscedasticity (think back to the Australian wine data at the beginning of the course the variance was clearly increasing with time, but when we took logarithms, the variance was approximately constant, as it should be for a stationary time series model). Common transformations include Logarithmic self-explanatory Box-Cox named after a famous paper by George Box and David Cox in 1964, this refers to the transformation { X λ t 1 f λ (X t ) = λ if λ 0 (36) log X t if λ = 0 ITSM allows the range 0 λ

14 The point of the specific representation (36) is that the limit of xλ 1 λ as λ 0 is precisely log x, so (36) is a continuous family of transformations (in other words, continuous in λ). However when smoothness near λ = 0 is not an issue, (36) may be replaced by a simple X λ t for λ 0. Other possible transformations are Decomposition of X t into the sum of trend, seasonal and stationary terms, Differencing, Fitting a polynomial and/or harmonic regression as part of an initial model fit, then fit a stationary time series model to the residuals Identification and Estimation The broad strategy is that after making an initial transformation to make the series stationary, an ARMA(p, q) model is identified using one of several model identification techniques, such as FPE, AICC or BIC. In the technical language of time series analysis, identification means specifically how to choose p and q. The ITSM Autofit command makes this especially easy by automatically searching over a range of models to minimize AICC. Note, however, that you have to specify the maximum p and q, and this can be problematic in some cases. An alternative strategy is the subset model approach, that starts with a high-order ARMA model and selectively sets certain coefficients to 0 (rather like backward variables selection in standard regression). The Constrain optimization command in ITSM allows such models to be fitted as part of the maximum likelihood procedure. The text (pp ) contains a detailed example of these techniques applied to the Australian wine data. 2.3 Unit Roots We have already seen in Section 2.1 that it can be problematic to fit an AR model when the φ(b) operator contains a term 1 B. This is called the unit root problem for the following reason: if we search for the roots of the autoregressive polynomial φ(z), we will find one at z = 1 (because φ contains a factor 1 z). But we have already seen in Chapter 3 that one of the conditions for a causal process is that all the roots satisfy z > 1. So this case is on the boundary between causal and non-causal. Unit root tests are a class of statistical tests designed to detect this situation. In fact there are two types of unit root problem unit roots in the AR component, and unit roots in the MA component The Dickey-Fuller Test One of the earliest successful solutions of this problem was the Dickey-Fuller test, published in Suppose we have an AR(1) process with unknown mean, X t µ = φ 1 (X t 1 µ) + Z t, Z t W N[0, σ 2 ]. (37) 14

15 For large sample size n, the MLE ˆφ has an approximate normal distribution with mean φ 1 and variance 1 φ2 1 n. However, this is not applicable when φ 1 = 1. Dickey and Fuller instead set it up as a hypothesis testing problem: test the null hypothesis H 0 : φ 1 = 1 against the alternative H 1 : φ 1 1. We may write where φ 0 = µ(1 φ 1) and φ 1 = φ 1 1. X t = X t X t 1 = φ 0 + φ 1X t 1 + Z t, t 2, (38) Suppose ˆφ 1 is the ordinary least squares (OLS) estimator of φ 1 in other words, we just treat (38) as a standard regression equation, using SAS or any other regression program to regress X t on X t 1. The estimated standard error is SE( ˆφ 1) = S nt=2 (X t 1 X) 2 where S is the residual standard deviation, Consider S 2 = ( nt=2 X t ˆφ 0 ˆφ ) 2 1 X t 1. n 3 ˆτ µ = ˆφ 1 SE( ˆφ (39) 1 ). In normal regression terminology, ˆτ µ would be called the t statistic for ˆφ 1 and have a t n 3 distribution (which, in turn, is approximately N[0,1] if n is large). However, the rather peculiar structure of the problem (38), where X t and X t 1 are not independent of one another but have rather a complicated interdependence, means that ˆτ µ does not have a normal distribution, even asymptotically as n. Instead, Dickey and Fuller figured out the correct asymptotic distribution, which is henceforth destined forever to be known as the Dickey-Fuller distribution. The most important thing to know about this distribution is that its.01,.05 and.10 quantiles are 3.43, 2.86, For example, the Dickey-Fuller test would reject H 0 at the 5% level of significance if ˆτ µ < 2.86 (instead of ˆτ µ < 1.645, which would be correct in the case of an asymptotic N[0,1] distribution). Note that it s a one-sided test: the case φ 1 > 1 doesn t really come into consideration, because in that case the process would grow exponentially, and we don t need a formal test for that. The real case of interest is between φ 1 = 1 and φ 1 < 1, but the latter case corresponds to φ 1 = φ 1 1 < Extension to the general AR(p) case If we have an AR(p) model X t µ = φ 1 (X t 1 µ) φ p (X t p µ) + Z t, 15

16 we rewrite this as X t = φ 0 + φ 1X t 1 + φ 2 X t φ p X t p+1 + Z t, (40) where φ 0 = µ(1 φ 1... φ p ), φ 1 = p 1 φ i 1, φ j = p i=j φ i, j = 2,..., p. If there exists a unit root, then 0 = φ(1) = φ 1, so this becomes the natural null hypothesis. As in the AR(1) case, we form the t statistic (39) using a standard regression package, but the asymptotic distribution is again the Dickey-Fuller distribution, not the normal distribution Example SAS Code It should be possible to adapt the following straightforwardly to your own examples. *** apply unit root tests to "sales" data; *** this version for AR(1) model; options ls=64 ps=45 nonumber label; *** insert your own data name, file name and path directory; data sales; infile d:/my Documents/itsm2000/sales.tsm ; input y; array vara(0:1) y0 y1; vara(1)=vara(0); retain y0 y1; ; vara(0)=y; ydif=y0-y1; run; ; proc reg data=sales; model ydif=y1; run; ; *** this version for AR(2) model; options ls=64 ps=45 nonumber label; data sales; infile d:/my Documents/itsm2000/sales.tsm ; input y; array vara(0:2) y0 y1 y2; do i=2 to 1 by -1; VarA(i)=VarA(i-1); end; retain y0 y1 y2; ; vara(0)=y; ydif=y0-y1; 16

17 ydif1=y1-y2; run; ; proc reg data=sales; model ydif=y1 ydif1; run; ; *** this version for AR(3) model; options ls=64 ps=45 nonumber label; data sales; infile d:/my Documents/itsm2000/sales.tsm ; input y; array vara(0:3) y0 y1 y2 y3; do i=3 to 1 by -1; VarA(i)=VarA(i-1); end; retain y0 y1 y2 y3; ; vara(0)=y; ydif=y0-y1; ydif1=y1-y2; ydif2=y2-y3; run; ; proc reg data=sales; model ydif=y1 ydif1 ydif2; run; ; Analysis of Sales Dataset An initial analysis of the SALES.TSM dataset in ITSM led to the following conclusions. First, the Burg algorithm was used in connection with AICC to select the best-fitting AR model. The resulting model was AR(5): X t = X t X t X t X t X t 5 + Z t. The corresponding AICC value is However, it is easily checked that the sum of the AR coefficients is very nearly 1 ( 5 i=1 ˆφi = ). If it was exactly 1, we would have a unit root process. So there is a strong suspicion that the process is unit root, though we have not yet conducted a formal test for this. It s also possible to look for the best-fitting ARMA model using the Autofit command (without differencing). This led to an ARMA(4,4) model with an accompanying AICC of An incidental comment here is that it shows that Autofit does not always produce the best-fitting model (because 17

18 the AR(5) model is still better according to AICC). This reflects the fact that the search algorithms used to calculate the maximum likelihood estimates do not work perfectly, a caution with any use of Autofit or similar procedures. After differencing, the Burg algorithm yields an optimal AR model of AR(4) (AICC=515.8), while Autofit determines that the optimal model is ARMA(1,1) with AICC= Thus, both from AICC and from the ease of the fitting procedure (also from other indicators, not shown here, such as ACF/PACF plots and residual analysis), we conclude that the analysis works better with differencing than without, but this still leaves open the question of a formal unit root test if such a test led to acceptance of the null hypothesis (the null hypothesis in this case being that there is a unit root), we would be fully justified in differencing. We therefore now give the results of the ASA analyses for unit root tests with this dataset. The first program (AR(1)) produces the following table of estimates: Parameter Standard Variable DF Estimate Error t Value Pr > t Intercept y The second program (AR(2)) produces the following table of estimates: Parameter Standard Variable DF Estimate Error t Value Pr > t Intercept y ydif The third program (AR(3)) produces the following table of estimates: Parameter Standard Variable DF Estimate Error t Value Pr > t Intercept y ydif ydif In all three cases, φ 1 is the coefficient of y1 (i.e. the coefficient of X t on X t 1 in our time series notation) and the corresponding t Value is the value of ˆτ µ (respectively, 0.17, 0.48, 0.66). You should ignore the Pr > t column for this variable because this is based on the standard t distribution, which we have already seen is not valid in the unit root context, but none of the three values of ˆτ µ is significant according to the Dickey-Fuller test, so we accept the null hypothesis of a unit root. As for the other parameters, we note that in the AR(3) model, the coefficients of ydif1 and ydif2 are both significant (for these parameters, the standard t distribution is approximately valid), so presumably, we should retain at least those terms. 18

19 Finally, we give the corresponding AR(5) analysis which exactly corresponds to the optimal AR analysis for the undifferenced data, according to the earlier ITSM analyses using AICC as a model-selection criterion. The key part of the SAS code is now array vara(0:5) y0 y1 y2 y3 y4 y5; do i=5 to 1 by -1; VarA(i)=VarA(i-1); end; retain y0 y1 y2 y3 y4 y5; ; vara(0)=y; ydif=y0-y1; ydif1=y1-y2; ydif2=y2-y3; ydif3=y3-y4; ydif4=y4-y5; run; ; proc reg data=sales; model ydif=y1 ydif1 ydif2 ydif3 ydif4; run; ; and leads to the following: Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > t Intercept y ydif ydif ydif ydif The value of ˆτ µ (in the table, the t Value associated with the y1 variable) is now 1.01, still clearly not significant according to the Dickey-Fuller test (in other words, we accept the null hypothesis that there is a unit root). The t Values associated with ydif1 through ydif4 range from 0.99 to Unlike y1, the conventional t distributions for these variables are approximately satisfied, so the conclusion is that not all of these variables are statistically significant. On the other hand, if we drop ydif3 from the above analysis, the t Values associated with ydif1, ydif2 and ydif4 respectively are 2.71, 1.97, Thus, one interpretation of the result is that after differencing, the AR(4) model probably is the correct model though the coefficient φ 3 is not significant. 19

20 2.4 Forecasting ARIMA Models As an example, consider the DOW JONES dataset within ITSM. We fitted that as an ARIMA(1,1,0) model (in other words, difference the data first, the use Autofit which selects an AR(1) model for the differenced data). Then select Forecasting, ARMA, and the option Forecast the undifferenced data. The plot of the data and forecasts (with 95% prediction bounds) is in Figure 7. Index Time Figure 7. Plot of Dow Jones index, with forecasts and 95% forecast intervals for next 10 observations. The table of forecasts is as follows: Approximate 95 Percent Prediction Bounds Step Prediction sqrt(mse) Lower Upper E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E+03 In this section, we describe how these forecasts are computed, especially, the mean squared errors. 20

21 Let s explain exactly how these forecasts and their MSEs were calculated. The last six observations of the Dow Jones series are ( is most recent.) Differencing once, the last 5 observations of the differenced series are The overall mean of the differenced series is (obtained from ITSM). Therefore, subtracting the overall mean from every observation yields Moreover, ITSM tells us that the maximum likelihood estimates of the fitted AR(1) model are ˆφ 1 = , σ 2 = Let s do the forecasts h steps ahead up to h = 4. Standard theory from earlier in the course says the forecasts of Y t+h for h = 1, 2, 3, 4 are (φ 1 Y t, φ 2 1 Y t, φ 3 1 Y t, φ 4 1 Y t). Substituting φ 1 = , Y t = yields Note: If the model was something more complicated than AR(1), then at this step of the calculation you d have to do the optimal forecast for whatever model was being used this is the main place in this calculation that the method would be different for a general ARMA(p, q) process. Now add back in the sample mean to get forecast values of Y t+h with the mean back in: Thus, the forecast values of X t are: = = = = These agree to two decimal places with the ITSM forecasts reproduced earlier. Next, we state the formula for the mean squared prediction errors. If we write the ARIMA(p, d, q) model in the form φ (B)X t = θ(b)z t where φ (B) = (1 B) d φ(b), φ(b) and θ(b) being the usual AR and MA polynomials associated with the process (1 B) d X t, then we formally define ψ (B) = θ(b) φ (B) = ψ 0 + ψ 1B + ψ 2B 2 + ψ 3B (ψ 0 = 1) (41) 21

22 The mean squared prediction error of the optimal predictor P t X t+h (the predictor of X t+h given the infinite past, X s, s t) is given by MSP E = E {(X t+h P t X t+h ) 2} h 1 = σ 2 (ψj ) 2. (42) Example. In the ARIMA(1,1,0) model used for the Dow Jones dataset, we have Thus we have etc. ψ (B) = 1 (1 B)(1 φ 1 B) j=0 = (1 + B + B 2 + B )(1 + φ 1 B + φ 2 1B 2 + φ 3 1B ) = 1 + (1 + φ 1 )B + (1 + φ 1 + φ 2 1)B 2 + (1 + φ 1 + φ φ 3 1)B ψ 0 = 1, ψ 1 = 1 + φ 1, ψ 2 = 1 + φ 1 + φ 2 1, ψ 3 = 1 + φ 1 + φ φ 3 1, (43) When we substitute our earlier estimates for φ 1 and σ 2 into (43) and then (42), the values of MSPE for h = 1, 2, 3, 4 become , , , , agreeing to the first five decimal places with the ITSM results given earlier. Why does any of this work? Because the φ operator is not causal, the expansion (41) should not be valid in most cases it won t even be true that ψj 0 as j. (So this isn t one of those cases where the infinite series really is convergent but we just didn t take the trouble to check it. It s not convergent period.) However it works because the expansion (41) is valid if we just look at it term by term and don t worry about convergence. We ll just do the d = 1 case here higher values of d are similar. Assume Y t = (1 B)X t = j=0 ψ j Z t j where Z is white noise. Then X t+h = X t + Y t+1 + Y t Y t+h = X t + j=h 1 ψ j h+1 Z t+h j + j=h 2 ψ j h+2 Z t+h j ψ j Z t+h j. j=0 The coefficient of Z t+h j is ψ j = h k=h j ψ j h+k = j ψ k. (44) k=0 22

23 However in the standard ARMA notation, we define ψ(b) = j=0 ψ j B j = θ(b) φ(b) where θ( ) and φ( ) are the MA and AR operators of Y. The relationship between ψ (B) and ψ(b) is ψ (B) = ψ(b) 1 B = (1 + ψ 1 B + ψ 2 B 2 + ψ 3 B )(1 + B + B 2 + B ) = 1 + (1 + ψ 1 )B + (1 + ψ 1 + ψ 2 )B 2 + (1 + ψ 1 + ψ 2 + ψ 3 )B (45) Comparing (44) and (45), we see that ψ j is precisely the coefficient of Bj in (45), agreeing with (41). The final step is to explain why (44) implies (42). By the definition of ψ j in (44), we have h 1 X t+h = X t + ψj Z t+h j + terms that depend on Z s, s t. (46) j=0 Since, as noted many times through the course, the predictor of X t+h from time t is essentially defined by setting all the values of Z t+h j, j = 0,..., h 1 equal to 0, we will have X t+h P t X t+h = h ψj Z t+h j j=0 and the result (42) follows. 2.5 Seasonal ARIMA Models General motivation: If a time series is seasonal with period s, then we may have correlations that operate at multiples of s as well or instead of the usual autocorrelations at small lags. This suggests time series models where the autoregressive and moving average operators include terms in B s as well as B. Comment 1. B s X t = X t s. Thus, terms of this form reflect effects that take place at time intervals corresponding to compleye cycles of a cyclic/seasonal process, but without assuming the process is exactly cyclic (a deficiency of many simple trend + seasonal factor + noise models). Example: Seasonal economic variables such as house prices are often the corresponding month in the previous year, rather than the most recent month. Thus, statistically we are interested in X t X t 12 rather than X t X t 1. That, in turn, suggests trying to model the series in terms of (1 B 12 )X t rather than (1 B)X t. Comment 2. Throughout our discussion of seasonal time series, we shall assume that the period s is known. Thus for monthly data with an annual cycle, s = 12. If s is unknown then we really need a different theory (although we didn t cover Chapter 4 in this course, that chapter, on spectral analysis, covers the techniques that are required for a systematic treatment). Now let s describe the models we are using. The general model is called SARIMA, for seasonal autoregressive integrated moving average. The order of the model is written (p, d, q) (P, D, Q) s 23

24 to indicate that the regular ARIMA components are of orders p, d, q, and the seasonal ARIMA components (those that depend on B s ) are of orders P, D, Q. The form of the model is where Fitting SARIMA Models φ(b)φ(b s )(1 B) d (1 B s ) D X t = θ(b)θ(b s )Z t (47) φ(b) = 1 φ 1 B... φ p B p, Φ(B s ) = 1 Φ 1 B s... Φ P B P s, θ(b) = 1 + θ 1 B θ q B q, Θ(B s ) = 1 + Θ 1 B s Θ Q B Qs. (48) The algorithms in ITSM rely on the fact that any SARIMA model can be rewritten as a regular ARIMA but with constraints on the parameters. As an example, consider the ARIMA(0, 0, 1) (0, 0, 1) 12 model Y t = (1 + θ 1 B)(1 + Θ 1 B 12 )Z t = (1 + θ 1 B + Θ 1 B 12 + θ 1 Θ 1 B 13 )Z t. (49) This is an MA(13) model where we fix θ 2 = θ 3 =... = θ 11 = 0, identify θ 12 with Θ 1, and fix θ 13 = θ 1 θ 12. We can fix a model of this form in ITSM by maximum likelihood, using the Constrain Optimization option. Example. Consider the DEATHS dataset from Problem 3.9. We first difference at lag 1 and at lag 12, forming the series Y t = (1 B)(1 B 12 )X t. In Problem 3.9, we considered the model The question naturally arises: which is better, (49) or (50)? Y t = (1 + θ 1 B + θ 12 B 12 )Z t. (50) As a start, let us fit the MA(12) model without any constraints. In ITSM, under the Innovations algorithm we get estimates MA Coefficients Ratio of MA coeff. to 1.96 * (standard error) This shows clearly that of all the MA coefficients, only θ 1 and θ 12 are statistically significant. We therefore refit the model using only those coefficients, i.e. model (50). In ITSM, first select Specify to select the initial MA(12) model, with all coefficients 0 (you may need reset the model to p = q = 0 and then use Specify again to achieve this). 24

25 Next, select Estimation followed by Max likelihood followed by Constrain optimization. Highlight the coefficients Theta(2) through Theta(11) (in other words, these will be in blue, while Theta(1) and Theta(12) are white). Return to the main box and click OK. The model is fitted and leads to ˆθ 1 = (standard error ) and ˆθ 12 = (standard error ). The value of -2Log(Likelihood) is and AICC is (Your answers may differ very slightly from these but should not differ by very much.) Now let s fit the model (49). To do this, first return to Specify and set the model as AR(13), with all coefficients 0. Follow Estimation, Max likelihood and Constrain optimization, and highlight the coefficients Theta(2) through Theta(11) in blue, as before. Then, under Specify multiplicative relations, enter 1 in the box Number of relations. In the next line, enter 1 12 = 13. Click OK to complete the specification and then OK a second time to fit the model. You will now find the model with θ 1 = , θ 12 = and θ 13 = (note that these satisfy the constraint θ 13 = θ 1 θ 12 ). The standard errors of θ 1 and θ 12 are and respectively. In this case -2Log(Likelihood) is 849.1, and AICC is 855.5, both slightly smaller (i.e. better) than the corresponding values from model (50). Our final conclusion is therefore that model (49) is best, with estimates ˆθ 1 = (standard error ), and ˆΘ 1 = (standard error ). 2.6 Regression with ARMA Errors Model: In matrix notation, Y t = k x tj β j + W t, j=1 φ(b)w t = θ(b)z t. (Z t W N[0, σ 2 ]) (51) Y = Xβ + W, (52) where (in contrast to the usual situation with linear regression) the error vector W is not uncorrelated noise but has some non-trivial covariance matrix Γ n. Since W t is ARMA, Γ n itself may be computed by methods seen earlier in this course. The question for the present section is: how does this affect the estimation of the regression component, i.e. the vector β? Definition of ordinary least squares (OLS): ˆβ OLS = (X T X) 1 X T Y, ( ) Cov ˆβOLS = (X T X) 1 X T Γ n X(X T X) 1. (53) In the special case Γ n = σ 2 I n, the covariance in (53) reduces to (X T X) 1 σ 2, the usual formula in regression. The alternative is generalized least squares (GLS): ˆβ GLS = (X T Γ 1 n ( ) Cov ˆβOLS X) 1 X T Γ 1 n Y, = (X T Γ 1 n X) 1. (54) 25

26 This is BLUE (Best Linear Unbiased Estimator) in the sense that if β is some other unbiased linear estimator then for all vectors c of length k, In particular, (55) applies when β = ˆβ) OLS Maximum Likelihood Estimation Var(c T β) Var(c T ˆβ)GLS (55) The general principle is a direct extension of the method of maximum likelihood given in Chapter 5. We can write the covariance matrix Γ n as Γ n (φ, θ, σ 2 ) to indicate explicitly the dependence on the ARMA parameters φ, θ, σ 2. Of course, the vector of regression coefficients, β, is also an unknown parameter. The joint likelihood is { L(β, φ, θ, σ 2 ) = (2π) n/2 (det Γ n ) 1/2 exp 1 2 (Y Xβ)T Γ 1 n } (Y Xβ). (56) This is a direct extension of formula (5.2.1) of the course text, where instead of X n (which, in the context of formula (5.2.1), represented a time series of length n) we now write Y βx to denote the residuals of the observed series Y on the regression function βx. The maximum likelihood estimates are those which maximize (56). These are found numerically, through search algorithms similar to those used for ARMA processes without the regression component. In practice, we often do a two-stage fit: first estimate β by OLS, then maximize (56) with respect to φ, θ, σ 2 holding β fixed (this part is operationally identical with the regular maximum likelihood procedure found in Chapter 5), then repeat the estimation of β using GLS. This is more or less the way you have to do it in ITSM. If desired, the process can be repeated to improve the estimation of φ, θ, σ 2. Example 1: Lake Huron Data This is the LAKE dataset that we saw previously refer to pages 6 and 7 of the first set of course notes, where we pointed out (a) that there is an apparently significant linear trend; (b) the residuals are autocorrelated in a way apparently consistent with an AR(2) model; (c) that the regression analysis could lead to different results if the autocorrelation was taken into account. However at that point of the course, we did not have a mean of estimating a regression function with autocorrelated errors. The first and obvious analysis is a linear trend of the form Y t = β 1 + β 2 t + W t, (57) In the context of (51), this is equivalent to taking k = 2, x t1 = 1, x t2 = t. Model (57) could be fitted as a linear regression using SAS, or directly in ITSM, by the following procedure: After loading the data from LAKE.TSM into ITSM, click on Regression, then Specify. A window comes up: after Polynomial Regression, in the box labelled Enter Order, change 0 to 1. 26

ITSM-R Reference Manual

ITSM-R Reference Manual George Weigt February 11, 2018 1 Contents 1 Introduction 3 1.1 Time series analysis in a nutshell............................... 3 1.2 White Noise Variance.....................................