UNIVERSITY OF UTAH GUIDED READING TIME SERIES Problems from Chapter 3 of Shumway and Stoffer s Book Author: Curtis MILLER Supervisor: Prof. Lajos HORVATH November 10, 2015
UNIVERSITY OF UTAH DEPARTMENT OF MATHEMATICS ARIMA Models Curtis Miller November 10, 2015 1 ESTIMATION 1.1 AR(2) MODEL FOR cmort To estimate the AR(2) process, I first use ordinary least squares (OLS). I then use the Yule- Walker estimate. This is shown in the R code below: # OLS esimate # demean = T results in looking at cmort - mean(cmort) # intercept = F sets the intercept to 0 cmort.ar2.ols <- ar.ols(cmort, order = 2, demean = T) # Yule-Walker estimate cmort.ar2.yw <- ar.yw(cmort, order = 2, demean = T) 1.1.1 PARAMETER ESTIMATE COMPARISON # OLS estimate cmort.ar2.ols Call: ar.ols(x = cmort, order.max = 2, demean = T) 2
Coefficients: 1 2 0.4286 0.4418 Intercept: -0.04672 (0.2527) Order selected 2 sigma^2 estimated as 32.32 # Yule-Walker estimate cmort.ar2.yw Call: ar.yw.default(x = cmort, order.max = 2, demean = T) Coefficients: 1 2 0.4339 0.4376 Order selected 2 sigma^2 estimated as 32.84 Looking at the coefficients of the AR(2) model estimated using the two methods, I see very little difference. OLS and Yule-Walker estimation produce similar results. 1.1.2 STANDARD ERROR COMPARISON # The standard error of the OLS estimates cmort.ar2.ols$asy.se.coef$ar [1] 0.03979433 0.03976163 # The variance matrix of the Yule-Walker estimates cmort.ar2.yw$asy.var.coef [,1] [,2] [1,] 0.001601043-0.001235314 [2,] -0.001235314 0.001601043 # Corresponding standard error of both parameters sqrt(cmort.ar2.yw$asy.var.coef[1,1]) [1] 0.04001303 Looking at the above R output, it appears that both models have the same standard error for the parameters. 3
First I generate the data: 1.2 AR(1) SIMULATION AND ESTIMATION ar1.sim <- arima.sim(n = 50, list(ar = c(.99), sd = c(1))) I first estimate the parameter from the simulation using the Yule-Walker estimate. ar1.sim.yw <- ar.yw(ar1.sim, order = 1) # Model estimates ar1.sim.yw Call: ar.yw.default(x = ar1.sim, order.max = 1) Coefficients: 1 0.8946 Order selected 1 sigma^2 estimated as 1.144 # Model covariance matrix ar1.sim.yw$asy.var.coef [,1] [1,] 0.004158676 Here, I would perform inference on the model by assuming it is Normally distributed. I would use the covariance matrix listed above for estimating the standard error. Bootstrap results in R could be done as follows: tsboot(ar1.sim, function(d) { return(ar.yw(d, order = 1)$ar) }, R = 2000) MODEL BASED BOOTSTRAP FOR TIME SERIES Call: tsboot(tseries = ar1.sim, statistic = function(d) { return(ar.yw(d, order = 1)$ar) }, R = 2000) 4
Bootstrap Statistics : original bias std. error t1* 0.8946416 0 0 The bootstrap standard error is zero, while the theoretical standard error is non-zero. 2 INTEGRATED MODELS FOR NONSTATIONARY DATA 2.1 EWMA MODEL FOR GLACIAL VARVE DATA Here I am interested in the varve dataset. In fact, I am interested in analyzing log(v ar ve), since I believe this may actually be a stationary process. I will be estimating a EWMA model for this data. logvarve <- log(varve) # EWMA for logvarve with lambda =.25 logvarve.ima.25 <- HoltWinters(logvarve[1:100], alpha = 1 -.25, beta = FALSE, gamma = FALSE) logvarve.ima.5 <- HoltWinters(logvarve[1:100], alpha = 1 -.5, beta = FALSE, gamma = FALSE) logvarve.ima.75 <- HoltWinters(logvarve[1:100], alpha = 1 -.75, beta = FALSE, gamma = FALSE) # Plotting results par(mfrow = c(3,1)) plot(logvarve.ima.25, main = "EWMA Fit with Lambda =.25") plot(logvarve.ima.5, main = "EWMA Fit with Lambda =.5") plot(logvarve.ima.75, main = "EWMA Fit with Lambda =.75") The results are shown in 2.1. With a small smoothing parameter (λ), the results are very sensetive to the immediate past, while a high smoothing parameter leads to more stable predictions. 3 BUILDING ARIMA MODELS 5
EWMA Fit with Lambda =.25 Observed / Fitted 1.5 2.5 3.5 0 20 40 60 80 100 Time EWMA Fit with Lambda =.5 Observed / Fitted 1.5 2.5 3.5 0 20 40 60 80 100 Time EWMA Fit with Lambda =.75 Observed / Fitted 1.5 2.5 3.5 0 20 40 60 80 100 Time Figure 2.1: EWMA fit for different smoothing parameters 6
3.1 AR(1) MODEL FOR GNP DATA Here I am investigating how well an AR(1) (or, more exactly, an ARIMA(1,1,0)) model fits the natural log of U.S. GNP data. I estimate this ARIMA model. gnpgr = diff(log(gnp)) # growth rate of GNP gnp.model <- sarima(gnpgr, 1, 0, 0, details = F) # AR(1) model fit I see disturbing trends in the diagnostic plots shown in Figure 3.1. The residual plot should look like white noise, but I see the variance decreasing as the year increases. The ACF is fine, but the Q-Q plot suggests non-normality. Fortunately, the ACF and p-values for Ljung- Box statistic look as they should be. Still, other models (probably ones that do not assume Gaussian white noise) may be better. 3.2 FITTING CRUDE OIL PRICES WITH AN ARIMA(p,d, q) MODEL My objective is to fit an ARIMA(p,d, q) model for the oil dataset. I start by examining the data: # Prepare layout old.par <- par(mar = c(0, 0, 0, 0), oma = c(4, 4, 1, 1), mfrow = c(4, 1), cex.axis =.75) plot(oil, xaxt = 'n'); mtext(text = "Oil Price", side = 2, line = 2, cex =.75) plot(log(oil), xaxt = 'n'); mtext(text = "Natural Logarithm of Oil Price", side = 2, line = 2, cex =.75) plot(diff(oil), xaxt = 'n'); mtext(text = "First Difference in Oil Price", side = 2, line = 2, cex =.75) plot(diff(log(oil))); mtext(text = "Percent Change in Oil Price", side = 2, line = 2, cex =.75) The first plot in Figure 3.2 shows that oil prices clearly are not a stationary process, and it appears that the variance of the process increases with time. Taking the natural log of oil prices helps control the increasing variability, but not the nonstationary behavior of the series. When looking at the change in oil price from one period to the next, I do see a process that looks more stationary, but the nonconstant variance is not removed. The final attempt is to look at the differences in the natural log of oil prices (which can be interpreted as the percentage change in oil prices). This appears to be stationary and with a mostly constant variance. However, there are large deviations around 2009, and even prior, that would lead one to conclude that the white noise is not Gaussian, which threatens estimation and inference. I now look at the ACF and PACF of log(oil t ): 7
Standardized Residuals 3 1 0 1 2 3 4 1950 1960 1970 1980 1990 2000 Time ACF of Residuals Normal Q Q Plot of Std Residuals ACF 0.2 0.0 0.2 0.4 Sample Quantiles 3 1 0 1 2 3 4 1 2 3 4 5 6 LAG 3 2 1 0 1 2 3 Theoretical Quantiles p values for Ljung Box statistic p value 0.0 0.2 0.4 0.6 0.8 1.0 5 10 15 20 lag Figure 3.1: Diagnostic plots for the AR(1) model 8
Percent Change in Oil Price 0.2 0.1 0.0 0.1 0.2 First Difference in Oil Price 15 10 5 0 5 10 15 Natural Logarithm of Oil Price 3.0 3.5 4.0 4.5 5.0 Oil Price 20 40 60 80 100 120 140 2000 2002 2004 2006 2008 2010 Figure 3.2: Basic plots of the oil series 9
par(mar = c(0, 0, 0, 0), oma = c(4, 4, 1, 1), mfrow = c(2, 1), cex.axis =.75) acf(diff(log(oil)), xaxt = 'n'); mtext(text = "Sample ACF", side = 2, line = 2) pacf(diff(log(oil))); mtext(text = "Sample PACF", side = 2, line = 2) When looking at the sample PACF in Figure 3.3, I see that the PACF is nonzero as far out as eight lags, which may suggest that p = 8 + 1 = 9, or that we should consider lagging the AR term out to as far as nine lags. # noquote(capture.output(), "") used only to make presentation # easier write(capture.output(sarima(log(oil), p = 9, d = 1, q = 0))[32:38],"") Coefficients: ar1 ar2 ar3 ar4 ar5 ar6 0.1678-0.1189 0.1844-0.0713 0.0486-0.0715 s.e. 0.0429 0.0432 0.0436 0.0442 0.0444 0.0443 ar7 ar8 ar9 constant -0.0158 0.1135 0.0525 0.0017 The ninth lag does not appear statistically significant, so I drop the number of lags down to eight. I now use the following ARIMA model (with diagnostic plots shown): oil.model <- sarima(log(oil), p = 8, d = 1, q = 0, details = F) write(capture.output(oil.model)[8:14], "") Coefficients: ar1 ar2 ar3 ar4 ar5 ar6 0.1742-0.1200 0.1814-0.0689 0.0448-0.0621 s.e. 0.0426 0.0433 0.0436 0.0442 0.0443 0.0437 ar7 ar8 constant -0.0218 0.1224 0.0017 s.e. 0.0435 0.0428 0.0026 The residuals clearly do not appear to be Gaussian; there are large price movements that make this assumption doubtful, and the Q-Q plot does not support the Normality assumption. The ACF of the residuals can get large for some distant lags but otherwise are within the band of reasonable values. The p-values of the Ljung-Box statistics suggest that we do not have dependence in our residuals for large lags. This may be the best fit an ARIMA model can provide. 10
Sample PACF 0.10 0.05 0.00 0.05 0.10 0.15 Sample ACF 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.1 0.2 0.3 0.4 0.5 Figure 3.3: Sample ACF and PACF for percentage change in oil price 11
Standardized Residuals 4 2 0 2 4 6 2000 2002 2004 2006 2008 2010 Time ACF of Residuals Normal Q Q Plot of Std Residuals ACF 0.2 0.0 0.2 0.4 Sample Quantiles 4 2 0 2 4 6 0.0 0.1 0.2 0.3 0.4 0.5 0.6 LAG 3 2 1 0 1 2 3 Theoretical Quantiles p values for Ljung Box statistic p value 0.0 0.2 0.4 0.6 0.8 1.0 10 12 14 16 18 20 lag Figure 3.4: Diagnostic plots for the ARIMA(8,1,0) model for the log(oil) series 12
4 REGRESSION WITH AUTOCORRELATED ERRORS 4.1 MONTHLY SALES DATA 4.1.1 ARIMA MODEL FITTING The problem first asks for an ARIMA model for the sales data series. I first plot the series. par(mar = c(0, 0, 0, 0), oma = c(4, 4, 1, 1), mfrow = c(3, 1), cex.axis =.75) plot(sales, xaxt = 'n'); mtext(text = "Sales", side = 2, line = 2, cex =.75) plot(diff(sales), xaxt = 'n') mtext(text = "First Order Difference in Sales", side = 2, line = 2, cex =.75) plot(diff(diff(sales))); mtext(text = "Second Order Difference in Sales", side = 2, line = 2, cex =.75) Figure 4.1 shows the plots of the sales series. Clearly, sales t is not stationary. Surprisingly, neither is sales t ; this series shows periodicity. It takes a second-order differencing, ( sales t ), to find a stationary series. I next examine the ACF and PACF functions to try and identify the order of the AR and MA terms. par(mar = c(0, 0, 0, 0), oma = c(4, 4, 1, 1), mfrow = c(2, 1), cex.axis =.75) acf(diff(diff(sales)), xaxt = 'n'); mtext(text = "Sample ACF", side = 2, line = 2) pacf(diff(diff(sales))); mtext(text = "Sample PACF", side = 2, line = 2) As shown in Figure 4.2, the sample ACF cuts off after one lag and the sample PACF appears to be trailing off, so I believe that an ARIMA(0,2,1) should provide a good fit for the data. par(old.par) oil.model <- sarima(sales, p = 0, d = 2, q = 1, details = F) write(capture.output(oil.model)[8:14], "") Coefficients: ma1-0.7480 s.e. 0.0662 sigma^2 estimated as 1.866: log likelihood = -256.57, aic = 517.14 13
Second Order Difference in Sales 4 2 0 2 First Order Difference in Sales 2 0 2 4 Sales 200 210 220 230 240 250 260 0 50 100 150 Figure 4.1: Basic plots of the sales series 14
Sample PACF 0.5 0.4 0.3 0.2 0.1 0.0 0.1 Sample ACF 0.5 0.0 0.5 1.0 5 10 15 20 Figure 4.2: Sample ACF and PACF for second order difference in sales 15
Standardized Residuals 3 2 1 0 1 2 3 0 50 100 150 Time ACF of Residuals Normal Q Q Plot of Std Residuals ACF 0.2 0.0 0.2 0.4 Sample Quantiles 3 2 1 0 1 2 3 5 10 15 20 LAG 2 1 0 1 2 Theoretical Quantiles p values for Ljung Box statistic p value 0.0 0.2 0.4 0.6 0.8 1.0 5 10 15 20 lag Figure 4.3: Diagnostic plots for the ARIMA(0,2,1) model for the sales series 16
Looking at the diagnostic plots in Figure 4.3, the ARIMA(0,2,1) seems to fit well. The error terms appear Gaussian, there are no strong autocorrelations in the residuals, and the error terms do not appear to be dependent. 4.1.2 RELATIONSHIP BETWEEN sales AND lead I examine the CCF of sales and lead and a lag plot of sales t and lead t 3 to determine if a regression involving these variables is reasonable. ccf(diff(sales), diff(lead), main = "CCF of sales and lead") As seen in Figure 4.4, while sales and lead are often uncorrelated, around lag 3 they become highly correlated. This fact is emphasized by a lag plot. lag2.plot(lead, sales, max.lag = 3) Figure 4.5 shows a linear relationship with a third lag of lead and contemporary sales. This would justify regressing sales t on lead t 3. 4.1.3 REGRESSION WITH ARMA ERRORS Given that the variable lead seems to provide useful information about sales, I try to regress sales on lead. More specifically, I try to regress sales t on lead t 3, while viewing the error term as being some unknown ARMA process. saleslead <- ts.intersect(diff(sales), lag(diff(lead), k = -3)) salesnew <- saleslead[,1] leadnew <- saleslead[,2] fit <- lm(salesnew ~ leadnew) acf2(resid(fit)) ACF PACF [1,] 0.59 0.59 [2,] 0.40 0.09 [3,] 0.34 0.11 [4,] 0.31 0.10 [5,] 0.23-0.02 [6,] 0.15-0.04 [7,] 0.13 0.03 [8,] 0.13 0.03 [9,] 0.01-0.15 [10,] 0.02 0.07 [11,] 0.09 0.10 17
CCF of sales and lead ACF 0.4 0.2 0.0 0.2 0.4 0.6 15 10 5 0 5 10 15 Lag Figure 4.4: CCF of sales and lead 18
lead(t 0) lead(t 1) sales(t) 200 210 220 230 240 250 260 0.95 sales(t) 200 210 220 230 240 250 260 0.95 10 11 12 13 14 10 11 12 13 14 lead(t 2) lead(t 3) sales(t) 200 210 220 230 240 250 260 0.94 sales(t) 200 210 220 230 240 250 260 0.94 10 11 12 13 14 10 11 12 13 14 Figure 4.5: Lag plot of sales and lead 19
Series: resid(fit) ACF 0.2 0.2 0.6 5 10 15 20 LAG PACF 0.2 0.2 0.6 5 10 15 20 LAG Figure 4.6: Sample ACF and PACF for residuals from linear fit 20
[12,] 0.01-0.13 [13,] -0.01 0.03 [14,] -0.07-0.09 [15,] -0.07-0.04 [16,] -0.02 0.09 [17,] -0.05-0.03 [18,] -0.03 0.02 [19,] 0.04 0.11 [20,] 0.05 0.03 [21,] 0.02-0.07 [22,] 0.00-0.01 [23,] -0.01-0.04 Figure 4.6 shows the ACF and the PACF of the residuals of the "naïve" fit. The PACF cuts off at 1 and the ACF trails off, so this appears to be an AR(1) process. arima.fit <- sarima(salesnew, 1, 0, 0, xreg=cbind(leadnew), details = F) As shown in Figure 4.7, the diagnostic plots for the process, when interpreting the error terms as an ARMA(1,0) process, look very good. Normality of the white noise residuals, the ACF of the white noise residuals, and the tests of dependence all show desirable properties. stargazer(arima.fit$fit, covariate.labels = c("$\\phi$", "Intercept", "$\\Delta \\text{lead}_{t-3}$"), dep.var.labels = c("$\\delta \\text{sales}_t$"), label = "tab:prob35h", title = "Coefficients of the model for $\\Delta \\text{sales}_t$", table.placement = "ht") Table 4.1 shows the estimates of the coefficients of the model. The AR(1) term (φ) is statistically significant and so is the intercept and the coefficient of the lead t 3 term. 5 MULTIPLICATIVE SEASONAL ARIMA MODELS 5.1 ACF OF AN ARIMA(p,d, q) (P,D,Q) s MODEL The problem asks for a plot of the theoretical ACF of an ARIMA(1,0,0) (0,0,1) 1 2 model, with Φ = 0.8 and θ = 0.5. This model is: The ACF is computed and plotted below: x t =.8x t 12 + w t +.5w t 1 (5.1) 21
Standardized Residuals 3 2 1 0 1 2 0 50 100 150 Time ACF of Residuals Normal Q Q Plot of Std Residuals ACF 0.2 0.0 0.2 0.4 Sample Quantiles 3 2 1 0 1 2 5 10 15 20 LAG 2 1 0 1 2 Theoretical Quantiles p values for Ljung Box statistic p value 0.0 0.2 0.4 0.6 0.8 1.0 5 10 15 20 lag Figure 4.7: Diagnostic plots for the model for the sales series with ARMA(1,0) error terms 22
Table 4.1: Coefficients of the model for sales t Dependent variable: φ Intercept lead t 3 sales t 0.645 (0.063) 0.362 (0.177) 2.788 (0.143) Observations 146 Log Likelihood 168.717 σ 2 0.588 Akaike Inf. Crit. 345.433 Note: p<0.1; p<0.05; p<0.01 ACF <- ARMAacf(ar = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,.8), ma = c(.5)) plot(acf, type = "h", xlab = "lag", xlim = c(1, 15), ylim = c(-.5,1)); abline(h=0) Figure 5.1 shows the theoretical ACF of the process. 23
ACF 0.5 0.0 0.5 1.0 2 4 6 8 10 12 14 lag Figure 5.1: ACF of a seasonal ARIMA process 24