STAT 501 - Regression Methods Unit 9 Examples Example 1: Quake Data Let y t = the annual number of worldwide earthquakes with magnitude greater than 7 on the Richter scale for n = 99 years. Figure 1 gives a time series plot showing a slowly cycling pattern (gradual increases and decreases) for this dataset. Figure 1: Minitab output for a time series plot over the 99-year time period for the quake dataset. Identifying the Order of an Autoregression Model Figure 2 gives a plot for the number of quakes versus the number of quakes in the previous year. (In Minitab, we used Stat Time Series Lag to create the column called lag1quakes.) There looks to be a moderate linear pattern, suggesting that the first order autoregression model y t = β 0 + β 1 y t 1 + ɛ t could be useful. Figure 3 gives a plot of the PACF (partial autocorrelation function), which can be interpreted to mean that the first-order autoregression may be sufficient. The vertical scale gives the value of the partial correlation and the horizontal scale gives the lag (time span) between values. The only notable (in size) correlation is for lag 1. 1
Figure 2: Scatterplot of the number of earthquakes versus the number of earthquakes in the previous year (lag 1). The next step is to do a simple regression with number of quakes as the response variable and number of quakes the previous year as the predictor variable. The results are given in Figure 4 and we see that the predictor for this case is highly significant. Finally, we obtain the autocorrelations within the series of residuals from the model estimated in Figure 4. These autocorrelations are given in figure 5. The vertical scale (negative or positive) gives the value of the correlation. The horizontal scale gives the lag. There are only weak correlations, so the residuals for the first-order autoregression model could reasonably be assumed to be independent (of each other). Simple Regression with Autoregressive (Autocorrelated) Errors Chapter 12 of the textbook considers the problem of autocorrelation within the errors (so they are not independent over time) when the y-variable and x-variable(s) are measured over time. It is assumed that the errors may have a first-order autoregression model. Thus, the overall model is as follows: y t = β 0 + β 1 x t + ɛ t ɛ t = ρɛ t 1 + u t, where we assume the u t s are iid with mean 0 and variance σ 2 and that the u t s and the ɛ t s are independent of each other. It is important to note that the parameter ρ can be proved to equal the correlation between ɛ t and ɛ t 1. The big issue is that if the residuals are dependent in time, the standard errors of coefficients calculated, assuming that the residuals are independent, will incorrectly estimate the true size of the standard errors. Examining Whether this Model May Be Necessary The steps in Minitab are: 1. Start by doing an ordinary regression. Store the residuals. 2
Figure 3: PACF plot for the earthquake data. Figure 4: Minitab output pertaining to the first-order autogregression model. 2. Plot the residuals in time order (either using Minitabs Graph button in Regression, or with a Time Series plot of the stored residuals (Graph Time Series). A slowly undulating time series plot (long sequences of residuals on the same side of zero) indicates a correlation between e t and e t 1. 3. Use Stat Time Series Lag to create a column of lagged residuals e t 1. Plot e t versus e t 1. A linear pattern indicates autocorrelation in the errors. 4. Calculate the correlation between e t and e t 1 (Stat Basic Stats Correlation). Examine its statistical significance. Example 2: Oil Data The data are from U.S. oil and gas price index values for 82 months. There is a strong linear pattern for the relationship between the two variables, as can be seen in Figure 6. We start the analysis by doing a simple linear regression. Minitab results for this analysis are given in Figure 7. The residuals in time order show a dependent pattern (see the plot in Figure 8). The slow cyclical pattern that we see happens because there is a tendency for residuals to keep the 3
Figure 5: Autocorrelations for the model estimated in Figure 4. same algebraic sign for several consecutive months. We also used Stat Time Series Lag to create a column of the lag 1 residuals. The correlation coefficient between the residuals and the lagged residuals is calculated to be 0.829 (and is calculated using Stat Basic Stats Correlation, which can be seen in the bottom of Figure 8). So, the overall analysis strategy in presence of autocorrelated errors is as follows: Do an ordinary regression. Identify the difficulty in the model (autocorrelated errors). Using the stored residuals from the linear regression, use regression to estimate the model for the errors, ɛ t = ρɛ t 1 + u t where the u t are iid with mean 0 and variance σ 2. Adjust the parameter estimates and their standard errors from the original regression. A Method for Adjusting the Original Parameter Estimates (Cochrane-Orcutt Method) Let ˆρ = estimated lag 1 autocorrelation in the residuals from the ordinary regression (in the U.S. oil example, ˆρ = 0.829). Let y t = y t ˆρy t 1. This will be used as a response variable. Let x t = x t ˆρx t 1. This will be used as a predictor variable. and x t. This model should have time inde- Do an ordinary regression between yt pendent residuals. The sample slope from the regression directly estimates β 1, the slope of the relationship between the original y and x. 4
Figure 6: Scatterplot of gas prices versus oil prices. Figure 7: Regression analysis for the U.S. oil data. The correct estimate of the intercept for the original model y versus x relationship is calculated as β 0 = ˆβ 0, where ˆβ 1 ˆρ 0 is the sample intercept obtained from the regression done with the modified variables. Returning to the U.S. oil data, the value of ˆρ = 0.829 and the modified variables are ynew = y t 0.829y t 1 and xnew = x t 0.829x t 1. The regression results are given in Figure 9. Parameter Estimates for the Original Model Our real goal is to estimate the original model y t = β 0 + β 1 x t + ɛ t. The estimates come from the results just given. ˆβ 1 = 1.08073 ˆβ 0 = 1.412 1 0.829 = 8.257 These estimates give the sample regression model: y t = 8.257 + 1.08073x t + ɛ t, with e t = 0.829e t 1 + u t, where u t s are iid with mean 0 and variance σ 2. 5
Figure 8: A plot of the residuals in time order along with the correlation results for the U.S. oil data. Figure 9: Regression results for the U.S. oil data using the modified variables. Correct Standard Errors for the Coefficients The correct standard error for the slope is taken directly from the regression with the modified variables. The correct standard error for the intercept is s.e.( ˆβ 0 ) = s.e.( ˆβ 0 ) 1 ˆρ. Coefficient Correct Estimate Correct Standard Error 1.412 Intercept 2.529 1 0.829 1 0.829 Slope 1.08073 0.05960 Wrong Estimate Wrong Standard Error Intercept -31.349 5.29 Slope 1.17677 0.02305 Table 1: Correct and wrong estimates for the coefficients. Table 1 compares the correct standard errors to the incorrect estimates based on the ordinary regression. The correct estimates come from the work done in this section of the notes. The wrong estimates are from the regression estimates reported in Figure 7. 6
Notice that the correct standard errors are larger than the incorrect values. If ordinary least squares estimation is used when the errors are autocorrelated, the standard errors often are underestimated. It is also important to note that this does not always happen. Overestimation of the standard errors is an on average tendency over all problems. Prediction Issues When calculating predicted values, it is important to utilize ɛ t = ρɛ t 1 + u t as part of the process. In the U.S. oil example, ŷ t = 8.257 + 1.08073x t + e t = 8.257 + 1.08073x t + 0.829e t 1. Values of ŷ t are computed iteratively. Assume e 0 = 0 (error before t = 1 is 0), compute ŷ 1 and e 1 = y 1 ŷ 1. Use the value of e 1 = y 1 ŷ 1 when computing ŷ 2 = 8.257 + 1.08073x 2 + 0.829e 1. Determine e 2 = y 2 ŷ 2, and use that value when computing ŷ 3 = 8.257+1.08073x 3 + 0.829e 2. Iterate. 7