Homework 2 1 Data analysis problems For the homework, be sure to give full explanations where required and to turn in any relevant plots. 1. The file berkeley.dat contains average yearly temperatures for the cities of Berkeley and Santa Barbara. Import the data into R using the following commands berk=scan("berkeley.dat", what=list(double(0),double(0),double(0))) time=berk[[1]] berkeley=berk[[2]] stbarb=berk[[3]] (a) Plot the variables berkeley and stbarb versus time. Also, plot berkeley versus stbarb. Figure 1: 1-(a) berkeley vs time, stbarb vs time and berkeley vs stbarb 1900 1940 1980 12.5 13.5 14.5 time berkeley 1900 1940 1980 12.5 13.5 14.5 time berkeley 14 15 16 17 18 12.5 13.5 14.5 stbarb berkeley (b) Perform a regression of berkeley on time. What do you think about this fit? Be sure to make diagnostic plots (including ) 1
of the residuals. If there are any violations of the assumptions for a linear regression model, make sure to comment on them. Figure 2: 1-(b) residual diagnostics a regression of berkeley on time 1900 1940 1980 12.5 13.5 14.5 time berkeley 13.0 13.4 13.8 14.2 1.0 0.5 0.0 0.5 1.0 bfit1$fitted bfit1$residuals 0 20 40 60 80 100 1.0 0.5 0.0 0.5 1.0 Index bfit1$residuals 0.2 0.2 0.6 1.0 Series bfit1$residuals 2 1 0 1 2 1.0 0.5 0.0 0.5 1.0 Normal Q Q Plot Theoretical Quantiles Sample Quantiles Call: lm(formula = berkeley ~ time) Residuals: Min 1Q Median 3Q Max -0.99195-0.33156-0.03834 0.32076 0.93689 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) -10.252876 2.884447-3.555 0.000575 *** time 0.012263 0.001482 8.272 5.23e-13 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 2
Residual standard error: 0.4539 on 102 degrees of freedom Multiple R-Squared: 0.4015,Adjusted R-squared: 0.3956 F-statistic: 68.43 on 1 and 102 DF, p-value: 5.228e-13 This seems to be a fairly reasonable fit for the data. The F test indicates a strong relationship. The residual plots seem not to indicate anything out of the ordinary. The only troubling feature is the rather low adjusted R-squared. (c) Perform a regression of berkeley on stbarb. Comment on the fit and the residuals. Figure 3: 1-(c) residual diagnostics a regression of berkeley on stbarb 14 15 16 17 18 12.5 13.5 14.5 stbarb berkeley 13.0 13.5 14.0 14.5 1.0 0.5 0.0 0.5 1.0 bfit2$fitted bfit1$residuals 0 20 40 60 80 100 1.0 0.0 0.5 1.0 Index bfit2$residuals 0.2 0.2 0.6 1.0 Series bfit2$residuals 2 1 0 1 2 1.0 0.0 0.5 1.0 Normal Q Q Plot Theoretical Quantiles Sample Quantiles Call: lm(formula = berkeley ~ stbarb) Residuals: 3
Min 1Q Median 3Q Max -0.97037-0.33928-0.06555 0.38689 1.39676 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 7.9503 1.0943 7.265 7.67e-11 *** stbarb 0.3653 0.0706 5.173 1.15e-06 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 0.5221 on 102 degrees of freedom Multiple R-Squared: 0.2079,Adjusted R-squared: 0.2001 F-statistic: 26.77 on 1 and 102 DF, p-value: 1.153e-06 This also seems like a reasonable fit with residuals that don t strongly violate the regression assumptions. Again the adjusted R-squared is rather low. The acf plot for the residuals show a correlation at lag 11 that is larger than what is expected if the residuals were independent. However, it is not extremely large and is most likely due to regular variation. (d) Make a plot of the variable berkeley and an plot of the data. Does the time series appear to be stationary? Explain. Interpret the plot in this situation. The time series has an increasing trend which means that it could not possibly be stationary. The plot is difficult to interpret since the data is not stationary; it cannot be interpreted as an approximation to the autocorrelation function. (e) Difference the data. Plot this differenced data, and make an plot. What is your opinion of whether the series is stationary after differencing? The data seems to be fairly stationary after differencing with a fairly constant variance and no discernible trend. (f) Now, we have detrended this series by using linear regression and with differencing. The result of detrending via regression was a model that fit rather well and residuals that had no apparent dependency. Let us assume then that the true model for this data 4
Figure 4: 1-(d) of the variable berkeley Series berkeley 0.2 0.0 0.2 0.4 0.6 0.8 1.0 Figure 5: 1-(e) differenced berkeley vs time and of differenced berkeley Series diffber diffber 1.0 0.0 0.5 1.0 1.5 0.5 0.0 0.5 1.0 0 20 40 60 80 100 5
is x t = β 0 + β 1 t + w t where w t, t = 1,..., T is normal white noise with variance σ 2. (This is the same as assuming that this data follows the standard regression assumptions.) Assuming this model, write out a formula for the differenced time series, x t. Use this to explain the apparent dependency in the differenced data from 1e above. The model after differencing would be x t x t 1 = β 1 + w t w t 1 The differenced data is an MA(1) series (with a constant mean) with a negative θ 1. This corresponds very well to the plot in the previous question. The at lag one is negative and significantly outside the confidence intervals. The other lags show no or weak dependency. 6
2. Load the data in dailyibm.dat using the command ibm=scan("dailyibm.dat", skip=1). This series is the daily closing price of IBM stock from Jan 1, 1980 to Oct 8, 1992. (a) Make a plot of the data and an plot of the data. Does the time series appear to be stationary? Explain. Interpret the plot in this situation. Figure 6: 2-(a) a time series plot for ibm and its Series ibm ibm 60 80 120 160 0.0 0.2 0.4 0.6 0.8 1.0 0 1000 2000 3000 0 5 10 20 30 The time series from a time series plot and its does not appear stationary. The series plot wanders in a fashion similar to a random walk. The no longer approximates an autocorrelation function. (b) Difference the data. Plot this differenced data, and make an plot. What is your opinion of whether the series is stationary after differencing? The plot not longer wanders; however the variance seems to be increasing which contradicts stationarity. Again, the plot does not have a clear interpretation. 7
Figure 7: 2-(b) a time series plot for diffibm and its Series diffibm diffibm 30 20 10 0 10 0.0 0.2 0.4 0.6 0.8 1.0 0 1000 2000 3000 0 5 10 20 30 (c) Another option for attempting to obtain stationary data when there is something similar to an exponential trend is to take the logarithm. Use the R command log() to take the logarithm of the data. Plot this transformed data. Does the transformed data appear stationary? Explain. Figure 8: 2-(c) a time series plot for logibm and its Series logibm logibm 4.0 4.4 4.8 5.2 0.0 0.2 0.4 0.6 0.8 1.0 0 1000 2000 3000 0 5 10 20 30 The series still seems to wander after the transformation. The no longer approximates an autocorrelation function. (d) Perhaps some combination of differencing and the logarithmic 8
transform will give us stationary data. Why would log(diff(ibm)) not be a very good idea? Try the opposite, difference the log transformed data difflogibm=diff(log(ibm)). Except for a few extreme outliers, does this transformation succeed in creating stationary data? Figure 9: 2-(d) a time series plot for difflogibm and its Series difflogibm difflogibm 0.2 0.1 0.0 0.1 0.0 0.2 0.4 0.6 0.8 1.0 0 1000 2000 3000 0 5 10 20 30 Taking the log of the difference will attempt to take the logarithm of many negative values which will be undefined. Taking the difference of the log yields data that are reasonably stationary. (e) Delete the extreme outliers using the following command: difflogibm=difflogibm[difflogibm> -0.1] Plot this data and the for this data. Sometimes with very long time series like this one, portions of the series exhibit different behavior than other portions. Break the series into two parts using the following commands: difflogibm1= difflogibm[1:500] difflogibm2= difflogibm[501:length(difflogibm)] Plot both of these and create plots of each. Do you notice a difference between these two sections of the larger time series? The plots seem to indicate that the difflogibm1 is slightly dependent whereas difflogibm2 is essentially white noise. 9
Figure 10: 2-(e) a time series plot for difflogibm (without the extreme outlier) and its Series difflogibm difflogibm 0.05 0.00 0.05 0.10 0.0 0.2 0.4 0.6 0.8 1.0 0 1000 2000 3000 0 5 10 20 30 Figure 11: 2-(e) a time series plot for difflogibm1 and difflogibm2, and their s Series difflogibm1 difflogibm1 0.04 0.00 0.04 0.08 0.2 0.2 0.4 0.6 0.8 1.0 0 100 200 300 400 500 25 Series difflogibm2 difflogibm2 0.05 0.00 0.05 0.10 0.0 0.2 0.4 0.6 0.8 1.0 0 500 1500 2500 0 5 10 20 30 10
(f) Assume the model for the data that we have called difflogibm2 is of the following form: d t = δ + w t where w t, t = 1,..., T is normal white noise with variance σ 2. Is this reasonable from what you now know of this time series? How would you estimate δ and σ? Give the estimates. As mentioned above the second series appears to have no dependency and could, therefore, be a shifted white noise. We can estimate δ and σ by using the sample mean and the sample standard deviation to obtain the estimates 0.0002646076 and 0.01326503 respectively. 11
3. Load the monthly temperature data for England from 1723 to 1970 using the following command engtemp=scan("tpmon.dat",skip=1) (a) Plot the data and create an. Try doing this only on the first 300 observations. Figure 12: 3-(a) a time series plot forengtemp and its Series engtemp[1:300] engtemp[1:300] 0 5 10 15 0.5 0.0 0.5 1.0 0 50 150 250 There is a clear seasonal pattern over the course of the year. The plot does not fall off over time and cannot be correctly interpreted because of the periodic trend. (b) Fit the following model using lm(): x t = β 0 + β 1 cos (2π 1 ) 12 t + β 2 sin (2π 1 ) 12 t + w t (You will need to create the variables for the covariates. It may be useful to know that there are R functions sin() and cos.) Call: lm(formula = engtemp ~ tcosine + tsine) Residuals: Min 1Q Median 3Q Max -5.9102-0.9244-0.0024 0.9455 4.7046 12
Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 9.21731 0.02689 342.8 <2e-16 *** tcosine -5.15517 0.03802-135.6 <2e-16 *** tsine -3.88514 0.03802-102.2 <2e-16 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 1.467 on 2973 degrees of freedom Multiple R-squared: 0.9065,Adjusted R-squared: 0.9064 F-statistic: 1.441e+04 on 2 and 2973 DF, p-value: < 2.2e-16 (c) Plot the residuals of the above fit. Comment on these residuals. You may want to look at only a few hundred at a time. Are the residuals dependent? Are they stationary? Figure 13: 3-(c) Diagnostics for the residuals 0 50 150 250 0 5 10 15 Index engtemp[1:300] 4 6 8 10 12 14 16 4 2 0 2 4 tfit$fitted[1:300] tfit$residuals[1:300] 0 50 150 250 4 2 0 2 4 Index tfit$residuals[1:300] 0.0 0.2 0.4 0.6 0.8 1.0 Series tfit$residuals[1:300] 3 2 1 0 1 2 3 4 2 0 2 4 Normal Q Q Plot Theoretical Quantiles Sample Quantiles 13
The periodic trend is largely removed. However, the plot is not falling off - indicating there may be some trend left in the series. The series appears fairly stationary but there are some indications that this may not be the case. They do appear more stationary than the original time series however. (d) Compare the periodograms of the original data and the residuals from the fit model. Figure 14: 3-(d) Periodograms of the original data (top) and the residuals from the fit model (bottom) Scaled Periodogram 0 10 20 30 40 0.0 0.1 0.2 0.3 0.4 0.5 Frequency Scaled Periodogram 0.00 0.10 0.20 0.30 0.0 0.1 0.2 0.3 0.4 0.5 Frequency In the original time series, the periodogram is dominated by the yearly periodic trend (i.e., 12=1/0.08333333). Only one clear spike is visible. When this is removed, 14
other patterns emerge. Specifically there is still a spike at around frequency 1/6 (6 month in terms of the period) and the low frequencies are somewhat strong beyond that. 15
4. Use the smoothing techniques introduced in class and above to estimate the trend in the global temperature data. Find out a proper window size or bandwidth and describe why you choose it. (Example 2.1 from the textbook. The data can be found in globtemp.dat ). In this problem, the students need to fit the global temperature data with moving average and kernel smoothing. Ideally, the students will show a few plots and discuss why a certain plot is the best. Below I have some example plots of MA smoothing (undersmoothed, about right, and oversmoothed). For the good level of smoothing, I have also shown the residuals and the acf. I then repeated the process for kernel smoothing. 16
Figure 15: 4-(a) MA smoothing (1st row - undersmoothed, 2nd row- about right, 3rd row - oversmoothed) Series resunderma5 globtemp 0.4 0.0 0.4 resunderma5 0.1 0.1 0.2 0.5 0.0 0.5 1.0 0 20 60 100 140 0 20 60 100 140 Series resgoodma15 globtemp 0.4 0.0 0.4 resgoodma15 0.2 0.0 0.2 0.2 0.2 0.6 1.0 0 20 60 100 140 0 20 40 60 80 120 Series resoverma40 globtemp 0.4 0.0 0.4 0 20 60 100 140 resoverma40 0.3 0.1 0.1 0.3 0 20 40 60 80 100 0.2 0.2 0.6 1.0 17
Figure 16: 4-(b) kernel smoothing (1st row - undersmoothed, 2nd row- about right, 3rd row - oversmoothed) Series resunderksm globtemp 0.4 0.0 0.4 resunderksm 0.1 0.0 0.1 0.2 0.4 0.0 0.4 0.8 0 20 60 100 140 0 20 60 100 140 Series resgoodksm globtemp 0.4 0.0 0.4 resgoodksm 0.2 0.0 0.2 0.2 0.2 0.6 1.0 0 20 60 100 140 0 20 60 100 140 Series resoverksm globtemp 0.4 0.0 0.4 resoverksm 0.2 0.0 0.2 0.2 0.2 0.6 1.0 0 20 60 100 140 0 20 60 100 140 18
2 Theoretical problems 1. (No R required.) Show that the M A(3) model is (weakly) stationary. You need to show that the mean is zero and the covariance function depends only on distance. This will be very similar to what was done in class for the MA(2) model. E[x t ] = E[w t + θ 1 w t 1 + θ 2 w t 2 + θ 3 w t 3 ] = 0. E[x t x t ] = E[(w t + θ 1 w t 1 + θ 2 w t 2 + θ 3 w t 3 )(w t + θ 1 w t 1 + θ 2 w t 2 + θ 3 w t 3 )] = Ew 2 t ] + E[θ 2 1w 2 t 1] + E[θ 2 2w 2 t 2] + E[θ 2 3w 2 t 3] = σ 2 + θ 2 1σ 2 + θ 2 2σ 2 + θ 2 3σ 2 Now, let s do the calculation where s and t are only one time unit away from each other. Remember that all MA models have mean zero which will simplify the calculations. E[x t x t 1 ] = E[(w t + θ 1 w t 1 + θ 2 w t 2 + θ 3 w t 3 )(w t 1 + θ 1 w t 2 + θ 2 w t 3 + θ 3 w t 4 )] = E[θ 1 w 2 t 1] + E[θ 1 θ 2 w 2 t 2] + E[θ 2 θ 3 w 2 t 3] = θ 1 σ 2 + θ 1 θ 2 σ 2 + θ 2 θ 3 σ 2 Now, let s try a lag of two. E[x t x t 2 ] = E[(w t + θ 1 w t 1 + θ 2 w t 2 + θ 3 w t 3 )(w t 2 + θ 1 w t 3 + θ 2 w t 4 + θ 3 w t 5 )] And for three, = E[θ 2 w 2 t 2] + E[θ 1 θ 3 w 2 t 3] = θ 2 σ 2 + θ 1 θ 3 σ 2 E[x t x t 3 ] = E[(w t + θ 1 w t 1 + θ 2 w t 2 + θ 3 w t 3 )(w t 3 + θ 1 w t 4 + θ 2 w t 5 + θ 3 w t 6 )] = E[θ 3 w 2 t 3] = θ 3 σ 2 Now, we see the pattern; if we move to a lag of four or greater, then the windows do not overlap. Therefore, the covariance between the time series at lags larger than 3 will be zero. We illustrate with a lag of 4. E[x t x t 4 ] = E[(w t + θ 1 w t 1 + θ 2 w t 2 + θ 3 w t 3 )(w t 4 + θ 1 w t 5 + θ 2 w t 6 + θ 3 w t 7 )] = 0 19
2. (No R required.) Verify that the following model is non-stationary: x t = β 0 + β 1 t + β 2 t 2 + w t where w t is white noise. Now, verify that 2 x t is stationary. To see that the model is not stationary one needs only to say that the mean is not constant in time. To show that 2 x t is stationary: 2 x t = (x t x t 1 ) = (β 1 + β 2 t 2 β 2 (t 1) 2 + w t w t 1 ) = (β 1 + 2β 2 t β 2 + w t w t 1 ) = 2β 2 + w t 2w t 1 + w t 2 Since this is an MA(2) with a constant mean it is stationary. 20