Homework 2. For the homework, be sure to give full explanations where required and to turn in any relevant plots.

Similar documents
Ch3. TRENDS. Time Series Analysis

Read Section 1.1, Examples of time series, on pages 1-8. These example introduce the book; you are not tested on them.

Homework 4. 1 Data analysis problems

Econ 424 Time Series Concepts

at least 50 and preferably 100 observations should be available to build a proper model

Week 9: An Introduction to Time Series

The log transformation produces a time series whose variance can be treated as constant over time.

STAT 3022 Spring 2007

Volatility. Gerald P. Dwyer. February Clemson University

Basics: Definitions and Notation. Stationarity. A More Formal Definition

Regression and Models with Multiple Factors. Ch. 17, 18

Minitab Project Report - Assignment 6

Ch 8. MODEL DIAGNOSTICS. Time Series Analysis

Time Series Analysis -- An Introduction -- AMS 586

Analysis of Violent Crime in Los Angeles County

Stat 5102 Final Exam May 14, 2015

Chapter 3: Regression Methods for Trends

GMM - Generalized method of moments

TESTING FOR CO-INTEGRATION

1.4 Properties of the autocovariance for stationary time-series

Lecture 18: Simple Linear Regression

Problem Set 2 Solution Sketches Time Series Analysis Spring 2010

Introduction and Single Predictor Regression. Correlation

1 Forecasting House Starts

Linear Modelling: Simple Regression

Circle a single answer for each multiple choice question. Your choice should be made clearly.

Simple linear regression

UNIVERSITY OF MASSACHUSETTS. Department of Mathematics and Statistics. Basic Exam - Applied Statistics. Tuesday, January 17, 2017

Regression of Time Series

Inference with Simple Regression

Chapter 12: Linear regression II

Chapter 12: An introduction to Time Series Analysis. Chapter 12: An introduction to Time Series Analysis

Lecture 1 Intro to Spatial and Temporal Data

Math 2311 Written Homework 6 (Sections )

STAT 436 / Lecture 16: Key

Multiple Regression and Regression Model Adequacy

Part 1. Multiple Choice (50 questions, 1 point each) Part 2. Problems/Short Answer (10 questions, 5 points each)

Applied Time Series Topics

Regression on Faithful with Section 9.3 content

PAPER 206 APPLIED STATISTICS

Time Series Analysis

Decision 411: Class 7

Scenario 5: Internet Usage Solution. θ j

Lecture 19 Box-Jenkins Seasonal Models

UNIVERSITY OF MASSACHUSETTS Department of Mathematics and Statistics Basic Exam - Applied Statistics January, 2018

Suan Sunandha Rajabhat University

Solution to Series 6

Non-Stationary Time Series and Unit Root Testing

Class: Dean Foster. September 30, Read sections: Examples chapter (chapter 3) Question today: Do prices go up faster than they go down?

A time series is called strictly stationary if the joint distribution of every collection (Y t

STOR 356: Summary Course Notes

TIME SERIES ANALYSIS AND FORECASTING USING THE STATISTICAL MODEL ARIMA

CHAPTER 8 MODEL DIAGNOSTICS. 8.1 Residual Analysis

1 The Classic Bivariate Least Squares Model

Introduction to Linear Regression

The ARIMA Procedure: The ARIMA Procedure

Stat 311: HW 9, due Th 5/27/10 in your Quiz Section

Lecture 8a: Spurious Regression

MODELING INFLATION RATES IN NIGERIA: BOX-JENKINS APPROACH. I. U. Moffat and A. E. David Department of Mathematics & Statistics, University of Uyo, Uyo

We d like to know the equation of the line shown (the so called best fit or regression line).

Regression, Part I. - In correlation, it would be irrelevant if we changed the axes on our graph.

Technical note on seasonal adjustment for Capital goods imports

Regression. Marc H. Mehlman University of New Haven

Analytics 512: Homework # 2 Tim Ahn February 9, 2016

Non-Stationary Time Series and Unit Root Testing

Variance Decomposition in Regression James M. Murray, Ph.D. University of Wisconsin - La Crosse Updated: October 04, 2017

ibm: daily closing IBM stock prices (dates not given) internet: number of users logged on to an Internet server each minute (dates/times not given)

Simple Linear Regression

Introduction to Linear Regression Rebecca C. Steorts September 15, 2015

Circle the single best answer for each multiple choice question. Your choice should be made clearly.

FinQuiz Notes

22s:152 Applied Linear Regression. Returning to a continuous response variable Y...

CHAPTER 21: TIME SERIES ECONOMETRICS: SOME BASIC CONCEPTS

SCHOOL OF MATHEMATICS AND STATISTICS

Decision 411: Class 9. HW#3 issues

22s:152 Applied Linear Regression. In matrix notation, we can write this model: Generalized Least Squares. Y = Xβ + ɛ with ɛ N n (0, Σ)

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

Dynamic Time Series Regression: A Panacea for Spurious Correlations

Module 3. Descriptive Time Series Statistics and Introduction to Time Series Models

Inferences on Linear Combinations of Coefficients

Introduction to Linear regression analysis. Part 2. Model comparisons

APPLIED ECONOMETRIC TIME SERIES 4TH EDITION

SCHOOL OF MATHEMATICS AND STATISTICS

5 Transfer function modelling

AMS 7 Correlation and Regression Lecture 8

Economics 618B: Time Series Analysis Department of Economics State University of New York at Binghamton

Non-Stationary Time Series and Unit Root Testing

FIN822 project 2 Project 2 contains part I and part II. (Due on November 10, 2008)

MS&E 226: Small Data

Stat 101 L: Laboratory 5

Variance Decomposition and Goodness of Fit

Extensions of One-Way ANOVA.

The Big Picture. Model Modifications. Example (cont.) Bacteria Count Example

Lesson 8: Testing for IID Hypothesis with the correlogram

13. Time Series Analysis: Asymptotics Weakly Dependent and Random Walk Process. Strict Exogeneity

Chapter 3 - Linear Regression

Non-independence due to Time Correlation (Chapter 14)

L21: Chapter 12: Linear regression

Chapter 5 Exercises 1. Data Analysis & Graphics Using R Solutions to Exercises (April 24, 2004)

Principal components

Transcription:

Homework 2 1 Data analysis problems For the homework, be sure to give full explanations where required and to turn in any relevant plots. 1. The file berkeley.dat contains average yearly temperatures for the cities of Berkeley and Santa Barbara. Import the data into R using the following commands berk=scan("berkeley.dat", what=list(double(0),double(0),double(0))) time=berk[[1]] berkeley=berk[[2]] stbarb=berk[[3]] (a) Plot the variables berkeley and stbarb versus time. Also, plot berkeley versus stbarb. Figure 1: 1-(a) berkeley vs time, stbarb vs time and berkeley vs stbarb 1900 1940 1980 12.5 13.5 14.5 time berkeley 1900 1940 1980 12.5 13.5 14.5 time berkeley 14 15 16 17 18 12.5 13.5 14.5 stbarb berkeley (b) Perform a regression of berkeley on time. What do you think about this fit? Be sure to make diagnostic plots (including ) 1

of the residuals. If there are any violations of the assumptions for a linear regression model, make sure to comment on them. Figure 2: 1-(b) residual diagnostics a regression of berkeley on time 1900 1940 1980 12.5 13.5 14.5 time berkeley 13.0 13.4 13.8 14.2 1.0 0.5 0.0 0.5 1.0 bfit1$fitted bfit1$residuals 0 20 40 60 80 100 1.0 0.5 0.0 0.5 1.0 Index bfit1$residuals 0.2 0.2 0.6 1.0 Series bfit1$residuals 2 1 0 1 2 1.0 0.5 0.0 0.5 1.0 Normal Q Q Plot Theoretical Quantiles Sample Quantiles Call: lm(formula = berkeley ~ time) Residuals: Min 1Q Median 3Q Max -0.99195-0.33156-0.03834 0.32076 0.93689 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) -10.252876 2.884447-3.555 0.000575 *** time 0.012263 0.001482 8.272 5.23e-13 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 2

Residual standard error: 0.4539 on 102 degrees of freedom Multiple R-Squared: 0.4015,Adjusted R-squared: 0.3956 F-statistic: 68.43 on 1 and 102 DF, p-value: 5.228e-13 This seems to be a fairly reasonable fit for the data. The F test indicates a strong relationship. The residual plots seem not to indicate anything out of the ordinary. The only troubling feature is the rather low adjusted R-squared. (c) Perform a regression of berkeley on stbarb. Comment on the fit and the residuals. Figure 3: 1-(c) residual diagnostics a regression of berkeley on stbarb 14 15 16 17 18 12.5 13.5 14.5 stbarb berkeley 13.0 13.5 14.0 14.5 1.0 0.5 0.0 0.5 1.0 bfit2$fitted bfit1$residuals 0 20 40 60 80 100 1.0 0.0 0.5 1.0 Index bfit2$residuals 0.2 0.2 0.6 1.0 Series bfit2$residuals 2 1 0 1 2 1.0 0.0 0.5 1.0 Normal Q Q Plot Theoretical Quantiles Sample Quantiles Call: lm(formula = berkeley ~ stbarb) Residuals: 3

Min 1Q Median 3Q Max -0.97037-0.33928-0.06555 0.38689 1.39676 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 7.9503 1.0943 7.265 7.67e-11 *** stbarb 0.3653 0.0706 5.173 1.15e-06 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 0.5221 on 102 degrees of freedom Multiple R-Squared: 0.2079,Adjusted R-squared: 0.2001 F-statistic: 26.77 on 1 and 102 DF, p-value: 1.153e-06 This also seems like a reasonable fit with residuals that don t strongly violate the regression assumptions. Again the adjusted R-squared is rather low. The acf plot for the residuals show a correlation at lag 11 that is larger than what is expected if the residuals were independent. However, it is not extremely large and is most likely due to regular variation. (d) Make a plot of the variable berkeley and an plot of the data. Does the time series appear to be stationary? Explain. Interpret the plot in this situation. The time series has an increasing trend which means that it could not possibly be stationary. The plot is difficult to interpret since the data is not stationary; it cannot be interpreted as an approximation to the autocorrelation function. (e) Difference the data. Plot this differenced data, and make an plot. What is your opinion of whether the series is stationary after differencing? The data seems to be fairly stationary after differencing with a fairly constant variance and no discernible trend. (f) Now, we have detrended this series by using linear regression and with differencing. The result of detrending via regression was a model that fit rather well and residuals that had no apparent dependency. Let us assume then that the true model for this data 4

Figure 4: 1-(d) of the variable berkeley Series berkeley 0.2 0.0 0.2 0.4 0.6 0.8 1.0 Figure 5: 1-(e) differenced berkeley vs time and of differenced berkeley Series diffber diffber 1.0 0.0 0.5 1.0 1.5 0.5 0.0 0.5 1.0 0 20 40 60 80 100 5

is x t = β 0 + β 1 t + w t where w t, t = 1,..., T is normal white noise with variance σ 2. (This is the same as assuming that this data follows the standard regression assumptions.) Assuming this model, write out a formula for the differenced time series, x t. Use this to explain the apparent dependency in the differenced data from 1e above. The model after differencing would be x t x t 1 = β 1 + w t w t 1 The differenced data is an MA(1) series (with a constant mean) with a negative θ 1. This corresponds very well to the plot in the previous question. The at lag one is negative and significantly outside the confidence intervals. The other lags show no or weak dependency. 6

2. Load the data in dailyibm.dat using the command ibm=scan("dailyibm.dat", skip=1). This series is the daily closing price of IBM stock from Jan 1, 1980 to Oct 8, 1992. (a) Make a plot of the data and an plot of the data. Does the time series appear to be stationary? Explain. Interpret the plot in this situation. Figure 6: 2-(a) a time series plot for ibm and its Series ibm ibm 60 80 120 160 0.0 0.2 0.4 0.6 0.8 1.0 0 1000 2000 3000 0 5 10 20 30 The time series from a time series plot and its does not appear stationary. The series plot wanders in a fashion similar to a random walk. The no longer approximates an autocorrelation function. (b) Difference the data. Plot this differenced data, and make an plot. What is your opinion of whether the series is stationary after differencing? The plot not longer wanders; however the variance seems to be increasing which contradicts stationarity. Again, the plot does not have a clear interpretation. 7

Figure 7: 2-(b) a time series plot for diffibm and its Series diffibm diffibm 30 20 10 0 10 0.0 0.2 0.4 0.6 0.8 1.0 0 1000 2000 3000 0 5 10 20 30 (c) Another option for attempting to obtain stationary data when there is something similar to an exponential trend is to take the logarithm. Use the R command log() to take the logarithm of the data. Plot this transformed data. Does the transformed data appear stationary? Explain. Figure 8: 2-(c) a time series plot for logibm and its Series logibm logibm 4.0 4.4 4.8 5.2 0.0 0.2 0.4 0.6 0.8 1.0 0 1000 2000 3000 0 5 10 20 30 The series still seems to wander after the transformation. The no longer approximates an autocorrelation function. (d) Perhaps some combination of differencing and the logarithmic 8

transform will give us stationary data. Why would log(diff(ibm)) not be a very good idea? Try the opposite, difference the log transformed data difflogibm=diff(log(ibm)). Except for a few extreme outliers, does this transformation succeed in creating stationary data? Figure 9: 2-(d) a time series plot for difflogibm and its Series difflogibm difflogibm 0.2 0.1 0.0 0.1 0.0 0.2 0.4 0.6 0.8 1.0 0 1000 2000 3000 0 5 10 20 30 Taking the log of the difference will attempt to take the logarithm of many negative values which will be undefined. Taking the difference of the log yields data that are reasonably stationary. (e) Delete the extreme outliers using the following command: difflogibm=difflogibm[difflogibm> -0.1] Plot this data and the for this data. Sometimes with very long time series like this one, portions of the series exhibit different behavior than other portions. Break the series into two parts using the following commands: difflogibm1= difflogibm[1:500] difflogibm2= difflogibm[501:length(difflogibm)] Plot both of these and create plots of each. Do you notice a difference between these two sections of the larger time series? The plots seem to indicate that the difflogibm1 is slightly dependent whereas difflogibm2 is essentially white noise. 9

Figure 10: 2-(e) a time series plot for difflogibm (without the extreme outlier) and its Series difflogibm difflogibm 0.05 0.00 0.05 0.10 0.0 0.2 0.4 0.6 0.8 1.0 0 1000 2000 3000 0 5 10 20 30 Figure 11: 2-(e) a time series plot for difflogibm1 and difflogibm2, and their s Series difflogibm1 difflogibm1 0.04 0.00 0.04 0.08 0.2 0.2 0.4 0.6 0.8 1.0 0 100 200 300 400 500 25 Series difflogibm2 difflogibm2 0.05 0.00 0.05 0.10 0.0 0.2 0.4 0.6 0.8 1.0 0 500 1500 2500 0 5 10 20 30 10

(f) Assume the model for the data that we have called difflogibm2 is of the following form: d t = δ + w t where w t, t = 1,..., T is normal white noise with variance σ 2. Is this reasonable from what you now know of this time series? How would you estimate δ and σ? Give the estimates. As mentioned above the second series appears to have no dependency and could, therefore, be a shifted white noise. We can estimate δ and σ by using the sample mean and the sample standard deviation to obtain the estimates 0.0002646076 and 0.01326503 respectively. 11

3. Load the monthly temperature data for England from 1723 to 1970 using the following command engtemp=scan("tpmon.dat",skip=1) (a) Plot the data and create an. Try doing this only on the first 300 observations. Figure 12: 3-(a) a time series plot forengtemp and its Series engtemp[1:300] engtemp[1:300] 0 5 10 15 0.5 0.0 0.5 1.0 0 50 150 250 There is a clear seasonal pattern over the course of the year. The plot does not fall off over time and cannot be correctly interpreted because of the periodic trend. (b) Fit the following model using lm(): x t = β 0 + β 1 cos (2π 1 ) 12 t + β 2 sin (2π 1 ) 12 t + w t (You will need to create the variables for the covariates. It may be useful to know that there are R functions sin() and cos.) Call: lm(formula = engtemp ~ tcosine + tsine) Residuals: Min 1Q Median 3Q Max -5.9102-0.9244-0.0024 0.9455 4.7046 12

Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 9.21731 0.02689 342.8 <2e-16 *** tcosine -5.15517 0.03802-135.6 <2e-16 *** tsine -3.88514 0.03802-102.2 <2e-16 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 1.467 on 2973 degrees of freedom Multiple R-squared: 0.9065,Adjusted R-squared: 0.9064 F-statistic: 1.441e+04 on 2 and 2973 DF, p-value: < 2.2e-16 (c) Plot the residuals of the above fit. Comment on these residuals. You may want to look at only a few hundred at a time. Are the residuals dependent? Are they stationary? Figure 13: 3-(c) Diagnostics for the residuals 0 50 150 250 0 5 10 15 Index engtemp[1:300] 4 6 8 10 12 14 16 4 2 0 2 4 tfit$fitted[1:300] tfit$residuals[1:300] 0 50 150 250 4 2 0 2 4 Index tfit$residuals[1:300] 0.0 0.2 0.4 0.6 0.8 1.0 Series tfit$residuals[1:300] 3 2 1 0 1 2 3 4 2 0 2 4 Normal Q Q Plot Theoretical Quantiles Sample Quantiles 13

The periodic trend is largely removed. However, the plot is not falling off - indicating there may be some trend left in the series. The series appears fairly stationary but there are some indications that this may not be the case. They do appear more stationary than the original time series however. (d) Compare the periodograms of the original data and the residuals from the fit model. Figure 14: 3-(d) Periodograms of the original data (top) and the residuals from the fit model (bottom) Scaled Periodogram 0 10 20 30 40 0.0 0.1 0.2 0.3 0.4 0.5 Frequency Scaled Periodogram 0.00 0.10 0.20 0.30 0.0 0.1 0.2 0.3 0.4 0.5 Frequency In the original time series, the periodogram is dominated by the yearly periodic trend (i.e., 12=1/0.08333333). Only one clear spike is visible. When this is removed, 14

other patterns emerge. Specifically there is still a spike at around frequency 1/6 (6 month in terms of the period) and the low frequencies are somewhat strong beyond that. 15

4. Use the smoothing techniques introduced in class and above to estimate the trend in the global temperature data. Find out a proper window size or bandwidth and describe why you choose it. (Example 2.1 from the textbook. The data can be found in globtemp.dat ). In this problem, the students need to fit the global temperature data with moving average and kernel smoothing. Ideally, the students will show a few plots and discuss why a certain plot is the best. Below I have some example plots of MA smoothing (undersmoothed, about right, and oversmoothed). For the good level of smoothing, I have also shown the residuals and the acf. I then repeated the process for kernel smoothing. 16

Figure 15: 4-(a) MA smoothing (1st row - undersmoothed, 2nd row- about right, 3rd row - oversmoothed) Series resunderma5 globtemp 0.4 0.0 0.4 resunderma5 0.1 0.1 0.2 0.5 0.0 0.5 1.0 0 20 60 100 140 0 20 60 100 140 Series resgoodma15 globtemp 0.4 0.0 0.4 resgoodma15 0.2 0.0 0.2 0.2 0.2 0.6 1.0 0 20 60 100 140 0 20 40 60 80 120 Series resoverma40 globtemp 0.4 0.0 0.4 0 20 60 100 140 resoverma40 0.3 0.1 0.1 0.3 0 20 40 60 80 100 0.2 0.2 0.6 1.0 17

Figure 16: 4-(b) kernel smoothing (1st row - undersmoothed, 2nd row- about right, 3rd row - oversmoothed) Series resunderksm globtemp 0.4 0.0 0.4 resunderksm 0.1 0.0 0.1 0.2 0.4 0.0 0.4 0.8 0 20 60 100 140 0 20 60 100 140 Series resgoodksm globtemp 0.4 0.0 0.4 resgoodksm 0.2 0.0 0.2 0.2 0.2 0.6 1.0 0 20 60 100 140 0 20 60 100 140 Series resoverksm globtemp 0.4 0.0 0.4 resoverksm 0.2 0.0 0.2 0.2 0.2 0.6 1.0 0 20 60 100 140 0 20 60 100 140 18

2 Theoretical problems 1. (No R required.) Show that the M A(3) model is (weakly) stationary. You need to show that the mean is zero and the covariance function depends only on distance. This will be very similar to what was done in class for the MA(2) model. E[x t ] = E[w t + θ 1 w t 1 + θ 2 w t 2 + θ 3 w t 3 ] = 0. E[x t x t ] = E[(w t + θ 1 w t 1 + θ 2 w t 2 + θ 3 w t 3 )(w t + θ 1 w t 1 + θ 2 w t 2 + θ 3 w t 3 )] = Ew 2 t ] + E[θ 2 1w 2 t 1] + E[θ 2 2w 2 t 2] + E[θ 2 3w 2 t 3] = σ 2 + θ 2 1σ 2 + θ 2 2σ 2 + θ 2 3σ 2 Now, let s do the calculation where s and t are only one time unit away from each other. Remember that all MA models have mean zero which will simplify the calculations. E[x t x t 1 ] = E[(w t + θ 1 w t 1 + θ 2 w t 2 + θ 3 w t 3 )(w t 1 + θ 1 w t 2 + θ 2 w t 3 + θ 3 w t 4 )] = E[θ 1 w 2 t 1] + E[θ 1 θ 2 w 2 t 2] + E[θ 2 θ 3 w 2 t 3] = θ 1 σ 2 + θ 1 θ 2 σ 2 + θ 2 θ 3 σ 2 Now, let s try a lag of two. E[x t x t 2 ] = E[(w t + θ 1 w t 1 + θ 2 w t 2 + θ 3 w t 3 )(w t 2 + θ 1 w t 3 + θ 2 w t 4 + θ 3 w t 5 )] And for three, = E[θ 2 w 2 t 2] + E[θ 1 θ 3 w 2 t 3] = θ 2 σ 2 + θ 1 θ 3 σ 2 E[x t x t 3 ] = E[(w t + θ 1 w t 1 + θ 2 w t 2 + θ 3 w t 3 )(w t 3 + θ 1 w t 4 + θ 2 w t 5 + θ 3 w t 6 )] = E[θ 3 w 2 t 3] = θ 3 σ 2 Now, we see the pattern; if we move to a lag of four or greater, then the windows do not overlap. Therefore, the covariance between the time series at lags larger than 3 will be zero. We illustrate with a lag of 4. E[x t x t 4 ] = E[(w t + θ 1 w t 1 + θ 2 w t 2 + θ 3 w t 3 )(w t 4 + θ 1 w t 5 + θ 2 w t 6 + θ 3 w t 7 )] = 0 19

2. (No R required.) Verify that the following model is non-stationary: x t = β 0 + β 1 t + β 2 t 2 + w t where w t is white noise. Now, verify that 2 x t is stationary. To see that the model is not stationary one needs only to say that the mean is not constant in time. To show that 2 x t is stationary: 2 x t = (x t x t 1 ) = (β 1 + β 2 t 2 β 2 (t 1) 2 + w t w t 1 ) = (β 1 + 2β 2 t β 2 + w t w t 1 ) = 2β 2 + w t 2w t 1 + w t 2 Since this is an MA(2) with a constant mean it is stationary. 20