Decision 411: Class 7

Similar documents
Decision 411: Class 3

Decision 411: Class 3

Decision 411: Class 3

Scenario 5: Internet Usage Solution. θ j

Decision 411: Class 9. HW#3 issues

Decision 411: Class 4

Decision 411: Class 5. Where we ve been so far

Decision 411: Class 5

Decision 411: Class 8

Time series and Forecasting

Univariate linear models

Decision 411: Class 8

10) Time series econometrics

Decision 411: Class 4

9) Time series econometrics

A Second Course in Statistics: Regression Analysis

Variance. Standard deviation VAR = = value. Unbiased SD = SD = 10/23/2011. Functional Connectivity Correlation and Regression.

Econ 424 Time Series Concepts

Ridge Regression. Summary. Sample StatFolio: ridge reg.sgp. STATGRAPHICS Rev. 10/1/2014

FORECASTING. Methods and Applications. Third Edition. Spyros Makridakis. European Institute of Business Administration (INSEAD) Steven C Wheelwright

10. Time series regression and forecasting

Introduction to Regression

Homework 2. For the homework, be sure to give full explanations where required and to turn in any relevant plots.

G. S. Maddala Kajal Lahiri. WILEY A John Wiley and Sons, Ltd., Publication

The ARIMA Procedure: The ARIMA Procedure

Modeling and forecasting global mean temperature time series

at least 50 and preferably 100 observations should be available to build a proper model

9. Linear Regression and Correlation

Covers Chapter 10-12, some of 16, some of 18 in Wooldridge. Regression Analysis with Time Series Data

Any of 27 linear and nonlinear models may be fit. The output parallels that of the Simple Regression procedure.

Lecture 6a: Unit Root and ARIMA Models

Econ 423 Lecture Notes: Additional Topics in Time Series 1

Steps to take to do the descriptive part of regression analysis:

LECTURE 9: GENTLE INTRODUCTION TO

Forecasting: Methods and Applications

Empirical Project, part 1, ECO 672

SOME BASICS OF TIME-SERIES ANALYSIS

Applied Time Series Topics

3 Time Series Regression

FinQuiz Notes

Regression Models for Time Trends: A Second Example. INSR 260, Spring 2009 Bob Stine

7 Introduction to Time Series Time Series vs. Cross-Sectional Data Detrending Time Series... 15

Nonlinear Regression. Summary. Sample StatFolio: nonlinear reg.sgp

ECON/FIN 250: Forecasting in Finance and Economics: Section 7: Unit Roots & Dickey-Fuller Tests

Time Series Analysis. Smoothing Time Series. 2) assessment of/accounting for seasonality. 3) assessment of/exploiting "serial correlation"

13. Time Series Analysis: Asymptotics Weakly Dependent and Random Walk Process. Strict Exogeneity

FIN822 project 2 Project 2 contains part I and part II. (Due on November 10, 2008)

Time Series Analysis -- An Introduction -- AMS 586

y response variable x 1, x 2,, x k -- a set of explanatory variables

11. Further Issues in Using OLS with TS Data

Handout 12. Endogeneity & Simultaneous Equation Models

Ch3. TRENDS. Time Series Analysis

ARDL Cointegration Tests for Beginner

Christopher Dougherty London School of Economics and Political Science

EC408 Topics in Applied Econometrics. B Fingleton, Dept of Economics, Strathclyde University

Box-Jenkins ARIMA Advanced Time Series

TIME SERIES ANALYSIS AND FORECASTING USING THE STATISTICAL MODEL ARIMA

Econ 300/QAC 201: Quantitative Methods in Economics/Applied Data Analysis. 17th Class 7/1/10

TIME SERIES ANALYSIS. Forecasting and Control. Wiley. Fifth Edition GWILYM M. JENKINS GEORGE E. P. BOX GREGORY C. REINSEL GRETA M.

YEAR 10 GENERAL MATHEMATICS 2017 STRAND: BIVARIATE DATA PART II CHAPTER 12 RESIDUAL ANALYSIS, LINEARITY AND TIME SERIES

Introduction to Econometrics

Chapter 3: Regression Methods for Trends

Forecasting using R. Rob J Hyndman. 2.4 Non-seasonal ARIMA models. Forecasting using R 1

INTRODUCTORY REGRESSION ANALYSIS

New York University Department of Economics. Applied Statistics and Econometrics G Spring 2013

Econometrics. 9) Heteroscedasticity and autocorrelation

Topic 4 Unit Roots. Gerald P. Dwyer. February Clemson University

Augmenting our AR(4) Model of Inflation. The Autoregressive Distributed Lag (ADL) Model

CHAPTER 21: TIME SERIES ECONOMETRICS: SOME BASIC CONCEPTS

How To: Deal with Heteroscedasticity Using STATGRAPHICS Centurion

Autoregressive distributed lag models

Multiple Regression Methods

7 Introduction to Time Series

Review of Statistics 101

5 Transfer function modelling

Lecture 19: Inference for SLR & Transformations

Introduction to Eco n o m et rics

Inference with Simple Regression

2. Linear regression with multiple regressors

BCT Lecture 3. Lukas Vacha.

1 Introduction to Minitab

ECON 4230 Intermediate Econometric Theory Exam

Suan Sunandha Rajabhat University

ECON3150/4150 Spring 2015

Year 10 Mathematics Semester 2 Bivariate Data Chapter 13

TESTING FOR CO-INTEGRATION

Bivariate Data: Graphical Display The scatterplot is the basic tool for graphically displaying bivariate quantitative data.

Forecasting. Simon Shaw 2005/06 Semester II

Read Section 1.1, Examples of time series, on pages 1-8. These example introduce the book; you are not tested on them.

Chapter 14 Student Lecture Notes 14-1

Lab 6 - Simple Regression

Regression Models. Chapter 4. Introduction. Introduction. Introduction

ECON2228 Notes 2. Christopher F Baum. Boston College Economics. cfb (BC Econ) ECON2228 Notes / 47

Chapter 2: simple regression model

176 Index. G Gradient, 4, 17, 22, 24, 42, 44, 45, 51, 52, 55, 56

ECON 497 Final Exam Page 1 of 12

5 Autoregressive-Moving-Average Modeling

Linear Regression with one Regressor

Empirical Application of Simple Regression (Chapter 2)

EC4051 Project and Introductory Econometrics

Transcription:

Decision 411: Class 7 Confidence limits for sums of coefficients Use of the time index as a regressor The difficulty of predicting the future

Confidence intervals for sums of coefficients Sometimes the SUM of two or more regression coefficients is of economic interest Example from last class: total impact of advertising on sales, taking into account lagged effects It s s OK to add the coefficients of ADV, LAG(ADV,1), LAG(ADV, 2), etc., to measure total impact but what is the appropriate standard error and confidence interval for the sum? Caution: it is never correct to add lower or upper confidence limits! l

Correlations of coefficient estimates The standard error for a sum of coefficients depends not only on their separate standard errors but also on the correlations between the coefficient estimates. The correlation matrix of the coefficient estimates is a tabular option in Statgraphics (and other standard regression software) Note: the correlation matrix of the coefficient estimates is NOT the same as the correlation matrix of the independent variables. See the formulas worksheet in the SIMPREG.XLS file for all the gory details

Special case: sum of 2 coefficients In the case of a sum of two coefficients, there is a simple formula for the standard error of the sum: a+ b 2 a 2 b SE = SE + SE + 2r where SE a and SE b are the standard errors of the coefficient estimates of variables a and b and r ab is the correlation between their coefficient estimates ab SE a SE b This formula can be generalized to more coefficients, but it gets messy: you need to include all pairwise correlations.

Example from last class: Sum of advertising coefficients = 0.3089 Std. error = 0.04104* Approx. 95% CI is [0.227, 0.391] The correlation matrix of the coefficient estimates is one of the standard tabular options in regression *.03518 2 +.03606 2 + 2(.3364)(.03518)(.03606) = 0.04104

Alternative method You can use the regression forecasting capability of Statgraphics (or other stat software) to calculate the standard errors and confidence intervals for the sum of any number of coefficients. The basic idea is to enter a row of artificial data so that the sum of coefficients is computed as a forecast (along with a standard error and CI). The key trick is to scale up the sum by a large factor in order to drown out the effects of other variables in the model.

Alternative method, continued 1. Use the Generate data feature to create separate columns for lagged variables on the spreadsheet so that each lag is in its own column with its own name. Here a new variable named LagAdv1 is created via the expression LAG(Adv,1). The Generate data option is used to get hard-coded values in the cells (rather than live calculations as we previously set up via formulas in the Modify column option).

Alternative method, continued 2. At the bottom of the spreadsheet, just beyond where the real data ends, enter an artificial row of data in which the value of each variable whose coefficient is to be summed is a very large number (e.g., 1 million) and other indep. variables (if any) are set equal to zero. The purpose of the very large number is to completely swamp the effects of the constant term and any other variables in the model so that they don t t influence the forecast calculation. Note: do NOT enter a value for the dependent variable (Sales). This is the signal for Statgraphics to calculate a forecast.

Alternative method, continued 3. Statgraphics will automatically generate a forecast, standard error, and confidence interval, which can then be divided by the large number to get the final answers. (See the Reports report.) After dividing out the factor of 1,000,000, the sum of coefficients is 0.308951, its standard error is 0.041042, and the 95% CI is [0.225, 0.393], essentially the same as what was obtained by direct calculation

Alternative method, continued 4. This method works for any number of variables. Here is the result obtained when LagAdv2 is added to the model and all three coefficients are summed The sum of advertising coefficients has now increased to 0.3513, and the standard error has increased to 0.0486. Note that the coefficient of LadAdv2 has a t-stat of only 1.63. Meanwhile, RMSE has decreased from 3.65 to 3.61.

Alternative method, continued 5. If we add LagAdv3, its t-stat t is only 0.72, and the model RMSE actually increases to 3.70. We seem to have reached the point of diminishing returns. The sum of advertising coefficients has now increased to 0.3745, and the standard error has increased to 0.05846. The lower 95% conf. limit is about the same as before, while the upper limit is slightly larger. Notice that as higher lags are included, the coefficients of the lower-order order lags remain very stable, indicating that there is no problem with multicollinearity.

Confidence limits for the bottom line Taking into account the effects of advertising lagged by 2 or 3 periods, it appears that the total impact is between 0.35 and 0.37 units per $, and the lower 95% confidence limit is 0.25. Recall that we also estimated a total impact of 0.36 from the model that included LAG(Sales,1) in lieu of the lag-2 2 and lag-3 3 values of Advertising.

Example of regression with time trends: forecasting college enrollment Variables (annual data): YEAR: 1961 = 1, 1989 = 29 (the time index) ROLL: Fall undergraduate enrollment at UNM HSGRAD: Spring high school graduates in NM UNEMP: January unemployment rate (%) in NM Objective: predict ROLL from the other variables, which are observed earlier in the same calendar year.

Time series plot original original data (X 1000) 20 17 14 Multiple X-Y Plot 10.7 9.7 8.7 Variables roll hsgrad unemp 11 8 5 0 5 10 15 20 25 30 year 7.7 6.7 5.7 ROLL trends upward, with a flatter trend in recent years, HSGRAD peaks in mid-sample, UNEMP has been highly variable-- linear relationships are not obvious.

Scatterplot matrix original data A scatterplot matrix shows relationships more clearly. The most obvious patterns are the strong time trend in ROLL and a strong relationship with HSGRAD..but is this just a coincidence of upward trends? The relationship of ROLL with UNEMP is not clear.

First cut: regression of ROLL on just HSGRAD and UNEMP In this model, nearly all of the trend in ROLL is loaded onto HSGRAD, but it yields a poor fit, especially near the end of the series (where the action is), and there is a noticeable upward trend in the residuals (bad)

Let s s add the time index as a regressor, to deal with the trend in the residuals The fit is much improved at the end of the series, and the standard ard error is reduced by more than 50% (although there is still significant autocorrelation). The coefficients of HSGRAD and UNEMP are significantly reduced, though.

The time index variable is highly significant, but what does that mean? The coefficient of YEAR is 191.9, with a t-stat t of 10 Does this mean a long-term trend of 191.9 students per year? Not necessarily! When the time index is included as a regressor in a model with other (possibly trended) independent variables, its role is merely to de-trend all the variables to correct for possible mis-matches matches in trend. This also helps to reveal whether other trended variables really explain the patterns in the dependent variable, or whether the alignment of trends is merely coincidental.

De-trending the variables using Time Series/Descriptive Methods To illustrate this phenomenon, let s s apply a linear trend adjustment, and then save the adjusted (de-trended) variable(s) back to the spreadsheet

De-trended variables: time series plot Multiple X-Y Plot 3900 1900-100 -2100-4100 0 5 10 15 20 25 30 year 2.8 1.8 0.8-0.2-1.2-2.2 Variables rolldetrend hsgraddetrend unempdetrend After de- trending, both ROLL and HSGRAD peak in the middle. All of the variables now have zero trend.

De-trended variables: scatterplot matrix hsgraddetrend Strong correlation is still apparent beween de- trended ROLL and HSGRAD, and the scatter plots look a little more normal (i.e., scattered) rolldetrend unempdetrend

Regression of de-trended variables (with no time index) The coefficients of the two independent variables are exactly the same as before! The standard error is different only because a different number of coefficients have been estimated the the errors are exactly the same, as indicated by MAE, DW and r 1. (R-squared is less because some of the original variance has been explained merely by detrending.) This is logically the same model as before.

Comparison of forecasting equations The multiple regression of Y on X and t has this equation: Y = β + β X + β t t 0 1 t 2 while the de-trended regression of Y on X has this equation: * * t Y + Y = 0 + 1 t X + X ( ) ( ) β β ( ) Y a b t X a b t where the a s and b s are intercepts and slopes for trend lines fitted separately to Y and X. By rearranging terms, the second equation becomes equivalent to the first. The coefficient of X is the same (β 1 = β 1* ) and the trend coefficient β 2 in the first equation is the difference between the trend in Y and β 1 times the trend in X: ( β ) ( ) 0 * β1 * β1 * β1 * Y = + a a + X + b b t t Y X t Y X β0 β1 β2

Take-away away on trends Adding the time index variable as a regressor merely corrects for the fact that the dependent and independent variables may have different trends. Thus, the trend in the independent variables is no longer forced to explain the trend (if any) in the dependent variable. The de-trended regression model assumes that the deviation of the dependent variable from its trend line is a linear function of the deviations of the other variables from their trend lines, a so-called trend- stationary model.

Onward to residual diagnostics: plot of residuals vs. time Residual Plot Studentized residual 2.5 1.5 0.5-0.5-1.5-2.5 0 5 10 15 20 25 30 year Here we see a wavelike pattern of time-variation in the residuals. This model does not really exploit the fact that the variables are all time series.

Let s s try the usual trick of adding lagged variables to the original model The time index has been dropped since the lagged variables account for trends. The standard error is much lower, but the patterns of the coefficients suggest that a regression of differenced variables might be a good alternative. Coefficient of lagged dependent variable is close to 1 Coefficients of independent variables and their lags are opposite in sign and comparable in magnitude.

Regression of differenced variables This model predicts the change in enrollment from the changes in high school grads and unemployment in the same year, although there is still positive autocorrelation, and the standard error has actually increased slightly. (Never mind that R-squared R has dropped from 98% to 40%. Since the dependent variable is now differenced, R-squared R can t t be compared to the previous models.)

Trend- vs. difference-stationarity The earlier model (with the time index) assumed that the variables were trend-stationary,, i.e., that each variable tended to revert to its own trend line over time, and deviations from trend lines were correlated The new model assumes that the variables are difference-stationary,, i.e., they are random walks with correlated steps The trend-stationary model performed poorly due to autocorrelated errors. In practice, trend-stationary models are usually used in conjunction with an autocorrelation correction that assumes an autoregressive error process.

Unit roots? A difference-stationary time series such as a random walk is said to have a unit root (for reasons to be explained when we get to ARIMA models ) The question of whether economic time series such as inflation or GDP have unit roots (i.e., whether they are trend-stationary or difference-stationary in the long run) requires very sensitive statistical tests and has long been controversial among econometricians*. *Specialists on this topic are known as unit rooters.

More fine-tuning: add a lag of the differenced dependent variable The standard error has been significantly reduced (best yet), and the autocorrelation is now negligible

Plot of residuals versus predicted Residual Plot Studentized residual 3 2 1 0-1 -2-3 -400 0 400 800 1200 1600 predicted diff(roll) Looks good no evidence of a nonlinear relationship or any tendency to make larger errors with larger predictions.

Plot of residuals versus time Residual Plot Studentized residual 3 2 1 0-1 -2-3 0 5 10 15 20 25 30 row number Hardly any time pattern in the errors, consistent with good DW stat and lag-1 1 autocorrelation

One last residual test percentage 99.9 99 95 80 50 20 5 1 Normal Probability Plot 0.1-800 -500-200 100 400 700 RESIDUALS This plot is (alas) not a standard option in the Multiple Regression procedure. It was drawn by saving the RESIDUALS back to the data spreadsheet and then running the Plot/Exploratory Plot/Normal Probability Plot procedure. Looks good!

Economic interpretation of model This year s s enrollment increase is predicted to be equal to be equal to 41% of last year s s increase plus 22% of the increase in high school graduates plus 203 students per percentage point of increase in the unemployment rate plus an additional 159 students (implying an upward trend, ceteris paribus).

Predicting the future via regression: a dilemma A regression model requires values for all the independent variables to be available ( non( non-missing ) in order to compute a forecast. If the regression model is to be used to predict the future,, then its independent variables must be quantities that are either controllable (e.g., advertising) or otherwise known in advance (e.g., deterministic quantities such as trends or seasonal dummies), or else they must be lagged values of other random variables.

Example: enrollment revisited Suppose it is necessary to make a prediction a whole year in advance.. One approach would be to lag the other variables by at least one year but alas, DIFF(HSGRAD) and DIFF(UNEMP) lose significance when they are lagged!

Regression of change-in in-enrollment on last year s s change We would do just as well with only the lagged value of DIFF(ROLL)

Why not just substitute forecasts for the future values of independent variables? In general it is not valid to insert forecasts of future values of independent variables into a regression model that has been fitted to actual past data. The forecasts of the independent variables may not have the same correlations with the dependent variable as the actual values of the independent variables, hence the coefficients may not be correct. The forecast errors associated with the independent variables would also have to somehow be taken into account.

A more correct approach If forecasts for independent variables must be used, then in principle the regression model should be fitted to past values of the forecasts of the independent variables (generated by the same models), not actual past values. But if those forecasts are merely obtained from extrapolative time series models (based on the prior history of the same independent variables), then this ends up being equivalent to having lagged the independent variables in the original model.

Fancier methods More complex models such as vector autoregression (VAR) and state space are able to simultaneously forecast all the variables in parallel. This is nice in theory, but it usually does not produce good long-term forecasts in practice. There are very many parameters to estimate, and the past data is often over-fitted. Hence there is no easy fix: forecasting is hard, especially when it s s about the future

Conclusions For all these reasons, extrapolative time series models (RW, ES, ARIMA) often perform as well or better than regression or more complex structural models for predicting one or more periods into the future. The past values of the dependent variable are often the best proxy for the past effects of other causally related random variables. Cross-correlation plots may be helpful for identifying situations where regressors provide genuine predictive power one or more periods in advance.

Recap of class 7 Confidence limits for sums of coefficients Use of the time index as a regressor The difficulty of predicting the future