ECON 497: Lecture 4 Page 1 of 1

Size: px

Start display at page:

Download "ECON 497: Lecture 4 Page 1 of 1"

Tracey Singleton
5 years ago
Views:

1 ECON 497: Lecture 4 Page 1 of 1 Metropolitan State University ECON 497: Research and Forecasting Lecture Notes 4 The Classical Model: Assumptions and Violations Studenmund Chapter 4 Ordinary least squares (OLS) is a very good way of estimating a linear relationship between a dependent variable and some independent or explanatory variables. In some sense, it is the best method of doing this. For it to be "best", however, there are seven assumptions that must be satisfied. These assumptions are somewhat technical, but here is an attempt to explain what each means and how it affects regression results. I. The regression model is linear in the coefficients, is correctly specified and has an additive error term. This assumption has three parts. Let's look at them each in turn. First, the model must be linear in the coefficients. This means that the process that is actually occurring in the real world is described by a relationship of the form Yi = β 0 + β 1 X 1i + β 2 X 2i +β 3 X 3i ε i or that the actual relationship can be rewritten in this form by, for example, taking logs. This part is best considered in conjunction with the second. The second is that the model is correctly specified. Combining these first two assumptions, we see that we need to know the actual process through which the dependent value is determined and that relationship has to be linear or some derivation of linear. While this may be possible for simple physical processes, it is virtually impossible that the decision making process of any rational person operating in a complex environment can be correctly modeled by a linear equation. What must be believed instead is that the model is sufficiently close to the actual process that the difference isn't important. In a practical sense, you can defend a specification by saying that it is a standard specification used in looking at the situation being examined. If you find other studies, papers or reports that have used a particular model, that may be an acceptable reason for using it. The opposite also holds. On the other hand, if you use a model which no one else has used, the results may argue for rejection of other models in favor of yours. The third part of this is that the error term is additive rather than being multiplicative or entering in some other way. This third part of the assumption is probably no more suspicious that the first two parts (that you have the correct model and that it is linear) but by looking at the residuals (the errors that you get after you estimate your model, also know as the e i s) it can be

2 ECON 497: Lecture 4 Page 2 of 2 determined whether or not this may be true. If the model is not correct, we may have problems such as those we've seen in class where the correct model is, say, curved but the estimated model is a straight line. II. The error term has a zero population mean. This means that the expected value of ε i is zero (E[ε i ]=0). There is nothing earthshaking about this. When you do your regression, your residuals will, basically by construction, have a mean value of zero. This is a practical matter and is only loosely related to the theoretical presentation above. Basically, the estimated model will have an error term with a mean value of zero, so if the theoretical model doesn t have an error term with an expected value of zero, the two versions will be inconsistent. III. All explanatory variables are uncorrelated with the error term (no endogeneity) This means that the error term is not likely to be larger or smaller, positive or negative when any of the explanatory variables are larger or smaller. If this were not true, then you might know, for example, that the error term is larger when one of the explanatory variables is larger and smaller when the explanatory variable is smaller. If this were true, the model could be improved based on the value of that explanatory variable. Whether or not this condition is satisfied can be investigated in at least two ways. A. The residuals (the differences between the actual value of the dependent variable and the predicted values) can be graphed against the various explanatory variables. B. Correlation coefficients between the residuals and the various explanatory variables can be calculated. There should be no discernable patterns in the graphs and only very small correlation coefficients. An Example of Endogeneity According to Prof. Lundberg, "Endogeneity is a problem when one of the right hand side variables is correlated with the error term because it is being determined as part of the whole behavioral system that this regression equation is part of. So, if we're trying to explain hours of TV watching, putting the number of sets in the household on the righthand side is a no-no. Both hours and sets will be driven by taste for TV watching, and the coefficient on number of sets will be meaningless behaviorally (though probably big and significant). So, endogeneity is a specification problem, and needs to be dealt with by estimating reduced-form models with only exogenous variables on the right, IV, or some simultaneous-equations method." Let's pursue this a bit. Imagine that there is some variable, Y i, which you are interested in. There are a number of explanatory variables that you want to include in the regression, X 1i, X 2i, X 3i, X 4i and X 5i. So, the equation you estimate is:

3 ECON 497: Lecture 4 Page 3 of 3 Y i = β 0 + β 1 X 1i + β 2 X 2i + β 3 X 3i + β 4 X 4i + β 5 X 5i + ε i However, if X 1i is edogenously determined, we will get a violation of the assumptions of the classical model, meaning that the happy results associated with OLS may not hold. For example, let's say that: X 1i = α0 + α 1 X 2i + α 2 X 3i + φ i where φ i is the error term. If this is how Y i is actually determined, then the correct model we should be estimating is: Y i = β 0 + β 1 (α 0 + α 1 X 2i + α 2 X 3i + φ i ) + β 2 X 2i + β 3 X 3i + β 4 X 4i + β 5 X 5i + ε i Rewriting this a bit gives us: Y i = β 0 + β 1 α 0 + (α 1 + β 2 )X 2i + (α 2 + β 3 )X 3i + β 4 X 4i + β 5 X 5i + ε i + φ i To summarize this equation, there is a constant term (β 0 + β 1 α 0 ) and a error term (ε i + φ i ) and coefficients attached to each of the remaining explanatory variables. If X 1i is included in this regression, it will be correlated with the error term because X 1i is a linear function of part of the error term, φ i. Because the error term is correlated with one of the explanatory variables, assumption three (III. All explanatory variables are uncorrelated with the error term.) is violated, so OLS won't work. Now, knowing that this may be a problem, what can or should be done about it? Kennedy, (chapter 10) describes different approaches to dealing with this problem and, if you're interested, I would be happy to share them with you. IV. Observations of the error term are uncorrelated with each other (no serial correlation). If you're looking at time series data (data collected from the same source in a number of different periods) the error term (e i = Y i - Ŷ i ) in one period should not have any relation to the error term from the previous period. A way to investigate this is to graph the error terms over time and see if there are any patterns or long runs of either positive or negative values. You may at some point see reference made to something called a runs test. This is basically a test to see if the number of simultaneous observations with either positive or negative residuals is suspiciously high. In looking at the California gasoline consumption data, there appears to be serial correlation for some models. V. The error term has a constant variance (no heteroskedasticity).

4 ECON 497: Lecture 4 Page 4 of 4 This means that the errors aren't more spread out for some of the observations than for others. It's tough to describe, but there's a good picture of it on Studenmund, p. 99. (in the third edition). The Studenmund picture shows a scatterplot with an explanatory variable on the horizontal axis, the dependent variable on the vertical axis and the regression line drawn in. The points of the scatterplot are further away from the regression line for larger values of the explanatory variable. Here s another picture Scaterplot of Squared Redsiduals on SQFT Residsq When the independent variable, SQFT, is smaller, the errors tend to be smaller. As SQFT increases, the errors get larger. This may be easier to see if you plot out the squared errors as is done in this picture. Basically, if you graph the squared residuals against all the explanatory variables, the size of the residuals shouldn't depend on the value of the explanatory variables. If the residuals get larger as the explanatory variable gets larger (or smaller) then you have heteroskedasticity. An example of a case in which heteroskedasticity might be a problem is in modeling house prices as a function of the house characteristics. There might be larger variance for the error term for more expensive houses and smaller variance for less expensive houses. A 95% confidence interval for the true value of a house with an estimated value of $40,000 might be [$38,000, $42,000] while the same interval for a house with an estimated value of $2,000,000 might be [$1,900,000, $2,100,000]. Kennedy (pp ) has a very good discussion about the consequences of heteroskedasticity, methods of testing for it and a somewhat vague description of how to correct for it. Kennedy offers four methods of detecting heteroskedasticity: 1. Visual inspection of the residuals

5 ECON 497: Lecture 4 Page 5 of 5 2. The Goldfeld-Quandt test 3. The Breusch-Pagan test 4. The White test The first of these is within your power to do in Excel. The others should be available options in any good software package. To deal with heteroskedasticity, there are really two options. 1. You could run a weighted least squares (rather than an ordinary least squares) regression. 2. You could opt for the zen approach and eliminate heteroskedasticity in a more spiritual way. VI. No explanatory variable is a perfect linear function of any other explanatory variable(s) (no perfect multicollinearity). This just means that no explanatory variable is a linear function of another explanatory variable or other explanatory variables. This means that you can include as explanatory variables X and X 2. You cannot include, for example, temperature in degrees Fahrenheit (F) and in degrees Celsius (C) because F = C. That is, Celsius is a linear function of Fahrenheit. This is why you must exclude one of the dummy variables if there is an exhaustive list of them. If, for example, you have dummy variables for male (M) and female (F) subjects and there are no other genders, then for each observation M + F = 1 or F = 1 - M or M = 1 - F. Because these two variables are linear functions of each other, one of them must be excluded. One way to see if this might be a problem is to generate a matrix of correlation coefficients for all the explanatory variables and the dependent variable. This won't tell you if a large number of variables are linearly related (such as dummy variables for a person's home state, for example) but it will tell you if two variables are linearly related. Alternatively, when you do a linear regression in SPSS, you can ask for Statistics/Collinearity diagnostics. In addition to all the wonderful things you usually get with your regression output, you ll get Variance Inflation Factors (VIF) for each explanatory variable. The larger this is, the greater the likelihood that you have multicollinearity. The VIF is calculated based on a regression of each explanatory variable on all the other explanatory variables, and is equal to 1/(1-R 2 ) from that regression. Kennedy (pp ) has a very good section on multicollinearity. A fun quote from this section:

6 ECON 497: Lecture 4 Page 6 of 6 The OLS estimator in the presence of multicollinearity remains unbiased and in fact is still BLUE. The R2 statistic is unaffected. In fact, since all the CLR assumptions are (strictly speaking) still met, the OLS estimator retains all its desirable properties, as noted in chapter 3. The major undesirable consequence of multicollinearity is that the variances of the OLS estimates of the parameters of the collinear variables are quite large. These high variances arise because in the presence of multicollinearity the OLS estimating procedure is not given enough independent variation in a variable to calculate with confidence the effect it has on the dependent variable. Possible rememdies are later suggested by Kennedy. VII. The error term is normally distributed This is important when generating confidence intervals and doing hypothesis testing in small samples but is less important as sample sizes increases. To see if your residuals are normally distributed, you can generate a histogram of them and see if they appear to be normal. To do this in SPSS you can do a regression, save the residuals and then use Graph/Histogram to make a histogram of the residuals see if they appear to be normally distributed. There are also statistical tests that you can do to see if the residuals are normally distributed. To be totally honest, most of the time if you do a statistical test to determine whether your residuals are normally distributed, the null hypothesis of normality will probably be rejected.

ECON 497 Midterm Spring

ECON 497 Midterm Spring 2009 1 ECON 497: Economic Research and Forecasting Name: Spring 2009 Bellas Midterm You have three hours and twenty minutes to complete this exam. Answer all questions and explain