Chapter 15: Other Regression Statistics and Pitfalls

Size: px

Start display at page:

Download "Chapter 15: Other Regression Statistics and Pitfalls"

Stuart Davis
5 years ago
Views:

1 Chapter 15: Other Regression Statistics and Pitfalls Chapter 15 Outline Two-Tailed Confidence Intervals o Confidence Interval Approach: Which Theories Are Consistent with the Data? o A Confidence Interval Example: Television Growth Rates o Calculating Confidence Intervals with Statistical Software Coefficient of Determination, R-Squared (R ) Pitfalls o Explanatory Variable Has the Same Value for All Observations o One Explanatory Variable Is a Linear Combination of Other Explanatory Variables o Dependent Variable Is a Linear Combination of Explanatory Variables o Outlier Observations o Dummy Variable Trap Chapter 15 Prep Questions 1. A friend believes that the internet is displacing the television as a source of news and entertainment. The friend theorizes that after accounting for other factors, television usage is falling by 1 percent annually: 1.0 Percent Growth Rate Theory: After accounting for all other factors, the annual growth rate of television users is negative, 1.0 percent. Recall the model we used previously to explain television use: LogUsers = β + β + β CapitalHuman t Const t CapHum t + β CapitalPhysical + β GdpPC + β Auth + e CapPhy t GDP t Auth t t and the data we used: Internet and Data: Panel data of Internet,, economic, and political statistics for 08 countries from 1995 to 00. [Link to MIT-InternetFlat wf1 goes here.] LogUsersInternet t LogUsers t t Logarithm of Internet users per 1,000 people for observation t Logarithm of television users per 1,000 people for observation t for observation t

2 CapitalHuman t Literacy rate for observation t (percent of population 15 and over) CapitalPhysical t Telephone mainlines per 10,000 people for observation t GdpPC t Per capita real GDP in nation t (1,000 s of international dollars) Auth t The Freedom House measures of political authoritarianism for observation t normalized to a 0 to 10 scale. 0 represents the most democratic rating and 10 the most authoritarian. During the period, Canada and the U.S. had a 0 rating; Iraq and the Democratic Republic of Korea (North Korea) rated 10. Now, assess your friend s theory. a. Use the ordinary least squares (OLS) estimation procedure to estimate the model s parameters. [Link to MIT-InternetFlat wf1 goes here.] b. Formulate the appropriate null and alternative hypotheses. Is a onetailed or a two-tailed test appropriate? c. Use the Econometrics Lab to calculate the Prob[Results IF H 0 True]. [Link to MIT-TTest 0.1 goes here.]. A regression s coefficient of determination, called the R-Squared, is referred to as the goodness of fit. It equals the portion of the dependent variable s squared deviations from its mean that is explained by the parameter estimates: R Explained Squared Deviations from the Mean = = Actual Squared Deviations from the Mean t= 1 T T ( Esty y) t= 1 t ( y y) Calculate the R-Squared for Professor Lord s first quiz by filling in the following blanks: Actual y Actual Explained y Explained Deviation Squared Esty Deviation Squared from Mean Deviation Equals from Mean Deviation Student x t y t yt y ( yt y) x Estyt y ( Estyt y) t

3 T t= 1 y = t T ( yt y) = t= 1 T ( Estyt y) = t= 1 y = = R -Squared = = 3 3. Students frequently experience difficulties when analyzing data. To illustrate some of these we first review the goal of multiple regression analysis: Goal of Multiple Regression Analysis: Multiple regression analysis attempts to sort out the individual effect of each explanatory variable. An explanatory variable s coefficient estimate allows us to estimate the change in the dependent variable resulting from a change in that particular explanatory variable while all other explanatory variables remain constant. Reconsider our baseball data for Baseball Data: Panel data of baseball statistics for the 588 American League games played during the summer of Attendance t Paid attendance for game t DH t Designator hitter for game t (1 if DH permitted; 0 otherwise) HomeSalary t Player salaries of the home team for game t (millions of dollars) PriceTicket t Average price of tickets sold for game t s home team (dollars) VisitSalary t Player salaries of the visiting team for game t (millions of dollars) Now, consider several pitfalls that students often encounter: a. Explanatory variable has the same value for all observations. Run the following regression: [Link to MIT-ALSummer-1996.wf1 goes here.] Dependent variable: Attendance Explanatory variables: PriceTicket, HomeSalary, and DH 1) What happens? ) What is the value of DH t for each of the observations? 3) Why is it impossible to determine the effect of an explanatory variable if the explanatory variable has the same value for each observation? Explain. b. One explanatory variable is a linear combination of other explanatory variables. Generate a new variable, the ticket price in terms of cents: PriceCents = 100 PriceTicket Run the following regression: [Link to MIT-ALSummer-1996.wf1 goes here.]

4 4 Dependent variable: Attendance Explanatory variables: PriceTicket, PriceCents, and HomeSalary 1) What happens? ) Is it possible to sort out the effect of two explanatory variables when they contain redundant information? c. One explanatory variable is a linear combination of other explanatory variables another example. Generate a new variable, the total salaries of the two teams playing: TotalSalary = HomeSalary + VisitSalary Run the following regression: [Link to MIT-ALSummer-1996.wf1 goes here.] Dependent variable: Attendance Explanatory variables: PriceTicket, HomeSalary, VisitSalary, and TotalSalary 1) What happens? ) Is it possible to sort out the effect of explanatory variables when they are linear combinations of each other and therefore contain redundant information? d. Dependent variable is a linear combination of explanatory variables. Run the following regression: [Link to MIT-ALSummer-1996.wf1 goes here.] Dependent variable: TotalSalary Explanatory variables: HomeSalary and VisitSalary What happens? e. Outlier observations. First, run the following regression: [Link to MIT-ALSummer-1996.wf1 goes here.] Dependent variable: Attendance Explanatory variables: PriceTicket and HomeSalary, 1) What is the coefficient estimate for the ticket price? ) Look at the first observation. What is the value of HomeSalary for the first observation? Now, access a second workfile in which a single value was entered incorrectly. [Link to MIT-ALSummerOutlier-1996.wf1 goes here.]

5 5 3) Look at the first observation. What is the value of HomeSalary for the first observation? Is the value that was entered correctly? Run the following regression: Dependent variable: Attendance Explanatory variables: PriceTicket and HomeSalary 4) Compare the coefficient estimates in the two regressions. 4. Return to our faculty salary data. Faculty Salary Data: Artificially constructed cross section salary data and characteristics for 00 faculty members. Salary t Salary of faculty member t (dollars) Experience t Teaching experience for faculty member t (years) Articles t Number of articles published by faculty member t SexM1 t 1 if faculty member t is male; 0 if female As we did in Chapter 13, generate the dummy variable SexF1 which equals 1 for a woman and 0 for a man: Run the following three regressions specifying Salary as the dependent variable: [Link to MIT-FacultySalaries.wf1 goes here.] a. Explanatory variables: SexF1 and Experience b. Explanatory variables: SexM1 and Experience c. Explanatory variables: SexF1, SexM1, and Experience but without a constant Getting Started in EViews To estimate the third model (part c) using EViews, you must fool EViews into running the appropriate regression: In the Workfile window: highlight Salary and then while depressing <Ctrl> highlight SexF1, SexM1, and Experience. In the Workfile window: double click on a highlighted variable. Click Open Equation. In the Equation Specification window delete c so that the window looks like this: salary sexf1 sexm1 experience. Click OK. For each regression, what is the equation that estimates the salary for 1) men?

6 6 ) women? Last, run one more regression specifying Salary as the dependent variable: d. Explanatory variables: SexF1, SexM1, and Experience but with a constant. What happens? 5. Consider a system of linear equations of equations and 3 unknowns. Can you solve for all three unknowns? Two-Tailed Confidence Intervals: Which Theories Are Consistent with the Data? Our approach thus far has been to present a theory first and then use data to assess the theory: First, we presented a theory. Second, we analyzed the data to determine whether or not the data were consistent with the theory. In other words, we have started with a theory and then decided whether or not the data were consistent with the theory. The confidence interval approach reverses this process. Confidence intervals indicate the range of theories that are consistent with the data. First, we analyze the data. Second, we consider various theories and determine which theories are consistent with the data and which are not. In other words, the confidence interval approach starts with the data and then decides what theories are compatible. Hypothesis testing plays a key role in both approaches. Consequently, we must choose a significance level. A confidence interval s size determines the significance level. We use significance levels to distinguish between a small probability and a large probability. The significance level associated with a confidence interval equals 100 percent less the size of the two-tailed confidence interval. Three commonly used significance levels are 90, 95, and 99 percent: For a 90 percent confidence interval, the significance level is 10 percent. For a 95 percent confidence interval, the significance level is 5 percent. For a 99 percent confidence interval, the significance level is 1 percent. A theory is consistent with the data if we cannot reject the null hypothesis at the confidence interval s significance level.

7 7 A Confidence Interval Example: Television Growth Rates No doubt this sounds confusing, so let us work through an example using our international television data: Project: Which growth theories are consistent with the international television data? Internet and Data: Panel data of Internet,, economic, and political statistics for 08 countries from 1995 to 00. [Link to MIT-InternetFlat wf1 goes here.] LogUsersInternet t Logarithm of Internet users per 1,000 people for observation t LogUsers t Logarithm of television users per 1,000 people for observation t t for observation t CapitalHuman t Literacy rate for observation t (percent of population 15 and over) CapitalPhysical t Telephone mainlines per 10,000 people for observation t GdpPC t Per capita real GDP in nation t (1,000 s of international dollars) Auth t The Freedom House measures of political authoritarianism for observation t normalized to a 0 to 10 scale. 0 represents the most democratic rating and 10 the most authoritarian. During the period, Canada and the U.S. had a 0 rating; Iraq and the Democratic Republic of Korea (North Korea) rated 10.

8 8 We begin by specifying the size of the confidence interval. It is most common to specify a 95 percent confidence interval. This means that we are choosing a significance level of 5 percent. The following two steps formalize the procedure to decide whether a theory lies within the two-tailed 95 percent confidence interval: Step 1: Analyze the data. Use the ordinary least squares (OLS) estimation procedure to estimate the model s parameters. Step : Consider a specific theory. Is the theory consistent with the data? Does the theory lie within the confidence interval? o Step a: Based on the theory, construct the null and alternative hypotheses. The null hypothesis reflects the theory. o Step b: Compute Prob[Results IF H 0 True]. o Step c: Do we reject the null hypothesis? Yes: Reject the theory. The data are not consistent with the theory. The theory does not lie within the confidence interval. No: The data are consistent with the theory. The theory does lie within the confidence interval. Since our example uses a 95 percent confidence interval and hence a 5 percent significance level: Prob[Results IF H 0 True] <.05 Prob[Results IF H 0 True] >.05 Reject H 0 Do not reject H 0 Theory is not consistent with the Theory is consistent with the data. data. Theory does not lie within the Theory does lie within the 95 percent confidence interval 95 percent confidence interval We shall illustrate illustrate the steps by focusing on four growth rate theories postulating what the growth rate of television use equals after accounting for other relevant factors: 0.0 Percent Growth Rate Theory 1.0 Percent Growth Rate Theory 4.0 Percent Growth Rate Theory 6.0 Percent Growth Rate Theory

9 9 0.0 Percent Growth Rate Theory Since television is a mature technology we begin with a theory postulating that time will have no impact on television use after accounting for other factors; that is, after accounting for other factors the growth rate of television use will equal 0.0. We shall now apply our two steps to determine if the 0.0 percent growth rate theory lies within the 95 percent confidence interval: Step 1: Analyze the data. Use the ordinary least squares (OLS) estimation procedure to estimate the model s parameters. We shall apply the same model to explain television use that we used previously: Model: LogUsers = β + β + β CapitalHuman t Const t CapHum t + β CapitalPhysical + β GdpPC + β Auth + e CapPhy t GDP t Auth t t We already estimated the parameters of this model in Chapter 13: Ordinary Least Squares (OLS) Dependent Variable: LogUsers Explanatory Variable(s): Estimate SE t-statistic Prob CapitalHuman CapitalPhysical GdpPC Auth Const Number of Observations 74 Estimated Equation: EstLogUsers = CapitalHuman +.00CapitalPhysical +.058GdpPC +.064Auth Table 15.1: Television Regression Results Step : 0.0 Percent Growth Rate Theory. Focus on the effect of time. Is a 0.0 percent growth theory consistent with the data? Does the theory lie within the confidence interval? 0.0 Percent Growth Rate Theory: After accounting for all other explanatory variables, time has no effect on television use; that is, after accounting for all other explanatory variables, the annual growth rate of television use equals 0.0 percent. Accordingly, the actual coefficient of, β, equals.000.

10 10 o Step a: Based on the theory, construct the null and alternative hypotheses. H 0 : β =.000 H 1 : β.000 o Step b: Compute Prob[Results IF H 0 True]. Prob[Results IF H 0 True] = Probability that the coefficient estimate would be at least.03 from.000, if H 0 were true (if the actual coefficient equals, β,.000). OLS estimation If H 0 Standard Number of Number of procedure unbiased true error observations parameters é ã é ã Mean[ b ] = β = 0 SE[ b ] =.0159 DF = 74 6 = 736 We can use the Econometrics Lab to calculate the probability of obtaining the results if the null hypothesis is true. Remember that we are conducting a two-tailed test. Econometrics Lab 15.1: Calculate Prob[Results IF H 0 True]. First, calculate the right hand tail probability. [Link to MIT-Lab 15.1a goes here.] Question: What is the probability that the estimate lies at or above.03? Answer:.074. Student t-distribution Mean =.000 SE =.0159 DF = Figure 15.1: Probability Distribution of Coefficient Estimate 0.0 Percent Growth Rate Theory b T V Ye ar

11 11 Second, calculate the left hand tail probability. [Link to MIT-Lab 15.1b goes here.] Question: What is the probability that the estimate lies at or below.03? Answer:.074. The Prob[Results IF H 0 True] equals the sum of the of the right and left tail two probabilities: Left Tail Right Tail Prob[Results IF H 0 True] = o Step c: Do we reject the null hypothesis? No, we do not reject the null hypothesis at a 5 percent significance level; Prob[Results IF H 0 True] equals.148 which is greater than.05. The theory is consistent with the data; hence,.000 does lie within the 95 percent confidence interval. Let us now apply the procedure to three other theories: 1.0 Percent Growth Rate Theory: After accounting for all other factors, the annual growth rate of television users is 1.0 percent; that is, β equals Percent Growth Rate Theory: After accounting for all other factors, the annual growth rate of television users is 4.0 percent; that is, β equals Percent Growth Rate Theory: After accounting for all other factors, the annual growth rate of television users is 6.0 percent; that is, β equals.060.

12 1 We shall not provide justification for any of these theories. The confidence interval approach does not worry about justifying the theory. The approach is pragmatic; the approach simply asks whether or not the data support the theory. 1.0 Percent Growth Rate Theory Step 1: Analyze the data. Use the ordinary least squares (OLS) estimation procedure to estimate the model s parameters. We have already done this. Step : 1.0 Percent Growth Rate Theory. Is the theory consistent with the data? Does the theory lie within the confidence interval? o Step a: Based on the theory, construct the null and alternative hypotheses. H 0 : β =.010 H 1 : β.010 o Step b: Compute Prob[Results IF H 0 True]. To compute Prob[Results IF H 0 True] we first pose a question: Question: How far is the coefficient estimate,.03, from the value of the coefficient specified by the null hypothesis,.010? Answer:.033. Accordingly, Prob[Results IF H 0 True] = Probability that the coefficient estimate would be at least.033 from.010, if H 0 were true (if the actual coefficient equals, β,.010). OLS estimation If H 0 Standard Number of Number of procedure unbiased true error observations parameters é ã é ã b ] = β =.010 SE[ b ] =.0159 DF = 74 6 = 736 Mean[ We can use the Econometrics Lab to calculate the probability of obtaining the results if the null hypothesis is true. Once again, remember that we are conducting a two-tailed test:

13 13 Econometrics Lab 15.: Calculate Prob[Results IF H 0 True] First, calculate the right hand tail probability. [Link to MIT-Lab 15.a goes here.] Student t-distribution Mean =.010 SE =.0159 DF = b T V Ye ar Figure 15.: Probability Distribution of Coefficient Estimate 1.0 Percent Growth Rate Theory Question: What is the probability that the estimate lies.033 or more above.010, at or above.03? Answer: Second, calculate the left hand tail probability. [Link to MIT-Lab 15.b goes here.] Question: What is the probability that the estimate lies.033 or more below.010, at or below.043? Answer: The Prob[Results IF H 0 True] equals the sum of the of the two probabilities: Left Tail Right Tail Prob[Results IF H 0 True] = o Step c: Do we reject the null hypothesis? Yes, we do reject the null hypothesis at a 5 percent significance level; Prob[Results IF H 0 True] equals.038 which is less than.05.

14 14 The theory is not consistent with the data; hence.010 does not lie within the 95 percent confidence interval. 4.0 Percent Growth Rate Theory Following the same procedure for the 4.0 percent growth rate theory: Prob[Results IF H 0 True].85. We do not reject the null hypothesis at a 5 percent significance level. The theory is consistent with the data; hence.040 does lie within the 95 percent confidence interval. 6.0 Percent Growth Rate Theory Again, following the same procedure for the 6.0 percent growth rate theory: Prob[Results IF H 0 True].00. We do reject the null hypothesis at a 5 percent significance level. The theory is not consistent with the data; hence.060 does not lie within the 95 percent confidence interval.

15 15 Now, let us summarize the four theories: Figure 15.3: Probability Distribution of Coefficient Estimate Comparison of Growth Rate Theories

16 16 Growth Rate Prob[Results Confidence Theory Null and Alternative Hypotheses IF H 0 True] Interval 1% H 0 : β =.010 H 1 : β No 0% H 0 : β =.000 H 1 : β Yes 4% H 0 : β =.040 H 1 : β Yes 6% H 0 : β =.060 H 1 : β No Table 15.: Growth Rate Theories and the 95 Percent Confidence Interval Now, we shall make two observations and pose two questions: The 0.0 percent growth rate theory lies within the confidence interval, but the 1.0 percent theory does not. Question: What is the lowest growth rate theory that is consistent with the data; that is, what is the lower bound of the confidence interval, β? LB The 4.0 percent growth rate theory lies within the confidence interval, but the 6.0 percent theory does not. Question: What is the highest growth rate theory that is consistent with the data; that is, what is the upper bound of the confidence interval, β? UB

17 17 Prob[Results IF H 0 True] % 0.0% Within 95% 4.0% 6.0% Prob[Results IF H 0 True] Confidence Interval Growth Rate Theory β L UB Ye Bar β Y e ar Do Not Reject H 0 Reject H Reject H 0 0 Significance Level = 5% =.05 Figure 15.4: Lower and Upper Confidence Interval Bounds Figure 15.5 answers these questions visually by illustrating the lower and upper bounds. The Prob[Results IF H 0 True] equals.05 for both lower and upper bound growth theories because our calculations are based on a 95 percent confidence interval: The lower bound growth theory postulates a growth rate that is less than that estimated. Hence, the coefficient estimate,.03, marks the right tail border of the lower bound. The upper bound growth theory postulates a growth rate that is greater than that estimated. Hence, the coefficient estimate,.03, marks the left tail border of the upper bound.

18 18 Figure 15.5: Probability Distribution of Coefficient Estimate Lower and Upper Confidence Intervals Econometrics Lab 15.3: Calculating the 95 Percent Confidence Interval. We can use the Econometrics Lab to calculate the lower and upper bounds: LB Calculating the Lower Bound: β For the lower bound, the right tail probability equals.05. [Link to MIT-Lab 15.3a goes here.] The appropriate information is already entered for us: Standard Error:.0159 Value:.03 Degrees of Freedom: 736 Area to Right:.05 Click Calculate. The reported Mean is the lower bound. Mean:.008 β =.008. LB

19 19 Calculating the Upper Bound: UB β For the upper bound, the left tail probability equals.05. Accordingly, the right tail probability will equal.975. [Link to MIT-Lab 15.3b goes here.] The appropriate information is already entered for us: Standard Error:.0159 Value:.03 Degrees of Freedom: 736 Area to Right:.975 Click Calculate. The reported Mean is the upper bound. Mean:.054 β =.054. UB.008 and.054 mark the bounds of the two-tailed 95 percent confidence interval: For any growth rate theory between.8 percent and 5.4 percent: Prob[Results IF H 0 True] >.05 Do not reject H 0 at the 5 percent significance level. For any growth rate theory below.8 percent or above 5.4 percent: Prob[Results IF H 0 True] <.05 Reject H 0 at the 5 percent significance level.

20 0 Calculating Confidence Intervals with Statistical Software Fortunately, statistic software provides us with an easy and convenient way to compute confidence intervals. The software does all the work for us. Getting Started in EViews After running the appropriate regression: In the Equation window: Click View, Coefficient Diagnostics, and Coefficient Intervals. In the Confidence Intervals window: Enter the confidence levels you wish to compute. (By default the values of.90,.95, and.99 are entered.) Click OK. 95 Percent Interval Estimates Dependent Variable: LogUsers Explanatory Variable(s): Estimate Lower Upper CapitalHuman CapitalPhysical GdpPC Auth Const Number of Observations 74 Table 15.3: 95 Percent Confidence Interval Calculations Table 15.3 reports that the lower and upper bounds for the 95 percent confidence interval are.008 and.054. These are the same values that we calculated using the Econometrics Lab.

21 1 Coefficient of Determination (Goodness of Fit), R-Squared (R ) All statistical packages report the coefficient of determination, the R-squared, in their regression printouts. The R-squared seeks to capture the goodness of fit. It equals the portion of the dependent variable s squared deviations from its mean that is explained by the parameter estimates: R Explained Squared Deviations from the Mean = = Actual Squared Deviations from the Mean t= 1 T T ( Esty y) t= 1 t ( y y) To explain how the coefficient of determination is calculated, we shall revisit Professor Lord s first quiz: Student Minutes Studied (x) Quiz Score (y) Table 15.4: First Quiz Data Recall the theory, the model, and our analysis: Theory: An increase in the number of minutes studied results in an increased quiz score. Model: y t = β Const + β x x t + e t x = Minutes studied by student t y = Quiz score earned by student t Theory: β x > 0 We used the ordinary least squares (OLS) estimation procedure to estimate the model s parameters: [Link to MIT-Quiz1.wf1 goes here.] t

22 Ordinary Least Squares (OLS) Dependent Variable: y Explanatory Variable(s): Estimate SE t-statistic Prob x Const Number of Observations 3 R-squared Estimated Equation: Esty = x Interpretation of Estimates: b Const = 63: Students receive 63 points for showing up. b x = 1.: Students receive 1. additional points for each additional minute studied. Critical Result: The coefficient estimate equals 1.. The positive sign of the coefficient estimate, suggests that additional studying increases quiz scores. This evidence lends support to our theory. Table 15.5: First Quiz Regression Results Next, we formulated the null and alternative hypotheses to determine how much confidence we should have in the theory: H 0 : β x = 0 Studying has no impact on a student s quiz score H 1 : β x > 0 Additional studying increases a student s quiz score We then calculated Prob[Results IF H 0 True], the probability of the results like we obtained (or even stronger) if studying in fact had no impact on quiz scores. The tails probability reported in the regression printout allows us to calculate this easily. Since a one-tailed test is appropriate, we divide the tails probability by :.601 Prob[Results IF H 0 True] =.13 We cannot reject the null hypothesis that studying has no impact even at the 10 percent significance level.

23 3 The regression printout reports that the R-squared equals about.84; this means that 84 percent of the dependent variable s squared deviations from its mean are explained by the parameter estimates. Table 15.6 shows the calculations required to compute the R-squared: Actual y Actual Explained y Explained Deviation Squared Esty Deviation Squared from Mean Deviation Equals from Mean Deviation Student x t y t yt y ( yt y) x Estyt y ( Estyt y) T t= 1 y = 43 t T ( yt y) = 34 t= y = = 81 R -Squared = = Table 15.6: R-Squared Calculations for First Quiz T ( Estyt y) = 88 t= 1 The R-squared equals R T T ( Estyt y) divided by ( yt y) : t= 1 t= 1 ( Estyt y) Explained Squared Deviations from the Mean 88 = = = =.84 Actual Squared Deviations from the Mean 34 ( y y) t= 1 T 84 percent of the y s squared deviations are explained by the estimated constant and coefficient. Our calculation of the R-squared agrees with the regression printout. T t= 1 t

24 4 While the R-Squared is always calculated and reported by all statistical software, it is not useful in assessing theories. We shall justify this claim by considering a second quiz that Professor Lord administered. Each student studies the same number of minutes and earns the same score in the second quiz as he/she did in the first quiz: Student Minutes Studied (x) Quiz Score (y) Table 15.7: Second Quiz Data Before we run another regression that includes the data from both quizzes, let us apply our intuition: Begin by focusing on only the first quiz. Taken in isolation, first quiz suggests that studying improves quiz scores. We cannot be very confident of this, however, since we cannot reject the null hypothesis even at a 10 percent significance level. Next, consider only the second quiz. Since the data from the second quiz is identical to the data from the first quiz, the regression results would be identical. Hence, taken in isolation, the second quiz suggests that studying improves quiz scores. Each quiz in isolation suggests that studying improves quiz scores. Now, consider both quizzes together. The two quizzes taken together reinforce each other; this should make us more confident in concluding that studying improves quiz scores, should it not? If our intuition is correct, how should the Prob[Results IF H 0 True] be affected when we consider both quizzes together? Since we are more confident in concluding that studying improves quiz scores, the probability should be less. Let us run a regression using data from both the first and second quizzes to determine whether or not this is true: [Link to MIT-Quiz1&.wf1 goes here.]

25 5 Ordinary Least Squares (OLS) Dependent Variable: y Explanatory Variable(s): Estimate SE t-statistic Prob x Const Number of Observations 3 R-squared Table 15.8: First and Second Quiz Regression Results Using data from both quizzes:.0099 Prob[Results IF H 0 True] =.005 As a consequence of the second quiz, the probability has fallen from.13 to.005; clearly, our confidence in the theory rises. We can now reject the null hypothesis that studying has no impact at the traditional significance levels of 1, 5, and 10 percent. Our calculations confirm our intuition. Next, consider the R-squared for the last regression that includes both quizzes. The regression printout reports that the R-squared has not changed; the R-squared is still.84. Table 15.9 explains why: Actual y Actual Explained y Explained Deviation Squared Esty Deviation Squared Quiz/ from Mean Deviation Equals from Mean Deviation Student x t y t yt y ( yt y) x Estyt y ( Estyt y) 1/ / / / / / T t= 1 y = 486 t T ( yt y) = 684 t= 1 T ( Estyt y) = 576 t= y = = 81 R -Squared = = Table 15.9: R-Squared Calculations for First and Second Quizzes

26 6 R ( Estyt y) Explained Squared Deviations from the Mean 586 = = = =.84 Actual Squared Deviations from the Mean 684 ( y y) t= 1 T T t= 1 t The R-squared still equals.84. Both the actual and explained squared deviations have doubled; consequently, their ratio, the R-squared, remains unchanged. Clearly, the R-squared does not help us assess our theory. We are now more confident in the theory, but the value of the R-squared has not changed. The bottom line is that if we are interested in assessing our theories we should focus on hypothesis testing, not on the R-squared. Pitfalls Frequently econometrics students using statistical software encounter pitfalls that are frustrating. We shall now discuss several of these pitfalls and describe the warning signs that accompany them. We begin by reviewing the goal of multiple regression analysis: Goal of Multiple Regression Analysis: Multiple regression analysis attempts to sort out the individual effect of each explanatory variable. An explanatory variable s coefficient estimate allows us to estimate the change in the dependent variable resulting from a change in that particular explanatory variable while all other explanatory variables remain constant. We shall consider five common pitfalls that often befell students: Explanatory variable has the same value for all observations. One explanatory variable is a linear combination of other explanatory variables. Dependent variable is a linear combination of explanatory variables. Outlier observations. Dummy variable trap.

27 7 We shall illustrate the first four pitfalls by revisiting our baseball attendance data that reports on every game played in the American League during the summer of 1996 season. Project: Assess the determinants of baseball attendance. Baseball Data: Panel data of baseball statistics for the 588 American League games played during the summer of Attendance t Paid attendance for game t DH t Designator hitter for game t (1 if DH permitted; 0 otherwise) HomeSalary t Player salaries of the home team for game t (millions of dollars) PriceTicket t Average price of tickets sold for game t s home team (dollars) VisitSalary t Player salaries of the visiting team for game t (millions of dollars) [Link to MIT-ALSummer-1996.wf1 goes here.] We begin with a model that we have studied before in which attendance, Attendance, depends on two explanatory variables, ticket price, PriceTicket, and home team salary, HomeSalary: Attendance t = β Const + β Price PriceTicket t + β HomeSalary HomeSalary t + e t Recall the regression results from Chapter 14: Ordinary Least Squares (OLS) Dependent Variable: Attendance Explanatory Variable(s): Estimate SE t-statistic Prob PriceTicket HomeSalary Const Number of Observations 585 Estimated Equation: EstAttendance = 9,46 591PriceTicket + 783HomeSalary Interpretation: b PriceTicket = 591. We estimate that a $1.00 increase in the price of tickets decreases attendance by 591 per game. b HomeSalary = 783. We estimate that a $1 million increase in the home team salary increases attendance by 783 per game. Table 15.10: Baseball Attendance Regression Results

28 8 Explanatory Variable Has the Same Value for All Observations One common pitfall is to include an explanatory variable in a regression that has the same value for each observation. To illustrate this, consider the variable DH: DH t Designator hitter for game t (1 if DH permitted; 0 otherwise) Our baseball data includes only American League games in Since interleague play did not begin until 1997 and all American League games allowed designated hitters, the variable DH t equals 1 for each observation. Let us try to use the ticket price, PriceTicket, home team salary, HomeSalary, and the designated hitter dummy variable, DH, to explain attendance, Attendance: [Link to MIT-ALSummer-1996.wf1 goes here.] The statistical software issues a diagnostic. While the verbiage differs from software package to software package, the message is the same: the software cannot perform the calculations that we requested. That is, the statistical software is telling us that it is being asked to do the impossible. What is the intuition behind this? To determine how a dependent variable is affected by an explanatory variable, we must observe how the dependent variable changes when the explanatory variable changes. The intuition is straightforward: If the dependent variable tends to rise when the explanatory variable rises, the explanatory variable affects the dependent variable positively suggesting a positive coefficient. On the other hand, if the dependent variable tends to fall when the explanatory variable rises, the explanatory variable affects the dependent variable negatively suggesting a negative coefficient. The evidence of how the dependent variable changes when the explanatory variable changes is essential. In the case of our baseball example, there is no variation in the designated hitter explanatory variable, however; the DH t equals 1 for each observation. We have no way to assess the effect that the designated hitter has on attendance. We are asking our statistical software to do the impossible. While we have attendance information when the designated hitter was used, we have no attendance information when the designated hitter was not used. How then can we expect the software to assess the impact of the designed hitter on attendance?

29 9 One Explanatory Variable Is a Linear Combination of Other Explanatory Variables We have already seen one example of this when we discussed multicollinearity in the previous chapter. We included both the ticket price in terms of dollars and the ticket price in terms of cents as explanatory variables. The ticket price in terms of cents was a linear combination of the ticket price in terms of dollars: PriceCents = 100 PriceTicket Let us try to use the ticket price, PriceTicket, home team salary, HomeSalary, and the ticket price in terms of cents, PriceCents, to explain attendance, Attendance: [Link to MIT-ALSummer-1996.wf1 goes here.] When both measures of the price were included in the regression our statistical software will issue a diagnostic indicating that it is being asked to do the impossible. Statistical software cannot separate out the individual influence of the two explanatory variables, PriceTicket and PriceCents, because they contain precisely the same information; the two explanatory variables are redundant. We are asking the software to do the impossible. In fact, any linear combination of explanatory variables produces this problem. To illustrate this, we consider two regressions. The first specifies three explanatory variables: ticket price, home team salary, and visiting team salary. [Link to MIT-ALSummer-1996.wf1 goes here.] Ordinary Least Squares (OLS) Dependent Variable: Attendance Explanatory Variable(s): Estimate SE t-statistic Prob PriceTicket HomeSalary VisitSalary Const Number of Observations 585 Estimated Equation: EstAttendance = 3,59 587PriceTicket + 791HomeSalary + 163VisitSalary Table 15.11: Baseball Attendance Now, generate a new variable, TotalSalary: TotalSalary = HomeSalary + VisitSalary TotalSalary is a linear combination of HomeSalary and VisitSalary. Let us try to use the ticket price, PriceTicket, home team salary, HomeSalary, and visiting

30 30 team salary, VisitSalary, and total salary, TotalSalary, to explain attendance, Attendance: [Link to MIT-ALSummer-1996.wf1 goes here.] Our statistical software will issue a diagnostic indicating that it is being asked to do the impossible. The information contained in TotalSalary is already included in HomeSalary and VisitSalary. Statistical software cannot separate out the individual influence of the three explanatory variables because they contain redundant information. We are asking the software to do the impossible. Dependent Variable Is a Linear Combination of Explanatory Variables Suppose that the dependent variable is a linear combination of the explanatory variables. The following regression illustrates this scenario. TotalSalary is by definition the sum of HomeSalary and VisitSalary. Total salary, TotalSalary, is the dependent variable; home team salary, HomeSalary, and visiting team salary, VisitSalary, are the explanatory variables: [Link to MIT-ALSummer-1996.wf1 goes here.] Ordinary Least Squares (OLS) Dependent Variable: Attendance Explanatory Variable(s): Estimate SE t-statistic Prob HomeSalary E E VisitSalary E E Const E Number of Observations 588 Estimated Equation: EstTotalSalary = 1.000HomeSalary VisitSalary Table 15.1: Baseball Attendance The estimates of the constant and coefficients reveal the definition of TotalSalary: TotalSalary = HomeSalary + VisitSalary Furthermore, the standard errors are very small, approximately 0. In fact, they are precisely equal to 0, but they are not reported as 0 s as a consequence of how digital computers process numbers. We can think of these very small standard errors as telling us that we are dealing with an identity here, something that is true by definition.

31 31 Outlier Observations We should be aware of the possibility of outliers because the ordinary least squares (OLS) estimation procedure is very sensitive to them. An outlier can occur for many reasons. One observation could have a unique characteristic or one observation could include a mundane typo. To illustrate the effect that an outlier may have, once again consider the games played in the summer of the 1996 American League season. [Link to MIT-ALSummer-1996.wf1 goes here.] The first observation reports the game played in Milwaukee on June 1, 1996: the Cleveland Indians visited the Milwaukee Brewers. The salary for the Brewers totaled 0.3 million dollars in 1996: Home Visiting Home Team Observation Month Day Team Team Salary Milwaukee Cleveland Oakland New York Seattle Boston Toronto Kansas City Texas Minnesota Review the following regression: Ordinary Least Squares (OLS) Dependent Variable: Attendance Explanatory Variable(s): Estimate SE t-statistic Prob PriceTicket HomeSalary Const Number of Observations 585 Estimated Equation: EstAttendance = 9,46 591PriceTicket + 783HomeSalary Table 15.13: Baseball Attendance Regression with Correct Data Now, suppose that a mistake was made in entering the Milwaukee s player salary for the first observation; suppose that the decimal point was misplaced and that was entered instead of All the other values were entered correctly. You can access the data including this outlier: [Link to MIT-ALSummerOutlier-1996.wf1 goes here.]

32 3 Home Visiting Home Team Observation Month Day Team Team Salary Milwaukee Cleveland Oakland New York Seattle Boston Toronto Kansas City Texas Minnesota Ordinary Least Squares (OLS) Dependent Variable: Attendance Explanatory Variable(s): Estimate SE t-statistic Prob PriceTicket HomeSalary Const Number of Observations 585 Estimated Equation: EstAttendance = 9,46 591PriceTicket + 783HomeSalary Table 15.14: Baseball Attendance Regression with an Outlier Even though only a single value has been altered, the estimates of both coefficients changes dramatically. The estimate of the ticket price coefficient changes from about 591 to 1,896 and the estimate of the home salary coefficient changes from to.088. This illustrates how sensitive the ordinary least squares (OLS) estimation procedure can be to an outlier. Consequently, we must take care to enter data properly and to check to be certain that we have generated any new variables correctly.

33 33 Dummy Variable Trap To illustrate the dummy variable trap, we shall revisit our faculty salary data: Project: Assess the possibility of discrimination in academia. Faculty Salary Data: Artificially constructed cross section salary data and characteristics for 00 faculty members. Salary t Salary of faculty member t (dollars) Experience t s of teaching experience for faculty member t Articles t Number of articles published by faculty member t SexM1 t 1 if faculty member t is male; 0 if female We shall investigate models that include only dummy variables and years of teaching experience. More specifically, we shall consider four cases: Dependent Explanatory Model Variable Variables Constant 1 Salary SexF1 and Experience Yes Salary SexM1 and Experience Yes 3 Salary SexF1, SexM1, and Experience No 4 Salary SexF1, SexM1, and Experience Yes We begin by generating the variable SexF1 as we did in Chapter 13: SexF1 = 1 SexM1 Now, we shall estimate the parameters of the four models. First, Model 1. Model 1: Salary t = β Const + β SexF1 SexF1 t + β E Experience t + e t [Link to MIT-FacultySalaries.wf1 goes here.] Ordinary Least Squares (OLS) Dependent Variable: Salary Explanatory Variable(s): Estimate SE t-statistic Prob SexF Experience Const Number of Observations 00 Estimated Equation: EstSalary = 4,38,40SexF1 +,447Experience Table 15.15: Faculty Salary Regression

34 34 Now, calculate the estimated salary equation for men and women: For men, SexF1 = 0: EstSalary = 4,38,40SexF1 +,447Experience EstSalary Men = 4,38 0 +,447Experience = 4,38 +,447Experience The intercept for men equals $4,38; the slope equals,447. For women, SexF1 = 1: EstSalary = 4,38,40SexF1 +,447Experience EstSalary Women = 4,38,40 +,447Experience = 39,988 +,447Experience It is easy to plot the estimated salary equations for men and women. EstSalary EstSalary Men = 4,38 +,447Experience 4,38,40 Slope =,447 EstSalary Women = 39,998 +,447Experience 39,998 Experience Figure 15.6: Estimated Salaries Equations for Men and Women Both plotted lines have the same slope,,447. The intercepts differ, however. The intercept for men is 4,38 while the intercept for women is 39,998:

35 35 Model : Salary t = β Const + β SexM1 SexM1 t + β E Experience t + e t EstSalary = b Const + b SexM1 SexM1 + b E Experience Let us attempt to calculate the second model s estimated constant and the estimated male sex dummy coefficient, b Const and b SexM1, using the intercepts from Model 1. For men For women SexM1 = 1 SexM1 = 0 EstSalary Men = b Const + b SexM1 + b E Experience EstSalary Women = b Const + b E Experience Intercept Men = b Const + b SexM1 Intercept Women = b Const 4,38 = b Const + b SexM1 39,998 = b Const We now have two equations: 4,38 = b Const + b SexM1 39,998 = b Const and two unknowns, b Const and b SexM1. It is easy to solve for the unknowns. The second equation tells us that b Const equals 39,998: b Const = 39,998 Next, focus on the first equation: 4,38 = b Const + b SexM1 Substituting for b Const 4,38 = 39,998 + b SexM1 Solving for b SexM1 : b SexM1 = 4,38 39,998 =,40 Using the estimates from Model 1, we compute that the estimate for Model s estimate for the constant should be 39,998 and the estimate for the male sex dummy coefficient should be,40. Let us now run the regression: Ordinary Least Squares (OLS) Dependent Variable: Salary Explanatory Variable(s): Estimate SE t-statistic Prob SexM Experience Const Number of Observations 00 Estimated Equation: EstSalary = 39,998 +,40SexM1 +,447Experience Table 15.16: Faculty Salary Regression The regression confirms our calculations.

36 36 Model 3: Salary t = β SexF1 SexF1 t + β SexM1 SexM1 t + β E Experience t + e t EstSalary = b SexF1 SexF1 + b SexM1 SexM1 + b E Experience Again, let us attempt to calculate the third model s estimated female sex dummy coefficient and its male sex dummy coefficient, b SexF1 and b SexM1, using the intercepts from Model 1. For men For women SexF1 = 0 and SexM1 = 1 SexF1 = 1 and SexM1 = 0 EstSalary Men = b SexM1 + b E Experience EstSalary Women = b SexF1 + b E Experience Intercept Men = b SexM1 Intercept Women = b SexF1 4,38 = b SexM1 39,998 = b SexF1 We now have two equations: 4,38 = b SexM1 39,998 = b SexF1 and two unknowns, b SexF1 and b SexM1. Using the estimates from Model 1, we compute that the estimate for Model 3 s estimate for the male sex dummy coefficient should be 4,38 and the estimate for the female sex dummy coefficient should be 39,998. Let us now run the regression: Getting Started in EViews To estimate the third model (part c) using EViews, you must fool EViews into running the appropriate regression: In the Workfile window: highlight Salary and then while depressing <Ctrl> highlight SexF1, SexM1, and Experience. In the Workfile window: double click on a highlighted variable. Click Open Equation. In the Equation Specification window delete c so that the window looks like this: salary sexf1 sexm1 experience. Click OK.

37 37 Ordinary Least Squares (OLS) Dependent Variable: Salary Explanatory Variable(s): Estimate SE t-statistic Prob SexF SexM Experience Number of Observations 00 Estimated Equation: EstSalary = 39,998 SexM1 + 4,38SexM1 +,447Experience Table 15.17: Faculty Salary Regression Again, the regression results confirm our calculations. Model 4: Salary t = β Const + β SexF1 SexF1 t + β SexM1 SexM1 t + β E Experience + e t EstSalary = b Const + b SexF1 SexF1 + b SexM1 SexM1 + b E Experience Question: Can we calculate the fourth model s b Const, b SexF1, and b SexM1 using Model 1 s intercepts? For men For women SexF1 = 0 and SexM1 = 1 SexF1 = 1 and SexM1 = 0 EstSalary EstSalary Men = b Const + b SexM1 + b E Experience Women = b Const + b SexF1 + b E Experience Intercept Men = b Const + b SexM1 Intercept Women = b Const + b SexF1 4,38 = b Const + b SexM1 39,998 = b Const + b SexF1 We now have two equations: 4,38 = b Const + b SexM1 39,998 = b Const + b SexF1 and three unknowns, b Const, b SexF1, and b SexM1. We have more unknowns than equations. We cannot solve for the three unknowns. It is impossible. This is called a dummy variable trap: Dummy Variable Trap: A model in which there are more parameters representing the intercepts than there are intercepts. There are three parameters, b Const, b SexF1, and b SexM1, estimating the two intercepts.

38 38 Now, let us try to run the regression: [Link to MIT-FacultySalaries.wf1 goes here.] Our statistical software will issue a diagnostic telling us that it is being asked to do the impossible. In some sense, the software is being asked to solve for three unknowns with only two equations.

Chapter 13: Dummy and Interaction Variables

Chapter 13: Dummy and Interaction Variables Chapter 13: Dummy and eraction Variables Chapter 13 Outline Preliminary Mathematics: Averages and Regressions Including Only a Constant An Example: Discrimination in Academia o Average Salaries o Dummy