Econometrics Problem Set 7

Similar documents
Econometrics Problem Set 6

Econometrics Problem Set 4

Universidad Carlos III de Madrid Econometría Nonlinear Regression Functions Problem Set 8

Econometrics Problem Set 6

Econometrics Problem Set 3

Econometrics Problem Set 10

5. Let W follow a normal distribution with mean of μ and the variance of 1. Then, the pdf of W is

WISE MA/PhD Programs Econometrics Instructor: Brett Graham Spring Semester, Academic Year Exam Version: A

4. Nonlinear regression functions

WISE MA/PhD Programs Econometrics Instructor: Brett Graham Spring Semester, Academic Year Exam Version: A

Final Exam - Solutions

Econometrics Problem Set 11

2. Linear regression with multiple regressors

Final Exam - Solutions

Applied Statistics and Econometrics

Ecn Analysis of Economic Data University of California - Davis February 23, 2010 Instructor: John Parman. Midterm 2. Name: ID Number: Section:

The F distribution. If: 1. u 1,,u n are normally distributed; and 2. X i is distributed independently of u i (so in particular u i is homoskedastic)

a) Do you see a pattern in the scatter plot, or does it look like the data points are

CHAPTER 4 & 5 Linear Regression with One Regressor. Kazu Matsuda IBEC PHBU 430 Econometrics

Answer Key: Problem Set 5

Solutions to Exercises in Chapter 9

Midterm 2 - Solutions


Chapter 7. Hypothesis Tests and Confidence Intervals in Multiple Regression

Midterm 2 - Solutions

WISE MA/PhD Programs Econometrics Instructor: Brett Graham Spring Semester, Academic Year Exam Version: A

Announcements: You can turn in homework until 6pm, slot on wall across from 2202 Bren. Make sure you use the correct slot! (Stats 8, closest to wall)

WISE International Masters

Nonlinear Regression Functions

MGEC11H3Y L01 Introduction to Regression Analysis Term Test Friday July 5, PM Instructor: Victor Yu

Econometrics -- Final Exam (Sample)

Linear Regression with Multiple Regressors

Section 2.1 Exercises

Eco 391, J. Sandford, spring 2013 April 5, Midterm 3 4/5/2013

Linear Regression with one Regressor

Linear Regression with Multiple Regressors

Chapter Goals. To understand the methods for displaying and describing relationship among variables. Formulate Theories.

1. Write an expression of the third degree that is written with a leading coefficient of five and a constant of ten., find C D.

Chapter 9. Dummy (Binary) Variables. 9.1 Introduction The multiple regression model (9.1.1) Assumption MR1 is

ECON Interactions and Dummies

Examining Relationships. Chapter 3

Regression with a Single Regressor: Hypothesis Tests and Confidence Intervals

Regression Models. Chapter 4. Introduction. Introduction. Introduction

Salt Lake Community College MATH 1040 Final Exam Fall Semester 2011 Form E

Mathematics Level D: Lesson 2 Representations of a Line

Applied Statistics and Econometrics

q3_3 MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

Chapter 12 : Linear Correlation and Linear Regression


2) For a normal distribution, the skewness and kurtosis measures are as follows: A) 1.96 and 4 B) 1 and 2 C) 0 and 3 D) 0 and 0

Announcements. J. Parman (UC-Davis) Analysis of Economic Data, Winter 2011 February 8, / 45

Introduction to Econometrics. Multiple Regression (2016/2017)

LHS Algebra Pre-Test

Unit 6 - Introduction to linear regression

Econometrics. 5) Dummy variables

Chapter 14 Student Lecture Notes 14-1

Applied Statistics and Econometrics

Stat 135 Fall 2013 FINAL EXAM December 18, 2013

SHORT ANSWER. Write the word or phrase that best completes each statement or answers the question. x )

Solutions to Odd-Numbered End-of-Chapter Exercises: Chapter 8

Homework Set 2, ECO 311, Fall 2014

Regression Analysis. BUS 735: Business Decision Making and Research. Learn how to detect relationships between ordinal and categorical variables.

Testing for Discrimination

ECON 497 Midterm Spring

Mrs. Poyner/Mr. Page Chapter 3 page 1

11 Correlation and Regression

Introduction to Econometrics (4 th Edition) Solutions to Odd-Numbered End-of-Chapter Exercises: Chapter 8

Hypothesis Tests and Confidence Intervals in Multiple Regression

Sociology 593 Exam 2 Answer Key March 28, 2002

Archdiocese of Washington Catholic Schools Academic Standards Mathematics

Introduction to Statistics for the Social Sciences Review for Exam 4 Homework Assignment 27

Chapter 4. Regression Models. Learning Objectives

In 1 6, match each scatterplot with the appropriate correlation coefficient. a) +1 b) +0.8 c) +0.3 d) 0 e) -0.6 f) -0.9

ECON 5350 Class Notes Functional Form and Structural Change

Answer Key: Problem Set 6

Econ Spring 2016 Section 9

Section 2.5 from Precalculus was developed by OpenStax College, licensed by Rice University, and is available on the Connexions website.

Unit 6: Say It with Symbols

STANDARDS OF LEARNING CONTENT REVIEW NOTES. ALGEBRA I Part II 1 st Nine Weeks,

Statistics II Exercises Chapter 5

Wooldridge, Introductory Econometrics, 4th ed. Chapter 6: Multiple regression analysis: Further issues

Introduction to Econometrics. Multiple Regression

Sociology 593 Exam 2 March 28, 2002

2. What are the zeros of (x 2)(x 2 9)? (1) { 3, 2, 3} (2) { 3, 3} (3) { 3, 0, 3} (4) {0, 3} 2

The flu example from last class is actually one of our most common transformations called the log-linear model:

Algebra I Assessment. Eligible Texas Essential Knowledge and Skills

Ch 7: Dummy (binary, indicator) variables

Chapter Learning Objectives. Regression Analysis. Correlation. Simple Linear Regression. Chapter 12. Simple Linear Regression

Topic 10 - Linear Regression

Linear Regression with 1 Regressor. Introduction to Econometrics Spring 2012 Ken Simons

Chapter 9 Regression with a Binary Dependent Variable. Multiple Choice. 1) The binary dependent variable model is an example of a

s e, which is large when errors are large and small Linear regression model

Chapter 3 Multiple Regression Complete Example

ALGEBRA 1 FINAL EXAM TOPICS

College Algebra. Word Problems

Introduction to Simple Linear Regression

download instant at

Regression Analysis. BUS 735: Business Decision Making and Research

Solutions to Problem Set 4 (Due November 13) Maximum number of points for Problem set 4 is: 66. Problem C 6.1

Mathematics for Economics MA course

Transcription:

Econometrics Problem Set 7 WISE, Xiamen University Spring 2016-17 Conceptual Questions 1. (SW 8.2) Suppose that a researcher collects data on houses that have sold in a particular neighborhood over the past year and obtains regression results in Table 1. Dependent variable: ln(p rice) Regressor (1) (2) (3) (4) (5) Size 0.00042 (0.000038) ln(size) 0.69 0.68 0.57 0.69 (0.054) (0.087) (2.03) (0.055) ln(size) 2 0.0078 (0.14) Bedrooms 0.0036 (0.037) P ool 0.082 0.071 0.071 0.071 0.071 (0.032) (0.034) (0.034) (0.035) (0.035) V iew 0.037 0.027 0.026 0.027 0.027 (0.029) (0.028) (0.026) (0.029) (0.030) P ool V iew 0.0022 (0.10) Condition 0.13 0.12 0.12 0.12 0.12 (0.045) (0.035) (0.035) (0.036) (0.035) Intercept 10.97 6.60 6.63 7.02 6.60 (0.069) (0.39) (0.53) (7.50) (0.40) Summary Statistics SER 0.102 0.098 0.099 0.099 0.099 R 2 0.72 0.74 0.73 0.73 0.73 Table 1: Variable definitions: Price sale price; Size house size (in square feet); Bedrooms number of bedrooms; Pool binary variable (1 if house has a swimming pool, 0 otherwise); View binary variable (1 if house has a nice view, 0 otherwise); Condition 1 if realtor reports house is in excellent condition, 0 otherwise). (a) Using the results in column (1), what is the expected change in price of building a 500- square-foot addition to a house? Construct a 95% confidence interval for the percentage change in price.

Price is expected to change by 21%, (100% 500 0.00042). The 95% confidence interval for the percentage change in price is 100% 500 [0.00042 1.96 0.000038, 0.00042 + 1.96 0.000038], that is, [17.276%, 24.724%]. (b) Comparing columns (1) and (2), is it better to use Size or ln(size) to explain house prices? Because the regressions in columns (1) and (2) have the same dependent variable, R 2 can be used to compare the fit of these two regressions. It is better to use ln(size) to explain house prices since its R 2 is higher. (c) Using column (2), what is the estimated effect of pool on price? (Make sure you get the units right.) Construct a 95% confidence interval for this effect. The estimated effect of adding a pool to a house on price is 0.071. That means that adding a pool is associated with a 7.1% increase in price. The 95% confidence interval is 100%[0.071 1.96 0.034, 0.071 + 1.96 0.034], that is, [0.436%,13.764%]. (d) The regression in column (3) adds the number of bedrooms to the regression. How large is the estimated effect of an additional bedroom? Is the effect statistically significant? Why do you think the estimated effect is so small? (Hint: What other variables are being held constant?) The estimated effect of an additional bedroom is 0.36%. The test statistic associated with significance of number of bedrooms is 0.0036/0.037 = 0.097297, so the effect isn t statistically significant at the 5% level. Note that this coefficient measures the effect of an additional bedroom holding the size of the house constant. (e) Is the quadratic term ln(size) 2 important? The test statistic associated with significance of the quadratic term ln(size) 2 in column (4) is 0.0078/0.14 = 0.055714 < 1.96, so the the quadratic term ln(size) 2 is not important. (f) Use the regression in column (5) to compute the expected percentage change in price when a pool is added to a house without a view. Repeat the exercise for a house with a view. Is there a large difference? Is the difference statistically significant? Without a view % P = 100% 0.071 1 = 7.1%. With a view % P = 100% (0.071 1 + 0.00221 = 7.32%. Page 2

The difference in the expected percentage change in price is 0.22%. The difference is not statistically significant at a 5% significance level: The test statistic associated with significance of the interaction term P ool V iew in column (5) is 0.0022/0.10 = 0.022 < 1.96, so the interaction term P ool V iew is not statistically significant. 2. Consider the following graph of income as a function of age. This functional form assumes that income is a continuous function of age and that older individuals earn more that younger individuals, but that the slope might change at some distinct milestones, for example, at age 18, when the typical individual graduates from high school, and at age 22, when he or she graduates from college. Income. 18 22 Age (a) Construct a regression model that could be used to estimate this function. At a general level, this can be described as three different regression functions, one for individuals younger than 18, one for individuals aged between 18 and 22, and one for individuals older than 22. A basic regression model with these features is Income i = β 0 + β 1 Age i + β 2 D i + β 3 D i Age i + β 4 E i + β 5 E i Age i + u i, where D i is 1 if 18 < Age i 22 and E i is 1 if 18 < Age i. However, since the function must also be continuous at the thresholds, this implies two requirements: These imply that β 0 + β 1 18 = β 0 + β 1 18 + β 2 + β 3 18 β 0 + β 1 22 + β 2 + β 3 22 = β 0 + β 1 22 + β 4 + β 5 22 β 2 = 18β 3 β 4 = 4β 3 22β 5 Imposing these constraints on our regression model implies that Income i = β 0 + β 1 Age i 18β 3 D i + β 3 D i Age i + (4β 3 22β 5 )E i + β 5 E i Age i + u i, = β 0 + β 1 Age i + β 3 (D i (Age i 18) + 4E i ) + β 5 E i (Age i 22) + u i, = β 0 + β 1 Age i + β 3 W i + β 5 X i + u i, where W i = D i (Age i 18) + 4E i and X i = E i (Age i 22). Page 3

(b) Based on your regression model from (a) construct a hypothesis test that tests if the slope of the function is constant. Based on the model, Income i = β 0 + β 1 Age i + β 3 W i + β 5 X i + u i, where W i = Age i 18 and X i = Age i 22, we should test H 0 : β 3 = β 5 = 0. 3. (SW 8.6) After reading the textbook s analysis of test scores (refer to Table 8.3 of the textbook), a researcher considers the following hypotheses: (a) She suspects that the effect of the percentage of students eligible for a subsidized lunch (ESL) has a non-linear effect on test scores. In particular, she conjectures that increases in this variable from 10% to 20% have little effect on test scores, but that changes from 50% to 60% have a much larger effect. i. Describe a nonlinear specification that can be used to model this form of nonlinearity. One possible approach is to construct a dummy variable LunchMed i, which is equal to one in districts where 20 < ESL 50 and zero otherwise, another dummy variable LunchHigh i, which is equal to one in districts where ESL > 50 and zero otherwise. Add these dummy variables to the base model. ii. How would you test whether the researcher s conjecture was better than a linear specification of the relationship between T estscore and ESL? Based on the answer to (i), we would test the joint hypothesis that β LunchMed = β LunchHigh = 0 against the alternative that at least one is not equal to zero. (b) She suspects that the effect of income on test scores is different in districts with small classes than in districts with large classes. i. Describe a nonlinear specification that can be used to model this form of nonlinearity. One possible approach is to construct an interaction variable SI i = ST R i ln(income) i and add this variable to the base model. ii. How would you test whether the researcher s conjecture was better than a linear specification of the relationship between income and test scores? Based on the answer to (i), we would test the hypothesis that β SI = 0 against the alternative that it is not equal to zero. 4. (SW 8.8) X is a continuous variable that takes on values between 5 and 100. Sketch the Page 4

following regression functions (with values of X between 5 and 100 on the horizontal axis and values of Ŷ on the vertical axis): (a) Ŷ = 2 + 3 ln(x). Y1 8 10 12 14 16 20 40 60 80 100 X (b) Ŷ = 2 3 ln(x). Y2 12 10 8 6 4 20 40 60 80 100 X (c) Ŷ = 1 + 125X 0.01X2. Page 5

Y3 2000 4000 6000 8000 10000 12000 20 40 60 80 100 X 5. (SW 8.9) Consider the following regression model, Y i = β 0 + β 1 X i + β 2 X 2 i + u i. Explain how to calculate the confidence interval of Ŷ from a change in X from 5 to 6. Ŷ = ˆβ 1 (6 5) + ˆβ 2 (6 2 5 2 ) = ˆβ 1 + 11 ˆβ 2. Using Approach #2 of Section 7.3: transform the regression function: Y i = β 0 + (β 1 + 11β 2 )X 11β 2 X + β 2 X 2 + u i = γ 0 + γ 1 X + γ 2 (X 2 11X) + u i. The confidence interval is ˆγ 1 ± z α/2 SE( ˆγ 1 ). Alternatively, you can directly test the joint hypothesis H 0 : β 1 + 11β 2 = 0 against the alternative H 1 : β 1 +11β 2 0. The F -statistic from this test is equal to ( Ŷ /S.E.( Ŷ ))2 The confidence interval is Ŷ ± z α/2 Ŷ / F -stat. 6. (SW 8.10) Consider the regression model Y i = β 0 + β 1 X 1i + β 2 X 2i + β 3 (X 1i X 2i ) + u i. Use Key Concept 8.1 in the textbook to show: (a) Y X 1 = β 1 + β 3 X 2 (effect of change in X 1 holding X 2 constant). Since Y i = β 0 + β 1 X 1i + β 2 X 2i + β 3 (X 1i X 2i ) + u i, we can get Y + Y = β 0 + β 1 (X 1 + X 1 ) + β 2 X 2 + β 3 (X 1 + X 1 ) X 2 + u, Thus, Y X 1 = β 1 + β 3 X 2. Y = β 1 X 1 + β 3 X 2 X 1. Page 6

(b) Y X 2 = β 2 + β 3 X 1 (effect of change in X 2 holding X 1 constant). Since Y i = β 0 + β 1 X 1i + β 2 X 2i + β 3 (X 1i X 2i ) + u i, we can have Y + Y = β 0 + β 1 X 1 + β 2 (X 2 + X 2 ) + β 3 X 1 (X 2 + X 2 ) + u, Thus, Y X 2 = β 2 + β 3 X 1. Y = β 2 X 2 + β 3 X 1 X 2. (c) If X 1 changes by X 1 and X 2 changes by X 2 then Y = (β 1 + β 3 X 2 ) X 1 + (β 2 + β 3 X 1 ) X 2 + β 3 X 1 X 2. Since Y i = β 0 + β 1 X 1i + β 2 X 2i + β 3 (X 1i X 2i ) + u i, we can obtain Y + Y = β 0 + β 1 (X 1 + X 1 ) + β 2 (X 2 + X 2 ) + β 3 (X 1 + X 1 ) (X 2 + X 2 ) + u, Y = β 1 X 1 + β 2 X 2 + β 3 (X 2 X 1 + X 1 X 2 + X 1 X 2 ). Thus, Y = (β 1 + β 3 X 2 ) X 1 + (β 2 + β 3 X 1 ) X 2 + β 3 X 1 X 2. Empirical Questions For these empirical exercises, the required datasets and a detailed description of them can be found at www.wise.xmu.edu.cn/course/gecon/written.html. 7. (SW E8.3) The data set used in this empirical exercise (CollegeDistance) contains data from a random sample of high school seniors interviewed in 1980 and re-interviewed in 1986. In this exercise you will use these data to investigate the relationship between the number of completed years of education for young adults and the distance from each student s high school to the nearest four year college. (Proximity to college lowers the cost of education, so that students who live closer to a four-year college should, on average, complete more years of higher education.) The R code required for each question is listed within its respective solution. The code listed here initialises the software. # read data and attach data CD< read. csv ( D: /R/ C o l l e g e D i s t a n c e. csv ) # add d i s t ˆ2 to the dataset CD$dist2< CD$dist ˆ2 # a t t a c h i n g a l l o w s you to d i r e c t l y a c c e s s v a r i a b l e names attach (CD) # load package AER l i b r a r y ( AER ) Page 7

(a) (b) (c) (f) (h) Regressor ED ln(ed) ED ED ED Dist 0.037 0.003 0.081 0.081 0.110 (0.012) (0.001) (0.025) (0.025) (0.028) Dist 2 0.005 0.005 0.006 (0.002) (0.002) (0.002) F emale 0.143 0.010 0.143 0.141 0.141 (0.050) (0.004) (0.050) (0.050) (0.050) Bytest 0.093 0.007 0.093 0.093 0.093 (0.003) (0.000) (0.003) (0.003) (0.003) T uition 0.191 0.014 0.193 0.194 0.210 (0.099) (0.007) (0.099) (0.099) (0.099) Black 0.351 0.026 0.334 0.331 0.333 (0.067) (0.005) (0.068) (0.068) (0.068) Hispanic 0.362 0.026 0.333 0.330 0.323 (0.076) (0.005) (0.078) (0.078) (0.078) Incomehi 0.372 0.027 0.369 0.362 0.217 (0.062) (0.004) (0.062) (0.062) (0.090) Incomehi 0.124 Dist (0.062) Incomehi 0.009 Dist 2 (0.006) Intercept 8.921 2.266 9.012 9.002 9.042 (0.243) (0.017) (0.250) (0.250) (0.251) Ownhome 0.139 0.010 0.143 0.141 0.144 (0.065) (0.005) (0.065) (0.065) (0.065) DadColl 0.571 0.041 0.561 0.654 0.663 (0.076) (0.005) (0.077) (0.087) (0.087) MomColl 0.378 0.027 0.378 0.569 0.567 (0.083) (0.006) (0.084) (0.122) (0.122) DadColl 0.366 0.356 M omcoll (0.164) (0.164) Cue80 0.029 0.002 0.026 0.026 0.026 (0.010) (0.001) (0.010) (0.010) (0.010) Stwmfg80 0.043 0.003 0.043 0.042 0.042 (0.020) (0.001) (0.020) (0.020) (0.020) R 2 0.281 0.283 0.282 0.283 0.283 SER 1.538 0.109 1.537 1.536 1.536 Regression results for question E8.3. Robust standard errors in parentheses. N = 3796. significant at p <.10; p <.05; p <.01; p <.001. (a) Run a regression of ED on Dist, F emale, Bytest, T uition, Black, Hispanic, Incomehi, Ownhome, DadColl, MomColl, Cue80, and Stwmfg80. If Dist increases from 2 to 3 Page 8

(that is, from 20 miles to 30 miles), how are years of education expected to change? If Dist increases from 6 to 7 (that is, from 60 to 70 miles), how are years of education expected to change? # run the regression ma<-lm(yrsed~dist+female+bytest+tuition+black+hispanic +incomehi+ownhome+dadcoll+momcoll+cue80+stwmfg80) # show the result coeftest(ma,vcov=vcovhc(ma,"hc1")) The regression result for this question is shown in column (a) of the above table. If Dist increases from 2 to 3 (that is, from 20 miles to 30 miles) or from 6 to 7 (that is, from 60 to 70 miles), years of education are expected to decrease by 0.037 year. These values are the same because the regression is a linear function relating ED to Dist. (b) Run a regression of ln(ed) on Dist, F emale, Bytest, T uition, Black, Hispanic, Incomehi, Ownhome, DadColl, M omcoll, Cue80, and Stwmf g80. If Dist increases from 2 to 3, how are years of education expected to change? If Dist increases from 6 to 7, how are years of education expected to change? # run the regression mb<-lm(log(yrsed)~dist+female+bytest+tuition+black+hispanic +incomehi+ownhome+dadcoll+momcoll+cue80+stwmfg80) # show the result coeftest (mb, vcov=vcovhc(mb,"hc")) The regression result for this question is shown in column (b) of the above table. If Dist increases from 2 to 3 or from 6 to 7, ED is expected to decrease by 0.26%. These values are the same because the regression is a linear function relating ln(ed) to Dist. (c) Run a regression of ED on Dist, Dist 2, F emale, Bytest, T uition, Black, Hispanic, Incomehi, Ownhome, DadColl, M omcoll, Cue80, and Stwmf g80. If Dist increases from 2 to 3, how are years of education expected to change? If Dist increases from 6 to 7, how are years of education expected to change? # run the regression mc<-lm(yrsed~dist+dist2+female+bytest+tuition+black+hispanic +incomehi+ownhome+dadcoll+momcoll+cue80+stwmfg80) # show the result coeftest (mc, vcov=vcovhc(mc,"hc")) Page 9

# dist changes from 2 to 3 mc$coef[2]*3+mc$coef[3]*3^2-(mc$coef[2]*2+mc$coef[3]*2^2) # dist changes from 6 to 7 mc$coef[2]*7+mc$coef[3]*7^2-(mc$coef[2]*6+mc$coef[3]*6^2) The regression result is shown in column (c) of the above table. If Dist increases from 2 to 3, ED is expected to decrease by 0.058 year; if Dist increases from 6 to 7, ED is expected to decrease by 0.021 year. (d) Do you prefer regression (c) to (a)? Explain. From a theoretical viewpoint, one might expect that the effect of Dist on ED has a diminishing effect the marginal negative effect of an additional mile in distance should be smaller when the distance is very large, because students who live far away might commute by car or school bus, which makes the reduction of ED less responsive to distance. From an empirical viewpoint, the regression in (c) adds the variable Dist 2 to regression (a). The coefficient on Dist 2 is statistically significant (t = 2.26) and this suggests that the addition of Dist 2 is important. Thus, (c) is preferred to (a). (e) Consider a Hispanic female with T uition = $950, Bytest = 58, Incomehi = Ownhome = 0, DadColl = MomColl = 1, Cue80 = 7.1, and Stwmfg = $10.06. i. Plot the regression relation between Dist and ED from (a) and (c) for Dist in the range from 0 to 10 (from 0 to 100 miles). Describe the similarities and differences between the estimated regression functions. Would your answer change if you plotted the regression for a white male with the same characteristics? # create a sequence of dist from 0 to 10 dseq<-seq(0,10,.5) # create a data frame consistent with CD nd<-data.frame(dist=dseq,dist2=dseq^2,female=1,hispanic=1, black=0,bytest=58,dadcoll=1,momcoll=1,ownhome=0,cue80=7.1, stwmfg80=10.06,tuition=0.95,incomehi=0) # plot the estimated regression function plot(nd$dist,predict(ma,nd),type= l,col= blue, xlab= Distance (Tens of Miles),ylab= Years of Education ) # add additional regression functions lines(nd$dist,predict(mc,nd),col= red ) # add a legend legend("topright",c("model a","model c"), col=c("blue","red"),lwd=1,inset =.01) Page 10

Model a Model c Years of Education 15.0 15.1 15.2 15.3 0 2 4 6 8 10 Distance (Tens of Miles) The quadratic regression in (c) is steeper for small values of Dist than for larger values. The quadratic function is essentially flat when Dist = 10. The only change in the regression functions for a white male is that the intercept would shift. The functions would have the same slopes. ii. How does the regression function (c) behave for Dist > 10? How many observations are there with Dist > 10? # number of observations with dist>10 sum(dist>10) The regression function becomes positively sloped for Dist > 10. There are only 44 of the 3796 observations with Dist > 10. This is approximately 1% of the sample. Thus, this part of the regression function is very imprecisely estimated. (f) Add the interaction term DadColl MomColl to the regression in (c). What does the coefficient on the interaction term measure? Page 11

# run the regression mf<-lm(yrsed~dist+dist2+female+bytest+tuition+black+hispanic +incomehi+ownhome+dadcoll+momcoll+dadcoll*momcoll+cue80+stwmfg80) # show the result coeftest (mf, vcov=vcovhc(mf,"hc")) The regression result is shown in column (f) of the above table. The estimated coefficient is 0.366. This is the extra effect of education above and beyond the separated M omcoll and DadColl effects, when both mother and father attended college. This effect is significant at the 5% significance level but not at the 1% significance level. (g) Mary, Jane, Alexis and Bonnie have the same values of Dist, F emale, Bytest, T uition, Black, Hispanic, Incomehi, Ownhome, Cue80, and Stwmf g80. Neither of Mary s parents attended college. Jane s father attended college, but her mother did not. Alexis s mother attended college, but her father did not. Both of Bonnie s parents attended college. Using the regression from (f): i. What does the regression predict for the difference between Jane s and Mary s years of education? According to the regression, Jane s years of education is 0.654 years longer than Mary s. ii. What does the regression predict for the difference between Alexis s and Mary s years of education? According to the regression, Alexis s years of education is 0.569 years longer than Mary s. iii. What does the regression predict for the difference between Bonnie s and Mary s years of education? According to the regression, Bonnie s years of education is 0.654 + 0.5690.366 = 0.856 years longer than Mary s. (h) Is there any evidence that the effect of Dist on ED depends on the family s income? %# generate the interaction terms %distinc<-dist*incomehi %dist2inc<-dist2*incomehi # run the regression mh<-lm(yrsed~dist+dist2+female+bytest+tuition+black+hispanic Page 12

+incomehi+dist*incomehi+dist2*incomehi+ownhome+dadcoll+momcoll +dadcoll*momcoll+cue80+stwmfg80) # show the result coeftest (mh, vcov=vcovhc(mh,"hc")) # create two data frames consistent with CD ndlow<-data.frame(dist=dseq,dist2=dseq^2,female=1,hispanic=1,black=0, bytest=58,dadcoll=1,momcoll=1,ownhome=0,cue80=7.1,stwmfg80=10.06, tuition=0.95,incomehi=0) ndhi<-data.frame(dist=dseq,dist2=dseq^2,female=1,hispanic=1,black=0, bytest=58,dadcoll=1,momcoll=1,ownhome=0,cue80=7.1,stwmfg80=10.06, tuition=0.95,incomehi=1) # plot the estimated regression function plot(ndhi$dist,predict(mh,ndhi),type= l,col= blue,ylim=c(14.8,15.7), xlab= Distance (Tens of Miles),ylab= Years of Education ) # add additional regression function lines(ndlow$dist,predict(mh,ndlow),type= l,col= red ) # add a legend legend("topright",c("incomehi=1","incomehi=0"), col=c("blue","red"),lwd=1,inset =.01) # joint test on interaction linearhypothesis(mh,c("dist:incomehi=0","dist2:incomehi=0"), vcov=vcovhc(mh,type="hc1")) The regression result is shown in column (h) of the above table. Regression (h) adds the interaction of Incomehi and the distance regressors, Dist and Dist 2. The implied coefficients on Dist and Dist 2 are: Students who are not high income (Incomehi = 0) ED = 0.110Dist + 0.0065Dist 2 + other factors. High Income Students (Incomehi = 1) ED = (0.110 + 0.124)Dist + (0.00650.0087)Dist 2 + other factors = 0.013Dist0.0012Dist 2 + other factors. The two estimated regression functions are plotted below for someone with characteristics given in (f), but with Incomehi = 1 and with Incomehi = 0. When Incomehi = 1, the regression function is essentially flat, suggesting very little effect of Dist on ED. The F -statistic testing that the coefficients on the interaction terms Incomehi Dist and Incomehi Dist 2 are both equal to zero has a p-value of 0.092. Thus, the interaction effects are significant at the 10% but not the 5% significance level. Page 13

Years of Education 14.8 15.0 15.2 15.4 15.6 Incomehi=1 Incomehi=0 0 2 4 6 8 10 Distance (Tens of Miles) (i) After running all of these regressions (and any others that you want to run), summarize the effect of Dist on years of education. The regression functions shown in (f) and (h) show the nonlinear effect of distance on years of education. The effect is statistically significant. In (f) the effect of changing Dist from 20 miles to 30 miles, reduces years of completed education by 0.081(32) + 0.0047 (3222) = 0.0575 years, on average. The regression in (h) shows a slightly negative effect for non-high income student, but essentially no effect for high income students. Page 14