Stat 139 Homework 7 Solutios, Fall 2015 Problem 1. I class we leared that the classical simple liear regressio model assumes the followig distributio of resposes: Y i = β 0 + β 1 X i + ɛ i, i = 1,...,, (1) where ɛ i i.i.d N(0, σ 2 ). Two estimators we d like to calculate are: (i) ˆµ Y X = ˆβ 0 + ˆβ 1 X, the predicted mea value of the respose, µ Y, for ay idividual with a specific value of the predictor, X. This ca be used to build a cofidece iterval for where the true µ Y will be located give X. (ii) Ŷj X j = ˆβ 0 + ˆβ 1 X j + ɛ j, the predicted value for a ew idividual respose, Y j, give that idividual s value of the predictor, X j. This ca be used to build a predictio iterval for where a ew Y j will be located give it s X j. I this problem we will determie the samplig distributios of these two estimators [Note: the secod is ot techically a estimator sice ɛ j is ot observable, but we ca still determie some characteristics of this etity sice we ca assume the samplig distributio of ɛ j provided above]. (a) Calculate the expected value of ˆµ Y X ad Ŷj X j. The samplig distributio results of ˆβ 0 ad ˆβ 1 provided i class ca be used directly for this problem. E(ˆµ Y X ) = E( ˆβ 0 + ˆβ 1 X) = E( ˆβ 0 ) + XE( ˆβ 1 ) = β 0 + Xβ 1 E(Ŷj X j ) = E( ˆβ 0 + ˆβ 1 X + ɛ j ) = E( ˆβ 0 ) + XE( ˆβ 1 ) + E(ɛ j ) = β 0 + Xβ 1 + 0 It turs out that Cov(Ȳ, ˆβ 1 ) = 0. I other words, the estimator of the slope of a regressio lie is ot correlated with the average respose. Ituitively, it is true because the regressio lie has to pass through the poit ( X, Ȳ ) regardless of the slope value. (b) Show that: Cov( ˆβ 0, ˆβ σ 2 X 1 ) = usig the fact above that Cov(Ȳ, ˆβ 1 ) = 0 ad the properties of Covariace: Cov(aX, Y ) = acov(x, Y ) ad Cov(X + W, Y ) = Cov(X, Y ) + Cov(W, Y ). Cov( ˆβ 0, ˆβ 1 ) = Cov(Ȳ ˆβ 1 X, ˆβ1 ) = Cov(Ȳ, ˆβ 1 ) Cov( ˆβ 1 X, ˆβ1 ) = 0 XVar( ˆβ σ 2 X 1 ) = (c) Determie Var(ˆµ Y X ). Hit: = i=1 (X2 i ) X 2 may be useful (though you may ot eed to use this property). Note: we decided to use X j to be where we are doig the calculatio so as to ot get cofused with the observed X s i the data set: Var(ˆµ Y Xj ) = Var( ˆβ 0 + ˆβ 1 X j ) = Var( ˆβ 0 ) + Var( ˆβ 1 X j ) + 2X j Cov( ˆβ 0, ˆβ 1 ) ( σ 2 = + X2 σ 2 ) X i=1 (X i X) j 2 2 + σ2 2X j σ 2 X i=1 (X i ( X) 2 = σ 2 1 X2 + + Xj 2 + 2X j X ) ( 1 = σ 2 + (X j X) 2 ) 1
It also turs out that Cov(ˆµ Y X, ɛ j ) = 0. I other words, the residuals aroud the lie are ot correlated with where the predicito is beig made. Ituitively this is true because oe of our assumptios i the regressio model is that the variace is the same ot matter what value of X j is observed.(note: Cov(Y j, ɛ j ) 0). (d) Determie Var(Ŷj X j ). Var(Ŷj X j ) = Var(ˆµ Yj X j + ɛ j ) = Var(ˆµ Yj X j ) + Var(ɛ j ) + 2Cov(ˆµ Y X, ɛ j ) = σ 2 ( 1 + (X j X) 2 ) + σ 2 + 0 ( = σ 2 1 + 1 + (X j X) 2 ) (e) Make a argumet for why ˆµ Y X ad Ŷj X j are both Normally distributed. Both of these estimators should be Normally distributed sice they are comprised of liear combiatios of Normally distributed radom variables (liear combiatios of ˆβ 0, ˆβ 1 X j, ad ɛ j ). (f) Where will ˆµ Y X lie 95% of the time? Where will Ŷ j X j lie 95% of the time. Note: these itervals ca be used to build a cofidece itervals ad predictio itervals at a particular value of X by ceterig them at the estimates rather tha the true values, ad by usig the usual regressio estimate for σ 2. These will lie plus or mius Φ 1 (0.975) = 1.96 times their respective stadard deviatios aroud the predicted value at X j. That is ˆµ Y X will lie withi: (β 0 + β 1 X j ) ± 1.96 ( 1 σ 2 + (X j X) 2 ) 95% of the time, ad Ŷj X j will lie withi the followig bouds 95% of the time: (β 0 + β 1 X j ) ± 1.96 σ 2 ( 1 + 1 + (X j X) 2 ) Problem 2. A study was coducted to determie the associatio betwee the maximum distace at which a highway sig ca be read (i feet) ad the age of the driver (i s). Thirty drivers of various ages were studied. Sample meas ad variaces for distace ad age ad the correlatio betwee these variables are give i the accompayig table. sample mea sample variace Distace 30 423.333 6678.16 Age 30 51.0 474.207 Correlatio r = 0.8012 (a) Fid ˆβ 0, ˆβ 1 the stadard error of ˆβ 1 ad the least-squares regressio equatio that would predict 2
the distace at which a highway sig ca be read give the age of the driver. ˆβ 1 = r s Y 6678.16 = 0.8012 s X 474.207 = 3.007 ˆβ 0 = Ȳ X ˆβ 1 = 423.333 (51)( 3.007) = 576.7 s (1 s ˆβ1 = 2 (Xi X) 2 = r 2 )s 2 Y (1 0.8012 ( 2)s 2 = 2 )6678.16 = 0.4244 X (28)474.207 Ŷ = ˆβ 0 + ˆβ 1 X = 576.7 3.007 X which uses the fact that the variace estimate of the residuals is s 2 = MSE = SSE/( 2) = (1 r 2 )SSY/( 2) = (1 r 2 )s 2 Y ( 1)/( 2). (b) Is Age a sigificat predictor of Distace i this liear model? Coduct a fromal hypothesis test at the α = 0.05 level (iclude the usual elemets of a test of hypothesis). H 0 : β 1 = 0 vs. H A : β 1 0 t = ˆβ 1 s ˆβ1 = 3.007 0.4244 = 7.085 p-value = P ( t df=28 > 7.085 ) 0.0000001 Sice our p-value is less tha 0.05, we ca reject the ull hypothesis. There is evidece that distace is related to age, i fact, youger age is associated with beig able to read from a further distace. (c) Usig oly the correlatio coefficiet (r) ad the sample size, coduct a test to determie if there is a sigificat associatio (i.e. H 0 : ρ = 0) betwee these two variables usig α = 0.05. t = r 1 r 2 2 = 0.8012 1 0.8012 2 28 = 7.085 We get the exact same t-statistic with the same d.f. as part (b), so we ll have the same p-value ad come to the same coclusio. (d) Comparig the results of part (b) ad (c) above, what do you coclude about these two tests? These two tests are mathematically equivalet (t-statistic, degrees of freedom, ad p-value) ad will always come to the same coclusio. This ca be show algebraically (e) Usig your results from part (a) above, calculate the 95% cofidece iterval for the slope of this regressio lie. ˆβ 1 ± t df=28 s ˆβ1 = 3.007 ± 2.0484(0.4244) = ( 3.876, 2.138) > qt(0.975,df=28) [1] 2.048407 (f) Cosiderig the lower ad upper 95% cofidece limits of the slope you calculated i part (e), how are these cosistet with your results for parts (b) ad (c)? These results are cosistet sice we rejected the ull hypothesis (H 0 : β 1 = 0) that the slop is trule zero, ad the cofidece iterval does ot iclude the value zero iside it, so either way, it appears a slope of zero if ot a plausible assumptio. 3
Problem 3. The data for the above problem are available i a Excel file o the class website uder the fileame HighwaySigs.csv. (a) Make a scatterplot of this data (with fitted regressio lie) i R (do ot iclude it here...you ll prit it out for part (g)), ru a regressio model ad cofirm the results you calculated i problem 2(a) for ˆβ 0, ˆβ 1, ad s ˆβ1. > fit=lm(distace~age,data=highwaysigs) > summary(fit) Call: lm(formula = distace ~ age, data = highwaysigs) Residuals: Mi 1Q Media 3Q Max -78.231-41.710 7.646 33.552 108.831 Coefficiets: Estimate Std. Error t value Pr(> t ) (Itercept) 576.6819 23.4709 24.570 < 2e-16 *** age -3.0068 0.4243-7.086 1.04e-07 *** --- Sigif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual stadard error: 49.76 o 28 degrees of freedom Multiple R-squared: 0.642, Adjusted R-squared: 0.6292 F-statistic: 50.21 o 1 ad 28 DF, p-value: 1.041e-07 > plot(distace~age,data=highwaysigs,pch=16,cex=3) > ablie(fit,lwd=3,col="red") distace 300 350 400 450 500 550 600 20 30 40 50 60 70 80 age Based o the above output, we see that the estimates match out had-calculated oes (igorig roudig errors): ˆβ1 = 3.0068, ˆβ 0 = 576.68, ad s ˆβ1 = 0.4243. (b) Usig the results of the regressio model from R, locate the calculated value of the t-test statistic ad the associated p-value to determie if age is a sigificat predictor of distace ad compare these results to the results you obtaied by had i problem 2(b) above. Basead o the R output, we see the calculated t-statistic for the slope is -70.86 with a p-value of 1.04 10 7, which agree with the work doe by had. 4
(c) Usig oly the regressio aalysis results table ad the sample meas ad variaces give i problem 2, calculate a 95% cofidece iterval for the average distace at which a highway sig ca be read by idividuals 75 s of age. Ŷ = ˆβ 0 + ˆβ 1 X = 576.7 3.007 X = 576.7 3.007 75 = 351.2 95% Cofidece Iterval for µ Y at x = 75: Ŷ ± t df= 2 s 1 + (x x) 2 1 (75 51)2 ( 1)s 2 = 351.2 ± 2.0484 49.76 + = (323.2, 379.2) x 30 (29)474.207 (d) Use R to cofirm the 95% cofidece iterval i part (c). You ll have to create a ew dataframe with the predictor variable age set to 75 ew=data.frame(age = 75), ad the use the commad predict usig the liear model from part (a). > ew=data.frame(age = 75) > predict(fit,ew,iterval="cofidece") fit lwr upr 1 351.1693 323.2135 379.1251 (e) Usig oly the regressio aalysis results table ad the sample meas ad variaces give i problem 2, calculate a 95% predictio iterval for the distace at which a highway sig ca be read by a idividual 75 s of age. Ŷ ± t df= 2 s 1 + 1 + (x x) 2 ( 1)s 2 x = 351.2 ± 2.0484 49.76 (f) Use R to cofirm the 95% predictio iterval i part (e). > predict(fit,ew,iterval="predictio") fit lwr upr 1 351.1693 245.4732 456.8653 1 + 1 30 (75 51)2 + = (245.51, 456.89) (29)474.207 (g) Prit out the scatterplot with least-squares lie, eter your two itervals from parts (c) ad (e) by had oto the scatterplot, ad explai the differece betwee the 95% cofidece iterval ad the 95% predictio iterval. distace 250 300 350 400 450 500 550 600 20 30 40 50 60 70 80 age I the above graph the predictio iterval is i blue ad the cofidece iterval is i orage. The predictio iterval is a reasoable iterval estimate for where a sigle 75 old perso would be 5
predicted to be able to read a sig at the 95% cofidece level, while the cofidece iterval is a rage of plausible values for where the average distace of all 75 olds i the populatio are able to read a sig (aka, where the uderlyig populatio is lyig i the Y -directio at X = 75) at the 95% cofidece level. Problem 4. The data set malebirths.csv cotais data for the proportio of male births i 4 differet coutries (Demark, the Netherlads, Caada, ad the Uited States) for a umber of s. Use this data set to aswer the followig questios: (a) Ru four differet simple liear regressio models i R, oe for each coutry separately. Create a table with four rows (oe for each coutry) ad four colums: oe colum each for the calculatios ˆβ 1, the stadard error of ˆβ 1, the related t-statistic, ad the p-value related to this test. For which coutries is the associatio sigificat? Coutry ˆβ1 s ˆβ1 t-stat p-value Demark 4.29 10 5 2.07 10 5 2.07 0.0442 Netherlads 8.08 10 5 1.42 10 5 5.71 < 0.00001 Caada 1.11 10 4 2.77 10 5 4.02 0.00074 USA 5.43 10 5 9.39 10 6 5.78 0.00001 Based o the results of the liear regressio models, we see that the associatio betwee the proportio of births that are male babies has sigificatly decreased over time i all four coutries. From most to least sigificat: Netherlads, USA, Caada, ad Demark. (b) Preset four differet scatterplots (oe for each coutry) with the observed poits ad the related fitted regressio lie as well. It would be helpful for iterpretatio if each plot had the same bouds o both axes. Be sure the plots are clearly labeled. Demark Netherlads demark etherlads Caada USA caada usa 6
(c) Explai why the U.S. ca have the largest of the 4 t-statistics (i magitude) eve though its slope is ot the highest. This ca be explaied by the fact that the stadard error of the slope is smaller for the U.S.(because the spread of the poits aroud the fitted regressio lie is much smaller i the vertical directio). (d) Explai why the stadard error of the estimated slope is smaller for the U.S. tha for Caada, eve though the sample size is the same. This ca be explaied by the fact that the estimate for the residuals is much smaller (less spread aroud the lie i the y-directio), so the precisio of the slope as a estimate for the ukow true value geeratig this data is much better. (e) Provide a reaso why the stadard devatios aroud the regressio lie might be differet for the four coutries (hit: the proportio of males ca be though of as a average withi each coutry respectively). This is because the poits actually represet averages of zeros ad oes: a measuremet for each baby that is bor for if it is a male or ot. Sice there are so may more observatios (births) i the U.S., the we d expect the average of these zeros ad oes to vary less (based o the Law of Large Numbers). 7