Statistics 203 Itroductio to Regressio ad Aalysis of Variace Assigmet #1 Solutios Jauary 20, 2005 Q. 1) (MP 2.7) (a) Let x deote the hydrocarbo percetage, ad let y deote the oxyge purity. The simple liear regressio model is ŷ = 77.863 + 11.801x. > #MP 2.7, Oxyge > oxyge.table <- read.table("http://www-stat/~jtaylo/courses/ stats203/data/oxyge.table", header=t, sep=",") > attach(oxyge.table) > purity.lm <- lm(purity ~ hydro) > summary(purity.lm) Call: lm(formula = purity ~ hydro) Residuals: Mi 1Q Media 3Q Max -4.6724-3.2113-0.0626 2.5783 7.3037 Coefficiets: Estimate Std. Error t value Pr(> t ) (Itercept) 77.863 4.199 18.544 3.54e-13 *** hydro 11.801 3.485 3.386 0.00329 ** --- Sigif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual stadard error: 3.597 o 18 degrees of freedom Multiple R-Squared: 0.3891, Adjusted R-squared: 0.3552 F-statistic: 11.47 o 1 ad 18 DF, p-value: 0.003291 1
> #Use filled i circles i the plot by typig pch=21 > plot(hydro, purity, pch=21, bg= blue, mai="purity vs Hydrocarbo Percetage") > ablie(purity.lm$coef, lwd=2) Figure 1: Plot of the purity versus hydrocarbo percetage, with the least squares lie superimposed. Figure 1 suggests a positive relatioship betwee oxyge purity ad the hydrocarbo percetage. (b) Cosider H 0 : β 1 = 0 versus H 1 : β 1 0. We have t ˆβ1 = 3.485 o 20 2 = 18 d.f., correspodig to a p-value 2
of.00329. We therefore reject H 0 i favor of H 1, ad coclude that the true slope β 1 is ot zero. (c) From summary(purity.lm) i part (a) above, we have R 2 =.3891. (d) A 95% cofidece iterval for β 1 i this SLR model is give i R by: > cofit(purity.lm, level=.95) 2.5 % 97.5 % (Itercept) 69.041747 86.68482 hydro 4.479066 19.12299 Alteratively, recall that a 100(1 α)% CI for β 1 is: ˆβ 1 ± SE ˆβ1 t 1 α/2, 2. From above, we have SE ˆβ1 = 3.485. We ca ow compute this i R by typig: > t.quatiles <- qt(c(.025,.975), 18) > 11.801 + 3.485*t.quatiles [1] 4.479287 19.122713 Here, t.quatiles are the.025 ad.975 quatiles of the t 18 distributio. (e) A 95% cofidece iterval for E(Y X = 1.0) is give by (87.51, 91.82). This is computed i R as follows. > predict(purity.lm, ewdata=list(hydro = 1.0), iterval="cofidece", level=.95) fit lwr upr [1,] 89.66431 87.51017 91.81845 Q. 2) (MP 2.19) (a) As usual, let SSE = [y i (β 0 + β 1 x i )] 2. The: SSE β 1 = 2 (y i β 0 β 1 x i ) 1 ( x i ) Settig SSE β 1 ˆβ1 = 0 ad dividig the above by 2, we obtai: ˆβ 1 0 = x 2 i = (x i y i β 0 x i ˆβ 1 x 2 i ) x i y i β 0 x i 3
So the least squares estimate ˆβ 1 is give by: ˆβ 1 = x iy i β 0 x i x2 i = (y i β 0 )x i. x2 i (b) We derive Var( ˆβ 1 ) as follows: ( Var( ˆβ 1 ) = Var x iy i β 0 x ) i x2 i = ( x 2 i ) 2 Var( x i y i β 0 x i ) = ( x 2 i ) 2 Var( x i y i ) = ( = ( x 2 i ) 2 x 2 i ) 2 / = σ 2 x 2 i. x 2 i Var(y i ) x 2 i σ 2 sice β 0 x i is costat sice {y i } are idepedet (c) We summarize our results i the followig table. Model Var( ˆβ 1 ) SSE SE( ˆβ 1 ) = Var( ˆβ 1 ) σ β 0 ukow P 2 (xi x)2 (y i ˆβ 0 ˆβ 1 x i ) 2 P SSE/( 2) (xi x)2 (y i β 0 ˆβ 1 x i ) 2 β 0 kow σ 2 P x2 i SSE/( 1) P x2 i β0 kow β0 ukow First, otice that Var( ˆβ 1 ) Var( ˆβ 1 ). (Equality holds if ad oly if x = 0.) To see this, observe: = (x i x) 2 = σ 2 (x i x) 2 x 2 i x 2 x 2 i σ 2 x2 i 4
Hece, cofidece itervals for β 1 will be arrower whe β 0 is kow, regardless of sample size (uless x = 0). Furthermore, i derivig a cofidece iterval for β 1 whe β 0 is kow, it is ot hard to show that: T = ˆβ 1 β 1 SE( ˆβ 1 ) t 1 The estimator ˆβ 1 is a liear combiatio of the {y i }, ad hece is ormally distributed. ˆβ1 is also ubiased: ( E( ˆβ 1 ) = E x iy i β 0 x ) i x2 i ( = ( x 2 i ) 1 E(x i y i ) β 0 = ( = ( x i ) ( x 2 i ) 1 E[x i (β 0 + β 1 x i + ε)] β 0 ( x 2 i ) 1 = β 1. β 0 x i + β 1 x 2 i β 0 xi ) x i ) Agai, the vector of residuals e is idepedet of ˆβ 1, ad so Var( ˆβ 1 ) is idepedet of ˆβ 1. Hece, T = ( )/ ˆβ 1 β 1 σ2 / (y i β 0 ˆβ 1 x i ) 2 /σ 2 x2 1 i }{{}}{{} N(0,1) χ 2 1 /( 1) = ˆβ 1 β 1 SE( ˆβ 1 ) t 1. Therefore, a 100(1 α)% cofidece iterval for β 1 has the form: ˆβ 1 ± t 1 α/2,ν SE( ˆβ 1 ), where the degrees of freedom ν = 1 whe β 0 is kow (oe less parameter to estimate) ad where ν = 2 whe β 0 is ukow. This differece of 1 degree of freedom results i a slightly arrower 5
CI for β 1, but for relatively large samples, this differece is almost egligible. The major differece is attributed to the variace of the estimator ˆβ 1. Q. 3) (MP 3.10) (a) The followig R commads allow us to compute ad plot the residuals ad stadardized residuals. > softdrik.table <- read.table("http://www-stat/~jtaylo/courses/stats203/ data/softdrik.table", header=t, sep=" ") > attach(softdrik.table) > #Compute residuals > softdrik.lm <- lm(y ~ x1 + x2) > softdrik.resid <- softdrik.lm$residuals > #Compute stadardized residuals > softdrik.st.resid <- rstadard(softdrik.lm) > #Combie residuals & stadardized residuals usig cbid > prit(cbid(softdrik.resid, softdrik.st.resid)) softdrik.resid softdrik.st.resid 1-5.0280843-1.62767993 2 1.1463854 0.36484267 3-0.0497937-0.01609165 4 4.9243539 1.57972040 5-0.4443983-0.14176094 6-0.2895743-0.09080847 7 0.8446235 0.27042496 8 1.1566049 0.36672118 9 7.4197062 3.21376278 10 2.3764129 0.81325432 11 2.2374930 0.71807970 12-0.5930409-0.19325733 13 1.0270093 0.32517935 14 1.0675359 0.34113547 15 0.6712018 0.21029137 16-0.6629284-0.22270023 17 0.4363603 0.13803929 18 3.4486213 1.11295196 19 1.7931935 0.57876634 20-5.7879699-1.87354643 6
21-2.6141789-0.87784258 22-3.6865279-1.44999541 23-4.6075679-1.44368977 24-4.5728535-1.49605875 25-0.2125839-0.06750861 > #Plot the residuals & stadard residuals i oe widow > par(mfrow = c(1,2)) > plot(softdrik.lm$residuals, pch=23, bg= blue, cex=2, lwd=2, mai="residuals") > plot(rstadard(softdrik.lm), pch=23, bg= red, cex=2, lwd=2, mai="stadardized Residuals") Figure 2: Plots of the residuals (blue) ad stadardized residuals (red) for the soft drik data. (b) From Table 4.2 of Motgomery ad Peck, we otice that the x 1 ad 7
Figure 3: Plot of the residuals versus fits, suggestig that case umber 9 is a outlyig observatio. x 2 values for Observatio 9 are much higher tha what appears to be typical, suggestig that Observatio 9 is a uusual observatio. We ca use the plot commad i R to obtai the residuals versus fits ad a Cook s distace plot. Both plots suggest that case umber 9 is a outlyig observatio. Refer to Figures 3 ad 4. > plot(softdrik.lm) 8
Figure 4: The Cooks distace plot suggests that Observatio 9 is a outlier. 9
Q. 4) (MP 4.24) Method 1: First ote that: For a costat matrix A ad a radom vector Z, we have Var(AZ) = A Var(Z)A t. The hat matrix H is oly a fuctio of X, which we treat as fixed. Hece, H is a costat matrix. Uder the multiple regressio model, we assume Var(Y ) = σ 2 I is a costat matrix. For a symmetric matrix U, that is, U = U t, we have U 1 = (U 1 ) t. To see this, observe: U 1 U = I (U 1 ) t U t = I Sice U = U t, we must have U 1 = (U 1 ) t. With Ŷ = HY, we therefore have: Var(Ŷ ) = Var(HY ) = H Var(Y )H t = [X(X t X) 1 X t ](σ 2 I)[X(X t X) 1 X t ] t = σ 2 [X(X t X) 1 X t ][(X t ) t ((X t X) 1 ) t X t ] = σ 2 X(X t X) 1 (X t X)((X t X) 1 ) t X t = σ 2 XI((X t X) 1 ) t X t = σ 2 X(X t X) 1 X t = σ 2 H The fact that ((X t X) 1 ) t = (X t X) 1 follows, sice X t X is a symmetric matrix. Method 2: Alteratively, otice that H = H t (provide a proof) ad use the result of the ext problem (that H = H 2 ) to see: Var(Ŷ ) = Var(HY ) = H Var(Y )H t = σ 2 HH t = σ 2 HH = σ 2 H 10
Q. 5) (MP 4.25) H 2 = [X(X t X) 1 X t ][X(X t X) 1 X t ] = X(X t X) 1 (X t X)(X t X) 1 X t = XI(X t X) 1 X t = H. (I H) 2 = I 2 2IH + H 2 = I 2H + H = I H. 11