MA 575, Linear Models : Homework 3

MA 575, Liear Models : Homework 3 Questio 1 RSS( ˆβ 0, ˆβ 1 ) (ŷ i y i ) Problem.7 Questio.7.1 ( ˆβ 0 + ˆβ 1 x i y i ) (ȳ SXY SXY x + SXX SXX x i y i ) ((ȳ y i ) + SXY SXX (x i x)) (ȳ y i ) SXY SXX SY Y SXY SXY SXY + ( SXX SXX ) SXX SY Y SXY SXX (a) Let s derive the formula of the estimator ˆβ 1. (Equatios (A.9) i Appedix (A.3)) (y i ȳ)(x i x) + ( SXY SXX ) (x i x) RSS(β 1 ) (y i β 1 x i ) RSS(β 1 ) β 1 x i (y i ˆβ 1 x i ) 0 x i (y i β 1 x i ) Therefore, x i y i ˆβ 1 (b) Let s show that the estimator ˆβ 1 is ubiased. x i y i x i E[y i ] E[ ˆβ 1 ] E x β 1 i x i x i x i x i β 1 (c) Let s fid the variace of our estimator ˆβ 1 1

where, By idepedecy of the {y i }.., we obtai : V ar( ˆβ 1 ) E[ ˆβ 1] E[ ˆβ 1 ] E[ ˆβ x i y i E x i x j y i y j 1] E j1 x i ( x i ) E[ ˆβ 1] Therefore, x i x i E[yi ] + x i x j E [y i ]E[y j ] j1 j i ( x i ) σ + β1x i + β 1 ( x i ) x j j1 j i (d) Let s fid a ubiased estimator of σ. Let RSS 0 RSS( ˆβ 1 ) the : V ar( ˆβ 1 ) x i (V ar(y i) + E[y i ] ) + ( x i ) x i (σ + β1 x i ) ( x i ) σ x i σ x i + β 1 β1x i x j j1 j i ( x i ) RSS 0 (ŷ i y i ) ( ˆβ 1 x i y i ) ( ˆβ 1x i ˆβ 1 x i y i + yi ) ˆβ 1 ( ˆβ 1 x i x i y i ) + yi ( x i y i ) x i yi After a few calculatios, oe ca easily fid that E[RSS 0 ] ( 1)σ, therefore we choose : ˆσ RSS 0 1 I our model, there are observatios ad oe parameter. Therefore, ˆσ has 1 df. Questio.7. (a) Let s derive the ANOVA table We wat to test the model : H 0 β 0 0. Source df SS MS F p-value Regressio 1 RSS 0 RSS (RSS 0 RSS)/1 (RSS 0 RSS)/ˆσ Residuals RSS ˆσ RSS/( ) Total 1 RSS 0 Table 1: ANOVA table

(b) Let s show that the F statistic is equal to the sqaure of t test statistic. To do so, we eed to prove that F ad t are equal, where : F RSS 0 RSS ˆσ t ˆβ 0 V ar( ˆβ 0 ) (ȳ ˆβ 1 x) ˆσ ( 1 + x SXX ) (ȳ ˆσ ( 1 + SXY SXX x) x SXX (SXXȳ xsxy ) ) ˆσ SXX (SXX + x ) Therefore, provig that F t (SXXȳ xsxy ) is equivalet to provig that RSS 0 RSS SXX(SXX+ x ) RSS 0 RSS yi ( x i y i ) x i SY Y + SXY SXX (y i ȳ) + ȳ ( (x i x)(y i ȳ) + xȳ) SY Y + SXY (x i x) + x SXX SY Y + ȳ (SXY + xȳ) SY Y + SXY SXX + x SXX By multiplyig each terms so that our equatio has oly oe deomiator ad the suppressig the terms that cacel oe aother, we easily fid that : Questio.7.3 RSS 0 RSS [( xsxy ) xȳsxy SXX + (ȳsxx) ] SXX(SXX + x ) ( xsxy ȳsxx) SXX(SXX + x ) (a) Let s fit a regressio over the "sake" data. # Load ALR package library(alr3) # Attach the Sake data file attach(sake) m0.lm lm(y~ X -1, datasake) # -1 to remove the itercept optio summary(m0.lm) (b) What are the values of ˆβ 1 ad ˆσ? summary(m0.lm) Call: lm(formula Y ~ X - 1, data sake) Residuals: Mi 1Q Media 3Q Max -.407-1.494-0.1935 1.6515 3.0771 Coefficiets: Estimate Std. Error t value Pr(> t ) X 0.5039 0.01318 39.48 <e-16 *** --- Sigif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual stadard error: 1.7 o 16 degrees of freedom Multiple R-squared: 0.9898,Adjusted R-squared: 0.989 F-statistic: 1559 o 1 ad 16 DF, p-value: <.e-16 3

Therefore, ˆβ 1 0.5039 ad ˆσ 1.7.89. (c) What is the 95% cofidece iterval of ˆβ 1? We kow that ˆβ 1 β 1 ŝe(β 1) T ( 1), therefore : where t 1 α is the 1 α dim(sake)[1] alpha 0.05 I 95% [ ˆβ 1 t 1 α ŝe(β 1); ˆβ 1 + t 1 α ŝe(β 1)] quatile of a studet distributio with 1 df. z qt(1-alpha/,-1) beta_hat m0.lm$coefficiets sbeta_hat summary(m0.lm)$coefficiets[1, ] Iter95 c(beta_hat -z*sbeta_hat,beta_hat + z*sbeta_hat ) Iter95 X X 0.49451 0.548337 (d) Let s test that the itercept is 0 # Fit a regressio with a o ull itercept m1.lm lm(y~ X, datasake) summary(m1.lm) # ANOVA aova(m0.lm,m1.lm) Aalysis of Variace Table Model 1: Y ~ X - 1 Model : Y ~ X Res.Df RSS Df Sum of Sq F Pr(>F) 1 16 46.6 15 45.560 1 0.6663 0.193 0.6463 The p-value is equal to 0.64. The ANOVA does ot reject the hypothesis that the itercept is ull. Questio.7.4 plot(m0.lm,which 1) 4

The model seems ok sice the residuals are approximately cetered aroud 0 (equally reparted aroud the x-axis) ad idepedet from oe aother (the scatter plot roud shape like). Problem.10 Questio.10.1 Because the questio i problem.10 ad problem 4 of the homework are fairly similar (oly the rak max to cosider ad the data (for problem 4) chage), the developmet of a R-fuctio seems appropriate so that we do ot waste time. The followig code is suggested : ru_lm fuctio(x,y,b1) { legth(x) _char as.character() #.10.1 # Ru a liear regressio usig Zipf s formula logmod.lm lm(log(y)~log(x)) # Plot ad save the preicted values vs the actual values fileame paste("predvsrak",_char,".pg",sep"") pg(fileamefileame) plot(log(x),log(y),xlab"log rak",ylab"log frequecy",maipaste("log frequecy vs Log rak, lies(log(x),logmod.lm$fitted.values) dev.off() # Plot the residuals QQ plot fileame paste("resqqplot", _char,".pg",sep"") pg(fileamefileame) plot(logmod.lm, which,mai paste("normal Q-Q plot, ", _char,sep"")) dev.off() # Ouput the results of the liear regressio prit(summary(logmod.lm)) #.10. Compute the t-test for the slope if (args() > ) { b1_hat logmod.lm$coefficiets[[]] se_b1 summary(logmod.lm)$coefficiets[, ] t (b1_hat + b1)/se_b1 p_val *(1-pt(abs(t),-)) } prit(paste("p-val ",as.character(p_val),sep "")) } where Y is a vector cotaiig the frequecies cosidered for the liear regressio ad X is the vector cotaiig the raks associated to the frequecies withi Y. Last but ot least, b is the value that oe might wat to test for the slope of the regressio. Note that this field ca be left empty. To aswer questios,.10.1 ad.10., ru the followig code : attach(mwwords) # Select the set correspodig to the th highest rak 50 set MWwords$HamiltoRak < Y Hamilto[set] 5

X HamiltoRak[set] ru_lm(x,y,1) We obtai the followig graphs for the Normal Q Qplot of the residuals ad the estimated mea fuctio. Zipf s law seems to model fairly well the frequecies i fuctio of the word s rak. Below, the umerical results of the liear regressio. Call: lm(formula log(y) ~ log(x)) Residuals: Mi 1Q Media 3Q Max -0.57413-0.05088-0.001563 0.043448 0.18868 Coefficiets: Estimate Std. Error t value Pr(> t ) (Itercept) 4.7714 0.03948 10.84 <e-16 *** log(x) -1.00764 0.0175-79.04 <e-16 *** --- Sigif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual stadard error: 0.07934 o 48 degrees of freedom Multiple R-squared: 0.994,Adjusted R-squared: 0.99 F-statistic: 648 o 1 ad 48 DF, p-value: <.e-16 Questio.10. Let H 0 ˆb 1. (b 1 ad ot 1 to fit the R liear model) where T τ( ). P H0 (T t ) (1 P H0 (T t ) [1 P (T ˆb + 1 ŝe(ˆb) )] This p-value is computed i the fuctio preseted above ad we obtai : [1] "p-val 0.551839164089" The t test does ot reject the hypothesis H 0. Questio.10.3 For this questio, use the R-code below : 6

for ( i c(75,100)) { set MWwords$HamiltoRak < Y Hamilto[set] X HamiltoRak[set] ru_lm(x,y) } The model seems to work for 75 but the predicted values of the low frequecies are ot well modeled i the case 100. Note : I the case 100, the date cotais more tha a hudred values because 3 words are raked i the 100 th positio. Questio 4 To aswer this questio, ru the R-code below. load("simpsoswordfreq.rdata") attach(simpsos.wordfreq) for ( i c(000,3000)) { Y Frequecy[1:] X Rak[1:] ru_lm(x,y) } We obtai : 7

It seems that i that case, the Zipf s model is appropriate for the smallest frequecies. 8