MA 575, Linear Models : Homework 3

Size: px

Start display at page:

Download "MA 575, Linear Models : Homework 3"

Alexina Webster
6 years ago
Views:

1 MA 575, Liear Models : Homework 3 Questio 1 RSS( ˆβ 0, ˆβ 1 ) (ŷ i y i ) Problem.7 Questio.7.1 ( ˆβ 0 + ˆβ 1 x i y i ) (ȳ SXY SXY x + SXX SXX x i y i ) ((ȳ y i ) + SXY SXX (x i x)) (ȳ y i ) SXY SXX SY Y SXY SXY SXY + ( SXX SXX ) SXX SY Y SXY SXX (a) Let s derive the formula of the estimator ˆβ 1. (Equatios (A.9) i Appedix (A.3)) (y i ȳ)(x i x) + ( SXY SXX ) (x i x) RSS(β 1 ) (y i β 1 x i ) RSS(β 1 ) β 1 x i (y i ˆβ 1 x i ) 0 x i (y i β 1 x i ) Therefore, x i y i ˆβ 1 (b) Let s show that the estimator ˆβ 1 is ubiased. x i y i x i E[y i ] E[ ˆβ 1 ] E x β 1 i x i x i x i x i β 1 (c) Let s fid the variace of our estimator ˆβ 1 1

2 where, By idepedecy of the {y i }.., we obtai : V ar( ˆβ 1 ) E[ ˆβ 1] E[ ˆβ 1 ] E[ ˆβ x i y i E x i x j y i y j 1] E j1 x i ( x i ) E[ ˆβ 1] Therefore, x i x i E[yi ] + x i x j E [y i ]E[y j ] j1 j i ( x i ) σ + β1x i + β 1 ( x i ) x j j1 j i (d) Let s fid a ubiased estimator of σ. Let RSS 0 RSS( ˆβ 1 ) the : V ar( ˆβ 1 ) x i (V ar(y i) + E[y i ] ) + ( x i ) x i (σ + β1 x i ) ( x i ) σ x i σ x i + β 1 β1x i x j j1 j i ( x i ) RSS 0 (ŷ i y i ) ( ˆβ 1 x i y i ) ( ˆβ 1x i ˆβ 1 x i y i + yi ) ˆβ 1 ( ˆβ 1 x i x i y i ) + yi ( x i y i ) x i yi After a few calculatios, oe ca easily fid that E[RSS 0 ] ( 1)σ, therefore we choose : ˆσ RSS 0 1 I our model, there are observatios ad oe parameter. Therefore, ˆσ has 1 df. Questio.7. (a) Let s derive the ANOVA table We wat to test the model : H 0 β 0 0. Source df SS MS F p-value Regressio 1 RSS 0 RSS (RSS 0 RSS)/1 (RSS 0 RSS)/ˆσ Residuals RSS ˆσ RSS/( ) Total 1 RSS 0 Table 1: ANOVA table

3 (b) Let s show that the F statistic is equal to the sqaure of t test statistic. To do so, we eed to prove that F ad t are equal, where : F RSS 0 RSS ˆσ t ˆβ 0 V ar( ˆβ 0 ) (ȳ ˆβ 1 x) ˆσ ( 1 + x SXX ) (ȳ ˆσ ( 1 + SXY SXX x) x SXX (SXXȳ xsxy ) ) ˆσ SXX (SXX + x ) Therefore, provig that F t (SXXȳ xsxy ) is equivalet to provig that RSS 0 RSS SXX(SXX+ x ) RSS 0 RSS yi ( x i y i ) x i SY Y + SXY SXX (y i ȳ) + ȳ ( (x i x)(y i ȳ) + xȳ) SY Y + SXY (x i x) + x SXX SY Y + ȳ (SXY + xȳ) SY Y + SXY SXX + x SXX By multiplyig each terms so that our equatio has oly oe deomiator ad the suppressig the terms that cacel oe aother, we easily fid that : Questio.7.3 RSS 0 RSS [( xsxy ) xȳsxy SXX + (ȳsxx) ] SXX(SXX + x ) ( xsxy ȳsxx) SXX(SXX + x ) (a) Let s fit a regressio over the "sake" data. # Load ALR package library(alr3) # Attach the Sake data file attach(sake) m0.lm lm(y~ X -1, datasake) # -1 to remove the itercept optio summary(m0.lm) (b) What are the values of ˆβ 1 ad ˆσ? summary(m0.lm) Call: lm(formula Y ~ X - 1, data sake) Residuals: Mi 1Q Media 3Q Max Coefficiets: Estimate Std. Error t value Pr(> t ) X <e-16 *** --- Sigif. codes: 0 *** ** 0.01 * Residual stadard error: 1.7 o 16 degrees of freedom Multiple R-squared: ,Adjusted R-squared: F-statistic: 1559 o 1 ad 16 DF, p-value: <.e-16 3

4 Therefore, ˆβ ad ˆσ (c) What is the 95% cofidece iterval of ˆβ 1? We kow that ˆβ 1 β 1 ŝe(β 1) T ( 1), therefore : where t 1 α is the 1 α dim(sake)[1] alpha 0.05 I 95% [ ˆβ 1 t 1 α ŝe(β 1); ˆβ 1 + t 1 α ŝe(β 1)] quatile of a studet distributio with 1 df. z qt(1-alpha/,-1) beta_hat m0.lm$coefficiets sbeta_hat summary(m0.lm)$coefficiets[1, ] Iter95 c(beta_hat -z*sbeta_hat,beta_hat + z*sbeta_hat ) Iter95 X X (d) Let s test that the itercept is 0 # Fit a regressio with a o ull itercept m1.lm lm(y~ X, datasake) summary(m1.lm) # ANOVA aova(m0.lm,m1.lm) Aalysis of Variace Table Model 1: Y ~ X - 1 Model : Y ~ X Res.Df RSS Df Sum of Sq F Pr(>F) The p-value is equal to The ANOVA does ot reject the hypothesis that the itercept is ull. Questio.7.4 plot(m0.lm,which 1) 4

5 The model seems ok sice the residuals are approximately cetered aroud 0 (equally reparted aroud the x-axis) ad idepedet from oe aother (the scatter plot roud shape like). Problem.10 Questio.10.1 Because the questio i problem.10 ad problem 4 of the homework are fairly similar (oly the rak max to cosider ad the data (for problem 4) chage), the developmet of a R-fuctio seems appropriate so that we do ot waste time. The followig code is suggested : ru_lm fuctio(x,y,b1) { legth(x) _char as.character() #.10.1 # Ru a liear regressio usig Zipf s formula logmod.lm lm(log(y)~log(x)) # Plot ad save the preicted values vs the actual values fileame paste("predvsrak",_char,".pg",sep"") pg(fileamefileame) plot(log(x),log(y),xlab"log rak",ylab"log frequecy",maipaste("log frequecy vs Log rak, lies(log(x),logmod.lm$fitted.values) dev.off() # Plot the residuals QQ plot fileame paste("resqqplot", _char,".pg",sep"") pg(fileamefileame) plot(logmod.lm, which,mai paste("normal Q-Q plot, ", _char,sep"")) dev.off() # Ouput the results of the liear regressio prit(summary(logmod.lm)) #.10. Compute the t-test for the slope if (args() > ) { b1_hat logmod.lm$coefficiets[[]] se_b1 summary(logmod.lm)$coefficiets[, ] t (b1_hat + b1)/se_b1 p_val *(1-pt(abs(t),-)) } prit(paste("p-val ",as.character(p_val),sep "")) } where Y is a vector cotaiig the frequecies cosidered for the liear regressio ad X is the vector cotaiig the raks associated to the frequecies withi Y. Last but ot least, b is the value that oe might wat to test for the slope of the regressio. Note that this field ca be left empty. To aswer questios,.10.1 ad.10., ru the followig code : attach(mwwords) # Select the set correspodig to the th highest rak 50 set MWwords$HamiltoRak < Y Hamilto[set] 5

6 X HamiltoRak[set] ru_lm(x,y,1) We obtai the followig graphs for the Normal Q Qplot of the residuals ad the estimated mea fuctio. Zipf s law seems to model fairly well the frequecies i fuctio of the word s rak. Below, the umerical results of the liear regressio. Call: lm(formula log(y) ~ log(x)) Residuals: Mi 1Q Media 3Q Max Coefficiets: Estimate Std. Error t value Pr(> t ) (Itercept) <e-16 *** log(x) <e-16 *** --- Sigif. codes: 0 *** ** 0.01 * Residual stadard error: o 48 degrees of freedom Multiple R-squared: 0.994,Adjusted R-squared: 0.99 F-statistic: 648 o 1 ad 48 DF, p-value: <.e-16 Questio.10. Let H 0 ˆb 1. (b 1 ad ot 1 to fit the R liear model) where T τ( ). P H0 (T t ) (1 P H0 (T t ) [1 P (T ˆb + 1 ŝe(ˆb) )] This p-value is computed i the fuctio preseted above ad we obtai : [1] "p-val " The t test does ot reject the hypothesis H 0. Questio.10.3 For this questio, use the R-code below : 6

7 for ( i c(75,100)) { set MWwords$HamiltoRak < Y Hamilto[set] X HamiltoRak[set] ru_lm(x,y) } The model seems to work for 75 but the predicted values of the low frequecies are ot well modeled i the case 100. Note : I the case 100, the date cotais more tha a hudred values because 3 words are raked i the 100 th positio. Questio 4 To aswer this questio, ru the R-code below. load("simpsoswordfreq.rdata") attach(simpsos.wordfreq) for ( i c(000,3000)) { Y Frequecy[1:] X Rak[1:] ru_lm(x,y) } We obtai : 7

8 It seems that i that case, the Zipf s model is appropriate for the smallest frequecies. 8

University of California, Los Angeles Department of Statistics. Simple regression analysis

University of California, Los Angeles Department of Statistics. Simple regression analysis Uiversity of Califoria, Los Ageles Departmet of Statistics Statistics 100C Istructor: Nicolas Christou Simple regressio aalysis Itroductio: Regressio aalysis is a statistical method aimig at discoverig