STK 2100 Oblig 1. Zhou Siyu. February 15, PDF Free Download

STK 200 Oblig Zhou Siyu February 5, 207 Question a) Make a scatter box plot for the data set. Answer:Here is the code I used to plot the scatter box in R. library ( MASS ) 2 pairs ( Boston ) Figure : Scatter box of Boston data set.

Here we get a general idea of the interactions between the response and each of the predictor variables as well as interactions between any two predictor variables. b) Divide the data set into training and test data sets. Discuss the advantages and shortcomings in doing so. Answer: The code is R is the as the question suggests: #b) 2 set. seed (345) 3 ind <- sample (: nrow ( Boston ), 250, replace = FALSE ) 4 Boston. train <- Boston [ ind,] 5 Boston. test <- Boston [-ind,] In splitting the data set, we have ended up with fewer observations to train the model, which is obviously a shortcoming in terms of model accuracy and efficiency. On the other hand, we can also gain information concerning the model s prediction efficiency by testing it with the test data (for prediction). It is with the test data we can see how much the model deviates from the real value and so as to evaluate model efficiency and accuracy. c) Explain the important assumptions about the linear model. Use crim as the response variable, adjust the model to the training data and discuss about the result. Answer: For linear regression models, there are some very important conditions that should be satisfied. First and foremost, ε i should be independent from each other. This is arguably the most important assumption for the model, otherwise we would see some correlation in the error terms in the response variable. Another condition is that we expect the expectation of ε i to be 0, namely, E[ε i ] = 0 while its variance to be σ 2, V [ε i ] = σ 2. This can also be a strong condition. Last but not least, we usually require ε i to be normally distributed. To fit the model with the data set, we run the following code in R. #c) 2 > fit. lim = lm( crim ~., data = Boston. train ) 3 > summary ( fit. lim ) 4 5 Call : 6 lm( formula = crim ~., data = Boston. train ) 7 8 Residuals : 9 Min Q Median 3Q Max 0-0.349-2.724-0.548.283 70.984 2 Coefficients : 3 Estimate Std. Error t value Pr ( > t ) 4 ( Intercept ) 9.0777 2.482380.53 0.2763 5 zn 0.053969 0.03030.739 0.08329. 6 indus -0.0726 0.298-0.556 0.579048 7 chas 0.897 2.265042 0.083 0.933530 8 nox -7.033460 9.80255 -.738 0.083576. 2

9 rm.43702.02735.406 0.60908 20 age -0.002842 0.030598-0.093 0.926089 2 dis -.405948 0.4667-3.045 0.002588 ** 22 rad 0.692667 0.45079 4.774 3.6e -06 *** 23 tax -0.003325 0.007855-0.423 0.672424 24 ptratio -0.384094 0.338328 -.35 0.25743 25 black -0.002365 0.005733-0.42 0.680358 26 lstat 0.0003 0.36722 0.739 0.460792 27 medv -0.345347 0.099422-3.474 0.0006 *** 28 --- 29 Signif. codes : 0 *** 0.00 ** 0.0 * 0.05. 0. 30 3 Residual standard error : 7.527 on 236 degrees of freedom 32 Multiple R- squared : 0.45, Adjusted R- squared : 0.4209 33 F- statistic : 4.92 on 3 and 236 DF, p- value : < 2.2e -6 Here we see the programme generated a linear regression model based on the training data set. But the problem with this model is that many coefficients involved have very big P-value, which suggests they might not belong in the model. In the meantime, the R 2 as well as adjusted R 2 are too small to convince us this is a reliable model. d) Remove the predictor variable with the biggest P-value and find a new model. Explain why this is a reasonable procedure. Explain the new p-values in terms of correlation between predictor variables. Answer: We can sort out the data frame with the following code in R. And then we can see which variable has the largest P-value. > newmodel = summary ( fit. lim ) $ coefficients 2 > newmodel = newmodel [ order (- newmodel [,"Pr ( > t )"]),] 3 > newmodel 4 Estimate Std. Error t value Pr ( > t ) 5 chas 0.896686 2.26504577 0.08349369 9.33e -0 6 age -0.00284524 0.030598362-0.09286523 9.26e -0 7 black -0.00236463 0.005732604-0.4248503 6.80e -0 8 tax -0.003325232 0.007854534-0.4233587 6.72e -0 9 indus -0.0726048 0.29873-0.55554577 5.79e -0 0 lstat 0.0003382 0.367260 0.7387524 4.60e -0 ptratio -0.38409447 0.338327662 -.3527267 2.57e -0 2 rm.4370236.02734994.40644328.60e -0 3 ( Intercept ) 9.077702 2.482380240.53077944.27e -0 4 nox -7.033460057 9.802550572 -.73765592 8.35e -02 5 zn 0.053968920 0.03029640.73926996 8.32e -02 6 dis -.405947877 0.4667438-3.04534299 2.58e -03 7 medv -0.345347229 0.09942607-3.4735635 6.0e -04 8 rad 0.692667246 0.45079089 4.774434 3.6e -06 9 > I decide to use the update function to remove the predictor variable with the largest P-value, namely, chas, and then run for a new model with the following code: 3

> fit. lim = update ( fit.lim, ~.- chas ) 2 > summary ( fit. lim ) 3 4 Call : 5 lm( formula = crim ~ zn + indus + nox + rm + age + dis + rad + 6 tax + ptratio + black + lstat + medv, data = Boston. train ) 7 8 Residuals : 9 Min Q Median 3Q Max 0-0.359-2.687-0.539.268 70.998 2 Coefficients : 3 Estimate Std. Error t value Pr ( > t ) 4 ( Intercept ) 9.586 2.44634.540 0.24933 5 zn 0.053855 0.030935.74 0.082993. 6 indus -0.07356 0.29220-0.552 0.58329 7 nox -7.059568 9.77704 -.745 0.082305. 8 rm.428362.04337.408 0.60390 9 age -0.002650 0.030449-0.087 0.9307 20 dis -.406242 0.460690-3.052 0.002528 ** 2 rad 0.692936 0.44739 4.787 2.97e -06 *** 22 tax -0.003350 0.007833-0.428 0.669278 23 ptratio -0.385242 0.337339 -.42 0.254606 24 black -0.002387 0.00575-0.48 0.676587 25 lstat 0.033 0.36385 0.743 0.4583 26 medv -0.343490 0.096698-3.552 0.00046 *** 27 --- 28 Signif. codes : 0 *** 0.00 ** 0.0 * 0.05. 0. 29 Residual standard error : 7.52 on 237 degrees of freedom 30 Multiple R- squared : 0.45, Adjusted R- squared : 0.4233 3 F- statistic : 6.23 on 2 and 237 DF, p- value : < 2.2e -6 32 33 > newmodel = summary ( fit. lim ) $ coefficients 34 > newmodel = newmodel [ order (- newmodel [,"Pr ( > t )"]),] 35 > 36 > 37 > newmodel 38 Estimate Std. Error t value Pr ( > t ) 39 age -0.002650348 0.03044858-0.0870434 9.30e -0 40 black -0.002386629 0.00574527-0.4764237 6.76e -0 4 tax -0.003349774 0.00783257-0.42767226 6.69e -0 42 indus -0.07355794 0.2929860-0.55220455 5.8e -0 43 lstat 0.032742 0.36384769 0.74284499 4.58e -0 44 ptratio -0.38524894 0.337339327 -.42003 2.54e -0 45 rm.42836244.04337456.4087283.60e -0 46 ( Intercept ) 9.5860840 2.44634458.53984277.24e -0 47 zn 0.053855329 0.030934788.74093092 8.29e -02 48 nox -7.059568385 9.77704403 -.7448648 8.23e -02 49 dis -.40624249 0.460689772-3.0524754 2.52e -03 50 medv -0.3434900 0.0966984-3.5528935 4.60e -04 5 rad 0.692936447 0.44739073 4.7874872 2.97e -06 Compare the two results and we will see that in removing chas, we have actually obtained a better model, in that the adjusted R 2 has slightly increased 4

while R 2 remains unchanged. This means that the new model explains the data set better than the previous one. In the meantime, we see that P-values and standard errors of the remaining predictor variables do not change much after we have removed chas, suggesting that these predictor variables are mutually independent. If there is any predictor variable collinear with chas, we are likely to see dramatic increase in the standard error for its least square estimation. e) Keep improving the model in this way until you get a reasonable model. Make different plots to show that this selection is reasonable. Answer: We can, of course, remove each variable by hand, but the following code in R will make it easier. > y. name <-" crim " 2 > alpha <- 0.05 3 > fit. lim = lm( crim ~., data = Boston. train ) 4 > beta <-max ( max ( summary ( fit. lim )$ coefficients [,4]),alpha ) 5 > tokeep <- which ( summary ( fit. lim )$ coefficients [,4] < beta ) 6 > print ( length ( tokeep )) 7 [] 3 8 > while (beta > alpha ) 9 + { 0 + if( length ( tokeep ) ==0) + { 2 + warning (" Nothing is significant ") 3 + break 4 + } 5 + if( names ( tokeep ) []== "( Intercept )") 6 + { 7 + names ( tokeep ) [] <-"" 8 + } else 9 + { 20 + names ( tokeep ) [] <-" -" 2 + } 22 + 23 + form <-as. formula ( paste (y.name,"~",paste ( names ( tokeep ), collapse = "+"))) 24 + fit. lim = lm( formula = form, data = Boston. train ) 25 + beta <-max ( max ( summary ( fit. lim )$ coefficients [,4]),alpha ) 26 + newmodel = summary ( fit. lim ) $ coefficients 27 + tokeep <- which ( summary ( fit. lim )$ coefficients [,4] < beta ) 28 + if( length ( tokeep ) ==) 29 + { 30 + names ( tokeep ) <-row. names ( summary ( fit. lim )$ coefficients ) [] 3 + } 32 + print ( names ( tokeep )) 33 + print ( length ( tokeep )) 34 + 35 + } 36 [] "( Intercept )" "zn" " indus " 37 [4] " nox " "rm" " dis " 38 [7] " rad " " tax " " ptratio " 39 [0] " black " " lstat " " medv " 40 [] 2 4 [] "( Intercept )" "zn" " indus " 5

42 [4] " nox " "rm" " dis " 43 [7] " rad " " tax " " ptratio " 44 [0] " lstat " " medv " 45 [] 46 [] "( Intercept )" "zn" " indus " 47 [4] " nox " "rm" " dis " 48 [7] " rad " " ptratio " " lstat " 49 [0] " medv " 50 [] 0 5 [] "( Intercept )" "zn" " nox " 52 [4] "rm" " dis " " rad " 53 [7] " ptratio " " lstat " " medv " 54 [] 9 55 [] "( Intercept )" "zn" " nox " 56 [4] "rm" " dis " " rad " 57 [7] " ptratio " " medv " 58 [] 8 59 [] "( Intercept )" "zn" " nox " 60 [4] " dis " " rad " " ptratio " 6 [7] " medv " 62 [] 7 63 [] "( Intercept )" "zn" " nox " 64 [4] " dis " " rad " " medv " 65 [] 6 66 [] "( Intercept )" "zn" " dis " 67 [4] " rad " " medv " 68 [] 5 69 [] "( Intercept )" "zn" " dis " 70 [4] " rad " " medv " 7 [] 5 72 > summary ( fit. lim ) 73 74 Call : 75 lm( formula = form, data = Boston. train ) 76 77 Residuals : 78 Min Q Median 3Q Max 79-9.527-2.549-0.352 0.955 72.879 80 8 Coefficients : 82 Estimate Std. Error t value Pr ( > t ) 83 ( Intercept ) 6.84678 2.5584 3.76 0.0069 ** 84 zn 0.06920 0.02863 2.47 0.0638 * 85 dis -0.86867 0.3355-2.592 0.002 * 86 rad 0.54587 0.06466 8.442 2.77e -5 *** 87 medv -0.259 0.05723-4.402.60e -05 *** 88 --- 89 Signif. codes : 90 0 *** 0.00 ** 0.0 * 0.05. 0. 9 92 Residual standard error : 7.533 on 245 degrees of freedom 93 Multiple R- squared : 0.4293, Adjusted R- squared : 0.42 94 F- statistic : 46.08 on 4 and 245 DF, p- value : < 2.2e -6 We can also run the backward model selection with the following code in R. 6

> library ( leaps ) 2 >fit. backward = regsubsets ( crim ~., data = Boston. train, nvmax =3, method =" backward ") 3 > summary. backward = summary ( fit. backward ) 4 > summary. backward 5 6 Selection Algorithm : backward 7 zn indus chas nox rm age dis rad tax ptra black lstat medv 8 " " " " " " " " " " " " " " "*" " " " " " " " " " " 9 " " " " " " " " " " " " " " "*" " " " " " " " " "*" 0 " " " " " " " " " " " " "*" "*" " " " " " " " " "*" "*" " " " " " " " " " " "*" "*" " " " " " " " " "*" 2 "*" " " " " "*" " " " " "*" "*" " " " " " " " " "*" 3 "*" " " " " "*" " " " " "*" "*" " " "*" " " " " "*" 4 "*" " " " " "*" "*" " " "*" "*" " " "*" " " " " "*" 5 "*" " " " " "*" "*" " " "*" "*" " " "*" " " "*" "*" 6 "*" "*" " " "*" "*" " " "*" "*" " " "*" " " "*" "*" 7 "*" "*" " " "*" "*" " " "*" "*" "*" "*" " " "*" "*" 8 "*" "*" " " "*" "*" " " "*" "*" "*" "*" "*" "*" "*" 9 "*" "*" " " "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" 20 "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" 2 22 > summary. backward $ adjr2 23 [] 0.3703365 0.406400 0.408622 0.4200276 0.424942 0.429705 0.435244 24 [8] 0.4307643 0.4296525 0.4276995 0.4257370 0.4233324 0.4209060 25 > summary. backward $cp 26 [] 23.65664 9.298090 9.22249 5.37589 4.299394 3.53209 3.562622 27 [8] 4.897268 6.37507 8.96209 0.0456 2.00697 4.000000 In the above result we see that the adjust R 2 increased slightly after the model takes up more than four predictor variables. In the meantime, Mallows C p is slightly bigger than m + with four variables. Therefore we decide to take four variables: > coef ( fit. backward,4) 2 ( Intercept ) zn dis rad medv 3 6.8467759 0.06920372-0.86867453 0.54587268-0.259355 And if we run the model with the above selected predictor variables: > fit. lim2 = lm( crim ~zn+dis + rad +medv, data = Boston. train ) 2 > summary ( fit. lim2 ) 3 4 Call : 5 lm( formula = crim ~ zn + dis + rad + medv, data = Boston. train ) 6 7 Residuals : 8 Min Q Median 3Q Max 9-9.527-2.549-0.352 0.955 72.879 7

0 Coefficients : 2 Estimate Std. Error t value Pr ( > t ) 3 ( Intercept ) 6.84678 2.5584 3.76 0.0069 ** 4 zn 0.06920 0.02863 2.47 0.0638 * 5 dis -0.86867 0.3355-2.592 0.002 * 6 rad 0.54587 0.06466 8.442 2.77e -5 *** 7 medv -0.259 0.05723-4.402.60e -05 *** 8 --- 9 Signif. codes : 0 *** 0.00 ** 0.0 * 0.05. 0. 20 2 Residual standard error : 7.533 on 245 degrees of freedom 22 Multiple R- squared : 0.4293, Adjusted R- squared : 0.42 23 F- statistic : 46.08 on 4 and 245 DF, p- value : < 2.2e -6 Figure 2: Scatter box of the model with four predictor variables. I have also plotted standardised residual against each predictor variable in the model. As we can see in the following figures, there seems to exit a pattern that as the predictor variable gets large, standardised residual tends to reduce. Based on the plots, as well as a low adjust R 2, I tend to think we need to consider interactions between predictor variables or models with polynomial terms. 8

Figure 3: Standarised residual plotted against each predictor variable. f) Use the averaged square error to see how good the model is. Answer: To calculate the required error term, we run the following code in R: > fit. lim3 = lm( crim ~zn+dis + rad +medv, data = Boston. train ) 2 > sum (( Boston. test $crim - predict ( fit.lim3, data. frame ( Boston. test ))) ^2) /n 3 [] 29.8876 In the meantime, if we run the following R code, we can see how the error term changes with the number of predictor variables we use in the model: > n= nrow ( Boston. test ) 2 > k= ncol ( Boston. test ) - 3 > mat = model. matrix ( crim ~., data = Boston. test ) 4 > fit.lm= regsubsets ( crim ~., data = Boston. train, nvmax =k, method = " backward ") 5 > cv=rep (0,k) 9

6 > for ( m in : k) 7 + { 8 + coef.m= coef ( fit.lm, m) 9 + for ( i in : n) 0 + { + pred = sum ( mat [i, names ( coef.m)]* coef.m) 2 + diff =( Boston. test $ crim [i]- pred )^2 3 + cv[ m]= cv[ m]+ diff 4 + } 5 + } 6 > cv/ n 7 [] 29.85552 30.3500 30.5097 29.8876 30.22899 8 [6] 30.9453 30.9567 30.57466 30.4063 30.3696 9 [] 30.02463 30.02272 30.06895 20 > cv [4] /n 2 [] 29.8876 We see that model with four predictor variables has an averaged square error term only slightly larger than the model using one variable. This is also consistent with the previous result, attesting to the fact that our model selection is correct. g) Repeat the model selection procedure, using the whole data set. Discuss the difference. Answer: With a slight change of code, we see that the model has also changed with a different data set. > y. name <-" crim " 2 > alpha <- 0.05 3 > fit. lim = lm( crim ~., data = Boston ) 4 > beta <-max ( max ( summary ( fit. lim )$ coefficients [,4]),alpha ) 5 > tokeep <- which ( summary ( fit. lim )$ coefficients [,4] < beta ) 6 > print ( length ( tokeep )) 7 [] 3 8 > while (beta > alpha ) 9 + { 0 + if( length ( tokeep ) ==0) + { 2 + warning (" Nothing is significant ") 3 + break 4 + } 5 + if( names ( tokeep ) []== "( Intercept )") 6 + { 7 + names ( tokeep ) [] <-"" 8 + } else 9 + { 20 + names ( tokeep ) [] <-" -" 2 + } 22 + 23 + form <-as. formula ( paste (y.name,"~",paste ( names ( tokeep ), collapse = "+"))) 24 + fit. lim = lm( formula = form, data = Boston ) 25 + beta <-max ( max ( summary ( fit. lim )$ coefficients [,4]),alpha ) 26 + newmodel = summary ( fit. lim ) $ coefficients 27 + tokeep <- which ( summary ( fit. lim )$ coefficients [,4] < beta ) 0

28 + if( length ( tokeep ) ==) 29 + { 30 + names ( tokeep ) <-row. names ( summary ( fit. lim )$ coefficients ) [] 3 + } 32 + print ( names ( tokeep )) 33 + print ( length ( tokeep )) 34 + 35 + } 36 [] "( Intercept )" "zn" " indus " 37 [4] " nox " "rm" " dis " 38 [7] " rad " " tax " " ptratio " 39 [0] " black " " lstat " " medv " 40 [] 2 4 [] "( Intercept )" "zn" " indus " 42 [4] " nox " "rm" " dis " 43 [7] " rad " " ptratio " " black " 44 [0] " lstat " " medv " 45 [] 46 [] "( Intercept )" "zn" " indus " 47 [4] " nox " " dis " " rad " 48 [7] " ptratio " " black " " lstat " 49 [0] " medv " 50 [] 0 5 [] "( Intercept )" "zn" " nox " 52 [4] " dis " " rad " " ptratio " 53 [7] " black " " lstat " " medv " 54 [] 9 55 [] "( Intercept )" "zn" " nox " 56 [4] " dis " " rad " " ptratio " 57 [7] " black " " medv " 58 [] 8 59 [] "( Intercept )" "zn" " nox " 60 [4] " dis " " rad " " black " 6 [7] " medv " 62 [] 7 63 [] "( Intercept )" "zn" " nox " 64 [4] " dis " " rad " " black " 65 [7] " medv " 66 [] 7 67 > summary ( fit. lim ) 68 69 Call : 70 lm( formula = form, data = Boston ) 7 72 Residuals : 73 Min Q Median 3Q Max 74-0.240 -.95-0.376 0.852 75.438 75 76 Coefficients : 77 Estimate Std. Error t value Pr ( > t ) 78 ( Intercept ) 4.642639 3.709443 3.947 9.04e -05 *** 79 zn 0.053963 0.07305 3.8 0.00923 ** 80 nox -9.238768 4.477580-2.063 0.039597 * 8 dis -0.9928 0.255075-3.892 0.0003 *** 82 rad 0.499838 0.044036.35 < 2e -6 *** 83 black -0.0087 0.00362-2.42 0.06237 *

84 medv -0.95990 0.037685-5.20 2.90e -07 *** 85 --- 86 Signif. codes : 87 0 *** 0.00 ** 0.0 * 0.05. 0. 88 89 Residual standard error : 6.452 on 499 degrees of freedom 90 Multiple R- squared : 0.444, Adjusted R- squared : 0.4373 9 F- statistic : 66.42 on 6 and 499 DF, p- value : < 2.2e -6 Now the model keeps six predictor variables, compared with four from before. The averaged square error: > n= nrow ( Boston ) 2 > fit. lim4 = lm( crim ~zn+dis + rad + medv + black +nox, data = Boston ) 3 > sum (( Boston $crim - predict ( fit.lim4, data. frame ( Boston ))) ^2) /n 4 [] 4.05396 We see that the error seems to get bigger. In the mean time, the biggest problem is that we do not have unused data to test the accuracy of the prediction. > n= nrow ( Boston ) 2 > k= ncol ( Boston ) - 3 > mat = model. matrix ( crim ~., data = Boston ) 4 > fit.lm= regsubsets ( crim ~., data = Boston, nvmax =k, method =" backward ") 5 > cv=rep (0,k) 6 > 7 > for ( m in : k) 8 + { 9 + coef.m= coef ( fit.lm, m) 0 + for ( i in : n) + { 2 + pred = sum ( mat [i, names ( coef.m)]* coef.m) 3 + diff =( Boston $ crim [i]- pred )^2 4 + cv[ m]= cv[ m]+ diff 5 + } 6 + 7 + } 8 > cv/ n 9 [] 44.94983 43.0568 42.66654 4.8336 4.40423 20 [6] 4.05396 40.78503 40.5789 40.4397 40.38645 2 [] 40.34929 40.366 40.3607 22 > cv [6] /n 23 [] 4.05396 Another problem with my solution is that, because we do not have any fresh data to use, error terms will decrease as the number of predictor variable increases. (The error term, obtained in either way, is the same.) 2

Question 2 a) Show two models are equivalent. Answer: To show equivalence, we know that with c i = i, K k= x ik =. Then: Y i = β 0 + β 2 x i2 +... + β K x ik + ε i K = β 0 x ik + β 2 x i2 +... + β K x ik + ε i k= = β 0 x i + (β 0 + β 2 )x i2 +... + (β 0 + β K )x ik + ε i Then we can see the correspondence between α and β. { β 0, j = α j = β 0 + β j, j = 2, 3, 4... K Each α j represents the mean of the population in category j. And it corresponds to certain β i according to the above mentioned rule. In the meantime, in the first model, β 0 serves as the baseline, while the difference between different categories is captured by the difference between β 0 and β j, where j = 2, 3, 4... K. In other words, β j, where j = 2, 3, 4... K represents the difference between the mean of population in category j and category. b) Show that the matrix has certain properties and some other results. Answer: The properties are straightforward: x x 2 x 3... x n x x 2 x 3... x K X T x 2 x 22 x 32... x n2 x 2 x 22 x 23... x 2K X =...................... x K x 2K x 3K... x nk x n x n2 x n3... x nk x 2 i xi x i2 xi x i3... xi x ik xi x i2 x 2 i2....... =. xi x..... i3............... xi x ik......... x 2 ik We see the resulting matrix is diagonal with n i x2 i,j as the j-th element on the main diagonal. And it is very easy to see n i x2 i,j = n j, since the other terms in the sum will go to 0. This represents how many times response variable Y i falls into category j. For X T y: 3

x x 2 x 3... x n y xi y i X T x 2 x 22 x 32... x n2 y 2 xi2 y i y =............ = xi3 y i. x K x 2K x 3K... x nk y n xik y i The resulting vector clearly shows that the result is i:c i=j y i represents the j-th element in the vector, namely, the sum of the response variable that falls into the same category. To find the least square estimator, we use matrix differentiation. RSS = (y αx) T (y αx) = (y T X T α T )(y αx) = y T y y T αx X T α T y + X T α T αx Then we differentiate RSS with respect to α, and equate it to 0. RSS α = yt y y T αx X T α T y + X T α T αx α = X T y + X T Xα = 0 Then we have the estimator: ˆα = (X T X) X T y = xiy i n xi2y i n 2 xi3y i n 3. xik y i The last equality follows from the result we get from before. In this way we see that ˆα j represents, reasonably, the mean of response variables that fall into the same category j. c) Construct a least square estimator ˆβ from ˆα. Show that it is indeed a least square estimator. Answer: According to the one-to-one correspondence we have established from before, we can formulate ˆβ as: n K β 0 α 0 β 2 α 2 α ˆβ = β 3 = α 3 α = ˆα α... β K α K α 4

I intend to prove that ˆβ is least square estimator by contradiction. Assume for the moment that the ˆβ thus obtained is not the least square estimator, and there exits a least square estimator ˆβ L. Then we can construct an ˆα L by using the above rule, and then: RSS = y ˆβ L X 2 = y ˆα L X 2 The last equality holds because there is a one-to-one correspondence between ˆα L and ˆβ L. Therefore we can see ˆα L is the least square estimator, and this is in contradiction with the definition of ˆα. d) Show model 3 is equivalent to model 2 or. Give an interpretation of the coefficients. Answer: First I show equivalence between model 3 and model 2. Y i = γ 0 + γ x i +... + γ K x ik + ε i K = γ 0 x ik + γ x i +... + γ K x ik + ε i k= = (γ 0 + γ )x i + (γ 0 + γ 2 )x i2 +... + (γ 0 + γ K )x ik + ε i Together with the condition that K j= γ j = 0 we can see the correspondence between γ and α. γ j = { αj K, j = 0 αj α j, j =, 2, 3, 4... K K Here we see, γ 0 represents the mean of the whole population, regardless of category. In other words, γ 0 is the real mean. In the meantime, γ j for j =, 2, 3,..., K, represents effects of each category on the response variable, making it deviate from the real mean. e) Try the following code in R, discuss its meaning. Answer: First we see the how the code is executed in R, and if there is something going amiss. > #e) 2 >Fe = read. table (" http :// www. uio.no/ studier / emner / matnat / math / STK200 / v7 /fe.txt ",header =T, sep =",") 3 > fit = lm(fe~ form +0, data =Fe) 4 > summary ( fit ) 5 6 Call : 7 lm( formula = Fe ~ form + 0, data = Fe) 8 9 Residuals : 0 Min Q Median 3Q Max -4.589-3.06 3.655 0.478 2.278 2 5

3 Coefficients : 4 Estimate Std. Error t value Pr ( > t ) 5 form 0.022 0.566 7.7 <2e -6 *** 6 --- 7 Signif. codes : 8 0 *** 0.00 ** 0.0 * 0.05. 0. 9 20 Residual standard error : 9.803 on 39 degrees of freedom 2 Multiple R- squared : 0.8894, Adjusted R- squared : 0.8865 22 F- statistic : 33.6 on and 39 DF, p- value : < 2.2e -6 We see here the problem is that different forms are not being treated as different categories, but rather as a whole predictor variable. In the meantime, as the command corresponds to model 2, there is no intercept coefficient in the linear model as well. But if we put in one extra command as the question suggests, then we can see the model seems to work: > Fe$ form = as. factor (Fe$ form ) 2 > fit = lm(fe~ form +0, data =Fe) 3 > summary ( fit ) 4 5 Call : 6 lm( formula = Fe ~ form + 0, data = Fe) 7 8 Residuals : 9 Min Q Median 3Q Max 0-8.340 -.255-0.250.770 0.360 2 Coefficients : 3 Estimate Std. Error t value Pr ( > t ) 4 form 26.080.25 20.85 <2e -6 *** 5 form2 24.690.25 9.74 <2e -6 *** 6 form3 29.950.25 23.95 <2e -6 *** 7 form4 33.840.25 27.06 <2e -6 *** 8 --- 9 Signif. codes : 20 0 *** 0.00 ** 0.0 * 0.05. 0. 2 22 Residual standard error : 3.955 on 36 degrees of freedom 23 Multiple R- squared : 0.9834, Adjusted R- squared : 0.985 24 F- statistic : 532.5 on 4 and 36 DF, p- value : < 2.2e -6 This model clearly corresponds to model 2, with α 0 = 0 being the restriction. And each form coefficient in the above table corresponds to one categotical coefficient α j in model 2. f) Try the following code in R, determine which model it corresponds to, and list all the coefficients. Answer: As can be seen in the following result, the first code snippet 6

corresponds to model, namely, β = 0, while the second corresponds to model3 where j= γ j = 0 is the restriction. The first model is R s default setting while the restrictions for model 3 are implemented through the command: options(contrasts=c("contr.sum","contr.sum")). > options ( ) $ contrasts 2 [] " contr. treatment " " contr. poly " 3 > fit2 = lm( Fe~form, data =Fe ) 4 > summary ( fit2 ) 5 6 Call : 7 lm( formula = Fe ~ form, data = Fe) 8 9 Residuals : 0 Min Q Median 3Q Max -8.340 -.255-0.250.770 0.360 2 3 Coefficients : 4 Estimate Std. Error t value Pr ( > t ) 5 ( Intercept ) 26.080.25 20.852 < 2e -6 *** 6 form2 -.390.769-0.786 0.437 7 form3 3.870.769 2.88 0.0352 * 8 form4 7.760.769 4.387 9.6e -05 *** 9 --- 20 Signif. codes : 2 0 *** 0.00 ** 0.0 * 0.05. 0. 22 23 Residual standard error : 3.955 on 36 degrees of freedom 24 Multiple R- squared : 0.4748, Adjusted R- squared : 0.43 25 F- statistic : 0.85 on 3 and 36 DF, p- value : 3.99e -05 26 27 > 28 > 29 > options ( contrasts =c(" contr. sum "," contr. sum ")) 30 > options ( ) $ contrasts 3 [] " contr. sum " " contr. sum " 32 > fit3 = lm( Fe~form, data =Fe) 33 > summary ( fit3 ) 34 35 Call : 36 lm( formula = Fe ~ form, data = Fe) 37 38 Residuals : 39 Min Q Median 3Q Max 40-8.340 -.255-0.250.770 0.360 4 42 Coefficients : 43 Estimate Std. Error t value Pr ( > t ) 44 ( Intercept ) 28.6400 0.6254 45.798 < 2e -6 *** 45 form -2.5600.083-2.363 0.023622 * 46 form2-3.9500.083-3.647 0.000833 *** 47 form3.300.083.209 0.234375 48 --- 49 Signif. codes : 50 0 *** 0.00 ** 0.0 * 0.05. 0. 7

5 52 Residual standard error : 3.955 on 36 degrees of freedom 53 Multiple R- squared : 0.4748, Adjusted R- squared : 0.43 54 F- statistic : 0.85 on 3 and 36 DF, p- value : 3.99e -05 The three set of regression coefficients correspond respectively to ˆα, ˆβ and ˆγ. The models are equivalent to each other, but the exact coefficients are different. But we have already established the rules connecting them to each other in earlier question. g) Do a variance analysis on the iron types. Make use of the results from the previous questions. Answer: To tell if there is any difference between the four types of iron, we are asking whether β j, where j = 2, 3, 4, and γ j, for j =, 2, 3, 4, are equal to 0 at the same time. In other words, we have the hypothesis: H 0 : β 2 = β 3 = β 4 = 0 against a hypothesis that at least one of them is not 0: We can run the F -test: H α : at least one of β j 0. F = (T SS RSS)/p RSS/(n p ) Fortunately, R has already run the test and printed out the result in the last line of the table, where we can see the P-value for the F -test is 3.99e 05. Therefore we can safely reject H 0. We can also run an anova analysis for the model and the result is the same for both models as well: > anova ( fit2 ) 2 Analysis of Variance Table 3 4 Response : Fe 5 Df Sum Sq Mean Sq F value Pr( >F) 6 form 3 509.2 69.707 0.849 3.99e -05 *** 7 Residuals 36 563.3 5.643 8 --- 9 Signif. codes : 0 0 *** 0.00 ** 0.0 * 0.05. 0. > anova ( fit3 ) 2 Analysis of Variance Table 3 4 Response : Fe 5 Df Sum Sq Mean Sq F value Pr( >F) 6 form 3 509.2 69.707 0.849 3.99e -05 *** 8

7 Residuals 36 563.3 5.643 8 --- 9 Signif. codes : 20 0 *** 0.00 ** 0.0 * 0.05. 0. h) Suggest a reasonable way to further simplify the model. Answer: We can use the F -test to see if we can leave out q predictor variable. H 0 : β 2 = β 3 =... = β q = 0 Then F = (RSS RSS 0)/q RSS/(n p ) Here RSS 0 is under H 0 while RSS is for the full model. In this way hopefully we can find some predictors that are not quite different from each other, then we can combine the cotegories together. We can also run a Tukey s procedure on this question and see the mean difference among the four types: aov. fit = aov ( Fe~ factor ( form ), data =Fe) 2 summary ( aov. fit ) 3 tukey. fit = TukeyHSD ( aov.fit, ordered =T) 4 plot ( tukey. fit ) Figure 4: Tukey plot for the iron types. We see here that type and 2 are very close to each other, while 3 and 4 is another pair. Therefore this is good chance of further simplifying the model. 9

STK 2100 Oblig 1. Zhou Siyu. February 15, 2017