SCHOOL OF MATHEMATICS AND STATISTICS Autumn Semester

RESTRICTED OPEN BOOK EXAMINATION (Not to be removed from the examination hall) Data provided: "Statistics Tables" by H.R. Neave PAS 371 SCHOOL OF MATHEMATICS AND STATISTICS Autumn Semester 2008 9 Linear Models 2 hours Marks will be awarded for your best three answers. RESTRICTED OPEN BOOK EXAMINATION Candidates may bring to the examination lecture notes and associated lecture material (but no textbooks) plus a calculator that conforms to University regulations. There are 99 marks available on the paper. Please leave this exam paper on your desk Do not remove it from the hall Registration number from U-Card (9 digits) to be completed by student PAS 371 1 Turn Over

Blank PAS 371 2 Continued

1 Four objects O 1, O 2, O 3, O 4 are weighed in a balance. Four weighings are made; the (i, j)-th element in the matrix below is +1 if object O i is placed in the left pan of the balance for weighing j, and it is -1 if it is placed in the right pan. We are required to estimate the weights of the four objects, given the weights y j required in the right pan to achieve balance (j = 1, 2, 3, 4): 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1. (i) (ii) (iii) Formulate a regression model for this problem, expressing the observed weights y i in terms of the unknown weights β j of the four objects. (7 marks) Showing your working explicitly, obtain expressions for the least-squares estimators of the weights of the four objects. (8 marks) Evaluate these estimates for data y 1 = 20.2, y 2 = 8.0, y 3 = 9.7, y 4 = 1.9. (3 marks) (iv) The whole experiment is now replicated n times. Showing your working explicitly, calculate the new least-squares estimates. (15 marks) PAS 371 3 Turn Over

2 In an agricultural experiment, measurements are collected on the volume (in mm 3 ), height (in cm) and diameter (in mm) at 4.5 ft. above ground level for a sample of 31 black cherry trees in the Allegheny National Forest, Pennsylvania, USA. Denote the volume by y, the height by h and the diameter by d. The following model was considered: y i = α + βh i + γd i + δd 2 i + ɛ i. (i) Write down the tted model and discuss the suitability of this model, based on the S-Plus output given below. (6 marks) Coefficients: Value Std. Error t value Pr(> t ) (Intercept) 10.1751 1.5214 6.6880 0.0000 Height -0.0644 0.0221-2.9134 0.0071 Diameter 0.3287 0.0306 10.7234 0.0000 I(Diameter^2) -0.0017 0.0004-4.5288 0.0001 Residual standard error: 0.6068 on 27 degrees of freedom Multiple R-Squared: 0.9664 F-statistic: 258.5 on 3 and 27 degrees of freedom, the p-value is 0 PAS 371 4 Question 2 continued on next page

2 (continued) (ii) Some further analysis showed that the standardized residuals and the diagonal elements of the hat matrix were as follows. Standardized residuals 1 2 3 4 5 6 7 8-1.0397-1.1000-0.9346 0.0293 0.2623 0.2508 0.6173 0.3776 9 10 11 12 13 14 15 16-0.8651-0.0481-1.3087-0.0857-0.2599-0.4780 1.6633 1.7043 17 18 19 20 21 22 23 24-1.7923 1.6481 1.2791 1.0997-0.8744 0.7333-1.1277 0.5589 25 26 27 28 29 30 31 0.1761-1.3299-0.9426-1.1155 0.8063 0.9430 2.0158 Hat values [1] 0.160950 0.192763 0.225723 0.065416 0.131057 0.166165 [7] 0.120374 0.058250 0.081715 0.048201 0.063211 0.048424 [13] 0.047078 0.075931 0.052287 0.041557 0.136463 0.180705 [19] 0.070914 0.210029 0.070187 0.071404 0.101612 0.144019 [25] 0.098835 0.113444 0.113520 0.138115 0.102000 0.100692 [31] 0.768945 (a) (b) (c) (d) Calculate approximate variances of the non-standardized residuals e 1, e 2, e 3, e 30 and e 31. (5 marks) Using the standardized residuals use an appropriate test to nd out whether there are any outliers, and comment. (7 marks) Calculate the Cook's distance for observation y i, with i = 1, 2, 3, 30, 31 and check whether these are inuential observations. (10 marks) Summarize your conclusions from (b) and (c) and make recommendations for any further analysis. (5 marks) PAS 371 5 Turn Over

3 The following table shows ve observations of a response variable y and two explanatory variables x and z. (i) y i x i z i 2 1 10 3 4 40 3 3 30 1 1 10 7 10 100 It is initially suggested that a linear model is considered as y i = α + βx i + γz i + ɛ i, ɛ i N(0, σ 2 ). Show that this model is overparameterised. Describe briey the phenomenon that is behind this particular form of overparametrisation. How can overparameterisation be resolved for this particular data set? (8 marks) (ii) Now consider the alternative model y i = α + βx i + γx 2 i + ɛ i, ɛ i N(0, σ 2 ). Find the least squares estimate of β = (α, β, γ) T and provide 95% condence intervals for γ and for σ 2. HINT: you can make use of the following inverse matrix result: 5 19 1 27 1.375 0.669 0.054 19 127 1093 = 0.669 0.413 0.035. 127 1093 10339 0.054 0.035 0.003 (20 marks) (iii) Without doing any further calculations, with the information given in (i) and (ii), suggest a suitable model and give brief explanations. (5 marks) PAS 371 6 Continued

4 (i) Explain briey the role of the Akaike information criterion (AIC), and the S-Plus command step, in model reduction. (5 marks) (ii) In a study of timber, the volume v of usable timber when a tree is felled is studied in terms of the height h and girth g of the tree. (a) In a polynomial regression, terms in h, g, h 2, hg, g 2 are introduced. When the term hg 2 is then introduced, it is found that this term is rejected as AIC is increased, not reduced. Why might this be? (5 marks) (b) How would you decide between the model using hg 2 and that using {h, g, h 2, hg, g 2 }, neither of which is nested within the other? (5 marks) (iii) In a study of loss through evaporation in a petrol tank, the loss y is measured in grams. The regressors thought relevant are: x 1 : x 2 : x 3 : x 4 : the initial tank temperature (F), the temperature of the petrol when dispensed (F), the initial vapour pressure in this tank (pounds per square inch), the vapour pressure of the petrol when dispensed (pounds per square inch). The data set consists of 32 points (y, x 1, x 2, x 3, x 4 ). An initial regression study suggests that interaction terms are not needed. The attached S-Plus output relates to three models under consideration. (a) Discuss briey the initial model; (5 marks) (b) Discuss briey the intermediate model; (5 marks) (c) Discuss briey the nal model. (5 marks) (d) Summarise your conclusions for use by the petrol company commissioning the study. (3 marks) PAS 371 7 Question 4 continued on next page

4 (continued) Call: lm(formula = y ~ x1 + x2 + x3 + x4) Residuals: Min 1Q Median 3Q Max -5.799-1.211-0.1308 1.3 5.205 Coefficients: Value Std. Error t value Pr(> t ) (Intercept) 0.9591 1.8812 0.5098 0.6143 x1-0.0385 0.0914-0.4211 0.6770 x2 0.2233 0.0679 3.2914 0.0028 x3-3.6639 2.7406-1.3369 0.1924 x4 8.3536 2.6620 3.1381 0.0041 Residual standard error: 2.754 on 27 degrees of freedom Multiple R-Squared: 0.9247 F-statistic: 82.93 on 4 and 27 degrees of freedom, the p-value is 9.215e-015 **Using stepwise regression** >step(lm(y~x1+x2+x3+x4)) Start: AIC = 280.7007 y ~ x1 + x2 + x3 + x4 Single term deletions Model: y ~ x1 + x2 + x3 + x4 scale: 7.586504 Df Sum of Sq RSS Cp <none> 204.8356 280.7007 x1 1 1.34549 206.1811 266.8731 x2 1 82.18732 287.0229 347.7150 x3 1 13.55965 218.3953 279.0873 x4 1 74.70807 279.5437 340.2357 PAS 371 8 Question 4 continued on next page

4 (continued) Step: AIC = 266.8731 y ~ x2 + x3 + x4 Single terms deletions Model: y ~ x2 + x3 + x4 scale: 7.586504 Df Sum of Sq RSS Cp <none> 206.1811 266.8731 x2 1 83.80282 289.9839 335.5030 x3 1 32.29814 238.4793 283.9983 x4 1 88.12139 294.3025 339.8215 Call: lm(formula = y ~ x2 + x3 + x4) Residuals: Min 1Q Median 3Q Max -5.864-1.32-0.06399 1.555 4.947 Coefficients: Value Std. Error t value Pr(> t ) (Intercept) 1.0313 1.8457 0.5588 0.5808 x2 0.2145 0.0636 3.3735 0.0022 x3-4.3911 2.0967-2.0943 0.0454 x4 8.6799 2.5091 3.4594 0.0018 Residual standard error: 2.714 on 28 degrees of freedom Multiple R-Squared: 0.9242 F-statistic: 113.9 on 3 and 28 degrees of freedom, the p-value is 8.882e-016 PAS 371 9 Question 4 continued on next page

4 (continued) Call: lm(formula = y ~ x2 + x4) Residuals: Min 1Q Median 3Q Max -7.938-1.453 0.333 1.803 5.28 Coefficients: Value Std. Error t value Pr(> t ) (Intercept) 1.1859 1.9032 0.0977 0.9228 x2 0.2750 0.0599 4.5943 0.0001 x4 8.5991 0.6768 5.3179 0.0000 Residual standard error: 2.868 on 29 degrees of freedom Multiple R-Squared: 0.9124 F-statistic: 151 on 2 and 29 degrees of freedom, the p-value is 4.441e-016 End of Question Paper PAS 371 10