STAT 3900/4950 MIDTERM TWO Name: Spring, 2015 (print: first last ) Covered topics: Two-way ANOVA, ANCOVA, SLR, MLR and correlation analysis

STAT 3900/4950 MIDTERM TWO Name: Spring, 205 (print: first last ) Covered topics: Two-way ANOVA, ANCOVA, SLR, MLR and correlation analysis Instructions: You may use your books, notes, and SPSS/SAS. NO internet access is allowed. Write your answers in the space provided. Question 4950 student must submit your SAS code of this question via Blackboard A survey was taken to see if a person s purchases based on infomercials on television differed by the level of several different factors. One study considered the two factors. One study considered the two factors household income and marital status. Household income was categorized into 4 categories: () under $30K, (2) $30K- $50K, (3) $50K- $00K, and (4) over $00K. Marital status was categorized into 3 levels: A, single (never married); B, married; and C, divorced/ separated/ widowed. For each of the 2 cells, 2 people were surveyed and reported their estimated past purchases per year that were based on infomercials on television. The goal of the study is to see a person s purchase depends on his/her martial status. Here is the purchase (in dollars) data for the different combinations: Household income Marital status 2 3 4 A 350; 270 390; 530 370; 230 430; 530 B 430; 390 450; 50 330; 370 570; 430 C 390; 450 50; 450 350; 490 590; 390 (a) Which model of the following is the most appropriate for the data? Circle your answer. A. Multiple linear regression B. One-way Anova C. One-way Ancova D. Two-way Anova E. Two-way Ancova Key: D (b) If the exact household income (rather than the category) is recorded, such as $0K, which model of the following is the most appropriate? Circle your answer. Key: C A. Multiple linear regression B. One-way Anova C. One-way Ancova D. Two-way Anova E. Two-way Ancova

(c) [3 points] For the given dataset, do the backward model selection. What is your final model? Summarize your model selection procedure. Step : drop the insignificant interaction (p-value is.9 >>.05) Step 2: drop the insignificant main effect of matrial_status (p-value is.8 >.05) Step 3: since all terms left are significant, we cannot drop further. This is our final model Our final model is one-way ANOVA with the only factor (household) income. 2

(d) [2 points] Check the NORMALITY assumption of the final model by graphs and tests. Sketch your plot and report the p-values of your tests. The plot shows all points are more or less on a line and the tests of normality both show insignificant (pvalues.20 and.946, bigger than.05), the normality assumption is valid. 3

(e) [2 points] Draw the interaction plot between household income and marital status. Does it show significant interaction effect to you? Justify your answer. This interaction plot shows several somewhat parallel lines, so it does not indicate significant interaction effect. (f) Can we conduct pairwise comparison for household income only in this case? If we can, explain why and conduct (and report the results) a proper procedure. If we cannot, explain why not. Yes, we can because the interaction term is not significant. Here we do Tukey procedure. Only the pair of Income groups 3 and 4 are significantly different; all other pairs are not significantly different. 4

(g) Which household income group spends the LEAST due to infomercials on television? Justify your answer. household income group 3 because it has the smallest sample mean purchase (as listed below). (h) Is a person s purchase increasing as his/her household income increases? Justify your answer. NO, because the mean purchase of the 4 income groups are not increasing. Household Income 2 3 4 Mean purchase 380 473.3 356.7 490 5

Question 2 Soil and sediment adsorption, the extent to which chemicals collect in a condensed form on the surface, is an important characteristic because it influences the effectiveness of pesticides and various agricultural chemicals. We are interested in how the adsorption (Y) changes as the amount of extractable iron (X ) and the amount of extractable aluminum (X 2 ) change. A Multiple Linear Regression (MLR) model Y = b 0 + b *X + b 2 *X 2 + error is fitted. X Correlations X2 Y X X2 Y X X2 Y Pearson Correlation Sig. (2-tailed) N Pearson Correlation Sig. (2-tailed) N Pearson Correlation Sig. (2-tailed) N X X2 Y.794**.908**.00.000 3 3 3.794**.935**.00.000 3 3 3.908**.935**.000.000 3 3 3 **. Correlation is significant at the 0.0 level (2-tailed). (a) [2 points] Based on the above scatter plots and correlation coefficients: a. Are the two variables X, X 2 associated with Y significantly? Justify your answer. b. Do you think MLR model would fit the data very well? Justify your answer. a. Yes. Because the p-values of their correlation with Y are both.000. b. Yes, because the scatterplots all show strong linear trend and the correlation coefficient between X and Y, and that between X 2 and Y, are both very large (bigger than.90). (b) Using the attached output: a. Write down the fitted regression equation b. Interpret the slopes of the regression equation in practical terms. c. How well does this model fit the data? Use a statistic to justify your answer. a. Y= -7.35+.273X +.349X 2. b. As the amount of extractable iron increases by one unit, the adsorption increases by.273 (on average); as the amount of extractable aluminum increases by one unit, the adsorption increases by.349 (on average). c. Adj R-Sq of 0.9382, very close to 00%, indicates an excellent goodness of fit. 6

(c) [2 points] One assumption is doubtable based on the residual plot below. What is this assumption? Dependent Variable: Y Regression Standardized Residual 0 - -2-2 - 0 Regression Standardized Predicted Value 2 It shows a horn form; the variance seems increasing as the predicted value increases; the assumption of equal variances seems violated. (d) [2 points] To solve the problem, we use the natural log of Y, LY, as the new response to fit the MLR model. The following plot is the residual plot of this new model. Do you think the problem identified in the previous part fixed? Is there another problem in this plot? If yes, what is it? Dependent Variable: LY Regression Standardized Residual 0 - -2-3 -2-0 Regression Standardized Predicted Value 2 The horn form is gone. However, there is a mild outlier with standardized residual between -2 and -3. 7

APPENDIX: OUTPUT FOR QUESTION TWO SPSS Output Model Model Summary b Adjusted Std. Error of R R Square R Square the Estimate.974 a.948.938 4.37937 a. Predictors: (Constant), X2, X b. Dependent Variable: Y Model (Constant) X X2 a. Dependent Variable: Y Unstandardized Coefficients Coefficients a Standardized Coefficients B Std. Error Beta t Sig. -7.35 3.485-2.09.06.3.030.449 3.797.004.349.07.578 4.894.00 SAS OUTPUT Root MSE 4.37937 R-Square 0.9485 Dependent Mean 29.8465 Adj R-Sq 0.9382 Coeff Var 4.6736 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > t Intercept -7.35066 3.48467-2. 0.06 X 0.273 0.02969 3.80 0.0035 X2 0.34900 0.073 4.89 0.0006 8