Lab 7 Multiple Regression and F Tests Of a Subset Of Predictors

Size: px

Start display at page:

Download "Lab 7 Multiple Regression and F Tests Of a Subset Of Predictors"

Oscar Montgomery
5 years ago
Views:

1 Lab 7 Multiple Regression and F Tests Of a Subset Of Predictors Preliminary Information: [1] Last week someone wanted to change the y axis labeling on a plot of the TukeyHSD plot(). The labels printed vertically, and some of the pairs listings were missing. We can include the argument las=1 in the plot command (for the confidence intervals of the group pairings of the Tukey procedure, to change orientation from vertical to horizontal on the y labeling. [2] Last week someone needed to have Markdown automatically resize a plot of 4 graphs which automatically result from the simple.ls() plot command of the UsingR package. Markdown doesn't automatically reproduce the 4 plot cluster. I have been unsuccessful at finding out how to get Markdown to do that, so, for now I suggest doing the plots of confidence intervals and prediction intervals the long way, as shown in the lab, which Markdown can reproduce successfully. Otherwise, use the 4 plot production of the simple.lm() routine and then copy/paste an expanded view of the graphs into the WORD document (after Markdown completes the WORD document) from Studio. Introduction: Up until now we have been looking at linear models of 2 quantitative variables (or transformed quantitative variables, such that a linear model would fit better), so that we could, given some sort of al strong relationship between our response (y) and explanatory (x), predict y given a value of x, within our range of x, to some sort of accuracy. The generic model for this linear model was presented in 2 forms: ^y =β o +β 1 x, or the more precise y=β o +β 1 x +ϵ Remember that we first made a scatter plot, with orthogonal axes x and y (usually in the first quadrant), and from that we came up with a residual plot which would give visual evidence (or not) that a linear fit would be appropriate, and that the error was relatively symmetrical across the range of x values. We also generated a set of summary statistics on the model, using the summary() command, along with various graphics, like qqnorm() plots, histograms, box plots, etc., in our linear regression investigations. It seems that the next logical step would be to ask if there are any of a number of explanatory variables (xi) which would influence the response variable (y). The possible formulas for this linear modeling are: ^y =β 0 +β 1 x +...+β n x n and y=β 0 +β 1 x +...+β n x n +ϵ, where all of the various x's are candidates for being influencing explanatory variables. We can branch from that to the next level of questions, asking if any powers of these x's (different than the first power) would influence the model, or if any interacting terms (i.e., xi * xn) would influence our model. In theory, we could also construct an n-space scatter plot, with more -1-

2 than 2 orthogonal axes (again, usually in the first quadrant), where a straight line would model the points through the n-space. See below for a picture of what this linear model would look like in 3-space. Example: Let us use our patients.txt data file, containing height (inches), weight (lbs), and catheter length (cath, in mm) of various young people, to see if catheter size (y) can be predicted by height and weight (x1, x2)of the patients. Look at the code used below. # lab 6 Multiple Regression Introduction # ====================================== patients <- read.delim("c:/users/michael/desktop/lab 7/patients.txt") data1 <- patients data1 height <- data1[,1] ; weight <- data1[,2] ; cath <- data1[,3] model1 <- lm(cath ~ height + weight) model1 summary(model1) plot(fitted(model1), resid(model1), main="residual plot", xlab="predicted response", ylab="residuals") abline(h=0, lty=2) var1 <- predict(model1) var2 <- fitted (model1) The residual plot and output are shown below. It doesn't look very promising that a linear fit is appropriate for our resulting model, since our error doesn't look very uniform across the predicted values. Our model1, as described by the lm() command above, is ^ cath=β 0 +β 1 height +β 2 weight= height weight Also notice that our var1 vector = var2 vector, showing that predict(model1) = fitted(model1) next -2-

3 The linear model of best fit, according to our lm() output is: ^ cath= height weight Homework [1]: Some archaeologists theorize that ancient Egyptians interbred with several different immigrant populations over thousands of years. To see if there is any indication of changes in body structure that might have resulted, they took several measurements (MB, BH, BL, and NH) of 30 skulls of male Egyptians dated from several eras (-4000bce, -3300bce, -1850bce, -200bce, and 150ad) (A. Thomson and R. Randall-Maciver, Ancient Races of the Thebiad, Oxford University Press, 1905). Generate a multiple regression model of the results of this study in pyramids.csv, and produce a summary of the model parameters, a residual plot, and a short description (using proper sentence structure) of your conclusions on the effect of BH, BL, and NH (the explanatory variables) on MB (the response). Model Predictors= Important and Unimportant ones: We want to review with you some procedures in manipulating general regression models-specifically, how to check for statistical significance when you remove some possible predictors from your complete linear model, and how to look for predictor interactions in models. As I understand it, these are sort of special cases of other general regression procedures, where both quantitative and categorical variables are involved. F-Test of a subset of Predictors: The picture below is the formulas used to determine how important predictors are to a model. -3-

We are testing the hypothesis test significance of assuming (under Ho) that one or more of the β's are 0, and, therefore, of no importance to the prediction power of our model.

4 We are testing the hypothesis test significance of assuming (under Ho) that one or more of the β's are 0, and, therefore, of no importance to the prediction power of our model. In our F test for this hypothesis we use the SS of the complete model, the SS of the reduced model, and the SS of the complete model Residual. Our degrees of freedom include n (the number of data points, k (the total number of predictors in the model), and g (the remaining number of predictors not hypothesized to be 0). If, for example, our computed F is much greater than the F.01(df1, df2), which is the 99th quantile of the F distribution, then we have statistical significance at the α = 0.01 level and we reject Ho. In essence, if we have statistically significant results, we know that at least one of the proposed predictors we threw out as not impacting our model is, indeed, needed for the model. Example: Let us use the example of bass catch data, which is located in basscatch.csv. A state fisheries commission wants to estimate the number of bass caught in a given lake in a season in order to restock the lake with an appropriate number of young fish. The commission could get a fairly accurate assessment of the seasonal catch by extensive netting sweeps of the lake before and after a season, but this technique is much too expensive to be done routinely. Therefore, the commission samples a number of lakes (the observational units) and records the seasonal catch (thousands of bass/sq.mi. of lake area), the number of lake area residences (per sq.mi. of lake area), the size of the lake (in sq.mi.), if the lake has public access (0 if not and 1 if so), and a structure index (these are weed beds, sunken trees, drop-offs, and other living places for bass). Part of the data set is shown below. I first read in the data file with the commands shown below. -4-

5 # Lab 7 Sec 12.5 and 12.7 # ======================== basscatch <- read.csv("c:/users/michael/desktop/lab 7/basscatch.csv") data1 <- basscatch data1 catch <- data1[,1] ; residence <- data1[,2] size <- data1[,3] ; access <- data1[,4] structure <- data1[,5] We next might want to look at some scatter plot information of 2 variables at a time, to see if there are any obvious candidates we might want to throw out of our model and test their significance. pairs(data1) This command results in the lattice type of plot below. Our complete model would then be: catch=β 0 +β 1 residence+β 2 siz+β 3 access+β 4 structure+ϵ Now, after a bit of investigating, let us hypothesize that we can throw out access and structure from our model, leaving the reduced model: catch=β 0 +β 1 residence +β 2 siz+ϵ -5-

6 Now using the letters from our formula we have n = 20, k = 4, and g = 2. The code below computes the needed F quantile from the output of our lm() and anova() commands of the 2 models. Note in the code that I call the full model modelfull and the reduced model modelpart. modelfull <- lm(catch ~ residence + size + access + structure) modelpart <- lm(catch ~ residence + size) summary(modelfull) anova(modelfull) summary(modelpart) anova(modelpart) Results of the full model are shown below. The results from the reduced model are shown below. -6-

7 From this output we can get the numbers we need to perform the partial F test sscomplete < ssreduced < ssresid.complete < k <- 4 ; g <- 2 ; n <- 20 Fvalue <- ((sscomplete ssreduced)/(k-g))/(ssresid.complete/ (n*(k+1))) qf(.99,2,15) Fvalue Results are shown below. With such a high F value way above allowable, we see that at least one of the variables we removed is influential to the model, has high predictive value, and should have not been removed. Below is the residual plot of the full model, along with code. plot(predict(modelfull),resid(modelfull), main="full Model Residual Plot") abline(h=0, lty=2) -7-

The plot gives some evidence that a linear model might be appropriate here. Homework[2]: Using the basscatch.

Interactions: We want to apply our hypothesis approach shown above to a specific application where we want to compare slopes of different regression lines.

Below are pictures of various interaction possibilities among factors of a regression model, where we have blood pressure in adults affected by 3 levels of dosage (10mg, 20mg, 30mg, factor A) and 2

8 The plot gives some evidence that a linear model might be appropriate here. Homework[2]: Using the basscatch.csv data set, pick one or 2 variables which you hypothesize might be 0 and use this subset F test procedure to gather evidence on your hypothesis. Interactions: We want to apply our hypothesis approach shown above to a specific application where we want to compare slopes of different regression lines. Statistically, if slopes are different enough we probably have an influential predictor in an interaction term in our overall regression model. Below are pictures of various interaction possibilities among factors of a regression model, where we have blood pressure in adults affected by 3 levels of dosage (10mg, 20mg, 30mg, factor A) and 2 administration times for the dosage (once/day, twice/day, factor B ). The table shows various possibilities of interaction and significance of factors A and B, with the ANOVA result shown in the right column (with results done in Minitab output format), and statement of the factor effect in the left column. Note that these results were done in a simulation, rather than as results from an actual study. In the graphs of the middle column for the table below, the lines represent the levels of the times-per-day factor (B). The x-axis represent the levels of the dosage factor. Factor Effect A and B are both significant in the model, with no interaction present. Blood pressure changes across dosage levels for taking drug once or twice daily, but lines are so close together that whether you take the drug once or twice daily makes no difference. So, factor A (dosage) is significant, and factor B (times/day) is not significant. Factor B is significant but A is not. The lines are flat across dosage levels, indicating dosage has no effect on blood pressure. However, the 2 lines are spread apart, ANOVA results 2-Way ANOVA results (Minitab output)

indicating their effect on blood pressure is significant.

Factors A and B interact because lines cross.

The opposite result occurs between once per day for dosage level vs blood pressure. Example: We will use the data set ratanxiety.csv.

9 indicating their effect on blood pressure is significant. The lines are flat and close together, so you have no interaction and both factors A and B are not significant. Factors A and B interact because lines cross. Taking drug twice per day at low dose, low blood pressure results, As dose increases, so does blood pressure. The opposite result occurs between once per day for dosage level vs blood pressure. Example: We will use the data set ratanxiety.csv. We have 2 different drug products (A and B), administered to 2 groups of rats in this experiment, and within each group different doses of the drug (5mg, 10mg, 20mg) are administered. The anxiety level of each rat is then measured, according to some rat anxiety scale. The partial picture of the data set is shown below. The x2 is a categorical predictor which takes on the value 1 if drug B is used and 0 if drug A is used. -9-

10 We want to perform our F test to see if the full model can have the interaction term deleted from it without losing significant prediction capability. Specifically, our full model is: anxiety=β 0 +β 1 dose+β 2 x 2+β 3 dose x 2+ ϵ Our reduced model is: anxiety=β 0 +β 1 dose+β 2 x 2+ϵ, where the interaction term is deleted. Our hypothesis test, then, is Ho: β3 = 0 vs Ha: β3 is not 0 We will compute the F test according to the formula used in our previous example, where Before we do this test, let us first read in the data and plot ANXIETY vs DOSE for each of the drugs, and superimpose the regression lines of each drug on the plot. The code used is shown below. # 2nd problem presentation # ratanxiety <- read.csv("c:/users/michael/desktop/lab 7/ratanxiety.csv") data3 <- ratanxiety product <- data3[,1] ; dose <-data3[,2] anxiety <- data3[,3] ; x2 <- data3[,4] modela <- lm(anxiety[1:30] ~ dose[1:30]) anova(modela) modelb <- lm(anxiety[31:60] ~ dose[31:60]) anova(modelb) total.model <- lm(anxiety ~ dose + x2 + I(dose*x2)) summary(total.model) anova(total.model) model.reduced <- lm(anxiety ~ dose + x2) summary(model.reduced) anova(model.reduced) plot(dose, anxiety, pch=as.character(product), main="slopes of Product A and B") abline(modela) abline(modelb, lty=2) text(12,25, labels="solid line is A, dashed is B") Graphical output is below. -10-

11 Model output is below. Next. -11-

12 I read in the data3 first, then label each column as product, dose, anxiety and x2, respectively. I next construct 4 models which I will need later, using the lm() command. modela is the model of the A drug effects only, modelb is of the B drug effects only, total.model is the total model and model.reduced is the model with the β3 term removed. I also produced the anova() information of these models, since I will need them later. Finally, I produced the plot of ANXIETY on DOSE, distinguishing the A and B dots, and superimposing the lines of best fit using the abline() command. We have some evidence with this plot that our F test will end up significant at the.05 level. I will leave it up to you to find and add up the various sum of squares values to see that we have the following F value computed. We compute our allowed F value using the command qf(.95, 1, 56), which is= 4.01 So, since our computed F of 22 is so much greater than the allowed F of 4, at the α = 0.05 level we conclude that β3 is not 0, and we have powerful prediction power in the interaction term of our model (which we suspected already from our previous scatter plot). -12-

13 Homework[2]: The data set crops.csv is mercury poisoning results of an agricultural experiment involving 3 kinds of crops (corn, wheat, and barley) planted in mercury tainted soil (called sludge). There are 6 levels of soil contamination, and the mercury content contained in the plants (the response variable) is measured. Compare corn with wheat to see if the β3 (the interaction term) is significant. Possibly construct a scatter plot with the 2 ablines() superimposed on the plot to see what you will end up with on the hypothesis test. Use the α = 0.05 level. You may want to add an x2 column of 0's and 1's as I did in my demonstration. Remember you will only be using rows 1 through 60 (i.e., [1:60]) for this test. Homework[3]: Repeat HW[2] for comparing corn and barley, for β3 significance at the 0.05 level. You may want to construct a new data set for this, with new x2 column of 0's and 1's where you have deleted the wheat rows. Homework[4]: Repeat HW[2] for comparing wheat and barley. Again you may want to construct a new data set with corn rows deleted and a new set of x2 values. You will have 2 weeks to complete this lab assignment. -13-

STAT 458 Lab 4 Linear Regression Analysis

STAT 458 Lab 4 Linear Regression Analysis Scatter Plots: When investigating relationship between 2 quantitative variables, one of the first steps might be to construct a scatter plot of the response variable