Sections 7.1, 7.2, 7.4, & 7.6

Size: px

Start display at page:

Download "Sections 7.1, 7.2, 7.4, & 7.6"

Karin Skinner
6 years ago
Views:

1 Sections 7.1, 7.2, 7.4, & 7.6 Adapted from Timothy Hanson Department of Statistics, University of South Carolina Stat 704: Data Analysis I 1 / 25

2 Chapter 7 example: Body fat n = 20 healthy females years old. x 1 = triceps skinfold thickness (mm) x 2 = thigh circumference (cm) x 3 = midarm circumference (cm) Y = body fat (%) Obtaining Y i, the percent of the body that is purely fat, requires immersing a person in water. Want to develop model based on simple body measurements that avoids people getting wet. 2 / 25

3 SAS code ******************************* * Body fat data from Chapter 7 *******************************; proc sgscatter data=bodyfat; matrix bodyfat triceps thigh midarm; run; proc corr data=bodyfat; var triceps thigh midarm; run; 3 / 25

4 Correlation coefficients There is high correlation among the predictors. For example r = 0.92 for triceps and thigh. These two variables are essentially carrying the same information. Maybe only one or the other is really needed. In general, one predictor may be essentially perfectly predicted by the remaining predictors (a high partial correlation ), and so would be unecessary if the other predictors are in the model. 4 / 25

5 7.1 Extra sums of squares Extra sums of squares are defined as the difference in SSE between a model with some predictors and a larger model that adds additional predictors. Fact: As predictors are added, the SSE can only decrease. The extra sums of squares is how much the SSE decreases: Def n Let x 1, x 2,..., x k be predictors in a model. SSR(x j+1,..., x k x 1, x 2,..., x j ) = SSE(x 1, x 2,..., x j ) SSE(x 1, x 2,..., x j, x j+1,..., x k ). I.e., SSR(x j+1,..., x k x 1, x 2,..., x j ) is the difference in the sums of squared errors from the reduced to the full model. This is how much of the total variation in SSTO is further explained by adding the new predictors. 5 / 25

6 Example with k = 8 predictors The predictors under consideration are There are two models x 1, x 2, x 3, x 4, x 5, x 6, x 7, x 8. Reduced : x 1, x 3, x 5, x 6, x 8 Full : x 1, x 2, x 3, x 4, x 5, x 6, x 7, x 8 Extra SS = SSR(x 2, x 4, x 7 x 1, x 3, x 5, x 6, x 8 ) = SSE(reduced) SSE(full) = SSE(x 1, x 3, x 5, x 6, x 8 ) SSE(x 1, x 2, x 3, x 4, x 5, x 6, x 7, x 8 ) = SSR(x 1, x 2, x 3, x 4, x 5, x 6, x 7, x 8 ) SSR(x 1, x 3, x 5, x 6, x 8 ) This is how much additional total variability (SSTO) is explained by adding x 2, x 4, x 7 to a model that already has x 1, x 3, x 5, x 6, x 8. 6 / 25

7 7.2 Associated tests We can formally test whether a certain set of predictors is useless, in the presence of other predictors in the model. This is the general linear test we talked about a few lectures ago (in simple linear regression). In the example above, we can test whether x 2, x 4, x 7 are needed if x 1, x 3, x 5, x 6, x 8 are in the model. If full (with x 1, x 2, x 3, x 4, x 5, x 6, x 7, x 8 ) model has much lower SSE than reduced model (without x 2, x 4, x 7 ) then at least one of x 2, x 4, x 7 is needed. 7 / 25

8 F-test Say we want to test whether we can drop q variables from a model that has p = k + 1 (including the intercept), q < p. Let R denote the reduced model and F the full, and SSE(R), SSE(F ) denote the sums of squared errors from the two models. To test H 0 : β j1 = β j2 = = β jq = 0 in the full model F = [SSE(R) SSE(F )]/q SSE(F )/(n p) F (q, n p) If H 0 : β j1 = β j2 = = β jq = 0 is true; a p-value for the test is P(F (q, n p) > F ). Can carry this out in SAS using test in proc reg. 8 / 25

9 F-test example with k = 8 predictors To test H 0 : β 2 = β 4 = β 7 = 0, F = [SSE(reduced) SSE(full)]/(# parameters in test) MSE(full) = [SSE(x 1, x 3, x 5, x 6, x 8 ) SSE(x 1, x 2, x 3, x 4, x 5, x 6, x 7, x 8 )]/3 SSE(x 1, x 2, x 3, x 4, x 5, x 6, x 7, x 8 )/(n 9) = SSR(x 2, x 4, x 7 x 1, x 3, x 5, x 6, x 8 )/3 SSE(x 1, x 2, x 3, x 4, x 5, x 6, x 7, x 8 )/(n 9) F (3, n 9) if H 0 : β 2 = β 4 = β 7 = 0 is true. 9 / 25

10 Body fat example proc reg data=body; model bodyfat=triceps thigh midarm; test thigh=0, midarm=0; run; Test 1 Results for Dependent Variable bodyfat Mean Source DF Square F Value Pr > F Numerator Denominator Reject H 0 : β 2 = β 3 = 0 in with p = fat i = β 0 + β 1 triceps i + β 2 thigh i + β 3 midarm i + ɛ i 10 / 25

11 Type I (sequential) sums of squares Note (pp ): Say you have k = 4 predictors. Then the SSR for the full model can be written SSR = SSR(x 1, x 2, x 3, x 4 ) = SSR(x 1 ) + SSR(x 2 x 1 ) + SSR(x 3 x 1, x 2 ) + SSR(x 4 x 1, x 2, x 3 ). These are called sequential sums of squares, or Type I sums of squares. They explain how much variability is soaked up by adding predictors sequentially to a model. There are four corresponding hypothesis tests with these sequential sums of squares: Model Hypothesis F-statistic Y i = β 0 + β 1 x i1 + ɛ i H 0 : β 1 = 0 SSR(x 1 ) MSE(x 1 ) Y i = β 0 + β 1 x i1 + β 2 x i2 + ɛ i H 0 : β 2 = 0 SSR(x 2 x 1 ) MSE(x 1,x 2 ) Y i = β 0 + β 1 x i1 + β 2 x i2 + β 3 x i3 + ɛ i H 0 : β 3 = 0 SSR(x 3 x 1,x 2 ) MSE(x 1,x 2,x 3 ) Y i = β 0 + β 1 x i1 + β 2 x i2 + β 3 x i3 + β 4 x i4 + ɛ i H 0 : β 4 = 0 SSR(x 4 x 1,x 2,x 3 ) MSE(x 1,x 2,x 3,x 4 ) 11 / 25

12 Body fat example You can get sequential SS from proc reg by adding ss1 as a model option. proc glm gives them automatically. proc glm data=body; model bodyfat=triceps thigh midarm / solution; run; Source DF Type I SS Mean Square F Value Pr > F triceps <.0001 thigh midarm Reject H 0 : β 1 = 0 in fat i = β 0 + β 1 triceps i + ɛ i with p < Reject H 0 : β 2 = 0 in fat i = β 0 + β 1 triceps i + β 2 thigh i + ɛ i with p = Accept H 0 : β 3 = 0 in fat i = β 0 + β 1 triceps i + β 2 thigh i + β 3 midarm i + ɛ i with p = Order entered (triceps, thigh, midarm) matters! 12 / 25

13 ANOVA table & decomposing the SSR(F) Sum of Source DF Squares Mean Square F Value Pr > F Model <.0001 Error Corrected Total The sequential extra sums of squares is given on the last slide: SSR(x 1 ) = 352.3; SSR(x 2 x 1 ) = 33.2, and SSR(x 3 x 1, x 2 ) = Almost all of the SSR(x 1, x 2, x 3 ) = is explained by x 1 (triceps) alone. Also note, as required, SSR(x 1, x 2, x 3 ) = = = SSR(x 1 )+SSR(x 2 x 1 )+SSR(x 3 x 1, x 2 ). Finally, we strongly reject H 0 : β 1 = β 2 = β 3 = / 25

14 7.4 Coefficients of partial determination We can standardize extra sums of squares to be between 0 and 1 (like R 2 ). The coefficient of partial determination is the fraction by which the sum of squared errors is reduced by adding predictor(s) to an existing model. Examples: R 2 Y 2 1 = SSR(x 2 x 1 )/SSE(x 1 ) R 2 Y 3 12 = SSR(x 3 x 1, x 2 )/SSE(x 1, x 2 ) R 2 Y 32 1 = SSR(x 2, x 3 x 1 )/SSE(x 1 ) For example, if RY = 0.5 then 50% of the remaining variability is explained by adding x 3 to a model that already had x 1 and x / 25

15 Body fat example In proc reg You can get RY 2 1, R2 Y 2 1, and R2 Y 3 12 by adding pcorr1 as a model option. You can get RY , R2 Y 2 13, and R2 Y 3 12 by adding pcorr2. proc reg data=body; model bodyfat=triceps thigh midarm / pcorr1; model bodyfat=triceps thigh midarm / pcorr2; run; 15 / 25

16 7.6 Multicollinearity Recall: In the body fat example, the F-test for testing H 0 : β 1 = β 2 = β 3 = 0 was highly significant, but individual t-tests for dropping any of x 1, x 2, or x 3 were not significant! The set x 1, x 2, x 3 are useful for explaining body fat, but none of the three are useful in the presence of the other two. Why? The predictors are measuring similar phenomena; their sample values are highly correlated. For example, r = between triceps thickness x 1 and thigh circumference x 2. This is known as multicollinearity among the predictors. 16 / 25

17 Effects of multicollinearity Model may still provide a good fit and precise prediction/estimation of the response. Several estimated regression coefficients b 1, b 2,..., b k will have large standard errors, leading to conclusions that individual predictors are not significant although overall F-test may be highly significant. Concept of holding all other predictors constant doesn t make sense in practice. Signs of regression coefficients may be opposite of intuition (or what we might think marginally they might be based on a scatterplot). 17 / 25

18 Bodyfat example proc glm data=body; model bodyfat=triceps thigh midarm / solution; run; Standard Parameter Estimate Error t Value Pr > t Intercept triceps thigh midarm Two of the three regression effects are negative. Holding midarm and triceps constant, increasing the thigh circumference 1 mm decreases bodyfat. Does this make sense? 18 / 25

19 Detecting multicollinearity A formal method for determining the presence of multicollinearity is the variance inflation factor (VIF). VIF s measure how much variances of estimated regression coefficients are inflated when compared to having uncorrelated predictors. We will start with the standardized regression model of Section 7.5. Let Y i = 1 n 1 Y i Ȳ s Y and x ij = where sy 2 = (n 1) 1 n i=1 (Y i Ȳ ) 2, = (n 1) 1 n i=1 (x ij x j ) 2, and x j = n 1 n s 2 j 1 n 1 x ij x j s j, i=1 x ij. 19 / 25

20 VIF s These variables are centered about their means and standardized to have Euclidean norm 1. For example, x j 2 = (x 1j )2 + + (x nj )2 = 1. Note that in general (Y ) (Y ) = (x j ) (x j ) = 1 and (x j ) (x s ) = corr(x j, x s ) def = r js. 20 / 25

21 VIF s Consider the standardized regression model Y i = β 1x i1 + + β k x ik + ɛ i. Define the k k sample correlation matrix R for the standardized predictors, and the n k design matrix X to be: 1 r 21 r k1 x11 x12 x 1k r 12 1 r k2 R =......, x X 21 x22 x 2k = r 1k r 2k 1 xn1 xn2 xnk Since (X ) (X ) = R, the least-squares estimate of β = (β 1,..., β k ) is given by b = R 1 (X ) Y. Hence Cov(b ) = R 1 (σ ) / 25

22 VIF s Now note that if all predictors are uncorrelated then R = I k = R 1. Hence the ith diagonal element of R 1 measures how much the variance of bi is inflated due to correlation between predictors. We call this the ith variance inflation factor: VIF i = (R 1 ) ii. Usually the largest VIF i is taken to be a measure of the seriousness of the multicollinearity among the predictors. 22 / 25

23 Detecting multicollinearity Predictor x j has a variance inflation factor of VIF j = 1 1 R 2 j, where Rj 2 is R 2 from regressing x j on the remaining predictors x 1, x 2,..., x j 1, x j+1,..., x k. High Rj 2 (near 1) x j is linearly associated with other predictors high VIF j. VIF j 1 x j is not involved in any multicollinearity. VIF j > 10 x j is involved in severe multicollinearity. 23 / 25

24 VIF j s in SAS model bodyfat = triceps thigh midarm / vif; Parameter Estimates Parameter Standard Variance Variable DF Estimate Error t Value Pr > t Inflation Intercept triceps thigh midarm What do you conclude? 24 / 25

25 Remedies for multicollinearity Drop one or more predictors from the model. We ll discuss this in Chapter 9. More advanced: principal components regression uses indexes (new predictors) that are linear combinations of the original predictors as predictors in a new model. The indexes are selected to be uncorrelated. Disadvantage: the indexes might be hard to interpret. More advanced: ridge regression (Section 11.2). More advanced: ensemble methods 25 / 25

STAT5044: Regression and Anova

STAT5044: Regression and Anova Inyoung Kim 1 / 25 Outline 1 Multiple Linear Regression 2 / 25 Basic Idea An extra sum of squares: the marginal reduction in the error sum of squares when one or several