R 2 and F -Tests and ANOVA

Size: px

Start display at page:

Download "R 2 and F -Tests and ANOVA"

Doris Lawson
5 years ago
Views:

1 R 2 and F -Tests and ANOVA December 6, Partition of Sums of Squares The distance from any point y i in a collection of data, to the mean of the data ȳ, is the deviation, written as y i ȳ. Definition. We define the following sum of squares SS T Total Sum of Squares. It measures total variation in y SS T (y i ȳ) 2 SS Res Residual Sum-of-Squares or Error Sum of Squares. It measures amount of residual variation in y SS Res e 2 i (y i ŷ i ) 2 SS Res is an aggregate measure of mis-fit of the line of best fit. SS R Sum of Squares due to Regression or Explained Sum of Squares. It measures amount of systematic variation in y due to the x y relationship SS R (ŷ i ȳ) 2 1

2 Figure 1: SST

3 Figure 2: SSRes

4 Figure 3: SSR

5 Then we have the following partition of the total sum of squares (y i y) 2 (y i y + ŷ i ŷ i ) 2 ((ŷ i ȳ) 2 + 2e i (ŷ i ȳ) + e 2 i ) (ŷ i ȳ) 2 + (ŷ i ȳ) 2 + (ŷ i ȳ) 2 + (ŷ i ȳ) 2 + e 2 i + 2 e 2 i + 2 ((ŷ i ȳ) + (y i ŷ i )) 2 }{{} e i e i (ŷ i ȳ) e i ( ˆβ 0 + ˆβ 1 x i y) e 2 i + 2( ˆβ 0 y) e 2 i e i }{{} 0 +2 ˆβ 1 e i x i }{{} 0

6 Therefore SS T SS R + SS Res Note: Partitioning of the sum of squared deviations into various components allows the overall variability in a dataset to be ascribed to different types or sources of variability, with the relative importance of each being quantified by the size of each component of the overall sum of squares. 2 R 2 Global Goodness-of-fit For a global measure of adequacy, we can reconsider the sums-of-squares decomposition. SS T SS Res + SS R If we want to get a global measure of how well x predicts y, we may consider the proportion of variation explained by the regression model which is called R 2 statistic. R 2 SS R SS T SS T SS Res SS T 1 SS Res SS T 2.1 Range Because we have

7 SS R (ŷ i ȳ) 2 (( ˆβ 0 + ˆβ 1 x i ) ( ˆβ 0 + ˆβ 1 x)) 2 ˆβ 2 1 ( cxy (x i x) 2 s 2 X n ˆβ 1 c XY ) 2 ns 2 X R 2 will be 0 when β 1 0. On the other hand, if all the residuals are zero, SS Res 0 then R 2 1. This marked out the limits of R 2 : a sample slope of 0 gives an R 2 of 0, the lowest possible, and all the data points falling exactly on a straight line gives an R 2 of 1, the largest possible. 2.2 Interpretation R 2 value tells you how much variation is explained by your model (whether the predictors are sufficient in explaining the variation in the response). So 0.1 R 2 means that your model explains 10% of variation within the data. The greater R 2 the better the model. Whereas p-value tells you about the hypothesis testing of the fit of the intercept-only model and your model are equal. So if the p-value is less than the significance level (usually 0.05) then your model fits the data well. (whether the predictors are necessary in explaining the variation in the response) Thus you have four scenarios: 1. low R 2 and low p-value (p-value α) 2. low R 2 and high p-value (p-value > α) 3. high R 2 and low p-value 4. high R 2 and high p-value

8 Interpretation: 1. means that your model does not explain much of variation of the data but it is significant (better than not having a model) 2. means that your model does not explain much of variation of the data and it is not significant (worst scenario) 3. means your model explains a lot of variation within the data and is significant (best scenario) 4. means that your model explains a lot of variation within the data but is not significant (model is worthless) 2.3 Adjusted R 2 We know that E Y X [SS Res X] σ 2 (n p) If we increase the model complexity by increasing the number of parameters in the model p n, the value of SS Res will approach 0, hence the value R 2 1 SS Res SS T approaches 1. This fact implies that the R 2 statistics does not account for the model complexity, so often the adjusted R 2 statistics is used R 2 Adj 1 SS Res/(n p) SS T /(n 1) If R 2 or RAdj 2 is near 1, then the predictor x explains a large proportion of the observed variation in response y, i.e. x is a good predictor of y. Note: When p 1, R 2 Adj R2. Adjusted R 2 is particularly useful in the feature selection stage of model building. We can get the relationship between R 2 and R 2 Adj

9 Therefore and hence R 2 Adj 1 SS Res/(n p) SS T /(n 1) SS Res SS T R 2 1 SS Res SS T n p n 1 (1 R2 Adj) 1 n p n 1 (1 R2 Adj) Figure 4: Six summary graphs. R 2 is an appropriate measure for a c, but inappropriate for d f.

10 2.4 Limitation What does R 2 converge to as n. The population version is R 2 Var[m(x)] Var[Y ] Var[β 0 + β 1 X] Var[β 0 + β 1 X + ɛ] Var[β 1X] Var[β 1 X + ɛ] β2 1Var[X] β 2 1Var[X] + σ 2 Since all our parameter estimates are consistent, and this formula is continuous in all the parameters, the R 2 we get from our estimate will converge on this limit. Unfortunately, a lot of myths about R 2 have become endemic in the scientific community, and it is vital at this point to immunize you against them. 1. The most fundamental is that R 2 does not measure goodness of fit. (a) R 2 can be arbitrarily low when the model is completely correct. By making Var[x] small, or σ 2 large, we drive R 2 towards 0, even when every assumption of the simple linear regression model is correct in every particular. (b) R 2 can be arbitrarily close to 1 when the model is totally wrong. There is, indeed, no limit to how high R 2 can get when the true model is nonlinear. All that s needed is for the slope of the best linear approximation to be non-zero, and for Var[X] to get big. 2. R 2 is also pretty useless as a measure of predictability. (a) R 2 says nothing about prediction error. R 2 can be anywhere between 0 and 1 just by changing the range of X. Mean squared error is a much better measure of how good predictions are; better yet are estimates of out-of-sample error which we ll cover later in the course.

11 (b) says nothing about interval forecasts. In particular, it gives us no idea how big prediction intervals, or confidence intervals for m(x), might be. 3. R 2 cannot be compared across data sets. 4. cannot be compared between a model with untransformed Y and one with transformed Y, or between different transformations of Y. 5. The one situation where R 2 can be compared is when different models are fit to the same data set with the same, untransformed response variable. Then increasing R 2 is the same as decreasing in-sample MSE. In that case, however, you might as well just compare the MSEs. 6. It is very common to say that R 2 is the fraction of variance explained by the regression. But it is also extremely easy to devise situations where R 2 is high even though neither one could possible explain the other. At this point, you might be wondering just what R 2 is good for what job it does that isn t better done by other tools. The only honest answer I can give you is that I have never found a situation where it helped at all. If I could design the regression curriculum from scratch, I would never mention it. Unfortunately, it lives on as a historical relic, so you need to know what it is, and what mis-understandings about it people suffer from. 2.5 The Correlation Coefficient As you know, the correlation coefficient between X and Y is ρ XY Cov[X, Y ] Var[X]Var[Y ] which lies between 1 and 1. It takes its extreme values when Y is a linear function of X. The sample version is ρ XY c XY s X s Y n (x i x)(y i ȳ) n (x i x) 2

12 ( ) 2 c We know that SS R XY s ns 2 2 X. Therefore the sample correlation is X ρ 2 XY c2 XY s 2 X s2 Y ( cxy s 2 X SS R n (y i ȳ) 2 SS R SS T R 2 ) 2 ns 2 X ns 2 Y Therefore R 2 equals to the sample correlation coefficient squared. 3 F -Tests and ANOVA The idea is to compare two models: H 0 : Y β 0 + ɛ versus Y β 0 + β 1 X + ɛ If we fit the first model, the least squares estimator is β 0 ȳ. The idea is to create a statistic that measures how much better the second model is than the first model. Consider now the F test statistic F obs SS R/(p 1) SS Res /(n p) MS R MS Res Under H 0, the statistic has a known distribution called the F distribution. This distribution depends on two parameters. These are called the degrees of freedom for the F distribution. We denote the distribution by F F (p 1, n p) F (1, n 2) (p 2) We reject H 0 if F is large enough. To quantify this, we set significance level α (0.01, 0.05, etc.) and reject H 0 if F obs > F α,p 1,n p (p 2)

13 where F α,p 1,n p is the 1 α quantile of the F -distribution with degree of freedom p 1, n p. Equivalent, we can conduct the test using p-value. The p-value is P (F > F obs ) where F F (1, n 2) and F obs is the actual observed value you compute from the data. We reject H 0 if p-value is less than α. In the olden days, people summarized the quantities used by F -test in a ANOVA table: Source SS df M S F Regression SS R p 1 MS R SS R /(p 1) Residual SS Res n p MS Res SS Res /(n p) SS R /(p 1) SS Res /(n p) Total SS T n 1 In deriving the F distribution, it is absolutely vital that all of the assumptions of the Gaussian-noise simple linear regression model hold: the true model must be linear, the noise must be Gaussian, the noise variance must be constant, the noise must be independent of X and independent across measurements. The only hypothesis being tested is whether, maintaining all these assumptions, we must reject the flat model β 1 in favor of a line at an angle. In particular, the test never doubts that the right model is a straight line. ANOVA is an historical relic. In serious applied work from the modern (say, post-1985) era, I have never seen any study where filling out an ANOVA table for a regression, etc., was at all important. 3.1 ANOVA in R The ANOVA table arranges the requires information in tabular form: Source SS df MS F Regression SS R p 1 SS R /(p 1) F Residual SS Res n p SS Res /(n p) Total SS T n 1

14 Source df SS MS F Regression p 1 SS R SS R /(p 1) F Residual n p SS Res SS Res /(n p) Total n 1 SS T Note that R swaps the positions of columns 1 and 2. In either case we have summation results in columns 1 and 2 SS R + SS Res SS T and (p 1) + (n p) (n 1) from the previous theory. R also adds the p-value from the test of the null hypothesis in a further final column. The easiest way to do this in R is to use the anova function. This will give you an analysis-of-variance table for the model. The actual object the function returns is an anova object, which is a special type of data frame. The columns record, respectively, degrees of freedom, sums of squares, mean squares, the actual F statistic, and the p value of the F statistic. What we ll care about will be the first row of this table, which will give us the test information for the slope on X. library(gamair) data(chicago) out lm(death ~ tmpd, data chicago) anova(out) ## Analysis of Variance Table ## ## Response: death ## Df Sum Sq Mean Sq F value Pr(>F) ## tmpd < 2.2e-16 ## Residuals ##

15 ## tmpd *** ## Residuals ## --- ## Signif. codes: ## 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Here is another example regarding F -test and ANOVA 1 > x<- RocketProp$Age 2 > y<- RocketProp$Strength 3 > n<- length (x) 4 > fit.rp <-lm(y x) 5 > summary ( fit.rp) 6 7 Call : 8 lm( formula y x) 9 10 Residuals : 11 Min 1 Q Median 3 Q Max Coefficients : 15 Estimate Std. Error t value Pr ( > t ) 16 ( Intercept ) < 2e -16 *** 17 x e -10 *** Signif. codes : 0 *** ** 0.01 * Residual standard error : on 18 degrees of freedom 22 Multiple R- squared : , Adjusted R- squared : F- statistic : on 1 and 18 DF, p- value : 1.643e -10 Line 23 contains the computation of the F statistic (165.4), the degrees of freedom values

16 p 2 and n p 18, and the p-value in the test of the hypothesis H 0 : β 1 0 ( ); the hypothesis is strongly rejected. 24 > anova ( fit.rp) 25 Analysis of Variance Table Response : y 28 Df Sum Sq Mean Sq F value Pr( > F) 29 x e -10 *** 30 Residuals Signif. codes : 0 *** ** 0.01 * > 34 > anova ( fit.rp )[,'Df '] 35 [1] > anova ( fit.rp )[,' Sum Sq '] 37 [1] > 39 > sum ( anova ( fit.rp )[,' Sum Sq ']) -(n -1)* var (y) 40 [1] e -10 The ANOVA table is contained between lines 25 and 32. The Regression entry is listed on line 29 as x. Note that the F statistic and p-values, listed in the columns headed F value and Pr(>F), are identical to the ones from the lm summary (line 23). 3.2 Equivalence of F -test and t-test for simple linear regression For simple linear regression, the test H 0 : β 1 0 H a : β 1 0 can be carried out using

17 t-test F -test as in this model Since by { } 2 [ ] T obs 2 ˆβ1 1 ŝe( ˆβ F obs 1 ) SS R (ŷ i ȳ) 2 (( ˆβ 0 + ˆβ 1 x i1 ) ( ˆβ 0 + ˆβ 1 x 1 )) 2 ˆβ 2 1 (x i1 x 1 ) 2 ˆβ 2 1ns 2 X and ŝe( ˆβ ˆσ 1 ) 2 (n 2)s 2 X we have F obs SS R/(p 1) SS Res /(n p) SS R/(2 1) SS Res /(n 2) SS R nˆσ 2 /(n 2) ˆβ 2 1ns 2 X ˆσ 2 The p-value for F -test is exactly the same as the p-value for t-test. Thus F > F obs T 2 > [ ] T1 obs 2 T > T obs 1 { ˆβ1 ŝe( ˆβ 1 ) } 2 [ ] T1 obs 2 Thus P (F > F obs ) P ( T > T obs 1 )

18 Therefore, the test results based on F and t statistics are the same. Note: this result is only true for simple linear regression model. 3.3 What the F -Test Really Tests Suppose first that we retain the null hypothesis, i.e., we do not find any significant share of variance associated with the regression. This could be because (i) the intercept-only model is right; (iii) β 1 0 but the test doesn t have enough power to detect departures from the null. We don t know which it is. There is also possibility that the real relationship is nonlinear, but the best linear approximation to it has slope (nearly) zero, in which case the F test will have no power to detect the nonlinearity. Suppose instead that we reject the null, intercept-only hypothesis. This does not mean that the simple linear model is right. It means that the latter model predicts better than the intercept-only model too much better to be due to chance. The simple linear regression model can be absolute garbage, with every single one of its assumptions flagrantly violated, and yet better than the model which makes all those assumptions and thinks the optimal slope is zero. Neither the F test of β 1 0 vs. β 1 0 nor the Wald/t test of the same hypothesis tell us anything about the correctness of the simple linear regression model. All these tests presume the simple linear regression model with Gaussian noise is true, and check a special case (flat line) against the general one (titled line). They do not test linearity, constant variance, lack of correlation, or Gaussianity.

Lecture 10: F -Tests, ANOVA and R 2

Lecture 10: F -Tests, ANOVA and R 2 1 ANOVA We saw that we could test the null hypothesis that β 1 0 using the statistic ( β 1 0)/ŝe. (Although I also mentioned that confidence intervals are generally