R 2 and F -Tests and ANOVA

Similar documents
Lecture 10: F -Tests, ANOVA and R 2

Variance Decomposition and Goodness of Fit

Ch 2: Simple Linear Regression

Lecture 6 Multiple Linear Regression, cont.

Simple Linear Regression

Simple Linear Regression

Variance Decomposition in Regression James M. Murray, Ph.D. University of Wisconsin - La Crosse Updated: October 04, 2017

Simple Linear Regression

Applied Regression. Applied Regression. Chapter 2 Simple Linear Regression. Hongcheng Li. April, 6, 2013

Simple Linear Regression

Lecture 11: Simple Linear Regression

Linear models and their mathematical foundations: Simple linear regression

Inference for Regression

Estimating σ 2. We can do simple prediction of Y and estimation of the mean of Y at any value of X.

Correlation and the Analysis of Variance Approach to Simple Linear Regression

22s:152 Applied Linear Regression. Take random samples from each of m populations.

Regression, Part I. - In correlation, it would be irrelevant if we changed the axes on our graph.

MS&E 226: Small Data

Example: Poisondata. 22s:152 Applied Linear Regression. Chapter 8: ANOVA

Ch 3: Multiple Linear Regression

Regression. Marc H. Mehlman University of New Haven

Biostatistics 380 Multiple Regression 1. Multiple Regression

22s:152 Applied Linear Regression. There are a couple commonly used models for a one-way ANOVA with m groups. Chapter 8: ANOVA

School of Mathematical Sciences. Question 1

Figure 1: The fitted line using the shipment route-number of ampules data. STAT5044: Regression and ANOVA The Solution of Homework #2 Inyoung Kim

Simple and Multiple Linear Regression

Lectures on Simple Linear Regression Stat 431, Summer 2012

Comparing Nested Models

Multiple Linear Regression

Math 423/533: The Main Theoretical Topics

STAT420 Midterm Exam. University of Illinois Urbana-Champaign October 19 (Friday), :00 4:15p. SOLUTIONS (Yellow)

Statistical Inference with Regression Analysis

SIMPLE REGRESSION ANALYSIS. Business Statistics

STAT 215 Confidence and Prediction Intervals in Regression

Tests of Linear Restrictions

Inferences for Regression

Multiple Regression Introduction to Statistics Using R (Psychology 9041B)

ECON3150/4150 Spring 2015

Inference. ME104: Linear Regression Analysis Kenneth Benoit. August 15, August 15, 2012 Lecture 3 Multiple linear regression 1 1 / 58

Chapter 12 - Lecture 2 Inferences about regression coefficient

Swarthmore Honors Exam 2012: Statistics

K. Model Diagnostics. residuals ˆɛ ij = Y ij ˆµ i N = Y ij Ȳ i semi-studentized residuals ω ij = ˆɛ ij. studentized deleted residuals ɛ ij =

Regression Analysis: Basic Concepts

Density Temp vs Ratio. temp

Linear Regression Model. Badr Missaoui

13 Simple Linear Regression

Lecture 9: Predictive Inference for the Simple Linear Model

Section 3: Simple Linear Regression

Regression and the 2-Sample t

Confidence Intervals, Testing and ANOVA Summary

AMS 315/576 Lecture Notes. Chapter 11. Simple Linear Regression

Homework 2: Simple Linear Regression

36-707: Regression Analysis Homework Solutions. Homework 3

22s:152 Applied Linear Regression. Chapter 5: Ordinary Least Squares Regression. Part 1: Simple Linear Regression Introduction and Estimation

Multiple Regression: Example

Nature vs. nurture? Lecture 18 - Regression: Inference, Outliers, and Intervals. Regression Output. Conditions for inference.

ST430 Exam 2 Solutions

STA 302 H1F / 1001 HF Fall 2007 Test 1 October 24, 2007

Mathematics for Economics MA course

Inference for Regression Simple Linear Regression

Regression Analysis. Regression: Methodology for studying the relationship among two or more variables

Stat 135, Fall 2006 A. Adhikari HOMEWORK 10 SOLUTIONS

Statistiek II. John Nerbonne. March 17, Dept of Information Science incl. important reworkings by Harmut Fitz

Analysis of Variance. Source DF Squares Square F Value Pr > F. Model <.0001 Error Corrected Total

Simple Linear Regression

Review. One-way ANOVA, I. What s coming up. Multiple comparisons

STAT 3A03 Applied Regression With SAS Fall 2017

Inference for Regression Inference about the Regression Model and Using the Regression Line

Multiple Linear Regression

Tentative solutions TMA4255 Applied Statistics 16 May, 2015

1 Least Squares Estimation - multiple regression.

IES 612/STA 4-573/STA Winter 2008 Week 1--IES 612-STA STA doc

1-Way ANOVA MATH 143. Spring Department of Mathematics and Statistics Calvin College

MAT2377. Rafa l Kulik. Version 2015/November/26. Rafa l Kulik

Section Least Squares Regression

Six Sigma Black Belt Study Guides

ECON2228 Notes 2. Christopher F Baum. Boston College Economics. cfb (BC Econ) ECON2228 Notes / 47

STAT 350: Summer Semester Midterm 1: Solutions

ST430 Exam 1 with Answers

AMS 7 Correlation and Regression Lecture 8

Unbalanced Data in Factorials Types I, II, III SS Part 1

ANOVA: Analysis of Variation

Inference for the Regression Coefficient

Applied Regression Analysis

Measuring the fit of the model - SSR

Lab 3 A Quick Introduction to Multiple Linear Regression Psychology The Multiple Linear Regression Model

Chapter 16. Simple Linear Regression and dcorrelation

Exam Applied Statistical Regression. Good Luck!

Introduction and Single Predictor Regression. Correlation

Dealing with Heteroskedasticity

The scatterplot is the basic tool for graphically displaying bivariate quantitative data.

df=degrees of freedom = n - 1

Overview Scatter Plot Example

We like to capture and represent the relationship between a set of possible causes and their response, by using a statistical predictive model.

Linear Regression. In this lecture we will study a particular type of regression model: the linear regression model

Lecture 3: Inference in SLR

UNIVERSITY OF MASSACHUSETTS. Department of Mathematics and Statistics. Basic Exam - Applied Statistics. Tuesday, January 17, 2017

Regression Models. Chapter 4. Introduction. Introduction. Introduction

Handout 4: Simple Linear Regression

ECO220Y Simple Regression: Testing the Slope

Transcription:

R 2 and F -Tests and ANOVA December 6, 2018 1 Partition of Sums of Squares The distance from any point y i in a collection of data, to the mean of the data ȳ, is the deviation, written as y i ȳ. Definition. We define the following sum of squares SS T Total Sum of Squares. It measures total variation in y SS T (y i ȳ) 2 SS Res Residual Sum-of-Squares or Error Sum of Squares. It measures amount of residual variation in y SS Res e 2 i (y i ŷ i ) 2 SS Res is an aggregate measure of mis-fit of the line of best fit. SS R Sum of Squares due to Regression or Explained Sum of Squares. It measures amount of systematic variation in y due to the x y relationship SS R (ŷ i ȳ) 2 1

Figure 1: SST

Figure 2: SSRes

Figure 3: SSR

Then we have the following partition of the total sum of squares (y i y) 2 (y i y + ŷ i ŷ i ) 2 ((ŷ i ȳ) 2 + 2e i (ŷ i ȳ) + e 2 i ) (ŷ i ȳ) 2 + (ŷ i ȳ) 2 + (ŷ i ȳ) 2 + (ŷ i ȳ) 2 + e 2 i + 2 e 2 i + 2 ((ŷ i ȳ) + (y i ŷ i )) 2 }{{} e i e i (ŷ i ȳ) e i ( ˆβ 0 + ˆβ 1 x i y) e 2 i + 2( ˆβ 0 y) e 2 i e i }{{} 0 +2 ˆβ 1 e i x i }{{} 0

Therefore SS T SS R + SS Res Note: Partitioning of the sum of squared deviations into various components allows the overall variability in a dataset to be ascribed to different types or sources of variability, with the relative importance of each being quantified by the size of each component of the overall sum of squares. 2 R 2 Global Goodness-of-fit For a global measure of adequacy, we can reconsider the sums-of-squares decomposition. SS T SS Res + SS R If we want to get a global measure of how well x predicts y, we may consider the proportion of variation explained by the regression model which is called R 2 statistic. R 2 SS R SS T SS T SS Res SS T 1 SS Res SS T 2.1 Range Because we have

SS R (ŷ i ȳ) 2 (( ˆβ 0 + ˆβ 1 x i ) ( ˆβ 0 + ˆβ 1 x)) 2 ˆβ 2 1 ( cxy (x i x) 2 s 2 X n ˆβ 1 c XY ) 2 ns 2 X R 2 will be 0 when β 1 0. On the other hand, if all the residuals are zero, SS Res 0 then R 2 1. This marked out the limits of R 2 : a sample slope of 0 gives an R 2 of 0, the lowest possible, and all the data points falling exactly on a straight line gives an R 2 of 1, the largest possible. 2.2 Interpretation R 2 value tells you how much variation is explained by your model (whether the predictors are sufficient in explaining the variation in the response). So 0.1 R 2 means that your model explains 10% of variation within the data. The greater R 2 the better the model. Whereas p-value tells you about the hypothesis testing of the fit of the intercept-only model and your model are equal. So if the p-value is less than the significance level (usually 0.05) then your model fits the data well. (whether the predictors are necessary in explaining the variation in the response) Thus you have four scenarios: 1. low R 2 and low p-value (p-value α) 2. low R 2 and high p-value (p-value > α) 3. high R 2 and low p-value 4. high R 2 and high p-value

Interpretation: 1. means that your model does not explain much of variation of the data but it is significant (better than not having a model) 2. means that your model does not explain much of variation of the data and it is not significant (worst scenario) 3. means your model explains a lot of variation within the data and is significant (best scenario) 4. means that your model explains a lot of variation within the data but is not significant (model is worthless) 2.3 Adjusted R 2 We know that E Y X [SS Res X] σ 2 (n p) If we increase the model complexity by increasing the number of parameters in the model p n, the value of SS Res will approach 0, hence the value R 2 1 SS Res SS T approaches 1. This fact implies that the R 2 statistics does not account for the model complexity, so often the adjusted R 2 statistics is used R 2 Adj 1 SS Res/(n p) SS T /(n 1) If R 2 or RAdj 2 is near 1, then the predictor x explains a large proportion of the observed variation in response y, i.e. x is a good predictor of y. Note: When p 1, R 2 Adj R2. Adjusted R 2 is particularly useful in the feature selection stage of model building. We can get the relationship between R 2 and R 2 Adj

Therefore and hence R 2 Adj 1 SS Res/(n p) SS T /(n 1) SS Res SS T R 2 1 SS Res SS T n p n 1 (1 R2 Adj) 1 n p n 1 (1 R2 Adj) Figure 4: Six summary graphs. R 2 is an appropriate measure for a c, but inappropriate for d f.

2.4 Limitation What does R 2 converge to as n. The population version is R 2 Var[m(x)] Var[Y ] Var[β 0 + β 1 X] Var[β 0 + β 1 X + ɛ] Var[β 1X] Var[β 1 X + ɛ] β2 1Var[X] β 2 1Var[X] + σ 2 Since all our parameter estimates are consistent, and this formula is continuous in all the parameters, the R 2 we get from our estimate will converge on this limit. Unfortunately, a lot of myths about R 2 have become endemic in the scientific community, and it is vital at this point to immunize you against them. 1. The most fundamental is that R 2 does not measure goodness of fit. (a) R 2 can be arbitrarily low when the model is completely correct. By making Var[x] small, or σ 2 large, we drive R 2 towards 0, even when every assumption of the simple linear regression model is correct in every particular. (b) R 2 can be arbitrarily close to 1 when the model is totally wrong. There is, indeed, no limit to how high R 2 can get when the true model is nonlinear. All that s needed is for the slope of the best linear approximation to be non-zero, and for Var[X] to get big. 2. R 2 is also pretty useless as a measure of predictability. (a) R 2 says nothing about prediction error. R 2 can be anywhere between 0 and 1 just by changing the range of X. Mean squared error is a much better measure of how good predictions are; better yet are estimates of out-of-sample error which we ll cover later in the course.

(b) says nothing about interval forecasts. In particular, it gives us no idea how big prediction intervals, or confidence intervals for m(x), might be. 3. R 2 cannot be compared across data sets. 4. cannot be compared between a model with untransformed Y and one with transformed Y, or between different transformations of Y. 5. The one situation where R 2 can be compared is when different models are fit to the same data set with the same, untransformed response variable. Then increasing R 2 is the same as decreasing in-sample MSE. In that case, however, you might as well just compare the MSEs. 6. It is very common to say that R 2 is the fraction of variance explained by the regression. But it is also extremely easy to devise situations where R 2 is high even though neither one could possible explain the other. At this point, you might be wondering just what R 2 is good for what job it does that isn t better done by other tools. The only honest answer I can give you is that I have never found a situation where it helped at all. If I could design the regression curriculum from scratch, I would never mention it. Unfortunately, it lives on as a historical relic, so you need to know what it is, and what mis-understandings about it people suffer from. 2.5 The Correlation Coefficient As you know, the correlation coefficient between X and Y is ρ XY Cov[X, Y ] Var[X]Var[Y ] which lies between 1 and 1. It takes its extreme values when Y is a linear function of X. The sample version is ρ XY c XY s X s Y n (x i x)(y i ȳ) n (x i x) 2

( ) 2 c We know that SS R XY s ns 2 2 X. Therefore the sample correlation is X ρ 2 XY c2 XY s 2 X s2 Y ( cxy s 2 X SS R n (y i ȳ) 2 SS R SS T R 2 ) 2 ns 2 X ns 2 Y Therefore R 2 equals to the sample correlation coefficient squared. 3 F -Tests and ANOVA The idea is to compare two models: H 0 : Y β 0 + ɛ versus Y β 0 + β 1 X + ɛ If we fit the first model, the least squares estimator is β 0 ȳ. The idea is to create a statistic that measures how much better the second model is than the first model. Consider now the F test statistic F obs SS R/(p 1) SS Res /(n p) MS R MS Res Under H 0, the statistic has a known distribution called the F distribution. This distribution depends on two parameters. These are called the degrees of freedom for the F distribution. We denote the distribution by F F (p 1, n p) F (1, n 2) (p 2) We reject H 0 if F is large enough. To quantify this, we set significance level α (0.01, 0.05, etc.) and reject H 0 if F obs > F α,p 1,n p (p 2)

where F α,p 1,n p is the 1 α quantile of the F -distribution with degree of freedom p 1, n p. Equivalent, we can conduct the test using p-value. The p-value is P (F > F obs ) where F F (1, n 2) and F obs is the actual observed value you compute from the data. We reject H 0 if p-value is less than α. In the olden days, people summarized the quantities used by F -test in a ANOVA table: Source SS df M S F Regression SS R p 1 MS R SS R /(p 1) Residual SS Res n p MS Res SS Res /(n p) SS R /(p 1) SS Res /(n p) Total SS T n 1 In deriving the F distribution, it is absolutely vital that all of the assumptions of the Gaussian-noise simple linear regression model hold: the true model must be linear, the noise must be Gaussian, the noise variance must be constant, the noise must be independent of X and independent across measurements. The only hypothesis being tested is whether, maintaining all these assumptions, we must reject the flat model β 1 in favor of a line at an angle. In particular, the test never doubts that the right model is a straight line. ANOVA is an historical relic. In serious applied work from the modern (say, post-1985) era, I have never seen any study where filling out an ANOVA table for a regression, etc., was at all important. 3.1 ANOVA in R The ANOVA table arranges the requires information in tabular form: Source SS df MS F Regression SS R p 1 SS R /(p 1) F Residual SS Res n p SS Res /(n p) Total SS T n 1

Source df SS MS F Regression p 1 SS R SS R /(p 1) F Residual n p SS Res SS Res /(n p) Total n 1 SS T Note that R swaps the positions of columns 1 and 2. In either case we have summation results in columns 1 and 2 SS R + SS Res SS T and (p 1) + (n p) (n 1) from the previous theory. R also adds the p-value from the test of the null hypothesis in a further final column. The easiest way to do this in R is to use the anova function. This will give you an analysis-of-variance table for the model. The actual object the function returns is an anova object, which is a special type of data frame. The columns record, respectively, degrees of freedom, sums of squares, mean squares, the actual F statistic, and the p value of the F statistic. What we ll care about will be the first row of this table, which will give us the test information for the slope on X. library(gamair) data(chicago) out lm(death ~ tmpd, data chicago) anova(out) ## Analysis of Variance Table ## ## Response: death ## Df Sum Sq Mean Sq F value Pr(>F) ## tmpd 1 162473 162473 803.07 < 2.2e-16 ## Residuals 5112 1034236 202 ##

## tmpd *** ## Residuals ## --- ## Signif. codes: ## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Here is another example regarding F -test and ANOVA 1 > x<- RocketProp$Age 2 > y<- RocketProp$Strength 3 > n<- length (x) 4 > fit.rp <-lm(y x) 5 > summary ( fit.rp) 6 7 Call : 8 lm( formula y x) 9 10 Residuals : 11 Min 1 Q Median 3 Q Max 12-215.98-50.68 28.74 66.61 106.76 13 14 Coefficients : 15 Estimate Std. Error t value Pr ( > t ) 16 ( Intercept ) 2627.822 44.184 59.48 < 2e -16 *** 17 x -37.154 2.889-12.86 1.64 e -10 *** 18 --- 19 Signif. codes : 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 20 21 Residual standard error : 96.11 on 18 degrees of freedom 22 Multiple R- squared : 0.9018, Adjusted R- squared : 0.8964 23 F- statistic : 165.4 on 1 and 18 DF, p- value : 1.643e -10 Line 23 contains the computation of the F statistic (165.4), the degrees of freedom values

p 2 and n p 18, and the p-value in the test of the hypothesis H 0 : β 1 0 (1.643 10 10 ); the hypothesis is strongly rejected. 24 > anova ( fit.rp) 25 Analysis of Variance Table 26 27 Response : y 28 Df Sum Sq Mean Sq F value Pr( > F) 29 x 1 1527483 1527483 165.38 1.643 e -10 *** 30 Residuals 18 166255 9236 31 --- 32 Signif. codes : 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 33 > 34 > anova ( fit.rp )[,'Df '] 35 [1] 1 18 36 > anova ( fit.rp )[,' Sum Sq '] 37 [1] 1527482.7 166254.9 38 > 39 > sum ( anova ( fit.rp )[,' Sum Sq ']) -(n -1)* var (y) 40 [1] -9.313226 e -10 The ANOVA table is contained between lines 25 and 32. The Regression entry is listed on line 29 as x. Note that the F statistic and p-values, listed in the columns headed F value and Pr(>F), are identical to the ones from the lm summary (line 23). 3.2 Equivalence of F -test and t-test for simple linear regression For simple linear regression, the test H 0 : β 1 0 H a : β 1 0 can be carried out using

t-test F -test as in this model Since by { } 2 [ ] T obs 2 ˆβ1 1 ŝe( ˆβ F obs 1 ) SS R (ŷ i ȳ) 2 (( ˆβ 0 + ˆβ 1 x i1 ) ( ˆβ 0 + ˆβ 1 x 1 )) 2 ˆβ 2 1 (x i1 x 1 ) 2 ˆβ 2 1ns 2 X and ŝe( ˆβ ˆσ 1 ) 2 (n 2)s 2 X we have F obs SS R/(p 1) SS Res /(n p) SS R/(2 1) SS Res /(n 2) SS R nˆσ 2 /(n 2) ˆβ 2 1ns 2 X ˆσ 2 The p-value for F -test is exactly the same as the p-value for t-test. Thus F > F obs T 2 > [ ] T1 obs 2 T > T obs 1 { ˆβ1 ŝe( ˆβ 1 ) } 2 [ ] T1 obs 2 Thus P (F > F obs ) P ( T > T obs 1 )

Therefore, the test results based on F and t statistics are the same. Note: this result is only true for simple linear regression model. 3.3 What the F -Test Really Tests Suppose first that we retain the null hypothesis, i.e., we do not find any significant share of variance associated with the regression. This could be because (i) the intercept-only model is right; (iii) β 1 0 but the test doesn t have enough power to detect departures from the null. We don t know which it is. There is also possibility that the real relationship is nonlinear, but the best linear approximation to it has slope (nearly) zero, in which case the F test will have no power to detect the nonlinearity. Suppose instead that we reject the null, intercept-only hypothesis. This does not mean that the simple linear model is right. It means that the latter model predicts better than the intercept-only model too much better to be due to chance. The simple linear regression model can be absolute garbage, with every single one of its assumptions flagrantly violated, and yet better than the model which makes all those assumptions and thinks the optimal slope is zero. Neither the F test of β 1 0 vs. β 1 0 nor the Wald/t test of the same hypothesis tell us anything about the correctness of the simple linear regression model. All these tests presume the simple linear regression model with Gaussian noise is true, and check a special case (flat line) against the general one (titled line). They do not test linearity, constant variance, lack of correlation, or Gaussianity.