Lecture 10: F -Tests, ANOVA and R 2

Similar documents
R 2 and F -Tests and ANOVA

Regression, Part I. - In correlation, it would be irrelevant if we changed the axes on our graph.

Regression, part II. I. What does it all mean? A) Notice that so far all we ve done is math.

Review. One-way ANOVA, I. What s coming up. Multiple comparisons

Estimating σ 2. We can do simple prediction of Y and estimation of the mean of Y at any value of X.

Module 03 Lecture 14 Inferential Statistics ANOVA and TOI

Lectures 5 & 6: Hypothesis Testing

Hypothesis testing I. - In particular, we are talking about statistical hypotheses. [get everyone s finger length!] n =

Section 3: Simple Linear Regression

Recall that a measure of fit is the sum of squared residuals: where. The F-test statistic may be written as:

22s:152 Applied Linear Regression. Take random samples from each of m populations.

SIMPLE REGRESSION ANALYSIS. Business Statistics

MA 1128: Lecture 08 03/02/2018. Linear Equations from Graphs And Linear Inequalities

Stat 135, Fall 2006 A. Adhikari HOMEWORK 10 SOLUTIONS

regression analysis is a type of inferential statistics which tells us whether relationships between two or more variables exist

Variance Decomposition in Regression James M. Murray, Ph.D. University of Wisconsin - La Crosse Updated: October 04, 2017

Lecture 11: Simple Linear Regression

AMS 7 Correlation and Regression Lecture 8

STA441: Spring Multiple Regression. This slide show is a free open source document. See the last slide for copyright information.

Variance Decomposition and Goodness of Fit

MS&E 226: Small Data

Lecture 9: Predictive Inference for the Simple Linear Model

Note that we are looking at the true mean, μ, not y. The problem for us is that we need to find the endpoints of our interval (a, b).

Note that we are looking at the true mean, μ, not y. The problem for us is that we need to find the endpoints of our interval (a, b).

Descriptive Statistics (And a little bit on rounding and significant digits)

Business Statistics. Lecture 9: Simple Regression

28. SIMPLE LINEAR REGRESSION III

Confidence intervals

In the previous chapter, we learned how to use the method of least-squares

df=degrees of freedom = n - 1

22s:152 Applied Linear Regression. There are a couple commonly used models for a one-way ANOVA with m groups. Chapter 8: ANOVA

Swarthmore Honors Exam 2012: Statistics

Regression and the 2-Sample t

Inferences for Regression

Simple Linear Regression

Six Sigma Black Belt Study Guides

LECTURE 15: SIMPLE LINEAR REGRESSION I

Lectures on Simple Linear Regression Stat 431, Summer 2012

Chapter 12 - Part I: Correlation Analysis

Simple Linear Regression

Ordinary Least Squares Linear Regression

9 Correlation and Regression

1. Least squares with more than one predictor

appstats27.notebook April 06, 2017

AMS 315/576 Lecture Notes. Chapter 11. Simple Linear Regression

Chapter 12 - Lecture 2 Inferences about regression coefficient

Confidence Intervals, Testing and ANOVA Summary

Statistical Techniques II EXST7015 Simple Linear Regression

STA 108 Applied Linear Models: Regression Analysis Spring Solution for Homework #6

Chapter 27 Summary Inferences for Regression

Lecture 13: Simple Linear Regression in Matrix Format

1 Least Squares Estimation - multiple regression.

Regression and correlation. Correlation & Regression, I. Regression & correlation. Regression vs. correlation. Involve bivariate, paired data, X & Y

Regression Models - Introduction

Chapter 4: Regression Models

Chapter 26: Comparing Counts (Chi Square)

Comparing Several Means

Ch 2: Simple Linear Regression

y ˆ i = ˆ " T u i ( i th fitted value or i th fit)

Linear Regression. Chapter 3

Mathematics for Economics MA course

Simple Linear Regression

Chapter 16. Simple Linear Regression and Correlation

INFERENCE FOR REGRESSION

Simple linear regression

Inference in Regression Model

Immigration attitudes (opposes immigration or supports it) it may seriously misestimate the magnitude of the effects of IVs

Lecture 3: Inference in SLR

Chapter 14. Simple Linear Regression Preliminary Remarks The Simple Linear Regression Model

Statistical Inference with Regression Analysis

Interactions. Interactions. Lectures 1 & 2. Linear Relationships. y = a + bx. Slope. Intercept

Your schedule of coming weeks. One-way ANOVA, II. Review from last time. Review from last time /22/2004. Create ANOVA table

STAT 135 Lab 11 Tests for Categorical Data (Fisher s Exact test, χ 2 tests for Homogeneity and Independence) and Linear Regression

ECON 497 Midterm Spring

Simple Linear Regression

Regression. Marc H. Mehlman University of New Haven

Chapter 16. Simple Linear Regression and dcorrelation

Lecture 9: Predictive Inference

EXST Regression Techniques Page 1. We can also test the hypothesis H :" œ 0 versus H :"

Applied Regression Analysis. Section 2: Multiple Linear Regression

Using R formulae to test for main effects in the presence of higher-order interactions

Example: Poisondata. 22s:152 Applied Linear Regression. Chapter 8: ANOVA

Multiple t Tests. Introduction to Analysis of Variance. Experiments with More than 2 Conditions

Introduction to the Analysis of Variance (ANOVA)

One-sample categorical data: approximate inference

Big Data Analysis with Apache Spark UC#BERKELEY

Statistical Distribution Assumptions of General Linear Models

Statistics Introductory Correlation

POL 681 Lecture Notes: Statistical Interactions

Lecture 4: Testing Stuff

Fitting a Straight Line to Data

SMA 6304 / MIT / MIT Manufacturing Systems. Lecture 10: Data and Regression Analysis. Lecturer: Prof. Duane S. Boning

Lecture 18: Simple Linear Regression

Inference with Simple Regression

Uni- and Bivariate Power

School of Mathematical Sciences. Question 1

Correlation 1. December 4, HMS, 2017, v1.1

Correlation and the Analysis of Variance Approach to Simple Linear Regression

Math101, Sections 2 and 3, Spring 2008 Review Sheet for Exam #2:

Inference for Regression

Transcription:

Lecture 10: F -Tests, ANOVA and R 2 1 ANOVA We saw that we could test the null hypothesis that β 1 0 using the statistic ( β 1 0)/ŝe. (Although I also mentioned that confidence intervals are generally more important than testing). There is another approach called Analysis of Variance (ANOVA). It s out-dated but it is in the book and you should know how it works. The idea is to compare two models: Y β 0 + ɛ versus Y β 0 + β 1 X + ɛ. If we fit the first model, the least squares estimator is β 0 Y. The idea is not to create a statistic that measures how much better the second model is than the first model. The residual sums iof squares (RSS) is thus i (Y i Y ) 2. This is called the total sums of squares and is denoted by SS total i (Y i Y ) 2. If we fit thw second model (the usual linear model) we get a smaller residual sums of squares, RSS i e2 i. The difference SS total RSS is called the sums of squares due to regression and is denoted by SS reg. If β 1 0 we expect this to be small. In the olden days, poeple summarized this in a ANOVA table lioke this: Source df SS MS F p-value Regression 1 SS reg MS reg SSreg 1 F MSreg MS res Residual n-2 RSS σ 2 RSS n 2 Total n-1 SS total The degrees of freedom (df) are just numbers that are defined, frankly, to make things work right. The mean squared errors (MS) are the sums of squares divided by the df. The F test is F MS reg MS res. Under H 0, the statistic has a known distribution called the F distribution. This distribution depends on two parameters (just as the χ 2 distribution depends on one parameter). These are called the degrees of freedom for the F distribution. We denote the distribution by F 1,n 2. The p-value is P (F > F observed ) where F F 1,n 2 and F observed is the actual observed value you comoute from the data. This is equivalent to using our previous test and squaring it. A little more formally, an F random variable is defined by χ 2 a/a χ 2 b /b when χ 2 a and χ 2 b are independent. Since χ 2 distributions arise from sums of Gaussians, F -distributed random variables tend to arise when we are dealing with ratios of sums of Gaussians. The MS terms in our table are independent χ 2 random variables under the usual assumptions. 1

2 ANOVA in R The easiest way to do this in R is to use the anova function. This will give you an analysis-ofvariance table for the model. The actual object the function returns is an anova object, which is a special type of data frame. The columns record, respectively, degrees of freedom, sums of squares, mean squares, the actual F statistic, and the p value of the F statistic. What we ll care about will be the first row of this table, which will give us the test information for the slope on X. Let s do an example: library(gamair) out lm(death ~ tmpd,datachicago) anova(out) The output looks like this: ## Analysis of Variance Table ## ## Response: death ## Df Sum Sq Mean Sq F value Pr(>F) ## tmpd 1 162473 162473 803.07 < 2.2e-16 *** ## Residuals 5112 1034236 202 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Assumptions In deriving the F distribution, it is absolutely vital that all of the assumptions of the Gaussian-noise simple linear regression model hold: the true model must be linear, the noise must be Gaussian, the noise variance must be constant, the noise must be independent of X and independent across measurements. The only hypothesis being tested is whether, maintaining all these assumptions, we must reject the flat model β 1 0 in favor of a line at an angle. In particular, the test never doubts that the right model is a straight line. ANOVA is an historical relic. In serious applied work from the modern (say, post-1985) era, I have never seen any study where filling out an ANOVA table for a regression, etc., was at all important. 3 What the F Test Really Tests The textbook ( 2.7 2.8) goes into great detail about an F test for whether the simple linear regression model explains (really, predicts) a significant amount of the variance in the response. What this really does is compare two versions of the simple linear regression model. The null hypothesis is that all of the assumptions of that model hold, and the slope, β 1, is exactly 0. (This is sometimes called the intercept-only model, for obvious reasons.) The alternative is that all of the simple linear regression assumptions hold with β 1 R. The alternative, non-zero-slope model will always fit the data better than the null, intercept-only model; the F test asks if the improvement in fit is larger than we d expect under the null. There are situations where it is useful to know about this precise quantity, and so run an F test on the regression. It is hardly ever, however, a good way to check whether the simple linear 2

regression model is correctly specified, because neither retaining nor rejecting the null gives us information about what we really want to know. Suppose first that we retain the null hypothesis, i.e., we do not find any significant share of variance associated with the regression. This could be because (i) the intercept-only model is right; (iii) β 1 0 but the test doesn t have enough power to detect departures from the null. We don t know which it is. There is also possibility that the real relationship is nonlinear, but the best linear approximation to it has slope (nearly) zero, in which case the F test will have no power to detect the nonlinearity. Suppose instead that we reject the null, intercept-only hypothesis. This does not mean that the simple linear model is right. It means that the latter model predicts better than the intercept-only model too much better to be due to chance. The simple linear regression model can be absolute garbage, with every single one of its assumptions flagrantly violated, and yet better than the model which makes all those assumptions and thinks the optimal slope is zero. Neither the F test of β 1 0 vs. β 1 0 nor the Wald/t test of the same hypothesis tell us anything about the correctness of the simple linear regression model. All these tests presume the simple linear regression model with Gaussian noise is true, and check a special case (flat line) against the general one (titled line). They do not test linearity, constant variance, lack of correlation, or Gaussianity. 4 R 2 Another quantity that gets mentioned a lot in regression (which is also a historical relic) is R 2. It is defined by R 2 SS reg SS total. It is often described as the fraction of variability explained by the regression. It can be shown that it can be written as R 2 r 2 where r Cov(X, Y) s X s Y in other words, the correlation coefficient squared. R 2 will be 0 when β 1 0. On the other hand, if all the residuals are zero, then R 2 1. It is not too hard to show that R 2 can t possible be bigger than 1, so we have marked out the limits: a sample slope of 0 gives an R 2 of 0, the lowest possbile, and all the data points falling exactly on a straight line gives an R 2 of 1, the largest possible. What does R 2 converge to as n. The population version is R 2 Var [m(x)] Var [Y ] Var [β 0 + β 1 X] Var [β 0 + β 1 X + ɛ] Var [β 1 X] Var [β 1 X + ɛ] (1) (2) (3) β1 2 Var [X] β1 2Var [X] + (4) σ2 3

Since all our parameter estimates are consistent, and this formula is continuous in all the parameters, the R 2 we get from our estimate will converge on this limit. Unfortunately, a lot of myths about R 2 have become endemic in the scientific community, and it is vital at this point to immunize you against them. 1. The most fundamental is that R 2 does not measure goodness of fit. (a) R 2 can be arbitrarily low when the model is completely correct. Look at Eq. 4. By making Var [X] small, or σ 2 large, we drive R 2 towards 0, even when every assumption of the simple linear regression model is correct in every particular. (b) R 2 can be arbitrarily close to 1 when the model is totally wrong. There is, indeed, no limit to how high R 2 can get when the true model is nonlinear. All that s needed is for the slope of the best linear approximation to be non-zero, and for Var [X] to get big. 2. R 2 is also pretty useless as a measure of predictability. (a) R 2 says nothing about prediction error. R 2 can be anywhere between 0 and 1 just by changing the range of X. Mean squared error is a much better measure of how good predictions are; better yet are estimates of out-of-sample error which we ll cover later in the course. (b) R 2 says nothing about interval forecasts. In particular, it gives us no idea how big prediction intervals, or confidence intervals for m(x), might be. 3. R 2 cannot be compared across data sets. 4. R 2 cannot be compared between a model with untransformed Y and one with transformed Y, or between different transformations of Y. 5. The one situation where R 2 can be compared is when different models are fit to the same data set with the same, untransformed response variable. Then increasing R 2 is the same as decreasing in-sample MSE (by Eq.??). In that case, however, you might as well just compare the MSEs. 6. It is very common to say that R 2 is the fraction of variance explained by the regression. But it is also extremely easy to devise situations where R 2 is high even though neither one could possible explain the other. At this point, you might be wondering just what R 2 is good for what job it does that isn t better done by other tools. The only honest answer I can give you is that I have never found a situation where it helped at all. If I could design the regression curriculum from scratch, I would never mention it. Unfortunately, it lives on as a historical relic, so you need to know what it is, and what mis-understandings about it people suffer from. 5 The Correlation Coefficient As you know, the correlation coefficient between X and Y is ρ XY Cov [X, Y ] Var [X] Var [Y ] 4

which lies between 1 and 1. It takes its extreme values when Y is a linear function of X. Recall, from lecture 1, that the slope of the ideal linear predictor β 1 is so Cov [X, Y ] Var [X] ρ XY β 1 Var [X] Var [Y ]. As we saw, R 2 is just ρ 2 XY. 6 Concluding Comment The tone I have taken when discussing F tests, R 2 and correlation has been dismissive. This is deliberate, because they are grossly abused and over-used in current practice, especially by nonstatisticians, and I want you to be too proud (or too ashamed) to engage in those abuses. In a better world, we d just skip over them, but you will have to deal with colleagues, and bosses, who learned their statistics in the bad old days, and so have to understand what they re doing wrong. In all fairness, the people who came up with these tools were great scientists, struggling with very hard problems when nothing was clear; they were inventing all the tools and concepts we take for granted in a class like this. Anyone in this class, me included, would be doing very well to come up with one idea over the whole of our careers which is as good as R 2. But we best respect our ancestors, and the tradition they left us, when we improve that tradition where we can. Sometimes that means throwing out the broken bits. 5