Start with review, some new definitions, and pictures on the white board. Assumptions in the Normal Linear Regression Model

Similar documents
Warm-up Using the given data Create a scatterplot Find the regression line

Linear Regression Communication, skills, and understanding Calculator Use

AP Statistics. The only statistics you can trust are those you falsified yourself. RE- E X P R E S S I N G D A T A ( P A R T 2 ) C H A P 9

23. Inference for regression

Chapter 27 Summary Inferences for Regression

Correlation Analysis

Confidence Interval for the mean response

Estimating σ 2. We can do simple prediction of Y and estimation of the mean of Y at any value of X.

Ch 13 & 14 - Regression Analysis

Inferences for Regression

INFERENCE FOR REGRESSION

appstats27.notebook April 06, 2017

AP Statistics Bivariate Data Analysis Test Review. Multiple-Choice

Basic Business Statistics 6 th Edition

LINEAR REGRESSION ANALYSIS. MODULE XVI Lecture Exercises

Chapter 5 Friday, May 21st

Chapter 4: Regression Models

Inference for Regression Inference about the Regression Model and Using the Regression Line

STATISTICS 110/201 PRACTICE FINAL EXAM

Unit 10: Simple Linear Regression and Correlation

Business Statistics. Lecture 9: Simple Regression

Conditions for Regression Inference:

Related Example on Page(s) R , 148 R , 148 R , 156, 157 R3.1, R3.2. Activity on 152, , 190.

27. SIMPLE LINEAR REGRESSION II

SMAM 314 Practice Final Examination Winter 2003

Chapter 9. Correlation and Regression

Chapter 16. Simple Linear Regression and dcorrelation

Chapter 10. Correlation and Regression. Lecture 1 Sections:

Regression Models - Introduction

Announcements: You can turn in homework until 6pm, slot on wall across from 2202 Bren. Make sure you use the correct slot! (Stats 8, closest to wall)

STA 108 Applied Linear Models: Regression Analysis Spring Solution for Homework #6

Statistics for Managers using Microsoft Excel 6 th Edition

Multiple Regression: Chapter 13. July 24, 2015

Chapter 7 Linear Regression

Ch Inference for Linear Regression

Simple Linear Regression

Simple Linear Regression Using Ordinary Least Squares

Introduction to Regression

SMAM 319 Exam1 Name. a B.The equation of a line is 3x + y =6. The slope is a. -3 b.3 c.6 d.1/3 e.-1/3

Section 3: Simple Linear Regression

Simple Linear Regression for the Climate Data

Basic Business Statistics, 10/e

Review of Regression Basics

AP STATISTICS Name: Period: Review Unit IV Scatterplots & Regressions

Lecture 18: Simple Linear Regression

y = a + bx 12.1: Inference for Linear Regression Review: General Form of Linear Regression Equation Review: Interpreting Computer Regression Output

Topic 10 - Linear Regression

MATH 1150 Chapter 2 Notation and Terminology

9. Linear Regression and Correlation

28. SIMPLE LINEAR REGRESSION III

q3_3 MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

Practice Questions for Exam 1

What is the easiest way to lose points when making a scatterplot?

Inference for the Regression Coefficient

Correlation & Simple Regression

Homework 2: Simple Linear Regression

Any of 27 linear and nonlinear models may be fit. The output parallels that of the Simple Regression procedure.

Regression Models. Chapter 4

Lecture 10 Multiple Linear Regression

(4) 1. Create dummy variables for Town. Name these dummy variables A and B. These 0,1 variables now indicate the location of the house.

The following formulas related to this topic are provided on the formula sheet:

UNIT 12 ~ More About Regression

Business Statistics. Chapter 14 Introduction to Linear Regression and Correlation Analysis QMIS 220. Dr. Mohammad Zainal

Foundations for Functions

Chapter 4. Regression Models. Learning Objectives

Lecture 11: Simple Linear Regression

Test 3A AP Statistics Name:

4.1 Least Squares Prediction 4.2 Measuring Goodness-of-Fit. 4.3 Modeling Issues. 4.4 Log-Linear Models

STK4900/ Lecture 3. Program

Ch. 3 Review - LSRL AP Stats

Data Analysis and Statistical Methods Statistics 651

Final Exam Bus 320 Spring 2000 Russell

Multiple Regression Examples

The Model Building Process Part I: Checking Model Assumptions Best Practice

Chapter 13. Multiple Regression and Model Building

Chapter 16. Simple Linear Regression and Correlation

Multiple Regression. Inference for Multiple Regression and A Case Study. IPS Chapters 11.1 and W.H. Freeman and Company

Chapter 7 9 Review. Select the letter that corresponds to the best answer.

General Linear Model (Chapter 4)

Chapter 9 - Correlation and Regression

The Model Building Process Part I: Checking Model Assumptions Best Practice (Version 1.1)

THE ROYAL STATISTICAL SOCIETY 2008 EXAMINATIONS SOLUTIONS HIGHER CERTIFICATE (MODULAR FORMAT) MODULE 4 LINEAR MODELS

Recall, Positive/Negative Association:

Index I-1. in one variable, solution set of, 474 solving by factoring, 473 cubic function definition, 394 graphs of, 394 x-intercepts on, 474

Chapter 14 Student Lecture Notes Department of Quantitative Methods & Information Systems. Business Statistics. Chapter 14 Multiple Regression

Correlation and Regression

Regression analysis is a tool for building mathematical and statistical models that characterize relationships between variables Finds a linear

y n 1 ( x i x )( y y i n 1 i y 2

Chapter 1: Linear Regression with One Predictor Variable also known as: Simple Linear Regression Bivariate Linear Regression

Chapter 3: Describing Relationships

Chapter 2: Looking at Data Relationships (Part 3)

7. Do not estimate values for y using x-values outside the limits of the data given. This is called extrapolation and is not reliable.

HOMEWORK (due Wed, Jan 23): Chapter 3: #42, 48, 74

Linear Regression. Linear Regression. Linear Regression. Did You Mean Association Or Correlation?

AP Final Review II Exploring Data (20% 30%)

IT 403 Practice Problems (2-2) Answers

Chapter 14. Multiple Regression Models. Multiple Regression Models. Multiple Regression Models

IF YOU HAVE DATA VALUES:

Lecture notes on Regression & SAS example demonstration

STAT Chapter 11: Regression

Transcription:

Start with review, some new definitions, and pictures on the white board. Assumptions in the Normal Linear Regression Model A1: There is a linear relationship between X and Y. A2: The error terms (and thus the Y s at each X) have constant variance. A3: The error terms are independent. A4: The error terms (and thus the Y s at each X) are normally distributed. Note: In practice, we are looking for a fairly symmetric distribution with no major outliers. Other things to check (Questions to ask): Q5: Are there any major outliers in the data (X, or combination of (X,Y))? Q6: Are there other possible predictors that should be included in the model?

Useful Plots for Checking Assumptions and Answering These Questions Reminders: Residual = ei = Yi Yˆ i = observed Yi predicted Yi where predicted Yi = Y ˆ ˆ ˆi = β + β1x i, also called fitted Yi Definitions and formulas; we will use R to compute these: Standardized residual for unit i is ee ii = ee ii ssssaaaaaaaaaaaa eeeeeeeeee(ee ii ) = ee ii MMMMMM(1 h ii ) h i defined later in course, but usually is close to, so denominator MMMMMM Externally studentized residual ri for unit i is the same, except use MSE from the model fit without unit i. Plot Useful for Dotplot, stemplot, histogram of X Q5 Outliers in X; range of X values Residuals ei versus Xi or predicted Yˆ A1 Linear, A2 Constant var., Q5 outliers i ei* or ri versus Xi or predicted Yˆ i As above, but a better check for outliers Dotplot, stemplot, histogram of ei A4 Normality assumption Residuals ei versus time (if measured) A3 Dependence across time Residuals ei versus other predictors Q6 Predictors missing from model Normal probability plot of residuals A4 Normality assumption

Example: Highway sign data Plot of: Residuals versus predicted ( fitted ) values Residuals vs Age NOTE: Plot of residuals versus predictor variable X should look the same except for the scale on the X axis, because fitted values are linear transform of X s. However, when the slope is negative, one will be a mirror image of the other. Residuals vs fitted values Residuals vs age Residuals -1-5 5 1 35 4 45 5 55 Fitted values Residuals -1-5 5 1 2 4 6 8 Age Comments: These are good residual plots. Points look randomly scattered around. No evidence of nonlinear pattern or unequal variances.

Some other plots of the residuals, used to check the normality assumption Stemplot of standardized residuals To generate them in R, for the linear model called HWModel : > Highway$StResids <- rstandard(hwmodel) > stem(highway$stresids) The decimal point is at the -1 6655-1 - 99866-32 1122234 56789 1 34 1 6 2 3 Further confirmation that the residuals are relatively symmetric with no major outliers. Residual of 2.3 is for a driver with x = 75 years, y = 46 feet, yy = 577 3(75) = 352, so residual = 46 352 = 18 feet.

Normal probability plot of residuals and standardized residuals for highway sign data, also to check normality assumption A4 (see Figure 1.7 in textbook for examples with normal and quite non-normal residuals). Using residuals Using standardized residuals Explanation of what these are will be given on the white board. These are pretty good plots. There is one point at the lower end that is slightly off, and might be investigated, but no major problems. Note that the plot looks the same using residuals or standardized residuals. It doesn t matter which ones you use.

Example from textbook of good and bad normal probability plots:

Assumption 1: Relationship is linear. What to do when assumptions aren t met How to detect a problem: Plot y versus x and also plot residuals versus fitted values or residuals versus x. If you see a pattern, there is a problem with the assumption. What to do about the problem: Transform the X values, X' = f(x). (Read as X-prime = a function of X. ) Then do the regression using X' instead of X: where we still assume the ε are N(, σ 2 ). Y = β + β 1 X' + ε NOTE: Only use this solution if non-linearity is the only problem, not if it also looks like there is non-constant variance or non-normal errors. For those, we will transform Y. REASON: The errors are in the vertical direction. Stretching or shrinking the X-axis doesn t change those, so if they are normal with constant variance, they will stay that way. Let s look at what kinds of transformations to use.

Residuals are inverted U, use X' = X or log 1 X 16 Scatterplot of Y vs X 15 Residuals Versus the Fitted Values (response is Y) 14 1 12 5 Y 1 8 6 4 Residual -5-1 -15-2 2 1 2 X 3 4 5-25 5 75 1 Fitted Value 125 15 16 Scatterplot of Y vs Sqrt_X Residuals Versus the Fitted Values (response is Y) 14 1 12 5 Y 1 8 Residual 6-5 4 2 1 2 3 4 Sqrt_X 5 6 7-1 2 4 6 8 1 Fitted Value 12 14 16

Residuals are U-shaped and association between X and Y is positive: Use X' = X 2 3 Scatterplot of Yi vs X Residuals Versus the Fitted Values (response is Yi) 25 5 2 25 Yi 15 1 Residual 5-25 1 2 X 3 4 5-5 5 1 15 Fitted Value 2 25 3 Scatterplot of Yi vs X-squared 3 Residuals Versus the Fitted Values (response is Yi) 25 2 2 1 Yi 15 1 Residual -1 5-2 5 1 15 X-squared 2 25-3 5 1 15 Fitted Value 2 25 3

Residuals are U-shaped and association between X and Y is negative: Use X' = 1/X or X' = exp(-x) 9 Scatterplot of Y vs X Residuals Versus X (response is Y) 8 2 7 6 1 Y 5 4 3 Residual 2 1-1..1.2 X.3.4.5-2..1.2 X.3.4.5 9 Scatterplot of Y vs 1/X Residuals Versus 1/X (response is Y) 8 1. 7.5 Y 6 5 Residual. 4 -.5 3 2 1 2 1/X 3 4 5-1. 1 2 1/X 3 4 5

Assumption 2: Constant variance of the errors across X values. How to detect a problem: Plot residuals versus fitted values. If you see increasing or decreasing spread, there is a problem with the assumption. Example: Real estate data for n = 522 homes sold in a Midwestern city. Y = Sales price (thousands); X = Square feet (in hundreds). (Source: Applied Linear Regression Models) Original data: Residual plot: PriceThsds 2 4 6 8 1 1 2 3 4 5 SqFtHdrd Residuals -2 2 4 1 2 3 4 5 SqFtHdrd Clearly, the variance is increasing as house size increases!

NOTE: Usually increasing variance and skewed distribution go together. Here is a histogram of the residuals, with a superimposed normal distribution. Notice the residuals extending to the right. Density.2.4.6.8-2 2 4 Residuals

What to do about the problem: Transform the Y values, or both the X and Y values. Example: Real estate sales, transform Y values to Y' = ln (Y) = natural log of Y Scatter plot of ln(price) vs Square feet Residuals versus Square feet: LnPrice 11.5 12 12.5 13 13.5 14 Residuals -1 -.5.5 1 1 2 3 4 5 SqFt 1 2 3 4 5 SqFt Looks like one more transformation might help use square root of size. But we will leave it as this for now. See histogram of residual on next page.

Density.5 1 1.5 2 2.5-1 -.5.5 1 Residuals This looks better more symmetric and no outliers.

Transforming X only: Using models after transformations Use transformed X for future predictions: X' = f(x). Then do the regression using X' instead of X: where we still assume the ε are N(, σ 2 ). Y = β + β 1 X' + ε For example, if X' = X then the predicted values are: ˆ = ˆ β + ˆ β Y 1 X Transforming Y (and possibly X): Everything must be done in transformed values. For confidence intervals and prediction intervals (to be covered next week), get the intervals first and then transform the endpoints back to original units.

Example: Predicting house sales price using square feet. Regression equation is: Predicted Ln(Price) = 11.2824 +.51(Square feet in hundreds) For a house with 2 square feet = 2 hundred square feet: Y ˆ ' = 11.2824 +.51(2) = 12.324 So predicted price = exp(12.324) = $22,224 (because taking exp reverses a natural log transformation). Assumption 3: Independent errors 1. The main way to check this is to understand how the data were collected. For example, suppose we wanted to predict blood pressure from amount of fat consumed in the diet. If we were to sample entire families, and treat them as independent, that would be wrong. If one member of the family has high blood pressure, related members are likely to have it as well. Taking a random sample is one way to make sure the observations are independent. 2. If the values were collected over time (or space) it makes sense to plot the residuals versus order collected, and see if there is a trend or cycle.

OUTLIERS Some reasons for outliers: 1. A mistake was made. If it s obvious that a mistake was made in recording the data, or that the person obviously lied, etc., it s okay to throw out an outlier and do the analysis without it. For example, a height of 7 inches is an obvious mistake. If you can t go back and figure out what it should have been (7 inches? 72 inches? 67 inches?) you have no choice but to discard that case. 2. The person (or unit) belongs to a different population, and should not be part of the analysis, so it s okay to remove the point(s). An example is for predicting house prices, if a data set has a few mansions (5+ square feet) but the other houses are all smaller (1 to 25 square feet, say), then it makes sense to predict sales prices for the smaller houses only. In the future when the equation is used, it should be used only for the range of data from which it was generated. 3. Sometimes outliers are simply the result of natural variability. In that case, it is NOT okay to discard them. If you do, you will underestimate the variance.

(Story of Alcohol and Tobacco from DASL) 6.5 Scatterplot of Alcohol vs Tobacco 6. Alcohol 5.5 5. 4.5 4. 2.5 3. 3.5 Tobacco 4. 4.5 Notice Northern Ireland in lower right corner a definite outlier, based on the combined (X,Y) values. Why is it an outlier? It represents a different religion than other areas of Britain.

1. Residuals Versus the Fitted Values (response is Alcohol).5. Residual -.5-1. -1.5-2. 5.1 5.2 5.3 5.4 5.5 Fitted Value 5.6 5.7 5.8 In the plot of residuals versus fitted values, it s even more obvious that the outlier is a problem.

1. Residuals Versus Tobacco (response is Alcohol).5. Residual -.5-1. -1.5-2. 2.5 3. 3.5 Tobacco 4. 4.5 The plot of residuals versus the X variable is very similar to residuals vs fitted values. Again the problem is obvious.

Scatterplot of Alcohol vs Tobacco 6.5 6. Alcohol 5.5 5. 4.5 2.5 3. 3.5 Tobacco 4. 4.5 Here is a plot with Northern Ireland removed.

.75 Residuals Versus Tobacco (response is Alcohol).5 Residual.25. -.25 -.5 2.5 3. 3.5 Tobacco 4. 4.5 Here is a residual plot with Northern Ireland removed.

Notice how much the analysis changes when the outlier is removed. R-sq /1 is the square of the correlation between Alcohol and Tobacco, so the correlation goes from. 5 =. 22 to. 615 =. 78. Intercept and slope change too: With Outlier (Northern Ireland) The regression equation is Alcohol = 4.35 +.32 Tobacco Predictor Coef SE Coef T P Constant 4.351 1.67 2.71.24 Tobacco.319.4388.69.59 S =.81963 R-Sq = 5.% R-Sq(adj) =.% Without Outlier The regression equation is Alcohol = 2.4 + 1.1 Tobacco Predictor Coef SE Coef T P Constant 2.41 1.1 2.4.76 Tobacco 1.59.2813 3.58.7 S =.4462 R-Sq = 61.5% R-Sq(adj) = 56.7%

Transformations in R: If you want to transform the response variable Y into some new variable Y', you can add a new column to the data table consisting of the new variable. For the data table named MyData, to square the response variable GPA and add it to the data table, type: > MyData <- cbind(mydata, MyData$GPA^2) To take its square root, type: > MyData <- cbind(mydata, sqrt(mydata$gpa) ) To take its natural logarithm, type: > MyData <- cbind(mydata, log(mydata$gpa) ) To take its common logarithm (base 1), type: > MyData <- cbind(mydata, log1(mydata$gpa) ) To take its reciprocal, type: > MyData <- cbind(mydata, 1/MyData$GPA ) To take its reciprocal square root, type: > MyData <- cbind(mydata, 1/sqrt(MyData$GPA) ) And so on. You will want to give the new column in the data table an appropriate name. You can then run a linear model using the transformed response variable and the original predictor. More in discussion on Friday!