Chapter 3. Diagnostics and Remedial Measures

Similar documents
3. Diagnostics and Remedial Measures

Diagnostics and Remedial Measures

Remedial Measures, Brown-Forsythe test, F test

Formal Statement of Simple Linear Regression Model

Diagnostics and Remedial Measures: An Overview

STAT 571A Advanced Statistical Regression Analysis. Chapter 3 NOTES Diagnostics and Remedial Measures

Estimating σ 2. We can do simple prediction of Y and estimation of the mean of Y at any value of X.

Concordia University (5+5)Q 1.

SSR = The sum of squared errors measures how much Y varies around the regression line n. It happily turns out that SSR + SSE = SSTO.

Chapter 16. Simple Linear Regression and dcorrelation

Simple Linear Regression

Chapter 2. Continued. Proofs For ANOVA Proof of ANOVA Identity. the product term in the above equation can be simplified as n

STAT 705 Chapter 16: One-way ANOVA

22s:152 Applied Linear Regression. Take random samples from each of m populations.

Chapter 16. Simple Linear Regression and Correlation

Outline. Remedial Measures) Extra Sums of Squares Standardized Version of the Multiple Regression Model

K. Model Diagnostics. residuals ˆɛ ij = Y ij ˆµ i N = Y ij Ȳ i semi-studentized residuals ω ij = ˆɛ ij. studentized deleted residuals ɛ ij =

22s:152 Applied Linear Regression. There are a couple commonly used models for a one-way ANOVA with m groups. Chapter 8: ANOVA

6. Multiple Linear Regression

Chapter 4: Regression Models

Ch 2: Simple Linear Regression

Simple Regression Model Setup Estimation Inference Prediction. Model Diagnostic. Multiple Regression. Model Setup and Estimation.

F-tests and Nested Models

Keller: Stats for Mgmt & Econ, 7th Ed July 17, 2006

Regression Analysis. Regression: Methodology for studying the relationship among two or more variables

Regression Models. Chapter 4. Introduction. Introduction. Introduction

Lecture 9: Linear Regression

Lectures on Simple Linear Regression Stat 431, Summer 2012

Basic Business Statistics 6 th Edition

Lecture 18 MA Applied Statistics II D 2004

Confidence Interval for the mean response

Chapter 4. Regression Models. Learning Objectives

Chapter Learning Objectives. Regression Analysis. Correlation. Simple Linear Regression. Chapter 12. Simple Linear Regression

Chapter 1. Linear Regression with One Predictor Variable

STA 108 Applied Linear Models: Regression Analysis Spring Solution for Homework #6

Mathematics for Economics MA course

Sections 7.1, 7.2, 7.4, & 7.6

Lecture 11: Simple Linear Regression

Chapter 11 - Lecture 1 Single Factor ANOVA

Inferences for Regression

Unit 10: Simple Linear Regression and Correlation

Design & Analysis of Experiments 7E 2009 Montgomery

STAT 705 Chapter 19: Two-way ANOVA

Answer Keys to Homework#10

STATISTICS 110/201 PRACTICE FINAL EXAM

Regression Models - Introduction

Assignment 9 Answer Keys

STAT 705 Chapter 19: Two-way ANOVA

1. Simple Linear Regression

Remedial Measures Wrap-Up and Transformations Box Cox

Multiple Linear Regression

Unbalanced Data in Factorials Types I, II, III SS Part 1

Homework 2: Simple Linear Regression

Lecture 10 Multiple Linear Regression

MULTICOLLINEARITY AND VARIANCE INFLATION FACTORS. F. Chiaromonte 1

Much of the material we will be covering for a while has to do with designing an experimental study that concerns some phenomenon of interest.

Inference for Regression Simple Linear Regression

Regression - Modeling a response

Linear models and their mathematical foundations: Simple linear regression

Simple Linear Regression

Lecture 9 SLR in Matrix Form

ECON2228 Notes 2. Christopher F Baum. Boston College Economics. cfb (BC Econ) ECON2228 Notes / 47

Inference for Regression Inference about the Regression Model and Using the Regression Line, with Details. Section 10.1, 2, 3

CS 5014: Research Methods in Computer Science

Statistics for Managers using Microsoft Excel 6 th Edition

Lecture 3: Inference in SLR

22S39: Class Notes / November 14, 2000 back to start 1

Regression Models - Introduction

27. SIMPLE LINEAR REGRESSION II

Chapter 6 Multiple Regression

Confidence Intervals, Testing and ANOVA Summary

Section 3: Simple Linear Regression

Math 3330: Solution to midterm Exam

Lecture 1 Linear Regression with One Predictor Variable.p2

SIMPLE REGRESSION ANALYSIS. Business Statistics

Introduction to Linear regression analysis. Part 2. Model comparisons

Summary of Chapter 7 (Sections ) and Chapter 8 (Section 8.1)

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore

PubH 7405: REGRESSION ANALYSIS SLR: DIAGNOSTICS & REMEDIES

STAT5044: Regression and Anova

STAT5044: Regression and Anova

Start with review, some new definitions, and pictures on the white board. Assumptions in the Normal Linear Regression Model

R 2 and F -Tests and ANOVA

Wooldridge, Introductory Econometrics, 4th ed. Chapter 2: The simple regression model

Tentative solutions TMA4255 Applied Statistics 16 May, 2015

Business Statistics. Tommaso Proietti. Linear Regression. DEF - Università di Roma 'Tor Vergata'

Sociology 6Z03 Review II

Chapter 12 - Part I: Correlation Analysis

Applied Regression Modeling: A Business Approach Chapter 3: Multiple Linear Regression Sections

Simple Linear Regression. (Chs 12.1, 12.2, 12.4, 12.5)

1-Way ANOVA MATH 143. Spring Department of Mathematics and Statistics Calvin College

Nature vs. nurture? Lecture 18 - Regression: Inference, Outliers, and Intervals. Regression Output. Conditions for inference.

STAT 540: Data Analysis and Regression

Simple Linear Regression for the Advertising Data

ST Correlation and Regression

Contents. 1 Review of Residuals. 2 Detecting Outliers. 3 Influential Observations. 4 Multicollinearity and its Effects

Lecture 15 Multiple regression I Chapter 6 Set 2 Least Square Estimation The quadratic form to be minimized is

Chapter 2 Inferences in Simple Linear Regression

9 Correlation and Regression

DESAIN EKSPERIMEN Analysis of Variances (ANOVA) Semester Genap 2017/2018 Jurusan Teknik Industri Universitas Brawijaya

Transcription:

Chapter 3. Diagnostics and Remedial Measures So far, we took data (X i, Y i ) and we assumed Y i = β 0 + β 1 X i + ǫ i i = 1, 2,..., n, where ǫ i iid N(0, σ 2 ), β 0, β 1 and σ 2 are unknown parameters, X i s are fixed constants. Question: What are the possible mistakes or violations of these assumptions?

1. Regression function is not linear (E(Y ) β 0 + β 1 X) 2. Error terms do not have a constant variance 3. Error terms are not independent We will use Residual Plots to diagnose the problems Residuals: e i = Y i Ŷi = Y i (b 0 + b 1 X i ) Sample Mean: ē = 1 n Sample Var: 1 n 1 i (e i ē) 2 = 1 n 1 i e i = 0 i e2 i MSE We will sometimes use standardized (semistudentized) residuals

Nonlinearity of Regression Function (1.) Residual plot against the predictor variable, X. Or use a residual plot against the fitted values, Ŷ. Look for systematic tendencies! Example: e i < 0 e i > 0 e i < 0 plant growth residuals 0 e i < 0 e i > 0 e i < 0 water/week water/week

Nonconstancy of Error Variance (2.) We diagnose nonconstant error variance by observing a residual plot against X and looking for structure. Example: entertainment 0 residuals salary salary

Modified Levene Test 1. Divide residuals into two groups. For this example, low and high salary groups, because the variance is suspected to depend on salary. 2. Calculate d i1 = e i1 ẽ 1 and d i2 = e i2 ẽ 2, where e ij is the i th residual in group j and ẽ j is the median of residuals in group j. 3. Conduct two-sample t-test with d ij.

Nonindependence of Error Terms (3.) We diagnose nonindependence of errors over time or in some sequence by observing a residual plot against time (or the sequence) and looking for a trend (see textbook, p. 101, for typical plots). Example: #parts residuals 0 #hours #hours

But, if the data is like day 1: (X 1, Y 1 ) day 2: (X 2, Y 2 ). day n: (X n, Y n ) then we can see the effect of learning. residuals 0 residuals 0 #hours day

Model fits all but a few observations (4.) Example: LS Estimates with 2 outlying points (solid) and without them (dashed). Rule of Thumb: Outliers are detected by observing a plot of e i vs. X i. y e i MSE 3 0 +3 x x

Errors not normally dist d (5.) We assumed ǫ 1,..., ǫ n iid N(0, σ 2 ) but we can t observe these error terms! We will be convinced that this assumption is reasonable, if e 1,..., e n appear to be iid N(0, MSE). Fact: If e 1,..., e n iid N(0, MSE), then one can show that the expected value of the ith smallest is [ ( )] i 3/8 MSE z, i = 1, 2,..., n n + 1/4 Then we have pairs residual expected [ ( residual e min MSE z 1 0.375 )] n+0.25 [ ( e 2nd smallest MSE z 2 0.375 )] n+0.25.. [ ( MSE z n 0.375 )] n+0.25 e max

Notice: If Y 1,..., Y 4 iid N(0, σ 2 ), then E(Y 1 ) = = E(Y 4 ) = 0, and E(Ȳ ) = 0, but E(Y min ) = σ [ z ( )] 1 0.375 4+0.25 = σz(0.147) = 1.05σ, E(Y 2nd ) = σ [ z ( )] 2 0.375 4+0.25 = σz(0.382) = 0.30σ, E(Y 3rd ) = σ [ z ( )] 3 0.375 4+0.25 = σz(0.618) = +0.30σ, E(Y max ) = σ [ z ( )] 4 0.375 4+0.25 = σz(0.853) = +1.05σ,

Points on a straight line: Errors are normal (left) Points on a curve: Errors are not normal (right) semistud. residuals 2 1 0 1 2 2 1 0 1 2 3 expected residuals semistud. residuals 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2 1 0 1 2 3 expected residuals

Omission of important predictors (6.) Example: X i = #years of education, Y i = salary salary semistud. residuals 0 #years of education #years in job Means, that a better model would be (Multiple Regression Model)

Lack of Fit Test Suppose we want to test whether the relationship between X and Y is linear vs the possibility that it is NOT linear. Test for: H 0 : E(Y ) = β 0 + β 1 X versus H a : Not H 0 Here, H 0 includes the cases when either or both β 0 and β 1 are zero. We can t use this test unless there are multiple Y s observed at at least 1 value of X.

Y 2j Y Y 2 Y^ 2 E^(Y) = b 0 + b 1 X X 1 X 2 X 3 X 4 X Can we use this test when X=day and Y=stock price? Can we use this test when X=weight and Y=height and those are measured with a super accurate measure?

New Notation: Y values are observed at c different levels of X, say X 1, X 2,..., X c. n j such Y values, say Y 1j, Y 2j,..., Y nj j, are observed at level X j, j = 1, 2,..., c, n j 1. Let Ȳj = 1 n j i Y ij be the average of the Y s at X j and Ŷj = b 0 + b 1 X j the fitted mean under the SLR. The data now look like at X 1 : (X 1, Y 11 ), (X 1, Y 21 ),..., (X 1, Y n1 1) at X 2 : (X 2, Y 12 ), (X 2, Y 22 ),..., (X 2, Y n2 2). at X c : (X c, Y 1c ), (X c, Y 2c ),..., (X c, Y nc c) Ȳ1 Ȳ2 Ȳc

The less restricting model puts no structure on the means at each level of X (Full model). Full model: Y ij = µ j + ǫ ij, where ˆµ j = Ȳj Reduced model: Y ij = β 0 + β 1 X j + ǫ ij F-test!!!!

Note that Y ij Ŷj = (Y ij Ȳj) + (Ȳj Ŷj) Let s partition the SSE into 2 pieces SSE = SSPE + SSLF where n c j (Y ij Ŷj) 2 = j=1 i=1 n c j n c j (Y ij Ȳj) 2 + (Ȳj Ŷj) 2 j=1 i=1 j=1 i=1 If SSPE SSE, it says that the means ( ) are close to the fitted values ( ). That is, even if we fit a less restrictive model, we can t reduce the amount of unexplained variability. If SSLF SSE, the means ( ) are far away from the fitted values ( ) and the (linear) restriction seems unreasonable. Thus,

Formal Test for: H 0 : E(Y ) = β 0 + β 1 X H A : E(Y ) β 0 + β 1 X Let MSLF = SSLF c 2 and MSPE = SSPE n c F = = = SSE(R) SSE(F) df(r) df(f) SSE(F) df(f) SSE SSP E (n 2) (n c) SSPE SSLF c 2 SSPE n c n c F c 2,n c F SSE SSPE The model is bad. Test Statistic: Rejection Rule:

This fits nicely into our ANOVA Table: Source of variation SS df M S Regression SSR 1 M SR Error SSE n 2 MSE Lack of Fit SSLF c 2 MSLF Pure Error SSPE n c MSPE Total SSTO n 1 Example: Suppose that the house prices follow a SLR in #bedrooms. The estimated regression function is Ê(price/1,000) = 37.2 + 43.0(#bedrooms) Variation SS df M S Regression 62,578 1 62,578 Error 117,028 91 1,286 Lack of Fit Pure Error Total

Because F = MSLF/MSPE = 1, 432/1, 281 = 1.12 < F(0.95; 3, 88) = 2.71 we do not reject H 0. price 50 100 150 200 250 300 1 2 3 4 5 bedrooms

Transformations Motivation: Consider the function y = x 2 x y 0 0 1 1 2 4 3 9 4 16 y 0 5 10 15 y = x 2 x 2 y 0 0 1 1 4 4 9 9 16 16 y 0 5 10 15 0 1 2 3 4 x 0 5 10 15 x 2 If you have (x 1, y 1 ), (x 2, y 2 ),..., (x n, y n ) and you know y = f(x), then (f(x 1 ), y 1 ), (f(x 2 ), y 2 ),..., (f(x n ), y n ) will be on a straight line. What follows are two situations in which transformations may help:

Situation 1: nonlinear regression function with constant error variances (1.) Note that E(Y ) doesn t appear to be a linear function of X, that is, the points do not seem to lie on a line. The spread of the Y s at each level of X appears to be constant, however. Y X vs. Y 0 4 8 12 16 X Typical remedy Transform X We consider Do not transform Y because this will disturb the spread of the Y s at each level X. Y sqrt(x) vs. Y 0 2 8 12 4 sqrt(x)

Situation 2: nonlinear regression function with nonconstant error variances (1. with 2.) X vs. Y Note that E(Y ) isn t a linear function of X. The variance of the Y s at each level of X is increasing with X. Y 0 4 8 12 16 X X vs. sqrt(y) Typical remedy Transform Y (or maybe X and Y ) We consider And hope that both problems are fixed. sqrt(y) 0 4 8 12 16 X

Prototypes for Transforming Y Y Y Y X Try Y, log 10 Y, or 1/Y X X Prototypes for Transforming X Y Y Y X Use X or log 10 X (left); X 2 or exp(x) (middle); 1/X or exp( X) (right). X X