STAT Checking Model Assumptions

Similar documents
Sections 7.1, 7.2, 7.4, & 7.6

Ridge Regression. Summary. Sample StatFolio: ridge reg.sgp. STATGRAPHICS Rev. 10/1/2014

Diagnostics and Remedial Measures

Topic 16: Multicollinearity and Polynomial Regression

STAT5044: Regression and Anova

The Model Building Process Part I: Checking Model Assumptions Best Practice

Remedial Measures Wrap-Up and Transformations Box Cox

Remedial Measures for Multiple Linear Regression Models

Estimating σ 2. We can do simple prediction of Y and estimation of the mean of Y at any value of X.

The Model Building Process Part I: Checking Model Assumptions Best Practice (Version 1.1)

Remedial Measures, Brown-Forsythe test, F test

holding all other predictors constant

Formal Statement of Simple Linear Regression Model

Stat 5100 Handout #26: Variations on OLS Linear Regression (Ch. 11, 13)

Contents. 1 Review of Residuals. 2 Detecting Outliers. 3 Influential Observations. 4 Multicollinearity and its Effects

Remedial Measures Wrap-Up and Transformations-Box Cox

REGRESSION DIAGNOSTICS AND REMEDIAL MEASURES

Diagnostics and Remedial Measures: An Overview

Lecture 15 Multiple regression I Chapter 6 Set 2 Least Square Estimation The quadratic form to be minimized is

MULTICOLLINEARITY AND VARIANCE INFLATION FACTORS. F. Chiaromonte 1

Basic Business Statistics 6 th Edition

Math 5305 Notes. Diagnostics and Remedial Measures. Jesse Crawford. Department of Mathematics Tarleton State University

Chapter 1: Linear Regression with One Predictor Variable also known as: Simple Linear Regression Bivariate Linear Regression

Diagnostics and Transformations Part 2

STA 108 Applied Linear Models: Regression Analysis Spring Solution for Homework #6

STAT Chapter 11: Regression

Inferences for Regression

Unit 11: Multiple Linear Regression

Lecture 7 Remedial Measures

Data Analyses in Multivariate Regression Chii-Dean Joey Lin, SDSU, San Diego, CA

Checking model assumptions with regression diagnostics

STAT 704 Sections IRLS and Bootstrap

Topic 7: Heteroskedasticity

Least Squares Estimation-Finite-Sample Properties

PubH 7405: REGRESSION ANALYSIS SLR: DIAGNOSTICS & REMEDIES

Multiple Linear Regression

ANALYSIS OF VARIANCE OF BALANCED DAIRY SCIENCE DATA USING SAS

Final Review. Yang Feng. Yang Feng (Columbia University) Final Review 1 / 58

Single and multiple linear regression analysis

Multicollinearity occurs when two or more predictors in the model are correlated and provide redundant information about the response.

Chapter 14 Student Lecture Notes 14-1

The legacy of Sir Ronald A. Fisher. Fisher s three fundamental principles: local control, replication, and randomization.

STA 4210 Practise set 2a

Multiple Regression Analysis. Part III. Multiple Regression Analysis

Multiple Regression Analysis: Heteroskedasticity

CS 147: Computer Systems Performance Analysis

Multiple Regression Methods

Nonlinear Regression Functions

Outline. Review regression diagnostics Remedial measures Weighted regression Ridge regression Robust regression Bootstrapping

Simple Linear Regression for the Advertising Data

The simple linear regression model discussed in Chapter 13 was written as

Multiple Linear Regression

Lecture 10 Multiple Linear Regression

STAT5044: Regression and Anova

Leonor Ayyangar, Health Economics Resource Center VA Palo Alto Health Care System Menlo Park, CA

3. Diagnostics and Remedial Measures

STAT 571A Advanced Statistical Regression Analysis. Chapter 3 NOTES Diagnostics and Remedial Measures

Topic 2. Chapter 3: Diagnostics and Remedial Measures

Sociology 593 Exam 1 Answer Key February 17, 1995

STAT 571A Advanced Statistical Regression Analysis. Chapter 8 NOTES Quantitative and Qualitative Predictors for MLR

What to do if Assumptions are Violated?

STAT 512 MidTerm I (2/21/2013) Spring 2013 INSTRUCTIONS

y response variable x 1, x 2,, x k -- a set of explanatory variables

Lecture 3: Inference in SLR

Design & Analysis of Experiments 7E 2009 Montgomery

FinQuiz Notes

Chapter Learning Objectives. Regression Analysis. Correlation. Simple Linear Regression. Chapter 12. Simple Linear Regression

1. (Problem 3.4 in OLRT)

statistical sense, from the distributions of the xs. The model may now be generalized to the case of k regressors:

Regression Analysis IV... More MLR and Model Building

Chapter 3: Regression Methods for Trends

Exam 2. Average: 85.6 Median: 87.0 Maximum: Minimum: 55.0 Standard Deviation: Numerical Methods Fall 2011 Lecture 20

Chapter 4: Regression Models

One-way ANOVA Model Assumptions

How the mean changes depends on the other variable. Plots can show what s happening...

Statistics 5100 Spring 2018 Exam 1

Trendlines Simple Linear Regression Multiple Linear Regression Systematic Model Building Practical Issues

Types of economic data

Statistics for Managers using Microsoft Excel 6 th Edition

Chapter 6 Multiple Regression

7.3 Ridge Analysis of the Response Surface

PRINCIPAL COMPONENTS ANALYSIS

STAT 3A03 Applied Regression Analysis With SAS Fall 2017

Violation of OLS assumption- Multicollinearity

One-sided and two-sided t-test

Advanced Regression Topics: Violation of Assumptions

Mathematics for Economics MA course

Regression Analysis. Regression: Methodology for studying the relationship among two or more variables

Simple Regression Model (Assumptions)

Chapter 16. Simple Linear Regression and dcorrelation

Multiple Regression Analysis

Chapter 16. Simple Linear Regression and Correlation

Iris Wang.

Correlation and Regression Theory 1) Multivariate Statistics

Unbalanced Data in Factorials Types I, II, III SS Part 1

Multicollinearity. Filippo Ferroni 1. Course in Econometrics and Data Analysis Ieseg September 22, Banque de France.

Simple Linear Regression

Regression used to predict or estimate the value of one variable corresponding to a given value of another variable.

Lecture 2 Simple Linear Regression STAT 512 Spring 2011 Background Reading KNNL: Chapter 1

Overview Scatter Plot Example

Transcription:

STAT 704 --- Checking Model Assumptions Recall we assumed the following in our model: (1) The regression relationship between the response and the predictor(s) specified in the model is appropriate (2) The errors have mean zero (3) The errors have constant variance (4) The errors are normally distributed (5) The errors are independent We cannot observe the true errors 1,, n, but we can observe the residuals e1,, en. Assumptions typically checked via a combination of plots and formal tests. Assumption (1): Sometimes checked by a scatterplot of Y against X (or a scatterplot matrix of Y, X1,, Xk). We look for patterns other than that specified. More generally, we can examine a residual plot: Look for non-random (especially curved) pattern in residual plot, indicating a violation of Assumption (1). Remedies: Choose different functional form of model. Use transformation of X variable(s). In multiple regression, separate plots of residuals against each predictor can be useful for determining which X variable may need transforming.

A formal lack of fit test is available (see Section 3.7), but it requires replicate observations at one or more levels of the X variables (often not applicable when one or more predictors are continuous). Assumption (2): not checked separately residuals have mean zero by definition. Assumption (3): Often the most worrisome assumption. Violation indicated by megaphone or funnel shape in residual plot: Remedy: Transform the Y variable: Use Y * Y or Y ln( Y ) i i i* i Advanced method: Weighted Least Squares (we will see in Chapter 11) Formal tests for nonconstant error variance are available: Breusch-Pagan Test: Tests whether the error variance increases or decreases linearly with the predictor(s). H0 specifies that the error variance is constant. Requires large sample. Assumes errors are normally distributed.

Brown-Forsythe Test: Robust to non-normal errors. Requires user to break data into groups and test for constancy of error variance across groups. Not natural for data with continuous predictors. Graphical methods have the advantage of checking for general violations, not just violations of a specific type. Assumption (4): Graphical approach: Look at normal Q-Q plot of residuals. Violation indicated by severely curved Q-Q plot. Remedies: Transformations of Y and/or X. Nonparametric methods. Formal test for error non-normality: The Shapiro-Wilk test (implemented in R and SAS) tests for normality. Test based on the correlation between the ordered residuals and their expected values when the errors are normal. Example (Studio data): Note: With large sample sizes, the normality assumption is not critical. Note: The formal test will not indicate the type of departure from normality.

Assumption (5): Typically only a concern when the data are gathered over time. Violation indicated by a pattern in the residuals plotted against time. Remedies: Include a time variable as a predictor. Use time series methods. Transformations of Variables (Section 3.9): Some violations of our model assumptions may be alleviated by working with transformed data. If the only problem is a nonlinear relationship between Y and the X s, a transformation of one or more X s is preferred. Possible: See diagrams in Figure 3.13, p. 130. If there is evidence of nonnormality or nonconstant error variance, a transformation of Y (and possibly also X) is often useful. Examples: If the error variance is nonconstant but linear relationship is fine, then only transforming Y may disturb the linearity. May need to transform X also. The Box-Cox procedure provides an automatic way to determine the optimal transformation of the type:

Note: When working with transformed data, predictions and interpretations of regression coefficients are all in terms of the transformed variables. To state conclusions in terms of the original variables, we typically need to do a reverse transformation. Example (surgical unit data): Extra Sums of Squares and Related F-tests Extra Sums of Squares can be defined as the difference in SSE between a model with a few predictors and a model with those predictors, plus some others. Recall: As predictors are added to the model, SSE Example: Predictors under consideration are X1,, X8. Two possible models:

Book s notation for this: Why important? We can formally test whether a certain set of predictors is useless, in the presence of the other predictors in the model. Question: Are X2, X4, X7 needed, if the other predictors are in the model? We want our model to have large SSR and small SSE. (Why?) If full model has much lower SSE than reduced model (without X2, X4, X7), then at least one of X2, X4, X7 is needed. To test use Reject H0 if Example above: Note: The tests for individual coefficients are examples of this type of test. Example:

To test about more than one (but not all) coefficients in SAS, use a TEST statement in PROC REG. Example (Body fat data): Y = amount of body fat, X1 = triceps skinfold thickness, X2 = thigh circumference, X3 = midarm circumference. Is the set of X2, X3 significantly useful if X1 is already in the model? Multicollinearity Note: In the body fat example, the F-test for testing was but individual t-tests for each of Paradoxical conclusion: Reason? Example: The correlation coefficient between triceps thickness and thigh circumference is

This condition is known as multicollinearity among the predictors. With uncorrelated predictors, the model can show us the individual effect of each predictor on the response. When predictors are correlated, it is difficult to separate the effects of each predictor. Effects of Multicollinearity (1) The model may still provide a good fit and precise prediction of the response and estimation of the mean response. (2) Estimated regression coefficients (b1, b2, ) will have large variances leads to the conclusion that individual predictors are not significant although overall F-test may be highly significant. (3) Concept of holding all other X variables constant doesn t make sense in practice. (4) Signs of estimated regression coefficients may seem opposite of intuition. Detecting Multicollinearity For each predictor, say Xj, its Variance Inflation Factor (VIF) is: For any predictor Xj

Remedies for Multicollinearity (1) Drop one or more predictors from model. (2) More advanced methods: (3) More advanced: Polynomial Regression Used when the relationship between Y and the predictor(s) is curvilinear. Example: Quadratic Regression (one predictor): Note: Usually in polynomial regression, all predictors are first centered by subtracting the sample mean (of the predictor values) from each X-value. This reduces Another option: Use orthogonal polynomials which are uncorrelated.

Example: Cubic Regression (one predictor): Polynomials of higher order than cubic should rarely be used. High-order polynomials may be excessively wiggly and erratic for both interpolations and extrapolations. Example: Polynomial Regression, More than One Predictor In the case of multiple predictors, all cross-product terms must be included in the model. Example: Quadratic Regression (two predictors):

Notes: (1) These models are all cases of the general linear model so we can fit them with least squares as usual. (2) A model containing a particular term should also contain all terms of lower order: (3) Extrapolation is particularly dangerous with polynomial models. Examples: (4) A common approach is to fit a high-order model and then test (with t-test or F-test) whether a lower-order model is sufficient. Example:

Interaction Models An interaction model is one that includes one or several crossproduct terms. Example (two predictors): Question: What is the change in mean response for a one-unit increase in X1 (holding X2 fixed)? Example: The marginal effect of X1 on the mean response We may see this phenomenon graphically through interaction plots.

Example: Notes: Including interaction terms may lead to multicollinearity problems. Possible remedy: Including all pairwise cross-product terms can complicate a model greatly. We should test whether interactions are significant. Graphical Check: Fit model with no interaction. Plot residuals from this model against each potential interaction term separately. If plot shows random scatter, that interaction term is probably not needed. Formal F-test: