Regression Diagnostics Procedures

Similar documents
Regression Model Building

LAB 3 INSTRUCTIONS SIMPLE LINEAR REGRESSION

Multiple Linear Regression

Single and multiple linear regression analysis

Psychology Seminar Psych 406 Dr. Jeffrey Leitzel

Multivariate Data Analysis Joseph F. Hair Jr. William C. Black Barry J. Babin Rolph E. Anderson Seventh Edition

Prepared by: Prof. Dr Bahaman Abu Samah Department of Professional Development and Continuing Education Faculty of Educational Studies Universiti

Contents. 1 Review of Residuals. 2 Detecting Outliers. 3 Influential Observations. 4 Multicollinearity and its Effects

Labor Economics with STATA. Introduction to Regression Diagnostics

Checking model assumptions with regression diagnostics

STAT 4385 Topic 06: Model Diagnostics

Multicollinearity occurs when two or more predictors in the model are correlated and provide redundant information about the response.

Simple Linear Regression

Regression coefficients may even have a different sign from the expected.

Analysing data: regression and correlation S6 and S7

Regression Model Specification in R/Splus and Model Diagnostics. Daniel B. Carr

Assumptions of the error term, assumptions of the independent variables

Ridge Regression. Summary. Sample StatFolio: ridge reg.sgp. STATGRAPHICS Rev. 10/1/2014

The Model Building Process Part I: Checking Model Assumptions Best Practice

Project Report for STAT571 Statistical Methods Instructor: Dr. Ramon V. Leon. Wage Data Analysis. Yuanlei Zhang

The Model Building Process Part I: Checking Model Assumptions Best Practice (Version 1.1)

LINEAR REGRESSION. Copyright 2013, SAS Institute Inc. All rights reserved.

Introduction to Linear regression analysis. Part 2. Model comparisons

Chapter 10 Correlation and Regression

Nonlinear Regression. Summary. Sample StatFolio: nonlinear reg.sgp

Any of 27 linear and nonlinear models may be fit. The output parallels that of the Simple Regression procedure.

CHAPTER 5. Outlier Detection in Multivariate Data

holding all other predictors constant

EDF 7405 Advanced Quantitative Methods in Educational Research MULTR.SAS

Trendlines Simple Linear Regression Multiple Linear Regression Systematic Model Building Practical Issues

Topic 18: Model Selection and Diagnostics

Contents. Acknowledgments. xix

Regression diagnostics

Unit 10: Simple Linear Regression and Correlation

Quantitative Methods I: Regression diagnostics

Chapter 10 Building the Regression Model II: Diagnostics

Discussion # 6, Water Quality and Mercury in Fish

REGRESSION DIAGNOSTICS AND REMEDIAL MEASURES

FAQ: Linear and Multiple Regression Analysis: Coefficients

Applied Regression Modeling

Ref.: Spring SOS3003 Applied data analysis for social science Lecture note

Parametric Test. Multiple Linear Regression Spatial Application I: State Homicide Rates Equations taken from Zar, 1984.

General linear models. One and Two-way ANOVA in SPSS Repeated measures ANOVA Multiple linear regression

Soil Phosphorus Discussion

Regression Diagnostics for Survey Data

Sociology 6Z03 Review I

Chapter 7. Scatterplots, Association, and Correlation

MATH 1150 Chapter 2 Notation and Terminology

Lecture 4: Multivariate Regression, Part 2

Section 2 NABE ASTEF 65

AP Statistics. Chapter 9 Re-Expressing data: Get it Straight

Correlation. Tests of Relationships: Correlation. Correlation. Correlation. Bivariate linear correlation. Correlation 9/8/2018

Introduction to Regression

Lectures on Simple Linear Regression Stat 431, Summer 2012

MATERIALS AND METHODS

Review: Second Half of Course Stat 704: Data Analysis I, Fall 2014

Chapter 16. Simple Linear Regression and dcorrelation

Lecture Slides. Elementary Statistics Tenth Edition. by Mario F. Triola. and the Triola Statistics Series. Slide 1

Estimating σ 2. We can do simple prediction of Y and estimation of the mean of Y at any value of X.

Summary of OLS Results - Model Variables

The Steps to Follow in a Multiple Regression Analysis

ECON 497: Lecture 4 Page 1 of 1

Bivariate data analysis

K. Model Diagnostics. residuals ˆɛ ij = Y ij ˆµ i N = Y ij Ȳ i semi-studentized residuals ω ij = ˆɛ ij. studentized deleted residuals ɛ ij =

Interactions and Centering in Regression: MRC09 Salaries for graduate faculty in psychology

Ch Inference for Linear Regression

Linear Models, Problems

10 Model Checking and Regression Diagnostics

REGRESSION OUTLIERS AND INFLUENTIAL OBSERVATIONS USING FATHOM

Statistical View of Least Squares

1) Answer the following questions as true (T) or false (F) by circling the appropriate letter.

Keller: Stats for Mgmt & Econ, 7th Ed July 17, 2006

Lecture 4: Multivariate Regression, Part 2

Linear Regression Measurement & Evaluation of HCC Systems

Review of Statistics 101

11. Generalized Linear Models: An Introduction

Stat 101 Exam 1 Important Formulas and Concepts 1

A Second Course in Statistics: Regression Analysis

CREATED BY SHANNON MARTIN GRACEY 146 STATISTICS GUIDED NOTEBOOK/FOR USE WITH MARIO TRIOLA S TEXTBOOK ESSENTIALS OF STATISTICS, 3RD ED.

sphericity, 5-29, 5-32 residuals, 7-1 spread and level, 2-17 t test, 1-13 transformations, 2-15 violations, 1-19

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis.

MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS

LINEAR REGRESSION ANALYSIS. MODULE XVI Lecture Exercises

1) A residual plot: A)

Statistical Modelling in Stata 5: Linear Models

Chapter 12 Summarizing Bivariate Data Linear Regression and Correlation

Chapter 9. Correlation and Regression

THE PEARSON CORRELATION COEFFICIENT

Multivariate and Multivariable Regression. Stella Babalola Johns Hopkins University

Overview. Overview. Overview. Specific Examples. General Examples. Bivariate Regression & Correlation

Multicollinearity and A Ridge Parameter Estimation Approach

Detecting and Assessing Data Outliers and Leverage Points

y response variable x 1, x 2,, x k -- a set of explanatory variables

COMPREHENSIVE WRITTEN EXAMINATION, PAPER III FRIDAY AUGUST 26, 2005, 9:00 A.M. 1:00 P.M. STATISTICS 174 QUESTION

Prediction of Bike Rental using Model Reuse Strategy

Correlation: basic properties.

ECON 497 Midterm Spring

Regression Review. Statistics 149. Spring Copyright c 2006 by Mark E. Irwin

Multiple Linear Regression I. Lecture 7. Correlation (Review) Overview

Correlation and Regression Theory 1) Multivariate Statistics

Transcription:

Regression Diagnostics Procedures

ASSUMPTIONS UNDERLYING REGRESSION/CORRELATION NORMALITY OF VARIANCE IN Y FOR EACH VALUE OF X For any fixed value of the independent variable X, the distribution of the dependent variable Y is normal. NORMALITY OF VARIANCE FOR THE ERROR TERM The error term is normally distributed. (Many authors argue that this is more important than normality in the distribution of Y). THE INDEPENDENT VARIABLE IS UNCORRELATED WITH THE ERROR TERM

ASSUMPTIONS UNDERLYING REGRESSION/CORRELATION (Continued) HOMOSCEDASTICITY LINEARITY It is assumed that there is equal variances for Y, for each fixed value of X. The relationship between X and Y is linear. INDEPENDENCE The Y s are statistically independent of each other.

Graph the Distributions of the Dependent and Independent Variables They should be roughly normally distributed. If a given variable is not normally distributed, you may wish to consider transforming it (e.g., a log transformation). If there is a serious violation of normality, you might consider dropping the variable from the analysis (e.g., if it is a independent variable, and not theoretically crucial). You might also try another technique (e.g., nonlinear regression, logistic regression).

Some Tips for Transforming Data: To correct positive skews: a stronger transformation: -1/X 2 some mild transformations: -1/X log X X

No shape change Required: X To correct negative skews: some mild transformations: X 2 X 3 a stronger transformation: antilog X

Plotting to test the assumptions of Multiple Regression Analysis

Examine the Scatterplots of each X by Y. The assumption is that the bivariate relationship will be roughly linear. If it is not linear, you might consider transforming one (or both) of the variables.

Examine the Plots of the Residuals by Each X. The assumption is that the residuals are equally distributed at each value of X, and that the slope of the regression line = 0. If not, you might consider a transformation, or a different form of the equation.

Examine the plot of the standardized residuals by the predicted value of Y. The slope should be 0, the residuals spread out evenly at different levels of the predicted Y.

Examine the Normal P-P Plot. This is the normal probability plot of the standardized residuals. It is used to check normality. If the variable is normally distributed, the plotted points form a straight diagonal line.

Use a histogram to examine the distribution of the residuals. Similar to looking at the Normal P-P Plot. The residuals should be roughly normally distributed. Keep an eye out in particular for severe outliers.

Checking for Constance Variance It is assumed that variance is the same across all values of the independent variable. Plot the standardized (or studentized residuals) against the predicted values. There should be equal spread, the slope should be 0). Plot predicted Y by observed Y. Should be linear association, equal spread below and above the regression line.

Examine the partial plots of each X by Y. Again, the assumption is that the partial relationship will be roughly linear. If it is not linear, you might consider transforming one of the variables.

Independence of the Y s It is assumed that the Ys are independent from one another. This may not be the case if data collection occurred over time, or if the dependent variable is somehow related to time. You can check for potential problems of this nature by: 1. Examining the Durban Watson statistic. One of the assumptions of regression analysis is that the residuals for consecutive observations are uncorrelated. If this is true, the expected value of the Durbin_Watson statistic is 2. Values less than 2 indicate positive autocorrelation, a common problem in time_series data. Values greater than 2 indicate negative autocorrelation. 2. You can plot the residuals by the sequence of the observations.

Problems of Multicolinearity? If an independent variable is strongly associated with another independent variable (colinearity, very high r 2 ), or if an independent variable is a strong linear function of the other independent variables in a regression model (multicolinearity, very high R 2 ) then problems may arise in the estimation of regression coefficients. Notably, multicolinearity causes inflated standard errors for estimates of regression coefficients, and can cause other problems (coefficients with the wrong sign, dramatic changes in the sign and size of a coefficient when another one is added to the equation.).

Detecting Colinearity/Multicolinearity Examine the intercorrelation matrix. Very high values of r indicate a potential problem.

Examine the tolerance. For each independent variable, the tolerance is the proportion of variability of that variable that is not explained by its linear relationship with the other independent variables in the model (1 - R 2). Tolerance can range from 0 to 1. A value close to 1 indicates that an independent variable has little of its variability explained by the other independent variables. A value close to 0 indicates that a variable is almost a linear combination of the other independent variables. Tolerances of less than 0.1 may be a problem.

You can also look at the Variance Inflation Factor (VIF). This is the reciprocal of the tolerance. As the variance inflation factor increases, so does the variance of the regression coefficient, making it an unstable estimate. Large VIF values are an indicator of Multicollinearity.

Solutions to Multicolinearity One solution is to omit a problem variable from the analysis. Another, is if you have several variables that are conceptually related, and that are highly intercorrelated, then you might consider creating an index.

Examine the Casewise Statistics Are there outliers? If yes, how large are they (what is their standardized value?) If they are greater than 3.0 then they are unlikely to occur due to chance if the residuals are normally distributed.

Looking for Influential Points. Influence Statistics DfBeta(s): The difference in beta value is the change in the regression coefficient that results from the exclusion of a particular case. A value is computed for each term in the model, including the constant. Standardized DfBeta(s) Standardized differences in beta values. The change in the regression coefficient that results from the exclusion of a particular case. SPSS suggests that you may want to examine cases with absolute values greater than 2 divided by the square root of N, where N is the number of cases. A value is computed for each term in the model including the constant.

DfFit The difference in fit value is the change in the predicted value that results from the exclusion of a particular case. Standardized DfFit Standardized difference in fit value. The change in the predicted value that results from the exclusion of a particular case. SPSS suggests that you may want to examine standardized values which in absolute value exceed 2 divided by the squared root of p/n where p is the number of independent variables in the equation and N is the number of cases.

Distances Mahalanobis A measure of how much a case s values on the independent variables differ from the average of all cases. A large Mahalanobis distance identifies a case as having extreme values on one or more of the independent variables. Cook s A measure of how much the residuals of all cases would change if a particular case were excluded from the calculation of the regression coefficients. A large Cook s D indicates that excluding a case from computation of the regression statistics, changes the coefficients substantially.

Leverage Values Measures the influence of a point on the fit of the regression. The centered leverage ranges from 0 (no influence on the fit) to (N-1)/N.