Multicollinearity occurs when two or more predictors in the model are correlated and provide redundant information about the response.

Similar documents
Single and multiple linear regression analysis

Multiple Linear Regression

Regression Diagnostics Procedures

Regression Model Building

Trendlines Simple Linear Regression Multiple Linear Regression Systematic Model Building Practical Issues

Unit 10: Simple Linear Regression and Correlation

Quantitative Methods I: Regression diagnostics

Inferences for Regression

The simple linear regression model discussed in Chapter 13 was written as

9 Correlation and Regression

STAT 4385 Topic 06: Model Diagnostics

INFERENCE FOR REGRESSION

K. Model Diagnostics. residuals ˆɛ ij = Y ij ˆµ i N = Y ij Ȳ i semi-studentized residuals ω ij = ˆɛ ij. studentized deleted residuals ɛ ij =

Basic Business Statistics 6 th Edition

Regression - Modeling a response

Contents. 1 Review of Residuals. 2 Detecting Outliers. 3 Influential Observations. 4 Multicollinearity and its Effects

y response variable x 1, x 2,, x k -- a set of explanatory variables

Chapter 11. Correlation and Regression

From Practical Data Analysis with JMP, Second Edition. Full book available for purchase here. About This Book... xiii About The Author...

Fitting a regression model

, (1) e i = ˆσ 1 h ii. c 2016, Jeffrey S. Simonoff 1

Chapter 1. Linear Regression with One Predictor Variable

Multiple Regression Methods

Chapter 16. Simple Linear Regression and dcorrelation

Chapter 14 Multiple Regression Analysis

Regression Analysis V... More Model Building: Including Qualitative Predictors, Model Searching, Model "Checking"/Diagnostics

Regression Analysis V... More Model Building: Including Qualitative Predictors, Model Searching, Model "Checking"/Diagnostics

Prepared by: Prof. Dr Bahaman Abu Samah Department of Professional Development and Continuing Education Faculty of Educational Studies Universiti

Chapter 9. Correlation and Regression

PBAF 528 Week 8. B. Regression Residuals These properties have implications for the residuals of the regression.

Math 423/533: The Main Theoretical Topics

Simple Linear Regression

Regression. Marc H. Mehlman University of New Haven

Checking model assumptions with regression diagnostics

Review: Second Half of Course Stat 704: Data Analysis I, Fall 2014

STATISTICS 479 Exam II (100 points)

ECON3150/4150 Spring 2015

IES 612/STA 4-573/STA Winter 2008 Week 1--IES 612-STA STA doc

Analysing data: regression and correlation S6 and S7

Machine Learning for OR & FE

Statistics for Managers using Microsoft Excel 6 th Edition

Ref.: Spring SOS3003 Applied data analysis for social science Lecture note

Final Exam - Solutions

Project Report for STAT571 Statistical Methods Instructor: Dr. Ramon V. Leon. Wage Data Analysis. Yuanlei Zhang

Regression diagnostics

CS 147: Computer Systems Performance Analysis

Correlation and Regression

Applied Regression Modeling

MATH11400 Statistics Homepage

Chapter 8. Linear Regression. Copyright 2010 Pearson Education, Inc.

Chapter 12: Multiple Regression

Regression Analysis IV... More MLR and Model Building

Chapter 1 Statistical Inference

Stat 135, Fall 2006 A. Adhikari HOMEWORK 10 SOLUTIONS

Simple Linear Regression

Statistical Modelling in Stata 5: Linear Models

Regression Analysis. BUS 735: Business Decision Making and Research

Homework 2: Simple Linear Regression

Psychology Seminar Psych 406 Dr. Jeffrey Leitzel

Lecture 6: Linear Regression

Lecture 18: Simple Linear Regression

Correlation Analysis

Notes 6. Basic Stats Procedures part II

Chapter 16. Simple Linear Regression and Correlation

Applied Regression Modeling: A Business Approach Chapter 3: Multiple Linear Regression Sections

Remedial Measures for Multiple Linear Regression Models

Regression Analysis. BUS 735: Business Decision Making and Research. Learn how to detect relationships between ordinal and categorical variables.

Lectures on Simple Linear Regression Stat 431, Summer 2012

Ridge Regression. Summary. Sample StatFolio: ridge reg.sgp. STATGRAPHICS Rev. 10/1/2014

Chapter 15 Multiple Regression

Applied Statistics and Econometrics

STATISTICS 174: APPLIED STATISTICS TAKE-HOME FINAL EXAM POSTED ON WEBPAGE: 6:00 pm, DECEMBER 6, 2004 HAND IN BY: 6:00 pm, DECEMBER 7, 2004 This is a

BANA 7046 Data Mining I Lecture 2. Linear Regression, Model Assessment, and Cross-validation 1

Unit 11: Multiple Linear Regression

Chapter 10 Building the Regression Model II: Diagnostics

Regression Analysis. Regression: Methodology for studying the relationship among two or more variables

Simple linear regression

MLR Model Checking. Author: Nicholas G Reich, Jeff Goldsmith. This material is part of the statsteachr project

Chapter 4: Regression Models

Assumptions in Regression Modeling

FRANKLIN UNIVERSITY PROFICIENCY EXAM (FUPE) STUDY GUIDE

Correlation & Simple Regression

Important note: Transcripts are not substitutes for textbook assignments. 1

Cheat Sheet: Linear Regression

Simple Linear Regression

STAT Checking Model Assumptions

CHAPTER 5 LINEAR REGRESSION AND CORRELATION

CS 5014: Research Methods in Computer Science

PubH 7405: REGRESSION ANALYSIS SLR: DIAGNOSTICS & REMEDIES

Correlation and Regression Bangkok, 14-18, Sept. 2015

Making sense of Econometrics: Basics

Diagnostics and Remedial Measures

ECN221 Exam 1 VERSION B Fall 2017 (Modules 1-4), ASU-COX VERSION B

Chapter 2 Multiple Regression I (Part 1)

FAQ: Linear and Multiple Regression Analysis: Coefficients

Introduction to Regression

The Simple Linear Regression Model

Statistical View of Least Squares

22S39: Class Notes / November 14, 2000 back to start 1

Business Statistics. Lecture 10: Course Review

Transcription:

Multicollinearity Read Section 7.5 in textbook. Multicollinearity occurs when two or more predictors in the model are correlated and provide redundant information about the response. Example of multicollinear predictors are height and weight of a person, years of education and income, and assessed value and square footage of a home. Consequences of high multicollinearity: 1. Increased standard error of estimates of the β s (decreased reliability). 2. Often confusing and misleading results. Stat 328 - Fall 2004 1

Multicollinearity (cont d) What do we mean by confusing results? Suppose that we fit a model with x 1 income and x 2 years of education as predictors and y intake of fruits and vegetables as the response. Years of education and income are correlated, and we expect a positive association of both with intake of fruits and vegetables. t tests for the individual β 1, β 2 might suggest that none of the two predictors are significantly associated to y, while the F test indicates that the model is useful for predicting y. Contradiction? No! This has to do with interpretation of regression coefficients. Stat 328 - Fall 2004 2

Multicollinearity (cont d) β 1 is the expected change in y due to x 1 given x 2 is already in the model. β 2 is the expected change in y due to x 2 given x 1 is already in the model. Since both x 1 and x 2 contribute redundant information about y once one of the predictors is in the model, the other one does not have much more to contribute. This is why the F test indicates that at least one of the predictors is important yet the individual tests indicate that the contribution of the predictor, given that the other one has already been included, is not really important. Stat 328 - Fall 2004 3

Detecting multicollinearity Easy way: compute correlations between all pairs of predictors. If some r are close to 1 or -1, remove one of the two correlated predictors from the model. Another way: predictor x j : calculate the variance inflation factors for each V IF j = 1 1 Rj 2, where Rj 2 is the coefficient of determination of the model that includes all predictors except the jth predictor. If V IF j 10 then there is a problem with multicollinearity. JMP: Right-click on Parameter Estimates table, then choose Columns and then choose VIF. Stat 328 - Fall 2004 4

Multicollinearity - Example See Example 7.3, page 349. Response is carbon monoxide content of cigarettes and predictors are tar content (x 1 ), nicotine content (x 2 ) and weight (x 3 ). Estimated parameters in first order model: ŷ = 3.2+0.96x 1 2.63x 2 0.13x 3. F = 78.98 with p value below 0.0001. Individual t statistics and p values: 3.37 (0.0007), -0.67 (0.51) and -0.03 (0.97). Note that signs on β 2 and β 3 are opposite of what is expected. Also, very high F would suggest more than just one signficant predictor. VIF were: 21.63, 21.89 and 1.33, so there is a multicollinearity problem. Correlations are r 12 = 0.98, r 13 = 0.49 and r 23 = 0.50. Stat 328 - Fall 2004 5

Multicollinearity - Solutions If interest is only in estimation and prediction, multicollinearity can be ignored since it does not affect ŷ or its standard error (either ˆσŷ or ˆσ y ŷ ). Above is true only if the x p at which we want estimation or prediction is within the range of the data. If the wish is to establish association patters between y and the predictors, then analyst can: Eliminate some predictors from the model. Design an experiment in which the pattern of correlation is broken. Stat 328 - Fall 2004 6

Multicollinearity - Polynomial model Multicollinearity is a problem in polynomial regression (with terms of second and higher order): x and x 2 tend to be highly correlated. A special solution in polynomial models is to use z i = x i x i instead of just x i. That is, first subtract each predictor from its mean and then use the deviations in the model. Example: suppose that the model is y = β 0 + β 1 x + β 2 x 2 + ɛ and we have a sample of size n. Let x = n 1 i x i and define z = x x. Then fit the model y = γ 0 + γ 1 z + γ 2 z 2 + e. Stat 328 - Fall 2004 7

Multicollinearity - Polynomial model (cont d) Example: x = 2, 3, 4, 5, 6 and x 2 = 4, 9, 16, 25, 36. As x increases, so does x 2. r x,x 2 = 0.98. x = 4 so then z = 2, 1, 0, 1, 2 and z 2 = 4, 1, 0, 1, 4. Thus, z and z 2 are no longer correlated. r z,z 2 = 0. We can get the estimates of the β s from the estimates of the γ s. Since E(y) = γ 0 + γ 1 z + γ 2 z 2 = γ 0 + γ 1 (x x) + γ 2 (x x) 2 = (γ 0 + γ 1 x + γ 2 ( x) 2 ) + (γ 1 γ 2 x)x + γ 2 x 2. Then: β 0 = γ 0 + γ 1 x + γ 2 ( x) 2, β 1 = γ 1 γ 2 x and β 2 = γ 2. Stat 328 - Fall 2004 8

Analysis of residuals - Chapter 8 Residual analysis is important to detect departures from model assumptions (ɛ N(0, σ 2 ) and observations are independent) and to determine lack of fit of the model. Estimated residuals are obtained as e i = y i ŷ i. A standardized residual is defined as e i = e i/rmse. Very roughly, we expect that standardized residuals will be distributed as standard normal random variables N(0, 1) as long as the assumptions of the model hold. Stat 328 - Fall 2004 9

Residual plots for lack of fit Scatter plots of residuals e i against each of the predictors and against ŷ provide information about lack of fit. Look for trends and changes in variability that indicate that there is some systematic (rather than purely random) effect on residuals. Look for no more than 5% residuals falling outside of 0 ± RMSE. Figure 8.4 is residuals against predictor when a second order model is fit to the data. No patterns are observable. Figure 8.3: when a first order model is fitted, there is a curvilinear pattern to residuals, indicating that predictor needs to be added to model in quadratic form. Stat 328 - Fall 2004 10

Residual plots for unequal variances When the variance of ɛ i depends on the value of ŷ i, we say that the variances are heteroscedastic (unequal). Heteroscedasticity may occur for many reasons, but typically occurs when responses are not normally distributed or when the errors are not additive in the model. Examples of non-normal responses are: Poisson counts: number of sick days taken per person/month, or number of traffic crashes per intersection/year. Binomial proportions: proportion of felons who are repeat offenders or proportion of consumers who prefer low carb foods. See Figures 8.9 and 8.10 for an example of each. Stat 328 - Fall 2004 11

Residual plots for unequal variances Unequal variances also occur when errors are multiplicative rather than additive: y = E(y)ɛ. In multiplicative models, the variance of y is proportional to the squared mean of y: V ar(y) = [E(y)] 2 σ 2, so the variance of y (or equivalently, the variance of ɛ) increases when the mean response increases. See Figure 8.11 for an example of this. Stat 328 - Fall 2004 12

Solutions for unequal variances Unequal variances can often be ameliorated by transforming the data: Type of response Poisson counts Binomial proportions Multiplicative errors Appropriate transformation y sin 1 y log(y) Fit the model to the transformed response. Typically, transformed data will satisfy the assumption of homoscedasticity. Stat 328 - Fall 2004 13

Example Traffic engineers wish to assess association between # of crashes and daily entering vehicles at intersections. Data are # of crashes per year at 50 intersections and average (over the year) DEV. See JMP example on handout with results from fitting the simple model: crashes = β 0 + β 1 DEV. Plot of ˆɛ against ŷ: as ŷ increases so does spread of residuals around zero: indication that σ 2 is not constant for all observations. The same model was fitted, but this time using crashes as response. See new residual plot with no evidence of heteroscedasticity. Stat 328 - Fall 2004 14

Checking normality assumption Even though we assume that ɛ N(0, σ 2 ), moderate departures from the assumption of normality do not have a big effect on the reliability of hypothesis tests or accuracy of confidence intervals. There are several statistical tests to check normality, but they are reliable only in very large samples. Easiest type of check is graphical: Fit model and estimate residuals e i = y i ŷ i. Get a histogram of residuals and see if they look normal. Stat 328 - Fall 2004 15

Checking normality - Q-Q plots We can also obtain a Q-Q (Quantile-Quantile) plot, also called Normal Probability Plot of the residuals. A QQ plot is the following: On the y-axis: the expected value of each residual under the assumption that the residual in fact comes from a normal distribution. These are called normal scores. On the x-axis: the estimated residuals. If residuals in fact are normal, then the plot will show a straight, 45 o line. Departures from the straight line indicate departures from normality. Stat 328 - Fall 2004 16

Checking normality - Q-Q plots JMP will construct normal probability plots directly: Choose Distribution, then right-click on variable name and choose Normal Quantile Plot. A formula to compute normal scores and then construct plot by hand is given on page 394 - box. Not required learning. See Figures 8.20 (no departures from normality) and Exercise 8.18 on page 396 (again no departures). See figures on next page: simulated Normal and non-normal data (non-normal simulated by taking fourth power of normal data). Stat 328 - Fall 2004 17

Outliers and influential observations Observations with large values of the residual e i = y i ŷ i are called outliers. What is a large residual? A residual e i is large if it is outside of 0 ± 3 RMSE. A standardized residual is large if it is outside of 0 ± 3. A studentized residual is similar to a standardized residual, but it is computed as zi e i = RMSE, 1 h i where h i is the leverage of the ith observation. The higher h i, the higher the studentized residual. JMP gives the zi, we do not need to compute h i. Stat 328 - Fall 2004 18

Outliers and influential obs - Cook s D Cook s D measures the influence of each observation on the value of the estimated parameters. Definition: D i = (y i ŷ i ) 2 (k + 1)MSE [ ] h i (1 h i ) 2. Cook s D is large for the ith observation if the residual (y i ŷ i ) is large or the leverage h i is large or both. Although not obvious from the formula, D i measures the difference in the estimated parameters in a model fitted to all the observations and an model fitted to the sample that omits y i. If D i is small, then the ith observation does not have an overwhelming effect on parameter estimates. Stat 328 - Fall 2004 19

Cook s D We can compute D i for each of the n sample observations using JMP (or SAS). After fitting the model: Right-click on Response icon Choose Save Columns Choose Cook s D Influence How big is big? Compare D i to the other D j in the sample. Notice whether any of the D i are substantially bigger than the rest. If so, those observations are influential. Stat 328 - Fall 2004 20

Outliers and influential obs - Example Where do outliers come from? Measurement or recording error or if no error, perhaps inappropriate model. Example FastFood, Table 8.5 in book. Response y is sales in 1000$ for fastfood outlets in four cities. Predictors are traffic flow (x 4, in 1000 cars) and three dummies (x 1, x 2, x 3 ) to represent the four cities. Suppose that observation 13 in city 3 for which traffic flow is 75,800 cars had sales recorded as $82,000 rather than $8,200 (someone put the decimal in the wrong place) and fit a first order model E(y) = β 0 + β 1 x x + β 2 x 2 + β 3 x 3 + β 4 x 4. Stat 328 - Fall 2004 21

Outliers and influential obs - Example Estimated prediction equation is: ŷ = 16.46 + 1.1x 1 + 6.14x 2 + 14.5x 3 + 0.36x 4. RMSE = 14.86 and CV = 163.8. F test for utility of model is only 1.66 with a p value of 0.2, meaning that none of the predictors are useful for predicting sales. R 2 a = 0.10: only about 10% of the observed variability in sales can be attributable to traffic flow and city. Results are not what was expected. associated to fast food sales. Traffic flow is expected to be Stat 328 - Fall 2004 22

Outliers and influential obs - Example Residual plots: observation 13 is clearly an outlier (see next page). Standardized residual is 4.36, outside of expected interval (-3, 3) and ordinary residual is 56.46, outside of 0 ± 3 14.86. Cook s D for obs 13 is 1.196, about 10 times larger than the next largest D in the sample. If now we correct the mistake and use $8,200, results change dramatically: F statistic is 222.17 with a very small p value, Ra 2 = 97.5% and largest standardized residual is only 2.38. One outlier can have an enormous effect on the results! Important to do a residual analysis. Stat 328 - Fall 2004 23