MULTICOLLINEARITY AND VARIANCE INFLATION FACTORS. F. Chiaromonte 1

Similar documents
LINEAR REGRESSION ANALYSIS. MODULE XVI Lecture Exercises

STA 108 Applied Linear Models: Regression Analysis Spring Solution for Homework #6

Contents. 1 Review of Residuals. 2 Detecting Outliers. 3 Influential Observations. 4 Multicollinearity and its Effects

Confidence Interval for the mean response

Lecture 10 Multiple Linear Regression

Multiple Regression Methods

STAT Checking Model Assumptions

Interaction effects for continuous predictors in regression modeling

Single and multiple linear regression analysis

3. Diagnostics and Remedial Measures

Ordinary Least Squares (OLS): Multiple Linear Regression (MLR) Analytics What s New? Not Much!

Chapter 14 Student Lecture Notes 14-1

Chapter 7 Student Lecture Notes 7-1

Topic 20: Single Factor Analysis of Variance

The simple linear regression model discussed in Chapter 13 was written as

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore

Classification & Regression. Multicollinearity Intro to Nominal Data

Inference for Regression Inference about the Regression Model and Using the Regression Line

Dr. Maddah ENMG 617 EM Statistics 11/28/12. Multiple Regression (3) (Chapter 15, Hines)

Steps for Regression. Simple Linear Regression. Data. Example. Residuals vs. X. Scatterplot. Make a Scatter plot Does it make sense to plot a line?

Concordia University (5+5)Q 1.

Chapter 3. Diagnostics and Remedial Measures

Chapter 26 Multiple Regression, Logistic Regression, and Indicator Variables

Chapter 3 Multiple Regression Complete Example

Stat 5100 Handout #26: Variations on OLS Linear Regression (Ch. 11, 13)

STAT420 Midterm Exam. University of Illinois Urbana-Champaign October 19 (Friday), :00 4:15p. SOLUTIONS (Yellow)

L7: Multicollinearity

Ch 2: Simple Linear Regression

Histogram of Residuals. Residual Normal Probability Plot. Reg. Analysis Check Model Utility. (con t) Check Model Utility. Inference.

Simple Linear Regression. Steps for Regression. Example. Make a Scatter plot. Check Residual Plot (Residuals vs. X)

Model Building Chap 5 p251

PART I. (a) Describe all the assumptions for a normal error regression model with one predictor variable,

x3,..., Multiple Regression β q α, β 1, β 2, β 3,..., β q in the model can all be estimated by least square estimators

ITEC 621 Predictive Analytics 6. Variable Selection

Remedial Measures for Multiple Linear Regression Models

Categorical Predictor Variables

Statistics GIDP Ph.D. Qualifying Exam Methodology

Predict y from (possibly) many predictors x. Model Criticism Study the importance of columns

Multicollinearity Richard Williams, University of Notre Dame, Last revised January 13, 2015

Basic Business Statistics 6 th Edition

Analysis of Bivariate Data

TMA4255 Applied Statistics V2016 (5)

Apart from this page, you are not permitted to read the contents of this question paper until instructed to do so by an invigilator.

Last updated: Oct 18, 2012 LINEAR REGRESSION PSYC 3031 INTERMEDIATE STATISTICS LABORATORY. J. Elder

Chapter 1: Linear Regression with One Predictor Variable also known as: Simple Linear Regression Bivariate Linear Regression

1 Multiple Regression

Sections 7.1, 7.2, 7.4, & 7.6

15.1 The Regression Model: Analysis of Residuals

Stat 501, F. Chiaromonte. Lecture #8

Estimating σ 2. We can do simple prediction of Y and estimation of the mean of Y at any value of X.

Outline. Remedial Measures) Extra Sums of Squares Standardized Version of the Multiple Regression Model

Simple Linear Regression Using Ordinary Least Squares

STA 4210 Practise set 2b

Multiple Regression Examples

Sociology 593 Exam 1 Answer Key February 17, 1995

Multiple Predictor Variables: ANOVA

with the usual assumptions about the error term. The two values of X 1 X 2 0 1

(4) 1. Create dummy variables for Town. Name these dummy variables A and B. These 0,1 variables now indicate the location of the house.

Lecture 9: Linear Regression

School of Mathematical Sciences. Question 1. Best Subsets Regression

Chapter 14 Multiple Regression Analysis

Correlation. Tests of Relationships: Correlation. Correlation. Correlation. Bivariate linear correlation. Correlation 9/8/2018

22s:152 Applied Linear Regression. Take random samples from each of m populations.

Available online at (Elixir International Journal) Statistics. Elixir Statistics 49 (2012)

Chapter 14 Student Lecture Notes Department of Quantitative Methods & Information Systems. Business Statistics. Chapter 14 Multiple Regression

22s:152 Applied Linear Regression. There are a couple commonly used models for a one-way ANOVA with m groups. Chapter 8: ANOVA

Measuring the fit of the model - SSR

Matematické Metody v Ekonometrii 7.

Statistical Modelling in Stata 5: Linear Models

Section Least Squares Regression

LECTURE 6. Introduction to Econometrics. Hypothesis testing & Goodness of fit

1-Way ANOVA MATH 143. Spring Department of Mathematics and Statistics Calvin College

A discussion on multiple regression models

3 Variables: Cyberloafing Conscientiousness Age

Unit 11: Multiple Linear Regression

Regression Models - Introduction

The Multiple Regression Model

Analysing data: regression and correlation S6 and S7

27. SIMPLE LINEAR REGRESSION II

CHAPTER 3: Multicollinearity and Model Selection

General Linear Model (Chapter 4)

Lecture 13 Extra Sums of Squares

Diagnostics and Remedial Measures: An Overview

Ch 3: Multiple Linear Regression

Unit 10: Simple Linear Regression and Correlation

Simple Linear Regression

Section 4: Multiple Linear Regression

Inferences for Regression

COMPREHENSIVE WRITTEN EXAMINATION, PAPER III FRIDAY AUGUST 26, 2005, 9:00 A.M. 1:00 P.M. STATISTICS 174 QUESTION

Orthogonal contrasts for a 2x2 factorial design Example p130

Regression: Main Ideas Setting: Quantitative outcome with a quantitative explanatory variable. Example, cont.

6. Multiple Linear Regression

EXAMINATIONS OF THE ROYAL STATISTICAL SOCIETY

Inference for Regression Simple Linear Regression

Chapter 2 Multiple Regression (Part 4)

22S39: Class Notes / November 14, 2000 back to start 1

STAT763: Applied Regression Analysis. Multiple linear regression. 4.4 Hypothesis testing

ST 512-Practice Exam I - Osborne Directions: Answer questions as directed. For true/false questions, circle either true or false.

Two-Way Factorial Designs

Formal Statement of Simple Linear Regression Model

Transcription:

MULTICOLLINEARITY AND VARIANCE INFLATION FACTORS F. Chiaromonte 1

Pool of available predictors/terms from them in the data set. Related to model selection, are the questions: What is the relative importance of different terms, what is the sign and magnitude of their effect on Y, what is their contributions to explaining the variability of Y? Can a term be dropped (should a term be added) to the model because its contribution is small (large)? When the terms are not correlated with one another, answers are straightforward, e.g.: yi = β0 + β1x1, i + β2x2, i + εi y = % β + % β x + % ε i 0 1 1, i i if cor( x1, x2) = 0, then: b = b% 1 1 SSR( x x ) = SSE( x ) SSE( x, x ) = SSR( x ) 1 2 2 1 2 1 the LS coefficient and the response variability explained by x 1, alone and together with x 2, are the same F. Chiaromonte 2

However, when terms are correlated with one another, things become complicated: b SSR( x x, h ) h LS coefficient and contribution to explaining y variability can change depending on what other terms are considered with x in the model; they are context dependent b can change in magnitude and even sign depending on the presence of terms correlated to x interpretation of b as the average change in y when x increases by one and all other terms are held constant becomes ambiguous; in practice, can you hold the other terms constant while changing x? the SSR attributable to x can decrease but also increase depending on the presence of terms correlated to x (e.g. decrease if both x and other terms are correlated with y, and with one another; increase if x is not correlated with y per se, but is correlated with other terms, which in turn are correlated with y). F. Chiaromonte 3

In addition, when terms are correlated with one another, the sampling variance of the LS regression coefficients, and therefore their standard errors σ { b } = [ σ ( X ' X) ] 2 2 1 se b s X X, 1 ( ) = [( ' ) ], increase Our estimates become less accurate. When correlations among terms are very strong, the inversion of (X X) becomes numerically unstable (det close to 0), so our estimates 1 b= ( X ' X) X ' Y y are not ust very variable, they are poorly determined! Many different b vectors provide very similar LS fit to the data x2 F. Chiaromonte 4 x1

Because both the elements in b and their se are affected, if terms are correlated it is hard to use the t ratios b t = se( b ) as indicative of importance: many p-values for individual t tests can be non-significant, although some terms are obviously relevant, and the regression is significant as a whole (overall F test) p-values can change dramatically when dropping/adding. The regression equation is Systol = 225 + 1.31 Years - 4.50 Weight + 0.0381 Years^2 + 0.0504 Weight^2-0.0503 Years*Weight Predictor Coef SE Coef T P Constant 225.1 115.5 1.95 0.00 Years 1.308 1.919 0.8 0.500 Weight -4.499 3.90-1.15 0.258 Years^2 0.03809 0.01388 2.74 0.010 Weight^2 0.05042 0.03325 1.52 0.139 Years*Weight -0.05032 0.03247-1.55 0.131 S = 9.4320 R-Sq = 54.8% R-Sq(ad) = 47.9% Analysis of Variance Source DF SS MS F P Regression 5 357.21 715.24 7.99 0.000 Residual Error 33 2955.22 89.55 Total 38 531.44 F. Chiaromonte 5

Diagnose pair-wise correlations among terms: scatter-plot matrix and correlation coefficients matrix: Matrix Plot of x1, x2, x3 x1 9 x2 Correlations: x1 x2 x2 0.74 x3 0.830 0.853 3 15 10 x3 5 4 8 3 9 F. Chiaromonte

However, these diagnostics are incomplete, because the real issue is the presence of linear interdependencies among the terms. These can be strong even when pair-wise correlations are relatively week. Given the model: y = β + β x... + β x... + β x + ε i 0 1 1, i, i p 1 p 1, i i Consider each term as a linear function of all others, fitting regressions of the type 2 2 i, = α0 + αl l, i+ i = ( l, l ) = 1... 1 l x x err R R x x p 2 2 1, 2 1 R share of the variability of x explained by a linear form in the other terms. variance of regr coeff σ { b } = [ σ ( X ' X) ] 1 = VIF variance inflation factor Rules of thumb: serious multicollinearity if p 1 1 max VIF 10 and/or VIF = VIF 1 = 1... p 1 p 1 = 1 F. Chiaromonte 7

y = 0 + x + x + x +ε, ε iid N(0,1) Simulated example: 1, 2, 3, i i i i i i The regression equation is y = 0.115 + 1.31 x1-0.109 x2 + 1.38 x3 Predictor Coef SE Coef T P VIF Constant 0.1148 0.3523 0.33 0.745 x1 1.312 1.031 1.27 0.204 224.145 x2-0.1093 0.9971-0.11 0.913 207.52 x3 1.3798 0.7323 1.88 0.01 454.378 S = 1.0131 R-Sq = 94.4% R-Sq(ad) = 94.3% 8 x1 x2 Analysis of Variance Source DF SS MS F P Regression 3 3411.1 1137.0 110.70 0.000 Residual Error 19 201.4 1.0 Total 199 312.4 4 Correlations: 15 x1 x2 10 x3 x2 0.995 x3 0.998 0.998 5 4 8 4 8 x = x + 2, i 1, i x = x + x + F. Chiaromonte 8 3, i 1, i 2, i small gaussian noise small gaussian noise

One remedy: dropping terms as needed e.g. in the simulated example we have The regression equation is y = 0.051 + 2.75 x1 + 1.23 x2 Predictor Coef SE Coef T P VIF Constant 0.0507 0.3529 0.14 0.88 x1 2.7457 0.7001 3.92 0.000 102.00 x2 1.2299 0.7037 1.75 0.082 102.00 S = 1.02015 R-Sq = 94.3% R-Sq(ad) = 94.3% Analysis of Variance Source DF SS MS F P Regression 2 3407.4 1703.7 137.08 0.000 Residual Error 197 205.0 1.0 Total 199 312.4 The regression equation is y = 0.107 + 3.9 x1 Predictor Coef SE Coef T P VIF Constant 0.1073 0.3532 0.30 0.72 x1 3.932 0.095 5.90 0.000 1.000 S = 1.02543 R-Sq = 94.2% R-Sq(ad) = 94.2% Analysis of Variance Source DF SS MS F P Regression 1 3404.2 3404.2 3237.51 0.000 Residual Error 198 208.2 1.1 Total 199 312.4 F. Chiaromonte 9

More sophisticated remedies: orthogonalize terms at the outset (but new terms are linear combs of original ones, harder to interpret ) fit a ridge regression. Important: multicollinearity does not affect prediction! fitted values, their sampling variability and stability are not affected it s ust that very similar fitted values can be produced by very different models! F. Chiaromonte 10