Dr. Maddah ENMG 617 EM Statistics 11/28/12. Multiple Regression (3) (Chapter 15, Hines)

Similar documents
Ridge Regression. Summary. Sample StatFolio: ridge reg.sgp. STATGRAPHICS Rev. 10/1/2014

Multiple Linear Regression

Math 3330: Solution to midterm Exam

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore

Topic 18: Model Selection and Diagnostics

Review: Second Half of Course Stat 704: Data Analysis I, Fall 2014

Prepared by: Prof. Dr Bahaman Abu Samah Department of Professional Development and Continuing Education Faculty of Educational Studies Universiti

LINEAR REGRESSION ANALYSIS. MODULE XVI Lecture Exercises

LINEAR REGRESSION. Copyright 2013, SAS Institute Inc. All rights reserved.

Matematické Metody v Ekonometrii 7.

Ch 3: Multiple Linear Regression

Final Review. Yang Feng. Yang Feng (Columbia University) Final Review 1 / 58

Multiple Linear Regression

Model Building Chap 5 p251

Unit 11: Multiple Linear Regression

Linear model selection and regularization

Regression Model Building

STATISTICS 110/201 PRACTICE FINAL EXAM

Any of 27 linear and nonlinear models may be fit. The output parallels that of the Simple Regression procedure.

Remedial Measures for Multiple Linear Regression Models

Simple Regression Model Setup Estimation Inference Prediction. Model Diagnostic. Multiple Regression. Model Setup and Estimation.

Regression coefficients may even have a different sign from the expected.

Chapter 14 Multiple Regression Analysis

Contents. 1 Review of Residuals. 2 Detecting Outliers. 3 Influential Observations. 4 Multicollinearity and its Effects

Empirical Economic Research, Part II

Lecture 11. Correlation and Regression

10. Alternative case influence statistics

How the mean changes depends on the other variable. Plots can show what s happening...

REGRESSION DIAGNOSTICS AND REMEDIAL MEASURES

Model Selection Procedures

STATISTICS 174: APPLIED STATISTICS FINAL EXAM DECEMBER 10, 2002

Data Mining Stat 588

Question Possible Points Score Total 100

holding all other predictors constant

Math 423/533: The Main Theoretical Topics

Day 4: Shrinkage Estimators

y response variable x 1, x 2,, x k -- a set of explanatory variables

Multiple linear regression S6

12.12 MODEL BUILDING, AND THE EFFECTS OF MULTICOLLINEARITY (OPTIONAL)

Linear Regression. September 27, Chapter 3. Chapter 3 September 27, / 77

Math 5305 Notes. Diagnostics and Remedial Measures. Jesse Crawford. Department of Mathematics Tarleton State University

Chapter 4: Regression Models

Multiple Regression Methods

Prediction of Bike Rental using Model Reuse Strategy

Chapter 26 Multiple Regression, Logistic Regression, and Indicator Variables

Nonlinear Regression. Summary. Sample StatFolio: nonlinear reg.sgp

Linear Models 1. Isfahan University of Technology Fall Semester, 2014

Chapter 13. Multiple Regression and Model Building

Chapter 3 Multiple Regression Complete Example

10. Time series regression and forecasting

STAT5044: Regression and Anova

Multiple Regression Basic

STA121: Applied Regression Analysis

A Practitioner s Guide to Cluster-Robust Inference

Chapter 1 Statistical Inference

Stat 5100 Handout #26: Variations on OLS Linear Regression (Ch. 11, 13)

MS&E 226. In-Class Midterm Examination Solutions Small Data October 20, 2015

Diagnostics of Linear Regression

STAT 212: BUSINESS STATISTICS II Third Exam Tuesday Dec 12, 6:00 PM

Introduction to Statistical modeling: handout for Math 489/583

Selection of the Best Regression Equation by sorting out Variables

A Second Course in Statistics: Regression Analysis

MULTICOLLINEARITY AND VARIANCE INFLATION FACTORS. F. Chiaromonte 1

Statistics for Managers using Microsoft Excel 6 th Edition

Chapter 14 Student Lecture Notes 14-1

Regression Analysis for Data Containing Outliers and High Leverage Points

STAT 4385 Topic 06: Model Diagnostics

Variable Selection and Model Building

Variable Selection and Model Building

Regression Analysis V... More Model Building: Including Qualitative Predictors, Model Searching, Model "Checking"/Diagnostics

Regression Analysis V... More Model Building: Including Qualitative Predictors, Model Searching, Model "Checking"/Diagnostics

MULTIPLE LINEAR REGRESSION IN MINITAB

COMPREHENSIVE WRITTEN EXAMINATION, PAPER III FRIDAY AUGUST 26, 2005, 9:00 A.M. 1:00 P.M. STATISTICS 174 QUESTION

This model of the conditional expectation is linear in the parameters. A more practical and relaxed attitude towards linear regression is to say that

The prediction of house price

Unit 10: Simple Linear Regression and Correlation

Quantitative Methods I: Regression diagnostics

L7: Multicollinearity

Least Squares Estimation-Finite-Sample Properties

We like to capture and represent the relationship between a set of possible causes and their response, by using a statistical predictive model.

Estimating σ 2. We can do simple prediction of Y and estimation of the mean of Y at any value of X.

Making sense of Econometrics: Basics

Answer all questions from part I. Answer two question from part II.a, and one question from part II.b.

Chapter 12: Multiple Linear Regression

Föreläsning /31

Linear regression methods

with the usual assumptions about the error term. The two values of X 1 X 2 0 1

Linear Methods for Regression. Lijun Zhang

Lecture 6 Multiple Linear Regression, cont.

STAT 212 Business Statistics II 1

Ch 2: Simple Linear Regression

Iris Wang.

Regression Diagnostics for Survey Data

Multiple Regression Analysis

Introduction to Regression

Linear Model Selection and Regularization

Inference for the Regression Coefficient

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

Using Ridge Least Median Squares to Estimate the Parameter by Solving Multicollinearity and Outliers Problems

Analysing data: regression and correlation S6 and S7

Transcription:

Dr. Maddah ENMG 617 EM Statistics 11/28/12 Multiple Regression (3) (Chapter 15, Hines) Problems in multiple regression: Multicollinearity This arises when the independent variables x 1, x 2,, x k, are highly inter-correlated. Multicollinearity leads to poor estimates of the regression coefficients and negatively affects the applicability of the regression model. To illustrate the effect of multicollinearity, consider a regression with two independent variables, x 1 and x 2, and assume the X X matrix has been written in correlation form. Then, it can be shown that Note that the variances of the correlation coefficients, MS E C jj, j =1, 2, as the correlation between x 1 and x 2 increases, r 12 1. In general, with k > 2 independent variables, it can be shown, when C = X X has been written in correlation form that C jj = 1 / (1 R 2 j ). Here, R j is the coefficient of multiple determination resulting from regressing x j on the other k 1 regressor variables. 1

The term 1 / (1 R j 2 ) is called the variance inflation factor, VIF j of ˆ, j ˆ. In brief, multicollinearity leads to high variances of the correlation coefficient j and damages the applicability of the regression model. Specifically, multicollinearity significantly affects the ability of the model to extrapolate and predict responses for x values outside the range of the data. Multicollinearity can be detected in the following ways, assuming the X matrix is written in regression form. 1. Large VIF, specifically, VIF j ˆ C 10. 2. Low determinant of X, specifically, X 0, when X is written in correlation form. 3. Small eigenvalues of C = X X, one or more eigenvalues are close to 0, or max / min > 10, where max and min are the largest and smallest eigenvalues. 4. High correlation coefficients, r ij 1. 5. F-test for significance of regression is significant, but individual regression coefficients are not significant. Some remedial measures for multicollinearity include augmenting the data, when possible, and deleting some independent variables. jj 2

Some more advanced remedies involve using more robust estimates of the correlation coefficients, with methods other than least square. The text presents one such method, ridge regression. Example on detecting multicollinearity (Ex. 15-14) Consider the data on the heat generated (in calories per gram) from cement function of the quantities of four additives. First, the data is coded in standard form by applying the transformation x ( z z ) / S. 1/2 ij ij j jj Then, The X X matrix has several large correlation coefficients. In addition, three of the four VIFs, the diagonal entries of (X X) 1, are larger than 10. 3

Finally, the eigenvalues of X X are 3.657, 0.2679, 0.07127, 0.004014, implying that max / min = 3.657 / 0.00414 = 911.06 > 10. Therefore, multicollinearity problems are likely present. Influential Observations Sometimes, a small subset of data significantly affects the parameters and the quality of the regression model. This typically happens with observations which are from the range of the data. Then, it becomes important to identify these points in order to eliminate them if they were collected by mistake. Even if the influential observations are correct, it is good to detect them to understand what drives the model. A measure which is used to detect influential points is Cook distance. For an observation i, this distance measures in the regression coefficients if Observation i is removed. 4

Letting ˆβ and βˆ () i be the estimate of the regression coefficient with the full data and after removing Observation i, respectively, Cook s distance is given by It can be shown that D i can be written as It can be also shown that the matrix H relates the response variable, y, to the fits, ŷ, as yˆ Hy. So, the distance D i reflects how well the model fits Observation i. A value of D i > 1 indicates an influential point. Example on influential observations (Ex. 15-15) For the peach damage example, Cook distance measures are as shown below. All values are significantly below 1, which implies that the data has no influential observations. 5

Autocorrelation If the error terms i are correlated, then the regression models discussed thus far are not applicable. Correlation may occur in time series observations where observations at time t are related to those at time t 1. One test for correlation is the Durbin-Watson test, which applies to simple regression and assumes a first-order autoregressive model, Here is the autocorrelation coefficient and a t are IID normal random variables. 6

The test checks the significance of based on the statistic where e t is the residual at time t. If autocorrelation is present, then e t and e t 1 would be close, and D would be small, which leads to rejecting the hypothesis = 0. The text describes two one-sided tests on, and present tabulates critical values of D at different significance levels. Model-building: Selecting variables Typically, not all candidate independent variables are necessary to adequately model the response variable. One is then interested in screening the candidate independent variables to obtain the best regression model. A good model balances performance (e.g., prediction accuracy) with tractability (ease of estimation and usage). There is no straightforward technique for selecting the best model. Search techniques are used that require interaction and judgment by the analyst. 7

All possible regressions This a technique that finds a good regression model based on k candidate variables by exploiting all possible subsets of independent variables. E.g., with three variables, x 1, x 2, and x 3 this technique explores regression models with subsets {x 1 }, {x 2 }, {x 3 }, {x 1, x 2 },{x 1, x 3 }, {x 2, x 3 }, and {x 1, x 2, x 3 }, and no regression. In general, with k candidate independent variables, this method explores 2 k possibilities, which tends to be too large. Several criteria are used to compare the candidates. With p 1 variables in the model, the most common criteria are the coefficient of determination R 2 p, the mean square error MS E (p), and the C p statistic. C p statistic The C p statistic is a measure of the total mean square error. It is an estimate of the total standardized mean square error p, 8

C p is given by SSE ( p) CP n 2 p, 2 ˆ where the variance is estimated based on the full term model, 2 ˆ MSE ( k 1). In a model with no bias, it can be shown that E[C p ] = p. So, models with C p close to p are desired. Model selection in all possible regressions A model with small R 2 p, small MS E (p), and small C p which is close to p, is desired. A small R 2 adj is also. But this is equivalent to having a small MS E (p). One could also use the F-score in the significance of regression test. A large F-score is desirable. Typically, R 2 p, MS E (p), and C p decrease as p increases. One chooses a value of p where further increase of p yields insignificant improvement. 9

Example of all possible regressions Consider an augmented peach damage model with five candidate independent variables. Values of R 2 p, MS E (p), and C p have been computed for all possible regression models 31, 2 5 1, in total. These are tabulated on the next page. Also shown are plots for R 2 p, MS E (p), and C p versus p where values corresponding to the best model with p 1 variables have been chosen. These plots indicate that model with four variables, {x 1, x 2, x 3, x 4 } is an appropriate choice. 10

Example of all possible regressions 11

Stepwise regression This is a widely used variable selection technique. It consist of an iterative procedure where variables are added or removed one at a time, that continues until no variables can be added or removed. Specifically, critical values of the F-score, F in and F out, should be chosen, such that F in F out. The procedure starts by selecting the single-variable model with the highest F-score greater than F in, if any. Then, a second variable is chosen to enter the model, having on the highest F-score greater than F in, if any. E.g., if x j has been chosen at the first step, then the second variable chosen is x l having the highest F l > F in, where SS Fl MS ( x, x ) R( l 0, j ). E j l Then, a third variable, x m is added in the same way. After this, the procedure test whether one of two variables x j and x j added at the first two steps should be deleted from the model, based on the lowest F-score smaller than F out, if any. And so on, until the F-scores for adding a variable become all < F in and those for deleting a variable become all > F out. 12

Variants of stepwise regression Forward regression is the same as stepwise regression, but without variable deletion. Backward regression begins with all candidate variables in the model and eliminates variables one at a time. 13