Chapter 2 Multiple Regression (Part 4)

Similar documents
STAT5044: Regression and Anova

Sections 7.1, 7.2, 7.4, & 7.6

Outline. Remedial Measures) Extra Sums of Squares Standardized Version of the Multiple Regression Model

STA 4210 Practise set 2b

MATH 644: Regression Analysis Methods

2. Regression Review

Ch 2: Simple Linear Regression

Lecture 6 Multiple Linear Regression, cont.

Final Review. Yang Feng. Yang Feng (Columbia University) Final Review 1 / 58

STA 4210 Practise set 2a

Hypothesis testing Goodness of fit Multicollinearity Prediction. Applied Statistics. Lecturer: Serena Arima

Regression Models for Quantitative and Qualitative Predictors: An Overview

STAT763: Applied Regression Analysis. Multiple linear regression. 4.4 Hypothesis testing

SSR = The sum of squared errors measures how much Y varies around the regression line n. It happily turns out that SSR + SSE = SSTO.

Ch 3: Multiple Linear Regression

4.7 Confidence and Prediction Intervals

Multiple regression: introduction

Formal Statement of Simple Linear Regression Model

PubH 7405: REGRESSION ANALYSIS. MLR: INFERENCES, Part I

x3,..., Multiple Regression β q α, β 1, β 2, β 3,..., β q in the model can all be estimated by least square estimators

Multiple Linear Regression

Simple and Multiple Linear Regression

F-tests and Nested Models

Concordia University (5+5)Q 1.

Statistics 203: Introduction to Regression and Analysis of Variance Course review

ST430 Exam 1 with Answers

Simple Linear Regression

Simple Linear Regression

Remedial Measures for Multiple Linear Regression Models

Multiple linear regression S6

Day 4: Shrinkage Estimators

Single and multiple linear regression analysis

Business Statistics. Tommaso Proietti. Linear Regression. DEF - Università di Roma 'Tor Vergata'

STAT 705 Chapter 16: One-way ANOVA

Linear models and their mathematical foundations: Simple linear regression

Inferences for Regression

Unit 11: Multiple Linear Regression

Chapter 8 Quantitative and Qualitative Predictors

1 Multiple Regression

Regression Models. Chapter 4. Introduction. Introduction. Introduction

Ridge Regression. Summary. Sample StatFolio: ridge reg.sgp. STATGRAPHICS Rev. 10/1/2014

Multiple Regression Analysis. Part III. Multiple Regression Analysis

Lecture 1: OLS derivations and inference

Simple Regression Model Setup Estimation Inference Prediction. Model Diagnostic. Multiple Regression. Model Setup and Estimation.

MULTICOLLINEARITY AND VARIANCE INFLATION FACTORS. F. Chiaromonte 1

STAT Checking Model Assumptions

ST430 Exam 2 Solutions

Multiple Regression Methods

The general linear regression with k explanatory variables is just an extension of the simple regression as follows

Contents. 1 Review of Residuals. 2 Detecting Outliers. 3 Influential Observations. 4 Multicollinearity and its Effects

Inference in Normal Regression Model. Dr. Frank Wood

Simple Linear Regression

Least Squares Estimation

Chapter 2. Continued. Proofs For ANOVA Proof of ANOVA Identity. the product term in the above equation can be simplified as n

A discussion on multiple regression models

Correlation. Tests of Relationships: Correlation. Correlation. Correlation. Bivariate linear correlation. Correlation 9/8/2018

Multiple Regression: Chapter 13. July 24, 2015

MLR Model Selection. Author: Nicholas G Reich, Jeff Goldsmith. This material is part of the statsteachr project

UNIVERSITY OF MASSACHUSETTS. Department of Mathematics and Statistics. Basic Exam - Applied Statistics. Tuesday, January 17, 2017

Econ 510 B. Brown Spring 2014 Final Exam Answers

ITEC 621 Predictive Analytics 6. Variable Selection

Lecture 6: Linear Regression

Lecture 10 Multiple Linear Regression

The simple linear regression model discussed in Chapter 13 was written as

Simple Linear Regression

Multivariate Linear Regression Models

Final Exam - Solutions

Chapter 1. Linear Regression with One Predictor Variable

Application of Independent Variables Transformations for Polynomial Regression Model Estimations

Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods.

Regression Analysis: Basic Concepts

Lecture 5: Omitted Variables, Dummy Variables and Multicollinearity

Math 423/533: The Main Theoretical Topics

Linear Regression Model. Badr Missaoui

CHAPTER 3: Multicollinearity and Model Selection

Applied Regression Analysis

FinQuiz Notes

Figure 1: The fitted line using the shipment route-number of ampules data. STAT5044: Regression and ANOVA The Solution of Homework #2 Inyoung Kim

Chapter 10 Building the Regression Model II: Diagnostics

Regression, Ridge Regression, Lasso

Comparing Nested Models

Regression Analysis Chapter 2 Simple Linear Regression

Chapter 4. Regression Models. Learning Objectives

6. Multiple Linear Regression

Chapter 3. Diagnostics and Remedial Measures

Ch14. Multiple Regression Analysis

Review: General Approach to Hypothesis Testing. 1. Define the research question and formulate the appropriate null and alternative hypotheses.

Analysing data: regression and correlation S6 and S7

STAT420 Midterm Exam. University of Illinois Urbana-Champaign October 19 (Friday), :00 4:15p. SOLUTIONS (Yellow)

General linear models. One and Two-way ANOVA in SPSS Repeated measures ANOVA Multiple linear regression

1 Least Squares Estimation - multiple regression.

Chapter 6 Multiple Regression

Bias Variance Trade-off

COMPREHENSIVE WRITTEN EXAMINATION, PAPER III FRIDAY AUGUST 26, 2005, 9:00 A.M. 1:00 P.M. STATISTICS 174 QUESTION

Statistics 5100 Spring 2018 Exam 1

STAT 3A03 Applied Regression With SAS Fall 2017

Lecture 6. Multiple linear regression. Model validation, model choice

Chapter 7 Student Lecture Notes 7-1

Regression Analysis. Regression: Methodology for studying the relationship among two or more variables

Simple Linear Regression (Part 3)

Transcription:

Chapter 2 Multiple Regression (Part 4) 1 The effect of multi-collinearity Now, we know to find the estimator (X X) 1 must exist! Therefore, n must be great or at least equal to p + 1 (WHY?) However, even n p +1we the inverse may still not exist when there is multi-collinearity in the predictors. Multi-collinearity means the correlation coefficients between predictor variables are close to +1 or -1 (positive or negative). In that case, the design matrix X will be ill-conditioned, i.e. the determination, det(x X) is close to 0, or the inverse of X X is not stable. It also cause other problems. below are some discussions 1.1 An example in which two predictor variables are perfectly uncorrelated Work crew size example revisited Effects on Regression Coefficients Case Crew Size Bonus pay Crew productivity i X 1 X 2 Y 1 4 2 42 2 4 2 39 3 4 3 48 4 4 3 51 5 6 2 49 6 6 2 53 7 6 3 61 8 6 3 60 1

Models b 1 b 2 Y = β 0 + β 1 X 1 + ε 5.375 - Y = β 0 + β 2 X 2 + ε - 9.250 Y = β 0 + β 1 X 1 + β 2 X 2 + ε 5.375 9.250 Extra sums of squares SSR(X 1 X 2 ) SSR(X 1 ) SSR(X 2 X 1 ) SSR(X 2 ) 231.125 231.125 171.125 171.125 Unrelated predictor variables (not practical!) correlation coefficient of X 1 and X 2 is zero. X 1 and X 2 are uncorrelated Regression effect of one predictor variable is independent of whether other predictor variables are included in the model Extra sums of squares are equal to regression sums of squares in that case, we can consider each predictor separately! 1.2 An example in which two predictor variables are perfectly correlated Two fitted lines: case X 1 X 2 Y 1 2 6 23 2 8 9 83 3 6 8 63 4 10 10 103 Ŷ = 87 + X 1 +18X 2 Ŷ = 7+9X 1 +2X 2 because X 2 =5+.5X 1 sometimes regression model can still obtain a good fit for the data but best fitted line (least squares estimator) is not unique (indicate) larger variability/instabability of estimator the common interpretation of regression coefficient is not applicable, we can not vary one predictor variable while holding other constant. 2

1.3 Body fat example revisited Effects on Regression Coefficients 20 healthy females 25-34 years old subject X 1 X 2 X 3 Y 1 19.5 43.1 29.1 11.9 2 24.7 49.8 28.2 22.8..... 19 22.7 48.2 27.1 14.8 20 25.2 51.0 27.5 21.1 The correlation matrix is r X 1 X 2 X 3 X 1 1.0 0.924 0.458 X 2 0.924 1.0 0.085 X 3 0.458 0.085 1.0 Inflated variability of estimator Models b 1 b 2 Y = β 0 + β 1 X 1 + ε 0.8572 - Y = β 0 + β 2 X 2 + ε - 0.8565 Y = β 0 + β 1 X 1 + β 2 X 2 + ε 0.2224 0.6594 Y = β 0 + β 1 X 1 + β 2 X 2 + β 3 X 3 + ε 4.334-2.857 Extra sums of squares Models s{b 1 } s{b 2 } Y = β 0 + β 1 X 1 + ε 0.1288 - Y = β 0 + β 2 X 2 + ε - 0.1100 Y = β 0 + β 1 X 1 + β 2 X 2 + ε 0.3034 0.2912 Y = β 0 + β 1 X 1 + β 2 X 2 + β 3 X 3 + ε 3.016 2.582 SSR(X 1 X 2 ) SSR(X 1 ) SSR(X 2 X 1 ) SSR(X 2 ) SSE(X 1,X 2 ) 3.47 352.27 33.17 381.97 109.95 no unique sum of squares ascribed to any one predictor variable must take into account other correlated predictor variables already included in the model 3

1.4 Effect of Multicollinearity When the multicollinearity is not strong, i.e. model to make prediction. (X X) 1 exists, we can still use the However, the multicollinearity will result in instability of the estimated coefficient, i.e. the S.E. of the estimated coefficient is large. Thus the model is unreliable. The interpretation of the coefficient is difficult. For example, β 1 for X 1 is interpreted the increasment of EY when X 1 increase by 1 unit IF the other predictor variable hold constant. The real situation is that the other predictor variable CANNOT hold constant when there is multicollinearity However, if the multicollinearity is too serious, e.g. X i1 = X i2,forwhich(x X) 1 does not exits. There are other methods (not discussed here) such as the ridge regression and regression with penalty 2 Polynomial regression models General regression model: Y = f(x)+ɛ, ory = f(x 1,X 2,..., X p 1 )+ɛ Linear regression model: f(x) =β 0 + β 1 X or f(x 1,X 2,..., X p 1 )=β 0 + β 1 X 1 +... + β p 1 X p 1 Polynomial regression function f(x) =β 0 + β 1 X + β 2 X 2 +... + β k X k reasons for using polynomial regression model: a. true regression function is a polynomial function b. better approximation than linear function (k =1) second order Y i = β 0 + β 1 X i + β 2 X 2 i + ɛ i Third order: Y i = β 0 + β 1 X i + β 2 X 2 i + β 3X 3 i + ɛ i Higher order: Y i = β 0 + β 1 X i + β 2 X 2 i +... + β kx k i + ɛ i higher order, more parameters (less degrees of freedom) 4

two predictors Y i = β 0 + β 1 X i1 + β 2 X i2 + β 11 X 2 i1 + β 22 X 2 i2 + β 12 X i1 X i2 + ɛ i β 12 : interaction effect coefficient three predictors Y i = β 0 + β 1 X i1 + β 2 X i2 + β 3 X i3 + β 11 X 2 i1 + β 22 X 2 i2 + β 33 X 2 i3 +β 12 X i1 X i2 + β 13 X i1 X i3 + β 23 X i2 X i3 + ɛ i interpretation of interaction regression models Y i = β 0 + β 1 X i1 + β 2 X i2 + β 3 X i1 X i2 + ɛ i regression effects of X 1 per unit when holding X 2 constant: β 1 + β 3 X 2 regression effects of X 2 per unit when holding X 1 constant: β 2 + β 3 X 1 Easy implementation as special case of multiple regression (see the example below) Use polynomial regression to test linearity of regression function First fit a third order model: Y i = β 0 + β 1 X i + β 11 X 2 i + β 111 X 3 i + ɛ i then use SSR(X 3 X, X 2 )orssr(x 3,X 2 X) to test whether we can drop X 3 or X 3,X 2 Example 1 Suppose we have data Data with two predictors X 1,X 2 and response Y. If we fit a linear regression model (see Code) the estimated model is (Reduced model) : Y = β 0 + β 1 X 1 + β 2 X 2 + ε, Ŷ = -543.594 + 61.211X 1-101.387X 2 (S.E.) (228.244) (3.774) (42.099) R 2 = 0.9535, Ra 2 =0.948, ˆσ = 170.3 F-value 174.2 with df 2, 17. 5

If we consider model (Full model) : Y = β 0 + β 1 X 1 + β 2 X 2 + β 11 X 2 1 + β 12X 1 X 2 + β 22 X 2 2 + ε The estimated model is Ŷ = -1.56 + 1.05X 1-0.55X 2 + 1.00X 2 1-1.01X 1 X 2-0.03X 2 2 (S.E.) (0.73) (0.02) (0.28) (.001) (.003) (0.03) R 2 =0.9999, R 2 a =0.9999, ˆσ =0.0878, F-value 2.751e+08 with df 5 and 14. It seems that X 2 and X 2 2 can be removed from the model. Let consider a test H 0 : β 2 = β 22 =0 we have SSE(F )=0.0878 2 14,SSE(R) =0.1986 2 16 and F = (SSE(R) SSE(F ))/2 SSE(F )/14 =33.93 >F(1 0.05, 2, 14) Thus, we reject H 0 Thus, we need to remove one variable H 0 : β 22 =0 Under which we consider model (Reduced model) Y = β 0 + β 1 X 1 + β 2 X 2 + β 11 X1 2 + β 12 X 1 X 2 + ε. we have SSE(R) =0.08832 2 15 and F = (SSE(R ) SSE(F ))/1 SSE(F )/14 concluding H 0. The estimated model is =1.178203 <F(1 0.05, 1, 14) Ŷ = -0.90 + 1.04X 1-0.84X 2 + 1.00X1 2-1.00X 1 X 2 (S.E.) (0.42) (0.02) (0.10) (.001) (.003) R 2 = 0.9999, Ra 2 =0.9999, ˆσ =0.08832, F-value 3.399e+08 with df 4 and 15. 6