Remedial Measures for Multiple Linear Regression Models

Similar documents
Chapter 11 Building the Regression Model II:

Final Review. Yang Feng. Yang Feng (Columbia University) Final Review 1 / 58

Stat 5100 Handout #26: Variations on OLS Linear Regression (Ch. 11, 13)

Review: Second Half of Course Stat 704: Data Analysis I, Fall 2014

STAT 704 Sections IRLS and Bootstrap

Outline. Review regression diagnostics Remedial measures Weighted regression Ridge regression Robust regression Bootstrapping

Diagnostics and Remedial Measures

Contents. 1 Review of Residuals. 2 Detecting Outliers. 3 Influential Observations. 4 Multicollinearity and its Effects

Outline. Remedial Measures) Extra Sums of Squares Standardized Version of the Multiple Regression Model

Introduction to Linear regression analysis. Part 2. Model comparisons

Simple and Multiple Linear Regression

STAT Checking Model Assumptions

Dr. Maddah ENMG 617 EM Statistics 11/28/12. Multiple Regression (3) (Chapter 15, Hines)

Simple Regression Model Setup Estimation Inference Prediction. Model Diagnostic. Multiple Regression. Model Setup and Estimation.

Regression diagnostics

REGRESSION DIAGNOSTICS AND REMEDIAL MEASURES

Weighted Least Squares

POLSCI 702 Non-Normality and Heteroskedasticity

Statistics 203: Introduction to Regression and Analysis of Variance Course review

Multiple Linear Regression

Applied Regression Analysis

Lecture 10 Multiple Linear Regression

Remedial Measures, Brown-Forsythe test, F test

Regression Models - Introduction

Regression Models - Introduction

Lecture 18 Miscellaneous Topics in Multiple Regression

Effects of Outliers and Multicollinearity on Some Estimators of Linear Regression Model

Chapter 13 Introduction to Nonlinear Regression( 非線性迴歸 )

Regression Diagnostics for Survey Data

L7: Multicollinearity

THE PRINCIPLES AND PRACTICE OF STATISTICS IN BIOLOGICAL RESEARCH. Robert R. SOKAL and F. James ROHLF. State University of New York at Stony Brook

holding all other predictors constant

Least Squares Estimation-Finite-Sample Properties

y response variable x 1, x 2,, x k -- a set of explanatory variables

Single and multiple linear regression analysis

Formal Statement of Simple Linear Regression Model

9. Robust regression

Simple Linear Regression

Regression Analysis for Data Containing Outliers and High Leverage Points

Quantitative Methods I: Regression diagnostics

Multiple Regression Analysis. Part III. Multiple Regression Analysis

Introduction to Simple Linear Regression

Linear Algebra Review

Math 423/533: The Main Theoretical Topics

STAT5044: Regression and Anova. Inyoung Kim

Multicollinearity and A Ridge Parameter Estimation Approach

MA 575 Linear Models: Cedric E. Ginestet, Boston University Non-parametric Inference, Polynomial Regression Week 9, Lecture 2

Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods.

Multicollinearity occurs when two or more predictors in the model are correlated and provide redundant information about the response.

Concordia University (5+5)Q 1.

COMPREHENSIVE WRITTEN EXAMINATION, PAPER III FRIDAY AUGUST 26, 2005, 9:00 A.M. 1:00 P.M. STATISTICS 174 QUESTION

Making sense of Econometrics: Basics

Ridge Regression. Summary. Sample StatFolio: ridge reg.sgp. STATGRAPHICS Rev. 10/1/2014

Chapter 11: Robust & Quantile regression

MULTICOLLINEARITY AND VARIANCE INFLATION FACTORS. F. Chiaromonte 1

Wiley. Methods and Applications of Linear Models. Regression and the Analysis. of Variance. Third Edition. Ishpeming, Michigan RONALD R.

STAT5044: Regression and Anova

Classification & Regression. Multicollinearity Intro to Nominal Data

Quantitative Analysis of Financial Markets. Summary of Part II. Key Concepts & Formulas. Christopher Ting. November 11, 2017

The general linear regression with k explanatory variables is just an extension of the simple regression as follows

Ridge Regression and Ill-Conditioning

Statistical View of Least Squares

Weighted Least Squares

Stat 5101 Lecture Notes

Sections 7.1, 7.2, 7.4, & 7.6

Linear Models, Problems

Chapter 2 Multiple Regression (Part 4)

AMS 315/576 Lecture Notes. Chapter 11. Simple Linear Regression

Chapter 2 Multiple Regression I (Part 1)

Using Ridge Least Median Squares to Estimate the Parameter by Solving Multicollinearity and Outliers Problems

Experimental Design and Data Analysis for Biologists

Introduction to Regression

Introduction to Regression

Indian Statistical Institute

Lecture 15 Multiple regression I Chapter 6 Set 2 Least Square Estimation The quadratic form to be minimized is

Local Polynomial Regression

Checking model assumptions with regression diagnostics

Linear Regression. September 27, Chapter 3. Chapter 3 September 27, / 77

Uncertainty Quantification for Inverse Problems. November 7, 2011

Variable Selection and Model Building

Matematické Metody v Ekonometrii 7.

Ordinary Least Squares Regression

Lecture 3: Linear Models. Bruce Walsh lecture notes Uppsala EQG course version 28 Jan 2012

Lecture 9 SLR in Matrix Form

STAT 4385 Topic 06: Model Diagnostics

Chapter 14. Linear least squares

PAijpam.eu M ESTIMATION, S ESTIMATION, AND MM ESTIMATION IN ROBUST REGRESSION

Lecture 14 Simple Linear Regression

Alternative Biased Estimator Based on Least. Trimmed Squares for Handling Collinear. Leverage Data Points

Introduction The framework Bias and variance Approximate computation of leverage Empirical evaluation Discussion of sampling approach in big data

MS&E 226: Small Data. Lecture 11: Maximum likelihood (v2) Ramesh Johari

Regression Model Building

On Modifications to Linking Variance Estimators in the Fay-Herriot Model that Induce Robustness

Math 5305 Notes. Diagnostics and Remedial Measures. Jesse Crawford. Department of Mathematics Tarleton State University

where x and ȳ are the sample means of x 1,, x n

LINEAR REGRESSION ANALYSIS. MODULE XVI Lecture Exercises

11. Bootstrap Methods

Resampling techniques for statistical modeling

Econ 510 B. Brown Spring 2014 Final Exam Answers

SSR = The sum of squared errors measures how much Y varies around the regression line n. It happily turns out that SSR + SSE = SSTO.

Transcription:

Remedial Measures for Multiple Linear Regression Models Yang Feng http://www.stat.columbia.edu/~yangfeng Yang Feng (Columbia University) Remedial Measures for Multiple Linear Regression Models 1 / 25

Outline Unequal Error Variance Weighted Least Squares Multicollinearity Ridge Regression Influential Cases Robust Regression Nonparametric Regression Lowess Method and Regression Trees Evaluating Precision Bootstrapping Yang Feng (Columbia University) Remedial Measures for Multiple Linear Regression Models 2 / 25

Unequal Error Variance Y i = β 0 + β 1 X i1 + + β p 1 X i,p 1 + i Here: i are independent N(0, σ 2 i ). (Originally: i are independent N(0, σ 2 )) In matrix form: σ 2 1 0 0 σ 2 0 σ2 2 0 { } =... 0 0 σn 2 Yang Feng (Columbia University) Remedial Measures for Multiple Linear Regression Models 3 / 25

Known Error Variance Define weights Denote w i = 1 σ 2 i w 1 0 0 0 w 2 0 W =... 0 0 w n Weighted least squares and maximum likelihood estimator is b w = (X WX) 1 X WY (Derivation on the board, two methods: direct MLE and transform variables to the regular multiple linear regression) Yang Feng (Columbia University) Remedial Measures for Multiple Linear Regression Models 4 / 25

Error Variance Known up to Proportionality Constant Same estimator. w i = k 1 σ 2 i Yang Feng (Columbia University) Remedial Measures for Multiple Linear Regression Models 5 / 25

Unknown Error Variances In reality, one rarely known the variances σ 2 i. Estimation of Variance Function or Standard Deviation Function Use of Replicates or Near Replicates Yang Feng (Columbia University) Remedial Measures for Multiple Linear Regression Models 6 / 25

Estimation of Variance Function or Standard Deviation Function Four steps: (Can be iterated for several times to reach convegence) 1 Fit the regression model by unweighted least squares and analyze the residuals 2 Estimate the variance function or the standard deviation function by regressing either the squared residuals or the absolute residuals on the appropriate predictor(s). (We known that the variance of i σ 2 i = E( 2 i ) (E( i)) 2 = E( 2 i ). Hence the squared residual e 2 i is an estimator of σ 2 i.) 3 Use the fitted value from the estimated variance or standard deviation function to obtain the weights w i. 4 Estimate the regression coefficients using these weights. Yang Feng (Columbia University) Remedial Measures for Multiple Linear Regression Models 7 / 25

Use of Replicates or Near Replicates When replicates or near replicates are available, use the sample variance of the replicates as the estimate for the variances. In observational studies, usually not available. Yang Feng (Columbia University) Remedial Measures for Multiple Linear Regression Models 8 / 25

Ridge Regression Multicollinearity In polynomial regression models, higher order terms One or several predictor variables may be dropped from the model in order to remove the multicollinearity. Yang Feng (Columbia University) Remedial Measures for Multiple Linear Regression Models 9 / 25

Ridge Estimators OLS: (X X)b = X Y Transformed by correlation transformation: r XX b = r YX Ridge Estimator: for a constant c 0, c = 0, OLS (r XX + ci)b R = r YX c > 0, biased, but much more stable. Yang Feng (Columbia University) Remedial Measures for Multiple Linear Regression Models 10 / 25

Choice of c Ridge trace (0 c 1) and the variance inflation factors VIF k (c) Choose the c where the Ridge trace starts to become stable and VIF has become sufficiently small. Recall: VIF value measure how large is the variance of bk relative to what the variance would be if the predictor variables were uncorrelated. Since σ 2 {(r XX + ci) 1 r YX } = (r XX + ci) 1 r XX (r XX + ci) 1 VIF k (c) is the k-th diagonal element of (r XX + ci) 1 r XX (r XX + ci) 1 Yang Feng (Columbia University) Remedial Measures for Multiple Linear Regression Models 11 / 25

Robust Regression Robust to outlying and influential cases. LAD (Least Absolute Deviation) regression. To minimize L 1 = n Y i (β 0 + β 1 X i1 + + β p 1 X i,p 1 ). (1) i=1 LMS (Least Median of Squares) regression. To minimize median{[y i (β 0 + β 1 X i1 + + β p 1 X i,p 1 )] 2 }. (2) Yang Feng (Columbia University) Remedial Measures for Multiple Linear Regression Models 12 / 25

IRLS Robust Regression 1 Choose a weight function for weighting the case. 2 Obtain starting weights for all cases. 3 Using the starting weights in weighted least squares and obtain the residuals from the fit. 4 Use the residuals in step 3 to obtain revised weights. 5 Continue the iteration until convergence. Yang Feng (Columbia University) Remedial Measures for Multiple Linear Regression Models 13 / 25

IRLS Robust Regression 1 Huber weight function. 1 u 1.345. w = 1.345 u u > 1.345. (3) 2 Bisquare weight function. 1 ( u w = 4.685 )2 2 u 4.685. 0 u > 4.685. (4) Yang Feng (Columbia University) Remedial Measures for Multiple Linear Regression Models 14 / 25

Starting weights Huber weight: OLS Bisquare weight: use the initial robust fit using huber weights, or residual from LAD regression. Scaled Residuals When there is no outlying observations, normalize the residual by MSE. When there are outlying observations, use the resistant and robust median absolute deviation (MAD) estimator. Then the scaled residual is MAD = 1.6745 median{ e l median{e i } }. (5) u i = e i MAD. (6) Yang Feng (Columbia University) Remedial Measures for Multiple Linear Regression Models 15 / 25

Lowess Method Two predictor variables, fitted value at (X h1, X h2 ). Distance Measure. d i = [(X i1 X h1 ) 2 + (X i2 X h2 ) 2 ] 1/2. (7) Proportion of the data q that are nearest to (X h1, X h2 ). Larger q leads to smoother fit, but may increase the bias. Usually between.4 and.6. Weight Function. w i = 1 ( d i d q ) 3 3 d i < d q. 0 d i d q. (8) Yang Feng (Columbia University) Remedial Measures for Multiple Linear Regression Models 16 / 25

Regression Trees A powerful nonparametric regression method. Can handle multiple predictors. Easy to calculate and require virtually no assumptions. Achieved by partitioning the covariates. Key quantities. Number of regions r. Split points between the regions. Yang Feng (Columbia University) Remedial Measures for Multiple Linear Regression Models 17 / 25

Growing a regression tree Take a single predictor as an example. Yang Feng (Columbia University) Remedial Measures for Multiple Linear Regression Models 18 / 25

Growing a regression tree Divide X into two regions: R 21 and R 22. The optimal split point X s would be to minimize the SSE. where SSE = SSE(R 21 ) + SSE(R 22 ), (9) SSE(R rj ) = (Y i Ȳ Rjk ) 2. (10) Yang Feng (Columbia University) Remedial Measures for Multiple Linear Regression Models 19 / 25

Growing a regression tree Yang Feng (Columbia University) Remedial Measures for Multiple Linear Regression Models 20 / 25

A graphical example Yang Feng (Columbia University) Remedial Measures for Multiple Linear Regression Models 21 / 25

Bootstrapping A nonparametric method of evaluating the uncertainly of the estimate. Suppose we want to evaluate the precision of an estimated regression coefficient b 1. 1 Fix B as the number of bootstrap samples to be generated. Say B = 500. 2 For each k = 1,, B, sample n observations with replacement from the original n observations. 3 Fit a linear regression model using the bootstrap sample which leads to coefficient b (k) 1. 4 The sample standard deviation of {b (1) 1, b(2) 1,, b(b) 1 } is a measure of the precision of b 1. Yang Feng (Columbia University) Remedial Measures for Multiple Linear Regression Models 22 / 25

Bootstrap sampling Fixed X sampling. If the regression function is a good model fro the data and the error terms have constant variance. Obtain the residuals e i from the original fit, then get a bootstrap sample of size n of e i, then define new Y 1,, Y n to be Y i = Ŷi + e i. (11) Then regress Y values on the original X variables. Random X sampling. Get a bootstrap sample of (X, Y ) pairs. Yang Feng (Columbia University) Remedial Measures for Multiple Linear Regression Models 23 / 25

Bootstrap Confidence Intervals From the bootstrap distribution of b1, get the α/2 and 1 α/2 quantiles b1 (α/2) and b 1 (1 α/2). Suppose the original estimate is b Percentile Bootstrap b1(α/2) β 1 b1(1 α/2). (12) Basic Bootstrap (Reflection method) 2b 1 b1(1 α/2) β 1 2b 1 b1(α/2). (13) Yang Feng (Columbia University) Remedial Measures for Multiple Linear Regression Models 24 / 25

Reflection Method-explained With probability 1 α, b 1 (α/2) b 1 b 1 (1 α/2). (14) Let D 1 = β 1 b 1 (α/2) (15) D 2 = b 1 (1 α/2) β 1. (16) Substitue (15) and (16) into (14), we have b 1 D 2 β 1 b 1 + D 1. (17) Yang Feng (Columbia University) Remedial Measures for Multiple Linear Regression Models 25 / 25