Remedial Measures for Multiple Linear Regression Models Yang Feng http://www.stat.columbia.edu/~yangfeng Yang Feng (Columbia University) Remedial Measures for Multiple Linear Regression Models 1 / 25
Outline Unequal Error Variance Weighted Least Squares Multicollinearity Ridge Regression Influential Cases Robust Regression Nonparametric Regression Lowess Method and Regression Trees Evaluating Precision Bootstrapping Yang Feng (Columbia University) Remedial Measures for Multiple Linear Regression Models 2 / 25
Unequal Error Variance Y i = β 0 + β 1 X i1 + + β p 1 X i,p 1 + i Here: i are independent N(0, σ 2 i ). (Originally: i are independent N(0, σ 2 )) In matrix form: σ 2 1 0 0 σ 2 0 σ2 2 0 { } =... 0 0 σn 2 Yang Feng (Columbia University) Remedial Measures for Multiple Linear Regression Models 3 / 25
Known Error Variance Define weights Denote w i = 1 σ 2 i w 1 0 0 0 w 2 0 W =... 0 0 w n Weighted least squares and maximum likelihood estimator is b w = (X WX) 1 X WY (Derivation on the board, two methods: direct MLE and transform variables to the regular multiple linear regression) Yang Feng (Columbia University) Remedial Measures for Multiple Linear Regression Models 4 / 25
Error Variance Known up to Proportionality Constant Same estimator. w i = k 1 σ 2 i Yang Feng (Columbia University) Remedial Measures for Multiple Linear Regression Models 5 / 25
Unknown Error Variances In reality, one rarely known the variances σ 2 i. Estimation of Variance Function or Standard Deviation Function Use of Replicates or Near Replicates Yang Feng (Columbia University) Remedial Measures for Multiple Linear Regression Models 6 / 25
Estimation of Variance Function or Standard Deviation Function Four steps: (Can be iterated for several times to reach convegence) 1 Fit the regression model by unweighted least squares and analyze the residuals 2 Estimate the variance function or the standard deviation function by regressing either the squared residuals or the absolute residuals on the appropriate predictor(s). (We known that the variance of i σ 2 i = E( 2 i ) (E( i)) 2 = E( 2 i ). Hence the squared residual e 2 i is an estimator of σ 2 i.) 3 Use the fitted value from the estimated variance or standard deviation function to obtain the weights w i. 4 Estimate the regression coefficients using these weights. Yang Feng (Columbia University) Remedial Measures for Multiple Linear Regression Models 7 / 25
Use of Replicates or Near Replicates When replicates or near replicates are available, use the sample variance of the replicates as the estimate for the variances. In observational studies, usually not available. Yang Feng (Columbia University) Remedial Measures for Multiple Linear Regression Models 8 / 25
Ridge Regression Multicollinearity In polynomial regression models, higher order terms One or several predictor variables may be dropped from the model in order to remove the multicollinearity. Yang Feng (Columbia University) Remedial Measures for Multiple Linear Regression Models 9 / 25
Ridge Estimators OLS: (X X)b = X Y Transformed by correlation transformation: r XX b = r YX Ridge Estimator: for a constant c 0, c = 0, OLS (r XX + ci)b R = r YX c > 0, biased, but much more stable. Yang Feng (Columbia University) Remedial Measures for Multiple Linear Regression Models 10 / 25
Choice of c Ridge trace (0 c 1) and the variance inflation factors VIF k (c) Choose the c where the Ridge trace starts to become stable and VIF has become sufficiently small. Recall: VIF value measure how large is the variance of bk relative to what the variance would be if the predictor variables were uncorrelated. Since σ 2 {(r XX + ci) 1 r YX } = (r XX + ci) 1 r XX (r XX + ci) 1 VIF k (c) is the k-th diagonal element of (r XX + ci) 1 r XX (r XX + ci) 1 Yang Feng (Columbia University) Remedial Measures for Multiple Linear Regression Models 11 / 25
Robust Regression Robust to outlying and influential cases. LAD (Least Absolute Deviation) regression. To minimize L 1 = n Y i (β 0 + β 1 X i1 + + β p 1 X i,p 1 ). (1) i=1 LMS (Least Median of Squares) regression. To minimize median{[y i (β 0 + β 1 X i1 + + β p 1 X i,p 1 )] 2 }. (2) Yang Feng (Columbia University) Remedial Measures for Multiple Linear Regression Models 12 / 25
IRLS Robust Regression 1 Choose a weight function for weighting the case. 2 Obtain starting weights for all cases. 3 Using the starting weights in weighted least squares and obtain the residuals from the fit. 4 Use the residuals in step 3 to obtain revised weights. 5 Continue the iteration until convergence. Yang Feng (Columbia University) Remedial Measures for Multiple Linear Regression Models 13 / 25
IRLS Robust Regression 1 Huber weight function. 1 u 1.345. w = 1.345 u u > 1.345. (3) 2 Bisquare weight function. 1 ( u w = 4.685 )2 2 u 4.685. 0 u > 4.685. (4) Yang Feng (Columbia University) Remedial Measures for Multiple Linear Regression Models 14 / 25
Starting weights Huber weight: OLS Bisquare weight: use the initial robust fit using huber weights, or residual from LAD regression. Scaled Residuals When there is no outlying observations, normalize the residual by MSE. When there are outlying observations, use the resistant and robust median absolute deviation (MAD) estimator. Then the scaled residual is MAD = 1.6745 median{ e l median{e i } }. (5) u i = e i MAD. (6) Yang Feng (Columbia University) Remedial Measures for Multiple Linear Regression Models 15 / 25
Lowess Method Two predictor variables, fitted value at (X h1, X h2 ). Distance Measure. d i = [(X i1 X h1 ) 2 + (X i2 X h2 ) 2 ] 1/2. (7) Proportion of the data q that are nearest to (X h1, X h2 ). Larger q leads to smoother fit, but may increase the bias. Usually between.4 and.6. Weight Function. w i = 1 ( d i d q ) 3 3 d i < d q. 0 d i d q. (8) Yang Feng (Columbia University) Remedial Measures for Multiple Linear Regression Models 16 / 25
Regression Trees A powerful nonparametric regression method. Can handle multiple predictors. Easy to calculate and require virtually no assumptions. Achieved by partitioning the covariates. Key quantities. Number of regions r. Split points between the regions. Yang Feng (Columbia University) Remedial Measures for Multiple Linear Regression Models 17 / 25
Growing a regression tree Take a single predictor as an example. Yang Feng (Columbia University) Remedial Measures for Multiple Linear Regression Models 18 / 25
Growing a regression tree Divide X into two regions: R 21 and R 22. The optimal split point X s would be to minimize the SSE. where SSE = SSE(R 21 ) + SSE(R 22 ), (9) SSE(R rj ) = (Y i Ȳ Rjk ) 2. (10) Yang Feng (Columbia University) Remedial Measures for Multiple Linear Regression Models 19 / 25
Growing a regression tree Yang Feng (Columbia University) Remedial Measures for Multiple Linear Regression Models 20 / 25
A graphical example Yang Feng (Columbia University) Remedial Measures for Multiple Linear Regression Models 21 / 25
Bootstrapping A nonparametric method of evaluating the uncertainly of the estimate. Suppose we want to evaluate the precision of an estimated regression coefficient b 1. 1 Fix B as the number of bootstrap samples to be generated. Say B = 500. 2 For each k = 1,, B, sample n observations with replacement from the original n observations. 3 Fit a linear regression model using the bootstrap sample which leads to coefficient b (k) 1. 4 The sample standard deviation of {b (1) 1, b(2) 1,, b(b) 1 } is a measure of the precision of b 1. Yang Feng (Columbia University) Remedial Measures for Multiple Linear Regression Models 22 / 25
Bootstrap sampling Fixed X sampling. If the regression function is a good model fro the data and the error terms have constant variance. Obtain the residuals e i from the original fit, then get a bootstrap sample of size n of e i, then define new Y 1,, Y n to be Y i = Ŷi + e i. (11) Then regress Y values on the original X variables. Random X sampling. Get a bootstrap sample of (X, Y ) pairs. Yang Feng (Columbia University) Remedial Measures for Multiple Linear Regression Models 23 / 25
Bootstrap Confidence Intervals From the bootstrap distribution of b1, get the α/2 and 1 α/2 quantiles b1 (α/2) and b 1 (1 α/2). Suppose the original estimate is b Percentile Bootstrap b1(α/2) β 1 b1(1 α/2). (12) Basic Bootstrap (Reflection method) 2b 1 b1(1 α/2) β 1 2b 1 b1(α/2). (13) Yang Feng (Columbia University) Remedial Measures for Multiple Linear Regression Models 24 / 25
Reflection Method-explained With probability 1 α, b 1 (α/2) b 1 b 1 (1 α/2). (14) Let D 1 = β 1 b 1 (α/2) (15) D 2 = b 1 (1 α/2) β 1. (16) Substitue (15) and (16) into (14), we have b 1 D 2 β 1 b 1 + D 1. (17) Yang Feng (Columbia University) Remedial Measures for Multiple Linear Regression Models 25 / 25