Lecture Week Multiple Linear Regression Predict y from (possibly) many predictors x Including extra derived variables Model Criticism Study the importance of columns Draw on Scientific framework Experiment; find simplest & best predictor Look for important rows Diagnostics Outliers and Influence 12/02/201 1
Interpreting the coefficients MLR Experiment with models Dropping/Adding vars impacts coeffs of others No issues to discuss unless > 1 predictor Co-variation in at least 3 dimensions Do more x-variables mean better models? More Coeffs bigger R 2, smaller S & SumSq Simplest coefficients have value 0 large T-values! Simple models Best science 12/02/201 2
Multiple Linear Regression Predict y from (possibly) many predictors x Including extra derived variables Experiment with models Dropping/Adding Vars Noting change in R 2 Noting change in fitted coeffs Check diagnostics VIF Find simplest & best predictor Understand how y interacts with predictors x 12/02/201 3
How important is predictor x k? Fitted coeff b k = avg inc/dec in y when x k increases by one unit and all other predictors unchanged Big numerical value? Big T ratio? Small p? 12/02/201 4
How important is predictor x k? What if : An important predictor is not available? Some X vars are highly inter-correlated? What are implications for interpreting b k? changes in b k when other variables added/dropped? 12/02/201
Trees: A simple case Linear Model regressing Vol on Height Diam and Ht Diam 2 simple theory available 12/02/201 6
Diameter and Height important The regression equation is Volume = - 8.0 + 4.71 Diameter + 0.339 Height Predictor Coef SE Coef T P Constant -7.988 8.638-6.71 0.000 Diameter 4.7082 0.2643 17.82 0.000 Height 0.3393 0.1302 2.61 0.014 S = 3.88183 R-Sq = 94.8% R-Sq(adj) = 94.4% 12/02/201 7
Diameter and Height not important The regression equation is Volume = - 1.7-0.094 Diameter + 0.030 Height + 0.394 Ht*Diam^2 Predictor Coef SE Coef T P Constant -1.6 10.94-0.1 0.881 Diameter -0.0942 0.8133-0.12 0.909 Height 0.0299 0.1004 0.30 0.768 Ht*Diam^2 0.3939 0.0614 6.0 0.000 S = 2.7630 R-Sq = 97.8% R-Sq(adj) = 97.% 12/02/201 8
Strategies with correlated predictors Regression a device: to think about Rel Importance of X vars Correlation important Proceed with care; need reg theory! Transformations Derived variables Modify models; use VIF to predict Correlation relatively unimportant? Semi-automatic options Incl Best Subsets/Stepwise 12/02/201 9
Outline Examples, mostly in more than two dims Theory for correlated x-vars Sums of Squares R 2, multiple correlation and simple corr Changes in R 2 - partial R 2 Changes depend on ORDER in x-vars MTB Coefficients - nor T or P values are NOT always a measure of importance 12/02/201 10
Technical Material MLR as a sequence of SLR Partial R 2 Correlated predictors Variance inflation Multi-collinearity Use of Intercept term with indicator variables 12/02/201 11
Extreme case: Tree Vol: x 1 =x 2 = Ht Regress Vol on x 1 Vol = 87.12+1.43 x 1 = 87.12+1.43 x 1 +0 x 2 Regress Vol on x 2 Vol = 87.12+0 x 1 +1.43 x 2 Regress Vol on both 12/02/201 Vol = 87.12+(b) x 1 +(1.43-b) x 2 for any arbitrary value of b!! Infinity of identical solutions All equally good for predicting MINITAB notes and takes action Extra tech material online 12
Common case: x 1 x 2 Infinity of nearly identical solutions Many pairs of coeffs almost equivalent More generally at least one x User notes and takes action nearly perfectly predictable from other x vars Many sets of coeffs almost equivalent 12/02/201 13
M.Stuart PEMax: Too many predictors? Obj: Relate Respiratory Muscle Strength (PEMax) To other measures of lung function in patients suffering from cystic fibrosis, adjusting for sex and body size. 12/02/201 14
Controlling for external variation Observation al data are often unbalanced eg age, gender Ideally data collection designed equal numbers M/F similar age dist in each group Regression often used to control for such variation 12/02/201 1
PEmax FEV 1 RV FRC TLC Sex Height Weight BMP PEMax: The variables Maximal static expiratory pressure a measure of expiratory muscle strength Forced expiratory volume in 1 second Residual volume (after 1 second) Functional residual capacity Total lung capacity 0 = Male, 1 = Female cms. kg. Sub Age Sex Ht Wt BMP FEV1 RV FRC TLC PEmax 1 7 0 109 13.1 68 32 28 183 137 9 2 7 1 112 12.9 6 19 449 24 134 8 3 8 0 124 14.1 64 22 441 268 147 100 4 8 1 12 16.2 67 41 234 146 124 8 8 0 127 21. 93 2 202 131 104 9 6 9 0 130 17. 68 44 308 1 118 80 7 11 1 139 30.7 89 28 30 179 119 6 8 12 1 10 28.4 69 18 369 198 103 110 9 12 0 146 2.1 67 24 312 194 128 70 Body mass (percent of median of normal cases) 12/02/201 16
PEmax: Too many vars? all coeffs small The regression equation is PEmax = 102 + 1.36 FEV1 + 0.172 RV - 0.206 FRC + 0.27 TLC - 0.8 Sex - 0.373 Height + 2.1 Weight - 1.39 BMP Predictor Coef SE Coef T P Constant 101.7 172.3 0.9 0.63 FEV1 1.3626 0.918 1.48 0.17 RV 0.1723 0.1860 0.93 0.368 FRC -0.206 0.4410-0.47 0.647 TLC 0.271 0.4614 0.60 0.9 Sex -0.76 14.04-0.0 0.98 Height -0.3731 0.8721-0.43 0.67 Weight 2.12 1.19 1.80 0.091 BMP -1.3868 0.913-1.2 0.148 No variables important? Issue here: Too many correlated variables Challenge here: Poor theor. guidance Return later S = 24.8872 R-Sq = 63.1% R-Sq(adj) = 44.6% 12/02/201 17
Some simple cases Scientific Framework Networks 12/02/201 18
Direct and Indirect Importance Ht Tree Vol Ht Ht* Diam 2 Diam OR? Diam Ht* Diam 2 Tree Vol Theory Scientific Framework Causal model 12/02/201 19
Math Marks Marks on 88 students in maths exams. How to predict Stats mark from others? Correlation mx R Variable MeanStdDev Mech Vect Alg Anal Stat Mech 38.9 17.49 Mech 1.00 0. 0. 0.41 0.39 Vect 0.9 13.1 Vect 0. 1.00 0.61 0.49 0.44 Alg 0.60 10.62 Alg 0. 0.61 1.00 0.71 0.67 Anal 46.68 14.8 Anal 0.41 0.49 0.71 1.00 0.61 Stat 42.31 17.26 Stat 0.39 0.44 0.67 0.61 1.00 What can we learn from the coeffs in the best predictor? Correlation mx R Variable MeanStdDev Mech Vect Alg Anal Stat Mech 38.9 17.49 Mech 1.00 0. 0. 0.41 0.39 Vect 0.9 13.1 Vect 0. 1.00 0.61 0.49 0.44 Alg 0.60 10.62 Alg 0. 0.61 1.00 0.71 0.67 Anal 46.68 14.8 Anal 0.41 0.49 0.71 1.00 0.61 Stat 42.31 17.26 Stat 0.39 0.44 0.67 0.61 1.00 12/02/201 20
Math Marks: guidance from theory Mechanics Analysis Algebra Vectors Statistics 12/02/201 21
Predicting Statistics Performance The regression equation is Stat = - 11.4 + 0.313 Anal + 0.729 Alg + 0.026 Vect + 0.022 Mech Predictor Coef SE Coef T P Constant -11.378 6.982-1.63 0.107 Anal 0.3129 0.131 2.38 0.020 Alg 0.7294 0.2096 3.48 0.001 Vect 0.027 0.139 0.18 0.84 Mech 0.02217 0.0989 0.22 0.823 Mechanics Vectors Algebra Analysis Statistics S = 12.7478 R-Sq = 47.9% R-Sq(adj) = 4.4% 12/02/201 22
Alternative Predictions Stat = - 11.2 + 0.316 Anal + 0.76 Alg Predictor Coef SE Coef T P Constant -11.192 6.92-1.70 0.093 Anal 0.3164 0.1294 2.44 0.017 Alg 0.763 0.1808 4.23 0.000 S = 12.606 R-Sq = 47.9% Stat = 1.92 + 0.601 Anal + 0.244 Vect Predictor Coef SE Coef T P Constant 1.922 6.16 0.31 0.76 Anal 0.6011 0.1121.36 0.000 Vect 0.2436 0.1266 1.92 0.08 Mechanics Vectors Algebra Analysis Statistics S = 13.787 R-Sq = 39.% 12/02/201 23
Theory 12/02/201 24
Ex a. Uncorrelated x-variables Artificial data x1 x2 e y -1-1 0.09-1.91-1 0-0.61-1.61-1 1-0.33-0.33 0-1 -0.9-1.9 0 0-0.1-0.1 0 1 0.06 1.06 1-1 0.34 0.34 1 0 0.1 1.1 1 1-0.08 1.92 SS(total) 1.87 Corr x1 x2 x2 0.000 y 0.739 0.627 Data Generating Model Y x x ; ~ N 0, 1 1 2 2 0; 1; 1; 0. 1 2 The regression equation is ybal = - 0.164 + 1.20 x1 + 1.02 x2bal Predictor Coef SE Coef T P Constant -0.1644 0.1338-1.23 0.26 x1 1.2017 0.1639 7.33 0.000 x2bal 1.0200 0.1639 6.22 0.001 2 S = 0.40140 R-Sq = 93.9% SS Total 1.8738 12/02/201 2
x1unbal Ex b. Correlated x-variables 1.0 Artificial data x1 x2 e y -1-1.2 0.09-2.11-1 -1-0.6-2.61-1 -0.8-0.3-2.13-0.8-1 -0.9-2.7 0 0-0.2-0.1 0.8 1 0.06 1.86 1 1.2 0.34 2.4 1 1 0.1 2.1 1 0.8-0.1 1.72 SS(total) 40.1 Scatterplot of x1unbal vs x2unbal Corr x1 x2 x2 0.986 y 0.986 0.991 Data Generating Model Y x x ; ~ N 0, 1 1 2 2 0; 1; 1; 0. 1 2 The regression equation is y = - 0.164 + 0.786 x1 + 1.46 x2 Predictor Coef SE Coef T P Constant -0.1644 0.1078-1.3 0.178 x1 0.7863 0.7203 1.09 0.317 x2 1.460 0.6804 2.1 0.07 2 0. 0.0-0. -1.0 S = 0.32340 R-Sq = 98.4% Coeffs smaller SE(Coeffs) larger -1.0-0. 0.0 x2unbal 0. 1.0 1. 12/02/201 26
MLR as successive SLR Additional Info from x2 not in x1 1. Regress y on x1 Store Resids RESy.x1 1a Regress x2 on x1 Store Resids RESx2.x1 Thus RESy.x1 and RESx2.x1 represent those aspects of y and x2, that DO NOT depend on x1 2. Regress RESy.x1 on RESx2.x1 24/02/201 27
x2unbal RESy.x y Residuals the same Models identical MLR stepwise by SLR Residuals (y.x1) x1 x2 y y.x1 x2.x1.(x2.x1) y.x1x2-1 -1.2-2.11 0.370-0.16 0.99 0.99-1 -1-2.61-0.130 0.044-0.194-0.194-1 -0.8-2.13 0.30 0.244-0.007-0.007-0.8-1 -2.7-0.683-0.16-0.442-0.442 0 0-0.1 0.014 0.000 0.014 0.014 0.8 1 1.86 0.172 0.16-0.070-0.070 1 1.2 2.4 0.389 0.16 0.160 0.160 1 1 2.1-0.01-0.044 0.013 0.013 1 0.8 1.72-0.431-0.244-0.074-0.074 3 1; R 2 =97.2% 2 1 0-1 -2-3 -1.0 Fitted Line Plot y = - 0.1644 + 2.316 x1 ; unbalanced case -0. 0.0 x1 0. 1.0 S 0.398647 R-Sq 97.2% R-Sq(adj) 96.8% Fitted Line Plot x2 = 0.00000 + 1.044 x1; unbalanced case 1. S 0.17966 0.0 1a R-Sq 97.2% R-Sq(adj) 96.8% 2; R 2 =43.6% 1.0 0.2 Fitted Line Plot RES.x1 = 0.00000 + 1.46 RESx2.x1 unbalanced case S 0.29941 R-Sq 43.6% R-Sq(adj) 3.% 0. 0.00 0.0-0.2-0. -0.0-1.0-1.0-0. 0.0 x1unbal 0. 1.0-0.7-0.3-0.2-0.1 0.0 RESx2l.x1 0.1 0.2 0.3 12/02/201 28
Reduction in SSQ Partial R 2 The regression equation is y = - 0.164 + 0.786 x1 + 1.46 x2 Predictor Coef SE Coef T P Constant -0.1644 0.1078-1.3 0.178 X1 0.7863 0.7203 1.09 0.317 X2 1.460 0.6804 2.1 0.07 S = 0.32340 R-Sq = 98.4% Analysis of Variance Source DF SS MS F P Regression 2 39.22 19.761 188.94 0.000 (small rounding error 43.6%) e Residual Error 6 0.628 0.10 Total 8 40.10 Source DF Seq SS X1 1 39.037 X2 1 0.48 12/02/201 Note that MINITAB 17 uses a different layout for the SS than that shown here Total SS 40.10 X1 explains 39.037 97.2% X2 explains 0.48 Total 39.12 98.4% Unexplained by X1 1.113 Of this explained by x2 0.48 ie 43% y x [ y x ] [ x x ] 2 2 2 1 R 1 R 1 R 1 1 2 1 2 2 Here using R as in (0,1); ie %age R /100 29
Is Order Important? 12/02/201 30
Is order important? The regression equation is y = - 0.164 + 0.786 x1 + 1.46 x2 Predictor Coef SE Coef T P Constant -0.1644 0.1078-1.3 0.178 X1 0.7863 0.7203 1.09 0.317 X2 1.460 0.6804 2.1 0.07 The regression equation is y = - 0.164 + 1.46 x2 + 0.786 x1 No: Coefficients not impacted by ordering Predictor Coef SE Coef T P Constant -0.1644 0.1078-1.3 0.178 x2 1.460 0.6804 2.1 0.07 x1 0.7863 0.7203 1.09 0.317 S = 0.32340 R-Sq = 98.4% Analysis of Variance S = 0.32340 R-Sq = 98.4% Analysis of Variance Source DF SS MS Regression 2 39.22 19.761 Residual Error 6 0.628 0.10 Total 8 40.10 Source DF Seq SS X1 1 39.037 X2 1 0.48 first use x and then x 1 2 Source DF SS MS Regression 2 39.22 19.761 Residual Error 6 0.628 0.10 Total 8 40.10 Source DF Seq SS x2 1 39.398 X1 1 0.12 Yes: partial R 2 impacted by ordering first use x and then x 2 1 Note that MINITAB 17 uses a different layout for the SS than that shown here 24/02/201 31
Is order important? Uncorrelated Preds No: Coefficients not impacted by ordering Analysis of Variance Analysis of Variance Source DF SS MS Regression 2 14.9064 7.432 Residual Error 6 0.9674 0.1612 Total 8 1.8738 Source DF SS MS Regression 2 14.9064 7.432 Residual Error 6 0.9674 0.1612 Total 8 1.8738 Source DF Seq SS x1 1 8.6640 x2 1 6.2424 Source DF Seq SS x2 1 6.2424 x1 1 8.6640 Partial R 2 not impacted by ordering if predictors uncorrelated Note that MINITAB 17 uses a different layout for the SS than that shown here 24/02/201 32
Is order important? No No Yes For prediction If predictor variables uncorrelated even if correlated For coeffs and for SE( Coeff), T ratios, p values for teasing out aspects of relative importance 12/02/201 33
Is correlation in predictors important? No even if correlated For prediction - if n is large Yes For coeffs and for SE( Coeff), T ratios, p values Seek simplest model you can get away with But no simpler Seek and drop redundant variables 12/02/201 34
Variance Inflation Factors 12/02/201 3
2 x 2 j SE b s R j SE b j Variance Inflation Factor 2 s 1 ( n 1) s 1 R Variance of the 2 2 x j 2 x 2 j values % of var of x when regressed on all other preds j large when j j x j s R j small large Implications: If have control over study design spread out the predictors Else Here using R as in (0,1); arrange preds to be uncorrelated coeffs can be individually small, SEs can be large ie %age R /100 2 2 If Coeff interpretation important be careful with too many derived variables 12/02/201 36
Trees Regression Analysis: Vol versus Ht x Diam^2, Diam, Ht The regression equation is Vol = - 1.7 + 0.0021 Ht x Diam^2-0.094 Diam + 0.030 Ht Predictor Coef SE Coef T P VIF Constant -1.6 10.94-0.1 0.881 Ht x Diam^2 0.0021474 0.00031 6.0 0.000 33.366 Diam -0.0942 0.8133-0.12 0.909 29.442 Ht 0.0299 0.1004 0.30 0.768 1.849 S = 2.7630 R-Sq = 97.8% Note that MINITAB 17 uses a different layout for the SS than that shown here Source DF Seq SS Ht x Diam^2 1 792.8 Diam 1 0.4 Ht 1 0.6 24/02/201 37
Trees Regression Analysis: Vol versus Ht x Diam^2, Diam, Ht The regression equation is Vol = - 1.7 + 0.0021 Ht x Diam^2-0.094 Diam + 0.030 Ht Predictor Coef SE Coef T P VIF Constant -1.6 10.94-0.1 0.881 Ht x Diam^2 0.0021474 0.00031 6.0 0.000 33.366 Diam -0.0942 0.8133-0.12 0.909 29.442 Ht 0.0299 0.1004 0.30 0.768 1.849 S = 2.7630 R-Sq = 97.8% Note that MINITAB 17 uses a different layout for the SS than that shown here Source DF Seq SS Ht x Diam^2 1 792.8 Diam 1 0.4 Ht 1 0.6 24/02/201 38
The regression equation is Trees Vol = - 0.298 + 0.00212 Ht x Diam^2 Predictor Coef SE Coef T P VIF Constant -0.2977 0.9636-0.31 0.760 Ht x Diam^2 0.00212437 0.0000949 3.71 0.000 1.000 Cf Ht x Diam^2 0.0021474 0.00031 6.0 0.000 33.366 S = 2.49300 R-Sq = 97.8% VIF = 1 1< VIF < VIF > to 10 Not correlated Mod. correlated Highly correlated VIF values greater than 10 may indicate multicollinearity is unduly influencing your regression results. In this case, you may want to reduce multicollinearity by removing unimportant predictors from your model. 12/02/201 39
Example: Trees Ht Theory Ht* Diam 2 Tree Vol Diam Source DF Seq SS Diameter 1 781.8 Height 1 102.4 Ht*Diam^2 1 242.7 Alternative orderings Source DF Seq SS Ht*Diam^2 1 792.8 Height 1 0.9 Diameter 1 0.1 Note that MINITAB 17 uses a different layout for the SS than that shown here Source DF Seq SS Ht*Diam^2 1 792.8 Diameter 1 0.4 Height 1 0.6 24/02/201 40
PE Max revisited The regression equation is PEmax = 177-2.7 Age - 3.8 Sex - 0.40 Height + 3.01 Weight - 1.7 BMP + 1.08 FEV1 + 0.198 RV - 0.311 FRC + 0.188 TLC Predictor Coef SE Coef T P VIF Constant 177.4 226.2 0.78 0.44 Age -2.69 4.808-0.3 0.601 21.900 Sex -3.81 1.47-0.2 0.809 2.273 Height -0.4499 0.9038-0.0 0.626 13.978 Weight 3.00 2.011 1.49 0.16 47.971 BMP -1.71 1.16-1.1 0.11 7.133 FEV1 1.077 1.081 1.00 0.33.426 RV 0.1980 0.1962 1.01 0.329 10.47 FRC -0.3114 0.4927-0.63 0.37 17.176 TLC 0.1877 0.4996 0.38 0.712 2.661 Which to drop? What s the objective? S = 2.4622 R-Sq = 63.8% 12/02/201 41
PEmax PEmax PE Max revisited The regression equation is PEmax = 19-0.319 FRC 200 17 Fitted Line Plot PEmax = 18.7-0.3191 FRC S 31.0414 R-Sq 17.4% R-Sq(adj) 13.8% 10 Predictor Coef SE Coef T P Constant 18.71 23.36 6.79 0.000 FRC -0.3191 0.1449-2.20 0.038 12 100 7 0 100 120 140 160 180 200 FRC 220 240 260 280 S = 31.0414 R-Sq = 17.4% 200 17 Scatterplot of PEmax vs FRC Sex 0 1 10 12 100 7 0 100 120 140 160 180 200 FRC 220 240 260 280 12/02/201 42
PE Max revisited The regression equation is PEmax = 160-14. Sex - 0.288 FRC Predictor Coef SE Coef T P Constant 160.29 23.2 6.90 0.000 Sex -14.48 12.64-1.1 0.264 FRC -0.2883 0.1464-1.97 0.062 Source DF Seq SS Sex 1 2234.4 FRC 1 3683.8 S = 30.8327 R-Sq = 22.1% R-Sq(adj) = 1.0% Analysis of Variance Source DF SS MS F P Regression 2 918.2 299.1 3.11 0.06 Residual Error 22 20914.4 90.7 Total 24 26832.6 12/02/201 43
The regression equation is PE Max revisited PEmax = 62.1-12. Sex + 3.77 Age - 0.013 FRC Predictor Coef SE Coef T P Constant 62.13 42.84 1.4 0.162 Sex -12.4 11.26-1.11 0.278 Age 3.771 1.441 2.62 0.016 FRC -0.0134 0.1673-0.08 0.937 Source DF Seq SS Sex 1 2234.4 Age 1 8819. FRC 1 4.9 S = 27.4069 R-Sq = 41.2% Analysis of Variance Source DF SS MS F P Regression 3 1108.8 3686.3 4.91 0.010 Residual Error 21 1773.9 71.1 Total 24 26832.6 12/02/201 44
PE Max revisited Corr(Ht,Wt)=0.921 %age of variation in Ht explained by Wt = 100(0.921) 2 =8% The regression equation is PEmax = 4.7 + 0.129 Height + 1.00 Weight - 0.026 FRC Predictor Coef SE Coef T P VIF Constant 4.73 89.84 0.61 0.49 Height 0.1293 0.6819 0.19 0.81 6.789 Weight 1.0048 0.8131 1.24 0.230 6.690 FRC -0.02 0.1664-0.1 0.879 1.671 12/02/201 4
PE Max revisited The regression equation is PEmax = - 14.4 + 0.863 Height - 0.04 FRC Predictor Coef SE Coef T P VIF Constant -14.39 71.14-0.20 0.842 Height 0.8633 0.3390 2. 0.018 1.639 FRC -0.041 0.1667-0.32 0.749 1.639 The regression equation is PEmax = 4.7 + 0.129 Height + 1.00 Weight - 0.026 FRC Predictor Coef SE Coef T P VIF Constant 4.73 89.84 0.61 0.49 Height 0.1293 0.6819 0.19 0.81 6.789 Weight 1.0048 0.8131 1.24 0.230 6.690 FRC -0.02 0.1664-0.1 0.879 1.671 12/02/201 46
Interpreting the coefficients Do more x-variables mean better models? Bigger R 2, Smaller S, Fewer Coeffs Key issue: correlated and/or missing x-variables Theory Coefficients indirectly reflect correlation High correlation does not imply big coeff Low coeff does not imply low correlation 12/02/201 47
Review R-squared R as a correlation coefficient 2 S S 1r one x-var; r Corr( x, y) y If yˆ b x b x b x... 1 1 2 2 3 3 2 Then S S 1 R where R Corr( y, yˆ ) y yˆ is that linear combination of x, x, x... 1 2 3 which best predicts y 12/02/201 48
SSTotal S y S Review R-squared Var of y about its mean Var of y about best linear reg predictor Var of residuals about their mean, 0 S S 1R y 2 2 2 1 1 S S R R y x, x y y x y x, x 1 2 1 1 2 2 2 2 yx y x, x 1 R 1 R 1 R 1 1 2 2 S S 1 r one x-var; r Corr( x, y) y 2 2 Here using R as in (0,1); ie %age R /100 12/02/201 49
Review Coefficients When one predictor x y x 2 2 simply related to r Corr( y, x); R r NB Symmetry r Corr( x, y) When one predictor y x a by b simply to r Corr( y, x) and hence to 12/02/201 0
Review Coefficients Coefficients are not impacted by order When multiple predictors x, x, x... 1 2 3 y x x x 1 1 2 2 3 3... not proportional to r Corr( y and x ) i i i In fact i reflects Corr x i and best predictor of x i using other x vars AN D y 12/02/201 1
Strategies for correlated x-vars Redundancy in an extreme case If two or more vars contain exactly one piece of information, use only one of them Partial redundancy If two or more vars contain much the same information for the purposes in hand, use one (possibly composite) variable. More generally, can the important info in K variables be reduced to a few (possibly composite) variables? 12/02/201 2
Other Strategies Best Subsets and Stepwise Regression Select Best in a predictive sense Modern methods Very large data sets n and/or p Computationally intensive Data Mining literature/software Penalise models with many variables Note that many models nearly as good 12/02/201 3
Challenges with coeffs To be able to interpret coefficients, ideally Choose x variables that are complementary and measure quite different aspects of the system Organise the data such that it does not inadvertently give the impression that these are correlated, despite their selection In other words, design an experiment 12/02/201 4
Technicality 12/02/201
Extreme case Exact multi-collinearity x 2 perfectly correlated with x 1 x k perfectly predicted by others SE(coeff) = can t be computed No single best set of parameters 2 j SE b R j 2 s 1 ( n 1) s 1 R 2 2 x j j % of var of x when regressed on all other preds j MINITAB refuses to proceed Often an error Same var entered twice! Always arises with sets of indicators 12/02/201 6
Exact multi-collinearity Many pairs of coeffs give same predicted values No unique solution slope 1.86 Pred from Preds using both x1 and x2 intercept 1.36 X1 coeffs x1 1.86 0 3 0.93 y x1 x2 x2 0 1.86-1.14 0.93 2.6 1 1 3.2 3.2 3.2 3.2 3.2 4.6 2 2.1.1.1.1.1 7.3 3 3 6.9 6.9 6.9 6.9 6.9 10.0 4 4 8.8 8.8 8.8 8.8 8.8 12.1 10.7 10.7 10.7 10.7 10.7 10.6 6 6 12. 12. 12. 12. 12. 14.3 7 7 14.4 14.4 14.4 14.4 14.4 12/02/201 7
Exact multi-collinearity Many pairs of coeffs give same predicted values No unique solution slope 1.86 Pred from Preds using both x1 and x2 intercept 1.36 X1 coeffs x1 1.86 0 3 0.93 y x1 x2 x2 0 1.86-1.14 0.93 2.6 1 1 3.2 3.2 3.2 3.2 3.2 4.6 2 2.1.1.1.1.1 7.3 3 3 6.9 6.9 6.9 6.9 6.9 10.0 4 4 8.8 8.8 8.8 8.8 8.8 12.1 10.7 10.7 10.7 10.7 10.7 10.6 6 6 12. 12. 12. 12. 12. 14.3 7 7 14.4 14.4 14.4 14.4 14.4 12/02/201 8
Indicator Vars: Exact multi-collinearity Regression Analysis: Comps versus Time since 1978, Q1, Q2, Q3, Q4 * Q4 is highly correlated with other X variables * Q4 has been removed from the equation. The regression equation is Comps = - 942 + 986 Time since 1978-1792 Q1-1139 Q2-78 Q3 12/02/201 9
Indicator Vars: Exact multi-collinearity Model has no intercept Regression Analysis: Comps versus Time since 1978, Q1, Q2, Q3, Q4 The regression equation is Comps = 986 Time since 1978-11244 Q1-1092 Q2-10210 Q3-942 Q4 12/02/201 60
Indicator Vars: Exact multi-collinearity Models 1 Comps = 986 Time since 1978-11244 Q1-1092 Q2-10210 Q3-942 Q4 2 Comps = - 942 + 986 Time since 1978-1792 Q1-1139 Q2-78 Q3 Time since Indicator vars Predictions 1978 Comps Q1 Q2 Q3 Q4 Model 1 Model 2 1 3684 1 0 0 0 346 346 1.2 4487 0 1 0 0 4444. 444. 1. 089 0 0 1 0 073 073 1.7 6041 0 0 0 1 6077. 6077. 16 4291 1 0 0 0 432 432 12/02/201 61
Indicator Vars: Exact multi-collinearity Models 1 Comps = 986 Time since 1978-11244 Q1-1092 Q2-10210 Q3-942 Q4 2 Comps = - 942 + 986 Time since 1978-1792 Q1-1139 Q2-78 Q3 Time since Indicator vars Predictions 1978 Comps Q1 Q2 Q3 Q4 Model 1 Model 2 1 3684 1 0 0 0 346 346 1.2 4487 0 1 0 0 4444. 444. 1. 089 0 0 1 0 073 073 1.7 6041 0 0 0 1 6077. 6077. 16 4291 1 0 0 0 432 432 12/02/201 62