Predict y from (possibly) many predictors x. Model Criticism Study the importance of columns

Similar documents
Interpreting the coefficients

Introduction to Regression

Multiple Regression Examples

Models with qualitative explanatory variables p216

MATH ASSIGNMENT 2: SOLUTIONS

Model Building Chap 5 p251

STA 108 Applied Linear Models: Regression Analysis Spring Solution for Homework #6

Introduction to Regression

STATISTICS 110/201 PRACTICE FINAL EXAM

y response variable x 1, x 2,, x k -- a set of explanatory variables

Correlation & Simple Regression

LINEAR REGRESSION ANALYSIS. MODULE XVI Lecture Exercises

Q Lecture Introduction to Regression

Multiple Regression Methods

Analysis of Bivariate Data

Lecture 4: Multivariate Regression, Part 2

Question Possible Points Score Total 100

SMAM 319 Exam 1 Name. 1.Pick the best choice for the multiple choice questions below (10 points 2 each)

Lecture 4: Multivariate Regression, Part 2

SMAM 319 Exam1 Name. a B.The equation of a line is 3x + y =6. The slope is a. -3 b.3 c.6 d.1/3 e.-1/3

School of Mathematical Sciences. Question 1. Best Subsets Regression

Overview. 4.1 Tables and Graphs for the Relationship Between Two Variables. 4.2 Introduction to Correlation. 4.3 Introduction to Regression 3.

Chapter 14. Multiple Regression Models. Multiple Regression Models. Multiple Regression Models

Multiple Linear Regression

Chapter 26 Multiple Regression, Logistic Regression, and Indicator Variables

Multiple Regression: Chapter 13. July 24, 2015

MULTIPLE LINEAR REGRESSION IN MINITAB

Lecture 11 Multiple Linear Regression

Chapter 12: Multiple Regression

6. Multiple regression - PROC GLM

Confidence Interval for the mean response

Chapter 1. Linear Regression with One Predictor Variable

Chapter 9. Correlation and Regression

Basic Business Statistics 6 th Edition

Chapter 14 Multiple Regression Analysis

Lecture notes on Regression & SAS example demonstration

Lecture 3: Inference in SLR

Correlation and Regression

MULTICOLLINEARITY AND VARIANCE INFLATION FACTORS. F. Chiaromonte 1

General Linear Model (Chapter 4)

Histogram of Residuals. Residual Normal Probability Plot. Reg. Analysis Check Model Utility. (con t) Check Model Utility. Inference.

Simple Linear Regression. Steps for Regression. Example. Make a Scatter plot. Check Residual Plot (Residuals vs. X)

STAT 212 Business Statistics II 1

Lecture 18: Simple Linear Regression

(4) 1. Create dummy variables for Town. Name these dummy variables A and B. These 0,1 variables now indicate the location of the house.

Ridge Regression. Summary. Sample StatFolio: ridge reg.sgp. STATGRAPHICS Rev. 10/1/2014

28. SIMPLE LINEAR REGRESSION III

Introduction to Regression

SMAM 314 Computer Assignment 5 due Nov 8,2012 Data Set 1. For each of the following data sets use Minitab to 1. Make a scatterplot.

Day 4: Shrinkage Estimators

COMPREHENSIVE WRITTEN EXAMINATION, PAPER III FRIDAY AUGUST 26, 2005, 9:00 A.M. 1:00 P.M. STATISTICS 174 QUESTION

Analysing data: regression and correlation S6 and S7

22S39: Class Notes / November 14, 2000 back to start 1

AP Statistics Unit 6 Note Packet Linear Regression. Scatterplots and Correlation

Inferences for Regression

Acknowledgements. Outline. Marie Diener-West. ICTR Leadership / Team INTRODUCTION TO CLINICAL RESEARCH. Introduction to Linear Regression

Sections 7.1, 7.2, 7.4, & 7.6

Six Sigma Black Belt Study Guides

Steps for Regression. Simple Linear Regression. Data. Example. Residuals vs. X. Scatterplot. Make a Scatter plot Does it make sense to plot a line?

Stat 501, F. Chiaromonte. Lecture #8

Chapter 14 Student Lecture Notes 14-1

Apart from this page, you are not permitted to read the contents of this question paper until instructed to do so by an invigilator.

Statistics 5100 Spring 2018 Exam 1

INFERENCE FOR REGRESSION

LEARNING WITH MINITAB Chapter 12 SESSION FIVE: DESIGNING AN EXPERIMENT

Practice Questions for Exam 1

Announcements: You can turn in homework until 6pm, slot on wall across from 2202 Bren. Make sure you use the correct slot! (Stats 8, closest to wall)

Applied Statistics and Econometrics

Analysis of Covariance. The following example illustrates a case where the covariate is affected by the treatments.

STAT 212: BUSINESS STATISTICS II Third Exam Tuesday Dec 12, 6:00 PM

Ph.D. Preliminary Examination Statistics June 2, 2014

23. Inference for regression

Ch 13 & 14 - Regression Analysis

Stat 529 (Winter 2011) A simple linear regression (SLR) case study. Mammals brain weights and body weights

Data Set 8: Laysan Finch Beak Widths

STAT 3900/4950 MIDTERM TWO Name: Spring, 2015 (print: first last ) Covered topics: Two-way ANOVA, ANCOVA, SLR, MLR and correlation analysis

An Introduction to Mplus and Path Analysis

y n 1 ( x i x )( y y i n 1 i y 2

ACOVA and Interactions

A Re-Introduction to General Linear Models (GLM)

AP Statistics Bivariate Data Analysis Test Review. Multiple-Choice

Period: Date: Lesson 3B: Properties of Dilations and Equations of lines

Final Exam Bus 320 Spring 2000 Russell

10. Alternative case influence statistics

Chapter 7 Student Lecture Notes 7-1

Week 8 Hour 1: More on polynomial fits. The AIC

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore

REVIEW 8/2/2017 陈芳华东师大英语系

12.12 MODEL BUILDING, AND THE EFFECTS OF MULTICOLLINEARITY (OPTIONAL)

An Introduction to Path Analysis

Topic 18: Model Selection and Diagnostics

1 Introduction to Minitab

Final Exam - Solutions

MODELING. Simple Linear Regression. Want More Stats??? Crickets and Temperature. Crickets and Temperature 4/16/2015. Linear Model

Basic Business Statistics, 10/e

Pooled Regression and Dummy Variables in CO$TAT

STAT-UB.0103 Exam APRIL.11 SQUARE Version Solutions

appstats8.notebook October 11, 2016

Stat 231 Final Exam. Consider first only the measurements made on housing number 1.

Path Analysis. PRE 906: Structural Equation Modeling Lecture #5 February 18, PRE 906, SEM: Lecture 5 - Path Analysis

Transcription:

Lecture Week Multiple Linear Regression Predict y from (possibly) many predictors x Including extra derived variables Model Criticism Study the importance of columns Draw on Scientific framework Experiment; find simplest & best predictor Look for important rows Diagnostics Outliers and Influence 12/02/201 1

Interpreting the coefficients MLR Experiment with models Dropping/Adding vars impacts coeffs of others No issues to discuss unless > 1 predictor Co-variation in at least 3 dimensions Do more x-variables mean better models? More Coeffs bigger R 2, smaller S & SumSq Simplest coefficients have value 0 large T-values! Simple models Best science 12/02/201 2

Multiple Linear Regression Predict y from (possibly) many predictors x Including extra derived variables Experiment with models Dropping/Adding Vars Noting change in R 2 Noting change in fitted coeffs Check diagnostics VIF Find simplest & best predictor Understand how y interacts with predictors x 12/02/201 3

How important is predictor x k? Fitted coeff b k = avg inc/dec in y when x k increases by one unit and all other predictors unchanged Big numerical value? Big T ratio? Small p? 12/02/201 4

How important is predictor x k? What if : An important predictor is not available? Some X vars are highly inter-correlated? What are implications for interpreting b k? changes in b k when other variables added/dropped? 12/02/201

Trees: A simple case Linear Model regressing Vol on Height Diam and Ht Diam 2 simple theory available 12/02/201 6

Diameter and Height important The regression equation is Volume = - 8.0 + 4.71 Diameter + 0.339 Height Predictor Coef SE Coef T P Constant -7.988 8.638-6.71 0.000 Diameter 4.7082 0.2643 17.82 0.000 Height 0.3393 0.1302 2.61 0.014 S = 3.88183 R-Sq = 94.8% R-Sq(adj) = 94.4% 12/02/201 7

Diameter and Height not important The regression equation is Volume = - 1.7-0.094 Diameter + 0.030 Height + 0.394 Ht*Diam^2 Predictor Coef SE Coef T P Constant -1.6 10.94-0.1 0.881 Diameter -0.0942 0.8133-0.12 0.909 Height 0.0299 0.1004 0.30 0.768 Ht*Diam^2 0.3939 0.0614 6.0 0.000 S = 2.7630 R-Sq = 97.8% R-Sq(adj) = 97.% 12/02/201 8

Strategies with correlated predictors Regression a device: to think about Rel Importance of X vars Correlation important Proceed with care; need reg theory! Transformations Derived variables Modify models; use VIF to predict Correlation relatively unimportant? Semi-automatic options Incl Best Subsets/Stepwise 12/02/201 9

Outline Examples, mostly in more than two dims Theory for correlated x-vars Sums of Squares R 2, multiple correlation and simple corr Changes in R 2 - partial R 2 Changes depend on ORDER in x-vars MTB Coefficients - nor T or P values are NOT always a measure of importance 12/02/201 10

Technical Material MLR as a sequence of SLR Partial R 2 Correlated predictors Variance inflation Multi-collinearity Use of Intercept term with indicator variables 12/02/201 11

Extreme case: Tree Vol: x 1 =x 2 = Ht Regress Vol on x 1 Vol = 87.12+1.43 x 1 = 87.12+1.43 x 1 +0 x 2 Regress Vol on x 2 Vol = 87.12+0 x 1 +1.43 x 2 Regress Vol on both 12/02/201 Vol = 87.12+(b) x 1 +(1.43-b) x 2 for any arbitrary value of b!! Infinity of identical solutions All equally good for predicting MINITAB notes and takes action Extra tech material online 12

Common case: x 1 x 2 Infinity of nearly identical solutions Many pairs of coeffs almost equivalent More generally at least one x User notes and takes action nearly perfectly predictable from other x vars Many sets of coeffs almost equivalent 12/02/201 13

M.Stuart PEMax: Too many predictors? Obj: Relate Respiratory Muscle Strength (PEMax) To other measures of lung function in patients suffering from cystic fibrosis, adjusting for sex and body size. 12/02/201 14

Controlling for external variation Observation al data are often unbalanced eg age, gender Ideally data collection designed equal numbers M/F similar age dist in each group Regression often used to control for such variation 12/02/201 1

PEmax FEV 1 RV FRC TLC Sex Height Weight BMP PEMax: The variables Maximal static expiratory pressure a measure of expiratory muscle strength Forced expiratory volume in 1 second Residual volume (after 1 second) Functional residual capacity Total lung capacity 0 = Male, 1 = Female cms. kg. Sub Age Sex Ht Wt BMP FEV1 RV FRC TLC PEmax 1 7 0 109 13.1 68 32 28 183 137 9 2 7 1 112 12.9 6 19 449 24 134 8 3 8 0 124 14.1 64 22 441 268 147 100 4 8 1 12 16.2 67 41 234 146 124 8 8 0 127 21. 93 2 202 131 104 9 6 9 0 130 17. 68 44 308 1 118 80 7 11 1 139 30.7 89 28 30 179 119 6 8 12 1 10 28.4 69 18 369 198 103 110 9 12 0 146 2.1 67 24 312 194 128 70 Body mass (percent of median of normal cases) 12/02/201 16

PEmax: Too many vars? all coeffs small The regression equation is PEmax = 102 + 1.36 FEV1 + 0.172 RV - 0.206 FRC + 0.27 TLC - 0.8 Sex - 0.373 Height + 2.1 Weight - 1.39 BMP Predictor Coef SE Coef T P Constant 101.7 172.3 0.9 0.63 FEV1 1.3626 0.918 1.48 0.17 RV 0.1723 0.1860 0.93 0.368 FRC -0.206 0.4410-0.47 0.647 TLC 0.271 0.4614 0.60 0.9 Sex -0.76 14.04-0.0 0.98 Height -0.3731 0.8721-0.43 0.67 Weight 2.12 1.19 1.80 0.091 BMP -1.3868 0.913-1.2 0.148 No variables important? Issue here: Too many correlated variables Challenge here: Poor theor. guidance Return later S = 24.8872 R-Sq = 63.1% R-Sq(adj) = 44.6% 12/02/201 17

Some simple cases Scientific Framework Networks 12/02/201 18

Direct and Indirect Importance Ht Tree Vol Ht Ht* Diam 2 Diam OR? Diam Ht* Diam 2 Tree Vol Theory Scientific Framework Causal model 12/02/201 19

Math Marks Marks on 88 students in maths exams. How to predict Stats mark from others? Correlation mx R Variable MeanStdDev Mech Vect Alg Anal Stat Mech 38.9 17.49 Mech 1.00 0. 0. 0.41 0.39 Vect 0.9 13.1 Vect 0. 1.00 0.61 0.49 0.44 Alg 0.60 10.62 Alg 0. 0.61 1.00 0.71 0.67 Anal 46.68 14.8 Anal 0.41 0.49 0.71 1.00 0.61 Stat 42.31 17.26 Stat 0.39 0.44 0.67 0.61 1.00 What can we learn from the coeffs in the best predictor? Correlation mx R Variable MeanStdDev Mech Vect Alg Anal Stat Mech 38.9 17.49 Mech 1.00 0. 0. 0.41 0.39 Vect 0.9 13.1 Vect 0. 1.00 0.61 0.49 0.44 Alg 0.60 10.62 Alg 0. 0.61 1.00 0.71 0.67 Anal 46.68 14.8 Anal 0.41 0.49 0.71 1.00 0.61 Stat 42.31 17.26 Stat 0.39 0.44 0.67 0.61 1.00 12/02/201 20

Math Marks: guidance from theory Mechanics Analysis Algebra Vectors Statistics 12/02/201 21

Predicting Statistics Performance The regression equation is Stat = - 11.4 + 0.313 Anal + 0.729 Alg + 0.026 Vect + 0.022 Mech Predictor Coef SE Coef T P Constant -11.378 6.982-1.63 0.107 Anal 0.3129 0.131 2.38 0.020 Alg 0.7294 0.2096 3.48 0.001 Vect 0.027 0.139 0.18 0.84 Mech 0.02217 0.0989 0.22 0.823 Mechanics Vectors Algebra Analysis Statistics S = 12.7478 R-Sq = 47.9% R-Sq(adj) = 4.4% 12/02/201 22

Alternative Predictions Stat = - 11.2 + 0.316 Anal + 0.76 Alg Predictor Coef SE Coef T P Constant -11.192 6.92-1.70 0.093 Anal 0.3164 0.1294 2.44 0.017 Alg 0.763 0.1808 4.23 0.000 S = 12.606 R-Sq = 47.9% Stat = 1.92 + 0.601 Anal + 0.244 Vect Predictor Coef SE Coef T P Constant 1.922 6.16 0.31 0.76 Anal 0.6011 0.1121.36 0.000 Vect 0.2436 0.1266 1.92 0.08 Mechanics Vectors Algebra Analysis Statistics S = 13.787 R-Sq = 39.% 12/02/201 23

Theory 12/02/201 24

Ex a. Uncorrelated x-variables Artificial data x1 x2 e y -1-1 0.09-1.91-1 0-0.61-1.61-1 1-0.33-0.33 0-1 -0.9-1.9 0 0-0.1-0.1 0 1 0.06 1.06 1-1 0.34 0.34 1 0 0.1 1.1 1 1-0.08 1.92 SS(total) 1.87 Corr x1 x2 x2 0.000 y 0.739 0.627 Data Generating Model Y x x ; ~ N 0, 1 1 2 2 0; 1; 1; 0. 1 2 The regression equation is ybal = - 0.164 + 1.20 x1 + 1.02 x2bal Predictor Coef SE Coef T P Constant -0.1644 0.1338-1.23 0.26 x1 1.2017 0.1639 7.33 0.000 x2bal 1.0200 0.1639 6.22 0.001 2 S = 0.40140 R-Sq = 93.9% SS Total 1.8738 12/02/201 2

x1unbal Ex b. Correlated x-variables 1.0 Artificial data x1 x2 e y -1-1.2 0.09-2.11-1 -1-0.6-2.61-1 -0.8-0.3-2.13-0.8-1 -0.9-2.7 0 0-0.2-0.1 0.8 1 0.06 1.86 1 1.2 0.34 2.4 1 1 0.1 2.1 1 0.8-0.1 1.72 SS(total) 40.1 Scatterplot of x1unbal vs x2unbal Corr x1 x2 x2 0.986 y 0.986 0.991 Data Generating Model Y x x ; ~ N 0, 1 1 2 2 0; 1; 1; 0. 1 2 The regression equation is y = - 0.164 + 0.786 x1 + 1.46 x2 Predictor Coef SE Coef T P Constant -0.1644 0.1078-1.3 0.178 x1 0.7863 0.7203 1.09 0.317 x2 1.460 0.6804 2.1 0.07 2 0. 0.0-0. -1.0 S = 0.32340 R-Sq = 98.4% Coeffs smaller SE(Coeffs) larger -1.0-0. 0.0 x2unbal 0. 1.0 1. 12/02/201 26

MLR as successive SLR Additional Info from x2 not in x1 1. Regress y on x1 Store Resids RESy.x1 1a Regress x2 on x1 Store Resids RESx2.x1 Thus RESy.x1 and RESx2.x1 represent those aspects of y and x2, that DO NOT depend on x1 2. Regress RESy.x1 on RESx2.x1 24/02/201 27

x2unbal RESy.x y Residuals the same Models identical MLR stepwise by SLR Residuals (y.x1) x1 x2 y y.x1 x2.x1.(x2.x1) y.x1x2-1 -1.2-2.11 0.370-0.16 0.99 0.99-1 -1-2.61-0.130 0.044-0.194-0.194-1 -0.8-2.13 0.30 0.244-0.007-0.007-0.8-1 -2.7-0.683-0.16-0.442-0.442 0 0-0.1 0.014 0.000 0.014 0.014 0.8 1 1.86 0.172 0.16-0.070-0.070 1 1.2 2.4 0.389 0.16 0.160 0.160 1 1 2.1-0.01-0.044 0.013 0.013 1 0.8 1.72-0.431-0.244-0.074-0.074 3 1; R 2 =97.2% 2 1 0-1 -2-3 -1.0 Fitted Line Plot y = - 0.1644 + 2.316 x1 ; unbalanced case -0. 0.0 x1 0. 1.0 S 0.398647 R-Sq 97.2% R-Sq(adj) 96.8% Fitted Line Plot x2 = 0.00000 + 1.044 x1; unbalanced case 1. S 0.17966 0.0 1a R-Sq 97.2% R-Sq(adj) 96.8% 2; R 2 =43.6% 1.0 0.2 Fitted Line Plot RES.x1 = 0.00000 + 1.46 RESx2.x1 unbalanced case S 0.29941 R-Sq 43.6% R-Sq(adj) 3.% 0. 0.00 0.0-0.2-0. -0.0-1.0-1.0-0. 0.0 x1unbal 0. 1.0-0.7-0.3-0.2-0.1 0.0 RESx2l.x1 0.1 0.2 0.3 12/02/201 28

Reduction in SSQ Partial R 2 The regression equation is y = - 0.164 + 0.786 x1 + 1.46 x2 Predictor Coef SE Coef T P Constant -0.1644 0.1078-1.3 0.178 X1 0.7863 0.7203 1.09 0.317 X2 1.460 0.6804 2.1 0.07 S = 0.32340 R-Sq = 98.4% Analysis of Variance Source DF SS MS F P Regression 2 39.22 19.761 188.94 0.000 (small rounding error 43.6%) e Residual Error 6 0.628 0.10 Total 8 40.10 Source DF Seq SS X1 1 39.037 X2 1 0.48 12/02/201 Note that MINITAB 17 uses a different layout for the SS than that shown here Total SS 40.10 X1 explains 39.037 97.2% X2 explains 0.48 Total 39.12 98.4% Unexplained by X1 1.113 Of this explained by x2 0.48 ie 43% y x [ y x ] [ x x ] 2 2 2 1 R 1 R 1 R 1 1 2 1 2 2 Here using R as in (0,1); ie %age R /100 29

Is Order Important? 12/02/201 30

Is order important? The regression equation is y = - 0.164 + 0.786 x1 + 1.46 x2 Predictor Coef SE Coef T P Constant -0.1644 0.1078-1.3 0.178 X1 0.7863 0.7203 1.09 0.317 X2 1.460 0.6804 2.1 0.07 The regression equation is y = - 0.164 + 1.46 x2 + 0.786 x1 No: Coefficients not impacted by ordering Predictor Coef SE Coef T P Constant -0.1644 0.1078-1.3 0.178 x2 1.460 0.6804 2.1 0.07 x1 0.7863 0.7203 1.09 0.317 S = 0.32340 R-Sq = 98.4% Analysis of Variance S = 0.32340 R-Sq = 98.4% Analysis of Variance Source DF SS MS Regression 2 39.22 19.761 Residual Error 6 0.628 0.10 Total 8 40.10 Source DF Seq SS X1 1 39.037 X2 1 0.48 first use x and then x 1 2 Source DF SS MS Regression 2 39.22 19.761 Residual Error 6 0.628 0.10 Total 8 40.10 Source DF Seq SS x2 1 39.398 X1 1 0.12 Yes: partial R 2 impacted by ordering first use x and then x 2 1 Note that MINITAB 17 uses a different layout for the SS than that shown here 24/02/201 31

Is order important? Uncorrelated Preds No: Coefficients not impacted by ordering Analysis of Variance Analysis of Variance Source DF SS MS Regression 2 14.9064 7.432 Residual Error 6 0.9674 0.1612 Total 8 1.8738 Source DF SS MS Regression 2 14.9064 7.432 Residual Error 6 0.9674 0.1612 Total 8 1.8738 Source DF Seq SS x1 1 8.6640 x2 1 6.2424 Source DF Seq SS x2 1 6.2424 x1 1 8.6640 Partial R 2 not impacted by ordering if predictors uncorrelated Note that MINITAB 17 uses a different layout for the SS than that shown here 24/02/201 32

Is order important? No No Yes For prediction If predictor variables uncorrelated even if correlated For coeffs and for SE( Coeff), T ratios, p values for teasing out aspects of relative importance 12/02/201 33

Is correlation in predictors important? No even if correlated For prediction - if n is large Yes For coeffs and for SE( Coeff), T ratios, p values Seek simplest model you can get away with But no simpler Seek and drop redundant variables 12/02/201 34

Variance Inflation Factors 12/02/201 3

2 x 2 j SE b s R j SE b j Variance Inflation Factor 2 s 1 ( n 1) s 1 R Variance of the 2 2 x j 2 x 2 j values % of var of x when regressed on all other preds j large when j j x j s R j small large Implications: If have control over study design spread out the predictors Else Here using R as in (0,1); arrange preds to be uncorrelated coeffs can be individually small, SEs can be large ie %age R /100 2 2 If Coeff interpretation important be careful with too many derived variables 12/02/201 36

Trees Regression Analysis: Vol versus Ht x Diam^2, Diam, Ht The regression equation is Vol = - 1.7 + 0.0021 Ht x Diam^2-0.094 Diam + 0.030 Ht Predictor Coef SE Coef T P VIF Constant -1.6 10.94-0.1 0.881 Ht x Diam^2 0.0021474 0.00031 6.0 0.000 33.366 Diam -0.0942 0.8133-0.12 0.909 29.442 Ht 0.0299 0.1004 0.30 0.768 1.849 S = 2.7630 R-Sq = 97.8% Note that MINITAB 17 uses a different layout for the SS than that shown here Source DF Seq SS Ht x Diam^2 1 792.8 Diam 1 0.4 Ht 1 0.6 24/02/201 37

Trees Regression Analysis: Vol versus Ht x Diam^2, Diam, Ht The regression equation is Vol = - 1.7 + 0.0021 Ht x Diam^2-0.094 Diam + 0.030 Ht Predictor Coef SE Coef T P VIF Constant -1.6 10.94-0.1 0.881 Ht x Diam^2 0.0021474 0.00031 6.0 0.000 33.366 Diam -0.0942 0.8133-0.12 0.909 29.442 Ht 0.0299 0.1004 0.30 0.768 1.849 S = 2.7630 R-Sq = 97.8% Note that MINITAB 17 uses a different layout for the SS than that shown here Source DF Seq SS Ht x Diam^2 1 792.8 Diam 1 0.4 Ht 1 0.6 24/02/201 38

The regression equation is Trees Vol = - 0.298 + 0.00212 Ht x Diam^2 Predictor Coef SE Coef T P VIF Constant -0.2977 0.9636-0.31 0.760 Ht x Diam^2 0.00212437 0.0000949 3.71 0.000 1.000 Cf Ht x Diam^2 0.0021474 0.00031 6.0 0.000 33.366 S = 2.49300 R-Sq = 97.8% VIF = 1 1< VIF < VIF > to 10 Not correlated Mod. correlated Highly correlated VIF values greater than 10 may indicate multicollinearity is unduly influencing your regression results. In this case, you may want to reduce multicollinearity by removing unimportant predictors from your model. 12/02/201 39

Example: Trees Ht Theory Ht* Diam 2 Tree Vol Diam Source DF Seq SS Diameter 1 781.8 Height 1 102.4 Ht*Diam^2 1 242.7 Alternative orderings Source DF Seq SS Ht*Diam^2 1 792.8 Height 1 0.9 Diameter 1 0.1 Note that MINITAB 17 uses a different layout for the SS than that shown here Source DF Seq SS Ht*Diam^2 1 792.8 Diameter 1 0.4 Height 1 0.6 24/02/201 40

PE Max revisited The regression equation is PEmax = 177-2.7 Age - 3.8 Sex - 0.40 Height + 3.01 Weight - 1.7 BMP + 1.08 FEV1 + 0.198 RV - 0.311 FRC + 0.188 TLC Predictor Coef SE Coef T P VIF Constant 177.4 226.2 0.78 0.44 Age -2.69 4.808-0.3 0.601 21.900 Sex -3.81 1.47-0.2 0.809 2.273 Height -0.4499 0.9038-0.0 0.626 13.978 Weight 3.00 2.011 1.49 0.16 47.971 BMP -1.71 1.16-1.1 0.11 7.133 FEV1 1.077 1.081 1.00 0.33.426 RV 0.1980 0.1962 1.01 0.329 10.47 FRC -0.3114 0.4927-0.63 0.37 17.176 TLC 0.1877 0.4996 0.38 0.712 2.661 Which to drop? What s the objective? S = 2.4622 R-Sq = 63.8% 12/02/201 41

PEmax PEmax PE Max revisited The regression equation is PEmax = 19-0.319 FRC 200 17 Fitted Line Plot PEmax = 18.7-0.3191 FRC S 31.0414 R-Sq 17.4% R-Sq(adj) 13.8% 10 Predictor Coef SE Coef T P Constant 18.71 23.36 6.79 0.000 FRC -0.3191 0.1449-2.20 0.038 12 100 7 0 100 120 140 160 180 200 FRC 220 240 260 280 S = 31.0414 R-Sq = 17.4% 200 17 Scatterplot of PEmax vs FRC Sex 0 1 10 12 100 7 0 100 120 140 160 180 200 FRC 220 240 260 280 12/02/201 42

PE Max revisited The regression equation is PEmax = 160-14. Sex - 0.288 FRC Predictor Coef SE Coef T P Constant 160.29 23.2 6.90 0.000 Sex -14.48 12.64-1.1 0.264 FRC -0.2883 0.1464-1.97 0.062 Source DF Seq SS Sex 1 2234.4 FRC 1 3683.8 S = 30.8327 R-Sq = 22.1% R-Sq(adj) = 1.0% Analysis of Variance Source DF SS MS F P Regression 2 918.2 299.1 3.11 0.06 Residual Error 22 20914.4 90.7 Total 24 26832.6 12/02/201 43

The regression equation is PE Max revisited PEmax = 62.1-12. Sex + 3.77 Age - 0.013 FRC Predictor Coef SE Coef T P Constant 62.13 42.84 1.4 0.162 Sex -12.4 11.26-1.11 0.278 Age 3.771 1.441 2.62 0.016 FRC -0.0134 0.1673-0.08 0.937 Source DF Seq SS Sex 1 2234.4 Age 1 8819. FRC 1 4.9 S = 27.4069 R-Sq = 41.2% Analysis of Variance Source DF SS MS F P Regression 3 1108.8 3686.3 4.91 0.010 Residual Error 21 1773.9 71.1 Total 24 26832.6 12/02/201 44

PE Max revisited Corr(Ht,Wt)=0.921 %age of variation in Ht explained by Wt = 100(0.921) 2 =8% The regression equation is PEmax = 4.7 + 0.129 Height + 1.00 Weight - 0.026 FRC Predictor Coef SE Coef T P VIF Constant 4.73 89.84 0.61 0.49 Height 0.1293 0.6819 0.19 0.81 6.789 Weight 1.0048 0.8131 1.24 0.230 6.690 FRC -0.02 0.1664-0.1 0.879 1.671 12/02/201 4

PE Max revisited The regression equation is PEmax = - 14.4 + 0.863 Height - 0.04 FRC Predictor Coef SE Coef T P VIF Constant -14.39 71.14-0.20 0.842 Height 0.8633 0.3390 2. 0.018 1.639 FRC -0.041 0.1667-0.32 0.749 1.639 The regression equation is PEmax = 4.7 + 0.129 Height + 1.00 Weight - 0.026 FRC Predictor Coef SE Coef T P VIF Constant 4.73 89.84 0.61 0.49 Height 0.1293 0.6819 0.19 0.81 6.789 Weight 1.0048 0.8131 1.24 0.230 6.690 FRC -0.02 0.1664-0.1 0.879 1.671 12/02/201 46

Interpreting the coefficients Do more x-variables mean better models? Bigger R 2, Smaller S, Fewer Coeffs Key issue: correlated and/or missing x-variables Theory Coefficients indirectly reflect correlation High correlation does not imply big coeff Low coeff does not imply low correlation 12/02/201 47

Review R-squared R as a correlation coefficient 2 S S 1r one x-var; r Corr( x, y) y If yˆ b x b x b x... 1 1 2 2 3 3 2 Then S S 1 R where R Corr( y, yˆ ) y yˆ is that linear combination of x, x, x... 1 2 3 which best predicts y 12/02/201 48

SSTotal S y S Review R-squared Var of y about its mean Var of y about best linear reg predictor Var of residuals about their mean, 0 S S 1R y 2 2 2 1 1 S S R R y x, x y y x y x, x 1 2 1 1 2 2 2 2 yx y x, x 1 R 1 R 1 R 1 1 2 2 S S 1 r one x-var; r Corr( x, y) y 2 2 Here using R as in (0,1); ie %age R /100 12/02/201 49

Review Coefficients When one predictor x y x 2 2 simply related to r Corr( y, x); R r NB Symmetry r Corr( x, y) When one predictor y x a by b simply to r Corr( y, x) and hence to 12/02/201 0

Review Coefficients Coefficients are not impacted by order When multiple predictors x, x, x... 1 2 3 y x x x 1 1 2 2 3 3... not proportional to r Corr( y and x ) i i i In fact i reflects Corr x i and best predictor of x i using other x vars AN D y 12/02/201 1

Strategies for correlated x-vars Redundancy in an extreme case If two or more vars contain exactly one piece of information, use only one of them Partial redundancy If two or more vars contain much the same information for the purposes in hand, use one (possibly composite) variable. More generally, can the important info in K variables be reduced to a few (possibly composite) variables? 12/02/201 2

Other Strategies Best Subsets and Stepwise Regression Select Best in a predictive sense Modern methods Very large data sets n and/or p Computationally intensive Data Mining literature/software Penalise models with many variables Note that many models nearly as good 12/02/201 3

Challenges with coeffs To be able to interpret coefficients, ideally Choose x variables that are complementary and measure quite different aspects of the system Organise the data such that it does not inadvertently give the impression that these are correlated, despite their selection In other words, design an experiment 12/02/201 4

Technicality 12/02/201

Extreme case Exact multi-collinearity x 2 perfectly correlated with x 1 x k perfectly predicted by others SE(coeff) = can t be computed No single best set of parameters 2 j SE b R j 2 s 1 ( n 1) s 1 R 2 2 x j j % of var of x when regressed on all other preds j MINITAB refuses to proceed Often an error Same var entered twice! Always arises with sets of indicators 12/02/201 6

Exact multi-collinearity Many pairs of coeffs give same predicted values No unique solution slope 1.86 Pred from Preds using both x1 and x2 intercept 1.36 X1 coeffs x1 1.86 0 3 0.93 y x1 x2 x2 0 1.86-1.14 0.93 2.6 1 1 3.2 3.2 3.2 3.2 3.2 4.6 2 2.1.1.1.1.1 7.3 3 3 6.9 6.9 6.9 6.9 6.9 10.0 4 4 8.8 8.8 8.8 8.8 8.8 12.1 10.7 10.7 10.7 10.7 10.7 10.6 6 6 12. 12. 12. 12. 12. 14.3 7 7 14.4 14.4 14.4 14.4 14.4 12/02/201 7

Exact multi-collinearity Many pairs of coeffs give same predicted values No unique solution slope 1.86 Pred from Preds using both x1 and x2 intercept 1.36 X1 coeffs x1 1.86 0 3 0.93 y x1 x2 x2 0 1.86-1.14 0.93 2.6 1 1 3.2 3.2 3.2 3.2 3.2 4.6 2 2.1.1.1.1.1 7.3 3 3 6.9 6.9 6.9 6.9 6.9 10.0 4 4 8.8 8.8 8.8 8.8 8.8 12.1 10.7 10.7 10.7 10.7 10.7 10.6 6 6 12. 12. 12. 12. 12. 14.3 7 7 14.4 14.4 14.4 14.4 14.4 12/02/201 8

Indicator Vars: Exact multi-collinearity Regression Analysis: Comps versus Time since 1978, Q1, Q2, Q3, Q4 * Q4 is highly correlated with other X variables * Q4 has been removed from the equation. The regression equation is Comps = - 942 + 986 Time since 1978-1792 Q1-1139 Q2-78 Q3 12/02/201 9

Indicator Vars: Exact multi-collinearity Model has no intercept Regression Analysis: Comps versus Time since 1978, Q1, Q2, Q3, Q4 The regression equation is Comps = 986 Time since 1978-11244 Q1-1092 Q2-10210 Q3-942 Q4 12/02/201 60

Indicator Vars: Exact multi-collinearity Models 1 Comps = 986 Time since 1978-11244 Q1-1092 Q2-10210 Q3-942 Q4 2 Comps = - 942 + 986 Time since 1978-1792 Q1-1139 Q2-78 Q3 Time since Indicator vars Predictions 1978 Comps Q1 Q2 Q3 Q4 Model 1 Model 2 1 3684 1 0 0 0 346 346 1.2 4487 0 1 0 0 4444. 444. 1. 089 0 0 1 0 073 073 1.7 6041 0 0 0 1 6077. 6077. 16 4291 1 0 0 0 432 432 12/02/201 61

Indicator Vars: Exact multi-collinearity Models 1 Comps = 986 Time since 1978-11244 Q1-1092 Q2-10210 Q3-942 Q4 2 Comps = - 942 + 986 Time since 1978-1792 Q1-1139 Q2-78 Q3 Time since Indicator vars Predictions 1978 Comps Q1 Q2 Q3 Q4 Model 1 Model 2 1 3684 1 0 0 0 346 346 1.2 4487 0 1 0 0 4444. 444. 1. 089 0 0 1 0 073 073 1.7 6041 0 0 0 1 6077. 6077. 16 4291 1 0 0 0 432 432 12/02/201 62