Lecture 11 Multiple Linear Regression

Lecture 11 Multiple Linear Regression STAT 512 Spring 2011 Background Reading KNNL: 6.1-6.5 11-1

Topic Overview Review: Multiple Linear Regression (MLR) Computer Science Case Study 11-2

Multiple Regression Model Y X β n 1 n p p 1 ε n 1 where ε ~ N 0, I 2 n n X is the design matrix (each column after the first corresponds to a predictor variable). Y,β,ε are vectors holding the responses, parameters, and errors 11-3

Parameter Estimates 1 b X X X Y Variance-Covariance Matrix is MSE 1 2 s b X X p p 2 To get s b k, read the appropriate (k th ) diagonal element of the matrix (where k=0,1,,p-1). Use Estimate, SE to produce confidence interval(s) 11-4

Multiple Regression Case Study Problem: Computer science majors at Purdue have a large drop out rate Potential Solution: Can we find predictors of success? Predictors must be available at time of entry into program File cs.sas contains SAS code for this example. 11-5

Computer Science Data 224 data points Response: GPA after three semesters Predictor Variables X 1 = High school math grades (HSM) X 2 = High school science grades (HSS) X 3 = High school English grades (HSE) X 4 = SAT Math (SATM) X 5 = SAT Verbal (SATV) X 6 = Gender (1 = male, 0 = female) 11-6

CS Data (2) For the time being we will ignore gender and focus on the remaining five variables. We have n = 224 observations, so if all five variables are included the design matrix X is 224 6. Grades are integer values from 0-10 SAT scores are integer values between 200 and 800. 11-7

Exploring the Data (SAS) Use FREQ procedure to learn about HSM, HSS, HSE since these scores are discrete. Use UNIVARIATE and MEANS procedures to learn about SATM and SATV (basically continuous) Use GPLOT procedure to produce plots that may be informative. Use CORR procedure to determine correlations among the predictor variables Use REG procedure to analyze the relationships of the predictor variables with GPA after 3 semesters. 11-8

HSM, HSS, and HSE Variables Mostly between 6 and 10 using PROC FREQ proc freq data=cs; tables hsm hss hse; Can plot each variable vs. GPA - difficult to see trends proc gplot; plot gpa*(hsm hss hse); Can run regressions vs. GPA individually proc reg data=cs; model gpa = hsm; model gpa = hss; model gpa = hse; 11-9

Output: FREQ procedure (HSE) hse Frequency Percent Cumulative Frequency Cumulative Percent 3 1 0.45 1 0.45 4 4 1.79 5 2.23 5 5 2.23 10 4.46 6 23 10.27 33 14.73 7 43 19.20 76 33.93 8 49 21.88 125 55.80 9 52 23.21 177 79.02 10 47 20.98 224 100.00 11-10

Output: PROC GPLOT (HSE) 11-11

Output: REG Procedure (GPA vs. HSE) Analysis of Variance Source DF Sum of Squares Mean Square F Value Pr > F Model 1 11.31409 11.31409 20.23 <.0001 Error 222 124.14870 0.55923 Corrected Total 223 135.46279 Root MSE 0.74782 R-Square 0.0835 Dependent Mean 2.63522 Adj R-Sq 0.0794 Coeff Var 28.37770 11-12

Output: REG Procedure (GPA vs. HSE) Parameter Estimates Variable DF Parameter Estimate Standard Error t Value Pr > t Intercept 1 1.42618 0.27340 5.22 <.0001 hse 1 0.14938 0.03321 4.50 <.0001 Also get model significance for HSM and HSS R 2 of 19% and 11% respectively. 11-13

SATM and SATV Variables Can look at descriptive statistics and histograms - using PROC MEANS and PROC UNIVARIATE proc means data=cs maxdec=2; var SATM SATV; proc univariate data=cs noprint; var SATM SATV; histogram SATM SATV; Can plot each variable vs. GPA proc gplot; plot gpa*(satm satv); Can run regressions vs. GPA individually proc reg data=cs; model gpa=satm; model gpa=satv; 11-14

Output: MEANS/UNIVARIATE Variable N Mean Std Dev Minimum Maximum satm 224 595.29 86.40 300.00 800.00 satv 224 504.55 92.61 285.00 760.00 11-15

Output: PROC GPLOT (SATM) 11-16

SATM and SATV Variables Histograms look reasonable, no obvious issues with these variables. Trends still not evident from plots Only SATM is actually significant (R 2 of 6%) p-value for SATV is 0.08; suggests that SATV may not be important, but does not mean we should not consider its importance in a multiple regression model. 11-17

Multiple Regression Model proc reg data=cs; model gpa=hsm hss hse satm satv; Source DF Sum of Squares Mean Square F Value Pr > F Model 5 28.64364 5.72873 11.69 <.0001 Error 218 106.81914 0.49000 Corrected Total 223 135.46279 Root MSE 0.70000 R-Square 0.2115 Dependent Mean 2.63522 Adj R-Sq 0.1934 Model is significant; 21% of variation in GPA explained by model 11-18

Parameter Estimates Variable DF Parameter Estimate Standard Error t Value Pr > t Intercept 1 0.32672 0.40000 0.82 0.4149 hsm 1 0.14596 0.03926 3.72 0.0003 hss 1 0.03591 0.03780 0.95 0.3432 hse 1 0.05529 0.03957 1.40 0.1637 satm 1 0.00094359 0.00068566 1.38 0.1702 satv 1-0.00040785 0.00059189-0.69 0.4915 All except SATV were significant individually; now only HSM is significant??? Why??? 11-19

Parameter Estimates (2) Remember: These tests are now variable added last tests. There is SSTO amount of variation in GPA to explain. If the predictor variables are correlated, they will overlap in the part of this variation that they will explain. This is why some of the variables are no longer significant they are competing with the other variables in trying to explain the same thing! 11-20

Correlated Predictors? From the parameter estimates, we see that HSM is an important variable in the model. But we do not need (or want) all five variables in the model. To see why, we can look at the correlations between the individual variables. proc corr data=cs noprob; var hsm hss hse satm satv; proc corr data=cs noprob; var hsm hss hse satm satv; with gpa; 11-21

Output: PROC CORR Pearson Correlation Coefficients, N = 224 hsm hss hse satm satv hsm 1.00000 0.57569 0.44689 0.45351 0.22112 hss 0.57569 1.00000 0.57937 0.24048 0.26170 hse 0.44689 0.57937 1.00000 0.10828 0.24371 satm 0.45351 0.24048 0.10828 1.00000 0.46394 satv 0.22112 0.26170 0.24371 0.46394 1.00000 Pearson Correlation Coefficients, N = 224 hsm hss hse satm satv gpa 0.43650 0.32943 0.28900 0.25171 0.11449 11-22

Interactive Data Analysis (Scatterplot Matrix) Can also view this in a graphical correlation matrix Read data into SAS, then select from the menus: Solutions ~ Analysis ~ Interactive data analysis Hold CTRL button while clicking on columns you want to analyze Choose Analyze ~ Scatterplot(X,Y) 11-23

11-24

Conclusions Conclusion: Predictors are correlated to some extent (multicollinearity). May be worse than is actually seen in the table! Why??? 11-25

Back to Analysis: What Now? HSM is clearly important and should be in the model. We might want another variable in the model, but we wouldn t want all five particularly if we wish to interpret the parameter estimates. To figure out where to go next, we could analyze the residuals from the model for GPA based on HSM (use PROC CORR again). 11-26

Correlation Matrix proc corr data=diag noprob; var resid; with hss hse satm satv; resid hss 0.08685 hse 0.10441 satm 0.05975 satv 0.01998 11-27

Which Model? None are significant after HSM; HSE comes the closest. Choose between HSM alone, and HSM with HSE. HSM alone explains 19% of var. in GPA HSM + HSE explains 20% of var. in GPA Adjusted R 2 favors 2-variable model; 0.187 vs. 0.194 11-28

Which Model? (2) Depends on your interest! If interested in parameter estimates, would probably choose the 1-variable model to avoid collinearity issues. If interested in prediction of GPA as we are, then it is ok to have some collinearity and we may favor the 2-variable model. Slight differences in the estimates depending on which we choose. 11-29

Parameter Estimates Model #1 Variable DF Parameter Estimate Standard Error t Value Pr > t Intercept 1 0.90768 0.24355 3.73 0.0002 hsm 1 0.20760 0.02872 7.23 <.0001 Model #2 Variable DF Parameter Estimate Standard Error t Value Pr > t Intercept 1 0.62423 0.29172 2.14 0.0335 hsm 1 0.18265 0.03196 5.72 <.0001 hse 1 0.06067 0.03473 1.75 0.0820 11-30

Key Ideas from Case Study (1) First, look at graphical and numerical summaries for one variable at a time. Then, look at relationships between pairs of variables with graphical and numerical summaries. Ue plots and correlations to understand relationships. The relationship between a response variable and an explanatory variable depends on what other variables are in the model. 11-31

Key Ideas from Case Study (2) A variable can be significant predictor alone and not significant when other predictors are in the model. Regression coefficients, standard errors, and the results of significance tests depend on what other predictors are in the model. Significant tests do not tell the whole story. Squared multiple correlations (R 2 ) give the proportion of variation in the response explained by the model and can give a different view (adjusted R 2 can also help). 11-32

Model Assumptions Need to check assumptions (will do this in a later topic) 11-33

Upcoming in Lecture 12... Inference in MLR (Sections 6.6-6.7) A bit more on the case study 11-34