Lecture 11 Multiple Linear Regression

Similar documents
Chapter 6 Multiple Regression

Lecture 12 Inference in MLR

Multiple Regression. Inference for Multiple Regression and A Case Study. IPS Chapters 11.1 and W.H. Freeman and Company

Lecture 13 Extra Sums of Squares

Confidence Interval for the mean response

Topic 16: Multicollinearity and Polynomial Regression

Multiple Linear Regression

Lecture 10 Multiple Linear Regression

Statistiek II. John Nerbonne. March 17, Dept of Information Science incl. important reworkings by Harmut Fitz

Lecture 11: Simple Linear Regression

General Linear Model (Chapter 4)

Statistics 5100 Spring 2018 Exam 1

Lecture 18 Miscellaneous Topics in Multiple Regression

Overview of Topic III

Lecture 3: Inference in SLR

Lecture 18 MA Applied Statistics II D 2004

Lecture 2 Simple Linear Regression STAT 512 Spring 2011 Background Reading KNNL: Chapter 1

Chapter 8 Quantitative and Qualitative Predictors

Lecture 4: Multivariate Regression, Part 2

STOR 455 STATISTICAL METHODS I

Handout 1: Predicting GPA from SAT

Outline. Review regression diagnostics Remedial measures Weighted regression Ridge regression Robust regression Bootstrapping

Topic 20: Single Factor Analysis of Variance

Topic 17 - Single Factor Analysis of Variance. Outline. One-way ANOVA. The Data / Notation. One way ANOVA Cell means model Factor effects model

Lab # 11: Correlation and Model Fitting

Chapter 1 Linear Regression with One Predictor

Topic 18: Model Selection and Diagnostics

Lecture notes on Regression & SAS example demonstration

Lecture 4: Multivariate Regression, Part 2

Multicollinearity Exercise

ANALYSES OF NCGS DATA FOR ALCOHOL STATUS CATEGORIES 1 22:46 Sunday, March 2, 2003

Simple Linear Regression: One Quantitative IV

Paper: ST-161. Techniques for Evidence-Based Decision Making Using SAS Ian Stockwell, The Hilltop UMBC, Baltimore, MD

Topic 14: Inference in Multiple Regression

Analysis of Covariance

Ch Inference for Linear Regression

STAT 3A03 Applied Regression With SAS Fall 2017

Chapter 1: Linear Regression with One Predictor Variable also known as: Simple Linear Regression Bivariate Linear Regression

Booklet of Code and Output for STAC32 Final Exam

3 Variables: Cyberloafing Conscientiousness Age

Business Statistics. Lecture 10: Correlation and Linear Regression

Stat 302 Statistical Software and Its Applications SAS: Simple Linear Regression

unadjusted model for baseline cholesterol 22:31 Monday, April 19,

In Class Review Exercises Vartanian: SW 540

Statistics 512: Solution to Homework#11. Problems 1-3 refer to the soybean sausage dataset of Problem 20.8 (ch21pr08.dat).

Math Section MW 1-2:30pm SR 117. Bekki George 206 PGH

STA 108 Applied Linear Models: Regression Analysis Spring Solution for Homework #6

y response variable x 1, x 2,, x k -- a set of explanatory variables

Answer Key: Problem Set 6

Effect of Centering and Standardization in Moderation Analysis

A discussion on multiple regression models

ST505/S697R: Fall Homework 2 Solution.

STAT 3900/4950 MIDTERM TWO Name: Spring, 2015 (print: first last ) Covered topics: Two-way ANOVA, ANCOVA, SLR, MLR and correlation analysis

STAT 350: Summer Semester Midterm 1: Solutions

Overview Scatter Plot Example

H. Diagnostic plots of residuals

Booklet of Code and Output for STAC32 Final Exam

IES 612/STA 4-573/STA Winter 2008 Week 1--IES 612-STA STA doc

Intro to Linear Regression

SAS Procedures Inference about the Line ffl model statement in proc reg has many options ffl To construct confidence intervals use alpha=, clm, cli, c

Intro to Linear Regression

Review of Multiple Regression

a. The least squares estimators of intercept and slope are (from JMP output): b 0 = 6.25 b 1 =

STAT 3A03 Applied Regression Analysis With SAS Fall 2017

1 Introduction to Minitab

STA 302 H1F / 1001 HF Fall 2007 Test 1 October 24, 2007

Models for Clustered Data

Analysis of Variance. Source DF Squares Square F Value Pr > F. Model <.0001 Error Corrected Total

BE640 Intermediate Biostatistics 2. Regression and Correlation. Simple Linear Regression Software: SAS. Emergency Calls to the New York Auto Club

Models for Clustered Data

SAS Commands. General Plan. Output. Construct scatterplot / interaction plot. Run full model

Statistics 512: Applied Linear Models. Topic 4

Possibly useful formulas for this exam: b1 = Corr(X,Y) SDY / SDX. confidence interval: Estimate ± (Critical Value) (Standard Error of Estimate)

T-test: means of Spock's judge versus all other judges 1 12:10 Wednesday, January 5, judge1 N Mean Std Dev Std Err Minimum Maximum

Topic 23: Diagnostics and Remedies

Lecture 3: Multiple Regression. Prof. Sharyn O Halloran Sustainable Development U9611 Econometrics II

appstats8.notebook October 11, 2016

Lecture 7: OLS with qualitative information

Chapter 2 Inferences in Simple Linear Regression

1 Multiple Regression

Business Statistics. Lecture 9: Simple Regression

Outline. Topic 20 - Diagnostics and Remedies. Residuals. Overview. Diagnostics Plots Residual checks Formal Tests. STAT Fall 2013

Lecture 9 SLR in Matrix Form

Lecture 18: Simple Linear Regression

STAT 350 Final (new Material) Review Problems Key Spring 2016

ECON3150/4150 Spring 2015

STATISTICS 479 Exam II (100 points)

Inference for Regression

6. Multiple regression - PROC GLM

Advanced Quantitative Data Analysis

Lecture 7 Remedial Measures

Predict y from (possibly) many predictors x. Model Criticism Study the importance of columns

A Introduction to Matrix Algebra and the Multivariate Normal Distribution

PubH 7405: REGRESSION ANALYSIS SLR: DIAGNOSTICS & REMEDIES

Lecture 1 Linear Regression with One Predictor Variable.p2

Unit 6 - Introduction to linear regression

Regression #8: Loose Ends

Simple Linear Regression: One Qualitative IV

Stat 500 Midterm 2 12 November 2009 page 0 of 11

Unit 6 - Simple linear regression

Transcription:

Lecture 11 Multiple Linear Regression STAT 512 Spring 2011 Background Reading KNNL: 6.1-6.5 11-1

Topic Overview Review: Multiple Linear Regression (MLR) Computer Science Case Study 11-2

Multiple Regression Model Y X β n 1 n p p 1 ε n 1 where ε ~ N 0, I 2 n n X is the design matrix (each column after the first corresponds to a predictor variable). Y,β,ε are vectors holding the responses, parameters, and errors 11-3

Parameter Estimates 1 b X X X Y Variance-Covariance Matrix is MSE 1 2 s b X X p p 2 To get s b k, read the appropriate (k th ) diagonal element of the matrix (where k=0,1,,p-1). Use Estimate, SE to produce confidence interval(s) 11-4

Multiple Regression Case Study Problem: Computer science majors at Purdue have a large drop out rate Potential Solution: Can we find predictors of success? Predictors must be available at time of entry into program File cs.sas contains SAS code for this example. 11-5

Computer Science Data 224 data points Response: GPA after three semesters Predictor Variables X 1 = High school math grades (HSM) X 2 = High school science grades (HSS) X 3 = High school English grades (HSE) X 4 = SAT Math (SATM) X 5 = SAT Verbal (SATV) X 6 = Gender (1 = male, 0 = female) 11-6

CS Data (2) For the time being we will ignore gender and focus on the remaining five variables. We have n = 224 observations, so if all five variables are included the design matrix X is 224 6. Grades are integer values from 0-10 SAT scores are integer values between 200 and 800. 11-7

Exploring the Data (SAS) Use FREQ procedure to learn about HSM, HSS, HSE since these scores are discrete. Use UNIVARIATE and MEANS procedures to learn about SATM and SATV (basically continuous) Use GPLOT procedure to produce plots that may be informative. Use CORR procedure to determine correlations among the predictor variables Use REG procedure to analyze the relationships of the predictor variables with GPA after 3 semesters. 11-8

HSM, HSS, and HSE Variables Mostly between 6 and 10 using PROC FREQ proc freq data=cs; tables hsm hss hse; Can plot each variable vs. GPA - difficult to see trends proc gplot; plot gpa*(hsm hss hse); Can run regressions vs. GPA individually proc reg data=cs; model gpa = hsm; model gpa = hss; model gpa = hse; 11-9

Output: FREQ procedure (HSE) hse Frequency Percent Cumulative Frequency Cumulative Percent 3 1 0.45 1 0.45 4 4 1.79 5 2.23 5 5 2.23 10 4.46 6 23 10.27 33 14.73 7 43 19.20 76 33.93 8 49 21.88 125 55.80 9 52 23.21 177 79.02 10 47 20.98 224 100.00 11-10

Output: PROC GPLOT (HSE) 11-11

Output: REG Procedure (GPA vs. HSE) Analysis of Variance Source DF Sum of Squares Mean Square F Value Pr > F Model 1 11.31409 11.31409 20.23 <.0001 Error 222 124.14870 0.55923 Corrected Total 223 135.46279 Root MSE 0.74782 R-Square 0.0835 Dependent Mean 2.63522 Adj R-Sq 0.0794 Coeff Var 28.37770 11-12

Output: REG Procedure (GPA vs. HSE) Parameter Estimates Variable DF Parameter Estimate Standard Error t Value Pr > t Intercept 1 1.42618 0.27340 5.22 <.0001 hse 1 0.14938 0.03321 4.50 <.0001 Also get model significance for HSM and HSS R 2 of 19% and 11% respectively. 11-13

SATM and SATV Variables Can look at descriptive statistics and histograms - using PROC MEANS and PROC UNIVARIATE proc means data=cs maxdec=2; var SATM SATV; proc univariate data=cs noprint; var SATM SATV; histogram SATM SATV; Can plot each variable vs. GPA proc gplot; plot gpa*(satm satv); Can run regressions vs. GPA individually proc reg data=cs; model gpa=satm; model gpa=satv; 11-14

Output: MEANS/UNIVARIATE Variable N Mean Std Dev Minimum Maximum satm 224 595.29 86.40 300.00 800.00 satv 224 504.55 92.61 285.00 760.00 11-15

Output: PROC GPLOT (SATM) 11-16

SATM and SATV Variables Histograms look reasonable, no obvious issues with these variables. Trends still not evident from plots Only SATM is actually significant (R 2 of 6%) p-value for SATV is 0.08; suggests that SATV may not be important, but does not mean we should not consider its importance in a multiple regression model. 11-17

Multiple Regression Model proc reg data=cs; model gpa=hsm hss hse satm satv; Source DF Sum of Squares Mean Square F Value Pr > F Model 5 28.64364 5.72873 11.69 <.0001 Error 218 106.81914 0.49000 Corrected Total 223 135.46279 Root MSE 0.70000 R-Square 0.2115 Dependent Mean 2.63522 Adj R-Sq 0.1934 Model is significant; 21% of variation in GPA explained by model 11-18

Parameter Estimates Variable DF Parameter Estimate Standard Error t Value Pr > t Intercept 1 0.32672 0.40000 0.82 0.4149 hsm 1 0.14596 0.03926 3.72 0.0003 hss 1 0.03591 0.03780 0.95 0.3432 hse 1 0.05529 0.03957 1.40 0.1637 satm 1 0.00094359 0.00068566 1.38 0.1702 satv 1-0.00040785 0.00059189-0.69 0.4915 All except SATV were significant individually; now only HSM is significant??? Why??? 11-19

Parameter Estimates (2) Remember: These tests are now variable added last tests. There is SSTO amount of variation in GPA to explain. If the predictor variables are correlated, they will overlap in the part of this variation that they will explain. This is why some of the variables are no longer significant they are competing with the other variables in trying to explain the same thing! 11-20

Correlated Predictors? From the parameter estimates, we see that HSM is an important variable in the model. But we do not need (or want) all five variables in the model. To see why, we can look at the correlations between the individual variables. proc corr data=cs noprob; var hsm hss hse satm satv; proc corr data=cs noprob; var hsm hss hse satm satv; with gpa; 11-21

Output: PROC CORR Pearson Correlation Coefficients, N = 224 hsm hss hse satm satv hsm 1.00000 0.57569 0.44689 0.45351 0.22112 hss 0.57569 1.00000 0.57937 0.24048 0.26170 hse 0.44689 0.57937 1.00000 0.10828 0.24371 satm 0.45351 0.24048 0.10828 1.00000 0.46394 satv 0.22112 0.26170 0.24371 0.46394 1.00000 Pearson Correlation Coefficients, N = 224 hsm hss hse satm satv gpa 0.43650 0.32943 0.28900 0.25171 0.11449 11-22

Interactive Data Analysis (Scatterplot Matrix) Can also view this in a graphical correlation matrix Read data into SAS, then select from the menus: Solutions ~ Analysis ~ Interactive data analysis Hold CTRL button while clicking on columns you want to analyze Choose Analyze ~ Scatterplot(X,Y) 11-23

11-24

Conclusions Conclusion: Predictors are correlated to some extent (multicollinearity). May be worse than is actually seen in the table! Why??? 11-25

Back to Analysis: What Now? HSM is clearly important and should be in the model. We might want another variable in the model, but we wouldn t want all five particularly if we wish to interpret the parameter estimates. To figure out where to go next, we could analyze the residuals from the model for GPA based on HSM (use PROC CORR again). 11-26

Correlation Matrix proc corr data=diag noprob; var resid; with hss hse satm satv; resid hss 0.08685 hse 0.10441 satm 0.05975 satv 0.01998 11-27

Which Model? None are significant after HSM; HSE comes the closest. Choose between HSM alone, and HSM with HSE. HSM alone explains 19% of var. in GPA HSM + HSE explains 20% of var. in GPA Adjusted R 2 favors 2-variable model; 0.187 vs. 0.194 11-28

Which Model? (2) Depends on your interest! If interested in parameter estimates, would probably choose the 1-variable model to avoid collinearity issues. If interested in prediction of GPA as we are, then it is ok to have some collinearity and we may favor the 2-variable model. Slight differences in the estimates depending on which we choose. 11-29

Parameter Estimates Model #1 Variable DF Parameter Estimate Standard Error t Value Pr > t Intercept 1 0.90768 0.24355 3.73 0.0002 hsm 1 0.20760 0.02872 7.23 <.0001 Model #2 Variable DF Parameter Estimate Standard Error t Value Pr > t Intercept 1 0.62423 0.29172 2.14 0.0335 hsm 1 0.18265 0.03196 5.72 <.0001 hse 1 0.06067 0.03473 1.75 0.0820 11-30

Key Ideas from Case Study (1) First, look at graphical and numerical summaries for one variable at a time. Then, look at relationships between pairs of variables with graphical and numerical summaries. Ue plots and correlations to understand relationships. The relationship between a response variable and an explanatory variable depends on what other variables are in the model. 11-31

Key Ideas from Case Study (2) A variable can be significant predictor alone and not significant when other predictors are in the model. Regression coefficients, standard errors, and the results of significance tests depend on what other predictors are in the model. Significant tests do not tell the whole story. Squared multiple correlations (R 2 ) give the proportion of variation in the response explained by the model and can give a different view (adjusted R 2 can also help). 11-32

Model Assumptions Need to check assumptions (will do this in a later topic) 11-33

Upcoming in Lecture 12... Inference in MLR (Sections 6.6-6.7) A bit more on the case study 11-34