Unit 11: Multiple Linear Regression

Similar documents
Unit 10: Simple Linear Regression and Correlation

Lecture 10 Multiple Linear Regression

Project Report for STAT571 Statistical Methods Instructor: Dr. Ramon V. Leon. Wage Data Analysis. Yuanlei Zhang

Correlation and regression

Chapter 14 Student Lecture Notes 14-1

Chapter 4: Regression Models

Dr. Maddah ENMG 617 EM Statistics 11/28/12. Multiple Regression (3) (Chapter 15, Hines)

Statistics for Managers using Microsoft Excel 6 th Edition

Chapter 4. Regression Models. Learning Objectives

Multiple Regression Methods

Data Analysis 1 LINEAR REGRESSION. Chapter 03

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore

Chapter 7 Student Lecture Notes 7-1

Correlation Analysis

y response variable x 1, x 2,, x k -- a set of explanatory variables

Lecture 9: Linear Regression

Any of 27 linear and nonlinear models may be fit. The output parallels that of the Simple Regression procedure.

Chapter 3 Multiple Regression Complete Example

Inferences for Regression

Final Review. Yang Feng. Yang Feng (Columbia University) Final Review 1 / 58

Multiple Regression. Peerapat Wongchaiwat, Ph.D.

x3,..., Multiple Regression β q α, β 1, β 2, β 3,..., β q in the model can all be estimated by least square estimators

Regression Models. Chapter 4. Introduction. Introduction. Introduction

Basic Business Statistics 6 th Edition

Basic Business Statistics, 10/e

Multiple linear regression

Unit 9: Inferences for Proportions and Count Data

Chapter 19: Logistic regression

FinQuiz Notes

(Where does Ch. 7 on comparing 2 means or 2 proportions fit into this?)

2/26/2017. PSY 512: Advanced Statistics for Psychological and Behavioral Research 2

Review of Statistics 101

CHAPTER 5 LINEAR REGRESSION AND CORRELATION

4. Nonlinear regression functions

Unit 9: Inferences for Proportions and Count Data

STA 108 Applied Linear Models: Regression Analysis Spring Solution for Homework #6

STAT Chapter 11: Regression

Analysing data: regression and correlation S6 and S7

Class Notes Spring 2014

General Linear Model (Chapter 4)

Inference for Regression Inference about the Regression Model and Using the Regression Line

Simple Linear Regression: One Qualitative IV

Lecture 12: Effect modification, and confounding in logistic regression

Trendlines Simple Linear Regression Multiple Linear Regression Systematic Model Building Practical Issues

Chapter 1 Statistical Inference

Chapter 13. Multiple Regression and Model Building

LAB 5 INSTRUCTIONS LINEAR REGRESSION AND CORRELATION

STA121: Applied Regression Analysis

Multiple Regression: Chapter 13. July 24, 2015

10. Alternative case influence statistics

LINEAR REGRESSION ANALYSIS. MODULE XVI Lecture Exercises

Lecture 3: Inference in SLR

9. Linear Regression and Correlation

Inference for Regression

STATISTICS 110/201 PRACTICE FINAL EXAM

Basic Medical Statistics Course

Estimating σ 2. We can do simple prediction of Y and estimation of the mean of Y at any value of X.

Multiple Linear Regression

STAT 7030: Categorical Data Analysis

Multiple Regression Analysis. Part III. Multiple Regression Analysis

Logistic Regression. Continued Psy 524 Ainsworth

Lectures on Simple Linear Regression Stat 431, Summer 2012

Introduction to the Logistic Regression Model

Topic 18: Model Selection and Diagnostics

Regression Models - Introduction

A Second Course in Statistics: Regression Analysis

Regression Models for Time Trends: A Second Example. INSR 260, Spring 2009 Bob Stine

Simple Linear Regression

holding all other predictors constant

Ridge Regression. Summary. Sample StatFolio: ridge reg.sgp. STATGRAPHICS Rev. 10/1/2014

Chapter 14. Linear least squares

Inference for the Regression Coefficient

Statistics in medicine

Correlation & Simple Regression

Linear Regression Analysis for Survey Data. Professor Ron Fricker Naval Postgraduate School Monterey, California

Prepared by: Prof. Dr Bahaman Abu Samah Department of Professional Development and Continuing Education Faculty of Educational Studies Universiti

Correlation and simple linear regression S5

Multiple regression: Model building. Topics. Correlation Matrix. CQMS 202 Business Statistics II Prepared by Moez Hababou

STA 303 H1S / 1002 HS Winter 2011 Test March 7, ab 1cde 2abcde 2fghij 3

CS 5014: Research Methods in Computer Science

Confidence Interval for the mean response

Model Selection Procedures

Ch14. Multiple Regression Analysis

Classification & Regression. Multicollinearity Intro to Nominal Data

INTRODUCTION TO LINEAR REGRESSION ANALYSIS

MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS

Investigating Models with Two or Three Categories

Machine Learning Linear Classification. Prof. Matteo Matteucci

Multiple linear regression S6

Lecture 4 Multiple linear regression

STA6938-Logistic Regression Model

Single and multiple linear regression analysis

Section IX. Introduction to Logistic Regression for binary outcomes. Poisson regression

Business Statistics. Lecture 9: Simple Regression

Statistics 5100 Spring 2018 Exam 1

STAT 212 Business Statistics II 1

Chapter 14 Student Lecture Notes Department of Quantitative Methods & Information Systems. Business Statistics. Chapter 14 Multiple Regression

Regression Analysis By Example

SMA 6304 / MIT / MIT Manufacturing Systems. Lecture 10: Data and Regression Analysis. Lecturer: Prof. Duane S. Boning

Unit 6 - Introduction to linear regression

Transcription:

Unit 11: Multiple Linear Regression Statistics 571: Statistical Methods Ramón V. León 7/13/2004 Unit 11 - Stat 571 - Ramón V. León 1 Main Application of Multiple Regression Isolating the effect of a variable by adjusting for the effect of other variables Regressing health index versus wealth gives a negative slope for wealth effect The higher the wealth the lower the health What one does when one fits a simple linear regression model Regressing health index versus wealth while adjusting for age gives a positive slope for wealth The higher the wealth the higher the health when one adjusts for the effect of the confounder age Multiple linear regression models allow us to determine the effect of regressors while adjusting For confounding For effect modification 7/13/2004 Unit 11 - Stat 571 - Ramón V. León 2

Why Do Regression Coefficients Have the Wrong Sign? yˆ = 1.835 + 0.463x yˆ = 1.036 1.222x + 3.649x 1 1 2 7/13/2004 Unit 11 - Stat 571 - Ramón V. León 3 Multiple Linear Regression y = β + β x + β x + + β x + ε Model: 0 1 1 2 2... k k Model is called linear because it is linear in the β s not necessarily in the x s. So y = + x + x + + x + 2 k β0 β1 β2... βk ε is considered a linear model and so is y = β + β x + β x + β x x + β x + β x + ε 2 2 0 1 1 2 2 3 1 2 4 1 5 2 7/13/2004 Unit 11 - Stat 571 - Ramón V. León 4

A Probabilistic Model for Multiple Linear Regression Yi = β0 + β1xi 1+ β2xi2 +... + βkxik + εi, i= 1,2,..., n where εi is a random error with E( εi) = 0, and β0, β1,..., βk are unknown parameters. Usually we assume that the εi are 2 independent N(0, σ ) r.v.'s. 2 This implies that Yi's are independent N( µ i, σ ) r.v.'s with µ = EY ( ) = β + β x + β x +... + β x i i 0 1 i1 2 i2 k ik 7/13/2004 Unit 11 - Stat 571 - Ramón V. León 5 Tire Wear Example xi1 x i 2 k=2 i y n=9 i = 1,2,..., n= 9 7/13/2004 Unit 11 - Stat 571 - Ramón V. León 6

Minimize: Least Squares Fit n i 0 1 i1 2 i2 k ik i= 1 ( ) 2 β β β... β Q= y + x + x + + x By solving the following equations: Q Q Q = 0, = 0,..., = 0 β β β 0 1 Solution are Least Square regression estimates: ˆ β, ˆ β,..., ˆ β 0 1 k k 7/13/2004 Unit 11 - Stat 571 - Ramón V. León 7 Sums of Squares yˆ = ˆ β + ˆ β x +... ++ ˆ β x i= 1,2,..., n i 0 0 i1 k ik n n 2 2 e ( ˆ i = yi yi) i= 1 i= 1 n 2 ( yi y) i= 1 n 2 ( yˆ i y) i= 1 n ( ) 2 n ( ) 2 n y ˆ ( ˆ ) 2 i y = yi y + yi yi i= 1 i= 1 i= 1 Error Sum of Squares (SSE) = Total Sum of Squares (SST) = Regression Sum of Squares (SSR) = SST = SSR + SSE As also obtained for simple linear regression 7/13/2004 Unit 11 - Stat 571 - Ramón V. León 8

Coefficient of Multiple Determination r 2 SSR SSE = = 1 SST SST Percentage of the total variation in Y that is explained by the model r = r 2, is called the multiple correlation coefficient. Remark: R 2 square always increases with the number of variables included in the model, but this does not mean that more variables always leads to a better model. 7/13/2004 Unit 11 - Stat 571 - Ramón V. León 9 The Tire Wear Example: Quadratic Fit Using Fit Y by X platform: 7/13/2004 Unit 11 - Stat 571 - Ramón V. León 10

Tire Wear Example: JMP Output We used the Fit Y by X platform of JMP Reduces multicollinearty and differences in order of magnitudes among different powers of x Notice that the polynomial is centered at the average of Mileage = 16 7/13/2004 Unit 11 - Stat 571 - Ramón V. León 11 Tire Wear Example: JMP In this method of fitting, using the Fit Model platform, JMP does not know that Mileage^2 is the square of Mileage. The Mileage^2 column was created using JMP s calculator 7/13/2004 Unit 11 - Stat 571 - Ramón V. León 12

Tire Wear Example: JMP Output Polynomial is not centered here. Why? 7/13/2004 Unit 11 - Stat 571 - Ramón V. León 13 JMP Fit Model Platform Square term is added by highlighting Mileage in both Select Columns and Construct Model Effects and then clicking the Cross button Polynomials are centered by default 7/13/2004 Unit 11 - Stat 571 - Ramón V. León 14

Fit Model Platform Output Now polynomial is centered again 7/13/2004 Unit 11 - Stat 571 - Ramón V. León 15 Statistical Inference on Individual β s ˆ β j β j Tn ( k+ 1) ( j = 0,1,..., k) SE( ˆ β ) j 100(1-α)% CI for β : ˆ β ± t j n ( k+ 1), α 2 j j SE( ˆ β ) H : β = 0 versus H : β 0 oj j 1 j j ˆ β j Reject if t = > t SE ( ˆ β j ) j n ( k+ 1), α 2 Recall that the CI are obtained by right clicking on the parameter estimate area and from the popup menu selecting Columns and then Upper 95% and Lower 95% 7/13/2004 Unit 11 - Stat 571 - Ramón V. León 16

H Statistical Inference on Simultaneous β s : β = β =... = β = 0 vs. H : At least one β 0 0 1 2 k 1 j MSR Reject if F = > fkn k MSE, ( + 1), α ANOVA table for quadratic fit of tire wear data. 7/13/2004 Unit 11 - Stat 571 - Ramón V. León 17 Compares the full model: Other Test of Hypothesis H0 : β = k m 1... = + β = k 0 Y = β + β x +... + β x + ε ( i= 1,2,..., n) i 0 1 i1 k ik i to the partial model: Y = β + β x +... + β x + ε ( i = 1, 2,..., n) i 0 1 i1 k m i, k m i Does adding additional predictor variables to the model lead to a significant reduction in the Error Sums of Squares (SSE)? 7/13/2004 Unit 11 - Stat 571 - Ramón V. León 18

Exercise 11.3: Type I and III Sums of Squares 7/13/2004 Unit 11 - Stat 571 - Ramón V. León 19 Type III Sums of Squares Does adding a variable to the model last lead to a significant decrease in the error sum of squares? Adding weight to the model - once brain size and height are in the model - does not help predict IQ Part of JMP s standard output for the Fit Model platform 7/13/2004 Unit 11 - Stat 571 - Ramón V. León 20

Type I or Sequential Sum of Squares Decrease in the error sums of squares when variables are added one at the time. Here order matters. Conclusion: Weight does not provide additional information about IQ once brain size and height are in the model 7/13/2004 Unit 11 - Stat 571 - Ramón V. León 21 Sequential SS Add to the Model SS (SSR) SSR = SSR = SST = SSR = 5572.744 Type III SS do not add up to the Model SS (SSR) 7/13/2004 Unit 11 - Stat 571 - Ramón V. León 22

Type I or Sequential Sum of Squares Looks a how much of SST is moved from the SSE into the SSR as you add additional regressors SST remains constant no matter how many regressors are in the model Recall that SST = SSR + SSE 7/13/2004 Unit 11 - Stat 571 - Ramón V. León 23 Variable Selection Methods In practice, a common problem is that there is a large set of candidate predictor variables. One wants to choose a small subset so that the resulting regression model is simple and has good predictive ability Stepwise regression Best subsets regression Optimality criteria Take Stat 572: Applied Linear Models to learn more Reading assignment: Section 11.8 A Strategy for Building a Multiple Regression Model. This strategy will help you with your project. 7/13/2004 Unit 11 - Stat 571 - Ramón V. León 24

Stepwise Regression in JMP 7/13/2004 Unit 11 - Stat 571 - Ramón V. León 25 Stepwise Regression in JMP Stepwise regression drops Weight from the model Sequential Sums of Square 7/13/2004 Unit 11 - Stat 571 - Ramón V. León 26

Reduced Model Compared with Full Model Brain size, height, and weight in the models Only brain size and height in the model Adjusted R 2 improves when Weight is dropped from the model since it has a penalty for number of variables in the model R 2 always increases with number of variables added to the model, but in this case the increase is negligible. This confirms that weight does not add much additional information to the model once brain weight and height are in the model. 7/13/2004 Unit 11 - Stat 571 - Ramón V. León 27 PRESS Statistic 7/13/2004 Unit 11 - Stat 571 - Ramón V. León 28

PRESS in JMP 7/13/2004 Unit 11 - Stat 571 - Ramón V. León 29 Cross-Validation Divide the data into training and test sets using random mechanism. Fit the models with training set data. Predict the test set data. Choose model that predicts best. 7/13/2004 Unit 11 - Stat 571 - Ramón V. León 30

Analysis Methods that Are Similar to Those in Simple Linear Regression Confidence intervals for the average response of Y at a particular value of the regressors Prediction intervals for a future value of Y for a particular value of the regressors Plot of residuals against individual predictor variables included or not in the model against predicted value on normal plot in time order if data is time ordered (Durbin-Watson test) Standardized residuals ( > 2 indicates an outlier) Influential observations if 2( k + 1) hii > n 7/13/2004 Unit 11 - Stat 571 - Ramón V. León 31 Transformations For the predictor variables as well as the response to make the model linear A model that can be transformed so that it becomes linear in in its unknown parameters is called intrinsically linear otherwise it is called a nonlinear model Unless a common transformation of y linearizes its relationship with all x s, effort should be focused on transforming the x s, leaving y untransformed, at least initially. It is usually assumed that the random error enters additively into the final form of the model 7/13/2004 Unit 11 - Stat 571 - Ramón V. León 32

Illustration of Multicollinearity: Cement Data Note that x1, x2, x3, and x4 add up to approximately 100% for all observations. 7/13/2004 Unit 11 - Stat 571 - Ramón V. León 33 Correlation Matrix: Cement Data Note that the x 1, x 2, x 3, and x 4 add up to approximately 100% for all observations. This approximate linear relationship among the x s results in multicollinearity. Multicollinearity means that the predictor variable are linearly related. Correlations X1 x2 x3 X4 X1 x2 x3 X4 1.0000 0.2286-0.8241-0.2454 0.2286 1.0000-0.1392-0.9730-0.8241-0.1392 1.0000 0.0295-0.2454-0.9730 0.0295 1.0000 7/13/2004 Unit 11 - Stat 571 - Ramón V. León 34

Scatterplot Matrix 25 20 15 Scatter Plot Matrix of Cement Data 10 5 80 70 60 50 40 30 25 20 15 10 5 60 X1 x2 x3 40 20 10 X4 5 10 15 20 25 30 40 50 60 70 80 5 10 15 20 25 10 20 30 40 50 60 70 7/13/2004 Unit 11 - Stat 571 - Ramón V. León 35 Multicolliniarity 7/13/2004 Unit 11 - Stat 571 - Ramón V. León 36

Multicollinearity Multicollinearity can cause serious numerical and statistical difficulties in fitting the regression model unless extra predictor variables are deleted. Example:If income, expenditure, and savings are used as predictor variables one should be deleted since saving = income expenditure Multicollinearity leads to these problems: -The estimates ˆ β j are subject to numerical errors and are unreliables. This is reflected in large changes in their magnitudes with small changes in data. -Most of the estimated coefficients have large standard errors and as a result are statistically nonsignificant even if the overall F-statistic is significant. 7/13/2004 Unit 11 - Stat 571 - Ramón V. León 37 Regression Coefficients in the Presence of Multicollinearty: Cement Data Parameter Estimates Term Intercept X1 x2 x3 X4 Source Model Error C. Total Estimate 62.405369 1.5511026 0.5101676 0.1019094-0.144061 Analysis of Variance DF 4 8 12 Std Error 70.07096 0.74477 0.723788 0.754709 0.709052 Sum of Squares 2667.8994 47.8636 2715.7631 t Ratio 0.89 2.08 0.70 0.14-0.20 Mean Square 666.975 5.983 Prob> t 0.3991 0.0708 0.5009 0.8959 0.8441 F Ratio 111.4792 Prob > F <.0001 Notice that all the regression coefficients are nonsignificant, although the overall F = 111.48 is highly significant 7/13/2004 Unit 11 - Stat 571 - Ramón V. León 38

Measures of Multicollinearity: Correlations and Variance Inflation Factor (VIF) The simplest measure is the correlations matrix between pairs of predictor variables. A more direct measure is the VIF. VIFj = Diagonal elements of inverse of correlation matrix 1 We can show that VIFj =, i = 1, 2,..., k 1 r 2 where rj is the coefficient of multiple determination when regressing xj on the remaining k-1 predictor variables. Generally, VIF j values greater than 10, corresponding to 2 r >.9, are regarded as unacceptable. j 2 j 7/13/2004 Unit 11 - Stat 571 - Ramón V. León 39 VIFs for Cement Data Parameter Estimates Term Intercept X1 x2 x3 X4 Estimate 62.405369 1.5511026 0.5101676 0.1019094-0.144061 Std Error 70.07096 0.74477 0.723788 0.754709 0.709052 t Ratio 0.89 2.08 0.70 0.14-0.20 Prob> t 0.3991 0.0708 0.5009 0.8959 0.8441 VIF. 38.496211 254.42317 46.868386 282.51286 Notice that all the VIFs are greater than 10, a strong indication of multicollinearity 7/13/2004 Unit 11 - Stat 571 - Ramón V. León 40

VIF in JMP: How? Right click on the Parameter Estimates area. Select VIF from the columns submenu that pops up: 7/13/2004 Unit 11 - Stat 571 - Ramón V. León 41 Polynomial Regression Remarks Problems Power of x are highly correlated: collinearity Powers of x can have large differences in order of magnitude Solutions Center the x around the mean: x x Standardizing the x is even better: x x s x 7/13/2004 Unit 11 - Stat 571 - Ramón V. León 42

Dummy Predictor Variables Many applications involve categorical predictor variables, e.g. gender, season, prognosis (poor, average, good) For ordinal variables may use scores, e.g. use 1, 2, and 3 for the prognosis categories For nominal variables with c categories use c-1 indicator variables x 1,x 2,,x c-1 called dummy variables. Example: 1 if Winter x1 = 0 if another season Model: 1 if Spring x2 = Y = β0 + β1x1 + β2x2 + β3x3 + ε 0 if another season Fall is taken as reference point: 1 if Summer x3 = 0 if another season β 0 7/13/2004 Unit 11 - Stat 571 - Ramón V. León 43 Why 1-c Dummy Variables? 1 if Winter x1 = 0 if another season If 1 if Spring x2 = 0 if another season 1 if Summer x3 = 0 if another season Then x x x x 1+ 2 + 3+ 4 = 1 Multicollinearity x 4 1 if Fall = 0 if another season 7/13/2004 Unit 11 - Stat 571 - Ramón V. León 44

JMP s Dummy Variables 1 if Winter 1 if Spring 1 if Summer z1 = -1 if Fall z2 = -1 if Fall z3 = -1 if Fall 0 if another season 0 if another season 0 if another season Y = β + β ( z ) + β ( z ) + β ( z ) + ε 0 1 1 2 2 3 3 Four season average is taken as reference point:β 0 + Y Y Y Y = β + β ( 1) + β ( 1) + β ( 1) + ε Fall 0 1 2 3 = β + β (1) + ε Winter 0 1 = β + β (1) + ε Spring 0 2 = β + β (1) + ε Summer 0 3 = 4β + ε 0 ε Y = β0 + 4 7/13/2004 Unit 11 - Stat 571 - Ramón V. León 45 Quarterly Sales of Soda Cans (in Millions of Cans) The x s are the textbook (SAS) dummy variables. Thez s are JMP s dummy variables. 7/13/2004 Unit 11 - Stat 571 - Ramón V. León 46

New platform Runs Chart of Soda Can Sales 1 if Winter x1 = 0 if another season 1 if Spring x2 = 0 if another season 1 if Summer x3 = 0 if another season Try Model : Y = β + β x + β x + 3 3 Q 0 1 1 2 2 β x + β QUARTER + ε Linear time trend and cyclical seasonal trend 7/13/2004 Unit 11 - Stat 571 - Ramón V. León 47 JMP Analysis: Textbook (SAS) Dummy Variables After adjusting for the linear time trend, the Spring and Summer sales differ significantly from the Fall sales, but those of Winter do not. Notice that these variables are declared continuous in JMP Winter Spring Summer 7/13/2004 Unit 11 - Stat 571 - Ramón V. León 48

JMP Analysis: JMP Dummy Variables After adjusting for the linear time trend Winter, Spring, and Summer (and Fall) differ significantly from the average of the four quarters Winter Spring Summer 7/13/2004 Unit 11 - Stat 571 - Ramón V. León 49 JMP Analysis: The Easy Way Nominal To make Fall the -1 z variable. 7/13/2004 Unit 11 - Stat 571 - Ramón V. León 50

How to Tell Season Effect is Important* Effect Tests Source Quarter Season Np arm 1 3 DF 1 3 Sum of Squares 1602.0500 998.9424 F Ratio 182.0605 37.8407 Prob > F <.0001 <.0001 There statistical evidence of a seasonal effect *Part the output of the command in the previous visual 7/13/2004 Unit 11 - Stat 571 - Ramón V. León 51 7/13/2004 Unit 11 - Stat 571 - Ramón V. León 52

Confidence and Prediction Intervals Notice that one should not use the Mean Confidence Interval Predicted sales for seventeenth quarter (next Winter) in last line 7/13/2004 Unit 11 - Stat 571 - Ramón V. León 53 Time Series Reference Forecasting Principles and Applications by Stephen A. DeLurgio. Inwin McGraw- Hill 7/13/2004 Unit 11 - Stat 571 - Ramón V. León 54

No interaction term in the model One Dummy and One Continuous Predictor Variables E( Weight) = β0 + β1sex + β2height where SEX = 0 for females and 1 for males. Equivalent formulation: β0 + β2height for females E( Weight) = ( β0 + β1) + β2height for males Parallel lines Lines have the same slope Height is an effect confounder for the gender effect. 7/13/2004 Unit 11 - Stat 571 - Ramón V. León 55 No interaction term in the model One Dummy and One Continuous Predictor Variables E( Weight) = β0 + β1sex + β2height where SEX = 0 for females and 1 for males. Equivalent formulation: β0 + β2height for females E( Weight) = ( β0 + β1) + β2height for males Lines have the same slope Height is an effect confounder for the gender effect. 7/13/2004 Unit 11 Female - Stat 571 - Ramón Male V. León 56 F

Interaction E( Weight) = β0 + β1sex + β2height + β3sex * HEIGHT where SEX = 0 for females and 1 for males. Interaction Equivalent formulation: Term β0 + β2height for females E( Weight) = ( β0 + β1) + ( β2 + β3) Height for males Lines have different slopes Height is an effect modifier for the gender effect Interactions are also important to consider among continuous and among categorical variables. This is a key step in model building. 7/13/2004 Unit 11 - Stat 571 - Ramón V. León 57 Considering Interactions 7/13/2004 Unit 11 - Stat 571 - Ramón V. León 58

7/13/2004 Unit 11 - Stat 571 - Ramón V. León 59 7/13/2004 Unit 11 - Stat 571 - Ramón V. León 60

The interaction of Brain Size and Height is not significant 7/13/2004 Unit 11 - Stat 571 - Ramón V. León 61 Fallacy of Doing Simple Linear Regression When Multiple Linear Regression is Called For Regressing health index versus wealth gives a negative slope for wealth effect The higher the wealth the lower the health This is what one does when one fits a simple linear regression model Regressing health index versus wealth while adjusting for age gives a positive slope for wealth The higher the wealth the higher the health This is what one does when one fits a multiple linear regression model of the health index versus wealth and age 7/13/2004 Unit 11 - Stat 571 - Ramón V. León 62

Logistic Regression and Logit Models Logistic Regression used when the response variable is binary (or more generally categorical), e.g., a patient survives or dies. Model: PY ( = 1 x1, x2,..., x) k ln = β0 + β1x1+ β2x2 +... + βkxk PY ( = 0 x1, x2,..., xk ) Equivalently, β0+ β1x1+ β2x2+... + βkxk e PY ( = 1 x1, x2,..., xk ) = β0+ β1x1+ β2x2+... + βkxk 1 + e 7/13/2004 Unit 11 - Stat 571 - Ramón V. León 63 Age and Coronary Heart Disease: Status of 100 Subjects Age CHD 20 0 23 0 24 0 25 0 25 1 26 0 26 0 28 0 28 0 29 0 30 0 30 0 30 0 30 0 30 0 30 1 32 0 32 0 33 0 33 0 Age CHD 34 0 34 0 34 1 34 0 34 0 35 0 35 0 36 0 36 1 36 0 37 0 37 1 37 0 38 0 38 0 39 0 39 1 40 0 40 1 41 0 Age CHD 41 0 42 0 42 0 42 0 42 1 43 0 43 0 43 1 44 0 44 0 44 1 44 1 45 0 45 1 46 0 46 1 47 0 47 0 47 1 48 0 Age CHD 48 1 48 1 49 0 49 0 49 1 50 0 50 1 51 0 52 0 52 1 53 1 53 1 54 1 55 0 55 1 55 1 56 1 56 1 56 1 57 0 Age CHD 57 0 57 1 57 1 57 1 57 1 58 0 58 1 58 1 59 1 59 1 60 0 60 1 61 1 62 1 62 1 63 1 64 0 64 1 65 1 69 1 7/13/2004 Unit 11 - Stat 571 - Ramón V. León 64

Coronary Heart Disease by Age 7/13/2004 Unit 11 - Stat 571 - Ramón V. León 65 Logistic Regression in JMP Notice that CHD is declared nominal 7/13/2004 Unit 11 - Stat 571 - Ramón V. León 66

JMP Output 1.00 0.75 1 The probability of coronary heart disease increases with age. CHD 0.50 0.25 0 0.00 20 30 40 50 60 70 AGE Ask JMP with the question mark tool for the complete interpretation of this plot 7/13/2004 Unit 11 - Stat 571 - Ramón V. León 67 JMP Output SSR SSE SST Whole Model Test Model Difference Full Reduced -LogLikelihood 14.654945 53.676546 68.331491 RSquare (U) Observations (or Sum Wgts) Converged by Gradient Parameter Estimates Term Intercept AGE For log odds of 0/1 Estimate 5.30945302-0.1109211 DF 1 0.2145 100 Std Error 1.1336546 0.0240598 ChiSquare 29.30989 ChiSquare 21.94 21.25 Prob>ChiSq <.0001 Counterpart to the Analysis of Variance table in multiple linear regression with similar interpretation. Prob>ChiSq <.0001 <.0001 7/13/2004 Unit 11 - Stat 571 - Ramón V. León 68

Logistic Regression Using JMP s Fit Model Platform: How to do Multiple Logistic Regression 7/13/2004 Unit 11 - Stat 571 - Ramón V. León 69 Whole Model Test Model Difference Full Reduced -LogLikelihood 14.654945 53.676546 68.331491 RSquare (U) Observations (or Sum Wgts) Converged by Gradient Lack Of Fit Source Lack Of Fit Saturated Fitted Intercept AGE DF 41 42 1 Parameter Estimates Term Estimate 5.30945302-0.1109211 DF 1 0.2145 100 -LogLikelihood ChiSquare 11.877163 23.75433 41.799383 Prob>ChiSq 53.676546 0.9857 Std Error 1.1336546 0.0240598 ChiSquare Prob>ChiSq 29.30989 <.0001 ChiSquare 21.94 21.25 For log odds of 0/1 Effect Wald Tests Source Nparm DF Wald ChiSquare Prob>ChiSq AGE 1 1 21.2541294 0.0000 Effect Likelihood Ratio Tests Source Nparm DF L-R ChiSquare Prob>ChiSq AGE 1 1 29.30989 0.0000 Prob>ChiSq Lower 95% <.0001 <.0001 3.24599619-0.1619996 Upper 95% 7.72551624-0.066929 JMP Output This test is also available in multiple linear regression when there are repeat observations at certain values of the regressors 2 1 2 3 4 5 6 X 7/13/2004 Unit 11 - Stat 571 - Ramón V. León 70 Y 9 8 7 6 5 4 3 Interpreted as the Type III sums of squares in multiple linear regression

Do You Need to Know More? 572 Applied Regression Analysis (3) Simple linear regression. Matrix approach to multiple linear regression. Partial and sequential sums of squares, interaction and confounding, use of dummy variables, model selection. Leverage, influence and collinearity. Autocorrelated errors. Generalized linear models, maximum likelihood estimation, logistic regression, analysis of deviance. Nonlinear models, inference, ill-conditioning. Robust regression, M-estimators, iteratively reweighted least squares. Nonparametric regression, kernel, splines, testing lack of fit. Prereq: 571 and matrix algebra. F,Sp Recommended References: Introduction to Linear Regression Analysis, Third Edition by Douglas C. Montgomery, Elizabeth A. Peck, and G. Geoffrey Vining. A very well-written introduction to regression including some modern methods. Applied Logistic Regression, Second Edition by David W. Hosmer and Stanley Lemeshow. Wiley Interscience. A great introduction to logistic regression for those already familiar with the elements of multiple regression analysis. 7/13/2004 Unit 11 - Stat 571 - Ramón V. León 71