Chapter 11: Linear Regression and Correla4on Regression analysis is a sta3s3cal tool that u3lizes the rela3on between two or more quan3ta3ve variables so that one variable can be predicted from the other, or others. Some Examples: Height and weight of people Income and expenses of people Produc3on size and produc3on 3me Soil ph and the rate of growth of plants 1 Correla4on An easy way to determine if two quan3ta3ve variables are linearly related is by looking at their scakerplot. Another way is to calculate the correla3on coefficient, denoted usually by r. The Linear Correla+on measures the strength of the linear rela3onship between explanatory variable (x) and the response variable (y). An es3mate of this correla3on parameter is provided by the Pearson sample correla3on coefficient, r. Note: - 1 r 1. 2 1
Example Sca;erplots with Correla4ons If X and Y are independent, then their correla3on is 0. 3 Correla4on If the correla3on between X and Y is 0, it doesn t mean they are independent. It only means that they are not linearly related. One complain about the correla3on is that it can be subjec3ve when interpre3ng its value. Some people are very happy with r 0.6, while others are not. Note: Correla3on does not necessarily imply Causa3on Some Guidelines in Interpre3ng r. Value of r Strength of linear rela1onship If r 0.95 Very Strong If 0.85 r < 0.95 Strong If 0.65 r < 0.85 Moderate to Strong If 0.45 r < 0.65 Moderate If 0.25 r < 0.45 Weak If r < 0.25 Very weak/close to none 4 2
Compu4ng Correla4on in R data.health=read.csv("healthexam.csv",header=t) head(data.health) Gender Age Height Weight Waist Pulse SysBP DiasBP Cholesterol BodyMass Leg Elbow Wrist Arm 1 F 12 63.3 156.3 81.4 64 104 41 89 27.5 41.0 6.8 5.5 33.0 2 F 16 57.0 100.7 68.7 64 106 64 2 21.9 33.8 5.6 4.6 26.4 3 M 17 63.0 156.3 86.7 96 109 65 78 27.8 44.2 7.1 5.3 31.7 attach(data.health) plot(height,weight,pch=19,main="scatterplot") cor(height,weight) # 0.544563 cor(waist,weight) # 0.9083268 plot(waist,weight,pch=19,main="scatterplot") 5 Simple Linear Regression Model: Y i =(β 0 +β 1 x i ) + ε i Random Error where, Y i is the i th value of the response variable. x i is the i th value of the explanatory variable. ε i s are uncorrelated with a mean of 0 and constant variance σ 2. How do we determine the underlying linear rela3onship? Well, since the points are following this linear trend, why don t we look for a line that best fit the points. But what do we mean by best fit? We need a criterion to help us determine which between 2 compe3ng candidate lines is beker. y L 4 L 3 Observed point ε 1 L 2 x 1 L 1 Expected point x Y=β 0 +β 1 x 6 3
Method of Least Squares Model: Y i =(β 0 +β 1 x i ) + ε i Y=β 0 +β 1 x where, Y i is the i th value of the response variable. x i is the i th value of the explanatory variable. Observed ε i s are uncorrelated with a mean of 0 and Predicted constant variance σ 2. Residual = (Observed y- value) (Predicted y- value) e 1 = y 1 y y 1 P 1 (x 1,y 1 ) e 1 e 2 Example: 2+.8x Method of Least Squares: Choose the line that minimizes the SSE as the best line. This line is unknown as the Least- Squares Regression Line. x 1 Ques1on: But there are infinite possible candidate lines, how can we find the one that minimizes the SSE? x Answer: Since SSE is a con+nuous func+on of 2 variables, we can use methods from calculus to minimize the SSE. 7 Obtaining the Regression Line in R data.health=read.csv("healthexam.csv",header=t) head(data.health) Gender Age Height Weight Waist Pulse SysBP DiasBP Cholesterol BodyMass Leg Elbow Wrist Arm 1 F 12 63.3 156.3 81.4 64 104 41 89 27.5 41.0 6.8 5.5 33.0 2 F 16 57.0 100.7 68.7 64 106 64 2 21.9 33.8 5.6 4.6 26.4 3 M 17 63.0 156.3 86.7 96 109 65 78 27.8 44.2 7.1 5.3 31.7 attach(health.exam) plot(waist,weight,pch=19,main="scatterplot") result=lm(weight~waist) coef(result) (Intercept) Waist As waist increases by 1 cm, weight goes up by about 2.4 pounds. -51.72790 2.39469 abline(a=-51.7279,b=2.39469,lwd=2,col="blue") So, for the first person, her predicted weight is 143.2 pounds. Predicted.1=-51.728+2.39581.4 # 143.225 pounds Since her actual weight is 156.3 pounds. Residual.1=156.3-143.2 # 13.1 pounds 8 4
What else do we get from the lm func4on? data.health=read.csv("healthexam.csv",header=t) attach(health.exam) result=lm(weight~waist) attributes(result) $names "coefficients" "residuals" "effects" "rank" "fitted.values" "assign" "qr" "df.residual" "xlevels" "call" "terms" "model" result$fit[1] # 143.1999 result$res[1] # 13.10011 summary(result) lm(formula = Weight ~ Waist) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) -51.7279 11.1288-4.648 1.34e-05 Waist 2.3947 0.1249 19.180 < 2e-16 --- Signif. codes: 0 0.001 0.01 0.05. 0.1 1 Residual standard error: 14.68 on 78 degrees of freedom Multiple R-squared: 0.8251, Adjusted R-squared: 0.8228 F-statistic: 367.9 on 1 and 78 DF, p-value: < 2.2e-16 Coefficient of Determination (R 2 ) : This index measures the amount of variability in the dependent variable (y) that can be explained by the regression line. Hence, about 82.51% of the variability of weight can be explained by the regression line involving the waist size. Testing H 0 : β 1 = 0 vs. H 1 : β 1 0. Since the p-vlaue is extremely small (<0.05), we can reject the null hypothesis and conclude that waist has a significant effect on weight. Model Assump4ons Model: Y i =(β 0 +β 1 x i ) + ε i where, ε i s are uncorrelated with a mean of 0 and constant variance σ 2 ε. ε i s are normally distributed. (This is needed in the test for the slope.) Y=β 0 +β 1 x y Observed point ε 1 e 1 Expected point Predicted point Since the underlying (green) line is unknown to us, we can t calculate the values of the error terms (ε i ). The best that we can do is study the residuals (e i ). x 1 x 10 5
Es4ma4ng the Variance of the Error Terms The unbiased estimator for σ 2 ε is sse=sum(result$residuals^2) # 16811.16 mse=sse/(80-2) # 215.5277 sigma.hat=sqrt(mse) # 14.68086 anova(result) Response: Weight Df Sum Sq Mean Sq F value Pr(>F) Waist 1 79284 79284 367.86 < 2.2e-16 Residuals 78 16811 216 Total 79 96095 summary(result) lm(formula = Weight ~ Waist) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) -51.7279 11.1288-4.648 1.34e-05 Waist 2.3947 0.1249 19.180 < 2e-16 Residual standard error: 14.68 on 78 degrees of freedom Multiple R-squared: 0.8251, Adjusted R-squared: 0.8228 F-statistic: 367.9 on 1 and 78 DF, p-value: < 2.2e-16 R 2 = SSR/SSTO y i P i (x i,y i ) Y=β 0 +β 1 x SSTO = SSE + SSR Since the p-vlaue is less than 0.05, we conclude the the regression model account for a significant 11 amount of the variability in weight. x i y Things that affect the slope es4mate Ø Watch the regression podcast by Dr. Will posted on our course webpage. Three things that affect the slope estimate: 1. Sample size (n). 2. Variability of the error terms (σ ε2 ). 3. Spread of the independent variable. summary(result) lm(formula = Weight ~ Waist) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) -51.7279 11.1288-4.648 1.34e-05 Waist 2.3947 0.1249 19.180 < 2e-16 Testing H 0 : β 1 = 0 vs. H 1 : β 1 0. t.obs=(beta1.hat-0)/se.beta1 # 19.17971 p.value=2(1-pt(19.18,df=78)) # virtually 0 The smaller σ ε is, the smaller the standard error of the slope estimate. MSE=anova(result)$Mean[2] # 215.5277 SE.beta1=sqrt(MSE/SSxx) # 0.1248554 SS=function(x,y){sum((x-mean(x))(y-mean(y)))} SSxy=SS(Waist,Weight) # 33108.35 SSxx=SS(Waist,Waist) # 13825.73 SSyy=SS(Weight,Weight) # 96095.4 = SSTO Beta1.hat=SSxy/SSxx # 2.39469 As n increases, the standard error of the slope estimate decreases. 12 6
Effect of Outliers to the Slope Es4mate Three types of outliers: 1. Outlier in the x direction This type of an outlier is said to be a high leverage point. 2. Outlier in the y direction. 3. Outlier in both x and y directions This point is said to be a high influence point. The effect of a high influence point. The effect of a point with an outlying y value. 13 The (1-α)100% C.I. for β 1 : Confidence Intervals Hence, the 90% C.I. for β 1 for our example is Lower=Beta1.hat qt(0.95,df=78)se.beta1 # 2.186853 Upper=Beta1.hat + qt(0.95,df=78)se.beta1 # 2.602528 confint(result,level=.90) 5 % 95 % (Intercept) -70.253184-33.202619 Waist 2.186853 2.602528 Estimating the mean response (µ y ) at a specified value of x: predict(result,newdata=data.frame(waist=c(80,90))) 1 2 139.8473 163.7942 Confidence interval for the mean response (µ y ) at a specified value of x: predict(result,newdata=data.frame(waist=c(80,90)),interval= confidence ) fit lwr upr 1 139.8473 136.0014 143.6932 2 163.7942 160.4946 167.0938 7
Predic4on Intervals Predicting the value of the response variable at a specified value of x: predict(result,newdata=data.frame(waist=c(80,90))) 1 2 139.8473 163.7942 Prediction interval for the value of new response value (y n+1 ) at a specified value of x: predict(result,newdata=data.frame(waist=c(80,90)),interval= prediction ) fit lwr upr 1 139.8473 110.3680 169.3266 2 163.7942 134.3812 193.2072 predict(result,newdata=data.frame(waist=c(80,90)),interval= prediction,level=.99) fit lwr upr 1 139.8473 100.7507 178.9439 2 163.7942 124.7855 202.8029 Note that the only difference between the prediction interval and confidence interval for the mean response is the addition of 1 inside the square root. This makes the prediction intervals wider than the confidence intervals for the mean response. Confidence and Predic4on Bands Working-Hotelling (1-α)100% confidence band:, result=lm(weight~waist) CI=predict(result,se.fit=TRUE) # se.fit=se(mean) W=sqrt(2qf(0.95,2,78)) # 2.495513 band.lower=ci$fit - WCI$se.fit band.upper=ci$fit + WCI$se.fit plot(waist,weight,xlab="waist,ylab="weight,main="confidence Band") abline(result) points(sort(waist),sort(band.lower),type="l",lwd=2,lty=2,col= Blue") points(sort(waist),sort(band.upper),type="l",lwd=2,lty=2,col= Blue") The ((1-α)100% Prediction Band: mse=anova(result)$mean[2] se.pred=sqrt(ci$se.fit^2+mse) band.lower.pred=ci$fit - Wse.pred band.upper.pred=ci$fit + Wse.pred points(sort(waist),sort(band.lower.pred),type="l",lwd=2,lty=2,col="red") points(sort(waist),sort(band.upper.pred),type="l",lwd=2,lty=2,col="red ) 8
Tests for Correla4ons Testing H 0 : ρ = 0 vs. H 1 : ρ 0. cor(waist,weight) # Computes the Pearson correlation coefficient, r 0.9083268 cor.test(waist,weight, conf.level=.99) # Tests Ho:rho=0 and also constructs C.I. for rho Pearson's product-moment correlation data: Waist and Weight Note that the results are exactly t = 19.1797, df = 78, p-value < 2.2e-16 the same as what we got when alternative hypothesis: true correlation is not equal to 0 99 percent confidence interval: testing H 0 : β 1 = 0 vs. H 1 : β 1 0. 0.8409277 0.9479759 Testing H 0 : ρ = 0 vs. H 1 : ρ 0 using the (Nonparametric) Spearman s method. cor.test(waist,weight,method="spearman") # Test of independence using the Spearman's rank correlation rho # Spearman Rank correlation data: Waist and Weight S = 8532, p-value < 2.2e-16 alternative hypothesis: true rho is not equal to 0 sample estimates: rho 0.9 17 Model Diagnos4cs Model: Y i =(β 0 +β 1 x i ) + ε i where, ε i s are uncorrelated with a mean of 0 and constant variance σ 2 ε. ε i s are normally distributed. (This is needed in the test for the slope.) Assessing uncorrelatedness of the error terms plot(result$residuals,type='b') Assessing Normality qqnorm(result$residuals); qqline(result$residuals) shapiro.test(result$residuals) W = 0.9884, p-value = 0.6937 Assessing Constant Variance plot(result$fitted,result$residuals) levene.test(result$residuals,waist) Test Statistic = 2.1156, p-value = 0.06764 18 9