Chapter 11: Linear Regression and Correla4on. Correla4on

Similar documents
Inference for Regression

Biostatistics for physicists fall Correlation Linear regression Analysis of variance

Figure 1: The fitted line using the shipment route-number of ampules data. STAT5044: Regression and ANOVA The Solution of Homework #2 Inyoung Kim

Chapter 8: Correlation & Regression

1 Multiple Regression

Example: Data from the Child Health and Development Study

Variance Decomposition and Goodness of Fit

Garvan Ins)tute Biosta)s)cal Workshop 16/6/2015. Tuan V. Nguyen. Garvan Ins)tute of Medical Research Sydney, Australia

Regression. Marc H. Mehlman University of New Haven

Lecture 18: Simple Linear Regression

Unit 6 - Simple linear regression

Variance Decomposition in Regression James M. Murray, Ph.D. University of Wisconsin - La Crosse Updated: October 04, 2017

Ch 2: Simple Linear Regression

Inferences for Regression

Unit 6 - Introduction to linear regression

Handout 4: Simple Linear Regression

ST430 Exam 2 Solutions

Homework 9 Sample Solution

Lecture 11: Simple Linear Regression

14 Multiple Linear Regression

13 Simple Linear Regression

L21: Chapter 12: Linear regression

Biostatistics 380 Multiple Regression 1. Multiple Regression

Can you tell the relationship between students SAT scores and their college grades?

df=degrees of freedom = n - 1

STAT 215 Confidence and Prediction Intervals in Regression

7.2 One-Sample Correlation ( = a) Introduction. Correlation analysis measures the strength and direction of association between

Binomial Logis5c Regression with glm()

Correlation and the Analysis of Variance Approach to Simple Linear Regression

ST430 Exam 1 with Answers

Tests of Linear Restrictions

STA 108 Applied Linear Models: Regression Analysis Spring Solution for Homework #6

MS&E 226: Small Data

Chapter 12: Linear regression II

Chapter 8: Correlation & Regression

Estimating σ 2. We can do simple prediction of Y and estimation of the mean of Y at any value of X.

Correlation. A statistics method to measure the relationship between two variables. Three characteristics

Inference for Regression Inference about the Regression Model and Using the Regression Line, with Details. Section 10.1, 2, 3

Example: 1982 State SAT Scores (First year state by state data available)

Data files for today. CourseEvalua2on2.sav pontokprediktorok.sav Happiness.sav Ca;erplot.sav

Nature vs. nurture? Lecture 18 - Regression: Inference, Outliers, and Intervals. Regression Output. Conditions for inference.

Linear Regression and Correla/on. Correla/on and Regression Analysis. Three Ques/ons 9/14/14. Chapter 13. Dr. Richard Jerz

Linear Regression and Correla/on

Correlation and Simple Linear Regression

UNIVERSITY OF TORONTO SCARBOROUGH Department of Computer and Mathematical Sciences Midterm Test, October 2013

REVIEW 8/2/2017 陈芳华东师大英语系

Multiple Linear Regression

BIOL 458 BIOMETRY Lab 9 - Correlation and Bivariate Regression

Linear Modelling: Simple Regression

LECTURE 6. Introduction to Econometrics. Hypothesis testing & Goodness of fit

Correlation. We don't consider one variable independent and the other dependent. Does x go up as y goes up? Does x go down as y goes up?

Chapter 8: Correlation & Regression

Chapter 1: Linear Regression with One Predictor Variable also known as: Simple Linear Regression Bivariate Linear Regression

Regression Analysis: Exploring relationships between variables. Stat 251

Correlation Analysis

Stat 411/511 ESTIMATING THE SLOPE AND INTERCEPT. Charlotte Wickham. stat511.cwick.co.nz. Nov

STA 302 H1F / 1001 HF Fall 2007 Test 1 October 24, 2007

SCHOOL OF MATHEMATICS AND STATISTICS

Ch 3: Multiple Linear Regression

AMS 7 Correlation and Regression Lecture 8

Multiple Regression and Regression Model Adequacy

Chapter 16: Understanding Relationships Numerical Data

Lecture 6 Multiple Linear Regression, cont.

Business Statistics. Lecture 10: Course Review

1. Introduc9on 2. Bivariate Data 3. Linear Analysis of Data

No other aids are allowed. For example you are not allowed to have any other textbook or past exams.

9 Correlation and Regression

2. Outliers and inference for regression

STAT 3022 Spring 2007

Chapter 16. Simple Linear Regression and Correlation

SSR = The sum of squared errors measures how much Y varies around the regression line n. It happily turns out that SSR + SSE = SSTO.

Section 3: Simple Linear Regression

Inference for Regression Simple Linear Regression

Applied Regression Analysis

Keller: Stats for Mgmt & Econ, 7th Ed July 17, 2006

Lecture 1 Linear Regression with One Predictor Variable.p2

Some Review and Hypothesis Tes4ng. Friday, March 15, 13

Mathematics for Economics MA course

Linear Regression Model. Badr Missaoui

2.1: Inferences about β 1

ST Correlation and Regression

Review of Statistics

Quantitative Understanding in Biology Module II: Model Parameter Estimation Lecture I: Linear Correlation and Regression

Regression. Bret Hanlon and Bret Larget. December 8 15, Department of Statistics University of Wisconsin Madison.

Lectures on Simple Linear Regression Stat 431, Summer 2012

Introduction and Single Predictor Regression. Correlation

Regression Models - Introduction

Multiple Regression Introduction to Statistics Using R (Psychology 9041B)

ANOVA: Analysis of Variation

Multiple linear regression

Lecture 8: Fitting Data Statistical Computing, Wednesday October 7, 2015

Simple and Multiple Linear Regression

Lectures 5 & 6: Hypothesis Testing

Swarthmore Honors Exam 2012: Statistics

Chapter 14 Simple Linear Regression (A)

9. Linear Regression and Correlation

UNIVERSITY OF MASSACHUSETTS. Department of Mathematics and Statistics. Basic Exam - Applied Statistics. Tuesday, January 17, 2017

Statistiek II. John Nerbonne. March 17, Dept of Information Science incl. important reworkings by Harmut Fitz

STAT 4385 Topic 03: Simple Linear Regression

Simple Linear Regression

Transcription:

Chapter 11: Linear Regression and Correla4on Regression analysis is a sta3s3cal tool that u3lizes the rela3on between two or more quan3ta3ve variables so that one variable can be predicted from the other, or others. Some Examples: Height and weight of people Income and expenses of people Produc3on size and produc3on 3me Soil ph and the rate of growth of plants 1 Correla4on An easy way to determine if two quan3ta3ve variables are linearly related is by looking at their scakerplot. Another way is to calculate the correla3on coefficient, denoted usually by r. The Linear Correla+on measures the strength of the linear rela3onship between explanatory variable (x) and the response variable (y). An es3mate of this correla3on parameter is provided by the Pearson sample correla3on coefficient, r. Note: - 1 r 1. 2 1

Example Sca;erplots with Correla4ons If X and Y are independent, then their correla3on is 0. 3 Correla4on If the correla3on between X and Y is 0, it doesn t mean they are independent. It only means that they are not linearly related. One complain about the correla3on is that it can be subjec3ve when interpre3ng its value. Some people are very happy with r 0.6, while others are not. Note: Correla3on does not necessarily imply Causa3on Some Guidelines in Interpre3ng r. Value of r Strength of linear rela1onship If r 0.95 Very Strong If 0.85 r < 0.95 Strong If 0.65 r < 0.85 Moderate to Strong If 0.45 r < 0.65 Moderate If 0.25 r < 0.45 Weak If r < 0.25 Very weak/close to none 4 2

Compu4ng Correla4on in R data.health=read.csv("healthexam.csv",header=t) head(data.health) Gender Age Height Weight Waist Pulse SysBP DiasBP Cholesterol BodyMass Leg Elbow Wrist Arm 1 F 12 63.3 156.3 81.4 64 104 41 89 27.5 41.0 6.8 5.5 33.0 2 F 16 57.0 100.7 68.7 64 106 64 2 21.9 33.8 5.6 4.6 26.4 3 M 17 63.0 156.3 86.7 96 109 65 78 27.8 44.2 7.1 5.3 31.7 attach(data.health) plot(height,weight,pch=19,main="scatterplot") cor(height,weight) # 0.544563 cor(waist,weight) # 0.9083268 plot(waist,weight,pch=19,main="scatterplot") 5 Simple Linear Regression Model: Y i =(β 0 +β 1 x i ) + ε i Random Error where, Y i is the i th value of the response variable. x i is the i th value of the explanatory variable. ε i s are uncorrelated with a mean of 0 and constant variance σ 2. How do we determine the underlying linear rela3onship? Well, since the points are following this linear trend, why don t we look for a line that best fit the points. But what do we mean by best fit? We need a criterion to help us determine which between 2 compe3ng candidate lines is beker. y L 4 L 3 Observed point ε 1 L 2 x 1 L 1 Expected point x Y=β 0 +β 1 x 6 3

Method of Least Squares Model: Y i =(β 0 +β 1 x i ) + ε i Y=β 0 +β 1 x where, Y i is the i th value of the response variable. x i is the i th value of the explanatory variable. Observed ε i s are uncorrelated with a mean of 0 and Predicted constant variance σ 2. Residual = (Observed y- value) (Predicted y- value) e 1 = y 1 y y 1 P 1 (x 1,y 1 ) e 1 e 2 Example: 2+.8x Method of Least Squares: Choose the line that minimizes the SSE as the best line. This line is unknown as the Least- Squares Regression Line. x 1 Ques1on: But there are infinite possible candidate lines, how can we find the one that minimizes the SSE? x Answer: Since SSE is a con+nuous func+on of 2 variables, we can use methods from calculus to minimize the SSE. 7 Obtaining the Regression Line in R data.health=read.csv("healthexam.csv",header=t) head(data.health) Gender Age Height Weight Waist Pulse SysBP DiasBP Cholesterol BodyMass Leg Elbow Wrist Arm 1 F 12 63.3 156.3 81.4 64 104 41 89 27.5 41.0 6.8 5.5 33.0 2 F 16 57.0 100.7 68.7 64 106 64 2 21.9 33.8 5.6 4.6 26.4 3 M 17 63.0 156.3 86.7 96 109 65 78 27.8 44.2 7.1 5.3 31.7 attach(health.exam) plot(waist,weight,pch=19,main="scatterplot") result=lm(weight~waist) coef(result) (Intercept) Waist As waist increases by 1 cm, weight goes up by about 2.4 pounds. -51.72790 2.39469 abline(a=-51.7279,b=2.39469,lwd=2,col="blue") So, for the first person, her predicted weight is 143.2 pounds. Predicted.1=-51.728+2.39581.4 # 143.225 pounds Since her actual weight is 156.3 pounds. Residual.1=156.3-143.2 # 13.1 pounds 8 4

What else do we get from the lm func4on? data.health=read.csv("healthexam.csv",header=t) attach(health.exam) result=lm(weight~waist) attributes(result) $names "coefficients" "residuals" "effects" "rank" "fitted.values" "assign" "qr" "df.residual" "xlevels" "call" "terms" "model" result$fit[1] # 143.1999 result$res[1] # 13.10011 summary(result) lm(formula = Weight ~ Waist) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) -51.7279 11.1288-4.648 1.34e-05 Waist 2.3947 0.1249 19.180 < 2e-16 --- Signif. codes: 0 0.001 0.01 0.05. 0.1 1 Residual standard error: 14.68 on 78 degrees of freedom Multiple R-squared: 0.8251, Adjusted R-squared: 0.8228 F-statistic: 367.9 on 1 and 78 DF, p-value: < 2.2e-16 Coefficient of Determination (R 2 ) : This index measures the amount of variability in the dependent variable (y) that can be explained by the regression line. Hence, about 82.51% of the variability of weight can be explained by the regression line involving the waist size. Testing H 0 : β 1 = 0 vs. H 1 : β 1 0. Since the p-vlaue is extremely small (<0.05), we can reject the null hypothesis and conclude that waist has a significant effect on weight. Model Assump4ons Model: Y i =(β 0 +β 1 x i ) + ε i where, ε i s are uncorrelated with a mean of 0 and constant variance σ 2 ε. ε i s are normally distributed. (This is needed in the test for the slope.) Y=β 0 +β 1 x y Observed point ε 1 e 1 Expected point Predicted point Since the underlying (green) line is unknown to us, we can t calculate the values of the error terms (ε i ). The best that we can do is study the residuals (e i ). x 1 x 10 5

Es4ma4ng the Variance of the Error Terms The unbiased estimator for σ 2 ε is sse=sum(result$residuals^2) # 16811.16 mse=sse/(80-2) # 215.5277 sigma.hat=sqrt(mse) # 14.68086 anova(result) Response: Weight Df Sum Sq Mean Sq F value Pr(>F) Waist 1 79284 79284 367.86 < 2.2e-16 Residuals 78 16811 216 Total 79 96095 summary(result) lm(formula = Weight ~ Waist) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) -51.7279 11.1288-4.648 1.34e-05 Waist 2.3947 0.1249 19.180 < 2e-16 Residual standard error: 14.68 on 78 degrees of freedom Multiple R-squared: 0.8251, Adjusted R-squared: 0.8228 F-statistic: 367.9 on 1 and 78 DF, p-value: < 2.2e-16 R 2 = SSR/SSTO y i P i (x i,y i ) Y=β 0 +β 1 x SSTO = SSE + SSR Since the p-vlaue is less than 0.05, we conclude the the regression model account for a significant 11 amount of the variability in weight. x i y Things that affect the slope es4mate Ø Watch the regression podcast by Dr. Will posted on our course webpage. Three things that affect the slope estimate: 1. Sample size (n). 2. Variability of the error terms (σ ε2 ). 3. Spread of the independent variable. summary(result) lm(formula = Weight ~ Waist) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) -51.7279 11.1288-4.648 1.34e-05 Waist 2.3947 0.1249 19.180 < 2e-16 Testing H 0 : β 1 = 0 vs. H 1 : β 1 0. t.obs=(beta1.hat-0)/se.beta1 # 19.17971 p.value=2(1-pt(19.18,df=78)) # virtually 0 The smaller σ ε is, the smaller the standard error of the slope estimate. MSE=anova(result)$Mean[2] # 215.5277 SE.beta1=sqrt(MSE/SSxx) # 0.1248554 SS=function(x,y){sum((x-mean(x))(y-mean(y)))} SSxy=SS(Waist,Weight) # 33108.35 SSxx=SS(Waist,Waist) # 13825.73 SSyy=SS(Weight,Weight) # 96095.4 = SSTO Beta1.hat=SSxy/SSxx # 2.39469 As n increases, the standard error of the slope estimate decreases. 12 6

Effect of Outliers to the Slope Es4mate Three types of outliers: 1. Outlier in the x direction This type of an outlier is said to be a high leverage point. 2. Outlier in the y direction. 3. Outlier in both x and y directions This point is said to be a high influence point. The effect of a high influence point. The effect of a point with an outlying y value. 13 The (1-α)100% C.I. for β 1 : Confidence Intervals Hence, the 90% C.I. for β 1 for our example is Lower=Beta1.hat qt(0.95,df=78)se.beta1 # 2.186853 Upper=Beta1.hat + qt(0.95,df=78)se.beta1 # 2.602528 confint(result,level=.90) 5 % 95 % (Intercept) -70.253184-33.202619 Waist 2.186853 2.602528 Estimating the mean response (µ y ) at a specified value of x: predict(result,newdata=data.frame(waist=c(80,90))) 1 2 139.8473 163.7942 Confidence interval for the mean response (µ y ) at a specified value of x: predict(result,newdata=data.frame(waist=c(80,90)),interval= confidence ) fit lwr upr 1 139.8473 136.0014 143.6932 2 163.7942 160.4946 167.0938 7

Predic4on Intervals Predicting the value of the response variable at a specified value of x: predict(result,newdata=data.frame(waist=c(80,90))) 1 2 139.8473 163.7942 Prediction interval for the value of new response value (y n+1 ) at a specified value of x: predict(result,newdata=data.frame(waist=c(80,90)),interval= prediction ) fit lwr upr 1 139.8473 110.3680 169.3266 2 163.7942 134.3812 193.2072 predict(result,newdata=data.frame(waist=c(80,90)),interval= prediction,level=.99) fit lwr upr 1 139.8473 100.7507 178.9439 2 163.7942 124.7855 202.8029 Note that the only difference between the prediction interval and confidence interval for the mean response is the addition of 1 inside the square root. This makes the prediction intervals wider than the confidence intervals for the mean response. Confidence and Predic4on Bands Working-Hotelling (1-α)100% confidence band:, result=lm(weight~waist) CI=predict(result,se.fit=TRUE) # se.fit=se(mean) W=sqrt(2qf(0.95,2,78)) # 2.495513 band.lower=ci$fit - WCI$se.fit band.upper=ci$fit + WCI$se.fit plot(waist,weight,xlab="waist,ylab="weight,main="confidence Band") abline(result) points(sort(waist),sort(band.lower),type="l",lwd=2,lty=2,col= Blue") points(sort(waist),sort(band.upper),type="l",lwd=2,lty=2,col= Blue") The ((1-α)100% Prediction Band: mse=anova(result)$mean[2] se.pred=sqrt(ci$se.fit^2+mse) band.lower.pred=ci$fit - Wse.pred band.upper.pred=ci$fit + Wse.pred points(sort(waist),sort(band.lower.pred),type="l",lwd=2,lty=2,col="red") points(sort(waist),sort(band.upper.pred),type="l",lwd=2,lty=2,col="red ) 8

Tests for Correla4ons Testing H 0 : ρ = 0 vs. H 1 : ρ 0. cor(waist,weight) # Computes the Pearson correlation coefficient, r 0.9083268 cor.test(waist,weight, conf.level=.99) # Tests Ho:rho=0 and also constructs C.I. for rho Pearson's product-moment correlation data: Waist and Weight Note that the results are exactly t = 19.1797, df = 78, p-value < 2.2e-16 the same as what we got when alternative hypothesis: true correlation is not equal to 0 99 percent confidence interval: testing H 0 : β 1 = 0 vs. H 1 : β 1 0. 0.8409277 0.9479759 Testing H 0 : ρ = 0 vs. H 1 : ρ 0 using the (Nonparametric) Spearman s method. cor.test(waist,weight,method="spearman") # Test of independence using the Spearman's rank correlation rho # Spearman Rank correlation data: Waist and Weight S = 8532, p-value < 2.2e-16 alternative hypothesis: true rho is not equal to 0 sample estimates: rho 0.9 17 Model Diagnos4cs Model: Y i =(β 0 +β 1 x i ) + ε i where, ε i s are uncorrelated with a mean of 0 and constant variance σ 2 ε. ε i s are normally distributed. (This is needed in the test for the slope.) Assessing uncorrelatedness of the error terms plot(result$residuals,type='b') Assessing Normality qqnorm(result$residuals); qqline(result$residuals) shapiro.test(result$residuals) W = 0.9884, p-value = 0.6937 Assessing Constant Variance plot(result$fitted,result$residuals) levene.test(result$residuals,waist) Test Statistic = 2.1156, p-value = 0.06764 18 9