Unit 10: Simple Linear Regression and Correlation

Unit 10: Simple Linear Regression and Correlation Statistics 571: Statistical Methods Ramón V. León 6/28/2004 Unit 10 - Stat 571 - Ramón V. León 1

Introductory Remarks Regression analysis is a method for studying the relationship between two or more numerical variables In regression analysis one of the variables is regarded as a response, outcome or dependent variable The other variables are regarded as predictor, explanatory or independent variables An empirical model approximately captures the main features of the relationship between the response variable and the key predictor variables Sometimes it is not clear which of two variable should be the response. In this case correlation analysis is used. In this unit we consider only two variables 6/28/2004 Unit 10 - Stat 571 - Ramón V. León 2

A Probabilistic Model for Simple Linear Regression 2 Note that the Yi are independent N( µ i = β0 + β1xi, σ ) r.v.'s Constant variability about the regression line Notice that the predictor variable is regarded as nonrandom because it is assumed to be set by the investigator 6/28/2004 Unit 10 - Stat 571 - Ramón V. León 3

A Probabilistic Model for Simple Linear Regression Let x1, x2,..., xn be specific settings of the predictor variable. Let y1, y2,..., yn be the corresponding values of the response variable. Assume that yi is the observed value of a random variable (r.v.) Yi, which depends on xi according to the following model: Y = β + β x + ε ( i = 1, 2,..., n). i 0 1 i i 2 Here εi is the random error with E( εi) = 0 and Var( εi) = σ Thus EY ( i) = µ i = β0 + β1xi (true regression line). We assume that εi are independent, identically distributed (i.i.d.) r.v.'s We also usually assume that the are normally distributed. ε i 6/28/2004 Unit 10 - Stat 571 - Ramón V. León 4

Reflections on the Method of Tire Data Collection: Experimental Design Mileage Depth 0 394.33 4 329.5 8 291 12 255.17 16 229.33 20 204.83 24 179 28 163.83 32 150.33 Mileage in 1000 miles Depth in mils Was one tire or nine used in the experiment? (Read the book!) What effect would this answer have on the experimental error? on our ability to generalize to all tires of this brand? Was the data collected on one or several cars? Does it matter? Was the data collected in random order or in the order given? Does it matter? Is there a possible confounding problem? 6/28/2004 Unit 10 - Stat 571 - Ramón V. León 5

Example 10.1: Tread Wear vs. Mileage Using the Fit Y by X platform in the Analyze menu 6/28/2004 Unit 10 - Stat 571 - Ramón V. León 6

Scatter Plot: Tread Wear vs. Mileage 6/28/2004 Unit 10 - Stat 571 - Ramón V. León 7

Least Squares Line Fit y = ˆ β + ˆ 0 1 β x 6/28/2004 Unit 10 - Stat 571 - Ramón V. León 8

Illustration: Least Squares Line Fit The least squares line fit is the line that minimizes the sum of the squares of the lengths of the vertical segments 6/28/2004 Unit 10 - Stat 571 - Ramón V. León 9

i= 1 Mathematics of Least Squares Find the line, i.e., values of β and β that minimizes the sum of the squared deviations: How? [ ( β β )] 0 1 0 1 2 0 1 0 1 Solve for values of β and β for which Q β n Q= y + x i Q = 0 and = 0 β i 6/28/2004 Unit 10 - Stat 571 - Ramón V. León 10

Predict Mean Grove Depth for a Given Mileage For x = 25 (25,000 miles) the predicted y: yˆ = ˆ β + ˆ β (25) = 360.64 7.281 25 = 178.62 mils. 0 1 Add a row to the table and leave the y value blank. 6/28/2004 Unit 10 - Stat 571 - Ramón V. León 11

Goodness of Fit of the LS Line Fitted values of y : yˆ = ˆ β + ˆ β x, i = 1,2,..., n. i i 0 1 i ( ˆ β ˆ ) 0 β1 Residuals: e = y yˆ = y + x = segment length i i i i i 6/28/2004 Unit 10 - Stat 571 - Ramón V. León 12

Sums of Squares Error Sum of Squares (SSE) = Total Sum of Squares (SST) = n n 2 ei = yi yi i= 1 i= 1 i= 1 Regression Sum of Squares (SSR) = n ( y y) i n i= 1 ( ˆ ) 2 ( yˆ y) n 2 n 2 n 2 ( y y) = ( yˆ y) + ( y yˆ ) i i i i i= 1 i= 1 i= 1 SST = SSR + SSE i 2 2 These definitions apply even if the model is more complex than linear in x, for example, if it is quadratic in x SST is interpreted as the total variation in y JMP calls SSR the Model Sums of Square 6/28/2004 Unit 10 - Stat 571 - Ramón V. León 13

y i yi y y i yˆi y ˆ i y y Illustration of Sums of Squares 6/28/2004 Unit 10 - Stat 571 - Ramón V. León 14

Coefficient of Determination SST = SSR + SSE 2 SSR SSE r = = 1 = proportion of the variation in SST SST that is accounted for by regression on x y SSR SST = 50,887.20 53,418.73 6/28/2004 Unit 10 - Stat 571 - Ramón V. León 15

s Estimation of σ 2 n n n 2 e i yi y i yi 0 1xi 2 i= 1 i= 1 i= 1 ( ) 2 ˆ ( ˆ β + ˆ β ) 2 = = = n 2 n 2 n 2 s 2 s This estimate has n-2 degrees of freedom because two unknown parameters are estimated. 6/28/2004 Unit 10 - Stat 571 - Ramón V. León 16

Statistical Inference for β 0 and β 1 2. Select ˆ β0 β ˆ 0 β1 β1 N(0,1) and N(0,1) SD( ˆ β ) SD( ˆ β ) 0 1 ˆ β β ˆ β β t and SE( ˆ β ) SE( ˆ β ) 0 0 1 1 n 2 n 2 0 1 100(1- α)% CI : ˆ β ± t SE( ˆ β ) = 7.28 ± t 0.614 = 1 n 2, α 2 1 7,0.025 [ ] 7.28 ± 2.365 0.614 = 8.733, 5.829 ˆ β ˆ 0 ± tn 2, α 2SE( β0) = 360.63 ± t7, 0.025 11.69 t 1. Right click on the Parameter Estimate area 6/28/2004 Unit 10 - Stat 571 - Ramón V. León 17

Test of Hypothesis for β 0 and β 1 H : β = 0 vs. H : β 0 0 1 1 1 Reject H at α level if t = t ˆ β 0 ˆ β = > t SE( ˆ β ) SE( ˆ β ) 1 1 0 n 2, α 2 Tire example: ˆ β 7.28 SE( ˆ β ).614 1 1 1 = = = t7,.025 = 1 11.86, 2.365 Strong evidence that mileage affects tread depth. (Two-sided p-value) 6/28/2004 Unit 10 - Stat 571 - Ramón V. León 18

The Mean Square (MS) is the Sums of Square divided by its degrees of freedom, e.g. MSE = SSE/df =2531.529/7 = 361.6 H Analysis of Variance : β = 0 vs. H : β 0 0 1 0 1 MSR F = MSE 50887.2 140.71 361.6 = F = 140.70 = ( 11.8621) = t 2 2 F and t relationship holds only when the F numerator has one degree of freedom 6/28/2004 Unit 10 - Stat 571 - Ramón V. León 19

Example: Prediction of a Future y or Its Mean For a fixed value of x*, are we trying to predict -the average value of y? -the value of a future observation of y? Do I want to predict the average selling price of all 4,000 square feet houses in my neighborhood. Or do I want to predict the particular future selling price of my 4,000 square feet house? Which prediction is subject to the most error? 6/28/2004 Unit 10 - Stat 571 - Ramón V. León 20

Prediction of a Future y or Its Mean: Prediction Interval or Confidence Interval For a fixed value of x*, are we trying to predict -the average value of y? -the value of a future observation of y? 6/28/2004 Unit 10 - Stat 571 - Ramón V. León 21

JMP: Prediction of the Mean of y Grove Depth (in mils) 400 350 300 250 200 150 100-10 0 10 20 30 40 Mileage (in 1000 Miles 6/28/2004 Unit 10 - Stat 571 - Ramón V. León 22

JMP: Prediction of a Future Value of y Grove Depth (in mils) 400 350 300 250 200 150 100-10 0 10 20 30 40 Mileage (in 1000 Miles 6/28/2004 Unit 10 - Stat 571 - Ramón V. León 23

Formulas for Confidence and Prediction Intervals xx n i= 1 ( ) 2 S = x x i Prediction Interval of Chapter 7: 6/28/2004 Unit 10 - Stat 571 - Ramón V. León 24

Confidence and Prediction Intervals with JMP Using the Fit Model Platform not the Fit Y by X platform 6/28/2004 Unit 10 - Stat 571 - Ramón V. León 25

CI for Mean for µ* 6/28/2004 Unit 10 - Stat 571 - Ramón V. León 26

Prediction Interval for Y* 6/28/2004 Unit 10 - Stat 571 - Ramón V. León 27

Prediction for the Mean of Y or a Future Observation of Y Point estimate prediction is the same in both cases: 178.62 But the error bands are different Narrower for the mean of Y: [158.73, 198.51] Wider for a future value of Y: [129.44, 227.80] 6/28/2004 Unit 10 - Stat 571 - Ramón V. León 28

Calibration (Inverse Regression) 6/28/2004 Unit 10 - Stat 571 - Ramón V. León 29

Inverse Regression in JMP 5: Confidence Interval In Fit Model Platform unchecked 6/28/2004 Unit 10 - Stat 571 - Ramón V. León 30

Inverse Regression: Two Types of Confidence Intervals Calibration is usually used in measurement. Given that you got a certain measurement what is the true value of the measured quantity? Which inverse prediction interval makes more sense? 6/28/2004 Unit 10 - Stat 571 - Ramón V. León 31

Measurement Application Suppose you get a measurement of 6.50 on an object. What is a calibration interval for its true value? Data was supposedly obtained by measuring standards of known true measurement with a calibrated instrument How the data was really generated 6/28/2004 Unit 10 - Stat 571 - Ramón V. León 32

Residuals The residuals are the differences between the observed value of y and the predicted value of y yˆ = ˆ β + ˆ β (32) = 360.64 7.281 32 = 127.65 mils 0 1 6/28/2004 Unit 10 - Stat 571 - Ramón V. León 33

Regression Diagnostics The residuals e = y yˆ can be viewed as the "left over" i i i after the model is fitted. If the model is correct, then the e i can be viewed as the "estimates" of the random errors ε ' s. As such, the residuals are vital for checking the model assumptions. i Residual plots If the model is correct the residuals should have no structure, that is, they should look random on their plots. 6/28/2004 Unit 10 - Stat 571 - Ramón V. León 34

Mathematics of Residuals If the assumed regression model is correct, then the normally distributed with Ee ( ) = 0 and 2 2 1 ( xi x) 2 Var( ei ) = σ 1 σ if n is large. n Sxx The e ' s are not independent (even though the ε 's are i n i i i i=1 i=1 i n i e ' s are independent) because they satisfy the following contraints: e = 0, xe = 0. However, the dependence introduced by these constraints is neglible in n is modestly large, say greater than 20 or so. i xx n i= 1 ( ) 2 S = x x i 6/28/2004 Unit 10 - Stat 571 - Ramón V. León 35

Basic Principle Underlying Residual Plots If the assumed model is correct then the residuals should be randomly scattered around 0 and should show no obvious systematic pattern. 6/28/2004 Unit 10 - Stat 571 - Ramón V. León 36

In Fit Y by X platform: In Fit Model platform: Red triangle menu next to Linear Fit. Residuals in JMP 6/28/2004 Unit 10 - Stat 571 - Ramón V. León 37

Checking for Linearity Should EY ( ) = β + β x+ β x 0 1 2 be fitted rather than EY ( ) = β + β x? 0 1 2 400 40 Grove Depth (in mils) 350 300 250 200 150 Residuals Grove Depth (in mils) 30 20 10 0-10 -20 100-10 0 10 20 30 40-10 0 10 20 30 40 Mileage (in 1000 Miles Mileage (in 1000 Miles 6/28/2004 Unit 10 - Stat 571 - Ramón V. León 38

Predicted by Residual Plot 40 40 Residuals Grove Depth (in mils) 30 20 10 0-10 -20 100 150 200 250 300 350 40 Predicted Grove Depth (in mils) 30 20 10 0-10 -20-10 0 10 20 30 40 Mileage (in 1000 Miles 6/28/2004 Unit 10 - Stat 571 - Ramón V. León 39 Residuals Grove Depth (in mils) What are the advantages of this plot?

Tread Wear Quadratic Model 6/28/2004 Unit 10 - Stat 571 - Ramón V. León 40

Residuals for Quadratic Model Residuals have a random pattern The Fit Y by X platform was used to get these plots Left plot can also be obtained using the Plot Residuals option 6/28/2004 Unit 10 - Stat 571 - Ramón V. León 41

Tread Wear Quadratic Model R 2 was.95261 for model linear in x 6/28/2004 Unit 10 - Stat 571 - Ramón V. León 42

Checking for Constant Variance: Plot of Predicted vs. Residual If assumption is incorrect, often Var( Y ) is some function of EY ( ) = µ 6/28/2004 Unit 10 - Stat 571 - Ramón V. León 43

Other Model Diagnostics Based on Residuals Checking for normality of errors: Do a normal plot of the residuals Warning: Don t plot the response y on a normal plot This plot has no meaning when one has regressors Don t transform the data on the basis of this plot Many students make this mistake in their project. Don t be one of them. 6/28/2004 Unit 10 - Stat 571 - Ramón V. León 44

Other Model Diagnostics Based on Residuals Fit Model Platform: Checking for independence of errors: Plot the residual in time order. The order of data collection should be always recorded. Use Durbin-Watson statistic to test for autocorrelation. 6/28/2004 Unit 10 - Stat 571 - Ramón V. León 45

Other Model Diagnostics Based on Residuals Checking for outliers: See if any standardized (Studentized) residual exceeds 2 (standard deviations) in absolute value. * ei ei ei ei = =, i= 1, 2,..., n. SE( e ) 2 i 1 ( xi x) s s 1 n S xx Also do a box plot of residuals to check for outliers 6/28/2004 Unit 10 - Stat 571 - Ramón V. León 46

Standardized (Studentized) Residuals in JMP 4 Using the Fit Model JMP platform with linear model Should we omit the first observation and refit the model? Possible outlier The model fitted here is the one linear in x, not the quadratic model 6/28/2004 Unit 10 - Stat 571 - Ramón V. León 47

Checking for Influential Observations An observation can be influential because it has an extreme x-value, an extreme y-value, or both Note that the influential observation above are not outliers since their residuals are quite small (even zero) 6/28/2004 Unit 10 - Stat 571 - Ramón V. León 48

Checking for Influential Observations n yˆ = h y where the h are some functions of the x's i ij j ij j= 1 H = h ij is called the hat matrix We can think of the h as the leverage exerted by y on the fitted value yˆ. i ij If yˆ i is largely determined by yi with very small contribution from the other y 's, then we say that the ith observations is influential (also called high leverage). j j 6/28/2004 Unit 10 - Stat 571 - Ramón V. León 49

e * i Checking for Influential Observations Since h detemines how much ˆ ii yi depends on yi it is used as a measure of the leverage of the i-thobservation. n Now h = k+ 1 or the average of h is ( k+ 1) / n, i= 1 ii where k is the number of predictor variables variables. Rule of thumb: Regard any hii > 2( k+ 1) / n as high leverage. In simple regression k = 1 and so h > 4 / n is regarded as high leverage. ei ei = = SE( e ) s 1 h i ii ii 6/28/2004 Unit 10 - Stat 571 - Ramón V. León 50 ii Standardized (Studentized) residuals will be large relative to the residual for high leverage observations.

Graph of Anscombe Data Observation in row 8 6/28/2004 Unit 10 - Stat 571 - Ramón V. León 51

Checking for Influential Observations Fit Model Platform: Anscombe Data Illustration: > 4 / n= 4 /11 = 0.36 Notice: Observation No. 8 is not an outlier since e8 = y ˆ 8 y8 = 0 6/28/2004 Unit 10 - Stat 571 - Ramón V. León 52 h ii Thus observation 8 is influential

How to Deal with Outliers and Influential Observations They can give a misleading picture of the relationship between y and x Are they erroneous observations? If yes, they should be discarded. If the observation are valid then they should be included in the analysis Least Squares is very sensitive to these observations. Do analysis with and without outlier or influential observation. Or use robust method: Minimize the sum of the absolute deviations n i= 1 y i ( β + β x ) 0 1 These estimates are analogous to the sample median. LS estimates are analogous to the sample mean. 6/28/2004 Unit 10 - Stat 571 - Ramón V. León 53 i

Data Transformations If the functional relationship between x and y is known it may be possible to find a linearizing transformation analytically: β For y = αx take log on both sides: log y = logα + βlog x β x For y = αe take log on both sides: log y = logα + βx If we are fitting an empirical model, then a linearizing transformation can be found by trial and error using the scatter plot as a guide. 6/28/2004 Unit 10 - Stat 571 - Ramón V. León 54

Scatter Plots and Linearizing Transformations 6/28/2004 Unit 10 - Stat 571 - Ramón V. León 55

Tire Tread Wear vs. Mileage: Exponential Model y = αe β x log y = logα β x 6/28/2004 Unit 10 - Stat 571 - Ramón V. León 56

Tire Tread Wear vs. Mileage: Exponential Model WEAR = e e = 376.15e 5.93.0298 MILES.0298MILES 6/28/2004 Unit 10 - Stat 571 - Ramón V. León 57

Tire Tread Wear vs. Mileage: Exponential Model For linear model: R 2 =.953 Residual plot is still curved mainly because observation 1 is an outlier 6/28/2004 Unit 10 - Stat 571 - Ramón V. León 58

Weights and Gas Mileages of 38 1978-79 Model Year Cars 6/28/2004 Unit 10 - Stat 571 - Ramón V. León 59

Weights and Gas Mileages of 38 1978-79 Model Year Cars Using Fit Y by X platform 100 Use transformation y. The new variable has units y of gallons per 100 miles. This transformation has the advantage of being physically interpretable. A spline fits the data locally. Our prior curve fits have been global. 6/28/2004 Unit 10 - Stat 571 - Ramón V. León 60

Weights and Gas Mileages of 38 1978-79 Model Year Cars 6/28/2004 Unit 10 - Stat 571 - Ramón V. León 61

Variance Stabilizing Transformation α Suppose σ µ We want to find a transformation of y that yields a constant variance. Suppose the transformation is a power of the original data, say y* = Y y λ For example: Take the square root of Poisson count data since for data so distributed the mean and the variance are the same 6/28/2004 Unit 10 - Stat 571 - Ramón V. León 62

Box-Cox Transformation: Empirical Determination of Best Transformation Fit Model Platform: y λ λ 1 ( y 1 ) y 0 λ λ λ = yln y λ = 0 ( ) where y = n n yi = i= 1 geometric mean 6/28/2004 Unit 10 - Stat 571 - Ramón V. León 63

Transformation in Tire Data A better fit than than the polynomial and with one less parameter (3 versus 4) 0.09 0.08 Transform 0.07 0.06 0.05-10 0 10 20 30 40 Mileage (in 1000 Miles 6/28/2004 Unit 10 - Stat 571 - Ramón V. León 64

Weights and Gas Mileages of 38 1978-79 Model Year Cars: Box-Cox Transformation Reciprocal transformation among suggested transformations 6/28/2004 Unit 10 - Stat 571 - Ramón V. León 65

Correlation Analysis When it is not clear which is the predictor variable and which is the response variable When both variables are random Try the bivariate normal distribution as a probability model for the joint distribution of two r.v. s Parameters: µ X, µ Y, σ X, σy and Cov (X, Y) ρ = Corr(X, Y) = Var(X)Var(Y) The correlation ρ is a measure of association between the two random variables X and Y. Zero correlation corresponds to no association. Correlation of 1 or -1 represents perfect association. 6/28/2004 Unit 10 - Stat 571 - Ramón V. León 66

Bivariate Normal Density Function 6/28/2004 Unit 10 - Stat 571 - Ramón V. León 67

Statistical Inference on the Correlation Coefficient R H = n ( X X)( Y Y) i i i= 1 n n 2 2 ( X i X) ( Yi Y) i= 1 i= 1 : ρ = 0 vs H : ρ 0 0 1 Test statistic Approximate test statistic available to Test nonzero null hypotheses Obtain confidence intervals for ρ T = R n-2 1 R 6/28/2004 Unit 10 - Stat 571 - Ramón V. León 68 2 t n 2 Equivalent to testing H : β = 0 vs. H : β 0 0 1 1 1 See textbook

Exercise 10.28 6/28/2004 Unit 10 - Stat 571 - Ramón V. León 69

Exercise 10.28 6/28/2004 Unit 10 - Stat 571 - Ramón V. León 70

Exercise 10.28 By default, a 95% bivariate normal density ellipse is imposed on each scatterplot. If the variables are bivariate normally distributed, this ellipse encloses approximately 95% of the points. The correlation of the variables is seen by the collapsing of the ellipse along the diagonal axis. If the ellipse is fairly round and is not diagonally oriented, the variables are uncorrelated 6/28/2004 Unit 10 - Stat 571 - Ramón V. León 71