Lecture 11: Simple Linear Regression

Lecture 11: Simple Linear Regression Readings: Sections 3.1-3.3, 11.1-11.3 Apr 17, 2009

In linear regression, we examine the association between two quantitative variables. Number of beers that you drink and your blood alcohol level. Homework score and test score. Response variable Y: Dependent variable Measures an outcome of a study Explanatory variable X: Independent/predictor variable explains or is related to changes in the response variable We will have pairs of observations: (x 1, y 1 ), (x 2, y 2 ),..., (x n, y n ).

General Procedure for Analyzing Two Quantitative Variables 1. Make a scatter plot of the data. Describe the form, direction, and strength. Look for outliers. 2. Look at the correlation to get a numerical value for the direction and strength. 3. If the data is reasonably linear, get an equation of the line using least squares technique. 4. Look at the residual plot to see whether the assumptions of the linear regression hold. 5. Perform formal inference procedures for the correlation, intercept, and slope.

Example 1: We want to examine whether the amount of rainfall per year increases or decreases corn bushel output. A sample of 10 observations was taken, and the amount of rainfall (in inches), was measured, as was the subsequent growth of corn. Obs. x (Rainfall) y (Corn Yield) 1 3.03 80 2 3.47 84 3 4.21 90 4 4.44 95 5 4.95 97 6 5.11 102 7 5.63 105 8 6.34 112 9 6.56 115 10 6.82 115

What can we see from the scatter plot? Form: Linear? Non-linear? No obvious pattern? Direction: Positive or negative association? No association? Positive association Negative association No association Strength: how closely do the points follow a clear form? Strong or weak or moderate? Look for OUTLIERS!

Form and direction of an association Linear Non Linear No Relationship

Strength of an association 2 4 6 8 10 0 5 10 Strong Positive Linear Association 2 4 6 8 10 0 5 10 Weak Positive Linear Association

Note: Association or correlation is NOT the same thing as causation. Just because two variables are associated doesn t mean that a change in one variable causes a change in the other. The relationship between two variables might not tell the whole story. Other variables may affect the relationship. These other variables are called lurking variables.

Correlation Pearson s Sample Correlation r: a numerical quantity that measures the direction and strength of the linear relationship between two quantitative variables. r = n (x i x)(y i ȳ) i=1 = n n (x i x) 2 (y i ȳ) 2 i=1 i=1 SS xy SSxx SSyy where SS xy = n (x i x)(y i ȳ) = n x i y i n xȳ i=1 i=1 SS xx = n (x i x) 2 = n x 2 i n x2 = (n 1)s 2 x i=1 i=1 SS yy = n (y i ȳ) 2 = n yi 2 nȳ2 = (n 1)s 2 y i=1 i=1

Example 1 (cont d): a. What is the correlation between amount of rainfall and corn yield? Obs. x (Rainfall) y (Corn Yield) x 2 y 2 xy 1 3.03 80 9.1809 6400 242.4 2 3.47 84 12.0409 7056 291.48 3 4.21 90 17.7241 8100 378.9 4 4.44 95 19.7136 9025 421.8 5 4.95 97 24.5025 9409 480.15 6 5.11 102 26.1121 10404 521.22 7 5.63 105 31.6969 11025 591.15 8 6.34 112 40.1956 12544 710.08 9 6.56 115 43.0336 13225 754.4 10 6.82 115 46.5124 13225 784.3 Sum 50.56 995 270.7126 100413 5175.88

SS xy = n x i y i n xȳ = i=1 SS xx = n x 2 i n x2 = i=1 SS yy = n yi 2 nȳ2 = i=1 r = SS xy SSxx SSyy =

Correlation in SAS data yield; input rainfall yield @@; datalines; 3.03 80 3.47 84 4.21 90 4.44 95 4.95 97 5.11 102 5.63 105 6.34 112 6.56 115 6.82 115 ; run; proc corr data=yield; var rainfall yield; run;

The CORR Procedure 2 Variables: rainfall yield Simple Statistics Variable N Mean Std Dev Sum Minimum Maximu rainfall 10 5.05600 1.29449 50.56000 3.03000 6.8200 yield 10 99.50000 12.51887 995.00000 80.00000 115.0000 Pearson Correlation Coefficients, N = 10 Prob > r under H0: Rho=0 rainfall yield rainfall 1.00000 0.99527 <.0001 yield 0.99527 1.00000 <.0001

Properties of Correlation Correlation measures the strength of only a linear relationship. (i.e. correlation is meaningless if the scatter plot shows a curved relationship). The correlation r does not change if we change the units of measurements of X or Y.

The correlation r is always between -1 and 1, i.e., 1 r 1. A positive r corresponds to a positive association between the variables. As X increases, Y increases. A negative r corresponds to a negative association between the variables. As X increases, Y decreases. Values near 0 indicate a weak linear relationship. Values close to 1 or -1 indicate a strong linear relationship. r = 1 only when all points lie exactly on a line with positive slope; r = 1 only when all points lie exactly on a line with negative slope.

r = 0 r = 0.5 r = 0.9 r = 0.3 r = 0.7 r = 0.99

If a scatter plot shows that a relationship is linear and we want to use one variable to help explain or predict the other, we can summarize the relationship between the two variables by using a regression line. In linear regression, the regression line is a straight line that describes how a response variable y changes as an explanatory variable x changes.

Example 1 (cont d):

Least Squares Regression Least Squares Regression fits a straight line through the data points that will minimize the sum of the vertical distances of the data points from the line. Least Squares Regression Line: ŷ = b 0 + b 1 x n b 1 = SS xy SS xx = b 0 = ȳ b 1 x i=1 x i y i n xȳ n x 2 i n x2 i=1 = r s y s x

Least Squares Regression Example 1 (cont d): b. What is the equation of the least squares regression line? Obs. x (Rainfall) y (Corn Yield) x 2 y 2 xy 1 3.03 80 9.1809 6400 242.4 2 3.47 84 12.0409 7056 291.48 3 4.21 90 17.7241 8100 378.9 4 4.44 95 19.7136 9025 421.8 5 4.95 97 24.5025 9409 480.15 6 5.11 102 26.1121 10404 521.22 7 5.63 105 31.6969 11025 591.15 8 6.34 112 40.1956 12544 710.08 9 6.56 115 43.0336 13225 754.4 10 6.82 115 46.5124 13225 784.3 Sum 50.56 995 270.7126 100413 5175.88

Least Squares Regression b 1 = n x iy i n xȳ i=1 n = x 2 i n x2 i=1 b 0 = ȳ b 1 x = The regression equation is:

Prediction and Residual Prediction: We can use a regression line to predict the value of the response variable y for a specific value of the explanatory variable x. This value is called a predicted value or fitted value. Be careful about extrapolations. While our data may provide evidence of a linear relationship between y and x, this relationship may not hold outside of the range of x values actually observed. Therefore predictions of y for values of x that are far away from the range you actually have are often not accurate.

Prediction and Residual Example 1 (cont d): c. Predict the corn yield for i. 5 inches of rain ii. 0 inches of rain iii. 100 inches of rain iv. For which amounts of rainfall above do you think the line does a good job of predicting actual corn yield? Why?

Prediction and Residual A residual is the difference between an observed value of the response variable and the value predicted by the regression line. residual = e i = y i ŷ i. y i y^i x i

Prediction and Residual Example 1 (cont d): d. Find the predicted value and residual for every observation. Obs. x (Rainfall) y (Corn Yield) ŷ (Predicted) e i (Residual) 1 3.03 80 79.999 0.001 2 3.47 84 84.234-0.234 3 4.21 90 91.357-1.357 4 4.44 95 93.571 1.429 5 4.95 97 98.480-1.480 6 5.11 102 100.020 1.980 7 5.63 105 105.025-0.025 8 6.34 112 111.859 0.141 9 6.56 115 113.976 1.024 10 6.82 115 116.479-1.479 Sum 50.56 995 995 0.000

Assessing Model Fit Regression Sum of Squares (SSR): measure of the variation in y that is explained by the linear regression of y on x. SSR = n (ŷ i ȳ) 2 = b 2 1SS xx i=1 Residual/Error Sum of Squares (SSE): measure of the variation in y that is not explained by the linear regression of y on x. n n SSE = (y i ŷ i ) 2 = i=1 i=1 e 2 i

Assessing Model Fit Total Sum of Squares (SST): measure of the total variation in y. n SST = (y i ȳ) 2 = SS yy i=1 SST = SSR + SSE (Note: (y i ȳ) = (y i ŷ i ) + (ŷ i ȳ)). y i y^i y x i

Assessing Model Fit Coefficient of Determination r 2 : r 2 = SSR SST = 1 SSE SST r 2 is the fraction of the variation in y that can be explained by the linear regression of y on x. r 2 measures how successful the linear regression explains the response. r 2 is the square of the Pearson correlation r.

Assessing Model Fit R 2 = 0.25 R 2 = 0.7 R 2 = 0.95

proc print data=yield1; run; Assessing Model Fit Regression in SAS data yield; input rainfall yield @@; datalines; 3.03 80 3.47 84 4.21 90 4.44 95 4.95 97 5.11 102 5.63 105 6.34 112 6.56 115 6.82 115 ; run; proc reg data=yield; model yield = rainfall; plot yield * rainfall; output out=yield1 p=pred r=resid; run;quit;

Assessing Model Fit The REG Procedure Model: MODEL1 Dependent Variable: yield Number of Observations Read 10 Number of Observations Used 10 Analysis of Variance Source DF Squares Square F Value Pr > F Model 1 1397.19450 1397.19450 840.07 <.0001 Error 8 13.30550 1.66319 Corrected Total 9 1410.50000 Root MSE 1.28965 R-Square 0.9906 Dependent Mean 99.50000 Adj R-Sq 0.9894 Coeff Var 1.29613

Assessing Model Fit Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > t Intercept 1 50.83497 1.72785 29.42 <.0001 rainfall 1 9.62520 0.33209 28.98 <.0001

Assessing Model Fit

Assessing Model Fit Obs rainfall yield pred resid 1 3.03 80 79.999 0.00066 2 3.47 84 84.234-0.23443 3 4.21 90 91.357-1.35708 4 4.44 95 93.571 1.42913 5 4.95 97 98.480-1.47973 6 5.11 102 100.020 1.98024 7 5.63 105 105.025-0.02487 8 6.34 112 111.859 0.14124 9 6.56 115 113.976 1.02369 10 6.82 115 116.479-1.47886

Statistical Model and Assumption The model: y = β 0 + β 1 x + ɛ For any fixed value of x, the error term ɛ is assumed to follow a normal distribution with mean 0 and standard deviation σ (Normality). The standard deviation σ does not vary for different values of x (Constant Variability). The random errors ɛ 1, ɛ 2,..., ɛ n associated with different observations are independent of each other (Independence).

Statistical Model and Assumption How do we check the regression assumptions? Normality: Normal quantile plot of the residuals. Constant variability: Residual plot. Independence: Examine the way in which subjects/units were selected in the study. Linearity: Scatter plot or a residual plot. Note: It is always important to check that the assumptions of the regression model have been met to determine whether your results are valid. This is also important to do before you proceed with inference.

Residual Analysis A residual plot is a scatter plot of the regression residuals against the explanatory variable x. The mean of the least-squares residuals is always zero. ē = 0. Good plot: total randomness, no pattern, approximately the same number of points above and below the e = 0 line Bad plot: obvious pattern, funnel shape, parabola, more points above 0 than below (or vice versa)

Residual Analysis Example 1 (cont d):

Residual Analysis

Residual Analysis SAS Code for Residual Analysis proc reg data=yield; model yield = rainfall; output out=yield1 p=pred r=resid; run;quit; proc gplot data=yield1; plot resid * rainfall /vref=0 cvref=red lvref=2; run;quit; proc univariate data=yield1; qqplot resid / normal(l=1 mu=est sigma=est); run;

Residual Analysis Nonlinear relationship Scatter Plot Residual Plot

Residual Analysis Nonconstant variance Scatter Plot Residual Plot

Residual Analysis Non-normal error Scatter Plot Residual Plot 2 1 0 1 2 1 0 1 2 3 Normal Quantile Plot Theoretical Quantiles Sample Quantiles

Population parameters in linear regression: ρ: population correlation - estimated by Pearson Correlation r. β 0 : population intercept - estimated by b 0. β 1 : population slope - estimated by b 1. σ: population standard deviation of the random errors - estimated by SSE s = n 2.

Inference about β 1 Sampling distribution of b 1 : b 1 is normally distributed with mean µ b1 = β 1 standard deviation σ σ b1 = SSxx = estimated by s b1 = s SSxx The standardized variable t = b 1 β 1 s b1 degrees of freedom df = n 1. has a t distribution with

Inference about β 1 Confidence interval for β 1 : Hypothesis test about β 1 : Hypotheses: H 0 : β 1 = 0 or H a : β 1 > 0 The t test statistic is: b 1 ± t α/2,n 2 s b1 H 0 : β 1 = 0 H a : β 1 < 0 t = b 1 s b1 or H 0 : β 1 = 0 H a : β 1 0 The test statistic has a t distribution with n 2 degrees of freedom if H 0 is true. P-value and rejection region can be computed as with previous t-tests.

Inference about β 1 Example 1 (cont d): e. Construct the 95% confidence interval for β 1. (We have previously found that ŷ = 50.835 + 9.6252x, SS xx = 15.08124, and s = 1.28965).

Inference about β 1 f. Does amount of rainfall have a linear relationship with the corn yield? Perform a hypothesis test using α = 0.05.

Inference about β 1 g. Does amount of rainfall have a positive linear relationship with the corn yield? Perform a hypothesis test using α = 0.05.

Inference about ρ Hypothesis test about ρ: Hypotheses: H 0 : ρ = 0 or H a : ρ > 0 The t test statistic is: H 0 : ρ = 0 H a : ρ < 0 or H 0 : ρ = 0 H a : ρ 0 t = r n 2 1 r 2 The test statistic has a t distribution with n 2 degrees of freedom if H 0 is true. Note: The test statistic for correlation is numerically identical to the test statistic used to test slope. P-value and rejection region can be computed as with previous t-tests.

Inference about ρ Example 1 (cont d): h. Do amount of rainfall and corn yield have a positive correlation? Perform a hypothesis test using α = 0.05. (previously we have found that r = 0.99527).

Example 2: Twenty plots, each 10 4 meters, were randomly chosen in a large field of corn. For each plot, the plant density (number of plants in the plot) and the mean cob weight (gm of grain per cob) were observed. The results are given in the table. Plant Density Cob Weight Plant Density Cob Weight 137 212 173 235 107 241 124 241 132 215 157 196 135 225 184 193 115 250 112 224 103 241 80 257 102 237 165 200 65 282 160 190 149 206 157 208 85 246 119 224 Preliminary calculations yield the following results: x = 128.05, ȳ = 224.1, SS xx = 20208.95, SS yy = 11831.8, SS xy = 14563.1, SSE = 1337.3

a. Calculate the linear regression line of cob weight on plant density.

b. Plot the data and draw the regression line on the graph.

c. What percent of variation in y can be explained by the linear regression line?

d. What is the correlation between plant density and cob weight?

e. If there is an additional plot with plant density 125, how much do you expect the cobs weigh?

f. Construct a 99% confidence interval for the population regression slope β 1.

g. Is there a linear association between plant density and cob weight? Test this hypothesis using α = 0.01.