MATH 2560 C F03 Elementary Statistics I LECTURE 9: Least-Squares Regression Line and Equation

MATH 2560 C F03 Elementary Statistics I LECTURE 9: Least-Squares Regression Line and Equation 1 Outline least-squares regresion line (LSRL); equation of the LSRL; interpreting the LSRL; correlation and regression;

2 Least-Squares Regression Line = Our first aim is: we need a way to draw a regression line that doesn t depend on our guess as to where the line should go. We want one line that is as close as possible. = Our second aim is: we want a regression line that makes the prediction errors as small as possible: errors=observed variables-predicted variables minimize! Figure 2.13 illustrate the idea.

= The most common idea how to make these errors as small as possible precisely is the LEAST-SQUARES idea. Leat-Squares Regression Line A least-squares regression line of y on x is the line that makes the sum of the squares of the vertical distances of the data points from the line as small as possible. Below we have the least-squares idea expressed as a mathematical problem. Least-Squares Idea as a Mathematical Problem 1. There are n observations on two variables x and y : (x 1, y 1 ), (x 2, y 2 ),..., (x n, y n ); 2. The line y = a + bx through scatterplot of these observations predicts the value of y corresponding to x i as ŷ i = a + bx i ; 3. The predicted response ŷ i will not be exactly the same as the actually observed response y i ; 4.The prediction error for the point x i is: error=observed y i -predicted ŷ i ; 5. The method of Least-Squares chooses the line that makes the sum of the squares of these errors as small as possible; 6. Mathematical problem : find the values of the intercept a and the slope b that minimize the following expression; (error) 2 = (y i ŷ i ) 2 = (y i a bx i ) 2 minimize.

Equation for the LSRL Equation of the Leat-Squares Regression Line 1. Let we have data on explanatory variable x and a response variable y for n individuals; 2. The mean and standard deviations of the sample data are x and s x for x and ȳ and s y for y, and the correlation between x and y is r; 3. The equation of the least-squares regression line of y on x is: with slope and intercept ŷ = a + bx b = r s y s x a = ȳ b x. Example 2.13. Mean height of Kalama children (Table 2.7). We calculate means, standard deviations for x and y, correlation r, slope b, intercept a and give the equation of the least-squares line in this case: 1. Mean and Standard Deviation for x: x = 23.5m, s x = 3.606m; 2. Mean and Standard Deviation for y: 3. Correlation: 4. Slope: 5. Intercept: ȳ = 79.85, s y = 2.302; r = 0.9944; b = r s y s x = 0.9944 2.302 3.606 = 0.6348cm/m; a = ȳ b x = 79.85 (0.6348)(23.5) = 64.932cm; 6. The equation of the least-squares line is: ŷ = 64.932 + 0.6348x.

3 Interpreting the regression line Interpreting the Leat-Squares Regression Line 1. Slope b = r sy s x : says that along the regression line, a change of one standard deviation in x corresponds to a change of r standard deviations in y; (The change in the predicted response ŷ is the same as the change in x when r = 1 or r = 1. Otherwise, 1 < r < 1, the change in ŷ is less than the change in x.) 2. The least-squares regression line always passes through the point ( x, ȳ); Figure 2.14 displays the basic regression output for the Kalama data from a graphing calculator and two statistical software packages.

4 Correlation and Regression Least-squares regression looks at the distances of the data points from the line only in the y direction. Example 2.14. Expanding the Universe (Figure 2.15). Figure 2.15 is a scatterplot of data that played a central role in the discovery that the universe is expanding. Here r = 0.7842, hence, relationship between the distances from Earth of 24 spiral galaxies and the speeds at which they are moving away from us is a positive and linear. Important Remark: Although there is only one correlation between velocity and distance, regression of velocity on distance and regression of distance on velocity give different lines.

= There is a close connection between correlation and regression: Connection between Correlation and Regression: the slope of the least-squares line involves r; the square of the correlation, r 2, is the fraction of the variation in the values of y that is explained by the least-squares regression of y on x. Relationship between r and r 2 When you report a regression, give r 2 as a measure of how successfully the regression explains the response. All the software outputs in Figure 2.14 include r 2. The use of r 2 to describe the success of regression in explaining the response y is very common: it rests on the fact that there are two sources of variation in the responses y in a regression setting. Example: Kalama children. One reason the Kalama heights vary is that height changes with age; Second reason is that heights do not lie exactly on the line, but are scattered above and below it. We use r 2 to measure variation along the line as a fraction of the total variation in the response variables. For a pictorial grasp of what r 2 tells us, look at Figure 2.16. Both scatterplots resemble the Kalama data, but with many more observations. The least-squares regression line is the same as we computed from the Kalama data. In Figure 2.16(a), r = 0.994 and r 2 = 0.989. In Figure 2.16(b), r = 0.921 and r 2 = 0.849. There is more scatter about the fitted line and here r 2 is less than in Figure 2.16(a).

5 More Specific Interpretation of r 2 The squared correlation gives us the variance of the predicted responses as a fraction of the variance of the actual responses: r 2 = varianceofpredictedvalues ŷ varianceofobservedvalues y. This fact is always true. Final Important Remark: The connections with correlation are special properties of least-squares regression. They are not true for other methods of fitting a line to data.

6 Summary 1. A regression line is stright line that describes how a response variable y changes as an explanatory variable x changes. 2. The most common method of fitting a line to a scatterplot is least squares. The least-squares regression line is the stright line ŷ = a + bx that minimizes the sum of the squares of the vertical distances of the observed y-values from the line. 3. A regression line is used to predict the value of y for any value of x by substituting this x into the eqution of the line. Exptrapolation beyond the range of x values spanned by the data is risky. 4. The slope b of a regression line ŷ = a + bx is the rate at which the predicted response ŷ changes along the line as the explanatory variable x changes. Specifically, b is the change in ŷ when x increases by 1. 5. The intercept a of a regression line ŷ = a + bx is the predicted response ŷ when the explanatory variable x = 0. This prediction is of no statistical use unless x can actually take values near 0. The least-squares regression line of y on x is the line with slope r sy s x and intercept a = ȳ b x. This line always passes through the point ( x, ȳ). 6. Remarks. Correlation and regression are closely connected. The correlation r is the slope of the least-squares regression line when we measure both x and y in standardized units. The square of the correaltion r 2 is the fraction of the variance of one variable that is explaned by least-squares regression on the other variable.