Intro to Linear Regression
Introduction to Regression Regression is a statistical procedure for modeling the relationship among variables to predict the value of a dependent variable from one or more predictor variables. Imagine that I ask you to guess the weight of a college-aged male who is hidden from view What would your best guess be?
Introduction to Regression weight 158.26 w eight 18.64
Introduction to Regression Regression is a statistical procedure for modeling the relationship among variables to predict the value of a dependent variable from one or more predictor variables. Imagine that I ask you to guess the weight of a college-aged male who is hidden from view What would your best guess be? What if I also gave you his height? Intuitively, it should be clear that you can do better
Introduction to Regression
Introduction to Regression The Pearson correlation, which we covered in the last lecture measures the degree to which a set of data points form a linear (straight line) relationship. Simple regression describes the linear relationship between a dependent variable () and one predictor variable () The resulting line is called the regression line.
Regression and Linear Equations ou should remember the following from your high school algebra course: Any straight line can be represented by an equation of the form = b + a, where b and a are constants. The value of b is called the slope and determines the direction and degree to which the line is tilted. The value of a is called the -intercept and determines the point where the line crosses the -axis. In the context of linear regression, a and b are called regression coefficients
Regression and Linear Equations b 0.5 a 1.0 ˆ b a 0.5 1
Residuals: Errors of Prediction How well a regression line fits a set of data points can be measured by calculating the distance between the data points and the line. Using the formula Ŷ = b + a, it is possible to find the predicted value of Ŷ for any. The residual, or error of prediction, between the predicted value and the actual value can be found by computing the difference -Ŷ The regression line is selected to be the best fit in the leastsquares sense. This means that we want to compute the line that minimizes the sum of squared residuals: SS 2 ˆ residual
Residuals: Errors of Prediction ˆ b a, ˆ
The Standard Error of Estimate The measure of unpredicted variability or error for the regression line is called the standard error of estimate (s e or s -Ŷ ) ou can think of it as analogous to the standard deviation if we were to use the mean M as our estimate of the variable s SS df M 2 n 1 s ˆ SS df residual residual ˆ 2 n 2
Computing Regression Coefficients b change in (as a function of ) change in SP SS or rs s a M bm
Example 70 150 67 140 72 180 75 190 68 145 69 150 71.5 164 71 140 72 142 69 136 67 123 68 155 66 140 72 145 73.5 160 73 190 69 155 73 165 72 150 74 190 M M s s 70.6 155. 5 2.6 19.2 cov 36. 8
Example: Computing Regression Coefficients M M s s 70.6 155. 5 2.6 19.2 cov 36. 8 Compute r : r cov 36.8 s s 2. 6 19.2 Compute b : 0.737 Compute a : a M bm 155.5 5.44 70.6 228.56 b rs 0.737 19.2 s 2.6 5.44 So, ˆ b a 5.44 228.56
Example: Predicting from ou are told that a college-aged male is 74 inches tall. Given the computed regression coefficients, what is your best estimate of his weight? : height : weight ˆ b a 5.44 228.56 ˆ 5.44 74 228.56 174 402.56 228.56
Example: Computing Accuracy of Prediction Regression M M s s cov 70.6 155.5 2.6 19.2 r 0. 737 36.8 Two measures for accuracy of prediction: standard error of estimate (s e or s -Ŷ ) interpreted as standard deviation of the error around the regression line r 2 interpreted as % of variance accounted for by regression model r 2 2 cov s s 2 2 variation explained by total variation r r 2 0.737 0. 54 2 ˆ 2 n 1 s ˆ s 1 r n 2 n 2 19 19.2 1 0.54 13. 38 18
Example: Computing Accuracy of Prediction Regression Just as σ or s can be used to compute confidence intervals for population means, s -Ŷ can be used to compute predictive intervals for t df crit s ˆ
Example: Computing Accuracy of Prediction Regression Just as σ or s can be used to compute confidence intervals for population means, s -Ŷ can be used to compute predictive intervals for t crit df s ˆ 1 x M 2 SS nss Note that the actual formula for the predictive interval is slightly more complicated and depends on x.
Standardized Regression The standardized regression coefficient (β) is computed by first standardizing both the predictor and dependent variables (i.e., by converting both the values and the values to z-scores) and then computing the regression coefficient (b) on the transformed scores For standardized regression, the y-offset is always zero For standardized regression with a single predictor variable, β is always equal to r. Standardized regression coefficients are only really useful in multiple regression, where there are multiple predictor variables). In these cases, standardizing can make it easier to determine the relative contribution of the different predictor variables to the regression model
Multiple Regression Often, researchers measure several variables that are hypothesized to predict a particular dependent variable For example, we might be interested in how well both SAT scores and high school GPAs predict college GPA s Multiple regression is an appropriate tool for such situations. Multiple regression describes the linear relationship between multiple predictor variables ( 1,, n ) and one criterion variable () The resulting surface is called the regression surface.
Multiple Regression with Two Predictor Variables In the same way that linear regression produces an equation that uses values of to predict values of, multiple regression produces an equation that uses two different variables ( 1 and 2 ) to predict values of. The equation is determined by a least squared error solution that minimizes the squared distances between the actual values and the predicted values. For two predictor variables, the general form of the multiple regression equation is: Ŷ= b 1 1 + b 2 2 + a The resulting plane is called the regression plane.
Multiple Regression
Multiple Regression
Multiple Regression