Simple Linear Regression Analysis

LINEAR REGRESSION ANALYSIS MODULE II Lecture - 6 Simple Linear Regression Analysis Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur

Prediction of values of study variable An important use of linear regression modeling is to predict the average and actual values of study variable. The term prediction of value of study variable corresponds to knowing the value of E(y (in case of average value and value of y (in case of actual value for a given value of explanatory variable. We consider both the cases. Case 1: Prediction of average value Under the linear regression model y = β + β1 x+ ε, the fitted model is y= b + bx 1 where b and b1 are the OLS estimators of β andβ 1 respectively. Suppose we want to predict the value of E(y for a given value of E ( y x = µ = b + bx. 1 x= x.. Then the predictor is given by Predictive bias The prediction error is given as Then Thus the predictor µ E( y = b+ bx 1 E( β+ β1x+ ε = b + bx ( β + β x 1 1 = ( b β + ( b β x. 1 1 E µ E( y Eb ( β Eb ( β x = + = + =. 1 1 µ is an unbiased predictor of E(y.

3 Predictive variance The predictive variance of µ is PV ( µ = Var( b + b x 1 [ ( ] = Var y + b x x 1 = Var( y + ( x x Var( b + ( x x Cov( y, b σ σ ( x x = + + n s 1 ( x x = σ + n s 1 1. Estimate of predictive variance The predictive variance can be estimated by substituting 1 ( x x PV ( µ = σ + n s 1 ( x x = MSE + n s. σ σ = by MSE as

4 Prediction interval estimation The 1(1 α% prediction interval for E( y x is obtained as follows: The predictor µ is a linear combination of normally distributed random variables, so it is also normally distributed as ( + ( yx µ ~ N β β x, PV µ. 1 So if σ is known, then the distribution of µ E( y x PV ( µ is N(,1, so the 1(1 α% prediction interval is obtained as µ E( y x α α = 1 α PV ( µ P z z which gives the prediction interval for E( y x as 1 ( x x 1 ( x x µ z, α σ + µ + z. α σ + n s n s

σ When is unknown, it is replaced by σ = MSE and in this case the sampling distribution of 5 µ E( y x 1 ( x x MSE + n s is t-distribution with (n - degrees of freedom, i.e., t n -. The 1(1- α % prediction interval in this case is µ E( y x α α = 1 α, n, n 1 ( x x MSE + n s P t t which gives the prediction interval as 1 ( x x 1 ( x x µ t, α MSE + µ + t MSE., n α +, n n s n s Note that the width of prediction interval E(y x is a function of x. The interval width is minimum for as x x and widens increases. This is expected also as the best estimates of y to be made at x-values lie near the center of the data and the precision of estimation to deteriorate as we move to the boundary of the x-space. x = x

6 Case : Prediction of actual value If x is the value of the explanatory variable, then the actual value predictor for y is. y = b + bx. 1 Note that the form of predictor is same as of average value predictor but its predictive error and other properties are different. This is the dual nature of predictor. Predictive bias Then the prediction error of ŷ 1 1 1 1 is given as y y = b + bx ( β + β x + ε = ( b β + ( b β x ε. Thus, we find that E( y y = Eb ( β + Eb ( β x E( ε 1 1 = + + = ŷ which implies that is an unbiased predictor of y.

7 Predictive variance ŷ Because the future observation y is independent of, the predictive variance of is PV ( y = E( y y = E[( b β + ( x x( b β + ( b β x ε ] 1 1 1 1 = Var( b + ( x x Var( b + x Var( b + Var( ε + ( x x Cov( b, b + xcov( b, b + ( x x Var( b 1 1 1 1 1 [rest of the terms are assuming the independence of ε with ε, ε,..., ε ] = Var( b + [( x x + x + ( x x] Var( b + Var( ε + [( x x + x] Cov( b, b 1 1 = Var( b + x Var( b + Var( ε + x Cov( b, b 1 1 1 x = σ + + σ + x σ x n s s s 1 n ( x x s = σ 1+ +. xσ ŷ 1 n Estimate of predictive variance The estimate of predictive variance can be obtained by replacing by its estimate as 1 ( x x PV ( y = σ 1+ + n s 1 ( x x = MSE + + n s 1. σ σ = MSE

8 Prediction interval If σ is known, then the distribution of y E( y PV ( y is N(,1. So the 1(1 α% prediction interval is obtained as y E( y P zα zα = 1 α PV ( y which gives the prediction interval for ŷ as 1 ( x x 1 ( x x y zα σ 1 + +, y + zα σ 1 + +. n s n s When σ is unknown, then y E( y PV ( y follows a t-distribution with (n - degrees of freedom.

9 The 1(1 α% prediction interval for in this case is obtained as y E( y P t t = 1, n, n PV ( y α α α which gives the prediction interval ŷ 1 ( x x 1 ( x x y tα MSE 1 + +, y + tα MSE 1 + +., n, n n s n s The prediction interval is of minimum width at x x and widens as x x increases. = ŷ µ ŷ The prediction interval for is wider than the prediction interval for because the prediction interval for depends on both the error from the fitted model as well as the error associated with the future observations. 9

1 Reverse regression method The reverse (or inverse regression approach minimizes the sum of squares of horizontal distances between the observed data points and the line in the following scatter diagram to obtain the estimates of regression parameters. The reverse regression has been advocated in the analysis of sex (or race discrimination in salaries. For example, if y denotes salary and x denotes qualifications and we are interested in determining if there is a sex discrimination in salaries, we can ask: Whether men and women with the same qualifications (value of x are getting the same salaries (value of y. This question is answered by the direct regression. Alternatively, we can ask: Whether men and women with the same salaries (value of y have the same qualifications (value of x. This question is answered by the reverse regression, i.e., regression of x on y.

11 The regression equation in case of reverse regression can be written as * * xi = β + β1 yi + δi ( i = 1,,..., n where δ i s are the associated random error components and satisfy the assumptions as in the case of usual simple linear regression model. The reverse regression estimates β of β and β of β the direct regression estimators of OR β and β 1 * * 1R 1 for the model are obtained by interchanging the x and y in. The estimates are obtained as β OR = x β y 1R and sxy β 1R = syy * * for β and β respectively. 1 The residual sum of squares in this case is Note that SS * res s = s s xy yy sxy β 1Rb1 = = r s s yy. xy where b 1 is the direct regression estimator of slope parameter and r xy is the correlation coefficient between x and y. Hence if r xy is close to 1, the two regression lines will be close to each other. An important application of reverse regression method is in solving the calibration problem.