Simple Liear Regressio 1. Model ad Parameter Estimatio (a) Suppose our data cosist of a collectio of pairs (x i, y i ), where x i is a observed value of variable X ad y i is the correspodig observatio of radom variable Y. The simple liear regressio model y i = β 0 + β 1 x i + ɛ i expresses the relatioship betwee variables X ad Y. Here β 0 deotes the itercept ad β 1 the slope of the regressio lie. (b) Values for β 0 ad β 1 are estimated from the data by the method of least squares. (c) From the may straight lies that could be draw through our data, we fid the lie that miimizes the sum of squared residuals, where a residual is the vertical distace betwee a poit (x i, y i ) ad the regressio lie. (d) Values ˆβ 0 ad ˆβ 1 deote the estimates for β 0 ad β 1 that miimize the sum of squared residuals, or error sum of squares(sse). The estimates are called least squares estimates. SSE = ɛi 2 = i=1 i=1 (y i β 0 β 1 x i ) 2 (e) SSE is miimized whe the partial derivatives of the SSE with respect to the ukows (β 0 ad SSE β 1 ) are set to zero: β 0 = 0 ad SSE β 1 = 0. (You eed multivariable calculus [eg Math 2001] to uderstad the theoretical details, so we will just take this as a give.) These two coditios result i the two so-called ormal equatios. β 0 + β 1 i=1 x i = β 0 x i + β 1 xi 2 = i=1 i=1 y i i=1 x i y i i=1 (f) The two ormal equatios are solved simultaeously to obtai estimates of β 0 ad β 1. These estimates are: ˆβ 1 = i=1 (y i ȳ)(x i x) i=1 (x i x) 2 = i=1 x iy i ( i=1 x i) ( i=1 y i) i=1 x2 i ( i=1 x i) 2 ˆβ 0 = ȳ ˆβ 1 x Lookig at the formula for ˆβ 1, ad recallig the formula for the correlatio coefficiet r, it is easy to see that ˆβ 1 = rs y /s x. (g) The error variace, σ 2, is estimated as ˆσ 2 = SSE 2 = (y i ŷ i ) 2 2 1
The followig example shows the calculatios as they would be carried out by had, i gruesome detail. eg: To study the effect of ozoe pollutio o soybea yield, data were collected at four ozoe dose levels ad the resultig soybea seed yield moitored. Ozoe dose levels (i ppm)were reported as the average ozoe cocetratio durig the growig seaso. Soybea yield was reported i grams per plat. X Y Ozoe(ppm) Yield (gm/plat).02 242.07 237.11 231.15 201 Estimated values for β 0 ad β 1 are ow computed from the data X Y X 2 Y 2 XY.02 242.0004 58564 4.84.07 237.0049 56169 16.59.11 231.0121 53361 25.41.15 201.0225 40401 30.15 Colum sums: x i =.35, y i = 911, x 2 i =.0399, y 2 i = 208, 495, ad x i y i = 76.99 Meas: x =.0875 ad ȳ = 227.95 Itermediate terms: = i (x i x) 2 = i x 2 i ( x i) 2 =.0399 (.35)2 4 =.009275 SS xy = i (x i x)(y i ȳ) = i x i y i ( x i)( y i ) = 76.99.35(911) 4 = 2.7225 ˆβ 1 = SS xy = 293.531, ˆβ 0 = ȳ ˆβ 1 x = 227.95 ( 293.531)(.0875) = 253.434 (h) the least squares regressio equatio which characterizes the liear relatioship betwee soybea yield ad ozoe dose is ŷ i = 253.434 293.531x i (i) The error variace, σ 2, is estimated as MSE. (j) Residuals: ˆɛ i = y i ŷ i = y i ( ˆβ 0 + ˆβ 1 x i ) x i y i ŷ i ˆɛ i = y i ŷ i.02 242 247.563-5.563.07 237 232.887 4.113.11 231 221.146 9.854.15 201 209.404-8.404 2
(k) Residual Sum of Squares (I regressio problems, the error sum of squares is also kow as the residual sum of squares). (l) Mea Squared Error: MSE = SSE = ˆɛ 2 i = ( 5.563) 2 + (4.113) 2 + (9.854) 2 + ( 8.404) 2 = 215.59 SSE ( 2) = 107.80 3
x=c(.02,.07,.11,.15) y=c(242,237,231,201) SXX=sum((x-mea(x))^2) SXY=sum((x-mea(x))*(y-mea(y))) SYY=sum((y-mea(y))^2) b1=sxy/sxx b0=mea(y)-b1*mea(x) yp=b0+b1*x resids=y-yp SSE=sum(resids^2) SST=SYY SSR=SST-SSE SS=c(SSR,SSE,SST) =legth(y) df=c(1,-2,-1) MS=SS/df cbid(ss,df,ms) Calculatios by had i R SS df MS [1,] 799.1381 1 799.1381 [2,] 215.6119 2 107.8059 [3,] 1014.7500 3 338.2500 4
Check calculatios usig builti lm, summary ad ANOVA commads i R Call: lm(formula = y ~ x) Coefficiets: (Itercept) x 253.4-293.5 Call: lm(formula = y ~ x) Residuals: 1 2 3 4-5.563 4.113 9.854-8.404 Coefficiets: Estimate Std. Error t value Pr(> t ) (Itercept) 253.43 10.77 23.537 0.0018 ** x -293.53 107.81-2.723 0.1126 --- Sigif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual stadard error: 10.38 o 2 degrees of freedom Multiple R-squared: 0.7875,Adjusted R-squared: 0.6813 F-statistic: 7.413 o 1 ad 2 DF, p-value: 0.1126 Aalysis of Variace Table Respose: y Df Sum Sq Mea Sq F value Pr(>F) x 1 799.14 799.14 7.4127 0.1126 Residuals 2 215.61 107.81 1 2 3 4 247.5633 232.8868 221.1456 209.4043 1 2 3 4-5.563342 4.113208 9.854447-8.404313 [1] 215.6119 [1] 799.1381 215.6119 1014.7500 5
Statistical ifereces - CI s ad tests for the β s 2. Stadard Errors for Regressio Coefficiets (a) Regressio coefficiet values, ˆβ 0 ad ˆβ 1, are poit estimates of the true itercept ad slope, β 0 ad β 1 respectively. (b) To develop iterval estimates (cofidece itervals) for β 0 ad β 1, we eed to make assumptios about the errors i the regressio model. I partiular, we assume ɛ 1, ɛ 2,..., ɛ i.i.d N(0, σ 2 ), i which case: (c) The stadard deviatio of ˆβ 1 is σ 2 ˆβ 1 N(β 1, σ 2 ) (d) The value of σ 2 is ukow, so the estimator MSE is used i its place to produce the stadard error of the estimate ˆβ 1, as SE ˆβ 1 = MSE/ (e) The stadard error for estimate ˆβ 0 is give as: SE ˆβ 0 = MSE( 1 + x2 ) (f) Stadard Errors for regressio coefficiets i the above example are estimated below. =.009275 ad MSE = 107.80 SE ˆβ 1 = MSE/ = 107.80/.009275 = 107.81 SE ˆβ 0 = MSE( 1 + SS x2 xx ) = 107.80((1/4) + (.0399/.009275)) = 10.77 6
3. Cofidece Itervals for Regressio Coefficiets (a) Cofidece itervals are costructed usig the stadard errors as follows: ˆβ i ± t α/2, 2 SE ˆβ i (b) I the example, 95% cofidece itervals for β 1 ad β 0 are computed as follows. t α/2, 2 = t.025,2 = 4.303 For the slope, β 1 : 293.531 ± 4.303(107.81) ( 757.4, 170.3) For the itercept, β 0 : 253.434 ± 4.303(10.77) (207.1, 299.8) 95% Cofidece itervals i R upper 2.5th percetile of t-dist with -2 d.f. MSE=SSE/(-2) t=qt(.975,-2) t #upper.025'th percetile of t with -2 df. [1] 4.302653 95%cofidece iterval for β 1 SEb1=sqrt(MSE/SXX) #stadard error of beta_1 c(b1-t*seb1,b1+t*seb1) [1] -757.4057 170.3437 7
Why does the cofidece iterval have the correct coverage probability? Cosider the example of the iterval for ˆβ 1. We eed the followig facts: (a) β 1 has a ormal distributio with mea β 1 ad ukow variace σ 2 /SXX. A cosequece is that Z = β 1 β 1 σ/ SXX (b) W = ( 2)MSE σ to prove.) 2 N(0, 1) (Easy results to prove.) χ 2 2, a chi-squared distributio with 2 degrees of freedom. (A bit harder (c) β 1 ad SSE are idepedet, implies Z = β 1 β 1 σ/ ( 2)MSE ad are idepedet. (Hard to SXX σ 2 prove. Details ivolve cosiderable matrix algebra, ad are cotaied i appedix C3 of Motgomery et al) (d) Defiitio: If Z is stadard ormal, idepedet of W which is χ 2 ν, the t = have a t distributio with ν degrees of freedom. (e) The see geeral otes o costructig cofidece itervals. Z W/ν is defied to 8
4. The correlatio betwee X ad Y is estimated by: r = A alterative expressio is give by or i=1 (y i ȳ)(x i x) i=1 (x i x) 2 i=1 (y i ȳ) 2 r = ˆβ 1 i=1 (x i x) 2 i=1 (y i ȳ) 2 r = ˆβ 1 SSxx SSyy where = i=1 (x i x) 2 ad SS yy = i=1 (y i ȳ) 2 are the sums of squares of the X s ad Y s, respectively. Note that SS yy = SST, the total sum of squares. Note that stadard deviatios of the X s ad the Y s. The correlatio coefficiet lies i the iterval [-1,+1]. SSxx SSyy = s x s y, the ratio of the If the relatioship bewee Y ad X is perfectly liear ad icreasig, the correlatio will be +1. If the relatioship is perfectly liear ad decreasig, the correlatio will be +1. If there is o liear relatioship betwee X ad Y, the correlatio is 0. I the example, r = ˆβ SSxx 1 = 293.531.009275 =.887 SSyy 1016.49 9
5. Goodess of fit of the regressio lie is measured by the coefficiet of determiatio, R 2. For simple liear regressio R 2 = r 2. R 2 = SSR SST The Regressio Sum of Squares (SSR) is similar to the Treatmet Sum of Squares i a ANOVA problem. It is give by SSR = SS2 xy. Alterative ways of calculatig the residual sum of squares are to use the additivity relatioship (SSR + SSE = SST), or to use oe of the followig formulas. R 2 = SSR/SST 1 R 2 = (SST SSR)/SST = SSE/SST SSE = (1 R 2 )SST R 2 is the fractio of the total variability i y accouted for by the liear regressio lie, ad rages betwee 0 ad 1. R 2 = 1.00 idicates a perfect liear fit, while R 2 = 0.00 is a complete liear o-fit. I the example: SSR = SS2 xy = ( 2.7255) 2 /.009275 = 800.90 SST = SSR + SSE = 800.90 + 215.59 = 1016.49 R 2 = SSR/SST = 0.786 Note that R 2 = r 2, the square of the correlatio coefficiet. 78.8% of the variability i Y is accouted for by the regressio model. [1] 799.1381 [1] -0.8874245 [1] 0.7875222 10
6. Estimatig the mea of Y (a) The estimated mea of Y whe x = x is ˆµ x = ˆβ 0 + ˆβ 1 x. (b) (c) The stadard error of ˆµ x is ( ˆµ x = ˆβ 0 + ˆβ 1 x N (β 0 + β 1 x 1, σ 2 + (x x) 2 )) SE ˆµx = ( 1 MSE + (x x) 2 ) (d) A cofidece iterval for the mea µ x = β 0 + β 1 x whe x = x is give by ˆµ x ± t α/2, 2 SE ˆµx (e) eg. A 95% cofidece iterval for the mea at x = 0.10 is: Whe x = 0.10, the estimated mea is ˆµ.1 = 253.434 293.531(0.1) = 224.08 ( ) SE ˆµ.1 = 107.8 14 + (0.1.0875)2.009275 = 5.36 t α/2, 2 = t.025,2 = 4.303 margi of error = 4.303(5.36) = 23.08 224.08 ± 23.08 (201, 247.16) 95% cofidece iterval for mu at x0=.10 x0=.10 muhat=b0+b1*x0 # estimate of mea at x=x0 muhat SEmu=sqrt(MSE)*sqrt(1/+(x0-mea(x))^2/SXX) #SE of muhat SEmu c(muhat-t*semu, muhat+t*semu) [1] 224.0809 [1] 5.363545 [1] 201.0034 247.1583 11
7. Predictig a New Respose Value We are ow iterestig i predictig the value of y at a future value x = x. I makig a predictio iterval for a future observatio o y whe x = x, we eed to icorporate two sources of variatio which accout for the fact that we are replacig the ukow mea by the estimate ˆβ 0 + ˆβ 1 x, ad we are replacig the ukow stadard deviatio σ by the estimate MSE. y ( ˆβ 0 + ˆβ 1 x ) = (y (β 0 + β 1 x )) ( ˆβ 0 + ˆβ 1 x (β 0 + β 1 x )) The first term i brackets o the right had side of this expressio has a N(0, σ 2 ) distributio. From (b) above, the distributio of the secod term is ( 1 N (0, σ 2 + (x x) 2 )) As y represets a future observatio, the distributios of the two terms are idepedet, ad it follows that the distributio of y ( ˆβ 0 + ˆβ 1 x ) is N (0, σ (1 2 + 1 + (x x) 2 )) (a) The predicted value of y is give by ŷ = ˆβ 0 + ˆβ 1 x (b) The variace of the above distributio is estimated by: ( MSE 1 + 1 + (x x) 2 ) (c) ad the predictio iterval for y is give by ( ˆβ 0 + ˆβ 1 x ± t α/2, 2 MSE 1 + 1 + (x x) 2 ) (d) eg. A 95% predictio iterval for y whe x = 0.10 is: For x = 0.10, y = 253.434 293.531(0.1) = 224.08 ) SE y = 107.8 (1 + 1 4 + (0.1.0875)2.009275 = 11.69 t α/2, 2 = t.025,2 = 4.303 margi of error = 4.303(11.69) = 50.29 224.08 ± 50.29 (173.79, 274.37) SEmu=sqrt(MSE)*sqrt(1+1/+(x0-mea(x))^2/SXX) c(muhat-t*semu, muhat+t*semu) 95% predictio iterval for a ew observatio at x0=.10 12
[1] 173.7980 274.3637 13