Simple Linear Regression

Size: px

Start display at page:

Download "Simple Linear Regression"

Sharyl Owens
6 years ago
Views:

1 Simple Liear Regressio 1. Model ad Parameter Estimatio (a) Suppose our data cosist of a collectio of pairs (x i, y i ), where x i is a observed value of variable X ad y i is the correspodig observatio of radom variable Y. The simple liear regressio model y i = β 0 + β 1 x i + ɛ i expresses the relatioship betwee variables X ad Y. Here β 0 deotes the itercept ad β 1 the slope of the regressio lie. (b) Values for β 0 ad β 1 are estimated from the data by the method of least squares. (c) From the may straight lies that could be draw through our data, we fid the lie that miimizes the sum of squared residuals, where a residual is the vertical distace betwee a poit (x i, y i ) ad the regressio lie. (d) Values ˆβ 0 ad ˆβ 1 deote the estimates for β 0 ad β 1 that miimize the sum of squared residuals, or error sum of squares(sse). The estimates are called least squares estimates. SSE = ɛi 2 = i=1 i=1 (y i β 0 β 1 x i ) 2 (e) SSE is miimized whe the partial derivatives of the SSE with respect to the ukows (β 0 ad SSE β 1 ) are set to zero: β 0 = 0 ad SSE β 1 = 0. (You eed multivariable calculus [eg Math 2001] to uderstad the theoretical details, so we will just take this as a give.) These two coditios result i the two so-called ormal equatios. β 0 + β 1 i=1 x i = β 0 x i + β 1 xi 2 = i=1 i=1 y i i=1 x i y i i=1 (f) The two ormal equatios are solved simultaeously to obtai estimates of β 0 ad β 1. These estimates are: ˆβ 1 = i=1 (y i ȳ)(x i x) i=1 (x i x) 2 = i=1 x iy i ( i=1 x i) ( i=1 y i) i=1 x2 i ( i=1 x i) 2 ˆβ 0 = ȳ ˆβ 1 x Lookig at the formula for ˆβ 1, ad recallig the formula for the correlatio coefficiet r, it is easy to see that ˆβ 1 = rs y /s x. (g) The error variace, σ 2, is estimated as ˆσ 2 = SSE 2 = (y i ŷ i ) 2 2 1

2 The followig example shows the calculatios as they would be carried out by had, i gruesome detail. eg: To study the effect of ozoe pollutio o soybea yield, data were collected at four ozoe dose levels ad the resultig soybea seed yield moitored. Ozoe dose levels (i ppm)were reported as the average ozoe cocetratio durig the growig seaso. Soybea yield was reported i grams per plat. X Y Ozoe(ppm) Yield (gm/plat) Estimated values for β 0 ad β 1 are ow computed from the data X Y X 2 Y 2 XY Colum sums: x i =.35, y i = 911, x 2 i =.0399, y 2 i = 208, 495, ad x i y i = Meas: x =.0875 ad ȳ = Itermediate terms: = i (x i x) 2 = i x 2 i ( x i) 2 =.0399 (.35)2 4 = SS xy = i (x i x)(y i ȳ) = i x i y i ( x i)( y i ) = (911) 4 = ˆβ 1 = SS xy = , ˆβ 0 = ȳ ˆβ 1 x = ( )(.0875) = (h) the least squares regressio equatio which characterizes the liear relatioship betwee soybea yield ad ozoe dose is ŷ i = x i (i) The error variace, σ 2, is estimated as MSE. (j) Residuals: ˆɛ i = y i ŷ i = y i ( ˆβ 0 + ˆβ 1 x i ) x i y i ŷ i ˆɛ i = y i ŷ i

3 (k) Residual Sum of Squares (I regressio problems, the error sum of squares is also kow as the residual sum of squares). (l) Mea Squared Error: MSE = SSE = ˆɛ 2 i = ( 5.563) 2 + (4.113) 2 + (9.854) 2 + ( 8.404) 2 = SSE ( 2) =

4 x=c(.02,.07,.11,.15) y=c(242,237,231,201) SXX=sum((x-mea(x))^2) SXY=sum((x-mea(x))*(y-mea(y))) SYY=sum((y-mea(y))^2) b1=sxy/sxx b0=mea(y)-b1*mea(x) yp=b0+b1*x resids=y-yp SSE=sum(resids^2) SST=SYY SSR=SST-SSE SS=c(SSR,SSE,SST) =legth(y) df=c(1,-2,-1) MS=SS/df cbid(ss,df,ms) Calculatios by had i R SS df MS [1,] [2,] [3,]

5 Check calculatios usig builti lm, summary ad ANOVA commads i R Call: lm(formula = y ~ x) Coefficiets: (Itercept) x Call: lm(formula = y ~ x) Residuals: Coefficiets: Estimate Std. Error t value Pr(> t ) (Itercept) ** x Sigif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual stadard error: o 2 degrees of freedom Multiple R-squared: ,Adjusted R-squared: F-statistic: o 1 ad 2 DF, p-value: Aalysis of Variace Table Respose: y Df Sum Sq Mea Sq F value Pr(>F) x Residuals [1] [1]

6 Statistical ifereces - CI s ad tests for the β s 2. Stadard Errors for Regressio Coefficiets (a) Regressio coefficiet values, ˆβ 0 ad ˆβ 1, are poit estimates of the true itercept ad slope, β 0 ad β 1 respectively. (b) To develop iterval estimates (cofidece itervals) for β 0 ad β 1, we eed to make assumptios about the errors i the regressio model. I partiular, we assume ɛ 1, ɛ 2,..., ɛ i.i.d N(0, σ 2 ), i which case: (c) The stadard deviatio of ˆβ 1 is σ 2 ˆβ 1 N(β 1, σ 2 ) (d) The value of σ 2 is ukow, so the estimator MSE is used i its place to produce the stadard error of the estimate ˆβ 1, as SE ˆβ 1 = MSE/ (e) The stadard error for estimate ˆβ 0 is give as: SE ˆβ 0 = MSE( 1 + x2 ) (f) Stadard Errors for regressio coefficiets i the above example are estimated below. = ad MSE = SE ˆβ 1 = MSE/ = / = SE ˆβ 0 = MSE( 1 + SS x2 xx ) = ((1/4) + (.0399/ )) =

7 3. Cofidece Itervals for Regressio Coefficiets (a) Cofidece itervals are costructed usig the stadard errors as follows: ˆβ i ± t α/2, 2 SE ˆβ i (b) I the example, 95% cofidece itervals for β 1 ad β 0 are computed as follows. t α/2, 2 = t.025,2 = For the slope, β 1 : ± 4.303(107.81) ( 757.4, 170.3) For the itercept, β 0 : ± 4.303(10.77) (207.1, 299.8) 95% Cofidece itervals i R upper 2.5th percetile of t-dist with -2 d.f. MSE=SSE/(-2) t=qt(.975,-2) t #upper.025'th percetile of t with -2 df. [1] %cofidece iterval for β 1 SEb1=sqrt(MSE/SXX) #stadard error of beta_1 c(b1-t*seb1,b1+t*seb1) [1]

8 Why does the cofidece iterval have the correct coverage probability? Cosider the example of the iterval for ˆβ 1. We eed the followig facts: (a) β 1 has a ormal distributio with mea β 1 ad ukow variace σ 2 /SXX. A cosequece is that Z = β 1 β 1 σ/ SXX (b) W = ( 2)MSE σ to prove.) 2 N(0, 1) (Easy results to prove.) χ 2 2, a chi-squared distributio with 2 degrees of freedom. (A bit harder (c) β 1 ad SSE are idepedet, implies Z = β 1 β 1 σ/ ( 2)MSE ad are idepedet. (Hard to SXX σ 2 prove. Details ivolve cosiderable matrix algebra, ad are cotaied i appedix C3 of Motgomery et al) (d) Defiitio: If Z is stadard ormal, idepedet of W which is χ 2 ν, the t = have a t distributio with ν degrees of freedom. (e) The see geeral otes o costructig cofidece itervals. Z W/ν is defied to 8

9 4. The correlatio betwee X ad Y is estimated by: r = A alterative expressio is give by or i=1 (y i ȳ)(x i x) i=1 (x i x) 2 i=1 (y i ȳ) 2 r = ˆβ 1 i=1 (x i x) 2 i=1 (y i ȳ) 2 r = ˆβ 1 SSxx SSyy where = i=1 (x i x) 2 ad SS yy = i=1 (y i ȳ) 2 are the sums of squares of the X s ad Y s, respectively. Note that SS yy = SST, the total sum of squares. Note that stadard deviatios of the X s ad the Y s. The correlatio coefficiet lies i the iterval [-1,+1]. SSxx SSyy = s x s y, the ratio of the If the relatioship bewee Y ad X is perfectly liear ad icreasig, the correlatio will be +1. If the relatioship is perfectly liear ad decreasig, the correlatio will be +1. If there is o liear relatioship betwee X ad Y, the correlatio is 0. I the example, r = ˆβ SSxx 1 = =.887 SSyy

10 5. Goodess of fit of the regressio lie is measured by the coefficiet of determiatio, R 2. For simple liear regressio R 2 = r 2. R 2 = SSR SST The Regressio Sum of Squares (SSR) is similar to the Treatmet Sum of Squares i a ANOVA problem. It is give by SSR = SS2 xy. Alterative ways of calculatig the residual sum of squares are to use the additivity relatioship (SSR + SSE = SST), or to use oe of the followig formulas. R 2 = SSR/SST 1 R 2 = (SST SSR)/SST = SSE/SST SSE = (1 R 2 )SST R 2 is the fractio of the total variability i y accouted for by the liear regressio lie, ad rages betwee 0 ad 1. R 2 = 1.00 idicates a perfect liear fit, while R 2 = 0.00 is a complete liear o-fit. I the example: SSR = SS2 xy = ( ) 2 / = SST = SSR + SSE = = R 2 = SSR/SST = Note that R 2 = r 2, the square of the correlatio coefficiet. 78.8% of the variability i Y is accouted for by the regressio model. [1] [1] [1]

11 6. Estimatig the mea of Y (a) The estimated mea of Y whe x = x is ˆµ x = ˆβ 0 + ˆβ 1 x. (b) (c) The stadard error of ˆµ x is ( ˆµ x = ˆβ 0 + ˆβ 1 x N (β 0 + β 1 x 1, σ 2 + (x x) 2 )) SE ˆµx = ( 1 MSE + (x x) 2 ) (d) A cofidece iterval for the mea µ x = β 0 + β 1 x whe x = x is give by ˆµ x ± t α/2, 2 SE ˆµx (e) eg. A 95% cofidece iterval for the mea at x = 0.10 is: Whe x = 0.10, the estimated mea is ˆµ.1 = (0.1) = ( ) SE ˆµ.1 = ( ) = 5.36 t α/2, 2 = t.025,2 = margi of error = 4.303(5.36) = ± (201, ) 95% cofidece iterval for mu at x0=.10 x0=.10 muhat=b0+b1*x0 # estimate of mea at x=x0 muhat SEmu=sqrt(MSE)*sqrt(1/+(x0-mea(x))^2/SXX) #SE of muhat SEmu c(muhat-t*semu, muhat+t*semu) [1] [1] [1]

12 7. Predictig a New Respose Value We are ow iterestig i predictig the value of y at a future value x = x. I makig a predictio iterval for a future observatio o y whe x = x, we eed to icorporate two sources of variatio which accout for the fact that we are replacig the ukow mea by the estimate ˆβ 0 + ˆβ 1 x, ad we are replacig the ukow stadard deviatio σ by the estimate MSE. y ( ˆβ 0 + ˆβ 1 x ) = (y (β 0 + β 1 x )) ( ˆβ 0 + ˆβ 1 x (β 0 + β 1 x )) The first term i brackets o the right had side of this expressio has a N(0, σ 2 ) distributio. From (b) above, the distributio of the secod term is ( 1 N (0, σ 2 + (x x) 2 )) As y represets a future observatio, the distributios of the two terms are idepedet, ad it follows that the distributio of y ( ˆβ 0 + ˆβ 1 x ) is N (0, σ ( (x x) 2 )) (a) The predicted value of y is give by ŷ = ˆβ 0 + ˆβ 1 x (b) The variace of the above distributio is estimated by: ( MSE (x x) 2 ) (c) ad the predictio iterval for y is give by ( ˆβ 0 + ˆβ 1 x ± t α/2, 2 MSE (x x) 2 ) (d) eg. A 95% predictio iterval for y whe x = 0.10 is: For x = 0.10, y = (0.1) = ) SE y = ( ( ) = t α/2, 2 = t.025,2 = margi of error = 4.303(11.69) = ± (173.79, ) SEmu=sqrt(MSE)*sqrt(1+1/+(x0-mea(x))^2/SXX) c(muhat-t*semu, muhat+t*semu) 95% predictio iterval for a ew observatio at x0=.10 12

13 [1]

Linear Regression Models

Linear Regression Models Liear Regressio Models Dr. Joh Mellor-Crummey Departmet of Computer Sciece Rice Uiversity johmc@cs.rice.edu COMP 528 Lecture 9 15 February 2005 Goals for Today Uderstad how to Use scatter diagrams to ispect