Liear Regressio Models Dr. Joh Mellor-Crummey Departmet of Computer Sciece Rice Uiversity johmc@cs.rice.edu COMP 528 Lecture 9 15 February 2005
Goals for Today Uderstad how to Use scatter diagrams to ispect the relatioship betwee two umerical variables Fit a lie to observatios usig liear regressio Calculate ad iterpret a coefficiet of determiatio Compute cofidece itervals associated with regressios Verify the assumptios uderlyig regressio aalysis 2
Scatter Diagrams What is a good model? Y Y Y X X X Good Model Bad Model Bad Model (wrog slope) (o-liear behavior) 3
Possible Models of a Radom Variable Mea value observed i several trials A distributio that fits the observatios e.g. y ~ N(µ,σ) A equatio i terms of oe or more idepedet variables 4
Liear Regressio Aalysis Fit a liear model that predicts the value of a radom variable Examples: predict the time for gzip to compress a file from its size predict the size of a file compressed by gzip from its origial size 5
Estimatig Model Parameters Give observatio pairs { (x 1,y 1 ), (x 2,y 2 ),..., (x,y )} each x i is a idepedet variable each y i is a depedet variable Determie regressio parameters b 0 ad b 1 i y ˆ i = b o + b 1 x i ˆ y = b o + b 1 x : predicted value of i th observatio y i : predictio error for i th observatio e i = y i " ˆ Y e i (x i,y i ) (x i, ˆ y i ) X 6
Fittig a Lie to Data Approach 1: Miimize sum of predictio errors Choose lie so that " e i = 0 May lies satisfy this equatio better method eeded Y X 7
Fittig a Lie to Data Approach 2: Least-squares Fittig Criterio Miimize sum of squares error (SSE), where " SSE = e i 2 Subject to costrait that total error is 0 " e i = 0 8
Derivig Coefficiet b 0 for Least Squares Error Mea error e i = y i " ˆ y i = y i " (b 0 + b 1 x i ) e = 1 " e i = 1 Settig mea error to 0, we obtai b 0 "(y i #(b 0 + b 1 x i )) = y # b 0 # b 1 x 0 = y " b 0 " b 1 x # b 0 = y " b 1 x 9
Computig Sum of Squared Errors Error e i = y i " ˆ y i = y i " (b 0 + b 1 x i ) Substitutig b 0 = y " b 1 x, we get e i = y i " (y " b 1 x + b 1 x i ) # # # SSE = e i 2 = ((y i " y ) " (b 1 x i " b 1 x )) 2 SSE "1 = 1 = 1 ((y i " y ) 2 " 2b 1 (y i " y )(x i " x ) + b 2 1 (x i " x ) 2 ) "1 $ ' $ ' #(y "1 i " y ) 2 1 " 2b # 2 1 1 & (y "1 i " y )(x i " x )) + b 1 & #(x "1 i " x ) 2 ) % ( % ( = s 2 y " 2b 1 s 2 xy + b 2 2 1 s x 10
Derivig Coefficiet b 1 for Least Squares To fid b 1, solve "(SSE) /"b 1 = 0 SSE "1 = s 2 y " 2b 1 s 2 xy + b 2 2 1 s x 1 #(SSE) = "2s 2 2 xy + 2b 1 s x "1 #b 1 0 = "2s 2 xy + 2b 1 s 2 x # b 1 = s 2 xy 2 s x b 1 = # # x i y i " x y x i 2 " x 2 11
Allocatig Variatio Quatifyig variatio SST = (y i " y ) 2 Key questios how much variatio is uexplaied? # (total sum of squares) SSE = #(y i " y ˆ i ) 2 (sum of squares error) how much variatio is accouted for by the regressio? SSR = SST " SSE (sum of squares regressio) 12
Coefficiet of Determiatio Measurig the quality of a regressio model coefficiet of determiatio = R 2 = SSR SST = SST " SSE SST = #(y i " y ) 2 "#(y i " y ˆ i ) 2 # (y i " y ) 2 What does each of the followig mea: R 2 =1? R 2 =0? R 2 =0.77? 13
Stadard Deviatio of Errors Variace of errors = SSE/(degrees of freedom) s e 2 = SSE " 2 Why -2 degrees of freedom for SSE? SSE computed after calculatig two regressio parameters Degrees of freedom ad liear regressio SST = SSR + SSE (-1) = 1 + (-2) Variace of errors kow as Mea Squared Error (MSE) Stadard deviatio of errors for liear regressio s e = MSE = SSE " 2 14
Cofidece Itervals for b 0 & b 1 Assume the populatio is described by a liear model y = " 0 + " 1 x b 0 ad b 1 are estimates of β 0 ad β 1 from a sigle sample Other samples might yield differet estimates How accurate are b 0 ad b 1? compute cofidece itervals at 100(1-α)% cofidece level b 0 ± t [1"# / 2;"2] s b0 b 1 ± t [1"# / 2;"2] s b1 where $ ' 1 s b0 = s & e + x 2 ) & % # x 2 i " x 2 ) ( 1/ 2 s b1 = $ %& # s e x 2 i " x 2 ' () 1/ 2 15
Cofidece Itervals for Predictios Use regressio model for y to predict for ew x values y ˆ p = b 0 + b 1 x p This is mea value of predicted respose based o sample Stadard deviatio of mea of a future sample of m observatios at x p $ 1 y mp = s e m + 1 ' & + (x p " x ) 2 ) & % # x 2 i " x 2 ) ( sˆ For 1 observatio 1/m = 1; for observatios 1/m 0 1/ 2 Cofidece iterval for m future predictios at x p y ˆ p ± t [1"# / 2;"2] sˆ y mp 16
Facts about Cofidece i Predictios $ 1 y mp = s e m + 1 ' & + (x p " x ) 2 ) & % # x 2 i " x 2 ) ( sˆ Cofidece iterval for predictios is tightest at ˆ y p ± t [1"# / 2;"2] sˆ y mp 1/ 2 x Y upper cofidece boud mea lower cofidece boud X Be cautious i makig predictios far from x x 17
Assumptios of Liear Regressio Whe derivig regressio parameters, we make the followig four assumptios 1. The predictor x is o-stochastic ad is measured error-free 2. The true relatioship betwee y ad predictor x is liear 3. The model errors are statistically idepedet 4. The errors are ormally distributed with a 0 mea ad costat std. deviatio If ay of the assumptios are violated, the model would be misleadig Apply visual tests to verify that assumptios 2-4 hold 18
2. Test liear relatioship of y ad x Use scatter plot of y versus x Y liear Y multiliear X X Y outlier Y oliear X X 19
3. Errors are idepedet Plot e i versus ˆ y i ad verify that there is o tred e i y ˆ i Plot error as a fuctio of experimet umber ad verify that there is o tred ay tred would idicate that some factor ot accouted for affected the observed values e i i 20
4. Errors are ormally distributed Use a quatile-quatile plot of e i versus N(0,1) residual quatile Check for costat stadard deviatio of errors by verifyig that there is o spread i plot of e i versus ˆ o tred i spread ormal quatile y i icreasig spread e i e i ˆ y i ˆ y i 21