Chapter 2 Simple Linear Regression

Size: px

Start display at page:

Download "Chapter 2 Simple Linear Regression"

Kelley Williams
5 years ago
Views:

1 Chapter Smple Lear Regresso. Itroducto ad Least Squares Estmates Regresso aalyss s a method for vestgatg the fuctoal relatoshp amog varables. I ths chapter we cosder problems volvg modelg the relatoshp betwee two varables. These problems are commoly referred to as smple lear regresso or straght-le regresso. I later chapters we shall cosder problems volvg modelg the relatoshp betwee three or more varables. I partcular we ext cosder problems volvg modelg the relatoshp betwee two varables as a straght le, that s, whe Y s modeled as a lear fucto of X. Example: A regresso model for the tmg of producto rus We shall cosder the followg example take from Foster, Ste ad Waterma (997, pages 9 99) throughout ths chapter. The orgal data are the form of the tme take ( mutes) for a producto ru, Y, ad the umber of tems produced, X, for radomly selected orders as supervsed by three maagers. At ths stage we shall oly cosder the data for oe of the maagers (see Table. ad Fgure. ). We wsh to develop a equato to model the relatoshp betwee Y, the ru tme, ad X, the ru sze. A scatter plot of the data lke that gve Fgure. should ALWAYS be draw to obta a dea of the sort of relatoshp that exsts betwee two varables (e.g., lear, quadratc, expoetal, etc.)... Smple Lear Regresso Models Whe data are collected pars the stadard otato used to desgate ths s: (x, y ),(x, y ),...,(x, y ) where x deotes the frst value of the so-called X -varable ad y deotes the frst value of the so-called Y -varable. The X -varable s called the explaatory or predctor varable, whle the Y -varable s called the respose varable or the depedet varable. The X -varable ofte has a dfferet status to the Y -varable that: S.J. Sheather, A Moder Approach to Regresso wth R, 5 DOI:.7/ _, Sprger Scece + Busess Meda LLC 9

2 6 Smple Lear Regresso Table. Producto data (producto.txt) Case Ru tme Ru sze Case Ru tme Ru sze Ru Tme 6 Fgure. A scatter plot of the producto data 5 3 Ru Sze It ca be thought of as a potetal predctor of the Y-varable Its value ca sometmes be chose by the perso udertakg the study Smple lear regresso s typcally used to model the relatoshp betwee two varables Y ad X so that gve a specfc value of X, that s, X = x, we ca predct the value of Y. Mathematcally, the regresso of a radom varable Y o a radom varable X s E(Y X = x), the expected value of Y whe X takes the specfc value x. For example, f X = Day of the week ad Y = Sales at a gve compay, the the regresso of Y o X represets the mea (or average) sales o a gve day. The regresso of Y o X s lear f

3 . Itroducto ad Least Squares Estmates 7 E( Y X = x) = b + b x (.) where the ukow parameters b ad b determe the tercept ad the slope of a specfc straght le, respectvely. Suppose that Y, Y,, Y are depedet realzatos of the radom varable Y that are observed at the values x, x,, x of a radom varable X. If the regresso of Y o X s lear, the for =,,, Y = E( Y X = x) + e = b + b x+ e where e s the radom error Y ad s such that E(e X) =. The radom error term s there sce there wll almost certaly be some varato Y due strctly to radom pheomeo that caot be predcted or explaed. I other words, all uexplaed varato s called radom error. Thus, the radom error term does ot deped o x, or does t cota ay formato about Y (otherwse t would be a systematc error). We shall beg by assumg that V ar ( Y X = x ) = s. (.) I Chapter 4 we shall see how ths last assumpto ca be relaxed. Estmatg the populato slope ad tercept Suppose for example that X = heght ad Y = weght of a radomly selected dvdual from some populato, the for a straght le regresso model the mea weght of dvduals of a gve heght would be a lear fucto of that heght. I practce, we usually have a sample of data stead of the whole populato. The slope b ad tercept b are ukow, sce these are the values for the whole populato. Thus, we wsh to use the gve data to estmate the slope ad the tercept. Ths ca be acheved by fdg the equato of the le whch best fts our data, that s, choose b ad b such that yˆ = b + bx s as close as possble to y. Here the otato ŷ s used to deote the value of the le of best ft order to dstgush t from the observed values of y, that s, y. We shall refer to ŷ as the th predcted value or the ftted value of y. Resduals I practce, we wsh to mmze the dfferece betwee the actual value of y (y ) ad the predcted value of y (ŷ ). Ths dfferece s called the resdual, ê, that s, ê = y ŷ. Fgure. shows a hypothetcal stuato based o sx data pots. Marked o ths plot s a le of best ft, ŷ alog wth the resduals. Least squares le of best ft A very popular method of choosg b ad b s called the method of least squares. As the ame suggests b ad b are chose to mmze the sum of squared resduals (or resdual sum of squares [RSS]),

4 8 Smple Lear Regresso 5 ê 6 Y 5 ê 3 ê4 ê 5 Le of best ft ê ê X Fgure. A scatter plot of data wth a le of best ft ad the resduals detfed eˆ ˆ y y y b bx = = = RSS = = ( ) = ( ). For RSS to be a mmum wth respect to b ad b we requre RSS = ( y b bx) = b = ad RSS = x( y b bx) = b = Rearragg terms these last two equatos gves ad y = b + b x = = xy = bx + bx = = =. These last two equatos are called the ormal equatos. Solvg these equatos for b ad b gves the so-called least squares estmates of the tercept bˆ = y bˆ x (.3)

5 . Itroducto ad Least Squares Estmates 9 ad the slope x y xy ( x x)( y y) SXY ˆ = = = = =. SXX x x ( x x) = = b (.4) Regresso Output from R The least squares estmates for the producto data were calculated usg R, gvg the followg results: Coeffcets: Estmate Std. Error t value Pr(> t ) (Itercept) e-3 *** RuSze e-6 *** --- Sgf. codes: ***. **. *.5.. Resdual stadard error: 6.5 o 8 degrees of freedom Multple R-Squared:.73, Adjusted R-squared:.75 F-statstc: 48.7 o ad 8 DF, p-value:.65e-6 The least squares le of best ft for the producto data Fgure.3 shows a scatter plot of the producto data wth the least squares le of best ft. The equato of the least squares le of best ft s y = x. Let us look at the results that we have obtaed from the le of best ft Fgure.3. The tercept Fgure.3 s 49.7, whch s where the le of best ft crosses the ru tme axs. The slope of the le Fgure.3 s.6. Thus, we say that each addtoal ut to be produced s predcted to add.6 mutes to the ru tme. The tercept the model has the followg terpretato: for ay producto ru, the average set up tme s 49.7 mutes. Estmatg the varace of the radom error term Cosder the lear regresso model wth costat varace gve by (.) ad (.). I ths case, Y = b + b x + e ( =,,..., ) where the radom error e has mea ad varace s. We wsh to estmate s = Var(e). Notce that e = Y ( b + b x ) = Y ukow regresso le at x.

6 Smple Lear Regresso 4 Ru Tme Ru Sze Fgure.3 A plot of the producto data wth the least squares le of best ft Sce b ad b are ukow all we ca do s estmate these errors by replacg b ad b by ther respectve least squares estmates ad gvg the resduals bˆ bˆ eˆ = Y ( bˆ + bˆ x ) = Y estmated regresso le at x. These resduals ca be used to estmate s. I fact t ca be show that S RSS = = eˆ = s a ubased estmate of s. Two pots to ote are:. e ˆ = (sce e ˆ = as the least squares estmates mmze RSS = eˆ ). The dvsor S s sce we have estmated two parameters, amely b ad b.. Ifereces About the Slope ad the Itercept I ths secto, we shall develop methods for fdg cofdece tervals ad for performg hypothess tests about the slope ad the tercept of the regresso le.

7 . Ifereces About the Slope ad the Itercept.. Assumptos Necessary Order to Make Ifereces About the Regresso Model Throughout ths secto we shall make the followg assumptos:. Y s related to x by the smple lear regresso model Y = b + b x + e ( =,..., ),.e., E( Y X = x ) = b + bx. The errors e, e,..., e are depedet of each other 3. The errors e, e,..., e have a commo varace s 4. The errors are ormally dstrbuted wth a mea of ad varace s, that s, e X~ N(, s ) Methods for checkg these four assumptos wll be cosdered Chapter 3. I addto, sce the regresso model s codtoal o X we ca assume that the values of the predctor varable, x, x,, x are kow fxed costats... Ifereces About the Slope of the Regresso Le Recall from (.4) that the least squares estmate of b s gve by bˆ x y xy ( x x)( y y) = = = = = x x ( x x) = = SXY SXX Sce, ( x x) = we fd that = ( x x)( y y) = ( x x) y y ( x x) = ( x x) y = = = = Thus, we ca rewrte bˆ as ˆ x x b = cy where c = (.5) SXX = We shall see that ths verso of wll be used wheever we study ts theoretcal bˆ propertes. Uder the above assumptos, we shall show Secto.7 that E( bˆ X ) = b (.6) s Var( b ˆ X) = SXX (.7)

8 Smple Lear Regresso s b ˆ b X~ N, SXX (.8) Note that (.7) the varace of the least squares slope estmate decreases as SXX creases (.e., as the varablty the X s creases). Ths s a mportat fact to ote f the expermeter has cotrol over the choce of the values of the X varable. Stadardzg (.8) gves bˆ Z = s b SXX ~ N(,) If s were kow the we could use a Z to test hypotheses ad fd cofdece tervals for b. Whe s s ukow (as s usually the case) replacg s by S, the stadard devato of the resduals results bˆ b bˆ b T = = S se( bˆ ) SXX where se ( b ˆ ) = S s the estmated stadard error (se) of, whch s gve bˆ SXX drectly by R. I the producto example the X -varable s RuSze ad so se (bˆ ) =.374. It ca be show that uder the above assumptos that T has a t-dstrbuto wth degrees of freedom, that s bˆ b T = se( ˆ ) ~ t b Notce that the degrees of freedom satsfes the followg formula degrees of freedom = sample sze umber of mea parameters estmated. I ths case we are estmatg two such parameters, amely, b ad b. For testg the hypothess H : b = b the test statstc s bˆ b T = ~ t whe s true. se( ˆ H b ) R provdes the value of T ad the p -value assocated wth testg H : b = agast H A : b (.e., for the choce b = ). I the producto example the X-varable s RuSze ad T = 6.98, whch results a p -value less tha.. A ( a) % cofdece terval for b, the slope of the regresso le, s gve by

9 . Ifereces About the Slope ad the Itercept 3 ( b ˆ t( a/, -)se( b ˆ ), b ˆ + t( a/, -)se( b ˆ )) where t(a /, ) s the ( a / )th quatle of the t -dstrbuto wth degrees of freedom. I the producto example the X -varable s RuSze ad bˆ ˆ =.594, se( b ) =.374, t (.5, = 8) =.9. Thus a 95% cofdece terval for b s gve by (.594 ±.9.374) = (.594 ±.783) = (.8,.337)..3 Ifereces About the Itercept of the Regresso Le Recall from (.3) that the least squares estmate of b s gve by bˆ = y bˆ x Uder the assumptos gve prevously we shall show Secto.7 that ˆ E( b X) b = (.9) b ˆ x X = s + Var( ) SXX (.) SXX ˆ x X~ N b, s + b (.) Stadardzg (.) gves Z = s bˆ b + x SXX ~ N(,) If s were kow the we could use Z to test hypotheses ad fd cofdece tervals for b. Whe s s ukow (as s usually the case) replacg σ by S results bˆ b bˆ b T = = ~ t ˆ x se( b ) S + SXX where se ( b ˆ ) = S x + SXX s the estmated stadard error of bˆ, whch s gve drectly by R. I the producto example the tercept s called Itercept ad so se(bˆ ) =

10 4 Smple Lear Regresso For testg the hypothess H : b = b the test statstc s bˆ b T = ~ t whe s true. se( ˆ H b ) R provdes the value of T ad the p -value assocated wth testg H : b = agast H A : b. I the producto example the tercept s called Itercept ad T = 7.98 whch results a p -value <.. A ( a )% cofdece terval for b, the tercept of the regresso le, s gve by ( b ˆ t( a/, ) se( b ˆ ), b ˆ + t( a /, )se( b ˆ )) where t(a /, ) s the ( a / ) th quatle of the t -dstrbuto wth degrees of freedom. I the producto example, bˆ = , se( bˆ ) = 8.385, t(.5, = 8) =.9. Thus a 95% cofdece terval for b s gve by ( ± ) = ( ± 7.497) = (3.3,67.) Regresso Output from R: 95% cofdece tervals.5% 97.5% (Itercept) RuSze Cofdece Itervals for the Populato Regresso Le I ths secto we cosder the problem of fdg a cofdece terval for the ukow populato regresso le at a gve value of X, whch we shall deote by x *. Frst, recall from (.) that the populato regresso le at X = x * s gve by E( Y X = x*) = b + b x* A estmator of ths ukow quatty s the value of the estmated regresso equato at X = x *, amely, yˆ* = bˆ + bˆ x* Uder the assumptos stated prevously, t ca be show that E( yˆ*) = E( yˆ X = x*) = b + b x* (.)

11 .4 Predcto Itervals for the Actual Value of Y 5 ( x* x) Var( yˆ*) = Var( yˆ X = x*) = s + SXX (.3) ( x* x) yˆ* = yˆ X = x* N b + bx*, s + SXX (.4) Stadardzg (.4) gves Z = yˆ * ( b + bx*) N(,) s ( x* x) ( + ) SXX Replacg s by S results yˆ * ( b + bx*) T = t ( x* x) S ( + ) SXX A ( a)% cofdece terval for E( Y X = x*) = b + bx*, the populato regresso le at X = x *, s gve by ( x* x) yˆ * ± t( a/, ) S ( + ) SXX ˆ ˆ ( x* x) = b + b x* ± t( a/, ) S ( + ) SXX where t( a/, s ) the ( a/)th quatle of the t -dstrbuto wth degrees of freedom..4 Predcto Itervals for the Actual Value of Y I ths secto we cosder the problem of fdg a predcto terval for the actual value of Y at x *, a gve value of X. Importat Notes:. E( Y X = x*), the expected value or average value of Y for a gve value x * of X, s what oe would expect Y to be the log ru whe X = x *. E( Y X = x*) s therefore a fxed but ukow quatty whereas Y ca take a umber of values whe X = x *.

12 6 Smple Lear Regresso. E(Y X = x*), the value of the regresso le at X = x *, s etrely dfferet from Y *, a sgle value of Y whe X = x *. I partcular, Y * eed ot le o the populato regresso le. 3. A cofdece terval s always reported for a parameter (e.g., E(Y X = x*) = b + b x* ) ad a predcto terval s reported for the value of a radom varable (e.g., Y *). We base our predcto of Y whe X = x * (that s of Y *) o The error our predcto s yˆ* = bˆ + bˆ x* Y* yˆ* = b + b x* + e* yˆ* = E( Y X = x*) yˆ* + e* that s, the devato betwee E(Y X = x*) ad ŷ* plus the radom fluctuato e* (whch represets the devato of Y * from E(Y X = x*)). Thus the varablty the error for predctg a sgle value of Y wll exceed the varablty for estmatg the expected value of Y (because of the radom error e *). It ca be show that uder the prevously stated assumptos that E( Y* yˆ*) = E( Y yˆ X = x*) = (.5) ( x* x) Var( Y* yˆ*) = Var( Y yˆ X = x*) = s + + SXX (.6) ( x* x) Y* yˆ * ~ N, s + + SXX (.7) Stadardzg (.7) ad replacg s by S gves T = S Y* yˆ * ( x* x) ( + + ) SXX ~ t A ( a)% predcto terval for Y *, the value of Y at X = x *, s gve by ( x* x) yˆ * ± t( a/, ) S ( + + ) SXX ˆ ˆ ( x* x) = b + b x* ± t( a/, ) S ( + + ) SXX

13 .5 Aalyss of Varace 7 where t(a /, ) s the ( a / )th quatle of the t -dstrbuto wth degrees of freedom. Regresso Output from R Nety-fve percet cofdece tervals for the populato regresso le (.e., the average RuTme) at RuSze = 5,, 5,, 5, 3, 35 are: ft lwr upr Nety-fve percet predcto tervals for the actual value of Y (.e., the actual RuTme) at at RuSze = 5,, 5,, 5, 3, 35 are: ft lwr upr Notce that each predcto terval s cosderably wder tha the correspodg cofdece terval, as s expected..5 Aalyss of Varace There s a lear assocato betwee Y ad x f Y = b + b x + e ad b. If we kew that b the we would predct Y by ŷ = bˆ + bˆ x O the other had, f we kew that b = the we predct Y by ŷ = y To test whether there s a lear assocato betwee Y ad X we have to test H : b = agast H A : b.

14 8 Smple Lear Regresso We ca perform ths test usg the followg t-statstc bˆ = T t se( bˆ whe H ) s true. We ext look at a dfferet test statstc whch ca be used whe there s more tha oe predctor varable, that s, multple regresso. Frst, we troduce some termology. Defe the total corrected sum of squares of the Y s by SST = SYY = ( y y) Recall that the resdual sum of squares s gve by RSS = ( y yˆ ) Defe the regresso sum of squares (.e., sum of squares explaed by the regresso model) by SSreg = ( yˆ y) It s clear that SSreg s close to zero f for each, ŷ s close to ȳ whle SSreg s large f ŷ dffers from ȳ for most values of x. We ext look at the hypothetcal stuato Fgure.4 wth just a sgle data pot ( x, y ) show alog wth the least squares regresso le ad the mea of y based o all data pots. It s apparet from Fgure.4 that y ( ˆ ) ( ˆ y = y y + y y). Further, t ca be show that SST = SSreg + RSS Total sample = Varablty explaed by + Uexplaed (or error) varablty the model varablty See exercse 6 Secto.7 for detals. If Y = b + b x+ e ad b the RSS should be small ad SSreg should be close to SST. But how small s small ad how close s close?

15 .5 Aalyss of Varace 9 Fgure.4 Graphcal depcto that y y = ( y yˆ) + ( yˆ y ) To test we ca use the test statstc H : b = agast H A : b F = SSreg / RSS /( ) sce RSS has ( ) degrees of freedom ad SSreg has degree of freedom. Uder the assumpto that e, e,..., e are depedet ad ormally dstrbuted wth mea ad varace s, t ca be show that F has a F dstrbuto wth ad degrees of freedom whe H s true, that s, F = SSreg / ~ RSS /( ) F, whe H s true Form of test: reject H at level a f F > F a,, (whch ca be obtaed from table of the F dstrbuto). However, all statstcal packages report the correspodg p-value.

16 3 Smple Lear Regresso The usual way of settg out ths test s to use a Aalyss of varace table Source of varato Degrees of freedom (df) Sum of squares (SS) Mea square (MS) Regresso SSreg SSreg/ SSreg / F = RSS /( ) Resdual RSS RSS/( ) Total SST F Notes:. It ca be show that the case of smple lear regresso bˆ T = ~ se( bˆ ) SSreg / ad F = ~ F, are related va F = T RSS /( ) t. R, the coeffcet of determato of the regresso le, s defed as the proporto of the total sample varablty the Y s explaed by the regresso model, that s, SSreg RSS R = = SST SST The reaso ths quatty s called R s that t s equal to the square of the correlato betwee Y ad X. It s arguably oe of the most commoly msused statstcs. Regresso Output from R Aalyss of Varace Table Respose: RuTme Df Sum Sq Mea Sq F value Pr(>F) RuSze e-6 *** Resduals Sgf. codes: ***. **. *.5.. Notce that the observed F -value of s just the square of the observed t-value 6.98 whch ca be foud betwee Fgures. ad.3. We shall see Chapter 5 that Aalyss of Varace overcomes the problems assocated wth multple t-tests whch occur whe there are may predctor varables..6 Dummy Varable Regresso So far we have oly cosdered stuatos whch the predctor or X-varable s quattatve (.e., takes umercal values). We ext cosder so-called dummy varable regresso, whch s used ts smplest form whe a predctor s categorcal

17 .6 Dummy Varable Regresso 3 wth two values (e.g., geder) rather tha quattatve. The resultg regresso models allow us to test for the dfferece betwee the meas of two groups. We shall see a later topc that the cocept of a dummy varable ca be exteded to clude problems volvg more tha two groups. Usg dummy varable regresso to compare ew ad old methods We shall cosder the followg example throughout ths secto. It s take from Foster, Ste ad Waterma (997, pages 4 48). I ths example, we cosder a large food processg ceter that eeds to be able to swtch from oe type of package to aother quckly to react to chages order patters. Cosultats have developed a ew method for chagg the producto le ad used t to produce a sample of 48 chage-over tmes ( mutes). Also avalable s a depedet sample of 7 chage-over tmes ( mutes) for the exstg method. These two sets of tmes ca be foud o book web ste the fle called chageover_tmes. txt. The frst three ad the last three rows of the data from ths fle are reproduced below Table.. Plots of the data appear Fgure.5. We wsh to develop a equato to model the relatoshp betwee Y, the chage-over tme ad X, the dummy varable correspodg to New ad hece test whether the mea chage-over tme s reduced usg the ew method. We cosder the smple lear regresso model Y = b + b x+ e where Y = chage-over tme ad x s the dummy varable (.e., x = f the tme correspods to the ew chage-over method ad f t correspods to the exstg method). Regresso Output from R Coeffcets: Estmate Std. Error t value Pr(> t ) (Itercept) <e-6 *** New * --- Sgf. codes: ***. **. *.5.. Resdual stadard error: o 8 degrees of freedom Multple R-Squared:.48, Adjusted R-squared:.335 F-statstc: 5.8 o ad 8 DF, p-value:.64 We ca test whether there s sgfcat reducto the chage-over tme for the ew method by testg the sgfcace of the dummy varable, that s, we wsh to test whether the coeffcet of x s zero or less tha zero, that s: H : b = agast H A : b < We use the oe-sded < alteratve sce we are terested whether the ew method has lead to a reducto mea chage-over tme. The test statstc s bˆ = se( bˆ ) T ~ t H whe s true.

18 3 Smple Lear Regresso Table. Chage-over tme data (chageover_tmes.txt) Method Y, Chage-over tme X, New Exstg 9 Exstg 4 Exstg New 4 New 4 New 35 Chage Over Tme Chage Over Tme Dummy Varable, New Dummy Varable, New Chage Over Tme Exstg New Method Fgure.5 A scatter plot ad box plots of the chage-over tme data I ths case, T =.54. (Ths result ca be foud the output the colum headed t value ). The assocated p -value s gve by.6 p value = P( T <.54 whe H s true) = =.3 as the two-sded p- value = P( T.54 whe H s true) =.6. Ths meas that there s sgfcat evdece of a reducto the mea chageover tme for the ew method.

19 .7 Dervatos of Results 33 Next cosder the group cosstg of those tmes assocated wth the ew chage-over method. For ths group, the dummy varable, x s equal to. Thus, we ca estmate the mea chage-over tme for the ew method as: ( 3.736) = = 4.7 mutes Next cosder the group cosstg of those tmes assocated wth the exstg chage-over method. For ths group, the dummy varable, x s equal to. Thus, we ca estmate the mea chage-over tme for the ew method as: ( 3.736) = 7.86 = 7.9 mutes The ew chage-over method produces a reducto the mea chage-over tme of 3. m from 7.9 to 4.7 mutes (Notce that the reducto the mea chageover tme for the ew method s just the coeffcet of the dummy varable.) Ths reducto s statstcally sgfcat. A 95% cofdece terval for the reducto mea chage-over tme due to the ew method s gve by ( b ˆ t( a/, )se( b ˆ ), b ˆ + t( a/, )se( b ˆ )) where t( a /, ) s the ( a/ ) th quatle of the t -dstrbuto wth degrees of freedom. I ths example the X -varable s the dummy varable New ad b ˆ = 3.736, se( b ˆ ) =.48, t(.5, = 8) = Thus a 95% cofdece terval for b ( mutes) s gve by ( ± ) = ( ±.7883) = ( 5.96,.39). Fally, the compay should adopt the ew method f a reducto of tme of ths sze s of practcal sgfcace..7 Dervatos of Results I ths secto, we shall derve some results gve earler about the least squares estmates of the slope ad the tercept as well as results about cofdece tervals ad predcto tervals. Throughout ths secto we shall make the followg assumptos:. Y s related to x by the smple lear regresso model Y = b + bx + e ( =,..., ), e..,e( Y X = x) = b + bx. The errors e,e,...,e are depedet of each other 3. The errors e,e,...,e have a commo varace s 4. The errors are ormally dstrbuted wth a mea of ad varace s (especally whe the sample sze s small), that s, e X~ N(, s )

20 34 Smple Lear Regresso I addto, sce the regresso model s codtoal o X we ca assume that the values of the predctor varable, x, x,, x are kow fxed costats..7. Ifereces about the Slope of the Regresso Le Recall from (.5) that the least squares estmate of b s gve by ˆ x x b = cy where c =. SXX = Uder the above assumptos we shall derve (.6), (.7) ad (.8). To derve (.6) let s cosder sce ˆ E( b X) = E cy X = x = = = [ ] = ce y X= x ( b b x) = c + = b c + b c x = = x x x x + x = SXX = SXX =b b = b ( x x) = ad ( x x) x = x x = SXX. = = = To derve (.7) let s cosder ˆ Var( b X) = Var cy X = x = = c Var( y X = x ) = =σ c = x x =σ = SXX σ = SXX

21 .7 Dervatos of Results 35 Fally we derve (.8). Uder assumpto (4), the errors e X are ormally dstrbuted. Sce y = b + b x + e ( =,,..., ), Y X s ormally dstrbuted. Sce b ˆ X s a lear combato of the y s, b ˆ X s ormally dstrbuted..7. Ifereces about the Itercept of the Regresso Le Recall from (.3) that the least squares estmate of b s gve by b ˆ = y b ˆ x. Uder the assumptos gve prevously we shall derve (.9), (.) ad (.). To derve (.9) we shall use the fact that The frst pece of the last equato s E( b ˆ X ) = E( y X ) E( b ˆ X ) x E( y X) = E( y X = x) = = E( b + bx + e) = = b + b = b + b x = x The secod pece of that equato s E( bˆ Xx ) = b x. Thus, E( bˆ X) = E( y X) E( bˆ X) x = b + b x b x = b To derve (.) let s cosder Var( bˆ ˆ X) = Var( y bx X) = + ˆ The frst term s gve by Var( y X) x Var( b X) xcov( y, b X) s s Var( y X) = Var( y X = x) = =. = ˆ

22 36 Smple Lear Regresso From (.7), ˆ s Var( b X) = SXX Fally, So, ˆ s Cov( y, b X) = Cov y, c y = ccov( y, y ) = c = = = = = b ˆ x X = s + Var( ) SXX Result (.) follows from the fact that uder assumpto (4), Y X (ad hece ȳ ) are ormally dstrbuted as s b ˆ X..7.3 Cofdece Itervals for the Populato Regresso Le Recall that the populato regresso le at X = x * s gve by E( Y X = x*) = b + b x* A estmator the populato regresso le at X = x * (.e., E( Y X = x*) = b + bx* ) s the value of the estmated regresso equato at X = x *, amely, yˆ* = bˆ + bˆ x* Uder the assumptos stated prevously, we shall derve (.), (.3) ad (.4). Frst, otce that (.) follows from the followg earler establshed results E( bˆ X = x*) = b ad E( bˆ X = x*) = b. Next, cosder (.3) Var( yˆ X = x*) = Var( bˆ + bˆ x X = x*) = Var( bˆ X = x*) + x* Var( bˆ X = x*) + x* Cov( b ˆ, b ˆ X = x *) Now, Cov( bˆ, bˆ X = x*) = Cov( y bˆ x, bˆ X = x*) = Cov( y, bˆ X = x*) xcov( bˆ ˆ, b) = x Var( bˆ ) xs = SXX

23 .7 Dervatos of Results 37 So that, x s x* xs Var( yˆ X = x*) = s + + x* SXX SXX SXX s ( x* x) SXX = + Result (.4) follows from the fact that uder assumpto (4), b ˆ X s ormally dstrbuted as s b ˆ X..7.4 Predcto Itervals for the Actual Value of Y We base our predcto of Y whe X = x * (that s of Y *) o The error our predcto s yˆ* = bˆ + bˆ x* Y* yˆ* = b + b x* + e* yˆ* = E( Y X = x*) yˆ* + e* that s, the devato betwee E( Y X = x*) ad ŷ* plus the radom fluctuato e* (whch represets the devato of Y * from E( Y X = x*) ). Uder the assumptos stated prevously, we shall derve (.5), (.6) ad (.7). Frst, we cosder (.5) E( Y* yˆ*) = E( Y yˆ X = x*) = E( Y X = x*) E( bˆ ˆ + bx X = x* ) = I cosderg (.6), otce that ŷ s depedet of Y *, a future value of Y. Thus, Var( Y* yˆ*) = Var( Y yˆ X = x*) = Var( Y X = x*) + Var( yˆ X = x*) Cov( Y, yˆ X = x*) = s + s + ( x* x) SXX = s + + ( x* x) SXX Fally, (.7) follows sce both ŷ ad Y * are ormally dstrbuted.

24 38 Smple Lear Regresso.8 Exercses. The web ste provdes weekly reports o the box offce tcket sales for plays o Broadway New York. We shall cosder the data for the week October 7, 4 (referred to below as the curret week). The data are the form of the gross box offce results for the curret week ad the gross box offce results for the prevous week (.e., October 3, 4). The data, plotted Fgure.6, are avalable o the book web ste the fle playbll.csv. Ft the followg model to the data: Y = b + bx+ e where Y s the gross box offce results for the curret week ( $) ad x s the gross box offce results for the prevous week ( $). Complete the followg tasks: (a) Fd a 95% cofdece terval for the slope of the regresso model, b. Is a plausble value for b? Gve a reaso to support your aswer. (b) Test the ull hypothess H : b = agast a two-sded alteratve. Iterpret your result. (c) Use the ftted regresso model to estmate the gross box offce results for the curret week ( $) for a producto wth $4, gross box offce the prevous week. Fd a 95% predcto terval for the gross box offce Gross Box Offce Results Curret Week Gross Box Offce Results Prevous Week Fgure.6 Scatter plot of gross box offce results from Broadway

25 .8 Exercses 39 results for the curret week ( $) for a producto wth $4, gross box offce the prevous week. Is $45, a feasble value for the gross box offce results the curret week, for a producto wth $4, gross box offce the prevous week? Gve a reaso to support your aswer. (d) Some promoters of Broadway plays use the predcto rule that ext week s gross box offce results wll be equal to ths week s gross box offce results. Commet o the approprateess of ths rule.. A story by James R. Hagerty ettled Wth Buyers Sdeled, Home Prces Slde publshed the Thursday October 5, 7 edto of the Wall Street Joural cotaed data o so-called fudametal housg dcators major real estate markets across the US. The author argues that prces are geerally fallg ad overdue loa paymets are plg up. Thus, we shall cosder data preseted the artcle o Y = Percetage chage average prce from July 6 to July 7 (based o the S&P/Case-Shller atoal housg dex); ad x = Percetage of mortgage loas 3 days or more overdue latest quarter (based o data from Equfax ad Moody s). The data are avalable o the book web ste the fle dcators.txt. Ft the followg model to the data: Y = b + bx+ e. Complete the followg tasks: (a) Fd a 95% cofdece terval for the slope of the regresso model, b. O the bass of ths cofdece terval decde whether there s evdece of a sgfcat egatve lear assocato. (b) Use the ftted regresso model to estmate E ( Y X =4). Fd a 95% cofdece terval for E ( Y X =4). Is % a feasble value for E ( Y X =4)? Gve a reaso to support your aswer. 3. The maager of the purchasg departmet of a large compay would lke to develop a regresso model to predct the average amout of tme t takes to process a gve umber of voces. Over a 3-day perod, data are collected o the umber of voces processed ad the total tme take ( hours). The data are avalable o the book web ste the fle voces.txt. The followg model was ft to the data: Y = b + bx+ e where Y s the processg tme ad x s the umber of voces. A plot of the data ad the ftted model ca be foud Fgure.7. Utlzg the output from the ft of ths model provded below, complete the followg tasks. (a) Fd a 95% cofdece terval for the start-up tme,.e., b. (b) Suppose that a best practce bechmark for the average processg tme for a addtoal voce s. hours (or.6 mutes). Test the ull hypothess H : b =. agast a two-sded alteratve. Iterpret your result. (c) Fd a pot estmate ad a 95% predcto terval for the tme take to process 3 voces.

26 4 Smple Lear Regresso Processg Tme Fgure.7 Scatter plot of the voce data Number of Ivoces Regresso output from R for the voce data Call: lm(formula = Tme ~ Ivoces) Coeffcets: Estmate Std. Error t value Pr(> t ) (Itercept) e-5 *** Ivoces e-4 *** --- Resdual stadard error:.398 o 8 degrees of freedom Multple R-Squared:.878, Adjusted R-squared:.867 F-statstc: 9.4 o ad 8 DF, p-value: 5.75e-4 mea(tme). meda(tme) mea(ivoces) 3. meda(ivoces) Straght-le regresso through the org: I ths questo we shall make the followg assumptos: () Y s related to x by the smple lear regresso model Y = bx + e ( =,,..., ),.e., E( Y X = x ) = bx

27 .8 Exercses 4 () The errors e, e,..., e are depedet of each other (3) The errors e, e,..., e have a commo varace s (4) The errors are ormally dstrbuted wth a mea of ad varace s (especally whe the sample sze s small),.e., e X~ N(, s ) I addto, sce the regresso model s codtoal o X we ca assume that the values of the predctor varable, x, x,, x are kow fxed costats. (a) Show that the least squares estmate of b s gve by bˆ = = = xy x (b) Uder the above assumptos show that () () () E( bˆ X) = b Var( bˆ X) = = = s bˆ X ~ N ( b, s ) x x 5. Two alteratve straght le regresso models have bee proposed for Y. I the frst model, Y s a lear fucto of x, whle the secod model Y s a lear fucto of x. The plot the frst colum of Fgure.8 s that of Y agast x, whle the plot the secod colum below s that of Y agast x. These plots also show the least squares regresso les. I the followg statemets RSS stads for resdual sum of squares, whle SSreg stads for regresso sum of squares. Whch oe of the followg statemets s true? (a) RSS for model s greater tha RSS for model, whle SSreg for model s greater tha SSreg for model. (b) RSS for model s less tha RSS for model, whle SSreg for model s less tha SSreg for model. (c) RSS for model s greater tha RSS for model, whle SSreg for model s less tha SSreg for model. (d) RSS for model s less tha RSS for model, whle SSreg for model s greater tha SSreg for model. Gve a detaled reaso to support your choce.

28 4 Smple Lear Regresso Model Model y y x x Fgure.8 Scatter plots ad least squares les 6. I ths problem we wll show that SST=SSreg+RSS. To do ths we wll show that = ( y yˆ )( yˆ y) =. (a) Show that ( y y ˆ ) = ( y y ) b ( ) x x. (b) Show that ( yˆ y) = bˆ ( x x). (c) Utlzg the fact that ˆ SXY b =, show that SXX Ÿ = ( y yˆ ) ( yˆ y) =. 7. A statstcs professor has bee volved a collaboratve research project wth two etomologsts. The statstcs part of the project volves fttg regresso models to large data sets. Together they have wrtte ad submtted a mauscrpt to a etomology joural. The mauscrpt cotas a umber of scatter plots wth each showg a estmated regresso le (based o a vald model) ad

29 .8 Exercses 43 assocated dvdual 95% cofdece tervals for the regresso fucto at each x value, as well as the observed data. A referee has asked the followg questo: I do t uderstad how 95% of the observatos fall outsde the 95% CI as depcted the fgures. Brefly expla how t s etrely possble that 95% of the observatos fall outsde the 95% CI as depcted the fgures.

Simple Linear Regression

Simple Linear Regression Statstcal Methods I (EST 75) Page 139 Smple Lear Regresso Smple regresso applcatos are used to ft a model descrbg a lear relatoshp betwee two varables. The aspects of least squares regresso ad correlato