Lecture 3 Stat0, Sprng 007 Chapter 3. 3.: Introducton to regresson analyss Lnear regresson as a descrptve technque The least-squares equatons Chapter 3.3 Samplng dstrbuton of b 0, b. Contnued n net lecture Regresson Analyss Galton s classc data on heghts of parents and ther chld (95 pars) Descrbes the relatonshp between chld s heght (y) and the parents heght (). Predct the chld s heght gven parents heght. Parent ht Chld ht 73.60 7. 7.69 67.7 7.85 70.46 7.68 65.3 70.6 6.0 70.3 63.0 70.74 64.96 70.73 66.43 69.47 63.0 68.6 6.00 65.88 6.3 64.90 6.36 64.80 6.95 64. 64.96 And more chld ht 75 73 7 69 67 65 63 6 63 64 65 66 67 68 69 70 7 7 73 74 parent ht
Uses of Regresson Analyss Descrpton: Descrbe the relatonshp between a dependent varable y (chld s heght) and eplanatory varables (parents heght). Predcton: Predct dependent varable y based on eplanatory varables. 3 Model for Smple Regresson Model Consder a populaton of unts on whch the varables (y,) are recorded. Let µ y denote the condtonal mean of y gven. The goal of regresson analyss s to estmate µ y. Smple lnear regresson model: µ + y = β0 β 4
Smple Lnear Regresson Model Model (more detals later) y β 0 + β + e = y = dependent varable = ndependent varable β 0 = y-ntercept β = slope of the lne e = error (normally dstrbuted) µ = β + β y 0 µ y β 0 β 0 and β are unknown populaton parameters, therefore are estmated from the data. Run Rse β = Rse/Run 5 Interpretng the Coeffcents The slope β s the change n the mean of y that s assocated wth a one unt change n e.g.,for each etra nch for parents, the average heghts of the chld ncreases by 0.6 nch. The ntercept s the estmated mean of y for =0. However, ths nterpretaton should only be used when the data contans observatons wth near 0. Otherwse t s an etrapolaton of the model whch can be unrelable (Secton 3.7.). chld ht 75 73 7 69 67 65 63 6 63 64 65 66 67 68 69 70 7 7 73 74 parent ht chld ht = 6.46 + 0.6 parent ht 6 3
Estmatng the Coeffcents The estmates are determned from observatons: (,y ),,( n,y n ). by calculatng sample statstcs. Correspond to a straght lne that cuts nto the data. y Queston: What should be consdered a good lne? 7 Least Squares Regresson Lne What s a good estmate of the lne? A good estmated lne should predct y well based on. Least absolute value regresson lne: Lne that mnmzes the absolute values of the predcton errors n the sample. Good crteron but hard to compute. Least squares regresson lne: Lne that mnmzes the squared predcton errors n the sample. Good crteron and easy to compute. 8 4
The Least Squares (Regresson) Lne 4 3.5 Sum of squared dfferences = ( - ) +(4 -) +(.5-3) + (3. - 4) = 6.89 Sum of squared dfferences = ( -.5) +(4 -.5) +(.5 -.5) +(3. -.5) = 3.99 (,) (,4) (3,.5) 3 (4,3.) 4 Let us compare two lnes The second lne s horzontal The smaller the sum of squared dfferences the better the ft of the lne to the data. 9 The Estmated Coeffcents To calculate the estmates of the coeffcents of the lne that mnmzes the sum of the squared dfferences between the data ponts and the lne, use the formulas: The regresson equaton that estmates the equaton of the smple lnear regresson model s: b b 0 = n = ( )( y y) n = = y b ( ) ŷ = b0 + b 0 5
Eample Heghts (cont.) For smple lnear regresson analyss n JMP: Clck Analyze, Ft Y by X ; then put chld ht n Y and parent ht n X and clck OK. Then clck red trangle net to Bvarate Ft and clck Ft Lne. Some commands we wll use later can now be found n the red trangle net to Lnear Ft Eample Heghts (cont) Based on our observatons, fnd b and b 0 The summary statstcs for parent hts and chld hts: Chld hts Mean 68.0 Std Dev.60 Std Err Mean 0.084 upper 95% Mean 68.37 lower 95% Mean 68.04 N 95 For the regresson lne Parent hts Mean 68.7 Std Dev.79 Std Err Mean 0.0580 upper 95% Mean 68.38 lower 95% Mean 68.5 N 95 From JMPIN b = 0.6 b0 = Y bx = 68.0 0.6 68.7 = 6.55 The LS equaton s y = 6.55 + 0.6 6
JMP Output Bvarate Ft of chld ht By parent ht 75 73 7 chld ht 69 67 65 63 6 63 64 65 66 67 68 69 70 7 7 73 74 parent ht Lnear Ft chld ht = 6.456 + 0.6 parent ht 3 JMP Output (cont) Note the values of b 0, b n the parameter estmates table The other output entres wll be eplaned later Summary of Ft RSquare 0.77 RSquare Adj 0.76 Root Mean Square Error.357 Mean of Response 68.0 Observatons 95 Analyss of Varance Source DF Sum of Squares Mean Square F Rato Model 36.50 36.50 04.59 Error 950 577.8 5.56 Prob > F C. Total 95 643.78 <.000 Parameter Estmates Term Estmate Std Error t Rato Prob> t Intercept 6.456.90 9.06 <.000 parent ht 0.6 0.043 4.30 <.000 4 7
Ordnary Lnear Model Assumptons Propertes of errors under deal model: µ y = β0 + β for all. y for all = β0 + β + e The dstrbuton of e s normal. e,,e n are ndependent. E( e ) = 0 and Var( e ) = σ e Equvalent defnton: For each, y has a normal dstrbuton wth mean β 0 + β and varance σ e. Also, y,, y n are ndependent. 5 Samplng Dstrbuton of b 0, b The samplng dstrbuton of b0, b s the probablty dstrbuton of the estmates over repeated samples y,.., y n from the deal lnear regresson model wth fed values of β0, β and σ e and,.., n. Standardregresson.jmp contans a smulaton of pars (, y),..,( n, y n) from a smple lnear regresson model wth β0 =, β=, σ e =. AND It contans another smulaton labeled (, y),..,( n, yn) from the same model. Notce the dfference n the estmated coeffcents calculated from the y s and from the y* s. 6 8
y 4.0 3.5 3.0.5.0.5.0 0.5 0.0 Bvarate Ft of y By.0..4.6.8.0 Lnear Ft y =.00 +.5 y* 5 4 3 0 Bvarate Ft of y* By.0..4.6.8.0 Lnear Ft y* =.85 + 0.5 Two outcomes from standardregresson.jmp Each data set comes from the model wth β0 =, β =, σ e = The values of,..., 0 are the same n both data sets 7 Samplng Dstrbuton (Detals) b 0 and b have easly descrbed normal dstrbutons Samplng dstrbuton of b 0 s normal wth ( ) Eb = β (Hence the estmate s unbased ) 0 0 ( 0) = σe + n ( n ) s Var b where s ( ) n Samplng dstrbuton of b 0 s normal wth ( ) Eb = β (Hence the estmate s unbased ) σe Var ( b ) = n s ( ) 8 9
Typcal Regresson Analyss. Observe pars of data (,y ),,( n,y n ) that are a sample from populaton of nterest.. Plot the data. 3. Assume smple lnear regresson model assumptons hold. 4. Estmate the true regresson lne µ y = β0 + β by the least squares lne ˆ µ y = b0 + b 5. Check whether the assumptons of the deal model are reasonable (Chapter 6, and net lecture) 6. Make nferences concernng coeffcents β 0, β and make predctons ( ) ˆ = b + b y 0 9 Notes Formulas for the least squares equatons:. The equatons for b 0 and b are easy to derve. Here s a dervaton that nvolves a lttle bt of calculus: It s desred to mnmze the sum of squared errors. Symbolcally, ths s SSE ( b ) ( ( )) 0, b = y b0 + b. The mnmum occurs when 0 = SSE ( b0, b) and 0 = SSE ( b0, b). b b0 Hence we need 0 = SSE ( b0, b) = ( y ( b0 + b ) ) and b 0 = SSE ( b0, b) = ( y ( b0 + b ) ). b0 These are two lnear equatons n the two unknowns b0 and b. Some algebrac manpulaton shows that the soluton can be wrtten n the desred form ( )( y y) b = and b 0 = y b. ( ) 0 0
. A NICE FACT that s sometmes useful: a. The least squares lne passes through the pont (, y ). To see ths note that f = then the correspondng pont on the least squares lne s ŷ= b0+ b. Substtutng the defnton of b 0 yelds ŷ= ( y b ) + b = y, as clamed. b. The equaton for the least squares lne can be re-wrtten n the form y y= b( ). 3. There are other useful ways to wrte the equatons for b 0 and b. Recall that the sample covarance s defned as Cov ({, y} ) = ( )( y y ) Sy, say. n Smlarly, the sample correlaton coeffcent s Sy R SS, say. y [ S = s s defned on overhead 8, and S y s defned smlarly.] Thus, Sy Sy Sy Sy b = R S = S SS = S. y Hstory of Galton s Data: 4. Francs Galton gathered data about heghts of parents and ther chldren, and publshed the analyss n 886 n a paper enttled Regresson towards medocrty [sc] n heredtary stature. In the process he coned the term Regresson to descrbe the straght lne that summarzes the type of relatonal data that may appear n a scatterplot. He dd not use our current least-squares technque for fndng ths lne; nstead he used a clever analyss whose fnal step s to ft the lne by eye. He estmated the slope of the regresson lne as 3 Further work n the net decades by Galton and by K. Pearson, Gossett (wrtng as A. Student ) and others connected Galton s analyss to the least squares technque earler nvented by Gauss (809), and also derved the relevant samplng dstrbutons needed to create a statstcal regresson analyss. 5. The data we use for our analyss s packaged wth the JMP program dsk. It s not eactly Galton s orgnal data. We beleve t s a verson of the data set prepared by S. Stgler (986) as a mnor modfcaton of Galton s data. In order for the data to plot ncely, Stgler jttered the data. He also ncluded some data that Galton dd not. The data lsted as Parent heght n ths data set s actually the average of both parents heghts, after adjustng the mothers heghts as dscussed n the net note.
6. Galton dd not know how to separately treat men s and women s heghts n order to produce the knd of results he wanted to look at. SO (after lookng at the structure of the data) he multpled all female heghts by.08. Ths puts all the heghts on very nearly the same scale, and allowed hm to treat mens and womens heghts together, wthout regard to se. [Instead of dong ths Galton could have dvded the mens heghts by.08; or he could have acheved a smlar effect by dvdng the male heghts by.04 and multplyng the female ones by.04. Why ddn t he use one of these other schemes?] 7. Galton dd not use modern random-samplng methods to obtan hs data. Instead, he obtaned hs data through the offer of przes for the best etracts from ther own famly records obtaned from ndvdual famly correspondents. He summarzed the data n a journal that s now n the Lbrary of the Unversty College of London. Here s what the frst half of p. 4 looks lke. (Accordng to Galton s notatons one should add 60 nches to every entry n the Table.) 3 Half of p4 of Galton s Journal (note the appromate heghts for some records, and the entres tall and deformed ) Ths photocopy, as well as much of the above dscusson s taken from Hanley, J. A. (004), Transmutng women nto men: Galton s famly data on human stature, The Amer. Statstcan, 58, p37-43. Another ecellent reference s Stgler, S. (986) The Englsh breakthrough: Galton n The Hstory of Statstcs: The Measurement of Uncertanty before 900, Harvard Unv. Press. 4