STAT 3008 Applied Regression Analysis

STAT 3008 Appled Regresson Analyss Tutoral : Smple Lnear Regresson LAI Chun He Department of Statstcs, The Chnese Unversty of Hong Kong 1 Model Assumpton To quantfy the relatonshp between two factors, say X and Y, we have to at least assume the type of relatonshp they have, could t be lnear, logarthm, quadratc, etc. In the experment, the factor Y s lkely to be affected by the factor X. In usual experment context, X and Y are known as ndependent and dependent factors respectvely. But n lnear regresson, X and Y are known as predctor and response respectvely. Now, we state our model assumpton: Smple Lnear Regresson Model The model relatng the two factors are assumed to be: y = Y X=x = β 0 + β 1 x + e where Ee = 0, Vare = σ and e s are..d. Therefore, we have EY X = x = β 0 + β 1 x VarY X = x = σ Remark 1.1. The values x here are known constants, nstead of some realsed observatons from a random varable X. For regresson of random predctors, you may refer to [1]. Remark 1.. It should be notced that the response y s a random varable. The data set {x, y } n =1 conssts of realsed values from the random varables. Least Square Estmator Essentally, we want to ft a straght lne to the set of ponts on the Cartesan plane. However, there are many ways to defne good n a ft. The smplest way s to consder the total vertcal dstance between the ponts and the lne. The best ft lne s therefore the lne whch mnmses the dstance,.e. whch mnmses n RSSβ 0, β 1 = [y β 0 + β 1 x ] Mnmsng the total dstance s equvalent to mnmsng the total squared dstance. Hence, we have the Least Square Method. By elementary multvarate calculus and statstcal concepts, the dervaton performed n the lesson yelds the followng results. s1155008@sta.cuhk.edu.hk 1

LEAST SQUARE ESTIMATOR Least Squares Estmator If we defne the followng notatons, x = 1 n ȳ = 1 n x y = SYY = SXY = x x = x n x y ȳ = y n ȳ x xy ȳ = x y n x ȳ then the estmators are gven by ˆβ 0 = ȳ ˆβ 1 x ˆβ 1 = x xy ȳ x x ˆσ = ê = SXY n = SYY SXY /. n Under our estmated model, we therefore have the ftted values gven x and the resdual,.e. the dfference between the ftted value and the realsed value. ŷ = ÊY X = x = ˆβ 0 + ˆβ 1 x ê = y ŷ = y ˆβ 0 ˆβ 1 x Remark.1. For the detaled dervaton, please refer to the lecture notes. You should be famlar wth ther dervaton as they may be tested n mdterm and fnal exam. Remark.. ˆβ0 and ˆβ 1 can be wrtten as lnear combnatons of y : [ x x 1 n x ˆβ 1 = y and ˆβ0 = ] x x y. Ths s useful when dervng the dstrbuton and consstency of the estmators n Exercse 4.1. Remark.3. By the dervatve condton of RSS wth respect to β 0, we have ê = 0 and ȳ = ˆβ 0 + ˆβ 1 x. Exercse.1. 01 Fall Mdterm #3 Use the smple lnear regresson model to ft a straght lne on two data ponts:, 4, 1, 3. What are the values of ˆβ 0 and ˆβ 1?

LEAST SQUARE ESTIMATOR Exercse.. Show that x ê = 0 and, therefore ŷ ê = 0. Exercse.3. Show that ˆσ s an unbased estmator of σ,.e. E ˆσ = σ. 3

3 ANALYSIS OF VARIANCEANOVA 3 Analyss of VaranceANOVA It s natural that we are nterested n whether our model assumpton s correct. The basc queston wll be f Y s really related to X. Mathematcally, ths queston s equvalent to askng whether β 1 s zero n our model. The most common way s the Analyss of Varance, whch compares two models of dfferent mean functons. We want to test the followng hypothess: H 0 : EY X = x = β 0 vs H 1 : EY X = x = β 0 + β 1 x. Therefore, we need some test statstcs related to these two hypothess and, most mportantly, of known dstrbutons. Consder the resdual sum of square for the two models. For H 0, we have ÊY X = x = ˆβ 0 = ȳ RSS H0 = y ȳ = SYY. On the other hand, for H 1, we have RSS H1 = [ y ˆβ0 + ˆβ ] SXY 1 x = SYY. Snce we use more varables to ft the ponts n H 1, t must be true that RSS H0 RSS H1. Therefore, H 1 s vald only when RSS H0 RSS H1. For easy comparson, we defne the Sum of Square due to Regresson as SSreg = RSS H0 RSS H1 Equvalently, H 1 s vald only when SSreg 0. = SXY. 3.1 Dstrbutons of Estmators To defne how large s large, we need the dstrbutons as well so that we can defne large n a probablstc sense. Under H 0, the dstrbuton of the sum of squares are gven below, By smple algebra, we rewrte By Central Lmt Theorem, we have SSreg = SXY = [ x x y ]. x x y N 0, σ. Therefore, we know SSreg /σ χ 1. Also, t s known that n ˆσ /σ χ n. The test statstc s thus gven by F = SSreg /σ 1 n σ ˆ H 1 /σ n = SSreg ˆσ H 1 F 1, n. For sgnfcance level α, we reject H 0 f the p-value s smaller than F 1 α 1, n. Remark 3.1. That n ˆσ /σ χ n s due to a fact that the degree of freedom s the number of values n the statstcs that are free to vary. The detaled proof of the dstrbuton of n ˆσ /σ ncludes the use of quadratc forms, whch s beyond the scope. Interested students may refer to []. 4

3. ANOVA Table 3 ANALYSIS OF VARIANCEANOVA Remark 3.. It should be notced that the ˆσ here s under the model n H 1. The reason of usng ths nstead of ˆσ under H 0 s examned n Exercse 3.1. Exercse 3.1. What s the estmator of σ under H 0? Explan why the use of t n the denomnator makes no sense. Exercse 3.. Show that y ȳ = y ŷ + ŷ ȳ 3. ANOVA Table Thanks to the result of Exercse 3., we have a neat and tdy representaton of ANOVA, whch s called the ANOVA table. Source df SS MS F p-value Regresson 1 SSreg SSreg /1 SSreg / ˆσ P F 1,n 1 > F Resdual n 1 RSS H1 ˆσ = RSS /n Total n SYY You should be extremely famlar wth the above table because t appears n every mdterm. We wll practce ths n Exercse 4.3 and 4.4. Remark 3.3. Therefore, we can defne the Coeffcent of Determnaton to be R = SSreg [0, 1]. SYY The realsed value summarses the strength of relatonshp between the sampled response and predctors. 5

4 INTERVALS, TESTS AND BAND 4 Intervals, Tests and Band Besdes testng the mean functons by ANOVA, we wll also want to perform test on ndvdual parameters. Therefore, we need the dstrbutons of the estmator. We begn ths secton wth an exercse. Exercse 4.1. Prove that ˆβ 0 and ˆβ 1 are unbased and fnd ther varance and asymptotc dstrbuton. 4.1 Confdence Intervals and Tests for Intercept and Slope Wth the dstrbutons of the estmator and some facts n statstcs, we can construct the test statstcs from the dstrbuton derved n Exercse 4.1. Confdence Interval and Test for Intercept If we want to test whether the ntercept s a certan value β 0,.e. H 0 : β 0 = β 0 vs H 1 : β 0 β 0, then the test statstc s t = ˆβ 0 β0 se ˆβ 0 tn where se ˆβ 1 0 = ˆσ n + x. Therefore, for sgnfcance level α, we reject H 0 when t > t 1 α n. Also, the 1 α 100% confdence nterval of β 0 s gven by ˆβ 0 t 1 α n se ˆβ 0 β 0 ˆβ 0 +t 1 α n se ˆβ 0. Confdence Interval and Test for Slope Smlarly, for the test of slope,.e. the test statstc s H 0 : β 1 = β 1 vs H 1 : β 1 β 1, t = ˆβ 1 β1 se ˆβ 1 tn where se ˆβ ˆσ 0 =. Therefore, for sgnfcance level α, we reject H 0 when t > t 1 α n. Also, the 1 α 100% confdence nterval of β 1 s gven by ˆβ 1 t 1 α n se ˆβ 1 β 1 ˆβ 1 +t 1 α n se ˆβ 1. 6

4.1 Confdence Intervals and Tests for Intercept and Slope 4 INTERVALS, TESTS AND BAND Remark 4.1. Obvously, a test of zero slope,.e. s equvalent to testng H 0 : β 1 = 0 vs H 1 : β 1 0, H 0 : EY X = x = β 0 vs H 1 : EY X = x = β 0 + β 1 x, whch s our ANOVA F-test n Secton 3. Therefore, they should gve the same result. Mathematcally, f we look at the t-statstcs, t = t = ˆβ 1 0 se ˆβ 1 = ˆβ1 ˆσ/ ˆβ 1 ˆσ / = ˆβ 1 = SXY ˆσ ˆσ = SSreg = F. ˆσ In general, we have F 1, m = χ 1 χ m/m = Z χ m/m = Z = tm χ m/m Exercse 4.. Construct a 95% confdence nterval for the slope from the data set {1, 1, 4, 9, 10, 10}, gven t 0.975 1 = 1.706. Bosco argues that the confdence nterval you construct has a 95% probablty of ncludng the true slope. Explan whether he s correct. Exercse 4.3. 013 Fall Mdterm #1 Fll n the mssng values n the followng tables of regresson output from a data set of sze 100. ANOVA Table Source df SS MS F Regresson Resdual Total Coeffcent Table Varable Coeffcent s.e. t-statstcs p-value Constant 0.5854 0.188 X 0.497 n = ˆσ = 4.714 R = 0.0394 7

4.1 Confdence Intervals and Tests for Intercept and Slope 4 INTERVALS, TESTS AND BAND Exercse 4.4. 01 Sprng Mdterm #1 Fll n the mssng values n the followng tables of regresson output. In R, t s found that qf1 9.5e 9, 1, 6 = 1917.3. Also, x = 5.15, ȳ = 9.1974, = 54.875. ANOVA Table Source df SS MS F p-value Regresson 9.5e-09 Resdual Total Coeffcent Table Varable Coeffcent s.e. t-statstcs p-value Constant 0.003 X -.0445 n = ˆσ = R = 8

4. Confdence and Predcton Intervals 4 INTERVALS, TESTS AND BAND 4. Confdence and Predcton Intervals Besdes the true ntercept and slope, we are often nterested n the mean of response gven a predctor x,.e. EY X = x. Ths can be constructed from the ftted value ŷ snce t s an unbased estmator of the mean. For example, we consder the mean IQ of a student who scores 98 n the mdterm. On the other hand, nstead of the mean, we may also be nterested n the response tself,.e. y X=x. In ths case, we want to make a predcton on what the outcome wll be, gven x. Here, for example, we are lookng for the IQ of a student who scores 98 n the mdterm. Therefore, we want to construct ntervals for the mean and the predcton. The results are lsted below. Confdence Interval for Mean Gven a predctor value x, the true value and estmaton of the mean are respectvely EY X = x = β 0 + β 1 x and ŷ = ÊY X = x = ˆβ 0 + ˆβ 1 x. The estmaton uncertanty of the mean s Var ŷ EY X = x = Var ˆβ 0 + ˆβ 1 x = σ 1 n + x x Defne the standard error of ft as. 1 seftŷ x = ˆσ n + x x n ˆσ σ = n seftŷ x 1 n + x x χ n. Therefore, the 1 α 100% confdence nterval for the mean s gven by ŷ t 1 α seftŷ x EY X = x ŷ + t 1 α seftŷ x. Predcton Interval for Response Gven a predctor value x, the response and ts estmaton are respectvely y X=x = β 0 + β 1 x + e and ŷ = ˆβ 0 + ˆβ 1 x. The estmaton uncertanty of the predcton s Var ŷ y X=x = Var ˆβ0 + ˆβ 1 x + e = σ 1 + 1 n + x x. Defne the standard error of predcton as sepredŷ x = ˆσ 1 + 1 x x + n n ˆσ σ = n seftŷ x 1 + 1 n + χ n. x x Therefore, the 1 α 100% confdence nterval for the mean s gven by ŷ t 1 α sepredŷ x y X=x ŷ + t 1 α sepredŷ x. Remark 4.. The estmaton uncertanty of predcton and mean dffers only by σ. The extra uncertanty comes from the error term n the new observaton that we wants to predct. Compare y X=x = β 0 + β 1 x + e and EY X = x = β 0 + β 1 x. Due to the extra uncertanty, predcton nterval ncludes and s larger than the confdence nterval for mean. 9

4.3 Confdence Band 4 INTERVALS, TESTS AND BAND 4.3 Confdence Band In the prevous subsecton, we construct confdence nterval of mean for a certan pont x. It s temptng to connect all the upper lmts and lower lmts of confdence ntervals,.e. ˆβ 0 + ˆβ 1 x ± t 1 α seftŷ x, x and say that ths random band has a 1 α 100% probablty of ncludng the true mean lne EY X = x = β 0 + β 1 x. However, ths s wrong see Remark 4.3 and Exercse 4.5. The correct band s gven below. Confdence Band for Mean Functon The 1 α 100% confdence band of the mean functon s gven by Cx = ˆβ 0 + ˆβ 1 x ± F 1 α, n seftŷ x, x. Therefore, t s true that PrThe mean lne les n the confdence band = Pr x, EY X = x Cx = 1 α. Remark 4.3. For confdence nterval Cx = ˆβ 0 + ˆβ 1 x ± t 1 α seftŷ x, we have by defnton x, Pr EY X = x Cx = 1 α. Ths relatonshp holds for each pont,.e. pontwse. Whle for the confdence band, we have Pr x, EY X = x Cx = 1 α. Here, the ncluson s for the entre lne. The two cases are dfferent. Exercse 4.5. Explan, why t s wrong to say the band, ˆβ 0 + ˆβ 1 x ± t 1 α seftŷ x, x has a 1 α 100% probablty of ncludng the mean lne EY X = x = β 0 + β 1 x. 10

REFERENCES Exercse 4.6. For the data set {1, 1, 4, 9, 10, 10}, construct 1. a 95% confdence nterval and a 95% nterval for the pont x = 3, and. a 95% confdence band. 3. What s the value of the band when x = 3? You are gven t 0.975 1 = 1.706 and F 0.95, 1 = 199.5. 5 Resduals To check whether our model assumpton s vald, a good way s to look at the resdual plot. Recall that the resduals ê = y ŷ = y ˆβ 0 ˆβ 1 x so ths gves Eê = Ey β 0 β 1 x = 0 and Varê = σ 1 + 1 n + x x. The data set gves a set of realsed ê. Accordng to our observaton above, these realsed resduals should have mean close to zero, and have constant varance for all value x. A plot that satsfes the above crtera s a null plot, whch ndcates that the model assumpton s vald and the regresson s a good ft. 6 Appendx For more reference, you may refer to the followng text books. References [1] Douglas C. MONTGOMERY, Elzabeth A. PECK and G. Geoffrey VINING 006. Introducton to Lnear Regresson Analyss, Wley. [] Robert V. HOGG and Allen T. CRAIG Introducton to Mathematcal Statstcs, Pearson. 11