Simple Linear Regression

Size: px

Start display at page:

Download "Simple Linear Regression"

Barbara Reed
5 years ago
Views:

1 Chapter 2 Simple Liear Regressio 2.1 Itroductio The term regressio ad the methods for ivestigatig the relatioships betwee two variables may date back to about 100 years ago. It was first itroduced by Fracis Galto i 1908, the reowed British biologist, whe he was egaged i the study of heredity. Oe of his observatios was that the childre of tall parets to be taller tha average but ot as tall as their parets. This regressio toward mediocrity gave these statistical methods their ame. The term regressio ad its evolutio primarily describe statistical relatios betwee variables. I particular, the simple regressio is the regressio method to discuss the relatioship betwee oe depedet variable (y) ad oe idepedet variable (x). The followig classical data set cotais the iformatio of paret s height ad childre s height. Table 2.1 Paret s Height ad Childre s Height Paret Childre The mea height is for childre ad 68.5 for parets. The regressio lie for the data of parets ad childre ca be described as child height = paret height. The simple liear regressio model is typically stated i the form y = β 0 + β 1 x + ε, where y is the depedet variable, β 0 is the y itercept, β 1 is the slope of the simple liear regressio lie, x is the idepedet variable, ad ε is the 9

2 10 Liear Regressio Aalysis: Theory ad Computig radom error. The depedet variable is also called respose variable, ad the idepedet variable is called explaatory or predictor variable. A explaatory variable explais causal chages i the respose variables. A more geeral presetatio of a regressio model may be writte as y = E(y) + ɛ, where E(y) is the mathematical expectatio of the respose variable. Whe E(y) is a liear combiatio of exploratory variables x 1, x 2,, x k the regressio is the liear regressio. If k = 1 the regressio is the simple liear regressio. If E(y) is a oliear fuctio of x 1, x 2,, x k the regressio is oliear. The classical assumptios o error term are E(ε) = 0 ad a costat variace Var(ε) = σ 2. The typical experimet for the simple liear regressio is that we observe pairs of data (x 1, y 1 ), (x 2, y 2 ),, (x, y ) from a scietific experimet, ad model i terms of the pairs of the data ca be writte as y i = β 0 + β 1 x i + ε i for i = 1, 2,,, with E(ε i ) = 0, a costat variace Var(ε i ) = σ 2, ad all ε i s are idepedet. Note that the actual value of σ 2 is usually ukow. The values of x i s are measured exactly, with o measuremet error ivolved. After model is specified ad data are collected, the ext step is to fid good estimates of β 0 ad β 1 for the simple liear regressio model that ca best describe the data came from a scietific experimet. We will derive these estimates ad discuss their statistical properties i the ext sectio. 2.2 Least Squares Estimatio The least squares priciple for the simple liear regressio model is to fid the estimates b 0 ad b 1 such that the sum of the squared distace from actual respose y i ad predicted respose ŷ i = β 0 + β 1 x i reaches the miimum amog all possible choices of regressio coefficiets β 0 ad β 1. i.e., (b 0, b 1 ) = arg mi (β 0,β 1 ) [y i (β 0 + β 1 x i )] 2. The motivatio behid the least squares method is to fid parameter estimates by choosig the regressio lie that is the most closest lie to

3 Simple Liear Regressio 11 all data poits (x i, y i ). Mathematically, the least squares estimates of the simple liear regressio are give by solvig the followig system: β 0 β 1 [y i (β 0 + β 1 x i )] 2 = 0 (2.1) [y i (β 0 + β 1 x i )] 2 = 0 (2.2) Suppose that b 0 ad b 1 are the solutios of the above system, we ca describe the relatioship betwee x ad y by the regressio lie ŷ = b 0 + b 1 x which is called the fitted regressio lie by covetio. It is more coveiet to solve for b 0 ad b 1 usig the cetralized liear model: y i = β0 + β 1 (x i x) + ε i, where β 0 = β0 β 1 x. We eed to solve for β 0 β 1 [y i (β0 + β 1 (x i x))] 2 = 0 [y i (β0 + β 1 (x i x))] 2 = 0 Takig the partial derivatives with respect to β 0 ad β 1 we have [y i (β0 + β 1 (x i x))] = 0 [y i (β0 + β 1 (x i x))](x i x) = 0 Note that y i = β0 + β 1 (x i x) = β0 (2.3) Therefore, we have β0 = 1 obtai y i = ȳ. Substitutig β0 by ȳ i (2.3) we [y i (ȳ + β 1 (x i x))](x i x) = 0.

4 12 Liear Regressio Aalysis: Theory ad Computig Deote b 0 ad b 1 be the solutios of the system (2.1) ad (2.2). Now it is easy to see ad b 1 = (y i ȳ)(x i x) (x i x) 2 = S xy (2.4) b 0 = b 0 b 1 x = ȳ b 1 x (2.5) The fitted value of the simple liear regressio is defied as ŷ i = b 0 + b 1 x i. The differece betwee y i ad the fitted value ŷ i, e i = y i ŷ i, is referred to as the regressio residual. Regressio residuals play a importat role i the regressio diagosis o which we will have extesive discussios later. Regressio residuals ca be computed from the observed resposes y i s ad the fitted values ŷ i s, therefore, residuals are observable. It should be oted that the error term ε i i the regressio model is uobservable. Thus, regressio error is uobservable ad regressio residual is observable. Regressio error is the amout by which a observatio differs from its expected value; the latter is based o the whole populatio from which the statistical uit was chose radomly. The expected value, the average of the etire populatio, is typically uobservable. Example 2.1. If the average height of 21-year-old male is 5 feet 9 iches, ad oe radomly chose male is 5 feet 11 iches tall, the the error is 2 iches; if the radomly chose ma is 5 feet 7 iches tall, the the error is 2 iches. It is as if the measuremet of ma s height was a attempt to measure the populatio average, so that ay differece betwee ma s height ad average would be a measuremet error. A residual, o the other had, is a observable estimate of uobservable error. The simplest case ivolves a radom sample of me whose heights are measured. The sample average is used as a estimate of the populatio average. The the differece betwee the height of each ma i the sample ad the uobservable populatio average is a error, ad the differece betwee the height of each ma i the sample ad the observable sample average is a residual. Sice residuals are observable we ca use residual to estimate the uobservable model error. The detailed discussio will be provided later.

5 Simple Liear Regressio Statistical Properties of the Least Squares Estimatio I this sectio we discuss the statistical properties of the least squares estimates for the simple liear regressio. We first discuss statistical properties without the distributioal assumptio o the error term, but we shall assume that E(ɛ i ) = 0, Var(ɛ i ) = σ 2, ad ɛ i s for i = 1, 2,, are idepedet. Theorem 2.1. The least squares estimator b 0 is a ubiased estimate of β 0. Proof. = 1 ( 1 Eb 0 = E(ȳ b 1 x) = E (β 0 + β 1 x i ) β 1 x = 1 y i ) Eb 1 x = 1 β 0 + β 1 1 Ey i xeb 1 x i β 1 x = β 0. Theorem 2.2. The least squares estimator b 1 is a ubiased estimate of β 1. Proof. ( Sxy ) E(b 1 ) = E = 1 E 1 (y i ȳ)(x i x) = 1 1 (x i x)ey i = 1 1 (x i x)(β 0 + β 1 x i ) = 1 1 (x i x)β 1 x i = 1 1 (x i x)β 1 (x i x) = 1 1 (x i x) 2 β 1 = β 1 = β 1

6 14 Liear Regressio Aalysis: Theory ad Computig Theorem 2.3. Var(b 1 ) = Proof. σ2. ( Sxy ) Var(b 1 ) = Var = 1 ( 1 Sxx 2 Var = 1 ( 1 Sxx 2 Var = 1 S 2 xx 1 2 = 1 1 Sxx 2 2 ) (y i ȳ)(x i x) ) y i (x i x) (x i x) 2 Var(y i ) (x i x) 2 σ 2 = σ2 Theorem 2.4. The least squares estimator b 1 ad ȳ are ucorrelated. Uder the ormality assumptio of y i for i = 1, 2,,, b 1 ad ȳ are ormally distributed ad idepedet. Proof. Cov(b 1, ȳ) = Cov( S xy, ȳ) = 1 Cov(S xy, ȳ) = 1 Cov ( ) (x i x)(y i ȳ), ȳ = 1 ( ) Cov (x i x)y i, ȳ = 1 ( 2 Cov (x i x)y i, = 1 2 ) y i (x i x) Cov(y i, y j ) i,j=1 Note that Eε i = 0 ad ε i s are idepedet we ca write Cov(y i, y j ) = E[ (y i Ey i )(y j Ey j ) ] = E(ε i, ε j ) = { σ 2, if i = j 0, if i j

7 Simple Liear Regressio 15 Thus, we coclude that Cov(b 1, ȳ) = 1 2 (x i x)σ 2 = 0. Recall that zero correlatio is equivalet to the idepedece betwee two ormal variables. Thus, we coclude that b 0 ad ȳ are idepedet. ( 1 ) Theorem 2.5. Var(b 0 ) = + x2 σ 2. Proof. Var(b 0 ) = Var(ȳ b 1 x) = Var(ȳ) + ( x) 2 Var(b 1 ) = σ2 σ2 + x2 ( 1 = )σ + x2 2 The properties 1 5, especially the variaces of b 0 ad b 1, are importat whe we would like to draw statistical iferece o the itercept ad slope of the simple liear regressio. Sice the variaces of least squares estimators b 0 ad b 1 ivolve the variace of the error term i the simple regressio model. This error variace is ukow to us. Therefore, we eed to estimate it. Now we discuss how to estimate the variace of the error term i the simple liear regressio model. Let y i be the observed respose variable, ad ŷ i = b 0 + b 1 x i, the fitted value of the respose. Both y i ad ŷ i are available to us. The true error σ i i the model is ot observable ad we would like to estimate it. The quatity y i ŷ i is the empirical versio of the error ε i. This differece is regressio residual which plays a importat role i regressio model diagosis. We propose the followig estimatio of the error variace based o e i : s 2 = 1 (y i ŷ i ) 2 2 Note that i the deomiator is 2. This makes s 2 a ubiased estimator of the error variace σ 2. The simple liear model has two parameters, therefore, 2 ca be viewed as umber of parameters i simple

8 16 Liear Regressio Aalysis: Theory ad Computig liear regressio model. We will see i later chapters that it is true for all geeral liear models. I particular, i a multiple liear regressio model with p parameters the deomiator should be p i order to costruct a ubiased estimator of the error variace σ 2. Detailed discussio ca be foud i later chapters. The ubiasess of estimator s 2 for the simple liear regressio ca be show i the followig derivatios. y i ŷ i = y i b 0 b 1 x i = y i (ȳ b 1 x) b 1 x i = (y i ȳ) b 1 (x i x) It follows that (y i ŷ i ) = (y i ȳ) b 1 (x i x) = 0. Note that (y i ŷ i )x i = [(y i ȳ) b 1 (x i x)]x i, hece we have (y i ŷ i )x i = [(y i ȳ) b 1 (x i x)]x i = = [(y i ȳ) b 1 (x i x)](x i x) (y i ȳ)(x i x) b 1 (x i x) 2 ( = (S xy b 1 ) = S xy S ) xy = 0 To show that s 2 is a ubiased estimate of the error variace, first we ote that therefore, = = = = (y i ŷ i ) 2 = (y i ŷ i ) 2 = [(y i ȳ) b 1 (x i x)] 2, [(y i ȳ) b 1 (x i x)] 2 (y i ȳ) 2 2b 1 (x i x)(y i ȳ i ) + b 2 1 (y i ȳ) 2 2b 1 S xy + b 2 1 (y i ȳ) 2 2 S xy S xy + S2 xy Sxx 2 (y i ȳ) 2 S2 xy (x i x) 2

9 Simple Liear Regressio 17 Sice ad therefore, ad (y i ȳ) 2 = [β 1 (x i x) + (ε i ε)] 2 (y i ȳ) 2 = β 2 1(x i x) 2 + (ε i ε) 2 + 2β 1 (x i x)(ε i ε), E(y i ȳ) 2 = β 2 1(x i x) 2 + E(ε i ε) 2 = β 2 1(x i x) σ2, E(y i ȳ) 2 = β1s 2 xx + Furthermore, we have ( 1 E(S xy ) = E ad ) Var (S xy 1 σ2 = β ( 1)σ 2. ) (x i x)(y i ȳ) = 1 E (x i x)y i = 1 = 1 ( 1 = Var Thus, we ca write (x i x)ey i (x i x)(β 0 + β 1 x i ) = 1 β 1 = 1 β 1 (x i x)x i (x i x) 2 = β 1 ) (x i x)y i = 1 2 (x i x) 2 Var(y i ) = 1 σ 2 E(S 2 xy) = Var(S xy ) + [E(S xy )] 2 = 1 σ 2 + β 2 1S 2 xx ad ( S 2 ) xy E = σ 2 + β S 1S 2 xx. xx

10 18 Liear Regressio Aalysis: Theory ad Computig Fially, E(ˆσ 2 ) is give by: E (y i ŷ) 2 = β1s 2 xx + ( 1)σ 2 β1s 2 xx σ 2 = ( 2)σ 2. I other words, we prove that ( ) E(s 2 1 ) = E (y i ŷ) 2 = σ 2. 2 Thus, s 2, the estimatio of the error variace, is a ubiased estimator of the error variace σ 2 i the simple liear regressio. Aother view of choosig 2 is that i the simple liear regressio model there are observatios ad two restrictios o these observatios: (1) (y i ŷ) = 0, (2) (y i ŷ)x i = 0. Hece the error variace estimatio has 2 degrees of freedom which is also the umber of total observatios total umber of the parameters i the model. We will see similar feature i the multiple liear regressio. 2.4 Maximum Likelihood Estimatio The maximum likelihood estimates of the simple liear regressio ca be developed if we assume that the depedet variable y i has a ormal distributio: y i N(β 0 + β 1 x i, σ 2 ). The likelihood fuctio for (y 1, y 2,, y ) is give by L = 1 f(y i ) = 2 (2π) /2 σ e( 1/2σ ) (y i β 0 β 1 x i ) 2. The estimators of β 0 ad β 1 that maximize the likelihood fuctio L are equivalet to the estimators that miimize the expoetial part of the likelihood fuctio, which yields the same estimators as the least squares estimators of the liear regressio. Thus, uder the ormality assumptio of the error term the MLEs of β 0 ad β 1 ad the least squares estimators of β 0 ad β 1 are exactly the same.

11 Simple Liear Regressio 19 After we obtai b 1 ad b 0, the MLEs of the parameters β 0 ad b 1, we ca compute the fitted value ŷ i, ad the likelihood fuctio i terms of the fitted values. L = 1 f(y i ) = 2 (2π) /2 σ e( 1/2σ ) (y i ŷ i ) 2 We the take the partial derivative with respect to σ 2 i the log likelihood fuctio log(l) ad set it to zero: log(l) σ 2 The MLE of σ 2 is ˆσ 2 = 1 = 2σ σ 4 (y i ŷ i ) 2 = 0 (y i ŷ i ) 2. Note that it is a biased estimate of σ 2, sice we kow that s 2 = 1 2 (y i ŷ i ) 2 is a ubiased estimate of the error variace σ 2. 2 ˆσ2 is a ubiased estimate of σ 2. Note also that the ˆσ 2 is a asymptotically ubiased estimate of σ 2, which coicides with the classical theory of MLE. 2.5 Cofidece Iterval o Regressio Mea ad Regressio Predictio Regressio models are ofte costructed based o certai coditios that must be verified for the model to fit the data well, ad to be able to predict the respose for a give regressor as accurate as possible. Oe of the mai objectives of regressio aalysis is to use the fitted regressio model to make predictio. Regressio predictio is the calculated respose value from the fitted regressio model at data poit which is ot used i the model fittig. Cofidece iterval of the regressio predictio provides a way of assessig the quality of predictio. Ofte the followig regressio predictio cofidece itervals are of iterest: A cofidece iterval for a sigle pit o the regressio lie. A cofidece iterval for a sigle future value of y correspodig to a chose value of x. A cofidece regio for the regressio lie as a whole.

12 20 Liear Regressio Aalysis: Theory ad Computig If a particular value of predictor variable is of special importace, a cofidece iterval for the correspodig respose variable y at particular regressor x may be of iterest. A cofidece iterval of iterest ca be used to evaluate the accuracy of a sigle future value of y at a chose value of regressor x. Cofidece iterval estimator for a future value of y provides cofidece iterval for a estimated value y at x with a desirable cofidece level 1 α. It is of iterest to compare the above two differet kids of cofidece iterval. The secod kid has larger cofidece iterval which reflects the less accuracy resultig from the estimatio of a sigle future value of y rather tha the mea value computed for the first kid cofidece iterval. Whe the etire regressio lie is of iterest, a cofidece regio ca provide simultaeous statemets about estimates of y for a umber of values of the predictor variable x. i.e., for a set of values of the regressor the 100(1 α) percet of the correspodig respose values will be i this iterval. To discuss the cofidece iterval for regressio lie we cosider the fitted value of the regressio lie at x = x 0, which is ŷ(x 0 ) = b 0 + b 1 x 0 ad the mea value at x = x 0 is E(ŷ x 0 ) = β 0 + β 1 x 0. Note that b 1 is idepedet of ȳ we have Var(ŷ(x 0 )) = Var(b 0 + b 1 x 0 ) = Var(ȳ b 1 (x 0 x)) = Var(ȳ) + (x 0 x) 2 Var(b 1 ) = 1 σ2 + (x 0 x) 2 1 σ 2 = σ 2[ 1 + (x 0 x) 2 ] Replacig σ by s, the stadard error of the regressio predictio at x 0 is give by 1 sŷ(x0 ) = s + (x 0 x) 2 If ε N(0, σ 2 ) the (1 α)100% of cofidece iterval o E(ŷ x 0 ) = β 0 + β 1 x 0 ca be writte as

13 Simple Liear Regressio 21 1 ŷ(x 0 ) ± t α/2, 2 s + (x 0 x) 2. We ow discuss cofidece iterval o the regressio predictio. Deotig the regressio predictio at x 0 by y 0 ad assumig that y 0 is idepedet of ŷ(x 0 ), where y(x 0 )=b 0 + b 1 x 0,adE(y ŷ(x 0 )) = 0, we have Var ( y 0 ŷ(x 0 ) ) = σ 2 + σ 2[ 1 + (x 0 x) 2 ] = σ 2[ (x 0 x) 2 ]. Uder the ormality assumptio of the error term Substitutig σ with s we have y 0 ŷ(x 0 ) N(0, 1). σ (x0 x)2 y 0 ŷ(x 0 ) t 2. s (x0 x)2 Thus the (1 α)100% cofidece iterval o regressio predictio y 0 ca be expressed as ŷ(x 0 ) ± t α/2, 2 s (x 0 x) Statistical Iferece o Regressio Parameters We start with the discussios o the total variace of regressio model which plays a importat role i the regressio aalysis. I order to partitio the total variace (y i ȳ) 2, we cosider the fitted regressio equatio ŷ i = b 0 + b 1 x i,whereb 0 =ȳ b 1 x ad b 1 = S xy /.Wecawrite ŷ = 1 ŷ i = 1 [(ȳ b 1 x)+b 1 x i ]= 1 [ȳ + b 1 (x i x)] = ȳ.

14 22 Liear Regressio Aalysis: Theory ad Computig For the regressio respose y i, the total variace is 1 (y i ȳ) 2. Note that the product term is zero ad the total variace ca be partitioed ito two parts: 1 (y i ȳ) 2 = 1 [(y i ŷ) 2 + (ŷ i ȳ)] 2 = 1 (ŷ i ȳ) (y i ŷ) 2 = SS Reg + SS Res = Variace explaied by regressio + Variace uexplaied It ca be show that the product term i the partitio of variace is zero: = (ŷ i ȳ)(y i ŷ i ) (use the fact that ŷ i (y i ŷ i ) = (y i ŷ i ) = 0) [ b0 + b 1 (x i x) ] (y i ŷ) = b 1 x i (y i ŷ i ) = b 1 [ = b 1 x i (yi ȳ) b 1 (x i x) ] x i [y i b 0 b 1 (x i x)] [ = b 1 (x i x)(y i ȳ) b 1 (x i x) 2] = b 1 [S xy b 1 ] = b 1 [S xy (S xy / ) ] = 0 The degrees of freedom for SS Reg ad SS Res are displayed i Table 2.2. Table 2.2 Degrees of Freedom i Partitio of Total Variace SS T otal = SS Reg + SS Res -1 = To test the hypothesis H 0 : β 1 = 0 versus H 1 : β 1 0 it is eeded to assume that ε i N(0, σ 2 ). Table 2.3 lists the distributios of SS Reg, SS Res ad SS T otal uder the hypothesis H 0. The test statistic is give by

15 Simple Liear Regressio 23 SS Reg F = SS Res / 2 F 1, 2, which is a oe-sided, upper-tailed F test. Table 2.4 is a typical regressio Aalysis of Variace (ANOVA) table. Table 2.3 Distributios of Partitio of Total Variace SS df Distributio SS Reg 1 σ 2 χ 2 1 SS Res -2 σ 2 χ 2 2 SS Total -1 σ 2 χ 2 1 Table 2.4 ANOVA Table 1 Source SS df MS F Regressio SS Reg 1 SS Reg /1 F = MS Reg s 2 Residual SS Res -2 s 2 Total SS Total -1 To test for regressio slope β 1, it is oted that b 1 follows the ormal distributio ( σ 2 ) b 1 N β 1, S ad ( b1 β 1 ) Sxx t 2, s which ca be used to test H 0 : β 1 = β 10 versus H 1 : β 1 β 10. Similar approach ca be used to test for the regressio itercept. Uder the ormality assumptio of the error term b 0 N [β 0,σ 2 ( 1 ] + x2 ).

16 24 Liear Regressio Aalysis: Theory ad Computig Therefore, we ca use the followig t test statistic to test H 0 : β 0 = β 00 versus H 1 : β 0 β 00. t = b 0 β 0 s 1/ + ( x 2 / ) t 2 It is straightforward to use the distributios of b 0 ad b 1 to obtai the (1 α)100% cofidece itervals of β 0 ad β 1 : ad 1 b 0 ± t α/2, 2 s + x2, 1 b 1 ± t α/2, 2 s. Suppose that the regressio lie pass through (0, β 0 ). i.e., the y itercept is a kow costat β 0. The model is give by y i = β 0 + β 1 x i + ε i with kow costat β 0. Usig the least squares priciple we ca estimate β 1 : b 1 = xi y i x 2 i. Correspodigly, the followig test statistic ca be used to test for H 0 : β 1 = β 10 versus H 1 : β 1 β 10. Uder the ormality assumptio o ε i t = b 1 β 10 s x 2 i t 1 Note that we oly have oe parameter for the fixed y-itercept regressio model ad the t test statistic has 1 degrees of freedom, which is differet from the simple liear model with 2 parameters. The quatity R 2, defied as below, is a measuremet of regressio fit: R 2 = SS Reg = (ŷ i ȳ) 2 SS T otal (y i ȳ) = 1 SS Res 2 SS T otal Note that 0 R 2 1 ad it represets the proportio of total variatio explaied by regressio model.

17 Simple Liear Regressio 25 Quatity CV = s 100 is called the coefficiet of variatio, which is also ȳ a measuremet of quality of fit ad represets the spread of oise aroud the regressio lie. The values of R 2 ad CV ca be foud from Table 2.7, a ANOVA table geerated by SAS procedure REG. We ow discuss simultaeous iferece o the simple liear regressio. Note that so far we have discussed statistical iferece o β 0 ad β 1 idividually. The idividual test meas that whe we test H 0 : β 0 = β 00 we oly test this H 0 regardless of the values of β 1. Likewise, whe we test H 0 : β 1 = β 10 we oly test H 0 regardless of the values of β 0. If we would like to test whether or ot a regressio lie falls ito certai regio we eed to test the multiple hypothesis: H 0 : β 0 = β 00, β 1 = β 10 simultaeously. This falls ito the scope of multiple iferece. For the multiple iferece o β 0 ad β 1 we otice that ( ) ( b0 β 0, b 1 β 1 x i 2s 2 F 2, 2. x ) ( ) i b0 β 0 x2 i b 1 β 1 Thus, the (1 α)100% cofidece regio of the β 0 ad β 1 is give by ( ) ( b0 β 0, b 1 β x ) ( ) i b0 β 1 x 0 i x2 i b 1 β 1 2s 2 F α,2, 2, where F α,2, 2 is the upper tail of the αth percetage poit of the F- distributio. Note that this cofidece regio is a ellipse. 2.7 Residual Aalysis ad Model Diagosis Oe way to check performace of a regressio model is through regressio residual, i.e., e i = y i ŷ i. For the simple liear regressio a scatter plot of e i agaist x i provides a good graphic diagosis for the regressio model. A evely distributed residuals aroud mea zero is a idicatio of a good regressio model fit. We ow discuss the characteristics of regressio residuals if a regressio model is misspecified. Suppose that the correct model should take the quadratic form:

18 26 Liear Regressio Aalysis: Theory ad Computig y i = β 0 + β 1 (x i x)+β 2 x 2 i + ε i with E(ε i ) = 0. Assume that the icorrectly specified liear regressio model takes the followig form: y i = β 0 + β 1 (x i x)+ε i. The ε i = β 2 x 2 i + ε i which is ukow to the aalyst. Now, the mea of the error for the simple liear regressio is ot zero at all ad it is a fuctio of x i. From the quadratic model we have ad b 0 =ȳ = β 0 + β 2 x 2 + ε b 1 = S xy = (x i x)(β 0 + β 1 (x i x)+β 2 x 2 i + ε i) b 1 = β 1 + β (x i x)x 2 i 2 + (x i x)ε i. It is easy to kow that ad E(b 0 )=β 0 + β 2 x 2 E(b 1 )=β 1 + β (x i x)x 2 i 2. Therefore, the estimators b 0 ad b 1 are biased estimates of β 0 ad β 1. Suppose that we fit the liear regressio model ad the fitted values are give by ŷ i = b 0 + b 1 (x i x), the expected regressio residual is give by E(e i )=E(y i ŷ i )= [ β 0 + β 1 (x i x)+β 2 x 2 ] [ i E(b0 )+E(b 1 )(x i x) ] = [ β 0 + β 1 (x i x)+β 2 x 2 ] [ i β0 + β 2 x 2] [ β 1 + β (x i x)x 2 ] i 2 (x i x) S xx = β 2 [(x 2 i x 2 ) (x i x)x 2 ] i

19 Simple Liear Regressio 27 If β 2 = 0 the the fitted model is correct ad E(y i ŷ i ) = 0. Otherwise, the expected value of residual takes the quadratic form of x i s. As a result, the plot of residuals agaist x i s should have a curvature of quadratic appearace. Statistical iferece o regressio model is based o the ormality assumptio of the error term. The least squares estimators ad the MLEs of the regressio parameters are exactly idetical oly uder the ormality assumptio of the error term. Now, questio is how to check the ormality of the error term? Cosider the residual y i ŷ i : we have E(y i ŷ i ) = 0 ad Var(y i ŷ i ) = V ar(y i ) + Var(ŷ i ) 2Cov(y i, ŷ i ) = σ 2 + σ 2[ 1 + (x i x) 2 ] 2Cov(y i, ȳ + b 1 (x i x)) We calculate the last term Cov(y i, ȳ + b 1 (x i x)) = Cov(y i, ȳ) + (x i x)cov(y i, b 1 ) = σ2 + (x i x)cov(y i, S xy / ) = σ2 + (x i x) 1 ( ) Cov y i, (x i x)(y i ȳ) = σ2 + (x i x) 1 ( Cov y i, ) (x i x)y i Thus, the variace of the residual is give by = σ2 + (x i x) 2 σ 2 Var(e i ) = V ar(y i ŷ i ) = σ 2[ ( (x i x) 2 )], which ca be estimated by s ei [ ( 1 = s 1 + (x i x) 2 )]. If the error term i the simple liear regressio is correctly specified, i.e., error is ormally distributed, the stadardized residuals should behave like the stadard ormal radom variable. Therefore, the quatile of the stadardized residuals i the simple liear regressio will be similar to the quatile of the stadardized ormal radom variable. Thus, the plot of the

20 28 Liear Regressio Aalysis: Theory ad Computig quatile of the stadardized residuals versus the ormal quatile should follow a straight lie i the first quadrat if the ormality assumptio o the error term is correct. It is usually called the ormal plot ad has bee used as a useful tool for checkig the ormality of the error term i simple liear regressio. Specifically, we ca (1) Plot ordered residual y i ŷ i agaist the ormal quatile Z s (2) Plot ordered stadardized residual y i ŷ i ( Z i ) s ei ( i agaist the ormal quatile ) 2.8 Example The SAS procedure REG ca be used to perform regressio aalysis. It is coveiet ad efficiet. The REG procedure provides the most popular parameter estimatio, residual aalysis, regressio diagosis. We preset the example of regressio aalysis of the desity ad stiffess data usig SAS. data example1; iput desity datalies; ; proc reg data=example1 outest=out1 tableout; model stiffess=desity/all; ru; ods rtf file="c:\example1_out1.rtf"; proc prit data=out1; title "Parameter Estimates ad CIs"; ru; ods rtf close;

21 Simple Liear Regressio 29 *Trace ODS to fid out the ames of the output data sets; ods trace o; ods show; ods rtf file="c:\example1_out2.rtf"; proc reg data=example1 alpha=0.05; model stiffess=desity; ods select Reg.MODEL1.Fit.stiffess.ANOVA; ods select Reg.MODEL1.Fit.stiffess.FitStatistics; ods select Reg.MODEL1.Fit.stiffess.ParameterEstimates; ods rtf close; proc reg data=example1; model stiffess=desity; output out=out3 p=yhat r=yresid studet=sresid; ru; ods rtf file="c:\example1_out3.rtf"; proc prit data=out3; title "Predicted Values ad Residuals"; ru; ods rtf close; The above SAS code geerate the followig output tables 2.5, 2.6, 2.7, 2.8, ad 2.9. Table 2.5 Cofidece Itervals o Parameter Estimates Obs MODEL TYPE DEPVAR RMSE Itercept desity 1 Model1 Parms stiffess Model1 Stderr stiffess Model1 T stiffess Model1 P-value stiffess Model1 L95B stiffess Model1 U95B stiffess Data Source: desity ad stiffess data The followig is a example of SAS program for computig the cofidece bad of regressio mea, the cofidece bad for regressio predic-

22 30 Liear Regressio Aalysis: Theory ad Computig Table 2.6 ANOVA Table 2 Sum of Mea Source DF Squares Square F Value Pr >F Model <.0001 Error Corrected Total Data Source: desity ad stiffess data Table 2.7 Regressio Table Root MSE R-Square Depedet Mea Adj R-Sq Coeff Var Data Source: desity ad stiffess data Table 2.8 Parameter Estimates of Simple Liear Regressio Parameter Stadard Variable DF Estimate Error t Value Pr > t Itercept desity <.0001 Data Source: desity ad stiffess data tio, ad probability plot (QQ-plot ad PP-plot). data Example2; iput desity datalies; ; across=1 cborder=red offset=(0,0) shape=symbol(3,1) label=oe value=(height=1); symbol1 c=black value=- h=1; symbol2 c=red;

23 Simple Liear Regressio 31 Table 2.9 Table for Fitted Values ad Residuals Obs desity stiffess yhat yresid Data Source: desity ad stiffess data symbol3 c=blue; symbol4 c=blue; proc reg data=example2; model desity=stiffess /oprit p r; output out=out p=pred r=resid LCL=lowpred UCL=uppred LCLM=lowreg UCLM=upreg; ru; ods rtf file="c:\example2.rtf"; ods graphics o; title "PP Plot";

24 32 Liear Regressio Aalysis: Theory ad Computig plot pp.*r./caxis=red ctext=blue ostat cframe=ligr; ru; title "QQ Plot"; plot r.*qq. /olie mse caxis=red ctext=blue cframe=ligr; ru; *Compute cofidece bad of regressio mea; plot desity*stiffess/cof caxis=red ctext=blue cframe=ligr leged=leged1; ru; *Compute cofidece bad of regressio predictio; plot desity*stiffess/pred caxis=red ctext=blue cframe=ligr leged=leged1; ru; ods graphics off; ods rtf close; quit; The regressio scatterplot, residual plot, 95% cofidece bads for regressio mea ad predictio are preseted i Fig. 2.1.

25 Simple Liear Regressio 33 (a) desity = a + b(stiffess) (b) Residual Plot of Stiffess Data desity Residual e+04 4e+04 6e+04 8e+04 1e+05 stiffess Fitted Value (c) 95% Regressio Bad (d) 95% Predictio Bad Desity Desity e+04 4e+04 6e+04 8e+04 1e+05 Stiffess 2e+04 4e+04 6e+04 8e+04 1e+05 Stiffess Fig. 2.1 (a) Regressio Lie ad Scatter Plot. (b) Residual Plot, (c) 95% Cofidece Bad for Regressio Mea. (d) 95% Cofidece Bad for Regressio Predictio. The Q-Q plot for regressio model desity=β 0 + β 1 stiffess is preseted i Fig. 2.2.

26 34 Liear Regressio Aalysis: Theory ad Computig Q Q plot for Stiffess Data Sample Quatile Quatile Fig. 2.2 Q-Q Plot for Regressio Model desity=β 0 + β 1 stiffess + ε.

27 Simple Liear Regressio 35 Problems 1. Cosider a set of data (x i, y i ), i = 1, 2,,, ad the followig two regressio models: y i = β 0 + β 1 x i + ε, (i = 1, 2,, ), Model A y i = γ 0 + γ 1 x i + γ 2 x 2 i + ε, (i = 1, 2,, ), Model B Suppose both models are fitted to the same data. Show that SS Res, A SS Res, B If more higher order terms are added ito the above Model B, i.e., y i = γ 0 + γ 1 x i + γ 2 x 2 i + γ 3 x 3 i + + γ k x k i + ε, (i = 1, 2,, ), show that the iequality SS Res, A SS Res, B still holds. 2. Cosider the zero itercept model give by y i = β 1 x i + ε i, (i = 1, 2,, ) where the ε i s are idepedet ormal variables with costat variace σ 2. Show that the 100(1 α)% cofidece iterval o E(y x 0 ) is give by x 2 0 b 1 x 0 + t α/2, 1 s x2 i where s = (y i b 1 x i )/( 1) ad b 1 = y ix i. x2 i 3. Derive ad discuss the (1 α)100% cofidece iterval o the slope β 1 for the simple liear model with zero itercept. 4. Cosider the fixed zero itercept regressio model y i = β 1 x i + ε i, (i = 1, 2,, ) The appropriate estimator of σ 2 is give by s 2 = (y i ŷ i ) 2 1 Show that s 2 is a ubiased estimator of σ 2.

28 36 Liear Regressio Aalysis: Theory ad Computig Table 2.10 Data for Two Parallel Regressio Lies x y x 1 y 1.. x 1 y 1 x 1 +1 y x y Cosider a situatio i which the regressio data set is divided ito two parts as show i Table The regressio model is give by β (1) 0 + β 1 x i + ε i, i = 1, 2,, 1 ; y i = β (2) 0 + β 1 x i + ε i, i = 1 + 1,, I other words, there are two regressio lies with commo slope. Usig the cetered regressio model y i = β (1 ) 0 + β 1 (x i x 1 ) + ε i, i = 1, 2,, 1 ; β (2 ) 0 + β 1 (x i x 2 ) + ε i, i = 1 + 1,, 1 + 2, where x 1 = 1 x i/ 1 ad x 2 = i= x 1+1 i/ 2. Show that the least squares estimate of β 1 is give by 1 b 1 = (x i x 1 )y i i= 1 +1 (x i x 2 )y i 1 (x i x 1 ) i= 1 +1 (x i x 2 ) 2 6. Cosider two simple liear models ad Y 1j = α 1 + β 1 x 1j + ε 1j, j = 1, 2,, 1 Y 2j = α 2 + β 2 x 2j + ε 2j, j = 1, 2,, 2 Assume that β 1 β 2 the above two simple liear models itersect. Let x 0 be the poit o the x-axis at which the two liear models itersect. Also assume that ε ij are idepedet ormal variable with a variace σ 2. Show that

29 Simple Liear Regressio 37 (a). x 0 = α 1 α 2 β 1 β 2 (b). Fid the maximum likelihood estimates (MLE) of x 0 usig the least squares estimators ˆα 1,ˆα 2, ˆβ 1,ad ˆβ 2. (c). Show that the distributio of Z, where Z =(ˆα 1 ˆα 2 )+x 0 ( ˆβ 1 ˆβ 2 ), is the ormal distributio with mea 0 ad variace A 2 σ 2,where x 2 A 2 1j 2x 0 x1j + x x 2 = 2j 2x 0 x2j + x (x1j x 1 ) (x2j x 2 ) 2. (d). Show that U = N ˆσ 2 /σ 2 is distributed as χ 2 (N), where N = (e). Show that U ad Z are idepedet. (f). Show that W = Z 2 /A 2ˆσ 2 has the F distributio with degrees of freedom 1 ad N. (g). Let S 2 1 = (x 1j x 1 ) 2 ad S 2 2 = (x 2j x 2 ) 2, show that the solutio of the followig quadratic equatio about x 0, q(x 0 )= ax bx 0 + c =0, [ ( ˆβ 1 ˆβ 2 ) 2 ( 1 S ] )ˆσ 2 S2 2 F α,1,n x 2 0 [ +2 (ˆα 1 ˆα 2 )( ˆβ 1 ˆβ 2 )+ ( x1 + x 2 S 2 1 S 2 2 )ˆσ 2 F α,1,n ] x 0 [ ( + (ˆα 1 ˆα 2 ) 2 x 2 1j x 2 ] 2j 1 S1 2 + )ˆσ 2 2 S2 2 F α,1,n =0. Show that if a 0adb 2 ac 0, the 1 α cofidece iterval o x 0 is b b 2 ac x 0 b + b 2 ac. a a 7. Observatios o the yield of a chemical reactio take at various temperatures were recorded i Table 2.11: (a). Fit a simple liear regressio ad estimate β 0 ad β 1 usig the least squares method. (b). Compute 95% cofidece itervals o E(y x) at 4 levels of temperatures i the data. Plot the upper ad lower cofidece itervals aroud the regressio lie.

30 38 Liear Regressio Aalysis: Theory ad Computig Table 2.11 Chemical Reactio Data temperature (C 0 ) yield of chemical reactio (%) Data Source: Raymod H. Myers, Classical ad Moder Regressio Aalysis With Applicatios, P77. (c). Plot a 95% cofidece bad o the regressio lie. Plot o the same graph for part (b) ad commet o it. 8. The study Developmet of LIFETEST, a Dyamic Techique to Assess Idividual Capability to Lift Material was coducted i Virgiia Polytechic Istitute ad State Uiversity i 1982 to determie if certai static arm stregth measures have ifluece o the dyamic lift characteristics of idividual. 25 idividuals were subjected to stregth tests ad the were asked to perform a weight-liftig test i which weight was dyamically lifted overhead. The data are i Table 2.12: (a). Fid the liear regressio lie usig the least squares method. (b). Defie the joit hypothesis H 0 : β 0 = 0, β 1 = 2.2. Test this hypothesis problem usig a 95% joit cofidece regio ad β 0 ad β 1 to draw your coclusio. (c). Calculate the studetized residuals for the regressio model. Plot the studetized residuals agaist x ad commet o the plot.

31 Simple Liear Regressio 39 Table 2.12 Weight-liftig Test Data Idividual Arm Stregth (x) Dyamic Lift (y) Data Source: Raymod H. Myers, Classical ad Moder Regressio Aalysis With Applicatios, P76.

1 Inferential Methods for Correlation and Regression Analysis

1 Inferential Methods for Correlation and Regression Analysis 1 Iferetial Methods for Correlatio ad Regressio Aalysis I the chapter o Correlatio ad Regressio Aalysis tools for describig bivariate cotiuous data were itroduced. The sample Pearso Correlatio Coefficiet