Formal Statement of Simple Linear Regression Model

Size: px

Start display at page:

Download "Formal Statement of Simple Linear Regression Model"

Mabel Webster
5 years ago
Views:

1 Formal Statement of Simple Linear Regression Model Y i = β 0 + β 1 X i + ɛ i Y i value of the response variable in the i th trial β 0 and β 1 are parameters X i is a known constant, the value of the predictor variable in the i th trial ɛ i is a random error term with mean E(ɛ i ) = 0 and variance Var(ɛ i ) = σ 2 i = 1,..., n

2 Least Squares Linear Regression Seek to minimize Q = n (Y i (β 0 + β 1 X i )) 2 i=1 Choose b 0 and b 1 as estimators for β 0 and β 1. b 0 and b 1 will minimize the criterion Q for the given sample observations (X 1, Y 1 ), (X 2, Y 2 ),, (X n, Y n ).

3 Normal Equations The result of this maximization step are called the normal equations. b 0 and b 1 are called point estimators of β 0 and β 1 respectively. Yi = nb 0 + b 1 Xi Xi Y i = b 0 Xi + b 1 X 2 i This is a system of two equations and two unknowns. The solution is given by...

4 Solution to Normal Equations After a lot of algebra one arrives at b 1 = (Xi X )(Y i Ȳ ) (Xi X ) 2 b 0 = Ȳ b 1 X X = Ȳ = Xi n Yi n

5 Properties of Solution The i th residual is defined to be e i = Y i Ŷ i i e i = 0 i Ŷ i = i Y i i X ie i = 0 i Ŷ i e i = 0 The regression line always goes through the point X, Ȳ

6 Alternative format of linear regression model: Y i = β 0 + β 1 (X i X ) + ɛ i The least squares estimator b 1 for β 1 remains the same as before. The least squares estimator for β 0 = β 0 + β 1 X becomes b 0 = b 0 + b 1 X = ( Ȳ b 1 X ) + b1 X = Ȳ Hence the estimated regression function is Ŷ = Ȳ + b 1 (X X )

7 s 2 estimator for σ 2 s 2 = MSE = SSE n 2 = (Yi Ŷi) 2 n 2 MSE is an unbiased estimator of σ 2 E(MSE) = σ 2 = e 2 i n 2 The sum of squares SSE has n 2 degrees of freedom associated with it. Cochran s theorem (later in the course) tells us where degree s of freedom come from and how to calculate them.

8 Normal Error Regression Model Y i = β 0 + β 1 X i + ɛ i Y i value of the response variable in the i th trial β 0 and β 1 are parameters X i is a known constant, the value of the predictor variable in the i th trial ɛ i iid N(0, σ 2 ) note this is different, now we know the distribution i = 1,..., n

9 Inference concerning β 1 Tests concerning β 1 (the slope) are often of interest, particularly H 0 : β 1 = 0 H a : β 1 0 the null hypothesis model Y i = β 0 + (0)X i + ɛ i implies that there is no relationship between Y and X. Note the means of all the Y i s are equal at all levels of X i.

10 Sampling Dist. Of b 1 The point estimator for b 1 is b 1 = (Xi X )(Y i Ȳ ) (Xi X ) 2 For a normal error regression model the sampling distribution of b 1 is normal, with mean and variance given by E(b 1 ) = β 1 Var(b 1 ) = σ 2 (Xi X ) 2

11 Estimated variance of b 1 When we don t know σ 2 then we have to replace it with the MSE estimate Let where s 2 = MSE = SSE n 2 SSE = e 2 i and e i = Y i Ŷi plugging in we get Var(b 1 ) = ˆ Var(b 1 ) = σ 2 (Xi X ) 2 s 2 (Xi X ) 2

12 Recap We now have an expression for the sampling distribution of b 1 when σ 2 is known b 1 N (β 1, σ 2 (Xi X ) 2 ) (1) When σ 2 is unknown we have an unbiased point estimator of σ 2 ˆ Var(b 1 ) = s 2 (Xi X ) 2

13 Sampling Distribution of (b 1 β 1 )/S(b 1 ) b 1 is normally distributed so (b 1 β 1 )/( Var(b 1 )) is a standard normal variable We don t know Var(b 1 ) so it must be estimated from data. We have already denoted it s estimate If using the estimate ˆV (b 1 ) it can be shown that b 1 β 1 Ŝ(b 1 ) Ŝ(b 1 ) = t(n 2) ˆV (b 1 )

14 Confidence Intervals and Hypothesis Tests Now that we know the sampling distribution of b 1 (t with n-2 degrees of freedom) we can construct confidence intervals and hypothesis tests easily.

15 1 α confidence limits for β 1 The 1 α confidence limits for β 1 are b 1 ± t(1 α/2; n 2)s{b 1 } Note that this quantity can be used to calculate confidence intervals given n and α. Fixing α can guide the choice of sample size if a particular confidence interval is desired Given a sample size, vice versa. Also useful for hypothesis testing

16 Tests Concerning β 1 Example 1 Two-sided test H 0 : β 1 = 0 H a : β 1 0 Test statistic t = b1 0 s{b 1}

17 Tests Concerning β 1 We have an estimate of the sampling distribution of b 1 from the data. If the null hypothesis holds then the b 1 estimate coming from the data should be within the 95% confidence interval of the sampling distribution centered at 0 (in this case) t = b 1 0 s{b 1 }

18 Decision rules if t t(1 α/2; n 2), accept H 0 if t > t(1 α/2; n 2), reject H 0 Absolute values make the test two-sided

19 Inferences Concerning β 0 Largely, inference procedures regarding β 0 can be performed in the same way as those for β 1 Remember the point estimator b 0 for β 0 b 0 = Ȳ b 1 X

20 Sampling distribution of b 0 When error variance is known E(b 0 ) = β 0 σ 2 {b 0 } = σ 2 ( 1 n + X 2 (Xi X ) 2 ) When error variance is unknown s 2 {b 0 } = MSE( 1 n + X 2 (Xi X ) 2 )

21 Confidence interval for β 0 The 1 α confidence limits for β 0 are obtained in the same manner as those for β 1 b 0 ± t(1 α/2; n 2)s{b 0 }

22 Sampling Distribution of Ŷh We have Ŷ h = b 0 + b 1 X h Since this quantity is itself a linear combination of the Y i s it s sampling distribution is itself normal. The mean of the sampling distribution is Biased or unbiased? E{Ŷh} = E{b 0 } + E{b 1 }X h = β 0 + β 1 X h

23 Sampling Distribution of Ŷh So, plugging in, we get ( 1 σ 2 {Ŷ h } = σ 2 n + (X h X ) 2 ) (Xi X ) 2 Since we often won t know σ 2 we can, as usual, plug in S 2 = SSE/(n 2), our estimate for it to get our estimate of this sampling distribution variance ( 1 s 2 {Ŷ h } = S 2 n + (X h X ) 2 ) (Xi X ) 2

24 No surprise... The sampling distribution of our point estimator for the output is distributed as a t-distribution with two degrees of freedom Ŷ h E{Y h } t(n 2) s{ŷ h } This means that we can construct confidence intervals in the same manner as before.

25 Confidence Intervals for E(Y h ) The 1 α confidence intervals for E(Y h ) are Ŷ h ± t(1 α/2; n 2)s{Ŷ h } From this hypothesis tests can be constructed as usual.

26 Prediction interval for single new observation If the regression parameters are unknown the 1 α prediction interval for a new observation Y h is given by the following theorem Ŷ h ± t(1 α/2; n 2)s{pred} We have σ 2 {pred} = σ 2 {Y h Ŷ h } = σ 2 {Y h } + σ 2 {Ŷ h } = σ 2 + σ 2 {Ŷ h } An unbiased estimator of σ 2 {pred} is s 2 {pred} = MSE + s 2 {Ŷ h }, which is given by s 2 {pred} = MSE [1 + 1n + (X h X ) 2 ] (Xi X ) 2

27 ANOVA table for simple lin. regression Source of Variation SS df MS E(MS) Regression SSR = (Ŷi Ȳ )2 1 MSR = SSR/1 σ 2 + β 2 1 (Xi X ) 2 Error SSE = (Y i Ŷ i ) 2 n 2 MSE = SSE/(n 2) σ 2 Total SSTO = (Y i Ȳ )2 n 1

28 F Test of β 1 = 0 vs. β 1 0 ANOVA provides a battery of useful tests. For example, ANOVA provides an easy test for Two-sided test H 0 : β 1 = 0 H a : β 1 0 Test statistic Test statistic from before t = b 1 0 s{b 1 } ANOVA test statistic F = MSR MSE

29 F Distribution The F distribution is the ratio of two independent χ 2 random variables normalized by their corresponding degrees of freedom. The test statistic F follows the distribution F F (1, n 2)

30 Hypothesis Test Decision Rule Since F is distributed as F (1, n 2) when H 0 holds, the decision rule to follow when the risk of a Type I error is to be controlled at α is: If F F (1 α; 1, n 2), conclude H 0 If F > F (1 α; 1, n 2), conclude H a

31 General Linear Test The test of β 1 = 0 versus β 1 0 is but a single example of a general test for a linear statistical models. The general linear test has three parts Full Model Reduced Model Test Statistic

32 Full Model Fit A full linear model is first fit to the data Y i = β 0 + β 1 X i + ɛ i Using this model the error sum of squares is obtained, here for example the simple linear model with non-zero slope is the full model SSE(F ) = [Y i (b 0 + b 1 X i )] 2 = (Y i Ŷ i ) 2 = SSE

33 Fit Reduced Model One can test the hypothesis that a simpler model is a better model via a general linear test (which is really a likelihood ratio test in disguise). For instance, consider a reduced model in which the slope is zero (i.e. no relationship between input and output). H 0 : β 1 = 0 H a : β 1 0 The model when H 0 holds is called the reduced or restricted model. Y i = β 0 + ɛ i The SSE for the reduced model is obtained SSE(R) = (Y i b 0 ) 2 = (Y i Ȳ ) 2 = SSTO

34 Test Statistic The idea is to compare the two error sums of squares SSE(F) and SSE(R). Because the full model F has more parameters than the reduced model R SSE(F ) SSE(R) always In the general linear test, the test statistic is F = SSE(R) SSE(F ) df R df F SSE(F ) df F which follows the F distribution when H 0 holds. df R and df F are those associated with the reduced and full model error sums of square respectively

35 R 2 (Coefficient of determination) SSTO measures the variation in the observations Y i when X is not considered SSE measures the variation in the Y i after a predictor variable X is employed A natural measure of the effect of X in reducing variation in Y is to express the reduction in variation (SSTO SSE = SSR) as a proportion of the total variation R 2 = SSR SSTO = 1 SSE SSTO Note that since 0 SSE SSTO then 0 R 2 1

36 Coefficient of Correlation r = ± R 2 Range: 1 r 1

37 Remedial Measures How do we know that the regression function is a good explainer of the observed data? - Plotting - Tests What if it is not? What can we do about it? - Transformation of variables

38 Residuals Remember, the definition of residuals: e i = Y i Ŷ i And the difference between that and the unknown true error ɛ = Y i E(Y i ) In a normal regression model the ɛ i s are assumed to be iid N(0, σ 2 ) random variables. The observed residuals e i should reflect these properties.

39 Departures from Model... To be studied by residuals Regression function not linear Error terms do not have constant variance Error terms are not independent Model fits all but one or a few outlier observations Error terms are not normally distributed One or more predictor variables have been omitted from the model

40 Diagnostics for Residuals Plot of residuals against predictor variable Plot of absolute or squared residuals against predictor variable Plot of residuals against fitted values Plot of residuals against time or other sequence Plot of residuals against omitted predictor variables Box plot of residuals Normal probability plot of residuals

41 Tests Involving Residuals Tests for constancy of variance (Brown-Forsythe test, Breusch-Pagan test, Section 3.6) Tests for normality of error distribution

42 Brown-Forsythe Test The test statistic for comparing the means of the absolute deviations of the residuals around the group medians is where the pooled variance s 2 = t BF = d 1 d 2 s 1 n n 2 (di1 d 1 ) 2 + (d i2 d 2 ) 2 n 2

43 Brown-Forsythe Test If n 1 and n 2 are not extremely small t BF t(n 2) approximately From this confidence intervals and tests can be constructed.

44 F test for lack of fit Formal test for determining whether a specific type of regression function adequately fits the data. Assumptions (usual): - observations Y X are 1. i.i.d. 2. normally distributed 3. same variance σ 2 Requires: repeat observations at one or more X levels (called replicates)

45 Full Model vs. Regression Model The full model is Y ij = µ j + ɛ ij where - µ j are parameters j = 1,..., c - ɛ ij are iid N(0, σ 2 ) Full model Since the error terms have expectation zero E(Y ij ) = µ j

46 Full Model In the full model there is a different mean (a free parameter) for each X i In the regression model the mean responses are constrained to lie on a line E(Y ) = β 0 + β 1 X

47 Fitting the Full Model The estimators of µ j are simply ˆµ j = Ȳ j The error sum of squares of the full model therefore is SSE(F ) = (Y ij Ȳ j ) 2 = SSPE SSPE: Pure Error Sum of Squares

48 Degrees of Freedom Ordinary total sum of squares had n-1 degrees of freedom. Each of the j terms is a ordinary total sum of squares - Each then has n j 1 degrees of freedom The number of degrees of freedom of SSPE is the sum of the component degrees of freedom df F = j (n j 1) = j n j c = n c

49 General Linear Test Remember: the general linear test proposes a reduced model null hypotheses - this will be our normal regression model The full model will be as described (one independent mean for each level of X) H 0 : E(Y ) = β 0 + β 1 X H a : E(Y ) β 0 + β 1 X

50 SSE For Reduced Model The SSE for the reduced model is as before - remember SSE(R) = i = i [Y ij (b 0 + b 1 X j )] 2 j (Y ij Yˆ ij ) 2 j - and has n-2 degrees of freedom df R = n 2

51 F Test Statistic From the general linear test approach Lack of fit sum of squares: F = SSE(R) SSE(F ) df R df F SSE(F ) df F F = SSE SSPE (n 2) (n c) SSPE n c SSLF = SSE SSPE Then F = SSLF (n 2) (n c) SSPE n c = MSLF MSPE

52 F Test Rule From the F test we know that large values of F lead us to reject the null hypothesis: If F F (1 α; c 2, n c), conclude H 0 If F > F (1 α; c 2, n c), conclude H a

53 Variance decomposition SSE = SSPE + SSLF. (Yij Ŷ ij ) 2 = (Y ij Ȳ j ) 2 + (Ȳ j Ŷ ij ) 2

54 Example decomposition

55 Box Cox Transforms It can be difficult to graphically determine which transformation of Y is most appropriate for correcting - skewness of the distributions of error terms - unequal variances - nonlinearity of the regression function The Box-Cox procedure automatically identifies a transformation from the family of power transformations on Y

56 Box Cox Transforms This family is of the form Examples include Y = Y λ λ = 2 Y = Y 2 λ =.5 Y = Y λ = 0 Y = lny (by definition) λ =.5 Y = 1 Y λ = 1 Y = 1 Y

57 Box Cox Cont. The normal error regression model with the response variable a member of the family of power transformations becomes Y λ i = β 0 + β 1 X i + ɛ i This model has an additional parameter that needs to be estimated Maximum likelihood is a way to estimate this parameter

58 Using the Bonferroni inequality cont. To achieve a 1 α family confidence interval for β 0 and β 1 (for example) using the Bonferroni procedure we know that both individual intervals must shrink. Returning to our confidence intervals for β 0 and β 1 from before b 0 ± t(1 α/2; n 2)s{b 0 } b 1 ± t(1 α/2; n 2)s{b 1 } To achieve a 1 α family confidence interval these intervals must widen to Then b 0 ± t(1 α/4; n 2)s{b 0 } b 1 ± t(1 α/4; n 2)s{b 1 } P(Ā1 Ā2) 1 P(A 2 ) P(A 1 ) = 1 α/4 α/4 = 1 α/2

Diagnostics and Remedial Measures

Diagnostics and Remedial Measures Yang Feng http://www.stat.columbia.edu/~yangfeng Yang Feng (Columbia University) Diagnostics and Remedial Measures 1 / 72 Remedial Measures How do we know that the regression