Inference in Regression Analysis

Size: px

Start display at page:

Download "Inference in Regression Analysis"

Lee Cooper
5 years ago
Views:

1 Inference in Regression Analysis Dr. Frank Wood Frank Wood, Linear Regression Models Lecture 4, Slide 1

2 Today: Normal Error Regression Model Y i = β 0 + β 1 X i + ǫ i Y i value of the response variable in the i th trial β 0 and β 1 are parameters X i is a known constant, the value of the predictor variable in the i th trial ǫ i ~ iid N(0,σ 2 ) i = 1,,n Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 4, Slide 2

3 Inferences concerning β 1 Tests concerning β 1 (the slope) are often of interest, particularly H 0 : β 1 = 0 H a : β1 0 the null hypothesis model Y i = β 0 +(0)X i + ǫ i implies that there is no relationship between Y and X Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 4, Slide 3

4 Review : Hypothesis Testing Elements of a statistical test Null hypothesis, H 0 Alternative hypothesis, H a Test statistic Rejection region Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 4, Slide 4

5 Review : Hypothesis Testing - Errors Errors A type I error is made if H 0 is rejected when H 0 is true. The probability of a type I error is denoted by α. The value of α is called the level of the test. A type II error is made if H 0 is accepted when H a is true. The probability of a type II error is denoted by β. Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 4, Slide 5

6 P-value The p-value, or attained significance level, is the smallest level of significance α for which the observed data indicate that the null hypothesis should be rejected. Frank Wood, Linear Regression Models Lecture 4, Slide 6

7 Null Hypothesis If β 1 = 0 then with 95% confidence the b 1 would fall in some range around zero 40 Guess, y = 0x , mse: 37.1 True, y = 2x + 9, mse: Response/Output Predictor/Input Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 4, Slide 7

8 Alternative Hypothesis : Least Squares Fit Estimate, y = 2.09x , mse: 4.15 True, y = 2x + 9, mse: 4.22 Response/Output b 1 rescaled is test statistic Predictor/Input Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 4, Slide 8

9 Testing This Hypothesis Only have a finite sample Different finite set of samples (from the same population / source) will (almost always) produce different estimates of β 0 and β 1 (b 0, b 1 ) given the same estimation procedure b 0 and b 1 are random variables whose sampling distributions can be statistically characterized Hypothesis tests can be constructed using these distributions. Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 4, Slide 9

10 Example : Sampling Dist. Of b 1 The point estimator for b 1 is b 1 = (Xi X)(Y i Ȳ) (Xi X) 2 The sampling distribution for b 1 is the distribution over b 1 that occurs when the predictor variables X i are held fixed and the observed outputs are repeatedly sampled Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 4, Slide 10

11 Sampling Dist. Of b 1 In Normal Regr. Model For a normal error regression model the sampling distribution of b 1 is normal, with mean and variance given by E(b 1 ) = β 1 σ 2 V(b 1 ) = (Xi X) 2 To show this we need to go through a number of algebraic steps. Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 4, Slide 11

12 First step To show (Xi X)(Y i Ȳ)= (X i X)Y i we observe (Xi X)(Y i Ȳ) = (X i X)Y i (X i X)Ȳ = (X i X)Y i Ȳ (X i X) = (X i X)Y i Ȳ (X i )+Ȳn Xi n = (X i X)Y i Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 4, Slide 12

13 Slope as linear combination of outputs b 1 can be expressed as a linear combination of the Y i s b 1 = (Xi X)(Y i Ȳ) (Xi X) 2 = (Xi X)Y i (Xi X) 2 where = k i Y i k i = (Xi X) (Xi X) 2 Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 4, Slide 13

14 Properties of the k i s It can be shown that ki = 0 ki X i = 1 k 2 i = 1 (Xi X) 2 (possible homework). We will use these properties to prove various properties of the sampling distributions of b 1 and b 0. write on board Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 4, Slide 14

15 Normality of b 1 s Sampling Distribution Useful fact: A linear combination of independent normal random variables is normally distributed More formally: when Y 1,, Y n are independent normal random variables, the linear combination a 1 Y 1 + a 2 Y a n Y n is normally distributed, with mean a i E(Y i ) and variance a 2 iv(y i ) Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 4, Slide 15

16 Normality of b 1 s Sampling Distribution Since b 1 is a linear combination of the Y i s and each Y i is an independent normal random variable, then b 1 is distributed normally as well b 1 = k i Y i, k i = (X i X) (Xi X) 2 write on board Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 4, Slide 16

17 b 1 is an unbiased estimator This can be seen using two of the properties E(b 1 ) = E( k i Y i )= k i E(Y i )= k i (β 0 + β 1 X i ) = β 0 ki + β 1 ki X i = β 0 (0)+β 1 (1) = β 1 Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 4, Slide 17

18 Variance of b 1 Since the Y i are independent random variables with variance σ 2 and the k i s are constants we get V(b 1 ) = V( k i Y i )= k 2 i V(Y i) = k 2 i σ2 = σ 2 k 2 i = σ 2 1 (Xi X) 2 note that this assumes that we know σ 2. Can we? Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 4, Slide 18

19 Estimated variance of b 1 If we don t know σ 2 then we can replace it with the MSE estimate Remember s 2 = MSE= SSE n 2 = (Yi Ŷ i ) 2 n 2 = e 2 i n 2 plugging in we get V(b 1 ) = ˆV(b 1 ) = σ 2 (Xi X) 2 s 2 (Xi X) 2 Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 4, Slide 19

20 Digression : Gauss-Markov Theorem In a regression model where E(ǫ i ) = 0 and variance V(ǫ i ) = σ 2 < and ǫ i and ǫ j are uncorrelated for all i and j the least squares estimators b 0 and b 1 and unbiased and have minimum variance among all unbiased linear estimators. Remember b 1 = (Xi X)(Y i Ȳ) (Xi X) 2 b 0 = Ȳ b 1 X Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 4, Slide 20

21 Proof The theorem states that b 1 as minimum variance among all unbiased linear estimators of the form ˆβ 1 = c i Y i As this estimator must be unbiased we have E(ˆβ 1 ) = c i E(Y i )=β 1 = c i (β 0 + β 1 X i )=β 0 ci + β 1 ci X i = β 1 Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 4, Slide 21

22 Proof cont. Given these constraints β 0 ci + β 1 ci X i = β 1 clearly it must be the case that c i =0 and c i X i = 1 write these on board as conditions of unbiasedness The variance of this estimator is V(ˆβ 1 ) = c 2 i V(Y i )=σ 2 c 2 i Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 4, Slide 22

23 Proof cont. Now define c i = k i + d i where the k i are the constants we already defined and the d i are arbitrary constants. Let s look at the variance of the estimator V(ˆβ 1 ) = c 2 i V(Y i )=σ 2 (k i + d i ) 2 = σ 2 ( k 2 i + d 2 i +2 k i d i ) Note we just demonstrated that σ 2 k 2 i = V(b 1) Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 4, Slide 23

24 Proof cont. Now by showing that k i d i = 0 we re almost done ki d i = k i (c i k i ) = k i (c i k i ) = k i c i k 2 i = c i ( Xi X (Xi X) 2 ) 1 (Xi X) 2 = ci X i X c i (Xi X) 2 1 (Xi X) 2 =0 from conditions of unbiasedness Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 4, Slide 24

25 So we are left with Proof end V(ˆβ 1 ) = σ 2 ( k 2 i + d 2 i) = V(b 1 )+σ 2 ( d 2 i ) which is minimized when the d i s = 0. This means that the least squares estimator b 1 has minimum variance among all unbiased linear estimators. Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 4, Slide 25

26 Sampling Distribution of (b 1 - β 1 )/S(b 1 ) b 1 is normally distributed so (b 1 -β 1 )/(V(b 1 ) 1/2 ) is a standard normal variable We don t know V(b 1 ) so it must be estimated from data. We have already denoted it s estimate ˆV(b 1 ) Using this estimate we it can be shown that b 1 β 1 Ŝ(b 1 ) t(n 2) Ŝ(b 1)= ˆV(b1 ) Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 4, Slide 26

27 Where does this come from? We need to rely upon the following theorem For the normal error regression model SSE σ 2 = (Yi Ŷ i ) 2 σ 2 χ 2 (n 2) and is independent of b 0 and b 1 Intuitively this follows the standard result for the sum of squared normal random variables Here there are two linear constraints imposed by the regression parameter estimation that each reduce the number of degrees of freedom by one. Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 4, Slide 27

28 Another useful fact : t distribution Let z and χ 2 (ν) be independent random variables (standard normal and χ 2 respectively). We then define a t random variable as follows: t(ν)= z χ 2 (ν) ν This version of the t distribution has one parameter, the degrees of freedom ν Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 4, Slide 28

29 Distribution of the studentized statistic To derive the distribution of this statistic, first we do the following rewrite b1 β1 b 1 β 1 Ŝ(b 1 ) = S(b 1 ) Ŝ(b 1 ) S(b 1 ) This is a standard normal variable Ŝ(b 1 ) ˆV(b1 S(b 1 ) = ) V(b 1 ) Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 4, Slide 29

30 Studentized statistic cont. And note the following (X i X) 2 MSE ˆV(b 1 ) V(b 1 ) = = MSE σ 2 σ = SSE 2 σ 2 (n 2) (X i X) 2 where we know (by the given theorem) the distribution of the last term is χ 2 and indep. of b 1 and b 0 SSE σ 2 (n 2) χ2 (n 2) n 2 Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 4, Slide 30

31 Studentized statistic final But by the given definition of the t distribution we have our result b 1 β 1 Ŝ(b 1 ) t(n 2) because putting everything together we can see that b 1 β 1 Ŝ(b 1 ) z χ 2 (n 2) n 2 Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 4, Slide 31

32 Confidence Intervals and Hypothesis Tests Now that we know the sampling distribution of b 1 (t with n-2 degrees of freedom) we can construct confidence intervals and hypothesis tests easily Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 4, Slide 32

Bias Variance Trade-off

Bias Variance Trade-off The mean squared error of an estimator MSE(ˆθ) = E([ˆθ θ] 2 ) Can be re-expressed MSE(ˆθ) = Var(ˆθ) + (B(ˆθ) 2 ) MSE = VAR + BIAS 2 Proof MSE(ˆθ) = E((ˆθ θ) 2 ) = E(([ˆθ E(ˆθ)]