The Statistical Property of Ordinary Least Squares

Size: px

Start display at page:

Download "The Statistical Property of Ordinary Least Squares"

Andra Hubbard
5 years ago
Views:

1 The Statistical Property of Ordinary Least Squares The linear equation, on which we apply the OLS is y t = X t β + u t Then, as we have derived, the OLS estimator is ˆβ = [ X T X] 1 X T y Then, substituting the linear model into y, we get [ 1 ˆβ = X X] T X T [Xβ + u] [ 1 = β + X X] T X T u

2 Thus, statistical property of the OLS is very much the statistical property of [ X T X] 1 X T u Definition: The OLS estimator is said to be unbiased if [ E ˆβ] = β This means, given the way the data is generated, before you start doing the estimation procedure, you would expect that on average, the OLS estimate should not deviate from the true value. Because [ 1 ˆβ β = X X] T X T u, then OLS is unbiased if ( [ ) 1 E X X] T X T u = 0

3 Now, let us use the Law of Iterated Expectation. That is, [ ( ) [ [ 1 ( ) 1 E X T X X u] T = E E X T X X u X]] T [ ( ) 1 = E X T X X T E [u X]] Therefore, the OLS estimator is unbiased if E [u X] = 0 In this case, [ E ˆβ X] = β as well. This assumption states that the regressor X is exogenous to the error term. But this could be a strong assumption. In time series, it means that mean of current u t conditional on all present and future values of X s has to be zero. Roughly, it means that current u t has to be orthogonal (or uncorrelated) with X s of all periods.

4 In some cases, we can make the the assumption weaker. That is, E [u t X t ] = 0 If the data is times series, i.e. the sample is the data over time, and t means time periods, then the assumption is said to be that the regressors (X t ) are predetermined with respect to the error term.

5 An Example where OLS estimator is biased: y t = β 1 + β 2 y t 1 + u t, u t IID(0, σ 2 ) Then, using the FWL theorem, we first demean y t, y t 1 to derive the OLS coefficient. That is, [ ] 1 [ ] 1 ˆβ 2 = y 1M T ı y 1 y T 1 M ı y = y 1M T ı y 1 y T 1 M ı (y 1 β 2 + u) [ ] 1 = β 2 + y 1M T ı y 1 y T 1 M ı u Notice that β 2 is a scalar. So, β 2 y 1 = y 1 β 2. Because E [u M ı y 1 ] = E [u y 1 y 1 ı] 0 in general, the OLS estimator is not unbiased.

6 Consistency of the Least Squares Estimator In this Chapter, we will argue that as the sample size increases, the least squares estimator ˆβ will converge in probability to the truel value β. Law of Large Numbers Before moving to show the consistency of the OLS estimator, we discuss a very important theorem, the Law of Large Numbers. Let u t, t = 1,..., n be any random variable that is independently and identically distributed with finite mean E[u t ] = µ u and finite variance σ 2 u. Then, the sample average is ū = 1 n n t=1 u t

7 which has mean E [ū] = E [ 1 n ] n u t = 1 n t=1 n E[u t ] = 1 n t=1 Next, we derive variance. Now, for n = 2 Var(ū) = E n µ = µ t=1 [ ] (u 1 + u 2 ) µ = 1 4 E [ (u 1 µ + u 2 µ) 2] = 1 4 E [ (u 1 µ) 2 + (u 2 µ) 2 + 2(u 1 µ)(u 2 µ) ] = 1 4 [Var(u 1) + Var(u 2 ) + E(u 1 µ)(u 2 µ)] = 1 4 [Var(u 1) + Var(u 2 )] This is because u 1, u 2 are independent Cov(u 1, u 2 ) = E(u 1 µ)(u 2 µ) = E(u 1 µ)e(u 2 µ) = 0

8 Similarly, for general n, ( Var(ū) = Var 1 n = 1 n n 2 ) n u t = E t=1 t=1 Var(u t ) + i j [ 1 n n (u t µ) t=1 ] 2 Cov(u i, u j ) = 1 n 2 n t=1 σ 2 u = σ2 u n

9 Mean of the sample average is the true value. E [ū] = µ Variance of the sample average converges to zero as sample size goes to infinity. Var(ū) 0 as n or lim Var(ū) = 0 n Which means that as the sample size increases, the sample average will be distributed more and more closely around the true value. Then, no matter how small, deviation of the sample mean from the true value becomes less and less likely. More formally, For any ɛ > 0, Pr ( ū µ < ɛ) 0 as n

10 or or lim Pr ( ū µ < ɛ) 0 n plim n ū = µ and in words, sample average converges to the true mean in probability. Theorem: (Weak) Law of Large Numbers Let u t, t = 1,..., n be independently and identically distributed with finite mean µ and variance σ 2 u. Then, the sample average ū converges to the true value µ in probability.

11 Consistency of OLS estimator. Suppose, to simplify the discusson, that X t is random but independent of u t. OLS estimator: ˆβ = [ 1 [ 1 X X] T X T y = β + X X] T X T u Now, divide both denominator and numerator by sample size n Now, let us assume that where S X T X is invertible. [ ] ˆβ = β + n XT X n XT u plim n 1 n XT X = S X T X

12 Then, it is know that we can write like this. [ ] plim n ˆβ = β + plim n n XT X plim n n XT u What is left for us to derive Now, = β + [S X T X ] 1 plim n 1 n XT u plim n 1 n XT u 1 n XT u = 1 n T x t u t Then, x t u t, t = 1,..., n are independently and identically distributed random varibles with finite mean E[x t u t ] = 0 and (we assume) finite variance Var(x t u t ) = σ 2 xu. Therefore, Law of Large Number holds and t=1

13 plim n 1 n Together, we have shown that T x t u t = µ xu = 0 t=1 [ ] plim n ˆβ = β + plim n n XT X plim n n XT u = β + [S X T X ] 1 plim n 1 n XT u = β Thus, the OLS estimator converges to the true parameter value in probability as sample size increases. We also say that the OLS estimator is consistent.

14 The (Variance) Covariance Matrix of OLS Estimator The (variance) covariance matrix of the OLS estimator given X is [ [ [ ] [ ] T Var ˆβ X] = E ˆβ E( ˆβ X) ˆβ E( ˆβ X)] X which is a k by k matrix, if X has k variables. The often reported standard error of the OLS parameter estimate is [ std.error( ˆβ i ) = Var ˆβ X] ii

15 Derivation of the Covariance Matrix Remember that OLS estimator is ˆβ = [ X T X] 1 X T y = β + and because OLS is unbiased, [ E ˆβ] = β [ X T X] 1 X T u Therefore, [ ˆβ E ˆβ] = [ X T X] 1 X T u and [ ˆβ E [ ]] [ [ ]] [ T [ ] ] [ 1 [ ] 1 T ˆβ ˆβ E ˆβ = X T X X T u X T X X u] T = [ 1 [ ] 1 X X] T X T uu T X X T X

16 Now, assume that the error term of the linear model satisfies the following Var(u X) = E[uu T X] = σ 2 I That is, because E[u t X] = 0 for any sample t and for s t Var (u t X) = E[uu T X] tt = σ 2 Cov (u s, u t X) = E[uu T X] st = 0 That is, the variance of the error terms are the same for each sample, and the error terms of two different samples are uncorrelated.

17 Then, Var( ˆβ X) = = [ 1 ( ) X X] T X T E uu T X [ 1 X X] T X T σ 2 IX [ ] 1 X X T X [ X T X] 1 = σ 2 [ X T X ] 1

18 Precision of Least Squares Estimators The smaller the variance of an OLS parameter ˆβ, we say the higher the precision of the parameter estimate. Therefore, we defined the precision matrix as the inverse of the Covariance matrix. Prec( ˆβ) = Var( ˆβ) 1 = σ 2 (X T X) First, we can see that the smaller the variance of the error term, the smaller the variance, i.e, larger the precision. Secondly, because N X T X = xt T x t usually the larger the sample size, the smaller the variance, i.e. the larger the precision. t=1

19 Now, we look at the variance of a single OLS parameter ˆβ 1. As we have seen from FWL Therorem, if we regress X 2 on both y and X and take residuals, i.e. premultiply them with M 2 = I X 2 [X T 2 X 2 ] 1 X T 2 and the variance of ˆβ 1 is ˆβ 1 = [x 1 M 2 x 1 ] 1 x 1 M 2 y Var( ˆβ 1 ) = σ 2 [x 1 M 2 x 1 ] 1 = σ 2 x T 1 M 2x 1 If x 1 can be perfectly explained by the rest of X, i.e, X 2, then M 2 x 1 = 0 and therefore, OLS estimator ˆβ 1 is not defined, or equivalently, its variance is infinite and precision zero. This is what is called Munticollinearity problem. The smaller the sum of squares of the residual M 2 x 1, the larger the variance, and thus the smaller the precision.

20 Linear Functions of Parameter Estimates Suppose that the objective of the OLS regression was to obtain the parameter estimates so that we can obtain predictor ŷ p given the parameter estimate ˆβ and specific value of x p. Then, ŷ p = x p ˆβ Notice that because OLS estimator is unbiased, and the mean of the error term is zero, then predictor is also unbiased. That is, [ E [ŷ p x p ] = x p E ˆβ] = x p β = E [x p β + u p x p ] = E [y p x p ] and the variance of the predictor is [ [ [ ] ] T Var [ŷ p x p ] = E x p ˆβ x p β] x p ˆβ x p β xp [ [ ] [ ] T = x p E ˆβ β ˆβ β] x T p x p = x p Var( ˆβ)x p T

21 The forecast error is y p ŷ p = x p β + u p x p ˆβ = up + x p [β ˆβ ] Because the error term u p and X and x p are assumed to be uncorrelated, if we assume that u p is uncorrelated with u t, t = 1,..., n, then u p and ˆβ are also uncorrelated. Therefore, Var(y p ŷ p ) = Varu p + x p Var( ˆβ)x T p Cov(u p, x p ˆβ) = Var(u p ) + x p Var( ˆβ)x T p = σ 2 + σ 2 x p (X T X) 1 x T p

22 The same as we have discussed, we can derive standard errors of any linear combination of the OLS estimate. ω T ˆβ as follows. Var(ω T ˆβ) = E [ ] [ ] T ω T ˆβ ωβ ω T ˆβ ω T β = Eω T [ ˆβ β ] [ ˆβ β ] T ω = ω T Var( ˆβ)ω = ω T σ 2 [ X T X] 1 ω

23 Efficiency of the OLS Estimator In this section, we will conclude that the OLS is the most efficient estimator among all linear unbiased estimators. That is, roughly speaking, OLS has the smallest variance. But how do we define the variance covariance matrix A to be smaller than the other variance covariance matrix B? A nice way to define smaller is as follows. : A symmetric matrix A is smaller than the other symmetric matrix B if, for any vector ω, ω T [B A] ω 0 This is equivalent to saying that the symmetric matrix B A is positive semidefinite. A symmetric matrix A is positive definite if for any vector ω ω T Aω = k k ω i ω j A ij > 0 i=1 j=1

24 Notice that the variance covariance matrix is always symmetric and positive semidefinite. Suppose y t to be a (row) vector of random variable. Then, (y t E[y t ]) (y t E[y t ]) T is positive definite, because for any (row) vector ω ω T (y t E[y t ]) (y t E[y t ]) T ω = [ ω T (y t E[y t ])] 2 0 By taking expectations, [ ] E ω T (y t E[y t ]) (y t E[y t ]) T ω = ω T Var(y t )ω 0

25 Gauss-Markov Theorem Assume that E [u X] = 0 and E [ uu T X ] = σi in the linear regression model. Then, the OLS estimator is more efficient than any other linear unbiased estimator β, i.e. is a positive semidefinite matrix. Var( β) Var( ˆβ)

26 Proof Consider an arbitrary linear estimator, which is β = Ay Now, denote β = ˆβ + β ˆβ Then, Var( β X) = Var( ˆβ X) + Var( ˆβ β X) + 2Cov( ˆβ, ˆβ β X) What we will show below is that indeed Cov( ˆβ, ˆβ β X) = 0 and therefore, Var( β X) = Var( ˆβ X) + Var( ˆβ β X) Var( ˆβ X)

27 The OLS estimator is one of the linear estimator with [ ] 1 A = X T X X T Because Ay = A [Xβ + u] E [Ay X ] = AXβ + AE [u X ] = AXβ

28 In order for the unbiasedness to hold, i.e. for any true value β AX = I = E [Ay X ] = β [ X T X] 1 X T X has to hold. So, [ [ ] ] 1 A X T X X T X = 0 Now, [ β ˆβ E β ˆβ X] [ = A(Xβ + u) β ] ] 1 X T = [ A [ X T X X T X] 1 X T u u

29 Because it is assumed that Var(u X) = σ 2 I ( ) Cov β ˆβ, ˆβ X ([ = E A = E [ ] ] 1 [ X T X X T E(uu X)X ] X T X ) ([ [ ] ] 1 [ ] A X T X X T σ 2 IX X T X X [ ] ] 1 [ ] = σ [A 2 X T X X T X X T X = 0 ) X

30 Residuals and Error Terms Now, consider the residuals of the OLS ( [ ] ) 1 û = y X ˆβ = I X X T X X T y = M X y = M X [Xβ + u] = M X u Then, because E [M X u X] = 0 [ ] Var (û X) = E ûû T X = M X E [ σ 2 I ] M X = σ 2 M X Which is different from Var (u) = σ 2 I In fact, Var (u) Var (û) = σ 2 [I M X ] = σ 2 P X

31 Notice that for any vector y y T P X y = y T P X P X y = (X ˆβ) T (X ˆβ) 0 So, P T is positive semidefinite. Therefore, in matrix sense, Var (û) is smaller than Var (u). That is, the OLS overfits the data, i.e. makes the residual having smaller variance than the error term. As we have seen the variance matrix of OLS estimator is ( ) [ ] 1 Var ˆβ = σ 2 X T X We need to derive an estimate of σ 2 = Var(u t ). A potential estimator is the sample variance of the residual, which is 1 n n [ût û ] 2 t=1 As long as constant term ı is included in X, which is usually the case, û = 1 n ıû = 0

32 Therefore, Varû t = [ σ 2 M X ] tt = diag ( σ 2 M X )t where diag(a) is the vector containing the diagonal elements of the n by n matrix A. Hence, n Varû t = t=1 n [ σ 2 M X ]tt = trace ( σ 2 ) M X t=1 where trace(a) is the sum of all the diagonal elements,i.e., trace(a) = n i=1 A ii

33 Notice that trace(a + B) = trace(a) + trace(b) This is because Also, This is because trace(ab) = = [A + B] ii = A ii + B ii. trace(ab) = trace(ba) n n n [AB] ii = A ij B ji = i=1 i=1 j=1 n [BA] jj = trace(ba) j=1 n n B ji A ij j=1 i=1

34 Therefore, [ ( ) ]) 1 trace(σ 2 M X ) = trace (σ 2 I X X T X X T ( [ ] ) 1 = σ 2 trace(i) σ 2 trace X X T X X T ( [ ) 1 = σ 2 n σ 2 trace X X] T X T X = σ 2 (n k) Therefore, [ n E t=1 û 2 t ] = n Var(û t ) = σ 2 (n k) < σ 2 n = t=1 n Var(u t ) t=1

35 So, the unbiased estimate of the variance σ 2 is s 2 = SSR n k = 1 n k n t=1 Together, the estimate of the variance matrix of the OLS coefficient is Var( ˆβ) ( ) 1 = s 2 X T X û 2 t

36 Misspecification of Linear Regression Models In most situations, we do not a priori know what is the true regression model, i.e. which variable should belong to the regression equation. Overspecification Suppose we put in more variables on the RHS as needed. y = Xβ + Zγ + u, u IID(0, σ 2 I) where E(u [X, Z]) = 0 but Z is redundant, i.e. γ = 0 Then, [ 1 [ 1 ˆβ = X T M Z X] X T M Z y = X T M Z X] X T M Z (Xβ + u) [ 1 = β + X T M Z X] X T M Z u

37 Because Therfore, E [u X, Z] = 0, ( [ ) 1 E X T M Z X] X T M Z u [X, Z] = 0 [ E ˆβ X, Z] = 0 and thus, the coefficient ˆβ is unbiased.

38 Variance ( ) ( ) 1 Var ˆβ X, Z = σ 2 X T M Z X Now, we know that X T M Z X = X T X X T P Z X Hence, X T X X T M Z X = X T P Z X which is positive semidefinite. So, X T M Z X X T X Therefore, ( σ 2 X X) ( 1 ) 1 T σ 2 X T M Z X That is, if we include unnecessary variables, as longa s the assumption E [u X, Z] are satisfied, we have unbiasedness but we have an estimator with larger variance, i.e. we lose on efficiency.

39 Underspecification What if we the true specification is y = Xβ + Zγ + u but we leave out Z? As we have discussed, the OLS estimate is [ 1 ˆβ = X X] T X T (Xβ + Zγ + u) [ 1 = β + X X] T X T (Zγ + u) Given the assumption of E [u X, Z] = 0, [ E ˆβ X, Z] [ 1 = β + X X] T X T Zγ as long as X T Z 0 and γ 0, we have omitted variable bias.

40 Now, ˆβ β = [ 1 [ 1 X X] T X T Zγ + X X] T X T u and ( ˆβ β)( ˆβ β) T = [ 1 [ X X] T X T Zγγ T Z T X X T X [ 1 [ ] 1 + X X] T X T uu T X X T X [ 1 [ + X X] T X T Zγu T X X T X [ 1 [ + X X] T X T uγ T Z T X X T X ] 1 ] 1 ] 1 Because E [ ] uu T [X, Z] = σ 2 I, E [u [X, Z]] = 0

41 It is not clear which OLS estimator has less variance. If γ is small, then it could be that omitting an unimportant variable would result in bias but improve in MSE. But with omittev variables, variance covariance matrix gives the wrong message on the accuracy and reliability of the OLS estimate. With larger sample size, bias dominates, thus the problem with underspecification becomes more severe. If we take conditional expectations given X, Z, we obtain [ ] MSE( ˆβ o ) = E ( ˆβ o β)( ˆβ o β) T [X, Z] ( 1 ( = X X) T X T Zγγ T Z T X X T X +σ 2 ( X T X ) 1 ) 1 Mean square error is the variation of an estimator around its true value. What if you included the variables Z into the regression? Then, ˆβ is unbiased, and thus, MSE( ˆβ) = Var( ˆβ) = σ 2 ( X T M Z X ) 1

42 Measures of Goodness of Fit The simple R 2 is R 2 = 1 SSR TSS = 1 n t=1 ût 2 n t=1 (y t ȳ) 2 If we look at the numerator: The more independent variables we have, the less will be the RSS. n ûkt 2 = min (β1,...,β k ) t=1 min (β1,...,β k+1 ) n y t t=1 k x jt β j j=1 n k+1 y t x jt β j t=1 j=1 2 2 = n t=1 û 2 k+1,t

43 The more independent variables you include (higher k), the smaller becomes the RSS (residual sum of squres). If k = n, then RSS becomes zero. So, you can increase the R 2 by simply putting more and more stuff as the independent variables. Hence, R 2 won t be a good indication of the appropriateness of the linear model. One needs to fine a measure of goodness of fit that penalizes having many independent variables.

44 For the numerator: instead of SSR, use unbiased estimator for Var(u t ) = σ 2, s 2. For the denomintaor, instead of TSS, use the unbiased estimator for the Var(y) R 2 = 1 1 n n k t=1 û2 i 1 n 1 n t=1 (y i ȳ) 2 = 1 (n 1) n t=1 û2 i (n k) n t=1 (y i ȳ) 2 This is called adjusted R 2. Given the same SSR, having more independent variables (higher k) reduces the R 2. So, it puts penalties on the high number of regressors. Notice that for very poorly fit models, adjusted R 2 can be negative.

45 Hypothesis Testing in Linear Regression Models Suppose you have the following linear model: log(wage) = β 0 + β 1 educ + β 2 exp + β 3 ten + u educ: education, exp: experience, ten: tenure on the job. Suppose you obtained the following regression result. variables estimate std. error constant education experience tenure sample size 526 R

46 Suppose you want to know whether the return to education is zero or not. That is, you want to test the hypothesis that the return to education is 0, when the estimated return is with the standard error If the estimated value is very different from zero, then you should reject. But what is the proper way to measure whether is very different from zero or not? Then, the variance of the OLS becomes important as well. If the OLS is accurately estimated, then, even if it is close to zero, one should reject the hypothesis of zero returns to education. In this case the standard error is 0.007, fairly accurate.

47 A potential candidate of would be, for the hypothesis of β j = β 0 ˆβ j β 0 st.dev( ˆβ j ) If, relative to the accuracy of the OLS estimate, ˆβ j, is too far away from the hypothesis β 0, then we reject the hypothesis. Suppose the OLS estimator is normally distributed with mean β 0 ˆβ j N(β 0, σ βj ) Then, z = ˆβ j β 0 Var( ˆβ N(0, 1) j ) 1/2 Then, one can set up a rejection region with R βj 0 such that reject if ˆβ j β 0 Var( ˆβ j ) 1/2 R β

48 Then, we can make R β such that P( z R β ) = 0.05 Then, even if the hypothesis is true, i.e. β = β 0, the hypothesis will be rejected, i.e. ˆβ j β 0 Var( ˆβ j ) 1/2 R β with probability In this case, the probability of Type 1 error (rejecting a true hypothesis) is In other words, this test has a significance level of Or, the power of the test is 0.05.

49 Type II error: Suppose the null hypothesis β = β 0 is not true. Then, mistakenly not rejecting the false null hypothesis is type II error, which equals 1 - power. R β is called critical value, and is often denoted as c α, where α is the power of the test. For example, if z N(0, 1), the critical value c α for α = 0.05 is C 0.05 = 1.96 i.e. P (z 1.96 z 1.96) = 0.05 or P ( 1.96 z 1.96) = = 0.95 Notice that because this is two-tailed test P (z 1.96) = 0.025, P (z 1.96) = That is, if Φ() is the (cumulative) distribution function of N(0, 1), Φ(c α ) = 1 α/2

51 P values Suppose in the above example, z = ˆβ j β 0 = 1.96 Var( ˆβ j ) 1/2 Then, the P-value is That is, you barely fail to reject the hypothesis at significance level If the P-value is 0.05, can you reject the hypothesis at significance leve of 0.1? The critical value for 0.1, C 0.1 is Then, because 1.96 > 1.645, the hypothesis is rejected. What about significance level of 0.025? c = 2.24 > Hence, the hypothesis cannot be rejected. P-value (marginal significance): greatest significance level for which the hypothesis cannot be rejected.

52 In general, P-value for 2 tailed test for ẑ is Prob( z ẑ ) then, if the distribution of z is standard normal, p(ẑ) = 2(1 Φ( ẑ )) or Φ( ẑ ) = 1 α/2 P-values preserves all the information from the estimation.

53 Normal Distribution The density function of standard normal distribution (with mean 0, variance 1) u is f (u) = 1 ( ) u 2 exp 2π 2 Consider now x, the normal random variable with mean µ and variance σ 2. Then, u = x µ σ Therefore, the density of x, g(x) is g(x)dx = g(u) du dx dx = 1 2π exp = 1 σ 2π exp ( ) (x µ)2 2σ 2 ( ( x µ σ 2 ) 2 ) 1 σ

54 Joint Distribution The joint distribution of independent normal random variables x 1 N(0, 1), x 2 N(0, 1 2 ) is ( 1 f (x) = f (x 1 )f (x 2 ) = ( 2π) exp x x ( ) = det(i )( 2π) exp x T x/2 2 ) where x = [ x1 x 2 ]

55 Now, consider a vector y which is normally distributed with mean µ and variance matrix Σ. Let A be the matrix that satisfies AA T = Σ. Then, y can be expressed as because, then also, E(y) = µ, Var(y) = E y = µ + Ax ( [y µ][y µ] T ) = AIA T = Σ x = A 1 (y µ), dx = det(a 1 )dy = det(σ) 1/2 dy

56 Together, we obtain the joint normal distribution y as g(y)dy = f ( A 1 (y µ) ) dx dy dy = 1 det(σ) 1/2 (2 π) 2 exp ( 1 ) 2 (y µ)t Σ 1 (y µ) dy Independence Suppose y 1, y 2 are jointly normally distributed but not correlated, i.e. Σ 12 = Σ 21 = 0. Then, y 1, y 2 are independent. The opposite is also true. This is because in this case, (y µ) T Σ 1 (y µ) = (y 1 µ 1 ) 2 Σ 11 + (y 2 µ 2 ) 2 Σ 22 and det(σ) = Σ 11 Σ 22

57 Therefore, g(y) = 1 Σ11 2π exp ( (y 1 µ 1 ) 2 ) 2Σ 11 1 exp ( (y 2 µ 2 ) 2 ) Σ22 2π = g 1 (y 1 )g 2 (y 2 ) 2Σ 22 Therefore, y 1, y 2 are independent to each other. We next show that the sum of two normal distribution is also normal. Consider the sum of 2 independent standard normally distributed random variables, y = x 1 + x 2. Then, we can express x 2 = y x 1. Therefore,

58 [ ] d(x1, x 2 ) f (x)dx = g(x 1, y)det d(x 1, y) d(x 1, y) 1 = ( ( 2π) exp x (y x 1) 2 ) d(x 2 1, y) 2 1 = ( ( 2π) exp 2(x ) 1 y 2 )2 + y 2 2 d(x 2 1, y) 2 = 1 2π 2 exp ( y ) 1 2π 1/2 exp ( (x 1 y 2 ) Hence we can see that conditional on y, x 1 has mean y/2 with variance 1/2, and integrating over x 1, we can get the distribution of y having the following functional form. 1 2π exp ( y 2 ) that is, y is normally distributed with mean 0 and variance 2. 4 )

59 Chi Squared Distribution Suppose z is a vector of m independently and indentically distributed standard normal distributions. Then, y = z T z = m t=1 is chi-squared distributed with m degrees of freedom; i.e. y χ 2 (m) Because of this, we can see that if y 1 χ 2 (m 1 ) and y 2 χ 2 (m 2 ), and if y 1 and y 2 are independently distributed, then y 1 + y 2 is just like a sum of squared of m 1 + m 2 independently distributed standard normal distribution, so z 2 t y 1 + y 2 χ 2 (m 1 + m 2 )

60 Quadratic Form and chi-square distribution 1. If m-vector x N(0, Ω), then x T Ω 1 x χ 2 (m) Let A be such that AA T = Ω. Then, [ ] Var(A 1 x) = E A 1 xx T A 1T = A 1 ΩA 1T = A 1 AA T A 1T = I m Therefore, and therefore, A 1 x N(0, I m ) ( A 1 x ) T A 1 x = x T Ω 1 x χ 2 (m)

61 2. If P is a projection matrix with rank r and z N(0, I n ), then z T Pz χ(r) Because P is the projection matrix, there exists a n by r matrix X such that [ ] 1 P = X X T X X T Now, X T z is r by 1 vector, and X T z N(0, X T X) Therefore, from 1, [ 1 z T X X X] T X T z χ 2 (r)

62 Student s t distribution If z N(0, 1) and y χ 2 (m) and z and y are independent, then t = z ( y m) 1/2 has a Student t distribution with m degrees of freedom. F distribution If y 1 χ 2 (m 1 ) and y 2 χ 2 (m 2 ) and they are independent, then F = y 1/m 1 y 2 /m 2 has a F distribution F (m 1, m 2 ) with degrees of freedom m 1, m 2.

63 Exact Test in the Classical Normal Linear Model y = Xβ + u Additional assumption: u is statistically independent and normally distributed, u N(0, σ 2 I ). Test of a Single Restriction y = X 1 β 1 + X 2 β 2 + u, u N(0, σ 2 I) Test of the hypothesis β 2 = β 2 OLS Estimation: M X1 y = M X1 X 2 β 2 + M X1 u ˆβ 2 = [ X T 2 M X1 X 2 ] 1 X T 2 M X1 y Var( ˆβ 2 ) = σ 2 ( X T 2 M X1 X 2 ) 1

64 Then, if the null hypothesis is true, then the true parameter is β 2 = β2, and then, the OLS coefficient is normally distributed with mean β2, and variance Var( ˆβ ( ) 1 2 ) = σ 2 X2 T M X1 X 2 Then, we know that [ ˆβ 2 E ˆβ] Var( ˆβ 2 ) N(0, 1) Therefore, ˆβ 2 β 2 σ ( X T 2 M X 1 X 2 ) 1/2 N(0, 1)

65 The problem is that we do not know σ 2. To deal with this, we use s 2, the unbiased estimate of σ 2, i.e. s 2 = 1 n k n t=1 û 2 t = y T M X y n k But the problem is that if you just stick s 2 into the place where σ 2 used to be, then the resulting distribution is not normal any more, because s 2 is not fixed, it is random. Next, we will deal with the additional randomness.

66 Now, given X being a rank k n by k matrix, consider a n by n k matrix Z of rank n k such that X T Z = 0 Then, because [X, Z] is a n by n matrix with full rank, [ ] β y = [X, Z] = X ˆβ + Zˆγ γ has a solution. Notice that because of orthogonality, X T Zˆγ = 0

67 Therefore, Zˆγ is the residual û of OLS where X is the independent variable, i.e. [ 1 û = Zˆγ = Z Z Z] T Z T y = Z [ Z T Z] 1 Z T (Xβ + u) = Z [ Z T Z] 1 Z T u Then, because u/σ N(0, I ), [ 1 SSR = u T Z Z Z] T Z T u χ 2 (n k) Then, because z = ˆβ 2 β 2 σ ( X T 2 M X 1 X 2 ) 1/2 N(0, 1), SSR σ 2 χ 2 (n k) and if z and SSR are independent. z SSR σ 2 (n k) t(m)

68 We next show that z and SSR are independent. First, we show that ˆβ and û are uncorrelated. This is because ( ) [ T ( ) ] 1 ( ) 1 û ˆβ β = I X X T X X T uu T X X T X Hence, [ ( ) ] T E û ˆβ β X ( ) ] 1 ( ) 1 = σ [I 2 X X T X X T X X T X ( ) 1 ( ) ] 1 = σ [X 2 X T X X X T X = 0 Because both z and û are normally distributed, since they are uncorrelated with each other, they are independently distributed.

69 Therefore, ˆβ 2 β 2 (X T 2 M X 1 X 2) 1/2 s 2 = ˆβ 2 β 2 (s 2 [ X T 2 M X 1 X 2 ] 1 ) 1/2 t(m) where s 2 [ X T 2 M X1 X 2 ] 1 is the estimated variance of ˆβ 2.

70 Tests of Several Restrictions Suppose that X 1 is a n by k 1 matrix and X 2 is a n by k 2 matrix, β 1 is k 1 by 1 vector and β 2 is k 2 > 1 by 1 vector. If we want to test β 2 = 0, then, the above t test does not work. So, instead, we use the F-distribution based F-test. The null hypothesis is H 0 : y = X 1 β 1 + u, u N(0, σ 2 I) and the alternative hypothesis is H 1 : y = X 1 β 1 + X 2 β 2 + u, u N(0, σ 2 I)

71 Instead, use the chi squared distribution of the residuals. The residual if null hypothesis were true, û T r û r σ 2 χ 2 (n k 1 ) The residual for the alternative hypothesis: û T u û u σ 2 χ 2 (n k 1 k 2 ) Because of this, we are considering using F-test, which involves two chi-squared distributions. But for the F-test, those two distributions need to be independent. But û r and û u are not independent to each other, thus neither are û T r û r and û T u û u. So, we cannot use the above sum of squares of the residuals directly.

72 But we will show below that actually û u and û r û u are uncorrelated, thus independent because they are normally distributed. Therefore, we can do the F-test using and and û T u û u σ 2 χ 2 (n k 1 k 2 ) (û r û u ) T (û r û u ) σ 2 χ 2 (k 2 ) F = (û r û u ) T (û r û u ) /k 2 û T u û u /(n k 1 k 2 ) F (k 2, n k 1 k 2 )

73 Then, restricted sum of squares of residuals (RSSR), i.e. β 2 = 0 is RSSR = y T M X1 y and without restriction, û r = M X1 y = M X1 X 2 ˆβ 2 + M X1 û u = M X1 X 2 ˆβ 2 + û u Next, we show that M X1 X 2 ˆβ 2 and û u are uncorrelated. ( ) ] E [M X1 X 2 ˆβ 2 β 2 ûu T ) ] 1 = E [M X1 X 2 (X T2 M X1 X 2 X T2 M X1 uu T M X ) 1 = σ 2 M X1 X 2 (X T 2 M X1 X 2 X T 2 M X1 M X = 0

74 Therefore, û r û u = M X1 X 2 ˆβ 2 and û u are uncorrelated, and thus independent to each other. Furthermore, û r û u = M X1 X 2 (X T 2 M X1 X T 2 ) 1 X T 2 M X1 y = β 2 + P MX1 X 2 u Therfore, given the hypothesis of β 2 = 0 because P MX1 X 2 projection matrix, is a and finally, Therefore, (û r û u ) T (û r û u ) σ 2 = ut P MX1 X 2 u σ 2 χ 2 (k 2 ) (û r û u ) T û u = ˆβ T 2 X T 2 M X1 û u = 0 (û r û u ) T (û r û u ) = (û r û u ) T û r = û T r û r û T u (û u + (û r û u )) = û T r û r û T u û u

75 Together, we have shown that F = (ût r û r û T u û u ) /k2 û T u û u /(n k 1 k 2 ) = (RSSR USSR) /k 2 USSR/(n k 1 k 2 ) F (k 2, n k 1 k 2 )

76 Test of General Linear Restrictions. All linear restrictions on the parameters can be expressed as Rβ = 0 where β is k 1 vector, R is r k matrix of rank r (consisting of r linearly independent vectors), and r is the number of restrictions. For example, the restriction that is β = 0 Rβ = 0 where R = I k. The restriction that β 1 = 0,..., β k1 = 0 can be expresses similarly with R = [I k1, 0]

77 Then, as we have seen before, if we assume that u N ( 0, σ 2 I ), then, ˆβ N (β 0, σ 2 (X T X) 1) Therefore, Var E [ R ˆβ X ] = Rβ ( R ˆβ X ) ( = R Var( ˆβ) ) ( ) 1 R T = σ 2 R X T X R T Therefore, as we have seen before, ( R ˆβ ) T [ Rβ R ( X T X ) ] 1 1 ( R T R ˆβ ) Rβ σ 2 χ 2 (r) The remaining thing to do is to substitute s 2 for σ 2, and then, the statistics will be F-distributed.

78 As before, we need to show that s 2 and R ˆβ are independent with each other. We first show that û and ˆβ are independent. They are both normally distributed and E [ ( ) ] T û ˆβ β X ( ) ] 1 ( 1 = σ [I 2 X X T X X T X X X) T = 0 Since they are both normally distributed, and uncorrelated, they are independent. Therefore, s 2 and ˆβ are also independent, and thus, s 2 and R ˆβ are indepedent. Therefore, ( R ˆβ Rβ ) T [ R ( X T X ) 1 R T ] 1 ( R ˆβ Rβ ) /r s 2 F (r, n k)

LECTURE 2 LINEAR REGRESSION MODEL AND OLS

SEPTEMBER 29, 2014 LECTURE 2 LINEAR REGRESSION MODEL AND OLS Definitions A common question in econometrics is to study the effect of one group of variables X i, usually called the regressors, on another