Standard Linear Regression Model (SLM)

Size: px

Start display at page:

Download "Standard Linear Regression Model (SLM)"

Milton Simon
6 years ago
Views:

1 Ordinary Least Squares Estimator (OLSE Nathan Smooha Abstract Estimation of Standard Linear Model - Ordinary Least Squares Estimator: Model specfication, objective function, finite sample and asymptotic properties of the OLSE, measurement of Goodness of Fit, Tests of Linear Hypotheses, Restriced LSE Standard Linear Regression Model (SLM Linear Regression Model The data in economics usually cannot be generated by experiments, so both the dependent and independent variables have to be treated as random variables, variables whose values are subject to chance. A model is a set of restictions on the joint distribtuion of the dependent and independent variables. That is, a model is a set of joint distributions satisfying a set of assumptions. A linear regression model specifies the dependent variable (regressand y t as a sum of linear function of the K observable explanatory variables (regressors X tk, k = 1,2,...,K, and an unobersable error term u t, y t = X t β + u t, t = 1,2,...,N (1 where the subscript t indiciates the t th observation, X t is a K-dimensional row vector and β is a K-dimensional column vector of unknonw coefficients. This is written in a matrix form for the entire N obersvations as y = Xβ + u where the dimensions of the matrices are y : N 1, β : K 1, and u : N 1 Ordinary Least Squares (OLS Estimation Method The OLS estimation method does not require information about the statistical properties of the variables in the model. It just needs to specify a linear (or nonlinear regression equation. Properties of the OLS estimator of course depends on the statistical properties of the variables. Although we do not observe the error term, we can calculate the value implied by a hypothetical value, β, of β as y t X t β this is called the residual for observation t. From this, form the sum of squared residuals (SSR: SSR( β N t=1 (y t X t β 2 = (y X β (y X β (2 1

2 This is also called the error sum of squares (ESS or the residual sum of squares (RSS. It is a function of β because the residual depends on it. The OLS estimate, ˆβ, of β is the β that minimizes this function: ˆβ argmin SSR( β (3 β As an example, if K = 2 and the first column of X is all 1 s, then we want to find ˆβ 1 and ˆβ 2 such that the estimated regression line, ŷ t = ˆβ 1 + x t ˆβ2, is close to the data points. Closeness is measured by the SSR. So, the least squares method of estimation of unknown coefficeints β k finds the vector ˆβ that minimizes the sum of squared residuals. Since it depends on the sample (y,x, the OLS estimate ˆβ is in general different from the true value β. By having squared residuals in the objective function, this method imposes a heavy penalty on large residuals; the OLS estimate is chosen to prevent large residuals for a few observations at the expense of tolerating relatively small residuals for many other observations. A sure-fire way of solving the minimization problem is to derive the first-order conditions by setting the partial derivatives equal to zero 1 SSR( β = (y X β (y X β = (y β X (y X β = y y β X y y X β + β X X β = y y 2y X β + β X X β (since β X y is a scalar and it equals its transpose y X β y y 2a β + β A β (with a X y and A X X Recalling from matrix algebra that ( a β ( β A β = a and = 2A β for A symmetric β β the K-dimensional column vector of partial derivatives is SSR( β = 2a + 2A β β The first-order conditions are obtained by setting the partial derivative above equal to zero. X y and A X X, we can write the first-order conditions as X X ˆβ = X y Recalling that a Here, we have replaced β by ˆβ because the OLS estimate ˆβ is the β that satisfies the first-order conditions. These K equations are called the normal equations. The OLS predictor, or the vector of fitted values, of the dependent variable is defined as ŷ = X ˆβ The vector of residuals evaluated at β = ˆβ, û y X ˆβ = y ŷ is called the vector of OLS residuals. Its t-th element is û t y t X t ˆβ. To be sure, the first-order conditions are just a necessary condition for minimization, so we have to check the second-order condition to make sure that ˆβ achieves, the minimum. 2 SSR( β = 2X X > 0 β β The second-order condition is satisfied since X X is positive definite. 1 If h : R k R is a scalar-valued function of a K-dimensional column vector x, the derivative of h with respect to x is a K-dimensional column vector whose k-th element is h(x x where x k k is the k-th element of x. This K-dimensional column vector is called the gradient. Here, the x is β and the function h(x is SSR( β. (4 (5 (6 Page 2 of 20

3 Finite-Sample Properties of the OLSE Since the estimator ˆβ depends on the random sample (y,x, then it is also random, and thus, it has statistical properties. Note, the characteristics of the distribution of the estimator we will derive are valid for any given sample size n. The OLSE is the best linear unbiased estimator (BLUE under certain assumptions. We present the minimum set of assumptions for each property of the OLSE. Assumption 1: rank(x = K and rank(x,y = K + 1 The assumption of rank(x = K implies that N K, and the columns of X (observation vectors of each regressor are linearly independent, i.e., none of the columns can be expressed as a linear combination of the other columns. This assumption is called the assumption of full column rank. The regressors are said to be (perfectly multicollinear if the assumption is not satisfied. The assumption of rank(x,y = K + 1 implies that there is no exact solution for the unknown coefficient β in a linear equation y = Xβ. If this rank is K, then y is linearly dependent on the columns of X and we can find a nonzero vector β such that y = K k=1 β k X k = β 1 x 11. x t β k x 1k. x tk Under these assumptions, the OLSE has the following properties: (a ˆβ is unique This holds because X X is positive definite and nonsingular due to the assumption of full column rank. The normal equations can be solved uniquely for ˆβ by premultiplying both sides by (X X 1. ˆβ = (X X 1 X y (7 Viewed as a function of the sample (y,x, this is often called the OLS estimator. For any given sample (y,x, the value of this function is the OLS estimate. Note, ˆβ k is an estimate of the unknown population parameter β k. The predicted value of y can now be written as ŷ = X ˆβ = X(X X 1 X y = Py where P X(X X 1 X is N N. The projection matrix, P, projects y onto the column space of X to obtain ŷ. P is symmetric and idempotent, and PX = X. If rank(x < K, then X X is positive semi-definite and singular, so ˆβ is not unique because an infinite number of solutions exist. The sampling error is defined as ˆβ β. It can be related to u as follows: ˆβ β = (X X 1 X y β (by (7 = (X X 1 X (Xβ + u β (since y = Xβ + u = β + (X X 1 X u β = (X X 1 X u = Au (where A (X X 1 X (8 (b ˆβ is a linear estimator ˆβ is a linear function of y, i.e., ˆβ is linear in y (ˆβk = b k1 y b kn y N, k = 1,...,K Page 3 of 20

4 Example. [ ] [ ˆβ1 b11 b = 12 b 1N ˆβ 2 b 21 b 22 b 2N [ where (X X 1 X b11 b = 12 b 1N b 21 b 22 b 2N ] ] y 1. y N = [ b11 y b 1N y N b 21 y b 2N y N ] (c The OLS regression residual, û Note, Since u it is a random vector, û should not be considered an estimate of u. Recall, the vector of OLS regression residuals is û y X ˆβ = y ŷ = [ I N X(X X 1 X ] y = My where M [ I N X(X X 1 X ]. The annihilator matrix, M, is a symmetric and idempotent matrix with rank N K, and MX = 0. (c.1 û = My = M(Xβ + u = MXβ + Mu = Mu (c.2 X û = X Mu = (MX u = 0. Thus, û is orthogonal to the columns of X, i.e., x 11 x 1N û 1 X N û = = x k1 û x kn û N = X tk û t = X kû = 0 t=1 x K1 x KN û N This shows that the normal equations can be interpreted as the sample analogue of the orthogonality conditions E(X t u t = 0, which will be covered in the next subsection. (c.3 Let X have a column of a constant, so X t1 = 1, t = 1,...,N. From the result in (c.2, N t=1 X t1 û t = 0 N t=1 û = 0 n 1 N û = 0 û = 0 As a result, ŷ = X ˆβ = y = X ˆβ + û = ȳ = X ˆβ + û = X ˆβ t=1 To sum up, if X has a column of a constant, then N t=1 û = 0 and ȳ = X ˆβ, where ȳ and X are the samples means of y and X. That is, the regression line passes through the point of sample means. If X does not have a column of a constant, the regression line does not pass through the point of sample means. Assumption 2: E(u X = 0 (Strict Exogeneity This assumption implies that the regressors are strictly exogeneous. So, the expectiation (mean is conditional on the regressors for all observations, i.e., E(u t X 1,...,X N = 0 (t = 1,...,N. (9 To state the assumption differently, take, for any given observation t, the joint distribution of the nk + 1 random variables, f (u t,x 1,...,X N, and consider the conditional distribution, f (u t X 1,...,X N. The conditional mean E(u t X 1,...,X N is, in general, a nonlinear function of (X 1,...,X N. The strict exogeneity assumption says that this function is a constant value of zero. It also implies that E(y X = Xβ. Note, if X is a fixed variate, then X is strongly exogeneous. Weak exogeneity of X requires only E(u t X t = 0 t. Assuming this constant to be zero is not restrictive if the regressors include a constant because the equation can be rewritten so that the conditional mean of the eror is zero. To see this, suppose that E(u t X = µ and x t1 = 1. The equation can be written at y t = β 1 + β 2 x t2 + + β K x tk + u t = (β 1 + µ + β 2 x t2 + + β K x tk + (u t µ If we redefine β 1 to be β 1 + µ and u t to be u t µ, the conditional mean of the new error term is zero. In virtually all applications, the regressors include a constant term. Page 4 of 20

5 Implications of Strict Exogeneity The unconditional mean of the error term is zero, i.e., E(u t = 0 (t = 1,...,N (10 This is because, by the Law of Total Expectations 2, E [E(u t X] = E(u t. Under strict exogeneity, the regressors are orthogonal to the error term for all observations, i.e., E(x sk u t = 0 (s,t = 1,...,N; k = 1,...,K or E(X su t = E(x s1 u t.. E(x sk u t = 0 (K 1 ( t,s (11 Proof. Since x sk is an element of X, strict exogeneity implies E(u t x sk = E [E(u t X x sk ] = 0 by the Law of Iterated Expectiations. 3 It follows from this that E(x sk u t = E [E(x sk u t x sk ] (by the Law of Total Expectiations = E[x sk E(u t x sk ] (by the linearity of conditional expectiations = 0 Strict exogeneity requires the regressors be orthogonal not only to the error term from the same observation (i.e., E(x tk u t = 0 k, but also to the error term from the other observations (i.e., E(x sk u t k and for s t. Because the mean of the error term is zero, the orthogonality conditions are equivalent to zero-correlation conditions. This is because cov(u t,x sk = E(x sk u t E(x sk E(u t (by definition of covariance = E(x sk u t (by (10 = 0 (by the orthogonality conditions (11 In particular, for t = s, cov(x tk,u t = 0. Therefore, strict exogeneity implies the requirement that the regressors be contemporaneously uncorreclated with the error term. (d ˆβ is unbiased E (ˆβ X = β β Proof. E (ˆβ X ( (X = E X 1 X y X (by (7 ( (X = E X 1 X (Xβ + u X (since y = Xβ + u ( = E β + ( X X 1 X u X ( (X = E (β X + E X 1 X u X (by the linearity of conditional expectations = β + ( X X 1 X E(u X (since β is a constant and ( X X 1 X is a function of X = β (by (9 2 The Law of Total Expectiations states that E [E (y x] = E(y. 3 The Law of Iterated Expectations states that E [E (y x,z x] = E(y x. Page 5 of 20

6 The OLS estimator ˆβ is a function of the sample (y,x. Since (y,x are random, so is ˆβ. Now imagine that we fix X at some given value, calculate ˆβ for all samples corresponding to all possible realizations of y, and take the average of ˆβ. This average is the (population conditional mean E (ˆβ X. The result above says that this average equals the true value β. It should be emphasized that the strict exogeneity assumption is critical for proving unbiasedness. Anything short of strict exogeneity will not do. There is [ another notion ] of unbiasedness that is weaker than the unbiasedness above. The result above implies that E (ˆβ = E E (ˆβ X = β, by the Law of Total Expectations. This says that if we calculated ˆβ for all possible different samples, differeing not only in y but also in X, the average would be the true value. [ Finally, ˆβ ] 2 is the best estimator of β in that it minimizes the mean squared error (MSE, E (ˆβ β. (e Linear parametric functions Let R be a J K matrix of nonzero constants. An unbiased estimator of linearly estimable linear parametric functions Rβ is given by Rˆβ. (f û is unbiased The residual û is an unbiased estimator of the error term u in the sense that E (û u X = 0. Assumption 3: E(uu X = σ 2 I N (Spherical error covariance This assumption implies that the error terms (u t are homoskedastic (have the same variance and uncorrelated with each other conditional on X. E(u 2 t X = σ 2 > 0 (t = 1,2,...,N (12 E(u t u s X = 0 (t,s = 1,2,...,N; t s (13 The homoskedasticity assumption (12 says that the conditional second moment, which in general is a nonlinear function of X, is a constant. Thanks to strict exogeneity, this condition can be stated equivalently in more familiar terms. Consider the conditional variance var(u t X. It equals the same constant because var(u t X = E(u 2 t X E(u t X 2 (by the definition of conditional variance = E(u 2 t X (by strict exogeneity Similarly, (13 is equivalent to the requirement that cov(u t,u s X = 0 (t,s = 1,2,...,N; t s That is, in the joint distribution of (u t,u s conditional on X, the covariance is zero. In the context of time-series models, (13 states that there is no serial correlation in the error term. The discussion above shows that the assumption can also be written as cov(u X = σ 2 I N With the third assumption added to the previous two, we have the following key results for the OLSE. Page 6 of 20

7 (g The covariance matrix of ˆβ V cov(ˆβ X = σ 2 (X X 1 Proof. cov(ˆβ X = cov(ˆβ β X (since βis not random = cov(au X (by (8 = Acov(u XA (since A is a function of X = A ( σ 2 I N A (since cov(u X = σ 2 I N = σ 2 AA = σ 2 (X X 1 (since AA = (X X 1 X X(X X 1 = (X X 1 So, cov(rˆβ X = Rcov(Rˆβ XR = σ 2 R(X X 1 R = RV R for a matrix of R of constants. (h Gauss-Markov Theorem The OLSE is efficient in the class of linear unbiased estimators. That is, for any unbiased estimator β that is linear in y, cov( β X cov(ˆβ X in the matrix sense, i.e., cov( β X cov(ˆβ X is positive semi-definite. So, [ ] a cov( β X cov(ˆβ X a 0 or [ ] [ ] a cov( β X a a cov(ˆβ X a for any K-dimensional vector a. In particular, consider a special vector whose elements are all 0 except for the k-th element, which is 1. For this particular a, the quadratic form above picks up the (k,k element, which is var( β k X where β k is the k-th element of β. Thus, the matrix inequality above implies var( β k X var(ˆβ k X (k = 1,2,...,K (14 That is, for any regression coefficient, the variance of the OLS estimator is no larger than that of any other linear unbiased estimator. (14, along with the unbiasedness of ˆβ, imply var( β var(ˆβ (where β is any linear unbiased estimator The Gauss-Markov says that the OLSE is efficient in the sense that its conditional covariance matrix cov(ˆβ X is smallest among linear unbiased estimators. For this reason, the OLSE is called the Best Linear Unbiased Estimator (BLUE. Proof. Since β is linear in y, then it can be written as β = Cy for some matrix C, which is possibly a function of X. Let D C A or C = D + A where A = (X X 1 X. Then β = (D + Ay = Dy + Ay = D(Xβ + u + ˆβ (since y = Xβ + u and ˆβ = Ay = DXβ + Du + ˆβ Page 7 of 20

8 Take the conditional expectations on both sides, we obtain E( β X = E(DXβ + Du + ˆβ X = DXβ + DE(u X + β (since ˆβ is unbiased = DXβ + β (since E(u X = 0 β is unbiased if and only if DXβ = 0. For this to be true for any given β, it is necessary that DX = 0. So β = Du + ˆβ and Now, Thus, β β = Du + (ˆβ β = (D + Au (by (8 cov( β X = cov( β β X = cov((d + Au X = (D + Acov(u X(D + A (since D and A are functions of X = σ 2 (D + A(D + A (since cov(u X = σ 2 I N = σ 2 (DD + DA + AD + AA = σ 2 (DD + (X X 1 (DA = DX(X X 1 = 0 since DX = 0 and AA = (X X 1 cov( β X = σ 2 [DD + (X X 1 ] σ 2 (X X 1 = cov(ˆβ X (since DD is positive semi-definite (i BLUE of linear parametric functions The unique BLUE of linearly estimable linear parametric functions Rβ is given by Rˆβ. Proof. Let β be as defined above. Then cov(r β cov(rˆβ = σ 2 R(DD + (X X 1 R σ 2 R[(X X 1 ]R = σ 2 R[DD + (X X 1 (X X 1 ]R = σ 2 R[DD ]R 0 (since DD is positive semi-definite Page 8 of 20

9 (j Conditional covariance of û cov(û X = σ 2 M, which is singluar. Thus, û t and û s are correlated (cov(û t,û s X s 0 even if we assmed that the true error terms are uncorrelated. Proof. cov(û X = cov(mu XM = cov(mu X (since û = Mu = Mcov(u XM = σ 2 MM (since cov(u X = σ 2 I N = σ 2 M (since M = M and MM = M (k Estimator of σ 2 Let the LSE of σ 2 be ˆσ 2 = SSR N K = û û N K (15 The intuitive reason to dvide the SSR by N K rather than by N is that K parameters (β have to be estimated before obtaining the residual vector û used to calculate ˆσ 2. More specifically, û has to satisfy the K normal equations (4, which limits the variability of the residual. ˆσ 2 is an unbased estimator of σ 2. Proof. Since ˆσ 2 = û û N K the proof amounts to showing that E(û û X = (N Kσ 2. E(û û X = E[(Mu (Mu X] (since û = Mu = E[u Mu X] (since M = M and MM = M = E[tr(u Mu X] (since u Mu is a scalar = E[tr(Muu X] (since tr(abc = tr(cab = tr[e(muu X] (since trace is a linear operator = tr[me(uu X] (since M is a function of X = σ 2 tr(m (since E(uu X = σ 2 I N = σ 2 tr(i N X(X X 1 X (since M = I N X(X X 1 X = σ 2 [tr(i N tr(x(x X 1 X ] (since tr(a + B = tr(a +tr(b = σ 2 [tr(i N tr((x X 1 X X] (since tr(abc = tr(cab = σ 2 [tr(i N tr(i K ] (since X X : K K = σ 2 (N K Thus, the unbiased estimator of cov(ˆβ X is ˆV ˆσ 2 (X X 1, and the unbiased estimator of cov(rˆβ X is R ˆV R = ˆσ 2 R(X X 1 R. The square root of ˆσ 2, ˆσ, is called the standard error of the regression (SER or standard error of the equation (SEE. It is an estimate of the standard deviation of the error term. Page 9 of 20

10 Assumption 4: (u X is distributed as a multivariate normal (0,σ 2 I Definition. An n-dimensional random vector y is distributed as a multivariate normal, y N(µ, Σ, with a mean vector µ and a nonsingular (positive definite variance-covariance matrix Σ, if the joint density function is given by { f (y = (2π 2 n Σ 1 (y µ Σ 1 } (y µ 2 exp 2 Theorem 1. Let y be an n-dimensional normal variate: y N(µ,Σ. Then (a (Ay b N((Aµ b,aσa, where A is an m n matrix of constants with rank(a = m n, and b is an m- dimensional vector of constants. (b Let P P = Σ 1. There exists such a nonsingular matrix P for a positive definite matrix Σ. Then, P(y µ is distributed as a standed multivariate normal: N(0,I. Now, under Assumption 4, we can specify the distribution functions of ˆβ and ˆσ 2. We need to know the distribution of u because it drives the statistical properties of the estimators. When we know the distribution of u, we know the distribution of ˆβ and ˆσ 2. We will then be able to formulate statistical tests, or inferences, of hypotheses. Assumption 4 implies that the distribution of y conditional on X is a multivariate normal N(Xβ,σ 2 I. This is because a linear transformation of a multivariate normal random vector is still a multivariate normal random vector by Theorem 1 (l Distribution of ˆβ and û ˆβ N(β,V and û N(0,σ 2 M We have already derived the mean and covariance matrix of both random vectors. Since ˆβ is linear function of y, i.e., ˆβ = (X X 1 X y, and û is a linear function of u, i.e., û = Mu, then the results of Theorem 1 hold. So, both random vectors are multivariate normal. (m Distribution of ˆσ 2 Theorem 2. Let y be an n-dimensional normal variate: y N(µ,Σ. Then, (a (y µ Σ 1 (y µ is distributed as χ 2 (n, a central chi-square random variable with n degrees of freedom. Note, from (b in the previous Theorem, we have z = P(y µ N(0,I, that is, z i are i.i.d. standard normal random variables. The quadratic function can be written as (y µ Σ 1 (y µ = (y µ P P(y µ = z z = which is the sum of squared independent standard normal random variables. Thus, we can define a χ 2 random variable with n degrees of freedom as the sum of n squared independent standard normal random variables. (b y Σ 1 y is distributed as χ 2 (n;δ, a noncentral chi-square distribution with n degrees of freesom and the noncentrality parameter δ. n z 2 i i=1 (c (y b Σ 1 (y b is distributed as χ 2 (n;δ, where δ 2 = (µ b Σ 1 (µ b. (d If Σ = σ 2 I, then y Ay is distributed as χ 2 (m;δ if and only if A is an idempotent matrix of rank m, where the σ 2 noncentrality parameter is δ 2 = µ Aµ. σ 2 (e Let y be an n- dimensional (not necessarily normal random vector with E(y = µ and cov(y = Σ. Then, E(y Ay = tr(aσ + µ Aµ. Let W (N K ˆσ2 = û û = u Mu σ 2 σ 2 σ 2 Since u N(0,σ 2 I, M is idempotent, and rank(m = N K, then by part (d of Theorem 2, W = u Mu σ 2 χ 2 (N K. Page 10 of 20

11 (n ˆβ is independent of ˆσ 2 Theorem 3. Let y N(µ, Σ. Then, (a A linear function By and a quadratic function y Ay are independent if BΣA = 0, where A is a symmetric matrix. (b Two quadratic functions y Ay and y By are independent if AΣB = 0. Example. ˆβ = β + (X X 1 X u and ˆβ β = (X X 1 X u ˆσ 2 = (N K 1 û û = (N K 1 u Mu Let B = (X X 1 X, A = M, Σ = σ 2 I BΣA = σ 2 (X X 1 X M = σ 2 (X X 1 (MX = 0 By part (a from Theorem 3 above, ˆβ is independent of ˆσ 2. Distributions Used in Statistical Inference Theorem 4. Let Y N(µ,1, Z χ 2 (n;δ, and W χ 2 (m which are independent. Then, (a Y Wm t(m;µ, noncentral t distribution (b Z/n F(n,m;δ, noncentral F distribution W/m Let R be a J K matrix of known constants with rank J K, and q be a J 1 vector of known constants. Then results in (l-(n imply (o Rˆβ N(Rβ,RV R (p Z (Rˆβ Rβ (RV R 1 (Rˆβ Rβ χ 2 (J (q Z (Rˆβ q (RV R 1 (Rˆβ q χ 2 (J;δ (r Central F distribution Deriving the F test: F = 1 J Z /J W/(N K (Rˆβ Rβ (RV R 1 (Rˆβ Rβ ( (N K ˆσ2 /σ 2 /(N K = σ2 (Rˆβ Rβ (R(X X 1 R 1 (Rˆβ Rβ Jσ 2 ˆσ 2 F = 1 J (Rˆβ Rβ (R ˆV R 1 (Rˆβ Rβ F(J,N K (16 (s Noncentral F distribution F Z /J W/(N K = 1 J (Rˆβ q (R ˆV R 1 (Rˆβ q F(J,N K;δ (17 Page 11 of 20

12 (t Central and noncentral t distributions When the number of restrictions is 1, we need to consider the following t-statistic. Let Z 1 N(µ,1 and Z 2 χ 2 (m. If Z 1 and Z 2 are independent, then t = Z 1 Z2/m t(m;µ where µ is the noncentrality parameter which can be either positive, zero, or negative. If µ = 0, it is a central t distribution If J = 1 and Rˆβ is a scalar, its true variance and estimated variance are also scalars. t = (Rˆβ Rβ/sd(Rˆβ W/(N K = (Rˆβ Rβ/sd(Rˆβ ˆσ 2 /σ 2 = (Rˆβ Rβ t(n K (18 est.sd(rˆβ t = (Rˆβ q t(n K;Rβ q (19 est.sd(rˆβ Remark: It is easy to verfiy the following relationship between t distribution and F distribution: t(n K 2 F(1,N K Example. Consider a row vector R = (0,0,...,0,1,0,...,0 which has 1 in the k-th position. Then, Rˆβ = ˆβ k, and R ˆV R = ˆv kk is the k-th diagonal element of ˆV, which is the estimate of the variance of ˆβ k. Therefore, we can write t = ˆβ k β k ˆvkk t(n K t = ˆβ k q ˆvkk t(n K;β k q For another example, consider a row vector R = (a,b,0,...,0 where a and b are constants. Then, Rˆβ = aˆβ 1 + bˆβ 2, and R ˆV R = a 2 ˆv 11 + b 2 ˆv ab ˆv 12, which is the estimated variance of Rˆβ, and hence t = (aˆβ 1 + bˆβ 2 (aβ 1 + bβ 2 est.sd(aˆβ 1 + bˆβ 2 t(n K t = (aˆβ 1 + bˆβ 2 q est.sd(aˆβ 1 + bˆβ 2 t(n K;aβ 1 + bβ 2 q The Classical Regression Model for Random Samples The sample (y,x is a random sample if {y t,x t } is i.i.d. (independent and identically distributed across observations. Since u t is a function of (y t,x t by Assumption 1 and since (y t,x t is independent of (y s,x s for t s, (u t,x t is independent of X s for t s. Thus, E(u t X = E(u t X t E(ut 2 X = E(ut 2 X t E(u t u s X = E(u t X t E(u s X s (for t s (20 Therefore, Assumptions (A2 and (A3 reduce to (A2 : E(u t X t = 0 (t = 1,2,...,N (21 (A3 : E(ut 2 X t = σ 2 > 0 (t = 1,2,...,N (22 Page 12 of 20

13 The implication of the identical distribution aspect of a random sample is that the joint distribution (u t,x t does not depened on t. So the unconditional second moment E(ut 2 is constant across t (this is referred to as unconditional homoskedasticity and the functional form of the conditional second moment E(ut 2 X t is the same across t. However, (A3, that the value of the conditional second moment is the same across t, does not follow. Therefore, (A3 remains restricitive for the case of a random sample; without it, the conditional second moment E(ut 2 X t can differ across t through its possible dependence on X t. To emphasize the distinction, the restictions on the conditional second moments are referred to as conditional homoskedasticity. Measurement of Goodness of Fit How well do the regressors explain the dependent variable? How do we measure the goodness of the regression? A measure of the goodness of fit is the squared correlation coefficient between the dependent variable y t and the least squares predictor ŷ t = X t ˆβ: R 2 [corr(y t,ŷ t ] 2 = [cov(y t,ŷ t ] 2 var(y t var(ŷ t = [(y ȳ (ŷ ŷ/n] 2 [(y ȳ (y ȳ/n][(ŷ ŷ (ŷ ŷ/n] where ȳ and ŷ are the vectors of sample means of y t and ŷ t, respectively. Since R 2 is a squared correlation coefficient, it akes a value between 0 and 1 (0 R 2 1, and its value is independent of the measurement unit. A high R 2 is interpreted as a good fit and the regressors explain the variation of the dependentvariable well. This expression can be simplified further by using the following results. Lemma 1. When a constant is one of the regressors, (a ȳ = ŷ (b û (ŷ ȳ = 0 (c (y ȳ (ŷ ȳ = (ŷ ȳ (ŷ ȳ (d (y ȳ (y ȳ = (ŷ ȳ (ŷ ȳ + û û Proof. Let a constant be one of the regressors, then (a ȳ = X ˆβ = n 1 N X t ˆβ = n 1 N ŷ t = ŷ t=1 t=1 (b û (ŷ ȳ = û ŷ û ȳ = (Mu (X ˆβ N t=1 û t ȳ = 0 ȳ N t=1 û t = 0 (c (y ȳ (ŷ ȳ = ([ŷ ȳ] + û (ŷ ȳ = (ŷ ȳ (ŷ ȳ + û (ŷ ȳ = (ŷ ȳ (ŷ ȳ (d (y ȳ (y ȳ = ([ŷ ȳ] + û ([ŷ ȳ] + û = (ŷ ȳ (ŷ ȳ + û û Now, we substitute these results into the equation for R 2 : (23 R 2 = = = [(y ȳ (ŷ ŷ/n] 2 [(y ȳ (y ȳ/n][(ŷ ŷ (ŷ ŷ/n] [(y ȳ (ŷ ȳ] 2 [(y ȳ (y ȳ][(ŷ ȳ (ŷ ȳ] [(ŷ ȳ (ŷ ȳ] 2 [(y ȳ (y ȳ][(ŷ ȳ (ŷ ȳ] (since ȳ = ŷ (since (y ȳ (ŷ ȳ = (ŷ ȳ (ŷ ȳ = (ŷ ȳ (ŷ ȳ (y ȳ (y ȳ = (y ȳ (y ȳ (y ȳ (y ȳ û û (y ȳ (y ȳ (since (ŷ ȳ (ŷ ȳ = (y ȳ (y ȳ û û = 1 û û (y ȳ (y ȳ (24 Page 13 of 20

14 This is the expression more commonly used to define R 2, which is called the coefficient of (multiple determination. The denominator term is called the total sum of squares (TSS and is a measure of the variation of the dependent variable. The numerator in the first expression is called the explained sum of squares or regression sum of squares (RSS, and the SSR, û û, is also called the unexplained sum of squares or error sum of squares (ESS. R 2 = RSS T SS = 1 ESS T SS (25 If the estimation equation does not include an intercept term, and you compute R 2 by the second expression, then the value of R 2 can be negative. With an intercept of 0, the OLS residuals no longer sum to zero, so most of the properties in the Lemma above fail to hold. We can see this by the following derivation: This implies that ESS = û û = (y ŷ (y ŷ = [(y ȳ + (ȳ ŷ] [(y ȳ + (ȳ ŷ] = (y ȳ (y ȳ 2(y ȳ (y ȳ + (ȳ ŷ (ȳ ŷ R 2 = 2(y ȳ (y ȳ + (ȳ ŷ (ȳ ŷ (y ȳ (y ȳ Another serious drawback of using R 2 as a measure of goodness of fit is that it is a nondecreasing, and typically an increasing, function of the number of regressors. To explain, T SS = RSS + ESS, so minimizing the SSR, or ESS, is equivalent to maximizing R 2. When an additional variable is included, R 2 improves if the coefficient on the variable is non-zero because (SSR r SSR u. Thus, One can increase the value of R 2 by simply throwing in more variables into the regression equation, even if they do not belong to the equation in theory. To avoid this problem in assessing the goodness of the regression equation, the adjusted R 2 (adjusted for the degrees of freedom is often used. R 2 = 1 (û û/(n K = 1 N 1 (y ȳ (y ȳ/(n 1 N K (1 R2 ; K 1 N K R2 1 (26 The Best Linear Unbiased Predictor Consider a linear regression model y 1 = X 1 β + u 1 and its OLS estimator ˆβ which is the BLUE under assumptions (A1-(A3. Suppose we wish to predict the value of the dependent [ variable ] at new values of regressors X 2 when the u1 regession relationship has not changed, i.e., y 2 = X 2 β + u 2 and u =, where u 1 : N 1 and u 2 : N 1, satisfies the assumptions (A1-(A3, [ E(uu E(u1 u X = 1 X E(u 1u 2 X E(u 2 u 1 X E(u 2u 2 X Note that E(u 1 u 2 X = E(u 2u 1 X = 0. ] [ σ = 2 I N 0 0 σ 2 I N u 2 ] = σ 2 I 2N To elaborate, we have the data on the new values of the regressors, X 2, but we do not have data on y 2. However, we still wish to predict the values of y 2. To do this, we first obtain ˆβ from the linear regression model y 1 = X 1 β 1 + u 1. Then, we use this estimate of ˆβ and the new values of the regressors, X 2, to predict y 2, i.e., ŷ 2 = X 2 ˆβ. Theorem 5. The best linear unbiased predictor (BLUP of y 2 for a given X 2 is given by ŷ 2 = X 2 ˆβ. Page 14 of 20

15 Proof. (i Linearity. ŷ 2 is a linear functionof y 1 because ˆβ is a linear function of y 1. (ii Unbiasedness. The prediction error û 2 = y 2 ŷ 2 has zero mean because E(û 2 X = E[X 2 (β ˆβ + u 2 X] = X 2 (β β = 0 (iii Best. The prediction error has the smallest covariance matrix. To show this, cov(û 2 X = cov(y 2 ŷ 2 X = cov(x 2 (β ˆβ + u 2 X = σ 2 I N2 + X 2 cov(ˆβ XX 2 (since u 2 is uncorrelated with X 2 and (β ˆβ = σ 2 I N2 + σ 2 X 2 (X 1X 1 1 X 2 Let another linear predictor be ỹ 2 = Ay 1 and prediction error ũ 2 = y 2 ỹ 2. We need to choose A such that the predictor is unbiased, E(ũ 2 = 0, and then we need to show cov(ũ 2 cov(û 2 is a positive semidefinite. E(ũ 2 X = E(y 2 ỹ 2 X = E(X 2 β + u 2 Ay 1 X = E[X 2 β + u 2 A(X 1 β + u 1 X] = E(X 2 β + u 2 AX 1 β + Au 1 X = 0 if X 2 = AX 1 cov(ũ 2 X = cov(u 2 + Au 1 X = σ 2 I N2 + σ 2 AA cov(ũ 2 X cov(û 2 X = σ 2 I N2 + σ 2 AA [σ 2 I N2 + σ 2 X 2 (X 1X 1 1 X 2] = σ 2 [AA X 2 (X 1X 1 1 X 2] = σ 2 [AA AX 1 (X 1X 1 1 X 1A ] = σ 2 A[I X 1 (X 1X 1 1 X 1]A = σ 2 AM 1 A = σ 2 (AM 1 (AM 1 0 Statistical Inference A major part of statistical inference is the test of hypothesis. A test of hypothesis consits of 1. specity the null hypothesis (H 0 and the alternative hypothesis (H 1 2. choose a test statistic 3. choose a rejection region for a given level of significance 4. conclude Each hypothesis specifies the parameter space Θ 0 and Θ 1, respectively. If Θ 0 is a subset of Θ 1, then it is called a nested hypothesis; otherwise, it is called a nonnested hypothesis. A nested null (or maintained hypothesis H 0 : h(θ = 0 against the alternative hypothesis H 1 : h(θ 0 restricts the parameter space. To test H 0 against H 1, we need a test statistic that is computable from the sample (i.e., must not involve any unknown parameters, and a decision rule of Page 15 of 20

16 when to reject or not reject H 0. The decision rule is expressed in terms of the rejection region (or critical region, such that H 0 is rejected if the test statistic is in the critial region. Since the test statistic is a random variable, there is a positive probability that we will reject the true H 0 for any reasonable choice of the critical region. Rejection of a true H 0 is called a Type I error, and the probability of Type I error is called the size of a test, or the significance level of a test, commonly denoted by α. The conventional choice of the significance level is 0.05 or 0.1. We may also fail to reject a false H 0, which is called a Type II error. The probability of rejecting a false H 0 is called the power of a test, which depends on the alternative hypothesis. Null Hypothesis H 0 H 0 is correct H 0 is false reject H 0 Type I error correct decision accept H 0 correct decision Type II error significance level: α = Prob(reject a true H 0 = Prob(type I error Power of test: π = Prob(reject a false H 0 = 1 Prob(not reject a false H 0 = 1 Prob(type II error Since the significance level is the probability that the test statistic is in the rejection region when the null hypothesis is correct, we need to know the distribution function of the test statistic under H 0. Similarly, the power of test requires the knowledge about the distribution function of the test statistic under H 1. Since the distribution function of the test statistic depends on the true value of the parameters, there are many values of the power when the alternative hypothesis is a composite hypothesis. Since the power of the test requires knowledge of the true population parameter β, we are usually unable to compute it. An ideal test (meaning an ideal choice of test statistic and critical region is the one that has a zero probability of Type I and Type II errors. But such a test does not exist. Typically, choosing a smaller size of a test decreases the power of the test. A test is called the most powerful test if it has greater power than any other test of same test size, and the uniformly most powerful test if it has greater power than any other test of same test size for all possible values of the parameters. A test is called an unbiased test if its power is at least as large as its test size. A test is called a consistent test if its power converges to 1 as the sample size grows to infinite. Finite Sample Tests of Nested Linear Hypotheses in a SLM Linear Regression Model: y = Xβ + u, u X N(0,σ 2 I Number of observations: N Unrestricted OLS estimator: minssr u = (y Xβ (y Xβ β Properties of Unrestricted OLSE: ˆβ = (X X 1 X y N(β,V, V = σ 2 ( X X 1 ˆσ 2 = SSR u (N K, W (N K ˆσ2 u σ 2 χ 2 (N K ˆβ is independent of ˆσ 2 Rˆβ N(Rβ,S, S RV R = σ 2 R(X X 1 R Z (Rˆβ q S 1 (Rˆβ q χ 2 (J;δ F Z /J W u/(n K = 1 J (Rˆβ q Ŝ 1 (Rˆβ q F(J,N K;δ δ = [ (Rβ q S 1 (Rβ q ]1/2 Page 16 of 20

17 Resricted OLS estimator: min β s.t. SSR r = (y Xβ (y Xβ Rβ = q R (J K and q (J 1 are known matrices, rank(r = J < K. Properties of Restricted OLSE β = ˆβ ( V R S 1 Rˆβ q N ( β V R S 1 (Rβ q,v V R S 1 RV λ = σ 2 S 1 (Rˆβ q N ( σ 2 S 1 (Rβ q,σ 4 S 1 σ 2 = SSR r N (K J (N K + J σ2 W r = SSR r χ 2 (N K + J;δ σ 2 σ 2 β, λ, σ 2 are mutually independent SSR r = SSR u + σ 2 (Rˆβ q S 1 (Rˆβ q Nested Linear Hypotheses H 0 : Rβ = q, H 1 : Rβ q (or H 0 is false where R is a J K matrix of known constants and q is a J 1 vector of known constants. The restriction matrix R has full row rank that is less than K, rank(r = J < K. An intuition suggests that we may compare the unrestricted estimation ˆβ and restricted estimation β to verify whether the restriction under H 0 is acceptable or not. These two estimators will be different with probability 1 due to the random variation of the sample. But, what do we expect about the magnitude of the difference? a big difference or small difference? It is easy to imagine that the difference will be relatively small (relative to their variances if the restriciton under H 0 is correct and it is not binding in the minization of SSR. On the other hand, the difference is expected to be relatively large if the null hypothesis is incorrect and the null restriction is a binding constraint in the optimization. Then, it is reasonable to consider the difference ˆβ β as a test statistic and construct a rejection region. Of course, we have to decide how to measure the difference because it is a K-dimensional vector As we will discuss in more detail later, the Durbin-Wu-Hasuman test can be interpreted as a test based on this type of reasoning, though it is used for different type of hypotheses. There is no reason why we should limit the test only to the comparison of the estimators themselves. It is possible to compare the estimated values of a certain function of parameters. Four tests presented below compare different functions of estimators. We will call them likelihood ratio (LR type test, Wald type test, Lagrange Multiplier (LM type test and efficient score type test. The reason for the word type is because the ideas of the finite sample tests discussed below are based on the ideas of the classical test procedures, but they are not the true LR, Wald, LM, and Score tests which are asymptotic tests and use the asymptotic χ 2 distribution of the test statistics. Likelihood Ratio (LR Type Test Since the least squares method tries to minimize the SSR, we may compare the values of the objective function (SSR. We know SSR r is greater than SSR u. Their difference is expected to be large if the restricitons under H 0 is not true and hence they are binding. Thus, it is intuitively reasonable to use the difference SSR r SSR u as the test statistic and we reject the null hypothesis if SSR r SSR u is large. Once the test statistic is chosen, we need to determine the decision rule of when to reject H 0 in favor of H 1. As alluded to already, we will reject H 0 if SSR r SSR u is sufficiently large. Otherwise, the test concludes that the data Page 17 of 20

18 does not provide enough evidence to reject H 0. How big a difference is sufficiently large enough to reject the null hypothesis? This requires a choice of the critical (rejection region of the test. Since we reject the null hypothesis when SSR r SSR u is large, the proper critical region is the set of the sample R c {SSR r SSR u > c}, where c is a pre-chosen value. The choice of the critical value depends on how much of Type I error we are willing to tolerate. A choice of large critical value will reject the null hypothesis less frequently and we are more likely to commit the Type II error. The choice of the critical value is governed by the choice of significance level (size of the test. That is, for a pre-chosen significance level α, the crtical value is determined by P(SSR r SSR u > c H 0 = α which requires knowledge of the distribution function of the test statistic under the null hypothesis. Remark. A small α leads to a higher critical value. So we reject H 0 less frequently and we are more likely to commit the Type II error, so the power of the test decreases. The p-value is defined as p = Prob(this sample H 0 is true It tells us how likely it is to get the sample we got if the null hypothesis is true. It is the probability of obtaining a test statistic at least as extremem as the one that was actually observed, assuming that the null hypothesis is true. It can also be interpreted as the chance of a Type I error if we reject H 0. If the p-value is less than α, then we reject H 0, indicating that the observed result would be highly unlikely under the null hypothesis. We have shown Z SSR r SSR u σ 2 = (Rˆβ q S 1 (Rˆβ q χ 2 (J;δ Recall that S = σ 2 R(X X 1 R involves the unknown parameter σ 2. Hence, this is not a statistic (i.e, not computable from the sample. We may just replace σ 2 with its estimator ˆσ 2. But then, Z does not have the χ 2 distribution. To see this, note that the replacement of σ 2 with its estimator ˆσ 2 is equivalent to taking a ratio of two independent χ 2 random variables, Z and W u : F Z /J W u/(n K = 1 J (Rˆβ q Ŝ 1 (Rˆβ q F(J,N K;δ which can also be expressed as F Z /J = (SSR r SSR u /J = N K SSR r SSR u F(J,N K;δ W u/(n K SSR u/(n K J SSR u This is a statistic that has a central F distribution under H 0 and a noncentral F distribution under H 1. The null hypothesis is rejected if F is greater than the critical value F α of significance level α: P(F > F α H 0 = α. To compute this test statistic, we need both unrestricted estimators for SSR u and restricted estimators for SSR r Wald type Test We know that restricted estimator satisfies the restriction under H 0, i.e., R β = q. The Wald type test asks whether the unrestricted estimator also satisfies the restriciton equation reasonably well. If the null hypothesis is correct so that the restricted and unrestricted estimators are close enough, we would exprec that Rˆβ is close to its hypothesized value q. If they are not close enough we will reject the null hypothesis. We amy also interpret this test as a comparison of Rˆβ q with R β q, where the latter is a zero vector. Unlike the scalar values of SSR s in the LR type test, Rˆβ is a J 1 vector in general. Therefore, we need to choose a measure of the distance of Rˆβ from q. One may just consider the squared Euclidean norm, Rˆβ q 2 = ( ( Rˆβ q Rˆβ q. However, this does not take into account the differences in the variances of each element. We Page 18 of 20

19 ( need to standardize Rˆβ q in a certain way. Let P P = S 1. Then P Rˆβ q N(P(Rβ q,i. We now take the Euclidean norm of this standardized vector. ( 2 ( Z = P Rˆβ q = Rˆβ q ( S 1 Rˆβ q χ 2 (J;δ As in the previous case of the LR type test, we take the ratio of independent χ 2 random variables to eliminate the known σ 2 in S. The test statistic is then exactly the same as the LR type test statistic F 1 J (Rˆβ q Ŝ 1 (Rˆβ q F(J,N K;δ Under the null hypothesis, this F-statistic distribution has a central F(J,N K. The null hypothesis is rejected if F is greater than the critical value F α of significance level α: P(F > F α H 0 = α. To compute this test statistic, we need only unrestricted estimators. Lagrange Multiplier (LM type Test As the name indicates, this test is based on the estimator λ of the Lagrange multiplier λ. If the restriction of H 0 is true, it will not be a binding constraint and hence, the restricted estimator λ (estimator of the shadow price of the restricitons will be close to zero. Thus, we reject the null hypothesis if λ is sufficiently different from a zero vector. As in the Wald type test, we need to find a measure of the distance between λ and a zero vector because λ is a vector in general. Following the same procedure as in the Wald type test, we note Z r λ [ σ 4 S 1] 1 λ = σ 4 λ S λ = σ 2 λ [ R(X X 1 R ] λ χ 2 (J;δ and we take the ratio of independent χ 2 random variables to eliminate the unknown σ 2 Z r/j F r = = 1 [ W r/(n K+J J σ 2 λ R(X X 1 R ] λ F(J,N K + J under H0 Note that this test statistic requires only restricted estimators, and its degrees of freedom are different from those of the LR and Wald type tests. Under the null hypothesis, this has a central F distribution because both the numerator and denominator terms are distributed as a central χ 2 under H 0. However, under the alternative hypothesis (Rβ q, both Z r and W r are noncentral χ 2 random variables with the same noncentrality parameter δ. Therefore, F r has a doubly-noncentral F distribution under H 1. This will affect the power of the test. The critical value for the rejection region is still found from the null central F distribution. We have shown earlier that the LR type test and the Wald type test are identical. To see the relationship of the LM type test with other tests, note the solution λ = σ 2 S 1 (Rˆβ q. Since S 1 has a full column rank, λ 0 is equivalent to Rˆβ q 0, which is the Wald type test. Therefore, both tests are checking on the same quantity. This becomes clear if λ = σ 2 S 1 (Rˆβ q is substituted into Z r equation. The resulting equation is Z r = (Rˆβ q S 1 (Rˆβ q, which is the same as Z. Thus, ther numerator terms of F and F r are the same. The difference is the denominator term. If we use the unrestricted estimator of ˆσ 2 instead of the restricted estimator σ 2, that is, if we use W u/(n K instead of W r/(n K+J in the ratio F r, then we have a test statistic that is the same as the Wald type test. Efficient Score type Test The unrestricted estimator is the solution of the first order condition SSR β = 2X y + 2X X ˆβ = 0 = g(ˆβ = X y + X X ˆβ = 0 β=ˆβ If the restrictions are correct, then the restricted estimator must satisfy this equation closely so that g( β = X y + X X β = 0 Page 19 of 20

20 The score type test thus asks whether the restricted estimator satisfies the first order condition of the unrestricted least squares estimation. The test rejects the null hypothesis if g( β is far from a zero vector. This test can also be considered as a comparison of restricted gradient g( β and unrestricted g(ˆβ which is of course zero. The score type test is equivalent to the LM type test. Consider the first order condition of the restricted least squares X y + X X β + R λ = 0 g( β + R λ = 0 Since R has a full column rank by our specification of the null hypothesis, λ = 0 if and only if g( β = 0. Therefore, the LM type test of asking whether λ is close to zero or not is equivalent to asking whether g( β is close to zero or not. Substitution of R λ = g( β into the Z r gives and Z r = σ 2 λ [ R(X X 1 R ] λ ( λ = σ 2 (X R X ( 1 R λ = σ 2 g( β ( X X 1 g( β F r = 1 J σ 2 g( β ( X X 1 g( β F(J,N K + J under H0 Case of Single Restriction: t test If J = 1 and Rˆβ is a scalar, then under H 0 and under H 1 t = (Rˆβ Rβ/sd(Rˆβ W/(N K = (Rˆβ Rβ/sd(Rˆβ ˆσ 2 /σ 2 = t = (Rˆβ q t(n K,Rβ q est. sd(rˆβ ( Rˆβ Rβ t(n K est. sd(rˆβ H 0 is rejected when t > t α where the critical value is determined by P( t > t α H 0 = α. Page 20 of 20

Linear models. Linear models are computationally convenient and remain widely used in. applied econometric research

Linear models. Linear models are computationally convenient and remain widely used in. applied econometric research Linear models Linear models are computationally convenient and remain widely used in applied econometric research Our main focus in these lectures will be on single equation linear models of the form y