Linear Model Under General Variance We have a sample of T random variables y 1, y 2,, y T, satisfying the linear model Y = X β + e, where Y = (y 1,, y T )' is a (T 1) vector of random variables, X = (T K) matrix of explanatory variables, β is a (K 1) vector of parameters, and e = (e 1,, e T )' is a (T 1) error term vector. Under the classical linear model we have assumed Assumption A3: the e t s are independently distributed Assumption A4: V(e) = σ 2 I T, implying that V(e t ) = σ 2, t = 1,, T. (homoscedasticity). Lets relax assumptions A3 and A4 and consider the more general case where V(e) = s 2 y where σ 2 is a positive scalar, and ψ is a (T T) symmetric, positive-definite matrix. Under this structure: The variance of e is proportional to the matrix y. It allows for the variance of e t to vary across observations It allows for the possibility of non-zero covariance between e t and e t*, t t*. A Reformulation of the Standard Model Consider the model Y = X β + e, where V(e) = σ 2 ψ (model M). The (T T) matrix ψ being symmetric, positive-definite, it is non-singular and can be written as ψ -1 = P' P,
2 where P is a (T T) non-singular matrix. This is called the Cholesky decomposition of ψ -1. It follows that ψ = (P' P) -1 = P -1 (P' ) -1 or P ψ P' = I T. Let X * = P X, Y * = PY, and e * = P e. Premultiplying model M by P gives PY = P X β + P e, or Y * = X * β + e * (model M*) With P being a non-singular matrix, models M and M * are informationally equivalent. V(e) = σ 2 ψ, does not satisfy A3-A4 when ψ I T, V(e * ) = V(P e) = P V(e) P' = P σ 2 ψ P' = σ 2 P ψ P' = σ 2 P P -1 P' -1 P', since ψ = P -1 P' -1, = σ 2 I T, which satisfies assumptions A3-A4. While model M does not satisfy conditions A3-A4, model M * does. This implies that all the results obtained for the classical linear model do apply to model M *. Estimation of b Under General Variance Structure Consider the error sum-of-squares for model (M * ) S = e *' e * = (P e) ' (P e) = e ' P ' P e = e ' ψ -1 e, since P ' P = ψ -1. The term S = (e ' ψ -1 e) is called the weighted error sum of squares, where the weights involve the inverse of the ψ matrix (which is proportional to the variance of e). The value of β that minimizes S is b g the weighted least squares or generalized least squares estimator of β. b g is simply the least squares estimator under model M * b g = (X *' X * ) -1 X *' Y *, since b g is the least-squares estimator of β in M * = [(P X) ' P X] -1 (P X) ' P Y
3 = [X ' P ' P X] -1 X ' P ' P Y = [X ' ψ -1 X] -1 X ' ψ -1 Y, since P ' P = ψ -1. The generalized least squares estimator of β is b g = [X ' y -1 X] -1 X ' y -1 Y. Properties of b g Since b g is simply the least squares estimator of β in model M * and model M * satisfies all assumptions of the classical linear model, it follows that all of the results obtained for the classical linear model apply to model M * : β g is also the max. likelihood estimator of β in model M* where e is normally distributed. β g is an unbiased estimator of β: E(β g ) = β. V(β g ) = σ 2 [X *' X * ] -1 = σ 2 [(P X) ' (P X)] -1 = σ 2 [X ' P ' P X] -1 or, given ψ -1 = ' P, V(b g) ) = s 2 [X ' y -1 X] -1. β g is the best linear unbiased estimator (BLUE) of β, implying that it is efficient in finite sample (its variance is smallest compared to the variance of all linear unbiased estimators). In large samples, β g is a consistent estimator of β an asymptotically efficient estimator of β asymptotically normal with T 1/2 (β g - β) d N(0, T σ 2 (X ' ψ -1 X) -1 ], or β g N(β, σ 2 (X ' ψ -1 X) -1 ] as T. A Comparison of b s and b g We have b s = (X ' X) -1 X ' Y as the least-squares estimator of β, and b g = [X ' y -1 X] -1 X ' y -1 Y as the generalized least squares estimator of β. In general, β s β g whenever the matrix ψ is not proportional to the identity matrix I T. Then if ψ I T, on what basis can we choose between these two estimators in the estimation of β in model M?
4 Bias of b s Under General Variance Structure We have E(β s ) = E[(X'X) -1 X' Y] = (X'X) -1 X'E(y) = (X'X) -1 X' E[X β + e] = (X X) -1 X' X β + (X' X) -1 X' E(e) = β, since E(e) = 0. Thus, least squares estimator β s is an unbiased estimator of β under model M. Variance of b s Under The General Variance Structure Given E(β s ) = β under the general variance model M: V(b s ) = E[(b s - b)(b s - b)']. β s - β = (X' X) -1 X' Y - β = (X' X) -1 X' (X β + e) - β = (X' X) -1 X' e.? V(β s ) = E[(X' X) -1 X' e e' X (X' X) -1 ] = (X' X) -1 X' E(e e' ) X (X' X) -1 Given E(e e') = V(e) = σ 2 ψ,? V(b s ) = s 2 (X'X) -1 X' y X (X'X) -1. In Summary: When y = I T, then b g = b s = (X'X) -1 X' Y, and V(b g )=V(β s ) = σ 2 (X ψ -1 X) -1 = σ 2 (X X) -1 X ψ X (X X) -1 = s 2 (X'X) -1 When y I T under model M, we have b g = (X' y -1 X) -1 X' y -1 y b s, and V(b g ) = s 2 (X'y -1 X) -1 V(b s ) Efficiency of b s Under the General Variance Structure Applying the Gauss-Markov theorem to model M * implies: β g = (X' ψ -1 X) -1 X' ψ -1 Y is the best linear unbiased estimator (BLUE) of β in M. β s = (X' X) -1 X' Y is another linear unbiased estimator: V(β g ) = σ 2 (X' ψ -1 X) -1 V(β s ). In general b s is an inefficient estimator of β in model M (its variance is large compared to the variance of b g ).
5 Consistency of b s Under the General Variance Structure Since β s is an unbiased estimator of β, it is also asymptotically unbiased. Assume that, as T, (X'X/T) and (X' ψ X/T) converge each to a finite, nonsingular matrix. This implies that V(β s ) = σ 2 (X'X) -1 X' ψ X (X' X) -1 = (1/T) σ 2 (X' X/T) -1 (X' ψ X/T) (X' X/T) -1 0 as T. Together with being asymptotically unbiased, this implies that β s is a consistent estimator of β in model M. Estimation of s 2 Under the General Variance Structure When ψ I T, model M does not satisfy the conditions of the classical linear model. This implies σ l 2 = (Y X β s )' (Y X β s )/T, and σ u 2 = (Y X β s )' (Y X β s )/(T K) are in general biased and inconsistent estimators of σ 2. With models M and M * being informationally equivalent and model M * satisfying the conditions of the classical linear model, we can apply the results obtained for the classical linear model to M * : σ gl 2 = (Y * X * β g )' (Y * X * β g )/T a biased but consistent estimator of σ 2, β g = (X * ' X * ) -1 X * ' Y * = (X' ψ X) -1 X' ψ Y. This implies: σ gl 2 = (P Y P X β g )' (P Y P X β g )/T = (Y X β g )' P'P (Y X β g )/T. Since P'P = ψ -1, σ gl 2 = (Y X β g )' ψ -1 (Y X β g )/T and is a biased but consistent estimator of σ 2. Results of the classical linear model applied to M * give σ gu 2 = (Y * X * β g )' (Y * X * β g )/(T K) as an unbiased and consistent estimator of s 2 and σ gu 2 = (PY P X β g ) ' (PY P X β g )/(T K) = (Y X β g ) ' P' P (Y X β g )/(T K).
6 Since P' P = ψ -1, s gu 2 = (Y X b g )' y -1 (Y X b g )/(T K) is an unbiased and consistent estimator of σ 2. When y I T, in general s gl 2 s l 2, and s gu 2 s u 2, with only the generalized least squares being consistent estimators of σ 2. Prediction Under the General Variance Structure Let the sample information based on T observations be Y = X β + e, Y is (T 1), X is (T K), and e is (T 1), where e ~(0, σ 2 ψ). Consider a prediction scenario where the intent is to anticipate new and unknown Y 0 given known explanatory variables X 0, where Y 0 is generated by Y 0 = X 0 β + e 0, where Y 0 is (T 0 1), X 0 is (T 0 K), and e 0 = (Y 0 X 0 β) is (T 0 1) where e 0 ~(0, σ 2 ψ 0 ) and Cov(e, e 0 ) = σ 2 C, with C being a (T T 0 ) matrix. Note that when C 0, this allows for non-zero covariance between the error term of the sample and error term of the prediction. The variance of (e, e 0 ) is V(e, e 0 ) = σ 2 Ψ Ψ C ' Ψ, where is 0 C' Ψ0 symmetric, positive-definite (and thus non-singular) matrix. An Alternative Formulation Ψ Consider the Cholesky decomposition (P) of C' ; P' P = Ψ0 P1 0 where P = is a (T + T 0 ) (T + T 0 ) non-singular matrix. P2 P3 Ψ This implies: = P -1 P' -1 Ψ, or P C' Ψ P' = I T + T, or 0 0 C' Ψ0 P1 0 Ψ P1 ' P2 ' IT 0 P2 P 3 C' Ψ =. 0 0 P3 ' 0 IT 0 It follows that (P 2 ψ + P 3 C' ) P 1 ' = 0, or P 2 ψ = P 3 C', or P 3-1 P 2 = C' ψ -1. Consider model Q: Y X e = β + Y 0 X 0 e. 0 1 Ψ C' Ψ 0 1
7 Premultiplying by P results in model Q * where Y X e P P β P Y = + X e, or 0 0 0 Y* P1 0 Y = Y* P P Y, 0 2 3 0 e* P1 0 e = e* P P e. 0 2 3 0 y* X* e* β y* = + X* e*, 0 0 0 X* P1 0 X = X* P P X, and 0 2 3 0 Note that, since the matrix P is non-singular, models Q and Q * are informationally equivalent. Predicting Y 0 Under the General Variance Structure V(e *, e * 0 ) = P V(e, e 0 ) P' = σ 2 Ψ P P' = σ 2 I T + T C' Ψ 0. Thus Q * satisfies all 0 the assumptions of the traditional linear regression model. It follows that (X * 0 β g ) is the best linear unbiased predictor of Y * 0, where E(Y * 0 - X * 0 β g ) = 0 and (X * 0 β g ) has the smallest variance among linear unbiased predictors of Y * 0. Y * 0 = P 2 Y + P 3 Y 0 and X * 0 = P 2 X + P 3 X 0. This gives Y 0 = P -1 3 Y * 0 P -1 3 P 2 Y = P -1 3 [X * 0 β g ] - P -1 3 P 2 Y = P -1 3 [(P 2 X + P 3 X 0 ) β g ] P -1 3 P 2 Y = X 0 β g - P -1 3 P 2 [Y - X β g ]. Since -P -1 3 P 2 = C' ψ -1, this implies that the best linear unbiased predictor of Y 0 is X 0 b g + C' y -1 [y X b g ]. This predictor is unbiased in the sense that E[Y 0 (X 0 β g + C' ψ -1 [Y X β g ])] = 0. And it is best in the sense that it has the smallest possible variance among all linear unbiased predictors of Y 0. The predictor of Y 0 is X 0 β g if C = 0, but X 0 β g if C 0.
8 The prediction error (ε) is ε = Y 0 - (X 0 β g + C' ψ -1 [y - X β g ]). The variance of the prediction error, V(ε) is V(ε) = V(Y 0 X 0 β g C' ψ -1 [Y X β g ]) = V[P 3-1 (Y 0 * X 0 * β g )], since P 3-1 P 2 = C' ψ -1, = P 3-1 V(Y 0 * - X 0 * β g ) P 3 ' -1 = σ 2 P -1 3 [ I T0 + X 0 * (X * ' X * ) -1 X 0 * '] P 3 ' -1, using results from the classical linear model applied to model Q *. = σ 2 [P 3-1 P 3 ' -1 + P 3-1 X 0 * (X * ' X * ) -1 X 0 * ' P 3 ' -1 ] = σ 2 [ψ 0 C' ψ -1 C + (P 3-1 P 2 X + X 0 )(X * ' X * ) -1 (X' P 2 ' P 3 ' -1 + X 0 '], (proving this step is a little tedious ) = σ 2 [ψ 0 C' ψ -1 C + (X 0 C' ψ -1 X)(X' ψ -1 X) -1 (X 0 ' X' ψ -1 C)]. Note that the variance of the prediction error satisfies V(ε) = σ 2 [ψ 0 + X 0 (X' ψ -1 X) -1 X 0 '] if C = 0, σ 2 [ψ 0 + X 0 (X' ψ -1 X) -1 X 0 '] if C 0. Hypothesis Testing Under The General Variance Structure Assume we have the model: Y = X β + e where e ~ (0, σ 2 ψ). Consider the hypothesis consisting of J linear restrictions on β Null hypothesis: H 0 : R β = r Alternative hypothesis: H 1 : R β r, where R is a known (J K) matrix of rank J, and r is a known (J 1) vector. With the (K K) matrix ψ is known, the unrestricted generalized least squares estimator of β is b g = [X' y -1 X] -1 X' y -1 Y. We have shown that β g is an unbiased, consistent, and efficient estimator of β. An unbiased estimator of σ 2 is s gu 2 = (Y X b g ) ' y -1 (Y X b g )/(T K), and an unbiased estimator of the variance of β g is V(b g ) = s gu 2 [X' y -1 X] -1.
9 Under null hypothesis H 0, the restricted generalized least squares estimator of β is b gr = b g + C R' [R C R'] -1 [r R b g ], where C = [X' ψ -1 X] -1. Applying the results of the classical linear model to model M *. This gives the test statistic λ = (WSSE R WSSE u )/(J σ gu 2 ) = (β g - β gr ) ' X' ψ -1 X (β g - β gr )/(J σ gu 2 ) = (R β g - r) ' [R (X' ψ -1 X) -1 R'] -1 (R β g - r)/(j σ gu 2 ) where WSSE R = (y X β gr ) ' ψ -1 (y X β gr ), WSSE u = (y X β g ) ψ -1 (y X β g ) And WSSE R is the weighted restricted error sum of squares and WSSE U the weighted unrestricted error sum of squares. Under H 0 and assuming that e ~ N(0, s 2 y), the test statistic λ is distributed as F (J, T-K). With normality the following test procedure can be undertaken: Choose the significance level α = P(type-1 error) Find λ c that satisfies α = P(F (J, T-K) λ c ). Reject H 0 if λ > λ c Accept H 0 if λ λ c. If J = 1, consider using λ 1/2 : t = (R β g - r)/ [σ gu (R (X' ψ -1 X) -1 R') 1/2 ], t t (T-K) under H 0 which implies the following equivalent test procedure Choose the significance level α = P(type-1 error) Find t c that satisfies α/2 = P(t (T-K) t c ). Reject H 0 if t > t c Accept H 0 if t t c Estimation of b, s 2 and y When y not Known We have discussed the estimation of β and σ 2, assuming the (T T) matrix y is known. We proposed the unbiased estimators b g = [X' y -1 X] -1 X' y -1 Y, and s gu 2 = (Y X b l )' y -1 (Y X b l )/(T K). Note that both estimators depend on y. This is fine if ψ is known. However, it creates a problem if ψ is not known to the investigator. In that case, our proposed estimators are not empirically tractable (since they depend on the unknown ψ).
10 Lets consider the case where the (T T) matrix y is unknown and needs to be estimated. Thus, given a sample Y, we look for estimators β e for β, (σ 2 ) e for σ 2, and ψ e for ψ. A simple and intuitive way to proceed is, first to choose some estimator ψ e for ψ, and then to substitute it into our proposed estimators to obtain β e = [X' (ψ e ) -1 X] -1 X' (ψ e ) -1 Y as an estimator of β, and (σ 2 ) e = (Y X β l )' (ψ e ) -1 (Y X β l )/(T K) as an estimator of σ 2. This is the essence of the estimation method discussed below. This simple approach raises some difficult questions in evaluating the statistical properties of the estimator. The reason is that, since β e now depends explicitly on ψ e, the estimators β e and ψ e are necessarily correlated random variables. Note that this differs significantly from the classical linear model, where b s and s 2 u conveniently happened to be uncorrelated. This means that the small sample properties of the estimator can be complex and difficult to establish. However, large sample properties of the estimator remain available. Being easier to evaluate, we will rely extensively on such asymptotic properties. Some Key Asymptotic Results Under the General Variance Structure Assume that plim[(x' ψ -1 X)/T] = a (K K) finite, non-singular matrix. Let ψ e be a consistent estimator of ψ. Then plim[(x' (ψ e ) -1 X)/T] = plim[(x' ψ -1 X)/T] and plim[(x' (ψ e ) -1 e)/(t 1/2 )] = plim[(x' ψ -1 e)/(t 1/2 )]. β fg = [X' (ψ e ) -1 X] -1 X' (ψ e ) -1 Y is called the feasible generalized least squares estimator of β. When ψ e be a consistent estimator of ψ, it can be shown that β fg = [X' (ψ e ) -1 X] -1 X' (ψ e ) -1 Y has the same asymptotic distribution as β g = [X' ψ -1 X] -1 X' ψ -1 Y.
11 This is an important result since we already known the asymptotic properties of β g. It implies the following asymptotic properties for the feasible generalized least squares estimator β fg When ψ e be a consistent estimator of ψ, the estimator β fg of β is asymptotically unbiased consistent asymptotically efficient asymptotically normal, with (T 1/2 ) (β fg - β) d N(0, σ 2 [(X ψ -1 X)/T] -1 ) or b fg» N(b, s 2 [X y -1 X] -1 ) as T fi. A Proposed Estimation Procedure With a General Variance Structure We propose the following three-step estimation procedure: Obtain the least squares estimator β s = (X' X) -1 X'Y, a consistent estimator of β. From these estimates generate e s = Y X β s as a consistent estimator of e. Use e s to obtain consistent estimators ψ e of ψ, and (σ 2 ) e of σ 2. Obtain the feasible generalized least-squares estimator β fg = [X' (ψ e ) -1 X] -1 X' (ψ e ) -1 Y. This estimator β fg of β is consistent, asymptotically efficient, and satisfy b fg» N(b, s 2 [X' y -1 X] -1 ) as T. It follows that [(s 2 ) e [X' (y e ) -1 X] -1 ] is a consistent estimator of V(b fg ), which can be used to conduct asymptotic tests about β (e.g., using a Wald test). The above procedure is written in a very general form. How it gets implemented typically depends on the model specification for ψ. For that reason, we proceed with an analysis of more specific models.