Econometria. Estimation and hypotheses testing in the uni-equational linear regression model: cross-section data. Luca Fanelli. University of Bologna

Econometria Estimation and hypotheses testing in the uni-equational linear regression model: cross-section data Luca Fanelli University of Bologna luca.fanelli@unibo.it

Estimation and hypotheses testing in the uni-equational regression model - Model - Compact representation - OLS - GLS - ML - Constrained estimation - Testing linear hypotheses

Estimation: cross-section data Uni-equational linear regression model 8 >< >: y i = x 0 i + u i, i = 1; 2; :::; n classical: generalized: Cov(u i, u j ):=0, 8i 6= j E(u 2 i ):=2 u, 8i Cov(u i, u j ):= i;j zero or not E(u 2 i ):=2 i We know that the generalized model has to be intended as `virtual' unless further information is used ( i;j and 2 i are generally unknown)

Compact matrix representation y n1 = X nk k1 + u n1 E(u)= 0 n1, E(uu 0 ):= nn Why do we use this representation? Because it is useful in some cases! It is particularly useful in order to derive the estimators of interest!

Write y i = x 0 i + u i, i = 1; 2; :::; n as: y 1 = x 0 1 + u 1, i:=1 y 2 = x 0 2 + u 2, i:=2.... y n = x 0 n + u n, i:=n: Compactly: 2 6 4 y 1 y 2. y n 3 7 5 = 2 6 4 x 0 1 x 0 2. x 0 n 3 7 5 + 2 6 4 u 1 u 2. u n 3 7 5 y n1 = X nk k1 + u n1

What about the matrix E(uu 0 ):= nn? E(uu 0 ):=E = E 02 B6 @ 4 02 B6 @ 4 u 1 u 2. u n 3 7 5 [u 1; u 2 ; : : : u n ] u 2 1 u 1 u 2 u 1 u n u 2 u 1 u 2 2.... u n u 1 u n u 2 u 2 n 1 C A 31 7C 5A

= 2 6 4 E(u 2 1 ) E(u 1u 2 ) E(u 1 u n ) E(u 2 u 1 ) E(u 2 2 ).... E(u n u 1 ) E(u n u 2 ) E(u 2 n) 3 7 5 = 2 6 4 E(u 2 1 ) Cov(u 1; u 2 ) Cov(u 1 ; u n ) Cov(u 2 ; u 1 ) E(u 2 2 ).... Cov(u n ; u 1 ) Cov(u n ; u 2 ) E(u 2 n) 3 7 5 = 2 6 4 2 1 1;2 1;n 2;1 2 2.... n;1 n;2 2 n 3 7 5 It is clear that := ( 2 u I n 6= 2 ui n classical model generalzied model :

To sum up, the compact representation of the uniequational linear regression model based on cross-section data is: y n1 = X nk k1 + u n1 where E(u)= 0 n1, E(uu0 ):= nn := ( 2 u I n 6= 2 u I n classical model generalized model :

The representations y i = x 0 i + u i, i = 1; 2; :::; n and y n1 are interchangeable. = X nk k1 + u n1 The link between the two is immediately given if one recalls that link is obtained by: X 0 X nx i=1 x i x 0 i ; X 0 y nx i=1 x i y i :

We have introduced the classical uni-equation regression model in the following way: y i = x 0 i + u i, i = 1; 2; :::; n E(u i j x i ) = 0 ) E(u i ) = 0 Hp: E(u 2 i j x i) = 2 i Hp: Cov(u i ; u j ) = i;j 8i 6= j: Using the compact representation, we can say even more about the feautes of this model.

In particular, given the conditional nature of the model (we are conditioning with respect to regressors) we will be writing y n1 = X nk k1 + u n1 E(u j X) = 0 n1 ) E(u) = 0 n1 E(uu 0 j X):= nn ) E(uu 0 ) = nn so that u and X can be thought of being stocastic independent in the sense that E(u j X) = E(u) E(uu 0 j X) = E(uu 0 ):

It is useful to focus on the meaning n 1 vector E(u j X): From the previous set of slides we know: E(u j X):=E 00 BB @@ u 1. u n 1 C 1 C A j XA := 2 6 4 E(u 1 j X). E(u n j X) 3 7 5 Now consider e.g. E(u 1 j X) E(u 1 jall stochastic variables within X); E(u 2 j X) E(u 2 jall stochastic variables within X);... E(u n j X) E(u n jall stochastic variables within X):

Thus E(u n j X) = 0 means that conditional on the knowledge of all regressors x 1 (i:=1); x 2 (i:=2),..., x n (i:=n) that enter the matrix X, the expected value of u n is zero as the unconditional expectation: E(u n j X) = E(u n ) = 0: This holds for E(u n 1 j X), E(u n 2 j X); :::; E(u 1 j X) and allows us to write that E(u j X)=0 n1 ) E X (E(u j X)) =0 n1 The condition E(u j X)=0 n1 is usually assumed with cross-section data and is stronger than it apparently seems. Why? Example on the blackboard. E(u j X)=0 n1 ) E(u)=0 n1 (law of iterated expectations) The reverse it is not true, E(u)=0 n1 does not imply E(u j X)=0 n1!

For regression models based on time series data where lags of y appear among the regressors, the condition E(u j X)=0 n1 doesn't hold (but it holds E(u)=0 n1 ) Example on the blackboard based on: c i = 0 + 1 c i 1 + 2 z i + u i, i = 1; :::; n:

Classical model with cross-section data: OLS estimation y n1 = X nk k1 + u n1 E(u j X)= 0 n1, E(uu0 j X):= 2 ui n: Vector of k + 1 unknown parameters: = ( 0 ; 2 u) 0. Objective function: Q() = 1 2 u nx i=1 (y i x 0 i )2 1 2 u (y X ) 0 (y X ):

Given Q() = 1 2 u nx i=1 (y i x 0 i )2 1 2 u (y X) 0 (y X); the OLS estimator of is obtained by solving the problem min Q() i.e. the OLS estimator of is the vector that solves: ^ OLS = arg min Q():

First-order conditions: where note that 0 B @ @Q(; 2 u) @ @Q(; 2 u ) @ 1 @Q(; 2 u ) @ 2. @Q(; 2 u ) @ k 1 C A = 0 k1 = 0 B @ 0 0. 0 1 C A : The needed derivative rules are in the Appendix 3.A.3 (Ch. 3) of the textbook. Exercise at home.

The solution of this problem leads us to ^ OLS := 0 nx @ i=1 x i x 0 i 1 A 1 0 nx @ i=1 1 x i y i A X 0 X 1 X 0 y : The condition rank(x) = k ) X 0 X is non-singular is crucial to obtain a valid (unique) OLS estimator. The estimator of 2 u is obtained indirectly ^ 2 u = 1 n k 0 @ nx i=1 ^u 2 i 1 A 1 n k ^u0^u where ^u i = y i x 0 i^ OLS, i = 1; :::; n or, alternatively, ^u = (y X ^OLS ):

Are the estimators of and 2 u correct? E(^OLS )=, E(^ 2 u)= 2 u?? (if yes, under which conditions?) Consider that ^ OLS :=(X 0 X) 1 X 0 [X + u] = + (X 0 X) 1 X 0 u and E(^OLS ):=E X E ^OLS j X : Likewise, E(^ 2 u ):=E X E ^ 2 u j X :

Estimator of : E ^OLS j X = + (X 0 X) 1 X 0 E (u j X) : Hence, if E (u j X) = 0 n1, one has E ^OLS j X = ) E(^OLS ):=E X E ^OLS j X = E X () =; the OLS estimator is correct. Note that if E (u j X) 6= 0 n1, the estimator is no longer correct (it happens in the regression model with time series data in which regressors include lags of y).

Estimator of 2 u: To check whether the estimator ^ 2 u is correct, we start from some computations and considerations: ^u = (y X ^OLS ) = (y X (X 0 X) 1 X 0 y) =(X + u X (X 0 X) 1 X 0 )(X + u) =(X + u X X(X 0 X) 1 X 0 u) =(I n X(X 0 X) 1 X 0 )u The matrix (I n X(X 0 X) 1 X 0 ) will be indicated with the symbol: M XX :

M XX :=(I n X(X 0 X) 1 X 0 ) is a `special' matrix (idempotent matrix): - is symmetric: M 0 XX = M XX; - is such that M XX M XX = M XX - rank(m XX )=tr(m XX )=n k: (Properties of trace operator on the blackboard).

We have proved that ^u := (y X ^OLS ) = M XX u: E ^ 2 u j X = E 1 n =E 1 n k u0 M XX u j X k ^u0^u j X = n 1 k E u0 M XX u j X = n 1 k E tr u0 M XX u j X = n 1 k E tr M XXuu 0 j X = n 1 k tr E M XXuu 0 j X = n 1 k tr M XXE uu 0 j X = n 1 k tr M XX 2 u I n = n 2 k tr (M XX) = 2 u n k (n k) = 2 u

We have proved that E(^ 2 u j X) = 2 u ) E(^ 2 u):=e X E ^ 2 u j X = E X 2 u = 2 u ; the OLS estimator of 2 u is correct.

Covariance matrix of ^OLS. V ar ^OLS j X =V ar + (X 0 X) 1 X 0 u j X =V ar (X 0 X) 1 X 0 u j X :=(X 0 X) 1 X 0 2 ui n X(X 0 X) 1 = 2 u(x 0 X) 1 : Recall section Let v be p1 stochastic vector with V ar(v):=v (symmetric positive denite) and A m p non-stochastic matrix. Then (sandwich rule). V ar(av) = A mp V A 0 pp pm Assume now that also A is stochastic, i.e. random variables. its elements are Then the quantity V ar(av j A) means that we can treat A as if it is a non-stochastic matrix! Hence V ar(av j A):=AV ar(v j A)A 0 : End of recall section

Note that V ar ^OLS :=EX V ar ^OLS j X = E X 2 u (X 0 X) 1 = 2 u E h (X 0 X) 1i = 2 ue h ( P n i=1 x i x 0 i ) 1i :

To sum up: The OLS estimator of in the linear classical model based on cross-section data is correct E(^OLS ) = and has (conditional) covariance matrix: V ar ^OLS j X := 2 u(x 0 X) 1 : The OLS estimator of 2 u E(^ 2 u)= 2 u in the linear classical model based on cross-section data is correct. Usually we are not interested to the variance of the estimator of ^ 2 u but in principle we can also derive it.

A famous theorem (Gauss-Markov Theorem) applies to the case in which X does not contain stocastic variables. It says that given the linear model y n1 = X nk k1 + u n1 E(u)=0 n1, E(uu 0 ):= 2 ui n the OLS estimator is, in the class of linear correct estimators, the one whose covariance matrix V ar ^OLS := 2 u (X 0 X) 1 is the `most ecient' (minimum covariance matrix).

In our case, X is stochastic (we condition the model with respect to the regressors). A version of the Gauss-Markov theorem where X is stochastic and all variables are conditioned with respect to the elements of X still applies. This means that any estimator dierent from the OLS estimator has (conditional) covariance matrix `larger' than V ar ^OLS j X := 2 u(x 0 X) 1 : This is why the OLS estimator applied to cross-section data (under homoskedasticity) is often called BLUE (Best Linear Unbiased Estimator).

Recall section Let 1 and 2 two symmetric positive denite matrices. We say that 1 `is larger' than 2 if the matrix 1 2 is semipositive denite (this means that the eigenvalues of 1 2 are either positive or zero). What are eigenvalues? Given a q q matrix A the eigenvalues are the solutions (that can be complex numbers) to the problem det(i q A) = 0: If the matrix A is symmetric the eigenvalues are real mumbers. If A is symmetric and positive denite the eigenvalues are real and positive. End of recall section

Why do we care about the covariance matrix of the estimator ^OLS? Because it is fundamental to make inference! Imagine that ^ OLS j X N(, 2 u(x 0 X) 1 ): ^G j X 2 (n k), ^G:=^2 u 2 (n k): u These distributions hold if u j X N(0, 2 ui n ) (irrespective of whether n is small or large). It is possible to show that the random vector ^ OLS j X is independent on the random variable ^G j X:

Observe that given then ^ OLS j X N(, 2 u(x 0 X) 1 ); ^ j j 1=2 j X N(0, 1), u cjj j = 1; :::; k where ^j is the j-th element of ^, j is the j-th element of and c jj is the j-th element on the main diagonal of the matrix (X 0 X) 1 :

It then easy to construct a statistial test for H 0 : j :=0 H 1 : j 6= 0. Recall that according to the theory of distributions, if Z N(0; 1) G 2 (q) and Z and G are independent then Z Gq 1=2 t(q):

Consider H 0 : j :=0 H 1 : j 6= 0 and the test statistic (t-ratio): ^ j 0 t:= 1=2 = ^ u cjj ^ j s.e.(^j ).

Then ^ j 0 ^ j 0 ^ j ^j t:= s.e.(^j ) = 0 1=2 = ^ u cjj u (c jj ) 1=2 ^ u (c jj ) 1=2 u (c jj ) 1=2 = u (c jj ) 1=2 ^ u u Accordingly, = ^ j 0 u (c jj ) 1=2 ^ 2 1=2 = u n k 2 u n k ^ j 0 u (c jj ) 1=2 1=2 ^G n k t j X = ^ j 0 u (c jj ) 1=2 ^G n k 1=2 j X N(0; 1) 1=2 j X t(n 2 (n k) n k Thus, if disturbances are Gaussian we can test the signicance of each regression coecient by running t-tests. These tests are t(n k) distributed. k): Observe that if n large, t(n k) N(0; 1).

OLS with `robust' standard errors The standard error s.e.(^j ):=^ u cjj 1=2 provides a measure of the variability of the estimator ^ j : This measure, however, has been obtained under the implicit assumption of homoskedasticity (indeed, to be fully ecient OLS requires homoskedasticity!).

Imagine that the `true model' is y n1 = X nk k1 + u n1 E(u j X)= 0 n1, E(uu0 j X):== 2 6 4 2 1 0 0 0 2 2 0...... 0 0 2 n 3 7 5 that means that we have heteroskedasticity! Assume that an econometrician erraneously believes that there is homoskedasticity and applies the OLS estimator. In this case, the OLS estimator is still correct (proof as exercise!) but will no longer be ecient! Is it possible to improve its eciency?

Then V ar ^OLS j X =V ar + (X 0 X) 1 X 0 u j X =V ar (X 0 X) 1 X 0 u j X :=(X 0 X) 1 X 0 X(X 0 X) 1 therefore (X 0 X) 1 X 0 X(X 0 X) 1 6= 2 u(xx) 1 : The matrix (X 0 X) 1 X 0 X(X 0 X) 1 is the `correct' covariance matrix of the OLS estimator in the presence of heteroskedasticity.

Let us focus in detail on the structure of the covariance matrix X 0 X: Since is diagonal, it is seen that X 0 X:= nx i=1 2 i x ix 0 i : The US econometrician White has shown that given the residuals ^u i :=y i x 0 i^ OLS, i = 1; 2; :::; n, the quantity 1 n nx i=1 ^u 2 i x ix 0 i is a consistent estimator for 1 n X0 X. In other words, for large n 1 n nx i=1 ^u 2 i x ix 0 i! p 1 n nx i=1 2 i x ix 0 i :

The consequence of this result is the following: when the econometrician has the suspect that homoskedasticity might not hold in his/her sample, he/she might take the standard errors not from the covariance matrix ^ 2 u(xx) 1 but rather on the covariance matrix = ( (X 0 X) 1 nx i=1 x i x 0 i ) 1 0 @ nx i=1 0 @ nx i=1 1 ^u 2 i x ix 0 A i (X 0 X) 1 ^u 2 i x ix 0 i 1 nx A ( i=1 x i x 0 i ) 1 The standard errors are `robust' in the sense that they take into account the heteroskedasticity. The robust standard errors are also known as `White standard errors'. Every econometric package complements OLS estimation with these standard errors.

Generalized model: GLS estimation y n1 = X nk k1 + u n1 E(u j X)= 0 n1, E(uu0 j X)= 6= 2 ui n: Vector of unknown parameters: = ( 0 ; vech()) 0. It is impossible to estimate on the basis of n observations. We need some assumptions. Hp: is known.

Objective function: Q() = (y X ) 0 1 (y X ) nx nx i=1 j=1 ij(y i x 0 i )(y j x 0 j ) ij elements of GLS estimator of (given ) is obtained by solving the problem min Q() i.e. the GLS estimator of is the vector that solves: ^ GLS = arg min Q():

The solution of this problem leads us to ^ GLS := X 0 1 X 1 X 0 1 y : The condition rank(x) = k ) X 0 1 X is non-singular is crucial to obtain a valid (unique) GLS estimator. It can be noticed that the GLS estimator of is `virtual' in the sense that it requires the knowledge of otherwise it can not be computed.

^ GLS = + X 0 1 X 1 X 0 1 u E(^GLS j X) = + X 0 1 X 1 X 0 1 E(u j X) = : V ar(^gls j X) = V ar(+ X 0 1 X X) 1 X 0 1 u j =V ar( X 0 1 X 1 X 0 1 u j X) = X 0 1 X 1 The GLS estimator is correct under the assumption E(u j X) = 0 n1 and has conditional covariance matrix V ar(^gls j X) = X 0 1 X 1 :

A famous theorem (Aitken Theorem) applies to the case in which X does not contain stocastic variables. It says that given the generalized linear model y n1 = X nk k1 + u n1 E(u)= 0 n1, E(uu0 ):= 6= 2 ui n the GLS estimator is, in the class of linear correct estimators, the one whose covariance matrix V ar ^GLS :=(X 0 1 X) 1 is the `most ecient' (minimum covariance matrix).

In our case, X is stochastic (we condition the model with respect to the regressors). A version of the Aitken theorem where X is stochastic and all variables are conditioned with respect to the elements of X still applies. This means that any estimator dierent from the GLS estimator has (conditional) covariance matrix `larger' than V ar ^GLS j X :=(X 0 1 X) 1 :

Feasible GLS estimators There are cases in which the GLS is `feasible', i.e. it can be calculated by using the information in the data. Imagine the model is: y i = 0 + 1 x 1;i + 2 x 2;i + u i, i = 1; :::; n E(u i j x i ) = 0 ) E(u i ) = 0 Hp: E(u 2 i j x i) = 2 i = h (x 1i) 2, h > 0 Hp: Cov(u i ; u j ) = 0, 8i 6= j: We can transform the model by dividing both sides by 1 i, obtaining y i i = 1 i 0 + 1 i 1 x 1;i + 1 i 2 x 2;i + 1 i u i

which is equivalent to y i (h) 1=2 x 1i = which is equivalent to where 1 (h) 1=2 x 1i 0 + + 1 (h) 1=2 x 1i 1 x 1;i 1 1 (h) 1=2 2 x 2;i + x 1i (h) 1=2 u i x 1i y i 1 = 0 + 1 + x 2;i 2 + u i x 1i x 1i x 1i x 1i y i = 1 + 0 x 1;i + 2x 2;i + u i y i := y i x 1i, x 1;i := 1 x 1i, x 2;i :=x 2;i x 1i, u i := u i x 1i. Now, E( u i 2 j xi ) = E( u i x 1i 2 j xi ) = 1 x1i 2 E(u 2 i j x i ) = 1 x1i 2 h (x1i ) 2 = h 8i:

The transformed model y i = 1 + 0 x 1;i + 2x 2;i + u i is homoskedatic because E( u i 2 j xi ):=h=const 8i: OLS can be applied to estimate 1, 0 and 2 eciently. This transformation is equivalent to the following idea: Hp : 2 i :=h (x 1i) 2, h > 0 2 2 x1;1 0 0 2 ) :=h 0 x1;2 0 6...... 6 4 0 0 x1;n 2 3 7 5 :=hv ^ GLS := X 0 (hv ) 1 X 1 X 0 (hv ) 1 y = X 0 V 1 X 1 X 0 V 1 y it is feasible because the matrix V is known!

Maximum Likelihood (ML) estimation Before discussing the ML estimation of the linear regression model, we overwiew this crucial estimation method. Statistical model = ( stochastic distribution sampling scheme ) f(data; ) = L() likelihood function. The g 1 vector of unknown parameters belongs to the (open) space, and the `true' value of, 0, is an interior point of.

The ML method is parametric because it requires the knowledge of the stochastic distribution of the variables, other than the sampling scheme. Model (classical to simplify): y i = x 0 i + u i Our parameters are: :=( 0, 2 u ) Key hypothesis (normality): y i j x i N(x 0 i, 2 u), i = 1; :::; n Recall that E(y i j x i )=x 0 i ; This implies u i j x i N(0, 2 u), i = 1; :::; n:

Thus, for each xed j 6= i u i u j! j x i; x j N 0 0!, " 2 u 0 0 2 u #! : More generally, we can write so that where u j X:= 0 B @ u 1 j X. u n j X f(u j X; ):= 1 C A N 0 n1, 2 ui n ny i=1 f(u i j X; ) 1 1 f(u i j X; ):= u (2) 1=2e 2 2 uu 2 i = 1 u (2) 1=2e 1 2 2 u(y i x 0 i )2 :

Then L() = f(data; ) = ny i=1 f(u i j X; ): It will be convenient to focus on the log-likelihood function: log L() = nx i=1 log f(u i j X; ) = C n 2 log 2 u 1 2 2 u nx i=1 (y i x 0 i )2 C n 2 log 2 u 1 2 2 u (y X) 0 (y X): The ML estimator of is obtained by solving the problem i.e. max log L() ^ML = arg max log L():

As is known, in order to obtain ^ML it is necessary to solve the rst-order conditions: s n () = @ log L() @ = 0 B @ @ log L() @ 1 @ log L() @ 2. @ log L() @ g 1 C A = 0 g1 that means that ^ML is such that s n (^ML ) = 0 g1 : The vector s n () is known as the score (gradient) of the likelihood function. It is not always possible to solve the rst order conditions analytically; in many curcumstances (e.g. nonlinear restrictions) numerical optimization procedures are required.

Note that s n () = @ log L() @ = @ P n i=1 log f(u i j x i ; ) @ = nx i=1 @ log f(u i j x i ; ) @ = nx i=1 s i (): Each component s i () of the score depends on the data, hence it is a random variable! To be sure that ^ML is a maximum (and not a minimum), it is further necessary that the Hessian matrix H n () = = 0 B @ @ @ 0 @ log L() @ @ 2 log L() @ 2 1 @ 2 log L() @ 1 @ 2. @ 2 log L() @ 1 @ g! @ 2 log L() @ 2 @ 1 @ 2 log L() @ 2 2 = @ @ 0s n() = @2 log L() @ 0 @.... @ 2 log L() @ 2 @ g @ 2 log L() @g@ 1 @ 2 log L() @g@ 2 @ 2 log L() @ 2 g be (semi)negative denite at the point ^ML, i.e. H n (^ML ) 0: 1 C A gg

We dene (Fisher) Information Matrix the quantity: I n () = E @2 log L() @ 0 @! = E (H n ()) and it can be shown that under a set of regularity conditions, including the hypothesis of correct speci- cation of the model, one has the equivalence: I n () = E @2 log L() @ 0 @ = E 0 @! = E s n ()s n () 0 " # " @ log L() @ log L() @ @ # 0 1 A : The Asymptotic Information Matrix is given by the quantity 1 I 1 () = lim n!1 n I 1 n() = lim n!1 n E( H n()): The inverse of the matrix I 1 () represents a lower bound for any estimator of : this means that any estimator of will have covariance matrix `greater' than or at most equal to I1 1 ():

A crucial requirement for ML estimation is that log L( 0 ) > log L() for each 2 n f 0 g condition that ensures the existence of a global maximum. However, also situations of the time: log L( 0 ) > log L() for 2 N 0 where N 0 is a neighborhood of 0 are potentially ne, local maximum.

Properties of ML estimator Crucial Assumption: the model is correctly specied (the underlying statistical model is correct). 1. In general, E(^ML ) 6= 0 namely the ML is not correct! 2. Under general conditions the ML estimator is consistent: ^ML! p 0 : 3. Under general conditions the ML estimator is Asymptotically Gaussian: n 1=2 ^ML 0!D N(0 g1 ; V^): This property suggests that one can do standard inference in large samples!

4. Under general conditions the ML estimator estimator is asymptotically ecient, in particular V^ = [I 1 ()] 1 : The properties 3-4-5 are asymptotic properties and make the ML estimator `optimal'.

max log L() ) max C n log 2 u 1 2 2 u (y X) 0 (y X)! 0 B @ @ log L() @ @ log L() @ 2 u 1 C A = 0 k1 0! k 1 1 1 The needed derivative rules are in the Appendix 3.A.3 (Ch. 3) of the textbook.

The solution of this problem leads us to ^ ML := 0 nx @ i=1 x i x 0 i 1 A ^ 2 u = 1 n 1 0 nx @ i=1 0 1 nx @ ^u 2 i i=1 1 x i y i A X 0 X A 1 n ^u0^u 1 X 0 y where ^u i = y i x 0 i^ ML, i = 1; :::; n or, alternatively, ^u = (y X ^ML ): Recall that by contruction, s n (^ML ; ^ 2 u) = 0 k1 and H n (^ML ; ^ 2 u) is negative semidenite.

Constrained estimation Given y = X + u with u j X N(0 n1 ; 2 ui n ) general (linear) restrictions on can be represented either in implicit form or in explicit form. Implicit form: or, alternatively R qk k1 := r q1 R qk k1 where q is the # of restrictions; r q1 := 0 q1

R qk k1 r q1 := 0 q1 R is a known q k matrix whose rows select the elements of that must be restricted; r is a known q 1 vector. EXAMPLES: BLACKBOARD!

Explicit form: where a = (k := H k1 ka q), ' a1 + h k1 H is a known k a selection matrix; ' is the a 1 vector that contains the elements of which are not subject to restrictions (free parameters); h is a known k 1 selection vector. EXAMPLES: BLACKBOARD!

One can write the linear contraints on either in implicit or in explicit form. The two methods are alternative but equivalent. For certain problem it is convenient the implicit form representation, for others it is convenient the explicit form representation. It is therefore clear that there must exist a connection between the matrices R and H and r and h:

In particular, take the explicit form :=H' + h and multiply both sides by R, obtaining R:=RH' + Rh: Then, since the right-hand-side of the expression above must be equal to r, it must hold RH:=0 qa Rh:=r: This means that the (rows of the) matrix R lie in the null column space of the matrix H (sp(r) sp(h? )).

Recall: Let A a n m matrix, n > m, with (column) rank m; A = [a 1 : a 2 : :::: : a m ]. sp(a) = fall linear combination of a 1 ; a 2 ; :::; a m g R n sp(a? ) = n v, v 0 a i = 0; for each i = 1; :::; m o R n all the vectors in sp(a? ) are said to lie in the null column space of the matrix A: Let A? be a n(n m) full column rank matrix that satises A 0? A = 0 (n m)m A 0 A? = 0 m(n m) : A? is called orthgonal complement of A. The n n matrix [A : A? ] has rank n and forms a basis of R n.

Constrained estimation means that one stimates the model under H 0, i.e. imposing the restriction implied by the null hypothesis. Suppose that H 0 :R:=r, (H 0 ::=H' + h) has been accepted by the data. Then we wish to re-estimate the null model, i.e. the econometric model that embodies the linear restriction. In this case, using the restrictions in explicit form is convenient.

Indeed, the null model can be written as and re-arranged as y = X[H' + h] + u where y n1 = X na ' a1 + u n1 y :=y Xh X :=XH: We obtain a transformed dynamic linear regression model whose parameters are no longer in = ( 0 ; 2 u) 0 but in = (' 0 ; 2 u )0 : Recall that ' is the a1 vector, a = k q, that contains the unrestricted parameters, i.e. those parameters which are not aected by H 0.

We know very well how to estimate ' and 2 u and their properties: ^' = (X 0 X ) 1 X 0 y = (H 0 X 0 XH) 1 H 0 X 0 (y Xh) where ^' = ^' CML or ^' = ^' CLS ; ~ 2 u = 1 n a (y X ^') 0 (y X ^'):

The constrained ML or LS estimators of is obtained indirectly as ~ = ^CML = H ^' CML + h ~ = ^CLS = H ^' CLS + h and inherit the same properties as ^' CML.

Obviously, the asymptotic covariance matrix of the restricted estimator ~ will be V ar(~ j X):=V ar(h ^'CLS + h j X) =HV ar(^' CLS j X)H 0 =H(X 0 X ) 1 H 0 =H(H 0 (X 0 X)H) 1 H 0 : It can be proved that H(H 0 (X 0 X)H) 1 H 0 2 u(x 0 X) 1 regardless of whether H 0 is true or not! This means that imposing restrictions on has the effect of reducing the variability of the estimator (thus the standard errors of the single coecients). Of course, this operation is `correct' only when H 0 is uspported by the data.

In the `traditional' textbook econometrics the derivation of constrained estimator is usually obtained by exploiting the restriction in implicit form, by solving either or min max ( ( Q() subject to R r = 0 log L() subject to R r:=0 :

In order to solve these problems it is necessary to apply a technique based on Lagrange multipliers. For instance, in the case of ML estimation, the problem above amounts to solving where max () (; ) = log L() + 0 (R r) 1q q1 is the Lagrangean function and is the vector of Lagrange multipliers. Each element of gives a weight to the corresponding linear restriciton. By using the restrictions in explicit form we stick to the standard framework without the need of using constrained estimation techniques. More specically, we have turned a restricted estimation problem into an unrestricted estimation problem (' is estimated unrestrictedly!)

Testing problem Given y t = x 0 i + u i, i = 1; :::; n or objective is testing H 0 : R:=r, (H 0 : :=H' + h) vs H 1 : R 6= r, (H 1 : 6= H' + h). A testing problem requires solving a probabilistic decision rule that allows the researcher establishing whether the evidence provided by the data is closer to H 0 or to H 1.

This decision is based on the following ingredients: a Pre-xed nominal type I error (or signicance level) = Pr(rejecting H 0 j H 0 ): The cases we typically address are such that is actually an asymptotic error in the sense that Pr(reject H 0 j H 0 ) can be evaluated asymptotically (for large n); b Test statistic ^G n = G n (^) which is a function of ^ and whose distribution is known under H 0 asymptotically (later on we see that ^G n = G n (^) usually belongs to one of three families); c Decision rule of the type: reject H 0 if G n (^) > cv(1 ), where cv (1 ) is the 100*(1 ) percentile taken from the (asymptotic) distribution of ^G n = G n (^) under H 0 ;

d To understand whether the test does a good job one should be able to evaluate the function 1 () = lim n!1 Pr( reject H 0 j H 1 ) = Pr(G 1 (^) cv(1 ) j H 1 ) known as asymptotic power function. A`desired' test should be such that once one xes the type I error (usually 5% or 10%), the test is consistent meaning that 1 () = lim n!1 Pr(reject H 0 j H 1 ) = Pr(G 1 (^) cv(1 ) j H 1 )! 1:

When estimation is perfomed with ML we can classify the test statistic ^G n = G n (^) into three families, known as Wald, Lagrange Multipliers (LM) and Likelihood Ratio (LR) tests. However, the existence of these three families is not solely conned to ML estimation but can be extended to other classes of estimators as well. General philosophy: Wald These type of tests are based on the idea of checking whether the unrestricted (ML) estimator of (i.e. the estimator obtained without imposing any constraint) is `close in a statistical sense' to H 0, namely whether R^ r 0q1 which is what one would expect if H 0 is true. Computing a Wald-type test requires estimating the regression model one time without any restriction.

LM These type of tests are based on the the estimation of the null model (i.e. the regression model under H 0 ), hence are based on the constrained estimator of. Let ^ be the unconstrained (obtained under H 1 ) estimator of and ~ its constrained (obtained under H 0 ) counterpart. The idea is that if H 0 is true, the score of the unrestricted model, evaluated at the constrained estimator (recall that s n (^) = 0k1 ) should be `close to zero in a statistical sense', i.e. s n (~ ) 0k1 : Computing a LM test requires estimating the regression model one time under the restrictions implied by H 0 (however, it also requires that we know the structure of the score of the unrestricted model!).

LR The type of tests are based on the idea of estimating the linear regression model both under H 0 (~ ) and without restrictions (^); estimation is carried out two times. If H 0 is true, the distance between the likelihood functions of the null and unrestricted models should not be too large (note that one always has log L(~ ) log L(^)).

Recall: Let v a p 1 stochastic vector such that v N(0 p1 ; V ), where V is a p p covariance matrix (hence symmetric positive denite). Then v 0 V 1 v 2 (p): The same result holds if ` ' is replaced with `! D '. Note also that kvk 2 = (v 0 v) 1=2 is the Euclidean norm of the vector v (a measure of its length in the space R p, i.e. a measure of the distance of the vector v from the vector 0 p1 ). One can generalize this measure by dening the norm kvk A = (v 0 Av) 1=2, where A is a symmetric positive denite matrix; this norm measures the distance of v from 0 p1 `weighted' by the elements of the matrix A. This means that the random variable (quadratic form) v 0 V 1 v measures a weighted distance of v from 0 p1.

Wald test Let ^ be either the OLS or the ML estimator of in the classical regression model with corss-section data. Standard Assumptions. We start from the assumption (Gaussian) which implies that u j X N(0 n ; 2 ui n ): ^ j X N(; 2 u(x 0 X) 1 ): From the properties of the multivariate Gaussian distribution we obtain R(^ ) q1 j X N(0 q ; 2 ur(x 0 X) 1 R 0 ):

The result above implies that under H 0 :R:=r : R^ q1 r j X N(0 q1 ; 2 u R(X0 X) 1 R 0 ): Interpretation: the left-hand side can be regarded as a measure of the distance of the q 1 (stochastic) vector R^ from the known vector r (or the distance of R^ r from the vector 0q1 ); moreover, we have a symmetric positive denite matrix ( 2 ur(x 0 X) 1 R 0 ) with respect to which we can weight this distance.

Imagine rst that 2 u is known (unrealistic). We dene the test statistic (a quadratic form from Gaussian) W n = h (R^ r) i 0 n 2 u R(X 0 X) 1 R 0o 1 h (R^ r) i : From the properties of multivariate Gaussian distribution: W n j X 2 (q): Thus, xed, one rejects H 0 if W n cv (1 ), where cv (1 ) is the 100*(1-) percentile taken from the 2 (q) distribution (Pr[ 2 (q) > cv (1 ) ] = ). One accepts H 0 otherwise.

Since 2 u is generally unknown, we replace it with the estimator ^ 2 u. Please recall that we know that under our assumptions: ^G j X 2 (n k), ^G:=^2 u 2 (n k): u Recall also that from the theory of distributions, if G 1 2 (q 1 ) G 2 2 (q 2 ) and G 1 and G 2 are independent then G1 q 1 G2 q 2 F (q 1 ; q 2 ):

Now dene the test statistic W n = h (R^ r) i 0 n^ 2 u R(X 0 X) 1 R 0o 1 h (R^ r) i =q h (R^ r) i 0 n R(X 0 X) 1 R 0o 1 h (R^ r) i = q^ 2 u h 0fR(X (R^ r)i 0 X) 1 R 0 g 1h i (R^ r) = = ^ 2 u 2 u q 2 u n n h 0fR(X (R^ r)i 0 X) 1 R 0 g 1h i (R^ r) q 2 u ^G n k k k It can be proved that the two quadratic forms above and below the fraction are independent.

Accordingly, W n j X = h 0fR(X (R^ r)i 0 X) 1 R 0 g 1h i (R^ r) q 2 u ^G n k j X 2 (q) q 2 (n k) n k F (q; n k): Thus, xed, one rejects H 0 if W n cv (1 ), where cv (1 ) is the 100*(1-) percentile taken from the F (q; n k) distribution (Pr[ 2 (q) > cv (1 ) ] = ). One accepts H 0 otherwise. General linear hypotheses on can be tested by using the F-distribution. This is true if u j X N(0 n ; 2 u I n) irrespective of whether n is small or large. Note that F (q; n k)! 2 (q)=q for n! 1:

LM and LR tests will be reviewed when we deal with the regression model based on time series data.