ECONOMETRICS. Bruce E. Hansen. c2000, 2001, 2002, 2003, University of Wisconsin

Size: px

Start display at page:

Download "ECONOMETRICS. Bruce E. Hansen. c2000, 2001, 2002, 2003, University of Wisconsin"

Rosalyn Heath
5 years ago
Views:

1 ECONOMETRICS Bruce E. Hansen c2000, 200, 2002, 2003, 2004 University of Wisconsin Revised: January 2004 Comments Welcome This manuscript may be printed and reproduced for individual or instructional use, but may not be printed for commercial purposes.

2 Contents Ordinary Least Squares. Framework Estimation Eciency Model in Matrix Notation Residual Regression Consistency Asymptotic Normality Covariance Matrix Estimation Functions of Parameters t tests Condence Intervals Wald Tests F Tests Regression Models Regression Bias and Variance of OLS estimator Multicollinearity Forecast Intervals NonLinearity in Regressors NonLinear Least Squares Normal Regression Model Least Absolute Deviations Model Selection Omitted Variables Irrelevant Variables Model Selection Testing for Omitted NonLinearity i

3 3.5 log(y ) versus Y as Dependent Variable Generalized Least Squares GLS and the Gauss-Markov Theorem Skedastic Regression Estimation of Skedastic Regression Testing for Heteroskedasticity Feasible GLS Estimation Covariance Matrix Estimation Commentary: FGLS versus OLS Generalized Method of Moments Overidentied Linear Model GMM Estimator Distribution of GMM Estimator Estimation of the Ecient Weight Matrix GMM: The General Case Over-Identication Test Hypothesis Testing: The Distance Statistic Conditional Moment Restrictions Empirical Likelihood Non-Parametric Likelihood Asymptotic Distribution of EL Estimator Overidentifying Restrictions Testing Numerical Computation Derivatives Inner Loop Outer Loop Endogeneity 7 7. Instrumental Variables Reduced Form Identication Estimation Special Cases: IV and 2SLS Bekker Asymptotics Identication Failure ii

4 8 The Bootstrap 8 8. Monte Carlo Simulation An Example The Empirical Distribution Function Denition of the Bootstrap Bootstrap Estimation of Bias and Variance Percentile Intervals Percentile-t Equal-Tailed Interval Symmetric Percentile-t Intervals Asymptotic Expansions One-Sided Tests Symmetric Two-Sided Tests Percentile Condence Intervals Bootstrap Methods for Regression Models Bootstrap GMM Inference Univariate Time Series Stationarity and Ergodicity Autoregressions Stationarity of AR() Process Lag Operator Stationarity of AR(k) Estimation Asymptotic Distribution Bootstrap for Autoregressions Trend Stationarity Testing for Omitted Serial Correlation Model Selection Autoregressive Unit Roots Multivariate Time Series 4 0. Vector Autoregressions (VARs) Estimation Restricted VARs Single Equation from a VAR Testing for Omitted Serial Correlation Selection of Lag Length in an VAR Granger Causality Cointegration Cointegrated VARs iii

5 Limited Dependent Variables 2. Binary Choice Count Data Censored Data Sample Selection Panel Data Individual-Eects Model Fixed Eects Dynamic Panel Regression Nonparametrics Kernel Density Estimation Asymptotic MSE for Kernel Estimates Appendix A: Mathematical Formula 38 5 Appendix B: Matrix Algebra Terminology Matrix Multiplication Trace, Inverse, Determinant Eigenvalues Idempotent and Projection Matrices Kronecker Products and the Vec Operator Matrix Calculus Appendix C: Probability Foundations Random Variables Expectation Common Distributions Multivariate Random Variables Conditional Distributions and Expectation Transformations Normal and Related Distributions Maximum Likelihood Appendix D: Asymptotic Theory Inequalities Convergence in Probability Almost Sure Convergence Convergence in Distribution iv

6 7.5 Asymptotic Transformations Appendix E: Numerical Optimization Grid Search Gradient Methods Derivative-Free Methods v

7 Chapter Ordinary Least Squares. Framework An econometrician has observational data f(y ; x ) ; (y 2 ; x 2 ) ; :::; (y i ; x i ) ; :::; (y n ; x n )g = f(y i ; x i ) : i = ; :::; ng where each pair fy i ; x i g 2 R R k is observation on an individual (e.g., household or rm). We call these observations the sample. Notice that the observations are paired (y i ; x i ): We call y i the dependent variable and x i the regressor vector. For convenience, the vector x i is typically presumed to include a constant. That is, one element (typically written as the rst) equals. We can write the k regressor x i as 0 0 x i = x i x 2i. x ki C A = If the data is cross-sectional (each observation is a dierent individual) it is often reasonable to assume they are mutually independent. If the data is randomly gathered, it is reasonable to model each observation as a random draw from the same probability distribution. Thus, the data are independent and identically distributed, or iid. We call this a random sample. Sometimes the label iid is misconstrued. It means that the pair (y i ; x i ) is independent of the pair (y j ; x j ) for i 6= j: It is not a statement about the relationship between y i and x i : The random variables (y i ; x i ) have a distribution F which we call the population. This \population" is innitely large. Sometimes this is a source of confusion, but it is merely an abstraction. This distribution is unknown, and the goal of statistical inference is to learn about features of F from the sample. x 2i. x ki C A :

8 It is unimportant whether the observations y i and x i may come from continuous or discrete distributions. For example, many regressors in econometric practice are binary, taking on only the values 0 and, and are typically called dummy variables. A linear regression model for y i given x i takes the form y i = + x 2i x ki k + e i ; i = ; :::; n (.) where through k are parameters and e i is the error. The parameter vector is written as 0 2 = B A : We can then write (.) more compactly as k y i = x 0 i + e i ; i = ; :::; n (.2) The model is incomplete without a description of the error e i. It should be mean zero, a nite variance, and be uncorrelated with the regressors. We state the needed conditions here. Assumption... E(e i ) = 0 2. E (x i e i ) = = Ee 2 i < 4. Ex 0 i x i < 5. Q = Ex i x 0 i > 0 Assumptions...3 and...4 are made to guarantee that all variables in the model have a nite variance. This is necessary to ensure that E (x i e i ) is well dened. Indeed by the Cauchy- Schwarz inequality, E jx i e i j E jx i j 2 E je i j 2 < under these assumptions. We can use Assumption.. to derive a moment representation for the parameter vector : Take equation (.2) and pre-multiply by x i x i y i = x i x 0 i + x i e i : 2

9 Now take expectations: E (x i y i ) = E x i x 0 i + E (xi e i ) = E x i x 0 i where the second equality is Assumption...2. Since E (x i x 0 i ) is invertible by Assumption...5, we can solve for : = E x i x 0 i E (xi y i ) : (.3) Thus the parameter is an explicit function of population second moments of (y i ; x i ): In fact, this derivation shows that if is dened by (.3), then Assumption...2 must hold true by construction. In this sense, Assumption...2 is very weak. However, it is important to not misinterpret this statement. In many economic models, the parameter may be dened within the model, rather than by construction as in (.3). In this case (.3) may not hold. These structural models require alternative estimation methods, and are discussed in Chapter 5. To emphasize this distinction, we may describe the model of this section as a linear projection model rather than a linear regression model. This is an accurate label as the equation (.3) shows that e i is explicitly dened as a projection error. However, conventional econometric practice labels (.2) as a linear regression model, so we will adhere to this convention..2 Estimation Equation (.3) writes the regression parameter as an explicit function of population moments E (x i y i ) and E (x i x 0 i ) : Their moment estimators are the sample moments ^E (x i y i ) = x i y i n ^E x i x 0 i = n x i x 0 i: It follows that the moment estimator of is (.3) with the population moments replaced by the sample moments: = =! x i x 0 i x i y i n! x i x 0 i x i y i : (.4) n Another way to derive ^ is as follows. Observe that Assumptions...2 can be written in the parametric form E x i y i x 0 i = 0: (.5) 3

10 The function E (x i (y i x 0 i)) can be estimated by ^E x i y i x 0 i = n x i y i x 0 i and ^ is the value which sets this equal to zero: 0 = n = n x i y i x 0 i^ x i y i n x i x 0 i^ (.6) whose solution is (.4). There is another classic motivation for the estimator (.4). Dene the sum-of-squared errors (SSE) function S n () = y i x 0 i 2 The Ordinary Least Squares (OLS) estimator is the value of which minimizes S n (): Observe that we can write the latter as S n () = yi x i y i + 0 x i x 0 i Vector calculus (see section..3) gives the rst-order conditions for minimization: S n(^) = 2 x i y i + 2 x i x 0 i^ whose solution is (.4). Following convention, we will call ^ the OLS estimator of : As a by-product of OLS estimation, we dene the predicted value and the residual ^y i = x 0 i^ ^e i = y i ^y i = y i x 0 i^: 4

11 Note that y i = ^y i + ^e i :It is important to understand the distinction between the error e i and the residual ^e i : The error is unobservable, while the residual is a by-product of estimation. These two variables are frequently mislabeled, which can cause confusion. Equation (.6) implies that x i^e i = 0: n Thus the sample correlation between the regressors and the residual is zero. Furthermore, since x i (typically) contains a constant, one implication is that n ^e i = 0: Thus the residuals have a sample mean of zero. These are algebraic results, and hold true for all linear regression estimates. The error variance 2 is also a parameter of interest. A method of moments estimator for it is the sample average ^ 2 = ^e 2 i : n The error variance 2 measures the variation in the \unexplained" part of the regression. A measure of the explained variation relative to the total variation is the coecient of determination or R-squared. R 2 ^ 2 = where ^ 2 y = n ^ 2 y (y i y) 2 is the sample variance of y i : The R 2 is frequently mislabeled as a measure of \t". It is an inappropriate label, as the value of R 2 does not aid in the interpretation of parameter estimates or test statistics..3 Eciency Is the OLS estimator ecient, in the sense of achieving the smallest possible mean-squared error among feasible estimators? The answer was armatively provided by Chamberlain (987). Suppose that the joint distribution of (y i ; x i ) is discrete. That is, for nite r; P y i = j ; x i = j = j ; j = ; :::; r 5

12 for some constant vectors j ; j ; and j : Assume that the j and j are known, but the j are unknown. (We know the values y i and x i can take, but we don't know the probabilities.) In this discrete setting, the moment condition (.5) can be rewritten as rx j j j 0 j = 0: (.7) j= By the implicit function theorem, is a function of ( ; :::; r ) : As the data are multinomial, the maximum likelihood estimator (MLE) is ^ j = n (y i = j ) x i = j for j = ; :::; r; where () is the indicator function. That is, ^ j is the percentage of the observations which fall in each category. The MLE ^ mle for is then the function of (^ ; :::; ^ r ) which satises the analog of (.7) with the i replaced by the ^ i : rx j= Substituting in the expressions for ^ j, 0 = = n = n rx j= n j= rx x i y i ^ j j j 0 j ^ mle = 0: (y i = j ) x i = j! j j (y i = j ) x i = j j j x 0 i^ mle 0 j ^ mle 0 j ^ mle But this is the same expression as (.6), which means that ^ mle = ^ ols : In other words, if the data have a discrete distribution, the maximum likelihood estimator is simply the OLS estimator. Since this is a regular parametric model the MLE is asymptotically ecient, and thus so is the OLS estimator. Chamberlain (987) extends this argument to the case of continuously-distributed data: He observes that the above argument holds for all multinomial distributions, and any continuous distribution can be arbitrarily well approximated by a multinomial distribution. He proves that generically the OLS estimator is asymptotically ecient for the class of regression models satisfying Assumption... 6

13 .4 Model in Matrix Notation For some purposes, including computation, it is convenient to write the model and statistics in matrix notation. We dene 0 0 y x 0 0 e y 2 C B x 0 2 C B e 2 C Y = y n C A ; X = x 0 n C A ; e = Observe that Y and e are n vectors, and X is an n k matrix. The linear regression model (.2) is a system of n equations, one for each observation. We can stack these n equations together as or equivalently y = x 0 + e y 2 = x e 2. y n = x 0 n + e n : Y = X + e: Sample sums can also be written in matrix notation. For example x i x 0 i = X 0 X x i y i = X 0 Y: Thus the estimator (.4), residual vector, and sample error variance can be written ^ = X 0 X X 0 Y Dene the projection matrices ^e = Y X ^ ^ 2 = n ^e 0^e P = X X 0 X X 0 M = I n P:. e n C A : Then ^Y = X ^ = X X 0 X X 0 Y = P Y 7

14 and ^e = Y X ^ = Y P Y = (I n P ) Y = MY: (.8) Another way of writing this is Y = (P + M) Y = P Y + MY = ^Y + ^e: This decomposition is orthogonal, that is ^Y 0^e = (P Y ) 0 (MY ) = Y 0 P MY = 0:.5 Residual Regression Partition and Then the regression model can be rewritten as X = [X X 2 ] = 2 : Y = X + X e: (.9) Observe that the OLS estimator of = ( 0 ; 0 2) 0 can be obtained by regression of Y on X = [X X 2 ]: OLS estimation can be written as Using the partitioned matrix inversion formula (5.), X 0 X X 0 X 2 where X 0 2 X X 0 2 X 2 = Y = X ^ + X 2^2 + ^e: (.0) (X 0 M 2X ) (X 0 M 2X ) X 0 X 2 (X2 0 X 2) (X2 0 X 2) X2 0 X (X 0 M 2X ) (X2 0 M X 2 ) M = I n X XX 0 X 0 M 2 = I n X 2 X2X 0 2 X 0 2 : (.) 8

15 Thus ^ ^ 2 = = = = X 0 X X 0 X 2 X 0 Y X2 0 X X2 0 X 2 X2 0 Y (X 0 M 2X ) (X 0 M 2X ) X 0 X 2 (X2 0 X 2) (X2 0 X 2) X2 0 X (X 0 M 2X ) (X2 0 M X 2 ) (X 0 M 2 X ) (X 0 M 2Y ) (X2 0 M X 2 ) (X2 0 M Y ) 0 ~X 0 ~ X ~X 0 ~ ~X 0 2 ~ X2 ~X 0 2 ~ Y2 X 0 Y X 0 2 Y A (.2) where ~X = M 2 X ~Y = M 2 Y ~X 2 = M X 2 ~Y 2 = M Y The variables X ~ and Y ~ are least-squares residuals from the regression of X and Y; respectively, on the matrix X 2 only. Similarly, the variables X ~ 2 and Y ~ 2 are least-squares residuals from the regression of X 2 and Y on the matrix X only. Formula (.2) shows that the subvector ^ of the OLS estimator ^ can be calculated by the OLS regression of Y ~ on X ~ ; and similarly ^ 2 can be calculated by the OLS regression of Y ~ 2 on ~X 2 : This technique is called residual regression. Furthermore, recalling the denition M = I X (X 0 X) X 0 ; observe that X2 0 M = 0 and hence M M = I X 2 X 0 2X 2 X 0 2 M = M Then using (.8), we nd M 2^e = M 2 MY = MY = ^e: Premultiplying (.0) by M 2 ; we obtain ~Y = ~ X ^ + ^e: Since ^ is precisely the OLS coecient from a regression of ~ Y on ~ X ; this shows that the residual from this regression is ^e, the numerically same residual from the joint regression (.0). We have proven the following theorem. Theorem.5. (Frisch-Waugh-Lovell). In the model (.9), the OLS estimator of and the OLS residuals ^e may be equivalently computed by either the OLS regression (.0) or via the following algorithm:. Regress Y on X 2 ; obtain residuals ~ Y ; 9

16 2. Regress X on X 2 ; obtain residuals ~ X ; 3. Regress ~ Y on ~ X ; obtain OLS estimates ^ and residuals ^e: In some contexts, the FWL theorem can be used to speed computation, but in most cases there is little computational advantage to using the two-step algorithm. Rather, the theorem's primary use is theoretical. A common application of the FWL theorem, which you may have seen in an introductory econometrics course, is the demeaning formula for regression. Partition X = [X X 2 ] where X = is a vector of ones, and X 2 is the vector of observed regressors. In this case, Observe that and M = I 0 0 : ~X 2 = M X 2 = X X 2 = X 2 X 2 ~Y = M Y = Y 0 0 Y = Y Y ; which are \demeaned". The FWL theorem says that ^ 2 is the OLS estimate from a regression of ~Y on ~ X 2 ; or y i y on x 2i x 2 : ^ 2 =! (x 2i x 2 ) (x 2i x 2 ) 0! (x 2i x 2 ) (y i y) : Thus the OLS estimator for the slope coecients is a regression with demeaned data..6 Consistency The OLS estimator ^ is a statistic, and thus has a statistical distribution. In general, this distribution is unknown. Asymptotic (large sample) methods approximate sampling distributions based on the limiting experiment that the sample size n tends to innity. A preliminary step in this approach is the demonstration that estimators are consistent { that they converge in probability to the true parameters as the sample size gets large. 0

17 The following decomposition is quite useful. ^ = = =! x i x 0 i x i y i! x i x 0 i x i x 0 i = +! x i x 0 i + e i! x i x 0 i +! x i x 0 i x i e i! x i x 0 i x i e i : (.3) This shows that after centering, the distribution of ^ is determined by the joint distribution of (x i ; e i ) only. We can now deduce the consistency of ^: First, Assumption.. and the WLLN (Section 7.2) imply that x i x 0 i! p E x i x 0 n i = Q (.4) and Using (.3), we can write n ^ = + x i e i! p E (x i e i ) = 0: (.5) = + n = + g! x i x 0 i x i e i x i x 0 i n! x i x 0 i; n n! x i e i! x i e i where g(a; b) = A b is a continuous function of A and b; at all values of the arguments such that A exist. Now by (.4) and (.5), n x i x 0 i; n! x i e i! p (Q; 0) :

18 Assumption...5 implies that Q exists and thus g(; ) is continuous at (Q; 0): Hence by the continuous mapping theorem (CMT) (Section 7.5), so g n x i x 0 i; n ^ = + g n! x i e i! p g (Q; 0) = Q 0 = 0 x i x 0 i; n! x i e i! p + 0 = 0 Theorem.6. Under Assumption.., as n! ; ^! p : In Section.2 we also dened the sample error variance ^ 2 : We now demonstrate its consistency for 2 : Using (.8), n^ 2 = e 0 MMe = e 0 Me = e 0 e e 0 P e: (.6) An application of the WLLN yields n e 2 i! p Ee 2 i = 2 as n! ; so combined with (.4) and (.5), ^ 2 = n e 2 i n e i x 0 i n x i x 0 i! n! x i e i! p Q 0 = 2 (.7) so ^ 2 is consistent for 2 :.7 Asymptotic Normality We now establish the asymptotic distribution of ^ after normalization. We need a strengthening of the moment conditions. Assumption.7. In addition to Assumption.., Ee 4 i < and E jx ij 4 < : Now dene = E x i x 0 ie 2 i : Assumption (.7.) guarantees that the elements of are nite. To see this,by the Cauchy-Schwarz inequality and Assumption.7., E x i x 0 ie 2 i E x i x 0 2 =2 i E e 4 =2 i = E jx i j 4 =2 E e 4 =2 i < : (.8) 2

19 Thus x i e i is iid with mean zero and has covariance matrix. By the central limit theorem (Section 7.4), p x i e i! d N (0; ) (.9) n Then using (.3), (.4), and (.9), p n ^ = n x i x 0 i! d Q N (0; )! = N 0; Q Q : Theorem.7. Under Assumption.7., as n! p n ^! d N (0; V ) where V = Q Q : p n As V is the variance of the asymptotic distribution of p n ^! x i e i ; V is often referred to as the asymptotic covariance matrix of ^: The form V = Q Q is called a sandwich form. It may be insightful to examine a special case where and V simplify: Homoskedastic Projection Error: Cov(x i x 0 i ; e2 i ) = 0 Condition (.7) holds, for example, when x i and e i are independent, but this is not a necessary condition. We should not expect it to generically hold, but when it does the asymptotic variance formulas simplify. If (.7) is true, then = E x i x 0 i E e 2 i = Q 2 (.20) V = Q Q = Q 2 V 0 (.2) In (.2) we dene V 0 = Q 2 as this matrix is dened even if (.7) is false, although in that case V 0 does not equal V: We call V 0 the homoskedastic covariance matrix..8 Covariance Matrix Estimation The homoskedastic covariance matrix V 0 = Q 2 can be estimated by ^V 0 = ^Q ^ 2 (.22) where ^Q = n x i x 0 i = n X0 X 3

20 is the method of moments estimator for Q: Since ^Q! p Q and ^! p 2 (see (.4) and (.7)) it is clear that ^V 0! p V 0 : To estimate V = Q Q ; we need an estimate of = E x i x 0 i e2 i : Their MME estimator is ^ = n x i x 0 i^e 2 i where ^e i are the OLS residuals. A useful computational formula is to dene ^u i = x i^e i and the n k matrix 0 ^u 0 ^u 0 2 ^u = B A : Then = n ^u0^u ^u 0 n ^V = n X 0 X ^u0^u X 0 X This estimator was introduced to the econometrics literature by White (980): The estimator ^V 0 was the dominate covariance estimator used before 980, and was still the standard choice in the 980s. From my reading of the literature, the White estimate ^V started to come in common use in the early 990s, and by the late 990s is quite commonly used, especially by younger researchers. When reading and reporting applied work, it is important to pay attention to the distinction between ^V 0 and ^V, as it is not always clear which has been used. When ^V is used rather than the traditional choice ^V 0 ; many authors will state that \their standard errors have been corrected for heteroskedasticity", or that they use a \heteroskedasticity-robust covariance matrix estimator", or that they use the \White formula", the \Eicker-White formula", the \Huber formula", the \Huber-White formula" or the \GMM covariance matrix". In most cases, these all mean the same thing. We now show ^! p ; from which it follows that ^V! p V as n! : Expanding the quadratic ^e 2 i = = y i e i 2 x 0 i^ 2 ^ x 0 i = e 2 i 2 ^ 0 xi e i + ^ 0 xi x 0 i ^ : Hence 4

21 ^ = n = n x i x 0 i^e 2 i x i x 0 ie 2 i 2 n x i x 0 i ^ 0 xi e i + n 0 x i x 0 i ^ xi x 0 i ^ : (.23) We now examine the each sum on the right-hand-side of (.23) in turn. WLLN show that x i x 0 n ie 2 i! p E x i x 0 ie 2 i = : Second, by Holder's inequality (Section 7.) E jx i j 3 je i j E jx i j 4 3=4 E e 4 =4 i < ; First, (.8) and the so by the WLLN and thus since ^! p 0; n n 0 x i x 0 i ^ xi e i ^ jx i j 3 je i j! p E jx i j 3 je i j ; n! jx i j 3 je i j! p 0: Third, by the WLLN so n Together, these establish consistency. n jx i j 4! p E jx i j 4 ; 0 x i x 0 i ^ xi x i ^ ^ 2 n jx i j 4! p 0: Theorem.8. As n! ; ^! p : The variance estimator ^V is an estimate of the variance of the asymptotic distribution of ^. A more easily interpretable measure of spread is its square root { the standard deviation. This motivates the denition of a standard error. Denition.8. A standard error s(^) for an estimator ^ is an estimate of the standard deviation of the distribution of ^: 5

22 When is scalar, and ^V is an estimator of the variance of p n ^ ; we set s(^) = n =2p ^V : When is a vector, we focus on individual elements of one-at-a-time, vis., j ; j = ; :::; k: Thus s(^ j ) = n =2 q ^V jj : Generically, standard errors are not unique, as there may be more than one estimator of the variance of the estimator. It is therefore important to understand what formula and method is used by an author when studying their work. It is also important to understand that a particular standard error may be relevant under one set of model assumptions, but not under another set of assumptions, just as any other estimator. From a computational standpoint, the standard method to calculate the standard errors is to rst calculate n ^V, then take the diagonal elements, and then the square roots..9 Functions of Parameters Sometimes we are interested in some function of the parameter vector. Let h : R k! R q, and = h(): We will assume from now on that h() is continuously dierentiable at the true value of : The estimate of is ^ = h(^): What is an appropriate standard error for ^? By a rst-order Taylor series approximation: h(^) ' h() + H ^ 0 : where Thus where p n ^ H h() k = p n h(^) p ' n ^ H 0! d H 0 N(0; V ) h() = N(0; V ): (.24) V = H 0 V H : 6

23 where If ^V is the estimated covariance matrix for ^; then the natural estimate for the variance of ^ is In many cases, the function h() is linear: ^V = ^H 0 ^V ^H h(^): h() = R 0 for some k q matrix R: In this case, H = R and ^H = R; so ^V = R 0 ^V R: For example, if R is a \selector matrix" I R = 0 so that if = ( ; 2 ); then = R 0 = and ^V = I 0 ^V I 0 = ^V ; the upper-left block of ^V : When q = (so h() is real-valued), the standard error for ^ is the square root of n ^V ; that is, s(^) = n =2 q ^H0 ^V ^H :.0 t tests Let = h() : R k! R be any parameter of interest, ^ its estimate and s(^) its asymptotic standard error. Consider the studentized statistic t n () = ^ s(^) (.25) It is easy to calculate that this statistic has the asymptotic distribution t n () = ^ s(^) p n ^ = q ^V! d N(0; V ) p V = N(0; ) 7

24 the standard normal. This distribution is known. Since this distribution does not depend on the parameters, we say that t n () is asymptotically pivotal. In special cases (such as the normal regression model, see Section 2.7), the statistic t n has an exact t distribution, and is therefore exactly free of unknowns. In this case, we say that t n is an exactly pivotal statistic. In general, however, pivotal statistics are unavailable and so we must rely on asymptotically pivotal statistics. A simple null and composite hypothesis takes the form H 0 : = 0 H : 6= 0 where 0 is some pre-specied value, and = h() is some function of the parameter vector. (For example, could be a single element of ): The standard test for H 0 against H is the t-statistic (or studentized statistic) t n = t n ( 0 ) = ^ 0 s(^) : Under H 0 ; t n! d N(0; ): Let z =2 is the upper =2 quantile of the standard normal distribution. That is, if Z N(0; ); then P (Z > z =2 ) = =2 and P (jzj > z =2 ) = : For example, z :025 = :96 and z :05 = :645: A test of asymptotic signicance rejects H 0 if jt n j > z =2 : Otherwise the test does not reject, or \accepts" H 0 : This is because P (reject H 0 j H 0 ) = P jt n j > z =2 j = 0! P jzj > z =2 = : The rejection/acceptance dichotomy is associate with the Neyman-Pearson approach to hypothesis testing. An alternative approach, associate with Fisher, is to report an asymptotic p-value. The asymptotic p-value for the above statistic is constructed as follows. Dene the tail probability, or asymptotic p-value function p(t) = P (jzj > jtj) = 2 ( (jtj)) : Then the asymptotic p-value of the statistic t n is p n = p(t n ): If the p-value p n is small (close to zero) then the evidence against H 0 is strong. In a sense, p- values and hypothesis tests are equivalent since p n < if and only if jt n j > z =2, thus an equivalent statement of a Neyman-Pearson test is to reject at the % level if and only if p n < : The p-value is more general, however, in that the reader is allowed to pick the level of signicance (), in contrast to Neyman-Pearson rejection/acceptance reporting, where the researcher picks the level. Another helpful observation is that the p-value function has simply made a unit-free transformation of the test statistic. That is, under H 0 ; p n! d U[0; ]; so the \unusualness" of the test 8

25 statistic can be compared to the easy-to-understand uniform distribution, regardless of the complication of the distribution of the original test statistic. To see this fact, note that the asymptotic distribution of jt n j is F (x) = p(x): Thus P ( p n u) = P ( p(t n ) u) = P (F (t n ) u) = P jt n j F (u)! F F (u) = u; establishing that p n! d U[0; ]; from which it follows that p n! d U[0; ]: It may be helpful to note that in the GAUSS language, the function p(t) may be computed by the expression p = 2 cdfnc(t):. Condence Intervals A condence interval C n is an interval estimate of ; and is a function of the data and hence is random. It is designed to cover with high probability. Either 2 C n or =2 C n : The coverage probability is P ( 2 C n ). We typically cannot calculate the exact coverage probability P ( 2 C n ): However we often can calculate lim n! P ( 2 C n ): We call this the asymptotic coverage probability. We say that C n has asymptotic ( )% coverage for if P ( 2 C n )! as n! : A good method for construction of a condence interval is the collection of parameter values which are not rejected by an appropriate statistical test. We recall the t-statistic (.25) and the test: Reject H 0 : 0 = if the is jt n ()j > z =2 where z =2 again is the upper =2 quantile of the standard normal distribution. Our condence interval is then C n = : jt n ()j z =2 ( = : z =2 ^ ) s(^) z =2 = h^ z=2 s(^); ^ + z=2 s(^)i (.26) While there is no hard-and-fast guideline for choosing the coverage probability ; the most common professional i choice isi95%, or = :05: This corresponds to selecting the condence interval h^ :96s(^) h^ 2s(^) : Thus values of within two standard errors of the estimated ^ are considered \reasonable" candidates for the true value ; and values of outside two standard errors of the estimated ^ are considered unlikely or unreasonable candidates for the true value. The interval has been constructed so that as n! ; and C n is an asymptotic ( P ( 2 C n ) = P jt n ()j z =2! P jzj z=2 = : )% condence interval. 9

26 .2 Wald Tests Sometimes = h() is a q vector, and it is desired to test the joint restrictions simultaneously. In this case the t-statistic approach does not work. We have the null and alternative H 0 : = 0 H : 6= 0 : The natural estimate of is ^ = h(^) and has asymptotic covariance matrix estimate where ^V = ^H 0 ^V ^H h(^): The Wald statistic for H 0 against H is 0 W n = n ^ 0 ^V ^ 0 0 = n h(^) 0 ^H0 ^V ^H h(^) 0 : When h is a linear function of ; h() = R 0 ; then the Wald statistic takes the form W n = n R 0^ 0 0 R 0 ^V R R 0^ 0 : The delta method (.24) showed that p n ^! d Z N(0; V ); and Theorem.8. showed that ^V! p V: Furthermore, H () is a continuous function of ; so by the continuous mapping theorem, H (^)! p H : Thus ^V = ^H 0 ^V ^H! p H 0 V H = V > 0 if H has full rank q: Hence W n = n 0 ^ 0 ^V ^ by Theorem 6.8.: We have established: 0! d Z 0 V Z = 2 q; Theorem.2. Under H 0 and Assumption.7., if rank(h ) = q; then W n! d 2 q; a chisquare random variable with q degrees of freedom. An asymptotic Wald test rejects H 0 in favor of H if W n exceeds 2 q(); the upper- quantile of the 2 q distribution. For example, 2 (:05) = 3:84 = z2 :025 : The Wald test fails to reject if W n is less than 2 q(): The asymptotic p-value for W n is p n = p(w n ); where p(x) = P 2 q x is the tail probability function of the 2 q distribution. As before, the test rejects at the % level i p n < ; and p n is asymptotically U[0; ] under H 0 : In addition, it may be helpful to note that in the GAUSS language, the function p(t) may be computed by the expression p = cdfchic(t): 20

27 .3 F Tests Take the linear model Y = X + X e where X is n k and X 2 is n k 2 and k = k + k 2 : The null hypothesis is H 0 : 2 = 0: In this case, = 2 ; and there are q = k 2 restrictions. Also h() = R 0 is linear with R = a selector matrix. We know that the Wald statistic takes the form W n = n^ 0 ^V ^ = n^ 0 2 R 0 ^V R ^2 : What we will show in this section is that if ^V is replaced with ^V 0 = ^ 2 n X 0 X ; the covariance matrix estimator valid under homoskedasticity, then the Wald statistic can be written in the form ~ 2 ^ 2 W n = n ^ 2 (.27) where are from OLS of Y on X ; and ~ 2 = n ~e0 ~e; ~e = Y X ~ ; ~ = X 0 X X 0 Y ^ 2 = n ^e0^e; ^e = Y X ^; ^ = X 0 X X 0 Y are from OLS of Y on X = (X ; X 2 ): The elegant feature about (.27) is that it is directly computable from the standard output from two simple OLS regressions, as the sum of square errors is a typical output from statistical packages. This statistic is typically reported as an \F-statistic" which is dened as F = n k W n = ~2 ^ 2 =k2 n k 2 ^ 2 =(n k) : 0 I While it should be emphasized that equality (.27) only holds if ^V 0 = ^ 2 formula often nds good use in reading applied papers. We now derive expression (.27). First, note that using (.), n X 0 X ; still this R 0 ^V 0 R = n ^ 2 R 0 X 0 X X 0 X 2 X 0 2 X X 0 2 X 2 R! = ^ 2 n X 0 2M X 2 ; 2

28 where M = I X (X 0 X ) X 0 : Thus W n = n^ 0 2 R 0 ^V 0 R ^2 = ^ 0 2 (X 0 2 M X 2 ) ^ 2 ^ 2 : To simplify this expression further, note that if we regress Y on X alone, the residual is ~e = M Y: Now consider the residual regression of ~e on X ~ 2 = M X 2 : By the FWL theorem, ~e = X ~ 2^2 + ^e and X ~ 0 2^e = 0: Thus or alternatively, Also, since we conclude that ~e 0 ~e = 0 ~X2^2 + ^e ~X2^2 + ^e = ^ 0 2 ~ X 0 2 ~ X 2^2 + ^e 0^e = ^ 0 2X 0 2M X 2^2 + ^e 0^e; ^ 0 2X 0 2M X 2^2 = ~e 0 ~e ^ 2 = n ^e 0^e ^e 0^e: ~e 0 ~e ^e 0^e ~ 2 ^ 2 W n = n ^e 0^e = n ^ 2 ; as claimed. In many statistical packages, when an OLS regression is reported, an \F statistic" is reported. This is where F = ~2 y ^ 2 =(k ) ^ 2 : =(n k) ~ 2 y = n (y y)0 (y y) is the sample variance of y i ; equivalently the residual variance from an intercept-only model. This special F statistic is testing the hypothesis that all slope coecients (other than the intercept) are zero. This was a popular statistic in the early days of econometric reporting, when sample sizes were very small and researchers wanted to know if there was \any explanatory power" to their regression. This is rarely an issue today, as sample sizes are typically suciently large that this F statistic is highly \signicant". Certainly, there are special cases where this F statistic is useful, but these cases are atypical. 22

29 Chapter 2 Regression Models 2. Regression In regression, we want to nd the central tendency of the conditional distribution of y i given x i : A standard measure of central tendency is the mean. The conditional analog is the conditional mean m(x) = E (y i j x i = x). In general, m(x) can take any form. The regression error e i is dened to be the dierence between y i and its conditional mean: By construction, this yields the formula e i = y i m(x i ): y i = m(x i ) + e i : (2.) It is worth emphasizing that no assumptions have been used to develop (2.), other than that (y i ; x i ) have a joint distribution and E jy i j < : Proposition 2.. Properties of the regression error e i. E (e i j x i ) = 0: 2. E(e i ) = 0: 3. E (h(x i )e i ) = 0 for any function h () : 4. E(x i e i ) = 0: Proof: 23

30 . By the denition of e i and the linearity of conditional expectations, E (e i j x i ) = E ((y i m(x i )) j x i ) = E (y i j x i ) E (m(x i ) j x i ) = m(x i ) m(x i ) = 0: 2. By the law of iterated expectations (Theorem 6.7) and the rst result, E(e i ) = E (E (e i j x i )) = E(0) = 0: 3. By a similar argument, and using the conditioning theorem (Theorem 6.9), E(h(x i )e i ) = E (E (h(x i )e i j x i )) = E (h(x i )E (e i j x i )) = E(h(x i ) 0) = 0: 4. Follows from the third result setting h(x i ) = x i : Equation (2.) plus Proposition 2... are often stated jointly as the regression framework: y i = m(x i ) + e i (2.2) E (e i j x i ) = 0: It is important to understand that this is a framework, not a model, because no restrictions have been placed on the joint distribution of the data. These equations hold true by denition. A regression model imposes further restrictions on the joint distribution; most typically, restrictions on the permissible class of regression functions m(x): The most common choice is the linear regression model. It species that m(x) is a linear function of x : y i = x 0 i + e i E (e i j x i ) = 0 Since this is a linear equation is a special case of the general conditional conditional mean equation, this is a substantive restriction which may or may not be true in a specic application. The fact 24

31 that Proposition is the same as Assumption...2 means that the linear regression model is a special case of the least-squares projection model of Chapter. Another way of saying this is that the conditional mean assumption that E (e i j x i ) = 0 is stronger than the uncorrelated assumption E (x i e i ) = 0: It is also useful to dene the conditional variance of y i given x i = x: V ar (y i j x i = x) = E e 2 i j x i = x = 2 (x): Generally, this is a function of x: Just as the conditional mean function may take any form, so may the conditional variance function (other than the restriction that it is non-negative). Given the random variable x i, the conditional variance is 2 i = 2 (x i ): In the general case where 2 (x) is not necessarily a constant function, so 2 i may dierent across i; we say that the error e i is heteroskedastic. When 2 (x) is a constant, so that E e 2 i j x i = 2 (2.3) we say that the error e i is homoskedastic. The model y i = x 0 i + e i E (e i j x i ) = 0 E e 2 i j x i = 2 is called the homoskedastic linear regression model. In this case, by the law of iterated expectations E x i x 0 ie 2 i = E xi x 0 ie e 2 i j x i = Q 2 which is (.20). Thus the homoskedastic linear regression model is a special case of the homoskedastic projection error model. 2.2 Bias and Variance of OLS estimator The conditional mean assumption allows us to calculate the small sample conditional mean and variance of the OLS estimator. To examine the bias of ^, using (.3), the conditioning theorem, and the independence of the 25

32 observations E ^ j X 2! 3 = E 4 x i x 0 i x i e i j X5 = = = 0! x i x 0 i x i E (e i j X)! x i x 0 i x i E (e i j x i ) Thus the OLS estimator ^ is unbiased for : To examine its covariance matrix, for a random vector Y we dene V ar(y ) = E (Y EY ) (Y EY ) 0 = EY Y 0 (EY ) (EY ) 0 : Then by independence of the observations! V ar x i e i j X = V ar (x i e i j X) = x i x 0 i 2 i and we nd V ar ^ j X = x i x 0 i!! x i x 0 i 2 i x i xi! 0 : In the special case of the linear homoskedastic regression model, 2 i = 2 and the covariance matrix simplies to V ar ^ j X = x i xi! 0 2 : Recall the method of moments estimator ^ 2 for 2 : We now calculate its nite sample bias in the context of the homoskedastic linear regression model. Using (.6) and linear algebra manipulations 26

33 E n^ 2 j X = E e 0 Me j X = E tr e 0 Me j X = E tr Mee 0 j X = tr E Mee 0 j X = tr ME ee 0 j X = tr M 2 = 2 (n k); the nal equality by (5.4). We have found that under these assumptions so ^ 2 is biased towards zero. bias-corrected estimator E^ 2 = (n k) 2 n Since the bias is proportional to 2 ; it is common to dene the s 2 = n so that Es 2 = 2 is unbiased. It is important to remember, however, that this estimator is only unbiased in the special case of the homoskedastic linear regression model. It is not unbiased in the absence of homoskedasticity, or in the projection model. 2.3 Multicollinearity If rank(x 0 X) < k; then ^ is not dened. This can be called strict multicollinearity. This happens when the columns of X are linearly dependent, i.e., there is some such that X = 0: Most commonly, this arises when sets of regressors are included which are identically related. For example, if X includes both the logs of two prices and the log of the relative prices log(p ); log(p 2 ) and log(p =p 2 ): When this happens, the applied researcher quickly discovers the error as the statistical software will be unable to construct (X 0 X) : Since the error is discovered quickly, this is rarely a problem for applied econometric practice. The more relevant issue is near multicollinearity, which is often called \multicollinearity" for brevity. This is the situation when the X 0 X matrix is near singular, when the columns of X are close to linearly dependent. This denition is not precise, because we have not said what it means for a matrix to be \near singular". This is one diculty with the denition and interpretation of multicollinearity. One implication of near singularity of matrices is that the numerical reliability of the calculations is reduced. It is possible that the reported calculations will be in error due to oating-point calculation diculties. k ^e 2 i 27

34 More relevantly in practice, an implication of near multicollinearity is that estimation precision of individual coecients will be poor. We can see this most simply in a model with two regressors and no intercept: y i = x i + x 2i 2 + e i ; where e i is independent of x i and x 2i ; Ee 2 i = and x 2 E i x i x 2i x i x 2i x 2 2i In this case the asymptotic covariance matrix V is = = 2 The correlation indexes collinearity, since as approaches the matrix becomes singular. We can see the eect of collinearity on precision by examining the asymptotic variance of either coecient estimate, which is 2 : As approaches, the variance rises quickly to innity. Thus the more \collinear" are the regressors, the worse the precision of the individual coecient estimates. Basically, what is happening is that when the regressors are highly dependent, it is statistically dicult to disentangle the impact of from that of 2 : The precision of individual estimates are reduced. Is there a simple solution? Basically, No. Fortunately, multicollinearity does not lead to errors in inference. The asymptotic distribution is still valid. Regression estimates are asymptotically normal, and estimated standard errors are consistent for the asymptotic variance. So reported condence intervals are not inherently misleading. They will be large, correctly indicating the inherent uncertainty about the true parameter value 2.4 Forecast Intervals In the linear regression model, m(x) = E (y i j x i = x) = x 0 : In some cases, we want to estimate m(x) at a particular point x: Notice that this is a (linear) function p of : Letting h() = x 0 and = h(); we see that ^m(x) = ^ = x 0^ and H = x; so s(^) = n x 0 ^V x: Thus an asymptotic 95% condence interval for m(x) is hx 0^ p 2 n x 0 ^V i x : It is interesting to observe that if this is viewed as a function of x; the width of the condence set is dependent on x: : 28

35 For a given value of x i = x; we may want to forecast (guess) y i out-of-sample. A reasonable guess is the conditional mean m(x); and indeed this is the mean-square-minimizing decision rule. Thus a point forecast is ^m(x) = x 0^; the estimated conditional mean, as discussed above. We would also like a measure of uncertainty for the forecast. The forecast error is ^e i = y i ^m(x) = e i x 0 ^ : As the out-of-sample error e i is independent of the in-sample estimate ^; this has variance E^e 2 i = E e 2 i j x i = x + x 0 E ^ ^ 0 x = 2 (x) + n x 0 V x: Assuming E e 2 i j x i = 2 ; the natural estimate of this variance is ^ 2 + n x 0 ^V x; so a standard p error for the forecast is ^ 2 + n x 0 ^V x: Notice that this is dierent from the standard error for the conditional mean. It would appear natural to conclude that an asymptotic 95% forecast interval for y i is x 0^ q 2 ^ 2 + n x 0 ^V x ; but this turns out to be incorrect. In general, the validity of an asymptotic condence interval is based on the asymptotic normality of the studentized ratio. In the present case, this would require the asymptotic normality of the ratio e i x 0 ^ p ^ 2 + n x 0 ^V : x But no such asymptotic approximation can be made. The only special exception is the case where e i has the exact distribution N(0; 2 ); which is generally invalid. To get an accurate forecast interval, we need to estimate the conditional distribution of e i given x i = x; which is a much more dicult task. Given the diculty, most applied forecasters focus on the simple and unjustied interval hx 0^ p 2 ^ 2 + n x 0 ^V i x : 2.5 NonLinearity in Regressors In the regression setting we are interested in E (y i j x i = x) = m(x); which need not be a linear function of x; and its precise form may be unknown. A common approach is to employ a polynomial approximation. Consider the case of x i 2 R: Then a k 0 th order polynomial model is y i = 0 + x i + 2 x 2 i + + k x k i + e i : Letting = ( 0 ; ; :::; k ) and z i = (; x i ; x 2 i ; :::; xk i ); this is the linear regression y i = z 0 i + e i. 29

36 Now suppose that x 2 R 2 : A simple quadratic approximation is y i = 0 + x i + 2 x 2i + 3 x 2 i + 4 x 2 2i + 5 x i x 2i + e i : As the dimensionality of x increases, such approximations can become quite non-parsimonious! In practice, therefore, most applications do appear to use more than quadratic terms. Some applications add cubics without interactions: y i = 0 + x i + 2 x 2i + 3 x 2 i + 4 x 2 2i + 5 x 3 i + 6 x 3 2i + 7 x i x 2i + e i : Non-linear approximations can also be made using alternative basis functions, such as Fourier series (sins and cosines), splines, neural nets, or wavelets. Since these non-linear models are linear in the parameters, they can be estimated by OLS, and inference is convention. However, the model is non-linear so interpretation must take this into account. For example, in the cubic model given above, the slope with respect to x i E (y i j x i ) = x i x 2 i + 7 x 2i ; which is a function of x i and x 2i ; making reporting of the \slope" dicult. In many applications, it will be important to report the slopes for dierent values of the regressors, carefully chosen to illustrate the point of interest. In other applications, an average slope may be sucient. There are two obvious candidates: the derivative evaluated at the sample averages and the average i E (y i j x i ) j xi =x= x x x E (y i j x i ) x i n 2.6 NonLinear Least Squares x 2 i + 7 x 2 : We say that the regression function m(x; ) = E (y i j x i = x) is nonlinear in the parameters if it cannot be written as m(x; ) = z(x) 0 for some function z(x): Examples of nonlinear regression 30

37 functions include m(x; ) = + 2 x + 3 x m(x; ) = + 2 x 3 m(x; ) = + 2 exp( 3 x) m(x; ) = G(x 0 ); G known x2 5 m(x; ) = + 2 x + ( x ) m(x; ) = + 2 x + 4 (x 3 ) (x > 3 ) m(x; ) = ( + 2 x ) (x 2 < 3 ) + ( x ) (x 2 > 3 ) In the rst ve examples, m(x; ) is (generically) dierentiable in the parameters : In the nal two examples, m is not dierentiable with respect to 3 ; which alters some of the analysis. When it exists, let m (x; ) m(x; Nonlinear regression is frequently adopted because the functional form m(x; ) is suggested by an economic model. In other cases, it is adopted as a exible approximation to an unknown regression function. The least squares estimator ^ minimizes the sum-of-squared-errors S n () = 6 (y i m(x i ; )) 2 : When the regression function is nonlinear, we call this the nonlinear least squares (NLLS) estimator. The NLLS residuals are ^e i = y i m(x i ; ^): One motivation for the choice of NLLS as the estimation method is that the parameter is the solution to the population problem min E (y i m(x i ; )) 2 Since sum-of-squared-errors function S n () is not quadratic, ^ must be found by numerical methods. See Appendix E. When m(x; ) is dierentiable, then the FOC for minimization are 0 = m (x i ; ^)^e i : (2.4) Theorem 2.6. If the model is identied and m(x; ) is dierentiable with respect to, where m i = m (x i ; 0 ): p n ^ 0! d N(0; V ) V = E m i m 0 i E mi m 0 i e2 i E mi m 0 i 3

Least Squares Estimation-Finite-Sample Properties

Least Squares Estimation-Finite-Sample Properties Ping Yu School of Economics and Finance The University of Hong Kong Ping Yu (HKU) Finite-Sample 1 / 29 Terminology and Assumptions 1 Terminology and Assumptions