Almost Unbiased Estimation in Simultaneous Equations Models with Strong and / or Weak Instruments

Almost Unbiased Estimation in Simultaneous Equations Models with Strong and / or Weak Instruments Emma M. Iglesias and Garry D.A. Phillips Michigan State University Cardiff University This version: March 28 Abstract We propose two simple bias reduction procedures that apply to estimators in a general static simultaneous equation model. Standard jackknife estimators, as applied to 2SLS, may not reduce the bias of the exogenous variable coefficient estimators since the estimator biases are not monotonically non-increasing with sample size (a necessary condition for successful bias reduction) and they have moments only up to the order of overidentification. Our proposed approaches do not have either of these drawbacks. In the first procedure, both endogenous and exogenous variable parameter estimators are unbiased to order T 2 and when implemented for k-class estimators for which k<1, the higher order moments will exist. An alternative second approach is based on taking linear combinations of k-class estimators for k<1. Ingeneral,this yields estimators which are unbiased to order T 1 and which possess higher moments. We also prove theoretically how the combined k-class estimator produces a smaller mean squared error than 2SLS when the degree of overidentification of the system is larger than 7. Moreover, the combined k-class estimators remain unbiased to order T 1 even if there are redundant variables (including weak instruments) in any part of the simultaneous equation system, and we can allow for any number of endogenous variables. The performance of the two procedures is compared with 2SLS in a number of Monte Carlo experiments using a simple two equation model. We are grateful for helpful comments from Bruce Hansen, Tim Vogelsang, Jeffrey Wooldridge and seminar participants at Harvard / MIT, Michigan State University, University of Montreal, Purdue and at SMU (Singapore); and also from participants at the 26 Far Eastern Meeting of the Econometric Society, the 26 European Meeting of the Econometric Society and the 26 Midwest Econometrics Group. Financial support from the MSU Intramural Research Grants Program is gratefully acknowledged. Corresponding author. Department of Economics, Michigan State University, 11 Marshall-Adams Hall, East Lansing, MI 48824-138, USA. e-mail: iglesia5@msu.edu. 1

Keywords: Combined k-class estimators; Bias correction; Weakinstruments; Endogenous and exogenous parameter estimators. JEL classification: C12; C13; C3; C51. 1 Introduction Instrumental variable (IV) estimators (see e.g. Hausman (1983), Staiger and Stock (1997), Phillips (2) and Hahn and Hausman (22a, 22b, 23)) are known to have very poor finite sample properties, both when the size of the sample is very small, and / or when the instruments are weak (see eg. Nelson and Startz (199a, 199b) and Staiger and Stock (1997)). Andrews and Stock (26) recently provided an extensive and updated review of the literature of simultaneous equations systems and weak instruments. If the problem is that we simply have a small sample, then, traditional higher order strong IV asymptotics may provide an appropriate means of analysis; however, this is not enough in the weak instruments context. Therefore, in the weak instruments literature, in order to analyze the performance of instrumental variable estimators, several different types of asymptotics have been proposed: for example, weak IV asymptotics, many IV asymptotics and many weak IV asymptotics, see Andrews and Stock (26). In relation to bias, Richardson and Wu (1971) found the exact bias of the Two Stage Least Squares (2SLS) estimator while Sawa (1972) gave the exact bias for the general k class estimator. More recently, Chao and Swanson (27) have shown the asymptotic bias of the 2SLS estimator using the weak IV asymptotics of Staiger and Stock (1997), allowing for possibly non-normal errors and stochastic instruments. All these results were derived in the context of an equation containing two endogenous variables. The exact density for instrumental variable estimators in the general case is given in Phillips (2). However, the resulting moment expressions are complex and we shall not consider them in this paper. In this paper we are interested in constructing two very simple bias correction procedures that can work in a range of cases, including when we have weak instruments. Recently Davidson and Mackinnon (26), in a comprehensive Monte Carlo study of a two endogenous variable equation, found little evidence of support for the use of the Jackknife Instrumental Variable estimator introduced by Phillips and Hale (1977), and later used by Angrist, Imbens and Krueger (1999) and Blomquist and Dahlberg (1999), especially in the presence of weak instruments. In addition, they concluded that there is no clear advantage in using Limited Information Maximum Likelihood (LIML) over Two Stages Least Squares (2SLS) in practice, particularly in the context of weak instruments, while estimators which possess moments tend to perform better than those which do not. So there re- 2

main questions about the finite sample properties of the standard simultaneous equation estimators especially in the weak instrument context. For our analysis, we deal with a broad setting where we allow for a general number of endogenous and exogenous variables in the system. We first develop a bias reduction approach that will apply to both endogenous and exogenous variable parameter estimators. We use both large-t asymptotic analysis and exact finite sample results, to propose a procedure to reduce the bias of k-class estimators that works particularly well when the system includes weak instruments since, in this case, any efficiency loss is minimized. We examine the bias of 2SLS and also other k-class estimators, and we show that the bias to order T 2 is eliminated when k 21, the number of excluded exogenous from the equation of interest is chosen so that the parameters are notionally overidentified of order one. This is achieved through the introduction of redundant regressors. While the resulting bias corrected estimators are more dispersed, the increased dispersion will be less the weaker the instruments we have available in the system. We show in our simulations for sample sizes T =1, that the bias correction is so impressive, that there are some indications that the bias may be of even smaller order. The above procedure, as applied to 2SLS, has the main disadvantage that whereas the estimator has a finite bias, the variance does not exist. We therefore consider other k-class estimators where we can continue to get a reduction of the bias by using our approach, and where higher order moments will exist. In particular, we consider the estimators where k =1 1/T 3 and k =1 1/T. The former is especially interesting since its behavior is very close to that of 2SLS yet higher moments are defined. Note that LIML is known to be median unbiased up order T 1 (see Rothenberg (1983)), while we can get with our bias correction mechanism estimators that are unbiased up to T 1 and sometimes up to order T 2 even when we have redundant variables in any part of the system. Doran and Schmidt (26) have recently proposed the use of principal components to reduce the number of instrumentsinthesystem. Weproposehereanalternativeprocedurewherewechoosethenumber of instruments such as the system is overidentified of order one. A second procedure that we analyze is based on a linear combination of k-class estimators that is unbiased to order T 1, has higher moments and generally improves on MSE compared to 2SLS when there are a moderate or larger number of (weak) instruments in the system. In particular, we show that, generally, the MSE tends to be smaller for systems that are overidentified of order higher than 7. The bias correction procedures proposed in this paper may have several advantages over other bias reduction techniques. First, the standard delete one jackknife estimator referred to as JN2SLS recently proposed by Hahn, Hausman and Kuersteiner (24) does generally reduce the bias of the endogenous variable coefficient estimator but it does not necessarily reduce the bias of non-redundant 3

exogenous variable coefficient estimators since, as Ip and Phillips (1998) show, the exogenous coefficient bias is not monotonically non-increasing with sample size. Our simple bias correction will work with non-redundant included exogenous variables. Second, our approaches have an advantage over the Chao and Swanson (27) bias correction which does not work well for small k 21 values. In fact, Chao and Swanson (27) show in their paper that their procedure does not work very well for k 21 =6, while in our simulations, we get a very satisfactory removal of the bias even when k 21 =4. Hansen, Hausman and Newey (27) point out, in checking five years of the microeconometric papers published in the Journal of Political Economy, in the Quarterly Journal of Economics and in the American Economic Review, that 5 percent had at least one overidentifying restriction, 25 percent had at least three and 1 per cent had 7 or more. This justifies the need to have procedures available that can work well in practice, at different orders of overidentification and with a range of instruments. When there is only one overidentifying restriction, the frequently used 2SLS does not have a finite variance and this applies in 5% of the papers analyzed above. However, our procedures based on k-class estimators for which k<1 lead to estimators that have higher order moments. Indeed, we show in our simulations that both our procedures work well for the case of one overidentifying restriction and, more generally, for both small and large numbers of instruments. Third, following Hansen, Hausman and Newey (27), we have checked the microeconometric papers published in 24 and 25 in Journal of Political Economy, Quarterly Journal of Economics and American Economic Review, and 4 per cent of them are using more than 1 included endogenous variable in their simultaneous equation models. Moreover, 2SLS is the most commonly chosen estimation procedure. Therefore, this supports the need of developing attractive estimation methods that can deal with more than 1 included endogenous variable in the system. Note that, in the actual econometrics literature, some of the procedures that have been developed only allow for 1 endogenous variable in the system (see e.g. Donald and Newey (21)) and they have used 2SLS as the main estimation procedure. We propose in this paper an alternative estimation procedure that allows for any number of endogenous variables in the system. Fourth, our procedures are very easy to implement and do not use a traditional bias correction approach whereby the estimated bias by substraction from the uncorrected estimate. At the same time, the bias correction works even if we are introducing irrelevant variables in any equation of the system. Finally, we are able to prove theoretically that both of our approaches produce unbiased estimators up to order T 1, and also, that the approach that combines k-class estimators produces a smaller MSE than 2SLS when the system is overidentified of order larger than 7. A final remark, is that there is a growing econometric literature about nearly-weak-identification (see e.g. Caner (27), and Antoine and Renault (27)), where standard strong asymptotics is valid. Our bias correction mechanisms will work for this case also. 4

The structure of the paper is as follows. Section 2 presents the model and a summary of large-t approximations which provide a theoretical underpinning for the bias correction. We also discuss some of the exact finite sample theory results for 2SLS which provide further theoretical support, and we examine the weak instrument model of Staiger and Stock (1997) and Chao and Swanson (27). In the setting of Staiger and Stock (1997) only weak instruments are available, while in the strong asymptotic setting, we have a mixture of weak and strong instruments. In Section 3 we discuss in detail one possible procedure for bias correction based on introducing redundant exogenous variables and we consider how the asymptotic covariance matrix changes with the choice made. This provides a criteria for selecting such variables optimally. In Section 4 we present the main theoretical results. We develop a new estimator based a linear combination of k-class estimators with k<1 which has higher moments and which generally has a smaller mean squared error than 2SLS when L, the order of over identification, exceeds 7. We also explore the approximate variance of the new combined estimator and show that the Nagar (1959) bias approximation will hold for k-class estimators under conditional heteroskedasticity thus enabling our bias reduction to carry over to this case too. In Section 5 we present the results of simulation experiments which indicate the usefulness of our proposed procedures. Finally, Section 6 concludes. The proofs are collected in Appendices 1 and 2. 2 The Simultaneous Equation Model We consider a simultaneous equation model containing G equations given by (1) By t + Γx t = u t, t =1, 2,..., T in which y t is a G 1 vector of endogenous variables, x t is a K 1 vector of strongly exogenous variables and u t is a G 1 vector of structural disturbances with G G positive definite covariance matrix Σ. The matrices of structural disturbances, B and Γ are, respectively, G G and G K. It is assumed that B is non-singular so that the reduced form equations corresponding to (1) are y t = B 1 Γx t + B 1 u t = Πx t + v t, where Π is a G K matrix of reduced form coefficients and v t is a G 1 vector of reduced form disturbances with a G G positive definite covariance matrix Ω. With T observations we may write the system as (2) YB + XΓ = U. 5

Here, Y is a T G matrix of observations on endogenous variables, X is a T K matrix of observations on the strongly exogenous variables and U is a T G matrix of structural disturbances. The first equation of the system will be written as (3) y 1 = Y 2 β + X 1 γ + u 1, where y 1 and Y 2 are, respectively, a T 1 vector and a T g 1 matrix of observations on g 1 +1 endogenous variables, X 1 is a T k 1 matrix of observations on k 1 exogenous variables, β and γ are, respectively, g 1 1 and k 1 1 vectors of unknown parameters and u 1 is a T 1 vector of disturbances with covariance matrix E(u 1 u 1)=σ 11 I T. I T denotes the identity matrix of dimension T. We shall find it convenient to rewrite (3) as (4) y 1 = Y 2 β + X 1 γ + u 1 = Z 1 α + u 1, where α =(β,γ) and Z 1 =(Y 2 : X 1 ). The reduced form of the system includes Y 1 = XΠ 1 + V 1, in which Y 1 =(y 1 : Y 2 ), X =(X 1 : X 2 ) is a T K matrix of observations on K exogenous variables with an associated K (g 1 +1)matrix of reduced form parameters given by Π 1 =(π 1 : Π 2 ),while V 1 =(v 1 : V 2 ) is a T (g 1 +1)matrix of reduced form disturbances. The transpose of each row of V 1 has zero mean vector and (g 1 +1) (g 1 +1) positive definite matrix Ω 1 =(ω ij ). It is further assumed that Assumption (i): TheT K matrix X is strongly exogenous and of rank K with limit matrix lim T T 1 X X = Σ XX, which is K K positive definite, and that: Assumption (ii): Equation (3) is over-identified so that K>g 1 +k 1, i.e. the number of excluded variables exceeds the number required fortheequationtobejustidentified. In cases where second moments are analyzed we shall assume that K exceeds g 1 +k 1 by at least two. These over-identifying restrictions are sufficient to ensure that the Nagar expansion is valid in the case considered by Nagar and that the estimator moments exist: see Sargan (1974). Nagar, in deriving the moment approximations to be discussed in the next section, assumed that the structural disturbances were normally, independently and identically distributed. Subsequently it has proved possible to obtain his bias approximation with much weaker conditions. In fact, Phillips (2, 27) showed that normality is not required and that a sufficient condition for the Nagar bias approximation to hold is that the disturbances in the system be Gauss Markov. However, to derive the higher order bias and second moment approximations, normality is required. 6

2.1 Large T-approximations in the simultaneous equation model In his seminal paper, Nagar (1959) presented approximations for the first and second moments of the k-class of estimators where k =1+θ/T, θ is non-stochastic and may be any real number. Notice that (1 k) is of order T 1. Themainresultsaregivenbythefollowing 1. The bias of the k class estimator for α in (4) is given by (5) E(ˆα k α) =[L θ 1]Qq + o(t 1 ). 2. The second moment matrix of the k class estimator for α in (4) is given by (E(ˆα k α)(ˆα k α)) = σ 2 Q[I + A ]+o(t 2 ), where A =[ (2L 3)tr(C 1 Q)+tr(C 2 Q)]I + {(θ L +2) 2 +2(θ +1)}C 1 Q +(θ L +2)C 2 Q. To interpret the above approximations we define L = k 21 g 1,Y 2 = Ȳ 2 + V 2 " # 1 Ȳ 2Ȳ2 Ȳ with Ȳ 2 = XΠ 2, Q = 2X 1 and X1Ȳ2 X1X 1 V 2 = u 1 π + W " # " # E(V where u 1 and W are independent and 1 2u 1 ) π = σ 2 = q, T " # " # " (1/T )E(V C = 2V 2 ) σ 2 ππ 1/T E(W W ) = C 1 +C 2, where C 1 = and C 2 = #. In Appendix 2, we shall also h introduce i the T T idempotent matrix M = I X(X X) 1 X and the T (g 1 + k 1 ) matrix V z = V 2. The approximations for the 2SLS estimator are found by setting θ =in the first expression above so that, for example, the 2SLS bias approximation is given by E(ˆα α) =(L 1)Qq + o(1/t ). The 2SLS bias approximation was extended by Mikhail (1972) to E(ˆα α) =(L 1)[I + tr(qc)i (L 2)QC]Qq + o(1/t 2 ). 7

Notice that this bias approximation contains the term (L 1)Qq which, as we have seen, is the approximation of order 1/T whereas the remaining term, (L 1)[tr(QC)I (L 2)QC]Qq, is of order 1/T 2. This higher order approximation is of importance for this paper. Note that the 1/T 2 term includes a component (L 1)(L 2)QCQq which may be relatively large when L is large, a fact that will be commented on again later. It is also of particular interest that the approximate bias is zero to order 1/T 2 when L 1=,i.e,when K (g 1 + k 1 )=1. We also denote k 21 = K k 1. It is well known that in simultaneous equation estimation there is an existence of moments problem; the LIML estimator has no moments of any order in the classical framework and neither does any k class estimator when k > 1. On the other hand, the 2SLS estimator (k =1)has moments up to the order of overidentification (L) while for k<1, moments may be shown to exist up to order (T g 1 k 1 +2), see, in particular, Kinal (198). Little use has been made of k class estimators for which <k<1 despite the fact that there is no existence of moments problem. However, we shall see in the next section that there are situations, in particular when bias correction is required, where such estimators can be most helpful. Before proceeding further, we shall examine a proposal by Donald and Newey (21), where they present, for the case of 2 included endogenous variables, an estimator where the objective is to minimize the mean square error. To set the Donald and Newey (21) estimator within the context of the k-class of estimators, we note that the estimators have traditionally been presented the form β γ = = = Y 2Y 2 k ˆV 2 2 Y2X 1 X1Y 2 X1X 1 1 Y 2 k ˆV 2 X 1 Y 2Y 2 ˆV 2 2 +(1 k) ˆV 2 2 Y2X 1 X1Y 2 X1X 1 " Y 2Y 2 ˆV 2 2 Y2X 1 X1Y 2 X1X 1 + y 1 1 Y 2 k ˆV 2 (1 k) ˆV ˆV 2 2 X 1 y 1 # 1 Y 2 k ˆV 2 X1 = ((Z P X Z (1 k)z (I P X ) Z)) 1 (Z y 1 kz (I P X )y 1 ) = (kz P X Z +(1 k) Z Z) 1 (kz P X y 1 +(1 k)z y 1 ), y 1 on putting Z =(Y 2 : X 1 ). Then we can divide inside the left hand side by k and outside the right hand side by k so that the estimator becomes β µ 1 = Z (1 k) P X Z + Z Z) (Z (1 k) P X y 1 + Z y 1 ). k k γ The Donald and Newey (21, page 1164) B2SLS estimator is given as ˆδ = Z P X Z Λ Z Z) 1 (Z P X y 1 Λ Z y 1 ) 8

which is clearly a k-class estimator for which (1 k) = Λ. Solving for k yields (1 k)+k Λ =or k 1=k(1 Λ) which gives k = 1. Now Donald and Newey (21) state that Λ =(K d (1 Λ) 1 2)/T 1 (where d 1 = k 1 ) so that k = =1+ (K d 1 2)/T > 1. 1 (K d 1 2)/T 1 (K d 1 2)/T Nagar (1959) proposed an unbiased estimator for which k =1+(L 1)/T which is not the same as the Donald-Newey estimator. To order T 1 the Donald and Newey (21) estimator has k =1+(K d 1 2)/T, which coincides with Nagar (1959) when there is only one endogenous regressor but not otherwise. Therefore, the unbiasedness of the Donald and Newey (21) estimator in the context of more than 1 endogenous variables in the system is not a trivial extension. Moreover, the Donald and Newey (21) estimator is momentless for the case of one endogenous regressor (since it coincides with Nagar (1959)). One of the main novelties of our paper, is that we present an unbiased estimator to order T 1 in the setting of any number of endogenous variables and that allows for the existence of any number of irrelevant variables in the system. Moreover, our estimator has moments when dealing with the k-class where k<1. 2.2 Exact Finite Sample theory Exact finite sample theory in the classical simultaneous equation model has a long history. Many of the developments are reviewed in the recent book by Ullah (24). Much of this work was focused on a structural equation in which there were just two endogenous variables. Consider the equation y 1 = βy 2 + X 1 γ + u 1, which includes as regressors, one endogenous variable and a set of exogenous variables and which is a special case of (3). Assuming that the equation parameters are overidentified at least of order two, Richardson and Wu (1971) provided expressions for the exact bias and mean squared error of both the 2SLS and OLS estimators. In particular, the bias of the 2SLS estimator of β was shown to be (6) E(ˆβ β) = σ µ 22β σ 12 e μ μ k21 σ 2 1F 1 22 2 1; k 21 2 ; μ μ, 2 where 1 F 1 is the standard confluent hypergeometric function, see Slater (196), σ 22 is the variance of the disturbance in the second reduced form equation while σ 12 is the covariance between the two reduced form disturbances. Also k 21 is the number of exogenous variables excluded from the equation which appear elsewhere in the system and μ μ is the concentration parameter which is a monotonically non decreasing function of the sample size. In this framework Owen (1976) noted that the bias and mean squared error of 2SLS are monotonically non-increasing functions of the sample size, a result that has implications for successful jackknife estimation. Later Ip and Phillips (1998) extended Owen s analysis to show that the result does not carry over to the 2SLS estimator of the exogenous coefficients. 9

Hale, Mariano and Ramage (198) considered different types of misspecification in the context of(3)wheretheequationispartofageneralsystem. Specifically they were concerned with the misspecification that arises when exogenous variables are incorrectly included/omitted in parts of the system. Suppose that the misspecification consists of including redundant variables in the first equation where we can also allow for other equations of the system to be misspecified by adding redundant variables as well.then under this type of misspecification, the exact bias of the general k-class estimator β b k is given in Hale, Mariano and Ramage (198) by ³ E bβk β = ρσ [1 δ 1 G (k, δ 1, ; m +1,n)], σ 2 where, in their notation, δ 1 is the concentration parameter, σ 2 is the variance of the true u 1,σ 2 2 = var (y 2 ),ρis the correlation coefficient between the true u 1 and y 2,m=(T k 1 ) /2,n=(T K) /2, m n =(K k 1 )/2 =k 21 /2, and G (k, δ 1, ; m +1,n) is equal to e δ 1 X k s (n) s / (m) s+11 F 1 (m, m +1+s; δ 1 ). s= For k =1, the 2SLS case, the bias equals e δ 1 / (m n) 1 F 1 (m n, m +1 n; δ 1 ) with 1 F 1 being the standard hypergeometric function and so the bias has the same structure as when there is no misspecification. Hale, Mariano and Ramage (198) have two main conclusions 1. The fact of adding redundant variables in the first equation will decrease δ 1, the concentration parameter, and m =(T k 1 ) /2, in relation to the correctly specified case. The direction of this effectonbiasinvolves atrade-off: the bias is decreased because m decreases and is increased by decreasing δ 1. 2. The mean squared error seems to increase with this type of misspecification mainly because of the decrease of δ 1. In relation to the above we should note that when the redundant variable is a weak instrument, the effect on the concentration parameter is likely to be particularly small. As a result the mean squared error may not increase and could well decrease. As we shall see below, this observation has important implications for the development of bias corrected estimators. Finally we note from the exact bias expression that, conditional on the concentration parameter, the bias is minimized for k 21 =2, which is the case where the order of overidentification is equal to one for the first equation of a two equation model. Hansen, Hausman and Newey (27) recently propose the use of the Fuller (1977) estimator with Bekker (1994) standard errors in order to improve the estimation and testing results in the case of 1

many instruments. However, their procedure can still produce very large biases, and our procedure can complement theirs, since in our case, we can get a nearly unbiased estimator. Moreover, since we deal with k-class estimators where k<1, our estimator has higher moments. 2.3 The weak instrument case In the previous sections, we were dealing with a system where enough strong instruments were available to identify the structural parameters, together with some or many redundant variables that may exist in the structural equation of interest (weak instruments). So, in the reduced form, we have a mixture of weak and strong instruments. In this case, standard Nagar expansions are valid, and in this paper we find a way to reduce the bias, even if we have many redundant variables in the system. Another different setting is the one proposed in Staiger and Stock (1997), where only weak instruments are available (locally unidentified case) in the reduced form equation to be used as instruments (excluding the exogenous variables which appear in the first equation). No strong instrument is available. Chao and Swanson (27) provide the asymptotic bias and MSE results in the same framework of weak asymptotics. Suppose a system of the type y 1 = βy 2 + Xγ + u y 2 = ZΠ + XΦ + v, where y 1 and y 2 are T 1 vectors of observations on the two endogenous variables, X is an T k 1 matrix of observations on k 1 exogenous variables included in the structural equation and Z is a T k 21 matrix of instruments which contains observations on k 21 exogenous variables. The asymptotic bias of the IV estimator is given by µ b βiv (μ μ, k 21 )=σ 1/2 uu σ 1/2 vv ρe μ μ2 k21 1F 1 2 1; k 21 2 ; μ μ, 2 ³ where b βiv (μ μ, k 21 )=lim T E bβiv,t β is the asymptotic bias function of the IV estimator, Π = Π T = C/ T where C is a fixed k 2 1 vector, (uú/t, u v/t, v v/t) p (σ uu,σ uv,σ vv ), Z t = (X t,z t ),Q= E Z t, Z t,ρ= σ uv σ 1/2 uu σ 1/2 vv and k 21 is the number of excluded variables from the first equation. Since an IV estimator of β may be chosen, where the IV estimator may not make use of all available instruments, define P X = X (X X) 1 X, β b IV =(y 2 (P H P X ) y 2 ) 1 (y 2 (P H P X ) y 1 ), where H =(Z 1,X) is an T (k 21 + k 1 ) matrix of instruments, and Z 1 is an T k 21 submatrix of Z formed by column selection. Z =(Z 1,Z 2 ) and μ μ = σ 1 vv C Ω 1 Ω 1 11 Ω 1 C. Comparing this result with (6), we find that the asymptotic bias in this weak instrument model, takes the same form as the exact bias in the strong instrument case. Furthermore the above result 11

goes through without an assumption of normality and in the presence of stochastic instruments. Note also that the framework of Staiger and Stock (1997) and Chao and Swanson (27) allow for weak instruments to be in equation 2 but not in equation 1. Chao and Swanson (27) assume that k 21 4 to ensure that the result encompasses the Gaussian case. However, following Ullah (24, page 196), we know that µ k21 1F 1 2 1; k 21 2 ; μ μ =1+ ( k 21 1) 2 μ μ k21 2 k 21 2 + 1 k 21 µ 2 2 2 μ μ 2 k 21 k21 +1 +..., 2 2 2 2 which leads to the asymptotic bias being minimized for a given value of μ μ when k 21 =1. 2 We can also provide an expansion up to O (T 2 ) with the approach of the weak IV asymptotics by using the bias expressions of Chao and Swanson (27). The results of the previous theorems stay thesameandalsothebiasisminimizedfork 21 =2, i.e, the equation is overidentified of order one. 3 Bias Correction When the matrix X 1 in (4) is augmented by a set of k1 redundant exogenous variables X, which appear elsewhere in the system, so that X1 =(X 1 : X ), the equation to be estimated by 2SLS is now (7) y 1 = Y 2 β + X 1 γ + X δ + u 1, where δ is zero but is estimated as part of the coefficient vector α =(β,γ,δ ). The bias in estimating α takes the same form as (5) except that in the definition of Q, X 1 is replaced by X1. If k1 is chosen so that K (g 1 + k 1 + k1)=1, then the bias of the resulting 2SLS estimator ˆα, is zero to order 1/T 2. It follows that ˆα is unbiased to order 1/T 2 and the result holds whether or not there are weak instruments in the system as will be discussed below. Of course, the introduction of the redundant variables changes the variance too. In fact the asymptotic covariance matrix of ˆα is given by lim T σ 2 TQ,whereastheasymptoticcovariancematrixforˆα is given by lim T σ 2 TQ ;whereq is obtained by replacing X 1 with X1 in Q. Hence, the estimator ˆα will not be asymptotically efficient. However if the redundant variables areweakinstruments theincreasein the small sample variance will be small and the MSE may be reduced. We can find a second order approximation to the variance of ˆα from a simple extension of the earlier results of Nagar (1959) given above. The new estimator that we get for the β parameter vector is denoted β b, and we name it redundant-variable-2sls estimator when it is used in the context of 2SLS. Althoughtheproposedprocedureisbasedonlargesampleasymptoticanalysis,thecaseis strengthened by the exact finite sample results in Section 4 since we have observed there that setting 12

k 21 =2minimizes the exact bias for a given value of the concentration parameter although it does not completely eliminate it. Of course, as we have noted, the introduction of the redundant exogenous variables alters the value of the concentration parameter so we cannot be certain that the bias is reduced. However, all our results indicate that bias reduction will be successful. Effectively, we are using in our analysis an optimal number of redundant variables for inclusion in the first equation to improve on the finite sample properties of the 2SLS/IV instrumental variable estimator of β. By doing so, we know that this type of misspecification has the drawback that it decreases the concentration parameter δ 1, so we want to choose the redundant variables from the weaker instruments. In the definition of weak instruments, it is usual to refer to instruments that make only a small contribution to the concentration parameter (see eg. Stock, Wright and Yogo (22) and Davidson and Mackkinon (26)). So this adds support for what we do and guides our selection of variables in choosing the redundant set. We have seen in Section 2 that the limiting bias in the weak instrument model of Staiger and Stock (1997) and of Chao and Swanson (27), takes the same form as the exact bias in finite samples. Insofar as we can minimize the exact bias by setting k 21 =2, it follows that we also minimize the limiting bias in the weak instrument case. As a result we can reasonably expect our bias reduction procedure will work in models even where the weak instrument problem is severe. We have earlier noted that the existence of moments depends on the "notional" order of overidentification. To see why this is so we consider a simple system of 2 endogenous variables as given in (7), and a reduced form equation (8) y 2 = Zπ 2 + v 2, then the 2SLS estimator of β has the form (y 2 Ṕy 2 ) 1 (y 2 Ṕy 1 ), where P = Z (ZŹ) 1 Z Z 1 (Z 1 Ź 1 ) 1 Z 1 and rank(p )=p =rank(z) rank(z 1 ). To examine the existence of moments, we can take a look at the canonical case where the covariance matrix of (u 1,v 2 ) is just the identity, and β ==γ = δ = π 2. Then, conditional on y 2, the distribution of the 2SLS estimator is N, (y 2 Ṕy 2 ) 1, so the conditional moments all exist, the odd vanish, and the unconditional even moment of order 2r exists only if E (y 2 Ṕy 2 ) r exists. But y 2 Ṕy 2 χ 2 p, so this is only for p<2r, or 2r p 1. In the non-canonical case the odd-order moments do not vanish, but the argument is the same. This provides the intuition of why, by adding redundant exogenous in the system, in this case by altering p, we alter the notional order of identification and it is this which determines the existence of moments 1. Note that in our bias results, α corresponds to the full vector of parameters so that the bias is zero to order 1/T 2 for the exogenous coefficient estimators too. This is important because not all 1 We thank Grant Hillier for providing this simple proof. 13

bias corrected estimators proposed in the literature have this quality. For example, the standard delete-one jackknife in Hahn, Hausman and Kuersteiner (24) is shown to perform quite well and it has moments whenever 2SLS has moments. However, the case analyzed does not include any exogenous variables in the equation that is estimated so no results are given for the jackknife applied to exogenous coefficient estimators. In fact, a necessary condition for the jackknife to reduce bias is that the estimator has a bias that is monotonically non-increasing with the sample size. As noted previously, the bias of the 2SLS estimator of the endogenous variable coefficient is monotonically non-increasing and so in this case the jackknife satisfies the necessary condition for bias reduction to be successful. However, Ip and Phillips (1998) show that the same does not apply to the exogenous coefficient bias. In fact, even the mean squared error of the 2SLS estimator in this case is not monotonically non-increasing in the sample size which is a disquieting limitation of the 2SLS method. As noted in Section 2, the Nagar bias approximation is valid under conditional heteroskedasticity so that we can prove that the bias disappears up to O (T 1 ) even when conditional heteroskedasticity is present. The essential requirement is that the disturbances be Gauss Markov which covers a wide range of generalized-autoregressive conditional heteroskedastic (ARCH/GARCH) processes (see, e.g. Bollerslev (1986)). The bias corrected estimator that is proposed here is essentially a 2SLS estimator of an overspecified equation. It will have moments up to the notional order of overidentification as can be seen from Mariano (1972) so that whereas the first moment exists, the second and all other higher moments do not. If it is preferred to use a bias corrected estimator which has higher moments however, one possibility is to use an alternative member of the k class. Consider the k class estimator for which θ = 1, i. e., k =1 1. From Nagar s bias approximation given in (5) we can see that the bias to order 1/T is (L θ 1)Qq = LQq, which T will disappear for L =and this can be achieved through the introduction of redundant exogenous variables. Consequently, a bias corrected estimator which is unbiased to order 1/T and which has higher order moments, is readily available. It is apparent that the estimator for for which k =1 1/T is asymptotically efficient and will have properties close to those of 2SLS. However, this estimator is not unbiased to order 1/T 2. We provide in the next section simulation results that show that this alternative k-class estimator may have a smaller mean squared error than 2SLS. We show the performance of our bias corrected mechanism both when it is applied to 2SLS and the alternative k-class estimator. Careful examination of the analysis employed by Nagar and Mikhail reveals that if k is chosen to be 1 1, then the bias approximations to order T 1 and 1 1 are the same as those for 2SLS, T 3 T 2 i.e. when k =1. Consequently, when redundant variables are added to reduce the notional order of 14

overidentification to (L 1), the estimator with k =1 1 will be unbiased to order T 2 and its T 3 moments will exist. The estimator is, essentially, 2SLS with moments and is therefore to be preferred to 2SLS. One of the key issues about bias correcting with the introduction of redundant variables concerns the choice of the redundant variable set. We now consider this. 3.1 The Redundant Variable Estimator Covariance Matrix Asymptotically efficient k class estimators have an asymptotic covariance matrix given by σ 2 plimt " Π 2 X XΠ 2 Π 2X X 1 X 1XΠ 2 X 1X 1 # 1, and the 2SLS small sample variances are typically estimated from the matrix ˆσ 2 " ˆΠ 2X X ˆΠ 2 ˆΠ 2X X 1 X 1X ˆΠ 2 X 1X 1 # 1. Consider the asymptotic covariance matrix (ignoring the constant) given by V 1 = " Π 2 X XΠ 2 Π 2X X 1 X 1XΠ 2 X 1X 1 # 1. Suppose that the relevant equation has a redundant variable set X, contained in the matrix X, where X1 =(X 1 : X ); then the asymptotic covariance matrix is proportional to V 2 = " Π 2 X XΠ 2 Π 2X X 1 (X 1) XΠ 2 (X 1) X 1 # 1 = Π 2X XΠ 2 Π 2X X 1 Π 2X X X 1XΠ 2 X 1X 1 X 1X (X ) XΠ 2 (X ) X 1 (X ) X 1. Using the partitioned inverse theorem, we may write that part of the inverse of V 2 that refers to the original coefficients to be the inverse of " # "" # Π 2 X XΠ 1 Π 2X X 1 Π 2 X X h i # [(X ) X ] 1 (X ) XΠ X1XΠ 2 X1X 1 X1X 2 (X )X 1, which reduces to " Π 2 X M X XΠ 2 Π 2X M X X 1 V 3 = X 1M X XΠ 2 X 1M X X 1 # 1, 15

where M X = I X [(X ) X ] 1 (X ). Itfollowsthatwemayusethisresultindevelopingcriteriaforchoosingtheredundantvariable set. We should choose X to minimize the difference, in some sense, between the two matrices. One approach is to minimize the difference between the generalized variances. In effect this means choosing X to minimize the determinant of V 3. A simpler approach is to minimize the difference between the traces of the two matrices. This amounts to choosing X so that the sum of the variances increases the least. This is clearly simple to implement since it implies that we merely compare the main diagonal terms of the estimated covariance matrices. In the simple model given by (19) in which there are no exogenous variables in the equation of interest we have V 3 =(π 2X [I M X ] Xπ 2 ) 1. h i If we partition the X matrix as X = X and the π 2 vector as π 2 =(π 21:π 22) where X Xπ 2 = X π 21 + X π 22, then we may write V 3 =(π 21(X ) [I M X ] X π 21 ) 1. We recognize this as being proportional to the inverse of the concentration parameter. Hence, we see that in the simple model our criterion involves minimizing the reduction in the concentration parameter. Again, the practical implementation requires that the unknown reduced form parameter vector be replaced with estimates. Hence, X is chosen to minimize the asymptotic variance of the endogenous coefficient estimator. We consider in what follows an alternative approach to bias correction by combining k-class estimators. 4 A new estimator as a linear combination of k-class estimators with k<1 In what follow we develop a new estimator as a linear combination of k-class estimators with k<1. We shall denote the estimator for k =1 1 as β b T 3 k1 and the estimator for which k =1 1 as β b T k2. The 2SLS estimator is β b 1. Another alternative that we propose is the following. According to (5), thebiasof2slsuptot 1 (or that of β b k1 ) takes the form ³ Bias bβ1 =(L 1) Qq, and the bias for b β k2 is given by ³ Bias bβk2 = LQq. 16

So, we can write the following Bias ³2 β b 1 β k2 b =2(L 1) Qq LQq =(L 2) Qq Bias ³3 β b 1 2β k2 b =3(L 1) Qq 2LQq =(L 3) Qq Bias ³4 β b 1 3β k2 b =4(L 1) Qq 3LQq =(L 4) Qq, and more generally Bias ³p β b 1 (p 1) β k2 b =(L p) Qq, for any p where all biases are approximations to order T 1. Notice that in the general case, the estimator is β b ³ 1 +(p 1) bβ1 β b k2, where the second term acts as a bias correction. However, the estimator has all necessary moments if 2SLS is replaced with β b k1. We have noted that 2SLS and β b k1 have the same second order bias but this is not true of the estimator β b k2. While the linear combination removes the first order bias some second order bias remains; hence the combined estimator is not unbiased to order T 2. The next Theorem shows how the combined estimator may have smaller mean squared error (MSE) for systems that have a degree of overidentification larger than 7. That means that when we have a reasonable number of instruments (upwards of 7), our combined estimator will not only be unbiased to order 1/T, but also it will improve on 2SLS in terms of MSE. Moreover, note that the combined estimator is always unbiased to order 1/T even if we have added irrelevant variables in any part of the system (as the result of Hale, Mariano and Ramage (198) points out). This also includes the case where we have many weak instruments in the system. Replacing the 2SLS estimator with β b k1,the combined estimator, which is unbiased to order 1/T, is given by β b k3 where bβ k3 = β b k1 +(p 1)( β b k1 β b k2 ). We now state the following theorem Theorem 1 In the model of (1) and under the assumptions in Section 2, the difference to order T 2 between the mean squared error of the combined estimator β b k3 and that of the k class estimator for which k =1 1/T 3, β b k1, is given by E( β b k3 β)( β b k3 β) E( β b k1 β)( β b k1 β) = (L 1)[(L 7)σ 2 QC 1 Q 2(σ 2 trqc 1 Q σ 2 QC 1 Q) 2σ 2 QC 2 Q]+o(T 2 ). 17

Proof. GiveninAppendix1. While the precise value for L at which the MSE of the combined estimator is less than that of the k class estimator, β b k1, and, hence, that of 2SLS, depends upon the relationship between QC 1 Q and QC 2 Q, avalueforl be greater than 7 will often be sufficient and we propose L>7 as the rule determining when to use the combined estimator. Of course, when L =1the mean squared error of 2SLS does not exist and the combined estimator is preferred. In this case β b k3 = β b k1 ; thus the combined estimator reduces to the k class estimator for which k =1 1/T 3. Note that the Chao and Swanson (27) bias correction procedure does not work very well when the number of excluded regressors is less than 11 (see Chao and Swanson (27), Table 2) and in this case, we can have a very simple procedure that is unbiased up to 1/T anditimproveson2slsin terms of MSE as L increases above 7. This means that our combined estimator will work particularly well in the case of many (weak) instruments and moderate amount of (weak) instruments. The result is the same regardless of whether our instruments are weak or strong and the unbiasedness result holds under possible misspecification in terms of including irrelevant variables in another equations. Even when L 6 7, the combined estimator is still unbiased. Note also that Chao and Swanson (27) need a correctly specified system, while our combined estimator will be unbiased up to 1/T even if we have irrelevant variables in any part of the system. As Hansen, Hausman and Newey (27) point out, there are many papers that need in practice to use a moderate/large number of instruments, and Angrist and Krueger (1991) is an example where a large number instruments is used. As a final remark, note that in the case L =, the combined estimator reduces to the k-class estimator with k =1 1/T. This has moments and clearly dominates 2SLS or k =1 1/T 3 estimator on a MSE criterion. Actually this is quite obvious since the variance of the k-class will decline as k is reduced and so the estimator with k =1 1/T will have a smaller variance while at the same time it has a zero bias to order 1/T. 4.1 Asymptotic variance In theorem 1 we have shown that, if we denote E( β b k1 β)( β b k1 β) = MSE³ bβk1 β)( β b k3 β) = MSE³ bβk3, then, if we define as in (25a) (9) MSE³ bβk1 + = MSE³ bβk3. and E( b β k3 Also, β b k3 is unbiased to order T 1 so that ³ (1) MSE³ bβk3 = var bβk3 + o T 2. 18

Now, ³ ³ ³ ³ MSE³ bβk1 = var bβk1 + bias bβk1 2 = var bβk1 +(L 1) 2 Qqq Q and ³ (11) MSE³ bβk1 = var bβk1 +(L 1) 2 σ 2 QC 1 Q. From (9), (1) and (11) ³ ³ var bβk3 = var bβk1 +(L 1) 2 σ 2 QC 1 Q +. It follows that ³ var bβk3 ³ = var bβk1 (L 1) ((L 7) (L 1)) σ 2 QC 1 Q 2 σ 2 trqc 1 Q σ 2 QC 1 Q 2σ 2 QC 2 Q +(L 1) 6σ 2 QC 1 Q +2σ 2 (trqc 1 Q QC 1 Q)+2σ 2 QC 2 Q + o T 2. +o T 2 = var ³ bβk1 It is clear that for large L there will be a large difference in the variances and we should sensibly adjust for this in any inference situation. Notice that the difference in the variances takes a rather simple form. Also, it depends directly on (L 1) and not (L 1) 2. There is no theoretical justification available in the literature for using the general k-class estimators for bias reduction in the presence of ARCH/GARCH disturbances. Phillips (2) covers the 2SLS case but not the k-class. We now show in the next section that the result holds for k-class too. 4.2 Bias Approximation and k-class Estimation We consider the estimation of the equation given in (1) by the method of 2SLS. It is well known that the 2SLS estimator can be written in the form bβ ˆΠ = 2X X ˆΠ 2 ˆΠ 2X X 1 X1X ˆΠ 2 X1X 1 1 ˆΠ 2X Xˆπ 1 (12) ˆα = ˆγ X 1Xˆπ 1 where ˆΠ 2 =(X X) 1 X Y 2 and ˆπ 1 =(X X) 1 X y 1. This representation of 2SLS was considered, for example, in Harvey and Phillips (198). It is apparent that, conditional on the exogenous variables, the 2SLS estimators are functions of the matrix ˆΠ 1 =(ˆπ 1 : ˆΠ 2 ); hence we may write ˆα = f(vecˆπ 1 ). As shown in Phillips (2), the unknown parameter vector can be written as α = f(vecπ 1 ), so that the estimation error is f(vecˆπ 1 ) f(vecπ 1 ). A Taylor expansion about the point vecπ 1 may then be employed directly to find a counterpart of the Nagar expansion. In fact, Phillips (2) considered the general element of the estimation error ˆα i α i = e i(ˆα α) =f i (vecˆπ 1 ) f i (vecπ 1 ), 19

i =1, 2,..., g + k where e i is a 1 (g + k) unit vector. Then the bias approximation for the general case was found using the expansion (13) f i (vecˆπ 1 ) = f i (vecπ 1 )+(vec(ˆπ 1 Π 1 ) f (1) i + 1 2 (vec(ˆπ 1 Π 1 )) f (2) i (vec(ˆπ 1 Π 1 )) where f (1) i matrix of second-order partial derivatives, + 1 3 ΣK r=1σ g+1 s=1(ˆπ rs π rs )(vec(ˆπ 1 Π 1 )) f (3) i,rs (vec(ˆπ 1 Π 1 )) + o p (T 3 2 ), is a K(g+1) vector of first-order partial derivatives, 2 f i = f (3) vecˆπ 1 ( vecˆπ 1 ) i,rs f i = f (2) vecˆπ 1 i is a (K(g+1)) (K(g+1)) is a (K(g +1)) (K(g +1))matrix of third-order partial derivatives defined as f (3) i,rs = f(2) i ˆπ rs, r =1,..., K, s =1,..., g +1. All derivatives are evaluated at vecπ 1. ThebiasapproximationtoorderT 1 is then obtained by taking expectations of the first two terms of the stochastic expansion to yield E(ˆα i α i )= 1 2 tr h (f (2) i i (I (X X) 1 X )Ω vec 1 (I X(X X) 1 ) + o(t 1 ), where Ω vec 1 = E(vecV 1 (vecv 1 ) ) = Ω 1 I T reflects the fact that the reduced form disturbances satisfy the Gauss Markov conditions. When the partial derivatives f (2) i are introduced the bias approximation is given in the following theorem Theorem 2 Let ˆα i be the i th componentoftheg + k component 2SLS component estimator ˆα given in (4), then under the assumptions in Section 2, the bias to order T 1 is given by (14) E(ˆα i α i ) = tr [(HQe i β (P X P Z))Ω vec 1 ] tr[(i ( ZQe i β H Q Z ))Ω vec 1 ]+o(t 1 ), where I is a T (g+1) T(g+1) commutation matrix which is partitioned into T (g+1) submatrices of order (T,g+1) such that the p, q th submatrix has unity in its q, p th position " and # zeroes elsewhere, p = Ig 1, 2,...g +1,q =1, 2,..., T (see Magnus and Neudecker (1979)), H = is a (g +k) (g +1) selection matrix, Z = (XΠ 2 : X 1 ),Q= ( Z Z) 1,P X = X(X X) 1 X,P Z = Z( Z Z) 1 Z and β =( 1,β ). Proof. See Phillips (2, page 352). This Theorem can be used to find the approximate biases for each component of the 2SLS vector ˆα under different assumptions concerning the generation of disturbances. From an examination of the bias approximation it is apparent that as the assumptions change, any change in the bias is brought about by changes in Ω vec 1. The Nagar approximation is found when Ω vec 1 = Ω 1 I T. However, the approximation is unchanged under changes in the disturbances which do not result in a change 2

in Ω vec 1.From this we can deduce that the Nagar approximation remains valid as long as the reduced form disturbances and, hence, the structural disturbances, obey the Gauss Markov assumptions. Thus neither normality nor independence is required. To show that this result holds as well for the general consistent k class of estimators, with k non-stochastic, we shall need to modify the above approach. Consider the k class estimator given by (15) ˆα k = = bβk = Y 2Y 2 k ˆV 2 2 Y2X 1 X1Y 2 X1X 1 ˆγ k ˆΠ 2 X X ˆΠ 2 +(1 k) ˆV 2 2 ˆΠ 2 X X 1 X1X ˆΠ 2 X1X 1 1 Y 2 k ˆV 2 X 1 y 1 1 ˆΠ 2X Xˆπ 1 +(1 k) X1Xˆπ 1 h ˆV 2 ˆv 1 i. Here it is clear that, conditional on the exogenous variables, the k class estimators are functions not only of ˆΠ 1 =(ˆπ 1 : ˆΠ 2 ) but also of ˆV 2 ˆV 2 and ˆV 2 ˆv 1. However by suitably manipulating the estimator itispossibletoexpressitinthesameformas(14)sothatnonewanalysis,beyondthatsetoutin Phillips (2), is required to find the bias approximation. To see this let W =(X : X) be a T T matrix of rank T obtained by augmenting the X matrix and adding T K linearly independent columns X. Then it is possible to write Y 1 =(y 1 : Y 2 )=W (π 1 : Π 2), where Let (π 1 : Π 2)=(W W ) 1 W ((y 1 : Y 2 )=W 1 (y 1 : Y 2 ). W =[I c(i P X )] W where c =1+ 1 k and k is defined in (5). The corresponding k class estimator may then be written as 1 bβk Π 2 (W ) W Π 2 Π 2 (W ) W1 Π 2 (W ) W π 1 (16) ˆα k = =, ˆγ k (W1 ) W Π 2 (W1 ) W1 (W1 ) W π 1 where W1 is a T k 1 matrix forming the first k 1 columns of W. Thus W1 = X 1. Now putting βk α k = we may write that, conditional on W, (17) α k = f (vecπ 1), 21 γ k