Subsampling Tests of Parameter Hypotheses and Overidentifying Restrictions with Possible Failure of Identification

Size: px

Start display at page:

Download "Subsampling Tests of Parameter Hypotheses and Overidentifying Restrictions with Possible Failure of Identification"

Bryan Poole
5 years ago
Views:

1 Subsampling Tests of Parameter Hypotheses and Overidentifying Restrictions with Possible Failure of Identification Patrik Guggenberger Department of Economics U.C.L.A. Michael Wolf Department of Economics and Business Universitat Pompeu Fabra April 2004; revised November 2004 Abstract We introduce a general testing procedure in models with possible identification failure that has exact asymptotic rejection probability under the null hypothesis. The procedure is widely applicable and in this paper we apply it to tests of arbitrary linear parameter hypotheses and to tests of overidentifying restrictions in time series models given by unconditional moment conditions. The main idea is to subsample classical tests, like for example the Wald or the J test. More precisely, instead of using these tests with critical values based on asymptotic theory, we compute data dependent critical values based on the subsampling technique. We show that the resulting tests have exact asymptotic rejection probabilities under the null hypothesis independent of identification failure. Furthermore, the subsampling tests of parameter hypotheses are shown to be consistent against fixed alternatives and to have the same local power as the original tests under full identification. The subsampling test of overidentifying restrictions is shown to be consistent against model misspecification. An algorithm is provided that automates the block size choice needed to implement the subsampling testing procedure. A Monte Carlo study shows that the tests have reliable size properties and often outperform other robust tests in terms of power. JEL Classification: C12, C15, C52. Keywords: Hypothesis Testing, Nonlinear Moment Conditions, Overidentifying Restrictions, Partial Identification, Subsampling, Weak Identification. Corresponding author: Patrik Guggenberger, Bunche Hall 8385, Department of Economics, U.C.L.A., Box , Los Angeles, CA guggenbe@econ.ucla.edu michael.wolf@upf.edu. Research supported by the Spanish Ministry of Science and Technology and FEDER, grant BMF , and by the Barcelona Economics Program of CREA. We are grateful for very helpful comments by Don Andrews, In Choi, Jinyong Hahn, Marcelo Moreira, Jim Stock, and participants at the 2004 NBER/NSF Time Series Conference at SMU and seminar participants at UC Davis.

2 1 Introduction Since Phillips (1989) seminal paper on the consequences of identification failure on the distribution of point estimators and test statistics a vast literature on partially or weakly identified models has developed. 1 Amajorfinding of this literature is that in models with possible identification failure, point estimates can be severely biased and classical tests of parameter hypotheses or overidentifying restrictions can be extremely size distorted in finite samples. In response to the unreliability of classical tests of parameter hypotheses in models with identification failure, such as Wald or likelihood ratio tests, several new tests for simple full vector parameter hypotheses have recently been introduced whose rejection probabilities under the null hypothesis are (asymptotically) unaffected by identification failure. 2 However, to the best of our knowledge, no test of overidentifying restrictions, that is consistent against model misspecification and robust to identification failure, has been introduced in the literature. Furthermore, generalizations in the literature of the above mentioned tests of simple full vector parameter hypotheses to more general parameter hypotheses, either require additional assumptions or are only conservative. For example, Kleibergen s (2001, 2004) test can be used to test simple subvector hypotheses under the additional assumption that the parameters not under test are strongly identified. Dufour (1997) suggests a projection based testing procedure for general parameter hypotheses that works without additional assumptions but leads to conservative tests. Furthermore, in general the projection idea is computationally cumbersome. 3 In this paper, we address the need for robust tests of general linear hypotheses and overidentifying restrictions. We introduce a general testing procedure in models with possible identification failure that has exact asymptotic rejection probability under the null hypothesis. We then apply the procedure to tests of arbitrary linear parameter hypotheses and to tests of overidentifying restrictions in time series models given by unconditional moment restrictions. The main idea of our procedure is to apply 1 See among others, Nelson and Startz (1990), Choi and Phillips (1992), Dufour (1997), Staiger and Stock (1997), Stock and Wright (2000), Forchini and Hillier (2003), and for recent reviews of the weak identification literature see Stock et al. (2002) and Startz et al. (2004). A recent paper by Chao and Swanson (2003) brings together the many (Bekker, 1994) and weak instruments literature. 2 Besides the early contribution of Anderson and Rubin (1949) see among others, Stock and Wright (2000), Kleibergen (2001, 2002), Caner (2003), Guggenberger (2003), Moreira (2003), Otsu (2003), Dufour and Taamouti (2004a, 2004b), and Guggenberger and Smith (2004). Andrews et al. (2004) investigate robust hypothesis testing with optimal power properties when the instrumental variables might be weak. 3 One exception is the Anderson and Rubin (1949) statistic for scalar linear hypotheses where a closed form solution is available, see Dufour and Taamouti (2004a, 2004b). [1]

3 subsampling to classical tests. More precisely, instead of using the tests with critical values based on asymptotic theory, we compute data dependent critical values based on the subsampling technique. The test statistic under consideration is evaluated on all (overlapping) blocks of the observed data sequence, where the common block size is small compared to the sample size. The critical value is then obtained as an empirical quantile of the resulting block (or subsample) test statistics. We first introduce a general definition of identification failure that brings together Phillips (1989) notion of partial identification with Stock and Wright s (2000) notion of weak identification. We then apply the subsampling method to the Wald and the J test (Hansen (1982), also see Newey (1985)) and show that the resulting tests have exact asymptotic rejection probabilities under the null hypothesis independent of identification failure. Furthermore, we show that the subsampling version of the Wald test is consistent against fixed alternatives under full identification and has the same local power as the Wald test. The subsampling version of the J test is shown to be consistent against model misspecification. Our analysis is done in time series models given by nonlinear moment conditions. Throughout the paper, we use the linear single equation instrumental variables model as an illustrative example of the general time series model. Our parameter tests can be applied to general linear hypotheses without additional identification assumptions. In particular, unlike Kleibergen (2001, 2004), no additional identification assumptions are required for subvector testing. Also, in a linear single equation instrumental variables model we can test simultaneous hypotheses on the coefficients of the exogenous and endogenous variables. A further advantage of the subsampling approach considered here is its robustness with respect to the model assumptions. For example, we show that the sizes of the subsampling tests are not affected (asymptotically) by instrument exclusion in the reduced form of a linear single equation instrumental variables model. This last advantage also holds true for the Anderson and Rubin (1949) statistic (in the case of a simple full vector hypothesis) but not for the tests by Kleibergen (2002, 2004) or Moreira (2003); see Dufour and Taamouti (2004b). We assess the finite sample performance of several parameter subvector tests in a Monte Carlo study using Dufour and Taamouti s (2004b) linear design with two endogenous variables on the right side of the structural equation. We find that their projected Anderson and Rubin test is dominated in terms of power by Kleibergen s (2003) test across every single scenario. In all scenarios, where the parameter not under test is only weakly identified, our subsampled Wald test is the clear winner among the three statistics and the power gains can be dramatic in these cases. If this parameter is strongly identified, then Kleibergen s (2003) test typically has slightly better power properties than our test. In an additional Monte Carlo experiment we [2]

4 assess the power loss of the subsampling procedure in a scenario where subsampling does not enjoy a comparative advantage, namely when testing a simple full vector hypothesis in a linear i.i.d. model. We find that even in this disadvantageous setup, subsampling still performs competitively but is often outperformed in terms of power by Moreira (2003). Lastly, we conduct an experiment to assess the size properties of tests of overidentifying restrictions. Again, our Monte Carlo results are consistent with our theory: While the classical J test oftentimes severely overrejects, we find that its subsampling version has generally very reliable size properties. Besides all the advantages of the subsampling technique mentioned above, there are also general drawbacks. Firstly, compared to tests that are given in closed form, a relative disadvantage of the subsampling approach is its computational burden. However, this disadvantage is shared with other popular resampling methods, such as the bootstrap and the jackknife. Secondly, an application of the subsampling method requires the choice of a block size b, which can be considered a model parameter. To overcome that problem, we provide a data dependent method to automate this choice. Thirdly, under full identification, weak regularity conditions, and one sided alternatives, the error in rejection probability under the null for tests based on subsampling is typically of order O p (b 1/2 ) compared to the faster O p (n 1/2 ) of standard approaches, where b and n denote the block and sample size, respectively, see Politis et al. (1999, chapter 10.1). This slower rate of convergence under full identification is the price that has to be paid for making the procedure robust to identification failure. Lastly, in our specific application, the finitesamplepowerfunctionofasubsampling version of a test is oftentimes below the one of the original test in strongly identified situations. However, compared to other tests that are robust to identification failure, our Monte Carlo study indicates that oftentimes there can be tremendous power gains of the subsampling approach. In the Econometrics literature, subsampling has now been suggested in a variety of situations for the construction of confidence intervals or hypotheses tests where it is at least questionable whether the bootstrap would work. Some recent examples include, Romano and Wolf (2001) who use subsampling to construct confidence intervals for the autoregressive coefficientinanar(1)modelwithapossibleunit root. Andrews (2003) introduces a subsampling like testing method for structural instability of short duration. Choi (2004) uses subsampling for tests of linear parameter constraints in a vector autoregression with potential unit roots and Gonzalo and Wolf (2005) suggest subsampling for the construction of confidence intervals for the threshold parameter in threshold autoregressive models with potentially discontinuous autoregressive function. Related to our paper is Hall and Horowitz (1996) who suggest the bootstrap for [3]

5 improved critical values for certain GMM tests. However, their paper is not concerned with the possibility of identification failure. Kleibergen (2003) derives higher order expansions of various statistics that are robust to weak instruments and suggests the bootstrap to further improve on the size properties of tests based on these statistics. He also provides insight as to why the bootstrap is not expected to improve on the size distortion of classical tests, like a Wald test, under identification failure. In an i.i.d. linear model with one endogenous right hand side variable, Moreira et al. (2004) go one step further by providing a formal proof of the validity of Edgeworth expansions for the score and conditional likelihood ratio statistics when instruments may be weak. These statistics are known to be robust to weak instruments. They show the validity of the bootstrap for the score test and the validity of the conditional bootstrap for various conditional tests. On the other hand, our paper shows that in general time series moment condition models subsampling fixes the size distortion of classical tests of general hypotheses that are not robust to weak identification, like a Wald test. The remainder of the paper is organized as follows. In Section 2, the model is introduced, the testing problems are described, and a general definition of identification failure is provided. In order to be self contained, in Section 3 we first briefly review the basic theory of subsampling for time series data. We then derive the asymptotic distribution of some classical test statistics under the general asymptotic framework of identification failure to show that the tests are generally size distorted under identification failure. We then apply subsampling to those tests in subsections 3.2 (overidentifying restrictions) and 3.3 (general linear parameter hypotheses) to cure the problem of size distortion. In Section 4 we provide a data driven choice of the block size needed to implement the subsampling procedure. Some further robustness properties of the subsampling method are discussed in Section 5. Section 6 describes the simulation results. All proofs are relegated to Appendix (C) while Appendices (A) and (B) contain some additional discussion of our assumption on identification failure and contiguity, respectively. The following notation and terminology is used in the paper. The symbols d, p, and denote convergence in distribution, convergence in probability, and weak convergence of empirical processes, respectively. For the latter, see Andrews (1994) for a definition. For with probability 1 we write w.p.1 and a.s. stands for almost surely. By C i (A, B) we denote the set of functions f : A B that are i times continuously differentiable. If B = R, thesetofrealnumbers,wesimply write C i (A) for C i (A, B). By id we denote the identity map and by O(i) the group of orthogonal i i matrices. By e j R p we denote the p vector (0,..., 1,..., 0) 0 with 1 appearing at position j. For a matrix M, M > 0 means that M is positive definite [4]

6 and [M] i,j denotes the element of M in row i and column j. By I i we denote the i dimensional identity matrix. Furthermore, vec(m) stands for the column vectorization of the k i matrix M, thatis,ifm =(m 1,..., m i ) then vec(m) =(m 0 1,..., m0 i )0. By P M we denote the orthogonal projection onto the range space of M. Finally, M equals the square root of the largest eigenvalue of M 0 M and denotes the Kronecker product. 2 The Model, Tests, and Identification Failure 2.1 The Model We consider models specified by a finite number of unconditional moment restrictions. Let {z i : i =1,..., n} be R l valued data and, for each n N, g n : G Θ R k, where G R l and Θ R p denotes the parameter space. The model has a true parameter θ 0 for which the moment condition Eg n (z i, θ 0 )=0 (2.1) is satisfied for all i =1,..., n. Forg n (z i, θ) we usually simply write g i (θ). For example, moment conditions may result from conditional moment restrictions. Assume E[h(Y i, θ 0 ) F i ]=0,whereh : H Θ R k 1, H R k 2, and F i is the information setattimei. Let Z i be a k 3 dimensional vector of instruments contained in F i. If g i (θ) :=h(y i, θ) Z i,theneg i (θ 0 )=0follows by taking iterated expectations. In (2.1), k = k 1 k 3 and l = k 2 + k 3. A second important example of model (2.1) is given by the following: Example 2.1 (I.i.d. linear instrumental variable (IV) model): Consider the linear model with i.i.d. observations given by the structural equation and the reduced form for Y y = Y β 0 + Xγ 0 + u (2.2) Y = ZΠ + XΦ + V, (2.3) where y, u R n, Y,V R n v 1, X R n v 2,Z R n j, Φ R v 2 v 1, and Π R j v 1. Let p := v 1 + v 2,k:= j + v 2, θ := (β 0, γ 0 ) 0, and θ 0 := (β 0 0, γ 0 0 )0. The matrix Y contains the endogenous and the matrix X the exogenous variables. The variables Z constitute a set of instruments for the endogenous variables Y. For the model to be identified it is necessary that j v 1. Denote by Y i,v i,z i,... (i =1,..., n) the i th row of the matrix Y, V, Z,... written as a column vector and similarly for analogous [5]

7 expressions. Assume E(Zi 0,X0 i )0 u i =0and E(Zi 0,X0 i )0 Vi 0 implies that Eg i (θ 0 )=0, where for each i =1,..., n g i (θ) :=(Z 0 i,x0 i )0 (y i Y 0 i β X0 i γ). =0. The first condition Note that in this example g i (θ) depends on n if the reduced form coefficient matrix Π is modeled to depend on n, see Staiger and Stock (1997). 2.2 Hypothesis Tests Interest focuses on two separate testing problems in a context that allows for identification failure: (i) testing hypotheses involving the unknown parameter vector θ 0 (ii) testing the overidentifying restrictions assumption Eg n (z i, θ 0 )=0for some θ 0 Θ in (2.1), when the model is overidentified, that is when k>p. More precisely, the testing problems are (i) H 0 : Rθ 0 = q versus H 1 : Rθ 0 6= q, (2.4) (ii) H 0 : θ Θ,Eg i (θ) =0versus H 1 : θ Θ,Eg i (θ) 6= 0, (2.5) where in (2.4) R R r p is a matrix of maximal rank r, for some r satisfying 1 r p and q R r is an arbitrary vector. 4 For testing problem (ii) to make sense, one has to impose a stationarity assumption on the distribution of z i, which we do below. Problem (i) with r<pcontains as a particular subcase simple subvector tests in which case the rows of R are a subset of the rows of I p. Subvector testing in the context of weak identification has attracted a lot of attention in the recent literature, see, for example, Kleibergen (2001, 2004), Dufour and Taamouti (2004a, 2004b), Guggenberger and Smith (2004), and Startz et al. (2004). Note also that we allow fornullhypothesesin(i)that,inthecaseofthelinearmodel(2.2),mayinvolveboth the unknown parameters of the exogenous and endogenous variables. Many tests in the literature are designed for the linear model where the included exogenous variables have been projected out in a first step, thereby ruling out a test of a hypothesis that involves both parameters of the exogenous and endogenous variables, see for example Kleibergen s (2002, 2004) test. 2.3 Identification Failure As is now widely documented, classical tests of the hypotheses in (2.4) and (2.5), such as the Wald, likelihood ratio, and J test (Hansen, 1982) can suffer from severe 4 While in this paper we only deal with two sided alternatives, our approach can also be applied to one sided alternatives of the form H 1 : Rθ 0 <qor H 1 : Rθ 0 >q,if there is only one restriction under test, that is r =1. Furthermore, using more complicated assumptions in the theorems below, our approach could even be adapted to nonlinear parameter hypotheses. [6]

8 size distortion in situations where the model is not identified or close to being not identified. In model (2.1) identification failure means that besides θ 0 there are other θ Θ that satisfy the moment condition. The abstract meaning of weak identification is that there are other θ Θ that satisfy the moment condition in the limit n. The classical identification condition, the so called rank condition of identification, states that the matrix ( Eg i / θ)(θ 0 ) has full column rank p. In the linear model, violation of the rank condition immediately implies that the model is not identified. Much of the literature on weak identification has focused on the particular case where the parameter vector θ 0 has a decomposition θ 0 =(θ 01, θ 02 ) into some weakly, θ 01, and some strongly identified components, θ 02. Namely, the original definition of weak identification introduced in Stock and Wright (2000) for nonlinear models focusesonthiscase 5.Define bg(θ) :=n 1 n P i=1 g i (θ). As discussed in Appendix (A), Assumption C, applied to the linear model, implies that ( Ebg/ θ 0 )=(0,M), where M is a matrix of maximal rank. On the other hand, Phillips (1989) and Choi and Phillips (1992) allow for a linear model with general failure of the rank condition in what they call partial identification. In their model, ( Ebg/ θ 0 ) can be of non maximal rank without being of the particular form (0,M). We now introduce a general version of identification failure in nonlinear models that brings together this partially identified and Stock and Wright s (2000) weakly identified model. We show in the next section that the subsampling tests are robust against this general version of identification failure. A more detailed discussion of Assumption ID is relegated to Appendix (A). Assumption ID: There exist a coordinate change 6 T O(p) such that T (Θ) =Θ, where Θ is a compact product set Θ = Θ 1 Θ 2 R p 1+p 2 = R p, and functions m 1n, 5 Assumption C, Stock and Wright (2000, p. 1061): Decompose θ =(θ 0 1, θ 0 2) 0, θ 0 =(θ 0 01, θ 0 02) 0 and Θ = Θ 1 Θ 2. (i) Ebg(θ) =n 1/2 m 1n(θ) +m 2(θ 2), wherem 1n, m 1 C 0 (Θ, R k ), and m 2 C 0 (Θ 2, R k ), such that m 1n(θ) m 1(θ) uniformly on Θ, m 1(θ 0)=0and m 2(θ 2)=0if and only if θ 2 = θ 02. (ii) m 2 C 1 (N, R k ) for a neighborhood N Θ 2 of θ 02 and ( m 2 / θ 0 2)(θ 02 ) has full column rank. In the linear model with no included exogenous variables, Assumption C boils down to a decomposition for Π into Π n =(n 1/2 Π A, Π B ), where Π A and Π B are fixed matrices with p 1 and p 2 columns and Π B has full column rank, see Stock and Wright (2000, Section 3). 6 For notational convenience we denote by T the linear map T : R p R p and the uniquely defined matrix in R p p that defines this map. Assumption ID could be generalized to allow for possibly nonlinear coordinate changes T. [7]

9 m 1 : Θ R k, and m 2 : Θ 2 R k such that for θ :=(θ 1, θ 2 ):=T 1 (θ) and θ 0 := (θ 01, θ 02 ):=T 1 (θ 0 ) bg( ) :=bg(t ( )) : Θ R k (i) m 1 C 0 (Θ, R k ), m 2 C 0 (Θ 2, R k ) C 1 (N, R k ) for a neighborhood N of θ 02, (ii) Ebg(θ) =n 1/2 m 1n (θ)+m 2 (θ 2 ), m 1n (θ) m 1 (θ) uniformly on Θ, (iii) m 1 (θ 0 )=0, m 2 (θ 2 )=0if and only if θ 2 = θ 02,andM 2 (θ 02 ) has full column rank, where M 2 (θ 2 ):=( m 2 / θ 0 2)(θ 2 ) R k p 2. Assumption ID contains as a subcase the case of a fully identified model (T id and p 1 =0) and the case of a totally unidentified model (T id, p 1 = p, and m 1n 0). T is a change of the coordinate system such that in the new coordinate system the identified components θ 02 of the parameter vector θ 0 are singled out. ID essentially boils down to Assumption C in Stock and Wright (2000) if we set T id. If T id then 7, by ID (ii) (iii), the first components θ 01 of θ 0 are only weakly identified. Clearly, no information on θ 01 can be gained from the term m 2. Therefore, all the identifying information on θ 01 from the condition Ebg(θ) =0has to come from the term n 1/2 m 1n (θ). But this term vanishes with increasing sample size. ID is more general than Assumption C in Stock and Wright (2000) because unlike C it comprises the partially identified model of Phillips (1989). It is more general than the latter because it allows for nonlinear moment conditions and weak identification. For every finite sample size n, the model may be fully identified through the term n 1/2 m 1n (θ). But the information contained in n 1/2 m 1n (θ) fades away with n going to infinity leading to a partially identified model asymptotically. 3 Subsampling Tests Under Weak Identification The main reason for the size distortion of classical tests (such as Wald, likelihood ratio, and J) under identification failure is that parameter estimates of θ 0 have a non normal asymptotic distribution which implies that the tests are no longer asymptotically χ 2 under weak identification, see Theorems 3.2(ii) and 3.3 below. Subsampling can cure the problem of size distortion. Instead of critical values based on asymptotic theory, it uses data dependent critical values obtained as follows. The test statistic 7 Whenever T = id, the new coordinates are the same as the original ones and therefore, throughout the paper, we leave out the bars in the notation in this case. [8]

10 under consideration is evaluated on all (overlapping) blocks of the observed data sequence, where the common block size b is small compared to the sample size. The critical value is then obtained as an empirical quantile of the resulting subsample test statistics. In this section we describe in more detail how to use subsampling to construct tests that have exact (asymptotic) rejection probabilities under the null hypothesis, both under full identification and identification failure. For a general reference on subsampling see Politis et al. (1999). Our approach is to present a high level theorem and then verify/illustrate it in the particular settings we are interested in. One observes a stretch of vector valued data z 1,...,z n. Denote the unknown probability mechanism generating the data by P. It is assumed that P belongs to a certain class of mechanisms P. The null hypothesis H 0 asserts P P 0 and the alternative hypothesis H 1 asserts P P 1, where P 0, P 1 P, P 0 P 1 =, and P 0 P 1 = P. The goal is to construct a test with exact asymptotic rejection probability under the null hypothesis based on a given test statistic D n = D n (z 1,...,z n ). Let C n (P ) denote the sampling distribution of D n under P,thatis, C n (x, P ):=Prob P {D n (z 1,...,z n ) x}. It will be assumed that under the null hypothesis C n (P ) converges in distribution to a continuous limit law C(P ). The1 α quantile of this limit law is denoted by c(1 α,p):=inf{x : C(x, P ) 1 α}. To describe the subsampling test construction, denote by Q 1,...,Q N the N := n b +1 blocks of size b of the observed data stretch {z 1,...,z n };thatis,q a := {z a,...,z a+b 1 } for a =1,...,N. The model parameter b is called the block size. We will discuss its choice in Section 4. Let D b,a be equal to the statistic D b evaluated at the block Q a. The sampling distribution of D n is then approximated by 8 bc n,b (x) :=N 1 N X a=1 1{D b,a x}. The data dependent critical value of the subsampling test is obtained as the 1 α quantile of bc n,b,thatis bc n,b (1 α) :=inf{x : bc n,b (x) 1 α} 8 In the special case of i.i.d. data, one could theoretically use all n b blocks of size b rather than only the N blocks used in the general time series context. Computationally however, it is generally not feasible to use all n b blocks. [9]

11 and the test arrives at the following decision: Reject H 0 at nominal level α if and only if D n > bc n,b (1 α). (3.6) If our only concern was to construct a test with correct null rejection probability, it could be achieved trivially: generate a uniform (0,1) variable and reject the null hypothesis if the outcome is smaller than α. But, obviously, we also want to achieve power when the model is identified. To formally establish power, we make the further assumption that the test statistic can be written as where D n (z 1,...,z n )=n β d n (z 1,...,z n ) for some β > 0, (3.7) d n (z 1,...,z n ) p d(p ) satisfying ½ d(p )=0if P P0 d(p ) > 0 if P P 1. (3.8) We then have the following theorem. Theorem 3.1 Assume the sequence {z i } is strictly stationary and strongly mixing 9 and that the block size satisfies b/n 0 and b as n. (i) Assume that for P P 0, C n (P ) converges weakly to a continuous limit law C(P ) whose cumulative distribution function is C(,P) and whose 1 α quantile is c(1 α,p). Then,ifP P 0, bc n,b (1 α) p c(1 α,p) as n and Prob P {D n > bc n,b (1 α)} α. (ii) If (3.7) and (3.8) hold and P P 1,thenasn Prob P {D n > bc n,b (1 α)} 1. (iii) Suppose P n is a sequence of alternatives such that, for some P P 0, {P n [n] } is contiguous to {P [n] }. 10 Here, P n [n] denotes the law of the finite segment {z 1,...,z n } when the law of the infinite sequence {...,z 1,z 0,z 1,...} is given by P n. The meaning of {P [n] } is analogous. Then, ĉ n,b (1 α) c(1 α,p) in P n [n] -probability. Hence, if D n converges in distribution to D under P n and C(,P) is continuous at c(1 α,p) then as n Prob [n] P {D n > ĉ n,b (1 α)} Prob{D >c(1 α,p)}. n 9 Alternatively, strongly mixing is sometimes called α-mixing, see Politis et al. (1999, p. 315) for adefinition. 10 In Appendix (B) we provide some background information on contiguity. [10]

12 The theorem shows that the subsampling approach is consistent and has exact asymptotic rejection probability under the null. The interpretation of part (iii) is the following. Suppose that instead of using the subsampling construction, one could use the oracle test that rejects when D n >c n (1 α,p), where c n (1 α,p) is the exact 1 α quantile of the true sampling distribution C n (,P), where P P 0.Ofcourse, this test is not available in general because P is unknown and so is c n (1 α,p). Then, the limiting power of the subsampling test against a sequence of contiguous alternatives {P n } to P with P P 0 is the same as the limiting power of this fictitious oracle test against the same sequence of alternatives. Hence, to the order considered, there is no loss in efficiency in terms of power. 3.1 Classical Test Statistics We next introduce subsampling based testing procedures for the testing problems (2.4) and (2.5) that, unlike various classical tests, have exact (asymptotic) rejection probabilities under the null, independent of possible identification failure. The Anderson and Rubin (1949) statistic, recently reinvestigated by Dufour and Taamouti (2004b), has an F distributioninthelinearmodelundernormalityandasimple null hypothesis H 0 : θ 0 = q in (2.4) and therefore, under normality, leads to a test with exact null rejection probability independent of identification failure. However, for tests of more general hypotheses, the (projected) Anderson and Rubin test is only conservative, even asymptotically. Other recent tests, for example Kleibergen s (2001, 2004) test, are not available for tests of general linear hypotheses. They can be generalized however to tests of simple subvector hypotheses with exact asymptotic null rejection probabilities but require the additional assumption that the parameters not under test are strongly identified. In contrast, our testing approach, based on subsampling classical statistics, like for example the Wald, LR, and J statistic, has exact (asymptotic) null rejection probabilities without further assumptions and is applicable to general linear hypotheses and overidentifying restrictions, respectively. In this subsection, we introduce the test statistics, focusing on the J and the Wald statistic 11. As in Stock and Wright (2000), we focus on a GMM setup. Let S n (θ) :=n A n (e n (θ)) 1/2 bg(θ) 2 11 AsimilaranalysiscanbedonefortheLRtest. WefocusontheWaldstatisticherebecauseitdoes not involve the restricted estimator of θ 0 under the null hypothesis which simplifies the exposition. We experimented with a subsampled version of the LR statistic in our simulations but did not find any systematic advantage over the Wald approach for linear hypotheses tests. To test overidentifying restrictions, other test statistics besides the J test could be considered. See, for example, Imbens (1997), Kitamura and Stutzer (1997) or Imbens et al. (1998) who investigate several Lagrange multiplier and criterion function tests based on generalized empirical likelihood methods. [11]

13 be the GMM criterion function that is pinned down by some data dependent weighting matrix A n (e n (θ)) 1/2 R k k for a (possibly stochastic) function e n ( ) :Θ Θ. More precisely, we allow for three different cases, namely one step, two step, and continuous updating (CU) GMM, see Hansen et al. (1996) for the latter. For one step GMM A n (e n (θ)) is typically chosen to be I k or some other fixed positive definite nonstochastic matrix. Furthermore, e n (θ) := ½ en fortwo stepgmm θ for CU GMM, (3.9) for some preliminary estimator e n of θ 0. Therefore, for two step GMM, e n ( ) does not depend on θ and for CU, e n ( ) is the nonstochastic identity map id. Define the GMM estimator as a sequence of random variables b θ n satisfying We usually write b θ for b θ n. Let b θn Θ and S n ( b θ n ) arg inf θ Θ S n(θ)+o p (1). (3.10) Ψ n (θ) :=n 1/2 (bg(θ) Ebg(θ)), Ω(θ, θ + ) : = lim EΨ n(θ)ψ n (θ + ) 0, and Ω(θ) :=Ω(θ, θ) R k k. n As before in Assumption ID, a bar denotes expressions in new coordinates. For example, we write, Ψ n ( ) := Ψ n (T ( )), Ψ( ) := Ψ(T ( )), Ω(, ) := Ω(T ( ),T( )), A( ) :=A(T ( )), and A n ( ) :=A n (T ( )) for functions and e n (θ) :=T 1 (e n (θ)) for vectors, and similarly for other expressions. Note that by writing functions in new variables, for example, Ψ(θ) instead of Ψ(θ), we do not change the value of the function, that means Ψ(θ) =Ψ(θ); what we achieve by using the new coordinates is to single out identified from unidentified components in the parameter vector θ 0. For testing problem (2.4) we now define the classical Wald statistic W n based on the GMM estimator and for problem (2.5) we define the J statistic J n (Hansen (1982)) as the GMM criterion function evaluated at the GMM estimator. More precisely, where W n :=n(r b θ n q) 0 [R bb n 1 Ω b n bb n 1 R 0 ] 1 (R b θ n q), (3.11) J n :=S n ( b θ n ), (3.12) bg n :=n 1 P n g i i=1 θ 0 (b θ n ) R k p, (3.13) bb n := bg 0 na n (e n ( b θ n )) bg n R p p, bω n := b G 0 na n (e n ( b θ n ))K n ( b θ)a n (e n ( b θ n )) b G n R p p, (3.14) [12]

14 and K n ( ) is a R k k valued (stochastic) function on Θ and K n ( b θ n ) an estimator of the long run covariance matrix Ω(θ 0 ). For example, in an i.i.d. model, a natural choice would be K n (θ) := n 1 P n i=1 g i(θ)g i (θ) 0 R k k, whereas in a time series model one would typically use some version of a heteroskedasticity and autocorrelation consistent (HAC) estimator, see Andrews (1991). From now on, we distinguish the following two polar opposite cases of identification. Full identification: Assume ID with p 1 =0and T = id. Identification failure: Assume ID with p 1 > 0 and m 1n 0. In the next two subsections, we show that the classical J test of overidentifying restrictions and the Wald test of parameter hypotheses are generally size distorted under Assumption ID. In contrast, we establish that the subsampling versions of the J and Wald test have (asymptotically) exact rejection probabilities under the null hypothesis, both under full identification and identification failure. Extrapolating from these polar opposite cases of identification and identification failure, we interpret this as evidence that the tests based on subsampling continue to have this property in the intermediate case of weak identification (where m 1n 6=0)asdefinedinAssumption ID. 3.2 Testing Overidentifying Restrictions In this subsection, we first derive the asymptotic distribution of the classical J statistic under Assumption ID and conclude that the J test is potentially size distorted under identification failure. We then use this asymptotic result to show that the subsampling version of the J test has exact (asymptotic) rejection probability under the null. To derive the asymptotic distribution of the J statistic under Assumption ID we first need the one of the estimator b θ n. We essentially make the same high level assumptions as Stock and Wright (2000, see Assumptions B and D). Assumption PE (parameter estimates): 12 Assume ID. Suppose there exists a neighborhood U 2 Θ 2 of θ 02, such that for Θ 12 := Θ 1 U 2 (i) Ψ n Ψ, where Ψ is a Gaussian stochastic process on Θ 12 with mean zero, covariance function EΨ(θ)Ψ(θ + ) 0 = Ω(θ, θ + ) for θ, θ + Θ 12, sample paths that are continuous w.p.1, and sup θ Θ n 1/2 Ψ n (θ) p 0; 12 Weak convergence here is defined with respect to the sup norm on function spaces and the Euclidean norm on R k. Also, note that Assumption PE could alternatively be stated in original coordinates. [13]

15 (ii) for some A( ) C 0 (Θ, R k k ), sup θ Θ A n (θ) A(θ) p 0, A(θ) > 0, and A n (θ) > 0 for all θ Θ w.p.1; (iii) e n ( ) e( ) jointly with the statement in (i). 13 Assumption PE states that after a coordinate change the Assumptions B, D, and Assumptions made in Theorem 1 in Stock and Wright (2000) hold. Our assumption is slightly weaker because in PE(i) we do not require that convergence holds on the whole parameter space Θ but only on Θ 12. For the J statisticwenowhavethe following theorem. Theorem 3.2 Suppose Assumption PE holds. Let b θ =( b θ 1, b θ 2 ):=T 1 ( b θ) and assume that S in (7.23) in the Appendix satisfies the unique minimum 14 condition in (7.24). Then, (i) (Asymptotic distribution of parameter estimates) ( b θ 1,n 1/2 ( b θ 2 θ 02 )) d θ := (θ 1, θ 2), where the nonstandard limit θ is defined in (7.25) and (7.26) and (ii) (Asymptotic distribution of the J statistic) assuming k > p J n d J := S(θ 1, θ 2). Part (i) shows that some components of the estimator in new coordinates, b θ 2, are root n consistent for θ 2 yet are not asymptotically normally distributed due to the inconsistent estimation of the remaining components θ 1 by b θ 1. Under full identification (T id and p 1 =0) and assuming that e n p θ 0 for the two step GMM case, equation (7.26) shows that n 1/2 ( b θ θ 0 ) d θ which is distributed as N(0, (M2 0AM 2) 1 (M2 0AΩ(θ 0)AM 2 )(M2 0AM 2) 1 ), where M 2 := M 2 (θ 0 ) and A := A(θ 0 ). Choi and Phillips (1992) and Stock and Wright (2000) (Theorem 1 (ii)) derive the limit distribution of the parameter estimates in the linear model under partial identificationandinthenonlinearmodelunderidwitht id, respectively. 13 By definition e n(θ) = ½ en for two step GMM θ for CU GMM and therefore, for two step GMM PE(iii) means e n d e for some random variable e while for CU PE(iii) boils down to the trivially satisfied condition e n( ) =e( ) :=id( ). 14 The unique minimum condition is used in the proof when we apply Lemma in van der Vaart and Wellner (1996), as in Stock and Wright s (2000) proof of Theorem 1(ii). [14]

16 Part (ii) corresponds to Corollaries 4 (i) and (j) in Stock and Wright (2000), where the asymptotic distribution of the J statistic is derived under their Assumption C. Part (ii) shows that in general the J statistic has a nonstandard asymptotic distribution while under full identification and A = Ω(θ 0 ) 1 we obtain the well known result that J n d χ 2 (k p). Therefore, generally, under identification failure, the J test does not have correct rejection probability under the null if inference is based on χ 2 critical values. As we show now, subsampling overcomes that problem. To formally establish power, we have to make the following assumption under the alternative H 1. Assumption MM (misspecified model): (i) the parameter space Θ is compact; (ii) Eg i ( ) C 0 (Θ, R k ) and sup θ Θ bg(θ) Eg i (θ) p 0; (iii) there exists a nonstochastic function A( ) C 0 (Θ, R k k ) such that sup θ Θ A n (θ) A(θ) p 0 and A(θ) > 0 for θ Θ w.p.1; (iv) for e n (θ) definedin(3.9)wehavee n (θ) p e(θ), wheree(θ) is nonstochastic; 15 (v) e θ := arg min θ Θ A(e(θ)) 1/2 Eg i (θ) exists and is unique. Given the previous theorem, the next statement is a corollary of Theorem 3.1. The test is H 0 : θ Θ,Eg i (θ) =0versus H 1 : θ Θ,Eg i (θ) 6= 0. Corollary 3.1 Suppose k>pand that the sequence {z i } is strictly stationary and strongly mixing. Assume b/n 0 and b as n.letd n = J n of (3.12) and define the subsampling test by (3.6). (i) Under H 0 assume PE and that J in Theorem 3.2 is continuously distributed. Then the rejection probability of the subsampling test converges to α as n both under full identification and identification failure. (ii) Under H 1 n. and Assumption MM the rejection probability converges to 1 as The corollary shows that the subsampling test of overidentifying restrictions is consistent against model misspecification. It also shows that the test has asymptotically exact rejection probabilities under the null hypothesis both under full identification and identification failure. The test therefore improves on the classical J test 15 In other words, for 2-step GMM we assume that the preliminary estimator e n converges in probability to an element e Θ. [15]

17 or the tests of overidentifying restrictions suggested in Imbens et al. (1998) that are all size distorted under identification failure. 3.3 Testing Parameter Hypotheses In this subsection, we derive the asymptotic distribution of the Wald statistic under Assumption ID and conclude that the Wald test is size distorted under identification failure. We then use this asymptotic result to show that the subsampling version of the Wald test has exact (asymptotic) rejection probability under the null. To derive the asymptotic distribution of the Wald statistic we need the following additional assumption besides Assumption PE. If they exist, denote by ( g i / θ 0 1)(θ) R k p 1 and ( g i / θ 0 2)(θ) R k p 2 the partial derivatives of g i with respect to the first p 1 and last p 2 components of θ, respectively, where we use the notation of Assumption ID. Define N := diag(n jj ) R p p, (3.15) where n jj = n 1/2 if j p 1 and n jj =1otherwise for j =1,..., p. Assumption WS (Wald statistic): Assume ID and suppose there exists a neighborhood U 2 Θ 2 of θ 02, such that for Θ 12 := Θ 1 U 2 (i) (ii) Φ n (θ) :=[(n 1/2 P n i=1 vec g i (θ)) 0, Ψ n (θ) 0 ] 0 Φ(θ) holds jointly with PE(iii), where Φ is a k(p 1 +1) dimensional Gaussian stochastic process on Θ 12 with sample paths that are continuous w.p.1, a certain (possibly nonzero) mean function, and covariance function (θ, θ + ):=EΦ(θ)Φ(θ + ) 0 for θ, θ + Θ 12 ; sup n 1 P n i=1 θ=(θ 1,θ 2 ) Θ 12 θ 0 1 g i (θ) M 2 (θ 2 ) p 0 θ 0 2 and M 2 (θ 2 ) has maximal column rank for all θ Θ 12 ; (iii) by (i), (ii), and Theorem 3.2 (n 1 P n i=1 ( g i/ θ 0 )( b θ))n R k p converges in distribution to a random variable with realizations in R k p. Assume the realizations have full column rank a.s.; [16]

18 (iv) there exists a nonstochastic function Λ : T (Θ 12 ) R k k such that sup K n (θ) Λ(θ) p 0 (3.16) θ T (Θ 12 ) for K n (θ) definedin(3.14);λ(θ) has full rank for all θ T(Θ 12 ). We now discuss Assumption WS. WS(i) generalizes PE(i) by including a portion of the first derivative matrix into the functional central limit theorem (FCLT). Joint CLTs of g i and (portions of) its derivative matrix have also been assumed by Kleibergen (2001, Assumption 1) and Guggenberger and Smith (2004, Assumption M θ (vii)). However, instead of a FCLT, these papers only require a joint CLT at θ 0. We require a FCLT because instead of evaluating our test statistic at a fixed hypothesized parametervector,ourteststatisticisevaluatedatanestimatedparametervector. As shown in Theorem 3.2, this estimator is in general not consistent. Note that we do nothavetosubtractoff themeaninthefcltfromthederivativecomponent;under weak technical conditions that allow the interchange of differentiation and integration, ID(ii) implies that n 1/2 P n i=1 E( g i/ θ 0 1)(θ) M 1 (θ), where M 1 (θ) R k p 1 denotes the derivative of m 1 (θ) with respect to the first p 1 coordinates. Then the mean function of Φ(θ) equals [(vecm 1 (θ)) 0, 0 0 ] 0. Assumptions WS(ii) and (iv) state uniform law of large numbers. In WS(ii), the series converges to M 2 (θ 2 ) which assumes that one can interchange the order of integration and differentiation. We make this assumption to economize on notation, but everything that follows would go through if convergence was instead to a different full rank non stochastic function, G 2 (θ 2 ) say, instead of M 2 (θ 2 ). On the other hand, note that in (iv) we do not require that Λ(θ 0 ) is the long run covariance matrix Ω(θ 0 ) of g i (θ 0 ). Our theory goes through in the general time series context, even if a simple sample average K n (θ) =n 1 P n i=1 g i(θ)g i (θ) 0 is used, as long as K n (θ) converges uniformly to a full rank nonstochastic matrix. Example 2.1 (cont.): In the linear model, the upper left kp 1 dimensional square submatrix of (, ) and M 2 ( ) from Assumptions WS(i) and (ii) do not depend on the argument θ. This implies an easy sufficient condition for WS(iii) as stated in the next Lemma. Furthermore, WS(i) (ii) hold automatically. Lemma 3.1 In the linear model of Example 2.1 assume i.i.d. data, E(Zi 0,X0 i )0 (u i,vi 0) =0,E (Zi 0,X0 i )0 (Zi 0,X0 i,u i,vi 0) 2 <, and set K n (θ) :=n 1 P n i=1 g i(θ)g i (θ) 0.Then, under Assumption ID, it follows that WS(i) (ii) hold. If, in addition, the upper left kp 1 dimensional square submatrix of is positive definite, then WS(iii) holds. Finally, WS(iv) holds if lim n Eg i (θ)g i (θ) 0 is positive definite for all θ Θ. [17]

19 Besides mild additional assumptions the lemma states the main assumptions that are needed for the subsampling approach to work when applied to the Wald test and parameter hypotheses in the linear model; see Corollary 3.2 below. We can now formulate the following theorem that derives the asymptotic distribution of the Wald statistic under ID. Theorem 3.3 (Asymptotic distribution of the Wald statistic) Assume the assumptions of Theorem 3.2 and Assumption WS hold. Then, under the null hypothesis Rθ 0 = q, we have W n d W, where the limit W is defined in (7.29) in the Appendix. Theorem 3.3 generalizes an analogous result about the Wald statistic in Staiger and Stock (1997, Theorem 1(c)) from the linear model with only weakly identified parameters to the GMM setup under ID. Phillips (1989) and Choi and Phillips (1992) derive the asymptotic distribution of the Wald statistic that tests hypotheses on the coefficients of either the exogenous or endogenous regressors in the linear model under partial identification. For example, they show that in the totally unidentified case, the Wald statistic converges to a random variable that can be written as a continuous function of random variables that are distributed as noncentral Wishart and multivariate t (Phillips (1989, Theorem 2.8.)). Theorem 3.3 shows that the Wald statistic has a nonstandard asymptotic distribution under identification failure. On the other hand, under full identification and assuming that Λ(θ 0 )=Ω(θ 0 ), the proof of the theorem contains the well known result that the Wald statistic is asymptotically distributed as χ 2 (r). A test based on the Wald statistic using critical χ 2 values is likely to be size distorted when identification fails. On the other hand, as we will show now, the subsampling test has rejection probabilities under the null that are asymptotically exact even under identification failure. What is crucial (and sufficient under very mild additional assumptions) for the subsampling approach to have exact (asymptotic) rejection probabilities under the null, is that the test statistics we apply subsampling to, converge to an asymptotic distribution independent of the particular assumption in ID; see part (i) of Corollaries 3.2 and 3.1. Given the previous theorem the following statement is a corollary of Theorem 3.1. The hypothesis under test is H 0 : Rθ 0 = q versus the two sided alternative H 1 : Rθ 0 6= q. Corollary 3.2 Assume PE, WS, and that W in Theorem 3.3 is continuously distributed. Suppose the sequence {z i } is both strictly stationary and strongly mixing. Assume b/n 0 and b as n. Let D n = W n of (3.11) and define the subsampling test by (3.6). Then [18]

20 (i) Under H 0 the rejection probability converges to α as n both under full identification and identification failure. (ii) Under H 1 the rejection probability converges to 1 as n under full identification. (iii) Consider a sequence of contiguous alternatives under full identification. Then the limiting rejection probability of the subsampling test (3.6) is equal to that of the Wald test. The corollary shows that the subsampling test of parameter hypotheses is consistent against fixed alternatives under full identification and has asymptotically exact rejection probabilities under the null hypothesis both under full identification and identification failure. Furthermore, it has the same limiting power against contiguous alternatives under full identification as the original Wald test. As a special case for this last statement consider again Example 2.1. Assume a parametric distribution for z i indexed by θ, {P θ : θ Θ}, thatisdifferentiable in quadratic mean around a particular parameter θ 0 which satisfies Rθ 0 = q (see Appendix (B)). Denote by χ 2 1 α (r) the 1 α quantile of a χ2 distribution with r degrees of freedom and let W be a random variable that follows a noncentral χ 2 (r, δ) distribution for some noncentrality parameter δ. Furthermore, assume the data is generated according to a Pitman drift θ n = θ 0 + h/ n for some h R p. Assuming various regularity conditions given in Newey and West (1987, Theorem 2), Λ(θ 0 )=Ω(θ 0 )=A(θ 0 ) 1, and e n p θ 0 in the two step GMM case, the corresponding limiting power for both the classical Wald and the subsampling test is given by P {W > χ 2 1 α (r)}, where δ := h 0 R 0 [R{M 2 (θ 0 )Ω(θ 0 ) 1 M 2 (θ 0 ) 0 } 1 R 0 ] 1 Rh. 4 Choice of the Block Size An application of the subsampling method requires a choice of the block size b. Unfortunately, the asymptotic requirements b/n and b as n offer little practical guidance. We propose to select b by a calibration method, anidea dating back to Loh (1987). It is our goal to construct a test with nominal size α. However, generally, this can only be achieved exactly as the sample size tends to infinity. The actual size in finite sample, denoted by λ, typically differs from α. The crux of the calibration method is to adjust the block size b in a manner such that the actual size λ will hopefully be close to the nominal size α. To this end consider the calibration function h(b) =λ. This function maps the block size b ontotheactualsizeofthetest,given [19]

Exponential Tilting with Weak Instruments: Estimation and Testing

Exponential Tilting with Weak Instruments: Estimation and Testing Mehmet Caner North Carolina State University January 2008 Abstract This article analyzes exponential tilting estimator with weak instruments