An Averaging GMM Estimator Robust to Misspecication

Size: px

Start display at page:

Download "An Averaging GMM Estimator Robust to Misspecication"

Ashlie Bishop
5 years ago
Views:

1 An Averaging GMM Estimator Robust to Misspecication Xu Cheng y Zhipeng Liao z Ruoyao Shi x This Version: January, 28 Abstract This paper studies the averaging GMM estimator that combines a conservative GMM estimator based on valid moment conditions and an aggressive GMM estimator based on both valid and possibly misspecied moment conditions, where the weight is the sample analog of an infeasible optimal weight. We establish asymptotic theory on uniform approximation of the upper and lower bounds of the nite-sample truncated risk dierence between any two estimators, which is used to compare the averaging GMM estimator and the conservative GMM estimator. Under some sucient conditions, we show that the asymptotic lower bound of the truncated risk dierence between the averaging estimator and the conservative estimator is strictly less than zero, while the asymptotic upper bound is zero uniformly over any degree of misspeci- cation. Extending seminal results on the James-Stein estimator, this uniform dominance is established in non-gaussian semiparametric nonlinear models. The simulation results support our theoretical ndings. Keywords: Asymptotic Risk, Finite-Sample Risk, Generalized Shrinkage Estimator, GMM, Misspecication, Model Averaging, Non-Standard Estimator, Uniform Approximation We thank Benedikt Potscher for the insightful comments that greatly improved the quality of the paper. The paper also beneted from discussions with Donald Andrews, Denis Chetverikov, Patrik Guggenberger, Jinyong Hahn, Jia Li, Rosa Matzkin, Hyunsik Roger Moon, Ulrich Mueller, Joris Pinkse, Frank Schorfheide, Shuyang Sheng, and participants at various seminars and conferences. y Department of Economics, University of Pennsylvania, 378 Locust Walk, Philadelphia, PA 94, USA. xucheng@upenn.edu z Department of Economics, UCLA,8379 Bunche Hall, Mail Stop: 4773, Los Angeles, CA zhipeng.liao@econ.ucla.edu x Department of Economics, UC Riverside, 336 Sproul Hall, Riverside, CA, ruoyao.shi@ucr.edu

2 Introduction We are interested in estimating some nite dimensional parameter F 2 R d which is uniquely identied by the moment restrictions E F [g (W; F )] = r (.) for some known vector functions g () : W!R r, where is a compact subset of R d, W is a random vector with support W and joint distribution F, and E F [] denotes the expectation operator under F. Suppose we have i.i.d. data fw i g n i=, where W i has distribution F for any i = ; : : : ; n. Let g () = n P n i= g (W i ; ). One ecient GMM estimator for F is b = arg min 2 g () ( ) g (); (.2) where = n P n i= g (W i ; e )g (W i ; e ) g ( e )g ( e ) is the ecient weighting matrix with some preliminary consistent estimator e. 2 In a linear instrumental variable (IV) example, Y = X F + U where the IV Z 2 R r g (W; ) = Z (Y satises E F [Z U] = r. The moments in (.) hold with X ) and F is uniquely identied if E F [Z X ] has full column rank. Under certain regularity conditions, it is well-known that b is consistent and achieves the lowest asymptotic variance among GMM estimators based on the moments in (.), see Hansen (982). If one has additional moments E F [g (W; F )] = r (.3) for some known function g () : W!R r, imposing them together with (.) can further reduce the asymptotic variance of the GMM estimator. However, if these additional moments are misspecied in the sense that E F [g (W; F )] 6= r, imposing (.3) may result in inconsistent estimation. The choice of moment conditions is routinely faced by empirical researchers. Take the linear IV model for example. One typically starts with a large number of candidate IVs but only has condence that a small number of them are valid, denoted by Z. The rest of them, denoted by Z, are valid only under certain economic hypothesis that yet to be tested. In this example, g (W; ) = Z (Y X ). In contrast to the conservative estimator b, an aggressive estimator b 2 always imposes (.3) regardless of its validity. Let g 2 (W i ; ) = (g (W i ; ) ; g (W i ; ) ) for any The main theory of the paper can be easily extended to time series models with dependent data, as long as the preliminary results in Lemma A. hold. 2 For example, e could be the GMM estimator similar to b but with an identity weighting matrix, see (A.5) in the Appendix. 2

3 Figure. Finite Sample (n = 5) MSEs of the Pre-test and the Averaging GMM Estimators Note: \Pre-test(.)" refers to the pre-test GMM estimator based on the J test statistic ng 2 ( b 2) ( 2) g 2 ( b 2) with a norminal size.. \Emp-opt" refers to the averaging GMM estimator dened in equation (4.7) of the paper. In this simulation, we set F = r cn =2 where c is a r by real vector. At each r, we consider 27 dierent values for c and report the largest nite sample MSEs of the estimators. Details of the simulation design for Figure is provided in Section 7. i = ; : : : ; n, and g 2 () = n P n i= g 2(W i ; ). The aggressive estimator b 2 takes the form b 2 = arg ming 2 () ( 2 ) g 2 (); (.4) 2 where 2 is constructed in the same way as except that g (W i ; ) is replaced by g 2 (W i ; ). 3 Because imposing E F [g (W; F )] = r is a double-edged sword, a data-dependent decision usually is made to choose between b and b 2. To study such a decision and the subsequent estimator, let F = E F [g (W; F )] 2 R r : (.5) The pre-testing approach tests the null hypothesis H : F = r and constructs an estimator b pre = ft n > c g b + ft n c g b 2 (.6) 3 See the rst line of equations (C.3) in the Supplemental Appendix for the denition of 2. In particular, 2 is constructed using e, the preliminary consistent estimator based on the valid moment conditions in (.) only. 3

4 for some test statistic T n with the critical value c at the signicance level. One popular test is the J-test, see Hansen (982), and c is the quantile of the chi-squared distribution with degree of freedom r 2 d where r 2 = r + r. Because the power of this test against the xed alternative is, b pre equals b with probability asymptotically (n! ) for those xed misspecied model where F 6= r. Thus, it seems that b pre is immune to moment misspecication. However, we care about the nite-sample mean squared error (MSE) of b pre in practice and this standard pointwise asymptotic analysis ( F is xed and n! ) provides a poor approximation to the former. 4 To see the comparison between b pre and b, the dashed line in Figure plots the maximum nite-sample (n = 5) MSE of b pre while the MSE of b is normalized to be. For some values of F, the MSE of b pre is larger than that of b, sometimes by 5%. Note that the pre-test estimator exhibits multiple peaks because the simulation design allows for multiple potentially misspecied moments and the horizontal axis only shows the main component r which determines the norm of the vector that measures the degree of misspecication. The goal of this paper is twofold. First, we propose a data-dependent averaging of b and b 2 that takes the form b eo = ( e! eo ) b + e! eo b 2 (.7) where e! eo 2 [; ] is a data-dependent weight specied in (4.7) below. The subscript in e! eo is short for empirical optimal because this weight is an empirical analog of an infeasible optimal weight! F dened in (4.3) below. We plot the nite-sample MSE of this averaging estimator as the solid line in Figure. This averaging estimator is robust to misspecication in the sense that the solid line is below for all values of F, in contrast to the bump in the dashed line that represents the pre-test estimator. Second, we develop a uniform asymptotic theory to justify the nite-sample robustness of this averaging estimator. We show that this averaging estimator dominates the conservative estimator uniformly over a large class of models with dierent degrees of misspecication. The uniform dominance is established under the truncated weighted loss function which is dened in (3.) below. 5 Truncation at a large number is necessary for the asymptotic analysis. The standard asymptotic theory is pointwise and fails to reveal the fragile nature of the pre-test estimator. A stronger uniform notion of robustness is crucial for this model. Furthermore, we quantify the upper and lower bounds of the asymptotic risk dierences between the averaging estimator and the conservative estimator. 6 4 The poor approximation of the pointwise asymptotics to the nite sample properties of the pre-test estimator has been noted in Shibata (986), Potscher (99), Kabaila (995, 998) and Leeb and Potscher (25, 28), among others. 5 Truncation at a large number is needed for the asymptotic analysis of the risk of general estimator without imposing stringent conditions such as uniform integrability. 6 The lower and upper bounds of asymptotic risk dierence are dened in (3.2) below. 4

5 The rest of the paper is organized as follows. Section 2 discusses the literature related to our paper. Section 3 denes the parameter space over which the uniform result is established and denes uniform dominance. Section 4 introduces the averaging weight. Section 5 provides an analytical representation of the bounds of the asymptotic risk dierences and applies it to show that the averaging GMM estimator uniformly dominates the conservative estimator. Dierent from the global uniform dominance in Section 5, Section 6 studies local uniform dominance over a shrinking parameter space. Section 7 investigates the nite sample performance of our averaging estimator using Monte Carlo simulations. Section 8 concludes. Proof of the main results of the paper is given in the Appendix. Analysis of the pre-test estimator, extra simulation studies and proofs of some auxiliary results are included in the Supplemental Appendix of the paper. Notation. For any real matrix A, we use jjajj to denote the Frobenius norm of A. If A is a real square matrix, we use tr(a) denote the trace of A. If A is a real symmetric matrix, min (A) and max (A) to denote the smallest and largest eigenvalues of A, respectively. For any positive integers d and d 2, I d and d d 2 stand for the d d identity matrix and d d 2 zero matrix, respectively. Let vec () denotes vectorization of a matrix and vech () denotes the half vectorization of a symmetric matrix. Let R = ( ; +), R + = [; +), R = R [ fg and R +; = R + [f+g. For any positive integers d and any set S, S d denotes the Cartesian product of d many sets: S S d with S j = S for j = ; : : : ; d. For any set S, int(s) denotes the interior of S. We use fng to denote the set of natural numbers and fp n g = fp n : n g denote a subsequence of fng. For any (possibly random) positive sequences fa n g n= and fb ng n=, a n = O p (b n ) means that lim c! lim n! Pr (a n =b n > c) = ; a n = o p (b n ) means that for all " >, lim n! Pr (a n =b n > ") =. Let \! p " and \! D " stand for convergence in probability and convergence in distribution, respectively. The notation a b means a is dened as b. 2 Related Literature In this section, we discuss the related literature. Our uniform dominance result is related to the Stein's phenomenon (Stein, 956) in parametric models. The James-Stein (JS) estimator (James and Stein, 96) is shown to dominate the maximum likelihood estimator in exact normal sampling. Hansen (26) considers local asymptotic analysis of the JS type averaging estimator in general parametric models and substantially extends its application in econometrics. Many other frequentist model averaging estimators are studied in the literature, including Buckland, Burnham, and Augustin (997), Hjort and Claeskens (23, 26), Hansen (27, 25, 27), Claeskens and Carroll (27), Hansen and Racine (22), Cheng and Hansen (24) and Lu and Su (25), DiTraglia 5

6 (26) to name only a few. Our paper is closely related to DiTraglia (26) and Hansen (27). Hansen (27) proposes an averaging estimator that combines the ordinary least squares (OLS) estimator and the two-stage-least-squares (2SLS) estimator in linear IV models. Under some sucient conditions, he shows that the averaging estimator has smaller asymptotic risk than the OLS estimator under any given sequence of n =2 local misspecied data generating processes (DGPs). DiTraglia (26) also studies the averaging GMM estimator under given sequences of n =2 local misspecication DGPs, including the averaging of OLS and IV estimators. The averaging weight is based on the focused moment selection criterion with a targeted parameter. The simulation results in the paper show that this averaging estimator does not uniformly dominate the conservative estimator. Compared to these results, our paper makes the following contributions. First, we provide a uniform asymptotic framework, which is dierent from pointwise asymptotic analysis for xed models or local asymptotic analysis along some drifting models (e.g., the given n =2 local misspecied DGPs in DiTraglia (26) and Hansen (27)). For this purpose, this paper denes a large set of drifting models and show that a uniform result on the risk of this shrinkage estimator requires the study of all of them. In particular, this large set of models not only includes the crucial n =2 local misspecied models that are considered by DiTraglia (26) and Hansen (27), but it also includes many more severely misspecied models. This uniform analysis is similar to those studied in Andrews and Guggenberger (2) and Andrews, Cheng, and Guggenberger (2) for uniform size control for inference in non-standard problems, but the present paper is for estimation rather than inference and focuses on a misspecication issue that is not studied in these papers. Second, the present paper extends the study of the Stein's phenomenon to non-gaussian semiparametric nonlinear GMM models. The weight in our averaging estimator is close to those studied in Hjort and Claeskens (23) and Liu (25) in the sense that we rst consider the optimal weight and then obtain its empirical analog. The uniform dominance property of the averaging estimator does not contradict the unbounded risk property of the post-model-selection estimator found in Yang (25) and Leeb and Potscher (28). Measured by the MSE, the post-model-selection estimator based on consistent model selection procedure usually does better than the unrestricted estimator in part of the parameter space and worse than the latter in other part of the parameter space. One standard example is the Hodge's estimator, whose scaled maximal MSE diverges to innity with the growth of the sample size (see, e.g., Lehmann and Casella, 998). Similar unbounded risk results for other postmodel-selection estimators are established in Yang (25) and Leeb and Potscher (28). The post-model-selection estimator has unbounded (scaled) maximal MSE because it is based on a nonsmooth transition rule between the restricted and unrestricted estimators and a consistent model 6

7 selection procedure is employed in the transition rule. 7 However, the averaging estimator proposed in this paper is based on a smooth combination of the restricted and unrestricted estimators and no model selection procedure is used in the smooth combination. Hence our averaging estimator is essentially dierent from the post-model-selection estimator. There is a large literature studying the validity of GMM moment conditions. Many methods can be applied to detect the validity, including the over-identication tests (see, e.g., Sargan, 958; Hansen, 982; and Eichenbaum, Hansen and Singleton 988), the information criteria (see, e.g., Andrews, 999; Andrews and Lu, 23; Hong, Preston and Shum, 23), and the penalized estimation methods (see, e.g., Liao, 23 and Cheng and Liao, 24). Recently, misspecied moments and their consequences are considered by Ashley (29), Berkowitz, Caner, and Fang (22), Conley, Hansen, and Rossi (22), Doko Tchatoka and Dufour (28, 24), Guggenberger (22), Nevo and Rosen (22), Kolesar, Chetty, Friedman, Glaeser, Imbens (24), Small (27), Small, Cai, Zhang, Kang (25), among others. Moon and Schorfheide (29) explore over-identifying moment inequalities to reduce the MSE. This paper contributes to this literature by providing new uniform results for potentially misspecied semiparametric models. There is also a large literature studying adaptive estimation in nonparametric regression model using model averaging; see Yang (2, 23, 24), Leung and Barron (26), and the references therein. Since the unknown function can be written as a linear combination of (possibly innitely but countably many) basis functions, the nonparametric model may be well approximated by parametric regression models in nite samples. These papers show that the averaging estimators which combine OLS estimators from dierent parametric models with data dependent weights may achieve the optimal convergence rate up to some logarithm factor (or the oracle inequalities). Our paper is dierent from these papers since the parameter of interests is a nite dimensional real value, not an unknown function, and the bias and variance trade-o is due to the possibly misspecied moment conditions. Moreover, there is a benchmark estimator in our paper, i.e., the conservative GMM estimator whose asymptotic properties are well-known and many model selection estimators proposed in the literature do not uniformly improve upon this benchmark estimator in terms of risk over dierent degrees of moment misspecication. Our goal is to propose an averaging estimator with smaller risk than the conservative estimator uniformly. 7 The post-model-selection estimator based on conservative model selection procedure (e.g., hypothesis test with xed critical value or Akaike information criterion) may not have unbounded (scaled) maximal MSE. However its asymptotic maximal MSE is not guaranteed to be less than or equal to the benchmark estimator (e.g., the conservative GMM estimator in the framework of this paper). The pre-test estimator in our simulation studies is a good example, since it is based on the J-test with nominal size.. 7

8 3 Parameter Space and Uniform Dominance Let g 2;j (w; ) (j = ; : : : ; r 2 ) denote the j-th component function of g 2 (w; ). We assume that g 2;j (w; ) for j = ; : : : ; r 2 is twice continuously dierentiable with respect to for any w 2 W. The rst and second order derivatives of g 2 (w; ) with respect to are denoted by g 2; (w; 2;r2 C A and g 2;(w; 2 g 2 g 2;r2 C A ; (3.) respectively. 8 Let F be a set of distribution functions of W. For k = and 2, dene the expectation of the moment functions, the Jacobian matrix and the variance-covariance matrix as M k;f E F [g k (W; F )], G k;f E F [g k; (W; F )] ; and k;f E F gk (W; F )g k (W; F ) M k;f M k;f (3.2) for any F 2 F respectively. The moments above exist by Assumption 3.2 below. Let Q F () E F [g 2 (W; )] 2;F E F [g 2 (W; )] (3.3) for any 2, which denotes the population criterion of the GMM estimation in (.4). For any 2, dene B c "() = f 2 : jj jj "g. We consider the risk dierence between two estimators uniformly over F 2 F that satises Assumptions below. Assumption 3. The following conditions hold: (i) for any F 2 F, E F [g (W; F )] = r for some F 2 int(); (ii) for any " >, inf inf jje F 2F 2B"( c F [g (W; )] jj > ; F ) (iii) for any F 2 F, there is F 2 int() such that inf F 2F inf [Q 2B" c ( F ) F () Q F ( F )] > for any " > ; jjg (iv) inf 2;F 2;F 2;F jj ff 2F: k F k>g k 2;F k > where 2;F = ( r ; F ) and > is a xed constant; (v) r 2 int( ) where = f F : F 2 Fg. 8 By denition, g ; (w; ) and g ; (w; ) are the leading r d and (r d ) d submatrices of g 2; (w; ) and g 2; (w; ); respectively. 8

9 Assumptions 3..(i)-(ii) require that the true unknown parameter F is uniquely identied by the moment conditions E F [g (W; F )] = r for any DGP F 2 F. Assumption 3..(iii) implies that for any F 2 F, a pseudo true value F is identied by the unique minimizer of the population GMM criterion Q F () under possible misspecication. Assumption 3..(iv) requires that 2;F is not orthogonal to 2;F G 2;F, which rules out the special case that F may be consistently estimable even with severely misspecied moment conditions. Assumption 3..(v) implies that the set of distribution functions F is rich such that it includes the distributions under which the extra moment conditions are correctly specied. Uniform dominance is of interest only if we allow for dierent degrees of misspecication in the parameter space. If we only allow for correctly specied models or severely misspecied models, the desired dominance results hold trivially following a pointwise analysis. Assumption 3..(v) ensures that the extra moment conditions can be correctly specied, locally misspecied or severely misspecied. Assumption 3.2 The following conditions hold: (i) For j = ; : : : ; r 2, g 2;j (w; ) is twice continuously dierentiable with respect to for any w 2 W; (ii) sup E F sup jjg 2 (W; )jj 2+ + jjg 2; (W; )jj 2+ + jjg 2; (W; )jj 2+ < for some > ; F 2F 2 (iii) inf F 2F min ( 2;F ) > ; (iv) inf F 2F min (G ;F G ;F ) >. Assumption 3.2.(i) requires that the moment functions are smooth. Assumption 3.2.(ii) imposes 2+ nite moment conditions on the GMM moment functions and their rst and second derivatives. Assumptions 3.2.(iii) and 3.2.(iv) are important sucient conditions for the local identication of the unknown parameter in GMM with valid moment conditions. The next assumption is on the nuisance parameters of the DGP F 2 F. Write v F = vec(g 2;F ) ; vech( 2;F ) ; F (3.4) for any F 2 F. It is clear that v F includes the Jacobian matrix, the variance-covariance matrix, and the measure of misspecication of the moment conditions E F [g (W; F )] = r. Let v F = vec(g 2;F ) ; vech( 2;F ) (3.5) for any F 2 F. Assumption 3.3 The following conditions hold: (i) For any F 2 F with F = r, there exists a constant " F > such that for any e 2 R r with jj e jj < " F, there is F e 2 F with ef = e and v ef v F Cjj e jj for some > ; 9

10 (ii) The set fv F : F 2 Fg is closed. Assumption 3.3.(i) requires that for any F 2 F such that E F [g 2 (W; F )] = r is valid, there are many DGPs F e 2 F which are close to F. Here the closeness of any two DGPs F and F e is measured by the distance between v F and v ef. Assumption 3.3 (i) and (ii) are useful to derive the exact expression of the asymptotic risk of the GMM estimator. Example 3. (Linear IV Model) We study a linear IV model and provide a set of low-level conditions that imply Assumptions 3., 3.2 and 3.3. The parameters of interest are the coecients of the endogenous regressors X in Y = X + U, (3.6) with some valid IVs Z 2 R r and some potentially misspecied IVs Z 2 R r such that E F [U] = ; E F [Z U] = r ; and (3.7) Z = U + V, with E F [V ] = r and E F [V U] = r ; (3.8) where F denotes the joint distribution of (X ; Z ; V ; U). In the reduced-form equation (3.8), is a r real vector which characterizes the degree of misspecication. Let F denote a class of distributions containing F, and let and denote the parameter spaces of and respectively. The joint distribution of W = (Y; Z ; Z ; X ) is denoted as F which is determined by, and F through the linear equations in (3.6) and (3.8). For ease of discussion, we further assume that the random vector (X ; Z ; V ; U) follows the normal distribution with mean and variance-covariance matrix. 9 Under the normal assumption, each distribution F corresponds to a pair of and. For notational simplicity, in Lemma 3. below, for any nite dimensional random vectors a and a 2, let aj = E F [a j ] for j = ; 2, a a 2 = E F [a a 2 ]; and a a 2 = E F [a a 2 ] a a 2. Lemma 3. Let F denote the set of normal distributions which satises: (i) u =, z u = r and vu = r ; (ii) inf F 2F min( xz z x) >, sup F 2F jjjj2 < and < inf F 2F min ( ) sup F 2F max ( ) < ; (iii) inf F 2F inf fkk"g kk jj( xz z z z v xv) xujj > for some " > that is small enough; 9 In Section B of the Supplemental Appendix, we provide sucient low-level conditions for Assumptions 3., 3.2 and 3.3 in the linear IV model without the normal assumption. The constant " depend on the inmum and supremum in Condition (ii) and it is given in (A.3) in the Appendix.

11 (iv) 2 int() and is compact and large enough such that the pseudo-true value F 2 int(); (v) = [c ; ; C ; ] [c r ;; C r ;] where fc j; ; C j; g r j= c j; < < C j; for j = ; : : : ; r, then Assumptions 3., 3.2 and 3.3 hold. Condition (i) lists the moment conditions in (3.7) and (3.8). (ii) rules out DGPs under which min ( xz is a set of nite constants with The inequality in Condition z x) may be close to zero and (part of) the unknown parameter is weakly identied. Condition (ii) also requires that the mean of the random vector (X ; Z ; V ; U) is uniformly bounded and the eigenvalues of its variance-covariance matrix are uniformly nite and bounded away from. Condition (iii) requires that the projection residual of the vector xu on the subspace spanned by the matrix xz z z z v xv is bounded away from zero. It is a sucient condition for Assumption 3..(iv), which ensures that the aggressive estimator is inconsistent under severe misspecication. Condition (iv) is needed to derive the limit of the aggressive estimator under misspecication. The compactness assumption of is not needed for the linear IV model. However, it is useful to verify Assumptions 3., 3.2 and 3.3 which do not assume any special structure on the model. Condition (v) species that the parameter space of is a product space. Lemma 3. provides simple conditions on, and F on which uniformity results are subsequently established. 2 Now we get back to the general set up. For a generic estimator b of, consider a weighted quadratic loss function `( b ; ) = n( b ) ( b ); (3.9) where is a d d pre-determined positive semi-denite matrix. For example, if = I d ; E F [`( b ; F )] is the MSE of b. If = ( ;F 2;F ) where k;f (k = ; 2) is dened in (4.4), the weighting matrix rescales b by the scale of variance reduction due to the additional moments. If = E F [X i X i ] for regressors X i, E F [`( b ; F )] is the MSE of X i b, an estimator of X i. Below we compare the averaging estimator b eo and the conservative estimator b. We are Specic restritions on which ensures that F 2 int() are given in (B.8) and Assumption B..(vi) in the Supplemental Appendix. 2 Similar results have been established in Section B of the Supplemental Appendix for the linear IV model when the normal assumption on (X ; Z; V ; U) is relaxed. Section B of the Supplemental Appendix also provides proof for Lemma 3. with and without the normal assumption.

12 interested in the bounds of the truncated nite sample risk dierence RD n ( b eo ; b ; ) inf F 2F E F [`( b eo ; F ) `( b ; F )] and RD n ( b eo ; b ; ) sup E F [`( b eo ; F ) `( b ; F )]; (3.) F 2F where `( b ; F ) minf`( b ; F ); g (3.) denotes the truncated loss function with an arbitrarily large trimming. The truncated loss function is employed to facilitate the asymptotic analysis of the bounds of the risk dierence. The nite-sample bounds in (3.) are approximated by AsyRD( b eo ; b ) lim inf! AsyRD( b eo ; b ) lim sup! lim inf n! RD n( b eo ; b ; ) and lim suprd n ( b eo ; b ; ) ; (3.2) n! which are called lower and upper bounds of the asymptotic risk dierence respectively in this paper. The averaging estimator b eo asymptotically uniformly dominates the conservative estimator b if AsyRD( b eo ; b ) < and AsyRD( b eo ; b ) : (3.3) The bounds of the asymptotic risk dierence build the uniformity over F 2 F into the denition by taking inf F 2F and sup F 2F before lim inf n! and lim sup n! respectively. Uniformity is crucial for the asymptotic results to give a good approximation to their nite-sample counterparts. These uniform bounds are dierent from pointwise results which are obtained under a xed DGP. The sequence of DGPs ff n g along which the supremum or the inmum are approached often varies with the sample size. 3 Therefore, to determine the bounds of the asymptotic risk dierence, one has to derive the asymptotic distributions of these estimators under various sequences ff n g. Under ff n g, the observations fw n;i g n i= form a triangular array. For notational simplicity, W n;i is abbreviated to W i throughout the paper. To study the bounds of asymptotic risk dierence, we consider sequences of DGPs ff n g such that Fn satises 4 (i) n =2 Fn! d 2 R r or (ii) jjn =2 Fnjj! : (3.4) Case (i) models mild misspecication, where Fn is a n =2 -local deviation from r. Case (ii) 3 In the rest of the paper, we use ff ng to denote ff n 2 F : n = ; 2; :::g. 4 Since F n 2 F, by Assumption 3.2.(ii), the sequence Fn in (3.4) should satisfy k Fn k C for any n. 2

13 includes the severe misspecication where k Fn k is bounded away from as well as the intermediate case in which Fn! and jjn =2 Fn jj!. To obtain a uniform approximation, all of these sequences are necessary. Once we study the bounds of asymptotic risk dierence along each of these sequences, we show that we can glue them together to obtain the bounds of asymptotic risk dierence. 4 Averaging Weight We start by deriving the joint asymptotic distribution of b and b 2 under dierent degrees of misspecication. We consider sequences of DGPs ff n g in F such that (i) n =2 Fn! d 2 R r or jjn =2 Fn jj! ; and (ii) G 2;Fn, 2;Fn and M 2;Fn converges to G 2;F, 2;F and M 2;F for some F 2 F. 5 For k = ; 2 and any F 2 F, dene k;f = G k;f k;f G k;f G k;f k;f : (4.) Let Z 2;F denote a zero mean normal random vector with variance-covariance matrix 2;F and Z ;F denote its rst r components. Lemma 4. Suppose Assumptions 3. and 3.2 hold. Consider any sequence of DGPs ff n g such that v Fn! v F for some F 2 F, and n =2 Fn! d for d 2 R r. (a) If d 2 R r, then where d = ( r ; d ) n=2 ( b Fn ) n =2 ( b 2 Fn ) A! ;F 2;F ;F Z ;F 2;F (Z 2;F + d ) (b) If jjdjj =, then n =2 ( b Fn )! D ;F and jjn =2 ( b 2 Fn )jj! p. Given the joint asymptotic distribution of b and b 2, it is straightforward to study b (!) = (!) b +! b 2 if! is deterministic. Following Lemma 4..(a), A, n =2 ( b (!) Fn )! D F (!) (!) ;F +! 2;F (4.2) 5 The requirement on the convergence of G 2;Fn, 2;Fn and M 2;Fn is not restrictive. Lemma A.7 in Appendix A. shows that the sequences G 2;Fn, 2;Fn and M 2;Fn have subsequences that converge to G 2;F, 2;F and M 2;F ; respectively, for some F 2 F. The general result on the lower and upper bounds of the asymptotic risk dierence, Lemma A.4 in Appendix A.2, only requires to consider the subsequence ff pn g such that G 2;Fpn, 2;Fpn and M 2;Fpn are convergent, where fp ng is a subsequence of fng. The asymptotic properties of the GMM estimators established in this section under the full sequence of DGPs ff ng holds trivially for its subsequence. 3

14 for n =2 Fn! d, where d 2 R r. In Section C of the Supplemental Appendix, a simple calculation shows that the asymptotic risk of b (!) is minimized at the infeasible optimal weight! F d 2;F ;F tr( ( ;F 2;F )) 2;F ;F d + tr( ( ;F 2;F )) ; (4.3) where is the matrix specied in the loss function, k;f G k;f k;f G k;f for k = ; 2 and ;F [ ;F ; d r] : (4.4) To gain some intuition, consider the case where = I d at! F. In this case, the infeasible optimal weight! F such that the MSE of b (!) is minimized yields the ideal bias and variance trade o. However, the bias depends on d, which cannot be consistently estimated. Hence,! F cannot be consistently estimated. Our solution to this problem follows the popular approach in the literature which replaces d by an estimator whose asymptotic distribution is centered at d; see Liu (25) and Charkhi, Claeskens, and Hansen (26) for similar estimators in the least square estimation and maximum likelihood estimation problems, respectively. Moreover, we show that the resulting averaging estimator reduces the MSE for any value of d. The empirical analog of! F is constructed as follows. First, for k = and 2, replace k;f by its consistent estimator b k ( b G k b k bg k n bg k ), 6 where nx g k; (W i ; b ) and b k n i= nx g k (W i ; b )g k (W i ; b ) g k ( b )g k ( b ) : (4.5) i= Note that b G k and b k are based on the conservative GMM estimator b. Hence they are consistent regardless of the degree of misspecication of the moment conditions in (.3). ( 2;F ;F )d by its asymptotically unbiased estimator n =2 ( b 2 b ) because Second, replace n =2 ( b 2 b )! D ( 2;F ;F ) (Z 2;F + d ) ; (4.6) for d = ( r ; d ) and d 2 R r following Lemma 4.(a). Then the empirical optimal weight takes the form e! eo 6 The consistency of b k is proved in the proof of Lemma 4.2. tr(( b b 2 )) n( b 2 b ) ( b 2 b ) + tr(( b b 2 )) ; (4.7) 4

15 and the averaging GMM estimator takes the form b eo = ( e! eo ) b + e! eo b 2 : (4.8) Next we consider the asymptotic distribution of b eo under dierent degrees of misspecication. Lemma 4.2 Suppose that Assumptions hold. Consider ff n g such that v Fn! v F for some F 2 F, and n =2 Fn! d for d 2 R r. (a) If d 2 R r, then e! eo! D! F tr(( ;F 2;F )) (Z 2;F + d ) ( 2;F ;F ) ( 2;F ;F )(Z 2;F + d ) + tr(( ;F 2;F )) and n =2 ( b eo Fn )! D F (! F ) ;F +! F 2;F : (b) If jjdjj =, then e! eo! p and n =2 ( b eo Fn )! D ;F. To study the bounds of asymptotic risk dierence between b eo and b, it is important to take into account the data-dependent nature of e! eo. Unlike b and b 2, the randomness in e! eo is nonnegligible in the mild misspecication case (a) of Lemma 4.2. In consequence, b eo does not achieve the same bounds of asymptotic risk dierence as the ideal averaging estimator (! F )b +! b F 2 does. Nevertheless, below we show that b eo is insured against potentially misspecied moments because it uniformly dominates b. 5 Bounds of Asymptotic Risk Dierence under Misspecication In this section, we study the bounds of the asymptotic risk dierence dened in (3.2). Note that the asymptotic distributions of b and b eo in Lemma 4. and 4.2 only depend on d, G 2;F and 2;F. For notational convenience, dene h F;d = (d ; vec(g 2;F ) ; vech( 2;F ) ) (5.) for any F 2 F and any d 2 R r. For the mild misspecication case, dene the parameter space of h F;d as where F is dened by (.5) for a given F. H = fh F;d : d 2 R r and F 2 F with F = r g (5.2) 5

16 Theorem 5. Suppose that Assumptions hold. The bounds of the asymptotic risk dierence satisfy AsyRD( b eo ; b ) = AsyRD( b eo ; b ) = max sup h2h [g(h)] ; ; min inf [g(h)] ; ; h2h where g(h) E[ F F of Z 2;F. ;F ;F ] and the expectation is taken under the joint normal distribution The upper (or lower) bound of the asymptotic risk dierence is determined by the maximum between sup h2h [g(h)] and zero (or the minimum between inf h2h [g(h)] and zero), where sup h2h [g(h)] (or inf h2h [g(h)]) is related to the mildly misspecied DGPs and the zero component is associated with the severely misspecied DGPs. Since the GMM averaging estimator has the same asymptotic distribution as the conservative GMM estimator b under the severely misspecied DGPs, their asymptotic risk dierence is zero. To show that b eo uniformly dominates b, Theorem 5. implies that it is sucient to show that inf h2h [g(h)] < and sup h2h [g(h)]. We can investigate inf h2h g(h) and sup h2h g(h) by simulating g(h). In practice, we replace G 2;F and 2;F by their consistent estimators and plot g(h) as a function of d. Even if the uniform dominance condition does not hold, min finf h2h [g(h)] ; g and max fsup h2h [g(h)] ; g quantify the most- and least-favorable scenarios for the averaging estimator. Theorem 5.2 Let A F ( ;F 2;F ) for any F 2 F. Suppose that Assumptions hold. If tr(a F ) > and tr(a F ) 4 max (A F ) for any F 2 F with F =, we have AsyRD( b eo ; b ) < and AsyRD( b eo ; b ) = : Thus, b eo uniformly dominates b : Theorem 5.2 indicates that: (i) there exists " < and some nite integer n " such that the minimum risk dierence between b eo and b is less than " for any n larger than n " ; (ii) for any " 2 >, there exists a nite integer n "2 such that the maximum risk dierence between b eo and b is less than " 2 for any n larger than n "2. Pre-test estimators fail to satisfy both properties (i) and (ii) above at the same time. Take the pre-test estimator based on the J-test for example 7 and consider three scenarios: (a) the critical value is xed for any sample size; (b) the critical value diverges to innity; and (c) the critical value converges to zero. In the pointwise asymptotic framework, 7 See Section D in Supplemental Appendix for denition and analysis of this estimator. 6

17 the J-test based on the critical values in (a), (b) and (c) leads to inconsistent (but conservative) model selection, consistent model selection and no model selection results respectively. The pre-test estimator based on the J-test violates property (ii) in scenarios (a) and (b), and violates property (i) in scenario (c). Dierent from the nite-sample results for the JS estimator established for the Gaussian location model, our comparison of the two estimators b eo and b is based on the asymptotic bounds of the risk dierences. For a given sample size n, we do not provide results on this asymptotic approximation error, and therefore our results do not state how the nite-sample upper bound RD n ( b eo ; b ; ) approaches to zero as n! and then! (e.g., from above or from below). For the Gaussian location model, the asymptotically uniform dominance here is weaker than the classical nitesample results established for the JS estimator. However, this asymptotic results apply to general nonlinear econometric models with non-normal random variables. 8 To shed light on the sucient conditions in Theorem 5.2, let us consider a scenario similar to the JS estimator: ;F = 2 ;F I d ; 2;F = 2 2;F I d ; and = I d. In this case, the sucient conditions become ;F > 2;F and d 4: The rst condition tr(a F ) > ; which is reduced to ;F > 2;F, requires that the additional moments E F [g (W i ; F )] = are non-redundant in the sense that they lead to a more ecient estimator of F. The second condition tr(a F ) 4 max (A F ); which is reduced to d 4; requires that we are interested in the total risk of several parameters rather than that of a single one. In a more general case where ;F and 2;F are not proportional to the identity matrix, the sucient conditions are reduced to ;F > 2;F and d 4 under the choice = ( ;F 2;F ), which rescales b by the variance reduction ;F 2;F. In a simple linear IV model (Example 3.) where Z i is independent of Z ;i and the regression error U i is homoskedastic conditional on the IVs, ;F > 2;F requires that E F [Zi X i ] and E F [Z i Z i ] both have full rank. Note that these conditions are sucient but not necessary. If these sucient conditions do not hold, we can still simulate the upper bounds in Theorem 5. to check the uniform dominance condition. In fact, simulation studies in the next session show that in many cases b eo has a smaller nite-sample risk than b even if these sucient conditions are violated. Nevertheless, these analytical sucient conditions can be checked easily before the simulation-based methods are adopted. 8 In Section F of the Supplemental Appendix, we show that the averaging GMM estimator has similar nite sample dominace results in the Gaussian location model. 7

18 6 Local Uniform Dominance In this section, we provide a local uniform dominance result that strengthens AsyRD( b eo ; b ) = in Theorem 5.2 to AsyRD( b eo ; b ) < ; (6.) at the cost of comparing b eo and b in a smaller parameter space that only allows for local misspecication up to n =2. The result is uniform over this shrinking parameter space. Showing AsyRD( b eo ; b ) < is desirable because it implies that the nite sample maximum risk dierence between b eo and b is bounded above from zero for all large n, whereas AsyRD( b eo ; b ) = allows this nite samples maximum risk dierence to be a shrinking, but positive value for large n. A smaller parameter space is necessary for this stronger result because b eo cannot strictly improve upon b when the additional moments are severely misspecied and thus do not provide any useful information. Assumption 6. For each n let F n denote a set of DGPs. The following conditions hold: (i) for any n and any F 2 F n there is F 2 int() such that E F [g (W; F )] = r ; (ii) for any " >, inf n inf F 2Fn inf 2B c " ( F ) jje F [g (W; )] jj > ; (iii) for any n and any F 2 F n there is d F 2 R r such that E F [g 2 (W; F )] = n =2 d F ; (iv) Assumption 3.2 holds for F n uniformly over n; (v) fv F : F 2 F n for some ng is closed; (vi) kd F k C L for some xed constant C L. Assumption 6..(i) and (ii) are similar to Assumptions 3..(i)-(ii), which ensures the unique identication of F for any F 2 F n and any n. Assumptions 6..(iii) implies that the extra moment conditions are mildly misspecied which together with Assumptions 6..(i) and (ii) and Assumption 3.2 ensures that the aggressive GMM estimator b 2 has a normal asymptotic distribution as shown in Lemma 4..(a). As a result, Assumptions 3..(iii)-(iv) are not needed here. Assumption 3.2 contains some regularity conditions for showing the asymptotic properties of the GMM estimator and it is maintained in Assumption 6..(iv). Assumption 6..(v) is a reduced version of Assumption 3.3.(i). Assumption 6..(vi) is an important condition to show the local uniform dominance result. To introduce the local uniform dominance result, we dene H CL = fh F;d : d 2 R r with kdk C L and F 2 F n for some ng: (6.2) 8

19 In the local misspecication framework, the set of DGPs F n may change with the sample size n. The upper bound of the nite sample risk dierence between b eo and b should be dened as which is approximated by RD n ( b eo ; b ; ) = sup F 2F n E F [`( b eo ; F ) `( b ; F )]; (6.3) AsyRD( b eo ; b ) = lim lim sup sup E F [`( b eo ; F ) `( b ; F )]: (6.4)! n! F 2F n To show the local uniform dominance result, it is sucient to study the upper bound of the risk dierence AsyRD( b eo ; b ). Lemma 6. Suppose that Assumption 6. hold. The upper bound of the asymptotic risk dierence satises AsyRD( b eo ; b ) where g(h) = E[ F F ;F ;F ] is dened in Theorem 5.. sup [g(h)] ; (6.5) h2h CL Lemma 6. provides an upper bound to the maximum risk dierence between b eo and b. The criterion function g(h) in (6.5) is the same as its counterpart in Theorem 5.. To show the local uniform dominance result in (6.), it is sucient to show that sup h2hcl [g(h)] is bounded away from zero. This is proved in the following Theorem. Theorem 6. Suppose that Assumption 6. hold. If tr(a F ) > and tr(a F ) 4 max (A F ) for any F 2 F n and for any n, we have sup h2hcl [g(h)] < for any nite constant C L. Combining the results in Lemma 6. and Theorem 6., we immediately obtain (6.). The sucient conditions to ensure that the upper bound sup h2hcl [g(h)] is bounded away from zero are the same as those in Theorem Simulation Studies In this section, we investigate the nite sample performance of our averaging GMM estimator in linear IV models. In addition to the empirical optimal weight e! eo, we consider another averaging 9

20 estimator based on the JS type of weight. First consider the positive part of the JS weight 9 :! JS = tr( A) b 2 max ( A) b! n( b 2 b ) ( b 2 b ) + (7.) where (x) + = max f; xg and b A is the estimator of A F averaging estimator uses the restricted JS weight using b k for k = ; 2. The alternative! R;JS = (! JS ) + : (7.2) By construction,! JS and! R;JS. We compare the nite-sample MSEs of our proposed averaging estimator with the empirical optimal weight, the JS type of averaging estimator with the restricted weight in (7.2), the conservative GMM estimator b, and the pre-test GMM estimator based on the J-test. The nite-sample MSE of the conservative GMM estimator is normalized to be. We consider a linear regression model with i.i.d. observed data W i = (Y i ; X ;i ; : : : ; X 6;i ; Z ;i ; : : : ; Z 2;i ; Z ;i; : : : ; Z 6;i) for i = ; :::; n; (7.3) where Y is the dependent variable, (X ; :::; X 6 ) are 6 endogenous regressors, (Z ; :::; Z 2 ) are 2 valid IVs, and (Z ; :::; Z 6 ) are 6 invalid IVs. The data is generated as follows. The regression model is where X j are generated by Y = 6X j X j + u; (7.4) j= X j = (Z j + Z j+6 ) j + Z j+2 + " j for j = ; : : : ; 6: (7.5) We rst draw (Z ; :::; Z 8 ; " ; : : : ; " 6 ; u ) from normal distribution with mean zero and variancecovariance matrix diag(i 88 ; 77 ) where 77 I 66 :25 6 :25 6 A. (7.6) To show the performance of the estimator under non-gaussian errors, we draw from exponential 9 This formula is a GMM analog of the generalized James-Stein S type shrinkage estimator in Hansen (26) for parametric models. The shrinkage scalar is set to tr( A) b 2 max (tr( A)) b in a fashion similar to the original James-Stein estimator. 2

21 distribution with mean and is independent of (Z ; : : : ; Z 8 ; " ; : : : ; " 6 ; u ). The residual term u in (7.4) is generated by u = (u + )=2; (7.7) where is demeaned to ensure that the mean of u is zero. The invalid IVs are generated by Z j = Z j+2 + n =2 d j u, for j = ; : : : ; 6: (7.8) We set ( ; : : : ; 6 ) = 2:5 6 and ( ; : : : ; 6 ) = :5 6. In the main regression equation (7.4), all regressors are endogenous because E[X j u] = E[" j u =2] = :25 for j = ; : : : ; 6: (7.9) From the expression of Zj above, we see that increasing the magnitude of d j will enlarge the correlation coecient between Zj and u and hence the endogeneity of Z j. Given the sample size n, we consider dierent DGPs of the simulated data fw i : i = ; :::; ng by changing the values of the location parameters (d ; :::; d 6 ). We consider the following parametrization (d ; :::; d 6 ) = r (c ; :::; c 6 ) (7.) where r is a scalar that takes values on the grid points between and 25 with the grid length :5, (c ; : : : ; c 6 ) is parametrized in two dierent ways. In the rst one, we set c j = or for j = ; : : : ; 6 and rule out the case that c j = for all j (since this is the same as the case which sets r = ). In the second one, we consider the polar transformation and set c = sin( ) sin( 2 ) sin( 3 ) sin( 4 ) sin( 5 ); c 2 = cos( ) sin( 2 ) sin( 3 ) sin( 4 ) sin( 5 ); c 3 = cos( 2 ) sin( 3 ) sin( 4 ) sin( 5 ); c 4 = cos( 3 ) sin( 4 ) sin( 5 ); c 5 = cos( 4 ) sin( 5 ); c 6 = cos( 5 ); (7.) where 2 f=4, 3=4, 5=4, 7=4g and j 2 f=4; 3=4g for j = 2; : : : ; 5. Therefore, there are 27 dierent values for (c ; : : : ; c 6 ) which together with 5 dierent values of r produces 6477 dierent DGPs in the simulation studies. For each DGP, we consider sample size n = 5,, 25, 5 and 2

$Figure 2. Finite Sample MSEs of the Pre-test and Averaging GMM Estimators Note: \Pre-test(.)" refers to the pre-test GMM estimator based on the J-test with nominal size.$

22 Figure 2. Finite Sample MSEs of the Pre-test and Averaging GMM Estimators Note: \Pre-test(.)" refers to the pre-test GMM estimator based on the J-test with nominal size.; \Emp-opt" refers to the averaging GMM estimator based on the empirical optimal weight; \ReSt-JS" refers to the averaging estimators based on the restricted James-Stein weight, respectively. For each estimator, the upper envelop of the shaded area is the maximum nite sample MSE among the 27 DGPs described in Section 7, and the lower envelop is the minimum. Both maximum and minimum are functions of r. use 25, simulation repetitions. Given the sample size and the value of r, we report the minimum and maximum of the 27 values of the nite sample MSEs for each estimator. 2 Therefore given each sample size, we report a shaded area for each estimator where the upper envelope represents the estimator's maximum nite sample MSE and the lower envelope is its minimum nite sample MSE. For each estimator, both the upper envelope and lower envelope are functions of r. The MSEs of various estimators of the parameters in (7.4) are included in Figure 2. 2 The two panels on the top of Figure 2 present the MSEs of estimators with sample size n = 5 and 2 In this simulation study, we also consider the truncated risk function with =. The simulation results are identical to what we get without truncation. These extra simulation results are available upon request. 2 The nite sample biases and variances of the GMM estimators are reported in Subsection E. of the Supplemental Appendix. 22

23 respectively, while the two panels on the bottom provides their MSEs with n = 25 and 5 respectively. Our ndings in the simulation studies are summarized as follows. First, the averaging GMM estimator b eo has smaller MSE than b uniformly over d in all sample sizes considered, which is predicted by our theory because the key sucient condition is satised in this model. 22 Second, the pre-test GMM estimator does not dominate the conservative GMM estimator b when the sample size becomes slightly large (e.g., n ). For example, when n = 5 and the location parameter r is close to zero, the pre-test GMM estimator has relative MSE as low as.35. However, its relative MSE is above when r is between 5 and 2. Third, comparing the two averaging estimators, we nd that the restricted JS estimator does not reduce the MSE as much as the averaging estimator based on e! eo for dierent sample sizes and dierent DGPs considered in this simulation study. 8 Conclusion This paper studies the averaging GMM estimator that combines the conservative estimator and the aggressive estimator with a data-dependent weight. The averaging weight is the sample analog of an optimal non-random weight. We provide a sucient class of drifting DGPs under which the pointwise asymptotic results combine to yield uniform approximations to the nite-sample risk dierence between two estimators. Using this asymptotic approximation, we show that the proposed averaging GMM estimator uniformly dominates the conservative GMM estimator. Inference based on the averaging estimator is an interesting and challenging problem. As pointed out in Potscher (26), the nite sample density of the averaging estimator can not be consistently estimated, which implies that directly applying an estimator of the nite-sample density may not yield uniformly valid inference. In addition to the uniform validity, a desirable condence set should have smaller volume than that obtained from the conservative moments alone. We leave the inference issue to future investigation. 22 It is easy to show that in this simulation design, when F = we have tr(a F ) = 4 and tr(a F ) 4 max (A F ) = 4=3. 23

24 A Appendix In Lemma 3., dene c min F 2F f min( xz z x); min ( )g ; C max F 2F jjjj 2 ; max ( ) : C sup 2 k k 2 (A.) Let C ;W 2(d + r 2 + )C, c ; minf; c 2 g and C ; C 2 ;W (2 + C =2 )2 : (A.2) Then, in Lemma 3. (iii), the constant " is given by " = c ; C ;C ; (A.3) i.e., we require the condition to hold on a set bounded away from by ". The details of the proofs are given in Section B of the Supplemental Appendix. A. Proof of the Results in Section 4 Let n (g 2 (W; )) = n =2 P n i= (g 2(W i ; ) E Fn [g 2 (W i ; )]). In the rest of the Appendix, we use C to denote a generic xed positive nite constant which does not depend on any F 2 F or n. Lemma A. Suppose that Assumption 3.2.(ii) holds and is compact. Then we have (i) sup 2 kg 2 () E Fn [g 2 (W i ; )]k = o p (); (ii) sup 2 n P n i= g 2(W i ; )g 2 (W i ; ) E Fn [g 2 (W i ; )g 2 (W i ; ) ] = o p (); (iii) sup 2 n P n i= g 2;(W i ; ) E Fn [g 2; (W i ; )] = op (); (iv) n (g 2 (W; )) is stochastic equicontinuous over 2 ; (v) =2 2;F n n (g 2 (W; Fn ))! D N( r2 ; I r2 ). Proof of Lemma A.. See Lemma.3-.5 of Andrews and Cheng (23). Dene M k;f () = E F [g k (W; )], G k;f () = E F [g k; (W; )] and k;f () =Var F [g k (W; )], for any F 2 F, for any 2 and for k = ; 2. The next lemma shows that M 2;F (), G 2;F () and 2;F () are Lipschitz continuous uniformly over F 2 F. Lemma A.2 Under Assumptions 3.2.(i)-(ii), for any F 2 F and any ; 2 2, we have: (i) km 2;F ( ) M 2;F ( 2 )k C k 2 k; 24

25 (ii) kg 2;F ( ) G 2;F ( 2 )k C k 2 k; (iii) k 2;F ( ) 2;F ( 2 )k C k 2 k. Proof of Lemma A.2 is included in Section C of the Supplemental Appendix. Lemma A.3 Suppose that Assumptions 3..(i)-(ii) and 3.2.(i)-(ii) hold. Then for any sequence of DGPs ff n g, we have e Fn = o p () and 2 = 2;Fn + o p (); (A.4) where e is a preliminary estimator dened as e = arg min 2 g () g () (A.5) and 2 is dened in (C.3) of the Supplemental Appendix. Proof of Lemma A.3 is included in Section C of the Supplemental Appendix. Lemma A.4 Suppose that Assumptions 3..(i)-(ii) and 3.2 hold. Then for any sequence of DGPs ff n g, we have where ;F n n (g (W; Fn )) n =2 ( b Fn ) = ;Fn n (g (W; Fn )) + o p (); (A.6) G ;F n ;F n G ;Fn G ;Fn ;F n = O p (). Proof of Lemma A.4 is included in Section C of the Supplemental Appendix. Lemma A.5 Suppose that Assumptions 3..(iii) and 3.2.(i)-(iii) hold. Then for any sequence of DGPs ff n g, we have b 2 F n = o p (): (A.7) Proof of Lemma A.5 is included in Section C of the Supplemental Appendix. Lemma A.6 Suppose that Assumptions 3..(i)-(ii) and 3.2.(i)-(iii) hold. Consider any sequence of DGPs ff n g such that Fn = o(). Then we have b 2 Fn = o p (): (A.8) If we further have Assumption 3.2.(iv), then n =2 ( b o 2 Fn ) = ( 2;Fn + o p ()) n n (g 2 (W; Fn )) + n =2 2;Fn + o p (); (A.9) where 2;F n = G 2;F n 2;F n G 2;Fn G 2;Fn 2;F n and 2;Fn = ( r ; F n ). 25

26 Proof of Lemma A.6 is included in Section C of the Supplemental Appendix. Lemma A.7 Under Assumptions 3.2.(ii) and 3.3.(ii), for any sequence of DGPs ff pn g with F pn 2 F where fp n g is a subsequence of fng, there is a subsequence fp ng of fp n g such that v Fp ( Fp )! n n v F ( F ) as p n!, where F 2 F. Proof of Lemma A.7. Recall that = fv F : F 2 Fg. By Assumptions 3.2.(ii) and 3.3.(ii), is compact. Hence for any sequence v Fpn ( Fpn ) in, it has a convergent subsequence fv Fp ( Fp )g n n such that v Fp ( Fp )! v n n F ( F ) as p n!, where F 2 F. Lemma A.8 Suppose that Assumptions 3..(i)-(ii) and 3.2 hold. Consider any sequence of DGPs ff n g such that v Fn! v F for some F 2 F, and n =2 Fn! d for d 2 R r. Then where d = ( r ; d ) n=2 ( b Fn ) n =2 ( b 2 Fn ) A! ;F 2;F ;F Z ;F 2;F (Z 2;F + d ) A, Proof of Lemma A.8. In the proof, we use G 2;Fn! G 2;F and 2;Fn! 2;F (A.) for some F 2 F, which is assumed in the lemma. Under Assumptions 3..(i)-(ii) and 3.2, for the sequence of DGPs ff n g considered in the lemma, we can apply Lemma A.4 and Lemma A.6 to deduce n=2 ( b Fn ) n =2 ( b 2 Fn ) ;F A n n (g (W; Fn )) ( 2;Fn + o p ()) n (g 2 (W; Fn )) + n =2 2;Fn A + o p (); (A.) where 2;Fn = ( r ; F n ). By (A.) and Assumption 3.2, we have where k;f = ;F n = ;F + o() and 2;Fn = 2;F + o() (A.2) G k;f k;f G k;f G k;f k;f for k = ; 2. Collecting the results in Lemma A..(v), (A.) and (A.2), and then applying the continuous mapping theorem (CMT), we n=2 ( b Fn ) n =2 ( b 2 Fn ) A! ;F 26 2;F A (Z 2;F + d ), (A.3)

27 where Z 2;F N( r2 ; 2;F ), ;F = ( ;F ; d r ) and d = ( r ; d ). The claimed result follows from (A.3) and the denitions of ;F and Z 2;F. Proof of Lemma 4.. The claimed result in Part (a) has been proved in Lemma A.8. We next consider the case that n =2 Fn! d with jjdjj =. Note that the results in (A.6) and (A.2) do not depend on jjdjj < or jjdjj =. Using (A.6), (A.2), Lemma A..(v) and the CMT, we have n =2 ( b Fn )! D ;F Z ;F : (A.4) To study the properties of b 2, we have to consider two separate scenarios: () Fn = o(); and (2) k Fn k > c for some c >. In scenario (), Assumption 3.2, Lemma A..(v) and Lemma A.6 imply that n =2 ( b 2 Fn ) = ( 2;Fn + o p ())n =2 Fn + O p (): (A.5) By Assumption 3..(iv) and jjn =2 Fn jj!, n F n 2;Fn 2;F n Fn C 2 n F n Fn! (A.6) which together with (A.5) implies that jjn =2 ( b 2 Fn )jj! p. Finally, we consider the scenario (2) where k Fn k > c. By Assumption 3..(iv), jjg 2;F n 2;F n Fn jj > C k Fn k > c C (A.7) for any n. As F n is the minimizer of Q Fn (), it has the following rst order condition d = G 2;Fn ( F n ) 2;F n M 2;Fn ( F n ); (A.8) which implies that G 2;F n 2;F n Fn = G 2;Fn ( Fn ) 2;F n M 2;Fn ( Fn ) G 2;Fn ( F n ) 2;F n M 2;Fn ( F n ) = G 2;Fn ( Fn ) G 2;Fn ( F n ) 2;F n M 2;Fn ( Fn ) +G 2;Fn ( F n ) 2;F n M2;Fn ( Fn ) M 2;Fn ( F n ) : (A.9) By Lemma A.2, the Cauchy-Schwarz inequality and Assumption 3.2.(ii)-(iii), we have G 2;Fn ( Fn ) G 2;Fn ( F n ) 2;F n M 2;Fn ( Fn ) G 2;Fn ( Fn ) G 2;Fn ( F n ) 2;F n M 2;Fn ( Fn ) C Fn F n ; (A.2) 27

28 where C is a xed constant. Similarly, we have G 2;Fn ( F n ) 2;F M2;Fn n ( Fn ) M 2;Fn ( F n ) M 2;Fn ( Fn ) M 2;Fn ( F n ) 2;F n G 2;Fn F n ) C Fn F n : (A.2) Combining the results in (A.9), (A.2) and (A.2), and using the triangle inequality, we have Fn F n c C (A.22) for some xed constant C. Using b 2 = F n +o p () (which is proved in Lemma A.5) and the triangle inequality, we obtain b 2 Fn jj b 2 F n jj F n Fn = F n Fn ( + o p ()); (A.23) which together with (A.22) implies that n =2 jj b 2 Fn jj! p. This nishes the proof. Lemma A.9 (a) ;F d = d ; (b) ;F 2;F ;F = ;F ; (c) ;F 2;F 2;F = 2;F ; (d) 2;F 2;F 2;F = 2;F. Proof of Lemma A.9 is included in Section C of the Supplemental Appendix. A.2 Proof of the Results in Section 5 We rst present some generic results on the bounds of asymptotic risk dierence between two estimators under some high-level conditions. Then we apply these generic results to the two specic estimators we consider in this paper: b eo and b. The proof uses the subsequence techniques used to show the asymptotic size of a test in Andrews, Cheng, and Guggenberger (2) but we adapt the proof and notations to the current setup and extend results from test to estimators. Recall that h F;d = (d ; vec(g 2;F ) ; vech( 2;F ) ) and v F = (vec(g 2;F ) ; vech( 2;F ) ) for any F 2 F and any d 2 R r. We have dened H = fh F;d : d 2 R r and F 2 F with F = r g (A.24) where F is dened by (.5) for a given F. Dene H = fh F;d : d 2 R r with jjdjj = and F 2 Fg. (A.25) Let d h = r + d r 2 + (r 2 + )r 2 =2. It is clear that h F;d is a d h -dimensional vector. 28

29 Condition A. (i) For any sequence of DGPs ff pn g with F pn 2 F where fp n g is a subsequence of fng, there exists a subsequence fp ng of fp n g and some F 2 F such that v Fp! v n F as p n! ; (ii) M ;F () = r has a unique solution at F 2 for any F 2 F; (iii) M 2;F () is uniform equicontinuous over F 2 F; (iv) for any subsequence fp n g of fng, if (p n ) =2 Fpn! d for d 2 R r and v Fpn! v F, then lim E F n! pn [`( b ; Fpn )] = R (h F;d ) and lim E F n! pn [`( e ; Fpn )] = R e (h F;d ) where R (h F;d ) and e R (h F;d ) are some non-negative functions that are bounded from above by for any F 2 F and any d 2 R r ; (v) for any F 2 F with F = r, there exists a constant " F > such that for any e 2 R r jj e jj < " F, there is e F 2 F with ef = e and jjv F v ef jj Cjj e jj for some > ; (vi) for any h F;d 2 H and h F; e d 2 H, we have with R (h F;d ) = R (h F; e d ) and e R (h F;d ) = e R (h F; e d ) for any >. Condition A..(i) requires that for any sequence of fv Fpn g, it has a convergent subsequence fv Fp g with limit being v n F for some F 2 F. This condition is veried under Assumptions 3.2.(ii) and 3.3.(ii) in Lemma A.7. Condition A..(ii) is the unique identication condition of F which holds under Assumptions 3..(i)-(ii). Condition A..(iii) holds under Assumption 3.2.(ii) by Lemma A.2. Condition A..(iv) is a key assumption to derive an explicit upper bound of asymptotic risk. This condition can be veried by using Lemma 4. as we shall show in the proof of Theorem 5.. Condition A..(v) enables us to show that the upper bound we derived for the asymptotic risk is also a lower bound. This condition is assumed in Assumption 3.3.(i). Condition A..(vi), in our context, requires that the asymptotic (truncated) risk of b (or e ) under the subsequences of DGPs ff pn g satisfying the restrictions in Condition A..(iv) are identical whenever (p n ) =2 Fpn! d with jjdjj =. Condition A. is veried in the proof of Theorem 5. below. Lemma A. Under Conditions A..(i) - A..(iv), we have AsyR ( b ) max ( sup R (h); sup R (h) h2h h2h ) ; (A.26) where AsyR ( b ) lim sup sup F 2F E F [`( b ; F )]. n! 29

30 Proof of Lemma A.. Let ff n g be a sequence such that lim supe Fn [`( b ; Fn )] = lim sup sup E F [`( b ; F )] AsyR ( b ): n! n! F 2F (A.27) Such a sequence always exists by the denition of supremum. The sequence fe Fn [`( b ; Fn )]: n g may not converge. However, by the denition of limsup, there exists a subsequence of fng, say fp n g, such that fe Fpn [`( b ; Fpn )]: n g converges and lim E F n! pn [`( b ; Fpn )] = AsyR ( b ): (A.28) Below we show that for any subsequence fp n g of fng such that fe Fpn [`( b ; Fpn )]: n g is convergent, there exists a subsequence fp ng of fp n g such that lim E F n! p [`( b ; Fp )] = R n n (h) for some h 2 H or H: (A.29) Because lim n! E Fp n [`( b ; Fp n )] = lim n! E Fpn [`( b ; Fpn )], which combined with (A.28) and (A.29) implies that AsyR ( b ) = R (h) for some h 2 H or H : The desired result in (A.26) follows immediately by (A.3). (A.3) To show that there exists a subsequence fp ng of fp n g such that (A.29) holds, it suces to show that for any sequence ff n g and any subsequence fp n g of fng, there exists a subsequence fp ng of fp n g for which we have (p n) =2 Fp n! d for d 2 R r and v Fp n! v F (A.3) for some F 2 F. If (A.3) holds, then we can use Condition A..(iv) to deduce that lim E F n! p [`( b ; Fp )] = R n n (h F;d ) (A.32) for the sequence of DGPs ff p n g that satises (A.3). As d 2 R r, we have either kdk < or kdk =. In the rst case, kdk < together with (p n) =2 Fp n! d and Fp n! F (which is implied by v Fp! v n F ) implies that F = r, which implies that h F;d 2 H by the denition of H. In the second case, h F;d 2 H by the denition of H. We have proved that h F;d in (A.32) belongs either to H or H which together with (A.32) proves (A.29). Finally, we show that for any sequence ff n g and any subsequence fp n g of fng, there ex- 3

31 ists a subsequence fp ng of fp n g for which (A.3) holds. Let pn;j denote the j-th component of pn and p ;n = p n for any n. For j =, either (a) lim sup n! jp =2 j;n p j;n ;jj < ; or (b) lim sup n! jp =2 j;n p j;n ;jj =. If (a) holds, then for some subsequence fp j+;n g of fp j;n g, p =2 j+;n p j+;n ;j! d j for some d j 2 R. If (b) holds, then for some subsequence fp j+;n g of fp j;n g, p =2 j+;n p j+;n ;j! or. As r is a xed positive integer, we can apply the same arguments successively for j = ; :::; r to obtain a subsequence fp r ;ng of fp n g such that (p r ;n) =2 pr ;n! d 2 R r. By Condition A..(i), we know that there exists a subsequence fp ng of fp r ;ng such that v p n! v F for some F 2 F, which nishes the proof of (A.3). Lemma A. Suppose that Condition A..(v) holds. Then (i) for any h F;d 2 H, there exists a sequence of DGPs ff n g with F n 2 F such that n =2 Fn! d, G 2;Fn! G 2;F and 2;Fn! 2;F ; (A.33) (ii) for any h F;d 2 H, there exists a sequence of DGPs ff n g with F n 2 F such that jjn =2 Fn jj!, G 2;Fn! G 2;F ; 2;Fn! 2;F and Fn! F : (A.34) Proof of Lemma A.. (i) By the denition of H, we have F = r for any F such that h F;d 2 H. Let N "F be the smallest n such that kdk n =2 < " F. By Condition A..(v), for any n N "F we can nd a DGP F n such that Fn = n =2 d and kv Fn v F k n =2 Cjjdjj : (A.35) For any n < N "F such that kdk n =2 " F, we let F n = F. The desired properties in (A.33) holds under the constructed sequence of DGPs ff n g by (A.35), because C is a xed constant and >. (ii) For any h F;d 2 H, we have either F = r or jj F jj >. We rst consider the case that F = r. Let r denote the r vector of ones. Let N "F be the smallest n such that n =4 (r ) =2 < " F. By Condition A..(v), for any n N "F we can nd a DGP F n such that Fn = n =4 r and v Fpn v F Cn =4 (r ) =2 : (A.36) For any n < N "F such that n =4 (r ) =2 " F, we let F n = F. The desired properties in (A.34) holds under the constructed sequence of DGPs ff n g by (A.36), because C is a xed constant and >. When jj F jj >, we dene a trivial sequence of DGPs ff n g as F n = F for any n. It is clear that (A.34) holds trivially in this case. 3

32 Lemma A.2 Under Condition A., we have AsyR ( b ) = max ( sup R (h); sup R (h) h2h h2h ) : (A.37) Proof of Lemma A.2. show that In view of the upper bound in (A.26) in Lemma A., it is sucient to AsyR ( b ) max ( sup R (h); sup R (h) h2h h2h ) : (A.38) First, we note that for any h d;f = (d ; vec(g 2;F ) ; vech( 2;F ) ) 2 H, there exists a sequence ff n 2 F : n g such that n =2 Fn! d 2 R r and v Fn! v F (A.39) by Lemma A..(i). The sequence E Fn [`( b ; Fn )] may not be convergent, but there exists a subsequence fp n g of n such that E Fpn [`( b ; Fpn )] is convergent and lim E F n! pn [`( b ; Fpn )] = lim supe Fn [`( b ; Fn )]: n! (A.4) As fp n g is a subsequence of fng, by (A.39) (p n ) =2 Fpn! d 2 R r and v Fpn! v F : (A.4) By Condition A..(iv), we have that lim E F n! pn [`( b ; Fpn )] = R (h F;d ); (A.42) which combined with (A.4) and the denition of AsyR ( b ) gives AsyR ( b ) = lim sup n! sup F 2F E F [`( b ; F )] lim supe Fn [`( b ; Fn )] = R (h F;d ): n! (A.43) Second, consider any h d;f = (d ; vec(g 2;F ) ; vech( 2;F ) ) 2 H. By Lemma A..(ii), there exists a sequence of DGPs ff n g such that jjn =2 Fn jj! and v Fn! v F : (A.44) Using the same arguments in proving (A.4) to (A.42), we can show that for some subsequence fp n g of fng; jp =2 n Fpn jj! and v pn! v F (A.45) 32

33 and lim supe Fn [`( b ; Fn )] = lim E F n! n! pn [`( b ; Fpn )] = R (h F;d ): (A.46) for jjdjj = by Conditions A..(vi). By the denition of AsyR ( b ) and (A.46), AsyR ( b ) = lim sup n! sup F 2F E F [`( b ; F )] lim supe Fn [`( b ; Fn )] = R (h F;d ): n! (A.47) Combining the results in (A.43) and (A.47), we immediately get (A.37). Lemma A.3 Under Conditions A..(i) - A..(iv), the upper and lower bounds of the asymptotic risk dierence between b and e satisfy AsyRD( b ; e ) AsyRD( b ; e ) lim max! min lim! ( sup h2h inf h2h h R (h) h R (h) i h R e (h) ; sup R (h) R e (h)i )! ; (A.48) h2h i R e (h) ; inf h2h h R (h) e R (h)i ; (A.49) where 8 h n oi er (h) E[min < E min F F ; ; kdk < ;F ;F ; ] and R (h) : E min ;F ;F ; ; kdk = for any h 2 H [ H. Proof of Lemma A.3. Dene R (H; H ) max ( sup h2h R (H; H) min inf h2h h R (h) h R (h) i h R e (h) ; sup R (h) R e (h)i ) ; (A.5) h2h i R e (h) ; inf h2h h R (h) e R (h)i : (A.5) By the denition of AsyRD( b ; e ), to show (A.48) it is sucient to show that for any > lim sup n! sup E F [`( b ; F ) `( e ; F )] R (H; H); (A.52) F 2F which can be proved using the same arguments in the proof of Lemma A. (but replacing `( b ; F ) and R (h) by `( b ; F ) `( e ; F ) and R (h) e R (h) respectively). Similarly by the denition of AsyRD( b ; e ), for (A.49) it is sucient to show that for any > lim inf n! inf F 2F E F [`( b ; F ) `( e ; F )] R (H; H ); (A.53) 33

34 which can be proved using the same arguments in the proof of Lemma A. (but replacing lim sup n, sup F 2F, `( b ; F ) and R (h) by lim inf n, inf F 2F, `( b ; F ) `( e ; F ) and R (h) e R (h) respectively). Lemma A.4 Under Condition A., the upper and lower bounds of the asymptotic risk dierence between b and e have the following representations: AsyRD( b ; e ) = AsyRD( b ; e ) = lim max! min lim! ( sup h2h inf h2h h R (h) h R (h) i h R e (h) ; sup R (h) R e (h)i )! ; (A.54) h2h i R e (h) ; inf h2h h R (h) e R (h)i : (A.55) Proof of Lemma A.4. By Lemma A.3, it is sucient to show that lim sup n! sup E F [`( b ; F ) `( e ; F )] R (H; H); (A.56) F 2F lim inf n! inf F 2F E F [`( b ; F ) `( e ; F )] R (H; H ); (A.57) for any >. (A.56) can be proved using the same arguments in the proof of Lemma A.2 by replacing `( b ; F ) and R (h) by `( b ; F ) `( e ; F ) and R (h) e R (h) respectively. Similarly, (A.57) can be proved using the same arguments in the proof of Lemma A.2 by replacing lim sup n, sup F 2F, `( b ; F ) and R (h) by lim inf n, inf F 2F, `( b ; F ) `( e ; F ) and R (h) e R (h) respectively. Lemma A.5 Under Assumptions 3.2.(ii) and 3.2.(iv), we have sup E[( ;F ;F ) 2 ] C and sup E[( F F ) 2 ] C: h2h h2h (A.58) Lemma A.6 and 3.2.(iv), we have where sup h2h [jg(h)j] C. h Let g (h) E minf F F ; g min ;F ;F ; i. Under Assumptions 3.2.(ii) lim sup [jg (h) g(h)j] = (A.59)! h2h Proof of Theorem 5.. The proof consists of two steps. The rst step is to apply Lemma A.4 to show (A.6) and (A.6) below, and the second step is to apply Lemma A.6 to show (A.75) and (A.76) below. 34

35 In the rst step, we apply Lemma A.4 with b = b eo and e = b to show that AsyRD( b eo ; b ) = AsyRD( b eo ; b ) = lim max! lim min! sup h2h [g (h)] ; inf h2h [g (h)] ; and (A.6) : (A.6) To prove (A.6) and (A.6), we now verify Condition A. under Assumptions Condition A..(i) is veried by Lemma A.7 under Assumptions 3.2.(ii) and 3.3.(ii). Condition A..(ii) is implied by Assumptions 3..(i) and 3..(ii). Condition A..(iii) is implied by Assumptions 3.2.(i)- (ii) as a result of Lemma A.2. Condition A..(v) is assumed in Assumption 3.3.(ii). We next verify Conditions A..(iv) and A..(vi). Consider any sequence of DGPs ff pn g with (p n ) =2 Fpn! d for d 2 R r and v Fpn! v F (A.62) for some F 2 F, where fp n g is a subsequence of fng. First, we consider the case that d 2 R r. By Lemma 4..(a) and 4.2.(a), (p n ) =2 ( b Fpn )! D ;F and (p n ) =2 ( b eo Fpn )! D F (A.63) which combined with the continuous mapping theorem implies that `( b ; Fpn )! D ;F ;F and `( b eo ; Fpn )! D F F : (A.64) Since is positive semi-denite, ;F ;F and F F are both non-negative. The function f (x) = min fx; g is a bounded continuous function for x. By (A.64) and the Portmanteau Lemma (see Lemma 2.2 in van der Vaart (998)), E Fpn [`( b h i eo ; Fpn )]! E minf F F ; g and E Fpn [`( b ; Fpn )]! E minf ;F ;F ; g : Second, we consider the case that kdk =. Then under Lemma 4..(b) and 4.2.(b), (A.65) (p n ) =2 ( b Fpn )! D ;F and (p n ) =2 ( b eo Fpn )! D ;F : (A.66) 35

36 Using the same arguments in showing (A.65), we get E Fpn [`( b eo ; Fpn )]! E minf ;F ;F ; g and E Fpn [`( b ; Fpn )]! E minf ;F ;F ; g : Dene 8 er (h F;d ) = E minf ;F ;F ; g < E[minf F F ; g]; kdk < and R (h F;d ) = : E minf ;F ;F ; g ; kdk = : (A.67) (A.68) Collecting the results in (A.65) and (A.67), we deduce that under the sequence of DGPs ff pn g satisfying (A.62), E Fpn [`( b eo ; Fpn )]! R (h F;d ) and E Fpn [`( b ; Fpn )]! e R (h F;d ); (A.69) where R (h F;d ) and R e (h F;d ) are non-negative and bounded from above by for any d 2 R r and any F 2 F. This veries Condition A..(iv). By denition, R e (h F;d ) in (A.68) does not depend on d for any F. Moreover, for any d and d e with jjdjj = and jjdjj e =, by the denition of R (h F;d ) in (A.69), R (h F;d ) = E minf ;F ;F ; g = R (h F; e d ): (A.7) Hence, Condition A..(vi) is also veried. We next apply Lemma A.4 to get (A.6) and (A.6) above. By (A.68), R (h) e R (h) = E[minf F F ; g] E[minf ;F ;F ; g] for any h 2 H (A.7) and R (h) e R (h) = E minf ;F ;F ; g E minf ;F ;F ; g = for any h 2 H : (A.72) By Lemma A.4, (A.7) and (A.72), we have AsyRD( b eo ; b ) = lim max! = lim! max ( sup h2h sup E h2h h R (h) i h R e (h) ; sup R (h) h2h h minf F F ; g i ) R e (h) i minf ;F ;F ; g ; (A.73) 36

37 and AsyRD( b eo ; b ) = lim min inf! = lim! min h2h h R (h) h inf E minf F F ; g h2h i h R e (h) ; inf R (h) h2h minf ;F ;F ; g i R e (h) i ; ; (A.74) which proves (A.6) and (A.6). In the second step, we show that lim max! lim min! sup h2h [g (h)] ; inf h2h [g (h)] ; = max sup h2h [g(h)] ; ; and (A.75) = min inf [g(h)] ; : (A.76) h2h By Lemma A.6, lim sup [g (h)] = sup [g(h)] and! h2h h2h lim inf [g (h)] = inf [g(h)] ;! h2h h2h (A.77) where sup h2h [g(h)] and inf h2h [g(h)] are nite real numbers. Let f(x) = max(x; ) and f(x) = min(x; ). It is clear that f(x) and f(x) are continuos function on R. The asserted results in (A.75) and (A.76) follow by (A.77), and the continuity of f(x) and f(x). Proof of Theorem 5.2. For any F 2 F, dene D F = ( 2;F ;F ) ;F : (A.78) Recall that we have dened A F = ( ;F 2;F ) and B F = ( 2;F ;F ) ( 2;F ;F ) (A.79) in Theorem 5.2 and (??) respectively. By the denition of F, E[ F F ] = tr( ;F ) + 2tr(A F )J ;F + tr(a F ) 2 J 2;F (A.8) where " Zd;2;F J ;F = E D # F Z d;2;f Zd;2;F B F Z d;2;f + tr(a F ) " Zd;2;F and J 2;F = E B # F Z d;2;f (Zd;2;F B F Z d;2;f + tr(a F )) 2 : (A.8) 37

38 We provide a upper bound for J ;F dened in (A.8). Dene a function (x) x x B F x + tr(a F ) for any x 2 Rr 2 : (A.82) Its = x B F x + tr(a F ) I r 2 2B F (x B F x + tr(a F )) 2 xx : (A.83) Then J ;F = E [(Z d;2;f ) D F Z d;2;f ]. Note that D F Z d;2;f = D F Z 2;F by construction because the last r columns of ;F are zeros. Applying Lemma A.9 yields tr (D F 2;F ) = tr ( 2;F ;F ) ;F 2;F = tr( ;F 2;F 2;F ;F 2;F ;F ) = tr( ( 2;F ;F )) = tr(a F ): (A.84) By Lemma of Hansen (26), which is a matrix version of the Stein's Lemma (Stein, 98), J ;F = E (Z d;2;f ) D F Z d;2;f = E tr D F 2;F : Plugging (A.82)-(A.84) into (A.85), we have J ;F 2 " # 3 tr (D F 2;F ) 6 tr B F Z d;2;f Zd;2;F D F 2;F 7 = E Zd;2;F B 2E 4 F Z d;2;f + tr(a F ) 2 5 Zd;2;F B F Z d;2;f + tr(a F ) 2 3 " # tr(a F ) 6 Zd;2;F = E Zd;2;F B + 2E 4 D F 2;F B F Z d;2;f 7 F Z d;2;f + tr(a F ) 2 5 (A.86) Zd;2;F B F Z d;2;f + tr(a F ) where the second equality is by (A.84). By denition and Lemma A.9 Z d;2;f D F 2;F B F Z d;2;f = Z d;2;f ( 2;F = Z d;2;f ( 2;F ;F ) ;F 2;F ( 2;F ;F ) ( 2;F ;F )Z d;2;f ;F ) ( ;F 2;F )( 2;F ;F )Z d;2;f max ( =2 ( ;F 2;F ) =2 )(Z d;2;f ( 2;F ;F ) ( 2;F ;F )Z d;2;f ) = max (A F )Z d;2;f B F Z d;2;f ; (A.87) where the last equality is by max ( =2 ( ;F 2;F ) =2 ) = max (( ;F 2;F )). Combining the 38

39 results in (A.86) and (A.87), we get 2 " # tr(a F ) 6 J ;F E Zd;2;F B + 2E 4 max(a F )Zd;2;F B F Z d;2;f F Z d;2;f + tr(a F ) Zd;2;F B F Z d;2;f + tr(a F ) " # tr(a F ) =E Zd;2;F B F Z d;2;f + tr(a F ) 2h i 3 6 Zd;2;F B F Z d;2;f + tr(a) max (A F ) tr(a F ) max (A F ) 7 + 2E Zd;2;F B F Z d;2;f + tr(a F ) 2 " # 2 =E max (A F ) tr(a F ) 6 2 Zd;2;F B E max (A F )tr(a F ) 4 F Z d;2;f + tr(a F ) Zd;2;F B F Z d;2;f + tr(a F ) : (A.88) Next, note that 2 6 Zd;2;F J 2;F =E 4 B F Z d;2;f 7 Zd;2;F B F Z d;2;f + tr(a F ) = E 4 Z d;2;f B F Z d;2;f + tr(a F ) tr(a F ) 7 Zd;2;F B F Z d;2;f + tr(a F ) " # 6 tr(a F ) 7 =E Zd;2;F B E 4 F Z d;2;f + tr(a F ) Zd;2;F B F Z d;2;f + tr(a F ) 2 5 : (A.89) Combining (A.8), (A.88), (A.89) and the denition of g(h) (in Theorem 5.), we obtain that g(h d;f )=2tr(A F )J ;F + tr(a F ) 2 J 2;F " # B 2 2tr(A F ) max (A F ) tr(a F Zd;2;F B F Z d;2;f + tr(a F ) " # +tr(a) 2 Zd;2;F B F Z d;2;f + tr(a F ) =E " # tr(a F ) (4 max (A F ) tr(a F )) Zd;2;F B F Z d;2;f + tr(a F ) tr(A F ) E max (A F ) 7C 4 Zd;2;F B F Z d;2;f + tr(a F ) 2 5A tr(a F ) 7C E 4 Zd;2;F B F Z d;2;f + tr(a F ) 2 5A E 4 tr(a F ) 2 (4 max (A F ) + tr(a F )) 7 Zd;2;F B F Z d;2;f + tr(a F ) 2 5 : (A.9) For all G 2 and 2 such that h = (d; vec(g 2 ) ; vech( 2 ) ) 2 H, we have G 2 = G 2;F and 2 = 39

40 2;F for some F 2 F by the denition of H. If tr(a F ) >, then max (A F ) > and thus the second term in the right-hand side of the last equality of (A.9) will be negative. If in addition tr(a F ) 4 max (A F ), then the rst term in the right-hand side of the last equality of (A.9) will be non-negative. As a result, when tr(a F ) > and 4 max (A F ) tr(a F ) for 8F 2 F, we have sup h2h [g(h)] <. This combined with Theorem 5. implies the results of this theorem. A.3 Proof of the Results in Section 6 Lemma A.7 Suppose that Assumption 6. holds. Consider ff n g such that v Fn v F 2 and n =2 Fn! d with kdk <. We have! v F for some h i where g(h F;d ) = E F F ;F ;F. lim lim E F! n! n [`( b eo ; Fn ) `( b ; Fn )] = g(h F;d ) Proof of Lemma A.7. For the sequence of DGPs ff n g considered in the lemma, by Assumptions 6..(i), 6..(ii) and 6..(iv), we can use Lemma A.8 to deduce n=2 ( b Fn ) n =2 ( b 2 Fn ) A! ;F 2;F A (A.9) where d = ( r ; d ). In the proof of Lemma 4.2, we have show that b k = 2;F + o p () and b G k = G k;f + o p () (A.92) under v Fn! v F, Assumptions 6..(i), 6..(ii) and 6..(iv). By (A.92), Assumption 6..(iv) and the Slutsky Theorem, b ;F and b 2;F are consistent estimators of ;F and 2;F respectively. By the consistency of b ;F and b 2;F, the weak convergence in (A.9) and the CMT, we deduce that n =2 ( b eo Fn )! D F : (A.93) Collecting the results in (A.9) and (A.93), and then applying the CMT and the Portmanteau Lemma, we get lim E F n! n [`( b eo ; Fn ) `( b ; Fn )]! g (h) (A.94) h where g (h) = E minf F F ; g min i ;F ;F ;. The asserted result follows by Lemma A.6 and (A.94). 4

41 Proof of Lemma 6.. Let ff n g be a sequence such that lim supe Fn [`( b eo ; Fn ) n! `( b ; Fn )] = lim sup sup E F [`( b eo ; F ) `( b ; F )] : (A.95) n! F 2F n Such a sequence always exists by the denition of supremum. The sequence fe Fn [`( b eo ; Fn ) `( b ; Fn )]: n g may not converge. However, by the denition of limsup, there exists a subsequence of fng, say fp n g, such that fe Fpn [`( b eo ; Fpn ) `( b ; Fpn )]: n g converges and lim E F n! pn [`( b eo ; Fpn ) `( b ; Fpn )] = lim sup sup E F [`( b eo ; F ) `( b ; F )] : (A.96) n! F 2F n Below we show that for any subsequence fp n g of fng such that fe Fpn [`( b eo ; Fpn ) `( b ; Fpn )]: n g is convergent, there exists a subsequence fp ng of fp n g such that lim n! E F p n [`( b eo ; Fp n ) `( b ; Fp n )] = g (h) for some h 2 H CL (A.97) Because lim n! E Fp n [`( b eo ; Fp n ) `( b ; Fp n )] = lim n! E Fpn [`( b eo ; Fpn ) `( b ; Fpn )], which combined with (A.96) and (A.97) implies that lim sup sup E F [`( b eo ; F ) `( b ; F )] = g (h) for some h 2 H CL : (A.98) n! F 2F n The desired result in (6.5) follows immediately by (A.98). To show that there exists a subsequence fp ng of fp n g such that (A.97) holds, it suces to show that for any sequence ff n g and any subsequence fp n g of fng, there exists a subsequence fp ng of fp n g for which we have (p n) =2 Fp n! d for kdk C L and v Fp n! v F (A.99) for some F 2 F. By (A.99), we can use Lemma A.7 to deduce that lim E F n! p [`( b n eo ; Fp ) `( b n ; Fp )] = g n (h F;d ) (A.) for the sequence of DGPs ff p n g satises (A.99). Moreover, we have h F;d 2 H CL by the denition of H CL, which together with (A.) proves (A.97). Finally, we show that for any sequence ff n g and any subsequence fp n g of fng, there exists a subsequence fp ng of fp n g for which (A.99) holds. Let pn;j denote the j-th component of pn p ;n = p n for any n. For j =, we have jp =2 j;n p j;n ;jj C L for any n by Assumption 6..(vi). and 4

42 Hence there is some subsequence fp j+;n g of fp j;n g, p =2 j+;n p j+;n ;j! d j for some jd j j C L. As r is a xed positive integer, we can apply the same arguments successively for j = ; : : : ; r to obtain a subsequence fp r ;ng of fp n g such that (p r ;n) =2 pr ;n! d with jd j j C L for j = ; : : : ; r. Since (p r ;n) =2 pr ;n CL for any n by Assumption 6..(vi), we have kdk C L. By Assumptions 3.2.(ii) and 6..(v), is a compact set. Hence, there is a subsequence fp ng of fp r ;ng such that v Fp n! v F, which nishes the proof of (A.99). Proof of Theorem 6.. By (A.9) in the proof of Theorem 5.2 g(h d;f )E " # tr(a F ) (4 max (A F ) tr(a F )) Zd;2;F B F Z d;2;f + tr(a F ) 2 6 E 4 tr(a F ) 2 (4 max (A F ) + tr(a F )) 7 Zd;2;F B F Z d;2;f + tr(a F ) 2 5 : (A.) 3 By Jensen's inequality, " # E Zd;2;F B F Z d;2;f + tr(a F ) tr(d B F d) + 2tr(A F ) ; (A.2) and similarly E 4 Zd;2;F B F Z d;2;f + tr(a F ) jtr(d B F d) + 2tr(A F )j 2 : (A.3) (A.2) and (A.3), combined with tr(a F ) 4 max (A F ) and max (A F ) >, imply that g(h) tr(a F ) (4 max (A F ) tr(a F )) tr(d B F d) + 2tr(A F ) tr(a F ) (4 max (A F ) tr(a F )) C 2 L max(b F ) + 2tr(A F ) tr(a F ) 2 (4 max (A F ) + tr(a F )) jtr(d B F d) + 2tr(A F )j 2 tr(a F ) 2 (4 max (A F ) + tr(a F )) C 2 L max(b F ) + 2tr(A F ) 2 : (A.4) Since is positive semi-denite, by denition of B F and Assumption 6..(iv), max (B F ) <. Moreover, by Assumptions 6..(iv) and 6..(vi), tr(a F ) < and C L <. Therefore (A.4) immediately implies that g(h)< for any h 2 H CL. 42

43 References [] Andrews, D. W. K. (999): \Consistent Moment Selection Procedures for Generalized Method of Moments Estimation," Econometrica, 67(3), [2] Andrews, D. W. K. and X. Cheng (23): \Maximum Likelihood Estimation and Uniform Inference with Sporadic Identication Failure," Journal of Econometrics, 73(), [3] Andrews, D. W. K., X. Cheng and P. Guggenberger (2): \Generic Results for Establishing the Asymptotic Size of Condence Sets and Tests," Cowles Foundation Discussion Paper, No.83. [4] Andrews, D. W. K. and P. Guggenberger (26): \On the Asymptotic Maximal Risk of Estimators and the Hausman Pre-Test Estimator," unpublished manuscript. [5] Andrews, D. W. K. and B. Lu (2): \Consistent Model and Moment Selection Procedures for GMM Estimation with Application to Dynamic Panel Data Models," Journal of Econometrics, (), [6] Ashley, R. (29): \Assessing the Credibility of Instrumental Variables Inference with Imperfect Instruments via Sensitivity Analysis," Journal of Applied Econometrics, 24, 325{337. [7] Berkowitz, D., M. Caner and Y. Fang (22): \The Validity of Instruments Revisited," Journal of Econometrics, 66(2), [8] Buckland, S. T., K. P. Burnham, and N. H. Augustin (997): \Model Selection: An Integral Part of Inference," Biometrics, 53, [9] Charkhi, A., G. Claeskens and Hansen, B. (26): "Minimum Mean Squared Error Model Averaging in Likelihood Models," Statistica Sinica, Vol. 26, [] Cheng, X. and B. Hansen (25): \Forecasting with Factor-Augmented Regression: A Frequentist Model Averaging Approach," Journal of Econometrics, 86(2), [] Cheng, X. and Z. Liao (25): \Select the Valid and Relevant Moments: An Information-Based LASSO for GMM with Many Moments," Journal of Econometrics, 86(2), [2] Claeskens, G. and R. J. Carroll (27): \An Asymptotic Theory for Model Selection Inference in General Semiparametric Problems," Biometrika, 94, [3] Conley T. G., C. B. Hansen, and P. E. Rossi (22): \Plausibly Exogenous," Review of Economics and Statistics, 94(), [4] DiTraglia, F. (26): \Using Invalid Instruments on Purpose: Focused Moment Selection and Averaging for GMM," Journal of Econometrics, 95(2), [5] Doko Tchatoka, F. and J.-M. Dufour (28): \Instrument Endogeneity and Identicationrobust Tests: Some Analytical Results," Journal of Statistical Planning and Inference, 38(9), 2649{

44 [6] Doko Tchatoka, F. and J.-M. Dufour (24): \Identication-robust Inference for Endogeneity Parameters in Linear Structural Models," Econometrics Journal, 7, 65{87. [7] Eichenbaum, M., L. Hansen and K. Singleton (988): \A Time Series Analysis of Representative Agent Models of Consumption and Leisure Choice under Uncertainty," Quarterly Journal of Economics, 3(), [8] Guggenberger, P. (22): \On the Asymptotic Size Distortion of Tests when Instruments Locally Violate the Exogeneity Assumption," Econometric Theory, 28(2), [9] Hansen, B. E. (27): \Least Squares Model Averaging," Econometrica, 75, [2] Hansen, B. E. (25): \Shrinkage Eciency Bounds," Econometric Theory, 3, [2] Hansen, B. E. (26): \Ecient Shrinkage in Parametric Models," Journal of Econometrics, 9(), 5{32. [22] Hansen, B. E. (27): \A Stein-Like 2SLS Estimator," Econometric Reviews, 36, [23] Hansen, B.E., and Racine, J. (22): \Jackknife Model Averaging," Journal of Econometrics, 67, 38{46. [24] Hansen, L. P. (982): \Large Sample Properties of Generalized Method of Moments Estimators," Econometrica, 5, [25] Heckman, J. J. (976): \Estimation of A Human Capital Production Function Embedded in A Life-cycle Model of Labor Supply," In Household Production and Consumption, NBER, [26] Hong, H., B. Preston and M. Shum (23): \Generalized Empirical Likelihood-Based Model Selection Criteria for Moment Condition Models," Econometric Theory, 9(6), [27] Hjort, N. L. and G. Claeskens (23): \Frequentist Model Average Estimators," Journal of the American Statistical Association, 98, [28] Hjort, N. L. and G. Claeskens (26): \Focused Information Criteria and Model Averaging for the Cox Hazard Regression Model," Journal of the American Statistical Association,, [29] James W. and C. Stein (96): \Estimation with Quadratic Loss," Proc. Fourth Berkeley Symp. Math. Statist. Probab., vol(), [3] Kabaila, P. (995): "The Eect of Model Selection on Condence Regions and Prediction Regions," Econometric Theory,, 537{549. [3] Kabaila, P. (998): "Valid Condence Intervals in Regression After Variable Selection," Econometric Theory, 4, 463{482. [32] Kang, H., A. Zhang, T. T. Cai and Small, D. S. (26): "Instrumental Variables Estimation with Some Invalid Instruments and Its Application to Mendelian Randomization," Journal of the American Statistical Association, Vol.,

45 [33] Kolesar, M., R. Chetty, J. Friedman, E. Glaeser and G. Imbens (25): \Identication and Inference with Many Invalid Instruments," Journal of Business Economics and Statistics, 33(4) [34] Leeb, H. and B. M. Potscher (25): \Model Selection and Inference: Facts and Fiction," Econometric Theory, 2, 2{59. [35] Leeb, H. and B. M. Potscher (28): \Sparse Estimators and the Oracle Property, or the Return of the Hodge's Estimator," Journal of Econometrics, 42(), 2-2. [36] Lehmann, E. L. and G. Casella (998): Theory of Point Estimation, Springer-Verlag New York, Inc. [37] Leung, G. and A. Barron (26): \Information Theory and Mixing Least-square Regressions," IEEE Transactions on Information Theory, 52, 3396{34. [38] Liao, Z. (23): \Adaptive GMM Shrinkage Estimation with Consistent Moment Selection," Econometric Theory, 29, {48. [39] Liu, C-A. (25): \Distribution Theory of the Least Squares Averaging Estimator," Journal of Econometrics, 86(), 42{59. [4] Lu, X. and L. Su (25): \Jackknife Model Averaging for Quantile Regressions," Journal of Econometrics, forthcoming. [4] Moon. H. R. and F. Schorfheide (29): \Estimation with Overidentifying Inequality Moment Conditions," Journal of Econometrics, 53(2), 36{54. [42] Nevo, A. and A. Rosen (22): \Identication with Imperfect Instruments," Review of Economics and Statistics, 93(3), [43] Potscher, B. M. (99): "Eects of Model Selection on Inference," Econometric Theory, 7 63{85. [44] Potscher, B. M. (26): \The Distribution of Model Averaging Estimators and an Impossibility Result Regarding Its Estimation," in: H.-C. Ho, C.-K. Ing and T.-L. Lai (eds.), Time Series and Related Topics: In Memory of Ching-Zong Wei. IMS Lecture Notes and Monograph Series, Vol. 52., 26, [45] Sargan, J. (958): \The estimation of Economic Relationships Using Instrumental Variables," Econometrica, 26(3), [46] Shibata, R. (986): "Consistency of Model Selection and Parameter Estimation," Journal of Applied Probability, special volume 23A, 27{4. [47] Small, D. S. (27): \Sensitivity Analysis for Instrumental Variables Regression with Overidentifying Restrictions," Journal of the American Statistical Association, 2(479), 49{58. 45

46 [48] Small, D. S., Cai, T. T., Zhang, A. and Kang, H. (25): \Instrumental Variables Estimation with Some Invalid Instruments and Its Application to Mendelian Randomization," Journal of the American Statistical Association, forthcoming. [49] Stein, C. M. (956): \Inadmissibility of the Usual Estimator for the Mean of A Multivariate Normal Distribution," Proc. Third Berkeley Symp. Math. Statist. Probab., vol(), [5] Van der Vaart, A. W. (998): Asymptotic Statistics, Cambridge University Press. [5] Yang, Y. (2): "Combining Dierent Regression Procedures for Adaptive Regression," Journal of Multivariate Analysis, 74, 35{6. [52] Yang, Y. (23): "Regression with Multiple Candidate Models: Selecting or Mixing?" Statistica Sinica, 3, 783{89. [53] Yang, Y. (24): "Combining Forecasting Procedures: Some Theoretical Results," Econometric Theory, 2, 76{222. [54] Yang, Y. (25): \Can the Strengths of AIC and BIC be Shared?{A Conict Between Model Identication and Regression Estimation," Biometrika, 92, 937{95. 46

47 Supplemental Appendix of \An Averaging GMM Estimator Robust to Misspecication" Xu Cheng, Zhipeng Liao, Ruoyao Shi In this supplemental appendix, we present supporting materials for Cheng, Liao and Shi (27) (cited as CLS hereafter in this Appendix): Section B provides primitive conditions for Assumptions 3., 3.2 and 3.3 and the proof of Lemma 3. of CLS. Section C provides the proof of (4.3) in Section 4 and the proof of some Lemmas in Appendix A. of CLS. Section D studies the bounds of asymptotic risk dierence of the pre-test GMM estimator presented in Figure 2 of CLS. Section E includes extra simulation studies. Section F presents the uniform dominance result in a Gaussian location model. B Primitive Conditions for Assumptions 3., 3.2 and 3.3 and Proof of Lemma 3. of CLS In this section, we provide primitive conditions for Assumption Assumptions 3., 3.2 and 3.3 in the linear IV model presented in Example 3. of CLS. We rst provide a set of sucient conditions without imposing the normal distribution assumption on (X ; Z ; V ; U) in Lemma B.. Then, we impose the normal assumptions and show that these conditions can be simplied to those in Lemma 3. of CLS under normality. For ease of notations, we dene z vu 2 E F [Z V U 2 ], z z u 2 E F [Z Z U 2 ] and vvu 2 E F [V V U 2 ]. The Jacobian matrices are G ;F = E F [Z X ] and G 2;F E F [Z X ] E F [Z X ] A : (B.) Let Z 2 = (Z ; Z ). The variance-covariance matrix of the moment conditions is 2;F = E F [Z 2 Z 2(Y X ) 2 ] E F [(Y X )Z 2 ]E F [(Y X )Z 2]: (B.2) By denition, ;F is the leading r r submatrix of 2;F. Let F denote the joint distribution of W = (Y; Z ; Z ; X ) induced by, and F. By denition, we can write F = uu, G 2;F z x ux vx A, 2;F z z u 2 2;r;F 2;r;F 2;rr;F A (B.3) 47

48 where 2;r;F = z u 3 + z vu 2 = 2;r;F, and 2;rr;F = u 2 u 2 + u 3 v + vu 3 + vvu 2: (B.4) Therefore, the parameter v F dened in (3.4) depends on F through F and, and its dependence on F is through v ;F, where v ;F u 2 u 2; uu; vec( z x) ; vec( ux ) ; vec( vx ) ; vech( z z u 2) ; vec( z u 3) ; vec( z vu 2) ; vec( u 3 v) ; vech( vvu 2) A : (B.5) Dene 2;max maxfsup max ( 2;F ); sup max (G 2;F G 2;F )g; F 2F F 2F 2;min minf inf min( 2;F ); inf min(g 2;F G 2;F )g; F 2F F 2F C W sup F 2F E F [jj(x ; Z ; V ; U)jj 2 ] and C sup 2 k k 2 : (B.6) In the proof of Lemma B. below, we show that 2;max < (see (B.4) and (B.8)). Moreover, we have 2;min >, C W < and C < by Assumptions B..(iii), B..(ii) and B..(vii) respectively. Dene B c 2 f 2 R r : kk 2;min g: (B.7) Let be a non-empty set in R d. Dene 2;max C =2 B f 2 R d : k k 4 2;min 3 2;maxC C 2 W for any 2 g: (B.8) Let fc j; ; C j; g r j= be a set of nite constants. We next provide the low-level sucient conditions for Assumptions 3., 3.2 and 3.3. Assumption B. The following conditions hold: (i) E F [V ] =, E F [U] =, E F [Z U] = r and E F [V U] = r for any F 2 F ; (ii) sup E F [jjxjj 4+ + jjz jj 4+ + jjv jj 4+ + U 6 ] < for some > ; F 2F (iii) inf E F 2F F [U 2 ] >, inf min( xz F 2F z x) > and inf min( 2;F ) > ; F 2F (iv) inf inf kk jj( xz F 2F 2B c z z u 2 z vu 2 xv) + xz z z u 2 z u 3 xujj > ; 2 (v) the set fv ;F : F 2 F g is closed; (vi) 2, B int() and is compact; (vii) = [c ; ; C ; ] [c r ;; C r ;] where c j; < < C j; for j = ; : : : ; r. Lemma B. Suppose that fw i g n i= are i.i.d. and generated by the linear model (3.6) and (3.8) in CLS. Then under Assumption B., F satises Assumptions 3., 3.2 and

49 For the linear IV model, Lemma B. provides simple conditions on, and F on which uniformity results are subsequently established. Proof of Lemma B.. By Assumption B..(i) and the denition of G ;F, E F [g (W; )] = E F [Z (U X ( ))] = G ;F ( ); (B.9) which together with Assumption B..(iii) implies that F = and hence E F [g (W; F )] = r. Also F 2 int() holds by F = and Assumption B..(vi). This veries Assumption 3..(i). By (B.9) for any 2 with jj F jj " and any F 2 F ke F [g (W; )]k =2 min (G ;F G ;F ) k F k " =2 min (G ;F G ;F ) (B.) which combined with Assumption B..(iii) and G ;F = xz ;F implies that inf F 2F inf jje 2B"( c F [g (W; )] jj > : F ) (B.) This veries Assumption 3..(ii). Next, we show Assumption 3..(iii). Let Z 2 (Z ; Z ). By the Lyapunov inequality, Assumptions B..(i)-(ii) and B..(vii), sup E F [jjz 2 jj 2 ] F 2F sup E F [jjz jj 2 ] + 2 sup E F [jjv jj 2 ] F 2F F 2F +2 sup k k 2 sup E F [U 2 ] < : 2 F 2F (B.2) By (B.2), the Holder inequality, the Lyapunov inequality and Assumption B..(ii), sup kg 2;F k = sup F 2F F 2F EF [Z 2 X ] sup F 2F (E F [jjz 2 jj 2 ]) =2 sup F 2F (E F [jjxjj 2 ]) =2 < ; (B.3) which together with the denition of G 2;F and the Cauchy-Schwarz inequality implies that sup G 2;F G 2;F < : F 2F (B.4) Similarly by the Cauchy-Schwarz inequality, the Lyapunov inequality, Assumptions B..(ii) and B..(vii), we have sup E F [jjz 2 jj 4 ] = sup E F [(jjz jj 2 + jjz jj 2 ) 2 ] F 2F F 2F 2 sup E F [jjz jj 4 ] + 2 sup E F [jjz jj 4 ] F 2F F 2F 2 sup F 2F E F [jjz jj 4 ] + 8 sup F 2F E F [jjv jj 4 ] +8 sup k k 4 sup E F [U 4 ] < : 2 F 2F (B.5) 49

50 By (B.2), (B.5), Assumption B..(ii), the Lyapunov inequality and the Holder inequality, we have sup EF [Z 2 Z2(Y F 2F X ) 2 ] sup E F [jjz 2 jj 2 (Y X ) 2 ] F 2F sup(e F [jjz 2 jj 4 ]) =2 sup (E F [U 4 ]) =2 < ; F 2F F 2F (B.6) and sup EF [(Y F 2F X )Z 2 ] sup (E F [jjz 2 jj 2 ]) =2 sup (E F [U 2 ]) =2 < : F 2F F 2F (B.7) By the denition of 2;F, the triangle inequality, the Cauchy-Schwarz inequality and the results in (B.6) and (B.7), sup k 2;F k < : (B.8) F 2F We then show that F 2 int(). By the triangle inequality, the Cauchy-Schwarz inequality and the Holder inequality, kg 2;F k k xz k + k k k xu k + k xv k (E F [jjxjj 2 ]) =2 (E F [jjz jj 2 ]) =2 + k k (E F [jjxjj 2 ]) =2 (E F [U 2 ]) =2 +(E F [jjxjj 2 ]) =2 (E F [jjv jj 2 ]) =2 C W (2 + C =2 ), (B.9) for any F 2 F, where C W < by Assumptions B..(ii) and (vii). Since G 2;F = (G ;F ; G r ;F ) where G r ;F = E F [UX ] E F [V X ], we have G 2;F G 2;F = G ;F G ;F + G r ;F G r ;F ; (B.2) which implies that for any F 2 F, min (G 2;F G 2;F ) min (G ;F G ;F ): (B.2) To show Assumption 3..(iii), we write Q F () = E F [Z 2 (Y X )] 2;F E F [Z 2 (Y X )] = G 2;F 2;F G 2;F + 2 G 2;F 2;F C F + CF 2;F C F ; (B.22) where C F = E F [Z 2 Y ]. Since G 2;F 2;F G 2;F is non-singular by (B.8), (B.2) and Assumption 5

51 B..(iii), Q F () is minimized at F = (G 2;F 2;F G 2;F ) G 2;F 2;F C F for any F 2 F. Therefore, k F k 2 = (G 2;F 2;F G 2;F ) G 2;F 2;F E F [Z 2 U] 2 max( 2;F ) 2 min (G 2;F G 2;F ) E F UZ 2 2;F G 2;F G 2;F 2;F E F [Z 2 U] 2 2 max( 2;F ) max (G 2;F G 2;F ) 2 uu 2 min ( 2;F ) 2 min (G 2;F G 2;F ) 4 2;min 3 2;maxC C 2 W k k 2 (B.23) for any F 2 F. By Assumption B..(vi), F 2 int(). Moreover for any 2 with jj F jj ", Q F () Q F ( F ) min (G 2;F 2;F G 2;F ) k F k 2 " 2 min (G 2;F 2;F G 2;F ) " 2 max( 2;F ) min (G 2;F G 2;F ); (B.24) which together with (B.8), (B.2) and Assumption B..(iii) implies that inf F 2F inf [Q 2B" c ( F ) F () Q F ( F )] > : (B.25) This veries Assumption 3..(iii). Next, we verify Assumption 3..(iv). Let (22) 2;F = ( 2;rr;F 2;r;F z z u 2 2;r;F ), where 2;r;F and 2;rr;F are dened in (B.4). Then G 2;F 2;F 2;F = ( xz ; xv + xu z z u 2 2;r;F I r A (22) 2;F uu = uu [( xz z z u 2 z u 3 xu) + xz z z u 2 z vu 2 xv] (22) 2;F = uu (22) 2;F ( xz z z u 2 z u 3 xu) + uu ( xz z z u 2 z vu 2 xv) (22) 2;F ; (B.26) by the formula of the inverse of partitioned matrix. For any 2 with k k >, we have ( (22) 2;F )2 ( (22) 2;F ) 2 ( min ( (22) 2;F ))2 ( max ( (22) 2;F ))2 2 2;min C 2 2;max (B.27) and (22) 2;F = (22) 2;F ( ( (22) 2;F )2 ) =2 ( ( (22) k k 2;F )2 ) =2 2;max (22) 2;F ( ( (22) (B.28) 2;F )2 ) =2 5

52 where the last inequality in (B.27) and the inequality in (B.28) are due to min ( (22) 2;F ) min( 2;F ) = 2;max and max ( (22) 2;F ) max( 2;F ) = 2;min : Therefore, for any F 2 F with 2;F = uu ( r ; ) and k k >, G 2;F 2;F 2;F k 2;F k = (22) 2;F ( xz z z u 2 z vu 2 xv) (22) k k +( xz z z u 2 z u 3 xu) (22) 2;F ( xz 2;max ( ( (22) 2;F )2 ) =2 +( xz ( = 2;max jj e xz z z u 2 z vu 2 xv) e jj +( xz z z u 2 z u 3 xu) 2;F (22) 2;F z z u 2 z vu 2 xv) (22) 2;F (22) 2;F z z u 2 z u 3 xu) (B.29) where e (22) 2;F = (22) 2;F and the inequality is by (B.28). By (B.28) and the denition of B c 2, e 2 B c 2. Therefore, (B.29) implies that G 2;F 2;F 2;F k 2;F k 2;max inf kk 2B c 2 ( xz z z u 2 z vu 2 xv) +( xz z z u 2 z u 3 xu) : (B.3) Collecting the results in (B.8) and (B.3) and then applying Assumption B..(iv), we get inf ff 2F: k F k>g G 2;F 2;F 2;F k 2;F k > (B.3) which shows Assumption 3..(iv) with =. Assumption 3..(v) is implied by Assumption B..(vii). This nishes the verication of Assumption 3.. To verify Assumption 3.2, note that g 2 (W; ) = Z 2 (U X ( )), g 2; (W; ) = Z 2 X and g 2; (W; ) = (r2 d )d. Therefore, Assumption 3.2.(i) holds automatically. Moreover Assumption 3.2.(ii) is implied by Assumption B..(ii) and the assumption that is bounded. Assumptions 3.2(iii)-(iv) follow from Assumption B..(iii). We next verify Assumption 3.3. By denition, v F = vec(g 2;F ) ; vech( 2;F ) ; F. (B.32) Let = fv ;F : F 2 F g. From the expressions in (B.3), we see that = fv F : F 2 Fg is the 52

53 image of under a continuous mapping. By Assumption B..(ii) and the Holder inequality, is bounded which together with Assumption B..(v) implies that is compact. Since is also a compact set by Assumption B..(vii), we know that is compact. Therefore, is compact and hence closed. This veries Assumption 3.3.(ii). Let " F = uu c where c = min fmin jr jc j; j; min jr jc j; jg. Below we show that for any e 2 R r with jj e jj " F, there is F e 2 F such that e ef = e, jjg 2; e F G 2;F jj C jj e F jj =4 and jj 2; e F 2;F jj C 2 jj e jj =4 (B.33) for some xed constants C and C 2. This veries Assumption 3.3.(i) with = =4. First if e = r, then we set F e to be F which is induced by, and F with = r. By denition G 2; F e = G 2;F, 2; F e = 2;F and e ef = F = uu = = e which implies that (B.33) holds. Second consider any e 2 R r with < jj e jj < " F. Dene e = e uu. Since jj e jj < " F and " F = uu c, jj e jj = jj e uu jj = jj e jjuu < c ; (B.34) which combined with the denition of implies that e 2. Let e F be the joint distribution induced by e, and F. By the denition of F, we have e F 2 F. Moreover, e ef = e uu = e (B.35) which veries the equality in (B.33). By denition, G 2; F e E F [Z X ] e E F [UX ] E F [V X ] A and G 2;F E F [Z X ] E F [V X ] A (B.36) which together with the Cauchy-Schwarz inequality and the Holder inequality implies that jjg 2; e F G 2;F jj = jj e E F [UX ]jj jj e jj( uu E F [jjxjj 2 ]) =2 = jj e jj 3=4 =4 uu (E F [jjxjj 2 ]) =2 jj e uu jj =4 : (B.37) By Assumption B..(ii), sup E F [jjxjj 2 ] < and F 2F sup uu < F 2F (B.38) which together with (B.34), (B.37) and the denition of e implies that jjg 2; e F G 2;F jj C jj e jj =4 ; (B.39) where C = c 3=4 sup F 2F (E F [jjxjj2 ]) =2 sup F 2F =4 uu is nite. 53

54 To show the last inequality in (B.33), note that by denition ef = = F and hence E ef [Z Z (Y X ef ) 2 ] = E F [Z Z U 2 ] = E F [Z Z (Y X F ) 2 ]: (B.4) Under e F, E ef [Z Z (Y X ef ) 2 ] = E F [Z (U e + V ) U 2 ] = E F [U 3 Z ] e + E F [U 2 Z V ]; (B.4) and Under F, E ef [Z Z (Y X ef ) 2 ] = E F [(U e + V )(U e + V ) U 2 ] = E F [U 4 ] e e + e E F [U 3 V ] + E F [U 3 V ] e + E F [U 2 V V ]: (B.42) E F [Z Z (Y X F ) 2 ] = E F [U 2 Z V ] and E F [Z Z (Y X F ) 2 ] = E F [U 2 V V ]: (B.43) Collecting the results in (B.4), (B.4), (B.42) and (B.43), and applying the triangle inequality, we get E ef [Z 2 Z2(Y X ef ) 2 ] E F [Z 2 Z2(Y X F ) 2 ] E F [U 3 Z ] e + E F [U 4 ] e e + e E F [U 3 V ] + E F [U 3 V ] e : (B.44) By Assumption B..(ii) and the Lyapunov inequality sup E F [juj 5 ] <, F 2F sup E F [jjz jj 4 ] < and F 2F sup E F [jjv jj 4 ] < : F 2F (B.45) By the Holder inequality, EF [U 3 Z ] (EF [jujjjz jj 2 ]E F [juj 5 ]) =2 (E F [juj 5 ]) =2 (E F [jjz jj 4 ]) =4 (E F [U 2 ]) =4 = =4 uu (E F [juj 5 ]) =2 (E F [jjz jj 4 ]) =4 : (B.46) Similarly, we can show that E F [U 3 V ] =4 uu (E F [juj 5 ]) =2 (E F [jjv jj 4 ]) =4 (B.47) 54

55 and E F [U 4 ] (E F [U 2 ]E F [U 6 ]) =2 = =4 uu sup (E F [U 6 ]) =2 : F 2F (B.48) Let C 2; = sup F 2F f(e F [juj5 ]) =2 [(E F [jjz jj 4 ]) =4 +(E F [jjv jj 4 ]) =4 ]+(E F [U 6 ]) =2 g. Combining the results in (B.44), (B.46), (B.47) and (B.48), and applying the Cauchy-Schwarz inequality, we get E ef [Z 2 Z 2(Y X ef ) 2 ] E F [Z 2 Z 2(Y X F ) 2 ] 3C 2; =4 uu jj e jj + C 2; =4 uu jj e jj 2 = (3C 2; jj e jj 3=4 + C 2; jj e jj 7=4 ) =4 uu jj e jj =4 C 2; jj e jj =4 (B.49) where C 2; = C 2; (3c 3=4 + c7=4 ), the second inequality is by (B.34) and the denition of e. By (B.45), Assumption B..(ii) and the denition of c, C 2; < : (B.5) Next note that E ef [Z 2 (Y X ef )] E F [Z U] e uu A and E F [Z 2 (Y X F )] E F [Z U] r A (B.5) which implies that E ef [Z 2 (Y X ef )]E ef [Z2 (Y X ef )] E F [Z 2 (Y X F )]E F [Z2 (Y X F )] r r uu;f E F [Z U] e A e E F [Z U] e uu e 2 uu EF uu [Z U] e + uu e E F [Z U] + 2 uujj e jj 2 2 uu jj e jj ke F [Z U]k + 2 uujj e jj 2 (2 5=4 uu jj e jj 3=4 (E F [jjz jj 2 ]) =2 + 7=4 uu jj e jj 7=4 )jj e jj =4 C 2;2 jj e jj =4 (B.52) where C 2;2 = sup F 2F f25=4 uu (E F [jjz jj 2 ]) =2 c 3=4 + 7=4 uu c 7=4 g, the second inequality is by the Cauchy-Schwarz inequality, the third inequality is by the Holder inequality. By Assumption B..(ii) and the denition of c, C 2;2 < : (B.53) By the denition of 2;F in (B.2), we can use the triangle inequality and the results in (B.49) and (B.52) to deduce that 2; ef 2;F C 2 jj e jj =4 (B.54) 55

56 where C 2 = C 2; + C 2;2 and C 2 < by (B.5) and (B.53), which proves the second inequality in (B.33). This veries Assumption 3.3.(i) with = =4. Proof of Lemma 3.. Next, we apply Lemma B. to prove Lemma 3. in the paper. For convenience, the conditions of Lemma 3. are stated here. The proof veries the conditions of Lemma B. with the following conditions in a Gaussian model. Let F denote the set of normal distributions which satises: (i) u =, z u = r and vu = r ; (ii) inf F 2F min( xz z x) >, sup F 2F jjjj2 < and < inf F 2F min ( ) sup F 2F max ( ) < ; (iii) inf F 2F inf fkk"g kk jj( xz z z z v xv) xujj > for some " > that is small enough (where " is given in (A.3) in the Appendix of CLS); (iv) 2 int() and is compact and large enough such that the pseudo-true value (F ) 2 int(); (v) = [c ; ; C ; ] [c r ;; C r ;] where fc j; ; C j; g r j= is a set of nite constants with c j; < < C j; for j = ; : : : ; r. Specically, we assume that condition (ii) of Lemma 3. holds with some constants c and C such that c min ( xz z x), jjjj 2 C and c min ( ) max ( ) C ; condition (iii) of Lemma 3. holds with inf 2B c " kk jj( xz z z z v xv) xujj c (B.55) for some positive constant c and B c " f 2 R r : kk c ; C ;C g; (B.56) where C ;W 2(d + r 2 + )C, c ; minf; c 2 g and C ; C 2 ;W (2 + C =2 )2 (B.57) and C sup 2 k k 2. Assumption B..(i) holds under Condition (i) of Lemma 3.. Since (X ; Z ; V ; U) is a normal random vector, Assumption B..(ii) holds by kk 2 C and max ( ) C. By min ( ) c and u =, we have E F [U 2 ] c for any F 2 F and hence inf F 2F E F [U 2 ] >. Let F denote the distribution of W induced by F with mean and variance-covariance matrix. By denition, G ;F = E F [Z X ] = z x. Therefore, inf F 2F min (G ;F G ;F ) c > (B.58) holds by min ( xz z x) c > for any F 2 F. Since z u = r and vu = r for any F 2 F, U is independent with respect to (Z ; V ) under the normal assumption. Therefore, by 56

57 Condition (i) of Lemma 3. 2;F uu z z uu z v A uu z v 2 2 uu + uu vv = z z z v A + z z vz vv v v A r r r r r r 2 2 uu A(B.59) which implies that min ( 2;F ) 2 min ( ) where F is the distribution of W induced by F with mean and variance-covariance matrix. Since min ( ) c >, we have inf F 2F min ( 2;F ) c 2 > : (B.6) This nishes the proof of Assumption B..(iii). By (B.59), Conditions (ii) and (v) of Lemma 3. sup max ( 2;F ) 2 max( ) + max ( ) kk max( )C 2C( 2 + C ): F 2F (B.6) By (B.9) in the proof of Lemma B., which implies that kg 2;F k 2C (d + r 2 + )(2 + C =2 ) sup max (G 2;F G 2;F ) 4C(d 2 + r 2 + ) 2 (2 + C =2 )2 : (B.62) F 2F By (B.58) and (B.6), min inf min( 2;F ); inf min(g 2;F G 2;F ) minf; c 2 g: F 2F F 2F (B.63) By (B.6) and (B.62), max sup max ( 2;F ); sup max (G 2;F G 2;F ) 4C(d 2 + r 2 + ) 2 (2 + C =2 )2 : (B.64) F 2F F 2F From (B.63), (B.64), the denitions of c ;, C ; and B c N;, we have Bc 2 B c N; where Bc 2 is dened in (B.7). Moreover, by u =, the normal assumption and the independence between U and (Z ; V ), we have z z u 2 = uu z z, z vu 2 = uu z v and z u 3 = r, which implies that jj( xz z z u 2 z vu 2 xv) + xz z z u 2 z u 3 xujj = jj( xz z z z v xv) xujj: (B.65) 57

58 Assumption B..(iv) follows by B c 2 BN; c, (B.65) and Condition (iii) of the lemma. We next show that Assumption B..(v) holds. Dene v ;F uu; vec( xz ) ; vec( xu ) ; vec( xv ) ; vec( z v) ; vech( z z ) ; vech( vv ) A : Under Condition (i) of Lemma 3. and the normal assumption, u 2 u 2 = 32 uu, z u 3 = r, vu 3 = r, z z u 2 = uu z z, z vu 2 = uu z v and vvu 2 = uu vv. Therefore to verify Assumption B..(v), it is sucient to show that the set fv ;F : F 2 F g is compact because the set fv ;F : F 2 F g is the image of the set fv ;F : F 2 F g under a continuous mapping. Let f( n ; n)g n be a convergent sequence where ( n ; n) satises Conditions (i)-(iii) of Lemma 3. for any n. Let e and e denote the limits of n and n under the Euclidean norm respectively. We rst show that Conditions (i)-(iii) of Lemma 3. hold for ( ; e e ). Since u;n =, z u;n = r and vu;n = r for any n, we have e u =, e z u = r and e vu = r which shows that ( ; e e ) satises Condition (i) of Lemma 3.. Since n! e and k n k 2 C for any n, we have jjjj e 2 C. By the convergence of ( n ; n), xz ;n! e xz. Since the roots of a polynomial continuously depends on its coecients, we have min ( xz ;n xz ;n)! min ( e xz e xz ), min ( n)! min ( e ) and max ( n)! max ( e ) which together with the assumption that xz ;n and n satisfy Condition (ii) of Lemma 3. implies that c min ( e e xz xz ) and c min ( e ) max ( e ) C : This shows that Condition (ii) of Lemma 3. holds for ( ; e e ). For any 2 BN; c, by the triangle inequality, the Cauchy-Schwarz inequality and kk c 2 C 2 C ( + C ) 2, kk jj( e e e xz z z z e v xv ) e xu jj kk jj( xz;n z z ;n z v;n xv;n) xu;njj jj e e e xz z z z v xz ;n z z ;n z v;njj jj e xv xv;njj 2CC 2 ( + C )c 2 jj xu;n e xu jj which together with the convergence of ( n ; n) and Conditions (ii)-(iii) of Lemma 3. implies that kk jj( e e e xz z z z e v xv ) e xu jj c jj e e e xz z z z v xz ;n z z ;n z v;njj jj e xv xv;njj 2CC 2 ( + C )c 2 jj xu;n e xu jj for any n. Let n go to innity, we get kk jj( e xz e z z e z v e xv ) e xu jj c 58

59 for any 2 B". c This shows that Condition (iii) of Lemma 3. also holds for ( ; e e ). Hence the set of (; ) which satises Conditions (i)-(iii) of Lemma 3. is closed. By Conditions (i)-(ii) of the Lemma, we know that this set is compact because it is also bounded. Let F denote the normal distribution with mean and variance-covariance matrix. Then v ;F is the image of (; ) under a continuous mapping, which implies that fv ;F : F 2 F g is compact. Therefore the set fv ;F : F 2 F g is compact and hence closed. This proves Assumption B..(v). Assumption B..(vi) is used to show that F 2 int() and F 2 int() for any F 2 F. By F = and Condition (iv) of Lemma 3., we have F 2 int() and F 2 int(). Finally, Assumption B..(vii) is the same as Condition (v) of Lemma 3.. C Proof of Some Auxiliary Results in Sections 4, 5 and 6 of CLS Proof of Lemma A.2. (i) Let g 2;j (w; ) denote the j-th (j = ; : : : ; r 2 ) component of g 2 (w; ). By the mean value expansion, g 2;j (w; ) g 2;j (w; 2 ) = g 2;j; (w; e ;2 )( 2 ) (C.) for any j = ; : : : ; r 2, where e ;2 is some vector between and 2. By (C.) and the Cauchy-Schwarz inequality je F [g 2;j (w; ) g 2;j (w; 2 )]j E F sup kg 2; (W; )k k 2 k ; (C.2) 2 for any j = ; : : : ; r 2. By (C.2), we deduce that km 2;F ( ) M 2;F ( 2 )k p r 2 E F sup kg 2; (W; )k k 2 k 2 p C M; r2 k 2 k (C.3) for any F 2 F, where C M; sup F 2F E F [sup 2 kg 2; (W; )k] and C M; < by Assumption 3.2.(ii). This immediately proves the claim in (i). The claim in (ii) follows by similar argument and its proof is omitted. (iii) By the mean value expansion, = g 2;j (w; )g 2;j2 (w; ) g 2;j (w; 2 )g 2;j2 (w; 2 ) h g 2;j ;(w; e ;2 )g 2;j2 (w; e ;2 ) + g 2;j (w; e ;2 )g 2;j2 ;(w; e i ;2 ) ( 2 ) (C.4) for any j ; j 2 = ; : : : ; r 2, where e ;2 is some vector between and 2 and may take dierent values 59

60 from the e ;2 in (C.). By (C.4), the triangle inequality and the Cauchy-Schwarz inequality je F [g 2;j (w; )g 2;j2 (w; ) g 2;j (w; 2 )g 2;j2 (w; 2 )]j 2E F sup kg 2 (W; )k kg 2; (W; )k k 2 k 2 E F sup (kg 2 (W; )k 2 + kg 2; (W; )k 2 ) k 2 k (C.5) 2 for any j ; j 2 = ; : : : ; r 2, where the second inequality is by the simple inequality that jabj (a 2 + b 2 )=2. By (C.5) EF g2 (W; )g 2 (W; ) g 2 (W; 2 )g 2 (W; 2 ) k 2 k r 2 E F sup (kg 2 (W; )k 2 + kg 2; (W; )k 2 ) 2 r 2 C M;2 k 2 k (C.6) i for any F 2 F, where C M;2 sup F 2F E F hsup 2 (kg 2 (W; )k 2 + kg 2; (W; )k 2 ) and C M;2 < by Assumption 3.2.(ii). Using the triangle inequality, and the inequality in (C.2), we deduce that je F [g 2;j (w; )]E F [g 2;j2 (w; )] je F [g 2;j (w; ) g 2;j (w; 2 )]E F [g 2;j2 (w; )]j E F [g 2;j (w; 2 )]E F [g 2;j2 (w; 2 )]j + je F [g 2;j (w; 2 )]E F [g 2;j2 (w; 2 ) g 2;j2 (w; )]j 2E F sup kg 2 (W; )k E F sup kg 2; (W; )k k 2 k (C.7) 2 2 for any j ; j 2 = ; : : : ; r 2. By (C.7) EF [g 2 (w; )]E F [g 2 (w; ) ] E F [g 2 (w; 2 )]E F [g 2 (w; 2 ) ] r2 C M;3 k 2 k (C.8) for any F 2 F, where C M;3 2 sup F 2F E F [sup 2 kg 2 (W; )k] E F [sup 2 kg 2; (W; )k] and C M;3 < by Assumption 3.2.(ii). By the denition of 2;F (), the triangle inequality and the results in (C.6) and (C.8) k 2;F ( ) 2;F ( 2 )k r 2 (C M;2 + C M;3 ) k 2 k ; (C.9) which immediately proves the claim in (iii). Proof of Lemma A.3. By Lemma A..(i), g 2 () = M 2;Fn () + " n nx g 2 (W i ; ) i= M 2;Fn () # = M 2;Fn () + o p (); (C.) 6

61 uniformly over 2. As g (W; ) is a subvector of g 2 (W; ), by (C.) and Assumption 3.2.(ii), g () g () = M ;Fn () M ;Fn () + o p () (C.) uniformly over 2. By Assumptions 3..(i)-(ii) and F n 2 F, M ;Fn () M ;Fn () is uniquely minimized at Fn, which together with the uniform convergence in (C.) implies that e Fn! p : (C.2) To show the consistency of 2, note that nx 2 = n g 2 (W i ; e )g 2 (W i ; e ) g 2 ( e )g 2 ( e ) i= = E Fn [g 2 (W; e )g 2 (W; e ) ] M 2;Fn ( e ) M 2;Fn ( e ) + o p () = 2;Fn ( e ) + o p () = 2;Fn + o p (); (C.3) where the rst equality is by the denition of 2, the second equality holds by (C.), Lemma A..(ii) and Assumption 3.2.(ii), the third equality follows from the denition of 2;Fn (), and the last equality holds by Lemma A.2.(iii) and (C.2). This shows the consistency of 2. In the rest of the Supplemental Appendix, we use C denote a generic xed positive nite constant whose value does not depend on F or n. Proof of Lemma A.4. As g () is a subvector of g 2 (), and ;n is a submatrix of 2;n, using (C.), (C.3) and Assumptions 3.2.(ii)-(iii), we have g () ( ) g () = M ;Fn () ;F n M ;Fn () + o p (); (C.4) uniformly over. By Assumptions 3.2.(ii)-(iii), C min ( ;F n ) max ( ;F n ) C (C.5) which together with Assumptions 3..(i)-(ii) implies that M ;Fn () ;F n M ;Fn () is uniquely minimized at Fn. By the standard arguments for the consistency of an extremum estimator, we have b Fn = o p (): (C.6) Using (C.6), Lemma A..(iv) and Assumption 3.2.(ii), we have g ( b h ) = g ( Fn ) + M ;Fn ( b i ) M ;Fn ( Fn ) + o p (n =2 ) = g ( Fn ) + [G ;Fn ( Fn ) + o p ()] ( b Fn ) + o p (n =2 ): (C.7) 6

62 Similarly, n nx g ; (W i ; b ) = G ;Fn ( b ) + o p () = G ;Fn + o p (); i= (C.8) where the rst equality follows from Lemma A..(iii) and the second equality follows by (C.6) and Lemma A.2.(ii). From the rst order condition for the GMM estimator b, we deduce that = " n nx g ; (W i ; b )# ( ) g ( b ) i= h = (G ;F n ;F n + o p ()) g ( Fn ) + (G ;Fn + o p ())( b i Fn ) + o p (n =2 ) (C.9) where the second equality follows from Assumptions 3.2.(ii)-(iii), (C.3), (C.7) and (C.8). By (C.9), E Fn [g (W; Fn )] = and Assumption 3.2, n =2 ( b Fn ) = ( ;Fn + o p ()) n (g (W; Fn )) + o p (): (C.2) By Assumptions 3.2 and Lemma A..(v), with (C.2) implies that ;F n = O() and n (g (W; Fn )) = o p (), which together n =2 ( b Fn ) = ;Fn n (g (W; Fn )) + O p (); where ;F n n (g (W; Fn )) = O p (). This nishes the proof. Proof of Lemma A.5. By (C.), (C.3) and Assumptions 3.2.(ii)-(iii), we have g 2 () ( 2 ) g 2 () = M 2;Fn () 2;F n M 2;Fn () + o p () = Q Fn () + o p () (C.2) uniformly over. By Assumption 3..(iii), Q Fn () is uniquely minimized at F n. The consistency result b 2 F n! p follows from standard arguments for the consistency of an extremum estimator. Proof of Lemma A.6. By the denition of b 2, g 2 ( b 2 ) ( 2 ) g 2 ( b 2 ) g 2 ( Fn ) ( 2 ) g 2 ( Fn ); (C.22) which implies that jjg 2 ( b 2 )jj 2 max ( 2 ) min ( 2) kg 2 ( Fn )k 2 : (C.23) By (C.3) and Assumptions 3.2.(ii)-(iii), C min ( 2 ) max ( 2 ) C (C.24) 62

63 with probability approaching. By Lemma A..(i), M ;Fn ( Fn ) = r and Fn = o(), kg 2 ( Fn )k 2 = o p () (C.25) which combined with (C.23) and (C.24) implies that jjg 2 ( b 2 )jj = o p (): (C.26) Moreover, by (C.26), Lemma A..(i) and the triangle inequality, jjm 2;Fn ( b 2 )jj jjg 2 ( b 2 ) M 2;Fn ( b 2 )jj + jjg 2 ( b 2 )jj = o p () (C.27) which immediately implies that jjm ;Fn ( b 2 )jj = o p (): (C.28) The rst result in Lemma A.6 follows by (C.28) and the unique identication of Fn maintained by Assumptions 3..(i)-(ii). Using b 2 Fn = o p (), Lemma A..(iv) and Assumption 3.2.(ii), we have g 2 ( b h 2 ) = g 2 ( Fn ) + M 2;Fn ( b i 2 ) M 2;Fn ( Fn ) + o p (n =2 ) = g 2 ( Fn ) + [G 2;Fn ( Fn ) + o p ()] ( b 2 Fn ) + o p (n =2 ): (C.29) Similarly, n nx g 2; (W i ; b 2 ) = G 2;Fn ( b 2 ) + o p () = G 2;Fn ( Fn ) + o p (); i= (C.3) where the rst equality follows from Lemma A..(iii) and the second equality follows by b 2 Fn = o p () and Lemma A.2.(ii). From the rst order condition for the GMM estimator b 2, we deduce that = " n nx g 2; (W i ; b 2 )# ( 2 ) g 2 ( b 2 ) i= h = (G 2;F n 2;F n + o p ()) g 2 ( Fn ) + (G 2;Fn + o p ())( b i 2 Fn ) + o p (n =2 ) (C.3) where the second equality follows from Assumptions 3.2.(ii)-(iii), (C.3), (C.29) and (C.3). By (C.3) and Assumption 3.2, n =2 ( b n o 2 Fn ) = ( 2;Fn + o p ()) n (g 2 (W; Fn )) + n =2 E Fn [g 2 (W; Fn )] + o p (); (C.32) where 2;F n = G 2;F n 2;F n G 2;Fn G 2;Fn 2;F n. Proof for the claim in equation (4.3). 63

64 Consider the case n =2 Fn! d 2 R r. By Lemma 4., h i n =2 b (!) Fn = n =2 ( b h Fn ) +! n =2 ( b 2 Fn ) n =2 ( b i Fn )! D ;F Z d;2;f +!( 2;F ;F )Z d;2;f ; (C.33) where Z d;2;f has the same distribution as Z 2;F + d. This implies that `( b h i h i (!)) = n bn (!) Fn bn (!) Fn! D F (!) (C.34) where F (!) = Zd;2;F ;F ;F Z d;2;f + 2!Zd;2;F ( 2;F +! 2 Z d;2;f ( 2;F ;F ) ( 2;F ;F )Z d;2;f : ;F ) ;F Z d;2;f Now we consider E[ F (!)] using the equalities in Lemma A.9 below. First, E[Z d;2;f ;F ;F Z d;2;f ] = tr( ;F ) (C.35) because Z d;2;f = ;F Z ;F and ;F E[Z ;F Z ;F ] ;F = ;F by denition. Second, E Z d;2;f ( 2;F ;F ) ;F Z d;2;f = tr( ;F E Z d;2;f Zd;2;F ( 2;F ;F ) ) = tr( ;F d d + 2;F ( 2;F ;F ) ) = tr(( 2;F ;F )); (C.36) where the last equality holds by Lemma A.9. Third, E Z d;2;f ( 2;F ;F ) ( 2;F ;F )Z d;2;f = tr(( 2;F ;F ) d d + 2;F ( 2;F ;F ) ) = d 2;F 2;F d tr(( 2;F ;F )) (C.37) by Lemma A.9. Combining the results in (C.35)-(C.37), we obtain E[ F (!)] = tr( ;F ) 2!tr ( ( ;F 2;F )) +! 2 d 2;F 2;F d + tr ( ( ;F 2;F )). (C.38) Note that d 2;F 2;F d = d ( 2;F ;F ) ( 2;F ;F )d because ;F d = d. It is clear that the optimal weight! F in (4.3) minimizes the quadratic function of! in (C.38). 64

65 Proof of Lemma A.9. By construction, ;F d = d. For ease of notation, we write 2;F and G 2;F as 2;F = To prove part (b), we have To show part (c), note ;F r r ;F r ;F A and G 2;F G ;F G r ;F A : (C.39) ;F 2;F ;F = [ ;F ; d r ;F r A [ ;F ; d r ] r ;F r ;F = ;F ;F ;F = G ;F ;F G ;F = ;F : (C.4) ;F 2;F 2;F = [ ;F ; d r ] 2;F 2;F G 2;F = ;F G ;F G 2;F 2;F G 2;F because ;F G ;F = I d d. Part (d) follows from the denition of 2;F. G 2;F 2;F G 2;F = G 2;F 2;F G 2;F = 2;F (C.4) Proof of Lemma 4.2. We rst prove the consistency of b k, G b k and b k for k = ; 2. By Lemma 4., we have b = Fn + o p (). Using the same arguments in showing (C.3), we can show that b 2 = 2;Fn + o p () = 2;F + o p (); (C.42) where the second equality is by the assumption of the lemma that v Fn! v F for some F 2 F. As b is a submatrix of b 2, by (C.42) we have b = ;Fn + o p () = ;F + o p (): (C.43) By the consistency of b and the same arguments used to show (C.3), we have n nx g 2; (W i ; b ) = G 2;Fn ( Fn ) + o p () = G 2;F + o p (); i= (C.44) where the second equality is by (A.) which is assumed in the lemma. As n P n i= g ;(W i ; b ) is a submatrix of n P n i= g 2;(W i ; b ), by (C.44) we have n nx g ; (W i ; b ) = G ;Fn ( Fn ) + o p () = G ;F + o p (): i= (C.45) From Assumption 3.2, (C.42), (C.43), (C.44) and (C.45), we see that b k and b G k are consistent 65

66 estimators of k;f and G k;f respectively for k = ; 2. By the Slutsky theorem and Assumption 3.2, we know that b k is a consistent estimator of k;f for k = ; 2. In the case where n =2 Fn! d 2 R r, the desired result follows from Lemma 4., the consistency of b ;F and b 2;F, and the CMT. In the case where jjn =2 Fn jj!, e! eo! p because n =2 jj b 2 b jj! p and n =2 ( b eo Fn ) = n =2 ( b Fn ) + e! eo n =2 ( b 2 b ) = n =2 ( b n =2 ( b h 2 b )tr ( b i 2 b ) Fn ) + n( b 2 b ) ( b h 2 b ) + tr ( b i! D ;F 2 b ) (C.46) by Lemma 4.. Proof of Lemma A.5. By denition, ;F ;F = Z ;F ;F ;F Z ;F = Z =2 ;F ;F ;F =2 ;F Z (C.47) where Z N( r ; I r r ). By Assumptions 3.2.(ii) and 3.2.(iv), and the fact that is a xed matrix, sup max ( =2 ;F ;F ;F =2 ;F ) C: (C.48) F 2F By (C.48), sup E[( ;F ;F ) 2 ] sup 2 max( =2 ;F ;F ;F =2 ;F )E[(Z Z ) 2 ] 3r C (C.49) h2h h2h where the second inequality is by E[(Z Z ) 2 ] 3r + r (r ) = r 2 + 2r which is implied by the assumption that Z is a r -dimensional standard normal random vector. The rst inequality of this lemma follows as the upper bound in (C.49) does not depend on F. For any F 2 F, dene B F ( 2;F ;F ) ( 2;F ;F ): By the Cauchy-Schwarz inequality and the simple inequality jabj (a 2 +b 2 )=2 (for any real numbers a and b), F F 2 Z d;2;f ;F ;F Z d;2;f +! 2 F Z d;2;f B F Z d;2;f = 2 Z ;F ;F ;F Z ;F +! 2 F Z d;2;f B F Z d;2;f (C.5) where the equality is by ;F d = d (which is proved in Lemma A.9). By (C.5) and the simple inequality (a + b) 2 2(a 2 + b 2 ) (for any real numbers a and b), ( F F ) 2 8(Z;F ;F ;F Z ;F ) 2 + 8(! 2 F Zd;2;F B F Z d;2;f ) 2 : (C.5) By the rst inequality of this lemma, we have sup h2h E[( ;F ;F ) 2 ] C. Hence by (C.5), to 66

67 show the second inequality of this lemma, it is sucient to prove that sup E[(! 2 F Zd;2;F B F Z d;2;f ) 2 ] C: h2h (C.52) Recall that we have dened A F = ( ;F 2;F ) in Theorem 5.2. By the denition,! 2 F Z d;2;f B F Z d;2;f = (tr(a F )) 2 Z d;2;f B F Z d;2;f (Z d;2;f B F Z d;2;f + tr(a F )) 2 tr(a F ) Zd;2;F = tr(a F ) B F Z d;2;f Zd;2;F B F Z d;2;f + tr(a F ) Zd;2;F B F Z d;2;f + tr(a F ) : (C.53) By Lemma 2. in Cheng and Liao (25), tr(a F ) for any F 2 F. Zd;2;F B F Z d;2;f implies that This together with tr(a F ) Z d;2;f B F Z d;2;f + tr(a F ) and Z d;2;f B F Z d;2;f Z d;2;f B F Z d;2;f + tr(a F ) : (C.54) By (C.54) and tr(a F ),! 2 F Zd;2;F B F Z d;2;f tr(a F ) = tr( ;F ) tr( 2;F ); (C.55) where the equality is by A F = ( ;F 2;F ). By (C.55) and the simple inequality (a + b) 2 2(a 2 + b 2 ), E[(! 2 F Zd;2;F B F Z d;2;f ) 2 ] 2(tr( ;F )) 2 + 2(tr( 2;F )) 2 : (C.56) By Assumptions 3.2.(ii) and 3.2.(iv), min (G k;f k;f G k;f ) min ( k;f ) min(g k;f G k;f ) = min (G k;f G k;f )= max ( k;f ) C (C.57) for any F 2 F and for k = ; 2. By (C.57) and the denition of k;f (k = ; 2), max ( k;f ) = min (G k;f k;f G k;f ) C (C.58) for any F 2 F. As and k;f are positive denite symmetric matrix, by the standard trace inequality (tr(ab) tr(a) max (B) for Hermitian matrices A and B ), tr( k;f ) tr() max ( k;f ) C for k = ; 2; (C.59) for any F 2 F. Collecting the results in (C.56) and (C.59), we immediately get (C.52). This nishes the proof. 67

68 Proof of Lemma A.6. First note that minfx; g x = ( x)ifx > g: (C.6) Hence we have h i sup E minf F F ; g F F h2h h i sup E F F If F F > g h2h i sup E h2h 2 sup E h2h h If F F > g + sup E h2h h F F If > ( F F ) g h ( F F ) 2i 2C (C.6) i where the rst inequality is by the Jensen's inequality, the second inequality is by the Markov inequality, the third inequality is by the monotonicity of expectation and the last inequality is by Lemma A.5. Using the same arguments, we can show that sup E minf ;F ;F ; g ;F ;F 2C : (C.62) h2h Collecting the results in (C.6) and (C.62), and applying the triangle inequality, we deduce that sup [jg (h) g(h)j] 4C : (C.63) h2h The claimed result of this lemma follows by (C.63) as C is a xed constant. By the triangle inequality, the Jensen's inequality and Lemma A.5, which nishes the proof of the lemma. sup jg(h)j = sup E[ F F ;F ;F ] h2h h2h sup E[ F F ] + sup E[ ;F ;F ] C h2h h2h D Asymptotic Risk of the Pre-test GMM Estimator In this section, we establish similar results in Theorem 5. for the pre-test GMM estimator based on the J-test statistic. The pre-test estimator is dened as b pre = fj n > c g b + fj n c g b 2 ; (D.) where J n = ng 2 ( b 2 ) ( b 2 ) g 2 ( b 2 ) and c is the ( with degree of freedom r 2 d. )th quantile of the chi-squared distribution 68

69 Theorem D. Suppose that Assumptions hold. The bounds of the asymptotic risk dierence satisfy AsyRD( b pre ; b )= min inf p(h)] ; ; h2h AsyRD( b pre ; b )= max sup [g p (h)] ; h2h ; where g p (h) E[ p;f p;f ;F ;F ] and p;f is dened in (D.3) below. Proof of Theorem D.. The two equalities and inequalities in the theorem follow by the same arguments in the proof of Theorem 5. with Lemma 4.2 for b eo replaced by Lemma D. for b pre, Lemma A.5 replaced by Lemma D.2, and Lemma A.6 replaced by Lemma D.3. Its proof is hence omitted. By Denition, E[ p;f p;f ] = E[Z d;2;f ;F ;F Z d;2;f ] + 2E[! p;f Zd;2;F ( 2;F ;F ) ( 2;F ;F )Z d;2;f ] +E[! 2 p;f Z d;2;f ( 2;F ;F ) ;F Z d;2;f ] = tr( ;F ) + 2E[! p;f Zd;2;F ( 2;F ;F ) ;F Z d;2;f ] +E[! 2 p;f Zd;2;F ( 2;F ;F ) ( 2;F ;F )Z d;2;f ] (D.2) The asymptotic risk of the pre-test estimator b p in Figure 2 is simulated based on the formula in (D.2). The following lemma provides the asymptotic distribution of the pre-test GMM estimator under various sequence of DGPs, which is used to show Theorem D.. Lemma D. Suppose that Assumptions hold. Consider ff n g such that v Fn! v F for some F 2 F. (a) If n =2 Fn! d for some d 2 R r, then J n! D J (h d;f ) (Z 2;F + d ) L F (Z 2;F + d ); where L F 2;F 2;F G 2;F G 2;F 2;F G 2;F G 2;F 2;F and d = ( r ; d ), and n =2 ( b pre Fn )! D p;f (! p;f ) ;F +! p;f 2;F (D.3) where! p;f = fj (h d;f ) c g. (b) If jjn =2 Fn jj!, then! p;f! p and n =2 ( b pre Fn )! D ;F. 69

70 Proof of Lemma D.. (a) By Assumption 3.2.(ii), (C.29) and (C.32), g 2 ( b 2 ) = g 2 ( Fn ) + [G 2;Fn ( Fn ) + o p ()] ( b 2 Fn ) + o p (n =2 ) = g 2 ( Fn ) + G 2;Fn 2;F n g 2 ( Fn ) + o p (n =2 ) = (I r2 + G 2;Fn 2;F n )g 2 ( Fn ) + o p (n =2 ); (D.4) which implies that J n = ng 2 ( Fn ) L Fn g 2 ( Fn ) + o p () (D.5) where L Fn 2;F n 2;F n G 2;Fn (G 2;F n 2;F n G 2;Fn ) G 2;F n 2;F n. By n =2 Fn! d and Lemma A..(v), n =2 =2 2;F n g 2 ( Fn ) = =2 2;F n n (g 2 (W; Fn )) + =2 2;F n n =2 Fn! D Z+ =2 2;F d (D.6) where d = ( r ; d ) and Z is a r 2 standard normal random vector. By v Fn (D.6) and the CMT, J n! D (Z 2;F + d ) L F (Z 2;F + d ):! v F, (D.5), (D.7) Recall that Lemma 4..(a) implies that n =2 ( b Fn )! D ;F and n =2 ( b 2 Fn )! D 2;F ; (D.8) which together with (D.7) and the CMT implies that n =2 ( b pre Fn ) = fj n > c gn =2 ( b Fn ) + fj n c gn =2 ( b 2 Fn )! D (! p;f ) ;F +! p;f 2;F ; (D.9) which nishes the proof of the claim in (a). (b) There are two cases to consider: (i) jj Fn jj > C ; and (ii) jj Fn jj!. We rst consider case (i). As g ( b 2 ) is a subvector of g 2 ( b 2 ), J n = ng 2 ( b 2 ) ( b 2 ) g 2 ( b 2 ) n max( b 2 )g 2 ( b 2 ) g 2 ( b 2 ) n max( b 2 )g ( b 2 ) g ( b 2 ): (D.) By (A.22) and (A.23) in the Appendix of CLS b 2 Fn C with probability approaching, (D.) which together with Assumption 3..(ii) and Lemma A..(i) implies that g ( b 2 ) = M ;F ( b 2 ) + o p () C (D.2) 7

71 with probability approaching. By (C.42) and Assumption 3.2.(ii), we have max ( b 2 ) C with probability approaching. (D.3) Combining the results in (D.), (D.2) and (D.3), we deduce that J n nc with probability approaching, (D.4) which immediately implies that! p;f = fj n c g = (D.5) with probability approaching, as c is a xed constant. assumption that is bounded, we have By Lemma 4..(b), (D.5) and the n =2 ( b pre Fn ) = fj n > c gn =2 ( b Fn ) + fj n c gn =2 ( b 2 Fn ) = fj n > c gn =2 ( b Fn ) + o p ()! D ;F (D.6) where the convergence in distribution is by the CMT. We next consider the case that jj Fn jj! and jjn =2 Fn jj!. In the proof of Lemma 4., we have shown that b 2 Fn = o p (), and that (D.4) and (D.5) hold in this case. It is clear that n =2 g 2 ( Fn ) = n (g 2 (W; Fn ) r n =2 Fn A (D.7) which implies that ng 2 ( Fn ) L Fn g 2 ( Fn ) = [ n (g 2 (W; Fn )] L Fn [ n (g 2 (W; Fn )] +2 r n =2 F n L Fn [ n (g 2 (W; Fn )] + r n =2 F n L Fn r n =2 F n : (D.8) By Lemma A..(v) and Assumptions 3.2.(ii)-(iii), [ n (g 2 (W; Fn )] L Fn [ n (g 2 (W; Fn )] = O p (): (D.9) In order to bound the third term in (D.8) from below, we shall show that for any d = ( r ; d ) for d 2 R r with kdk =, d L Fn d C (D.2) By denition, L Fn has d many zero eigenvalues and r 2 d many of eigenvalues of ones. The matrix G 2;Fn contains the d many eigenvectors of the zero eigenvalues of L Fn, because L Fn G 2;Fn = r2 d and min (G 2;F n G 2;Fn ) C. (D.2) 7

72 Let G?;Fn denote the orthogonal complement of G 2;Fn with G?;F n G?;Fn = I r2 d. Then we G ;F n G r ;F n A a G ;?;F n G r ;?;F n A a 2 r d A (D.22) for some constant vectors a 2 R d and a 2 2 R r 2 d. As min (G ;F n G ;Fn ) C by Assumption 3.2, we have a = (G ;F n G ;Fn ) G ;F n G ;?;Fn a 2 (D.23) and (G r ;?;F n G r ;F n (G ;F n G ;Fn ) G ;F n G ;?;Fn )a 2 = d: (D.24) Let H Fn = G r ;?;F n G r ;F n (G ;F n G ;Fn ) G ;F n G ;?;Fn. By min (G ;F n G ;Fn ) C, Assumptions 3.2.(ii), (D.24) and the Cauchy-Schwarz inequality, kdk 2 = a 2H Fn H F n a 2 C ka 2 k 2 (D.25) which together with kdk = implies that ka 2 k 2 C : (D.26) Using (D.2), (D.22) and (D.26), we deduce that which proves (D.2). By (D.2), d L Fn d = (G 2;Fn a + G?;Fn a 2 ) L Fn (G 2;Fn a + G?;Fn a 2 ) = a 2G?;F n L Fn G?;Fn a 2 = ka 2 k 2 C (D.27) r n =2 F n L Fn r n =2 F n C njj Fn jj 2 (D.28) which together with njj Fn jj 2! implies that r n =2 F n L Fn r n =2 F n! : (D.29) Collecting the results in (D.8), (D.9) and (D.9),and by the Cauchy-Schwarz inequality, we deduce that ng 2 ( Fn ) L Fn g 2 ( Fn )! p, which together with (D.5) implies that J n! p : (D.3) Using the same arguments in showing (D.6), we deduce that n =2 ( b pre Fn )! D ;F : (D.3) 72

73 This nishes the proof. Lemma D.2 Under Assumptions 3.2, we have sup E[( p;f p;f ) 2 ] C: h2h (D.32) Proof of Lemma D.2. By the same arguments in showing (C.5), we have ( p;f p;f ) 2 8(Z ;F ;F ;F Z ;F ) 2 + 8(! 2 p;f Z d;2;f B F Z d;2;f ) 2 : (D.33) By the rst inequality in (A.58) in the Appendix of CLS, we have sup h2h E[( ;F ;F ) 2 ] C. Hence by (D.33), to show the inequality in (D.32), it is sucient to prove that sup E[(! 2 p;f Zd;2;F B F Z d;2;f ) 2 ] C: h2h (D.34) By denition,! p;f = IfJ (h d;f ) c g = IfZ d;2;f L F Z d;2;f c g: (D.35) By the simple inequality (a + b) 2 a 2 =2 2b 2, (z + d ) L F (z + d ) d L F d =2 2z L F z (D.36) for any z 2 R, which together with Assumption 3.2 and (D.2) implies that (z + d ) L F (z + d ) kdk 2 =C 2z L F z kdk 2 =C C kzk 2 : (D.37) Under Assumption 3.2, kb F k C for any F 2 F which together with the simple inequality (a + b) 2 2(a 2 + b 2 ) implies that (z + d ) B F (z + d ) 2C(kdk 2 + kzk 2 ) (D.38) for any z 2 R. Collecting the results in (D.36) and (D.38), we get If(z + d ) L F (z + d ) c gz B F z 2CIfkdk 2 c C + C 2 kzk 2 g(kdk 2 + kzk 2 ) 2C(c C + (C 2 + ) kzk 2 ) (D.39) 73

74 which implies that sup E[(! 2 p;f Zd;2;F B F Z d;2;f ) 2 ] h2h 4C 2 E[(c C + (C 2 + )Z 2;F Z 2;F ) 2 ] C(c + E[ Z 2;F Z 2;F 2]) = C(c max( 2 )r 2 ): (D.4) This nishes the proof. Lemma D.3 we have where sup h2h [jg p (h)j] C. h Let g p; (h) E minf p;f p;f ; g i minf ;F ;F ; g. Under Assumptions 3.2, lim sup [jg p; (h) g p (h)j] = (D.4)! h2h Proof of Lemma D.3. The proof follows the same arguments of the proof of Lemma A.6 with the second inequality in (A.58) in the Appendix of CLS replaced by (D.32). E Extra Simulation Studies E. Finite sample bias and variance of the GMM estimators in Section 7 of CLS In this subsection, we report the nite sample biases and variances of the pre-test and averaging GMM estimators in the simulation design in Section 7 of CLS. Suppose b n is a generic estimator of with sample size n. The mean square error (MSE) of the estimator b n is MSE n = E b n 2 ; which is a measure of the distance between the estimator and the true value. The bias and variance of b n are dened as E[ B n = E b 2 n ] and V ar n = E b n E[ b n ] 2 : Then it is easy to see that MSE n = B n + V ar n : The MSEs of the pre-test and averaging GMM estimators are reported in Figure 2 of CLS. The nite sample biases B n and variances V ar n of these estimator (together with the conservative GMM estimator b ;n ) are reported in Figure E. and Figure E.2 respectively. In Figure E. and Figure E.2, the blue area, light blue area and the red line represent the pre-test GMM estimator, the averaging GMM estimator and the conservative GMM estimator respectively. 74

75 For each shaded area, the upper and lower envelopes are the maximum and minimum bias/variance of the GMM estimator respectively. For dierent values of r, the maximum and minimum bias (or variance) of the conservative GMM estimator are very close to each other. Therefore, we only present its minimum bias (and variance) in Figure E. (and Figure E.2 respectively). From Figure E., we see the that the pre-test GMM estimator and the averaging GMM estimator have larger nite sample biases than the conservative GMM estimator. As sample size increases, the nite sample biases of the averaging and the conservative GMM estimators go to zero fast. However, the nite sample bias of the pre-test GMM estimator goes to zero very slowly which leads to larger nite sample MSE than the conservative GMM estimator as we can see from Figure 2 in Section 7 of CLS. From Figure E.2, we see that the pre-test GMM estimator has smaller nite sample variance than the conservative GMM estimator when the sample size is small (e.g., n = 5, ), but it may have slightly larger variance in some area of r when the sample size is slightly larger (e.g., n = 25, 5). In contrast, the averaging GMM estimator has smaller nite sample variances than the conservative GMM estimator for all sample sizes. Figure E.. Finite Sample Biases of the Pre-test and Averaging GMM Estimators 75

76 Figure E.2. Finite Sample Variances of the Pre-test and Averaging GMM Estimators E.2 Simulation under the normal distribution In this subsection, we report the simulation results on the nite sample properties of the pre-test and averaging GMM estimators, when the residual term u in the structure equation (7.4) of CLS follows normal distribution, that is, (7.7) in CLS is replaced by u = u : In this simulation study, we consider r to be a scalar that takes values on the grid points between and 2 with the grid length :5. The rest of the simulation design is the same as what is used in Section 7 of CLS. The simulation results are reported in Figure E.3 - E.5. For dierent values of r, the maximum and minimum bias (or variance) of the conservative GMM estimator are very close to each other. Therefore, we only present its minimum bias (and variance) in Figure E.4 (and Figure E.5 respectively). As we can see from Figure E.3 - E.5, the simulation results are quite similar to what we get in the simulation design of Section 7 of CLS. 76

77 Figure E.3. Finite Sample MSEs of the Pre-test and Averaging GMM Estimators 77

78 Figure E.4. Finite Sample Biases of the Pre-test and Averaging GMM Estimators 78

79 Figure E.5. Finite Sample Variances of the Pre-test and Averaging GMM Estimators E.3 Simulation under the Student-t distribution In this subsection, we report the simulation results on the nite sample properties of the pre-test and averaging GMM estimators, when the residual term u in the structure equation (7.4) of CLS is a student-t random variable with degree of freedom 2. We draw the latent random variables (Z ; : : : ; Z 8 ; " ; : : : ; " 6 ; u ) from the student-t distribution with degree of freedom 2 and correlation matrix diag(i 88 ; 77 ) (where 77 is dened in (7.6) of CLS), and then replace (7.7) in CLS by u = u. 23 In this simulation study, we consider r to be a scalar that takes values on the grid points between and 2 with the grid length :5. The rest of the simulation design is the same as what is used in Section 7 of CLS. The nite sample MSEs, biases and variances of the GMM estimators are reported in Figures 23 We use the matlab to perform all the simulation studies in this paper. Let C diag(i 88; 77), the matlab function mvtrnd(c, 2, n) will draw n such student-t random vectors independently. Each vector is constructed in the following way. First, a random vector, say (N ; : : : ; N 25), will be simulated from the joint normal distribution with mean zero and variance-covariance matrix C. Then a vector of Chi-square random variables with degree of freedom 2, say ( 2 (2); : : : ; 2 25 (2)), will be simulated independently. The Chi-square random variables are independent with each other. The student-t random vector is dened as (t (2); : : : ; t 25(2)) = N =( 2 (2)=2)=2 ; : : : ; N 25=( 2 (25)=2)=2 : 79

80 E.6 - E.8. For dierent values of r, the maximum and minimum MSE of the restricted JS estimator in this simulation study are very close to each other. Therefore, we only present its minimum MSE in Figure E.6. For dierent values of r, the maximum and minimum bias (or variance) of the conservative GMM estimator in this simulation study are dierent from each other. Therefore, we present both its minimum and maximum bias (and minimum and maximum variance) in Figure E.7 (and Figure E.8 respectively). It is interesting to see that in this simulation design, both the pre-test GMM estimator and the averaging GMM estimator has smaller nite sample MSEs than the conservative GMM estimator. The main reason for this phenomenon is that the residual term u in the structural equation is Student-t distributed with degree of freedom 2, which implies that u has innite variance and hence the conservative GMM estimator has large variance in nite samples. This can be clearly seen from Figure E.8. When the extra IVs Zj (j = ; : : : ; 6) are used in the GMM estimation, the nite sample variances of the GMM estimator is greatly reduced. Therefore, the nite samples biases of the pre-testing estimator and the averaging GMM estimator introduced by the extra IVs Zj (j = ; : : : ; 6) are more than oset by the reduced nite sample variances, which enables both estimator have smaller nite sample MSEs. Figure E.6. Finite Sample MSEs of the Pre-test and Averaging GMM Estimators 8

81 Figure E.7. Finite Sample Biases of the Pre-test and Averaging GMM Estimators 8

82 Figure E.8. Finite Sample Variances of the Pre-test and Averaging GMM Estimators F Illustration in Gaussian Location Model This section shows that in a Gaussian location model, the averaging GMM estimator dominates the conservative GMM estimator in nite samples, i.e., it exhibits JS phenomenon. Suppose that we have one observation (X ; Y ) from the normal X Y A + d A ; I 2k A (F.) where and d are k vectors and I 2k is a 2k 2k identity matrix. We are interested in estimating. Let be the kk identity matrix. The conservative GMM estimator b = X has risk tr(i k ) = k. On the other hand, the aggressive GMM estimator is b 2 = (X+Y )=2, which has risk k=2+kdk 2 =4. The empirical optimal weight dened in (4.7) becomes e! eo = 2k 2k + (Y X) (Y X) ; (F.2) 82

On Uniform Asymptotic Risk of Averaging GMM Estimators

On Uniform Asymptotic Risk of Averaging GMM Estimators Xu Cheng Zhipeng Liao Ruoyao Shi This Version: August, 28 Abstract This paper studies the averaging GMM estimator that combines a conservative GMM