Irregular Identification, Support Conditions, and Inverse Weight Estimation

Size: px

Start display at page:

Download "Irregular Identification, Support Conditions, and Inverse Weight Estimation"

Tamsin Benson
5 years ago
Views:

1 Irregular Identification, Support Conditions, and Inverse Weight Estimation Shakeeb Khan and Elie Tamer This version: Jan. 30, 2009 ABSTRACT. This paper points out a subtle, sometimes delicate property of estimators based on weighted moment conditions models where we find that there is an intimate link between identification and estimability. If it is necessary for (point) identification that the weights take arbitrary large values, then the parameter of interest, though point identified, cannot be estimated at the regular (parametric) rate. This rate depends on relative tail conditions and can be as slow as n 1/4. Practically speaking, this nonstandard rate of convergence can lead to numerical instability, and/or large standard errors. We examine two examples where a parameter is defined by a weighted moment condition with the above property: 1) The binary response model under mean restriction introduced by Lewbel (1997) and further generalized to cover endogeneity and selection where the estimator in this class of models is weighted by the density of a special regressor; and, 2) The treatment effect model under exogenous selection where the resulting estimator of the average treatment effect (or ATE) is one that is weighted by a variant of the propensity score. Without support conditions (that can be strong), these models, similar to well known identified at infinity models, lead to estimators that converge at slower than parametric rate, since essentially, to ensure point identification, one requires some variables to take values on sets with arbitrarily small probabilities, or thin sets. For the two models above, we derive rates of convergence under different tail conditions and also propose rate adaptive inference procedures, analogous to Andrews and Schafgans (1998). JEL Classification: C14, C25, C13. Key Words: Irregular and Robust Identification, inverse weighting, rates of convergence. We thank a Co-editor, 3 referees, J. Powell and seminar participants at Brown, UC-Berkeley, U Chicago, Duke, Harvard/MIT, Indiana, Simon Fraser, Stanford, UBC, University of Toronto, UW-Madison, UWO, and Yale, as well as conference participants at the 2007 NASM at Duke University and 2008 NAWM in New Orleans, for helpful comments. Department of Economics, Duke University, 213 Social Sciences Building, Durham, NC Phone: (919) Department of Economics, Northwestern University, 2001 Sheridan Road, Evanston, IL Phone: (847) Support from the National Science Foundation is gratefully acknowledged.

2 1 Introduction The identification problem in econometrics is a first step that one must explore and carefully analyze before any meaningful inferences can be reached. Identification clarifies what can be learned about a parameter of interest in a given model. There is a class of models in econometrics, mainly arising in limited dependent variable models, that attain identification by requiring that covariate variables take support in regions with arbitrary small probability mass. These identification strategies sometimes lead to estimators that are weighted by a density or a conditional probability, and these weights take arbitrarily large values on these regions of small mass. For example, an estimator that is weighted by the inverse of the density of a continuous random variable, effectively only uses observations for which that density is small. Taking values on these thin sets, is essential (often necessary) for point identification in these models. We explore the relationship between the inverse weighting method in these models and the rates of convergence of resulting estimators, and show that under general conditions, it is often not possible to attain the regular parametric rate (square root of the sample size) with these models. Furthermore, these rates of convergence depend on unknown relative tail behavior properties of observed and unobserved random variables, making inference more complicated than in standard problems. We label these identification strategies as irregular in the sense that estimators based on them do not converge at the parametric (square root of the sample size) rate. Consequently we argue that these strategies belong to the class of identified at infinity approaches (See Chamberlain (1986) and Heckman (1990a)). Note that it is part of econometrics folklore to equate the parameter being identified at infinity with slow rates of convergence, as was also established in Chamberlain (1986) for some particular models. See also Andrews and Schafgans (1998). We show that the actual rates of convergence for estimators based on these identification strategies vary substantially with tail conditions on the underlying variables. The rate of convergence is an explicit function of the relative tail behavior. This suggests the need for a rate adaptive inference procedure for this class of models and this paper aims to address this by proposing and establishing the asymptotic properties of studentized estimators. The results in this paper are connected to two models that we examine in detail. In these models, the parameter of interest can be written as an expectation of functions of observed random variables. First, we consider the binary choice model under an exclusion restriction and mean restrictions. This model was introduced to the literature by Lewbel (1997), Lewbel (1998) and Lewbel (2000). There, Lewbel demonstrates that this binary model is point identified with only a mean restriction on the unobservables, by requiring 1

3 the presence of a special regressor that is conditionally independent of the error. Under these conditions, Lewbel provides a density weighted estimator for the finite dimensional parameter (including the constant term). This estimator contains a random denominator that can take arbitrarily small values in regions that are necessary for point identification. We show that the parameters in a simple version of that model are irregularily identified unless special relative tail behavior conditions are imposed. Two such conditions are: a support condition such that the propensity score hits limiting values with positive measure, in which case root n estimation can be attained; or infinite moment restrictions on the regressors. As we show below, the optimal rate of convergence in these cases depend on the tail behavior of the special regressor relative to the error distribution. Second, we consider the important treatment effects model under binary treatment and exogenous selection. This is an important model that is widely used in economics and statistics to estimate the average treatment effect (ATE), or the treatment on the treated (ATOT) parameters in program evaluations. For an exhaustive review of this literature, see Imbens (2004). Hahn (1998), in important work, derived the efficiency bound for this model and provided an estimator that reaches that bound (See also Hirano, Imbens, and Ridder (2003)). In the case where the covariates take relative large support, the propensity score is arbitrarily close to zero or one 4 (as opposed to bounded away from zero or one). We show that the average treatment effect in this case can be irregularly identified (the semiparametric efficiency bound can be infinite) resulting in non-regular rates of convergence for the ATE in general: the optimal rates will depend on the relative tail thickness of the error in the treatment equation and the covariates. For empirical users, this results in instability of the estimator in cases where there is limited overlap and that care should be taken in making inference on these parameters. Busso, DiNardo, and McCrary (2008), in an important recent paper, highlight this instability in extensive Monte Carlo simulations where the ATE and TOT parameter appear to exhibit bias at moderate sample sizes. See also Frolich (2004). Also, in Appendix A.1, we show impossibility theorems that are analogous to those proposed in Chamberlain (1986) for sample selection and binary choice models that show that generally, no regular estimator (see Newey (1990)) exists for these models. The next section formally defines the concepts of irregular and robust identification and relates it to the two models we study in detail which is done in sections 3 and 4. Specifically, section 3 considers the binary choice model under mean restrictions, and section 4 considers 4 The identification strategy for these models is based on writing the estimand as a weighted sum where the weights can get arbitrary small. This intuition applies equally to a matching strategy since a matching estimator can be written as a weighted estimator where weights can get small. 2

4 the average treatment effect. In each of these sections we also derive optimal rates of convergence for estimators of parameters of interest that are sample analogs to the moments conditions used to identify these parameters, and propose rate adaptive inference procedures. Section 5 concludes by summarizing and suggesting areas for future research. 2 Irregular Identification and Inverse Weighting The notion of regularity of statistical models is linked to efficient estimation of parameterssee, e.g. Stein (1956). The concept of irregular identification has been related to consistent estimation of the parameters of interest at the parametric rate. An important early reference to irregular identification in the literature appears in Chamberlain (1986) dealing with efficiency bounds for sample selection models 5. There, Chamberlain showed that even though a slope parameter in the selection equation is identified, it is not possible to estimate it at the parametric root n rate (under the assumptions he maintains in the paper). Chamberlain added that point identification in this case relies on the behavior of the regression function at infinity. In essence, since the disturbance terms are allowed to take support on the real line, one requires that the regression function takes large values so that at the limit the propensity score is equal to one (or is arbitrarily close to one). The identified set in the model shrinks to a point when the propensity score hits one (or no selection). Heckman (1990b) highlighted the importance of identified at infinity parameters in various selection models and linked this type of identification essentially to the propensity score. Andrews and Schafgans (1998) used an estimator proposed by Heckman (1990a) and confirmed slow rates of convergence of the estimator for the constant and also showed that this rate depends on the thickness of the tails of the regression function distribution relative to that of the error term. This rate can be as slow as cube root and can be arbitrarily close to root n. The class of models we consider in this paper are ones where the parameter of interest can be written as a closed form solution to a weighted moment condition. The weight in these models can take arbitrary small values, and causes point identification to be fragile. This will mean that the rate of convergence is generally slower than the regular rate. This exact rate is determined by relative tail conditions on observed and unobserved variables. In some models, not only is point identification delicate, but the model will not be able to have any information about the true parameter, i.e., the identified set is the whole parameter space. This leads to what we call non-robust irregular identification in the sense that one either learns the parameter precisely, or, the model provides no information about the parameter 5 See also Newey (1990) for other impossibility theorems and notions of regular and irregular estimators. 3

5 of interest. We view this lack of robustness as an added drawback of inference in this class of models. Heuristically, we consider models where the parameter of interest can be written as an expectation of some function of observed variables. θ 0 = E[g(y i, x i )] (2.1) where θ 0 is a finite dimensional parameter of interest, and y i, x i is finite dimensional vector of observed random variables What we are particularly interested in this paper is the class of models above where E[ g(y i, x i ) 2 ] = (2.2) where denotes the Euclidean norm. As this paper will show, many recently proposed identification strategies based on inverse weighting will fall into this class, as the inverse weight function can become arbitrarily large when the conditioning variables take the limiting values of their support. As we will show in this paper identification of θ 0 in these cases is irregular as defined: Definition 2.1 (Irregular Identification of a parameter vector) We say that a parameter vector is irregularly point identified if it cannot be estimated regularly 6 at the parametric rate. Furthermore, in this paper we will distinguish between two classes of irregularly identified models, robust and non-robust, defined as follows: Definition 2.2 (Robust Irregular Identification of a parameter vector) We say that a parameter vector is robustly irregularly point identified if it is irregularly identified as defined above, yet the function g( ) in (2.1) is bounded on the support of its arguments. We note that this definition is a special case of the classical definition of robustness (see, e.g. Hampel (1986)) where we focus attention to irregularly identified models. The rest of this paper will explore the statistical properties of estimators of models based on (2.1). As we will prove, these estimators generally converge at a rate slower than the standard parametric rate, and whose optimal rate depends on relative tail conditions on the random variable x i and the unobserved components also governing the behavior of y i. This general setting will 6 see, e.g. Newey (1990) for the conditions for an estimator to be regular. 4

6 encompass many models in econometrics, but for illustrative purposes, our discussion will focus on two important models in the recent literature that are based on weighted moment conditions: the binary response model under mean restrictions, and the average treatment effect estimator under conditional independence. 2.1 Preview of Approach The key problem with an empirical analog estimator of θ ˆθ = 1 n g(y i, x i ) (2.3) is that although consistent (assume that θ 0 = 0), has the unattractive property that E[ g(y i, x i ) 2 ] = which does not allow us to use a standard CLT to say that ˆθ is root-n consistent and asymptotically normal. One approach is to introduce a trimming sequence τ ni = τ(y i, x i, n) > a.s. 0, τ ni 1 such that E[τ 2 ni g(y i, x i ) 2 ] γ n < (2.4) But note that since τ ni is converging to 1, γ n will generally diverge to infinity as the sample size increases. Thus we can explore the asymptotic properties of the trimmed estimator: ˆθ n = 1 n τ ni g(y i, x i ) (2.5) This trimming, which disregards observations in the sample where the value of the denominator is small, will introduce bias which has to be taken into account. So, ultimately for the binary response model and the ATE model we consider, we address the following questions. 1. What is the optimal rate of convergence for ˆθ n? 2. What is the limiting distribution for ˆθ n? 3. How does one conduct inference for θ 0 given the rate of convergence of its estimator will generally be unknown? Addressing the first question is completely analogous to deriving optimal rates for nonparametric estimation. That is, the variance and bias are inversely related in the sense that here τ ni controls the variance, but at the expense of inducing bias. Therefore, as in nonparametric 5

7 estimation, we can solve for an optimal sequence τni that balances the bias and the variance.to address the second question, we can apply the Lindeberg theorem to the centered trimmed estimator that should hold under a version of a Lindberg condition. Finally, to address the third question, our proposed approach to conduct inference based on ˆθ n will be to studentize the estimator. As we will show, under our conditions the t-statistic will be rate adaptive in the sense that it will converge in probability to a normal distribution regardless of the rate of convergence of ˆθ. We next apply this framework to specific models, beginning with the binary choice model. Next, we examine the average treatment effect model under conditional independence. 3 Binary Choice with Mean Restrictions In the standard binary response model y = 1[x β + ɛ 0] Manski (1988) showed that a conditional mean restriction does not identify the parameter β. In fact, he showed that the mean independence assumption is not strong enough in binary response models to even bound β. Hence, we modify this model by adding more assumptions to ensure point identification. The ensuing model is a simplified version of the one introduced by Lewbel (1997): y i = 1[α + v i ɛ i 0] (3.1) where v i is a scalar random variable that is independent of ɛ i and both ɛ i and v i have support on the real line and E[ɛ i ] = 0. We observe both y and v and the object of interest is the parameter α. The location restriction on ɛ point identifies α. We start with the following: P (y i = 1 v i ) = F ɛ (α + v i ) (3.2) where F ɛ (.) is the cdf of ɛ where we assume this cdf is a strictly increasing function. Lewbel (1997) derived the following relation α = E[(y i I[v i > 0]) 1 ] E[(y i I[v i > 0])w(v i )] (3.3) with the weight function w(v i ) = 1 where f(.) here denotes the density of v i. This type of identification is sensitive to conditions imposed on the support of v and ɛ. For example, 6

8 identification of α is lost when one excludes sets of v of arbitrary small probability when v is in the tails. The model in this case will not contain any information about α (trivial bounds). This is a case of non-robust or thin-set identification. To see this, note that if we restrict the support of v to lie on the set [ K, K] for any K > 0, α will not be point identified: K α = vf ɛ (α + v)dv } {{ } (1) K v v P (y = 1 v)dv vf ɛ (α + v)dv K } {{ } K } {{ } (2) (3) We see that only (2) is identified, while (1) and (3) are not since, in (3.2), we can only learn F ɛ from α K to α + K. So, no matter how large K is, the model is set identified. In addition, we see that one can easily choose the unidentified portion of f ɛ (.) (parts (1) and (3) above) in such a way that the model with these (strange) support conditions provides no information about α. This was shown for the Lewbel model Magnac and Maurin (2005) (we view this property as similar to the nonrobustness of the sample mean). To guarantee point identification when ɛ takes support on the real line ([, + ]), v is also required to take arbitrary large values. This will mean that the density of v i vanishes at these, implying that there is a set of arbitrarily small measure, in this case v > M, for an arbitrarily large constant M, where the weight function in (3.3) becomes unbounded. This suggests identification of α based on (3.3) is irregular in the sense that an analog estimator will not generally converge at the parametric rate 7 8. We confirm this later 9 by showing how optimal rates of the analog estimator can vary widely with tail behavior conditions on observed and unobserved variables. This nonuniform behavior of the analog estimator makes conducting inference difficult, and motivates our proposed rate adaptive inference procedure which we explain in the following section. 7 This is a property shared by the estimator in Heckman (1990b) and Andrews and Schafgans (1998) for an estimator of the intercept term in a sample selection model. As is the case in that setting, the slower rate attained does not necessarily imply inefficiency of the analog estimator. In fact the appendix derives a result analogous to an impossibility theorem in Chamberlain (1986) 8 In cases where ɛ has bounded support, it is possible that α is identified regularly, if the support of v is large relative to that of ɛ. One can check that easily since for those large values of v, the choice probability hits 0 or 1. 9 We also derive the efficiency bounds for this model in the Appendix. 7

9 3.1 Rate Adaptive Inference in the Binary Choice Model We propose the use of rate-adaptive inference procedures to estimate the binary response model above. This is similar to what Andrews and Schafgans (1998) did for the selection model. Let ˆα n denote the trimmed variant of the estimator proposed in Lewbel (1997) for the intercept term: ˆα n = 1 n y i I[v i > 0] I[ v i γ 2n ] (3.4) ˆ where ˆ is a kernel estimator for the density of v (details of the kernel estimation procedure can be found in the Appendix). The estimator includes the additional trimming term I[ v i γ 2n ] where γ 2n is a deterministic sequence of numbers that satisfies lim n γ 2n = and lim n γ 2n /n = 0. Effectively, this extra term helps govern tail behavior. Trimming this way suffices (under the assumptions in Lewbel (1997)) to deal with the denominator problem associated with the density function getting small, which, under stated conditions in Lewbel (1997)), only happens in the tails of v i. Let Ŝn denote a trimmed estimator of its asymptotic variance if conditions were such that the asymptotic variance were finite Ŝ n = 1 n ˆp(v i )(1 ˆp(v i )) ˆ 2 I[ v i γ 2n ] (3.5) where ˆp( ) denotes a kernel estimator of the choice probability function (or the propensity score). Our main result in this section is that the studentized estimator converges to a standard normal distribution. We consider this result as rate-adaptive in the sense that the limiting distribution holds regardless of the rate of convergence of the un-studentized estimator. We state this formally as a theorem, which first requires the following definitions. Define v(γ 2n ) = E[ p(v i)(1 p(v i )) 2 I[ v i γ 2n ]] (3.6) and b(γ 2n ) = α 0 E[ y i I[v i > 0] I[ v i γ 2n ]] (3.7) X ni = y i p(v i ) I[ v i γ 2n ] (3.8) The first piece v(γ 2n ) is the variance of the infeasible estimator where in (3.4), the true f(.) is used. This term is of the same order as the variance of the feasible estimator (See the 8

10 Appendix). Similarly, the term in (3.7) is the bias term for the infeasible estimator. The form of X in is the linear representation of ˆα n, which involves first linearizing the estimator around ˆf and then taking the usual U-statistic projections. See section A.4 of the appendix for details on this. Theorem 3.1 Suppose (i)γ 2n, (ii) ɛ > 0, lim n 1 v(γ 2n ) E[X2 nii[ X ni > ɛ nv(γ 2n )] = 0 (3.9) and (iii) nv(γ 2n )b(γ 2n ) 0, then n Ŝ 1/2 n (ˆα α 0 ) N(0, 1) (3.10) The proof is left to the appendix and here we only comment on the conditions. Condition (ii) is a Lindeberg that allows us to apply a central limit theorem. Condition (iii) on the other hand is needed to ensure that the bias will vanish with sample size. Other conditions on the kernel and the other smoothing parameters are given in section A.3 in the appendix. This result is analogous to the results in Andrews and Schafgans (1998) who considered asymptotic properties of the identification at infinity estimator proposed in Heckman (1990a). Note that the rate of convergence of this estimator depends on the behavior of Ŝn. For example, in cases where Ŝn converges in probability to a non-random matrix, then the estimator above will converge at the regular root n rate. However, in some interesting cases that we go over below, the variance of the estimator goes to infinity which results in slower than regular rates. Finally, for a model with regressors x, the estimator for the vector of parameters (which include both slope and intercept) can be obtained in closed form (See Lewbel (1997)): ˆβ = { 1 n x i x i} 1 1 n x i y i 1[v i 0] ˆf(v i x i ) 2 I[ v i γ 2n ] and so, the studentized estimator is similar to the one above. 3.2 Relative Tail Behavior and Rates of Convergence in Special Cases In this section, we investigate further the rate of convergence of the estimator in the previous section. This rate of convergence depend on the relative tail behavior of the density of v vs the tails of the propensity score, or the rate at which the propensity score approaches 9

11 0 and 1. In generic cases, the rate of convergence for the estimator of the intercept term in (3.1) is slower than parametric rate. The result holds for any model where the tail of the special regressor v is as thin or thinner than the tail of the error term. We also point out that, there are cases (when the moment conditions in the above theorem are violated) where the rate of convergence reaches the regular parametric rate. As we show below, when v i is Cauchy, then the estimator of α converges at root n rate 10. Here, we focus on the rate of convergence for the infeasible estimator. This rate of convergence is of the same order as the one of the feasible estimator (See the Appendix for more on this). Consider the infeasible estimator of α that was proposed by Lewbel (1997) ˆα = 1 n y i I[v i > 0] I[ v i γ 2n ] (3.11) Next we define ᾱ n = E[ˆα]. In what follows we will establish a rate of convergence and limiting distribution for ˆα ᾱ n. To do so we first define the sequence of constants v(γ 2n ) = Var( y i I[v i >0] f(v i I[ v ) i γ 2n ]), and in what will follow we let h n = v(γ 2n ) 1 As alluded to in our previous discussion, we will generally have lim n v(γ 2n ) =, which will result in a slower rate of convergence. To derive the distribution of the above, we will apply the Lindeberg condition to the following triangular array: hn nhn (ˆα ᾱ n ) = n ( ) yi I[v i > 0] I[ v i γ 2n ] ᾱ n (3.12) Before claiming asymptotic normality of the above term we verify that the Lindeberg condition for the array is indeed satisfied. Specifically, we need to establish the limit of [ ((yi ) 2 ( ) ] 2 I[v i > 0]) (yi I[v i > 0]) h n E I[ v i γ 2n ] ᾱ n I[ ᾱ n > n ɛ 2 ] (3.13) h n [ ( ) ] 2 for ɛ > 0. The above limit is indeed zero since h n E[ (yi I[v i >0]) f(v i I[ v ) i γ 2n ] ᾱ n = 1 and n h n. Therefore, we can conclude that nhn ( ˆα n ᾱ n ) N(0, 1) (3.14) 10 The notion of attaining a faster rate of convergence when moments are infinite is not new. For example, it is well known that when regressors have a cauchy distribution in the basic linear model, OLS is superconsistent. 10

12 So we have established that the rate of convergence of the centered estimator is governed by the limiting behavior of h n. As we will show below, in most cases, h n 0, resulting in a slower rate of convergence, as is to be expected given our efficiency bound calculations in A.1. Of course, the above calculations only derives rates for the centered estimator, and not the estimator itself. convergence of the bias term: so we have: To complete the distribution theory we need to derive the rate of nhn (ᾱ n α 0 ) and α 0 = E[ y i I[v i > 0] ] b n ᾱ n α 0 = γ 2n (p(v) 1)dv (3.15) where p(v) = P (y i = 1 v i = v) is the propensity score. Clearly we have lim n b n = 0 if we maintain that lim n γ 2n =. There we can see that the limiting distribution of the estimator can be characterized by two components, the variance, which we see depends on the limiting ratio of the propensity score divided by the regressor density, and the bias, which depends on the rate at which the propensity score converges to 1. Our results are somewhat abstract as stated since we have not yet stated conditions on γ 2n except that it increases to infinity. Nor have we stated what the sequences h n and b n look like as functions of γ 2n. We show now how they relate to tail behavior assumptions on the regressor distribution and latent error term. First we calculate the form of the variance term v(γ 2n ) as a function of γ 2n. Note we can express v(γ 2n ) as the following integral: v(γ 2n ) = γ2n p(v)(1 p(v) γ 2n f(v) dv + Var((E[( y i I[v i > 0] I[ v i γ 2n ]) v i ] ᾱ n )) (3.16) We will focus on the first term because we will now show the second term is negligible when compared to the first. The second term in the variance is of the form: Var(E[( y i I[v i > 0] I[ v i γ 2n ]) v i ] ᾱ n ) (3.17) The above variance is of the form: E[(ᾱ n (v i ) ᾱ n ) 2 ] (3.18) 11

13 where ᾱ n (v i ) denotes the conditional expectation in (3.17). Note that since ᾱ n converges to α 0 and E[ᾱ n (v i )] = ᾱ n the only term in the above expression that may diverge is E[ᾱ n (v i ) 2 ] (3.19) This term can be expressed as the integral: γ2n (1 p(v)) 2 γ 2n f(v) dv (3.20) which will generally be dominated by the first piece of the variance term in (3.16), which recall was of the form: γ2n p(v)(1 p(v)) dv (3.21) γ 2n f(v) So generally speaking, as far as deriving the optimal rate of convergence for the estimator, we can ignore this component of the variance. We can show that the bias term is b n = E[ᾱ n ] α = γ2n This bias term behaves asymptotically like: γ 2n {p(v + α) 1[v 0]}dv α (3.22) b n γ 2n (1 p(γ 2n )) (3.23) Note that the bias term does not directly depend on the density of v i. With these general results, we now calculate the rate of convergence for some special cases corresponding to various tail behavior conditions on the regressors and the error terms. - [Logit Errors/Logit Regressors] Here we assume the latent error term and the regressors both have a standard logistic distribution. We consider this to be the main example, as the results that arise here ( a slower than root-n rate) will generally also arise anytime we have distributions whose tails decline exponentially, such as the normal, logistic and Laplace distributions. From our calculations in the previous section we can see that first v(γ 2n ) = γ 2n and b n = ln 1 + eγ 2n+α 1 + e γ 2n+α ln(eγ 2n+α ) (3.24) 12

14 v(γ 2n ) = γ 2n and b n = γ 2n exp( γ 2n ) 1 + exp( γ 2n ) (3.25) = γ2 2n exp( 2γ 2n ) So to minimize mean squared error we set γ 2n n (1+exp( γ 2n to get γ )) 2 2n = O(log n 1/2 ), resulting in the rate of convergence of the estimator: n log n (ˆα α 0) = O 1/2 p (1) (3.26) Furthermore, from the results in the previous section, the estimator will be asymptotically normally distributed, and have a limiting bias. - [Normal Errors/Normal Regressors] Here we assume that the latent error term and the regressors both have a standard normal distribution. To calculate v(γ 2n ) in this case, we can use the asymptotic properties of the Mills Ratio (see, e.g. Gordon(1941, AMS, 12)), r( ): r(t) 1 t as t (3.27) to approximate v(γ 2n ) log(γ 2n ). For this case we have b n γ n exp( γ2n/2). 2 So to minimize MSE, we set γ 2n = log n and the estimator is O p ( log( log n) n ). - [Logit Errors/ Normal Regressors] Here we assume the latent error term has a logistic distribution but the regressors have a normal distribution. As we show here the rate of convergence for the estimator will be very slow. The variance is of the form: v(γ 2n ) = γ2n exp(v 2 /2)dv (3.28) which can be approximated as: v(γ 2n ) exp(γ 2 n/2)γ 1 2n see, e.g. Weisstein (1998) 11. The bias is of the form: b n γ n exp( γ 2n ) (3.29) So the MSE minimizing sequence is of the form: γ 2n = log n Resulting in the rate of convergence O p ( n 1/4 4 log n ) 11 Erfi. From MathWorld A Wolfram Web Resource. 13

15 - [Logit Errors/Cauchy Regressors] Here we assume the latent error term has a standard logistic distribution but the regressor has a standard cauchy distribution. From our calculations in the previous section we can see that and v(γ 2n ) = γ 2n exp( γ n ) (1 + exp( γ n )) 2 (1 + γ2 n) (3.30) b n = γ 2n exp( γ 2n ) 1 + exp( γ 2n ) (3.31) We note in this situation v(γ 2n ) remains bounded as γ 2n, so the variance of the estimator is O( 1 n ). Therefore we can let γ 2n increase to infinity as quickly as desired to ensure b n = o p (n 1/2 ). Therefore in this case we can conclude that the estimator is root-n consistent and asymptotically normal with no asymptotic bias. - [Probit Errors/Cauchy Regressors] Here we assume the latent error term has a standard normal distribution but the regressor has a standard cauchy distribution. From our calculations in the previous section we can see that v(γ 2n ) γ 2n exp( γ 2 2n/2)γ 1 2n (1 + γ 2 2n) (3.32) where the above approximation is based on the asymptotic series expansion of the erf. Furthermore b n = γ 2n exp( γ 2 2n/2)γ 1 2n (3.33) We can see immediately that no matter how quickly γ 2n, v(γ 2n ) = O(1), so we can set γ 2n = O(n 2 ) to that n(ˆα α0 ) = O p (1) (3.34) and furthermore the estimator will be asymptotically normal and asymptotically unbiased. - Differing Support Conditions: Regular Rate] Let the support of the latent error term be strictly smaller than the support of the regression function. For γ 2n sufficiently large p(γ 2n ) takes the value 1, so in this case we need not use the above approximation and lim γ2n v(γ 2n ) is finite. This implies we can let γ 2n increase as quickly as possible to remove the bias, so the estimator is root-n consistent and asymptotically normal with no limiting bias. This is a case where for example α + v in (3.1) has strictly larger support than ɛ s. 14

16 As the results in this section illustrates, the rates of convergence for the analog inverse weight estimators can vary widely with tail behavior conditions on both observed and unobserved random variables, and the optimal rate rarely coincides with the parametric rate. Hence, semiparametric efficiency bounds might not be as useful for this class of models. Overall, this confirms the results in Lewbel (1997) about the relationship between the thickness of the tails and the rates of convergence. So to summarize our results, we say that the optimal rate is directly tied to the relative tail behavior of the special regressor and the error term. Rates will generally be slower than the parametric rate, which can only be attained with infinite moments on the regressor, or strong support conditions. The latter ensure the propensity scores attains its extreme values with positive probability. 4 Treatment Effects Model under Exogenous Selection This section studies another example of a parameter that is written as a weighted moment condition and where regular rates of convergence requires that support conditions be met, essentially guaranteeing that the denominator be bounded away from zero. We show that the Average Treatment Effect estimator under exogenous selection cannot generally be estimated at the regular rate unless the propensity score is bounded away from 0 and 1, and so, estimation of this parameter can lead to unstable estimates if these support (or overlap) conditions are not met. A central problem in evaluation studies is that potential outcomes that program participants would have received in the absence of the program is not observed. Letting d i denote a binary variable taking the value 1 if treatment was given to agent i, and 0 otherwise, and letting y 0i,y 1i denote potential outcome variables, we refer to y 1i y 0i as the treatment effect for the i th individual. A parameter of interest for identification and estimation is the average treatment effect, defined as: β = E[y 1i y 0i ] (4.1) One identification strategy for β was proposed in Rosenbaum and Rubin (1983), who imposed the following: Assumption 1 (ATE under Conditional Independence) Let the following hold: (i) There exists an observed variable x i s.t. d i (y 0i, y 1i ) x i 15

17 (ii) 0 < P (d i = 1 x i ) < 1 x i See also Hirano, Imbens, and Ridder (2003). The above assumption can be used to identify β as or β = E X [E[y i d i = 1, x i ] E[y i d i = 0, x i ]] (4.2) β = E X [E[y i d i = 1, p(x i )] E[y i d i = 0, p(x i )]] (4.3) where p(x i ) = P (d i = 1 x i ). The above parameter can be written as: y i (d i p(x i )) β = E[ p(x i )(1 p(x i )) ] (4.4) The above parameter β is a weighted moment condition where the denominator gets small if the propensity score approaches 0 or 1. Also, identification is lost when we remove any region in the support of x i (so, fixed trimming will not identify β above). Without any further 1 restrictions, there can be regions in the support of x where becomes arbitrarily p(x)(1 p(x)) large, suggesting an analog estimator may not converge at the parametric rate. Whether or not it can is a delicate question, and depends on the support conditions and on the propensity score. Hahn (1998), in very important work, derived efficiency bounds for the model based on Assumption 1. These bounds can be infinite and in fact, there will be many situations where this can happen. For example, in the homoskedastic setting with one continuous regressor, anytime the distribution of the regressor is the same as that of the error term in the treatment equation, so (ii) is satisfied, the variance bound in is infinity. Thus we see that, such as was the case in the binary choice model, rates of convergence will depend on relative tail behaviors of regressors and error terms. Consequently, inference can be more complicated, and one might want to supplement standard efficiency bound calculations with more. For inference, we again propose a rate adaptive procedure based on a studentized estimator. 4.1 Rate Adaptive Inference in the Treatment Effects Model We derive the asymptotic distribution of the feasible ATE estimator when the propensity score p(x) is replaced by a nonparametric estimate ˆp(x) which is kernel based. The feasible estimator is 16

18 ˆβ n = 1 n Moreover, let ( di y i ˆp(x i ) (1 d ) i)y i I[ x i γ 2n ] (4.5) 1 ˆp(x i ) X in = [( di y i p(x i ) (1 d ) ( i)y i 1 p(x i ) α µ1 (x i ) 0 p(x i ) + µ ) ] 0(x i ) (d i p(x i ) 1[ x i γ 2n ] 1 p(x i ) This represents the first term in the linear representation (see the appendix for more on this). As in the previous section, let Ŝn denote the trimmed estimator of its asymptotic variance, which is the variance of the above linear term and can be written as (assuming a homoskedastic model with unit variances) Ŝ n = 1 n [ ) ] 2 (Ê[yi d i = 1, x i ] Ê[y 1 i d i = 0, x i ] ˆα n + 1[ x i γ 2n ] ˆp(x i )(1 ˆp(x i )) Again, we provide below the main result which is the asymptotic distribution of the studentized estimator. First, we make the following definitions: where here ( di y i β 0 = E[ β n = E[ ˆβ 1 n ]; v(γ 2n ) = E[ p(x i )(1 p(x i ) I[ x i γ 2n ]]; b n = β n β 0 p(x i ) (1 d i)y i 1 p(x i ) We next state the main theorem of this section. ) ] (4.6) Theorem 4.1 Suppose (i)γ 2n, (ii) ɛ > 0, lim n 1 v(γ 2n ) E[X2 nii[ X ni > ɛ nv(γ 2n )] = 0 (4.7) and (iii) nv(γ 2n )b(γ 2n ) 0, then n Ŝ 1/2 n ( ˆβ n β 0 ) N(0, 1) (4.8) 17

19 4.2 Average Treatment Effect Estimation Rates Here we conduct the same rate of convergence exercises for estimating the average treatment effect as the ones we computed for the binary response model in the previous section. As a reminder, we do not compute what the variance or the bias are exactly, rather, we are interested in the rate at which these need to behave as sample size increases. We will explore the asymptotic properties of the following estimator: ˆα = 1 ( di y i n p(x i ) (1 d ) i)y i I[ x i γ 2n ] (4.9) 1 p(x i ) We note that this estimator is infeasible, this time because we are assuming the propensity score is known. Like in the binary choice model this will not effect our main conclusions (of course its asymptotic variance will be different than the feasible estimator, but will have the same order). Also, like in the binary choice setting, we assume that it suffices to trim on regressor values, and that the denominator only vanishes in the tails of the regressor. Furthermore, to clarify our arguments we will assume here that the counterfactual outcomes are homoskedastic with a variance of 1. Define h n = v(γ 2n ) 1. Carrying the same arguments as before we explore the asymptotic properties of v(γ 2n ) and b n as γ 2n. Generally speaking, anytime v(γ 2n ), the fastest rate for the estimator will be slower than root-n. We can express v(γ 2n ) as the following integral: v(γ 2n ) γ2n γ 2n f(x) dx (4.10) p(x)(1 p(x)) which using the same expansion as before, will behave asymptotically like v(γ 2n ) γ 2n f(γ 2n ) p(γ 2n )(1 p(γ 2n )) (4.11) For the problem at hand we can write b n approximate as the following integral: b n = γ 2n ( γ 2n m(x)f(x)dx + γ2n m(x)f(x)dx) (4.12) where m(x) is the conditional ATE. We now consider some particular examples corresponding to tail behavior on the error term in the treatment equation and the regressor. Logit Errors/Regressors/Bounded CATE Here we assume the latent error term and the regressors both have a standard logistic distribution, and to simplify calculations 18

20 we will assume the conditional average treatment effect is bounded on the regressor support. We consider this to be a main example, as the results that arise here ( a slower than root-n rate) will generally also arise anytime we have distributions whose tails decline exponentially, such as the normal, logistic and Laplace distributions. From our calculations in the previous section we can see that v(γ 2n ) = γ 2n (4.13) and b n γ 2n exp( γ 2n ) (1 + exp( γ 2n )) 2 (4.14) Clearly v(γ 2n ) resulting in a slower rate of convergence. To get the optimal rate we solve for γ 2n that set v(γ 2n )/n = b 2 n. We see this matches up when γ 2n = 1 log n, 2 resulting in the rate of convergence of the estimator: n log n (ˆα α 0) = O 1/2 p (1) (4.15) Furthermore, from the results in the previous section, the estimator will be asymptotically normally distributed, and have a limiting bias. We notice this the exact same result attained for the binary choice model considered previously. As mentioned, similar slower than parametric rates will be attained when the error distribution and regressor have similar tail behavior. Normal Errors/Logistic Regressors/Bounded CATE Here we assume the latent error term has a standard normal distribution and the regressors both have a standard logistic distribution, and to simplify calculations we will assume the conditional average treatment effect is bounded on the regressor support. In this situation we have: v(γ 2n ) γ2n γ 2n exp( x) dx (4.16) 1 Φ(x) which by multiplying and dividing the above fraction inside the integral by φ(x), the standard normal p.d.f, and using the asymptotic approximation to the inverse Mills ratio, we get v(γ 2n ) γ2n gamma 2n exp(x 2 /2)xdx = exp(γ 2n ) (4.17) 19

21 In this setting the bias is approximately of the form b n = γ 2n f(x)dx + γ2n so the MSE minimizing value of γ 2n here is f(x)dx exp( γ 2n ) (4.18) γ 2n = log n 1/3 (4.19) so the estimator is O p (n 1/3 ). Differing Support Conditions/Bounded CATE Here we assume the support of the latent error term in the treatment equation is strictly larger than the support of the regressor and the support of the regression function in the treatment equation. For γ 2n sufficiently large p(γ 2n ) remains bounded away from 0 and 1 so lim γ2n v(γ 2n ) is finite. We can let γ 2n increase as quickly as possible to remove the bias, resulting in a root-n consistent estimator that is asymptotically normal, completely analogous to the result attained before under differing supports in the binary choice model. Cauchy Errors/Logistic Regressors/Bounded CATE Here we impose heavy tails on treatment effect error term, assuming it has a Cauchy distribution, but we impose exponential tails on the regressor distribution, assuming here that it has an logistic distribution, though similar results would hold if we assumed normality. v(γ 2n ) γ 2n exp( γ 2n ) 1 4 arctan(γ 2n) 2 (4.20) where after using L Hospitale s rule we can conclude that: v(γ 2n ) γ 2n exp( γ 2n )(1 + γ 2 2n) (4.21) which remains bounded as γ 2n. Therefore we can let γ 2n increase arbitrarily quickly in n, say γ 2n = O(n 2 ) in which case the estimator will be root-n consistent and asymptotically normal with no asymptotic bias. 5 Conclusion This paper points out a subtle property of estimators based on weighted moment conditions. In some of these models, there is an intimate link between identification and estimability, in that if it is necessary for (point) identification that the weights take arbitrary large values, 20

22 then the parameter of interest, though point identified, cannot be estimated at the regular rate. This rate depends on relative tail conditions and can be as slow as O(n 1/4 ). Practically speaking, this nonstandard rate of convergence can lead to numerical instability, and/or large standard errors. See for example the recent work of Busso, DiNardo, and McCrary (2008) numerical evidence on this based on the ATE parameter. We provide in this paper a studentized approach to inference in which there is no need to know explicitly the rate of convergence of the estimator. There are a few remaining issues that need to be addressed in future work. One is the choice of the trimming parameter, γ 2n, in small samples. We provide bounds on the rate of divergence to infinity of this parameter in the paper. One approach to pick a rate is similar to what is typically done for kernel estimators for densities, mainly pick the parameter that minimizes a the first term expansion of the mean squared error. For example, in the ATE case, γ 2n can be chosen to minimize the sum of an empirical version of (4.10) and the squared of the empirical version of (4.12). In addition, in this paper, we saw examples of non-robust point identification in which the model has no information on the parameter if a set with arbitrary small probability is taken out of the support. This notion of identification deserves further study also. 21

23 References Andrews, D. W. K., and M. Schafgans (1998): Semiparametric Estimation of the Intercept of a Sample Selection Model, Review of Economic Studies, 65, Busso, M., J. DiNardo, and J. McCrary (2008): Finite Sample Properties of Semiparametric Estimators of Average Treatment Effects, Working Paper, UC - Berkeley. Chamberlain, G. (1986): Asymptotic efficiency in semi-parametric models with censoring, journal of Econometrics, 32(2), Collomb, G., and W. Hardle (1986): Strong Uniform Convergence Rates in Robust Nonparametric Time Series Analysis and Prediction: Kernel Regression Estimation from Dependent Observations, Stochastic Processes and their Applications, 23, Frolich, M. (2004): Finite-Sample Properties of Propensity-Score Matching and Weighting Estimators, Review of Economics and Statistics, 86(1), Hahn, J. (1998): On the Role of the Propensity Score in Efficient Semiparametric Estimation of Average Treatment Effects, Econometrica, 66(2), Hampel, F.R., R. E. R. P. S. W. (1986): Robust Statistics The Approach Based on Influence Functions. Wiley Interscience. Heckman, J. (1990a): Varieties of Selection Bias, The American Economic Review, 80(2), Heckman, J. (1990b): Varieties of Selection Bias, American Economic Review, 80, Hirano, K., G. W. Imbens, and G. Ridder (2003): Efficient Estimation of Average Treatment Effects Using the Estimated Propensity Score, Econometrica, 71(4), Imbens, G. W. (2004): Nonparametric Estimation of Average Treatment Effects Under Exogeneity: A Review, Review of Economics and Statistics, 86(1), Lewbel, A. (1997): Semiparametric Estimation of Location and Other Discrete Choice Moments, Econometric Theory, 1997(1), Lewbel, A. (1998): Semiparametric Latent Variable Model Estimation with Endogenous or Mismeasured Regressors, Econometrica, 66(1), (2000): Semiparametric Qualitative Response Model Estimation with Unknown Heteroscedasticity or Instrumental Variables, Journal of Econometrics, 97, Magnac, T., and E. Maurin (2005): Identification and Information in Monotone Binary Models, Journal of Econometrics, forthcoming. 22

24 Manski, C. F. (1988): Identification of Binary Response Models, Journal of the American Statistical Association, 83. Newey, W. (1990): Semiparametric Efficiency Bounds, Journal of Applied Econometrics, 5(2), Newey, W., and D. McFadden (1994): Large Sample Estimation and Hypothesis Testing, in Handbook of Econometrics, Vol. 4, ed. by R. Engle, and D. McFadden. North Holland. Powell, J., J. Stock, and T. Stoker (1989): Semiparametric Estimation of Index Coefficients, Econometrica, pp Rosenbaum, P., and D. Rubin (1983): The Central Role of the Propensity Score in Observational Studies for Causal Effects, Biometrika, 70, Stein, C. (1956): Efficient Nonparametric Testing and Estimation, Proc. Third Berekely Symp. on Math. Stat. and Prob., 1, Stoker, T. (1991): Equivalence of Direct, Indirect, and Slope Estimators for Average Derivatives, Nonparametric and Semiparametric Methods, pp Weisstein, E. (1998): The CRC Concise Encyclopedia of Mathematics. CRC Press. 23

25 A Appendix In section A.1, we provide impossibility results for the binary choice model above. In sections A2-A6, we provide proofs for the theorems in Section 3, for the binary choice model. A7 outlines the proof for the linear representation for the ATE model. A.1 Infinite Bounds As alluded to earlier in the paper the slower than parametric rates of the inverse weight estimators do not necessarily imply a form of inefficiency of the procedure. It may be the case that the parametric rate is unattainable by any procedure. In this section we confirm this by establishing an impossibility theorem- analogous to those established in Chamberlain (1986). We show that efficiency bounds are infinite for a variant of model (3.1) above. Now, introducing regressors x i, we alter the assumption so that ɛ x, v = d ɛ x and that E[ɛ x] = 0 (here, = d means has the same distribution as ). We also impose other conditions such that ɛ x and v x have support on the real line for all x. The model we consider now is y = 1[α + xβ + v ɛ 0] (A.1) The following theorem, proven below, states that there does not exist any regular root-n estimator for α and β in this model. Theorem A.1 In the binary choice model (A.1) with exclusion restriction and unbounded support on the error terms, and where ɛ x, v = d ɛ x and E[ɛ x] = 0, if the second moments of x, v are finite, then the semiparametric efficiency bound for β is not finite. Remark A.1 The proof of the above results are based on the assumption of second moments of the regressors being finite. This type of assumption was also made in Chamberlain (1986) in establishing his impossibility result for the maximum score model. A.2 Proof of Theorem A.1 To prove this theorem, we follow the approach taken in Chamberlain (1986). Specifically, look for a subpath around which the variance of the parametric submodel is unbounded. We first define the function: g 0 (t, x) = P (ɛ i t x i = x) 24

Flexible Estimation of Treatment Effect Parameters

Flexible Estimation of Treatment Effect Parameters Thomas MaCurdy a and Xiaohong Chen b and Han Hong c Introduction Many empirical studies of program evaluations are complicated by the presence of both