Large Sample Properties of Matching Estimators for Average Treatment Effects

Size: px

Start display at page:

Download "Large Sample Properties of Matching Estimators for Average Treatment Effects"

Nancy Green
5 years ago
Views:

1 Revision in progress. December 7, 4 Large Sample Properties of atching Estimators for Average Treatment Effects Alberto Abadie Harvard University and BER Guido Imbens UC Bereley and BER First version: July This version: December 4 We wish to than Don Andrews, Joshua Angrist, Gary Chamberlain, Geert Dhaene, Jinyong Hahn, James Hecman, Keisue Hirano, Hidehio Ichimura, Whitney ewey, Jac Porter, James Powell, Geert Ridder, Paul Rosenbaum, Ed Vytlacil, a co-editor and two anonymous referees, and seminar participants at various universities for comments, and Don Rubin for many discussions on these topics. Financial support for this research was generously provided through SF grants SES Abadie, SBR and SES Imbens.

2 Large Sample Properties of atching Estimators for Average Treatment Effects Alberto Abadie and Guido W. Imbens Abstract atching estimators for average treatment effects are widely used in evaluation research despite the fact that their large sample properties have not been established in many cases. The absence of formal results in this area may be partly due to the fact that standard asymptotic expansions do not apply to simple matching estimators, which are highly nonsmooth functionals of the data. In this article, we develop new methods to analyze the properties of matching estimators and establish a number of new results. First, we show that matching estimators are not / -consistent in general and describe conditions under which matching estimators attain / -consistency. Second, we show that even for cases that matching estimators are / -consistent, simple matching estimators with a fixed number of matches do not attain the semiparametric efficiency bound. Third, we provide new a estimator for the variance that does not require consistent nonparametric estimation of unnown functions. eywords: matching estimators, average treatment effects, unconfoundedness, selection-onobservables, potential outcomes Alberto Abadie John F. Kennedy School of Government Harvard University 79 John F. Kennedy Street Cambridge, A 38 and BER alberto abadie@harvard.edu Guido Imbens Department of Economics, and Department of Agricultural and Resource Economics University of California at Bereley 66 Evans Hall #388 Bereley, CA and BER imbens@econ.bereley.edu

3 . Introduction Estimation of average treatment effects is an important goal of much evaluation research, both in academic studies, as well as in government sponsored evaluations of social programs. Often, analyses are based on the assumptions that i assignment to treatment is unconfounded or exogenous, that is, based on observable pretreatment variables only, and ii there is sufficient overlap in the distributions of the pretreatment variables. ethods for estimating average treatment effects in parametric settings under these assumptions have a long history. Recently, a number of nonparametric implementations of this idea have been proposed. Hahn 998 calculates the efficiency bound and proposes an asymptotically efficient estimator based on nonparametric series estimation. Hecman, Ichimura, and Todd 998 and Hecman, Ichimura, Smith, and Todd 998 focus on the average effect on the treated and consider estimators based on local linear ernel regression methods. Hirano, Imbens, and Ridder propose an estimator that weights the units by the inverse of their assignment probabilities, and show that nonparametric series estimation of this conditional probability, labeled the propensity score by Rosenbaum and Rubin 983, leads to an efficient estimator. Empirical researchers, however, often use simple matching procedures to estimate average treatment effects when assignment for treatment is believed to be unconfounded. uch lie nearest neighbor estimators, these procedures match each treated unit to a fixed number of untreated units with similar values for the pretreatment variables. The average effect of the treatment is then estimated by averaging within-match differences in the outcome variable between the treated and the untreated units see, e.g., Rosenbaum, 995; Dehejia and Wahba, 999. atching estimators have great intuitive appeal, and are widely used in practice. However, their formal large sample properties have not been established. Part of the reason may be that simple matching estimators are highly non-smooth functionals of the distribution of the data, not amenable to standard asymptotic methods for smooth functionals. In this article, we study the large sample properties of matching estimators of average treatment effects and establish a number of new results. Our results show that some of the formal large sample properties of simple matching estimators are not very attractive. First, we show that when matching is done in multiple dimensions, matching estimators include a conditional bias term whose stochastic order increases with the number of continuous matching variables. We shown that the order of this conditional bias term See for example Cochran and Rubin 973, Rubin 977, Barnow, Cain, and Goldberger 98, Rosenbaum and Rubin 983, Hecman and Robb 984, and Rosenbaum 995.

4 may be greater than /. As a result, matching estimators are in general not / -consistent. Second, even if the dimension of the covariates is low enough for the conditional bias term to vanish asymptotically, we show that the simple matching estimator with a fixed number of matches does not achieve the semiparametric efficiency bound as calculated by Hahn 998. However, for the case when only a single continuous covariate is used to match, we show that the efficiency loss can be made arbitrarily close to zero by allowing a sufficiently large number of matches. Despite these poor formal properties, matching estimators do have some attractive features that may account for their popularity. In particular, matching estimators are extremely easy to implement, and they do not require consistent nonparametric estimation of unnown functions. In this paper we also propose a consistent estimator for the variances of matching estimators that does not require consistent non-parametric estimation of unnown functions. In the next section we introduce the notation and define the estimators. In Section 3 we discuss the large sample properties of simple matching estimators. In Section 4 we propose an estimator for the large sample variance of matching estimators. Section 5 concludes. The appendix contains proofs... otation. otation and Basic Ideas We are interested in estimating the average effect of a binary treatment on some outcome. For unit i, with i =,...,, let Y i and Y i denote the two potential outcomes given the control treatment and given the active treatment, respectively. The variable W i, for W i {, } indicates the treatment received. For unit i, we observe W i and the outcome for this treatment, { Yi if W Y i = i =, Y i if W i =, as well as a vector of pretreatment variables or covariates, X i. Following the literature, our main focus is on the population average treatment effect, τ = EY i Y i ], and the average effect for the treated, τ t = EY i Y i W i = ]. See Rubin 977, Hecman and Robb 984, and Imbens 3 for some discussion of these estimands.

5 We assume that assignment to treatment is unconfounded Rosenbaum and Rubin, 983, and that the probability of assignment is bounded away from zero and one. Assumption : Let X be a random vector of dimension of continuous covariates distributed on R with compact and convex support X, with a version of the density bounded, and bounded away from zero on its support. Assumption : For almost every x X, i unconfoundedness W is independent of Y, Y conditional on X = x; ii overlap η < PrW = X = x < η, for some η >. The dimension of X, denoted by, will be seen to play an important role in the properties of matching estimators. We assume that all covariates have continuous distributions. The combination of the two conditions in Assumption is referred to as strong ignorability Rosenbaum and Rubin, 983. These conditions are strong, and in many cases may not be satisfied. Compactness and convexity of the support of the covariates are convenient regularity conditions. Hecman, Ichimura and Todd 998 point out that for identification of the average treatment effect, τ, Assumption i can be weaened to mean independence EY w W, X] = EY w X] for w =,. For simplicity, we assume full independence, although for most of the results meanindependence is sufficient. When the parameter of interest is the average effect for the treated, τ t, then the first part of Assumption can be relaxed to require only that Y, is independent of W conditional on X. Also, when the parameter of interest is τ t the second part of Assumption can be relaxed so that the support of X for the treated X is a subset of the support of X for the untreated X. Assumption : For almost every x X, i W is independent of Y conditional on X = x; ii PrW = X = x < η, for some η >. Under Assumption i, the average treatment effect for the subpopulation with X = x is equal to: τx = EY Y X = x] = EY W =, X = x] EY W =, X = x]. Discrete covariates with a finite number of support points can be easily dealt with by analyzing estimation of average treatment effects within subsamples defined by their values. The number of such covariates does not affect the asymptotic properties of the estimators. In small samples, however, matches along discrete covariates may not be exact, so discrete covariates may create the same type of biases as continuous covariates. 3

6 Under Assumption ii, the difference on the right hand side of equation is identified for almost all x in X. Therefore, the average effect of the treatment can be recovered by averaging EY W =, X = x] EY W =, X = x] over the distribution of X: τ = E τx ] = E EY W =, X = x] EY W =, X = x] ]. Under Assumption i, the average treatment effect for the subpopulation with X = x and W = is equal to: τ t x = EY Y W =, X = x] = EY W =, X = x] EY W =, X = x]. Under Assumption ii, the difference on the right hand side of equation is identified for all x in X. Therefore, the average effect of the treatment on the treated can be recovered by averaging EY W =, X = x] EY W =, X = x] over the distribution of X conditional on W = : τ t = E τ t X W = ] = E EY W =, X = x] EY W =, X = x] W = ]. ext, we introduce some additional notation. For x X and w {, }, let µx, w = EY X = x, W = w], µ w x = EY w X = x], σ x, w = VY X = x, W = w, σwx = VY w X = x, and ε i = Y i µ Wi X i. Under Assumption, µx, w = µ w x and σ x, w = σwx. Let f w x be the conditional density of X given W = w, and let ex = PrW = X = x be the propensity score Rosenbaum and Rubin, 983. In part of our analysis, we adopt the following assumption. Assumption 3: {Y i, W i, X i } are independent draws from the distribution of Y, W, X. In some cases, however, treated and untreated are sampled separately and their proportions in the sample may not reflect their proportions in the population. Therefore, we relax Assumption 3 so that conditional on W i sampling is random. As we will show later, relaxing Assumption 3 is particularly useful when the parameter of interest is the average treatment effect on the treated. The numbers of control and treated units are and respectively, with = +. For reasons that will become clear later, to estimate average effects on the treated we will require that the size of the control group is at least of the same order of magnitude as the size of the treatment group. Assumption 3 : Conditional on W i = w the sample consists of independent draws from Y, X W = w, for w =,. For some r, r/ θ, with < θ <. 4

7 In this paper we focus on matching with replacement, allowing each unit to be used as a match more than once. For x X, let x = x x / be the standard Euclidean vector norm. 3 Let j m i be the index j that solves W j = W i and l:w l = W i { } X l X i X j X i = m, where { } is the indicator function, equal to one if the expression in bracets is true and zero otherwise. In other words, j m i is the index of the unit that is the m-th closest to unit i in terms of the covariate values, among the units with the treatment opposite to that of unit i. In particular, j i, which will be sometimes denoted by ji, is the nearest match for unit i. For notational simplicity and because we only consider continuous covariates, we ignore the possibility of ties, which happen with probability zero. Let J i denote the set of indices for the first matches for unit i: J i = {j i,..., j i}. 4 Finally, let K i denote the number of times unit i is used as a match given that matches per unit are done: K i = {i J l}. l= The distribution of K i will play an important role in the variance of the estimators. In many analyses of matching methods e.g., Rosenbaum, 995, matching is carried out without replacement, so that every unit is used as a match at most once, and K i. In this article, however, we focus on matching with replacement, allowing each unit to be used as a match more than once. atching with replacement produces matches of higher quality than matching without replacement by increasing the set of possible matches. 5 In addition, matching with replacement has the advantage that it allows us to consider estimators that match all units, treated as well as controls, so that the estimand is identical to the population average treatment effect... Estimators The unit level treatment effect is τ i = Y i Y i. For the units in the sample, only one of the potential outcomes, Y i and Y i, is observed and the other is unobserved or missing. All 3 Alternative norms of the form x V = x V x / for some positive definite symmetric matrix V are also covered by the results below, because x V = P x P x / for P such that P P = V. 4 For this definition to mae sense, we assume that and. We maintain this assumption implicit throughout. 5 As we show below, inexact matches generate bias in matching estimators. Therefore, expanding the set of possible matches will tend to produce smaller biases. 5

8 estimators for the average treatment effects we consider impute the expected potential outcomes in some way. The first estimator, the simple matching estimator, uses the following estimates for the expected potential outcomes: Y i if W i =, Ŷ i = j J i Y j if W i =, and Ŷ i = leading to the following estimator for the average treatment effect: ˆτ sm = Ŷi Ŷi = j J i Y j if W i =, Y i if W i =, W i + K i Y i. 3 The simple matching estimator can easily be modified to estimate the average treatment effect on the treated: τ sm,t = W i = Y i Ŷi = W i W i K i Y i. 4 It is useful to compare matching estimators to covariance-adjustment or regression imputation estimators. Let ˆµ w X i be a consistent estimator of µ w X i. Let Ȳ i = { Yi if W i =, ˆµ X i if W i =, and Ȳ i = { ˆµ X i if W i =, Y i if W i =. 5 The regression imputation estimators of τ and τ t are ˆτ reg = Ȳi Ȳi and ˆτ reg,t = Yi Ȳi. 6 W i = In our discussion we classify as regression imputation estimators those for which ˆµ w x is a consistent estimator of µ w x. The estimators proposed by Hahn 998 and some of those proposed by Hecman, Ichimura, and Todd 997 and Hecman, Ichimura, Smith, and Todd 998 fall into this category. If µ w X i is estimated using a nearest neighbor estimator with a fixed number of neighbors, then the regression imputation estimator is identical to the matching estimator with the same number of matches. The two estimators differ in the way they change with the sample size. We classify as matching estimators those estimators which use a finite and fixed number of matches. Interpreting matching estimators in this way may provide some intuition for some of the subsequent results. In nonparametric regression methods one typically chooses smoothing parameters to balance bias and variance of the estimated regression function. For example, in 6

9 ernel regression a smaller bandwidth leads to lower bias but higher variance. A nearest neighbor estimator with a single neighbor is at the extreme end of this. The bias is minimized within the class of nearest neighbors estimators but the variance no longer vanishes with the sample size. evertheless, as we shall show, matching estimators of average treatment effects are consistent under wea regularity conditions. The variance of matching estimators, however, is still relatively high and, as a result, matching with a fixed number of matches does not lead to an efficient estimator. The first goal of our paper is to derive the properties of the simple matching estimator in large samples, that is, as increases, for fixed. The motivation for our fixed- asymptotics is to provide an approximation to the sampling distribution of matching estimators with a small number of matches, because matching estimators with a small number of matches have been widely used in practice. The properties of interest include bias and variance. Of particular interest is the dependence of these results on the dimension of the covariates. A second goal is to provide methods for conducting inference through estimation of the large sample variance of the matching estimator. 3. Simple atching Estimators In this section we investigate the properties of the simple matching estimator, ˆτ sm, defined in 3. We can write the difference between the matching estimator, ˆτ sm, and the population average treatment effect τ as ˆτ sm τ = τx τ + E sm + B sm, 7 where τx is the average conditional treatment effect: τx = µ X i µ X i, 8 E sm is a weighted average of the residuals: E sm and B sm = E sm,i = W i + K i ε i, 9 is the conditional bias relative to τx: B sm = B sm,i = W i m= µ Wi X i µ Wi X jmi. 7

10 The first two terms on the right hand side of equation 7, τx τ and E sm, have zero mean. They will be shown to be / -consistent and asymptotically normal. The first term depends only on the covariates, and its variance is V τx /, where V τx = EτX τ ] is the variance of the conditional average treatment effect τx. Conditional on X and W, the matrix and vector with i-th row equal to X i and W i respectively the variance of τ sm is equal to the conditional variance of the second term, E sm. We will analyze the variances of these two terms in Section 3.. We will refer to the third term on the right hand side of of equation 7, B sm, as the conditional bias, and to Biassm = EBsm ] as the unconditional bias. If matching is exact, X i = X jmi for all i, and the conditional bias is equal to zero. In general it is not and its properties will be analyzed in Section 3.. where Similarly, we can write the estimator for the average effect for the treated, 4, as ˆτ sm,t τ t = τx t τ t + E sm,t + Bsm,t, τx t = W i µx i, µ X i, is the average conditional treatment effect for the treated, E sm,t = E sm,t,i = W i W i K i/ε i, is the contribution of the residuals, and B sm,t = is the conditional bias. B sm,t,i = W i µ X i µ X jmi, m= 3.. Bias Here we investigate the stochastic order of the conditional bias and its counterpart for the average treatment effect for the treated. The conditional bias consists of terms of the form µ X jmi µ X i or µ X i µ X jmi. To investigate the nature of these terms expand the difference µ X jmi µ X i around X i : µ X jmi µ X i = X jmi X i µ x X i + X j mi X i µ x x X ix jmi X i + O X jmi X i 3. 8

11 In order to study the components of the bias it is therefore useful to analyze the distribution of the matching discrepancy X jmi X i. First, let us analyze the matching discrepancy at a general level. Fix the covariate value at X = z, and suppose we have a random sample X,..., X with density fx and distribution function F x over the support X which is bounded. ow, consider the closest match to z in the sample. Let j = argmin j=,..., X j z, and let U = X j z be the matching discrepancy. We are interested in the distribution of the difference U, which is a vector. ore generally, we are interested in the distribution of the m-th closest matching discrepancy, U m = X jm z, where X jm is the m-th closest match to z from the random sample of size. The following lemma describes some ey asymptotic properties of the matching discrepancy at interior points of the support of X. Lemma : atching Discrepancy Asymptotic Properties Suppose that fz > and that f is differentiable in a neighborhood of z. Let V m = / U m and f Vm be the density of V m. Then, as, lim f V m v = fz v fz π / m! Γ/ m exp v fz π / Γ/ where Γy = e t t y dt for y > is Euler s Gamma Function, so that U m = O p /. oreover, the first three moments of U m are: / m + π / f E U m ] = Γ fz m! Γ + / fz x z / + o /, and m + EU m U m] = Γ fz m! E U m 3 ] = O 3/, where I is the identity matrix of size. All proofs are given in the appendix. π / Γ + / / / I + o /, This lemma shows that the order of the matching discrepancy increases with the number of continuous covariates. The lemma also shows that the first term in the stochastic expansion of / U m has a rotation invariant distribution with respect to the origin. The following lemma shows that for all points in the support, including boundary points, the normalized moments of the matching discrepancies, U m, are bounded. 9,

12 Lemma : atching Discrepancy Uniformly Bounded oments If Assumption holds, then all the moments of / U m are uniformly bounded in and z X. These results allow us to bound the stochastic order of the conditional bias. Theorem : Order of the Conditional Bias for the Average Treatment Effect Under assumptions, and 3: i if µ x and µ x are Lipschitz on X, then B sm = O p /, and ii the order of Bias sm = EBsm ] is not in general lower than /. Consider the implications of this theorem for the asymptotic properties of the simple matching estimator. First notice that, under regularity conditions, τx τ = O p with a normal limiting distribution, by a standard central limit theorem. Also, it will be shown later that, under regularity conditions E sm result of the theorem implies that B sm = O p, again with a normal limiting distribution. However, the is not O p in general. In particular, if is large enough the asymptotic distribution of ˆτ sm τ is dominated by the bias term and the simple matching estimator is not / -consistent. However, if only one of the covariates is continuously distributed, then = and B sm = O p, so ˆτ sm A similar result holds for the average treatment effect on the treated. τ will be asymptotically normal. Theorem : Order of the Conditional Bias for the Average Treatment Effect on the Treated Under assumptions, and 3, i if µ x is Lipschitz on X, then B sm,t = O p r/, and ii if X is a compact subset of the interior of X, µ x has bounded third derivatives in the interior of X, f x is differentiable in the interior of X with bounded derivatives, then Bias sm,t = EBsm,t ] = π / θ / f x Γ + / m + Γ m! m= /{ f f x x x µ x x + tr r/ µ x x x } f xdx+ o r/. This case is particularly relevant because often matching estimators have been used to estimate the average effect for the treated, in settings in which a large number of controls are sampled separately. Generally, in those cases, the conditional bias term has been ignored in the asymptotic approximation to standard errors and confidence intervals. That is justified if is of sufficiently high order relative to, or, to be precise, if r > /. In that case it follows that B sm,t =

13 o p /, and the bias term will get dominated in the large sample distribution by the two other terms, τx t τ t and E sm,t, both of which are O p /. In part ii of Theorem, we show that a general expression of the bias Bias sm,t can be calculated if X is compact and X int X so that the bias is not affected by the geometric characteristics of the boundary of X. Under these conditions, the bias of the matching estimator is at most of order /. This bias is further reduced when µ is constant or when µ is linear and f is constant, among other cases. ote, however, that randomizing the treatment, f = f does not reduce the order of Bias sm,t. 3.. Variance In this section we investigate the variance of the simple matching estimator, ˆτ sm. We focus on the first two terms of the simple matching estimator, 8 and 9, ignoring for the moment the conditional bias term. Conditional on X and W, the matrix and vector with i-th row equal to X i and W i respectively, the variance of ˆτ sm For τ sm,t Vˆτ sm X, W = we obtain: Vˆτ sm,t X, W = is + K i σ X i, W i. W i W i K i/ σ X i, W i. 3 Let V E = Vˆτ sm X, W and V E,t = Vˆτ sm,t X, W be the corresponding normalized variances. Ignoring the conditional bias term, B sm, the conditional expectation of τ sm is τx. The variance of this conditional mean is therefore V τx /, where V τx = EτX τ ]. Hence the marginal variance of τ sm, ignoring the conditional bias term, is Vˆτ sm = EV E ] + V τx /. For the estimator for the average effect on the treated the marginal variance is, again ignoring the conditional bias term, Vˆτ sm,t = EV E,t ]+V τx,t /, where V τx,t = Eτ t X τ t W = ]. The following lemma shows that the expectation of the normalized variance is finite. The ey is that K i, the number of times that unit i is used as a match, is O p with finite moments. 6 Lemma 3: i Suppose assumptions -3 hold, then K i = O p and its moments are bounded uniformly in. ii If, in addition, σ x, w are Lipschitz in X for w =,, then EV E + V τx ] = O. iii Suppose Assumptions, and 3, then / EK i q W i = ] is 6 otice that, for i, K i are exchangeable random variables, and therefore have identical marginal distributions.

14 uniformly bounded in for all q. iv If also σ x, w are Lipschitz in X for w =,, then EV E,t + V τx,t ] = O Consistency and Asymptotic ormality In this section we show that the simple matching estimator is consistent for the average treatment effect and that, without the conditional bias term, is / -consistent and asymptotically normal. The next assumption contains a set of wea smoothness restrictions on the conditional distribution of Y given X. otice that it does not require existence of higher order derivatives. Assumption 4: i µx, w and σ x, w are Lipschitz in X for w =,, ii the fourth moments of the conditional distribution of Y given W = w and X = x exist and are uniformly bounded, and iii σ x, w is bounded away from zero. Theorem 3: Consistency of the Simple atching Estimator i Suppose assumptions -3 and 4i hold. Then ˆτ sm τ p. ii Suppose assumptions,, 3, and 4i hold. Then ˆτ sm,t τ t p. otice that the consistency result holds regardless of the dimension of the covariates. ext, we state the formal result for asymptotic normality. The first result gives an asymptotic normality result for the estimators ˆτ sm sm,t and ˆτ after subtracting the bias term. Theorem 4: Asymptotic ormality for the Simple atching Estimator i Suppose assumptions -3 and 4 hold. Then V E + V τx / ˆτ sm B sm d τ,. ii Suppose assumptions,, 3, and 4 hold. Then V E,t + V τx,t / ˆτ sm,t Bsm,t τ t d,. Although one generally does not now the conditional bias term, this result is useful for two reasons. First, in some cases the bias term can be ignored because it is of sufficiently low order. Second, as we show in Abadie and Imbens 3, under some conditions, an estimate of the bias term can be used in the statement of Theorem 4 without changing the result. In the scalar covariate case, or when only the treated are matched and the size of the control group is of sufficient order of magnitude, there is no need to remove the bias.

15 Corollary : Asymptotic ormality for Simple atching Estimator Vanishing Bias i Suppose assumptions -3 and 4 hold, and =. Then V E + V τx / ˆτ sm d τ,. ii Suppose assumptions,, 3, and 4 hold, and r > /. Then V E,t + V τx,t / ˆτ sm,t τ t d, Efficiency The asymptotic efficiency of the estimators considered here depends on the limit of EV E ], which in turn depends on the limiting distribution of K i. It is difficult to wor out the limiting distribution of this variable for the general case. 7 for the special case with a scalar covariate = and a general. Here we investigate the form of the variance Theorem 5: Suppose =. If Assumptions to 4 hold, and f and f are continuous on int X, then σ Vˆτ sm = E X + E ex + σ X ] + V τx ex ex ex σx + ex ex ] σx + o. ote that with = we can ignore the conditional bias term, B sm. The semiparametric efficiency bound for this problem is, as established by Hahn 998, V eff σ = E X ex + σ X ] + V τx. ex The limiting variance of the matching estimator is in general larger. Relative to the efficiency bound it can be written as Vˆτ sm lim V eff V eff <. The asymptotic efficiency loss disappears quicly if the number of matches is large enough, and the efficiency loss from using a few matches is very small. For example, the asymptotic variance with a single match is less than 5% higher than the asymptotic variance of the efficient estimator, and with five matches the asymptotic variance is less than % higher. 7 The ey is the second moment of the volume of the catchment area A i, defined as the subset of X such that each observation, j, with W j = W i and X j A i is matched to i. In the single match case with = these objects are studied in stochastic geometry where they are nown as Poisson-Voronoi tesselations Oabe, Boots, Sugihara and o Chiu,. The variance of the volume of such objects under uniform f x and f x, normalized by the mean, has been wored out numerically for the one, two, and three dimensional cases. 3

16 4. Estimating the Variance Corollary uses the square-roots of V E + V τx and V E,t + V τx,t as normalizing factors to attain a limiting normal distribution for matching estimators. In this section, we show how to estimate these terms. 4.. Estimating the Conditional Variance Estimating the conditional variance, V E = + K i/ σ W i X i /, is complicated by the fact that it involves the conditional outcome variances, σ wx. We propose an estimator of the conditional variance of the simple matching estimator which does not require consistent nonparametric estimation of σ wx. Our method uses a matching estimator for σ wx where instead of the original matching of treated to control units, we now match treated units to treated units and control units to control units. Let l j i be the j-th closest unit to unit i among the units with the same value for the treatment. Then, we estimate the conditional variance as σ W i X i = J Y i J + J J j= Y lj i. 4 otice that if all matches are perfect so X lj i = X i for all j =,..., J, then E σ W i X i X i = x, W i = w] = σ wx. In practice, if the covariates are continuous, it will not be possible to find perfect matches, so σ W i X i will be only asymptotically unbiased. Because σ W i X i is an average of a fixed number of observations, this estimator will not be consistent for σ W i X i. However, the next theorem shows that averages of the σ W i X i -s are consistent for V E and V E,t. Theorem 6: Let σ W i X i be as in equation 4. Let V E = + K i σ W i X i, V E,t = W i W i K i ˆσ W i X i. If Assumptions to 4 hold, then V E V E = o p. If Assumptions, 3 and 4 hold, then V E,t V E,t = o p. 4.. Estimating the arginal Variance Here we develop consistent estimators for V = V E + V τx and V t = V E,t + V τx,t. These estimators are based on the same matching approach to estimating the conditional error variance 4

17 σ wx as in the previous subsection. In addition, these estimators exploit the fact that, EŶi Ŷi τ ] V τx + E ε i + m= ε j mi The average of the left hand side can be estimated as i Ŷi Ŷi τ sm /. The average of the second term on the right hand side can be estimated using the fact that ] E ε i + ε j mi X, W = + K i σw i X i, m= which can then be combined to estimate V τx. This in turn can be combined with the previously defined estimator for V E to obtain an estimator of V. ]. Theorem 7: Let σ X i, W i be as in equation 4. Let V = Ŷi Ŷi τ sm + K i ] σ K i + X i, W i and V t = W i = Y i Ŷi τ sm,t + K ik i W i σ X i, W i. If Assumptions to 4 hold, then V V = o p. V t V t = o p. If Assumptions, 3 and 4 hold, then 5. Conclusion In this paper we derive large sample properties of simple matching estimators that are widely used in applied evaluation research. The formal large sample properties turn out to be surprisingly poor. We show that simple matching estimators include a conditional bias term which does not disappear in large samples, under the standard / normalization. Therefore, standard matching estimators are not / -consistent in general. We derive the asymptotic distribution of matching estimators for the cases where the conditional bias can be ignored, and show that matching estimators with a fixed number of matches are not efficient. 5

18 Appendix Before proving Lemma, we collect some results on integration using polar coordinates that will be useful. See for example Strooc 999. Let = {ω R : ω = } be the unit -sphere, and λ S be its surface measure. Then, the area and volume of the unit -sphere are λ S dω = S π/ and r λ S dω dr = π/ Γ/ Γ/ = π / Γ + /, respectively. In addition, ω λ S dω =, and ωω λ S dω = λ S dω π / I = Γ + / I, where I is the -dimensional identity matrix. For any non-negative measurable function g on R, gx dx = r grω λ S dω dr. R We will also use the following result on Laplace approximation of integrals. Lemma A.: Let ar and br be two real functions, ar is continuous in a neighborhood of zero and br has continuous first derivative in a neighborhood of zero. Let b =, br > for r >, and that for every r > the infimum of br over r r is positive. Suppose that there exist positive real numbers a, b, α, β such that lim r arr α = a, lim brr β db = b, and lim r r dr rr β = b β. Suppose also that ar exp br dr < for all sufficiently large. Then, for α a ar exp brdr = Γ β + o. α/β α/β βb α/β Proof: It follows from Theorem 7. in Olver 997, page 8. Proof of Lemma : First consider the conditional probability of unit i being the m-th closest match to z, given X i = x: Prj m = i X i = x = Pr X z > x z m Pr X z x z m. m Because the marginal probability of unit i being the m-th closest match to z is Prj m = i = /, and because the density of X i is fx, then the distribution of X i conditional on it being the m-th closest match is: f Xi j m=ix = fx Prj m = i X i = x = fx Pr X z x z m Pr X z x z m, m and this is also the distribution of X jm. ow transform to the matching discrepancy U m = X jm z to get f Um u = fz + u Pr X z u m Pr X z u m. A. m 6

19 Transform to V m = / U m with Jacobian to obtain: f Vm v = f z + v Pr m / f z + = m m ote that Pr X z v / is v / / r fz + rωλ S dω dr, X z v / v Pr X z v / / m Pr X z v + o Pr / m X z v / where = {ω R : ω = } is the unit -sphere, and λ S is its surface measure. The derivative w.r.t. is v f z + v ω λ / S dω. Therefore, Pr X z v / lim / In addition, it is easy to chec that m = m m! + o. = v fz λ S dω. Therefore, lim f V m v = fz v fz m λ S dω exp v fz λ S dω. m! The previous equation shows that the density of V m converges pointwise to a non-negative function which is rotation invariant with respect to the origin. As a result, the matching discrepancy U m is O p / and the limiting distribution of / U m is rotation invariant with respect to the origin. This finishes the proof of the first result. ext, given f Um u in A., EU m = m A m, where A m = u fz + u Pr X z u m Pr X z u m du. R Boundedness of X implies that A m converges uniformly. It is easy to relax the bounded support condition here. We maintain it because it is used elsewhere in the article. Changing variables to polar coordinates gives: A m = r rω fz + rωλ S dω Pr X z r m Pr X z r m dr Then rewriting the probability Pr X z r as fx{ x z r}dx = fz + v{ v r}dv = R R 7 r s fz + sωλ S dω ds m.

20 and substituting this into the expression for A m gives: where A m = br = log r rω fz + rωλ S dω r r m s fz + sωλ S dω ds r m s fz + sωλ S dω ds dr = s fz + sωλ S dω ds, e br ar dr, and ar = r ω fz + rωλ S dω r m s fz + sωλ S dω ds r s fz + sωλ S dω m. ds That is, ar = qrpr, qr = r cr, and pr = gr m, where ω fz + rωλ S dω S cr = r, gr = s fz + sωλ S dω ds r s fz + sωλ S dω ds. s fz + sωλ S dω ds First notice that br is continuous in a neighborhood of zero and b =. By Theorem 6. in Rudin 976, s fz + sωλ S dω is continuous, and db dr r = r fz + rωλ S dω, s fz + sωλ S dω ds r which is also continuous. Using L Hospital s rule: db lim r brr = lim r r dr r = fz λ S dω. Similarly, cr is continuous in a neighborhood of zero, c =, and dc f lim r crr = lim r = r dr x z ωω λ S dω = f x z λ S dω. Therefore, dc lim r qrr + = lim r dr r = f x z λ S dω. Similar calculations yield dg lim r grr = lim r r dr r = fz λ S dω. 8 r

21 Therefore lim r prr m = ow, it is clear that lim r arr m+ = = fz m fz λ S dω. lim r qrr + prr m lim r m f λ S dω x z λ S dω = Therefore, the conditions of Lemma A. hold for α = m +, β = m a = fz f λ S dω fz x z, and b = fz λ S dω. Applying Lemma A., we get m + a A m = Γ b m+/ + o m+/ m+/ / m + π / = Γ fz Γ df + fz dx z + o m+/ Therefore, / m + π / EU m ] = Γ fz m! Γ df + fz dx z m fz f λ S dω fz x z. m+/ + o / which finishes the proof for the second result of the lemma. The results for EU m U m] and E U m 3 ] follow from similar arguments. Proof of Lemma : The proof consists of showing that the density of V m = / U m, denoted by f Vm v, is bounded by f Vm v which does not depend on or z, followed by a proof that v L f Vm vdv < for any L >. It is enough to show the result for > m the bounded support condition implies uniformly bounded moments of V m over z X for any given, and in particular for = m. Recall from the proof of Lemma that f Vm v = f z + v Pr X z v m Pr X z v m. m / / / Define f = inf x X fx and f = sup x X fx. By assumption, f > and f is finite. Let ū be the diameter of X ū = sup x,y X x y which is finite because X is bounded by assumption. Consider all the balls Bx, ū with centers x X and radius ū. Let c be the infimum over x X of the fraction that the intersection with X represents in volume of the balls. otice that, because X has dimension, then < c < ; and that, because X is convex, this proportion can only increase for a smaller radius. Let z X and v / ū. /., Pr X z v / = = v / v / f v / cf v / r fz + rωλ S dωdr r fz + rω{fz + rω > }λ S dωdr r {fz + rω > }λ S dωdr r 9 λ S dωdr = c v f π / Γ + /.

22 Similarly, Pr X z v v / f π / Γ + /. Hence, using the fact that for positive a, loga a and thus for all < b < and > m we have b/ m exp b m/ exp b/m +. In addition, m m m!. It follows that f Vm v f Vm v = f m! exp v c v m + f π/ π f / m =C v m exp C v, Γ/ Γ/ with C and C positive. This inequality holds trivially for v > / ū. This establishes an exponential bound that does not depend on or z. Hence for all and z, v L f Vm vdv is finite and thus all moments of / U m are uniformly bounded in and z. Proof of Theorem i: Let the unit-level matching discrepancy U m,i = X i X jmi. Define the unit-level conditional bias from the m-th match as B m,i = W i µ X i µ X jmi W i µ X i µ X jmi = W i µ X i µ X i + U m,i W i µ X i µ X i + U m,i. By the Lipschitz assumption on µ and µ, we obtain B m,i C U m,i, for some positive constant C. The bias term is B sm = m= B m,i. Using the Cauchy-Schwarz Inequality and Lemma : ] E / B sm ] C / E U,i = C / E + / W i= C E / E W ] E / U,i W,..., W, X i / U,i W,..., W, X i ] ] / + / for some positive constant, C. Using Chernoff s Inequality, it can be seen that any moment of / or / is uniformly bounded in with w for w =,. The result of the theorem follows now from arov s Inequality. This proves part i of the theorem. We defer the proof of Theorem ii until after the proof of Theorem ii, because the former will follow directly from the latter. Proof of Theorem : The proof of first part of theorem is very similar to the proof of Theorem, and therefore is omitted. To prove the second part of Theorem, the following auxiliary lemma will be useful. ],

23 Lemma A.: Let X be distributed with density f on some compact set of dimension : X R. Let Z be a compact set of dimension which is a subset of int X. Suppose that f is bounded and bounded away from zero on X, < f fx f < for all x X. Suppose also that f is differentiable in the interior of X with bounded derivatives X, sup x int X fx/ X <. Then / EU m ] is bounded by a constant uniformly over z Z and > m. Proof of Lemma A.: Fix z Z. From the proof of Lemma, we now that: E ] U m = ufz + u Pr X z u m Pr X z u m du m = r rωf z + rω λ m S dω Pr S X z r m Pr X z r m dr. Let rz = sup{r > z + rω X, for all ω }. Given the conditions of the lemma, there exists r such that rz r > for all z Z. Let where A m = ϕr = Then, rz m E U m ] = A m + A m. ϕr dr, and A m = rz ϕr dr, r rωf z + rω λ S dω Pr X z r m Pr X z r m. Consider the change of variable t = r /, then: A m = / rz / Pr m X z t / t ωf m Pr z + t ω λ / S dω X z t m dt. / Let f/ X = sup x int X f/ Xx. By the ean Value Theorem, for t t rz / f z + t ω fz / = t / f ω X z + t / ω t / f X. As a result, for t rz /, we obtain: ωf z + t ω λ / S dω S = ω f z + t ω fz λ / S dω t f / X λ S dω. ote also that for t rz / we have: Pr X z t / = t / f t / s fz + sωλ S dω ds s ds λ S dω = t f λ S dω.

24 Similarly: Pr X z t / Therefore A m It is easy to see that: / m m rz / t f λ S dω. m t + f m X λ S dω t m t m f λ S dω f λ S dω dt. m!. It is also easily seen that for < b < and > m, b/ m exp b/m +. Therefore, for z Z and > m, we obtain: A m < / m! t+ f X λ S dω t t m exp m + f λ S dω f λ S dω dt C, / for some positive constant, C. For A m, notice that, for m: A m Pr X z rz m m r rωfz + rωλ S dω Pr X z r m dr < Pr X z rz m m m! m r rωfz + rωλ S dω Pr X z r m dr rz m Pr X z rz m ū/m!, where ū is the diameter of X, that is, ū = sup x,y X x y <. The last inequality holds because the last term on the left hand side is equal to the expectation of U m when = m which is bounded by ū. Consequently, we obtain: m / A m < ū/m! /+m f rz λ S dω, z Z. To simplify notation, let b = f λ S dω/. Then: / A m < u/m! /+m b rz m u/m! /+m b r m, where < b r <. The right hand side of last equation is maximal at: / + m = ln b r.

25 Therefore, / A m < u/m! / + m /+m ln b r /+m b r /+m ln b r m = / + m /+m u/m! e /+m = C ln b r /+m <. b r m ote that this bound does not depend on or z. As a result, we obtain / E Um ] < C + C <, for all > m and z Z. The proof of the second part of Theorem is as follows. Bias sm,t = EB sm,t ] = E W i µ X i µ X ] jmi = m= E µ X i µ X jmi Wi = ]. m= Applying a second order Taylor expansion, we obtain: µ X jmi µ X i = µ x X i U m,i + tr µ x x X i U m,i U m,i + O U m,i 3. Therefore, because the trace is a linear operator: E µ X jmi µ X i X i = z, W i = ] = µ x z EU m,i X i = z, W i = ] + tr µ x x z EU m,iu m,i X i = z, W i = ] + O E U m,i 3 Xi = z, W i = ]. Lemma implies that the norm of / EU m,i U m,i X i = z, W i = ] and / E U m,i 3 X i = z, W i = ] are uniformly bounded over z X and. Lemma A. implies the same result for / EU m,i X i = z, W i = ]. As a result, / Eµ X jmi µ X i X i = z, W i = ] is uniformly bounded over z X and. Applying Lebesgue s Dominated Convergence Theorem along with Lemma, we obtain: / E µ X jmi µ X i Wi = ] m + = Γ π / f x Γ + / m! / { f f x x x µ x x + tr ow, the result follows easily from the conditions of the theorem. } µ x x x f x dx + o. Proof of Theorem ii: Consider the special case where µ x is flat over X and µ x is flat in a neighborhood of the boundary, B. Then, matching the control units does not create bias. atching the treated units creates a bias that is similar to the formula in Theorem ii, but with r =, θ = p/ p, and the integral taen over X B c. Proof of Lemma 3: Define f = inf x,w f w x and f = sup x,w f w x, with f > and f finite. Let ū = sup x,y X x y. Consider all the balls Bx, u with centers x X and radius u. Let cu < cu < 3

26 be the infimum over x X of the proportion that the intersection with X represents in volume of the balls. ote that, because X is convex, this proportion nonincreasing in u, so let c = cū, and cu c for u ū. The proof consists of three parts. First we derive an exponential bound for the probability that the distance to a match, X jmi X i exceeds some value. Second, we use this to obtain an exponential bound on the volume of the catchment area, A i, defined as the subset of X such that i is matched to each observation, j, with W j = W i and X j A i. That is, if W j = W i and X j A i, then i J j. { A i = x } l W l =W i { X l x X i x }. Third, we use the exponential bound on the volume of the catchment area to derive an exponential bound on the probability of a large K i, which will be used to bound the moments of K i. For the first part we bound the probability of the distance to a match. Let x X and u < / W i ū. Then, Pr Similarly X j X i > u / W W i,..., W, W j = W i, X i = x / u W i = r f Wi x + rωλ S dωdr cf Pr u / W i r λ S dωdr = cfu W i π / /Γ + /. X j X i u / W W i,..., W, W j = W i, X i = x fu W i π / /Γ + /. otice also that Pr X j X i > u / W W i,..., W, X i = x, j J i Pr X j X i > u / W W i,..., W, X i = x, j = j i Wi = m m= Pr X j X i > u / Wi m W W i,..., W, W j = W i, X i = x Pr X j X i u / W i W,..., W, W j = W i, X i = x m. In addition, Wi m Pr X j X i u / m W W i,..., W, W j = W i, X i = x m! u π f / m. Γ + / Therefore, Pr X j X i > u / W W i,..., W, X i = x, j J i m= m! u f π / Γ + / m u cf π / Γ + / Wi Wi m. 4

27 Then, for some constant C >, Pr X j X i > u / W W i,..., W, X i = x, j J i C max{, u } u π / cf Γ + / m= Wi Wi m C max{, u } exp u + cf π /. Γ + / otice that this bound also holds for u / W i ū, because in that case the probability that X jmi X i > u / W i is zero. ext, we consider for unit i, the volume B i of the catchment area A i, defined as: B i = dx. A i Conditional on W,..., W, i J j, X i = x, and A i, the distribution of X j is proportional to f Wi x {x A i}. otice that a ball with radius b/ / /π / /Γ + / / has volume b/. Therefore for X i in A i and B i b, we obtain b/ / Pr X j X i > π / /Γ + / / W,..., W, X i = x, A i, i J j f f. The last inequality does not depend on A m i given B i b. Therefore, b/ / Pr X j X i > π / /Γ + / / W,..., W, X i = x, i J j, B i b f f. As a result, if b/ / Pr X j X i > π / /Γ + / / W,..., W, X i = x, i J j δ f f, then it must be the case that PrB i b W,..., W, X i = x, i J j δ. In fact, the inequality in equation A. has been established above for b = u π /, and δ = f Wi Γ + / f C max{, u } exp u + cf π /. Γ + / Let t = u π / /Γ + /, then Pr Wi B i t W,..., W, X i = x, i J j C max{, C 3 t } exp C 4 t, for some positive constants, C, C 3, and C 4. This establishes an uniform exponential bound, so all the moments of Wi B i exist conditional on W,..., W, X i = x, i J j uniformly in. For the third part of the proof, consider the distribution of K i, the number of times unit i is used as a match. Let P i be the probability that an observation with the opposite treatment is matched to observation i: P i = f Wi xdx f B i. A i ote that for n, E Wi P i n X i = x, W,..., W ] E Wi P i n X i = x, W,..., W, i J j] A. f n E Wi B i n X i = x, W,..., W, i J j]. 5

28 As a result, E Wi P i n X i = x, W,..., W ] is uniformly bounded. Conditional on P i, and on X i = x, W,..., W, the distribution of K i is binomial with parameters Wi and P i. Therefore, conditional on P i, and X i = x, W,..., W, the q-th moment of K i is EK q i P i, X i = x, W,..., W ] = q n= Sq, n Wi! P i n Wi n! q Sq, n Wi P i n, n= where Sq, n are Stirling numbers of the second ind and q see, e.g., Johnson, Kotz and Kemp, 99. Then, because Sq, = for q, EK q i X i = x, W,..., W ] C q Sq, n n= Wi for some positive constant, C. Using Chernoff s bound for binomial tails, it can be easily seen that E Wi / Wi n X i = x, W i ] = E Wi / Wi n W i ] is uniformly bounded in, for all n, so the result of the first part of the lemma follows. ext, consider part ii of Lemma 3. Because the variance σ x, w is Lipschitz on a bounded set, it is therefore bounded by some constant, σ = sup w,x σ x, w. As a result, E + K / σ x, w] is bounded by σ E + K / ], which is uniformly bounded in by the result in the first part of the lemma. Hence EV E ] = O. ext, consider part iii of Lemma 3. Using the same argument as for EK q i], we obtain Therefore, EK q i W i = ] q Sq, n n= EK q i W i = ] q Sq, n n= Wi n, n E P i n W i = ]. n E P i n W i = ], which is uniformly bounded because r. For part iv notice that ] EV E,t ] = E W i σ K i X i, W i + E W i σ X i, W i ] K ] = Eσ i X i, W i W i = ] + E σ X i, W i W i =. Therefore, EV E,t ] is uniformly bounded. Proof of Theorem 3: We only prove the first part of the theorem. The second part follows the same argument. We can write ˆτ sm τ = τx τ + Esm + Bsm. We consider each of the three terms separately. First, by assumptions and 4i, µ w x is bounded over x X and w =,. Hence µ X µ X τ has mean zero and p finite variance. Therefore, by a standard law of large numbers τx τ. Second, by Theorem, B sm = O p / = o p. Finally, because Eε i X, W] =, Eε i X, W] σ and Eε i ε j X, W] = i j, we obtain E ] E sm = E + K ] i ε i = E + K i σ X i, W i ] = O, 6

29 where the last equality comes from Lemma 3. By arov s inequality E sm = O p / = o p. Proof of Theorem 4: We only prove the first assertion in the theorem as the second follows the same argument. We can write ˆτ sm B sm τ = τx τ + E sm. First, consider the contribution of τx τ. By a standard central limit theorem τx τ d, V τx. Second, consider the contribution of E sm/ V E = Esm,i / V E. Conditional on W and X the unit-level terms E,i sm are independent with zero means and non-identical distributions. The conditional variance of E,i sm is + K i/ σ X i, W i. We will use a Lindeberg-Feller central limit theorem for E sm / V E. For a given X, W, the Lindeberg-Feller condition requires that V E ] E E,i sm { E,i sm η LF V E } X, W, for all η LF >. To prove that A.4 condition holds, notice that by Hölder s and arov s inequalities we have ] E E,i sm { E,i sm η LF V E } X, W E E,i sm 4 X, W ] / ] / E { E,i sm η LF V E } X, W E E,i sm 4 X, W ] / Pr E,i sm η LF V E X, W E E,i sm 4 X, W ] / E E sm,i X, W ] ηlf V E. Let s = sup w,x σ wx <, s = inf w,x σ x, w >, and K = sup w,x E ε 4 i X i = x, W i = w ] <. otice that V E s. Therefore, V E V E ] E E,i sm { E,i sm η LF V E } X, W s K / η LF s + K i/ 4 E ε 4 i X, W ] / + K i/ σ X i, W i ηlf V E + K i/ 4. Because E + K i/ 4 ] is uniformly bounded, by arov s Inequality, the last term in parentheses is bounded in probability. Hence, the Lindeberg-Feller condition is satisfied for almost all X and W. As a result, / Esm,i + K i/ σ X i, W i / = / E sm V E d,. Finally, E sm/ V E and τx τ are asymptotically independent the central limit theorem for E sm / V E holds conditional on X and W. Thus, the fact that both converge to standard normal distributions, boundedness of V E and V τx, and boundedness away from zero of V E imply that V E + V τx / / τ sm Bsm τ converges to a standard normal distribution. 7 A.3 A.4

Large Sample Properties of Matching Estimators for Average Treatment Effects

Large Sample Properties of Matching Estimators for Average Treatment Effects Large Sample Properties of atching Estimators for Average Treatment Effects Alberto Abadie Harvard University and BER Guido Imbens UC Bereley and BER First version: July This version: January 4 A previous