Mean-squared-error Calculations for Average Treatment Effects

Size: px

Start display at page:

Download "Mean-squared-error Calculations for Average Treatment Effects"

Amberlynn Barnett
5 years ago
Views:

1 Mean-squared-error Calculations for Average Treatment Effects Guido W. Imbens UC Berkeley and BER Whitney ewey MIT Geert Ridder USC First Version: December 2003, Current Version: April 9, 2005 Abstract This paper develops JEL Classification: C4, C20. Keywords: onparametric Estimation

2 Introduction Recently a number of estimators have been proposed for average treatment effects under the assumption of unconfoundedness or selection on observables. Many of these estimators require nonparametric estimation of an unknown function, either the regression function or the propensity score. Typically results are presented concerning the rates at which the smoothing parameters go to their limiting values, without specific recommendations regarding their values Hahn, 998; Hirano, Imbens and Ridder, 2000; Heckman, Ichimura and Todd, 998; Rotnitzky and Robins, 995; Robins, Rotnitzky and Zhao, 995)). In this paper we make two contributions. First, we propose a new estimator. Our estimator is a modification of an estimator introduced in an influential paper by Hahn 998). Like Hahn, our estimator relies on consistent estimation of the two regression functions followed by averaging their difference over the empirical distribution of the covariates. Our estimator differs from Hahn s in that it directly estimates these two regression functions whereas Hahn first estimates the propensity score and the two conditional epectations of the product of the outcome and the indicators for being in the control and treatment group and then combines these to get estimates of the two regression functions. Thus our estimator completely avoids the need to estimate the propensity score. Our second and most important contribution is that we are eplicit about the choice of smoothing parameters. In our series estimation setting the smoothing parameter is the number of terms in the series. We provide a criterion for choosing this number of terms and show its asymptotic optimality in terms of epectedmean-squared-error. This criterion is related to, but differ from from the standard one for nonparametric estimation as in Li 987) and Andrews 99) in that it focuses eplicitly on optimal estimation of the average treatment effect rather than on optimal estimation of the entire unknown function. In the net section we discuss the basic set up and introduce the new estimator. In Section 3 we analyze the asymptotic properties of this estimator. In Section 4 we propose a method for choosing the number of terms in the series. 2 The Basic Framework The basic framework is standard in this literature e.g Rosenbaum and Rubin, 983; Hahn, 998; Heckman, Ichimura and Todd, 998; Hirano, Imbens and Ridder, 2003) We have a random sample of size from a large population. For each unit i in the sample, let W i indicate whether the treatment of interest was received, with W i = if unit i receives the treatment of interest, and W i = 0 if unit i receives the control treatment. Using the potential outcome notation popularized by Rubin 974), let Y i 0) denote the outcome for unit i under control and Y i ) Independently Chen, Hong, and Tarozzi 2004) have established the efficiency of their CEP-GMM estimator that is similar to our new estimator. []

3 the outcome under treatment. We observe W i and Y i, where Y i Y i W i ) = W i Y i ) W i ) Y i 0). In addition, we observe a vector of pre-treatment variables, or covariates, denoted by X i. We shall focus on the population average treatment effect: τ E[Y ) Y 0)]. Similar results can be obtained for the average effect for the treated: τ t E[Y ) Y 0) W = ]. The central problem of evaluation research e.g., Holland, 986) is that for unit i we observe Y i 0) or Y i ), but never both. Without further restrictions, the treatment effects are not consistently estimable. To solve the identification problem, we maintain throughout the paper the unconfoundedness assumption Rubin, 978; Rosenbaum and Rubin, 983), which asserts that conditional on the pre-treatment variables, the treatment indicator is independent of the potential outcomes. This assumption is closely related to selection on observables assumptions e.g., Barnow, Cain and Goldberger, 980; Heckman and Robb, 984). Formally: Assumption 2. Unconfoundedness) W Y 0), Y )) X. 2.) Let the propensity score be the probability of selection into the treatment group: e) PrW = X = ) = E[W X = ], 2.2) Assumption 2.2 Overlap) The propensity score is bounded away from zero and one. Define the average effect conditional on pre-treatment variables: τ) E[Y ) Y 0) X = ] ote that τ) is estimable under the unconfoundedness assumption, because E[Y ) Y 0) X = ] = E[Y ) W =, X = ] E[Y 0) W = 0, X = ] = E[Y W =, X = ] E[Y W = 0, X = ]. The population average treatment effect can be obtained by averaging the τ) over the distribution of X: τ = E[τX)], and therefore the average treatment effect is identified. [2]

4 3 Efficient Estimation In this section we review two efficient estimators previously proposed in the literature. We then discuss two new estimators, which will facilitate the mean-squared error calculations. 3. The Hahn Estimator Hahn 998) studies the same model as in the current paper. He calculates the efficiency bound, and proposes an efficient estimator. His estimator also imputes the potential outcomes given covariates, followed by averaging the difference in the estimated regression functions. The difference with our estimator is that Hahn first estimates nonparametrically the three conditional epectations E[Y W X], E[Y W ) X], and ex) = E[W X]. and then uses these conditional epectations to estimate the two regression function µ 0 ) = E[Y 0) X = ] and µ ) = E[Y ) X = ] as ˆµ 0 ) = Ê[Y W ) X = ], and ˆµ ) = êx) Ê[Y W X = ]. êx) The average treatment effect is then estimated as ˆτ h = ˆµ X i ) ˆµ 0 X i )). Hahn shows that under regularity conditions this estimator is consistent, asymptotically normally distributed, and that it reaches the semiparametric efficiency bound. 3.2 The Hirano-Imbens-Ridder Estimator Hirano, Imbens and Ridder 2003) also study the same set up. They propose using a weighting estimator with the weights based on the estimated propensity score: ˆτ hir, = Wi Y i êx i ) W ) i. êx i ) They show that under regularity conditions, and with ê) a nonparametric estimator for the propensity score, this estimator is consistent, asymptotically normally distributed and efficienty. It will be useful to consider a slight modification of this estimator. Consider the weights for the treated observations, /êx i ). Summing up over all treated observations and dividing by we get i W i/êx i ))/. This is not necessarily equal to one. We may therefore wish to modify the weights to ensure they add up to one for the treated and control units. This leads to the following estimator: ˆτ hir,2 = Wi Y i êx i ) W i êx i ) ) / Wi êx i ) W ) i. êx i ) [3]

5 3.3 A ew Estimator The new estimator relies on estimating the unknown regression functions µ ) and µ 0 ) through nonparametric regression of Y on X for the two subpopulations indeed by the treatment status. It is a modification of Hahn s estimator. Like Hahn s estimator it estimates the average treatment effect as ˆτ inr = ˆµ X i ) ˆµ 0 X i )). The difference with Hahn s estimator is that the two regression functions are estimated directly without first estimating the propensity score and the two conditional epectations E[Y W X] and E[Y W ) X]. This implies that we only need to estimate two regression functions nonparametrically rather than three and this will make the optimal choice of smoothing parameters easier. ote that this estimator still requires more unknown regression functions to be estimated than the HIR estimator which requires only estimation of the propensity score, but since the new estimator relies on estimation of different functions, it is not clear which is to be preferred in practice. To provide additional insight into the structure of the problem, we introduce one final estimator, which combines features of the Hirano-Imbens-Ridder estimators and the new estimator: ˆτ mod = Wi ˆµ X i ) êx i ) W i) ˆµ 0 X i ) êx i ) ) / Wi êx i ) W ) i. êx i ) In order to implement these estimators we need estimators for the two regression functions. Following ewey 995), we use series estimators for the two regression functions µ w ), with K terms. As the basis we use power series. Let λ = λ,..., λ k ) be a multi-inde of dimension k, that is, a k-dimensional vector of non-negative integers, with λ = k λ i, and let λ = λ... λ k k. Consider a series {λr)} r= containing all distinct such vectors such that λr) is nondecreasing. Let p r ) = λr), where p K ) = p ),..., p K )). The nonparametric series estimator of the regression function µ w ), given K terms in the series, is given by: ˆµ w ) = p K ) p K X i )p K X i ) p K X i )Y i, W i =w W i =w where A denotes a generalized inverse of A. Given the two estimated regression functions we estimate the average treatment effect as ˆτ = ) ˆµ X i ) ˆµ 0 X i ). For the propensity score we use the series logit estimator Hirano, Imbens and Rubin, 2003) Let Lz) = epz)/ epz)) be the logistic cdf. The series logit estimator of the population [4]

6 propensity score e) is ê M ) = Lp M ) ˆπ M ), for simplicity we use the same series p M ), although this is not essential) where for ˆπ M = arg ma π L,Mπ), 3.3) L,M π) = Wi ln Lp M X i ) π) W i ) ln Lp M X i ) π)) ). 3.4) 3.4 First Order Equivalence of Estimators We make the following assumptions. Assumption 3. Distribution of Covariates) X X R d, where X is the Cartesian product of intervals [ jl, ju ], j =,..., d, with jl < ju. The density of X is bounded away from zero on X. Assumption 3.2 Propensity Score) i) The propensity score is bounded away from zero and one. ii) The propensity score is s times continuously differentiable. Assumption 3.3 Conditional Outcome Distributions) i) The two regression functions µ w ) are t times continuously differentiable. ii) the conditional variance of Y i w) given X i = is bounded by σw. 2 Assumption 3.4 Rates for Series Estimators) i) K = ν, with < ν <, ii) L = ξ, with < ξ <. The properties of the estimators will follow from the following lemma: Theorem 3. Asymptotic Equivalence of ˆβ inr, ˆβ mod and ˆβ hir ) Suppose Assumptions hold. Then i), ˆτinr ˆτ mod ) = o p ), ii), ˆτmod ˆτ hir, ) = o p ). iii), ˆτhir, ˆτ hir,2 ) = o p ). [5]

7 and iv), ˆτh ˆτ hir, ) = o p ). Proof: See Appendi. 4 A feasible MSE criterion All three estimators contain a function or functions that are estimated nonparametrically. For the Hahn estimator we need nonparametric estimates of the propensity score and the conditional epectation of the product of the outcome and the treatment indicator and of the outcome and the control indicator given the covariates. For the HIR weighting estimator an estimator of the propensity score is needed, and for the imputation estimator introduced in section 3.3 we need estimators for the conditional means in the treatment and control populations. Both Hahn and HIR use series estimators for either the propensity score or the conditional epectations. That leaves the question how to select the order of the series. For a meaningful comparison of the performance of these asymptotically efficient estimators such a selection rule is essential. Despite its practical importance there has been little work on the selection of the nonparametric estimators. The only paper that we are aware of is Ichimura and Linton 2003) who consider bandwidth selection if the propensity score in the weighting estimator is estimated by a kernel nonparametric regression estimator. The current practice in propensity score matching, which is a nonparametric estimator that is different from the estimators considered in section 3, is that the propensity score is usually selected using the balancing score property W X ex) In practice this is implemented by stratifying the sample on the propensity score and testing whether the means of the covariates are the same for treatment and controls see e.g. Dehejia and Wahba 2000)). This method of selecting the model for the propensity score focuses eclusively on the bias in the estimation of the treatment effect. This could lead to overspecification of the propensity score and inflation of the variance of the estimator of the treatment effect. We consider both the bias and the variance associated with a choice of the nonparametric function in the treatment effect estimator. Initially we consider the missing data problem in which we observe X for all observations, but Y only if an observation indicator D =. If D = 0 we observe X but not Y. We refer to this type of data as one-sample data. Alternatively, we could have two samples. The first sample is from the marginal population distribution of X. The second sample is a random sample from the subpopulation Y, X D =. This type of data is referred as the two-sample data. We develop the theory for two-sample data, but it turns out that the results are identical for the one-sample case. [6]

8 Throughout we assume that the Y is Missing At Random MAR), i.e. Y D X The notation in this section will differ from that used in section 3 to reflect that we consider the missing data instead of the treatment effect problem. In particular, we use p) for the propensity score, and r kk ) for the orthonormal polynomials that are used in the estimation of µ). 4. The MSE and its estimator As in Li 987) we consider a population in which the joint distribution of Y, X is such that Y = µx) U with EU X) = 0 and VarU X) = σ 2. Andrews 99) has generalized Li s results to the heteroskedastic case and we could do the same. To concentrate on essentials first we maintain the assumption that U is homoskedastic. Initially we assume that we have two independent samples. Sample is a sample of size from the marginal distribution of X. We denote this sample by X i, i =,...,. Sample 2 is a random sample of size from the distribution of Y, X D =. We denote this sample by Y i, X i, i =,...,. The imputation estimator for the parameter of interest µ Y = EY ) is with ˆτ IR = ˆµ Y K = ˆµ K X i ) ˆµ K ) = r K ) R KR K ) R Ky The subscript K is the number of base functions used in the estimation of µ), and we use the notation y = Y. Y µ = µx ). µx ) u = U. U and r K ) = r K ). r KK ) R K = r K X ). r K X ) R K = r K X ). r K X ) The parameter of interest is µ Y = EY ). The MSE is obtained from ˆµY K µ Y ) = ˆµ K X [ i ) E Y X ˆµ K X ]) i ) [ E Y X ˆµ K X ] i ) µ X ) i ) [7]

9 µ X ) i ) µ Y 4.5) We can treat X i and X i as constants. Therefore, it is reasonable to redefine the parameter of interest as µ Y = µ X i ) If we do so, the final term in the comparison can be omitted, and we need only consider the first two terms. This parameter is the missing data analogue of the sample treatment effect of Imbens 2003). We now consider the first two terms separately. The first corresponds to the variance and the second to the bias term in the MSE. The first term can be written as V = ˆµ K X [ i ) E Y X ˆµ K X ]) i ) = ι RK R KR K ) R Ku If we treat the covariates as constants, it is easily seen that the epected value of the crossproduct of the bias and variance term is 0. Hence we can deal with the bias and variance terms separately. The variance term can be epressed as EV 2 ) = σ 2 2 Because f) = ι RK R KR K ) R K ι = σ 2 ι RK ) R ) K R K ι RK p f D = ) 4.6) p) with p = PrD = ), the variance term can be computed from the observations in the subpopulation X D = if we use the weights p px), so that with EV 2 ) = p 2 σ2 a M K a a = px ). px ) M K = R K R KR K ) R K The bias term is [ E Y X ˆµ K X ] i ) µ X ) i ) = r K X i ) R KR K ) R Kµ µ X ) i ) [8] )

10 Again we average over the distribution of X D = using the weights term is written as B = p rk X i ) R px i ) KR K ) R Kµ µx i ) ) = pa A K µ p px). Hence, the bias with A K = I M K. ote that a A K µ is the covariance of the residuals of the regressions of µ on R K and a on R K, respectively. By Lorentz 986), Theorem 8, p. 90, we have that a A K µ = O K sµ/d sp/d) so that the squared bias term is of order O K 2sµ/d 2sp/d) with s µ the number of continuous derivatives of µ), s p the number of continuous derivatives of p), and d the dimension of X. If we add the variance and squared bias terms we obtain the population MSE E K) = p2 σ 2 a M K a a A K µ) 2) 4.7) To use the MSE we need to estimate the bias term. This can be done in several ways 2. Let e K be the vector of residuals of the regression of y on R K, i.e. e K = A K y. Using so that Hence, a e K = a A K µ a A K u a e K ) 2 = a A K µ) 2 a A K uu A K a 2a A K µa A K u E [ a e K ) 2] = a A K µ) 2 σ 2 a A K a so that the bias term is estimated by ote that and a e K ) 2 σ 2 a A K a a A K µ) 2 = O 2 K 2sµ/d 2sp/d ) σ 2 a A K a = OK 2sp/d ) Upon substitution of the estimate we obtain the estimated MSE C K) = p2 2σ 2 a M K a σ 2 a a a e K ) 2) 4.8) 2 ote that a A K ˆµ K = 0, so that this obvious estimator cannot be used. [9]

11 Both the population MSE and its estimator are proportional to p 2. Hence the value of K that minimizes the estimated) MSE is independent of the fraction of observations for which we observe Y. In the sequel we omit p 2 and redefine E K) = σ 2 a M K a a A K µ) 2) C K) = 2σ 2 a M K a σ 2 a a a e K ) 2) 4.2 The population MSE If, the variance term converges to [ σ 2 E r K X) ] Σ K [r E kk X) ] with Σ K the probability limit of R K R K. The variance term simplifies if we choose the base functions r kk as orthonormal polynomials with respect to the density of X D =, so that E [r jk X)r jk X) D = ] = 0, j k E [ r kk X) 2 D = ] = For this choice the variance term converges to [ σ 2 E r K X) ] [ E r K X) ] = σ 2 K k= [ E r kk X) ]) 2 Because the constant function is one of the orthonormal polynomials, we have that for all non-constant base functions E [r kk X)] = 0 Using 4.6) [ E r kk X) ] [ ] p = E px) r kkx) which is proportional to the covariance of px) and r kkx). If the inverse of the probability that Y is observed is smooth in the sense that it can be well approimated by a linear combination of a relatively small number of base functions, then the variance will not increase much if K eceeds the number of base functions needed. As an eample take p) = d c γ K 0 r K0 ) [0]

12 With bounded support of X we can always find a c so that p) is nonnegative. For this choice of p) we have [ ] p E px) r kkx) = pγ k E [ r kk X) 2] = pγ k if the base functions are orthonormal. Hence for large, the variance term is proportional to p 2 K k= γ2 k. Hence the variance term increases with K for K K 0, and is constant for K > K 0. In this eample the bias decreases with K and is 0 for K K 0 for all. Because the bias term increases with we find that the value of K that minimizes the MSE is bounded by K 0 if is large. In this eample a is orthogonal to the residuals e K0, and a A K0 a = 0, so that the estimate of the bias is 0 as is the bias) for K K 0. As we will see in section 4.4 these same results apply approimately) if p) is not eactly a linear combination of base functions. Unless p) is very unsmooth, only a few base functions are needed to minimize the MSE. It is instructive to compare the population MSE with Li s 986) average squared error criterion. We obtain intuitive results if we assume that K0 p) = K 0 γ k r k ) µ) = δ k r k ) k= k= with r k ) orthogonal polynomials with respect to the density of X D =. Li s average squared error criterion is L K) = ˆµ K X i ) µx i )) 2 i.e. it is average squared deviation in the sample in which we observe both Y and X. The epected value of the variance term is σ 2 r K X i ) R KR K ) r K X i ) = σ 2 K The bias term is equal to µ µ µ R K R KR K ) R Kµ which converges to K 0 k=k δ2 k. Hence Li s criterion is E [L K)] = σ 2 K K 0 k=k Under these assumptions the bias in our criterion is p a µ ) a R K R KR K ) R Kµ δ 2 k 4.9) []

13 We have a µ = K0 ) K0 ) γ k r k X i ) δ l r l X i ) = k= l= K 0 γ k δ k r k X i ) 2 k= K 0 γ k δ l r k X i )r l X i ) k l Because the r k are orthonormal, the second term on the right hand side converges to 0, and the first converges to K 0 k= γ kδ k. Finally, because a R K converges to γ K I K and R K µ to I K δ K, the second term in the bias converges to K k= γ kδ k. Combining the results we find p 2 σ 2 K k= γ 2 k K0 k=k ) 2 γ k δ k 4.0) These two criteria are clearly different. For instance, if we take γ k = γ, then for fied, E [L K )] E [L K)] if and only if t 2 K with t k = δ k σ the asymptotic t-ratio for δ K. For our criterion we find that it decreases if and only if t K TK2 2 T K2 or t K TK2 2 T K2 with T K2 = K 0 k=k2 t k. 4.3 Optimality of minimizer of estimated MSE Define ˆK as ˆK = argmin K K C K) with K an inde set that grows with at rate κ. We will show that ˆK is optimal in the sense that Li 987)) E ˆK) inf K K E K) p 4.) This does not imply that the difference between ˆK and the minimizer of E K) converges to 0. We make the following assumptions Assumption 4. The smallest eigenvalue of E [r K X)r K X) D = ] is bounded from 0 for all K. [2]

14 Assumption 4.2 E [U m ] < for some integer m. Assumption 4.3 κmsp/d ) inf E K) m 2 K K with d the dimension of X. Theorem 4. If assumptions hold, then E ˆK) inf K K E K) Proof See Appendi. p 4.4 Simulation results The finite sample performance of our estimator is investigated in a number of sampling eperiments. The population model is with Y = µx) U µ) = 5 8 k= k X a scalar variable and U standard normal. We choose a simple logit model for p) p) = e.53 e.53 As base functions we choose the Legendre polynomials. These polynomials are defined on [, ] and are orthogonal with weight function equal to on this interval. For that reason we choose the marginal distribution of X such that X D = is a uniform distribution on [, ], i.e. the distribution of X has density f) = p 2 p) with p = 6 6 e.5 e 4.5 ) Of course, in practice it may not be feasible to choose polynomials that are orthogonal with a weight function that is equal to the distribution X D =. Table gives some statistics for the data generating process. The fraction with a missing Y is.429. The mean of both X and Y is larger in the subpopulation with observed Y. [3]

15 Table : Statistics sampling eperiment Mean Std dev Y X Y D = D.57 Table 2: Optimality of minimizer estimated GMM Fraction ˆK = K E ˆK) E K) = = = We consider three sample sizes = 00, 000, The number of repetitions is 000. The results are in Table 2 and the figures. The second column is in line with the asymptotic result in section 4.3. That result does not imply that the minimizer of the population and estimated population MSE are closer if the sample size increases. However, this is what we find in the eperiments. [4]

16 Figure : Population and estimated MSE, = 00 In the figures -3 we report the population MSE and its estimator. ote that with the simple logit model the MSE is flat after the inclusion of a few basis functions. To avoid scaling problems we omit some population and estimated MSE values for small K. [5]

17 Figure 2: Population and estimated MSE, = 000 [6]

18 Figure 3: Population and estimated MSE, = 0000 [7]

19 5 Appendi For matrices A we use the norm A = tra A)) /2. Lemma A. Properties of orm) For conformable matrices A and B, i), AB A B, ii), A BA B A A. If B is positive semi-definite and symmetric, then, for λ ma B) equal to the maimum eigenvalue of B, iii), tra BA) A 2 λ ma B), iv), AB A λ ma B), v), BA A λ ma B). Proof: Let A be of dimension K L, and B of dimension L M. First we prove i): AB is of dimension K M, with its k, m) element equal to L l= a klb lm. Hence the m, m) element of the M M matri B A AB is equal to K L ) 2, k= l= a klb lm and thus AB 2 = trb A AB) = M m= k= K L ) 2 a kl b lm l= M K L m= k= l= a 2 kl l= L b 2 lm = A 2 B 2, L ) 2 where the last inequality follows from the Cauchy-Schwartz inequality which implies that l= a klb lm L L l= a2 kl l= b2 lm. et, consider ii): A BA 2 = tra B AA BA) = trb AA BAA ) = trd E), where D = AA B and E = BAA. Let δ = trd E)/trE E), and F = D δe, so that trf E) = 0. Then trd E) = trδe F ) E) = δtre E), and trd D) = δ 2 tre E) trf F ), so that Thus trd E) 2 = δ 2 tre E) 2 δ 2 tre E) trf F )) tre E) = trd D) tre E). A BA 2 = trb AA BAA ) trb AA AA B)) /2 traa B BAA )) /2 = AA B BAA. By part i) of the lemma, AA B AA B, and BAA B AA, so that the conclusion follows. et, consider iii). Because B is symmetric L L, and positive definite, we have B = SΛS, where S S = I L, and Λ is diagonal with the eigenvalues of B on its diagonal, so that λ ma B) = λ ma Λ). Let C = A S, with c ij the i, j) element of A S. With A of dimension L K, C is of dimension K L. Then tra BA) = tra SΛS A) = trλs AA S) = trλc C) = L K λ j j= c 2 ij λ ma B) L j= K c 2 ij = λ ma B) trc C) = λ ma B) trs AA S) = λ ma B) traa SS ) = λ ma B) traa ) = λ ma B) tra A) = λ ma B) A 2. [8]

20 et, consider iv). Again write B = SΛS, with Λ diagonal and S S = I L. Then: AB 2 = trb A AB) = tra ABB ) = tra ASΛS SΛS ) = tra ASΛ 2 S ) λ 2 mab) trs A AS) = λ 2 mab) tra ASS ) = λ 2 mab) tra A) = λ 2 mab) A 2. Finally, v) can be proven the same way. The data consist of two random samples X i, Y i ), i =,...,, and Z i ), i =,..., M. Let µ) = E[Y X = ] be the conditional epectation of the outcome given the covariates. We use a series estimator for the regression function µ). Let K denote the number of terms in the series. As the basis we use power series. Let λ = λ,..., λ d ) be a multi-inde of dimension d, that is, a d-dimensional vector of non-negative integers, with λ = d k= λ k, and let λ = λ... λ d d. Consider a series {λr)} r= containing all distinct vectors such that λr) is nondecreasing. Let p r ) = λr), where P r ) = p ),..., p r )). Assumption A. X X, where X is the Cartesian product of intervals [ jl, ju ], j =,..., d, with jl < ju. The density of X is bounded away from zero on X. Assumption A.2 Z X, and the density of Z is bounded away from zero on X. Given Assumption A. the epectation Ω K = E[P K X)P K X) ] is nonsingular for all K. Hence we can construct a sequence R K ) = Ω /2 K P K) with E[R K X)R K X) ] = I K. Let R kk ) be the kth element of the vector R K ). It will be convenient to work with this sequence of basis function R K ). Define ζk) = sup R K ). X Lemma A.2 ewey, 994) ζk) = OK). Proof: see other notes) Let R K be the K matri with ith row equal to R K X i), and let ˆΩ K, = R K R K/. Lemma A.3 ewey, 997) ˆΩ K, I K = O p ζk)k /2 /2) Proof: We will show that E [ R KR K / I K 2] ζk) 2 K/, A.) so that the result follows by Markov s inequality. E [ R KR K / I K 2] = E [ tr R KR K R KR K / 2 2R KR )] K / I K = tr E [ R KR K R KR K / 2] 2E [R KR ) K /] I K = tr E [ R KR K R KR K / 2]) K [9]

21 K K = E [ R kk X i )R lk X i )R lk X j )R kk X j )/ 2] K. k= l= j= The terms with i j and k l have epectation zero. The terms with k = l and i j have epectation E[R kk X i ) 2 R kk X j ) 2 ] =. There are K ) of those terms, each divided by 2, leading to a total of K )/ = K K/. Hence K K E [ R kk X i )R lk X i )R lk X j )R kk X j )/ 2] K k= l= j= K k= l= = 2 K E [ R kk X i )R lk X i )R lk X i )R kk X i )/ 2] [ K E R kk X i ) 2 k= K l= R lk X i ) 2 ]. A.2) By definition, ζk) = sup R K ) = sup k R2 kk ))/2. Hence it follows that k R2 kk ) ζ2 K), and thus A.2) can be bounded by 2 [ K ] ζ 2 K) E R lk X i ) 2 = Kζ 2 K)/. l= This shows that A.) holds, and thus finishes the proof. Let U i = Y i µx i ). Let U, Y, and X be the vectors and d matri with ith row equal to U i, Y i, and X i. Furthermore, let be an indicator for the event λ min ˆΩ K, ) > /2. By Lemma A.3, p it follows that. For a symmetric matri A, we can write A as F ΛF, where F is symmetric and Λ diagonal. Let Λ /2 be the diagonal matri with each diagonal element equal to the reciprocal of the square root of the corresponding diagonal element of Λ, and let A /2 = F Λ /2, so that A /2 is symmetric and satisfies A /2 A /2 = A. Assumption A.3 sup VarY X) σ 2. X Lemma A.4 i), and ii), ˆΩ ˆΩ /2 K, R KU/ K, R KU/ Proof: First we prove i). [ E ˆΩ /2 K, R KU/ = O p K /2 /2 ), = O p K /2 /2 ), ] 2 X [ ] = E U R ˆΩ K K, R KU/ 2 X [20]

22 = E [ U R K R KR K ) R KU/ X ] = E [ tr U R K R KR K ) R KU ) X ] / [ = E tr R K R KR K ) R KUU ) ] X / ) = tr R K R KR K ) R KE [UU X] / σ 2 tr R K R KR K ) R K) / = σ 2 K/ σ 2 K/. Then by the Markov inequality ˆΩ /2 K, R K U/ = O pk /2 / /2 ). et, consider part ii). Using Lemma A.v), ˆΩ λ ma ˆΩ /2 K, ) ˆΩ /2 K, R KU/. K, R KU/ Since λ ma ˆΩ /2 K, ) = O p), the conclusion follows. This conditional epectation µ) is estimated by the series estimator ˆµ K ) = R K )ˆγ K with ˆγ K the least squares estimator. Formally R K R K may be singular, although Lemma A.3 shows that this happens with probability going to zero. To deal with this case we define ˆγ K as 0 K if λ min R K ˆγ K = R K/) /2, ) R KX i )R K X i) R A.3) KX i )Y i otherwise, where 0 K is a K-dimensional vector of zeros, and λ min A) is the minimum eigenvalue of the matri A. Define γk to be the pseudo true value defined as γk = arg min E [µx) R K X) γ) 2 ] W =, A.4) γ with the corresponding pseudo true value of the regression function denoted by µ K ) = R K) γ K. First we state some of the properties of the estimator for the regression function. In order to do so it is useful to first give some approimation properties for µ). Assumption A.4 µ) is s times continuously differentiable on X. Lemma A.5 Lorentz) Suppose Assumptions A.-A.2 hold. Then there is a sequence γk 0 sup µ) R K ) γk 0 = O K s/d). such that Proof: see notes) For the sequence γk 0 in Lemma A.5, define the corresponding sequence of regression functions, µ 0 K ) = R K) γk 0. [2]

23 Lemma A.6 Convergence Rate for Regression Function Estimators) Suppose Assumptions A.-A.2 hold. Then i): ii): iii): γ K γk 0 = O ζk)k s/d), A.5) sup µ K) µ 0 K) = Oζ 2 K)K s/d ). A.6) ˆγ K γ K = O p K /2 /2 K s/d ), A.7) iv): ˆγK γk 0 = Op K /2 /2 K s/d), A.8) v): sup ˆµ K ) µ K) = O p ζk)k /2 /2 ζk)k s/d ), A.9) and vi): sup ˆµ K ) µ) = O p ζk)k /2 /2 ζ 2 K)K s/d ), A.0) Proof: First, consider i): γ K = E[R K X)R KX)]) E[R K X)Y ] = E[R K X)R KX)]) E[R K X)µX)] = E[R K X)µX)], where we use both [R K X)U] = 0 and E[R K X)R K X) ] = I K. Also, γk 0 = E[R KX)R K X)γ0 K ], so that γ K γ 0 K = E[R K X)µX)] E[R K X)R K X) γ 0 K] = E[R K X)µX) R K X) γ 0 K)] sup et, consider ii): sup R K ) sup µx) R K X) γk 0 CζK)K s/d. µ K) µ 0 K) = sup R K ) γk γk) 0 sup R K ) γk γk 0 Cζ 2 K)K s/d, using the result in Lemma A.6i). et, consider iii): ˆγ K γ K = ˆγ K γ K ) ˆγ K γ K, [22]

24 since Pr = ). The second term is ) ˆγ K γ K = ) γ K = o p ). The first term is ˆγ K γk = ˆΩ K, R KY/) ˆΩ K, R KR K γk/) = ˆΩ K, R KU/) ˆΩ K, R KµX) R K γk)/) ˆΩ K, R KU/) ˆΩ K, R KµX) R K γ K)/). By Lemma A.4ii) the first term is O p K /2 /2 ). For the second term, we have, using Lemma A.v), and using the fact that if is equal to, then λ ma ˆΩ /2 K, ) 2, ˆΩ K, R KµX) R K γk)/) λ ma ˆΩ /2 /2 K, ) ˆΩ K, R KµX) R K γk)/) 2 2 µx) R Kγ K) R K ) /2 /2 /2 ˆΩ K, ˆΩ K, R KµX) R K γk) = 2 µx) R KγK) R K R KR K ) R KµX) R K γk) 2 ) /2 µx) R KγK) µx) R K γk) A.) where we use the fact that because R K R K R K) R K is a projection matri, it follows that I R K R K R K) R K is positive semi-definite. Since by the definition of γ K [ ] [ ] E µx) R KγK) µx) R K γk) E µx) R KγK) 0 µx) R K γk) 0 sup µ) R K ) γk 0 2 CK 2s/d, it follows by the Markov inequality that A.) is O p K s/d ). Hence ˆγ K γ K = O pk s/d ) o p ) O p K /2 /2 ) = O p K s/d K /2 /2 ). et, consider iv). We use the same argument as for iii): ˆγ K γk 0 = ˆΩ K, R KY/) ˆΩ K, R KR K γk/) 0 ˆΩ K, R KU/) ˆΩ K, R KµX) R Kγ 0 K)/). The first term is again O p K /2 /2 ). For the second term we now have: ˆΩ K, R KµX) R K γk)/) 0 λ ma ˆΩ /2 /2 K, ) ˆΩ K, R KµX) R K γk)/) 0 2 sup µ) R K ) γk 0 = OK s/d ). et, consider v): sup ˆµ K ) µ K) = sup R K ) ˆγ K γk) sup R K ) ˆγ K γk = O p ζk)k /2 /2 ζk)k s/d ). [23] ) /2

25 Finally, consider vi). sup ˆµ K ) µ) sup ˆµ K ) µ K) sup µ K) µ 0 K) sup µ 0 K) µ) = O p ζk)k /2 /2 ζk)k s/d ζ 2 K)K s/d K s/d ) = O p ζk)k /2 /2 ζ 2 K)K s/d ). The new imputation estimator developed in this paper is ˆβ inr = ˆµ K X i ) = R K X i ) ˆγ K A.2) It is useful to consider in this discussion three additional estimators that are based on the propensity score. In all cases we estimate the propensity score using a logistic series estimator, as suggested in Hirano, Imbens and Ridder 2003). Here we briefly summarize the relevant results. Let Lz) = epz)/ epz)) be the logistic cdf and L z) = Lz) Lz)). The series logit estimator of the population propensity score e ) is ê K ) = LR K ) ˆπ K ), for simplicity we use the same series R K ), although this is not essential) where for ˆπ K = arg ma π L π), A.3) L π) = W i ln LR K X i ) π) W i ) ln LR K X i ) π))). A.4) For and K fied we have ˆπ K p π K, with π K the pseudo true value: π K = arg ma π E [e X) ln LR K X) π) e X)) ln LR K X) π))]. A.5) We also define the pseudo true propensity score: e K ) = LR K) π K ). Assumption A.5 e) is s times continuously differentiable on X. Lemma A.7 Suppose Assumptions A., A.3 hold. Then there is a sequence πl 0 sup e) LR L ) πk) 0 = O L s/d). such that Assumption A.6 inf e) > 0 and sup e) <. Lemma A.8 Convergence Rate for Propensity Score Estimators) Suppose Assumptions??, A., and A.4 hold. Then i): ˆπ L π L = O p L /2 /2 ), A.6) ii): π L πl 0 = O L s/2d)), A.7) iii): sup ê L ) e L) = O p ζl)l /2 /2). A.8) [24]

26 iv): v): vi), sup e L ) e 0 L) = O ζl)l s/2d)). A.9) sup e 0 L) e) = O p L s/d). A.20) ˆπ L π 0 L = O p L /2 /2 L s/2d) ), A.2) vii), and viii), sup ê L ) e 0 L) = O p ζl)l /2 /2 L s2d) )), A.22) sup ê L ) e) = O p ζl)l /2 /2 L s2d) )). A.23) Proof: See Hirano, Imbens and Ridder 2003). The second estimator is a modified imputation estimator that only averages over the observations with W i = : ˆβ mod = W i ˆµ K X i ). A.24) ê L X i ) The third estimator is the weighting estimator proposed by Hirano, Imbens and Ridder 2003): ˆβ hir, = W i Y i ê L X i ). A.25) Hirano, Imbens and Ridder 2003) show that ˆβ hir, is consistent, asymptotically normal and efficient. The fourth estimator is a modified version of the Hirano-Imbens-Ridder estimator where the weights are normalized to add up to unity: ˆβ hir,2 = / W i Y i W j ê L X i ) ê j= L X j ). The properties of ˆβ inr will follow from the following lemma: Proof of Theorem 3.: First we prove i). We have ˆβinr ˆβ mod = ˆµ K X i ) W i ˆµ K X i ) ê L X i ) = = ˆµK X i ) ê L X i ) ê L X i ) ˆµK X i ) ê L X i ) ê L X i ) W ) i ˆµ K X i ) ê L X i ) W i ˆµ K X i ) ê L X i ) A.26) A.27) [25]

27 ˆµ KX i ) ê L X i ) ˆµ KX i ) ê L X i ) ˆµ KX i ) W i e L X i ) e L X i ) e 0 L X ˆµ KX i ) W i i) µ0 K X i) ê L X i ) µ0 K X i) µ0 K X i) µ0 K X i) µ0 K X i) W i µ0 K X i) ê L X i ) µ0 K X i) µ0 K X i) µ0 K X i) µ0 K X i) W i µx i) ê L X i ) ˆµK X = i ) ê L X i ) ê L X i ) ˆµ KX i ) ê L X i ) ˆµ KX i ) ˆµ KX i ) µx i) ê L X i ) ˆµ KX i ) ˆµ KX i ) ˆµ0 K X i) ˆµ KX i ) W i µ0 K X i) W i µx i) W i ˆµ KX i ) ê L X i ) µ0 K X i) ê L X i ) µ0 K X i) e L X i ) µ KX i ) ˆµ KX i ) ˆµ KX i ) ˆµ0 K X i) ˆµ KX i ) W i µ0 K X i) W i µx i) W i W i ˆµ K X i ) ê 0 L X i) ˆµ KX i ) ˆµ KX i ) ˆµ KX i ) ) ˆµ KX i ) W i µ0 K X i) µ0 K X i) µ0 K X i) ˆµ KX i ) W i ˆµ KX i ) µ KX i ) W i µ0 K X i) ˆµ KX i ) W i ˆµ KX i ) W i µ0 K X i) W i µ0 K X i) W i µ0 K X i) ê 0 L X i) e 0 L X µx i) ê L X i ) µ0 K X i) W i i) e 0 L X µx i) W i i) µx i) ê L X i ) µx ) i) W i ˆµK X i ) ê L X i ) ˆµ0 K X ) i) e 0 L X ê L X i ) W i ) i) ˆµ K X i ) µ 0 K X i) ê L X i ) e 0 e L X i ) LX i ) ˆµ K X i ) µ 0 K X i) e 0 L X e 0 LX i ) ) i) ˆµK X i ) µ 0 K X i) ˆµ KX i ) µ 0 K X ) i) W i ) e L X i ) A.28) A.29) A.30) A.3) [26]

28 ˆµ K X i ) µ 0 K X i) W i ) µk X i ) µx ) i) ê L X i ) W i ) µx i ) ê LX i ) W i ). A.32) A.33) A.34) We will deal with equations A.28)-A.34) separately. First consider A.28). This itself will be broken up into seven parts. RK X i ) ˆγ K LR L X i ) ˆπ L ) R KX i ) ˆγ ) K LR L X i ) πl 0 ) ê L X i ) W i ) = = ˆµK X i ) ê L X i ) ˆµ ) KX i ) ) µ K X i ) ê L X i ) ) ˆµ K X i ) µ K X i )) ê L X i ) e L X i ) ê L X i ) W i ) ê L X i ) W i ) ê L X i ) W i ) ê L X i ) W i ) ê L X i ) e L X i )) ê L X i ) W i ) )) êl X i ) e L X i ) µ K X i ) e 2 L X i) ê L X i ) e L X i ) µk X i ) e 2 L X i) µx ) i) e 2 X i ) µx i ) ) e 2 X i ) ê L X i ) e L X i ) )R LX i )ˆπ L πl) 0 µx i ) ) e 2 X i ) µx i ) ) e 2 X i ) µx i ) ) e 2 X i ) R LX i )ˆπ L πl) 0 ê L X i ) e L X i )) R LX i )ˆπ L πl) 0 e L X i ) ) R LX i )ˆπ L πl) 0 W i ) A.35) A.36) A.37) ê L X i ) W i ) A.38) A.39) A.40) A.4) First consider A.35). Since inf e) > c for some c > 0, it follows that for large enough, inf e 0 L ) > c/2, and therefore for large enough we have inf e L ) > c/4. Thus for large enough, with arbitrarily high probability, inf ê L ) > c/8. Thus, since sup ê L ) e 0 L ) = O pζl)l s/2d) ζl)l /2 /2 ), it follows that ê L X i ) e L X i ) = O pζl)l s/2d) ζl)l /2 /2 ). [27]

29 Thus ) ˆµ K X i ) µ K X i )) ê L X i ) e L X i ) /2 sup ˆµ K ) µ K ) sup ê L X i ) W i ) ê L X i ) e L X i ) 2 = O p /2 ζk)k s/d ζk)k /2 /2 ) ζl)l s/2d) ζl)l /2 /2 ). et, consider A.36). ) µ K X i ) ê L X i ) êlx ) i ) e L X i ) e L X i ) e 2 L X ê L X i ) W i ) i) = = /2 sup µ K X i ) ) el X i ) ê L X i ) êlx ) i ) e L X i ) e L X i )ê L X i ) e 2 L X i) µ K X i ) e L X i ) e LX i ) ê L X i )) ê L X i ) ê L ) e 0 L) sup = O p /2 ζk)k s/d ζk)k /2 / /2 ) 2 ). ê L X i ) W i ) ) ê L X i ) W i ) e L X i ) ê L X i ) e L X i ) 2 et, consider A.37). µk X i ) e 2 L X i) µx ) i) e 2 ê L X i ) e L X i )) ê L X i ) W i ) X i ) /2 sup µk X i ) e 2 L X i) µx ) i) e 2 L X ê L X i ) e L X i )) ê L X i ) W i ) i) µxi ) e 2 L X i) µx ) i) e 2 ê L X i ) e L X i )) ê L X i ) W i ) X i ) µ K X i ) e 2 L X i) µx i) e 2 L X i) sup ê L ) e L ) 2 µx i ) e 2 L X i) µx i) e 2 X i ) sup ê L ) e L ) 2 /2 sup = O p /2 ζ 2 K)K s/d ζl)l /2 /2 ζl)l s/2d) )) O p /2 ζl)l s/2d) ζl)l /2 /2 ζl)l s/2d) )). et, consider A.38). Using a mean value theorem, we can write ê L X i ) e 0 LX i ) = LR L X i )ˆπ L ) LR L X i )π 0 L) = LR L X i ) π L ) LR L X i ) π L ))R L X i ) ˆπ L π 0 L), [28]

30 for some intermediate value π L. Since π L is between ˆπ L and πl 0, it follows that and Hence = π L π 0 L = LO p L /2 /2 ), sup LR L ) π L ) LR L )πl) 0 = O p ζl)l /2 /2 ). /2 sup Since µx i ) e 2 X i ) ê L X i ) e L X i ) ))R L X i )ˆπ L πl) 0 ) ê L X i ) W i ) µx i ) e 2 X i ) LR K X i ) π L ) LR K X i ) π L ))) ))R L X i )ˆπ L πl) 0 ) ê L X i ) W i ) µx i ) e 2 X i ) sup LR K X i ) π L ) LR K X i ) π L ))) )) sup R L ) ˆπ L πl 2 0 = O p /2 ζl)l /2 /2 ζl)ζl)l /2 /2 ). et, consider A.39). µx i ) ) e 2 R L X i )ˆπ L π 0 X i ) L) ê L X i ) e L X i )) /2 sup µ)e) e)) e 2 ) sup R K ) ˆπ L πl 0 sup ê L ) e 0 L) = O p /2 ζl)l /2 /2 L s/2d) )ζl)l /2 /2 ζl)l s/2d) )). et, consider A.40). µx i ) ) e 2 R L X i )ˆπ L π 0 X i ) L) e L X i ) ) /2 sup µ)e) e)) e 2 ) sup = O p /2 ζl)l /2 /2 L s/2d) )L s/d ). R K ) ˆπ L πl 0 sup e 0 L) e) et, consider A.4). µx i ) ) e 2 R L X i )ˆπ L π 0 X i ) L) W i ) E ˆπ L πl 0 µx i ) ) e 2 X i ) µx i ) ) e 2 X i ) 2 R L X i ) W i ) [29] R L X i ) W i ).

31 sup µ)e) e)) e 2 E[R L X i ) R L X i )] = OL), ) it follows that A.4) is O p LL /2 /2 L s/2d) ). Epression A.29) is bounded by ˆγ K γ K sup R K ) /2 sup e L ) sup ê L ) e L ) = O p /2 ζk)) O p ζk)) /2 O p /2 ξ 2 L)) = O p /2 ξ 2 K)ξ 2 L)). ote that inf e L ) > sup e)/2 for L large enough, so that sup /e L )) = O).) Epression A.30) is bounded by ˆγ K γ K sup R K ) /2 sup e L ) sup e L ) e) = O p /2 ζk)) O p ζk)) /2 O p L t/d ) = O p L t/d ξ 2 K)). Epression A.3) is bounded by ˆγ K γ K sup R K ) /2 sup e L ) e) = O p /2 ζk)) O p ζk)) /2 O p L t/d ) = O p L t/d ξ 2 K)). Epression A.32) is bounded by R ˆγ K γ K K X i ) ex i ) W i ) The first factor is O p /2 ζk)). Because E R K X i ) 2 W i ) C ξ 2 K), it follows that the second factor is O p ζk)). Thus A.32) is O p /2 ξ 2 K)). et, consider A.33): µk X i ) e L X i ) µx ) i) êx i ) W i ) µ = K X i ) µx i ) e L X i ) êx i ) W i ) e L X i ) µk X i ) µx i ) e L X i ) e 2 µ ) KX i ) µx i ) e L X i ) êx i ) W i ) X i ) e L X i ) µ K X i ) µx i ) e L X i ) e 2 êx i ) W i ) X i ) sup e L ) e)) µ K X i ) µx i ) e L X i ) [30] êx i ) W i )

32 µ K X i ) µx i ) e 2 X i ) µx i ) µx i ) e L X i ) e 2 X i ) = O p L t/d ) O p K s/d ) O p L t/d ). êx i ) W i ) êx i ) W i ) so that epression A.33) is O p L t/d K s/d ). et, consider A.34). Because e) is bounded away from 0 on X, X is a compact subset of R d, and µ) and e) are s and t times continuously differentiable, it follows that µx) ex) is mins, t) times continuously differentiable, has a finite second moment, and the projection of this random variable on R L X) eists. Define [ µx) ) ] 2 δ L = arg min E δ ex) R LX) δ A.42) By Lorentz ), the uniform approimation error is sup µ) e) R ) L) δ L = O L mins,t) d. A.43) X Hence, µx i ) ê LX i ) W i ) = ) µxi ) R LX i ) δ L ê L X i ) W i ) ) µxi ) R LX i ) δ L ê L X i ) W i ), A.44) R L X i ) δ L ê L X i ) W i ) A.45) because the second term vanishes as a result of the first order conditions for the series logit estimator which imply that i R LX i )êx i ) W i ) = 0. The last epression, A.45) can be bounded by sup µ) e) R L) δ L êx i ) W i 2 sup µ) e) R L) δ L = O X X L mins,t) d because of A.43). This proves that A.34) is O L mins,t)/d). Thus combined with ), this finishes the proof of the first assertion in Theorem 3.. et, consider part ii) of Theorem 3.. By the triangle inequality we have ), ˆβmod ˆβ hir ) = W i Y i µx i )) W i Y i ˆµ K X i )) ê L X i ) W i µx i ) ˆµ K X i )) ê L X i ) ) A.46) A.47) [3]

33 W i Y i µ K X i )) ) ) W i Y i µ K X i )) ê L X i ). A.48) A.49) First consider A.46). Because e) is s times continuously differentiable and bounded away from zero, it follows that there is a δ K such that for some finite C we can approimate /e) by R K )δ, with sup e) R K)δ K CK s/d. By the first order conditions for ˆγ K it follows that W i Y i ˆµ K X i ))R K X i ) = 0. Hence W i Y i ˆµ K X i )) ) W i Y i ˆµ K X i )) ex i ) R KX i )δ K W i Y i ˆµ K X i )) R K X i )δ K ) W i Y i ˆµ K X i )) ex i ) R KX i )δ K ) W i Y i µx i )) ex i ) R KX i )δ K ) W i µx i ) ˆµ K X i )) R KX i )δ K /2 /2 sup W i Y i µx i )) sup µ) ˆµ K ) sup e) R K)δ K e) R K)δ K = O p /2 K s/d ) O p /2 ζk)k /2 / /2 ζ 2 K)K s/d )K s/d ) = O p /2 K s/d ). et, consider A.47). W i µx i ) ˆµ K X i )) ê L X i ) ) sup µ) ˆµ K ) sup ê L ) e) [32]

34 = O p /2 ζk)k /2 / /2 ζ 2 K)K s/d )ζl)l /2 /2 L s2d) ))). et, consider A.48). W i Y i µx i )) ) ) W i Y i µx i )) = O p /2 L s/d ). sup Finally, consider A.49). ) W i Y i µx i )) ê L X i ) = W i Y i µx i )) êlx i ) ê L X i ) êl X i ) e 0 L W i Y i µx i )) X i) ê L X i )e 0 L X êlx i ) e 0 L X ) i) i) êl X i ) e 0 L W i Y i µx i )) X i) W i Y i µx i )) ex i) ex i) R L X i )ˆπ L πl) 0. First, consider A.50). êl X i ) e 0 L W i Y i µx i )) X i) ê L X i )e 0 L X êlx i ) e 0 L X ) i) i) A.50) R L X i )ˆπ L π 0 L)) A.5) A.52) ) /2 W i Y i µx i ) sup êl ) e 0 L) sup e ) L ) ê L) e 0 L ) e0 L ) = O p /2 ζl)l /2 /2 L s/2d) ))ζl)l /2 /2 L s/2d) )). et, consider A.5). ote that ê L ) e 0 L) = LR L )ˆπ L ) LR L )π 0 L) = LR L ) π L ) LR L ) π L ))R L )ˆπ L π 0 L), with π in between ˆπ and π 0 L. Hence π π0 L = O pl /2 /2 L s/2d) ), and sup LR L ) π L ) e) = O p ζl)l /2 /2 L s/2d) )). Thus sup ê L ) e 0 L) e) e))r L )ˆπ L π 0 L) = O p ζl)l /2 /2 L s/2d) )) sup R L ) ˆπ L πl) 0 = O p ζ 2 L)L /2 /2 L s/2d) )L /2 /2 L s/2d) )). [33]

35 Then we have êl X i ) e 0 L W i Y i µx i )) X i) êl X i ) e 0 L W i Y i µx i )) X i) ex i) R L X i )ˆπ L π 0 L)) ex i) R L X i )ˆπ L π 0 L)) êl X i ) e 0 L W i Y i µx i )) X i) e 0 L X êlx i ) e 0 L X ) i) i). The first term is O p ζ 2 L)L /2 /2 L s/2d) )L /2 /2 L s/2d) )). The second term is bounded by W i Y i µx i )) sup êl X i ) e 0 LX i ) sup = O p /2 ζl)l /2 / /2 L s/2d) )ζl)l /2 / /2 L s/2d) )) Finally, consider A.52). W i Y i µx i )) ex i) R L X i )ˆπ L π 0 L). ˆπ L πl 0 W i Y i µx i )) ex i) R L X i ). The first factor is O p ζl)l /2 /2 L s/2d) ). The epectation of the square of the second factor is [ E = E [ W i Y i µx i )) ex i) T 2 Y µx)) 2 ex) ex) ] R L X i ). ) 2 R L X) R L X)] ) 2 e) sup σy 2 ) sup E[R L X) R L X)] = OL). e) Hence A.52) is O p LζL)L /2 /2 L s/2d) ). This finishes the proof of part ii) of the Theorem. Finally, consider part iii) of the Theorem. First we prove /2 / ) W i = o p ). A.53) ê L X i ) / /2 W i ) ê L X i ) = W i ê L X i ) W i ê L X i ) ê L X i ) A.54) [34]

36 ê L X i )) êx i ) ) W i )) ) ) W i )) ê L X i ). A.55) A.56) A.57) We will deal with A.54)-A.57) one by one. First consider A.54). There is a sequence of δ L such that for some finite C we have sup e) R L) δ L CL s/d. Then W i ê L X i ) W i ê L X i )) R L X i ) δ W i ê L X i )) R L X i ) δ L ). Because of the first order conditions for ˆπ L the first term vanishes. Then The second term is bounded by C /2 sup R L) δ L e) = O /2 L s/d ) = o p ). et, consider A.55). ê L X i )) êx i ) ) /2 sup e) ê L ) sup e) ê L ) = O p /2 ζl)l /2 /2 L s/d )ζl)l /2 /2 L s/d )) = o p ). et, consider A.56). W i )) ) 2 sup e 0 L ) e) = O p /2 L s/d ) = o p ). Finally, consider A.57). ) W i )) ê L X i ) = W i )) e0 L X i) ê L X i ) ê L X i ) [35]

37 W i ) e 0 LX i ) ê L X i ) ) e 0 W i ) L X i ) ê L X i ) e 2 X i ) W i ) ex i) ê L X i ) e 2 X i ) ex i) R L X i ) ˆπ L πl) 0 ) R L X i ) ˆπ L π 0 L)) The first of these terms is of order O p /2 ζ 2 L)L /2 /2 L s/2d) ) 2 ) = o p ). The second term is of order5 O P ) = o p ). For the third term, note that E W i ) ex 2 i) R L X i ) so that [ ] ex) 3 = E R L X) R L X) CL ex) W i ) ex i) R L X i ) ˆπ L πl) 0 CL /2 ˆπ L π 0 L = O p L /2 L /2 /2 L s/2d) )) = o p ). This finishes the proof of A.53). et, define C = W i Y i ê L X i ), so that C = o p ), and ˆβ hir, ˆβ hir,2 = Hence ˆβhir, ˆβ hir,2 W i Y i ê L X i ) C ). W i Y i ê L X i ) C C = O p) o p ) = o p ). Lemma A.9 If then E K) C K) p sup 0 K K E K) E ˆK) p inf K K E K) [36]

38 Proof: Define K = argmin K K E K) For large enough we have that for any δ, η > 0 ) Pr C K) E K) < δ > η sup K K Hence if is sufficiently large then with probability of at least η δ δ C K) δ)e K) C ˆK) δ)e K) E ˆK) E K) and the conclusion follows because δ, η >) are arbitrary. Proof of Theorem 4.: We have C K) = E K) 2 ε A K aa A K µ a A K εε A K a σ 2 a A K a) We need to show sup K K u A K aa A K µ E K) p 0 A.58) and sup K K a A K uu A K a σ 2 a A K a E K) p 0 A.59) I consider the first statement. We have for δ > 0 a A K µ u A K µ Pr sup K K E K) a A K µ me u A K a m ) δ m m E K) m K K By Whittle s theorem We have Also E u A K a m ) Ca A K a) m 2 a A K a CK 2sp/d a A K µ) 2 E K) Combining this we find ) > δ a A K µ me u A K a m ) δ m m E K) m K K K K Pr C K msp/d E K K K) m 2 [37] a A K µ u ) A K µ > δ E K)

39 so that we require that compare with Li) 0 K msp/d E K K K) m 2 We have K msp/d E K K K) m 2 inf K K E K) m 2 K K K msp/d If we take the dimension of K as κ, this holds if compare with Li) κmsp/d ) Pr sup K K inf E K) m 2 A.60) K K so that optimal MSE cannot decrease too fast with. We now consider A.59). We have u A K aa A K u σ 2 a A K a ) K K Pr K K K K so that we require or E K) > δ u A K aa A K u σ 2 a A K a ) > δ E K) E u A K aa A K u σ 2 a A K a m ) δ m m E K) m Ctr AK aa A K ) 2) m 2 δ m m E K) m K 2msp/d E K K K) 0 m κ2msp/d ) which is implied by A.60). inf E K) m K K K K Ca A K a)m δ m m E K) m REFERECES Andrews, D. 99): Asymptotic Optimality of Generalized C L, Cross-validation, and Generalized Cross-validation in Regression with Heteroskedastic Errors, Journal of Econometrics, 47, Barnow, B.S., G.G. Cain and A.S. Goldberger 980), Issues in the Analysis of Selectivity Bias, in Evaluation Studies, vol. 5, ed. by E. Stromsdorfer and G. Farkas. San Francisco: Sage. Chen, X., Hong, H., and Tarozzi, A., 2004), Semiparametric Efficiency in GMM Models of onclassical Measurement Error, Missing Data and Treatment Effects. Working Paper. [38]

40 Hahn, J., 998), On the Role of the Propensity Score in Efficient Semiparametric Estimation of Average Treatment Effects, Econometrica 66 2), Heckman, J., and R. Robb, 984), Alternative Methods for Evaluating the Impact of Interventions, in Heckman and Singer eds.), Longitudinal Analysis of Labor Market Data, Cambridge, Cambridge University Press. Heckman, J., H. Ichimura, and P. Todd, 998), Matching as an Econometric Evaluation Estimator, Review of Economic Studies 65, Heckman, J., H. Ichimura, J. Smith, and P. Todd, 998), Characterizing Selection Bias Using Eperimental Data, Econometrica 66, Hirano, K., G. Imbens, and G. Ridder 2000), Efficient Estimation of Average Treatment Effects Using the Estimated Propensity Score. BER Working Paper. Holland, P. 986), Statistics and Causal Inference, Journal of the American Statistical Association, 8, Ichimura, H., and O. Linton, 200), Asymptotic Epansions for some Semiparametric Program Evaluation Estimators. Institute for Fiscal Studies, cemmap working paper cwp04/0. Li, K. 987), Asymptotic Optimality for C p, C L, Cross-validation and Generalized Cross-validation: Discrete Inde Set, Annals of Statistics, 53), ewey, W., 994), The Asymptotic Variance of Semiparametric Estimators, Econometrica 62, ewey, W. 995), Convergence Rates for Series Estimators, in Advances in Econometrics and Quantitative Economics: Essays in Honor of Professor C. R. Rao, Maddala, Phillips, and Srinivasan eds.), Cambridge, Basil Blackwell. ewey, W., and D. McFadden 994), Large Sample Estimation, in Handbook of Econometrics, Vol. 4, Engle and McFadden eds.), orth Holland. Robins, J., A. Rotnitzky, and L. Zhao 995), Analysis of Semiparametric Regression Models for Repeated Outcomes in the Presence of Missing Data, Journal of the American Statistical Association, ), Rosenbaum, P., and D. Rubin, 983a), The Central Role of the Propensity Score in Observational Studies for Causal Effects, Biometrika, 70, Rotnitzky, A., and J. Robins 995), Semiparametric Regression Estimation in the Presence of Dependent Censoring, Biometrika 82 4), Rubin, D. 974), Estimating Causal Effects of Treatments in Randomized and on-randomized Studies, Journal of Educational Psychology, 66, Rubin, D. B., 978), Bayesian inference for causal effects: The Role of Randomization, Annals of Statistics, 6: [39]

Efficient Estimation of Average Treatment Effects Using the Estimated Propensity Score 1

Efficient Estimation of Average Treatment Effects Using the Estimated Propensity Score 1 Keisuke Hirano University of Miami 2 Guido W. Imbens University of California at Berkeley 3 and BER Geert Ridder