Matching using Semiparametric Propensity Scores

Size: px

Start display at page:

Download "Matching using Semiparametric Propensity Scores"

Reynold Fitzgerald
5 years ago
Views:

1 Matching using Semiparametric Propensity Scores Gregory Kordas Department of Economics University of Pennsylvania Steven F. Lehrer SPS and Department of Economics Queen s University lehrers@post.queensu.ca October 2004 First Version: December 2002 Abstract This paper considers the application of semiparametric methods to estimate propensity scores or probabilities of program participation, which are in central in program evaluation methods. To evaluate the practical bene ts we use data from the NSW experiment, CPS and PSID. We compare treatment e ect and evaluation bias estimates using propensity scores estimated from parametric logit, semiparametric single index and semiparametric binary quantile regression models. We nd that the quantile regression model exhibits the smallest average absolute bias error. Our results suggest that it is important to account for very general forms of heterogeneity in (semiparametric) estimation of the propensity score. JEL codes: C14, C81, C99, H53, I38 *We are grateful to Mianna Plesca and seminar participants at the 2004 NASM of the Econometric Society meetings, 2003 CEA meetings, 2003 ihea World Congress, Concordia University, Florida State University, Lehigh University, McGill University, Queens University, Simon Fraser University, SUNY Albany, UNC- Greensboro, University of South Florida and the Wharton School, University of Pennsylvania for comments and suggestions which have helped to improve this paper. We are responsible for all errors. 1

2 1 Introduction An increasing body of evidence has found that there is signi cant diversity and heterogeneity in response to a given policy. Heckman (2001) argues that this has profound consequences for economic theory and for economic practice. In particular, accounting for heterogeneity may improve the performance of non-experimental estimators. In this paper, we introduce and evaluate the performance of a semiparametric propensity score matching estimation strategy that explicitly accounts for heterogeneity in response across observed covariates along the conditional willingness to participate in the treatment intervention distribution. Matching estimators evaluate the e ects of a treatment intervention by comparing outcomes such as wages, employment, fertility or mortality for treated persons to those of similar persons in a comparison group. The use of the propensity score as a basis for matching treated and untreated individuals (and thus for evaluating the magnitude of treatment e ects) is becoming increasingly common in clinical medicine, demographic and economic research. The propensity score is de ned as the conditional probability of being treated given the individual s covariates and requires the assumption of selection on observables. 1 Existing studies use parametric estimators of binary response models, such as the probit and logit which imposes strong distributional assumptions on the underlying data. In particular, the dangers of misspeci cation may be severe if the error terms are not independent and identically distributed from their known parametric distributions. 2 Kordas (2004) outlines the bene ts of using Manski s (1975, 1985) binary regression quantiles to provide consistent estimates of the conditional probability at di erent points of the distribution. This estimator avoids the distributional restrictions embedded in the parametric approach and has the advantage that it is robust and can accommodate heteroskedasticity of very general form. This property is extremely valuable in our setting as the estimator can accommodate problems of heterogeneity, self-selection and misclassi cation. 2

3 Todd (1999) presents the only other study that we are aware of that considers matching using semiparametrically estimated propensity scores. She considers matching using the estimated probabilities from both the semiparametric least squares estimator of Ichimura (1993) and the quasi maximum likelihood estimator of Klein and Spady (1993). 3 Her Monte Carlo study demonstrates that the gains from semiparametric procedures relative to parametric alternatives are greatest when either the systematic component of the model is misspeci ed or when the error distribution is highly asymmetric. Our approach o ers several additional bene ts for empirical researchers. First, this estimator does not require the researcher to select higher order or interaction terms to ensure balancing of covariates across the treatment and control groups. While recent work in economics (Dehejia and Wahba, 2002) has proposed the use of balancing tests to determine if additional higher order or interaction terms should be included in the estimates of the propensity scores but does not provide guidance on precisely which of these terms should be included. 4 Second, quantile treatment e ects are simple to calculate permitting an examination of the average treatment e ect on the treated at di erent points along the probability of participation distribution. 2 Framework Let D i = 1 indicate if person i received treatment and D i = 0 if not. Let Y 1i and Y 0i are the outcome of interest if person i received treatment or did not respectively. Evaluators are often interested in the average e ect of the treatment on the treated (AT T D=1 (X)), which can be de ned conditional on some characteristics X as AT T D=1 (X) = E(Y 1 Y 0 jx; D = 1) = E(Y 1 jx; D = 1) E(Y 0 jx; D = 1). In the absence of a randomized control group, matching methods allow for the construction of a comparison group for the treated under the assumption that conditional on observed covariates assignment to treatment is random. In many empirical 3

4 applications, the dimension of these observed characteristics is high and Rosenbaum and Rubin (1983) suggest using a scalar measure P (X); where P (X) = P r(d i = 1jX): (1) is the propensity score. 5 Implementation involves two steps. First, equation 1 is estimated to calculate the propensity score. Second, a matching algorithm is used to construct the matched comparison group for the treated. Algorithms di er in the weights they place on individuals in the comparison group. 2.1 Econometric Methods Let D denote the latent propensity to participate index. D measures the (indirect) latent utility di erential between receiving and not receiving the treatment. We consider three estimation approaches, a parametric logit, the semiparametric single-index model of Klein and Spady (1993), and Manski s (1975, 1985) binary regression quantiles. The logit and single-index estimators maximize the familiar log-likelihood of the binary response model, `() = n 1 nx D i log(p i ) + (1 D i ) log(1 P i ); (2) i=1 di ering from each other only in the speci cation of P i Pr(Y i = 1jX 0 i). The homoskedastic logit model is given by Pi L = L(Xi=) 0 1 (3) 1 + exp( Xi 0 =): In this model, the covariates exert a pure location e ect on the latent utility di erential. To make the model more exible researchers often include higher order and interaction terms which are selected based on balancing tests results (i.e. Hotelling T 2 tests for di erences in means) which determine whether a covariate adds information on the selection process conditional on the propensity score. 4

5 The semiparametric procedures avoid the distributional and other restrictions embedded in the parametric speci cation of Pr(D i = 1jX i ): First, we consider the semiparametric singleindex model of Klein and Spady (1993). To de ne this estimator, let K() be a kernel function (in this paper we use a Gaussian kernel function), h n be a bandwidth parameter that converges to zero as n becomes large, and de ne the leave-i-out estimate of Pr(Y = 1jX 0 i) by ~P i = X X 0 i D j K j6=i h n Xj 0, X X 0 i K To avoid anomalous behavior at the tails, let " be a small positive number and de ne j6=i h n X 0 j : P SI i = maxf"; minf1 "; ~ P i gg: (4) Klein and Spady (1993) have shown that, under some additional regularity conditions, if nh 6 n! 1 and nh 8 n! 0 as n! 1, the estimator converges at the parametric root-n rate to an asymptotically normal random variable, and achieves the semiparametric e ciency bound. Our second semiparametric estimation strategy is Manski s (1975, 1985) binary regression quantiles. We assume the following linear quantile speci cation of D i Q D i (qjx i ) = X 0 i(q); q 2 (0; 1): (5) where (q) is the coe cient vector of the q-th conditional quantile. Using the equivariance property of quantile functions with respect to monotonic transformations, we can write the conditional quantile function of D i = 1fDi 0g as (See Kordas (2004) equation 6) Q Di (qjx i ) Q 1fD i 0g(qjX i ) = 1fQ D i (qjx i ) 0g = 1fXi(q) 0 0g: (6) This estimator is the binary response analogue to the linear quantile regression estimator introduced by Koenker and Bassett (1978) and o ers a robust and e cient semiparametric alternative to commonly used parametric models. From an empirical point of view, their main 5

6 advantage is their ability to model very general forms of population heterogeneity by allowing the coe cient vector to vary across the conditional quantiles of the dependent variable. Estimates of the scaled coe cients (q) such that jj(q) = 1jj, are obtained by solving the quantile regression problem ( (q) = argmin a:jjajj=1 S N (a) = N 1 N X i=1 q (D i 1fX 0 ia 0g) ) ; (7) where q (u) = (q 1fu < 0g) u, and S N () is the score function. The score function is a multimodal step function so optimization is performed using the simulated annealing algorithm of Go e et al. (1994). The discontinuities of the score function also a ect the asymptotic behavior of the estimators that have been shown to converge at the slow N 1=3 rate to a nongaussian random variable (Kim and Pollard, 1990). To overcome these problems Horowitz (1992) smoothed the median score function and derived a smoothed median estimator that is asymptotically normally distributed. Kordas (2004) extended these results to show joint asymptotic normality of families of smoothed binary quantile estimates and showed how these smoothed estimates may be optimally combined for e cient estimation. Note the single index and binary regression quantile models are not nested. Since our focus is on estimating propensity scores only unsmoothed estimates will be computed here. The probabilities are computed by noting that the quantile regression model in (6) implies that if an individual s q-th conditional quantile Xi(q) 0 is (approximately) equal to zero, his conditional probability of receiving treatment is (approximately) equal to 1 q, i.e., Pr(D i = 1jX 0 (q) = 0) = 1 q: (8) Given estimates of (q) over a grid = fq 1 ; q 2 ; ; q M jq 1 < q 2 <; < q M g of quantiles, Kordas (2004) shows how this equation may be used to derive semiparametric interval probability estimates. Let ^q i = argmin q2 fq : Xi(q) 0 0g (9) 6

7 be the smallest quantile in the grid for which i s index function is positive. Then an interval estimate of the conditional probability P i;1jxi P r(d i = 1jX i ) is given by ^P i;1jxi () = [1 ^q i ; 1 ^q 1;i ); (10) where ^q 1;i denotes the quantile immediately preceding ^q i. In our application, = f0:05; 0:10; ; 0:95g, so, for example, if i s quantile indices are negative for quantiles below 0:70 and are positive for quantiles 0:70 and above, ^q i = 0:70; and ^P i;1jxi () = [0:30; 0:35). 2.2 Matching Algorithm Since the estimated choice probabilities from binary regression quantiles are discrete (interval probabilities) the average treatment e ect on the treated (AT T D=1 (X)) is calculated using strati cation matching. At each probability interval, we compute the di erence in average outcomes of treated and controls, providing an estimate of the quantile treatment e ect (AT T D=1 (X) q ), q = 1; 2; :::; Q AT T D=1 (X) q = P i2l q Y 1i N 1 q P j2l q Y 0j N 0 q where N 1 q and N 0 q number of treated and untreated individuals at quantile q respectively. The average treatment e ect on the treated is computed using a weighted (by the number of treated) average of these quantile treatment e ects as AT T D=1 (X) = QX AT T D=1 (X) q q=1 (11) P i2n 1 q D i Pi2N 1 D i (12) where Q is the total number of quantiles estimated and N 1 is the total number of treated individuals that are matched. Assuming independence of outcomes across units, the variance of AT T D=1 (X) is given by V ar(at T D=1 (X)) = 1 N 1 (V ar(y 1i ) + 7 QX q=1 Nq 1 N q 1 V ar(y N 1 Nq 0 0j ) ) (13)

8 Bootstrapped standard errors could also be calculated and are presented as we are matching on the estimated and not the actual propensity score. 6 Since the estimated participation probabilities from the logit and Klein and Spady (1993) models are continuous, we consider a larger variety of matching algorithms that are described in Smith and Todd (2004). 3 Returns to the NSW Job Training Program 3.1 Data Following LaLonde (1986), numerous studies have examined whether econometric (non-experimental) estimators recover impacts on post-intervention outcomes that are similar to those produced from a randomized experiment such as the National Supported Work Demonstration program (NSW). 7 Experimental treated units are combined with non-experimental comparison units drawn from two national survey datasets; CPS and PSID. Treatment e ect estimates obtained using econometric estimators are then compared with the benchmark results from the experiment. This exercise presents challenges for non-experimental estimator since there are substantial di erences in demographic and economic characteristics between individuals in the CPS, PSID and the experimental samples. Studies evaluating propensity score matching with this data initially found that these methods were able to replicate experimental treatment e ects (Dehejia and Wahba (1999)). More recent evidence calls these ndings in question (Smith and Todd (2004)) and indicate that accounting for permanent unobserved heterogeneity does lower the estimated bias with propensity score matching estimators. We follow Smith and Todd (2004) by considering three alternative experimental samples from the NSW data (LaLonde s full sample, the Dehejia and Wahba extract, an extract containing subjects assigned in the rst four months of the program) in addition to the survey data. 8 8

9 3.2 Propensity Score Estimates In practice, researchers focus on the quality of matches obtained from propensity score estimation and speci cations are selected with the best balance between matched individuals. In our application, we consider two alternative speci cations for the binary response model i) based on Dehejia and Wahba (1999,2002) that includes higher order and interaction terms to satisfy balancing tests and ii) omitting higher order and interaction terms since binary regression quantiles have the desirable property of being robust to heteroskedasticity of general forms. Note, a slightly di erent set of higher order and interaction terms are used for speci cations with the CPS and PSID samples. 9 This empirical approach used in the propensity score matching literature to create speci cation one di ers dramatically from traditional diagnostics used to assess binary response models which are considered with parameter estimates that are used in calculating propensity scores. Likelihood ratio tests between the logit and heteroskedastic logit strongly reject the null hypothesis of a homoskedastic residual. 10 Thus, the parametric homoskedastic binary response models are misspeci ed. A graphical examination of the normalized quantile, logit and single index model coe cient estimates across quantiles indicate disagreements between binary regression quantile estimates and the other approaches only appear at the higher quantiles as the parametric and single index models under predict the probability of participation. The general pattern of over and under prediction by the logit models could also be used to provide further evidence of the restrictiveness of the parametric model which tend to extrapolate the behavior of individuals near the mean to individuals that belong in the tails of the propensity to participate distribution. In general, for treatment e ects and evaluation bias estimates the results are similar from propensity scores estimated using the Klein and Spady (1993) and logit model so we will focus our discussion on binary regression quantiles. 11 9

10 3.3 Treatment E ects Table 1 presents estimates for both speci cations of the causal e ect of the NSW Work Demonstration on earnings in calender year 1978 based on strati cation matching with binary regression quantile propensity scores. The rows di er solely in the number of bins that are employed excluding the lowest probability bin in the top panel. While the experimental impact is captured within a 95% bootstrapped con dence interval, the estimates are extremely accurate for each experimental sample matched with the PSID (even numbered columns). Notice that the results with twenty bins are practically identical between speci cations 1 and 2. Further, the results do not appear to be sensitive to the number of bins used to stratify the sample match. The results improve with fewer bins for some subsamples but weaken for others such as column 2 and 6 whose estimated magnitude decreases by approximately 67% when only ve bins are used. Table 2 presents Hotelling T 2 tests for di erences in means (i.e. balance) within each quintile probability interval. Each entry lists the number of covariates which failed the test at the 5% level. 12 Notice that there are signi cant failures at the lowest probability interval capturing the dissimilarities between the experimental and CPS non-experimental samples. The bottom panel of Table 1 demonstrates a dramatic reduction in the estimated magnitude of the treatment e ect for the columns matched with CPS sample when the lowest probability interval bin is included. 13 In Figure 1, we graph quantile treatment e ects for the sample that corresponds to speci cation 2 and column 6 of Table 1. With balanced covariates the quantile treatment e ects have a clear interpretation. Notice that the training program had a negative impact for those subjects in the middle quantiles who tend to be either blacks or Hispanics that had low earnings in 1974 but high earnings in 1975 if the control group was drawn from the PSID. Similarly, the largest gains from NSW were achieved at the extreme quantiles containing individuals with low 10

11 earnings in Evaluation Bias Estimates An alternative approach to evaluate non-experimental estimates is to match the randomized out control group with the non-experimental samples. As neither group has received the intervention the di erence in earnings between matched individuals from each experimental control group and non-experimental sample should be zero. Any deviation is evaluation bias. Table 3 presents direct estimates of the bias using strati cation matching with binary regression quantile propensity scores. Notice that with the exception of column 4 of speci cation one, the bias is of the order of a few hundred dollars and is less than 15% of the experimental treatment impact in columns 2, 3, 5 and 6 respectively. Unlike the treatment e ect estimates, including the lowest probability quintile has little e ect on the bias since fewer individuals are assigned to this probability interval. The evaluation bias increases by approximately $200 in column 1-3 when the higher order and interaction terms are omitted from the estimating equation. 6 continues to exhibit low bias whereas column 5 s bias is also reduced in absolute value. Surprisingly we nd a high degree of bias in the Dehejia and Wahba samples, columns which exhibit low bias with parametric propensity scores. To uncover an explanation as to why the evaluation bias calculated using semiparametric propensity scores exceeded the estimate obtained using parametric propensity scores in column 4 of Table 4 we conducted a more detailed examination of how the estimated bias di ers across quantiles. Figure 2 presents a graph of the quantile bias e ects at each interval for both parametric and semiparametric propensity scores for speci cation 1. Notice that in almost all quantiles the semiparametric procedure exhibits lower bias. The results in Table 3 (and ) present a number of treated individuals weighted average of these quintile biases and suggest 11

12 that the lower bias for the Dehejia and Wahba subsample is based in part on having the larger biases across quantiles cancel out. To provide additional guidance for empirical researchers on the performance of propensity score matching algorithms we compare the average absolute bias error of our matching algorithm with a variety of di erent matching algorithms based on parametric propensity scores. 14 For each matched outcome we calculate the absolute bias error Bias error = jy 1i ^E(Y0i jp (X i ); D i = 0)j where ^E(Y 0i jp (X i ); D i = 0) is calculated by the algorithm under investigation. The average absolute bias error is calculate by dividing the sum of these bias errors by the number of individuals in the treatment group who were successfully matched. Table 4 reports summary statistics on the average absolute bias error. 15 Notice that with one exception, the smallest average absolute bias error is attained using strati cation matching with propensity scores calculated by binary regression quantiles. In general, bias error estimates obtained by strati cation matching procedures are smaller than the nonparametric and distance metric algorithms. In general when using parametric propensity scores algorithms that use a larger distance produce smooth results; whereas narrow intervals produce larger bias errors on average. In part, this occurs since fewer individuals have matches as the distance shrinks. The results from speci cation 1 nd that Kernel and local linear matching estimator exhibit signi cantly less bias error than nearest neighbor or caliper matching algorithms. Overall, it appears that using 20 bins produces estimates with the smallest mean squared error. The results suggest that adding the lowest probability quantile to the strati cation matching algorithm increases bias up to an average of $500 and $670 per treated participant for speci cation 1 and 2 respectively. The increased average size of the bias error from parametric procedures ranges from slightly more than $55.00 to approximately $5200 for speci cation 1. As a percentage of the estimated 12

13 treatment impact this range is equivalent 6.2% to 586.9%. For speci cation 2, strati cation matching using parametric propensity scores does exhibit smaller bias error for column Of the remaining columns, the size of the average bias error ranges from $91 to $6250 or 10.3% to 706% of the experimental treatment impact per matched treated individual. While the semiparametric procedure yielded the smallest average absolute bias error in 11 of the 12 columns in Table 4, the number is still large relative to the experimental impact. This casts doubt as to whether all observables were included in the estimation of the propensity score and is a potential cause for concern for empirical researchers interested in using these methods. After all, the matched individuals using one nearest neighbor matching intuitively should be most alike, yet the average absolute bias error is extremely large. Strati cation matching with parametric and binary regression quantile propensity scores yield similar average absolute bias error but wildly di erent treatment e ects. In general if one excludes the lowest probability bin ( %) the procedures rarely placed individuals within the same interval. This is demonstrated by examining the scarcity of individuals lying on the prime diagonal of Table 5 and the large number of individuals residing in the o diagonal elements. This table presents information on the horizontal rows of which bin the semiparametric procedure assigns and the columns provide the bins that the parametric procedure assigns. Notice that ignoring the lowest probability quantile, approximately 30% of all the observations fall in the same probability bin for the two methods. For all 12 subsamples the similarities range between 22%-43%. The improved performance of strati cation matching relative to other algorithms with parametric propensity scores is due in part to the balancing tests which occur over large intervals. Replicating these tests in smaller bins (1%) indicate substantial failures with each sample. In contrast, using smaller bins constructed by binary regression quantiles we found little evidence of increased failures in balancing tests. 13

14 Finally, we compared binary regression quantiles (Table 4) to the Klein and Spady (1993) estimator. From a heterogeneity perspective the major distinction between these estimators is that the former allows for very general forms while the latter is more limited only allowing heterogeneity through the index. The results using speci cation two are provided in Appendix Table 4. Notice again that the strati cation matching estimates with binary regression quantiles are substantially lower in absolute bias error for nearly all columns. 4 Conclusions In situations with non-experimental data matching methods provide a means to estimate program impacts when the variables determining assignment to treatment are observed and the support of treatment and comparison groups overlap. Since the use of the propensity score as a basis for estimating treatment e ects is becoming increasingly common in research in a variety of disciplines researchers should test for possible model misspeci cation and if present, consider the methods described above to improve inference and reduce concerns regarding the speci c higher order and interaction terms to include when estimating the treatment program participation equation. 14

15 Notes 1 The assumption of selection on observables requires that conditioning on the observed variables the assignment to treatment is random. Propensity score matching (Rosenbaum and Rubin (1983)) reduce the dimensionality of having to match participants and non participants on the set of conditioning variables (X) by matching solely on the basis estimated propensity scores (P (X)). 2 For example, Horowitz (1993) demonstrates that misspeci cation is likely to be severe under heteroskedasticity and bimodality. 3 These methods estimate a conditional mean and overcome the distributional restrictions embedded in the parametric approach but allows for only limited forms of heterogeneity. Note, we also consider the Klein and Spady (1993) estimator. 4 See Dehejia (2004) and Smith and Todd (2004) for a demonstration of the di culties in choosing the appropriate speci cation of higher order and interaction terms. 5 This estimator assumes i) E(Y 0 jp (X); D = 1) = E(Y 0 jp (X); D = 1) ( 0 < Pr(D = 1jX) < 1) and ii). 6 Notice that if a quantile contains numerous treated units and few controls it will increase the variance AT T D=1 (X). Quantiles with few treated and many controls work in an opposite manner but receive little weight in the calculation of AT T D=1 (X). 7 Our results uses the same data as in LaLonde (1986), Heckman and Hotz (1989), Dehejia and Wahba (1999, 2002, 2004), Abadie and Imbens (2002) and Smith and Todd (2002, 2004) among others, making it easier to draw comparisions. 8 See Appendix Table 1 or Smith and Todd (2002) for further information on these datasets. 9 The selection of variables to include in the estimation of the propensity score is very important since even small changes in the estimated probabilities can dramatically a ect the magnitude of treatment e ects in the matching stage and cause a substantial di erence in the 15

16 amount of bias present in the matching estimator. See Heckman, Ichimura, Smith and Todd (1998), Smith and Todd (2002) or Dehejia and Wahba (2004) for a discussion. Finally it is worth noting that while Dehejia and Wahba (1999, 2002) did not nd evidence that the treatment e ect estimated was sensitive to the inclusion of these terms, they stress the importance of variable selection to ensure that the balancing hypothesis is satis ed. 10 This assumption is rejected below the 5% level for all columns and both speci cations with the exception of column 5 in speci cation 1 which is rejected at the 15% level. 11 For several samples, we experimented with di erent bandwiths with Klein and Spady (1993) and the results were robust. The full set of results are available from the authors by request. 12 The results do not change signi cantly if we report the 10% or 20% level. 13 Note these individuals are generally not included in the parametric matching procedures due to trimming conditions as in Dehejia and Wahba (1999). See Appendix Table 2 for estimates corresponding to logit propensity scores 14 We also considered Abadie and Imbens (2004) matching procedure with homoskedastic and heteroskedastic weighting matrices and either one or four individuals matched. The results indicated larger absolute bias error than local linear matching with a bandwidth of For the parametric propensity score we match on the estimated propensity score with the exception of kernel and local linear matching estimators where we match on the odds ratio due to the choice based nature of the sample. Heckman and Todd (1995) demonstrate that matching methods can be applied with the odds ratio to gain consistent estimates when the sample is choice based. Note failure to account for choice based samples should not a ect nearest neighbor or strati cation point estimates. Finally, following Smith and Todd (2004) we investigated the sensitivity of our results to alternative seeds for several of these algorithms. There were no major shifts in the magnitude of the absolute bias error. 16 This column requires further study as it exhibited a signi cant balancing test failures. 16

17 References [1] Abadie, A. and G. Imbens, Large Sample Properties of Matching Estimators for Average Treatment E ects, mimeo, University of California at Berkeley (2004). [2] Dehejia, R., Practical Propensity Score Matching: A Reply to Smith and Todd, forthcoming in Journal of Econometrics, (2004). [3] Dehejia, R., and S. Wahba, Causal E ects in Non-Experimental Studies: Re-Evaluating the Evaluation of Training Programs, Journal of the American Statistical Association 94:448 (1999), [4] Dehejia, R., and S. Wahba, Propensity Score Matching Methods for Non-Experimental Causal Studies, Review of Economics and Statistics 84:1 (2002), [5] Go e, W., G. D. Ferrier and J. Rogers, Global Optimization of Statistical Functions with Simulated Annealing, Journal of Econometrics 60:1/2 (1994), [6] Heckman, J. J., Heterogeneity, Microdata and the Evaluation of Public Policy: Nobel Lecture, Journal of Political Economy 109:4 (2001), [7] Heckman, J. J., H. Ichimura, J. Smith and P. Todd, Characterizing Selection Bias using Experimental Data, Econometrica 66:5 (1998), [8] Heckman, J. J., H. Ichimura and P. Todd, Matching as an Econometric Evaluation Estimator, Review of Economic Studies 65:2 (1998), [9] Heckman, J. J., H. Ichimura and P. Todd, Matching as an Econometric Evaluation Estimator: Evidence from Evaluating a Job Training Program, Review of Economic Studies 64:4 (1997), [10] Heckman, J. J. and P. Todd, Adapting Propensity Score Matching and Selection Models to Choice-based Samples, mimeo, University of Chicago (1995). [11] Heckman, J. J., and J. V. Hotz, Choosing Among Alternative Non-Experimental Methods for Estimating the Impact of Social Programs: The Case of Manpower Training, Journal of the American Statistical Association 84:408 (1989), [12] Horowitz, J., A Smoothed Maximum Score Estimator for the Binary Response Model, Econometrica, 60:3 (1992),

18 [13] Horowitz, J., Semiparametric and Nonparametric Estimation of Quantal Response Models, in Handbook of Statistics, Vol 11 (1993), Maddala, GS and Vinod, HD (eds.), North- Holland. [14] Ichimura, H., Semiparametric Least Squares (SLS) and Weighted SLS Estimation of Single-Index Models, Journal of Econometrics, 58:1-2 (1993), [15] Kim, J. and D. Pollard, Cube Root Asymptotics, Annals of Statistics 18 (1990), [16] Klein, R. and R. Spady, An E cient Semiparametric Estimator for Discrete Choice Models, Econometrica 61:2 (1993), [17] Koenker, R. and G. Bassett, Jr., Regression Quantiles, Econometrica 46:1 (1978), [18] Kordas, G., Smoothed Binary Regression Quantiles, forthcoming in Journal of Applied Econometrics (2004). [19] LaLonde, R., Evaluating the Econometric Evaluations of Training Programs, American Economic Review 76:4 (1986), [20] Manski, C. F., Semiparametric Analysis of Discrete Response: Asymptotic Properties of the Maximum Score Estimator. Journal of Econometrics 27: (1975), [21] Manski, C. F., Maximum Score Estimation of the Stochastic Utility Model of Choice, Journal of Econometrics 3: (1975), [22] Rosenbaum, P. and D. B. Rubin, The Central Role of the Propensity Score in Observational Studies for Causal E ects, Biometrika 70:1 (1983), [23] Smith, J., and P. Todd, Rejoinder to Does Matching Overcome LaLonde s Critique of Non-Experimental Estimators forthcoming in Journal of Econometrics (2004). [24] Smith, J., and P. Todd, Does Matching Overcome LaLonde s Critique of Non- Experimental Estimators, forthcoming in Journal of Econometrics (2002). [25] Todd, P., Local Linear Approaches to Program Evaluation using a Semiparametric Propensity Score, mimeo, University of Pennsylvania (2002). 18

19 Treatment Effect in $ Probability 6 DW Specification Figure 1: Quantile Treatment E ects 19

20 Logit Bias BRQ Bias Probability 4 Figure 2: Estimated Bias Parametric and Semiparametric Estimates 3 of Table 4 20

21 Table 1: Treatment Effects Estimates with Binary Regression Quantile Propensity Scores using Stratification Matching SPECIFICATION ONE SPECIFICATION TWO Experiment Impact 20 Bins (618.77) (900.59) (678.11) (675.37) (1047.0) (1624.8) (587.43) (874.25) (831.12) (705.95) (1196.2) (1619.5) 10 Bins (508.81) (886.37) (935.18) (684.84) (1242.6) (1329.7) (584.42) (1234.5) (866.29) (674.73) (1202.6) (1326.6) 5 Bins (667.10) (945.53) (961.73) (669.76) (1903.0) (1933.2) (593.42) (1102.3) (873.32) (668.59) (1226.4) (1143.1) Including Lowest Probability Bin 20 Bins (574.16) (841.31) (833.20) (681.35) (1127.3) (1635.9) (535.94) (874.25) (736.18) (710.99) (1057.5) (1509.2) 10 Bins (649.83) (837.60) (804.42) (669.84) (1092.5) (1337.1) (550.35) (1284.1) (779.58) (702.81) (1068.0) (1291.6) 5 Bins (622.46) (854.67) (891.22) (735.58) (1075.1) (1903.0) (510.77) (1036.5) (820.30) (700.04) (1042.2) (1153.3) Note: Bootstrapped standard errors in parentheses Bootstrap replications. 21

22 Table 2: Balancing Test Results SPECIFICATION ONE SPECIFICATION TWO Quantile Note: Number of unbalanced covariates at the 5% level reported. 22

23 Table 3: Evaluation Bias Estimates with Binary Regression Quantile Propensity Scores using Stratification Matching SPECIFICATION ONE SPECIFICATION TWO Experiment Impact 20 Bins (480.81) (810.83) (741.64) (746.29) (461.54) (641.74) (640.18) (704.56) (680.63) (972.87) (1196.2) (1619.5) 10 Bins (594.49) (853.37) (771.63) (812.25) (447.11) (23.12) (640.09) (755.05) (716.03) (995.20) (1202.6) (1326.6) 5 Bins (556.55) (861.99) (781.37) (649.94) (413.25) (1124.9) (652.39) (590.43) (805.45) (777.46) (1226.4) (1143.1) Including Lowest Probability Bin 20 Bins (482.99) (872.13) (701.25) (767.14) (817.92) (1000.8) (458.99) (666.71) (589.98) (716.22) (641.02) ( ) 10 Bins (560.22) (833.26) (624.82) (831.63) (713.60) (911.09) (430.82) (1153.4) (635.01) (719.66) (690.56) (981.67) 5 Bins (556.55) (845.89) (678.53) (698.07) (669.55) (976.36) (423.29) (1101.5) (621.08) (636.01) (725.08) (866.48) Note: Bootstrapped standard errors in parentheses Bootstrap replications. 23

24 Table 4: Average Absolute Bias Error Matching Algoritm SPECIFICATION ONE SPECIFICATION TWO Nearest Neighbor 1 W/O Common Support (5746.5) (8011.1) (5375.8) Nearest Neighbor W/O Common Support (5840.8) (8011.1) (5426.0) Nearest Neighbor 1 W Common Support (5779.5) (8011.1) (5450.1) Nearest Neighbor W. Common Support (5820.5) (8011.1) (5587.4) Kernel (Bandwidth ) (3738.3) (4086.8) (3878.1) Kernel (Bandwidth ) (3839.0) (4335.5) (3962.5) Local Linear (Bandwidth 0.04) (3345.0) (3749.6) (3524.6) Local Linear (Bandwidth 0.01) (3379.1) (4105.4) (3498.1) Caliper (0.01) (5758.3) (8138.6) (5365.1) Caliper (0.001) (5981.8) (6132.1) (5057.3) Caliper (0.0001) (5766.4) (6749.7) (5375.1) Stratification 20 Bins Logit (1879.6) (1884.1) (1825.0) Stratification 20 Bins BRQ (2192.5) (2312.4) (1833.8) Stratification 10 Bins BRQ (2364.5) (2833.4) (2412.1) Stratification 20 Bins BRQ No Zeros (1116.5) (2100.5) (2306.5) Stratification 20 Bins Logit No Zeros (1218.8) (2352.0) (1164.9) Note: Standard deviation in parentheses (5190.2) (5190.2) (5190.2) (5190.2) (4204.6) (4339.8) (3636.9) (3895.5) (5098.1) (4838.2) (5403.5) (2961.3) (1997.1) (2844.3) (1549.5) (2430.7) (6444.2) (6488.0) (6502.3) (6564.0) (4378.1) (4671.2) (3857.7) (3883.0) (6492.5) (6621.5) (6546.1) (2382.5) (2032.3) (2137.9) (1468.4) (2052.7) (6005.6) (6005.6) (6005.6) (6005.6) (4588.3) (4720.5) (4038.7) (4356.2) (5879.6) (6788.6) (8353.7) (3140.3) (2063.6) (2757.8) (1917.5) (2219.0) (5719.1) (5740.2) (5785.3) (5769.2) (4405.8) (4572.7) (3459.7) (3464.6) (5844.4) (5791.0) (6073.7) (1879.6) (2633.4) (2626.1) (1878.5) (1218.8) (6199.8) (6199.8) (6199.8) (6199.8) (5955.1) (8027.5) (3891.8) (4100.5) (6428.5) (7189.0) (10196.) (2890.4) (1968.3) (2577.7) (1557.5) (2352.0) (5848.0) (5963.3) (5834.5) (5978.8) (4509.0) (4720.5) (3546.9) (3613.5) (5760.1) (6387.0) (5185.5) (1825.0) (1611.9) (2043.8) (1210.4) (1164.9) (6518.4) (6518.4) (6518.4) (6518.4) (5487.0) (5287.2) (3908.4) (3917.0) (6612.4) (8489.1) (14083.) (2961.3) (2174.3) (2954.8) (1281.4) (2430.7) (6322.6) (6309.0) (6476.5) (6564.0) (4940.4) (5198.4) (3790.4) (3883.0) (6499.1) (6218.1) (6395.2) (2382.5) (2099.7) (2954.8) (1437.1) (2052.7) (5847.6) (5847.6) (5847.6) (5847.6) (5164.6) (5337.7) (4223.0) (4247.1) (6022.4) (5162.5) (5506.3) (3140.3) (3444.3) (3624.5) (1578.6) (2219.0) 24

25 Table 5: Number of Individuals Assigned to a Bin by Parametric and Semiparametric Estimates Using PSID and Early Random Assignment Experimental Sample via Specification 2 Logit Bins -> BRQ Bins Total [0-0.05%) [ %) [ %) [ %) [ %) [ %) [ %) [ %) [ %) [ %) [ %) [ %) [ %) [ %) [ %) [ %) [ %) [ %) [ %) [ %) Total

26 Appendix Table 1: Summary Statistics Sample LaLonde Treated LaLonde Controls DW Treated DW Controls Early Assignment Treated Early Assignment Control Sample Size Age (6.686) (6.590) (7.155) (7.058) (6.251) (7.108) (11.045) (10.441) Years of Education ) (1.619) (2.011) (1.614) (1.643) (1.572) (2.871) (3.082) Hispanic Black Married Dropout Zero Earnings in Zero Earnings in Real Earnings in ( ) ( ) Real Earnings in 1975 ( ) ( ) Real Earnings in 1978 ( ) ( ) Note: Standard Deviation in Parentheses ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) CPS ( ) ( ) ( ) PSID ( ) ( ) ( ) 26

27 Appendix Table 2: Treatment Effect Estimates with Parametric Propensity Scores using Stratification Matching SPECIFICATION ONE SPECIFICATION TWO Experiment Impact 20 Bins Excluding (603.86) (788.77) (860.30) (1157.9) (1108.3) (1265.9) (556.89) (706.37) (794.66) (911.70) (1063.0) (1303.5) Bins no exclusion (596.06) (759.50) (816.72) (1094.7) (959.90) (1351.0) (544.38) (707.88) (776.22) (947.75) (1060.0) (1237.7) 20 Bins DW exclusion (582.53) (800.09) (800.63) (1060.4) (1032.5) (1125.5) (489.33) (688.45) (731.83) (920.17) (938.87) Note: Bootstrapped standard errors in parentheses Bootstrap replications. DW exclusion drops all individuals in the treatment group with estimated propensity scores above the maximum propensity score in the control group and drops all control individuals whose estimated propensity score is less than the minimum propensity score of the treatment group (1245.2) 27

28 Appendix Table 3: Evaluation Bias Estimates with Parametric Propensity Scores using Stratification Matching SPECIFICATION ONE SPECIFICATION TWO Experiment Impact 20 Bins (530.93) (696.50) (673.98) (779.96) (769.63) (911.30) (427.43) (594.36) (547.29) (760.47) (634.42) (1108.8) 10 Bins (539.09) (783.31) (675.83) (926.07) (849.69) (952.01) (437.63) (634.63) (566.49) (1056.9) (641.74) (876.66) Including Lowest Probability Bin 20 Bins (517.42) (695.99) (644.94) (770.67) (737.83) (947.28) (399.59) (600.09) (528.85) (805.11) (630.75) (1097.8) 10 Bins (491.63) (741.56) (643.09) (911.10) (762.06) (863.60) (398.61) (650.42) (511.59) (1057.2) (627.09) (934.91) Note: Bootstrapped standard errors in parentheses Bootstrap replications. 28

29 Table 4: Average Absolute Bias Error using Klein Spady (1993) Propensity Scores. Matching Algoritm SPECIFICATION ONE SPECIFICATION TWO Nearest Neighbor 1 W/O Common Support (6193.4) (5394.6) (5478.5) Nearest Neighbor W/O Common Support (3667.6) (4005.2) (3769.7) Nearest Neighbor 1 W Common Support (6193.4) (5394.6) (5478.5) Nearest Neighbor W. Common Support (3667.6) (4005.2) (3769.7) Kernel (Bandwidth ) (3757.7) (3821.2) (3784.4) Kernel (Bandwidth ) (3807.3) (3940.8) (4162.6) Local Linear (Bandwidth 0.04) (3361.2) (3443.1) (3484.8) Local Linear (Bandwidth 0.01) (3401.0) (3615.3) (3468.0) Caliper (0.01) (6261.6) (6748.7) (5212.1) Caliper (0.001) (6396.3) (6420.4) (5941.0) Caliper (0.0001) (7520.5) (6439.2) (5921.7) Stratification 20 Bins KS (2569.6) (2109.9) (2310.8) Stratification 20 Bins BRQ (2192.5) (2312.4) (1833.8) Stratification 10 Bins BRQ (2364.5) (2833.4) (2412.1) Stratification 20 Bins BRQ No Zeros (1116.5) (2100.5) (2306.5) Stratification 20 Bins KS No Zeros (1541.6) (1623.0) (1575.8) Note: Standard deviation in parentheses (5732.3) (4070.6) (5732.3) (4070.6) (4070.6) (4606.9) (4030.1) (4037.1) (5532.0) (5244.3) (6777.3) (2714.3) (1997.1) (2844.3) (1549.5) (2155.5) (5745.2) (4037.6) (5745.2) (4037.6) (4275.5) (4265.6) (3730.9) (3804.2) (5884.8) (5551.1) (5892.1) (2592.6) (2032.3) (2137.9) (1468.4) (1665.9) (4360.3) (5530.1) (4360.3) (5530.1) (5163.0) (4817.3) (3996.1) (4125.3) (5978.1) (5952.7) (7393.1) (5567.7) (2063.6) (2757.8) (1917.5) (4660.3) (8108.0) (6630.9) (8108.0) (3573.7) (3749.6) (3568.3) (3390.6) (3414.4) (6439.0) (6573.9) (7070.8) (2053.8) (2633.4) (2626.1) (1878.5) (1341.8) (8108.0) (6630.9) (6519.6) (4489.5) (4240.0) (4579.4) (4164.5) (6567.4) (8119.1) (7070.8) (12640.) (2728.4) (1968.3) (2577.7) (1557.5) (2136.3) (5848.0) (5963.3) (5562.5) (3550.9) (3773.9) (3787.2) (3385.1) (3298.2) (5726.5) (6630.4) (5913.0) (2244.7) (1611.9) (2043.8) (1210.4) (1543.9) (5405.9) (4313.1) (5852.4) (4581.9) (4899.4) (4599.3) (4278.7) (4215.7) (5674.4) (6698.5) (8341.5) (2811.4) (2174.3) (2954.8) (1281.4) (1685.3) (5585.2) (4190.9) (5604.8) (4192.8) (4400.9) (4338.8) (3781.0) (3845.8) (5729.9) (5101.5) (5487.2) (3408.2) (2099.7) (2954.8) (1437.1) (3214.4) (6501.4) (4200.4) (6501.4) (4767.8) (4835.2) (4767.8) (4017.3) (4074.9) (6495.2) (7685.5) (11077.) (3674.8) (3444.3) (3624.5) (1578.6) (2420.8) 29

Selection on Observables: Propensity Score Matching.

Selection on Observables: Propensity Score Matching. Department of Economics and Management Irene Brunetti ireneb@ec.unipi.it 24/10/2017 I. Brunetti Labour Economics in an European Perspective 24/10/2017