WE consider the problem of using data from several

Size: px

Start display at page:

Download "WE consider the problem of using data from several"

Raymond Wilkinson
5 years ago
Views:

1 COMPARING TREATMENTS ACROSS LABOR MARKETS: AN ASSESSMENT OF NONEXPERIMENTAL MULTIPLE-TREATMENT STRATEGIES Carlos A. Flores and Oscar A. Mitnik* Abstract We study the effectiveness of nonexperimental strategies in adjusting for comparison group differences when using data from several programs, each implemented at a different location, to compare their effect if implemented at alternative locations. First, we adjust for individual characteristics differences simultaneously across all groups using unconfoundedness-based and conditional difference-in-difference methods for multiple treatments. Second, we adjust for differences in local economic conditions and stress their role after program participation. Our results show that it is critical to have sufficient overlap across locations in both dimensions and illustrate the difficulty of adjusting for local economic conditions that differ greatly across locations. I. Introduction WE consider the problem of using data from several programs, each implemented at a different location, to compare what their effect would be if they were implemented at a specific location. One example would be a local government considering the implementation of one of several possible job training programs. This is a problem many policymakers are likely to face given the heterogeneity of job training programs worldwide and the scope that many local governments have in selecting their programs. In this case, the policymaker may want to perform an experiment in her locality and randomly assign individuals to all the possible programs to assess the effectiveness of the programs and choose the best one for her locality. This would require implementing all the candidate programs in her locality. Thus, the most likely scenario is that the policymaker would have to rely on nonexperimental methods and use data on implementations of the programs at various locations and times. In this paper, we assess the likely effectiveness of nonexperimental strategies for multiple treatments to inform a policymaker about the effects of programs implemented in different locations if they were implemented at her location. Received for publication September 21, Revision accepted for publication June 20, * Flores: California Polytechnic State University at San Luis Obispo; Mitnik: Federal Deposit Insurance Corporation and IZA. Detailed comments from two anonymous referees and an editor greatly improved the paper and are gratefully acknowledged. We are also grateful for comments by Chris Bollinger, Matias Cattaneo, Alfonso Flores- Lagunes, Laura Giuliano, Guido Imbens, Kyoo il Kim, Carlos Lamarche, Phil Robins, Jeff Smith, and seminar participants at the 2009 AEA meetings, 2009 SOLE meetings, Second Joint IZA/IFAU Conference on Labor Market Policy Evaluation, 2009 Latin American Econometric Society Meeting, Third Meeting of the Impact Evaluation Network, Second UCLA Economics Alumni Conference, Abt Associates, Federal Deposit Insurance Corporation, Institute for Defense Analyses, Interamerican Development Bank, Queen Mary University of London, Tulane University, U.S. Census Bureau, University of Freiburg, University of Kentucky, University of Miami, University of Michigan, and VU University Amsterdam. Bryan Mueller and Yongmin Zang provided excellent research assistance. This is a heavily revised version of IZA discussion paper 4451, Evaluating Nonexperimental Estimators for Multiple Treatments: Evidence from Experimental Data. The usual disclaimer applies. A supplemental appendix is available online at journals.org/doi/suppl/ /rest_a_ This paper is closely related to the literature analyzing whether observational comparison groups from a different location from the treated observations can be used to estimate average treatment effects by employing nonexperimental identification strategies (Friedlander & Robins, 1995; Heckman, Ichimura, & Todd, 1997; Heckman et al., 1998; Michalopoulos, Bloom, & Hill, 2004). Our work is also closely related to Hotz, Imbens, and Mortimer (2005), who analyze the problem of predicting the average effect of a new training program using data on previous program implementations at different locations. Our paper differs from the rest of the literature, which focuses on pairwise comparisons between groups in different locations. Instead, we focus on comparing all groups simultaneously, which requires the use of nonexperimental methods for multiple treatments. This is an important distinction in practice. For instance, suppose that the policymaker above was given the average effects of each of k > 2 programs estimated from the binary-treatment versions of the nonexperimental strategies we study in this paper. If so, each estimate would give the average effect for a particular subpopulation: individuals in the common support region of the corresponding pairwise comparison (Heckman et al., 1997; Dehejia & Wahba, 1999, 2002). Thus, the policymaker would be unable to compare the average effects across the different programs simultaneously because the effects are likely to apply to different subpopulations. We employ data from the National Evaluation of Welfareto-Work Strategies (NEWWS) study, which combines highquality survey and administrative data. The NEWWS evaluated a set of social experiments conducted in the United States in the 1990s in which individuals in several locations were randomly assigned to a control group or to different training programs. These data allow us to restrict the analysis to a relatively well-defined target population: welfare recipients in the different locations. This is consistent with our example above, where we would expect the policymaker to be interested in comparing programs that applied to relatively similar target populations. Even in cases like this, differences across locations in the average outcomes of individuals who participated in the site-specific programs can arise because of (at least) three important factors (Hotz et al., 2005). First, the effectiveness of the programs can differ (this is the part that the policymaker in the example above would like to know). Second, the distribution of individual characteristics across sites may be different (for example, individuals may be more educated, on average, in some locations). Third, the labor market conditions may differ across locations (for example, the unemployment rate may be higher at some sites). In this paper, we focus on the control groups in the different locations and use the insight that since the control individuals were excluded from all training services, the control groups The Review of Economics and Statistics, December 2013, 95(5): by the President and Fellows of Harvard College and the Massachusetts Institute of Technology

2 1692 THE REVIEW OF ECONOMICS AND STATISTICS should be comparable after adjusting for differences in individual characteristics and labor market conditions. If by using nonexperimental strategies to adjust for those differences we are able to simultaneously equalize average outcomes among the control groups in the different locations, we would expect the average effects resulting from the implementation of those strategies on the treated observations to reflect actual program heterogeneity across the sites (Heckman et al., 1997; Heckman et al., 1998; Hotz et al., 2005). 1 The first source of heterogeneity across sites that we consider is the distribution of individual characteristics. We consider unconfoundedness-based and conditional differencein-difference strategies for multiple treatments to adjust simultaneously across all sites for this source of heterogeneity. This is related to the literature on program evaluation using nonexperimental methods, most of which focuses on the binary treatment case (see Heckman, Lalonde, & Smith, 1999; Imbens & Wooldridge, 2009). More recently, however, there has been a growing interest on methods to evaluate multiple treatments (Lechner, 2001, 2002a, 2002b; Frölich, 2004; Plesca & Smith, 2007) and treatments that are multivalued or continuous (Imbens, 2000; Imai & van Dyk, 2004; Hirano & Imbens, 2004; Abadie, 2005; Cattaneo, 2010); we employ some of these methods in this paper. In addition, we study the use of the generalized propensity score (GPS), defined as the probability of receiving a particular treatment conditional on covariates, in assessing the comparability of individuals among the different comparison groups, and we propose a strategy to determine the common support region that is less stringent than those previously used in the multiple treatment literature (Lechner, 2002a, 2002b; Frölich, Heshmati, & Lechner, 2004). We also consider several GPS specifications and GPS-based estimators, and we analyze the performance of the nonexperimental strategies considered when controlling for alternative sets of covariates. The second source of heterogeneity across the comparison groups that we consider is local economic conditions (LEC). The literature on program evaluation methods stresses the importance of comparing treatment and nonexperimental comparison groups from the same local labor market (Friedlander & Robins, 1995; Heckman et al., 1997, 1998). In our setting, the analyst likely has information about specific programs only when they were implemented in different local labor markets. Thus, we consider the challenging problem of explicitly adjusting for differences in LEC across our comparison groups. The literature on how to adjust for these differences is scarce compared to that on how to adjust for individual characteristics, as we discuss later. We stress the importance of adjusting for differences in LEC after program 1 Another potential source of heterogeneity across control groups is substitution by their members into alternative training programs (Heckman & Smith, 2000; Heckman et al., 2000; Greenberg & Robins, 2011). Our analysis leads us to believe that it is unlikely that this potential source of heterogeneity drives our results, as we discuss later. participation, as they can be a significant source of heterogeneity in average outcomes across sites. We also highlight the importance of analyzing the overlap in LEC across locations, as well as the advantages of separating such analysis from that of the overlap in individual characteristics. Our results suggest that the strategies studied are valuable econometric tools when evaluating the likely effectiveness of programs implemented at different locations if they were implemented at another location. The keys are adjusting for a rich set of individual characteristics (including labor market histories) and having sufficient overlap across locations in both the individual and local labor market characteristics. More generally, our results also illustrate that in the multiple treatment setting, there may be treatment groups for which it is extremely difficult to find valid comparison groups and that the strategies we consider perform poorly in such cases. Importantly, the overlap analyses (both in individual characteristics and LEC) are critical for identifying those noncomparable groups. In addition, our results show that when the differences in postrandomization LEC are large across sites, the methods we consider to adjust for them substantially reduce this source of heterogeneity, although they are not able to eliminate it. These results highlight the importance of adjusting for posttreatment LEC. The paper is organized as follows. Section II presents our general setup and the econometric methods we employ. Section III describes the data. Section IV presents the results, section V provides our conclusions. An online appendix provides further details on the econometric methods and data. It also presents supplemental estimation results, including those from several robustness checks briefly mentioned throughout the paper and those complementing the analysis in section IV. II. A. General Framework Methodology We use data from a series of randomized experiments conducted in several locations. In each location, individuals were randomly assigned either to participate in a site-specific training program or to a control group that was excluded from all training services. We study whether the nonexperimental strategies considered in this paper can simultaneously equalize average outcomes for control individuals across all the sites by adjusting for differences in the distribution of individual characteristics and local economic conditions. Each unit i in our sample, i = 1, 2,..., N, comes from one of k possible sites. Let D i {1, 2,..., k} denote the location of individual i. For our purposes, it is convenient to write the potential outcomes of interest as Y i (0, d), where d denotes the site and 0 emphasizes the fact that we focus exclusively on the control groups. Thus, Y i (0, d) is the outcome (for example, employment status) that individual i would obtain if that person were a control in site d. The data we observe for each unit are (Y i, D i, X i ), where X i is a set of pretreatment covariates and Y i = Y(0, D i ).

3 COMPARING TREATMENTS ACROSS LABOR MARKETS 1693 Let S be a subset of the support of X, X. Our parameters of interest in this paper are β d β d (S) = E[Y(0, d) X S], for d = 1, 2,..., k. (1) Region S can include all or part of the support of X and is determined by the overlap region, which we discuss below. To simplify the notation, we omit the conditioning on X S unless necessary for clarity. The object in equation (1) gives the average potential outcome under the control treatment at location d for someone with X S randomly selected from any of the k sites. This is different from E[Y(0, d) D = d, X S], which gives the average control outcome at site d for someone with X S randomly selected from site d. While this last quantity is identified from the data because of random treatment assignment within site d, β d is not. Rather than focusing on equation (1), we could have focused on E[Y(0, d) D = f, X S f ] for d = 1,..., k, which gives the average outcome for the individuals in a particular site f {1,..., k} within the overlap region S f for site f.by focusing on the average over all k sites in equation (1), we avoid having our results depend on the selection of site f as a reference point. However, the estimation problem becomes more challenging because we need to find comparable individuals in each of the sites for every individual in the k sites (as opposed to finding comparable individuals in each site for only those in site f ). We assess the performance of the strategies presented below in several ways. First, given estimates β 1,..., β k of the corresponding parameters in equation (1), we test the joint hypothesis, β 1 = β 2 =...= β k. (2) One drawback of this approach is that we could fail to reject the null hypothesis in equation (2) only because the variance of the estimators is high rather than because all the estimated β s are sufficiently close to each other. Hence, we focus our analysis mostly on an overall measure of distance among the estimated means. Letting μ = k 1 k d=1 β d, we define the root mean square distance as rmsd = 1 1 k ( β d μ) μ k 2, (3) d=1 where we divide by μ to facilitate its interpretation and the comparison across outcomes. Due to pure sample variation, we would never expect the rmsd to equal 0, even if D i were randomly assigned. In section IV, we present benchmark values of the rmsd as a reference point for reasonable values of the rmsd. These benchmarks give the value of the rmsd that would be achieved by an experiment in a setting where β 1 = β 2 =...= β k holds. 2 2 In addition to the rmsd in equation (3), we considered two other measures: the mean absolute distance and the maximum pairwise distance among all the estimated means. All three measures give similar results. In the following two sections, we discuss the strategies we employ to adjust for differences across the sites in the distribution of individual characteristics and local economic conditions. B. Adjusting for Individual Characteristics Nonexperimental strategies. We consider two identification strategies to adjust for differences in individual characteristics across sites. Let 1(A) be the indicator function, which equals 1 if event A is true and 0 if otherwise. The first strategy is based on the following assumption: Assumption 1 (Unconfounded site): 1(D i = d) Y i (0, d) X i, for all d {1, 2,..., k}. This assumption is an extension to the multiple-site setting of the one in Hotz, Imbens, and Mortimer (2005), and in the program evaluation context, it is called weak unconfoundedness (Imbens, 2000). The problem when estimating β d is that we observe Y i (0, d) only for individuals with 1(D i = d) = 1, while it is missing for those with 1(D i = d) = 0. Assumption 1 implies that conditional on X i, the two groups are comparable, which means that we can use the individuals with 1(D i = d) = 1 to learn about β d. A key feature of assumption 1 is that all that matters is whether unit i is in site d; hence, the actual site of an individual not in d is not relevant for learning about β d. In addition to assumption 1, we impose an overlap assumption that guarantees that in infinite samples we are able to find comparable individuals in terms of the covariates for all units with D i = d in the subsample with D i = d, for every site d. Assumption 2 (Simultaneous strict overlap): 0 < ξ < Pr(D = d X = x), for all d and x X. Assumption 2 is stronger than that of the binary treatment case, as it requires that for every individual in the k sites, we find comparable individuals in terms of covariates in each of the k sites. Intuitively, while we want to learn about the potential outcomes in all sites for every individual, we observe only one of those k potential outcomes, so the fundamental problem of causal inference (Holland, 1986) is worsened. When the strict overlap assumption is violated, semiparametric estimators of β d can fail to be n-consistent, which can lead to large standard errors and numerical instability (Busso, DiNardo, & McCrary, 2009a, 2009b; Khan & Tamer, 2010). In practice, assumption 2 may hold only for some of the sites or regions in the support of X, in which case one may have to restrict the analysis to those sites or regions with common support. The second identification strategy we consider to adjust for differences in individual characteristics across locations is a multiple-treatment version of the conditional differencein-differences (DID) strategy (Heckman et al., 1997, 1998; Smith & Todd, 2005). In our setting, the main identification

4 1694 THE REVIEW OF ECONOMICS AND STATISTICS assumption of this strategy is 1(D i = d) {Y i,t (0, d) Y i,t (0, d)} X i, for all d {1, 2,..., k}, (4) where t and t are time periods after and before random assignment, respectively. This assumption is similar to assumption 1 but employs the difference of the potential outcomes over time. This identification strategy allows for systematic differences in time-invariant factors between the individuals in site d and those not in site d for every site d. Thus, DID methods can also help to adjust for time-invariant heterogeneity in local economic conditions across sites (Heckman et al., 1997, 1998; Smith & Todd, 2005). Estimators. We consider several estimators to implement the strategies discussed above. For simplicity, we present the estimators in the context of the unconfoundedness strategy. The DID estimators simply replace the outcome in levels with the outcome in differences. Under assumption 1 we can identify β d (omitting for brevity the conditioning on X S)as β d = E X [E[Y(0, d) X = x]] = E X [E[Y(0, d) 1(D = d) = 1, X = x]] = E X [E[Y D = d, X = x]], (5) where, following similar steps to those used in a binary treatment setting (Imbens, 2004), the first equality uses iterated expectations and the second uses assumption 1. This result suggests estimating β d using a partial mean, which is an average of a regression function over some of its regressors while holding others fixed (Newey, 1994). To estimate β d, we first estimate the expectation of Y conditional on D and X, and then we average this regression function over the covariates holding D fixed at d. The first estimator we consider uses a linear regression of Y on the site dummies and X to model the conditional expectation E[Y i D i = d, X i = x]. We refer to it as the partial mean linear X estimator. We also consider a more flexible model of that expectation by including polynomials of the continuous covariates and various interactions; we refer to the corresponding estimator as the partial mean flexible X estimator. The rest of the estimators we consider are based on the generalized propensity score (GPS). Its binary treatment version, the propensity score, is widely used in the program evaluation literature (Heckman et al., 1999; Imbens & Wooldridge, 2009). In our setting, the GPS is defined as the probability of belonging to a particular site conditional on the covariates (Imbens, 2000): r(d, x) = Pr(D = d X = x). (6) We let R i = r i (D i, X i ) be the probability that individual i belongs to the location she actually belongs to, and let Ri d = r i (d, X i ) be the probability that i belongs to a particular location d conditional on her covariates. Clearly, Ri d = R i for those units with D i = d. We consider GPS-based estimators of β d based on partial means and inverse probability weighting. Imbens (2000) shows that the GPS can be used to estimate β d within a partial mean framework by first estimating the conditional expectation of Y as a function of D and R = r(d, X), and then by averaging this conditional expectation over R d = r(d, X) while holding D fixed at d. We consider two estimators based on this approach. The first follows Hirano and Imbens (2004) and estimates E[Y i D i = d, R i = r] by employing a linear regression of the outcome on the site dummies and the interactions of the site dummies with R i and Ri 2. The second estimator employs a local polynomial regression of order 1 to estimate that regression function. We refer to these two estimators of β d as the GPS-based parametric and nonparametric partial mean estimators, respectively. Finally, we also employ the GPS in a weighting scheme. Intuitively, reweighting observations by the inverse of R i creates a sample in which the covariates are balanced across all locations. The first inverse probability weighting (IPW) estimator of β d that we consider equals the coefficient for 1(D i = d) from a weighted linear regression of Y i on all the site indicator variables, with weights equal to w i = 1/R i. We also consider an estimator that combines IPW with linear regression by adding covariates to this regression. We refer to it as the IPW with covariates. Determination of the overlap region. A key ingredient in the implementation and performance of the estimators previously discussed is the identification of regions of the data in which there is overlap in the covariate distributions across the different comparison groups. The usual approach in the binary treatment literature is to focus on the overlap or common support region by dropping those individuals whose propensity score does not overlap with the propensity score of those in the other treatment arm (Heckman et al., 1997, 1998; Dehejia & Wahba, 1999, 2002). We propose a GPSbased trimming procedure for determining the overlap region in the multiple treatment setting that is motivated by a procedure commonly used when the treatment is binary (Dehejia & Wahba, 1999, 2002). We determine the overlap region in two steps. First, we define the overlap region with respect to a particular site d, Overlap d, as the subsample of individuals whose conditional probability of being in site d (Ri d ) is greater than a cutoff value, which is given by the highest q-quantile of the Ri d distribution for two groups: those individuals who are in site d and those who are not. This rule ensures that within the subsample Overlap d, for every individual with D i = d, we can find comparable individuals with D i = d, which implies that we can estimate the mean of Y(0, d) for the individuals in Overlap d under assumption 1. Second, we define the overlap region as

5 COMPARING TREATMENTS ACROSS LABOR MARKETS 1695 the subsample given by individuals who are simultaneously in the overlap regions for all the different sites. This step guarantees that we estimate the mean of Y(0, d) for d = 1, 2,..., k for the set of individuals who are simultaneously comparable in all the sites. More formally, let Overlap d ={i : Ri d max{rq,{j:d d j =d}, rd q,{j:d j =d} }}, (7) where rq,{j A} d denotes the qth quantile of the distribution of R d over individuals in subsample A. We define the overlap region as 3 Overlap = k Overlap d. (8) d=1 Lechner (2002a, 2002b) and Frölich et al. (2004) use a rule akin to equation (8) but define Overlap d ={i : Ri d [max c=1,...,k {min {j:dj =c} Rj d}, min c=1,...,k{max {j:dj =c} Rj d }]}. Our rule is less stringent in two important ways. First, as implied by assumption 1, it does not require the comparison of Ri d among all k locations, only between the groups of individuals with D i = d and D i = d. Second, it identifies individuals outside the overlap region based only on the lower tail of the distributions of Ri d. This is because we impose the overlap condition Overlap d for all sites, so it is not necessary to look at the upper tail of the GPS distributions. 4 C. Adjusting for Local Economic Conditions Even after adjusting for differences in individual characteristics, there could still be differences in the average outcomes of the comparison groups across sites because of differences in LEC. The main difficulty in adjusting for differences in LEC is that these variables take on the same value within each location or within each cohort in a location. When each comparison group belongs to a different location, as in our setting, the variation of the LEC variables can be limited, and they are likely to fail the overlap condition. In such cases, the literature provides little guidance on how to adjust for differences in LEC across comparison groups. In the context of evaluating training programs, a strategy followed in practice to control for differences in LEC across comparison groups is to include LEC measures or geographic indicators in the propensity score (Roselius, 1996; Heckman et al., 1998; Dyke et al., 2006; Mueser, Troske, & Gorislavsky, 2007; Dolton & Smith, 2011). Using regression-adjustment 3 Alternatively, the overlap region could be defined as S = k {x X : R d (x) max{r d q,{ j:d, j=d} rd q,{ j:d }}. j =d} d=1 4 For example, when estimating the average treatment effect in the binary treatment setting both tails of the distribution of the propensity score are considered. This is equivalent to considering the two lower tails of the distributions of Pr(D = 1 X) and Pr(D = 0 X) because Pr(D = 1 X) + Pr(D = 0 X) = 1. methods, Hotz, Imbens, and Klerman (2006) and Flores- Lagunes, Gonzalez, and Neumann (2010) include as regressors measures of posttreatment LEC. Most of these papers differ from ours because we do not have a large time series or cross-sectional variation in LEC. In a setting more similar to ours, Hotz et al. (2005) include two measures of prerandomization LEC in the regression step of the bias-corrected matching estimator in Abadie and Imbens (2011). 5 We pay particular attention to adjusting for differences across sites in LEC after randomization, which can be a significant source of heterogeneity in average outcomes. For example, Hotz et al. (2006) and Flores-Lagunes et al. (2010) find that it is very important to adjust for posttreatment LEC when evaluating training programs. The approach we use to adjust for postrandomization LEC consists of modeling the effect of the prerandomization LEC on prerandomization values of the outcomes, and then using the estimated parameters to adjust the outcomes of interest for the effect of postrandomization LEC. We implement our approach in two ways depending on whether we use the outcome in levels or in differences. Assume that the potential outcomes Y i (0, d) are separable functions of the individual characteristics and the LEC, and for the sake of presentation, ignore the presence of the covariates X i. For the outcome in levels, we use the prerandomization outcome and LEC to run the regression Y i,t = k [1(D i = d) λ d,t ]+L i,t δ + u i,t, (9) d=1 where t denotes a time period before random assignment, L i,t is a vector of LEC at time t, and u i,t is an error term. The availability of different cohorts within each site in our data provides us with sufficient variation in the LEC to identify the parameter vector δ. We use the estimated coefficient δ to adjust the outcome of interest (Y i,t )asỹ i,t = Y i,t L i,t δ, and apply the estimators discussed in section IIB on the LECadjusted outcome Ỹ i,t. For the outcome in differences, we follow a similar approach but employ in the regression the differences of the outcomes and LEC in two periods prior to randomization. In both cases (for the outcome in levels and in differences), we normalize the adjustments to have mean zero in order to allow for easier comparisons between the unadjusted and adjusted outcomes. Further details are provided in the online appendix. Clearly, the flexibility of the specific model employed to adjust for differences across locations in LEC depends heavily on the available data and the amount of variation in the LEC. For instance, one may be able to estimate more flexible 5 Dehejia (2003) uses an approach based on hierarchical models to deal with site effects when evaluating multisite training programs, although he does not directly address LEC differences. Galdo (2008) considers a setting where there is lack of overlap due to site and time effects and proposes a procedure that removes those two fixed effects before implementing a matching estimator. As Dehejia (2003) pointed out, in a framework that aims at predicting the effects of programs if implemented in different sites, it is problematic to model site effects as fixed effects.

6 1696 THE REVIEW OF ECONOMICS AND STATISTICS models if there is a large number of locations or cohorts in each location. In this paper, we are faced with the challenging (but not uncommon) problem that there is limited variation in the LEC measures (we have thirteen cohorts total), so it is difficult to use a more flexible model than the one we use. An attractive feature of our approach is that it is not estimator specific, meaning that once we adjust the outcomes for differences in postrandomization LEC, we can employ any of the estimators from section IIB to adjust for differences in individual characteristics. Another potentially attractive feature is that the effects of the LEC on the outcome variables are estimated using the outcomes prior to random assignment, so part of the analysis (for example, modeling the effect of the LEC on outcomes) can be performed without resorting to the outcomes after randomization. Finally, note that while we do not explicitly adjust for prerandomization LEC in our preferred specification, we include a rich set of welfare and labor market histories as covariates, which can proxy for those variables (Hotz et al., 2005). In section IV, we also present results from models that include measures of both pre- and postrandomization LEC in the GPS estimation. III. Data The data we use come from the National Evaluation of Welfare-to-Work Strategies (NEWWS), a multiyear study conducted in seven U.S. cities to compare the effects of different approaches for helping welfare recipients (mostly single mothers) improve their labor market outcomes and leave public assistance. The individuals in each location were randomly assigned to different training programs or to a control group that was denied access to the training services offered by the program for a preset embargo period. The year of randomization differed across sites, with the earliest randomization taking place in 1991 and the latest in We rely on the public use version of the NEWWS data, which is a combination of high-quality survey and administrative data. (For further details on the NEWWS study, see Hamilton et al., 2001.) We restrict the analysis to female control individuals in the five sites for which two years of prerandomization individual labor market histories are available and in which individuals were randomized after welfare eligibility had been determined. The five sites we use are Atlanta, Georgia; Detroit, Michigan; Grand Rapids, Michigan; Portland, Oregon; and Riverside, California (we exclude Columbus, Ohio, and Oklahoma City, Oklahoma). The final sample size in our analysis is 9,351 women. 6 The outcome we use is an indicator variable equal to 1 if the individual was ever employed during the two years (quarters 2 to 9) following randomization and 0 otherwise. 7 We focus on an outcome that is measured only up to two years after randomization because, in most sites, we cannot identify which control individuals were embargoed from receiving program services after year 2. We define the outcome in differences as the original outcome minus an indicator equal to 1 if the individual was ever employed during the two years prior to randomization. Our data contain information on demographic and family characteristics, education, housing type and stability, prerandomization welfare and food stamps use (seven quarters), and prerandomization earnings and employment histories (eight quarters). The first five columns of panel A in table 1 show the averages of the outcomes and selected covariates for each site. As expected, there are important and usually large differences in the means of all the variables across sites (for example, percentage of blacks). We employ three measures of LEC for the different cohorts within each site at the metropolitan statistical area (MSA) level: employment-to-population ratio, average real earnings, and unemployment rate. The bottom of panel A presents the average of these variables during the calendar year of random assignment, as well as their two-year growth rates in the two years before and after random assignment. The differences in LEC across all sites clearly suggest that we are working with five distinct local labor markets, especially with respect to Riverside. IV. Results A. GPS Estimation and Covariate Balancing We estimate the GPS using a multinomial logit model (MNL) that includes 52 individual-level covariates (estimation results are provided in the online appendix). In section IVD, we consider alternative estimators of the GPS. An important property of the GPS is that it balances the covariates between the individuals in a particular site and those not in that site (Imbens, 1999). In the binary treatment setting, this property is commonly used to gauge the adequacy of the propensity score specification (Dehejia & Wahba, 1999; Smith & Todd, 2005; Lee, 2013). We follow two strategies to examine the balance of each covariate across the different sites after adjusting for the GPS. The first strategy tests whether there is joint equality of means across all five sites after each observation is weighted by 1/ R i. The second strategy consists of a series of pairwise comparisons of the mean in each site versus the mean in the (pooled) remaining sites, adjusting for the GPS within a blocking or stratification framework in the spirit of Dehejia and Wahba (1999, 2002) and Hirano and Imbens (2004). Both strategies are implemented after imposing the overlap rule, equation (8). 6 Michalopoulos et al. (2004) use the NEWWS data to perform binary comparisons between control groups in different locations. See the online appendix for a discussion of the differences between their sample and ours. 7 Using one-year employment indicators instead does not change our conclusions substantively. We focus on an employment indicator because the public use data available to us contain only the year of random assignment and nominal earnings measures. Thus, we could create only rough measures of real earnings. By focusing on employment, we also avoid having to deal with issues related to cost-of-living differences across sites.

7 COMPARING TREATMENTS ACROSS LABOR MARKETS 1697 Table 1. Descriptive Statistics, Covariates Balancing, and Overlap NEWWS Data P-Value Joint Equality Root Mean Raw Means of Means Wald Test Square Distance Variables ATL DET GRP POR RIV Raw IPW b Raw IPW b A. Descriptive Statistics (Selected Variables) Outcomes Ever employed in 2 years after RA Ever employed in 2 years after RA (in diff) a Individual-Level Covariates Demographic and family characteristics Black Age years old Age 40+ years old Became mother as a teenager Never married Any child 0 5 years old Any child 6 12 years old Education characteristics 10th grade th grade Grade 12 or higher Highest degree = high school or GED Welfare use history On welfare for less than 2 years On welfare for 2 5 years On welfare 5 10 years Received welfare in Q1 before RA Received welfare in Q7 before RA Food stamps use history Received food stamps in Q1 before RA Received food stamps in Q6 before RA Received food stamps in Q7 before RA Employment history Employed in Q1 before RA Employed in Q8 before RA Employed at RA (self reported) Ever worked full time 6 or more months at same job Earnings history (real $ /1,000) Earnings Q1 before RA Earnings Q8 before RA Any earnings year before RA (self-reported) Local Economic Conditions Employment/population year of RA Average total earnings year of RA ($1,000) Unemployment rate year of RA Employment/population growth rate 2 years before RA (Δ logs) Average earnings growth rate 2 years before RA (Δ logs) Unemployment growth rate 2 years before RA (Δ logs) Employment/population growth rate 2 years after RA (Δ logs) Average earnings growth rate 2 years after RA (Δ logs) Unemployment growth rate 2 years after RA (Δ logs) B. Covariates Balancing (number of individual covariates for which p-value 0.05; total number of covariates = 52) Tests ATL DET GRP POR RIV Joint equality of means across all sites Raw means before overlap 52 GPS-based inverse probability weighting c 6 Difference of means each site versus other sites pooled Raw means before overlap Blocking on GPS c C. Overlap ATL DET GRP POR RIV Total Number of observations 1,372 2,037 1,374 1,740 2,828 9,351 Number of observations after imposing overlap 1,299 1, ,263 1,437 6,901 Percentage observations dropped due to overlap 5.3% 5.1% 29.5% 27.4% 49.2% 26.2% RA = random assignment. a The lagged outcome was standardized to be mean zero, so the grand mean is the same for both the levels and in differences outcomes. b The IPW joint equality of means and IPW RMSD are calculated from GPS-based IPW means (after imposing overlap). c Tests calculated after imposing overlap.

8 1698 THE REVIEW OF ECONOMICS AND STATISTICS An important issue when implementing this rule is selecting the quantile q that determines the amount of trimming. Even in the binary treatment literature, there is no consensus about how to select the amount of trimming (Imbens & Wooldridge, 2009). Based on attaining good balancing under both strategies, we set q = For brevity, we focus on the first strategy. The last four columns of panel A in table 1 show the p-value from a joint equality test and the root mean square distance (rmsd) as given in equation (3) for the means of the variables in each site before and after weighting by the GPS. The complete variable-by-variable results are presented in the online appendix. Weighting by the GPS brings the means of all of the covariates much closer together (except for the percentage of individuals with a high school diploma or GED, which becomes slightly less balanced). For example, the rmsd for the percentage of blacks is reduced from 0.64 to 0.06 after weighting by the GPS. Despite the significant improvement in balancing, the joint equality of means test is still rejected for some of the covariates. Panel B in table 1 shows summary results for the covariate balancing analysis. Weighting by the GPS greatly improves the balancing among the five sites, as the number of unbalanced covariates at the 5% significance level drops from 52 to 6. The second strategy used to analyze covariate balancing, blocking on the GPS, yields similar results. Overall, we conclude that the covariates in the raw data are highly unbalanced across sites and that the estimated GPS does a reasonably good job in improving their balance across all five sites. B. Overlap Analysis We first analyze the overlap in individual characteristics across sites and then analyze the overlap in LEC. Panel C in table 1 presents the percentage of individuals dropped from each site after imposing overlap and the overall percentage of observations dropped (26 percent). In Riverside, almost half of the observations are dropped, implying that many of the individuals in Riverside are not comparable to individuals in at least one of the other sites. The sites with the fewest number of observations dropped are Atlanta and Detroit (about 5% in each). In figure 1, we examine more closely the overlap quality in individual characteristics. Each panel in the figure shows, for a given site d, the number of observations with D i = d (left bars) and D i = d (right bars) whose GPS R d i falls within given percentile intervals of the support of R d. The percentile intervals, for each site d, are constructed using the distribution of R d only for individuals in site d before imposing overlap. The outside wide (inside thin) bars display the number of individuals before (after) imposing overlap. As discussed in section IIB, we are interested only in the lower tail of these distributions because, for estimation of β d, we need to find 8 We also examined covariate balancing properties for values of q equal to 0, 0.001, 0.002, 0.003, 0.004, and Our main results presented in section IVC are in general robust to the choice of q. for every individual with D i = d comparable individuals in terms of R d in the D i = d group, and not vice versa. However, note that the intersection in the overlap rule in equation (8) results in individuals being dropped from all regions of the GPS distributions in figure 1. This illustrates a key difference with the binary treatment case, where usually individuals are dropped only from the tails. Figure 1 makes clear that the GPS distributions of the D i = d and D i = d groups differ greatly in some sites, which is consistent with the large differences in individual characteristics documented in table 1. Figure 1 also shows that imposing overlap significantly decreases the number of individuals at the lower tail of the distribution of R d for the D i = d group in most sites. The biggest changes, mostly arising from the large number of Riverside individuals dropped, are in Atlanta and Detroit. For example, the bottom 1% of the individuals in Atlanta (about fourteen observations) are comparable to 3,245 individuals (about 41% of those not in Atlanta) before imposing overlap and to 938 individuals after imposing overlap (about 17% of those not in Atlanta). Despite the improvement in overlap after imposing equation (8), the overlap quality remains relatively poor in some regions of the distributions, as many observations not in site d are comparable to few observations in site d in those regions. This poor overlap quality can lead to high variance of the GPS-based semiparametric estimators, just as in the binary treatment setting (Black & Smith, 2004; Busso et al., 2009b; Imbens & Wooldridge, 2009; Khan & Tamer, 2010). In general, the overlap quality in individual characteristics between units with D i = d and D i = d in each site is similar to that from previous studies in a binary treatment setting (for example, Dehejia & Wahba, 1999; Black & Smith, 2004; Smith & Todd, 2005). We now examine the overlap in pre- and postrandomization LEC across the different cohorts in our sites. Figure 2 presents plots of the LEC faced by each of the cohorts in our analysis before (top figures) and after (bottom figures) random assignment. It is clear from the figure that all the cohorts in Riverside experienced, especially after randomization, LEC extremely different from the other sites with respect to the employment-to-population ratio and the unemployment rate. Moreover, the comparison of the top and bottom panels shows that the behavior over time of the LEC in Riverside is also different, with earlier cohorts experiencing a deterioration of the LEC. As for the rest of the sites, although their LEC are not the same, they are roughly similar. Figure 2 highlights that it is almost impossible to find individuals in other sites who faced postrandomization LEC similar to those faced by the individuals in Riverside. Thus, the adjustment for LEC in Riverside heavily depends on how well the parametric model used to adjust for LEC is able to extrapolate to these regions of the data. C. Main Results Table 2 presents the results for the assessment measures of the different estimators used to implement the

9 COMPARING TREATMENTS ACROSS LABOR MARKETS 1699 Figure 1. Overlap Quality Based on GPS Distribution Outside wide bars and inside thin bars display the number of individuals whose GPS fall in a percentile interval, before and after imposing overlap, respectively. Percentile intervals in each site are constructed using the GPS distribution, before imposing overlap, for individuals in that site only. Right bars correspond to individuals in each particular site and left bars correspond to individuals not in that site. unconfoundedness (first three columns) and conditional DID (last three columns) strategies. Panel A shows the results when the outcome is not adjusted for differences in postrandomization LEC, and panel B shows the results when adjusting for such differences following the procedure discussed in section IIC. The first and second columns of the table show, respectively, the p-value from the joint equality test in equation (2) and the rmsd in equation (3) with its corresponding 95% confidence interval based on 1,000 bootstrap replications. To have a reference level for the reduction in the rmsd for the different estimators, the third column displays the rmsd relative to that from the raw mean estimator using the outcome in levels before imposing overlap and not adjusting for LEC. In addition, to have a benchmark against which to compare our results, the last row in each of the panels gives the value of the rmsd that would be achieved by an experiment in a setting where β 1 =... = β k holds. In particular, we perform a placebo experiment in which we assign placebo sites randomly to the individuals in the data set (keeping the number of observations in each site the same as in our sample) and let β d be the average outcome in each placebo site. These placebo values are also calculated using the outcome in differences and with and without adjusting for the LEC. We implement all the GPS-based estimators only within the overlap region and the non-gps-based estimators before and after imposing overlap. 9 The rmsd of the raw mean in levels (0.129) is about six times what one would expect to see in an experiment (0.021). Both strategies used to adjust for individual characteristics do very little in reducing the rmsd, regardless of the estimator used. This illustrates that in some instances, DID estimators are unable to adjust for differences across sites. In our case, there seem to be important differences in time-variant 9 For the partial mean linear X and IPW with covariates estimators, we include all 52 individual covariates used in the GPS estimation. For the partial mean flexible X estimator, we add 50 higher-order terms and interactions. We implement the GPS-based nonparametric partial mean estimator using an Epanechnikov kernel and selecting the bandwidth using the plugin procedure proposed by Fan and Gijbels (1996). For the GPS-based parametric partial mean estimator, we also explored using a cubic and a quartic polynomial in R i, which made the results very close to those from the nonparametric partial mean estimator. Finally, the results from the regressions in the first step of the procedure used to adjust for LEC are shown in the online appendix.

Comparing Treatments across Labor Markets: An Assessment of Nonexperimental Multiple-Treatment Strategies

Comparing Treatments across Labor Markets: An Assessment of Nonexperimental Multiple-Treatment Strategies Carlos A. Flores Oscar A. Mitnik May, 2012 Abstract We consider the problem of using data from