Econ 673: Microeconometrics Chapter 12: Estimating Treatment Effects. The Problem

Econ 673: Microeconometrics Chapter 12: Estimating Treatment Effects The Problem Analysts are frequently interested in measuring the impact of a treatment on individual behavior; e.g., the impact of job training programs on income 401(k) s on household savings teenage pregnancy on high school drop-out or college graduation rates environmental regulations on pollution levels Randomized experiments are typically not an option for cost and/or ethical reasons. Comparisons of treatment and nontreatment outcomes in a nonexperimental setting are contaminated by the treatment selection process. 1

The Problem (cont d) Lalonde (1986, AER) used data from an actual experiment (the National Supported Work Demonstration Experiment) to study the performance of non-experimental estimators simple regression adjustments difference-in-differences two step Heckman adjustment Found alternative estimators produced very different estimates Most deviated substantially from experimental benchmarks There has in recent years been a boom in the development of alternative non-experimental estimators Alternative Solutions Matching Instrumental Variables Control Functions 2

The Literature - Theory *Wooldridge, J. M, (2002), Econometric Analysis of Cross Section and Panel Data, Cambridge: The MIT Press, Ch. 18. Heckman, J., and Navarro-Lozano, S., (2004), Using Matching, Instrumental Variables, and Continuous Control Functions to Estimate Economics Choice Models, The Review of Economics and Statistics, 86(1): 30-57 Rosenbaum, P., and D. Rubin (1983), The Central Role of the Propensity Score in Observations Studies for Causal Effects, Biometrika 70(1): 41-55. Dehejia, R.H., and S. Wahba (2002), Propensity Score-Matching Methods for Nonexperimental Causal Studies, The Review of Economic Studies, 84(1): 151-161. Heckman, J., H. Ichimura, J. Smith, and P. Todd (1998), Characterizing Selection Bias Using Experimental Data, Econometrica 66(5): 1017-1098. Heckman, J., H. Ichimura, and P. Todd (1997), Matching as an Econometric Evaluation Estimator: Evidence from Evaluating a Job Training Programme, Review of Economic Studies 64: 605-654. Heckman, J., H. Ichimura, and P. Todd (1998), Matching as an Econometric Evaluation Estimator, Review of Economic Studies 65: 261-294. *Smith, J., and P. Todd (2005), Does Matching Overcome Lalonde s Critique of Nonexperimental Estimators? Journal of Econometrics, 125(1-2): 305-53. Abadie, A., and G. Imbens (2004), Large Sample Properties of Matching Estimators for Average Treatment Effects, working paper, January. The Literature - Applications Benjamin, D., (2003), Does 401(k) Eligability Increase Saving? Evidence from Propensity Score Subclassification, Journal of Public Economics 87: 1259-1290. Jalan, J., and M. Ravallion (2003), Does Piped Water Reduce Diarrhea for Children in Rural India? Journal of Econometrics 112: 153-173. Jalan, J., and M. Ravallion (2003), Estimating the Benefit Incidence of an Antipoverty Program by Propensity-Score Matching, Journal of Business and Economic Statistics 21(1):19-30. Levine, D., and G. Painter (2003), The Schooling Costs of Teenage Outof-Wedlock Childbearing: Analysis with a within-school Propensity- Score-Matching Estimator, The Review of Economics and Statistics 85(4): 884-900. *List, J., D. Millimet, P. Fredriksson, and W. McHone (2003), Effects of Environmental Regulations on Manufacturing Plant Births: Evidence from a Propensity Score Matching Estimator, The Review of Economics and Statistics 85(4): 944-52. Park, A., S. Wang, and G. Wu (2002), Regional Poverty Targeting in China, Journal of Public Economics, 86: 123-153. 3

Notation The choice of the treatment is assumed to be determined in the fashion of a standard RUM model, with where V (, ) 1( 0) V = μ Z U D= V > Z denotes factors observed by the analyst V U V denotes factors unobserved by the analyst, but known to the decision maker Potential Outcomes Let Y 1 and Y 0 denote the outcome with and without the treatment, where ( X U ) ( X U ) Y = μ, D = 1 Y 1 1 1 = μ, D = 0 0 0 0 The individual level treatment effect is given by Δ = Y Y 1 0 Additively separable specifications are often considered, with ( ) ( ) 0 ( ) ( ) ( ) ( ) V = μ Z + U E U = V V V Y X U E U 1 = μ1 + 1 1 = 0 Y X U E U 0 = μ0 + 0 0 = 0 4

Parameters of Interest Three different treatment effects are typically of interest 1. The average treatment effect ( ) ATE : E Y Y X 1 0 2. The treatment on the treated ( = ) TT : E Y Y X, D 1 1 0 3. The marginal treatment effect ( = ) MTE E Y Y X Z U u : 1 0,, V V The Selection Problem in a Regression Context The fundamental problem is that each individual is only observed in one state of the world; i.e., we only observe 1 ( 1 ) 0 [ μ1( ) 1] ( 1 )[ μ0( ) 0] ( X) D[ ( X) ( X) ] Y = DY + D Y = D X + U + D X + U = μ + μ μ + ε where ε DU + ( 1 ) 0 1 0 D U 1 0 Unfortunately, unless the treatment assignment is randomized, E ( ε X D) ATE ( X ), 0. 5

The Biases From the samples, we can compute (,, = 1 ) = (,, = 1) EY XZD EY XZD (,, = 0 ) = (,, = 0) EY XZD EY XZD Integrating out Z yields (, = 1 ) and (, = 0) EY XD EY XD 1 0 The resulting bias from comparing (D = 1) and (D = 0) means 1 0 [ ( 1 ) ( 0 )] (, 1) Bias TT= EY XD, = 1 EY XD, = 0 EY Y XD= 1 0 (, 1 ) (, 0) = EY XD= EY XD= 0 0 For ATE The Biases (cont d) [ ( 1 ) ( 0 )] EY ( 1 Y0 X) [ EY ( 1 XD, 1 ) EY ( 1 X) ] [ EY ( XD, 0 ) EY ( X) ] Bias ATE = E Y X, D = 1 E Y X, D = 0 = = = 0 0 [ ( 1 ) ( 0 )] EY ( 1 Y0 XZU,, V = uv ) [ EY ( 1 XZD,, 1 ) EY ( 1 XZU,, V uv )] [ EY ( XZD,, 0 ) EY ( XZU,, u) ] Bias MTE = E Y X, Z, D = 1 E Y X, Z, D = 0 = = = = = 0 0 V V 6

Ignorability of Treatment Matching methods are based on the ignorability of treatment assumption introduced by Rosenbaum and Rubin (1983) Assumption ATE.1: Conditional on W=(X,Z), D and (Y 0,Y 1 ) are independent. ( Y Y ), D W 0 1 A less restrictive version that sometimes suffices is Assumption ATE.1': (, ) = ( ) and (, ) = ( ) EY WD EY W EY WD EY W 0 0 1 1 selection on observables Ignorability of Treatment (cont d) The key to the benefit of ignorability is that it suggests that, even though (Y 0,Y 1 ) and D might be correlated, once we control for W they are uncorrelated (, = 0 ) = (, = 1 ) = ( ) EY WD EY WD EY W 1 1 1 (, = 1 ) = (, = 0 ) = ( ) E Y W D E Y W D E Y W 0 0 0 By conditioning on W, we can construct the missing counterfactuals. Note: If we are interested in TT, then we only need the weaker assumption that Y D W 0 7

Making Use of Ignorability There are several ways in which we can use the ignorability assumption. 1. Since we have a random sample on (Y,D,W), we can estimate (even nonparametrically): ( ) E( Y W D= ) r1 W, 1 ( ) E( Y W D= ) r0 W, 0 given consistent estimators of these functions, a consistent estimator of ATE is N 1 ATE = [ rˆ( W ) rˆ ( W )] N 1 i 0 i i= 1 Similarly Making Use of Ignorability (cont d) N 1 N i i 1 i 0 i i= 1 i= 1 TT = D D [ rˆ( W ) rˆ ( W )] 2. Alternatively, if W can take on a finite number of alternatives i.e., W { w1,, wm } Then we can compute τ jm = E Yj W = τ m, D= j N ATE = s [ ˆ τ ˆ τ ] i= 1 m 1m 0m a form of matching difficult if M is larger where s m denotes the population proportions of type m 8

Using the Propensity Score The ignorability assumption is less useful if W is of high dimensionality. Rosenbaum and Rubin (1983, Theorem 3) reduce the dimensionality problem using the propensity score: ( Y Y ) D W p( W), and 0 < < 1 1 0 ( ) = Pr( = 1 ) p W D W Rosenbaum and Rubin (1983, Theorem 3) show that ( Y, Y ) D p( W) and 0 < Pr( D= 1 p( W) ) 1 0 strong ignorability of treatment Using the Propensity Score (cont d) Again, we can now construct the counterfactuals of interest ( ( ), = 0 ) = ( ( ), = 1 ) = ( ( )) EY pw D EY pw D EY pw 1 1 1 ( ( ), = 1 ) = ( ( ), = 0 ) = ( ( )) EY pw D EY pw D EY pw 0 0 0 Note, however, that we are ruling out p(w)=1 and p(w)=0 cases we want a good model of p(w), but not too good 9

Using the Propensity Score (cont d) Strong ignorability implies that [ D p( W) ] Y ( )[ 1 ( )] ATE E = pw pw [ ( )] [ 1 pw ( )] D p W Y TT = E p W ( ) Given a consistent estimator of p(w), we then have N 1 [ Di pˆ ( Wi) ] Yi ATE = N = 1 pˆ( W )[ 1 pˆ( W )] i i i 1 [ ( )] 1 N 1 N ˆ TT N Di N = 1 = 1 [ 1 pw ˆ ( )] D p W Y i i i = i i i Propensity Score Matching Estimators PSM estimators take the form: ˆ τ = Y Y ˆ with where I 1 n 1 I 0 S P ( ) Wˆ i, j 1 n1 1i 0i i I1 SP Yˆ = W ˆ ( i, j) Y 0i 0 j j I0 denotes the set of treatment observations denotes the number of treatment observations denotes the set of comparison observations denotes the region of common support are weights that depend upon the distance between the propensity scores for i and j. 10

The Choice of Weights Nearest neighbor matching 1 argmin ˆ ˆ j = Pi Pk = 0 otherwise ( ) k I0 Wˆ i, j frequently used because of ease of implementation a single alternative individual serves as counterfactual for the treated individual Nearest k neighbors matching trades off reduced variance (more info used to construct counterfactual) and increased bias (on average poorer fits) The Choice of Weights (cont d) caliper matching ni 1 ˆ ˆ ˆ n P i i Pj < c W( i, j) = 0 otherwise denotes number of caliper matches for i Note: Treated individuals for whom no matches can be found are excluded from the analysis Stratification matching 1 ˆ n P ˆ i j Ti W( i, j) = 0 otherwise Ti denotes propensity score strata for i n denotes number of strata matches for i i 11

Matching Decisions (cont d) kernel (e.g., Heckman, Ichimura, and Todd; 1997,1998) n ( ) Wˆ i, j = Pˆ ˆ j Pi G an Pˆ ˆ k Pi G a n k I0 15 16 ( ) is a kernel function - e.g., G ( s) = ( s 2 1) 2 G s a is a bandwidth parameter local linear Fan (1992) Other Matching Decisions matching with or without replacement again, the tradeoff here is between bias and variance trimming the support region focus analysis on that region such that ( ( ) ) ( ˆp( W) ) Pr ˆp W > 0 > 0 Pr 1 > 0 > 0 nonparametric density estimators can be used for p(w) typically, stricter requirements are placed on the support, with ( ˆp ( W) ) ( ˆp ( W) ) Pr > 0 > c Pr 1 > 0 > c 12

Other Matching Decisions (cont d) difference in difference matching uses time series differencing to eliminate unobserved temporally invariant effects requires before and after treatment observations for both treated and untreated individuals conditional matching (e.g., common region, school, etc.) the choice of the comparison sample. Heckman et al. (1997,1998) argue for the following criteria: same data source individuals reside in the same market data contain a rich set of variables affecting outcomes and treatment group Example #1: Heckman, Ichimura, and Todd (1997) HIT7 Use data from the National Job Training Partnership Act (JTPA) Experiment, including randomized-out controls an eligible nonparticipants comparison group. In this paper, the authors decompose the bias differences in earnings test the assumptions underlying matching, rejecting most of them evaluate the performance of difference matching routines emphasize the importance of a good comparison group 13

Decomposing Evaluation Bias in TT The bias in PSME can be decomposed as follows where S1 (, 1 ) ( 1) B = EY XD= f X D= dx EY ( 0 XD, 0 ) f ( X D 0) dx = B + 1 B + 2 B3 S0 0 = = 1 S1\ S10 S0\ S10 ( ) ( ) B = EY, 1 1 0 XD= f X D= dx (, 0 ) ( 0) EY XD= f X D= dx 0 bias due to nonoverlapping support Decomposing Evaluation Bias in TT (cont d) S10 (, 0) { f ( X D= ) ( = )} B2 = E Y0 X D= 1 f X D 0 dx { EY ( XD= ) EY ( XD = )} ( = ) 3 0, 1 0, 0 1 S10 bias due to differing distributions in X B = f X D dx bias due to selection on unobservables PSME attempts to address B 1 and B 2, but assumes away B 3 14

Overlap - Adult Males (HIT7) Overlap - Males Youths (HIT7) 15

Overlap - Males Youths (HIT7) Decomposition of Bias (HIT7) 16

Testing Key assumptions (HIT7) Testing Key assumptions cont d (HIT7) 17

Testing Key assumptions cont d (HIT7) max correct predictions increasing number of explanatory variables in p(w) model 18

Example #2: Dehejia and Wahba (2003) ReStat Use data on National Supported Work (NSW) demonstration this is randomized experiment DW compare experimental treatment effect estimates to those obtained using two comparison samples Population Survey of Income Dynamics (PSID) Current Population Survey A variety of matching algorithms are considered 19

Example #3: Smith and Todd (2005) Repeat the exercise in DW, but investigate alternative sample definitions estimate bias by using PSME s on NSW randomized controls add difference in difference matching General conclusions: PSME are not a silver bullet for nonexperimental situations The performance of PSME in DW is not generalizable, varying by sample definition Difference-in-difference matching performed substantially better than cross-sectional matching alone Details of the matching procedure generally had little impact including type of matching (nearest neighbor, local linear, etc.) propensity score estimation procedure 22

Example #4: List, Millimet, Fredriksson, and McHone (2003) REStat Treatment: Nonattainment designation Outcome of interest: County level dirty plant births in New York 176 treatment observations Caliper matching conditional matching considered for within region and year within year matches are obtain for 8 to 81 of the treatment observations (depending on the use of conditional matches) Difference-in-difference estimates using clean plant births as control 23

Let with A Simple Experiment Y1 i = 2+ 2X1+ X2 + 2X3 + ε1 i Y0i = 1+ X1+ 2X2 + X3 + ε 0i Y = 4 + X + X + X + X + ε Di 1 2 3 4 ( ε1 i, ε0i, εdi) ~ N( 0, I3) X ~ N( 1, Σ) 1 ρ ρ ρ 2 ρ 1 ρ ρ Σ= σ D ρ ρ 1 ρ ρ ρ ρ 1 Di 24

RMSE Using Full Set of Conditioning Variables 3.5 3 2.5 2 1.5 1 0.5 0 1 2 3 4 5 6 7 8 9 10 sigmad Treatment-NonTreatment PSME RMSE Omitting X 1 3.5 3 2.5 2 1.5 1 0.5 0 1 2 3 4 5 6 7 8 9 10 sigmad Treatment-NonTreatment PSME 25

RMSE Omitting X 2 3.5 3 2.5 2 1.5 1 0.5 0 1 2 3 4 5 6 7 8 9 10 sigmad Treatment-NonTreatment PSME Other Issues Post-matching verification that treatment and matched group characteristics are similar Limited common support Standard errors practitioners frequently ignore uncertainty in matching process Large sample properties - Aradie and Imbens (2004) working paper Multiple treatments Lechner, M., (2002), Program Heterogeneity and Propensity Score Matching: An Application to the Evaluation of Active Labor Market Policies, Review of Economics and Statistics 84(2): 205-220. Bellio, R., and E. Gori (2003), Impact Evaluation of Job Training Programmes: Selection Bias in Multilevel Models, Journal of Applied Statistics 30(8):893-907 26