Empirical Methods in Applied Microeconomics

Size: px

Start display at page:

Download "Empirical Methods in Applied Microeconomics"

Judith Matthews
5 years ago
Views:

1 Empirical Methods in Applied Microeconomics Jörn-Ste en Pischke LSE November Nonlinearity and Heterogeneity We have so far concentrated on the estimation of treatment e ects when the treatment e ect is a constant, i.e. E[y 1i jd i = 1] E[y 0i jd i = 1] = E[y 1i y 0i ] = : However, in reality treatment e ects may be heterogeneous, so that each individual has their own i. In this case, the di erent averages like E[y 1i y 0i ], the population average treatment e ect, E[y 1i jd i = 1] E[y 0i jd i = 1], the treatment e ect on the treated, and E[y 1i jd i = 0] E[y 0i jd i = 0], the treatment e ect on the untreated, may di er from each other. Moreover, with continuous treatments, like in the returns to schooling example, the relationship between y i (earnings) and s i (the number of years of schooling) may be nonlinear, i.e. the return to the 10th year of schooling may be di erent from the 16th year of schooling. Nevertheless, simple linear regression and 2SLS still provide important tools, and possibly the most important tools, to analyze the data, and summarize the results. First, it is important to note that OLS provides the best linear approximation to the population conditional expectation function E[y 1i jx i ]. Hence, linear regression always has a legitimacy, independent of the true functional of E[y 1i jx i ]. The same is not true of other estimators, like GLS, which rely speci cally on the linearity of the regression function. Of course, whether regression estimates something causal depends on how we feel about the population conditional expectation function: if E[y 1i jx i ] represents a causal relationship, then regression helps us estimate that relationship. 1

2 If the population conditional expectation function is non-linear, regression will provide a weighted average derivative (see more on this below). Much recent econometrics has focused on using non-linear models and calculating average derivatives along the non-linear function. This often requires a substantial technical apparatus: OLS gives us such an average derivative right o the bat, is easy to implement, and doesn t involve any researcher choices as to estimtaor speci cs, kernels, bandwidths, etc. It is hence also easy to replicate. One case where the population conditional expectation function is clearly non-linear are linear dependent variable models, like the binary choice model, or regression models for censored variables. Many empirical researchers feel that these cases call for non-linear models like logit or probit for the binary choice model, or tobit for the censored regression model. However, the underlying regression function and their parameters in the probit or logit models are not of particular interest, compared to derivates (or marginal e ects) or the average e ects of switching a dummy regressor o and on, which have a direct interpretation in terms of treatment e ects. When the dependent variable only takes on positive values, as in the case of hours of work, many researchers argue we should be interested in the conditional on positive e ects (COP), which are, for example, estimated by the tobit model. However, these e ects have no direct causal interpretation. COP e ects are estimates of the form E[y i jy i > 0; d i = 1] E[y i jy i > 0; d i = 0]: Note that this conditions on an outcome: y i > 0, e.g. working positive hours. Hence, this identi es a valid causal e ect (in terms of y 1i and y 0i ) plus a selection bias. The reason (and algebra) is the same as we saw before when we conditioned on outcome variables on the right hand side of a regression. In order to get back to causal e ects, we need to combine the COP e ect with the participation e ect (going from y i = 0 to y i > 0) again. Regression automatically gives us a weighted average of these two e ects. Table in the back shows some estimates from a paper by Angrist and Evans (1998) investigating the e ect of the number of children on the employment and hours of mothers. The rst line in columns (2) to (4) show results from regressions of a dummy variable for mothers employment on whether the family has three or more kids. Column (2) is the OLS estimate, columns (3) and (4) are di erent average e ects from a probit model. They are all identical. E ects on work hours, which can be zero or positive, are in columns (2), (5) and (6). The OLS results are again very 2

3 similar to average e ects from a Tobit model. Colums (7) to (10) repeat the same exercise with the number of chilren, rather than just a dummy variable. It is sometimes argued that OLS coe cients will recover the average e ects from a probit as long as the mean of the dependent variable is close to 0.5 but not if it is further in the tail, where the probit function has more curvature. Panel B in the table investigates this case by focusing on college educated women over the age of 30 with older children, who have a much higher employment rate. However, the OLS and probit average e ects are still fairly close in this case, although not as close as in Panel A. 1.1 Controlling for Observables Matching When we talked about regression, we relied on the selection on observables or conditional independence assumption E[y 0i jx i ;d i ] = E[y 0i jx i ]. Instead of exploiting this condition using regression, the most obvious thing to do would be to exploit it directly, and compute, say the e ect of treatment on the treated E[y 1i y 0i jd i = 1] non-parametrically. By the law of iterated expectations and using the CIA E[y 1i y 0i jd i = 1] = E fe[y 1i y 0i jx i ; d i = 1]jd i = 1g E[y 1i y 0i jd i = 1] = E fe[y 1i jx i ; d i = 1] E[y 0i jx i ; d i = 0]jd i = 1g where = E fe[y i jx i ; d i = 1] E[y i jx i ; d i = 0]jd i = 1g = E f X jd i = 1g X = E[y i jx i ; d i = 1] E[y i jx i ; d i = 0]: X is simply the di erence in means between treated and untreated observations with covariate value X i = X. With discrete X s the e ect of treatment on the treated is M = E[y 1i y 0i jd i = 1] = X X X P (X i = Xjd i = 1) where P (X i = Xjd i = 1) is the histogram of X i among the treated. The sample analogue to this is the non-parametric matching estimator. Obviously X is only de ned when there are both treated and untreated observations for a particular covariate value. This means what we are estimating 3

4 in practice is actually M = E[y 1i y 0i jd i = 1; 0 < P (D = 1jX i = X) < 1]; i.e. the probability of treatment in the covariate cell cannot be 0 or 1. This is known as common support, and it highlights that matching is only feasible in the region of common support Matching using the Propensity Score Often there are continuous covariates or the covariate vector is high dimensional so that matching on every covariate combination is not feasible. An important result by Rosenbaum and Rubin (1983) implies that it is actually su cient to match on certain functions of the covariates, and in particular on the propensity score P (X i ) = P (d i = 1jX i = X) = E(d i jx i = X), which is the probability of treatment given X i = X. The propensity score theorem says if (y 1i ; y 0i )?d i jx i then (y 1i ; y 0i )?d i jp (X i ). To demonstrate this we will show that P [d i = 1jy 1i ;y 0i ; P (X i )] = P (X i ), implying independence of d i and counterfactual outcomes. P [d i = 1jy 1i ; y 0i ; P (X i )] = E[d i jy 1i ; y 0i ; P (X i )] = EfE[d i jy 1i ; y 0i ; P (X i ); X i ]jy 1i ; y 0i ; P (X i )g = EfE[d i jy 1i ; y 0i ; X i ]jy 1i ; y 0i ; P (X i )g = EfE[d i jx i ]jy 1i ; y 0i ; P (X i )g; by the CIA = EfP (X i )jy 1i ; y 0i ; P (X i )g = P (X i ): In practice, we will not know the propensity score. But it can be estimated in a rst step by a logit or probit regression on the relevant covariates, and possibly interactions of the covariates. We then proceed using the estimated propensity score ^P (X i ). There are various ways to construct estimators of treatment e ects based on matching on the propensity score, see, for example, Imbens (2004) for a survey. One upshot from the literature on matching is that the particular matching estimator used seems to play relatively little role for matching estimates. What is important is the choice of covariates (the CIA really needs to hold), and imposing common support. One useful way to think about propensity score matching estimates is the following weighting approach. It exploits the fact that the CIA implies yi d i E = E[y 1i ] P (X i ) yi (1 d i ) E = E[y 0i ]: (1 P (X i )) 4

5 Therefore, given a scheme for estimating P (X i ); we can construct estimates of the average treatment e ect from the sample analog of yi d i y i (1 d i ) E[y 1i y 0i ] = E P (X i ) 1 P (X i ) (di P (X i ))y i = E : (1) (1 P (X i ))P (X i ) We can similarly calculate the e ect of treatment on the treated from the sample analog of: (d i P (X i ))y i E[y 1i y 0i jd i = 1] = E : (2) (1 P (X i ))P (d i = 1) The idea that you can correct for non-random sampling by weighting by the reciprocal of the probability of selection dates back to Horvitz and Thompson (1952). The Horvitz-Thompson version of the propensity-score approach is appealing since the estimator is essentially automated, with no cumbersome matching required. The Horvitz-Thompson approach also highlights the close link between propensity-score matching and regression. Think of the regression regression of y i on d i, controlling for a saturated model for covariates. This estimator can be written R = E[(d i P (X i ))y i ] E[P (X i )(1 P (X i ))] : (3) The two Horvitz-Thompson matching estimators and the regression estimator are all members of the class of weighted average estimators considered by Hirano, Imbens, and Ridder (2003): E yi d i g(x i ) P (X i ) y i (1 d i ) ; (4) (1 P (X i )) where g(x i ) is a known weighting function. The weighting functions are g(x i ) = 1 g(x i ) = P (X i) P (d i =1) average treatment e ect e ect on the treated g(x i ) = P (X i)(1 P (X i )) E[P (X i )(1 P (X i ))] regression. This highlights that regression will not recover the treatment e ect on the treated but a di erently weighted average treatment e ect. The treatment e ect on the treated puts high weight on the part of the support of X i 5

6 where there are many treated observations (and hence P (X i ) is close to 1). OLS is the e cient estimator if the treatment e ect is the same for all covariate values. To achieve this e icency, OLS puts maximum weight on the observations where p(x i )(1 p(x i )) is large, which is p(x i ) = 0:5. This high variance area of the data carries the most information on the treatment e ect. A similar analogy holds between regression and direct covariate matching Regression versus Matching Given this close analogy between matching and regression, is it worthwhile going down the matching route? Or will simple OLS regression give you a satisfactory answer in many cases, even if you suspect that treatment may vary a lot in the population? I believe that regression will often su ce, and it should always be the rst line of attack. Proponents of matching (be it directly on the covariates or via the propensity score) have recently stressed that the key advantage of matching is that it forces you to consider the common support problem. Relying on regression may inadvertently lead you to extrapolate a lot from the cells that contain only treated or only control observations. In order to investigate this it is useful to consider the empirical example that has played a large role in the di usion of matching methods in econometrics: the evaluation of the National Supported Work (NSW) Demonstration, originally analyzed by Lalonde (1986). Lalonde compared the experimental results from the NSW study to those obtained with alternative methods including regression, using non-experimental control groups drawn from standard data sets like the CPS. Lalonde s conclusions were rather negative, nding that the non-experimental methods yielded variable results which were often not particularly close to the experimental estimates. Dehejia and Wahba (1991) reanalyzed the Lalonde data and found that they could replicate the experimental results well with matching methods using the propensity score. The NSW Demonstration was a program which provided work experience to individuals with particular social and economic problems, like previous unemployment. It was evaluated using a random assignment experiment. Lalonde (1986) created three potential control groups from the CPS (matched to Social Security earnings records ) for the NSW treatment group: The rst is the raw CPS sample, which is fairly representative of the population (CPS-1) as well as two subsamples, which were selected to mirror the characteristics of NSW enrollees more closely, based, for example, 6

7 on previous unemployment experience. We will present some results from both the broad (CPS-1) and the narrowest (CPS-3) comparison group. All samples are limited to men and to those observations for whom both 1974 and 1975 (pre-program) earnings are available. Table (from Dehejia and Wahba, 1991) shows the means of some demographic characteristics, and earnings before the program for the treatment group as well as for the various control groups. The table demonstrates that the NSW program group (and the experimental control group) is younger, less educated, more likely minority, and has much lower earnings than the general population (the CPS-1 sample). The CPS-3 sample matches the treatment group more closely but still shows some di erences, particularly in terms of race and pre-program earnings. Table displays results from various regression estimators of the NSW treatment e ect, using annual 1978 earnings as the outcome variable. The estimates using the experimental control group in column 1 are in the order of $1,600 to $1,800. As would be expected, these estimates vary little depending on the speci cation. Column 2 displays results for the CPS-1 sample. The raw di erence in earnings between NSW participants and the CPS sample is $-8,500, indicating that NSW participants earn substantially less than the program participants. This simply re ects the large selection bias present in the naive comparison with this sample. Successively including demographics and per-program earnings narrows the cap and the treatment e ect rises to a positive $800 in the last row. Results are slightly better in column 3, where we use the CPS-3 control group. This characteristics of this group are much closer to the NSW treatment group, and the raw di erence in earnings is only $-600. The estimate in the last row is close to $1,400, not far from the experimental treatment e ect. The last two columns of the table repeat the exercise using the method advocated by Imbens (2007). He suggests to start by estimating the propensity score, and then limiting the sample to one with enough overlap so that 0:1 < P (X i ) < 0:9. Once this is done, Imbens (2007) nds little di erence in treatment e ects from regression or matching estimators using the same NSW data. He concludes that imposing overlap is important while the particular estimator matters little once this done. We therefore implement this idea by rst selecting the sample, and then running simple regressions within the selected sample again. The same covariates are used in the calculation of the propensity score and in the second step regression. Estimates are displayed in the nal two columns of table 2. With just demographics or just 1975 earnings the Imbens style estimates di er little from the regression estimates in the earlier columns. However, once these 7

8 covariates are combined the treatment e ect estimates in the CPS-1 sample are somewhat closer to the experimental estimates than the pure regression estimates ( nal two rows). This is not true for the CPS-3 sample. This indicates that prior sample selection is not really necessary once we use the propensity score method to select the sample, and, in fact, it may be detrimental. The results also highlight that the 0:1 < P (X i ) < 0:9 rule may result in an empty sample. This happens in case where we use only the 1975 earnings for the CPS-1 sample. Since this indicates that the set of covariates is not very powerful to create enough variation in the propensity score it is a useful warning sign about the covariates. Table 3.32 also displays the means in the CPS-1 sample after pre-selecting on the propensity score using the full set of covariates. The comparison with the experimental group demonstrates that imposing common support does a good job in balancing most covariates. What to take away from this exercise? First and foremost, it is impressive what a good job regression does in the CPS-1 sample with a fairly limited number of covariates (and without any non-linearities or interaction terms other than age squared). Clearly this sample is extremely di erent from the treatment group, so the potential for extrapolating from outside the common support seems great in this example. Nevertheless, the estimate in the last row closes 90% of the gap in the raw data. Preselecting the sample by emulating the program admission criteria (as in the case of the CPS-3) sample yields even better estimates, as good as the propensity score method. This preselection seems like a sensible route since there is little a priori reason to start with the CPS-1 sample. The second nding is that the choice of covariates is more important than the choice of estimation method, and the choice of the original control sample once enough covariates are being used (compare di erences across rows versus di erences across columns). Finally, the propensity score method is able to improve on regression with the CPS-1 sample. The estimate of $1,400 may be signi cantly di erent from $800 for policy purposes (although the di erence is not huge compared to the standard errors). For example, these two estimates might yield di erent conclusions from a cost-bene t calculation for the program. This means that matching methods may have a role in a carefully designed study of treatment e ects. Nevertheless, this should not distract from the fact that regression should play an important role in evaluating programs like the NSW: it is simple to implement and hence to replicate, transparent, and it squarely puts the focus on the key issue of covariate selection, rather than other estimation issues. Hence, regression should be the starting point of any analysis invoking the CIA, and 8

9 often may be the nal word. 1.2 IV In analogy to the previous discussion we can ask what IV estimates when the treatment e ect is heterogeneous. In order to gain insight on this, consider the simplest case of a binary instrument and a binary endogenous regressor. Return to the IV assumptions we discussed earlier: Assumption 1 (Random assignment) z is as good as randomly assigned Assumption 2 (Exclusion) Y (z; D) = Y (z 0 ; D) 8z; z 0 ; D Assumption 3 (First stage) E(D(1) D(0)) 6= 0. To these assumptions add: Assumption 4 (Monotonicity) D i (1) D i (0) 0 8i The fourth assumption says that we need the instrument to act similarly on all observations: if it raises the probability of treatment for some individuals, it can t lower it for others (alternatively we could have D i (1) D i (0) 0). With these assumptions we get the LATE theorem (Imbens and Angrist, 1994). E(y i jz i = 1) E(y i jz i = 0) E(d i jz i = 1) E(d i jz i = 0) = E[y 1i y 0i jd i (1) > d i (0)]: The LATE theorem says that IV estimates the average e ect of the treatment among the subpopulation of individuals for whom d i (1) >d i (0). This subpopulation consists of those whose treatment status is changed by the instrument because the inequality is strict. Hence, it is a local average treatment e ect, and it depends on the particular instrument being used. With heterogeneous treatment e ects di erent instruments will in general identify di erent LATEs. One implication of this is that there is no overidenti cation test: the over-id test in essence tells us whether the LATEs for di erent instruments are the same, so it becomes a test of homogeneity. A quick proof of the LATE theorem is as follows: The exclusion restriction let s us write E(y i jz i = 1) = E[y 0i + (y 1i y 0i )d i jz i = 1], which equals 9

10 E[y 0i + (y 1i y 0i )d 1i ] by random assignment. Similarly E(y i jz i = 1) = E[y 0i + (y 1i y 0i )d 0i ]. Hence E(y i jz i = 1) E(y i jz i = 0) = E[y 0i + (y 1i y 0i ) d 1i ] E[y 0i + (y 1i y 0i ) d 0i ] = E[(y 1i y 0i ) (d 1i d 0i )] By monotonicity d 1i d 0i equals either zero or one so that E[(y 1i y 0i ) (d 1i d 0i )] = E[(y 1i y 0i ) jd 1i > d 0i ]P (d 1i > d 0i ): A similar argument shows that E(d i jz i = 1) E(d i jz i = 0) = P (d 1i >d 0i ), which completes the proof. In order to understand this result, it is useful to consider four subpopulations: d 1i = 0 1 d 0i = 0 never-taker complier 1 de er always-taker Never-takers are never treated, regardless of the value of the instrument. Always takers always take the treatment, also regardless of the instrument. Since the treatment status of never-takers and always-takers doesn t change, they do not contribute to the estimate of the treatment e ect (because of the exclusion restriction). Hence E(y i jz i = 1) E(y i jz i = 0) is only a ected by the compliers and de ers, the groups which have their treatment status changed by the instrument. Notice that compliers have their treatment status switched on, and de ers have their treatment status switched o as the instrument changes from 0 to 1. The estimate E(y i jz i = 1) E(y i jz i = 0) will be an average of the treatment e ect of the two groups. Since we don t know the relative size of these (unobserved) subpopulations, it would be di cult to interpret this average. The monotonicity assumption rules out the de ers. Hence, the e ect E(y i jz i = 1) E(y i jz i = 0) is just due to the compliers, i.e. the group with d 1i >d 0i. Hence, IV estimates the treatment e ect for compliers with the instrument. 1.3 Internal versus external validity The key of evaluation research is to estimate the causal e ect of a treatment, and to enhance our understanding of the treatment in question. The rst goal in analyzing an evaluation question is always to avoid selection bias. To put it di erently, the rst goal is the internal validity of the estimates. 10

11 Internal validity means that we actually get an estimate of the causal e ect of the treatment under study. Random assignment experiments tend to score well on internal validity. The Tennessee STAR experiment is likely to deliver internally valid estimates of sending children to classrooms of size 15 compared to 22 for very young children in Tennessee in the mid 1980s. Of course, this question is not of as much interest as the broader question: do smaller classes raise student achievement? For example, are the STAR estimates valid for the neighboring state of Kentucky, for other countries than the US? Do they still hold in 2005? Are the e ects the same in 8th grade? Are the e ects the same for reducing class size from 37 to 30 students? These are questions of external validity. We like to generalize and extrapolate from what we have learned in one setting. Whether we can do that is a question of the external validity of the estimates. Random assignment experiments are designed to overcome threats to internal validity. The external validity of an experiment may or may not be good, depending on how special the setting is. In medical drug trials, external validity may not be a big issue. If I try a cancer drug on a large enough group of cancer patients, the results are likely to hold for similar cancer patients elsewhere as well. In social experiments this may be less likely. If the e ect of class size is non-linear, the STAR results may not hold for larger class sizes. If the class size e ect is heterogenous, the STAR results may not hold in states with a very di erent make-up of student population. If the class size matters because students in big classes are more likely to be disruptive, and whether a student is disruptive depends on their age, then the STAR experiment may not be informative for older students, etc. 11

ECONOMETRICS II (ECO 2401) Victor Aguirregabiria. Spring 2018 TOPIC 4: INTRODUCTION TO THE EVALUATION OF TREATMENT EFFECTS

ECONOMETRICS II (ECO 2401) Victor Aguirregabiria Spring 2018 TOPIC 4: INTRODUCTION TO THE EVALUATION OF TREATMENT EFFECTS 1. Introduction and Notation 2. Randomized treatment 3. Conditional independence